{"title": "A Reinforcement Learning Variant for Control Scheduling", "book": "Advances in Neural Information Processing Systems", "page_first": 479, "page_last": 485, "abstract": null, "full_text": "A Reinforcement Learning Variant for Control \n\nScheduling \n\nHoneywell Sensor and System Development Center \n\nAloke Guha \n\n3660 Technology Drive \nMinneapolis, MN 55417 \n\nAbstract \n\nWe present an algorithm based on reinforcement and state recurrence \nlearning techniques to solve control scheduling problems. In particular, we \nhave devised a simple learning scheme called \"handicapped learning\", in \nwhich the weights of the associative search element are reinforced, either \npositively or negatively, such that the system is forced to move towards the \ndesired setpoint in the shortest possible trajectory. To improve the learning \nrate, a variable reinforcement scheme is employed: negative reinforcement \nvalues are varied depending on whether the failure occurs in handicapped or \nnormal mode of operation. Furthermore, to realize a simulated annealing \nscheme for accelerated learning, if the system visits the same failed state \nsuccessively, the negative reinforcement value is increased. \nIn examples \nstudied, these learning schemes have demonstrated high learning rates, and \ntherefore may prove useful for in-situ learning. \n\n1 INTRODUCTION \n\nReinforcement learning techniques have been applied successfully for simple control \nproblems, such as the pole-cart problem [Barto 83, Michie 68, Rosen 88] where the \ngoal was to maintain the pole in a quasistable region, but not at specific setpoints. \nHowever, a large class of continuous control problems require maintaining the \nsystem at a desired operating point, or setpoint, at a given time. We refer to this \nproblem as the basic setpoint control problem [Guha 90], and have shown that \nreinforcement learning can be used, not surprisingly, quite well for such control tasks. \nA more general version of the same problem requires steering the system from some \n\n479 \n\n\f480 \n\nGuha \n\ninitial or starting state to a desired state or setpoint at specific times without \nknowledge of the dynamics of the system. We therefore wish to examine how \ncontrol scheduling tasks, where the system must be steered through a sequence of \nsetpoints at specific times. can be learned. Solving such a control problem without \nexplicit modeling of the system or plant can prove to be beneficial in many adaptive \ncontrol tasks. \n\nTo address the control scheduling problem. we have derived a learning algorithm \ncalled handicapped learning. Handicapped learning uses a nonlinear encoding of the \nstate of the system. a new associative reinforcement learning algorithm. and a novel \nreinforcement scheme to explore the control space to meet the scheduling \nconstraints. The goal of handicapped learning is to learn the control law necessary to \nsteer the system from one setpoint to another. We provide a description of the state \nencoding and associative learning in Section 2. the reinforcement scheme in Section \n3, the experimental results in Section 4, and the conclusions in Section 5. \n\n2 REINFORCEMENT LEARNING STRATEGY: \n\nHANDICAPPED LEARNING \n\nOur earlier work on regulatory control using reinforcement learning [Guha 90] used a \nsimple linear coded state representation of the system. However. when considering \nmultiple setpoints in a schedule, a linear coding of high-resolution results in a \ncombinatorial explosion of states. To avoid this curse of dimensionality, we have \nadopted a simple nonlinear encoding of the state space. We describe this first. \n\n2.1 STATE ENCODING \n\nTo define the states in which reinforcement must be provided to the controller. we \nset tolerance limits around the desired setpoint. say Xd. If the tolerance of operation \ndefined by the level of control sophistication required in the problem is T. then the \ncontroller is defined to fail if IX(t) - Xdl > T as described in our earlier work in [Guha \n90]. \n\nThe controller must learn to maintain the system within this tolerance window. If the \nrange, R. of possible values of the setpoint or control variable X(t) is significantly \ngreater than the tolerance window. then the number of states required to define the \nsetpoint will be large. We therefore use a nonlinear coding of the control variable. \nThus, if the level of discrimination within the tolerance window is 2T/n. then the \nnumber of states required to represent the control variable is (n + 2) where the two \nadded states represent the states, (X(t) - Xd) > T and (X(t) - Xd) < -T. With this \nrepresentation scheme. any continuous range of setpoints can be represented with \nvery high resolution but without the explosion in state space. \n\nThe above state encoding will be used in our associative reinforcement learning \nalgorithm. handicapped learning, which we describe next. \n\n\fA Reinforcement Learning Variant for Control Scheduling \n\n481 \n\n2.2 HANDICAPPED LEARNING ALGORITHM \n\nOur reinforcement learning strategy is derived from the Associative Search \nElement/Adaptive Heuristic Critic (ASE/AHC) algorithm [Barto 83. Anderson 86]. \nWe have considered a binary control output. y(t): \n\ny(t) = f(L wi(t)xi(t) + noise(t\u00bb \n\ni \n\n(1) \n\nwhere f is the thresholding step function. and xi(t). 0 SiS N. is the current decoded \nstate. that is. xi(t) = 1 when the system is in the ith state and 0 otherwise. As in \nASE. the added term noise(t) facilitates stochastic learning. Note that the learning \nalgorithm can be easily extended to continuous valued outputs. the nature of the \ncontinuity is determined by the thresholding function. \n\nWe incorporate two learning heuristics: state recurrence [Rosen 88] and a newly \nintroduced heuristic called \"handicapped learning\". The controller is in the \nhandicapped learning mode if a flag. H. is set high. H is defined as follows: \n\nH = O. if IX(t) - Xdl < T \n\n= 1. otherwise \n\n(2) \n\nThe handicap mode provides a mechanism to modify the reinforcement scheme. In \nthis mode the controller is allowed to explore the search space of action sequences. \nto steer to a new setpoint. without \"punishment\" (negative reinforcement). The mode \nis invoked when the system is at a valid setpoint XI(tI) at time tl. but must be \nsteered to the new setpoint X2 outside the tolerance window. that is. IXI - X21 > T. \nat time t2. Since both setpoints are valid operating points. these setpoints as well as \nall points within the possible optimal trajectories from Xl to X2 cannot be deemed to \nbe failure states. Further. by following a special reinforcement scheme during the \nhandicapped mode. one can enable learning and facilitate the controller to find the \noptimal trajectory to steer the system from one setpoint to another. \n\nThe weight updating rule used during setpoint schedule learning is given by equation \n(3): \n\nwi(t+I) = wi(t) + (1 rt(t) ei(t) + (12 r2(t) e2i(t) + (13 r3(t) e3i(t) \n\n(3) \n\nwhere the term (1 rt (t) ei(t) is the basic associative learning component. rt (t) the \nheuristic reinforcement. and ei(t) the eligibility trace of the state xi(t) [Barto 83]. \n\nThe third term in equation (3) is the state recurrence component for reinforcing short \ncycles [Rosen 88]. Here (12 is a constant gain. f2(t) is a positive constant reward. \nand ~i the state recurrence eligibility is defined as follows: \n\ne2i(t) = ~2 xi(t)y(ti.last)/(~2 + t - ti.last). \n\n= O. otherwise \n\nif (t - ti.last) > 1 and H = 0 \n(4) \n\n\f482 \n\nGuha \n\nwhere ~2 is a positive constant, and ti.last is the last time the system visited the ith \nstate. The eligibility function in equation (4) reinforces shorter cycles more than \nlonger cycles, and improve control when the system is within a tolerance window. \n\nThe fourth term in equation (3) is the handicapped learning component. Here (13 is a \nconstant gain. r3(t) is a positive constant reward and e3i the handicapped learning \neligibility is defined as follows: \n\ne3i(t) = - ~3 xi(t)y(ti.last)/(~3 + t - ti.lasV. \n\n= O. otherwise \n\nif H = 1 \n\n(5) \n\nwhere ~3 is a positive constant. While state recurrence promotes short cycles around \na desired operating point. handicapped learning forces the controller to move away \nfrom the current operating point X(t). The system enters the handicapped mode \nwhenever it is outside the tolerance window around the desired setpoint. If the initial \noperating point Xi (= X(O\u00bb is outside the tolerance window of the desired setpoint \nXd. 1Xi - Xdl > T. the basic AHC network will always register a failure. This failure \nsituation is avoided by invoking the handicapped learning described above. By \nsetting absolute upper and lower limits to operating point values. the controller based \non handicapped learning can learn the correct sequence of actions necessary to steer \nthe system to the desired operating point Xd. \n\nThe weight update equations for the critic in the AHC are unchanged from the \noriginal AHC and we do not list them here. \n\n3 REINFORCEMENT SCHEMES \n\nUnlike in previous experiments by other researchers. we have constructed the \nreinforcement values used during learning to be multivalued. and not binary. The \nreinforcement to the critic is negative-both positive and negative reinforcements are \nused. There are two forms of failure that can occur during setpoint control. First. the \ncontroller can reach the absolute upper or lower limits. Second. there may be a \ntimeout failure in the handicapped mode. By design. when the controller is in \nhandicapped mode, it is allowed to remain there for only TL. determined by the \naverage control step Ay and the error between the current operating point and the \ndesired setpoint: \n\nTL = k Ay (XO - Xd) \n\n(6) \n\nwhere Xo is the initial setpoint. and k some constant. The negative reinforcement \nprovided to the controller is higher if the absolute limits of the operating point is \nreached. \n\nWe have implemented a more interesting reinforcement scheme that is somewhat \nIf the system fails in the same state on two \nsimilar to simulated annealing. \nsuccessive trials. the negative reinforcement is increased. \nThe primary \nreinforcement function can be defined as follows: \n\n\fA Reinforcement Learning Variant for Control Scheduling \n\n483 \n\nrjCk + I) = riCk) - rO, \n\n= rl, \n\nif i = j \nif i \":i; j \n\n(7) \n\nwhere ri(k) is the negative reinforcement provided if the system failed in state i \nduring trial k, and rO and rl are constants. \n\n4 EXPERIMENTS AND RESULTS \n\nTwo different setpoint control experiments have been conducted. The first was the \nbasic setpoint control of a continuous stirred tank reactor in which the temperature \nmust be held at a desired setpoint. That experiment successfully demonstrated the \nuse of reinforcement learning for setpoint control of a highly nonlinear and unstable \nprocess [Guha 90]. The second recent experiment has been on evaluating the \nhandicapped learning strategy for an environmental controller where the controller \nmust learn to control the heating system to maintain the ambient temperature \nspecified by a time-temperature schedule. Thus, as the external temperature varies, \nthe network must adapt the heating (ON) and (OFF) control sequence so as to \nmaintain the environment at the desired temperature as quickly as possible. The \nstate information describing system is composed of the time interval of the schedule, \nthe current heating state (ON/OFF), and the error or the difference between desired \nand current ambient or interior temperature. The heating and cooling rates are \nvariable: the heating rate decreases while the cooling rate increases exponentially as \nthe exterior temperature falls below the ambient or controlled temperature. \n\ne j \n..!! \n:I \n1 .c \nJ! \n'5 \n\" \n\n100 \n\n80 \n\n60 \n\n40 \n\n20 \n\n0 \n\n0 \n\n. . Handicapped Learning \n+ No Handicapped Learning \n\n10 \n\n20 \n\n30 \n\n40 \n\nTrial Number \n\nFigure I: Rate of Learning with and without Handicapped Learning \n\n\f484 \n\nGuha \n\nTenp \n\n70 \n\n68 \n\n66 \n\n6f \n\n62 \n\n60 \n\n58 \n\n56 \n\n54 \n\nTdalH3 \n\nAmbient \nTemperature \n(CXIIltrolled) \n\n' (\n\n... \n\nDl \nTime (minute) \n\n1200 \n\nFigure 2: Time-Temperature Plot of Controlled Environment at Forty-third Trial \n\nThe experiments on the environmental controller consisted of embedding a daily \nsetpoint schedule that contains six setpoints at six specific times. Trails were \nconducted to train the controller. Each trial starts at the beginning of the schedule \n(time = 0). The setpoints typically varied in the range of 55 to 75 degrees. The \ndesired tolerance window was 1 degree. The upper and lower limits of the controlled \ntemperature were set arbitrarily at 50 and 80 degrees. respectively. Control actions \nwere taken every 5 minutes. Learning was monitored by examining how much of the \nschedule was learnt correctly as the number of trials increased. \n\nTal Run \n\nt-\n\n65 \n\nt-\n\nTemp \n\n,4 \n,. \nt-r \n55 , \n:-\n\n. \n\n, \n\n, \n.. ,~I\\'~';;\u00b7\u00b7\u00b7~l' ~ \nt' \nl \n\n\\. \n\n, \n~ \n~ \n1 \n., \n., \n\\. \n, \n, \n\\ \n\n\u2022 \n\n.'.'t 1-1.'), \n'! \n\n. \\ . ( . \n. \n, \n.. \ni \n, \n, \n! \n\nSetpoint \nSd\\ecIule/ ___ \nTemperature \n\n,,-' . \n, \n.' \\\" J.\\,L \n! \n.' \n, \nI \n; \nI \nit. \" \"' .. \n\nI \n\\ \n\nAmbient \n\n(controlled) \n\n., ~emperature \n., \n;, \n. \n'. \n... \u00b7fl\u00b7,! \n\n\"' \"r\\~\u00b7 J\\\"'. \n\n1 \n\n' .. -..!'.~J I .\u2022 ' . ... .! \n\n.'. \n\nBxteriCll\" ~ \nTemperature \n\n___ \n\n-------\n\n50 r------------\n\n200 \n\n400 \n\n. \n\n600 \n\n. --.---.--~-\n\n800 \n\n1000 \n\n1200 \n\n1400 \n\nFigure 3: Time-Temperature Plot of Controlled Environment for a Test Run \n\nTIme (minutea) \n\nFigure 1 shows how the learning progresses with the number of trials. Current results \nshow that the learning of the complete schedule (of the six time-temperature pairs) \nrequiring 288 control steps. can be accomplished in only 43 trials. (Given binary \n\n\fA Reinforcement Learning Variant for Control Scheduling \n\n485 \n\noutput, the controller could have in the worst case executed 1086 (- 2288) trials to learn \nthe complete schedule.) \n\nMore details on the learning ability using the reinforcement learning strategy are available \nfrom the time-temperature plots of the trial and test runs in Figures 2 and 3. As the \nlearning progresses to the forty-third trial, the controller learns to continuously heat up or \ncool down to the desired temperature (Figure 2). To further test the learning \ngeneralizations on the schedule, the trained network was tested on a different environment \nwhere the exterior temperature profile (and the therefore the heating and cooling rates) was \ndifferent from the one used for training. Figure 3 shows the schedule that is maintained. \nBecause the controller encounters different cooling rates in the test run, some learning \nstill occurs as evident form Figure 3. However, all six setpoints were reached in the \nproper sequence. In essence, this test shows that the controller has generalized on the \nheating and cooling control law , independent of the setpoints and the heating and cooling \nrates. \n\n5 CONCLUSIONS \n\nWe have developed a new learning strategy based on reinforcement learning that can be \nused to learn setpoint schedules for continuous processes. The experimental results have \ndemonstrated good learning performance. However, a number of interesting extensions to \nthis work are possible. For instance. the handicapped mode exploration of control can be \nbetter controlled for faster learning, if more information on the desired or possible \ntrajectory is known. Another area of investigation would be the area of state encoding. \nIn our approach, the nonlinear encoding of the system state was assumed uniform at \ndifferent regions of the control space. In applications where the system with high \nnonlinearity, different nonlinear coding could be used adaptively to improve the state \nrepresentation. Finally, other formulations of reinforcement learning algorithms, besides \nASE/AHC, should also be explored. One such possibility is Watkins' Q-Iearning \n[Watkins 89]. \n\nReferences \n\n[Guha 90] A. Guha and A. Mathur, Set point Control Based on Reinforcement Learning, \nProceedings of UCNN 90, Washington D.C., January 1990. \n[Barto 83] A.G. Barto, R.S. Sutton, and C.W. Anderson, Neuronlike Adaptive Elements \nThat Can Solve Difficult Learning Control Problems, IEEE Transactions on Systems, \nMan, and Cybernetics, Vol. SMC-13. No.5. September/October 1983. \n[Michie 68] D. Michie and R. Chambers, Machine Intelligence, E. Dale and D. Michie \n(eds.), Oliver and Boyd, Edinburgh, 1968, p. 137. \n[Rosen 88] B. E. Rosen, J. M. Goodwin. and J. J. Vidal, Learning by State Recurrence \nDetection, IEEE Conference on Neural Information Processing Systems - Natural and \nSynthetic. AlP Press, 1988. \n[Watkins 89] C.J.C.H. Watkins, Learning from Delayed Rewards, Ph. D. Dissertation, \nKing's College, May 1989. \n\n\f", "award": [], "sourceid": 337, "authors": [{"given_name": "Aloke", "family_name": "Guha", "institution": null}]}