{"title": "Gradient Descent for General Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 968, "page_last": 974, "abstract": null, "full_text": "Gradient Descent for General \n\nReinforcement Learning \n\nLeemon Baird \n\nleemon@cs.cmu.edu \n\nwww.cs.cmu.edu/- Ieemon \n\nComputer Science Department \n\n5000 Forbes Avenue \n\nCarnegie Mellon University \nPittsburgh, PA 15213-3891 \n\nAndrew Moore \nawm@cs.cmu.edu \n\nwww.cs.cmu.edu/-awm \n\nComputer Science Department \n\n5000 Forbes Avenue \n\nCarnegie Mellon University \nPittsburgh, PA 15213-3891 \n\nAbstract \n\nA simple learning rule is derived, the VAPS algorithm, which can \nbe instantiated to generate a wide range of new reinforcement(cid:173)\nlearning algorithms. These algorithms solve a number of open \nproblems, define several new approaches to reinforcement learning, \nand unify different approaches to reinforcement learning under a \nsingle theory. These algorithms all have guaranteed convergence, \nand include modifications of several existing algorithms that were \nknown to fail to converge on simple MOPs. These include Q(cid:173)\nIn addition to these \nlearning, SARSA, and advantage learning. \nit also generates pure policy-search \nvalue-based algorithms \nreinforcement-learning algorithms, which learn optimal policies \nwithout learning a value function. \nsearch and value-based algorithms to be combined, thus unifying \ntwo very different approaches to reinforcement learning into a \nsingle Value and Policy Search (V APS) algorithm. And these \nalgorithms converge for POMDPs without requiring a proper belief \nstate . Simulations results are given, and several areas for future \nresearch are discussed. \n\nIn addition, it allows policy(cid:173)\n\n1 CONVERGENCE OF GREEDY EXPLORATION \n\nMany reinforcement-learning algorithms are known that use a parameterized \nfunction approximator to represent a value function, and adjust the weights \ninclude Q-learning, SARSA, and \nincrementally during \nadvantage learning. There are simple MOPs where the original form of these \nalgorithms fails to converge, as summarized in Table 1. For the cases with..J, the \nalgorithms are guaranteed to converge under reasonable assumptions such as \n\nExamples \n\nlearning. \n\n\fGradient Descent for General Reinforcement Learning \n\n969 \n\nTable 1. Current convergence results for incremental, value-based RL algorithms. \n\nResidual algorithms changed every X in the first two columns to ..J. \nX to a ..J. \nThe new al \n\nin this \n\ndistribution \n\ndistribution \n\nUsually-\ngreedy \n\ndistribution \n\nMarkov \nchain r-----~----~----_+------------.--------\n\nMDP \n\nPOMDP \n\nr--------:---'''---_+.\" \n\n=convergence guaranteed \n\nbest and worst \n\nX=counterexample is known that either diverges or oscillates between the \n\nible \n\nicies. \n\ndecaying learning rates. For the cases with X, there are known counterexamples \nwhere it will either diverge or osciIlate between the best and worst possible policies, \nwhich have very-different values. This can happen even with infinite training time \nand slowly-decreasing learning rates (Baird, 95, Gordon, 96). Each X in the first \ntwo columns can be changed to a ..J and made to converge by using a modified form \nof the algorithm, the residual form (Baird 95). But this is only possible when \nlearning with a fixed training distribution, and that is rarely practical. For most \nlarge problems, it is useful to explore with a policy that is usualIy-greedy with \nrespect to the current value function, and that changes as the value function changes. \nIn that case (the rightmost column of the chart), the current convergence guarantees \nare not very good. One way to guarantee convergence in alI three columns is to \nmodify the algorithm so that it is performing stochastic gradient descent on some \naverage error function, where the average is weighted by state-visitation frequencies \nfor the current usually-greedy policy. Then the weighting changes as the policy \nIt might appear that this gradient is difficult to compute. Consider Q(cid:173)\nchanges. \nlearning exploring with a Boltzman distribution that is usually greedy with respect \nto the learned Q function. It seems difficult to calculate gradients, since changing a \nsingle weight will change many Q values, changing a single Q value will change \nmany action-choice probabilities in that state, and changing a single action-choice \nprobability may affect the frequency with which every state in the MDP is visited. \nAlthough this might seem difficult, it is not. Surprisingly, unbiased estimates of the \ngradients of visitation distributions with respect to the weights can be calculated \nquickly, and the resulting algorithms can put a ..J in every case in Table 1. \n\n2 DERIVATION OF THE V APS EQUATION \n\nConsider a sequence of transitions observed while following a particular stochastic \npolicy on an MDP. Let Sl = {xo,uo,Ro, xt.ut.Rt. ... xl.t.ul_t.RI_t. xtout.RI} be the \nsequence of states, actions, and reinforcements up to time t, where performing \naction UI in state XI yields reinforcement RI and a transition to state XI+I. The \n\n\f970 \n\nL. Baird and A. W. Moore \n\nstochastic policy may be a function of a vector of weights w. Assume the MOP has \na single start state named Xo. \nIf the MOP has terminal states, and x, is a terminal \nstate, then X'+I=XO. Let S, be the set of all possible sequences from time 0 to t. Let \ne(s,) be a given error function that calculates an error on each time step, such as the \nsquared Bellman residual at time t, or some other error occurring at time t. If e is a \nfunction of the weights, then it must be a smooth function of the weights. Consider \na period of time starting at time 0 and ending with probability P(endls,) after the \nsequence s, occurs. The probabilities must be such that the expected squared period \nlength is finite. Let B be the expected total error during that period, where the \nexpectation is weighted according to the state-visitation frequencies generated by \nthe given policy: \n\nT \n\nB = I I P(period ends at time T after trajectory Sr) I e(s,) \n\nr \n\n,=0 \n\nxc \n\n= I I e(s,)P(sJ \n\n1= 0 s, e St \n\nwhere: \n\npes,) = P(u, I sJP(R, I s,)O P(u, I s,)P(R, I s,)P(S'+1 I s,)fi - P(end Is,)] \n\n, - I \n\n,=0 \n\n(I) \n\n(2) \n\n(3) \n\nNote that on the first line, for a particular s\" the error e(s,) will be added in to B \nonce for every sequence that starts with s,. Each of these terms will be weighted by \nthe probability of a complete trajectory that starts with s,. The sum of the \nprobabilities of all trajectories that start with s, is simply the probability of s, being \nobserved, since the period is assumed to end eventually with probability one. So the \nsecond line equals the first. The third line is the probability of the sequence, of \nwhich only the P(u,lx,) factor might be a function of w. If so, this probability must \nbe a smooth function of the weights and nonzero everywhere. The partial derivative \nof B with respect to w, a particular element of the weight vector w, is: \n\n(4) \n\n(5) \n\nSpace here is limited, and it may not be clear from the short sketch of this \nderivation, but summing (5) over an entire period does give an unbiased estimate of \nB, the expected total error during a period. An incremental algorithm to perform \nstochastic gradient descent on B is the weight update given on the left side of Table \n2, where the summation over previous time steps is replaced with a trace T, for each \nweight. This algorithm is more general than previously-published algorithms of this \nform, in that e can be a function of all previous states, actions, and reinforcements, \nrather than just the current reinforcement. This is what allows V APS to do both \nvalue and policy search. \n\nEvery algorithm proposed in this paper is a special case of the V APS equation on \nthe left side of Table 2. Note that no model is needed for this algorithm. The only \nprobability needed in the algorithm is the policy, not the transition probability from \nthe MOP. This is stochastic gradient descent on B, and the update rule is only \ncorrect if the observed transitions are sampled from trajectories found by following \n\n\fGradient Descent for General Reinforcement Learning \n\n971 \n\nTable 2. The general YAPS algorithm (left), and several instantiations of it (right). \nThis single algorithm includes both value-based and policy-search approaches and \nh . \nt elr com matlOn, an gives guarantee convergence m every case. \n\nd' \n\nb\" \n\nd \n\n. \n\n~w, = -aL~ e(s,) + e(s,)T,] \n\n~T, = ~I In(P(u'_1 I S,_I)) \n\ne adva\"lag, (S,)=fE2 \n\n[RH + r m\", A(x\" u) -1' A(x,_,. U H \n\neSARSA (St) = t \u00a32 (R,_1 + }Q(xt , ut ) - Q(xt_1 , u,-ll \neQ-learm\"g(s,) = 1- E2lRI _1 + y m~ Q(x\" u) - Q(x, _1' u,-;l \n) 1 \neva/lte - 'leraIlO\" (S/) = + [ max E[ R' _I + yV (xJ] - V (x/-I) J \neSARI'A-poh,y (SJ = (t - P)eSARI'A(SJ + pT.b - y' R/J \n\n\"(~-I) A( \n\n) \nX, _I' U \n\nm,:u' \n\n+ A \n\nIt, 1 \n\nthe current, stochastic policy. Both e and P should be smooth functions of w, and \nfor any given w vector, e should be bounded. The algorithm is simple, but actuaIly \ngenerates a large class of different algorithms depending on the choice of e and \nwhen the trace is reset to zero. For a single sequence, sampled by following the \ncurrent policy, the sum of ~w along the sequence will give an unbiased estimate of \nthe true gradient, with finite variance. Therefore, during learning, if weight updates \nare made at the end of each trial, and if the weights stay within a bounded region, \nand the learning rate approaches zero, then B wiIl converge with probability one. \nAdding a weight-decay term (a constant times the 2-norm of the weight vector) onto \nB will prevent weight divergence for small initial learning rates. There is no \nguarantee that a global minimum will be found when using general function \napproximators, but at least it will converge. This is true for backprop as well. \n\n3 \n\nINSTANTIATING THE V APS ALGORITHM \n\nMany reinforcement-learning algorithms are value-based; they try to learn a value \nfunction that satisfies the BeUman equation . Examples are Q-learning, which learns \na value function, actor-critic algorithms, which learn a value function and the policy \nwhich is greedy with respect to it, and TO( 1), which learns a value function based \non future rewards. Other algorithms are pure policy-search algorithms; they \ndirectly learn a policy that returns high rewards. These include REINFORCE \n(Williams, 1988), backprop \nlearning automata, and genetic \nalgorithms. The algorithms proposed here combine the two approaches: they \nperform Value And Policy Search (YAPS). The ,general VAPS equation is \ninstantiated by choosing an expression for e. This can be a Bellman residual \n(yielding value-based), the reinforcement (yielding policy-search), or a linear \ncombination of the two (yielding Value And Policy Search). The single VAPS \nupdate rule on the left side of Table 2 generates a variety of different types of \nalgorithms, some of which are described in the foIlowing sections. \n\nthrough \n\ntime, \n\n3.1 REDUCING MEAN SQUARED RESIDUAL PER TRIAL \n\nIf the MOP has terminal states, and a trial is the time from the start until a terminal \nstate is reached, then it is possible to minimize the expected total error per trial by \nresetting the trace to zero at the start of each trial. Then, a convergent form of \nSARSA, Q-Iearning, incremental value iteration, or advantage learning can be \ngenerated by choosing e to be the squared Bellman residual, as shown on the right \nside of Table 2. In each case, the expected value is taken over all possible (x/>u\"R,) \n\n\f972 \n\nL. Baird and A. W Moore \n\ntriplets, given St-I' The policy must be a smooth, nonzero function of the weights. \nSo it could not be an c-greedy policy that chooses the greedy action with probability \n(I-c) and chooses uniformly otherwise. That would cause a discontinuity in the \ngradient when two Q values in a state were equal. But the policy could be \nsomething that approaches c-greedy as a positive temperature c approaches zero: \n\n& \n\nP(u I x) = -;; + (I - &) I (I + eQ(x,u') lc ) \n\n1 + eQ(x.II) l c \n\n(6) \n\nII' \n\nwhere n is the number of possible actions in each state. For each instance in Table 2 \nother than value iteration, the gradient of e can be estimated using two, independent, \nunbiased estimates of the expected value. For example: \n\n!, eSARSA (Sf) == e SAR.S:4 (Sf {r\u00a2 !, Q(X'f , U'f ) - !, Q(Xf _l , U f _I )) \n\n(7) \n\nWhen $=1, this is an estimate of the true gradient. When $<1, this is a residual \nalgorithm, as described in (Baird, 96), and it retains guaranteed convergence, but \nmay learn more quickly than pure gradient descent for some values of $. Note that \nthe gradient of Q(x,u) at time I uses primed variables. That means a new state and \naction at time I were generated independently from the state and action at time 1-1. \nOf course, if the MOP is deterministic, then the primed variables are the same as the \nunprimed. If the MOP is nondeterministic but the model is known, then the model \nmust be evaluated one additional time to get the other state. \nIf the model is not \nknown, then there are three choices. First, a model could be learned from past data, \nand then evaluated to give this independent sample. Second, the issue could be \nignored, simply reusing the unprimed variables in place of the primed variables. \nThis may affect the quality of the learned function (depending on how random the \nMOP is), but doesn't stop convergence, and be an acceptable approximation in \npractice. Third, all past transitions could be recorded, and the primed variables \ncould be found by searching for all the times (Xt-hUt-') has been seen before, and \nrandomly choosing one of those transitions and using its successor state and action \nas the primed variables. This is equivalent to learning the certainty equivalence \nmodel, and sampling from it, and so is a special case of the first choice. For \nextremely large state-action spaces with many starting states, this is likely to give \nthe same result in practice as simply reusing the unprimed variables as the primed \nvariables. Note, that when weights do not effect the policy at all, these algorithms \nreduce to standard residual algorithms (Baird, 95). \n\nIt is also possible to reduce the mean squared residual per step, rather than per trial. \nThis is done by making period lengths independent of the policy, so minimizing \nerror per period will also minimize the error per step. For example, a period might \nbe defined to be the first 100 steps, after which the traces are reset, and the state is \nreturned to the start state. Note that if every state-action pair has a positive chance \nof being seen in the first 100 steps, then this will nol just be solving a finite-horizon \nproblem. It will be actually be solving the discounted, infinite-horizon problem, by \nreducing the Bellman residual in every state. But the weighting of the residuals wilI \nbe determined only by what happens during the first 100 steps. Many different \nproblems can be solved by the V APS algorithm by instantiating the definition of \n\"period\" in different ways. \n\n3.2 POLICY-SEARCH AND VALUE-BASED LEARNING \n\nIt is also possible to add a term that tries to maximize reinforcement directly. For \nexample, e could be defined to be e.\\\u00b7ARSA-I'0!Jcy rather than eSARSA. from Table 2, and \n\n\fGradient Descent for General Reinforcement Learning \n\n973 \n\n10000 , -- - -- - - - - - - - - - , \n\n{Jl ca \n. _ 1000 \nI-< \nE-\n\n100 -t----r---,...---,...-----l \n0.8 \n\n0.6 \n\n0.4 \nBeta \n\no \n\n0.2 \n\nFigure 1. A POMDP and the number of trials needed to learn it vs. p . \n\nA combination of policy-search and value-based RL outperforms either alone. \n\nthe various algorithms from \n\nthe trace reset to zero after each terminal state is reached. The constant b does not \naffect the expected gradient, but does affect the noise distribution, as discussed in \n(Williams, 88). When P=O, the algorithm will try to learn a Q function that satisfies \nthe Bellman equation, just as before. When P=I, it directly learns a policy that will \nminimize the expected total discounted reinforcement. The resulting \"Q function\" \nmay not even be close to containing true Q values or to satisfying the Bellman \nequation, it will just give a good policy . When P is in between, this algorithm tries \nto both satisfy the Bellman equation and give good greedy policies. A similar \nmodification can be made to any of the algorithms in Table 2. \nIn the special case \nwhere P=I, this algorithm reduces to the REINFORCE algorithm (Williams, 1988). \nREINFORCE has been rederived for the special case of gaussian action distributions \n(Tresp & Hofman, 1995), and extensions of it appear in (Marbach, 1998). This case \nof pure policy search is particularly interesting, because for P=I , there is no need \nfor any kind of model or of generating two independent successors. Other \nalgorithms have been proposed for finding policies directly, such as those given in \n(Gullapalli, 92) and \nlearning automata theory \nsummarized in (Narendra & Thathachar, 89). The V APS algorithms proposed here \nappears to be the first one unifying these two approaches to reinforcement learning, \nfinding a value function that both approximates a Bellman-equation solution and \ndirectly optimizes the greedy policy. \nFigure 1 shows simulation results for the combined algorithm. A run is said to have \nlearned when the greedy policy is optimal for 1000 consecutive trials. The graph \nshows the average plot of 100 runs, with different initial random weights between \n\u00b1 10.6 . The learning rate was optimized separately for each p value. R= 1 when \nleaving state A, R=2 when leaving state B or entering end, and R=O otherwise. y=0.9. \nThe algorithm used was the modified Q-Iearning from Table 2, with exploration as \nin equation 13, and q>=c= l, b=O, c=O.1. States A and B share the same parameters, \nso ordinary SARSA or greedy Q-Iearning could never converge, as shown in \n(Gordon, 96). When p=O (pure value-based), the new algorithm converges, but of \ncourse it cannot learn the optimal policy in the start state, since those two Q values \nlearn to be equal. When P=1 (pure policy-search), learning converges to optimality, \nbut slowly, since there is no value function caching the results in the long sequence \nof states near the end. By combining the two approaches, the new algorithm learns \nmuch more quickly than either alone. \n\nIt is interesting that the V APS algorithms described in the last three sections can be \napplied directly to a Partially Observable Markov Decision Process (POMDP), \nwhere the true state is hidden, and all that is available on each time step is an \n\n\f974 \n\nL. Baird and A. W Moore \n\nambiguous \"observation\", which is a function of the true state. Normally, an \nalgorithm such as SARSA only has guaranteed convergance when applied to an \nMOP. The V APS algorithms will converge in such cases. \n\n4 CONCLUSION \n\nA new algorithm has been presented. Special cases of it give new algorithms \nsimilar to Q-Iearning, SARSA, and advantage learning, but with guaranteed \nconvergence for a wider range of problems than was previously possible, including \nPOMOPs. For the first time, these can be guaranteed to converge, even when the \nexploration policy changes during learning. Other special cases allow new \napproaches to reinforcement learning, where there is a tradeoff between satisfying \nthe Bellman equation and improving the greedy policy. For one MOP, simulation \nshowed that this combined algorithm learned more quickly than either approach \nalone. This unified theory, unifying for the first time both value-based and policy(cid:173)\nsearch reinforcement learning, is of theoretical interest, and also was of practical \nvalue for the simulations performed. Future research with this unified framework \nmay be able to empirically or analytically address the old question of when it is \nbetter to learn value functions and when it is better to learn the policy directly. \nIt \nmay also shed light on the new question, of when it is best to do both at once. \n\nAcknowledgments \n\nThis research was sponsored in part by the U.S. Air Force. \n\nReferences \n\nBaird, L. C. \n(1995) . Residual Algorithms: Reinforcement Learning with Function \nApproximation. In Armand Prieditis & Stuart Russell, eds. Machine Learning: Proceedings \nof the Twelfth International Conference, 9- 1 2 July, Morgan Kaufman Publishers, San \nFrancisco, CA. \n\nGordon, G. (1996). \" Stable fitted reinforcement learning\". In G. Tesauro, M. Mozer, and M. \nHasselmo \n(eds.), Advances in Neural Information Processing Systems 8, pp. 1052-1058. \nMIT Press, Cambridge, MA . \n\nGullapalli, V. (1992). Reinforcement Learning and Its Application to Control. Dissertation \nand COINS Technical Report 92-10, University of Massachusetts, Amherst, MA. \n\nKaelbling, L. P., Littman, M. L. & Cassandra, A., \" Planning and Acting in Partially \nObservable Stochastic Domains\". Artificial Intelligence, to appear. Available now at \nhttp://www.cs.brown.edu/people/lpk. \n\nMarbach, P. (1998). Simulation-Based Optimization of Markov Decision Processes. Thesis \nLIDS-TH 2429, Massachusetts Institute of Technology. \n\nMcCallum (1995), A. Reinforcement learning with selective perception and hidden state. \nDissertation, Department of Computer Science, UniverSity of Rochester, Rochester, NY. \n\nNarendra, K .. & Thathachar, M.A.L. (1989). Learning automata: An introduction . Prentice \nHall, Englewood Cliffs, NJ. \n\nTresp, V., & R. Hofman (1995). \"Missing and noisy data in nonlinear time-series \nIn Proceedings of Neural Networks for Signal Processing 5, F. Girosi , J. \nprediction\". \nMakhoul, E. Manolakos and E. Wilson, eds., IEEE Signal Processing Society, New York, \nNew York, 1995. pp. 1-10. \n\nWilliams, R. J. (1988). Toward a theory of reinforcement-learning connectionist systems. \nTechnical report NU-CCS-88-3, Northeastern University, Boston, MA. \n\n\f", "award": [], "sourceid": 1576, "authors": [{"given_name": "Leemon", "family_name": "Baird", "institution": null}, {"given_name": "Andrew", "family_name": "Moore", "institution": null}]}