{"title": "Generalization in Reinforcement Learning: Safely Approximating the Value Function", "book": "Advances in Neural Information Processing Systems", "page_first": 369, "page_last": 376, "abstract": null, "full_text": "Generalization in Reinforcement Learning: \nSafely Approximating the Value Function \n\nJustin A. Boyan and Andrew W. Moore \n\nComputer Science Department \n\nCarnegie Mellon University \n\nPittsburgh, PA 15213 \n\njab@cs.cmu.edu, awm@cs.cmu.edu \n\nAbstract \n\nA straightforward approach to the curse of dimensionality in re(cid:173)\ninforcement learning and dynamic programming is to replace the \nlookup table with a generalizing function approximator such as a neu(cid:173)\nral net. Although this has been successful in the domain of backgam(cid:173)\nmon, there is no guarantee of convergence. In this paper, we show \nthat the combination of dynamic programming and function approx(cid:173)\nimation is not robust, and in even very benign cases, may produce \nan entirely wrong policy. We then introduce Grow-Support, a new \nalgorithm which is safe from divergence yet can still reap the benefits \nof successful generalization . \n\nINTRODUCTION \n\n1 \nReinforcement learning-the problem of getting an agent to learn to act from sparse, \ndelayed rewards-has been advanced by techniques based on dynamic programming \n(DP). These algorithms compute a value function which gives, for each state, the min(cid:173)\nimum possible long-term cost commencing in that state. For the high-dimensional \nand continuous state spaces characteristic of real-world control tasks, a discrete repre(cid:173)\nsentation of the value function is intractable; some form of generalization is required. \n\nA natural way to incorporate generalization into DP is to use a function approximator, \nrather than a lookup table, to represent the value function. This approach, which \ndates back to uses of Legendre polynomials in DP [Bellman et al., 19631, has recently \nworked well on several dynamic control problems [Mahadevan and Connell, 1990, Lin, \n1993] and succeeded spectacularly on the game of backgammon [Tesauro, 1992, Boyan, \n1992]. On the other hand, many sensible implementations have been less successful \n[Bradtke, 1993, Schraudolph et al., 1994]. Indeed, given the well-established success \n\n\f370 \n\nJustin Boyan, Andrew W. Moore \n\non backgammon, the absence of similarly impressive results appearing for other games \nis perhaps an indication that using function approximation in reinforcement learning \ndoes not always work well. \nIn this paper, we demonstrate that the straightforward substitution of function ap(cid:173)\nproximators for lookup tables in DP is not robust and, even in very benign cases, may \ndiverge, resulting in an entirely wrong control policy. We then present Grow-Support, \na new algorithm designed to converge robustly. Grow-Support grows a collection of \nstates over which function approximation is stable. One-step backups based on Bell(cid:173)\nman error are not used; instead, values are assigned by performing \"rollouts\" -explicit \nsimulations with a greedy policy. We discuss potential computational advantages of \nthis method and demonstrate its success on some example problems for which the \nconventional DP algorithm fails. \n2 DISCRETE AND SMOOTH VALUE ITERATION \nMany popular reinforcement learning algorithms, including Q-Iearning and TD(O), \nare based on the dynamic programmin~ algorithm known as value iteration [Watkins, \n1989, Sutton, 1988, Barto et al., 1989J, which for clarity we will call discrete value \niteration. Discrete value iteration takes as input a complete model of the world as a \nMarkov Decision Task, and computes the optimal value function J*: \n\nJ* (x) = the minimum possible sum of future costs starting from x \n\nTo assure that J* is well-defined, we assume here that costs are nonnegative and that \nsome absorbing goal state-with all future costs O-is reachable from every state. For \nsimplicity we also assume that state transitions are deterministic. Note that J* and \nthe world model together specify a \"greedy\" policy which is optimal for the domain: \n\noptimal action from state x = argmin(CosT(x, a) + J*(NEXT-STATE(X, a))) \n\naEA \n\nWe now consider extending discrete value iteration to the continuous case: we replace \nthe lookup table over all states with a function approximator trained over a sample of \nstates. The smooth value iteration algorithm is given in the appendix. Convergence \nis no longer guaranteed; we instead recognize four possible classes of behavior: \ngood convergence The function approximator accurately represents the interme-\ndiate value functions at each iteration (that is, after m iterations, the value \nfunction correctly represents the cost of the cheapest m-step path), and suc(cid:173)\ncessfully converges to the optimal J* value function. \n\nlucky convergence The function approximator does not accurately represent the \nintermediate value functions at each iteration; nevertheless, the algorithm \nmanages to converge to a value function whose greedy policy is optimal. \n\nbad convergence The algorithm converges, i.e. the target J-values for the N train(cid:173)\n\ning points stop changing, but the resulting value function and policy are \npoor. \n\ndivergence Worst of all: small fitter errors may become magnified from one iteration \n\nto the next, resulting in a value function which never stops changing. \n\nThe hope is that the intermediate value functions will be smooth and we will achieve \n\"good convergence.\" Unfortunately, our experiments have generated all four of these \nbehaviors-and the divergent behavior occurs frequently, even for quite simple prob(cid:173)\nlems. \n\n\fGeneralization in Reinforcement Learning: Safely Approximating the Value Function \n\n37 J \n\n2.1 DIVERGENCE IN SMOOTH VALUE ITERATION \n\nWe have run simulations in a variety of domains-including a continuous gridworld, \na car-on-the-hill problem with nonlinear dynamics, and tic-tac-toe versus a stochas(cid:173)\ntic opponent-and using a variety of function approximators, including polynomial \nregression, backpropagation, and local weighted regression. In our experiments, none \nof these function approximators was immune from divergence. \n\nlOy. \n\nlOx -\n\nThe first set ofresults is from the 2-D continuous gridworld, described in Figure 1. \nBy quantizing the state space into a 100 x 100 grid, we can compute J* with discrete \nvalue iteration, as shown in Figure 2. The optimal value function is exactly linear: \nJ*(x, y) = 20 -\nSince J* is linear, one would hope smooth value iteration could converge to it with a \nfunction approximator as simple as linear or quadratic regression. However, the in(cid:173)\ntermediate value functions of Figure 2 are not smooth and cannot be fit accurately by \na low-order polynomial. Using linear regression on a sample of 256 randomly-chosen \nstates, smooth value iteration took over 500 iterations before \"luckily\" converging to \noptimal. Quadratic regression, though it always produces a smaller fit error than lin(cid:173)\near regression, did not converge (Figure 3). The quadratic function, in trying to both \nbe flat in the middle of state space and bend down toward 0 at the goal corner, must \ncompensate by underestimating the values at the corner opposite the goal. These \nunderestimates then enlarge on each iteration, as the one-step DP lookaheads erro(cid:173)\nneously indicate that points can lower their expected cost-to-go by stepping farther \naway from the goal. The resulting policy is anti-optimal. \n\nfontinuous Gridworld \n\nJ*(x,y) \n\n>. \n\n0.8 \n\n0.6 \n\n0.4 \n\n0.2 \n\n0L-0~.~2-0~.~4-0~.~6~0~.~8~1 \n\nx \n\nFigure 1: In the continuous gridworld domain, the state is a point (x, y) E [0,1]2. There are \nfour actions corresponding to short steps (length 0.05, cost 0.5) in each compass direction, \nand the goal region is the upper right-hand corner. l*(x, y) is linear. \n\nIteration 12 \n\nIteration 25 \n\nIteration 40 \n\n.8 \n\nFigure 2: Computation of 1* by discrete value iteration \n\n1 \n\n\f372 \n\nJustin Boyan, Andrew W. Moore \n\nIteration 17 \n\nIteration 43 \n\nIteration 127 \n\n1 \n\n. 8 \n\n.8 \n\n.8 \n\nFigure 3: Divergence of smooth value iteration with quadratic regression (note z-axis). \n\n1 \n\nJ*(x , y) \n\nIteration 144 \n\n>. \n\no. \n\no. \no. \n\no. \n\n.8 \n\n0.20 . 40.60 . 8 1 \n\nx \n\n1 \n\nFigure 4: The 2-D continuous gridworld with puddles, its optimal value function, and a \ndiverging approximation of the value function by Local Weighted Regression (note z-axis). \n\ncar- on-the-Hill \n\nJ* (pa s, vel) \n\n0.5 \n\npas \n\nFigure 5: The car-on-the-hill domain. When the velocity is below a threshold, the car must \nreverse up the left hill to gain enough speed to reach the goal, so r is discontinuous. \n\nIteration 11 \n\nIterati on 101 \n\nIteration 201 \n\nfor car-on-th~~hill~ The \nFigure 6: Divergeri'ce oYsmooth value iteration wit~' \nneural net, a 2-layer MLP with 80 hidden units, was trained for 2000 epochs per iteration. \nIt may seem as though the divergence of smooth value iteration shown above can be \nattributed to the global nature of polynomial regression. In fact, when the domain \nis made slightly less trivial, the same types of instabilities appear with even a highly \n\n\fGeneralization in Reinforcement Learning: Safely Approximating the Value Function \n\n373 \n\nTable 1: Summary of convergence results: Smooth value iteration \n\nDomain \n2-D grid world \n2-D puddle world \nCar-on-the-hill \n\nLinear Quadratic \nlucky \n\ndiverge \n\n-\n-\n\n-\n-\n\nLWR \ngood \ndiverge \ngood \n\nBackprop \n\nlucky \ndiverge \ndiverge \n\nlocal memory-based function approximator such as local weighted regression (LWR) \n[Cleveland and Delvin, 1988]. Figure 4 shows the continuous gridworld augmented \nto include two oval \"puddles\" through which it is costly to step. Although LWR can \nfit the corresponding J* function nearly perfectly, smooth value iteration with LWR \nnonetheless reliably diverges. On another two-dimensional domain, the car-on-the-hill \n(Figure 5), smooth value iteration with LWR did converge, but a neural net trained \nby backpropagation did not (see Figure 6) . Table 1 summarizes our results. \n\nIn light of such experiments, we conclude that the straightforward combination of \nDP and function approximation is not robust. A general-purpose learning method \nwill require either using a function approximator constrained to be robust during DP \n[Yee, 1992], or an algorithm which explicitly prevents divergence even in the face of \nimperfect function approximation, such as the Grow-Support algorithm we present \nin Section 3. \n\n2.2 RELATED WORK \nTheoretically, it is not surprising that inserting a smoothing process into a recursive \nDP procedure can lead to trouble. In [Thrun and Schwartz, 1993] one case is analyzed \nwith the assumption that errors due to function approximation bias are independently \ndistributed. Another area of theoretical analysis concerns inadequately approximated \nJ* functions. In [Singh and Yee, 1994] and [Williams, 1993] bounds are derived for the \nmaximum reduction in optimality that can be produced by a given error in function \napproximation. If a basis function approximator is used, then the reduction can be \nlarge [Sabes, 1993]. These results assume generalization from a dataset containing \ntrue optimal values; the true reinforcement learning scenario is even harder because \neach iteration of DP requires its own function approximation. \n\n3 THE GROW-SUPPORT ALGORITHM \nThe Grow-Support algorithm is designed to construct the optimal value function with \na generalizing function approximator while being robust and stable. It recognizes that \nfunction approximators cannot always be relied upon to fit the intermediate value \nfunctions produced by DP. Instead, it assumes only that the function approximator \ncan represent the final J* function accurately. The specific principles of Grow-Support \nare these: \n\n1. We maintain a \"support\" set of states whose final J* values have been com(cid:173)\n\nputed, starting with goal states, and growing this set out from the goal. The \nfitter is trained only on these values, which we assume it is capable of fitting. \n2. Instead of propagating values by one-step DP backups, we use simulations \nwith the current greedy policy, called \"rollouts\". They explicitly verify the \nachievability of a state's cost-to-go estimate before adding that state to the \n\n\f374 \n\nJustin Boyan, Andrew W. Moore \n\nsupport. In a rollout, the J values are derived from costs of actual paths to the \ngoal, not from the values of the previous iteration's function approximation. \nThis prevents divergence. \n\n3. We take maximum advantage of generalization. Each iteration, we add to \nthe support set any sample state which can, by executing a single action, \nreach a state that passes the rollout test. In a discrete environment, this \nwould cause the support set to expand in one-step concentric \"shells\" back \nfrom the goal. But in our continuous case, the function approximator may \nbe able to extrapolate correctly well beyond the support region-and when \nthis happens, we can add many points to the support set at once. This leads \nto the very desirable behavior that the support set grows in big jumps in \nregions where the value function is smooth. \n\nIteration 1, \n\nI Support I =4 \n\nIteration 2, 1 Support 1=12 \n\nIteration 3, ISupportl=256 \n\nFigure 7: Grow-Support with quadratic regression on the gridworld. (Compare Figure 3.) \n\nIteration 1, \n\nI Support I =3 \n\nIteration 2, ISupportl=213 \n\nIteration 5, ISupportl=253 \n\nFigure 8: Grow-Support with LWR on the two-puddle gridworld. (Compare Figure 4.) \n\nIteration 3, \n\nI Support I =79 \n\nIteration 8, ISupportl=134 \n\nIteration 14, ISupportl=206 \n\n3 \n\no. \nFigure 9: Grow-Support with backprop on car-on-the-hill. (Compare Figure 6.) \n\n-2 \n\nO. \n\nO. \n\n2 \n\nThe algorithm, again restricted to the deterministic case for simplicity, is outlined in \nthe appendix. In Figures 7-9, we illustrate its convergence on the same combinations \nof domain and function approximator which caused smooth value iteration to diverge. \nIn Figure 8, all but three points are added to the support within only five iterations, \n\n\fGeneralization in Reinforcement Learning: Safely Approximating the Value Function \n\n375 \n\nand the resulting greedy policy is optimal. In Figure 9, after 14 iterations, the algo(cid:173)\nrithm terminates. Although 50 states near the discontinuity were not added to the \nsupport set, the resulting policy is optimal within the support set. Grow-support \nconverged to a near-optimal policy for all the problems and fitters in Table 1. \n\nThe Grow-Support algorithm is more robust than value iteration. Empirically, it was \nalso seen to be no more computationally expensive (and often much cheaper) despite \nthe overhead of performing rollouts. Reasons for this are (1) the rollout test is not \nexpensive; (2) once a state has been added to the support, its value is fixed and it \nneeds no more computation; and most importantly, (3) the aggressive exploitation \nof generalization enables the algorithm to converge in very few iterations. However, \nwith a nondeterministic problem, where multiple rollouts are required to assess the \naccuracy of a prediction, Grow-Support would become more expensive. \nIt is easy to prove that Grow-Support will always terminate after a finite number \nof iterations. If the function approximator is inadequate for representing the J* \nfunction, Grow-Support may terminate before adding all sample states to the support \nset. When this happens, we then know exactly which of the sample states are having \ntrouble and which have been learned. This suggests potential schemes for adaptively \nadding sample states to the support in problematic regions. Investigation of these \nideas is in progress. \n\nIn conclusion, we have demonstrated that dynamic programming methods may di(cid:173)\nverge when their tables are replaced by generalizing function approximators. Our \nGrow-Support algorithm uses rollouts, rather than one-step backups, to assign train(cid:173)\ning values and to keep inaccurate states out of the training set. We believe these \nprinciples will contribute substantially to producing practical, robust, reinforcement \nlearning. \n\nAcknowledgements \nWe thank Scott Fahlman, Geoff Gordon, Mary Lee, Michael Littman and Marc Ringuette for \ntheir suggestions, and the NDSEG fellowship and NSF Grant IRI-9214873 for their support. \n\nAPPENDIX: ALGORITHMS \n\nSmooth Value Iteration(X, G, A, NEXT-STATE, COST, FITJ): \n\nGiven: _ a finite collection of states X = {Xl, X2, .. . XN} sampled from the \n\ncontinuous state space X C fR n , and goal region G C X \n\n_ a finite set of allowable actions A \n_ a deterministic transition function NEXT-STATE: X x A -+ X \n_ the I-step cost function COST: X x A -+ fR \n_ a smoothing function approximator FIT J \n\n.(iter) [.] ._ { 0 \nJ \n\n1 . -\n\nminaEA (COST(Xi,a) + FITJ(lter-I)(NEXT-STATE(xi,a))) \n\n. \n\nuntil j array stops changing \n\nif Xi E G \notherwise \n\niter := 0 \n]<0) [i] := 0 \nrepeat \n\nVi = 1 ... N \n\n!rain ~ITJ(iter) to approximate the training set: \nIter .:= Iter + 1; \nfor ~ := 1 ... N do \n\n{X \n\n\u00b7(iter) [1] } \n\nI t-+ J \n: \n\nXN t-+ /iter)[N] \n\n\f376 \n\nJustin Boyan, Andrew W. Moore \n\nsubroutine RoIloutCost(x, J): \n\nStarting from state x , follow the greedy policy defined by value function J until \n\neither reaching the goal, or exceeding a total path cost of J(x) + \u00a3. Then return: \n--t the actual total cost of the path, if goal is reached from x with cost ~ J(x) + e \n--t 00, if goal is not reached in cost J(x) + \u00a3. \n\nGiven: \u2022 exactly the same inputs as Smooth Value Iteration. \n\nGrow-Support(X,G,A, NEXT-STATE, COST, FITJ): \nSUPPORT := {(Xi t-+ 0) I Xi E G} \nrepeat \n\nTrain FIT J to approximate the training set SUPPORT \nfor each Xi ~ SUPPORT do \n\nc := minaEA [COsT(xi,a) + RolloutCost(NEXT-STATE(Xi, a), FITJ)] \nif c < 00 then \n\nadd (Xi t-+ c) to the training set SUPPORT \n\nuntil SUPPORT stops growing or includes all sample points. \n\nReferences \n[Barto et al., 1989] A . Barto, R. Sutton , and C . Watkins . Learning and sequential decision making. Tech(cid:173)\n\nnical Report COINS 89-95, Univ. of Massachusetts, 1989 . \n\n[Bellman et al., 1963] R . Bellman, R . Kalaba, and B . Kotkin. Polynomial approximation-a new compu(cid:173)\n\ntational technique in dynamic programming: Allocation processes . Mathematics of Computation, 17, \n1963. \n\n[Boyan , 1992] J. A . Boyan. Modular neural networks for learning context-dependent game strategies. \n\nMaster's thesis, Cambridge University, 1992. \n\n[Bradtke, 1993] S. J. Bradtke. Reinforcement learning applied to linear quadratic regulation. In S. J . \n\nHanson, J . Cowan, and C . L . Giles, editors, NIPS-5. Morgan Kaufmann, 1993. \n\n[Cleveland and Delvin, 1988] W . S. Cleveland and S. J. Delvin. Locally weighted regression : An approach \n\nto regression analysis by local fitting. JASA , 83(403):596-610, September 1988. \n\n[Lin, 1993] L.-J . Lin . Reinforcement Learning for Robots Using Neural Networks. PhD thesis, Carnegie \n\nMellon University, 1993. \n\n[Mahadevan and Connell , 1990] S. Mahadevan and J. Connell . Automatic programming of behavior-based \nrobots using reinforcement learning. Technical report, IBM T. J . Watson Research Center, NY 10598, \n1990 . \n\n[Sabes, 1993] P. Sabes. Approximating Q-values with basis function represent ations. In Proceedings of \n\nthe Fourth Connectionist Models Summer School, 1993. \n\n[Schraudolph et al., 1994] N . Schraudolph, P. Dayan, and T. Sejnowski. Using TD(>.) to learn an eval(cid:173)\n\nuation function for the game of Go . In J. D. Cowan, G . Tesauro , and J . Alspector, editors, NIPS-6. \nMorgan Kaufmann, 1994. \n\n[Singh and Yee, 1994] S. P. Singh and R. Yee. An upper bound on the loss from approximate optimal-value \n\nfunctions . Machine Learning, 1994. Technical Note (to appear) . \n\n[Sutton, 1988] R . Sutton . Learning to predict by the methods of temporal differences. Machine Learning, \n\n3,1988. \n\n[Tesauro, 1992] G. Tesauro. Practical issues in temporal difference learning. Machine Learning, 8(3/4), \n\nMay 1992. \n\n[Thrun and Schwartz, 1993] S. Thrun and A . Schwartz. Issues in using function approximation for rein(cid:173)\n\nforcement learning. In Proceedings of the Fourth Connectionist Models Summer School, 1993. \n\n[Watkins, 1989] C . Watkins. Learning from Delayed Rewards. PhD thesis, Cambridge University, 1989 . \n[Williams, 1993] R. Williams . Tight performance bounds on greedy policies based on imperfect value \n\nfunctions . Technical Report NU-CCS-93-13, Northeastern University, 1993. \n\n[Yee, 1992] R . Yee. Abstraction in control learning. Technical Report COINS 92-16 , Univ. of Mas(cid:173)\n\nsachusetts, 1992. \n\n\f", "award": [], "sourceid": 1018, "authors": [{"given_name": "Justin", "family_name": "Boyan", "institution": null}, {"given_name": "Andrew", "family_name": "Moore", "institution": null}]}