{"title": "Barycentric Interpolators for Continuous Space and Time Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 1024, "page_last": 1030, "abstract": null, "full_text": "Barycentric Interpolators for Continuous \nSpace & Time Reinforcement Learning \n\nRemi Munos & Andrew Moore \n\nRobotics Institute, Carnegie Mellon University \n\nPittsburgh, PA 15213, USA. \n\nE-mail: {munos, awm }@cs.cmu.edu \n\nAbstract \n\nIn order to find the optimal control of continuous state-space and \ntime reinforcement learning (RL) problems, we approximate the \nvalue function (VF) with a particular class of functions called the \nbarycentric interpolators. We establish sufficient conditions under \nwhich a RL algorithm converges to the optimal VF, even when we \nuse approximate models of the state dynamics and the reinforce(cid:173)\nment functions . \n\n1 \n\nINTRODUCTION \n\nIn order to approximate the value function (VF) of a continuous state-space and \ntime reinforcement learning (RL) problem, we define a particular class of functions \ncalled the barycentric interpolator, that use some interpolation process based on \nfinite sets of points. This class of functions, including continuous or discontinuous \npiecewise linear and multi-linear functions, provides us with a general method for \ndesigning RL algorithms that converge to the optimal value function. Indeed these \nfunctions permit us to discretize the HJB equation of the continuous control problem \nby a consistent (and thus convergent) approximation scheme, which is solved by \nusing some model of the state dynamics and the reinforcement functions. \nSection 2 defines the barycentric interpolators. Section 3 describes the optimal con(cid:173)\ntrol problem in the deterministic continuous case. Section 4 states the convergence \nresult for RL algorithms by giving sufficient conditions on the applied model. Sec(cid:173)\ntion 5 gives some computational issues for this method, and Section 6 describes the \napproximation scheme used here and proves the convergence result. \n\n\fBarycentric Interpolators for Continuous Reinforcement Learning \n\n1025 \n\n2 DEFINITION OF BARYCENTRIC INTERPOLATORS \n\nLet I:0 = {~di be a set of points distributed at some resolution <5 (see (4) below) \non the state space of dimension d. \nFor any state x inside some simplex (6, ... , ~n), we say that x is the barycenter of \nthe {~di=Ln inside this simplex with positive coefficients P(XI~i) of sum 1, called \nthe barycentric coordinates, if x = Li=1..np(xl~i)'~i' \n\nLet VO (~i) be the value of the function at the points ~i. Va is a barycentric \ninterpolator if for any state x which is the barycenter of the points {~di=1.n for \nsome simplex (6, ... ,~n), with the barycentric coordinates p(xl~d, we have: \n\n(1) \n\nMoreover we assume that the simplex (~1' ... , ~n) is of diameter 0(<5). Let us describe \nsome simple barycentric interpolators: \n\n\u2022 Piecewise linear functions defined by some triangulation on the state \nspace (thus defining continuous functions), see figure La, or defined at any \nx by a linear combination of (d + 1) values at any points (6, ... , ~d+ d 3 x \n(such functions may be discontinuous at some boundaries), see figure Lb . \n\n\u2022 Piecewise multi-linear functions defined by a multi-linear combination \nof the 2d values at the vertices of d-dimensional rectangles, see figure 1.c. \nIn this case as well, we can build continuous interpolations or allow discon(cid:173)\ntinuities at the boundaries of the rectangles. \n\nAn important point is that the convergence result stated in Section 4 does not \nrequire the continuity of the function. This permits us to build variable resolution \ntriangulations (see figure 1.b) or grid (figure 1.c) easily. \n\n~.I , \n\nx+ + ~ \n\n(a) \n\n(b) \n\n(c) \n\nFigure 1: Some examples of barycentric approximators. These are piecewise con(cid:173)\ntinuous (a) or discontinuous (b) linear or multi-linear (c) interpolators. \n\nRemark 1 In the general case, for a given x, the choice of a simplex (6, ... , ~n) 3 x \nis not unique (see the two sets of grey and black points in figure l.b and l.c), and \nonce the simplex (~1' ... , ~n) 3 x is defined, if n > d + 1 (for example in figure l.c), \nthen the choice of the barycentric coordinates P(XI~i) is also not unique. \n\nRemark 2 Depending on the interpolation method we use, the time needed for com(cid:173)\nputing the values will vary. Following {Dav96}, the continuous multi-linear interpo(cid:173)\nlation must process 2d values, whereas the linear continuous interpolation inside a \nsimplex processes (d + 1) values in 0 (d log d) time. \n\n\f1026 \n\nR. Munos and A. W. Moore \n\nIn comparison to [Gor95], the functions used here are averagers that satisfy the \nbarycentric interpolation property (1). This additional geometric constraint permits \nus to prove the consistency (see (15) below) ofthe approximation scheme and thus \nthe convergence to the optimal value in the continuous time case. \n\n3 THE OPTIMAL CONTROL PROBLEM \n\nLet us describe the optimal control problem in the deterministic and discounted case \nfor continuous state-space and time variables and define the value function that we \nintend to approximate. We consider a dynamical system whose state dynamics \ndepends on the current state x(t) E () (the state-space, with 0 an open subset of \nJRd) and control u(t) E U (compact subset) by a differential equation : \n\ndx \ndt = f(x(t), u(t)) \n\n(2) \n\nFrom equation (2), the choice of an initial state x and a control function u(t) leads \nto a unique trajectories x (t) (see figure 2). Let r be the exit time from 0 (with \nthe convention that if x(t) always stays in 0, then r = (0) . Then, we define the \nfunctional J as the discounted cumulative reinforcement : \n\nJ(x; u(.)) = loT -/r(x(t), u(t))dt + -{ R(x(r)) \n\nwhere r(x, u) is the running reinforcement and R(x) the boundary reinforcement. \n'Y is the discount factor (0 ~ 'Y < 1). We assume that f, rand R are bounded and \nLipschitzian, and that the boundary 80 is C2 . \n\nRL uses the method of Dynamic Programming (DP) that introduces the value \nfunction (VF) : the maximal value of J as a function of initial state x : \n\nV(x) = sup J(x; u(.)). \n\nu(.) \n\nFrom the DP principle, we deduce that V satisfies a first-order differential equation, \ncalled the Hamilton-Jacobi-Bellman (HJB) equation (see [FS93] for a survey) : \n\nTheorem 1 If V is differentiable at x E 0, let DV(x) be the gradient of V at x , \nthen the following HJB equation holds at x. \n\nH(V, DV, x) ~f V(x) In'Y + sup[DV(x).f(x, u) + r(x , u)] = 0 \n\nuEU \n\n(3) \n\nThe challenge of RL is to get a good approximation of the VF, because from V \nwe can deduce the optimal control : for state x, the control u\u00b7 (x) that realizes the \nsupremum in the HJB equation provides an optimal (feed-back) control law . \n\nThe following hypothesis is a sufficient condition for V to be continuous within 0 \n(see [Bar94]) and is required for proving the convergence result of the next section. \n\nHyp 1: For x E 80 , let nt(x) be the outward normal of 0 at x , we assume that : \n\n-If3u E U, s.t . f(x, u) .nt(x) ~ 0 then 3v E U, s.t . f(x, v)nt(x) < O. \n-If3u E U, s.t. f(x, u) .nt(x) ~ 0 then 3v E U, s.t. f(x, v)nt(x) > O. \n\nwhich means that at the states (if there exist any) where some trajectory is tangent \nto the boundary, there exists, for some control, a trajectory strictly coming inside \nand one strictly leaving the state space. \n\n\fBarycentric Interpolators for Continuous Reinforcement Learning \n\n1027 \n\no \n\n\u2022 \n\no \n\nx \n\n\u2022 \n\n\u2022 \n\n\u2022 \n\nO--------QO~------~~~ \n\nFigure 2: The state space and the set \nof points EO (the black dots belong to \nthe interior and the white ones to the \nboundary). The value at some point e \nis updated, at step n, by the discounted \nvalue at point '1n E (el, 6, 6). The main \nrequirement for convergence is that the \npoints '1n approximate '1 in the sense : \nP('1nl{.) = p('1I{.) + 0(0) (i.e. \nthe '1n \nbelong to the grey area). \n\n4 THE CONVERGENCE RESULT \n\nLet us introduce the set of points I;0 = {~di' composed of the interior (I;0 n 0) \nand the boundary (8I;\u00b0 = I; \\ 0), such that its convex hull covers the state space \n0, and performing a discretization at some resolution 6 : \n\nVxEO , \n\ninf \n\n\u20ac.EE 6no \n\nIIX-~ill::;6 and VxE80 inf Ilx-~jll::;6 \n\n\u20acjE&E6 \n\n(4) \n\nMoreover, we approximate the control space U by some finite control spaces UO C U \nsuch that for 6 ::; 6', UO' c UO and liffiO-+o UO = U. \nWe would like to update the value of any: \n- interior point ~ E I;0 nO with the discounted values at state 77n(~, u) (figure 2) : \n\nV~+l (~) +- sup [\"YTn(\u20ac,u)V~(77n(~' u)) + Tn(~, u) .rn(~, u)] \n\nuEU 6 \n\n(5) \n\nfor some state 77n(~, u), some time delay Tn(~, u) and some reinforcement rn(~, u) . \n- boundary point ~ E 8I;\u00b0 with some terminal reinforcement Rn(~) : \n\nV~+1 (~) +- Rn(~) \n\n(6) \nThe following theorem states that the values V~ computed by a RL algorithm using \nthe model (because of some a priori partial uncertainty of the state dynamics and \nthe reinforcement functions) 77n(~, u), Tn(~, u), rn(~, u) and Rn(~) converge to the \noptimal value function as the number of iterations n -+ 00 and the resolution 6 -+ O. \nLet us define the state 77(~, u) (see figure 2) : \n\n77(~, u) = ~ + T(~, u).f(~, u) \n\n(7) \nfor some time delay T(~, u) (with k16 ::; T(~, u) ::; k26 for some constants kl > 0 and \nk2 > 0), and let p(77I~i) (resp. P(77nl~d) be the barycentric coordinate of 77 inside a \nsimplex containing it (resp. 77n inside the same simplex). We will write 77, 77n , T, 1', \n.. . , instead of 77(~, u), 77n(~, u), T(~, u), r(~, u), .. . when no confusion is possible. \n\nTheorem 2 Assume that the hypotheses of the previous sections hold, and that for \nany resolution 6, we use barycentric interpolators VO defined on state spaces I;0 \n(satisfying (4)) such that all points of I;0 nO are regularly updated with rule (5) \nand all points of 8I;\u00b0 are updated with rule (6) at least once. Suppose that 77n , Tn, \nrn and Rn approximate 77, T, rand R in the sense: \n\nV~i, P(77nl~d \nTn \nrn \nRn \n\np(77I~i) + 0(6) \nT + 0(62 ) \n1'+0(6) \nR + 0(6) \n\n(8) \n(9) \n(10) \n(11) \n\n\f1028 \n\nR. Munos and A. W. Moore \n\nthen we have limn-+oo V; \ncompact C 0, 3~, 3N, such that \"18 ~ ~,Vn 2: N, SUp~6nn IVn - VI ~ e). \n\nV uniformly on any compact 0 C 0 (i.e. \"Ie > 0, \"10 \n\n0-+0 \n\n\u00b0 \n\nRemark 3 For a given value of 8, the rule (5) is not a DP updating rule for some \nMarkov Decision Problem (MDP) since the values l7n, Tn, rn depend on n. This \npoint is important in the RL framework since this allows on-line improvement of \nthe model of the state dynamics and the reinforcement functions. \n\nRemark 4 This result extends the previous results of convergence obtained by \nFinite-Element or Finite-Difference methods (see {Mun97}}. \n\nThis theoretical result can be applied by starting from a rough EO (high 8) and by \ncombining to the iteration process (n ~ 00) some learning process of the model \n(l7n ~ 17) and a increasing process of the number of points (8 ~ 0). \n\n5 COMPUTATIONAL ISSUES \n\nFrom (8) we deduce that the method will also converge if we use an approximate \nbarycentric interpolator, defined at any state x E (~1\"'\" ~n) by the value of the \nbarycentric interpolator at some state x' E (~1' ... , ~n) such that p(X'I~i) = p(XI~i) + \n0(8) (see figure 3) . The fact that we need not be completely accurate can be \n\nApprox-linear \nLinear \n\nFigure 3: The linear function and \nthe approximation error around it \n(the grey area). The value of the \napproximate linear function plotted \nhere at some state x is equal to the \nvalue of the linear one at x'. Any \nsuch approximate barycenter inter(cid:173)\npolator can be used in (5). \n\n~3 X x' ~4 \n\nused to our advantage. First, the computation of barycentric coordinates can use \nvery fast approximate matrix methods. Second, the model we use to integrate the \ndynamics need not be perfect. We can make an 0(&2) error, which is useful if we \nare learning a model from data: we need simply arrange to not gather more data \nthan is necessary for the current 8. For example, if we use nearest neighbor for \nour dynamics learning, we need to ensure enough data so that every observation \nis 0(8 2 ) from its nearest neighbor. If we use local regression, then a mere 0(8) \ndensity is all that is required [Om087, AMS97]. \n\n6 PROOF OF THE CONVERGENCE RESULT \n\n6.1 Description of the approximation scheme \n\nWe use a convergent scheme derived from Kushner (see [Kus90]) in order to ap(cid:173)\nproximate the continuous control problem by a finite MDP. The HJB equation is \ndiscretized, at some resolution 8, into the following DP equation : for ~ E EO nO, \n\nVO(~) = FO [vo(.)] (~) ~f sUPUEU 6 {\"IT L~t p(l7l~i).v\u00b0(~d + T.r} \n\n(12) \n\nand for ~ E BEo, VO (~) = R(~) . This is a fixed-point equation and we can prove that, \nthanks to the discount factor \"I, it satisfies the \"strong\" contraction property: \n(13) \n\nSUP~6 jv;+l - vo I ~ ,\\. sup~61V; - vo I for some ,\\ < 1 \n\n\fBarycentric Interpolators for Continuous Reinforcement Learning \n\n1029 \n\nfrom which we deduce that there exists exactly one solution Va to the DP equation , \nwhich can be computed by some value iteration process : for any initial Voa, we \niterate V~+l f- Fa [V~] . Thus for any resolution 8, the values V~ -+ Va as 71 -+ 00. \n\nMoreover, as va is a barycentric interpolator and from the definition (7) of \"I , \n\nFa [va (.)] (~) = sUPuEU6 {-yT va (~ + T.f(~ , u)) + T.r} \n\n(14) \n\nfrom which we deduce that the scheme Fa is consistent : in a formal sense, \n\nlimsuPa--+o ilFa[W](x) - W(x)1 '\" H(W, DW,x) \n\n(15) \nand obtain, from the general convergence theorem of [BS91] (and a result of strong \nunicity obtained from hyp.l)' the convergence of the scheme : va -+ V as 8 -+ O. \n\n6.2 Use of the \"weak contraction\" result of convergence \n\nSince in the RL approach used here, we only have an approximation \"In , Tn , ... of \nthe true values \"I, T, ... , the strong contraction property (13) does not hold any \nmore. However, in previous work ([Mun98]), we have proven the convergence for \nsome weakened conditions, recalled here : \nIf the values V~ updated by some algorithm satisfy the \"weak\" contraction prop(cid:173)\nerty with respect to a solution va of a convergent approximation scheme (such as \nthe previous one (12)) : \n\nsUPE6no 1V~+1 - Va I < \nSUP&E61V~+1 - va I \n\n(1 - k.8) . SUPE6 IV~ - va 1+ 0(8) \n0(8) \n\n(16) \n(17) \nfor some positive constant k, (with the notation f(8) :S 0(8) iff 39(8) = 0(8) with \n:S 9(8)) then we have limn-+oo V~ = V uniformly on any compact 0 C 0 \nf(8) \n:S ~,Vn ~ N , \n(i .e. Vf > 0, VO compact C 0, 3~ and N such that V8 \nSUPE6nn IV~ - Vi :S f) . \n\na--+O \n\n6.3 Proof of theorem 2 \n\nWe are going to use the approximations (8), (9), (10) and (11) to deduce that the \nweak contraction property holds, and then use the result of the previous section to \nprove theorem 2. \nThe proof of (17) is immediate since, from (6) and (11) we have : V~ E a'L.o , \n\n1V~+1(~) - va(~)1 = I Rn(~) - R(~)I = 0(8) \n\nNow we need to prove (16) . Let us estimate the error En(~) = va(~) - V~(~) \nbetween the value Va of the DP equation (12) and the values V~ computed by rule \n(5) after one iteration : \nEn+d~) = SUPuEU6 {LE' [-{ p(TJI~d\u00b7 Va (~d - \"(Tn P(TJn I~d.v~ (~d] + T.T' - Tn .rn} \nEn+d~) = SUp { \"(T LE, [P( ryl~;) - p( ryn I~d] Va (~;) + b T \n\n\"(Tn] L\u20ac, p( \"In I~i)' va (~d \n\n-\n\nuEU6 \n\n+ \"(Tn L\u20ac. P(TJn I~i)' [va (~;) - V~(~;)] + Tn [1' - rn] + [T - Tn] r} \n\nBy using (9) (from which we deduce : -{ = \"(Tn + 0(8 2 )) and (10), we deduce : \n\nIEn+d~)1 < SUPuEU6 {\"(T \u00b7IL\u20ac, \n\n[P(TJI~;) - P(TJnl~d] Va (~d I \n+\"(Tn L\u20ac, P(TJnl~i).lVa(~d - V~(~i)l} + 0(8 2 ) . \n\n(18) \n\n\f1030 \n\nR. Munos and A. W Moore \n\nFrom the basic properties of the coefficients p( 1J1~d and p( 1Jn I~;) we have: \n\nLE, [P(1JI~i) - P(1Jnl~d] VO(~i) = L(, [P(1JI~d - P(1Jnl~d] [VO(~d - VO(~)] \n\n(19) \nMoreover, IVO(~d - VO(~)I ~ IVO(~d - V(~i)1 + 1V(~i) - V(~)I + IV(~) - VO(~)I\u00b7 \nFrom the convergence of the scheme V O, we have sUPE6nn Ivo - Vi ~ 0 for any \ncompact nCo and from the continuity of V and the fact that the support of the \nsimplex {O 3 1J is 0(0), we have sUPE6nn 1V(~d - V(~)I ~ 0 and deduce that \nsUPE 6 nn Jv\u00b0(~i) - VO(~)I o~ O. Thus, from (19) and (8) , we obtain: \n\nILE' [P(1JI~) - P(1Jnl~)] VO(~dl = 0(0) \n\n(20) \nThe \"weak\" contraction property (16) holds: from the property of the \nexponential function ,Tn ~ 1 - 2f In ~ for small values of Tn, from (9) and that \nT 2: klO , we deduce that ,Tn ~ 1 - \u00a5 In ~ + 0(0 2 ), and from (18) and (20) we \ndeduce that : \n\nIV;+l(~) - VO(~)I ~ (1- k.0)SUPE61V;+d~) - VO(~)I + 0(0) \n\nwith k = \u00a5 In 1 , and the property (16) holds. Thus the \"weak contraction\" result \nof convergence (described in section 6.2) applies and convergence occurs. \n\n\"I \n\n~ \n\nFUTURE WORK \n\nThis work proves the convergence to the optimal value as the resolution tends to the \nlimit, but does not provide us with the rate of convergence. Our future work will \nfocus on defining upper bounds of the approximation error, especially for variable \nresolution discretizations, and we will also consider the stochastic case. \n\nACKNOWLEDGMENTS \n\nThis research was sponsored by DASSAULT-AVIATION and CMU. \n\nReferences \n[AMS97] c. G. Atkeson, A. W. Moore, and S. A. Schaal. Locally Weighted Learning. AI \n\nReview, 11:11- 73, April 1997. \n\n[Bar94] Guy Barles. Solutions de viscosite des equations de Hamilton-Ja cobi, volume 17 \n\n[BS91] \n\nof Mathematiques et Applications. Springer-Verlag, 1994. \nGuy Barles and P.E. Souganidis. Convergence of approximation schemes for \nfully nonlinear second order equations. Asymptotic Analysis, 4:271- 283, 1991. \n\n[Dav96] Scott Davies. Multidimensional triangulation and interpolation for reinforcement \n\nlearning. Advances in Neural Information Processing Systems, 8, 1996. \n\n[FS93] Wendell H. Fleming and H. Mete Soner. Controlled Markov Processes and Vis(cid:173)\n\ncosity Solutions. Applications of Mathematics. Springer-Verlag, 1993. \n\n[Gor95] G. Gordon. Stable function approximation in dynamic programming. Interna(cid:173)\n\ntional Conference on Machine Learning, 1995. \n\n[Kus90] Harold J. Kushner. Numerical methods for stochastic control problems in con(cid:173)\n\ntinuous time. SIAM J. Control and Optimization, 28:999- 1048, 1990. \n\n[Mun97] Remi Munos. A convergent reinforcement learning algorithm in the continuous \nInternational Joint Conference on \n\ncase based on a finite difference method. \nA rtificial Intelligence, 1997. \n\n[Mun98] Remi Munos. A general convergence theorem for reinforcement learning in the \n\ncontinuous case. European Conference on Machine Learning, 1998. \n\n[Omo87] S. M. Omohundro. Efficient Algorithms with Neural Network Behaviour. Journal \n\nof Complex Systems, 1(2):273-347, 1987. \n\n\f", "award": [], "sourceid": 1565, "authors": [{"given_name": "R\u00e9mi", "family_name": "Munos", "institution": null}, {"given_name": "Andrew", "family_name": "Moore", "institution": null}]}