{"title": "Reinforcement Learning for Continuous Stochastic Control Problems", "book": "Advances in Neural Information Processing Systems", "page_first": 1029, "page_last": 1035, "abstract": null, "full_text": "Reinforcement Learning for Continuous \n\nStochastic Control Problems \n\nRemi Munos \n\nCEMAGREF, LISC, Pare de Tourvoie, \nBP 121, 92185 Antony Cedex, FRANCE. \n\nRerni.Munos@cemagref.fr \n\nPaul Bourgine \n\nEcole Polyteclmique, CREA, \n\n91128 Palaiseau Cedex, FRANCE. \n\nBourgine@poly.polytechnique.fr \n\nAbstract \n\nThis paper is concerned with the problem of Reinforcement Learn(cid:173)\ning (RL) for continuous state space and time stocha.stic control \nproblems. We state the Harnilton-Jacobi-Bellman equation satis(cid:173)\nfied by the value function and use a Finite-Difference method for \ndesigning a convergent approximation scheme. Then we propose a \nRL algorithm based on this scheme and prove its convergence to \nthe optimal solution. \n\n1 \n\nIntroduction to RL in the continuous, stochastic case \n\nThe objective of RL is to find -thanks to a reinforcement signal- an optimal strategy \nfor solving a dynamical control problem. Here we sudy the continuous time, con(cid:173)\ntinuous state-space stochastic case, which covers a wide variety of control problems \nincluding target, viability, optimization problems (see [FS93], [KP95])}or which a \nformalism is the following. The evolution of the current state x(t) E 0 (the state(cid:173)\nspace, with 0 open subset of IRd ), depends on the control u(t) E U (compact subset) \nby a stochastic differential equation, called the state dynamics: \n\ndx = f(x(t), u(t))dt + a(x(t), u(t))dw \n\n(1) \nwhere f is the local drift and a .dw (with w a brownian motion of dimension rand \n(j a d x r-matrix) the stochastic part (which appears for several reasons such as lake \nof precision, noisy influence, random fluctuations) of the diffusion process. \nFor initial state x and control u(t), (1) leads to an infinity of possible traj~tories \nx(t). For some trajectory x(t) (see figure I)., let T be its exit time from 0 (with \nthe convention that if x(t) always stays in 0, then T = 00). Then, we define the \nfunctional J of initial state x and control u(.) as the expectation for all trajectories \nof the discounted cumulative reinforcement : \n\nJ(x; u(.)) = Ex,u( .) {loT '/r(x(t), u(t))dt +,,{ R(X(T))} \n\n\f1030 \n\nR. Munos and P. Bourgine \n\nwhere rex, u) is the running reinforcement and R(x) the boundary reinforcement. \n'Y is the discount factor (0 :S 'Y < 1). In the following, we assume that J, a are of \nclass C2 , rand Rare Lipschitzian (with constants Lr and LR) and the boundary \n80 is C2 . \n\n\u00b7 all \u00b7 \n\n\u2022 \n\nII \n\n\u2022 \n\n\u2022 \n\n\u2022 \n\n\u2022 \n\n\u2022 \n\n\u2022 \n\n\u2022 \n\n\u2022 \nxirJ \n\u2022 \n\n\u2022 \n\nFigure 1: The state space, the discretized ~6 (the square dots) and its frontier 8~6 \n(the round ones). A trajectory Xk(t) goes through the neighbourhood of state ~. \n\nRL uses the method of Dynamic Program~ing (DP) which generates an optimal \n(feed-back) control u*(x) by estimating the value function (VF), defined as the \nmaximal value of the functional J as a function of initial state x : \n\nVex) = sup J(x; u(.). \n\nu( .) \n\n(2) \n\nIn the RL approach, the state dynamics is unknown from the system ; the only \navailable information for learning the optimal control is the reinforcement obtained \nat the current state. Here we propose a model-based algorithm, i.e. that learns \non-line a model of the dynamics and approximates the value function by successive \niterations. \n\nSection 2 states the Hamilton-Jacobi-Bellman equation and use a Finite-Difference \n(FD) method derived from Kushner [Kus90] for generating a convergent approxi(cid:173)\nmation scheme. In section 3, we propose a RL algorithm based on this scheme and \nprove its convergence to the VF in appendix A. \n\n2 A Finite Difference scheme \n\nHere, we state a second-order nonlinear differential equation (obtained from the DP \nprinciple, see [FS93J) satisfied by the value function, called the Hamilton-Jacobi(cid:173)\nBellman equation. \n\nLet the d x d matrix a = a.a' (with' the transpose of the matrix). We consider \nthe uniformly pambolic case, Le. we assume that there exists c > 0 such that \nV$ E 0, Vu E U, Vy E IRd ,2:t,j=l aij(x, U)YiYj 2: c1lY112. Then V is C2 (see [Kry80J). \nLet Vx be the gradient of V and VXiXj its second-order partial derivatives. \n\nTheorem 1 (Hamilton-Jacohi-Bellman) The following HJB equation holds : \n\nVex) In 'Y + sup [rex, u) + Vx(x).J(x, u) + ! 2:~j=l aij VXiXj (x)] = 0 for x E 0 \nBesides, V satisfies the following boundary condition: Vex) = R(x) for x E 80. \n\nuEU \n\n\fReinforcement Learningfor Continuous Stochastic Control Problems \n\n1031 \n\nRemark 1 The challenge of learning the VF is motivated by the fact that from V, \nwe can deduce the following optimal feed-back control policy: \n\nu*(x) E arg sup [r(x, u) + Vx(x).f(x, u) + ! L:7,j=l aij VXiXj (x)] \n\nuEU \n\nIn the following, we assume that 0 is bounded. Let eI, ... , ed be a basis for JRd. \nLet the positive and negative parts of a function 4> be : 4>+ = ma.x(4),O) and \n4>- = ma.x( -4>,0). For any discretization step 8, let us consider the lattices: 8Zd = \n{8. L:~=1 jiei} where j}, ... ,jd are any integers, and ~6 = 8Zd n O. Let 8~6, the \nfrontier of ~6 denote the set of points {~ E 8Zd \\ 0 such that at least one adjacent \npoint ~ \u00b1 8ei E ~6} (see figure 1). \nLet U 6 cUbe a finite control set that approximates U in the sense: 8 ~ 8' => \nU6' C U6 and U6U6 = U. Besides, we assume that: Vi = l..d, \n\n(3) \n\nBy replacing the gradient Vx(~) by the forward and backward first-order finite(cid:173)\ndifference quotients: ~;, V(~) = l [V(~ \u00b1 8ei) - V(~)l and VXiXj (~) by the second(cid:173)\norder finite-difference quotients: \n\n~XiXi V(~) \n~;iXj V(~) = 2P[V(~ + 8ei \u00b1 8ej) + V(~ - 8ei =F 8ej) \n\n-b [V(~ + 8ei) + V(,' - 8ei) - 2V(O] \n\n-V(~ + 8ei) - V(~ - 8ei) - V(~ + 8ej) - V(~ - 8ej) + 2V(~)] \n\nin the HJB equation, we obtain the following : for ~ E :\u00a36, \nV6(~)In,+SUPUEUh {r(~,u) + L:~=1 [f:(~,u)'~~iV6(~) - fi-(~,U)'~;iV6(~) \n. V(C) + \" . . (at; (~,'U) ~ + . V(C) _ a~ (~,'U) ~ - . . V(C))] } = 0 \n\n+ aii (~.u) ~ . \n\n2 \n\nX,X,'\" \n\nwJ'l=~ \n\n2 \n\nx,x.1'\" \n\n2 \n\nx,xJ\n\n'\" \n\nKnowing that (~t In,) is an approximation of ( ,l:l.t -1) as ~t tends to 0, we deduce: \nSUPuEUh [,\"'(~'U)L(EEbP(~,U,()V6\u00ab()+T(~,u)r(~,u)] (4) \n(5) \n\nwith T(~, u) \n\nV6(~) \n\nwhich appears as a DP equation for some finite Markovian Decision Process (see \n[Ber87]) whose state space is ~6 and probabilities of transition: \n\np(~,u,~ \u00b1 8ei) \np(~, u, ~ + 8ei \u00b1 8ej) \np(~,u,~ - 8ei \u00b1 8ej) \np(~,u,() \n\n\"'~~r) [28Ift(~, u)1 + aii(~' u) - Lj=l=i laij(~, u)l] , \n\"'~~r)a~(~,u)fori=f:j, \n\"'~~r)a~(~,u) for i =f: j, \no otherwise. \n\n(6) \n\nThanks to a contraction property due to the discount factor\" there exists a unique \nsolution (the fixed-point) V to equation (4) for ~ E :\u00a36 with the boundary condition \nV6(~) = R(~) for ~ E 8:\u00a36. The following theorem (see [Kus90] or [FS93]) insures \nthat V 6 is a convergent approximation scheme. \n\n\f1032 \n\nR. Munos and P. Bourgine \n\nTheorem 2 (Convergence of the FD scheme) V D converges to V as 8 1 0 : \n\nlim /)10 VD(~) = Vex) un~formly on 0 \n\n~-x \n\nRemark 2 Condition (3) insures that the p(~, u, () are positive. If this condition \ndoes not hold, several possibilities to overcome this are described in [Kus90j. \n\n3 The reinforcement learning algorithm \n\nHere we assume that f is bounded from below. As the state dynami,:s (J and a) \nis unknown from the system, we approximate it by building a model f and a from \nsamples of trajectories Xk(t) : we consider series of successive states Xk = Xk(tk) \nand Yk = Xk(tk + Tk) such that: \n- \"It E [tk, tk + Tk], \nkN.8 for some positive constant kN, \n- the control u is constant for t E [tk, tk + Tk], \n- T k satisfies for some positive kl and k2, \n\nx(t) E N(~) neighbourhood of ~ whose diameter is inferior to \n\nThen incrementally update the model : \n\n.1 \",n Yk - Xk \nn ~k=l Tk \n\nan(~,u) \n\nn \n\n1 \n-;;; Lk=l \n\n(Yk - Xk - Tk.fn(~, u)) (Yk - Xk - Tk\u00b7fn(~, u))' \n\nTk \n\n(7) \n\n(8) \n\nand compute the approximated time T( x, u) ~d the approximated probabilities of \ntransition p(~, u, () by replacing f and a by f and a in (5) and (6). \nWe obtain the following updating rule of the V D -value of state ~ : \n\nV~+l (~) = sUPuEU/) [,~/:(x,u) L( p(~, u, ()V~(() + T(x, u)r(~, u)] \n\n(9) \n\nwhich can be used as an off-line (synchronous, Gauss-Seidel, asynchronous) or on(cid:173)\ntime (for example by updating V~(~) as soon as a trajectory exits from the neigh(cid:173)\nbourood of ~) DP algorithm (see [BBS95]). \nBesides, when a trajectory hits the boundary [JO at some exit point Xk(T) then \nupdate the closest state ~ E [JED with: \n\n(10) \n\nTheorem 3 (Convergence of the algorithm) Suppose that the model as well \nas the V D -value of every state ~ E :ED and control u E UD are regularly updated \n(respectively with (8) and (9)) and that every state ~ E [JED are updated with (10) \nat least once. Then \"Ie> 0, :3~ such that \"18 ~ ~, :3N, \"In 2: N, \n\nsUP~EE/) IV~(~) - V(~)I ~ e with probability 1 \n\n\fReinforcement Learningfor Continuous Stochastic Control Problems \n\n1033 \n\n4 Conclusion \n\nThis paper presents a model-based RL algorithm for continuous stochastic control \nproblems. A model of the dynamics is approximated by the mean and the covariance \nof successive states. Then, a RL updating rule based on a convergent FD scheme is \ndeduced and in the hypothesis of an adequate exploration, the convergence to the \noptimal solution is proved as the discretization step 8 tends to 0 and the number \nof iteration tends to infinity. This result is to be compared to the model-free RL \nalgorithm for the deterministic case in [Mun97]. An interesting possible future \nwork should be to consider model-free algorithms in the stochastic case for which a \nQ-Iearning rule (see [Wat89]) could be relevant. \n\nA Appendix: proof of the convergence \n\nLet M f ' M a, M fr. and Ma .\u2022 be the upper bounds of j, a, f x and 0' x and m f the lower \nbound of f. Let EO = SUP\u20acEI:h !V0 (';) - V(';)I and E! = SUP\u20acEI:b \\V~(';) - VO(.;)\\. \n\nA.I Estimation error of the model fn and an and the probabilities Pn \n\nSuppose that the trajectory Xk(t) occured for some occurence Wk(t) of the brownian \nmotion: Xk(t) = Xk + f!k f(Xk(t),u)dt + f!\" a(xk(t),U)dwk. Then we consider a \ntrajectory Zk (t) starting from .; at tk and following the same brownian motion: \nZk(t) ='; + fttk. f(Zk(t), u)dt + fttk a(zk(t), U)dWk' \nLet Zk = Zk(tk + Tk). Then (Yk - Xk) - (Zk -.;) = ftk [f(Xk(t), u) - f(Zk(t), u)] dt + \nftt:.+Tk [a(xk(t), u) - a(zk(t), u)J dWk. Thus, from the C1 property of f and a, \n\nII(Yk - Xk) - (Zk - ';)11 ~ (Mf'\" + M aJ.kN.Tk. 8. \n\n(11) \n\nThe diffusion processes has the following property ~ee for example the ItO-Taylor \nmajoration in [KP95j) : Ex [ZkJ = ';+Tk.f(';, U)+O(Tk) which, from (7), is equivalent \nto: Ex [z~:g] = j(';,u) + 0(8). Thus from the law of large numbers and (11): \n- \u00a5.] II + 0(8) \n\nli~-!~p Ilfn(';, u) - f(';, u)11 \n\n-\n\nli;;:s~p II~ L~=l [Yk;kX& \n(Mf:r: + M aJ\u00b7kN\u00b78 + 0(8) = 0(8) w.p. 1 (12) \n\nIlrk - ikll = (Mf:r: + M aJ.Tk.kN.8 + Tk.o(8) \n\nBesides, diffusion processes have the following property (again see [KP95J): \nEx [(Zk -.;) (Zk - .;)'] = a(';, U)Tk + f(';, u).f(';, U)'.T~ + 0(T2) which, from (7), \nis equivalent to: Ex [(Zk-\u20ac-Tkf(S'U)~(kZk-S-Tkf(S'U\u00bb/] = a(';, u) + 0(82). Let rk = \nZk -.; - Tkf(';, u) and ik = Yk - Xk - Tkfn(';, u) which satisfy (from (11) and (12\u00bb : \n(13) \nFrom the definition of Ci;;(';,u), we have: Ci;;(';,u) - a(';,u) = ~L~=l '\\:1.' -\nEx [r~':k] + 0(82 ) and from the law of large numbers, (12) and (13), we have: \n11~(';,u) - a(';,u)11 = li~-!~p II~ L~=l rJ./Y - r~':k II + 0(82 ) \nIlik -rkllli:!s!p~ fl (II~II + II~II) +0(82 ) = 0(82 ) \n\nli~~~p \n\n\f1034 \n\nR. Munos and P. Bourgine \n\n\"In(';, u) - I(';, u)\" ~ kf\u00b78 w.p. 1 \n1Ill;;(';, u) - a(';, u)1I ~ ka .82 w.p. 1 \n\nBesides, from (5) and (14), we have: \n\n1 (c ) _ -\nT r.\",U \n\nTn r.\",U _ \n\n)1 < d.(k[.6 2+d.k,,6 2 ) J:2 < k \n\n(d.m, .6)2 \n\nU \n\n_ \n\n(c \n\nJ:2 \nT'U \n\nand from a property of exponential function, \n\nI,T(~.u) _ ,7' .. (\u20ac .1\u00a3) I = kT.In ~ .82 . \n\nWe can deduce from (14) that: \n\n(14) \n\n(15) \n\n(16) \n\n. \nlimsupp';,u,( -Pn';,u,( ~ \nn-+oo \n\n) -( \n\n1 ( \n\n)1 \n\n(2.6.Mt+d.Ma)(2.kt+d.k,,)62 \n\n6mr(2.k,+d.ka)62 \n\nk J: \n\nS; puw.p.l \n\n(17) \n\nA.2 Estimation of IV~+l(';) - V6(.;) 1 \n\nMter having updated V~(';) with rule (9), \nIV~+l(';) - V6(.;) I. From (4), (9) and (8), \nA < \n\n,T(\u20ac.U) L: [P(';, u, () - p(.;, u, ()] V 6 (() + ( ,T(\u20ac.1\u00a3) - ,7'(~'1\u00a3\u00bb) L p(.;, u, () V 6 (() \n\nlet A denote \n\nthe difference \n\n( \n\n+,7' (\u20ac.u) . L:p(.;, u, () [V6(() - V~(()] + L:p(.;, u, ().T(';, u) [r(';, u) - F(';, u)] \n\n+ L:( p(.;, u, () [T(';, u) - T(';, u)] r(';, u) for all u E U6 \n\n( \n\n( \n\n( \n\nAs V is differentiable we have : Vee) = V(';) + VX ' (( - . ; ) + 0(1I( - ';11). Let \nus define a linear function V such that: Vex) = V(';) + VX ' (x - ';). Then \n[P(';, u, () - p(.;, u, ()] V 6(() = [P(';, u, () - p(.;, u, ()] . [V6(() - V(()] + \nwe have: \n[P(';,u,()-p(';,u,()]V((), thus: L:([p(';,u,()-p(';,u,()]V6(() = kp .E6.8 + \nL([P(';,U,()-p(.;,u,()] [V(() +0(8)] = [V(7J)-VUD] + kp .E6.8 + 0(8) = \n[V(7J) - V(1j)] + 0(8) with: 7J = L:( p(';, u, () (( -.;) and 1j = L:( p(.;, u, () (( - .;). \nBesides, from the convergence of the scheme (theorem 2), we have E6.8 = \n0(8). From the linearity of V, IV(() - V(Z) I ~ II( - ZII\u00b7Mv\", S; 2kp 82 . Thus \nIL( [P(';, u, () - p(.;, u, ()] V6 (() I = 0(8) and from (15), (16) and the Lipschitz prop(cid:173)\n\nerty of r, \n\nA = 1'l'(\u20ac'U), L:( p(.;, u, () [V6(() - V~ (()] 1+ 0(8). \n\nAs ,..,.7'(\u20ac.u) < 1 - 7'(\u20ac.U) In 1 < 1 _ T(\u20ac.u)-k.,.6 2 In 1 < 1 _ ( \nwe have: \n\n'Y -\n\n'Y -\n\n2 \n\n2 \n\nI \n\n-\n\n6 \n\n2d(M[+d.M,,) \n\nA = (1 - k.8)E~ + 0(8) \n\n_ \n\n!ix..82) In 1 \n'Y ' \n2 \n\n(18) \n\nwith k = 2d(M[~d.M,,). \n\n\fReinforcement Learning for Continuous Stochastic Control Problems \n\n1035 \n\nA.3 A sufficient condition for sUP\u20acEE~ IV~(~) - V6(~)1 :S C2 \n\nLet us suppose that for all ~ E ~6, the following conditions hold for some a > 0 \n\nE~ > C2 =} IV~+I(O - V6(~)1 :S E~ - a \nE~ :S c2=}IV~+I(~)_V6(~)I:Sc2 \n\n(19) \n(20) \nFrom the hypothesis that all states ~ E ~6 are regularly updated, there exists an \ninteger m such that at stage n + m all the ~ E ~6 have been updated at least \nonce since stage n. Besides, since all ~ E 8C6 are updated at least once with \nrule (10), V~ E 8C6, IV~(~) - V6(~)1 = IR(Xk(T)) - R(~)I :S 2.LR.8 :S C2 for any \n8 :S ~3 = 2~lR' Thus, from (19) and (20) we have: \n\n:S C2 =} E!+m :S C2 \nThus there exists N such that: Vn ~ N, E~ :S C2. \n\nE! > C2 =} E!+m :S E! - a \nE! \n\nA.4 Convergence of the algorithm \n\nLet us prove theorem 3. For any c > 0, let us consider Cl > 0 and C2 > 0 such that \nCl +C2 = c. Assume E~ > \u00a32, then from (18), A = E! - k.8'\u00a32+0(8) :S E~ -k.8.~ \nfor 8 :S ~3. Thus (19) holds for a = k.8.~. Suppose now that E~ :S \u00a32. From (18), \nA :S (1 - k.8)\u00a32 + 0(8) :S \u00a32 for 8 :S ~3 and condition (20) is true. \nThus for 8 :S min { ~1, ~2, ~3}, the sufficient conditions (19) and (20) are satisfied. \nSo there exists N, for all n ~ N, E~ :S \u00a32. Besides, from the convergence of the \nscheme (theorem 2), there exists ~o st. V8:S ~o, sUP\u20acEE~ 1V6(~) - V(~)I :S \u00a31\u00b7 \nThus for 8 :S min{~o, ~1, ~2, ~3}, \"3N, Vn ~ N, \n\nsup IV~(~) - V(~)I :S sup IV~(~) - V6(~)1 + sup 1V6(~) - V(~)I :S \u00a31 + c2 = \u00a3. \n\u20acEE6 \n\n\u20acEE6 \n\n\u20acEEh \n\nReferences \n\n[BBS95j Andrew G. Barto, Steven J. Bradtke, and Satinder P. Singh. Learning to \nact using real-time dynamic programming. Artificial Intelligence, (72):81-\n138, 1995. \n\n[Ber87j Dimitri P. Bertsekas. Dynamic Programming: Deterministic and Sto(cid:173)\n\nchastic Models. Prentice Hall, 1987. \n\n[FS93j Wendell H. Fleming and H. Mete Soner. Controlled Markov Processes and \nViscosity Solutions. Applications of Mathematics. Springer-Verlag, 1993. \n[KP95j Peter E. Kloeden and Eckhard Platen. Numerical Solutions of Stochastic \n\nDifferential Equations. Springer-Verlag, 1995. \n\n[Kry80j N.V. Krylov. Controlled Diffusion Processes. Springer-Verlag, New York, \n\n1980. \n\n[Mun97j \n\n[Kus90j Harold J. Kushner. Numerical methods for stochastic control problems in \ncontinuous time. SIAM J. Control and Optimization, 28:999-1048, 1990. \nRemi Munos. A convergent reinforcement learning algorithm in the con(cid:173)\ntinuous case based on a finite difference method. International Joint Con(cid:173)\nference on Art~ficial Intelligence, 1997. \nChristopher J.C.H. Watkins. Learning from delayed reward. PhD thesis, \nCambridge University, 1989. \n\n[Wat89j \n\n\f", "award": [], "sourceid": 1404, "authors": [{"given_name": "R\u00e9mi", "family_name": "Munos", "institution": null}, {"given_name": "Paul", "family_name": "Bourgine", "institution": null}]}