{"title": "Multi-Grid Methods for Reinforcement Learning in Controlled Diffusion Processes", "book": "Advances in Neural Information Processing Systems", "page_first": 1033, "page_last": 1039, "abstract": "", "full_text": "Multi-Grid Methods for  Reinforcement \n\nLearning in Controlled Diffusion Processes \n\nStephan Pareigis \n\nstp@numerik.uni-kiel.de \n\nLehrstuhl Praktische Mathematik \nChristian-Albrechts-U niversi tat Kiel \n\nKiel,  Germany \n\nAbstract \n\nReinforcement learning methods for  discrete and semi-Markov de(cid:173)\ncision  problems  such  as  Real-Time  Dynamic  Programming  can \nbe  generalized  for  Controlled  Diffusion  Processes.  The  optimal \ncontrol  problem  reduces  to  a  boundary  value  problem  for  a  fully \nnonlinear  second-order  elliptic  differential  equation  of Hamilton(cid:173)\nJacobi-Bellman  (HJB-)  type.  Numerical  analysis  provides  multi(cid:173)\ngrid methods for this kind of equation.  In the case of Learning Con(cid:173)\ntrol, however, the systems of equations on the various grid-levels are \nobtained  using  observed  information  (transitions  and  local  cost). \nTo  ensure  consistency,  special  attention  needs  to  be  directed  to(cid:173)\nward  the  type  of time  and  space  discretization  during  the  obser(cid:173)\nvation.  An  algorithm for  multi-grid  observation is  proposed.  The \nmulti-grid algorithm is demonstrated on a simple queuing problem. \n\n1 \n\nIntroduction \n\nControlled Diffusion Processes (CDP) are the analogy to Markov Decision Problems \nin continuous state space and continuous time.  A CDP can always be discretized in \nstate space and time and thus reduced to a  Markov Decision Problem.  Algorithms \nlike Q-Iearning and RTDP as described in [1]  can then be applied to produce controls \nor optimal value functions for  a fixed  discretization. \n\nProblems  arise  when  the  discretization  needs  to  be  refined,  or  when  multi-grid \ninformation  needs  to  be  extracted  to  accelerate  the  algorithm.  The  relation  of \ntime  to  state  space  discretization  parameters  is  crucial  in  both  cases.  Therefore \n\n\f1034 \n\nS.  Pareigis \n\na  mathematical  model  of the  discretized  process  is  introduced,  which  reflects  the \nproperties of the converged empirical process.  In this model, transition probabilities \nof the discrete process can be expressed  in  terms of the transition probabilities  of \nthe continuous process.  Recent results in numerical methods for  stochastic control \nproblems  in  continuous  time can  be applied  to give  assumptions that guarantee a \nlocal consistency condition which is needed for  convergence.  The same assumptions \nallow application of multi-grid methods. \n\nIn  section  2  Controlled  Diffusion  Processes  are  introduced.  A  model  for  the  dis(cid:173)\ncretized process is suggested in section 3 and the main theorem is stated.  Section 4 \npresents an algorithm for multi-grid observation according to the results in the pre(cid:173)\nceding section.  Section 5 shows an application of multi-grid techniques for observed \nprocesses. \n\n2  Controlled Diffusion Processes \n\nConsider a  Controlled  Diffusion Process  (CDP)  ~(t) in some bounded domain 0  C \nffi. n  fulfilling  the diffusion equation \n\n~(t) =  b(~(t), u(t))dt + (7(~(t))dw. \n\n(1) \n\nThe control  u(t)  takes  values  in  some finite  set  U.  The immediate  reinforcement \n(cost)  for  state ~(t) and control u(t) is \n\nThe control objective is  to find  a  feedback control law \n\nr(t)  =  r(~(t),u(t)). \n\nu(t) =  u(~(t)), \n\nthat minimizes the total discounted cost \n\nJ(x, u)  =  IE~ 100 e-/3tr(~(t), u(t)dt, \n\n(2) \n\n(3) \n\n(4) \n\nwhere  IE~ is  the expectation starting in x  E  0  and  applying the  control law  u(.). \n(3  > 0 is  the discount. \nThe transition probabilities  of the  CDP  are given for  any  initial  state x  E  0  and \nsubset  A  c  0  by the stochastic kernels \n\nPtU(x, A)  :=prob{~(t) E AI~(O) =x,u} . \n\nIt is  known that the kernels  have the properties \n\nl (y  - x)PtU(x, dy) \nl (y  - x)(y - xf PtU(x, dy) \n\nt . b(x, u) + o(t) \n\nt\u00b7 (7(x)(7(xf + o(t). \n\n(5) \n\n(6) \n\n(7) \n\nFor  the optimal control it  is  sufficient  to calculate the optimal value  function  V  : \nO-tffi. \n\nV(x)  := inf J(x, u). \n\nu(.) \n\n(8) \n\n\fMulti-Grid Methods for Reinforcement Learning in Diffusion Processes \n\n1035 \n\nUnder appropriate smoothness assumptions V  is a solution of the Hamilton-Jacobi(cid:173)\nBellman  (HJB-) equation \n\nmin {C:tV(x) - ,i3V(x) + r(x, an =  0, \n\naEU \n\nx  E  O. \n\n(9) \n\nLet  a(x)  =  O\"(x)O\"(x)T  be  the diffusion  matrix,  then  La,  a  E  U  is  defined  as  the \nelliptic differential operator \n\nn \n\nn \n\nLa  :=  L  aij(x)ox/Jx; + Lbi(x,a)oxi. \n\n(10) \n\ni,j=l \n\ni=l \n\n3  A  Model for  Observed  CDP's \n\nLet  Ohi  be  the centers of cells of a  cell-centered grid  on  0  with cell  sizes  ho,  hI  = \nho/2,  h2  =  hI/2,  ....  For  any  x  E  Ohi  we  shall  denote by  A(x)  the  cell  of x.  Let \n6.t > 0 be a  parameter for  the time discretization. \n\n\u00b7 \n\u00b7 ,-\n\n\u00b7 \n\u00b7 \n\u00b7 \nt\" r----J.,c \nr  J.'\" \n\u00b7 \n\n:J \n\nl -\n\n~ \n\nD\u00b7 \n\u00b7 r \n\u00b7 \nJ0 \n\u00b7 \n\u00b7 \n\n\u00b7 \n\u00b7 \n\nFigure  1:  The  picture  depicts  three \ncell-centered grid levels  and the trajec(cid:173)\ntory of a diffusion process.  The approx(cid:173)\nimating  value  function  is  represented \nlocally  constant  on  each  cell.  The  tri(cid:173)\nangles  on  the  path  denote  the  posi(cid:173)\ntion  of  the  diffusion  at  sample  times \n0, /It, 2/lt, 3/lt, . . .. \ntween respective cells  are then counted \nin  matrices Q't,  for  each  control  a  and \ngrid i. \n\nTransitions  be(cid:173)\n\nBy counting the transitions between cells and calculating the empirical probabilities \nas  defined  in  (20)  we  obtain  empirical  processes  on  every  grid.  By  the  law  of \ngreat  numbers  the  empirical  processes  will  converge  towards  observed  CDPs  as \nsubsequently defined. \n\nDefinition  1  An  observed  process  ~hi,Lldt)  is  a  Controlled  Markov  Chain  (i.e. \ndiscrete  state-space  and  discrete  time)  on Ohi  and interpolation  time 6.ti  with  the \ntransition probabilities \n\nprob{~(6.ti) E  A(Y)I~(O) E  A(x), u} \n\n:n  {  PXti (z, A(y\u00bbdz, \ni  J A(x) \n\n(11) \n\nwhere x, y  E  Ohi  and ~(t) is a solution of (1).  Also define the observed reinforcement \np  as \n\n(12) \n\n\f1036 \n\ns.  Pareigis \n\nOn every grid  Ohi  the respective process ehi ,Llti  has its own value function  Vhi ,Llti . \nBy theorem 10.4.1.  in Kushner,  Dupuis  ([5],  1992)  it holds,  that \n\nVhi,Llti (x)  -+ V(x)  for  all x  E  0, \n\n(13) \n\nif the following  local consistency conditions hold. \n\nDefinition 2  Let D.eh,Llt  = eh,Llt(D.t)  - eh,Llt(O).  eh,Llt  is  called  locally  consistent \nto  a solution e(.)  of (1),  iff \n\nIE~ D.eh,Llt \nIE~[D.eh,Llt - IE~D.eh,Llt][D.eh,Llt - IE~D.eh,LltlT \nsup lD.eh,Llt(nD.t)I \nn \n\nb(x, a)D.t + o(D.t) \na(x)D.t + o(D.t) \n\n-+  0  as  h -+ O. \n\n(14) \n(15) \n(16) \n\nTo verify these conditions for  the observed  CDP, the expectation and variance can \nbe calculated.  For the expectation we  get \n\nyEOhi \n\nL  Phi,Llti(x,y)(y - x) \n:n  L l  (y  - x)PXdz,A(y))dz. \n\ni  yEOhi  A(x) \n\n(17) \n\nRecalling properties  (6)  and (7)  and doing a  similar calculation for  the variance we \nobtain the following theorem. \n\nTheorem 3  For  observed  CDPs  ehi,Llti  let hi  and D.ti  be  such  that \n\n(18) \n\nFurthermore,  ehi ,Llti  shall  be  truncated  at  some  radius  R,  such  that  R  -+  0  for \nhi  -+  0  and  expectation  and  variance  of  the  truncated  process  differ  only  in  the \norder o(D.t)  from  expectation  and  variance  of ehi,Llti.  Then  the  observed processes \nehi,Llti  truncated at R are locally consistent to the diffusion process e(.)  and therefore \nthe  value  functions  Vhi ,Llti  converge  to  the  value  function  V. \n\n4 \n\nIdentification by Multi-Grid Observation \n\nThe condition in Theorem 3  provides information as  how  to choose  parameters in \nthe  algorithm  with  empirical  data.  Choose  discretization  values  ho,  D.to  for  the \ncoarsest grid no.  D.to  should  typically be of order  Ilbllsup/ho.  Then choose for  the \nfiner grids \n\ngrid \nspace \ntime \n\no \nho \nD.to \n\n1 \n~ \n2 \n\n2 \n~ \n4 \n\n3 \n\n~ \n\n~ \n\n2 \n\n2 \n\n4 \n~ \n16 \n~ \n\n5 \n~ \n32 \n~ \n\n4 \n\n8 \n\n(19) \n\nThe  sequences  verify  the  assumption  (18).  We  may  now  formulate  the  algorithm \nfor  Multi-Grid  Observation of the  CDP e(.).  Note  that only  observation  is  being \ncarried  out.  The  actual  calculation of the  value  function  may  be  done  separately \nas  described  in  the next  section.  The choice of the  control is  assumed  to be done \n\n\fMulti-Grid Methodsfor Reinforcement Learning in Diffusion Processes \n\n1037 \n\nby  a  separate  controller.  Let  Ok  be  the  finest  grid,  Le.  Dotk  and  hk  the  finest \ndiscretizations.  Let U,  = u~t';~t,. = U x ... xU, Dotl! Dotk  times.  Qr'  is a 10,1 x 10,1-\nmatrix (a,  E  U,),  containing the number of transitions between cells in  0\"  Rr'  is  a \nlO,l-vector containing the empirical cost for every cell in 0 , . The immediate cost is \ngiven by the system as r,  =  Jo~t' e-/3tr(~(t), a,)dt.  T  denotes current time. \nO.  Initialize 0 \"  Qr', Rr'  for  all a,  E  U\"  1 = 0, ... , k \n1.  repeat { \nchoose  a = a(T)  E  U  and apply a constantly on [T; T + Dotk) \n2. \nT  := T + Dotk \n3. \nfor I = 0 to k do { \n4. \ndetermine cell Xl  E 0, with ~(T - Dot,)  E A(XI) \n5. \ndetermine cell Yl  E 0 ,  with ~(T) E A(Yl) \n6. \nif Ilxk  - Ykll  ~ R  (truncation radius) then goto 2.  else \n7. \na,  := (a(T - Dot,) , a(T + Dotk  - Dot,), .. . ,a(T - Dotk)) \n8. \nreceive immediate cost r, \n9. \n10.  Qr'(Xl,Yl)  := Qr'(Xl,Yt) + 1 \nRr' (Xl)  := (rl  + Rr' (Xl)  . EZEn, Qr' (Xl, z)) /  (1 + EZEn, Qr' (Xl, z)) \n11. \n} (for-do) \n}  (repeat) \n\nBefore  applying a  multi-grid  algorithm for  the calculation of the value function  on \nthe  basis  of the  observations,  one  should  make  sure  that  every  box  has  at  least \nsome data for  every control.  Especially in the early stages of learning only the two \ncoarsest grids  00,  0 1  could  be used  for  computation of the optimal value function \nand finer  grids may be added  (possibly locally)  as learning evolves. \n\n5  Application of Multi-Grid Techniques \n\nThe identification algorithm produces matrices Qr'  containing the number of tran(cid:173)\nsitions  between  boxes in  0 ,.  We  will  calculate from  the matrices  Q  the transition \nmatrices P  by the formula \n\np,a' (x, y)  = Qr' (x, Y)/ (L Qr' (x, Z))  , x, Y E  0 \"  a,  E  U\"  1 = 0, .. . , k. \n\nzEn, \n\nNow  we  define  matrices A  and right hand sides I  as \n\nAr'  := ({31 p,a'  - I) / Dot, \n\nIt':= Rr' / Dotl , \n\nwhere {31  = e-/3~t,.  The discrete Bellman equation takes the following  form \n\n(20) \n\n(21) \n\n(22) \n\nThe problem is  now  in a  form  to which the multi-grid method due to Hoppe,  BloB \n([2],  1989)  can  be  applied.  For  prolongation  and  restriction  we  choose  bilinear \ninterpolation  and  full  weighted  restriction  for  cell-centered  grids.  We  point  out, \nthat for  any cell  X  E  0 , only those neighboring cells shall be used for  prolongation \nand  restriction for  which  the  minimum  in  (22)  is  attained  for  the same  control  as \nthe minimizing control in  X  (see  [2],  1989 and  [3],  1996 for  details).  On every grid \n\n\f1038 \n\ns. Pareigis \n\n0 1 the defect in equation  (22)  is  calculated and used for  a  correction on grid 0 /- 1 . \nAs  a smoother nonlinear Gauss-Seidel iteration applied to  (22)  is  used. \n\nOur approach differs  from  the algorithm in  Hoppe,  BloB  ([2],  1989)  in  the special \nform  of  the  matrices  A~'  in  equation  (22).  The  stars  are  generally  larger  than \nnine-point,  in fact  the stars grow  with  decreasing h  although  the matrices  remain \nsparse.  Also, when working with empirical information the relationship between the \nmatrices Ar' on the various grids is based on observation of a process, which implies \nthat  coarse  grid  corrections  do  not  always  correct  the  equation  of the  finest  grid \n(especially in the early stages of learning).  However,  using  the observed transition \nmatrices Ar'  on the coarse grids saves the computing time  which  would  otherwise \nbe needed to calculate these matrices by  the Galerkin product  (see  Hackbusch  [4], \n1985). \n\n6  Simulation with precomputed transitions \n\nConsider  a  homogeneous  server  problem  with  two  servers  holding  data  (Xl, X2)  E \n[0,1] x [0,1].  Two independent data streams arrive, one at each server.  A controller \nhas to decide to which server to route.  The modeling equation for  the stream shall \nbe \n\ndx =  b(x, u)dt + CT(x)dw,  u  E  {I, 2} \n\nwith \n\nb(x,l) = (!1)  b(x,2) = (~1)  CT=  (~  ~) \n\n(23) \n\n(24) \n\n(25) \n\nThe  boundaries  at  Xl  =  0  and  X2  =  0  are  reflecting.  The  exceeding  data  on \neither server  Xl, X2  > 1 is  rejected  from  the system and  penalized  with  g(Xl, 1)  = \ng(1,x2)  =  10,  9  =  0  otherwise.  The  objective  of the  control  policy  shall  be  to \nminimize \n\nIE 1000 e-i3t (xI(t) + X2(t)  + g(Xl,X2))dt. \n\nThe plots of the value function  show,  that in  case of high load  (Le.  Xl, X2  close  to \n1)  a maximum of cost is  assumed.  Therefore it is  cheaper to overload a server and \npay penalty than to stay close to the diagonal as  is  optimal in the low  load case. \n\nFor simulation we used preco~puted (Le.  converged heuristic)  transition probabili(cid:173)\nties to test the multi-grid performance.  The discount f3  was set to .7.  The multi-grid \nalgorithm  reduces the error  in each  iteration  by ' a  factor  0.21,  using  5 grid  levels \nand a  V -cycle and two smoothing iterations on the coarsest grid.  For  comparison, \nthe iteration on the finest  grid converges with a  reduction factor  0.63. \n\n7  Discussion \n\nWe  have  given  a  condition  for  sampling  controlled  diffusion  processes  such  that \nthe  value  functions  will  converge  while  the discretization  tends to  zero.  Rigorous \nnumerical  methods  can  now  be  applied  to  reinforcement  learning  algorithms  in \ncontinuous-time,  continuous-state  as  is  demonstrated  with  a  multi-grid  algorithm \nfor  the  HJB-equation.  Ongoing work  is  directed  towards adaptive  grid refinement \nalgorithms and application to systems that include hysteresis. \n\n\fMulti-Grid Methodsfor Reinforcement Leaming in Diffusion Processes \n\n1039 \n\nFigure  2:  Contour  plots  of  the  predicted  reward  in  a  homogeneous  server  problem  with \nnonlinear costs are shown on different grid levels.  On the coarsest 4 x  4 grid a sampling rate \nof one second  is  used  with  9-point-star  transition  matrices.  At  the finest  grid  (64  x  64)  a \nsampling rate of t second is  used with observation on 81-point-stars.  Inside the egg-shaped \narea the value function  assumes its maximum. \n\nReferences \n\n[lJ  A.  Barto, S.  Bradtke, S.  Singh.  Learning to Act using Real-Time Dynamic Pro(cid:173)\ngramming,  AI  Journal on Computational Theories of Interaction and  Agency, \n1993. \n\n[2J  M.  BloB and R.  Hoppe.  Numerical  Computation  of the Value  Function  of Op(cid:173)\ntimally  Controlled  Stochastic  Switching Processes  by Multi-Grid  Techniques, \nNumer Funct  Anal And  Optim 10(3+4), 275-304,  1989. \n\n[3]  S.  Pareigis. Lernen der Lasung der Bellman-Gleichung durch  Beobachtung von \n\nkontinuierlichen  Prozessen,  PhD  Thesis,  1996. \n\n[4J  W.  Hackbusch.  Multi-Grid Methods and Applications,  Springer-Verlag,  1985. \n\n[5]  H.  Kushner  and  P.  Dupuis.  Numerical  Methods  for  Stochastic  Control  Prob(cid:173)\n\nlems in  Continuous Time,  Springer-Verlag,  1992. \n\n\f", "award": [], "sourceid": 1273, "authors": [{"given_name": "Stephan", "family_name": "Pareigis", "institution": null}]}