{"title": "Improving Policies without Measuring Merits", "book": "Advances in Neural Information Processing Systems", "page_first": 1059, "page_last": 1065, "abstract": null, "full_text": "Improving Policies without  Measuring \n\nMerits \n\nPeter Dayan! \n\nCBCL \n\nE25-201, MIT \n\nCambridge, MA  02139 \ndayan~ai.mit.edu \n\nSatinder P  Singh \n\nHarlequin,  Inc \n\n1 Cambridge Center \nCambridge, MA  02142 \nsingh~harlequin.com \n\nAbstract \n\nPerforming policy  iteration  in  dynamic  programming should  only \nrequire knowledge of relative rather than absolute measures of the \nutility of actions (Werbos,  1991) - what Baird  (1993)  calls the  ad(cid:173)\nvantages  of actions at states.  Nevertheless, most existing methods \nin dynamic programming (including Baird's) compute some form of \nabsolute utility function .  For smooth problems,  advantages satisfy \ntwo  differential  consistency  conditions  (including  the  requirement \nthat they be free of curl), and we show that enforcing these can lead \nto appropriate policy improvement solely in terms of advantages. \n\n1 \n\nIntrod uction \n\nIn  deciding  how  to change  a  policy  at  a  state,  an  agent  only  needs  to  know  the \ndifferences  (called advantages) between the total return based on taking each action \na  for  one  step  and  then  following  the  policy  forever  after,  and  the  total  return \nbased on always following the policy  (the conventional  value  of the state under the \npolicy).  The advantages are like differentials - they do not depend on the local levels \nof  the  total  return.  Indeed,  Werbos  (1991)  defined  Dual  Heuristic  Programming \n(DHP), using these facts,  learning the derivatives of these total returns with respect \nto  the  state.  For  instance,  in  a  conventional  undiscounted  maze  problem  with  a \n\nlWe  are grateful  to Larry  Saul,  Tommi  Jaakkola and Mike  Jordan for  comments,  and \nAndy Barto for  pointing out the connection to Werbos'  DHP. This work was supported by \nNSERC,  MIT,  and grants  to  Professor  Michael  I  Jordan  from  ATR  Human  Information \nProcessing Research  and Siemens Corporation. \n\n\f1060 \n\nP. DAYAN, S. P. SINGH \n\npenalty  for  each  move,  the  advantages  for  the  actions  might  typically  be  -1,0 \nor  1,  whereas  the  values  vary  between  0  and  the  maximum  distance  to  the goal. \nAdvantages should therefore be easier to represent than absolute value functions in a \ngeneralising system such as a neural network and, possibly, easier to learn.  Although \nthe advantages are  differential,  existing methods  for  learning them,  notably  Baird \n(1993),  require the agent simultaneously to learn  the total return from  each state. \nThe underlying trouble  is  that advantages  do  not  appear to satisfy any  form  of a \nBellman  equation.  Whereas  it is  clear  that the  value  of a  state should  be  closely \nrelated to the value of its neighbours, it is  not obvious that the advantage of action \na  at a state should be equally closely  related to its advantages nearby. \n\nIn  this paper, we  show  that under some circumstances it is  possible to use  a  solely \nadvantage-based  scheme  for  policy  iteration  using  the  spatial  derivatives  of  the \nvalue function rather than the value function  itself.  Advantages satisfy a particular \nconsistency  condition,  and,  given  a  model  of the  dynamics  and  reward  structure \nof the environment,  an  agent can use  this condition  to directly  acquire the spatial \nderivatives  of the  value  function.  It  turns  out  that  the  condition  alone  may  not \nimpose enough constraints to specify these derivatives  (this is  a  consequence of the \nproblem described  above)  - however the value  function  is  like  a  potential function \nfor  these  derivatives, and this allows extra constraints to be imposed. \n\n2  Continuous DP,  Advantages and  Curl \n\nConsider  the problem  of controlling a  deterministic system to minimise  V\"'(xo)  = \nminu(t) Jo= r(y(t), u(t\u00bb)dt,  where  y(t)  E  Rn  is  the  state  at time  t,  u(t)  E  Rm  is \nthe  control,  y(O)  =  xo,  and  y(t)  =  f((y(t), u(t)).  This  is  a  simplified  form  of a \nclassic  variational  problem  since  rand f  do  not  depend  on  time  t  explicitly,  but \nonly  through  y(t)  and  there  are  no  stopping time  or  terminal  conditions  on  y(t) \n(see  Peterson,  1993;  Atkeson,  1994, for  recent methods for  solving such problems) . \nThis  means  that  the  optimal  u(t)  can  be  written  as  a  function  of y(t)  and  that \nV(xo)  is a  function  of Xo  and not t. We  do  not treat the cases  in  which the infinite \nintegrals do not converge comfortably and we  will  also assume adequate continuity \nand differentiability. \n\nThe solution by advantages:  This  problem can  be  solved  by  writing down  the \nHamilton-Jacobi-Bellman (HJB) equation (see Dreyfus, 1965) which V\"'(x) satisfies: \n(1) \n\n0= mJn [r(x, u) + f(x, u) . V' x V\"'(x)] \n\nis \n\nthe  continuous  space/time  analogue  of  the  conventional  Bellman \n\nThis \nequation  (Bellman,  1957)  for  discrete,  non-discounted,  deterministic  deci(cid:173)\nsion  problems,  which  says  that  for  the  optimal  value  function  V\"',  0  = \nmina [r(x, a) + V'\" (f(x, a)) - V\"'(x)] , where starting the process at state x and us(cid:173)\ning action a incurs a  cost r(x, a)  and leaves the process in  state !(x, a).  This,  and \nits  obvious  stochastic  extension  to  Markov  decision  processes,  lie  at the  heart  of \ntemporal difference  methods for  reinforcement  learning (Sutton,  1988;  Barto, Sut(cid:173)\nton & Watkins,  1989; Watkins,  1989).  Equation 1 describes what the optimal value \nfunction  must  satisfy.  Discrete  dynamic  programming also  comes  with  a  method \ncalled value iteration which starts with any function Vo(x),  improves it sequentially, \nand  converges to the optimum. \n\nThe alternative method,  policy  iteration  (Howard,  1960),  operates in  the space of \n\n\fImproving  Policies  without  Measuring Merits \n\n1061 \n\npolicies,  ie  functions  w(x).  Starting  with  w(x),  the  method  requires  evaluating \neverywhere  the  value  function  VW(x)  =  1000 r(y(t), w(y(t))dt,  where  y(O)  = \nx,  and  y(t)  =  f(y(t), w(y(t)).  It turns  out that VW  satisfies  a  close  relative  of \nequation  1: \n\n0= r(x, w(x)) + f(x, w(x)) . V' x VW(x) \n\n(2) \n\nIn policy iteration, w(x)  is  improved,  by  choosing the maximising action: \n\nWi (x)  =  argm~ [r(x, u) + f(x, u) . V' x VW (x)] \n\n(3) \nas  the  new  action.  For  discrete  Markov  decision  problems,  the  equivalent  of this \nprocess of policy  improvement is  guaranteed to improve upon w. \n\nIn the discrete case and for  an analogue of value iteration, Baird  (1993)  defined  the \noptimal  advantage  function  A*(x, a)  =  [Q*(x, a)  - maxb Q*(x, b)]  jM,  where  6t  is \neffectively a  characteristic time for  the process which  was  taken to be  1 above, and \nthe optimal  Q  function  (Watkins,  1989)  is  Q*(x, a)  =  r(x, a)  + V*(f(x, a)),  where \nV* (y)  =  maxb Q* (y, b).  It turns out (Baird, 1993) that in the discrete case, one can \ncast  the  whole  of policy  iteration  in  terms  of advantages.  In  the  continuous  case, \nwe  define  advantages directly as \n\n(4) \nThis equation indicates how the spatial derivatives of VW  determine the advantages. \nNote that the consistency condition in equation 2 can be written as AW(x, w(x))  = \nO.  Policy iteration can proceed using \n\nw'(x) =  argmaxuAW(x, u). \n\n(5) \nDoing without VW:  We  can now  state more precisely the intent of this  paper:  a) \nthe consistency  condition in  equation 2 provides  constraints on the  spatial deriva(cid:173)\ntives V' x VW(x), at least given a model of rand f;  b)  equation 4 indicates how these \nspatial  derivatives can be  used  to determine  the advantages,  again using a  model; \nand c)  equation 5 shows that the advantages tout court  can be used to improve the \npolicy.  Therefore, one  apparently should  have no need  to know  Vv.' (x)  but just its \nspatial derivatives in order to do policy iteration. \n\nDidactic  Example  -\nLQR:  To  make  the  discussion  more  concrete,  consider \nthe  case  of  a  one-dimensional  linear  quadratic  regulator  (LQR).  The  task  is  to \nminimise  V*(xo)  =  It o:x(t)2  + (3u(t)2dt  by  choosing u(t), where  0:,(3  > O,\u00b1(t) = \n-[ax(t) + u(t)]  and  x(O)  =  Xo.  It is  well  known  (eg  Athans  &  Falb,  1966)  that \nthe solution to this problem is  that V*(x)  =  k*x2 j2 where  k*  =  (0: + (3(u*)2)j(a + \nu*)  and  u(t)  =  (-a + Ja2 + o:j (3)x(t).  Knowing  the  form  of  the  problem,  we \nconsider policies  w that make  u(t)  =  wx(t)  and require  h(x,k) ==  V'\" VW(x)  =  kx , \nwhere  the  correct  value  of  k  =  (0:  + (3w2)j(a + w).  The  consistency  condition \nin  equation  2  evaluated  at  state  x  implies  that  0  =  (0:  + (3w2)X2 - h(x, k)(a + \nw)x.  Doing online gradient descent in the square inconsistency at samples Xn  gives \nkn+l  =  kn -fa [(0: + (3W2)x~ - knXn(a + W)Xn]2  jakn, which will reduce the square \ninconsistency for  small enough f  unless x  =  O.  As required, the square inconsistency \ncan only  be  zero for  all  values  of x  if k =  (0: + (3w2)j((a + w)).  The advantage of \nperforming action v  (note this  is  not vx)  at state x is, from equation 4,  AW (x, v)  = \no:x2 + (3v2 -\n(ax + v)(o: + (3w2)xj(a + w),  which,  minimising  over  v  (equation  5) \ngives u(x) =  w'x where Wi  =  (0: + (3w2)j(2(3(a+ w)) , which is  the Newton-Raphson \niteration to solve the quadratic equation that determines the optimal policy.  In this \ncase,  without ever explicitly forming VW (x),  we  have  been able to learn an optimal \n\n\f1062 \n\nP.  DAYAN, S. P. SINGH \n\npolicy.  This  was  based,  at least conceptually,  on  samples  Xn  from  the interaction \nof the agent with the world. \n\nThe curl condition:  The astute reader will  have noticed a  problem.  The consis(cid:173)\ntency  condition in equation 2 constrains the  spatial derivatives  \\7 x  VW  in  only  one \ndirection at every point - along the route f(x, w(x))  taken according to the policy \nthere.  However,  in  evaluating  actions  by  evaluating their  advantages,  we  need  to \nknow \\7 x  VW  in all the directions accessible through f(x, u) at state x.  The quadratic \nregulation task was only solved because we employed a function approximator (which \nwas linear in this case h(x, k)  =  kx).  For the case of LQR, the restriction that h  be \nlinear allowed  information about f(X', w(x' ))  . \\7 x' VW (x')  at distant states x'  and \nfor  the  policy  actions  w(x' )  there  to  determine  f(x, u)  . \\7 x VW(x)  at state  x  but \nfor  non-policy  actions  u.  If we  had tried  to represent h(x, k)  using  a  more flexible \napproximator such as radial basis functions,  it might not have worked.  In general, if \nwe didn't know the form of \\7 x  VW (x), we cannot rely on the function  approximator \nto generalize correctly. \nThere  is  one  piece  of  information  that  we  have  yet  to  use  - function  h(x, k)  == \n\\7 x  VW (x)  (with  parameters  k,  and  in  general  non-linear)  is  the  gradient of some(cid:173)\nthing  - it  represents  a  conservative  vector  field.  Therefore  its  curl  should  vanish \n(\\7 x  x h(x, k)  =  0).  Two ways to try to satisfy this are to represent h  as a suitably \nweighted combination of functions that satisfy this condition or to use its square as \nan additional error during the process of setting the parameters k.  Even in the case \nof the LQR,  but in  more than one dimension,  it turns out to be essential to use the \ncurl  condition.  For  the  multi-dimensional  case we  know  that  VW (x)  =  x T KWx/2 \nfor  some symmetric matrix KW,  but enforcing zero curl  is  the only  way  to enforce \nthis symmetry. \n\nThe curl  condition  says  that  knowing  how  some  component  of \\7 x  VW(x)  changes \nin  some  direction  (eg  8\\7 x VW(xh/8xl)  does  provide information about how  some \nother component  changes  in a different direction (eg  8\\7 x vw (xh /8X2).  This infor(cid:173)\nmation is  only useful  up to constants of integration, and smoothness conditions will \nbe necessary to apply  it. \n\n3  Simulations \n\nWe  tested the method of approximating hW(x)  =  \\7 x  VW(x)  as  a  linearly weighted \ncombination of local  conservative vector fields  hW(x)  =  L~=l ci\\7 x <p(x, Zi),  where \nci  are  the  approximation  weights  that  are  set  by  enforcing  equation  2,  and \n</J(x, Zi)  =  e-a:lx-z;l2 are standard radial basis functions  (Broomhead & Lowe,  1988; \nPoggio  &  Girosi,  1990).  We  enforced  this  condition  at a  discrete  set  {xd of 100 \npoints  scattered  in  the  state space,  using  as  a  policy,  explicit  vectors  Uk  at  those \nlocations,  and  employed  49  similarly  scattered  centres  Zi. \nIssues  of learning  to \napproximate conservative and  non-conservative  vector fields  using such sums  have \nbeen  discussed  by  Mussa-Ivaldi  (1992).  One  advantage  of  using  this  representa(cid:173)\ntion  is  that  1jJ(x)  =  L~=l ci <p(x, Zi)  can  be  seen  as  the  system's  effective  policy \nevaluation function  VW(x),  at least  modulo  an arbitrary constant  (we  call  this an \nun-normalised value function). \n\nWe  chose  two  2-dimensional problems to prove that the system works.  They share \nthe same dynamics x(t) =  -x(t) + u(t),  but have different cost functions: \n\n\fImproving  Policies  without  Measuring  Merits \n\n1063 \n\n, \n\nTLQR(X(t), U(t))  = 5lx(tW + lu(tW \nTSp(X(t), U(t))  = Ix(tW + \\/1 + IU(t)12 \nTLQR  makes  for  a  standard  linear  quadratic  regulation  problem,  which  haJ6:l \nquadratic optimal value function  and a linear optimal controller as before (although \nnow we are using limited range basis functions instead of using the more appropriate \nlinear form).  TSp  has  a  mixture  of a  quadratic term in  x(t), which  encourages the \nstate to move  towards the origin, and a more nearly linear cost term in u(t), which \nwould  tend  to  encourage  a  constant  speed.  All  the  sample  points  Xk  and  radial \nbasis function  centres Zi  were selected within the {-I , IF square.  We  started from \na randomly chosen policy with both components of Uk  being samples from the uni(cid:173)\nform  distribution U( -.25, .25).  This was chosen so that the overall dynamics of the \nsystem, including the -x(t) component should lead the agent towards the origin. \n\nFigure Ia shows the initial values of Uk  in the regulator case, where the circles are at \nthe leading edges of the local policies which point in the directions shown with rela(cid:173)\ntive magnitudes given by the length of the lines, and (for scale)  the central object is \nthe square {-O.I,O.IF.  The  'policy' lines  are centred at the 100  Xk  points.  Using \nthe  basis  function  representation,  equation  2  is  an over-determined linear  system, \nand  so,  the  standard  Moore-Penrose  pseudo-inverse  was  used  to  find  an  approx(cid:173)\nimate  solution.  The  un-normalised  approximate  value  function  corresponding  to \nthis policy is  shown in  figure  lb.  Its bowl-like  character is  a feature  of the optimal \nvalue  function.  For  the  LQR case,  it  is  straightforward  to  perform  the  optimisa(cid:173)\ntion  in  equation  5 analytically,  using the values  for  h W  (Xk)  determined  by  the ci. \nFigure Ic,d show the policy and its associated un-normalised value function after 4 \niterations.  By this point, the policy and value functions are essentially optimal - the \npolicy shows  the agent moves  inwards from  all  Xk  and the magnitudes are linearly \nrelated to the distances from  the centre.  Figure Ie,f show the same at the end point \nfor  TSp.  One major difference  is  that we  performed the optimisation in  equation 5 \nover  a  discrete  set  of values  for  Uk  rather  than analytically.  The  tendency  for  the \nagent  to maintain  a  constant speed  is  apparent except  right  near  the  origin.  The \nbowl  is  not centred exactly at (0,0)  - which  is  an approximation error. \n\n4  Discussion \n\nThis  paper has  addressed  the  question  of whether  it  is  possible  to perform  policy \niteration using just differential quantities like  advantages.  We  showed  that using a \nconventional consistency condition and a curl constraint on the spatial derivatives of \nthe value function it is possible to learn enough about the value function for a policy \nto improve  upon  that policy.  Generalisation can  be  key  to the  whole  scheme.  We \nshowed this working on an LQR problem and a more challenging non-LQR case.  We \nonly  treated  'smooth'  problems - addressing  discontinuities  in  the  value  function, \nwhich  imply  un differentiability,  is  clearly key.  Care must be  taken in  interpreting \nthis result.  The most challenging problem is  the error metric for the approximation. \nThe consistency condition may either under-specify or over-specify the parameters. \nIn  the  former  case,  just  as  for  standard  approximation  theory,  one  needs  prior \ninformation  to  regularise  the  gradient  surface.  For  many  problems  there  may  be \nspatial  discontinuities  in  the  policy  evaluation,  and  therefore  this  is  particularly \nIT  the  parameters  are  over-specified  (and,  for  good  generalisation,  one \ndifficult. \nwould  generally  be  working  in  this  regime),  we  need  to  evaluate  inconsistencies. \nInconsistencies  cost  exactly  to  the  degree  that  the  optimisation  in  equation  5  is \ncompromised  - but  this  is  impossible  to  quantify.  Note  that  this  problem  is  not \n\n\f1064 \n\na \n\nb \n\nc \n\nd \n\nP. DAYAN, S. P. SINOH \n\ne \n\nf \n\nFigure 1:  a-d)  Policies and un-normalised value  functions  for  the rLQR  and e-f)  for \nthe rsp  problem. \n\nconfined  to the  current  scheme  of learning  the  derivatives  of the  value  function  -\nit  also  impacts  algorithms  based  on  learning  the  value  function  itself.  It is  also \nunreasonable to specify the actions  Uk  only at the points Xk.  In general, one would \neither need a parameterised function for u(x) whose parameters would be updated in \nthe light of performing the optimisations in equation 5 (or some sort of interpolation \nscheme),  or alternatively one  could  generate  u  on  the fly  using  the  learned  values \nof h(x) . \nIf there is  a  discount factor,  ie  V*(xo)  =  minu(t) fooo e-Atr(y(t), u(t\u00bbdt,  then  0 = \nr(x, w(x\u00bb - AVw (x) + f(x, w(x\u00bb\u00b7 \\7 x  VW (x) is the equivalent consistency condition \nto equation 2 (see also Baird, 1993) and so it is no longer possible to learn \\7 x  VW (x) \nwithout ever considering VW(x)  itself.  One can still  optimise parameterised forms \nfor  VW  as in section 3,  except that the once  arbitrary constant is  no longer free . \n\nThe discrete analogue to the differential consistency condition in equation 2 amounts \nto  the  tautology  that  given  current  policy  7r,  't/x,  A7r(x,7r(x\u00bb  =  O.  As  in  the \ncontinuous case, this  only provides information about V7r(f(x, 7r(x\u00bb) - V7r(x)  and \nnot V7r(f(x, a\u00bb-V 7r (x) for other actions a which are needed for policy improvement. \nThere  is  an equivalent  to the  curl  condition:  if there  is  a  cycle  in  the  undirected \ntransition graph, then the weighted sum of the advantages for the actions along the \ncycle  is  equal  to  the  equivalently  weighted  sum  of payoffs  along  the  cycle,  where \nthe  weights  are  + 1 if  the action  respects  the  cycle  and  -1 otherwise.  This  gives \na  consistency  condition  that  A 7r  has  to  satisfy  - and,  just  as  in  the  constants  of \nintegration for  the differential  case,  it requires grounding:  A 7r (z, a)  =  0 for  some  z \nin  the  cycle.  It is  certainly not true that all  discrete problems  will  have  sufficient \ncycles to specify A 7r  completely - in an extreme case,  the undirected version of the \ndirected transition graphs might contain no cycles at all.  In the continuous case,  if \nthe  updates  are  sufficiently  smooth,  this  is  not possible.  For  stochastic problems, \nthe consistency  condition equivalent  to equation  2 will  involve  an integral,  which, \n\n\fImproving  Policies  without  Measuring  Merits \n\n1065 \n\nif doable,  would  permit the application of our method. \n\nWerbos's  (1991)  DHP  and  Mitchell  and  Thrun's  (1993)  explanation-based  Q(cid:173)\nlearning  also  study  differential  forms  of the  Bellman  equation  based  on  differen(cid:173)\ntiating the discrete Bellman equation (or its  Q-function equivalent)  with respect to \nthe  state.  This  is  certainly  fine  as  an  additional  constraint  that  V*  or  Q*  must \nsatisfy  (as  used  by  Mitchell  and  Thrun and  Werbos'  Globalized  version  of DHP) , \nbut by itself,  it does not enforce the curl condition, and is  insufficient for  the whole \nof policy  improvement. \n\nReferences \nAthans,  M &  Falb,  PL (1966).  Optimal  Control.  New  York,  NY:  McGraw-Hill. \nAtkeson,  CG  (1994).  Using  Local  Trajectory Optimizers To Speed  Up  Global  Op(cid:173)\ntimization in  Dynamic Programming.  In  NIPS 6. \nBaird, LC, IIIrd (1993).  Advantage  Updating.  Technical report, Wright Laboratory, \nWright-Patterson Air  Force  Base. \nBarto,  AG,  Bradtke,  SJ  &  Singh,  SP  (1995).  Learning  to act  using  real-time  dy(cid:173)\nnamic programming.  Artificial Intelligence,  72, 81-138. \nBarto, AG, Sutton , RS  & Watkins, CJCH (1990) .  Learning and sequential decision \nmaking.  In  M  Gabriel  &  J  Moore,  editors,  Learning  and  Computational  Neuro(cid:173)\nscience:  Foundations  of Adaptive Networks.  Cambridge, MA:  MIT Press, Bradford \nBooks. \nBellman, RE  (1957).  Dynamic  Programming.  Princeton,  NJ:  Princeton University \nPress. \nBroomhead ,  DS  &  Lowe,  D  (1988).  Multivariable  functional  interpolation  and \nadaptive networks.  Complex  Systems,  2,  321-55. \nDreyfus,  SE  (1965).  Dynamic  Programming  and  the  Calculus  of  Variations.  New \nYork,  NY:  Academic Press. \nHoward,  RA  (1960).  Dynamic  Programming  and  Markov  Processes.  New  York, \nNY:  Technology Press & Wiley. \nMitchell,  TM  &  Thrun, SB  (1993).  Explanation-based neural network learning for \nrobot control.  In  NIPS 5. \nMussa-Ivaldi,  FA  (1992).  From basis functions  to basis fields:  Vector field  approxi(cid:173)\nmation from  sparse data.  Biological  Cybernetics,  67, 479-489. \nPeterson, JK  (1993).  On-Line estimation of optimal value functions.  In  NIPS  5. \nPoggio,  T  & Girosi,  F  (1990) .  A  theory  of  networks  for  learning.  Science,  247, \n978-982. \nSutton,  RS  (1988).  Learning  to  predict  by  the  methods  of  temporal  difference. \nMachine  Learning,  3, pp 9-44. \nWatkins,  CJCH  (1989).  Learning  from  Delayed  Rewards.  PhD Thesis.  University \nof Cambridge, England. \nWerbos,  P  (1991).  A  menu  of  designs  for  reinforcement  learning  over  time.  In \nWT  Miller  IIIrd,  RS  Sutton  &  P  Werbos,  editors,  Neural  Networks  for  Control. \nCambridge, MA:  MIT Press, 67-96. \n\n\f", "award": [], "sourceid": 1143, "authors": [{"given_name": "Peter", "family_name": "Dayan", "institution": null}, {"given_name": "Satinder", "family_name": "Singh", "institution": null}]}