{"title": "Transition Point Dynamic Programming", "book": "Advances in Neural Information Processing Systems", "page_first": 639, "page_last": 646, "abstract": null, "full_text": "Transition Point  Dynamic Programming \n\nKenneth  M.  Buckland'\" \n\nDept. of Electrical Engineering \nUniversity of British  Columbia \n\nPeter D.  Lawrence \n\nDept. of Electrical  Engineering \nUniversity of British  Columbia \n\nVancouver,  B.C,  Canada V6T  1Z4 \n\nVancouver,  B.C,  Canada V6T 1Z4 \n\nbuckland@pmc-sierra.bc.ca \n\npeterl@ee.ubc.ca \n\nAbstract \n\nTransition  point  dynamic  programming  (TPDP)  is  a  memory(cid:173)\nbased,  reinforcement  learning,  direct  dynamic  programming  ap(cid:173)\nproach  to  adaptive  optimal  control  that  can  reduce  the  learning \ntime  and  memory  usage  required  for  the  control  of  continuous \nstochastic  dynamic  systems.  TPDP  does  so  by  determining  an \nideal  set  of transition  points  (TPs)  which  specify only  the  control \naction  changes  necessary  for  optimal  control.  TPDP  converges  to \nan ideal TP set by using a variation of Q-Iearning to assess the mer(cid:173)\nits of adding,  swapping and  removing TPs from  states throughout \nthe  state  space.  When  applied  to  a  race  track  problem,  TPDP \nlearned  the optimal  control  policy  much  sooner  than  conventional \nQ-Iearning,  and was  able  to do so using  less  memory. \n\n1 \n\nINTRODUCTION \n\nDynamic programming  (DP)  approaches can  be  utilized  to determine optimal con(cid:173)\ntrol  policies  for  continuous  stochastic  dynamic  systems  when  the  state  spaces  of \nthose  systems have  been  quantized  with  a  resolution  suitable for  control (Barto  et \nal.,  1991).  DP  controllers,  in  lheir  simplest  form,  are  memory-based  controllers \nthat operate  by  repeatedly  updating  cost  values  associated  with  every  state in  the \ndiscretized state space (Barto et  al.,  1991).  In  a slate space of any size  the required \nquantization  can  lead  to an  excessive  memory  requirement,  and  a  related  increase \nin  learning  time  (Moore,  1991).  This is  the  \"curse of dimensionality\". \n\n\u00b7Nowat:  PMC-Sierra Inc.,  8501  Commerce  Court,  Burnaby,  B.C.,  Canada V5A 4N3. \n\n639 \n\n\f640 \n\nBuckland and Lawrence \n\nQ-Iearning (Watkins,  1989,  Watkins  et  al.,  1992)  is  a  direct form  of DP  that avoids \nexplicit  system modeling  - thereby  reducing  the  memory  required  for  DP  control. \nFurther reductions are  possible if Q-Ieal'l1ing  is  modified so  that its  DP  cost  values \n(Q-values)  are  associated  only  with  states where  control action  changes need  to be \nspecified.  Transition  point  dynamic  programming  (TPDP),  the  control  approach \ndescribed in  this paper, is  designed  to take advantage of this DP memory reduction \npossibility by determining the states where control  action changes must be specified \nfor  optimal control, and  what  those optimal  changes are. \n\n2  GENERAL  DESCRIPTION  OF  TPDP \n\n2.1  TAKING ADVANTAGE  OF INERTIA \n\nTPDP is  suited  to the  control  of continuous stochastic dynamic systems that have \ninertia.  In  such  systems  \"uniform  regions\"  are  likely  to  exist  in  the  state  space \nwhere  all  of the  (discretized)  states  have  the  same  optimal  control  action  (or  the \nsame set of optimal actions l ).  Considering one  such  uniform  region,  if the  optimal \naction  for  that  region  is  specified  at  the  \"boundary  states\"  of the  region  and  then \nmaintained  throughout  the  region  until  it  is  left  and  another  uniform  region  is \nentered  (where  another set of boundary states specify  the next  action), none of the \n\"dormant states\"  in  the middle of the region  need  to specify any actions themselves. \nThus  dormant  states  do  not  have  to  be  represented  in  memory.  This  is  the  basic \npremise of TPDP. \n\nThe  association  of  optimal  actions  with  boundary  states  is  done  by  \"transition \npoints\"  (TPs)  at  those  states.  Boundary  states  include  all  of the  states  that  can \nbe  reached from  outside  a  uniform region  when  that  region  is  entered as  a  result of \nstochastic state transitions.  The boundary states of anyone  uniform region  form  a \nhyper-surface  of variable  thickness  which  mayor  may  not  be  closed.  The  TPs  at \nboundary  states must  be  represented  in  memory,  but  if they  are  small  in  number \ncompared to  the  dormant states  the  memory  savings can  be significant. \n\n2.2 \n\nILLUSTRATING THE  TPDP  CONCEPT \n\nFigure  1  illustrates  the  TPDP  concept  when  movement  control  of a  \"car\"  on  a \none  dimensional track is  desired.  The car,  with  some initial positive  velocity to the \nright, must pass Position A and return to the left.  The TPs in  Figure 1 (represented \nby  boxes)  are  located  at  boundary  states.  The  shaded  regions  indicate  all  of the \nstates that the system can  possibly  move through given  the actions specified  at the \nboundary states and the stochastic response of the car.  Shaded states without TPs \nare  therefore  dormant states.  Uniform  regiolls  consist  of adjacent  boundary  states \nwhere  the same action is  specified,  as  well  as  the shaded  region  through  which that \naction is  maintained before  another boundary is  encountered.  Boundary states that \ndo not seem to be on the main sta.te transition routes (the one identified in Figure  1 \nfor  example)  ensure that any stochastic  deviations from  those routes are realigned. \nUnshaded  states are  \"external states\"  the system  does  not  reach. \n\nIThe  simplifying  assumption  t.hat  t.here  is  ouly  oue  optimal  action  in  each  uniform \n\nregion  will  be made throughout  this  paper.  TPDP  operates  the same regardless. \n\n\f+ \n\n~ \n'0 \n00 \nQ) > \n\nTransition Point Dynamic Programming \n\n641 \n\nEach 13  is a \n\ntransition point (TP), \n\nniform \nRegion \n\nBoundary \n\nState \n\nA \n\nPosition \n\nFigure  1:  Application of TPDP  to a  One  Dimension  Movement Control Task \n\n2.3  MINIMAL TP  OPTIMAL CONTROL \n\nThe  main  benefit of the TPDP  approach  is  that,  where  uniform  regions exist,  they \ncan  be represented by a relatively small  number of DP elements (TPs) - depending \non  the  shape  of the  boundaries  and  the  size  of  the  uniform  regions  they  encom(cid:173)\npass.  This reduction  in  memory  usage  results in  an  accompanying reduction in  the \nlearning  time  required  to learn  optimal  control policies  (Chapman  et  al.,  1991). \n\nTPDP operates by learning optimal points of transition in the control action specifi(cid:173)\ncation, where those points can be accurately located  in highly resolved state spaces. \nTo  do  this  TPDP  must  determine  which  states  are  boundary  states  that  should \nhave TPs, and  what actions those TPs should  specify.  In other words, TPDP must \nfind  the right TPs for  the right states.  When  it has done so,  \"minimal TP optimal \ncontrol\"  has been  achieved.  That is,  optimal  control with  a  minimal set  of TPs. \n\n3  ACHIEVING MINIMAL TP  OPTIMAL CONTROL \n\n3.1  MODIFYING  A  SET  OF  TPs \n\nGiven  an  arbitrary  initial  set  of TPs,  TPDP  must  modify  that  set  so  that  it  is \ntransformed into a  minimal  TP  optimal  control set.  Modifications  can  include  the \n\"addition\"  and  \"removal\"  of TPs throughout  the  state space,  and  the  \"swapping\" \nof one TP for  another  (each specifying a  different action)  at the same state.  These \n\n\f642 \n\nBuckland and Lawrence \n\nmodifications  are  performed  one  at  a  time  in  arbitl'ary  order,  and  can  continue \nindefinitely.  TPDP operates so that each TP modification results in  an incremental \nmovement towards minimal TP optimal  control  (Buckland,  1994). \n\n3.2  Q-LEARNING \n\nTPDP makes use  of Q-Iearning (Watkins, 1989,  Watkins et  ai.,  1992)  to modify the \nTP set.  Normally  Q-Iearning  is  used  to  determine  the optimal control policy  J-t  for \na  stochastic dynamic system subjected  to immediate  costs  c(i, u)  when  action  u is \napplied  in  each  state  i  (Barto  et  ai.,  1991).  Q-learning  makes  use  of  \"Q-values\" \nQ( i, u),  which indicate  the expected  total  infini te-horizon  discounted  cost  if action \nu  is  applied  in  state  i,  and  actions  defined  by  the  existing  policy  J-t  are  applied  in \nall  future  states.  Q-values are learned  by  using  the following  updating equation: \n\nQt+l(St, Ut)  = (1  - Ctt)Qt(St, ud + at [c(St, ud + 'YVt(St+l)] \n\n(1) \nWhere at is  the update rate, l' is  the discount factor,  and St  and Ut  are respectively \nthe state  at  time  step t  and  the  action  taken  at  that  time  step  (all  other  Q-values \nremain  the  same  at  time  step t).  The  evaluation  function  value  lit ( i)  is  set  to  the \nlowest  Q-value  action  of all  those possible U(i)  in  each state  i: \n\nVt(i)  =  min  Qt(i, u) \n\nUEU(i) \n\n(2) \n\nIf Equations 1 and 2 are employed during exploratory movement of the system, it has \nbeen  proven  that  convergence to optimal  Q-values Q* (i, u)  and  optimal evaluation \nfunction  values  VI-'. (i)  will  result  (given  that  the  proper  constraints  are  followed, \nWatkins,  1989,  Watkins  et  ai.,  1992,  Jaakkola et  ai.,  1994).  From these  values  the \noptimal action in  each state can be determined  (the action that fulfills  Equation 2). \n\n3.3  ASSESSING  TPs WITH  Q-LEARNING \n\nTPDP uses  Q-Iearning to determiue  how  an  existing set of TPs should be modified \nto  achieve  minimal  TP optimal  control.  Q-values can  be  associated  with  TPs,  and \nthe  Q-values of two TPs at  the same  \"TP state\", each  specifying  different  actions, \ncan  be compared  to determine  which  should  be  maintained  at  that state - that is, \nwhich  has  the lower  Q-value.  This is  how  TPs  are  swapped  (Buckland,  1994). \n\nStates  which  do  not  have  TPs,  \"non-TP  states\",  have  no  Q-values  from  which \nevaluation function  values vt(i) can  be determined  (using Equation  2).  As  a result, \nto  learn  TP  Q-values,  Equation  1 must  be  modified  to facilitate  Q-value  updating \nwhen  the system makes d state transitions from  one TP state through  a number of \nnon-TP states to another TP state: \n\nQt+.( St, Ut)  =  (1  - a,jQt (5t, Ut)  + \"t  [ (~'Yn c( St+n, Ut))  + 'Y.v,( St+.)] \n\n(3) \nWhen d = 1,  Equation 3 takes the form of Equation  1.  When d > 1,  the intervening \nnon-TP states are effectively ignored  and treated  as inherent parts of the stochastic \ndynamic  behavior of the system  (Buckla.nd,  1994). \n\nIf Equation  3  is  used  to  determine  the  costs  incurred  when  no  action  is  specified \nat a state  (when  the  action specified  at some  previous state is  maintained),  an  \"R(cid:173)\nvalue\"  R( i)  is  the result.  R-values  can  be  used  to expediently add  and remove TPs \n\n\fTransition Point Dynamic Programming \n\n643 \n\nfrom  each  state.  If the  Q-value  of a  TP  is  less  than  the  R-value  of the  state  it  is \nassociated  with,  then  it  is  worthwhile having that TP  at that state; otherwise it is \nnot  (Buckland,  1994). \n\n3.4  CONVERGENCE TO  MINIMAL  TP  OPTIMAL  CONTROL \n\nIt has  been  proven  that  a  random  sequence  of TP  additions,  swaps  and  removals \nattempted  at  states throughout  the  state  space  will  result  in  convergence  to  min(cid:173)\nimal  TP  optimal  control  (Buckland,  1994).  This  proof depends  mainly  on  all  TP \nmodifications  \"locking-in\"  any potential cost reductions which are  discovered as the \nresult  of learning exploration. \n\nThe  problem  with  this  proof  of convergence,  and  the  theoretical  form  of TPDP \ndescribed up  to this point, is  that each modification to the existing set of TPs (each \naddition,  swap  and  removal)  requires  the  determination  of Q-values  and  R-values \nwhich  are  negligibly  close  to  being  exact.  This  means  that  a  complete  session  of \nQ-Iearning  must occur  for  every TP modification. 2  The result  is  excessive learning \ntimes - a  problem circumvented  by the  practical form  of TPDP described  next. \n\n4  PRACTICAL TPDP \n\n4.1  CONCURRENT  TP  ASSESSMENT \n\nTo solve  the  problem  of the  protracted  learning  time  required  by  the  theoretical \nform  of  TPDP,  many  TP  modifications  can  be  assessed  concurrently.  That  is, \nQ-Iearning can  be  employed  not just to determine  the  Q-values and  R-values  for  a \nsingle TP modification, but instead  to learn these values for  a number of concurrent \nmodifications.  Further,  the  modification  attempts,  and  the  learning  of the  values \nrequired for  them,  need  not  be  initiated simultaneously.  The determination of each \nvalue  can  be  made  part of the  Q-Iearning  process  whenever  new  modifications  are \nrandomly  attempted.  This approa.ch  is  called  \"Pra.ctical  TPDP\".  Practical TPDP \nconsists of a  continually running Q-Ieal'l1ing  process  (based  on  Equations 2 and  3), \nwhere  the  Q-values and  R-values  of a  constantly  changing set of TPs are  learned. \n\n4.2  USING  WEIGHTS  FOR  CONCURRENT  TP  ASSESSMENT \n\nThe main difficulty that arises when  TPs are assessed  concurrently is that of deter(cid:173)\nmining  when  an  assessment  is  complete.  That is,  when  the  Q-values  and  R-values \nassociated  with  each  TP  ha.ve  been  learned  well  enough  for  a  TP  modification  to \nbe  made  based  on  them.  The  technique  employed  to  address  this  problem  is  to \nassociate  a  \"weight\"  wei, u)  with  ea.ch  TP  that  indicates  the  general  merit  of that \nTP.  The  basic  idea of weights  is  to facilita.te  the  random  addition  of trial  TPs  to \na  TP  \"assessment  group\"  with  a  low  initial  weight  Winitial.  The  Q-values  and  R(cid:173)\nvalues  of the  TPs  in  the  assessment  group  are  learned  in  an  ongoing  Q-Iearning \nprocess,  and  the  weights  of the  TPs  are  adjusted  heuristically  using  those  values. \nOf those  TPs at  any  state i  whose  weights  wei, u)  have  been  increased  above  Wthr \n\n2The  TPDP  proof  allows  for  more  than  one  TP swap  to  be  assessed  simultaneously, \n\nbut this does little to  reduce  the overall  problem  being described  (Buckland,  1994). \n\n\f644 \n\nBuckland and Lawrence \n\n100 \n\n50 \n\nC \nQ) \n.....J \n\n..c -C> \n..c -CU a... \n\nQ) \nC> \n~ \nQ) \n~ \n\no \n\no \n\nConventional \n\nQ-Iearning \n\nPractical TPDP \n\nEpoch Number \n\n2500 \n\nFigure  2:  Performance of Practical TPDP on  a  Race  Track Problem \n\n(Winitial  <  Wthr  <  wmax ),  the  one  with  the  lowest  Q-value  Q(i, u)  is  swapped into \nthe  \"policy  TP\"  role  for  that state.  The heuristic  weight  adjustment rules  are: \n\n1.  New,  trial TPs are given  an  initial  weight of Wjnitial  (0  < Winitial  < Wthr). \n2.  Each  time  the  Q-value of a TP is  updated,  the  weight w(i, u)  of that TP is \n\nincremented if Q(i, u)  <  R(i)  and  decremented  otherwise. \n\n3.  Each  TP  weight  w( i, u)  is  limited  to  a  maximum  value  of  w max .  This \nprevents anyone  weight  from  becoming  so  large  that  it  cannot  readily  be \nreduced  again. \n\n4.  If a  TP  weight  w(i, u)  is  decremented  to 0 the TP is  removed. \n\nAn  algorithm for  Practical TPDP implementation is  described  in  Buckland (1994). \n\n4.3  PERFORMANCE OF  PRACTICAL TPDP \n\nPractical TPDP was applied  to a continuous version  of a  control task described  by \nBarto et  al.  (1991)  - that of controlling the  acceleration of a  car  down  a  race  track \n(specifically the track shown in  Figures 3 and 4)  when that car randomly experiences \ncontrol action non-responsiveness.  As  shown in  Figure  2 (each epoch in  this  Figure \nconsisted  of 20  training  trials  and  500  testing  trials),  Practical  TPD P  learned  the \noptimal  control  policy  much  sooner  than  conventional  Q-Iearning,  and  it  was  able \nto do so  when limited to only  15% of the possible number of TPs (Buckland, 1994). \nThe  possible  number  of TPs  is  the  full  set  of  Q-values  required  by  conventional \nQ-Iearning  (one  for  each possible  state and  action  combination). \n\nThe main  advantage of Practical  TPDP  is  that it  facilitates  rapid  learning of pre(cid:173)\nliminary  control  policies.  Figure  3 shows  typical  routes  followed  by  the  car  early \n\n\fTransition Point Dynamic Programming \n\n645 \n\nFinishing \nPositions \n\nFinishing \nPositions \n\nFigure  3:  Typical Race  Track  Routes After  300  Epochs \n\nStarting \nPositions \n\nStarting \nPositions \n\nFigure 4:  Typical  Race  Track  Routes  After  1300  Epochs \n\nin  the  learning  process.  With  the  addition  of relatively few  TPs,  the  policy  of ac(cid:173)\ncelerating  wildly  down  the  track, smashing  into the  wall  and  continuing on  to  the \nfinishing  positions  was learned.  Further learning  centered  around  this  preliminary \npolicy  led  to  the  optimal  policy  of sweeping  around  the  left  turn.  Figure  4  shows \ntypical  routes  followed  by  the  car  during  this  shift  in  the  learned  policy  - a  shift \nindicated  by  a  slight  drop  in  the  learning  curve  shown  in  Figure  2  (around  1300 \nepochs).  After  this shift, learning progressed  rapidly  until  roughly  optimal policies \nwere  consistently followed. \n\nA  problem  which  occurs  in  Practical  TPDP  is  that  of the  addition  of superfluous \nTPs  after  the  optimal  policy  has  bac;ically  been  learned.  The  reasons  this  occurs \nare  described  in  Buckland  (1994),  ac;  well  as  a number of solutions  to the problem. \n\n5  CONCLUSION \n\nThe  practical  form  of TPDP  performs  very  well  when  compared  to  conventional \nQ-Iearning.  When  applied  to  a  race  track  problem  it  was  able  to  learn  optimal \npolicies  more  quickly  while  using  less  memory.  Like  Q-learning,  TPDP  has all  the \n\n\f646 \n\nBuckland and Lawrence \n\nadvantages  and  disadvantages  that  result  from  it  being  a  direct  control  approach \nthat develops no explicit system model  (Watkins,  1989,  Buckland,  1994). \n\nIn order  to take advantage of the sparse  memory  usage  that occurs  in TPDP, TPs \nare  best  represented  by  ACAMs  (associative  content  addressable  memories,  Atke(cid:173)\nson,  1989).  A  localized  neural  network  design  which  operates  as  an  ACAM  and \nwhich facilitates  Practical TPDP control is  described  in  Buckland  et  al.  (1993)  and \nBuckland  (1994). \n\nThe  main  idea of TPDP  is  to,  \"try  this  for  a  while  and  see  what  happens\".  This \nis  a  potentially  powerful  approach,  and  the  use  of TPs associated  with  abstracted \ncontrol  actions  could  be  found  to  have  substantial  utility  in  hierarchical  control \nsystems. \n\nAcknowledgements \n\nThanks to John Ip for  his help on this work.  This work was supported by an NSERC \nPostgraduate Scholarship, and  NSERC Operating Grant A4922. \n\nReferences \n\nAtkeson,  C.  G.  (1989),  \"Learning  arm  kinematics  and  dynamics\",  Annual  Review \nof Neuroscience, vol.  12,  1989,  pp.  157-183. \nBarto,  A.  G.,  S.  J.  Bradtke  and  S.  P.  Singh  (1991),  \"Real-time  learning  and  con(cid:173)\ntrol  using  asynchronous  dynamic  programming\",  COINS  Technical  Report  91-57, \nUniversity of Massachusetts,  Aug.  1991. \n\nBuckland,  K.  M.  and  P.  D.  Lawrence  (1993),  \"A  connectionist  approach  to direct \ndynamic  programming control\" , Proc.  of the  IEEE  Pacific  Rim  Conf.  on  Commu(cid:173)\nnications,  Computers  and Signal  Processing, Victoria,  1993,  vol.  1,  pp.  284-287. \n\nBuckland,  K.  M.  (1994),  Optimal  Control  of Dynamic  Systems  Through  the  Rein(cid:173)\nforcement  Learning  of Transition  Points,  Ph.D.  Thesis,  Dept.  of Electrical  Engi(cid:173)\nneering,  University of British  Columbia,  1994. \nChapman,  D.  and  L.  P.  Kaelbling  (1991),  \"Input  generalization  in  delayed \nreinforcement-learning:  an  algorithm  a.nd  performance  comparisons\",  Proc.  of the \n12th  Int.  Joint  Con/.  on  Artificial Intelligence,  Sydney,  Aug.  1991,  pp.  726-731. \n\nJaakkola, T., M.  I. Jordan and  S.  P.  Singh  (1994),  \"Stocha'ltic convergence of iter(cid:173)\native DP algorithms\",  A dvances  in  N eM'al  Information  Processing  Systems  6,  eds.: \nJ.  D.  Cowen,  G.  Tesauro and  J.  Alspector,  San  Francisco,  CA:  Morgan  Kaufmann \nPublishers,  1994. \nMoore,  A.  W.  (1991),  \"Variable resolution  dynamic programming:  efficiently learn(cid:173)\ning  action maps in  multivariate real-valued state-spaces\",  Machine  Learning:  Proc. \nof the  8th  Int.  Workshop,  San  Mateo,  CA:  Morgan  Kaufmann  Publishers,  1991. \n\nWatkins,  C.  J.  C.  H.  (1989),  Learning from  Delayed  Rewards,  Ph.D.  Thesis,  Cam(cid:173)\nbridge  University,  Cambridge,  England,  1989. \n\nWatkins, C.  J.  C.  H.  and  P.  Dayan (1992),  \"Q-Iearning\",  Machine  Learning,  vol.  8, \n1992,  pp.  279-292. \n\n\f", "award": [], "sourceid": 848, "authors": [{"given_name": "Kenneth", "family_name": "Buckland", "institution": null}, {"given_name": "Peter", "family_name": "Lawrence", "institution": null}]}