{"title": "Advantage Updating Applied to a Differential Game", "book": "Advances in Neural Information Processing Systems", "page_first": 353, "page_last": 360, "abstract": "", "full_text": "Advantage Updating Applied to \n\na Differential Game \n\nMance  E.  Harmon \n\nWright Laboratory \n\nWL/AAAT Bldg. 635  2185 Avionics Circle \n\nWright-Patterson Air Force Base, OH  45433-7301 \n\nharmonme@aa.wpafb.mil \n\nLeemon  C.  Baird  III\u00b7 \n\nWright Laboratory \n\nbaird@cs.usafa.af.mil \n\nA.  Harry  Klopr \nWright Laboratory \n\nklopfah@aa.wpafb.mil \n\nCategory:  Control,  Navigation, and  Planning \n\nKeywords: Reinforcement Learning, Advantage Updating, \n\nDynamic Programming, Differential Games \n\nAbstract \n\nAn application of reinforcement learning to a linear-quadratic, differential \ngame  is  presented.  The reinforcement learning  system  uses  a  recently \ndeveloped  algorithm,  the residual gradient form  of advantage updating. \nThe  game  is a  Markov  Decision  Process  (MDP)  with continuous  time, \nstates, and actions, linear dynamics, and a quadratic cost function.  The \ngame consists of two players, a missile and a plane;  the missile pursues \nthe plane and  the plane evades the  missile.  The reinforcement learning \nalgorithm for optimal control is modified for differential games in order to \nfind the minimax point, rather than  the maximum.  Simulation results are \ncompared  to  the  optimal  solution,  demonstrating  that  the  simulated \nreinforcement learning  system  converges  to  the optimal  answer.  The \nperformance of both the residual gradient and non-residual gradient forms \nof advantage updating and Q-learning are compared.  The results show that \nadvantage  updating  converges faster  than  Q-learning in  all  simulations. \nThe results also show advantage updating converges regardless of the time \nstep duration; Q-learning is  unable to converge as the  time step duration \n~rows small. \n\nU.S .A.F.  Academy,  2354  Fairchild  Dr.  Suite 6K4l,  USAFA,  CO  80840-6234 \n\n\f354 \n\nMance E.  Hannon,  Leemon C.  Baird ll/, A.  Harry Klopf \n\n1  ADVANTAGE  UPDATING \n\nThe advantage updating algorithm (Baird, 1993) is a reinforcement learning algorithm in \nwhich  two types  of information  are stored.  For each  state x, the  value V(x)  is  stored, \nrepresenting an estimate of the total discounted return expected when starting in  state x \nand performing optimal actions.  For each state x and action u, the advantage, A(x,u), is \nstored, representing  an  estimate of the  degree  to  which  the expected  total  discounted \nreinforcement  is  increased  by  performing  action  u  rather  than  the  action  currently \nconsidered best.  The optimal value function V* (x) represents the true value of each state. \nThe optimal advantage function A * (x,u)  will be zero if u is the optimal action (because u \nconfers no advantage relative to itself) and A * (x,u) will be negative for any suboptimal u \n(because a suboptimal action has a negative advantage relative to  the best action).  The \noptimal advantage function A * can be defined in terms of the optimal value function v*: \n\nA*(x,u) = ~[RN(X,U)- V*(x)+ rNV*(x')] \n\nbat \n\n(1) \n\nThe definition  of an  advantage  includes a  l/flt term  to ensure that,  for  small  time  step \nduration flt, the advantages will not all go to zero. \nBoth the value function  and the advantage function are needed during learning, but after \nconvergence to optimality, the policy can be extracted from the advantage function alone. \nThe optimal policy for state x is any u that maximizes A * (x,u).  The notation \n\n~ax (x) = max A(x,u) \n\n\" \n\ndefines Amax(x). If Amax converges to zero in every state, the advantage function is said \nto be normalized.  Advantage  updating  has been  shown  to  learn  faster  than Q-Iearning \n(Watkins, 1989), especially for continuous-time problems (Baird, 1993). \nIf advantage updating (Baird, 1993) is used to control a deterministic system, there are two \nequations that are the equivalent of the Bellman equation in  value iteration  (Bertsekas, \n1987).  These are a pair of two simultaneous equations (Baird,  1993): \n\n(2) \n\n(3) \n\n(4) \n\nA(x,u)-maxA(x,u') =(R+ y l1lV(x')- V(X\u00bb)_l \n~t \n\nw \n\nmaxA(x,u)=O \n\n\" \n\nwhere  a  time  step  is of duration  L1t,  and  performing  action  u  in  state x  results  in  a \nreinforcement  of R  and  a  transition  to  state Xt+flt.  The optimal advantage and  value \nfunctions  will  satisfy  these  equations.  For a  given  A  and V  function,  the Bellman \nresidual errors, E, as used in Williams and Baird (1993) and defined here as equations (5) \nand (6).are the degrees to which the two equations are not satisfied: \n\nE1 (xl,u) = (R(x\"u)+ y l1lV(xt+l1I)- V(XI\u00bb)~- A(x\"u)+ max A(x,,u' ) \n\n~t \n\nw \n\nE2 (x,u)=-maxA(x,u) \n\n\" \n\n(5) \n\n(6) \n\n\fAdvantage Updating Applied to a  Differential Game \n\n355 \n\n2  RESIDUAL  GRADIENT  ALGORITHMS \n\nDynamic programming algorithms can be guaranteed to converge to optimality when used \nwith  look-up  tables,  yet  be  completely  unstable  when  combined  with  function(cid:173)\napproximation  systems (Baird & Harmon,  In  preparation).  It is  possible to  derive an \nalgorithm that has guaranteed convergence for a quadratic function approximation system \n(Bradtke, 1993), but that algorithm is specific to quadratic systems.  One solution to this \nproblem is to derive a learning algorithm to perfonn gradient descent on the mean squared \nBellman residuals given in  (5) and (6).  This is called the residual  gradient form  of an \nalgorithm. \nThere are  two  Bellman  residuals,  (5)  and  (6),  so  the residual  gradient algorithm  must \nperfonn gradient descent on  the sum of the two squared Bellman residuals.  It has been \nfound  to  be  useful  to  combine  reinforcement  learning  algorithms  with  function \napproximation  systems (Tesauro,  1990 & 1992).  If function approximation systems are \nused for the advantage and value functions, and if the function approximation systems are \nparameterized  by  a  set  of adjustable  weights,  and  if the  system  being  controlled  is \ndeterministic,  then,  for  incremental  learning,  a  given  weight  W  in  the  function(cid:173)\napproximation system could be changed according to equation (7) on each time step: \n\ndW = _ a  a[E;(x\"u,) + E;(x\"u,)] \n\n2 \n\naw \n\n_ _  E ( \n-\n\na  1  x\"u, \n\n) aE1 (x, ,u,) _  E  ( \n\n) aE2 (x\"  u, ) \n\naw \n\na  2  x\"u, \n\naw \n\ndt \n\n= _a(_l (R + yMV(X'+M) - V (x) ) - A(x\"u,) + max A(X\"U)) \n_(_I (yfJJ  aV(x,+fJJ) _  av(x,))_ aA(x\"u,) + am~XA(xt'U)J \ndt \n\naw \n\naw \n\naw \n\naw \n\nu \n\n(7) \n\namaxA(x\"u) \n\n-amaxA(x\"u) \n\nU \n\nU  a \n\nW \n\nAs  a  simple, gradient-descent algorithm, equation  (7)  is guaranteed  to  converge to  the \ncorrect  answer  for  a  deterministic  system,  in  the  same  sense  that  backpropagation \n(Rumelhart, Hinton, Williams,  1986) is guaranteed to converge.  However, if the system \nis nondetenninistic, then  it is necessary to independently generate two different possible \n\"next states\" Xt+L1t  for a given action Ut  perfonned in  a given state Xt.  One Xt+L1t  must \n\nbe used  to  evaluate V(Xt+L1t),  and  the other must be used  to  evaluate %w V(xt+fJJ)' \n\nThis ensures that the weight change is an unbiased estimator of the true Bellman-residual \ngradient,  but requires  a system  such  as  in  Dyna (Sutton,  1990) to  generate  the  second \nXt+L1t.  The differential game in this paper was detenninistic, so this was not needed here. \n\n\f356 \n\nMance E.  Harmon,  Leemon C. Baird /11,  A.  Harry KLopf \n\n3  THE  SIMULATION \n\n3.1  GAME  DEFINITION \n\nWe employed a linear-quadratic, differential game (Isaacs, 1965) for comparing Q-learning \nto advantage updating, and for comparing the algorithms in their residual gradient forms. \nThe game has two players, a missile and a plane, as in games described by Rajan, Prasad, \nand Rao (1980) and Millington (1991).  The state x is a vector (xm,xp) composed of the \nstate of the missile and the state of the plane, each of which are composed of the poSition \nand  velocity of the player in  two-dimensional space.  The action  u  is  a vector (um,up) \ncomposed of the action  performed by the missile and the action performed by the plane, \neach of which are the acceleration of the player in two-dimensional space.  The dynamics \nof the system are linear; the next state xt+ 1 is a linear function of the current state Xl and \naction Ul.  The reinforcement function R is a quadratic function  of the accelerations and \nthe distance between the players. \n\nR(x,u)= [distance2 + (missile acceleration)2 - 2(plane acceleration)2]6t \n\nR(X,U)=[(X m -Xp)2 +U~-2U!]llt \n\n(8) \n\n(9) \n\nIn equation (9), squaring a vector is equivalent to taking the dot product of the vector with \nitself.  The missile seeks to minimize the reinforcement, and the plane seeks to maximize \nreinforcement.  The plane receives twice as much punishment for acceleration as does the \nmissile, thus allowing the missile to accelerate twice as easily as the plane. \nThe value function  V is a quadratic function  of the state.  In  equation (10), Dm  and Dp \nare weight matrices that change during learning. \n\n(10) \n\nThe advantage function A is a quadratic function of the state X  and action u.  The actions \nare accelerations of the missile and plane in two dimensions. \n\nA(x,u)=x~Amxm +x~BmCmum +u~Cmum + \n\nx~Apxp +x~BpCpup +u~Cpup \n\n(11) \n\nThe  matrices  A, B, and  C  are  the  adjustable  weights  that  change  during  learning. \nEquation (11) is  the sum of two general quadratic functions.  This would still be true if \nthe  second  and  fifth  terms  were  xBu  instead of xBCu.  The latter form  was  used  to \nsimplify  the calculation of the policy.  Using  the  xBu  form,  the gradient is  zero when \nu=-C-lBx!2.  Using the xBCu  form,  the gradient of A(x,u) with  respect to  u  is  zero \nwhen u=-Bx!2, which avoids the need to invert a matrix while calculating the policy. \n\n3.2  THE  BELLMAN  RESIDUAL  AND  UPDATE  EQUATIONS \n\nEquations (5) and (6) define the Bellman residuals when maximizing the total discounted \nreinforcement  for  an  optimal  control  problem;  equations  (12)  and  (13)  modify  the \nalgorithm to solve differential games rather than optimal control problems. \n\n\fAdvantage Updating Applied to a  Differential Game \n\nE1(x\"u,) = (R(x\"u,)+ r6tV(xl+6t)- V(X,\u00bb)..!...- A(x\"u,)+ minimax A(x,) \n\ntl.t \n\nE2(x\"u,)=-minimax A(x,) \n\nThe resulting weight update equation is: \n\ntl.W = -aU R+ r 6tV(X'M')- V(x,\u00bb) 1t - A(x\"u,)+minimax A(X,\u00bb) \n.((rt:., aV~6t)  aV(X,\u00bb)_1 _ aA(x\"u,) + aminimax A(X,\u00bb) \n\naW \n\ntl.t \n\naw \n\naw \n\n\" \n\nA() aminimax A(x,) \n\n-amzmmax \n\nx, \n\naw \n\nFor Q-leaming, the residual-gradient form of the weight update equation is: \n\ntl.W = -a( R+ r 6t  minimax Q(Xl+dt)-Q(x\"u,\u00bb) \n.( r 6t -kminimax Q(x,+6t)--kQ(x\"u,\u00bb) \n\n4  RESULTS \n\n357 \n\n(12) \n\n(13) \n\n(14) \n\n(15) \n\n4.1  RESIDUAL  GRADIENT  ADVANTAGE  UPDATING  RESULTS \n\nThe  optimal  weight  matrices  A * , B *,  C *,  and  D *  were  calculated  numerically  with \nMathematica  for comparison.  The residual gradient form  of advantage updating learned \nthe correct policy  weights,  B,  to  three significant digits after extensive training.  Very \ninteresting behavior was exhibited by  the plane under certain initial conditions.  The plane \nlearned  that  in  some cases  it  is better to  turn  toward  the  missile  in  the  short  term  to \nincrease  the  distance between  the  two  in  the  long  term.  A tactic  sometimes used  by \npilots.  Figure  1 gives an example. \n\n10 r - - - - - - -...... \n\n~/ \n\n....................... \n\n......... \n.' \n.' ...... \n\\. \n\nI \n\n............................ ~ \n\n....... \n\ni ... \u00b7\u00b7 \n\n\\. \n\nGO \nV \n\nC \u2022 .... \n\ntil \n0.01 \n\n.001 \n\n.0001 \n\n0 \n\n0.04 \n\n0.08 \n\n0.12 \n\nTime \n\nFigure  1:  Simulation  of a  missile  (dotted  line)  pursuing a plane (solid line), each \nhaving learned optimal behavior.  The graph of distance vs. time show the effects of \nthe plane's maneuver in turning toward the missile. \n\n\f358 \n\nMance E.  Harmon.  Leemon C.  Baird III.  A. Harry Klopf \n\n4.2  COMPARATIVE  RESULTS \n\n. The error in  the  policy of a  learning system  was defined  to  be the sum  of the  squared \nerrors in the B matrix weights.  The optimal policy weights in this problem are the same \nfor both advantage updating and Q-learning, so this metric can be used to compare results \nfor  both  algorithms.  Four  different  learning  algorithms  were  compared:  advantage \nupdating, Q-Iearning, Residual  Gradient advantage updating, and Residual  Gradient Q(cid:173)\nlearning.  Advantage updating in the non-residual-gradient form was unstable to the point \nthat no meaningful results could be obtained, so simulation results cannot be given for it. \n\n4.2.1  Experiment  Set  1 \n\nThe learning rates for both forms of Q-Iearning were optimized to one significant digit for \neach simulation.  A single learning rate was used for residual-gradient advantage updating \nin  all  four  simulations.  It is possible that  advantage  updating  would  have  performed \nbetter with different learning rates.  For each algorithm,  the error was  calculated after \nlearning for 40,000 iterations.  The process was repeated 10 times using different random \nnumber seeds  and  the results  were averaged.  This experiment was  performed for  four \ndifferent  time  step  durations,  0.05,  0.005,  0.0005,  and  0.00005.  The  non-residual(cid:173)\ngradient form  of Q-Iearning appeared to work better when the weights were initialized to \nsmall numbers.  Therefore, the initial weights were chosen randomly between 0 and  1 for \nthe residual-gradient forms of the algorithms, and between 0 and 10-8 for the non-residual(cid:173)\ngradient  form  of Q-learning.  For  small  time  steps,  nonresidual-gradient  Q-Iearning \nperformed so poorly that the error was lower for a learning rate of zero (no learning) than \nit  was  for  a  learning  rate  of  10-8 .  Table  1  gives  the  learning  rates  used  for  each \nsimulation, and figure 2 shows the resulting error after learning. \n\nFinal \nError \n\n8 \n\n6 \n\n4 \n\n2 \n\n0 \n\n0--\n\n\u2022 \n\n\u2022 \n\n[J \n\n\u2022 \n\n- - -0  \n\n-D--FQ \n\n-\u00b7-RAU \n\n0.05 \n\n0.005 \n\n0.0005  0.00005 \n\nTIme Step Duration \n\nFigure 2:  Error vs.  time step size comparison for Q-Learning (Q), residual-gradient \nQ-Learning(RQ), and residual-gradient advantage updating(RAU) using rates optimal \nto  one  significant  figure  for  both  forms  of Q-Iearning,  and  not  optimized  for \nadvantage  updating.  The final  error is  the  sum  of squared errors  in  the  B  matrix \nweights after 40,000 time steps of learning.  The final error for advantage updating \nwas lower than both forms of Q-learning in every case.  The errors increased for Q(cid:173)\nlearning as the time step size decreased. \n\n\fAdvantage Updating Applied to  a Differential Game \n\n359 \n\nTime step duration, III \n\n5.10-2 \n\n5.10-3 \n\n5.10-4 \n\n5.10-5 \n\nQ \n\nRQ \n\n0.02 \n\n0.08 \n\n0.06 \n\n0.09 \n\n0.2 \n\n0 \n\n0.4 \n\n0 \n\nRAU \n\n0.005 \n\n0.005 \n\n0.005 \n\n0.005 \n\nTable 1:  Learning rates used for each simulation.  Learning rates are optimal to one \nsignificant figure  for both forms  of Q-learning, but are not necessarily  optimal for \nadvantage updating. \n\n4.2.2  Experiment  Set  2 \n\nFigure 3 shows a comparison of the three algorithms'  ability to converge to the correct \npolicy.  The figure  shows  the total squared error in each algorithms' policy  weights as a \nfunction  of learning  time.  This  simulation  ran  for  a  much  longer  period  than  the \nsimulations  in  table  1 and  figure  2.  The  learning  rates  used  for  this  simulation  were \nidentical to the rates that were found  to be optimal for the shorter run.  The weights for \nthe  non-Residual  gradient  form  of Q-Iearning grew  without  bound  in  all  of the  long \nexperiments, even after the learning rate was reduced by an order of magnitude.  Residual \ngradient advantage updating was able to learn  the correct policy, while Q-learning was \nunable to learn a policy that was better than the initial, random weights. \n\nLeorning  Ability Comporison \n\n10~------------------, \n\n---RAU \n\n- - - - -, RO, \n\nError \n\n.1 \n\n,01 \n\n,001 \n\n0 \n\n5  Conclusion \n\n2 \n\n3 \n\n4 \n\n5 \n\nTime Steps in millions \n\nFigure 3 \n\nThe experimental data shows residual-gradient advantage updating to be superior to  the \nthree other algorithms in all cases.  As the time step grows small, Q-learning is unable to \nlearn  the correct policy.  Future research  will  include the  use of more general  networks \nand implementation of the wire fitting algorithm proposed by Baird and Klopf (1994)  to \ncalculate the policy from a continuous choice of actions in more general networks. \n\n\f360 \n\nMance E. Hannon.  Leemon C.  Baird Ill.  A.  Harry Klopf \n\nAcknowledgments \n\nThis research was supported under Task 2312Rl by the Life and Environmental Sciences \nDirectorate of the United States Air Force Office of Scientific Research. \n\nReferences \nBaird, L.C. (1993). Advantage updating  Wright-Patterson Air Force Base, OH. (Wright \nLaboratory  Technical  Report  WL-TR-93-1146,  available  from  the  Defense  Technical \ninformation Center, Cameron Station, Alexandria, VA 22304-6145). \n\nBaird, L.C.,  &  Harmon,  M.  E.  (In preparation). Residual gradient  algorithms Wright(cid:173)\nPatterson Air Force Base, OH. (Wright Laboratory Technical report). \n\nBaird,  L.C.,  &  Klopf,  A.  H.  (1993).  Reinforcement  learning  with  high-dimensional. \ncontinuous actions  Wright-Patterson Air Force Base, OH. (Wright Laboratory  technical \nreport  WL-TR-93-1147,  available  from  the  Defense  Technical  information  Center, \nCameron Station, Alexandria, VA 22304-6145). \n\nBertsekas, D.  P.  (1987). Dynamic programming: Deterministic and stochastic  models. \nEnglewood Cliffs, NJ: Prentice-Hall. \n\nBradtke, S.  J.  (1993). Reinforcement Learning Applied to Linear Quadratic  Regulation. \nProceedings of the 5th annual Conference on Neural Information Processing Systems. \n\nIsaacs, Rufus (1965). Differential games.  New York: John Wiley and Sons, Inc. \n\nMillington,  P.  J.  (1991).  Associative  reinforcement  learning  for  optimal  control. \nUnpublished master's thesis, Massachusetts Institute of Technology, Cambridge, MA. \n\nRajan,  N.,  Prasad,  U.  R.,  and  Rao,  N.  J.  (1980).  Pursuit-evasion  of two  aircraft  in  a \nhorizontal plane. Journal of Guidance and Control. 3(3). May-June, 261-267. \n\nRumelhart,  D.,  Hinton,  G.,  &  Williams,  R.  (1986).  Learning  representations  by \nbackpropagating errors.  Nature. 323 .. 9 October, 533-536. \n\nSutton, R.  S. (1990). Integrated architectures for learning, planning, and reacting based on \napproximating dynamic programming. Proceedings of the Seventh International \nConference on Machine Learning. \n\nTesauro,  G.  (1990).  Neurogammon:  A  neural-network  backgammon  program. \nProceedings of the International Joint  Conference  on  Neural Networks . 3 ..  (pp.  33-40). \nSan Diego, CA. \n\nTesauro, G.  (1992). Practical issues in  temporal difference learning. Machine Learning, \n8(3/4),  279-292. \n\nWatkins, C. J.  C. H. (1989). Learningfrom delayed rewards.  Doctoral thesis, Cambr~dge \nUniversity, Cambridge, England. \n\n\f", "award": [], "sourceid": 912, "authors": [{"given_name": "Mance", "family_name": "Harmon", "institution": null}, {"given_name": "Leemon", "family_name": "Baird", "institution": null}, {"given_name": "A.", "family_name": "Klopf", "institution": null}]}