{"title": "Analysis of Temporal-Diffference Learning with Function Approximation", "book": "Advances in Neural Information Processing Systems", "page_first": 1075, "page_last": 1081, "abstract": "", "full_text": "Analysis of Temporal-Difference Learning \n\nwith Function  Approximation \n\nJohn N.  Tsitsiklis and Benjamin Van Roy \nLaboratory for  Information and Decision Systems \n\nMassachusetts Institute of Technology \n\nCambridge,  MA  02139 \n\ne-mail:  jnt@mit.edu, bvr@mit.edu \n\nAbstract \n\nWe  present  new results  about  the temporal-difference learning al(cid:173)\ngorithm,  as  applied  to  approximating  the  cost-to-go  function  of \na  Markov  chain  using  linear  function  approximators.  The  algo(cid:173)\nrithm we  analyze performs on-line updating of a  parameter vector \nduring a  single endless  trajectory of an aperiodic irreducible finite \nstate Markov chain.  Results include convergence (with probability \n1),  a  characterization of the limit  of convergence,  and a  bound on \nthe resulting approximation error.  In addition to establishing new \nand  stronger  results  than  those  previously  available,  our  analysis \nis  based  on  a  new  line  of  reasoning  that  provides  new  intuition \nabout the dynamics of temporal-difference learning.  Furthermore, \nwe  discuss  the  implications  of two  counter-examples with  regards \nto the Significance  of on-line updating and linearly  parameterized \nfunction  approximators. \n\n1 \n\nINTRODUCTION \n\nThe  problem  of  predicting  the  expected  long-term  future  cost  (or  reward)  of  a \nstochastic dynamic system manifests itself in both time-series  prediction and con(cid:173)\ntrol.  An  example  in  time-series  prediction  is  that  of estimating  the  net  present \nvalue  of a  corporation,  as  a  discounted  sum  of its future  cash flows,  based on  the \ncurrent  state of its  operations.  In  control,  the ability  to  predict  long-term future \ncost as a function of state enables the ranking of alternative states in order to guide \ndecision-making.  Indeed, such predictions constitute the  cost-to-go  function that is \ncentral to dynamic programming and optimal control (Bertsekas,  1995). \nTemporal-difference learning, originally proposed by Sutton (1988), is  a method for \napproximating long-term future  cost  as  a  function  of current state.  The algorithm \n\n\f1076 \n\n1.  N.  Tsitsiklis and B.  Van  Roy \n\nis  recursive, efficient,  and simple to implement.  Linear combinations of fixed  basis \nfunctions  are  used  to  approximate  the  mapping  from  state  to  future  cost.  The \nweights  of  the  linear  combination  are  updated  upon  each  observation  of a  state \ntransition  and  the  associated  cost.  The  objective  is  to  improve  approximations \nof  long-term  future  cost  as  more  and  more  state  transitions  are  observed.  The \ntrajectory  of states  and  costs  can  be  generated  either  by  a  physical  system  or  a \nsimulated model.  In either case,  we  view  the system as  a  Markov chain.  Adopting \nterminology  from  dynamic  programming,  we  will  refer  to  the  function  mapping \nstates of the  Markov chain to expected long-term cost as the cost-to-go function. \n\nIn  this paper,  we  introduce a  new  line  of analysis for  temporal-difference learning. \nIn  addition  to  providing  new  intuition  about  the  dynamics  of the  algorithm,  this \napproach  leads  to a  stronger convergence  result  than  previously  available,  as  well \nas  an  interpretation  of the  limit  of convergence  and  bounds  on  the  resulting  ap(cid:173)\nproximation  error,  neither  of which  have  been  available  in  the  past.  Aside  from \nthe statement of results, we  maintain the discussion at an informal level,  and make \nno  attempt  to present a  complete or rigorous proof.  The formal  and more general \nanalysis based on our line of reasoning can found in  (Tsitsiklis and Van Roy,  1996), \nwhich also  discusses the relationship between our results  and other work  involving \ntem poral-difference learning. \n\nThe convergence results  assume  the  use  of both on-line  updating  and linearly  pa(cid:173)\nrameterized function approximators.  To clarify the relevance of these requirements, \nwe discuss the implications of two counter-examples that are presented in (Tsitsiklis \nand Van Roy,  1996).  These counter-examples demonstrate that temporal-difference \nlearning  can  diverge  in  the  presence  of either  nonlinearly  parameterized  function \napproximators or arbitrary (instead of on-line)  sampling distributions. \n\n2  DEFINITION OF  TD(A) \n\nIn this section, we define precisely the nature of temporal-difference learning, as ap(cid:173)\nplied  to approximation of the cost-to-go function  for  an infinite-horizon discounted \nMarkov chain.  While the method as well as our subsequent results are applicable to \nMarkov chains with fairly general state spaces, including continuous and unbounded \nspaces,  we  restrict our attention  in this  paper to the case where  the  state space  is \nfinite.  Discounted  Markov  chains  with  more general  state spaces  are addressed  in \n(Tsitsiklis and Van Roy,  1996).  Application of this line of analysis to the context of \nundiscounted  absorbing  Markov  chains  can  be found  in  (Bertsekas  and  Tsitsiklis, \n1996)  and has also been carried out by  Gurvits  (personal communication). \nWe  consider  an  aperiodic  irreducible  Markov  chain  with  a  state  space  S  = \n{I, ... , n}, a  transition probability matrix P  whose  (i, j)th entry is  denoted by Pij, \ntransition  costs  g(i,j)  associated  with  each  transition from  a  state i  to  a  state j, \nand a discount factor Q  E  (0,1).  The sequence of states visited by the Markov chain \nis  denoted  by  {it  I t  =  0,1, ... }.  The  cost-to-go  function  J*  : S  t-+  ~ associated \nwith this  Markov chain is  defined by \n\nJ*(i) ~ E [f: olg(it, it+d  I io  =  ij. \n\nt=o \n\nSince  the  number  of dimensions  is  finite,  it  is  convenient  to  view  J*  as  a  vector \ninstead of a function. \nWe  consider approximations of J*  using  a function of the form \n\nJ(i, r) =  (<I>r)(i). \n\n\fAnalysis ofTemporal-Diflference Learning with Function Approximation \n\n1077 \n\nHere,  r  =  (r(l), ... ,r(K))  is  a  parameter vector and cI>  is  a n  x  K.  We  denote the \nith row of cI>  as  a  (column)  vector </J(i). \nSuppose that we observe a sequence of states it generated according to the transition \nprobability matrix P  and that at time t the parameter vector r  has been set to some \nvalue rt.  We  define  the temporal difference dt corresponding to the transition from \nit to it+l  by \n\ndt =  g(it, it+1) + aJ(it+1' rt) - J(it, rt). \n\nWe  define  a sequence of eligibility  vectors Zt  (of dimension  K)  by \n\nt \n\nZt  =  2)aA)t-k</J(ik). \n\nk=O \nThe TD(A)  updates are then given by \n\nrt+l  =  rt + \"Itdtzt, \n\nwhere  ro  is  initialized  to  some  arbitrary  vector,  \"It  is  a  sequence  of  scalar  step \nsizes,  and  A is  a  parameter in  [0,1].  Since  temporal-difference learning  is  actually \na  continuum  of algorithms,  parameterized  by  A,  it  is  often  referred  to  as  TD(A). \nNote  that  the  eligibility  vectors  can  be  updated  recursively  according  to  Zt+1 \naAzt + </J(it+d,  initialized with Z-l  =  O. \n\n3  ANALYSIS  OF  TD(\"\\) \n\nTemporal-difference  learning  originated  in  the  field  of reinforcement  learning.  A \nview commonly adopted in the original setting is that the algorithm involves  \"look(cid:173)\ning back in time and correcting previous predictions.\"  In this context, the eligibility \nvector  keeps  track of how the parameter vector should be adjusted in order to ap(cid:173)\npropriately modify prior predictions when  a  temporal-difference is  observed.  Here, \nwe  take  a  different  view  which  involves  examining  the  \"steady-state\"  behavior of \nthe  algorithm  and  arguing  that  this  characterizes  the  long-term  evolution  of the \nparameter vector.  In the remainder ofthis section, we introduce this view of TD(A) \nand provide an overview of the analysis that it leads to.  Our goal in this section is to \nconvey some intuition about how  the algorithm works,  and in  this spirit, we  main(cid:173)\ntain the discussion  at an informal  level,  omitting technical  assumptions and  other \ndetails required to formally  prove the statements we make.  These technicalities are \naddressed in  (Tsitsiklis  and Van  Roy,  1996), where formal  proofs are presented. \n\nWe  begin  by  introducing  some  notation  that  will  make  our  discussion  here  more \nconcise.  Let  71\"(1), .. . , 7I\"(n)  denote the steady-state probabilities for  the process it. \nWe  assume that 7I\"(i)  > 0 for  all i  E S.  We  define an n  x n  diagonal matrix D  with \ndiagonal entries 71\"(1), ... , 7I\"(n).  We  define a  weighted norm II  \u00b7IID  by \n\nIIJIID =  L 7I\"(i)J2(i). \n\niES \n\nWe  define  a  \"projection matrix\"  II  by \n\nIIJ =  arg !llin  IIJ - JIID. \n\nJ=tf>r \n\nIt is  easy to show that II =  cI>(cI>' DcI\u00bb-lcI>' D. \n\nWe  define an operator T(>\")  :  ~n I-t  ~n, indexed by  a  parameter A E  [0,1)  by \n\n(T(\u00bb  J)(i) =  (1  - ~) %;. ~m E [t, o/g(i\"  it+1) + \"m+l J(im+l)  I io  =  i)  . \n\n\f1078 \n\n1.  N.  Tsitsiklis and B.  Van  Roy \n\nFor  A =  1 we  define  (T(l)J)(i)  =  J*(i), so  that lim>.tl(T(>')J)(i)  =  (T(l)J)(i).  To \ninterpret this operator in  a  meaningful manner, note that, for  each m, the term \n\nE [f cig(it, it+d + am+! J(im+d I io  = i] \n\nt=o \n\nis  the expected  cost  to  be  incurred  over  m  transitions  plus  an  approximation  to \nthe  remaining  cost  to be incurred,  based on  J.  This  sum  is  sometimes called  the \n\"m-stage truncated  cost-to-go.\"  Intuitively,  if J  is  an  approximation  to the  cost(cid:173)\nto-go  function,  the  m-stage  truncated  cost-to-go  can  be  viewed  as  an  improved \napproximation.  Since T(>') J  is a weighted average over the m-stage truncated cost(cid:173)\nto-go  values,  T(>') J  can  also  be  viewed  as  an  improved  approximation  to  J*.  A \nproperty of T(>')  that is  instrumental in  our proof of convergence is  that T(>')  is  a \ncontraction of the norm II\u00b7IID.  It follows from  this fact  that the composition IIT(>') \nis  also  a  contraction with  respect  to  the same  norm,  and  has  a  fixed  point  of the \nform  cf>r*  for  some parameter vector r* . \nTo  clarify  the  fundamental  structure  of  TD(A),  we  construct  a  process  X t  = \n(it, it+!, Zt)\u00b7  It is  easy  to  see  that  X t  is  a  Markov  process.  In  particular,  Zt+l \nand it+!  are deterministic functions of X t  and the distribution of it+2 only depends \non  it+l.  Note  that  at each  time  t,  the random  vector  X t ,  together  with  the  cur(cid:173)\nrent parameter vector rt, provides all necessary information for computing rt+l.  By \ndefining a function s with s(r, X) =  (g(i,j)+aJ(j, r) -J(i, r))z, where X  =  (i,j, z), \nwe  can rewrite the TD(A)  algorithm  as \n\nrt+1  =  rt + Its(rt, Xd\u00b7 \n\nFor  any  r,  s(r,Xt)  has  a  \"steady-state\"  expectation,  which  we  denote  by \nEo[s(r, X t)].  Intuitively,  once  X t  reaches  steady-state,  the  TD(A)  algorithm,  in \nan  \"average\"  sense,  behaves like the following  deterministic algorithm: \n\nTT+l  =  TT  + ITEO[S(TT' X t )]. \n\nUnder  some  technical  assumptions,  a  theorem from  (Benveniste,  et  al.,  1990)  can \nbe  used  to deduce  convergence TD(A)  from  that  of the deterministic counterpart. \nOur study centers on an analysis of this deterministic algorithm.  A  theorem from \n(Benveniste,  et  aI,  1990)  is  used  to  formally  deduce  convergence of the  stochastic \nalgorithm. \nIt turns out that \n\nEo[s(r,Xt )]  =  cf>'D(T(>')(cf>r)  - cf>r). \n\nUsing the contraction property of T(>'), \n\n(r - r*)'Eo[s(r,Xt )]  = \n< \n< \n\n(cf>r  - cf>r*)'D(IIT(>')(cf>r)  -\n\ncf>r*  + (cf>r*  - cf>r)) \n\nlIcf>r  - cf>r*IID  . IlIIT(>') (cf>r)  - cf>r*IID  -11cf>r*  - cf>r1l1 \n(0:  -1)IIcf>r - cf>r*1I1. \n\nSince  a  <  1,  this  inequality  shows  that  the  steady  state expectation  Eo[s(r, Xd] \ngenerally moves the parameter vector  towards r*,  the fixed  point of IIT(>'),  where \n\"closeness\"  is  measured in  terms of the  norm  II  . liD.  This provides  the main  line \nof reasoning behind  the proof of convergence provided in  (Tsitsiklis  and Van  Roy, \n1996).  Some illuminating interpretations of this deterministic algorithm, which are \nuseful  in  developing an intuitive understanding of temporal difference learning,  are \nalso discussed  in  (Tsitsiklis and Van  Roy,  1996). \n\n\fAnalysis ofTemporal-DiflJerence Learning with Function Approximation \n\n1079 \n\n4  CONVERGENCE RESULT \n\nWe  now  present our main result concerning temporal-difference learning.  A formal \nproof is  provided in  (Tsitsiklis and Van  Roy,  1996). \n\nTheorem 1  Let the  following  conditions  hold: \n(a)  The  Markov  chain it  has  a unique  invariant  distribution  71\"  that satisfies 71\"' P  = \n71\"',  with  71\"( i)  > 0 for  all  i. \n(b)  The  matrix 4>  has  full  column  rank;  that  is,  the  \"basis  functions\"  {\u00a2k  I  k  = \n1, ... ,K} are  linearly  independent. \n(c)  The  step  sizes 'Yt  are  positive,  nonincreasing,  and  predetermined.  Furthermore, \nthey  satisfy 2::0 'Yt  = 00,  and 2::0 'Yt  < 00. \nWe  then  have: \n(a)  For  any A E  [0,1]'  the  TD(A)  algorithm,  as  defined  in Section  2,  converges  with \nprobability  1. \n(b)  The  limit  of convergence r*  is  the  unique  solution  of the  equation \n\n(c)  Furthermore,  r*  satisfies \n\nIIT(>') (4)r*)  =  4>r*. \n\nl14>r*  - J* liD  :S  1 - Aa IlIIJ* - J* liD. \n\nI-a \n\nPart  (b)  of the theorem  leads  to  an  interesting  interpretation of the limit  of con(cid:173)\nvergence.  In  particular, if we  apply the TD (A)  operator to the final  approximation \n4>r*,  and then  project  the resulting function  back into  the span of the basis func(cid:173)\ntions,  we  get  the  same  function  4>r*.  Furthermore,  since  the  composition  IIT(>') \nis  a  contraction,  repeated  application  of this  composition  to  any  function  would \ngenerate a  sequence of functions  converging to 4>r*. \nPart  (c)  of  the  theorem  establishes  that  a  certain  desirable  property  is  satisfied \nby  the  limit  of  convergence.  In  particular,  if  there  exists  a  vector  r  such  that \n4>r  =  J*,  then  this  vector  will  be the limit  of convergence of TD(A),  for  any  A E \n[0, 1].  On  the other hand, if no such parameter vector exists,  the distance between \nthe  limit  of  convergence  4>r*  and  J*  is  bounded  by  a  multiple  of  the  distance \nbetween the projection IIJ*  and J*.  This latter distance is  amplified by a factor of \n(1  - Aa)/(1 - a),  which  becomes larger as A becomes smaller. \n\n5  COUNTER-EXAMPLES \n\nSutton  (1995)  has  suggested  that  on-line  updating  and  the  use  of linear  function \napproximators are  both important factors  that  make  temporal-difference learning \nconverge  properly.  These  requirements  also  appear  as  assumptions  in  the conver(cid:173)\ngence  result  of the  previous  section.  To  formalize  the  fact  that  these  assumptions \nare relevant, two counter-examples were presented in (Tsitsiklis and Van Roy,  1996). \nThe first counter-example involves the use of a variant of TD(O)  that does not sample \nstates based on trajectories.  Instead, the states it are sampled independently from a \ndistribution q(.) over S,  and successor states jt are generated by sampling according \nto Pr[jt = jlit] = Pid.  Each iteration of the algorithm takes on the form \n\nrt+I  = rt + 'Yt\u00a2(it) (g(it,jt) + a\u00a2'(jt)rt - \u00a2'(it)rt). \n\nWe  refer to this algorithm as q-sampled TD(O).  Note that this algorithm is  closely \nrelated to the original TD(A)  algorithm as defined in Section 2.  In particular, if it is \n\n\f1080 \n\nJ.  N.  Tsitsiklis and B.  Van Roy \n\ngenerated by the Markov chain and jt =  it+! , we are back to the original algorithm. \nIt  is  easy  to show,  using  a  subset  of the  arguments  required  to prove  Theorem  1, \nthat  this  algorithm  converges  when  q(i)  = 7r(i)  for  all  i,  and  the  Assumptions  of \nTheorem  1  are satisfied.  However,  results  can  be  very  different  when  q( .) is  arbi(cid:173)\ntrary.  In particular, the counter-example presented in  (Tsitsiklis an Van  Roy,  1996) \nshows that for  any sampling distribution q(.) that is different from  7r(-)  there exists \na  Markov  chain  with  steady-state  probabilities  7r(-)  and  a  linearly  parameterized \nfunction  approximator  for  which  q-sampled  TD(O)  diverges.  A  counter-example \nwith similar implications has also been presented by  Baird (1995). \nA  generalization  of temporal  difference  learning  is  commonly  used  in  conjunction \nwith nonlinear function approximators.  This generalization involves replacing each \nvector </J( it) that is used to construct the eligibility vector with the vector of deriva(cid:173)\ntives  of  J(it, .),  evaluated  at  the  current  parameter vector  rt.  A  second  counter(cid:173)\nexample in  (Tsitsiklis and Van  Roy,  1996), shows that there exists a  Markov chain \nand a nonlinearly parameterized function  approximator such that both the param(cid:173)\neter vector  and  the approximated cost-to-go function  diverge  when  such  a  variant \nof TD(O)  is applied.  This nonlinear function approximator is  \"regular\"  in the sense \nthat it is  infinitely differentiable with respect to the parameter vector.  However,  it \nis still somewhat contrived, and the question of whether such a counter-example ex(cid:173)\nists in the context of more standard function approximators such as neural networks \nremains open. \n\n6  CONCLUSION \n\nTheorem  1  establishes  convergence  with  probability  1,  characterizes  the  limit  of \nconvergence,  and provides error  bounds,  for  temporal-difference  learning.  It is  in(cid:173)\nteresting to note that the margins allowed by the error bounds are inversely propor(cid:173)\ntional to >..  Although  this  is  only  a  bound,  it strongly suggests that higher values \nof >.  are likely  to  produce  more  accurate  approximations.  This  is  consistent  with \nthe examples that have  been constructed by  Bertsekas (1994). \n\nThe sensitivity of the error bound to >.  raises the question of whether or not it ever \nmakes  sense  to set  >.  to values  less  than  1.  Many  reports of experimental  results, \ndating  back  to  Sutton  (1988),  suggest  that  setting>.  to  values  less  than  one  can \noften  lead  to significant  gains  in  the rate of convergence.  A  full  understanding of \nhow>. influences the rate of convergence is yet to be found,  though some insight in \nthe  case of look-up  table  representations  is  provided  by  Dayan  and  Singh  (1996). \nThis is an interesting direction for  future  research. \n\nAcknowledgments \n\nWe  thank  Rich  Sutton  for  originally  making  us  aware  of the  relevance  of on-line \nstate sampling,  and also for  pointing out a  simplification  in  the expression for  the \nerror bound  of Theorem  l.  This research  was  supported by  the  NSF  under  grant \nDMI-9625489 and the ARO  under grant DAAL-03-92-G-01l5. \n\nReferences \n\nBaird, L.  C.  (1995).  \"Residual Algorithms:  Reinforcement Learning with  Function \nApproximation,\"  in  Prieditis  &  Russell,  eds.  Machine  Learning:  Proceedings  of \nthe Twelfth International Conference, 9-12  July,  Morgan Kaufman Publishers,  San \nFrancisco,  CA. \n\nBertsekas,  D.  P.  (1994)  \"A  Counter-Example  to  Temporal-Difference  Learning,\" \n\n\fAnalysis ofTemporal-Diffference Learning with Function Approximation \n\n1081 \n\nNeural Computation, vol.  7,  pp.  270-279. \nBertsekas,  D.  P.  (1995)  Dynamic  Programming  and  Optimal  Control,  Athena Sci(cid:173)\nentific,  Belmont,  MA. \nBertsekas,  D.  P.  &  Tsitsiklis,  J.  N.  (1996)  Neuro-Dynamic  Programming,  Athena \nScientific,  Belmont, MA. \nBenveniste,  A.,  Metivier,  M.,  &  Priouret,  P.,  (1990)  Adaptive  Algorithms  and \nStochastic  Approximations,  Springer-Verlag, Berlin. \nDayan,  P.  D.  &  Singh,  S.  P  (1996)  \"Mean  Squared  Error  Curves  in  Temporal \nDifference Learning,\"  preprint. \nGurvits, L.  (1996)  personal communication. \nSutton, R. S.,  (1988)  \"Learning to Predict by the Method of Temporal Differences,\" \nMachine Learning, vol.  3,  pp.  9-44. \n\nSutton,  R.S.  (1995)  \"On the  Virtues of Linear  Learning  and  Trajectory  Distribu(cid:173)\ntions,\"  Proceedings of the  Workshop  on  Value  Function  Approximation,  Machine \nLearning  Conference  1995,  Boyan,  Moore,  and  Sutton,  Eds.,  p.  85.  Technical \nReport CMU-CS-95-206,  Carnegie Mellon  University,  Pittsburgh, PA  15213. \nTsitsiklis, J. N.  & Van Roy, B.  (1996)  \"An Analysis of Temporal-Difference Learning \nwith Function Approximation,\"  to appear in the IEEE  Transactions  on  Automatic \nControl. \n\n\f", "award": [], "sourceid": 1269, "authors": [{"given_name": "John", "family_name": "Tsitsiklis", "institution": null}, {"given_name": "Benjamin", "family_name": "Van Roy", "institution": null}]}