{"title": "Reinforcement Learning with Function Approximation Converges to a Region", "book": "Advances in Neural Information Processing Systems", "page_first": 1040, "page_last": 1046, "abstract": null, "full_text": "Reinforcement  Learning with  Function \nApproximation  Converges to a  Region \n\nGeoffrey J.  Gordon \nggordon@es.emu.edu \n\nAbstract \n\nMany  algorithms  for  approximate reinforcement  learning  are  not \nknown  to  converge.  In  fact,  there  are  counterexamples  showing \nthat the adjustable weights in some algorithms may oscillate within \na region rather than converging to a point.  This paper shows that, \nfor  two  popular algorithms,  such  oscillation  is  the  worst  that  can \nhappen:  the  weights  cannot  diverge,  but  instead  must  converge \nto a  bounded region.  The algorithms are SARSA(O)  and V(O);  the \nlatter algorithm was used in the well-known TD-Gammon program. \n\n1 \n\nIntroduction \n\nAlthough  there  are  convergent  online  algorithms  (such  as  TD()')  [1])  for  learning \nthe parameters of a linear approximation to the value function of a Markov process, \nno way  is  known to extend these convergence proofs to the task of online  approxi(cid:173)\nmation of either the state-value (V*)  or the action-value (Q*)  function  of a general \nMarkov  decision  process.  In fact,  there  are  known  counterexamples to  many  pro(cid:173)\nposed  algorithms.  For example,  fitted  value iteration can diverge even  for  Markov \nprocesses [2];  Q-Iearning with linear function approximators can diverge, even when \nthe  states  are  updated  according to  a  fixed  update  policy  [3];  and SARSA(O)  can \noscillate between multiple policies with  different  value functions  [4]. \nGiven  the  similarities  between  SARSA(O)  and  Q-Iearning,  and  between  V(O)  and \nvalue iteration, one might suppose that their convergence properties would be identi(cid:173)\ncal.  That is  not the case:  while Q-Iearning can diverge for  some exploration strate(cid:173)\ngies,  this  paper  proves  that  the  iterates  for  trajectory-based  SARSA(O)  converge \nwith probability 1 to a fixed  region.  Similarly,  while value  iteration can diverge for \nsome exploration strategies, this paper proves that the iterates for  trajectory-based \nV(O)  converge with probability 1 to a  fixed  region. l \nThe question ofthe convergence behavior of SARSA()') is one of the four open theo(cid:173)\nretical questions of reinforcement learning that Sutton [5]  identifies as  \"particularly \nimportant,  pressing,  or  opportune.\"  This  paper  covers  SARSA(O),  and  together \n\nlIn  a  ''trajectory-based''  algorithm,  the  exploration  policy  may  not  change  within  a \nsingle episode of learning.  The policy may change between episodes, and the value function \nmay change within a single episode.  (Episodes end when the agent enters a terminal state. \nThis paper considers only episodic tasks, but since any discounted task can be transformed \ninto an equivalent episodic task,  the algorithms  apply to non-episodic  tasks  as well.) \n\n\fwith an earlier paper [4]  describes its convergence behavior:  it is  stable in the sense \nthat there exist  bounded regions which  with probability 1 it eventually enters and \nnever leaves, but for some Markov decision processes it may not converge to a  single \npoint.  The proofs extend easily to SARSA(,\\)  for  ,\\ > O. \nUnfortunately the bound given  here is  not of much use  as a  practical guarantee:  it \nis  loose  enough  that  it  provides little  reason  to  believe  that  SARSA(O)  and  V(O) \nproduce useful  approximations to the  state- and  action-value  functions.  However, \nit is important for  several reasons.  First, it is the best result available for these two \nalgorithms.  Second,  such a  bound is  often the first  step towards proving stronger \nresults.  Finally,  in  practice  it  often  happens  that  after  some  initial  exploration \nperiod, only a  few  different  policies  are ever greedy; if this is  the case, the strategy \nof this paper could be used to prove much tighter bounds. \n\nResults similar to the ones presented here were developed independently in  [6]. \n\n2  The algorithms \n\nThe SARSA(O)  algorithm was first  suggested in  [7].  The V(O)  algorithm was  pop(cid:173)\nularized by its use in the TD-Gammon backgammon playing program  [8]. 2 \nFix  a  Markov  decision  process  M,  with  a  finite  set  8  of states,  a  finite  set  A  of \nactions,  a  terminal  state  T,  an  initial  distribution  8 0  over  8,  a  one-step  reward \nfunction  r  :  8  x  A  -+  R,  and  a  transition  function  8  :  8  x  A  -+  8  U {T}.  (M \nmay  also  have a  discount  factor  'Y  specifying  how  to trade future  rewards  against \npresent ones.  Here we fix  'Y  = 1,  but our results  carry through to 'Y  < 1.)  Both the \ntransition and reward functions may be stochastic, so long as successive samples are \nindependent  (the  Markov property)  and the reward has bounded  expectation  and \nvariance.  We  assume that all states in 8  are reachable with positive probability. \nWe  define  a  policy  7r  to  be  a  function  mapping  states  to probability distributions \nover  actions.  Given  a  policy  we  can  sample  a  trajectory  (a  sequence  of  states, \nactions,  and one-step  rewards)  by the following  rule:  begin by selecting a  state So \naccording to 80 .  Now  choose  an action ao  according to 7r(so),  Now  choose  a  one(cid:173)\nstep  reward ro  according to  r(so, ao).  Finally  choose  a  new  state  Sl  according to \n8(so, ao).  If Sl = T, stop; otherwise repeat.  We assume that all policies are proper, \nthat is, that the agent reaches T  with probability 1 no matter what policy it follows. \n(This assumption is  satisfied trivially if'Y < 1.) \nThe  reward for  a  trajectory is  the  sum  of all  of its one-step  rewards.  Our  goal  is \nto find  an optimal policy,  that is,  a  policy  which  on average generates trajectories \nwith  the  highest  possible  reward.  Define  Q*(s, a)  to  be  the  best  total  expected \nreward that we  can achieve by starting in  state s, performing action a,  and acting \noptimally afterwards.  Define V*(s) = maxaQ*(s, a).  Knowledge of either Q*  or the \ncombination of V*,  8,  and r  is  enough to determine an optimal policy. \nThe SARSA(O)  algorithm maintains an approximation to Q*.  We  will write Q(s,a) \nfor  s  E  8  and  a  E  A  to  refer  to this  approximation.  We  will  assume  that Q is  a \nfull-rank  linear  function  of some  parameters  w.  For  convenience  of notation,  we \nwill  write  Q(T, a)  =  0  for  all  a E  A,  and tack  an  arbitrary action onto the end  of \nall  trajectories  (which  would  otherwise end with the terminal  state).  After  seeing \n\n2The  proof given  here  does  not  cover  the  TD-Gammon program,  since  TD-Gammon \nuses  a  nonlinear  function  approximator  to  represent  its  value  function. \nInterestingly, \nthough, the proof extends easily to cover games such as backgammon in addition to MDPs. \nIt also  extends to cover SARSA('x)  and V(,x)  for  ,X  > O. \n\n\fa trajectory fragment  s, a, r, s', a',  the SARSA(O)  algorithm updates \n\nQ(s, a)  +- r + Q(s', a') \n\nThe notation Q(s, a)  +- V  means  that the parameters, w,  which  represent  Q(s, a) \nshould be adjusted by gradient descent to reduce the error (Q(s, a)  - V)2;  that is, \nfor  some preselected learning rate 0:  ~ 0, \n\nWnew  =  Wold  + 0:(V  - Q(s, a)) 8w Q(s, a) \n\n8 \n\nFor  convenience,  we  assume  that  0:  remains  constant  within  a  single  trajectory. \nWe  also  make the standard assumption that the sequence of learning rates is  fixed \nbefore the start of learning and satisfies Et O:t  = 00  and Et o:~ < 00. \nWe  will  consider  only  the  trajectory-based  version  of  SARSA(O).  This  version \nchanges  policies  only  between  trajectories.  At  the  beginning  of each  trajectory, \nit  selects the E-greedy  policy for  its  current Q function.  From state s, the E-greedy \npolicy  chooses  the  action  argmaxa  Q(s, a)  with  probability  1 - E,  and  otherwise \nselects  uniformly  at  random  among  all  actions.  This  rule  ensures that,  no  matter \nthe sequence of learned Q functions,  each state-action pair will  be visited infinitely \noften.  (The  use  of E-greedy  policies  is  not  essential.  We  just  need  to  be  able  to \nfind  a  region  that  contains  all  of the approximate value functions  for  every policy \nconsidered,  and a  bound on the convergence rate of TD(O).) \n\nWe  can compare the SARSA(O)  update rule to the one for  Q-Iearning: \n\nQ(s, a)  +- r + maxQ(s, b) \n\nb \n\nOften  a'  in  the  SARSA(O)  update  rule  will  be  the  same  as  the  maximizing  b in \nthe  Q-Iearning  update  rule;  the  difference  only  appears  when  the  agent  takes  an \nexploring action,  i.e.,  one which is  not greedy for  the current Q function. \nThe V(O)  algorithm maintains an approximation to V*  which we  will write V(s) for \nall  s  E  S.  Again,  we  will  assume V  is  a  full-rank  linear function  of parameters w, \nand V(T)  is  held fixed  at O.  After seeing a  trajectory fragment  s,a,r,s', V(O)  sets \n\nV(s) +- r + V(s') \n\nThis update ignores  a.  Often a  is  chosen  according to a  greedy or E-greedy  policy \nfor  a  recent V.  However, for  our analysis we  only need to assume that we  consider \nfinitely  many policies and that the policy remains fixed  during each trajectory. \n\nWe leave open the question of whether updates to w happen immediately after each \ntransition or only at the end of each trajectory.  As pointed out in [9],  this difference \nwill not affect convergence:  the updates within a single trajectory are 0(0:), so they \ncause  a  change  in  Q(s,a)  or V(s)  of 0(0:),  which  means  subsequent  updates  are \naffected  by  at  most  0(0:2 ).  Since  0:  is  decaying  to  zero,  the  0(0:2 )  terms  can  be \nneglected.  (If we were to change policies during the trajectory, this argument would \nno longer hold, since small changes in Q or V  can cause large changes in the policy.) \n\n3  The result \n\nOur result is  that the weights w  in either SARSA(O)  or V(O)  converge with proba(cid:173)\nbility 1 to a fixed  region.  The proof of the result is based on the following intuition: \nwhile SARSA(O)  and V(O)  might consider many different policies over time, on any \ngiven  trajectory  they  always  follow  the  TD(O)  update  rule  for  some  policy.  The \nTD(O)  update  is,  under  general  conditions,  a  2-norm  contraction,  and  so  would \n\n\fconverge to its fixed  point if it were applied repeatedly; what causes SARSA(O)  and \nV(O)  not  to  converge  to  a  point  is  just  that  they  consider  different  policies  (and \nso take steps towards different fixed  points) during different  trajectories.  Crucially, \nunder general conditions, all of these fixed  points are within some bounded region. \nSo, we  can view the SARSA(O)  and V(O)  update rules as contraction mappings plus \na bounded amount of \"slop.\"  With this observation, standard convergence theorems \nshow that the weight vectors generated by SARSA(O)  and V(O)  cannot  diverge. \n\nTheorem 1  For  any Markov  decision  process  M  satisfying  our assumptions,  there \nis  a  bounded  region  R  such  that  the  SARSA(O)  algorithm,  when  acting  on  M,  pro(cid:173)\nduces  a  series  of weight  vectors  which  with probability  1  converges  to  R.  Similarly, \nthere  is  another  bounded  region  R'  such  that the  V(O)  algorithm  acting  on M  pro(cid:173)\nduces  a  series  of weight  vectors  converging  with probability  1  to  R' . \n\nPROOF:  Lemma  2,  below,  shows  that  both  the  SARSA(O)  and  V(O)  updates  can \nbe written in the form \n\nWt+1  =  Wt  - at (Atwt - rt + Et) \n\nwhere  At  is  positive definite,  at  is  the current  learning rate,  E(Et)  = 0,  Var(Et)  ::::: \nK(l + IlwtI12),  and At  and rt  depend only on the currently greedy policy.  (At  and \nrt  represent,  in  a  manner described  in  the lemma,  the transition probabilities  and \none-step  costs  which  result  from  following  the  current  policy.  Of course,  Wt,  At, \nand rt  will  be different depending on whether we  are following SARSA(O)  or V(O).) \nSince  At is  positive definite,  the SARSA(O)  and V(O)  updates are 2-norm contrac(cid:173)\ntions for  small enough  at.  So,  if we  kept  the  policy fixed  rather than changing it \nat the beginning of each trajectory, standard results such as Lemma 1 below would \nguarantee convergence.  The intuition is that we  can define  a  nonnegative potential \nfunction  J(w)  and  show  that,  on  average,  the  updates  tend  to  decrease  J(w)  as \nlong as at is small enough and  J (w)  starts out large enough compared to at. \nTo  apply Lemma 1 under the assumption that  we  keep  the policy  constant  rather \nthan  changing  it  every  trajectory,  write  At  = A  and  rt  = r  for  all  t,  and  write \nw\" = A -1 r.  Let  p be the smallest eigenvalue of A  (which must be real and positive \nsince  A  is  positive  definite).  Write  St  =  AWt  - r + Et  for  the  update  direction  at \nstep t.  Then if we  take  J(w) = Ilw  - w,,11 2, \n\nE(V J(Wt)T stlwt)  =  2(wt - w,,)T(Awt - r + E(Et)) \n\n=  2(wt - w,,)T(Awt - Aw,,) \n>  2pllwt  - w,,11 2 \n=  2pJ(wt) \n\nso  that  -St is  a  descent  direction  in the  sense  required  by  the  lemma.  It is  easy \nto check the lemma's variance condition.  So,  Lemma 1 shows that  J(Wt)  converges \nwith probability 1 to 0,  which  means Wt  must  converge with probability 1 to W\". \nIf we  pick  an  arbitrary  vector  u  and  define  H(w)  =  max(O, Ilw  - ull  - C)2  for  a \nsufficiently large constant C, then the same argument reaches the weaker conclusion \nthat Wt  must converge with probability 1 to a  sphere of radius C  centered at u.  To \nsee why,  note that -St is also a descent direction for  H(w):  inside the sphere, H  = 0 \nand V H  =  0,  so  the descent  condition is satisfied trivially.  Outside the sphere, \n\nVH(w) \n\n=  2(w - u) \n\nIl w-ull-C \n\nIlw-ull \n\n= d(w)(w - u) \n\nV H(Wt)T E(stlwt) \n\nd(wt)(wt  - u)TE(stlwt) \n\n\f=  d(wt)(wt  - w\" + w\"  - U)T A(wt - w,,) \n~  d(wt)(pllwt  - w,,112 -llw\" - ullllAllllwt - w\"ID \n\nThe positive term will be larger than the negative one if Ilwt  - w\" II  is  large enough. \nSo, if we choose C large enough, the descent condition will be satisfied.  The variance \ncondition is again easy to check.  Lemma 3 shows that \\7 H  is Lipschitz.  So, Lemma 1 \nshows  that  H(wt)  converges  with  probability  1  to  0,  which  means  that  Wt  must \nconverge with probability 1 to the sphere of radius C  centered at u. \nBut now  we  are done:  since there are finitely many policies that SARSA(O)  or V(O) \ncan consider,  we  can pick any u  and then  choose a  C  large enough that the above \nargument holds for  all policies simultaneously.  With this choice of C the update for \nany policy decreases H(wt) on average as long as at is small enough, so the update \nfor  SARSA(O)  or V(O)  does too,  and Lemma 1 applies. \n0 \n\nThe  following  lemma  is  Corollary  1  of  [10]. \nIn  the  statement  of the  lemma,  a \nLipschitz  continuous function  F  is  one for  which there exists  a  constant  L  so  that \nIIF(u) - F(w)11  ::;  Lllu - wll  for  all u and w.  The Lipschitz  condition is essentially \na  uniform bound on the derivative of F. \n\nLemma 1  Let J  be  a  differentiable  function,  bounded  below  by  J*,  and  let \\7 J  be \nLipschitz continuous.  Suppose  the  sequence Wt  satisfies \n\nfor  random  vectors  St  independent  of Wt-;P, Wt+2, . . ..  Suppose  - St  is  a  descent \ndirection  for  J  in  the  sense  that  E(stlwt)  \\7 J(Wt)  > 6(E)  > 0  whenever  J(Wt)  > \nJ*  + Eo  Suppose  also  that \n\nE(llstI12Iwt) ::;  Kd(wt) + K2E(stlwt)T\\7J(Wt) + K3 \n\nand finally  that the  constants at  satisfy \n\nat > 0  L: at =  00 \n\nt \n\nThen  J(Wt)  -+  J*  with probability  1. \n\nMost of the work in proving the next lemma is  already present in [1].  The transfor(cid:173)\nmation from  an MDP  under a fixed  policy to a  Markov chain is  standard. \n\nLemma 2  The  update  made  by  SARSA(O)  or  V(O)  during  a  single  trajectory  can \nbe  written in the form \n\nwhere the constant matrix A\"  and constant vector r\"  depend  on the currently greedy \npolicy  7f,  a  is the  current learning  rate,  and E(E)  =  O.  Furthermore,  A\"  is  positive \ndefinite,  and there  is  a  constant K  such  that Var(E)  ::;  K(l + IlwI12). \n\nPROOF:  Consider  the  following  Markov  process  M,,:  M\"  has  one  state  for  each \nstate-action pair in  M.  If M  has a transition which goes from  state S  under action \na with reward r to state s' with probability p, then M\"  has a  transition from  state \n(s,a)  with  reward  r  to state  (s',a')  for  every  a';  the  probability of this transition \nis  p7r(a'ls').  We  will  represent  the value function  for  M\"  in the same way that we \nrepresented the Q function for M; in other words, the representation for V ( (s, a})  is \nthe same as the representation for  Q(s, a).  With these definitions,  it is  easy to see \nthat TD(O)  acting on M\" produces exactly the same sequence of parameter changes \n\n\fas  SARSA(O)  acting on  M  under the fixed  policy  1r.  (And  since 7r(als)  > 0,  every \nstate of M\",  will  be  visited infinitely often.) \n\nWrite T\",  for  the transition probability matrix of the above  Markov process.  That \nis, the entry of T\",  in row  (s, a)  and column (s', a')  will be equal to the probability of \ntaking a step to (s', a')  given that we start in (s, a).  By definition, T\",  is  substochas(cid:173)\ntic.  That is,  it has nonnegative entries, and its row sums are less than or equal to l. \nWrite s  for the vector whose  (s, a)th element is  So(s)7r(als), that is,  the probability \nthat we  start in state s  and take action a.  Write  d1f  =  (I - T;f')-ls,  where I  is the \nidentity matrix.  As  demonstrated in,  e.g.,  [11],  d\",  is the vector of expected visita(cid:173)\ntion frequencies  under  7rj  that  is,  the  element  of d\",  corresponding to  state  sand \naction a is  the expected number of times that the agent will  visit state s and select \naction  a  during  a  single  trajectory following  policy  7r.  Write  D1f  for  the diagonal \nmatrix with  d1f  on  its  diagonaL  Write  r  for  the  vector  of expected  rewardsj  that \nis,  the  component of r  corresponding to state s  and action a is  E(r(s, a)).  Finally \nwrite X  for the Jacobian matrix  ~. \nWith this notation, Sutton [1]  showed that the expected  TD(O)  update is \n\nE(wnewlwold) = Wold  - aXT D\",(I - T\",)XWold  + aXT D\",r \n\n(Actually,  he only considered the case where  all rewards are zero except on transi(cid:173)\ntions from  nonterminal to terminal states, but his  argument works equally well for \nthe more general  case  where  nonzero rewards are  allowed  everywhere.)  So,  we  can \ntake A\",  =  X T D\",(I - T\",)X  and r\",  =  X T D\",r to make E(f) =  O. \nFurthermore,  Sutton  showed  that,  as long  as  the agent  reaches the terminal  state \nwith  probability  1  (in  other  words,  as  long  as  7r  is  proper)  and  as  long  as  every \nstate is  visited with positive probability (which is true since  all states are reachable \nand 7r  has a  nonzero probability of choosing every  action),  the matrix D 1f (I - T\",) \nis  strictly positive definite.  Therefore, so  is  A\",. \nFinally,  as  can be seen from  Sutton's equations on  p.  25,  there  are two  sources of \nvariance in  the update direction:  variation in the number  of times each  transition \nis  visited, and variation in the one-step rewards.  The visitation frequencies and the \none-step rewards both have bounded variance,  and are independent of one another. \nThey  enter into the overall update in  two  ways:  there is  one  set  of terms which is \nbilinear in the one-step rewards and the visitation frequencies,  and there is another \nset of terms which  is  bilinear  in the visitation frequencies  and the weights  w.  The \nformer  set  of terms  has  constant  variance.  Because the policy  is  fixed,  W  is  inde(cid:173)\npendent  of the  visitation  frequencies,  and  so  the  latter  set  of terms  has  variance \nproportional to  Ilw112.  So,  there is  a  constant  K  such  that the total  variance  in  f \ncan be bounded by K(1 + IlwI12). \nA similar but simpler argument applies to V(O).  In this case we  define  M\",  to have \nthe same states as M, and to have the transition matrix T\",  whose  element  s, s'  is \nthe probability of landing in s'  in  M  on step t + 1,  given that we  start in  s  at step \nt  and follow  7r.  Write  s  for  the vector of starting probabilities, that is,  Sx  =  So(x). \nNow define X  =  ~~ and d\",  =  (I - TJ)-l s.  Since we have assumed that all policies \nare  proper  and that every policy  considered  has  a  positive  probability of reaching \nany state, the update matrix A\",  = XT D\", (I - T\",)X  is  strictly positive definite.  0 \nLemma 3  The  gradient  of the  function  H(w)  =  max(O, Ilwll  - 1)2  is  Lipschitz \ncontinuous. \n\nPROOF:  Inside  the  unit  sphere,  H  and  all  of  its  derivatives  are  uniformly  zero. \nOutside, we  have \n\n'VH = wd(w) \n\n\fwhere  d(w)  = II~Rlll, and \n\n'\\72 H  =  d(w)I + '\\7d(w)wT \n\n=  d(w)I + IIwl12  Ilwllw \n\nw IT  \n\n=  d(w)I + IIwl12 (1  - d(w)) \n\nwwT \n\nThe norm of the first term is d( w),  the norm of the second is 1 - d~ w),  and since one \nof the terms is  a multiple of I  the norms add.  So,  the norm of '\\7  H  is  0 inside the \nunit sphere and 1 outside.  At the boundary of the unit  sphere,  '\\7 H  is  continuous, \nand  its  directional  derivatives  from  every  direction  are  bounded  by the  argument \nabove.  So,  '\\7 H  is  Lipschitz continuous. \n0 \n\nAcknowledgements \n\nThanks to  Andrew  Moore  and to  the  anonymous  reviewers for  helpful  comments. \nThis work was supported in  part by DARPA  contract number F30602- 97- 1- 0215, \nand in  part by  NSF  KDI  award number  DMS- 9873442.  The opinions  and  conclu(cid:173)\nsions are the author's and do not reflect those of the US  government or its agencies. \n\nReferences \n[1]  R.  S.  Sutton.  Learning  to  predict  by  the  methods  of  temporal  differences. \n\nMachine  Learning,  3(1):9-44,  1988. \n\n[2]  Geoffrey J.  Gordon.  Stable function  approximation in dynamic programming. \n\nTechnical Report  CMU-CS-95-103,  Carnegie Mellon  University,  1995. \n\n[3]  L.  C.  Baird.  Residual  algorithms:  Reinforcement  learning  with function  ap(cid:173)\n\nproximation.  In  Machine  Learning:  proceedings  of the  twelfth  international \nconference,  San Francisco,  CA,  1995.  Morgan Kaufmann. \n\n[4]  Geoffrey  J.  Gordon.  Chattering in  SARSA(A).  Internal report,  1996.  CMU \n\nLearning Lab.  Available from  www.es . emu. edu;-ggordon. \n\n[5]  R.  S.  Sutton.  Open theoretical questions in reinforcement  learning.  In P.  Fis(cid:173)\n\ncher  and  H.  U.  Simon,  editors,  Computational  Learning  Theory  (Proceedings \nof EuroCOLT'99),  pages 11- 17,  1999. \n\n[6]  D.  P.  de  Farias and B.  Van  Roy.  On the existence of fixed  points for  approxi(cid:173)\n\nmate value iteration and temporal-difference learning.  Journal of Optimization \nTheory  and Applications,  105(3), 2000. \n\n[7]  Gavin  A.  Rummery  and  Mahesan  Niranjan.  On-line  Q-Iearning  using  con(cid:173)\n\nnectionist  systems.  Technical  Report  166,  Cambridge University  Engineering \nDepartment, 1994. \n\n[8]  G.  Tesauro.  TD-Gammon,  a  self-teaching  backgammon  program,  achieves \n\nmaster-level play.  Neural  Computation,  6:215- 219,  1994. \n\n[9]  T.  Jaakkola, M.  1.  Jordan,  and  S.  P.  Singh.  On the convergence of stochastic \niterative dynamic programming algorithms.  Neural Computation, 6:1185- 1201, \n1994. \n\n[10]  B.  T.  Polyak  and  Ya.  Z.  Tsypkin.  Pseudogradient  adaptation  and  training \nalgorithms.  Automation and Remote Control,  34(3):377- 397, 1973.  'Translated \nfrom  A vtomatika i  Telemekhanika. \n\n[11]  J. G. Kemeny and J. L. SnelL  Finite Markov  Chains.  Van Nostrand- Reinhold, \n\nNew York,  1960. \n\n\f", "award": [], "sourceid": 1911, "authors": [{"given_name": "Geoffrey", "family_name": "Gordon", "institution": null}]}