{"title": "Convergent Combinations of Reinforcement Learning with Linear Function Approximation", "book": "Advances in Neural Information Processing Systems", "page_first": 1611, "page_last": 1618, "abstract": null, "full_text": "Convergent  Combinations of \n\nReinforcement  Learning with  Linear \n\nFunction  Approximation \n\nRalf Schoknecht \n\nILKD \n\nUniversity of Karlsruhe, Germany \nralf. schoknecht@ilkd. uni-karlsruhe. de \n\nArtur  Merke \n\nLehrstuhl Informatik 1 \n\nUniversity of Dortmund,  Germany \n\narturo merke@udo.edu \n\nAbstract \n\nConvergence  for  iterative  reinforcement  learning  algorithms  like \nTD(O)  depends on the sampling strategy for  the transitions.  How(cid:173)\never,  in  practical  applications  it  is  convenient  to  take  transition \ndata  from  arbitrary  sources  without  losing  convergence.  In  this \npaper we  investigate the problem of repeated synchronous updates \nbased on a  fixed  set of transitions.  Our main theorem yields  suffi(cid:173)\ncient  conditions  of convergence for  combinations  of reinforcement \nlearning algorithms and linear function approximation.  This allows \nto analyse if a  certain reinforcement learning algorithm and a  cer(cid:173)\ntain function approximator are  compatible.  For the combination of \nthe residual gradient algorithm with grid-based linear interpolation \nwe  show  that there exists  a  universal  constant  learning rate such \nthat the  iteration  converges  independently  of the  concrete  transi(cid:173)\ntion data. \n\nIntroduction \n\n1 \nThe  strongest  convergence  guarantees for  reinforcement  learning  (RL)  algorithms \nare  available  for  the  tabular  case,  where  temporal  difference  algorithms  for  both \npolicy  evaluation  and  the  general  control  problem  converge  with  probability  one \nindependently  of the  concrete  sampling strategy as  long  as  all  states  are  sampled \ninfinitely  often  and  the  learning  rate is  decreased  appropriately  [2].  In  large,  pos(cid:173)\nsibly continuous,  state spaces  a  tabular representation and adaptation of the value \nfunction is not feasible with respect to time and memory considerations.  Therefore, \nlinear  feature-based  function  approximation  is  often  used.  However,  it  has  been \nshown that synchronous TD(O),  i.e.  dynamic programming, diverges for  general lin(cid:173)\near function  approximation  [1].  Convergence with probability one for  TD('\\)  with \ngeneral linear function  approximation has  been  proved in  [12].  They establish the \ncrucial  condition  of sampling  states  according  to  the  steady-state  distribution  of \nthe  Markov  chain in  order  to  ensure  convergence.  This  requirement  is  reasonable \nfor  the  pure  prediction  task  but  may  be  disadvantageous  for  policy  improvement \nas  shown  in  [6]  because  it  may  lead  to  bad  action  choices  in  rarely  visited  parts \nof the state space.  When transition data is  taken from  arbitrary sources  a  certain \nsampling distribution cannot be assured which  may prevent convergence. \n\n\fAn alternative to such iterative TD approaches are least-squares TD (LSTD)  meth(cid:173)\nods  [4, 3,  6,  8].  They eliminate the learning rate parameter and carry out a  matrix \ninversion in order to compute the fixed  point of the iteration directly.  In [4]  a least(cid:173)\nsquares approach for  TD(O)  is  presented which is  generalised to TD(A)  in  [3].  Both \napproaches  still  sample  the  states  according  to  the  steady-state  distribution.  In \n[6,  8]  arbitrary sampling distributions are used such that the transition data could \nbe  taken  from  any  source.  This  may  yield  solutions  that  are  not  achievable  by \nthe corresponding iterative approach because this iteration diverges.  All the LSTD \napproaches have the problem that the matrix to be inverted may be singular.  This \ncase  can occur if the basis functions  are not linearly independent or if the Markov \nchain  is  not  recurrent.  In  order to  apply  the  LSTD  approach  the  problem  would \nhave  to  be  preprocessed  by  sorting  out  the  linear  dependent  basis  functions  and \nthe  transient  states  of the  Markov  chain.  In  practice  one  would  like  to  save  this \nadditional work. \n\nThus,  the  least-squares  TD  algorithm  can  fail  due  to  matrix  singularity  and  the \niterative TD(O)  algorithm can fail  if the sampling distribution is  different from  the \nsteady-state distribution.  Hence,  there are problems for  which  neither an iterative \nnor  a  least-squares  TD  solution  exist.  The  actual  reason  for  the  failure  of  the \niterative TD(O)  approach lies  in  an incompatible combination of the RL  algorithm \nand the  function  approximator.  Thus,  the  idea is  that  either  a  change  in  the  RL \nalgorithm or a  change in the approximator may yield a  convergent iteration.  Here, \na  change  in  the  TD(O)  algorithm  is  not  meant  to  completely  alter  the  character \nof  the  algorithm.  We  require  that  only  modifications  of the  TD(O)  algorithm  be \nconsidered that are  consistent according to the definition in the next  section. \n\nIn  this  paper  we  propose  a  unified  framework  for  the  analysis  of a  whole  class  of \nsynchronous  iterative  RL  algorithms  combined  with  arbitrary  linear  function  ap(cid:173)\nproximation.  For  the sparse iteration matrices  that  occur in  RL  such  an  iterative \napproach is superior to a  method that uses matrix inversion as the LSTD approach \ndoes  [5].  Our main theorem states sufficient  conditions  under which  combinations \nof RL  algorithms and linear function  approximation converge.  We  hope that these \nconditions and the convergence analysis, that is based on the eigenvalues of the iter(cid:173)\nation matrix, bring new  insight in the interplay of RL  and function  approximation. \nFor an arbitrary linear function approximator and for arbitrary fixed transition data \nthe  theorem  allows  to  predict  the  existence  of a  constant  learning  rate  such  that \nthe  synchronous  residual  gradient  algorithm  [1]  converges.  Moreover,  in  combina(cid:173)\ntion  with  interpolating  grid-based  function  approximators  we  are  able  to  specify \na  formula  for  a  constant  learning  rate  such  that  the  synchronous  residual  gradi(cid:173)\nent  algorithm  converges  independently  of the  transition  data.  This  is  very  useful \nbecause otherwise the learning rate would  have to be  decreased  which  slows  down \nconvergence. \n\n2  A  Framework for  Synchronous  Iterative RL  Algorithms \nFor a  Markov decision process (MDP)  with N  states S  =  {S1' .. .  ,SN}, action space \nA, state transition probabilities p  : (S, S, A)  -+ [0,1] and stochastic reward function \nr  : (S, A)  -+ R  policy evaluation is  concerned with solving the Bellman equation \n\nV 7r  =  'YP7rV7r  + R7r \n\n(1) \nfor  a fixed  policy 7r  : S -+ A.  Vt denotes the value of state Si,  Pi7j  = P(Si ' Sj, 7r(Si)) , \nRi  =  E{r(si,7r(Si))}  and  'Y  is  the  discount  factor.  As  the policy  7r  is  fixed  we  will \nomit it in the following  to make notation easier. \nIf the state space  S  gets  too large the exact solution of equation  (1)  becomes very \ncostly with respect to both memory and computation time.  Therefore, often linear \n\n\ffeature-based  function  approximation  is  applied.  The  value  function  V  is  repre(cid:173)\nsented as a linear combination of basis functions  {<PI, ... ,<P F  }  which can be written \nas V  =  <pw, where WE IRF  is the parameter vector describing the linear combination \nand  <P  =  (<PI I\u00b7\u00b7 .I<p F)  E IRNxF  is  the  matrix with  the  basis  functions  as  columns. \nThe rows of <P  are the feature  vectors <P(Si)  E IRF  for  the states  Si. \n\nA  popular algorithm for  updating the parameter vector  W  after a  single transition \nXi  ---+  Zi  with reward ri  is  the TD(O)-algorithm  [11] \n\nwn +l  = wn  + o:<p(xi)[ri + ,<p(zif wn  - <p(xif wn ]  = (IF + o:A;)wn  + o:bi , \n\n(2) \n\nwhere  0:  is  the  learning  rate,  Ai  =  <P(Xi)[,<P(Zi)  - <P(Xi)Y,  bi  =  <p(xi)ri  and  IF  is \nthe identity matrix in  IRF.  In the following  we  investigate the synchronous  update \nfor  a  fixed  set  of m  transitions  T  =  {(xi,zi,ri)li  =  1, . . .  ,m}.  The  start  states \nXi  are  sampled  with  respect  to  the  probability  distribution  p,  the  next  states  Zi \nare sampled  according to P(Xi,')  and the  rewards  ri  are sampled  from  r(xi).  The \nsynchronous update for  the transition set T  can then be written in matrix notation \nas \n\n(3) \nwith ATD  = Al + ... + Am  and bTD  = bl  + ... + bm'  Let  X  E  IRmxN  with Xi ,j  = 1 \nif  Xi  =  Sj  and  0  otherwise.  Then,  <pX =  X<P  E  IRmxF  is  the  matrix  with  feature \nvector  <p(Xi)  as  its  i-th row.  Define  Z  and  <p Z  accordingly for  the states Zi .  With \nthe vector of obtained rewards r  =  (rl ,'\" \n,rm)T we  have ATD  =  (<pX)Th<pz  - <pX) \nand bTD  =  (<px)T r . \nThe  synchronous  TD(O)  algorithm  is  an  instance  of a  much  broader  class  of RL \nalgorithms.  The  residual  gradient  algorithm  [1],  for  example,  minimises  the  Bell(cid:173)\nman  error by  gradient  descent.  In  the  following,  let  e =  ,<pz - <px.  The  matrix \nfn D  =  fn XT X  E  IRNxN  is  diagonal  and denotes  the  relative frequency  of state  Si \nas  start  state  in  the  transition  data T.  Let  15  be  the  diagonal  matrix  with  the \ninverse  entries  of  D.  For  Di,i  =  0  set  15i,i  =  O.  The  matrix  of the  relative  fre(cid:173)\nquencies  for  the  state  transitions  from  Si  to  Sj  is  given  by  P  =  15XT Z  and  the \nvector  of  the  average  reward  in  the  different  states  Si  is  given  by  it  =  15XT r. \nIt  can  be  shown  that  the  weighted  Bellman  error  for  the  synchronous  update \nEB(W)  =  ~ [hP - IN)<pw  + itr fnD  [hP - IN)<pw  + it]  with  the estimated en(cid:173)\ntities P, it and D instead of the unknown expected values P , Rand D  is equivalent \nto  the expression  EB(W)  =  2!n  [ew + rf X15XT [ew + r].  Thus, for  the residual \ngradient  algorithm the  update  rule  (3)  becomes  Wn+l  =  (IF  + o:ARG )wn  + o:bRG \nwith  ARG  =  -eTx15xTe  and  bRG  =  -eTx15XTr.  The  synchronous  TD(O) \nand the residual  gradient  algorithm can be  analysed in  an unified  framework  with \nA  = 'lTTe and b = 'lTTr.  By setting 'lTTD  = <p X and 'lTRG  =  -x15xTe , for example, \none  obtains the TD(O)  algorithm and the residual gradient algorithm respectively. \nMoreover, varying 'IT  yields  a  whole class of algorithms.  We  denote such algorithms \nas  consistent RL  algorithms  if two conditions are fulfilled.  First, for  a  tabular rep(cid:173)\nresentation the algorithm converges to an optimal solution  w*  with Bellman error \nzero.  And  second,  if the  algorithm converges  with  a  linear function  approximator \nit achieves the same Bellman error independently of the initial value wo.  This class \nof RL  algorithms  includes  the  Kaczmarz  rule  [9],  which  is  similar  to  the  NTD(O) \nrule  [4],  or  the  uniform  update  rule  described  in  [7].  In general,  these  algorithms \nyield  different  solutions  when  function  approximation is  used.  For  the  TD(O)  and \nthe residual gradient algorithm this is shown in  [10].  However, a general assessment \nof the solution quality of the different  algorithms is  still missing. \n\n\f3  Convergence Results \nThe convergence properties of RL algorithms for synchronous updates in the general \nframework presented in the last section are described in the following main theorem \nof  our  paper.  It  generalises  the  case  of  repeated  single-transition  updates  [7]  to \nrepeated  multi-transition  updates.  For  the  following  let  [M]  be  the  span  of  the \ncolumns of a  matrix M  and  [M]l. the orthogonal complement of [M]. \nTheorem 1  Let wn+l = (IF + aA)wn + ab  be  the  synchronous  update  rule for  the \ntransition  data  T.  Let  A  E  jRF x F  be  representable  as  A  = C T D  with  some  C, D  E \njRk x F  and bE jRF  be  representable  as  b = C T v  with some v E jRk.  Let K  = DCT  E jRk x k \nand p( x) = ( _l)k (x - Al )fh  ... (x - Al )f31  be  the  characteristic  polynomial  of Kover \n<C  with  IAII  >  ... >  IAll.  Also,  let Ef,  be  the  eigenspace  corresponding  to  eigenvalue \nAi  and H  =  maxd ,J;(l:)I }.  If the following  assumptions  hold \n\n(a)  Vi:  (Re(Ai)  < 0)  v  Ai  = 0 \n(b)  dim(Ef,)  = (3i  for  Ai  = 0 \n\n(c)  [CT] 11  [DT]l.  = {O} \n\nthen  the  limit w*  = limn -> (1)  w n  exists  for  all  learning  rates  0 < a  < aL,  where  the \nlimit  learning  rate  aL  satisfies  aL  =  if.  The  limit w*  may  depend  on  the  initial \nvalue  wO .  Note,  if the  Ai  leading  to  the  maximum  of H  is  real  then H  =  I Ai I. \nA  proof of this  theorem  can  be  found  in  the  appendix.  General  convergence  con(cid:173)\nditions  of iterations  have  been  examined  in  numerical  mathematics.  A  standard \nresult  states  that  if  the  absolute  value  of  the  largest  eigenvalue  of  the  iteration \nmatrix  IF  + aA,  i.e.  the  spectral  radius,  is  smaller  than  one,  then  the  iteration \nconverges to the  unique fixed  point w*  =  -A-I b [5]  (Theorem 2.1.1).  In our case, \nhowever,  the  matrix  A  may  not  be  invertible.  This  happens,  for  example,  if  the \nfeatures  <Pi  in the feature  matrix <P  are linearly dependent.  If A  is  not invertible it \nhas eigenvalue zero  and,  thus,  IF + aA has eigenvalue one.  Conditions  (b)  and  (c) \nin  the  above  theorem  are  needed  in  order  to  compensate  for  the  singularity  of A \nand to assure convergence.  If the iteration converges for  singular A  the fixed  point \ndepends  on the  initial  value  wO  and  is  no  longer  unique.  Therefore, for  consistent \nRL  algorithms  we  require  that  the  Bellman  error  of all  fixed  points  be  the  same. \nThus,  the  quality  of the  obtained  solution  to  the  policy  evaluation  problem  is  in(cid:173)\ndependent  of the initial value.  However,  the suitability of different  w*  for  a  policy \nimprovement step can vary but this question is  not addressed here. \n\nAn  important  implication  of Theorem  1  concerns  the  choice  of the  learning  rate. \nIf sampling  were  involved  in  the  update  rule  the  learning  rate  would  have  to  be \ndecreased  in  the  standard  manner  (Lt at  =  00,  Lt a;  <  (0)  in  order  to  fulfil  the \ncondition for  stochastic approximation algorithms.  However,  for  a  fixed  set of up(cid:173)\ndates  and  certain  synchronous  RL  algorithms  with  linear  feature-based  function \napproximation  Theorem  1  predicts  the  existence  of a  constant  learning  rate.  In \ngeneral the computation of this learning rate would require knowledge of the eigen(cid:173)\nvalues of K  which may not be directly available.  As the following proposition shows, \nfor  certain combinations of RL algorithms and linear function approximation a uni(cid:173)\nversal constant learning rate exists such that the iteration in Theorem 1 converges. \nThe proof can be found  in  the appendix. \n\nProposition 1  For  an  appropriate  constant choice  of the  learning rate a  the  resid(cid:173)\nual gradient algorithm will  converge  independently  of the linear function  approxima(cid:173)\ntion  scheme  when  applied  to  the  problem  of repeated  synchronous  multi-transition \n\n\fupdates.  The  residual gradient  algorithm  is  a consistent RL  algorithm.  If the  resid(cid:173)\nual  gradient  algorithm  is  combined  with  grid-based  linear  interpolation  over  an  ar(cid:173)\nbitrary triangulation  of the  state space  and the  transition  set contains m  transitions \nthen  the  iteration  converges for  all  0:  <  m(1~'Y2)' \nA  choice  of the  learning  rate  0:  <  k according  to  Theorem  1  yields  a  convergent \niteration.  However,  this  might  not  be  the  best  choice  with  respect  to  asymptotic \nconvergence rate.  The asymptotic convergence rate is  better for matrices with lower \nspectral radius [5],  which yields a criterion for the choice of an optimal learning rate \n0:*.  If K  has only real eigenvalues then we  can deduce  a  particular simple formula \nfor  0:*.  Assume  that all  nonzero  eigenvalues  of K  satisfy  Ai  E  [Amax, Amin],  where \nAmin  is  the  largest  eigenvalue  smaller  than  zero  and  Amax  is  the  eigenvalue  with \nlargest  absolute  value.  It can  be  shown  that  the  asymptotic  convergence  rate  is \ndetermined  by  the  eigenvalues  of 1m  +  o:K  that  are  unequal one.  The eigenvalues \nAi  of K  are related to the eigenvalues  ),i  of 1m  +  o:K  by  ),i  =  1 +  o:Ai.  Hence,  the \ninterval [Amax, Amin]  is mapped to [),max, ),min]  =  [1 +O:Amax , 1 +o:Amin].  In order to \nobtain a low spectral radius of 1m +o:K this interval should lie symmetrically around \nzero, which is  equivalent to ),min  =  -),max'  This yields  0:*  =  1 >'=in l ~ I >'=ax l  < k with \nH  =  IAmaxl.  Thus, 0:*  leads to convergence according to Theorem 1.  Note also that \na  larger learning rate does  not necessarily lead  to a  faster  asymptotic  convergence \nof the iteration. \n\n4  Counterexample of Baird - Revisited \nIn  this section we  analyse the counterexample given by Baird in  [1],  and show how \nTheorem  1  and  Proposition  1  can  be  applied  to  obtain  explicit  bounds  for  the \nlearning rate 0:  and the discount factor \"(  for  which the residual gradient and TD(O) \nalgorithms converge.  The matrices <I>,  X  and Z  are given by \n\n<I>= \n\n12000000 \n10200000 \n10020000 \n10002000 \n10000200 \n10000020 \n20000001 \n\n1000000 \n0100000 \n0010000 \nX=  0001000 \n0000100 \n0000010 \n0000001 \n\nZ= \n\n0000001 \n0000001 \n0000001 \n0000001 \n0000001 \n0000001 \n0000001 \n\nIn \nwhich  corresponds  to  the  synchronous  update  of  every  state  transition. \nthe  residual  gradient  case  we  have  K RG \n-(\"(Z  - X)<I>((\"(Z  - X)<I\u00bbT \n{-4, H -15  +  34\"(  - 35\"(2  \u00b1 \nwhich  has  just  negative  eigenvalues  URG \n-}2102,,(2  - 812\"( - 2380\"(3 + 121 + 1225\"(4]}.  Using  Theorem  1  and  Proposition  1 \nwe  can  find  a  constant  learning  rate  0:,  such  that  the  iteration  converges  for  ev(cid:173)\nery  \"(  E  [0,1).  For  example,  for  \"(  =  0.9  the  eigenvalues  of  KRG  are  URG  = \n{-0.0204,-4,-12.7296}  and  Theorem  1  yields  0:  <  0.1571  which  is  also  almost \nequal to the optimal learning rate 0:*  ~ 0.1569. \nIn the TD(O)  case we have to analyse the matrix KTD  =  -(\"(Z -X)<I>(X<I\u00bbT, which \nhas the eigenvalues UTD  =  {-4, H -15 + 17\"( \u00b1 -}289\"(2  - 406\"( + 121]}.  There are \neigenvalues  of KTD  with  positive  real  part  for  \"(  ~ 0.89.  In  such  cases  we  have \ndivergence for  every  0:  >  0  as  described  in  [1]  for,,(  =  0.9.  However,  contradicting \nthe  argument in  [1]  the TD(O)  algorithm converges for  all  \"(  :::;  0.88  if the learning \nrate is  chosen  appropriately.  For  example,  for  \"(  =  0.4  all eigenvalues  are  negative \n(UTD  =  {-3.0,-4,-5.2}),  so  condition  (a)  and  (b)  of  Theorem  1  are  trivially \nfulfilled.  Condition  (c)  can  also  be  shown  by  simple  computation,  and  therefore \nusing  Theorem  1  we  obtain  convergence  for  0:  <  0.384  and  optimal  asymptotic \nconvergence for  0:*  ~ 0.244,  which is  much smaller. \n\n\f5  Conclusions \nFor the problem of repeated synchronous updates based on a fixed  set of transitions \nwe  have  proved  sufficient  conditions  of convergence  for  arbitrary  combinations  of \nreinforcement  learning  algorithms  and  linear  function  approximation.  Our  main \ntheorem yields a  rule for  determining a  problem dependent learning rate such that \nthe algorithm converges.  For a combination of the residual gradient algorithm with \ngrid-based linear interpolation we  have deduced  a  constant learning rate such that \nthe  algorithm  converges  independently  of the  concrete  transition  data.  Moreover, \nwe  have  derived  a  general  formula  for  an  optimal  learning  rate  with  respect  to \nasymptotic convergence.  Finally we  have applied our main theorem to fully analyse \nthe example Baird gives  for  the divergence of TD(O)  [1]. \n\nAppendix \nLemma 1  Let  D  be  a  real  m  x  F  matrix  and  CT  a  real  F  x  m  matrix,  where \nm  >  F.  Then  K  =  DCT  has  the  same  eigenvalues  as  A  =  CT D  and  additionally \nthe  eigenvalue  zero  with  multiplicity (F-m).  Let HI{  be  the  generalised  eigenspace \nof K  corresponding  to  the  eigenvalue  A  and  H1  the  generalised  eigenspace  of A \ncorresponding  to  the  eigenvalue  A.  Then,  CTHI{  ~ H1  and  DH1  ~ HI{.  For \nA oF  0  it even  holds  that CTHI{  = H1  and DH1 = HI{. \n\nProof:  The generalised eigenspace HI{  has index sI{  if sI{  is  the smallest number \nfor  which ker(K - AIm)sf  =  ker(K - AIm)sf +1  holds, where h  denotes the identity \nin  IRkxk.  Let  x  E  HI{,  i.e.  (K - AIm)sf x  = O.  With CT Ki = AiCT  we  have \nCT(K - AImyf x  = CT(i~  St  KiASf - i)x = (A - AIF)sf CT x . \n\nsf  (  K) \n\n(4) \n\nThus,  CT x  E  H1.  And  with  the  same  argument  we  obtain  Dx  E  HI{  from  x  E \nH1\u00b7  Therefore,  CTHI{  ~ H1  and  DH1  ~ HI{  Let  A  oF  0  and  BI{  a  basis  in \nHI{.  As  the  Jordan  block  of  K  corresponding  to  HI{  is  invertible  the  vectors \nCT Bf are  linearly  independent  and  therefore  form  a  basis  of  the  span  [CT BI{]. \nWith  the  above  consideration  we  have  [CT BI{J  ~ H 1.  If this  is  a  real  subset \nCTBI{  can  be completed  to form  a  basis  B1  of H1  with  IBI{I  <  IB11.  Then we \nhave  that  DB1  is  linearly  independent  and  [DB1 J  ~ HI{.  Moreover,  we  have \ndim(HI{)  =  IBI{ I <  IB11 =  dim([DB1])  ~ dim(HI{),  which  is  a  contradiction. \nTherefore,  CTHI{  =  [CT BfJ  =  H1.  Similarly,  we  obtain  DH1  =  HI{.  Thus,  the \nmultiplicities  of the  eigenvalues  A oF  0  of A  and  K  are the same.  The multiplicity \nof the eigenvalue zero of matrix K  is  by  (F - m)  larger than that of matrix A.  D \n\nProof of Theorem 1:  Due to assumption  (a)  and Lemma 1 every eigenvalue of \nA is  either zero or has a real part less than zero.  If the real part of every eigenvalue \nof A  is  less  than  zero,  A  is  invertible.  For  invertible  matrices Theorem  2.1.1  from \n[5] states that the iteration converges if and only if the spectral radius  e(IF + aA), \ni.e.  the  largest  eigenvalue,  is  less  than  1.  For  every  eigenvalue  Ai  of A  obviously \n1 + aAi  is  an eigenvalue of IF  + aA.  With H  =  maxi { ,~;(l:) , }  we  obtain for  a> 0 \n\n2 \ne(IF + aA)  <  1 ~ 'it:  11  + aAi l <  1 ~ a  <  H' \n\n. \n\n(5) \n\nThis completes the proof if all  eigenvalues of A  have a  negative real part. \nIn  the  following  let  A  have  the  eigenvalue  Al  =  O.  The  vector  space  IRF  can  be \nrepresented  as  the  direct  sum  of the  generalised  eigenspaces  IRF  =  H~ EB  H12  EB \n\n\fIn  the  following  we  write  ilt  =  Ht2  EB  ... EB  Htl  because  this  is  a \n\u00b7 .. EB  Htl \u2022 \ncomplementary  space  of  Ht.  As  the  generalised  eigenspaces  of  A  are  invariant \nagainst  A,  i.e.  \\::Ix  E  Ht.  : Ax E  Ht.,  the  iteration wn+1  =  (IF  + aA)wn + ab  can \nbe decomposed in two parts, one in the generalised eigenspace Ht and the other in \nthe com.Qlem~ntary space ilt.  Let wn = wn + wn and b = b + b,  where wn, b E  Ht \nand wn , b E  Ht.  Then we  have \n\nwn+1  = wn + a(Awn + b)  = ~n + a(Awn + b~ +~n + a(Awn + b~ \n\n(6) \n\nThus, the convergence analysis can be carried out separately for  the two iterations. \nThe matrix A in iteration wn+1  =  wn + a(Awn + b)  is  not invertible.  However,  the \niteration takes place in  the subspace ilt.  In this subspace the mapping associated \nwith  A  is  invertible.  Therefore,  A  can be  replaced  by  an invertible  matrix A that \ndoes  not  ~lter the  iteration  in  ilt.  The  matrix  A  can  be  constructed  such  that \ne(IF  + aA)  =  e(IF  + aA).  Therefore,  according  to  the  considerations  above  the \niteration converges for  0 <  a  <  it. \nIn  the  following  we  show  that  the  iteration  in  Ht  is  the  identity  and  therefore \ntrivially  converges.  According  to  assumption J~ Hff  =  E{f.  All  v  E  IRm  can  be \nrepresented as  v  =  ii + v with ii E  E{f  and v E  Ho  =  H~ EB \u00b7 .. EB Ht.  According to \nLemma 1 CTilff  =  ilt and CTHff ~ Ht hold.  Therefore, for  b + b =  b =  CT v we \nhave b =  CT ii and b =  CT v.  Let E{f  =1=  {o}.  Then, for  all ii E  E{f \n\n0= Kii  =  DCTii  ===*  CTii  E  [CT]  n  [DT].L  1% cTii =  O. \n\nFor  E{f  =  {O}  we  also  obtain  CTii  =  0  because  ii  =  o.  Therefore,  we  have \nCTE{f  =  {O}  and, as  a  consequence,  b =  CTii  =  o.  The last that remains to show \nis  that  Aw  =  0  for  all  w  E  HA.  According to  Lemma 1 we  know  that  Dw  E  Hff. \nAssumption  (b)  says  that H~ =  E{f  and  from  the  above  considerations  we  know \nthat CTE{f  =  {O}.  Therefore, Aw =  CT(Dw)  =  o.  Thus, the iteration in Ht is the \nidentity.  As  both parts of the iteration converge the overall iteration also converges \nwhich completes that part of the proof. \nThe limit w*  of wn+1  =  wn + a(Awn + b)  is  unique and we  have w*  =  A-lb.  The \nlimit  of wn+l  =  wn + a(Awn + b)  is  not  unique,  but  depends  on  the  initial  value \nwo.  It  holds  that  w*  =  wo.  Therefore,  the  limit  w*  =  w*  + w*  depends  on  the \ninitial value wo. \n\nProof of Proposition  1:  For  the  residual  gradient  algorithm  we  have  ARG  = \n_8T X DXT8  and  bRG  =  _8T X DXT r.  In  order  to  apply  Theorem  1  this  is \ndecomposed  in  ARG  =  CTD  and  bRG  =  CTv  with  C  =  -D  =  v75XT8  and \nv =  -v75XT r.  As  the diagonal entries of D are positive we  can write v75 for  the \ndiagonal matrix whose  entries  are the square roots of D.  Thus  [CT]  = )DT] which \nyields  condition  (c)  of  Theorem  1.  Moreover,  the  matrix  K  =  DC  =  -CCT \nis  symmetric  and  therefore  diagonalisable.  Hence,  condition  (b)  is  fulfilled  and \nall  eigenvalues  are  real.  Let  now  A  =1=  0  be  an  eigenvalue  of  K  and  let  x  be  a \ncorresponding  eigenvector.  Then  0  >  - (CT x) T (CT x)  =  xT K x  =  AXT x  which \nyields  A < o.  Thus,  all  requirements  are fulfilled  and for  an  appropriate choice  of \na  the residual gradient algorithm converges independently of the  concrete form  of \nthe function  approximation scheme. \n\nThe consistency of the residual gradient algorithm can be shown formally but due to \nspace limitations we only give the following informal proof.  The algorithm minimises \n\n\fthe Bellman error, which is a quadratic objective function.  Hence, there are no local \noptima and if the global optimum is  not unique, the values of all  global optima are \nidentical.  Due  to  its  gradient  descent  property  the  residual  gradient  algorithm \nconverges to such a  global optimum independently of the initial value.  In case of a \ntabular  representation a  global  minimum  has  Bellman  error zero  and  corresponds \nto an optimal solution.  Thus, the residual gradient algorithm is  consistent. \n\nA  detailed description of how  grid-based linear interpolation works  in  combination \nwith  RL  can be found  in  [7].  Important for  us  is  that in  a  d-dimensional grid each \nfeature vector ip(x)  satisfies 0 ~ ipi(X)  ~ land 2:::1 ipi(X)  = 1.  With (, ->  denoting \nthe standard scalar product and  II  . 112  denoting the corresponding euclidean norm, \nwe  have  !Ki,jl  =  1\u00abCT)i, (CT)j )1 ~ maxdll(CT)IIID  =  2::=1 Cl~j\"  According  to \nthe  definition  Cl,j  =  (-JD)I,1 2:~1 Xk,ICripj(Zk)  - ipj(Xk))  holds.  Moreover,  from \nD  =  X T X  it follows  that Dl ,l  =  2:;;'=1 X~,l =  2:;;'=1 Xk ,l  because Xk ,l  is  either zero \nor one.  And besides that we  have nl,IDI ,1  = 1.  Altogether we  obtain \n\nIK',il ,,;~' (15\", ,~, X\", it, <Pi (Z.)) '+ (15\", ,~, X\", it, <Pi (X,l) Z ~ ~z + 1. \n\nIt is  well  known  that the spectral  radius  {!  of the  matrix  K  satisfies  (!(K)  ~ IIKII \nfor  every  norm  II  .  II .  Then,  for  the  maximum  norm  of  K  we  obtain  I!K II 00  = \nmax1 \";i\";m 2:1=1  IKi,jl  ~ m(l + ,2) .  With  H  =  m(l + ,2)  this  yields  {!(K)  ~ \nIIKll oo  ~ H.  Thus we  have a  bound for  the absolute value of the largest eigenvalue \nof K.  According to Theorem 1 the iteration converges for  a  <  ft\u00b7 \nD \n\nReferences \n[1]  L.  C.  Baird.  Residual  algorithms:  Reinforcement  learning with function  approxima(cid:173)\n\ntion.  Proc.  of the  Tw elfth  International  Conference  on  Machine  Learning,  1995. \n\n[2]  D.  P. Bertsekas and J . N.  Tsitsiklis.  Neuro  Dynamic Programming.  Athena Scientific, \n\nBelmont, Massachusetts,  1996. \n\n[3]  J .A. Boyan.  Least-squares temporal difference learning.  In Proceeding  of the Sixteenth \n\nInternational  Conference  on  Machine  Learning, pages  49- 56,  1999. \n\n[4]  S.J Bradtke and A.G.  Barto.  Linear least-squares  algorithms  for  temporal difference \n\nlearning.  Machine  Learning,  22:33- 57,  1996. \n\n[5]  A.  Greenbaum.  Iterative  Methods  for Solving  Linear Systems.  SIAM,  1997. \n[6]  D.  Koller  and R.  Parr.  Policy  iteration for  factored  mdps.  In  Proc.  of the  Sixteenth \n\nConference  on  Uncertainty  in Artificial Intelligence  (UAI),  pages  326- 334,  2000. \n\n[7]  A.  Merke and R.  Schoknecht.  A necessary condition of convergence for  reinforcement \nlearning with function approximation.  In Proceedings  of the  Nineteenth  International \nConference  on  Machine  Learning,  pages 411- 418,  Sydney,  Australia,  2002. \n\n[8]  M. G.  Lagoudakis and R . Parr.  Model-free least-squares policy iteration.  In Advances \n\nin  Neural  Information  Processing  Systems,  volume 14,  2002. \n\n[9]  S.  Pareigis.  Adaptive choice  of grid and time in  reinforcement  learning.  Advances  in \n\nNeural  Information  Processing  Systems,  1998. \n\n[10]  R.  Schoknecht.  Optimality of reinforcement  learning  algorithms with linear function \napproximation.  In  Advances  in  Neural  Information  Processing  Systems,  volume  15, \n2003. \n\n[11]  R.  S.  Sutton.  Learning  to  predict  by the methods of temporal  differences.  Machine \n\nLearning,  3:9- 44,  1988. \n\n[12]  J.  N.  Tsitsiklis  and  B.  Van  Roy.  An  analysis  of  temporal-difference  learning  with \n\nfunction  approximation.  IEEE  Transactions  on  Automatic  Control,  1997. \n\n\f", "award": [], "sourceid": 2208, "authors": [{"given_name": "Ralf", "family_name": "Schoknecht", "institution": null}, {"given_name": "Artur", "family_name": "Merke", "institution": null}]}