{"title": "iLSTD: Eligibility Traces and Convergence Analysis", "book": "Advances in Neural Information Processing Systems", "page_first": 441, "page_last": 448, "abstract": null, "full_text": "iLSTD: Eligibility Traces and Convergence Analysis\n\nAlborz Geramifard\n\nMichael Bowling Martin Zinkevich Richard S. Sutton Department of Computing Science University of Alberta Edmonton, Alberta {alborz,bowling,maz,sutton}@cs.ualberta.ca\n\nAbstract\nWe present new theoretical and empirical results with the iLSTD algorithm for policy evaluation in reinforcement learning with linear function approximation. iLSTD is an incremental method for achieving results similar to LSTD, the dataefficient, least-squares version of temporal difference learning, without incurring the full cost of the LSTD computation. LSTD is O(n2 ), where n is the number of parameters in the linear function approximator, while iLSTD is O(n). In this paper, we generalize the previous iLSTD algorithm and present three new results: (1) the first convergence proof for an iLSTD algorithm; (2) an extension to incorporate eligibility traces without changing the asymptotic computational complexity; and (3) the first empirical results with an iLSTD algorithm for a problem (mountain car) with feature vectors large enough (n = 10, 000) to show substantial computational advantages over LSTD.\n\n1 Introduction\nA key part of many reinforcement learning algorithms is a policy evaluation process, in which the value function of a policy is estimated online from data. In this paper, we consider the problem of policy evaluation where the value function estimate is a linear function of state features and is updated after each time step. Temporal difference (TD) learning is a common approach to this problem [Sutton, 1988]. The TD algorithm updates its value-function estimate based on the observed TD error on each time step. The TD update takes only O(n) computation per time step, where n is the number of features. However, because conventional TD methods do not make any later use of the time step's data, they may require a great deal of data to compute an accurate estimate. More recently, LSTD [Bradtke and Barto, 1996] and its extension LSTD() [Boyan, 2002] were introduced as alternatives. Rather than making updates on each step to improve the estimate, these methods maintain compact summaries of all observed state transitions and rewards and solve for the value function which has zero expected TD error over the observed data. However, although LSTD and LSTD() make more efficient use of the data, they require O(n2 ) computation per time step, which is often impractical for the large feature sets needed in many applications. Hence, practitioners are often faced with the dilemma of having to chose between excessive computational expense and excessive data expense. Recently, Geramifard and colleagues [2006] introduced an incremental least-squares TD algorithm, iLSTD, as a compromise between the computational burden of LSTD and the relative data inefficiency of TD. The algorithm focuses on the common situation of large feature sets where only a small number of features are non-zero on any given time step. iLSTD's per-time-step computational complexity in this case is only O(n). In empirical results on a simple problem, iLSTD exhibited a rate of learning similar to that of LSTD.\n\n\f\nIn this paper, we substantially extend the iLSTD algorithm, generalizing it in two key ways. First, we include the use of eligibility traces, defining iLSTD() consistent with the family of TD() and LSTD() algorithms. We show that, under the iLSTD assumptions, the per-time-step computational complexity of this algorithm remains linear in the number of features. Second, we generalize the feature selection mechanism. We prove that for a general class of selection mechanisms, iLSTD() converges to the same solution as TD() and LSTD(), for all 0    1.\n\n2 Background\nReinforcement learning is an approach to finding optimal policies in sequential decision making problems with an unknown environment [e.g., see Sutton and Barto, 1998]. We focus on the class of environments known as Markov decision processes (MDPs). An MDP is a tuple, a a (S , A, Pss , Ras ,  ), where S is a set of states, A is a set of actions, Pss is the probability of s a a reaching state s fter taking action a in state s, and Rss is the reward received when that transition occurs, and   [0, 1] is a discount rate parameter. A trajectory of experience is a sequence s0 , a0 , r1 , s1 , a1 , r2 , s2 , . . ., where the agent in s1 takes action a1 and receives reward r2 while transitioning to s2 before taking a2 , etc. Given a policy, one often wants to estimate the policy's state-value function, or expected sum of discounted future rewards:  . s t  t-1 V (s) = E  rt 0 = s, \n=1\n\nIn particular, we are interested in approximating V  using a linear function approximator. Let  : S  n, be some features of the state space. Linear value functions are of the form V (s) = (s)T  , where   n are the parameters of the value function. In this work we will exclusively consider sparse feature representations: for all states s the number of non-zero features in (s) is no more than k n. Sparse feature representations are quite common as a generic approach to handling non-linearity [e.g., Stone et al., 2005].1 2.1 Temporal Difference Learning\n\nTD() is the traditional approach to policy evaluation [see Sutton and Barto, 1998]. It is based on  the computation of a -return, Rt (V ), at each time step:  . k ik  k-1 k i-1 Rt (V ) = (1 - )  V (st+k ) +  rt+i\n=1 =1\n\nNote that the -return is a weighted sum of k -step returns, each of which looks ahead k steps summing the discounted rewards as well as the estimated value of the resulting state. The -return forms the basis of the update to the value function parameters:  t+1 =  t + t (st ) (Rt (Vt ) - Vt (st )) , where t is the learning rate. This \"forward view\" requires a complete trajectory to compute the -return and update the parameters. The \"backward view\" is a more efficient implementation that depends only on one-step returns and an eligibility trace vector:  t+1 =  t + t ut (t ) ut ( ) = zt (rt+1 +  V (st+1 ) - V (st )) zt =  zt+1 + (st ), where zt is the eligibility trace and ut ( ) is the TD update. Notice that TD() requires only a constant number of vector operations and so is O(n) per time step. In the special case where  = 0 and the feature representation is sparse, this complexity can be reduced to O(k ). In addition, TD() is guaranteed to converge [Tsitsiklis and Van Roy, 1997].\n1 Throughout this paper we will use non-bolded symbols to refer to scalars (e.g.,  and t ), bold-faced lower-case symbols to refer to vectors (e.g.,  and bt ), and bold-faced upper-case symbols for matrices (e.g., At ).\n\n\f\n2.2\n\nLeast-Squares TD\n\nLeast-squares TD (LSTD) was first introduced by Bradtke and Barto [1996] and later extended with -returns by Boyan [2002]. LSTD() can be viewed as immediately solving for the value function parameters which would result in the sum of TD updates over the observed trajectory being zero. Let t ( ) be the sum of the TD updates through time t. If we let t = (st ) then, = t ( ) = = it\n=1 it =1\n\nui ( ) = zi r\ni+1\n\nit\n=1\n\nzi\n\nr\n\ni+1\n\n+  V (si+1 ) - V (si )\n\n+  T+1  - T  i i it\n=1\n\nit\n=1\n\nzi ri+1 - b\nt\n\nzi (i -  i+1 )T  = bt - At  . A\nt\n\n(1)\n\nSince we want to choose parameters such that the sum of TD updates is zero, we set Equation 1 to zero and solve for the new parameter vector,  t+1 = A-1 bt . t The online version of LSTD() incorporates each observed reward and state transition into the b vector and the A matrix and then solves for a new  . Notice that, once b and A are updated, the experience tuple can be forgotten without losing any information. Because A only changes by a small amount on each time step, A-1 can also be maintained incrementally. The computation requirement is O(n2 ) per time step. Like TD(), LSTD() is guaranteed to converge [Boyan, 2002]. 2.3 iLSTD\n\niLSTD was recently introduced to provide a balance between LSTD's data efficiency and TD's time efficiency for  = 0 when the feature representation is sparse [Geramifard et al., 2006]. The basic idea is to maintain the same A matrix and b vector as LSTD, but to only incrementally solve for  . The update to  requires some care as the sum TD update itself would require O(n2 ). iLSTD instead updates only single dimensions of  , each of which requires O(n). By updating m parameters of  , which is a parameter that can be varied to trade off data and computational efficiency, iLSTD requires O(mn + k 2 ) per time step, which is linear in n. The result is that iLSTD can scale to much larger feature spaces than LSTD, while still retaining much of its data efficiency. Although the original formulation of iLSTD had no proof of convergence, it was shown in synthetic domains to perform nearly as well as LSTD with dramatically less computation. In the remainder of the paper, we describe a generalization, iLSTD(), of the original algorithm to handle  > 0. By also generalizing the mechanism used to select the feature parameters to update, we additionally prove sufficient conditions for convergence.\n\n3 The New Algorithm with Eligibility Traces\nThe iLSTD() algorithm is shown in Algorithm 1. The new algorithm is a generalization of the original iLSTD algorithm in two key ways. First, it uses eligibility traces (z) to handle  > 0. Line 5 updates z, and lines 59 incrementally compute the same At , bt , and t as described in Equation 1. Second, the dimension selection mechanism has been relaxed. Any feature selection mechanism can be employed in line 11 to select a dimension of the sum TD update vector ().2 Line 12 will then take a step in that dimension, and line 13 updates the  vector accordingly. The original iLSTD algorithm can be recovered by simply setting  to zero and selecting features according to the dimension of  with maximal magnitude. We now examine iLSTD()'s computational complexity.\nThe choice of this mechanism will determine the convergence properties of the algorithm, as discussed in the next section.\n2\n\n\f\nAlgorithm 1: iLSTD() Complexity 0 s  s0 , z  0, A  0,   0, t  0 1 Initialize  arbitrarily 2 repeat 4 3 Take action according to  and observe r, s tt+1 5 z   z + (s) O(n) 6  b  zr O(n) 7 A  z((s) -  (s ))T O(k n) 8 A  A + A O(k n) 9    + b - (A) O(k n) 10 for i from 1 to m do 11 j  choose an index of  using some feature selection mechanism 12 j  j + j O(1) 13    - j Aej O(n) 14 end for 1 15 ss 6 end repeat\n\nTheorem 1 Assume that the feature selection mechanism takes O(n) computation. If there are n features and, for any given state s, (s) has at most k non-zero elements, then the iLSTD() algorithm requires O((m + k )n) computation per time step. Proof Outside of the inner loop, lines 79 are the most computationally expensive steps of iLSTD(). Since we assumed that each feature vector has at most k non-zero elements, and the T z vector can have up to n non-zero elements, the z ((s) -  (s )) matrix (line 7) has at most 2k n non-zero elements. This leads to O(nk ) complexity for the outside of the loop. Inside, the complexity remains unchanged from iLSTD with the most expensive lines being 11 and 13. Because  and A do not have any specific structure, the inside loop time3 is O(n). Thus, the final bound for the algorithm's per-time-step computational complexity is O((m + k )n). 4\n\nConvergence\nWe now consider the convergence properties of iLSTD(). Our analysis follows that of Bertsekas and Tsitsiklis [1996] very closely to establish that iLSTD() converges to the same solution that TD() does. However, whereas in their analysis they considered Ct and dt that had expectations that converged quickly, we consider Ct and dt that may converge more slowly, but in value instead of expectation. In order to establish our result, we consider the theoretical model where for all t, yt  Rn ,dt  Rn , Rt , Ct  Rnn , t  R, and: yt+1 = yt + t (Rt )(Ct yt + dt ). (2)\n\nOn every round, Ct and dt are selected first, followed by Rt . Define Ft to be the state of the algorithm on round t before Rt is selected. Ct and dt are sequences of random variables. In order to prove convergence of yt , we assume that there is a C , d , v ,  > 0, and M such that: C is negative definite, Ct converges to C with probability 1, dt converges to d with probability 1, E[Rt |Ft ] = I , and Rt  M , T A5. limT  t=1 t = , and A6. t < v t- . A1. A2. A3. A4.\nNote that Aei selects the ith column of A and so does not require the usual quadratic time for multiplying a vector by a square matrix.\n3\n\n\f\nTheorem 2 Given the above assumptions, yt converges to -(C )-1 d with probability 1. The proof of this theorem is included in the additional material and will be made available as a companion technical report. Now we can map iLSTD() on to this mathematical model: 1. 2. 3. 4. 5. yt =  t , t = t/n, Ct = -At /t, dt = bt /t, and Rt is a matrix, where there is an n on the diagonal in position (kt , kt ) (where kt is uniform random over the set {1,. . . , n} and i.i.d.) and zeroes everywhere else.\n\nThe final assumption defines the simplest possible feature selection mechanism sufficient for convergence, viz., uniform random selection of features. Theorem 3 If the Markov decision process is finite, iLSTD() with a uniform random feature selection mechanism converges to the same result as TD(). Although this result is for uniform random selection, note that Theorem 2 outlines a broad range of possible mechanisms sufficient for convergence. However, the greedy selection of the original iLSTD algorithm does not meet these conditions, and so has no guarantee of convergence. As we will see in the next section, though, greedy selection performs quite well despite this lack of asymptotic guarantee. In summary, finding a good feature selection mechanism remains an open research question. As a final aside, one can go beyond iLSTD() and consider the case where Rt = I, i.e., we take a step in all directions at once on every round. This does not correspond to any feature selection mechanism and in fact requires O(n2 ) computation. However, we can examine this algorithm's rate of convergence. In particular we find it converges linearly fast to LSTD(). Theorem 4 If Ct is negative definite, for some  dependent upon Ct , if Rty I, then there exis< = ts + (Ct )-1 dt an y  (0, 1) such .that for all yt , if yt+1 = yt +  (Ct yt + dt ), then  t+1  t + (Ct )-1 dt This may explain why iLSTD()'s performance, despite only updating a single dimension, approaches LSTD() so quickly in the experimental results in the next section.\n\n5 Empirical Results\nWe now examine the empirical performance of iLSTD(). We first consider the simple problem introduced by Boyan [2002] and on which the original iLSTD was evaluated. We then explore the larger mountain car problem with a tile coding function approximator. In both problems, we compare TD(), LSTD(), and two variants of iLSTD(). We evaluate both the random feature selection mechanism (\"iLSTD-random\"), which is guaranteed to converge,4 as well as the original iLSTD feature selection rule (\"iLSTD-greedy\"), which is not. In both cases, the number of dimensions picked per iteration is m = 1. The step size () used for both iLSTD() and TD() was of the same form as in Boyan's experiments, with a slightly faster decay rate in order to make it consistent with the proof 's assumption. N0 + 1 t = 0 N0 + Episode#1.1 For the TD() and iLSTD() algorithms, the best 0 and N0 have been selected through experimental search of the sets of 0  {0.01, 0.1, 1} and N0  {100, 100, 106 } for each domain and  value, which is also consistent with Boyan's original experiments.\n4 When selecting features randomly we exclude dimensions with zero sum TD update. To be consistent with the assumptions of Theorem 2, we compensate by multiplying the learning rate t by the fraction of features that are non-zero at time t.\n\n\f\n-3\n\n-3\n\n-3\n\n-3\n\n-3\n\nGoal\n13\n-3 -3\n\n5\n\n-3\n\n4\n\n-3\n\n3\n\n-3\n\n2\n\n-2\n\n1\n\n0\n\n0\n\n1 0 0 0\n\n0 0 1 0\n\n0 0 .75 .25\n\n0 0 .5 .5\n\n0 0 .25 .75\n\n0 0 0 1\n\n0 0 0 0\n\n(a)\n\n(b)\n\nFigure 1: The two experimental domains: (a) Boyan's chain example and (b) mountain car.\n10 RMS error of V(s) over all states\n0\n\n10\n\n-1\n\n10\n\n-2\n\nTD iLSTD-Random iLSTD-Greedy LSTD 0 0.5 0.7 ! 0.8 0.9 1\n\nFigure 2: Performance of various algorithms in Boyan's chain problem with 6 different lambda values. Each line represents the averaged error over last 100 episodes after 100, 200, and 1000 episodes respectively. Results are also averaged over 30 trials.\n\n5.1\n\nBoyan Chain Problem\n\nThe first domain we consider is the Boyan chain problem. Figure 1(a) shows the Markov chain together with the feature vectors corresponding to each state. This is an episodic task where the discount factor  is one. The chain starts in state 13 and finishes in state 0. For all states s > 2, there exists an equal probability of ending up in (s - 1) and (s - 2). The reward is -3 for all transitions except from state 2 to 1 and state 1 to 0, where the rewards are -2 and 0, respectively. Figure 2 shows the comparative results. The horizontal axis corresponds to different  values, while the vertical axis illustrates the RMS error in a log scale averaged over all states uniformly. Note that in this domain, the optimum solution is in the space spanned by the feature vectors:   = (-24, -16, -8, 0)T . Each line shows the averaged error over last 100 episodes after 100, 200, and 1000 episodes over the same set of observed trajectories based on 30 trials. As expected, LSTD() requires the least amount of data, obtaining a low average error after only 100 episodes. With only 200 episodes, though, the iLSTD() methods are performing as well as LSTD(), and dramatically outperforming TD(). Finally, notice that iLSTD-Greedy() despite its lack of asymptotic guarantee, is actually performing slightly better than iLSTD-Random() for all cases of . Although  did not play a significant role for LSTD() which matches the observation of Boyan [Boyan, 1999],  > 0 does show an improvement in performance for the iLSTD() methods. Table 1 shows the total averaged per-step CPU time for each method. For all methods sparse matrix optimizations were utilized and LSTD used the efficient incremental inverse implementation. Although TD() is the fastest method, the overall difference between the timings in this domain is very small, which is due to the small number of features and a small ratio n . In the next domain, we k illustrate the effect of a larger and more interesting feature space where this ratio is larger.\n\n\f\nAlgorithm TD() iLSTD() LSTD()\n\nCPU time/step (msec) Boyan's chain Mountain car 0.3057.0e-4 5.353.5e-3 0.3707.0e-4 9.802.8e-1 0.3677.0e-4 253.42\n\nTable 1: The averaged CPU time per step of the algorithms used in Boyan's chain and mountain car problems.\n10 5 10 10 10\n85 7 8\n\n7 4 10 5 108 10 6 10 4 107 10 3 10 5 3 10 3 106 10 4 10 5 10 2 10 10 0 2 3 2 10 1 104 10 0 2 10 0 103 10 01 - 1 2 1 3 2 4 5\n\n6\n\n!-Greedy RandomLS i Boltzmann TD-Greedy Greedy TD !-Greedy RandomnnLSTD-Random Boltzma i TD Greedy !-Greedy iLSTD-Random Boltzmann TD LS\n\nE Measure Error Measure rror MeasureError Loss Loss Function Function\n\nError MeasureError Measure ss Loss Function E Error Measure rror Measure Lo Function\n\n8 10 6 10 4 10\n\n7\n\nRandom Greedy ! anr om R-Gdeedy Boltzm LS Greedyainn TD-Greedy\n\n!=0 !=0\n\n10 5 10 10 10\n5 7\n\n8\n\n1084 10 10 107 10 1064 10 10 83 10 10 105 10 7 10 104 10 62 10 10 0 1 1033 10 5 10 0 0 102 10 4 10\n-1 1 3 2 3 4 5\n\n6\n\nRandom Greedy !-Greedy Boltzmann TD-Greedy iLS\n\n!=0 ! = 0.9\n\nRandom Greedy TD !-Greedy i BoltzmannLSTD-Random Random TD Greedy LSTD !-Greedy iLSTD-Random Boltzmann 200 200\n\nLSTD\n200 200 400 400 600 Episode 600 Episode 800 800 1000 1000\n\nLST0 40D\n400\n\n600 Episode 600 Episode\n\n800 800\n\n1000 1000\n\n10 10 Figure 31:0 Performance of various methods in mo10ntain car problem with two different lambda u 10 10 values. LSTD was run only every 100 episodes. Results are averaged over 30 trials. 10\n0 10 -2 101 10 -1\n\nSmall Small Small\n\n-2 - 10 0 1 10\n\nMedium Large Boyan Chain Medium Large Boyan Chain Medium Large Boyan Chain\n\n10 1Hard 0 Easy Mountain Car\n-1 10 1\n\n-2 0 2\n\nSmall\n\nMedium Large Boyan Chain\n\nEasy Hard Mountain Car\n\n10\n\n5.2\n\nMountain Car 10\n-2 10-1 -2\n\n1Hard 0 Easy Mountain Car 10 0 10 ard Easy H Mountain Car\n-1 -2\n\nSmall\n\nMedium Large Boyan Chain\n\nEasy Hard Mountain Car\n\nOur second tesmalled is Medeum ountarige car doasyain [eHgr.d see Sutton and Barto, 1998]. Illustrated in t-b th i m n m 10 1. , 0 S La E a Boyan Chain Figure 1(b), the episodic task for the car isMountainaCar the goal state. Possible actions are accelerate to re ch forward, accelerate backward, and coast. The obse10 atioSmisl a paMredoufmcontinaugeus valuass: positHorn rv n al ii ia d L ro Ee y Mountain Car and velocity. The initial value of the state was -1 for position andBoyanor velocity. Further details 0 f Chain about the mountain car problem are available online [RL Library, 2006]. As we are focusing on policy evaluation, the policy was fixed for the car to always accelerate in the direction of its current velocity, although the environment is stochastic and the chosen action is replaced with a random one with 10% probability. Tile Coding [e.g., see Sutton, 1996] was selected as our linear function approximator. We used ten tilings (k = 10) over the combination of the two parameter sets and hashed the tilings into 10,000 features (n = 10, 000). The rest of the settings were identical to those in the Boyan chain domain.\n-2\n\nFigure 3 shows the results of the different methods on this problem with two different  values. The horizontal axis shows the number of episodes, while the vertical axis represents our loss function in log scale. The loss we used was ||b - A  ||2 , where A and b were computed for each  from 200,000 episodes of interaction with the environment. With  = 0, both iLSTD() methods performed considerably better than TD() in terms of data efficiency. The iLSTD() methods even reached a level competitive with LSTD() after 600 episodes. For  = 0.9, it proved to be difficult to find stable learning rate parameters for iLSTD-Greedy(). While some iterations performed competitively with LSTD(), others performed extremely poorly with little show of convergence. Hence, we did not include the performance line in the figure. This fact may suggest that the greedy feature selection mechanism does not converge, or it may simply be more sensitive to the learning rate. Finally, notice that the plotted loss depends upon , and so the two graphs cannot be directly compared.\n, In this environment the n is relatively large ( 101000 = 1000), which translates into a dramatic imk 0 provement of iLSTD() over LSTD as can be see in Table 1. Again sparse matrix optimizations were utilized and LSTD() used the efficient incremental ivnerse implementation. The computational demands of LSTD() can easily prohibit its application in domains with a large feature space. When the feature representation is sparse, though, iLSTD() can still achieve results competitive with LSTD() using computation more on par with the time efficient TD().\n\n\f\n6 Conclusion\nIn this paper, we extended the previous iLSTD algorithm by incorporating eligibility traces without increasing the asymptotic per time-step complexity. This extension resulted in improvements in performance in both the Boyan chain and mountain car domains. We also relaxed the dimension selection mechanism of the algorithm and presented sufficient conditions on the mechanism under which iLSTD() is guaranteed to converge. Our empirical results showed that while LSTD() can be impractical in on-line learning tasks with a large number of features, iLSTD() still scales well while having similar performance to LSTD. This work opens up a number of interesting directions for future study. Our results have focused on two very simple feature selection mechanisms: random and greedy. Although the greedy mechanism does not meet our sufficient conditions for convergence, it actually performed slightly better on the examined domains than the theoretically guaranteed random selection. It would be interesting to perform a thorough exploration of possible mechanisms to find a mechanism with both good empirical performance while satisfying our sufficient conditions for convergence. In addition, it would be interesting to apply iLSTD() in even more challenging environments where the large number of features has completely prevented the least-squares approach, such as in simulated soccer keepaway [Stone et al., 2005].\n\nReferences\n[Bertsekas and Tsitsiklis, 1996] Dmitri P. Bertsekas and John N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific, 1996. [Boyan, 1999] Justin A. Boyan. Least-squares temporal difference learning. In Proceedings of the Sixteenth International Conference on Machine Learning, pages 4956. Morgan Kaufmann, San Francisco, CA, 1999. [Boyan, 2002] Justin A. Boyan. Technical update: Least-squares temporal difference learning. Machine Learning, 49:233246, 2002. [Bradtke and Barto, 1996] S. Bradtke and A. Barto. Linear least-squares algorithms for temporal difference learning. Machine Learning, 22:3357, 1996. [Geramifard et al., 2006] Alborz Geramifard, Michael Bowling, and Richard S. Sutton. Incremental least-squares temporal difference learning. In Proceedings of the Twenty-First National Conference on Artificial Intelligence (AAAI), pages 356361. AAAI Press, 2006. [RL Library, 2006] RL Library. The University of Alberta reinforcement learning library. http: //rlai.cs.ualberta.ca/RLR/environment.html, 2006. [Stone et al., 2005] Peter Stone, Richard S. Sutton, and Gregory Kuhlmann. Reinforcement learning for robocup soccer keepaway. International Society for Adaptive Behavior, 13(3):165188, 2005. [Sutton and Barto, 1998] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press, 1998. [Sutton, 1988] Richard S. Sutton. Learning to predict by the methods of temporal differences. Machine Learning, 3:944, 1988. [Sutton, 1996] Richard S. Sutton. Generalization in reinforcement learning: Successful examples using sparse coarse coding. In Advances in Neural Information Processing Systems 8, pages 10381044. The MIT Press, 1996. [Tsitsiklis and Van Roy, 1997] John N. Tsitsiklis and Benjamin Van Roy. An analysis of temporaldifference learning with function approximation. IEEE Transactions on Automatic Control, 42(5):674690, 1997.\n\n\f\n", "award": [], "sourceid": 3092, "authors": [{"given_name": "Alborz", "family_name": "Geramifard", "institution": null}, {"given_name": "Michael", "family_name": "Bowling", "institution": null}, {"given_name": "Martin", "family_name": "Zinkevich", "institution": null}, {"given_name": "Richard", "family_name": "Sutton", "institution": null}]}