{"title": "Temporal Difference Updating without a Learning Rate", "book": "Advances in Neural Information Processing Systems", "page_first": 705, "page_last": 712, "abstract": null, "full_text": "Temporal Difference Updating\n\nwithout a Learning Rate\n\nMarcus Hutter\n\nRSISE@ANU and SML@NICTA\nCanberra, ACT, 0200, Australia\n\nmarcus@hutter1.net\n\nwww.hutter1.net\n\nIDSIA, Galleria 2, Manno-Lugano CH-6928, Switzerland\n\nShane Legg\n\nshane@vetta.org\n\nwww.vetta.org/shane\n\nAbstract\n\nWe derive an equation for temporal difference learning from statistical principles.\nSpeci\ufb01cally, we start with the variational principle and then bootstrap to produce\nan updating rule for discounted state value estimates. The resulting equation is\nsimilar to the standard equation for temporal difference learning with eligibil-\nity traces, so called TD(\u03bb), however it lacks the parameter \u03b1 that speci\ufb01es the\nlearning rate. In the place of this free parameter there is now an equation for the\nlearning rate that is speci\ufb01c to each state transition. We experimentally test this\nnew learning rule against TD(\u03bb) and \ufb01nd that it offers superior performance in\nvarious settings. Finally, we make some preliminary investigations into how to\nextend our new temporal difference algorithm to reinforcement learning. To do\nthis we combine our update equation with both Watkins\u2019 Q(\u03bb) and Sarsa(\u03bb) and\n\ufb01nd that it again offers superior performance without a learning rate parameter.\n\n1 Introduction\n\nIn the \ufb01eld of reinforcement learning, perhaps the most popular way to estimate the future discounted\nreward of states is the method of temporal difference learning. It is unclear who exactly introduced\nthis \ufb01rst, however the \ufb01rst explicit version of temporal difference as a learning rule appears to be\nWitten [9]. The idea is as follows: The expected future discounted reward of a state s is,\n\nwhere the rewards rk, rk+1, . . . are geometrically discounted into the future by \u03b3 < 1. From this\nde\ufb01nition it follows that,\n\nV s := E(cid:8)rk + \u03b3rk+1 + \u03b32rk+2 + \u00b7 \u00b7 \u00b7 |sk = s(cid:9) ,\n\nV s = E(cid:8)rk + \u03b3V sk+1|sk = s(cid:9) .\n\n(1)\n\nOur task, at time t, is to compute an estimate V t\ns of V s for each state s. The only information we\nhave to base this estimate on is the current history of state transitions, s1, s2, . . . , st, and the current\nhistory of observed rewards, r1, r2, . . . , rt. Equation (1) suggests that at time t + 1 the value of\nrt + \u03b3Vst+1 provides us with information on what V t\nst then perhaps\nthis estimate should be increased, and vice versa. This intuition gives us the following estimation\nheuristic for state st,\n\ns should be: If it is higher than V t\n\nwhere \u03b1 is a parameter that controls the rate of learning. This type of temporal difference learning\nis known as TD(0).\n\nV t+1\nst\n\n:= V t\n\nst + \u03b1(cid:16)rt + \u03b3V t\n\nst+1 \u2212 V t\n\nst(cid:17) ,\n\n1\n\n\fOne shortcoming of this method is that at each time step the value of only the last state st is updated.\nStates before the last state are also affected by changes in the last state\u2019s value and thus these could\nbe updated too. This is what happens with so called temporal difference learning with eligibility\ntraces, where a history, or trace, is kept of which states have been recently visited. Under this\nmethod, when we update the value of a state we also go back through the trace updating the earlier\nstates as well. Formally, for any state s its eligibility trace is computed by,\n\nEt\n\ns := (cid:26) \u03b3\u03bbEt\u22121\n\n\u03b3\u03bbEt\u22121\n\ns\ns + 1\n\nif s 6= st,\nif s = st,\n\nwhere \u03bb is used to control the rate at which the eligibility trace is discounted. The temporal differ-\nence update is then, for all states s,\nV t+1\ns\n\ns + \u03b1Et\n\nst+1 \u2212 V t\n\n:= V t\n\n(2)\n\nThis more powerful version of temporal different learning is known as TD(\u03bb) [7].\nThe main idea of this paper is to derive a temporal difference rule from statistical principles and\ncompare it to the standard heuristic described above. Super\ufb01cially, our work has some similarities\nto LSTD(\u03bb) ([2] and references therein). However LSTD is concerned with \ufb01nding a least-squares\nlinear function approximation, it has not yet been developed for general \u03b3 and \u03bb, and has update time\nquadratic in the number of features/states. On the other hand, our algorithm \u201cexactly\u201d coincides\nwith TD/Q/Sarsa(\u03bb) for \ufb01nite state spaces, but with a novel learning rate derived from statistical\nprinciples. We therefore focus our comparison on TD/Q/Sarsa. For a recent survey of methods to\nset the learning rate see [1].\n\ns(cid:16)r + \u03b3V t\n\nst(cid:17) .\n\nIn Section 2 we derive a least squares estimate for the value function. By expressing the estimate as\nan incremental update rule we obtain a new form of TD(\u03bb), which we call HL(\u03bb). In Section 3 we\ncompare HL(\u03bb) to TD(\u03bb) on a simple Markov chain. We then test it on a random Markov chain in\nSection 4 and a non-stationary environment in Section 5. In Section 6 we derive two new methods\nfor policy learning based on HL(\u03bb), and compare them to Sarsa(\u03bb) and Watkins\u2019 Q(\u03bb) on a simple\nreinforcement learning problem. Section 7 ends the paper with a summary and some thoughts on\nfuture research directions.\n\n2 Derivation\n\nThe empirical future discounted reward of a state sk is the sum of actual rewards following from\nstate sk in time steps k, k + 1, . . ., where the rewards are discounted as they go into the future.\nFormally, the empirical value of state sk at time k for k = 1, ..., t is,\n\nvk :=\n\n\u221e\n\nXu=k\n\n\u03b3u\u2212kru,\n\n(3)\n\nwhere the future rewards ru are geometrically discounted by \u03b3 < 1. In practice the exact value of\nvk is always unknown to us as it depends not only on rewards that have been already observed, but\nalso on unknown future rewards. Note that if sm = sn for m 6= n, that is, we have visited the same\nstate twice at different times m and n, this does not imply that vn = vm as the observed rewards\nfollowing the state visit may be different each time.\n\ns should be as close as possible to the true expected\nOur goal is that for each state s the estimate V t\nfuture discounted reward V s. Thus, for each state s we would like Vs to be close to vk for all k such\nthat s = sk. Furthermore, in non-stationary environments we would like to discount old evidence\nby some parameter \u03bb \u2208 (0, 1]. Formally, we want to minimise the loss function,\n\nL :=\n\n1\n2\n\nt\n\nXk=1\n\nsk(cid:1)2\n\u03bbt\u2212k(cid:0)vk \u2212 V t\n\n.\n\n(4)\n\nFor stationary environments we may simply set \u03bb = 1 a priori.\nAs we wish to minimise this loss, we take the partial derivative with respect to the value estimate of\neach state and set to zero,\n\n\u2202L\n\u2202V t\ns\n\n= \u2212\n\nt\n\nXk=1\n\n\u03bbt\u2212k(cid:0)vk \u2212 V t\n\nsk(cid:1)\u03b4sks = V t\n\ns\n\nt\n\nXk=1\n\n\u03bbt\u2212k\u03b4sks \u2212\n\nt\n\nXk=1\n\n\u03bbt\u2212k\u03b4sksvk = 0,\n\n2\n\n\fwhere we could change V t\nx = y, and 0 otherwise. By de\ufb01ning a discounted state visit counter N t\n\ns due to the presence of the Kronecker \u03b4sks, de\ufb01ned \u03b4xy := 1 if\nk=1 \u03bbt\u2212k\u03b4sks we get\n\nsk into V t\n\ns := Pt\n\n(5)\n\ns N t\nV t\n\ns =\n\nt\n\nXk=1\n\n\u03bbt\u2212k\u03b4sksvk.\n\nSince vk depends on future rewards rk, Equation (5) can not be used in its current form. Next we\nnote that vk has a self-consistency property with respect to the rewards. Speci\ufb01cally, the tail of the\nfuture discounted reward sum for each state depends on the empirical value at time t in the following\nway,\n\nSubstituting this into Equation (5) and exchanging the order of the double sum,\n\nvk =\n\nt\u22121\n\nXu=k\n\n\u03b3u\u2212kru + \u03b3t\u2212kvt.\n\nV t\ns N t\n\ns =\n\n=\n\nt\u22121\n\nu\n\nt\u22121\n\nXk=1\n\nXu=1\nXu=1\ns + Et\n\n\u03bbt\u2212u\n\n= Rt\n\nsvt,\n\n\u03bbt\u2212k\u03b4sks\u03b3u\u2212kru +\n\nt\n\nXk=1\n\n\u03bbt\u2212k\u03b4sks\u03b3t\u2212kvt\n\nu\n\nt\n\n(\u03bb\u03b3)u\u2212k\u03b4sksru +\n\nXk=1\n\n(\u03bb\u03b3)t\u2212k\u03b4sksvt\n\nXk=1\n\nwhere Et\nthe discounted reward with eligibility.\n\ns := Pt\n\nk=1(\u03bb\u03b3)t\u2212k\u03b4sks is the eligibility trace of state s, and Rt\n\nu=1 \u03bbt\u2212uEu\n\ns ru is\n\ns := Pt\u22121\n\ns and Rt\n\nEt\nhave to replace with our current estimate of this value at time t, which is V t\nbootstrap our estimates. This gives us,\n\ns depend only on quantities known at time t. The only unknown quantity is vt, which we\nst. In other words, we\n\nFor state s = st, this simpli\ufb01es to V t\nwe obtain,\n\ns N t\nV t\nst = Rt\n\ns = Rt\nst/(N t\n\ns + Et\nst \u2212 Et\n\n(6)\nsV t\nst.\nst). Substituting this back into Equation (6)\n\nV t\ns N t\n\ns = Rt\n\ns + Et\n\ns\n\nRt\nst\nst \u2212 Et\nst\n\nN t\n\n.\n\n(7)\n\nThis gives us an explicit expression for our V estimates. However, from an algorithmic perspective\nan incremental update rule is more convenient. To derive this we make use of the relations,\n\nN t+1\ns = E0\n\ns = \u03bbN t\ns = R0\n\ns + \u03b4st+1s,\ns = 0. Inserting these into Equation (7) with t replaced by t + 1,\n\ns + \u03b4st+1s,\n\ns = \u03bb\u03b3Et\n\ns = \u03bbRt\n\nEt+1\n\nRt+1\n\ns + \u03bbEt\n\nsrt,\n\nwith N 0\n\nV t+1\ns N t+1\n\ns\n\n= Rt+1\n\ns + Et+1\n\ns\n\nRt+1\nst+1\n\nN t+1\n\nst+1 \u2212 Et+1\nst+1\nst+1 + Et\nst+1 \u2212 \u03b3Et\n\nRt\nN t\n\ns\n\nst+1 rt\n\n.\n\nst+1\n\n= \u03bbRt\n\ns + \u03bbEt\n\nsrt + Et+1\n\nBy solving Equation (6) for Rt\n\ns and substituting back in,\n\nV t+1\ns N t+1\n\ns N t\n\ns \u2212 Et\n\nsrt + Et+1\n\ns\n\ns = \u03bb(cid:0)V t\n= (cid:0)\u03bbN t\n\nst(cid:1) + \u03bbEt\nsV t\ns \u2212 \u03b4st+1sV t\nst+1V t\n\nN t\n\n+ Et+1\n\ns\n\ns + \u03b4st+1s(cid:1)V t\n\nst + \u03bbEt\n\nsrt\nst + Et\n\ns \u2212 \u03bbEt\nst+1 \u2212 Et\nN t\n\nsV t\nst+1V t\nst+1 \u2212 \u03b3Et\n\nst+1\n\nst+1rt\n\n.\n\nN t\n\nst+1 V t\n\nst+1 \u2212 Et\nN t\n\nst+1V t\nst+1 \u2212 \u03b3Et\n\nst+1\n\nst + Et\n\nst+1rt\n\nDividing through by N t+1\n\ns\n\n(= \u03bbN t\n\ns + \u03b4st+1s),\n\nV t+1\ns = V t\n\ns +\n\n\u2212\u03b4st+1sV t\n\ns \u2212 \u03bbEt\n\u03bbN t\n\nsV t\ns + \u03b4st+1s\n\nst + \u03bbEt\n\nsrt\n\n3\n\n\f+\n\n(\u03bb\u03b3Et\n\ns + \u03b4st+1s)(N t\n\nst+1 V t\n\n(N t\n\nst+1 \u2212 \u03b3Et\n\nst+1 \u2212 Et\nst+1)(\u03bbN t\n\nst+1V t\ns + \u03b4st+1s)\n\nst + Et\n\nst+1 rt)\n\n.\n\nMaking the \ufb01rst denominator the same as the second, then expanding the numerator,\n\nV t+1\ns = V t\n\ns +\n\n\u03bbEt\n\nsrtN t\n\nst+1 \u2212 \u03bbEt\n(N t\n\nstN t\n\nsV t\nst+1 \u2212 \u03b3Et\n\nst+1)(\u03bbN t\n\ns + \u03b4st+1s)\n\nst+1 \u2212 \u03b4st+1sV t\n\ns N t\n\nst+1 \u2212 \u03bb\u03b3Et\n\nst+1 Et\n\nsrt\n\n+\n\n+\n\n\u03bb\u03b3Et\n\nst+1Et\n\nsV t\n\nst + \u03b3Et\n(N t\n\nst+1V t\nst+1 \u2212 \u03b3Et\nst+1 V t\nst+1 \u2212 \u03b3Et\n\ns \u03b4st+1s + \u03bb\u03b3Et\nst+1 )(\u03bbN t\nst+1 \u2212 \u03b4st+1sEt\nst+1)(\u03bbN t\n\nst+1 V t\nsN t\ns + \u03b4st+1s)\nst+1 V t\ns + \u03b4st+1s)\n\n(N t\n\n\u03bb\u03b3Et\n\nsEt\n\nst+1 rt + \u03b4st+1sN t\n\nst+1 \u2212 \u03bb\u03b3Et\n\nsEt\n\nst+1V t\n\nst\n\nst + \u03b4st+1sEt\n\nst+1 rt\n\n.\n\ns we obtain,\nst + \u03b4st+1srt(cid:1)\n\nAfter cancelling equal terms (keeping in mind that in every term with a Kronecker \u03b4xy factor we\nmay assume that x = y as the term is always zero otherwise), and factoring out Et\n\nV t+1\ns = V t\n\ns +\n\nEt\n\ns(cid:0)\u03bbrtN t\n\nst+1 \u2212 \u03bbV t\n\nstN t\n\nst+1 + \u03b3V t\n(N t\n\ns \u03b4st+1s + \u03bb\u03b3N t\nst+1 )(\u03bbN t\n\nst+1 V t\ns + \u03b4st+1s)\n\nst+1 \u2212 \u03b3Et\n\nst+1 \u2212 \u03b4st+1sV t\n\nFinally, by factoring out \u03bbN t\n\nst+1 + \u03b4st+1s we obtain our update rule,\n\nV t+1\ns = V t\nwhere the learning rate is given by,\n\ns + Et\n\ns \u03b2t(s, st+1)(cid:0)rt + \u03b3V t\n\nst+1 \u2212 V t\nst(cid:1),\n\n\u03b2t(s, st+1) :=\n\n1\n\nN t\n\nst+1 \u2212 \u03b3Et\n\nst+1\n\nN t\n\nst+1\nN t\ns\n\n.\n\n(8)\n\n(9)\n\nExamining Equation (8), we \ufb01nd the usual update equation for temporal difference learning with eli-\ngibility traces (see Equation (2)), however the learning rate \u03b1 has now been replaced by \u03b2t(s, st+1).\nThis learning rate was derived from statistical principles by minimising the squared loss between\nthe estimated and true state value. In the derivation we have exploited the fact that the latter must be\nself-consistent and then bootstrapped to get Equation (6). This gives us an equation for the learning\nrate for each state transition at time t, as opposed to the standard temporal difference learning where\nthe learning rate \u03b1 is either a \ufb01xed free parameter for all transitions, or is decreased over time by\nsome monotonically decreasing function. In either case, the learning rate is not automatic and must\nbe experimentally tuned for good performance. The above derivation appears to theoretically solve\nthis problem.\nThe \ufb01rst term in \u03b2t seems to provide some type of normalisation to the learning rate, though the\nintuition behind this is not clear to us. The meaning of second term however can be understood as\nfollows: N t\ns \u226a\nthen state s has a value estimate based on relatively few samples, while state st+1 has a\nN t\nvalue estimate based on relatively many samples. In such a situation, the second term in \u03b2t boosts\nthe learning rate so that V t+1\nmoves more aggressively towards the presumably more accurate\n. In the opposite situation when st+1 is a less visited state, we see that the reverse occurs\nrt + \u03b3V t\nand the learning rate is reduced in order to maintain the existing value of Vs.\n\ns measures how often we have visited state s in the recent past. Therefore, if N t\n\nst+1\n\nst+1\n\ns\n\n3 A simple Markov process\n\nFor our \ufb01rst test we consider a simple Markov process with 51 states. In each step the state number\nis either incremented or decremented by one with equal probability, unless the system is in state 0\nor 50 in which case it always transitions to state 25 in the following step. When the state transitions\nfrom 0 to 25 a reward of 1.0 is generated, and for a transition from 50 to 25 a reward of -1.0 is\ngenerated. All other transitions have a reward of 0. We set the discount value \u03b3 = 0.99 and then\ncomputed the true discounted value of each state by running a brute force Monte Carlo simulation.\n\nWe ran our algorithm 10 times on the above Markov chain and computed the root mean squared\nerror in the value estimate across the states at each time step averaged across each run. The optimal\n\n4\n\n\f0.40\n\n0.35\n\n0.30\n\n0.25\n\n0.20\n\n0.15\n\n0.10\n\nE\nS\nM\nR\n\nHL(1.0)\nTD(0.9) a = 0.1\nTD(0.9) a = 0.2\n\n0.40\n\n0.35\n\n0.30\n\n0.25\n\n0.20\n\n0.15\n\n0.10\n\nE\nS\nM\nR\n\nHL(1.0)\nTD(0.9) a = 8.0/sqrt(t)\nTD(0.9) a = 2.0/cbrt(t)\n\n0.05\n\n0.0\n\n0.5\n\n1.0\nTime\n\n1.5\n\n2.0\n\nx1e+4\n\n0.05\n\n0.0\n\n0.5\n\n1.0\nTime\n\n1.5\n\n2.0\n\nx1e+4\n\nFigure 1: 51 state Markov process averaged over\n10 runs. The parameter a is the learning rate \u03b1.\n\nFigure 2: 51 state Markov process averaged over\n300 runs.\n\nvalue of \u03bb for HL(\u03bb) was 1.0, which was to be expected given that the environment is stationary and\nthus discounting old experience is not helpful.\n\nFor TD(\u03bb) we tried various different learning rates and values of \u03bb. We could \ufb01nd no settings where\nTD(\u03bb) was competitive with HL(\u03bb). If the learning rate \u03b1 was set too high the system would learn\nas fast as HL(\u03bb) brie\ufb02y before becoming stuck. With a lower learning rate the \ufb01nal performance\nwas improved, however the initial performance was now much worse than HL(\u03bb). The results of\nthese tests appear in Figure 1.\n\nSimilar tests were performed with larger and smaller Markov chains, and with different values of \u03b3.\nHL(\u03bb) was consistently superior to TD(\u03bb) across these tests. One wonders whether this may be due\nto the fact that the implicit learning rate that HL(\u03bb) uses is not \ufb01xed. To test this we explored the\nperformance of a number of different learning rate functions on the 51 state Markov chain described\nabove. We found that functions of the form \u03ba\nt always performed poorly, however good performance\nwas possible by setting \u03ba correctly for functions of the form \u03ba\n. As the results were much\n\u221at\ncloser, we averaged over 300 runs. These results appear in Figure 2.\n\nand \u03ba\n3\u221at\n\nWith a variable learning rate TD(\u03bb) is performing much better, however we were still unable to \ufb01nd\nan equation that reduced the learning rate in such a way that TD(\u03bb) would outperform HL(\u03bb). This\nis evidence that HL(\u03bb) is adapting the learning rate optimally without the need for manual equation\ntuning.\n\n4 Random Markov process\n\nTo test on a Markov process with a more complex transition structure, we created a random 50\nstate Markov process. We did this by creating a 50 by 50 transition matrix where each element was\nset to 0 with probability 0.9, and a uniformly random number in the interval [0, 1] otherwise. We\nthen scaled each row to sum to 1. Then to transition between states we interpreted the ith row as a\nprobability distribution over which state follows state i. To compute the reward associated with each\ntransition we created a random matrix as above, but without normalising. We set \u03b3 = 0.9 and then\nran a brute force Monte Carlo simulation to compute the true discounted value of each state.\n\nThe \u03bb parameter for HL(\u03bb) was simply set to 1.0 as the environment is stationary. For TD we\nexperimented with a range of parameter settings and learning rate decrease functions. We found that\na \ufb01xed learning rate of \u03b1 = 0.2, and a decreasing rate of 1.5\nperformed reasonable well, but never\n3\u221at\nas well as HL(\u03bb). The results were generated by averaging over 10 runs, and are shown in Figure 3.\nAlthough the structure of this Markov process is quite different to that used in the previous experi-\nment, the results are again similar: HL(\u03bb) preforms as well or better than TD(\u03bb) from the beginning\nto the end of the run. Furthermore, stability in the error towards the end of the run is better with\nHL(\u03bb) and no manual learning tuning was required for these performance gains.\n\n5\n\n\fE\nS\nM\nR\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0.0\n0\n\nHL(1.0)\nTD(0.9) a = 0.2\nTD(0.9) a = 1.5/cbrt(t)\n\n0.30\n\n0.25\n\n0.20\n\nE\nS\nM\nR\n\n0.15\n\n0.10\n\n0.05\n\nHL(0.9995)\nTD(0.8) a = 0.05\nTD(0.9) a = 0.05\n\n1000\n\n2000\n\n3000\n\n4000\n\n5000\n\n0.00\n\n0.0\n\n0.5\n\nTime\n\n1.0\nTime\n\n1.5\n\n2.0\n\nx1e+4\n\nFigure 3: Random 50 state Markov process. The\nparameter a is the learning rate \u03b1.\n\nFigure 4: 21 state non-stationary Markov pro-\ncess.\n\n5 Non-stationary Markov process\n\nThe \u03bb parameter in HL(\u03bb), introduced in Equation (4), reduces the importance of old observations\nwhen computing the state value estimates. When the environment is stationary this is not useful and\nso we can set \u03bb = 1.0, however in a non-stationary environment we need to reduce this value so that\nthe state values adapt properly to changes in the environment. The more rapidly the environment is\nchanging, the lower we need to make \u03bb in order to more rapidly forget old observations.\nTo test HL(\u03bb) in such a setting, we used the Markov chain from Section 3, but reduced its size to\n21 states to speed up convergence. We used this Markov chain for the \ufb01rst 5,000 time steps. At that\npoint, we changed the reward when transitioning from the last state to middle state to from -1.0 to\nbe 0.5. At time 10,000 we then switched back to the original Markov chain, and so on alternating\nbetween the models of the environment every 5,000 steps. At each switch, we also changed the\ntarget state values that the algorithm was trying to estimate to match the current con\ufb01guration of the\nenvironment. For this experiment we set \u03b3 = 0.9.\nAs expected, the optimal value of \u03bb for HL(\u03bb) fell from 1 down to about 0.9995. This is about what\nwe would expect given that each phase is 5,000 steps long. For TD(\u03bb) the optimal value of \u03bb was\naround 0.8 and the optimum learning rate was around 0.05. As we would expect, for both algorithms\nwhen we pushed \u03bb above its optimal value this caused poor performance in the periods following\neach switch in the environment (these bad parameter settings are not shown in the results). On the\nother hand, setting \u03bb too low produced initially fast adaption to each environment switch, but poor\nperformance after that until the next environment change. To get accurate statistics we averaged\nover 200 runs. The results of these tests appear in Figure 4.\n\nFor some reason HL(0.9995) learns faster than TD(0.8) in the \ufb01rst half of the \ufb01rst cycle, but only\nequally fast at the start of each following cycle. We are not sure why this is happening. We could\nimprove the initial speed at which HL(\u03bb) learnt in the last three cycles by reducing \u03bb, however that\ncomes at a performance cost in terms of the lowest mean squared error attained at the end of each\ncycle. In any case, in this non-stationary situation HL(\u03bb) again performed well.\n\n6 Windy Gridworld\n\nReinforcement learning algorithms such as Watkins\u2019 Q(\u03bb) [8] and Sarsa(\u03bb) [5, 4] are based on\ntemporal difference updates. This suggests that new reinforcement learning algorithms based on\nHL(\u03bb) should be possible.\nFor our \ufb01rst experiment we took the standard Sarsa(\u03bb) algorithm and modi\ufb01ed it in the obvious way\nto use an HL temporal difference update. In the presentation of this algorithm we have changed\nnotation slightly to make things more consistent with that typical in reinforcement learning. Specif-\nically, we have dropped the t super script as this is implicit in the algorithm speci\ufb01cation, and have\n\n6\n\n\fAlgorithm 1 HLS(\u03bb)\n\nInitialise Q(s, a) = 0, N (s, a) = 1 and E(s, a) = 0 for all s, a\nInitialise s and a\nrepeat\n\nTake action a, observed r, s\u2032\nChoose a\u2032 by using \u01eb-greedy selection on Q(s\u2032, \u00b7)\n\u2206 \u2190 r + \u03b3Q(s\u2032, a\u2032) \u2212 Q(s, a)\nE(s, a) \u2190 E(s, a) + 1\nN (s, a) \u2190 N (s, a) + 1\nfor all s, a do\n\n\u03b2((s, a), (s\u2032, a\u2032)) \u2190\n\n1\n\nN (s\u2032,a\u2032)\u2212\u03b3E(s\u2032,a\u2032)\n\nN (s\u2032,a\u2032)\nN (s,a)\n\nQ(s, a) \u2190 Q(s, a) + \u03b2(cid:0)(s, a), (s\u2032, a\u2032)(cid:1)E(s, a)\u2206\n\nE(s, a) \u2190 \u03b3\u03bbE(s, a)\nN (s, a) \u2190 \u03bbN (s, a)\n\nend for\nfor all s, a do\n\nend for\ns \u2190 s\u2032; a \u2190 a\u2032\n\nuntil end of run\n\nde\ufb01ned Q(s, a) := V(s,a), E(s, a) := E(s,a) and N (s, a) := N(s,a). Our new reinforcement learn-\ning algorithm, which we call HLS(\u03bb) is given in Algorithm 1. Essentially the only changes to the\nstandard Sarsa(\u03bb) algorithm have been to add code to compute the visit counter N (s, a), add a loop\nto compute the \u03b2 values, and replace \u03b1 with \u03b2 in the temporal difference update.\nTo test HLS(\u03bb) against standard Sarsa(\u03bb) we used the Windy Gridworld environment described on\npage 146 of [6]. This world is a grid of 7 by 10 squares that the agent can move through by going\neither up, down, left of right. If the agent attempts to move off the grid it simply stays where it is.\nThe agent starts in the 4th row of the 1st column and receives a reward of 1 when it \ufb01nds its way to\nthe 4th row of the 8th column. To make things more dif\ufb01cult, there is a \u201cwind\u201d blowing the agent\nup 1 row in columns 4, 5, 6, and 9, and a strong wind of 2 in columns 7 and 8. This is illustrated in\nFigure 5. Unlike in the original version, we have set up this problem to be a continuing discounted\ntask with an automatic transition from the goal state back to the start state.\n\nWe set \u03b3 = 0.99 and in each run computed the empirical future discounted reward at each point in\ntime. As this value oscillated we also ran a moving average through these values with a window\nlength of 50. Each run lasted for 50,000 time steps as this allowed us to see at what level each\nlearning algorithm topped out. These results appear in Figure 6 and were averaged over 500 runs to\nget accurate statistics.\n\nDespite putting considerable effort into tuning the parameters of Sarsa(\u03bb), we were unable to achieve\na \ufb01nal future discounted reward above 5.0. The settings shown on the graph represent the best \ufb01nal\nvalue we could achieve. In comparison HLS(\u03bb) easily beat this result at the end of the run, while\nbeing slightly slower than Sarsa(\u03bb) at the start. By setting \u03bb = 0.99 we were able to achieve the\nsame performance as Sarsa(\u03bb) at the start of the run, however the performance at the end of the run\nwas then only slightly better than Sarsa(\u03bb). This combination of superior performance and fewer\nparameters to tune suggest that the bene\ufb01ts of HL(\u03bb) carry over into the reinforcement learning\nsetting.\n\nAnother popular reinforcement learning algorithm is Watkins\u2019 Q(\u03bb). Similar to Sarsa(\u03bb) above, we\nsimply inserted the HL(\u03bb) temporal difference update into the usual Q(\u03bb) algorithm in the obvious\nway. We call this new algorithm HLQ(\u03bb)(not shown). The test environment was exactly the same as\nwe used with Sarsa(\u03bb) above.\nThe results this time were more competitive (these results are not shown). Nevertheless, despite\nspending a considerable amount of time \ufb01ne tuning the parameters of Q(\u03bb), we were unable to beat\nHLQ(\u03bb). As the performance advantage was relatively modest, the main bene\ufb01t of HLQ(\u03bb) was\nthat it achieved this level of performance without having to tune a learning rate.\n\n7\n\n\f6\n\n5\n\n4\n\n3\n\n2\n\n1\n\nd\nr\na\nw\ne\nR\n \nd\ne\nt\nn\nu\no\nc\ns\ni\nD\n \ne\nr\nu\nt\nu\nF\n\n0\n\n0\n\nHLS(0.995) e = 0.003\nSarsa(0.5) a = 0.4 e = 0.005\n\n1\n\n2\n\n3\n\n4\n\nTime\n\n5\nx1e+4\n\nFigure 5: [Windy Gridworld] S marks the start\nstate and G the goal state, at which the agent\njumps back to S with a reward of 1.\n\nFigure 6: Sarsa(\u03bb) vs. HLS(\u03bb) in the Windy\nGridworld. Performance averaged over 500 runs.\nOn the graph, e represents the exploration param-\neter \u01eb, and a the learning rate \u03b1.\n\n7 Conclusions\n\nWe have derived a new equation for setting the learning rate in temporal difference learning with\neligibility traces. The equation replaces the free learning rate parameter \u03b1, which is normally ex-\nperimentally tuned by hand. In every setting tested, be it stationary Markov chains, non-stationary\nMarkov chains or reinforcement learning, our new method produced superior results.\n\nTo further our theoretical understanding, the next step would be to try to prove that the method\nconverges to correct estimates. This can be done for TD(\u03bb) under certain assumptions on how the\nlearning rate decreases over time. Hopefully, something similar can be proven for our new method.\nIn terms of experimental results, it would be interesting to try different types of reinforcement learn-\ning problems and to more clearly identify where the ability to set the learning rate differently for\ndifferent state transition pairs helps performance. It would also be good to generalise the result to\nepisodic tasks. Finally, just as we have successfully merged HL(\u03bb) with Sarsa(\u03bb) and Watkins\u2019\nQ(\u03bb), we would also like to see if the same can be done with Peng\u2019s Q(\u03bb) [3], and perhaps other\nreinforcement learning algorithms.\n\nAcknowledgements\n\nThis research was funded by the Swiss NSF grant 200020-107616.\n\nReferences\n\n[1] A. P. George and W. B. Powell. Adaptive stepsizes for recursive estimation with applications in approxi-\n\nmate dynamic programming. Journal of Machine Learning, 65(1):167\u2013198, 2006.\n\n[2] M. G. Lagoudakis and R. Parr. Least-squares policy iteration. Journal of Machine Learning Research,\n\n4:1107\u20131149, 2003.\n\n[3] J. Peng and R. J. Williams. Increamental multi-step Q-learning. Machine Learning, 22:283\u2013290, 1996.\n[4] G. A. Rummery. Problem solving with reinforcement learning. PhD thesis, Cambridge University, 1995.\n[5] G. A. Rummery and M. Niranjan. On-line Q-learning using connectionist systems. Technial Report\n\nCUED/F-INFENG/TR 166, Engineering Department, Cambridge University, 1994.\n\n[6] R. Sutton and A. Barto. Reinforcement learning: An introduction. Cambridge, MA, MIT Press, 1998.\n[7] R. S. Sutton. Learning to predict by the methods of temporal differences. Machine Learning, 3:9\u201344, 1988.\n[8] C.J.C.H Watkins. Learning from Delayed Rewards. PhD thesis, King\u2019s College, Oxford, 1989.\n[9] I. H. Witten. An adaptive optimal controller for discrete-time markov environments.\n\nInformation and\n\nControl, 34:286\u2013295, 1977.\n\n8\n\n\f", "award": [], "sourceid": 3290, "authors": [{"given_name": "Marcus", "family_name": "Hutter", "institution": null}, {"given_name": "Shane", "family_name": "Legg", "institution": null}]}