{"title": "Multi-Step Dyna Planning for Policy Evaluation and Control", "book": "Advances in Neural Information Processing Systems", "page_first": 2187, "page_last": 2195, "abstract": "We extend Dyna planning architecture for policy evaluation and control in two significant aspects. First, we introduce a multi-step Dyna planning that projects the simulated state/feature many steps into the future. Our multi-step Dyna is based on a multi-step model, which we call the {\\em $\\lambda$-model}. The $\\lambda$-model interpolates between the one-step model and an infinite-step model, and can be learned efficiently online. Second, we use for Dyna control a dynamic multi-step model that is able to predict the results of a sequence of greedy actions and track the optimal policy in the long run. Experimental results show that Dyna using the multi-step model evaluates a policy faster than using single-step models; Dyna control algorithms using the dynamic tracking model are much faster than model-free algorithms; further, multi-step Dyna control algorithms enable the policy and value function to converge much faster to their optima than single-step Dyna algorithms.", "full_text": "Multi-step Linear Dyna-style Planning\n\nHengshuai Yao\n\nShalabh Bhatnagar\n\nDepartment of Computing Science\n\nDepartment of Computer Science\n\nUniversity of Alberta\n\nEdmonton, AB, Canada T6G2E8\n\nand Automation\n\nIndian Institute of Science\nBangalore, India 560012\n\nDongcui Diao\n\nSchool of Economics and Management\n\nSouth China Normal University\n\nGuangzhou, China 518055\n\nAbstract\n\nIn this paper we introduce a multi-step linear Dyna-style planning algorithm. The\nkey element of the multi-step linear Dyna is a multi-step linear model that en-\nables multi-step projection of a sampled feature and multi-step planning based on\nthe simulated multi-step transition experience. We propose two multi-step linear\nmodels. The \ufb01rst iterates the one-step linear model, but is generally computa-\ntionally complex. The second interpolates between the one-step model and the\nin\ufb01nite-step model (which turns out to be the LSTD solution), and can be learned\nef\ufb01ciently online. Policy evaluation on Boyan Chain shows that multi-step linear\nDyna learns a policy faster than single-step linear Dyna, and generally learns faster\nas the number of projection steps increases. Results on Mountain-car show that\nmulti-step linear Dyna leads to much better online performance than single-step\nlinear Dyna and model-free algorithms; however, the performance of multi-step\nlinear Dyna does not always improve as the number of projection steps increases.\nOur results also suggest that previous attempts on extending LSTD for online\ncontrol were unsuccessful because LSTD looks in\ufb01nite steps into the future, and\nsuffers from the model errors in non-stationary (control) environments.\n\n1 Introduction\n\nLinear Dyna-style planning extends Dyna to linear function approximation (Sutton, Szepesv\u00b4ari,\nGeramifard & Bowling, 2008), and can be used in large-scale applications. However, existing Dyna\nand linear Dyna-style planning algorithms are all single-step, because they only simulate sampled\nfeatures one step ahead. This is many times insuf\ufb01cient as one does not exploit in such a case all\npossible future results. We extend linear Dyna architecture by using a multi-step linear model of\nthe world, which gives what we call the multi-step linear Dyna-style planning. Multi-step linear\nDyna-style planning is more advantageous than existing linear Dyna, because a multi-step model of\nthe world can project a feature multiple steps into the future and give more steps of results from the\nfeature.\n\nFor policy evaluation we introduce two multi-step linear models. The \ufb01rst is generated by iterating\nthe one-step linear model, but is computationally complex when the number of features is large. The\nsecond, which we call the \u03bb-model, interpolates between the one-step linear model and an in\ufb01nite-\nstep linear model of the world, and is computationally ef\ufb01cient to compute online. Our multi-step\nlinear Dyna-style planning for policy evaluation, Dyna(k), uses the multi-step linear models to gen-\nerate k-steps-ahead prediction of the sampled feature, and applies a generalized TD (temporal dif-\n\n1\n\n\fference, e.g., see (Sutton & Barto, 1998)) learning on the imaginary multi-step transition experience.\nWhen k is equal to 1, we recover the existing linear Dyna-style algorithm; when k goes to in\ufb01nity,\nwe actually use the LSTD (Bradtke & Barto, 1996; Boyan, 1999) solution for planning.\n\nFor the problem of control, related work include least-squares policy iteration (LSPI) (Lagoudakis &\nParr, 2001; Lagoudakis & Parr, 2003; Li, Littman & Mansley, 2009), and linear Dyna-style planning\nfor control. LSPI is an of\ufb02ine algorithm, that learns a greedy policy out of a data set of experience,\nthrough a number of iterations, each of which sweeps the data set and alternates between LSTD\nand policy improvement. Sutton et al. (2008) explored the use of linear function approximation\nwith Dyna for control, which does planning using a set of linear action models built from state\nto state. In this paper, we \ufb01rst build a one-step model from state-action pair to state-action pair\nthrough tracking the greedy policy. Using this tracking model for planning is in fact another way of\ndoing single-step linear Dyna-style planning. In a similar manner to policy evaluation, we also have\ntwo multi-step models for control. We build the iterated multi-step model by iterating the one-step\ntracking model. Also, we build a \u03bb-model for control by interpolating the one-step tracking model\nand the in\ufb01nite-step model (also built through tracking). As the in\ufb01nite-step model coincides with\nthe LSTD solution, we actually propose an online LSTD control algorithm.\n\n2 Backgrounds\n\nPolicy evaluation on Boyan Chain shows that multi-step linear Dyna learns a policy faster than\nsingle-step linear Dyna. Results on the Mountain-car experiment show that multi-step linear Dyna\ncan \ufb01nd the optimal policy faster than single-step linear Dyna; however, the performance of multi-\nstep linear Dyna does not always improve as the number of projection steps increases. In fact, LSTD\ncontrol and the in\ufb01nite-step linear Dyna for control are both unstable, and some intermediate value\nof k makes the k-step linear Dyna for control perform the best.\n\nGiven a Markov decision process (MDP) with a state spaceS = {1, 2, . . . , N }, the problem of\npolicy evaluation is to predict the long-term reward of a policy \u03c0 for every state s \u2208S:\nS 7\u2192R, j = 1, . . . , n, the feature of state i is \u03c6(i) = [\u03d51(i), \u03d52(i), . . . , \u03d5n(i)]T . Now V \u03c0 can\n\nwhere rt is the reward received by the agent at time t. Given n (n \u2264 N ) feature functions \u03d5j :\n\n0 < \u03b3 < 1,\n\nV \u03c0(s) =\n\ns0 = s,\n\n\u03b3trt,\n\n\u221e\n\nX\n\nt=0\n\nbe approximated using \u02c6V \u03c0 = \u03a6\u03b8, where \u03b8 is the weight vector, and \u03a6 is the feature matrix whose\nentries are \u03a6i,j = \u03d5j(i), i = 1, . . . , N; j = 1, . . . , n. At time t, linear TD(0) updates the weights as\n\n\u03b8t+1 = \u03b8t + \u03b1t\u03b4t\u03c6t,\n\n\u03b4t = rt + \u03b3\u03b8T\n\nt \u03c6t+1 \u2212 \u03b8T\n\nt \u03c6t,\n\nwhere \u03b1t is a positive step-size and \u03c6t corresponds to \u03c6(st).\nMost of earlier work on Dyna uses a lookup table representation of states (Sutton, 1990; Sutton &\nBarto, 1998). Modern Dyna is more advantageous in the use of linear function approximation, which\nis called linear Dyna-style planning (Sutton et al., 2008). We denote the state transition probability\nmatrix of policy \u03c0 by P \u03c0, whose (i, j)th component is P \u03c0\ni,j = E\u03c0{st+1 = j|st = i}; and denote the\nexpected reward vector of policy \u03c0 by R\u03c0, whose ith component is the expected reward of leaving\nstate i in one step. Linear Dyna tries to estimate a compressed model of policy \u03c0:\n\n(F \u03c0)T = (\u03a6T D\u03c0\u03a6)\u22121 \u00b7 \u03a6T D\u03c0P \u03c0\u03a6;\n\nf \u03c0 = (\u03a6T D\u03c0\u03a6)\u22121 \u00b7 \u03a6T D\u03c0R\u03c0,\n\nwhere D\u03c0 is the N \u00d7 N matrix whose diagonal entries correspond to the steady distribution of states\nunder policy \u03c0. F \u03c0 and f \u03c0 constitute the world model of linear Dyna for policy evaluation, and are\nestimated online through gradient descent:\n\nt+1 = F \u03c0\nF \u03c0\n\nt + \u03b2t(\u03c6t+1 \u2212 F \u03c0\n\nt \u03c6t)\u03c6T\nt ;\n\nt+1 = f \u03c0\nf \u03c0\n\nt + \u03b2t(rt \u2212 \u03c6T\n\nt f \u03c0\n\nt )\u03c6t,\n\n(1)\n\nrespectively, where the features and reward are all from real world experience and \u03b2t is the modeling\nstep-size.\n\nDyna repeats some steps of planning in each of which it samples a feature, projects it using the world\nmodel, and plans using linear TD(0) based on the imaginary experience. For policy evaluation, the\n\n2\n\n\f\ufb01xed-point of linear Dyna is the same as that of linear TD(0) under some assumptions (Tsitsiklis &\nVan Roy, 1997; Sutton et al., 2008), that satis\ufb01es\n\nA\u03c0\u03b8\u2217 + b\u03c0 = 0 : A\u03c0 = \u03a6T D\u03c0(\u03b3P \u03c0 \u2212 I)\u03a6;\n\nb\u03c0 = \u03a6T D\u03c0R\u03c0,\n\nwhere IN \u00d7N is the identity matrix.\n\n3 The Multi-step Linear Model\n\nIn the lookup table representation, (P \u03c0)T and R\u03c0 constitute the one-step world model. The k-step\ntransition model of the world is obtained by iterating (P \u03c0)T , k times with discount (Sutton, 1995):\n\nAt the same time we accumulate the rewards generated in the process of this iterating:\n\nP (k) = (\u03b3(P \u03c0)T )k,\n\n\u2200k = 1, 2, . . .\n\nR(k) =\n\nk\u22121\n\nX\n\nj=0\n\n(\u03b3P \u03c0)jR\u03c0,\n\n\u2200k = 1, 2, . . . ,\n\nwhere R(k) is called the k-step reward model. P (k) and R(k) predict a feature k steps into the\nfuture. In particular, P (k)\u03c6 is the feature of the expected state after k steps from \u03c6, and (R(k))T \u03c6 is\nthe expected accumulated rewards in k steps from \u03c6. Notice that\n\nV \u03c0 = R(k) + (P (k))T V \u03c0,\n\n\u2200k = 1, 2, . . . ,\n\n(2)\n\nwhich is a generalization of the Bellman equation, V \u03c0 = R\u03c0 + \u03b3P \u03c0V \u03c0.\n\n3.1 The Iterated Multi-step Linear Model\n\nIn the linear function approximation, F \u03c0 and f \u03c0 constitute the one-step linear model. Similar to the\nlookup table representation, we can iterate F \u03c0, k times, and accumulate the approximated rewards\nalong the way:\n\nF (k) = (\u03b3F \u03c0)k;\n\nf (k) =\n\nk\u22121\n\nX\n\nj=0\n\n(\u03b3(F \u03c0)T )j f \u03c0.\n\nWe call (F (k), f (k)) the iterated multi-step linear model. By this de\ufb01nition, we extend (2) to the\nk-step linear Bellman equation:\n\n\u02c6V \u03c0 = \u03a6\u03b8\u2217 = \u03a6f (k) + \u03a6(F (k))T \u03b8\u2217,\n\n\u2200k = 1, 2, . . . ,\n\n(3)\n\nwhere \u03b8\u2217 is the linear TD(0) solution.\n\n3.2 The \u03bb-model\n\nThe quantities F (k) and f (k) require powers of F \u03c0. One can \ufb01rst estimate F \u03c0 and f \u03c0, and then\nestimate F (k) and f (k) using powers of the estimated F \u03c0. However, real life tasks require a lot\nof features. Generally (F \u03c0)k requires O((k \u2212 1)n3) computation, which is too complex when the\nnumber of features (n) is large.\n\nRather than using F (k) and f (k), we would like to explore some other multi-step model that is cheap\nin computation but is still meaningful in some sense. First let us see how F (k) and f (k) are used\nif they can be computed. Given an imaginary feature \u02dc\u03c6\u03c4 , we look k steps ahead to see our future\nfeature by applying F (k):\n\n\u02dc\u03c6(k)\n\u03c4 = F (k) \u02dc\u03c6\u03c4 .\nAs k grows, F (k) diminishes and thus \u02dc\u03c6(k)\nconverges to 0. 1 This means that the more steps we look\ninto the future from a given feature, the more ambiguous is our resulting feature. It suggests that we\n\n\u03c4\n\n1This is because \u03b3F \u03c0 has a spectral radius smaller than one, cf. Lemma 9.2.2 of (Bertsekas, Borkar &\n\nNedich, 2004).\n\n3\n\n\fcan use a decayed one-step linear model to approximate the effects of looking multiple steps into\nthe future:\n\nL(k) = (\u03bb\u03b3)k\u22121\u03b3F \u03c0,\n\nparameterized by a factor \u03bb \u2208 (0, 1]. To guarantee that the optimality (3) still holds, we de\ufb01ne\n\nl(k) = (I \u2212 (L(k))T )(I \u2212 \u03b3(F \u03c0)T )\u22121f \u03c0.\n\nWe call (L(k), l(k)) the \u03bb-model. When k = 1, we have L(1) = F (1) = \u03b3F \u03c0 and l(1) = f (1) = f \u03c0,\nrecovering the one-step model used by existing linear Dyna. Notice that L(k) diminishes as k grows,\nwhich is consistent with the fact that F (k) also diminishes as k grows. Finally, the in\ufb01nite-step\nmodel reduces to a single vector, l(\u221e) = f (\u221e) = \u03b8\u2217. The intermediate k interpolates between the\nsingle-step model and in\ufb01nite-step model.\n\nFor intermediate k, computation of L(k) has the same complexity as the estimation of F \u03c0. Interest-\ningly, all l(k) can be obtained by shifting from l(\u221e) by an amount that shrinks l(\u221e) itself: 2\n\nl(k) = (I \u2212 (L(k))T )(I \u2212 \u03b3(F \u03c0)T )\u22121f \u03c0,\n\n= l(\u221e) \u2212 (L(k))T l(\u221e).\n\n(4)\nThe case of k = 1 is interesting. The linear Dyna algorithm (Sutton et al., 2008) takes advantage\nof the fact that l(1) = f \u03c0 and estimates it through gradient descent. On the other hand, in our Dyna\nalgorithm, we use (4) and estimate all l(k) from the estimation of l(\u221e), which is generally no longer\na gradient-descent estimate.\n\n4 Multi-step Linear Dyna-style Planning for Policy Evaluation\n\nThe architecture of multi-step linear Dyna-style planning, Dyna(k), is shown in Algorithm 1. Gen-\nerally any valid multi-step model can be used in the architecture. For example, in the algorithm we\ncan take M (k) = F (k) and m(k) = f (k), giving us a linear Dyna architecture using the iterated\nmulti-step linear model, which we call the Dyna(k)-iterate.\nIn the following we present the family of Dyna(k) planning algorithms that use the \u03bb-model. We \ufb01rst\ndevelop a planning algorithm for the in\ufb01nite-step model, and based on it we then present Dyna(k)\nplanning using the \u03bb-model for any \ufb01nite k.\n\n4.1 Dyna(\u221e): Planning using the In\ufb01nite-step Model\n\nThe in\ufb01nite-step model is preferable in computation because F (\u221e) diminishes and the model re-\nduces to f (\u221e). It turns out that f (\u221e) can be further simpli\ufb01ed to allow an ef\ufb01cient online estimation:\n\nf (\u221e) = (I \u2212 \u03b3(F \u03c0)T )\u22121f \u03c0\n\n= (\u03a6T D\u03c0\u03a6 \u2212 \u03b3\u03a6T D\u03c0P \u03c0\u03a6)\u22121 \u00b7 \u03a6T D\u03c0\u03a6f \u03c0\n= \u2212(A\u03c0)\u22121b\u03c0.\n\nWe can accumulate A\u03c0 and b\u03c0 online like LSTD (Bradtke & Barto, 1996; Boyan, 1999; Xu et al.,\n2002) and solve f (\u221e) by matrix inversion methods or recursive least-square methods.\nAs with traditional Dyna, we initially sample a feature \u02dc\u03c6 from some distribution \u00b5. We then apply\nthe in\ufb01nite-step model to get the expected future features and all the possible future rewards:\n\nNext, a generalized linear TD(0) is applied on this simulated experience.\n\n\u02dc\u03c6(\u221e) = F (\u221e) \u02dc\u03c6;\n\n\u02dcr(\u221e) = (f (\u221e))T \u02dc\u03c6.\n\n\u02dc\u03b8 := \u02dc\u03b8 + \u03b1(\u02dcr(\u221e) + \u02dc\u03b8T \u02dc\u03c6(\u221e) \u2212 \u02dc\u03b8T \u02dc\u03c6) \u02dc\u03c6.\n\nBecause \u02dc\u03c6(\u221e) = 0, this simpli\ufb01es into\n\n\u02dc\u03b8 := \u02dc\u03b8 + \u03b1(\u02dcr(\u221e) \u2212 \u02dc\u03b8T \u02dc\u03c6) \u02dc\u03c6.\n\nWe call this algorithm Dyna(\u221e), which actually uses the LSTD solution for planning.\n\n2Similarly f (k) can be obtained by shifting from f (\u221e) by an amount that shrinks itself.\n\n4\n\n\fAlgorithm 1 Dyna(k) algorithm for evaluating policy \u03c0 (using any valid k-step model).\n\nInitialize \u03b80 and some model\nSelect an initial state\nfor each time step do\n\nTake an action a according to \u03c0, observing rt and \u03c6t+1\n\u03b8t+1 = \u03b8t + \u03b1t(rt + \u03b3\u03c6T\nUpdate M (k) and m(k)\nSet \u02dc\u03b80 = \u03b8t+1\nrepeat \u03c4 = 1 to p\n\nt+1\u03b8t \u2212 \u03c6T\n\nt \u03b8t)\u03c6t\n\n/* linear TD(0) */\n\n/*Planning*/\n\n/* \u02dc\u03c6(\u221e) = 0*/\n\nSample \u02dc\u03c6\u03c4 \u223c \u00b5(\u00b7)\n\u02dc\u03c6(k) = M (k) \u02dc\u03c6\u03c4\n\u02dcr(k) = (m(k))T \u02dc\u03c6\u03c4\n\u02dc\u03b8\u03c4 +1 := \u02dc\u03b8\u03c4 + \u03b1\u03c4 (\u02dcr(k)\n\nSet \u03b8t+1 = \u02dc\u03b8\u03c4 +1\n\nend for\n\n\u03c4 + \u02dc\u03b8T\n\n\u03c4\n\n\u02dc\u03c6(k)\n\u03c4 \u2212 \u02dc\u03b8T\n\n\u03c4\n\n\u02dc\u03c6\u03c4 ) \u02dc\u03c6\u03c4\n\n/*Generalized k-step linear TD(0) learning */\n\n4.2 Planning using the \u03bb-model\n\nThe k-step \u03bb-model is ef\ufb01cient to estimate, and can be directly derived from the single-step and\nin\ufb01nite-step models:\n\nL(k) = (\u03bb\u03b3)k\u22121\u03b3F \u03c0\n\nt+1;\n\nl(k) = f (\u221e) \u2212 (L(k))T f (\u221e),\n\nrespectively, where the in\ufb01nite-step model is estimated by f (\u221e) = (A\u03c0\nnary feature \u02dc\u03c6, we look k steps ahead to see the future features and rewards:\n\nt+1)\u22121b\u03c0\n\nt+1. Given an imagi-\n\n\u02dc\u03c6(k) = L(k) \u02dc\u03c6;\n\n\u02dcr(k) = (l(k))T \u02dc\u03c6.\n\nThus we obtain an imaginary k-step transition experience \u02dc\u03c6 \u2192 ( \u02dc\u03c6(k),\nk-step version of linear TD(0):\n\n\u02dcr(k)), on which we apply a\n\n\u02dc\u03b8\u03c4 +1 = \u02dc\u03b8\u03c4 + \u03b1(\u02dcr(k) + \u02dc\u03b8T\n\n\u03c4\n\n\u02dc\u03c6(k) \u2212 \u02dc\u03b8T\n\n\u03c4\n\n\u02dc\u03c6) \u02dc\u03c6.\n\nWe call this algorithm the Dyna(k)-lambda planning algorithm. When k = 1, we obtain another\nsingle-step Dyna, Dyna(1). Notice that Dyna(1) uses f (\u221e) while the linear Dyna uses f \u03c0. When\nk \u2192 \u221e, we obtain the Dyna(\u221e) algorithm.\n\n5 Planning for Control\n\nPlanning for control is more dif\ufb01cult than that for policy evaluation because in control the policy\nchanges from time step to time step. Linear Dyna uses a separate model for each action, and these\naction models are from state to state (Sutton et al., 2008). Our model for control is different in that\nit is from state-action pair to state-action pair. However, rather than building a model for all state-\naction pairs, we build only one state-action model that tracks the sequence of greedy actions. Using\nthis greedy-tracking model is another way of doing linear Dyna-style planning. In the following we\n\ufb01rst build the single-step greedy-tracking model and the in\ufb01nite-step greedy-tracking model, and\nbased on these tracking models we build the iterated model and the \u03bb-model.\nOur extension of linear Dyna to control contains a TD control step (we use Q-learning), and we\ncall it the linear Dyna-Q architecture. In the Q-learning step, the next feature is already implicitly\nselected. Recall that Q-learning selects the largest next Q-function as the target for TD learning,\nwhich is maxa\u2032 \u02c6Qt+1(st+1, a\u2032) = maxa\u2032 \u03c6(st+1, a\u2032)T \u03b8t. Alternatively, the greedy next state-action\nfeature\n\n~\u03c6t+1 = arg max\n\n\u03c6\u2032T \u03b8t\n\n\u03c6\u2032=\u03c6(st+1,\u00b7)\n\nis selected by Q-learning. We build a single-step projection matrix between state-action pairs, F , by\nmoving its projection of the current feature towards the greedy next state-action feature (tracking):\n\nFt+1 = Ft + \u03b2t(~\u03c6t+1 \u2212 Ft\u03c6t)\u03c6T\nt .\n\n(5)\n\n5\n\n\fAlgorithm 2 Dyna-Q(k)-lambda: k-step linear Dyna-Q algorithm for control (using the \u03bb-model).\n\nInitialize F0, A0, b0 and \u03b80\nSelect an initial state\nfor each time step do\n\nTake action a at st (using \u01eb-greedy), observing rt and st+1\nChoose a\u2032 that leads to the largest \u02c6Q(st+1, a\u2032)\nSet \u03c6 = \u03c6(st, a), ~\u03c6 = \u03c6(st+1, a\u2032)\n\u03b8t+1 = \u03b8t + \u03b1t(rt + \u03b3 ~\u03c6T \u03b8t \u2212 \u03c6T \u03b8t)\u03c6\nAt+1 = At + \u03c6t(\u03b3 ~\u03c6T \u2212 \u03c6)T ,\nf (\u221e) = \u2212(At+1)\u22121bt+1\nFt+1 = Ft + \u03b1t(~\u03c6 \u2212 Ft\u03c6)\u03c6T ,\nL(k) = (\u03bb\u03b3)k\u22121\u03b3Ft+1\nl(k) = f (\u221e) \u2212 (L(k))T f (\u221e)\nSet \u02dc\u03b80 = \u03b8t+1\nrepeat \u03c4 times\n\nbt+1 = bt + \u03c6trt\n\n/*Q-learning*/\n\n/*Using matrix inversion or recursive least-squares */\n\n/*Planning*/\n\nSample \u02dc\u03c6\u03c4 \u223c \u00b5\n\u02dc\u03c6(k) = L(k) \u02dc\u03c6\u03c4\n\u02dcr(k) = (l(k))T \u02dc\u03c6\u03c4\n\u02dc\u03b8\u03c4 +1 := \u02dc\u03b8\u03c4 + \u03b1\u03c4 (\u02dcr(k)\n\nSet \u03b8t+1 = \u02dc\u03b8\u03c4 +1\n\nend for\n\n\u03c4 + \u02dc\u03b8T\n\n\u03c4\n\n\u02dc\u03c6(k)\n\u03c4 \u2212 \u02dc\u03b8T\n\n\u03c4\n\n\u02dc\u03c6\u03c4 ) \u02dc\u03c6\u03c4\n\nEstimation of the single-step reward model, f , is the same as in policy evaluation.\nIn a similar manner, in the in\ufb01nite-step model, matrix A is updated using the greedy next feature,\nwhile vector b is updated in the same way as in LSTD. Given A and b, we can solve them and get\nf (\u221e). Once the one-step model and the in\ufb01nite-step model are available, we interpolate them and\ncompute the \u03bb-model in a similar manner to policy evaluation. The complete multi-step Dyna-Q\ncontrol algorithm using the \u03bb-model is shown in Algorithm 2. We noticed that f (\u221e) can be directly\nused for control, giving an online LSTD control algorithm.\n\nWe can also extend the iterated multi-step model and Dyna(k)-iterate to control. Given the single-\nstep greedy-tracking model, we can iterate it and get the iterated multi-step linear model in a similar\nway to policy evaluation. The linear Dyna for control using the iterated greedy-tracking model\n(which we call Dyna-Q(k)-iterate) is straightforward and thus not shown.\n\n6 Experimental Results\n\n6.1 Boyan Chain Example\n\nThe problem we consider is exactly the same as that considered by Boyan (1999). The root mean\nsquare error (RMSE) of the value function is used as a criterion. Previously it was shown that linear\nDyna can learn a policy faster than model-free TD methods in the beginning episodes (Sutton et al.,\n2008). However, after some episodes, their implementation of linear Dyna became poorer than\nTD. A possible reason leading to their results may be that the step-sizes of learning, modeling and\nplanning were set to the same value. Also, their step-size diminishes according to 1/(traj#)1.1,\nwhich does not satisfy the standard step-size rule required for stochastic approximation. In our linear\nDyna algorithms, we used different step-sizes for learning, modeling and planning.\n\n(1) Learning step-size. We used here the same step-size rule for TD as Boyan (1999), where \u03b1 =\n0.1(1 + 100)/(traj# + 100) was found to be the best in the class of step-sizes and also used it\nfor TD in the learning sub-procedure of all linear Dyna algorithms. (2) Modeling step-size. For\nDyna(k)-lambda, we used \u03b2T = 0.5(1 + 10)/(10 + T ) for estimation of F \u03c0, where T is the number\nof state visits across episodes. For linear Dyna, the estimation of F \u03c0 and f \u03c0 also used the same \u03b2T .\n(3) Planning step-size. In our experiments all linear Dyna algorithms simply used \u03b1\u03c4 = 0.1.\n\n6\n\n\f15\n\n10\n\n)\ng\no\nL\n(\n \n\nE\nS\nM\nR\n\n1\n\n0.1\n\n \n\n100\n\n \n\n15\n\n10\n\n Dyna(10)\u2212lambda\n\n Dyna(1)\n\nTD\n\n)\ng\no\nL\n(\nE\nS\nM\nR\n\n1\n\nLSTD, A0=\u22120.1I\n\n Dyna(\u221e)\n\n101\n\nEpisodes (Log)\n\n102\n\n0.1\n\n100\n\n101\n\nEpisodes (Log)\n\n102\n\nLSTD, A\n=\u22120.1I\n0\nLSTD, A\n=\u2212I\n0\nLSTD, A\n=\u221210I\n0\n\nDyna(3)\u2212iterate\nDyna(5)\u2212iterate\nDyna(10)\u2212iterate\nLinear Dyna\n\nFigure 1: Results on Boyan Chain. Left: comparison of RMSE of Dyna(k)-iterate with LSTD.\nRight: comparison of RMSE of Dyna(k)-lambda with TD and LSTD.\n\nThe weights of various learning algorithms, f \u03c0 for the linear Dyna, and b\u03c0 for Dyna(k) were all\ninitialized to zero. No eligibility trace is used for any algorithm. In the planning step, all Dyna\nalgorithms sampled a unit basis vector whose nonzero component was in a uniformly random loca-\ntion. In the following we report the results of planning only once. All RMSEs of algorithms were\naveraged over 30 (identical) sets of trajectories.\nFigure 1 (left) shows the performance of Dyna(k)-iterate and LSTD, and Figure 1 (right) shows\nthe performance of Dyna(k)-lambda, LSTD and TD. All linear Dyna algorithms were found to be\nsigni\ufb01cantly and consistently faster than TD. Furthermore, multi-step linear Dyna algorithms were\nmuch faster than single-step linear Dyna algorithms. Matrix A of LSTD and Dyna(k)-lambda needs\nperturbation in initialization, which has a great impact on the performances of two algorithms. For\nLSTD, we tried initialization of A\u03c0\n0 to \u221210I, \u2212I, \u22120.1I, and showed their effects in Figure 1 (left),\n0 = \u22120.1I was the best for LSTD. Similar to LSTD, Dyna(k)-lambda is also sensitive\nin which A\u03c0\nto A\u03c0\n0 . F \u03c0 was\ninitialized to 0 for Dyna(k) (k < \u221e) and linear Dyna. In Figure 1 (right) LSTD and Dyna(k)-lambda\nwere compared under the same setting (Dyna(k)-lambda also used A0 = \u22120.1I). Dyna(k)-lambda\nused \u03bb = 0.9.\n\n0 . Linear Dyna and Dyna(k)-iterate do not use A\u03c0 and thus do not have to tune A\u03c0\n\n6.2 Mountain-car\n\nWe used the same Mountain-car environment and tile coding as in the linear Dyna paper (Sutton\net al., 2008). The state feature has a dimension of 10, 000. The state-action feature is shifted from\nthe state feature, and has a dimension of 30, 000 because there are three actions of the car. Because\nthe feature and matrix are really large, we were not able to compute the iterated model, and hence\nwe only present here the results of Dyna-Q(k)-lambda.\nExperimental setting. (1)Step-sizes. The Q-learning step-size was chosen to be 0.1, in both the\nindependent algorithm and the sub-procedure of Dyna-Q(k)-lambda. The planning step-size was\n0.1. The matrix F is much more dense than A and leads to a very slow online performance. To\ntackle this problem, we avoided computing F explicitly, and used a least-squares computation of\nthe projection, given in the supplementary material. In this implementation, there is no modeling\nstep-size. (2)Initialization. The parameters \u03b8 and b were both initialized to 0. A was initialized to\n\u2212I. (3)Other setting. The \u03bb value for Dyna-Q(k)-lambda was 0.9. We recorded the state-action\npairs online and replayed the feature of a past state-action pair in planning. We also compared the\nlinear Dyna-style planning for control (with state features) (Sutton et al., 2008), which has three\nsets of action models for this problem. In linear Dyna-style planning for control we replayed a state\nfeature of a past time step, and projected it using the model of the action that was selected at that\ntime step. No eligibility trace or exploration was used. Results reported below were all averaged\nover 30 independent runs, each of which contains 20 episodes.\n\n7\n\n\fDyna\u2212Q(5)\u2212lambda\n\nDyna\u2212Q(10)\u2212lambda\n\n\u2212100\n\n\u2212150\n\n\u2212200\n\nt\n\nn\nr\nu\ne\nR\ne\nn\n\n \n\n\u2212250\n\ni\nl\n\nn\nO\n\n\u2212300\n\n\u2212350\n\n5\n\nDyna\u2212Q(1)\n\nDyna\u2212Q(20)\u2212lambda\n\nDyna\u2212Q(\u221e)\n\nLinear Dyna\n\nQ\u2212learning\n\n10\n\nEpisode\n\n15\n\n20\n\nFigure 2: Results on Mountain-car: comparison of online return of Dyna-Q(k)-lambda, Q-learning\nand linear Dyna for control.\n\nResults are shown in Figure 2. Linear Dyna-style planning algorithms were found to be signi\ufb01cantly\nfaster than Q-learning. Multi-step planning algorithms can be still faster than single-step planning\nalgorithms. The results also show that planning too many steps into the future is harmful, e.g.,\nDyna-Q(20)-lambda and Dyna-Q(\u221e) gave poorer performance than Dyna-Q(5)-lambda and Dyna-\nQ(10)-lambda. This shows that some intermediate values of k trade off the model accuracy and the\ndepth of looking ahead, and performed best. In fact, Dyna-Q(\u221e) and LSTD control algorithm were\nboth unstable, and typically failed once or twice in 30 runs. The intuition is that in control the policy\nchanges from time step to time step and the model is highly non-stationary. By solving the model\nand looking in\ufb01nite steps into the future, LSTD and Dyna-Q(\u221e) magnify the errors in the model.\n\n7 Conclusion and Future Work\n\nWe have taken important steps towards extending linear Dyna-style planning to multi-step planning.\nMulti-step linear Dyna-style planning uses multi-step linear models to project a simulated feature\nmultiple steps into the future. For control, we proposed a different way of doing linear Dyna-style\nplanning, that builds a model from state-action pair to state-action pair, and tracks the greedy ac-\ntion selection. Experimental results show that multi-step linear Dyna-style planning leads to better\nperformance than existing single-step linear Dyna-style planning on Boyan chain and Mountain-\ncar problems. Our experimental results show that linear Dyna-style planning can achieve a better\nperformance by using different step-sizes for learning, modeling, and planning than using a uni-\nform step-size for the three sub-procedures. While it is not clear from previous work, our results\nfully demonstrate the advantages of linear Dyna over TD/Q-learning for both policy evaluation and\ncontrol.\n\nOur work also sheds light on why previous attempts on developing independent online LSTD control\nwere not successful (e.g., forgetting strategies (Sutton et al., 2008)). LSTD and Dyna-Q(\u221e) can\nbecome unstable because they magnify the model errors by looking in\ufb01nite steps into the future.\nCurrent experiments do not include comparisons with any other LSTD control algorithm because\nwe did not \ufb01nd in the literature an independent LSTD control algorithm. LSPI is usually off-line, and\nits extension to online control has to deal with online exploration (Li et al., 2009). Some researchers\nhave combined LSTD in critic within the Actor-Critic framework (Xu et al., 2002; Peters & Schaal,\n2008); however, LSTD there is still not an independent control algorithm.\n\nAcknowledgements\n\nThe authors received many feedbacks from Dr. Rich Sutton and Dr. Csaba Szepesv\u00b4ari. We gratefully\nacknowledge their help in improving the paper in many aspects. We also thank Alborz Geramifard\nfor sending us Matlab code of tile coding. This research was supported by iCORE, NSERC and the\nAlberta Ingenuity Fund.\n\n8\n\n\fReferences\n\nBertsekas, D. P., Borkar, V., & Nedich, A. (2004). Improved temporal difference methods with linear\nfunction approximation. Learning and Approximate Dynamic Programming (pp. 231\u2013255). IEEE\nPress.\n\nBoyan, J. A. (1999). Least-squares temporal difference learning. ICML-16.\nBradtke, S., & Barto, A. G. (1996). Linear least-squares algorithms for temporal difference learning.\n\nMachine Learning, 22, 33\u201357.\n\nLi, L., Littman, M. L., & Mansley, C. R. (2009). Online exploration in least-squares policy iteration.\n\nAAMAS-8.\n\nPeters, J., & Schaal, S. (2008). Natural actor-critic. Neurocomputing, 71, 1180\u20131190.\nSutton, R. S. (1990). Integrated architectures for learning, planning, and reacting based on approxi-\n\nmating dynamic programming. ICML-7.\n\nSutton, R. S. (1995). TD models: modeling the world at a mixture of time scales. ICML-12.\nSutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. MIT Press.\nSutton, R. S., Szepesv\u00b4ari, C., Geramifard, A., & Bowling, M. (2008). Dyna-style planning with\n\nlinear function approximation and prioritized sweeping. UAI-24.\n\nTsitsiklis, J. N., & Van Roy, B. (1997). An analysis of temporal-difference learning with function\n\napproximation. IEEE Transactions on Automatic Control, 42, 674\u2013690.\n\nXu, X., He, H., & Hu, D. (2002). Ef\ufb01cient reinforcement learning using recursive least-squares\n\nmethods. Journal of Arti\ufb01cial Intelligence Research, 16, 259\u2013292.\n\n9\n\n\f", "award": [], "sourceid": 254, "authors": [{"given_name": "Hengshuai", "family_name": "Yao", "institution": null}, {"given_name": "Shalabh", "family_name": "Bhatnagar", "institution": null}, {"given_name": "Dongcui", "family_name": "Diao", "institution": null}, {"given_name": "Richard", "family_name": "Sutton", "institution": null}, {"given_name": "Csaba", "family_name": "Szepesv\u00e1ri", "institution": null}]}