{"title": "Stable Dual Dynamic Programming", "book": "Advances in Neural Information Processing Systems", "page_first": 1569, "page_last": 1576, "abstract": null, "full_text": "Stable Dual Dynamic Programming\n\nTao Wang(cid:3) Daniel Lizotte Michael Bowling Dale Schuurmans\n\nftrysi,dlizotte,bowling,daleg@cs.ualberta.ca\n\nDepartment of Computing Science\n\nUniversity of Alberta\n\nAbstract\n\nRecently, we have introduced a novel approach to dynamic programming and re-\ninforcement learning that is based on maintaining explicit representations of sta-\ntionary distributions instead of value functions. In this paper, we investigate the\nconvergence properties of these dual algorithms both theoretically and empirically,\nand show how they can be scaled up by incorporating function approximation.\n\n1 Introduction\n\nValue function representations are dominant in algorithms for dynamic programming (DP) and rein-\nforcement learning (RL). However, linear programming (LP) methods clearly demonstrate that the\nvalue function is not a necessary concept for solving sequential decision making problems. In LP\nmethods, value functions only correspond to the primal formulation of the problem, while in the dual\nthey are replaced by the notion of state (or state-action) visit distributions [1, 2, 3]. Despite the well\nknown LP duality, dual representations have not been widely explored in DP and RL. Recently, we\nhave showed that it is entirely possible to solve DP and RL problems in the dual representation [4].\nUnfortunately, [4] did not analyze the convergence properties nor implement the proposed ideas. In\nthis paper, we investigate the convergence properties of these newly proposed dual solution tech-\nniques, and show how they can be scaled up by incorporating function approximation. The proof\ntechniques we use to analyze convergence are simple, but lead to useful conclusions. In particular,\nwe \ufb01nd that the standard convergence results for value based approaches also apply to the dual case,\neven in the presence of function approximation and off-policy updating. The dual approach appears\nto hold an advantage over the standard primal view of DP/RL in one major sense: since the funda-\nmental objects being represented are normalized probability distributions (i.e., belong to a bounded\nsimplex), dual updates cannot diverge. In particular, we \ufb01nd that dual updates converge (i.e. avoid\noscillation) in the very circumstance where primal updates can and often do diverge: gradient-based\noff-policy updates with linear function approximation [5, 6].\n\n2 Preliminaries\n\nWe consider the problem of computing an optimal behavior strategy in a Markov decision process\n(MDP), de\ufb01ned by a set of actions A, a set of states S, a jSjjAj by jSj transition matrix P , a reward\nvector r and a discount factor (cid:13), where we assume the goal is to maximize the in\ufb01nite horizon\ndiscounted reward r0 + (cid:13)r1 + (cid:13)2r2 + (cid:1) (cid:1) (cid:1) = P1\nIt is known that an optimal behavior\nstrategy can always be expressed by a stationary policy, whose entries (cid:25)(sa) specify the probability\nof taking action a in state s. Below, we represent a policy (cid:25) by an equivalent representation as\nan jSj (cid:2) jSjjAj matrix (cid:5) where (cid:5)(s;s0a) = (cid:25)(sa) if s0 = s, otherwise 0. One can quickly verify\nthat the matrix product (cid:5)P gives the state-to-state transition probabilities induced by the policy\n(cid:25) in the environment P , and that P (cid:5) gives the state-action to state-action transition probabilities\ninduced by policy (cid:25) in P . The problem is to compute an optimal policy given either (a) a complete\n\nt=0 (cid:13)trt.\n\n(cid:3)Current af\ufb01liation: Computer Sciences Laboratory, Australian National University, tao.wang@anu.edu.au.\n\n\fspeci\ufb01cation of the environmental variables P and r (the \u201cplanning problem\u201d), or (b) limited access\nto the environment through observed states and rewards and the ability to select actions to cause\nfurther state transitions (the \u201clearning problem\u201d). The \ufb01rst problem is normally tackled by LP or DP\nmethods, and the second by RL methods. In this paper, we will restrict our attention to scenario (a).\n\n3 Dual Representations\n\nTraditionally, DP methods for solving the MDP planning problem are typically expressed in terms of\nthe primal value function. However, [4] demonstrated that all the classical algorithms have natural\nduals expressed in terms of state and state-action probability distributions.\n\nIn the primal representation, the policy state-action value function can be speci\ufb01ed by an jSjjAj(cid:2)1\nvector q = P1\ni=0 (cid:13)i(P (cid:5))ir which satis\ufb01es q = r + (cid:13)P (cid:5)q. To develop a dual form of state-\naction policy evaluation, one considers the linear system d> = (1 (cid:0) (cid:13))(cid:23)\n> + (cid:13)d>P (cid:5), where (cid:23) is\nthe initial distribution over state-action pairs. Not only is d a proper probability distribution over\nstate-action pairs, it also allows one to easily compute the expected discounted return of the policy (cid:25).\nHowever, recovering the state-action distribution d is inadequate for policy improvement. Therefore,\none considers the following jSjjAj (cid:2) jSjjAj matrix H = (1 (cid:0) (cid:13))I + (cid:13)P (cid:5)H. The matrix H that\nsatis\ufb01es this linear relation is similar to d>, in that each row is a probability distribution and the\nentries H(sa;s0a0) correspond to the probability of discounted state-action visits to (s0a0) for a policy\n(cid:25) starting in state-action pair (sa). Unlike d>, however, H drops the dependence on (cid:23), giving\n(1 (cid:0) (cid:13))q = Hr. That is, given H we can easily recover the state-action values of (cid:25).\n0 via\nFor policy improvement, in the primal representation one can derive an improved policy (cid:25)\nthe update a(cid:3)(s) = arg maxa q(sa) and (cid:25)\n(sa) = 1 if a = a(cid:3)(s), otherwise 0. The dual form\nof the policy update can be expressed in terms of the state-action matrix H for (cid:25) is a(cid:3)(s) =\narg maxa H(sa;:)r.\nIn fact, since (1 (cid:0) (cid:13))q = Hr, the two policy updates given in the primal\nand dual respectively, must lead to the same resulting policy (cid:25)\n\n0. Further details are given in [4].\n\n0\n\n4 DP algorithms and convergence\n\nWe \ufb01rst investigate whether dynamic programming operators with the dual representations exhibit\nthe same (or better) convergence properties to their primal counterparts. These questions will be an-\nswered in the af\ufb01rmative. In the tabular case, dynamic programming algorithms can be expressed by\noperators that are successively applied to current approximations (vectors in the primal case, matri-\nces in the dual), to bring them closer to a target solution; namely, the \ufb01xed point of a desired Bellman\nequation. Consider two standard operators, the on-policy update and the max-policy update.\n\nFor a given policy (cid:5), the on-policy operator O is de\ufb01ned as\n\nOq = r + (cid:13)P (cid:5)q\n\nand\n\nOH = (1 (cid:0) (cid:13))I + (cid:13)P (cid:5)H;\n\nfor the primal and dual cases respectively. The goal of the on-policy update is to bring current\nrepresentations closer to satisfying the policy-speci\ufb01c Bellman equations,\nH = (1 (cid:0) (cid:13))I + (cid:13)P (cid:5)H\n\nq = r + (cid:13)P (cid:5)q\n\nand\n\nThe max-policy operator M is different in that it is neither linear nor de\ufb01ned by any reference policy,\nbut instead applies a greedy max update to the current approximations\n\nMq = r + (cid:13)P (cid:5)(cid:3)[q]\n\nand MH = (1 (cid:0) (cid:13))I + (cid:13)P (cid:5)(cid:3)\n\nr[H];\n\nwhere (cid:5)(cid:3)[q](s) = maxa q(sa) and (cid:5)(cid:3)\nr[H](s;:) = H(sa0(s);:) such that a0(s) = arg maxa[Hr](sa).\nThe goal of this greedy update is to bring the representations closer to satisfying the optimal-policy\nBellman equations q = r + (cid:13)P (cid:5)(cid:3)[q] and H = (1 (cid:0) (cid:13))I + (cid:13)P (cid:5)(cid:3)\n\nr[H].\n\n4.1 On-policy convergence\n\nFor the on-policy operator O, convergence to the Bellman \ufb01xed point is easily proved in the primal\ncase, by establishing a contraction property of O with respect to a speci\ufb01c norm on q vectors. In\nparticular, one de\ufb01nes a weighted 2-norm with weights given by the stationary distribution deter-\nmined by the policy (cid:5) and transition model P : Let z (cid:21) 0 be a vector such that z>P (cid:5) = z>;\nthat is, z is the stationary state-action visit distribution for P (cid:5). Then the norm is de\ufb01ned as\n\n\f2 = q>Zq = P(sa) z(sa)q2\n\n(sa), where Z = diag(z). It can be shown that kP (cid:5)qkz (cid:20) kqkz\nkqkz\nand kOq1 (cid:0) Oq2kz (cid:20) (cid:13)kq1 (cid:0) q2kz (see [7]). Crucially, for this norm, a state-action transition is\nnot an expansion [7]. By the contraction map \ufb01xed point theorem [2] there exists a unique \ufb01xed point\nof O in the space of vectors q. Therefore, repeated applications of the on-policy operator converge\nto a vector q(cid:5) such that q(cid:5) = Oq(cid:5); that is, q(cid:5) satis\ufb01es the policy based Bellman equation.\nAnalogously, for the dual representation H, one can establish convergence of the on-policy oper-\nator by \ufb01rst de\ufb01ning an approximate weighted norm over matrices and then verifying that O is a\ncontraction with respect to this norm. De\ufb01ne\n\nkHkz;r\n\n2 = kHrkz\n\n2 = X\n\nz(sa)( X\n\nH(sa;s0a0)r(s0a0))2\n\n(1)\n\nIt is easily veri\ufb01ed that this de\ufb01nition satis\ufb01es the property of a pseudo-norm, and in particular,\nsatis\ufb01es the triangle inequality. This weighted 2-norm is de\ufb01ned with respect to the stationary\ndistribution z, but also the reward vector r. Thus, the magnitude of a row normalized matrix is\ndetermined by the magnitude of the weighted reward expectations it induces.\n\n(sa)\n\n(s0a0)\n\nInterestingly, this de\ufb01nition allows us to establish the same non-expansion and contraction results\nas the primal case. We can have kP (cid:5)Hkz;r (cid:20) kHkz;r by arguments similar to the primal case.\nMoreover, the on-policy operator is a contraction with respect to k(cid:1)kz;r.\n\nLemma 1 kOH1 (cid:0) OH2kz;r (cid:20) (cid:13)kH1 (cid:0) H2kz;r\n\nProof:\nkHkz;r.\n\nkOH1 (cid:0) OH2kz;r = (cid:13)kP (cid:5)(H1 (cid:0) H2)kz;r (cid:20) (cid:13)kH1 (cid:0) H2kz;r since kP (cid:5)Hkz;r (cid:20)\n\nThus, once again by the contraction map \ufb01xed point theorem there exists a \ufb01xed point of O among\nrow normalized matrices H, and repeated applications of O will converge to a matrix H(cid:5) such that\nOH(cid:5) = H(cid:5); that is, H(cid:5) satis\ufb01es the policy based Bellman equation for dual representations. This\nargument shows that on-policy dynamic programming converges in the dual representation, without\nmaking direct reference to the primal case. We will use these results below.\n\n4.2 Max-policy convergence\n\nThe strategy for establishing convergence for the nonlinear max operator is similar to the on-policy\ncase, but involves working with a different norm. Instead of considering a 2-norm weighted by the\nvisit probabilities induced by a \ufb01xed policy, one simply uses the max-norm in this case: kqk1 =\nmax(sa) jq(sa)j. The contraction property of the M operator with respect to this norm can then\nbe easily established in the primal case: kMq1 (cid:0) Mq2k1 (cid:20) (cid:13)kq1 (cid:0) q2k1 (see [2]). As in the\non-policy case, contraction suf\ufb01ces to establish the existence of a unique \ufb01xed point of M among\nvectors q, and that repeated application of M converges to this \ufb01xed point q(cid:3) such that Mq(cid:3) = q(cid:3).\nTo establish convergence of the off-policy update in the dual representation, \ufb01rst de\ufb01ne the max-\nnorm for state-action visit distribution as\n\nkHk1 = max\n(sa)\n\nj X\n\n(s0a0)\n\nH(sa;s0a0)r(s0a0)j\n\n(2)\n\nThen one can simply reduce the dual to the primal case by appealing to the relationship (1(cid:0)(cid:13))Mq =\nMHr to prove convergence of MH.\n\nLemma 2 If (1(cid:0)(cid:13))q = Hr, then (1(cid:0)(cid:13))Mq = MHr.\nProof: (1(cid:0)(cid:13))Mq = (1(cid:0)(cid:13))r+(cid:13)P (cid:5)(cid:3)[(1(cid:0)(cid:13))q]) = (1(cid:0)(cid:13))r+(cid:13)P (cid:5)(cid:3)[Hr] = (1(cid:0)(cid:13))r+(cid:13)P (cid:5)(cid:3)\nMHr where the second equality holds since we assumed (1 (cid:0) (cid:13))q(sa) = [Hr](sa) for all (sa).\n\nr[H]r =\n\nThus, given convergence of Mq to a \ufb01xed point Mq(cid:3) = q(cid:3), the same must also hold for MH.\nHowever, one subtlety here is that the dual \ufb01xed point is not unique. This is not a contradiction\nbecause the norm on dual representations k(cid:1)kz;r is in fact just a pseudo-norm, not a proper norm.\nThat is, the relationship between H and q is many to one, and several matrices can correspond to\nthe same q. These matrices form a convex subspace (in fact, a simplex), since if H1r = (1 (cid:0) (cid:13))q\nand H2r = (1 (cid:0) (cid:13))q then ((cid:11)H1 + (1 (cid:0) (cid:11))H2)r = (1 (cid:0) (cid:13))q for any (cid:11), where furthermore (cid:11) must be\nrestricted to 0 (cid:20) (cid:11) (cid:20) 1 to maintain nonnegativity. The simplex of \ufb01xed points fH(cid:3) : MH(cid:3) = H(cid:3)g\nis given by matrices H(cid:3) that satisfy H(cid:3)r = (1 (cid:0) (cid:13))q(cid:3).\n\n\f5 DP with function approximation\n\nPrimal and dual updates exhibit strong equivalence in the tabular case, as they should. However,\nwhen we begin to consider approximation, differences emerge. We next consider the convergence\nproperties of the dynamic programming operators in the context of linear basis approximation. We\nfocus on the on-policy case here, because, famously, the max operator does not always have a \ufb01xed\npoint when combined with approximation in the primal case [8], and consequently suffers the risk\nof divergence [5, 6].\n\nNote that the max operator cannot diverge in the dual case, even with basis approximation, by\nboundedness alone; although the question of whether max updates always converge in this case\nremains open. Here we establish that a similar bound on approximation error in the primal case can\nbe proved for the dual approach with respect to the on-policy operator.\n\nIn the primal case, linear approximation proceeds by \ufb01xing a small set of basis functions, forming\na jSjjAj(cid:2) k matrix (cid:8), where k is the number of bases. The approximation of q can be expressed\nby a linear combination of bases ^q = (cid:8)w where w is a k (cid:2)1 vector of adjustable weights. This is\nequivalent to maintaining the constraint that ^q 2 col span((cid:8)).\nIn the dual, a linear approximation\nto H can be expressed as vec( ^H) = (cid:9)w, where the vec operator creates a column vector from\na matrix by stacking the column vectors of the matrix below one another, w is a k (cid:2) 1 vector of\nadjustable weights as it is in the primal case, and (cid:9) is a (jSjjAj)2 (cid:2) k matrix of basis functions.\nTo ensure that ^H remains a nonnegative, row normalized approximation to H, we simply add the\nconstraints that ^H 2 simplex((cid:9)) (cid:17) f ^H : vec( ^H) = (cid:9)w; (cid:9) (cid:21) 0;(1>(cid:10)I)(cid:9) = 11>;w (cid:21) 0; w>1 = 1g\nwhere the operator (cid:10) is the Kronecker product.\nIn this section, we \ufb01rst introduce operators (projection and gradient step operators) that ensure the\napproximations stay representable in the given basis. Then we consider their composition with the\non-policy and off-policy updates, and analyze their convergence properties. For the composition of\nthe on-policy update and projection operators, we establish a similar bound on approximation error\nin the dual case as in the primal case.\n\n5.1 Projection Operator\n\nRecall that in the primal, the action value function q is approximated by a linear combination of\nbases in (cid:8). Unfortunately, there is no reason to expect Oq or Mq to stay in the column span of\n(cid:8), so a best approximation is required. The subtlety resolved by Tsitsiklis and Van Roy [7] is to\nidentify a particular form of best approximation\u2014weighted least squares\u2014that ensures convergence\nis still achieved when combined with the on-policy operator O. Unfortunately, the \ufb01xed point of this\ncombined update operator is not guaranteed to be the best representable approximation of O\u2019s \ufb01xed\npoint, q(cid:5). Nevertheless, a bound can be proved on how close this altered \ufb01xed point is to the best\nrepresentable approximation.\n\nWe summarize a few details that will be useful below: First, the best least squares approximation is\ncomputed with respect to the distribution z. The map from a general q vector onto its best approxi-\nmation in col span((cid:8)) is de\ufb01ned by another operator, P, which projects q into the column span of\n2 = (cid:8)((cid:8)>Z(cid:8))(cid:0)1(cid:8)>Zq, where ^q is an approximation for\n(cid:8), Pq = argmin^q2col span((cid:8)) kq (cid:0) ^qkz\nvalue function q. The important property of this weighted projection is that it is a non-expansion op-\nerator in k(cid:1)kz, i.e., kPqkz (cid:20) kqkz, which can be easily obtained from the generalized Pythagorean\ntheorem. Approximate dynamic programming then proceeds by composing the two operators\u2014the\non-policy update O with the subspace projection P\u2014to compute the best representable approxima-\ntion of the one step update. This combined operator is guaranteed to converge, since composing a\nnon-expansion with a contraction is still a contraction, i.e., kq+ (cid:0) q(cid:5)kz (cid:20) 1\n1(cid:0)(cid:13) kq(cid:5) (cid:0) Pq(cid:5)kz [7].\nLinear function approximation in the dual case is a bit more complicated because matrices are being\nrepresented, not vectors, and moreover the matrices need to satisfy row normalization and nonneg-\nativity constraints. Nevertheless, a very similar approach to the primal case can be successfully\napplied. Recall that in the dual, the state-action visit distribution H is approximated by a linear\ncombination of bases in (cid:9). As in the primal case, there is no reason to expect that an update like\nOH should keep the matrix in the simplex. Therefore, a projection operator must be constructed\nthat determines the best representable approximation to OH. One needs to be careful to de\ufb01ne\n\n\fthis projection with respect to the right norm to ensure convergence. Here, the pseudo-norm k(cid:1)kz;r\nde\ufb01ned in Equation 1 suits this purpose. De\ufb01ne the weighted projection operator P over matrices\n\nPH =\n\nargmin\n\nkH (cid:0) ^Hkz;r\n2\n\n^H2simplex((cid:9))\n\n(3)\n\nThe projection could be obtained by solving the above quadratic program. A key result is that this\nprojection operator is a non-expansion with respect to the pseudo-norm k(cid:1)kz;r.\n\nTheorem 1 kPHkz;r (cid:20) kHkz;r\n\nProof: The easiest way to prove the theorem is to observe that the projection operator P is really\na composition of three orthogonal projections: \ufb01rst, onto the linear subspace span((cid:9)), then onto\nthe subspace of row normalized matrices span((cid:9)) \\ fH : H1 = 1g, and \ufb01nally onto the space of\nnonnegative matrices span((cid:9)) \\ fH : H1 = 1g \\ fH : H (cid:21) 0g. Note that the last projection into\nthe nonnegative halfspace is equivalent to a projection into a linear subspace for some hyperplane\ntangent to the simplex. Each one of these projections is a non-expansion in k(cid:1)kz;r in the same way:\na generalized Pythagorean theorem holds. Consider just one of these linear projections P1\n\nkHkz;r\n\n2 = kP1H + H (cid:0) P1Hkz;r\n\n2 = kP1Hr + Hr (cid:0) P1Hrkz\n\n2\n\n= kP1Hrkz\n\n2 + kHr (cid:0) P1Hrkz\n\n2 = kP1Hkz;r\n\n2 + kH (cid:0) P1Hkz;r\n\n2\n\nSince the overall projection is just a composition of non-expansions, it must be a non-expansion.\n\nAs in the primal, approximate dynamic programming can be implemented by composing the on-\npolicy update O with the projection operator P. Since O is a contraction and P a non-expansion,\nPO must also be a contraction, and it then follows that it has a \ufb01xed point. Note that, as in the tabular\ncase, this \ufb01xed point is only unique up to Hr-equivalence, since the pseudo-norm k(cid:1)kz;r does not\ndistinguish H1 and H2 such that H1r = H2r. Here too, the \ufb01xed point is actually a simplex\nof equivalent solutions. For simplicity, we denote the simplex of \ufb01xed points for PO by some\nrepresentative H+ = POH+. Finally, we can recover an approximation bound that is analogous\nto the primal bound, which bounds the approximation error between H+ and the best representable\napproximation to the on-policy \ufb01xed point H(cid:5) = OH(cid:5).\n\nTheorem 2 kH+ (cid:0) H(cid:5)kz;r (cid:20) 1\n\n1(cid:0)(cid:13) kPH(cid:5) (cid:0) H(cid:5)kz;r\n\nFirst note that kH+ (cid:0)H(cid:5)kz;r = kH+ (cid:0)PH(cid:5) +PH(cid:5) (cid:0)H(cid:5)kz;r (cid:20) kH+ (cid:0)PH(cid:5)kz;r +\nProof:\nkPH(cid:5) (cid:0)H(cid:5)kz;r by generalized Pythagorean theorem. Then since H+ = POH+ and P is a\nnon-expansion operator, we have kH+ (cid:0)PH(cid:5)kz;r = kPOH+ (cid:0)PH(cid:5)kz;r (cid:20) kOH+ (cid:0)H(cid:5)kz;r:\nFinally, using H(cid:5) = OH(cid:5) and Lemma 1, we obtain kOH+ (cid:0)H(cid:5)kz;r = kOH+ (cid:0)OH(cid:5)kz;r (cid:20)\n(cid:13)kH+ (cid:0)H(cid:5)kz;r: Thus (1(cid:0)(cid:13))kH+ (cid:0)H(cid:5)kz;r (cid:20) kPH(cid:5) (cid:0)H(cid:5)kz;r.\n\nTo compare the primal and dual results, note that despite the similarity of the bounds, the projection\noperators do not preserve the tight relationship between primal and dual updates. That is, even if\n(1(cid:0)(cid:13))q = Hr and (1(cid:0)(cid:13))(Oq) = (OH)r, it is not true in general that (1(cid:0)(cid:13))(POq) = (POH)r.\nThe most obvious difference comes from the fact that in the dual, the space of H matrices has\nbounded diameter, whereas in the primal, the space of q vectors has unbounded diameter in the\nnatural norms. Automatically, the dual updates cannot diverge with compositions like PO and PM;\nyet, in the primal case, the update PM is known to not have \ufb01xed points in some circumstances [8].\n\n5.2 Gradient Operator\n\nIn large scale problems one does not normally have the luxury of computing full dynamic pro-\ngramming updates that evaluate complete expectations over the entire domain, since this requires\nknowing the stationary visit distribution z for P (cid:5) (essentially requiring one to know the model of\nthe MDP). Moreover, full least squares projections are usually not practical to compute. A key in-\ntermediate step toward practical DP and RL algorithms is to formulate gradient step operators that\nonly approximate full projections. Conveniently, the gradient update and projection operators are\nindependent of the on-policy and off-policy updates and can be applied in either case. However, as\nwe will see below, the gradient update operator causes signi\ufb01cant instability in the off-policy update,\n\n\fto the degree that divergence is a common phenomenon (much more so than with full projections).\nComposing approximation with an off-policy update (max operator) in the primal case can be very\ndangerous. All other operator combinations are better behaved in practice, and even those that are\nnot known to converge usually behave reasonably. Unfortunately, composing the gradient step with\nan off-policy update is a common algorithm attempted in reinforcement learning (Q-learning with\nfunction approximation), despite being the most unstable.\n\nIn the dual representation, one can derive a gradient update operator in a similar way to the primal,\nexcept that it is important to maintain the constraints on the parameters w, since the basis functions\nare probability distributions. We start by considering the projection objective\n\nJH =\n\nkH (cid:0) ^Hkz;r\n\n2 subject to\n\nvec( ^H) = (cid:9)w; w (cid:21) 0; w>1 = 1\n\n1\n2\n\nThe unconstrained gradient of the above objective with respect to w is\n\nrwJH = (cid:9)>(r> (cid:10)I)>Z(r> (cid:10)I)((cid:9)w (cid:0) h) = (cid:0)>Z(r> (cid:10)I)(^h (cid:0) h)\n\nwhere (cid:0) = (r> (cid:10) I)(cid:9), h = vec(H), and ^h = vec( ^H). However, this gradient step cannot be\nfollowed directly because we need to maintain the constraints. The constraint w>1 = 1 can be\nmaintained by \ufb01rst projecting the gradient onto it, obtaining (cid:14)w = (I (cid:0) 1\nk 11>)rwJH. Thus, the\nweight vector can be updated by\n\nwt+1 = wt (cid:0) (cid:11)(cid:14)w = wt (cid:0) (cid:11)(I (cid:0)\n\n11>)(cid:0)>Z(r> (cid:10) I)(^h (cid:0) h)\n\nwhere (cid:11) is a step-size parameter. Then the gradient operator can then be de\ufb01ned by\n\nG^hh = ^h (cid:0) (cid:11)(cid:9)(cid:14)w = ^h (cid:0) (cid:11)(cid:9)(I (cid:0)\n\n11>)(cid:0)>Z(r> (cid:10) I)(^h (cid:0) h)\n\n1\nk\n\n1\nk\n\n(Note that to further respect the box constraints, 0 (cid:20) h (cid:20) 1, the stepsize might need to be reduced\nand additional equality constraints might have to be imposed on some of the components of h that\nare at the boundary values.)\n\nSimilarly as in the primal, since the target vector H (i.e., h) is determined by the underlying dynamic\nprogramming update, this gives the composed updates\n\nGO^h = ^h (cid:0) (cid:11)(cid:9)(I (cid:0)\n\nGM^h = ^h (cid:0) (cid:11)(cid:9)(I (cid:0)\n\n11>)(cid:0)>Z(r> (cid:10)I)(^h(cid:0)O^h) and\n\n11>)(cid:0)>(r> (cid:10)I)(^h(cid:0)M^h)\n\n1\nk\n1\nk\n\nrespectively for the on-policy and off-policy cases (ignoring the additional equality constraints).\n\nThus far, the dual approach appears to hold an advantage over the standard primal approach, since\nconvergence holds in every circumstance where the primal updates converge, and yet the dual up-\ndates are guaranteed never to diverge because the fundamental objects being represented are nor-\nmalized probability distributions (i.e., belong to a bounded simplex). We now investigate the con-\nvergence properties of the various updates empirically.\n\n6 Experimental Results\n\nTo investigate the effectiveness of the dual representations, we conducted experiments on various\ndomains, including randomly synthesized MDPs, Baird\u2019s star problem [5], and on the mountain car\nproblem. The randomly synthesized MDP domains allow us to test the general properties of the\nalgorithms. The star problem is perhaps the most-cited example of a problem where Q-learning\nwith linear function approximation diverges [5], and the mountain car domain has been prone to\ndivergence with some primal representations [9] although successful results were reported when\nbases are selected by sparse tile coding [10].\n\nFor each problem domain, twelve algorithms were run over 100 repeats with a horizon of 1000 steps.\nThe algorithms were: tabular on-policy (O), projection on-policy (PO), gradient on-policy (GO),\ntabular off-policy (M), projection off-policy (PM), and gradient off-policy (GM), for both the\nprimal and the dual. The discount factor was set to (cid:13) = 0:9. For on-policy algorithms, we measure\nthe difference between the values generated by the algorithms and those generated by the analytically\ndetermined \ufb01xed-point. For off-policy algorithms, we measure the difference between the values\ngenerated by the resulting policy and the values of the optimal policy. The step size for the gradient\nupdates was 0:1 for primal representations and 100 for dual representations. The initial values of\n\n\fstate-action value functions q are set according to the standard normal distribution, and state-action\nvisit distributions H are chosen uniformly randomly with row normalization. Since the goal is to\ninvestigate the convergence of the algorithms without carefully crafting features, we also choose\nrandom basis functions according to a standard normal distribution for the primal representations,\nand random basis distributions according to a uniform distribution for the dual representations.\n\nRandomly Synthesized MDPs. For the synthesized MDPs, we generated the transition and re-\nward functions of the MDPs randomly\u2014the transition function is uniformly distributed between 0\nand 1 and the reward function is drawn from a standard normal. Here we only reported the results\nof random MDPs with 100 states, 5 actions, and 10 bases, observed consistent convergence of the\ndual representations on a variety of MDPs, with different numbers of states, actions, and bases. In\nFigure 1(right), the curve for the gradient off-policy update (GM) in the primal case (dotted line\nwith the circle marker) blows up (diverges), while all the other algorithms in Figure 1 converge.\nInterestingly, the approximate error of the dual algorithm POH (4:60(cid:2)10(cid:0)3) is much smaller than\nthe approximate error of the corresponding primal algorithm POq (4:23(cid:2)10(cid:0)2), even though their\ntheoretical bounds are the same (see Figure 1(left)).\n\nOn\u2212Policy Update on Random MDPs\n\nOff\u2212Policy Update on Random MDPs\n\ni\n\nt\nn\no\nP\n \ne\nc\nn\ne\nr\ne\nf\ne\nR\nm\no\nr\nf\n \ne\nc\nn\ne\nr\ne\nf\nf\ni\n\n \n\nD\n\n1010\n\n105\n\n100\n\n10\u22125\n\n10\u221210\n\nOq\n\nPOq\n\nG Oq\n\nOH\nPOH\nG OH\n\ni\n\nt\nn\no\nP\n \ne\nc\nn\ne\nr\ne\nf\ne\nR\nm\no\nr\nf\n \ne\nc\nn\ne\nr\ne\nf\nf\ni\n\n \n\nD\n\n100\n\n200\n\n300\n\n400\n600\nNumber of Steps\n\n500\n\n700\n\n800\n\n900\n\n1000\n\n1010\n\n105\n\n100\n\n10\u22125\n\n10\u221210\n\nMq\n\nPMq\n\nG Mq\n\nMH\nPMH\nG MH\n\n100\n\n200\n\n300\n\n400\n600\nNumber of Steps\n\n500\n\n700\n\n800\n\n900\n\n1000\n\nFigure 1: Updates of state-action value q and visit distribution H on randomly synthesized MDPs\n\nThe Star Problem. The star problem has 7 states and 2 actions. The reward function is zero\nfor each transition. In these experiments, we used the same \ufb01xed policy and linear value function\napproximation as in [5]. In the dual, the number of bases is also set to 14 and the initial values of the\nstate-action visit distribution matrix H are uniformly distributed random numbers between 0 and 1\nwith row normalization. The gradient off-policy update in the primal case diverges (see the dotted\nline with the circle marker in Figure 2(right)). However, all the updates with the dual representation\nalgorithms converge.\n\nOn\u2212Policy Update on Star Problem\n\nOff\u2212Policy Update on Star Problem\n\nt\n\ni\n\n \n\nn\no\nP\ne\nc\nn\ne\nr\ne\n\nf\n\n \n\ne\nR\nm\no\nr\nf\n \n\ne\nc\nn\ne\nr\ne\n\nf\nf\ni\n\nD\n\n1010\n\n105\n\n100\n\n10\u22125\n\n10\u221210\n\nOq\n\nPOq\n\nG Oq\n\nOH\nPOH\nG OH\n\n100\n\n200\n\n300\n\n400\n600\nNumber of Steps\n\n500\n\n700\n\n800\n\n900\n\n1000\n\nt\n\ni\n\n \n\nn\no\nP\ne\nc\nn\ne\nr\ne\n\nf\n\n \n\ne\nR\nm\no\nr\nf\n \n\ne\nc\nn\ne\nr\ne\n\nf\nf\ni\n\nD\n\n1010\n\n105\n\n100\n\n10\u22125\n\n10\u221210\n\nMq\n\nPMq\n\nG Mq\n\nMH\nPMH\nG MH\n\n100\n\n200\n\n300\n\n400\n600\nNumber of Steps\n\n500\n\n700\n\n800\n\n900\n\n1000\n\nFigure 2: Updates of state-action value q and visit distribution H on the star problem\n\n\fThe Mountain Car Problem The mountain car domain has continuous state and action spaces,\nwhich we discretized with a simple grid, resulting in an MDP with 222 states and 3 actions. The\nnumber of bases was chosen to be 5 for both the primal and dual algorithms. For the same reason\nas before, we chose the bases for the algorithms randomly. In the primal representations with linear\nfunction approximation, we randomly generated basis functions according to the standard normal\ndistribution. In the dual representations, we randomly picked the basis distributions according to\nthe uniform distribution. In Figure 3(right), we again observed divergence of the gradient off-policy\nupdate on state-action values in the primal, and the convergence of all the dual algorithms (see Figure\n3). Again, the approximation error of the projected on-policy update POH in the dual (1:90(cid:2)101)\nis also considerably smaller than POq (3:26(cid:2)102) in the primal.\n\nOn\u2212Policy Update on Mountain Car\n\nOff\u2212Policy Update on Mountain Car\n\nt\n\ni\n\nf\n\n \n\nn\no\nP\ne\nc\nn\ne\nr\ne\ne\nR\nm\no\nr\nf\n \ne\nc\nn\ne\nr\ne\nf\nf\ni\n\n \n\nD\n\n1010\n\n105\n\n100\n\n10\u22125\n\n10\u221210\n\nOq\n\nPOq\n\nG Oq\n\nOH\nPOH\nG OH\n\nt\n\ni\n\nf\n\n \n\nn\no\nP\ne\nc\nn\ne\nr\ne\ne\nR\nm\no\nr\nf\n \ne\nc\nn\ne\nr\ne\nf\nf\ni\n\n \n\nD\n\n100\n\n200\n\n300\n\n400\n600\nNumber of Steps\n\n500\n\n700\n\n800\n\n900\n\n1000\n\n1010\n\n105\n\n100\n\n10\u22125\n\n10\u221210\n\nMq\n\nPMq\n\nG Mq\n\nMH\nPMH\nG MH\n\n100\n\n200\n\n300\n\n400\n600\nNumber of Steps\n\n500\n\n700\n\n800\n\n900\n\n1000\n\nFigure 3: Updates of state-action value q and visit distribution H on the mountain car problem\n\n7 Conclusion\n\nDual representations maintain an explicit representation of visit distributions as opposed to value\nfunctions [4]. We extended the dual dynamic programming algorithms with linear function ap-\nproximation, and studied the convergence properties of the dual algorithms for planning in MDPs.\nWe demonstrated that dual algorithms, since they are based on estimating normalized probability\ndistributions rather than unbounded value functions, avoid divergence even in the presence of ap-\nproximation and off-policy updates. Moreover, dual algorithms remain stable in situations where\nstandard value function estimation diverges.\n\nReferences\n\n[1] M. Puterman. Markov Decision Processes: Discrete Dynamic Programming. Wiley, 1994.\n[2] D. Bertsekas. Dynamic Programming and Optimal Control, volume 2. Athena Scienti\ufb01c, 1995.\n[3] D. Bertsekas and J. Tsitsiklis. Neuro-Dynamic Programming. Athena Scienti\ufb01c, 1996.\n[4] T. Wang, M. Bowling, and D. Schuurmans. Dual representations for dynamic programming and rein-\nforcement learning. In Proceeding of the IEEE International Symposium on ADPRL, pages 44\u201351, 2007.\n[5] L. C. Baird. Residual algorithms: Reinforcement learning with function approximation. In International\n\nConference on Machine Learning, pages 30\u201337, 1995.\n\n[6] R. Sutton and A. Barto. Reinforcement Learning: An Introduction. MIT Press, 1998.\n[7] J. Tsitsiklis and B. Van Roy. An analysis of temporal-difference learning with function approximation.\n\nIEEE Trans. Automat. Control, 42(5):674\u2013690, 1997.\n\n[8] D. de Farias and B. Van Roy. On the existence of \ufb01xed points for approximate value iteration and\n\ntemporal-difference learning. J. Optimization Theory and Applic., 105(3):589\u2013608, 2000.\n\n[9] J. A. Boyan and A. W. Moore. Generalization in reinforcement learning: Safely approximating the value\n\nfunction. In NIPS 7, pages 369\u2013376, 1995.\n\n[10] R. S. Sutton. Generalization in reinforcement learning: Successful examples using sparse coarse coding.\n\nIn Advances in Neural Information Processing Systems, pages 1038\u20131044, 1996.\n\n\f", "award": [], "sourceid": 580, "authors": [{"given_name": "Tao", "family_name": "Wang", "institution": null}, {"given_name": "Michael", "family_name": "Bowling", "institution": null}, {"given_name": "Dale", "family_name": "Schuurmans", "institution": null}, {"given_name": "Daniel", "family_name": "Lizotte", "institution": null}]}