{"title": "Regularized Off-Policy TD-Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 836, "page_last": 844, "abstract": "We present a novel $l_1$ regularized off-policy convergent TD-learning method (termed RO-TD), which is able to learn sparse representations of value functions with low computational complexity. The algorithmic framework underlying RO-TD integrates two key ideas: off-policy convergent gradient TD methods, such as TDC, and a convex-concave saddle-point formulation of non-smooth convex optimization, which enables first-order solvers and feature selection using online convex regularization. A detailed theoretical and experimental analysis of RO-TD is presented. A variety of experiments are presented to illustrate the off-policy convergence, sparse feature selection capability and low computational cost of the RO-TD algorithm.", "full_text": "Regularized Off-Policy TD-Learning\n\nBo Liu, Sridhar Mahadevan\nComputer Science Department\nUniversity of Massachusetts\n\nAmherst, MA 01003\n\n{boliu, mahadeva}@cs.umass.edu\n\nJi Liu\n\nComputer Science Department\n\nUniversity of Wisconsin\n\nMadison, WI 53706\n\nji-liu@cs.wisc.edu\n\nAbstract\n\nWe present a novel l1 regularized off-policy convergent TD-learning method\n(termed RO-TD), which is able to learn sparse representations of value functions\nwith low computational complexity. The algorithmic framework underlying RO-\nTD integrates two key ideas: off-policy convergent gradient TD methods, such\nas TDC, and a convex-concave saddle-point formulation of non-smooth convex\noptimization, which enables \ufb01rst-order solvers and feature selection using online\nconvex regularization. A detailed theoretical and experimental analysis of RO-TD\nis presented. A variety of experiments are presented to illustrate the off-policy\nconvergence, sparse feature selection capability and low computational cost of the\nRO-TD algorithm.\n\n1 Introduction\n\nTemporal-difference (TD) learning is a widely used method in reinforcement learning (RL). Al-\nthough TD converges when samples are drawn \u201con-policy\u201d by sampling from the Markov chain\nunderlying a policy in a Markov decision process (MDP), it can be shown to be divergent when\nsamples are drawn \u201coff-policy\u201d. Off-policy methods are of wider applications since they are able to\nlearn while executing an exploratory policy, learn from demonstrations, and learn multiple tasks in\nparallel [2]. Sutton et al. [20] introduced convergent off-policy temporal difference learning algo-\nrithms, such as TDC, whose computation time scales linearly with the number of samples and the\nnumber of features. Recently, a linear off-policy actor-critic algorithm based on the same framework\nwas proposed in [2].\nRegularizing reinforcement learning algorithms leads to more robust methods that can scale up to\nlarge problems with many potentially irrelevant features. LARS-TD [7] introduced a popular ap-\nproach of combining l1 regularization using Least Angle Regression (LARS) with the least-squares\nTD (LSTD) framework. Another approach was introduced in [5] (LCP-TD) based on the Linear\nComplementary Problem (LCP) formulation, an optimization approach between linear program-\nming and quadratic programming. LCP-TD uses \u201cwarm-starts\u201d, which helps signi\ufb01cantly reduce\nthe burden of l1 regularization. A theoretical analysis of l1 regularization was given in [4], including\nerror bound analysis with \ufb01nite samples in the on-policy setting. Another approach integrating the\nDantzig Selector with LSTD was proposed in [3], overcoming some of the drawbacks of LARS-TD.\nAn approximate linear programming approach for \ufb01nding l1 regularized solutions of the Bellman\nequation was presented in [17]. All of these approaches are second-order methods, requiring com-\nplexity approximately cubic in the number of (active) features. Another approach to feature selec-\ntion is to greedily add new features, proposed recently in [15]. Regularized \ufb01rst-order reinforcement\nlearning approaches have recently been investigated in the on-policy setting as well, wherein con-\nvergence of l1 regularized temporal difference learning is discussed in [16] and mirror descent [6] is\nused in [11].\n\n1\n\n\fIn this paper, the off-policy TD learning problem is formulated from the stochastic optimization\nperspective. A novel objective function is proposed based on the linear equation formulation of\nthe TDC algorithm. The optimization problem underlying off-policy TD methods, such as TDC,\nis reformulated as a convex-concave saddle-point stochastic approximation problem, which is both\nconvex and incrementally solvable. A detailed theoretical and experimental study of the RO-TD\nalgorithm is presented.\nHere is a brief roadmap to the rest of the paper. Section 2 reviews the basics of MDPs, RL and recent\nwork on off-policy convergent TD methods, such as the TDC algorithm. Section 3 introduces the\nproximal gradient method and the convex-concave saddle-point formulation of non-smooth convex\noptimization. Section 4 presents the new RO-TD algorithm. Convergence analysis of RO-TD is\npresented in Section 5. Finally, in Section 6, experimental results are presented to demonstrate the\neffectiveness of RO-TD.\n\n2 Reinforcement Learning and the TDC Algorithm\n\nA Markov Decision Process (MDP) is de\ufb01ned by the tuple (S, A, P a\nss0 , R, \u03b3), comprised of a set of\nstates S, a set of (possibly state-dependent) actions A (As), a dynamical system model comprised of\nss0 specifying the probability of transition to state s0 from state s under action\nthe transition kernel P a\na, a reward model R, and 0 \u2264 \u03b3 < 1 is a discount factor. A policy \u03c0 : S \u2192 A is a deterministic\nmapping from states to actions. Associated with each policy \u03c0 is a value function V \u03c0, which is the\n\ufb01xed point of the Bellman equation:\n\nV \u03c0(s) = T \u03c0V \u03c0(s) = R\u03c0(s) + \u03b3P \u03c0V \u03c0(s)\n\nwhere R\u03c0 is the expected immediate reward function (treated here as a column vector) and P \u03c0 is\nthe state transition function under \ufb01xed policy \u03c0, and T \u03c0 is known as the Bellman operator. In what\nfollows, we often drop the dependence of V \u03c0, T \u03c0, R\u03c0 on \u03c0, for notational simplicity. In linear value\nfunction approximation, a value function is assumed to lie in the linear span of a basis function\nmatrix \u03a6 of dimension |S| \u00d7 d, where d is the number of linear independent features. Hence,\nV \u2248 \u02c6V = \u03a6\u03b8. The vector space of all value functions is a normed inner product space, where the\n\u201clength\u201d of any value function f is measured as ||f||2\ns \u03be(s)f 2(s) = f0\u039ef weighted by \u039e,\nwhere \u039e is de\ufb01ned in Figure 1. For the t-th sample, \u03c6t,\u03c60\nt, \u03b8t and \u03b4t are de\ufb01ned in Figure 1. TD\nlearning uses the following update rule \u03b8t+1 = \u03b8t + \u03b1t\u03b4t\u03c6t, where \u03b1t is the stepsize. However,\nTD is only guaranteed to converge in the on-policy setting, although in many off-policy situations,\nit still has satisfactory performance [21]. TD with gradient correction (TDC) [20] aims to minimize\nthe mean-square projected Bellman error (MSPBE) in order to guarantee off-policy convergence.\nMSPBE is de\ufb01ned as\n\n\u039e = P\n\nMSPBE(\u03b8) = k\u03a6\u03b8 \u2212 \u03a0T (\u03a6\u03b8)k2\n\n(1)\nTo avoid computing the inverse matrix (\u03a6T \u039e\u03a6)\u22121 and to avoid the double sampling problem [19]\nin (1), an auxiliary variable w is de\ufb01ned\n\n\u039e = (\u03a6T \u039e(T \u03a6\u03b8 \u2212 \u03a6\u03b8))T (\u03a6T \u039e\u03a6)\u22121\u03a6T \u039e(T \u03a6\u03b8 \u2212 \u03a6\u03b8)\n\nw = (\u03a6T \u039e\u03a6)\u22121\u03a6T \u039e(T \u03a6\u03b8 \u2212 \u03a6\u03b8)\n\n(2)\n\nThe two time-scale gradient descent learning method TDC [20] is de\ufb01ned below\nt wt), wt+1 = wt + \u03b2t(\u03b4t \u2212 \u03c6T\n\n0(\u03c6T\n\nwhere \u2212\u03b1t\u03b3\u03c6t\n\n\u03b8t+1 = \u03b8t + \u03b1t\u03b4t\u03c6t \u2212 \u03b1t\u03b3\u03c6t\n0(\u03c6T\n\n(3)\nt wt) is the term for correction of gradient descent direction, and \u03b2t = \u03b7\u03b1t, \u03b7 > 1.\n\nt wt)\u03c6t\n\n3 Proximal Gradient and Saddle-Point First-Order Algorithms\n\nWe now introduce some background material from convex optimization. The proximal mapping\nassociated with a convex function h is de\ufb01ned as:1\n\nproxh(x) = arg min\nu\n\n(h(u) +\n\nku \u2212 xk2)\n\n1\n2\n\n(4)\n\n1The proximal mapping can be shown to be the resolvent of the subdifferential of the function h.\n\n2\n\n\fA\n\n1\n2 may not be unique.\n\n\u2022 \u039e is a diagonal matrix whose entries \u03be(s) are given by a positive probability distribution\nover states. \u03a0 = \u03a6(\u03a6T \u039e\u03a6)\u22121\u03a6T \u039e is the weighted least-squares projection operator.\n\u2022 A square root of A is a matrix B satisfying B2 = A and B is denoted as A\n1\n2 . Note that\n\u2022 [\u00b7,\u00b7] is a row vector, and [\u00b7;\u00b7] is a column vector.\n\u2022 For the t-th sample, \u03c6t (the t-th row of \u03a6), \u03c60\n\ncorresponding to st, s0\nt \u03b8t) \u2212 \u03c6T\n0T\norder TD learning methods, and \u03b4t = (rt + \u03b3\u03c6\nerror. Also, xt = [wt; \u03b8t], \u03b1t is a stepsize, \u03b2t = \u03b7\u03b1t, \u03b7 > 0.\n\nt (the t-th row of \u03a60) are the feature vectors\nt, respectively. \u03b8t is the coef\ufb01cient vector for t-th sample in \ufb01rst-\nt \u03b8t is the temporal difference\n\n\u2022 m, n are conjugate numbers if 1\nm is\n\u2022 \u03c1 is l1 regularization parameter, \u03bb is the eligibility trace factor, N is the sample size, d\n\nthe m-norm of vector x.\n\nj |xj|m) 1\n\nn = 1, m \u2265 1, n \u2265 1. ||x||m = (P\n\nm + 1\n\nis the number of basis functions, p is the number of active basis functions.\n\nFigure 1: Notation used in this paper.\n\nIn the case of h(x) = \u03c1kxk1(\u03c1 > 0), which is particularly important for sparse feature selection,\nthe proximal operator turns out to be the soft-thresholding operator S\u03c1(\u00b7), which is an entry-wise\nshrinkage operator:\n\nproxh(x)i = S\u03c1(xi) = max(xi \u2212 \u03c1, 0) \u2212 max(\u2212xi \u2212 \u03c1, 0)\n\n(5)\nwhere i is the index, and \u03c1 is a threshold. With this background, we now introduce the proximal\ngradient method. If the optimization problem is\nx\u2217 = arg min\nx\u2208X\n\n(f(x) + h(x))\n\n(6)\n\nwherein f(x) is a convex and differentiable loss function and the regularization term h(x) is convex,\npossibly non-differentiable and computing proxh is not expensive, then computation of (6) can be\ncarried out using the proximal gradient method:\n\nxt+1 = prox\u03b1th (xt \u2212 \u03b1t\u2207f(xt))\n\n(7)\n\nwhere \u03b1t > 0 is a (decaying) stepsize, a constant or it can be determined by line search.\n\n3.1 Convex-concave Saddle-Point First Order Algorithms\n\nThe key novel contribution of our paper is a convex-concave saddle-point formulation for regular-\nized off-policy TD learning. A convex-concave saddle-point problem is formulated as follows. Let\nx \u2208 X, y \u2208 Y , where X, Y are both nonempty bounded closed convex sets, and f(x) : X \u2192 R\nIf there exists a function \u03d5(\u00b7,\u00b7) such that f(x) can be represented as\nbe a convex function.\nf(x) := supy\u2208Y \u03d5(x, y), then the pair (\u03d5, Y ) is referred as the saddle-point representation of f.\nThe optimization problem of minimizing f over X is converted into an equivalent convex-concave\nsaddle-point problem SadV al = inf x\u2208Xsupy\u2208Y \u03d5(x, y) of \u03d5 on X\u00d7Y . If f is non-smooth yet con-\nvex and well structured, which is not suitable for many existing optimization approaches requiring\nsmoothness, its saddle-point representation \u03d5 is often smooth and convex. Thus, convex-concave\nsaddle-point problems are, therefore, usually better suited for \ufb01rst-order methods [6]. A compre-\nhensive overview on extending convex minimization to convex-concave saddle-point problems with\nuni\ufb01ed variational inequalities is presented in [1]. As an example, consider f(x) = ||Ax \u2212 b||m\nwhich admits a bilinear minimax representation\n\nf(x) := kAx \u2212 bkm = max\nkykn\u22641\n\nyT (Ax \u2212 b)\n\nwhere m, n are conjugate numbers. Using the approach in [13], Equation (8) can be solved as\n\nxt+1 = xt \u2212 \u03b1tAT yt, yt+1 = \u03a0n(yt + \u03b1t(Axt \u2212 b))\n\nwhere \u03a0n is the projection operator of y onto the unit ln-ball kykn \u2264 1,which is de\ufb01ned as\n\n\u03a0n(y) = min(1, 1/kykn)y, n = 2, 3,\u00b7\u00b7\u00b7 , \u03a0\u221e(yi) = min(1, 1/|yi|)yi\n\nand \u03a0\u221e is an entrywise operator.\n\n3\n\n(8)\n\n(9)\n\n(10)\n\n\f4 Regularized Off-policy Convergent TD-Learning\n\nWe now describe a novel algorithm, regularized off-policy convergent TD-learning (RO-TD), which\ncombines off-policy convergence and scalability to large feature spaces. The objective function\nis proposed based on the linear equation formulation of the TDC algorithm. Then the objective\nfunction is represented via its dual minimax problem. The RO-TD algorithm is proposed based on\nthe primal-dual subgradient saddle-point algorithm, and inspired by related methods in [12],[13].\n\n4.1 Objective Function of Off-policy TD Learning\n\nIn this subsection, we describe the objective function of the regularized off-policy RL problem. We\nnow \ufb01rst formulate the two updates of \u03b8t, wt into a single iteration by rearranging the two equations\nin (3) as xt+1 = xt \u2212 \u03b1t(Atxt \u2212 bt), where xt = [wt; \u03b8t],\nt)T\nt)T\n\n\u03b7\u03c6t(\u03c6t \u2212 \u03b3\u03c60\n\u03c6t(\u03c6t \u2212 \u03b3\u03c60\n\n(cid:20) \u03b7\u03c6t\u03c6t\n\n(cid:20) \u03b7rt\u03c6t\n\n, bt =\n\nAt =\n\n\u03b3\u03c60\n\n(11)\n\n(cid:21)\n\n(cid:21)\n\nrt\u03c6t\n\nT\n\nT\n\nt\u03c6t\n\nFollowing [20], the TDC algorithm solution follows from the linear equation Ax = b, where\n\nA = E[At], b = E[bt], x = [w; \u03b8]\n\n(12)\nThere are some issues regarding the objective function, which arise from the online convex opti-\nmization and reinforcement learning perspectives, respectively. The \ufb01rst concern is that the objective\nfunction should be convex and stochastically solvable. Note that A, At are neither PSD nor symmet-\nric, and it is not straightforward to formulate a convex objective function based on them. The second\nconcern is that since we do not have knowledge of A, the objective function should be separable so\nthat it is stochastically solvable based on At, bt. The other concern regards the sampling condition\nin temporal difference learning: double-sampling. As pointed out in [19], double-sampling is a\nnecessary condition to obtain an unbiased estimator if the objective function is the Bellman resid-\nual or its derivatives (such as projected Bellman residual), wherein the product of Bellman error or\nprojected Bellman error metrics are involved. To overcome this sampling condition constraint, the\nproduct of TD errors should be avoided in the computation of gradients. Consequently, based on the\nlinear equation formulation in (12) and the requirement on the objective function discussed above,\nwe propose the regularized loss function as\n\nL(x) = kAx \u2212 bkm + h(x)\n\n(13)\n\nHere we also enumerate some intuitive objective functions and give a brief analysis on the reasons\nwhy they are not suitable for regularized off-policy \ufb01rst-order TD learning. One intuitive idea is\nto add a sparsity penalty on MSPBE, i.e., L(\u03b8) = MSPBE(\u03b8)+\u03c1k\u03b8k1. Because of the l1 penalty\nterm, the solution to \u2207L = 0 does not have an analytical form and is thus dif\ufb01cult to compute.\nThe second intuition is to use the online least squares formulation of the linear equation Ax = b.\n1\n2 does not exist and thus\nHowever, since A is not symmetric and positive semi-de\ufb01nite (PSD), A\nAx = b cannot be reformulated as minx\u2208X||A\n2. Another possible idea is to attempt\nto \ufb01nd an objective function whose gradient is exactly Atxt \u2212 bt and thus the regularized gradient\nis prox\u03b1th(xt)(Atxt \u2212 bt). However, since At is not symmetric, this gradient does not explicitly\ncorrespond to any kind of optimization problem, not to mention a convex one2.\n\n2 x \u2212 A\u2212 1\n\n2 b||2\n\n1\n\n4.2 RO-TD Algorithm Design\n\nIn this section, the problem of (13) is formulated as a convex-concave saddle-point problem, and the\nRO-TD algorithm is proposed. Analogous to (8), the regularized loss function can be formulated as\n(14)\n\nyT (Ax \u2212 b) + h(x)\n\nkAx \u2212 bkm + h(x) = max\nkykn\u22641\n\nSimilar to (9), Equation (14) can be solved via an iteration procedure as follows, where xt = [wt; \u03b8t].\n\nxt+ 1\n2\n\n= xt \u2212 \u03b1tAT\nt yt\nxt+1 = prox\u03b1th(xt+ 1\n)\n\n2\n\n= yt + \u03b1t(Atxt \u2212 bt)\n\n,\n\n,\n\nyt+ 1\n2\nyt+1 = \u03a0n(yt+ 1\n\n2\n\n)\n\n(15)\n\n2Note that the A matrix in GTD2\u2019s linear equation representation is symmetric, yet is not PSD, so it cannot\n\nbe formulated as a convex problem.\n\n4\n\n\fThe averaging step, which plays a crucial role in stochastic optimization convergence, generates the\napproximate saddle-points [6, 12]\n\n(cid:16)Xt\n\n(cid:17)\u22121Xt\n\n(cid:16)Xt\n\n(cid:17)\u22121Xt\n\ni=0\n\ni=0\n\ni=0\n\ni=0\n\n\u03b1i\n\n\u03b1i\n\n\u03b1iyi\n\n\u00afxt =\n\n\u03b1ixi, \u00afyt =\n\n(16)\nDue to the computation of At in (15) at each iteration, the computation cost appears to be O(N d2),\nwhere N, d are de\ufb01ned in Figure 1. However, the computation cost is actually O(N d) with a linear\nt At, Atxt \u2212 bt. Denoting yt = [y1,t; y2,t], where y1,t; y2,t\nalgebraic trick by computing not At but yT\nare column vectors of equal length, we have\n1,t\u03c6t) + \u03b3\u03c6T\n\n(\u03c6t \u2212 \u03b3\u03c60\nAtxt \u2212 bt can be computed according to Equation (3) as follows:\n\nh\nAtxt \u2212 bt =(cid:2) \u2212\u03b7(\u03b4t \u2212 \u03c6T\n\n(18)\nBoth (17) and (18) are of linear computation complexity. Now we are ready to present the RO-TD\nalgorithm:\n\nt wt)\u03c6t; \u03b3(\u03c6T\n\n0 \u2212 \u03b4t\u03c6t\n\nt)T (\u03b7yT\n\nt At =\nyT\n\n1,t + yT\n\n2,t\u03c60\nt)\n\nt wt)\u03c6t\n\n2,t)\u03c6t\n\nt (yT\n\nt (yT\n\n(17)\n\n\u03b7\u03c6T\n\ni\n\n(cid:3)\n\nAlgorithm 1 RO-TD\nLet \u03c0 be some \ufb01xed policy of an MDP M, and let the sample set S = {si, ri, si\nsome \ufb01xed basis.\n1: repeat\n2:\n3:\n4:\n5:\n6: until t = N;\n7: Compute \u00afxN , \u00afyN as in Equation (16) with t = N\n\nCompute \u03c6t, \u03c6t\nCompute yT\nCompute xt+1, yt+1 as in Equation (15)\nSet t \u2190 t + 1;\n\nAt, Atxt \u2212 bt in Equation (17) and (18).\n\n0 and TD error \u03b4t = (rt + \u03b3\u03c6\n\nt \u03b8t) \u2212 \u03c6T\n0T\nt \u03b8t\n\nt\n\n0}N\ni=1. Let \u03a6 be\n\nThere are some design details of the algorithm to be elaborated. First, the regularization term h(x)\ncan be any kind of convex regularization, such as ridge regression or sparsity penalty \u03c1||x||1. In case\nof h(x) = \u03c1||x||1, prox\u03b1th(\u00b7) = S\u03b1t\u03c1(\u00b7). In real applications the sparsi\ufb01cation requirement on \u03b8\nand auxiliary variable w may be different, i.e., h(x) = \u03c11k\u03b8k1 + \u03c12kwk1, \u03c11 6= \u03c12, one can simply\nreplace the uniform soft thresholding S\u03b1t\u03c1 by two separate soft thresholding operations S\u03b1t\u03c11, S\u03b1t\u03c12\nand thus the third equation in (15) is replaced by the following,\n\nh\n\ni\n\n=\n\nxt+ 1\n2\n\n(19)\nAnother concern is the choice of conjugate numbers (m, n). For ease of computing \u03a0n, we use\n(2, 2)(l2 \ufb01t), (+\u221e, 1)(uniform \ufb01t) or (1, +\u221e). m = n = 2 is used in the experiments below.\n\n), wt+1 = S\u03b1t\u03c12(wt+ 1\n\n, \u03b8t+1 = S\u03b1t\u03c11(\u03b8t+ 1\n\n; \u03b8t+ 1\n\nwt+ 1\n2\n\n)\n\n2\n\n2\n\n2\n\n4.3 RO-GQ(\u03bb) Design\n\n\u03b8t+1 = \u03b8t + \u03b1t[\u03b4tet \u2212 \u03b3(1 \u2212 \u03bb)wt\n\nGQ(\u03bb)[10] is a generalization of the TDC algorithm with eligibility traces and off-policy learning\nof temporally abstract predictions, where the gradient update changes from Equation (3) to\nt \u03c6t\u03c6t)\n\n(20)\nThe central element is to extend the MSPBE function to the case where it incorporates eligibility\ntraces. The objective function and corresponding linear equation component At, bt can be written\nas follows:\n\n\u00af\u03c6t+1], wt+1 = wt + \u03b2t(\u03b4tet \u2212 wT\n\nT et\n\n\"\nAt = (cid:2) \u03b7\u03c6T\n\nAt =\n\nL(\u03b8) = ||\u03a6\u03b8 \u2212 \u03a0T \u03c0\u03bb\u03a6\u03b8||2\n\u03b7et(\u03c6t \u2212 \u03b3 \u00af\u03c6t+1)T\net(\u03c6t \u2212 \u03b3 \u00af\u03c6t+1)T\n\nT\n\n\u039e\n\n\u03b7\u03c6t\u03c6t\n\n\u03b3(1 \u2212 \u03bb) \u00af\u03c6t+1eT\n\nt\n\n#\n\n, bt =\nAt, Atxt \u2212 bt is\n\nAtxt \u2212 bt = (cid:2) \u2212\u03b7(\u03b4tet \u2212 \u03c6T\n\nt (yT\n\nyT\n\nt\n\n1,t\u03c6t) + \u03b3(1 \u2212 \u03bb)eT\n\nt (yT\n\n2,t\n\nt\n\nSimilar to Equation (17) and (18), the computation of yT\n\n(23)\nwhere eligibility traces et, and \u00af\u03c6t, T \u03c0\u03bb are de\ufb01ned in [10]. Algorithm 2, RO-GQ(\u03bb), extends the\nRO-TD algorithm to include eligibility traces.\n\nt wt\u03c6t); \u03b3(1 \u2212 \u03bb)(eT\n\n(\u03c6t \u2212 \u03b3 \u00af\u03c6t+1)T (\u03b7yT\n\n\u00af\u03c6t+1)\nt wt) \u00af\u03c6t+1 \u2212 \u03b4tet\n\n1,t + yT\n\n2,t)et\n\n(cid:21)\n\n(cid:20) \u03b7rtet\n(cid:3)\n\nrtet\n\n(21)\n\n(22)\n\n(cid:3)\n\n5\n\n\fAlgorithm 2 RO-GQ(\u03bb)\nLet \u03c0 and \u03a6 be as de\ufb01ned in Algorithm 1. Starting from s0.\n1: repeat\n2:\n3:\n4:\n5:\n6:\n7: until st is an absorbing state;\n8: Compute \u00afxt, \u00afyt as in Equation (16)\n\nCompute \u03c6t, \u00af\u03c6t+1 and TD error \u03b4t = (rt + \u03b3 \u00af\u03c6T\nCompute yT\nCompute xt+1, yt+1 as in Equation (15)\nChoose action at, and get st+1\nSet t \u2190 t + 1;\n\nAt, Atxt \u2212 bt in Equation (23).\n\nt\n\nt+1\u03b8t) \u2212 \u03c6T\nt \u03b8t\n\n4.4 Extension\n\nIt is also worth noting that there exists another formulation of the loss function different from Equa-\ntion (13) with the following convex-concave formulation as in [14, 6],\n\nmin\nx\n\n1\n2\n\nkAx \u2212 bk2\n\n2 + \u03c1kxk1 =\n\n(bT y \u2212 \u03c1\n\n2 yT y)\n\n(cid:16)\n\nmax\n\nkAT yk\u221e\u22641\nmax\n\nkuk\u221e\u22641,y\n\n= min\nx\n\nxT u + yT (Ax \u2212 b) \u2212 \u03c1\n\n2 yT y\n\n(cid:17)\n\n(24)\n\nwhich can be solved iteratively without the proximal gradient step as follows, which serves as a\ncounterpart of Equation (15),\n\nxt+1 = xt \u2212 \u03b1t\u03c1(ut + At\n= ut + \u03b1t\n\u03c1\n\nut+ 1\n2\n\nT yt)\n\n,\n\nyt+1 = yt + \u03b1t\n\u03c1\n\n(Atxt \u2212 bt \u2212 \u03c1yt)\n\nxt\n\n, ut+1 = \u03a0\u221e(ut+ 1\n\n2\n\n)\n\n(25)\n\n5 Convergence Analysis of RO-TD\n\nAssumption 1 (MDP)[20]: The underlying Markov Reward Process (MRP) M = (S, P, R, \u03b3) is \ufb01-\nnite and mixing, with stationary distribution \u03c0. Assume that \u2203 a scalar Rmax such that V ar[rt|st] \u2264\nRmax holds w.p.1.\nAssumption 2 (Basis Function)[20]: \u03a6 is a full column rank matrix, namely, \u03a6 comprises a linear\nindependent set of basis functions w.r.t all sample states in sample set S. Also, assume the fea-\n0\nt) is an i.i.d sequence,\ntures (\u03c6t, \u03c6\n\u2200t,k\u03c6tk\u221e < +\u221e,k\u03c60\nAssumption 3 (Subgradient Boundedness)[12]: Assume for the bilinear convex-concave loss\nAt and\nfunction de\ufb01ned in (14), the sets X, Y are closed compact sets. Then the subgradient yT\nAtxt \u2212 bt in RO-TD algorithm are uniformly bounded, i.e., there exists a constant L such that\n\n0\nt) have uniformly bounded second moments. Finally, if (st, at, s\n\ntk\u221e < +\u221e.\n\nt\n\nkAtxt \u2212 btk \u2264 L,(cid:13)(cid:13)yT\n\nt\n\n(cid:13)(cid:13) \u2264 L.\n\nAt\n\nProposition 1: The approximate saddle-point \u00afxt of RO-TD converges w.p.1 to the global minimizer\nof the following,\n\nx\u2217 = arg min\nx\u2208X\n\nkAx \u2212 bkm + \u03c1kxk1\n\n(26)\n\nProof Sketch: See the supplementary material for details.\n\n6 Empirical Results\n\nWe now demonstrate the effectiveness of the RO-TD algorithm against other algorithms across a\nnumber of benchmark domains. LARS-TD [7], which is a popular second-order sparse reinforce-\nment learning algorithm, is used as the baseline algorithm for feature selection and TDC is used as\nthe off-policy convergent RL baseline algorithm, respectively.\n\n6\n\n\fFigure 2: Illustrative examples of the convergence of RO-TD using the Star and Random-walk\nMDPs.\n\n6.1 MSPBE Minimization and Off-Policy Convergence\n\nThis experiment aims to show the minimization of MSPBE and off-policy convergence of the RO-\nTD algorithm. The 7 state star MDP is a well known counterexample where TD diverges monoton-\nically and TDC converges. It consists of 7 states and the reward w.r.t any transition is zero. Because\nof this, the star MDP is unsuitable for LSTD-based algorithms, including LARS-TD since \u03a6T R = 0\nalways holds. The random-walk problem is a standard Markov chain with 5 states and two absorb-\ning state at two ends. Three sets of different bases \u03a6 are used in [20], which are tabular features,\ninverted features and dependent features respectively. An identical experiment setting to [20] is used\nfor these two domains. The regularization term h(x) is set to 0 to make a fair comparison with TD\nand TDC. \u03b1 = 0.01, \u03b7 = 10 for TD, TDC and RO-TD. The comparison with TD, TDC and RO-TD\nis shown in the left sub\ufb01gure of Figure 2, where TDC and RO-TD have almost identical MSPBE\n(Axt \u2212 b) and kAxt \u2212 bk2, wherein\nover iterations. The middle sub\ufb01gure shows the value of yT\n(Axt \u2212 b). Note that for this problem, the Slater\nkAxt \u2212 bk2 is always greater than the value of yT\ncondition is satis\ufb01ed so there is no duality gap between the two curves. As the result shows, TDC\nand RO-TD perform equally well, which illustrates the off-policy convergence of the RO-TD algo-\nrithm. The result of random-walk chain is averaged over 50 runs. The rightmost sub\ufb01gure of Figure\n2 shows that RO-TD is able to reduce MSPBE over successive iterations w.r.t three different basis\nfunctions.\n\nt\n\nt\n\n6.2 Feature Selection\n\nIn this section, we use the mountain car example with a variety of bases to show the feature selection\ncapability of RO-TD. The Mountain car MDPis an optimal control problem with a continuous two-\ndimensional state space. The steep discontinuity in the value function makes learning dif\ufb01cult for\nbases with global support. To make a fair comparison, we use the same basis function setting as in\n[7], where two dimensional grids of 2, 4, 8, 16, 32 RBFs are used so that there are totally 1365 basis\nfunctions. For LARS-TD, 500 samples are used. For RO-TD and TDC, 3000 samples are used by\nexecuting 15 episodes with 200 steps for each episode, stepsize \u03b1t = 0.001, and \u03c11 = 0.01, \u03c12 =\n0.2. We use the result of LARS-TD and l2 LSTD reported in [7]. As the result shows in Table 1,\nRO-TD is able to perform feature selection successfully, whereas TDC and TD failed. It is worth\nnoting that comparing the performance of RO-TD and LARS-TD is not the focus of this paper since\nLARS-TD is not convergent off-policy and RO-TD\u2019s performance can be further optimized using\nthe mirror-descent approach with the Mirror-Prox algorithm [6] which incorporates mirror descent\nwith an extragradient [9], as discussed below.\n\nAlgorithm\n\nSuccess(20/20)\n\nSteps\n\nLARS-TD\n142.25 \u00b1 9.74\n\n100%\n\nRO-TD\n100%\n\n147.40 \u00b1 13.31\n\nl2 LSTD TDC TD\n0% 0%\n-\n-\n\n0%\n-\n\nTable 1: Comparison of TD, LARS-TD, RO-TD, l2 LSTD, TDC and TD\n\n7\n\n01020304050607080901000102030405060708090100SweepsMSPBEComparison of MSPBE  TDTDCRO\u2212TD02040608010012014016018020002468101214Sweeps||Ax\u2212b||2 and yT(Ax\u2212b)  yT(Ax\u2212b)||Ax\u2212b||20204060801001201401601802000.020.030.040.050.060.070.080.090.10.110.12SweepsMSPBEMSPBE Minimization  InvertedTabularDependent\fExperiment\\Method\n\nExperiment 1\nExperiment 2\n\nRO-GQ(\u03bb)\n6.9 \u00b1 4.82\n14.7 \u00b1 10.70\n\nGQ(\u03bb)\n\n11.3 \u00b1 9.58\n27.2 \u00b1 6.52\n\nLARS-TD\n\n-\n-\n\nTable 2: Comparison of RO-GQ(\u03bb), GQ(\u03bb), and LARS-TD on Triple-Link Inverted Pendulum Task\nshowing minimum number of learning episodes.\n\n6.3 High-dimensional Under-actuated Systems\n\nThe triple-link inverted pendulum [18] is a highly nonlinear under-actuated system with 8-\ndimensional state space and discrete action space. The state space consists of the angles and angular\nvelocity of each arm as well as the position and velocity of the car. The discrete action space is\n{0, 5Newton,\u22125Newton}. The goal is to learn a policy that can balance the arms for Nx steps\nwithin some minimum number of learning episodes. The allowed maximum number of episodes\nis 300. The pendulum initiates from zero equilibrium state and the \ufb01rst action is randomly chosen\nto push the pendulum away from initial state. We test the performance of RO-GQ(\u03bb), GQ(\u03bb) and\nLARS-TD. Two experiments are conducted with Nx = 10, 000 and 100, 000, respectively. Fourier\nbasis [8] with order 2 is used, resulting in 6561 basis functions. Table 2 shows the results of this\nexperiment, where RO-GQ(\u03bb) performs better than other approaches, especially in Experiment 2,\nwhich is a harder task. LARS-TD failed in this domain, which is mainly not due to LARS-TD itself\nbut the quality of samples collected via random walk.\nTo sum up, RO-GQ(\u03bb) tends to outperform GQ(\u03bb) in all aspects, and is able to outperform LARS-\nTD based policy iteration in high dimensional domains, as well as in selected smaller MDPs where\nLARS-TD diverges (e.g., the star MDP). It is worth noting that the computation cost of LARS-TD\nis O(N dp3), where that for RO-TD is O(N d). If p is linear or sublinear w.r.t d, RO-TD has a\nsigni\ufb01cant advantage over LARS-TD. However, compared with LARS-TD, RO-TD requires \ufb01ne\ntuning the parameters of \u03b1t, \u03c11, \u03c12 and is usually not as sample ef\ufb01cient as LARS-TD. We also \ufb01nd\nthat tuning the sparsity parameter \u03c12 generates an interpolation between GQ(\u03bb) and TD learning,\nwhere a large \u03c12 helps eliminate the correction term of TDC update and make the update direction\nmore similar to the TD update.\n\n7 Conclusions\n\nThis paper presents a novel uni\ufb01ed framework for designing regularized off-policy convergent RL\nalgorithms combining a convex-concave saddle-point problem formulation for RL with stochastic\n\ufb01rst-order methods. A detailed experimental analysis reveals that the proposed RO-TD algorithm\nis both off-policy convergent and is robust to noisy features. There are many interesting future\ndirections for this research. One direction for future work is to extend the subgradient saddle-\npoint solver to a more generalized mirror descent framework. Mirror descent is a generalization of\nsubgradient descent with non-Euclidean distance [1], and has many advantages over gradient descent\nin high-dimensional spaces. In [6], two algorithms to solve the bilinear saddle-point formulation are\nproposed based on mirror descent and the extragradient [9], such as the Mirror-Prox algorithm. [6]\nalso points out that the Mirror-Prox algorithm may be further optimized via randomization. To scale\nto larger MDPs, it is possible to design SMDP-based mirror-descent methods as well.\n\nAcknowledgments\n\nThis material is based upon work supported by the Air Force Of\ufb01ce of Scienti\ufb01c Research (AFOSR)\nunder grant FA9550-10-1-0383, and the National Science Foundation under Grant Nos. NSF CCF-\n1025120, IIS-0534999, IIS-0803288, and IIS-1216467 Any opinions, \ufb01ndings, and conclusions or\nrecommendations expressed in this material are those of the authors and do not necessarily re\ufb02ect\nthe views of the AFOSR or the NSF. We thank M. F. Duarte for helpful discussions.\n\n8\n\n\fReferences\n[1] A. Ben-Tal and A. Nemirovski. Non-Euclidean restricted memory level method for large-scale\n\nconvex optimization. Mathematical Programming, 102(3):407\u2013456, 2005.\n\n[2] T. Degris, M. White, and R. S. Sutton. Linear off-policy actor-critic. In International Confer-\n\nence on Machine Learning, 2012.\n\n[3] M. Geist, B. Scherrer, A. Lazaric, and M. Ghavamzadeh. A Dantzig Selector Approach to\n\nTemporal Difference Learning. In International Conference on Machine Learning, 2012.\n\n[4] M. Ghavamzadeh, A. Lazaric, R. Munos, and M. Hoffman. Finite-Sample Analysis of Lasso-\n\nTD . In Proceedings of the 28th International Conference on Machine Learning, 2011.\n\n[5] J. Johns, C. Painter-Wake\ufb01eld, and R. Parr. Linear complementarity for regularized policy\nevaluation and improvement. In Proceedings of the International Conference on Neural Infor-\nmation Processing Systems, 2010.\n\n[6] A. Juditsky and A. Nemirovski. Optimization for Machine Learning, chapter First-Order Meth-\n\nods for Nonsmooth Convex Large-Scale Optimization. MIT Press, 2011.\n\n[7] J. Zico Kolter and A. Y. Ng. Regularization and feature selection in least-squares temporal\ndifference learning. In Proceedings of 27 th International Conference on Machine Learning,\n2009.\n\n[8] G. Konidaris, S. Osentoski, and PS Thomas. Value function approximation in reinforcement\nlearning using the fourier basis. In Proceedings of the Twenty-Fifth Conference on Arti\ufb01cial\nIntelligence, 2011.\n\n[9] G. M. Korpelevich. The extragradient method for \ufb01nding saddle points and other problems.\n\n1976.\n\n[10] H.R. Maei and R.S. Sutton. GQ (\u03bb): A general gradient algorithm for temporal-difference\nprediction learning with eligibility traces. In Proceedings of the Third Conference on Arti\ufb01cial\nGeneral Intelligence, pages 91\u201396, 2010.\n\n[11] S. Mahadevan and B. Liu. Sparse Q-learning with Mirror Descent.\n\nConference on Uncertainty in AI, 2012.\n\nIn Proceedings of the\n\n[12] A. Nedi\u00b4c and A. Ozdaglar. Subgradient methods for saddle-point problems. Journal of opti-\n\nmization theory and applications, 142(1):205\u2013228, 2009.\n\n[13] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach\n\nto stochastic programming. SIAM Journal on Optimization, 19:1574\u20131609, 2009.\n\n[14] Y. Nesterov.\n\nGradient methods for minimizing composite objective function.\n\nwww.optimization-online.org, 2007.\n\n[15] C. Painter-Wake\ufb01eld and R. Parr. Greedy algorithms for sparse reinforcement learning.\n\nInternational Conference on Machine Learning, 2012.\n\nIn\n\nIn\n\n[16] C. Painter-Wake\ufb01eld and R. Parr. L1 regularized linear temporal difference learning. Technical\n\nreport, Duke CS Technical Report TR-2012-01, 2012.\n\n[17] M. Petrik, G. Taylor, R. Parr, and S. Zilberstein. Feature selection using regularization in ap-\nproximate linear programs for Markov decision processes. In Proceedings of the International\nConference on Machine learning (ICML), 2010.\n\n[18] J. Si and Y. Wang. Online learning control by association and reinforcement. IEEE Transac-\n\ntions on Neural Networks, 12:264\u2013276, 2001.\n\n[19] R. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press, 1998.\n[20] R.S. Sutton, H.R. Maei, D. Precup, S. Bhatnagar, D. Silver, C. Szepesv\u00b4ari, and E. Wiewiora.\nFast gradient-descent methods for temporal-difference learning with linear function approxi-\nmation. In International Conference on Machine Learning, pages 993\u20131000, 2009.\n\n[21] J. Zico Kolter. The Fixed Points of Off-Policy TD. In Advances in Neural Information Pro-\n\ncessing Systems 24, pages 2169\u20132177, 2011.\n\n9\n\n\f", "award": [], "sourceid": 394, "authors": [{"given_name": "Bo", "family_name": "Liu", "institution": null}, {"given_name": "Sridhar", "family_name": "Mahadevan", "institution": null}, {"given_name": "Ji", "family_name": "Liu", "institution": null}]}