{"title": "Convergent Fitted Value Iteration with Linear Function Approximation", "book": "Advances in Neural Information Processing Systems", "page_first": 2537, "page_last": 2545, "abstract": "Fitted value iteration (FVI) with ordinary least squares regression is known to diverge. We present a new method, \"Expansion-Constrained Ordinary Least Squares\" (ECOLS), that produces a linear approximation but also guarantees convergence when used with FVI. To ensure convergence, we constrain the least squares regression operator to be a non-expansion in the infinity-norm. We show that the space of function approximators that satisfy this constraint is more rich than the space of \"averagers,\" we prove a minimax property of the ECOLS residual error, and we give an efficient algorithm for computing the coefficients of ECOLS based on constraint generation. We illustrate the algorithmic convergence of FVI with ECOLS in a suite of experiments, and discuss its properties.", "full_text": "Convergent Fitted Value Iteration\n\nwith Linear Function Approximation\n\nDaniel J. Lizotte\n\nDavid R. Cheriton School of Computer Science\n\nUniversity of Waterloo\n\nWaterloo, ON N2L 3G1 Canada\ndlizotte@uwaterloo.ca\n\nAbstract\n\nFitted value iteration (FVI) with ordinary least squares regression is known to\ndiverge. We present a new method, \u201cExpansion-Constrained Ordinary Least\nSquares\u201d (ECOLS), that produces a linear approximation but also guarantees con-\nvergence when used with FVI. To ensure convergence, we constrain the least\nsquares regression operator to be a non-expansion in the \u221e-norm. We show that\nthe space of function approximators that satisfy this constraint is more rich than\nthe space of \u201caveragers,\u201d we prove a minimax property of the ECOLS residual\nerror, and we give an ef\ufb01cient algorithm for computing the coef\ufb01cients of ECOLS\nbased on constraint generation. We illustrate the algorithmic convergence of FVI\nwith ECOLS in a suite of experiments, and discuss its properties.\n\n1\n\nIntroduction\n\nFitted value iteration (FVI), both in the model-based [4] and model-free [5, 15, 16, 17] settings, has\nbecome a method of choice for various applied batch reinforcement learning problems. However, it\nis known that depending on the function approximation scheme used, \ufb01tted value iteration can and\ndoes diverge in some settings. This is particularly problematic\u2014and easy to illustrate\u2014when using\nlinear regression as the function approximator. The problem of divergence in FVI has been clearly\nillustrated in several settings [2, 4, 8, 22]. Gordon [8] proved that the class of averagers\u2013a very\nsmooth class of function approximators\u2013can safely be used with FVI. Further interest in batch RL\nmethods then led to work that uses non-parametric function approximators with FVI to avoid diver-\ngence [5, 15, 16, 17]. This has left a gap in the \u201cmiddle ground\u201d of function approximator choices\nthat guarantee convergence\u2013we would like to have a function approximator that is more \ufb02exible than\nthe averagers but more easily interpreted than the non-parametric approximators. In many scienti\ufb01c\napplications, linear regression is a natural choice because of its simplicity and interpretability when\nused with a small set of scienti\ufb01cally meaningful state features. For example, in a medical setting,\none may want to base a value function on patient features that are hypothesized to impact a long-term\nclinical outcome [19]. This enables scientists to interpret the parameters of an optimal learned value\nfunction as evidence for or against the importance of these features. Thus for this work, we restrict\nour attention to linear function approximation, and ensure algorithmic convergence to a \ufb01xed point\nregardless of the generative model of the data. This is in contrast to previous work that explores\nhow properties of the underlying MDP and properties of the function approximation space jointly\nin\ufb02uence convergence of the algorithm [1, 14, 6].\nOur aim is to develop a variant of linear regression that, when used in a \ufb01tted value iteration al-\ngorithm, guarantees convergence of the algorithm to a \ufb01xed point. The contributions of this paper\nare three-fold: 1) We develop and describe the \u201cExpansion-Constrained Ordinary Least Squares\u201d\n(ECOLS) approximator. Our approach is to constrain the regression operator to be a non-expansion\nin the \u221e-norm. We show that the space of function approximators that satisfy this property is more\n\n1\n\n\frich than the space of averagers [8], and we prove a minimax property on the residual error of the\napproximator. 2) We give an ef\ufb01cient algorithm for computing the coef\ufb01cients of ECOLS based\non quadratic programming with constraint generation. 3) We verify the algorithmic convergence\nof \ufb01tted value iteration with ECOLS in a suite of experiments and discuss its performance. Fi-\nnally, we discuss future directions of research and comment on the general problem of learning an\ninterpretable value function and policy from \ufb01tted value iteration.\n\n(cid:104)\n\n(cid:105)\n\n.\n\n2 Background\nConsider a \ufb01nite MDP with states S = {1, ..., n}, actions A = {1, ...,|A|}, state transition matrices\nP (a) \u2208 Rn\u00d7n for each action, a deterministic1 reward vector r \u2208 Rn, and a discount factor \u03b3 < 1.\nLet Mi,: (M:,i) denote the ith row (column) of a matrix M. The \u201cBellman optimality\u201d operator or\n\u201cDynamic Programming\u201d operator T is given by\n\n(T v)i = ri + max\n\n(1)\nThe \ufb01xed point of T is the optimal value function v\u2217 which satis\ufb01es the Bellman equation, T v\u2217 = v\u2217\n[3]. From v\u2217 we can recover a policy \u03c0\u2217\ni,: v\u2217 that has v\u2217 as its value function.\nAn analogous operator K can be de\ufb01ned for the state-action value function Q \u2208 Rn\u00d7|A|.\n\ni = ri + argmaxa \u03b3P (a)\n\n\u03b3P (a)\ni,: v\n\na\n\na\n\nQ:,a\n\ni,: max\n\n(KQ)i,j = ri + \u03b3P (j)\n\n(2)\nThe \ufb01xed point of K is the optimal state-action value Q\u2217 which satis\ufb01es KQ\u2217 = Q\u2217. The value\niteration algorithm proceeds by starting with an initial v or Q, and applying T or K repeatedly until\nconvergence, which is guaranteed because both T and K are contraction mappings in the in\ufb01nity\nnorm [8], as we discuss further below. The above operators assume knowledge of the transition\nmodel P (a) and rewards r. However K in particular is easily adapted to the case of a batch of n\ntuples of the form (si, ai, ri, s(cid:48)\ni) obtained by interaction with the system [5, 15, 16, 17]. In this case,\nQ is only evaluated at states in our data set, and in MDPs with continuous state, the number of tuples\nn is analogous from a computational point of view to the size of our state space.\nFitted value iteration [5, 15, 16, 17] (FVI) interleaves either T or K above with a function ap-\nproximation operator M. For example in the model-based case, the composed operator (M \u25e6 T )\nis applied repeatedly to an initial guess v0. FVI has become increasingly popular especially in the\n\ufb01eld of \u201cbatch-mode Reinforcement Learning\u201d [13, 7] where a policy is learned from a \ufb01xed batch\nof data that was collected by a prior agent. This has particular signi\ufb01cance in scienti\ufb01c and medical\napplications, where ethics concerns prevent the use of current RL methods to interact directly with\na trial subject. In these settings, data gathered from controlled trials can still be used to learn good\npolicies [11, 19]. Convergence of FVI depends on properties of M\u2014particularly on whether M is a\nnon-expansion in the \u221e-norm, as we discuss below. The main advantage of \ufb01tted value iteration is\nthat the computation of (M \u25e6 T ) can be much lower than n in cases where the approximator M only\nrequires computation of elements of (T v)i for a small subset of the state space. If M generalizes\nwell, this enables learning in large \ufb01nite or continuous state spaces. Another advantage is that M\ncan be chosen to represent the value function in a meaningful way, i.e. in a way that meaningfully\nrelates state variables to expected performance. For example, if M were linear regression and a\nparticular state feature had a positive coef\ufb01cient in the learned value function, we know that larger\nvalues of that state feature are preferable. Linear models are of importance because of their ease of\ninterpretation, but unfortunately, ordinary least squares (OLS) function approximation can cause the\nsuccessive iterations of FVI to fail to converge. We now examine properties of the approximation\noperator M that control the algorithmic convergence of FVI.\n\n3 Non-Expansions and Operator Norms\nWe say M is a linear operator if M y + M y(cid:48) = M (y + y(cid:48)) \u2200y, y(cid:48) \u2208 Rp and M 0 = 0. Any linear\noperator can be represented by a p \u00d7 p matrix of real numbers.\n\n1A noisy reward signal does not alter the analyses that follow, nor does dependence of the reward on action.\n\n2\n\n\fLemma 1. A linear operator M is a \u03b3-contraction in the q-norm if and only if ||M||op(q) \u2264 \u03b3.\n\nProof. If M is linear and is a \u03b3-contraction, we have\n\n||M (y \u2212 y(cid:48))||q \u2264 \u03b3||y \u2212 y(cid:48)||q \u2200y, y(cid:48) \u2208 Rp.\n\nBy choosing y(cid:48) = 0, it follows that M satis\ufb01es\n\nUsing the de\ufb01nition of || \u00b7 ||op(q), we have that the following conditions are equivalent:\n\n||M z||q \u2264 \u03b3||z||q \u2200z \u2208 Rp.\n\n||M z||q \u2264 \u03b3||z||q \u2200z \u2208 Rp\n||M z||q\n\u2264 \u03b3 \u2200z \u2208 Rp, z (cid:54)= 0\n||z||q\n||M z||q\n||z||q\n\n\u2264 \u03b3\n||M||op(q) \u2264 \u03b3.\n\nsup\n\nz\u2208Rp,z(cid:54)=0\n\n(4)\n\n(5)\n\n(6)\n\n(7)\n\n(8)\n\n(9)\n\n(10)\n\n(11)\n\n(12)\n\nBy de\ufb01nition, an operator M is a \u03b3-contraction in the q-norm if\n\n\u2203\u03b3 \u2264 1 s.t. ||M y \u2212 M y(cid:48)||q \u2264 \u03b3||y \u2212 y(cid:48)||q \u2200y, y(cid:48) \u2208 Rp\n(3)\nIf the condition holds only for \u03b3 = 1 then M is called a non-expansion in the q-norm. It is well-\nknown [3, 5, 21] that the operators T and K are \u03b3-contractions in the \u221e-norm.\nThe operator norm of M induced by the q-norm can be de\ufb01ned in several ways, including\n\n||M||op(q) = sup\n\ny\u2208Rp,y(cid:54)=0\n\n||M y||q\n||y||q\n\n.\n\nConversely, any M that satis\ufb01es (10) satis\ufb01es (5) because we can always write y \u2212 y(cid:48) = z.\nLemma 1 implies that a linear operator M is a non-expansion in the \u221e-norm only if\n\nwhich is equivalent [18] to:\n\nmax\n\ni\n\n|mij| \u2264 1\n\n||M||op(\u221e) \u2264 1\n\n(cid:88)\n\nj\n\nCorollary 1. The set of all linear operators that satisfy (12) is exactly the set of linear operators\nthat are non-expansions in the \u221e-norm.\nOne subset of operators on Rp that are guaranteed to be non-expansions in the \u221e-norm are the\naveragers2, as de\ufb01ned by Gordon [8].\nCorollary 2. The set of all linear operators that satisfy (12) is larger than the set of averagers.\n\nProof. For M to be an averager, it must satisfy\n\n(cid:88)\n\nj\n\nmax\n\ni\n\nmij \u2265 0 \u2200i, j\nmij \u2264 1.\n\n(13)\n(14)\n\nThese constraints are stricter than (12), because they impose an additional non-negativity constraint\non the elements of M.\n\nWe have shown that restricting M to be a non-expansion is equivalent to imposing the constraint\n||M||op(\u221e) \u2264 1.\nIt is well-known [8] that if such an M is used as a function approximator in\n\ufb01tted value iteration, the algorithm is guaranteed to converge from any starting point because the\ncomposition M \u25e6 T is a \u03b3-contraction in the \u221e-norm.\n\n2The original de\ufb01nition of an averager was an operator of the form y (cid:55)\u2192 Ay + b for a constant vector b. For\n\nthis work we assume b = 0.\n\n3\n\n\f4 Expansion-Constrained Ordinary Least Squares\n\nWe now describe our Expansion-Constrained Ordinary Least Squares function approximation\nmethod, and show how we enforce that it is a non-expansion in the \u221e-norm.\nSuppose X is an n \u00d7 p design matrix with n > p and rank(X) = p, and suppose y is a vector of\nregression targets. The usual OLS estimate \u02c6\u03b2 for the model y \u2248 X\u03b2 is given by\n\n\u02c6\u03b2 = argmin\n\n||X\u03b2 \u2212 y||2\n\n\u03b2\n\n= (X TX)\u22121X Ty.\n\n(15)\n\n(16)\n\nThe predictions made by the model at the points in X\u2014i.e., the estimates of y\u2014are given by\n\n\u02c6y = X \u02c6\u03b2 = X(X TX)\u22121X Ty = Hy\n\n(17)\nwhere H is the \u201chat\u201d matrix because it \u201cputs the hat\u201d on y. The ith element of \u02c6y is a linear combi-\nnation of the elements of y, with weights given by the ith row of H. These weights sum to one, and\nmay be positive or negative. Note that H is a projection of y onto the column space of X, and has 1\nas an eigenvalue with multiplicity rank(X), and 0 as an eigenvalue with multiplicity (n\u2212rank(X)).\nIt is known [18] that for a linear operator M, ||M||op(2) is given by the largest singular value of M.\nIt follows that ||H||op(2) \u2264 1 and, by Lemma 1, H is a non-expansion in the 2-norm. However,\ndepending on the data X, we may not have ||H||op(\u221e) \u2264 1, in which case H will not be a non-\nexpansion in the \u221e-norm. The \u221e-norm expansion property of H is problematic when using linear\nfunction approximation for \ufb01tted value iteration, as we described earlier.\nIf one wants to use linear regression safely within a value-iteration algorithm, it is natural to consider\nconstraining the least-squares problem so that the resulting hat matrix is an \u221e-norm non-expansion.\nConsider the following optimization problem:\n\n||XW X Ty \u2212 y||2\n\n\u00afW = argmin\ns.t. ||XW X T||op(\u221e) \u2264 1, W \u2208 Rp\u00d7p, W = W T.\n\nW\n\n(18)\n\nThe symmetric matrix W is of size p \u00d7 p, so we have a quadratic objective with a convex norm\nconstraint on XW X T, resulting in a hat matrix \u00afH = X \u00afW X T. If the problem were unconstrained,\nwe would have \u00afW = (X TX)\u22121, \u00afH = H and \u00af\u03b2 = \u00afW X Ty = \u02c6\u03b2, the original OLS parameter\nestimate.\nThe matrix \u00afH is a non-expansion by construction. However, unlike the OLS hat matrix H =\nX(X TX)\u22121X T, the matrix \u00afH depends on the targets y. That is, given a different set of regression\ntargets, we would compute a different \u00afH. We should therefore more properly write this non-linear\noperator as \u00afHy. Because of the non-linearity, the operator \u00afHy resulting from the minimization in\n(18) can in fact be an expansion in the \u221e-norm despite the constraints.\nWe now show how we might remove the dependence on y from (18) so that the resulting operator is\na linear non-expansion in the op(\u221e)-norm. Consider the following optimization problem:\n\n||XW X Tz \u2212 z||2\n\nmax\n\n\u02c7W = argmin\ns.t. ||XW X T||op(\u221e) \u2264 1, ||z||2 = c, W \u2208 Rp\u00d7p, W = W T, z \u2208 Rn\n\nW\n\nz\n\n(19)\n\nIntuitively, the resulting \u02c7W is a linear operator of the form X \u02c7W X T that minimizes the squared\nerror between its approximation \u02c7z and the worst-case (bounded) targets z.3 The resulting \u02c7W does\nnot depend on the regression targets y, so the corresponding \u02c7H is a linear operator. The constraint\n||XW X T||op(\u221e) \u2264 1 is effectively a regularizer on the coef\ufb01cients of the hat matrix which will\ntend to shrink the \ufb01tted values X \u02c7W X Ty toward zero.\nMinimization 19 gives us a linear operator, but, as we now show, \u02c7W is not unique\u2014there are in fact\nan uncountable number of \u02c7W that minimize (19).\n\n3The c is a mathematical convenience; if ||z||2 were unbounded then the max would be unbounded and the\n\nproblem ill-posed.\n\n4\n\n\fTheorem 1. Suppose W (cid:48) is feasible for (19) and is positive semi-de\ufb01nite. Then W (cid:48) satis\ufb01es\n\nmax\n\nz,||z||2<c\n\n||XW (cid:48)X Tz \u2212 z||2 = min\n\nW\n\nmax\n\nz,||z||2<c\n\n||XW X Tz \u2212 z||2\n\n(20)\n\nfor all c.\n\nProof. We begin by re-formulating (19), which contains a non-concave maximization, as a convex\nminimization problem with convex constraints.\n\nLemma 2. Let X, W , c, and H be de\ufb01ned as above. Then\n\n||XW X Tz \u2212 z||2 = c||XW X T \u2212 I||op(2).\n\nmax\n\nz,||z||2=c\n\nProof. maxz\u2208Rn,||z||2=c ||XW X Tz \u2212 Iz||2 = maxz\u2208Rn,||z||2\u22641 ||(XW X T \u2212 I)cz||2 =\nc maxz\u2208Rn,||z||2(cid:54)=0 ||(XW X T \u2212 I)z||2/||z||2 = c||XW X T \u2212 I||op(2).\n\nUsing Lemma 2, we can rewrite (19) as\n\n||XW X T \u2212 I||op(2)\n\n\u02c7W = argmin\ns.t. ||XW X T||op(\u221e) \u2264 1, W \u2208 Rp\u00d7p, W = W T\n\nW\n\n(21)\n\nwhich is independent of z and independent of the positive constant c. This objective is convex in\nW , as are the constraints. We now prove a lower bound on (21) and prove that W (cid:48) meets the lower\nbound.\nLemma 3. For all n\u00d7p design matrices X s.t. n > p and all symmetric W , ||XW X T\u2212I||op(2) \u2265 1.\n\nProof. Recall that ||XW X T \u2212 I||op(2) is given by the largest singular value of XW X T \u2212 I. By\nsymmetry of W , write XW X T = U DU T where D is a diagonal matrix whose diagonal entries dii\nare the eigenvalues of XW X T and U is an orthonormal matrix. We therefore have\nXW X T \u2212 I = U DU T \u2212 I = U DU T \u2212 U IU T = U (D \u2212 I)U T\n\n(22)\nTherefore ||XW X T \u2212 I||op(2) = maxi |dii \u2212 1|, which is the largest singular value of XW X T \u2212 I.\nFurthermore we know that rank(XW X T) \u2264 p and that therefore at least n \u2212 p of the dii are zero.\nTherefore maxi |dii \u2212 1| \u2265 1, implying ||XW X T \u2212 I||op(2) \u2265 1.\nLemma 4. For any symmetric positive de\ufb01nite matrix W (cid:48) that satis\ufb01es the constraints in (19) and\nany n \u00d7 p design matrix X s.t. n > p, we have ||XW (cid:48)X T \u2212 I||op(2) = 1.\n\nknown [18] that for any M, ||M||op(2) \u2264(cid:112)||M||op(\u221e)||M||op(1) which gives ||H(cid:48)||op(2) \u2264 1 and\n\nProof. Let H(cid:48) = XW (cid:48)X T and write H(cid:48) \u2212 I = U(cid:48)(D(cid:48) \u2212 I)U(cid:48)T where U is orthogonal and D(cid:48) is a\ndiagonal matrix whose diagonal entries d(cid:48)\nii are the eigenvalues of H(cid:48). We know H(cid:48) is positive semi-\nii \u2265 0. From the constraints\nde\ufb01nite because W (cid:48) is assumed to be positive semi-de\ufb01nite; therefore d(cid:48)\nin (19), we have ||H(cid:48)||op(\u221e) \u2264 1, and by symmetry of H(cid:48) we have ||H(cid:48)||op(\u221e) = ||H(cid:48)||op(1). It is\nii \u2264 1 \u2200i. Recall that\ntherefore |d(cid:48)\n||XW (cid:48)X T \u2212 I||op(2) = maxi |dii \u2212 1|, the maximum eigenvalue of H(cid:48). Because rank(XW X T) \u2264\nii \u2264 1, it\np, we know that there exists an i such that d(cid:48)\nfollows that maxi |dii \u2212 1| = 1, and therefore ||XW (cid:48)X T \u2212 I||op(2) = 1.\n\nii| \u2264 1 for all i \u2208 1..n. Combining these results gives 0 \u2264 d(cid:48)\n\nii = 0, and because we have shown that 0 \u2264 d(cid:48)\n\nLemma 4 shows that the objective value at any feasible, symmetric postive-de\ufb01nite W (cid:48) matches the\nlower bound proved in Lemma 3, and that therefore any such W (cid:48) satis\ufb01es the theorem statement.\n\n5\n\n\fTheorem 1 shows that the optimum of (19) not unique. We therefore solve the following optimiza-\ntion problem, which has a unique solution, shows good empirical performance, and yet still provides\nthe minimax property guaranteed by Theorem 1 when the optimal matrix is positive semi-de\ufb01nite.4\n(23)\n\n||XW X Tz \u2212 Hz||2\n\n\u02dcW = argmin\ns.t. ||XW X T||op(\u221e) \u2264 1, ||z||2 = c, W \u2208 Rp\u00d7p, W = W T, z \u2208 Rn\n\nmax\n\nW\n\nz\n\nIntuitively, this objective searches for a \u02dcW such that linear approximation using X \u02dcW TX T is as close\nas possible to the OLS approximation, for the worst case regression targets, according to the 2-norm.\n\n5 Computational Formulation\n\nBy an argument identical to that of Lemma 2, we can re-formulate (23) as a convex optimization\nproblem with convex constraints, giving\n\n||XW X T \u2212 H||op(2)\n\n\u02dcW = argmin\ns.t. ||XW X T||op(\u221e) \u2264 1, W \u2208 Rp\u00d7p, W = W T.\n\nW\n\n(24)\n\ni=1\n\ni,j m2\n\n(cid:80)n\n\nj=1 kjXi,:W X T\n\nj=1 |Xi,:W X T\n\nj=1 \u03be(ij)\u03be(ij)T and \u03be(ij) = (X T\n\nnorm ||M||F = ((cid:80)\nExpanding ||XW X T \u2212 H||F gives ||XW X T \u2212 H||F = Tr(cid:2)XW X TXW X T \u2212 2XW X T \u2212 H(cid:3).\n\u039e = (cid:80)n\nas the set of constraints(cid:80)n\n(cid:80)n\n\nThough convex, objective (24) has no simple closed form, and we found that standard solvers have\ndif\ufb01culty for larger problems [9]. However, ||XW X T\u2212 H||op(2) is upper bounded by the Frobenius\nij)1/2. Therefore, we minimize the quadratic objective ||XW X T \u2212 H||F\nsubject to the same convex constraints, which is easier to solve than (21). Note that Theorem 1\napplies to the solution of this modi\ufb01ed objective when the resulting \u02dcW is positive semide\ufb01nite.\nLet M (:) be the length p \u00b7 n vector consisting of the stacked columns of the matrix M. After\nsome algebraic manipulations, we can re-write the objective as W (:)T\u039eW (:) \u2212 2\u03b6 TW (:), where\ni,:Xj,:)(:), and \u03b6 = (X TX)(:). This objective can\nthen be fed into any standard QP solver. The constraint ||XW X T||op(\u221e) \u2264 1 can be expressed\nj,:| < 1, i = 1..n, or as a set of n2n linear constraints\nj,: < 1, i = 1..n, k \u2208 {+1,\u22121}n. Each of these linear constraints involves a\nvector k with entries {+1,\u22121} multiplied by a row of XW X T. If the entries in k match the signs\nof the row of XW X T, then their inner product is equal to the sum of the absolute values of the\nrow, which must be constrained. If they do not match, the result is smaller. By constraining all n2n\npatterns of signs, we constrain the sum of the absolute values of the entries in the row. Explicitly\n(cid:80)n\nenforcing all of these constraints is intractable, so we employ a constraint-generation approach [20].\nWe solve a sequence of quadratic programs, adding the most violated linear constraint after each\nj=1 |Xi,:W X T\nj,:| and a\nstep. The most violated constraint is given by a row i\u2217 = argmaxi\u22081..n\nvector k\u2217 = sign Xi,:W . The resulting constraint on W (:) can be written as k\u2217L W (:) \u2264 1 where\nLj,: = \u03be(i\u2217j), i = 1..n. This formulation allows us to use a general QP solver to compute \u02dcW .\nNote that batch \ufb01tted value iteration performs many regressions where the targets y change from\niteration to iteration, but the design matrix X is \ufb01xed. Therefore we only need to solve the ECOLS\noptimization problem once for any given application of FVI, meaning the additional computational\ncost of ECOLS over OLS is not a major drawback.\n\n6 Experimental results\n\nIn order to illustrate the behavior of ECOLS in different settings, we present four different empirical\nevaluations: one regression problem and three RL problems. In each of the RL settings, ECOLS\nwith FVI converges, and the learned value function de\ufb01nes a good greedy policy.\n\n4One could in principle include a semi-de\ufb01nite constraint in the problem formulation, at an increased com-\nputational cost. (The problem is not a standard semi-de\ufb01nite program because the objective is not linear in the\nelements of W .) We have not imposed this constraint in our experiments and we have always found that the\nresulting \u02dcW is positive semi-de\ufb01nite. We conjecture that \u02dcW is always positive semi-de\ufb01nite.\n\n6\n\n\f\u03b2\u2217\n1\n-3\n-3\n1\n6.69\n\n1\nx\nx2\nx3\nrms\n\nFunction Coef\ufb01cients\n\u02dc\u03b2op(2)\n0.77\n-2.02\n-1.88\n0.64\n13.44\n\n\u02dc\u03b2F\n0.16\n-1.80\n-1.71\n0.58\n13.60\n\n\u02c6\u03b2\n0.95\n-2.92\n-3.00\n1.00\n6.68\n\n\u02dc\u03b2avg\n-2.21\n-0.97\n-1.09\n0.37\n16.52\n\nFigure 1: Example of OLS, ECOLS with ||XW X T \u2212 H||F , ECOLS with ||XW X T \u2212 H||op(2)\n\ni , x3\n\nRegression The \ufb01rst is a simple regression setting, where we examine the behavior of ECOLS\ncompared to OLS. To give a simple, pictorial rendition of the difference between OLS, ECOLS\nusing the Frobenius, ECOLS using the op(2)-norm, and an averager, we generated a dataset of\nn = 25 tuples (x, y) as follows: x \u223c U (\u22122, 4), y = 1 \u2212 3x \u2212 3x2 + x3 + \u03b5, \u03b5 \u223c N (0, 4). The\ni ]. The ECOLS regression optimizing the Frobenius\ndesign matrix X had rows Xi,: = [1, xi, x2\nnorm using CPLEX [12] took 0.36 seconds, whereas optimizing the op(2)-norm using the cvx\npackage [10] took 8.97 seconds on a 2 GHz Intel Core 2 Duo.\nFigure 1 shows the regression curves produced by OLS and the two versions of ECOLS, along with\nthe learned coef\ufb01cients and root mean squared error of the predictions on the data. Neither of the\nECOLS curves \ufb01t the data as well as OLS, as one would expect. Generally, their curves are smoother\nthan the OLS \ufb01t, and predictions are on the whole shrunk toward zero. We also ran ECOLS with\nan additional positivity constraint on X \u02dcW X T, effectively forcing the result to be an averager as\ndescribed in Sect. 3. The result is smoother than either of the ECOLS regressors, with a higher RMS\nprediction error. Note the small difference between ECOLS using the Frobenius norm (dark black\nline) and using the op(2)-norm (dashed line.) This is encouraging, as we have found that in larger\ndatasets optimizing the op(2)-norm is much slower and less reliable.\n\nTwo-state example Our second example is a classic on-policy \ufb01tted value iteration problem that is\nknown to diverge using OLS. It is perhaps the simplest example of FVI diverging, due to Tsitsiklis\nand Van Roy [22]. This is a deterministic on-policy example, or equivalently for our purposes, a\nproblem with |A| = 1. There are three states {1, 2, 3} with features X = (1, 2, 0)T, one action with\nP1,2 = 1, P2,2 = 1 \u2212 \u03b5, P2,3 = \u03b5, P3,3 = 1 and Pi,j = 0 elsewhere. The reward is R = [0, 0, 0]T\nand the value function is v\u2217 = [0, 0, 0]T. For \u03b3 > 5/(6 \u2212 4\u03b5), FVI with OLS diverges for any\nstarting point other than v\u2217. FVI with ECOLS always converges to v\u2217. If we change the reward\nto R = [1, 1, 0]T and set \u03b3 = 0.95, \u03b5 = 0.1, we have v\u2217 = [7.55, 6.90, 0]. FVI with OLS of\ncourse still diverges, whereas FVI with ECOLS converges to \u02dcv = [4.41, 8.82, 0]. In this case, the\napproximation space is poor, and no linear method based on the features in X can hope to perform\nwell. Nonetheless, ECOLS converges to a \u02c6v of at least the appropriate magnitude.\n\nGrid world Our third example is an off-policy value iteration problem which is known to diverge\nwith OLS, due to Boyan and Moore [4]. In this example, there are effectively 441 discrete states, laid\nout in a 21 \u00d7 21 grid, and assigned an (x, y) feature in [0, 1]2 according to their position in the grid.\nThere are four actions which deterministically move the agent up, down, left, or right by a distance\nof 0.05 in the feature space, and the reward is -0.5 everywhere except the corner state (1, 1), where\nit is 0. The discount \u03b3 is set to 1.0 so the optimal value function is v\u2217(x, y) = \u221220 + 10x + 10y.\nBoyan and Moore de\ufb01ne \u201clucky\u201d convergence of FVI as the case where the policy induced by\nthe learned value function is optimal, even if the learned value function itself does not accurately\nrepresent v\u2217. They found that with OLS and a design matrix Xi,: = [1, xi, yi], they achieve lucky\nconvergence. We replicated their result using FVI on 255 randomly sampled states plus the goal\n\n7\n\n\u22122\u2212101234\u221215\u221210\u221250510xyExpansion\u2212Constrained Ordinary Least Squares Comparisons  OLSECOLS with Fro. normECOLS with op(2)\u2212normECOLS Avg. with Fro. norm\fstate, and found that OLS converged5 to \u02c6\u03b2 = [\u2212515.89, 9.99, 9.99] after 10455 iterations. This\nvalue function induces a policy that attempts to increase x and y, which is optimal. ECOLS on the\nother hand converged to \u02dc\u03b2 = [\u22121.09, 0.030, 0.07] after 31 iterations, which also induces an optimal\npolicy. In terms of learning correct value function coef\ufb01cients, the OLS estimate gets 2 of the 3\nalmost exactly correct. In terms of estimating the value of states, OLS achieves an RMSE over all\nstates of 10413.73, whereas ECOLS achieves an RMSE of 208.41.\nIn the same work, Boyan and Moore apply OLS with quadratic features Xi,: =\n[1, x, y, x2, y2, xy], and \ufb01nd that FVI diverges. We found that ECOLS converges, with coef\ufb01cients\n[\u22120.80,\u22122.67,\u22122.78, 2.73, 2.91, 0.06]. This is not \u201clucky\u201d, as the induced policy is only optimal\nfor states in the upper-right half of the state space.\n\nLeft-or-right world Our fourth and last example is an off-policy value iteration problem with\nstochastic dynamics where OLS causes non-divergent but non-convergent behavior. To investigate\nproperties of their tree-based Fitted Q-Iteration (FQI) methods, Ernst, Geurts, and Wehenkel de\ufb01ne\nthe \u201cleft-or-right\u201d problem [5], an MDP with S = [0, 10], and stochastic dynamics given by st+1 =\nst + a + \u03b5, where \u03b5 \u223c N (0, 1). Rewards are 0 for s \u2208 [0, 10], 100 for s > 10, and 50 for s < 0. All\nstates outside [0, 10] are terminal. The discount factor \u03b3 is 0.75. In their formulation they use A \u2208\n{\u22122, 2}, which gives an optimal policy that is approximately \u03c0\u2217(s) = {2 if s > 2.5, -2 otherwise}.\nWe examine a simpler scenario by choosing A \u2208 {\u22124, 4}, so that \u03c0\u2217(s) = 4, i.e., it is optimal to\nalways go right. Based on prior data [5], the optimal Q functions for this type of problem appear\nto be smooth and non-linear, possibly with in\ufb02ection points. Thus we use polynomial features6\nXi,: = [1, x, x2, x3] where x = s/5 \u2212 1. As is common in FQI, we \ufb01t separate regressions to learn\nQ(\u00b7, 4) and Q(\u00b7,\u22124) at each iteration. We used 300 episodes worth of data generated by the uniform\nrandom policy for learning.\nIn this setting, OLS does not diverge, but neither does it converge: the parameter vector of each\nQ function moves chaotically within some bounded region of R4. The optimal policy induced by\nthe Q-functions is determined solely by zeroes of Q(\u00b7, 4) \u2212 Q(\u00b7,\u22124), and in our experiments this\nfunction had at most one zero. Over 500 iterations of FQI with OLS, the cutpoint ranged from -7.77\nto 14.04, resulting in policies ranging from \u201calways go right\u201d to \u201calways go left.\u2019 FQI with ECOLS\nconverged to a near-optimal policy \u02dc\u03c0(s) = {4 if s > 1.81, -4 otherwise}. We determined by Monte\nCarlo rollouts that, averaged over a uniform initial state, the value of \u02dc\u03c0 is 59.59, whereas the value\nof the optimal policy \u03c0\u2217 is 60.70. While the performance of the learned policy is very good, the\nestimate of the average value using the learned Qs, 28.75, is lower due to the shrinkage induced by\nECOLS in the predicted state-action values.\n\n7 Concluding Remarks\n\nDivergence of FVI with OLS has been a long-standing problem in the RL literature. In this pa-\nper, we introduced ECOLS, which provides guaranteed convergence of FVI. We proved theoretical\nproperties that show that in the minimax sense, ECOLS is optimal among possible linear approx-\nimations that guarantee such convergence. Our test problems con\ufb01rm the convergence properties\nof ECOLS and also illustrate some of its properties. In particular, the empirical results illustrate\nthe regularization effect of the op(\u221e)-norm constraint that tends to \u201cshrink\u201d predicted values to-\nward zero. This is a further contribution of our paper: Our theoretical and empirical results indicate\nthat this shrinkage is a necessary cost of guaranteeing convergence of FVI using linear models with\na \ufb01xed set of features. This has important implications for the deployment of FVI with ECOLS.\nIn some applications where accurate estimates of policy performance are required, this shrinkage\nmay be problematic; addressing this problem is an interesting avenue for future research. In other\napplications where the goal is to identify a good, intuitively represented value function and policy\nECOLS, is a useful new tool.\n\nAcknowledgements We acknowledge support from Natural Sciences and Engineering Research\nCouncil of Canada (NSERC) and the National Institutes of Health (NIH) grants R01 MH080015\nand P50 DA10075.\n\n5Convergence criterion was ||\u03b2iter+1 \u2212 \u03b2iter|| \u2264 10\u22125. All starts were from \u03b2 = 0.\n6The re-scaling of s is for numerical stability.\n\n8\n\n\fReferences\n[1] A. Antos, R. Munos, and Cs. Szepesv\u00b4ari. Fitted Q-iteration in continuous action-space MDPs.\n\nIn Advances in Neural Information Processing Systems 20, pages 9\u201316. MIT Press, 2008.\n\n[2] L. Baird. Residual Algorithms: Reinforcement Learning with Function Approximation. In\nA. Prieditis and S. Russell, editors, Proceedings of the 25th International Conference on Ma-\nchine Learning, pages 30\u201337. Morgan Kaufmann, 1995.\n\n[3] D. Bertsekas. Dynamic Programming and Optimal Control. Athena Scienti\ufb01c, 2007.\n[4] J. Boyan and A. W. Moore. Generalization in reinforcement learning: Safely approximating\nIn Advances in neural information processing systems, pages 369\u2013376,\n\nthe value function.\n1995.\n\n[5] D. Ernst, P. Geurts, and L. Wehenkel. Tree-Based Batch Mode Reinforcement Learning. Jour-\n\nnal of Machine Learning Research, 6:503\u2013556, 2005.\n\n[6] A. M. Farahmand, M. Ghavamzadeh, Cs. Szepesv\u00b4ari, and S. Mannor. Regularized \ufb01tted Q-\niteration for planning in continuous-space Markovian decision problems. In American Control\nConference, pages 725\u2013730, 2009.\n\n[7] R. Fonteneau. Contributions to Batch Mode Reinforcement Learning. PhD thesis, University\n\nof Liege, 2011.\n\n[8] G. J. Gordon. Approximate Solutions to Markov Decision Processes. PhD thesis, Carnegie\n\nMellon University, 1999.\n\n[9] M. Grant and S. Boyd. CVX: Matlab software for disciplined convex programming, version\n\n1.21. http://cvxr.com/cvx, Apr. 2011.\n\n[10] M. C. Grant. Disciplined convex programming and the cvx modeling framework. Information\n\nSystems Journal, 2006.\n\n[11] A. Guez, R. D. Vincent, M. Avoli, and J. Pineau. Adaptive treatment of epilepsy via batch-\nmode reinforcement learning. In D. Fox and C. P. Gomes, editors, Innovative Applications of\nArti\ufb01cial Intelligence, pages 1671\u20131678, 2008.\n\n[12] IBM. IBM ILOG CPLEX Optimization Studio V12.2, 2011.\n[13] S. Kalyanakrishnan and P. Stone. Batch reinforcement learning in a complex domain.\n\nIn\nProceedings of the 6th international joint conference on Autonomous agents and multiagent\nsystems AAMAS 07, 2007.\n\n[14] R. Munos and Cs. Szepesv\u00b4ari. Finite time bounds for \ufb01tted value iteration. Journal of Machine\n\nLearning Research, 9:815\u2013857, 2008.\n\n[15] D. Ormoneit and S. Sen. Kernel-based reinforcement learning. Machine learning, 49(2):161\u2013\n\n178, 2002.\n\n[16] M. Riedmiller. Neural \ufb01tted Q iteration-\ufb01rst experiences with a data ef\ufb01cient neural reinforce-\n\nment learning method. In ECML 2005, pages 317\u2013328. Springer, 2005.\n\n[17] J. Rust. Using randomization to break the curse of dimensionality. Econometrica, 65(3):pp.\n\n487\u2013516, 1997.\n\n[18] G. A. F. Seber. A MATRIX HANDBOOK FOR STATISTICIANS. Wiley, 2007.\n[19] S. M. Shortreed, E. Laber, D. J. Lizotte, T. S. Stroup, J. Pineau, and S. A. Murphy. Inform-\ning sequential clinical decision-making through reinforcement learning : an empirical study.\nMachine Learning, 2010.\n\n[20] S. Siddiqi, B. Boots, and G. Gordon. A Constraint Generation Approach to Learning Stable\nLinear Dynamical Systems. In Advances in Neural Information Processing Systems 20, pages\n1329\u20131336. MIT Press, 2008.\n\n[21] Cs. Szepesv\u00b4ari. Algorithms for Reinforcement Learning. Morgan and Claypool, 2010.\n[22] J. N. Tsitsiklis and B. van Roy. An analysis of temporal-difference learning with function\n\napproximation. IEEE Transactions on Automatic Control, 42(5):674\u2013690, 1997.\n\n9\n\n\f", "award": [], "sourceid": 1376, "authors": [{"given_name": "Daniel", "family_name": "Lizotte", "institution": null}]}