{"title": "LSTD with Random Projections", "book": "Advances in Neural Information Processing Systems", "page_first": 721, "page_last": 729, "abstract": "We consider the problem of reinforcement learning in high-dimensional spaces when the number of features is bigger than the number of samples. In particular, we study the least-squares temporal difference (LSTD) learning algorithm when a space of low dimension is generated with a random projection from a high-dimensional space. We provide a thorough theoretical analysis of the LSTD with random projections and derive performance bounds for the resulting algorithm. We also show how the error of LSTD with random projections is propagated through the iterations of a policy iteration algorithm and provide a performance bound for the resulting least-squares policy iteration (LSPI) algorithm.", "full_text": "LSTD with Random Projections\n\nMohammad Ghavamzadeh, Alessandro Lazaric, Odalric-Ambrym Maillard, R\u00b4emi Munos\n\nINRIA Lille - Nord Europe, Team SequeL, France\n\nAbstract\n\nWe consider the problem of reinforcement learning in high-dimensional spaces\nwhen the number of features is bigger than the number of samples. In particular,\nwe study the least-squares temporal difference (LSTD) learning algorithm when\na space of low dimension is generated with a random projection from a high-\ndimensional space. We provide a thorough theoretical analysis of the LSTD with\nrandom projections and derive performance bounds for the resulting algorithm.\nWe also show how the error of LSTD with random projections is propagated\nthrough the iterations of a policy iteration algorithm and provide a performance\nbound for the resulting least-squares policy iteration (LSPI) algorithm.\n\nIntroduction\n\n1\nLeast-squares temporal difference (LSTD) learning [3, 2] is a widely used reinforcement learning\n(RL) algorithm for learning the value function V \u03c0 of a given policy \u03c0. LSTD has been successfully\napplied to a number of problems especially after the development of the least-squares policy iteration\n(LSPI) algorithm [9], which extends LSTD to control problems by using it in the policy evaluation\nstep of policy iteration. More precisely, LSTD computes the \ufb01xed point of the operator \u03a0T \u03c0, where\nT \u03c0 is the Bellman operator of policy \u03c0 and \u03a0 is the projection operator onto a linear function\nspace. The choice of the linear function space has a major impact on the accuracy of the value\nfunction estimated by LSTD, and thus, on the quality of the policy learned by LSPI. The problem\nof \ufb01nding the right space, or in other words the problems of feature selection and discovery, is an\nimportant challenge in many areas of machine learning including RL, or more speci\ufb01cally, linear\nvalue function approximation in RL.\nTo address this issue in RL, many researchers have focused on feature extraction and learning.\nMahadevan [13] proposed a constructive method for generating features based on the eigenfunctions\nof the Laplace-Beltrami operator of the graph built from observed system trajectories. Menache et\nal. [16] presented a method that starts with a set of features and then tunes both features and the\nweights using either gradient descent or the cross-entropy method. Keller et al. [7] proposed an\nalgorithm in which the state space is repeatedly projected onto a lower dimensional space based on\nthe Bellman error and then states are aggregated in this space to de\ufb01ne new features. Finally, Parr et\nal. [17] presented a method that iteratively adds features to a linear approximation architecture such\nthat each new feature is derived from the Bellman error of the existing set of features.\nA more recent approach to feature selection and discovery in value function approximation in RL is\nto solve RL in high-dimensional feature spaces. The basic idea here is to use a large number of fea-\ntures and then exploit the regularities in the problem to solve it ef\ufb01ciently in this high-dimensional\nspace. Theoretically speaking, increasing the size of the function space can reduce the approxima-\ntion error (the distance between the target function and the space) at the cost of a growth in the\nestimation error. In practice, in the typical high-dimensional learning scenario when the number of\nfeatures is larger than the number of samples, this often leads to the over\ufb01tting problem and poor\nprediction performance. To overcome this problem, several approaches have been proposed includ-\ning regularization. Both (cid:96)1 and (cid:96)2 regularizations have been studied in value function approximation\nin RL. Farahmand et al. presented several (cid:96)2-regularized RL algorithms by adding (cid:96)2-regularization\nto LSTD and modi\ufb01ed Bellman residual minimization [4] as well as \ufb01tted value iteration [5], and\nproved \ufb01nite-sample performance bounds for their algorithms. There have also been algorithmic\nwork on adding (cid:96)1-penalties to the TD [12], LSTD [8], and linear programming [18] algorithms.\n\n1\n\n\fIn this paper, we follow a different approach based on random projections [21]. In particular, we\nstudy the performance of LSTD with random projections (LSTD-RP). Given a high-dimensional\nlinear space F, LSTD-RP learns the value function of a given policy from a small (relative to the\ndimension of F) number of samples in a space G of lower dimension obtained by linear random\nprojection of the features of F. We prove that solving the problem in the low dimensional random\nspace instead of the original high-dimensional space reduces the estimation error at the price of a\n\u201ccontrolled\u201d increase in the approximation error of the original space F. We present the LSTD-\nRP algorithm and discuss its computational complexity in Section 3. In Section 4, we provide the\n\ufb01nite-sample analysis of the algorithm. Finally in Section 5, we show how the error of LSTD-RP is\npropagated through the iterations of LSPI.\n2 Preliminaries\nFor a measurable space with domain X , we let S(X ) and B(X ; L) denote the set of probability\nmeasures over X and the space of measurable functions with domain X and bounded in absolute\nvalue by 0 < L < \u221e, respectively. For a measure \u00b5 \u2208 S(X ) and a measurable function f :\nX \u2192 R, we de\ufb01ne the (cid:96)2(\u00b5)-norm of f as ||f||2\n||f||\u221e = supx\u2208X |f (x)|, and for a set of n states X1, . . . , Xn \u2208 X the empirical norm of f as\n||f||2\ni .\ni=1 u2\nWe consider the standard RL framework [20] in which a learning agent interacts with a stochastic\nenvironment and this interaction is modeled as a discrete-time discounted Markov decision process\n(MDP). A discount MDP is a tuple M = (cid:104)X ,A, r, P, \u03b3(cid:105) where the state space X is a bounded closed\nsubset of a Euclidean space, A is a \ufb01nite (|A| < \u221e) action space, the reward function r : X\u00d7A \u2192 R\nis uniformly bounded by Rmax, the transition kernel P is such that for all x \u2208 X and a \u2208 A,\nP (\u00b7|x, a) is a distribution over X , and \u03b3 \u2208 (0, 1) is a discount factor. A deterministic policy \u03c0 :\nX \u2192 A is a mapping from states to actions. Under a policy \u03c0, the MDP M is reduced to a Markov\n\n(cid:80)n\nt=1 f (Xt)2. Moreover, for a vector u \u2208 Rn we write its (cid:96)2-norm as ||u||2\n\n\u00b5 = (cid:82) f (x)2\u00b5(dx), the supremum norm of f as\n\n2 =(cid:80)n\n\nn = 1\nn\n\n1\u2212\u03b3 ) \u2192 B(X ; Vmax) de\ufb01ned\nthe unique \ufb01xed-point of the Bellman operator T \u03c0 : B(X ; Vmax = Rmax\nX P \u03c0(dy|x)V (y). We also de\ufb01ne the optimal value function V \u2217 as\nthe unique \ufb01xed-point of the optimal Bellman operator T \u2217 : B(X ; Vmax) \u2192 B(X ; Vmax) de\ufb01ned\n\nchain M\u03c0 = (cid:104)X , R\u03c0, P \u03c0, \u03b3(cid:105) with reward R\u03c0(x) = r(cid:0)x, \u03c0(x)(cid:1), transition kernel P \u03c0(\u00b7|x) = P(cid:0) \u00b7\n|x, \u03c0(x)(cid:1), and stationary distribution \u03c1\u03c0 (if it admits one). The value function of a policy \u03c0, V \u03c0, is\nby (T \u03c0V )(x) = R\u03c0(x) + \u03b3(cid:82)\nby (T \u2217V )(x) = maxa\u2208A(cid:2)r(x, a) + \u03b3(cid:82)\nX P (dy|x, a)V (y)(cid:3). Finally, we denote by T the truncation\noperator at threshold Vmax, i.e., if |f (x)| > Vmax then T (f )(x) = sgn(cid:0)f (x)(cid:1)Vmax.\nwhere \u03c6(\u00b7) = (cid:0)\u03d51(\u00b7), . . . , \u03d5D(\u00b7)(cid:1)(cid:62)\nthe feature vector \u03a8 (\u00b7) = (cid:0)\u03c81(\u00b7), . . . , \u03c8d(\u00b7)(cid:1)(cid:62)\n\nTo approximate a value function V \u2208 B(X ; Vmax), we \ufb01rst de\ufb01ne a linear function space F spanned\nby the basis functions \u03d5j \u2208 B(X ; L), j = 1, . . . , D, i.e., F = {f\u03b1 | f\u03b1(\u00b7) = \u03c6(\u00b7)(cid:62)\u03b1, \u03b1 \u2208 RD},\nis the feature vector. We de\ufb01ne the orthogonal projection of\nV onto the space F w.r.t. norm \u00b5 as \u03a0F V = arg minf\u2208F ||V \u2212 f||\u00b5. From F we can gener-\nate a d-dimensional (d < D) random space G = {g\u03b2 | g\u03b2(\u00b7) = \u03a8 (\u00b7)(cid:62)\u03b2, \u03b2 \u2208 Rd}, where\nis de\ufb01ned as \u03a8 (\u00b7) = A\u03c6(\u00b7) with A \u2208 Rd\u00d7D\nbe a random matrix whose elements are drawn i.i.d. from a suitable distribution, e.g., Gaussian\nN (0, 1/d). Similar to the space F, we de\ufb01ne the orthogonal projection of V onto the space G\nw.r.t. norm \u00b5 as \u03a0GV = arg ming\u2208G ||V \u2212 g||\u00b5. Finally, for any function f\u03b1 \u2208 F, we de\ufb01ne\nm(f\u03b1) = ||\u03b1||2 supx\u2208X ||\u03c6(x)||2.\n3 LSTD with Random Projections\nThe objective of LSTD with random projections (LSTD-RP) is to learn the value function of a\ngiven policy from a small (relative to the dimension of the original space) number of samples in a\nlow-dimensional linear space de\ufb01ned by a random projection of the high-dimensional space. We\nshow that solving the problem in the low dimensional space instead of the original high-dimensional\nspace reduces the estimation error at the price of a \u201ccontrolled\u201d increase in the approximation error.\nIn this section, we introduce the notations and the resulting algorithm, and discuss its computational\ncomplexity. In Section 4, we provide the \ufb01nite-sample analysis of the algorithm.\nWe use the linear spaces F and G with dimensions D and d (d < D) as de\ufb01ned in Section 2. Since in\nthe following the policy is \ufb01xed, we drop the dependency of R\u03c0, P \u03c0, V \u03c0, and T \u03c0 on \u03c0 and simply\nuse R, P , V , and T . Let {Xt}n\nt=1 be a sample path (or trajectory) of size n generated by the Markov\n\n2\n\n\fn = 1\nn\n\nt=1 y2\nfeature matrix de\ufb01ned at {Xt}n\n\nchain M\u03c0, and let v \u2208 Rn and r \u2208 Rn, de\ufb01ned as vt = V (Xt) and rt = R(Xt), be the value and\nreward vectors of this trajectory. Also, let \u03a8 = [\u03a8 (X1)(cid:62); . . . ; \u03a8 (Xn)(cid:62)] be the feature matrix\ndenote by(cid:98)\u03a0G : Rn \u2192 Gn the orthogonal projection onto Gn, de\ufb01ned by(cid:98)\u03a0Gy = arg minz\u2208Gn ||y \u2212\nde\ufb01ned at these n states and Gn = {\u03a8\u03b2 | \u03b2 \u2208 Rd} \u2282 Rn be the corresponding vector space. We\n(cid:80)n\n{\u03a6\u03b1 | \u03b1 \u2208 RD} as (cid:98)\u03a0F y = arg minz\u2208Fn ||y \u2212 z||n, where \u03a6 = [\u03c6(X1)(cid:62); . . . ; \u03c6(Xn)(cid:62)] is the\nz||n, where ||y||2\nt . Similarly, we can de\ufb01ne the orthogonal projection onto Fn =\nt=1. Note that for any y \u2208 Rn, the orthogonal projections (cid:98)\u03a0Gy and\n(cid:98)\u03a0F y exist and are unique.\nempirical operator(cid:98)\u03a0G(cid:98)T , where (cid:98)T is the pathwise Bellman operator de\ufb01ned as (cid:98)T y = r + \u03b3(cid:98)P y. The\noperator (cid:98)P : Rn \u2192 Rn is de\ufb01ned as ((cid:98)P y)t = yt+1 for 1 \u2264 t < n and ((cid:98)P y)n = 0. As shown\nin [11], (cid:98)T is a \u03b3-contraction in (cid:96)2-norm, thus together with the non-expansive property of (cid:98)\u03a0G, it\nguarantees the existence and uniqueness of the pathwise-LSTD \ufb01xed point \u02c6v \u2208 Rn, \u02c6v = (cid:98)\u03a0G(cid:98)T \u02c6v.\nLSTD-RP(cid:0)D, d,{Xt}n\n\nWe consider the pathwise-LSTD algorithm introduced in [11]. Pathwise-LSTD takes a single tra-\njectory {Xt}n\nt=1 of size n generated by the Markov chain as input and returns the \ufb01xed point of the\n\nNote that the uniqueness of \u02c6v does not imply the uniqueness of the parameter \u02c6\u03b2 such that \u02c6v = \u03a8 \u02c6\u03b2.\n\nt=1, \u03c6, \u03b3(cid:1)\n\nt=1,{R(Xt)}n\n\nCost\n\nCompute\n\n\u2022 the reward vector rn\u00d71 ; rt = R(Xt)\n\u2022 the high-dimensional feature matrix \u03a6n\u00d7D = [\u03c6(X1)(cid:62); . . . ; \u03c6(Xn)(cid:62)]\n\u2022 the projection matrix Ad\u00d7D whose elements are i.i.d. samples from N (0, 1/d)\n\u2022 the low-dim feature matrix \u03a8n\u00d7d = [\u03a8 (X1)(cid:62); . . . ; \u03a8 (Xn)(cid:62)] ; \u03a8 (\u00b7) = A\u03c6(\u00b7)\n\u2022 \u02dcAd\u00d7d = \u03a8(cid:62)(\u03a8 \u2212 \u03b3\u03a8(cid:48))\n\nO(n)\nO(nD)\nO(dD)\nO(ndD)\nO(nd)\nO(nd + nd2) + O(nd)\nreturn either \u02c6\u03b2 = \u02dcA\u22121\u02dcb or \u02c6\u03b2 = \u02dcA+\u02dcb ( \u02dcA+ is the Moore-Penrose pseudo-inverse of \u02dcA) O(d2 + d3)\n\n\u2022 the matrix (cid:98)P \u03a8 = \u03a8(cid:48)\n\nn\u00d7d = [\u03a8 (X2)(cid:62); . . . ; \u03a8 (Xn)(cid:62); 0(cid:62)]\n\n\u02dcbd\u00d71 = \u03a8(cid:62)r\n\n,\n\nFigure 1: The pseudo-code of the LSTD with random projections (LSTD-RP) algorithm.\n\n\u221a\n\nFigure 1 contains the pseudo-code and the computational cost of the LSTD-RP algorithm. The total\ncomputational cost of LSTD-RP is O(d3 + ndD), while the computational cost of LSTD in the\nhigh-dimensional space F is O(D3 + nD2). As we will see, the analysis of Section 4 suggests\nthat the value of d should be set to O(\nn). In this case the numerical complexity of LSTD-RP is\nO(n3/2D), which is better than O(D3), the cost of LSTD in F when n < D (the case considered\nin this paper). Note that the cost of making a prediction is D in LSTD in F and dD in LSTD-RP.\n4 Finite-Sample Analysis of LSTD with Random Projections\nIn this section, we report the main theoretical results of the paper. In particular, we derive a per-\nformance bound for LSTD-RP in the Markov design setting, i.e., when the LSTD-RP solution is\ncompared to the true value function only at the states belonging to the trajectory used by the al-\ngorithm (see Section 4 in [11] for a more detailed discussion). We then derive a condition on the\nnumber of samples to guarantee the uniqueness of the LSTD-RP solution. Finally, from the Markov\ndesign bound we obtain generalization bounds when the Markov chain has a stationary distribution.\n\n4.1 Markov Design Bound\nTheorem 1. Let F and G be linear spaces with dimensions D and d (d < D) as de\ufb01ned in Section 2.\nt=1 be a sample path generated by the Markov chain M\u03c0, and v, \u02c6v \u2208 Rn be the vectors\nLet {Xt}n\nwhose components are the value function and the LSTD-RP solution at {Xt}n\nt=1. Then for any\n\u03b4 > 0, whenever d \u2265 15 log(8n/\u03b4), with probability 1\u2212 \u03b4 (the randomness is w.r.t. both the random\n(cid:32)(cid:114)\n(cid:33)\nsample path and the random projection), \u02c6v satis\ufb01es\n\n(cid:114)\n\n(cid:35)\n\n(cid:114) d\n\n(cid:34)\n||v \u2212(cid:98)\u03a0F v||n +\n\n8 log(8n/\u03b4)\n\nm((cid:98)\u03a0F v)\n\n||v\u2212\u02c6v||n \u2264\n\n1(cid:112)1 \u2212 \u03b32\n\n+\n\n\u03b3VmaxL\n1 \u2212 \u03b3\n\n\u03bdn\n\nn\n\n8 log(4d/\u03b4)\n\n+\n\n,\n\n1\nn\n(1)\n\nd\n\n3\n\n\fwhere the random variable \u03bdn is the smallest strictly positive eigenvalue of the sample-based Gram\nmatrix 1\n\nn \u03a8(cid:62)\u03a8. Note that m((cid:98)\u03a0F v) = m(f\u03b1) with f\u03b1 be any function in F such that f\u03b1(Xt) =\n\n((cid:98)\u03a0F v)t for 1 \u2264 t \u2264 n.\n\nBefore stating the proof of Theorem 1, we need to prove the following lemma.\nLemma 1. Let F and G be linear spaces with dimensions D and d (d < D) as de\ufb01ned in Section 2.\nLet {Xi}n\ni=1 be n states and f\u03b1 \u2208 F. Then for any \u03b4 > 0, whenever d \u2265 15 log(4n/\u03b4), with\nprobability 1 \u2212 \u03b4 (the randomness is w.r.t. the random projection), we have\n\ng\u2208G ||f\u03b1 \u2212 g||2\n\ninf\n\nn \u2264 8 log(4n/\u03b4)\n\nd\n\nm(f\u03b1)2.\n\n(2)\n\nProof. The proof relies on the application of a variant of Johnson-Lindenstrauss (JL) lemma which\nstates that the inner-products are approximately preserved by the application of the random matrix\nd log(4n/\u03b4). Thus for d \u2265\nA (see e.g., Proposition 1 in [14]). For any \u03b4 > 0, we set \u00012 = 8\n15 log(4n/\u03b4), we have \u0001 \u2264 3/4 and as a result \u00012/4 \u2212 \u00013/6 \u2265 \u00012/8 and d \u2265 log(4n/\u03b4)\n\u00012/4\u2212\u00013/6. Thus, from\nProposition 1 in [14], for all 1 \u2264 i \u2264 n, we have |\u03c6(Xi) \u00b7 \u03b1 \u2212 A\u03c6(Xi) \u00b7 A\u03b1| \u2264 \u0001||\u03b1||2||\u03c6(Xi)||2 \u2264\n\u0001 m(f\u03b1) with high probability. From this result, we deduce that with probability 1 \u2212 \u03b4\n|\u03c6(Xi) \u00b7 \u03b1 \u2212 A\u03c6(Xi) \u00b7 A\u03b1|2 \u2264 8 log(4n/\u03b4)\n\nn \u2264 ||f\u03b1 \u2212 gA\u03b1||2\n\nn(cid:88)\n\nm(f\u03b1)2.\n\nn =\n\ng\u2208G ||f\u03b1 \u2212 g||2\n\ninf\n\nd\n\n1\nn\n\ni=1\n\nProof of Theorem 1. For any \ufb01xed space G, the performance of the LSTD-RP solution can be\nbounded according to Theorem 1 in [10] as\n\n||v \u2212 \u02c6v||n \u2264\n\n1(cid:112)1 \u2212 \u03b32\n\n||v \u2212(cid:98)\u03a0Gv||n +\n\n\u03b3VmaxL\n1 \u2212 \u03b3\n\n(cid:114) d\n\n(cid:16)(cid:114)\n\n\u03bdn\n\n(cid:17)\n\n,\n\n+\n\n1\nn\n\n8 log(2d/\u03b4(cid:48))\n\nn\n\n(3)\n\nwith probability 1 \u2212 \u03b4(cid:48) (w.r.t. the random sample path). From the triangle inequality, we have\n||v \u2212(cid:98)\u03a0Gv||n \u2264 ||v \u2212(cid:98)\u03a0F v||n + ||(cid:98)\u03a0F v \u2212(cid:98)\u03a0Gv||n = ||v \u2212(cid:98)\u03a0F v||n + ||(cid:98)\u03a0F v \u2212(cid:98)\u03a0G((cid:98)\u03a0F v)||n.\nn. Since ||v\u2212(cid:98)\u03a0F v||n is independent of g, we have arg inf g\u2208G ||v\u2212g||2\n||v\u2212(cid:98)\u03a0F v||2\nn +||(cid:98)\u03a0F v\u2212g||2\nThe equality in Eq. 4 comes from the fact that for any vector g \u2208 G, we can write ||v \u2212 g||2\nn, and thus,(cid:98)\u03a0Gv = (cid:98)\u03a0G((cid:98)\u03a0F v). From Lemma 1, if d \u2265 15 log(4n/\u03b4(cid:48)(cid:48)), with\narg inf g\u2208G ||(cid:98)\u03a0F v \u2212 g||2\n(cid:114)\nprobability 1 \u2212 \u03b4(cid:48)(cid:48) (w.r.t. the choice of A), we have\n||(cid:98)\u03a0F v \u2212(cid:98)\u03a0G((cid:98)\u03a0F v)||n \u2264\n\nm((cid:98)\u03a0F v).\n\n8 log(4n/\u03b4(cid:48)(cid:48))\n\nn =\nn =\n\n(5)\n\n(4)\n\nd\n\nWe conclude from a union bound argument that Eqs. 3 and 5 hold simultaneously with probability\nat least 1 \u2212 \u03b4(cid:48) \u2212 \u03b4(cid:48)(cid:48). The claim follows by combining Eqs. 3\u20135, and setting \u03b4(cid:48) = \u03b4(cid:48)(cid:48) = \u03b4/2.\n\n1\n1 \u2212 \u03b3\n\n||v \u2212 \u00afv||n \u2264\n\n1(cid:112)1 \u2212 \u03b32\n\nO((cid:112)D/n).\n\n||v \u2212(cid:98)\u03a0F v||n +\n\nRemark 1. Using Theorem 1, we can compare the performance of LSTD-RP with the performance\nof LSTD directly applied in the high-dimensional space F. Let \u00afv be the LSTD solution in F, then\nup to constants, logarithmic, and dominated factors, with high probability, \u00afv satis\ufb01es\n\nBy comparing Eqs. 1 and 6, we notice that 1) the estimation error of \u02c6v is of order O((cid:112)d/n), and\nthus, is smaller than the estimation error of \u00afv, which is of order O((cid:112)D/n), and 2) the approximation\nerror of \u02c6v is the approximation error of \u00afv, ||v \u2212 (cid:98)\u03a0F v||n, plus an additional term that depends on\nm((cid:98)\u03a0F v) and decreases with d, the dimensionality of G, with the rate O((cid:112)1/d). Hence, LSTD-RP\nthe gain achieved in the estimation error. Note that m((cid:98)\u03a0F v) highly depends on the value function\n\nmay have a better performance than solving LSTD in F whenever this additional term is smaller than\nV that is being approximated and the features of the space F. It is important to carefully tune the\nvalue of d as both the estimation error and the additional approximation error in Eq. 1 depend on\nd. For instance, while a small value of d signi\ufb01cantly reduces the estimation error (and the need for\nsamples), it may amplify the additional approximation error term, and thus, reduce the advantage of\nLSTD-RP over LSTD. We may get an idea on how to select the value of d by optimizing the bound\n\n(6)\n\n4\n\n\f(cid:115)\n\nm((cid:98)\u03a0F v)\n\nn\u03bdn(1 \u2212 \u03b3)\n\n.\n\nd =\n\n(7)\n\n(8)\n\nn\u03bdn(1 + \u03b3)\n\n1\n1 \u2212 \u03b3\n\n1(cid:112)1 \u2212 \u03b32\n\n\u221a\nTherefore, when n samples are available the optimal value for d is of the order O(\nvalue of d in Eq. 7, we can rewrite the bound of Eq. 1 as (up to the dominated term 1/n)\n1 \u2212 \u03b3\n\n(cid:113)\n\u03b3VmaxL m((cid:98)\u03a0F v)(cid:0)\n\n(cid:112)8 log(8n/\u03b4)\n\n||v \u2212(cid:98)\u03a0F v||n +\n\n||v \u2212 \u02c6v||n \u2264\n\n\u03b3VmaxL\n\n1 + \u03b3\n\n(cid:1)1/4.\n\nn). Using the\n\nF, and observe the role of the term m((cid:98)\u03a0F v). For further discussion on m((cid:98)\u03a0F v) refer to [14] and\n\nUsing Eqs. 6 and 8, it would be easier to compare the performance of LSTD-RP and LSTD in space\nfor the case of D = \u221e to Section 4.3 of this paper.\nRemark 2. As discussed in the introduction, when the dimensionality D of F is much bigger than\nthe number of samples n, the learning algorithms are likely to over\ufb01t the data. In this case, it is\nreasonable to assume that the target vector v itself belongs to the vector space Fn. We state this\ncondition using the following assumption:\nAssumption 1. (Over\ufb01tting). For any set of n points {Xi}n\ni=1, there exists a function f \u2208 F such\nthat f (Xi) = V (Xi), 1 \u2264 i \u2264 n .\nn \u03a6(cid:62)\u03a6 to be\nAssumption 1 is equivalent to require that the rank of the empirical Gram matrices 1\nbigger than n. Note that Assumption 1 is likely to hold whenever D (cid:29) n, because in this case we\ncan expect that the features to be independent enough on {Xi}n\nn \u03a6(cid:62)\u03a6 to be\nbigger than n (e.g., if the features are linearly independent on the samples, it is suf\ufb01cient to have\nD \u2265 n). Under Assumption 1 we can remove the empirical approximation error term in Theorem 1\nand deduce the following result.\nCorollary 1. Under Assumption 1 and the conditions of Theorem 1, with probability 1\u2212 \u03b4 (w.r.t. the\nrandom sample path and the random space), \u02c6v satis\ufb01es\n\ni=1 so that the rank of 1\n\n(cid:114)\n\n1(cid:112)1 \u2212 \u03b32\n\nm((cid:98)\u03a0F v) +\n\n8 log(8n/\u03b4)\n\nd\n\n||v \u2212 \u02c6v||n \u2264\n\n(cid:114) d\n\n(cid:16)(cid:114)\n\n\u03b3VmaxL\n1 \u2212 \u03b3\n\n8 log(4d/\u03b4)\n\n\u03bdn\n\nn\n\n(cid:17)\n\n.\n\n+\n\n1\nn\n\n4.2 Uniqueness of the LSTD-RP Solution\nWhile the results in the previous section hold for any Markov chain, in this section we assume that\nthe Markov chain M\u03c0 admits a stationary distribution \u03c1 and is exponentially fast \u03b2-mixing with\nparameters \u00af\u03b2, b, \u03ba, i.e., its \u03b2-mixing coef\ufb01cients satisfy \u03b2i \u2264 \u00af\u03b2 exp(\u2212bi\u03ba) (see e.g., Sections 8.2\nand 8.3 in [10] for a more detailed de\ufb01nition of \u03b2-mixing processes). As shown in [11, 10], if\n\u03c1 exists, it would be possible to derive a condition for the existence and uniqueness of the LSTD\nsolution depending on the number of samples and the smallest eigenvalue of the Gram matrix de\ufb01ned\n\naccording to the stationary distribution \u03c1, i.e., G \u2208 RD\u00d7D , Gij = (cid:82) \u03d5i(x)\u03d5j(x)\u03c1(dx). We\n\nsmallest eigenvalue of the Gram matrix H \u2208 Rd\u00d7d , Hij = (cid:82) \u03c8i(x)\u03c8j(x)\u03c1(dx) in the random\nSection 2 with D > d + 2(cid:112)2d log(2/\u03b4) + 2 log(2/\u03b4). Let the elements of the projection matrix A be\n\nnow discuss the existence and uniqueness of the LSTD-RP solution. Note that as D increases, the\nsmallest eigenvalue of G is likely to become smaller and smaller. In fact, the more the features in F,\nthe higher the chance for some of them to be correlated under \u03c1, thus leading to an ill-conditioned\nmatrix G. On the other hand, since d < D, the probability that d independent random combinations\nof \u03d5i lead to highly correlated features \u03c8j is relatively small. In the following we prove that the\nspace G is indeed bigger than the smallest eigenvalue of G with high probability.\nLemma 2. Let \u03b4 > 0 and F and G be linear spaces with dimensions D and d (d < D) as de\ufb01ned in\nGaussian random variables drawn from N (0, 1/d). Let the Markov chain M\u03c0 admit a stationary\ndistribution \u03c1. Let G and H be the Gram matrices according to \u03c1 for the spaces F and G, and \u03c9\nand \u03c7 be their smallest eigenvalues. We have with probability 1 \u2212 \u03b4 (w.r.t. the random space)\n\n(cid:32)\n\n(cid:114)\n\n(cid:114)\n\n(cid:33)2\n\n\u03c7 \u2265 D\nd\n\n\u03c9\n\n1 \u2212\n\n\u2212\n\nd\nD\n\n2 log(2/\u03b4)\n\nD\n\n.\n\n(9)\n\nProof. Let \u03b2 \u2208 Rd be the eigenvector associated to the smallest eigenvalue \u03c7 of H, from the\nde\ufb01nition of the features \u03a8 of G (H = AGA(cid:62)) and linear algebra, we obtain\n\n5\n\n\f\u221a\n\nwhere \u03be is the smallest eigenvalue of the random matrix AA(cid:62), or in other words,\n\u03be is the smallest\nsingular value of the D \u00d7 d random matrix A(cid:62), i.e., smin(A(cid:62)) =\ndA.\nNote that if the elements of A are drawn from the Gaussian distribution N (0, 1/d), the elements\nof B are standard Gaussian random variables, and thus, the smallest eigenvalue of AA(cid:62), \u03be, can be\nmin(B(cid:62))/d. There has been extensive work on extreme singular values of random\nwritten as \u03be = s2\nmatrices (see e.g., [19]). For a D \u00d7 d random matrix with independent standard normal random\nvariables, such as B(cid:62), we have with probability 1 \u2212 \u03b4 (see [19] for more details)\n\n\u03be. We now de\ufb01ne B =\n\n\u221a\n\n\u221a\n\n(cid:17)\nd \u2212(cid:112)2 log(2/\u03b4)\n(cid:33)2\n(cid:114)\nFrom Eq. 11 and the relation between \u03be and smin(B(cid:62)), we obtain\n\n) \u2265(cid:16)\u221a\n(cid:32)\n(cid:114)\n\nD \u2212\n\nsmin(B\n\n\u221a\n\n(cid:62)\n\n.\n\n(11)\n\n(12)\n\n,\n\n\u03c7||\u03b2||2\n\n2 = \u03b2\n\n(cid:62)\n\n(cid:62)\n\n\u03c7\u03b2 = \u03b2\n\nH\u03b2 = \u03b2\n\n(cid:62)\n\n(cid:62)\n\nAGA\n\n\u03b2 \u2265 \u03c9||A\n(cid:62)\n\n\u03b2||2\n\n2 = \u03c9 \u03b2\n\n(cid:62)\n\n(cid:62)\n\nAA\n\n\u03b2 \u2265 \u03c9 \u03be ||\u03b2||2\n2 ,\n\n(10)\n\nwith probability 1 \u2212 \u03b4. The claim follows by replacing the bound for \u03be from Eq. 12 in Eq. 10.\n\n\u03be \u2265 D\nd\n\n1 \u2212\n\n\u2212\n\nd\nD\n\n2 log(2/\u03b4)\n\nD\n\nThe result of Lemma 2 is for Gaussian random matrices. However, it would be possible to extend\nthis result using non-asymptotic bounds for the extreme singular values of more general random\nmatrices [19]. Note that in Eq. 9, D/d is always greater than 1 and the term in the parenthesis\napproaches 1 for large values of D. Thus, we can conclude that with high probability the smallest\neigenvalue \u03c7 of the Gram matrix H of the randomly generated low-dimensional space G is bigger\nthan the smallest eigenvalue \u03c9 of the Gram matrix G of the high-dimensional space F.\nLemma 3. Let \u03b4 > 0 and F and G be linear spaces with dimensions D and d (d < D) as de\ufb01ned in\nGaussian random variables drawn from N (0, 1/d). Let the Markov chain M\u03c0 admit a stationary\ndistribution \u03c1. Let G be the Gram matrix according to \u03c1 for space F and \u03c9 be its smallest eigenvalue.\nLet {Xt}n\nt=1 be a trajectory of length n generated by a stationary \u03b2-mixing process with stationary\ndistribution \u03c1. If the number of samples n satis\ufb01es\n\nSection 2 with D > d + 2(cid:112)2d log(2/\u03b4) + 2 log(2/\u03b4). Let the elements of the projection matrix A be\n\n(cid:26) \u039b(n, d, \u03b4/2)\n\u03b4 + log+(cid:0)max{18(6e)2(d+1), \u00af\u03b2}(cid:1), then with probability\n\uf8f6\uf8f8 \u2212 6L\n\nt=1, i.e., ||g\u03b2||n = 0\nn \u03a8(cid:62)\u03a8 satisi\ufb01es\n(cid:41)1/\u03ba\n\nwhere \u039b(n, d, \u03b4) = 2(d + 1) log n + log e\n1 \u2212 \u03b4, the features \u03c81, . . . , \u03c8d are linearly independent on the states {Xt}n\n(cid:40)\nimplies \u03b2 = 0, and the smallest eigenvalue \u03bdn of the sample-based Gram matrix 1\n\u03bdn \u2265 \u221a\n\u221a\n\nn >\n\n288L2 d \u039b(n, d, \u03b4/2)\n\n\u03c9D\n\nmax\n\n(cid:118)(cid:117)(cid:117)(cid:116) 2\u039b(n, d, \u03b4\n\n\uf8eb\uf8ed1 \u2212\n\n(cid:27)1/\u03ba(cid:32)\n\n(cid:33)\u22122\n\n\u039b(n, d, \u03b4\n2 )\n\n2 log( 2\n\u03b4 )\n\n2 log(2/\u03b4)\n\n(cid:115)\n\n\u2212\n\nd\nD\n\n(cid:114)\n\n(cid:114)\n\n,\n\n(13)\n\n2 )\n\nmax\n\n, 1\n\n> 0 .\n\n(cid:114)\n\n(cid:114)\n\n1 \u2212\n\n\u03bd =\n\nD\n\n, 1\n\nb\n\n\u221a\n\u03c9\n2\n\nD\nd\n\nn\n\nb\n\n\u2212\n\nd\nD\n\nD\n\n(14)\nProof. The proof follows similar steps as in Lemma 4 in [10]. A sketch of the proof is available\nin [6].\n\nn \u03a6(cid:62)\u03a6 in the high-dimensional space F.\n\nBy comparing Eq. 13 with Eq. 13 in [10], we can see that the number of samples needed for the\nn \u03a8(cid:62)\u03a8 in G to be invertible with high probability is less than that for its\nempirical Gram matrix 1\ncounterpart 1\n4.3 Generalization Bound\nIn this section, we show how Theorem 1 can be generalized to the entire state space X when the\nMarkov chain M\u03c0 has a stationary distribution \u03c1. We consider the case in which the samples\n{Xt}n\nt=1 are obtained by following a single trajectory in the stationary regime of M\u03c0, i.e., when X1\nis drawn from \u03c1. As discussed in Remark 2 of Section 4.1, it is reasonable to assume that the high-\ndimensional space F contains functions that are able to perfectly \ufb01t the value function V in any \ufb01nite\nnumber n (n < D) of states {Xt}n\n\nt=1, thus we state the following theorem under Assumption 1.\n\n6\n\n\fTheorem 2. Let \u03b4 > 0 and F and G be linear spaces with dimensions D and d (d < D) as de\ufb01ned\nin Section 2 with d \u2265 15 log(8n/\u03b4). Let {Xt}n\nt=1 be a path generated by a stationary \u03b2-mixing\nprocess with stationary distribution \u03c1. Let \u02c6V be the LSTD-RP solution in the random space G. Then\nunder Assumption 1, with probability 1 \u2212 \u03b4 (w.r.t. the random sample path and the random space),\n||V \u2212 T ( \u02c6V )||\u03c1 \u2264\n\nd\n\u03bd\nwhere \u03bd is a lower bound on the eigenvalues of the Gram matrix 1\n\n2(cid:112)1 \u2212 \u03b32\n\nm(\u03a0F V ) +\n\n8 log(24n/\u03b4)\n\n+ \u0001 , (15)\n\n(cid:114)\n\nn\n\n(cid:17)\n\n(cid:114)\n\n2\u03b3VmaxL\n1 \u2212 \u03b3\n\n(cid:16)(cid:114)\nn \u03a8(cid:62)\u03a8 de\ufb01ned by Eq. 14 and\n(cid:27)1/\u03ba\n(cid:26) \u039b(n, d, \u03b4/3)\n\n8 log(12d/\u03b4)\n\n1\nn\n\n+\n\nd\n\n(cid:115)\n\n\u0001 = 24Vmax\n\n2\u039b(n, d, \u03b4/3)\n\nn\n\nmax\n\nb\n\n, 1\n\n.\n\nwith \u039b(n, d, \u03b4) de\ufb01ned as in Lemma 3. Note that T in Eq. 15 is the truncation operator de\ufb01ned in\nSection 2.\n\nProof. The proof is a consequence of applying concentration of measures inequalities for \u03b2-mixing\nprocesses and linear spaces (see Corollary 18 in [10]) on the term ||V \u2212 T ( \u02c6V )||n, using the fact that\n||V \u2212 T ( \u02c6V )||n \u2264 ||V \u2212 \u02c6V ||n, and using the bound of Corollary 1. The bound of Corollary 1 and\nthe lower bound on \u03bd, each one holding with probability 1 \u2212 \u03b4(cid:48), thus, the statement of the theorem\n(Eq. 15) holds with probability 1 \u2212 \u03b4 by setting \u03b4 = 3\u03b4(cid:48).\nRemark 1. An interesting property of the bound in Theorem 2 is that the approximation error of\nV in space F, ||V \u2212 \u03a0F V ||\u03c1, does not appear and the error of the LSTD solution in the randomly\nprojected space only depends on the dimensionality d of G and the number of samples n. However\nthis property is valid only when Assumption 1 holds, i.e., at most for n \u2264 D. An interesting case\nhere is when the dimension of F is in\ufb01nite (D = \u221e), so that the bound is valid for any number\nof samples n. In [15], two approximation spaces F of in\ufb01nite dimension were constructed based\non a multi-resolution set of features that are rescaled and translated versions of a given mother\nfunction. In the case that the mother function is a wavelet, the resulting features, called scrambled\nwavelets, are linear combinations of wavelets at all scales weighted by Gaussian coef\ufb01cients. As a\nresults, the corresponding approximation space is a Sobolev space H s(X ) with smoothness of order\ns > p/2, where p is the dimension of the state space X . In this case, for a function f\u03b1 \u2208 H s(X ),\nit is proved that the (cid:96)2-norm of the parameter \u03b1 is equal to the norm of the function in H s(X ), i.e.,\n||\u03b1||2 = ||f\u03b1||H s(X ). We do not describe those results further and refer the interested readers to [15].\nWhat is important about the results of [15] is that it shows that it is possible to consider in\ufb01nite\ndimensional function spaces for which supx ||\u03c6(x)||2 is \ufb01nite and ||\u03b1||2 is expressed in terms of the\nnorm of f\u03b1 in F. In such cases, m(\u03a0F V ) is \ufb01nite and the bound of Theorem 2, which does not\ncontain any approximation error of V in F, holds for any n. Nonetheless, further investigation is\nneeded to better understand the role of ||f\u03b1||H s(X ) in the \ufb01nal bound.\nRemark 2. As discussed in the introduction, regularization methods have been studied in solving\nhigh-dimensional RL problems. Therefore, it is interesting to compare our results for LSTD-RP with\nthose reported in [4] for (cid:96)2-regularized LSTD. Under Assumption 1, when D = \u221e, by selecting the\nfeatures as described in the previous remark and optimizing the value of d as in Eq. 7, we obtain\n\n||V \u2212 T ( \u02c6V )||\u03c1 \u2264 O\n\n.\n\n(16)\nAlthough the setting considered in [4] is different than ours (e.g., the samples are i.i.d.), a quali-\ntative comparison of Eq. 16 with the bound in Theorem 2 of [4] shows a striking similarity in the\nperformance of the two algorithms. In fact, they both contain the Sobolev norm of the target func-\ntion and have a similar dependency on the number of samples with a convergence rate of O(n\u22121/4)\n(when the smoothness of the Sobolev space in [4] is chosen to be half of the dimensionality of X ).\nThis similarity asks for further investigation on the difference between (cid:96)2-regularized methods and\nrandom projections in terms of prediction performance and computational complexity.\n5 LSPI with Random Projections\nIn this section, we move from policy evaluation to policy iteration and provide a performance bound\nfor LSPI with random projections (LSPI-RP), i.e., a policy iteration algorithm that uses LSTD-RP\nat each iteration. LSPI-RP starts with an arbitrary initial value function V\u22121 \u2208 B(X ; Vmax) and\nits corresponding greedy policy \u03c00. At the \ufb01rst iteration, it approximates V \u03c00 using LSTD-RP and\n\n(cid:16)(cid:113)||f\u03b1||Hs(X ) n\n\n\u22121/4(cid:17)\n\n7\n\n\freturns a function \u02c6V0, whose truncated version \u02dcV0 = T ( \u02c6V0) is used to build the policy for the second\niteration. More precisely, \u03c01 is a greedy policy w.r.t. \u02dcV0. So, at each iteration k, a function \u02c6Vk\u22121 is\ncomputed as an approximation to V \u03c0k\u22121, and then truncated, \u02dcVk\u22121, and used to build the policy \u03c0k.1\nNote that in general, the measure \u03c3 \u2208 S(X ) used to evaluate the \ufb01nal performance of the LSPI-\nRP algorithm might be different from the distribution used to generate samples at each iteration.\nMoreover, the LSTD-RP performance bounds require the samples to be collected by following the\npolicy under evaluation. Thus, we need Assumptions 1-3 in [10] in order to 1) de\ufb01ne a lower-\nbounding distribution \u00b5 with constant C < \u221e, 2) guarantee that with high probability a unique\nLSTD-RP solution exists at each iteration, and 3) de\ufb01ne the slowest \u03b2-mixing process among all the\nmixing processes M\u03c0k with 0 \u2264 k < K.\nTheorem 3. Let \u03b4 > 0 and F and G be linear spaces with dimensions D and d (d < D) as de\ufb01ned\nin Section 2 with d \u2265 15 log(8Kn/\u03b4). At each iteration k, we generate a path of size n from the\nstationary \u03b2-mixing process with stationary distribution \u03c1k\u22121 = \u03c1\u03c0k\u22121. Let n satisfy the condition in\nEq. 13 for the slower \u03b2-mixing process. Let V\u22121 be an arbitrary initial value function, \u02c6V0, . . . , \u02c6VK\u22121\n( \u02dcV0, . . . , \u02dcVK\u22121) be the sequence of value functions (truncated value functions) generated by LSPI-\nRP, and \u03c0K be the greedy policy w.r.t. \u02dcVK\u22121. Then, under Assumption 1 and Assumptions 1-3\nin [10], with probability 1 \u2212 \u03b4 (w.r.t. the random samples and the random spaces), we have\n||\u03c6(x)||2\n\n\u2217 \u2212 V \u03c0K||\u03c3 \u2264\n\n8 log(24Kn/\u03b4)\n\n(cid:34)\n\n||V\n\n(17)\n\n4\u03b3\n\n(1 \u2212 \u03b3)2\n\n(cid:40)\n(1 + \u03b3)(cid:112)CC\u03c3,\u00b5\n(cid:115)\n(cid:16)(cid:114)\n\n2Vmax(cid:112)1 \u2212 \u03b32\n\n(cid:114)\n(cid:35)\n\nC\n\u03c9\u00b5\n\n(cid:115)\n(cid:17)\n\nd\n\nsup\nx\u2208X\n\n(cid:41)\n\n+\n\n2\u03b3VmaxL\n1 \u2212 \u03b3\n\nd\n\u03bd\u00b5\n\n8 log(12Kd/\u03b4)\n\nn\n\n+\n\n1\nn\n\n+ E\n\nK\u22121\n\n2 Rmax\n\n+ \u03b3\n\n,\n\nwhere C\u03c3,\u00b5 is the concentrability term from De\ufb01nition 2 in [1], \u03c9\u00b5 is the smallest eigenvalue of the\nGram matrix of space F w.r.t. \u00b5, \u03bd\u00b5 is \u03bd from Eq. 14 in which \u03c9 is replaced by \u03c9\u00b5, and E is \u0001 from\nTheorem 2 written for the slowest \u03b2-mixing process.\n\nProof. The proof follows similar lines as in the proof of Thm. 8 in [10] and is available in [6].\n\nRemark. The most critical issue about Theorem 3 is the validity of Assumptions 1-3 in [10]. It\nis important to note that Assumption 1 is needed to bound the performance of LSPI independent\nfrom the use of random projections (see [10]). On the other hand, Assumption 2 is explicitly related\nto random projections and allows us to bound the term m(\u03a0F V ). In order for this assumption to\nhold, the features {\u03d5j}D\nj=1 of the high-dimensional space F should be carefully chosen so as to be\nlinearly independent w.r.t. \u00b5.\n6 Conclusions\nLearning in high-dimensional linear spaces is particularly appealing in RL because it allows to have\na very accurate approximation of value functions. Nonetheless, the larger the space, the higher\nthe need of samples and the risk of over\ufb01tting. In this paper, we introduced an algorithm, called\nLSTD-RP, in which LSTD is run in a low-dimensional space obtained by a random projection of\nthe original high-dimensional space. We theoretically analyzed the performance of LSTD-RP and\nshowed that it solves the problem of over\ufb01tting (i.e., the estimation error depends on the value of\nthe low dimension) at the cost of a slight worsening in the approximation accuracy compared to the\nhigh-dimensional space. We also analyzed the performance of LSPI-RP, a policy iteration algorithm\nthat uses LSTD-RP for policy evaluation. The analysis reported in the paper opens a number of inter-\nesting research directions such as: 1) comparison of LSTD-RP to (cid:96)2 and (cid:96)1 regularized approaches,\nand 2) a thorough analysis of the case when D = \u221e and the role of ||f\u03b1||H s(X ) in the bound.\nAcknowledgments This work was supported by French National Research Agency through the\nprojects EXPLO-RA n\u25e6 ANR-08-COSI-004 and LAMPADA n\u25e6 ANR-09-EMER-007, by Ministry\nof Higher Education and Research, Nord-Pas de Calais Regional Council and FEDER through the\n\u201ccontrat de projets \u00b4etat region 2007\u20132013\u201d, and by PASCAL2 European Network of Excellence.\n\n1Note that the MDP model is needed to generate a greedy policy \u03c0k. In order to avoid the need for the\nmodel, we can simply move to LSTD-Q with random projections. Although the analysis of LSTD-RP can be\nextended to action-value functions and LSTD-RP-Q, for simplicity we use value functions in the following.\n\n8\n\n\fReferences\n[1] A. Antos, Cs. Szepesvari, and R. Munos. Learning near-optimal policies with Bellman-residual\nminimization based \ufb01tted policy iteration and a single sample path. Machine Learning Journal,\n71:89\u2013129, 2008.\n\n[2] J. Boyan. Least-squares temporal difference learning. Proceedings of the 16th International\n\nConference on Machine Learning, pages 49\u201356, 1999.\n\n[3] S. Bradtke and A. Barto. Linear least-squares algorithms for temporal difference learning.\n\nMachine Learning, 22:33\u201357, 1996.\n\n[4] A. M. Farahmand, M. Ghavamzadeh, Cs. Szepesv\u00b4ari, and S. Mannor. Regularized policy\nIn Proceedings of Advances in Neural Information Processing Systems 21, pages\n\niteration.\n441\u2013448. MIT Press, 2008.\n\n[5] A. M. Farahmand, M. Ghavamzadeh, Cs. Szepesv\u00b4ari, and S. Mannor. Regularized \ufb01tted Q-\niteration for planning in continuous-space Markovian decision problems. In Proceedings of\nthe American Control Conference, pages 725\u2013730, 2009.\n\n[6] M. Ghavamzadeh, A. Lazaric, O. Maillard, and R. Munos. LSPI with random projections.\n\nTechnical Report inria-00530762, INRIA, 2010.\n\n[7] P. Keller, S. Mannor, and D. Precup. Automatic basis function construction for approximate\ndynamic programming and reinforcement learning. In Proceedings of the Twenty-Third Inter-\nnational Conference on Machine Learning, pages 449\u2013456, 2006.\n\n[8] Z. Kolter and A. Ng. Regularization and feature selection in least-squares temporal difference\nlearning. In Proceedings of the Twenty-Sixth International Conference on Machine Learning,\npages 521\u2013528, 2009.\n\n[9] M. Lagoudakis and R. Parr. Least-squares policy iteration. Journal of Machine Learning\n\nResearch, 4:1107\u20131149, 2003.\n\n[10] A. Lazaric, M. Ghavamzadeh, and R. Munos. Finite-sample analysis of least-squares policy\n\niteration. Technical Report inria-00528596, INRIA, 2010.\n\n[11] A. Lazaric, M. Ghavamzadeh, and R. Munos. Finite-sample analysis of LSTD. In Proceedings\nof the Twenty-Seventh International Conference on Machine Learning, pages 615\u2013622, 2010.\n[12] M. Loth, M. Davy, and P. Preux. Sparse temporal difference learning using lasso. In IEEE\nSymposium on Approximate Dynamic Programming and Reinforcement Learning, pages 352\u2013\n359, 2007.\n\n[13] S. Mahadevan. Representation policy iteration. In Proceedings of the Twenty-First Conference\n\non Uncertainty in Arti\ufb01cial Intelligence, pages 372\u2013379, 2005.\n\n[14] O. Maillard and R. Munos. Compressed least-squares regression. In Proceedings of Advances\n\nin Neural Information Processing Systems 22, pages 1213\u20131221, 2009.\n\n[15] O. Maillard and R. Munos. Brownian motions and scrambled wavelets for least-squares re-\n\ngression. Technical Report inria-00483014, INRIA, 2010.\n\n[16] I. Menache, S. Mannor, and N. Shimkin. Basis function adaptation in temporal difference\n\nreinforcement learning. Annals of Operations Research, 134:215\u2013238, 2005.\n\n[17] R. Parr, C. Painter-Wake\ufb01eld, L. Li, and M. Littman. Analyzing feature generation for value-\nIn Proceedings of the Twenty-Fourth International Conference on\n\nfunction approximation.\nMachine Learning, pages 737\u2013744, 2007.\n\n[18] M. Petrik, G. Taylor, R. Parr, and S. Zilberstein. Feature selection using regularization in\napproximate linear programs for Markov decision processes. In Proceedings of the Twenty-\nSeventh International Conference on Machine Learning, pages 871\u2013878, 2010.\n\n[19] M. Rudelson and R. Vershynin. Non-asymptotic theory of random matrices: extreme singular\n\nvalues. In Proceedings of the International Congress of Mathematicians, 2010.\n\n[20] R. Sutton and A. Barto. Reinforcement Learning: An Introduction. MIP Press, 1998.\n[21] S. Vempala. The Random Projection Method. American Mathematical Society, 2004.\n\n9\n\n\f", "award": [], "sourceid": 780, "authors": [{"given_name": "Mohammad", "family_name": "Ghavamzadeh", "institution": null}, {"given_name": "Alessandro", "family_name": "Lazaric", "institution": null}, {"given_name": "Odalric", "family_name": "Maillard", "institution": null}, {"given_name": "R\u00e9mi", "family_name": "Munos", "institution": null}]}