{"title": "A Kernel Loss for Solving the Bellman Equation", "book": "Advances in Neural Information Processing Systems", "page_first": 15456, "page_last": 15467, "abstract": "Value function learning plays a central role in many state-of-the-art reinforcement \nlearning algorithms. Many popular algorithms like Q-learning do not optimize\nany objective function, but are fixed-point iterations of some variants of Bellman\noperator that are not necessarily a contraction. As a result, they may easily lose\nconvergence guarantees, as can be observed in practice. In this paper, we propose a novel loss function, which can be optimized using standard gradient-based methods with guaranteed convergence. The key advantage is that its gradient can be easily approximated using sampled transitions, avoiding the need for double samples required by prior algorithms like residual gradient. Our approach may be combined with general function classes such as neural networks, using either on- or off-policy data, and is shown to work reliably and effectively in several benchmarks, including classic problems where standard algorithms are known to diverge.", "full_text": "A Kernel Loss for Solving the Bellman\n\nEquation\n\nYihao Feng\nUT Austin\n\nLihong Li\n\nGoogle Research\n\nQiang Liu\nUT Austin\n\nyihao@cs.utexas.edu\n\nlihong@google.com\n\nlqiang@cs.utexas.edu\n\nAbstract\n\nValue function learning plays a central role in many state-of-the-art reinforcement-\nlearning algorithms. Many popular algorithms like Q-learning do not optimize\nany objective function, but are \ufb01xed-point iterations of some variants of Bellman\noperator that are not necessarily a contraction. As a result, they may easily lose\nconvergence guarantees, as can be observed in practice. In this paper, we propose a\nnovel loss function, which can be optimized using standard gradient-based methods\nwith guaranteed convergence. The key advantage is that its gradient can be easily\napproximated using sampled transitions, avoiding the need for double samples\nrequired by prior algorithms like residual gradient. Our approach may be combined\nwith general function classes such as neural networks, using either on- or off-policy\ndata, and is shown to work reliably and effectively in several benchmarks, including\nclassic problems where standard algorithms are known to diverge.\n\n1\n\nIntroduction\n\nThe goal of a reinforcement learning (RL) agent is to optimize its policy to maximize the long-term\nreturn through repeated interaction with an external environment. The interaction is often modeled as\na Markov decision process, whose value functions are the unique \ufb01xed points of their corresponding\nBellman operators. Many state-of-the-art algorithms, including TD(\u03bb), Q-learning and actor-critic,\nhave value function learning as a key component (Sutton & Barto, 2018; Szepesv\u00b4ari, 2010).\n\nA fundamental property of the Bellman operator is that it is a contraction in the value function\nspace in the \u2113\u221e-norm (Puterman, 1994). Therefore, starting from any bounded initial function, with\nrepeated applications of the operator, the value function converges to the true value function. A\nnumber of algorithms are directly inspired by this property, such as temporal difference (Sutton,\n1988) and its many variants (Bertsekas & Tsitsiklis, 1996; Sutton & Barto, 2018; Szepesv\u00b4ari, 2010).\nUnfortunately, when function approximation such as neural networks is used to represent the value\nfunction in large-scale problems, the critical property of contraction is generally lost (e.g., Boyan &\nMoore, 1995; Baird, 1995; Tsitsiklis & Van Roy, 1997), except in rather restricted cases (e.g., Gordon,\n1995; Tsitsiklis & Van Roy, 1997). Not only is this instability one of the core theoretical challenges\nin RL, but it also has broad practical signi\ufb01cance, given the growing popularity of algorithms like\nDQN (Mnih et al., 2015), A3C (Mnih et al., 2016) and their many variants (e.g., Gu et al., 2016;\nSchulman et al., 2016; Wang et al., 2016; Wu et al., 2017), whose stability largely depends on the\ncontraction property. The instability becomes even harder to avoid, when training data (transitions)\nare sampled from an off-policy distribution, a situation known as the deadly triad (Sutton & Barto,\n2018, Sec. 11.3).\n\nThe brittleness of Bellman operator\u2019s contraction property has inspired a number of works that aim\nto reformulate value function learning as an optimization problem, where standard algorithms like\nstochastic gradient descent can be used to minimize the objective, without the risk of divergence\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f(under mild and typical assumptions). One of the earliest attempts is residual gradient, or RG (Baird,\n1995), which relies on minimizing squared temporal differences. The algorithm is convergent, but\nits objective is not necessarily a good proxy due to a well-known \u201cdouble sample\u201d problem. As a\nresult, it may converge to an inferior solution; see Sections 2 and 6 for further details and numerical\nexamples. This drawback is inherited by similar algorithms like PCL (Nachum et al., 2017, 2018).\n\nAnother line of work seeks alternative objective functions, the minimization of which leads to desired\nvalue functions (e.g., Sutton et al., 2009; Maei, 2011; Liu et al., 2015; Dai et al., 2017). Most existing\nworks are either for linear approximation, or for evaluation of a \ufb01xed policy. An exception is the\nSBEED algorithm (Dai et al., 2018b), which transforms the Bellman equation to an equivalent saddle-\npoint problem, and can use nonlinear function approximations. While SBEED is provably convergent\nunder fairly standard conditions, it relies on solving a minimax problem, whose optimization can be\nrather challenging in practice, especially with nonconvex approximation classes like neural networks.\n\nIn this paper, we propose a novel loss function for value function learning. It avoids the double-sample\nproblem (unlike RG), and can be easily estimated and optimized using sampled transitions (in both\non- and off-policy scenarios). This is made possible by leveraging an important property of integrally\nstrictly positive de\ufb01nite kernels (Stewart, 1976; Sriperumbudur et al., 2010). This new objective\nfunction allows us to derive simple yet effective algorithms to approximate the value function,\nwithout risking instability or divergence (unlike TD algorithms), or solving a more sophisticated\nsaddle-point problem (unlike SBEED). Our approach also allows great \ufb02exibility in choosing the\nvalue function approximation classes, including nonlinear ones like neural networks. Experiments\nin several benchmarks demonstrate the effectiveness of our method, for both policy evaluation and\noptimization problems. We will focus on the batch setting (or the growing-batch setting with a\ngrowing replay buffer), and leave the online setting for future work.\n\n2 Background\n\nThis section starts with necessary notation and background information, then reviews two representa-\ntive algorithms that work with general, nonlinear (differentiable) function classes.\n\nNotation. A Markov decision process (MDP) is denoted by M = hS, A, P, R, \u03b3), where S is a\n(possibly in\ufb01nite) state space, A an action space, P (s\u2032 | s, a) the transition probability, R(s, a) the\naverage immediate reward, and \u03b3 \u2208 (0, 1) a discount factor. The value function of a policy \u03c0 :\nS 7\u2192 RA\nt=0 \u03b3tR(st, at) | s0 = s, at \u223c \u03c0(\u00b7, st)] , measures the expected\nlong-term return of a state. It is well-known that V = V \u03c0 is the unique solution to the Bellman\nequation (Puterman, 1994), V = B\u03c0V , where B\u03c0 : RS \u2192 RS is the Bellman operator, de\ufb01ned by\n\n+, denoted V \u03c0(s) := E [P\u221e\n\nB\u03c0V (s) := E\n\na\u223c\u03c0(\u00b7|s),s\u2032\u223cP (\u00b7|s,a)[R(s, a) + \u03b3V (s\u2032) | s] .\n\nWhile we develop and analyze our approach mostly for B\u03c0 given a \ufb01xed \u03c0 (policy evaluation), we\nwill also extend the approach to the controlled case of policy optimization, where the corresponding\nBellman operator becomes\n\nBV (s) := max\n\na\n\nE\n\ns\u2032\u223cP (\u00b7|s,a)[R(s, a) + \u03b3V (s\u2032) | s, a] .\n\nThe unique \ufb01xed point of B is known as the optimal value function, denoted V \u2217; that is, BV \u2217 = V \u2217.\n\nOur work is built on top of an alternative to the \ufb01xed-point view above: given some \ufb01xed distribution\n\u00b5 whose support is S, V \u03c0 is the unique minimizer of the squared Bellman error:\n\nL2(V ) := kB\u03c0V \u2212 V k2\n\n\u00b5 = Es\u223c\u00b5(cid:2) (B\u03c0V (s) \u2212 V (s))2(cid:3) .\n\nDenote by R\u03c0V := B\u03c0V \u2212 V the Bellman error operator. With a set D = {(si, ai, ri, s\u2032\ni)}1\u2264i\u2264n\nof transitions where ai \u223c \u03c0(\u00b7|si), the Bellman operator in state si can be approximated by boot-\nstrapping: \u02c6B\u03c0V (si) := ri + \u03b3V (s\u2032\ni) \u2212 V (si). Clearly, one\nhas E[ \u02c6B\u03c0V (si)|si] = B\u03c0V (si) and E[ \u02c6R\u03c0V (si)|si] = R\u03c0V (si). In the literature, \u02c6R\u03c0V (si) is also\nknown as the temporal difference or TD error, whose expectation is the Bellman error.\n\ni). Similarly, \u02c6R\u03c0V (si) := ri + \u03b3V (s\u2032\n\nFinally, in this work, we use the same notation for a distribution and its probability density function.\n\n2\n\n\fBasic Algorithms. We are interested in estimating V \u03c0, from a parametric family {V\u03b8 : \u03b8 \u2208 \u0398},\nfrom data D. The residual gradient algorithm (Baird, 1995) minimizes the squared TD error:\n\n\u02c6LRG(V\u03b8) :=\n\n1\nn\n\nn\n\nXi=1(cid:16) \u02c6B\u03c0V\u03b8(si) \u2212 V\u03b8(si)(cid:17)2\n\nwith gradient descent update \u03b8t+1 = \u03b8t \u2212 \u01eb\u2207\u03b8 \u02c6LRG(V\u03b8t ), where\n\n,\n\n(1)\n\n\u2207\u03b8 \u02c6LRG(V\u03b8) =\n\n2\nn\n\nn\n\nXi=1(cid:16)(cid:0) \u02c6B\u03c0V\u03b8(si) \u2212 V\u03b8(si)(cid:1) \u00b7 \u2207\u03b8(cid:0) \u02c6B\u03c0V\u03b8(si) \u2212 V\u03b8(si)(cid:1)(cid:17) .\n\nHowever, the objective in (1) is a biased and inconsistent estimate of the squared Bellman error. This\n\nis because Es\u223c\u00b5[ \u02c6LRG(V )] = L2(V ) + Es\u223c\u00b5(cid:2)var( \u02c6B\u03c0V (s)|s)(cid:3) 6= L2(V ), where there is an extra\n\nterm that involves the conditional variance of the empirical Bellman operator, which does not vanish\nunless the state transitions are deterministic. As a result, RG can converge to incorrect value functions\n(see also Section 6). With random transitions, correcting the bias requires double samples (i.e., at\nleast two independent samples of (r, s\u2032) for the same (s, a) pair) to estimate the conditional variance.\nMore popular algorithms in the literature are instead based on \ufb01xed-point iterations, using \u02c6B\u03c0 to\nconstruct a target value to update V\u03b8(si). An example is \ufb01tted value iteration, or FVI (Bertsekas\n& Tsitsiklis, 1996; Munos & Szepesv\u00b4ari, 2008), which includes as special cases the empirically\nsuccessful DQN and variants, and also serves as a key component in many state-of-the-art actor-critic\nalgorithms. In its basic form, FVI starts from an initial \u03b80, and iteratively updates the parameter by\n\n\u03b8t+1 = arg min\n\nFVI (V\u03b8) :=\n\n\u03b8\u2208\u0398 ( \u02c6L(t+1)\n\n1\nn\n\nn\n\nXi=1(cid:16)V\u03b8(si) \u2212 \u02c6B\u03c0V\u03b8t (si)(cid:17)2) .\n\n(2)\n\nDifferent from RG, when gradient-based methods are applied to solve (2), the current parameter \u03b8t\nis treated as a constant: \u2207\u03b8 \u02c6L(t+1)\n1988) may be viewed as a stochastic version of FVI, where a single sample (i.e., n = 1) is drawn\nrandomly (either from a stream of transitions or from a replay buffer) to estimate the gradient of (2).\n\ni=1(cid:0)V\u03b8(si) \u2212 \u02c6B\u03c0V\u03b8t (si)(cid:1)\u2207\u03b8V\u03b8(si). TD(0) (Sutton,\n\nFVI (V\u03b8) = 2\n\nnPn\n\nBeing \ufb01xed-point iteration methods, FVI-style algorithms do not optimize any objective function,\nand their convergence is guaranteed only in rather restricted cases (e.g., Gordon, 1995; Tsitsiklis\n& Van Roy, 1997; Antos et al., 2008). Such divergent behavior is well-known and empirically\nobserved (Baird, 1995; Boyan & Moore, 1995); see Section 6 for more numerical examples. It creates\nsubstantial dif\ufb01culty in parameter tuning and model selection in practice.\n\n3 Kernel Loss for Policy Evaluation\n\nMuch of the algorithmic challenge described earlier lies in the dif\ufb01culty in estimating squared Bellman\nerror from data. In this section, we address this dif\ufb01culty by proposing a new loss function that is\nmore amenable to statistical estimation from empirical data. Proofs are deferred to the appendix.\n\nOur framework relies on an integrally strictly positive de\ufb01nite (ISPD) kernel K : S \u00d7 S \u2192 R, which\nis a symmetric bi-variate function that satis\ufb01es kf k2\nnon-zero L2-integrable function f . For simplicity, we consider two functions f and g equal if (f \u2212 g)\nhas a zero L2 norm. We call kf kK the K-norm of f . Many commonly used kernels, such as Gaussian\nRBF kernel K(s, \u00afs) = exp(\u2212 ks \u2212 \u00afsk2\n2 /h) is ISPD. More discussion on ISPD kernels can be found\nin Stewart (1976) and Sriperumbudur et al. (2010).\n\nK := RS 2 K(s, \u00afs)f (s)f (\u00afs) ds d\u00afs > 0, for any\n\n3.1 The New Loss Function\n\nRecall that R\u03c0V = B\u03c0V \u2212 V is the Bellman error operator. Our new loss function is de\ufb01ned by\n\nLK(V ) = kR\u03c0V k2\n\nK,\u00b5 := Es,\u00afs\u223c\u00b5 [K(s, \u00afs) \u00b7 R\u03c0V (s) \u00b7 R\u03c0V (\u00afs)] ,\n\n(3)\n\nwhere \u00b5 is any positive density function on states s, and s, \u00afs \u223c \u00b5 means s and \u00afs are drawn i.i.d. from \u00b5.\nHere, k\u00b7kK,\u00b5 is regarded as the K-norm under measure \u00b5. It is easy to show that kf kK,\u00b5 = kf \u00b5kK .\nNote that \u00b5 can be either the visitation distribution under policy \u03c0 (the on-policy case), or some other\ndistribution (the off-policy case). Our approach handles both cases in a uni\ufb01ed way. The following\ntheorem shows that the loss LK is consistent:\n\n3\n\n\fTheorem 3.1. Let K be an ISPD kernel and assume \u00b5(s) > 0,\u2200s \u2208 S. Then, LK(V ) \u2265 0 for any\nV ; and LK(V ) = 0 if and only if V = V \u03c0. In other words, V \u03c0 = arg minV LK(V ).\n\nThe next result relates the kernel loss to a \u201cdual\u201d kernel norm of the value function error, V \u2212 V \u03c0.\n\nTheorem 3.2. Under the same assumptions as Theorem 3.1, we have LK(V ) = kV \u2212 V \u03c0k2\nwhere k\u00b7kK \u2217,\u00b5 is the K \u2217-norm under measure \u00b5 with a \u201cdual\u201d kernel K \u2217(s, \u00afs), de\ufb01ned by\n\nK \u2217,\u00b5,\n\nK \u2217(s\u2032, \u00afs\u2032) := Es,\u00afs \u223c d\u2217\n\nand the expectation notation is shorthand for Es\u223cd\u2217\n\n\u03c0,\u00b5hK(s\u2032, \u00afs\u2032) + \u03b32K(s, \u00afs) \u2212 \u03b3(cid:0)K(s\u2032, \u00afs) + K(s, \u00afs\u2032)(cid:1) | s\u2032, \u00afs\u2032i ,\n\u03c0,\u00b5(s|s\u2032) :=Xa\n\n\u03c0,\u00b5 [f (s)|s\u2032] =R f (s)d\u2217\n\n\u03c0,\u00b5(s|s\u2032)ds , with\n\nd\u2217\n\n\u03c0(a|s)P (s\u2032|s, a)\u00b5(s)/\u00b5(s\u2032) .\n\nThe norm involves a quantity, d\u2217\ntional probability of state s conditioning on observing the next state s\u2032 (but note that d\u2217\nnormalized to sum to one unless \u00b5 = d\u03c0).\n\n\u03c0,\u00b5(s|s\u2032), which may be heuristically viewed as a \u201cbackward\u201d condi-\n\u03c0,\u00b5(s|s\u2032) is not\n\nEmpirical Estimation The key advantage of the new loss LK is that it can be easily estimated and\noptimized from observed transitions, without requiring double samples. Given a set of empirical data\nD = {(si, ai, ri, s\u2032\n\ni)}1\u2264i\u2264n, a way to estimate LK is to use the so-called V-statistics,\n\n\u02c6LK(V\u03b8) :=\n\n1\n\nn2 X1\u2264i,j\u2264n\n\nK(si, sj) \u00b7 \u02c6R\u03c0V\u03b8(si) \u00b7 \u02c6R\u03c0V\u03b8(sj) .\n\n(4)\n\nSimilarly, the gradient \u2207\u03b8LK(V\u03b8) = 2E\u00b5[K(s, \u00afs)R\u03c0V\u03b8(s)\u2207\u03b8(R\u03c0V\u03b8(\u00afs))] can be estimated by\n\n\u2207\u03b8 \u02c6LK(V\u03b8) :=\n\n2\n\nn2 X1\u2264i,j\u2264n\n\nK(si, sj) \u00b7 \u02c6R\u03c0V\u03b8(si) \u00b7 \u2207\u03b8 \u02c6R\u03c0V\u03b8(sj) .\n\nNote that while calculating the exact gradient requires O(n2) computation, in practice we may use\nstochastic gradient descent on mini-batches of data instead. The precise formulas for unbiased\nestimates of the gradient of the kernel loss using a subset of samples are given in Appendix B.1.\n\nRemark (unbiasedness) An alternative approach is to use the U-statistics, which removes the\ndiagonal (i = j) terms in the pairwise average in (4). In the case of i.i.d. samples, it is known that\nU-statistics forms an unbiased estimate of the true gradient, but may have higher variance than the\nV-statistics. In our experiments, we observe that V-statistics works better than U-statistics.\n\nRemark (consistency) Following standard statistical approximation theory (e.g., Ser\ufb02ing, 2009),\nboth U/V-statistics provide consistent estimation of the expected quadratic quantity given the sample\nis weakly dependent and satis\ufb01es certain mixing condition (e.g., Denker & Keller, 1983; Beutner\n& Z\u00a8ahle, 2012); this often amounts to saying that {si} forms a Markov chain that converges to its\nstationary distribution \u00b5 suf\ufb01ciently fast. This is in contrast to the gradient computed by residual\ngradient, which is known to be inconsistent in general.\n\nRemark Another advantage of our kernel loss is that we have LK(V ) = 0 iff V = V \u03c0. Therefore,\nthe magnitude of the empirical loss \u02c6LK(V ) re\ufb02ects the closeness of V to the true value function V \u03c0.\nIn fact, by using methods from kernel-based hypothesis testing (e.g., Gretton et al., 2012; Liu et al.,\n2016; Chwialkowski et al., 2016), one can design statistically calibrated methods to test if V = V \u03c0\nhas been achieved, which may be useful for designing ef\ufb01cient exploration strategies. In this work,\nwe focus on estimating V \u03c0 and leave it as future work to test value function proximity.\n\n3.2\n\nInterpretations of the Kernel Loss\n\nWe now provide some insights into the new loss function, based on two interpretations.\n\n4\n\n\fEigenfunction Interpretation Mercer\u2019s theorem implies the following decomposition\n\n\u221e\n\nK(s, \u00afs) =\n\n\u03bbiei(s)ei(\u00afs) ,\n\n(5)\n\nof any continuous positive de\ufb01nite kernel on a compact domain, where {ei}\u221e\nof orthonormal eigenfunctions w.r.t. \u00b5 (i.e., Es\u223c\u00b5[ei(s)ej(s)] = 1{i = j}), and {\u03bbi}\u221e\neigenvalues. For ISPD kernels, all the eigenvalues must be positive: \u2200i, \u03bbi > 0.\nThe following shows that LK is a squared projected Bellman error in the space spanned by {ei}\u221e\n\ni=1 is a countable set\ni=1 are their\n\ni=1.\n\nXi=1\n\nProposition 3.3. If (5) holds, then\n\n\u221e\n\nLK(V ) =\n\n\u03bbi (Es\u223c\u00b5 [R\u03c0V (s) \u00d7 ei(s)])2 .\n\nXi=1\n\n\u221e\n\nXi=1\n\nMoreover, if {ei} is a complete orthonormal basis of L2-space under measure \u00b5, then the L2 loss is\n\nL2(V ) =\n\n(Es\u223c\u00b5 [R\u03c0V (s) \u00d7 ei(s)])2 .\n\nTherefore, LK(V ) \u2264 \u03bbmaxL2(V ), where \u03bbmax := maxi{\u03bbi}.\n\nThis result shows that the eigenvalue \u03bbi controls the contribution of the projected Bellman error to\nthe eigenfunction ei in LK . It may be tempting to have \u03bbi \u2261 1, in which LK(V ) = L2(V ), but\nthe Mercer expansion in (5) can diverge to in\ufb01nity, resulting in an ill-de\ufb01ned kernel K(s, \u00afs). To\ni=1 \u03bbi < \u221e. Therefore, the\nkernel loss LK(V ) can be viewed as prioritizing the projections to the eigenfunctions with larger\neigenvalues. In typical kernels such as Gaussian RBF kernels, these dominant eigenfunctions are\nFourier bases with low frequencies (and hence high smoothness), which may intuitively be more\nrelevant than the higher frequency bases for practical purposes.\n\navoid this, the eigenvalues must decay to zero fast enough such thatP\u221e\n\nRKHS Interpretation The squared Bellman error has the following variational form:\n\nL2(V ) = max\n\nf n(Es\u223c\u00b5 [R\u03c0V (s) \u00d7 f (s)])2 :\n\nEs\u223c\u00b5[(f (s))2] \u2264 1o ,\n\nwhich involves \ufb01nding a function f in the unit L2-ball, whose inner product with R\u03c0V (s) is maximal.\nOur kernel loss has a similar interpretation, with a different unit ball.\n\nAny positive kernel K(s, \u00afs) is associated with a Reproducing Kernel Hilbert Space (RKHS) HK ,\nwhich is the Hilbert space consisting of (the closure of) the linear span of K(\u00b7, s), for s \u2208 S, and\nsatis\ufb01es the reproducing property, f (x) = hf, K(\u00b7, x)iHK , for any f \u2208 HK . RKHS has been\nwidely used as a powerful tool in various machine learning and statistical problems; see Berlinet &\nThomas-Agnan (2011); Muandet et al. (2017) for overviews.\n\nProposition 3.4. Let HK be the RKHS of kernel K(s, \u00afs), we have\n\nLK(V ) = max\n\nf \u2208HKn(Es\u223c\u00b5 [R\u03c0V (s) \u00d7 f (s)])2 :\n\nkf kHK\n\n\u2264 1o .\n\n(7)\n\nSince RKHS is a subset of the L2 space that includes smooth functions, we can again see that LK(V )\nemphasizes more the projections to smooth basis functions, matching the intuitive from Theorem 3.3.\nIt also draws a connection to the recent primal-dual reformulations of the Bellman equation (Dai\net al., 2017, 2018b), which formulate V \u03c0 as a saddle-point of the following minimax problem:\n\nmin\n\nV\n\nmax\n\nf\n\nEs\u223c\u00b5(cid:2)2R\u03c0V (s) \u00d7 f (s) \u2212 f (s)2(cid:3) .\n\nThis is equivalent to minimizing L2(V ) as (6), except that the L2 constraint is replaced by a quadratic\npenalty term. When only samples are available, the expectation in (8) is replaced by the empirical\nversion. If the optimization domain of f is unconstrained, solving the empirical (8) reduces to the\nempirical L2 loss (1), which yields inconsistent estimation. Therefore, existing works propose to\nfurther constrain the optimization of f in (8) to either RKHS (Dai et al., 2017) or neural networks (Dai\net al., 2018b), and hence derive a minimax strategy for learning V . Unfortunately, this is substantially\nmore expensive than our method due to the cost of updating another neural network f jointly; the\nminimax procedure may also make the training less stable and more dif\ufb01cult to converge in practice.\n\n(6)\n\n(8)\n\n5\n\n\f3.3 Connection to Temporal Difference (TD) Methods\n\nWe now instantiate our algorithm in the tabular and linear cases to gain further insights. Interestingly,\nwe show that our loss coincides with previous work, and as a result leads to the same value function\nas several classic algorithms. Hence, the approach developed here may be considered as their strict\nextensions to the much more general nonlinear function approximation classes.\n\nAgain, let D be a set of n transitions sampled from distribution \u00b5, and linear approximation be used:\nV\u03b8(s) = \u03b8T\u03c6(s), where \u03c6 : S \u2192 Rd is a feature function, and \u03b8 \u2208 Rd is the parameter to be learned.\nThe TD solution, \u02c6\u03b8TD, for either on- and off-policy cases, can be found by various algorithms (e.g.,\nSutton, 1988; Boyan, 1999; Sutton et al., 2009; Dann et al., 2014), and its theoretical properties have\nbeen extensively studied (e.g., Tsitsiklis & Van Roy, 1997; Lazaric et al., 2012).\n\nCorollary 3.5. When using a linear kernel of form k(s, \u00afs) = \u03c6(s)T\u03c6(\u00afs), minimizing the kernel\nobjective (4) gives the TD solution \u02c6\u03b8TD.\n\nRemark The result follows from the observation that our loss becomes the Norm of the Expected\nTD Update (NEU) in the linear case (Dann et al., 2014), whose minimizer coincides with \u02c6\u03b8TD.\nMoreover, in \ufb01nite-state MDPs, the corollary includes tabular TD as a special case, by using a one-hot\nvector (indicator basis) to represent states. In this case, the TD solution coincides with that of a\nmodel-based approach (Parr et al., 2008) known as certainty equivalence (Kumar & Varaiya, 1986).\nIt follows that our algorithm includes certainty equivalence as a special case in \ufb01nite-state problems.\n\n4 Kernel Loss for Policy Optimization\n\nThere are different ways to extend our approach to policy optimization. One is to use the kernel loss\n(3) inside an existing algorithm, as an alternative to RG or TD to learn V \u03c0(s). For example, our loss\n\ufb01ts naturally into an actor-critic algorithm, where we replace the critic update (often implemented\nby TD(\u03bb) or its variant) with our method, and the actor updating part remains unchanged. Another,\nmore general way is to design a kernelized loss for V (s) and policy \u03c0(a|s) jointly, so that the policy\noptimization can be solved using a single optimization procedure. Here, we take the \ufb01rst approach,\nleveraging our method to improve the critic update step in Trust-PCL (Nachum et al., 2018).\n\nTrust-PCL is based on a temporal/path consistency condition resulting from policy smooth-\ning (Nachum et al., 2017). We start with the smoothed Bellman operator, de\ufb01ned by\n\nB\u03bbV (s) = max\n\n\u03c0(\u00b7|s)\u2208PA\n\nE\u03c0[R(s, a) + \u03b3V (s\u2032) + \u03bbH(\u03c0 | s) | s] ,\n\n(9)\n\nwhere PA is the set of distributions over action space A; the conditional expectation E\u03c0[\u00b7|s] denotes\na \u223c \u03c0(\u00b7|s), and \u03bb > 0 is a smoothing parameter; H is a state-dependent entropy term: H(\u03c0 | s) :=\n\n\u2212Pa\u2208A \u03c0(a|s) log \u03c0(a|s). Intuitively, B\u03bb is a smoothed approximation of B. It is known that B\u03bb\n\nis a \u03b3-contraction (Fox et al., 2016), so has a unique \ufb01xed point V \u2217\nrecover the standard Bellman operator, and \u03bb smoothly controls kV \u2217\nThe entropy regularization above implies the following path consistency condition. Let \u03c0\u2217\noptimal policy in (9) for B\u03bb, which yields V \u2217\n\n\u03bb . Furthermore, with \u03bb = 0 we\n\u03bb \u2212 V \u2217k\u221e (Dai et al., 2018b).\n\u03bb be an\n\n\u03bb . Then, (V, \u03c0) = (V \u2217\n\n\u03bb , \u03c0\u2217\n\n\u03bb) uniquely solves\n\n\u2200(s, a) \u2208 S \u00d7 A :\n\nV (s) = R(s, a) + \u03b3E\n\ns\u2032|s,a[V (s\u2032)] \u2212 \u03bb log \u03c0(a|s) .\n\nThis property inspires a natural extension of the kernel loss (3) to the controlled case:\n\nLK(V ) = E\n\ns,\u00afs\u223c\u00b5,a\u223c\u03c0(\u00b7|s),\u00afa\u223c\u03c0(\u00b7|\u00afs)[K([s, a], [\u00afs, \u00afa]) \u00b7 R\u03c0,\u03bbV (s, a) \u00b7 R\u03c0,\u03bbV (\u00afs, \u00afa)] ,\n\nwhere R\u03c0,\u03bbV (s, a) is given by\n\nR\u03c0,\u03bbV (s, a) = R(s, a) + \u03b3E\n\ns\u2032|s,a[V (s\u2032)] \u2212 \u03bb log \u03c0(a|s) \u2212 V (s) .\n\nGiven a set of transitions D = {(si, ai, ri, s\u2032\n\ni)}1\u2264i\u2264n, the objective can be estimated by\n\nwith\n\n\u02c6LK(V\u03b8) =\n\n1\n\nn2 X1\u2264i,j\u2264n\n\n[K([si, ai], [sj, aj]) \u02c6Ri \u02c6Rj] ,\n\n\u02c6Ri = ri + \u03b3V\u03b8(s\u2032\n\ni) \u2212 \u03bb log \u03c0\u03b8(ai|si) \u2212 V\u03b8(si) .\n\nThe U-statistics version and multi-step bootstraps can be similarly obtained (Nachum et al., 2017).\n\n6\n\n\f5 Related Work\n\nIn this work, we studied value function learning, one of the most-studied and fundamental problems\nin reinforcement learning. The dominant approach is based on \ufb01xed-point iterations (Bertsekas\n& Tsitsiklis, 1996; Szepesv\u00b4ari, 2010; Sutton & Barto, 2018), which can risk instability and even\ndivergence when function approximation is used, as discussed in the introduction.\n\nOur approach exempli\ufb01es more recent efforts that aim to improve stability of value function learning\nby reformulating it as an optimization problem. Our key innovation is the use of a kernel method\nto estimate the squared Bellman error, which is otherwise hard to estimate directly from samples,\nthus avoids the double-sample issue unaddressed by prior algorithms like residual gradient (Baird,\n1995) and PCL (Nachum et al., 2017, 2018). As a result, our algorithm is consistent: it \ufb01nds the\ntrue value function with enough data, using suf\ufb01ciently expressive function approximation classes.\nFurthermore, the solution found by our algorithm minimizes the projected Bellman error, as in prior\nworks when specialized to the same settings (Sutton et al., 2009; Maei et al., 2010; Liu et al., 2015;\nMacua et al., 2015). However, our algorithm is more general: it allows us to use nonlinear value\nfunction classes and can be naturally implemented for policy optimization. Compared to nonlinear\nGTD2/TDC (Maei et al., 2009), our method is simpler (without having to do a local linear expansion)\nand empirically more effective (as demonstrated in the next section).\n\nAs discussed in Section 3, our approach is related to the recently proposed SBEED algorithm (Dai\net al., 2018b) which shares many advantages with this work. However, SBEED requires solving a\nminimax problem that can be rather challenging in practice. In contrast, our algorithm only needs to\nsolve a minimization problem, for which a wide range of powerful methods exist (e.g., Bertsekas,\n2016). Note that there exist other saddle-point formulations for RL, which is derived from the linear\nprogram of MDPs (Wang, 2017; Chen et al., 2018; Dai et al., 2018a). The connection and comparison\nbetween these formulations would be interesting directions to investigate.\n\nOur work is also related to a line of interesting work on Bellman residual minimization (BRM) based\non nested optimization (Antos et al., 2008; Farahmand et al., 2008, 2016; Hoffman et al., 2011;\nChen & Jiang, 2019). They formulate the value function as the solution to a coupled optimization\nproblem, where both the inner and outer optimization are over the same function space. While their\ninner optimization plays a similar role as our use of RKHS in the kernel loss de\ufb01nition, our loss is\nderived from a different way, and decouples the representations used in inner and outer optimizations.\nFurthermore, the nested optimization formulation also involves solving a minimax problem (similar\nto SBEED), while our approach is much simpler as it only requires solving a minimization problem.\n\nFinally, the kernel method has been widely used in machine learning (e.g., Sch\u00a8olkopf & Smola,\n2001; Muandet et al., 2017). In RL, authors have used kernels either to smooth the estimates of\ntransition probabilities and rewards (Ormoneit & Sen, 2002), or to represent the value function (e.g.,\nXu et al., 2005, 2007; Taylor & Parr, 2009). Our method differs from these works in that we leverage\nkernels for designing proper loss functions to address the double-sampling problem, while putting\nno constraints on which approximation classes to represent the value function. Our approach is thus\nexpected to be more \ufb02exible and scalable in practice, allowing the value function to lie in \ufb02exible\nfunction classes like neural networks.\n\n6 Experiments\n\nWe compare our method (labelled \u201cK-loss\u201d in all experiments) with several representative baselines in\nboth classic examples and popular benchmark problems, for both policy evaluation and optimization.\n\n6.1 Modi\ufb01ed Example of Tsitsiklis & Van Roy\n\nFig. 1 (a) shows a modi\ufb01ed problem of the classic example by Tsitsiklis & Van Roy (1997), by\nmaking transitions stochastic.1 It consists of 5 states, including 4 nonterminal (circles) and 1 terminal\nstates (square), and 1 action. The arrows represent transitions between states. The value function\nestimate is linear in the weight w = [w1, w2, w3]: for example, the leftmost and bottom-right states\u2019\nvalues are w1 and 2w3, respectively. Furthermore, we set \u03b3 = 1, so V (s) is exact with the optimal\n\n1Recall that the double-sample issue exists only in stochastic problems, so the modi\ufb01cation is necessary to\n\nmake the comparison to residual gradient meaningful.\n\n7\n\n\fr = 0\nW1\nr = 0\n\nW2\n\np=0.8\np=0.2\nW3\n\nr = \u00a01\n\np=0.1\n\nr = 0\n\nr = 0\n\n2W3\n\np = 0.9\n\n(a) Our MDP Example\n\n(b) MSE vs. Iteration\n\n(c) ||w \u2212 w\n\n\u2217|| vs. Iteration\n\nFigure 1: Modi\ufb01ed example of Tsitsiklis & Van Roy (1997).\n\n(a) MSE\n\n(b) Bellman Error\n\n(c) L2/K-Loss vs MSE\n\n(d) L2/K-Loss vs Bellman Error\n\nFigure 2: Results on Puddle World.\n\nweight w\u2217 = [0.8, 1.0, 0]. In the experiment, we randomly collect 2 000 transition tuples for training.\nWe use a linear kernel in our method, so that it will \ufb01nd the TD solution (Corollary 3.5).\nFig. 1 (b&c) show the learning curves of mean squared error (kV \u2212V \u2217k2) and weight error (kwww\u2212w\u2217k)\nof different algorithms over iterations. Results are consistent with theory: our method converges to\nthe true weight w\u2217, while both FVI and TD(0) diverge, and RG converges to a wrong solution.\n\n6.2 Policy Evaluation with Neural Networks\n\nWhile popular in recent RL literature, neural networks are known to be unstable for a long time. Here,\nwe revisit the classic divergence example of Puddle World (Boyan & Moore, 1995), and demonstrate\nthe stability of our method. Experimental details are found in Appendix B.2.\nFig. 2 summarizes the result using a neural network as value function for two metrics: kV \u2212 V \u2217k2\n2\nand kBV \u2212 V k2\n2, both evaluated on the training transitions. First, as shown in (a-b), our method\nworks well while residual gradient converged to inferior solutions. In contrast, FVI and TD(0) exhibit\nunstable/oscilating behavior, and can even diverge, which is consistent with past \ufb01ndings (Boyan\n& Moore, 1995). In addition, non-linear GTD2 (Maei et al., 2009) and SBEED (Dai et al., 2017,\n2018b), which do not \ufb01nd a better solution than our method in terms of MSE.\n\nSecond, Fig. 2 (c&d) show the correlation between MSE, empirical Bellman error of the value\nfunction estimation and an algorithm\u2019s training objective respectively. Our kernel loss appears to be a\ngood proxy for learning the value function, for both MSE and Bellman error. In contrast, the L2 loss\n(used by residual gradient) does not correlate well, which also explains why residual gradient has\nbeen observed not to work well empirically.\n\nFig. 3 shows more results on value function learning on CartPole and Mountain Car, which again\ndemonstrate that our method performs better than other methods in general.\n\n6.3 Policy Optimization\n\nTo demonstrate the use of our method in policy optimization, we combine it with Trust-PCL, and\ncompare with variants of Trust-PCL combined with FVI, TD(0) and RG. To fairly evaluate the\nperformance of all these four methods, we use Trust-PCL (Nachum et al., 2018) framework and the\npublic code for our experiments. We only modify the training of V\u03b8(s) for each of the method and\nkeep rest same as original release. Experimental details can be found in Appendix B.3.1.\n\n8\n\n02004006008001000Iterations103010201010100101002004006008001000Iterations0.00.51.01.52.0K-lossTD(0)FVIRG0500100015002000Epochs1001021041061080500100015002000Epochs1.641.661.681.701.721.31.51.71.9L2/K-Loss0.00.71.42.12.81.31.51.71.9L2/K-Loss1.631.651.671.691.711.73GTD2(nonlinear)TD0FVIRGSBEEDK-loss\f(a) CartPole MSE\n\n(b) CartPole Bellman Error\n\n(c) Mountain Car MSE\n\n(d) Mountain Car Bellman Error\n\nFigure 3: Policy evaluation results on CartPole and Mountain Car.\n\ns\nn\nr\nu\nt\ne\nR\ne\ng\na\nr\ne\nv\nA\n\n(a) Swimmer\n\n(b) InvertedDoublePendulum\n\n(c) Ant\n\n(d) InvertedPendulum\n\nFigure 4: Results of various variants of Trust-PCL on Mujoco Benchmark.\n\nWe evaluate the performance of these four methods on Mujoco benchmark and report the best\nperformance of these four methods in Figure 4 (averaged on \ufb01ve different random seeds). K-\nloss consistently outperforms all the other methods, learning better policy with fewer data. Note that\nwe only modify the update of value functions inside Trust-PCL, which can be implemented relatively\neasily. We expect that we can improve many other algorithms in similar ways by improving the value\nfunction using our kernel loss.\n\n7 Conclusion\n\nThis paper studies the fundamental problem of solving Bellman equations with parametric value\nfunctions. A novel kernel loss is proposed, which is easy to be estimated and optimized using sampled\ntransitions. Empirical results show that, compared to prior algorithms, our method is convergent,\nproduces more accurate value functions, and can be easily adapted for policy optimization.\n\nThese promising results open the door to many interesting directions for future work. An important\nquestion is \ufb01nite-sample analysis, quantifying how fast the minimizer of the empirical kernel loss\nconverges to the true minimizer of the population loss, when data is not i.i.d. Another is to extend the\nloss to the online setting, where data arrives in a stream and the learner cannot store all previous data.\nSuch an online version may provide computational bene\ufb01ts in certain applications. Finally, it may be\npossible to quantify uncertainty in the value function estimate, and use this uncertainty information\nto guide ef\ufb01cient exploration.\n\nAcknowledgment\n\nThis work is supported in part by NSF CRII 1830161 and NSF CAREER 1846421. We would like to\nacknowledge Google Cloud and and Amazon Web Services (AWS) for their support. We also thank\nan anonymous reviewer and Bo Dai for helpful suggestions on related work that improved the paper.\n\n9\n\n0500100015002000Epochs1021001021040500100015002000Epochs1.61.51.41.31.21.1GTD2(nonlinear)TD0FVIRGSBEEDK-loss0100020003000Epochs102.5103103.5104104.50100020003000Epochs1.00.50.00.51.01.50.00.20.40.60.81.0million steps501001502002503003500.00.20.40.60.8million steps2000400060008000100000.00.30.60.91.21.5million steps5001000150020000.00.10.20.30.40.5million steps2004006008001000K-lossTD(0)FVIRG\fReferences\n\nAntos, A., Szepesv\u00b4ari, C., and Munos, R. Learning near-optimal policies with Bellman-residual\nminimizing based \ufb01tted policy iteration and a single sample path. Machine Learning, 71(1):89\u2013129,\n2008.\n\nBaird, L. C. Residual algorithms: Reinforcement learning with function approximation. In Proceed-\n\nings of the Twelfth International Conference on Machine Learning, pp. 30\u201337, 1995.\n\nBerlinet, A. and Thomas-Agnan, C. Reproducing kernel Hilbert spaces in probability and statistics.\n\nSpringer Science & Business Media, 2011.\n\nBertsekas, D. P. Nonlinear Programming. Athena Scienti\ufb01c, 3rd edition, 2016.\n\nBertsekas, D. P. and Tsitsiklis, J. N. Neuro-Dynamic Programming. Athena Scienti\ufb01c, September\n\n1996.\n\nBeutner, E. and Z\u00a8ahle, H. Deriving the asymptotic distribution of U- and V-statistics of dependent\n\ndata using weighted empirical processes. Bernoulli, pp. 803\u2013822, 2012.\n\nBoyan, J. A. Least-squares temporal difference learning. In Proceedings of the Sixteenth International\n\nConference on Machine Learning, pp. 49\u201356, 1999.\n\nBoyan, J. A. and Moore, A. W. Generalization in reinforcement learning: Safely approximating the\n\nvalue function. In Advances in Neural Information Processing Systems 7, pp. 369\u2013376, 1995.\n\nChen, J. and Jiang, N. Information-theoretic considerations in batch reinforcement learning. In\nProceedings of the 36th International Conference on Machine Learning (ICML), pp. 1042\u20131051,\n2019.\n\nChen, Y., Li, L., and Wang, M. Scalable bilinear \u03c0-learning using state and action features. In\nProceedings of the Thirty-Fifth International Conference on Machine Learning, pp. 833\u2013842,\n2018.\n\nChwialkowski, K., Strathmann, H., and Gretton, A. A kernel test of goodness of \ufb01t. In Proceedings\n\nof The 33rd International Conference on Machine Learning, 2016.\n\nDai, B., He, N., Pan, Y., Boots, B., and Song, L. Learning from conditional distributions via dual\n\nembeddings. In Arti\ufb01cial Intelligence and Statistics, pp. 1458\u20131467, 2017.\n\nDai, B., Shaw, A., He, N., Li, L., and Song, L. Boosting the actor with dual critic. In Proceedings of\n\nthe Sixth International Conference on Learning Representations (ICLR), 2018a.\n\nDai, B., Shaw, A., Li, L., Xiao, L., He, N., Liu, Z., Chen, J., and Song, L. SBEED: Convergent\nreinforcement learning with nonlinear function approximation. In Proceedings of the Thirty-Fifth\nInternational Conference on Machine Learning, pp. 1133\u20131142, 2018b.\n\nDann, C., Neumann, G., and Peters, J. Policy evaluation with temporal differences: A survey and\n\ncomparison. Journal of Machine Learning Research, 15(1):809\u2013883, 2014.\n\nDenker, M. and Keller, G. On U-statistics and v. mise statistics for weakly dependent processes.\n\nZeitschrift f\u00a8ur Wahrscheinlichkeitstheorie und verwandte Gebiete, 64(4):505\u2013522, 1983.\n\nFarahmand, A. M., Ghavamzadeh, M., Szepesv\u00b4ari, C., and Mannor, S. Regularized policy iteration.\n\nIn Advances in Neural Information Processing Systems 21, pp. 441\u2013448, 2008.\n\nFarahmand, A. M., Ghavamzadeh, M., Szepesv\u00b4ari, C., and Mannor, S. Regularized policy iteration\nwith nonparametric function spaces. Journal of Machine Learning Research, 17(130):1\u201366, 2016.\n\nFox, R., Pakman, A., and Tishby, N. Taming the noise in reinforcement learning via soft updates. In\n\nProceedings of the Thirty-Second Conference on Uncertainty in Arti\ufb01cial Intelligence, 2016.\n\nGordon, G. J. Stable function approximation in dynamic programming. In Proceedings of the Twelfth\n\nInternational Conference on Machine Learning, pp. 261\u2013268, 1995.\n\n10\n\n\fGretton, A., Borgwardt, K. M., Rasch, M. J., Sch\u00a8olkopf, B., and Smola, A. A kernel two-sample test.\n\nJournal of Machine Learning Research, 13(Mar):723\u2013773, 2012.\n\nGu, S., Lillicrap, T. P., Sutskever, I., and Levine, S. Continuous deep Q-learning with model-based\nacceleration. In Proceedings of the Thirty-third International Conference on Machine Learning,\npp. 2829\u20132838, 2016.\n\nHoffman, M. W., Lazaric, A., Ghavamzadeh, M., and Munos, R. Regularized least squares temporal\ndifference learning with nested \u21132 and \u21131 penalization. In Recent Advances in Reinforcement\nLearning. EWRL 2011, volume 7188 of Lecture Notes in Computer Science, pp. 102\u2013114, 2011.\n\nKumar, P. and Varaiya, P. Stochastic Systems: Estimation, Identi\ufb01cation, and Adaptive Control.\n\nPrentice Hall, 1986.\n\nLazaric, A., Ghavamzadeh, M., and Munos, R. Finite-sample analysis of least-squares policy iteration.\n\nJournal of Machine Learning Research, pp. 3041\u20133074, 2012.\n\nLiu, B., Liu, J., Ghavamzadeh, M., Mahadevan, S., and Petrik, M. Finite-sample analysis of proximal\ngradient TD algorithms. In Proceedings of the Thirty-First Conference on Uncertainty in Arti\ufb01cial\nIntelligence, pp. 504\u2013513, 2015.\n\nLiu, Q., Lee, J., and Jordan, M. A kernelized Stein discrepancy for goodness-of-\ufb01t tests.\n\nIn\n\nInternational Conference on Machine Learning, pp. 276\u2013284, 2016.\n\nMacua, S. V., Chen, J., Zazo, S., and Sayed, A. H. Distributed policy evaluation under multiple\n\nbehavior strategies. IEEE Transactions on Automatic Control, 60(5):1260\u20131274, 2015.\n\nMaei, H. R. Gradient Temporal-Difference Learning Algorithms. PhD thesis, University of Alberta,\n\nEdmonton, Alberta, Canada, 2011.\n\nMaei, H. R., Szepesv\u00b4ari, C., Bhatnagar, S., Precup, D., Silver, D., and Sutton, R. S. Convergent\ntemporal-difference learning with arbitrary smooth function approximation. In Advances in Neural\nInformation Processing Systems 22, pp. 1204\u20131212, 2009.\n\nMaei, H. R., Szepesv\u00b4ari, C., Bhatnagar, S., and Sutton, R. S. Toward off-policy learning control\nwith function approximation. In Proceedings of the Twenty-Seventh International Conference on\nMachine Learning, pp. 719\u2013726, 2010.\n\nMnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A.,\nRiedmiller, M., Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou,\nI., King, H., Kumaran, D., Wierstra, D., Legg, S., and Hassabis, D. Human-level control through\ndeep reinforcement learning. Nature, 518:529\u2013533, 2015.\n\nMnih, V., Adri`a, Badia, P., Mirza, M., Graves, A., Lillicrap, T. P., Harley, T., Silver, D., and\nKavukcuoglu, K. Asynchronous methods for deep reinforcement learning. In Proceedings of the\nThirty-third International Conference on Machine Learning, pp. 1928\u20131937, 2016.\n\nMuandet, K., Fukumizu, K., Sriperumbudur, B., Sch\u00a8olkopf, B., et al. Kernel mean embedding of\ndistributions: A review and beyond. Foundations and Trends R(cid:13) in Machine Learning, 10(1-2):\n1\u2013141, 2017.\n\nMunos, R. and Szepesv\u00b4ari, C. Finite-time bounds for sampling-based \ufb01tted value iteration. Journal\n\nof Machine Learning Research, 9:815\u2013857, 2008.\n\nNachum, O., Norouzi, M., Xu, K., and Schuurmans, D. Bridging the gap between value and policy\nbased reinforcement learning. In Advances in Neural Information Processing Systems 30, pp.\n2772\u20132782, 2017.\n\nNachum, O., Norouzi, M., Xu, K., and Schuurmans, D. Trust-PCL: An off-policy trust region method\n\nfor continuous control. In International Conference on Learning Representations, 2018.\n\nOrmoneit, D. and Sen, \u00b4S. Kernel-based reinforcement learning. Machine Learning, 49:161\u2013178,\n\n2002.\n\n11\n\n\fParr, R., Li, L., Taylor, G., Painter-Wake\ufb01eld, C., and Littman, M. L. An analysis of linear models, lin-\near value-function approximation, and feature selection for reinforcement learning. In Proceedings\nof the Twenty-Fifth International Conference on Machine Learning, pp. 752\u2013759, 2008.\n\nPuterman, M. L. Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley-\n\nInterscience, New York, 1994.\n\nSch\u00a8olkopf, B. and Smola, A. J. Learning with kernels: Support vector machines, regularization,\n\noptimization, and beyond. MIT press, 2001.\n\nSchulman, J., Moritz, P., Levine, S., Jordan, M., and Abbeel, P. High-dimensional continuous\ncontrol using generalized advantage estimation. In Proceedings of the International Conference on\nLearning Representations, 2016.\n\nSer\ufb02ing, R. J. Approximation theorems of mathematical statistics, volume 162. John Wiley & Sons,\n\n2009.\n\nSriperumbudur, B. K., Gretton, A., Fukumizu, K., Sch\u00a8olkopf, B., and Lanckriet, G. R. Hilbert space\nembeddings and metrics on probability measures. Journal of Machine Learning Research, 11(Apr):\n1517\u20131561, 2010.\n\nStewart, J. Positive de\ufb01nite functions and generalizations, an historical survey. The Rocky Mountain\n\nJournal of Mathematics, 6(3):409\u2013434, 1976.\n\nSutton, R. S. Learning to predict by the methods of temporal differences. Machine Learning, 3(1):\n\n9\u201344, 1988.\n\nSutton, R. S. and Barto, A. G. Reinforcement Learning: An Introduction. Adaptive Computation and\n\nMachine Learning. MIT Press, 2nd edition, 2018.\n\nSutton, R. S., Maei, H., Precup, D., Bhatnagar, S., Szepesv\u00b4ari, C., and Wiewiora, E. Fast gradient-\ndescent methods for temporal-difference learning with linear function approximation. In Pro-\nceedings of the Twenty-Sixth International Conference on Machine Learning, pp. 993\u20131000,\n2009.\n\nSzepesv\u00b4ari, C. Algorithms for Reinforcement Learning. Morgan & Claypool, 2010.\n\nTaylor, G. and Parr, R. Kernelized value function approximation for reinforcement learning. In\nProceedings of the Twenty-Sixth International Conference on Machine Learning, pp. 1017\u20131024,\n2009.\n\nTsitsiklis, J. N. and Van Roy, B. An analysis of temporal-difference learning with function approxi-\n\nmation. IEEE Transactions on Automatic Control, 42:674\u2013690, 1997.\n\nWang, M. Primal-dual \u03c0 learning: Sample complexity and sublinear run time for ergodic Markov\n\ndecision problems, 2017. CoRR abs/1710.06100.\n\nWang, Z., Schaul, T., Hessel, M., van Hasselt, H., Lanctot, M., and de Freitas, N. Dueling network\narchitectures for deep reinforcement learning. In Proceedings of the Third International Conference\non Machine Learning, pp. 1995\u20132003, 2016.\n\nWu, Y., Mansimov, E., Grosse, R. B., Liao, S., and Ba, J. Scalable trust-region method for deep rein-\nforcement learning using Kronecker-factored approximation. In Advances in Neural Information\nProcessing Systems 30, pp. 5285\u20135294, 2017.\n\nXu, X., Xie, T., Hu, D., and Lu, X. Kernel least-squares temporal difference learning. International\n\nJournal of Information and Technology, 11(9):54\u201363, 2005.\n\nXu, X., Hu, D., and Lu, X. Kernel-based least-squares policy iteration for reinforcement learning.\n\nIEEE Transactions on Neural Networks, 18(4):973\u2013992, 2007.\n\n12\n\n\f", "award": [], "sourceid": 8945, "authors": [{"given_name": "Yihao", "family_name": "Feng", "institution": "UT Austin"}, {"given_name": "Lihong", "family_name": "Li", "institution": "Google Brain"}, {"given_name": "Qiang", "family_name": "Liu", "institution": "UT Austin"}]}