{"title": "A Non-Parametric Approach to Dynamic Programming", "book": "Advances in Neural Information Processing Systems", "page_first": 1719, "page_last": 1727, "abstract": "In this paper, we consider the problem of policy evaluation for continuous-state systems. We present a non-parametric approach to policy evaluation, which uses kernel density estimation to represent the system. The true form of the value function for this model can be determined, and can be computed using Galerkin's method. Furthermore, we also present a unified view of several well-known policy evaluation methods. In particular, we show that the same Galerkin method can be used to derive Least-Squares Temporal Difference learning, Kernelized Temporal Difference learning, and a discrete-state Dynamic Programming solution, as well as our proposed method. In a numerical evaluation of these algorithms, the proposed approach performed better than the other methods.", "full_text": "A Non-Parametric Approach to\n\nDynamic Programming\n\nOliver B. Kroemer1,2\n\nJan Peters1,2\n\n1Intelligent Autonomous Systems, Technische Universit\u00e4t Darmstadt\n2Robot Learning Lab, Max Planck Institute for Intelligent Systems\n\n{kroemer,peters}@ias.tu-darmstadt.de\n\nAbstract\n\nIn this paper, we consider the problem of policy evaluation for continuous-\nstate systems. We present a non-parametric approach to policy evaluation,\nwhich uses kernel density estimation to represent the system. The true\nform of the value function for this model can be determined, and can be\ncomputed using Galerkin\u2019s method. Furthermore, we also present a uni\ufb01ed\nview of several well-known policy evaluation methods.\nIn particular, we\nshow that the same Galerkin method can be used to derive Least-Squares\nTemporal Di\ufb00erence learning, Kernelized Temporal Di\ufb00erence learning, and\na discrete-state Dynamic Programming solution, as well as our proposed\nmethod. In a numerical evaluation of these algorithms, the proposed ap-\nproach performed better than the other methods.\n\nIntroduction\n\n1\nValue functions are an essential concept for determining optimal policies in both optimal con-\ntrol [1] and reinforcement learning [2, 3]. Given the value function of a policy, an improved\npolicy is straightforward to compute. The improved policy can subsequently be evaluated\nto obtain a new value function. This loop of computing value functions and determining\nbetter policies is known as policy iteration. However, the main bottleneck in policy iteration\nis the computation of the value function for a given policy. Using the Bellman equation, only\ntwo classes of systems have been solved exactly: tabular discrete state and action problems\n[4] as well as linear-quadratic regulation problems [5]. The exact computation of the value\nfunction remains an open problem for most systems with continuous state spaces [6]. This\npaper focuses on steps toward solving this problem.\nAs an alternative to exact solutions, approximate policy evaluation methods have been\ndeveloped in reinforcement learning. These approaches include Monte Carlo methods, tem-\nporal di\ufb00erence learning, and residual gradient methods. However, Monte Carlo methods\nare well-known to have an excessively high variance [7, 2], and tend to over\ufb01t the value\nfunction to the sampled data [2]. When using function approximations, temporal di\ufb00erence\nlearning can result in a biased solution[8]. Residual gradient approaches are biased unless\nmultiple samples are taken from the same states [9], which is often not possible for real\ncontinuous systems.\nIn this paper, we propose a non-parametric method for continuous-state policy evaluation.\nThe proposed method uses a kernel density estimate to represent the system in a \ufb02exible\nmanner. Model-based approaches are known to be more data e\ufb03cient than direct methods,\nand lead to better policies [10, 11]. We subsequently show that the true value function\nfor this model has a Nadaraya-Watson kernel regression form [12, 13]. Using Galerkin\u2019s\nprojection method, we compute a closed-form solution for this regression problem. The\n\n1\n\n\fresulting method is called Non-Parametric Dynamic Programming (NPDP), and is a stable\nas well as consistent approach to policy evaluation.\nThe second contribution of this paper is to provide a uni\ufb01ed view of several sample-based\nalgorithms for policy evaluation, including the NPDP algorithm. In Section 3, we show how\nLeast-Squares Temporal Di\ufb00erence learning (LSTD) in [14], Kernelized Temporal Di\ufb00erence\nlearning (KTD) in [15], and Discrete-State Dynamic Programming (DSDP) in [4, 16] can\nall be derived using the same Galerkin projection method used to derive NPDP. In Section\n4, we compare these methods using empirical evaluations.\nIn reinforcement learning, the uncontrolled system is usually represented by a Markov De-\ncision Process (MDP). An MDP is de\ufb01ned by the following components: a set of states\nS; a set of actions A; a transition distribution p(s(cid:48)|a, s), where s(cid:48) \u2208 S is the next state\ngiven action a \u2208 A in state s \u2208 S; a reward function r, such that r(s, a) is the immediate\nreward obtained for performing action a in state s; and a discount factor \u03b3 \u2208 [0, 1) on future\nrewards. Actions a are selected according to the stochastic policy \u03c0(a|s). The goal is to\nt=0\u03b3tr(st, at). The term\nsystem will refer jointly to the agent\u2019s policy and the MDP.\nThe value of a state V (s), for a speci\ufb01c policy \u03c0, is de\ufb01ned as the expected discounted sum\nof rewards that an agent will receive after visiting state s and executing policy \u03c0; i.e.,\n\nmaximize the discounted rewards that are obtained; i.e., max(cid:80)\u221e\nt=0\u03b3tr(st, at)(cid:12)(cid:12) s0 = s, \u03c0(cid:9) .\n\n(1)\n\nBy using the Markov property, Eq. (1) can be rewritten as the Bellman equation\n\nV (s) = E(cid:8)(cid:80)\u221e\n\u00b4\n\u00b4\n\nV (s) =\n\n(2)\nThe advantage of using the Bellman equation is that it describes the relationship between the\nvalue function at one state s and its immediate follow-up states s(cid:48) \u223c p(s(cid:48)|s, a). In contrast,\nthe direct computation of Eq. (1) relies on the rewards obtained from entire trajectories.\n\nS\u03c0 (a|s) p (s(cid:48)|s, a) [r (s, a) + \u03b3V \u03c0 (s(cid:48))] ds(cid:48)da.\n\nA\n\n2 Non-Parametric Model-based Dynamic Programming\nWe begin describing the NPDP approach by introducing the kernel density estimation frame-\nwork used to represent the system. The true value function for this model has a kernel\nregression form, which can be computed by using Galerkin\u2019s projection method. We subse-\nquently discuss some of the properties of this algorithm, including its consistency.\n\n\u00b4\n\n2.1 Non-Parametric System Modeling\nThe dynamics of a system are compactly represented by the joint distribution p(s, a, s(cid:48)).\nUsing Bayes rule and marginalization, one can compute the transition probabilities\np(s(cid:48)|s, a) and the current policy \u03c0(a|s) from this joint distribution; e.g. p(s(cid:48)|s, a) =\np(s, a, s(cid:48))/\np(s, a, s(cid:48))ds(cid:48). Rather than assuming that certain prior information is given,\nwe will focus on the problem where only sampled information of the system is available.\nHence, the system\u2019s joint distribution is modeled from a set of n samples obtained from the\nreal system. The ith sample includes the current state si \u2208 S, the selected action ai \u2208 A,\ni \u2208 S, as well as the immediate reward ri \u2208 R. The state space S\nand the follow-up state s(cid:48)\nand the action space A are assumed to be continuous.\nWe propose using kernel density estimation to represent the joint distribution [17, 18]\nin a non-parametric manner. Unlike parametric models, non-parametric approaches\nn\u22121(cid:80)n\nuse the collected data as features, which leads to accurate representations of arbitrary\nfunctions [19]. The system\u2019s joint distribution is therefore modeled as p(s, a, s(cid:48)) =\ni), \u03d5i (a) = \u03d5 (a, ai), and \u03c6i (s) =\n\u03c6 (s, si) are symmetric kernel functions. In practice, the kernel functions \u03c8 and \u03c6 will often\n\u00b4\nbe the same. To ensure a valid probability density, each kernel must integrate to one; i.e.,\n\u03c6i (s) ds = 1, \u2200i, and similarly for \u03c8 and \u03d5. As an additional constraint, the kernel must\nalways be positive; i.e., \u03c8i (s(cid:48)) \u03d5i (a) \u03c6i (s) \u2265 0, \u2200s \u2208 S. This representation implies a fac-\ntorization into separate \u03c8i (s(cid:48)), \u03d5i (a), and \u03c6i (s) kernels. As a result, an individual sample\ncannot express correlations between s(cid:48), a, and s. However, the representation does allow\nmultiple samples to express correlations between these components in p(s, a, s(cid:48)).\n\ni=1\u03c8i (s(cid:48)) \u03d5i (a) \u03c6i (s), where \u03c8i (s(cid:48)) = \u03c8 (s(cid:48), s(cid:48)\n\n2\n\n\fThe reward function r(s, a) must also be represented. Given the kernel density estimate\nrepresentation, the expected reward for a state-action pair is denoted as [12]\n\n(cid:80)n\n(cid:80)n\n\nr(s, a) = E[r|s, a] =\n\nk=1rk\u03d5k (a) \u03c6k (s)\n\ni=1\u03d5i (a) \u03c6i (s)\n\n.\n\nHaving speci\ufb01ed the model of the system dynamics and rewards, the next step is to derive\nthe corresponding value function.\n\n2.2 Resulting Solution\nIn this section, we propose an approach to computing the value function for the continuous\nmodel speci\ufb01ed in Section 2.1. Every policy has a unique value function, which ful\ufb01lls the\nBellman equation, Eq. (2), for all states [2, 20]. Hence, the goal is to solve the Bellman\nequation for the entire state space, and not just at the sampled states. This goal can be\nachieved by using the Galerkin projection method to compute the value function for the\nmodel [21].\nThe Galerkin method involves \ufb01rst projecting the integral equation into the space spanned\nby a set of basis functions. The integral equation is then solved in this projected space. To\nbegin, the Bellman equation, Eq. (2), is rearranged as\nS\u03c0 (a|s) r (s, a) p (s(cid:48)|s, a) ds(cid:48)da +\np (a, s) r (s, a) da + \u03b3\n\nAp (s(cid:48)|s, a) \u03b3V (s(cid:48)) \u03c0 (a|s) dads(cid:48),\n(3)\n\np (s(cid:48), s) V (s(cid:48)) ds(cid:48).\n\np (s) V (s) =\n\nV (s) =\n\n\u00b4\n\u02c6\n\n\u02c6\n\n\u00b4\n\n\u00b4\n\n\u00b4\n\nA\n\nS\n\nA\n\nS\n\nBefore applying the Galerkin method, we derive the exact form of the value function. Ex-\npanding the reward function and joint distributions, as de\ufb01ned in Section 2.1, gives\n\nA\n\nk=1\u03d5k (a) \u03c6k (s)\n\np (s) V (s) = n\u22121\n\u02c6\n\n(cid:80)n\n(cid:80)n\n\u02c6\n(cid:80)n\nn\u22121(cid:80)n\np (s) V (s) = n\u22121(cid:80)n\ni=1ri\u03c6i (s) + n\u22121(cid:80)n\nTherefore, p(s)V (s) = n\u22121(cid:80)n\nn\u22121(cid:80)n\n\np (s) V (s) =\n\ni=1\u03b3\n\nA\n\ni=1ri\u03d5i (a) \u03c6i (s)\nj=1\u03d5j (a) \u03c6j (s)\ni=1ri\u03d5i (a) \u03c6i (s) da + \u03b3\n\u02c6\n\nn\u22121(cid:80)n\n\n\u02c6\n\nS\n\nNadaraya-Watson kernel regression [12, 13] form\n\nS\n\u03c8i (s(cid:48)) \u03c6i (s) V (s(cid:48)) ds(cid:48),\n\ni=1\u03b8i\u03c6i (s), where \u03b8 are value weights. Given that p(s) =\nj=1\u03c6j (s), the true value function of the kernel density estimate system has a\n\n\u02c6\n\nda + \u03b3\n\nS\n\np (s(cid:48), s) V (s(cid:48)) ds(cid:48),\n\ni=1\u03c8i (s(cid:48)) \u03c6i (s) V (s(cid:48)) ds(cid:48),\n\nV (s) =\n\ni=1\u03b8i\u03c6i (s)\nj=1\u03c6j (s)\n\n.\n\n(4)\n\nHaving computed the true form of the value function, the Galerkin projection method can be\nused to compute the value weights \u03b8. The projection is performed by taking the expectation\nof the integral equation with respect to each of the n basis function \u03c6i. The resulting n\nsimultaneous equations can be written as the vector equation\n\u02c6\n\n\u03c6 (s) p(s)V (s)ds =\n\n\u03c6 (s) n\u22121\u03c6 (s)T rds+\u03b3\n\nV (s(cid:48))ds(cid:48)ds,\nS\nwhere the ith elements of the vectors are given by [r]i = ri, [\u03c6 (s)]i = \u03c6i (s), and [\u03c8 (s(cid:48))]i =\n\u03c8i (s(cid:48)). Expanding the value functions gives\n\u02c6\n\n\u03c6 (s)T \u03c8 (s(cid:48))\n\n\u03c6 (s) n\u22121(cid:16)\n\n(cid:17)\n\n\u02c6\n\n\u02c6\n\n\u02c6\n\n\u02c6\n\n\u02c6\n\nS\n\nS\n\nS\n\n\u03c6 (s) \u03c6 (s)T \u03b8ds =\n\nS\n\nS\n\n\u03c6 (s) \u03c6 (s)T rds + \u03b3\n\n\u03c6 (s)\n\nS\n\n\u03c6 (s)T \u03c8 (s(cid:48))\n\nds(cid:48)ds,\n\n(cid:17) \u03c6 (s(cid:48))T \u03b8\n(cid:80)n\n\ni=1\u03c6i (s(cid:48))\n\n\u00b4\nS \u03c6 (s) \u03c6 (s)T ds, and \u03bb =\n\n\u00b4\n\nC\u03b8 = Cr + \u03b3C\u03bb\u03b8,\n\ni=1\u03c6i (s(cid:48)))\u22121\u03c8 (s(cid:48)) \u03c6 (s(cid:48))T ds(cid:48) is a stochastic\nwhere C =\nmatrix; i.e., a transition matrix. The matrix C can become singular if two basis functions\n\n(cid:16)\n\n(cid:80)n\n(cid:80)n\n\n\u02c6\nS((cid:80)n\n\nS\n\n3\n\n\fAlgorithm 1 Non-Parametric Dynamic Programming\n\nInput:\n\nn system samples:\n\nstate si, next state s(cid:48)\n\nKernel functions:\n\ni\n\n\u03c6i (sj) = \u03c6 (si, sj), and \u03c8i\n\n, and reward ri\n\nDiscount factor:\n\n0 \u2264 \u03b3 < 1\n\nOutput:\n\nValue function:\n\nComputation:\n\nReward vector:\n\n[r]i = ri\n\nTransition matrix:\n\n\u00b4\n\n(cid:80)n\n\n\u03c6j(s(cid:48))\u03c8i(s(cid:48))\nk=1\u03c6k(s(cid:48))\n\nds(cid:48)\n\n[\u03bb]i,j =\n\nS\n\nValue weights:\n\n\u03b8 = (I \u2212 \u03b3\u03bb)\u22121r\n\n(cid:1)\n\ni, s(cid:48)\n\nj\n\n(cid:0)s(cid:48)\n\nj\n\n(cid:1) = \u03c8(cid:0)s(cid:48)\n(cid:80)n\n(cid:80)n\n\nV (s) =\n\ni=1\u03b8i\u03c6i(s)\nj=1\u03c6j (s)\n\nare coincident. In such cases, there exists an in\ufb01nite set of solutions for \u03b8. However, all of\nthe solutions result in identical values. The NPDP algorithm uses the solution given by\n\n\u03b8 = (I \u2212 \u03b3\u03bb)\u22121r,\n\n(5)\n\nwhich always exists for any stochastic matrix \u03bb. Thus, the derivation has shown that the\nexact value function for the model in Section 2.1 has a Nadaraya-Watson kernel regression\nform, as shown in Eq. (4), with weights \u03b8 given by Eq. (5). The non-parametric dynamic\nprogramming algorithm is summarized in Alg. 1. The NPDP algorithm ultimately requires\nonly the state information s and s(cid:48), and not the actions a. In Section 3, we will show how\nthis form of derivation can also be used to derive the LSTD, KTD, and DSDP algorithms.\n\n2.3 Properties of the NPDP Algorithm\nIn this section, we discuss some of the key properties of the proposed NPDP algorithm,\nincluding precision, accuracy, and computational complexity. Precision refers to how close\nthe predicted value function is to the true value function of the model, while accuracy refers\nto how close the model is to the true system.\nOne of the key contributions of this paper is providing the true form of the value function for\npolicy evaluation with the non-parametric model described in Section 2.1. The parameters\nof this value function can be computed precisely by solving Eq. (5). Even if \u03bb is evaluated\nnumerically, a high level of precision can still be obtained.\nAs a non-parametric method, the accuracy of the NPDP algorithm depends on the number\nof samples obtained from the system. It is important that the model, and thus the value\nfunction, converges to that of the true system as the number of samples increases; i.e., that\nthe model is statistically consistent. In fact, kernel density estimation can be proven to have\nalmost sure convergence to the true distribution for a wide range of kernels [22].\nGiven that \u03bb is a stochastic matrix and 0 \u2264 \u03b3 < 1, it is well-known that the inversion of\n(I \u2212 \u03b3\u03bb) is well-de\ufb01ned [16]. The inversion can therefore also be expanded according to\ni=0[\u03b3\u03bb]ir. Similar to other kernel-based policy evaluation\nmethods [23, 24], NPDP has a computational complexity of O(n3) when performed naively.\nHowever, by taking advantage of sparse matrix computations, this complexity can be reduced\nto O(nz), where z is the number of non-zero elements in (I \u2212 \u03b3\u03bb).\n\nthe Neumann series; i.e., \u03b8 =(cid:80)\u221e\n\n3 Relation to Existing Methods\n\nThe second contribution of this paper is to provide a uni\ufb01ed view of Least Squares Temporal\nDi\ufb00erence learning (LSTD), Kernelized Temporal Di\ufb00erence learning (KTD), Discrete-State\nDynamic Programming (DSDP), and the proposed Non-Parametric Dynamic Programming\n(NPDP). In this section, we utilize the Galerkin methodology from Section 2.2 to re-derive\nthe LSTD, KTD, and DSDP algorithms, and discuss how these methods compare to NPDP.\nA numerical comparison is given in Section 4.\n\n4\n\n\f3.1 Least Squares Temporal Di\ufb00erence Learning\nbasis functions \u02c6\u03c6i(s), see [14]. Hence, V (s) =(cid:80)m\nThe LSTD algorithm allows the value function V (s) to be represented by a set of m arbitrary\n\u02c6\u03c6i (s) = \u02c6\u03c6 (s)T \u02c6\u03b8, where \u02c6\u03b8 is a vector\np (s, a, s(cid:48)) = n\u22121(cid:80)n\nof coe\ufb03cients learned during policy evaluation, and [ \u02c6\u03c6 (s)]i = \u02c6\u03c6i (s). In order to re-derive\nthe LSTD policy evaluation, the joint distribution is represented as a set of delta functions\ni=1 \u03b4i(s, a, s(cid:48)), where \u03b4i(s, a, s(cid:48)) is a Dirac delta function centered on\n(si, ai, s(cid:48)\ni). Using Galerkin\u2019s method, the integral equation is projected into the space of\nthe basis functions \u02c6\u03c6 (s). Thus, Eq. (3) becomes\n\u02c6\n\n\u02c6\n\n\u02c6\n\n\u02c6\u03b8i\n\ni=1\n\n\u02c6\u03c6 (s) p (s) \u02c6\u03c6 (s)T \u02c6\u03b8ds =\n\n\u02c6\u03c6 (s) p (s, a) r (s, a) dsda+\u03b3\n\n\u02c6\u03c6 (s) p (s, s(cid:48)) \u02c6\u03c6 (s(cid:48))T \u02c6\u03b8ds(cid:48)ds,\n\n\u02c6\nn(cid:88)\n\nS\n\nS\n\nn(cid:88)\n\nA\n\nS\n\nn(cid:88)\n\n\u02c6\u03c6 (si) \u02c6\u03c6 (si)T \u02c6\u03b8 =\n\nr (sj, aj) \u02c6\u03c6 (sj) + \u03b3\n\ni=1\n\nj=1\n\nn(cid:88)\n\n(cid:16) \u02c6\u03c6 (si)T \u2212 \u03b3 \u02c6\u03c6 (s(cid:48)\nand thus A \u02c6\u03b8 = b, where A = (cid:80)n\n(cid:80)n\n\n\u02c6\u03c6 (si)\n\ni=1\n\ni)T(cid:17) \u02c6\u03b8 =\n\nn(cid:88)\n\nj=1\n\nj=1 r (sj, aj) \u02c6\u03c6 (sj). The \ufb01nal weights are therefore given by\n\n\u02c6\u03c6 (sk) \u02c6\u03c6 (s(cid:48)\n\nk)T \u02c6\u03b8,\n\nk=1\n\nr (sj, aj) \u02c6\u03c6 (sj) ,\n\n\u02c6\u03c6 (si) ( \u02c6\u03c6 (si)T \u2212 \u03b3 \u02c6\u03c6 (s(cid:48)\n\ni)T ) and b =\n\ni=1\n\n\u02c6\u03b8 = A\u22121b.\n\nThis equation is also solved by LSTD, including the incremental updates of A and b as\nnew samples are acquired [14]. Therefore, LSTD can be seen as computing the transitions\nbetween the basis functions using a Monte Carlo approach. However, Monte Carlo methods\nrely on large numbers of samples to obtain accurate results.\nA key disadvantage of the LSTD method is the need to select a speci\ufb01c set of basis functions.\nThe computed value function will always be a projection of the true value function into the\nspace of these basis functions [8]. If the true value function does not lie within the space of\nthese basis functions, the resulting approximation may be arbitrarily inaccurate, regardless\nof the number of acquired samples. However, using prede\ufb01ned basis functions only requires\ninverting an m\u00d7 m matrix, which results in a lower computational complexity than NPDP.\nThe LSTD may also need to be regularized, as the inversion of A becomes ill-posed if the\nbasis functions are too densely spaced. Regularization has a similar e\ufb00ect to changing the\ntransition probabilities of the system [25].\n\n3.2 Kernelized Temporal Di\ufb00erence Learning Methods\nThe proposed approach is of course not the \ufb01rst to use kernels for policy evaluation. Meth-\nods such as kernelized least-squares temporal di\ufb00erence learning [24] and Gaussian process\ntemporal di\ufb00erence learning [23] have also employed kernels in policy evaluation. Taylor\nand Parr demonstrated that these methods di\ufb00er mainly in their use of regularization [15].\nThe uni\ufb01ed view of these methods is referred to as Kernelized Temporal Di\ufb00erence learning.\nThe KTD approach assumes that the reward and value functions can be represented by\nkernelized linear least-squares regression; i.e., r(s) = k(s)T K\u22121r and V (s) = k(s)T \u02c6\u03b8,\nwhere [k(s)]i = k(s, si), [K]ij = k(si, sj), [r]i = ri, and \u02c6\u03b8 is a weight vector. In order to\nderive KTD using Galerkin\u2019s method, it is necessary to again represent the joint distribution\ni=1 \u03b4i(s, a, s(cid:48)). The Galerkin method projects the integral equation\ni), where \u02c7\u03b4i(s, a, s(cid:48)) = 1\n\nas p (s, a, s(cid:48)) = n\u22121(cid:80)n\n\ninto the space of the Kronecker delta functions [\u02c7\u03b4(s)]i = \u02c7\u03b4i(s, ai, s(cid:48)\nif s(cid:48) = s(cid:48)\n\u02c6\n\n, a = ai, and s = si; otherwise \u02c7\u03b4i(s, a, s(cid:48)) = 0. Thus, Eq. (3) becomes\n\n\u02c6\n\n\u02c6\n\ni\n\n\u02c7\u03b4 (s) p (s, s(cid:48)) k(s(cid:48))T \u02c6\u03b8ds(cid:48)ds,\n\n\u02c7\u03b4 (s) p (s) k(s)T \u02c6\u03b8ds =\n\nS\n\n\u02c7\u03b4 (s) p (s) r (s) ds + \u03b3\n\nS\n\nS\n\n5\n\n\fn(cid:88)\n\ni=1\n\nn(cid:88)\n\nj=1\n\nBy substituting p(s, a, s(cid:48)) and applying the sifting property of delta functions, this equation\nbecomes\n\n\u02c7\u03b4(si)k(si)T \u02c6\u03b8 =\n\n\u02c7\u03b4(sj)k(sj)T K\u22121r + \u03b3\n\n\u02c7\u03b4(sk)k(s(cid:48)\n\nk)T \u02c6\u03b8,\n\nand thus K \u02c6\u03b8 = r+\u03b3K(cid:48) \u02c6\u03b8, where [K(cid:48)]ij = k(s(cid:48)\n\ni, sj). The value function weights are therefore\n\nn(cid:88)\n\nk=1\n\n\u02c6\u03b8 = (K \u2212 \u03b3K(cid:48))\u22121r,\n\ni\n\ni = si+1.\n\nwhich is identical to the solution found by the KTD approach [15]. In this manner, the\nKTD approach computes a weighting \u02c6\u03b8 such that the di\ufb00erence in the value at si and the\ndiscounted value at s(cid:48)\nequals the observed empirical reward ri. Thus, only the \ufb01nite set\nof sampled states are regarded for policy evaluation. Therefore, some KTD methods, e.g.\nGaussian process temporal di\ufb00erence learning [23], require that the samples are obtained\nfrom a single trajectory to ensure that s(cid:48)\nA key di\ufb00erence between KTD and NPDP is the representation of the value function V (s).\nThe form of the value function is a direct result of the representation used to embody the\nstate transitions. In the original paper [15], the KTD algorithm represents the transitions\nby using linear kernelized regression \u02c6k(s(cid:48)) = k(s)T K\u22121K(cid:48), where [\u02c6k(s(cid:48))]i = E[k(s(cid:48), si)].\nThe value function V (s) = k(s)T \u02c6\u03b8 is the correct form for this transition model. However,\nthe transition model does not explicitly represent a conditional distribution and can lead\nto inaccurate predictions. For example, consider two samples that start at s1 = 0 and\ns2 = 0.75 respectively, and both transition to s(cid:48) = 0.75. For clarity, we use a box-cart\nkernel with a width of one k(si, sj) = 1 i\ufb00 (cid:107)si \u2212 sj(cid:107) \u2264 0.5 and 0 otherwise. Hence, K = I\nIn the region 0.25 \u2264 s \u2264 0.5, where the two\nand each row of K\u2019 corresponds to (0, 1).\nkernels overlap, the transition model would then predict \u02c6k(s) = k(s)T K\u22121K(cid:48) = [ 0\n2 ].\nThis prediction is however impossible as it requires that E[k(s(cid:48), s2)] > maxs k(s, s2).\nIn\ncomparison, NPDP would predict the distribution \u03c8(s(cid:48)) \u2261 \u03c81(s(cid:48)) \u2261 \u03c82(s(cid:48)) for all states in\nthe range \u22120.5 \u2264 s \u2264 1.25.\nSimilar as for LSTD, the matrix (K\u2212\u03b3K(cid:48)) may become singular and thus not be invertible.\nAs a result, KTD usually needs to be regularized [15]. Given that KTD requires inverting\nan n \u00d7 n matrix, this approach has a computational complexity similar to NPDP.\n3.3 Discrete-State Dynamic Programming\nThe standard tabular DSDP approach can also be derived using the Galerkin method.\nGiven a system with q discrete states, the value function has the form V (s) = \u02c7\u03b4(s)T v,\nwhere \u02c7\u03b4(s) is a vector of q Kronecker delta functions centered on the discrete states. The\ncorresponding reward function is r(s) = \u02c7\u03b4(s)T \u00afr. The joint distribution is given by p(s(cid:48), s) =\nj=1[P ]ij = 1, \u2200i and hence p(s) =\ni=1 \u03b4i(s). Galerkin\u2019s method projects the integral equation into the space of the\n\nq\u22121\u03b4(s)T P \u03b4(s(cid:48)), where P is a stochastic matrix (cid:80)q\nq\u22121(cid:80)q\n\u02c6\n\n\u02c6\n\n\u02c6\n\nstates \u02c7\u03b4(s). Thus, Eq. (3) becomes\n\u02c7\u03b4 (s) p (s) \u02c7\u03b4(s)T \u00afrds + \u03b3\n\u02c6\n\n\u02c7\u03b4 (s) p (s) \u02c7\u03b4(s)T vds =\n\nS\n\nS\n\n\u02c7\u03b4 (s) p (s, s(cid:48)) \u02c7\u03b4(s(cid:48))T vds(cid:48)ds,\n\nS\n\nIv = I \u00afr + \u03b3\n\n\u02c7\u03b4 (s) \u03b4(s)T P \u03b4(s(cid:48))\u02c7\u03b4(s(cid:48))T vds(cid:48)ds,\nv = \u00afr + \u03b3P v,\n\nS\nv = (I \u2212 \u03b3P )\u22121 \u00afr,\n\n(6)\nwhich is the same computation used by DSDP [16]. The DSDP and NPDP methods actually\nuse similar models to represent the system. While NPDP uses a kernel density estimation,\nthe DSDP algorithm uses a histogram representation. Hence, DSDP can be regarded as a\nspecial case of NPDP for discrete state systems.\nThe DSDP algorithm has also been the basis for continuous-state policy evaluation algo-\nrithms [26, 27]. These algorithms \ufb01rst use the sampled states as the discrete states of an\nMDP and compute the corresponding values. The computed values are then generalized,\nunder a smoothness assumption, to the rest of the state-space using local averaging. Unlike\nthese methods, NPDP explicitly performs policy evaluation for a continuous set of states.\n\n6\n\n\f4 Numerical Evaluation\n\nIn this section, we compare the di\ufb00erent policy evaluation methods discussed in the previous\nsection, with the proposed NPDP method, on an illustrative benchmark system.\n\n4.1 Benchmark Problem and Setup\nIn order to compare the LSTD, KTD, DSDP, and NPDP approaches, we evaluated the\nmethods on a discrete-time continuous-state system. A standard linear-Gaussian system\nwas used for the benchmark problem, with transitions given by s(cid:48) = 0.95s + \u03c9 where \u03c9 is\nGaussian noise N (\u00b5 = 0, \u03c3 = 0.025). The initial states are restricted to the range 0.95 to 1.\nThe reward functions consist of three Gaussians, as shown by the black line in Fig. 1.\nThe KTD method was implemented using a Gaussian kernel function and regularization.\nThe LSTD algorithm was implemented using 15 uniformly-spaced normalized Gaussian\nbasis functions, and did not require regularization. The DSDP method was implemented\nby discretizing the state-space into 10 equally wide regions. The NPDP method was also\nimplemented using Gaussian kernels.\nThe hyper-parameters of all four methods, including the number of basis functions for\nLSTD and DSDP, were carefully tuned to achieve the best performance. As a performance\nbase-line, the values of the system in the range 0 < s < 1 were computed using a Monte\nCarlo estimate based on 50000 trajectories. The policy evaluations performed by the tested\nmethods were always based on only 500 samples; i.e. 100 times less samples than the base-\nline. The experiment was run 500 times using independent sets of 500 samples. The samples\nwere not drawn from the same trajectory.\n\n1\n\n1\n\n\u00b4\n\n4.2 Results\nThe performance of the di\ufb00erent methods were compared using three performance measures.\n\u00b4\nTwo of the performance measures are based on the weighted Mean Squared Error (MSE)\n0 W (s) (V (s) \u2212 V (cid:63)(s))2 ds where V (cid:63) is the true value function and W (s) \u2265 0,\n[2] E(V ) =\nfor all states, is a weighting distribution\n0 W (s)ds = 1. The \ufb01rst performance measure\nEunif corresponds to the MSE where W (s) = 1 for all states in the range zero to one. The\nsecond performance measure Esamp corresponds to the MSE where W (s) = n\u22121\u03a3n\ni=1\u03b4i(s)\nrespectively. Thus, Esamp is an indicator of the accuracy in the space of the samples, while\nEunif is an indicator of how well the computed value function generalizes to the entire state\nspace. The third performance measure Emax is given by the maximum error in the value\nfunction. This performance measure is the basis of a bound on the overall value function\napproximation [20].\nThe results of the experiment are shown in Table 1. The performance measures were aver-\naged over the 500 independent trials of the experiment. For all three performance measures,\nthe NPDP algorithm achieved the highest levels of performance, while the DSDP approach\nconsistently led to the worst performance.\n\nEunif\n\nNPDP 0.5811 \u00b1 0.0333\nLSTD 0.6898 \u00b1 0.0443\n0.7585 \u00b1 0.0460\nKTD\n1.6979 \u00b1 0.0332\nDSDP\n\nEsamp\n\n0.7185 \u00b1 0.0321\n0.8932 \u00b1 0.0412\n0.8681 \u00b1 0.0270\n2.1548 \u00b1 0.1082\n\nEmax\n\n1.4971 \u00b1 0.0309\n1.5591 \u00b1 0.0382\n2.5329 \u00b1 0.0391\n2.9985 \u00b1 0.0449\n\nTable 1: Each row corresponds to one of the four tested algorithms for policy evaluation.\nThe columns indicate the performance of the approaches during the experiment. The per-\nformance indexes include the mean squared error evaluated uniformly over the zero to one\nrange, the mean squared error evaluated at the 500 sampled points, and the maximum error.\nThe results are averaged over 500 trials. The standard errors of the means are also given.\n\n7\n\n\fFigure 1: Value functions obtained by the evaluated methods. The black lines show the\nreward function. The blue lines show the value function computed from the trajectories of\n50,000 uniformly sampled points. The LSTD, KTD, DSDP, and NPDP methods evaluated\nthe policy using only 500 points. The presentation was divided into two plots for improved\nclarity\n\n4.3 Discussion\nThe LSTD algorithm achieved a relatively low Eunif value, which indicates that the tuned\nbasis functions could accurately represent the true value function. However, the performance\nof LSTD is sensitive to the choice of basis functions and the number of samples per basis\nfunction. Using 20 basis functions instead of 15 reduces the performance of LSTD to Eunif =\n2.8705 and Esamp = 1.0256 as a result of over\ufb01tting. The KTD method achieved the second\nbest performance for Esamp, as a result of using a non-parametric representation. However,\nthe value tended to drop in sparsely-sampled regions, which lead to relatively high Eunif\nand Emax values. The discretization of states for DSDP is generally a disadvantage when\nmodeling continuous systems, and resulted in poor overall performance for this evaluation.\nThe NPDP approach out-performed the other methods in all three performance measures.\nThe performance of NPDP could be further improved by using adaptive kernel density\nestimation [28] to locally adapt the kernels\u2019 bandwidths according to the sampling density.\nHowever, all methods were restricted to using a single global bandwidth for the purpose of\nthis comparison.\n\n5 Conclusion\nThis paper presents two key contributions to continuous-state policy evaluation. The \ufb01rst\ncontribution is the Non-Parametric Dynamic Programming algorithm for policy evaluation.\nThe proposed method uses a kernel density estimate to generate a consistent representation\nof the system.\nIt was shown that the true form of the value function for this model is\ngiven by a Nadaraya-Watson kernel regression. The NPDP algorithm provides a solution for\ncalculating the value function. As a kernel-based approach, NPDP simultaneously addresses\nthe problems of function approximation and policy evaluation.\nThe second contribution of this paper is providing a uni\ufb01ed view of Least-Squares Temporal\nDi\ufb00erence learning, Kernelized Temporal Di\ufb00erence learning, and discrete-state Dynamic\nProgramming, as well as NPDP. All four approaches can be derived from the Bellman\nequation using the Galerkin projection method. These four approaches were also evaluated\nand compared on an empirical problem with a continuous state space and non-linear reward\nfunction, wherein the NPDP algorithm out-performed the other methods.\n\nAcknowledgements\nThe project receives funding from the European Community\u2019s Seventh Framework Pro-\ngramme under grant agreement n\u00b0 ICT- 248273 GeRT and n\u00b0 270327 Complacs.\n\n8\n\n00.10.20.30.40.50.60.70.80.91024681012StateValue  True ValueRewardLSTDKTD00.10.20.30.40.50.60.70.80.91024681012StateValue  True ValueRewardDSDPNPDP\fReferences\n[1] Dimitri P. Bertsekas. Dynamic Programming and Optimal Control, Vol. II. Athena Scienti\ufb01c,\n\n2007.\n\n[2] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. 1998.\n[3] H. Maei, C. Szepesvari, S. Bhatnagar, D. Precup, D. Silver, and R. Sutton. Convergent\ntemporal-di\ufb00erence learning with arbitrary smooth function approximation. In NIPS, pages\n1204\u20131212, 2009.\n\n[4] Richard Bellman. Bottleneck problems and dynamic programming. Proceedings of the National\n\nAcademy of Sciences of the United States of America, 39(9):947\u2013951, 1953.\n\n[5] R.E. Kalman. Contributions to the theory of optimal control, 1960.\n[6] Warren B. Powell. Approximate Dynamic Programming: Solving the Curses of Dimensionality\n\n(Wiley Series in Probability and Statistics). Wiley-Interscience, 2007.\n\n[7] R\u00e9mi Munos. Geometric Variance Reduction in Markov Chains: Application to Value Function\n\nand Gradient Estimation. Journal of Machine Learning Research, 7:413\u2013427, 2006.\n\n[8] Ralf Schoknecht. Optimality of reinforcement learning algorithms with linear function approx-\n\nimation. In NIPS, pages 1555\u20131562, 2002.\n\n[9] Leemon Baird. Residual algorithms: Reinforcement learning with function approximation. In\n\nICML, 1995.\n\n[10] Christopher G. Atkeson and Juan C. Santamaria. A Comparison of Direct and Model-Based\n\nReinforcement Learning. In ICRA, pages 3557\u20133564, 1997.\n\n[11] H. Bersini and V. Gorrini. Three connectionist implementations of dynamic programming for\n\noptimal control: A preliminary comparative analysis. In Nicrosp, 1996.\n\n[12] E. Nadaraya. On estimating regression. Theory of Prob. and Appl., 9:141\u2013142, 1964.\n[13] G. Watson. Smooth regression analysis. Sankhya, Series, A(26):359\u2013372, 1964.\n[14] Justin A. Boyan. Least-squares temporal di\ufb00erence learning.\nFrancisco, CA, USA, 1999. Morgan Kaufmann Publishers Inc.\n\nIn ICML, pages 49\u201356, San\n\n[15] Taylor, Gavin and Parr, Ronald. Kernelized value function approximation for reinforcement\n\nlearning. In ICML, pages 1017\u20131024, New York, NY, USA, 2009. ACM.\n\n[16] Dimitri P. Bertsekas and John N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scienti\ufb01c,\n\n1996.\n\n[17] Murray Rosenblatt. Remarks on Some Nonparametric Estimates of a Density Function. The\n\nAnnals of Mathematical Statistics, 27(3):832\u2013837, September 1956.\n\n[18] Emanuel Parzen. On Estimation of a Probability Density Function and Mode. The Annals of\n\nMathematical Statistics, 33(3):1065\u20131076, 1962.\n\n[19] G. S. Kimeldorf and G. Wahba. Some results on Tchebyche\ufb03an spline functions. Journal of\n\nMathematical Analysis and Applications, 33(1):82\u201395, 1971.\n\n[20] R\u00e9mi Munos. Error bounds for approximate policy iteration. In ICML, pages 560\u2013567, 2003.\n[21] Kendall E. Atkinson. The Numerical Solution of Integral Equations of the Second Kind. Cam-\n\nbridge University Press, 1997.\n\n[22] Dominik Wied and Rafael Weissbach. Consistency of the kernel density estimator: a survey.\n\nStatistical Papers, pages 1\u201321, 2010.\n\n[23] Yaakov Engel, Shie Mannor, and Ron Meir. Reinforcement learning with Gaussian processes.\n\nIn ICML, pages 201\u2013208, New York, NY, USA, 2005. ACM.\n\n[24] Xin Xu, Tau Xie, Dewen Hu, and Xicheng Lu. Kernel least-squares temporal di\ufb00erence learn-\n\ning. International Journal of Information Technology, 11:54\u201363, 1997.\n\n[25] J. Zico Kolter and Andrew Y. Ng. Regularization and feature selection in least-squares tem-\n\nporal di\ufb00erence learning. In ICML, pages 521\u2013528. ACM, 2009.\n\n[26] Nicholas K. Jong and Peter Stone. Model-based function approximation for reinforcement\n\nlearning. In AAMAS, May 2007.\n\n[27] Dirk Ormoneit and \u015aaunak Sen. Kernel-Based reinforcement learning. Machine Learning,\n\n49(2):161\u2013178, November 2002.\n\n[28] B. W. Silverman. Density estimation: for statistics and data analysis. London, 1986.\n\n9\n\n\f", "award": [], "sourceid": 974, "authors": [{"given_name": "Oliver", "family_name": "Kroemer", "institution": null}, {"given_name": "Jan", "family_name": "Peters", "institution": null}]}