{"title": "A Generalized Natural Actor-Critic Algorithm", "book": "Advances in Neural Information Processing Systems", "page_first": 1312, "page_last": 1320, "abstract": "Policy gradient Reinforcement Learning (RL) algorithms have received much attention in seeking stochastic policies that maximize the average rewards.  In addition, extensions based on the concept of the Natural Gradient (NG) show promising learning efficiency because these regard metrics for the task. Though there are two candidate metrics, Kakades Fisher Information Matrix (FIM) and Morimuras FIM, all RL algorithms with NG have followed the Kakades approach. In this paper, we describe a generalized Natural Gradient (gNG) by linearly interpolating the two FIMs and propose an efficient implementation for the gNG learning based on a theory of the estimating function, generalized Natural Actor-Critic (gNAC). The gNAC algorithm involves a near optimal auxiliary function to reduce the variance of the gNG estimates. Interestingly, the gNAC can be regarded as a natural extension of the current state-of-the-art NAC algorithm, as long as the interpolating parameter is appropriately selected. Numerical experiments showed that the proposed gNAC algorithm can estimate gNG efficiently and outperformed the NAC algorithm.", "full_text": "A Generalized Natural Actor-Critic Algorithm\n\nTetsuro Morimuray, Eiji Uchibez, Junichiro Yoshimotoz, Kenji Doyaz\n\ny: IBM Research (cid:150) Tokyo, Kanagawa, Japan\n\nz: Okinawa Institute of Science and Technology, Okinawa, Japan\n\ntetsuro@jp.ibm.com, fuchibe,jun-y,doyag@oist.jp\n\nAbstract\n\nPolicy gradient Reinforcement Learning (RL) algorithms have received substan-\ntial attention, seeking stochastic policies that maximize the average (or discounted\ncumulative) reward. In addition, extensions based on the concept of the Natural\nGradient (NG) show promising learning ef(cid:2)ciency because these regard metrics\nfor the task. Though there are two candidate metrics, Kakade\u2019s Fisher Information\nMatrix (FIM) for the policy (action) distribution and Morimura\u2019s FIM for the state-\naction joint distribution, but all RL algorithms with NG have followed Kakade\u2019s\napproach. In this paper, we describe a generalized Natural Gradient (gNG) that\nlinearly interpolates the two FIMs and propose an ef(cid:2)cient implementation for the\ngNG learning based on a theory of the estimating function, the generalized Natu-\nral Actor-Critic (gNAC) algorithm. The gNAC algorithm involves a near optimal\nauxiliary function to reduce the variance of the gNG estimates. Interestingly, the\ngNAC can be regarded as a natural extension of the current state-of-the-art NAC\nalgorithm [1], as long as the interpolating parameter is appropriately selected. Nu-\nmerical experiments showed that the proposed gNAC algorithm can estimate gNG\nef(cid:2)ciently and outperformed the NAC algorithm.\n\n1 Introduction\n\nPolicy Gradient Reinforcement Learning (PGRL) attempts to (cid:2)nd a policy that maximizes the av-\nerage (or time-discounted) reward, based on gradient ascent in the policy parameter space [2, 3, 4].\nSince it is possible to handle the parameters controlling the randomness of the policy, the PGRL,\nrather than the value-based RL, can (cid:2)nd the appropriate stochastic policy and has succeeded in sev-\neral practical applications [5, 6, 7]. However, depending on the tasks, PGRL methods often require\nan excessively large number of learning time-steps to construct a good stochastic policy, due to the\nlearning plateau where the optimization process falls into a stagnant state, as was observed for a\nvery simple Markov Decision Process (MDP) with only two states [8]. In this paper, we propose a\nnew PGRL algorithm, a generalized Natural Actor-Critic (gNAC) algorithm, based on the natural\ngradient [9].\nBecause (cid:147)natural gradient(cid:148) learning is the steepest gradient method in a Riemannian space and the\ndirection of the natural gradient is de(cid:2)ned on that metric, it is an important issue how to design\nthe Riemannian metric. In the framework of PGRL, the stochastic policies are represented as para-\nmetric probability distributions. Thus the Fisher Information Matrices (FIMs) with respect to the\npolicy parameter induce appropriate Riemannian metrics. Kakade [8] used an average FIM for the\npolicy over the states and proposed a natural policy gradient (NPG) learning. Kakade\u2019s FIM has\nbeen widely adopted and various algorithms for the NPG learning have been developed by many\nresearchers [1, 10, 11]. These are based on the actor-critic framework, called the natural actor-critic\n(NAC) [1]. Recently, the concept of (cid:147)Natural State-action Gradient(cid:148) (NSG) learning has been pro-\nposed in [12], which shows potential to reduce the learning time spent by being better at avoiding\nthe learning plateaus than the NPG. This natural gradient is on the FIM of the state-action joint\ndistribution as the Riemannian metric for RL, which is directly associated with the average rewards\nas the objective function. Morimura et al. [12] showed that the metric of the NSG corresponds with\n\n1\n\n\fthe changes in the stationary state-action joint distribution. In contrast, the metric of the NPG takes\ninto account only changes in the action distribution and ignores changes in the state distribution,\nwhich also depends on the policy in general. They also showed experimental results with exact gra-\ndients where the NSG learning outperformed NPG learning, especially with large numbers of states\nin the MDP. However, no algorithm for estimating the NSG has been proposed, probably because\nthe estimation for the derivative of log stationary state distribution was dif(cid:2)cult [13]. Therefore, the\ndevelopment of a tractable algorithm for NSG would be of great importance, and this is the one of\nthe primary goals of this paper.\nMeanwhile, it would be very dif(cid:2)cult to select an appropriate FIM because it would be dependent on\nthe given task. Accordingly, we created a linear interpolation of both of the FIMs as a generalized\nNatural Gradient (gNG) and derived an ef(cid:2)cient approach to estimate the gNG by applying the\ntheory of the estimating function for stochastic models [14] in Section 3. In Section 4, we derive\na gNAC algorithm with an instrumental variable, where a policy parameter is updated by a gNG\nestimate that is a solution of the estimating function derived in Section 3, and show that the gNAC\ncan be regarded as a natural extension of the current state-of-the-art NAC algorithm [1]. To validate\nthe performance of the proposed algorithm, numerical experiments are shown in Section 5, where\nthe proposed algorithm can estimate the gNG ef(cid:2)ciently and outperformed the NAC algorithm [1].\n\n2 Background of Policy Gradient and Natural Gradient for RL\n\nWe brie(cid:3)y review the policy gradient and natural gradient learning as gradient ascent methods for\nRL and also present the motivation of the gNAC approach.\n\n2.1 Policy Gradient Reinforcement Learning\n\nPGRL is modeled on a discrete-time Markov Decision Process (MDP) [15, 16]. It is de(cid:2)ned by the\nquintuplet (S; A; p; r; (cid:25)), where S 3 s and A 3 a are (cid:2)nite sets of states and actions, respectively.\nAlso, p : S (cid:2) A (cid:2) S ! [0; 1] is a state transition probability function of a state s, an action a,\nand the following state s+1, i.e.1, p(s+1js; a) , Pr(s+1js; a). R : S (cid:2) A (cid:2) S ! R is a bounded\nreward function of s, a, and s+1, which de(cid:2)nes an immediate reward r = R(s; a; s+1) observed by\na learning agent at each time step. The action probability function (cid:25) : A (cid:2) S (cid:2) Rd ! [0; 1] uses a,\ns, and a policy parameter (cid:18) 2 Rd to de(cid:2)ne the decision-making rule of the learning agent, which is\nalso called a policy, i.e., (cid:25)(ajs; (cid:18)) , Pr(ajs; (cid:18)). The policy is normally parameterized by users and\nis controlled by tuning (cid:18). Here, we make two assumptions in the MDP.\n\nAssumption 1 The policy is always differentiable with respect to (cid:18) and is non-redundant for the\ntask, i.e., the statistics F a((cid:18)) 2 Rd(cid:2)d (de(cid:2)ned in Section 2.2) are always bounded and non-singular.\n\nAssumption 2 The Markov chain M((cid:18)) , fS; A; p; (cid:25); (cid:18)g is always ergodic (irreducible and aperi-\nodic).\n\nUnder Assumption 2, there exists a unique stationary state distribution d(cid:18)(s) , Pr(sjM((cid:18))), which\nis equal to the limiting distribution and independent of the initial state, d(cid:18)(s0) = limt!1 Pr(S+t =\ns0jS = s; M((cid:18))), 8s 2 S. This distribution satis(cid:2)es the balance equation: d(cid:18)(s+1) =\n\nPs2SPa2A p(s+1js; a)(cid:25)(ajs;(cid:18))d(cid:18)(s).\n\nThe goal of PGRL is to (cid:2)nd the policy parameter (cid:18)(cid:3) that maximizes the average of the immediate\nrewards, the average reward,\n\n(cid:17)((cid:18)) , E(cid:18)[r] =Xs2SXa2A Xs+12S\n\nd(cid:18)(s)(cid:25)(ajs;(cid:18))p(s+1js; a)R(s; a; s+1);\n\n(1)\n\nwhere E(cid:18)[a] denotes the expectation of a on the Markov chain M((cid:18)). The derivative of the average\nreward for (1) with respect to the policy parameter, r(cid:18)(cid:17)((cid:18)) , [@(cid:17)((cid:18))=@(cid:18)1; :::; @(cid:17)((cid:18))=@(cid:18)d]>, which\nis referred to as the Policy Gradient (PG), is\n\nr(cid:18)(cid:17)((cid:18)) = E(cid:18) [ rr(cid:18)lnfd(cid:18)(s)(cid:25)(ajs;(cid:18))g] :\n\n1Although to be precise it should be Pr(S+1 = s+1jSt = s; A = a) for the random variables S+1, S, and A,\n\nwe write Pr(s+1js; a) for simplicity. The same simpli(cid:2)cation is applied to the other distributions.\n\n2\n\n\fTherefore, the average reward (cid:17)((cid:18)) will be increased by updating the policy parameter as (cid:18) :=\n(cid:18) + (cid:11)r(cid:18)(cid:17)((cid:18)), where := denotes the right-to-left substitution and (cid:11) is a suf(cid:2)ciently small learning\nrate. This framework is called the PGRL [4].\nIt is noted that the ordinary PGRL methods omit the differences in sensitivities and the correla-\ntions between the elements of (cid:18), as de(cid:2)ned by the probability distributions of the MDP, while most\nprobability distributions expressed in the MDP have some form of a manifold structure instead of\na Euclidean structure. Accordingly, the updating direction of the policy parameter by the ordinary\ngradient method will be different from the steepest directions on these manifolds. Therefore, the\noptimization process sometimes falls into a stagnant state, commonly called a plateau [8, 12].\n\n2.2 Natural Gradients for PGRL\n\nTo avoid the plateau problem, the concept of the natural gradient was proposed by Amari [9], which\nis a gradient method on a Riemannian space. The parameter space being a Riemannian space implies\nthat the parameter (cid:18) 2 Rd is on the manifold with the Riemannian metric G((cid:18)) 2 Rd(cid:2)d (a semi-\npositive de(cid:2)nite matrix), instead of being on a Euclidean manifold of an arbitrarily parameterized\npolicy, and the squared length of a small incremental vector (cid:1)(cid:18) connecting (cid:18) to (cid:18) + (cid:1)(cid:18) is given\nby k(cid:1)(cid:18)k2\nG = \"2\nfor a suf(cid:2)ciently small constant \", the steepest ascent direction of the function (cid:17)((cid:18)) on the manifold\nG((cid:18)) is given by\n\nG = (cid:1)(cid:18)>G((cid:18))(cid:1)(cid:18), where > denotes the transpose. Under the constraint k(cid:1)(cid:18)k2\n\nerG((cid:18)) (cid:17)((cid:18)) = G((cid:18))(cid:0)1r(cid:18)(cid:17)((cid:18));\n(cid:18) := (cid:18) + (cid:11) erG((cid:18)) (cid:17)((cid:18)):\n\nwhich is called the natural gradient (NG). Accordingly, to (locally) maximize (cid:17)((cid:18)), (cid:18) is incremen-\ntally updated with\n\nThe direction of the NG is de(cid:2)ned using a Riemannian metric. Thus, an appropriate choice of the\nRiemannian metric for the task is required. With RL, two kinds of Fisher Information Matrices\n(FIMs) F ((cid:18)) have been proposed as the Riemannian metric matrices G((cid:18)):2\n(I) Kakade [8] focuses only on the changes in the policy (action) distributions and proposes de(cid:2)ning\nthe metric matrix with the notation r(cid:18)a(cid:18)b(cid:18) , (r(cid:18)a(cid:18))b(cid:18), as\n\nF a((cid:18)) , E(cid:18)(cid:2)r(cid:18)ln(cid:25)(ajs;(cid:18))r(cid:18)ln(cid:25)(ajs;(cid:18))>(cid:3) = E(cid:18) [Fa((cid:18); s)] ;\n\nwhere Fa((cid:18); s) , E(cid:18)(cid:2)r(cid:18)ln(cid:25)(ajs;(cid:18))r(cid:18)ln(cid:25)(ajs;(cid:18))>js(cid:3) is the FIM of the policy at a state s. The NG\non this FIM, erF\n\n(II) Considering that the average reward (cid:17)((cid:18)) in (1) is affected not only by the policy distributions\n(cid:25)(ajs;(cid:18)) but also by the stationary state distribution d(cid:18)(s), Moimura et al. [12] proposed the use of\nthe FIM of the state-action joint distribution for RL,\n\na ((cid:18))(cid:17)((cid:18)) = F a((cid:18))(cid:0)1r(cid:18)(cid:17)((cid:18)), is called the Natural Policy Gradient (NPG).\n\n(2)\n\n(3)\n\nFs;a((cid:18)) , E(cid:18)hr(cid:18) ln fd(cid:18)(s)(cid:25)(ajs;(cid:18))g r(cid:18) ln fd(cid:18)(s)(cid:25)(ajs;(cid:18))g>i = Fs((cid:18)) + F a((cid:18));\n\ns;a((cid:18)) (cid:17)((cid:18)) = Fs;a((cid:18))(cid:0)1r(cid:18)(cid:17)((cid:18)), is called the Natural State-action Gradient (NSG).\n\nwhere Fs((cid:18)) , Ps2S d(cid:18)(s)r(cid:18)ln d(cid:18)(s)r(cid:18)ln d(cid:18)(s)> is the FIM of d(cid:18)(s). The NG on this FIM,\nerF\n\nSome algorithms for the NPG learning, such as NAC [1] and NTD [10, 11], can be successfully\nimplemented using modi(cid:2)cations of the actor-critic frameworks based on the LSTDQ((cid:21)) [18] and\nTD((cid:21)) [16]. In contrast, no tractable algorithm for the NSG learning has been proposed to date.\nHowever, it has been suggested that the NSG learning it better than the NPG learning due to the\nthree differences [12]: (a) The NSG learning appropriately bene(cid:2)ts from the concepts of Amari\u2019s\nNG learning, since the metric Fs;a((cid:18)) necessarily and suf(cid:2)ciently accounts for the probability dis-\ntribution that the average reward depends on. (b) Fs;a((cid:18)) is an analogy to the Hessian matrix of\nthe average reward. (c) Numerical experiments show a strong tendency to avoid entrapment in a\nlearning plateau3, especially with large numbers of states. Therefore, the development of a tractable\nalgorithm for NSG is important, and this is one of the goals of our work.\n\n2The reason for using F ((cid:18)) as G((cid:18)) is because the FIM Fx((cid:18)) is a unique metric matrix of the second-\n\norder Taylor expansion of the Kullback-Leibler divergence Pr(xj(cid:18)+(cid:1)(cid:18)) from Pr(xj(cid:18)) [17].\n\n3Although there were numerical experiments involving the NSG in [12], they computed the NSG analyti-\n\ncally with the state transition probabilities and the reward function, which is typically unknown in RL.\n\n3\n\n\fOn the other hand, it was proven that the metric of NPG learning, F a((cid:18)), accounts for the in(cid:2)nite\ntime-steps joint distribution in the Markov chain M((cid:18)) [19, 1], while the metric of NSG learning,\nFs;a((cid:18)) accounts only for the single time-step distribution, which is the stationary state-action joint\ndistribution d(cid:18)(s)(cid:25)(ajs;(cid:18)). Accordingly, the mixing time of M((cid:18)) might be drastically changed with\nNSG learning compared to NPG learning, since the mixing time depends on the multiple (not neces-\nsarily in(cid:2)nite) time-steps rather than the single time-step, i.e., while various policies can lead to the\nsame stationary state distribution, Markov chains associated with these policies have different mix-\ning times. A larger mixing time makes it dif(cid:2)cult for the learning agent to explore the environment\nand to estimate the gradient with (cid:2)nite samples. The ranking of the performances of the NPG and\nNSG learning will be dependent on the RL task properties. Thus, we consider a mixture of NPG and\nNSG as a generalized NG (gNG) and propose the approach of \u2019generalized Natural Actor-Critic\u2019\n(gNAC), in which the policy parameter of an actor is updated by an estimate of the gNG of a critic.\n\n3 Generalized Natural Gradient for RL\n\nFirst we explain the de(cid:2)nition and properties of the generalized Natural Gradient (gNG). Then we\nintroduce the estimating functions to build up a foundation for an ef(cid:2)cient estimation of the gNG.\n\n3.1 De(cid:2)nition of gNG for RL\nIn order to de(cid:2)ne an interpolation between NPG and NSG with a parameter (cid:19) 2 [0; 1], we consider\na linear interpolation from the FIM of (2) for the NPG to the FIM of (3) for the NSG, written as\n\n~Fs;a((cid:18); (cid:19)) , (cid:19)Fs((cid:18)) + F a((cid:18)):\n\n(4)\n\nThen the natural gradient of the interpolated FIM is\n\ner~F\n\ns;a((cid:18);(cid:19)) (cid:17)((cid:18)) = ~Fs;a((cid:18); (cid:19))(cid:0)1r(cid:18)(cid:17)((cid:18));\n\n(5)\nwhich we call the (cid:147)generalized natural gradient(cid:148) for RL with the interpolating parameter (cid:19), gNG((cid:19)).\nObviously, gNG((cid:19) = 0) and gNG((cid:19) = 1) are equivalent to the NPG and the NSG, respectively. When\n(cid:19) is equal to 1=t, this FIM ~Fs;a((cid:18); (cid:19)) is equivalent to the FIM of the t time-steps joint distribution\nfrom the stationary state distribution d(cid:18)(s) on M((cid:18)) [12]. Thus, this interpolation controlled by (cid:19) can\nbe interpreted as a continuous interpolation with respect to the time-steps of the joint distribution,\nso that (cid:19) : 1 ! 0 is inversely proportional to t : 1 ! 1. The term \u2019generalized\u2019 of gNG((cid:19)) re(cid:3)ects\nthe generalization as the time steps on the joint distribution that the NG follows.\n\n3.2 Estimating Function of gNG((cid:19))\nWe provide a general view of the estimation of the gNG((cid:19)) using the theory of the estimating func-\ntion, which provides well-established results for parameter estimation [14].\nSuch a function g 2 Rd for an estimator ! 2 Rd (and a variable x) is called an estimating function\nwhen it satis(cid:2)es these conditions for all (cid:18):\n\nE(cid:18) [g(x; !(cid:3))] = 0\ndet j E(cid:18)[r!g(x; !)] j 6= 0;\n\n(6)\n(7)\n(8)\nwhere !(cid:3) and det j (cid:1) j denote the exact solution of this estimation and the determinant, respectively.\nProposition 1 The d-dimensional (random) function\n\nE(cid:18)(cid:2)g(x; !)>g(x; !)(cid:3) < 1;\n\n8!\n8!\n\ng0\n\n(cid:19);(cid:18)(s; a; !) , r(cid:18)lnfd(cid:18)(s)(cid:25)(ajs;(cid:18))g(cid:0) r (cid:0) r(cid:18)lnfd(cid:18)(s)(cid:19)(cid:25)(ajs;(cid:18))g>!(cid:1)\n\nis an estimating function for gNG((cid:19)), such that the unique solution of E(cid:18)[g0\nrespect to ! is equal to the gNG((cid:19)).\nProof: From (1) and (4), the equation\n\n(9)\n(cid:19);(cid:18)(s; a; !)] = 0 with\n\n0 = E(cid:18)[g0\n\n(cid:19);(cid:18)(s; a; !(cid:3))] = r(cid:18)(cid:17)((cid:18)) (cid:0) ~Fs;a((cid:18); (cid:19))!(cid:3)\n\nholds. Thus, !(cid:3) is equal to the gNG((cid:19)) from (5). The remaining conditions from (7) and (8), which\nthe estimating function must satisfy, also obviously hold (under Assumption 1).\n(cid:3)\n\n4\n\n\f1\nT\n\nT (cid:0)1Xt=0\n\ng0\n\nIn order to estimate gNG((cid:19)) by using the estimating function (9) with (cid:2)nite T samples on M((cid:18)), the\nsimultaneous equation\n\n(cid:19);(cid:18)(st; at;b!) = 0\nis solved with respect to !. The solution b!, which is also called the M-estimator [20], is an unbiased\nestimate of gNG((cid:19)), so that b! = !(cid:3) holds in the limit as T ! 1.\n\nNote that the conduct of solving the estimating function of (9) is equivalent to the linear regression\nwith the instrumental variable r(cid:18)lnfd(cid:18)(s)(cid:25)(ajs;(cid:18))g where the regressand, the regressor, and the\nmodel parameter (estimator) are r (or R(s; a)), r(cid:18)lnfd(cid:18)(s)(cid:19)(cid:25)(ajs;(cid:18))g, and !, respectively [21], so\nthat the regression residuals \u2018r (cid:0) r(cid:18)lnfd(cid:18)(s)(cid:19)(cid:25)(ajs;(cid:18))g>!\u2019 are not correlated with the instrumental\nvariables r(cid:18)lnfd(cid:18)(s)(cid:25)(ajs;(cid:18))g.\n\n3.3 Auxiliary Function of Estimating Function\n\nAlthough we made a simple algorithm implementing the gNAC approach with the M-estimator of\nthe estimating function in (9), the performance of the estimation of gNG((cid:19)) may be unacceptable\nfor real RL applications, since the variance of the estimates of gNG((cid:19)) tends to become too large.\nFor that reason, we extend the estimating function using (9) by embedding an auxiliary function to\ncreate space for improvement in (9).\n\nLemma 1 The d-dimensional (random) function is an estimating function for gNG((cid:19)),\n\ng(cid:19);(cid:18)(s; a; !) , r(cid:18)lnfd(cid:18)(s)(cid:25)(ajs;(cid:18))g(cid:0)r (cid:0) r(cid:18)lnfd(cid:18)(s)(cid:19)(cid:25)(ajs;(cid:18))g>! (cid:0) (cid:26)(s; s+1)(cid:1) ;\n\nwhere (cid:26)(s; s+1) is called the auxiliary function for (9):\n\n(cid:26)(s; s+1) , c + b(s) (cid:0) b(s+1):\n\n(10)\n\n(11)\n\nThe c and b(s) are an arbitrary bounded constant and an arbitrary bounded function of the state.\nrespectively.\nProof: See supplementary material.\nLet G(cid:18) denote the class of such functions g(cid:18) with various auxiliary functions (cid:26). An optimal aux-\n\niliary function, which leads to minimizing the variance of the gNG estimate b!, is de(cid:2)ned by the\n\noptimality criterion of the estimating functions [22]. An estimating function g (cid:3)\ndet j (cid:6)g(cid:3)\n\n(cid:19);(cid:18) is optimal in G(cid:19);(cid:18) if\n\n(cid:19);(cid:18) j (cid:20) det j (cid:6)g(cid:19);(cid:18) j where (cid:6)g(cid:18)\n\n, E(cid:18)(cid:2)g(cid:19);(cid:18)(s; a; !(cid:3))g(cid:19);(cid:18)(s; a; !(cid:3))>(cid:3).\n\nLemma 2 Let us approximate (or assume)\n\nr (cid:25) E(cid:18)[R(s; a; s+1)js; a] , R(s; a);\n(cid:26)(s; s+1) (cid:25) E(cid:18)[(cid:26)(s; s+1)js; a] , (cid:26)(s; a):\n\n(12)\n(13)\ni=1(jAij (cid:0) 1),\nwhere jSj and jAij are the numbers of states and the available actions at state si, respectively) and\n!(cid:3) denotes the gNG((cid:19)), then the \u2018near\u2019 optimal auxiliary function (cid:26)(cid:3) in the \u2018near\u2019 optimal estimating\nfunction g(cid:3)\n\nIf the policy is non-degenerate for the task (so the dimension of (cid:18), d, is equal to PjSj\n\n(cid:19);(cid:18)(s; a; !) satis(cid:2)es4\n\n(14)\nProof Sketch: The covariance matrix for the criterion of the auxiliary function (cid:26) is approximated as\n\nR(s; a) = r(cid:18)lnfd(cid:18)(s)(cid:19)(cid:25)(ajs;(cid:18))g>!(cid:3) + E(cid:18)[(cid:26)(cid:3)(s; s+1)js; a]:\n\n(cid:6)g(cid:18) (cid:25) E(cid:18)(cid:2)r(cid:18)lnfd(cid:18)(s)(cid:25)(ajs;(cid:18))gr(cid:18)lnfd(cid:18)(s)(cid:25)(ajs;(cid:18))g>(R(s; a) (cid:0) r(cid:18)lnfd(cid:18)(s)(cid:19)(cid:25)(ajs;(cid:18))g>!(cid:3) (cid:0) (cid:26)(s; a))2(cid:3)\n\n, ^(cid:6)g(cid:18) :\n\n(15)\n\nThe function (cid:26)(s; a) usually has jSj degrees of freedom over all of the (s; a) couplets with the ergod-\nicity of M((cid:18)), because (cid:147)b(s) (cid:0) b(s+1)(cid:148) in (cid:26) has (jSj (cid:0) 1) degrees of freedom over all of the (s; s+1)\n4The \u2018near\u2019 of the near estimating function comes from the approximations of (12) and (13), which im-\nplicitly assume that the sum of the (co)variances, E(cid:18)[(r (cid:0) R(s; a))2 + ((cid:26)(s; a; s+1) (cid:0) (cid:26)(s; a))2 (cid:0) 2(r (cid:0)\nR(s; a))((cid:26)(s; a; s+1)(cid:0)(cid:26)(s; a))js; a], are not large. This assumption seems to hold in many RL tasks.\n\n5\n\n\fcouplets. The value of r(cid:18)lnfd(cid:18)(s)(cid:19)(cid:25)(ajs;(cid:18))g>! hasPjSj\nhas at mostPjSj\n\ni=1(jAij (cid:0) 1) degrees of freedom. R(s; a)\ni=1 jAij degrees of freedom. Therefore, there exist (cid:26)(cid:3) and r(cid:18)lnfd(cid:18)(s)(cid:19)(cid:25)(ajs;(cid:18))g>!?\nthat satisfy (14). Remembering that r(cid:18)lnfd(cid:18)(s)(cid:19)(cid:25)(ajs;(cid:18))g>!(cid:3) is the approximator of R(s; a) (or r)\nand !(cid:3) is independent of the choice of (cid:26) due to Lemma 1, we know that ! ? = !(cid:3) holds. Therefore,\nif the estimating function has an auxiliary function (cid:26)(cid:3) satisfying (14), the criterion of the optimality\nfor (cid:26) is minimized for det j ^(cid:6)g(cid:3)\n(cid:3)\nFrom Lemma 2,\n\nthe near optimal auxiliary function (cid:26)(cid:3) can be regarded as minimiz-\n\n(cid:18) j = 0 due to (15).\n\nr(cid:18)lnfd(cid:18)(s)(cid:19)(cid:25)(ajs;(cid:18))g>! + (cid:26)(s; s+1). Thus, the meaning of this near optimality of g(cid:3)\nis interpreted as a near minimization of the Euclidean distance between r and its approximator\n\ning the mean squared residuals to zero between R(s; a) and the estimator bR(cid:26)(s; a; !) ,\n(cid:19);(cid:18)(s; a;b!)\nbR(cid:26)(cid:3)(s; a;b!), so that (cid:26)(cid:3) works to reduce the distance of the regressand r and the subspace of the\nregressor r(cid:18)lnfd(cid:18)(s)(cid:19)(cid:25)(ajs;(cid:18))g of the M-estimator b!. In particular, R(s; a) is almost in this sub-\nspace at the point b! = !(cid:3). Lemma 2 leads directly to Corollary 1.\n\n(cid:19)=0 be the functions in the near optimal auxiliary function (cid:26)(cid:3)(s; s+1)\n(cid:19)=0 are equal to the (un-discounted) state value function [23] and the\n\nCorollary 1 Let b(cid:3)\n(cid:19)=0(s) and c(cid:3)\nat (cid:19) = 0, then b(cid:3)\naverage reward, respectively.\nProof: For all s, !, and (cid:18), the following equation holds,\n\n(cid:19)=0(s) and c(cid:3)\n\nE(cid:18)(cid:2)r(cid:18)lnfd(cid:18)(s)(cid:19)=0(cid:25)(ajs;(cid:18))g>! j s(cid:3) = !>Xa2A\n\nr(cid:18)(cid:25)(ajs;(cid:18)) = 0:\n\nTherefore, the following equation, which is the same as the de(cid:2)nition of the value function b(cid:3)\n(cid:19)=0(s)\n(cid:19)=0 as the solution of the Poisson equation [23], can be derived from (14):\nwith the average reward c(cid:3)\nb(cid:3)\n(cid:19)=0(s) + c(cid:3)\n(cid:3)\n\n(cid:19)=0 = E(cid:18)[r + b(cid:3)\n\n(cid:19)=0(s+1) j s];\n\n8s:\n\n4 A Generalized NAC Algorithm\n\nWe now propose a useful instrumental variable for the gNG((cid:19)) estimation and then derive a gNAC\nalgorithm along with an algorithm for r(cid:18)ln d(cid:18)(s) estimation.\n\n4.1 Bias from Estimation of r(cid:18)ln d(cid:18)(s)\nFor computation of the M-estimator of g(cid:19);(cid:18)(s; a; !) as the gNG((cid:19)) estimate on M((cid:18)), the computa-\ntions of both of the derivatives, r(cid:18)ln(cid:25)(ajs;(cid:18)) and r(cid:18)ln d(cid:18)(s), are required. While we can easily\ncompute r(cid:18)ln(cid:25)(ajs;(cid:18)) since we have parameterized the policy, we cannot compute the Logarithm\nstationary State distribution Derivative (LSD) r(cid:18)ln d(cid:18)(s) analytically unless the state transition\nprobabilities and the reward function are known. Thus, we use the LSD estimate from the algo-\n\nrithm, LSLSD [13]. These LSD estimates br(cid:18)ln d(cid:18)(s) usually have some estimation errors with\n(cid:2)nite samples, while the estimates are unbiased, so that br(cid:18)ln d(cid:18)(s) = r(cid:18)ln d(cid:18)(s) + (cid:15)(s), where\n\n(cid:15)(s) is an d-dimensional random variable satisfying Ef(cid:15)(s)jsg = 0.\nIn such cases, the estimate of gNG((cid:19)) from the estimating function (9) or (10) would be biased,\nbecause the (cid:2)rst condition of (6) for g(cid:19);(cid:18) is violated unless E(cid:18)[(cid:15)(s)(cid:15)(s)>] = 0. Thus, in Section\n4.2, we consider a re(cid:2)nement of the instrumental variable as the part r(cid:18)lnfd(cid:18)(s)(cid:25)(ajs;(cid:18))g in the\nestimating function (10), since the instrumental variable can be replaced with any function I that\nsatis(cid:2)es their conditions5 for any s, (cid:18), and ! [22] and makes the solution !(cid:3) become the gNG((cid:19)):\n\nE(cid:18)hI(r (cid:0) fbr(cid:18)ln d(cid:18)(s)(cid:19) + r(cid:18)ln(cid:25)(ajs;(cid:18))g>!(cid:3) (cid:0) (cid:26)(s; s+1)i = 0;\ndet j E(cid:18)[I fbr(cid:18)ln d(cid:18)(s)(cid:19) + r(cid:18)ln(cid:25)(ajs;(cid:18))g>] j 6= 0;\nE(cid:18)h(r (cid:0) fbr(cid:18)ln d(cid:18)(s)(cid:19) + r(cid:18)ln(cid:25)(ajs;(cid:18))g>! (cid:0) (cid:26)(s; s+1))2I>Ii < 1:\n\n(16)\n\n(17)\n\n(18)\n\n4.2 Instrumental variables of near optimal estimating function for gNG((cid:19))\nWe use a linear function to introduce the auxiliary function (de(cid:2)ned in (11)),\n\n(cid:26)(s; s+1; (cid:23)) , ( ~(cid:30)(s) (cid:0) [(cid:30)(s+1)>0]>)>(cid:23);\n\n5These correspond to the conditions for the estimating function, (6), (7), and (8).\n\n6\n\n\fA Generalized Natural Actor-Critic Algorithm with LSLSD((cid:21))\nGiven: A policy (cid:25)(ajs;(cid:18)) with an adjustable (cid:18) and a feature vector function of the state, (cid:30)(s).\nInitialize: (cid:18), (cid:12) 2 [0; 1], (cid:11), (cid:21) 2 [0; 1),.\nSet: A := 0; B := 0; C := 0; D := 0; E := 0; x := 0; y := 0.\nFor t = 0 to T (cid:0) 1 do\nCritic: Compute the gNG((cid:19)) estimate b!(cid:19)\nA := (cid:12)A + r(cid:18)ln(cid:25)(atjst;(cid:18))r(cid:18)ln(cid:25)(atjst;(cid:18))>; B := (cid:12)B + r(cid:18)ln(cid:25)(atjst;(cid:18)) ~ (st; st+1)>;\nC := (cid:12)C + ~(cid:30)(st)r(cid:18)ln(cid:25)(atjst;(cid:18))>; D := (cid:12)D + ~(cid:30)(st)(cid:30)(st)>; E := (cid:12)E + ~(cid:30)(st) ~ (st; st+1)>;\nx := (cid:12)x + rtr(cid:18)ln(cid:25)(atjst;(cid:18));\nb!(cid:19) := fA + (cid:19) ~C >(cid:10) (cid:0) BE(cid:0)1(C + (cid:19)D(cid:10))g(cid:0)1 (x (cid:0) BE(cid:0)1y)\nActor: Update (cid:18) by the gNG((cid:19))estimate b!\n(cid:18) := (cid:18) + (cid:11) b!(cid:19);\n\ny := (cid:12)y + rt ~(cid:30)(st); (cid:10) := (cid:147)LSLSD((cid:21)) algorithm(cid:148) [13]\n\nEnd\nReturn: the policy (cid:25)(ajs;(cid:18)).\n(cid:3) ~C is the sub-matrix of C getting off the lowest row.\n\nwhere (cid:23) 2 RjSj+1 and (cid:30)(s) 2 RjSj are the model parameter and the regressor (feature vector\nfunction) of the state s, respectively, and ~(cid:30)(s) , [(cid:30)(s)>; 1]>. We assume that the set of (cid:30)(s) is\nlinearly independent. Accordingly, the whole model parameter of the estimating function is now\n[!>; (cid:23) >]> , $.\nWe propose the following instrumental variable\n\n(19)\nBecause this instrumental variable I ? has the desirable property as shown in Theorem 1, the esti-\nmating function g?\n\n(cid:19);(cid:18)(s; a; $) with I ? is a useful function, even if the LSD is estimated.\n\nI ?(s; a) , [r(cid:18)ln(cid:25)(ajs;(cid:18)); ~(cid:30)(s)]>:\n\ng?\n\nTheorem 1 To estimate gNG((cid:19)), let I ?(s; a) be used for the estimating function as\n\n(cid:19);(cid:18)(s; a; $) = I ?(s; a)(cid:8)r (cid:0) (br(cid:18)ln d(cid:18)(s)(cid:19) + r(cid:18)ln(cid:25)(ajs;(cid:18)))>! (cid:0) (cid:26)(s; s+1; (cid:23))(cid:9);\n\n(20)\ns;a((cid:18);(cid:19)) (cid:17)((cid:18)), and the auxil-\niary function with (cid:23) (cid:3) is the near optimal auxiliary function provided in Lemma 2, (cid:26)(s; s+1; (cid:23) (cid:3)) =\n(cid:26)(cid:3)(s; s+1), even if the LSD estimates include (zero mean) random noises.\nProof Sketch: (i) The condition (18) for the instrumental variable is satis(cid:2)ed due to Assump-\n\nand !(cid:3) and (cid:23) (cid:3) be the solutions, so !(cid:3) is equal to the gNG((cid:19)), !(cid:3) = er~F\n\n(cid:19);(cid:18)]=0 is unique. (iii) Assuming that Theorem 1 is true so that (cid:147)!(cid:3)=er~F\n\n(cid:19);(cid:18)] j 6= 0, is satis(cid:2)ed. This guarantees that the solution $(cid:3) , [!(cid:3)>; (cid:23) (cid:3)>]> of (20) that\ns;a((cid:18);(cid:19)) (cid:17)((cid:18))(cid:148)\n\ntion 1. (ii) Considering E(cid:18)[br(cid:18)ln d(cid:18)(s)r(cid:18)ln(cid:25)(ajs;(cid:18))>] = 0 and Assumption 1, the condition (17)\n(cid:19);(cid:18)(s; a; $>)js; a] becomes I ?(s; a)(cid:8)r (cid:0) R(s; a)(cid:9)\n\ndet j E(cid:18)[r$g?\nsatis(cid:2)es E(cid:18)[g?\nand (cid:147)(cid:26)(s; s+1; (cid:23) (cid:3))=(cid:26)(cid:3)(s; s+1)(cid:148) hold, then E[g?\nfrom (14) and its expectation over M((cid:18)) becomes equal to 0. This means that (20) also satis(cid:2)es the\ncondition (16). From (i), (ii), and (iii), this theorem is proven.\n(cid:3)\nThe optimal instrumental variable I (cid:3)(s; a) with respect to the variance minimization is derived\nstraightforwardly with the results of [21, 24]. However, since I (cid:3) is usually to be estimated, we do\nnot adress I (cid:3) here. Note that the proposed I ?(s; a) of (19) can be computed analytically.\n\n(cid:19);(cid:18)(s; a; $) in (20), using the LSD estimate br(cid:18)ln d(cid:18)(s) , (cid:10)>(cid:30)(s).\n\n4.3 A Generalized Natural Actor-Critic Algorithm with LSLSD\nWe can straightforwardly derive a generalized Natural Actor-Critic algorithm, gNAC((cid:19)), by solv-\ning the estimating function g?\nHowever, since (cid:23) in the mode parameter is not required in updating the policy parameter (cid:18), to re-\nduce the computational cost, we compute only ! by using the results of the block matrices, The\nabove algorithm table shows an instance of the gNAC((cid:19)) algorithm with LSLSD((cid:21)) [13] with the\nforgetting parameter (cid:12) for the statistics, the learning rate of the policy (cid:11), and the the de(cid:2)nitions\n (st(cid:0)1; st) , [(cid:30)(st(cid:0)1) (cid:0) (cid:30)(st)] and ~ (st(cid:0)1; st) , [ (st(cid:0)1; st)>; 1]>.\nNote that the LSD estimate is not used at all in the proposed gNAC((cid:19) = 0).\nIn addition, note\nthat gNAC((cid:19) = 0) is equivalent to a non-episodic NAC algorithm modi(cid:2)ed to optimize the average\nreward, instead of the discounted cumulative reward [1]. This interpretation is consistent with the\nresults of Corollary 1.\n\n7\n\n\f(A)\n\n2\n\n1.5\n\n1\n\n0.5\n\n]\n\ni\n\nn\na\nd\na\nr\n[\n \n\nl\n\ne\ng\nn\nA\n\n0\n\n0\n\n200\n\ngNG(1) estimate with r\ngNG(0.5) estimate with r\ngNG(0.25) estimate with r\ngNG(1) estimate without r\ngNG(0.5) estimate without r\ngNG(0.25) estimate without r\n\n400\n\nTime Step\n\n600\n\n800\n\n1000\n\n(B)\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n \n\nd\nr\na\nw\ne\nR\ne\ng\na\nr\ne\nv\nA\n\n0\n0\n\ngNAC(1)\ngNAC(0.5)\ngNAC(0.25)\nNAC = gNAC(0)\nAC\n\n1\n\n2\n3\nTime Step\n\n4\n\n5\nx 105\n\nFigure 1: Averages and standard deviations over 50 independent episodes: (A) The angles between\nthe true gNG((cid:19)) and estimates with and without the auxiliary function (cid:26)(s; s+1; (cid:23)) on the 5-states\nMDP, (B) The learning performances (average rewards) for the various (N)PGRL algorithms with\nthe auxiliary functions on the 30-states MDP.\n5 Numerical Experiment\nWe studied the results of the proposed gNAC algorithm with the various (cid:19) = f0; 0:25; 0:5; 1g and\nrandomly synthesized MDPs with jSj = f5; 30g states and jAj = 2 actions. Starting with the per-\nformance baseline of the existing PG methods, we used Konda\u2019s actor-critic algorithm [23]. This\nalgorithm uses the baseline function in which the state value estimates are estimated by LSTD(0)\n[25], while the original version did not use any baseline function. Note that gNAC((cid:19) = 0) can be\nregarded as the NAC proposed by [1], which serves as the baseline for the current state-of-the-art\nPGRL algorithm. We initialized the setting of the MDP in each episode so the set of the actions was\nalways jAj = fl; mg. The state transition probability function was set by using the Dirichlet distribu-\ntion Dir((cid:11) 2 R2) and the uniform distribution U(a; b) generating an integer from 1 to a other than\nb: we (cid:2)rst initialized it such that p(s0js; a) := 0, 8(s0; s; a) and then with q(s; a) (cid:24) Dir((cid:11)=[:3; :3])\nand snb (cid:24) U(jSj; b),\n\n(cid:26)p(s+1js; l) := q1(s; l)\n\np(xns+1js; l) := q2(s; l)\n\n;\n\n(cid:26)p(sjs; m)\n\n:= q1(s; m)\np(xnsjs; m) := q2(s; m)\n\n;\n\nwhere s0 = 1 and s0 = jSj+1 are the identical states. The the reward function R(s; a; s+1) was set\ntemporarily with the Gaussian distribution N((cid:22) = 0; (cid:27)2 = 1), normalized so that max(cid:18) (cid:17)((cid:18)) = 1 and\nmin(cid:18) (cid:17)((cid:18)) = (cid:0)1; R(s; a; s+1) := 2(R(s; a; s+1) (cid:0) min(cid:18) (cid:17)((cid:18)))=(max(cid:18) (cid:17)((cid:18)) (cid:0) min(cid:18) (cid:17)((cid:18))) (cid:0) 1. The\npolicy is represented by the sigmoidal function: (cid:25)(ljs; (cid:18)) = 1=(1 + exp((cid:0)(cid:18)>(cid:30)(s))). Each ith ele-\nment of the initial policy parameter (cid:18)0 2 RjSj and the feature vector of the jth state, (cid:30)(sj ) 2 RjSj,\nwere drawn from N(0; 1) and N((cid:14)ij; 0:5), respectively, where (cid:14)ij is the Kronecker delta. Figure 1 (A)\nshows the angles between the true gNG((cid:19)) and the gNG((cid:19)) estimates with and without the auxiliary\nfunction (cid:26)(s; s+1; (cid:23)) at (cid:11) := 0 ((cid:2)xed policy), (cid:12) := 1, (cid:21) := 0. The estimation without the auxil-\niary function was implemented by solving the estimating function of (9). We can con(cid:2)rm that the\n(cid:19);(cid:18)(s; a; $) in (20) that implements the near-optimal estimating function became\nestimate using g?\na much more ef(cid:2)cient estimator than without the auxiliary function. Figure 1 (B) shows the com-\nparison results in terms of the learning performances, where the learning rates for the gNACs and\nKonda\u2019s actor-critic were set as (cid:11) := 3 (cid:2) 10(cid:0)4 and (cid:11)Konda := 60(cid:11). The other hyper parameters\n(cid:12) := 1 (cid:0) (cid:11) and (cid:21) := 0 were the same in each of the algorithms. We thus con(cid:2)rmed that our\ngNAC((cid:19) > 0) algorithm outperformed the current state-of-the-art NAC algorithm (gNAC((cid:19) = 0)).\n\n6 Summary\nIn this paper, we proposed a generalized NG (gNG) learning algorithm that combines two Fisher\ninformation matrices for RL. The theory of the estimating function provided insight to prove some\nimportant theoretical results from which our proposed gNAC algorithm was derived. Numerical ex-\nperiments showed that the gNAC algorithm can estimate gNGs ef(cid:2)ciently and that it can outperform\na current state-of-the-art NAC algorithm. In order to utilize the auxiliary function of the estimating\nfunction for the gNG, we de(cid:2)ned an auxiliary function on the criterion of the near optimality of the\nestimating function, by minimizing the distance between the immediate reward as the regressand\nand the subspace of the regressors of the gNG at the solution of the gNG. However, it may be possi-\nble to use different criterion, such as the optimality on the Fisher information matrix metric instead\nof the Euclidean metric. Also, an analysis of the properties of gNG itself will be necessary to more\ndeeply understand the properties and ef(cid:2)cacy of our proposed gNAC algorithm.\n\n8\n\n\fReferences\n[1] J. Peters, S. Vijayakumar, and S. Schaal. Natural actor-critic.\n\nLearning, 2005.\n\nIn European Conference on Machine\n\n[2] V. Gullapalli. A stochastic reinforcement learning algorithm for learning real-valued functions. Neural\n\nNetworks, 3(6):671(cid:150)692, 1990.\n\n[3] R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning.\n\nMachine Learning, 8:229(cid:150)256, 1992.\n\n[4] J. Baxter and P. Bartlett. In(cid:2)nite-horizon policy-gradient estimation. Journal of Arti(cid:2)cial Intelligence\n\nResearch, 15:319(cid:150)350, 2001.\n\n[5] R. Tedrake, T.W. T. W. Zhang, and H. S. Seung. Stochastic policy gradient reinforcement learning on a\n\nsimple 3D biped. In IEEE International Conference on Intelligent Robots and Systems, 2004.\n\n[6] J. Peters and S. Schaal. Policy gradient methods for robotics.\n\nIntelligent Robots and Systems, 2006.\n\nIn IEEE International Conference on\n\n[7] S. Richter, D. Aberdeen, and J. Yu. Natural actor-critic for road traf(cid:2)c optimisation.\n\nNeural Information Processing Systems. MIT Press, 2007.\n\nIn Advances in\n\n[8] S. Kakade. A natural policy gradient. In Advances in Neural Information Processing Systems, volume 14.\n\nMIT Press, 2002.\n\n[9] S. Amari. Natural gradient works ef(cid:2)ciently in learning. Neural Computation, 10(2):251(cid:150)276, 1998.\n[10] T. Morimura, E. Uchibe, and K. Doya. Utilizing natural gradient in temporal difference reinforcement\nlearning with eligibility traces. In International Symposium on Information Geometry and its Applica-\ntions, pages 256(cid:150)263, 2005.\n\n[11] S. Bhatnagar, R. Sutton, M. Ghavamzadeh, and M. Lee. Incremental natural actor-critic algorithms. In\n\nAdvances in Neural Information Processing Systems, pages 105(cid:150)112. MIT Press, 2008.\n\n[12] T. Morimura, E. Uchibe, J. Yoshimoto, and K. Doya. A new natural policy gradient by stationary distri-\nbution metric. In European Conference on Machine Learning and Principles and Practice of Knowledge\nDiscovery in Databases, 2008.\n\n[13] T. Morimura, E. Uchibe, J. Yoshimoto, J. Peters, and K. Doya. Derivatives of logarithmic stationary\n\ndistributions for policy gradient reinforcement learning. Neural Computation. (in press).\n\n[14] V. Godambe. Estimating function. Oxford Science, 1991.\n[15] D. P. Bertsekas. Dynamic Programming and Optimal Control, Volumes 1 and 2. Athena Scienti(cid:2)c, 1995.\n[16] R. S. Sutton and A. G. Barto. Reinforcement Learning. MIT Press, 1998.\n[17] S. Amari and H. Nagaoka. Method of Information Geometry. Oxford University Press, 2000.\n[18] M. G. Lagoudakis and R. Parr. Least-squares policy iteration. Journal of Machine Learning Research,\n\n4:1107(cid:150)1149, 2003.\n\n[19] D. Bagnell and J. Schneider. Covariant policy search. In Proceedings of the International Joint Conference\n\non Arti(cid:2)cial Intelligence, July 2003.\n\n[20] S. Amari and M. Kawanabe. Information geometry of estimating functions in semi-parametric statistical\n\nmodels. Bernoulli, 3(1), 1997.\n\n[21] A. C. Singh and R. P. Rao. Optimal instrumental variable estimation for linear models with stochastic\n\nregressors using estimating functions. In Symposium on Estimating Functions, pages 177(cid:150)192, 1996.\n\n[22] B. Chandrasekhar and B. K. Kale. Unbiased statistical estimating functions in presence of nuisance\n\nparameters. Journal of Statistical Planning and. Inference, 9:45(cid:150)54, 1984.\n\n[23] V. S. Konda and J. N. Tsitsiklis. On actor-critic algorithms. SIAM Journal on Control and Optimization,\n\n42(4):1143(cid:150)1166, 2003.\n\n[24] T. Ueno, M. Kawanabe, T. Mori, S. Maeda, and S. Ishii. A semiparametric statistical approach to model-\n\nfree policy evaluation. In International Conference on Machine Learning, pages 857(cid:150)864, 2008.\n\n[25] J. A. Boyan. Technical update: Least-squares temporal difference learning. Machine Learning, 49(2-\n\n3):233(cid:150)246, 2002.\n\n9\n\n\f", "award": [], "sourceid": 900, "authors": [{"given_name": "Tetsuro", "family_name": "Morimura", "institution": null}, {"given_name": "Eiji", "family_name": "Uchibe", "institution": null}, {"given_name": "Junichiro", "family_name": "Yoshimoto", "institution": null}, {"given_name": "Kenji", "family_name": "Doya", "institution": null}]}