{"title": "Interpolated Policy Gradient: Merging On-Policy and Off-Policy Gradient Estimation for Deep Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 3846, "page_last": 3855, "abstract": "Off-policy model-free deep reinforcement learning methods using previously collected data can improve sample efficiency over on-policy policy gradient techniques. On the other hand, on-policy algorithms are often more stable and easier to use. This paper examines, both theoretically and empirically, approaches to merging on- and off-policy updates for deep reinforcement learning.  Theoretical results show that off-policy updates with a value function estimator can be interpolated with on-policy policy gradient updates whilst still satisfying performance bounds. Our analysis uses control variate methods to produce a family of policy gradient algorithms, with several recently proposed algorithms being special cases of this family. We then provide an empirical comparison of these techniques with the remaining algorithmic details fixed, and show how different mixing of off-policy gradient estimates with on-policy samples contribute to improvements in empirical performance. The final algorithm provides a generalization and unification of existing deep policy gradient techniques, has theoretical guarantees on the bias introduced by off-policy updates, and improves on the state-of-the-art model-free deep RL methods on a number of OpenAI Gym continuous control benchmarks.", "full_text": "Interpolated Policy Gradient: Merging On-Policy and\n\nOff-Policy Gradient Estimation for Deep\n\nReinforcement Learning\n\nShixiang Gu\n\nUniversity of Cambridge\n\nMax Planck Institute\nsg717@cam.ac.uk\n\nTimothy Lillicrap\n\nDeepMind\n\ncountzero@google.com\n\nZoubin Ghahramani\nUniversity of Cambridge\n\nUber AI Labs\n\nzoubin@eng.cam.ac.uk\n\nRichard E. Turner\n\nUniversity of Cambridge\n\nret26@cam.ac.uk\n\nBernhard Sch\u00f6lkopf\nMax Planck Institute\n\nbs@tuebingen.mpg.de\n\nSergey Levine\nUC Berkeley\n\nsvlevine@eecs.berkeley.edu\n\nAbstract\n\nOff-policy model-free deep reinforcement learning methods using previously col-\nlected data can improve sample ef\ufb01ciency over on-policy policy gradient techniques.\nOn the other hand, on-policy algorithms are often more stable and easier to use.\nThis paper examines, both theoretically and empirically, approaches to merging\non- and off-policy updates for deep reinforcement learning. Theoretical results\nshow that off-policy updates with a value function estimator can be interpolated\nwith on-policy policy gradient updates whilst still satisfying performance bounds.\nOur analysis uses control variate methods to produce a family of policy gradient\nalgorithms, with several recently proposed algorithms being special cases of this\nfamily. We then provide an empirical comparison of these techniques with the\nremaining algorithmic details \ufb01xed, and show how different mixing of off-policy\ngradient estimates with on-policy samples contribute to improvements in empirical\nperformance. The \ufb01nal algorithm provides a generalization and uni\ufb01cation of\nexisting deep policy gradient techniques, has theoretical guarantees on the bias\nintroduced by off-policy updates, and improves on the state-of-the-art model-free\ndeep RL methods on a number of OpenAI Gym continuous control benchmarks.\n\n1\n\nIntroduction\n\nReinforcement learning (RL) studies how an agent that interacts sequentially with an environment\ncan learn from rewards to improve its behavior and optimize long-term returns. Recent research has\ndemonstrated that deep networks can be successfully combined with RL techniques to solve dif\ufb01cult\ncontrol problems. Some of these include robotic control (Schulman et al., 2016; Lillicrap et al., 2016;\nLevine et al., 2016), computer games (Mnih et al., 2015), and board games (Silver et al., 2016).\nOne of the simplest ways to learn a neural network policy is to collect a batch of behavior wherein\nthe policy is used to act in the world, and then compute and apply a policy gradient update from\nthis data. This is referred to as on-policy learning because all of the updates are made using data\nthat was collected from the trajectory distribution induced by the current policy of the agent. It is\nstraightforward to compute unbiased on-policy gradients, and practical on-policy gradient algorithms\ntend to be stable and relatively easy to use. A major drawback of such methods is that they tend to\nbe data inef\ufb01cient, because they only look at each data point once. Off-policy algorithms based on\nQ-learning and actor-critic learning (Sutton et al., 1999) have also proven to be an effective approach\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fto deep RL such as in (Mnih et al., 2015) and (Lillicrap et al., 2016). Such methods reuse samples\nby storing them in a memory replay buffer and train a value function or Q-function with off-policy\nupdates. This improves data ef\ufb01ciency, but often at a cost in stability and ease of use.\nBoth on- and off-policy learning techniques have their own advantages. Most recent research has\nworked with on-policy algorithms or off-policy algorithms, and a few recent methods have sought to\nmake use of both on- and off-policy data for learning (Gu et al., 2017; Wang et al., 2017; O\u2019Donoghue\net al., 2017). Such algorithms hope to gain advantages from both modes of learning, whilst avoiding\ntheir limitations. Broadly speaking, there have been two basic approaches in recently proposed\nalgorithms that make use of both on- and off-policy data and updates. The \ufb01rst approach is to mix\nsome ratio of on- and off-policy gradients or update steps in order to update a policy, as in the\nACER and PGQ algorithms (Wang et al., 2017; O\u2019Donoghue et al., 2017). In this case, there are no\ntheoretical bounds on the error induced by incorporating off-policy updates. In the second approach,\nan off-policy Q critic is trained but is used as a control variate to reduce on-policy gradient variance,\nas in the Q-prop algorithm (Gu et al., 2017). This case does not introduce additional bias to the\ngradient estimator, but the policy updates do not use off-policy data.\nWe seek to unify these two approaches using the method of control variates. We introduce a\nparameterized family of policy gradient methods that interpolate between on-policy and off-policy\nlearning. Such methods are in general biased, but we show that the bias can be bounded.We show\nthat a number of recent methods (Gu et al., 2017; Wang et al., 2017; O\u2019Donoghue et al., 2017) can be\nviewed as special cases of this more general family. Furthermore, our empirical results show that in\nmost cases, a mix of policy gradient and actor-critic updates achieves the best results, demonstrating\nthe value of considering interpolated policy gradients.\n\n2 Preliminaries\n\nA key component of our interpolated policy gradient method is the use of control variates to mix\nlikelihood ratio gradients with deterministic gradient estimates obtained explicitly from a state-action\ncritic. In this section, we summarize both likelihood ratio and deterministic gradient methods, as well\nas how control variates can be used to combine these two approaches.\n\nmaximize the \u03b3-discounted cumulative future return J(\u03b8) = J(\u03c0) = Es0,a0,\u00b7\u00b7\u00b7\u223c\u03c0 [(cid:80)\u221e\n\n2.1 On-Policy Likelihood Ratio Policy Gradient\nAt time t, the RL agent in state st takes action at according to its policy \u03c0(at|st), the state transitions\nto st+1, and the agent receives a reward r(st, at). For a parametrized policy \u03c0\u03b8, the objective is to\nt=0 \u03b3tr(st, at)].\nMonte Carlo policy gradient methods, such as REINFORCE (Williams, 1992) and TRPO (Schulman\net al., 2015), use the likelihood ratio policy gradient of the RL objective,\n\u2207\u03b8J(\u03b8) = E\u03c1\u03c0,\u03c0[\u2207\u03b8 log \u03c0\u03b8(at|st)( \u02c6Q(st, at) \u2212 b(st))] = E\u03c1\u03c0,\u03c0[\u2207\u03b8 log \u03c0\u03b8(at|st) \u02c6A(st, at)],\n\nwhere \u02c6Q(st, at) =(cid:80)\u221e\nEst+1,at+1,\u00b7\u00b7\u00b7\u223c\u03c0|st,at[ \u02c6Q(st, at)], and \u03c1\u03c0 =(cid:80)\u221e\n\n(1)\nt(cid:48)=t \u03b3t(cid:48)\u2212tr(st(cid:48), at(cid:48)) is the Monte Carlo estimate of the \u201ccritic\u201d Q\u03c0(st, at) =\nt=0 \u03b3tp(st = s) are the unnormalized state visitation\nfrequencies, while b(st) is known as the baseline, and serves to reduce the variance of the gradient esti-\nmate (Williams, 1992). If the baseline estimates the value function, V \u03c0(st) = Eat\u223c\u03c0(\u00b7|st)[Q\u03c0(st, at)],\nthen \u02c6A(st) is an estimate of the advantage function A\u03c0(st, at) = Q\u03c0(st, at) \u2212 V \u03c0(st). Likelihood\nratio policy gradient methods use unbiased gradient estimates (except for the technicality detailed\nby Thomas (2014)), but they often suffer from high variance and are sample-intensive.\n\n2.2 Off-Policy Deterministic Policy Gradient\n\nPolicy gradient methods with function approximation (Sutton et al., 1999), or actor-critic methods,\nare a family of policy gradient methods which \ufb01rst estimate the critic, or the value, of the policy by\nQw \u2248 Q\u03c0, and then greedily optimize the policy \u03c0\u03b8 with respect to Qw. While it is not necessary for\nsuch algorithms to be off-policy, we primarily analyze the off-policy variants, such as (Riedmiller,\n2005; Degris et al., 2012; Heess et al., 2015; Lillicrap et al., 2016). For example, DDPG Lillicrap\net al. (2016), which optimizes a continuous deterministic policy \u03c0\u03b8(at|st) = \u03b4(at = \u00b5\u03b8(st)), can be\nsummarized by the following update equations, where Q(cid:48) denotes the target Q network and \u03b2 denotes\n\n2\n\n\f\u03b2\n-\n\u03c0\n-\n(cid:54)= \u03c0\n\nCV\n\u03bd\n0\nNo\n0 Yes\n-\n1\n-\nNo\n\nREINFORCE (Williams, 1992),TRPO (Schulman et al., 2015)\n\nQ-Prop (Gu et al., 2017)\n\nExamples\n\nDDPG (Silver et al., 2014; Lillicrap et al., 2016),SVG(0) (Heess et al., 2015)\n\n\u2248PGQ (O\u2019Donoghue et al., 2017), \u2248ACER (Wang et al., 2017)\n\nTable 1: Prior policy gradient method objectives as special cases of IPG.\n\nsome off-policy distribution, e.g. from experience replay (Lillicrap et al., 2016):\n\nw \u2190 arg min E\u03b2[(Qw(st, at) \u2212 yt)2],\n\u03b8 \u2190 arg max E\u03b2[Qw(st, \u00b5\u03b8(st))].\n\nyt = r(st, at) + \u03b3Q(cid:48)(st+1, \u00b5\u03b8(st+1))\n\nThis provides the following deterministic policy gradient through the critic:\n\n\u2207\u03b8J(\u03b8) \u2248 E\u03c1\u03b2 [\u2207\u03b8Qw(st, \u00b5\u03b8(st))].\n\n(2)\n\n(3)\n\nThis policy gradient is generally biased due to the imperfect estimator Qw and off-policy state\nsampling from \u03b2. Off-policy actor-critic algorithms therefore allow training the policy on off-policy\nsamples, at the cost of introducing potentially unbounded bias into the gradient estimate. This usually\nmakes off-policy algorithms less stable during learning, compared to on-policy algorithms using a\nlarge batch size for each update (Duan et al., 2016; Gu et al., 2017).\n\n2.3 Off-Policy Control Variate Fitting\n\nThe control variates method (Ross, 2006) is a general technique for variance reduction of a Monte\nCarlo estimator by exploiting a correlated variable for which we know more information such as\nanalytical expectation. General control variates for RL include state-action baselines, and an example\ncan be an off-policy \ufb01tted critic Qw. Q-Prop (Gu et al., 2017), for example, used \u02dcQw, the \ufb01rst-order\nTaylor expansion of Qw, as the control variates, and showed improvement in stability and sample\nef\ufb01ciency of policy gradient methods. \u00b5\u03b8 here corresponds to the mean of the stochastic policy \u03c0\u03b8.\n\n\u2207\u03b8J(\u03b8) = E\u03c1\u03c0,\u03c0[\u2207\u03b8 log \u03c0\u03b8(at|st)( \u02c6Q(st, at) \u2212 \u02dcQw(st, at))] + E\u03c1\u03c0 [\u2207\u03b8Qw(st, \u00b5\u03b8(st))].\n\n(4)\n\nThe gradient estimator combines both likelihood ratio and deterministic policy gradients in Eq. 1\nand 3. It has lower variance and stable gradient estimates and enables more sample-ef\ufb01cient learning.\nHowever, one limitation of Q-Prop is that it uses only on-policy samples for estimating the policy\ngradient. This ensures that the Q-Prop estimator remains unbiased, but limits the use of off-policy\nsamples for further variance reduction.\n\n3\n\nInterpolated Policy Gradient\n\nOur proposed approach, interpolated policy gradient (IPG), mixes likelihood ratio gradient with \u02c6Q,\nwhich provides unbiased but high-variance gradient estimation, and deterministic gradient through an\noff-policy \ufb01tted critic Qw, which provides low-variance but biased gradients. IPG directly interpolates\nthe two terms from Eq. 1 and 3:\n\n\u2207\u03b8J(\u03b8) \u2248 (1 \u2212 \u03bd)E\u03c1\u03c0,\u03c0[\u2207\u03b8 log \u03c0\u03b8(at|st) \u02c6A(st, at)] + \u03bdE\u03c1\u03b2 [\u2207\u03b8 \u00afQ\u03c0\n\n(5)\nwhere we generalized the deterministic policy gradient through the critic as \u2207\u03b8 \u00afQw(st) =\n\u2207\u03b8E\u03c0[Q\u03c0\nw(st,\u00b7)]. This generalization is to make our analysis applicable with more general forms of\nthe critic-based control variates, as discussed in the Appendix. This gradient estimator is biased from\ntwo sources: off-policy state sampling \u03c1\u03b2, and inaccuracies in the critic Qw. However, as we show in\nSection 4, we can bound the biases for all the cases, and in some cases, the algorithm still guarantees\nmonotonic convergence as in Kakade & Langford (2002); Schulman et al. (2015).\n\nw(st)],\n\n3.1 Control Variates for Interpolated Policy Gradient\n\nWhile IPG includes \u03bd to trade off bias and variance directly, it contains a likelihood ratio gradient term,\nfor which we can introduce a control variate (CV) Ross (2006) to further reduce the estimator variance.\n\n3\n\n\fThe expression for the IPG with control variates is below, where A\u03c0\n\nw(st, at) = Qw(st, at) \u2212 \u00afQ\u03c0\n\nw(st),\n\n\u2207\u03b8J(\u03b8) \u2248 (1 \u2212 \u03bd)E\u03c1\u03c0,\u03c0[\u2207\u03b8 log \u03c0\u03b8(at|st) \u02c6A(st, at)] + \u03bdE\u03c1\u03b2 [\u2207\u03b8 \u00afQ\u03c0\nw(st, at))]\n\n= (1 \u2212 \u03bd)E\u03c1\u03c0,\u03c0[\u2207\u03b8 log \u03c0\u03b8(at|st)( \u02c6A(st, at) \u2212 A\u03c0\nw(st)]\n\u2248 (1 \u2212 \u03bd)E\u03c1\u03c0,\u03c0[\u2207\u03b8 log \u03c0\u03b8(at|st)( \u02c6A(st, at) \u2212 A\u03c0\n\nw(st)] + \u03bdE\u03c1\u03b2 [\u2207\u03b8 \u00afQ\u03c0\n\n+ (1 \u2212 \u03bd)E\u03c1\u03c0 [\u2207\u03b8 \u00afQ\u03c0\n\nw(st)]\n\nw(st, at))] + E\u03c1\u03b2 [\u2207\u03b8 \u00afQ\u03c0\n\nw(st)].\n\n(6)\n\nThe \ufb01rst approximation indicates the biased approximation from IPG, while the second approximation\nindicates replacing the \u03c1\u03c0 in the control variate correction term with \u03c1\u03b2 and merging with the last\nterm. The second approximation is a design decision and introduces additional bias when \u03b2 (cid:54)= \u03c0 but it\nhelps simplify the expression to be analyzed more easily, and the additional bene\ufb01t from the variance\nreduction from the control variate could still outweigh this extra bias. The biases are analyzed in\nSection 4. The likelihood ratio gradient term is now proportional to the residual in on- and off-policy\nadvantage estimates \u02c6A(st, at) \u2212 A\u03c0\nw(st, at), and therefore, we call this term residual likelihood ratio\ngradient. Intuitively, if the off-policy critic estimate is accurate, this term has a low magnitude and\nthe overall variance of the estimator is reduced.\n\n3.2 Relationship to Prior Policy Gradient and Actor-Critic Methods\n\nCrucially, IPG allows interpolating a rich list of prior deep policy gradient methods using only three\nparameters: \u03b2, \u03bd, and the use of the control variate (CV). The connection is summarized in Table 1\nand the algorithm is presented in Algorithm 1. Importantly, a wide range of prior work has only\nexplored limiting cases of the spectrum, e.g. \u03bd = 0, 1, with or without the control variate. Our work\nprovides a thorough theoretical analysis of the biases, and in some cases performance guarantees,\nfor each of the method in this spectrum and empirically demonstrates often the best performing\nalgorithms are in the midst of the spectrum.\n\nRoll-out \u03c0\u03b8 for E episodes, T time steps each, to collect a batch of data B = {s, a, r}1:T,1:E to R\nFit Qw using R and \u03c0\u03b8, and \ufb01t baseline V\u03c6(st) using B\nCompute Monte Carlo advantage estimate \u02c6At,e using B and V\u03c6\nif useCV then\n\nAlgorithm 1 Interpolated Policy Gradient\ninput \u03b2, \u03bd, useCV\n1: Initialize w for critic Qw, \u03b8 for stochastic policy \u03c0\u03b8, and replay buffer R \u2190 \u2205.\n2: repeat\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10:\nend if\n11:\n12: Multiply lt,e by (1 \u2212 \u03bd)\nSample D = s1:M from R and/or B based on \u03b2\n13:\nCompute \u2207\u03b8J(\u03b8) \u2248 1\n14:\nUpdate policy \u03c0\u03b8 using \u2207\u03b8J(\u03b8)\n15:\n16: until \u03c0\u03b8 converges.\n\nCompute critic-based advantage estimate \u00afAt,e using B, Qw and \u03c0\u03b8\nCompute and center the learning signals lt,e = \u02c6At,e \u2212 \u00afAt,e and set b = 1\n\n(cid:80)\nt \u2207\u03b8 log \u03c0\u03b8(at,e|st,e)lt,e + b\n\nCenter the learning signals lt,e = \u02c6At,e and set b = \u03bd\n\n(cid:80)\nm \u2207\u03b8 \u00afQ\u03c0\n\n(cid:80)\n\nw(sm)\n\nelse\n\nET\n\ne\n\nM\n\n3.3\n\n\u03bd = 1: Actor-Critic methods\n\nBefore presenting our theoretical analysis, an important special case to discuss is \u03bd = 1, which\ncorresponds to a deterministic actor-critic method. Several advantages of this special case include\nthat the policy can be deterministic and the learning can be done completely off-policy, as it does not\nhave to estimate the on-policy Monte Carlo critic \u02c6Q. Prior work such as DDPG Lillicrap et al. (2016)\nand related Q-learning methods have proposed aggressive off-policy exploration strategy to exploit\nthese properties of the algorithm. In this work, we compare alternatives such as using on-policy\nexploration and stochastic policy with classical DDPG algorithm designs, and show that in some\ndomains the off-policy exploration can signi\ufb01cantly deteriorate the performance. Theoretically, we\ncon\ufb01rm this empirical observation by showing that the bias from off-policy sampling in \u03b2 increases\n\n4\n\n\fmonotonically with the total variation or KL divergence between \u03b2 and \u03c0. Both the empirical and\ntheoretical results indicate that well-designed actor-critic methods with an on-policy exploration\nstrategy could be a more reliable alternative than with an on-policy exploration.\n\n4 Theoretical Analysis\n\nIn this section, we present a theoretical analysis of the bias in the interpolated policy gradient. This is\ncrucial, since understanding the biases of the methods can improve our intuition about its performance\nand make it easier to design new algorithms in the future. Because IPG includes many prior methods\nas special cases, our analysis also applies to those methods and other intermediate cases. We \ufb01rst\nanalyze a special case and derive results for general IPG. All proofs are in the Appendix.\n4.1 \u03b2 (cid:54)= \u03c0, \u03bd = 0: Policy Gradient with Control Variate and Off-Policy Sampling\nThis section provides an analysis of the special case of IPG with \u03b2 (cid:54)= \u03c0, \u03bd = 1, and the control\nvariate. Plugging in to Eq. 6, we get an expression similar to Q-Prop in Eq. 4,\n\n\u2207\u03b8J(\u03b8) \u2248 E\u03c1\u03c0,\u03c0[\u2207\u03b8 log \u03c0\u03b8(at|st)( \u02c6A(st, at) \u2212 A\u03c0\n\nw(st, at))] + E\u03c1\u03b2 [\u2207\u03b8 \u00afQ\u03c0\n\nw(st)],\n\n(7)\n\nexcept that it also supports utilizing off-policy data for updating the policy. To analyze the bias for\nthis gradient expression, we \ufb01rst introduce \u02dcJ(\u03c0, \u02dc\u03c0), a local approximation to J(\u03c0), which has been\nused in prior theoretical work (Kakade & Langford, 2002; Schulman et al., 2015). The derivation and\nthe bias from this approximation are discussed in the proof for Theorem 1 in the Appendix.\n\nJ(\u03c0) = J(\u02dc\u03c0) + E\u03c1\u03c0,\u03c0[A\u02dc\u03c0(st, at)] \u2248 J(\u02dc\u03c0) + E\u03c1\u02dc\u03c0,\u03c0[A\u02dc\u03c0(st, at)] = \u02dcJ(\u03c0, \u02dc\u03c0).\n\n(8)\nNote that J(\u03c0) = \u02dcJ(\u03c0, \u02dc\u03c0 = \u03c0) and \u2207\u03c0J(\u03c0) = \u2207\u03c0 \u02dcJ(\u03c0, \u02dc\u03c0 = \u03c0). In practice, \u02dc\u03c0 corresponds to policy\n\u03c0k at iteration k and \u03c0 corresponds next policy \u03c0k+1 after parameter update. Thus, this approximation\nis often suf\ufb01ciently good. Next, we write the approximate objective for Eq. 7,\nw(st, at)] + E\u03c1\u03b2 [ \u00afA\u03c0,\u02dc\u03c0\n\n\u02dcJ \u03b2,\u03bd=0,CV (\u03c0, \u02dc\u03c0) (cid:44) J(\u02dc\u03c0) + E\u03c1\u02dc\u03c0,\u03c0[A\u02dc\u03c0(st, at) \u2212 A\u02dc\u03c0\n\nw (st)] \u2248 \u02dcJ(\u03c0, \u02dc\u03c0)\n\nw (st) = E\u03c0[A\u02dc\u03c0\n\u00afA\u03c0,\u02dc\u03c0\n\nw(st,\u00b7)] = E\u03c0[Qw(st,\u00b7)] \u2212 E\u02dc\u03c0[Qw(st,\u00b7)].\n\n(9)\n\nNote that \u02dcJ \u03b2,\u03bd=0(\u03c0, \u02dc\u03c0 = \u03c0) = \u02dcJ(\u03c0, \u02dc\u03c0 = \u03c0) = J(\u03c0), and \u2207\u03c0 \u02dcJ \u03b2,\u03bd=0(\u03c0, \u02dc\u03c0 = \u03c0) equals Eq. 7. We\ncan bound the absolute error between \u02dcJ \u03b2,\u03bd=0,CV (\u03c0, \u02dc\u03c0) and J(\u03c0) by the following theorem, where\nKL (\u03c0i, \u03c0j) = maxs DKL(\u03c0i(\u00b7|s), \u03c0j(\u00b7|s)) is the maximum KL divergence between \u03c0i, \u03c0j.\nDmax\n(cid:19)\nTheorem 1. If \u0001 = maxs | \u00afA\u03c0,\u02dc\u03c0\n\nw (s)|, \u03b6 = maxs | \u00afA\u03c0,\u02dc\u03c0(s)|, then\n\n(cid:18)\n\n(cid:113)\n\n(cid:113)\n\n(cid:13)(cid:13)(cid:13)J(\u03c0) \u2212 \u02dcJ \u03b2,\u03bd=0,CV (\u03c0, \u02dc\u03c0)\n\n(cid:13)(cid:13)(cid:13)1\n\n\u2264 2\n\n\u03b3\n\n(1 \u2212 \u03b3)2\n\n\u0001\n\nDmax\n\nKL (\u02dc\u03c0, \u03b2) + \u03b6\n\nDmax\n\nKL (\u03c0, \u02dc\u03c0)\n\nTheorem 1 contains two terms: the second term con\ufb01rms \u02dcJ \u03b2,\u03bd=0,CV is a local approximation around\n\u03c0 and deviates from J(\u03c0) as \u02dc\u03c0 deviates, and the \ufb01rst term bounds the bias from off-policy sampling\nusing the KL divergence between the policies \u02dc\u03c0 and \u03b2. This means that the algorithm \ufb01ts well with\npolicy gradient methods which constrain the KL divergence per policy update, such as covariant\npolicy gradient (Bagnell & Schneider, 2003), natural policy gradient (Kakade & Langford, 2002),\nREPS (Peters et al., 2010), and trust-region policy optimization (TRPO) (Schulman et al., 2015).\n\n4.1.1 Monotonic Policy Improvement Guarantee\n\nSome forms of on-policy policy gradient methods have theoretical guarantees on monotonic con-\nvergence Kakade & Langford (2002); Schulman et al. (2015). Such guarantees often correspond to\nstable empirical performance on challenging problems, even when some of the constraints are relaxed\nin practice (Schulman et al., 2015; Duan et al., 2016; Gu et al., 2017). We can show that a variant of\nIPG allows off-policy sampling while still guaranteeing monotonic convergence. The algorithm and\nthe proof are provided in the appendix.This algorithm is usually impractical to implement; however,\nIPG with trust-region updates when \u03b2 (cid:54)= \u03c0, \u03bd = 1, CV = true approximates this monotonic algo-\nrithm, similar to how TRPO is an approximation to the theoretically monotonic algorithm proposed\nby Schulman et al. (2015).\n\n5\n\n\f4.2 General Bounds on the Interpolated Policy Gradient\n\nWe can establish bias bounds for the general IPG algorithm, with and without the control variate,\nusing Theorem 2. The additional term that contributes to the bias in the general case is \u03b4, which\nrepresents the error between the advantage estimated by the off-policy critic and the true A\u03c0 values.\nTheorem 2. If \u03b4 = maxs,a |A\u02dc\u03c0(s, a) \u2212 A\u02dc\u03c0\n\nw(s, a)|, \u0001 = maxs | \u00afA\u03c0,\u02dc\u03c0\n\n\u02dcJ \u03b2,\u03bd(\u03c0, \u02dc\u03c0) (cid:44) J(\u02dc\u03c0) + (1 \u2212 \u03bd)E\u03c1\u02dc\u03c0,\u03c0[ \u02c6A\u02dc\u03c0] + \u03bdE\u03c1\u03b2 [ \u00afA\u03c0,\u02dc\u03c0\nw ]\n(cid:18)\n(cid:113)\n\u02dcJ \u03b2,\u03bd,CV (\u03c0, \u02dc\u03c0) (cid:44) J(\u02dc\u03c0) + (1 \u2212 \u03bd)E\u03c1\u02dc\u03c0,\u03c0[ \u02c6A\u02dc\u03c0 \u2212 A\u02dc\u03c0\n(cid:18)\n(cid:113)\n\n(cid:13)(cid:13)(cid:13)1\n(cid:13)(cid:13)(cid:13)J(\u03c0) \u2212 \u02dcJ \u03b2,\u03bd(\u03c0, \u02dc\u03c0)\n(cid:13)(cid:13)(cid:13)1\n(cid:13)(cid:13)(cid:13)J(\u03c0) \u2212 \u02dcJ \u03b2,\u03bd,CV (\u03c0, \u02dc\u03c0)\n\n\u2264 \u03bd\u03b4\n1 \u2212 \u03b3\n\u2264 \u03bd\u03b4\n1 \u2212 \u03b3\n\n(1 \u2212 \u03b3)2\n\u03b3\n\n(1 \u2212 \u03b3)2\n\n+ 2\n\n+ 2\n\n\u03bd\u0001\n\n\u03b3\n\nthen,\n\nw (s)|, \u03b6 = maxs | \u00afA\u03c0,\u02dc\u03c0(s)|,\n(cid:19)\n\nDmax\n\nKL (\u02dc\u03c0, \u03b2) + \u03b6\n\nDmax\n\nKL (\u03c0, \u02dc\u03c0)\n\nw] + E\u03c1\u03b2 [ \u00afA\u03c0,\u02dc\u03c0\nw ]\n\n\u0001\n\nDmax\n\nKL (\u02dc\u03c0, \u03b2) + \u03b6\n\nDmax\n\nKL (\u03c0, \u02dc\u03c0)\n\n(cid:113)\n(cid:113)\n\n(cid:19)\n\nThis bound shows that the bias from directly mixing the deterministic policy gradient through \u03bd\ncomes from two terms: how well the critic Qw is approximating Q\u03c0, and how close the off-policy\nsampling policy is to the actor policy. We also show that the bias introduced is proportional to \u03bd\nwhile the variance of the high variance likelihood ratio gradient term is proportional to (1 \u2212 \u03bd)2, so \u03bd\nallows directly trading off bias and variance. Theorem 2 fully bounds bias in the full spectrum of IPG\nmethods; this enables us to analyze how biases arise and interact and help us design better algorithms.\n\n5 Related Work\n\nAn overarching aim of this paper is to help unify on-policy and off-policy policy gradient algo-\nrithms into a single conceptual framework. Our analysis examines how Q-Prop (Gu et al., 2017),\nPGQ (O\u2019Donoghue et al., 2017), and ACER (Wang et al., 2017), which are all recent works that\ncombine on-policy with off-policy learning, are connected to each other (see Table 1). IPG with\n0 < \u03bd < 1 and without the control variate relates closely to PGQ and ACER, but differ in the details.\nPGQ mixes in the Q-learning Bellman error objective, and ACER mixes parameter update steps\nrather than directly mixing gradients. And both PGQ and ACER come with numerous additional\ndesign details that make fair comparisons with methods like TRPO and Q-Prop dif\ufb01cult. We instead\nfocus on the three minimal variables of IPG and explore their settings in relation to the closely related\nTRPO and Q-Prop methods, in order to theoretically and empirically understand in which situations\nwe might expect gains from mixing on- and off-policy gradients.\nAsides from these more recent works, the use of off-policy samples with policy gradients has been\na popular direction of research (Peshkin & Shelton, 2002; Jie & Abbeel, 2010; Degris et al., 2012;\nLevine & Koltun, 2013). Most of these methods rely on variants of importance sampling (IS) to\ncorrect for bias. The use of importance sampling ensures unbiased estimates, but at the cost of\nconsiderable variance, as quanti\ufb01ed by the ESS measure used by Jie & Abbeel (2010). Ignoring\nimportance weights produces bias but, as shown in our analysis, this bias can be bounded. Therefore,\nour IPG estimators have higher bias as the sampling distribution deviates from the policy, while\nIS methods have higher variance. Among these importance sampling methods, Levine & Koltun\n(2013) evaluates on tasks that are the most similar to our paper, but the focus is on using importance\nsampling to include demonstrations, rather than to speed up learning from scratch.\nLastly, there are many methods that combine on- and off-policy data for policy evaluation (Precup,\n2000; Mahmood et al., 2014; Munos et al., 2016), mostly through variants of importance sampling.\nCombining our methods with more sophisticated policy evaluation methods will likely lead to further\nimprovements, as done in (Degris et al., 2012). A more detailed analysis of the effect of importance\nsampling on bias and variance is left to future work, where some of the relevant work includes Precup\n(2000); Jie & Abbeel (2010); Mahmood et al. (2014); Jiang & Li (2016); Thomas & Brunskill (2016).\n\n6 Experiments\n\nIn this section, we empirically show that the three parameters of IPG can interpolate different\nbehaviors and often achieve superior performance versus prior methods that are limiting cases of this\n\n6\n\n\f(a) IPG with \u03bd = 0 and the control variate.\n\n(b) IPG with \u03bd = 1.\n\nFigure 1: (a) IPG-\u03bd = 0 vs Q-Prop on HalfCheetah-v1, with batch size 5000. IPG-\u03b2-rand30000,\nwhich uses 30000 random samples from the replay as samples from \u03b2, outperforms Q-Prop in terms\nof learning speed. (b) IPG-\u03bd=1 vs other algorithms on Ant-v1. In this domain, on-policy IPG-\u03bd=1\nwith on-policy exploration signi\ufb01cantly outperforms DDPG and IPG-\u03bd=1-OU, which use a heuristic\nOU (Ornstein\u2013Uhlenbeck) process noise exploration strategy, and marginally outperforms Q-Prop.\n\napproach. Crucially, all methods share the same algorithmic structure as Algorithm 1, and we hold\nthe rest of the experimental details \ufb01xed. All experiments were performed on MuJoCo domains in\nOpenAI Gym (Todorov et al., 2012; Brockman et al., 2016), with results presented for the average\nover three seeds. Additional experimental details are provided in the Appendix.\n\n6.1 \u03b2 (cid:54)= \u03c0, \u03bd = 0, with the control variate\n\nWe evaluate the performance of the special case of IPG discussed in Section 4.1. This case is of\nparticular interest, since we can derive monotonic convergence results for a variant of this method\nunder certain conditions, despite the presence of off-policy updates. Figure 1a shows the performance\non the HalfCheetah-v1 domain, when the policy update batch size is 5000 transitions (i.e. 5 episodes).\n\u201clast\u201d and \u201crand\u201d indicate if \u03b2 samples from the most recent transitions or uniformly from the\nexperience replay. \u201clast05000\u201d would be equivalent to Q-Prop given \u03bd = 0. Comparing \u201cIPG-\u03b2-\nrand05000\u201d and \u201cQ-Prop\u201d curves, we observe that by drawing the same number of samples randomly\nfrom the replay buffer for estimating the critic gradient, instead of using the on-policy samples, we\nget faster convergence. If we sample batches of size 30000 from the replay buffer, the performance\nfurther improves. However, as seen in the \u201cIPG-\u03b2-last30000\u201d curve, if we instead use the 30000\nmost recent samples, the performance degrades. One possible explanation for this is that, while\nusing random samples from the replay increases the bound on the bias according to Theorem 1, it\nalso decorrelates the samples within the batch, providing more stable gradients. This is the original\nmotivation for experience replay in the DQN method (Mnih et al., 2015), and we have shown that\nsuch decorrelated off-policy samples can similarly produce gains for policy gradient algorithms. See\nTable 2 for results on other domains.\nThe results for this variant of IPG demonstrate that random sampling from the replay provides further\nimprovement on top of Q-Prop. Note that these replay buffer samples are different from standard\noff-policy samples in DDPG or DQN algorithms, which often use aggressive heuristic exploration\nstrategies. The samples used by IPG are sampled from prior policies that follow a conservative\ntrust-region update, resulting in greater regularity but less exploration. In the next section, we show\nthat in some cases, ensuring that the off-policy samples are not too off-policy is essential for good\nperformance.\n\n6.2 \u03b2 = \u03c0, \u03bd = 1\n\nIn this section, we empirically evaluate another special case of IPG, where \u03b2 = \u03c0, indicating on-\npolicy sampling, and \u03bd = 1, which reduces to a trust-region, on-policy variant of a deterministic\nactor-critic method. Although this algorithm performs actor-critic updates, the use of a trust region\nmakes it more similar to TRPO or Q-Prop than DDPG.\n\n7\n\n\fIPG-\u03bd=0.2\nIPG-cv-\u03bd=0.2\nIPG-\u03bd=1\nQ-Prop\nTRPO\n\nHalfCheetah-v1\n\u03b2 (cid:54)= \u03c0\n\u03b2 = \u03c0\n3458\n3356\n4216\n4023\n4767\n2962\n4182\n4178\n2889\nN.A.\n\n\u03b2 = \u03c0\n4237\n3943\n3469\n3374\n1520\n\n\u03b2 (cid:54)= \u03c0\n4415\n3421\n3780\n3479\nN.A.\n\nAnt-v1\n\nWalker-v1\n\n\u03b2 = \u03c0\n3047\n1896\n2704\n2832\n1487\n\n\u03b2 (cid:54)= \u03c0\n1932\n1411\n805\n1692\nN.A.\n\nHumanoid-v1\n\u03b2 (cid:54)= \u03c0\n\u03b2 = \u03c0\n920\n1231\n1613\n1651\n1530\n1571\n1519\n1423\n615\nN.A.\n\nTable 2: Comparisons on all domains with mini-batch size 10000 for Humanoid and 5000 otherwise.\nWe compare the maximum of average test rewards in the \ufb01rst 10000 episodes (Humanoid requires\nmore steps to fully converge; see the Appendix for learning curves). Results outperforming Q-Prop (or\nIPG-cv-\u03bd=0 with \u03b2 = \u03c0) are boldface. The two columns show results with on-policy and off-policy\nsamples for estimating the deterministic policy gradient.\n\nResults for all domains are shown in Table 2. Figure 1b shows the learning curves on Ant-v1.\nAlthough IPG-\u03bd=1 methods can be off-policy, the policy is updated every 5000 samples to keep it\nconsistent with other IPG methods, while DDPG updates the policy on every step in the environment\nand makes other design choices Lillicrap et al. (2016). We see that, in this domain, standard DDPG\nbecomes stuck with a mean reward of 1000, while IPG-\u03bd=1 improves monotonically, achieving a\nsigni\ufb01cantly better result. To investigate why this large discrepancy arises, we also ran IPG-\u03bd=1 with\nthe same OU process exploration noise as DDPG, and observed large degradation in performance.\nThis provides empirical support for Theorem 2. It is illuminating to contrast this result with the\nprevious experiment, where the off-policy samples did not adversely alter the results. In the previous\nexperiments, the samples came from Gaussian policies updated with trust-regions. The difference\nbetween \u03c0 and \u03b2 was therefore approximately bounded by the trust-regions. In the experiment with\nBrownian noise, the behaving policy uses temporally correlated noise, with potentially unbounded\nKL-divergence from the learned Gaussian policy. In this case, the off-policy samples result in\nexcessive bias, wiping out the variance reduction bene\ufb01ts of off-policy sampling. In general, we\nobserved that for the harder Ant-v1 and Walker-v1 domains, on-policy exploration is more effective,\neven when doing off-policy state sampling from a replay buffer. This results suggests the following\nlesson for designing off-policy actor-critic methods: for domains where exploration is dif\ufb01cult, it may\nbe more effective to use on-policy exploration with bounded policy updates than to design heuristic\nexploration rules such as the OU process noise, due to the resulting reduction in bias.\n\n6.3 General Cases of Interpolated Policy Gradient\n\nTable 2 shows the results for experiments where we compare IPG methods with varying values of\n\u03bd; additional results are provided in the Appendix. \u03b2 (cid:54)= \u03c0 indicates that the method uses off-policy\nsamples from the replay buffer, with the same batch size as the on-policy batch for fair comparison.\nWe ran sweeps over \u03bd = {0.2, 0.4, 0.6, 0.8} and found that \u03bd = 0.2 consistently produce better\nperformance than Q-Prop, TRPO or prior actor-critic methods. This is consistent with the results in\nPGQ (O\u2019Donoghue et al., 2017) and ACER (Wang et al., 2017), which found that their equivalent\nof \u03bd = 0.1 performed best on their benchmarks. Importantly, we compared all methods with the\nsame algorithm designs (exploration, policy, etc.), since Q-Prop and TRPO are IPG-\u03bd=0 with and\nwithout the control variate. IPG-\u03bd=1 is a novel variant of the actor-critic method that differs from\nDDPG (Lillicrap et al., 2016) and SVG(0) (Heess et al., 2015) due to the use of a trust region. The\nresults in Table 2 suggest that, in most cases, the best performing algorithm is one that interpolates\nbetween the policy-gradient and actor-critic variants, with intermediate values of \u03bd.\n\n7 Discussion\n\nIn this paper, we introduced interpolated policy gradient methods, a family of policy gradient\nalgorithms that allow mixing off-policy learning with on-policy learning while satisfying performance\nbounds. This family of algorithms uni\ufb01es and interpolates on-policy likelihood ratio policy gradient\nand off-policy deterministic policy gradient, and includes a number of prior works as approximate\nlimiting cases. Empirical results con\ufb01rm that, in many cases, interpolated gradients have improved\nsample-ef\ufb01ciency and stability over the prior state-of-the-art methods, and the theoretical results\nprovide intuition for analyzing the cases in which the different methods perform well or poorly. Our\nhope is that this detailed analysis of interpolated gradient methods can not only provide for more\neffective algorithms in practice, but also give useful insight for future algorithm design.\n\n8\n\n\fAcknowledgements\n\nThis work is supported by generous sponsorship from Cambridge-T\u00fcbingen PhD Fellowship, NSERC,\nand Google Focused Research Award.\n\nReferences\nBagnell, J Andrew and Schneider, Jeff. Covariant policy search. IJCAI, 2003.\n\nBrockman, Greg, Cheung, Vicki, Pettersson, Ludwig, Schneider, Jonas, Schulman, John, Tang, Jie,\n\nand Zaremba, Wojciech. Openai gym. arXiv preprint arXiv:1606.01540, 2016.\n\nDegris, Thomas, White, Martha, and Sutton, Richard S. Off-policy actor-critic. arXiv preprint\n\narXiv:1205.4839, 2012.\n\nDuan, Yan, Chen, Xi, Houthooft, Rein, Schulman, John, and Abbeel, Pieter. Benchmarking deep\nreinforcement learning for continuous control. International Conference on Machine Learning\n(ICML), 2016.\n\nGu, Shixiang, Lillicrap, Timothy, Ghahramani, Zoubin, Turner, Richard E, and Levine, Sergey.\n\nQ-prop: Sample-ef\ufb01cient policy gradient with an off-policy critic. ICLR, 2017.\n\nHeess, Nicolas, Wayne, Gregory, Silver, David, Lillicrap, Tim, Erez, Tom, and Tassa, Yuval. Learning\ncontinuous control policies by stochastic value gradients. In Advances in Neural Information\nProcessing Systems, pp. 2944\u20132952, 2015.\n\nJiang, Nan and Li, Lihong. Doubly robust off-policy value evaluation for reinforcement learning. In\n\nInternational Conference on Machine Learning, pp. 652\u2013661, 2016.\n\nJie, Tang and Abbeel, Pieter. On a connection between importance sampling and the likelihood ratio\n\npolicy gradient. In Advances in Neural Information Processing Systems, pp. 1000\u20131008, 2010.\n\nKakade, Sham and Langford, John. Approximately optimal approximate reinforcement learning. In\n\nInternational Conference on Machine Learning (ICML), volume 2, pp. 267\u2013274, 2002.\n\nLevine, Sergey and Koltun, Vladlen. Guided policy search. In International Conference on Machine\n\nLearning (ICML), pp. 1\u20139, 2013.\n\nLevine, Sergey, Finn, Chelsea, Darrell, Trevor, and Abbeel, Pieter. End-to-end training of deep\n\nvisuomotor policies. Journal of Machine Learning Research, 17(39):1\u201340, 2016.\n\nLillicrap, Timothy P, Hunt, Jonathan J, Pritzel, Alexander, Heess, Nicolas, Erez, Tom, Tassa, Yuval,\nSilver, David, and Wierstra, Daan. Continuous control with deep reinforcement learning. ICLR,\n2016.\n\nMahmood, A Rupam, van Hasselt, Hado P, and Sutton, Richard S. Weighted importance sampling\nfor off-policy learning with linear function approximation. In Advances in Neural Information\nProcessing Systems, pp. 3014\u20133022, 2014.\n\nMnih, Volodymyr, Kavukcuoglu, Koray, Silver, David, Rusu, Andrei A, Veness, Joel, Bellemare,\nMarc G, Graves, Alex, Riedmiller, Martin, Fidjeland, Andreas K, Ostrovski, Georg, et al. Human-\nlevel control through deep reinforcement learning. Nature, 518(7540):529\u2013533, 2015.\n\nMunos, R\u00e9mi, Stepleton, Tom, Harutyunyan, Anna, and Bellemare, Marc G. Safe and ef\ufb01cient\n\noff-policy reinforcement learning. arXiv preprint arXiv:1606.02647, 2016.\n\nO\u2019Donoghue, Brendan, Munos, Remi, Kavukcuoglu, Koray, and Mnih, Volodymyr. Pgq: Combining\n\npolicy gradient and q-learning. ICLR, 2017.\n\nPeshkin, Leonid and Shelton, Christian R. Learning from scarce experience. In Proceedings of the\n\nNineteenth International Conference on Machine Learning, 2002.\n\nPeters, Jan, M\u00fclling, Katharina, and Altun, Yasemin. Relative entropy policy search. In AAAI.\n\nAtlanta, 2010.\n\n9\n\n\fPrecup, Doina. Eligibility traces for off-policy policy evaluation. Computer Science Department\n\nFaculty Publication Series, pp. 80, 2000.\n\nRiedmiller, Martin. Neural \ufb01tted q iteration\u2013\ufb01rst experiences with a data ef\ufb01cient neural reinforcement\n\nlearning method. In European Conference on Machine Learning, pp. 317\u2013328. Springer, 2005.\n\nRoss, Sheldon M. Simulation. Burlington, MA: Elsevier, 2006.\n\nSchulman, John, Levine, Sergey, Abbeel, Pieter, Jordan, Michael I, and Moritz, Philipp. Trust region\n\npolicy optimization. In ICML, pp. 1889\u20131897, 2015.\n\nSchulman, John, Moritz, Philipp, Levine, Sergey, Jordan, Michael, and Abbeel, Pieter. High-\ndimensional continuous control using generalized advantage estimation. International Conference\non Learning Representations (ICLR), 2016.\n\nSilver, David, Lever, Guy, Heess, Nicolas, Degris, Thomas, Wierstra, Daan, and Riedmiller, Martin.\nIn International Conference on Machine Learning\n\nDeterministic policy gradient algorithms.\n(ICML), 2014.\n\nSilver, David, Huang, Aja, Maddison, Chris J, Guez, Arthur, Sifre, Laurent, Van Den Driessche,\nGeorge, Schrittwieser, Julian, Antonoglou, Ioannis, Panneershelvam, Veda, Lanctot, Marc, et al.\nMastering the game of go with deep neural networks and tree search. Nature, 529(7587):484\u2013489,\n2016.\n\nSutton, Richard S, McAllester, David A, Singh, Satinder P, Mansour, Yishay, et al. Policy gra-\ndient methods for reinforcement learning with function approximation. In Advances in Neural\nInformation Processing Systems (NIPS), volume 99, pp. 1057\u20131063, 1999.\n\nThomas, Philip. Bias in natural actor-critic algorithms. In ICML, pp. 441\u2013448, 2014.\n\nThomas, Philip and Brunskill, Emma. Data-ef\ufb01cient off-policy policy evaluation for reinforcement\n\nlearning. In International Conference on Machine Learning, pp. 2139\u20132148, 2016.\n\nTodorov, Emanuel, Erez, Tom, and Tassa, Yuval. Mujoco: A physics engine for model-based control.\nIn 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026\u20135033.\nIEEE, 2012.\n\nWang, Ziyu, Bapst, Victor, Heess, Nicolas, Mnih, Volodymyr, Munos, Remi, Kavukcuoglu, Koray,\n\nand de Freitas, Nando. Sample ef\ufb01cient actor-critic with experience replay. ICLR, 2017.\n\nWilliams, Ronald J. Simple statistical gradient-following algorithms for connectionist reinforcement\n\nlearning. Machine learning, 8(3-4):229\u2013256, 1992.\n\n10\n\n\f", "award": [], "sourceid": 2098, "authors": [{"given_name": "Shixiang (Shane)", "family_name": "Gu", "institution": "University of Cambridge and Max Planck Institute for Intelligent Systems"}, {"given_name": "Timothy", "family_name": "Lillicrap", "institution": "Google DeepMind"}, {"given_name": "Richard", "family_name": "Turner", "institution": "University of Cambridge"}, {"given_name": "Zoubin", "family_name": "Ghahramani", "institution": "Uber and University of Cambridge"}, {"given_name": "Bernhard", "family_name": "Sch\u00f6lkopf", "institution": "MPI for Intelligent Systems"}, {"given_name": "Sergey", "family_name": "Levine", "institution": "UC Berkeley"}]}