{"title": "Divergence-Augmented Policy Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 6099, "page_last": 6110, "abstract": "In deep reinforcement learning, policy optimization methods need to deal with issues such as function approximation and the reuse of off-policy data. Standard policy gradient methods do not handle off-policy data well, leading to premature convergence and instability. This paper introduces a method to stabilize policy optimization when off-policy data are reused. The idea is to include a Bregman divergence between the behavior policy that generates the data and the current policy to ensure small and safe policy updates with off-policy data. The Bregman divergence is calculated between the state distributions of two policies, instead of only on the action probabilities, leading to a divergence augmentation formulation.\nEmpirical experiments on Atari games show that in the data-scarce scenario where the reuse of off-policy data becomes necessary, our method can achieve better performance than other state-of-the-art deep reinforcement learning algorithms.", "full_text": "Divergence-Augmented Policy Optimization\n\nQing Wang \u2217\nHuya AI\n\nGuangzhou, China\n\nYingru Li\n\nThe Chinese University of Hong Kong\n\nShenzhen, China\n\nJiechao Xiong\nTencent AI Lab\nShenzhen, China\n\nThe Hong Kong University of Science and Technology\n\nTong Zhang\n\nHong Kong, China\n\nAbstract\n\nIn deep reinforcement learning, policy optimization methods need to deal with\nissues such as function approximation and the reuse of off-policy data. Standard\npolicy gradient methods do not handle off-policy data well, leading to premature\nconvergence and instability. This paper introduces a method to stabilize policy\noptimization when off-policy data are reused. The idea is to include a Bregman\ndivergence between the behavior policy that generates the data and the current\npolicy to ensure small and safe policy updates with off-policy data. The Bregman\ndivergence is calculated between the state distributions of two policies, instead of\nonly on the action probabilities, leading to a divergence augmentation formulation.\nEmpirical experiments on Atari games show that in the data-scarce scenario where\nthe reuse of off-policy data becomes necessary, our method can achieve better\nperformance than other state-of-the-art deep reinforcement learning algorithms.\n\n1\n\nIntroduction\n\nIn recent years, many algorithms based on policy optimization have been proposed for deep reinforce-\nment learning (DRL), leading to great successes in Go, video games, and robotics (Silver et al., 2016;\nMnih et al., 2016; Schulman et al., 2015, 2017b). Real-world applications of policy-based methods\ncommonly involve function approximation and data reuse. Typically, the reused data are generated\nwith an earlier version of the policy, leading to off-policy learning. It is known that these issues may\ncause premature convergence and instability for policy gradient methods (Sutton et al., 2000; Sutton\nand Barto, 2017).\nA standard technique that allows policy optimization methods to handle off-policy data is to use\nimportance sampling to correct trajectories from the behavior policy that generates the data to the\ntarget policy (e.g. Retrace (Munos et al., 2016) and V-trace (Espeholt et al., 2018)). The ef\ufb01ciency of\nthese methods depends on the divergence between the behavior policy and the target policy. Moreover,\nto improve stability of training, one may introduce a regularization term (e.g. Shannon-Gibbs entropy\nin (Mnih et al., 2016)), or use a proximal objective of the original policy gradient loss (e.g. clipping\nin (Schulman et al., 2017b; Wang et al., 2016a)). Although the well-adopted method of entropy\nregularization can stabilize the optimization process (Mnih et al., 2016), this additional entropy\nregularization alters the learning objective, and prevent the algorithm from converging to the optimal\naction for each state. Even for the simple case of bandit problems, the monotonic diminishing\nregularization may fail to converge to the best arm (Cesa-Bianchi et al., 2017).\nIn this work, we propose a method for policy optimization by adding a Bregman divergence term,\nwhich leads to more stable and sample ef\ufb01cient off-policy learning. The Bregman divergence\n\n\u2217The work was done when the \ufb01rst author was at Tencent AI Lab.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fconstraint is widely used to explore and exploit optimally in mirror descent methods (Nemirovsky\nand Yudin, 1983), in which speci\ufb01c form of divergence can attain the optimal rate of regret (sample\nef\ufb01ciency) for bandit problems (Audibert et al., 2011; Bubeck and Cesa-Bianchi, 2012). In contrast\nto the traditional approach of constraining the divergence between target policy and behavior policy\nconditioned on each state (Schulman et al., 2015), we consider the divergence over the joint state-\naction space. We show that the policy optimization problem with Bregman divergence on state-action\nspace is equivalent to the standard policy gradient method with divergence-augmented advantage.\nUnder this view, the divergence-augmented policy optimization method not only considers the\ndivergence on the current state but also takes into account the discrepancy of policies on future states,\nthus can provide a better constraint on the change of policy and encourage \u201cdeeper\u201d exploration.\nWe experiment with the proposed method on the commonly used Atari 2600 environment from\nArcade Learning Environment (ALE) (Bellemare et al., 2013). Empirical results show that divergence-\naugmented policy optimization method performs better than the state-of-the-art algorithm under\ndata-scarce scenarios, i.e., when the sample generating speed is limited and samples in replay memory\nare reused multiple times. We also conduct a comparative study for the major effect of improvement\non these games.\nThe article is organized as follows: we give the basic background and notations in Section 2. The main\nmethod of divergence-augmented policy optimization is presented in Section 3, with connections to\nprevious works discussed in Section 4. Empirical results and studies can be found in Section 5. We\nconclude this work with a short discussion in Section 6.\n\n2 Preliminaries\n\nIn this section, we state the basic de\ufb01nition of the Markov decision process considered in this work,\nas well as the Bregman divergence used in the following discussions.\n\n2.1 Markov Decision Process\n\nWe consider a Markov decision process (MDP) with in\ufb01nite-horizon and discounted reward, denoted\nby M = (S,A, P, r, d0, \u03b3), where S is the \ufb01nite state space, A is the \ufb01nite action space, P : S\u00d7A \u2192\n\u2206(S) is the transition function, where \u2206(S) means the space of all probability distributions on S.\nA reward function is denoted by r : S \u00d7 A \u2192 R. The distribution of initial state s0 is denoted by\nd0 \u2208 \u2206(S). And a discount factor is denoted by \u03b3 \u2208 (0, 1).\nA stochastic policy is denoted by \u03c0 : S \u2192 \u2206(A). The space of all policies is denoted by \u03a0. We use the\nfollowing standard notation of state-value V \u03c0(st), action-value Q\u03c0(st, at) and advantage A\u03c0(st, at),\nde\ufb01ned as V \u03c0(st) = E\u03c0|st\nl=0 \u03b3lr(st+l, at+l), and\nA\u03c0(st, at) = Q\u03c0(st, at) \u2212 V \u03c0(st), where E\u03c0|st means al \u223c \u03c0(a|sl), sl+1 \u223c P (sl+1|sl, al), \u2200l \u2265 t,\nand E\u03c0|st,at means sl+1 \u223c P (sl+1|sl, al), al+1 \u223c \u03c0(a|sl+1), \u2200l \u2265 t. We also de\ufb01ne the space of\npolicy-induced state-action distributions under M as\n\nl=0 \u03b3lr(st+l, at+l), Q\u03c0(st, at) = E\u03c0|st,at\n\n(cid:80)\u221e\n\n(cid:80)\u221e\n\n\u2206\u03a0 = {\u00b5 \u2208 \u2206(S \u00d7 A) :\n\n\u00b5(s(cid:48), a(cid:48)) = (1 \u2212 \u03b3)d0(s(cid:48)) + \u03b3\n\nP (s(cid:48)|s, a)\u00b5(s, a),\u2200s(cid:48) \u2208 S} (1)\n\n(cid:88)\n\na(cid:48)\n\n(cid:88)\n\ns,a\n\n\u00b5 \u2208 \u2206\u03a0, there also exists a unique policy \u03c0\u00b5(a|s) = \u00b5(s,a)(cid:80)\ndistribution d\u03c0 as d\u03c0(s) = (1 \u2212 \u03b3)E\u03c4|\u03c0\nWe sometimes write \u03c0\u00b5t as \u03c0t and d\u03c0t as dt when there is no ambiguity.\nIn this paper, we mainly focus on the performance of a policy \u03c0 de\ufb01ned as\n\n(cid:80)\u221e\nWe use the notation \u00b5\u03c0 for the state-action distribution induced by \u03c0. On the other hand, for each\nb \u00b5(s,b) which induces \u00b5. We de\ufb01ne the state\nt=0 \u03b3t1(st = s). Then we have \u00b5\u03c0(s, a) = d\u03c0(s)\u03c0(a|s).\n\u221e(cid:88)\n\nJ(\u03c0) = (1 \u2212 \u03b3)E\u03c4|\u03c0\n\n\u03b3tr(st, at) = Ed\u03c0,\u03c0r(s, a)\n\n(2)\n\nwhere E\u03c4|\u03c0 means s0 \u223c d0, at \u223c \u03c0(at|st), st+1 \u223c P (st+1|st, at), t \u2265 0. We use the notation\nEd,\u03c0 = Es\u223cd(\u00b7),a\u223c\u03c0(\u00b7|s) for brevity.\n\nt=0\n\n2\n\n\f2.2 Bregman Divergence\n\nWe de\ufb01ne Bregman divergence (Bregman, 1967) as follows (e.g. De\ufb01nition 5.3 in (Bubeck and\nCesa-Bianchi, 2012)). For D \u2282 Rd an open convex set, the closure of D as \u00afD, we consider a\nLegendre function F : \u00afD \u2192 R de\ufb01ned as (1) F is strictly convex and admits continuous \ufb01rst\npartial derivatives on D, and (2) limx\u2192 \u00afD\\D(cid:107)\u2207F(cid:107) = +\u221e. For function F , we de\ufb01ne the Bregman\ndivergence DF : \u00afD \u00d7 D \u2192 R as\n\nDF (x, y) = F (x) \u2212 F (y) \u2212 (cid:104)\u2207F (y), x \u2212 y(cid:105).\n\nThe inner product is de\ufb01ned as (cid:104)x, y(cid:105) =(cid:80)\nexists uniquely for all y \u2208 D. Speci\ufb01cally, for F (x) = (cid:80)\n\nz = arg min\n\nDF (x, y)\n\ni xi log(xi) \u2212(cid:80)\n\nKullback-Leibler (KL) divergence as\n\ni xiyi. For K \u2282 \u00afD and K\u2229D (cid:54)= \u2205, the Bregman projection\n\nx\u2208K\n\n(cid:88)\n\ns,a\n\n(cid:88)\n\nDKL(\u00b5(cid:48), \u00b5) =\n\n\u00b5(cid:48)(s, a) log\n\n\u00b5(cid:48)(s, a)\n\u00b5(s, a)\n\ni xi, we recover the\n\nfor \u00b5, \u00b5(cid:48) \u2208 \u2206(S \u00d7A) and \u03c0, \u03c0(cid:48) \u2208 \u03a0. To measure the distance between two policies \u03c0 and \u03c0(cid:48), we also\nuse the symbol for conditional \u201cBregman divergence\u201d2 associated with state distribution d denoted as\n\nF (\u03c0(cid:48), \u03c0) =\nDd\n\nd(s)DF (\u03c0(cid:48)(\u00b7|s), \u03c0(\u00b7|s)).\n\n(3)\n\n3 Method\n\ns\n\nIn this section, we present the proposed method from the motivation of mirror descent and then\ndiscuss the parametrization and off-policy correction we employed in the practical learning algorithm.\n\n3.1 Policy Optimization and Mirror Descent\n\nThe mirror descent (MD) method (Nemirovsky and Yudin, 1983) is a central topic in the optimization\nand online learning research literature. As a \ufb01rst-order method for optimization, the mirror descent\nmethod can recover several interesting algorithms discovered previously (Sutton et al., 2000; Kakade,\n2002; Peters et al., 2010; Schulman et al., 2015). On the other hand, as an online learning method,\nthe online (stochastic) mirror descent method can achieve (near-)optimal sample ef\ufb01ciency for a wide\nrange of problems (Audibert and Bubeck, 2009; Audibert et al., 2011; Zimin and Neu, 2013). In this\nwork, following a series of previous works (Zimin and Neu, 2013; Neu et al., 2017), we investigate\nthe (online) mirror descent method for policy optimization. We denote the state-action distribution at\niteration t as \u00b5t, and (cid:96)t(\u00b5) = (cid:104)gt, \u00b5(cid:105) as the linear loss function for \u00b5 at iteration t. Without otherwise\nnoted, we consider the negative reward as the loss objective (cid:96)t(\u00b5) = \u2212(cid:104)r, \u00b5(cid:105), which also corresponds\nto the policy performance (cid:96)t(\u00b5) \u2261 \u2212J(\u03c0\u00b5) by Formula (2). We consider the mirror map method\nassociated with Legendre function F as\n\n(4)\n(5)\nwhere \u02dc\u00b5t+1 \u2208 \u2206(S \u00d7 A) and gt = \u2207(cid:96)t(\u00b5t). It is well-known (Beck and Teboulle, 2003) that an\nequivalent formulation of mirror map (4) is\n\n\u2207F (\u02dc\u00b5t+1) = \u2207F (\u00b5t) \u2212 \u03b7gt\n\u00b5t+1 \u2208 \u03a0\u2206\u03a0 (\u02dc\u00b5t+1),\n\n\u00b5t+1 = arg min\n\u00b5\u2208\u2206\u03a0\n= arg min\n\u00b5\u2208\u2206\u03a0\n\nDF (\u00b5, \u02dc\u00b5t+1)\nDF (\u00b5, \u00b5t) + \u03b7(cid:104)gt, \u00b5(cid:105),\n\n(6)\n\n(7)\n\nThe former formulation (6) takes the view of non-linear sub-gradient projection in convex optimiza-\ntion, while the later formulation (7) can be interpreted as a regularized optimization and is the usual\nde\ufb01nition of mirror descent (Nemirovsky and Yudin, 1983; Beck and Teboulle, 2003; Bubeck, 2015).\nIn this work, we will mostly investigate the approximate algorithm in the later formulation (7).\n\n2Note that Dd\n\nF may not be a Bregman divergence.\n\n3\n\n\f3.2 Parametric Policy-based Algorithm\n\nIn the mirror descent view for policy optimization on state-action space as in Formula (7), we need to\ncompute the projection of \u00b5 onto the space of \u2206\u03a0. For the special case of KL-divergence on \u00b5, the\nsub-problem of \ufb01nding minimum in (7) can be done ef\ufb01ciently, assuming the knowledge of transition\nfunction P (See Proposition 1 in (Zimin and Neu, 2013)). However, for a general divergence and real-\nworld problems with unknown transition matrices, the projection in (7) is non-trivial to implement.\nIn this section, we consider direct optimization in the (parametric) policy space without explicit\nprojection. Speci\ufb01cally, we consider \u00b5\u03c0 as a function of \u03c0, and \u03c0 parametrized as \u03c0\u03b8. The Formula\n(7) can be written as\n\n\u03c0t+1 = arg min\n\n\u03c0\n\nDF (\u00b5\u03c0, \u00b5t) + \u03b7(cid:104)gt, \u00b5\u03c0(cid:105).\n\n(8)\n\nInstead of solving globally, we approximate Formula (8) with gradient descent on \u03c0. From the\ncelebrated policy gradient theorem (Sutton et al., 2000), we have the following lemma:\nLemma 1. (Policy Gradient Theorem (Sutton et al., 2000)) For d\u03c0 and \u00b5\u03c0 de\ufb01ned previously, the\nfollowing equation holds for any state-action function f : S \u00d7 A \u2192 R:\n\n(cid:88)\n\n(cid:88)\n\nf (s, a)\u2207\u03b8\u00b5\u03c0(s, a) =\n\nd\u03c0(s)Q\u03c0(f )(s, a)\u2207\u03b8\u03c0(a|s),\n\nwhere Q\u03c0 is de\ufb01ned as an operator such that\n\ns,a\n\ns,a\n\nQ\u03c0(f )(s, a) = E\u03c0|st=s,at=a\n\n\u221e(cid:88)\n\nl=0\n\n\u03b3lf (st+l, at+l).\n\nDecomposing the loss and divergence in two parts (8), we have\n\n\u2207\u03b8 (cid:104)gt, \u00b5\u03c0(cid:105) = (cid:104)d\u03c0Q\u03c0(gt),\u2207\u03b8\u03c0(a|s)(cid:105) ,\n\n(9)\n\nwhich is the usual policy gradient, and\n\n\u2207\u03b8DF (\u00b5\u03c0, \u00b5t) = (cid:104)\u2207F (\u00b5\u03c0) \u2212 \u2207F (\u00b5t),\u2207\u03b8\u00b5\u03c0(cid:105) = (cid:104)d\u03c0Q\u03c0 (\u2207F (\u00b5\u03c0) \u2212 \u2207F (\u00b5t)) ,\u2207\u03b8\u03c0(a|s)(cid:105) .\n\n(10)\n\nSimilarly, we have the policy gradient for the conditional divergence (3) as\nF (\u03c0, \u03c0t) = (cid:104)dt(\u2207F (\u03c0) \u2212 \u2207F (\u03c0t)),\u2207\u03b8\u03c0(a|s)(cid:105) ,\n\n\u2207\u03b8Ddt\n\nwhich does not have a discounted sum, since dt is \ufb01xed and independent of \u03c0 = \u03c0\u00b5.\n\n(cid:88)\n\n3.3 Off-policy Correction\nIn this section, we discuss the practical method for estimating Q\u03c0(f ) under a behavior policy \u03c0t. In\ndistributed reinforcement learning with asynchronous gradient update, the policy \u03c0t which generated\nthe trajectories may deviate from the policy \u03c0\u03b8 currently being optimized. Thus off-policy correction\nis usually needed for the robustness of the algorithm (e.g. V-trace as in IMPALA (Espeholt et al.,\n2018)). Consider\n\nd\u03c0(s)Q\u03c0(f )(s, a)\u2207\u03b8\u03c0(a|s) = E(s,a)\u223c\u03c0d\u03c0Q\u03c0(f )(s, a)\u2207\u03b8 log \u03c0(a|s)\n\ns,a\n\n= E(s,a)\u223c\u03c0td\u03c0t\n\nd\u03c0(s)\nd\u03c0t(s)\n\n\u03c0(a|s)\n\u03c0t(a|s)\n\nQ\u03c0(f )(s, a)\u2207\u03b8 log \u03c0(a|s)\n\nfor f = gt or f = \u2207F (\u00b5\u03c0) \u2212 \u2207F (\u00b5t). We would like to have an accurate estimation of Q\u03c0(gt) (9)\nand Q\u03c0(\u2207F (\u00b5\u03c0) \u2212 \u2207F (\u00b5t)) (10), and correct the deviation from d\u03c0t to d\u03c0 and \u03c0t to \u03c0.\nFor the estimation of Q\u03c0(f ) under a behavior policy \u03c0t, possible methods include Retrace (Munos\net al., 2016) providing an estimator of state-action value Q\u03c0(f ), and V-trace (Espeholt et al., 2018)\nproviding an estimator of state value Ea\u223c\u03c0Q\u03c0(f )(s, a). In this work, we utilize the V-trace (Section\n4.1 (Espeholt et al., 2018)) estimation vsi = vi along a trajectory starting at (si, ai = s, a) under \u03c0t.\n\n4\n\n\fDetails of multi-step Q-value estimation can be found in Appendix A. With the value estimation vs,\nthe Q\u03c0(gt) is estimated with\n\nWe subtract a baseline V\u03b8(si) to reduce variance in estimation, as E\u03c0t,dt\nthe estimation of Q\u03c0(\u2207F (\u00b5\u03c0) \u2212 \u2207F (\u00b5t)), we use the n-steps truncated importance sampling as\n\n\u03c0\u03b8\n\u03c0t\n\n(11)\nV\u03b8(s)\u2207\u03b8 log \u03c0\u03b8 = 0. For\n\n\u02c6As,a = ri + \u03b3vi+1 \u2212 V\u03b8(si).\n\nn(cid:88)\n\nj\u22121(cid:89)\n\n\u02c6Ds,a = f (si, ai) +\n\n\u03b3j(\n\nci+k)\u03c1i+jf (si+j, ai+j).\n\n(12)\n\nj=1\n\nk=0\n\n\u03c0t(aj|sj ) ) and \u03c1j = min(\u00af\u03c1D, \u03c0\u03b8(aj|sj )\n\nin which we use the notation cj = min(\u00afcD, \u03c0\u03b8(aj|sj )\n\u03c0t(aj|sj ) ). The formula\nalso corresponds to V-trace under the condition V (\u00b7) \u2261 0. For RNN model trained on continuous\nroll-out samples, we set n equals to the max-length till the end of roll-out.\nFor the correction of state distribution d\u03c0(s)/d\u03c0t(s), previous solutions include the use of emphatic\nalgorithms as in (Sutton et al., 2016), or through an estimate of state density ratio as in (Liu et al.,\n2018). However, in our experience, the estimation of density ratio will introduce additional error,\nwhich may lead to worse performance in practice. Therefore in this paper, we propose a different\nsolution by restricting our attention to the correction of \u03c0t to \u03c0 via importance sampling and omitting\nthe difference of d\u03c0/d\u03c0t in the algorithm. This introduces a bias in the gradient estimation, which we\npropose a new method to handle in this paper. Speci\ufb01cally, we show that although the omission of\nthe state ratio introduces a bias in the gradient, the bias can be bounded by the regularization term\nof conditional KL divergence (see Appendix B). Therefore by explicitly adding an KL divergence\nregularization, we can effectively control the degree of off-policy bias caused by d\u03c0/d\u03c0t in that small\nregularization value implies a small bias. This approach naturally combines mirror descent with KL\ndivergence regularization, leading to a more stable algorithm that is robust to off-policy data, as we\nwill demonstrate by empirical experiments.\nThe \ufb01nal loss consists of the policy loss L\u03c0(\u03b8) and the value loss Lv(\u03b8). To be speci\ufb01c, the gradient\nof policy loss is de\ufb01ned as\n\n\u2207\u03b8L\u03c0(\u03b8) = E\u03c0t,dt\n\n\u03c0\n\u03c0t\n\n( \u02c6Ds,a \u2212 \u03b7 \u02c6As,a)\u2207\u03b8 log \u03c0.\n\n(13)\n\nWe can also use proximal methods like PPO (Schulman et al., 2017b) in conjunction with divergence\naugmentation. A practical implementation is elaborated later in Formula (19). In addition to the\npolicy loss, we also update V\u03b8 with value gradient de\ufb01ned as\n\n\u2207Lv(\u03b8) = E\u03c0t,dt\n\n\u03c0\n\u03c0t\n\n(V\u03b8(s) \u2212 vs)\u2207\u03b8V\u03b8(s),\n\n(14)\n\nwhere vs = vsi is the multi-step value estimation with V-trace. The parameter \u03b8 is then updated with\na mixture of policy loss and value loss\n\n(15)\nin which \u03b1t is the current learning rate, and b is the loss scaling coef\ufb01cient. The algorithm is\nsummarized in Algorithm 1.\n\n\u03b8 \u2190 \u03b8 \u2212 \u03b1t(\u2207\u03b8L\u03c0(\u03b8) + b\u2207\u03b8Lv(\u03b8)),\n\n4 Related Works\n\nThe policy performance in Equation (2) and the well-known policy difference lemma (Kakade and\nLangford, 2002) serve a fundamental role in policy-based reinforcement learning (e.g TRPO, PPO\n(Schulman et al., 2015, 2017b)). The gradient with respect to the policy performance and policy\ndifference provides a natural direction for policy optimization. And to restrict the changes in each\npolicy improvement step, as well as encouraging exploration at the early stage, the constraint-based\npolicy optimization methods try to limit the changes in the policy by constraining the divergence\nbetween behavior policy and current policy. The use of entropy maximization in reinforcement\nlearning can be dated back to the work of Williams and Peng (1991). And methods with relative\nentropy regularization include Peters et al. (2010); Schulman et al. (2015). The relationship between\nthese methods and the mirror descent method has been discussed in Neu et al. (2017). With\n\n5\n\n\fAlgorithm 1 Divergence-Augmented Policy Optimization (DAPO)\n\nInput: DF (\u00b5(cid:48), \u00b5), total iteration T , batch size M, learning rate \u03b1t.\nInitialize : randomly initiate \u03b80\nfor t = 0 to T do\n\n(in parallel) Use \u03c0t = \u03c0\u03b8t to generate trajectories.\nfor m = 1 to M do\n\nSample (si, ai) \u2208 S \u00d7 A w.p. dt\u03c0t.\nEstimate state value vsi (e.g. by V-trace).\nCalculate Q-value estimation \u02c6As,a (11) and divergence estimation \u02c6Ds,a (12).\n\n\u02c6As,a = ri + \u03b3vi+1 \u2212 V\u03b8(si),\n\n\u02c6Ds,a = f (si, ai) +(cid:80)n\n\nj=1 \u03b3j((cid:81)j\u22121\n\nUpdate \u03b8 with respect of policy loss (13, optionally 19) and value loss (14)\n\n\u03b8 \u2190 \u03b8 \u2212 \u03b1t(\u2207\u03b8L\u03c0(\u03b8) + b\u2207\u03b8Lv(\u03b8)).\n\nk=0 ci+k)\u03c1i+jf (si+j, ai+j).\n\nend for\nSet \u03b8t+1 = \u03b8.\n\nend for\n\nas F (x) =(cid:80)\n\nnotations in this work, consider the natural choice of F as the negative Shannon entropy de\ufb01ned\ni xi log(xi), the DF (\u00b7,\u00b7) becomes the KL-divergence DKL(\u00b7,\u00b7). By the equivalence\nof sub-gradient projection (6) and mirror descent (7), the mirror descent policy optimization with\nKL-divergence can be written as\n\n\u00b5t+1 = arg min\n\u00b5\u2208\u2206\u03a0\n\nDKL(\u00b5, \u02dc\u00b5t+1) = arg min\n\u00b5\u2208\u2206\u03a0\n\nDKL(\u00b5, \u00b5t) + \u03b7(cid:104)gt, \u00b5(cid:105).\n\n(16)\n\nUnder slightly different settings, this learning objective is the regularized version of the constrained\noptimization problem considered in Relative Entropy Policy Search (REPS) (Peters et al., 2010); And\nfor (cid:96)t(\u00b5) depending on t, the Equation (16) can also recover the O-REPS method considered in Zimin\nand Neu (2013). On the other hand, as the KL-divergence (and Bregman divergence) is asymmetric,\nwe can also replace the DF (x, y) in either formulation (6, 7) with reverse KL DKL(y, x), which\nwill result in different iterative algorithms (as the reverse KL is no longer a Bregman divergence,\nthe equivalence of Formula (6) and (7) no longer holds). Consider replacing DF (\u00b5, \u02dc\u00b5t+1) with\nDKL(\u02dc\u00b5t+1, \u00b5) in sub-gradient projection (6), we have the \u201cmirror map\u201d method with reverse KL as\n(17)\n\nDKL(\u02dc\u00b5t+1, \u00b5),\n\n\u00b5t+1 = arg min\n\u00b5\u2208\u2206\u03a0\n\nwhich is essentially the MPO algorithm (Abdolmaleki et al., 2018) under a probabilistic inference\nperspective, and MARWIL algorithm (Wang et al., 2018) when learning from off-policy data. Sim-\nilarly, consider the replacement of DF (\u00b5, \u00b5t) with DKL(\u00b5t, \u00b5) in mirror descent (7), we have the\n\u201cmirror descent\u201d method with reverse KL as\n\n\u00b5t+1 = arg min\n\u00b5\u2208\u2206\u03a0\n\nDKL(\u00b5t, \u00b5) + \u03b7(cid:104)gt, \u00b5(cid:105),\n\n(18)\n\nKL(\u03c0t, \u03c0), also see Section 5.1 of Neu et al. (2017)).\n\nwhich can approximately recover the TRPO optimization objective (Schulman et al., 2015) (if\nthe relative entropy between two state-action distributions DKL(\u00b5t, \u00b5) in (18) is replaced by the\nconditional entropy Ddt\nBesides, we note that there are other choices of constraint for policy optimization as well. For\nexample, in (Lee et al., 2018; Chow et al., 2018; Lee et al., 2019), a Tsallis entropy is used to promote\nsparsity in the policy distribution. And in (Belousov and Peters, 2017), the authors generalize KL,\nHellinger distance, and reversed KL to the class of f-divergence. In preliminary results, we found\ndivergence based on 0-potential (Audibert et al., 2011; Bubeck and Cesa-Bianchi, 2012) is also\npromising for policy optimization. We left this for future research.\nFor multi-step KL divergence regularized policy optimization, we note that the formulation also\ncorresponds to the KL-divergence-augmented return considered previously in several works (Fox\net al. (2015), Section 3 of Schulman et al. (2017a)), although in Schulman et al. (2017a) the authors\nuse a \ufb01xed behavior policy instead of \u03c0t as in ours. More often, the Shannon-entropy-augmented\nreturn can be dated back to earlier works (Kappen, 2005; Todorov, 2007; Ziebart et al., 2008; Nachum\net al., 2017), and is a central topic in \u201csoft\u201d reinforcement learning (Haarnoja et al., 2017, 2018).\n\n6\n\n\fFigure 1: Relative score improvement of PPO+DA compared with PPO on 58 Atari environments.\nThe relative performance is calculated as a\nmax(human,baseline)\u2212random (Wang et al., 2016b). The Atari\ngames are categorized according to Figure 4 of (Oh et al., 2018).\n\nproposed\u2212baseline\n\nThe mirror descent method is originally introduced by the seminal work of Nemirovsky and Yudin\n(1983) as a convex optimization method. Also, the online stochastic mirror descent method has alter-\nnative views, e.g. Follow the Regularized Leader (McMahan, 2011), and Proximal Point Algorithm\n(Rockafellar, 1976). For more discussions on mirror descent and online learning, we refer interested\nreaders to the work of Cesa-Bianchi and Lugosi (2006) and Bubeck and Cesa-Bianchi (2012).\n\n5 Experiments\n\nIn the experiments, we test the exploratory effect of divergence augmentation comparing with\nentropy augmentation, and the empirical difference between multi-step and 1-step divergence. For\nthe experiments, we mainly consider the DAPO algorithm (1) associated with the conditional KL\nb \u00b5(s,b), we have\n\ndivergence (see RC and DC in (Neu et al., 2017)). For F (\u00b5) =(cid:80)\n\ns,a \u00b5(s, a) log \u00b5(s,a)(cid:80)\n\nthe gradient in (10) as\n\n\u2207F (\u00b5\u03c0) \u2212 \u2207F (\u00b5t) = log\n\n\u03c0\n\u03c0t\n\n.\n\nThe multi-step divergence augmentation term as in (12) is then calculated as\n\n\u02c6DKL\n\ns,a = log\n\n\u03c0(ai|si)\n\u03c0t(ai|si)\n\n+\n\nci+k)\u03c1i+j log\n\n\u03c0(ai+j|si+j)\n\u03c0t(ai+j|si+j)\n\n.\n\nn(cid:88)\n\nj\u22121(cid:89)\n\n\u03b3j(\n\nj=1\n\nk=1\n\nAs a baseline, we also implement the PPO algorithm with a V-trace (Espeholt et al., 2018) estimation\nof advantage function A\u03c0 for target policy3. Speci\ufb01cally, we consider the policy loss as:\n\n\u03c0 (\u03b8) = E\u03c0t,dt min(\nLPPO\n\n\u03c0\u03b8\n\u03c0t\n\nAs,a, clip(\n\n\u03c0\u03b8\n\u03c0t\n\n, 1 \u2212 \u0001, 1 + \u0001)As,a),\n\n(19)\n\nwhere we choose \u0001 = 0.2 and the advantage is estimated by Rs,a. We also tested the DAPO algorithm\nwith PPO, with the advantage estimation As,a in (19) replaced with \u02c6As,a \u2212 1\n\u02c6Ds,a de\ufb01ned in (11)\n\u03b7\nand (12). We will refer to this algorithm as PPO+DA in the following sections.\n\n5.1 Algorithm Settings\n\nThe algorithm is implemented with TensorFlow (Abadi et al., 2016). For ef\ufb01cient training with deep\nneural networks, we use the Adam (Kingma and Ba, 2014) method for optimization. The learning\nrate is linearly scaled from 1e-3 to 0. The parameters are updated according to a mixture of policy\nloss and value loss, with the loss scaling coef\ufb01cient c = 0.5. In calculating multi-step \u03bb-returns Rs,a\n\n3In the original PPO (Schulman et al., 2017b) they use \u02c6A as the advantage estimation of behavior policy A\u03c0t.\n\n7\n\nRelative Performance\u2212100 %\u221250 %0 %50 %100 %150 %200 % \u221219% Frostbite \u221210% Krull \u221210% Boxing \u22127% MsPacman \u22126% TimePilot \u22126% Centipede \u22126% YarsRevenge \u22124% CrazyClimber \u22124% WizardOfWor \u22123% BankHeist \u22122% RoadRunner \u22122% Berzerk \u22122% Alien \u22121% Riverraid 0% Asterix 0% MontezumaRevenge 0% JourneyEscape 0% FishingDerby 0% Pitfall 0% AsteroidsVenture 0%PrivateEye 0%Jamesbond 0%Pong 1%Bowling 1%Skiing 2%Gravitar 2%Seaquest 3%Hero 3%Solaris 3%BeamRider 5%BattleZone 5%SpaceInvaders 6%Tennis 7%Tutankham 9%Amidar 11%ChopperCommand 11%Breakout 11%KungFuMaster 12%Kangaroo 15%IceHockey 15%Pooyan 19%NameThisGame 20%Zaxxon 21%Gopher 22%Assault 24%DemonAttack 24%AirRaid 25%Carnival 25%Robotank 38%Freeway 41%StarGunner 43%DoubleDunk 102%Atlantis 117%Enduro 134%Phoenix 153%VideoPinball 781%Qbert 894%Easy ExplorationHard Exploration (Dense Reward)Hard Exploration (Sparse Reward)Unknown\fFigure 2: Performance comparison of selected environments of Atari games. The performance of\nPPO, PPO+DA, PPO+DA (1-step), and PPO+Entropy are plotted in different colors. The score for\neach game is plotted on the y-axis with running time on the x-axis, as the algorithm is paralleled\nasynchronously in a distributed environment. For each line in the plots, we run the experiment 5\ntimes with the same parameters and environment settings. The median scores are plotted in solid\nlines, while the regions between 25% and 75% quantiles are shaded with respective colors.\n\nand divergence Ds,a, we use \ufb01xed \u03bb = 0.9 and \u03b3 = 0.99. The batch size is set to 1024, with roll-out\nlength set to 32, resulting in 1024/32=32 roll-outs in a batch. The policy \u03c0t and value Vt is updated\nevery 100 iterations (M = 100 in Algorithm 1). With our implementation, the training speed is about\n25k samples per second, and the data generating speed is about 220 samples per second for each\nactor, resulting in about 3500 samples per second for a total of 16 actors. Note that the PPO results\nmay not be directly comparable with other works (Schulman et al., 2017b; Espeholt et al., 2018;\nXu et al., 2018), mainly due to the different number of actors used. Unless otherwise noted, each\nexperiment is allowed to run 16000 seconds (about 4.5 hours), corresponding a total of 60M samples\ngenerated and 400M samples (with replacement) trained. Details of experimental settings can be\nfound in Appendix A.\n\n5.2 Empirical Results\n\nproposed\u2212baseline\n\nWe test the algorithm on 58 Atari environments and calculate its relative performance with PPO\n(Schulman et al., 2017b). The empirical performance is plotted in Figure 1. We run PPO and PPO+DA\nwith the same environmental settings and computational resources. The relative performance is\ncalculated as\nmax(human,baseline)\u2212random (Wang et al., 2016b). We also categorize the game environments\ninto easy exploration games and hard exploration games (Oh et al., 2018). We see that with a KL-\ndivergence-augmented return, the algorithm PPO+DA performs better than the baseline method,\nespecially for the games that may have local minimums and require deeper exploration. We plot the\nlearning curves of PPO+DA (in blue) comparing with PPO (in black) and other baseline methods on\n6 typical environments in Figure 2. Detailed learning curves for PPO and PPO+DA for the complete\n58 games can be found in Figure 3 in the Appendix.\n\n5.2.1 Divergence augmentation vs Entropy augmentation\n\nWe test the effect of divergence augmentation in contrast to the entropy augmentation (plotted in\norange in Figure 2). Entropy augmentation can prevent premature convergence and encourage\nexploration as well as stabilize policy during optimization. However, the additional entropy may\nhinder the convergence to the optimal action, as it alters the original learning objective. We set f (s, a)\nas log \u03c0(a|s) in Formula (12), and experiment the algorithm with 1\n\u03b7 = 0.5, 0.1, 0.01, 0.001, in which\nwe found that 1\n\u03b7 = 0.1 performs best. From the empirical results, we see that divergence-augmented\nPPO works better, while the entropy-augmented version may be too conservative on policy changes,\nresulting in inferior performance on these games.\n\n8\n\n01234\u221220\u22121001020DoubleDunkTraining Time/hourScorePPOPPO+DAPPO+DA (1\u2212step)PPO+Entropy012340500150025003500EnduroTraining Time/hourScorePPOPPO+DAPPO+DA (1\u2212step)PPO+Entropy0123420040060080010001400GravitarTraining Time/hourScorePPOPPO+DAPPO+DA (1\u2212step)PPO+Entropy01234050100150200250300MontezumaRevengeTraining Time/hourScorePPOPPO+DAPPO+DA (1\u2212step)PPO+Entropy012340e+002e+054e+05QbertTraining Time/hourScorePPOPPO+DAPPO+DA (1\u2212step)PPO+Entropy01234050001500025000SeaquestTraining Time/hourScorePPOPPO+DAPPO+DA (1\u2212step)PPO+Entropy\f5.2.2 Multi-step divergence vs 1-step divergence\n\nIn Figure 2, we also test the PPO+DA algorithm with its 1-step divergence-augmented counterpart\n(plotted in green). We rerun the experiments with the parameter \u00afcD (Formula (12)) set to 0, which\nmeans we only aggregate the divergence on the current state and action f (si, ai), without summing\nup future discounted divergence f (si+j, ai+j). This method also relates to the conditional diver-\ngence de\ufb01ned in Formula (3), and shares more similarities with previous works on regularized and\nconstrained policy optimization methods (Schulman et al., 2015; Achiam et al., 2017). We see that\nwith multi-step divergence augmentation, the algorithm can achieve high scores, especially on games\nrequiring deeper exploration like Enduro and Qbert. We hypothesize that the accumulated divergence\non future states can encourage the policy to explore more ef\ufb01ciently.\n\n6 Conclusion\n\nIn this paper, we proposed a divergence-augmented policy optimization method to improve the\nstability of policy gradient methods when it is necessary to reuse off-policy data. We showed that\nthe proposed divergence augmentation technique can be viewed as imposing Bregman divergence\nconstraint on the state-action space, which is related to online mirror descent methods. Experiments\non Atari games showed that in the data-scarce scenario, the proposed method works better than\nother state-of-the-art algorithms such as PPO. Our results showed that the technique of divergence\naugmentation is effective when data generated by previous policies are reused in policy optimization.\n\nReferences\nAbadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G.,\nIsard, M., Kudlur, M., Levenberg, J., Monga, R., Moore, S., Murray, D. G., Steiner, B., Tucker,\nP. A., Vasudevan, V., Warden, P., Wicke, M., Yu, Y., and Zhang, X. (2016). Tensor\ufb02ow: A system\nfor large-scale machine learning. arXiv preprint arXiv:1605.08695.\n\nAbdolmaleki, A., Springenberg, J. T., Tassa, Y., Munos, R., Heess, N., and Riedmiller, M. (2018).\nMaximum a posteriori policy optimisation. In International Conference on Learning Representa-\ntions.\n\nAchiam, J., Held, D., Tamar, A., and Abbeel, P. (2017). Constrained policy optimization. arXiv\n\npreprint arXiv:1705.10528.\n\nAsmussen, S. (2003). Markov chains. Applied Probability and Queues, pages 3\u201338.\n\nAudibert, J.-Y. and Bubeck, S. (2009). Minimax policies for adversarial and stochastic bandits. In In\n\nProceedings of the 22nd Annual Conference on Learning Theory (COLT), pages 217\u2013226.\n\nAudibert, J.-Y., Bubeck, S., and Lugosi, G. (2011). Minimax policies for combinatorial prediction\ngames. In Kakade, S. M. and von Luxburg, U., editors, Proceedings of the 24th Annual Conference\non Learning Theory (COLT), volume 19 of Proceedings of Machine Learning Research, pages\n107\u2013132, Budapest, Hungary. PMLR.\n\nBeck, A. and Teboulle, M. (2003). Mirror descent and nonlinear projected subgradient methods for\n\nconvex optimization. Operations Research Letters, 31(3):167\u2013175.\n\nBellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M. (2013). The arcade learning environment:\nAn evaluation platform for general agents. Journal of Arti\ufb01cial Intelligence Research, 47:253\u2013279.\n\nBelousov, B. and Peters, J. (2017). f-Divergence constrained policy improvement. ArXiv e-prints.\n\nBregman, L. M. (1967). The relaxation method of \ufb01nding the common point of convex sets and its\napplication to the solution of problems in convex programming. USSR computational mathematics\nand mathematical physics, 7(3):200\u2013217.\n\nBrockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W.\n\n(2016). Openai gym.\n\n9\n\n\fBubeck, S. (2015). Convex optimization: Algorithms and complexity. Foundations and Trends R(cid:13) in\n\nMachine Learning, 8(3-4):231\u2013357.\n\nBubeck, S. and Cesa-Bianchi, N. (2012). Regret analysis of stochastic and nonstochastic multi-armed\n\nbandit problems. Foundations and Trends R(cid:13) in Machine Learning, 5(1):1\u2013122.\n\nCesa-Bianchi, N., Gentile, C., Lugosi, G., and Neu, G. (2017). Boltzemann exploration done right.\n\nIn Advances in Neural Information Processing Systems, pages 6284\u20136293.\n\nCesa-Bianchi, N. and Lugosi, G. (2006). Prediction, learning, and games. Cambridge university\n\npress.\n\nChow, Y., Nachum, O., and Ghavamzadeh, M. (2018). Path consistency learning in tsallis entropy\n\nregularized mdps. In International Conference on Machine Learning, pages 978\u2013987.\n\nCsiszar, I. and K\u00f6rner, J. (2011). Information theory: coding theorems for discrete memoryless\n\nsystems. Cambridge University Press.\n\nEspeholt, L., Soyer, H., Munos, R., Simonyan, K., Mnih, V., Ward, T., Doron, Y., Firoiu, V., Harley,\nT., Dunning, I., Legg, S., and Kavukcuoglu, K. (2018). IMPALA: scalable distributed deep-rl with\nimportance weighted actor-learner architectures. arXiv preprint arXiv:1802.01561.\n\nFox, R., Pakman, A., and Tishby, N. (2015). Taming the noise in reinforcement learning via soft\n\nupdates. arXiv preprint arXiv:1512.08562.\n\nHaarnoja, T., Tang, H., Abbeel, P., and Levine, S. (2017). Reinforcement learning with deep\n\nenergy-based policies. arXiv preprint arXiv:1702.08165.\n\nHaarnoja, T., Zhou, A., Abbeel, P., and Levine, S. (2018). Soft actor-critic: Off-policy maximum\n\nentropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290.\n\nHasselt, H. v., Guez, A., and Silver, D. (2016). Deep reinforcement learning with double q-learning.\nIn Proceedings of the Thirtieth AAAI Conference on Arti\ufb01cial Intelligence, pages 2094\u20132100.\nAAAI Press.\n\nHochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8):1735\u2013\n\n1780.\n\nHorgan, D., Quan, J., Budden, D., Barth-Maron, G., Hessel, M., Van Hasselt, H., and Silver, D.\n\n(2018). Distributed prioritized experience replay. arXiv preprint arXiv:1803.00933.\n\nKakade, S. and Langford, J. (2002). Approximately optimal approximate reinforcement learning. In\n\nInternational Conference on Machine Learning, pages 267\u2013274.\n\nKakade, S. M. (2002). A natural policy gradient. In Advances in neural information processing\n\nsystems, pages 1531\u20131538.\n\nKappen, H. J. (2005). Path integrals and symmetry breaking for optimal control theory. Journal of\n\nstatistical mechanics: theory and experiment, 2005(11):P11011.\n\nKingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980.\n\nLee, K., Choi, S., and Oh, S. (2018). Maximum causal tsallis entropy imitation learning. In Advances\n\nin Neural Information Processing Systems, pages 4403\u20134413.\n\nLee, K., Kim, S., Lim, S., Choi, S., and Oh, S. (2019). Tsallis reinforcement learning: A uni\ufb01ed\n\nframework for maximum entropy reinforcement learning.\n\nLiu, Q., Li, L., Tang, Z., and Zhou, D. (2018). Breaking the curse of horizon: In\ufb01nite-horizon\noff-policy estimation. In Advances in Neural Information Processing Systems, pages 5361\u20135371.\n\nMcMahan, B. (2011). Follow-the-regularized-leader and mirror descent: Equivalence theorems and l1\nregularization. In Gordon, G., Dunson, D., and Dud\u00edk, M., editors, Proceedings of the Fourteenth\nInternational Conference on Arti\ufb01cial Intelligence and Statistics, volume 15 of Proceedings of\nMachine Learning Research, pages 525\u2013533, Fort Lauderdale, FL, USA. PMLR.\n\n10\n\n\fMnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., and Kavukcuoglu,\nK. (2016). Asynchronous methods for deep reinforcement learning. In International Conference\non Machine Learning, pages 1928\u20131937.\n\nMnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A.,\nRiedmiller, M., Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou,\nI., King, H., Kumaran, D., Wierstra, D., Legg, S., and Hassabis, D. (2015). Human-level control\nthrough deep reinforcement learning. Nature, 518(7540):529\u2013533.\n\nMunos, R., Stepleton, T., Harutyunyan, A., and Bellemare, M. (2016). Safe and ef\ufb01cient off-policy\nreinforcement learning. In Advances in Neural Information Processing Systems, pages 1054\u20131062.\n\nNachum, O., Norouzi, M., Xu, K., and Schuurmans, D. (2017). Bridging the gap between value\nand policy based reinforcement learning. In Advances in Neural Information Processing Systems,\npages 2775\u20132785.\n\nNemirovsky, A. S. and Yudin, D. B. (1983). Problem complexity and method ef\ufb01ciency in optimization.\n\nWiley.\n\nNeu, G., Jonsson, A., and G\u00f3mez, V. (2017). A uni\ufb01ed view of entropy-regularized markov decision\n\nprocesses. arXiv preprint arXiv:1705.07798.\n\nOh, J., Guo, Y., Singh, S., and Lee, H. (2018). Self-imitation learning. In International Conference\n\non Machine Learning, pages 3875\u20133884.\n\nPeters, J., M\u00fclling, K., and Altun, Y. (2010). Relative entropy policy search.\n\n1607\u20131612.\n\nIn AAAI, pages\n\nRockafellar, R. T. (1976). Monotone operators and the proximal point algorithm. SIAM journal on\n\ncontrol and optimization, 14(5):877\u2013898.\n\nSchulman, J., Chen, X., and Abbeel, P. (2017a). Equivalence between policy gradients and soft\n\nq-learning. arXiv preprint arXiv:1704.06440.\n\nSchulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. (2015). Trust region policy optimiza-\ntion. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pages\n1889\u20131897.\n\nSchulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017b). Proximal policy\n\noptimization algorithms. arXiv preprint arXiv:1707.06347.\n\nSilver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J.,\nAntonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner,\nN., Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel, T., and Hassabis, D. (2016).\nMastering the game of go with deep neural networks and tree search. Nature, 529(7587):484\u2013489.\n\nSutton, R. S. and Barto, A. G. (2017). Introduction to reinforcement learning, 2nd edition, in progress.\n\nMIT press.\n\nSutton, R. S., Mahmood, A. R., and White, M. (2016). An emphatic approach to the problem of\noff-policy temporal-difference learning. The Journal of Machine Learning Research, 17(1):2603\u2013\n2631.\n\nSutton, R. S., McAllester, D. A., Singh, S. P., and Mansour, Y. (2000). Policy gradient methods for\nreinforcement learning with function approximation. In Advances in neural information processing\nsystems, pages 1057\u20131063.\n\nTodorov, E. (2007). Linearly-solvable markov decision problems. In Advances in neural information\n\nprocessing systems, pages 1369\u20131376.\n\nWang, Q., Xiong, J., Han, L., Sun, P., Liu, H., and Zhang, T. (2018). Exponentially weighted imitation\nlearning for batched historical data. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K.,\nCesa-Bianchi, N., and Garnett, R., editors, Advances in Neural Information Processing Systems 31,\npages 6291\u20136300. Curran Associates, Inc.\n\n11\n\n\fWang, Z., Bapst, V., Heess, N., Mnih, V., Munos, R., Kavukcuoglu, K., and de Freitas, N. (2016a).\n\nSample ef\ufb01cient actor-critic with experience replay. arXiv preprint arXiv:1611.01224.\n\nWang, Z., Schaul, T., Hessel, M., Hasselt, H., Lanctot, M., and Freitas, N. (2016b). Dueling network\narchitectures for deep reinforcement learning. In International Conference on Machine Learning,\npages 1995\u20132003.\n\nWilliams, R. J. and Peng, J. (1991). Function optimization using connectionist reinforcement learning\n\nalgorithms. Connection Science, 3(3):241\u2013268.\n\nXu, Z., van Hasselt, H. P., and Silver, D. (2018). Meta-gradient reinforcement learning. In Advances\n\nin neural information processing systems, pages 2396\u20132407.\n\nZiebart, B. D., Maas, A. L., Bagnell, J. A., and Dey, A. K. (2008). Maximum entropy inverse\n\nreinforcement learning. In AAAI, volume 8, pages 1433\u20131438. Chicago, IL, USA.\n\nZimin, A. and Neu, G. (2013). Online learning in episodic markovian decision processes by relative\nentropy policy search. In Burges, C. J. C., Bottou, L., Welling, M., Ghahramani, Z., and Weinberger,\nK. Q., editors, Advances in Neural Information Processing Systems 26. Curran Associates, Inc.\n\n12\n\n\f", "award": [], "sourceid": 3280, "authors": [{"given_name": "Qing", "family_name": "Wang", "institution": "Huya AI"}, {"given_name": "Yingru", "family_name": "Li", "institution": "The Chinese University of Hong Kong, Shenzhen, China"}, {"given_name": "Jiechao", "family_name": "Xiong", "institution": "Tencent AI Lab"}, {"given_name": "Tong", "family_name": "Zhang", "institution": "Tencent AI Lab"}]}