{"title": "Dual Policy Iteration", "book": "Advances in Neural Information Processing Systems", "page_first": 7059, "page_last": 7069, "abstract": "Recently, a novel class of Approximate Policy Iteration (API) algorithms have demonstrated impressive practical performance (e.g., ExIt from [1], AlphaGo-Zero from [2]). This new family of algorithms maintains, and alternately optimizes, two policies: a fast, reactive policy (e.g., a deep neural network) deployed at test time, and a slow, non-reactive policy (e.g., Tree Search), that can plan multiple steps ahead. The reactive policy is updated under supervision from the non-reactive policy, while the non-reactive policy is improved with guidance from the reactive policy. In this work we study this Dual Policy Iteration (DPI) strategy in an alternating optimization framework and provide a convergence analysis that extends existing API theory. We also develop a special instance of this framework which reduces the update of non-reactive policies to model-based optimal control using learned local models, and provides a theoretically sound way of unifying model-free and model-based RL approaches with unknown dynamics. We demonstrate the efficacy of our approach on various continuous control Markov Decision Processes.", "full_text": "Dual Policy Iteration\n\nWen Sun1, Geoffrey J. Gordon1, Byron Boots2, and J. Andrew Bagnell3\n\n1School of Computer Science, Carnegie Mellon University, USA\n2College of Computing, Georgia Institute of Technology, USA\n\n3Aurora Innovation, USA\n\n{wensun, ggordon, dbagnell}@cs.cmu.edu, bboots@cc.gatech.edu\n\nAbstract\n\nA novel class of Approximate Policy Iteration (API) algorithms have recently\ndemonstrated impressive practical performance (e.g., ExIt [1], AlphaGo-Zero [2]).\nThis new family of algorithms maintains, and alternately optimizes, two policies: a\nfast, reactive policy (e.g., a deep neural network) deployed at test time, and a slow,\nnon-reactive policy (e.g., Tree Search), that can plan multiple steps ahead. The\nreactive policy is updated under supervision from the non-reactive policy, while\nthe non-reactive policy is improved via guidance from the reactive policy. In this\nwork we study this class of Dual Policy Iteration (DPI) strategy in an alternating\noptimization framework and provide a convergence analysis that extends existing\nAPI theory. We also develop a special instance of this framework which reduces\nthe update of non-reactive policies to model-based optimal control using learned\nlocal models, and provides a theoretically sound way of unifying model-free and\nmodel-based RL approaches with unknown dynamics. We demonstrate the ef\ufb01cacy\nof our approach on various continuous control Markov Decision Processes.\n\nIntroduction\n\n1\nApproximate Policy Iteration (API) [3, 4, 5, 6, 7], including conservative API (CPI) [5], API driven\nby learned critics [8], or gradient-based API with stochastic policies [9, 10, 11, 12], have played a\ncentral role in Reinforcement Learning (RL) for decades and motivated many modern practical RL\nalgorithms. Several existing API methods [4, 5] can provide both local optimality guarantees and\nglobal guarantees under strong assumptions regarding the way samples are generated (e.g., access to\na reset distribution that is similar to the optimal policy\u2019s state distribution). However, most modern\npractical API algorithms rely on myopic random exploration (e.g., REINFORCE [13] type policy\ngradient or \u270f-greedy). Sample inef\ufb01ciency due to random exploration can cause even sophisticated\nRL methods to perform worse than simple black-box optimization with random search in parameter\nspace [14].\nRecently, a new class of API algorithms, which we call Dual Policy Iteration (DPI), has begun to\nemerge. These algorithms follow a richer strategy for improving the policy, with two policies under\nconsideration at any time during training: a reactive policy, usually learned by some form of function\napproximation, used for generating samples and deployed at test time, and an intermediate policy that\ncan only be constructed or accessed during training, used as an expert policy to guide the improvement\nof the reactive policy. For example, ExIt [1] maintains and updates a UCT-based policy [15] as an\nintermediate expert. ExIt then updates the reactive policy by directly imitating the tree-based policy\nwhich we expect would be better than the reactive policy as it involves a multi-step lookahead search.\nAlphaGo-Zero [2] employs a similar strategy to achieve super-human performance at the ancient\ngame of Go. The key difference that distinguishes ExIt and AlphaGo-Zero from previous APIs is that\nthey leverage models to perform systematic forward search: the policy resulting from forward search\nacts as an expert and directly informs the improvement direction for the reactive policy. Hence the\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\freactive policy improves by imitation instead of trial-and-error reinforcement learning. This strategy\noften provides better sample ef\ufb01ciency in practice compared to algorithms that simply rely on locally\nrandom search (e.g., AlphaGo-Zero abandons REINFORCE from AlphaGo [16]).\nIn this work we provide a general framework for synthesizing and analyzing DPI by considering a\nparticular alternating optimization strategy with different optimization approaches each forming a\nnew family of approximate policy iteration methods. We additionally consider the extension to the\nRL setting with unknown dynamics. For example, we construct a simple instance of our framework,\nwhere the intermediate expert is computed from Model-Based Optimal Control (MBOC) locally\naround the reactive policy, and the reactive policy in turn is updated incrementally under the guidance\nof MBOC. The resulting algorithm iteratively learns a local dynamics model, applies MBOC to\ncompute a locally optimal policy, and then updates the reactive policy by imitation and achieve larger\npolicy improvement per iteration than classic APIs. The instantiation shares similar spirit from some\nprevious works from robotics and control literature, including works from [17, 18] and Guided Policy\nSearch (GPS) [19] (and its variants (e.g., [20, 21, 22])), i.e., using local MBOC to speed up learning\nglobal policies.\nTo evaluate our approach, we demonstrate our algorithm on discrete MDPs and continuous control\ntasks. We show that by integrating local model-based search with learned local dynamics into policy\nimprovement via an imitation learning-style update, our algorithm is substantially more sample-\nef\ufb01cient than classic API algorithms such as CPI [5], as well as more recent actor-critic baselines\n[23], albeit at the cost of slower computation per iteration due to the model-based search. We also\napply the framework to a robust policy optimization setting [24, 25] where the goal is to learn a single\npolicy that can generalize across environments. In summary, the major practical difference between\nDPI and many modern practical RL approaches is that instead of relying on random exploration, the\nDPI framework integrates local model learning, local model-based search for advanced exploration,\nand an imitation learning-style policy improvement, to improve the policy in a more systematic way.\nWe also provide a general convergence analysis to support our empirical \ufb01ndings. Although our\nanalysis is similar to CPI\u2019s, it has a key difference: as long as MBOC succeeds, we can provide a\nlarger policy improvement than CPI at each iteration. Our analysis is general enough to provide\ntheoretical intuition for previous successful practical DPI algorithms such as Expert Iteration (ExIt)\n[1]. We also analyze how predictive error from a learned local model can mildly affect policy\nimprovement and show that locally accurate dynamics\u2014a model that accurately predicts next states\nunder the current policy\u2019s state-action distribution, is enough for improving the current policy. We\nbelieve our analysis of local model predictive error versus local policy improvement can shed light\non further development of model-based RL approaches with learned local models. In summary, DPI\noperates in the middle of two extremes: (1) API type methods that update policies locally (e.g.,\n\ufb01rst-order methods like policy gradient and CPI), (2) global model-based optimization where one\nattempts to learn a global model and perform model-based search. First-order methods have small\npolicy improvement per iteration and learning a global model displays greater model bias and requires\na dataset that covers the entire state space. DPI instead learns a local model and allows us to integrate\nmodels to leverage the power of model-based optimization to locally improve the reactive policy.\n\n2 Preliminaries\n\nA discounted in\ufb01nite-horizon Markov Decision Process (MDP) is de\ufb01ned as (S,A, P, c,\u21e2 0, ). Here,\nS is a set of states, A is a set of actions, and P is the transition dynamics: P (s0|s, a) is the probability\nof transitioning to state s0 from state s by taking action a. We use Ps,a in short for P (\u00b7|s, a). We\ndenote c(s, a) as the cost of taking action a while in state s. Finally, \u21e20 is the initial distribution of\nstates, and  2 (0, 1) is the discount factor. Throughout this paper, we assume that we know the form\nof the cost function c(s, a), but the transition dynamics P are unknown.We de\ufb01ne a stochastic policy\n\u21e1 such that for any state s 2S , \u21e1(\u00b7|s) outputs a distribution over action space. The distribution\nof states at time step t, induced by running the policy \u21e1 until and including t, is de\ufb01ned 8st:\n\u21e1(st) =P{si,ai}i\uf8fft1\n\u21e1(s) = \u21e20(s) for\ni=0 \u21e1(ai|si)P (si+1|si, ai), where by de\ufb01nition d0\ndt\nany \u21e1. The state visitation distribution can be computed d\u21e1(s) = (1  )P1t=0 tdt\n\u21e1(s). Denote\n(d\u21e1\u21e1) as the joint state-action distribution such that d\u21e1\u21e1(s, a) = d\u21e1(s)\u21e1(a|s). We de\ufb01ne the value\n\n\u21e20(s0)Qt1\n\n2\n\n\ffunction V \u21e1(s), state-action value function Q\u21e1(s, a), and the objective function J(\u21e1) as:\n\nV \u21e1(s) =E\" 1Xt=0\n\ntc(st, at)|s0 = s#, Q\u21e1(s, a) = c(s, a)+Es0\u21e0Ps,a [V \u21e1(s0)], J(\u21e1) =Es\u21e0\u21e20[V \u21e1(s)].\nWith V \u21e1 and Q\u21e1, the advantage function A\u21e1(s, a) is de\ufb01ned as A\u21e1(s, a) = Q\u21e1(s, a)  V \u21e1(s). As\nwe work in the cost setting, in the rest of the paper we refer to A\u21e1 as the disadvantage function. The\ngoal is to learn a single stationary policy \u21e1\u21e4 that minimizes J(\u21e1): \u21e1\u21e4 = arg min\u21e12\u21e7 J(\u21e1).\nFor two distributions P1 and P2, DT V (P1, P2) denotes total variation distance, which is related\nto the L1 norm as DT V (P1, P2) = kP1  P2k1/2 (if we have a \ufb01nite probability space) and\nDKL(P1, P2) =Rx P1(x) log(P1(x)/P2(x))dx denotes the KL divergence.\nWe introduce Performance Difference lemma (PDL) [5], which will be used extensively in this work:\nLemma 2.1 For any two policies \u21e1 and \u21e10, we have: J(\u21e1)  J(\u21e10) = 1\n3 Dual Policy Iteration\n\n1 E(s,a)\u21e0d\u21e1\u21e1hA\u21e10(s, a)i.\n\nWe propose an alternating optimization framework inspired by the PDL (Lemma 2.1). Consider the\n\nmin-max optimization framework: min\u21e12\u21e7 max\u23182\u21e7 Es\u21e0d\u21e1\u21e5Ea\u21e0\u21e1(\u00b7|s) [A\u2318(s, a)]\u21e4. It is not hard to\n\nsee that the unique Nash equilibrium for the above equation is (\u21e1, \u2318 ) = (\u21e1\u21e4,\u21e1 \u21e4). The above min-max\nproposes a general strategy, which we call Dual Policy Iteration (DPI): alternatively \ufb01x one policy\nand update the second policy. Mapping to previous practical DPI algorithms [1, 2], \u21e1 stands for the\nfast reactive policy and \u2318 corresponds to the tree search policy. For notation purposes, we use \u21e1n\nand \u2318n to represent the two policies in the nth iteration. Below we introduce one instance of DPI for\nsettings with unknown models (hence no tree search), \ufb01rst describe how to compute \u2318n from a given\nreactive policy \u21e1n (Sec. 3.1), and then describe how to update \u21e1n to \u21e1n+1 via imitating \u2318n (Sec. 3.2).\n\n3.1 Updating \u2318 with MBOC using Learned Local Models\n\nGiven \u21e1n, the objective function for \u2318 becomes: max\u2318 Es\u21e0d\u21e1n\u21e5Ea\u21e0\u21e1n(\u00b7|s) [A\u2318(s, a)]\u21e4. From PDL\nwe can see that updating \u2318 is equivalent to \ufb01nding the optimal policy \u21e1\u21e4: arg max\u2318 (J(\u21e1n)  J(\u2318)) \u2318\narg min\u2318 J(\u2318), regardless of what \u21e1n is. As directly minimizing J(\u2318) is as hard as the original\nproblem, we update \u2318 locally by constraining it to a trust region around \u21e1n:\n\narg min\n\n\u2318\n\nJ(\u2318), s.t., Es\u21e0d\u21e1n DT V [(\u2318(\u00b7|s),\u21e1 n(\u00b7|s))] \uf8ff \u21b5.\n\n(1)\n\nTo solve the constraint optimization problem in Eq 1, we propose to learn Ps,a and use it with any\noff-the-shelf model-based optimal control algorithm. Moreover, thanks to the trust region, we can\nsimply learn a local dynamics model, under the state-action distribution d\u21e1n\u21e1n. We denote the\noptimal solution to the above constrained optimization (Eq. 1) under the real model Ps,a as \u2318\u21e4n. Note\nthat, due to the de\ufb01nition of the optimality, \u2318\u21e4n must perform better than \u21e1n: J(\u21e1n)J(\u2318\u21e4n)  n(\u21b5),\nwhere n(\u21b5)  0 is the performance gain from \u2318\u21e4n over \u21e1n. When the trust region expands, i.e., \u21b5\nincreases, then n(\u21b5) approaches the performance difference between the optimal policy \u21e1\u21e4 and \u21e1n.\nTo perform MBOC, we learn a locally accurate model\u2014a model \u02c6P that is close to P un-\nder the state-action distribution induced by \u21e1n: we seek a model \u02c6P , such that the quantity\nE(s,a)\u21e0d\u21e1n \u21e1nDT V ( \u02c6Ps,a, Ps,a) is small. Optimizing DT V directly is hard, but note that, by Pinsker\u2019s\ninequality, we have DKL(Ps,a, \u02c6Ps,a)  DT V ( \u02c6Ps,a, Ps,a)2, which indicates that we can optimize a\nsurrogate loss de\ufb01ned by a KL-divergence:\n\narg min\n\u02c6P2P\n\nEs\u21e0d\u21e1n ,a\u21e0\u21e1n(s)DKL(Ps,a, \u02c6Ps,a) = arg min\n\u02c6P2P\n\nEs\u21e0d\u21e1n ,a\u21e0\u21e1n(s),s0\u21e0Ps,a[ log \u02c6Ps,a(s0)], (2)\nwhere we denote P as the model class. Hence we reduce the local model \ufb01tting problem into a\nclassic maximum likelihood estimation (MLE) problem, where the training data {s, a, s0} can be\neasily collected by executing \u21e1n on the real system (i.e., Ps,a). As we will show later, to ensure\npolicy improvement, we just need a learned model to perform well under d\u21e1n\u21e1n (i.e., no training and\n\n3\n\n\ftesting distribution mismatch as one will have for global model learning). For later analysis purposes,\nwe denote \u02c6P as the MLE in Eq. 2 and assume \u02c6P is -optimal under d\u21e1n\u21e1n:\n\nE(s,a)\u21e0d\u21e1n \u21e1nDT V ( \u02c6Ps,a, Ps,a) \uf8ff ,\n\n(3)\nwhere  2 R+ is controlled by the complexity of model class P and by the amount of training data\nwe sample using \u21e1n, which can be analyzed by standard supervised learning theory. After achieving\na locally accurate model \u02c6P , we solve Eq. 1 using any existing stochastic MBOC solvers. Assume a\nMBOC solver returns an optimal policy \u2318n under the estimated model \u02c6P subject to trust-region:\n\n\u21e1\n\n\u2318n = arg min\n\nJ(\u21e1), s.t., st+1 \u21e0 \u02c6Pst,at, Es\u21e0d\u21e1n DT V (\u21e1, \u21e1n) \uf8ff \u21b5.\n\n(4)\nAt this point, a natural question is: If \u2318n is solved by an MBOC solver under \u02c6P , by how much can \u2318n\noutperform \u21e1n when executed under the real dynamics P ? Recall that the performance gap between\nthe real optimal solution \u2318\u21e4n (optimal under P ) and \u21e1n is denoted as n(\u21b5). The following theorem\nquanti\ufb01es the performance gap between \u2318n and \u21e1n using the learned local model\u2019s predictive error :\nTheorem 3.1 Assume \u02c6Ps,a satis\ufb01es Eq. 3, and \u2318n is the output of a MBOC solver for the optimization\nproblem de\ufb01ned in Eq. 4, then we have:\n\nJ(\u2318n) \uf8ff J(\u21e1n)  n(\u21b5) + O\u2713 \n\n1  \n\n+\n\n\u21b5\n\n(1  )2\u25c6 .\n\nThe proof of the above theorem can be found in Appendix A.2. Theorem 3.1 indicates that when the\nmodel is locally accurate, i.e.,  is small (e.g., P is rich and we have enough data from d\u21e1n\u21e1n), \u21b5 is\nsmall, and there exists a local optimal solution that is signi\ufb01cantly better than the current policy \u21e1n\n(i.e., n(\u21b5) 2 R+ is large), then the OC solver with the learned model \u02c6P \ufb01nds a nearly local-optimal\nsolution \u2318n that outperforms \u21e1n. With a better \u2318n, now we are ready to improve \u21e1n via imitating \u2318n.\n\narg min\n\n3.2 Updating \u21e1 via Imitating \u2318\nGiven \u2318n, we compute \u21e1n+1 by performing the following constrained optimization procedure:\n\n\u21e1 Es\u21e0d\u21e1n\u21e5Ea\u21e0\u21e1(\u00b7|s) [A\u2318n(s, a)]\u21e4 , s.t., Es\u21e0d\u21e1n [DT V (\u21e1(\u00b7|s),\u21e1 n(\u00b7|s))] \uf8ff \n\nNote that the key difference between Eq. 5 and classic API policy improvement procedure is that we\nuse \u2318n\u2019s disadvantage function A\u2318n, i.e., we are performing imitation learning by treating \u2318n as an\nexpert in this iteration [26, 27]. We can solve Eq. 5 by converting it to supervised learning problem\nsuch as cost-sensitive classi\ufb01cation [5] by sampling states and actions from \u21e1n and estimating A\u2318n\nvia rolling out \u2318n, subject to an L1 constraint.\nNote that a CPI-like update approximately solves the above constrained problem as well:\n\n\u21e1n+1 = (1  )\u21e1n + \u21e1\u21e4n, where \u21e1\u21e4n = arg min\n\n(6)\nNote that \u21e1n+1 satis\ufb01es the constraint as DT V (\u21e1n+1(\u00b7|s),\u21e1 n(\u00b7|s)) \uf8ff , 8s. Intuitively, the update\nin Eq. 6 can be understood as \ufb01rst solving the objective function to obtain \u21e1\u21e4n without considering the\nconstraint, and then moving \u21e1n towards \u21e1\u21e4n until the boundary of the constraint is reached.\n\n\u21e1 Es\u21e0d\u21e1n\u21e5Ea\u21e0\u21e1(\u00b7|s)[A\u2318n(s, a)]\u21e4 .\n\n(5)\n\n3.3 DPI: Combining Updates on \u21e1 and \u2318\nIn summary, assume MBOC is used for Eq. 1, DPI operates in an iterative way: with \u21e1n:\n\n1. Fit MLE \u02c6P on states and actions from d\u21e1n\u21e1n (Eq. 2).\n2. \u2318n MBOC( \u02c6P ), subject to trust region Es\u21e0d\u21e1n DT V (\u21e1, \u21e1n) \uf8ff \u21b5 (Eq. 4)\n3. Update to \u21e1n+1 by imitating \u2318n, subject to trust region Es\u21e0d\u21e1n DT V (\u21e1, \u21e1n) \uf8ff  (Eq. 5).\nThe above framework shows how \u21e1 and \u2318 are tightened together to guide each other\u2019s improvements:\nthe \ufb01rst step corresponds classic MLE under \u21e1n\u2019s state-action distribution: d\u21e1n\u21e1n; the second step\ncorresponds to model-based policy search around \u21e1n ( \u02c6P is only locally accurate); the third step\ncorresponds to updating \u21e1 by imitating \u2318 (i.e., imitation). Note that in practice MBOC solver (e.g.,\na second order optimization method, as we will show in our practical algorithm below) could be\ncomputationally expensive and slow (e.g. tree search in ExIt and AlphaGo-Zero), but once \u02c6P is\nprovided, MBOC does not require additional samples from the real system.\n\n4\n\n\fConnections to Previous works We can see that the above framework generalizes several previous\nwork from API and IL. (a) If we set \u21b5 = 0 in the limit, we reveal CPI (assuming we optimize with\nEq. 6), i.e., no attempt to search for a better policy using model-based optimization. (b) Mapping\nto ExIt, our \u2318n plays the role of the tree-based policy, and our \u21e1n plays the role of the apprentice\npolicy, and MBOC plays the role of forward search. (c) when an optimal expert policy \u21e1\u21e4 is available\nduring and only during training, we can set every \u2318n to be \u21e1\u21e4, and DPI then reveals a previous IL\nalgorithm\u2014AGGREVATED [27].\n\n4 Analysis of Policy Improvement\n\nWe provide a general convergence analysis for DPI. The trust region constraints in Eq. 1 and Eq. 5\ntightly combines MBOC and policy improvement together, and is the key to ensure monotonic\nimprovement and achieve larger policy improvement per iteration than existing APIs.\nDe\ufb01ne An(\u21e1n+1) as the disadvantage of \u21e1n+1 over \u2318n under d\u21e1n: An(\u21e1n+1) =\n\nEs\u21e0d\u21e1n\u21e5Ea\u21e0\u21e1n+1(\u00b7|s) [A\u2318n(s, a)]\u21e4. Note that An(\u21e1n+1) is at least non-positive (if \u21e1 and \u2318 are\n\nfrom the same function class, or \u21e1\u2019s policy class is rich enough to include \u2318), as if we set \u21e1n+1 to \u2318n.\nIn that case, we simply have An(\u21e1n+1) = 0, which means we can hope that the IL procedure (Eq. 5)\n\ufb01nds a policy \u21e1n+1 that achieves An(\u21e1n+1) < 0 (i.e., local improvement over \u2318n). The question\nwe want to answer is: by how much is the performance of \u21e1n+1 improved over \u21e1n by solving the\ntwo trust-region optimization procedures detailed in Eq. 1 and Eq. 5. Following Theorem 4.1 from\n[5], we de\ufb01ne \" = maxs |Ea\u21e0\u21e1n+1(\u00b7|s)[A\u2318n(s, a)]|, which measures the maximum possible one-step\nimprovement one can achieve from \u2318n. The following theorem states the performance improvement:\nTheorem 4.1 Solve Eq. 1 to get \u2318n and Eq. 5 to get \u21e1n+1. The improvement of \u21e1n+1 over \u21e1n is:\n\nJ(\u21e1n+1)  J(\u21e1n) \uf8ff\n\n\"\n\n(1  )2  |An(\u21e1n+1)|\n1  \n\n n(\u21b5).\n\n(7)\n\nThe proof of Theorem 4.1 is provided in Appendix A.3. When  is small, we are guaranteed to \ufb01nd a\npolicy \u21e1n+1 where the total cost decreases by n(\u21b5) + |An(\u21e1n+1)|/(1  ) compared to \u21e1n. Note\nthat classic CPI\u2019s per iteration improvement [5, 12] only contains a term that has the similar meaning\nand magnitude of the second term in the RHS of Eq. 7. Hence DPI can improve the performance\nof CPI by introducing an extra term n(\u21b5), and the improvement could be substantial when there\nexists a locally optimal policy \u2318n that is much better than the current reactive policy \u21e1n. Such (\u21b5)\ncomes from the explicit introduction of a model-based search into the training loop, which does not\nexist in classic APIs. From a practical point view, modern MBOCs are usually second-order methods,\nwhile APIs are usually \ufb01rst-order (e.g., REINFORCE and CPI). Hence it is reasonable to expect\n(\u21b5) itself will be larger than API\u2019s policy improvement per iteration. Connecting back to ExIt\nand AlphaGo-Zero under model-based setting, (\u21b5) stands for the improvement of the tree-based\npolicy over the current deep net reactive policy. In ExIt and AlphaGo Zero, the tree-based policy \u2318n\nperforms \ufb01xed depth forward search followed by rolling out \u21e1n (i.e., bottom up by V \u21e1n(s)), which\nensures the expert \u2318n outperforms \u21e1n.\nWhen |n(\u21b5)| and |An(\u21e1n+1)| are small, i.e., |n(\u21b5)|\uf8ff \u21e0 and |An(\u21e1n+1)|\uf8ff \u21e0, then we can\nguarantee that \u2318n and \u21e1n are good policies, under the stronger assumption that the initial distri-\nbution \u21e20 happens to be a good distribution (e.g., close to d\u21e1\u21e4), and the realizable assumption:\nmin\u21e12\u21e7 Es\u21e0d\u21e1n\u21e5Ea\u21e0\u21e1(\u00b7|s)[A\u2318n(s, a)]\u21e4 = Es\u21e0d\u21e1n [mina\u21e0A [A\u2318n(s, a)]], holds. We show in Ap-\n\npendix A.4 that under the realizable assumption:\n\nJ(\u2318n)  J(\u21e1\u21e4) \uf8ff\u2713max\n\ns \u2713 d\u21e1\u21e4(s)\n\n\u21e20(s)\u25c6\u25c6\u2713\n\n\u21e0\n\n(1  )2 +\n\n\u21e0\n\n(1  )\u25c6 .\n\nThe term (maxs (d\u21e1\u21e4(s)/\u21e20(s))) measures the distribution mismatch between the initial distribution\n\u21e20 and the optimal policy \u21e1\u21e4, and appears in some previous API algorithms\u2013CPI [5] and PSDP [4].\nA \u21e20 that is closer to d\u21e1\u21e4 (e.g., let experts reset the agent\u2019s initial position if possible) ensures better\nglobal performance guarantee. CPI considers a setting where a good reset distribution \u232b (different\nfrom \u21e20) is available, DPI can leverage such reset distribution by replacing \u21e20 by \u232b at training.\nIn summary, we can expect larger per-iteration policy improvement from DPI compared to CPI (and\nTRPO which has similar per iteration policy improvement as CPI), thanks to the introduction of local\nmodel-based search. The \ufb01nal performance bound of the learned policy is in par with CPI and PSDP.\n\n5\n\n\f5 An Instance of DPI\nIn this section, we dive into the details of each update step of DPI and suggest one practical instance\nof DPI, which can be used in continuous control settings. We denote T as the maximum possible\nhorizon.1 We denote the state space S\u2713 Rds and action space A\u2713 Rda. We work on parameterized\npolicies: we parameterize policy \u21e1 as \u21e1(\u00b7|s; \u2713) for any s 2S (e.g., a neural network with parameter\n\u2713), and parameterize \u2318 by a sequence of time-varying linear-Gaussian policies \u2318 = {\u2318t}1\uf8fft\uf8ffT , where\n\u2318t(\u00b7|s) = N (Kts + kt, Pt) with control gain Kt 2 Rda\u21e5ds, bias term kt 2 Rda and Covariance\nPt 2 Rda\u21e5da.We will use \u21e5= {Kt, kt, Pt}0\uf8fft\uf8ffH to represent the collection of the parameters of all\nthe linear-Gaussian policies across the entire horizon. One approximation we make here is to replace\nthe policy divergence measure DT V (\u21e1n,\u21e1 ) (note total variation distance is symmetric) with the\nKL-divergence DKL(\u21e1n,\u21e1 ), which allows us to leverage Natural Gradient [11, 10].2 To summarize,\n\u21e1n and \u2318n are short for \u21e1\u2713n and \u2318\u21e5n = {N (Kts + kt, Pt)}t, respectively. Below we \ufb01rst describe\nhow to compute \u2318\u21e5n given \u21e1n (Sec. 5.1), and then describe how to update \u21e1 via imitating \u2318\u21e5n using\nNatural Gradient (Sec. 5.2).\n\nst+1 \u21e0N (Atst + Btat + ct, \u2303t),\n\n5.1 Updating \u2318\u21e5 with MBOC using Learned Time Varying Linear Models\nWe explain here how to \ufb01nd \u2318n given \u21e1n using\nMBOC. In our implementation, we use Linear\nQuadratic Gaussian (LQG) optimal control [28]\nas the black-box optimal control solver. We\nlearn a sequence of time varying linear Gaussian\ntransition models to represent \u02c6P : 8t 2 [1, T ],\n(8)\nwhere At, Bt, ct, \u2303t can be learned using clas-\nsic linear regression techniques on a dataset\n{st, at, st+1} collected from executing \u21e1n on\nthe real system. Although the dynamics P (s, a)\nmay be complicated over the entire space, linear\ndynamics could locally approximate the dynam-\nics well (after all, our theorem only requires \u02c6P\nto have low predictive error under d\u21e1n\u21e1n).\nNext, to \ufb01nd a locally optimal policy under linear-Gaussian transitions (i.e., Eq. 4), we add the KL\nconstraint to the objective with Lagrange multiplier \u00b5 and form an equivalent min-max problem:\n\nAlgorithm 1 AGGREVATED-GPS\n1: Input: Parameters \u21b5 2 R+,  2 R+.\n2: Initialized \u21e1\u27130\n3: for n = 0 to ... do\n4:\n5:\n\nExecute \u21e1\u2713n to generate a set of trajectories\nFit local linear dynamics \u02c6P (Eq. 8) using\n{st, at, st+1} collected from step 1\nSolve the minmax in Eq. 9 subject to \u02c6P to\nobtain \u2318\u21e5n and form disadvantage A\u2318\u21e5n\nCompute \u2713n+1 by NGD (Eq. 12)\n\n7:\n8: end for\n\n6:\n\nwhere H(\u21e1(\u00b7|s)) =Pa \u21e1(a|s) ln(\u21e1(a|s)) is the negative entropy. Hence the above formulation can\nbe understood as using a new cost function: c0(st, at) = c(st, at)/\u00b5 log(\u21e1n(at|st)), and an entropy\nregularization on \u21e1. It is well known in the optimal control literature that when c0 is quadratic and\ndynamics are linear, the optimal sequence of linear Gaussian policies for the objective in Eq. 10 can\nbe found exactly by a Dynamic Programming (DP) based approach, the Linear Quadratic Regulator\n(LQR) [28]. Given a dataset {(st, at), c0(st, at)} collected while executing \u21e1n, we can \ufb01t a quadratic\napproximation of c0(s, a) [29, 19]. With a quadratic approximation of c0 and linear dynamics, we\nsolve Eq. 10 for \u2318 exactly by LQR [29]. Once we get \u2318, we go back to Eq. 9 and update the Lagrange\nmultiplier \u00b5, for example, by projected gradient ascent [30]. Upon convergence, LQR gives us a\nsequence of time-dependent linear Gaussian policies together with a sequence of analytic quadratic\ncost-to-go functions Qt(s, a), and quadratic disadvantage functions A\u2318\u21e5n\n\nt\n\n(s, a), for all t 2 [T ].\n\n1Note T is the maximum possible horizon which could be long. Hence, we still want to output a single policy,\nespecially when the policy is parameterized by complicated non-linear function approximators like deep nets.\n2Small DKL leads to small DT V , as by Pinsker\u2019s inequality, DKL(q, p) (and DKL(p, q))  DT V (p, q)2.\n\n6\n\nwhere \u00b5 is the Lagrange multiplier, which can be solved by alternatively updating \u2318 and \u00b5 [19]. For a\n\ufb01xed \u00b5, using the derivation from [19], ignoring terms that do not depend on \u2318, Eq. 9 can be written:\n\nmin\n\n\u2318\n\nmax\n\u00b50\n\nt1c(st, at)# + \u00b5\u21e3 TXt=1\nE\" TXt=1\nt1(c(st, at)/\u00b5  log \u21e1n(at|st))# \n\u2318 E\" TXt=1\n\nt1Es\u21e0dt\n\nTXt=1\n\narg min\n\n\u2318 [DKL(\u2318, \u21e1n)]  \u21b5\u2318,\n\n(9)\n\nt1Es\u21e0dt\n\n\u2318 [H(\u2318(\u00b7|s))],\n\n(10)\n\n\f\u2713n+1 = \u2713n  \u00b5F 1\n\n\u2713n r\u2713n, where \u00b5 =q/(rT\n\n\u2713n\n\nF 1\n\u2713n r\u2713n).\n\n(12)\n\n5.2 Updating \u21e1\u2713 via imitating \u2318\u21e5 using Natural Gradient\nPerforming a second order Taylor expansion of the KL constraint Es\u21e0d\u21e1n [DKL(\u21e1n(\u00b7|a),\u21e1 (\u00b7|s; \u2713)))]\naround \u2713n [11, 10], we get the following constrained optimization problem:\n\nmin\n\n\u2713\n\nEs\u21e0d\u21e1\u2713n\n\n[Ea\u21e0\u21e1(\u00b7|s;\u2713)[A\u2318\u21e5n (s, a)]], s.t., (\u2713  \u2713n)T F\u2713n(\u2713  \u2713n) \uf8ff ,\n\n(11)\n\nwhere F\u2713n is the Hessian of the KL constraint Es\u21e0d\u21e1\u2713n\nDKL(\u21e1\u2713n,\u21e1 \u2713) (i.e., Fisher information\nmatrix), measured at \u2713n. Denote the objective (i.e., the \ufb01rst term in Eq. 11) as Ln(\u2713), and denote\nr\u2713n as r\u2713Ln(\u2713)|\u2713=\u2713n, we can optimize \u2713 by performing natural gradient descent (NGD):\n\nThe speci\ufb01c \u00b5 above ensures the KL constraint is satis\ufb01ed. More details about the imitation update\non \u21e1 can be found in Appendix B.3.\n\nSummary\nIf we consider \u2318 as an expert, NGD is similar to natural gradient AGGREVATED\u2014a\ndifferential IL approach [27]. We summarizes the procedures presented in Sec. 3.1&5.2 in Alg. 1,\nwhich we name as AGGREVATED-GPS, stands for the fact that we are using MBOC to Guide Policy\nSearch [19, 21] via AGGREVATED-type update. Every iteration, we run \u21e1\u2713n on P to gather samples.\nWe estimate time dependent local linear dynamics \u02c6P and then leverage an OC solver (e.g, LQR) to\nsolve the Lagrangian in Eq. 9 to compute \u2318\u21e5n and A\u2318\u21e5n . We then perform NGD to update to \u21e1n+1.\n\n5.3 Additional Related Works\n\nThe most closely related work with respect to Alg. 1 is Guided Policy Search (GPS) for unknown\ndynamics [19] and its variants (e.g.,[20, 21, 22]). GPS (including its variants) demonstrates model-\nbased optimal control approaches can be used to speed up training policies parameterized by rich\nnon-linear function approximators (e.g., deep networks) in large-scale applications. While Alg. 1 in\nhigh level follows GPS\u2019s iterative procedure of alternating reactive policy improvement and MBOC,\nthe main difference between Alg. 1 and GPSs are the update procedure of the reactive policy. Classic\nGPS, including the mirror descent version, phrases the update procedure of the reactive policy as\na behavior cloning procedure, i.e., given an expert policy \u2318, we perform min\u21e1 DKL(d\u00b5\u00b5||d\u21e1\u21e1) 3.\nNote that our approach to updating \u21e1 is fundamentally on-policy, i.e., we generate samples from \u21e1.\nMoreover, we update \u21e1 by performing policy iteration against \u2318, i.e., \u21e1 approximately acts greedily\nwith respect to A\u2318, which resulting a key difference: if we limit the power of MBOC, i.e., set the\ntrust region size in MBOC step to zero in both DPI and GPS, then our approach reduces to CPI and\nthus improves \u21e1 to local optimality. GPS and its variants, by contrast, have no ability to improve the\nreactive policy in that setting.\n6 Experiments\nWe tested our approach on several MDPs: (1) a set of random discrete MDPs (Garnet problems [7])\n(2) Cartpole balancing [31], (3) Helicopter Aerobatics (Hover and Funnel) [32], (4) Swimmer, Hopper\nand Half-Cheetah from the MuJoCo physics simulator [33]. The goals of these experiments are: (a)\nto experimentally verify that using A\u2318 from the intermediate expert \u2318 computed by model-based\nsearch to perform policy improvement is more sample-ef\ufb01cient than using A\u21e1. (b) to show that our\napproach can be applied to robust policy search and can outperform existing approaches [25].\n\n6.1 Comparison to CPI on Discrete MDPs\nFollowing [7], we randomly create ten discrete MDPs with 1000 states and 5 actions. Different\nfrom the techniques we introduced in Sec. 5.2 for continuous settings, here we use the conservative\n\n3See Line 3 in Alg.2 in [21], where in principle a behavior cloning on \u21e1 uses samples from expert \u2318 (i.e.,\noff-policy samples). We note, however, in actual implementation some variants of GPS tend to swap the order of\n\u21e1 and \u2318 inside the KL, often resulting a on-policy sampling strategy (e.g.,[22]). We also note a Mirror Descent\ninterpretation and analysis to explain GPS\u2019s convergence [21] implies the correct way to perform a projection\nis to minimize the reverse KL, i.e., arg min\u21e12\u21e7 DKL(d\u21e1\u21e1||d\u2318\u2318). This in turn matches the DPI intuition: one\nshould attempt to \ufb01nd a policy \u21e1 that is similar to \u2318 under the state distribution of \u21e1 itself.\n\n7\n\n\f(a) Discrete MDP\n\n(b) Cart-Pole\n\n(c) Helicopter Hover\n\n(d) Helicopter Funnel\n\n(e) Swimmer\n\n(f) Hopper\n\n(g) Half-Cheetah\n\nFigure 1: Performance (mean and standard error of cumulative cost in log2-scale on y-axis) versus\nnumber of episodes (n on x-axis).\n\nupdate shown in Eq. 6 to update the reactive policy, where each \u21e1\u21e4n is a linear classi\ufb01er and is trained\nusing regression-based cost-sensitive classi\ufb01cation on samples from d\u21e1n [5]. The feature for each\nstate (s) is a binary encoding of the state ((s) 2 Rlog2(|S|)). We maintain the estimated transition\n\u02c6P in a tabular representation. The policy \u2318 is also in a tabular representation (hence expert \u2318 and\nreactive policy \u21e1 have different feature representation) and is computed using exact VI under \u02c6P and\nc0(s, a) (hence we name our approach here as AGGREVATED-VI). The setup and the conservative\nupdate implementation is detailed in Appendix B.1. Fig. 1a reports the statistical performance of our\napproach and CPI over the 10 random discrete MDPs. Note that our approach is more sample-ef\ufb01cient\nthan CPI, although we observed it is slower than CPI per iteration as we ran VI using learned model.\nWe tune  and neither CPI nor our approach uses line search on . The major difference between\nAGGREVATED-VI and CPI here is that we used A\u2318 instead of A\u21e1 to update \u21e1.\n\n6.2 Comparison to Actor-Critic in Continuous Settings\n\nWe compare against TRPO-GAE [23] on a set of continuous control tasks. The setup is detailed in\nAppendix B.4. TRPO-GAE is a actor-critic-like approach where both actor and critic are updated\nusing trust region optimization. We use a two-layer neural network to represent policy \u21e1 which is\nupdated by natural gradient descent. We use LQR as the underlying MBOC solver and we name\nour approach as AGGREVATED-ILQR. Fig. 1 (b-g) shows the comparison between our method\nand TRPO-GAE over a set of continuous control tasks (con\ufb01dence interval is computed from 20\nrandom trials). As we can see, our method is signi\ufb01cantly more sample-ef\ufb01cient than TRPO-GAE\nalbeit slower per iteration as we perform MBOC. The major difference between our approach and\nTRPO-GAE is that we use A\u2318 while TRPO-GAE uses A\u21e1 for the policy update. Note that both A\u2318\nand A\u21e1 are computed using the rollouts from \u21e1. The difference is that our approach uses rollouts to\nlearn local dynamics and analytically estimates A\u2318 using MBOC, while TRPO-GAE learns A\u21e1 using\nrollouts. Overall, our approach converges faster than TRPO-GAE (i.e., uses less samples), which\nagain indicates the bene\ufb01t of using A\u2318 in policy improvement.\n\n6.3 Application on Robust Policy Optimization\nOne application for our approach is robust policy optimization [34], where we have multiple training\nenvironments that are all potentially different from, but similar to, the testing environments. The\ngoal is to train a single reactive policy using the training environments and deploy the policy on\na test environment without any further training. Previous work suggests a policy that optimizes\nall the training models simultaneously is stable and robust during testing [24, 25], as the training\nenvironments together act as \u201cregularization\" to avoid over\ufb01tting and provide generalization.\n\n8\n\n\fi=1 Es\u21e0d\u21e1\u2713n\n\n[Ea\u21e0\u21e1(\u00b7|s;\u2713)[A\u2318\u21e5i\n\nn simultaneously for all i 2 [M ].\n\nn, for all i 2 [M ]. With A\u2318\u21e5i\n\nMore formally, let us assume that we have M training environments. At iteration n with \u21e1\u2713n,\nwe execute \u21e1\u2713n on the i\u2019th environment, generate samples, \ufb01t local models, and call MBOC\nassociated with the i\u2019th environment to compute \u2318\u21e5i\nn , for all\ni 2 [M ], we consider all training environments equally and formalize the objective Ln(\u2713) as\nLn(\u2713) =PM\nn ]]. We update \u2713n to \u2713n+1 by NGD on Ln(\u2713). Intuitively,\nwe update \u21e1\u2713 by imitating \u2318\u21e5i\nWe consider two simulation tasks, cartpole\nbalancing and helicopter funnel. For each\ntask, we create ten environments by vary-\ning the physical parameters (e.g., mass of\nhelicopter, mass and length of pole). We\nuse 7 of the environments for training and\nthe remaining three for testing. We com-\npare our algorithm against TRPO, which\ncould be regarded as a model-free, natural\ngradient version of the \ufb01rst-order algorithm\nproposed in [25]. We also ran our algo-\nrithm on a single randomly picked training\nenvironment, but still tested on test environ-\nments, which is denoted as non-robust in\nFig. 2. Fig. 2 summarizes the comparison\nbetween our approach and baselines. Similar to the trend we saw in the previous section, our approach\nis more sample-ef\ufb01cient in the robust policy optimization setup as well. It is interesting to see the\n\u201cnon-robust\" approach fails to further converge, which illustrates the over\ufb01tting phenomenon: the\nlearned policy over\ufb01ts to one particular training environment.\n\nFigure 2: Performance (mean in log-scale on y-axis)\nversus episodes (n on x-axis) in robust control.\n\n(a) Cart-Pole\n\n(b) Helicopter Funnel\n\n7 Conclusion\n\nWe present and analyze Dual Policy Iteration\u2014a framework that alternatively computes a non-reactive\npolicy via more advanced and systematic search, and updates a reactive policy via imitating the\nnon-reactive one. Recent algorithms that have been successful in practice, like AlphaGo-Zero and\nExIt, are subsumed by the DPI framework. We then provide a simple instance of DPI for RL with\nunknown dynamics, where the instance integrates local model \ufb01t, local model-based search, and\nreactive policy improvement via imitating the teacher\u2013the nearly local-optimal policy resulting from\nmodel-based search. We theoretically show that integrating model-based search and imitation into\npolicy improvement could result in larger policy improvement at each step. We also experimentally\ndemonstrate the improved sample ef\ufb01ciency compared to strong baselines.\nOur work also opens some new problems. In theory, the performance improvement during one call\nof optimal control with the local accurate model depends on a term that scales quadratically with\nrespect to the horizon 1/(1  ). We believe the dependency on horizon can be brought down by\nleveraging system identi\ufb01cation methods focusing on multi-step prediction [35, 36]. On the practical\nside, our speci\ufb01c implementation has some limitations due to the choice of LQG as the underlying\nOC algorithm. LQG-based methods usually require the dynamics and cost functions to be somewhat\nsmooth so that they can be locally approximated by polynomials. We also found that LQG planning\nhorizons must be relatively short, as the approximation error from polynomials will likely compound\nover the horizon. We plan to explore the possibility of learning a non-linear dynamics and using more\nadvanced non-linear optimal control techniques such as Model Predictive Control (MPC) for more\nsophisticated control tasks.\n\nAcknowledgement\n\nWe thank Sergey Levine for valuable discussion. WS is supported in part by Of\ufb01ce of Naval Research\ncontract N000141512365\n\n9\n\n\fReferences\n[1] Thomas Anthony, Zheng Tian, and David Barber. Thinking fast and slow with deep learning\n\nand tree search. arXiv preprint arXiv:1705.08439, 2017.\n\n[2] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur\nGuez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of\ngo without human knowledge. Nature, 550(7676):354, 2017.\n\n[3] Dimitri P Bertsekas and John N Tsitsiklis. Neuro-dynamic programming: an overview. In\nDecision and Control, 1995., Proceedings of the 34th IEEE Conference on, volume 1, pages\n560\u2013564. IEEE, 1995.\n\n[4] J Andrew Bagnell, Sham M Kakade, Jeff G Schneider, and Andrew Y Ng. Policy search by\ndynamic programming. In Advances in neural information processing systems, pages 831\u2013838,\n2004.\n\n[5] Sham Kakade and John Langford. Approximately optimal approximate reinforcement learning.\n\nIn ICML, 2002.\n\n[6] Alessandro Lazaric, Mohammad Ghavamzadeh, and R\u00e9mi Munos. Analysis of a classi\ufb01cation-\nbased policy iteration algorithm. In ICML-27th International Conference on Machine Learning,\npages 607\u2013614. Omnipress, 2010.\n\n[7] Bruno Scherrer. Approximate policy iteration schemes: a comparison.\n\nConference on Machine Learning, pages 1314\u20131322, 2014.\n\nIn International\n\n[8] Gavin A Rummery and Mahesan Niranjan. On-line Q-learning using connectionist systems,\n\nvolume 37. University of Cambridge, Department of Engineering, 1994.\n\n[9] Jonathan Baxter and Peter L Bartlett. In\ufb01nite-horizon policy-gradient estimation. Journal of\n\nArti\ufb01cial Intelligence Research, 15:319\u2013350, 2001.\n\n[10] J Andrew Bagnell and Jeff Schneider. Covariant policy search. IJCAI, 2003.\n[11] Sham Kakade. A natural policy gradient. NIPS, 2002.\n[12] John Schulman, Sergey Levine, Pieter Abbeel, Michael I Jordan, and Philipp Moritz. Trust\n\nregion policy optimization. In ICML, pages 1889\u20131897, 2015.\n\n[13] Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforce-\n\nment learning. Machine learning, 1992.\n\n[14] Horia Mania, Aurelia Guy, and Benjamin Recht. Simple random search provides a competitive\n\napproach to reinforcement learning. arXiv preprint arXiv:1803.07055, 2018.\n\n[15] Levente Kocsis and Csaba Szepesv\u00e1ri. Bandit based monte-carlo planning.\n\nconference on machine learning, pages 282\u2013293. Springer, 2006.\n\nIn European\n\n[16] David Silver et al. Mastering the game of go with deep neural networks and tree search. Nature,\n\n2016.\n\n[17] Christopher G Atkeson. Using local trajectory optimizers to speed up global optimization in\ndynamic programming. In Advances in neural information processing systems, pages 663\u2013670,\n1994.\n\n[18] Christopher G Atkeson and Jun Morimoto. Nonparametric representation of policies and value\nfunctions: A trajectory-based approach. In Advances in neural information processing systems,\npages 1643\u20131650, 2003.\n\n[19] Sergey Levine and Pieter Abbeel. Learning neural network policies with guided policy search\nunder unknown dynamics. In Advances in Neural Information Processing Systems, pages\n1071\u20131079, 2014.\n\n[20] Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep\n\nvisuomotor policies. The Journal of Machine Learning Research, 17(1):1334\u20131373, 2016.\n\n10\n\n\f[21] William H Montgomery and Sergey Levine. Guided policy search via approximate mirror\n\ndescent. In Advances in Neural Information Processing Systems, pages 4008\u20134016, 2016.\n\n[22] William Montgomery, Anurag Ajay, Chelsea Finn, Pieter Abbeel, and Sergey Levine. Reset-free\nguided policy search: ef\ufb01cient deep reinforcement learning with stochastic initial states. In\nRobotics and Automation (ICRA), 2017 IEEE International Conference on, pages 3373\u20133380.\nIEEE, 2017.\n\n[23] John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-\ndimensional continuous control using generalized advantage estimation. arXiv preprint\narXiv:1506.02438, 2015.\n\n[24] J Andrew Bagnell and Jeff G Schneider. Autonomous helicopter control using reinforcement\nlearning policy search methods. In Robotics and Automation, 2001. Proceedings 2001 ICRA.\nIEEE International Conference on, volume 2, pages 1615\u20131620. IEEE, 2001.\n\n[25] Christopher G Atkeson. Ef\ufb01cient robust policy optimization. In American Control Conference\n\n(ACC), 2012, pages 5220\u20135227. IEEE, 2012.\n\n[26] Stephane Ross and J Andrew Bagnell. Reinforcement and imitation learning via interactive\n\nno-regret learning. arXiv preprint arXiv:1406.5979, 2014.\n\n[27] Wen Sun, Arun Venkatraman, Geoffrey J Gordon, Byron Boots, and J Andrew Bagnell. Deeply\n\naggrevated: Differentiable imitation learning for sequential prediction. ICML, 2017.\n\n[28] Huibert Kwakernaak and Raphael Sivan. Linear optimal control systems, volume 1. Wiley-\n\nInterscience New York, 1972.\n\n[29] Brian D Ziebart. Modeling purposeful adaptive behavior with the principle of maximum causal\n\nentropy. 2010.\n\n[30] Martin Zinkevich. Online Convex Programming and Generalized In\ufb01nitesimal Gradient Ascent.\n\nIn ICML, 2003.\n\n[31] Richard S Sutton and Andrew G Barto. Introduction to reinforcement learning, volume 135.\n\nMIT Press Cambridge, 1998.\n\n[32] Pieter Abbeel, Varun Ganapathi, and Andrew Y Ng. Learning vehicular dynamics, with\n\napplication to modeling helicopters. In NIPS, pages 1\u20138, 2005.\n\n[33] Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based\ncontrol. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on,\npages 5026\u20135033. IEEE, 2012.\n\n[34] Kemin Zhou, John Comstock Doyle, Keith Glover, et al. Robust and optimal control, volume 40.\n\nPrentice hall New Jersey, 1996.\n\n[35] Arun Venkatraman, Martial Hebert, and J Andrew Bagnell. Improving multi-step prediction of\n\nlearned time series models. AAAI, 2015.\n\n[36] Wen Sun, Arun Venkatraman, Byron Boots, and J Andrew Bagnell. Learning to \ufb01lter with\n\npredictive state inference machines. In ICML, 2016.\n\n[37] Stephane Ross and Drew Bagnell. Agnostic system identi\ufb01cation for model-based reinforcement\nlearning. In Proceedings of the 29th International Conference on Machine Learning (ICML-12),\npages 1703\u20131710, 2012.\n\n[38] St\u00e9phane Ross, Geoffrey J Gordon, and J.Andrew Bagnell. A reduction of imitation learning\n\nand structured prediction to no-regret online learning. In AISTATS, 2011.\n\n[39] Alex Gorodetsky, Sertac Karaman, and Youssef Marzouk. Ef\ufb01cient high-dimensional stochastic\noptimal motion control using tensor-train decomposition. In Proceedings of Robotics: Science\nand Systems, Rome, Italy, July 2015.\n\n[40] C. Finn, M. Zhang, J. Fu, X. Tan, Z. McCarthy, E. Scharff, and S. Levine. Guided policy search\n\ncode implementation, 2016. Software available from rll.berkeley.edu/gps.\n\n11\n\n\f", "award": [], "sourceid": 3510, "authors": [{"given_name": "Wen", "family_name": "Sun", "institution": "Carnegie Mellon University"}, {"given_name": "Geoffrey", "family_name": "Gordon", "institution": "MSR Montr\u00e9al & CMU"}, {"given_name": "Byron", "family_name": "Boots", "institution": "Georgia Tech / Google Brain"}, {"given_name": "J.", "family_name": "Bagnell", "institution": "Carnegie Mellon University"}]}