{"title": "Guided Meta-Policy Search", "book": "Advances in Neural Information Processing Systems", "page_first": 9656, "page_last": 9667, "abstract": "Reinforcement learning (RL) algorithms have demonstrated promising results on complex tasks, yet often require impractical numbers of samples because they learn from scratch. Meta-RL aims to address this challenge by leveraging experience from previous tasks so as to more quickly solve new tasks. However, in practice, these algorithms generally also require large amounts of on-policy experience during the \\emph{meta-training} process, making them impractical for use in many problems. To this end, we propose to learn a reinforcement learning procedure in a federated way, where individual off-policy learners can solve the individual meta-training tasks, and then consolidate these solutions into a single meta-learner. Since the central meta-learner learns by imitating the solutions to the individual tasks, it can accommodate either the standard meta-RL problem setting, or a hybrid setting where some or all tasks are provided with example demonstrations. The former results in an approach that can leverage policies learned for previous tasks without significant amounts of on-policy data during meta-training, whereas the latter is particularly useful in cases where demonstrations are easy for a person to provide. Across a number of continuous control meta-RL problems, we demonstrate significant improvements in meta-RL sample efficiency in comparison to prior work as well as the ability to scale to domains with visual observations.", "full_text": "Guided Meta-Policy Search\n\nRussell Mendonca, Abhishek Gupta, Rosen Kralev, Pieter Abbeel, Sergey Levine, Chelsea Finn\n\nDepartment of Electrical Engineering and Computer Science\n\nUniversity of California, Berkeley\n\n{russellm, cbfinn}@berkeley.edu\n\n{abhigupta, pabbeel, svlevine}@eecs.berkeley.edu\n\nrdkralev@gmail.com\n\nAbstract\n\nReinforcement learning (RL) algorithms have demonstrated promising results on\ncomplex tasks, yet often require impractical numbers of samples since they learn\nfrom scratch. Meta-RL aims to address this challenge by leveraging experience\nfrom previous tasks so as to more quickly solve new tasks. However, in practice,\nthese algorithms generally also require large amounts of on-policy experience dur-\ning the meta-training process, making them impractical for use in many problems.\nTo this end, we propose to learn a reinforcement learning procedure in a federated\nway, where individual off-policy learners can solve the individual meta-training\ntasks, and then consolidate these solutions into a single meta-learner. Since the\ncentral meta-learner learns by imitating the solutions to the individual tasks, it can\naccommodate either the standard meta-RL problem setting, or a hybrid setting\nwhere some or all tasks are provided with example demonstrations. The former\nresults in an approach that can leverage policies learned for previous tasks without\nsigni\ufb01cant amounts of on-policy data during meta-training, whereas the latter is\nparticularly useful in cases where demonstrations are easy for a person to provide.\nAcross a number of continuous control meta-RL problems, we demonstrate signi\ufb01-\ncant improvements in meta-RL sample ef\ufb01ciency in comparison to prior work as\nwell as the ability to scale to domains with visual observations.\n\nIntroduction\n\n1\nMeta-learning is a promising approach for using previous experience across a breadth of tasks to\nsigni\ufb01cantly accelerate learning of new tasks. Meta-reinforcement learning considers this problem\nspeci\ufb01cally in the context of learning new behaviors through trial and error with only a few interactions\nwith the environment by building on previous experience. Building effective meta-RL algorithms\nis critical towards building agents that are \ufb02exible, such as an agent being able to manipulate new\nobjects in new ways without learning from scratch for each new object and goal. Being able to reuse\nprior experience in such a way is arguably a fundamental aspect of intelligence.\nEnabling agents to adapt via meta-RL is particularly useful for acquiring behaviors in real-world\nsituations with diverse and dynamic environments. However, despite recent advances [7, 8, 17],\ncurrent meta-RL methods are often limited to simpler domains, such as relatively low-dimensional\ncontinuous control tasks [8, 44] and navigation with discrete action commands [7, 24]. Optimization\nstability and sample complexity are major challenges for the meta-training phase of these methods,\nwith some recent techniques requiring upto 250 million transitions for meta-learning in tabular\nMDPs [7], which typically require a fraction of a second to solve in isolation.\nWe make the following observation in this work: while the goal of meta-reinforcement learning is\nto acquire fast and ef\ufb01cient reinforcement learning procedures, those procedures themselves do not\nneed to be acquired through reinforcement learning directly. Instead, we can use a signi\ufb01cantly more\nstable and ef\ufb01cient algorithm for providing supervision at the meta-level. In this work we show that a\npractical choice is to use supervised imitation learning. A meta-reinforcement learning algorithm\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fcan receive more direct supervision during meta-training, in the form of expert actions, while still\noptimizing for the ability to quickly learn tasks via reinforcement. Crucially, these expert policies\ncan themselves be produced automatically by standard reinforcement learning methods, such that no\nadditional assumptions on supervision are actually needed. They can also be acquired using very\nef\ufb01cient off-policy reinforcement learning algorithms which are otherwise challenging to use with\nmeta-reinforcement learning. When available, incorporating human-provided demonstrations can\nenable even more ef\ufb01cient meta-training, particularly in domains where demonstrations are easy\nto collect. At meta-test time, when faced with a new task, the method solves the same problem as\nconventional meta-reinforcement learning: acquiring the new skill using only reward signals.\nOur main contribution is a meta-RL method that learns fast reinforcement learning via supervised\nimitation. We optimize for a set of parameters such that only one or a few gradient steps leads to\na policy that matches the expert\u2019s actions. Since supervised imitation is stable and ef\ufb01cient, our\napproach can gracefully scale to visual control domains and high-dimensional convolutional networks.\nBy using demonstrations during meta-training, there is less of a challenge with exploration in the\nmeta-optimization, making it possible to effectively learn how to learn in sparse reward environments.\nWhile the combination of imitation and RL has been explored before [30, 20], the combination of\nimitation and RL in a meta-learning context has not been considered previously. As we show in our\nexperiments, this combination is in fact extremely powerful: compared to meta-RL, our method can\nmeta-learn comparable adaptation skills with up to 10x fewer interaction episodes, making meta-RL\nmuch more viable for real-world learning. Our experiments also show that, through our method, we\ncan adapt convolutional neural network policies to new goals through trial-and-error, with only a few\ngradient descent steps, and adapt policies to sparse-reward manipulation tasks with a handful of trials.\nWe believe this is a signi\ufb01cant step towards making meta-RL practical for use in complex real-world\nenvironments.\n\n2 Related Work\n\nOur work builds upon prior work on meta-learning [39, 1, 47], where the goal is to learn how to\nlearn ef\ufb01ciently. We focus on the particular case of meta-reinforcement learning [39, 7, 48, 8, 24,\n14]. Prior methods learned reinforcement learners represented by a recurrent or recursive neural\nnetwork [7, 48, 24, 41, 33], using gradient descent from a learned initialization [8, 14, 36], using\na learned critic that provides gradients to the policy [44, 17], or using a planner and an adaptable\nmodel [5, 38]. In contrast, our approach aims to leverage supervised learning for meta-optimization\nrather than relying on high-variance algorithms such as policy gradient. We decouple the problem of\nobtaining expert trajectories for each task from the problem of learning a fast RL procedure. This\nallows us to obtain expert trajectories using ef\ufb01cient, off-policy RL algorithms. Recent work has\nused amortized probabilistic inference [34] to achieve data-ef\ufb01cient meta-training, however such\ncontextual methods cannot continually adapt to out of distribution test tasks. Further, the ability\nof our method to utilize example demonstrations if available enables much better performance on\nchallenging sparse reward tasks. Our approach is also related to few-shot imitation [6, 11], in that\nwe leverage supervised learning for meta-optimization. However, in contrast to these methods, we\nlean an automatic reinforcement learner, which can learn using only rewards and does not require\ndemonstrations for new tasks.\nOur algorithm performs meta-learning by \ufb01rst individually solving the tasks with local learners,\nand then consolidating them into a central meta-learner. This resembles methods like guided policy\nsearch, which also use local learners [37, 29, 46, 28, 12]. However, while these prior methods aim to\nlearn a single policy that can solve all of the tasks, our approach instead aims to meta-learn a single\nlearner that can adapt to the training task distribution, and generalize to adapt to new tasks that were\nnot seen during training.\nPrior methods have also sought to use demonstrations to make standard reinforcement learning more\nef\ufb01cient in the single-task setting [30, 20, 21, 45, 4, 42, 16, 43, 32, 26, 19, 40]. These methods aim to\nlearn a policy from demonstrations and rewards, using demonstrations to make the RL problem easier.\nOur approach instead aims to leverage demonstrations to learn how to ef\ufb01ciently reinforcement learn\nnew tasks without demonstrations, learning new tasks only through trial-and-error. The version of our\nalgorithm where data is aggregated across iterations, is an extension of the DAgger algorithm [35]\ninto the meta-learning setting, and this allows us to provide theoretical guarantees on performance.\n\n2\n\n\fFigure 1: Overview of the guided meta-policy search algorithm: We learn a policy \u03c0\u03b8 which is capable of\nfast adaptation to new tasks via reinforcement learning, by using reinforcement learning in the inner loop of\noptimization and supervised learning in the meta-optimization. This algorithm either trains per-task experts \u03c0\u2217\ni or\nassumes that they are provided by human demonstrations, and then uses this for meta-optimization. Importantly,\nwhen faced with a new task we can simply perform standard reinforcement learning via policy gradient, and the\npolicy will quickly adapt to new tasks because of the meta-training.\n3 Preliminaries\nIn this section, we introduce the meta-RL problem and overview model-agnostic meta-learning\n(MAML) [8], which we build on in our work. We assume a distribution of tasks T \u223c p, where meta-\ntraining tasks are drawn from p and meta-testing consists of learning held-out tasks sampled from p\nthrough trial-and-error, by leveraging what was learned during meta-training. Formally, each task\nT = {r(st, at), q(s1), q(st+1|st, at)} consists of a reward function r(st, at) \u2192 R, an initial state\ndistribution q(s1), and unknown dynamics q(st+1|st, at). The state space, action space, and horizon\nH are shared across tasks. Meta-learning methods learn using experience from the meta-training\ntasks, and are evaluated on their ability to learn new meta-test tasks. MAML in particular performs\nmeta-learning by optimizing for a deep network\u2019s initial parameter setting such that one or a few\nsteps of gradient descent on a small dataset leads to good generalization. Then, after meta-training,\nthe learned parameters are \ufb01ne-tuned on data from a new task.\nConcretely, consider a supervised learning problem with a loss function denoted as L(\u03b8,D), where\n\u03b8 denotes the model parameters and D denotes the labeled data. During meta-training, a task T\nis sampled, along with data from that task, which is randomly partitioned into two sets, Dtr and\nDval. MAML optimizes for a set of model parameters \u03b8 such that one or a few gradient steps on Dtr\nproduces good performance on Dval. Thus, using \u03c6T to denote the updated parameters, the MAML\nobjective is the following:\n\nL(\u03b8 \u2212 \u03b1\u2207\u03b8L(\u03b8,DtrT ),DvalT ) = min\n\nL(\u03c6T ,DvalT ).\n\n(cid:88)\n\nT\n\nmin\n\n\u03b8\n\n(cid:88)\n\n\u03b8\n\nT\n\nwhere \u03b1 is a step size that can be set as a hyperparameter or learned. Moving forward, we will refer\nto the outer objective as the meta-objective. Subsequently, at meta-test time, K examples from a\nnew, held-out task Ttest are presented and we can run gradient descent starting from \u03b8 to infer model\nparameters for the new task: \u03c6Ttest = \u03b8 \u2212 \u03b1\u2207\u03b8L(\u03b8,DtrTtest).\nThe MAML algorithm can also be applied to the meta-reinforcement learning setting, where each\ndataset DTi consists of trajectories of the form s1, a1, ..., aH\u22121, sH and where the inner and outer\nloss function corresponds to the negative expected reward:\n\nLRL(\u03c6,DTi ) = \u2212 1\n|DTi|\n\nri(st, at) = \u2212Est,at\u223c\u03c0\u03c6,qTi\n\nri(st, at)\n\n.\n\n(1)\n\n(cid:88)\n\nst,at\u2208DTi\n\n(cid:35)\n\n(cid:34)\n\nH(cid:88)\n\nt=1\n\n1\nH\n\nPolicy gradients [49] are used to estimate the gradient of this loss function. Thus, the algorithm\nproceeds as follows: for each task Ti, \ufb01rst collect samples DtrTi\nfrom the policy \u03c0\u03b8, then compute\nthe updated parameters using the policy gradient evaluated on DtrTi\n, then collect new samples DvalTi\nvia the updated policy parameters, and \ufb01nally update the initial parameters \u03b8 by taking a gradient\nstep on the meta-objective. In the next section, we will introduce a new approach to meta-RL that\nincorporates a more stable meta-optimization procedure that still converges to the same solution\nunder some regularity assumptions, and that can naturally leverage demonstrations or policies learned\nfor previous tasks if desired.\n4 Guided Meta-Policy Search\nExisting meta-RL algorithms generally perform meta-learning from scratch with on-policy methods.\nThis typically requires a large number of samples during meta-training. What if we instead formulate\n\n3\n\n\fmeta-training as a data-driven process, where the agent had previously learned a variety of tasks\nwith standard multi-task reinforcement learning techniques, and now must use the data collected\nfrom those tasks for meta-training? Can we use this experience or these policies in meaningful ways\nduring meta-training? Our goal is to develop an approach that can use these previously learned skills\nto guide the meta-learning process. While we will still require on-policy data for inner loop sampling,\nwe will require considerably less of it than what we would need without using this prior experience.\nSurprisingly, as we will show in our experiments, separating meta-training into two phases in this\nway \u2013 a phase that individually solves the meta-training tasks and a second phase that uses them for\nmeta-learning \u2013 actually requires less total experience overall, as the individual tasks can be solved\nusing highly-ef\ufb01cient off-policy reinforcement learning methods that actually require less experience\ntaken together than a single meta-RL training phase. We can also improve sample ef\ufb01ciency during\nmeta-training even further by incorporating explicit demonstrations. In the rest of this section, we\ndescribe our approach, analyze its theoretical properties, and discuss its practical implementation in\nmultiple real world scenarios.\n4.1 Guided Meta-Policy Search Algorithm\nIn the \ufb01rst phase of the algorithm, task learning, we learn policies for each of the meta-training\ntasks. While these policies solve the meta-training tasks, they do not accelerate learning of future\nmeta-test tasks. In Section 4.3, we describe how these policies are trained. Instead of learning\npolicies explicitly through reinforcement learning, we can also obtain expert demonstrations from\na human demonstrator, which can be used equivalently with the same algorithm. In the second\nphase, meta-learning, we will learn to reinforcement learn using these policies as supervision at the\nmeta-level. In particular, we train for a set of initial parameters \u03b8 such that only one or a few steps of\ngradient descent produces a policy that matches the policies learned in the \ufb01rst phase.\nWe will denote the optimal or near-optimal policies learned during the task-learning phase for each\nmeta-training task Ti as {\u03c0\u2217\ni }. We will refer to these individual policies as \u201cexperts,\u201d because after\nthe \ufb01rst phase, they represent optimal or near-optimal solutions to each of the tasks. Our goal in the\nmeta-learning phase is to optimize the same meta-objective as the MAML algorithm, LRL(\u03c6i,Di),\nwhere \u03c6i denotes the parameters of the policy adapted to task Ti via gradient descent. The inner\npolicy optimization will remain the same as the policy-gradient MAML algorithm; however, we will\noptimize this meta-objective by leveraging the policies learned in the \ufb01rst phase. In particular, we\nwill base the outer objective on supervised imitation, or behavior cloning (BC), of expert actions. The\n\nbehavioral cloning loss function we use is LBC(\u03c6i,Di) (cid:44) \u2212(cid:80)\n\n(st,at)\u2208D log \u03c0\u03c6(at | st).\n\nGradients from supervised learning are lower variance, and hence more stable than reinforcement\nlearning gradients [27]. The speci\ufb01c implementation of the second phase proceeds as follows: we\n\ufb01rst roll out each of the policies \u03c0\u2217\ni for each of the\nmeta-training tasks Ti. Using this initial dataset, we update our policy according to the following\nmeta-objective:\n\ni to collect a dataset of expert trajectories D\u2217\n\n(cid:2)LBC(\u03b8 \u2212 \u03b1\u2207\u03b8LRL(\u03b8,Dtr\n\ni )(cid:3) .\n\ni ),Dval\n\n(2)\n\n(cid:88)\n\n(cid:88)\n\nTi\n\nDval\ni \u223cD\u2217\n\ni\n\nmin\n\n\u03b8\n\nEDtr\n\ni\u223c\u03c0\u03b8\n\nWe discuss how this objective can be ef\ufb01ciently optimized in Section 4.3. The result of this optimiza-\ntion is a set of initial policy parameters \u03b8 that can adapt to a variety of tasks, to produce \u03c6i, in a way\nthat comes close to the expert policy\u2019s actions. Note that, so far, we have not actually required query-\ning the expert beyond access to the initial rollouts; hence, this \ufb01rst step of our method is applicable to\nproblem domains where demonstrations are available in place of learned expert policies. However,\nwhen we do have policies for the meta-training tasks, we can continue to improve. In particular, while\nsupervised learning provides stable, low-variance gradients, behavior cloning objectives are prone\nto compounding errors. In the single task imitation learning setting, this issue can be addressed by\ncollecting additional data from the learned policy, and then labeling the visited states with optimal\nactions from the expert policy, as in DAgger [35]. We can extend this idea to the meta-learning\nsetting by alternating between data aggregation into dataset D\u2217 and meta-policy optimization in Eq. 2.\nData aggregation entails (1) adapting the current policy parameters \u03b8 to each of the meta-training\ntasks to produce {\u03c6i}, (2) rolling out the current adapted policies {\u03c0\u03c6i} to produce states {{st}i} for\neach task, (3) querying the experts to produce supervised data D = {{(st, \u03c0\u2217\ni (st)}i}, and \ufb01nally (4)\nis summarized in Alg. 1, and analyzed in Section 4.2. When provided with new tasks at meta-test\ntime, we initialize \u03c0\u03b8 and run the policy gradient algorithm.\n\naggregating this data with the existing supervised data D\u2217 \u2190 D\u2217(cid:83)D. This meta-training algorithm\n\n4\n\n\fTi\n\nAlgorithm 1 GMPS: Guided Meta-Policy Search\nRequire: Set of meta-training tasks {Ti}\n1: Use RL to acquire \u03c0\u2217\n2: Initialize D\u2217 = {D\u2217\n3: Randomly initialize \u03b8\n4: while not done do\n5:\n\ni for each meta-training task\ni } with roll-outs from each \u03c0\u2217\n\nOptimize meta-objective in Eq. 2 w.r.t. \u03b8 using\nAlg. 2 with aggregated data D\u2217\nfor each meta-training task Ti do\n\ni\n\nCollect Dtr\ni as K roll-outs from \u03c0\u03b8 in task Ti\nCompute task-adapted parameters with gradi-\nent descent: \u03c6i = \u03b8 \u2212 \u03b1\u2207\u03b8LRL(\u03b8,Dtr\ni )\nCollect roll-outs from \u03c0\u03c6i, resulting in data\n{(st, at)}\nAggregate D\u2217\n\n(cid:83){(st, \u03c0\u2217\n\ni \u2190 D\u2217\n\ni (st))}\n\ni\n\n6:\n7:\n8:\n\n9:\n\n10:\nend for\n11:\n12: end while\n\nAlgorithm 2 Optimization of Meta Objective\nRequire: Set of meta-training tasks {Ti}\nRequire: Aggregated dataset D\u2217 := {D\u2217\ni }\nRequire: \u03b8 initial parameters\n1: while not done do\n2:\n3:\n\nSample task Ti \u223c {Ti} {or minibatch of tasks}\nSample K roll-outs Dtr\ni = {(s1, a1, ...sH )}\nwith \u03c0\u03b8 in Ti\n\u03b8init \u2190 \u03b8\nfor n = 1...NBC do\n\ni ) according to Eq. 3\n\nEvaluate \u2207\u03b8LRL(\u03b8,Dtr\nwith importance weights \u03c0\u03b8 (at|st)\n\u03c0\u03b8init (at|st)\nCompute adapted parameters with gradient\ndescent: \u03c6i = \u03b8 \u2212 \u03b1\u2207\u03b8LRL(\u03b8,Dtr\ni )\ni \u223c D\u2217\nSample expert trajectories Dval\nUpdate \u03b8 \u2190 \u03b8 \u2212 \u03b2\u2207\u03b8LBC(\u03c6i,Dval\ni\ni ).\n\n4:\n5:\n6:\n\n7:\n\n8:\n9:\nend for\n10:\n11: end while\n\nOur algorithm, which we call guided meta-policy search (GMPS), has appealing properties that\narise from decomposing the meta-learning problem explicitly into the task learning phase and the\nmeta-learning phase. This decomposition enables the use of previously learned policies or human-\nprovided demonstrations. We \ufb01nd that it also leads to increased stability of training. Lastly, the\ndecomposition makes it easy to leverage privileged information that may only be available during\nmeta-training such as shaped rewards, task information, low-level state information such as the\npositions of objects [23]. In particular, this privileged information can be provided to the initial\npolicies as they are being learned and hidden from the meta-policy such that the meta-policy can\nbe applied in test settings where such information is not available. This technique makes it straight-\nforward to learn vision-based policies, for example, as the bulk of learning can be done without\nvision, while visual features are learned with supervised learning in the second phase. Our method\nalso inherits appealing properties from policy gradient MAML, such as the ability to continue to\nlearn as more and more experience is collected, in contrast to recurrent neural networks that cannot\nbe easily \ufb01ne-tuned on new tasks.\n\n4.2 Convergence Analysis\nNow that we have derived a meta-RL algorithm that leverages supervised learning for increased\nstability, a natural question is: will the proposed algorithm converge to the same answer as the original\n(less stable) MAML algorithm? Here, we prove that GMPS with data aggregation, described above,\nwill indeed obtain near-optimal cumulative reward when supplied with near-optimal experts. Our\nproof follows a similar technique to prior work that analyzes the convergence of imitation algorithms\nwith aggregation [35, 18], but extends these results into the meta-learning setting. More speci\ufb01cally,\nwe can prove the following theorem, for task distribution p and horizon H.\n\n[(cid:80)H\n\n[Ri] [(cid:80)H\n\nTheorem 4.1 For GMPS, assuming reward-to-go bounded by \u03b4, and training error bounded by\nt=1 ri(st, at)]] \u2212\n\u0001\u03b8\u2217, we can show that Ei\u223cp(T )[E\u03c0\u03b8+\u2207\u03b8\n\u221a\n\u03b4\n\nt=1 ri(st, at)]] \u2265 Ei\u223cp(T )[E\u03c0\u2217\n\n\u0001\u03b8\u2217O(H), where \u03c0\u2217\n\ni are per-task expert policies.\n\nE\u03c0\u03b8\n\ni\n\nThe proof of this theorem requires us to assume that the inner policy update in Eq. 2 can bring the\nlearned policy to within a bounded error of each expert, which amounts to an assumption on the\nuniversality of gradient-based meta-learning [10]. The theorem amounts to saying that GMPS can\nachieve an expected reward that is within a bounded error of the optimal reward (i.e., the reward of\nthe individual experts), and the error is linear in H and\n\u0001\u03b8\u2217. The analysis holds for GMPS when\neach iteration generates samples by adapting the current meta-trained policy to each training task.\nHowever, we \ufb01nd in practice that the the initial iteration, where data is simply sampled from per-task\nexperts \u03c0\u2217\ni , is quite stable and effective; hence, we use this in our experimental evaluation. For the\nfull proof of Theorem 4.1, see Appendix.\n\n\u221a\n\n5\n\n\f4.3 Algorithm Implementation\nWe next describe the full meta-RL algorithm in detail.\nExpert Policy Optimization. The \ufb01rst phase of GMPS entails learning policies for each meta-\ntraining task. The simplest approach is to learn a separate policy for each task from scratch. This\ncan already improve over standard meta-RL, since we can employ ef\ufb01cient off-policy reinforcement\nlearning algorithms. We can improve the ef\ufb01ciency of this approach by employing a contextual policy\nto represent the experts, which simultaneously uses data from all of the tasks. We can express such a\npolicy as \u03c0\u03b8(at|st, \u03c9), where \u03c9 represents the task context. Crucially, the context only needs to be\nknown during meta-training \u2013 the end result of our algorithm, after the second phase, still uses raw\ntask rewards without knowledge of the context at meta-test time. In our experiments, we employ this\napproach, together with soft-actor critic (SAC) [15], an ef\ufb01cient off-policy RL method.\nFor training the experts, we can also incorporate extra information during meta-training that is\nunavailable at meta-test time, such as knowledge of the state or better shaped rewards, when available.\nThe former has been explored in single-task RL settings [23, 31], while the latter has been studied for\non-policy meta-RL settings [14].\nMeta-Optimization Algorithm.\nIn order to ef\ufb01ciently optimize the meta-objective in Eq. 2, we\nadopt an approach similar to MAML. At each meta-iteration and for each task Ti, we \ufb01rst draw\nsamples DtrTi\n, then\nwe update \u03b8 to optimize LBC, averaging over all tasks in the minibatch. This requires sampling from\n\u03c0\u03b8, so for ef\ufb01cient learning, we should minimize the number of meta-iterations.\nWe note that we can take multiple gradient steps on the behavior cloning meta-objective in each\nmeta-iteration, since this objective does not require on-policy samples. However, after the \ufb01rst\ngradient step on the meta-objective modi\ufb01es the pre-update parameters \u03b8, we need to recompute\nthe adapted parameters \u03c6i starting from \u03b8, and we would like to do so without collecting new data\nfrom \u03c0\u03b8. To achieve this, we use an importance-weighted policy gradient, with importance weights\n\u03c0\u03b8(at|st)\n\u03c0\u03b8init (at|st), where \u03b8init denotes the policy parameters at the start of the meta-iteration. At the start\nof a meta-iteration, we sample trajectories \u03c4 from the current policy with parameters denoted as\n\u03b8 = \u03b8init. Then, we take many off-policy gradient steps on \u03b8. Each off-policy gradient step involves\nrecomputing the updated parameters \u03c6i using importance sampling:\n\nfrom the policy \u03c0\u03b8, then compute the updated policy parameters \u03c6Ti using the DtrTi\n\n(cid:21)\n\n(cid:20) \u03c0\u03b8(\u03c4 )\n\n\u03c0\u03b8init (\u03c4 )\n\n\u03c6i = \u03b8 + \u03b1E\u03c4\u223c\u03c0\u03b8 init\n\n\u2207\u03b8 log \u03c0\u03b8(\u03c4 )Ai(\u03c4 )\n\n(3)\n\ni ). This optimization algorithm is summarized in Alg. 2.\n\nwhere Ai is the estimated advantage function. Then, the off-policy gradient step is computed and\napplied using the updated parameters using the behavioral cloning objective de\ufb01ned previously:\n\u03b8 \u2190 \u03b8 \u2212 \u03b2\u2207\u03b8LBC(\u03c6i,Dval\n5 Experimental Evaluation\nWe evaluate GMPS separately as a meta-reinforcement algorithm, and for learning fast RL pro-\ncedures from multi-task demonstration data. We consider the following questions: As a meta-RL\nalgorithm, (1) can GMPS meta-learn more ef\ufb01ciently than prior meta-RL methods? For learning\nfrom demonstrations, (2) does using imitation learning in the outer loop of optimization enable\nus to overcome challenges in exploration, and learn from sparse rewards?, and further (3) can we\neffectively meta-learn CNN policies that can quickly adapt to vision-based tasks?\nTo answer these questions, we consider multiple continuous control domains shown in Fig. 2.\n5.1 Experimental Setup\nSawyer Manipulation Tasks. The tasks involving the 7-DoF Sawyer arm are performed with\n3D position control of a parallel jaw gripper (four DoF total, including open/close). The Sawyer\nenvironments include:\n\u2022 Pushing, full state: The tasks involve pushing a block with a \ufb01xed initial position to a target location\nsampled from a 20 cm \u00d7 10 cm region. The target location within this region is not observed and\nmust be implicitly inferred through trial-and-error. The \u2018full state\u2019 observations include the 3D\nposition of the end effector and of the block.\n\n\u2022 Pushing, vision: Same as above, except the policy receives images instead of block positions.\n\u2022 Door opening: The task distribution involves opening a door to a target angle sampled uniformly\nfrom 0 to 60 degrees. The target angle is not present in the observations, and must be implicitly\n\n6\n\n\fFigure 2: Illustration of pushing (left),\ndoor opening (center) and legged locomo-\ntion (right) used in our experiments, with\nthe goal regions speci\ufb01ed in green for push-\ning and locomotion.\n\nFigure 3: Meta-training ef\ufb01ciency on full state pushing and dense\nreward locomotion. All methods reach similar asymptotic perfor-\nmance, but GMPS requires signi\ufb01cantly fewer samples.\n\ninferred through trial-and-error. The \u2018full state\u2019 observations include the 3D end effector position\nof the arm, the state of the gripper, and the current position and angle of the door.\n\nQuadrupedal Legged Locomotion. This environment uses the ant environment in OpenAI gym [3].\nThe ask distribution comprises goal positions sampled uniformly from the edge of a circle with radius\n2 m, between 0 and 90 degrees. We consider dense rewards when evaluating GMPS as a meta-RL\nalgorithm, and challenging sparse rewards setting when evaluating GMPS with demonstrations.\nFurther details such as the reward functions for all environments, network architectures, and hyperpa-\nrameters swept over are in the appendix. Videos of our results are available online 1.\n\n5.2 Meta-Reinforcement Learning\nWe \ufb01rst evaluate the sample ef\ufb01ciency of GMPS as a meta-RL algorithm, measuring performance\nas a function of the total number of samples used during meta-training. We compare to a recent\ninference based off-policy method (PEARL) [34] and the policy gradient version of model-agnostic\nmeta-learning (MAML) [8], that uses REINFORCE in the inner loop and TRPO in the outer loop.\nWe also compare to RL2 [7], and to a single policy that is trained across all meta-training tasks (we\nrefer to this comparison as MultiTask). At meta-training time (but not meta-test time), we assume\naccess to the task context, i.e. information that completely speci\ufb01es the task: the target location for\nthe pushing and locomotion experiments. We train a policy conditioned on the target position with\nsoft actor-critic (SAC) to obtain expert trajectories which are used by GMPS. The samples used to\ntrain this expert policy with SAC are included in our evaluation. At meta-test time, when adapting to\nnew validation tasks, we only have access to the reward, which necessitates meta-learning without\nproviding the task contexts to the policy.\nFrom the meta-learning curves in\nFig. 3, we see similar performance\ncompared to PEARL, and 4x improve-\nment for sawyer object pushing and\nabout 12x improvement for legged lo-\ncomotion over MAML in terms of the\nnumber of samples required. We also\nsee that GMPS performs substantially\nbetter than PEARL when evaluated\non test tasks which are not in the train-\ning distribution for legged locomotion\n(Fig. 4). This is because PEARL cannot generate useful contexts for out of distribution tasks, while\nGMPS uses policy gradient to adapt, which enables it to continuously make progress.\nHence, the combination of (1) an off-policy RL algorithm such as SAC for obtaining per-task experts,\nand (2) the ability to take multiple off-policy supervised gradient steps w.r.t.\nthe experts in the\nouter loop, enables us to obtain signi\ufb01cant overall sample ef\ufb01ciency gains as compared to on-policy\nmeta-RL algorithm such as MAML, while also showing much better extrapolation than data-ef\ufb01cient\ncontextual methods like PEARL. These sample ef\ufb01ciency gains are important since they bring us\nsigni\ufb01cantly closer to having a robust meta-reinforcement learning algorithm which can be run on\nphysical robots with practical time scales and sample complexity.\n\nFigure 4: Test-time extrapolation for dense reward ant locomotion.\nThe test tasks involve navigating to the red goals indicated (right).\nGMPS gets better average return across tasks (left).\n\n1 The website is at https://sites.google.com/berkeley.edu/guided-metapolicy-search/home\n\n7\n\n\fFigure 5: Meta-training comparisons for sparse reward door opening (left), sparse reward ant locomotion\n(middle) and vision pusher (right). Our method is able to learn when only sparse rewards are available for\nadaptation, whereas prior methods struggle. For vision-based tasks, we \ufb01nd that GMPS is able to effectively\nleverage the demonstrations to quickly and stably learn to adapt.\n5.3 Meta-Learning from Demonstrations\nFor challenging tasks involving sparse rewards and image observations, access to demonstrations can\ngreatly help with learning reinforcement learners. GMPS allows us to incorporate supervision from\ndemonstrations much more easily than prior methods. Here, we compare against PEARL, MAML\nand MultiTask as in the previous section. When evaluating on tasks requiring exploration, such as\nsparse-reward tasks, we additionally compare against model agnostic exploration with structured\nnoise (MAESN) [14], which is designed with sparse reward tasks in mind. Finally, we compare\nto a single policy trained with imitation learning across all meta-training tasks using the provided\ndemonstrations (we refer to this comparison as MultiTask Imitation), for adaptation to new validation\ntasks via \ufb01ne-tuning. For all experiments, the position of the goal location is not provided as input:\nthe meta-learning algorithm must discover a strategy for inferring the goal from the reward.\nSparse Reward Tasks. One of the\npotential bene\ufb01ts of learning to learn\nfrom demonstrations is that explo-\nration challenges are substantially re-\nduced for the meta-optimizer, since\nthe demonstrations provide detailed\nguidance on how the task should be\nperformed. We hypothesize that in\ntypical meta-RL, lack of easily avail-\nable reward signal in sparse reward\ntasks makes meta-optimization very\nchallenging, while using demonstra-\ntions makes this optimization signi\ufb01-\ncantly easier. To test this hypothesis, we experiment with learning to reinforcement learn from sparse\nreward signals in two different domains: door opening and sparse legged locomotion, as described in\nSection 5.1.\nAs seen from Fig. 5, unlike meta-RL methods such as MAESN, PEARL and MAML, we \ufb01nd that\nGMPS is able to successfully \ufb01nd a good solution in sparse reward settings and learn to explore. This\nbene\ufb01t is largely due to the fact that we can tackle the exploration problem better with demonstrations\nthan requiring meta-reinforcement learning from scratch. We also \ufb01nd that GMPS adapts to validation\ntasks more successfully than a policy pre-trained with MultiTask imitation (Fig. 6). The policy pre-\ntrained with imitation learning does not effectively transfer to the new validation tasks via \ufb01ne-tuning,\nsince it is not trained for adaptability.\nVision Based Tasks. Deep RL methods have the potential to acquire policies that produce actions\nbased simply on visual input [22, 25, 9]. However, vision based policies that can quickly adapt to\nnew tasks using meta-reinforcement learning have proven to be challenging because of the dif\ufb01culty\nof optimizing the meta-objective with extremely high variance policy gradient algorithms. On the\nother hand, visual imitation learning algorithms and RL algorithms that leverage supervised learning\nhave been far more successful [23, 2, 13, 50],due to stability of supervised learning compared with\nRL. We evaluate GMPS with visual observations under the assumption that we have access to visual\ndemonstrations for the meta-training tasks. Given these demonstrations, we directly train vision-based\npolicies using GMPS with RL in the inner loop and imitation in the outer loop. To best leverage\nthe added stability provided by imitation learning, we meta-optimize the entire policy (both fully\nconnected and convolutional layers), but we only adapt the fully connected layers in the inner loop.\nThis enables us to get the bene\ufb01ts of fast adaptation while retaining the stability of meta-imitation.\n\nFigure 6: Comparison between GMPS and \ufb01ne-tuning a policy\npretrained with multi-task imitation, on held-out validation tasks\nfor sparse-reward door opening (right) and vision pusher (left). By\nmeta-learning the structure across tasks, GMPS achieves faster\nlearning. Error bars are across different seeds.\n\n8\n\n\fAs seen in Fig. 5, learning vision based policies with GMPS is more stable and achieves higher\nreward than using meta-learning algorithms such as MAML. Additionally, we \ufb01nd that both GMPS\nand MAML are able to achieve better performance than a single policy trained with reinforcement\nlearning across all the training tasks. In Fig. 6, we see that GMPS outperforms MultiTask Imitation\nfor adaptation to validation tasks, just as in the sparse reward case.\n6 Discussion and Future Work\nIn this work, we presented a meta-RL algorithm that learns ef\ufb01cient RL procedures via supervised\nimitation. This enables a substantially more ef\ufb01cient meta-training phase that incorporates expert-\nprovided demonstrations to drastically accelerate the acquisition of reinforcement learning procedures\nand priors. We believe that our method addresses a major limitation in meta-reinforcement learning:\nalthough meta-reinforcement learning algorithms can effectively acquire adaptation procedures that\ncan learn new tasks at meta-test time with just a few samples, they are extremely expensive in\nterms of sample count during meta-training, limiting their applicability to real-world problems.\nBy accelerating meta-training via demonstrations, we can enable sample-ef\ufb01cient learning both at\nmeta-training time and meta-test time. Given the ef\ufb01ciency and stability of supervised imitation, we\nexpect our method to be readily applicable to domains with high-dimensional observations, such\nas images. Further, given the number of samples needed in our experiments, our approach is likely\nef\ufb01cient enough to be practical to run on physical robotic systems. Investigating applications of our\napproach to real-world reinforcement learning is an exciting direction for future work.\n\n7 Acknowledgements\n\nThe authors would like to thank Tianhe Yu for contributions on an early version of the paper. This\nwork was supported by Intel, JP Morgan and a National Science Foundation Graduate Research\nFellowship for Abhishek Gupta.\n\nReferences\n[1] Yoshua Bengio, Samy Bengio, and Jocelyn Cloutier. Learning a synaptic learning rule.\n\n[2] Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat Flepp, Prasoon\nGoyal, Lawrence D. Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, Xin Zhang, Jake Zhao,\nand Karol Zieba. End to end learning for self-driving cars. CoRR, abs/1604.07316, 2016.\n\n[3] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang,\n\nand Wojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2017.\n\n[4] Tim Brys, Anna Harutyunyan, Halit Bener Suay, Sonia Chernova, Matthew E Taylor, and Ann\n\nNow\u00e9. Reinforcement learning from demonstration through shaping. In IJCAI, 2015.\n\n[5] Ignasi Clavera, Anusha Nagabandi, Ronald S Fearing, Pieter Abbeel, Sergey Levine, and\nChelsea Finn. Learning to adapt: Meta-learning for model-based control. arXiv preprint\narXiv:1803.11347, 2018.\n\n[6] Yan Duan, Marcin Andrychowicz, Bradly C. Stadie, Jonathan Ho, Jonas Schneider, Ilya\nSutskever, Pieter Abbeel, and Wojciech Zaremba. One-shot imitation learning. In NIPS, pages\n1087\u20131098, 2017.\n\n[7] Yan Duan, John Schulman, Xi Chen, Peter L Bartlett, Ilya Sutskever, and Pieter Abbeel. Rl2:\nFast reinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779,\n2016.\n\n[8] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adapta-\n\ntion of deep networks. In International Conference on Machine Learning, 2017.\n\n[9] Chelsea Finn and Sergey Levine. Deep visual foresight for planning robot motion. In 2017\nIEEE International Conference on Robotics and Automation (ICRA), pages 2786\u20132793. IEEE,\n2017.\n\n9\n\n\f[10] Chelsea Finn and Sergey Levine. Meta-learning and universality: Deep representations and\ngradient descent can approximate any learning algorithm. International Conference on Learning\nRepresentations (ICLR), 2018.\n\n[11] Chelsea Finn, Tianhe Yu, Tianhao Zhang, Pieter Abbeel, and Sergey Levine. One-shot visual\n\nimitation learning via meta-learning. arXiv preprint arXiv:1709.04905, 2017.\n\n[12] Dibya Ghosh, Avi Singh, Aravind Rajeswaran, Vikash Kumar, and Sergey Levine. Divide-and-\nconquer reinforcement learning. International Conference on Learning Representations (ICLR),\n2018.\n\n[13] Alessandro Giusti, J\u00e9r\u00f4me Guzzi, Dan C Cire\u00b8san, Fang-Lin He, Juan P Rodr\u00edguez, Flavio\nFontana, Matthias Faessler, Christian Forster, J\u00fcrgen Schmidhuber, Gianni Di Caro, et al. A\nmachine learning approach to visual perception of forest trails for mobile robots. IEEE Robotics\nand Automation Letters, 1(2):661\u2013667, 2016.\n\n[14] Abhishek Gupta, Russell Mendonca, YuXuan Liu, Pieter Abbeel, and Sergey Levine. Meta-\nreinforcement learning of structured exploration strategies. In Advances in Neural Information\nProcessing Systems, pages 5307\u20135316, 2018.\n\n[15] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-\npolicy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint\narXiv:1801.01290, 2018.\n\n[16] Todd Hester, Matej Vecerik, Olivier Pietquin, Marc Lanctot, Tom Schaul, Bilal Piot, Dan\nHorgan, John Quan, Andrew Sendonaris, Gabriel Dulac-Arnold, et al. Deep q-learning from\ndemonstrations. AAAI, 2018.\n\n[17] Rein Houthooft, Richard Y Chen, Phillip Isola, Bradly C Stadie, Filip Wolski, Jonathan Ho,\n\nand Pieter Abbeel. Evolved policy gradients. arXiv preprint arXiv:1802.04821, 2018.\n\n[18] Gregory Kahn, Tianhao Zhang, Sergey Levine, and Pieter Abbeel. PLATO: policy learning\n\nusing adaptive trajectory optimization. CoRR, abs/1603.00622, 2016.\n\n[19] Jens Kober, J Andrew Bagnell, and Jan Peters. Reinforcement learning in robotics: A survey.\n\nThe International Journal of Robotics Research, 2013.\n\n[20] Jens Kober and Jan R Peters. Policy search for motor primitives in robotics.\n\nInformation Processing Systems (NIPS), 2009.\n\nIn Neural\n\n[21] Petar Kormushev, Sylvain Calinon, and Darwin G Caldwell. Robot motor skill coordination\nwith em-based reinforcement learning. In International Conference on Intelligent Robots and\nSystems (IROS), 2010.\n\n[22] Sascha Lange, Martin Riedmiller, and Arne Voigtl\u00e4nder. Autonomous reinforcement learning\non raw visual input data in a real world application. In The 2012 International Joint Conference\non Neural Networks (IJCNN), pages 1\u20138. IEEE, 2012.\n\n[23] Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep\n\nvisuomotor policies. Journal of Machine Learning Research (JMLR), 17(39):1\u201340, 2016.\n\n[24] Nikhil Mishra, Mostafa Rohaninejad, Xi Chen, and Pieter Abbeel. A simple neural attentive\n\nmeta-learner. In International Conference on Learning Representations (ICLR), 2018.\n\n[25] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G\nBellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al.\nHuman-level control through deep reinforcement learning. Nature, 518(7540):529\u2013533, 2015.\n\n[26] Ashvin Nair, Bob McGrew, Marcin Andrychowicz, Wojciech Zaremba, and Pieter Abbeel. Over-\ncoming exploration in reinforcement learning with demonstrations. International Conference\non Robotics and Automation (ICRA), 2018.\n\n[27] Mohammad Norouzi, Samy Bengio, Zhifeng Chen, Navdeep Jaitly, Mike Schuster, Yonghui\nWu, and Dale Schuurmans. Reward augmented maximum likelihood for neural structured\nprediction. CoRR, abs/1609.00150, 2016.\n\n10\n\n\f[28] Shayegan Omidsha\ufb01ei, Jason Pazis, Christopher Amato, Jonathan P How, and John Vian. Deep\ndecentralized multi-task multi-agent rl under partial observability. International Conference on\nMachine Learning (ICML), 2017.\n\n[29] Emilio Parisotto, Jimmy Lei Ba, and Ruslan Salakhutdinov. Actor-mimic: Deep multitask and\ntransfer reinforcement learning. International Conference on Learning Representations (ICLR),\n2016.\n\n[30] Jan Peters and Stefan Schaal. Policy gradient methods for robotics. In International Conference\n\non Intelligent Robots and Systems (IROS), 2006.\n\n[31] Lerrel Pinto, Marcin Andrychowicz, Peter Welinder, Wojciech Zaremba, and Pieter Abbeel.\nAsymmetric actor critic for image-based robot learning. arXiv preprint arXiv:1710.06542,\n2017.\n\n[32] Aravind Rajeswaran, Vikash Kumar, Abhishek Gupta, John Schulman, Emanuel Todorov, and\nSergey Levine. Learning complex dexterous manipulation with deep reinforcement learning\nand demonstrations. Robotics: Science and Systems, 2018.\n\n[33] Kate Rakelly, Aurick Zhou, Deirdre Quillen, Chelsea Finn, and Sergey Levine. Ef\ufb01cient\noff-policy meta-reinforcement learning via probabilistic context variables. arXiv preprint\narXiv:1903.08254, 2019.\n\n[34] Kate Rakelly, Aurick Zhou, Deirdre Quillen, Chelsea Finn, and Sergey Levine. Ef\ufb01cient off-\npolicy meta-reinforcement learning via probabilistic context variables. International Conference\non Machine Learning (ICML), 2019.\n\n[35] St\u00e9phane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and\nstructured prediction to no-regret online learning. In International Conference on Arti\ufb01cial\nIntelligence and Statistics, 2011.\n\n[36] Jonas Rothfuss, Dennis Lee, Ignasi Clavera, Tamim Asfour, and Pieter Abbeel. Promp: Proximal\n\nmeta-policy search. CoRR, abs/1810.06784, 2018.\n\n[37] Andrei A Rusu, Sergio Gomez Colmenarejo, Caglar Gulcehre, Guillaume Desjardins, James\nKirkpatrick, Razvan Pascanu, Volodymyr Mnih, Koray Kavukcuoglu, and Raia Hadsell. Policy\ndistillation. International Conference on Learning Representations (ICLR), 2016.\n\n[38] Steind\u00f3r S\u00e6mundsson, Katja Hofmann, and Marc Peter Deisenroth. Meta reinforcement\n\nlearning with latent variable gaussian processes. CoRR, abs/1803.07551, 2018.\n\n[39] J\u00fcrgen Schmidhuber. Evolutionary principles in self-referential learning, or on learning how to\n\nlearn: the meta-meta-... hook. PhD thesis, Technische Universit\u00e4t M\u00fcnchen, 1987.\n\n[40] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driess-\nche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al.\nMastering the game of go with deep neural networks and tree search. nature, 2016.\n\n[41] Bradly C Stadie, Ge Yang, Rein Houthooft, Xi Chen, Yan Duan, Yuhuai Wu, Pieter Abbeel, and\nIlya Sutskever. Some considerations on learning to explore via meta-reinforcement learning.\narXiv preprint arXiv:1803.01118, 2018.\n\n[42] Kaushik Subramanian, Charles L Isbell Jr, and Andrea L Thomaz. Exploration from demonstra-\ntion for interactive reinforcement learning. In International Conference on Autonomous Agents\n& Multiagent Systems, 2016.\n\n[43] Wen Sun, J Andrew Bagnell, and Byron Boots. Truncated horizon policy search: Combining\nreinforcement learning & imitation learning. International Conference on Learning Representa-\ntions (ICLR), 2018.\n\n[44] Flood Sung, Li Zhang, Tao Xiang, Timothy Hospedales, and Yongxin Yang. Learning to learn:\n\nMeta-critic networks for sample ef\ufb01cient learning. arXiv preprint arXiv:1706.09529, 2017.\n\n11\n\n\f[45] Matthew E Taylor, Halit Bener Suay, and Sonia Chernova. Integrating reinforcement learning\nwith human demonstrations of varying ability. In International Conference on Autonomous\nAgents and Multiagent Systems, 2011.\n\n[46] Yee Teh, Victor Bapst, Wojciech M Czarnecki, John Quan, James Kirkpatrick, Raia Hadsell,\nNicolas Heess, and Razvan Pascanu. Distral: Robust multitask reinforcement learning. In\nNeural Information Processing Systems (NIPS), 2017.\n\n[47] Sebastian Thrun and Lorien Pratt. Learning to learn. Springer Science & Business Media,\n\n2012.\n\n[48] Jane X Wang, Zeb Kurth-Nelson, Dhruva Tirumala, Hubert Soyer, Joel Z Leibo, Remi Munos,\nCharles Blundell, Dharshan Kumaran, and Matt Botvinick. Learning to reinforcement learn.\narXiv preprint arXiv:1611.05763, 2016.\n\n[49] Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforce-\n\nment learning. In Reinforcement Learning. Springer, 1992.\n\n[50] Jiakai Zhang and Kyunghyun Cho. Query-ef\ufb01cient imitation learning for end-to-end simulated\n\ndriving. In Thirty-First AAAI Conference on Arti\ufb01cial Intelligence, 2017.\n\n12\n\n\f", "award": [], "sourceid": 5115, "authors": [{"given_name": "Russell", "family_name": "Mendonca", "institution": "UC Berkeley"}, {"given_name": "Abhishek", "family_name": "Gupta", "institution": "University of California, Berkeley"}, {"given_name": "Rosen", "family_name": "Kralev", "institution": "UC Berkeley"}, {"given_name": "Pieter", "family_name": "Abbeel", "institution": "UC Berkeley & covariant.ai"}, {"given_name": "Sergey", "family_name": "Levine", "institution": "UC Berkeley"}, {"given_name": "Chelsea", "family_name": "Finn", "institution": "Stanford University"}]}