{"title": "Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models", "book": "Advances in Neural Information Processing Systems", "page_first": 4754, "page_last": 4765, "abstract": "Model-based reinforcement learning (RL) algorithms can attain excellent sample efficiency, but often lag behind the best model-free algorithms in terms of asymptotic performance. This is especially true with high-capacity parametric function approximators, such as deep networks. In this paper, we study how to bridge this gap, by employing uncertainty-aware dynamics models. We propose a new algorithm called probabilistic ensembles with trajectory sampling (PETS) that combines uncertainty-aware deep network dynamics models with sampling-based uncertainty propagation. Our comparison to state-of-the-art model-based and model-free deep RL algorithms shows that our approach matches the asymptotic performance of model-free algorithms on several challenging benchmark tasks, while requiring significantly fewer samples (e.g. 8 and 125 times fewer samples than Soft Actor Critic and Proximal Policy Optimization respectively on the half-cheetah task).", "full_text": "Deep Reinforcement Learning in a Handful of Trials\n\nusing Probabilistic Dynamics Models\n\nKurtland Chua\n\nRoberto Calandra\n\nRowan McAllister\n\nSergey Levine\n\nBerkeley Arti\ufb01cial Intelligence Research\n\nUniversity of California, Berkeley\n\n{kchua, roberto.calandra, rmcallister, svlevine}@berkeley.edu\n\nAbstract\n\nModel-based reinforcement learning (RL) algorithms can attain excellent sample\nef\ufb01ciency, but often lag behind the best model-free algorithms in terms of asymp-\ntotic performance. This is especially true with high-capacity parametric function\napproximators, such as deep networks. In this paper, we study how to bridge this\ngap, by employing uncertainty-aware dynamics models. We propose a new algo-\nrithm called probabilistic ensembles with trajectory sampling (PETS) that combines\nuncertainty-aware deep network dynamics models with sampling-based uncertainty\npropagation. Our comparison to state-of-the-art model-based and model-free deep\nRL algorithms shows that our approach matches the asymptotic performance of\nmodel-free algorithms on several challenging benchmark tasks, while requiring\nsigni\ufb01cantly fewer samples (e.g., 8 and 125 times fewer samples than Soft Actor\nCritic and Proximal Policy Optimization respectively on the half-cheetah task).\n\n1\n\nIntroduction\n\nReinforcement learning (RL) algorithms provide for an automated framework for decision making\nand control: by specifying a high-level objective function, an RL algorithm can, in principle,\nautomatically learn a control policy that satis\ufb01es this objective. This has the potential to automate a\nrange of applications, such as autonomous vehicles and interactive conversational agents. However,\ncurrent model-free reinforcement learning algorithms are quite data-expensive to train, which often\nlimits their application to simulated domains [Mnih et al., 2015, Lillicrap et al., 2016, Schulman et al.,\n2017], with a few exceptions [Kober and Peters, 2009, Levine et al., 2016]. A promising direction\nfor reducing sample complexity is to explore model-based reinforcement learning (MBRL) methods,\nwhich proceed by \ufb01rst acquiring a predictive model of the world, and then using that model to make\ndecisions [Atkeson and Santamar\u00eda, 1997, Kocijan et al., 2004, Deisenroth et al., 2014]. MBRL is\nappealing because the dynamics model is reward-independent and therefore can generalize to new\ntasks in the same environment, and it can easily bene\ufb01t from all of the advances in deep supervised\nlearning to utilize high-capacity models. However, the asymptotic performance of MBRL methods\non common benchmark tasks generally lags behind model-free methods. That is, although MBRL\nmethods tend to learn more quickly, they also tend to converge to less optimal solutions.\nIn this paper, we take a step toward narrowing the gap between model-based and model-free RL\nmethods. Our approach is based on several observations that, though relatively simple, are critical\nfor good performance. We \ufb01rst observe that model capacity is a critical ingredient in the success\nof MBRL methods: while ef\ufb01cient models such as Gaussian processes can learn extremely quickly,\nthey struggle to represent very complex and discontinuous dynamical systems [Calandra et al., 2016].\nBy contrast, neural network (NN) models can scale to large datasets with high-dimensional inputs,\nand can represent such systems more effectively. However, NNs struggle with the opposite problem:\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFigure 1: Our method (PE-TS): Model: Our probabilistic ensemble (PE) dynamics model is shown as\nan ensemble of two bootstraps (bootstrap disagreement far from data captures epistemic uncertainty:\nour subjective uncertainty due to a lack of data), each a probabilistic neural network that captures\naleatoric uncertainty (inherent variance of the observed data). Propagation: Our trajectory sampling\n(TS) propagation technique uses our dynamics model to re-sample each particle (with associated\nbootstrap) according to its probabilistic prediction at each point in time, up until horizon T . Planning:\nAt each time step, our MPC algorithm computes an optimal action sequence, applies the \ufb01rst action\nin the sequence, and repeats until the task-horizon.\n\nto learn fast means to learn with few data and NNs tend to over\ufb01t on small datasets, making poor\npredictions far into the future. For this reason, MBRL with NNs has proven exceptionally challenging.\nOur second observation is that this issue can, to a large extent, be mitigated by properly incorporating\nuncertainty into the dynamics model. While a number of prior works have explored uncertainty-aware\ndeep neural network models [Neal, 1995, Lakshminarayanan et al., 2017], including in the context\nof RL [Gal et al., 2016, Depeweg et al., 2016], our work is, to our knowledge, the \ufb01rst to bring\nthese components together in a deep MBRL framework that reaches the asymptotic performance of\nstate-of-the-art model-free RL methods on benchmark control tasks.\nOur main contribution is an MBRL algorithm called probabilistic ensembles with trajectory sampling\n(PETS)1 summarized in Figure 1 with high-capacity NN models that incorporate uncertainty via\nan ensemble of bootstrapped models, where each model encodes distributions (as opposed to point\npredictions), rivaling the performance of model-free methods on standard benchmark control tasks at\na fraction of the sample complexity. An advantage of PETS over prior probabilistic MBRL algorithms\nis an ability to isolate two distinct classes of uncertainty: aleatoric (inherent system stochasticity) and\nepistemic (subjective uncertainty, due to limited data). Isolating epistemic uncertainty is especially\nuseful for directing exploration [Thrun, 1992], although we leave this for future work. Finally,\nwe present a systematic analysis of how incorporating uncertainty into MBRL with NNs affects\nperformance, during both model training and planning. We show, that PETS\u2019 particular treatment of\nuncertainty signi\ufb01cantly reduces the amount of data required to learn a task, e.g., eight times fewer\ndata on half-cheetah compared to the model-free Soft Actor Critic algorithm [Haarnoja et al., 2018].\n\n2 Related work\n\nModel choice in MBRL is delicate: we desire effective learning in both low-data regimes (at the\nbeginning) and high-data regimes (in the later stages of the learning process). For this reason,\nBayesian nonparametric models, such as Gaussian processes (GPs), are often the model of choice in\nMBRL, especially in low-dimensional problems where data ef\ufb01ciency is critical [Kocijan et al., 2004,\nKo et al., 2007, Nguyen-Tuong et al., 2008, Grancharova et al., 2008, Deisenroth et al., 2014, Kamthe\nand Deisenroth, 2018]. However, such models introduce additional assumptions on the system, such\nas the smoothness assumption inherent in GPs with squared-exponential kernels [Rasmussen and Kuss,\n2003]. Parametric function approximators have also been used extensively in MBRL [Hernandaz\nand Arkun, 1990, Miller et al., 1990, Lin, 1992, Draeger et al., 1995], but were largely supplanted\nby Bayesian models in recent years. Methods based on local models, such as guided policy search\nalgorithms [Levine et al., 2016, Finn et al., 2016, Chebotar et al., 2017], can ef\ufb01ciently train NN\npolicies, but use time-varying linear models, which only locally model the system dynamics. Recent\nimprovements in parametric function approximators, such as NNs, suggest that such methods are\nworth revisiting [Baranes and Oudeyer, 2013, Fu et al., 2016, Punjani and Abbeel, 2015, Lenz et al.,\n2015, Agrawal et al., 2016, Gal et al., 2016, Depeweg et al., 2016, Williams et al., 2017, Nagabandi\net al., 2017]. Unlike Gaussian processes, NNs have constant-time inference and tractable training in\nthe large data regime, and have the potential to represent more complex functions, including non-\n\n1Code available at https://github.com/kchua/handful-of-trials\n\n2\n\nGround TruthBootstrap 1Bootstrap 2Training DataDynamics ModelPlanning via Model Predictive ControlTrajectory Propagation\fsmooth dynamics that are often present in robotics [Fu et al., 2016, Mordatch et al., 2016, Nagabandi\net al., 2017]. However, most works that use NNs focus on deterministic models, consequently\nsuffering from over\ufb01tting in the early stages of learning. For this reason, our approach is able to\nachieve even higher data-ef\ufb01ciency than prior deterministic MBRL methods such as Nagabandi et al.\n[2017].\nConstructing good Bayesian NN models remains an open problem [MacKay, 1992, Neal, 1995,\nOsband, 2016, Guo et al., 2017], although recent promising work exists on incorporating\ndropout [Gal et al., 2017], ensembles [Osband et al., 2016, Lakshminarayanan et al., 2017], and\n\u03b1-divergence [Hern\u00e1ndez-Lobato et al., 2016]. Such probabilistic NNs have previously been used for\ncontrol, including using dropout [Gal et al., 2016, Higuera et al., 2018] and \u03b1-divergence [Depeweg\net al., 2016]. In contrast to these prior methods, our experiments focus on more complex tasks\nwith challenging dynamics \u2013 including contact discontinuities \u2013 and we compare directly to prior\nmodel-based and model-free methods on standard benchmark problems, where our method exhibits\nasymptotic performance that is comparable to model-free approaches.\n\n3 Model-based reinforcement learning\n\nWe now detail the MBRL framework and the notation used. Adhering to the Markov decision\nprocess formulation [Bellman, 1957], we denote the state s \u2208 Rds and the actions a \u2208 Rda of the\nsystem, the reward function r(s, a), and we consider the dynamic systems governed by the transition\nfunction f\u03b8 : Rds+da (cid:55)\u2192 Rds such that given the current state st and current input at, the next\nstate st+1 is given by st+1 = f (st, at). For probabilistic dynamics, we represent the conditional\ndistribution of the next state given the current state and action as some parameterized distribution\nfamily: f\u03b8(st+1|st, at) = Pr(st+1|st, at; \u03b8), overloading notation. Learning forward dynamics is\nD = {(sn, an), sn+1}N\n\nthus the task of \ufb01tting an approximation (cid:101)f of the true transition function f, given the measurements\nOnce a dynamics model (cid:101)f is learned, we use (cid:101)f to predict the distribution over state-trajectories\n\nn=1 from the real system.\n\nresulting from applying a sequence of actions. By computing the expected reward over state-\ntrajectories, we can evaluate multiple candidate action sequences, and select the optimal action\nsequence to use. In Section 4 we discuss multiple methods for modeling the dynamics, and in\nSection 5 we detail how to compute the distribution over state-trajectories given a candidate action\nsequence.\n\n4 Uncertainty-aware neural network dynamics models\n\nAleatoric\nuncertainty\n\nEpistemic\nuncertainty\n\nTable 1: Model uncertainties captured.\n\nModel\nBaseline Models\nDeterministic NN (D)\nProbabilistic NN (P)\nDeterministic ensemble NN (DE)\nGaussian process baseline (GP)\nOur Model\nProbabilistic ensemble NN (PE)\n\nThis section describes several ways to model the\ntask\u2019s true (but unknown) dynamic function, in-\ncluding our method: an ensemble of bootstrapped\nprobabilistic neural networks. Whilst uncertainty-\naware dynamics models have been explored in a\nnumber of prior works [Deisenroth et al., 2014,\nGal et al., 2016, Depeweg et al., 2016], the par-\nticular details of the implementation and design\ndecisions in regard incorporation of uncertainty\nhave not been rigorously analyzed empirically. As a result, prior work has generally found that\nexpressive parametric models, such as deep neural networks, generally do not produce model-based\nRL algorithms that are competitive with their model-free counterparts in terms of asymptotic perfor-\nmance [Nagabandi et al., 2017], and often even found that simpler time-varying linear models can\noutperform expressive neural network models [Levine et al., 2016, Gu et al., 2016].\nAny MBRL algorithm must select a class of model to predict the dynamics. This choice is often crucial\nfor an MBRL algorithm, as even small bias can signi\ufb01cantly in\ufb02uence the quality of the corresponding\ncontroller [Atkeson and Santamar\u00eda, 1997, Abbeel et al., 2006]. A major challenge is building a\nmodel that performs well in low and high data regimes: in the early stages of training, data is scarce,\nand highly expressive function approximators are liable to over\ufb01t; In the later stages of training, data\nis plentiful, but for systems with complex dynamics, simple function approximators might under\ufb01t.\nWhile Bayesian models such as GPs perform well in low-data regimes, they do not scale favorably\n\nNo\nNo\nYes\nYes\n\nYes\n\nNo\nYes\nNo\n\nYes\n\nHomoscedastic\n\n3\n\n\fwith dimensionality and often use kernels ill-suited for discontinuous dynamics [Calandra et al.,\n2016], which is typical of robots interacting through contacts.\nIn this paper, we study how expressive NNs can be incorporated into MBRL. To account for\nuncertainty, we study NNs that model two types of uncertainty. The \ufb01rst type, aleatoric uncertainty,\narises from inherent stochasticities of a system, e.g. observation noise and process noise. Aleatoric\nuncertainty can be captured by outputting the parameters of a parameterized distribution, while\nstill training the network discriminatively. The second type \u2013 epistemic uncertainty \u2013 corresponds\nto subjective uncertainty about the dynamics function, due to a lack of suf\ufb01cient data to uniquely\ndetermine the underlying system exactly. In the limit of in\ufb01nite data, epistemic uncertainty should\nvanish, but for datasets of \ufb01nite size, subjective uncertainty remains when predicting transitions. It\nis precisely the subjective epistemic uncertainty which Bayesian modeling excels at, which helps\nmitigate over\ufb01tting. Below, we describe how we use combinations of \u2018probabilistic networks\u2019 to\ncapture aleatoric uncertainty and \u2018ensembles\u2019 to capture epistemic uncertainty. Each combination is\nsummarized in Table 1.\n\nProbabilistic neural networks (P) We de\ufb01ne a probabilistic NN as a network whose output\nneurons simply parameterize a probability distribution function, capturing aleatoric uncertainty, and\nshould not be confused with Bayesian inference. We use the negative log prediction probability\n\nas our loss function lossP(\u03b8) = \u2212(cid:80)N\nn=1 log(cid:101)f\u03b8(sn+1|sn, an). For example, we might de\ufb01ne our\nconditioned on sn and an, i.e.: (cid:101)f = Pr(st+1|st, at) = N (\u00b5\u03b8(st, at), \u03a3\u03b8(st, at)). Then the loss\n(cid:62)\n\u03b8 (sn,an)[\u00b5\u03b8(sn,an)\u2212sn+1]+log det \u03a3\u03b8(sn,an). (1)\n\u03a3\u22121\n\npredictive model to output a Gaussian distribution with diagonal covariances parameterized by \u03b8 and\n\n[\u00b5\u03b8(sn,an)\u2212sn+1]\n\nlossGauss(\u03b8) =\n\nbecomes\n\nN(cid:88)\n\nn=1\n\nSuch network outputs, which in our particular case parameterizes a Gaussian distribution, models\naleatoric uncertainty, otherwise known as heteroscedastic noise (meaning the output distribution\nis a function of the input). However, it does not model epistemic uncertainty, which cannot be\ncaptured with purely discriminative training. Choosing a Gaussian distribution is a common choice\nfor continuous-valued states, and reasonable if we assume that any stochasticity in the system is\nunimodal. However, in general, any tractable distribution class can be used. To provide for an\nexpressive dynamics model, we can represent the parameters of this distribution (e.g., the mean\nand covariance of a Gaussian) as nonlinear, parametric functions of the current state and action,\nwhich can be arbitrarily complex but deterministic. This makes it feasible to incorporate NNs into a\nprobabilistic dynamics model even for high-dimensional and continuous states and actions. Finally,\nan under-appreciated detail of probabilistic networks is that their variance has arbitrary values for\nout-of-distribution inputs, which can disrupt planning. We discuss how to mitigate this issue in\nAppendix ??.\n\nDeterministic neural networks (D) For comparison, we de\ufb01ne a deterministic NN as a special-\ncase probabilistic network that outputs delta distributions centered around point predictions denoted\n\nas (cid:101)f\u03b8(st, at): (cid:101)f\u03b8(st+1|st, at) = Pr(st+1|st, at) = \u03b4(st+1 \u2212 (cid:101)f\u03b8(st, at)), trained using the MSE\nloss: lossD(\u03b8) =(cid:80)N\nn=1 (cid:107)sn+1 \u2212 (cid:101)f\u03b8(sn, an)(cid:107). Although MSE can be interpreted as lossP(\u03b8) with a\n\nGaussian model of \ufb01xed unit variance, in practice this variance cannot be used for uncertainty-aware\npropagation, since it does not correspond to any notion of uncertainty (e.g., a deterministic model\nwith in\ufb01nite data would be adding variance to particles for no good reason).\n\nEnsembles (DE and PE) A principled means to capture epistemic uncertainty is with Bayesian\ninference. Whilst accurate Bayesian NN inference is possible with suf\ufb01cient compute [Neal, 1995],\napproximate inference methods [Blundell et al., 2015, Gal et al., 2017, Hern\u00e1ndez-Lobato and\nAdams, 2015] have enjoyed recent popularity given their simpler implementation and faster training\ntimes. Ensembles of bootstrapped models are even simpler still: given a base model, no additional\n(hyper-)parameters need be tuned, whilst still providing reasonable uncertainty estimates [Efron and\nTibshirani, 1994, Osband, 2016, Kurutach et al., 2018]. We consider ensembles of B-many bootstrap\n\nmodels, using \u03b8b to refer to the parameters of our bth model (cid:101)f\u03b8b. Ensembles can be composed of\n(cid:80)B\nb=1 (cid:101)f\u03b8b. A visual example is\n\u2013 both of which de\ufb01ne predictive probability distributions: (cid:101)f\u03b8 = 1\n\ndeterministic models (DE) or probabilistic models (PE) \u2013 as done by Lakshminarayanan et al. [2017]\n\nprovided in Appendix ??. Each of our bootstrap models have their unique dataset Db, generated by\n\nB\n\n4\n\n\fsampling (with replacement) N times the dynamics dataset recorded so far D, where N is the size of\nD. We found B = 5 suf\ufb01cient for all our experiments. To validate the number of layers and neurons\nof our models, we can visualize one-step predictions (e.g. Appendix ??).\n\n.\n\n5 Planning and control with learned dynamics\n\nbilistic dynamics models. Once a model (cid:101)f\u03b8 is learned, we can use it for control by predicting the\n\nThis section describes different ways uncertainty can be incorporated into planning using proba-\n\n(cid:80)t+T\n\n= {at, . . . , at+T}; the probabilistic dynamics model (cid:101)f induces a distribution\nE(cid:101)f [r(s\u03c4 , a\u03c4 )]. A common technique to\n\nfuture outcomes of candidate policies or actions and then selecting the particular candidate that\nis predicted to result in the highest reward. MBRL planning in discrete time over long time hori-\nzons is generally performed by using the dynamics model to recursively predict how an estimated\nMarkov state will evolve from one time step to the next, e.g.: st+2 \u223c Pr(st+2|st+1, at+1) where\nst+1 \u223c Pr(st+1|st, at). When planning, we might consider each action at to be a function of state,\nforming a policy \u03c0 : st \u2192 at, a function to optimize. Alternatively, we can plan and optimize for a\nsequence of actions, a process called model predictive control (MPC) [Camacho and Alba, 2013].\nWe use MPC in our own experiments for several reasons, including implementation simplicity, lower\ncomputational burden (no gradients), and no requirement to specify the task-horizon in advance,\nwhilst achieving the same data-ef\ufb01ciency as Gal et al. [2016] who used a Bayesian NN with a policy\nto learn the cart-pole task in 2000 time steps. Our full algorithm is summarized in Section 6.\nGiven the state of the system st at time t, the prediction horizon T of the MPC controller, and an\naction sequence at:t+T\nover the resulting trajectories st:t+T . At each time step t, the MPC controller applies the \ufb01rst action at\nof the sequence of optimized actions arg maxat:t+T\ncompute the optimal action sequence is a random sampling shooting method, due to its parallelizability\nand ease of implementation. Nagabandi et al. [2017] use deterministic NN models and MPC with\nrandom shooting to achieve data ef\ufb01cient control in higher dimensional tasks than what is feasible\nfor GPs to model. Our work improves upon Nagabandi et al. [2017]\u2019s data ef\ufb01ciency in two ways:\nFirst, we capture uncertainty in modeling and planning, to prevent over\ufb01tting in the low-data regime.\nSecond, we use CEM [Botev et al., 2013] instead of random-shooting, which samples actions from a\ndistribution closer to previous action samples that yielded high reward.\nComputing the expected trajectory reward using recursive state prediction in closed-form is generally\nintractable. Multiple approaches to approximate uncertainty propagation can be found in the litera-\nture [Girard et al., 2002, Qui\u00f1onero-Candela et al., 2003]. These approaches can be categorized by\nhow they represent the state distribution: deterministic, particle, and parametric methods. Determinis-\ntic methods use the mean prediction and ignore the uncertainty, particle methods propagate a set of\nMonte Carlo samples, and parametric methods include Gaussian or Gaussian mixture models, etc.\nAlthough parametric distributions have been successfully used in MBRL [Deisenroth et al., 2014],\nexperimental results [Kupcsik et al., 2013] suggest that particle approaches can be competitive both\ncomputationally and in terms of accuracy, without making strong assumptions about the distribution\nused. Hence, we use particle-based propagation, speci\ufb01cally suited to our PE dynamics model\nwhich distinguishes two types of uncertainty, detailed in Section 5.1. Unfortunately, little prior\nwork has empirically compared the design decisions involved in choosing the particular propagation\nmethod. Thus, we compare against several baselines in Section 5.2. Visual examples are provided in\nAppendix ??.\n\n\u03c4 =t\n\n5.1 Our state propagation method: trajectory sampling (TS)\n\nt+1 \u223c (cid:101)f\u03b8b(p,t)(sp\n\nOur method to predict plausible state trajectories begins by creating P particles from the current state,\nt=0 = s0 \u2200 p. Each particle is then propagated by: sp\nsp\nt , at), according to a particular\nbootstrap b(p, t) in{1, . . . , B}, where B is the number of bootstrap models in the ensemble. A\nparticle\u2019s bootstrap index can potentially change as a function of time t. We consider two TS variants:\n\u2022 TS1 refers to particles uniformly re-sampling a bootstrap per time step. If we were to consider\nan ensemble as a Bayesian model, the particles would be effectively continually re-sampling\nfrom the approximate marginal posterior of plausible dynamics. We consider TS1\u2019s bootstrap\nre-sampling to place a soft restriction on trajectory multimodality: particles separation cannot be\nattributed to the compounding effects of differing bootstraps using TS1.\n\n5\n\n\f\u2022 TS\u221e refers to particle bootstraps never changing during a trial. An ensemble is a collection of\nplausible models, which together represent the subjective uncertainty in function space of the true\ndynamics function f, which we assume is time invariant. TS\u221e captures such time invariance\nsince each particle\u2019s bootstrap index is made consistent over time. An advantage of using TS\u221e\nis that aleatoric and epistemic uncertainties are separable [Depeweg et al., 2018]. Speci\ufb01cally,\naleatoric state variance is the average variance of particles of same bootstrap, whilst epistemic\nstate variance is the variance of the average of particles of same bootstrap indexes. Epistemic\nis the \u2018learnable\u2019 type of uncertainty, useful for directed exploration [Thrun, 1992]. Without a\nway to distinguish epistemic uncertainty from aleatoric, an exploration algorithm (e.g. Bayesian\noptimization) might mistakingly choose actions with high predicted reward-variance \u2018hoping\nto learn something\u2019 when in fact such variance is caused by persistent and irreducible system\nstochasticity offering zero exploration value.\n\nBoth TS variants can capture multi-modal distributions and can be used with any probabilistic model.\nWe found P = 20 and B = 5 suf\ufb01cient in all our experiments.\n\n5.2 Baseline state propagation methods for comparison\n\nTo validate our state propagation method, in the experiments of Section 7.2 we compare against four\nalternative state propagation methods, which we now discuss.\nExpectation (E) To judge the importance of our TS method using multiple particles to represent\na distribution we compare against the aforementioned deterministic propagation technique. The\nsimplest way to plan is iteratively propagating the expected prediction at each time step (ignoring\n\nuncertainty) st+1 = E[(cid:101)f\u03b8(st, at)]. An advantage of this approach over TS is reduced computation\n\niid\u223c N(cid:16)Ep,b\n\nand simple implementation: only a single particle is propagated. The main disadvantage of choosing\nE over TS is that small model biases can compound quickly over time, with no way to tell the quality\nof the state estimate.\nMoment matching (MM) Whilst TS\u2019s particles can represent multimodal distributions, forcing\na unimodal distribution via moment matching (MM) can (in some cases) bene\ufb01t MBRL data ef-\n\ufb01ciency [Gal et al., 2016]. Although unclear why, Gal et al. [2016] (who use Gaussian MM)\nhypothesize this effect may be caused by smoothing of the loss surface and implicitly penalizing\nmulti-modal distributions (which often only occur with uncontrolled systems). To test this hypothesis\nwe use Gaussian MM as a baseline and assume independence between bootstraps and particles for\nsimplicity sp\nt , at). Future work might\nconsider other distributions too, such as the Laplace distribution.\nDistribution sampling (DS) The previous MM approach made a strong unimodal assumption\nabout state distributions: the state distribution at each time step was re-cast to Gaussian. A softer\nrestriction on multimodality \u2013 between MM and TS \u2013 is to moment match w.r.t. the bootstraps only\n(noting the particles are otherwise independent if B = 1). This means that we effectively smooth\nthe loss function w.r.t. epistemic uncertainty only (the uncertainty relevant to learning), whilst the\naleatoric uncertainty remains free to be multimodal. We call this method distribution sampling (DS):\nsp\n\nt+1 \u223c (cid:101)f\u03b8b (sp\n\nt+1 \u223c N(cid:16)Eb\n\nt+1 \u223c (cid:101)f\u03b8b (sp\n\n, where sp,b\n\n, with sp,b\n\n(cid:105)(cid:17)\n\n(cid:105)(cid:17)\n\n, Vp,b\n\nsp,b\nt+1\n\n, Vb\n\nsp,b\nt+1\n\nsp,b\nt+1\n\nsp,b\nt+1\n\n(cid:104)\n\n(cid:105)\n\nt , at).\n\n(cid:104)\n\nt+1\n\n(cid:104)\n\n(cid:105)\n\n(cid:104)\n\n6 Algorithm summary\n\nHere we summarize our MBRL\nmethod PETS in Algorithm 1. We\nuse the PE model to capture het-\neroskedastic aleatoric uncertainty\nand heteroskedastic epistemic un-\ncertainty, which the TS planning\nmethod was able to best use. To\nguide the random shooting method\nof our MPC algorithm, we found\nthat the CEM method learned faster\n(as discussed in Appendix ??).\n\nfor Time t = 0 to TaskHorizon do\n\nTrain a PE dynamics model (cid:101)f given D.\n\u03c4 using TS and (cid:101)f|{D, at:t+T}.\n(cid:80)P\nEvaluate actions as(cid:80)t+T\nfor Actions sampled at:t+T \u223cCEM(\u00b7), 1 to NSamples do\n\nAlgorithm 1 Our model-based MPC algorithm \u2018PETS\u2019:\n1: Initialize data D with a random controller for one trial.\n2: for Trial k = 1 to K do\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n\nExecute \ufb01rst action a\u2217\nRecord outcome: D \u2190 D \u222a {st, a\u2217\n\nPropagate state particles sp\nUpdate CEM(\u00b7) distribution.\n\u03c4 =t\n\nt (only) from optimal actions a\u2217\n\nt , st+1}.\n\np=1 r(sp\n\nt:t+T .\n\n\u03c4 , a\u03c4 )\n\n1\nP\n\n6\n\n\fFigure 3: Learning curves for different tasks and algorithm. For all tasks, our algorithm learns in under\n100K time steps or 100 trials. With the exception of Cartpole, which is suf\ufb01ciently low-dimensional\nto ef\ufb01ciently learn a GP model, our proposed algorithm signi\ufb01cantly outperform all other baselines.\nFor each experiment, one time step equals 0.01 seconds, except Cartpole with 0.02 seconds. For\nvisual clarity, we plot the average over 10 experiments of the maximum rewards seen so far.\n\n7 Experimental results\n\nWe now evaluate the performance of our proposed MBRL al-\ngorithm called PETS using a deep neural network probabilistic\ndynamics model. First, we compare our approach on standard\nbenchmark tasks against state-of-the-art model-free and model-\nbased approaches in Section 7.1. Then, in Section 7.2, we\nprovide a detailed evaluation of the individual design decisions\nin the model and uncertainty propagation method and analyze\ntheir effect on performance. Additional considerations of hori-\nzon length, action sampling distribution, and stochastic systems\nare discussed in Appendix ??. The experiment setup is shown\nin Figure 2, and NN architecture details are discussed in the sup-\nplementary materials, in Appendix ??. Videos of the experiments, and code for reproducing the exper-\niments can be found at https://sites.google.com/view/drl-in-a-handful-of-trials.\n\nFigure 2: Tasks evaluated.\n\n(c) 7-dof Reacher\n\n(d) Half-cheetah\n\n(a) Cartpole\n\n(b) 7-dof Pusher\n\n7.1 Comparisons to prior reinforcement learning algorithms\n\ngradient RL algorithm (we used the implementation from Dhariwal et al. [2017].)\n\ndeep actor-critic algorithm (we used the implementation from Dhariwal et al. [2017].)\n\nWe compare our Algorithm 1 against the following reinforcement learning algorithms for continuous\nstate-action control:\n\u2022 Proximal policy optimization (PPO): [Schulman et al., 2017] is a model-free, deep policy-\n\u2022 Deep deterministic policy gradient (DDPG): [Lillicrap et al., 2016] is an off-policy model-free\n\u2022 Soft actor critic (SAC): [Haarnoja et al., 2018] is a model-free deep actor-critic algorithm, which\nreports better data-ef\ufb01ciency than DDPG on MuJoCo benchmarks (we obtained authors\u2019 data).\n\u2022 Model-based model-free hybrid (MBMF): [Nagabandi et al., 2017] is a recent deterministic\n\u2022 Gaussian process dynamics model (GP): we compare against three MBRL algorithms based on\nGPs. GP-E learns a GP model, but only propagate the expectation. GP-DS uses the propagation\nmethod DS. GP-MM is the algorithm proposed by Kamthe and Deisenroth [2018] except that we\ndo not update the dynamics model after each transition, but only at the end of each trial.\n\ndeep model-based RL algorithm, which we reimplement.\n\n7\n\nHalf-cheetahNumber of TimestepsReward7-DOF PusherNumber of TimestepsReward05000100000-300-100150007-DOF ReacherNumber of TimestepsReward05000100000-200-10015000CartpoleNumber of TimestepsReward0150020001003000-200Our Method (PE-TS1)[Nagabandi et al. 2017] (D-E)PPOat convergenceSAC at convergencePPO GP-E[Kamthe et al. 2018](GP-MM)SACGP-DSDDPGDDPG at convergence-2001800012000600000 100000 200000 300000 400000\fFigure 4: Final performance for different tasks, models, and uncertainty propagation techniques. The\nmodel choice seems to be more important than the technique used to propagate the state/action space.\nAmong the models the ranking in terms of performance is: P E > P > DE > D. A linear model\ncomparison can also be seen in Appendix ??.\n\nThe results of the comparison are presented in Figure 3. Our method reaches performance that is\nsimilar to the asymptotic performance of the state-of-the-art model-free baseline PPO. However, PPO\nrequires several orders of magnitude more samples to reach this point. We reach PPO\u2019s asymptotic\nperformance in fewer than 100 trials on all four tasks, faster than any prior model-free algorithm, and\nthe asymptotic performance substantially exceeds that of the prior MBRL algorithm by Nagabandi\net al. [2017], which corresponds to the deterministic variant of our approach (D-E). This result\nhighlights the value of uncertainty estimation. Moreover, the experiments show that NNs dynamics\ncan achieve similar performance to GPs on low-dimensional tasks (i.e., cartpole), while also scaling\nto higher dimensional tasks such as half-cheetah. Whilst the probabilistic baseline GP-MM slightly\noutperformed our method in cartpole, GP-MM scales cubically in time and quadratically in state\ndimensionality, so was infeasible to run on the remaining higher dimensional tasks. It is worth noting\nthat model-based deep RL algorithms have typically been considered to be ef\ufb01cient but incapable of\nachieving similar asymptotic performance as their model-free counterparts. Our results demonstrate\nthat a purely model-based deep RL algorithm that only learns a dynamics model, omitting even a\nparameterized policy, can achieve comparable performance when properly incorporating uncertainty\nestimation during modeling and planning. In the next section, we study which speci\ufb01c design\ndecisions and components of our approach are important for achieving this level of performance.\n\n7.2 Analyzing dynamics modeling and uncertainty propagation\n\nIn this section, we compare different choices for the dynamics model in Section 4 and uncertainty\npropagation technique in Section 5. The results in Figure 4 \ufb01rst show that w.r.t. model choice, the\nmodel should consider both uncertainty types: the probabilistic ensembles (PE-XX) perform best in\nall tasks, except cartpole (\u2018X\u2019 symbolizes any character). Close seconds are the single-probability-\ntype models: probabilistic network (P-XX) and ensembles of deterministic networks (E-XX). Worst\nis the deterministic network (D-E).\nThese observations shed some light on the role of uncertainty in MBRL, particularly as it relates to\ndiscriminatively trained, expressive parametric models such as NNs. Our results suggest that, the\nquality of the model and the use of uncertainty at learning time signi\ufb01cantly affect the performance\nof the MBRL algorithms tested, while the use of more advanced uncertainty propagation techniques\nseem to offers only minor improvements. We recon\ufb01rm that moment matching (MM) is competitive\nin low-dimensional tasks (consistent with [Gal et al., 2016]), however is not a reliable MBRL choice\nin higher dimensions, e.g. the half-cheetah.\nThe analysis provided in this section summarizes the experiments we conducted to design our\nalgorithm. It is worth noting that the individual components of our method \u2013 ensembles, probabilistic\nnetworks, and various approximate uncertainty propagation techniques \u2013 have existed in various\nforms in supervised learning and RL. However, as our experiments here and in the previous section\nshow, the particular choice of these components in our algorithm achieves substantially improved\n\n8\n\nCartpole7-DOF Pusher7-DOF ReacherHalf-cheetahD-EP-EP-DSP-MMDE-DSDE-TS\u221eDE-TS1DE-MMDE-EPE-EPE-DSPE-TS\u221ePE-TS1PE-MM150200RewardRewardD-EP-EP-DSP-MMDE-DSDE-TS\u221eDE-TS1DE-MMDE-EPE-EPE-DSPE-TS\u221ePE-TS1PE-MM800004000Reward0-200-100D-EP-EP-DSP-MMDE-DSDE-TS\u221eDE-TS1DE-MMDE-EPE-EPE-DSPE-TS\u221ePE-TS1PE-MMRewardD-EP-EP-DSP-MMDE-DSDE-TS\u221eDE-TS1DE-MMDE-EPE-EPE-DSPE-TS\u221ePE-TS1PE-MM0-300-100\fresults over previous state-of-the-art model-based and model-free methods, experimentally con\ufb01rming\nboth the importance of uncertainty estimation in MBRL and the potential for MBRL to achieve\nasymptotic performance that is comparable to the best model-free methods at a fraction of the sample\ncomplexity.\n\n8 Discussion & conclusion\n\nOur experiments suggest several conclusions that are relevant for further investigation in model-based\nreinforcement learning. First, our results show that model-based reinforcement learning with neural\nnetwork dynamics models can achieve results that are competitive not only with Bayesian nonpara-\nmetric models such as GPs, but also on par with model-free algorithms such as PPO and SAC in\nterms of asymptotic performance, while attaining substantially more ef\ufb01cient convergence. Although\nthe individual components of our model-based reinforcement learning algorithms are not individually\nnew \u2013 prior works have suggested both ensembling and outputting Gaussian distribution parame-\nters [Lakshminarayanan et al., 2017], as well as the use of MPC for model-based RL [Nagabandi et al.,\n2017] \u2013 the particular combination of these components into a model-based reinforcement learning\nalgorithm is, to our knowledge, novel, and the results provide a new state-of-the-art for model-based\nreinforcement learning algorithms based on high-capacity parametric models such as neural networks.\nThe systematic investigation in our experiments was a critical ingredient in determining the precise\ncombination of these components that attains the best performance.\nOur results indicate that the gap in asymptotic performance between model-based and model-free\nreinforcement learning can, at least in part, be bridged by incorporating uncertainty estimation into\nthe model learning process. Our experiments further indicate that both epistemic and aleatoric\nuncertainty plays a crucial role in this process. Our analysis considers a model-based algorithm based\non dynamics estimation and planning. A compelling alternative class of methods uses the model to\ntrain a parameterized policy [Ko et al., 2007, Deisenroth et al., 2014, McAllister and Rasmussen,\n2017]. While the choice of using the model for planning versus policy learning is largely orthogonal\nto the other design choices, a promising direction for future work is to investigate how policy learning\ncan be incorporated into our framework to amortize the cost of planning at test-time. Our initial\nexperiments with policy learning did not yield an effective algorithm by directly propagating gradients\nthrough our uncertainty-aware models. We believe this may be due to chaotic policy gradients, whose\nrecent analysis [Parmas et al., 2018] could help yield a policy-based PETS in future work. Finally,\nthe observation that model-based RL can match the performance of model-free algorithms suggests\nthat substantial further investigation of such methods is in order, as a potential avenue for effective,\nsample-ef\ufb01cient, and practical general-purpose reinforcement learning.\n\nReferences\nP. Abbeel, M. Quigley, and A. Y. Ng. Using inaccurate models in reinforcement learning.\n\nIn\nInternational Conference on Machine Learning (ICML), pages 1\u20138, 2006. ISBN 1-59593-383-2.\ndoi: 10.1145/1143844.1143845.\n\nP. Agrawal, A. V. Nair, P. Abbeel, J. Malik, and S. Levine. Learning to poke by poking: Experiential\nlearning of intuitive physics. Neural Information Processing Systems (NIPS), pages 5074\u20135082,\n2016.\n\nC. G. Atkeson and J. C. Santamar\u00eda. A comparison of direct and model-based reinforcement learning.\n\nIn International Conference on Robotics and Automation (ICRA), 1997.\n\nA. Baranes and P.-Y. Oudeyer. Active learning of inverse models with intrinsically motivated goal\nexploration in robots. Robotics and Autonomous Systems, 61(1):49\u201373, 2013. ISSN 0921-8890.\ndoi: 10.1016/j.robot.2012.05.008.\n\nR. Bellman. A Markovian decision process. Journal of Mathematics and Mechanics, pages 679\u2013684,\n\n1957.\n\nC. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra. Weight uncertainty in neural networks.\n\nInternational Conference on Machine Learning (ICML), 37:1613\u20131622, 2015.\n\n9\n\n\fZ. I. Botev, D. P. Kroese, R. Y. Rubinstein, and P. L\u2019Ecuyer. The cross-entropy method for optimiza-\n\ntion. In Handbook of statistics, volume 31, pages 35\u201359. Elsevier, 2013.\n\nS. H. Brooks. A discussion of random methods for seeking maxima. Operations Research, 6(2):\n\n244\u2013251, 1958.\n\nR. Calandra, J. Peters, C. E. Rasmussen, and M. P. Deisenroth. Manifold Gaussian processes for\nregression. In International Joint Conference on Neural Networks (IJCNN), pages 3338\u20133345,\n2016. doi: 10.1109/IJCNN.2016.7727626.\n\nE. F. Camacho and C. B. Alba. Model predictive control. Springer Science & Business Media, 2013.\n\nY. Chebotar, K. Hausman, M. Zhang, G. Sukhatme, S. Schaal, and S. Levine. Combining model-based\nand model-free updates for trajectory-centric reinforcement learning. In International Conference\non Machine Learning (ICML), 2017.\n\nM. Deisenroth, D. Fox, and C. Rasmussen. Gaussian processes for data-ef\ufb01cient learning in robotics\nand control. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 37(2):\n408\u2013423, 2014. ISSN 0162-8828. doi: 10.1109/TPAMI.2013.218.\n\nS. Depeweg, J. M. Hern\u00e1ndez-Lobato, F. Doshi-Velez, and S. Udluft. Learning and policy search in\n\nstochastic dynamical systems with Bayesian neural networks. ArXiv e-prints, May 2016.\n\nS. Depeweg, J.-M. Hernandez-Lobato, F. Doshi-Velez, and S. Udluft. Decomposition of uncertainty\nin Bayesian deep learning for ef\ufb01cient and risk-sensitive learning. In International Conference on\nMachine Learning (ICML), pages 1192\u20131201, 2018.\n\nP. Dhariwal, C. Hesse, O. Klimov, A. Nichol, M. Plappert, A. Radford, J. Schulman, S. Sidor, and\n\nY. Wu. Openai baselines. https://github.com/openai/baselines, 2017.\n\nA. Draeger, S. Engell, and H. Ranke. Model predictive control using neural networks. IEEE Control\n\nSystems, 15(5):61\u201366, Oct 1995. ISSN 1066-033X. doi: 10.1109/37.466261.\n\nB. Efron and R. Tibshirani. An introduction to the bootstrap. CRC press, 1994.\n\nC. Finn, X. Tan, Y. Duan, T. Darrell, S. Levine, and P. Abbeel. Deep spatial autoencoders for\n\nvisuomotor learning. In International Conference on Robotics and Automation (ICRA), 2016.\n\nJ. Fu, S. Levine, and P. Abbeel. One-shot learning of manipulation skills with online dynamics\nadaptation and neural network priors. In IEEE/RSJ International Conference on Intelligent Robots\nand Systems (IROS), pages 4019\u20134026, 2016. doi: 10.1109/IROS.2016.7759592.\n\nY. Gal, R. McAllister, and C. Rasmussen. Improving PILCO with Bayesian neural network dynamics\n\nmodels. ICML Workshop on Data-Ef\ufb01cient Machine Learning, 2016.\n\nY. Gal, J. Hron, and A. Kendall. Concrete dropout. In Neural Information Processing Systems (NIPS),\n\npages 3584\u20133593, 2017.\n\nA. Girard, C. E. Rasmussen, J. Quinonero-Candela, R. Murray-Smith, O. Winther, and J. Larsen.\nMultiple-step ahead prediction for non linear dynamic systems\u2013a Gaussian process treatment with\npropagation of the uncertainty. Neural Information Processing Systems (NIPS), 15:529\u2013536, 2002.\n\nA. Grancharova, J. Kocijan, and T. A. Johansen. Explicit stochastic predictive control of combustion\n\nplants based on Gaussian process models. Automatica, 44(6):1621\u20131631, 2008.\n\nS. Gu, T. Lillicrap, I. Sutskever, and S. Levine. Continuous deep Q-learning with model-based\nacceleration. In International Conference on Machine Learning (ICML), pages 2829\u20132838, 2016.\n\nC. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger. On calibration of modern neural networks.\n\nInternational Conference on Machine Learning (ICML), 2017.\n\nT. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft actor-critic: Off-policy maximum entropy deep\nreinforcement learning with a stochastic actor. In International Conference on Machine Learning\n(ICML), volume 80, pages 1856\u20131865, 2018.\n\n10\n\n\fE. Hernandaz and Y. Arkun. Neural network modeling and an extended DMC algorithm to control\n\nnonlinear systems. In American Control Conference, pages 2454\u20132459, May 1990.\n\nJ. M. Hern\u00e1ndez-Lobato and R. Adams. Probabilistic backpropagation for scalable learning of\nBayesian neural networks. In International Conference on Machine Learning (ICML), pages\n1861\u20131869, 2015.\n\nJ. M. Hern\u00e1ndez-Lobato, Y. Li, M. Rowland, D. Hern\u00e1ndez-Lobato, T. Bui, and R. E. Turner. Black-\nbox \u03b1-divergence minimization. International Conference on Machine Learning (ICML), 48:\n1511\u20131520, 2016.\n\nJ. C. G. Higuera, D. Meger, and G. Dudek. Synthesizing neural network controllers with probabilistic\n\nmodel based reinforcement learning. arXiv preprint arXiv:1803.02291, 2018.\n\nS. Kamthe and M. P. Deisenroth. Data-ef\ufb01cient reinforcement learning with probabilistic model\npredictive control. In International Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS),\n2018.\n\nJ. Ko, D. J. Klein, D. Fox, and D. Haehnel. Gaussian processes and reinforcement learning for\nidenti\ufb01cation and control of an autonomous blimp. In IEEE International Conference on Robotics\nand Automation (ICRA), pages 742\u2013747. IEEE, 2007.\n\nJ. Kober and J. Peters. Policy search for motor primitives in robotics.\n\nprocessing systems (NIPS), pages 849\u2013856, 2009.\n\nIn Neural information\n\nJ. Kocijan, R. Murray-Smith, C. E. Rasmussen, and A. Girard. Gaussian process model based\n\npredictive control. In American Control Conference, volume 3, pages 2214\u20132219. IEEE, 2004.\n\nA. G. Kupcsik, M. P. Deisenroth, J. Peters, and G. Neumann. Data-ef\ufb01cient generalization of\nrobot skills with contextual policy search. In Conference on Arti\ufb01cial Intelligence (AAAI), pages\n1401\u20131407, 2013.\n\nT. Kurutach, I. Clavera, Y. Duan, A. Tamar, and P. Abbeel. Model-ensemble trust-region policy\n\noptimization. arXiv preprint arXiv:1802.10592, 2018.\n\nB. Lakshminarayanan, A. Pritzel, and C. Blundell. Simple and scalable predictive uncertainty\nIn Neural Information Processing Systems (NIPS), pages\n\nestimation using deep ensembles.\n6405\u20136416. 2017.\n\nI. Lenz, R. Knepper, and A. Saxena. DeepMPC: Learning deep latent features for model predictive\n\ncontrol. In Robotics Science and Systems (RSS), 2015.\n\nS. Levine, C. Finn, T. Darrell, and P. Abbeel. End-to-end training of deep visuomotor policies. J.\n\nMach. Learn. Res., 17(1):1334\u20131373, Jan. 2016. ISSN 1532-4435.\n\nT. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra. Continuous\ncontrol with deep reinforcement learning. International Conference on Learning Representations\n(ICLR), 2016.\n\nL.-J. Lin. Reinforcement Learning for Robots Using Neural Networks. PhD thesis, Carnegie Mellon\n\nUniversity, 1992.\n\nD. J. MacKay. A practical Bayesian framework for backpropagation networks. Neural computation,\n\n4(3):448\u2013472, 1992.\n\nR. McAllister and C. E. Rasmussen. Data-ef\ufb01cient reinforcement learning in continuous state-action\nGaussian-POMDPs. In Neural Information Processing Systems (NIPS), pages 2037\u20132046. 2017.\n\nW. T. Miller, R. P. Hewes, F. H. Glanz, and L. G. Kraft. Real-time dynamic control of an industrial\nmanipulator using a neural network-based learning controller. IEEE Transactions on Robotics and\nAutomation, 6(1):1\u20139, Feb 1990. ISSN 1042-296X. doi: 10.1109/70.88112.\n\nV. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Ried-\nmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement\nlearning. Nature, 518(7540):529\u2013533, 2015.\n\n11\n\n\fI. Mordatch, N. Mishra, C. Eppner, and P. Abbeel. Combining model-based policy search with online\nmodel learning for control of physical humanoids. In IEEE International Conference on Robotics\nand Automation (ICRA), pages 242\u2013248, May 2016. doi: 10.1109/ICRA.2016.7487140.\n\nA. Nagabandi, G. Kahn, R. S. Fearing, and S. Levine. Neural network dynamics for model-based\n\ndeep reinforcement learning with model-free \ufb01ne-tuning. ArXiv e-prints, Aug. 2017.\n\nR. Neal. Bayesian learning for neural networks. PhD thesis, University of Toronto, 1995.\n\nD. Nguyen-Tuong, J. Peters, and M. Seeger. Local Gaussian process regression for real time online\n\nmodel learning. In Neural Information Processing Systems (NIPS), pages 1193\u20131200, 2008.\n\nI. Osband. Risk versus uncertainty in deep learning: Bayes, bootstrap and the dangers of dropout.\n\nNIPS Workshop on Bayesian Deep Learning, 2016.\n\nI. Osband, C. Blundell, A. Pritzel, and B. Van Roy. Deep exploration via bootstrapped DQN. In\n\nNeural Information Processing Systems (NIPS), pages 4026\u20134034, 2016.\n\nP. Parmas, C. E. Rasmussen, J. Peters, and K. Doya. PIPPS: Flexible model-based policy search\nrobust to the curse of chaos. In International Conference on Machine Learning (ICML), volume 80,\npages 4062\u20134071, 2018.\n\nA. Punjani and P. Abbeel. Deep learning helicopter dynamics models.\n\nIn IEEE International\nConference on Robotics and Automation (ICRA), pages 3223\u20133230, May 2015. doi: 10.1109/\nICRA.2015.7139643.\n\nJ. Qui\u00f1onero-Candela, A. Girard, J. Larsen, and C. E. Rasmussen. Propagation of uncertainty in\nBayesian kernel models\u2014application to multiple-step ahead forecasting. In IEEE International\nConference on Acoustics, Speech and Signal Processing, volume 2, pages 701\u2013704, April 2003.\ndoi: 10.1109/ICASSP.2003.1202463.\n\nP. Ramachandran, B. Zoph, and Q. V. Le. Searching for activation functions. arXiv preprint\n\narXiv:1710.05941, 2017.\n\nC. E. Rasmussen and M. Kuss. Gaussian processes in reinforcement learning. In Neural Information\n\nProcessing Systems (NIPS), volume 4, page 1, 2003.\n\nJ. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization\n\nalgorithms. arXiv preprint arXiv:1707.06347, 2017.\n\nS. Thrun. Ef\ufb01cient exploration in reinforcement learning. Technical Report CMU-CS-92-102,\n\nCarnegie Mellon University, 1992.\n\nE. Todorov, T. Erez, and Y. Tassa. Mujoco: A physics engine for model-based control. In IEEE/RSJ\n\nInternational Conference on Intelligent Robots and Systems (IROS), pages 5026\u20135033, 2012.\n\nG. Williams, N. Wagener, B. Goldfain, P. Drews, J. M. Rehg, B. Boots, and E. A. Theodorou.\nInformation theoretic MPC for model-based reinforcement learning. In International Conference\non Robotics and Automation (ICRA), 2017.\n\n12\n\n\f", "award": [], "sourceid": 2298, "authors": [{"given_name": "Kurtland", "family_name": "Chua", "institution": "UC Berkeley"}, {"given_name": "Roberto", "family_name": "Calandra", "institution": "Facebook AI Research"}, {"given_name": "Rowan", "family_name": "McAllister", "institution": "UC Berkeley"}, {"given_name": "Sergey", "family_name": "Levine", "institution": "UC Berkeley"}]}