{"title": "Data-Efficient Reinforcement Learning in Continuous State-Action Gaussian-POMDPs", "book": "Advances in Neural Information Processing Systems", "page_first": 2040, "page_last": 2049, "abstract": "We present a data-efficient reinforcement learning method for continuous state-action systems under significant observation noise. Data-efficient solutions under small noise exist, such as PILCO which learns the cartpole swing-up task in 30s. PILCO evaluates policies by planning state-trajectories using a dynamics model. However, PILCO applies policies to the observed state, therefore planning in observation space. We extend PILCO with filtering to instead plan in belief space, consistent with partially observable Markov decisions process (POMDP) planning. This enables data-efficient learning under significant observation noise, outperforming more naive methods such as post-hoc application of a filter to policies optimised by the original (unfiltered) PILCO algorithm. We test our method on the cartpole swing-up task, which involves nonlinear dynamics and requires nonlinear control.", "full_text": "Data-Ef\ufb01cient Reinforcement Learning in\n\nContinuous State-Action Gaussian-POMDPs\n\nRowan Thomas McAllister\nDepartment of Engineering\n\nCambridge University\nCambridge, CB2 1PZ\nrtm26@cam.ac.uk\n\nCarl Edward Rasmussen\nDepartment of Engineering\nUniversity of Cambridge\n\nCambridge, CB2 1PZ\ncer54@cam.ac.uk\n\nAbstract\n\nWe present a data-ef\ufb01cient reinforcement learning method for continuous state-\naction systems under signi\ufb01cant observation noise. Data-ef\ufb01cient solutions under\nsmall noise exist, such as PILCO which learns the cartpole swing-up task in\n30s. PILCO evaluates policies by planning state-trajectories using a dynamics\nmodel. However, PILCO applies policies to the observed state, therefore planning\nin observation space. We extend PILCO with \ufb01ltering to instead plan in belief\nspace, consistent with partially observable Markov decisions process (POMDP)\nplanning. This enables data-ef\ufb01cient learning under signi\ufb01cant observation noise,\noutperforming more naive methods such as post-hoc application of a \ufb01lter to\npolicies optimised by the original (un\ufb01ltered) PILCO algorithm. We test our\nmethod on the cartpole swing-up task, which involves nonlinear dynamics and\nrequires nonlinear control.\n\n1\n\nIntroduction\n\nThe Probabilistic Inference and Learning for COntrol (PILCO) [5] framework is a reinforcement\nlearning algorithm, which uses Gaussian Processes (GPs) to learn the dynamics in continuous state\nspaces. The method has shown to be highly ef\ufb01cient in the sense that it can learn with only very\nfew interactions with the real system. However, a serious limitation of PILCO is that it assumes\nthat the observation noise level is small. There are two main reasons which make this assumption\nnecessary. Firstly, the dynamics are learnt from the noisy observations, but learning the transition\nmodel in this way doesn\u2019t correctly account for the noise in the observations. If the noise is assumed\nsmall, then this will be a good approximation to the real transition function. Secondly, PILCO uses\nthe noisy observation directly to calculate the action, which is problematic if the observation noise is\nsubstantial. Consider a policy controlling an unstable system, where high gain feed-back is necessary\nfor good performance. Observation noise is ampli\ufb01ed when the noisy input is fed directly to the high\ngain controller, which in turn injects noise back into the state, creating cycles of increasing variance\nand instability.\nIn this paper we extend PILCO to address these two shortcomings, enabling PILCO to be used in\nsituations with substantial observation noise. The \ufb01rst issue is addressed using the so-called Direct\nmethod for training the transition model, see section 3.3. The second problem can be tackled by\n\ufb01ltering the observations. One way to look at this is that PILCO does planning in observation space,\nrather than in belief space. In this paper we extend PILCO to allow \ufb01ltering of the state, by combining\nthe previous state distribution with the dynamics model and the observation using Bayes rule. Note,\nthat this is easily done when the controller is being applied, but to gain the full bene\ufb01t, we have to\nalso take the \ufb01lter into account when optimising the policy.\nPILCO trains its policy through minimising the expected predicted loss when simulating the system\nand controller actions. Since the dynamics are not known exactly, the simulation in PILCO had to\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fsimulate distributions of possible trajectories of the physical state of the system. This was achieved\nusing an analytical approximation based on moment-matching and Gaussian state distributions. In\nthis paper we thus need to augment the simulation over physical states to include the state of the\n\ufb01lter, an information state or belief state. A complication is that the belief state is itself a probability\ndistribution, necessitating simulating distributions over distributions. This allows our algorithm to\nnot only apply \ufb01ltering during execution, but also anticipate the effects of \ufb01ltering during training,\nthereby learning a better policy.\nWe will \ufb01rst give a brief outline of related work in section 2 and the original PILCO algorithm\nin section 3, including the proposed use of the \u2018Direct method\u2019 for training dynamics from noisy\nobservations in section 3.3. In section 4 will derive the algorithm for POMDP training or planning\nin belief space. Note an assumption is that we observe noisy versions of the state variables. We\ndo not handle more general POMDPs where other unobserved states are also learnt nor learn any\nother mapping from the state space to observations other than additive Gaussian noise. In the \ufb01nal\nsections we show experimental results of our proposed algorithm handling observation noise better\nthan competing algorithms.\n\n2 Related work\n\nImplementing a \ufb01lter is straightforward when the system dynamics are known and linear, referred to\nas Kalman \ufb01ltering. For known nonlinear systems, the extended Kalman \ufb01lter (EKF) is often adequate\n(e.g. [13]), as long as the dynamics are approximately linear within the region covered by the belief\ndistribution. Otherwise, the EKF\u2019s \ufb01rst order Taylor expansion approximation breaks down. Larger\nnonlinearities warrant the unscented Kalman \ufb01lter (UKF) \u2013 a deterministic sampling technique to\nestimate moments \u2013 or particle methods [7, 12]. However, if moments can be computed analytically\nand exactly, moment-matching methods are preferred. Moment-matching using distributions from\nthe exponential family (e.g. Gaussians) is equivalent to optimising the Kullback-Leibler divergence\nKL(p||q) between the true distribution p and an approximate distribution q. In such cases, moment-\nmatching is less susceptible to model bias than the EKF due to its conservative predictions [4].\nUnfortunately, the literature does not provide a continuous state-action method that is both data\nef\ufb01cient and resistant to noise when the dynamics are unknown and locally nonlinear. Model-free\nmethods can solve many tasks but require thousands of trials to solve the cartpole swing-up task [8],\nopposed to model-based methods like PILCO which requires about six. Sometimes the dynamics are\npartially-known, with known functional form yet unknown parameters. Such \u2018grey-box\u2019 problems\nhave the aesthetic solution of incorporating the unknown dynamics parameters into the state, reducing\nthe learning task to a POMDP planning task [6, 12, 14]. Finite state-action space tasks can be similarly\nsolved, perhaps using Dirichlet parameters to model the \ufb01nitely-many state-action-state transitions\n[10]. However, such solutions are not suitable for continuous-state \u2018black-box\u2019 problems with no prior\ndynamics knowledge. The original PILCO framework does not assume task-speci\ufb01c prior dynamics\nknowledge (only that the prior is vague, encoding only time-independent dynamics and smoothness\non some unknown scale) yet assumes full state observability, failing under moderate sensor noise.\nOne proposed solution is to \ufb01lter observations during policy execution [4]. However, without also\npredicting system trajectories w.r.t. the \ufb01ltering process, a policy is merely optimised for un\ufb01ltered\ncontrol, not \ufb01ltered control. The mismatch between un\ufb01ltered-prediction and \ufb01ltered-execution\nrestricts PILCO\u2019s ability to take full advantage of \ufb01ltering. Dallaire et al. [3] optimise a policy using\na more realistic \ufb01ltered-prediction. However, the method neglects model uncertainty by using the\nmaximum a posteriori (MAP) model. Unlike the method of Deisenroth and Peters [4] which gives a\nfull probabilistic treatment of the dynamics predictions, work by Dallaire et al. [3] is therefore highly\nsusceptible to model error, hampering data-ef\ufb01ciency.\nWe instead predict system trajectories using closed loop \ufb01ltered control precisely because we execute\nclosed loop \ufb01ltered control. The resulting policies are thus optimised for the speci\ufb01c case in which\nthey are used. Doing so, our method retains the same data-ef\ufb01ciency properties of PILCO whilst\napplicable to tasks with high observation noise. To evaluate our method, we use the benchmark\ncartpole swing-up task with noisy sensors. We show that realistic and probabilistic prediction enable\nour method to outperform the aforementioned methods.\n\n2\n\n\fAlgorithm 1 PILCO\n1: De\ufb01ne policy\u2019s functional form: \u03c0 : zt \u00d7 \u03c8 \u2192 ut.\n2: Initialise policy parameters \u03c8 randomly.\n3: repeat\n4:\n5:\n6:\n7:\n8:\n9: until policy parameters \u03c8 converge\n\nExecute policy, record data.\nLearn dynamics model p(f ).\nPredict state trajectories from p(X0) to p(XT ).\nEvaluate policy:\nImprove policy:\n\nt=0 \u03b3tEt,\n\u03c8 \u2190 argmin\u03c8J(\u03c8).\n\nJ(\u03c8) =(cid:80)T\n\nEt = EX [cost(Xt)|\u03c8].\n\n3 The PILCO algorithm\n\nPILCO is a model-based policy-search RL algorithm, summarised by Algorithm 1. It applies to\ncontinuous-state, continuous-action, continuous-observation and discrete-time control tasks. After\nthe policy is executed, the additional data is recorded to train a probabilistic dynamics model. The\nprobabilistic dynamics model is then used to predict one-step system dynamics (from one timestep\nto the next). This allows PILCO to probabilistically predict multi-step system trajectories over an\narbitrary time horizon T , by repeatedly using the predictive dynamics model\u2019s output at one timestep,\nas the (uncertain) input in the following timestep. For tractability PILCO uses moment-matching to\nkeep the latent state distribution Gaussian. The result is an analytic distribution of state-trajectories,\napproximated as a joint Gaussian distribution over T states. The policy is evaluated as the expected\ntotal cost of the trajectories, where the cost function is assumed to be known. Next, the policy is\nimproved using local gradient-based optimisation, searching over policy-parameter space. A distinct\nadvantage of moment-matched prediction for policy search instead of particle methods is smoother\npolicy gradients and fewer local optima [9]. This process then repeats a small number of iterations\nbefore converging to a locally optimal policy. We now discuss details of each step in Algorithm 1\nbelow, with policy evaluation and improvement discussed Appendix B.\n3.1 Execution phase\nOnce a policy is initialised, PILCO can execute the system (Algorithm 1, line 4). Let the latent state\niid\u223c N (0, \u03a3\u0001).\nof the system at time t be xt \u2208 RD, which is noisily observed as zt = xt + \u0001t, where \u0001t\nThe policy \u03c0, parameterised by \u03c8, takes observation zt as input, and outputs a control action\nut = \u03c0(zt, \u03c8) \u2208 RF . Applying action ut to the dynamical system in state xt, results in a new system\nstate xt+1. Repeating until horizon T results in a new single state-trajectory of data.\n3.2 Learning dynamics\nTo learn the unknown dynamics (Algorithm 1, line 5), any probabilistic model \ufb02exible enough\nto capture the complexity of the dynamics can be used. Bayesian nonparametric models are\nparticularly suited given their resistance to over\ufb01tting and under\ufb01tting respectively. Over\ufb01tting\notherwise leads to model bias - the result of optimising the policy on the erroneous model. Un-\nder\ufb01tting limits the complexity of the system this method can learn to control.\nIn a nonpara-\nmetric model no prior dynamics knowledge is required, not even knowledge of how complex the\nunknown dynamics might be since the model\u2019s complexity grows with the available data. We\nde\ufb01ne the latent dynamics f : \u02dcxt \u2192 xt+1, where \u02dcxt\nt ](cid:62). PILCO models the dynam-\nics with D independent Gaussian process (GP) priors, one for each dynamics output variable:\nf a : \u02dcxt \u2192 xa\na \u02dcx, ka(\u02dcxi, \u02dcxj)).\nNote we implement PILCO with a linear mean function1, \u03c6(cid:62)\na \u02dcx, where \u03c6a are additional hyperpa-\nrameters trained by optimising the marginal likelihood [11, Section 2.7]. The covariance function\nk is squared exponential, with length scales \u039ba = diag([l2\na,D+F ]), and signal variance s2\na:\nka(\u02dcxi, \u02dcxj) = s2\n3.3 Learning dynamics from noisy observations\nThe original PILCO algorithm ignored sensor noise when training each GP by assuming each\nobservation zt to be the latent state xt. However, this approximation breaks down under signi\ufb01cant\nnoise. More complex training schemes are required for each GP that correctly treat each training\n\nt+1, where a \u2208 [1, D] is the a\u2019th dynamics output, and f a \u223c GP(\u03c6(cid:62)\n\na (\u02dcxi \u2212 \u02dcxj)(cid:1).\n\na exp(cid:0) \u2212 1\n\n2 (\u02dcxi \u2212 \u02dcxj)(cid:62)\u039b\u22121\n\n= [x(cid:62)\n.\n\nt , u(cid:62)\n\na,1, ..., l2\n\n1 The original PILCO [5] instead uses a zero mean function, and instead predicts relative changes in state.\n\n3\n\n\fdatum xt as latent, yet noisily-observed as zt. We resort to GP state space model methods, speci\ufb01cally\nthe \u2018Direct method\u2019 [9, section 3.5]. The Direct method infers the marginal likelihood p(z1:N )\napproximately using moment-matching in a single forward-pass. Doing so, it speci\ufb01cally exploits\nthe time series structure that generated observations z1:N . We use the Direct method to set the\nGP\u2019s training data {x1:N , u1:N} and observation noise variance \u03a3\u0001 to the inducing point parameters\nand noise parameters that optimise the marginal likelihood.\nIn this paper we use the superior\nDirect method to train GPs, both in our extended version of PILCO presented section 4, and in our\nimplementation of the original PILCO algorithm for fair comparison in the experiments.\n3.4 Prediction phase\nIn contrast to the execution phase, PILCO also predicts analytic distributions of state-trajectories\n(Algorithm 1, line 6) for policy evaluation. PILCO does this of\ufb02ine, between the online system execu-\ntions. Predicted control is identical to executed control except each aforementioned quantity is instead\nnow a random variable, distinguished with capitals: Xt, Zt, Ut, \u02dcXt and Xt+1, all approximated as\njointly Gaussian. These variables interact both in execution and prediction according to Figure 1. To\npredict Xt+1 now that \u02dcXt is uncertain PILCO uses the iterated law of expectation and variance:\np(Xt+1| \u02dcXt) = N (\u00b5x\nt+1 = V \u02dcX [Ef [f ( \u02dcXt)]] + E \u02dcX [Vf [f ( \u02dcXt)]]).\n(1)\nAfter a one-step prediction from X0 to X1, PILCO repeats the process from X1 to X2, and up to XT ,\nresulting in a multi-step prediction whose joint we refer to as a distribution over state-trajectories.\n\nt+1 = E \u02dcX [Ef [f ( \u02dcXt)]], \u03a3x\n\n4 Our method: PILCO extended with Bayesian \ufb01ltering\n\nHere we describe the novel aspects of our method. Our method uses the same high-level algorithm\nas PILCO (Algorithm 1). However, we modify (using PILCO\u2019s source code http://mlg.eng.\ncam.ac.uk/pilco/) two subroutines to extend PILCO from MDPs to a special-case of POMDPs\n(speci\ufb01cally where the partial observability has the form of additive Gaussian noise on the unobserved\nstate X). First, we \ufb01lter observations during system execution (Algorithm 1, line 4), detailed in\nSection 4.1. Second, we predict belief -trajectories instead of state-trajectories (line 6), detailed\nsection 4.2. Filtering maintains a belief posterior of the latent system state. The belief is conditioned\non, not just the most recent observation, but all previous observations (Figure 2). Such additional\nconditioning has the bene\ufb01t of providing a less-noisy and more-informed input to the policy: the\n\ufb01ltered belief-mean instead of the raw observation zt. Our implementation continues PILCO\u2019s\ndistinction between executing the system (resulting in a single real belief-trajectory) and predicting\nthe system\u2019s responses (which in our case yields an analytic distribution of multiple possible future\nbelief-trajectories). During the execution phase, the system reads speci\ufb01c observations zt. Our\nmethod additionally maintains a belief state b \u223c N (m, V ) by \ufb01ltering observations. This belief\nstate b can be treated as a random variable with a distribution parameterised by belief-mean m and\nbelief-certainty V seen Figure 3. Note both m and V are functions of previous observations z1:t.\nNow, during the (probabilistic) prediction phase, future observations are instead random variables\n(since they have not been observed yet), distinguished as Z. Since the belief parameters m and V are\n\nXt\n\nXt+1\n\nf\n\nBt|t\u22121\n\nBt|t\n\nBt+1|t\n\nf\n\n\u03c0\n\nZt\n\nUt\n\nZt+1\n\nZt\n\n\u03c0\n\nUt\n\nZt+1\n\nFigure 1: The original (un\ufb01ltered) PILCO,\nas a probabilistic graphical model. At each\ntimestep, the latent system Xt is observed nois-\nily as Zt which is inputted directly into policy\nfunction \u03c0 to decide action Ut. Finally, the la-\ntent system will evolve to Xt+1, according to\nthe unknown, nonlinear dynamics function f\nof the previous state Xt and action Ut.\n\nFigure 2: Our method (PILCO extended with Bayesian\n\ufb01ltering). Our prior belief Bt|t\u22121 (over latent system\nXt), generates observation Zt. The prior belief Bt|t\u22121\nthen combines with observation Zt resulting in posterior\nbelief Bt|t (the update step). Then, the mean posterior\nbelief E[Bt|t] is inputted into policy function \u03c0 to decide\naction Ut. Finally, the next timestep\u2019s prior belief Bt+1|t\nis predicted using dynamics model f (the prediction step).\n\n4\n\n\fV\n\nm\n\nB\n\n\u00b5m\n\n\u03a3m\n\n\u00afV\n\nM\n\nB\n\nFigure 3: Belief in execution phase: a Gaussian ran-\ndom variable parameterised by mean m and variance\nV .\n\nFigure 4: Belief in prediction phase: a Gaussian\nrandom variable with random mean M and non-\nrandom variance \u00afV , where M is itself a Gaussian\nrandom variable parameterised by mean \u00b5m and vari-\nance \u03a3m.\n\nfunctions of the now-random observations, the belief parameters must be random also, distinguished\nas M and V (cid:48). Given the belief\u2019s distribution parameters are now random, the belief is hierarchically-\nrandom, denoted B \u223c N (M, V (cid:48)) seen Figure 4. Our framework allows us to consider multiple\npossible future belief-states analytically during policy evaluation. Intuitively, our framework is an\nanalytical analogue of POMDP policy evaluation using particle methods. In particle methods, each\nparticle is associated with a distinct belief, due to each conditioning on independent samples of\nfuture observations. A particle distribution thus de\ufb01nes a distribution over beliefs. Our method is the\nanalytical analogue of this particle distribution, and requires no sampling. By restricting our beliefs\nas (parametric) Gaussian, we can tractably encode a distribution over beliefs by a distribution over\nbelief-parameters.\n4.1 Execution phase with a \ufb01lter\nWhen an actual \ufb01lter is applied, it starts with three pieces of information: mt|t\u22121, Vt|t\u22121 and a noisy\nobservation of the system zt (the dual subscript means belief of the latent physical state x at time t\ngiven all observations up until time t \u2212 1 inclusive). The \ufb01ltering \u2018update step\u2019 combines prior belief\nbt|t\u22121 = Xt|z1:t\u22121, u1:t\u22121 \u223c N (mt|t\u22121, Vt|t\u22121) with observational likelihood p(zt) = N (Xt, \u03a3\u0001)\nusing Bayes rule to yield posterior belief bt|t = Xt|z1:t, u1:t\u22121:\nmt|t = Wmmt|t\u22121 + Wzzt,\n\n(2)\nwith weight matrices Wm = \u03a3\u0001(Vt|t\u22121+\u03a3\u0001)\u22121 and Wz = Vt|t\u22121(Vt|t\u22121+\u03a3\u0001)\u22121 computed from the\nstandard result Gaussian conditioning. The policy \u03c0 instead uses updated belief-mean mt|t (smoother\nand better-informed than zt) to decide the action: ut = \u03c0(mt|t, \u03c8). Thus, the joint distribution over\nthe updated (random) belief and the (non-random) action is\n\nbt|t \u223c N (mt|t, Vt|t),\n\nVt|t = WmVt|t\u22121,\n\n\u02dcbt|t\n\n.\n=\n\n\u223c N\n\n\u02dcmt|t\n\n.\n=\n\n, \u02dcVt|t\n\n.\n=\n\n.\n\nut\n\n(3)\nNext, the \ufb01ltering \u2018prediction step\u2019 computes the predictive-distribution of bt+1|t = p(xt+1|z1:t, u1:t)\nfrom the output of dynamics model f given random input \u02dcbt|t. The distribution f (\u02dcbt|t) is non-\nGaussian yet has analytically computable moments [5]. For tractability, we approximate bt+1|t as\nGaussian-distributed using moment-matching:\nbt+1|t\u223cN (mt+1|t, Vt+1|t), ma\nt+1|t =E\u02dcbt|t\nwhere a and b refer to the a\u2019th and b\u2019th dynamics output. Both ma\nt+1|t are derived in\nAppendix D. The process then repeats using the predictive belief (4) as the prior belief in the following\ntimestep. This completes the speci\ufb01cation of the system in execution.\n4.2 Prediction phase with a \ufb01lter\nDuring the prediction phase, we compute the probabilistic behaviour of the \ufb01ltered system via an ana-\nlytic distribution of belief states (Figure 4). We begin with a prior belief at time t = 0 before any obser-\nvations are recorded (symbolised by \u2018\u22121\u2019), setting the prior Gaussian belief to have a distribution equal\n\n[f a(\u02dcbt|t), f b(\u02dcbt|t)],\n\nt+1|t =C\u02dcbt|t\nV ab\n\nt+1|t and V ab\n\n[f a(\u02dcbt|t)],\n\n(4)\n\n(cid:21)\n\n(cid:20) bt|t\n\nut\n\n(cid:18)\n\n(cid:20) mt|t\n\n(cid:21)\n\n(cid:20) Vt|t\n\n0\n\n(cid:21)(cid:19)\n\n0\n0\n\n5\n\n\fto the known initial Gaussian state distribution: B0|\u22121 \u223c N (M0|\u22121, \u00afV0|\u22121), where M0|\u22121 \u223c N (\u00b5x\n0 , 0)\nand \u00afV0|\u22121 = \u03a3x\n0. Note the variance of M0|\u22121 is zero, corresponding to a single prior belief at the\nbeginning of the prediction phase. We probabilistically predict the yet-unobserved observation Zt\nusing our belief distribution Bt|t\u22121 and the known additive Gaussian observation noise \u0001t as per\nFigure 2. Since we restrict both the belief mean M and observation Z to being Gaussian random\nvariables, we can express their joint distribution:\n\n(cid:20) Mt|t\u22121\n\n(cid:21)\n\nZt\n\n\u223c N\n\n(cid:18)(cid:20) \u00b5m\n\nt|t\u22121\n\u00b5m\nt|t\u22121\n\n(cid:20) \u03a3m\n\n(cid:21)\n\n,\n\nt|t\u22121 \u03a3m\nt|t\u22121\n\u03a3z\nt|t\u22121\nt\n\n\u03a3m\n\n(cid:21)(cid:19)\n\n,\n\n(5)\n\nt = \u03a3m\n\nt|t\u22121 + \u00afVt|t\u22121 + \u03a3\u0001.\n\nwhere \u03a3z\nThe \ufb01ltering \u2018update step\u2019 combines prior belief Bt|t\u22121 with observation Zt using the same logic\nas (2), the only difference being Zt is now random. Since the updated posterior belief mean Mt|t is\na (deterministic) function of random Zt, then Mt|t is necessarily random (with non-zero variance\nunlike M0|\u22121). Their relationship, Mt|t = WmMt|t\u22121 + WzZt, results in the updated hierarchical\nbelief posterior:\n\nBt|t \u223c N(cid:0)Mt|t, \u00afVt|t\n\n(cid:1) , where Mt|t \u223c N(cid:16)\n\n,\n\nt|t, \u03a3m\n\u00b5m\nt|t\n\nt|t\u22121 + Wz\u00b5m\nt|t\u22121W (cid:62)\n\nt|t\u22121 = \u00b5m\nt|t\u22121W (cid:62)\n\n\u00b5m\nt|t = Wm\u00b5m\n\u03a3m\nt|t = Wm\u03a3m\n\u00afVt|t = Wm \u00afVt|t\u22121.\n\n(6)\n(7)\n(8)\n(9)\nThe policy now has a random input Mt|t, thus the control output must also be random (even though \u03c0 is\na deterministic function): Ut = \u03c0(Mt|t, \u03c8), which we implement by overloading the policy function:\nt the output variance and C mu\n(\u00b5u\nt|t)\u22121CM [Mt|t, Ut].\ninput-output covariance with premultiplied inverse input variance, C mu\nMaking a moment-matched approximation yields a joint Gaussian:\n\nt|t\u22121,\nz + Wz\u03a3m\n\nt is the output mean, \u03a3u\n\nt|t, \u03c8), where \u00b5u\n\nm + Wm\u03a3m\n\nt|t\u22121W (cid:62)\n\nm + Wz\u03a3z\n\nt , \u03a3u\n\nt , C mu\n\nt\n\n) = \u03c0(\u00b5m\n\nt|t, \u03a3m\n\nt W (cid:62)\nz ,\n\n.\n= (\u03a3m\n\nt\n\nt\n\n(cid:17)\n\n(cid:21)\n\n(cid:20) Mt|t\n\nUt\n\n(cid:18)\n\n(cid:21)\n\n(cid:20) \u00b5m\n\nt|t\n\u00b5u\nt\n\n\u02dcMt|t\n\n.\n=\n\n\u223c N\n\n.\n=\n\n\u00b5 \u02dcm\nt|t\n\n(cid:20)\n\n(cid:21)(cid:19)\n\n, \u03a3 \u02dcm\nt|t\n\n.\n=\n\n\u03a3m\nt|t\n)(cid:62)\u03a3m\nt|t\n\n(C mu\n\nt\n\n\u03a3m\n\nt|tC mu\nt\n\u03a3u\nt\n\n.\n\n(10)\n\n[V (cid:48)\n\nFinally, we probabilistically predict the belief-mean Mt+1|t \u223c N (\u00b5m\nt+1|t) and the expected\nbelief-variance \u00afVt+1|t = E \u02dcMt|t\nt+1|t]. To do this we use a novel generalisation of Gaussian process\nmoment matching with uncertain inputs by Candela et al. [1] generalised to hierarchically-uncertain\ninputs detailed in Appendix E. We have now discussed the one-step prediction of the \ufb01ltered system,\nfrom Bt|t\u22121 to Bt+1|t. Using this process repeatedly, from initial belief B0|\u22121 we one-step predict to\nB1|0, then to B2|1, up to BT|T\u22121.\n5 Experiments\n\nt+1|t, \u03a3m\n\nWe test our algorithm on the cartpole swing-up problem (shown in Appendix A), a benchmark for\ncomparing controllers of nonlinear dynamical systems. We experiment using a physics simulator by\nsolving the differential equations of the system. Each episode begins with the pendulum hanging\ndownwards. The goal is then to swing the pendulum upright, thereafter continuing to balance it. The\nuse a cart mass of mc = 0.5kg. A zero-order hold controller applies horizontal forces to the cart\nwithin range [\u221210, 10]N. The policy is a linear combination of 100 radial basis functions. Friction re-\nsists the cart\u2019s motion with damping coef\ufb01cient b = 0.1Ns/m. Connected to the cart is a pole of length\nl = 0.2m and mass mp = 0.5kg located at its endpoint, which swings due to gravity\u2019s acceleration\ng = 9.82m/s2. An inexpensive camera observes the system. Frame rates of $10 webcams are typically\n30Hz at maximum resolution, thus the time discretisation is \u2206t = 1/30s. The state x comprises\nthe cart position, pendulum angle, and their time derivatives x = [xc, \u03b8, \u02d9xc, \u02d9\u03b8](cid:62). We both randomly-\ninitialise the system and set the initial belief of the system according to B0|\u22121 \u223c N (M0|\u22121, V0|\u22121)\nwhere M0|\u22121 \u223c \u03b4([0, \u03c0, 0, 0](cid:62)) and V 1/2\n0|\u22121 = diag([0.2m, 0.2rad, 0.2m/s, 0.2rad/s]). The camera\u2019s\n\u2206t rad/s]), noting 0.03rad \u2248\nnoise standard deviation is: (\u03a3\u0001)1/2 = diag([0.03m, 0.03rad, 0.03\n1.7\u25e6. We use the 0.03\nterms since using a camera we cannot observe velocities directly but can\nestimate them with \ufb01nite differences. Each episode has a two second time horizon (60 timesteps). The\n\n(cid:1) where \u03c3c = 0.25m and d2 is the squared Euclidean\n\ncost function we impose is 1 \u2212 exp(cid:0)\u2212 1\n\n\u2206t m/s, 0.03\n\n\u2206t\n\n2 d2/\u03c32\n\nc\n\ndistance between the pendulum\u2019s end point and its goal.\n\n6\n\n\fWe compare four algorithms: 1) PILCO by Deisenroth and Rasmussen [5] as a baseline (un\ufb01ltered\nexecution, and un\ufb01ltered full-prediction); 2) the method by Dallaire et al. [3] (\ufb01ltered execution,\nand \ufb01ltered MAP-prediction); 3) the method by Deisenroth and Peters [4] (\ufb01ltered execution, and\nun\ufb01ltered full-prediction); and lastly 4) our method (\ufb01ltered execution, and \ufb01ltered full-prediction).\nFor clear comparison we \ufb01rst control for data and dynamics models, where each algorithm has access\nto the exact same data and exact same dynamics model. The reason is to eliminate variance in\nperformance caused by different algorithms choosing different actions. We generate a single dataset\nby running the baseline PILCO algorithm for 11 episodes (totalling 22 seconds of system interaction).\nThe independent variables of our \ufb01rst experiment are 1) the method of system prediction and 2) the\nmethod of system execution. Each policy is then optimised from the same initialisation using their\nrespective prediction methods, before comparing performances. Afterwards, we experiment allowing\neach algorithm to collect its own data, and also experiment with various noise level.\n\n6 Results and analysis\n\n6.1 Results using a common dataset\nWe now compare algorithm performance, both predictive (Figure 5) and empirical (Figure 6). First,\nwe analyse predictive costs per timestep (Figure 5). Since predictions are probabilistic, the costs\nhave distributions, with the exception of Dallaire et al. [3] which predicts MAP trajectories and\ntherefore has deterministic cost. Even though we plot distributed costs, policies are optimised w.r.t.\nexpected total cost only. Using the same dynamics, the different prediction methods optimise different\npolicies (with the exception of Deisenroth and Rasmussen [5] and Deisenroth and Peters [4], whose\nprediction methods are identical). During the \ufb01rst 10 timesteps, we note identical performance with\nmaximum cost due to the non-zero time required to physically swing the pendulum up near the goal.\nPerformances thereafter diverge. Since we predict w.r.t. a \ufb01ltering process, less noise is predicted to\nbe injected into the policy, and the optimiser can thus afford higher gain parameters w.r.t. the pole at\nbalance point. If we linearise our policy around the goal point, our policy has a gain of -81.7N/rad\nw.r.t. pendulum angle, a larger-magnitude than both Deisenroth method gains of -39.1N/rad (negative\nvalues refer to left forces in Figure 11). This higher gain is advantageous here, corresponding to a\nmore reactive system which is more likely to catch a falling pendulum. Finally, we note Dallaire et al.\n[3] predict very high performance. Without balancing the costs across multiple possible trajectories,\nthe method instead optimises a sequence of deterministic states to near perfection.\nTo compare the predictive results against the empirical, we used 100 executions of each algorithm\n(Figure 6). First, we notice a stark difference between predictive and executed performances from\nDallaire et al. [3], due to neglecting model uncertainty, suffering model bias. In contrast, the other\nmethods consider uncertainty and have relatively unbiased predictions, judging by the similarity\nbetween predictive-vs-empirical performances. Deisenroth\u2019s methods, which differ only in execution,\nillustrate that \ufb01ltering during execution-only can be better than no \ufb01ltering at all. However, the major\nbene\ufb01t comes when the policy is evaluated from multi-step predictions of a \ufb01ltered system. Opposed\nto Deisenroth and Peters [4], our method\u2019s predictions re\ufb02ect reality closer because we both predict\nand execute system trajectories using closed loop \ufb01ltering control.\nTo test statistical signi\ufb01cance of empirical cost differences given 100 executions, we use a Wilcoxon\nrank-sum test at each time step. Excluding time steps ranging t = [0, 29] (whose costs are similar),\nthe minimum z-score over timesteps t = [30, 60] that our method has superior average-cost than each\nother methods follows: Deisenroth 2011 min(z) = 4.99, Dallaire 2009\u2019s min(z) = 8.08, Deisenroth\n2012\u2019s min(z) = 3.51. Since the minimum min(z) = 3.51, we have p > 99.9% certainty our\nmethod\u2019s average empirical cost is superior than each other method.\n6.2 Results of full reinforcement learning task\nIn the previous experiment we used a common dataset to compare each algorithm, to isolate and focus\non how well each algorithm makes use of data, rather than also considering the different ways each\nalgorithm collects different data. Here, we remove the constraint of a common dataset, and test the\nfull reinforcement learning task by allowing each algorithm to collect its own data over repeated trials\nof the cart-pole task. Each algorithm is allowed 15 trials (episodes), repeated 10 times with different\nrandom seeds. For a particular re-run experiment and episode number, an algorithm\u2019s predicted loss\nis unchanged when repeatedly computed, yet the empirical loss differs due to random initial states,\nobservation noise, and process noise. We therefore average the empirical results over 100 random\nexecutions of the controller at each episode and seed.\n\n7\n\n\ft\ns\no\nC\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n0\n\nDeisenroth 2011\nDallaire 2009\nDeisenroth 2012\nOur Method\n\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\nTimestep\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n0\n\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\nTimestep\n\nFigure 5: Predictive cost per timestep. The error\nbars show \u00b11 standard deviation. Each algorithm has\naccess to the same data set (generated by baseline\nDeisenroth 2011) and dynamics model. Algorithms\ndiffer in their multi-step prediction methods (except\nDeisenroth\u2019s algorithms whose predictions overlap).\n\nFigure 6: Empirical cost per timestep. We generate\nempirical cost distributions from 100 executions per\nalgorithm. Error bars show \u00b11 standard deviation.\nThe plot colours and shapes correspond to the legend\nin Figure 5.\n\ns\ns\no\nL\n\n60\n\n40\n\n20\n\n0\n\n1\n\nDeisenroth 2011\nDallaire 2009\nDeisenroth 2012\nOur Method\n\n2\n\n3\n\n4\n\n5\n\n6\n\n7\n\n8\n\n9 10 11 12 13 14\n\nEpisode\n\n60\n\n40\n\n20\n\n0\n\n1\n\nDeisenroth 2011\nDallaire 2009\nDeisenroth 2012\nOur Method\n\n2\n\n3 4\n\n5\n\n6\n\n7\n\n8\n\n9 10 11 12 13 14\n\nEpisode\n\nFigure 7: Predictive loss per episode. Error bars\nshow \u00b11 standard error of the mean predicted loss\ngiven 10 repeats of each algorithm.\n\nFigure 8: Empirical loss per episode. Error bars\nshow \u00b11 standard error of the mean empirical loss\ngiven 10 repeats of each algorithm. In each repeat we\ncomputed the mean empirical loss using 100 indepen-\ndent executions of the controller.\n\ns\ns\no\nL\n\n60\n\n50\n\n40\n\n30\n\n20\n\n10\n\n0\n\n1\n\nk = 1\nk = 2\nk = 4\nk = 8\nk = 16\n\n2\n\n3\n\n4\n\n5\n\n6\n\n7\n\n8\n\n9 10 11 12 13 14\n\nEpisode\n\n60\n\n50\n\n40\n\n30\n\n20\n\n10\n\n0\n\n1\n\n2\n\n3\n\n4\n\n5\n\n6\n\n7\n\n8\n\n9 10 11 12 13 14\n\nEpisode\n\nFigure 9: Empirical loss of Deisenroth 2011 for var-\nious noise levels. The error bars show \u00b11 standard\ndeviation of the empirical loss distribution based on\n100 repeats of the same learned controller, per noise\nlevel.\n\nFigure 10: Empirical loss of Filtered PILCO for\nvarious noise levels. The error bars show \u00b11 stan-\ndard deviation of the empirical loss distribution based\non 100 repeats of the same learned controller, per\nnoise level.\n\n8\n\n\fThe predictive loss (cumulative cost) distributions of each algorithm are shown Figure 7. Perhaps\nthe most striking difference between the full reinforcement learning predictions and those made\nwith a controlled dataset (Figure 5) is that Dallaire does not predict it will perform well. The\nquality of the data collected by Dallaire within the \ufb01rst 15 episodes is not suf\ufb01cient to predict\ngood performance. Our Filtered PILCO method accurately predicts its own strong performance and\nadditionally outperforms the competing algorithm seen in Figure 8. Of interest is how each algorithm\nperforms equally poorly during the \ufb01rst four episodes, with Filtered PILCO\u2019s performance breaking\naway and learning the task well by the seventh trial. Such a learning rate was similar to the original\nPILCO experiment with the noise-free cartpole.\n6.3 Results with various observation noises\nDifferent observation noise levels were also tested, comparing PILCO (Figure 9) with Filtered\n\u221a\nPILCO (Figure 10). Both \ufb01gures show a noise factors k, such that the observation noise is:\n\u03a3\u0001 = k \u00d7 diag([0.01m, 0.01rad, 0.01\n\u2206t rad/s]). For reference, our previous experiments used\na noise factor of k = 3. At low noise factor k = 1, both algorithms perform similarly-well, since\nobservations are precise enough to control a system without a \ufb01lter. As observations noise increases,\nthe performance of un\ufb01ltered PILCO soon drops, whilst the Filtered PILCO can successfully control\nthe system under higher noise levels (Figure 10).\n6.4 Training time complexity\nTraining the GP dynamics model involved N = 660 data points, M = 50 inducing points under\na sparse GP Fully Independent Training Conditional (FITC) [2], P = 100 policy RBF centroids,\nD = 4 state dimensions, F = 1 action dimensions, and T = 60 timestep horizon, with time\ncomplexity O(DN M 2). Policy optimisation (with 300 steps, each of which require trajectory\nprediction with gradients) is the most intense part: our method and both Deisenroth\u2019s methods scale\nO(M 2D2(D + F )2T + P 2D2F 2T ), whilst Dallaire\u2019s only scales O(M D(D + F )T + P DF T ).\nWorst case we require M = O(exp(D + F )) inducing points to capture dynamics, the average case\nis unknown. Total training time was four hours to train the original PILCO method with an additional\none hour to re-optimise the policy.\n\n\u2206t m/s, 0.01\n\n7 Conclusion and future work\n\nIn this paper, we extended the original PILCO algorithm [5] to \ufb01lter observations, both during system\nexecution and multi-step probabilistic prediction required for policy evaluation. The extended frame-\nwork enables learning in a special case of partially-observed MDP environments (POMDPs) whilst\nretaining PILCO\u2019s data-ef\ufb01ciency property. We demonstrated successful application to a benchmark\ncontrol problem, the noisily-observed cartpole swing-up. Our algorithm learned a good policy under\nsigni\ufb01cant observation noise in less than 30 seconds of system interaction. Importantly, our algorithm\nevaluates policies with predictions that are faithful to reality: we predict w.r.t. closed loop \ufb01ltered\ncontrol precisely because we execute closed loop \ufb01ltered control. We showed experimentally that\nfaithful and probabilistic predictions improved performance with respect to the baselines. For clear\ncomparison we \ufb01rst constrained each algorithm to use the same dynamics dataset to demonstrate su-\nperior data-usage of our algorithm. Afterwards we relaxed this constraint, and showed our algorithm\nwas able to learn from fewer data.\nSeveral more challenges remain for future work. Firstly the assumption of zero variance of the\nbelief-variance could be relaxed. A relaxation allows distributed trajectories to more accurately\nconsider belief states having various degrees of certainty (belief-variance). For example, system\ntrajectories have larger belief-variance when passing though data-sparse regions of state-space, and\nsmaller belief-variance in data-dense regions. Secondly, the policy could be a function of the full\nbelief distribution (mean and variance) rather than just the mean. Such \ufb02exibility could help the policy\nmake more \u2018cautious\u2019 actions when more uncertain about the state. A third challenge is handling\nnon-Gaussian noise and unobserved state variables. For example, in real-life scenarios using a camera\nsensor for self-driving, observations are occasionally fully or partially occluded, or limited by weather\nconditions, where such occlusions and limitations change, opposed to assuming a \ufb01xed Gaussian\naddition noise. Lastly, experiments with a real robot would be important to show the usefulness in\npractice.\n\n9\n\n\fReferences\n\n[1] Joaquin Candela, Agathe Girard, Jan Larsen, and Carl Rasmussen. Propagation of uncertainty in Bayesian\nkernel models-application to multiple-step ahead forecasting. In International Conference on Acoustics,\nSpeech, and Signal Processing, volume 2, pages 701\u2013704, 2003.\n\n[2] Lehel Csat\u00f3 and Manfred Opper. Sparse on-line Gaussian processes. Neural Computation, 14(3):641\u2013668,\n\n2002.\n\n[3] Patrick Dallaire, Camille Besse, Stephane Ross, and Brahim Chaib-draa. Bayesian reinforcement learning\nin continuous POMDPs with Gaussian processes. In International Conference on Intelligent Robots and\nSystems, pages 2604\u20132609, 2009.\n\n[4] Marc Deisenroth and Jan Peters. Solving nonlinear continuous state-action-observation POMDPs for\n\nmechanical systems with Gaussian noise. In European Workshop on Reinforcement Learning, 2012.\n\n[5] Marc Deisenroth and Carl Rasmussen. PILCO: A model-based and data-ef\ufb01cient approach to policy search.\n\nIn International Conference on Machine Learning, pages 465\u2013472, New York, NY, USA, 2011.\n\n[6] Michael Duff. Optimal Learning: Computational procedures for Bayes-adaptive Markov decision pro-\n\ncesses. PhD thesis, Department of Computer Science, University of Massachusetts Amherst, 2002.\n\n[7] Jonathan Ko and Dieter Fox. GP-BayesFilters: Bayesian \ufb01ltering using Gaussian process prediction and\n\nobservation models. Autonomous Robots, 27(1):75\u201390, 2009.\n\n[8] Timothy Lillicrap, Jonathan Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver,\nIn arXiv preprint, arXiv\n\nand Daan Wierstra. Continuous control with deep reinforcement learning.\n1509.02971, 2015.\n\n[9] Andrew McHutchon. Nonlinear modelling and control using Gaussian processes. PhD thesis, Department\n\nof Engineering, University of Cambridge, 2014.\n\n[10] Pascal Poupart, Nikos Vlassis, Jesse Hoey, and Kevin Regan. An analytic solution to discrete Bayesian\n\nreinforcement learning. International Conference on Machine learning, pages 697\u2013704, 2006.\n\n[11] Carl Rasmussen and Chris Williams. Gaussian Processes for Machine Learning. MIT Press, Cambridge,\n\nMA, USA, 1 2006.\n\n[12] Stephane Ross, Brahim Chaib-draa, and Joelle Pineau. Bayesian reinforcement learning in continuous\nPOMDPs with application to robot navigation. In International Conference on Robotics and Automation,\npages 2845\u20132851, 2008.\n\n[13] Jur van den Berg, Sachin Patil, and Ron Alterovitz. Ef\ufb01cient approximate value iteration for continuous\n\nGaussian POMDPs. In Association for the Advancement of Arti\ufb01cial Intelligence, 2012.\n\n[14] Dustin Webb, Kyle Crandall, and Jur van den Berg. Online parameter estimation via real-time replanning of\ncontinuous Gaussian POMDPs. In International Conference Robotics and Automation, pages 5998\u20136005,\n2014.\n\n10\n\n\f", "award": [], "sourceid": 1245, "authors": [{"given_name": "Rowan", "family_name": "McAllister", "institution": "University of Cambridge"}, {"given_name": "Carl Edward", "family_name": "Rasmussen", "institution": "University of Cambridge"}]}