{"title": "Gaussian Processes in Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 751, "page_last": 758, "abstract": "", "full_text": "Gaussian Processes in Reinforcement Learning\n\nCarl Edward Rasmussen and Malte Kuss\n\nMax Planck Institute for Biological Cybernetics\nSpemannstra\u00dfe 38, 72076 T\u00a8ubingen, Germany\n\n carl,malte.kuss\u0001 @tuebingen.mpg.de\n\nAbstract\n\nWe exploit some useful properties of Gaussian process (GP) regression\nmodels for reinforcement learning in continuous state spaces and dis-\ncrete time. We demonstrate how the GP model allows evaluation of the\nvalue function in closed form. The resulting policy iteration algorithm is\ndemonstrated on a simple problem with a two dimensional state space.\nFurther, we speculate that the intrinsic ability of GP models to charac-\nterise distributions of functions would allow the method to capture entire\ndistributions over future values instead of merely their expectation, which\nhas traditionally been the focus of much of reinforcement learning.\n\n1 Introduction\n\nModel-based control of discrete-time non-linear dynamical systems is typically exacer-\nbated by the existence of multiple relevant time scales: a short time scale (the sampling\ntime) on which the controller makes decisions and where the dynamics are simple enough\nto be conveniently captured by a model learning from observations, and a longer time scale\nwhich captures the long-term consequences of control actions. For most non-trivial (non-\nminimum phase) control tasks a policy relying solely on short time rewards will fail.\n\nIn reinforcement learning this problem is explicitly recognized by the distinction between\nshort-term (reward) and long-term (value) desiderata. The consistency between short- and\n\n\u0002\u0014\u0013\u0015\u0003\u0016\t\u0017\r\u0019\u0018\u001b\u001a\u001d\u001c\n\nlong-term goals are expressed by the Bellman equation, for discrete states \u0002 and actions\u0003 :\n\u001a\n \n\u0018\u001f\u001e\n\u0002\n\t\f\u000b\u000e\r\u0010\u000f\u0012\u0011\nwhere\u0004\nis the value (the expected long term reward) of state \u0002 while following policy\n\u0002\u0014\u0013\u0015\u0003\u0016\t , which is the probability of taking action \u0003\n% when applying action \u0003 given that we are in state \u0002 ,!\n\u0018\u001f\u001e\n\u0018\u001b\u001a\nprobability of going to state \u0002\n+.-\ndenotes the immediate expected reward and*,+\n\nis the discount factor (see Sutton and\nBarto (1998) for a thorough review). The Bellman equations are either solved iteratively by\npolicy evaluation, or alternatively solved directly (the equations are linear) and commonly\ninterleaved with policy improvement steps (policy iteration).\n\n\u001a\u0017\"$#\n\u0018\u001f\u001e\n\u0004\u0006\u0005\b\u0007\n\u0002&%'\t)(\nin state \u0002 , and \u001c\n\n(1)\n\n\u0018\u001f\u001e\n\nis the transition\n\n\u0004\u0006\u0005\b\u0007\n\n\u0002\n\t\n\nWhile the concept of a value function is ubiquitous in reinforcement learning, this is not\nthe case in the control community. Some non-linear model-based control is restricted to the\neasier minimum-phase systems. Alternatively, longer-term predictions can be achieved by\nconcatenating short-term predictions, an approach which is made dif\ufb01cult by the fact that\n\n\u0007\n\u000f\n\u0018\n!\n\u000f\n\u0018\n\u0005\n\u0007\n\u0011\n\u0007\n\u000f\n\u0018\n\u001a\n\u000f\n#\n\funcertainty in predictions typically grows (precluding approaches based on the certainty\nequivalence principle) as the time horizon lengthens. See Qui\u02dcnonero-Candela et al. (2003)\nfor a full probabilistic approach based on Gaussian processes; however, implementing a\ncontroller based on this approach requires numerically solving multivariate optimisation\nproblems for every control action. In contrast, having access to a value function makes\ncomputation of control actions much easier.\n\nMuch previous work has involved the use of function approximation techniques to represent\nthe value function. In this paper, we exploit a number of useful properties of Gaussian\nprocess models for this purpose. This approach can be naturally applied in discrete time,\ncontinuous state space systems. This avoids the tedious discretisation of state spaces often\nrequired by other methods, eg. Moore and Atkeson (1995). In Dietterich and Wang (2002)\nkernel based methods (support vector regression) were also applied to learning of the value\nfunction, but in discrete state spaces.\n\nIn the current paper we use Gaussian process (GP) models for two distinct purposes: \ufb01rst\nto model the dynamics of the system (actually, we use one GP per dimension of the state\nspace) which we will refer to as the dynamics GP and secondly the value GP for represent-\ning the value function. When computing the values, we explicitly take the uncertainties\nfrom the dynamics GP into account, and using the linearity of the GP, we are able to solve\ndirectly for the value function, avoiding slow policy evaluation iterations.\n\nExperiments on a simple problem illustrates the viability of the method. For these exper-\niments we use a greedy policy wrt. the value function. However, since our representation\nof the value function is stochastic, we could represent uncertainty about values enabling a\nprincipled attack of the exploration vs. exploitation tradeoff, such as in Bayesian Q-learning\nas proposed by Dearden et al. (1998). This potential is outlined in the discussion section.\n\n2 Gaussian Processes and Value Functions\n\nIn a continuous state space we straight-forwardly generalize the Bellman equation (1) by\nsubstituting sums with integrals; further, we assume for simplicity of exposition that the\npolicy is deterministic (see Section 4 for a further discussion):\n\n\u0018\u0005\u0004\n\n\u0018\u0005\u0004\n\u0018\u001f\u001e\n\u0005\u0003\u0002\n\u0018\u0005\u0004\n\u0018\u001b\u001a\n\u0018\u001f\u001e\n\u0005\u0003\u0002\n\n\u0002&\t\n\n\u0018\u001f\u001e\n\u0005\u0006\u0002\n\u0018\u0005\u0004\n\u0018\u001b\u001a\n\u0018\u001f\u001e\n\u0005\u0006\u0002\n. The transition probabilities\u001c\n\n\u0002&%\n\t\nThis involves two integrals over the distribution of consecutive states \u0002\n% visited when fol-\n\u001a may include two sources of stochas-\nlowing the policy\u0011\n\nticity: uncertainty in the model of the dynamics and stochasticity in the dynamics itself.\n\n\u0018\u001f\u001e\n\n(3)\n\n(\b\u0007\n\u0018\u0005\u0004\n\u0018\u001b\u001a\n\u0018\u001f\u001e\n\u0005\u0006\u0002\n\n\u0005\b\u0007\n\n(2)\n\n2.1 Gaussian Process Regression Models\n\n\u0007\u0011\u0010\n\n\u000b\r\f\n\n\u000b\u000f\u000e\n\n\"\u0013\u0012\n\u0007\u0011\u0010\n\nIn GP models we put a prior directly on functions and condition on observations to\nmake predictions (see Williams and Rasmussen (1996) for details). The noisy targets\n\nThroughout the remainder of this paper we use a Gaussian covariance function:\n\n\f are assumed jointly Gaussian with covariance function \u0014 :\n\"@?\n\n\t(\t\n\u0013 where !$#&%\n\t:9<;>=\n\t43\"576.8\nwhere the positive elements of the diagonal matrix 5\nB are hyperparameters col-\nlected in ) . The hyperparameters are \ufb01t by maximising the marginal likelihood (see again\n\nWilliams and Rasmussen (1996)) using conjugate gradients.\n\n\u0018\u001a\u0019\u001c\u001b\u001e\u001d \u001f\n\u000b+*\u0006,.-(/\u00030\n\n\u000b'\u0014\n\u0007\u0011\u0010\n, * and A\n\n#&%(A.,\n\n(4)\n\n(5)\n\n\u0015\u0017\u0016\n\n\u0013\"!\n\n\u001d\b1\n\n\u00072\u0010\n\n\u0007\u0011\u0010\n\n\u0004\n\u0005\n\u0007\n\u000b\n\u0001\n\u001c\n\u0018\n\u001a\n \n!\n\u0018\n\u001a\n\"\n#\n\u0004\n\u0005\n\u0007\n\u0002\n%\n\t\n\u0002\n%\n\u000b\n\u0001\n\u001c\n!\n\u0007\n\u0002\n%\n\"\n#\n\u0001\n\u001c\n\u0004\n\u0002\n%\n\t\n\u0007\n\u0005\n\u0018\n\f\n\t\n\t\n#\n\u0013\n\u0010\n%\n\u0014\n#\n\u0013\n\u0010\n%\n\u0016\n)\n\t\n#\n1\n\u0010\n%\n#\n1\n\u0010\n%\nB\n\u0013\n,\n\f\u0007\u0011\u0010\n\n\u00072\u0010\n\n\t\u0005!\n\n6.8\n\n\u000b\u0005\u0004\n\nis Gaussian:\n\n\t\u0005!\n\n(6)\n\n\t:=<\t\n\nThe predictive distribution for a novel test input \u0010\u0001\n\u000b'\u0014\n\n2.2 Model Identi\ufb01cation of System Dynamics\n\n\u0007\u0011\u0010\n\u001b\u001e\u001d\u0003\u0002\n-dimensional observations of the form\u0007\n\u0002\u0014\u0013\u0015\u0003\b\u0013\n\u0013\n\u0006\n\u0013&\t\b\t\b\t\n\nGiven a set of\nprocess model for predicting each coordinate of the system dynamics. The inputs to each\n\n\t , we use a separate Gaussian\n\u0002\u0014\u0013\u0015\u0003\u0017\t , the output is a (Gaussian) distribution\n\nusing eq. (6). Combining the\npredictive models we obtain a multivariate Gaussian distribution over the consecutive state:\n\nmodel are the state and action pair \u0010\nover the consecutive state variable, \u000b\n\u001a .\nthe transition probabilities\u001c\nWe now turn towards the problem of evaluating \u0004\n\u0002&\t for a given policy\u0011 over the contin-\nFor simple (eg. polynomial or Gaussian) reward functions ! we can directly compute 1\n\f ,\nthe \ufb01rst Gaussian integral of eq. (3). Thus, the expected immediate reward, from state \u0002\nfollowing\u0011\n\nuous state space. In policy evaluation the Bellman equations are used as update rules. In\norder to apply this approach in the continuous case, we have to solve the two integrals in\neq. (3).\n\n2.3 Policy Evaluation\n\n\u000b\b\u0007\n\n\u0018\u001f\u001e\n\n(7)\n\nis:\n\nin which the mean and covariance for the consecutive state are coordinate-wise given by\neq. (6) evaluated on the dynamics GP.\n\n\u0013\u0015\u0002\u001b\u001a\n\n\u0013&\t\b\t\b\t\n\nThe second integral of eq. (3) involves an expectation over the value function, which is\nmodeled by the value GP as a function of the states. We need access to the value function\nat every point in the continuous state space, but we only explicitly represent values at a\n\ufb01nite number of support points,\nspace. Here we use the mean of the GP to represent the value 2 \u2013 see section 4 for an\n\n\u0001 and let the GP generalise to the entire\n% . For\n\na Gaussian covariance function3 this can be done in closed form as shown by Girard et al.\n\nelaboration. Thus, we need to average the values over the distribution predicted for \u0002\n(2002). In detail, the Bellman equation for the value at support point \u0002\n\u0013 where \u001c\n\u0013\u000f\u000e\nand \u001d\n\f! \n9<;\nwhere !\n\u001f denotes the covariance matrix of the value GP,\nthe \u001d\nmatrix and boldface \u001c\n\u0013\b\t\b\t&\t\n. Note, that this equation implies a consistency between the value\nat the support points with the values at all other points. Equation (8) could be used for\nlinear simultaneous\niterative policy evaluation. Notice however, that eq. (8) is a set of\nequations in \u001c\n\nis the vector of values at the support points:\n\n, which we can solve4 explicitly:\n\n(9)\n\u2019th row of\n\n\"$#\u001e\u001d\n\n\f is:\n\n\"#\"\n\n-(/\u00030\n\n\"\u0006#\n\nis the\n\n1&\u0002\n\n1%\u0002\n\n6.8\n\n\t\u0019\t\n\n(8)\n\n\u0005\u0006\u0002\n\n\u0018\u001b\u001a\n\n\u0005\u0003\u0002\n\n\u0018\u001b\u001a\n\n\u0013 where \u001c\n\n\u001b\u001e\u001d\r\u0002\n\n\u0013\u000f\u000e\n\n\u000b\u0011\u0010\u0013\u0012\u0015\u0014\u0017\u0016\n\n\u0013&\t\b\t&\t\n\n\u0013:A\n\n\t4=\n\n1For more complex reward functions we may approximate it using eg. a Taylor expansion.\n2Thus, here we are using the GP for noise free interpolation of the value function, and conse-\n\nquently set its noise parameter to a small positive constant (to avoid numerical problems)\n\n3The covariance functions allowing analytical treatment in this way include Gaussian and poly-\n\nnomial, and mixtures of these.\n\n4We conjecture that the matrix\n\nyet devised a formal proof.\n\nis non-singular under mild conditions, but have not\n\n,.-0/\u001313254\u001e6\n\n#)\u001d\n\n\u000b+*\n\n#\u001e\u001d\n\n6.8\n\n(10)\n\n\u000b\n\n\u0016\n\u0010\n\n\u0013\n\u0018\n\u0013\n\u0015\n\u0013\n)\n\u0019\n\n\u0013\n\u0018\n\u0015\n\u0013\nA\n,\n\n\u0013\n\u0010\n\n\t\n1\n\u0004\n\n\u0013\n\u0018\n6\n8\n\u0004\n\u0007\n\u0018\n\u0013\n\u0010\n\n\u0006\n\u0002\n%\n\u000b\n\u0007\n%\n\t\n\u0013\n\u0007\n\u000b\n-\n\u0005\n\u0018\n\u0007\n\u000b\n\f\n\u000b\n\u0001\n\u001c\n\u0018\n\f\n\u0004\n\u0018\n\f\n\u001e\n!\n\u0018\n\f\n\u0004\n\u0018\n\f\n\u001e\n\u0007\n\u0002\n%\n\u0005\n\u0018\n\f\n\u001e\n\u0018\n\u001a\n\u000b\n\f\n\f\n\u0007\nA\n,\n8\n,\n\u0018\n\u0013\n\u0019\n\u000b\n\n\u0002\n8\n\u001c\n\f\n\u000b\n\u000b\n\f\n\u0001\n\u001c\n\u0005\n\u0018\n\f\n\u001e\n\u0018\n\u001a\n\u0004\n\u0007\n\u0002\n%\n\t\n\u0007\n\u0002\n%\n\u000b\n\u000b\n\f\n\f\n!\n6\n8\n\u001f\n\u001c\n\u0005\n\u0018\n\f\n\u001e\n\u0018\n\u001a\n\u000b\n\u001b\n\u0007\n\u0002\n\f\n\f\n\t\n\u000b\n\u0016\n5\n6\n8\n\u000e\n\f\n\u0016\n$\n,\n*\n,\n\u001d\n1\n\u0007\n\u0002\n \n\f\n\t\n3\n\u0007\n5\n\"\n\u000e\n\f\n\t\n6\n8\n\u0007\n\u0002\n \n\f\n\t\n=\n\u0013\n\u001d\n\f\n'\n\u001c\n\u000b\n\u0007\n\u0004\n\u0007\n\u0002\n8\n\t\n\u0013\n\u0004\n\u0007\n\u0002\n\u001a\n3\n(\n\u001c\n\u000b\n\u000b\n\"\n!\n6\n8\n\u001f\n\u001c\n\u001c\n\u000b\n\u001d\n\"\n1\n!\n6\n8\n\u001f\n=\n\u000b\n\t\n7\n\fThe computational cost of solving this system is \n\ndoing iterative policy evaluation, and equal to the cost of value GP prediction.\n\n(\u0002\u0001\n\t , which is no more expensive than\n\n2.4 Policy Improvement\n\nAbove we demonstrated how to compute the value function for a given policy \u0011\n\ngiven a value function we can act greedily, thereby de\ufb01ning an implicit policy:\n\n. Now\n\n(11)\n\n\u0002\n\t\u0004\u0003\n\n\u000f\n\t\u0006\u000b\n\u0014\u0006\u0005\n\u0016\b\u0007\n\n\u0018\u0005\u0004\n\n\u0018\u001f\u001e\n\n\u001a\n \n\n\u0018\u001f\u001e\n\n\"$#\n\n\t)(\n\ngiving rise to\nscalar). As above we can solve the relevant integrals and in addition compute derivatives\nwrt. the action. Note also that application-speci\ufb01c constraints can often be reformulated as\nconstraints in the above optimisation problem.\n\none-dimensional optimisation problems (when the possible actions \f are\n\n2.5 The Policy Iteration Algorithm\n\nWe now combine policy evaluation and policy improvement into policy iteration in which\nboth steps alternate until a stable con\ufb01guration is reached 5. Thus given observations of\nsystem dynamics and a reward function we can compute a continuous value function and\nthereby an implicitly de\ufb01ned policy.\n\nAlgorithm 1 Policy iteration, batch version\n\nfor a \ufb01xed time\n\n2. Model Identi\ufb01cation: Model the system dynamics by Gaussian processes for each\n\n\u0013\u0015\u0002\n\u0018\u001b\u001a\n\n\u0013\u000f\u000e\n\n\u0013\u000e\f\n\u001a\u0006\u0001 of\n\n\t . Fit Gaussian process hyperparameters for representing \u0004\n\n1. Given: \r observations of system dynamics of the form \u0007\ninterval \u000f\u0011\u0010 , discount factor# and reward function!\n\u0018\u001b\u001a\n\u0018\u001f\u001e\nstate coordinate and combine them to obtain a model\u001c\n\u0019$\u000b\n\u0013\b\t&\t\b\t\n3. Initialise Value Function: Choose a set\ninitialize \u001c\nusing conjugate gradient optimisation of the marginal likelihood and set A\npositive constant.\n4. Policy Iteration:\nrepeat\nfor all\u0002\nFind action\u0003\nCompute\u001c\u0015\u0014\n#\u001e\u001d\nUpdate Gaussian process hyperparameter for representing \u0004\n\n\f by solving equation (11) subject to problem speci\ufb01c constraints.\n\u2019th row of \u001d\n\nSolve equation (7) in order to obtain \u000b\nCompute\n\n6.8\nuntil stabilisation of \u001c\n\nusing the dynamics Gaussian processes\n\nto \ufb01t the new \u001c\n\nas in equation (9)\n\ndo\n\n\f\u0013\u0012\n\nsupport points and\n\nto a small\n\n\u0002\n\t\n\nend for\n\n\u0002\n\t\n\nThe selection of the support points remains to be determined. When using the algorithm\nin an online setting, support points could naturally be chosen as the states visited, possibly\nselecting the ones which conveyed most new information about the system. In the experi-\nmental section, for simplicity of exposition we consider only the batch case, and use simply\na regular grid of support points.\n\nWe have assumed for simplicity that the reward function is deterministic and known, but it\nwould not be too dif\ufb01cult to also use a (GP) model for the rewards; any model that allows\n\n5Assuming convergence, which we have not proven.\n\n\u0007\n\u0011\n\u0007\n\u0014\n/\n\u0002\n\u0001\n\u001c\n\u000f\n\u0018\n!\n\u000f\n\u0018\n\u001a\n\u0004\n\u0007\n\u0002\n%\n\u0007\n\u0002\n%\n\u0013\n(\n\u0002\n%\n\t\n\u0005\n\u0018\n\u001a\n\u000b\n\u001b\n\u0007\n\u0002\n\t\n\n\u0002\n8\n\u0013\n\u0002\n(\n\f\n\u0003\n!\n\u0007\n\u0002\n\f\n\u0007\n,\nB\n\u0019\n\f\n\u0018\n\u0016\n\u001e\n\u0018\n\u001a\n\f\n\f\n'\n\u001c\n\u0003\n\u0007\n\"\n1\n!\n\u001f\n\t\n6\n8\n\u000b\n\u0007\n\f5\n\n4\n\n3\n\n2\n\n1\n\nt\n\n-1\n\n-0.5\n\n0.6\n\n1\n\n\u22121\n\n\u22120.5\n\n(a)\n\n(b)\u0010\n\n0.6\n\n1\n\n\t\u0001\n\n\t'-\u0006\u0002\b\u0007\n\nFigure 1: Figure (a) illustrates the mountain car problem. The car is initially standing\n\nmotionless at\u0010\nand the goal is to bring it up and hold it in the region*\n\t\u0005\u0004\nsuch that 1\n- . The hatched area marks the target region and below the\napproximation by a Gaussian is shown (both projected onto the \u0010 axis). Figure (b) shows\nthe position\u0010 of the car is when controlled according to (11) using the approximated value\nin about \ufb01ve time steps but does not end up exactly at \u0010\ndynamics model. The circles mark the \u000f\n\nfunction after 6 policy improvements shown in Figure 3. The car reaches the target region\ndue to uncertainty in the\n\nsecond time steps.\n\n\t\u0001\u0003\u0002\n\n\u0010,\u000b\n\nevaluation of eq. (7) could be used. Similarly the greedy policy has been assumed, but\ngeneralisation to stochastic policies would not be dif\ufb01cult.\n\n3 Illustrative Example\n\nFor reasons of presentability of the value function, we below consider the well-known\nmountain car problem \u201cpark on the hill\u201d, as described by Moore and Atkeson (1995) where\nthe state-space is only two-dimensional. The setting depicted in Figure 1(a) consists of a\nfrictionless, point-like, unit mass car on a hilly landscape described by\n\n(12)\n\nforce\n\n\u00072\u0010\n\u0007\u0011\u0010\n\nNote that the admissible range of forces is not suf\ufb01cient to drive up the car greedily from\n\nis described by the position of the car and its speed\nrespectively. As action a horizontal\n\nfor\u0010\n\t\f\u000b\nfor\u0010\u0016\u0015\n\r\u0014\u0013\n8\u0010\u000f\u0012\u0011\nThe state of the system \u0002\n\u0013\u0012\u0007\n\u0002\u000e- and 1\nwhich are constrained to 1\n\u0002+;\n;\u0003\u0002\u0018\u0007\nin the range 1\u001b\u001a\n\u001a can be applied in order to bring the car up into the target\nregion which is a rectangle in state space such that *\n\t'- .\n\t\u0001\u001e\u0002\n\t such that a strategy has to be found which utilises the\nthe initial state \u0002\u0014!\nFor system identi\ufb01cation we draw samples \n\u000b#\n\u0001 uniformly from\n\u0013&\t\b\t&\t\ntheir respective admissible regions and simulate time steps of \u000f\nin time using an ODE solver in order to get the consecutive states \u0002\n\f . We then use two\n\nlandscape in order to accelerate up the slope, which gives the problem its non-minimum\nphase character.\n\nGaussian processes to build a model to predict the system behavior from these examples\nfor the two state variables independently using covariance functions of type eq. (5). Based\n\n-\u0017\u0002\n\u0002\u001c\u0019\u001d\u0002\n\u0013\u0015*\n\t\u0001\n\n\t'-\u001f\u0002 \u0007\n\nseconds6 forward\n\nand 1\n\n*\u0017\u0013\n\n\t\u0001\u0004\n\n\u0013\"\u0019\n\n6Note that\n\nseconds seems to be an order of magnitude slower than the time scale\n\nusually considered in the literature. Our algorithm works equally well for shorter time steps (\nshould be increased); for even longer time steps, modeling of the dynamics gets more complicated,\nand eventually for large enough\n\ncontrol is no-longer possible.\n\n$&%('*),+\n\n$.%\n\n\u0010\n\u000b\n1\n*\n\u0010\n\u0002\n*\n*\n\u0010\n\u0002\n*\n\t\n\u000b\n*\n\t\n\t\n*\n\t\n\n\u000b\n\f\n\u0010\n,\n\"\n\u0010\n+\n\n\u000e\n*\n\t\n\u000b\n\u0010\n\t\n\u0010\n\u0010\n\u0019\n\u0010\n\u0002\n*\n*\n\u0010\n\u0002\n*\n\u000b\n\u0007\n1\n*\n\u0007\n\u0002\n\f\n\f\n\t\n\u0016\n'\n\u000b\n-\n\u0013\n\n*\n\u0010\n\u000b\n*\n\t\n\n%\n-\n/\n\fV\n\n0.6\n0.4\n0.2\n0\n\u22121\n\n2\n\nV\n\n4\n\n2\n\n0\n\n\u22121\n\n1\n\n0\n\n\u22121\n\ndx\n\nV\n\n4\n\n2\n\n0\n\n\u22121\n\n2\n\n1\n\n0\n\n\u22121\n\ndx\n\n2\n\n1\n\n0\n\n\u22121\n\ndx\n\n\u22120.5\n\n0\n\n1 \u22122\n\n0.5\n\nx\n\n(c)\n\n\u22120.5\n\n0\n\n1 \u22122\n\n0.5\n\nx\n\n(b)\n\n\u22120.5\n\n0\n\n1 \u22122\n\n0.5\n\nx\n\n(a)\n\nFigure 2: Figures (a-c) show the estimated value function for the mountain car example af-\nter initialisation (a), after the \ufb01rst iteration over\n(b) and a nearly stabilised value function\nafter 3 iterations (c). See also Figure 3 for the \ufb01nal value function and the corresponding\nstate transition diagram\n\non\n\n*\u0014*\n\n-\u0005\u0004\n\n\u0007\u0001\n\n\t\u0017\u0013\u0015*\u0003\u0002\n\n,\n\nfor predicting\n\n.\n\nevaluate its gradient with respect to\n\nthe feasible region are assigned zero value and reward.\n\nrandom examples, the relations can already be approximated to within root mean\ntest samples and considering the mean of the predicted\n\n. This enables us to ef\ufb01ciently solve the optimization\ndescribed above. States outside\n\nHaving a model of the system dynamics, the other necessary element to provide to the\nproposed algorithm is a reward function. In the formulation by Moore and Atkeson (1995)\n\nsquared errors (estimated on -&*\nfor predicting\u0010 and*\ndistribution) of*\nthe reward is equal to-\nif the car is in the target region and* elsewhere. For convenience we\napproximate this cube by a Gaussian proportional to \u001b\n\t with maximum\nreward - as indicated in Figure 1(a). We now can solve the update equation (10) and also\nproblem eq. (11) subject to the constraints on \u0010\n\u0010 and\n- grid onto the\nAs support points for the value function we simply put a regular ;\nFigure 2(a). The standard deviation of the noise of the value GP representing \u0004\n\u0002&\t\nset to A\n\nstate-space and initialise the value function with the immediate rewards for these states,\nis\n. Following the policy iteration\nalgorithm we estimate the value of all support points following the implicit policy (11)\nwrt.\nthe initial value function, Figure 2(a). We then evaluate this policy and obtain an\nupdated value function shown in Figure 2(b) where all points which can expect to reach\nthe reward region in one time step gain value. If we iterate this procedure two times we\nobtain a value function as shown in Figure 2(c) in which the state space is already well\norganised. After \ufb01ve policy iterations the value function and therefore the implicit policy is\nstable, Figure 3(a). In Figure 3(b) a dynamic GP based state-transition diagram is shown,\n\n- , and the discount factor to #\n\ncorresponds to areas with zero value in Figure 3(a).\n\nwhen following the implicit policy. For some of the support points the model correctly\nis applied, which\n\nin which each support point \u0002\npredicts that the car will leave the feasible region, no matter what \u0016\nIf we control the car from \u0002\n\nis connected to its predicted (mean) consecutive state \u0002\n\u0013\u0015*\u0014\t according to the found policy the car gathers\n\nmomentum by \ufb01rst accelerating left before driving up into the target region where it is\nbalanced as illustrated in Figure 1(b). It shows that the\nrandom examples of the system\ndynamics are suf\ufb01cient for this task. The control policy found is probably very close to the\noptimally achievable.\n\n\u0019\n\n*\n\t\n*\n;\n\t\n;\n\u0007\n\u0010\n*\n\t\n%\n\u0013\n*\n\t\n*\n\n,\n\"\n\u0019\n\u0007\n\u0019\n;\n\u0007\n\u001f\n\u000b\n*\n\t\n*\n\u000b\n*\n\t\n\u0006\n\f\n%\n\f\n\u0019\n\u0016\n\u0002\n\u001a\n!\n\u000b\n\u0007\n1\n*\n\t\n\n\n*\n\f2.5\n\n2\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n\u22122\n\n\u22122.5\n\nx\nd\n\n2\n\n1\n\n0\n\n\u22121\n\ndx\n\nV\n\n4\n\n2\n\n0\n\n\u22121\n\n\u22120.5\n\n0\n\n1 \u22122\n\n0.5\n\nx\n(a)\n\n\u22121\n\n\u22120.5\n\n0\n\n0.5\n\n1\n\n1.5\n\nx\n\n(b)\n\nFigure 3: Figures (a) shows the estimated value function after 6 policy improvements (sub-\nsequent to Figures 2(a-c)) over\nhas stabilised. Figure (b) is the corresponding\nstate transition diagram illustrating the implicit policy on the support points. The black\n\nwhere \u001c\n\f estimated by the dynamics GP when following the\n\nimplicit greedy policy with respect to (a). The thick line marks the trajectory of the car\nfor the movement described in Figure 1(b) based on the physics of the system. Note that\nremains unnoticed using time intervals of\nthe temporary violation of the constraint\n\n; to avoid this the constraints could be enforced continuously in the training set.\n\n\f and the respective\u0002\n\nlines connect\u0002\n\u0010,\u000b\n\n4 Perspectives and Conclusion\n\nCommonly the value function is de\ufb01ned to be the expected (discounted) future reward.\nConceptually however, there is more to values than their expectations. The distribution over\nfuture reward could have small or large variance and identical means, two fairly different\nsituations, that are treated identically when only the value expectation is considered. It\nis clear however, that a principled approach to the exploitation vs. exploration tradeoff\nrequires a more faithful representation of value, as was recently proposed in Bayesian Q-\nlearning (Dearden et al. 1998), and see also Attias (2003). For example, the large variance\ncase is more attractive for exploration than the small variance case.\n\nThe GP representation of value functions proposed here lends itself naturally to this more\nelaborate concept of value. The GP model inherently maintains a full distribution over\nvalues, although in the present paper we have only used its expectation. Implementation\nof this would require a second set of Bellman-like equations for the second moment of\nthe values at the support points. These equations would simply express consistency of\nuncertainty:\nthe uncertainty of a value should be consistent with the uncertainty when\nfollowing the policy. The values at the support points would be (Gaussian) distributions\nwith individual variances, which is readily handled by using a full diagonal noise term in\nin eq. (5). The individual second moments can be computed in closed\nform (see derivations in Qui\u02dcnonero-Candela et al. (2003)). However, iteration would be\nnecessary to solve the combined system, as there would be no linearity corresponding to\neq. (10) for the second moments. In the near future we will be exploring these possibilities.\n\nthe place of ?\n\n#&%\n\nWhereas only a batch version of the algorithm has been described here, it would obviously\nbe interesting to explore its capabilities in an online setting, starting from scratch. This\nwill require that we abandon the use of a greedy policy, to avoid risking to get stuck in a\n\n\u0019\n%\n\u0007\n\u0010\n+\n;\n\u000f\n*\n\t\n\nA\n,\nB\n\flocal minima caused by an incomplete model of the dynamics. Instead, a stochastic policy\nshould be used, which should not cause further computational problems as long as it is\nrepresented by a Gaussian (or perhaps more appropriately a mixture of Gaussians). A good\npolicy should actively explore regions where we may gain a lot of information, requiring\nthe notion of the value of information (Dearden et al. 1998). Since the information gain\nwould come from a better dynamics GP model, it may not be an easy task in practice to\noptimise jointly information and value.\n\nWe have introduced Gaussian process models into continuous-state reinforcement learning\ntasks, to model the state dynamics and the value function. We believe that the good gen-\neralisation properties, and the simplicity of manipulation of GP models make them ideal\ncandidates for these tasks. In a simple demonstration, our parameter-free algorithm con-\nverges rapidly to a good approximation of the value function.\n\nOnly the batch version of the algorithm was demonstrated. We believe that the full proba-\nbilistic nature of the transition model should facilitate the early states of an on-line process.\nAlso, online addition of new observations in GP model can be done very ef\ufb01ciently. Only\na simple problem was used, and it will be interesting to see how the algorithm performs\non more realistic tasks. Direct implementation of GP models are suitable for up to a few\nthousand support points; in recent years a number of fast approximate GP algorithms have\nbeen developed, which could be used in more complex settings.\n\nWe are convinced that recent developments in powerful kernel-based probabilistic mod-\nels for supervised learning such as GPs, will integrate well into reinforcement learning\nand control. Both the modeling and analytic properties make them excellent candidates\nfor reinforcement learning tasks. We speculate that their fully probabilistic nature offers\npromising prospects for some fundamental problems of reinforcement learning.\n\nAcknowledgements\n\nBoth authors were supported by the German Research Council (DFG).\n\nReferences\n\nAttias, H. (2003). Planning by probabilistic inference. In Proceedings of the Ninth International\n\nWorkshop on Arti\ufb01cial Intelligence and Statistics.\n\nDearden, R., N. Friedman, and S. J. Russell (1998). Bayesian Q-learning. In Fifteenth National\n\nConference on Arti\ufb01cial Intelligence (AAAI).\n\nDietterich, T. G. and X. Wang (2002). Batch value function approximation via support vectors.\nIn Advances in Neural Information Processing Systems 14, Cambridge, MA, pp. 1491\u20131498.\nMIT Press.\n\nGirard, A., C. E. Rasmussen, J. Qui\u02dcnonero-Candela, and R. Murray-Smith (2002). Multiple-step\nahead prediction for non linear dynamic systems \u2013 a Gaussian process treatment with propa-\ngation of the uncertainty. In Advances in Neural Information Processing Systems 15.\n\nMoore, A. W. and C. G. Atkeson (1995). The parti-game algorithm for variable resolution rein-\n\nforcement learning in multidimensional state-spaces. Machine Learning 21, 199\u2013233.\n\nQui\u02dcnonero-Candela, J., A. Girard, J. Larsen, and C. E. Rasmussen (2003). Propagation of uncer-\ntainty in Bayesian kernel models - application to multiple-step ahead forecasting. In Proceed-\nings of the 2003 IEEE Conference on Acoustics, Speech, and Signal Processing.\n\nSutton, R. S. and A. G. Barto (1998). Reinforcement Learning. Cambridge, Massachusetts: MIT\n\nPress.\n\nWilliams, C. K. I. and C. E. Rasmussen (1996). Gaussian processes for regression. In Advances in\n\nNeural Information Processing Systems 8.\n\n\f", "award": [], "sourceid": 2420, "authors": [{"given_name": "Malte", "family_name": "Kuss", "institution": null}, {"given_name": "Carl", "family_name": "Rasmussen", "institution": null}]}