{"title": "Gaussian Processes for Regression", "book": "Advances in Neural Information Processing Systems", "page_first": 514, "page_last": 520, "abstract": null, "full_text": "Gaussian Processes for  Regression \n\nChristopher K.  I.  Williams \n\nNeural  Computing Research  Group \n\nAston  University \n\nBirmingham B4  7ET, UK \n\nCarl Edward Rasmussen \n\nDepartment of Computer ,Science \n\nUniversity of Toronto \n\nToronto, ONT,  M5S  lA4,  Canada \n\nc.k.i.williams~aston.ac.uk \n\ncarl~cs.toronto.edu \n\nAbstract \n\nThe Bayesian analysis of neural networks is difficult because a sim(cid:173)\nple  prior  over  weights  implies  a  complex  prior  distribution  over \nfunctions .  In this paper we investigate the use of Gaussian process \npriors  over  functions,  which  permit  the  predictive  Bayesian  anal(cid:173)\nysis  for  fixed  values  of hyperparameters  to  be  carried  out exactly \nusing matrix operations.  Two methods, using optimization and av(cid:173)\neraging (via Hybrid Monte Carlo) over  hyperparameters have been \ntested  on  a  number  of  challenging  problems  and  have  produced \nexcellent  results. \n\n1 \n\nINTRODUCTION \n\nIn  the  Bayesian  approach  to neural  networks  a  prior  distribution over  the  weights \ninduces  a  prior  distribution  over  functions.  This  prior  is  combined  with  a  noise \nmodel,  which  specifies  the  probability  of observing  the  targets  t  given  function \nvalues y, to yield a posterior over functions  which  can  then be used  for  predictions. \nFor  neural  networks  the  prior  over  functions  has  a  complex  form  which  means \nthat implementations must either make approximations (e.g. MacKay, 1992) or use \nMonte Carlo approaches  to evaluating integrals (Neal, 1993) . \n\nAs  Neal  (1995)  has  argued , there  is  no reason  to believe  that, for  real-world  prob(cid:173)\nlems,  neural  network  models should  be  limited  to  nets  containing only  a  \"small\" \nnumber of hidden units.  He  has shown that it  is  sensible  to consider  a  limit where \nthe number of hidden units in a net tends  to infinity, and that good predictions can \nbe  obtained  from  such  models  using  the  Bayesian  machinery.  He  has  also  shown \nthat a large class of neural network models will converge to a Gaussian process prior \nover functions  in the limit of an infinite number of hidden  units. \n\nIn this paper we use Gaussian processes specified parametrically for regression prob(cid:173)\nlems.  The advantage of the Gaussian process formulation is that the combination of \n\n\fGaussian  Processes  for  Regression \n\n515 \n\nthe prior and noise models can be carried out exactly using matrix operations.  We \nalso show how  the hyperparameters which  control the form of the Gaussian process \ncan  be  estimated  from  the  data,  using  either  a  maximum likelihood  or  Bayesian \napproach,  and  that  this leads  to  a  form  of \"Automatic Relevance  Determination\" \n(Mackay  1993j  Neal  1995). \n\n2  PREDICTION WITH GAUSSIAN  PROCESSES \n\nA stochastic process is a collection of random variables {Y (x) Ix EX} indexed by  a \nset X.  In our case X  will be the input space with dimension d, the number of irlputs. \nThe stochastic  process  is  specified  by  giving  the  probability distribution for  every \nfinite  subset  of variables Y(x(1)), . .. , Y(x(k))  in  a  consistent  manner.  A  Gaussian \nprocess  is  a  stochastic  process  which  can  be  fully  specified  by  its  mean  function \nJ.1.(:x:)  =  E[Y(x)]  and  its  covariance function  C(X , X/)  =  E[(Y(x) - J.1.(x))(Y(x /)(cid:173)\nJ.1.( Xl))];  any finite set of points will have  a joint multivariate Gaussian distribution. \nBelow  we  consider  Gaussian processes  which have J.1.( x) ==  O. \nIn section  2.1  we  will show how  to parameterise covariances using hyperparametersj \nfor  now  we  consider  the  form  of  the  covariance  C  as  given.  The  training  data \nconsists  of n  pairs of inputs and  targets  {( xCi) , t(i)) ,  i = 1 .. . n} .  The input vector \nfor  a  test  case  is  denoted  x  (with  no superscript).  The  inputs  are  d-dimensional \nXl,  . .. ,  Xd  and the targets are scalar. \nThe predictive distribution for  a  test  case  x  is obtained from  the n + 1 dimensional \njoint  Gaussian  distribution  for  the  outputs  of the  n  training  cases  and  the  test \ncase,  by  conditioning on the observed  targets in the training set.  This procedure is \nillustrated  in  Figure  1,  for  the  case  where  there  is  one  training point  and one  test \npoint.  In general,  the  predictive distribution is  Gaussian  with mean and variance \n\nkT (x)K- 1t \nC(x,x) - kT(x)K- 1k(x), \n\n(1) \n(2) \n\nwhere  k(x)  =  (C(x, x(1)), ... , C(x, x(n))f ,  K  is  the  covariance  matrix  for  the \ntraining cases  Kij =  C(x(i), x(j)),  and t  =  (t(l), ... , t(n))T . \n\nThe  matrix inversion step  in  equations  (1)  and  (2)  implies that  the algorithm has \nO( n 3 )  time  complexity  (if  standard  methods  of  matrix  inversion  are  employed) ; \nfor  a  few  hundred  data points  this is  certainly feasible  on  workstation  computers, \nalthough  for  larger  problems  some  iterative  methods  or  approximations  may  be \nneeded. \n\n2.1  PARAMETERIZING  THE COVARIANCE  FUNCTION \n\nThere are many choices of covariance functions which may be reasonable.  Formally, \nwe  are  required  to  specify  functions  which  will  generate  a  non-negative  definite \ncovariance matrix for  any set  of points (x(1 ), ... , x(k )).  From a  modelling point of \nview  we  wish  to specify  covariances so  that points with nearby inputs will give rise \nto similar predictions.  We find  that the following covariance function  works  well: \n\n(3) \n\nVo exp{ -t L WI(x~i) - x~j))2} \n\nd \n\n1=1 \nd \n\n+ao + a1  Lx~i)x~j) + V18(i , j), \n\n1=1 \n\n\f516 \n\nc. K.  I. WILLIAMS, C. E. RASMUSSEN \n\ny \n\ny \n\n/ \n\n/ \n\ny1 \n\np(y) \n\n/ \n\n/ \n\nFigure  1:  An  illustration  of prediction  using  a  Gaussian  process.  There  is  one  training \ncase  (x(1), t(1))  and  one  test  case  for  which  we  wish  to  predict  y.  The  ellipse  in  the  left(cid:173)\nhand  plot  is  the  one  standard  deviation  contour  plot  of  the  joint  distribution  of  Yl  and \ny .  The  dotted  line  represents  an  observation  Yl  =  t(1).  In  the  right-hand  plot  we  see \nthe distribution  of the  output for  the test case,  obtained  by  conditioning  on  the observed \ntarget.  The  y  axes  have  the same scale  in  both plots. \n\nwhere  (}  = log(vo, V1,  W1,  . . . , Wd,  ao, ad  plays  the  role  of hyperparameters1.  We \ndefine the hyperparameters to be the log of the variables in equation (4)  since these \nare positive scale-parameters. \n\nThe covariance function is made up of three parts; the first  term, a linear regression \nterm  (involving ao  and  aI)  and  a  noise  term V1b(i,  j).  The first  term expresses  the \nidea that cases  with nearby  inputs will  have  highly  correlated  outputs;  the  WI  pa(cid:173)\nrameters allow a different  distance measure for each input dimension.  For irrelevant \ninputs,  the corresponding  WI  will  become small, and  the  model will ignore that in(cid:173)\nput.  This is  closely related to the Automatic Relevance  Determination (ARD) idea \nof MacKay  and  Neal  (MacKay,  1993;  Neal  1995).  The Vo  variable gives the overall \nscale of the local correlations.  This covariance function  is valid for  all input dimen(cid:173)\nsionalities as  compared to splines,  where  the  integrated  squared  mth  derivative is \nonly  a  valid  regularizer  for  2m  >  d  (see  Wahba,  1990).  ao  and  a1  are  variables \ncontrolling the scale the of bias and linear contributions to the covariance.  The last \nterm accounts for  the noise  on the data;  VI  is  the variance of the noise. \n\nGiven a  covariance function , the log likelihood of the  training data is given  by \n\n1= - ~ logdet I< - ~tT I<-lt -\n\n22 2  \n\n!!.log27r. \n\n(4) \n\nIn  section  3  we  will  discuss  how  the  hyperparameters  III  C  can  be  adapted,  in \nresponse  to the training data. \n\n2.2  RELATIONSHIP  TO  PREVIOUS WORK \n\nThe Gaussian process view provides a unifying framework for many regression meth(cid:173)\nods.  ARMA models used  in time series analysis and spline smoothing (e.g.  Wahba, \n1990 and earlier references  therein)  correspond to Gaussian process  prediction  with \n\n1 We call ()  the hyperparameters as they correspond closely  to hyperparameters in neural \n\nnetworks;  in  effect  the  weights  have  been integrated  out exactly. \n\n\fGaussian Processes  for  Regression \n\n517 \n\na  particular choice of covariance function 2 .  Gaussian processes  have  also  been used \nin  the geostatistics field  (e .g. Cressie,  1993) , and are known  there  as  \"kriging\", but \nthis  literature  has  concentrated  on  the  case  where  the input space  is  two  or  three \ndimensional , rather than  considering more general  input spaces. \n\nThis work  is  similar to  Regularization  Networks  (Poggio  and  Girosi,  1990;  Girosi, \nJones and Poggio,  1995), except  that their derivation uses  a  smoothness functional \nrather  than  the  equivalent  covariance  function.  Poggio  et  al  suggested  that  the \nhyperparameters  be set  by  cross-validation.  The  main contributions of this  paper \nare to emphasize that a  maximum likelihood solution for  8  is  possible,  to recognize \nthe connections to ARD and to use the Hybrid Monte Carlo method in the Bayesian \ntreatment  (see  section  3). \n\n3  TRAINING  A  GAUSSIAN  PROCESS \n\nThe  partial  derivative  of the  log  likelihood  of the  training  data  I  with  respect  to \nall  the hyperparameters  can  be  computed using matrix operations,  and  takes time \nO( n 3 ) .  In  this  section  we  present  two  methods  which  can  be  used  to  adapt  the \nhyperparameters  using these  derivatives. \n\n3.1  MAXIMUM  LIKELIHOOD \n\nIn a  maximum likelihood framework,  we  adjust the hyperparameters so  as to max(cid:173)\nimize  that  likelihood  of the  training  data.  We  initialize  the  hyperparameters  to \nrandom values (in a reasonable range)  and then  use an iterative method, for exam(cid:173)\nple conjugate gradient,  to search  for  optimal values of the hyperparameters.  Since \nthere are only a small number of hyperparameters (d + 4)  a relatively small number \nof iterations  are  usually  sufficient  for  convergence.  However,  we  have  found  that \nthis  approach  is  sometimes susceptible  to  local  minima, so  it is  advisable to try  a \nnumber of random starting positions in  hyperparameter space. \n\n3.2 \n\nINTEGRATION  VIA  HYBRID  MONTE  CARLO \n\nAccording to the Bayesian formalism, we should start with a prior distribution P( 8) \nover  the  hyperparameters  which  is  modified  using  the  training data  D  to  produce \na  posterior  distribution  P(8ID).  To  make  predictions  we  then  integrate  over  the \nposterior;  for  example, the predicted  mean y( x) for  test  input x  is  given  by \n\ny(x) = J Y8(x)P(8I D)d8 \n\n(5) \n\nwhere Y8( x) is the predicted mean (as given by equation 1) for  a particular value of \n8.  It is not feasible  to do this integration analytically, but  the Markov chain Monte \nCarlo method of Hybrid  Monte Carlo (HMC)  (Duane  et  ai,  1987) seems  promising \nfor  this application.  We assign broad Gaussians priors to the hyperparameters,  and \nuse  Hybrid  Monte Carlo to give  us samples from  the posterior. \n\nHMC works by creating a fictitious dynamical system in which  the hyperparameters \nare regarded as position variables, and augmenting these with momentum variables \np.  The purpose  of the  dynamical system  is  to  give  the  hyperparameters  \"inertia\" \nso  that random-walk behaviour in 8-space can  be  avoided.  The total energy,  H,  of \nthe system is the sum of the kinetic energy,  J{,  (a function of the momenta) and the \npotential energy,  E.  The potential energy  is  defined  such  that p(8ID) ex:  exp(-E). \nWe  sample from  the joint  distribution for  8  and  p  given  by  p(8,p)  ex:  exp(-E-\n\n2Technically  splines  require  generalized  covariance  functions. \n\n\f518 \n\nC. K. I. WILUAMS, C.  E.  RASMUSSEN \n\nI<);  the  marginal of this  distribution  for  8  is  the  required  posterior.  A  sample of \nhyperparameters from  the  posterior  can  therefore  be  obtained  by  simply ignoring \nthe momenta. \n\nSampling from the joint distribution is achieved by two steps:  (i) finding new  points \nin  phase space  with near-identical energies  H  by  simulating the  dynamical system \nusing  a  discretised  approximation to Hamiltonian dynamics,  and  (ii)  changing  the \nenergy  H  by  doing  Gibbs sampling for  the momentum variables. \n\nHamiltonian Dynamics \nHamilton's first  order  differential  equations for  H  are  approximated  by  a  discrete \nstep  (specifically  using  the  leapfrog  method).  The  derivatives  of  the  likelihood \n(equation  4)  enter  through  the  derivative  of the  potential  energy.  This  proposed \nstate is  then  accepted  or rejected  using  the Metropolis  rule depending on  the final \nenergy  H*  (which  is  not  necessarily  equal  to  the  initial energy  H  because  of the \ndiscretization).  The same step size  c is used for  all hyperparameters , and should be \nas  large  as  possible  while keeping  the rejection  rate low. \n\nGibbs Sampling for  Momentum Variables \n\nThe momentum variables are updated using a  modified version of Gibbs sampling, \nthereby  allowing the energy  H  to change.  A  \"persistence\"  of 0.95  is  used;  the new \nvalue of the momentum is  a  weighted sum of the previous  value (with weight  0.95) \nand the value obtained  by  Gibbs sampling (weight  (1  - 0.952)1/ 2).  With  this form \nof persistence,  the momenta change approximately twenty  times more slowly, thus \nincreasing  the  \"inertia\"  of the  hyperparameters,  so  as  to further  help  in  avoiding \nrandom walks.  Larger values of the persistence  will further increase the inertia,  but \nreduce  the rate of exploration of H . \n\nPractical  Details \n\nThe  priors  over  hyperparameters  are set  to be  Gaussian  with  a  mean of -3 and a \nstandard deviation of 3.  In all our simulations a step size c =  0.05 produced  a  very \nlow  rejection  rate  \u00ab  1 %).  The  hyperparameters  corresponding  to  V1  and  to  the \nWI ' S  were  initialised to  -2 and the rest  to O. \n\nTo apply the method we first  rescale the inputs and outputs so  that they have mean \nof zero  and  a  variance  of one on  the  training set.  The sampling procedure  is  run \nfor  the desired amount of time, saving the  values of the hyperparameters 200  times \nduring  the  last  two-thirds  of the  run .  The first  third  of the  run  is  discarded;  this \n\"burn-in\"  is  intended to give the hyperparameters time to come close to their equi(cid:173)\nlibrium distribution.  The predictive distribution is then a mixture of 200 Gaussians. \nFor  a  squared error loss,  we  use  the  mean of this  distribution  as  a  point estimate. \nThe width of the predictive  distribution tells us  the uncertainty of the prediction. \n\n4  EXPERIMENTAL RESULTS \n\nWe report  the results of prediction with Gaussian process on (i)  a  modified version \nof MacKay's  robot  arm problem and  (ii)  five  real-world data sets. \n\n4.1  THE  ROBOT  ARM  PROBLEM \n\nWe  consider  a  version of MacKay's robot arm problem introduced  by  Neal  (1995). \nThe standard  robot  arm problem is  concerned  with  the mappings \n\nY1  = r1  cos Xl + r2  COS(X1  + X2) \n\nY2  = r1 sin Xl  + r2 sin(x1 + X2) \n\n(6) \n\n\fGaussian  Processes  for  Regression \n\n519 \n\nMethod \n\nNo.  of inputs \n\nsum squared  test error \n\nGaussian process \nGaussian process \n\nMacKay \n\nNeal \nNeal \n\n2 \n6 \n2 \n2 \n6 \n\n1.126 \n1.138 \n1.146 \n1.094 \n1.098 \n\nTable  1:  Results  on  the  robot  arm  task.  The  bottom  three  lines  of data were  obtained \nfrom  Neal (1995) .  The MacKay result is  the test error for  the net with highest  \"evidence\". \n\nThe  data  was  generated  by  picking  Xl  uniformly from  [-1.932,  -0.453]  and  [0.453, \n1.932] and picking  X2  uniformly from  [0 .534, 3.142].  Neal  added four further inputs, \ntwo  of  which  were  copies  of  Xl  and  X2  corrupted  by  additive  Gaussian  noise  of \nstandard deviation 0.02, and two further  irrelevant Gaussian-noise inputs with zero \nmean and unit variance.  Independent  zero-mean  Gaussian noise of variance  0.0025 \nwas  then  added  to the outputs  YI  and Y2 .  We  used  the same datasets  as  Neal  and \nMacKay, with  200  examples in the training set  and  200  in the test  set . \n\nThe theory described  in section 2 deals only with the prediction of a scalar quantity \nY , so  predictors  were  constructed  for  the  two  outputs separately,  although  a joint \nprediction is  possible within the Gaussian process framework  (see  co-kriging,  \u00a73.2.3 \nin Cressie,  1993). \n\nTwo  experiments  were  conducted,  the  first  using  only  the  two  \"true\"  inputs,  and \nthe  second  one  using  all  six  inputs.  In  this  section  we  report  results  using  max(cid:173)\nimum likelihood  training;  similar  results  were  obtained  with  HMC .  The  log( v),s \nand  loge w )'s  were  all  initialized  to  values  chosen  uniformly  from  [-3.0,  0.0],  and \nwere  adapted separately for  the prediction of YI  and Y2  (in these early experiments \nthe linear regression  terms in  the covariance function involving aa  and  al  were  not \npresent) .  The conjugate gradient search  algorithm was  allowed to run for  100  iter(cid:173)\nations, by which time the likelihood was changing very slowly.  Results are reported \nfor  the run  which  gave  the highest  likelihood of the  training data,  although in fact \nall runs performed very  similarly.  The results  are shown in Table 1 and are encour(cid:173)\naging,  as  they  indicate  that  the  Gaussian  process  approach  is  giving very  similar \nperformance to two  well-respected  techniques.  All of the  methods obtain a  level  of \nperformance which is  quite close  to the theoretical  minimum error level of 1.0 . ...Jt  is \ninteresting to look  at  the values of the w's obtained  after the optimization; for  the \nY2  task  the values  were  0.243,0.237,0.0639,7.0 x  10- 4 , 2.32  x 10- 6 ,1.70 x  10- 6 , \nand  Va  and  VI  were  7.5278  and 0.0022 respectively.  The  w  values show  nicely  that \nthe first  two  inputs are  the most important, followed  by  the  corrupted  inputs and \nthen  the irrelevant  inputs.  During training the irrelevant inputs are detected  quite \nquickly, but the  w 's for  the corrupted  inputs shrink more slowly, implying that the \ninput noise has relatively little effect  on  the likelihood. \n\n4.2  FIVE REAL-WORLD  PROBLEMS \n\nGaussian  Processes  as  described  above  were  compared  to  several  other  regression \nalgorithms on five  real-world  data sets  in  (Rasmussen,  1996; in  this  volume).  The \ndata  sets  had  between  80  and  256  training  examples,  and  the  input  dimension \nranged from  6 to  16.  The length of the  HMC  sampling for  the Gaussian  processes \nwas from  7.5  minutes for  the smallest training set size  up  to  1 hour  for  the largest \nones on  a  R4400 machine.  The results  rank  the methods in the order  (lowest error \nfirst)  a  full-blown  Bayesian  treatment  of neural  networks  using  HMC,  Gaussian \n\n\f520 \n\nC.  K.  I. WILLIAMS, C. E. RASMUSSEN \n\nprocesses,  ensembles  of neural  networks  trained  using  cross  validation and  weight \ndecay,  the  Evidence  framework  for  neural  networks  (MacKay,  1992),  and  MARS. \nWe are  currently  working on assessing  the statistical significance  of this ordering. \n\n5  DISCUSSION \n\nWe  have  presented  the  method of regression  with  Gaussian  processes,  and  shown \nthat it performs  well  on a  suite of real-world problems. \n\nWe have also conducted some experiments on the approximation of neural nets (with \na  finite  number of hidden  units)  by  Gaussian processes,  although space  limitations \ndo  not  allow  these  to  be  described  here.  Some  other  directions  currently  under \ninvestigation include (i)  the use  of Gaussian processes for classification problems by \nsoftmaxing the outputs of k regression surfaces (for a k-class classification problem), \n(ii)  using non-stationary  covariance functions,  so  that  C(x , Xl)  f:- C(lx - XII)  and \n(iii)  using a  covariance function  containing a sum of two or more terms of the form \ngiven  in  line 1 of equation 3. \n\nWe hope to make our code for  Gaussian process prediction publically available in the \nnear future.  Check http://www.cs.utoronto.ca/neuron/delve/delve.html for details. \n\nAcknowledgements \n\nWe  thank Radford Neal for  many useful  discussions,  David  MacKay for  generously provid(cid:173)\ning  the  robot  arm  data used  in  this  paper,  and Chris  Bishop,  Peter  Dayan,  Radford  Neal \nand  Huaiyu  Zhu  for  comments on  earlier  drafts.  CW  was  partially  supported  by  EPSRC \ngrant  GRjJ75425. \n\nReferences \n\nCressie,  N.  A.  C.  (1993) .  Statistics for  Spatial Data.  Wiley. \nDuane,  S.,  Kennedy,  A.  D.,  Pendleton,  B.  J.,  and Roweth,  D.  (1987).  Hybrid Monte Carlo. \n\nPhysics  Letters  B,  195:216-222. \n\nGirosi,  F.,  Jones,  M.,  and  Poggio,  T.  (1995).  Regularization  Theory and  Neural  Networks \n\nArchitectures.  Neural  Computation, 7(2):219-269. \n\nMacKay,  D. J. C.  (1992).  A  Practical Bayesian Framework for  Backpropagation  Networks. \n\nNeural  Computation, 4(3):448-472. \n\nMacKay,  D.  J.  C.  (1993).  Bayesian  Methods  for  Backpropagation  Networks. \n\nIn  van \nHemmen,  J.  L.,  Domany,  E.,  and  Schulten,  K.,  editors,  Models  of Neural  Networks \nII.  Springer. \n\nNeal,  R.  M.  (1993).  Bayesian  Learning via Stochastic  Dynamics.  In Hanson,  S.  J.,  Cowan, \nJ.  D.,  and Giles,  C.  L.,  editors,  Neural Information  Processing Systems,  Vol.  5,  pages \n475-482.  Morgan  Kaufmann,  San  Mateo,  CA. \n\nNeal,  R.  M.  (1995).  Bayesian  Learning for  Neural  Networks.  PhD  thesis,  Dept.  of Com(cid:173)\n\nputer  Science,  University  of Toronto. \n\nPoggio,  T.  and  Girosi,  F.  (1990).  Networks  for  approximation  and  learning.  Proceedings \n\nof IEEE,  78:1481-1497. \n\nRasmussen,  C.  E.  (1996).  A  Practical  Monte Carlo Implementation  of Bayesian  Learning. \nIn Touretzky,  D.  S.,  Mozer,  M.  C.,  and  Hasselmo,  M.  E.,  editors,  Advances in  Neural \nInformation  Processing Systems 8.  MIT Press. \n\nWahba,  G.  (1990).  Spline  Models  for  Observational Data.  Society  for  Industrial  and  Ap(cid:173)\nplied  Mathematics.  CBMS-NSF  Regional  Conference series  in  applied  mathematics. \n\n\f", "award": [], "sourceid": 1048, "authors": [{"given_name": "Christopher", "family_name": "Williams", "institution": null}, {"given_name": "Carl", "family_name": "Rasmussen", "institution": null}]}