{"title": "Predictive App roaches for Choosing Hyperparameters in Gaussian Processes", "book": "Advances in Neural Information Processing Systems", "page_first": 631, "page_last": 637, "abstract": null, "full_text": "Predictive Approaches For  Choosing \n\nHyperparameters in  Gaussian  Processes \n\nS.  Sundararajan \n\nS.  Sathiya Keerthi \n\nComputer Science and Automation \n\nIndian Institute of Science \nBangalore 560  012, India \nsundar@csa.iisc. ernet. in \n\nMechanical and Production Engg. \nNational University of Singapore \n\n10  Kentridge Crescent, Singapore 119260 \n\nmpessk@guppy. mpe. nus. edu. sg \n\nAbstract \n\nGaussian  Processes  are  powerful  regression  models  specified  by \nparametrized mean and covariance functions.  Standard approaches \nto  estimate  these  parameters  (known  by  the  name  Hyperparam(cid:173)\neters)  are  Maximum  Likelihood  (ML)  and  Maximum  APosterior \n(MAP)  approaches.  In this paper, we  propose and investigate pre(cid:173)\ndictive  approaches,  namely,  maximization  of  Geisser's  Surrogate \nPredictive Probability (GPP) and minimization of mean square er(cid:173)\nror with  respect to GPP  (referred to as  Geisser's  Predictive mean \nsquare  Error  (GPE))  to  estimate  the  hyperparameters.  We  also \nderive  results  for  the  standard  Cross-Validation  (CV)  error  and \nmake  a  comparison.  These approaches are tested on  a  number of \nproblems and experimental results show that these approaches are \nstrongly competitive to existing approaches. \n\n1 \n\nIntroduction \n\nGaussian Processes (GPs) are powerful regression models that have gained popular(cid:173)\nity recently, though they have appeared in different forms in the literature for years. \nThey can be used for  classification also; see MacKay (1997), Rasmussen (1996)  and \nWilliams and Rasmussen (1996).  Here, we  restrict ourselves to regression problems. \nNeal  (1996) showed that a large class of neural network models converge to a  Gaus(cid:173)\nsian Process prior over functions  in the limit of an infinite  number of hidden units. \nAlthough  GPs can  be  created  using  infinite  networks,  often  GPs  are specified  di(cid:173)\nrectly using parametric forms for  the mean and covariance functions  (Williams and \nRasmussen (1996)).  We assume that the process is zero mean.  Let ZN  =  {XN,yN} \nwhereXN  = {xCi):  i  = 1, ... ,N}andYN  = {y(i):  i  = 1, ... ,N}.  Here,y(i) \nrepresents the output corresponding to the input  vector  xCi).  Then, the Gaussian \nprior over the functions  is  given by \n\n(1) \n\nwhere eN is  the covariance matrix with  (i,j)th  element  [CN]i,j \nC(x(i),x(j);8) \nand C(.; 8) denotes the parametrized covariance function.  Now,  assuming that the \n\n\f632 \n\nS.  Sundararajan and S.  S.  Keerthi \n\nobserved output tN is modeled as tN  =  YN + eN and eN is zero mean multivariate \nGaussian with covariance matrix 0'2IN  and is  independent of YN,  we  get \n\n(t  IX  9)  =  exp(-t~Ci\\/tN) \np  N  N, \n\n(27r)~ICNli \n\n(2) \n\nwhere eN = eN + 0'2IN.  Therefore, [eN kj = [eN kj + 0'2 bi,j, where bi,j  = 1 when \ni  = j  and zero otherwise.  Note that 9 = (9,0'2)  is the new set of hyperparameters. \nThen, the predictive distribution ofthe output yeN + 1) for a test case x(N + 1)  is \nalso Gaussian with mean and variance \n\n(3) \n\nand \n\nM \n\nM \n\nO';(N+1)  =  bN+1  - k~+1 C;/kN +1 \n\n(4) \nC(x(N + 1), x(N + 1); 9)  and  kN+l  is  an  N  x  1  vector  with  ith \nwhere  bN+1 \nelement  given  by  C(x(N + 1),x(i); 9).  Now,  we  need  to  specify  the  covariance \nfunction  C(.; 9).  Williams  and  Rasmussen  (1996)  found  the  following  covariance \nfunction to work well in practice. \nC(x(i), x(j); 9)  =  ao + al L xp(i)xp(j) + voexp( - ~ L Wp (xp(i)  - Xp(j))2)  (5) \n\np=1 \n\np=1 \n\nwhere  xp(i)  is the pth  component of ith  input  vector  xCi).  The  wp  are  the  Auto(cid:173)\nmatic Relevance  Determination  (ARD)  parameters.  Note  that  C(x(i), x(j); 9)  = \nC(x(i), x(j); 9)  + 0'2bi ,j'  Also,  all  the  parameters  are  positive  and  it  is  conve(cid:173)\nnient to use  logarithmic scale.  Hence,  9  is  given  by  log (ao, aI, vo, WI, ... ,W M, 0'2). \nThen,  the  question  is:  how  do  we  handle  9?  More  sophisticated  techniques \nlike  Hybrid  Monte  Carlo  (HMC)  methods  (Rasmussen  (1996)  and  Neal  (1997)) \nare  available  which  can  numerically  integrate  over  the  hyperparameters  to  make \npredictions.  Alternately,  we  can  estimate  9  from  the  training  data.  We  restrict \nto  the  latter  approach  here. \nterministic  but  unknown  and  the estimate  is  found  by  maximizing  the  likelihood \n(2).  That  is,  9ML  = \nIn  the  Bayesian  approach,  9  is \nassumed  to  be  random  and  a  prior  p( 9)  is  specified.  Then,  the  MAP  estimate \n9MP  is  obtained  as  9MP  =  argijaz  p(tNIXN,9)p(9)  with  the  motivation  that \nthe the predictive distribution p(y(N + 1)lx(N + 1), ZN)  can be  approximated as \np(y(N + 1)lx(N + 1),ZN,9MP)'  With this  background,  in  this paper we  propose \nand  investigate  different  predictive  approaches  to  estimate  the  hyperparameters \nfrom  the training data. \n\nIn  the  classical  approach,  9  is  assumed  to  be  de(cid:173)\n\nargijaz  p(tNIXN' 9). \n\n2  Predictive approaches for  choosing hyperparameters \n\nGeisser  (1975)  proposed  Predictive Sample Reuse  (PSR)  methodology that can be \napplied  for  both  model  selection  and  parameter  estimation  problems.  The  basic \nidea is  to define  a  partition scheme  peN, n, r) such  that pJJ~n  =  (ZX; -n; Z~) is \nith  partition  belonging  to  a  set  r  of  partitions  with  Z}V -n'  Z~ representing  the \nN  - n  retained and n  omitted data sets respectively.  Then, the unknown  9  is  esti(cid:173)\nmated  (or  a  model  M j  is  chosen among a set of models  indexed by  j  =  1, ... , J) \nby  means  of optimizing a  predictive  measure that measures the  predictive  perfor(cid:173)\nmance  on the  omitted observations  X~ by  using  the  retained  observations  ZX;_n \naveraged  over  the  partitions  (i  E  r).  In  the  special  case  of  n  =  1,  we  have  the \nleave one out strategy.  Note that this approach was independently presented in the \n\n\fPredictive Approaches for Choosing Hyperparameters in Gaussian Processes \n\n633 \n\nname  of cross-validation  (CV)  by  Stone  (1974).  The well  known  examples are the \nstandard CV  error and negative of average predictive likelihood.  Geisser and Eddy \n(1979)  proposed to maximize n~l p(t(i)lx(i), Z<;}, M j )  (known as  Geisser's surro(cid:173)\ngate Predictive Probability (GPP)) by synthesizing Bayesian and PSR methodology \nin  the  context  of (parametrized)  model  selection.  Here,  we  propose  to maximize \nn~l p(t(i)lx(i), Z<;}, 0) to estimate 0,  where Z<;}  is obtained from  ZN  by removing \nthe ith  sample.  Note that p(t(i)lx(i), Zr;) ,0) is  nothing but the predictive distribu(cid:173)\ntion p(y(i)lx(i), Zr;), 0)  evaluated at y(i)  =  t(i).  Also,  we  introduce the  notion of \nGeisser's Predictive mean square Error (GPE) defined as  ~ 2:~1 E((y(i) - t(i))2) \n(where the expectation operation is defined with respect to p(y(i)lx(i), Zr;), 0))  and \npropose to estimate 0 by minimizing GPE. \n\n2.1  Expressions for  GPP and its gradient \n\nThe objective function  corresponding to GPP is  given  by \n. \n\n1  N \n\n- N  L  log(P(t(i)lx(i), Z~, 0) \n\nG(O) \n\nFrom  (3)  and  (4)  we  get \n\ni=l \n\n(6) \n\n(7) \n\nG(O)  =  ~ ;... (t(i)  - y(i))2 \n\nN  ~  20'2  . \n11(~) \n\ni=l \n\n1 \n\nN \n\n+  2N  L  log O';(i)  + '2  log 27l' \n\n1 \n\ni=l \n[c~i)JT[C~)-lC~i).  Here,  C~ \nwhere  y(i)  =  [c~i)JT[C~)J-lt~ and  O';(i)  =  Cii  -\nis  an  N  - 1 x  N  - 1 matrix  obtained from  C N  by  removing the  ith  column  and \nith  row.  Similarly,  t<;}  and  c~i)  are obtained  from  tN  and  Ci  (Le.,  ith  column  of \nCN)  respectively by removing the ith  element.  Then, G(O)  and its gradient can be \ncomputed efficiently  using the following  result. \n\nTheorem 1  The  objective junction G (0)  under the  Gaussian Process model is given \nby \n\nG(O)  =  2N tt  Cii \n\n1  N  q'fv(i) \n\n1  N \n\n_ \n\n1 \n\n- 2N ~ logcii  + \"2 10g27l' \n\n(8) \n\nwhere  Cii  denotes  the  ith  diagonal  entry  of C-r/  and qN (i)  denotes  the  ith  element \nof qN  =  C~;ItN'  Its  gradient  is  given  by \n\n8G(O)  =  _1 t (1  + q~(i)) (Si,i)  + ~ t qN(i)(r~(i)) \n\n80J\u00b7 \n-\n-\n\n2N  . \n\n~=l \n\nCii \n\nCii \n\nN \n\n.  1 \na= \n\nC-18CNC- 1t \n\n- N  80;  N  N  an  qN  -\n\nd \n\nh \nwere  Bj,i \ndenotes  the  ith  column  of the  matrix c~;I . \n\n-\n-T8CN-\nCi  80;  Ci,  rj  -\n\nCii \n- C-1t \n\nN  N\u00b7 \n\n(9) \n\nH \n\n-\nere,  Ci \n\nThus,  using  (8)  and  (9)  we  can  compute  the  GPP and  its  gradient.  We  will  give \nmeaningful interpretation to the different terms shortly. \n\n2.2  Expressions for  CV function and  its gradient \n\nWe  define  the  CV  function  as \n\nH(O) \n\nN \n\n~ L  (t(i)  - y(i))2 \n\ni=l \n\n(10) \n\n\f634 \n\nS.  Sundararajan and S.  S.  Keerthi \n\nwhere  y(i)  is  the  mean  of the  conditional  predictive  distribution  as  given  above. \nNow,  using the following  result  we  can compute R((})  efficiently. \n\nTheorem 2  The  CV function  R ((})  under the  Gaussian  model  is  given  by \n\nR((})  =  ~ ~ (q~(i))2 \n\nN  ~  C;. \n' \n\ni=l \n\nand  its  gradient  is  given  by \n\nwhere  Sj,i,rj,qN(i)  and Cii  are  as  defined  in  theorem  1. \n\n2.3  Expressions for  GPE and  its gradient \n\nThe G PE function  is  defined  as \n\nGE((})  =  ~ L / (t(i)  - y(i))2 p(y(i)lx(i), Z~, (})  dy(i) \n\nN \n\nwhich  can  be readily simplified to \n\ni=l \n\nGE((})  =  ~ L (t(i)  - y(i))2  +  N  L a~(i) \n\n1  N \n\nN \n\n(11) \n\n(13) \n\n(14) \n\ni=l \n\ni=l \n\nOn comparing  (14)  with  (10),  we  see  that while  CV  error minimizes  the  deviation \nfrom  the  predictive  mean,  GPE takes  predictive  variance  also  into  account.  Now, \nthe gradient can be  written as \n\n(15) \n\n-\nwhere  we  have  used  the  results  a~(i)  =  C!i' \n~Oiji  =  e[  8(}~  ei  and  8/;  = \n-Cj\\/ 88~N CNI .  Here ei denotes the ith  column vector of the identity  matrix IN. \n\n8C- 1 \n\n8C- 1 \n\nJ \n\n2.4 \n\nInterpretations \n\nMore  insight can be  obtained from  reparametrizing the covariance function  as fol(cid:173)\nlows. \n\nM \n\n1M \n\np=I \n\nC(x(i), x(j); (})  =  a2 (ao+ih L xp(i)xp(j)+voexp( - 2 L wp(xp(i)-xPU))2)+Oi,j) \n(16) \nwhere  ao  =  a2 \u00a3la,  al  =  a2 aI,  Va  =  a2 Va.  Let  us  define  P(x(i), xU); (})  = \n~C(x(i), xU); (}).  Then PNI  =  a2 C NI .  Therefore,  Ci,j  =  ~ where  Ci,j,  Pi,j \ndenote the (i, j)th element of the matrices CNI  and PNI  respectively.  From theorem \n2  (see  (10)  and  (11))  we  have  t(i) - y(i)  =  q~i~i)  =  c~~iN.  Then,  we  can rewrite \n(8)  as \n\np=l \n\nG ((})  = \n\niJ'jy (i) \n_ _ 1_  N \n\" \n2Na2  ~ p.. \ni=l \nn \n\n1  N \n- 2N \" \n\n~ \ni=l \n\nlOgPii  +  -2log2rra2 \n\n1 \n\n(17) \n\n\fPredictive Approaches for Choosing Hyperparameters  in Gaussian Processes \n\n635 \n\nHere, iiN  = Pj\\hN and, Pi, Pii  denote, respectively, the ith column and ith diagonal \nentry of the matrix Pj\\/.  Now,  by setting the derivative of (17)  with respect to a2 \nto zero,  we  can infer the noise level as \n\nSimilarly, the CV error  (10)  can be  rewritten as \n\nH(9)  =  ~ t  ii~~.i) \n\ni=l  Pu \n\n(18) \n\n(19) \n\nNote  that  H(9)  is  dependent  only  on  the  ratio  of the  hyperparameters  (Le.,  on \nao, aI, vo)  apart from  the  ARD  parameters.  Therefore,  we  cannot  infer  the  noise \nlevel  uniquely.  However,  we  can  estimate  the  ARD  parameters  and  the  ratios \nao, aI, vo.  Once  we  have  estimated these  parameters, then  we  can  use  (18)  to es(cid:173)\ntimate  the  noise  level.  Next,  we  note  that  the  noise  level  preferred  by  the  GPE \ncriterion is  zero.  To see this, first  let us  rewrite  (14)  under  reparametrization as \n\nGE (9)  =  ~ t q~;i)  +  a2 t ~ \n\nN \n\ni=l  Pii \n\nN \n\ni=l  Pii \n\n(20) \n\nSince  iiN(i)  and Pii  are  independent  of a 2 ,  it  follows  that the  GPE prefers  zero  as \nthe  noise  level,  which  is  not  true.  Therefore,  this  approach  can  be  applied  when, \neither the noise level  is  known or a good estimate of it is  available. \n\n3  Simulation results \n\nWe  carried out simulation  on four  data sets.  We  considered  MacKay's  robot arm \nproblem  and  its  modified  version  introduced  by  Neal  (1996).  We  used  the  same \ndata set  as  MacKay  (2-inputs  and  2-outputs),  with  200  examples  in  the  training \nset  and  200  in the  test  set.  This  data set  is  referred  to  as  'data set  l' in  Table \n1.  Next,  to  evaluate  the  ability  of the  predictive  approaches  in  estimating  the \nARD  parameters,  we  carried out  simulation on the  robot  arm data with  6  inputs \n(Neal's version), denoted as  'data set  2' in Table 1.  This data set was  generated by \nadding  four  further  inputs,  two  of which  were  copies  of the  two  inputs  corrupted \nby  additive zero mean  Gaussian  noise  of standard deviation  0.02  and two  further \nirrelevant  Gaussian noise  inputs  with  zero  mean and  unit  variance  (Williams  and \nRasmussen  (1996)).  The  performance  measures  chosen  were  average  of Test  Set \nError (normalized by true noise level of 0.0025) and average of negative logarithm of \npredictive probability  (NLPP)  (computed from  Gaussian density function  with  (3) \nand (4)).  Friedman's [1 J data sets 1 and 2 were  based on the problem of predicting \nimpedance  and  phase  respectively  from  four  parameters  of  an  electrical  circuit. \nTraining sets of three different  sizes  (50,  100,  200)  and with a signal-to-noise ratio \nof about  3:1  were  replicated  100  times  and for  each  training  set  (at  each  sample \n\nsize  N),  scaled  integral  squared  error  (ISE  = \n)  and  NLPP  were \ncomputed using  5000  data points randomly generated from  a  uniform distribution \nover  D  (Friedman  (1991)).  In the case  of GPE  (denoted as GE  in  the tables),  we \nused the noise level estimate generated from  Gaussian distribution with mean N LT \n(true noise level)  and standard deviation 0.03 N LT.  In the case of CV, we estimated \nthe hyperparameters in the reparametrized form and estimated the noise level using \n(18).  In the case of MAP  (denoted  as  MP in the tables),  we  used  the same prior \n\nf  (y(x) -1i(x))2dx \nD  varD y(x) \n\n\f636 \n\nS.  Sundararajan and S.  S.  Keerthi \n\nTable 1:  Results on robot arm data sets.  Average of normalized test set error (TSE) \nand negative logarithm of predictive probability  (NLPP) for  various  methods. \n\nData Set: 1 \nData Set: 2 \nTSE  NLPP  TSE  NLPP \n1.126 \n-1.512 \n1.131 \n-1.489 \n1.115 \n-1.516 \n1.112 \n-1.514 \n1.111 \n-1.524 \n\n-1.512 \n-1.511 \n-1.524 \n-1.518 \n-1.524 \n\n1.131 \n1.181 \n1.116 \n1.146 \n1.112 \n\nML \nMP \nGp \nCV \nGE \n\nTable 2:  Results on Friedman's data sets.  Average of scaled integral squared error \nand  negative  logarithm  of  predictive  probability  (given  in  brackets)  for  different \ntraining sample sizes and various methods. \n\nN =  50 \nML \n0.43  7.24 \nMP  0.42  7.18 \n0.47  7.29 \nG p \ncV \n0.55  7.27 \n0.35  7.10 \nGE \n\nDataSet: 1 \n\nN =  100 \n0.19  6.71 \n0.22  6.78 \n0.20  6.65 \n0.22  6.67 \n0.15  6.60 \n\nData Set: 2 \n\nN =  200 \n0.10  6.49 \n0.12  6.56 \n0.10  6.44 \n0.10  6.44 \n0.08  6.37 \n\nN =  50 \n0.26  1.05 \n0.25  1.01 \n0.33  1.25 \n0.42  1.36 \n0.28  1.20 \n\nN =  100 \nN =  200 \n0.16  0.82)  0.11  0.68) \n0.16  0.82)  0.11  0.69) \n0.20  0.86)  0.12  0.70) \n0.21  0.91) \n0.13  0.70) \n0.18  0.85) \n0.12  0.63) \n\ngiven in Rasmussen (1996).  The GPP approach is denoted as Gp  in the tables.  For \nall these methods, conjugate gradient (CG) algorithm (Rasmussen (1996)) was used \nto optimize the hyperparameters.  The termination criterion (relative function error) \nwith a  tolerance of 10-7  was  used,  but with a constraint on the maximum number \nof  CG  iterations  set  to  100.  In  the  case  of  robot  arm  data sets,  the  algorithm \nwas  run  with  ten  different  initial  conditions  and  the  best  solution  (chosen  from \nrespective best objective function  value)  is  reported.  The optimization was  carried \nout separately for  the  two  outputs and  the  results  reported are the average TSE, \nNLPP.  In  the  case  of  Friedman's  data sets,  the  optimization  algorithm  was  run \nwith three  different  initial conditions  and the  best solution  was  picked  up.  When \nN  =  200, the optimization algorithm was run with only one initial condition.  For \nall  the data sets,  both the  inputs  and  outputs  were  normalized to zero  mean and \nunit variance. \nFrom  Table  1,  we  see that the performances  (both TSE and NLPP)  of the predic(cid:173)\ntive approaches are  better than ML  and MAP  approaches for  both the data sets. \nIn  the  case  of data set  2,  we  observed  that  like  ML  and  MAP  methods,  all  the \npredictive approaches rightly  identified  the  irrelevant inputs.  The  performance  of \nGPE approach is  the best on the robot arm data and demonstrates the usefulness \nof this approach when a good noise level estimate is  available.  In the case of Fried(cid:173)\nman's data set 1  (see Table 2), the important observation is  that the performances \n(both ISE and NLPP) of GPP, CV approaches are relatively poor at low sample size \n(N  =  50)  and improve very well as N  increases.  Note that the performances of the \npredictive approaches are better compared to the  ML  and  MAP  methods  starting \nfrom  N  =  100 onwards  (see  NLPP).  Again,  GPE gives the best performance and \nthe  performance at  low  sample  size  (N  =  50)  is  also  quite  good.  In  the case  of \nFriedman's data set  2,  the  ML  and MAP  approaches  perform  better compared to \nthe predictive approaches except GPE.  The performances of GPP and CV improve \n\n\fPredictive Approaches for Choosing Hyperparameters  in  Gaussian Processes \n\n637 \n\nas  N  increases  and are very  close  to the  ML  and MAP  methods  when  N \n200. \nNext,  it  is  clear that the  MAP  method  gives  the  best  performance at low  sample \nsize.  This  behavior,  we  believe,  is  because  the  prior  plays  an important  role  and \nhence  is  very  useful.  Also,  note that unlike  data set 1,  the  performance of GP E  is \ninferior  to ML  and MAP  approaches at low  sample  sizes  and  improves  over  these \napproaches  (see  NLPP)  as  N  increases.  This  suggests that the  knowledge  of the \nnoise level alone is not the only issue.  The basic issue  we think is that the predictive \napproaches estimate the predictive performance of a given model from  the training \nsamples.  Clearly,  the  quality  of the  estimate  will  become  better  as  N  increases. \nAlso,  knowing the noise level  improves the quality of the  estimate. \n\n4  Discussion \n\nSimulation results indicate that the size N  required to get good estimates of predic(cid:173)\ntive performance will be dependent on the problem.  When N  is sufficiently large, we \nfind  that the predictive approaches perform better than ML  and MAP approaches. \nThe  sufficient number  of samples  can be as  low  as 100 as  evident  from  our results \non  Friedman's  data set  1.  Also,  MAP  approach  is  the  best,  when  N  is  very  low. \nAs  one  would  expect,  the  performances  of  ML  and  MAP  approaches  are  nearly \nsame as  N  increases.  The  comparison  with  the  existing  approaches indicate  that \nthe predictive approaches developed here are strongly competitive.  The overall cost \nfor  computing the  function  and  the  gradient  (for  all  three  predictive  approaches) \nis  O(M N3).  The  cost  for  making  prediction  is  same  as  the  one  required  for  ML \nand  MAP  methods.  The  proofs  of the  results  and detailed  simulation  results  will \nbe  presented in another paper  (Sundararajan and Keerthi,  1999). \n\nReferences \n\nFriedman, J .H.,  (1991)  Multivariate Adaptive Regression Splines,  Ann.  of Stat., 19, 1-141. \n\nGeisser,  S.,  (1975)  The  Predictive  Sample  Reuse  Method  with  Applications,  Journal  of \nthe  American Statistical  Association,  70, 320-328. \n\nGeisser,  S.,  and Eddy,  W.F.,  (1979)  A Predictive  Approach to  Model  Selection,  Journal \nof the  American Statistical  Association,  74, 153-160. \n\nMacKay,  D.J.C.  (1997)  Gaussian  Processes - A  replacement for  neural  networks  ?,  Avail(cid:173)\nable  in  Postscript via  URL  http://www.wol.ra.phy.cam.ac.uk/mackayj. \n\nNeal,  R.M.  (1996)  Bayesian  Learning for  Neural  Networks,  New  York:  Springer-Verlag. \n\nNeal,  R.M.  (1997)  Monte  Carlo Implementation of Gaussian  Process  Models for  Bayesian \nRegression  and Classification.  Tech.  Rep.  No.  9702,  Dept.  of Statistics,  University  of \nToronto. \nRasmussen, C. (1996)  Evaluation  of Gaussian Processes  and other Methods for Non-Linear \nRegression,  Ph.D.  Thesis,  Dept.  of Computer Science,  University of Toronto. \n\nStone, M.  (1974)  Cross-Validatory  Choice and Assessment of Statistical Predictions (with \ndiscussion),  Journal  of Royal  Statistical Society,  ser.B,  36,  111-147. \n\nSundararajan,  S.,  and  Keerthi,  S.S.  (1999)  Predictive  Approaches  for  Choosing  Hy(cid:173)\nperparameters  in  Gaussian  Processes,  submitted  to  Neural  Computation,  available  at: \nhttp://guppy.mpe.nus.edu.sgFmpessk/gp/gp.html. \n\nIn \nWilliams,  C.K.I.,  and  Rasmussen,  C.E.  (1996)  Gaussian  Processes  for  Regression. \nAdvances  in Neural Information Processing  Systems  8,  ed.  by D.S.Touretzky, M.C.Mozer, \nand M.E.Hasselmo.  MIT  Press. \n\n\f", "award": [], "sourceid": 1767, "authors": [{"given_name": "S.", "family_name": "Sundararajan", "institution": null}, {"given_name": "S.", "family_name": "Keerthi", "institution": null}]}