{"title": "Sparse Representation for Gaussian Process Models", "book": "Advances in Neural Information Processing Systems", "page_first": 444, "page_last": 450, "abstract": null, "full_text": "Sparse Representation for Gaussian Process \n\nModels \n\nLehel Csat6  and  Manfred Opper \n\nNeural Computing Research Group \n\nSchool of Engineering and Applied Sciences \n\nB4 7ET Birmingham, United Kingdom \n{csat o l, oppe r m} @as t o n. ac .uk \n\nAbstract \n\nWe develop an approach for a sparse representation for Gaussian Process \n(GP) models in order to overcome the limitations of GPs caused by large \ndata sets.  The method is based on a combination of a Bayesian online al(cid:173)\ngorithm together with a sequential construction of a relevant subsample \nof the  data  which  fully  specifies  the  prediction  of the  model.  Experi(cid:173)\nmental results on toy examples and large real-world data sets indicate the \nefficiency of the approach. \n\n1  Introduction \n\nGaussian  processes  (GP)  [1;  15]  provide  promising  non-parametric  tools  for  modelling \nreal-world statistical problems.  Like other kernel based methods, e.g.  Support Vector Ma(cid:173)\nchines (SVMs) [13], they combine a high flexibility ofthe model by working in high (often \n00) dimensional feature  spaces with  the  simplicity that all  operations are \"kernelized\" i.e. \nthey are performed in the (lower dimensional) input space using positive definite kernels. \n\nAn important advantage of GPs over other non-Bayesian models is the explicit probabilistic \nformulation of the model.  This does not only provide the modeller with (Bayesian) confi(cid:173)\ndence intervals (for regression) or posterior class probabilities (for classification) but also \nimmediately opens  the  possibility  to  treat other nonstandard data models  (e.g.  Quantum \ninverse statistics [4]). \n\nUnfortunately the drawback of GP models (which was originally apparent in SVMs as well, \nbut has  now  been  overcome [6]) lies in  the huge increase of the computational cost with \nthe number of training data.  This seems to preclude applications of GPs to large datasets. \n\nThis paper presents an  approach to overcome this problem. It is based on a combination of \nan  online learning approach requiring only a single sweep through the data and a method \nto reduce the number of parameters representing the model. \n\nMaking use of the proposed parametrisation the method extracts a subset of the examples \nand the prediction relies only on these basis vectors (BV). The memory requirement of the \nalgorithm scales thus  only with  the size  of this  set.  Experiments with real-world datasets \nconfirm the good performance of the proposed method.  1 \n\n1 A different approach for dealing with large datasets was suggested by V.  Tresp [12].  His method \n\n\f2  Gaussian Process Models \n\nGPs  belong to  Bayesian  non-parametric models where likelihoods  are parametrised by  a \nGaussian stochastic process (random field)  a(x) which is indexed by  the continuous input \nvariable x . The prior knowledge about a is expressed in the prior mean and the covariance \ngiven  by  the kernel  Ko(x,x')  =  Cov(a(x), a(x'))  [14;  15].  In  the following,  only  zero \nmean GP priors are used. \nIn  supervised  learning  the  process  a(x)  is  used  as  a  latent  variable  in  the  likelihood \nP(yla(x))  which denotes  the  probability  of output Y given  the  input x .  Based  on  a  set \nof input-output pairs (xn, Yn)  with Xn  E R m  and Yn  E R (n = 1, N) the Bayesian learn(cid:173)\ning  method  computes  the  posterior distribution  of the  process  a(x)  using  the  prior  and \nlikelihood [14; 15; 3]. \n\nAlthough  the  prior is  a  Gaussian  process,  the  posterior  process  usually  is  not  Gaussian \n(except for the special case of regression with Gaussian noise).  Nevertheless, various ap(cid:173)\nproaches have been introduced recently to approximate the posterior averages [11 ; 9].  Our \napproach is based on the idea of approximating the true posterior process p{ a} by a Gaus(cid:173)\nsian  process q{a} which is  fully  specified by  a covariance kernel Kt(x,x') and posterior \nmean  (a(x))t,  where t  is  the  number of training  data  processed  by  the  algorithm so  far. \nSuch  an  approximation  could  be  formulated  within  the  variational  approach,  where  q  is \nchosen such  that the relative entropy D(q,p)  ==  Eq In ~ is  minimal [9].  However, in this \nformulation, the expectation is over the approximate process q rather than over p.  It seems \nintuitively better to  minimise  the  other KL divergence given  by  D(p, q)  ==  Ep In  ~, be(cid:173)\ncause  the  expectation is  over the  true  distribution.  Unfortunately,  such  a computation is \ngenerally not possible.  The following online approach can be understood as an approxima(cid:173)\ntion to this task. \n\n3  Online learning for Gaussian Processes \n\nIn this section we briefly review the main idea of the Bayesian online approach (see e.g. [5]) \nto  GP models.  We  process the training data sequentially one after the  other.  Assume we \nhave a Gaussian approximation to the posterior process at time t. We use the next example \nt + 1 to update the posterior using Bayes rule via \n\np(a)  =  P(Yt+1la(Xt+l))Pt(q) \n(P(Yt+1la(xt+1)))t \n\n-\n\nSince the  resulting  posterior p(q)  is  non-Gaussian,  we  project it to  the  closest Gaussian \nprocess  q which minimises the KL divergence D(p, q).  Note,  that now  the  new  approxi(cid:173)\nmation q is  on  \"correct\" side  of the  KL  divergence.  The minimisation can be performed \nexactly, leading to a match of the means and covariances of p and q.  Since p is  much less \ncomplex  than  the  full  posterior,  it is  possible  to  write  down the  changes  in  the  first  two \nmoments analytically [2]: \n\n(a(x))t+1  =  (a(x))t + b1  Kt(x,xt+d \n\nK t+1(x,x') = Kt(x,x') + b2K t(x,xt+1)Kt (xt+1,x') \n\nwhere the scalar coefficients b1  and b2  are: \n\n(1) \n\n(2) \n\nwith averaging performed with respect to the marginal Gaussian distribution of the process \nvariable a at input Xt+1'  Note, that this yields a one dimensional integral!  Derivatives are \nis based on splitting the data-set into smaller subsets and training individual GP predictors on each of \nthem.  The final prediction is achieved by a specific weighting of the individual predictors. \n\n\f<PH! \n\n~' \n\n/:'~es \n\n(a) \n\n, \n,-------- -\n\n(b) \n\nFigure  1:  Projection of the  new  input <Pt+!  to  the  subspace spanned by  previous inputs. \n<l>t+l  is  the projection to  the  linear span of {<Pih=l,t.  and  <Pres  the residual vector.  Sub(cid:173)\nfigure  (a)  shows  the projections to  the  subspace,  and  (b)  gives a geometric picture of the \n\"measurable part\" of the error It+!  from eq. (8). \n\ntaken  with  respect to  (a(x))t .  Note  also  that this  procedure does not equal the extended \nKalman  filter  which  involves  linearisations of likelihoods,  whereas  in  our approach it is \npossible to use non-smooth likelihoods (e.g.  noise free classifications) without problems. \n\nIt turns out, that the recursion (1) is solved by the parametrisation \n\n(a(x))t  =  L~=IKo(x,xi)at(i) \n\nKt(x,x')  =  Ko(x,x') + LL=IKo(x,Xi)Ct(ij)Ko(xj,x') \n\n(3) \n\nsuch  that  in  each  on-line  step,  we  have  to  update  only  the  vector of a's and  the  matrix \nof C's.  For notational convenience we  use  vector at  = [at(1), ... , at (N)jT  and matrix \nC t  =  {Ct (ij) h,j=I,N.  Zero-mean GP with kernel Ko  is used as  the starting point for the \nalgorithm: ao =  a and Co  =  a will be the starting parameters. \nThe update of the parameters defined in (3) is found to be \n\nat+!  = at + bl [Ctkt+l  + et+!l \nC t+!  =  C t + b2  [C tkt+l + et+!l [C tkt+! + et+lf \n\n(4) \n\nwith kt+!  = [KO(XI,Xt+!), ... , Ko(xt ,xt+!)jT, et+!  the t + 1-th unit vector (all compo(cid:173)\nnents except t + 1-th are zero), and the scalar coefficients bl  and b2  computed from (2). \nThe serious drawback of this approach, which it shares with many other kernel methods, is \nthe quadratic increase of the matrix size with the training data. \n\n4  Sparse representation \n\nWe use the following idea for reducing the increase of the size of C  and a  (for a similar ap(cid:173)\nproach see [8]).  We consider the feature expansion of the kernel Ko(x,x')  =  <p(X)T <p(x') \nand  decompose  the  new  feature  vector <p(Xt+!)  as  a  linear combination of the previous \nfeatures and a residual <Pres: \n\n<p(Xt+!)  = <Pt+!  + <Pres  = ~ i=l ei<p(Xi)  + <Pres \n\n(5) \nwhere <l>t+!  is  the projection of <Pt+!  to the previous inputs and et+!  = [el' . . . ' etjT  are \nthe coordinates of <l>t+!  with respect to the basis {<Pih=l,t. We can then re-express the GP \nmeans: \n\nA \n\n\"t  A \n\n(6) \n\n\fwith  Qt+l(i)  = at+l(i) + et+1(i)at+1(t + 1)  and  'YHI  the residual  (or novelty factor) \nassociated with the new feature vector.  The vector et+1  and the residual term 'Yt+1  are all \nexpressed in terms of kernels: \n\n_  K(-I)k \n\n(7) \nwith KB(ij)  = {KO(Xi,Xj)h,j=l,t  and kt+1  = K o(Xt+1,Xt+1).  The relation between \nthe quantities et+1  and 'Yt+1  is illustrated in Figure 1. \n\nkT  K(-I)k \nt+1  B \n\nB  HI \n\nA \n\net+1  -\n\n'Yt+1  -\n\n- k* \n\nt+1  -\n\nt+1 \n\nNeglecting  the  last  term  in  the  decomposition  of the  new  input (5)  and  performing  the \nupdate with the resulting vector is  equivalent to the update rule (4) with et+1  replaced by \net+1.  Note that the dimension of parameter space  is  not increased by this  approximative \nupdate.  The memory required by  the algorithm scales  quadratically only with the size of \nthe set of \"basis vectors\", i.e.  those examples for which the full  update (4) is  made.  This \nis  similar to Support Vectors [13], without the need to  solve the (high dimensional) convex \noptimisation problem. It is also related to the kernel PCA and the reduced set method [8] \nwhere the full solution is computed first and then a reduced set is used for prediction. \n\nReplacing  the  input  vector  cJl t +1  by  its  projection  on  the  linear  span  of the  BVs  when \nupdating the GP parameters induces changes in the GP2.  However, the replacement of the \ntrue feature vector by its approximation leaves the mean function unchanged at each B V i  = \n1, t.  That is,  the functions  (a(x))t+1  from  (6) and  (a(x))t+1  =  L~=I Qt+1(i)Ko(Xi,X) \nhave the same value at all Xl.  The change at Xt+1  is \n\nCt+1  = l(a(xt+1))t+1  - (a(xt+t})t+11  = Ibl l'Yt+1 \n\n(8) \n\nwith bi  the factor from (2). \n\nAs  a consequence, a good approximation to  the full  GP  solution  is  obtained if the  input \nfor which we have only a small change in the mean function of the posterior process is not \nincluded in the set of BV s.  The change is given by Ct+1  and the decision of including Xt+1 \nor not is based on the  \"score\" associated to it. \n\nThe absence of matrix inversions is  an  important issue  when dealing with large datasets. \nThe matrix inversion from the projection equation (7) can be avoided by iterative inversion3 \nof the Gram matrix Q  = Ki/: \n\nQt+1  = Qt + 'Yt;1  (et+1  - et+t) (et+1  - et+If \n\n(9) \n\nAn important comment is that if the new input is in the linear span of the BVs, then it will \nnot be included in the basis set, avoiding thus:  1.)  the small  singular values of the matrix \nK Band 2.) the redundancy in representing the problem. \n\n4.1  Deleting a basis vector \nThe above section gave a method to leave out a vector that is  not significant for the predic(cid:173)\ntion purposes. However, it did not provide us with a method to eliminate one of the already \nexisting BV-s. \n\nLet us assume that an input Xt+1  has just been added to the set of BV s.  Since we know that \nan addition had taken place, the update rule (4) with the t + 1-th unit vector et+1  was last \nperformed. Since the model parameters at the previous step had an empty t + 1-th row and \ncolumn, the parameters before thefull update can be identified. \n\nThe removal of the  last basis  vector can  be done with  the following  steps:  1)  computing \nthe  parameters  before  the  update  of the  GP  and  2)  performing a  reduced  update  of the \n\n2Equation  (7)  also  minimises the  KL-distance between  the full posterior  (the  one that increases \n\nparameter space)  and a parametric distribution using only the old BVs. \n\n3 A guide is available from Sam Roweis:  http://www.gatsby.ucl.ac.uk/rvroweis/notes.html \n\n\fC t+ l \n\nd t ) \n\nc* \n\n....................... , ... \n\nC * T \n\nc* \n\nQt+l \n\ndt) \n\nQ \n\nQ *T \n\nq* \n\nFigure 2:  Decomposition of model parameters for the update equation (10). \n\nmodel without the inclusion of the basis vector (eq.  (4) using et+1)'  The updates for model \nparameters a, C, and Q are \"inverted\" by inverting the coupled equations (4) and (9): \n\nQ* \n& =  a(t)  - a*(cid:173)\nq* \n\nC =  C(t) + c* Q*Q*T  _  ~ [Q*C*T + C*Q*T] \n\n(10) \n\nq*2 \nQ*Q*T \n\nq* \n\nQ=Q(t) __  _ \nq* \n\nwhere the elements needed to update the model are extracted from the extended parameters \nas illustrated in Figure 2. \n\nThe consequence of the identification permits us  to evaluate the score for the last BV.  But \nsince the order of the BVs is approximately arbitrary, we can assign a score to each BV \n\nlat+l(i)1 \n\nCi  =  Qt+1 (i, i)' \n\n(11) \n\nThus we have a method to estimate the score of each basis vector at any time and to elimi(cid:173)\nnate the one with the least contribution to the GP output (the mean), providing a sparse GP \nwith a full control over memory size. \n\n5  Simulation results \n\nTo apply the online learning rules (4), the data likelihood for the specific problem has to be \naveraged with respect to a Gaussian. Using eq. (2), the coefficients b1  and b2  are obtained. \nThe marginal of the GP at Xt+1  is  a normal distribution with mean (a(Xt+1))t  = a[kt+1 \nand variance 0';'+1  = kt+1 +k;+1 C tkt+1  where the GP parameters at time t are considered. \nAs a first example, we consider regression with Gaussian output noise 0'5  for which \n\n) \nIn(P(Yt+1l a(Xt+d))t=-2\"ln  271'(0'0+0'11:'+1) \n\n2 \n\n2 \n\n-\n\n1 \n\n( \n\n(a(Xt+1)t)2 \n\n2(  2+  2 \n\n)2 \n\n(Yt+1  -\n0'0 \n\n0'11:'+1 \n\n(12) \n\nFor classification  we  use  the probit model.  The outputs  are binary Y  E {-I, I}  and  the \nprobability is given by the error function (where u = ya/O'o): \n\nP(yla) =  Erf \n\n( ya) \n-\n0'0 \n\n= . f(C \nV 271' \n\ndte-t  /2 \n\n2 \n\n1  l U \n\n00 \n\nThe averaged log-likelihood for the new data point at time tis: \n\n(  ( \nP  Yt+1  a  Xt+1 \n\nI ( \n\n))) \n\n( Yt+1  a[kt+1  ) \n=  Erf  j 0'5  +  O'i'+l \n\n(13) \n\n\f~ \n\n,\" \n\n1.4 \n\n1.2 \n\n0.8 \n\n0.6 \n\n0.4 \n\n0.2 \n\n~ \n\n-0.2 \n\n- 0.4 \n\n-3 \n\n~ \n\n-2 \n\n-1 \n\n100 \n\n150 \n\n200 \n\n250 \n\n300 \n\n# of BasIs Vectors \n\n350 \n\n400 \n\n450 \n\n500 \n\n550 \n\n(a) \n\n(b) \n\nFigure 3:  Simulation results for regression (a) and classification (b).  For details see text. \n\nFor the regression case we have chosen the  toy data model  y  =  sin(x)/x + ( where (  is \na zero-mean  Gaussian random variable with  variance  (]\"~  and  an  RBF  kernel.  Figure 3.a \nshows  the result of applying the  algorithm for 600 input data  and restricting the  number \nof BVs  to  20.  The  dash-dotted  line  is  the  true  function,  the  continuous  line  is  the  ap(cid:173)\nproximation with  the  Bayesian  standard deviation plotted by dotted lines  (a gradient-like \napproximation for the output noise based on maximising the likelihood (12) lead us  to the \nvariance with which the data has  been generated). \n\nFor classification we used the data from the US  postal database4  of handwritten zip codes \ntogether with an RBF kernel.  The database has 7291 training and 2007 test data of 16 x  16 \ngrey-scale images.  To  apply  the classification method to this database,  10 binary classifi(cid:173)\ncation problems were solved and the final  output was  the class with the largest probability. \nThe same BVs have been considered for each classifier and if a deletion was required, the \nBV having the  minimum cumulative score was  deleted.  The  cumulative score  was  cho(cid:173)\nsen  to  be  the  maximum of the  scores for each  classifier.  Figure 3.b  shows  the  test error \nas  a function  of the  size  of the  basis  set.  We  find  that  the  test error is  rather stable  over \na considerable range of basis  set sizes.  Also  a comparison with a second  sweep through \nthe data shows that the algorithm seems to have already extracted the relevant information \nout of the data within a single sweep.  Using a polynomial kernel for the USPS  dataset and \n500 BVs we achieved a test error of 4.83%, which compares favourably with other sparse \napproaches [10; 8]  but uses smaller basis sets than the SVM (2540 reported in [8]). \n\nWe  also  applied our algorithm to  the  NIST datasetS  which contains 60000 data.  Using a \nfourth  order polynomial kernel  with only 500 BVs we  achieved a test error of 3.13% and \nwe expect that improvements are possible by using a kernel with tunable hyperparameters. \nThe possibility of computing the posterior class probabilities allows us to reject data.  When \nthe test data for which the maximum probability was below 0.5 was rejected, the test error \nwas 1.53% with 1.60% of rejection rate. \n\n4Prom:  http://www.kernel-machines.org/data.html \n5 Available from:  http://www.research.att.comryann/ocr/rnnist/ \n\n\f6  Conclusion and further research \n\nThis paper presents a sparse approximation for GPs similar to the one found in SVMs [13] \nor  relevance  vector machines  [10].  In  contrast to  these  other approaches our  algorithm \nis  fully  online and does  not construct the  sparse representation from the  full  data set (for \nsequential optimisation for SVM see [6]). \n\nAn  important open  question  (besides  the  issue  of model  selection)  is how  to  choose  the \nminimal size of the  set of basis vectors  such that the predictive performance is  not much \ndeteriorated by  the  approximation  involved.  In  fact,  our numerical classification  experi(cid:173)\nments suggest that the prediction performance is  considerably stable when the basis set is \nabove a certain size.  It would be interesting if one could relate this  minimum size to  the \neffective dimensionality of the problem being defined as the number of feature dimensions \nwhich are well estimated by the data.  One may argue as follows:  Replacing the true kernel \nby a modified (finite dimensional) one which contains only the well estimated features will \nnot change the  predictive power.  On  the  other hand, for  kernels  with  a feature  space  of \nfinite  dimensionality M, it is  easy to  see  that we  need  never more than  M  basis vectors, \nbecause of linear dependence.  Whether such reasoning will  lead  to  a practical procedure \nfor choosing the appropriate basis set size, is a question for further research. \n\n7  Acknowledgement \n\nThis work was supported by EPSRC grant no. GRlM81608. \n\nReferences \n[1]  J. M. Bernardo and A. F.  Smith. Bayesian Theory. John Wiley &  Sons, 1994. \n[2]  L. Csat6, E. Fokoue, M.  Opper, B.  Schottky, and O. Winther. Efficient approaches  to Gaussian \n\nprocess classification. In NIPS, volume 12, pages 251- 257. The MIT Press, 2000. \n\n[3]  M. Gibbs and D. J. MacKay. Efficient implementation of Gaussian processes. Technical report, \n\nhttp://wol.ra.phy.cam.ac.uklmackay/abstracts/gpros.html.  1999. \n\n[4]  J.  C.  Lemm,  J.  Uhlig,  and  A.  Weiguny.  A  Bayesian  approach  to  inverse  quantum  statistics. \n\nPhys.Rev.Lett. , 84:2006, 2000. \n\n[5]  M. Opper. A Bayesian approach to online learning. In Saad [7], pages 363- 378. \n[6]  J. C.  Platt. Fast training of Support Vector Machines using sequential minimal optimisation. In \n\nAdvances in Kernel Methods  (Support Vector Learning). \n\n[7]  D. Saad, editor. On-Line Learning in Neural Networks. Cambridge Univ. Press, 1998. \n[8]  B. Scholkopf, S. Mika, C. J.  Burges, P.  Knirsch, K-R. Miiller, G. Ratsch, and A. J. Smola. Input \nspace vs.  feature  space in  kernel-based  methods. IEEE Transactions  on Neural Networks, \n10(5):1000-1017, September 1999. \n\n[9]  M.  Seeger.  Bayesian  model  selection  for  Support  Vector Machines,  Gaussian  processes  and \n\nother kernel  classifiers.  In  S.  A.  Solla, T.  KLeen, and K-R. Miiller, editors,  NIPS,  vol(cid:173)\nume  12. The MIT Press, 2000. \n\n[10]  M.  Tipping.  The  Relevance  Vector Machine.  In  S.  A.  Solla,  T.  KLeen,  and  K-R.  Miiller, \n\neditors, NIPS, volume 12. The MIT Press, 2000. \n\n[11]  G.  F.  Trecate, C.  K  1. Williams, and M. Opper. Finite-dimensional  approximation  of Gaussian \nprocesses. In M.  S. Kearns,  S. A.  Solla,  and  D.  A.  Cohn,  editors, NIPS, volume  11.  The \nMIT Press, 1999. \n\n[12]  v. Tresp. A Bayesian committee machine. Neural Computation, accepted. \n[13]  V. N. Vapnik. The Nature oj Statistical Learning Theory. Springer-Verlag, New York, NY,  1995. \n[14]  C.  K  1.  Williams.  Prediction  with  Gaussian  processes.  In M.  1.  Jordan,  editor,  Learning  in \n\nGraphical Models.  The MIT Press, 1999. \n\n[15]  C. K  1.  Williams and C. E.  Rasmussen.  Gaussian processes for regression. In D.  S.  Touretzky, \n\nM. C. Mozer, and M. E. Hasselmo, editors, NIPS, volume 8. The MIT Press, 1996. \n\n\f", "award": [], "sourceid": 1893, "authors": [{"given_name": "Lehel", "family_name": "Csat\u00f3", "institution": null}, {"given_name": "Manfred", "family_name": "Opper", "institution": null}]}