{"title": "Sparse Kernel Principal Component Analysis", "book": "Advances in Neural Information Processing Systems", "page_first": 633, "page_last": 639, "abstract": null, "full_text": "Sparse Kernel \n\nPrincipal  Component Analysis \n\nMichael E.  Tipping \n\nMicrosoft  Research \n\nSt  George House,  1 Guildhall St \n\nCambridge CB2  3NH,  U.K. \nmtipping~microsoft.com \n\nAbstract \n\n'Kernel'  principal  component  analysis  (PCA)  is  an  elegant  non(cid:173)\nlinear  generalisation  of the  popular  linear  data  analysis  method, \nwhere  a  kernel function  implicitly  defines  a  nonlinear transforma(cid:173)\ntion into a feature space wherein standard PCA is  performed.  Un(cid:173)\nfortunately,  the  technique  is  not  'sparse',  since  the  components \nthus obtained are expressed in terms of kernels associated with ev(cid:173)\nery training vector.  This paper shows  that by  approximating the \ncovariance matrix in  feature  space by a  reduced  number of exam(cid:173)\nple vectors, using a  maximum-likelihood approach, we may obtain \na highly sparse form  of kernel PCA without loss of effectiveness. \n\n1 \n\nIntroduction \n\nPrincipal component analysis  (PCA)  is  a well-established technique for  dimension(cid:173)\nality  reduction,  and  examples  of its  many  applications  include  data compression, \nimage processing, visualisation, exploratory data analysis, pattern recognition and \ntime series  prediction.  Given  a  set  of N  d-dimensional  data vectors  X n ,  which  we \ntake  to  have  zero  mean,  the  principal  components  are the linear projections  onto \nthe  'principal  axes',  defined  as  the  leading  eigenvectors  of the  sample  covariance \nmatrix  S  =  N-1Z=:=lXnX~ =  N-1XTX,  where  X  =  (Xl,X2, ... ,XN)T  is  the \nconventionally-defined  'design'  matrix.  These  projections  are  of interest  as  they \nretain maximum  variance and minimise error of subsequent linear reconstruction. \n\nHowever,  because  PCA  only  defines  a  linear  projection  of the  data,  the  scope  of \nits application is  necessarily  somewhat limited.  This has naturally motivated vari(cid:173)\nous developments of nonlinear 'principal component analysis' in an effort  to model \nnon-trivial data structures more faithfully,  and a particularly interesting recent in(cid:173)\nnovation has been  'kernel PCA'  [4]. \n\nKernel PCA, summarised in Section 2,  makes use of the 'kernel trick', so effectively \nexploited  by  the  'support  vector  machine',  in  that  a  kernel  function  k(\u00b7,\u00b7)  may \nbe  considered  to  represent  a  dot  (inner)  product  in  some  transformed  space  if it \nsatisfies  Mercer's  condition  -\nif it  is  the  continuous  symmetric  kernel  of a \npositive  integral  operator.  This  can  be  an  elegant  way  to  'non-linearise'  linear \n\ni.e. \n\n\fprocedures which depend only on inner products of the examples. \n\nApplications  utilising  kernel  PCA  are  emerging  [2],  but  in  practice  the  approach \nsuffers from  one important  disadvantage  in  that it  is  not  a  sparse  method.  Com(cid:173)\nputation of principal component projections for  a given input x  requires evaluation \nof the kernel  function  k(x, xn)  in  respect  of all  N  'training'  examples  Xn.  This is \nan unfortunate limitation as in practice, to obtain the best model, we  would like to \nestimate the kernel principal components from  as much  data as possible. \n\nHere we tackle this problem by first  approximating the covariance matrix in feature \nspace by a subset of outer products of feature vectors, using a maximum-likelihood \ncriterion based on  a  'probabilistic PCA' model detailed in  Section 3.  Subsequently \napplying  (kernel)  PCA defines  sparse projections.  Importantly, the approximation \nwe adopt is principled and controllable, and is related to the choice of the number of \ncomponents to  'discard' in the conventional approach.  We  demonstrate its efficacy \nin Section 4 and illustrate how it can offer similar performance to a full  non-sparse \nkernel PCA implementation while offering much reduced computational overheads. \n\n2  Kernel peA \n\nAlthough  PCA is  conventionally defined  (as  above)  in  terms of the  covariance,  or \nouter-product,  matrix,  it  is  well-established  that the eigenvectors  of XTX  can  be \nobtained from  those of the inner-product matrix XXT.  If V  is  an orthogonal ma(cid:173)\ntrix of column eigenvectors of XXT  with corresponding eigenvalues in the diagonal \nmatrix A, then by definition  (XXT)V = VA.  Pre-multiplying by  X T  gives: \n\n(XTX)(XTV) =  (XTV)A. \n\n(1) \nFrom inspection, it can be seen that the eigenvectors of XTX are XTV, with eigen(cid:173)\nvalues  A.  Note,  however,  that the  column  vectors  XTV  are  not  normalised  since \nfor  column i, llTXXTlli =  AillTlli  =  Ai,  so the correctly normalised eigenvectors of \nXTX, and thus the principal axes of the data, are given by Vpca  =  XTVA -'. \nThis  derivation  is  useful  if  d  >  N,  when  the  dimensionality  of x  is  greater than \nthe number of examples,  but it is  also  fundamental for  implementing kernel  PCA. \nIn kernel  PCA, the data vectors Xn  are implicitly  mapped into a  feature  space by \na  set  of functions  {ifJ}  : Xn  -+  4>(xn).  Although  the  vectors  4>n  = 4>(xn)  in  the \nfeature  space  are  generally  not  known  explicitly,  their  inner  products  are  defined \nby the kernel:  4>-:n4>n  = k(xm, xn).  Defining  cp  as  the  (notional)  design  matrix in \nfeature space,  and exploiting the above inner-product PCA formulation,  allows the \n,  S4>  = N- l  L:n 4>n4>~, to be \neigenvectors of the covariance matrix in feature spacel \nspecified as: \n\n1 \n\nVkpca=cpTVA-', \n\n(2) \nwhere  V, A  are  the  eigenvectors/values  of the  kernel  matrix  K,  with  (K)mn  = \nk(xm,xn).  Although we  can't compute Vkpca  since we  don't know  cp  explicitly, we \ncan  compute  projections  of arbitrary test  vectors  x*  -+  4>*  onto  Vkpca  in  feature \nspace: \n\n1 \n\n4>~Vkpca =  4>~cpTVA -~ =  k~VA-~, \n\n(3) \nwhere  k*  is  the  N -vector  of inner  products  of  x*  with  the  data  in  kernel  space: \n(k)n  = k(x*,xn).  We  can thus  compute,  and  plot,  these  projections  - Figure  1 \ngives an example for  some synthetic 3-cluster data in two  dimensions. \n\nlHere,  and  in  the  rest  of  the  paper,  we  do  not  'centre'  the  data  in  feature  space, \nalthough this may be achieved if desired (see  [4]).  In fact,  we would argue that when using \na  Gaussian  kernel,  it does  not necessarily  make sense to do so. \n\n\f0.218 \n.' \n\n0.203 \n.' \n\n0.191 \n\n-.I.-\n:!.'-:~ \n\n. \n\n. \n\n0.057 \n\n0.053 \n\n0.051 \n\n:fe\u00b7\u00b7\u00b7 \n. \n'\"F \n\n. \n~~:  ~ \n\n-.. \n. .  \n. \n. \n. \n.  . .  \n.. : \n\n. \n. \n. \n\n. \n..-\n. .  \n\n. \n\n0.047 \n\n0.043 \n\n0.036 \n\n~.I:.'  . \n. ' . . \" . \n\n. \n\nFigure 1:  Contour plots of the first  nine principal component projections evaluated over a \nregion of input space for  data from  3 Gaussian clusters  (standard deviation 0.1;  axis scales \nare shown in Figure 3)  each comprising 30 vectors.  A Gaussian kernel, exp( -lIx-x'11 2 /r2), \nwith  width  r  =  0.25,  was  used.  The  corresponding  eigenvalues  are  given  above  each \nprojection.  Note how  the first  three components  'pick out' the individual  clusters  [4]. \n\n3  Probabilistic Feature-Space peA \n\nOur approach to sparsifying kernel peA is to a priori approximate the feature space \nsample  covariance  matrix  Sq,  with  a  sum  of weighted  outer  products  of a  reduced \nnumber  of feature  vectors. \n(The  basis  of  this  technique  is  thus  general  and  its \napplication not necessarily limited to kernel peA.) This is achieved probabilistically, \nby maximising the likelihood of the feature vectors under a Gaussian density model \n\u00a2  ~ N(O, C) , where we  specify the covariance C  by: \n\nC = (721  +  L Wi\u00a2i\u00a2r = (721  +  c)TWC), \n\nN \n\ni=1 \n\n(4) \n\nwhere  W1  ... WN  are the adjustable weights,  W  is  a  matrix with  those  weights on \nthe  diagonal,  and  (72  is  an  isotropic  'noise'  component  common  to  all  dimensions \nof feature  space.  Of course,  a  naive  maximum  of the  likelihood  under this  model \nis  obtained  with  (72  = a and  all  Wi  = 1/ N.  However,  if we  fix  (72,  and  optimise \nonly  the weighting factors Wi,  we  will  find  that the maximum-likelihood estimates \nof many Wi  are zero, thus realising a sparse representation of the covariance matrix. \n\nThis probabilistic approach is motivated by the fact that if we  relax the form of the \nmodel,  by  defining it  in terms of outer products of N  arbitrary  vectors  Vi  (rather \nthan the fixed training vectors),  i.e.  C  = (721+ l:~1 WiViV'[, then we realise a form \nof 'probabilistic peA' [6].  That is, if {Ui' Ai} are the set of eigenvectors/values of Sq\" \nthen the likelihood under this model is maximised by Vi  =  Ui and Wi  =  (Ai _(72)1/2, \nfor  those i  for  which  Ai  > (72.  For Ai  :::;  (72,  the most likely weights Wi  are zero. \n\n\f3.1  Computations in feature  space \n\nWe  wish to maximise the likelihood under a  Gaussian model with covariance given \nby (4).  Ignoring terms independent of the weighting parameters, its log is  given by: \n\n(5) \n\nComputing  (5)  requires the quantities  ICI  and (VC-1rP,  which for  infinite  dimen(cid:173)\nsionality feature spaces might appear problematic.  However, by judicious re-writing \nof the terms of interest,  we  are  able to both compute the log-likelihood  (to  within \na  constant)  and optimise it with respect to the weights.  First, we  can write: \n\nlog 1(T21 + 4)TW4) I =  D log (T2  + log IW- 1 + (T-24)4)TI + log IWI. \n\n(6) \n\nThe  potential  problem of infinite  dimensionality,  D, of the  feature  space  now  en(cid:173)\nters  only in the  first  term,  which  is  constant if (T2  is  fixed  and  so  does  not  affect \nmaximisation.  The term in  IWI  is  straightforward and the remaining term  can be \nexpressed in terms of the inner-product  (kernel)  matrix: \n\nW- 1 + (T-24)4)T  = W-1 + (T-2K, \n\n(7) \n\nwhere K  is  the kernel matrix such that (K)mn  =  k(xm , xn). \nFor  the  data-dependent  term  in  the likelihood,  we  can  use  the  Woodbury  matrix \ninversion identity to compute the quantities rP~C-lrPn: \n\nrP~((T21 + 4)W4)T)-lrPn = rP~ [(T-21 - (T-44)(W- 1 + (T-24)T4\u00bb)-14)TJ  rPn' \n\n= (T-2k(xn, xn) - (T-4k~(W-l + (T-2K)-lkn , \n\n(8) \n\nwith k n  = [k(xn, xt), k(xn, X2), ... ,k(xn, XN )r\u00b7 \n\n3.2  Optimising the weights \n\nTo  maximise the log-likelihood with respect to the Wi,  differentiating  (5)  gives us: \n\n{)C  = ! (A.TC-14)T4)C-1A.. _  NA.TC-1A..) \n\n{)Wi \n\n2 \n\n'1', \n\n'1', \n\n'1', \n\n'1'\" \n\n= 2~2 (t M~i + N};,ii  - NWi)  , \n\n, \n\nn=l \n\nwhere};,  and I-Ln  are defined  respectively  by \n\n};,  = (W- 1 + (T-2K)-1, \nI-Ln  = (T-2};'kn. \n\nSetting  (10)  to zero gives re-estimation equations for  the weights: \n\nnew  N-1 '\"\"  2  + ~ \nL<ii\u00b7 \n\nWi  =  ~ Mni \n\nN \n\nn=l \n\n(9) \n\n(10) \n\n(11) \n(12) \n\n(13) \n\nThe  re-estimates  (13)  are  equivalent  to  expectation-maximisation  updates,  which \nwould be obtained by adopting a factor  analytic perspective  [3],  and introducing a \nset  of 'hidden'  Gaussian explanatory variables  whose  conditional means  and  com(cid:173)\nmon  covariance,  given  the  feature  vectors  and  the  current  values  of the  weights, \nare  given  by  I-Ln  and};,  respectively  (hence  the  notation).  As  such,  (13)  is  guar(cid:173)\nanteed  to  increase C  unless  it  is  already  at  a  maximum.  However,  an  alternative \n\n\fre-arrangement  of  (10),  motivated  by  [5],  leads  to  a  re-estimation  update  which \ntypically converges significantly more quickly: \n\nW new  = \n\n, \n\n\",N \n2 \nL....n-l JJni \n\nN(1 - ~idwi)\u00b7 \n\n(14) \n\nNote  that  these  Wi  updates  (14)  are  defined  in  terms  of the  computable  (i.e.  not \ndependent on explicit feature space vectors)  quantities  ~ and /Ln. \n\n3.3  Principal component  analysis \n\nThe  principal  axes \n\nSparse kernel  peA proceeds by  finding  the principal axes of the  covariance model \nC  = (721 + c)TWc).  These are identical  to those of c)TWc),  but with eigenvalues \nall  (72  larger.  Letting c)  =  W2c), then, we  need the eigenvectors of c)  c). \n\n.-...T  ........ \n\n.-... \n\n1 \n\nUsing  the  technique  of Section  2,  if the  eigenvectors  of  ~~T =  W!c)c)TW!  = \nW!KW!  are  U,  with  corresponding  eigenvalues  X,  then  the  eigevectors/values \n{U, A}  of C  that we  desire are given by: \n\n(15) \n(16) \n\nComputing  projections \n\nAgain,  we  can't compute the eigenvectors U  explicitly in (15),  but we  can compute \nthe projections of a general feature  vector cPo  onto the principal axes: \n\n(17) \n\nwhere  k.  is  the  sparse  vector  containing  the  non-zero  weighted  elements  of k., \ndefined  earlier.  The  corresponding  rows  of W!UX-2  are  combined  into  a  sin-\ngle  projecting matrix P,  each  column  of which  gives  the  coefficients  of the  kernel \nfunctions for  the evaluation of each principal component. \n\n1 \n\n3.4  Computing Reconstruction Error \nThe squared reconstruction error in kernel space for  a  test vector cPo  is  given  by: \n\n(18) \n\nwith K the kernel matrix evaluated only for  the representing vectors. \n\n4  Examples \n\nTo  obtain  sparse  kernel  peA  projections,  we  first  specify  the  noise  variance  (72, \nwhich is the the  amount of variance per co-ordinate that we  are prepared to allow \nto be explained by the (structure-free) isotropic noise rather than with the principal \naxes  (this  choice  is  a  surrogate for  deciding  how  many principal  axes  to  retain  in \nconventional  kernel  peA).  Unfortunately,  the  measure  is  in  feature  space,  which \nmakes it rather more difficult  to interpret than if it were in data space  (equally so, \nof course, for  interpretation of the eigenvalue  spectrum in the non-sparse case). \n\n\fWe apply sparse kernel peA to the Gaussian data of Figure 1 earlier, with the same \nkernel function and specifying (J'  = 0.25, deliberately chosen to give nine representing \nkernels so as to facilitate comparison.  Figure 2 shows the nine principal component \nprojections  based  on  the  approximated  covariance  matrix,  and  gives  qualitatively \nequivalent results to Figure 1 while utilising only 10% of the kernels.  Figure 3 shows \nthe data and highlights those examples corresponding to the nine kernels with non(cid:173)\nzero weights.  Note, although we  do not consider this aspect further here,  that these \nrepresenting vectors are themselves highly informative of the structure of the data \n(i. e.  with  a  Gaussian  kernel,  for  example,  they  tend  to  represent  distinguishable \nclusters).  Also  in  Figure  3,  contours of reconstruction  error,  based  only  on those \nnine kernels,  are plotted and indicate that the nonlinear model has more faithfully \ncaptured the structure of the data than would standard linear peA. \n\n0.199 \n\n00 \no \n\n0.184 \n\n0.161 \n\no \n\n-.0 \n\n\u2022\u2022  2t. \nrc. \n\n0 \n\n0 \n\nOJ: \n..... \n\u00b0l'or \n\u2022 \n\n0.082 \n\n0.074 \n\n0.074 \n\n0.074 \n\n0.072 \n\n0.071 \n\nFigure 2:  The nine  principal  component projections  obtained by sparse kernel  peA. \n\nTo  further  illustrate  the  fidelity  of the  sparse  approximation,  we  analyse  the  200 \ntraining examples  of the  7-dimensional  'Pima Indians  diabetes'  database  [1].  Fig(cid:173)\nure  4  (left)  shows  a  plot  of reconstruction  error  against  the  number  of  principal \ncomponents utilised  by  both conventional  kernel  peA and its  sparse counterpart, \nwith  (J'2  chosen  so  as  to  utilise  20%  of the kernels  (40).  An expected small  reduc(cid:173)\ntion in accuracy is  evident in  the sparse case.  Figure 4  (right)  shows  the error on \nthe  associated  test  set  when  using  a  linear  support  vector machine  to classify  the \ndata based on those numbers of principal components.  Here the sparse projections \nactually  perform  marginally better on  average,  a  consequence of both randomness \nand,  we  note  with  interest,  presumably  some inherent  complexity  control implied \nby the use of a  sparse approximation. \n\n\fFigure 3:  The data with the nine representing kernels circled and contours of reconstruc(cid:173)\ntion error  (computed in feature  space  although  displayed as a  function  of x)  overlaid. \n\nC \n.Q 0.15 \nt5 \n2 \nt5  0.1 \nC \n0 \n~ 0.05 \na: \n\n0 \n0 \n\n~ 100 \n\ne \nffi \nQ) \n(/) \nt5  80 \nQ) \nI-\n\n70 \n\n5 \n\n10 \n\n15 \n\n20 \n\n25 \n\n60 \n0 \n\n5 \n\n10 \n\n15 \n\n20 \n\n25 \n\nFigure 4:  RMS reconstruction  error  (left)  and test set misclassifications  (right)  for  num(cid:173)\nbers ofretained principal components ranging from  1- 25.  For  the standard case,  this was \nbased on all  200  training examples, for  the sparse form,  a subset of 40.  A Gaussian kernel \nof width 10  was  utilised, which gives near-optimal results if used in an SVM classification. \n\nReferences \n[1]  B.  D.  Ripley.  Pattern  Recognition  and Neural  Networks.  Cambridge University Press, \n\nCambridge,  1996. \n\n[2]  S.  Romdhani,  S.  Gong,  and  A.  Psarrou.  A  multi-view  nonlinear  active  shape  model \nusing  kernel  PCA.  In  Proceedings  of the  1999  British  Machine  Vision  Conference, \npages 483- 492,  1999. \n\n[3]  D.  B.  Rubin and D.  T.  Thayer.  EM algorithms for  ML factor analysis.  Psychometrika, \n\n47(1):69- 76,  1982. \n\n[4]  B.  Sch6lkopf,  A. Smola,  and K-R.  Miiller.  Nonlinear  component  analysis  as  a  kernel \neigenvalue  problem.  Neural  Computation,  10:1299- 1319,  1998.  Technical  Report  No. \n44,  1996,  Max  Planck Institut fiir  biologische  Kybernetik,  Tiibingen. \n\n[5]  M.  E.  Tipping.  The Relevance  Vector Machine.  In  S.  A.  Solla,  T. KLeen, and K-R. \nMiiller,  editors,  Advances in Neural Information Processing Systems 12, pages 652- 658. \nCambridge,  Mass:  MIT Press,  2000. \n\n[6]  M.  E.  Tipping and C.  M.  Bishop.  Probabilistic principal component analysis.  Journal \n\nof the  Royal  Statistical  Society,  Series  B,  61(3):611-622,  1999. \n\n\f", "award": [], "sourceid": 1791, "authors": [{"given_name": "Michael", "family_name": "Tipping", "institution": null}]}