{"title": "The Stability of Kernel Principal Components Analysis and its Relation to the Process Eigenspectrum", "book": "Advances in Neural Information Processing Systems", "page_first": 383, "page_last": 390, "abstract": null, "full_text": "The  Stability  of Kernel  Principal \n\nComponents Analysis  and  its  Relation to \n\nthe Process  Eigenspectrum \n\nJohn  Shawe-Taylor \n\nRoyal  Holloway \n\nUniversity of London \njohn\u00a9cs.rhul.ac.uk \n\nChristopher K.  I. Williams \n\nSchool of Informatics \n\nUniversity of Edinburgh \n\nc.k.i.williams\u00a9ed.ac.uk \n\nAbstract \n\nIn  this paper we  analyze the relationships between the eigenvalues \nof the m  x m  Gram matrix K  for  a kernel k(\u00b7, .)  corresponding to a \nsample  Xl, ... ,Xm  drawn from  a  density p(x)  and the eigenvalues \nof the corresponding continuous eigenproblem.  We  bound the dif(cid:173)\nferences between the two spectra and provide a performance bound \non kernel peA. \n\n1 \n\nIntroduction \n\nOver recent years there has been a considerable amount of interest in kernel methods \nfor  supervised learning (e.g.  Support Vector Machines and Gaussian Process predic(cid:173)\nt ion)  and for  unsupervised  learning  (e.g.  kernel  peA, Sch61kopf et al.  (1998)).  In \nthis paper we study the stability of the subspace of feature space extracted by kernel \npeA with respect to the sample of size m, and relate this to the feature space that \nwould be extracted in  the infinite sample-size limit.  This analysis essentially  \"lifts\" \ninto  (a  potentially  infinite  dimensional)  feature  space  an  analysis  which  can  also \nbe  carried  out  for  peA,  comparing  the  k-dimensional  eigenspace  extracted  from \na  sample  covariance  matrix  and  the  k-dimensional  eigenspace  extracted from  the \npopulation covariance matrix, and comparing the residuals from  the k-dimensional \ncompression for  the m-sample and the population. \nEarlier work by  Shawe-Taylor et  al.  (2002)  discussed the concentration of spectral \nproperties  of  Gram  matrices  and  of the  residuals  of fixed  projections.  However, \nthese  results  gave  deviation  bounds  on the  sampling  variability  the  eigenvalues  of \nthe  Gram  matrix,  but  did  not  address  the  relationship  of sample  and  population \neigenvalues,  or the estimation problem of the residual of peA on new  data. \nThe  structure  the  remainder  of the  paper  is  as  follows.  In  section  2  we  provide \nbackground on  the  continuous  kernel  eigenproblem,  and  the  relationship  between \nthe eigenvalues of certain matrices and the expected residuals when projecting into \nspaces  of  dimension  k.  Section  3  provides  inequality  relationships  between  the \nprocess eigenvalues  and the expectation of the Gram matrix eigenvalues.  Section 4 \npresents some concentration results and uses these to develop an approximate chain \nof inequalities.  In section 5 we obtain a performance bound on kernel peA, relating \nthe performance on the training sample to the expected performance wrt p(x). \n\n\f2  Background \n\n2.1  The kernel eigenproblern \n\nFor a  given kernel function  k(\u00b7,\u00b7)  the m  x m  Gram matrix K  has entries k(Xi,Xj), \ni, j  = 1, ... ,m, where  {Xi:  i  = 1, ... ,m} is  a  given  dataset.  For  Mercer kernels  K \nis  symmetric positive semi-definite.  We  denote the eigenvalues of the Gram matrix \nas  Al  2:  A2  .. . 2:  Am  2:  0  and write  its  eigendecomposition as  K  =  zAz'  where A \nis  a  diagonal  matrix of the eigenvalues  and  Z'  denotes  the transpose of matrix  Z. \nThe eigenvalues are also referred to as the spectrum of the  Gram matrix. \nWe  now  describe the relationship between the eigenvalues of the Gram matrix and \nthose of the underlying process.  For  a  given  kernel function  and density p(x)  on a \nspace  X, we  can also write down the eigenfunction problem \n\nIx k(x,Y)P(X)\u00a2i(X) dx =  AiC/Ji(Y)\u00b7 \n\n(1) \n\nthat  the  eigenfunctions  are  orthonormal  with  respect \n\nNote \ni.e. \nJ x  (Pi(x)p(x)\u00a2j (x)dx = 6ij.  Let  the  eigenvalues  be ordered so  that  Al  2:  A2  2:  .... \nThis  continuous  eigenproblem  can  be  approximated  in  the  following  way.  Let \n{Xi:  i  =  1, . .. , m}  be a  sample  drawn  according  to p(x).  Then  as  pointed  out  in \nWilliams  and Seeger  (2000),  we  can approximate the integral with  weight function \np(x) by an average over the sample points, and then plug in Y =  Xj  for  j  =  1, ... ,m \nto obtain the matrix eigenproblem. \n\nto  p(x), \n\nThus  we  see  that  J.1i d;j ~ Ai  is  an  obvious  estimator for  the  ith  eigenvalue  of the \ncontinuous  problem.  The  theory of the  numerical  solution of eigenvalue  problems \n(Baker  1977,  Theorem 3.4)  shows  that for  a  fixed  k,  J.1k  will  converge to  Ak  in  the \nlimit as m  -+  00. \nFor the case that  X  is  one dimensional, p(x)  is  Gaussian and  k(x, y)  =  exp -b(x(cid:173)\ny)2,  there are analytic results for  the eigenvalues and eigenfunctions of equation (1) \nas given in section 4 of Zhu et al.  (1998).  A plot in Williams and Seeger  (2000)  for \nm  =  500 with b =  3 and p(x) '\" N(O, 1/4) shows good agreement between J.1i  and Ai \nfor  small  i,  but that for  larger i  the  matrix eigenvalues  underestimate the  process \neigenvalues.  One of the by-products of this  paper will  be  bounds  on the degree of \nunderestimation for  this estimation problem in a  fully  general setting. \nKoltchinskii and Gine  (2000)  discuss a  number of results including rates of conver(cid:173)\ngence  of the  J.1-spectrum  to the  A-spectrum.  The  measure  they  use  compares  the \nwhole spectrum rather than individual eigenvalues or subsets of eigenvalues.  They \nalso do  not deal with the estimation problem for  PCA residuals. \n\n2.2  Projections,  residuals  and  eigenvalues \n\nThe approach adopted in the proofs of the next section is  to relate the eigenvalues \nto  the  sums  of squares  of residuals.  Let  X  be a  random variable  in  d  dimensions, \nand  let  X  be  a  d x  m  matrix  containing  m  sample  vectors  Xl, ... , X m .  Consider \nthe  m  x  m  matrix  M  =  XIX  with  eigendecomposition  M  =  zAz'.  Then taking \nX  =  Z VA we  obtain a  finite  dimensional  version  of Mercer's theorem.  To  set the \nscene,  we  now  present a  short description of the residuals  viewpoint. \nThe  starting  point  is  the  singular  value  decomposition  of  X  =  UY',Z' ,  where  U \nand Z  are orthonormal matrices and  Y',  is  a diagonal matrix containing the singular \n\n\fvalues  (in descending order).  We  can now reconstruct the eigenvalue decomposition \nof M  = X'X = Z~U'U~Z' = zAz', where A = ~2. But equally we  can construct \na  d x  d matrix N  = X X' = U~Z' Z~U' = u Au', with the same eigenvalues as M. \nWe  have  made  a  slight  abuse  of notation  by  using  A to represent  two  matrices  of \npotentially different dimensions, but the larger is  simply an extension of the smaller \nwith O's.  Note that N  = mCx , where Cx  is  the sample correlation matrix. \nLet  V  be  a  linear  space  spanned  by  k  linearly  independent  vectors.  Let  Pv(x) \n(PV(x))  be the projection of x onto V  (space perpendicular to V),  so that IlxW  = \nIIPv(x)112 + IIPv(x)112.  Using the Courant-Fisher minimax theorem it can be proved \n(Shawe-Taylor et al.,  2002,  equation 4)  that \n\nm \n\nm \n\nm \n\nm \n\nk \n\ni=l \n\nm \n\nm \n\nL IIxjl12 - L )...i(M)  =  min  L IlPv(xj)112. \nj=l \n\nL  )...i(M) \ni=k+1 \nHence the subspace spanned by the first  k eigenvectors is  characterised as  that for \nwhich the sum of the squares of the residuals is minimal.  We can also obtain similar \nresults for  the population case,  e.g.  L7=1  Ai  =  maXdim(V)=k lE[IIPv (x) 112]. \n2.3  Residuals  in feature  space \n\ndim(V)=k \n\nj=l \n\n(2) \n\nFrequently,  we  consider  all  of the  above  as  occurring  in  a  kernel  defined  feature \nspace,  so  that  wherever  we  have  written  a  vector  x  we  should  have  put  'l/J(x), \nwhere  'l/J  is  the  corresponding feature  map  'l/J  : x  E  X  f---t  'l/J(x)  E  F  to  a  feature \nspace  F.  Hence,  the  matrix  M  has  entries  Mij  =  ('l/J(Xi),'l/J(Xj)).  The  kernel \nfunction  computes  the  composition  of the  inner  product  with  the  feature  maps, \nk(x, z)  =  ('l/J(x) , 'l/J(z))  =  'l/J(x)''l/J(z) , which can in many cases be computed without \nexplicitly evaluating the mapping 'l/J.  We would also like to evaluate the projections \ninto eigenspaces without explicitly computing the feature mapping 'l/J .  This can be \ndone  as  follows.  Let  Ui  be  the  i-th  singular  vector  in  the  feature  space,  that  is \nthe  i-th eigenvector  of the  matrix N,  with  the  corresponding singular  value  being \nO\"i  =  ~ and the  corresponding eigenvector of M  being  Zi.  The projection of an \ninput x  onto  Ui  is  given  by \n\n'l/J(X)'Ui  = ('l/J(X)'U)i  = ('l/J(x)' X Z)W;l = k'ZW;l, \n\nwhere we  have used the fact  that X  = U~Z' and k j  = 'l/J(x)''l/J(Xj)  = k(x,xj). \nOur final  background observation concerns the kernel operator and its eigenspaces. \nThe operator in question is \n\nK(f)(x) = Ix k(x, z)J(z)p(z)dz. \n\nProvided  the  operator  is  positive  semi-definite,  by  Mercer's  theorem  we  can  de(cid:173)\ncompose  k(x,z)  as  a  sum  of  eigenfunctions,  k(x,z)  =  L :1 AiC!Ji(X) \u00a2i(Z)  = \n('l/J(x), 'l/J(z)),  where  the  functions  (\u00a2i(X)) ~l form  a  complete  orthonormal  basis \nwith  respect  to  the  inner  product  (j, g)p  =  Ix J(x)g(x)p(x)dx  and  'l/J(x)  is  the \nfeature space mapping \n\n'l/J  : x  --+ (1Pi(X)):l  =  ( A\u00a2i(X)):l E F. \n\nNote that \u00a2i(X)  has norm 1 and satisfies Ai\u00a2i(x)  =  Ix k(x, z)\u00a2i(z)p(z)dz  (equation \n1) , so that \n\nAi  =  r  k(y, Z) \u00a2i(Y)\u00a2i (Z)p(Z)p(y)dydz. \n\niX2 \n\n(3) \n\n\flE  [llPr(1jJ(x)) 112] \n\nIf we let cf>(x)  =  (cPi(X)):l  E  F, we can define the unit vector U i  E  F  corresponding \nto  Ai  by  Ui  =  Ix cPi(x)cf>(x)p(x)dx.  For  a  general function  J(x)  we  can  similarly \ndefine the vector f  =  Ix J(x)cf>(x)p(x)dx.  Now  the expected square of the norm of \nthe  projection Pr(1jJ(x))  onto the  vector  f  (assumed  to  be  of norm  1)  of an  input \n1jJ(x)  drawn according to p(x)  is  given  by \n\n=  L IlPr(1jJ(x))Wp(x)dx  =  L (f'1jJ(X))2 p(x)dx \n= L L L J(y) cf>(y)'1jJ (x)p(y)dyJ(z)cf> (z)'1jJ (x)p(z)dzp(x)dx \n= L3 J(y)J(z) t, A cPj(Y)cPj(x)p(y)dy ~ v>:ecPe(z)cPe(x)p(z)dzp(x)dx \n= L2 J(y)J(z) j~l AcPj(y)p(y)dyv'):ecPe(z)p(z)dz Ix cPj(x)cPe(x)p(x)dx \n=  L2 J(y)J(z) ~ AjcPj (Y)cPj (z)p(y)dyp(z)dz \n=  r J(y)J(z)k(y , z)p(y)p(z)dydz. \niX2 \n\nSince  all  vectors  f  in  the subspace spanned  by  the  image  of the  input  space  in  F \ncan be expressed in  this fashion,  it follows  using  (3)  that the sum of the finite  case \ncharacterisation of eigenvalues  and eigenvectors is  replaced by an expectation \n\nAk  =  max  min  lE[llPv (1jJ(x)) 112], \n\ndim(V)=k O#vEV \n\n(4) \n\nwhere V  is  a  linear subspace of the feature  space F.  Similarly, \n\nk \nL:Ai \ni=l \n\n00 \n\nmax \n\ndim(V)=k \n\nlE  [llPv(1jJ(x)) 112]  =  lE  [111jJ(x)112]  -\n\nmin \n\ndim(V)=k \n\nlE  [IIPv(1jJ(x))112]  , \n\n(5) \n\nwhere  Pv(1jJ(x))  (PV(1jJ(x)))  is  the  projection  of 1jJ(x)  into  the  subspace  V  (the \nprojection of 1jJ(x)  into the space orthogonal to V). \n\n2.4  Plan  of campaign \n\nWe are now in a position to motivate the main results ofthe paper.  We consider the \ngeneral  case  of a  kernel  defined  feature  space  with  input  space  X  and probability \ndensity p(x).  We fix a sample size m and a draw of m examples S =  (Xl, X2 , ... , xm ) \naccording to p.  Further we fix a feature dimension k.  Let Vk  be the space spanned by \nthe first k eigenvectors of the sample kernel matrix K  with corresponding eigenvalues \n'\\1, '\\2,\"\"  '\\k, while Vk  is  the space spanned by the first  k process eigenvectors with \ncorresponding eigenvalues  A1 , A2 , ... , Ak'  Similarly,  let  E[J(x)]  denote expectation \nwith  respect  to  the sample,  E[J(x)]  =  ~ 2:::1 J(Xi),  while  as  before  lE[\u00b7]  denotes \nexpectation with respect to p. \nWe  are  interested  in  the  relationships  between  the  following  quantities: \nE [IIPVk (x)11 2]  =  ~ 2:7=1 ~i  =  2:7=1 ILi ,  (ii) \n\n(i) \nlE  [IIPVk(X)112]  =  2:7=1  Ai  (iii) \n\n\flE  [IIPVk  (x)11 2]  and  (iv)  IE  [IIPVk (x)11 2] .  Bounding  the  difference  between  the  first \nand second will  relate the process eigenvalues  to the sample eigenvalues,  while  the \ndifference  between  the first  and third  will  bound the expected  performance of the \nspace identified by kernel  PCA when  used on new  data. \nOur first  two observations follow  simply from  equation  (5), \n\nIE  [IIPYk (x) 112] \n\nand \n\nlE [IIPVk  (x) 11 2] \n\nA \n\n[ \n\nk \n\n2] \nAi  ~ lE  IIPVk (x) II  , \n\n1 l: A \n-\nm  i=l \nk \nl: Ai  ~ lE  [IIPYk  (x)11 2]  . \ni=l \n\n(6) \n\n(7) \n\nOur strategy will  be to show that the right hand side of inequality  (6)  and the left \nhand side  of inequality  (7)  are  close  in  value  making the two  inequalities  approxi(cid:173)\nmately a  chain of inequalities.  We  then bound the difference  between the first  and \nlast entries in the chain. \n\n3  A veraging over  Samples and Population Eigenvalues \nThe sample  correlation matrix is  ex  =  ~XXI with eigenvalues  ILl  ~ IL2\u00b7\u00b7\u00b7  ~ ILd. \nIn  the  notation  of  the  section  2  ILi  =  (l/m),\\i '  The  corresponding  population \ncorrelation  matrix  has  eigenvalues  Al  ~ A2  ...  ~ Ad  and  eigenvectors  ul , . .. , U d. \nAgain by the observations above these are the process eigenvalues.  Let lE.n [.]  denote \naverages over random samples  of size  m . \n\nThe following proposition describes how lE.n [ILl ] is related to Al  and lE.n [ILd]  is related \nto  Ad.  It  requires no assumption of Gaussianity. \nProposition 1  (Anderson,  1963, pp  145-146)  lE.n [ILd  ~ Al  and lE.n[ILd]  :s:  Ad' \nProof: By the results of the previous section we  have \n\nWe  now  apply  the expectation operator lE.n  to both sides.  On the RHS  we  get \n\nlE.nIE  [llFul (x )11 2]  =  lE  [llFul (x)112]  =  Al \n\nby equation (5),  which completes the proof.  Correspondingly ILd  is  characterized by \nILd  = mino#c IE  [llFc(Xi) 11 2]  (minor components analysis).  D \nInterpreting this  result,  we  see  that lE.n [ILl]  overestimates  AI,  while  lE.n [ILd]  under(cid:173)\nestimates  Ad. \nProposition  1  can  be  generalized  to  give  the  following  result  where  we  have  also \nallowed for  a  kernel defined feature  space of dimension N F  :s:  00. \nProposition 2  Using  the  above  notation,  for  any  k,  1  :s:  k  :s:  m ,  lE.n [L:~=l ILi]  ~ \nL:~=l Ai  and lE.n [L::k+l ILi]  :s:  L:~k+l Ai\u00b7 \nProof: Let Vk  be the space spanned by the first  k process eigenvectors.  Then from \nthe derivations above we  have \n\nk \n\nl:ILi  =  v:  ::~=k IE  [11Fv('I/J(x))W]  ~ IE  [llFvk('I/J(x ))1 12]. \ni=l \n\n\fAgain,  applying  the  expectation  operator Em  to  both  sides  of this  equation  and \ntaking equation (5)  into account, the first inequality follows.  To prove the second we \nturn max into min, Pinto pl. and reverse the inequality.  Again taking expectations \nof both sides  proves the second part. 0 \nApplying the results obtained in this section, it follows that Em [ILl]  will overestimate \nA1,  and the  cumulative  sum 2::=1 Em [ILi ] will  overestimate 2::=1 Ai.  At  the other \nend, clearly for  N F \n\n:::::  k  > m, ILk  ==  0 is  an underestimate of Ak. \n\n4  Concentration of eigenvalues \n\nWe  now make use of results from Shawe-Taylor et al.  (2002)  concerning the concen(cid:173)\ntration of the eigenvalue spectrum of the Gram matrix.  We  have \nTheorem 3  Let K(x, z)  be  a positive  semi-definite  kernel function  on  a space  X, \nand let p  be  a  probability  density  function  on  X.  Fix  natural  numbers m  and 1 :::; \nk  < m  and let S  = (Xl, ... ,Xm)  E xm  be  a sample  of m  points  drawn  according  to \np.  Then  for  all  t  > 0, \n\np{ I ~~~k(S)_Em [~~9(S)] 1  :::::t} \n\n:::;  2exp(-~:m), \n\nwhere ~~k (S)  is  the sum of the largest k  eigenvalues of the matrix K(S)  with entries \nK(S)ij =  K(Xi,Xj)  and R2  =  maxxEX K(x, x). \nThis follows  by a  similar derivation to Theorem 5 in Shawe-Taylor et al.  (2002). \nOur next result  concerns the concentration of the residuals with  respect  to a  fixed \nsubspace.  For a  subspace V  and training set  S, we  introduce the notation \n\nFv(S)  =  t  [llPv('IjJ(x)) 112]  . \n\nTheorem 4  Let p  be  a probability  density function  on  X.  Fix natural numbers m \nand  a  subspace  V  and  let  S  =  (Xl' ... ' Xm)  E  xm  be  a  sample  of m  points  drawn \naccording  to  a probability  density function  p.  Then for  all  t  > 0, \n:::::  t} :::;  2exp (~~~) . \n\nP{Fv(S) - Em [Fv(S)] 1 \n\nThis is  theorem 6 in Shawe-Taylor et al.  (2002). \nThe concentration results of this section are very tight.  In the notation of the earlier \nsections they show that with high probability \n\nand \n\nk L Ai  ~ t  [IIPVk ('IjJ(x))W]  , \n\ni = l \n\n(9) \n\nwhere we  have used Theorem 3 to obtain the first  approximate equality and Theo(cid:173)\nrem 4 with V  =  Vk  to obtain the second approximate equality. \nThis gives the sought relationship to create an approximate chain of inequalities \n\n~  IE [IIPVk('IjJ(x))112]  =  L Ai::::: IE  [IIPVk ('IjJ(X)) 112]  . (10) \n\nk \n\ni = l \n\n\fThis approximate chain of inequalities could also have  been obtained using Propo(cid:173)\nsition 2.  It remains to bound the difference between the first  and last entries in this \nchain.  This  together  with  the  concentration results  of this  section  will  deliver  the \nrequired  bounds  on  the  differences  between  empirical  and  process  eigenvalues,  as \nwell  as providing a  performance bound on kernel peA. \n\n5  Learning  a  projection matrix \n\nThe  key  observation  that  enables  the  analysis  bounding  the  difference  between \nt  [IIPvJ!p(X)) 11 2]  and IE  [IIPvJ'I/J(x)) 11 2]  is  that  we  can view  the  projection norm \nIIPvJ'I/J(x))1 12 as a  linear function  of pairs offeatures from  the feature space F. \nProposition 5  The  projection  norm IIPVk ('I/J(X)) 11 2 is  a linear function  j  in  a fea(cid:173)\nture  space  F for  which  the  kernel  function  is  given  by  k(x, z)  =  k(x , Z)2.  Further(cid:173)\nmore  the  2-norm  of the  function  j  is  Vk. \nProof: Let X  =  Uy:.Z'  be the singular value decomposition of the sample matrix X \nin the feature space.  The projection norm is  then given by  j(x) =  IIPVk ('I/J(X)) 11 2 = \n'I/J(x)'UkUk'I/J(x),  where Uk  is  the matrix containing the first  k columns of U.  Hence \nwe  can write \n\nIIPvJ'I/J(x))11 2 =  l: (Xij'I/J(X) i'I/J(X)j  =  l: (Xij1p(X)ij, \n\nNF \n\nNF \n\nwhere 1p  is  the projection mapping into the feature  space F consisting of all  pairs \nof F  features  and (Xij  =  (UkUk)ij.  The standard polynomial construction gives \n\nij=l \n\nij=l \n\nk(x, z) \n\nNF \n\nl: 'I/J(X)i'I/J(Z)i'I/J(X)j'I/J(z)j  =  l: ('I/J(X)i'I/J(X)j)('I/J(Z)i'I/J(Z)j) \n\nNF \n\ni,j=l \n\ni,j=l \n\nIt remains  to  show  that  the  norm  of  the  linear  function  is  k.  The  norm  satisfies \n(note that  II  . IIF  denotes the Frobenius norm and  U i  the columns of U) \n\ni~' a1j ~ IIU,U;II} ~ (~\",U;, t, Ujuj) F ~ it, (U;Uj)' ~ k \n\nIlill' \n\nas  required. D \nWe  are  now  in  a  position  to  apply  a  learning  theory  bound  where  we  consider  a \nregression  problem  for  which  the  target  output  is  the  square  of the  norm  of the \nsample point  11'I/J(x)112.  We  restrict the linear function in the space F to have norm \nVk.  The loss function is  then the shortfall between the output of j and the squared \nnorm. \nUsing Rademacher complexity theory we  can obtain the following  theorems: \n\nTheorem 6  If we  perform  peA  in  the  feature  space  defined  by  a  kernel  k(x, z) \nthen  with  probability  greater  than  1 - 6,  for  all  1 :::;  k  :::;  m,  if we  project  new  data \n\n\fonto  the  space 11k ,  the  expected  squared  residual  is  bounded  by \n\n,\\,>.  :<:  IE  [ IIPt; (\"'(x)) II' 1  <  '~'~k [ ~ \\>l(S) + 7#  ,----------------, \n\n+R2  ~ln C:) \n\nl <!,f<!, k  m \n\nm  i=l \n\nwhere  the support of the  distribution  is in a ball  of radius  R  in the feature  space  and \nAi  and .xi  are  the  process  and  empirical  eigenvalues  respectively. \nTheorem 7  If we  perform  peA  in  the  feature  space  defined  by  a  kernel  k(x , z) \nthen  with  probability  greater  than  1 - 5,  for  all  1 :s:  k  :s:  m,  if we  project  new  data \nonto  the  space 11k ,  the  sum  of the  largest  k  process  eigenvalues  is  bounded  by \nA<!,k  ;:::  lE  [IIPVk (\"IjJ(x))W]  >  max  [~.x<!'f(S) - 1 + v'\u00a3 \nVm \n_R2  ~ln C(mt 1)) \nwhere  the support of the  distribution  is in a ball  of radius  R  in the feature  space  and \nAi  and .xi  are  the  process  and  empirical  eigenvalues  respectively. \nThe  proofs  of  these  results  are  given  in  Shawe-Taylor  et  al.  (2003).  Theorem  6 \nimplies  that if k  \u00ab m  the expected residuallE [11Pt;, (\"IjJ(x)) 112 ] closely  matches the \naverage  sample  residual  of IE  [11Pt;,(\"IjJ(x))112]  =  (1/m)E:k+ 1 .xi ,  thus  providing \na  bound for  kernel  peA on  new  data.  Theorem  7 implies  a  good  fit  between  the \npartial sums of the largest k empirical and process eigenvalues when Jk/m is small. \n\n!  f k(Xi' Xi)2 \n\nReferences \nAnderson,  T.  W.  (1963).  Asymptotic Theory for  Principal  Component  Analysis.  Annals \n\nof Mathematical  Statistics, 34( 1): 122- 148. \n\nBaker,  C.  T.  H.  (1977).  The  numerical treatm ent  of integral  equations.  Clarendon  Press, \n\nOxford. \n\nKoltchinskii, V.  and Gine, E.  (2000).  Random matrix approximation of spectra of integral \n\noperators.  B ernoulli,6(1):113- 167. \n\nSch6lkopf,  B.,  Smola,  A. ,  and  Miiller,  K-R.  (1998).  Nonlinear  component  analysis  as  a \n\nkernel eigenvalue  problem.  Neural  Computation,  10:1299- 1319. \n\nShawe-Taylor, J.,  Cristianini, N., and Kandola, J.  (2002).  On the Concentration of Spectral \nProperties.  In  Diettrich,  T.  G.,  Becker,  S.,  and  Ghahramani,  Z.,  editors,  Advances  in \nNeural  Information  Processing  Systems  14.  MIT Press. \n\nShawe-Taylor,  J.,  Williams,  C.  K  I. ,  Cristianini,  N.,  and  Kandola,  J.  (2003).  On  the \n\nEigenspectrum of the Gram Matrix and the Generalisation Error of Kernel  PCA.  Tech(cid:173)\nnical Report NC2-TR-2003-143, Dept of Computer Science,  Royal Holloway,  University \nof London.  Available from  http://www.neurocolt.com/archi ve . html. \n\nWilliams,  C.  K  I.  and Seeger, M.  (2000).  The Effect  of the Input Density Distribution on \n\nKernel-based Classifiers.  In Langley,  P.,  editor,  Proceedings  of the  Seventeenth  Interna(cid:173)\ntional  Conference  on  Machine  Learning  (ICML  2000).  Morgan  Kaufmann. \n\nZhu,  H.,  Williams, C.  K  I.,  Rohwer, R.  J.,  and Morciniec,  M.  (1998).  Gaussian regression \nand optimal finite dimensional linear models.  In Bishop, C.  M.,  editor,  Neural  Networks \nand Machine  Learning.  Springer-Verlag,  Berlin. \n\n\f", "award": [], "sourceid": 2309, "authors": [{"given_name": "Christopher", "family_name": "Williams", "institution": null}, {"given_name": "John", "family_name": "Shawe-taylor", "institution": null}]}