{"title": "On the Concentration of Spectral Properties", "book": "Advances in Neural Information Processing Systems", "page_first": 511, "page_last": 517, "abstract": null, "full_text": "On the  Concentration of Spectral \n\nProperties \n\nJohn  Shawe-Taylor \n\nRoyal Holloway,  University of London \n\njohn@cs.rhul.ac.uk \n\nN ella  Cristianini \nBIOwulf Technologies \nnello@support-vector. net \n\nJaz  Kandola \n\nRoyal Holloway,  University of London \n\njaz@cs.rhul.ac.uk \n\nAbstract \n\nWe  consider  the  problem  of  measuring  the  eigenvalues  of a  ran(cid:173)\ndomly drawn sample of points.  We  show  that these  values  can be \nreliably  estimated  as  can the  sum  of the  tail  of eigenvalues.  Fur(cid:173)\nthermore,  the  residuals  when  data is  projected  into a  subspace  is \nshown to be reliably estimated on  a  random sample.  Experiments \nare presented that confirm the theoretical results. \n\n1 \n\nIntroduction \n\nA  number  of learning algorithms  rely  on estimating  spectral  data on  a  sample  of \ntraining  points  and  using  this  data as  input  to  further  analyses.  For  example  in \nPrincipal  Component  Analysis  (PCA)  the  subspace  spanned  by  the  first  k  eigen(cid:173)\nvectors  is  used  to  give  a  k  dimensional  model  of the  data with  minimal  residual, \nhence  forming  a  low  dimensional  representation  of the  data  for  analysis  or  clus(cid:173)\ntering.  Recently  the  approach  has  been  applied  in  kernel  defined  feature  spaces \nin  what  has  become  known  as  kernel-PCA  [5].  This  representation  has  also  been \nrelated to an Information  Retrieval  algorithm  known  as  latent  semantic  indexing, \nagain with kernel defined feature  spaces  [2]. \nFurthermore eigenvectors have been used in the HITS [3]  and Google's PageRank [1] \nalgorithms.  In  both cases the entries in the eigenvector corresponding to the maxi(cid:173)\nmal eigenvalue are interpreted as authority weightings for  individual articles or web \npages. \nThe use of these techniques raises the question of how reliably these quantities can \nbe estimated from a random sample of data, or phrased differently, how much data is \nrequired to obtain an accurate empirical estimate with high confidence.  Ng  et al. [6] \nhave undertaken a study of the sensitivity of the estimate of the first  eigenvector to \nperturbations of the  connection  matrix.  They  have  also  highlighted  the  potential \ninstability that can arise when two eigenvalues are very close in value, so that their \neigenspaces become very difficult  to distinguish empirically. \nThe  aim  of this  paper is  to  study  the  error  in  estimation  that  can  arise  from  the \nrandom  sampling  rather than from  perturbations of the  connectivity.  We  address \n\n\fthis  question  using  concentration inequalities.  We  will  show  that  eigenvalues  esti(cid:173)\nmated from  a  sample of size  m  are indeed  concentrated,  and furthermore  the  sum \nof the  last  m  - k  eigenvalues  is  subject  to  a  similar  concentration effect,  both  re(cid:173)\nsults of independent mathematical interest.  The sum of the last  m  - k eigenvalues \nis  related to the error in  forming  a  k  dimensional  PCA approximation,  and hence \nwill  be shown to justify using empirical projection subspaces in such algorithms as \nkernel-PCA and latent semantic kernels. \nThe paper is  organised as  follows.  In  section 2 we  give  the background results  and \ndevelop the basic techniques that are required to derive the main results in section \n3.  We  provide  experimental  verification  of  the  theoretical  findings  in  section  4, \nbefore drawing our conclusions. \n\n2  Background and  Techniques \n\nWe  will  make  use  of the  following  results  due  to  McDiarmid.  Note  that  lEs  is  the \nexpectation operator under the selection of the sample. \n\nTheoreIll  1  (McDiarmid!4})  Let Xl, ... ,Xn  be  independent random variables tak(cid:173)\ning  values  in  a  set A,  and  assume  that  f  : An  -+~,  and  fi  : An- l  -+  ~ satisfy for \nl:::;i:::;n \n\nXl,\u00b7\u00b7\u00b7 , Xn \n\nTheoreIll 2  (McDiarmid!4})  Let Xl, ... ,Xn  be  independent random variables tak(cid:173)\ning  values  in  a  set A,  and  assume  that  f  : An -+  ~,  for  1 :::;  i  :::;  n \n\nsup \n\nIf(xI, ... , xn)  -\n\nf(XI, ... , Xi - I, Xi, Xi+!,\u00b7\u00b7\u00b7, xn)1  :::;  Ci, \n\nWe will  also make use of the following  theorem characterising the eigenvectors of a \nsymmetric matrix. \n\nTheoreIll 3  (Courant-Fischer  MiniIllax TheoreIll)  If M  E  ~mxm is symmet(cid:173)\nric,  then for  k  =  1, ... , m, \n\nAk(M)  =  max  min  - - = \n\ndim(T) = k O#v ET  vlv \n\nv'Mv \n\nv'Mv \ndim(T) = m - k+IO#v ET  vlv  ' \n\nmax \n\nmin \n\nwith  the  extrama  achieved  by  the  corresponding  eigenvector. \n\nThe approach adopted in the proofs of the next section is  to view the eigenvalues as \nthe sums of squares of residuals.  This is applicable when the matrix is positive semi(cid:173)\ndefinite and hence can be written as an inner product matrix M  =  XI X, where XI is \nthe transpose of the matrix X  containing the m  vectors Xl, . . .  ,  Xm  as columns.  This \nis the finite  dimensional version of Mercer's theorem, and follows  immediately if we \ntake  X  =  V VA,  where  M  =  VA VI  is  the  eigenvalue  decomposition  of M.  There \nmay be more succinct ways of representing X, but we will assume for simplicity (but \nwithout loss  of generality)  that  X  is  a  square matrix with the same dimensions  as \nM.  To set the scene,  we  now present a  short description of the residuals viewpoint. \n\n\fThe starting point is  the singular value decomposition of X  =  U~V', where U and \nV  are  orthonormal  matrices  and  ~ is  a  diagonal  matrix  containing  the  singular \nvalues  (in descending order).  We  can now reconstruct the eigenvalue decomposition \nof M  = X' X  = V~U'U~V' = V AV', where A = ~2.  But equally we  can construct \na  matrix N  = XX' = U~V'V~U' = UAU' , with the same eigenvalues as  M. \nAs  a simple example consider now the first  eigenvalue, which by Theorem 3 and the \nabove observations is  given  by \n\nA1(M) \n\nv'XX'v \n\nv'Nv \nmax  - - =  max \nO,t:vEIR=  v'v \nO,t:vEIR= \nmax  L IIPv(xj)11 2  =  L IIxjl12 - min  L IIP;-(xj)112 \n\nmax \nO,t:vEIR = \n\nv'v \nm \n\nv'v \n\nm \n\nm \n\nO,t:vEIR= \n\nj=l \n\nj=l \n\nO,t:vEIR= \n\nj=l \n\nwhere  Pv(x)  (Pv..l (x))  is  the  projection  of x  onto  the  space  spanned  by  v  (space \nperpendicular  to  v),  since  IIxI12 =  IIPv(x)112 + IIPv..l(x)112.  It  follows  that  the  first \neigenvector  is  characterised  as  the  direction  for  which  sum  of  the  squares  of the \nresiduals is  minimal. \nApplying the same line of reasoning to the first  equality of Theorem 3, delivers  the \nfollowing  equality \n\nAk  =  max  min  L IlPv(xj)112. \n\nm \n\ndim(V) = k O,t:vEV  . J=l \n\n(1) \n\nNotice that this characterisation implies that if v k  is  the k-th eigenvector of N, then \n\nm \n\nAk  = L IlPv k (xj)112, \n\nj=l \n\n(2) \n\nwhich  in  turn implies  that if Vk  is  the  space  spanned  by  the  first  k  eigenvectors, \nthen \n\nk \n\nm \n\nL Ai  = L IIPVk (Xj) 112  = L IIXj W - L IIP'* (Xj) 112, \n\nm \n\nj=l \n\nm \n\nj=l \n\ni=l \n\nj=l \n\n(3) \n\nwhere  Pv(x)  (PV(x))  is  the projection of x  into the space V  (space perpendicular \nto V).  It readily follows  by induction over the dimension of V  that we  can equally \ncharacterise the sum of the first  k  and last m  - k  eigenvalues by \n\nm \n\nmax  L IIPv(xj)11 2  =  L IIxjl12 - min  L IIPv(xj)112, \n\nm \n\ndim(V) = k  . \n\n) = 1 \n\nm \n\n. \n) = 1 \n\ni = l \n\nm \n\ndim(V) = k  . \n\nJ=l \n\nm \n\nk \n\nL IIXjl12 - L Ai =  min  L IlPv(xj)112. \n. \nJ=l \n\ndim(V)=k  . \n\n. \n.=1 \n\nJ=l \n\nm \n\n(4) \n\nHence,  as for  the case when  k  = 1,  the subspace spanned by the first  k  eigenvalues \nis characterised as that for  which the sum of the squares of the residuals is minimal. \nFrequently,  we  consider  all  of  the  above  as  occurring  in  a  kernel  defined  feature \nspace,  so that  wherever  we  have  written  Xj  we  should  have  put  \u00a2>(Xj),  where  \u00a2>  is \nthe corresponding projection. \n\n3  Concentration of eigenvalues \n\nThe  previous  section  outlined  the  relatively  well-known  perspective  that  we  now \napply to obtain the concentration results for the eigenvalues of positive semi-definite \n\n\fmatrices.  The  key  to  the  results  is  the  characterisation  in  terms  of the  sums  of \nresiduals given in equations  (1)  and  (4). \n\nTheorem 4  Let K(x,z)  be  a positive  semi-definite kernel  function  on  a space  X, \nand  let  J-t  be  a  distribution  on  X.  Fix  natural  numbers m  and  1  :::;  k  < m  and  let \nS  =  (Xl\"'\"  xm) E  xm  be  a  sample  of m  points  drawn  according  to  J-t.  Th en  for \nall  f  > 0, \n\nP{I~ )..k(S) -lEs [~ )..k(S)ll  2:  f} :::;  2exp ( -~:m) , \n\nwhere  )..k (S)  is  the  k-th  eigenvalue  of  the  matrix  K(S)  with  entries  K(S)ij \nK(Xi,Xj)  and R2  = maxxEx K(x,x). \n\nProof: The result follows  from  an application of Theorem 1 provided \n\n1 \nsup 1-\ns  m \n\n)..k(S)  -\n\n1 \n-\nm \n\n)..k(S \\  {xd)1  :::;  Rim. \n\n2 \n\nLet S =  S \\  {Xi}  and let V  (11)  be the k  dimensional subspace spanned by the first \nk  eigenvectors of K(S)  (K(S)).  Using equation  (1)  we  have \n\nm \n\nm \n\nD \nSurprisingly a  very similar result holds when we  consider the sum of the last m  - k \neigenvalues. \n\nTheorem 5  Let K(x, z)  be  a positive  semi-definite  kernel  function  on  a space  X, \nand  let  J-t  be  a  distribution  on  X.  Fix  natural  numbers m  and  1 :::;  k  < m  and  let \nS  =  (Xl, ... , Xm) E  xm  be  a  sample  of m  points  drawn  according  to  J-t.  Then  for \nall  f  > 0, \n\nP{I~ )..>k(S) -lEs [~ )..>k(S)ll  2:  f} :::;  2 exp ( -~:m) , \n\nwhere  )..>k(S)  is the  sum of all  but the largest k  eigenvalues  of the  matrix K(S)  with \nentries  K(S)ij = K(Xi,Xj)  and R2  = maxxEX K(x,x). \nProof: The result follows  from  an application of Theorem 1 provided \n\nsup 1~)..>k(S) - ~)..>k(S \\  {xd)1  :::;  R2/m. \ns  m \n\nm \n\nLet S = S  \\ {xd  and let V  (11)  be the k  dimensional subspace spanned by the first \nk  eigenvectors of K(S)  (K(S)).  Using equation  (4)  we  have \n\nm \n\nj=l \n\n#i \n\n#i \nm \n\nj=l \n\nD \n\n\fOur next result  concerns the concentration of the residuals with  respect to a  fixed \nsubspace.  For  a  subspace V  and training set  S , we  introduce the notation \n\nFv(S)  =  - L IIPV(Xi )112 . \n\n1  m \n\nm  i=l \n\nTheoreIll 6  Let J-t  be  a  distribution  on X.  Fix natural numbers m  and  a  subspace \nV  and  let  S  =  (Xl, .. . , Xm) E  xm  be  a  sample  of m  points  drawn  according  to  J-t. \nThen for  all  t  > 0, \n\nP{IFv(S) -lEs [Fv(S)ll ~ t}  :::::  2exp (~~r;) . \n\nProof: The result follows  from  an application of Theorem 2 provided \n\nsup IFv(S) - F(S \\  {xd U {xi)1  :::::  R2/m. \nS,X i \n\nClearly  the  largest  change  will  occur  if one  of the  points  Xi  and  Xi  is  lies  in  the \nsubspace V  and the other does  not.  In  this  case the change will  be at most  R2/m. \nD \n\n4  Experiments \n\nIn  order to test the concentration results we performed experiments with the Breast \ncancer data using a  cubic polynomial kernel.  The kernel was  chosen to ensure that \nthe spectrum did  not decay too fast. \nWe  randomly  selected  50%  of the  data as  a  'training'  set  and  kept  the  remaining \n50%  as  a  'test' set.  We centered the whole data set so that the origin of the feature \nspace is  placed at the  centre  of gravity of the training set.  We  then performed an \neigenvalue  decomposition  of the  training  set.  The  sum  of the  eigenvalues  greater \nthan  the  k-th  gives  the  sum  of the  residual  squared  norms  of the  training  points \nwhen we project onto the space spanned by the first  k  eigenvectors.  Dividing this by \nthe average of all  the eigenvalues  (which  measures  the average square norm of the \ntraining points  in  the transformed space)  gives  a  fraction  residual  not  captured in \nthe k  dimensional projection.  This quantity was averaged over 5 random splits and \nplotted  against  dimension  in  Figure  1 as  the  continuous  line.  The error bars  give \none  standard  deviation.  The  Figure  la shows  the  full  spectrum,  while  Figure  1 b \nshows a zoomed in subwindow.  The very tight error bars show clearly the very tight \nconcentration of the sums of tail of eigenvalues as predicted by  Theorem 5. \nIn  order to test  the  concentration results for  subsets  we  measured the  residuals  of \nthe  test  points  when  they  are  projected  into  the  subspace  spanned  by  the  first  k \neigenvectors generated above for the training set.  The dashed lines in Figure 1 show \nthe ratio of the average squares of these residuals to the average squared norm of the \ntest points.  We  see the two curves tracking each other very closely, indicating that \nthe  subspace  identified  as  optimal  for  the  training  set  is  indeed  capturing  almost \nthe same amount of information in the test points. \n\n5  Conclusions \n\nThe paper has  shown that the eigenvalues of a  positive semi-definite matrix gener(cid:173)\nated from  a random sample is  concentrated.  Furthermore the sum of the last m - k \neigenvalues  is  similarly  concentrated as  is  the  residual  when  the  data is  projected \ninto a  fixed  subspace. \n\n\f0.7,------,-------,-------,------,-------,-------,------,,------, \n\n0.6 \n\n0.5 \n\n0.2 \n\n0.1 \n\n0.14,-----,-----,-----,-----,-----,-----,-----,-----,-----,-----, \n\nProjection Dimensionality \n\n(a) \n\n0.12 \n\n0.1 \n\ne 0.08 \nW \nCii \nOJ :g \nen &!  0.06 \n\n0.04 \n\n0.02 \n\n1 \\ \n\n'1- - I \n\n- -:E-- -I- --:1:- _  '.[  __ \n\nO~--L---~--~--~---L-=~~~~======~~ \no \n100 \n\n20 \n\n30 \n\n80 \n\n10 \n\n40 \nProjection Dimensionality \n\n50 \n\n60 \n\n70 \n\n90 \n\n(b) \n\nFigure 1:  Plots ofthe fraction of the average squared norm captured in the subspace \nspanned  by  the  first  k  eigenvectors  for  different  values  of  k.  Continuous  line  is \nfraction for  training set,  while  the dashed line is  for  the test set.  (a)  shows the full \nspectrum, while  (b)  zooms in on an interesting portion. \n\n\fExperiments are presented that confirm the theoretical predictions on a  real world \ndataset.  The  results  provide  a  basis  for  performing  PCA  or  kernel-PCA  from  a \nrandomly generated sample, as they confirm that the subset identified by the sample \nwill  indeed  'generalise' in the  sense that it  will  capture most of the information in \na  test sample. \nFurther research should  look  at the question of how  the space  identified  by  a  sub(cid:173)\nsample relates to the eigenspace of the underlying kernel operator. \n\nReferences \n\n[1]  S.  Brin and L.  Page.  The anatomy of a large-scale hypertextual (web)  search en(cid:173)\ngine.  In  Proceedings  of the  Seventh  International  World  Wide  Web  Conference, \n1998. \n\n[2]  Nello Cristianini, Huma Lodhi, and John Shawe-Taylor.  Latent semantic kernels \nfor  feature  selection.  Technical  Report  NC-TR-00-080,  NeuroCOLT  Working \nGroup, http://www.neurocolt.org,  2000. \n\n[3]  J. Kleinberg.  Authoritative sources in a hyperlinked environment. In Proceedings \n\nof 9th  ACM-SIAM Symposium  on  Discrete  Algorithms, 1998. \n\n[4]  C.  McDiarmid.  On the method of bounded differences.  In Surveys  in  Combina(cid:173)\n\ntorics  1989,  pages  148- 188.  Cambridge University Press , 1989. \n\n[5]  S.  Mika,  B.  SchCilkopf,  A.  Smola,  K.-R.  MUller,  M.  Scholz,  and  G.  Ratsch. \nKernel PCA and de-noising in feature spaces. In Advances in Neural Information \nProcessing  Systems  11,  1998. \n\n[6]  Andrew Y.  Ng,  Alice X.  Zheng,  and Michael 1.  Jordan.  Link analysis, eigenvec(cid:173)\ntors and stability.  In  To  appear  in  the  Seventeenth  International  Joint  Confer(cid:173)\nence  on  Artificial Intelligence  (UCAI-Ol),  2001. \n\n\f", "award": [], "sourceid": 2127, "authors": [{"given_name": "John", "family_name": "Shawe-Taylor", "institution": null}, {"given_name": "Nello", "family_name": "Cristianini", "institution": null}, {"given_name": "Jaz", "family_name": "Kandola", "institution": null}]}