{"title": "On a Connection between Kernel PCA and Metric Multidimensional Scaling", "book": "Advances in Neural Information Processing Systems", "page_first": 675, "page_last": 681, "abstract": null, "full_text": "On a  Connection between Kernel  PCA \nand Metric  Multidimensional  Scaling \n\nChristopher K.  I. WilliaIns \n\nDivision of Informatics \n\nThe University of Edinburgh \n\n5 Forrest Hill,  Edinburgh EH1  2QL,  UK \n\nc.k.i.williams~ed.ac.uk \n\nhttp://anc.ed.ac.uk \n\nAbstract \n\nIn this paper we  show that the kernel peA algorithm of Sch6lkopf \net al  (1998) can be interpreted as a form of metric multidimensional \nscaling  (MDS)  when  the kernel function  k(x, y)  is  isotropic,  i.e.  it \ndepends  only  on  Ilx - yll.  This leads to a  metric  MDS  algorithm \nwhere the desired  configuration of points is  found  via the solution \nof an eigenproblem rather than through the iterative optimization \nof the  stress  objective function.  The  question  of kernel  choice  is \nalso  discussed. \n\n1 \n\nIntroduction \n\nSuppose  we  are  given  n  objects,  and  for  each  pair  (i,j)  we  have  a  measurement \nof  the  \"dissimilarity\"  Oij  between  the  two  objects. \nIn  multidimensional  scaling \n(MDS)  the aim  is to place n  points in a  low  dimensional  space  (usually Euclidean) \nso  that  the  interpoint  distances  dij  have  a  particular  relationship  to  the  original \ndissimilarities.  In classical scaling we would like the interpoint distances to be equal \nto  the  dissimilarities.  For  example,  classical  scaling  can  be  used  to  reconstruct  a \nmap of the locations of some cities given the distances between them. \nIn  metric  MDS  the  relationship  is  of the  form  dij  ~ f(Oij)  where  f  is  a  specific \nfunction.  In this paper we  show  that the kernel  peA algorithm of Sch6lkopf  et  al \n[7]  can be interpreted as performing metric  MDS  if the kernel function  is isotropic. \nThis is  achieved by  performing classical scaling in the feature  space defined by the \nkernel. \nThe structure of the remainder of this paper is  as follows:  In section 2 classical and \nmetric  MDS  are reviewed,  and in section 3 the kernel peA algorithm is  described. \nThe  link  between  the  two  methods  is  made  in  section  4.  Section  5  describes  ap(cid:173)\nproaches  to  choosing the  kernel function,  and  we  finish  with  a  brief discussion  in \nsection 6. \n\n\f2  Classical and metric MDS \n\n2.1  Classical scaling \n\nGiven  n  objects  and the  corresponding dissimilarity  matrix,  classical  scaling is  an \nalgebraic method for  finding  a  set of points in space so  that the  dissimilarities  are \nwell-approximated  by  the  interpoint  distances.  The  classical  scaling  algorithm  is \nintroduced  below  by  starting  with  the  locations  of n  points,  constructing  a  dis(cid:173)\nsimilarity  matrix  based  on  their  Euclidean  distances,  and  then  showing  how  the \nconfiguration of the  points  can  be reconstructed  (as  far  as  possible)  from  the  dis(cid:173)\nsimilarity matrix. \nLet the coordinates of n points in p dimensions be denoted by Xi, i  = 1, ... ,n.  These \ncan  be  collected  together  in  a  n  x  p  matrix  X .  The dissimilarities  are  calculated \nby  8;j  =  (Xi  - Xj)T(Xi  - Xj).  Given these  dissimilarities,  we  construct the  matrix \nA  such  that  aij  = -! 8;j'  and  then  set  B  = H AH,  where  H  is  the  centering \nmatrix  H  = In  - ~l1T .  With  8;j  = (Xi  - Xj)T(Xi  - Xj),  the  construction  of B \nleads  to  bij  =  (Xi  - xF(xj - x),  where  x  =  ~ L~=l Xi.  In  matrix form  we  have \nB  = (HX)(HX)T,  and  B  is  real,  symmetric  and  positive  semi-definite.  Let  the \neigendecomposition of B  be B  =  V A V T ,  where  A is  a  diagonal matrix and V  is  a \nmatrix whose  columns are the eigenvectors of B.  If p < n, there will be n  - p zero \neigenvaluesl .  If the  eigenvalues  are  ordered  Al  ~ A2  ~ ...  ~ An  ~ 0,  then  B  = \nVpApVpT,  where  Ap  = diag(Al, ... ,Ap)  and  Vp  is  the  n  x  p  matrix whose  columns \ncorrespond  to  the first  p  eigenvectors  of B,  with  the  usual  normalization  so  that \nthe eigenvectors  have  unit  length.  The matrix X of the reconstructed coordinates \nof the  points  can  be  obtained  as  X  = VpAJ,  with  B  = X XT.  Clearly  from  the \ninformation  in  the  dissimilarities  one  can only  recover the original coordinates up \nto a translation, a  rotation and reflections of the axes;  the solution obtained for  X \nis  such that the origin is  at the mean of the n  points,  and that the axes  chosen by \nthe procedure are the principal axes of the X configuration. \nIt may not  be  necessary  to  uses  all  p  dimensions  to  obtain a  reasonable  approxi(cid:173)\nmation;  a  configuration X in  k-dimensions  can be obtained by using the largest  k \neigenvalues so  that X  = VkA~ .  These are known  as the principal coordinates of X \nin  k  dimensions.  The fraction of the variance explained by the first  k  eigenvalues is \nL~=l Ad L~=l Ai\u00b7 \nClassical scaling as explained above works on Euclidean distances as the dissimilar(cid:173)\nities.  However, one  can run the same algorithm with a  non-Euclidean dissimilarity \nmatrix,  although  in  this  case  there  is  no  guarantee  that  the  eigenvalues  will  be \nnon-negative. \n\n! \n\n\"\" \n\nA \n\n1 \n\n'\" \n\n..... \n\nClassical scaling derives from  the work  of Schoenberg and Young and  Householder \nin  the 1930's.  Expositions of the theory can be found  in  [5]  and [2]. \n\n2.1.1  Opthnality properties of classical  scaling \n\nMardia et al  [5]  (section 14.4) give the following optimality property ofthe classical \nscaling solution. \n\n1 In fact  if the points are not in  \"general  position\"  the number of zero  eigenvalues  will \nbe greater than n - p.  Below we  assume that the points are in general  position,  although \nthe  arguments  can  easily  be  carried  through  with  minor  modifications  if this  is  not  the \ncase. \n\n\fTheorem 1  Let X  denote  a configuration of points in ffi.P ,  with interpoint distances \nc5ri  = (Xi - Xi)T (Xi - Xi).  Let L  be  a p  x  p  rotation  matrix  and  set L  = (L1' L 2), \nwhere L1  is p x k  for k  < p.  Let X = X L 1, the projection of X  onto a k-dimensional \nsubspace of ffi.P ,  and let dri  = (Xi - Xi) T (Xi - Xi).  Amongst all projections X = X L 1, \nthe  quantity \u00a2 =  Li,i (c5ri  - dri)  is  minimized when X  is  projected  onto its principal \ncoordinates  in k  dimensions.  For  all  i, j  we  have dii  :::;  c5ii .  The  value  of \u00a2  for  the \nprincipal  coordinate  projection  is \u00a2 = 2n(Ak+1  + ... + Ap). \n\n2.2  Relationships between classical scaling and peA \n\nThere is  a well-known relationship between PCA and classical scaling;  see  e.g.  Cox \nand Cox  (1994)  section  2.2.7. \n\nPrincipal components analysis  (PCA)  is  concerned with the eigendecomposition of \nthe sample covariance matrix S  = ~ XT H X.  It is easy to show that the eigenvalues \nof nS  are  the  p  non-zero  eigenvalues  of  B.  To  see  this  note  that  H2  = Hand \nthus  that  nS = (HX)T(HX).  Let  Vi  be  a  unit-length  eigenvector  of B  so  that \nBVi = AiVi.  Premultiplying by  (HX)T  yields \n\n(HX)T(HX)(HXf V i  =  Ai(Hx)T Vi \n\n(1) \nso  we  see  that  Ai  is  an  eigenvalue  of nS.  Yi  = (H X)T Vi  is  the  corresponding \neigenvector; note that Y; Yi  =  Ai.  Centering X  and projecting onto the unit vector \nYi  = X;1/2Yi  we  obtain \n\n(2) \nThus we see that the projection of X  onto the eigenvectors of nS returns the classical \nscaling solution. \n\nHXYi = X;1/2 HX(HXf Vi = AY2 v i . \n\n2.3  Metric MDS \nThe  aim  of classical  scaling  is  to find  a  configuration  of points X so  that  the in(cid:173)\nterpoint  distances  dii  well  approximate the dissimilarities  c5ii .  In metric MDS  this \ncriterion is  relaxed,  so  that instead we  require \n\n(3) \nwhere f  is  a  specified  (analytic) function.  For this  definition  see,  e.g.  Kruskal  and \nWish  [4]  (page  22),  where polynomial transformations are suggested. \n\nA  straightforward  way  to  carry  out  metric  MDS  is  to  define  a  error function  (or \nstress) \n\n(4) \n\nwhere  the  {wii}  are  appropriately  chosen  weights.  One  can  then  obtain  deriva(cid:173)\ntives  of S  with  respect  to  the  coordinates  of the  points  that  define  the  dii'S  and \nuse  gradient-based  (or  more  sophisticated  methods)  to  minimize  the  stress.  This \nmethod is known as least-squares scaling.  An early reference to this kind of method \nis  Sammon  (1969)  [6],  where wii = 1/c5ii  and f  is  the identity function. \nNote that if f(c5ii ) has some adjustable parameters ()  and is linear with respect to ()  2, \nthen the function f  can also be adapted and the optimal value for those parameters \ngiven the current dij's  can be obtained by  (weighted)  least-squares regression. \n\n2 f  can still be a  non-linear function  of its argument. \n\n\fCritchley  (1978)  [3]  (also  mentioned  in  section  2.4.2  of Cox  and  Cox)  carried out \nmetric MDS  by running the classical scaling algorithm on the transformed dissim(cid:173)\nilarities.  Critchley  suggests  the  power  transformation  f(oij)  = 00  (for  J.L  > 0) .  If \nthe  dissimilarities  are  derived  from  Euclidean  distances,  we  note  that  the  kernel \nk(x,y) = -llx-ylli3 is conditionally positive definite  (CPD) if f3::;  2 [1].  When the \nkernel is  CPD, the centered matrix will  be positive definite.  Critchley's use of the \nclassical scaling algorithm is similar to the algorithm discussed below, but crucially \nthe  kernel  PCA  method  ensures  that the matrix B  derived form  the  transformed \ndissimilarities  is  non-negative  definite,  while  this is  not  guaranteed by  Critchley's \ntransformation for  arbitrary J.L. \n\nA  further  member  of the  MDS  family  is  nonmetric  MDS  (NMDS),  also  known  as \nordinal scaling.  Here it is only the relative rank ordering between the d's and the o's \nthat is  taken to be  important;  this  constraint  can be  imposed  by  demanding that \nthe  function  f  in  equation  3  is  monotonic.  This  constraint  makes  sense  for  some \nkinds  of dissimilarity  data  (e.g.  from  psychology)  where  only  the  rank  orderings \nhave real meaning. \n\n3  Kernel PCA \n\nIn recent  years there has been  an explosion of work on kernel  methods.  For super(cid:173)\nvised  learning these include support  vector machines  [8],  Gaussian process predic(cid:173)\ntion (see, e.g.  [10])  and spline methods [9].  The basic idea of these methods is to use \nthe  \"kernel trick\".  A  point x  in the original space is re-represented as a  point \u00a2(x) \nin  a  Np-dimensional feature  space3  F,  where  \u00a2(x) = (\u00a21(X),\u00a22(X), ... ,\u00a2NF(X)). \nWe  can think of each function \u00a2j(-) as a  non-linear mapping.  The key to the kernel \ntrick is  to realize that for  many algorithms, the only quantities required  are of the \nform 4  \u00a2(Xi).\u00a2(Xj) and thus if these can be easily computed by a non-linear function \nk(Xi,Xj) = \u00a2(Xi).\u00a2(Xj) we  can save much time and effort. \nSch6lkopf,  Smola  and  Miiller  [7]  used  this  trick to  define  kernel  peA.  One  could \ncompute  the  covariance  matrix  in  the feature  space  and  then  calculate  its  eigen(cid:173)\nvectors/eigenvalues.  However,  using  the  relationship  between  B  and  the  sample \ncovariance matrix S  described  above,  we  can instead consider the n  x  n  matrix K \nwith  entries  Kij  =  k(Xi,Xj)  for  i,j =  1, .. .  ,no  If Np  > n  using  K  will  be  more \nefficient  than working with the covariance matrix in feature  space and anyway the \nlatter would  be singular. \nThe  data should  be  centered  in the feature  space  so  that  L~=l \u00a2(Xi)  = o.  This \nis  achieved  by carrying out the eigendecomposition of K =  H K H  which  gives the \ncoordinates of the  approximating  points  as  described  in  section  2.2.  Thus  we  see \nthat the visualization of data by projecting it onto the first  k  eigenvectors is exactly \nclassical scaling in feature space. \n\n4  A  relationship  between kernel PCA and  metric MDS \n\nWe  consider  two  cases.  In  section  4.1  we  deal  with  the  case  that  the  kernel  is \nisotropic and obtain a  close  relationship  between kernel  PCA  and  metric  MDS. If \nthe kernel is  non-stationary a  rather less close relationship is derived in section 4.2. \n\n3For some  kernels  NF  =  00. \n4We  denote the inner product of two vectors as  either a .h  or  aTh . \n\n\fIsotropic kernels \n\n4.1 \nA kernelfunction is stationary if k(Xi' Xj)  depends only on the vector T  = Xi -Xj.  A \nstationary covariance function  is  isotropic if k(Xi,Xj)  depends only on the distance \n8ij  with  8;j  =  T.T,  so  that  we  write  k(Xi,Xj)  =  r(8ij ).  Assume  that the kernel  is \nscaled so that r(O)  = 1.  An example of an isotropic kernel is the squared exponential \nor  REF  (radial  basis function)  kernel k(Xi' Xj)  =  exp{ -O(Xi - Xj)T(Xi  - Xj)},  for \nsome parameter 0 > O. \nConsider the  Euclidean  distance  in feature  space  8;j  = (\u00a2(Xi) - \u00a2(Xj))T(\u00a2(Xi) -\n\u00a2(Xj)).  With  an  isotropic  kernel  this  can  be  re-expressed  as  8;j  = 2(1  - r(8ij )). \nThus  the  matrix  A  has  elements  aij  =  r(8ij ) - 1,  which  can  be  written  as  A  = \nK  - 11 T.  It can be easily verified that the centering matrix H  annihilates 11 T,  so \nthat HAH = HKH. \nWe  see  that  the  configuration  of points  derived  from  performing  classical  scaling \non K  actually aims  to  approximate the feature-space  distances  computed  as 8ij  = \nJ2(1- r(8ij )).  As  the 8ij's are  a  non-linear function  of the  8ij's this  procedure \n(kernel MDS)  is  an example of metric MDS. \n\nRemark 1  Kernel functions are usually chosen to be conditionally positive definite, \nso  that the  eigenvalues  of the  matrix k  will  be  non-negative.  Choosing arbitrary \nfunctions to transform the dissimilarities will  not give this guarantee. \n\nRemark  2  In  nonmetric  MDS  we  require  that  dij  ~ f(8ij ) for  some  monotonic \nfunction  f.  If the kernel function  r  is  monotonically decreasing then  clearly  1 - r \nis  monotonically increasing.  However,  there  are valid  isotropic kernel  (covariance) \nfunctions  which  are  non-monotonic  (e.g.  the  exponentially  damped  cosine  r(8)  = \ncoo cos(w8);  see  [11]  for  details)  and thus we  see  that f  need not be monotonic in \nkernel  MDS. \n\nRemark  3  One  advantage of PCA is  that it  defines  a  mapping from  the  original \nspace  to  the  principal  coordinates,  and  hence  that  if a  new  point  x  arrives,  its \nprojection onto the principal coordinates defined by the original n data points can be \ncomputed5 .  The same property holds in kernel PCA, so that the computation of the \nprojection of \u00a2(x) onto the rth principal direction in feature space can be computed \nusing the kernel trick as L:~=1 o:i k(x, Xi), where or is the rth eigenvector of k  (see \nequation  4.1  in  [7]).  This  projection  property  does  not  hold  for  algorithms  that \nsimply minimize the stress objective function;  for example the Sammon \"mapping\" \nalgorithm [6]  does not in fact  define  a  mapping. \n\n4.2  Non-stationary kernels \n\nSometimes  non-stationary  kernels  (e.g.  k(Xi,Xj)  =  (1  + Xi.Xj)m  for  integer  m) \nare  used.  For  non-stationary  kernels  we  proceed  as  before  and  construct  8;j  = \n(\u00a2(Xi)-\u00a2(Xj))T(\u00a2(Xi)-\u00a2(Xj)).  We can again show that the kernel MDS procedure \noperates on the matrix  H K H.  However,  the distance 8ij  in  feature  space  is  not  a \nfunction of 8ij and so the relationship of equation 3 does not hold.  The situation can \nbe saved somewhat if we follow  Mardia et al  (section 14.2.3)  and relate similarities \n\n5Note that this will  be,  in general,  different to the solution found by doing peA on the \n\nfull  data set of n + 1 points. \n\n\fI \n\nbola_O \n, - ,  bela=4 \n- - bela =10 \n''''  bela-20 \n\n.#, .. - .. -\"--:;:-:;: -\n\n-\n\n-\n\n::: ......\u2022... \n\n~(-:::/ .... \n\n/ \n\n\"\", \n\n.,' \n\nI \nI \n... \n'.:/ \n\n500 \n\n1000 \n\nk \n\n1500 \n\n2000 \n\n2500 \n\nFigure  1:  The  plot  shows  'Y  as  a  function  of k  for  various  values  of (3  =  () /256 for \nthe USPS test set. \n\nto  dissimilarities  through  Jlj  =  Cii  + Cjj  - 2Cij,  where  Cij  denotes  the  similarity \nbetween items i  and j  in feature  space.  Then we  see  that the similarity in feature \nspace  is  given  by  Cij  =  \u00a2(Xi).\u00a2(Xj)  =  k(Xi' Xj).  For  kernels  (such  as  polynomial \nkernels)  that are functions of Xi.Xj  (the similarity in input space), we  see then that \nthe similarity in feature  space is a non-linear function of the similarity measured in \ninput space. \n\n5  Choice of kernel \n\nHaving  performed  kernel  MDS  one  can plot  the  scatter diagram  (or  Shepard dia(cid:173)\ngram) of the dissimilarities against the fitted  distances.  We  know that for each pair \nthe fitted distance d ij  ::;  Jij because of the projection property in feature space.  The \nsum of the residuals is given by 2n E~=k+l Ai  where the {Ai}  are the eigenvalues of \nk  =  H K H.  (See  Theorem 1 above and recall that at most  n  of the eigenvalues of \nthe covariance matrix in feature space will  be  non-zero.)  Hence the fraction of the \nsum-squared distance explained by the first  k  dimensions is 'Y  =  E:=1 Ad E~=1 Ai. \nOne idea for  choosing the  kernel  would  be to fix  the dimensionality  k  and  choose \nr(\u00b7)  so  that 'Y  is maximized.  Consider the effect  of varying ()  in the RBF kernel \n\nk(Xi , Xj)  =exp{-()(xi-xjf(Xi-Xj)}. \n\n(5) \nAs  ()  -+  00 we  have Jlj  =  2(1- c5(i,j))  (where c5(i,j)  is  the Kronecker delta), which \nare the distances  corresponding to  a  regular  simplex.  Thus  K  -+  In,  H K H  =  H \nand'Y =  k/(n -1). Letting ()  -+  0 and using e-oz  ~ 1- ()z for  small (),  we can show \nthat Kij  =  1 - ()c5lj  as ()  -+ 0,  and thus that the classical scaling solution is obtained \nin  this limit. \n\nExperiments have been run on the US  Postal Service database of handwritten digits, \nas used in [7].  The test set of 2007 images was used.  The size of each image is 16 x 16 \npixels, with the intensity of the pixels scaled so that the average variance over all 256 \ndimensions is 0.5.  In Figure 1 'Y  is  plotted against k for  various values of (3  =  () /256. \nBy choosing an index k  one can observe from Figure 1 what fraction of the variance \nis  explained  by the first  k  eigenvalues.  The trend is  that  as  ()  decreases more  and \n\n\fmore variance is  explained  by fewer  components,  which fits  in with the idea above \nthat the ()  -t 00  limit  gives  rise  to the  regular  simplex  case.  Thus there  does  not \nseem to be a  non-trivial value of ()  which  minimizes the residuals. \n\n6  Discussion \n\nThe results  above show that kernel  PCA using an isotropic kernel function  can be \ninterpreted as  performing a  kind  of metric  MDS.  The main  difference  between the \nkernel  MDS  algorithm  and  other metric  MDS  algorithms is  that kernel  MDS  uses \nthe  classical  scaling  solution in  feature  space.  The advantage of the  classical  scal(cid:173)\ning  solution is  that  it  is  computed from  an eigenproblem,  and  avoids the iterative \noptimization  of the stress  objective function  that  is  used  for  most  other  MDS  so(cid:173)\nlutions.  The classical scaling solution is  unique  up  to the unavoidable translation, \nrotation  and reflection  symmetries  (assuming that there are no  repeated eigenval(cid:173)\nues).  Critchley's work  (1978)  is  somewhat  similar to kernel  MDS,  but it lacks the \nnotion of a projection into feature space and does not always ensure that the matrix \nB  is non-negative definite. \n\nWe have also looked at the question of adapting the kernel so as to minimize the sum \nof the residuals.  However, for  the case investigated this leads to a trivial solution. \n\nAcknowledgements \n\nI thank David Willshaw,  Matthias Seeger and Amos Storkey for  helpful conversations,  and \nthe anonymous referees  whose comments have helped improve the paper. \n\nReferences \n\n[1]  C.  Berg,  J.  P.  R.  Christensen,  and  P.  Ressel.  Harmonic  Analysis  on  Semigroups. \n\nSpringer-Verlag,  New York,  1984. \n\n[2]  T.  F. Cox and M.  A.  A.  Cox.  Multidimensional Scaling.  Chapman and Hall,  London, \n\n1994. \n\n[3]  F. Critchley.  Multidimensionsal scaling:  a short critique and a new method. In L. C. A \nCorsten and J.  Hermans,  editors,  COMPSTAT 1978.  Physica-Verlag,  Vienna,  1978. \n[4]  J.  B.  Kruskal  and  M.  Wish.  Multidimensional  Scaling.  Sage  Publications,  Beverly \n\nHills,  1978. \n\n[5]  Mardia,  K  V.  and  Kent,  J.  T.  and  Bibby,  J.  M.  Multivariate  Analysis.  Academic \n\nPress,  1979. \n\n[6]  J.  W.  Sammon.  A  nonlinear  mapping for  data structure  analysis.  IEEE  Trans.  on \n\nComputers,  18:401-409,  1969. \n\n[7]  B.  Scholkopf,  A.  Smola,  and K-R. Muller.  Nonlinear component analysis as  a  kernel \n\neigenvalue  problem.  Neural  Computation,  10:1299- 1319,  1998. \n\n[8]  V.  N.  Vapnik.  The  nature  of statistical  learning  theory.  Springer Verlag,  New York, \n\n1995. \n\n[9]  G.  Wahba.  Spline  models  for  observational  data.  Society for  Industrial  and Applied \nMathematics,  Philadelphia,  PA,  1990.  CBMS-NSF  Regional  Conference  series  in \napplied mathematics. \n\n[10]  C.  K  I. Williams  and  D .  Barber.  Bayesian  classification  with  Gaussian  processes. \nIEEE Transactions  on Pattern Analysis  and Machine Intelligence,  20(12):1342- 1351, \n1998. \n\n[11]  A.  M.  Yaglom.  Correlation  Theory  of Stationary  and  Related  Random  Functions \n\nVolume  I:Basic  Results.  Springer Verlag,  1987. \n\n\f", "award": [], "sourceid": 1873, "authors": [{"given_name": "Christopher", "family_name": "Williams", "institution": null}]}