{"title": "Automatic Choice of Dimensionality for PCA", "book": "Advances in Neural Information Processing Systems", "page_first": 598, "page_last": 604, "abstract": null, "full_text": "Automatic choice of dimensionality for peA \n\nThomas P. Minka \n\nMIT Media Lab \n\n20 Ames St, Cambridge, MA 02139 \n\ntpminka@media.mit.edu \n\nAbstract \n\nA central issue  in  principal component analysis  (PCA)  is  choosing the \nnumber of principal components to be retained.  By  interpreting PCA  as \ndensity estimation, we show how to  use Bayesian model selection to es(cid:173)\ntimate the true dimensionality of the data.  The resulting estimate is sim(cid:173)\nple to  compute yet guaranteed to pick the correct dimensionality, given \nenough data.  The estimate involves an integral over the Steifel manifold \nof k-frames, which is difficult to compute exactly.  But after choosing an \nappropriate parameterization  and  applying  Laplace's  method,  an  accu(cid:173)\nrate and practical estimator is obtained. In simulations, it is convincingly \nbetter than cross-validation and  other proposed algorithms,  plus it runs \nmuch faster. \n\n1 \n\nIntroduction \n\nRecovering the intrinsic dimensionality of a data set is a classic and fundamental problem \nin data analysis.  A popular method for doing this is PCA or localized PCA. Modeling the \ndata manifold with localized PCA dates back to [4].  Since then, the problem of spacing and \nsizing the local regions has been  solved via the EM algorithm and  split/merge techniques \n[2, 6,  14,5]. \n\nHowever,  the  task of dimensionality  selection has  not been solved  in  a  satisfactory  way. \nOn the one hand we have crude methods based on eigenvalue thresholding [4]  which are \nvery fast,  or we have iterative methods [1]  which require excessive computing time.  This \npaper resolves  the  situation  by  deriving  a  method  which is  both  accurate  and fast.  It  is \nan application of Bayesian model selection to  the probabilistic PCA model developed by \n[12,  15]. \n\nThe new method operates exclusively on the eigenvalues of the data covariance matrix.  In \nthe local PCA context, these would be the eigenvalues of the local responsibility-weighted \ncovariance matrix, as defined by [14].  The method can be used to fit different PCA models \nto different classes, for use in Bayesian classification [11]. \n\n2  Probabilistic peA \n\nThis section reviews  the results of [15].  The PCA model is that a d-dimensional vector x \nwas  generated from a smaller k-dimensional vector w  by  a linear transformation (H, m) \n\n\fplus  a noise  vector e:  x  = Hw + m  + e.  Both  the  noise  and  the  principal component \nvector ware assumed spherical Gaussian: \n\nThe observation x is  therefore Gaussian itself: \n\np(xIH, m, v)  '\" N(m, HHT + vI) \n\n(1) \n\n(2) \n\nThe goal of PCA is to estimate the basis vectors H  and the noise variance v from a data set \nD  = {Xl, ... , XN }. The probability of the data set is \n\np(DIH,m,v) \n\n(27f)-Nd/2IHHT + vII- N/2 exp(-~tr((HHT + VI)-lS))  (3) \n\nS  =  I)Xi - m)(xi - m)T \n\nAs  shown by [15], the maximum-likelihood estimates are: \n\n1  ~ \n\nA \n\nm= N~xi \n\ni \n\n\"'~ \nA' \nL.\"J=k+l  J \n\nd-k \n\nA \n\n_ \n\nV  -\n\n(4) \n\n(5) \n\nwhere  orthogonal matrix U  contains  the  top  k  eigenvectors  of SIN, diagonal matrix  A \ncontains the corresponding eigenvalues, and R  is an arbitrary orthogonal matrix. \n\n3  Bayesian model selection \n\nBayesian  model  selection  scores  models  accord(cid:173)\ning  to  the  probability  they  assign  the  observed \ndata [9, 8]. It is completely analogous to Bayesian \nIt automatically  encodes  a  pref(cid:173)\nclassification. \nerence  for  simpler,  more  constrained  models,  as \nillustrated  in  figure  1.  Simple  models  only  fit \na  small  fraction  of  data  sets,  but  they  assign \ncorrespondingly  higher probability  to  those  data \nsets.  Flexible models spread themselves out more \nthinly. \n\nThe  probability  of  the  data  given  the  model  is \ncomputed  by  integrating  over  the  unknown  pa(cid:173)\nrameter values in that model: \n\np(D I M) n. ~\"\"\";\"'\" model \n\n------~--_r~------ D \n\nflexible model \n\nconstrained \nmodel wins  model wins \n\nflexible \n\nFigure  1:  Why Bayesian model se(cid:173)\nlection prefers simpler models \n\np(DIM) = fo p(DIO)p(OIM)dO \n\n(6) \n\nThis quantity is  called  the evidence for model  M.  A useful property of Bayesian model \nselection is  that it is  guaranteed to  select the  true  model, if it is  among the candidates, as \nthe size of the dataset grows to infinity. \n\n3.1  The evidence for probabilistic peA \n\nFor the PCA model, we want to  select the subspace dimensionality k.  To do this, we com(cid:173)\npute the probability of the data for each possible dimensionality and pick the maximum. For \na given dimensionality, this requires integrating over all PCA parameters (m, H, v) . First \nwe need to define a prior density for these parameters.  Assuming there is  no information \n\n\fother than the data D, the prior should be as noninformative as possible.  A non informative \nprior for m  is uniform, and with such a prior we can integrate out m  analytically, leaving \np(DIH, v) = N-d/2(27f)-(N-1)d/2IHHT +  vII-(N-1)/2 exp( -~tr((HHT +VI)-lS)) \n(7) \n(8) \n\nwhere S  =  ~)Xi - m)(Xi - m)T \n\nUnlike m, H  must have a proper prior since it  varies in dimension for  different models. \nLet H  be decomposed just as in (5): \n\n(9) \nwhere L is diagonal with diagonal elements k  The orthogonal matrix U  is  the basis, L is \nthe scaling (corrected for noise), and R  is  a rotation within the subspace (which will  turn \nout to be irrelevant).  A conjugate prior for (U, L, R, v), parameterized by a, is \n\np(U,L,R,v) \n\nex \n\nIHHT +vII-(a+2)/2exp(_~tr((HHT +VI)-l)) \n\n(10) \n\nThis distribution happens to factor into p(U)p(L )p(R)p( v) , which means the variables are \na-priori independent: \n\np(L) \n\nex \n\nILI-(a+2)/2 exp( -::tr(L -1)) \n\n2 \n\np(v) \n\nex  v-(a+2)(d-k)/2 exp( _ a(d - k)) \n\n2v \n\np(U)p(R) \n\n(constant-defined in (20\u00bb \n\n(11) \n\n(12) \n\n(13) \n\nThe  hyperparameter a  controls  the  sharpness  of the  prior.  For  a  noninformative  prior, \na  should  be  small,  making  the  prior diffuse.  Besides  providing  a  convenient prior,  the \ndecomposition  (9)  is  important for removing redundant degrees  of freedom  (R) and for \nseparating H  into independent components, as  described in the next section. \n\nCombining the likelihood with the prior gives \np(Dlk)  =Ck /IHHT +vII-n/2exp(-~tr((HHT +vI)-l(S+aI)))dUdLdv  (14) \n\nn  =  N  + 1 + a \n\n(15) \nThe  constant  Ck  includes  N-d/2  and  the  normalizing  terms  for  p(U) ,  p(L),  and  p(v) \n(given  in  [lO])-only p(U)  will  matter  in  the  end.  In  this  formula  R  has  already  been \nintegrated out;  the likelihood does  not involve R  so we just get a multiplicative factor of \nJRP(R) dR =  1. \n\n3.2  Laplace approximation \n\nLaplace's method is a powerful method for approximating integrals in Bayesian statistics \n[8]: \n\n/ \n\nf(())d() \n\n~  f(B)(27f),ows(A)/2IAI- 1/ 2 \n\n(16) \n\n(17) \n\nThe  key  to  getting  a  good  approximation  is  choosing  a  good  parameterization for  ()  = \n(U, L, v).  Since  li  and  v  are positive scale parameters, it is  best to  use  l~  =  log(li)  and \n\n\fv' = log( v).  This results in \n\nf.  _  NAi + a: \n,- N-1+a: \nd2 10g f((}) I  = _ N  - 1 + a: \n\n(dlD2 \n\n()=o \n\n2 \n\n(dV')2 \n\n()=o \n\n~ \nN~:=k+1 Aj \nv=  n(d-k)-2 \n(18) \nd2 10g f((})  I  = _ n(d - k)  - 2  (19) \n\n2 \n\nThe  matrix  U  is  an  orthogonal  k-frame  and  therefore lives  on  the  Stiefel  manifold  [7], \nwhich is defined by condition (9). The dimension of the manifold is m  =  dk - k(k + 1) /2, \nsince we are imposing k(k + 1)/2 constraints on a d  x  k  matrix. The prior density for U \nis  the reciprocal of the area of the manifold [7]: \n\np(U) =  Tk II r((d - i + 1)/2)7f-(d-i+1)/2 \n\nk \n\n(20) \n\ni=l \n\nA useful parameterization of this manifold is given by the Euler vector representation: \n\n(21) \n\nwhere U d  is  a fixed  orthogonal matrix and  Z  is  a skew-symmetric matrix of parameters, \nsuch as \n\nZ  =  [-~12  Zt/ \n\n- Z13 \n\n-Z23 \n\n~~: 1 \n\n0 \n\n(22) \n\nThe first k rows of Z determine the first k columns of exp(Z), so the free parameters are Zij \nwith i  < j  and i  ::;  k; the others are constant. This gives d(d-1)/2-(d-k)(d-k-1)/2 = \nm  parameters, as desired.  For example, in the case (d = 3, k = 1)  the free parameters are \nZ12  and Z13,  which define a coordinate system for the sphere. \nAs  a function of U, the integrand is simply \n1 \n\np(UID, L, v)  ex:  exp( -2tr((L -1 - v-1I)UT SU)) \n\n(23) \n\nThe  density  is  maximized  when  U  contains  the  top  k  eigenvectors  of S.  However,  the \ndensity  is  unchanged if we  negate  any  column of U.  This  means  that there  are  actually \n2k  different maxima,  and  we  need to  apply Laplace's method to each.  Fortunately, these \nmaxima are identical so can simply multiply (16) by  2k  to  get the integral over the whole \nmanifold. If we set U d  to the eigenvectors of S: \n\n(24) \nthen we just need to apply Laplace's method at Z = O.  As  shown in [10], if we define the \nestimated eigenvalue matrix \n\nuIsu d  = N A \n\nthen the second differential at Z = 0  simplifies to \n\nA = [~  VI~-J \n\n2 \n\nd  logf((})  Z=Q  =  - L...J  L...J  (\\  - \\ \n\nI \n\n~ -1 \n\n2 \n)(Ai - Aj)Ndzij \n\nd \n\nk \n\"  \" ~ -1 \ni=l j=i+1 \n\n(25) \n\n(26) \n\nThere  are no  cross derivatives; the  Hessian matrix  Az is  diagonal.  So its determinant is \nthe product of these second derivatives: \n\nk \n\nIAzl = II II (.~j1 - ~i1)(Ai - Aj)N \n\nd \n\ni=l j=i+1 \n\n(27) \n\n\fLaplace's method requires  this  to  be  nonsingular,  so  we  must have  k  <  N.  The  cross(cid:173)\nderivatives between the parameters are all zero: \n\ncP log 1(0) I \n\n=  d2 10g 1(0) I \n\n=  d2 10g 1(0) I \n\n=  0 \n\ndlidZ \n\n0=0 \n\ndvdZ \n\n0=0 \n\ndlidv \n\n0=0 \n\n(28) \n\nso A  is block diagonal and IAI =  IAzIIALIIAvl. We know AL from (19), Av from (19), \nand  Az  from  (27).  We  now  have  all  of the  terms  needed in  (16),  and  so  the  evidence \napproximation is \n\np(Dlk)  RJ  2kck  i \n\nI l-n/2 \n\nv-n(d-k)/2e-nd/2(27r)(m+k+1)/2IAzl-l/2IALI-1/2IAvl-1/2 \n\n(29) \nFor model  selection,  the  only  terms  that matter are  those  that  strongly depend on  k,  and \nsince D:  is small and N  reasonably large we can simplify this to \n\np(Dlk)  RJ  p(U) (g A;) -NI'iJ- N(,-.)I'(2.)(m+k)I' IAzl-'I' N-'I' \n\n~  Et=k+l Aj \nv =  d- k \n\n(30) \n\n(31) \n\nwhich is  the recommended formula.  Given the eigenvalues, the cost of computing p(D Ik) \nis  O(min(d, N)k), which is less than one loop over the data matrix. \n\nA  simplification  of Laplace's method is  the  BIC  approximation [8].  This  approximation \ndrops all terms which do not grow with N, which in this case leaves only \n\np(Dlk)  RJ  g Aj \n\n( \n\n) \n\n-N/2 \n\nv- N (d-k)/2 N-(m+k)/2 \n\n(32) \n\nBIC is compared to Laplace in section 4. \n\n4  Results \n\nTo  test the performance of various algorithms for model  selection, we  sample data from  a \nknown model  and  see  how  often the correct dimensionality is  recovered.  The seven esti(cid:173)\nmators implemented and tested in this  study are Laplace's method (30), BIC (32), the two \nmethods of [13]  (called RR-N and RR-U),  the algorithm in [3]  (ER),  the ARD algorithm \nof [1],  and 5-fold cross-validation (CV). For cross-validation, the log-probability assigned \nto  the held-out data is  the  scoring function.  ER  is  the  most similar to  this paper, since it \nperforms  Bayesian  model  selection  on  the  same  model,  but  uses  a different kind  of ap(cid:173)\nproximation combined with explicit numerical integration. RR-N and RR-U are maximum \nlikelihood techniques on  models  slightly different than probabilistic PCA; the details  are \nin  [10].  ARD  is  an  iterative estimation  algorithm for H  which  sets  columns to  zero  un(cid:173)\nless they are supported by the data.  The number of nonzero columns at convergence is the \nestimate of dimensionality. \nMost of these estimators work exclusively from the eigenvalues of the sample covariance \nmatrix.  The exceptions are RR-U,  cross-validation, and ARD; the latter two require diag(cid:173)\nonalizing a series of different matrices constructed from the data.  In our implementation, \nthe algorithms are ordered from fastest to slowest as RR-N, mc, Laplace, cross-validation, \nRR-U, ARD, and ER (ER is  slowest because of the numerical integrations required). \n\n\fThe  first  experiment  tests  the  data-rich  case  where \nN  > > d.  The data is generated from a lO-dimensional \nGaussian distribution with 5 \"signal\" dimensions and \n5 noise  dimensions.  The eigenvalues  of the  true  co-\nvariance matrix are: \n\nSignal \n\nNoise \n108642  1(x5) \n\nN  = 100 \n\nThe number of times  the correct dimensionality (k = \n5)  was  chosen  over 60 replications  is  shown  at right. \nThe differences between ER, Laplace, and CV are not \nstatistically significant.  Results  below the dashed line \nare  worse  than  Laplace  with  a  significance  level  of \n95%. \n\nER  Laplace  CV \n\nBIC \n\nARD  RRN  RRU \n\nThe  second  experiment  tests  the  case  of sparse  data \nand low noise: \n\nSignal \n\nNoise \n\n108642  0.1  (xl0) \n\nN= 10 \n\nThe  results  over  60  replications  are  shown  at  right. \nBIC and ER,  which are derived from large N  approx(cid:173)\nimations,  do  poorly.  Cross-validation  also  fails,  be(cid:173)\ncause it doesn't have enough data to work with. \n\nThe  third  experiment tests  the  case  of high  noise  di(cid:173)\nmensionality: \n\nSignal \n\nNoise \n\n10 8 642  0.25 (x95) \n\nN=60 \n\nThe ER algorithm was  not run in this  case because of \nits excessive computation time for large d. \n\nLaplace  CV \n\nARD \n\nRRU \n\nBlC \n\nRAN \n\nThe  final  experiment tests  the  robustness  to  having  a \nnon-Gaussian  data  distribution  within  the  subspace. \nWe  start  with  four  sound  fragments  of  100  samples \neach. To make things especially non-Gaussian, the val(cid:173)\nues in third fragment are squared and the values in the \nfourth fragment are cubed. All fragments are standard(cid:173)\nized to zero mean and unit variance.  Gaussian noise in \n20 dimensions is added to get: \n\nSignal \n4 sounds \n\nNoise \n\n0.5 (x20) \n\nN  = 100 \n\nThe results  over 60 replications of the  noise  (the  sig(cid:173)\nnals were constant) are reported at right. \n\nLaplace  ARD \n\nCV \n\nBIC  RRN  RRU \n\nER \n\n5  Discussion \n\nBayesian model  selection has been shown to provide excellent performance when the as(cid:173)\nsumed  model is  correct or partially correct.  The evaluation criterion was  the  number of \ntimes the correct dimensionality was chosen. It would also be useful to evaluate the trained \nmodel with respect to  its performance on new  data within an  applied setting.  In this case, \n\n\fBayesian  model  averaging is  more  appropriate,  and  it is  conceivable  that  a method  like \nARD,  which encompasses a soft blend between different dimensionalities, might perform \nbetter by this criterion than selecting one dimensionality. \n\nIt is  important to remember that these estimators are  for density estimation, i.e.  accurate \nrepresentation of the  data,  and  are  not necessarily  appropriate for other purposes like re(cid:173)\nducing computation or extracting salient features.  For example, on a database of 301  face \nimages  the  Laplace evidence picked  120 dimensions,  which  is  far  more  than  one would \nuse  for feature extraction.  (This result also suggests that probabilistic PCA is  not a good \ngenerative model for face images.) \n\nReferences \n\n[1]  C. Bishop.  Bayesian PCA. In Neural Information Processing Systems 11, pages  382- 388, \n\n1998. \n\n[2]  C. Bregler and S. M. Omohundro.  Surface learning with applications to lipreading.  In NIPS, \n\npages 43- 50, 1994. \n\n[3]  R. Everson and S. Roberts.  Inferring the eigenvalues of covariance matrices from limited, \n\nnoisy data.  IEEE Trans Signal Processing, 48(7):2083- 2091, 2000. \nhttp : //www. robots . ox . ac . uk/-sjrob/Pubs/spectrum . ps . gz. \n\n[4]  K. Fukunaga and D. Olsen.  An algorithm for finding intrinsic dimensionality of data.  IEEE \n\nTrans Computers, 20(2):176-183,1971. \n\n[5]  Z. Ghahramani and M. Beal.  Variational inference for Bayesian mixtures of factor analysers. \n\nIn Neural Information Processing Systems 12, 1999. \n\n[6]  Z. Ghahramani and G. Hinton.  The EM algorithm for mixtures of factor analyzers.  Technical \n\nReport CRG-TR-96-1 , University of Toronto, 1996. \nhttp : //www . gatsby . ucl . ac . uk/-zoubin/pape rs . html. \n\n[7]  A. James.  Normal multivariate analysis and the orthogonal group.  Annals of Mathematical \n\nStatistics, 25(1):40- 75,  1954. \n\n[8]  R.  E. Kass and A. E. Raftery.  Bayes factors and model uncertainty.  Technical Report 254, \n\nUniversity of Washington,  1993. \nhttp : //www . st a t . wa shington . edu/t e ch . reports/tr254 . ps . \n\n[9]  D. J. C. MacKay.  Probable networks and plausible predictions -\n\na review of practical \n\nBayesian methods for supervised neural networks.  Network:  Computation in Neural Systems, \n6:469- 505, 1995. \nhttp : //wol . r a. phy . cam .a c . uk/mack a y/abstr a cts/ne twork . html. \n\n[10]  T. Minka. Automatic choice of dimensionality for PCA.  Technical Report 514, MIT Media \n\nLab Vision and Modeling Group, 1999. \nf tp : //whit e chapel . media . mit .edu/pub/tech-reports/TR-514-\nABSTRAC T. html. \n\n[11]  B. Moghaddam, T. Jebara, and A. Pentland. Bayesian modeling of facial  similarity.  In Neural \n\nInformation Processing Systems 11, pages 910-916,  1998. \n\n[12]  B. Moghaddam and A. Pentland.  Probabilistic visual learning for object representation.  IEEE \n\nTrans Pattern Analysis and Machine Intelligence,  19(7):696-710, 1997. \n\n[13]  J. J. Rajan and P. J. W.  Rayner. Model order selection for the singular value decomposition and \n\nthe discrete Karhunen-Loeve transform using a Bayesian approach. lEE Vision,  Image and \nSignal Processing, 144(2):166- 123,  1997. \n\n[14]  M. E. Tipping and C. M. Bishop. Mixtures of probabilistic principal component analysers. \n\nNeural Computation, 11(2):443-482, 1999. \nhttp : //cit e s ee r . nj . n e c . com/362314 . html. \n\n[15]  M. E. Tipping and C. M. Bishop.  Probabilistic principal component analysis.  J Royal \n\nStatistical Society B, 61(3), 1999. \n\n\f", "award": [], "sourceid": 1853, "authors": [{"given_name": "Thomas", "family_name": "Minka", "institution": null}]}