{"title": "Bayesian PCA", "book": "Advances in Neural Information Processing Systems", "page_first": 382, "page_last": 388, "abstract": null, "full_text": "Bayesian peA \n\nChristopher M. Bishop \n\nMicrosoft Research \n\nSt. George House, 1 Guildhall Street \n\nCambridge CB2 3NH, u.K. \n\ncmbishop@microsoft.com \n\nAbstract \n\nThe technique of principal component analysis (PCA) has recently been \nexpressed as the maximum likelihood solution for a generative latent \nvariable model. In this paper we use this probabilistic reformulation \nas the basis for a Bayesian treatment of PCA. Our key result is that ef(cid:173)\nfective dimensionality of the latent space (equivalent to the number of \nretained principal components) can be determined automatically as part \nof the Bayesian inference procedure. An important application of this \nframework is to mixtures of probabilistic PCA models, in which each \ncomponent can determine its own effective complexity. \n\n1 Introduction \n\nPrincipal component analysis (PCA) is a widely used technique for data analysis. Recently \nTipping and Bishop (1997b) showed that a specific form of generative latent variable model \nhas the property that its maximum likelihood solution extracts the principal sub-space of \nthe observed data set. This probabilistic reformulation of PCA permits many extensions \nincluding a principled formulation of mixtures of principal component analyzers, as dis(cid:173)\ncussed by Tipping and Bishop (l997a). \n\nA central issue in maximum likelihood (as well as conventional) PCA is the choice of \nthe number of principal components to be retained. This is particularly problematic in a \nmixture modelling context since ideally we would like the components to have potentially \ndifferent dimensionalities. However, an exhaustive search over the choice of dimensionality \nfor each of the components in a mixture distribution can quickly become computationally \nintractable. In this paper we develop a Bayesian treatment of PCA, and we show how this \nleads to an automatic selection of the appropriate model dimensionality. Our approach \navoids a discrete model search, involving instead the use of continuous hyper-parameters \nto determine an effective number of principal components. \n\n\fBayesian peA \n\n383 \n\n2 Maximum Likelihood peA \n\nConsider a data set D of observed d-dimensional vectors D = {t n } where n E \n{I, ... ,N}. Conventional principal component analysis is obtained by first computing \nthe sample covariance matrix given by \n\nN 1\"\" \n\n-\n\nS = N L) t n - t) (tn - t) \n\n-T \n\n(1) \n\nn=l \n\nwhere t = N- 1 Ln tn is the sample mean. Next the eigenvectors Ui and eigenvalues .Ai \nof S are found, where SUi = .AiUi and i = 1, ... , d. The eigenvectors corresponding \nto the q largest eigenvalues (where q < d) are retained, and a reduced-dimensionality \nrepresentation of the data set is defined by Xn = U T (t n -\nt) where U q = (U 1 , . .. ,Uq). \nIt is easily shown that PCA corresponds to the linear projection of a data set under which \nthe retained variance is a maximum, or equivalently the linear projection for which the \nsum-of-squares reconstruction cost is minimized. \n\nA significant limitation of conventional PCA is that it does not define a probability distri(cid:173)\nbution. Recently, however, Tipping and Bishop (1997b) showed how PCA can be reformu(cid:173)\nlated as the maximum likelihood solution of a specific latent variable model, as follows. \nWe first introduce a q-dimensionallatent variable x whose prior distribution is a zero mean \nGaussianp(x) = N(O, Iq) and Iq is the q-dimensional unit matrix. The observed variable t \nis then defined as a linear transformation ofx with additive Gaussian noise t = Wx+ p,+\u20ac \nwhere W is a d x q matrix, p, is a d-dimensional vector and \u20ac \nis a zero-mean Gaussian(cid:173)\ndistributed vector with covariance (72Id. Thus p(tlx) = N(Wx + p\" (72Id). The marginal \ndistribution of the observed variable is then given by the convolution of two Gaussians and \nis itself Gaussian \n\np(t) = J p(tlx)p(x) dx = N(p\" C) \n\n(2) \n\nwhere the covariance matrix C = WWT + (72Id. The model (2) represents a constrained \nGaussian distribution governed by the parameters p\" Wand (72. \n\nThe log probability of the parameters under the observed data set D is then given by \n\nL(p\"W, (72) = -2 {dln(2rr) +lnlCl +Tr[C-1S]} \n\nN \n\n(3) \n\nwhere S is the sample covariance matrix given by (I). The maximum likelihood solution \nfor p, is easily seen to be P,ML = t. It was shown by Tipping and Bishop (l997b) that the \nstationary points of the log likelihood with respect to W satisfy \n\nWML = Uq(Aq - (72Iq)1/2 \n\n(4) \n\nwhere the columns of U q are eigenvectors of S, with corresponding eigenvalues in the \ndiagonal matrix A q \u2022 It was also shown that the maximum of the likelihood is achieved when \nthe q largest eigenvalues are chosen, so that the columns of U q correspond to the principal \neigenvectors, with all other choices of eigenvalues corresponding to saddle points. The \nmaximum likelihood solution for (72 is then given by \n\n2 \n\n(7ML = ~ ~ .Ai \n\n1 \n\nd \n\"\" \nq i=q+l \n\n(5) \n\nwhich has a natural interpretation as the average variance lost per discarded dimension. The \ndensity model (2) thus represents a probabilistic formulation of PCA. It is easily verified \nthat conventional PCA is recovered in the limit (72 -+ O. \n\n\f384 \n\nC. M Bishop \n\nProbabilistic PCA has been successfully applied to problems in data compression, density \nestimation and data visualization, and has been extended to mixture and hierarchical mix(cid:173)\nture models. As with conventional PCA, however, the model itself provides no mechanism \nfor determining the value of the latent-space dimensionality q. For q = d - 1 the model \nis equivalent to a full-covariance Gaussian distribution, while for q < d - 1 it represents \na constrained Gaussian in which the variance in the remaining d - q directions is mod(cid:173)\nelled by the single parameter (j2 . Thus the choice of q corresponds to a problem in model \ncomplexity optimization. If data is plentiful, then cross-validation to compare all possible \nvalues of q offers a possible approach. However, this can quickly become intractable for \nmixtures of probabilistic PCA models if we wish to allow each component to have its own \nq value. \n\n3 Bayesian peA \n\nThe issue of model complexity can be handled naturally within a Bayesian paradigm. \nArmed with the probabilistic reformulation of PCA defined in Section 2, a Bayesian treat(cid:173)\nment of PCA is obtained by first introducing a prior distribution p(p\" W, (j2) over the \nparameters of the model. The corresponding posterior distribution p(p\" W , (j2ID) is then \nobtained by multiplying the prior by the likelihood function, whose logarithm is given by \n(3), and normalizing. Finally, the predictive density is obtained by marginalizing over the \nparameters, so that \n\n(6) \n\nIn order to implement this framework we must address two issues: (i) the choice of prior \ndistribution, and (ii) the formulation of a tractable algorithm. Our focus in this paper is on \nthe specific issue of controlling the effective dimensionality of the latent space (correspond(cid:173)\ning to the number of retained principal components). Furthermore, we seek to avoid dis(cid:173)\ncrete model selection and instead use continuous hyper-parameters to determine automat(cid:173)\nically an appropriate effective dimensionality for the latent space as part of the process of \nBayesian inference. This is achieved by introducing a hierarchical prior p(Wla) over the \nmatrix W, governed by a q-dimensional vector of hyper-parameters a = {0:1, ... ,O:q}. \nThe dimensionality of the latent space is set to its maximum possible value q = d - 1, and \neach hyper-parameter controls one of the columns of the matrix W through a conditional \nGaussian distribution of the form \n\n(7) \n\nwhere {Wi} are the columns of W. This form of prior is motivated by the framework \nof automatic relevance determination (ARD) introduced in the context of neural networks \nby Neal and MacKay (see MacKay, 1995). Each O:i controls the inverse variance of the \ncorresponding Wi, so that if a particular O:i has a posterior distribution concentrated at \nlarge values, the corresponding Wi will tend to be small, and that direction in latent space \nwill be effectively 'switched off'. The probabilistic structure of the model is displayed \ngraphically in Figure I. \n\nIn order to make use of this model in practice we must be able to marginalize over the \nposterior distribution of W. Since this is analytically intractable we have developed three \nalternative approaches based on (i) type-II maximum likelihood using a local Gaussian \napproximation to a mode of the posterior distribution (MacKay, 1995), (ii) Markov chain \nMonte Carlo using Gibbs sampling, and (iii) variational inference using a factorized ap(cid:173)\nproximation to the posterior distribution. Here we describe the first of these in more detail. \n\n\fBayesian peA \n\n385 \n\nFigure 1: Representation of Bayesian PCA as a probabilistic graphical model showing the hierarchi(cid:173)\ncal prior over W governed by the vector of hyper-parameters ex. The box. denotes a 'plate' comprising \na data set of N independent observations of the visible vector tn (shown shaded) together with the \ncorresponding hidden variables X n . \n\nThe location W MP of the mode can be found by maximizing the log posterior distribution \ngiven, from Bayes' theorem, by \n\nInp(WID) = L - 2 L aill w ill 2 + const. \n\n1 d-l \n\ni=1 \n\n(8) \n\nwhere L is given by (3). For the purpose of controlling the effective dimensionality of \nthe latent space, it is sufficient to treat J.L, (1 2 and Q as parameters whose values are to \nbe estimated, rather than as random variables. In this case there is no need to introduce \npriors over these variables, and we can determine J.L and (1 2 by maximum likelihood. To \nestimate ex we use type-II maximum likelihood, corresponding to maximizing the marginal \nlikelihood p( D I ex) in which we have integrated over W using the quadratic approximation . \nIt is easily shown (Bishop, 1995) that this leads to a re-estimation formula for the hyper(cid:173)\nparameters ai of the form \n\n/i \nai := II W ill 2 \n\n(9) \n\nwhere /i ::::: d - ai Tri (H- 1 ) is the effective number of parameters in Wi, H is the Hessian \nmatrix given by the second derivatives of Inp(WID) with respect to the elements of W \n(evaluated at W MP), and Tri (.) denotes the trace of the sub-matrix corresponding to the \nvector Wi. \n\nFor the results presented in this paper, we make the further simplification of replacing / i in \n(9) by d, corresponding to the assumption that all model parameters are 'well-determined'. \nThis significantly reduces the computational cost since it avoids evaluation and manipula(cid:173)\ntion of the Hessian matrix. An additional consequence is that vectors Wi for which there is \ninsufficient support from the data wiII be driven to zero, with the corresponding a i -t 00, \nso that un-used dimensions are switched off completely. We define the effective dimension(cid:173)\nality of the model to be the number of vectors Wi whose values remain non-zero. \n\nThe solution for W MP can be found efficiently using the EM algorithm, in which the E(cid:173)\nstep involves evaluation of the expected sufficient statistics of the latent-space posterior \ndistribution, given by \n\nM- 1W T (tn - J.L) \n(12M + (xn)(xn) T \n\n(10) \n\n(II) \n\n\f386 \n\nC. M Bishop \n\nwhere M = (WTW + a 2 Iq). The M-step involves updating the model parameters using \n\n(12) \n\nW \n\n[ptn-I-')(X~)] [pxnX~)H'Ar \n~d L {lit n -\n\nN \n\n(;2 \n\nJ-t) + Tr [(XnX~)WTW]} (13) \nwhere A = diag(ad. Optimization of Wand a 2 is alternated with re-estimation of n, \nusing (9) with '\"'Ii = d, until all of the parameters satisfy a suitable convergence criterion. \n\nJ-t1l 2 - 2(x~)WT(tn -\n\nn=l \n\nAs an illustration of the operation of this algorithm, we consider a data set consisting of 300 \npoints in 10 dimensions, in which the data is drawn from a Gaussian distribution having \nstandard deviation 1.0 in 3 directions and standard deviation 0.5 in the remaining 7 direc(cid:173)\ntions. The result of fitting both maximum likelihood and Bayesian PCA models is shown \nin Figure 2. In this case the Bayesian model has an effective dimensionality of qeff = 3. \n\n\u2022 \n\n\u2022 \u2022 \u2022 \n\u00b7 \u2022 \u2022 \u00b7 \u2022 \n\u2022 \u2022 \n\u2022 \u2022 \u2022 \n\u2022 \n\u2022 \u2022 \n\u2022 \n\u2022 \n\u2022 \u2022 \n\u2022 \u2022 \u2022 \u2022 \n\u2022 \u2022 \n\u2022 \u00b7 \n\u2022 \n\u2022 \u2022 \n\u2022 \n\n\u2022 \n\nFigure 2: Hinton diagrams of the matrix W for a data set in 10 dimensions having m = 3 directions \nwith larger variance than the remaining 7 directions. The left plot shows W from maximum likeli(cid:173)\nhood peA while the right plot shows WMP from the Bayesian approach, showing how the model is \nable to discover the appropriate dimensionality by suppressing the 6 surplus degrees of freedom. \n\nThe effective dimensionality found by Bayesian PCA will be dependent on the number N \nof points in the data set. For N ~ 00 we expect qeff ~ d -1, and in this limit the maximum \nlikelihood framework and the Bayesian approach will give identical results. For finite data \nsets the effective dimensionality may be reduced, with degrees of freedom for which there \nis insufficient evidence in the data set being suppressed. The variance of the data in the re(cid:173)\nmaining d - qeff directions is then accounted for by the single degree of freedom defined by \na 2 . This is illustrated by considering data in 10 dimensions generated from a Gaussian dis(cid:173)\ntribution with standard deviations given by {1.0, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1}. \nIn Figure 3 we plot qeff (averaged over 50 independent experiments) versus the number N \nof points in the data set. \n\nThese results indicate that Bayesian PCA is able to determine automatically a suitable \neffective dimensionality qeff for the principal component subspace, and therefore offers a \npractical alternative to exhaustive comparison of dimensionalities using techniques such as \ncross-validation. As an illustration of the generalization capability of the resulting model \nwe consider a data set of 20 points in 10 dimensions generated from a Gaussian distribution \nhaving standard deviations in 5 directions given by (1.0,0.8,0.6 , 0.4,0.2) and standard \ndeviation 0.04 in the remaining 5 directions. We fit maximum likelihood PCA models to \nthis data having q values in the range 1-9 and compare their log likelihoods on both the \ntraining data and on an independent test set, with the results (averaged over 10 independent \nexperiments) shown in Figure 4. Also shown are the corresponding results obtained from \nBayesian PCA. \n\n\fFigure 3: Plot of the average effective dimensionality of the Bayesian PCA model versus the number \nN of data points for data in a IO-dimensional space. \n\n.. .... _-----(cid:173)\n\n---\n\n,0 \n\n, , , , , \n, , \n\n<'l \n\n8 \nc \n'8. 6 \n~ 4 \nQ; \na. 2 \n8 \u00a3 a \nQi \n\"'\" =-2 \n~ \n-4 \n\n-6~~~2~~3--~4--~5---6--~7--~8--~9~ \n\nq \n\nFigure 4: Plot of the log likelihood for the training set (dashed curve) and the test set (solid curve) \nfor maximum likelihood PCA models having q values in the range 1-9, showing that the best gener(cid:173)\nalization is achieved for q = 5 which corresponds to the number of directions of significant variance \nin the data set. Also shown are the training (circle) and test (cross) results from a Bayesian PCA \nmodel, plotted at the average effective q value given by qeff = 5.2. We see that the Bayesian PCA \nmodel automatically discovers the appropriate dimensionality for the principal component subspace, \nand furthermore that it has a generalization performance which is close to that of the optimal fixed q \nmodel. \n\n4 Mixtures of Bayesian peA Models \n\nGiven a probabilistic formulation of PCA it is straightforward to construct a mixture distri(cid:173)\nbution comprising a linear superposition of principal component analyzers. In the case of \nmaximum likelihood PCA we have to choose both the number IvI of components and the \nlatent space dimensionality q for each component. For moderate numbers of components \nand data spaces of several dimensions it quickly becomes intractable to explore the expo(cid:173)\nnentially large number of combinations of q values for a given value of M. Here Bayesian \nPCA offers a significant advantage in allowing the effective dimensionalities of the models \nto be determined automatically. \n\nAs an illustration we consider a density estimation problem involving hand-written digits \nfrom the CEDAR database. The data set comprises 8 x 8 scaled and smoothed gray-scale \nimages of the digits '2', '3' and '4', partitioned randomly into 1500 training, 900 validation \nand 900 test points. For mixtures of maximum likelihood PCA the model parameters can be \n\n\f388 \n\nC. M Bishop \n\ndetermined using the EM algorithm in which the M-step uses (4) and (5), with eigenvector \nand eigenvalues obtained from the weighted covariance matrices in which the weighting co(cid:173)\nefficients are the posterior probabilities for the components determined in the E-step. Since, \nfor maximum likelihood PCA, it is computationally impractical to explore independent q \nvalues for each component we consider mixtures in which every component has the same \ndimensionality. We therefore train mixtures having M E {2, 4, 6, 8, 10, 12, 14, 16, 18} for \nall values q E {2, 4, 8, 12, 16, 20, 25, 30, 40, 50}. In order to avoid singularities associ(cid:173)\nated with the more complex models we omit any component from the mixture for which \nthe value of (7 2 goes to zero during the optimization. The highest log likelihood on the \nvalidation set ( - 295) is obtained for M = 6 and q = 50. \nFor mixtures of Bayesian PCA models we need only explore alternative values for M , \nwhich are taken from the same set as for the mixtures of maximum likelihood PCA. Again, \nthe best performance on the validation set (- 293) is obtained for M = 6. The values of the \nlog likelihood for the test set were -295 (maximum likelihood PCA) and -293 (Bayesian \nPCA). The mean vectors I-Li for each of the 6 components of the Bayesian PCA mixture \nmodel are shown in Figure 5. \n\n62 \n\n54 \n\n63 \n\n60 \n\n62 \n\n59 \n\nFigure 5: The mean vectors for each of the 6 components in the Bayesian PCA mixture model, \ndisplayed as an 8 x 8 image, together with the corresponding values of the effective dimensionality. \n\nThe Bayesian treatment of PCA discussed in this paper can be particularly advantageous \nfor small data sets in high dimensions as it can avoid the singularities associated with \nmaximum likelihood (or conventional) PCA by suppressing unwanted degrees of freedom \nin the model. This is especially helpful in a mixture modelling context, since the effective \nnumber of data points associated with specific 'clusters' can be small even when the total \nnumber of data points appears to be large. \n\nReferences \n\nBishop, C. M. (1995). Neural Networks for Pattern Recognition. Oxford University \n\nPress. \n\nMacKay, D. J. C. (1995). Probable networks and plausible predictions - a review of \npractical Bayesian methods for supervised neural networks. Network: Computation \nin Neural Systems 6 (3), 469-505. \n\nTipping, M. E. and C. M. Bishop (1997a). Mixtures of principal component analysers. \nIn Proceedings lEE Fifth International Conference on Artificial Neural Networks. \nCambridge, u.K., July. , pp. 13-18. \n\nTipping, M. E . and C. M. Bishop (1997b). Probabilistic principal component analysis. \n\nAccepted for publication in the Journal of the Royal Statistical Society, B. \n\n\f", "award": [], "sourceid": 1549, "authors": [{"given_name": "Christopher", "family_name": "Bishop", "institution": null}]}