{"title": "Extreme Components Analysis", "book": "Advances in Neural Information Processing Systems", "page_first": 137, "page_last": 144, "abstract": "", "full_text": "Extreme Components Analysis\n\nMax Welling\n\nDepartment of Computer Science\n\nUniversity of Toronto\n10 King\u2019s College Road\n\nToronto, M5S 3G5 Canada\n\nwelling@cs.toronto.edu\n\nFelix Agakov, Christopher K. I. Williams\n\nInstitute for Adaptive and Neural Computation\n\nSchool of Informatics\nUniversity of Edinburgh\n{ckiw,felixa}@inf.ed.ac.uk\n\n5 Forrest Hill, Edinburgh EH1 2QL, UK\n\nAbstract\n\nPrincipal components analysis (PCA) is one of the most widely used\ntechniques in machine learning and data mining. Minor components\nanalysis (MCA) is less well known, but can also play an important role\nin the presence of constraints on the data distribution. In this paper we\npresent a probabilistic model for \u201cextreme components analysis\u201d (XCA)\nwhich at the maximum likelihood solution extracts an optimal combina-\ntion of principal and minor components. For a given number of compo-\nnents, the log-likelihood of the XCA model is guaranteed to be larger or\nequal than that of the probabilistic models for PCA and MCA. We de-\nscribe an ef\ufb01cient algorithm to solve for the globally optimal solution.\nFor log-convex spectra we prove that the solution consists of principal\ncomponents only, while for log-concave spectra the solution consists of\nminor components. In general, the solution admits a combination of both.\nIn experiments we explore the properties of XCA on some synthetic and\nreal-world datasets.\n\n1 Introduction\n\nThe simplest and most widely employed technique to reduce the dimensionality of a data\ndistribution is to linearly project it onto the subspace of highest variation (principal compo-\nnents analysis or PCA). This guarantees that the reconstruction error of the data, measured\nwith L2-norm, is minimized. For some data distributions however, it is not the directions of\nlarge variation that are most distinctive, but the directions of very small variation, i.e. con-\nstrained directions. In this paper we argue that in reducing the dimensionality of the data,\nwe may want to preserve these constrained directions alongside some of the directions of\nlarge variability.\n\nThe proposed method, termed \u201cextreme components analysis\u201d or XCA, holds the middle\nground between PCA and MCA (minor components analysis\u2013the method that projects on\ndirections of low variability). The objective that determines the optimal combination of\nprincipal and minor components derives from the probabilistic formulation of XCA, which\nneatly generalizes the probabilistic models for PCA and MCA. For a \ufb01xed number of com-\nponents, the XCA model will always assign higher probability to the (training) data than\nPCA or MCA, and as such be more ef\ufb01cient in encoding the data. We propose a very\n\n\fsimple and ef\ufb01cient algorithm to extract the optimal combination of principal and minor\ncomponents and prove some results relating the shape of the log-spectrum to this solution.\n\nThe XCA model is inspired by Hinton\u2019s \u201cproduct of experts\u201d (PoE) model [1].\nIn a\nPoE, linear combinations of an input vector are penalized according to their negative log-\nprobability and act as constraints. Thus, con\ufb01gurations of high probability have most of\ntheir constraints approximately satis\ufb01ed. As we will see, the same is true for the XCA\nmodel which can therefore be considered as an under-complete product of Gaussians\n(PoG).\n\n2 Variation vs. Constraint: PCA vs. MCA\n\nor\n\ns.t. wT x = 0\n\nx = Ay \u2200 y \u2208 R2,\n\nConsider a plane embedded in 3 dimensions that cuts through the origin. There are 2\ndistinct ways to mathematically describe points in that plane:\n\u2200 x \u2208 R3\n\n(1)\nwhere A is a 3\u00d72 matrix, the columns of which form a basis in the plane, and w is a vector\northogonal to the plane. In the \ufb01rst description we parameterize the modes of variation,\nwhile in the second we parameterize the direction of no variation or the direction in which\nthe points are constrained. Note that we only need 3 real parameters to describe a plane in\nterms of its constraint versus 6 parameters to describe it in terms of its modes of variation.\nMore generally, if we want to describe a d-dimensional subspace in D dimensions we may\nuse D \u2212 d constraint directions or d subspace directions.\nNext consider the stochastic version of the above problem: \ufb01nd an accurate description\nof an approximately d-dimensional data-cloud in D dimensions. The solution that prob-\nabilistic PCA (PPCA) [3, 4] provides is to model those d directions using unit vectors ai\n(organized as columns of a matrix A) while adding isotropic Gaussian noise in all direc-\ntions,\n\nx = Ay + n\n\ny \u223c N [0, Id]\n\nn \u223c N [0, \u03c32\n\n0ID]\n\n(2)\n\n(3)\n\n(cid:112)\n\nThe probability density of x is Gaussian with covariance\n\nCPCA = (cid:104)xxT(cid:105) = \u03c32\n\n0ID + AAT .\n\nIn [4] it was shown that at the maximum likelihood solution the columns of A are given\nby the \ufb01rst d principal components of the data with length ||ai|| =\ni is\nthe i(cid:48)th largest eigenvalue of the sample covariance matrix and \u03c32\n0 is equal to the average\nvariance in the directions orthogonal to the hyperplane.\nAlternatively, one may describe the data as D \u2212 d approximately satis\ufb01ed constraints, em-\nbedded in a high variance background model. The noisy version of the constraint wT x = 0\nis given by z = wT x where z \u223c N [0, 1]. The variance of the constrained direction,\n1/||w||2, should be smaller than that of the background model. By multiplying D \u2212 d of\nthese \u201cGaussian pancake\u201d models [6] a probabilistic model for MCA results with inverse\ncovariance given by,\n\ni \u2212 \u03c32\n\u03c32\n\n0 where \u03c32\n\nC\u22121\nMCA = ID\n\u03c32\n0\n\n+ W T W\n\n(4)\n\n(cid:112)\n\nwhere wT form the rows of W . It was shown that at the maximum likelihood solution the\nrows of W are given by the \ufb01rst D \u2212 d minor components of the data with length ||wi|| =\ni \u2212 1/\u03c32\ni is the i(cid:48)th smallest eigenvalue of the sample covariance matrix\n1/\u03c32\nand \u03c32\n0 is equal to the average variance in the directions orthogonal to the hyperplane. Thus,\nwhile PPCA explicitly models the directions of large variability, PMCA explicitly models\nthe directions of small variability.\n\n0 where \u03c32\n\n\f3 Extreme Components Analysis (XCA)\n\nProbabilistic PCA can be interpreted as a low variance data cloud which has been stretched\nout in certain directions. Probabilistic MCA on the other hand can be thought of as a\nlarge variance data cloud which has been pushed inward in certain directions. Given the\nGaussian assumption, the approximation that we make is due to the fact that we replace the\nvariances in the remaining directions by their average. Intuitively, better approximations\nmay be obtained by identifying the set of eigenvalues which, when averaged, induces the\nsmallest error. The appropriate model, to be discussed below, will both have elongated and\ncontracted directions in its equiprobable contours, resulting in a mix of principal and minor\ncomponents.\n\n3.1 A Probabilistic Model for XCA\n\nThe problem can be approached by either starting at the PPCA or PMCA model. The\nrestricting aspect of the PPCA model is that the noise n is added in all directions in input\nspace. Since adding random variables always results in increased variance, the directions\nmodelled by the vectors ai must necessarily have larger variance than the noise directions,\nresulting in principal components. In order to remove that constraint we need to add the\nnoise only in the directions orthogonal to the ai\u2019s. This leads to the following \u201ccausal\ngenerative model\u201d model1 for XCA,\n\nx = Ay + P\u22a5\nA n\n\ny \u223c N [0, Id]\n\nn \u223c N [0, \u03c32\n\n0ID]\n\n(5)\nA = ID\u2212A(AT A)\u22121AT is the projection operator on the orthogonal complement\n\nwhere P\u22a5\nof the space spanned by the columns of A. The covariance of this model is found to be\n\n0P\u22a5\n\nA + AAT .\n\nCXCA = \u03c32\n\n(6)\nApproaching the problem starting at the PMCA model we start with d components {wi}\n(organized as rows in W ) and add isotropic noise to the remaining directions,\nz2 \u223c N [0, \u03c32\n\nz1 \u223c N [0, Id]\n\nz1 = W x\n\nz2 = V x\n\n0I(D\u2212d)]\n\n(7)\n\nwhere the rows of V form an orthonormal basis in the orthogonal complement of the space\nspanned by {wi}. Importantly, we will not impose any constraints on the norms of {wi}\nor \u03c30, i.e. the components are allowed to model directions of large or small variance. To\nderive the PDF we note that ({z1i},{z2i}) are independent random variables implying that\nP (z1, z2) is a product of marginal distributions. This is then converted to P (x) by taking\ndet(W W T ). The result is\ninto account the Jacobian of the transformation J(z1,z2)\u2192x =\nthat x has a Gaussian distribution with with inverse covariance,\n\n(cid:112)\n\n1\n\u03c32\n0\n\nC\u22121\nXCA =\n\nP\u22a5\nW + W T W\n\n(8)\nW = ID \u2212 W T (W W T )\u22121W is the projection operator on the orthogonal com-\n\nwhere P\u22a5\nplement of W . Also, det(C\u22121\nIt is now not hard to verify that by identifying A = W # def= W T (W W T )\u22121 (the pseudo-\ninverse of W ) the two models de\ufb01ned through eqns. 6 and 8 are indeed identical. Thus,\nby slightly changing the noise model, both PPCA and PMCA result in XCA (i.e. compare\neqns.3,4,6,8).\n\nXCA) = det(W W T )\u03c32(d\u2212D)\n\n.\n\n0\n\n1Note however that the semantics of a two-layer directed graphical model is problematic since\n\np(x|y) is improper.\n\n\f3.2 Maximum Likelihood Solution\nFor a centered (zero mean) dataset {x} of size N the log-likelihood is given by,\nL = \u2212 N D\nC\u22121\nXCAS\n2\n\n(cid:162)\nlog det(W W T )+ N(D \u2212 d)\nlog(2\u03c0)+ N\n(9)\n2\ni \u2208 RD\u00d7D is the covariance of the data. To solve for the station-\ni=1 xixT\n0 and equate them to zero. Firstly, for\n\nwhere S = 1\nary points of L we take derivatives w.r.t W T and 1/\u03c32\nN\nW we \ufb01nd the following equation,\n\n(cid:80)N\n\n\u2212 N\n2\n\n1\n\u03c32\n0\n\n(cid:181)\n\n(cid:182)\n\n(cid:161)\n\nlog\n\ntr\n\n2\n\nW # \u2212 SW T +\n\nP\u22a5\nW SW # = 0.\n\n(10)\nLet W T = U\u039bRT be the singular value decomposition (SVD) of W T , so that U \u2208 RD\u00d7d\nforms an incomplete orthonormal basis, \u039b \u2208 Rd\u00d7d is a full-rank diagonal matrix, and\nR \u2208 Rd\u00d7d is a rigid rotation factor. Inserting this into eqn. 10 we \ufb01nd,\n\n1\n\u03c32\n0\n\nU\u039b\u22121RT \u2212 SU\u039bRT +\n\n1\n\u03c32\n0\n\n(ID \u2212 U U T )SU\u039b\u22121RT = 0.\n\n(11)\n\nNext we note that the projections of this equation on the space spanned by W and its\northogonal complement should hold independently. Thus, multiplying equation 11 on the\nleft by either PW or P\u22a5\nW , and multiplying it on the right by R\u039b\u22121, we obtain the following\ntwo equations,\n\nU\u039b\u22122 = U U T SU,\n\n(cid:181)\nId \u2212 \u039b\u22122\n\n(cid:182)\n\nSU\n\n\u03c32\n0\n\n= U U T SU\n\n(cid:181)\nId \u2212 \u039b\u22122\n\n(cid:182)\n\n.\n\n\u03c32\n0\n\n(12)\n\n(13)\n0)\u22121 we \ufb01nd the\n\nInserting eqn. 12 into eqn. 13 and right multiplying with (Id \u2212 \u039b\u22122/\u03c32\neigenvalue equation2,\n\nSU = U\u039b\u22122.\n\n(14)\nInserting this solution back into eqn. 12 we note that it is satis\ufb01ed as well. We thus conclude\nthat U is given by the eigenvectors of the sample covariance matrix S, while the elements\nof the (diagonal) matrix \u039b are given by \u03bbi = 1/\u03c3i with \u03c32\ni the eigenvalues of S (i.e. the\nspectrum).\nFinally, taking derivatives w.r.t. 1/\u03c32\n\n0 we \ufb01nd,\n\n0 =\n\u03c32\n\n1\n\nD \u2212 d\n\ntr\n\nP \u22a5\nW S\n\n=\n\n1\n\nD \u2212 d\n\ntr(S) \u2212 tr(U\u039b\u22122U T )\n\n=\n\n1\n\nD \u2212 d\n\n\u03c32\ni\n\n(15)\n\n(cid:161)\n\n(cid:162)\n\n(cid:161)\n\n(cid:88)\n\ni\u2208G\n\nwhere G is the set of all eigenvalues of S which are not represented in \u039b\u22122. The above\nequation expresses the fact that these eigenvalues are being approximated through their\naverage \u03c32\n0.\n\nInserting the solutions 14 and 15 back into the log-likelihood (eqn. 9) we \ufb01nd,\n\nL = \u2212 N D\n2\n\nlog(2\u03c0e) \u2212 N\n2\n\ni ) \u2212 N(D \u2212 d)\n\nlog\n\nlog(\u03c32\n\n2\n\n1\n\nD \u2212 d\n\n\u03c32\ni\n\n(16)\n\nwhere C is the set of retained eigenvalues. The log-likelihood has now been reduced to a\nfunction of the discrete set of eigenvalues {\u03c32\n\ni } of S.\n\n2As we will see later, the left-out eigenvalues have to be contiguous in the spectrum, implying\nthat the matrix (Id \u2212 \u039b\u22122/\u03c32\n0)\u22121 can only be singular if there is a retained eigenvalue that is equal\nto all left-out eigenvalues. This is clearly an uninteresting case, since the likelihood will not decrease\nif we leave this component out as well.\n\n(cid:33)\n\n(cid:88)\n\ni\u2208G\n\n(cid:162)\n\n(cid:195)\n\n(cid:88)\n\ni\u2208C\n\n\f(cid:80)\n\ni\u2208C\u222aG \u03c32\n\n3.3 An Algorithm for XCA\nTo optimize 16 ef\ufb01ciently we \ufb01rst note that the sum of the eigenvalues {\u03c32\ni } is constant:\ni = tr(S). We may use this to rewrite L in terms of the retained eigenvalues\nonly. We de\ufb01ne the following auxiliary cost to be minimized which is proportional to \u2212L\nup to irrelevant constants,\nK =\n\ni + (D \u2212 d) log(tr(S) \u2212\n\n(cid:88)\n\n(17)\n\nlog \u03c32\n\ni\u2208C\n\n(cid:88)\n\ni\u2208C\n\ni ).\n\u03c32\n\nNext we recall an important result that was proved in [4]:\nthe minimizing solution has\ni \u2208 G which are contiguous in the (ordered) spectrum, i.e. the eigenval-\neigenvalues \u03c32\ni ,\nues which are averaged form a \u201cgap\u201d in the spectrum. With this result, the search for the\noptimal solution has been reduced from exponential to linear in the number of retained di-\nmensions d. Thus we obtain the following algorithm for determining the optimal d extreme\ncomponents: (1) Compute the \ufb01rst d principal components and the \ufb01rst d minor compo-\nnents, (2) for all d possible positions of the \u201dgap\u201d compute the cost K in eqn. 17, and (3)\nselect the solution that minimizes K.\nIt is interesting to note that the same equations for the log-likelihood (L, eqn.16) and cost\n(K, eqn.17) appear in the analysis of PPCA [4] and PMCA [6]. The only difference being\nthat certain constraints forcing the solution to contain only principal or minor components\nare absent in eqn. 16. For XCA, this opens the possibility for mixed solutions with both\nprincipal and minor components. From the above observation we may conclude that the\noptimal ML solution for XCA will always have larger log-likelihood on the training data\nthen the optimal ML solutions for PPCA and PMCA. Moreover, when XCA contains only\nprincipal (or minor) components, it must have equal likelihood on the training data as PPCA\n(or PMCA). In this sense XCA is the natural extension of PPCA and PMCA.\n\n4 Properties of the Optimal ML Solution\n\ni \u2192 \u03b1\u03c32\n\nWe will now try to provide some insight into the nature of the the optimal ML solutions.\nFirst we note that the objective K is shifted by a constant if we multiply all variances by\na factor \u03c32\ni , which leaves its minima invariant. In other words, the objective is\nonly sensitive to changing ratios between eigenvalues. This property suggests to use the\nlogarithm of the eigenvalues of S as the natural quantities since multiplying all eigenvalues\nwith a constant results in a vertical shift of the log-spectrum. Consequently, the properties\nof the optimal solution only depend on the shape of the log-spectrum. In appendix A we\nprove the following characterization of the optimal solution,\n\nTheorem 1\n\u2022 A log-linear spectrum has no preference for principal or minor components.\n\u2022 The extreme components of log-convex spectra are principal components.\n\u2022 The extreme components of log-concave spectra are minor components.\nAlthough a log-linear spectrum with arbitrary slope has no preference for principal or minor\ncomponents, the slope does have an impact on the accuracy of the approximation because\nthe variances in the gap are approximated by their average value. A spectrum that can be\nexactly modelled by PPCA with suf\ufb01cient retained directions is one which has a pedestal,\ni.e. where the eigenvalues become constant beyond some value. Similarly PMCA can\nmodel exactly a spectrum which is constant and then drops off while XCA can model\nexactly a spectrum with a constant section at some arbitrary position. Some interesting\nexamples of spectra can be obtained from the Fourier (spectral) representation of stationary\nGaussian processes. Processes with power-law spectra S(\u03c9) \u221d \u03c9\u2212\u03b1 are log convex. An\nexample of a spectrum which is log linear is obtained from the RBF covariance function\n\n\fTable 1: Percent classi\ufb01cation error of noisy sinusoids as a function of g = D \u2212 d.\n\ng\n\u0001XCA\n\u0001M CA\n\u0001P CA\n\n2\n1.88\n2.37\n1.88\n\n3\n1.91\n3.10\n2.50\n\n4\n2.35\n4.64\n12.21\n\n5\n1.88\n4.06\n14.57\n\n6\n2.37\n2.37\n19.37\n\n7\n3.27\n3.27\n32.99\n\n8\n28.24\n28.24\n30.14\n\nwith a Gaussian weight function, [7]. The RBF covariance function on the circle will give\nrise to eigenvalues \u03bbi \u221d e\u2212\u03b2i2, i.e. a log-concave spectrum.\nBoth PCA and MCA share the convenient property that a solution with d components is\ncontained in the solution with d + 1 components. This is not the case for XCA: the solution\nwith d + 1 components may look totally different than the solution with d components (see\ninset in Figure 1c), in fact they may not even share a single component!\n\n5 Experiments\n\n(cid:80)p\n\nSmall Sample Effects\nWhen the number of data cases is small relative to the dimensionality of the problem, the\nlog-spectrum tends to bend down on the MC side producing \u201cspurious\u201d minor components\nin the XCA solution. Minor components that result from \ufb01nite sample effects, i.e. that do\nnot exist in the in\ufb01nite data limit, have an adverse effect on generalization performance.\nThis is shown in Figure 1a for the \u201cFrey-Faces\u201d dataset, where we plot the log-likelihood\nfor (centered) training and test data for both PCA and XCA. This dataset contains 1965\nimages of size 20 \u00d7 28, of which we used 1000 for training and 965 for testing. Since\nthe number of cases is small compared to the number of dimensions, both PCA and XCA\nshow a tendency to over\ufb01t. Note that at the point that minor components appear in the XCA\nsolution (d = 92) the log-likelihood of the training data improves relative to PCA, while\nthe log-likelihood of the test data suffers.\n(cid:80)p\nSinusoids in noise\nConsider a sum of p sinusoids Y (t) =\ni=1 Ai cos(\u03c9it + \u03c6i) sampled at D equally-\nspaced time points. If each \u03c6i is random in (0, 2\u03c0) then the covariance (cid:104)Y (t)Y (t(cid:48))(cid:105) =\ni=1 Pi cos \u03c9i(t \u2212 t(cid:48)) where Pi = A2\ni /2. This signal de\ufb01nes a 2p-dimensional linear\nmanifold in the D-dimensional space (see [2] \u00a712.5). By adding white noise to this signal\nwe obtain a non-singular covariance matrix. Now imagine we have two such signals, each\ndescribed by p different powers and frequencies. Instead of using the exact covariance\nmatrix for each we approximate the covariance matrix using either XCA, PMCA or PPCA.\nWe then compare the accuracy of a classi\ufb01cation task using either the exact covariance\nmatrix, or the approximations. (Note that although the covariance can be calculated exactly\nthe generating process is not in fact a Gaussian process.) By adjusting p, the powers and\nthe frequencies of the two signals, a variety of results can be obtained. We set D = 9 and\np = 4. The \ufb01rst signal had P = (1.5, 2.5, 3, 2.5) and \u03c9 = (1.9, 3.5, 4.5, 5), and the second\nP = (3, 2, 1.8, 1) and \u03c9 = (1.7, 2.9, 3.3, 5.3). The variance of the background noise was\n0.5. Table 1 demonstrates error rates on 10000 test cases obtained for XCA, PMCA and\nPPCA using g = D \u2212 d approximated components. For all values of g the error rate for\nXCA is \u2264 than that for PPCA and PMCA. For comparison, the optimal Gaussian classi\ufb01er\nhas an error rate of 1.87%. For g = 2 the XCA solution for both classes is PPCA, and for\ng = 6, 7, 8 it is PMCA; in between both classes have true XCA solutions. MCA behaviour\nis observed if \u03c32\n2-D Positions of Face Features\n671 cases were extracted from a dataset containing 2-D coordinates of 6 features on frontal\n\n0 is low.\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 1: (a) Log-likelihood of the \u201cFrey-faces\u201d training data (top curves) and test data (bottom\ncurves) for PCA (dashed lines) and XCA (solid lines) as a function of the number of components.\nInset: log-spectrum of training data.(b) Log-likelihood of training data for PCA (dash), MCA (dash-\ndot) and XCA (solid) as a function of the number of components. Inset: log-spectrum of training\ndata. (c) Log-likelihood of test data. Inset: number of PCs (dash) versus number of MCs (dash-dot)\nas a function of the number of components.\n\nfaces3. To obtain a translation and orientation invariant representation, we computed the 15\nsquared (Euclidean) distances between the features and removed their mean. In Figures 1b\nand 1c we show the log-likelihood for PCA, MCA and XCA of 335 training cases and 336\ntest cases respectively. Clearly, XCA is superior even on the test data. In the inset of Figure\n1c we depict the number of PCs and MCs in the XCA solution as we vary the number of\nretained dimensions. Note the irregular behavior when the number of components is large.\n\n6 Discussion\n\nIn this paper we have proposed XCA as the natural generalization of PCA and MCA for the\npurpose of dimensionality reduction. It is however also possible to consider a model with\nnon-Gaussian components. In [5] the components were distributed according to a Student-t\ndistribution resulting in a probabilistic model for undercomplete independent components\nanalysis (UICA).\n\nThere are quite a few interesting questions that remain unanswered in this paper. For in-\nstance, although we have shown how to ef\ufb01ciently \ufb01nd the global maximum of the log-\nlikelihood, we haven\u2019t identi\ufb01ed the properties of the other stationary points. Unlike PPCA\nwe expect many local maxima to be present. Also, can we formulate a Bayesian version of\nXCA where we predict the number and nature of the components supported by the data?\nCan we correct the systematic under-estimation of MCs in the presence of relatively few\ndata cases? There are a number of extensions of the XCA model worth exploring: XCA\nwith multiple noise models (i.e. multiple gaps in the spectrum), mixtures of XCA and so\non.\n\nA Proof of Theorem 1\n\nUsing the fact that the sum and the product of the eigenvalues are constant we can rewrite\nthe cost eqn.17 (up to irrelevant constants) in terms of the left-out eigenvalues of the spec-\ntrum only. We will also use the fact that the left-out eigenvalues are contiguous in the\n\n3The dataset was obtained by M. Weber at the computational vision lab at Caltech and contains\nthe 2-D coordinates of 6 features (eyes, nose, 3 mouth features) of unregistered frontal face images.\n\n010020030040050050060070080090010001100120013001400nr. retained dimensionslog\u2212probabilityFrey\u2212Faces020040060010\u2212610\u2212410\u22122100102eigendirectionvarianceSpectrum051015\u2212100\u221298\u221296\u221294\u221292\u221290\u221288\u221286\u221284\u221282\u221280nr. retained dimensionslog\u2212probabilityTraining Data051015100102104106eigendirectionvarianceSpectrum051015\u2212100\u221298\u221296\u221294\u221292\u221290\u221288\u221286\u221284\u221282\u221280nr. retained dimensionslog\u2212probabilityTest Data0510024681012nr. retained dimensionsnr. PCs / MCs Extreme Components\fspectrum, and form a \u201cgap\u201d of size g def= D \u2212 d,\n\nC = g log\n\nefi\n\nfi\n\n(18)\n\n\uf8eb\uf8edi\u2217+g\u22121(cid:88)\n\ni=i\u2217\n\n\uf8f6\uf8f8 \u2212 i\u2217+g\u22121(cid:88)\n(cid:33)\n\ni=i\u2217\n\n(cid:195)\n(cid:80)i\u2217+g\u22121\n1 + efi\u2217+g \u2212 efi\u2217\n\nwhere fi are the log-eigenvalues and i\u2217 is the location of the left hand side of the gap. We\nare interested in the change of this cost \u03b4C if we shift it one place to the right (or the left).\nThis can be expressed as\n\ni=i\u2217\n\n\u03b4C = g log\n\n\u2212 (f(i\u2217 + g) \u2212 f(i\u2217)) .\n(19)\nInserting a log-linear spectrum: fi = b + a\u00b7 i with a < 0 and using the result\ni=0 ea\u00b7i =\n(eag \u2212 1)/(ea \u2212 1) we \ufb01nd that the change in C vanishes for all log-linear spectra. This\nestablishes the \ufb01rst claim. For the more general case we de\ufb01ne corrections ci to the log-\nlinear spectrum that runs through the points fi\u2217 and fi\u2217+g, i.e. fi = b + a \u00b7 i + ci. First\nconsider the case of a convex spectrum between i\u2217 and i\u2217 +g, which implies that all ci < 0.\nInserting this into 19 we \ufb01nd after some algebra\n\n(cid:80)g\u22121\n\nefi\n\n\u03b4C = g log\n\n1 +\n\n\u2212 ag.\n\n(20)\n\n(cid:195)\n\n(cid:33)\n\n(cid:80)g\u22121\n\neag \u2212 1\n\ni(cid:48)=0 ea\u00b7i(cid:48)+c[i(cid:48)+i\u2217]\n\nBecause all ci < 0, the \ufb01rst term must be smaller (more negative) than the corresponding\nterm in the linear case implying that \u03b4C < 0 (the second term is unchanged w.r.t the\nlinear case). Thus, if the entire spectrum is log-convex the gap will be located on the right,\nresulting in PCs. A similar argument shows that for log-concave spectra the solutions\nconsist of MCs only. In general log-spectra may have convex and concave pieces. The cost\n18 is minimized when some of the ci are positive and some negative in such a way that,\nNote that due to the exponent in this sum, positive ci\n\n(cid:80)g\u22121\ni(cid:48)=0 ea\u00b7i(cid:48)+c[i(cid:48)+i\u2217] \u2248 (cid:80)g\u22121\n\ni(cid:48)=0 ea\u00b7i(cid:48)\n\nhave a stronger effect than negative ci.\n\nAcknowledgements\n\nWe\u2019d like to thank the following people for their invaluable input into this paper: Geoff Hinton, Sam\nRoweis, Yee Whye Teh, David MacKay and Carl Rasmussen. We are also very grateful to Pietro\nPerona and Anelia Angelova for providing the \u201cfeature position\u201d dataset used in this paper.\n\nReferences\n[1] G.E. Hinton. Products of experts. In Proceedings of the International Conference on Arti\ufb01cial\n\nNeural Networks, volume 1, pages 1\u20136, 1999.\n\n[2] J.G. Proakis and D.G. Manolakis. Digital Signal Processing: Principles, Algorithms and Appli-\n\ncations. Macmillan, 1992.\n\n[3] S.T. Roweis. Em algorithms for pca and spca. In Advances in Neural Information Processing\n\nSystems, volume 10, pages 626\u2013632, 1997.\n\n[4] M.E. Tipping and C.M. Bishop. Probabilistic principal component analysis. Journal of the Royal\n\nStatistical Society, Series B, 21(3):611\u2013622, 1999.\n\n[5] M. Welling, R.S. Zemel, and G.E. Hinton. A tractable probabilistic model for projection pursuit.\nIn Proceedings of the Conference on Uncertainty in Arti\ufb01cial Intelligence, 2003. accepted for\npublication.\n\n[6] C.K.I. Williams and F.V. Agakov. Products of gaussians and probabilistic minor components\n\nanalysis. Neural Computation, 14(5):1169\u20131182, 2002.\n\n[7] H. Zhu, C. K. I. Williams, R. J. Rohwer, and M. Morciniec. Gaussian regression and optimal \ufb01-\nnite dimensional linear models. In C. M. Bishop, editor, Neural Networks and Machine Learning.\nSpringer-Verlag, Berlin, 1998.\n\n\f", "award": [], "sourceid": 2517, "authors": [{"given_name": "Max", "family_name": "Welling", "institution": null}, {"given_name": "Christopher", "family_name": "Williams", "institution": null}, {"given_name": "Felix", "family_name": "Agakov", "institution": null}]}