{"title": "Linear Dependent Dimensionality Reduction", "book": "Advances in Neural Information Processing Systems", "page_first": 145, "page_last": 152, "abstract": "", "full_text": "Linear Dependent Dimensionality Reduction\n\nNathan Srebro\n\nTommi Jaakkola\n\nDepartment of Electrical Engineering and Computer Science\n\nMassachusetts Institute of Technology\n\nCambridge, MA 02139\n\nnati@mit.edu,tommi@ai.mit.edu\n\nAbstract\n\nWe formulate linear dimensionality reduction as a semi-parametric esti-\nmation problem, enabling us to study its asymptotic behavior. We gen-\neralize the problem beyond additive Gaussian noise to (unknown) non-\nGaussian additive noise, and to unbiased non-additive models.\n\n1 Introduction\n\nFactor models are often natural in the analysis of multi-dimensional data. The underly-\ning premise of such models is that the important aspects of the data can be captured via a\nlow-dimensional representation (\u201cfactor space\u201d). The low-dimensional representation may\nbe useful for lossy compression as in typical applications of PCA, for signal reconstruc-\ntion as in factor analysis or non-negative matrix factorization [1], for understanding the\nsignal structure [2], or for prediction as in applying SVD for collaborative \ufb01ltering [3]. In\nmany situations, including collaborative \ufb01ltering and structure exploration, the \u201cimportant\u201d\naspects of the data are the dependencies between different attributes. For example, in col-\nlaborative \ufb01ltering we rely on a representation that summarizes the dependencies among\nuser preferences. More generally, we seek to identify a low-dimensional space that captures\nthe dependent aspects of the data, and separate them from independent variations. Our goal\nis to relax restrictions on the form of each of these components, such as Gaussianity, addi-\ntivity and linearity, while maintaining a principled rigorous framework that allows analysis\nof the methods.\n\nWe begin by studying the probabilistic formulations of the problem, focusing on the as-\nsumptions that are made about the dependent, low-rank \u201csignal\u201d and independent \u201cnoise\u201d\ndistributions. We consider a general semi-parametric formulation that emphasizes what is\nbeing estimated and allows us to discuss asymptotic behavior (Section 2). We then study\nthe standard (PCA) approach, show that it is appropriate for additive i.i.d. noise (Section 3),\nand present a generic estimator that is appropriate also for unbiased non-additive models\n(Section 4). In Section 5 we confront the non-Gaussianity directly, develop maximum-\nlikelihood estimators in the presence of Gaussian mixture additive noise, and show that the\nconsistency of such maximum-likelihood estimators should not be taken for granted.\n\n\f2 Dependent Dimensionality Reduction\n\nOur starting point is the problem of identifying linear dependencies in the presence of in-\ndependent identically distributed Gaussian noise. In this formulation, we observe a data\nmatrix Y \u2208 <n\u00d7d which we assume was generated as Y = X + Z, where the dependent,\nlow-dimensional component X \u2208 <n\u00d7d (the \u201csignal\u201d) is a matrix of rank k and the inde-\npendent component Z (the \u201cnoise\u201d) is i.i.d. zero-mean Gaussian with variance \u03c32. We can\n\u03c32 |Y \u2212X|Fro +Const (where ||Fro is the Frobenius,\nwrite down the log-likelihood of X as \u22121\nor sum-squared, norm) and conclude that, regardless of the variance \u03c32, the maximum-\nlikelihood estimator of X is the rank-k matrix minimizing the Frobenius distance. It is\ngiven by the leading components of the singular value decomposition of Y .1\nAlthough the above formulation is perfectly valid, there is something displeasing about\nit. We view the entire matrix X as parameters, and estimate them according to a single\nobservation Y . The number of parameters is linear in the data, and even with more data,\nwe cannot hope to estimate the parameters (entries in X ) beyond a \ufb01xed precision. What we\ncan estimate with more data rows is the rank-k row-space of X . Consider the factorization\nX = U V 0, where V 0 \u2208 <k\u00d7d spans this \u201csignal space\u201d.\nThe dependencies of each row y of Y are captured by a row\nu of U , which, through the parameters V and \u03c3 speci\ufb01es\nhow each entry yi is generated independently given u.2\nA standard parametric analysis of the model would view u as a random vector (rather\nthan parameters) and impose some, possibly parametric, distribution over it (interestingly,\nif u is Gaussian, the maximum-likelihood reconstruction is the same Frobenius low-rank\napproximation [4]). However, in the analysis we started with, we did not make any as-\nsumptions about the distribution of u, beyond its dimensionality. The model class is then\nnon-parametric, yet we still desire, and are able, to estimate a parametric aspect of the\nmodel: The estimator can be seen as a ML estimator for the signal subspace, where the\ndistribution over u is unconstrained nuisance.\nAlthough we did not impose any form on the distribution u, we did impose a strict form\non the conditional distributions yi|u: we required them to be Gaussian with \ufb01xed variance\ni . We would like to relax these requirements, and require only that y|u be\n\u03c32 and mean uV 0\na product distribution, i.e. that its coordinates yi|u be (conditionally) independent. Since\nu is continuous, we cannot expect to forego all restrictions on yi|ui, but we can expect to\nset up a semi-parametric problem in which y|u may lie in an in\ufb01nite dimensional family\nof distributions, and is not strictly parameterized.\nRelaxing the Gaussianity leads to linear additive models y = uV 0 + z, with z independent\nof u, but not necessarily Gaussian. Further relaxing the additivity is appropriate, e.g., when\nthe noise has a multiplicative component, or when the features of y are not real numbers.\nThese types of models, with a known distribution yi|xi, have been suggested for classi\ufb01-\ncation using logistic loss [5], when yi|xi forms an exponential family [6], and in a more\nabstract framework [7]. Relaxing the linearity assumption x = uV 0 is also appropriate in\nmany situations. Fitting a non-linear manifold by minimizing the sum-squared distance can\nbe seen as a ML estimator for y|u = g(u) + z, where z is i.i.d. Gaussian and g : <k \u2192 <d\nspeci\ufb01es some smooth manifold. Combining these ideas leads us to discuss the conditional\ndistributions yi|gi(u), or yi|u directly.\nIn this paper we take our \ufb01rst steps is studying this problem, and relaxing restrictions on\n\n1A mean term is also usually allowed. Incorporating a non-zero mean is straight forward, and in\n\norder to simplify derivations, we do not account for it in most of our presentation.\n\n2We use uppercase letters to denote matrices, and lowercase letters for vectors, and use bold type\n\nto indicate random quantities.\n\ny1y2ydu...\fy|u. We continue to assume a linear model x = uV 0 and limit ourselves to additive noise\nmodels and unbiased models in which E [y|x] = x. We study the estimation of the rank-k\nsignal space in which x resides, based on a sample of n independent observations of y\n(forming the rows of Y), where the distribution on u is unconstrained nuisance.\nIn order to study estimators for a subspace, we must be able to compare two subspaces. A\nnatural way of doing so is through the canonical angles between them [8]. De\ufb01ne the angle\nbetween a vector v1 and a subspace V2 to be the minimal angle between v1 and any v2 \u2208 V2.\nThe largest canonical angle between two subspaces is then the maximal angle between a\nvector in v1 \u2208 V1 and the subspace V2. The second largest angle is the maximum over all\nvectors orthogonal to the v1, and so on. It is convenient to think of a subspace in terms\nof the matrix whose columns span it. Computationally, if the columns of V1 and V2 form\northonormal bases of V1 and V2, then the cosines of the canonical angles between V1 and\nV2 are given by the singular values of V 0\n1 V2. Throughout the presentation, we will slightly\noverload notation and use a matrix to denote also its column subspace. In particular, we\nwill denote by V0 the true signal subspace, i.e. such that x = uV0\n\n0.\n\n3 The L2 Estimator\n\nWe \ufb01rst consider the \u201cstandard\u201d approach to low-rank approximation\u2014minimizing the sum\nsquared error.3 This is the ML estimator when the noise is i.i.d. Gaussian. But the L2\nestimator is appropriate also in a more general setting. We will show that the L2 estimator\nis consistent for any i.i.d. additive noise with \ufb01nite variance (as we will see later on, this is\nmore than can be said for some ML estimators).\nThe L2 estimator of the signal subspace is the subspace spanned by the leading eigenvectors\nof the empirical covariance matrix \u02c6\u039bn of y, which is a consistent estimator of the true\ncovariance matrix \u039bY , which in turn is the sum of the covariance matrices of x and z,\nwhere \u039bX is of rank exactly4 k, and if z is i.i.d., \u039bZ = \u03c32I.\nLet s1 \u2265 s2 \u2265 \u00b7\u00b7\u00b7 \u2265 sk > 0 be the non-zero eigenvalues of \u039bx. Since z has variance ex-\nactly \u03c32 in any direction, the principal directions of variation are not affected by it, and the\neigenvalues of \u039bY are exactly s1 + \u03c32, . . . , sk + \u03c32, \u03c32, . . . , \u03c32, with the leading k eigen-\nvectors being the eigenvectors of \u039bX. This ensures an eigenvalue gap of sk > 0 between\nthe invariant subspace of \u039bY spanned by the eigenvectors of \u039bX and its complement, and\nwe can bound the norm of the canonical sines between V0 and the leading k eigenvectors of\n[8]. Since |\u02c6\u039bn\u2212\u039bY | \u2192 0 a.s., we conclude that the estimator is consistent.\n\u02c6\u039bn by\n\n| \u02c6\u039bn\u2212\u039bY |\n\nsk\n\n4 The Variance-Ignoring Estimator\n\nWe turn to additive noise with independent, but not identically distributed, coordinates. If\nthe noise variances are known, the ML estimator corresponds to minimizing the column-\nweighted (inversely proportional to the variances) Frobenius norm of Y \u2212 X , and can be\ncalculated from the leading eigenvectors of a scaled empirical covariance matrix [9]. If the\nvariances are not known, e.g. when the scale of different coordinates is not known, there is\nno ML estimator: at least k coordinates of each y can always be exactly matched, and so\nthe likelihood is unbounded when up to k variances approach zero.\n\n3We call this an L2 estimator not because it minimizes the matrix L2-norm |Y \u2212 X|2, which it\n\ndoes, but because it minimizes the vector L2-norms |y \u2212 x|2\n2.\n\n4We should also be careful about signals that occupy only a proper subspace of V0, and be satis\ufb01ed\nwith any rank-k subspace containing the support of x, but for simplicity of presentation we assume\nthis does not happen and x is of full rank k.\n\n\fFigure 1: Norm of sines of canonical angles to correct subspace: (a) Random rank-2 subspaces\nin <10. Gaussian noise of different scales in different coordinates\u2014 between 0.17 and 1.7 signal\nstrength. (b) Random rank-2 subspaces in <10, 500 sample rows, and Gaussian noise with varying\ndistortion (mean over 200 simulations, bars are one standard deviations tall) (c) Observations are\n0\nexponentially distributed with means in rank-2 subspace ( 1 1 1 1 1 1 1 1 1 1\n1 0 1 0 1 0 1 0 1 0 )\n\n.\n\nThe L2 estimator is not satisfactory in this scenario. The covariance matrix \u039bZ is still diag-\nonal, but is no longer a scaled identity. The additional variance introduced by the noise is\ndifferent in different directions, and these differences may overwhelm the \u201csignal\u201d variance\nalong V0, biasing the leading eigenvectors of \u039bY , and thus the limit of the L2 estimator,\ntoward axes with high \u201cnoise\u201d variance. The fact that this variability is independent of the\nvariability in other coordinates is ignored, and the L2 estimator is asymptotically biased.\nInstead of recovering the directions of greatest variability, we recover the covariance struc-\nture directly. In the limit, \u02c6\u039bn \u2192 \u039bY = \u039bX + \u039bZ, a sum of a rank-k matrix and a diagonal\nmatrix. In particular, the non-diagonal entries of \u02c6\u039bn approach those of \u039bX. We can thus\nseek a rank-k matrix \u02c6\u039bX approximating \u02c6\u039bn, e.g. in a sum-squared sense, except on the di-\nagonal. This is a (zero-one) weighted low-rank approximation problem. We optimize \u02c6\u039bX\nby iteratively seeking a rank-k approximation of \u02c6\u039bn with diagonal entries \ufb01lled in from\nthe last iterate of \u02c6\u039bX (this can be viewed as an EM procedure [5]). The row-space of the\nresulting \u02c6\u039bX is then an estimator for the signal subspace. Note that the L2 estimator is the\nrow-space of the rank-k matrix minimizing the unweighted sum-squared distance to \u02c6\u039bn.\nFigures 1(a,b) demonstrate this variance-ignoring estimator on simulated data with non-\nidentical Gaussian noise. The estimator reconstructs the signal-space almost as well as the\nML estimator, even though it does not have access to the true noise variance.\n\n(cid:1) for any 1\n\nDiscussing consistency in the presence of non-identical noise with unknown variances is\nproblematic, since the signal subspace is not necessarily identi\ufb01able. For example, the\ncombined covariance matrix \u039bY = ( 2 1\n1 2 ) can arise from a rank-one signal covariance\n2 \u2264 a \u2264 2, each corresponding to a different signal subspace.\nCounting the number of parameters and constraints suggests identi\ufb01ability when k < d \u2212\n\u221a\n8d+1\u22121\n, but this is by no means a precise guarantee. Anderson and Rubin [10] present\n\n\u039bX = (cid:0) a 1\nseveral conditions on \u039bX which are suf\ufb01cient for identi\ufb01ability but require k <(cid:4) d\n\n(cid:5), and\n\n2\n\n1 1/a\n\n2\n\nother weaker conditions which are necessary.\n\nNon-Additive Noise The above estimation method is also useful in a less straight-\nforward situation. Until now we have considered only additive noise, in which the dis-\ntribution of yi \u2212 xi was independent of xi. We will now relax this restriction and allow\nmore general conditional distributions yi|xi, requiring only that E [yi|xi] = xi. With this\nrequirement, together with the structural constraint (yi independent given x), for any i 6= j:\n\nCov [yi, yj] = E [yiyj] \u2212 E [yi]E [yj] = E [E [yiyj|x]] \u2212 E [E [yi|x]]E [E [yj|x]]\n\n= E [E [yi|x]E [yj|x]] \u2212 E [xi]E [xj] = E [xixj] \u2212 E [xi]E [xj] = Cov [xi, xj].\n\n1010010001000000.10.20.30.40.50.60.70.80.91sample size|sin Q|2L2 variance\u2212ignored ML, known variances12 3 4 5 6 7 8 9 100.10.20.30.40.50.60.70.80.911.1spread of noise scale (max/min ratio)|sin Q|2L2 variance\u2212ignored ML, known variances10210310400.20.40.60.81|sin(Q)|2sample size (number of observed rows)full L2variance\u2212ignored\fh\n\nh\n\ni |xi\n\nE(cid:2)y2\n\nE [yi|xi]2i \u2212\n\n(cid:3) \u2212 E [yi|xi]2i\n\nAs in the non-identical additive noise case, \u039bY agrees with \u039bX except on the diagonal.\nEven if yi|xi is identically conditionally distributed for all i, the difference \u039bY \u2212 \u039bX is\nnot in general a scaled identity: Var [yi] = E\nE [yi]2 = E [Var [yi|xi]] + Var [xi]. Unlike the additive noise case, the variance of yi|xi\ndepends on xi, and so its expectation depends on the distribution of xi.\nThese observations suggest using the variance-ignoring estimator. Figure 1(c) demonstrates\nhow such an estimator succeeds in reconstruction when yi|xi is exponentially distributed\nwith mean xi, even though the standard L2 estimator is not applicable. We cannot guaran-\ntee consistency because the decomposition of the covariance matrix might not be unique,\ny|x is known, even if the decomposition is not unique, the correct signal covariance might\nbe identi\ufb01able based on the relationship between the signal marginals and the expected\nconditional variance of of y|x, but this is not captured by the variance-ignoring estimator.\n\n(cid:5) this is not likely to happen. Note that if the conditional distribution\n\nbut when k < (cid:4) d\n\n+ E\n\n2\n\n5 Low Rank Approximation with a Gaussian Mixture Noise Model\n\nPm\n\nWe return to additive noise, but seeking better estimation with limited data, we confront\nnon-Gaussian noise distributions directly: we would like to \ufb01nd the maximum-likelihood\nX when Y = X + Z, and Zij are distributed according to a Gaussian mixture: pZ(zij) =\n\nc=1 pc(2\u03c0\u03c32\n\nc )1/2 exp((zij \u2212 \u00b5c)2/(2\u03c32\n\nc )).\n\nTo do so, we introduce latent variables Cij specifying the mixture component of the noise\nat Yij, and solve the problem using EM. In the Expectation step, we compute the posterior\nprobabilities Pr (Cij|Yij; X ) based on the current low-rank parameter matrix X . In the\nMaximization step we need to \ufb01nd the low-rank matrix X that maximizes the posterior\nexpected log-likelihood:\n\nEC|Y [log Pr (Y = X + Z|C; X )] = \u2212X\n\n(Xij\u2212(Yij +\u00b5c))2 + Const\n\nX\n\nij\n\nij\n\nWij (Xij \u2212 Aij)2 + Const\n\n2\u03c32\nc\n\nPr(Cij =c)|Yij\n\nX\nAij = Yij +X\n\nc\n\nc\n\n= \u2212 1\n\n2\n\nwhere Wij =X\n\nPr(Cij =c)|Yij\n\n\u03c32\nc\n\nc\n\n(1)\n\nPr(Cij =c)|Yij \u00b5c\n\n\u03c32\n\nc Wij\n\nThis is a weighted Frobenius low-rank approximation (WLRA) problem. Equipped with a\nWLRA optimization method [5], we can now perform EM iteration in order to \ufb01nd the ma-\ntrix X maximizing the likelihood of the observed matrix Y . At each M step it is enough to\nperform a single WLRA optimization iteration, which is guaranteed to improve the WLRA\nobjective, and so also the likelihood. The method can be augmented to handle an unknown\nGaussian mixture, by introducing an optimization of the mixture parameters at each M\niteration.\n\nExperiments with GSMs We report here initial experiments with ML estimation using\nbounded Gaussian scale mixtures [11], i.e. a mixture of Gaussians with zero mean, and\nvariance bounded from bellow. Gaussian scale mixtures (GSMs) are a rich class of sym-\nmetric distributions, which include non-log-concave, and heavy tailed distributions. We\ninvestigated two noise distributions: a \u2019Gaussian with outliers\u2019 distribution formed as a\nmixture of two zero-mean Gaussians with widely varying variances; and a Laplace dis-\ntribution p(z) \u221d e\u2212|z|, which is an in\ufb01nite scale mixture of Gaussians. Figures 2(a,b)\nshow the quality of reconstruction of the L2 estimator and the ML bounded GSM estima-\ntor, for these two noise distributions, for a \ufb01xed sample size of 300 rows, under varying\n\n\fFigure 2: Norm of sines of canonical angles to correct subspace: (a) Random rank-3 subspace in\n<10 with Laplace noise.\n(b)\nRandom rank-2 subspace in <10 with 0.99N (0, 1) + 0.01N (0, 100) noise. (c) span(2, 1, 1)0 \u2282 <3\nwith 0.9N (0, 1) + 0.1N (0, 25) noise. The ML estimator converges to (2.34, 1, 1). Bars are one\nstandard deviation tall.\n\nInsert: sine norm of ML est. plotted against sine norm of L2 est.\n\nsignal strengths. We allowed ten Gaussian components, and did not observe any signi\ufb01cant\nchange in the estimator when the number of components increases.\nThe ML estimator is overall more accurate than the L2 estimator\u2014it succeeds in reliably\nreconstructing the low-rank signal for signals which are approximately three times weaker\nthan those necessary for reliable reconstruction using the L2 estimator. The improvement\nin performance is not as dramatic, but still noticeable, for Laplace noise.\n\nComparison with Newton\u2019s Methods Confronted with a general additive noise distri-\nbution, the approach presented here would be to rewrite, or approximate, it as a Gaussian\nmixture and use WLRA in order to learn X using EM. A different approach is to consider-\ning the second order Taylor expansions of the log-likelihood, with respect to the entries of\nX , and iteratively maximize them using WLRA [5, 7]. Such an approach requires calculat-\ning the \ufb01rst and second derivatives of the density. If the density is not speci\ufb01ed analytically,\nor is unknown, these quantities need to be estimated. But beyond these issues, which can be\novercome, lies the major problem of Newton\u2019s method: the noise density must be strictly\nlog-concave and differentiable. If the distribution is not log-concave, the quadratic expan-\nsion of the log-likelihood will be unbounded and will not admit an optimum. Attempting\nto ignore this fact, and for example \u201coptimizing\u201d U given V using the equations derived\nfor non-negative weights would actually drive us towards a saddle-point rather then a local\noptimum. The non-concavity does not only mean that we are not guaranteed a global opti-\nmum (which we are not guaranteed in any case, due to the non-convexity of the low-rank\nrequirement)\u2014 it does not yield even local improvements. On the other hand, approximat-\ning the distribution as a Gaussians mixture and using the EM method, might still get stuck\nin local minima, but is at least guaranteed local improvement.\n\nLimiting ourselves to only log-concave distributions is a rather strong limitation, as it\nprecludes, for example, all heavy-tailed distributions. Consider even the \u201cbalanced tail\u201d\nLaplace distribution p(z) \u221d e\u2212|z|. Since the log-density is piecewise linear, a quadratic\napproximation of it is a line, which of course does not attain a minimum value.\n\nConsistency Despite the gains in reconstruction presented above, the ML estimator may\nsuffer from an asymptotic bias, making it inferior to the L2 estimator on large samples. We\nstudy the asymptotic limit of the ML estimator, for a known product distribution p. We \ufb01rst\nestablish a necessary and suf\ufb01cient condition for consistency of the estimator.\nThe ML estimator is the minimizer of the empirical mean of the random function \u03a6(V ) =\nminu(\u2212 log p(y\u2212 uV 0)). When the number of samples increase, the empirical means con-\nverge to the true means, and if E [\u03a6(V1)] < E [\u03a6(V2)], then with probability approaching\n\n0.20.40.60.811.21.41.60.10.150.20.250.30.350.40.45signal variance / noise variance|sin(Q)|2L2ML00.10.20.30.400.050.10.150.20.250.30.350.400.20.40.60.811.21.400.20.40.60.811.21.4signal variance / noise variance|sin(Q)|2L2ML, known noise modelML, nuisance noise model1010010001000000.050.10.150.20.250.30.350.40.450.5Sample size (number of observed rows)sin(Q)MLL2\fone V2 will not minimize \u02c6E [\u03a6(V )]. For the ML estimator to be consistent, E [\u03a6(V )] must\nbe minimized by V0, establishing a necessary condition for consistency.\nThe suf\ufb01ciency of this condition rests on the uniform convergence of {\u02c6E [\u03a6(V )]}, which\ndoes not generally exist, or at least on uniform divergence from E [\u03a6(V0)]. It should be\nnoted that the issue here is whether the ML estimator at all converges, since if it does con-\nverge, it must converge to the minimizer of E [\u03a6(V )]. Such convergence can be demon-\nstrated at least in the special case when the marginal noise density p(zi) is continuous,\nstrictly positive, and has \ufb01nite variance and differential entropy. Under these conditions,\nthe ML estimator is consistent if and only if V0 is the unique minimizer of E [\u03a6(V )].\nWhen discussing E [\u03a6(V )], the expectation is with respect to the noise distribution and\nthe signal distribution. This is not quite satisfactory, as we would like results which are\nindependent of the signal distribution, beyond the rank of its support. To do so, we must\nensure the expectation of \u03a6(V ) is minimized on V0 for all possible signals (and not only in\nexpectation). Denote the objective \u03c6(y; V ) = minu(\u2212 log p(y \u2212 uV 0)). For any x \u2208 <d,\nconsider \u03a8(V ; x) = Ez [\u03c6(x + z; V )], where the expectation is only over the additive noise\nz. Under the previous conditions guaranteeing the ML estimator converges, it is consistent\nfor any signal distribution if and only if, for all x \u2208 <d, \u03a8(V ; x) is minimized with respect\nto V exactly when x \u2208 spanV .\nIt will be instructive to \ufb01rst revisit the ML estimator in the presence of i.i.d. Gaussian\nnoise, i.e. the L2 estimator which we already showed is consistent. We will consider the\ndecomposition y = yk + y\u22a5 of vectors into their projection onto the subspace V , and the\nresidual . Any rotation of p is an isotropic Gaussian, and so z\u22a5 and zk are independent,\nand p(y) = pk(yk)p\u22a5(y\u22a5). We can now analyze:\n\nu\n\n\u03c6(V ; y) = min\n\n(\u2212 log pk(yk + uV 0) \u2212 log p\u22a5(y\u22a5)) = \u2212 log pk(0) +\n\n1\n\u03c32|y\u22a5|2 + Const\nyielding \u03a8(V ; x) \u221d Ez\u22a5 [|x\u22a5 + z\u22a5|2] + Const, which is minimized when x\u22a5 = 0, i.e. x\nis spanned by V . We thus re-derived the consistency of the L2 estimator directly, for the\nspecial case in which the noise is indeed Gaussian.\n\nThis consistency proof employed a key property of the isotropic Gaussian: rotations of an\nisotropic Gaussian random variable remain i.i.d. As this property is unique to Gaussian\nrandom variables, other ML estimators might not be consistent. In fact, we will shortly see\nthat the ML estimator for a known Laplace noise model is not consistent. To do so, we will\nnote that a necessary condition for consistency, if the density function p is continuous, is\nthat \u03a8(V ; 0) = E [\u03c6(z; V )] is constant over all V . Otherwise we have \u03a8(V1; 0) < \u03a8(V2; 0)\nfor some V1, V2, and for small enough x \u2208 V2, \u03a8(V1; x) < \u03a8(V2; x). A non-constant\n\u03a8(V ; 0) indicates an a-priori bias towards certain sub-spaces.\n2 e\u2212|zi|, is essentially the\nThe negative log-likelihood of a Laplace distribution, p(zi) = 1\nL1 norm. Consider a rank-one approximation in a two-dimensional space with Laplace\nnoise. For any V = (1, \u03b1), 0 \u2264 \u03b1 \u2264 1, and (z1, z2), the L1 norm |z + uV 0|1 is minimized\nwhen z1 + u = 0 yielding \u03c6(V ; z ) = |z2 \u2212 \u03b1z1|, ignoring a constant term, and \u03a8(V ; 0) =\n, which is monotonic increasing in \u03b1 in the\n2 and the estimator is\n\nRR 1\n4 e\u2212|z1|\u2212|z2||z2 \u2212 \u03b1z1|dz1dz2 = \u03b12+\u03b1+1\n\nvalid range [0, 1]. In particular, 1 = \u03a8((1, 0); 0) < \u03a8((1, 1); 0) = 3\nbiased towards being axis-aligned.\n\n\u03b1+1\n\nFigure 2(c) demonstrates such an asymptotic bias empirically. Two-component Gaussian\nmixture noise was added to rank-one signal in <3, and the signal subspace was estimated\nusing an ML estimator with known noise model, and an L2 estimator. For small data sets,\nthe ML estimator is more accurate, but as the number of samples increase, the error of the\nL2 estimator vanishes, while the ML estimator converges to the wrong subspace.\n\n\f6 Discussion\n\nIn many applications few assumptions beyond independence can be made. We formu-\nlate the problem of dimensionality reduction as semi-parametric estimation of the low-\ndimensional signal, or \u201cfactor\u201d space, treating the signal distribution as unconstrained nui-\nsance and the noise distribution as constrained nuisance. We present an estimator which is\nappropriate when the conditional means E [y|u] lie in a low-dimensional linear space, and\na maximum-likelihood estimator for additive Gaussian mixture noise.\nThe variance-ignoring estimator is also applicable when y can be transformed such that\nE [g(y)|u] lie in a low-rank linear space, e.g. in log-normal models.\nIf the conditional\ndistribution y|x is known, this amount to an unbiased estimator for xi. When such a\ntransformation is not known, we may wish to consider it as nuisance.\n\nWe draw attention to the fact the maximum-likelihood low-rank estimation cannot be taken\nfor granted, and demonstrate that it might not be consistent even for known noise models.\nThe approach employed here can also be used to investigate the consistency of ML estima-\ntors with non-additive noise models. Of particular interest are distributions yi|xi that form\nexponential families where xi are the natural parameters [6]. When the mean parameters\nform a low-rank linear subspace, the variance-ignoring estimator is applicable, but when\nthe natural parameters form a linear subspace, the means are in general curved, and there is\nno unbiased estimator for the natural parameters. Initial investigation reveals that, for ex-\nample, the ML estimator for a Bernoulli (logistic) conditional distribution is not consistent.\nThe problem of \ufb01nding a consistent estimator for the linear-subspace of natural parameters\nwhen yi|xi forms an exponential family remains open.\nWe also leave open the ef\ufb01ciency of the various estimators, and the problem of \ufb01nding\nasymptotically ef\ufb01cient estimators, and consistent estimators exhibiting the \ufb01nite-sample\ngains of the ML estimator for additive Gaussian mixture noise.\n\nReferences\n\n[1] Daniel D. Lee and H. Sebastian Seung. Learning the parts of objects by non-negative matrix\n\nfactorization. Nature, 401:788\u2013791, 1999.\n\n[2] Orly Alter, Patrick O. Brown, and David Botstein. Singular value decomposition for genome-\n\nwide expression data processing and modeling. PNAS, 97(18):10101\u201310106, 2000.\n\n[3] Yossi Azar, Amos Fiat, Anna R. Karlin, Frank McSherry, and Jared Saia. Spectral analysis of\n\ndata. In 33rd ACM Symposium on Theory of Computing, 2001.\n\n[4] M. E. Tipping and C. M. Bishop. Probabilistic principal component analysis. Journal of the\n\nRoyal Statistical Society, Series B, 21(3):611\u2013622, 1999.\n\n[5] Nathan Srebro and Tommi Jaakkola. Weighted low rank approximation. In 20th International\n\nConference on Machine Learning, 2003.\n\n[6] M. Collins, S. Dasgupta, and R. E. Schapire. A generalization of principal components analysis\n\nto the exponential family. In Advances in Neural Information Processing Systems 14, 2002.\n\n[7] Geoffrey J. Gordon. Generalized2 linear2 models. In Advances in Neural Information Process-\n\ning Systems 15, 2003.\n\n[8] G. W. Stewart and Ji-guang Sun. Matrix Perturbation Theory. Academic Press, Inc, 1990.\n[9] Michal Irani and P Anandan. Factorization with uncertainty. In 6th European Conference on\n\nComputer Vision, 2000.\n\n[10] T. W. Anderson and Herman Rubin. Statistical inference in factor analysis. In Third Berleley\n\nSymposium on Mathematical Statistics and Probability, volume V, pages 111\u2013150, 1956.\n\n[11] M J Wainwright and E P Simoncelli. Scale mixtures of Gaussians and the statistics of natural\n\nimages. In Advances in Neural Information Processing Systems 12, 2000.\n\n\f", "award": [], "sourceid": 2431, "authors": [{"given_name": "Nathan", "family_name": "Srebro", "institution": null}, {"given_name": "Tommi", "family_name": "Jaakkola", "institution": null}]}