{"title": "Fantope Projection and Selection: A near-optimal convex relaxation of sparse PCA", "book": "Advances in Neural Information Processing Systems", "page_first": 2670, "page_last": 2678, "abstract": "We propose a novel convex relaxation of sparse principal subspace estimation based on the convex hull of rank-$d$ projection matrices (the Fantope). The convex problem can be solved efficiently using alternating direction method of multipliers (ADMM). We establish a near-optimal convergence rate, in terms of the sparsity, ambient dimension, and sample size, for estimation of the principal subspace of a general covariance matrix without assuming the spiked covariance model. In the special case of $d=1$, our result implies the near- optimality of DSPCA even when the solution is not rank 1. We also provide a general theoretical framework for analyzing the statistical  properties of the method for arbitrary input matrices that extends the  applicability and provable guarantees to a wide array of settings.  We  demonstrate this with an application to Kendall's tau correlation matrices  and transelliptical component analysis.", "full_text": "Fantope Projection and Selection:\n\nA near-optimal convex relaxation of sparse PCA\n\nVincent Q. Vu\n\nThe Ohio State University\nvqv@stat.osu.edu\n\nJuhee Cho\n\nUniversity of Wisconsin, Madison\nchojuhee@stat.wisc.edu\n\nJing Lei\n\nCarnegie Mellon University\n\nleij09@gmail.com\n\nKarl Rohe\n\nUniversity of Wisconsin, Madison\nkarlrohe@stat.wisc.edu\n\nAbstract\n\nWe propose a novel convex relaxation of sparse principal subspace estimation\nbased on the convex hull of rank-d projection matrices (the Fantope). The convex\nproblem can be solved ef\ufb01ciently using alternating direction method of multipli-\ners (ADMM). We establish a near-optimal convergence rate, in terms of the spar-\nsity, ambient dimension, and sample size, for estimation of the principal subspace\nof a general covariance matrix without assuming the spiked covariance model.\nIn the special case of d = 1, our result implies the near-optimality of DSPCA\n(d\u2019Aspremont et al. [1]) even when the solution is not rank 1. We also provide a\ngeneral theoretical framework for analyzing the statistical properties of the method\nfor arbitrary input matrices that extends the applicability and provable guarantees\nto a wide array of settings. We demonstrate this with an application to Kendall\u2019s\ntau correlation matrices and transelliptical component analysis.\n\n1 Introduction\n\nPrincipal components analysis (PCA) is a popular technique for unsupervised dimension reduction\nthat has a wide range of application\u2014science, engineering, and any place where multivariate data\nis abundant. PCA uses the eigenvectors of the sample covariance matrix to compute the linear\ncombinations of variables with the largest variance. These principal directions of variation explain\nthe covariation of the variables and can be exploited for dimension reduction.\nIn contemporary\napplications where variables are plentiful (large p) but samples are relatively scarce (small n), PCA\nsuffers from two major weaknesses : 1) the interpretability and subsequent use of the principal\ndirections is hindered by their dependence on all of the variables; 2) it is generally inconsistent in\nhigh-dimensions, i.e. the estimated principal directions can be noisy and unreliable [see 2, and the\nreferences therein].\nOver the past decade, there has been a fever of activity to address the drawbacks of PCA with a class\nof techniques called sparse PCA that combine the essence of PCA with the assumption that the phe-\nnomena of interest depend mostly on a few variables. Examples include algorithmic [e.g., 1, 3\u201310]\nand theoretical [e.g., 11\u201314] developments. However, much of this work has focused on the \ufb01rst\nprincipal component. One rationale behind this focus is by analogy with ordinary PCA: additional\ncomponents can be found by iteratively de\ufb02ating the input matrix to account for variation uncovered\nby previous components. However, the use of de\ufb02ation with sparse PCA entails complications of\nnon-orthogonality, sub-optimality, and multiple tuning parameters [15]. Identi\ufb01ability and consis-\ntency present more subtle issues. The principal directions of variation correspond to eigenvectors\nof some population matrix \u03a3. There is no reason to assume a priori that the d largest eigenvalues\n\n1\n\n\fof \u03a3 are distinct. Even if the eigenvalues are distinct, estimates of individual eigenvectors can be\nunreliable if the gap between their eigenvalues is small. So it seems reasonable, if not necessary,\nto de-emphasize eigenvectors and to instead focus on their span, i.e.\nthe principal subspace of\nvariation.\nThere has been relatively little work on the problem of estimating the principal subspace or even\nmultiple eigenvectors simultaneously. Most works that do are limited to iterative de\ufb02ation schemes\nor optimization problems whose global solution is intractable to compute. Sole exceptions are the\ndiagonal thresholding method [2], which is just ordinary PCA applied to the subset of variables\nwith largest marginal sample variance, or re\ufb01nements such as iterative thresholding [16], which use\ndiagonal thresholding as an initial estimate. These works are limited, because they cannot be used\nwhen the variables have equal variances (e.g., correlation matrices). Theoretical results are equally\nlimited in their applicability. Although the optimal minimax rates for the sparse principal subspace\nproblem are known in both the spiked [17] and general [18] covariance models, existing statistical\nguarantees only hold under the restrictive spiked covariance model, which essentially guarantees that\ndiagonal thresholding has good properties, or for estimators that are computationally intractable.\nIn this paper, we propose a novel convex optimization problem to estimate the d-dimensional princi-\npal subspace of a population matrix \u03a3 based on a noisy input matrix S. We show that if S is a sample\ncovariance matrix and the projection \u03a0 of the d-dimensional principal subspace of \u03a3 depends only\non s variables, then with a suitable choice of regularization parameter, the Frobenius norm of the\n\nerror of our estimator !X is bounded with high probability\n\n|||!X \u2212 \u03a0|||2 = O\"(\u03bb1/\u03b4)s#log p/n$\n\nwhere \u03bb1 is the largest eigenvalue of \u03a3 and \u03b4 the gap between the dth and (d + 1)th largest eigen-\nvalues of \u03a3. This rate turns out to be nearly minimax optimal (Corollary 3.3), and under additional\nassumptions on signal strength, it also allows us to recover the support of the principal subspace\n(Theorem 3.2). Moreover, we provide easy to verify conditions (Theorem 3.3) that yield near-\noptimal statistical guarantees for other choices of input matrix, such as Pearson\u2019s correlation and\nKendall\u2019s tau correlation matrices (Corollary 3.4).\nOur estimator turns out to be a semide\ufb01nite program (SDP) that generalizes the DSPCA approach\nof [1] to d \u2265 1 dimensions. It is based on a convex body, called the Fantope, that provides a tight\nrelaxation for simultaneous rank and orthogonality constraints on the positive semide\ufb01nite cone.\nSolving the SDP is non-trivial. We \ufb01nd that an alternating direction method of multipliers (ADMM)\nalgorithm [e.g., 19] can ef\ufb01ciently compute its global optimum (Section 4).\nIn summary, the main contributions of this paper are as follows.\n\n1. We formulate the sparse principal subspace problem as a novel semide\ufb01nite program with\n\na Fantope constraint (Section 2).\n\n2. We show that the proposed estimator achieves a near optimal rate of convergence in sub-\nspace estimation without assumptions on the rank of the solution or restrictive spiked co-\nvariance models. This is a \ufb01rst for both d = 1 and d > 1 (Section 3).\n\n3. We provide a general theoretical framework that accommodates other matrices, in addition\n\nto sample covariance, such as Pearson\u2019s correlation and Kendall\u2019s tau.\n\n4. We develop an ef\ufb01cient ADMM algorithm to solve the SDP (Section 4), and provide nu-\nmerical examples that demonstrate the superiority of our approach over de\ufb02ation methods\nin both computational and statistical ef\ufb01ciency (Section 5).\n\nThe remainder of the paper explains each of these contributions in detail, but we defer all proofs to\nAppendix A.\n\nRelated work Existing work most closely related to ours is the DSPCA approach for single com-\nponent sparse PCA that was \ufb01rst proposed in [1]. Subsequently, there has been theoretical analysis\nunder a spiked covariance model and restrictions on the entries of the eigenvectors [11], and algo-\nrithmic developments including block coordinate ascent [9] and ADMM [20]. The crucial difference\nwith our work is that this previous work only considered d = 1. The d > 1 case requires invention\nand novel techniques to deal with a convex body, the Fantope, that has never before been used in\nsparse PCA.\n\n2\n\n\fNotation For matrices A, B of compatible dimension \u27e8A, B\u27e9 := tr(AT B) is the Frobenius inner\nproduct, and |||A|||2\n2 := \u27e8A, A\u27e9 is the squared Frobenius norm. \u2225x\u2225q is the usual \u2113q norm with \u2225x\u22250\nde\ufb01ned as the number of nonzero entries of x. \u2225A\u2225a,b is the (a, b)-norm de\ufb01ned to be the \u2113b norm\nis the maximum absolute row sum. For a\nof the vector of rowwise \u2113a norms of A, e.g. \u2225A\u22251,\u221e\nsymmetric matrix A, we de\ufb01ne \u03bb1(A) \u2265 \u03bb2(A) \u2265\u00b7\u00b7\u00b7\nto be the eigenvalues of A with multiplicity.\nWhen the context is obvious we write \u03bbj := \u03bbj(A) as shorthand. For two subspaces M1 and M2,\nsin \u0398(M1,M2) is de\ufb01ned to be the matrix whose diagonals are the sines of the canonical angles\nbetween the two subspaces [see 21, \u00a7VII].\n2 Sparse subspace estimation\n\nde\ufb01ned to be a solution of the semide\ufb01nite program\n\nGiven a symmetric input matrix S, we propose a sparse principal subspace estimator !X that is\n\n(1)\n\nmaximize\nsubject to X \u2208F d,\n\n\u27e8S, X\u27e9 \u2212 \u03bb\u2225X\u22251,1\n\nin the variable X, where\n\nF d :=%X : 0 \u227c X \u227c I and tr(X) = d&\n\nis a convex body called the Fantope [22, \u00a72.3.2], and \u03bb \u2265 0 is a regularization parameter that\nencourages sparsity. When d = 1, the spectral norm bound in F d is redundant and (1) reduces to\nthe DSPCA approach of [1]. The motivation behind (1) is based on two key insights.\nThe \ufb01rst insight is a variational characterization of the principal subspace of a symmetric matrix.\nThe sum of the d largest eigenvalues of a symmetric matrix A can be expressed as\n\nd\u2019i=1\n\n\u03bbi(A)\n\n(a)\n= max\n\nV T V =Id\u27e8A, V V T\u27e9\n\n(b)\n= max\n\nX\u2208F d\u27e8A, X\u27e9 .\n\n(2)\n\nIdentity (a) is an extremal property known as Ky Fan\u2019s maximum principle [23]; (b) is based on the\nless well known observation that\n\nF d = conv({V V T : V T V = Id}) ,\n\ni.e. the extremal points of F d are the rank-d projection matrices. See [24] for proofs of both.\nThe second insight is a connection between the (1, 1)-norm and a notion of subspace sparsity intro-\nduced by [18]. Any X \u227d 0 can be factorized (non-uniquely) as X = V V T .\nLemma 2.1. If X = V V T , then \u2225X\u22251,1 \u2264 \u2225V \u22252\nConsequently, any X \u2208F d that has at most s non-zero rows necessarily has \u2225X\u22251,1 \u2264 s2d. Thus,\n\u2225X\u22251,1 is a convex relaxation of what [18] call row sparsity for subspaces.\nThese two insights reveal that (1) is a semide\ufb01nite relaxation of the non-convex problem\n\n2,1 \u2264 \u2225V \u22252\n\n2,0 tr(X).\n\nmaximize\nsubject to V T V = Id .\n\n\u27e8S, V V T\u27e9 \u2212 \u03bb\u2225V \u22252\n\n2,0d\n\n[18] proposed solving an equivalent form of the above optimization problem and showed that the\nestimator corresponding to its global solution is minimax rate optimal under a general statistical\nmodel for S. Their estimator requires solving an NP-hard problem. The advantage of (1) is that it is\ncomputationally tractable.\n\nSubspace estimation The constraint !X \u2208F d guarantees that its rank is \u2265 d. However !X need not\nbe an extremal point of F d, i.e. a rank-d projection matrix. In order to obtain a proper d-dimensional\nsubspace estimate, we can extract the d leading eigenvectors of !X, say !V , and form the projection\nmatrix!\u03a0= !V!V T . The projection is unique, but the choice of basis is arbitrary. We can follow the\nconvention of standard PCA by choosing an orthogonal matrix O so that (!V O)T S(!V O) is diagonal,\nand take!V O as the orthonormal basis for the subspace estimate.\n\n3\n\n\f3 Theory\n\nIn this section we describe our theoretical framework for studying the statistical properties of !X\ngiven by (1) with arbitrary input matrices that satisfy the following assumptions.\nAssumption 1 (Symmetry). S and \u03a3 are p \u00d7 p symmetric matrices.\nAssumption 2 (Identi\ufb01ability). \u03b4 = \u03b4(\u03a3) = \u03bbd(\u03a3) \u2212 \u03bbd+1(\u03a3) > 0.\nAssumption 3 (Sparsity). The projection \u03a0 onto the subspace spanned by the eigenvectors of \u03a3\ncorresponding to its d largest eigenvalues satis\ufb01es \u2225\u03a0\u22252,0 \u2264 s, or equivalently, \u2225diag(\u03a0)\u22250 \u2264 s.\nThe key result (Theorem 3.1 below) implies that the statistical properties of the error of the estimator\n\ncan be derived, in many cases, by routine analysis of the entrywise errors of the input matrix\n\n\u2206 := !X \u2212 \u03a0 ,\n\nW := S \u2212 \u03a3 .\n\nThere are two main ideas in our analysis of !X. The \ufb01rst is relating the difference in the values of\nthe objective function in (1) at \u03a0 and !X to \u2206. The second is exploiting the decomposability of\n\nthe regularizer. Conceptually, this is the same approach taken by [25] in analyzing the statistical\nproperties of regularized M-estimators. It is worth noting that the curvature result in our problem\ncomes from the geometry of the constraint set in (1).\nIt is different from the \u201crestricted strong\nconvexity\u201d in [25], a notion of curvature tailored for regularization in the form of penalizing an\nunconstrained convex objective.\n\n3.1 Variational analysis on the Fantope\n\nThe \ufb01rst step of our analysis is to establish a bound on the curvature of the objective function along\nthe Fantope and away from the truth.\nLemma 3.1 (Curvature). Let A be a symmetric matrix and E be the projection onto the subspace\nspanned by the eigenvectors of A corresponding to its d largest eigenvalues \u03bb1 \u2265 \u03bb2 \u2265\u00b7\u00b7\u00b7\n. If\n\u03b4A = \u03bbd \u2212 \u03bbd+1 > 0, then\n\n\u03b4A\n2 |||E \u2212 F|||2\nfor all F satisfying 0 \u227c F \u227c I and tr(F ) = d.\nA version of Lemma 3.1 \ufb01rst appeared in [18] with the additional restriction that F is a projection\nmatrix. Our proof of the above extension is a minor modi\ufb01cation of their proof.\nThe following is an immediate corollary of Lemma 3.1 and the Ky Fan maximal principle.\nCorollary 3.1 (A sin \u0398 theorem [18]). Let A,B be symmetric matrices and MA, MB be their\nrespective d-dimensional principal subspaces. If \u03b4A,B = [\u03bbd+1(A)\u2212 \u03bbd(A)]\u2228 [\u03bbd+1(B)\u2212 \u03bbd(B)],\nthen\n\n2 \u2264 \u27e8A, E \u2212 F\u27e9\n\n|||sin \u0398(MA,MB)|||2 \u2264\n\n\u221a2\n\u03b4A,B |||A \u2212 B|||2 .\n\nwill be close to that of \u03a0 if \u2206 is small.\n\nThe advantage of Corollary 3.1 over the Davis-Kahan Theorem [see, e.g., 21, \u00a7VII.3] is that it does\nnot require a bound on the differences between eigenvalues of A and eigenvalues of B. This means\nthat typical applications of the Davis-Kahan Theorem require the additional invocation of Weyl\u2019s\nTheorem. Our primary use of this result is to show that even if rank(!X) \u0338= d, its principal subspace\nCorollary 3.2 (Subspace error bound). If M is the principal d-dimensional subspace of \u03a3 and (M\nis the principal d-dimensional subspace of !X, then\n\n|||sin \u0398(M, (M)|||2 \u2264 \u221a2|||\u2206|||2 .\n\n4\n\n\f3.2 Deterministic error\n\nWith Lemma 3.1, it is straightforward to prove the following theorem.\nTheorem 3.1 (Deterministic error bound). If \u03bb \u2265 \u2225W\u2225\u221e,\u221e\n|||\u2206|||2 \u2264 4s\u03bb/\u03b4 .\nTheorem 3.1 holds for any global optimizer !X of (1). It does not assume that the solution is rank-d as\n!X.\n\nin [11]. The next theorem gives a suf\ufb01cient condition for support recovery by diagonal thresholding\n\nTheorem 3.2 (Support recovery). For all t > 0\n\nand s \u2265 \u2225\u03a0\u22252,0 then\n\nAs a consequence,\nminj:\u03a0jj\u0338=0 \u03a0jj \u2265 2t > 2|||\u2206|||2.\n3.3 Statistical properties\n\n)){j :\u03a0 jj = 0, !Xjj \u2265 t})) +)){j :\u03a0 jj \u2265 2t, !Xjj < t})) \u2264 |||\u2206|||2\nthe variable selection procedure !J(t) := %j : !Xjj \u2265 t& succeeds if\nIn this section we use Theorem 3.1 to derive the statistical properties of !X in a generic setting where\ngeneral result possible, but it allows us to illustrate the statistical properties of !X for two different\n\ntypes of input matrices: sample covariance and Kendall\u2019s tau correlation. The former is the standard\ninput for PCA; the latter has recently been shown to be a useful robust and nonparametric tool for\nhigh-dimensional graphical models [26].\nTheorem 3.3 (General statistical error bound). If there exists \u03c3> 0 and n > 0 such that \u03a3 and S\nsatisfy\n\nthe entries of W uniformly obey a restricted sub-Gaussian deviation inequality. This is not the most\n\nt2\n\n2\n\n.\n\nmax\n\nij\n\nfor all t \u2264 \u03c3 and\nthen\n\nP\"|Sij \u2212 \u03a3ij|\u2265 t$ \u2264 2 exp\" \u2212 4nt2/\u03c32$\n\n\u03bb = \u03c3#log p/n \u2264 \u03c3,\ns#log p/n\n|||!X \u2212 \u03a0|||2 \u2264\n\n4\u03c3\n\u03b4\n\nwith probability at least 1 \u2212 2/p2.\nSample covariance Consider the setting where the input matrix is the sample covariance matrix\nof a random sample of size n > 1 from a sub-Gaussian distribution. A random vector Y with\n\u03a3 = Var(Y ) has sub-Gaussian distribution if there exists a constant L > 0 such that\n\nP\"|\u27e8Y \u2212 EY, u\u27e9| \u2265 t$ \u2264 exp\" \u2212 Lt2/\u2225\u03a31/2u\u22252\n2$\n\nfor all u and t \u2265 0. Under this condition we have the following corollary of Theorem 3.3.\nCorollary 3.3. Let S be the sample covariance matrix of an i.i.d. sample of size n > 1 from a\nsub-Gaussian distribution (5) with population covariance matrix \u03a3. If \u03bb is chosen to satisfy (4) with\n\u03c3 = c\u03bb1, then\n\n(5)\n\n(3)\n\n(4)\n\n|||!X \u2212 \u03a0|||2 \u2264 C\n\n\u03bb1\n\u03b4\n\ns#log p/n\n\n#\u03bb1/\u03bbd+1 \u00b7#s/d\n\nwith probablity at least 1 \u2212 2/p2, where c, C are constants depending only on L.\nComparing with the minimax lower bounds derived in [17, 18], we see that the rate in Corollary 3.3\nis roughly larger than the optimal minimax rate by a factor of\n\nThe \ufb01rst term only becomes important in the near-degenerate case where \u03bbd+1 \u226a \u03bb1. It is possible\nwith much more technical work to get sharp dependence on the eigenvalues, but we prefer to retain\nbrevity and clarity in our proof of the version here. The second term is likely to be unimprovable\nwithout additional conditions on S and \u03a3 such as a spiked covariance model. Very recently, [14]\nshowed in a testing framework with similar assumptions as ours when d = 1 that the extra factor\n\u221as is necessary for any polynomial time procedure if the planted clique problem cannot be solved\nin randomized polynomial time.\n\n5\n\n\fKendall\u2019s tau Kendall\u2019s tau correlation provides a robust and nonparametric alternative to ordi-\nnary (Pearson) correlation. Given an n \u00d7 p matrix whose rows are i.i.d. p-variate random vectors,\nthe theoretical and empirical versions of Kendall\u2019s tau correlation matrix are\n\n\u03c4ij := Cor\" sign(Y1i \u2212 Y2i) , sign(Y1j \u2212 Y2j)$\n\n2\n\n\u02c6\u03c4ij :=\n\nsign(Ysi \u2212 Yti) sign(Ysj \u2212 Ytj) .\n\nn(n \u2212 1)\u2019s<t\n\nA key feature of Kendall\u2019s tau is that it is invariant under strictly monotone transformations, i.e.\n\nsign(Ysi \u2212 Yti) sign(Ysj \u2212 Ytj)) = sign(fi(Ysi) \u2212 fi(Yti)) sign(fj(Ysj) \u2212 fj(Ytj)) ,\n\nwhere fi, fj are strictly monotone transformations. When Y is multivariate Gaussian, there is also\na one-to-one correspondence between \u03c4ij and \u03c1ij = Cor(Y1i, Y1j) [27] :\n\n(6)\n\n(7)\n\nThese two observations led [26] to propose using\n\n\u03c4ij =\n\n2\n\u03c0\n\narcsin(\u03c1ij) .\n\n!Tij =*sin\" \u03c0\n2 \u02c6\u03c4ij$\n\n1\n\nif i \u0338= j\nif i = j .\n\nas an input matrix to Gaussian graphical model estimators in order to extend the applicability of those\nprocedures to the wider class of nonparanormal distributions [28]. This same idea was extended to\n\nsparse PCA procedure of [13]. A shortcoming of that approach is that their theoretical guarantees\nonly hold for the global solution of an NP-hard optimization problem. The following corollary of\n\nsparse PCA by [29]; they proposed and analyzed using !T as an input matrix to the non-convex\nTheorem 3.3 recti\ufb01es the situation by showing that !X with Kendall\u2019s tau is nearly optimal.\nCorollary 3.4. Let S = !T as de\ufb01ned in (7) for an i.i.d. sample of size n > 1 and let \u03a3= T be the\nanalogous quantity with \u03c4ij in place of \u02c6\u03c4ij. If \u03bb is chosen to satisfy (4) with \u03c3 = \u221a8\u03c0, then\n\n|||!X \u2212 \u03a0|||2 \u2264\n\n8\u221a2\u03c0\n\n\u03b4\n\ns#log p/n\n\nwith probablity at least 1 \u2212 2/p2.\nNote that Corollary 3.4 only requires that \u02c6\u03c4 be computed from an i.i.d. sample. It does not specify\nthe marginal distribution of the observations. So \u03a3= T is not necessarily positive semide\ufb01nite\nand may be dif\ufb01cult to interpret. However, under additional conditions, the following lemma gives\nmeaning to T by extending (6) to a wide class of distributions, called transelliptical by [29], that\nincludes the nonparanormal. See [29, 30] for further information.\nLemma ([29, 30]). If (Y11, . . . , Y1p) has continuous distribution and there exist monotone transfor-\nmations f1, . . . , fp such that\n\nhas elliptical distribution with scatter matrix \u02dc\u03a3, then\n\n\"f1(Y11), . . . , fp(Y1p)$\nTij = \u02dc\u03a3ij/+ \u02dc\u03a3ii \u02dc\u03a3jj .\n\nMoreover, if fj(Y1j), j = 1, . . . , p have \ufb01nite variance, then Tij = Cor\"fi(Y1i), fj(Y1j)$.\n\nThis lemma together with Corollary 3.4 shows that Kendall\u2019s tau can be used in place of the sample\ncorrelation matrix for a wide class of distributions without much loss of ef\ufb01ciency.\n\n4 An ADMM algorithm\n\nThe chief dif\ufb01culty in directly solving (1) is the interaction between the penalty and the Fantope\nconstraint. Without either of these features, the optimization problem would be much easier. ADMM\ncan exploit this fact if we \ufb01rst rewrite (1) as the equivalent equality constrained problem\n\nminimize \u221e\u00b7 1F d(X) \u2212 \u27e8S, X\u27e9 + \u03bb\u2225Y \u22251,1\nsubject to X \u2212 Y = 0 ,\n6\n\n(8)\n\n\ft = 0, 1, 2, 3, . . .\n\nY (0) \u2190 0, U (0) \u2190 0\nrepeat\n\nAlgorithm 1 Fantope Projection and Selection (FPS)\nRequire: S = ST , d \u2265 1, \u03bb \u2265 0, \u03c1> 0, \u03f5> 0\nX (t+1) \u2190P F d\"Y (t) \u2212 U (t) + S/\u03c1$\nY (t+1) \u2190S \u03bb/\u03c1\"X (t+1) + U (t)$\nuntil max(|||X (t) \u2212 Y (t)|||2\nreturn Y (t)\n\nU (t+1) \u2190 U (t) + X (t+1) \u2212 Y (t+1)\n\n2 ,\u03c1 2|||Y (t) \u2212 Y (t\u22121)|||2\n\n2) \u2264 d\u03f52\n\n\u25c3 Initialization\n\n\u25c3 Fantope projection\n\u25c3 Elementwise soft thresholding\n\u25c3 Dual variable update\n\u25c3 Stopping criterion\n\nin the variables X and Y , where 1F d is the 0-1 indicator function for F d and we adopt the convention\n\u221e\u00b7 0 = 0. The augmented Lagrangian associated with (8) has the form\n\nL\u03c1(X, Y, U ) := \u221e\u00b7 1F d(X) \u2212 \u27e8S, X\u27e9 + \u03bb\u2225Y \u22251,1 +\n\n\u03c1\n\n2,|||X \u2212 Y + U|||2\n\n2- ,\n2 \u2212||| U|||2\n\n(9)\n\nwhere U = (1/\u03c1)Z is the scaled ADMM dual variable and \u03c1 is the ADMM penalty parameter [see\n19, \u00a73.1]. ADMM consists of iteratively minimizing L\u03c1 with respect to X, minimizing L\u03c1 with\nrespect to Y , and then updating the dual variable. Algorithm 1 summarizes the main steps.\nIn light of the separation of X and Y in (9) and some algebraic manipulation, the X and Y updates\nreduce to computing the proximal operators\n\nPF d\"Y \u2212 U + S/\u03c1$ := arg min\n\nX\u2208F d\nS\u03bb/\u03c1(X + U ) := arg min\n\nY\n\n1\n2|||X \u2212 (Y \u2212 U + S/\u03c1)|||2\n\u03bb\n\u03c1\u2225Y \u22251,1 +\n\n1\n2|||(X + U ) \u2212 Y |||2\n2 .\n\n2\n\nS\u03bb/\u03c1 is the elementwise soft thresholding operator [e.g., 19, \u00a74.4.3] de\ufb01ned as\n\nS\u03bb/\u03c1(x) = sign(x) max(|x|\u2212 \u03bb/\u03c1, 0) .\n\ni\n\ni (\u03b8)uiuT\n\ni (\u03b8) = d.\n\ni , where \u03b3+\n\nis a spectral decomposition of X, then\ni (\u03b8) = min(max(\u03b3i \u2212 \u03b8, 0), 1) and \u03b8 satis\ufb01es the equation\n\nPF d is the Euclidean projection onto F d and is given in closed form in the following lemma.\nLemma 4.1 (Fantope projection). If X = .i \u03b3iuiuT\nPF d(X) = .i \u03b3+\n.i \u03b3+\nThus, PF d(X) involves computing an eigendecomposition of Y , and then modifying the eigenvalues\nby solving a monotone, piecewise linear equation.\nRather than \ufb01x the ADMM penalty parameter \u03c1 in Algorithm 1 at some constant value, we recom-\nmend using the varying penalty scheme described in [19, \u00a73.4.1] that dynamically updates \u03c1 after\neach iteration of the ADMM to keep the primal and dual residual norms (the two sum of squares\nin the stopping criterion of Algorithm 1) within a constant factor of each other. This eliminates an\nadditional tuning parameter, and in our experience, yields faster convergence.\n\n5 Simulation results\n\nWe conducted a simulation study to compare the effectiveness of FPS against three de\ufb02ation-based\nmethods: DSPCA (which is just FPS with d = 1), GPower\u21131 [7], and SPC [5, 6]. These methods\nobtain multiple component estimates by taking the kth component estimate \u02c6vk from input matrix Sk,\nand then re-running the method with the de\ufb02ated input matrix: Sk+1 = (I \u2212 \u02c6vk\u02c6vT\nk ).\nk )Sk(I \u2212 \u02c6vk\u02c6vT\nThe resulting d-dimensional principal subspace estimate is the span of \u02c6v1, . . . , \u02c6vd. Tuning parameter\nselection can be much more complicated for these iterative de\ufb02ation methods. In our simulations,\nwe simply \ufb01xed the regularization parameter to be the same for all d components.\nWe generated input matrices by sampling n = 100, i.i.d. observations from a Np(0, \u03a3), p = 200\ndistribution and taking S to be the usual sample covariance matrix. We considered two different\ntypes of sparse \u03a0= V V T of rank d = 5: those with disjoint support for the nonzero entries of the\n\n7\n\n\f1\n0\n\u22121\n\u22122\n\u22123\n1\n0\n\u22121\n\n)\n\nE\nS\nM\n(\ng\no\n\nl\n\ns:10, noise:1\n\ns:10, noise:10\n\ns:25, noise:1\n\ns:25, noise:10\n\ns\nu\np\np\no\nr\nt\n:\n\ni\n\nj\n\nd\ns\no\nn\n\ni\n\nt\n\ns\nu\np\np\no\nr\nt\n:\ns\nh\na\nr\ne\nd\n\n20 30 5\n\n10\n\n20 30\n\n5\n\n10\n\n20 30 5\n\n20 30 5\n\n10\n(2,1)\u2212norm of estimate\n\n10\n\n), and\n) across 100 replicates each of a variety of simulation designs with n = 100, p = 200,\n\n), DSPCA with de\ufb02ation (\n\n), GPower\u21131 (\n\nFigure 1: Mean squared error of FPS (\nSPC (\nd = 5, s \u2208{ 10, 25}, noise \u03c32 \u2208{ 1, 10}.\n\ncolumns of V and those with shared support. We generated V by sampling its nonzero entries from\na standard Gaussian distribution and then orthnormalizing V while retaining the desired sparsity\npattern. In both cases, the number of nonzero rows of V is equal to s \u2208{ 10, 25}. We then embedded\n\u03a0 inside the population covariance matrix \u03a3= \u03b1\u03a0+ ( I \u2212 \u03a0)\u03a30(I \u2212 \u03a0), where \u03a30 is a Wishart\nmatrix with p degrees of freedom and \u03b1> 0 is chosen so that the effective noise level (in the optimal\nminimax rate [18]), \u03c32 =#\u03bb1\u03bbd+1/(\u03bbd \u2212 \u03bbd+1) \u2208{ 1, 10}.\nFigure 1 summarizes the resulting mean squared error |||!\u03a0 \u2212 \u03a0|||2\n\n2 across 100 replicates for each\nof the different combinations of simulation parameters. Each method\u2019s regularization parameter\nvaries over a range and the x-axis shows the (2, 1)-norm of the corresponding estimate. At the\nright extreme, all methods essentially correspond to standard PCA. It is clear that regularization is\nbene\ufb01cial, because all the methods have signi\ufb01cantly smaller MSE than standard PCA when they\nare suf\ufb01ciently sparse. Comparing between methods, we see that FPS dominates in all cases, but\nthe competition is much closer in the disjoint support case. Finally, all methods degrade when the\nnumber of active variables or noise level increases.\n\n6 Discussion\n\nEstimating sparse principal subspaces in high-dimensions poses both computational and statistical\nchallenges. The contribution of this paper\u2014a novel SDP based estimator, an ef\ufb01cient algorithm,\nand strong statistical guarantees for a wide array of input matrices\u2014is a signi\ufb01cant leap forward on\nboth fronts. Yet, there are newly open problems and many possible extensions related to this work.\nFor instance, it would be interesting to investigate the performance of FPS a under weak, rather than\nexact, sparsity assumption on \u03a0 (e.g., \u2113q, 0 < q \u2264 sparsity). The optimization problem (1) and\nADMM algorithm can easily be modi\ufb01ed handle other types of penalties. In some cases, extensions\nof Theorem 3.1 would require minimal modi\ufb01cations to its proof. Finally, the choices of dimension\nd and regularization parameter \u03bb are of great practical interest. Techniques like cross-validation\nneed to be carefully formulated and studied in the context of principal subspace estimation.\n\nAcknowledgments\nThis research was supported in part by NSF grants DMS-0903120, DMS-1309998, BCS-0941518,\nand NIH grant MH057881.\n\n8\n\n\fReferences\n[1] A. d\u2019Aspremont et al. \u201cA direct formulation of sparse PCA using semide\ufb01nite programming \u201d. In: SIAM\n\n[2]\n\n[3]\n\nReview 49.3 (2007).\nI. M. Johnstone and A. Y. Lu. \u201cOn consistency and sparsity for principal components analysis in high\ndimensions \u201d. In: JASA 104.486 (2009), pp. 682\u2013693.\nI. T. Jolliffe, N. T. Trenda\ufb01lov, and M. Uddin. \u201cA modi\ufb01ed principal component technique based on the\nLasso \u201d. In: JCGS 12 (2003), pp. 531\u2013547.\n\n[4] H. Zou, T. Hastie, and R. Tibshirani. \u201cSparse principal component analysis \u201d. In: JCGS 15.2 (2006),\n\npp. 265\u2013286.\n\n[5] H. Shen and J. Z. Huang. \u201cSparse principal component analysis via regularized low rank matrix approx-\n\nimation \u201d. In: Journal of Multivariate Analysis 99 (2008), pp. 1015\u20131034.\n\n[6] D. M. Witten, R. Tibshirani, and T. Hastie. \u201cA penalized matrix decomposition, with applications to\nsparse principal components and canonical correlation analysis \u201d. In: Biostatistics 10 (2009), pp. 515\u2013\n534.\n\n[7] M. Journee et al. \u201cGeneralized power method for sparse principal component analysis \u201d. In: JMLR 11\n\n(2010), pp. 517\u2013553.\n\n[8] B. K. Sriperumbudur, D. A. Torres, and G. R. G. Lanckriet. \u201cA majorization-minimization approach to\n\nthe sparse generalized eigenvalue problem \u201d. In: Machine Learning 85.1\u20132 (2011), pp. 3\u201339.\n\n[9] Y. Zhang and L. E. Ghaoui. \u201cLarge-scale sparse principal component analysis with application to text\n\ndata \u201d. In: NIPS 24. Ed. by J. Shawe-Taylor et al. 2011, pp. 532\u2013539.\n\n[10] X. Yuan and T. Zhang. \u201cTruncated power method for sparse eigenvalue problems \u201d. In: JMLR 14 (2013),\n\npp. 899\u2013925.\n\n[11] A. A. Amini and M. J. Wainwright. \u201cHigh-dimensional analysis of semide\ufb01nite relaxations for sparse\n\nprincipal components \u201d. In: Ann. Statis. 37.5B (2009), pp. 2877\u20132921.\n\n[12] A. Birnbaum et al. \u201cMinimax bounds for sparse pca with noisy high-dimensional data \u201d. In: Ann. Statis.\n\n41.3 (2013), pp. 1055\u20131084.\n\n[13] V. Q. Vu and J. Lei. \u201cMinimax rates of estimation for sparse PCA in high dimensions \u201d. In: AISTATS 15.\n\nEd. by N. Lawrence and M. Girolami. Vol. 22. JMLR W&CP. 2012, pp. 1278\u20131286.\n\n[14] Q. Berthet and P. Rigollet. \u201cComputational lower bounds for sparse PCA \u201d. In: (2013). arXiv: 1304.\n\n0828.\n\n[15] L. Mackey. \u201cDe\ufb02ation methods for sparse PCA \u201d. In: NIPS 21. Ed. by D. Koller et al. 2009, pp. 1017\u2013\n\n1024.\n\n[16] Z. Ma. \u201cSparse principal component analysis and iterative thresholding \u201d. In: Ann. Statis. 41.2 (2013).\n[17] T. T. Cai, Z. Ma, and Y. Wu. \u201cSparse PCA: optimal rates and adaptive estimation \u201d. In: Ann. Statis.\n\n(2013). to appear. arXiv: 1211.1309.\n\n[18] V. Q. Vu and J. Lei. \u201cMinimax sparse principal subspace estimation in high dimensions \u201d. In: Ann. Statis.\n\n(2013). to appear. arXiv: 1211.0373.\n\n[19] S. Boyd et al. \u201cDistributed optimization and statistical learning via the alternating direction method of\n\nmultipliers \u201d. In: Foundations and Trends in Machine Learning 3.1 (2010), pp. 1\u2013122.\n\n[20] S. Ma. \u201cAlternating direction method of multipliers for sparse principal component analysis \u201d. In: (2011).\n\narXiv: 1111.6703.\n\n[21] R. Bhatia. Matrix analysis. Springer-Verlag, 1997.\n[22]\n[23] K. Fan. \u201cOn a theorem of Weyl concerning eigenvalues of linear transformations I \u201d. In: Proceedings of\n\nJ. Dattorro. Convex optimization & euclidean distance geometry. Meboo Publishing USA, 2005.\n\nthe National Academy of Sciences 35.11 (1949), pp. 652\u2013655.\n\n[24] M. Overton and R. Womersley. \u201cOn the sum of the largest eigenvalues of a symmetric matrix \u201d. In: SIAM\n\nJournal on Matrix Analysis and Applications 13.1 (1992), pp. 41\u201345.\n\n[25] S. N. Negahban et al. \u201cA uni\ufb01ed framework for the high-dimensional analysis of M-estimators with\n\ndecomposable regularizers \u201d. In: Statistical Science 27.4 (2012), pp. 538\u2013557.\n\n[26] H. Liu et al. \u201cHigh-dimensional semiparametric gaussian copula graphical models \u201d. In: Ann. Statis.\n\n40.4 (2012), pp. 2293\u20132326.\n\n[27] W. H. Kruskal. \u201cOrdinal measures of association \u201d. In: JASA 53.284 (1958), pp. 814\u2013861.\n[28] H. Liu, J. Lafferty, and L. Wasserman. \u201cThe nonparanormal: semiparametric estimation of high dimen-\n\nsional undirected graphs \u201d. In: JMLR 10 (2009), pp. 2295\u20132328.\n\n[29] F. Han and H. Liu. \u201cTranselliptical component analysis \u201d. In: NIPS 25. Ed. by P. Bartlett et al. 2012,\n\npp. 368\u2013376.\n\n[30] F. Lindskog, A. McNeil, and U. Schmock. \u201cKendall\u2019s tau for elliptical distributions \u201d. In: Credit Risk.\n\nEd. by G. Bol et al. Contributions to Economics. Physica-Verlag HD, 2003, pp. 149\u2013156.\n\n9\n\n\f", "award": [], "sourceid": 1250, "authors": [{"given_name": "Vincent", "family_name": "Vu", "institution": "Ohio State University"}, {"given_name": "Juhee", "family_name": "Cho", "institution": "UW-Madison"}, {"given_name": "Jing", "family_name": "Lei", "institution": "CMU"}, {"given_name": "Karl", "family_name": "Rohe", "institution": "UW-Madison"}]}