{"title": "Sparse PCA via Bipartite Matchings", "book": "Advances in Neural Information Processing Systems", "page_first": 766, "page_last": 774, "abstract": "We consider the following multi-component sparse PCA problem:given a set of data points, we seek to extract a small number of sparse components with \\emph{disjoint} supports that jointly capture the maximum possible variance.Such components can be computed one by one, repeatedly solving the single-component problem and deflating the input data matrix, but this greedy procedure is suboptimal.We present a novel algorithm for sparse PCA that jointly optimizes multiple disjoint components. The extracted features capture variance that lies within a multiplicative factor arbitrarily close to $1$ from the optimal.Our algorithm is combinatorial and computes the desired components by solving multiple instances of the bipartite maximum weight matching problem.Its complexity grows as a low order polynomial in the ambient dimension of the input data, but exponentially in its rank.However, it can be effectively applied on a low-dimensional sketch of the input data.We evaluate our algorithm on real datasets and empirically demonstrate that in many cases it outperforms existing, deflation-based approaches.", "full_text": "Sparse PCA via Bipartite Matchings\n\nMegasthenis Asteris\n\nThe University of Texas at Austin\n\nmegas@utexas.edu\n\nDimitris Papailiopoulos\n\nUniversity of California, Berkeley\ndimitrisp@berkeley.edu\n\nAnastasios Kyrillidis\n\nAlexandros G. Dimakis\n\nThe University of Texas at Austin\n\nThe University of Texas at Austin\n\nanastasios@utexas.edu\n\ndimakis@austin.utexas.edu\n\nAbstract\n\nWe consider the following multi-component sparse PCA problem: given a set of\ndata points, we seek to extract a small number of sparse components with disjoint\nsupports that jointly capture the maximum possible variance. Such components\ncan be computed one by one, repeatedly solving the single-component problem\nand de\ufb02ating the input data matrix, but this greedy procedure is suboptimal. We\npresent a novel algorithm for sparse PCA that jointly optimizes multiple disjoint\ncomponents. The extracted features capture variance that lies within a multiplica-\ntive factor arbitrarily close to 1 from the optimal. Our algorithm is combinatorial\nand computes the desired components by solving multiple instances of the bipar-\ntite maximum weight matching problem.\nIts complexity grows as a low order\npolynomial in the ambient dimension of the input data, but exponentially in its\nrank. However, it can be effectively applied on a low-dimensional sketch of the\ninput data. We evaluate our algorithm on real datasets and empirically demon-\nstrate that in many cases it outperforms existing, de\ufb02ation-based approaches.\n\n1\n\nIntroduction\n\nPrincipal Component Analysis (PCA) reduces data dimensionality by projecting it onto principal\nsubspaces spanned by the leading eigenvectors of the sample covariance matrix. It is one of the\nmost widely used algorithms with applications ranging from computer vision, document clustering\nto network anomaly detection (see e.g. [1, 2, 3, 4, 5]). Sparse PCA is a useful variant that offers\nhigher data interpretability [6, 7, 8] a property that is sometimes desired even at the cost of statistical\n\ufb01delity [5]. Furthermore, when the obtained features are used in subsequent learning tasks, sparsity\npotentially leads to better generalization error [9].\n\nGiven a real n \u00d7 d data matrix S representing n centered data points in d variables, the \ufb01rst sparse\nprincipal component is the sparse vector that maximizes the explained variance:\n\nx\u22c6 , arg max\n\nx\u22a4Ax,\n\nkxk2=1,kxk0=s\n\n(1)\n\nwhere A = 1/n \u00b7 S\u22a4S is the d \u00d7 d empirical covariance matrix. Unfortunately, the directly enforced\nsparsity constraint makes the problem NP-hard and hence computationally intractable in general. A\nsigni\ufb01cant volume of prior work has focused on various algorithms for approximately solving this\noptimization problem [3, 5, 7, 8, 10, 11, 12, 13, 14, 15, 16, 17], while some theoretical results have\nalso been established under statistical or spectral assumptions on the input data.\n\nIn most cases one is not interested in \ufb01nding only the \ufb01rst sparse eigenvector, but rather the \ufb01rst k,\nwhere k is the reduced dimension where the data will be projected. Contrary to the single-component\n\n1\n\n\fproblem, there has been very limited work on computing multiple sparse components. The scarcity\nis partially attributed to conventional wisdom stemming from PCA: multiple components can be\ncomputed one by one, repeatedly solving the single-component sparse PCA problem (1) and de\ufb02at-\ning [18] the input data to remove information captured by previously extracted components. In fact,\nmulti-component sparse PCA is not a uniquely de\ufb01ned problem in the literature. De\ufb02ation-based\napproaches can lead to different output depending on the type of de\ufb02ation [18]; extracted compo-\nnents may or may not be orthogonal, while they may have disjoint or overlapping supports. In the\nstatistics literature, where the objective is typically to recover a \u201ctrue\u201d principal subspace, a branch\nof work has focused on the \u201csubspace row sparsity\u201d [19], an assumption that leads to sparse com-\nponents all supported on the same set of variables. While in [20] the authors discuss an alternative\nperspective on the fundamental objective of the sparse PCA problem.\n\nWe focus on the multi-component sparse PCA problem with disjoint supports, i.e., the problem of\ncomputing a small number of sparse components with non-overlapping supports that jointly maxi-\nmize the explained variance:\n\nX\u22c6 , arg max\n\nX\u2208Xk\n\nTR(cid:0)X\u22a4AX(cid:1),\n\n(2)\n\nXk , (cid:8)X \u2208 Rd\u00d7k : kXjk2 = 1, kXjk0 = s, supp(Xi) \u2229 supp(Xj) = \u2205, \u2200 j \u2208 [k], i < j(cid:9),\n\nwith Xj denoting the jth column of X. The number k of the desired components is considered a\nsmall constant. Contrary to the greedy sequential approach that repeatedly uses de\ufb02ation, our algo-\nrithm jointly computes all the vectors in X and comes with theoretical approximation guarantees.\nNote that even if we could solve the single-component sparse PCA problem (1) exactly, the greedy\napproach could be highly suboptimal. We show this with a simple example in Sec. 7 of the appendix.\n\nOur Contributions:\n\n1. We develop an algorithm that provably approximates the solution to the sparse PCA problem (2)\nwithin a multiplicative factor arbitrarily close to optimal. Our algorithm is the \ufb01rst that jointly\noptimizes multiple components with disjoint supports and operates by recasting the sparse PCA\nproblem into multiple instances of the bipartite maximum weight matching problem.\n\n2. The computational complexity of our algorithm grows as a low order polynomial in the ambient\ndimension d, but is exponential in the intrinsic dimension of the input data, i.e., the rank of A.\nTo alleviate the impact of this dependence, our algorithm can be applied on a low-dimensional\nsketch of the input data to obtain an approximate solution to (2). This extra level of approx-\nimation introduces an additional penalty in our theoretical approximation guarantees, which\nnaturally depends on the quality of the sketch and, in turn, the spectral decay of A.\n\n3. We empirically evaluate our algorithm on real datasets, and compare it against state-of-the-art\nmethods for the single-component sparse PCA problem (1) in conjunction with the appropriate\nde\ufb02ation step. In many cases, our algorithm signi\ufb01cantly outperforms these approaches.\n\n2 Our Sparse PCA Algorithm\n\nWe present a novel algorithm for the sparse PCA problem with multiple disjoint components. Our\nalgorithm approximately solves the constrained maximization (2) on a d \u00d7 d rank-r Positive Semi-\nDe\ufb01nite (PSD) matrix A within a multiplicative factor arbitrarily close to 1. It operates by recasting\nthe maximization into multiple instances of the bipartite maximum weight matching problem. Each\ninstance ultimately yields a feasible solution to the original sparse PCA problem; a set of k s-sparse\ncomponents with disjoint supports. Finally, the algorithm exhaustively determines and outputs the\nset of components that maximizes the explained variance, i.e., the quadratic objective in (2).\n\nThe computational complexity of our algorithm grows as a low order polynomial in the ambient\ndimension d of the input, but exponentially in its rank r. Despite the unfavorable dependence on\nthe rank, it is unlikely that a substantial improvement can be achieved in general [21]. However,\ndecoupling the dependence on the ambient and the intrinsic dimension of the input has an interesting\nrami\ufb01cation; instead of the original input A, our algorithm can be applied on a low-rank surrogate to\nobtain an approximate solution, alleviating the dependence on r. We discuss this in Section 3. In the\nsequel, we describe the key ideas behind our algorithm, leading up to its guarantees in Theorem 1.\n\n2\n\n\fLet A = U\u039bU\u22a4 denote the truncated eigenvalue decomposition of A; \u039b is a diagonal r \u00d7 r whose\nith diagonal entry is equal to the ith largest eigenvalue of A, while the columns of U coincide with\nthe corresponding eigenvectors. By the Cauchy-Schwartz inequality, for any x \u2208 Rd,\n\u2200 c \u2208 Rr : kck2 = 1.\n\n(3)\n\n,\n\n2 \u2265 (cid:10)\u039b1/2U\u22a4x, c(cid:11)2\n\nx\u22a4Ax = (cid:13)(cid:13)\u039b1/2U\u22a4x(cid:13)(cid:13)2\n\nIn fact, equality in (3) can always be achieved for c colinear to \u039b1/2Ux \u2208 Rr and in turn\n\nx\u22a4Ax = max\nc\u2208Sr\u22121\n\n2\n\n(cid:10)x, U\u039b1/2c(cid:11)2\n\n,\n\nwhere Sr\u22121\n\n2\n\ndenotes the \u21132-unit sphere in r dimensions. More generally, for any X \u2208 Rd\u00d7k,\n\nTR(cid:0)X\u22a4AX(cid:1) =\n\nkX\n\nj=1\n\nXj \u22a4\n\nAXj =\n\nmax\n2\n\nC:Cj \u2208Sr\u22121\n\n\u2200j\n\nkX\n\nj=1\n\n(cid:10)Xj, U\u039b1/2Cj(cid:11)2\n\n.\n\n(4)\n\nUnder the variational characterization of the trace objective in (4), the sparse PCA problem (2) can\nbe re-written as a joint maximization over the variables X and C as follows:\n\nmax\nX\u2208Xk\n\nTR(cid:0)X\u22a4AX(cid:1) = max\n\nX\u2208Xk\n\nmax\n\nC:Cj \u2208Sr\u22121\n\n2\n\nkX\n\nj=1\n\n\u2200j\n\n(cid:10)Xj, U\u039b1/2Cj(cid:11)2\n\n.\n\n(5)\n\nThe alternative formulation of the sparse PCA problem in (5) may be seemingly more complicated\nthan the original one in (2). However, it takes a step towards decoupling the dependence of the\noptimization on the ambient and intrinsic dimensions d and r, respectively. The motivation behind\nthe introduction of the auxiliary variable C will become more clear in the sequel.\n\nFor a given C, the value of X \u2208 Xk that maximizes the objective in (5) for that C is\n\nbX , arg max\n\nX\u2208Xk\n\nkX\n\nj=1\n\n(cid:10)Xj, Wj(cid:11)2\n\n,\n\n(6)\n\nwhere W,U\u039b1/2C is a real d \u00d7 k matrix. The constrained, non-convex maximization (6) plays a\ncentral role in our developments. We will later describe a combinatorial O(d \u00b7 (s \u00b7 k)2) procedure to\nef\ufb01ciently compute bX, reducing the maximization to an instance of the bipartite maximum weight\n\nmatching problem. For now, however, let us assume that such a procedure exists.\n\nLet X\u22c6, C\u22c6 be the pair that attains the maximum in (5); in other words, X\u22c6 is the desired solution\nto the sparse PCA problem. If the optimal value C\u22c6 of the auxiliary variable were known, then\nwe would be able to recover X\u22c6 by solving the maximization (6) for C = C\u22c6. Of course, C\u22c6 is\nnot known, and it is not possible to exhaustively consider all possible values in the domain of C.\nInstead, we examine only a \ufb01nite number of possible values of C over a \ufb01ne discretization of its\ndomain. In particular, let N\u01eb/2(Sr\u22121\n) denote a \ufb01nite \u01eb/2-net of the r-dimensional \u21132-unit sphere; for\nany point in Sr\u22121\n, the net contains a point within an \u01eb/2 radius from the former. There are several\n)]\u2297k \u2282 Rd\u00d7k denote the kth Cartesian power\nways to construct such a net. Further, let [N\u01eb/2(Sr\u22121\nof the aforementioned \u01eb/2-net. By construction, this collection of points contains a matrix C that is\ncolumn-wise close to C\u22c6. In turn, it can be shown using the properties of the net, that the candidate\nsolution X \u2208 Xk obtained through (6) at that point C will be approximately as good as the optimal\nX\u22c6 in terms of the quadratic objective in (2).\n\n2\n\n2\n\n2\n\nAll above observations yield a procedure for approximately solving the sparse PCA problem (2).\nThe steps are outlined in Algorithm 1. Given the desired number of components k and an accuracy\n)]\u2297k and iterates over its points. At\nparameter \u01eb \u2208 (0, 1), the algorithm generates a net [N\u01eb/2(Sr\u22121\neach point C, it computes a feasible solution for the sparse PCA problem \u2013 a set of k s-sparse\ncomponents \u2013 by solving maximization (6) via a procedure (Alg. 2) that will be described in the\nsequel. The algorithm collects the candidate solutions identi\ufb01ed at the points of the net. The best\namong them achieves an objective in (2) that provably lies close to optimal. More formally,\nTheorem 1. For any real d \u00d7 d rank-r PSD matrix A, desired number of components k, number s\nof nonzero entries per component, and accuracy parameter \u01eb \u2208 (0, 1), Algorithm 1 outputs X \u2208 Xk\nsuch that\n\n2\n\nwhere X\u22c6, arg maxX\u2208Xk\n\n\u22a4\n\nTR(cid:0)X\nAX(cid:1) \u2265 (1 \u2212 \u01eb) \u00b7 TR(cid:0)X\u22a4\nTR(cid:0)X\u22a4AX(cid:1) , in time TSVD(r) + O(cid:0)(cid:0) 4\n\nAX\u22c6(cid:1),\n\u01eb(cid:1)r\u00b7k\n\n\u22c6\n\n\u00b7 d \u00b7 (s \u00b7 k)2(cid:1).\n\n3\n\n\f{Theorem 1}\n\nAlgorithm 1 Sparse PCA (Multiple disjoint components)\n\ninput : PSD d \u00d7 d rank-r matrix A, \u01eb \u2208 (0, 1), k \u2208 Z+.\noutput : X \u2208 Xk\n1: C \u2190 {}\n2: [U, \u039b] \u2190 EIG(A)\n3: for each C \u2208 [N\u01eb/2(Sr\u22121\n4: W \u2190 U\u039b1/2C\n5:\n\nAlgorithm 1 is the \ufb01rst nontriv-\nial algorithm that provably approx-\nimates the solution of the sparse\nPCA problem (2). According to\nTheorem 1, it achieves an objective\nvalue that lies within a multiplica-\ntive factor from the optimal, arbi-\ntrarily close to 1.\nIts complexity\ngrows as a low-order polynomial in\nthe dimension d of the input, but ex-\nponentially in the intrinsic dimen-\nsion r. Note, however, that it can be\nsubstantially better compared to the\nO(ds\u00b7k) brute force approach that\nexhaustively considers all candidate supports for the k sparse components. The complexity of our\nalgorithm follows from the cardinality of the net and the complexity of Algorithm 2, the subroutine\nthat solves the constrained maximization (6). The latter is a key ingredient of our algorithm, and is\ndiscussed in detail in the next subsection. A formal proof of Theorem 1 is provided in Section 9.2.\n\nbX \u2190 arg maxX\u2208Xk Pk\nC \u2190 C \u222a(cid:8)bX(cid:9)\n\n8: X \u2190 arg maxX\u2208C TR(cid:0)X\u22a4AX(cid:1)\n\nj=1(cid:10)Xj, Wj(cid:11)2\n\n{W \u2208 Rd\u00d7k}\n{Alg. 2}\n\n6:\n7: end for\n\n)]\u2297k do\n\n2\n\n2.1 Sparse Components via Bipartite Matchings\n\nIn the core of Alg. 1 lies a procedure that solves the constrained maximization (6) (Alg. 2). The\nlatter breaks down the maximization into two stages. First, it identi\ufb01es the support of the optimal\n\nsolution bX by solving an instance of the maximum weight matching problem on a bipartite graph G.\n\nThen, it recovers the exact values of its nonzero entries based on the Cauchy-Schwarz inequality. In\nthe sequel, we provide a brief description of Alg. 2, leading up to its guarantees in Lemma 2.1.\n\nLet Ij ,supp(bXj) be the support of the jth column of bX, j = 1, . . . , k. The objective in (6) becomes\n\nkX\n\nj=1\n\n(cid:10)bXj, Wj(cid:11)2\n\n=\n\nkX\n\nj=1\n\ni\u2208Ij bXij \u00b7 Wij(cid:17)2\n(cid:16)X\n\n\u2264\n\nkX\n\nj=1\n\nX\n\ni\u2208Ij\n\nW 2\nij.\n\n(7)\n\nThe inequality is due to Cauchy-Schwarz and the constraint kXjk2 = 1 \u2200 j \u2208 {1, . . . , k}. In fact,\nif an oracle reveals the supports Ij , j = 1, . . . , k, the upper bound in (7) can always be achieved\n\nby setting the nonzero entries of bX as in Algorithm 2 (Line 6). Therefore, the key in solving (6) is\n\ndetermining the collection of supports to maximize the right-hand side of (7).\n\nBy constraint, the sets Ij must be pairwise disjoint,\neach with cardinality s. Consider a weighted bipartite\n\ngraph G = (cid:0)U = {U1, . . . , Uk}, V, E(cid:1) constructed as\n\nfollows1 (Fig. 1):\n\n\u2022 V is a set of d vertices v1, . . . , vd, corresponding to\n\nthe d variables, i.e., the d rows of bX.\n\n\u2022 U is a set of k \u00b7 s vertices, conceptually partitioned\ninto k disjoint subsets U1, . . . , Uk, each of cardinal-\nity s. The jth subset, Uj , is associated with the sup-\nport Ij ; the s vertices u(j)\n\u03b1 , \u03b1 = 1, . . . , s in Uj serve\nas placeholders for the variables/indices in Ij .\n\n\u2022 Finally,\n\nthe edge set is E = U \u00d7 V . The edge\nweights are determined by the d\u00d7k matrix W in (6).\nIn particular, the weight of edge (u(j)\n\u03b1 , vi) is equal\nto W 2\nij . Note that all vertices in Uj are effectively\nidentical; they all share a common neighborhood\nand edge weights.\n\nU1\n\nUk\n\n1\n\nu(1)\n...\n\nu(1)\n\ns\n\n...\n\n1\n\nu(k)\n...\n\nu(k)\n\ns\n\nW 2\ni1\n\nW 2\ni1\n\nW 2\nik\n\nW 2\nik\n\nv1\n\n...\n\nvi\n\nV\n\n...\n\nvd\n\nFigure 1: The graph G generated by\nAlg. 2. It is used to determine the support\n\nof the solution bX in (6).\n\n1The construction is formally outlined in Algorithm 4 in Section 8.\n\n4\n\n\f{Alg. 4}\n{\u2282 E}\n\ninput Real d \u00d7 k matrix W\n\n2: M \u2190 MAXWEIGHTMATCH(G)\n\nAlgorithm 2 Compute Candidate Solution\n\nj=1(cid:10)Xj, Wj(cid:11)2\nj=1, V, E(cid:1) \u2190 GENBIGRAPH(W)\n\noutput bX = arg maxX\u2208Xk Pk\n1: G(cid:0){Uj}k\n3: bX \u2190 0d\u00d7k\n\nAny feasible support {Ij}k\nj=1 corre-\nsponds to a perfect matching in G\nand vice-versa. Recall that a match-\ning is a subset of the edges con-\ntaining no two edges incident to the\nsame vertex, while a perfect match-\ning, in the case of an unbalanced\nbipartite graph G = (U, V, E) with\n|U | \u2264 |V |, is a matching that con-\ntains at least one incident edge for\neach vertex in U . Given a per-\nfect matching M \u2286 E, the dis-\njoint neighborhoods of Uj s under\nM yield a support {Ij}k\nj=1. Con-\nversely, any valid support yields a unique perfect matching in G (taking into account that all vertices\nin Uj are isomorphic). Moreover, due to the choice of weights in G, the right-hand side of (7) for\na given support {Ij}k\nj=1 is equal to the weight of the matching M in G induced by the former, i.e.,\nij=P(u,v)\u2208M w(u, v). It follows that determining the support of the solution in (6),\n\n[bXj]Ij \u2190 [Wj]Ij /k[Wj]Ij k2\n\nIj \u2190 {i \u2208 {1, . . . , d} : (u, vi) \u2208 M, u \u2208 Uj}\n\n4: for j = 1, . . . , k do\n5:\n\nreduces to solving the maximum weight matching problem on the bipartite graph G.\nAlgorithm 2 readily follows. Given W \u2208 Rd\u00d7k, the algorithm generates a weighted bipartite\ngraph G as described, and computes its maximum weight matching. Based on the latter, it \ufb01rst\nrecovers the desired support of bX (Line 5), and subsequently the exact values of its nonzero entries\nO(cid:0)|E||U | + |U |2 log |U |(cid:1) using a variant of the Hungarian algorithm [22]. Hence,\nLemma 2.1. For any W \u2208 Rd\u00d7k, Algorithm 2 computes the solution to (6), in time O(cid:0)d \u00b7 (s \u00b7 k)2(cid:1).\n\n(Line 6). The running time is dominated by the computation of the matching, which can be done in\n\nPk\nj=1Pi\u2208Ij\n\n6:\n7: end for\n\nW 2\n\nA more formal analysis and proof of Lemma 2.1 is available in Sec. 9.1. This completes the descrip-\ntion of our sparse PCA algorithm (Alg. 1) and the proof sketch of Theorem 1.\n\n3 Sparse PCA on Low-Dimensional Sketches\n\nAlgorithm 3 Sparse PCA on Low Dim. Sketch\n\nAlgorithm 1 approximately solves\nthe\nsparse PCA problem (2) on a d \u00d7 d rank-r\nPSD matrix A in time that grows as a\nlow-order polynomial in the ambient dimen-\nsion d, but depends exponentially on r. This\ndependence can be prohibitive in practice.\nTo mitigate its effect, we can apply our\nsparse PCA algorithm on a low-rank sketch\nof A. Intuitively, the quality of the extracted\ncomponents should depend on how well that low-rank surrogate approximates the original input.\n\ninput : Real n \u00d7 d S, r \u2208 Z+, \u01eb \u2208 (0, 1), k \u2208 Z+.\n{Thm. 2}\noutput X\n1: S \u2190 SKETCH(S, r)\n2: A \u2190 S\n3: X\n\n(r) \u2190 ALGORITHM 1 (A, \u01eb, k).\n\n(r) \u2208 Xk.\n\nS\n\n\u22a4\n\nMore formally, let S be the real n \u00d7 d data matrix representing n (potentially centered) datapoints\nin d variables, and A the corresponding d\u00d7d covariance matrix. Further, let S be a low-dimensional\nsketch of the original data; an n \u00d7 d matrix whose rows lie in an r-dimensional subspace, with r\nbeing an accuracy parameter. Such a sketch can be obtained in several ways, including for example\nexact or approximate SVD, or online sketching methods [23]. Finally, let A = 1/n \u00b7 S\nS be the\ncovariance matrix of the sketched data. Then, instead of A, we can approximately solve the sparse\nPCA problem by applying Algorithm 1 on the low-rank surrogate A. The above are formally out-\nlined in Algorithm 3. We note that the covariance matrix A does not need to be explicitly computed;\nAlgorithm 1 can operate directly on the (sketched) input data matrix.\nTheorem 2. For any n \u00d7 d input data matrix S, with corresponding empirical covariance matrix\nA = 1/n \u00b7 S\u22a4S, any desired number of components k, and accuracy parameters \u01eb \u2208 (0, 1) and r,\nAlgorithm 3 outputs X\n\n(r) \u2208 Xk such that\n\n\u22a4\n\nTR(cid:0)X\u22a4\nwhere X\u22c6, arg maxX\u2208Xk\n\n(r)\n\nAX\n\nAX\u22c6(cid:1) \u2212 2 \u00b7 k \u00b7 kA \u2212 Ak2,\n(r)(cid:1) \u2265 (1 \u2212 \u01eb) \u00b7 TR(cid:0)X\u22a4\nTR(cid:0)X\u22a4AX(cid:1), in time TSKETCH(r) + TSVD(r) + O(cid:0)(cid:0) 4\n\u01eb(cid:1)r\u00b7k\n\n\u22c6\n\n\u00b7 d \u00b7 (s \u00b7 k)2(cid:1).\n\n5\n\n\fThe error term kA \u2212 Ak2 and in turn the tightness of the approximation guarantees hinges on the\nquality of the sketch. Roughly, higher values of the parameter r should allow for a sketch that more\naccurately represents the original data, leading to tighter guarantees. That is the case, for example,\nwhen the sketch is obtained through exact SVD. In that sense, Theorem 2 establishes a natural\ntrade-off between the running time of Algorithm 3 and the quality of the approximation guarantees.\n(See [24] for additional results.) A formal proof of Theorem 2 is provided in Appendix Section 9.3.\n\n4 Related Work\n\nA signi\ufb01cant volume of work has focused on the single-component sparse PCA problem (1); we\nscratch the surface and refer the reader to citations therein. Representative examples range from\nearly heuristics in [7], to the LASSO based techniques in [8], the elastic net \u21131-regression in [5],\n\u21131 and \u21130 regularized optimization methods such as GPower in [10], a greedy branch-and-bound\ntechnique in [11], or semide\ufb01nite programming approaches [3, 12, 13]. Many focus on a statistical\nanalysis that pertains to speci\ufb01c data models and the recovery of a \u201ctrue\u201d sparse component. In prac-\ntice, the most competitive results in terms of the maximization in (1) seem to be achieved by (i) the\nsimple and ef\ufb01cient truncated power (TPower) iteration of [14], (ii) the approach of [15] stemming\nfrom an expectation-maximization (EM) formulation, and (iii) the (SpanSPCA) framework of [16]\nwhich solves the sparse PCA problem through low rank approximations based on [17].\n\nWe are not aware of any algorithm that explicitly addresses the multi-component sparse PCA prob-\nlem (2). Multiple components can be extracted by repeatedly solving (1) with one of the afore-\nmentioned methods. To ensure disjoint supports, variables \u201cselected\u201d by a component are removed\nfrom the dataset. However, this greedy approach can result in highly suboptimal objective value (see\nSec. 7). More generally, there has been relatively limited work in the estimation of principal sub-\nspaces or multiple components under sparsity constraints. Non-de\ufb02ation-based algorithms include\nextensions of the diagonal [25] and iterative thresholding [26] approaches, while [27] and [28] pro-\npose methods that rely on the \u201crow sparsity for subspaces\u201d assumption of [19]. These methods yield\ncomponents supported on a common set of variables, and hence solve a problem different from (2).\nIn [20], the authors discuss the multi-component sparse PCA problem, propose an alternative ob-\njective function and for that problem obtain interesting theoretical guarantees. In [29] they consider\na structured variant of sparse PCA where higher-order structure is encoded by an atomic norm reg-\nularization. Finally, [30] develops a framework for sparse matrix factorizaiton problems, based on\nan atomic norm. Their framework captures sparse PCA \u2013although not explicitly the constraint of\ndisjoint supports\u2013 but the resulting optimization problem, albeit convex, is NP-hard.\n\n5 Experiments\n\nWe evaluate our algorithm on a series of real datasets, and compare it to de\ufb02ation-based approaches\nfor sparse PCA using TPower [14], EM [15], and SpanSPCA [16]. The latter are representative\nof the state of the art for the single-component sparse PCA problem (1). Multiple components are\ncomputed one by one. To ensure disjoint supports, the de\ufb02ation step effectively amounts to removing\nfrom the dataset all variables used by previously extracted components. For algorithms that are\nrandomly initialized, we depict best results over multiple random restarts. Additional experimental\nresults are listed in Section 11 of the appendix.\n\nOur experiments are conducted in a Matlab environment. Due to its nature, our algorithm is easily\nparallelizable; its prototypical implementation utilizes the Parallel Pool Matlab feature to exploit\nmulticore (or distributed cluster) capabilities. Recall that our algorithm operates on a low-rank ap-\nproximation of the input data. Unless otherwise speci\ufb01ed, it is con\ufb01gured for a rank-4 approximation\nobtained via truncated SVD. Finally, we note that our algorithm is slower than the de\ufb02ation-based\nmethods. We set a barrier on the execution time of our algorithm at the cost of the theoretical ap-\nproximation guarantees; the algorithm returns the best result at the time of termination. This \u201cearly\ntermination\u201d can only hurt the performance of our algorithm.\n\nLeukemia Dataset. We evaluate our algorithm on the Leukemia dataset [31]. The dataset com-\nprises 72 samples, each consisting of expression values for 12582 probe sets. We extract k = 5\nsparse components, each active on s = 50 features. In Fig. 2(a), we plot the cumulative explained\nvariance versus the number of components. De\ufb02ation-based approaches are greedy: the leading\n\n6\n\n\fe\nc\nn\na\ni\nr\na\nV\n\n.\nl\n\np\nx\nE\n\ne\nv\ni\nt\na\nl\n\nu\nm\nu\nC\n\n6\n\n5\n\n4\n\n3\n\n2\n\n1\n\n0\n\n#109\n\nk = 5 components, s = 50 nnz/component\n\n#109\n\ns = 50 nnz/component\n\nTPower\nEM-SPCA\nSpanSPCA\nSPCABiPart\n\n+8:82%\n\n1\n\n2\n4\nNumber of Components\n\n3\n\n5\n\ne\nc\nn\na\ni\nr\na\nV\n\n.\nl\n\np\nx\nE\n\ne\nv\ni\nt\na\nl\n\nu\nm\nu\nC\n\nl\na\nt\no\nT\n\n8\n\n7\n\n6\n\n5\n\n4\n\n3\n\n2\n\n1\n\n0\n\nSPCABiPart\nSpanSPCA\nEM-SPCA\nTPower\n\n+6.88%\n\n+6.67%\n\n+6.39%\n\n+7.87%\n\n+6.80%\n\n+8.82%\n\n+8.71%\n\n+6.51%\n\n+0.48%\n\n2 \n\n3 \n\n4 \n\n5 \n\n6 \n\n7 \n\n8 \n\n9 \n\n10\n\nNumber of target components\n\n(a)\n\n(b)\n\nFigure 2: Cumul. variance captured by k s-sparse extracted components; Leukemia dataset [31]. We\narbitrarily set s = 50 nonzero entries per component. Fig. 2(a) depicts the cumul. variance vs the\nnumber of components, for k = 5. De\ufb02ation-based approaches are greedy; \ufb01rst components capture\nhigh variance, but subsequent contribute less. Our algorithm jointly optimizes the k components and\nachieves higher objective. Fig. 2(b) depicts the cumul. variance achieved for various values of k.\n\ncomponents capture high values of variance, but subsequent ones contribute less. On the contrary,\nour algorithm jointly optimizes the k = 5 components and achieves higher total cumulative vari-\nance; one cannot identify a top component. We repeat the experiment for multiple values of k.\nFig. 2(b) depicts the total cumulative variance capture by each method, for each value of k.\n\nAdditional Datasets. We repeat the experiment on multiple datasets, arbitrarily selected from [31].\nTable 1 lists the total cumulative variance captured by k = 5 components, each with s = 40 nonzero\nentries, extracted using the four methods. Our algorithm achieves the highest values in most cases.\n\nBag of Words (BoW) Dataset. [31] This is a collection of text corpora stored under the \u201cbag-of-\nwords\u201d model. For each text corpus, a vocabulary of d words is extracted upon tokenization, and\nthe removal of stopwords and words appearing fewer than ten times in total. Each document is then\nrepresented as a vector in that d-dimensional space, with the ith entry corresponding to the number\nof appearances of the ith vocabulary entry in the document.\n\nWe solve the sparse PCA problem (2) on the word-by-word cooccurrence matrix, and extract k = 8\nsparse components, each with cardinality s = 10. We note that the latter is not explicitly constructed;\nour algorithm can operate directly on the input word-by-document matrix. Table 2 lists the variance\ncaptured by each method; our algorithm consistently outperforms the other approaches.\n\nFinally, note that here each sparse component effectively selects a small set of words. In turn, the\nk extracted components can be interpreted as a set of well-separated topics. In Table 3, we list the\n\n(1500\u00d710000)\n(100\u00d710000)\n\nAMZN COM REV\nARCENCE TRAIN\nCBCL FACE TRAIN (2429\u00d7361)\n(1559\u00d7617)\nISOLET-5\nLEUKEMIA\nPEMS TRAIN\nMFEAT PIX\n\n(2000\u00d7240)\n\n(72\u00d712582)\n\n(267\u00d7138672)\n\nTPower\n\nEM sPCA\n\nSpanSPCA\n\nSPCABiPart\n\n7.31e + 03\n1.08e + 07\n5.06e + 00\n3.31e + 01\n5.00e + 09\n3.94e + 00\n5.00e + 02\n\n7.32e + 03\n1.02e + 07\n5.18e + 00\n3.43e + 01\n5.03e + 09\n3.58e + 00\n5.27e + 02\n\n7.31e + 03\n1.08e + 07\n5.23e + 00\n3.34e + 01\n4.84e + 09\n3.89e + 00\n5.08e + 02\n\n7.79e + 03\n1.10e + 07\n5.29e + 00\n3.51e + 01\n5.37e + 09\n3.75e + 00\n5.47e + 02\n\nTable 1: Total cumulative variance captured by k = 5 40-sparse extracted components on various\ndatasets [31]. For each dataset, we list the size (#samples\u00d7#variables) and the value of variance\ncaptured by each method. Our algorithm operates on a rank-4 sketch in all cases.\n\n7\n\n\fBOW:NIPS\nBOW:KOS\nBOW:ENRON\nBOW:NYTIMES (300000\u00d7102660)\n\n(1500\u00d712419)\n(3430\u00d76906)\n\n(39861\u00d728102)\n\nTPower\n\nEM sPCA\n\nSpanSPCA\n\nSPCABiPart\n\n2.51e + 03\n4.14e + 01\n2.11e + 02\n4.81e + 01\n\n2.57e + 03\n4.24e + 01\n2.00e + 02\n\n\u2212\n\n2.53e + 03\n4.21e + 01\n2.09e + 02\n4.81e + 01\n\n3.34e + 03 (+29.98%)\n6.14e + 01 (+44.57%)\n2.38e + 02 (+12.90%)\n5.31e + 01 (+10.38%)\n\nTable 2: Total variance captured by k = 8 extracted components, each with s = 15 nonzero entries\n\u2013 Bag of Words dataset [31]. For each corpus, we list the size (#documents\u00d7#vocabulary-size) and\nthe explained variance. Our algorithm operates on a rank-5 sketch in all cases.\n\ntopics extracted from the NY Times corpus (part of the Bag of Words dataset). The corpus consists\nof 3 \u00b7 105 news articles and a vocabulary of d = 102660 words.\n\n6 Conclusions\n\nWe considered the sparse PCA problem for multiple components with disjoint supports. Existing\nmethods for the single component problem can be used along with an appropriate de\ufb02ation step to\ncompute multiple components one by one, leading to potentially suboptimal results. We presented\na novel algorithm for jointly computing multiple sparse and disjoint components with provable ap-\nproximation guarantees. Our algorithm is combinatorial and exploits interesting connections be-\ntween the sparse PCA and the bipartite maximum weight matching problems. Its running time grows\nas a low-order polynomial in the ambient dimension of the input data, but depends exponentially on\nits rank. To alleviate this dependency, we can apply the algorithm on a low-dimensional sketch of\nthe input, at the cost of an additional error in our theoretical approximation guarantees. Empirical\nevaluation showed that in many cases our algorithm outperforms de\ufb02ation-based approaches.\n\nAcknowledgments\n\nDP is generously supported by NSF awards CCF-1217058 and CCF-1116404 and MURI AFOSR\ngrant 556016. This research has been supported by NSF Grants CCF 1344179, 1344364, 1407278,\n1422549 and ARO YIP W911NF-14-1-0258.\n\nReferences\n\n[1] A. Majumdar, \u201cImage compression by sparse pca coding in curvelet domain,\u201d Signal, image and video processing,\n\nvol. 3, no. 1, pp. 27\u201334, 2009.\n\n[2] Z. Wang, F. Han, and H. Liu, \u201cSparse principal component analysis for high dimensional multivariate time series,\u201d in\n\nProceedings of the Sixteenth International Conference on Arti\ufb01cial Intelligence and Statistics, pp. 48\u201356, 2013.\n\n[3] A. d\u2019Aspremont, L. El Ghaoui, M. Jordan, and G. Lanckriet, \u201cA direct formulation for sparse pca using semide\ufb01nite\n\nprogramming,\u201d SIAM review, vol. 49, no. 3, pp. 434\u2013448, 2007.\n\nTopic 1\n\nTopic 2\n\nTopic 3\n\nTopic 4\n\nTopic 5 Topic 6\n\nTopic 7\n\nTopic 8\n\nzzz united states\nzzz u s\nzzz american\nattack\n\n1: percent\n2: million\n3: money\n4: high\n5: program military\n6: number\n7: need\n8: part\n9: problem zzz white house\n10: com\n\npalestinian\nwar\nadministration\n\ngames\n\ncompany\ncompanies\n\nzzz bush\nof\ufb01cial\ngovernment market\npresident\ngroup\nleader\ncountry\npolitical\namerican\nlaw\n\nstock\nbusiness\nbillion\nanalyst\n\ufb01rm\nsales\ncost\n\nteam\ngame\nseason\nplayer\nplay\npoint\nrun\nright\nhome\nwon\n\nschool\ncup\nstudent\nminutes\nadd\nchildren\ntablespoon women\noil\nteaspoon\nwater\npepper\nlarge\nfood\n\nshow\nbook\nfamily\nlook\nhour\nsmall\n\nzzz al gore\nzzz george bush\ncampaign\nelection\nplan\ntax\npublic\nzzz washington\nmember\nnation\n\nTable 3: BOW:NYTIMES dataset [31]. The table lists the words corresponding to the s = 10\nnonzero entries of each of the k = 8 extracted components (topics). Words corresponding to higher\nmagnitude entries appear higher in the topic.\n\n8\n\n\f[4] R. Jiang, H. Fei, and J. Huan, \u201cAnomaly localization for network data streams with graph joint sparse pca,\u201d in Proceed-\n\nings of the 17th ACM SIGKDD, pp. 886\u2013894, ACM, 2011.\n\n[5] H. Zou, T. Hastie, and R. Tibshirani, \u201cSparse principal component analysis,\u201d Journal of computational and graphical\n\nstatistics, vol. 15, no. 2, pp. 265\u2013286, 2006.\n\n[6] H. Kaiser, \u201cThe varimax criterion for analytic rotation in factor analysis,\u201d Psychometrika, vol. 23, no. 3, pp. 187\u2013200,\n\n1958.\n\n[7] I. Jolliffe, \u201cRotation of principal components: choice of normalization constraints,\u201d Journal of Applied Statistics,\n\nvol. 22, no. 1, pp. 29\u201335, 1995.\n\n[8] I. Jolliffe, N. Trenda\ufb01lov, and M. Uddin, \u201cA modi\ufb01ed principal component technique based on the lasso,\u201d Journal of\n\nComputational and Graphical Statistics, vol. 12, no. 3, pp. 531\u2013547, 2003.\n\n[9] C. Boutsidis, P. Drineas, and M. Magdon-Ismail, \u201cSparse features for pca-like linear regression,\u201d in Advances in Neural\n\nInformation Processing Systems, pp. 2285\u20132293, 2011.\n\n[10] M. Journ\u00b4ee, Y. Nesterov, P. Richt\u00b4arik, and R. Sepulchre, \u201cGeneralized power method for sparse principal component\n\nanalysis,\u201d The Journal of Machine Learning Research, vol. 11, pp. 517\u2013553, 2010.\n\n[11] B. Moghaddam, Y. Weiss, and S. Avidan, \u201cSpectral bounds for sparse pca: Exact and greedy algorithms,\u201d NIPS, vol. 18,\n\np. 915, 2006.\n\n[12] A. d\u2019Aspremont, F. Bach, and L. E. Ghaoui, \u201cOptimal solutions for sparse principal component analysis,\u201d The Journal\n\nof Machine Learning Research, vol. 9, pp. 1269\u20131294, 2008.\n\n[13] Y. Zhang, A. d\u2019Aspremont, and L. Ghaoui, \u201cSparse pca: Convex relaxations, algorithms and applications,\u201d Handbook\n\non Semide\ufb01nite, Conic and Polynomial Optimization, pp. 915\u2013940, 2012.\n\n[14] X.-T. Yuan and T. Zhang, \u201cTruncated power method for sparse eigenvalue problems,\u201d The Journal of Machine Learning\n\nResearch, vol. 14, no. 1, pp. 899\u2013925, 2013.\n\n[15] C. D. Sigg and J. M. Buhmann, \u201cExpectation-maximization for sparse and non-negative pca,\u201d in Proceedings of the\n\n25th International Conference on Machine Learning, ICML \u201908, (New York, NY, USA), pp. 960\u2013967, ACM, 2008.\n\n[16] D. Papailiopoulos, A. Dimakis, and S. Korokythakis, \u201cSparse pca through low-rank approximations,\u201d in Proceedings\n\nof The 30th International Conference on Machine Learning, pp. 747\u2013755, 2013.\n\n[17] M. Asteris, D. S. Papailiopoulos, and G. N. Karystinos, \u201cThe sparse principal component of a constant-rank matrix,\u201d\n\nInformation Theory, IEEE Transactions on, vol. 60, pp. 2281\u20132290, April 2014.\n\n[18] L. Mackey, \u201cDe\ufb02ation methods for sparse pca,\u201d NIPS, vol. 21, pp. 1017\u20131024, 2009.\n\n[19] V. Vu and J. Lei, \u201cMinimax rates of estimation for sparse pca in high dimensions,\u201d in International Conference on\n\nArti\ufb01cial Intelligence and Statistics, pp. 1278\u20131286, 2012.\n\n[20] M. Magdon-Ismail and C. Boutsidis, \u201cOptimal sparse linear auto-encoders and sparse pca,\u201d arXiv preprint\n\narXiv:1502.06626, 2015.\n\n[21] M. Magdon-Ismail, \u201cNp-hardness and inapproximability of sparse PCA,\u201d CoRR, vol. abs/1502.05675, 2015.\n\n[22] L. Ramshaw and R. E. Tarjan, \u201cOn minimum-cost assignments in unbalanced bipartite graphs,\u201d HP Labs, Palo Alto,\n\nCA, USA, Tech. Rep. HPL-2012-40R1, 2012.\n\n[23] N. Halko, P.-G. Martinsson, and J. A. Tropp, \u201cFinding structure with randomness: Probabilistic algorithms for con-\n\nstructing approximate matrix decompositions,\u201d SIAM review, vol. 53, no. 2, pp. 217\u2013288, 2011.\n\n[24] M. Asteris, D. Papailiopoulos, A. Kyrillidis, and A. G. Dimakis, \u201cSparse pca via bipartite matchings,\u201d arXiv preprint\n\narXiv:1508.00625, 2015.\n\n[25] I. M. Johnstone and A. Y. Lu, \u201cOn consistency and sparsity for principal components analysis in high dimensions,\u201d\n\nJournal of the American Statistical Association, vol. 104, no. 486, 2009.\n\n[26] Z. Ma, \u201cSparse principal component analysis and iterative thresholding,\u201d The Annals of Statistics, vol. 41, no. 2,\n\npp. 772\u2013801, 2013.\n\n[27] V. Q. Vu, J. Cho, J. Lei, and K. Rohe, \u201cFantope projection and selection: A near-optimal convex relaxation of sparse\n\npca,\u201d in NIPS, pp. 2670\u20132678, 2013.\n\n[28] Z. Wang, H. Lu, and H. Liu, \u201cNonconvex statistical optimization: minimax-optimal sparse pca in polynomial time,\u201d\n\narXiv preprint arXiv:1408.5352, 2014.\n\n[29] R. Jenatton, G. Obozinski, and F. Bach, \u201cStructured sparse principal component analysis,\u201d in Proceedings of the Thir-\n\nteenth International Conference on Arti\ufb01cial Intelligence and Statistics, AISTATS, pp. 366\u2013373, 2010.\n\n[30] E. Richard, G. R. Obozinski, and J.-P. Vert, \u201cTight convex relaxations for sparse matrix factorization,\u201d in Advances in\n\nNeural Information Processing Systems, pp. 3284\u20133292, 2014.\n\n[31] M. Lichman, \u201cUCI machine learning repository,\u201d 2013.\n\n9\n\n\f", "award": [], "sourceid": 515, "authors": [{"given_name": "Megasthenis", "family_name": "Asteris", "institution": "University of Texas at Austin"}, {"given_name": "Dimitris", "family_name": "Papailiopoulos", "institution": "UC Berkeley"}, {"given_name": "Anastasios", "family_name": "Kyrillidis", "institution": "University of Texas at Austin"}, {"given_name": "Alexandros", "family_name": "Dimakis", "institution": "Utaustin"}]}