{"title": "Deflation Methods for Sparse PCA", "book": "Advances in Neural Information Processing Systems", "page_first": 1017, "page_last": 1024, "abstract": "In analogy to the PCA setting, the sparse PCA problem is often solved by iteratively alternating between two subtasks: cardinality-constrained rank-one variance maximization and matrix deflation. While the former has received a great deal of attention in the literature, the latter is seldom analyzed and is typically borrowed without justification from the PCA context. In this work, we demonstrate that the standard PCA deflation procedure is seldom appropriate for the sparse PCA setting. To rectify the situation, we first develop several heuristic deflation alternatives with more desirable properties. We then reformulate the sparse PCA optimization problem to explicitly reflect the maximum additional variance objective on each round. The result is a generalized deflation procedure that typically outperforms more standard techniques on real-world datasets.", "full_text": "De\ufb02ation Methods for Sparse PCA\n\nComputer Science Division\n\nUniversity of California, Berkeley\n\nLester Mackey\n\nBerkeley, CA 94703\n\nAbstract\n\nIn analogy to the PCA setting, the sparse PCA problem is often solved by iter-\natively alternating between two subtasks: cardinality-constrained rank-one vari-\nance maximization and matrix de\ufb02ation. While the former has received a great\ndeal of attention in the literature, the latter is seldom analyzed and is typically\nborrowed without justi\ufb01cation from the PCA context. In this work, we demon-\nstrate that the standard PCA de\ufb02ation procedure is seldom appropriate for the\nsparse PCA setting. To rectify the situation, we \ufb01rst develop several de\ufb02ation al-\nternatives better suited to the cardinality-constrained context. We then reformulate\nthe sparse PCA optimization problem to explicitly re\ufb02ect the maximum additional\nvariance objective on each round. The result is a generalized de\ufb02ation procedure\nthat typically outperforms more standard techniques on real-world datasets.\n\n1 Introduction\n\nPrincipal component analysis (PCA) is a popular change of variables technique used in data com-\npression, predictive modeling, and visualization. The goal of PCA is to extract several principal\ncomponents, linear combinations of input variables that together best account for the variance in a\ndata set. Often, PCA is formulated as an eigenvalue decomposition problem: each eigenvector of\nthe sample covariance matrix of a data set corresponds to the loadings or coef\ufb01cients of a principal\ncomponent. A common approach to solving this partial eigenvalue decomposition is to iteratively\nalternate between two subproblems: rank-one variance maximization and matrix de\ufb02ation. The \ufb01rst\nsubproblem involves \ufb01nding the maximum-variance loadings vector for a given sample covariance\nmatrix or, equivalently, \ufb01nding the leading eigenvector of the matrix. The second involves modifying\nthe covariance matrix to eliminate the in\ufb02uence of that eigenvector.\n\nA primary drawback of PCA is its lack of sparsity. Each principal component is a linear combination\nof all variables, and the loadings are typically non-zero. Sparsity is desirable as it often leads to\nmore interpretable results, reduced computation time, and improved generalization. Sparse PCA\n[8, 3, 16, 17, 6, 18, 1, 2, 9, 10, 12] injects sparsity into the PCA process by searching for \u201cpseudo-\neigenvectors\u201d, sparse loadings that explain a maximal amount variance in the data.\n\nIn analogy to the PCA setting, many authors attempt to solve the sparse PCA problem by itera-\ntively alternating between two subtasks: cardinality-constrained rank-one variance maximization\nand matrix de\ufb02ation. The former is an NP-hard problem, and a variety of relaxations and approx-\nimate solutions have been developed in the literature [1, 2, 9, 10, 12, 16, 17]. The latter subtask\nhas received relatively little attention and is typically borrowed without justi\ufb01cation from the PCA\ncontext. In this work, we demonstrate that the standard PCA de\ufb02ation procedure is seldom appro-\npriate for the sparse PCA setting. To rectify the situation, we \ufb01rst develop several heuristic de\ufb02ation\nalternatives with more desirable properties. We then reformulate the sparse PCA optimization prob-\nlem to explicitly re\ufb02ect the maximum additional variance objective on each round. The result is a\ngeneralized de\ufb02ation procedure that typically outperforms more standard techniques on real-world\ndatasets.\n\n1\n\n\fThe remainder of the paper is organized as follows. In Section 2 we discuss matrix de\ufb02ation as it re-\nlates to PCA and sparse PCA. We examine the failings of typical PCA de\ufb02ation in the sparse setting\nand develop several alternative de\ufb02ation procedures. In Section 3, we present a reformulation of the\nstandard iterative sparse PCA optimization problem and derive a generalized de\ufb02ation procedure\nto solve the reformulation. Finally, in Section 4, we demonstrate the utility of our newly derived\nde\ufb02ation techniques on real-world datasets.\n\nNotation\n\nI is the identity matrix. Sp\nCard(x) represents the cardinality of or number of non-zero entries in the vector x.\n\n+ is the set of all symmetric, positive semide\ufb01nite matrices in Rp\u00d7p.\n\n2 De\ufb02ation methods\n\nA matrix de\ufb02ation modi\ufb01es a matrix to eliminate the in\ufb02uence of a given eigenvector, typically by\nsetting the associated eigenvalue to zero (see [14] for a more detailed discussion). We will \ufb01rst\ndiscuss de\ufb02ation in the context of PCA and then consider its extension to sparse PCA.\n\n2.1 Hotelling\u2019s de\ufb02ation and PCA\n\nIn the PCA setting, the goal is to extract the r leading eigenvectors of the sample covariance matrix,\nA0 \u2208 Sp\n+, as its eigenvectors are equivalent to the loadings of the \ufb01rst r principal components.\nHotelling\u2019s de\ufb02ation method [11] is a simple and popular technique for sequentially extracting these\neigenvectors. On the t-th iteration of the de\ufb02ation method, we \ufb01rst extract the leading eigenvector\nof At\u22121,\n\n(1)\n\nxt = argmax\nx:xT x=1\n\nxT At\u22121x\n\nand we then use Hotelling\u2019s de\ufb02ation to annihilate xt:\nAt = At\u22121 \u2212 xtxT\n\nt At\u22121xtxT\nt .\n\n(2)\n\nThe de\ufb02ation step ensures that the t + 1-st leading eigenvector of A0 is the leading eigenvector of\nAt. The following proposition explains why.\nProposition 2.1. If \u03bb1 \u2265 . . . \u2265 \u03bbp are the eigenvalues of A \u2208 Sp\neigenvectors, and \u02c6A = A \u2212 xjxT\nwith corresponding eigenvalues \u03bb1, . . . , \u03bbj\u22121, 0, \u03bbj+1, . . . , \u03bbp.\nPROOF.\n\n+, x1, . . . , xp are the corresponding\nj for some j \u2208 1, . . . , p, then \u02c6A has eigenvectors x1, . . . , xp\n\nj AxjxT\n\n\u02c6Axj = Axj \u2212 xjxT\n\u02c6Axi = Axi \u2212 xjxT\n\nj AxjxT\nj AxjxT\n\nj xj = Axj \u2212 xjxT\nj xi = Axi \u2212 0 = \u03bbixi, \u2200i 6= j.\n\nj Axj = \u03bbjxj \u2212 \u03bbjxj = 0xj.\n\nThus, Hotelling\u2019s de\ufb02ation preserves all eigenvectors of a matrix and annihilates a selected eigen-\nvalue while maintaining all others. Notably, this implies that Hotelling\u2019s de\ufb02ation preserves positive-\nsemide\ufb01niteness. In the case of our iterative de\ufb02ation method, annihilating the t-th leading eigen-\nvector of A0 renders the t + 1-st leading eigenvector dominant in the next round.\n\n2.2 Hotelling\u2019s de\ufb02ation and sparse PCA\n\nIn the sparse PCA setting, we seek r sparse loadings which together capture the maximum amount\nof variance in the data. Most authors [1, 9, 16, 12] adopt the additional constraint that the loadings\nbe produced in a sequential fashion. To \ufb01nd the \ufb01rst such \u201dpseudo-eigenvector\u201d, we can consider a\ncardinality-constrained version of Eq. (1):\n\nx1 =\n\nargmax\n\nxT A0x.\n\nx:xT x=1,Card(x)\u2264k1\n\n(3)\n\n2\n\n\fThat leaves us with the question of how to best extract subsequent pseudo-eigenvectors. A common\napproach in the literature [1, 9, 16, 12] is to borrow the iterative de\ufb02ation method of the PCA setting.\nTypically, Hotelling\u2019s de\ufb02ation is utilized by substituting an extracted pseudo-eigenvector for a true\neigenvector in the de\ufb02ation step of Eq. (2). This substitution, however, is seldom justi\ufb01ed, for the\nproperties of Hotelling\u2019s de\ufb02ation, discussed in Section 2.1, depend crucially on the use of a true\neigenvector.\n\nTo see what can go wrong when Hotelling\u2019s de\ufb02ation is applied to a non-eigenvector, consider the\nfollowing example.\n\n1 1 (cid:19), a 2 \u00d7 2 matrix. The eigenvalues of C are \u03bb1 = 2.6180 and \u03bb2 =\nExample. Let C = (cid:18) 2 1\n.3820. Let x = (1, 0)T , a sparse pseudo-eigenvector, and \u02c6C = C \u2212 xxT CxxT , the corresponding\n1 1 (cid:19) with eigenvalues \u02c6\u03bb1 = 1.6180 and \u02c6\u03bb2 = \u2212.6180. Thus,\nde\ufb02ated matrix. Then \u02c6C = (cid:18) 0 1\n\nHotelling\u2019s de\ufb02ation does not in general preserve positive-semide\ufb01niteness when applied to a non-\neigenvector.\n\nThat Sp\n+ is not closed under pseudo-eigenvector Hotelling\u2019s de\ufb02ation is a serious failing, for most\niterative sparse PCA methods assume a positive-semide\ufb01nite matrix on each iteration. A second,\nrelated shortcoming of pseudo-eigenvector Hotelling\u2019s de\ufb02ation is its failure to render a pseudo-\neigenvector orthogonal to a de\ufb02ated matrix. If A is our matrix of interest, x is our pseudo-eigenvector\nwith variance \u03bb = xT Ax, and \u02c6A = A \u2212 xxT AxxT is our de\ufb02ated matrix, then \u02c6Ax = Ax \u2212\nxxT AxxT x = Ax \u2212 \u03bbx is zero iff x is a true eigenvector. Thus, even though the \u201cvariance\u201d of\nx w.r.t. \u02c6A is zero (xT \u02c6Ax = xT Ax \u2212 xT xxT AxxT x = \u03bb \u2212 \u03bb = 0), \u201ccovariances\u201d of the form\nyT \u02c6Ax for y 6= x may still be non-zero. This violation of the Cauchy-Schwarz inequality betrays a\nlack of positive-semide\ufb01niteness and may encourage the reappearance of x as a component of future\npseudo-eigenvectors.\n\n2.3 Alternative de\ufb02ation techniques\n\nIn this section, we will attempt to rectify the failings of pseudo-eigenvector Hotelling\u2019s de\ufb02ation by\nconsidering several alternative de\ufb02ation techniques better suited to the sparse PCA setting. Note\nthat any de\ufb02ation-based sparse PCA method (e.g. [1, 9, 16, 12]) can utilize any of the de\ufb02ation\ntechniques discussed below.\n\n2.3.1 Projection de\ufb02ation\n\nGiven a data matrix Y \u2208 Rn\u00d7p and an arbitrary unit vector in x \u2208 Rp, an intuitive way to remove\nthe contribution of x from Y is to project Y onto the orthocomplement of the space spanned by x:\n\u02c6Y = Y (I \u2212 xxT ). If A is the sample covariance matrix of Y , then the sample covariance of \u02c6Y is\ngiven by \u02c6A = (I \u2212 xxT )A(I \u2212 xxT ), which leads to our formulation for projection de\ufb02ation:\n\nProjection de\ufb02ation\nAt = At\u22121 \u2212 xtxT\n\n(4)\nNote that when xt is a true eigenvector of At\u22121 with eigenvalue \u03bbt, projection de\ufb02ation reduces to\nHotelling\u2019s de\ufb02ation:\n\nt At\u22121 \u2212 At\u22121xtxT\n\nt )At\u22121(I \u2212 xtxT\nt )\n\nt At\u22121xtxT\n\nt = (I \u2212 xtxT\n\nt + xtxT\n\nAt = At\u22121 \u2212 xtxT\n\nt At\u22121 \u2212 At\u22121xtxT\n\nt + xtxT\n\nt At\u22121xtxT\n\nt\n\n= At\u22121 \u2212 \u03bbtxtxT\n= At\u22121 \u2212 xtxT\n\nt \u2212 \u03bbtxtxT\nt At\u22121xtxT\nt .\n\nt + \u03bbtxtxT\n\nt\n\nHowever, in the general case, when xt is not a true eigenvector, projection de\ufb02ation maintains the\ndesirable properties that were lost to Hotelling\u2019s de\ufb02ation. For example, positive-semide\ufb01niteness\nis preserved:\n\nwhere z = (I \u2212 xtxT\northogonal to xt, as (I \u2212xtxT\nannihilates all covariances with xt: \u2200v, vT Atxt = xT\n\n\u2200y, yT Aty = yT (I \u2212 xtxT\nt )y. Thus, if At\u22121 \u2208 Sp\n\nt )At\u22121(I \u2212 xtxT\n+, so is At. Moreover, At is rendered left and right\nt )xt = xt \u2212xt = 0 and At is symmetric. Projection de\ufb02ation therefore\n\nt )y = zT At\u22121z\n\nt Atv = 0.\n\n3\n\n\f2.3.2 Schur complement de\ufb02ation\n\nSince our goal in matrix de\ufb02ation is to eliminate the in\ufb02uence, as measured through variance and\ncovariances, of a newly discovered pseudo-eigenvector, it is reasonable to consider the conditional\nvariance of our data variables given a pseudo-principal component. While this conditional variance\nis non-trivial to compute in general, it takes on a simple closed form when the variables are normally\ndistributed. Let x \u2208 Rp be a unit vector and W \u2208 Rp be a Gaussian random vector, representing the\njoint distribution of the data variables. If W has covariance matrix \u03a3, then (W, W x) has covariance\nxT \u03a3x whenever xT \u03a3x 6= 0 [15].\nThat is, the conditional variance is the Schur complement of the vector variance xT \u03a3x in the full\ncovariance matrix V . By substituting sample covariance matrices for their population counterparts,\nwe arrive at a new de\ufb02ation technique:\n\nxT \u03a3 xT \u03a3x (cid:19), and V ar(W |W x) = \u03a3 \u2212 \u03a3xxT \u03a3\n\nmatrix V = (cid:18) \u03a3\n\n\u03a3x\n\nSchur complement de\ufb02ation\nt At\u22121\n\nAt\u22121xtxT\n\nAt = At\u22121 \u2212\n\nxT\nt At\u22121xt\n\n(5)\n\n+. Then, \u2200v, vT Atv = vT At\u22121v \u2212 vT At\u22121xtxT\n\nSchur complement de\ufb02ation, like projection de\ufb02ation, preserves positive-semide\ufb01niteness. To\nsee this, suppose At\u22121 \u2208 Sp\n\u2265 0 as\nt At\u22121xt \u2212 (vT At\u22121xt)2 \u2265 0 by the Cauchy-Schwarz inequality and xT\nvT At\u22121vxT\nt At\u22121xt \u2265 0\nas At\u22121 \u2208 Sp\n+.\nFurthermore, Schur complement de\ufb02ation renders xt left and right orthogonal to At, since At is\nsymmetric and Atxt = At\u22121xt \u2212 At\u22121xtxT\nAdditionally, Schur complement de\ufb02ation reduces to Hotelling\u2019s de\ufb02ation when xt is an eigenvector\nof At\u22121 with eigenvalue \u03bbt 6= 0:\n\n= At\u22121xt \u2212 At\u22121xt = 0.\n\nt At\u22121xt\n\nt At\u22121xt\n\nt At\u22121xt\n\nt At\u22121v\n\nxT\n\nxT\n\nAt = At\u22121 \u2212\n\nAt\u22121xtxT\n\nt At\u22121\n\nxT\nt At\u22121xt\n\nt \u03bbt\n\n\u03bbtxtxT\n= At\u22121 \u2212\n\u03bbt\n= At\u22121 \u2212 xtxT\nt At\u22121xtxT\nt .\n\nWhile we motivated Schur complement de\ufb02ation with a Gaussianity assumption, the technique ad-\nmits a more general interpretation as a column projection of a data matrix. Suppose Y \u2208 Rn\u00d7p\nis a mean-centered data matrix, x \u2208 Rp has unit norm, and \u02c6Y = (I \u2212 Y xxT Y T\n||Y x||2 )Y , the projection\nof the columns of Y onto the orthocomplement of the space spanned by the pseudo-principal com-\nponent, Y x. If Y has sample covariance matrix A, then the sample covariance of \u02c6Y is given by\n\u02c6A = 1\n\nn Y T (I \u2212 Y xxT Y T\n\nn Y T (I \u2212 Y xxT Y T\n\n||Y x||2 )T (I \u2212 Y xxT Y T\n\n||Y x||2 )Y = 1\n\n||Y x||2 )Y = A \u2212 AxxT A\nxT Ax .\n\n2.3.3 Orthogonalized de\ufb02ation\n\nWhile projection de\ufb02ation and Schur complement de\ufb02ation address the concerns raised by per-\nforming a single de\ufb02ation in the non-eigenvector setting, new dif\ufb01culties arise when we attempt to\nsequentially de\ufb02ate a matrix with respect to a series of non-orthogonal pseudo-eigenvectors.\n\nWhenever we deal with a sequence of non-orthogonal vectors, we must take care to distinguish\nbetween the variance explained by a vector and the additional variance explained, given all pre-\nvious vectors. These concepts are equivalent in the PCA setting, as true eigenvectors of a matrix\nare orthogonal, but, in general, the vectors extracted by sparse PCA will not be orthogonal. The\nadditional variance explained by the t-th pseudo-eigenvector, xt, is equivalent to the variance ex-\nplained by the component of xt orthogonal to the space spanned by all previous pseudo-eigenvectors,\nqt = xt \u2212 Pt\u22121xt, where Pt\u22121 is the orthogonal projection onto the space spanned by x1, . . . , xt\u22121.\nOn each de\ufb02ation step, therefore, we only want to eliminate the variance associated with qt. Anni-\nhilating the full vector xt will often lead to \u201cdouble counting\u201d and could re-introduce components\nparallel to previously annihilated vectors. Consider the following example:\n\n4\n\n\fIf we apply projection de\ufb02ation w.r.t. x1 = (\n\n\u221a2\n2 )T , the result is\n2 (cid:19), and x1 is orthogonal to C1. If we next apply projection de\ufb02ation to C1 w.r.t.\n\n\u221a2\n2 ,\n\n2\n\nExample. Let C0 = I.\nC1 = (cid:18) 1\nx2 = (1, 0)T , the result, C2 = (cid:18) 0\n\n2 \u2212 1\n\u2212 1\n2\n\n0\n\n1\n\n0\n1\n\n2 (cid:19), is no longer orthogonal to x1.\n\nThe authors of [12] consider this issue of non-orthogonality in the context of Hotelling\u2019s de\ufb02ation.\nTheir modi\ufb01ed de\ufb02ation procedure is equivalent to Hotelling\u2019s de\ufb02ation (Eq. (2)) for t = 1 and can\nbe easily expressed in terms of a running Gram-Schmidt decomposition for t > 1:\n\nOrthogonalized Hotelling\u2019s de\ufb02ation (OHD)\n\nqt =\n\n(I \u2212 Qt\u22121QT\n(cid:12)(cid:12)(cid:12)(cid:12)(I \u2212 Qt\u22121QT\nAt = At\u22121 \u2212 qtqT\n\nt\u22121)xt\nt\u22121)xt(cid:12)(cid:12)(cid:12)(cid:12)\nt At\u22121qtqT\n\nt\n\n(6)\n\nwhere q1 = x1, and q1, . . . , qt\u22121 form the columns of Qt\u22121. Since q1, . . . , qt\u22121 form an orthonormal\nt\u22121 = Pt\u22121, the aforementioned\nbasis for the space spanned by x1, . . . , xt\u22121, we have that Qt\u22121QT\northogonal projection.\n\nSince the \ufb01rst round of OHD is equivalent to a standard application of Hotelling\u2019s de\ufb02ation, OHD\ninherits all of the weaknesses discussed in Section 2.2. However, the same principles may be applied\nto projection de\ufb02ation to generate an orthogonalized variant that inherits its desirable properties.\n\nSchur complement de\ufb02ation is unique in that it preserves orthogonality in all subsequent rounds.\nThat is, if a vector v is orthogonal to At\u22121 for any t, then Atv = At\u22121v \u2212 At\u22121xtxT\n= 0 as\nAt\u22121v = 0. This further implies the following proposition.\nProposition 2.2. Orthogonalized Schur complement de\ufb02ation is equivalent to Schur complement\nde\ufb02ation.\n\nt At\u22121xt\n\nt At\u22121v\n\nxT\n\nProof. Consider the t-th round of Schur complement de\ufb02ation. We may write xt = ot + pt, where\npt is in the subspace spanned by all previously extracted pseudo-eigenvectors and ot is orthogonal\nto this subspace. Then we know that At\u22121pt = 0, as pt is a linear combination of x1, . . . , xt\u22121,\nand At\u22121xi = 0, \u2200i < t. Thus, xT\nt Atot.\nFurther, At\u22121xtxT\nt At\u22121 = At\u22121ptpT\nt At\u22121 =\n.\nAt\u22121otoT\n\nt At\u22121. Hence, At = At\u22121 \u2212 At\u22121otoT\n\nt At\u22121 +At\u22121otpT\n= At\u22121 \u2212 At\u22121qtqT\n\nt Atxt = pT\nt At\u22121 +At\u22121ptoT\n\nt At\u22121 +At\u22121otoT\nas qt = ot\n||ot||\n\nt Atot = oT\n\nt Atpt + pT\n\nt Atpt + oT\n\nt Atot + oT\n\nt At\u22121ot\n\nt At\u22121qt\n\nt At\u22121\n\nt At\u22121\n\noT\n\nqT\n\nTable 1 compares the properties of the various de\ufb02ation techniques studied in this section.\n\nMethod\nHotelling\u2019s\nProjection\nSchur complement\nOrth. Hotelling\u2019s\nOrth. Projection\n\nt Atxt = 0 Atxt = 0 At \u2208 Sp\nxT\n\n+ Asxt = 0, \u2200s > t\n\nX\n\nX\n\nX\n\nX\n\nX\n\n\u00d7\nX\n\nX\n\u00d7\nX\n\n\u00d7\nX\n\nX\n\u00d7\nX\n\n\u00d7\n\u00d7\nX\n\u00d7\nX\n\nTable 1: Summary of sparse PCA de\ufb02ation method properties\n\n3 Reformulating sparse PCA\n\nIn the previous section, we focused on heuristic de\ufb02ation techniques that allowed us to reuse the\ncardinality-constrained optimization problem of Eq. (3). In this section, we explore a more princi-\npled alternative: reformulating the sparse PCA optimization problem to explicitly re\ufb02ect our maxi-\nmization objective on each round.\n\nRecall that the goal of sparse PCA is to \ufb01nd r cardinality-constrained pseudo-eigenvectors which\ntogether explain the most variance in the data. If we additionally constrain the sparse loadings to\n\n5\n\n\fbe generated sequentially, as in the PCA setting and the previous section, then a greedy approach of\nmaximizing the additional variance of each new vector naturally suggests itself.\n\nOn round t, the additional variance of a vector x is given by qT A0q\nqT q where A0 is the data covari-\nance matrix, q = (I \u2212 Pt\u22121)x, and Pt\u22121 is the projection onto the space spanned by previous\npseudo-eigenvectors x1, . . . , xt\u22121. As qT q = xT (I \u2212 Pt\u22121)(I \u2212 Pt\u22121)x = xT (I \u2212 Pt\u22121)x, max-\nimizing additional variance is equivalent to solving a cardinality-constrained maximum generalized\neigenvalue problem,\n\nmax\n\nx\n\nxT (I \u2212 Pt\u22121)A0(I \u2212 Pt\u22121)x\n\nsubject to xT (I \u2212 Pt\u22121)x = 1\n\nCard(x) \u2264 kt.\n\n(7)\n\nIf we let qs = (I \u2212 Ps\u22121)xs, \u2200s \u2264 t \u2212 1, then q1, . . . , qt\u22121 form an orthonormal basis for the space\nspanned by x1, . . . , xt\u22121. Writing I \u2212 Pt\u22121 = I \u2212 Pt\u22121\ns ) suggests a\ngeneralized de\ufb02ation technique that leads to the solution of Eq. (7) on each round. We imbed the\ntechnique into the following algorithm for sparse PCA:\n\ns = Qt\u22121\n\ns=1 qsqT\n\ns=1 (I \u2212 qsqT\n\nAlgorithm 1 Generalized De\ufb02ation Method for Sparse PCA\nGiven: A0 \u2208 Sp\nExecute:\n\n+, r \u2208 N, {k1, . . . , kr} \u2282 N\n\n1. B0 \u2190 I\n2. For t := 1, . . . , r\n\n\u2022 xt \u2190\n\nargmax\n\n\u2022 qt \u2190 Bt\u22121xt\n\u2022 At \u2190 (I \u2212 qtqT\n\u2022 Bt \u2190 Bt\u22121(I \u2212 qtqT\nt )\n\u2022 xt \u2190 xt/ ||xt||\n\nt )At\u22121(I \u2212 qtqT\nt )\n\nx:xT Bt\u22121x=1,Card(x)\u2264kt\n\nxT At\u22121x\n\nReturn: {x1, . . . , xr}\n\nAdding a cardinality constraint to a maximum eigenvalue problem renders the optimization problem\nNP-hard [10], but any of several leading sparse eigenvalue methods, including GSLDA of [10],\nDCPCA of [12], and DSPCA of [1] (with a modi\ufb01ed trace constraint), can be adapted to solve this\ncardinality-constrained generalized eigenvalue problem.\n\n4 Experiments\n\nIn this section, we present several experiments on real world datasets to demonstrate the value added\nby our newly derived de\ufb02ation techniques. We run our experiments with Matlab implementations\nof DCPCA [12] (with the continuity correction of [9]) and GSLDA [10], \ufb01tted with each of the\nfollowing de\ufb02ation techniques: Hotelling\u2019s (HD), projection (PD), Schur complement (SCD), or-\nthogonalized Hotelling\u2019s (OHD), orthogonalized projection (OPD), and generalized (GD).\n\n4.1 Pit props dataset\n\nThe pit props dataset [5] with 13 variables and 180 observations has become a de facto standard for\nbenchmarking sparse PCA methods. To demonstrate the disparate behavior of differing de\ufb02ation\nmethods, we utilize each sparse PCA algorithm and de\ufb02ation technique to successively extract six\nsparse loadings, each constrained to have cardinality less than or equal to kt = 4. We report the\nadditional variances explained by each sparse vector in Table 2 and the cumulative percentage vari-\nance explained on each iteration in Table 3. For reference, the \ufb01rst 6 true principal components of\nthe pit props dataset capture 87% of the variance.\n\n6\n\n\fHD\n2.938\n2.209\n0.935\n1.301\n1.206\n0.959\n\nPD\n2.938\n2.209\n1.464\n1.464\n1.057\n0.980\n\nDCPCA\n\nSCD\n2.938\n2.076\n1.926\n1.164\n1.477\n0.725\n\nOHD\n2.938\n2.209\n0.935\n0.799\n0.901\n0.431\n\nOPD\n2.938\n2.209\n1.464\n1.464\n1.058\n0.904\n\nGD\n2.938\n2.209\n1.477\n1.464\n1.178\n0.988\n\nHD\n2.938\n2.107\n1.988\n1.352\n1.067\n0.557\n\nPD\n2.938\n2.280\n2.067\n1.304\n1.120\n0.853\n\nGSLDA\n\nSCD\n2.938\n2.065\n2.243\n1.120\n1.164\n0.841\n\nOHD\n2.938\n2.107\n1.985\n1.335\n0.497\n0.489\n\nOPD\n2.938\n2.280\n2.067\n1.305\n1.125\n0.852\n\nGD\n2.938\n2.280\n2.072\n1.360\n1.127\n0.908\n\nTable 2: Additional variance explained by each of the \ufb01rst 6 sparse loadings extracted from the Pit\nProps dataset.\n\nOn the DCPCA run, Hotelling\u2019s de\ufb02ation explains 73.4% of the variance, while the best performing\nmethods, Schur complement de\ufb02ation and generalized de\ufb02ation, explain approximately 79% of the\nvariance each. Projection de\ufb02ation and its orthogonalized variant also outperform Hotelling\u2019s de\ufb02a-\ntion, while orthogonalized Hotelling\u2019s shows the worst performance with only 63.2% of the variance\nexplained. Similar results are obtained when the discrete method of GSLDA is used. Generalized\nde\ufb02ation and the two projection de\ufb02ations dominate, with GD achieving the maximum cumulative\nvariance explained on each round. In contrast, the more standard Hotelling\u2019s and orthogonalized\nHotelling\u2019s underperform the remaining techniques.\n\nDCPCA\n\nHD\n\n22.6%\n39.6%\n46.8%\n56.8%\n66.1%\n73.4%\n\nPD\n\n22.6%\n39.6%\n50.9%\n62.1%\n70.2%\n77.8%\n\nSCD\n22.6%\n38.6%\n53.4%\n62.3%\n73.7%\n79.3%\n\nOHD\n22.6%\n39.6%\n46.8%\n52.9%\n59.9%\n63.2%\n\nOPD\n22.6%\n39.6%\n50.9%\n62.1%\n70.2%\n77.2%\n\nGD\n\n22.6%\n39.6%\n51.0%\n62.2%\n71.3%\n78.9%\n\nHD\n\n22.6%\n38.8%\n54.1%\n64.5%\n72.7%\n77.0%\n\nPD\n\n22.6%\n40.1%\n56.0%\n66.1%\n74.7%\n81.2%\n\nGSLDA\n\nSCD\n22.6%\n38.5%\n55.7%\n64.4%\n73.3%\n79.8%\n\nOHD\n22.6%\n38.8%\n54.1%\n64.3%\n68.2%\n71.9%\n\nOPD\n22.6%\n40.1%\n56.0%\n66.1%\n74.7%\n81.3%\n\nGD\n\n22.6%\n40.1%\n56.1%\n66.5%\n75.2%\n82.2%\n\nTable 3: Cumulative percentage variance explained by the \ufb01rst 6 sparse loadings extracted from the\nPit Props dataset.\n\n4.2 Gene expression data\n\nThe Berkeley Drosophila Transcription Network Project (BDTNP) 3D gene expression data\n[4] contains gene expression levels measured in each nucleus of developing Drosophila em-\nbryos and averaged across many embryos and developmental stages. Here, we analyze 0-\n3 1160524183713 s10436-29ap05-02.vpc, an aggregate VirtualEmbryo containing 21 genes and\n5759 example nuclei. We run GSLDA for eight iterations with cardinality pattern 9,7,6,5,3,2,2,2\nand report the results in Table 4.\n\nGSLDA additional variance explained\n\nGSLDA cumulative percentage variance\n\nPC 1\nPC 2\nPC 3\nPC 4\nPC 5\nPC 6\nPC 7\nPC 8\n\nHD\n1.784\n1.464\n1.178\n0.716\n0.444\n0.303\n0.271\n0.223\n\nPD\n1.784\n1.453\n1.178\n0.736\n0.574\n0.306\n0.256\n0.239\n\nSCD\n1.784\n1.453\n1.179\n0.716\n0.571\n0.278\n0.262\n0.299\n\nOHD\n1.784\n1.464\n1.176\n0.713\n0.460\n0.354\n0.239\n0.257\n\nOPD\n1.784\n1.453\n1.178\n0.721\n0.571\n0.244\n0.313\n0.245\n\nGD\n1.784\n1.466\n1.187\n0.743\n0.616\n0.332\n0.304\n0.329\n\nHD\n\n21.0%\n38.2%\n52.1%\n60.5%\n65.7%\n69.3%\n72.5%\n75.1%\n\nPD\n\n21.0%\n38.1%\n51.9%\n60.6%\n67.4%\n71.0%\n74.0%\n76.8%\n\nSCD\n21.0%\n38.1%\n52.0%\n60.4%\n67.1%\n70.4%\n73.4%\n77.0%\n\nOHD\n21.0%\n38.2%\n52.0%\n60.4%\n65.9%\n70.0%\n72.8%\n75.9%\n\nOPD\n21.0%\n38.1%\n51.9%\n60.4%\n67.1%\n70.0%\n73.7%\n76.6%\n\nGD\n\n21.0%\n38.2%\n52.2%\n61.0%\n68.2%\n72.1%\n75.7%\n79.6%\n\nTable 4: Additional variance and cumulative percentage variance explained by the \ufb01rst 8 sparse\nloadings of GSLDA on the BDTNP VirtualEmbryo.\n\nThe results of the gene expression experiment show a clear hierarchy among the de\ufb02ation methods.\nThe generalized de\ufb02ation technique performs best, achieving the largest additional variance on every\nround and a \ufb01nal cumulative variance of 79.6%. Schur complement de\ufb02ation, projection de\ufb02ation,\nand orthogonalized projection de\ufb02ation all perform comparably, explaining roughly 77% of the total\nvariance after 8 rounds. In last place are the standard Hotelling\u2019s and orthogonalized Hotelling\u2019s\nde\ufb02ations, both of which explain less than 76% of variance after 8 rounds.\n\n7\n\n\f5 Conclusion\n\nIn this work, we have exposed the theoretical and empirical shortcomings of Hotelling\u2019s de\ufb02ation in\nthe sparse PCA setting and developed several alternative methods more suitable for non-eigenvector\nde\ufb02ation. Notably, the utility of these procedures is not limited to the sparse PCA setting. Indeed,\nthe methods presented can be applied to any of a number of constrained eigendecomposition-based\nproblems, including sparse canonical correlation analysis [13] and linear discriminant analysis [10].\n\nAcknowledgments\n\nThis work was supported by AT&T through the AT&T Labs Fellowship Program.\n\nReferences\n[1] A. d\u2019Aspremont, L. El Ghaoui, M. I. Jordan, and G. R. G. Lanckriet. A Direct Formulation for\nSparse PCA using Semide\ufb01nite Programming. In Advances in Neural Information Processing\nSystems (NIPS). Vancouver, BC, December 2004.\n\n[2] A. d\u2019Aspremont, F. R. Bach, and L. E. Ghaoui. Full regularization path for sparse principal\ncomponent analysis. In Proceedings of the 24th international Conference on Machine Learn-\ning. Z. Ghahramani, Ed. ICML \u201907, vol. 227. ACM, New York, NY, 177-184, 2007.\n\n[3] J. Cadima and I. Jolliffe. Loadings and correlations in the interpretation of principal compo-\n\nnents. Applied Statistics, 22:203.214, 1995.\n\n[4] C.C. Fowlkes, C.L. Luengo Hendriks, S.V. Kernen, G.H. Weber, O. Rbel, M.-Y. Huang, S.\n\nChatoor, A.H. DePace, L. Simirenko and C. Henriquez et al. Cell 133, pp. 364-374, 2008.\n\n[5] J. Jeffers. Two case studies in the application of principal components. Applied Statistics, 16,\n\n225-236, 1967.\n\n[6] I.T. Jolliffe and M. Uddin. A Modi\ufb01ed Principal Component Technique based on the Lasso.\n\nJournal of Computational and Graphical Statistics, 12:531.547, 2003.\n\n[7] I.T. Jolliffe, Principal component analysis, Springer Verlag, New York, 1986.\n[8] I.T. Jolliffe. Rotation of principal components: choice of normalization constraints. Journal of\n\nApplied Statistics, 22:29-35, 1995.\n\n[9] B. Moghaddam, Y. Weiss, and S. Avidan. Spectral bounds for sparse PCA: Exact and greedy\n\nalgorithms. Advances in Neural Information Processing Systems, 18, 2006.\n\n[10] B. Moghaddam, Y. Weiss, and S. Avidan. Generalized spectral bounds for sparse LDA. In Proc.\n\nICML, 2006.\n\n[11] Y. Saad, Projection and de\ufb02ation methods for partial pole assignment in linear state feedback,\n\nIEEE Trans. Automat. Contr., vol. 33, pp. 290-297, Mar. 1998.\n\n[12] B.K. Sriperumbudur, D.A. Torres, and G.R.G. Lanckriet. Sparse eigen methods by DC pro-\ngramming. Proceedings of the 24th International Conference on Machine learning, pp. 831-\n838, 2007.\n\n[13] D. Torres, B.K. Sriperumbudur, and G. Lanckriet. Finding Musically Meaningful Words by\nSparse CCA. Neural Information Processing Systems (NIPS) Workshop on Music, the Brain\nand Cognition, 2007.\n\n[14] P. White. The Computation of Eigenvalues and Eigenvectors of a Matrix. Journal of the Society\n\nfor Industrial and Applied Mathematics, Vol. 6, No. 4, pp. 393-437, Dec., 1958.\n\n[15] F. Zhang (Ed.). The Schur Complement and Its Applications. Kluwer, Dordrecht, Springer,\n\n2005.\n\n[16] Z. Zhang, H. Zha, and H. Simon, Low-rank approximations with sparse factors I: Basic algo-\n\nrithms and error analysis. SIAM J. Matrix Anal. Appl., 23 (2002), pp. 706-727.\n\n[17] Z. Zhang, H. Zha, and H. Simon, Low-rank approximations with sparse factors II: Penalized\nmethods with discrete Newton-like iterations. SIAM J. Matrix Anal. Appl., 25 (2004), pp.\n901-920.\n\n[18] H. Zou, T. Hastie, and R. Tibshirani. Sparse Principal Component Analysis. Technical Report,\n\nStatistics Department, Stanford University, 2004.\n\n8\n\n\f", "award": [], "sourceid": 197, "authors": [{"given_name": "Lester", "family_name": "Mackey", "institution": null}]}