{"title": "Sparse Features for PCA-Like Linear Regression", "book": "Advances in Neural Information Processing Systems", "page_first": 2285, "page_last": 2293, "abstract": "Principal Components Analysis~(PCA) is often used as a feature extraction procedure. Given a matrix $X \\in \\mathbb{R}^{n \\times d}$, whose rows represent $n$ data points with respect to $d$ features, the top $k$ right singular vectors of $X$ (the so-called \\textit{eigenfeatures}), are arbitrary linear combinations of all available features. The eigenfeatures are very useful in data analysis, including the regularization of linear regression. Enforcing sparsity on the eigenfeatures, i.e., forcing them to be linear combinations of only a \\textit{small} number of actual features (as opposed to all available features), can promote better generalization error and improve the interpretability of the eigenfeatures. We present deterministic and randomized algorithms that construct such sparse eigenfeatures while \\emph{provably} achieving in-sample performance comparable to regularized linear regression. Our algorithms are relatively simple and practically efficient, and we demonstrate their performance on several data sets.", "full_text": "Sparse Features for PCA-Like Linear Regression\n\nChristos Boutsidis\n\nMathematical Sciences Department\nIBM T. J. Watson Research Center\n\nYorktown Heights, New York\ncboutsi@us.ibm.com\n\nPetros Drineas\n\nComputer Science Department\nRensselaer Polytechnic Institute\n\nTroy, NY 12180\n\ndrinep@cs.rpi.edu\n\nMalik Magdon-Ismail\n\nComputer Science Department\nRensselaer Polytechnic Institute\n\nTroy, NY 12180\n\nmagdon@cs.rpi.edu\n\nAbstract\n\nPrincipal Components Analysis (PCA) is often used as a feature extraction proce-\ndure. Given a matrix X \u2208 Rn\u00d7d, whose rows represent n data points with respect\nto d features, the top k right singular vectors of X (the so-called eigenfeatures),\nare arbitrary linear combinations of all available features. The eigenfeatures are\nvery useful in data analysis, including the regularization of linear regression. En-\nforcing sparsity on the eigenfeatures, i.e., forcing them to be linear combinations\nof only a small number of actual features (as opposed to all available features), can\npromote better generalization error and improve the interpretability of the eigen-\nfeatures. We present deterministic and randomized algorithms that construct such\nsparse eigenfeatures while provably achieving in-sample performance comparable\nto regularized linear regression. Our algorithms are relatively simple and practi-\ncally ef\ufb01cient, and we demonstrate their performance on several data sets.\n\n1 Introduction\n\nLeast-squares analysis was introduced by Gauss in 1795 and has since has bloomed into a staple of\nthe data analyst. Assume the usual setting with n tuples (x1, y1), . . . , (xn, yn) in Rd, where xi are\npoints and yi are targets. The vector of regression weights w\u2217 \u2208 Rd minimizes (over all w \u2208 Rd)\nthe RMS in-sample error\n\nE(w) =vuut\n\nn\n\nXi=1\n(xi \u00b7 w \u2212 yi)2 = kXw \u2212 yk2.\n\nIn the above, X \u2208 Rn\u00d7d is the data matrix whose rows are the vectors xi (i.e., Xij = xi[j]); and,\ny \u2208 Rn is the target vector (i.e., y[i] = yi). We will use the more convenient matrix formulation1,\nnamely given X and y, we seek a vector w\u2217 that minimizes kXw \u2212 yk2. The minimal-norm vector\nw\u2217 can be computed via the Moore-Penrose pseudo-inverse of X: w\u2217 = X+y. Then, the optimal\nin-sample error is equal to:\n\n1For the sake of simplicity, we assume d \u2264 n and rank (X) = d in our exposition; neither assumption is\n\nnecessary.\n\nE(w\u2217) = ky \u2212 XX+yk2.\n\n1\n\n\fWhen the data is noisy and X is ill-conditioned, X+ becomes unstable to small perturbations and\nover\ufb01tting can become a serious problem. Practitioners deal with such situations by regularizing\nthe regression. Popular regularization methods include, for example, the Lasso [28], Tikhonov\nregularization [17], and top-k PCA regression or truncated SVD regularization [21]. In general,\nsuch methods are encouraging some form of parsimony, thereby reducing the number of effective\ndegrees of freedom available to \ufb01t the data. Our focus is on top-k PCA regression which can be\nviewed as regression onto the top-k principal components, or, equivalently, the top-k eigenfeatures.\nThe eigenfeatures are the top-k right singular vectors of X and are arbitrary linear combinations\nof all available input features. The question we tackle is whether one can ef\ufb01ciently extract sparse\neigenfeatures (i.e., eigenfeatures that are linear combinations of only a small number of the available\nfeatures) that have nearly the same performance as the top-k eigenfeatures.\nBasic notation. A, B, . . . are matrices; a, b, . . . are vectors; i, j, . . . are integers; In is the n \u00d7 n\nidentity matrix; 0m\u00d7n is the m \u00d7 n matrix of zeros; ei is the standard basis (whose dimensionality\nwill be clear from the context). For vectors, we use the Euclidean norm k \u00b7 k2; for matrices, the\nFrobenius and the spectral norms: kXk2\nij and kXk2 = \u03c31 (X), i.e., the largest singular\nvalue of X.\nTop-k PCA Regression. Let X = U\u03a3VT be the singular value decomposition of X, where U\n(resp. V) is the matrix of left (resp. right) singular vectors of X with singular values in the diagonal\nmatrix \u03a3. For k \u2264 d, let Uk, \u03a3k, and Vk contain only the top-k singular vectors and associated\nsingular values. The best rank-k reconstruction of X in the Frobenius norm can be obtained from\nthis truncated singular value decomposition as Xk = Uk\u03a3kVT\nk. The k right singular vectors in Vk\nare called the top-k eigenfeatures. The projections of the data points onto the top k eigenfeatures are\nobtained by projecting the xi\u2019s onto the columns of Vk to obtain Fk = XVk = U\u03a3VTVk = Uk\u03a3k.\nNow, each data point (row) in Fk only has k dimensions. Each column of Fk contains a particular\neigenfeature\u2019s value for every data point and is a linear combination of the columns of X.\n\nF =Pi,j X2\n\nThe top-k PCA regression uses Fk as the data matrix and y as the target vector to produce regression\nweights w\u2217\n\nk = F+\nky \u2212 Fkw\u2217\n\nk y. The in-sample error of this k-dimensional regression is equal to\nkyk2 = ky \u2212 UkUT\n\nk yk2 = ky \u2212 Uk\u03a3k\u03a3\u22121\n\nkk2 = ky \u2212 FkF+\n\nkyk2.\n\nk UT\n\nk\n\nkyk2.\n\nE(Vkw\u2217\n\nk) = ky \u2212 XVkw\u2217\nk and Vkw\u2217\n\nk are k-dimensional and cannot be applied to X, but the equivalent weights Vkw\u2217\n\nThe weights w\u2217\ncan be applied to X and they have the same in-sample error with respect to X:\nkk2 = ky \u2212 UkUT\n\nkk2 = ky \u2212 Fkw\u2217\nk as the top-k PCA regression weights (the dimension will\nHence, we will refer to both w\u2217\nk to refer to both\nmake it clear which one we are talking about) and, for simplicity, we will overload w\u2217\nthese weight vectors (the dimension will make it clear which). In practice, k is chosen to measure\nthe \u201ceffective dimension\u201d of the data, and, typically, k \u226a rank(X) = d. One way to choose k is so\nthat kX \u2212 XkkF \u226a \u03c3k(X) (the \u201cenergy\u201d in the k-th principal component is large compared to the\nenergy in all smaller principal components). We do not argue the merits of top-k PCA regression;\nwe just note that top-k PCA regression is a common tool for regularizing regression.\nProblem Formulation. Given X \u2208 Rn\u00d7d, k (the number of target eigenfeatures for top-k PCA\nregression), and r > k (the sparsity parameter), we seek to extract a set of at most k sparse eigenfea-\ntures \u02c6Vk which use at most r of the actual dimensions. Let \u02c6Fk = X \u02c6Vk \u2208 Rn\u00d7k denote the matrix\nwhose columns are the k extracted sparse eigenfeatures, which are a linear combination of a set of at\nmost r actual features. Our goal is to obtain sparse features for which the vector of sparse regression\nweights \u02c6wk = \u02c6F+\nk yk2 that is close to the top-k PCA\nregression error ky \u2212 FkF+\nk yk2. Just as with top-k PCA regression, we can de\ufb01ne the equivalent\nd-dimensional weights \u02c6Vk \u02c6wk; we will overload \u02c6wk to refer to these weights as well.\nFinally, we conclude by noting that while our discussion above has focused on simple linear regres-\nsion, the problem can also be de\ufb01ned for multiple regression, where the vector y is replaced by a\nmatrix Y \u2208 Rn\u00d7\u03c9, with \u03c9 \u2265 1. The weight vector w becomes a weight matrix, W, where each\ncolumn of W contains the weights from the regression of the corresponding column of Y onto the\nfeatures. All our results hold in this general setting as well, and we will actually present our main\ncontributions in the context of multiple regression.\n\nk y results in an in-sample error ky \u2212 \u02c6Fk \u02c6F+\n\n2\n\n\f2 Our contributions\n\nRecall from our discussion at the end of the introduction that we will present all our results in the\ngeneral setting, where the target vector y is replaced by a matrix Y \u2208 Rn\u00d7\u03c9. Our \ufb01rst theorem\nargues that there exists a polynomial-time deterministic algorithm that constructs a feature matrix\n\u02c6Fk \u2208 Rn\u00d7k, such that each feature (column of \u02c6Fk) is a linear combination of at most r actual\nfeatures (columns) from X and results in small in-sample error . Again, this should be contrasted\nwith top-k PCA regression, which constructs a feature matrix Fk, such that each feature (column of\nFk) is a linear combination of all features (columns) in X. Our theorems argue that the in-sample\nerror of our features is almost as good as the in-sample error of top-k PCA regression, which uses\ndense features.\nTheorem 1 (Deterministic Feature Extraction). Let X \u2208 Rn\u00d7d and Y \u2208 Rn\u00d7\u03c9 be the input matrices\nin a multiple regression problem. Let k > 0 be a target rank for top-k PCA regression on X and Y.\nFor any r > k, there exists an algorithm that constructs a feature matrix \u02c6Fk = X \u02c6Vk \u2208 Rn\u00d7k, such\nthat every column of \u02c6Fk is a linear combination of (the same) at most r columns of X, and\n\n= kY \u2212 \u02c6Fk \u02c6F+\n\nk YkF \u2264 kY \u2212 XW\u2217\n\nY \u2212 X \u02c6Wk(cid:13)(cid:13)(cid:13)F\n\n(\u03c3k(X) is the k-th singular value of X.) The running time of the proposed algorithm is T (Vk) +\n\n(cid:13)(cid:13)(cid:13)\nO(cid:0)ndk + nrk2(cid:1), where T (Vk) is the time required to compute the matrix Vk, the top-k right sin-\n\ngular vectors of X.\n\nkkF + 1 +r 9k\n\nr ! kX \u2212 XkkF\n\n\u03c3k(X)\n\nkYk2.\n\nTheorem 1 says that one can construct k features with sparsity O(k) and obtain a comparble regres-\nsion error to that attained by the dense top-k PCA features, up to additive term that is proportional\nto \u2206k = kX \u2212 XkkF /\u03c3k(X).\nTo construct the features satisfying the guarantees of the above theorem, we \ufb01rst employ the Al-\ngorithm DSF-Select (see Table 1 and Section 4.3) to select r columns of X and form the matrix\nC \u2208 Rn\u00d7r. Now, let \u03a0C,k (Y) denote the best rank-k approximation (with respect to the Frobenius\nnorm) to Y in the column-span of C. In other words, \u03a0C,k (Y) is a rank-k matrix that minimizes\nkY \u2212 \u03a0C,k (Y)kF over all rank-k matrices in the column-span of C. Ef\ufb01cient algorithms are known\nfor computing \u03a0C,k(X) and have been described in [2]. Given \u03a0C,k(Y), the sparse eigenfeatures\ncan be computed ef\ufb01ciently as follows: \ufb01rst, set \u03a8 = C+\u03a0C,k(Y). Observe that\n\nC\u03a8 = CC+\u03a0C,k(Y) = \u03a0C,k(Y).\n\nThe last equality follows because CC+ projects onto the column span of C and \u03a0C,k(Y) is already\nin the column span of C. \u03a8 has rank at most k because \u03a0C,k(Y) has rank at most k. Let the\n\u03c8 and set \u02c6Fk = CU\u03c8\u03a3\u03c8 \u2208 Rn\u00d7k. Clearly, each column of \u02c6Fk is a\nSVD of \u03a8 be \u03a8 = U\u03c8\u03a3\u03c8VT\nlinear combination of (the same) at most r columns of X (the columns in C). The sparse features\nthemselves can also be obtained because \u02c6Fk = X \u02c6Vk, so \u02c6Vk = X+\u02c6Fk.\nTo prove that \u02c6Fk are a good set of sparse features, we \ufb01rst relate the regression error from using \u02c6Fk\nto how well \u03a0C,k(Y) approximates Y.\nkY \u2212 \u03a0C,k(Y)kF = kY \u2212 C\u03a8kF = kY \u2212 CU\u03c8\u03a3\u03c8VT\nk YkF .\nThe last inequality follows because \u02c6F+\nk Y are the optimal regression weights for the features \u02c6Fk. The\nreverse inequality also holds because \u03a0C,k(Y) is the best rank-k approximation to Y in the column\nspan of C. Thus,\n\n\u03c8kF \u2265 kY \u2212 \u02c6Fk \u02c6F+\n\n= kY \u2212 \u02c6FkVT\n\n\u03c8kF\n\nkY \u2212 \u02c6Fk \u02c6F+\n\nk YkF = kY \u2212 \u03a0C,k(Y)kF .\n\nThe upshot of the above discussion is that if we can \ufb01nd a matrix C consisting of columns of X for\nwhich kY \u2212 \u03a0C,k(Y)kF is small, then we immediately have good sparse eigenfeatures. Indeed, all\nthat remains to complete the proof of Theorem 1 is to bound kY \u2212 \u03a0C,k(Y)kF for the columns C\nreturned by the Algorithm DSF-Select.\nOur second result employs the Algorithm RSF-Select (see Table 2 and Section 4.4) to select r\ncolumns of X and again form the matrix C \u2208 Rn\u00d7r. One then proceeds to construct \u03a0C,k(Y) and\n\u02c6Fk as described above. The advantage of this approach is simplicity, better ef\ufb01ciency and a slightly\nbetter error bound, at the expense of logarithmically worse sparsity.\n\n3\n\n\fTheorem 2 (Randomized Feature Extraction). Let X \u2208 Rn\u00d7d and Y \u2208 Rn\u00d7\u03c9 be the input matrices\nin a multiple regression problem. Let k > 0 be a target rank for top-k PCA regression on X and\nY. For any r > 144k ln(20k), there exists a randomized algorithm that constructs a feature matrix\n\u02c6Fk = X \u02c6Vk \u2208 Rn\u00d7k, such that every column of \u02c6Fk is a linear combination of at most r columns\nof X, and, with probability at least .7 (over random choices made in the algorithm),\n\n(cid:13)(cid:13)(cid:13)\n\nY \u2212 X \u02c6Wk(cid:13)(cid:13)(cid:13)F\n\nThe running time of the proposed algorithm is T (Vk) + O(dk + r log r).\n\n= kY \u2212 \u02c6Fk \u02c6F+\n\nk YkF \u2264 kY \u2212 XW\u2217\n\nkkF +r 36k ln(20k)\n\nr\n\nkX \u2212 XkkF\n\n\u03c3k(X)\n\nkYk2.\n\n3 Connections with prior work\n\nA variant of our problem is the identi\ufb01cation of a matrix C consisting of a small number (say r)\ncolumns of X such that the regression of Y onto C (as opposed to k features from C) gives small in-\nsample error. This is the sparse approximation problem, where the number of non-zero weights in the\nregression vector is restricted to r. This problem is known to be NP-hard [25]. Sparse approximation\nhas important applications and many approximation algorithms have been presented [29, 9, 30];\nproposed algorithms are typically either greedy or are based on convex optimization relaxations of\nthe objective. An important difference between sparse approximation and sparse PCA regression is\nthat our goal is not to minimize the error under a sparsity constraint, but to match the top-k PCA\nregularized regression under a sparsity constraint. We argue that it is possible to achieve a provably\naccurate sparse PCA-regression, i.e., use sparse features instead of dense ones.\n\nIf X = Y (approximating X using the columns of X), then this is the column-based matrix recon-\nstruction problem, which has received much attention in existing literature [16, 18, 14, 26, 5, 12, 20].\nIn this paper, we study the more general problem where X 6= Y, which turns out to be considerably\nmore dif\ufb01cult.\n\nInput sparseness is closely related to feature selection and automatic relevance determination. Re-\nsearch in this area is vast, and we refer the reader to [19] for a high-level view of the \ufb01eld. Again,\nthe goal in this area is different than ours, namely they seek to reduce dimensionality and improve\nout-of-sample error. Our goal is to provide sparse PCA features that are almost as good as the ex-\nact principal components. While it is de\ufb01nitely the case that many methods outperform top-k PCA\nregression, especially for d \u226b n, this discussion is orthogonal to our work.\nThe closest result to ours in prior literature is the so-called rank-revealing QR (RRQR) factoriza-\ntion [8]. The authors use a QR-like decomposition to select exactly k columns of X and compare\ntheir sparse solution vector \u02c6wk with the top-k PCA regularized solution w\u2217\n\nk. They show that\n\nkw\u2217\n\nk \u2212 \u02c6wkk2 \u2264pk(n \u2212 k) + 1 kX \u2212 Xkk2\n\n\u03c3k(X)\n\n\u2206,\n\nwhere \u2206 = 2 k \u02c6wkk2 + ky \u2212 Xw\u2217\nkk2 /\u03c3k(X). This bound is similar to our bound in Theorem 1,\nbut only applies to r = k and is considerably weaker. For example,pk(n \u2212 k) + 1kX \u2212 Xkk2 \u2265\n\u221ak kX \u2212 XkkF ; note also that the dependence of the above bound on 1/\u03c3k(X) is generally worse\n\nthan ours.\n\nThe importance of the right singular vectors in matrix reconstruction problems (including PCA)\nhas been heavily studied in prior literature, going back to work by Jolliffe in 1972 [22]. The idea of\nsampling columns from a matrix X with probabilities that are derived from VT\nk (as we do in Theorem\n2) was introduced in [15] in order to construct coresets for regression problems by sampling data\npoints (rows of the matrix X) as opposed to features (columns of the matrix X). Other prior work\nincluding [15, 13, 27, 6, 4] has employed variants of this sampling scheme; indeed, we borrow\nproof techniques from the above papers in our work. Finally, we note that our deterministic feature\nselection algorithm (Theorem 1) uses a sparsi\ufb01cation tool developed in [2] for column based matrix\nreconstruction. This tool is a generalization of algorithms originally introduced in [1].\n\n4\n\n\f4 Our algorithms\n\nOur algorithms emerge from the constructive proofs of Theorems 1 and 2. Both algorithms necessi-\ntate access to the right singular vectors of X, namely the matrix Vk \u2208 Rd\u00d7k. In our experiments, we\nused PROPACK [23] in order to compute Vk iteratively; PROPACK is a fast alternative to the exact\nSVD. Our \ufb01rst algorithm (DSF-Select) is deterministic, while the second algorithm (RSF-Select)\nis randomized, requiring logarithmically more columns to guarantee the theoretical bounds. Prior\nto describing our algorithms in detail, we will introduce useful notation on sampling and rescaling\nmatrices as well as a matrix factorization lemma (Lemma 3) that will be critical in our proofs.\n\n4.1 Sampling and rescaling matrices\n\nLet C \u2208 Rn\u00d7r contain r columns of X \u2208 Rn\u00d7d. We can express the matrix C as C = X\u2126, where\nthe sampling matrix \u2126 \u2208 Rd\u00d7r is equal to [ei1 , . . . , eir ] and ei are standard basis vectors in Rd. In\nour proofs, we will make use of S \u2208 Rr\u00d7r, a diagonal rescaling matrix with positive entries on the\ndiagonal. Our column selection algorithms return a sampling and a rescaling matrix, so that X\u2126S\ncontains a subset of rescaled columns from X. The rescaling is benign since it does not affect the\nspan of the columns of C = X\u2126 and thus the quantity of interest, namely \u03a0C,k(Y).\n\n4.2 A structural result using matrix factorizations\n\nWe now present a matrix reconstruction lemma that will be the starting point for our algorithms.\nLet Y \u2208 Rn\u00d7\u03c9 be a target matrix and let X \u2208 Rn\u00d7d be the basis matrix that we will use in order\nto reconstruct Y. More speci\ufb01cally, we seek a sparse reconstruction of Y from X, or, in other\nwords, we would like to choose r \u226a d columns from X and form a matrix C \u2208 Rn\u00d7r such that\nkY \u2212 \u03a0C,k(Y)kF is small. Let Z \u2208 Rd\u00d7k be an orthogonal matrix (i.e., ZTZ = Ik), and express the\nmatrix X as follows:\n\nX = HZT + E,\n\nwhere H is some matrix in Rn\u00d7k and E \u2208 Rn\u00d7d is the residual error of the factorization. It is easy\nto prove that the Frobenius or spectral norm of E is minimized when H = XZ. Let \u2126 \u2208 Rd\u00d7r and\nS \u2208 Rr\u00d7r be a sampling and a rescaling matrix respectively as de\ufb01ned in the previous section, and\nlet C = X\u2126 \u2208 Rn\u00d7r. Then, the following lemma holds (see [3] for a detailed proof).\nLemma 3 (Generalized Column Reconstruction). Using the above notation, if the rank of the matrix\nZT\u2126S is equal to k, then\n\nkY \u2212 \u03a0C,k(Y)kF \u2264 kY \u2212 HH+YkF + kE\u2126S(ZT\u2126S)+H+YkF .\n\n(1)\n\nWe now parse the above lemma carefully in order to understand its implications in our setting. For\nour goals, the matrix C essentially contains a subset of r features from the data matrix X. Recall that\n\u03a0C,k(Y) is the best rank-k approximation to Y within the column space of C; and, the difference\nY \u2212 \u03a0C,k(Y) measures the error from performing regression using sparse eigenfeatures that are\nconstructed as linear combinations of the columns of C. Moving to the right-hand side of eqn. (1),\nthe two terms re\ufb02ect a tradeoff between the accuracy of the reconstruction of Y using H and the\nerror E in approximating X by the product HZT. Ideally, we would like to choose H so that Y can\nbe accurately approximated and, at the same time, the matrix X is approximated by the product HZT\nwith small residual error E. In general, these two goals might be competing and a balance must be\nstruck. Here, we focus on one extreme of this trade off, namely choosing Z so that the (Frobenius)\nnorm of the matrix E is minimized. More speci\ufb01cally, since Z has rank k, the best choice for HZT in\norder to minimize kEkF is Xk; then, E = X \u2212 Xk. Using the SVD of Xk, namely Xk = Uk\u03a3kVT\nk,\nwe apply Lemma 3 setting H = Uk\u03a3k and Z = Vk. The following corollary is immediate.\nLemma 4 (Generalization of Lemma 7 in [2]). Using the above notation, if the rank of the matrix\nVT\n\nk\u2126S is equal to k, then\n\nkY \u2212 \u03a0C,k(Y)kF \u2264 kY \u2212 UkUT\n\nkYkF + k(X \u2212 Xk)\u2126S(VT\n\nk\u2126S)+\u03a3\u22121\n\nk UT\n\nkYkF .\n\nOur main results will follow by carefully choosing \u2126 and S in order to control the right-hand side of\nthe above inequality.\n\n5\n\n\fAlgorithm: DSF-Select\n\n1: Input: X, k, r.\n2: Output: r columns of X in C.\n3: Compute Vk and\n\nE = X \u2212 Xk = X \u2212 XVkVT\nk.\n\n4: Run DetSampling to construct sam-\n\npling and rescaling matrices \u2126 and S:\n\nAlgorithm: DetSampling\n1: Input: VT = [v1, . . . , vd], A = [a1, . . . , ad], r.\n2: Output: Sampling and rescaling matrices [\u2126, S].\n3: Initialize B0 = 0k\u00d7k, \u2126 = 0d\u00d7r, and S = 0r\u00d7r.\n4: for \u03c4 = 1 to r \u2212 1 do\n\u221ark.\n5:\n6:\n\nSet L\u03c4 = \u03c4 \u2212\nPick index i \u2208 {1, 2, ..., n} and t such that\n1\nt \u2264 L(vi, B\u03c4 \u22121, L\u03c4 ).\n\nU (ai) \u2264\n\n[\u2126, S] = DetSampling(VT\n\nk, E, r).\n\n5: Return C = X\u2126.\n\nUpdate B\u03c4 = B\u03c4 \u22121 + tvivT\ni .\nSet \u2126i\u03c4 = 1 and S\u03c4 \u03c4 = 1/\u221at.\n\n7:\n8:\n9: end for\n10: Return \u2126 and S.\n\nTable 1: DSF-Select: Deterministic Sparse Feature Selection\n\n4.3 DSF-Select: Deterministic Sparse Feature Selection\n\nDSF-Select deterministically selects r columns of the matrix X to form the matrix C (see Table 1\nand note that the matrix C = X\u2126 might contain duplicate columns which can be removed without\nany loss in accuracy). The heart of DSF-Select is the subroutine DetSampling, a near-greedy\nalgorithm which selects columns of VT\nk iteratively to satisfy two criteria: the selected columns should\nform an approximately orthogonal basis for the columns of VT\nk\u2126S)+ is well-behaved;\nand E\u2126S should also be well-behaved. These two properties will allow us to prove our results via\nLemma 4. The implementation of the proposed algorithm is quite simple since it relies only on\nstandard linear algebraic operations.\nDetSampling takes as input two matrices: VT \u2208 Rk\u00d7d (satisfying VTV = Ik) and A \u2208 Rn\u00d7d. In\norder to describe the algorithm, it is convenient to view these two matrices as two sets of column\nvectors, VT = [v1, . . . , vd] (satisfying Pd\ni = Ik) and A = [a1, . . . , ad]. In DSF-Select\nwe set VT = VT\nk and A = E = X \u2212 Xk. Given k and r, the algorithm iterates from \u03c4 = 0 up to\n\u03c4 = r \u2212 1 and its main operation is to compute the functions \u03c6(L, B) and L(v, B, L) that are de\ufb01ned\nas follows:\n\nk so that (VT\n\ni=1 vivT\n\n\u03c6 (L, B) =\n\nk\n\nXi=1\n\n1\n\n\u03bbi \u2212 L\n\n,\n\nL (v, B, L) =\n\nvT (B \u2212 (L + 1) Ik)\u22122\n\u03c6 (L + 1, B) \u2212 \u03c6 (L, B) \u2212 vT (B \u2212 (L + 1) Ik)\u22121\n\nv\n\nv.\n\nIn the above, B \u2208 Rk\u00d7k is a symmetric matrix with eigenvalues \u03bb1, . . . , \u03bbk and L \u2208 R is a parameter.\nWe also de\ufb01ne the function U (a) for a vector a \u2208 Rn as follows:\nr! aTa\nkAk2\n\nU (a) = 1 \u2212r k\n\nF\n\n.\n\nAt every step \u03c4, the algorithm selects a column ai such that U (ai) \u2264 L(vi, B\u03c4 \u22121, L\u03c4 ); note that\nB\u03c4 \u22121 is a k \u00d7 k matrix which is also updated at every step of the algorithm (see Table 1). The\nexistence of such a column is guaranteed by results in [1, 2].\n\nIt is worth noting that in practical implementations of the proposed algorithm, there might exist\nmultiple columns which satisfy the above requirement. In our implementation we chose to break\nsuch ties arbitrarily. However, more careful and informed choices, such as breaking the ties in a way\nthat makes maximum progress towards our objective, might result in considerable savings. This is\nindeed an interesting open problem.\n\nThe running time of our algorithm is dominated by the search for a column which satis\ufb01es\nU (ai) \u2264 L(vi, B\u03c4 \u22121, L\u03c4 ). To compute the function L, we \ufb01rst need to compute \u03c6(L\u03c4 , B\u03c4 \u22121) (which\nnecessitates the eigenvalues of B\u03c4 \u22121) and then we need to compute the inverse of B\u03c4 \u22121\u2212(L + 1) Ik.\nThese computations need O(k3) time per iteration, for a total of O(rk3) time over all r iterations.\nNow, in order to compute the function L for each vector vi for all i = 1, . . . , d, we need an additional\n\n6\n\n\fAlgorithm: RSF-Select\n\n1: Input: X, k, r.\n2: Output: r columns of X in C.\n3: Compute Vk.\n4: Run RandSampling to construct sam-\n\npling and rescaling matrices \u2126 and S:\n\n[\u2126, S] = RandSampling(VT\n\nk, r).\n\n5: Return C = X\u2126.\n\nAlgorithm: RandSampling\n1: Input: VT = [v1, . . . , vd] and r.\n2: Output: Sampling and rescaling matrices [\u2126, S].\n3: For i = 1, ..., d compute probabilities\n\npi =\n\n1\nkkvik2\n2.\n\n4: Initialize \u2126 = 0d\u00d7r and S = 0r\u00d7r.\n5: for \u03c4 = 1 to r do\n6:\n\nSelect an index i\u03c4 \u2208 {1, 2, ..., d} where the\nprobability of selecting index i is equal to pi.\nSet \u2126i\u03c4 \u03c4 = 1 and S\u03c4 \u03c4 = 1/\u221arpi\u03c4 .\n\n7:\n8: end for\n9: Return \u2126 and S.\n\nTable 2: RSF-Select: Randomized Sparse Feature Selection\n\nO(dk2) time per iteration; the total time for all r iterations is O(drk2). Next, in order to compute\nthe function U , we need to compute aT\ni ai (for all i = 1, . . . , d) which necessitates O(nnz(A)) time,\nwhere nnz(A) is the number of non-zero elements of A. In our setting, A = E \u2208 Rn\u00d7d, so the\noverall running time is O(drk2 + nd). In order to get the \ufb01nal running time we also need to account\nfor the computation of Vk and E.\nThe theoretical properties of DetSampling were analyzed in detail in [2], building on the original\nanalysis of [1]. The following lemma from [2] summarizes important properties of \u2126.\nLemma 5 ([2]). DetSampling with inputs VT and A returns a sampling matrix \u2126 \u2208 Rd\u00d7r and a\nrescaling matrix S \u2208 Rr\u00d7r satisfying\n\nk(VT\u2126S)+k2 \u2264 1 \u2212r k\n\nr\n\n;\n\nkA\u2126SkF \u2264 kAkF .\n\nWe apply Lemma 5 with V = VT\nproof of Theorem 1; see [3] for details.\n\nk and A = E and we combine it with Lemma 4 to conclude the\n\n4.4 RSF-Select: Randomized Sparse Feature Selection\n\nRSF-Select is a randomized algorithm that selects r columns of the matrix X in order to form the\nmatrix C (see Table 2). The main differences between RSF-Select and DSF-Select are two: \ufb01rst,\nRSF-Select only needs access to VT\nk and, second, RSF-Select uses a simple sampling procedure in\norder to select the columns of X to include in C. This sampling procedure is described in algorithm\nRandSampling and essentially selects columns of X with probabilities that depend on the norms of\nk. Thus, RandSampling \ufb01rst computes a set of probabilities that are proportional\nthe columns of VT\nto the norms of the columns of VT\nk and then samples r columns of X in r independent identical trials\nwith replacement, where in each trial a column is sampled according to the computed probabilities.\nNote that a column could be selected multiple times. In terms of running time, and assuming that\nthe matrix Vk that contains the top k right singular vectors of X has already been computed, the\nproposed algorithm needs O(dk) time to compute the sampling probabilities and an additional O(d+\nr log r) time to sample r columns from X. Similar to Lemma 5, we can prove analogous properties\nfor the matrices \u2126 and S that are returned by algorithm RandSampling. Again, combining with\nLemma 4 we can prove Theorem 2; see [3] for details.\n\n5 Experiments\n\nThe goal of our experiments is to illustrate that our algorithms produce sparse features which per-\nform as well in-sample as the top-k PCA regression. It turns out that the out-of-sample performance\nis comparable (if not better in many cases, perhaps due to the sparsity) to top-k PCA-regression.\n\n7\n\n\fData\n\n(n; d)\n\nk = 5, r = k + 1\n\nArcene\n\nI-sphere\n\nLibrasMov\n\nMadelon\n\nHillVal\n\nSpambase\n\n(100;10,000)\n\n(351;34)\n\n(45;90)\n\n(2,000;500)\n\n(606;100)\n\n(4601;57)\n\nw\u2217\nk\n\n0.93\n0.99\n0.57\n0.58\n2.9\n3.3\n0.98\n0.98\n0.68\n0.68\n0.30\n0.30\n\nk\n\n\u02c6wDSF\n0.88\n0.94\n0.52\n0.53\n2.9\n3.6\n0.98\n0.98\n0.66\n0.67\n0.30\n0.30\n\nk\n\n\u02c6wRSF\n0.91\n0.98\n0.55\n0.57\n3.1\n3.7\n0.98\n0.98\n0.67\n0.68\n0.31\n0.30\n\nk\n\n\u02c6wrnd\n1.0\n1.0\n0.57\n0.57\n3.7\n3.7\n1.0\n1.0\n0.68\n0.68\n0.28\n0.38\n\nw\u2217\nk\n\n0.93\n1.0\n0.57\n0.58\n2.9\n3.3\n0.98\n0.98\n0.68\n0.68\n0.3\n0.3\n\nk\n\nk\n\nk = 5, r = 2k\n\u02c6wDSF\n\u02c6wRSF\n0.86\n0.89\n0.98\n0.97\n0.52\n0.51\n0.54\n0.55\n2.6\n2.4\n3.6\n3.3\n0.97\n0.97\n0.98\n0.98\n0.65\n0.67\n0.69\n0.67\n0.3\n0.3\n0.3\n0.3\n\nk\n\n\u02c6wrnd\n1.0\n1.0\n0.56\n0.56\n3.6\n3.6\n1.0\n1.0\n0.69\n0.69\n0.25\n0.35\n\nTable 3: Comparison of DSF-select and RSF-select with top-k PCA. The top entry in each cell\nis the in-sample error, and the bottom entry is the out-sample error. In bold is the method achieving\nthe best out-sample error.\n\nCompared to top-k PCA, our algorithms are ef\ufb01cient and work well in practice, even better than the\ntheoretical bounds suggest.\n\nWe present our \ufb01ndings in Table 3 using data sets from the UCI machine learning repository. We\nused a \ufb01ve-fold cross validation design with 1,000 random splits: we computed regression weights\nusing 80% of the data and estimated out-sample error in the remaining 20% of the data. We set k = 5\nin the experiments (no attempt was made to optimize k). Table 3 shows the in- and out-sample error\nk; r-sparse features regression using DSF-select, \u02c6wDSF\nfor four methods: top-k PCA regression, w\u2217\n;\nr-sparse features regression using RSF-select, \u02c6wRSF\n; r-sparse features regression using r random\ncolumns, \u02c6wrnd\nk .\n\nk\n\nk\n\n6 Discussion\n\nThe top-k PCA regression constructs \u201cfeatures\u201d without looking at the targets \u2013 it is target-agnostic.\nSo are all the algorithms we discussed here, as our goal was to compare with top-k PCA. However,\nthere is unexplored potential in Lemma 3. We only explored one extreme choice for the factorization,\nnamely the minimization of some norm of the matrix E. Other choices, in particular non-target-\nagnostic choices, could prove considerably better. Such investigations are left for future work.\n\nAs mentioned when we discussed our deterministic algorithm, it will often be the case that in some\nsteps of the greedy selection process, multiple columns could satisfy the criterion for selection. In\nsuch a situation, we are free to choose any one; we broke ties arbitrarily in our implementation,\nand even as is, the algorithm performed as well or better than top-k PCA. However, we expect that\nbreaking the ties so as to optimize the ultimate objective would yield considerable additional bene\ufb01t;\nthis would also be non-target-agnostic.\n\nAcknowledgments\n\nThis work has been supported by two NSF CCF and DMS grants to Petros Drineas and Malik\nMagdon-Ismail.\n\nReferences\n[1] J. Batson, D. Spielman, and N. Srivastava. Twice-ramanujan sparsi\ufb01ers. In Proceedings of ACM STOC,\n\npages 255\u2013262, 2009.\n\n[2] C. Boutsidis, P. Drineas, and M. Magdon-Ismail. Near-optimal column based matrix reconstruction. In\n\nProceedings of IEEE FOCS, 2011.\n\n[3] C. Boutsidis, P. Drineas, and M. Magdon-Ismail.\n\nmanuscript, 2011.\n\nSparse features for PCA-like linear regression.\n\n[4] C. Boutsidis and M. Magdon-Ismail.\n\narXiv:1109.5664v1, 2011.\n\nDeterministic feature selection for k-means clustering.\n\n8\n\n\f[5] C. Boutsidis, M. W. Mahoney, and P. Drineas. An improved approximation algorithm for the column\n\nsubset selection problem. In Proceedings of ACM -SIAM SODA, pages 968\u2013977, 2009.\n\n[6] C. Boutsidis, M. W. Mahoney, and P. Drineas. Unsupervised feature selection for the k-means clustering\n\nproblem. In Proceedings of NIPS, 2009.\n\n[7] J. Cadima and I. Jolliffe. Loadings and correlations in the interpretation of principal components. Applied\n\nStatistics, 22:203\u2013214, 1995.\n\n[8] T. Chan and P. Hansen. Some applications of the rank revealing QR factorization. SIAM Journal on\n\nScienti\ufb01c and Statistical Computing, 13:727\u2013741, 1992.\n\n[9] A. Das and D. Kempe. Algorithms for subset selection in linear regression.\n\nSTOC, 2008.\n\nIn Proceedings of ACM\n\n[10] A. Dasgupta, P. Drineas, B. Harb, R. Kumar, and M. W. Mahoney. Sampling algorithms and coresets for\n\nLp regression. In Proceedings of ACM-SIAM SODA, 2008.\n\n[11] A. d\u2019Aspremont, L. El Ghaoui, M. I. Jordan, and G. R. G. Lanckriet. A direct formulation for sparse PCA\n\nusing semide\ufb01nite programming. In Proceedings of NIPS, 2004.\n\n[12] A. Deshpande and L. Rademacher. Ef\ufb01cient volume sampling for row/column subset selection. In Pro-\n\nceedings of ACM STOC, 2010.\n\n[13] P. Drineas, R. Kannan, and M. Mahoney. Fast Monte Carlo algorithms for matrices I: Approximating\n\nmatrix multiplication. SIAM Journal of Computing, 36(1):132\u2013157, 2006.\n\n[14] P. Drineas, M. Mahoney, and S. Muthukrishnan. Polynomial time algorithm for column-row based\n\nrelative-error low-rank matrix approximation. Technical Report 2006-04, DIMACS, March 2006.\n\n[15] P. Drineas, M. Mahoney, and S. Muthukrishnan. Sampling algorithms for \u21132 regression and applications.\n\nIn Proceedings of ACM-SIAM SODA, pages 1127\u20131136, 2006.\n\n[16] G. Golub. Numerical methods for solving linear least squares problems. Numerische Mathematik, 7:206\u2013\n\n216, 1965.\n\n[17] G. Golub, P. Hansen, and D. O\u2019Leary. Tikhonov regularization and total least squares. SIAM Journal on\n\nMatrix Analysis and Applications, 21(1):185\u2013194, 2000.\n\n[18] M. Gu and S. Eisenstat. Ef\ufb01cient algorithms for computing a strong rank-revealing QR factorization.\n\nSIAM Journal on Scienti\ufb01c Computing, 17:848\u2013869, 1996.\n\n[19] I. Guyon and A. Elisseeff. Special issue on variable and feature selection. Journal of Machine Learning\n\nResearch, 3, 2003.\n\n[20] N. Halko, P. Martinsson, and J. Tropp. Finding structure with randomness: probabilistic algorithms for\n\nconstructing approximate matrix decompositions. SIAM Review, 2011.\n\n[21] P. Hansen. The truncated SVD as a method for regularization. BIT Numerical Mathematics, 27(4):534\u2013\n\n553, 1987.\n\n[22] I. Jolliffe. Discarding variables in Principal Component Analysis: asrti\ufb01cial data. Applied Statistics,\n\n21(2):160\u2013173, 1972.\n\n[23] R. Larsen.\n\nPROPACK: A software package for the symmetric eigenvalue problem and sin-\ngular value problems on Lanczos and Lanczos bidiagonalization with partial reorthogonalization.\nhttp://soi.stanford.edu/\u223crmunk/\u223cPROPACK/.\nIn Proceedings of NIPS, 2005.\n\n[24] B. Moghaddam, Y. Weiss, and S. Avidan. Spectral bounds for sparse PCA: exact and greedy algorithms.\n\n[25] B. Natarajan. Sparse approximate solutions to linear systems. SIAM Journal on Computing, 24(2):227\u2013\n\n234, 1995.\n\n[26] M. Rudelson and R. Vershynin. Sampling from large matrices: An approach through geometric functional\n\nanalysis. Journal of the ACM, 54, 2007.\n\n[27] N. Srivastava and D. Spielman. Graph sparsi\ufb01cations by effective resistances. In Proceedings of ACM\n\nSTOC, pages 563\u2013568, 2008.\n\n[28] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society,\n\npages 267\u2013288, 1996.\n\n[29] J. Tropp. Greed is good: Algorithmic results for sparse approximation. IEEE Transactions on Information\n\nTheory, 50(10):2231\u20132242, 2004.\n\n[30] T. Zhang. Generating a d-dimensional linear subspace ef\ufb01ciently. In Adaptive forward-backward greedy\n\nalgorithm for sparse learning with linear models, 2008.\n\n9\n\n\f", "award": [], "sourceid": 1239, "authors": [{"given_name": "Christos", "family_name": "Boutsidis", "institution": null}, {"given_name": "Petros", "family_name": "Drineas", "institution": null}, {"given_name": "Malik", "family_name": "Magdon-Ismail", "institution": null}]}