{"title": "Nonnegative Sparse PCA", "book": "Advances in Neural Information Processing Systems", "page_first": 1561, "page_last": 1568, "abstract": null, "full_text": "Nonnegative Sparse PCA\n\nRon Zass\n\nand Amnon Shashua \n\nAbstract\nWe describe a nonnegative variant of the \"Sparse PCA\" problem. The goal is to create a low dimensional representation from a collection of points which on the one hand maximizes the variance of the projected points and on the other uses only parts of the original coordinates, and thereby creating a sparse representation. What distinguishes our problem from other Sparse PCA formulations is that the projection involves only nonnegative weights of the original coordinates -- a desired quality in various fields, including economics, bioinformatics and computer vision. Adding nonnegativity contributes to sparseness, where it enforces a partitioning of the original coordinates among the new axes. We describe a simple yet efficient iterative coordinate-descent type of scheme which converges to a local optimum of our optimization criteria, giving good results on large real world datasets.\n\n1 Introduction\nBoth nonnegative and sparse decompositions of data are desirable in domains where the underlying factors have a physical interpretation: In economics, sparseness increases the efficiency of a portfolio, while nonnegativity both increases its efficiency and reduces its risk [7]. In biology, each coordinate axis may correspond to a specific gene, the sparseness is necessary for finding focalized local patterns hidden in the data, and the nonnegativity is required due to the robustness of biological systems  where observed change in the expression level of a specific gene emerges from either positive or negative influence, rather than a combination of both which partly cancel each other [1]. In computer vision, coordinates may correspond to pixels, and nonnegative sparse decomposition is related to the extraction of relevant parts from images [10]; and in machine learning sparseness is closely related to feature selection and to improved generalization in learning algorithms, while nonnegativity relates to probability distributions. Principal Component Analysis (PCA) is a popular wide spread method of data decomposition with applications throughout science and engineering. The decomposition performed by PCA is a linear combination of the input coordinates where the coefficients of the combination (the principal vectors) form a low-dimensional subspace that corresponds to the direction of maximal variance in the data. PCA is attractive for a number of reasons. The maximum variance property provides a way to compress the data with minimal information loss. In fact, the principal vectors provide the closest (in least squares sense) linear subspace to the data. Second, the representation of the data in the projected space is uncorrelated, which is a useful property for subsequent statistical analysis. Third, the PCA decomposition can be achieved via an eigenvalue decomposition of the data covariance matrix. Two particular drawbacks of PCA are the lack of sparseness of the principal vectors, i.e., all the data coordinates participate in the linear combination, and the fact that the linear combination may mix both positive and negative weights, which might partly cancel each other. The purpose of our work is to incorporate both nonnegativity and sparseness into PCA, maintaining the maximal variance property of PCA. In other words, the goal is to find a collection of sparse nonnegative principal\n\n\nSchool of Engineering and Computer Science, Hebrew University of Jerusalem, Jerusalem 91904, Israel.\n\n\f\nvectors spanning a low-dimensional space that preserves as much as possible the variance of the data. We present an efficient and simple algorithm for Nonnegative Sparse PCA, and demonstrate good results over real world datasets. 1.1 Related Work\n\nThe desire of adding a sparseness property to PCA has been a focus of attention in the past decade starting from the work of [8] who applied axis rotations and component thresholding to the more recent computational techniques SCoTLASS L1 norm approach [9], elastic net L1 regression SPCA [14], DSPCA based on relaxing a hard cardinality cap constraint with a convex approximation [2], and most recently the work of [12] which applies post-processing renormalization steps to improve any approximate solution, in addition to two different algorithms that search for the active coordinates of the principal component based on spectral bounds. These references above can be divided into two paradigms: (i) adding L1 norm terms to the PCA formulation as it is known that L1 approximates L0 much better than L2 , (ii) relaxing a hard cardinality (L0 norm) constraint on the principal vectors. In both cases the orthonormality of the principal vector set is severely compromised or even abandoned and it is left unclear to what degree the resulting principal basis explains most of the variance present in the data. While the above methods do not deal with nonnegativity at all, other approaches focus on nonnegativity but are neutral to the variance of the resulting factors, and hence recover parts which are not necessarily informative. A popular example is the Nonnegative Matrix Factorization (NMF) [10] and the sparse versions of it [6, 11, 5, 4] that seek the best reconstruction of the input using nonnegative (sparse) prototypes and weights. We start with adding nonnegativity to PCA. An interesting direct byproduct of nonnegativity in PCA is that the coordinates split among the principal vectors. This makes the principal vectors disjoint, where each coordinate is non-zero in at most one vector. We can therefore view the principal vectors as parts. We then relax the disjoint property, as for most applications some overlap among parts is desired, allowing some overlap among the principal vectors. We further introduce a \"sparseness\" term to the optimization criterion to cover situations where the part (or semi-part) decomposition is not sufficient to guarantee sparsity (such as when the dimension of the input space far exceeds the number of principal vectors). The structure of the paper is as follows: In Sections 2 and 3 we introduce the formulation of Nonnegative Sparse PCA. An efficient coordinate descent algorithm for finding a local optimum is derived in Section 4. Our experiments in Section 5 demonstrate the effectiveness of the approach on large real-world datasets, followed by conclusions in Section 6.\n\n2 Nonnegative (Semi-Disjoint) PCA\nTo the original PCA, which maximizes the variance, we add nonnegativity, showing that this addition alone ensures some sparseness by turning the principal vectors into a disjoint set of vectors, meaning that each coordinate is non-zero in at most one principal vector. We will later relax the disjoint property, as it is too excessive for most applications. Let x1 , ..., xn  Rd form a zero mean collection of data points, arranged as the columns of the matrix X  Rdn , and u1 , ..., uk  Rd be the desired principal vectors, arranged as the columns of the matrix U  Rdk . Adding a nonnegativity constraint to PCA gives us the following optimization problem: 1T max U X 2 s.t. U T U = I , U  0 (1) F U 2 i2 T where A 2 = F j aij is the square Frobenius norm. Clearly, the combination of U U = I and U  0 entails that U is disjoint, meaning that each row of U contains at most one non-zero element. While having disjoint principal component may be considered as a kind of sparseness, it is too restrictive for most problems. For example, a stock may be a part of more than one sector, genes are typically involved in several biological processes [1], a pixel may be a shared among several image parts, and so forth. We therefore wish to allow some overlap among the principal vectors. The degree of coordinate overlap can be represented by an orthonormality distance measure which\n\n\f\nis nonnegative and vanishes iff U is orthonormal. The function I - U T U 2 is typically used in the F literature (cf. [13], pg. 275277) as a measure for orthonormality and the relaxed version of eqn. 1 becomes, 1T  max U X2- I - U T U 2 s.t. U  0 (2) F F U 2 4 where  > 0 is a balancing parameter between reconstruction and orthonormality. We see that the tradeoff for relaxing the disjoint property of Nonnegative PCA is also to relax the maximum variance property of PCA -- the constrained optimization tries to preserve the variance when possible but allows to tradeoff higher variance with some degree of coordinate overlap among the principal vectors. Next, we add sparseness to this formulation.\n\n3 Nonnegative Sparse PCA (NSPCA)\nWhile semi-disjoint principal components can be considered sparse when the number of coordinates is small, it may be too dense when the number of coordinates highly exceeds the number of principal vectors. In such case, the average number of non-zero elements per principal vector would be high. We therefore consider minimizing the number of non-zero elements directly, k n U L0 = i=1 j =1 uij , where x equals one if x is non-zero and zero otherwise. Adding this to the criteria of eqn. 2 we have, max\nU\n\n1T  U X F- 2 I - U T U F -  U L0 2 2 4\n\ns.t. U  0\n\nwhere   0 controls the amount of additional sparseness required. The L0 norm could be relaxed by replacing it with a L1 term and since U is nonnegative we obtain the relaxed sparseness term: U L1 = 1T U 1, where 1 is a column vector with all elements equal to one. The relaxed problem becomes, 1T  max U X F- 2 I - U T U F -  1T U 1 s.t. U  0 2 (3) U 2 4\n\n4 Algorithm\nFor certain values of  and  , solving the problem of eqn. 3 is NP-hard. For example, for large enough values of  and for  = 0 we obtain the original problem of eqn. 1. This is a concave quadratic programming, which is an NP-hard problem [3]. It is therefore unrealistic to look for a global solution of eqn. 3, and we have to settle with a local maximum. The objective of eqn. 3 as a function of urs (the s row of the ur column vector) is, c2  f (urs ) = - u4s + u2s + c1 urs + const 4r 2r where const stands for terms that do not depend on urs and,\nd k d\n\n(4)\n\nc1 =\n\ni\n=1,i=s\n\nasi uri -  \n\ni\n\nj\n=1,i=r =1,j =s d\n\nurj uij uis -  ,\nk\n\nc2 = ass +  -  \nT\n\ni\n=1,i=s\n\nu2i -   r\n\ni\n=1,i=r\n\nu2s i\n\nwhere A = X X . Setting the derivative with respect to urs to zero we obtain a cubic equation, f = -u3s + c2 urs + c1 = 0 r  ur s (5)\n\nEvaluating eqn. 4 for the nonnegative roots of eqn. 5 and zero, the nonnegative global maximum of f (urs ) can be found (see Fig. 1). Note that as urs approaches  the criteria goes to -, and since the function is continues a nonnegative maximum must exist. A coordinate-descent scheme for updating each entry of U one following the other would converge to a local maximum of the\n\n\f\n1000 500\n\n1500 1000\n\n-500 -1000 -1500 -2000 -10 -5 0 urs 5 10\n\n f /  urs\n\n0\n\nf(urs)\n\n500 0 -500 -10\n\n-5\n\n0 urs\n\n5\n\n10\n\nFigure 1: A 4th order polynomial (left) and its derivative (right). In order to find the global nonnegative\nmaximum, the function has to be inspected at all nonnegative extrema (where the derivative is zero) and at urs = 0.\n\nconstrained objective function, as summarized bellow:\nAlgorithm 1 Nonnegative Sparse PCA (NSPCA)\n\n Start with an initial guess for U .  Iterate over entries (r, s) of U until convergence:  Set the value of urs to the global nonnegative maximizer of eqn. 4 by evaluating it over all nonnegative roots of eqn. 5 and zero.\n\nCaching some calculation results from the update of one element of U to the other, each update is done in O(d), and the entire matrix U is updated in O(d2 k ). It is easy to see that the gradient at the convergence point of Alg. 1 is orthogonal to the constraints in eqn. 3, and therefore Alg. 1 converges to a local maximum of the problem. It is also worthwhile to compare this nonnegative coordinate-descent scheme with the nonnegative coordinate-descent scheme of Lee and Seung [10]. The update rule of [10] is multiplicative, which holds two inherent drawbacks. First, it cannot turn positive values into zero or vise versa, and therefore the solution will never be on the boundary itself, a drawback that does not exist in our scheme. Second, since it is multiplicative, the perseverance of nonnegativity is built upon the nonnegativity of the input, and therefore it cannot be applied to our problem while our scheme can be also applied to NMF. In other words, a practical aspect our the NSPCA algorithm is that it can handle general (not necessarily non-negative) input matrices -- such as zero-mean covariance matrices.\n\n5 Experiments\nWe start by demonstrating the role of the  and  parameters in the task of extracting face parts. We use the MIT CBCL Face Dataset #1 of 2429 aligned face images, 19 by 19 pixels each, a dataset that was extensively used to demonstrate the ability of Nonnegative Matrix Factorization (NMF) [10] methods. We start with  = 2  107 and  = 0 to extract the 10 principal vectors in Fig. 2(a), and then increase  to 5  108 to get the principal vectors in Fig. 2(b). Note that as  increases the overlap among the principal vectors decreases and the holistic nature of some of the vectors in Fig. 2(a) vanishes. The vectors also become sparser, but this is only a byproduct of their nonoverlapping nature. Fig. 3(a) shows the amount of overlap I - U T U as a function of , showing a consistence drop in the overlap as  increases. We now set  back to 2  107 as in Fig. 2(a), but set the value of  to be 2  106 to get the factors in Fig. 2(d). The vectors become sparser as  increases, but this time the sparseness emerges from a drop of less informative pixels within the original vectors of Fig. 2(a), rather than a replacement of the holistic principal vectors with ones that are part based in nature. The amount of non-zero elements in the principal vectors, U L0 , is plotted as a function of  in Fig. 3(b), showing the increment in sparseness as  increases.\n\n\f\n(a) (b) (c) (d) (e) (f) Figure 2: The role of  and  is demonstrated in the task of extracting ten image features using the MIT-CBCL Face Dataset #1. At the top row (a), we use  = 2  107 and  = 0. In (b) we increase  to 5  108 while  stays zero, to get more localized parts that has lower amount of overlap. In (c) we reset  to be 2  107 as in (a), but increase  to be 2  106 . While we increase  , pixels that explain less variance are dropped from the factors, but the overlapping nature of the factors remains. (See Fig. 3 for a detailed study.) In (d) we show the ten leading principal components of PCA, in (e) the ten factors of NMF, and in (f) the leading principal vectors of GSPCA when allowing 55 active pixels per principal vector. Next we study how the different dimensional reduction methods aid the generalization ability of SVM in the task of face detection. To measure the generalization ability we use the Receiver Operating Characteristics (ROC) curve, a two dimensional graph measuring the classification ability of an algorithm over a dataset, showing the amount of true-positives as a function of the amount of false-positives. The wider the area under this curve is, the better the generalization is. Again, we use the MIT CBCL Face Dataset #1, where 1000 face images and 2000 non-face images were used as a training set, and the rest of the dataset used as a test set. The dimensional reduction was performed over the 1000 face images of the training set. We run linear SVM on the ten features extracted by NSPCA when using different values of  and  , showing in Fig. 4(a) that as the principal factors become less overlapping (higher ) and sparser (higher  ), the ROC curve is higher, meaning that SVM is able to generalize better. Next, we compare the ROC curve produced by linear SVM when using the NSPCA extracted features (with  = 5  108 and  = 2  106 ) to the ones produced when using PCA and NMF (the principal vectors are displayed in Fig. 2(d) and Fig. 2(e), correspondingly). As a representative of the Sparse PCA methods we use the recent Greedy Sparse PCA (GSPCA) of [12] that shows comparable or better results to all other Sparse PCA methods (see the principal vectors in Fig. 2(f)). Fig. 4(b) shows that better generalization is achieved when using the NSPCA extracted features, and hence a more reliable face detection. Since NSPCA is limited to nonnegative entries of the principal vectors, it can inherently explain less variance than Sparse PCA algorithms which are not constrained in that way, similarly to the fact that Sparse PCA algorithms can explain less variance than PCA. While this limitation holds, NSPCA still manages to explain a large amount of the variance. We demonstrate that in Fig. 5, where we compare the amount of cumulative explained variance and cumulative cardinality of different Sparse PCA algorithms over the Pit Props dataset, a classic dataset used throughout the Sparse PCA literature. In domains where nonnegativity is intrinsic to the problem, however, using NSPCA extracted features improves the generalization ability of learning algorithms, as we have demonstrated above for the face detection problem.\n\n6 Summary\nOur method differs substantially from previous approaches to sparse PCA -- a difference that begins with the definition of the problem itself. Other sparse PCA methods try to limit the cardinality (number of non-zero elements) of each principal vector, and therefore accept as input a (soft) limitation on\n\n\f\n8 6 || I - UTU || 4 2 0\nL\n0\n\n2500 2000 1500 1000 500 6 7 8 log ()\n10\n\n|U|\n\n9\n\n10\n\n0\n\n5\n\n6 7 log ()\n10\n\n8\n\n(a)\n\n(b)\n\nFigure 3: (a) The amount of overlap and orthogonality as a function of , where higher values of  decrease the overlap and increase the orthogonality, and (b) the amount of non-zero elements as a function of  , where higher values of  enforce sparseness.\n100 % True Positives % True Positives 80 60 40 20 0 0 100 80 60 40 20 0 0\n\n=2x107, =0 =5x108, =0 =2x107, =2x106 =5x108, =2x106 20 40 60 80 % False Positives 100\n\nNSPCA GSPCA NMF PCA\n\n20\n\n40 60 80 % False Positives\n\n100\n\n(a)\n\n(b)\n\nFigure 4: The ROC curve of SVM in the task of face detection over the MIT CBCL Face Dataset #1 (a)\nwhen using different values of  and  , showing improved generalization when using principal vectors that has less overlap (higher ) and that are sparser (higher  ); and (b) when using NMF, PCA, GSPCA and NSPCA extracted features, showing better generalization when using NSPCA.\n100% Cumulative Variance 90% 80% 70% 60% 50% 40% 30% 20% 1 2 3 4 # of PCs 5 6 50 Cumulative Cardinality 40 30 20 10 0 1 SCoTLASS SPCA DSPCA ESPCA NSPCA\n\nPCA SCoTLASS SPCA DSPCA ESPCA NSPCA\n\n2\n\n3 4 # of PCs\n\n5\n\n6\n\n(a)\n\n(b)\n\nFigure 5: (a) Cumulative explained variance and (b) cumulative cardinality as a function of the number of\nprincipal components on the Pit Props dataset, a classic dataset that is typically used to evaluate Sparse PCA algorithms. Although NSPCA is more constrained than other Sparse PCA algorithms, and therefore can explain less variance just like Sparse PCA algorithms can explain less variance than PCA, and although the dataset is not nonnegative in nature, NSPCA shows competitive results when the number of principal components increases.\n\n\f\nthat cardinality. In addition, most sparse PCA methods focus on the task of finding a single principal vector. Our method, on the other hand, splits the coordinates among the different principal vectors, and therefore its input is the number of principal vectors, or parts, rather than the size of each part. As a consequence, the natural way to use our algorithm is to search for all principal vectors together. In that sense, it bears resemblance to the Nonnegative Matrix Factorization problem, from which our method departs significantly in the sense that it focus on informative parts, as it maximizes the variance. Furthermore, the non-negativity of the output does not rely on having non-negative input matrices to the process thereby permitting zero-mean covariance matrices to be fed into the process just as being done with PCA.\n\nReferences\n[1] Liviu Badea and Doina Tilivea. Sparse factorizations of gene expression guided by binding data. In Pacific Symposium on Biocomputing, 2005. [2] Alexandre d'Aspremont, Laurent El Ghaoui, Michael I. Jordan, and Gert R. G. Lanckriet. A direct formulation for sparse PCA using semidefinite programming. In Proceedings of the conference on Neural Information Processing Systems (NIPS), 2004. [3] C. A. Floudas and V. Visweswaran. Quadratic optimization. In Handbook of global optimization, pages 217269. Kluwer Acad. Publ., Dordrecht, 1995.  [4] Matthias Heiler and Christoph Schnorr. Learning non-negative sparse image codes by convex programming. In Proc. of the 10th IEEE Intl. Conf. on Comp. Vision (ICCV), 2005. [5] Patrik O. Hoyer. Non-negative sparse coding. In Neural Networks for Signal Processing, 2002. Proceedings of the 2002 12th IEEE Workshop on, pages 557565, 2002. [6] Patrik O. Hoyer. Non-negative matrix factorization with sparseness constraints. Journal of Machine Learning Research, 5:14571469, 2004. [7] Ravi Jagannathan and Tongshu Ma. Risk reduction in large portfolios: Why imposing the wrong constraints helps. Journal of Finance, 58(4):16511684, 08 2003. [8] Ian T. Jolliffe. Rotation of principal components: Choice of normalization constraints. Journal of Applied Statistics, 22(1):2935, 1995. [9] Ian T. Jolliffe, Nickolay T. Trendafilov, and Mudassir Uddin. A modified principal component technique based on the LASSO. Journal of Computational and Graphical Statistics, 12(3):531547, September 2003. [10] D. D. Lee and H. S. Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755):788791, October 1999. [11] S. Li, X. Hou, H. Zhang, and Q. Cheng. Learning spatially localized, parts-based representation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2001. [12] Baback Moghaddam, Yair Weiss, and Shai Avidan. Spectral bounds for sparse pca: Exact and greedy algorithms. In Proceedings of the conference on Neural Information Processing Systems (NIPS), 2005. [13] Beresford N. Parlett. The symmetric eigenvalue problem. Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1980. [14] H. Zou, T. Hastie, and R. Tibshirani. Sparse principal component analysis, 2004.\n\n\f\n", "award": [], "sourceid": 3104, "authors": [{"given_name": "Ron", "family_name": "Zass", "institution": null}, {"given_name": "Amnon", "family_name": "Shashua", "institution": null}]}