{"title": "Selecting Diverse Features via Spectral Regularization", "book": "Advances in Neural Information Processing Systems", "page_first": 1583, "page_last": 1591, "abstract": "We study the problem of diverse feature selection in linear regression: selecting a small subset of diverse features that can predict a given objective. Diversity is useful for several reasons such as interpretability, robustness to noise, etc.  We propose several spectral regularizers that capture a notion of diversity of features and show that these are all submodular set functions. These regularizers, when added to the objective function for linear regression, result in approximately submodular functions, which can then be maximized approximately by efficient greedy and local search algorithms, with provable guarantees.  We compare our algorithms to traditional greedy and $\\ell_1$-regularization schemes and show that we obtain a more diverse set of features that result in the regression problem being stable under perturbations.", "full_text": "Selecting Diverse Features via Spectral\n\nRegularization\n\nAbhimanyu Das\u2217\nMicrosoft Research\n\nMountain View\n\nabhidas@microsoft.com\n\nAnirban Dasgupta\n\nYahoo! Labs\nSunnyvale\n\nanirban@yahoo-inc.com\n\nravi.k53@gmail.com\n\nRavi Kumar\u2217\n\nGoogle\n\nMountain View\n\nAbstract\n\nWe study the problem of diverse feature selection in linear regression: selecting\na small subset of diverse features that can predict a given objective. Diversity is\nuseful for several reasons such as interpretability, robustness to noise, etc. We pro-\npose several spectral regularizers that capture a notion of diversity of features and\nshow that these are all submodular set functions. These regularizers, when added\nto the objective function for linear regression, result in approximately submodu-\nlar functions, which can then be maximized by ef\ufb01cient greedy and local search\nalgorithms, with provable guarantees. We compare our algorithms to traditional\ngreedy and (cid:96)1-regularization schemes and show that we obtain a more diverse set\nof features that result in the regression problem being stable under perturbations.\n\n1\n\nIntroduction\n\nFeature selection is a key component in many machine learning settings. The process involves\nchoosing a small subset of features in order to build a model to approximate the target concept\nwell. Feature selection offers several advantages in practice. This includes reducing the dimension\nof the data and hence the space requirements, enhancing the interpretability of the learned model,\nmitigating over-\ufb01tting, decreasing generalization error, etc.\nIn this paper we focus on feature selection for linear regression, which is the prediction model of\nchoice for many practitioners. The goal is to obtain a linear model using a subset of k features (where\nk is user-speci\ufb01ed), to minimize the prediction error or, equivalently, maximize the squared multiple\ncorrelation [16]. In general, feature selection techniques can be categorized into two approaches.\nIn the \ufb01rst, features are greedily selected one by one up to the pre-speci\ufb01ed budget k; the Forward\nor Backward greedy methods[19] fall into this type. In the second, the feature selection process is\nintimately coupled with the regression objective itself by adding a (usually convex) regularizer. For\nexample, the Lasso [20] uses the (cid:96)1-norm of the coef\ufb01cients as a regularizer to promote sparsity.\nIn this work we consider the feature selection problem of choosing the best set of features for pre-\ndicting a speci\ufb01ed target, coupled with the desire to choose as \u201cdiverse\u201d features as possible; our\ngoal will be to construct a regularizer that can capture diversity. Diversity among the chosen features\nis a useful property for many reasons. Firstly, it increases the interpretability of the chosen features,\nsince we are assured that they not redundant and are more representative of the feature space cov-\nered by the entire dataset (see e.g. [7]). Secondly, as we show, the right notion of diversity can also\nmake the feature selection task resistant to noise in the data. Thirdly, it is well known that correlated\nfeatures can slow down the convergence of algorithms such as the stochastic gradient (e.g., [2]); by\ndemanding diversity, one can potentially obviate this slowdown.\n\n\u2217This work was done while the author was at Yahoo! Labs.\n\n1\n\n\fUnfortunately, the traditional greedy and (cid:96)1-relaxation approaches to feature-selection do not ex-\nplictly address feature diversity1. In this paper, we address this problem of diverse feature selection\nusing an approach that falls between that of greedy methods and convex-regularization methods. In\nparticular, we construct regularizers that capture a notion of diversity \u2014 unlike regularizers such as\nLasso, our regularizers are set functions as opposed to functions of the regression coef\ufb01cient vector.\nOur objective function are thus a combination of the linear regression objective and the regular-\nizer. We then design provable approximation algorithms for such objectives using a combination of\ngreedy and local search techniques. While there is no unique way to de\ufb01ne feature diversity, we take\na spectral approach. By de\ufb01ning diversity to be a carefully chosen function of the spectrum of the\nchosen features, we tap into notions of submodularity and consequently into the rich literature for\nmaximizing submodular functions [5, 9, 14].\nOur contributions are as follows: (i) We formulate an optimization problem for diverse feature\nselection and construct a family of submodular spectral regularizers that capture diversity notions.\n(ii) We use a novel approach of combining the diversity regularizers with the optimization objective\nto obtain (approximately) submodular maximization problems, and optimize them using greedy\nand local search algorithms with provable guarantees.\n(iii) We validate the performance of our\nalgorithms using experiments on real and synthetic data sets.\n\n2 Related work\n\nFeature selection and the closely related problems of sparse approximation/recovery have been ex-\ntensively studied using two broad classes of methods: greedy [5, 19, 21, 11, 24] and convex relax-\nation [20, 25, 3, 22, 8]. None of these methods, however, takes feature diversity into the account\nduring selection. The (greedy) methods in our paper are inspired by those of Das and Kempe [5],\nwho provide prediction error bounds using a notion of approximate submodularity. However, they\ndo not incorporate any notion of feature diversity; they also require monotonicity, which does not\nhold for several regularizers we construct. A related convex relaxation based approach is that of\nGrave et al. [12], who address the unstable behavior of Lasso in the presence of correlated features\nand propose adding a trace norm regularizer to the error objective. The focus is to select groups\nof correlated variables together instead of selecting only one variable from each group. Our goal is\ndifferent: select variables that are relatively uncorrelated with each other.\nPrevious work on diverse feature selection includes greedy heuristics for trading-off information-\ntheoretic feature relevance and feature redundancy criteria when selecting features [7, 23]. However\nthe heuristics presented do not carry any theoretical guarantees.\nThere has been some work on selecting a diverse set of features to maximize the mutual information\nor the entropy of a set of variables [13, 17]. But, the problem de\ufb01nition in these works does not\nspecify a target prediction vector or variable; the goal instead is to select diverse features regardless\nof whether the features are relevant for predicting a particular target variable. On the other hand, our\nwork requires us to simultaneously optimize for both feature selection and diversity objectives.\nIf we consider orthogonality as a loose proxy for diversity, methods such Principal Component\nAnalysis and Singular Value Decomposition [15] become relevant. However, these methods do not\nreturn elements from the original set of features and instead output linear combinations of the feature\nvectors; this might not be desirable for many applications.\n\n3 Preliminaries\nFor any symmetric positive semide\ufb01nite n \u00d7 n matrix A, we denote its eigenvalues by \u03bbmin(A) =\n\u03bb1(A) \u2264 \u00b7\u00b7\u00b7 \u2264 \u03bbn(A) = \u03bbmax(A). We use det(A) = \u03a0n\ni=1\u03bbi(A) to denote the determinant of A.\n\nRecall the vector and matrix two-norms: (cid:107)x(cid:107)2 =(cid:112)(cid:80)\n\nLet X = {X1, . . . , Xn} be the set of feature vectors (or random variables) where each Xi \u2208 Rm and\nlet Z \u2208 Rm be the target vector. By appropriate normalization, we can assume (cid:107)Xi(cid:107)2 = 1 = (cid:107)Z(cid:107)2.\nWe wish to predict Z using linear regression on a small subset of X. The matrix of inner products (or\n\ni |xi|2 and (cid:107)A(cid:107)2 = \u03bbmax(A).\n\n1discussed in the supplementary material at http://cs.usc.edu/\u223cabhimand/nips12supplementary.pdf\n\n2\n\n\fcovariances) between the Xi and Xj is denoted by C, with entries Ci,j = Cov(Xi, Xj). Similarly,\nwe use b to denote the inner products between Z and the Xi\u2019s, with bi = Cov(Z, Xi).\nFor a n-dimensional Gaussian random vector v with covariance matrix C, we use H(v) =\n2 log((2\u03c0e)ndet(C)) to denote the differential entropy of v.\n1\nFor a set S \u2286 X, if Z(cid:48)(S) is the optimal linear predictor of Z using the vectors in S, then the\nsquared multiple correlation [6, 16] is de\ufb01ned as R2\n2. This is a widely\nused goodness-of-\ufb01t measure; it captures the length of the projection of Z on the subspace spanned\nby the vectors in S.\nDe\ufb01nition 1 (Diverse feature selection) Given k > 0, \ufb01nd a set S \u2286 X satisfying\n\nZ(S) = 1 \u2212 (cid:107)(Z \u2212 Z(cid:48)(S))(cid:107)2\n\nargmax\nS:|S|\u2264k\n\ng(S) \u2206= R2\n\nZ(S) + \u03bdf (S),\n\n(1)\n\nwhere \u03bd > 0 is the regularization constant and f (S) is some \u201cdiversity-promoting\u201d regularizer.\n\nNote that diversity is not a uniquely-de\ufb01ned notion, however, we call a regularizer f to be diversity-\npromoting if the following two conditions are satis\ufb01ed: for a \ufb01xed k, f (S) is maximized when S is\nan orthogonal set of vectors and is minimized when S has the lowest rank, where |S| \u2264 k.\nFor convenience, we do not distinguish between the index set S and the variables {Xi | i \u2208 S}. We\nuse CS to denote the submatrix of C with row and column set S, and bS to denote the vector with\nonly entries bi for i \u2208 S. Given the subset S of vectors used for prediction, the optimal regression\ncoef\ufb01cients \u03b1i are (\u03b1i)i\u2208S = C\u22121\nMany of our results are phrased in terms of eigenvalues of the inner product matrix C and its subma-\ntrices. Since such matrices are positive semide\ufb01nite, their eigenvalues are real, non-negative [16].\n\nS bS (e.g., [16]) and hence R2\n\nZ(S) = bT\n\nS bS. 2\n\nS C\u22121\n\nSubmodularity ratio. Das and Kempe [5] introduced the notion of submodularity ratio for a general\nset function to capture how close is the function to being submodular.\n\nDe\ufb01nition 2 (Submodularity ratio) Let\nf\nsubmodularity ratio of f with respect\n\nThe\nbe\nto a set U and a parameter k \u2265 1 is\n\nnon-negative\n\nfunction.\n\nset\n\n\u03b3U,k(f ) =\n\nmin\n\nL\u2286U,S:|S|\u2264k,S\u2229L=\u2205\n\n(cid:80)\nx\u2208S f (L \u222a {x}) \u2212 f (L)\nf (L \u222a S) \u2212 f (L)\n\na\n\n.\n\nThus, it captures how much f can increase by adding any subset S of size k to L, compared to the\ncombined bene\ufb01ts of adding its individual elements to L. In particular, [5] de\ufb01nes the submodularity\nratio for the R2 function and relates it to the smallest eigenvalue of the covariance matrix of the data.\nThey also show that, in practice, the submodularity ratio for R2 is often quite close to 1, and hence\na greedy algorithm is a good approximation to maximizing R2 subject to a cardinality constraint.\n\nTheorem 3 (from [4]) Let f be a non-negative, monotone set function and let OPT be the maxi-\nmum value of f value obtained by any set of size k. Then, the set \u02dcS selected by the Greedy Algorithm\nhas the following approximation guarantee: f ( \u02dcS) \u2265 (1 \u2212 e\u2212\u03b3 \u02dcS,k(f )) \u00b7 OPT.\n\n3.1 Robustness to perturbations\n\nAs mentioned earlier, in addition to providing better interpretability, another bene\ufb01t of diverse fea-\nture selection is robustness to feature and label perturbations. Given a selected subset S, we now\nobtain a connection between the robustness of the estimated regression coef\ufb01cients and the spectrum\nof CS, in the presence of noise. Suppose S, a subset of size k, is used to predict the target vector\nZ \u2208 Rn. Let A \u2208 Rn\u00d7k be the vectors from X corresponding to S. Then CS = AT A and the\noptimal regression coef\ufb01cients are \u03b1 = C\u22121\nNow suppose the target vector is perturbed with an i.i.d. Gaussian noise, i.e., Z(cid:48) = Z + \u03b7, where\n\u03b7 \u223c N (0, \u03c32In) is a random vector corresponding to measurement errors. Let the corresponding\n2We assume throughout that CS is non-singular. For some of our results, an extension to singular matrices\n\nS AT Z.\n\nis possible using the Moore\u2013Penrose generalized inverse.\n\n3\n\n\fregression coef\ufb01cient vector be \u03b1(cid:48) = C\u22121\nS AT Z(cid:48). We show the following perturbation result relating\nthe differential entropy of the perturbation error in the regression coef\ufb01cients to the spectrum of CS.\n\nLemma 4 H(\u03b1(cid:48) \u2212 \u03b1) = k log(2\u03c32\u03c0e) \u2212(cid:80)k\nk log(2\u03c32\u03c0e) \u2212(cid:80)k\nThus the perturbation error entropy is minimized by maximizing(cid:80)k\n\nProof. Let \u03b4 = \u03b1(cid:48) \u2212 \u03b1 = C\u22121\n\u03c32In\u00d7n \u00b7 (C\u22121\n\nS AT )T ). Or, \u03b4 \u223c N (0, \u03c32C\u22121\n\ni=1 log(\u03bbi(CS)).\n\ni=1 log(\u03bbi(CS)).\n\nS AT \u03b7. Since \u03b7 \u223c N (0, \u03c32In\u00d7n), we have that \u03b4 \u223c N (0, C\u22121\nS ). Thus, H(\u03b4) = log((2\u03c32\u03c0e)kdet(C\u22121\n\nS AT \u00b7\nS )) =\n\ni=1 log(\u03bbi(CS)), which moti-\n\nvates the smoothed differential-entropy regularizer used in Section 5.1.\nWe can also show (supplementary material) that the two-norm of the perturbation error in the re-\nthe expected noise in the regression\ngression coef\ufb01cients is also related to the spectrum of CS:\ncoef\ufb01cients depends on the sum of the eigenvalues of C\u22121\n\u03bbi(CS ) as\na diversity-promoting regularizer in De\ufb01nition 1. Unfortunately, this regularization function is not\nsubmodular and is thus hard to use directly. However, as seen in Sections 5.2 and 5.3, there are other\nrelated spectral functions that are indeed submodular and can thus be used as ef\ufb01cient regularizers.\n\nS . This suggests the use of \u2212(cid:80)\n\n1\n\ni\n\n4 Algorithms\n\nIn this section we present a greedy and local-search based (GLS) approximation algorithm for solv-\ning (1) when f (S) is a non-negative (but not necessarily monotone) submodular function (w.l.o.g.,\n\u03bd = 1). In order to give an approximation algorithm for argmaxS:|S|\u2264k g(S), we need to follow a\nsequence of steps. First we show a technical result (Theorem 5) that says that though the approxi-\nmation guarantees of [5] do not carry over to the non-monotone case, we can still prove a weaker\nresult that relates the solution obtained by a greedy algorithm with any feasible solution, as long\nas g(S) is approximately submodular and non-negative (which holds if f (S) is a non-negative sub-\nmodular function). Next, we modify a local-search based algorithm for unconstrained submodular\nmaximization to give an approximation of argmaxS g(S) (Theorem 7). We put these together using\nthe framework of [9] to show (Theorem 9) a constant factor approximation for solving (1).\nThe greedy Forward Regression (FR) algorithm is the following.\n1: S0 \u2190 \u2205 and U \u2190 {X1, . . . , Xn}.\n2: In each step i + 1, select Xj \u2208 U \\ Si maximizing g(Si \u222a {Xj}). Set Si+1 \u2190 Si \u222a {Xj} and\n\nU \u2190 U \\ {Xj}.\n\n3: Output Sk.\nTheorem 5 For any set T such that |T| \u2264 k, the set S selected by the greedy FR algorithm satis\ufb01es\ng(S) = R2\n\nZ(S) + f (S) \u2265 (1 \u2212 e\u2212 \u03b3S,2k\n\n)g(S \u222a T ).\n\n2\n\nThe proof is very similar to that of [5, Theorem 3.2] and is omitted due to space constraints. Next,\nwe consider the problem of unconstrained maximization of the function g(S) = R2\nZ(S) + f (S). For\nthis, we use a local search (LS) algorithm similar to [9].\n1: S \u2190 argmaxif (Xi) and U \u2190 {X1, . . . , Xn}.\n2: If there exists an element x \u2208 U\\S such that f (S\u222a{x}) \u2265 (1+ \u0001\n\nn2 )f (S), then set S \u2190 S\u222a{x},\n\nand go back to Step 2.\n\n3: Output argmaxT\u2208{S,U\\S,U} g(T ).\n\nNotice that even though we are interested in maximizing g(S), our LS algorithm \ufb01nds a local optima\nusing f, but then uses g to compute the maximum in the last step. To analyze the performance\nguarantees of LS, we \ufb01rst use the following result of [9, Theorem 3.4].\nLemma 6 If f is non-negative and submodular, then for any set T \u2286 U and any \u0001 > 0, the LS\nn )f (S)+f (U\\S) \u2265 f (T ).\nalgorithm takes O( 1\n\n\u0001 n3 log n) time and outputs solution S such that (2+ 2\u0001\n\nUsing the above, we prove an approximation guarantee for unconstrained maximization of g(S).\n\n4\n\n\f1\n\n4+ 4\u0001\nn\n\napproximation for solving argmaxS g(S).\n\nTheorem 7 The LS algorithm is a\nProof. Suppose the optimal solution is C\u2217 such that g(C\u2217) = OPT. Consider the set S obtained\nby the LS algorithm when it terminates. We obtain g(C\u2217) = f (C\u2217) + R2(C\u2217) \u2264 (2 + 2\u0001/n)f (S) +\nf (U \\ S) + R2(U ) \u2264 (2 + 2\u0001/n)g(S) + g(U \\ S) + g(U ), where the second step follows from\nLemma 6 and the monotonicity of R2 and the last step follows from the non-negativity of f and R2.\nThus, max(g(S), g(U \\ S), g(U )) \u2265 1\n\ng(C\u2217).\n\n4+ 4\u0001\nn\n\nS2 \u2190 FR(U \\ S1).\n\nWe now present the greedy and local search (GLS) algorithm for solving (1) for any submodular,\nnon-monotone, non-negative regularizer.\n1: U \u2190 {X1, . . . , Xn}.\n1 \u2190 LS(S1),\n2: S1 \u2190 FR(U ),\nS(cid:48)\n3: Output argmaxS\u2208{S1,S(cid:48)\n1,S2} g(S).\nNext, we prove a multiplicative approximation guarantee for the GLS algorithm.\nLemma 8 Given sets C, S1 \u2286 U, let C(cid:48) = C \\ S1 and S2 \u2286 U \\ S1. Then g(S1 \u222a C) + g(S2 \u222a\nC(cid:48)) + g(S1 \u2229 C) \u2265 g(C).\nZ(S), we obtain g(S1\u222aC)+g(S2\u222a\nProof. Using the submodularity of f and the monotonicity of R2\nZ(C) + f (S1\u222a S2\u222a C) + f (C(cid:48)).\nC(cid:48)) = R2\nNow, f (C(cid:48)) + f (S1 \u2229 C) \u2265 f (C) + f (\u2205) \u2265 f (C), or f (C(cid:48)) \u2265 f (C)\u2212 f (S1 \u2229 C). Hence, we have\ng(S1 \u222a C) + g(S2 \u222a C(cid:48)) + f (S1 \u2229 C) \u2265 R2\nTheorem 9 If f is non-negative and submodular and \u0001 < n\n\nZ(S2\u222a C(cid:48)) + f (S1\u222a C) + f (S2\u222a C(cid:48)) \u2265 R2\nZ(C) + f (C) = g(C).\n\n4 , the set \u02dcS selected by the GLS algorithm\n\nZ(S1\u222a C) + R2\n\napproximation for solving argmaxS:|S|\u2264k g(S).\n\ngives a\n\n\u2212 \u03b3 \u02dcS,2k\n\n2\n\n1\u2212e\n\u2212 \u03b3 \u02dcS,2k\n\n2+(1\u2212e\n\n\u2265 1\u2212e\n\n\u2212 \u03b3 \u02dcS,2k\n7\n\n2\n\n2\n\n)(4+4\u0001/n)\n\n2\n\nProof. Let C\u2217 be the optimal solution with g(C\u2217) = OPT. Then g(S1) \u2265 \u03bag(S1 \u222a C\u2217), where\n). If g(S1 \u2229 C\u2217) \u2265 \u0001OPT, then using the LS algorithm on S1, we get (using\n\u03ba = (1 \u2212 e\u2212 \u03b3S1,2k\nn . Else, g(S1) \u2265 \u03bag(S1 \u222a\nTheorem 7) a solution of value at least \u0001\nC\u2217) + \u03bag(S1 \u2229 C\u2217) \u2212 \u03ba\u0001OPT. Also, g(S2) \u2265 \u03bag(S2 \u222a (C\u2217 \\ S1)). Thus, g(S1) + g(S2) \u2265\n\u03bag(S1\u222a C\u2217) + \u03bag(S1\u2229 C\u2217)\u2212 \u03ba\u0001OPT + \u03bag(S2\u222a (C\u2217\\ S1)) \u2265 \u03bag(C\u2217)\u2212 \u03ba\u0001OPT \u2265 \u03ba(1\u2212 \u0001)OPT,\nwhere the last inequality follows from Lemma 8. Thus, max(g(S1), g(S2)) \u2265 \u03ba(1\u2212\u0001)OPT\n. Hence,\nthe approximation factor is max( \u0001\n\n\u03b1 g(C\u2217), where \u03b1 = 4 + 4\u0001\n\n). Setting \u0001 = \u03ba\u03b1\n\n\u03ba\u03b1+2-approximation.\n\n\u03ba\u03b1+2, we get a\n\n\u03ba\n\n\u03b1 , \u03ba(1\u2212\u0001)\n\n2\n\n2\n\nWhen f (S) is a monotone, non-negative, submodular function, the problem becomes much easier\ndue to the proposition below that follows directly from the de\ufb01nition of the submodularity ratio.\n\nProposition 10 For any submodular set function f (S), the function g(S) = R2\n\u03b3U,k(g) \u2265 \u03b3U,k(R2) for any U and k.\nThus, since g(S) is monotone and approximately submodular, we can directly apply [4, Theorem 3]\nto show that the greedy FR algorithm gives a (1 \u2212 e\u2212\u03b3 \u02dcS,k(f ))-approximation.\n\nZ(S)+f (S) satis\ufb01es\n\n5 Spectral regularizers for diversity\n\nIn this section we propose a number of diversity-promoting regularizers for the feature selection\nproblem. We then prove that our algorithms in the previous section can obtain provable guarantees\nfor each of the corresponding regularized feature selection problems.\nMost of our analysis requires the notion of operator antitone function [1] and its connection with\nsubmodularity that was recently obtained by Friedland and Gaubert [10].\n\nDe\ufb01nition 11 (Operator antitone functions [1]) A real valued function h is operator antitone on\nthe interval \u0393 \u2208 R if for all n \u2265 1 and for all n \u00d7 n Hermitian matrices A and B, we have\nA (cid:22) B =\u21d2 h(B) (cid:22) h(A), where A (cid:22) B denotes that B \u2212 A is positive semide\ufb01nite; the function\nh is called operator monotone if \u2212h is operator antitone.\n\n5\n\n\fTheorem 12 ([10]) Let f be a real continuous function de\ufb01ned on an interval \u0393 of R. If the deriva-\ntive of f is operator antitone on the interior of \u0393, then for every n \u00d7 n Hermitian matrix C with\n\nspectrum in \u0393, the set function (from 2n \u2212\u2192 R) tr(f (S)) =(cid:80)n\n\ni=1 f (\u03bbi(CS)) is submodular.\n\nWe will frequently use the following lemma for proving monotonicity of set functions. The proof is\ngiven in the supplementary material.\nLemma 13 If f is a monotone and non-negative function de\ufb01ned on R, then for every n \u00d7 n Her-\n\nmitian matrix C, the set function tr(f (S)) =(cid:80)n\n\ni=1 f (\u03bbi(CS)) is monotone.\n\n5.1 Smoothed differential entropy regularizer\n\nentropy regularizer as fde(S) =(cid:80)|S|\n(cid:80)|S|\n\nFor any set S with the corresponding covariance matrix CS, we de\ufb01ne the smoothed differential\ni=1 log2(\u03b4 + \u03bbi(CS))\u2212 3k log2 \u03b4, where \u03b4 > 0 is the smoothing\nconstant. This is a smoothed version of the log-determinant function fld(S) = log(det(CS)) =\ni=1 log(\u03bbi(CS)), that is also normalized by an additive term of 3k log2 \u03b4 in order to make the\n\nregularizer non-negative 3.\nAs shown in Lemma 4, this regularizer also helps improve the robustness of the regression model to\nnoise since maximizing fld(S) minimizes the entropy of the perturbation error. For a multivariate\nGaussian distribution, fld(S) also equivalent (up to an additive |S| factor) to the differential entropy\nof S. However, fld(S) is unde\ufb01ned if S is rank-de\ufb01cient and might also take negative values; the\nsmoothed version fde(S) overcomes these issues. It is also easy to show that fde(S) is a diversity-\npromoting regularizer. We now show that the GLS algorithm to solve (1) with f (S) = fde(S) gives\na constant-factor approximation algorithm.\n\nTheorem 14 The set \u02dcS selected by the GLS algorithm gives a 1\u2212e\ntion guarantee for (1) using the smoothed differential entropy regularizer fde(S).\n\n2\n\n\u2212 \u03b3 \u02dcS,2k\n7\n\nmultiplicative approxima-\n\nProof. We \ufb01rst prove that fde(S) is non-negative and submodular. Consider the real-valued func-\ntion \u02dcf (t) = log(\u03b4 + t) de\ufb01ned on the appropriate interval of R. We will show that the derivative of\n\u02dcf is operator antitone. Let A, B be k \u00d7 k Hermitian matrices, such that 0 \u227a A (cid:22) B. Let I denote\nthe identity matrix. Then A + \u03b4I (cid:22) B + \u03b4I. Taking inverses, (B + \u03b4I)\u22121 (cid:22) (A + \u03b4I)\u22121. Thus,\n\u03b4+t is operator antitone. Since h(t) is the derivative of \u02dcf (t),\nby De\ufb01nition 11, the function h(t) = 1\na straightforward application of Theorem 12 gives us that fde(S) is submodular. By Proposition 10,\nwe obtain that g(S) is approximately submodular, with submodularity ratio at least \u03b3 \u02dcS,k(R2) . Since\ng(S) is also non-negative, we can now apply Theorem 9 to obtain the approximation guarantee.\n\nNotice that since fde(S) is not monotone in general [13], we cannot use Theorem 3. However, in\nthe case when \u03b4 \u2265 1, a simple application of Lemma 13 shows that fde(S) becomes monotonically\nincreasing and we can then use Theorem 3 to obtain a tighter approximation bound.\n\neralized rank regularizer as fgr(S) =(cid:80)|S|\n\n5.2 Generalized rank regularizer\nFor any set S with covariance matrix CS, and constant \u03b1 such that 0 \u2264 \u03b1 \u2264 1, we de\ufb01ne the gen-\ni=1 \u03bbi(CS)\u03b1. Notice that for \u03b1 = 0, fgr(S) = rank(CS).\nThe rank function however, does not discriminate between a full-rank matrix and an orthogonal ma-\ntrix, and hence we de\ufb01ne fgr(S) as a generalization of the rank function. It is easy to show that\nfgr(S) is diversity-promoting. We prove that fgr(S) is also monotone and submodular, and hence\nobtain approximation guarantees for the greedy FR algorithm for (1) with f (S) = fgr(S).\nTheorem 15 The set \u02dcS selected by the greedy FR algorithm gives a (1 \u2212 e\u2212\u03b3 \u02dcS,k(R2)) multiplicative\napproximation guarantee for (1) using the generalized rank regularizer fgr(S).\n\n3we need this regularizer to be non-negative for sets of size up to 3k, because of the use of f (S1 \u222a S2 \u222a C)\n\nin the proof of Lemma 8\n\n6\n\n\fProof. Consider the real-valued function \u02dcf (t) = t\u03b1 de\ufb01ned on t \u2208 R. It is well known [1] that\nthe derivative of \u02dcf is operator antitone. Thus, Theorem 12 gives us that fgr(S) is submodular.\nHence, by applying Lemma 10, we obtain that g(S) is an approximately submodular function, with\nsubmodularity ratio at least \u03b3 \u02dcS,k(R2) . Also, by de\ufb01nition \u02dcf (t) is non-negative and monotone.\nThus, using Lemma 13, we get that fgr(S) and consequently g(S) is a monotonically increasing set\nfunction. Since g(S) is non-negative, monotone, and submodular, we can now apply Theorem 3 to\nobtain a (1 \u2212 e\u2212\u03b3 \u02dcS,k(R2)) approximation ratio.\n\n5.3 Spectral variance regularizer\n\n\u2212(cid:80)|S|\n9k2\u2212(cid:80)|S|\n\nFor a set S with covariance matrix CS, we de\ufb01ne the spectral variance regularizer as\ni=1(\u03bbi(CS) \u2212 1)2. This regularizes the variance of the eigenvalues of the matrix (recall that\nfor an orthogonal matrix, all the eigenvalues are equal to 1) and can be shown to be diversity-\npromoting. For non-negativity, we add a constant 9k2 term4 to the regularizer and de\ufb01ne fsv(S) =\ni=1(\u03bbi(CS)\u2212 1)2. As with fde(S), we can show (proof relegated to the supplementary ma-\nterial) that fsv(S) is submodular, but it is not monotonically increasing in general. Hence, appealing\nto Theorem 9, we obtain the following.\n\n\u2212 \u03b3 \u02dcS,2k\nTheorem 16 The set \u02dcS selected by the GLS algorithm gives a 1\u2212e\n7\ntion guarantee for (1) using the spectral variance regularizer fsv(S).\n\n2\n\nmultiplicative approxima-\n\n6 Experiments and results\n\nIn this section we conduct experiments in different settings to validate the robustness of our spectral\nregularizers. We compare our approach against two baselines: Lasso and greedy FR. We use two\ndifferent datasets for the experiments, the mnist data (http://yann.lecun.com/exdb/\nmnist/) and a simulation data (for which, results are presented in the supplementary material).\nThe way we synthesize a regression problem out of the mnist dataset is as follows. Each image\nis regarded as a feature vector (of size 784) consisting of the pixel intensities. The target vector for\nthe regression problem consists of the vector of labels. We only sample 1000 images out of the set,\nand thus have a regression problem with X \u2208 R1000\u00d7784 and Z \u2208 R1000. We then preprocess the\ncolumns of matrix X and the target vector Z to have unit (cid:96)2-length.\nWe use two baselines: lasso and no-reg, the greedy FR with no regularizer. We also use four\ndifferent spectral regularizers: ld (Section 5.1, with \u03b4 = 1), ld-0.1 (Section 5.1, with \u03b4 = 0.1),\nsv (Section 5.3), and gr (Section 5.2). We considered two different types of perturbations: perturb-\ning Z and X. In order to perturb Z, we \ufb01rst sample a random vector \u03b7 \u2208 R1000, \u03b7i \u223c N (0, 1). We\nthen create Z(cid:48) = Z + \u03c3 \u03b7(cid:107)\u03b7(cid:107), where \u03c3 is varied in [0, 1]5. If S is the set of features selected, then the\nunperturbed regression coef\ufb01cients are de\ufb01ned as \u03b1 = C\u22121\nS Z, and the perturbed coef\ufb01cients as\nS Z(cid:48). The error that we measure is (cid:107)\u03b1 \u2212 \u03b1(cid:48)(cid:107)2. Similarly, in order to perturb X, we \ufb01rst\n\u03b1(cid:48) = C\u22121\nsample E \u2208 R1000\u00d7784. Let E(cid:63)i denote the ith column of E. Then, we create X(cid:48), the perturbed\nversion of X columnwise as X(cid:48)\n(cid:107)E(cid:63)i(cid:107). Here again, the perturbed regression coef\ufb01cients\nS and the error is measured as (cid:107)\u03b1 \u2212 \u03b1(cid:48)(cid:107)2. For our\nare \u03b1(cid:48) = C(cid:48)\nT y where C(cid:48)\nexperiments, we apply each random perturbation 5 times and then take the average error. Note that\nthe differential entropy of \u03b1\u2212\u03b1(cid:48) is directly given by Lemma 4; we will directly measure the quantity\non the RHS of the equation of Lemma 4.\n\n(cid:63)i = X(cid:63)i + \u03c3 E(cid:63)i\nS)T X(cid:48)\nS = (X(cid:48)\n\nS X T\n\nS X T\n\n\u22121X(cid:48)\n\nS\n\nS\n\nResults. Figure 1 summarizes the result for the mnist data. For clarity of presentation, we have\nonly shown the results of greedy FR for monotone regularizers (ld and gr) and GLS for non-\nmonotone (ld-0.1, sv). We also show the results only for \u03c3 = 0.1; results for other values\nof \u03c3 are similar. The way we decided on the regularization parameters \u03bb is as follows. First we\nrun the lasso using a regularization path approach, and obtain a set of solutions for a range of\n\n4as before, we need this regularizer to be non-negative for sets of size up to 3k due to the proof of Lemma 8\n5Strictly speaking, normalizing \u03b7 makes it non-Gaussian, but for a high dimensional vector (cid:107)\u03b7(cid:107) is highly\n\nconcentrated.\n\n7\n\n\fFigure 1: All plots on mnist data. (a) Error when Z is perturbed (\u03c3 = 0.1). (b) Error when X is\nperturbed (\u03c3 = 0.1). (c) Diversity comparison for ld. (d) Diversity comparison for ld-0.1. (e)\nDiversity comparison for sv. (f) Diversity comparison for gr.\n\n.\n\nregularization parameter values and corresponding sparsity (k) values. For the other algorithms, we\nuse each of this set of sparsity values as the target number of features to be selected. We chose the\nregularization constant (\u03bd) to be the maximum subject to the condition that the R2 value for that\nsolution should be greater than that obtained by the lasso solution with corresponding sparsity.\nThis ensures we are not sacri\ufb01cing diversity for solution quality.\nFigure 1(a) shows the errors obtained when perturbing the Z vector. As is obvious from the \ufb01g-\nure, the coef\ufb01cient vector obtained by lasso is very susceptible to perturbation, and the effect of\nperturbation increases with the number of features used by lasso. This indicates that as lasso\nstarts incorporating more features, it does not ensure that the features are diverse enough so as to be\nrobust to perturbation. Greedy with no regularization seems more stable than lasso but still shows\nan increasing trend. On the other hand, the errors obtained by perturbing is much less for any of the\nregularizers, and is only very mildly increasing with k: it does not seem to matter which regularizer\nwe employ. Figure 1(b) shows the error obtained when perturbing the X matrix; the same story is\ntrue here also. In both cases, using any of the regularizers we are able to pick a set of features that\nare more robust to perturbation.\nFigures 1(c)- 1(f) show that our features are also more diverse than the ones obtained by both lasso\nand no-reg. Since there is no one de\ufb01nition of diversity, in each of the plots, we take one of the\nde\ufb01nitions of diversity value corresponding to the four regularizers we use. In order to be able to\ncompare, the regularizer values for each k are normalized by the maximum value possible for that\nk. For each of the plots we show the values of the diversity value for solutions at different levels of\nsparsity. It is obvious that we get more diverse solutions than both lasso and no-reg. The lines\ncorresponding to lasso or no-reg show an increasing trend because of the normalization.\n\n7 Conclusions\n\nIn this paper we proposed submodular spectral regularizers for diverse feature selection and obtained\nef\ufb01cient approximation algorithms using greedy and local search algorithms. These algorithms\nobtain a more diverse and noise-insensitive set of features. It would be interesting to see whether we\ncan design convex relaxations for such approaches, and to compare our approach with related ones\ne.g. CLASH [18] that presents a general framework for merging combinatorial constraints with the\nL1-norm constraint for LASSO, or with Elastic-Net that provides stability to the features selected\nwhen groups of correlated variables are present.\n\n8\n\n10203040506070809000.020.040.060.080.10.12Number of features selectedError in beta lassono\u2212reglogdetlogdet\u22120.1spec\u2212variancegen\u2212rank10203040506070809000.10.20.30.40.50.60.7Number of features selectedError in beta lassono\u2212reglogdetlogdet\u22120.1spec\u2212variancegen\u2212rank1020304050607080900.550.60.650.70.750.80.850.90.951Number of features selectedRegularizer value(logdet) lassono\u2212reglogdet1020304050607080900.40.50.60.70.80.91Number of features selectedRegularizer value(logdet\u22120.1) lassono\u2212reglogdet\u22120.11020304050607080900.930.940.950.960.970.980.991Number of features selectedRegularizer value(spec\u2212var) lassono\u2212regspec\u2212var1020304050607080900.650.70.750.80.850.90.951Number of features selectedRegularizer value(gen\u2212rank) lassono\u2212reggen\u2212rank\fReferences\n[1] R. Bhatia. Matrix Analysis. Springer, 1997.\n[2] J. K. Bradley, A. Kyrola, D. Bickson, and C. Guestrin. Parallel coordinate descent for l1-regularized loss\n\nminimization. In ICML, pages 321\u2013328, 2011.\n\n[3] E. J. Candes, J. Romberg, and T. Tao. Stable signal recovery from incomplete and inaccurate measure-\n\nments. CPAM, 59:1207\u20131223, 2005.\n\n[4] A. Das. Subset Selection Algorithms for Prediction. PhD thesis, University of Southern California, 2011.\n[5] A. Das and D. Kempe. Submodular meets spectral: Greedy algorithms for subset selection, sparse ap-\n\nproximation and dictionary selection. In ICML, pages 1057\u20131064, 2011.\n\n[6] G. Diekhoff. Statistics for the Social and Behavioral Sciences. Wm. C. Brown Publishers, 2002.\n[7] C. Ding and H. Peng. Minimum redundancy feature selection from microarray gene expression data. In\n\nJ. Bioinform. Comput. Biol., pages 523\u2013529, 2003.\n\n[8] D. Donoho. For most large underdetermined systems of linear equations, the minimal 11-norm near-\n\nsolution approximates the sparsest near-solution. CPAM, 59:1207\u20131223, 2005.\n\n[9] U. Feige, V. S. Mirrokni, and J. Vondrak. Maximizing non-monotone submodular functions. SIAM J.\n\nComput, 40(4):1133\u20131153, 2011.\n\n[10] S. Friedland and S. Gaubert. Submodular spectral functions of principal submatrices of a Hermitian\n\nmatrix, extensions and applications. Linear Algebra and its Applications, 2011.\n\n[11] A. Gilbert, S. Muthukrishnan, and M. Strauss. Approximation of functions over redundant dictionaries\n\nusing coherence. In SODA, 2003.\n\n[12] E. Grave, G. Obozinski, and F. R. Bach. Trace Lasso: a trace norm regularization for correlated designs.\n\nIn NIPS, 2011.\n\n[13] C. Guestrin, A. Krause, and A. Singh. Near-optimal sensor placements in Gaussian processes. In ICML,\n\n2005.\n\n[14] A. Gupta, A. Roth, G. Schoenebeck, and K. Talwar. Constrained non-monotone submodular maximiza-\n\ntion: Of\ufb02ine and secretary algorithms. In WINE, pages 246\u2013257, 2010.\n\n[15] R. A. Horn and C. R. Johnson. Matrix Analysis. Cambridge University Press, 1999.\n[16] R. A. Johnson and D. W. Wichern. Applied Multivariate Statistical Analysis. Prentice Hall, 2002.\n[17] C.-W. Ko, J. Lee, and M. Queyranne. An exact algorithm for maximum entropy sampling. OR, 43(4):684\u2013\n\n691, 1995.\n\n[18] A. Kyrillidis and V. Cevher. Combinatorial selection and least absolute shrinkage via the clash algorithm.\nIn Information Theory Proceedings (ISIT), 2012 IEEE International Symposium on, pages 2216 \u20132220,\njuly 2012.\n\n[19] A. Miller. Subset Selection in Regression. Chapman and Hall, second edition, 2002.\n[20] R. Tibshirani. Regression shrinkage and selection via the Lasso. JRSS, 58:267\u2013288, 1996.\n[21] J. Tropp. Greed is good: Algorithmic results for sparse approximation. IEEE Trans. Information Theory,\n\n50:2231\u20132242, 2004.\n\n[22] J. Tropp. Just relax: Convex programming methods for identifying sparse signals. IEEE TOIT, 51:1030\u2013\n\n1051, 2006.\n\n[23] L. Yu. Redundancy based feature selection for microarray data. In SIGKDD, pages 737\u2013742, 2004.\n[24] T. Zhang. Adaptive forward-backward greedy algorithm for sparse learning with linear models. In NIPS,\n\n2008.\n\n[25] S. Zhou. Thresholding procedures for high dimensional variable selection and statistical estimation. In\n\nNIPS, 2009.\n\n9\n\n\f", "award": [], "sourceid": 742, "authors": [{"given_name": "Abhimanyu", "family_name": "Das", "institution": null}, {"given_name": "Anirban", "family_name": "Dasgupta", "institution": null}, {"given_name": "Ravi", "family_name": "Kumar", "institution": null}]}