{"title": "Subset Selection by Pareto Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 1774, "page_last": 1782, "abstract": "Selecting the optimal subset from a large set of variables is a fundamental problem in various learning tasks such as feature selection, sparse regression, dictionary learning, etc. In this paper, we propose the POSS approach which employs evolutionary Pareto optimization to find a small-sized subset with good performance. We prove that for sparse regression, POSS is able to achieve the best-so-far theoretically guaranteed approximation performance efficiently. Particularly, for the \\emph{Exponential Decay} subclass, POSS is proven to achieve an optimal solution. Empirical study verifies the theoretical results, and exhibits the superior performance of POSS to greedy and convex relaxation methods.", "full_text": "Subset Selection by Pareto Optimization\n\nChao Qian\n\nYang Yu\n\nZhi-Hua Zhou\n\nNational Key Laboratory for Novel Software Technology, Nanjing University\n\nCollaborative Innovation Center of Novel Software Technology and Industrialization\n\n{qianc,yuy,zhouzh}@lamda.nju.edu.cn\n\nNanjing 210023, China\n\nAbstract\n\nSelecting the optimal subset from a large set of variables is a fundamental problem\nin various learning tasks such as feature selection, sparse regression, dictionary\nlearning, etc. In this paper, we propose the POSS approach which employs evo-\nlutionary Pareto optimization to \ufb01nd a small-sized subset with good performance.\nWe prove that for sparse regression, POSS is able to achieve the best-so-far the-\noretically guaranteed approximation performance ef\ufb01ciently. Particularly, for the\nExponential Decay subclass, POSS is proven to achieve an optimal solution. Em-\npirical study veri\ufb01es the theoretical results, and exhibits the superior performance\nof POSS to greedy and convex relaxation methods.\n\n1\n\nIntroduction\n\nSubset selection is to select a subset of size k from a total set of n variables for optimizing some\ncriterion. This problem arises in many applications, e.g., feature selection, sparse learning and\ncompressed sensing. The subset selection problem is, however, generally NP-hard [13, 4]. Previous\nemployed techniques can be mainly categorized into two branches, greedy algorithms and convex\nrelaxation methods. Greedy algorithms iteratively select or abandon one variable that makes the\ncriterion currently optimized [9, 19], which are however limited due to its greedy behavior. Convex\nrelaxation methods usually replace the set size constraint (i.e., the (cid:96)0-norm) with convex constraints,\ne.g., the (cid:96)1-norm constraint [18] and the elastic net penalty [29]; then \ufb01nd the optimal solutions to\nthe relaxed problem, which however could be distant to the true optimum.\nPareto optimization solves a problem by reformulating it as a bi-objective optimization problem\nand employing a bi-objective evolutionary algorithm, which has signi\ufb01cantly developed recently in\ntheoretical foundation [22, 15] and applications [16]. This paper proposes the POSS (Pareto Opti-\nmization for Subset Selection) method, which treats subset selection as a bi-objective optimization\nproblem that optimizes some given criterion and the subset size simultaneously. To investigate the\nperformance of POSS, we study a representative example of subset selection, the sparse regression.\nThe subset selection problem in sparse regression is to best estimate a predictor variable by linear\nregression [12], where the quality of estimation is usually measured by the mean squared error, or\nequivalently, the squared multiple correlation R2 [6, 11]. Gilbert et al. [9] studied the two-phased\napproach with orthogonal matching pursuit (OMP), and proved the multiplicative approximation\nguarantee 1 + \u0398(\u00b5k2) for the mean squared error, when the coherence \u00b5 (i.e., the maximum cor-\nrelation between any pair of observation variables) is O(1/k). This approximation bound was later\nimproved by [20, 19]. Under the same small coherence condition, Das and Kempe [2] analyzed the\nforward regression (FR) algorithm [12] and obtained an approximation guarantee 1\u2212 \u0398(\u00b5k) for R2.\nThese results however will break down when \u00b5 \u2208 w(1/k). By introducing the submodularity ratio\n\u03b3, Das and Kempe [3] proved the approximation guarantee 1 \u2212 e\u2212\u03b3 on R2 by the FR algorithm;\nthis guarantee is considered to be the strongest since it can be applied with any coherence. Note\nthat sparse regression is similar to the problem of sparse recovery [7, 25, 21, 17], but they are for\n\n1\n\n\fdifferent purposes. Assuming that the predictor variable has a sparse representation, sparse recovery\nis to recover the exact coef\ufb01cients of the truly sparse solution.\nWe theoretically prove that, for sparse regression, POSS using polynomial time achieves a mul-\ntiplicative approximation guarantee 1 \u2212 e\u2212\u03b3 for squared multiple correlation R2, the best-so-far\nguarantee obtained by the FR algorithm [3]. For the Exponential Decay subclass, which has clear\napplications in sensor networks [2], POSS can provably \ufb01nd an optimal solution, while FR cannot.\nThe experimental results verify the theoretical results and exhibit the superior performance of POSS.\nWe start the rest of the paper by introducing the subset selection problem. We then present in\nthree subsequent sections the POSS method, its theoretical analysis for sparse regression, and the\nempirical studies. The \ufb01nal section concludes this paper.\n\n2 Subset Selection\n\nThe subset selection problem originally aims at selecting a few columns from a matrix, so that the\nmatrix is most represented by the selected columns [1]. In this paper, we present the generalized\nsubset selection problem that can be applied to arbitrary criterion evaluating the selection.\n\n2.1 The General Problem\nGiven a set of observation variables V = {X1, . . . , Xn}, a criterion f and a positive integer k, the\nsubset selection problem is to select a subset S \u2286 V such that f is optimized with the constraint\n|S| \u2264 k, where | \u00b7 | denotes the size of a set. For notational convenience, we will not distinguish\nbetween S and its index set IS = {i | Xi \u2208 S}. Subset selection is formally stated as follows.\nDe\ufb01nition 1 (Subset Selection). Given all variables V = {X1, . . . , Xn}, a criterion f and a posi-\ntive integer k, the subset selection problem is to \ufb01nd the solution of the optimization problem:\n\narg minS\u2286V f (S)\n\ns.t.\n\n|S| \u2264 k.\n\n(1)\n\nThe subset selection problem is NP-hard in general [13, 4], except for some extremely simple crite-\nria. In this paper, we take sparse regression as the representative case.\n\n2.2 Sparse Regression\n\nSparse regression [12] \ufb01nds a sparse approximation solution to the regression problem, where the\nsolution vector can only have a few non-zero elements.\nDe\ufb01nition 2 (Sparse Regression). Given all observation variables V = {X1, . . . , Xn}, a predictor\nvariable Z and a positive integer k, de\ufb01ne the mean squared error of a subset S \u2286 V as\n\nM SEZ,S = min\u03b1\u2208R|S| E(cid:104)\n\n(Z \u2212(cid:88)\n\n\u03b1iXi)2(cid:105)\n\n.\n\nSparse regression is to \ufb01nd a set of at most k variables minimizing the mean squared error, i.e.,\n\narg minS\u2286V M SEZ,S\n\ns.t.\n\ni\u2208S\n\n|S| \u2264 k.\n\nFor the ease of theoretical treatment, the squared multiple correlation\nZ,S = (V ar(Z) \u2212 M SEZ,S)/V ar(Z)\nR2\n\nis used to replace M SEZ,S [6, 11] so that the sparse regression is equivalently\n\narg maxS\u2286V R2\n\nZ,S\n\ns.t.\n\n|S| \u2264 k.\n\n(2)\n\nSparse regression is a representative example of subset selection [12]. Note that we will study Eq. (2)\nin this paper. Without loss of generality, we assume that all random variables are normalized to have\nZ,S is simpli\ufb01ed to be 1 \u2212 M SEZ,S.\nexpectation 0 and variance 1. Thus, R2\nFor sparse regression, Das and Kempe [3] proved that the forward regression (FR) algorithm, pre-\nsented in Algorithm 1, can produce a solution SF R with |SF R| = k and R2\n\u2212\u03b3SF R ,k ) \u00b7\nOP T (where OP T denotes the optimal function value of Eq. (2)), which is the best currently known\napproximation guarantee. The FR algorithm is a greedy approach, which iteratively selects a vari-\nable with the largest R2 improvement.\n\nZ,SF R \u2265 (1\u2212e\n\n2\n\n\fAlgorithm 1 Forward Regression\nInput: all variables V = {X1, . . . , Xn}, a predictor variable Z and an integer parameter k \u2208 [1, n]\nOutput: a subset of V with k variables\nProcess:\n1: Let t = 0 and St = \u2205.\n2: repeat\n3:\n4:\n5: until t = k\n6: return Sk\n\nLet X\u2217 be a variable maximizing R2\nLet St+1 = St \u222a {X\u2217}, and t = t + 1.\n\nZ,St\u222a{X}, i.e., X\u2217 = arg maxX\u2208V \\St R2\n\nZ,St\u222a{X}.\n\n3 The POSS Method\n\nThe subset selection in Eq. (1) can be separated into two objectives, one optimizes the criterion, i.e.,\nminS\u2286V f (S), meanwhile the other keeps the size small, i.e., minS\u2286V max{|S| \u2212 k, 0}. Usually\nthe two objectives are con\ufb02icting, that is, a subset with a better criterion value could have a larger\nsize. The POSS method solves the two objectives simultaneously, which is described as follows.\nLet us use the binary vector representation for subsets membership indication, i.e., s \u2208 {0, 1}n\nrepresents a subset S of V by assigning si = 1 if the i-th element of V is in S and si = 0 otherwise.\nWe assign two properties for a solution s: o1 is the criterion value and o2 is the sparsity,\n\n(cid:26)+\u221e,\n\ns.o1 =\n\nf (s), otherwise\n\ns = {0}n, or |s| \u2265 2k\n\ns.o2 = |s|.\n\n,\n\nwhere the set of o1 to +\u221e is to exclude trivial or overly bad solutions. We further introduce the\nisolation function I : {0, 1}n \u2192 R as in [22], which determines if two solutions are allowed to be\ncompared: they are comparable only if they have the same isolation function value. The implemen-\ntation of I is left as a parameter of the method, while its effect will be clear in the analysis.\nAs will be introduced later, we need to compare solutions. For solutions s and s(cid:48), we \ufb01rst judge if\nthey have the same isolation function value. If not, we say that they are incomparable. If they have\nthe same isolation function value, s is worse than s(cid:48) if s(cid:48) has a smaller or equal value on both the\nproperties; s is strictly worse if s(cid:48) has a strictly smaller value in one property, and meanwhile has a\nsmaller or equal value in the other property. But if both s is not worse than s(cid:48) and s(cid:48) is not worse\nthan s, we still say that they are incomparable.\nPOSS is described in Algorithm 2. Starting from the solution representing an empty set and the\narchive P containing only the empty set (line 1), POSS generates new solutions by randomly \ufb02ip-\nping bits of an archived solution (in the binary vector representation), as lines 4 and 5. Newly\ngenerated solutions are compared with the previously archived solutions (line 6). If the newly gen-\nerated solution is not strictly worse than any previously archived solution, it will be archived. Before\narchiving the newly generated solution in line 8, the archive set P is cleaned by removing solutions\nin Q, which are previously archived solutions but are worse than the newly generated solution.\nThe iteration of POSS repeats for T times. Note that T is a parameter, which could depend on the\navailable resource of the user. We will analyze the relationship between the solution quality and T in\nlater sections, and will use the theoretically derived T value in the experiments. After the iterations,\nwe select the \ufb01nal solution from the archived solutions according to Eq. (1), i.e., select the solution\nwith the smallest f value while the constraint on the set size is kept (line 12).\n\n4 POSS for Sparse Regression\n\nIn this section, we examine the theoretical performance of the POSS method for sparse regression.\nFor sparse regression, the criterion f is implemented as f (s) = \u2212R2\nZ,s. Note that minimizing\n\u2212R2\nWe need some notations for the analysis. Let Cov(\u00b7,\u00b7) be the covariance between two random\nvariables, C be the covariance matrix between all observation variables, i.e., Ci,j = Cov(Xi, Xj),\n\nZ,s is equivalent to the original objective that maximizes R2\n\nZ,s in Eq. (2).\n\n3\n\n\fAlgorithm 2 POSS\nInput: all variables V = {X1, . . . , Xn}, a given criterion f and an integer parameter k \u2208 [1, n]\nParameter: the number of iterations T and an isolation function I : {0, 1}n \u2192 R\nOutput: a subset of V with at most k variables\nProcess:\n1: Let s = {0}n and P = {s}.\n2: Let t = 0.\n3: while t < T do\n4:\n5:\n6:\n\nif (cid:64)z \u2208 P such that I(z) = I(s(cid:48)) and(cid:0)(z.o1 < s(cid:48).o1 \u2227 z.o2 \u2264 s(cid:48).o2) or (z.o1 \u2264 s(cid:48).o1 \u2227\nz.o2 < s(cid:48).o2)(cid:1) then\n\nSelect s from P uniformly at random.\nGenerate s(cid:48) from s by \ufb02ipping each bit of s with probability 1/n.\n\nQ = {z \u2208 P | I(z) = I(s(cid:48)) \u2227 s(cid:48).o1 \u2264 z.o1 \u2227 s(cid:48).o2 \u2264 z.o2}.\nP = (P \\ Q) \u222a {s(cid:48)}.\n\n7:\n8:\n9:\n10:\n11: end while\n12: return arg mins\u2208P,|s|\u2264k f (s)\n\nend if\nt = t + 1.\n\nelements bi with i \u2208 S. Let Res(Z, S) = Z \u2212(cid:80)\n\nand b be the covariance vector between Z and observation variables, i.e., bi = Cov(Z, Xi). Let CS\ndenote the submatrix of C with row and column set S, and bS denote the subvector of b, containing\ni\u2208S \u03b1iXi denote the residual of Z with respect to\nS, where \u03b1 \u2208 R|S| is the least square solution to M SEZ,S [6]. The submodularity ratio presented\nin De\ufb01nition 3 is a measure characterizing how close a set function f is to submodularity. It is easy\nto see that f is submodular iff \u03b3U,k(f ) \u2265 1 for any U and k. For f being the objective function R2,\nwe will use \u03b3U,k shortly in the paper.\nDe\ufb01nition 3 (Submodularity Ratio [3]). Let f be a non-negative set function. The submodularity\nratio of f with respect to a set U and a parameter k \u2265 1 is\n\n(cid:80)\n\nx\u2208S(f (L \u222a {x}) \u2212 f (L))\n\nf (L \u222a S) \u2212 f (L)\n\n.\n\n\u03b3U,k(f ) =\n\nmin\n\nL\u2286U,S:|S|\u2264k,S\u2229L=\u2205\n\n4.1 On General Sparse Regression\n\nOur \ufb01rst result is the theoretical approximation bound of POSS for sparse regression in Theorem 1.\nLet OP T denote the optimal function value of Eq. (2). The expected running time of POSS is the\naverage number of objective function (i.e., R2) evaluations, the most time-consuming step, which\nis also the average number of iterations T (denoted by E[T ]) since it only needs to perform one\nobjective evaluation for the newly generated solution s(cid:48) in each iteration.\nTheorem 1. For sparse regression, POSS with E[T ] \u2264 2ek2n and I(\u00b7) = 0 (i.e., a constant func-\ntion) \ufb01nds a set S of variables with |S| \u2264 k and R2\nThe proof relies on the property of R2 in Lemma 1, that for any subset of variables, there always\nexists another variable, the inclusion of which can bring an improvement on R2 proportional to the\ncurrent distance to the optimum. Lemma 1 is extracted from the proof of Theorem 3.2 in [3].\nLemma 1. For any S \u2286 V , there exists one variable \u02c6X \u2208 V \u2212 S such that\n\nZ,S \u2265 (1 \u2212 e\u2212\u03b3\u2205,k ) \u00b7 OP T .\n\nZ,S\u222a{ \u02c6X} \u2212 R2\nR2\n\nZ,S \u2265 \u03b3\u2205,k\nk\n\n(OP T \u2212 R2\n\nZ,S).\n\nZ,S + R2\n\nZ,S(cid:48). Because R2\n\nwe get (cid:80)\n\nk be the optimal set of variables of Eq. (2), i.e., R2\nZ,S\u2217\nk \u2286 S \u222a \u00afS, we have R2\nZ,S increases with S and S\u2217\nZ,S. By De\ufb01nition 3, |S(cid:48)| = | \u00afS| \u2264 k and R2\n\nk \u2212 S\nProof. Let S\u2217\nand S(cid:48) = {Res(X, S) | X \u2208 \u00afS}. Using Lemmas 2.3 and 2.4 in [2], we can easily derive that\nZ,S\u222a \u00afS \u2265\nZ,S\u222a \u00afS = R2\nR2\nZ,\u2205 = 0,\nR2\nZ,S\u2217\nZ,S). Let \u02c6X(cid:48) = arg maxX(cid:48)\u2208S(cid:48) R2\nZ,X(cid:48).\nZ,S). Let \u02c6X \u2208 \u00afS correspond to \u02c6X(cid:48), i.e.,\nk (OP T \u2212R2\n\nZ,S(cid:48) \u2265 \u03b3\u2205,k(OP T \u2212 R2\nk (OP T \u2212 R2\nZ,S) \u2265 \u03b3\u2205,k\nZ, \u02c6X(cid:48) \u2265 \u03b3\u2205,k\n\nZ,X(cid:48) \u2265 \u03b3\u2205,kR2\nX(cid:48)\u2208S(cid:48) R2\nZ, \u02c6X(cid:48) \u2265 \u03b3\u2205,k|S(cid:48)| (OP T \u2212 R2\n\nThen, R2\nRes( \u02c6X, S) = \u02c6X(cid:48). Thus, R2\n\n= OP T . Let \u00afS = S\u2217\n\nZ,S(cid:48) \u2265 OP T \u2212 R2\n\nZ,S). The lemma holds.\n\n= OP T . Thus, R2\n\nZ,S\u222a{ \u02c6X}\u2212R2\n\nZ,S = R2\n\nk\n\nk\n\n4\n\n\fZ,s(cid:48) \u2265 (1 \u2212 \u03b3\u2205,k\nR2\nk\n\n)R2\n\nZ,s +\n\n\u03b3\u2205,k\nk\n\n\u00b7 OP T \u2265 (1 \u2212 (1 \u2212 \u03b3\u2205,k\nk\n\nZ,s \u2265 (1 \u2212 (1 \u2212 \u03b3\u2205,k\n\nZ,s \u2265 (1 \u2212 (1 \u2212 \u03b3\u2205,k\n\nZ,s \u2265 (1 \u2212 (1 \u2212 \u03b3\u2205,k\n\nProof of Theorem 1. Since the isolation function is a constant function, all solutions are allowed\nto be compared and we can ignore it. Let Jmax denote the maximum value of j \u2208 [0, k] such that in\nthe archive set P , there exists a solution s with |s| \u2264 j and R2\nk )j) \u00b7 OP T . That\nis, Jmax = max{j \u2208 [0, k] | \u2203s \u2208 P,|s| \u2264 j \u2227 R2\nk )j) \u00b7 OP T}. We then analyze\nthe expected iterations until Jmax = k, which implies that there exists one solution s in P satisfying\nthat |s| \u2264 k and R2\nThe initial value of Jmax is 0, since POSS starts from {0}n. Assume that currently Jmax = i < k.\nLet s be a corresponding solution with the value i, i.e., |s| \u2264 i and R2\nk )i)\u00b7 OP T .\nIt is easy to see that Jmax cannot decrease because cleaning s from P (lines 7 and 8 of Algorithm 2)\nimplies that s is \u201cworse\u201d than a newly generated solution s(cid:48), which must have a smaller size and a\nlarger R2 value. By Lemma 1, we know that \ufb02ipping one speci\ufb01c 0 bit of s (i.e., adding a speci\ufb01c\nk (OP T \u2212\nvariable into S) can generate a new solution s(cid:48), which satis\ufb01es that R2\nZ,s). Then, we have\nR2\n\nk )k) \u00b7 OP T \u2265 (1 \u2212 e\u2212\u03b3\u2205,k ) \u00b7 OP T .\n\nZ,s \u2265 (1\u2212 (1\u2212 \u03b3\u2205,k\n\nZ,s \u2265 \u03b3\u2205,k\n\nZ,s(cid:48) \u2212 R2\n)i+1) \u00b7 OP T.\n\n1\n\n1\n\nPmax\n\nPmax\n\nenPmax\n\nn (1 \u2212 1\n\n\u00b7 1\nn (1\u2212 1\n\nn )n\u22121 \u2265 1\n\nSince |s(cid:48)| = |s| + 1 \u2264 i + 1, s(cid:48) will be included into P ; otherwise, from line 6 of Algorithm 2, s(cid:48)\nmust be \u201cstrictly worse\u201d than one solution in P , and this implies that Jmax has already been larger\nthan i, which contradicts with the assumption Jmax = i. After including s(cid:48), Jmax \u2265 i + 1. Let Pmax\ndenote the largest size of P . Thus, Jmax can increase by at least 1 in one iteration with probability at\nis a lower bound on the probability of selecting s in\n, where\nleast\nn )n\u22121 is the probability of \ufb02ipping a speci\ufb01c bit of s and keeping\nline 4 of Algorithm 2 and 1\nother bits unchanged in line 5. Then, it needs at most enPmax expected iterations to increase Jmax.\nThus, after k \u00b7 enPmax expected iterations, Jmax must have reached k.\nBy the procedure of POSS, we know that the solutions maintained in P must be incomparable.\nThus, each value of one property can correspond to at most one solution in P . Because the solutions\nwith |s| \u2265 2k have +\u221e value on the \ufb01rst property, they must be excluded from P . Thus, |s| \u2208\n{0, 1, . . . , 2k \u2212 1}, which implies that Pmax \u2264 2k. Hence, the expected number of iterations E[T ]\n(cid:3)\nfor \ufb01nding the desired solution is at most 2ek2n.\n\u2212\u03b3SF R ,k ) \u00b7 OP T [3], it is easy to see\nComparing with the approximation guarantee of FR, (1 \u2212 e\nthat \u03b3\u2205,k \u2265 \u03b3SF R,k from De\ufb01nition 3. Thus, POSS with the simplest con\ufb01guration of the isolation\nfunction can do at least as well as FR on any sparse regression problem, and achieves the best\nprevious approximation guarantee. We next investigate if POSS can be strictly better than FR.\n\n4.2 On The Exponential Decay Subclass\n\nOur second result is on a subclass of sparse regression, called Exponential Decay as in De\ufb01nition 4.\nIn this subclass, the observation variables can be ordered in a line such that their covariances are\ndecreasing exponentially with the distance.\nDe\ufb01nition 4 (Exponential Decay [2]). The variables Xi are associated with points y1 \u2264 y2 \u2264 . . . \u2264\nyn, and Ci,j = a|yi\u2212yj| for some constant a \u2208 (0, 1).\nSince we have shown that POSS with a constant isolation function is generally good, we prove\nbelow that POSS with a proper isolation function can be even better: it is strictly better than FR\non the Exponential Decay subclass, as POSS \ufb01nds an optimal solution (i.e., Theorem 2) while FR\ncannot (i.e., Proposition 1). The isolation function I(s \u2208 {0, 1}n) = min{i | si = 1} implies that\ntwo solutions are comparable only if they have the same minimum index for bit 1.\nTheorem 2. For the Exponential Decay subclass of sparse regression, POSS with E[T ] \u2208 O(k2(n\u2212\nk)n log n) and I(s \u2208 {0, 1}n) = min{i | si = 1} \ufb01nds an optimal solution.\nThe proof of Theorem 2 utilizes the dynamic programming property of the problem, as in Lemma 2.\nLemma 2. [2] Let R2(v, j) denote the maximum R2\nZ,S value by choosing v variables, including\nZ,S | S \u2286 {Xj, . . . , Xn}, Xj \u2208\nnecessarily Xj, from Xj, . . . , Xn. That is, R2(v, j) = max{R2\n(cid:17)\nS,|S| = v}. Then, the following recursive relation holds:\nwhere the term in(cid:0)(cid:1) is the R2 value by adding Xj into the variable subset corresponding to R2(v, i).\nR2(v + 1, j) = maxj+1\u2264i\u2264n\n\nj + (bj \u2212 bi)2 a2|yi\u2212yj|\n\n1 \u2212 a2|yi\u2212yj| \u2212 2bjbi\n\n1 + a|yi\u2212yj|\n\nR2(v, i) + b2\n\na|yi\u2212yj|\n\n(cid:16)\n\n,\n\n5\n\n\f1\n\n1\n\nsired solutions to \ufb01nd in the i-th phase. Then, E[\u03bei] \u2264(cid:80)n\u2212i\n\nProof of Theorem 2. We divide the optimization process into k + 1 phases, where the i-th (1 \u2264\ni \u2264 k) phase starts after the (i\u22121)-th phase has \ufb01nished. We de\ufb01ne that the i-th phase \ufb01nishes when\nfor each solution corresponding to R2(i, j) (1 \u2264 j \u2264 n \u2212 i + 1), there exists one solution in the\narchive P which is \u201cbetter\u201d than it. Here, a solution s is \u201cbetter\u201d than s(cid:48) is equivalent to that s(cid:48) is\n\u201cworse\u201d than s. Let \u03bei denote the iterations since phase i\u2212 1 has \ufb01nished, until phase i is completed.\nStarting from the solution {0}n, the 0-th phase has \ufb01nished. Then, we consider \u03bei (i \u2265 1). In this\nphase, from Lemma 2, we know that a solution \u201cbetter\u201d than a corresponding solution of R2(i, j) can\nbe generated by selecting a speci\ufb01c one from the solutions \u201cbetter\u201d than R2(i\u22121, j +1), . . . , R2(i\u2212\n1, n) and \ufb02ipping its j-th bit, which happens with probability at least\n.\nenPmax\nThus, if we have found L desired solutions in the i-th phase, the probability of \ufb01nding a new desired\nsolution in the next iteration is at least (n\u2212i+1\u2212L)\u00b7\n, where n\u2212i+1 is the total number of de-\nn\u2212i+1\u2212L \u2208 O(n log nPmax). There-\nfore, the expected number of iterations E[T ] is O(kn log nPmax) until the k-th phase \ufb01nishes, which\nimplies that an optimal solution corresponding to max1\u2264j\u2264n R2(k, j) has been found. Note that\nPmax \u2264 2k(n\u2212k), because the incomparable property of the maintained solutions by POSS ensures\nthat there exists at most one solution in P for each possible combination of |s| \u2208 {0, 1, . . . , 2k \u2212 1}\nand I(s) \u2208 {0, 1, . . . , n}. Thus, E[T ] for \ufb01nding an optimal solution is O(k2(n \u2212 k)n log n). (cid:3)\nThen, we analyze FR (i.e., Algorithm 1) for this special class. We show below that FR can be\nblocked from \ufb01nding an optimal solution by giving a simple example.\nExample 1. X1 = Y1, Xi = riXi\u22121 + Yi, where ri \u2208 (0, 1), and Yi are independent random\nvariables with expectation 0 such that each Xi has variance 1.\n\nn )n\u22121 \u2265 1\n\n\u00b7 1\nn (1\u2212 1\n\nenPmax\n\nenPmax\n\nPmax\n\nL=0\n\nFor i < j, Cov(Xi, Xj) = (cid:81)j\nExponential Decay class by letting y1 = 0 and yi =(cid:80)i\n\nk=i+1 rk. Then, it is easy to verify that Example 1 belongs to the\n\nk=2 loga rk for i \u2265 2.\n\nProposition 1. For Example 1 with n = 3, r2 = 0.03, r3 = 0.5, Cov(Y1, Z) = Cov(Y2, Z) = \u03b4\nand Cov(Y3, Z) = 0.505\u03b4, FR cannot \ufb01nd the optimal solution for k = 2.\nProof. The covariances between Xi and Z are b1 = \u03b4, b2 = 0.03b1 + \u03b4 = 1.03\u03b4 and b3 = 0.5b2 +\nZ,S can be simply represented\n0.505\u03b4 = 1.02\u03b4. Since Xi and Z have expectation 0 and variance 1, R2\n= 1.0609\u03b42,\nas bT\nZ,{X2,X3} = 1.4009\u03b42.\nZ,{X1,X3} = 2.0103\u03b42, R2\nR2\nThe optimal solution for k = 2 is {X1, X3}. FR \ufb01rst selects X2 since R2\nis the largest, then\nZ,X2\nselects X1 since R2\n\nS bS [11]. We then calculate the R2 value as follows: R2\n= 1.0404\u03b42; R2\n\nZ,{X2,X3}; thus produces a local optimal solution {X1, X2}.\n\nZ,{X1,X2} = 2.0009\u03b42, R2\n\nZ,{X2,X1} > R2\n\nS C\u22121\n\n= \u03b42, R2\n\nZ,X3\n\nZ,X2\n\nZ,X1\n\nIt is also easy to verify that other two previous methods OMP [19] and FoBa [26] cannot \ufb01nd the\noptimal solution for this example, due to their greedy nature.\n\n5 Empirical Study\n\nWe conducted experiments on 12 data sets1 in Table 1 to compare POSS with the following methods:\n\u2022 FR [12] iteratively adds one variable with the largest improvement on R2.\n\u2022 OMP [19] iteratively adds one variable that mostly correlates with the predictor variable residual.\n\u2022 FoBa [26] is based on OMP but deletes one variable adaptively when bene\ufb01cial. Set parameter\n\u03bd = 0.5, the solution path length is \ufb01ve times as long as the maximum sparsity level (i.e., 5 \u00d7 k),\nand the last active set containing k variables is used as the \ufb01nal selection [26].\n\u2022 RFE [10] iteratively deletes one variable with the smallest weight by linear regression.\n\u2022 Lasso [18], SCAD [8] and MCP [24] replaces the (cid:96)0 norm constraint with the (cid:96)1 norm penalty,\nthe smoothly clipped absolute deviation penalty and the mimimax concave penalty, respectively. For\nimplementing these methods, we use the SparseReg toolbox developed in [28, 27].\nFor POSS, we use I(\u00b7) = 0 since it is generally good, and the number of iterations T is set to be\n(cid:98)2ek2n(cid:99) as suggested by Theorem 1. To evaluate how far these methods are from the optimum, we\nalso compute the optimal subset by exhaustive enumeration, denoted as OPT.\n\n1The data sets are from http://archive.ics.uci.edu/ml/ and http://www.csie.ntu.\nedu.tw/\u02dccjlin/libsvmtools/datasets/. Some binary classi\ufb01cation data are used for regression.\nAll variables are normalized to have mean 0 and variance 1.\n\n6\n\n\fdata set\nhousing\neunite2001\nsvmguide3\nionosphere\n\n#inst\n506\n367\n1284\n351\n\n#feat\n13\n16\n21\n34\n\nTable 1: The data sets.\ndata set\nsonar\ntriazines\ncoil2000\nmushrooms\n\n#inst\n208\n186\n9000\n8124\n\n#feat\n60\n60\n86\n112\n\ndata set\nclean1\nw5a\ngisette\nfarm-ads\n\n#inst\n476\n9888\n7000\n4143\n\n#feat\n166\n300\n5000\n54877\n\nTable 2: The training R2 value (mean\u00b1std.) of the compared methods on 12 data sets for k = 8. In\neach data set, \u2018\u2022/\u25e6\u2019 denote respectively that POSS is signi\ufb01cantly better/worse than the correspond-\ning method by the t-test [5] with con\ufb01dence level 0.05. \u2018-\u2019 means that no results were obtained after\nrunning several days.\nData set\nhousing\neunite2001\nsvmguide3\nionosphere\nsonar\ntriazines\ncoil2000\nmushrooms\nclean1\nw5a\ngisette\nfarm-ads\n\n.7415\u00b1.0300\u2022\n.8349\u00b1.0150\u2022\n.2557\u00b1.0270\u2022\n.5921\u00b1.0353\u2022\n.5112\u00b1.0425\u2022\n.4073\u00b1.0591\u2022\n.0619\u00b1.0075\u2022\n.9909\u00b1.0022\u2022\n.4132\u00b1.0315\u2022\n.3313\u00b1.0246\u2022\n.6731\u00b1.0134\u2022\n.4170\u00b1.0113\u2022\n\n.7423\u00b1.0301\u2022\n.8442\u00b1.0144\u2022\n.2601\u00b1.0279\u2022\n.5929\u00b1.0346\u2022\n.5138\u00b1.0432\u2022\n.4107\u00b1.0600\u2022\n.0619\u00b1.0075\u2022\n.9909\u00b1.0022\u2022\n.4145\u00b1.0309\u2022\n.3341\u00b1.0258\u2022\n.6747\u00b1.0145\u2022\n.4170\u00b1.0113\u2022\n\n.7354\u00b1.0297\u2022\n.8320\u00b1.0150\u2022\n.2397\u00b1.0237\u2022\n.5740\u00b1.0348\u2022\n.4496\u00b1.0482\u2022\n.3793\u00b1.0584\u2022\n.0570\u00b1.0075\u2022\n.8652\u00b1.0474\u2022\n.3563\u00b1.0364\u2022\n.2694\u00b1.0385\u2022\n.5709\u00b1.0123\u2022\n.3771\u00b1.0110\u2022\n\n.7429\u00b1.0300\u2022\n.8348\u00b1.0143\u2022\n.2615\u00b1.0260\u2022\n.5920\u00b1.0352\u2022\n.5171\u00b1.0440\u2022\n.4150\u00b1.0592\u2022\n.0624\u00b1.0076\u2022\n.9909\u00b1.0021\u2022\n.4169\u00b1.0299\u2022\n.3319\u00b1.0247\u2022\n.7001\u00b1.0116\u2022\n.4196\u00b1.0101\u2022\n\n.7388\u00b1.0304\u2022\n.8424\u00b1.0153\u2022\n.2136\u00b1.0325\u2022\n.5832\u00b1.0415\u2022\n.4321\u00b1.0636\u2022\n.3615\u00b1.0712\u2022\n.0363\u00b1.0141\u2022\n.6813\u00b1.1294\u2022\n.1596\u00b1.0562\u2022\n.3342\u00b1.0276\u2022\n.5360\u00b1.0318\u2022\n\nPOSS\n\n.7437\u00b1.0297\n.8482\u00b1.0132\n.2701\u00b1.0257\n.5990\u00b1.0329\n.5365\u00b1.0410\n.4301\u00b1.0603\n.0627\u00b1.0076\n.9912\u00b1.0020\n.4368\u00b1.0300\n.3376\u00b1.0267\n.7265\u00b1.0098\n.4217\u00b1.0100\n\nOPT\n\n.7437\u00b1.0297\n.8484\u00b1.0132\n.2705\u00b1.0255\n.5995\u00b1.0326\n\n\u2013\n\u2013\n\u2013\n\u2013\n\u2013\n\u2013\n\u2013\n\u2013\nPOSS: win/tie/loss\n\n\u2013\n\n11/0/0\n\n12/0/0\n\n12/0/0\n\n12/0/0\n\n12/0/0\n\nFR\n\nFoBa\n\nOMP\n\nRFE\n\nMCP\n\n\u2013\n\nTo assess each method on each data set, we repeat the following process 100 times. The data set is\nrandomly and evenly split into a training set and a test set. Sparse regression is built on the training\nset, and evaluated on the test set. We report the average training and test R2 values.\n\n5.1 On Optimization Performance\n\nTable 2 lists the training R2 for k = 8, which reveals the optimization quality of the methods. Note\nthat the results of Lasso, SCAD and MCP are very close, and we only report that of MCP due to the\npage limit. By the t-test [5] with signi\ufb01cance level 0.05, POSS is shown signi\ufb01cantly better than all\nthe compared methods on all data sets.\nWe plot the performance curves on two data sets for k \u2264 8 in Figure 1. For sonar, OPT is calculated\nonly for k \u2264 5. We can observe that POSS tightly follows OPT, and has a clear advantage over\nthe rest methods. FR, FoBa and OMP have close performances, while are much better than MCP,\nSCAD and Lasso. The bad performance of Lasso is consistent with the previous results in [3, 26].\nWe notice that, although the (cid:96)1 norm constraint is a tight convex relaxation of the (cid:96)0 norm constraint\nand can have good results in sparse recovery tasks, the performance of Lasso is not as good as POSS\nand greedy methods on most data sets. This is due to that, unlike assumed in sparse recovery tasks,\nthere may not exist a sparse structure in the data sets. In this case, (cid:96)1 norm constraint can be a bad\napproximation of (cid:96)0 norm constraint. Meanwhile, (cid:96)1 norm constraint also shifts the optimization\nproblem, making it hard to well optimize the original R2 criterion.\nConsidering the running time (in the number of objective function evaluations), OPT does exhaustive\nkk time, which could be unacceptable for a slightly large data set. FR,\nFoBa and OMP are greedy-like approaches, thus are ef\ufb01cient and their running time are all in the\norder of kn. POSS \ufb01nds the solutions closest to those of OPT, taking 2ek2n time. Although POSS\nis slower by a factor of k, the difference would be small when k is a small constant.\nSince the 2ek2n time is a theoretical upper bound for POSS being as good as FR, we empirically\nexamine how tight this bound is. By selecting FR as the baseline, we plot the curve of the R2\nvalue over the running time for POSS on the two largest data sets gisette and farm-ads, as shown\nin Figure 2. We do not split the training and test set, and the curve for POSS is the average of 30\nindependent runs. The x-axis is in kn, the running time of FR. We can observe that POSS takes\nabout only 14% and 23% of the theoretical time to achieve a better performance, respectively on the\ntwo data sets. This implies that POSS can be more ef\ufb01cient in practice than in theoretical analysis.\n\nsearch, thus needs(cid:0)n\n\n(cid:1) \u2265 nk\n\nk\n\n7\n\n\f(a) on svmguide3\n\n(b) on sonar\n\n(a) on gisette\n\n(b) on farm-ads\n\nFigure 1: Training R2 (the larger the better).\n\nFigure 2: Performance v.s. running time of POSS.\n\n(a) on svmguide3\n\n(b) on sonar\n\nFigure 3: Test R2 (the larger the better).\n\n(a) on training set (RSS)\n\n(b) on test set\n\nFigure 4: Sparse regression with (cid:96)2 regularization\non sonar. RSS: the smaller the better.\n\n5.2 On Generalization Performance\n\nWhen testing sparse regression on the test data, it has been known that the sparsity alone may be\nnot a good complexity measure [26], since it only restricts the number of variables, but the range of\nthe variables is unrestricted. Thus better optimization does not always lead to better generalization\nperformance. We also observe this in Figure 3. On svmguide3, test R2 is consistent with training R2\nin Figure 1(a), however on sonar, better training R2 (as in Figure 1(b)) leads to worse test R2 (as in\nFigure 3(b)), which may be due to the small number of instances making it prone to over\ufb01tting.\nAs suggested in [26], other regularization terms may be necessary. We add the (cid:96)2 norm regulariza-\ntion into the objective function, i.e.,\n\nRSSZ,S = min\u03b1\u2208R|S| E(cid:2)(Z \u2212(cid:80)\n\ni\u2208S \u03b1iXi)2(cid:3) + \u03bb|\u03b1|2\n\n2.\n\n|S| \u2264 k . We then test all the compared\nThe optimization is now arg minS\u2286V RSSZ,S s.t.\nmethods to solve this optimization problem with \u03bb = 0.9615 on sonar. As plotted in Figure 4, we\ncan observe that POSS still does the best optimization on the training RSS, and by introducing the\n(cid:96)2 norm, it leads to the best generalization performance in R2.\n\n6 Conclusion\n\nIn this paper, we study the problem of subset selection, which has many applications ranging from\nmachine learning to signal processing. The general goal is to select a subset of size k from a large\nset of variables such that a given criterion is optimized. We propose the POSS approach that solves\nthe two objectives of the subset selection problem simultaneously, i.e., optimizing the criterion and\nreducing the subset size.\nOn sparse regression, a representative of subset selection, we theoretically prove that a simple POSS\n(i.e., using a constant isolation function) can generally achieve the best previous approximation\nguarantee, using time 2ek2n. Moreover, we prove that, with a proper isolation function, it \ufb01nds an\noptimal solution for an important subclass Exponential Decay using time O(k2(n\u2212k)n log n), while\nother greedy-like methods may not \ufb01nd an optimal solution. We verify the superior performance of\nPOSS by experiments, which also show that POSS can be more ef\ufb01cient than its theoretical time.\nWe will further study Pareto optimization from the aspects of using potential heuristic operators [14]\nand utilizing infeasible solutions [23]; and try to apply it to more machine learning tasks.\n\nAcknowledgements We want to thank Lijun Zhang and Jianxin Wu for their helpful comments.\nThis research was supported by 973 Program (2014CB340501) and NSFC (61333014, 61375061).\n\n8\n\n34567kOPTPOSSFRFoBaOMPRFEMCPSCADLasso3456780.180.20.220.240.26kR23456780.250.30.350.40.450.50.55kR2102030400.640.660.680.70.72Running time in knR2POSSFR6kn2ek2n= 43kn102030400.390.40.410.42Running time in knR2POSSFR10kn2ek2n= 43kn3456780.160.180.20.22kR23456780.050.10.150.2kR256780.70.710.720.730.74kRSS56780.230.2350.240.2450.250.2550.26kR2\fReferences\n[1] C. Boutsidis, M. W. Mahoney, and P. Drineas. An improved approximation algorithm for the column\n\nsubset selection problem. In SODA, pages 968\u2013977, New York, NY, 2009.\n\n[2] A. Das and D. Kempe. Algorithms for subset selection in linear regression.\n\nVictoria, Canada, 2008.\n\nIn STOC, pages 45\u201354,\n\n[3] A. Das and D. Kempe. Submodular meets spectral: Greedy algorithms for subset selection, sparse ap-\n\nproximation and dictionary selection. In ICML, pages 1057\u20131064, Bellevue, WA, 2011.\n\n[4] G. Davis, S. Mallat, and M. Avellaneda. Adaptive greedy approximations. Constructive Approximation,\n\n13(1):57\u201398, 1997.\n\n[5] J. Dem\u02c7sar. Statistical comparisons of classi\ufb01ers over multiple data sets. Journal of Machine Learning\n\nResearch, 7:1\u201330, 2006.\n\n[6] G. Diekhoff. Statistics for the Social and Behavioral Sciences: Univariate, Bivariate, Multivariate.\n\nWilliam C Brown Pub, 1992.\n\n[7] D. L. Donoho, M. Elad, and V. N. Temlyakov. Stable recovery of sparse overcomplete representations in\n\nthe presence of noise. IEEE Transactions on Information Theory, 52(1):6\u201318, 2006.\n\n[8] J. Fan and R. Li. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal\n\nof the American Statistical Association, 96(456):1348\u20131360, 2001.\n\n[9] A. C. Gilbert, S. Muthukrishnan, and M. J. Strauss. Approximation of functions over redundant dictio-\n\nnaries using coherence. In SODA, pages 243\u2013252, Baltimore, MD, 2003.\n\n[10] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik. Gene selection for cancer classi\ufb01cation using support\n\nvector machines. Machine Learning, 46(1-3):389\u2013422, 2002.\n\n[11] R. A. Johnson and D. W. Wichern. Applied Multivariate Statistical Analysis. Pearson, 6th edition, 2007.\n[12] A. Miller. Subset Selection in Regression. Chapman and Hall/CRC, 2nd edition, 2002.\n[13] B. K. Natarajan. Sparse approximate solutions to linear systems. SIAM Journal on Computing, 24(2):227\u2013\n\n234, 1995.\n\n[14] C. Qian, Y. Yu, and Z.-H. Zhou. An analysis on recombination in multi-objective evolutionary optimiza-\n\ntion. Arti\ufb01cial Intelligence, 204:99\u2013119, 2013.\n\n[15] C. Qian, Y. Yu, and Z.-H. Zhou. On constrained Boolean Pareto optimization. In IJCAI, pages 389\u2013395,\n\nBuenos Aires, Argentina, 2015.\n\n[16] C. Qian, Y. Yu, and Z.-H. Zhou. Pareto ensemble pruning. In AAAI, pages 2935\u20132941, Austin, TX, 2015.\n[17] M. Tan, I. Tsang, and L. Wang. Matching pursuit LASSO Part I: Sparse recovery over big dictionary.\n\nIEEE Transactions on Signal Processing, 63(3):727\u2013741, 2015.\n\n[18] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society:\n\nSeries B (Methodological), 58(1):267\u2013288, 1996.\n\n[19] J. A. Tropp. Greed is good: Algorithmic results for sparse approximation. IEEE Transactions on Infor-\n\nmation Theory, 50(10):2231\u20132242, 2004.\n\n[20] J. A. Tropp, A. C. Gilbert, S. Muthukrishnan, and M. J. Strauss. Improved sparse approximation over\n\nquasiincoherent dictionaries. In ICIP, pages 37\u201340, Barcelona, Spain, 2003.\n\n[21] L. Xiao and T. Zhang. A proximal-gradient homotopy method for the sparse least-squares problem. SIAM\n\nJournal on Optimization, 23(2):1062\u20131091, 2013.\n\n[22] Y. Yu, X. Yao, and Z.-H. Zhou. On the approximation ability of evolutionary optimization with application\n\nto minimum set cover. Arti\ufb01cial Intelligence, 180-181:20\u201333, 2012.\n\n[23] Y. Yu and Z.-H. Zhou. On the usefulness of infeasible solutions in evolutionary search: A theoretical\n\nstudy. In IEEE CEC, pages 835\u2013840, Hong Kong, China, 2008.\n\n[24] C.-H. Zhang. Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics,\n\n38(2):894\u2013942, 2010.\n\n[25] T. Zhang. On the consistency of feature selection using greedy least squares regression. Journal of\n\nMachine Learning Research, 10:555\u2013568, 2009.\n\n[26] T. Zhang. Adaptive forward-backward greedy algorithm for learning sparse representations. IEEE Trans-\n\nactions on Information Theory, 57(7):4689\u20134708, 2011.\n\n[27] H. Zhou. Matlab SparseReg Toolbox Version 0.0.1. Available Online, 2013.\n[28] H. Zhou, A. Armagan, and D. Dunson. Path following and empirical Bayes model selection for sparse\n\nregression. arXiv:1201.3528, 2012.\n\n[29] H. Zou and T. Hastie. Regularization and variable selection via the elastic net. Journal of the Royal\n\nStatistical Society: Series B (Statistical Methodology), 67(2):301\u2013320, 2005.\n\n9\n\n\f", "award": [], "sourceid": 1058, "authors": [{"given_name": "Chao", "family_name": "Qian", "institution": "Nanjing University"}, {"given_name": "Yang", "family_name": "Yu", "institution": "Nanjing University"}, {"given_name": "Zhi-Hua", "family_name": "Zhou", "institution": "Nanjing University"}]}