{"title": "Lasso Screening Rules via Dual Polytope Projection", "book": "Advances in Neural Information Processing Systems", "page_first": 1070, "page_last": 1078, "abstract": "Lasso is a widely used regression technique to find sparse representations. When the dimension of the feature space and the number of samples are extremely large, solving the Lasso problem remains challenging. To improve the efficiency of solving large-scale Lasso problems, El Ghaoui and his colleagues have proposed the SAFE rules which are able to quickly identify the inactive predictors, i.e., predictors that have $0$ components in the solution vector. Then, the inactive predictors or features can be removed from the optimization problem to reduce its scale. By transforming the standard Lasso to its dual form, it can be shown that the inactive predictors include the set of inactive constraints on the optimal dual solution. In this paper, we propose an efficient and effective screening rule via Dual Polytope Projections (DPP), which is mainly based on the uniqueness and nonexpansiveness  of the optimal dual solution due to the fact that the feasible set in the dual space is a convex and closed polytope. Moreover, we show that our screening rule can be extended to identify inactive groups in group Lasso. To the best of our knowledge, there is currently no exact\" screening rule for group Lasso. We have evaluated our screening rule using many real data sets. Results show that our rule is more effective to identify inactive predictors than existing state-of-the-art screening rules for Lasso.\"", "full_text": "Lasso Screening Rules via Dual Polytope Projection\n\nJie Wang, Jiayu Zhou, Peter Wonka, Jieping Ye\n\nComputer Science and Engineering\n\n{jie.wang.ustc, jiayu.zhou, peter.wonka, jieping.ye}@asu.edu\n\nArizona State University, Tempe, AZ 85287\n\nAbstract\n\nLasso is a widely used regression technique to \ufb01nd sparse representations. When\nthe dimension of the feature space and the number of samples are extremely large,\nsolving the Lasso problem remains challenging. To improve the ef\ufb01ciency of solv-\ning large-scale Lasso problems, El Ghaoui and his colleagues have proposed the\nSAFE rules which are able to quickly identify the inactive predictors, i.e., predic-\ntors that have 0 components in the solution vector. Then, the inactive predictors\nor features can be removed from the optimization problem to reduce its scale. By\ntransforming the standard Lasso to its dual form, it can be shown that the inactive\npredictors include the set of inactive constraints on the optimal dual solution. In\nthis paper, we propose an ef\ufb01cient and effective screening rule via Dual Polytope\nProjections (DPP), which is mainly based on the uniqueness and nonexpansive-\nness of the optimal dual solution due to the fact that the feasible set in the dual\nspace is a convex and closed polytope. Moreover, we show that our screening rule\ncan be extended to identify inactive groups in group Lasso. To the best of our\nknowledge, there is currently no \u201cexact\u201d screening rule for group Lasso. We have\nevaluated our screening rule using many real data sets. Results show that our rule\nis more effective in identifying inactive predictors than existing state-of-the-art\nscreening rules for Lasso.\n\nIntroduction\n\n1\nData with various structures and scales comes from almost every aspect of daily life. To effectively\nextract patterns in the data and build interpretable models with high prediction accuracy is always\ndesirable. One popular technique to identify important explanatory features is by sparse regulariza-\ntion. For instance, consider the widely used (cid:96)1-regularized least squares regression problem known\nas Lasso [20]. The most appealing property of Lasso is the sparsity of the solutions, which is equiv-\nalent to feature selection. Suppose we have N observations and p predictors. Let y denote the N\ndimensional response vector and X = [x1, x2, . . . , xp] be the N \u00d7p feature matrix. Let \u03bb \u2265 0 be the\nregularization parameter, the Lasso problem is formulated as the following optimization problem:\n\n1\n\ninf\n\u03b2\u2208(cid:60)p\n\n(1)\nLasso has achieved great success in a wide range of applications [5, 4, 28, 3, 23] and in recent years\nmany algorithms have been developed to ef\ufb01ciently solve the Lasso problem [7, 12, 18, 6, 10, 1, 11].\nHowever, when the dimension of feature space and the number of samples are very large, solving\nthe Lasso problem remains challenging because we may not even be able to load the data matrix into\nmain memory. The idea of a screening test proposed by El Ghaoui et al. [8] is to \ufb01rst identify inactive\npredictors that have 0 components in the solution and then remove them from the optimization.\nTherefore, we can work on a reduced feature matrix to solve Lasso ef\ufb01ciently.\nIn [8], the \u201cSAFE\u201d rule discards xi when\n(2)\ni y| is the largest parameter such that the solution is nontrivial. Tibshirani et\nwhere \u03bbmax = maxi |xT\nal. [21] proposed a set of strong rules which were more effective in identifying inactive predictors.\n\n|xT\ni y| < \u03bb \u2212 (cid:107)xi(cid:107)2(cid:107)y(cid:107)2\n\n\u03bbmax\u2212\u03bb\n\u03bbmax\n\n2(cid:107)y \u2212 X\u03b2(cid:107)2\n\n2 + \u03bb(cid:107)\u03b2(cid:107)1.\n\n1\n\n\fi y| < 2\u03bb \u2212 \u03bbmax. However, it should be noted that the proposed\nThe basic version discards xi if |xT\nstrong rules might mistakenly discard active predictors, i.e., predictors which have nonzero coef\ufb01-\ncients in the solution vector. Xiang et al. [26, 25] developed a set of screening tests based on the\nestimation of the optimal dual solution and they have shown that the SAFE rules are in fact a special\ncase of the general sphere test.\nIn this paper, we develop new ef\ufb01cient and effective screening rules for the Lasso problem; our\nscreening rules are exact in the sense that no active predictors will be discarded. By transforming\nproblem (1) to its dual form, our motivation is mainly based on three geometric observations in the\ndual space. First, the active predictors belong to a subset of the active constraints on the optimal dual\nsolution, which is a direct consequence of the KKT conditions. Second, the optimal dual solution is\nin fact the projection of the scaled response vector onto the feasible set of the dual variables. Third,\nbecause the feasible set of the dual variables is closed and convex, the projection is nonexpansive\nwith respect to \u03bb [2], which results in an effective estimation of its variation. Moreover, based on\nthe basic DPP rules, we propose the \u201cEnhanced DPP\u201d rules which are able to detect more inactive\nfeatures than DPP. We evaluate our screening rules on real data sets from many different applications.\nThe experimental results demonstrate that our rules are more effective in discarding inactive features\nthan existing state-of-the-art screening rules.\n2 Screening Rules for Lasso via Dual Polytope Projections\nIn this section, we present the basics of the dual formulation of problem (1) including its geometric\nproperties (Section 2.1). Based on the geometric properties of the dual optimal, we develop the\nfundamental principle in Section 2.2 (Theorem 2), which can be used to construct screening rules\nfor Lasso. In section 2.3, we discuss the relation between dual optimal and LARS [7]. As a straight-\nforward extension of DPP rules, we develop the sequential version of DPP (SDPP) in Section 2.4.\nMoreover, we present enhanced DPP rules in Section 2.5.\n2.1 Basics\nDifferent from [26, 25], we do not assume y and all xi have unit length. We \ufb01rst transform problem\n(1) to its dual form (to make the paper self-contained, we provide the detailed derivation of the dual\nform in the supplemental materials):\n\n(cid:111)\n\n2 \u2212 \u03bb2\n\n2 (cid:107)\u03b8 \u2212 y\n\n\u03bb(cid:107)2\n2 :\n\n|xT\ni \u03b8| \u2264 1, i = 1, 2, . . . , p\n\n(cid:110) 1\n2(cid:107)y(cid:107)2\n\nsup\n\n\u03b8\n\nwhere \u03b8 is the dual variable. Since the feasible set, denoted by F , is the intersection of 2p half-\nspaces, it is a closed and convex polytope. From the objective function of the dual problem (3), it is\n\u03bb . In other words, \u03b8\u2217\neasy to see that the optimal dual solution \u03b8\u2217 is a feasible \u03b8 which is closest to y\nis the projection of y\n\u03bb onto the polytope F . Mathematically, for an arbitrary vector w and a convex\nset C, if we de\ufb01ne the projection function as\n\n(3)\n\n(4)\n\n(5)\n\n(6)\n\nthen\n\nWe know that the optimal primal and dual solutions satisfy:\ny = X\u03b2\u2217 + \u03bb\u03b8\u2217\n\nand the KKT conditions for the Lasso problem (1) are\n\nPC(w) = argmin\n\n\u03b8\u2217 = PF (y/\u03bb) = argmin\n\u03b8\u2208F\n\n(cid:107)u \u2212 w(cid:107)2,\n\nu\u2208C\n\n(cid:13)(cid:13)\u03b8 \u2212 y\n(cid:13)(cid:13)2.\n(cid:26)sign([\u03b2\u2217]i) if [\u03b2\u2217]i (cid:54)= 0\n\n\u03bb\n\n[\u22121, 1] if [\u03b2\u2217]i = 0\n\n(\u03b8\u2217)T xi \u2208\nwhere [\u00b7]k denotes the kth component.\nBy the KKT conditions in Eq. (7), if the inner product (\u03b8\u2217)T xi belongs to the open interval (\u22121, 1),\nthen the corresponding component [\u03b2\u2217]i in the solution vector \u03b2\u2217(\u03bb) has to be 0. As a result, xi is\nan inactive predictor and can be removed from the optimization.\nOn the other hand, let \u2202H(xi) = {z: zT xi = 1} and H(xi)\u2212 = {z: zT xi \u2264 1} be the hyperplane\nand half space determined by xi respectively. Consider the dual problem (3); constraints induced\nby each xi are equivalent to requiring each feasible \u03b8 to lie inside the intersection of H(xi)\u2212 and\nH(\u2212xi)\u2212. If |(\u03b8\u2217)T xi| = 1, i.e., either \u03b8\u2217 \u2208 \u2202H(xi)\u2212 or \u03b8\u2217 \u2208 \u2202H(\u2212xi)\u2212, we say the constraints\ninduced by xi are active on \u03b8\u2217.\n\n(7)\n\n2\n\n\fWe de\ufb01ne the \u201cactive\u201d set on \u03b8\u2217 as I\u03b8\u2217 = {i: |(\u03b8\u2217)T xi| = 1, i \u2208 I} where I = {1, 2, . . . , p}.\nOtherwise, if \u03b8\u2217 lies between \u2202H(xi) and \u2202H(\u2212xi), i.e., |(\u03b8\u2217)T xi| < 1, we can safely remove\nxi from the problem because [\u03b2\u2217]i = 0 according to the KKT conditions in Eq. (7). Similarly, the\n\u201cinactive\u201d set on \u03b8\u2217 is de\ufb01ned as I \u03b8\u2217 = I \\ I\u03b8\u2217. Therefore, from a geometric perspective, if we\nknow \u03b8\u2217, i.e., the projection of y\n\u03bb onto F , the predictors in the inactive set on \u03b8\u2217 can be discarded\nfrom the optimization. It is worthwhile to mention that inactive predictors, i.e., predictors that have\n0 components in the solution, are not the same as predictors in the inactive set. In fact, by the KKT\nconditions, predictors in the inactive set must be inactive predictors since they are guaranteed to\nhave 0 components in the solution, but the converse may not be true.\n2.2 Fundamental Screening Rules via Dual Polytope Projections\nMotivated by the above geometric intuitions, we next show how to \ufb01nd the predictors in the inactive\nset on \u03b8\u2217. To emphasize the dependence on \u03bb, let us write \u03b8\u2217(\u03bb) and \u03b2\u2217(\u03bb). If we know exactly\nwhere \u03b8\u2217(\u03bb) is, it will be trivial to \ufb01nd the predictors in the inactive set. Unfortunately, in most of\nthe cases, we only have incomplete information about \u03b8\u2217(\u03bb) without actually solving problem (1) or\n(3). Suppose we know the exact \u03b8\u2217(\u03bb(cid:48)) for a speci\ufb01c \u03bb(cid:48). How can we estimate \u03b8\u2217(\u03bb(cid:48)(cid:48)) for another \u03bb(cid:48)(cid:48)\nand its inactive set? To answer this question, we start from Eq. (5); \u03b8\u2217(\u03bb) is nonexpansive because\nit is a projection operator. For convenience, we cite the projection theorem in [2] as follows.\nTheorem 1. Let C be a convex set, then the projection function de\ufb01ned in Eq. (4) is continuous and\nnonexpansive, i.e.,\n\n(cid:107)PC(w2) \u2212 PC(w1)(cid:107)2 \u2264 (cid:107)w2 \u2212 w1(cid:107)2, \u2200w2, w1.\n\n(8)\nGiven \u03b8\u2217(\u03bb(cid:48)), the next theorem shows how to estimate \u03b8\u2217(\u03bb(cid:48)(cid:48)) and its inactive set for another pa-\nrameter \u03bb(cid:48)(cid:48).\nTheorem 2. For the Lasso problem, assume we are given the solution of its dual problem \u03b8\u2217(\u03bb(cid:48)) for\na speci\ufb01c \u03bb(cid:48). Let \u03bb(cid:48)(cid:48) be a nonnegative value different from \u03bb(cid:48). Then [\u03b2\u2217(\u03bb(cid:48)(cid:48))]i = 0 if\n\n|xT\ni \u03b8\u2217(\u03bb(cid:48))| < 1 \u2212 (cid:107)xi(cid:107)2(cid:107)y(cid:107)2\n(9)\nProof. From the KKT conditions in Eq. (7), we know |xT\ni \u03b8\u2217(\u03bb(cid:48)(cid:48))| < 1 \u21d2 [\u03b2\u2217(\u03bb(cid:48)(cid:48))]i = 0. By the\ndual problem (3), \u03b8\u2217(\u03bb) is the projection of y\n\u03bb onto the feasible set F . According to the projection\ntheorem [2], that is, Theorem 1, for closed convex sets, \u03b8\u2217(\u03bb) is continuous and nonexpansive, i.e.,\n\u03bb(cid:48)(cid:48) \u2212 y\n(10)\n\u03bb(cid:48)\n\n(cid:107)\u03b8\u2217(\u03bb(cid:48)(cid:48)) \u2212 \u03b8\u2217(\u03bb(cid:48))(cid:107)2 \u2264(cid:13)(cid:13) y\n\n\u03bb(cid:48) \u2212 1\n\u03bb(cid:48)(cid:48)\n\n\u03bb(cid:48)(cid:48) \u2212 1\n\u03bb(cid:48)\n\nThen\n\n(cid:12)(cid:12)(cid:12)(cid:12).\n\n(cid:12)(cid:12)(cid:12)(cid:12) 1\n(cid:12)(cid:12) 1\n(cid:13)(cid:13)2 = (cid:107)y(cid:107)2\n(cid:12)(cid:12) 1\n(cid:12)(cid:12) + 1 \u2212 (cid:107)xi(cid:107)2(cid:107)y(cid:107)2\n\ni \u03b8\u2217(\u03bb(cid:48))|\n\n(cid:12)(cid:12)\n(cid:12)(cid:12) 1\n\n|xT\ni \u03b8\u2217(\u03bb(cid:48)(cid:48))| \u2264 |xT\n\ni \u03b8\u2217(\u03bb(cid:48)(cid:48)) \u2212 xT\n\ni \u03b8\u2217(\u03bb(cid:48))| + |xT\n\n< (cid:107)xi(cid:107)2(cid:107)(\u03b8\u2217(\u03bb(cid:48)(cid:48)) \u2212 \u03b8\u2217(\u03bb(cid:48)))(cid:107)2 + 1 \u2212 (cid:107)xi(cid:107)2(cid:107)y(cid:107)2\n\u2264 (cid:107)xi(cid:107)2(cid:107)y(cid:107)2\n\n\u03bb(cid:48)(cid:48) \u2212 1\n\u03bb(cid:48)\n\n\u03bb(cid:48)(cid:48) \u2212 1\n\u03bb(cid:48)\n\u03bb(cid:48)(cid:48) \u2212 1\n\u03bb(cid:48)\n\n(cid:12)(cid:12) = 1\n\n(cid:12)(cid:12) 1\n\n(cid:12)(cid:12)\n\n(11)\n\ny\n\n\u03bbmax\n\nis itself feasible. Therefore the projection of\n\nwhich completes the proof.\nFrom theorem 2, it is easy to see our rule is quite \ufb02exible since every \u03b8\u2217(\u03bb(cid:48)) would result in a new\nscreening rule. And the smaller the gap between \u03bb(cid:48) and \u03bb(cid:48)(cid:48), the more effective the screening rule is.\nBy \u201cmore effective\u201d, we mean a stronger capability of identifying inactive predictors.\ni y|. It is easy to verify\nAs an example, let us \ufb01nd out \u03b8\u2217(\u03bbmax). Recall that \u03bbmax = maxi |xT\n.\nMoreover, by noting that for \u2200\u03bb > \u03bbmax, we have |xT\ni y/\u03bb| < 1, i \u2208 I, i.e., all predictors are in the\ninactive set at \u03b8\u2217(\u03bb), we conclude that the solution to problem (1) is 0. Combining all these together\nand plugging \u03b8\u2217(\u03bbmax) = y\n(cid:16) 1\nCorollary 3. DPP: For the Lasso problem (1), let \u03bbmax = maxi |xT\n0,\u2200i \u2208 I. Otherwise, [\u03b2\u2217(\u03bb)]i = 0 if\n\u03bb \u2212 1\n\n(cid:17)\ni y|. If \u03bb \u2265 \u03bbmax, then [\u03b2\u2217]i =\n\n(cid:12)(cid:12)(cid:12) < 1 \u2212 (cid:107)xi(cid:107)2(cid:107)y(cid:107)2\n\nonto F is itself, i.e., \u03b8\u2217(\u03bbmax) = y\n\ninto Eq. (9), we obtain the following screening rule.\n\n(cid:12)(cid:12)(cid:12)xT\n\n\u03bbmax\n\n\u03bbmax\n\n\u03bbmax\n\n\u03bbmax\n\ny\n\ny\n\ni\n\n\u03bbmax\n\n.\n\nClearly, DPP is most effective when \u03bb is close to \u03bbmax. So how can we \ufb01nd a new \u03b8\u2217(\u03bb(cid:48)) with\n\u03bb(cid:48) < \u03bbmax? Note that Eq. (6) is in fact a natural bridge which relates the primal and dual optimal\nsolutions. As long as we know \u03b2\u2217(\u03bb(cid:48)), it is easy to get \u03b8\u2217(\u03bb(cid:48)) when \u03bb is relatively small, e.g., LARS\n[7] and Homotopy [17] algorithms.\n\n3\n\n\fTable 1: Illustration of the running time for DPP screening and for solving the Lasso problem after\nscreening. Ts: time for screening. Tl: time for solving the Lasso problem after screening. To:\nthe total time. Entries of the response vector y are i.i.d. by a standard Gaussian. Columns of the\ndata matrix X \u2208 (cid:60)1000\u00d7100000 are generated by xi = y + \u03b1z where \u03b1 is a random number drawn\nuniformly from [0, 1]. Entries of z are i.i.d. by a standard Gaussian. \u03bbmax = 0.95 and \u03bb/\u03bbmax=0.5.\n\nTs (S)\nTl (S)\nTo (S)\n\nLASSO\n\n\u2014\n\u2014\n\n103.314\n\nDPP\n0.035\n10.250\n10.285\n\nDPP2\n0.073\n9.634\n9.707\n\nDPP5\n0.152\n8.399\n8.552\n\nDPP10\n0.321\n1.369\n1.690\n\nDPP20\n0.648\n0.121\n0.769\n\n\u03bb(cid:48) \u2212 1\n\n\u03bbmax\u2212\u03bb\n\u03bbmax\n\ni y| < \u03bb\u2212(cid:107)xi(cid:107)2(cid:107)y(cid:107)2\n\nRemark: Xiang et al. [26] developed a general sphere test which says that if \u03b8\u2217 is estimated to be\ninside a ball (cid:107)\u03b8\u2217 \u2212 q(cid:107)2 \u2264 r, then |xT\ni q| < (1 \u2212 r) \u21d2 [\u03b2\u2217]i = 0. Considering the DPP rules in\nTheorem 2, it is equivalent to setting q = \u03b8\u2217(\u03bb(cid:48)) and r = | 1\n\u03bb(cid:48)(cid:48)|. Therefore, different from the\nsphere test and Dome developed in [26, 25] with the radius r \ufb01xed at the beginning, the construction\nof our DPP rules is equivalent to an \u201cr\u201d decreasing process. Clearly, the smaller r is, the more\neffective the DPP rules will be.\nRemark: Notice that, DPP is not the same as ST1 [26] and SAFE [8], which discards the ith feature\nif |xT\n. From the perspective of the sphere test, the radius of ST1/SAFE\nand DPP are the same. But the centers of ST1 and DPP are y/\u03bb and y/\u03bbmax respectively, which\nleads to different formulas, i.e., Eq. (2) and Corollary 3.\n2.3 DPP Rules with LARS/Homotopy Algorithms\nIt is well known that under mild conditions, the set {\u03b2\u2217(\u03bb) : \u03bb > 0} (also know as regularization\npath [15]) is continuous piecewise linear [17, 7, 15]. The output of LARS or Homotopy algorithms is\nin fact a sequence of values like (\u03b2\u2217(\u03bb(0)), \u03bb(0)), (\u03b2\u2217(\u03bb(1)), \u03bb(1)), . . ., where \u03b2\u2217(\u03bb(i)) corresponds\nto the ith breakpoint of the regularization path {\u03b2\u2217(\u03bb) : \u03bb > 0} and \u03bb(i)s are monotonically de-\ncreasing. By Eq. (6), once we get \u03b2\u2217(\u03bb(i)), we can immediately compute \u03b8\u2217(\u03bb(i)). Then according\nto Theorem 2, we can construct a DPP rule based on \u03b8\u2217(\u03bb(i)) and \u03bb(i). For convenience, if the DPP\nrule is built based on \u03b8\u2217(\u03bb(i)), we add the index i as suf\ufb01x to DPP, e.g., DPP5 means it is developed\nbased on \u03b8\u2217(\u03bb(5)). It should be noted that LARS or Homotopy algorithms are very ef\ufb01cient to \ufb01nd\nthe \ufb01rst few breakpoints of the regularization path and the corresponding parameters. For the \ufb01rst\nfew breakpoints, the computational cost is roughly O(N p), i.e., linear with the size of the data ma-\ntrix X. In Table 1, we report both the time used for screening and the time needed to solve the Lasso\nproblem after screening. The Lasso solver is from the SLEP [14] package.\nFrom Table 1, we can see that compared with the time saved by the screening rules, the time used\nfor screening is negligible. The ef\ufb01ciency of the Lasso solver is improved by DPP20 more than\n130 times. In practice, DPP rules built on the \ufb01rst few \u03b8\u2217(\u03bb(i))\u2019s lead to more signi\ufb01cant perfor-\nmance improvement than existing state-of-art screening tests. We will demonstrate the effectiveness\nof our DPP rules in the experiment section. As another useful property of LARS/Homotopy al-\ngorithms, it is worthwhile to mention that changes of the active set only happen at the breakpoints\n[17, 7, 15]. Consequently, given the parameters corresponding to a pair of adjacent breakpoints, e.g.,\n\u03bb(i) and \u03bb(i+1), the active set for \u03bb \u2208 (\u03bb(i+1), \u03bb(i)) is the same as \u03bb = \u03bb(i). Therefore, besides the\nsequence of breakpoints and the associated parameters (\u03b2\u2217(\u03bb(0)), \u03bb(0)), . . . (\u03b2\u2217(\u03bb(k)), \u03bb(k)) com-\nputed by LARS/Homotopy algorithms, we know the active set for \u2200\u03bb \u2265 \u03bb(k). Hence we can remove\nthe predictors in the inactive set from the optimization problem (1). This scheme has been embedded\nin DPP rules.\nRemark: Some works, e.g., [21], [8], solve several Lasso problems for different parameters to\nimprove the screening performance. However, the DPP algorithms do not aim to solve a sequence\nof Lasso problems, but just to accelerate one. The LARS/Homotopy algorithms are used to \ufb01nd the\n\ufb01rst few breakpoints of the regularization path and the corresponding parameters, instead of solving\ngeneral Lasso problems. Thus, different from [21], [8] who need to iteratively compute a screening\nstep and a Lasso step, DPP algorithms only compute one screening step and one Lasso step.\n2.4 Sequential Version of DPP Rules\nMotivated by the ideas of [21] and [8], we can develop a sequential version of DPP rules. In other\nwords, if we are given a sequence of parameter values \u03bb1 > \u03bb2 > . . . > \u03bbm, we can \ufb01rst apply\nDPP to discard inactive predictors for the Lasso problem (1) with parameter being \u03bb1. After solving\n\n4\n\n\fthe reduced optimization problem for \u03bb1, we obtain the exact solution \u03b2\u2217(\u03bb1). Hence by Eq. (6),\nwe can \ufb01nd \u03b8\u2217(\u03bb1). According to Theorem 2, once we know the optimal dual solution \u03b8\u2217(\u03bb1), we\ncan construct a new screening rule to identify inactive predictors for problem (1) with \u03bb = \u03bb2. By\nrepeating the above process, we obtain the sequential version of the DPP rule (SDPP).\nCorollary 4. SDPP: For the Lasso problem (1), suppose we are given a sequence of parameter\nvalues \u03bbmax = \u03bb0 > \u03bb1 > . . . > \u03bbm. Then for any integer 0 \u2264 k < m, we have [\u03b2\u2217(\u03bbk+1)]i = 0\nif \u03b2\u2217(\u03bbk) is known and the following holds:\n\n(cid:12)(cid:12)(cid:12) < 1 \u2212 (cid:107)xi(cid:107)2(cid:107)y(cid:107)2\n\n(cid:16) 1\n\n\u03bbk+1\n\n(cid:17)\n\n.\n\n\u2212 1\n\n\u03bbk\n\n(cid:12)(cid:12)(cid:12)xT\n\ni\n\ny\u2212X\u03b2\u2217(\u03bbk)\n\n\u03bbk\n\nRemark: There are some other related works on screening rules, e.g., Wu et al. [24] built screening\nrules for (cid:96)1 penalized logistic regression based on the inner products between the response vector\nand each predictor; Tibshirani et al. [21] developed strong rules for a set of Lasso-type problems via\nthe inner products between the residual and predictors; in [9], Fan and Lv studied screening rules\nfor Lasso and related problems. But all of the above works may mistakenly discard predictors that\nhave non-zero coef\ufb01cients in the solution. Similar to [8, 26, 25], our DPP rules are exact in the\nsense that the predictors discarded by our rules are inactive predictors, i.e., predictors that have zero\ncoef\ufb01cients in the solution.\n2.5 Enhanced DPP Rules\nIn this section, we show how to further improve the DPP rules. From the inequality in (9), we can\nsee that the larger the right hand side is, the more inactive features can be detected. From the proof\nof Theorem 2, we need to make the right hand side of the inequality in (10) as small as possible. By\nnoting that \u03b8\u2217(\u03bb(cid:48)) = PF ( y\n\u03bb(cid:48) ) and \u03b8\u2217(\u03bb(cid:48)(cid:48)) = PF ( y\n\u03bb(cid:48)(cid:48) ) [please refer to Eq. (5)], the inequality in (10)\nis in fact a direct consequence of Theorem 1 by letting C := F , w1 := y\n\u03bb(cid:48) and w2 := y\n\u03bb(cid:48)(cid:48) .\n\u03bb(cid:48) /\u2208 F , i.e., \u03bb(cid:48) \u2208 (0, \u03bbmax). It is clear that y\n\u03bb(cid:48) ) = \u03b8\u2217(\u03bb(cid:48)). Let\nOn the other hand, suppose y\n\u03bb(cid:48)\n\u03bb(cid:48) \u2212 \u03b8\u2217(\u03bb(cid:48))) for t \u2265 0, i.e., \u03b8(t) is a point lying on the ray starting from \u03b8\u2217(\u03bb(cid:48))\n\u03b8(t) = \u03b8\u2217(\u03bb(cid:48)) + t( y\n\u03bb(cid:48) \u2212 \u03b8\u2217(\u03bb(cid:48)). We can observe that PF (\u03b8(t)) = \u03b8\u2217(\u03bb(cid:48)), i.e., the\nand pointing to the same direction as y\nprojection of \u03b8(t) onto the set F is \u03b8\u2217(\u03bb(cid:48)) as well (please refer to Lemma A in the supplement for\ndetails). By applying Theorem 1 again, we have\n(cid:107)\u03b8\u2217(\u03bb(cid:48)(cid:48))\u2212\u03b8\u2217(\u03bb(cid:48))(cid:107)2 = (cid:107)PF ( y\n\u03bb(cid:48)(cid:48) \u2212\u03b8\u2217(\u03bb(cid:48)))(cid:107)2.\n(12)\nClearly, when t = 1, the inequality in (12) reduces to the one in (10). Because the inequality in (12)\nholds for all t \u2265 0, we may get a tighter bound by\n\n\u03bb(cid:48)(cid:48) )\u2212PF (\u03b8(t))(cid:107)2 \u2264 (cid:107) y\n\n\u03bb(cid:48)(cid:48) \u2212\u03b8(t)(cid:107)2 = (cid:107)t( y\n\n\u03bb(cid:48) \u2212\u03b8\u2217(\u03bb(cid:48)))\u2212( y\n\n(cid:54)= PF ( y\n\n(cid:107)\u03b8\u2217(\u03bb(cid:48)(cid:48)) \u2212 \u03b8\u2217(\u03bb(cid:48))(cid:107)2 \u2264 min\nt\u22650\n\nwhere v1 = y\nwhere x\u2217 := argmaxxi|xT\nmization problem on the right hand side of the inequality (13) can be easily solved as follows:\n\n(13)\n\u03bb(cid:48)(cid:48) \u2212 \u03b8\u2217(\u03bb(cid:48)). When \u03bb(cid:48) = \u03bbmax, we can set v1 = sign(xT\u2217 y)x\u2217\ni y| (please refer to Lemma B in the supplement for details). The mini-\n\n\u03bb(cid:48) \u2212 \u03b8\u2217(\u03bb(cid:48)) and v2 = y\n\n(cid:107)tv1 \u2212 v2(cid:107)2,\n\n(cid:40)(cid:107)v2(cid:107)2,\n(cid:13)(cid:13)(cid:13)v2 \u2212 (cid:104)v1,v2(cid:105)\n\n(cid:107)v1(cid:107)2\n\n2\n\n(cid:13)(cid:13)(cid:13)2\n\nv1\n\n,\n\nif (cid:104)v1, v2(cid:105) < 0,\notherwise.\n\n(cid:107)tv1 \u2212 v2(cid:107)2 = \u03d5(\u03bb(cid:48), \u03bb(cid:48)(cid:48)) =\n\nmin\nt\u22650\n\n(14)\n\n(15)\n\nSimilar to Theorem 2, we have the following result:\nTheorem 5. For the Lasso problem, assume we are given the solution of its dual problem \u03b8\u2217(\u03bb(cid:48)) for\na speci\ufb01c \u03bb(cid:48). Let \u03bb(cid:48)(cid:48) be a nonnegative value different from \u03bb(cid:48). Then [\u03b2\u2217(\u03bb(cid:48)(cid:48))]i = 0 if\n\n|xT\ni \u03b8\u2217(\u03bb(cid:48))| < 1 \u2212 (cid:107)xi(cid:107)2\u03d5(\u03bb(cid:48), \u03bb(cid:48)(cid:48)).\n\nAs we explained above, the right hand side of the inequality (15) is no less than that of the inequality\n(9). Thus, the enhanced DPP is able to detect more inactive features than DPP. The analogues of\nCorollaries 3 and 4 can be easily derived as well.\nCorollary 6. DPP\u2217: For the Lasso problem (1), let \u03bbmax = maxi |xT\n[\u03b2\u2217]i = 0,\u2200i \u2208 I. Otherwise, [\u03b2\u2217(\u03bb)]i = 0 if the following holds:\n\nIf \u03bb \u2265 \u03bbmax, then\n\ni y|.\n\n(cid:12)(cid:12)(cid:12)xT\n\ni\n\n(cid:12)(cid:12)(cid:12) < 1 \u2212 (cid:107)xi(cid:107)2\u03d5(\u03bbmax, \u03bb).\n\ny\n\n\u03bbmax\n\nCorollary 7. SDPP\u2217: For the Lasso problem (1), suppose we are given a sequence of parameter\nvalues \u03bbmax = \u03bb0 > \u03bb1 > . . . > \u03bbm. Then for any integer 0 \u2264 k < m, we have [\u03b2\u2217(\u03bbk+1)]i = 0\n\n5\n\n\fif \u03b2\u2217(\u03bbk) is known and the following holds:\ny\u2212X\u03b2\u2217(\u03bbk)\n\n(cid:12)(cid:12)(cid:12)xT\n\ni\n\n(cid:12)(cid:12)(cid:12) < 1 \u2212 (cid:107)xi(cid:107)2\u03d5(\u03bbk, \u03bbk+1).\n\n\u03bbk\n\nTo simplify notations, we denote the enhanced DPP and SDPP by DPP\u2217 and SDPP\u2217 respectively.\n3 Extensions to Group Lasso\nTo demonstrate the \ufb02exibility of DPP rules, we extend our idea to the group Lasso problem [27]:\n\nwhere Xg \u2208 (cid:60)N\u00d7ng is the data matrix for the gth group and p =(cid:80)G\n\nXg\u03b2g(cid:107)2\n\ninf\n\u03b2\u2208(cid:60)p\n\n2 + \u03bb\n\ng=1\n\ng=1\n\n\u221a\n\nng(cid:107)\u03b2g(cid:107)2,\n\nproblem of (16) is (see detailed derivation in the supplemental materials):\n\ng=1 ng. The corresponding dual\n\n2(cid:107)y \u2212(cid:88)G\n\n1\n\n(cid:88)G\n\n(cid:111)\n\nng, g = 1, 2, . . . , G\n\n(cid:110) 1\n2(cid:107)y(cid:107)2\n\n2 \u2212 \u03bb2\n\nsup\n\n\u03b8\n\nSimilar to the Lasso problem, the primal and dual optimal solutions of the group Lasso satisfy:\n\n(16)\n\n(17)\n\n(18)\n\n(19)\n\n(20)\n\nand the KKT conditions are:\n\ng \u03b8(cid:107)2 \u2264 \u221a\n\n\u2217\ng + \u03bb\u03b8\n\n\u2217\n\n2 (cid:107)\u03b8 \u2212 y\n\u03bb(cid:107)2\n2 : (cid:107)XT\n(cid:88)G\n(cid:40)\u221a\n\ny =\n\ng=1\n\nXg\u03b2\n\n\u221a\n\nif \u03b2\u2217\n\ng (cid:54)= 0\nng\nngu, (cid:107)u(cid:107)2 \u2264 1 if \u03b2\u2217\n\n\u03b2\u2217\ng(cid:107)\u03b2\u2217\ng(cid:107)2\n\u221a\n\ng = 0\n\n(\u03b8\u2217)T Xg \u2208\n\nng, we can conclude that \u03b2\u2217\n\nfor g = 1, 2, . . . , G. Clearly, if (cid:107)(\u03b8\u2217)T Xg(cid:107)2 <\nConsider problem (17). It is easy to see that the dual optimal \u03b8\u2217 is the projection of y\n\u03bb onto the\nfeasible set. For each g, the constraint (cid:107)XT\nng con\ufb01nes \u03b8 to an ellipsoid which is closed\nand convex. Therefore, the feasible set of the dual problem (17) is the intersection of ellipsoids and\nthus closed and convex. Hence \u03b8\u2217(\u03bb) is also nonexpansive for the group lasso problem. Similar to\nTheorem 2, we can readily develop the following theorem for group Lasso.\nTheorem 8. For the group Lasso problem, assume we are given the solution of its dual problem\n\u03b8\u2217(\u03bb(cid:48)) for a speci\ufb01c \u03bb(cid:48). Let \u03bb(cid:48)(cid:48) be a nonnegative value different from \u03bb(cid:48). Then \u03b2\u2217\n\ng \u03b8(cid:107)2 \u2264 \u221a\n\ng = 0.\n\ng (\u03bb(cid:48)(cid:48)) = 0 if\n\n(cid:107)XT\n\ng \u03b8\u2217(\u03bb(cid:48))(cid:107)2 <\n\n\u221a\n\n(cid:12)(cid:12) 1\n\n(cid:12)(cid:12)\n\nng \u2212 (cid:107)Xg(cid:107)F(cid:107)y(cid:107)2\n\u221a\ng y(cid:107)2/\n\n\u03bb(cid:48) \u2212 1\n\u03bb(cid:48)(cid:48)\nng, we can see that\n\n.\n\ny\n\ng\n\ny\n\n<\n\n\u03bbmax\n\n\u03bbmax\n\n\u03bbmax\n\n\u03bbmax\n\n\u221a\n\n(cid:17)\n\n(cid:13)(cid:13)(cid:13)2\n\n(cid:13)(cid:13)(cid:13)XT\ny\u2212(cid:80)G\n\n(cid:16) 1\nng.\ng (\u03bb) = 0 if the following holds:\n\u03bb \u2212 1\n(cid:16) 1\n\nSimilar to the Lasso problem, let \u03bbmax = maxg (cid:107)XT\nis itself\nfeasible, and \u03bbmax is the largest parameter such that problem (16) has a nonzero solution. Clearly,\n\u03b8\u2217(\u03bbmax) = y\n. Similar to DPP and SDPP, we can construct GDPP and SGDPP for group Lasso.\nCorollary 9. GDPP: For the group Lasso problem (16), let \u03bbmax = maxg (cid:107)XT\nIf\n\u03bb \u2265 \u03bbmax, \u03b2\u2217\n\n\u221a\ng y(cid:107)2/\n\ng (\u03bb) = 0,\u2200g = 1, 2, . . . , G. Otherwise, we have \u03b2\u2217\nng \u2212 (cid:107)Xg(cid:107)F(cid:107)y(cid:107)2\n\ng=1 Xg\u03b2\u2217\n\n(cid:13)(cid:13)(cid:13)(cid:13)2\n\n(cid:13)(cid:13)(cid:13)(cid:13)XT\n\nng \u2212 (cid:107)Xg(cid:107)F(cid:107)y(cid:107)2\n\n(21)\nCorollary 10. SGDPP: For the group Lasso problem (16), suppose we are given a sequence of\nparameter values \u03bbmax = \u03bb0 > \u03bb1 > . . . > \u03bbm. For any integer 0 \u2264 k < m, we have \u03b2\u2217\ng (\u03bbk+1) =\n0 if \u03b2\u2217(\u03bbk) is known and the following holds:\n<\n\n(22)\nRemark: Similar to DPP\u2217, we can develop the enhanced GDPP by simply replacing the term\n(cid:107)y(cid:107)2(1/\u03bb \u2212 1/\u03bbmax) on the right hand side of the inequality (21) with \u03d5(\u03bbmax, \u03bb). Notice that,\nto compute \u03d5(\u03bbmax, \u03bb), we set v1 = X\u2217(X\u2217)T y where X\u2217 = argmaxXg(cid:107)XT\nng (please\nrefer to Lemma C in the supplement for details). The analogs of SDPP\u2217, that is, SGDPP\u2217, can be\nobtained by replacing the term (cid:107)y(cid:107)2(1/\u03bbk+1 \u2212 1/\u03bbk) on the right hand side of the inequality (22)\nwith \u03d5(\u03bbk, \u03bbk+1).\n4 Experiments\nIn section 4.1, we \ufb01rst evaluate the DPP and DPP\u2217 rules on both real and synthetic data. We then\ncompare the performance of DPP with Dome (see [25, 26]) which achieves state-of-art performance\nfor the Lasso problem among exact screening rules [25]. We evaluate GDPP and SGDPP for the\ngroup Lasso problem on three synthetic data sets in section 4.2. We are not aware of any \u201cexact\u201d\nscreening rules for the group Lasso problem at this point.\n\n\u221a\ng y(cid:107)2/\n\n\u2212 1\n\n(cid:17)\n\ng (\u03bbk)\n\n\u221a\n\n\u03bbk+1\n\n\u03bbk\n\n\u03bbk\n\n.\n\ng\n\n6\n\n\f(a) MNIST-DPP2/DPP\u22172 (b) MNIST-DPP5/DPP\u22175\n\n(c) COIL-DPP2/DPP\u22172\n\n(d) COIL-DPP5/DPP\u22175\n\nFigure 1: Comparison of DPP and DPP\u2217 rules on the MNIST and COIL data sets.\n\nTo measure the performance of our screening rules, we compute the rejection rate, i.e., the ratio be-\ntween the number of predictors discarded by screening rules and the actual number of zero predictors\nin the ground truth. Because the DPP rules are exact, i.e., no active predictors will be mistakenly\ndiscarded, the rejection rate will be less than one. For SAFE and Dome, it is not straightforward\nto extend them to the group Lasso problem. Similarly to previous works [26], we do not report the\ncomputational time saved by screening because it can be easily computed from the rejection ratio.\nSpeci\ufb01cally, if the Lasso solver is linear in terms of the size of the data matrix X, a K% rejection\nof the data can save K% computational time. The general experiment settings are as follows. For\neach data set, after we construct the data matrix X and the response y, we run the screening rules\nalong a sequence of 100 values equally spaced on the \u03bb/\u03bbmax scale from 0 to 1. We repeat the\nprocedure 100 times and report the average performance at each of the 100 values of \u03bb/\u03bbmax. All\nof the screening rules are implemented in Matlab. The experiments are carried out on a Intel(R)\n(i7-2600) 3.4Ghz processor.\n4.1 DPPs and DPP\u2217s for the Lasso Problem\nIn this experiment, we \ufb01rst compare the performance of the proposed DPP rules with the enhanced\nDPP rules (DPP\u2217) on (a) the MNIST handwritten digit data set [13]; (b) the COIL rotational image\ndata set [16] in Section 4.1.1. We show that the DPP\u2217 rules are more effective in identifying inactive\nfeatures than the DPP rules. This demonstrate our theoretical results in Section 2.5. Then we\nevaluate the DPP\u2217/SDPP\u2217 rules and Dome on (c) the ADNI data set; (d) the Olivetti Faces data set\n[19]; (e) Yahoo web pages data sets [22] and (f) a synthetic data set whose entries are i.i.d. by a\nstandard Gaussian.\n4.1.1 Comparison of DPP and DPP\u2217\nAs we explain in Section 2.5, all inactive feature detected by the DPP rules can also be detected\nby the DPP\u2217 rules. But conversely, it is not necessarily true. To demonstrate the advantage of the\nDPP\u2217 rules, we run DPP2, DPP\u22172, DPP5 and DPP\u22175 on the MNIST and COIL data sets. a) The\nMNIST data set contains grey images of scanned handwritten digits, including 60, 000 for training\nand 10, 000 for testing. The dimension of each image is 28\u00d7 28. Each time, we \ufb01rst randomly select\n100 images for each digit (and in total we have 1000 images) and get a data matrix X \u2208 (cid:60)784\u00d71000.\nThen we randomly pick an image as the response y \u2208 (cid:60)784. b) The COIL data set includes 100\nobjects, each of which has 72 color images with 128\u00d7128 pixels. The images that belong to the same\nobject are taken every 5 degree by rotating the object. We use the images of object 10. Each time,\nwe randomly pick one of the images as the response vector y \u2208 (cid:60)49152 and use all the remaining\nones to construct the data matrix X \u2208 (cid:60)49152\u00d771. The average \u03bbmax for the so cultured MNIST and\nthe COIL data sets are 0.837 and 0.986. Clearly, the predictors in the data sets are high correlated.\nFrom Figure 1, we observe that DPP\u22172 signi\ufb01cantly outperforms DPP2 for both data sets, especially\nwhen \u03bb/\u03bbmax is small. We also observe the same pattern for DPP5 and DPP\u22175, verifying the claims\nabout DPP\u2217 made in the paper. Thus, in the following experiments, we only report the performance\nof DPP\u2217 and the competing algorithm Dome.\n4.1.2 Comparison of DPP\u2217/SDPP\u2217 and Dome\nIn this experiment, we compare DPP\u2217/SDPP\u2217 rules with Dome. We only report the performance of\nDPP\u22175 and DPP\u221710 among the family of DPP\u2217 rules on the following four data sets.\nc) The Alzheimer\u2019s disease neuroimaging initiative (ADNI; available at www.loni.ucla.edu/ADNI)\nstudies the disease progression of Alzheimer\u2019s. The ADNI data set includes 434 patients with 306\nfeatures extracted from their baseline MRI scans. Each time we randomly select 90% samples to\nconstruct the data matrix X \u2208 (cid:60)391\u00d7306. The response y is the patients\u2019 MMSE cognitive scores\n[29]. d) The Olivetti faces data set includes 400 grey scale face images of size 64\u00d7 64 for 40 people\n(10 for each). Each time, we randomly take one of the images as the response vector y \u2208 (cid:60)4096\n\n7\n\n\f(a) ADNI\n\n(b) Olivetti\n\nFigure 2: Comparison of DPP\u2217/SDPP\u2217 rules and Dome on three real data sets, Yahoo computers\ndata set, ADNI data set, Olivetti face data set and one synthetic data set.\n\n(c) Yahoo-Computers\n\n(d) Synthetic\n\n(a) 20 groups\n\n(b) 50 groups\n\n(c) 100 groups\n\nFigure 3: Performance of GDPP and SGDPP applied to three synthetic data sets.\n\nand the data matrix X \u2208 (cid:60)4096\u00d7399 is constructed by the left ones. e) The Yahoo data sets include\n11 top-level categories such as Computers, Education, Health, Recreation, and Science etc. Each\ncategory is further divided into a set of subcategories. Each time, we construct a balanced binary\nclassi\ufb01cation data set from the topic of Computers. We choose samples from one subcategory as the\npositive class and randomly sample an equal number of samples from the rest of subcategories as\nthe negative class. The size of the data matrix is 876 \u00d7 25259 and the response vector is the binary\nlabel of the samples. f) For the synthetic data set X \u2208 (cid:60)100\u00d75000 and the response vector y \u2208 (cid:60)100,\nall of the entries are i.i.d. by a standard Gaussian.\nThe average \u03bbmax of the above three data sets are 0.7273, 0.989, 0.914, and 0.371 respectively.\nThe predictors in ADNI, Yahoo-Computers and Olivetti data sets are highly correlated as indicated\nby the average \u03bbmax. In contrast with the real data sets, the average \u03bbmax of the synthetic data is\nsmall. As noted in [26, 25], Dome is very effective in discarding inactive features when \u03bbmax is\nlarge. From Fig. 2, we observe that Dome performs much better on the real data sets compared to\nthe synthetic data. However, the proposed rules are able to identify far more inactive features than\nDome on both real and synthetic data, even for the cases in which \u03bbmax is small.\n4.2 GDPPs for the Group Lasso Problem\nWe apply GDPPs to three synthetic data sets. The entries of data matrix X \u2208 (cid:60)100\u00d71000 and the\nresponse vector y are generated i.i.d.\nfrom the standard Gaussian distribution. For each of the\ncases, we randomly divided X into 20, 50, and 100 groups. We compare the performance of GDPP\nand SGDPP along a sequence of 100 parameter values equally spaced on the \u03bb/\u03bbmax scale. We\nrepeat the above procedure 100 times for each of the cases and report the average performance. The\naverage \u03bbmax values are 0.136, 0.167, and 0.219 respectively. As shown in Fig. 3, it is expected\nthat SGDPP signi\ufb01cantly outperforms GDPP which only makes use of the information of the dual\noptimal solution at a single point. For more discussions, please refer to the supplement.\n5 Conclusion\nIn this paper, we develop new screening rules for the Lasso problem by making use of the nonex-\npansiveness of the projection operator with respect to a closed convex set. Our new methods, i.e.,\nDPP rules, are able to effectively identify inactive predictors of the Lasso problem, thus greatly re-\nducing the size of the optimization problem. Moreover, we further improve DPP rules and propose\nthe enhanced DPP rules, that is, the DPP\u2217 rules, which are even more effective in discarding inactive\npredictors than DPP rules. The idea of DPP and DPP\u2217 rules can be easily generalized to screen the\ninactive groups of the group Lasso problem. Extensive experiments on both synthetic and real data\ndemonstrate the effectiveness of the proposed rules. Moreover, DPP and DPP\u2217 rules can be com-\nbined with any Lasso solver as a speedup tool. In the future, we plan to generalize our idea to other\nsparse formulations consisting of more general structured sparse penalties, e.g., tree/graph Lasso.\nAcknowledgments\nThis work was supported in part by NIH (LM010730) and NSF (IIS-0953662, CCF-1025177).\n\n8\n\n\fReferences\n[1] S. R. Becker, E. Cand`es, and M. Grant. Templates for convex cone problems with applications to sparse\n\nsignal recovery. Technical report, Standford University, 2010.\n\n[2] D. P. Bertsekas. Convex Analysis and Optimization. Athena Scienti\ufb01c, 2003.\n[3] A. Bruckstein, D. Donoho, and M. Elad. From sparse solutions of systems of equations to sparse modeling\n\nof signals and images. SIAM Review, 51:34\u201381, 2009.\n\n[4] E. Cand`es. Compressive sampling. In Proceedings of the International Congress of Mathematics, 2006.\n[5] S. S. Chen, D. L. Donoho, and M. A. Saunders. Atomic decomposition by basis pursuit. SIAM Review,\n\n43:129\u2013159, 2001.\n\n[6] D. L. Donoho and Y. Tsaig. Fast solution of l-1 norm minimization problems when the solution may be\n\nsparse. IEEE Transactions on Information Theory, 54:4789\u20134812, 2008.\n\n[7] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. Annals of Statistics, 32:407\u2013\n\n499, 2004.\n\n[8] L. El Ghaoui, V. Viallon, and T. Rabbani. Safe feature elimination in sparse supervised learning. Paci\ufb01c\n\nJournal of Optimization, 8:667\u2013698, 2012.\n\n[9] J. Fan and J. Lv. Sure independence screening for ultrahigh dimensional feature spaces. Journal of the\n\nRoyal Statistical Society Series B, 70:849\u2013911, 2008.\n\n[10] J. Friedman, T. Hastie, H. H\u00a8e\ufb02ing, and R. Tibshirani. Pathwise coordinate optimization. Annals of Applied\n\nStatistics, 1:302\u2013332, 2007.\n\n[11] J. Friedman, T. Hastie, and R. Tibshirani. Regularization paths for generalized linear models via coordi-\n\nnate descent. Journal of Statistical Software, 33:1\u201322, 2010.\n\n[12] S. J. Kim, K. Koh, M. Lustig, S. Boyd, and D. Gorinevsky. An interior-point method for large scale\n\nl1-regularized least squares. IEEE Journal of Selected Topics in Signal Processing, 1:606\u2013617, 2007.\n\n[13] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.\n\nIn Proceedings of the IEEE, 1998.\n\n[14] J. Liu, S. Ji, and J. Ye. SLEP: Sparse Learning with Ef\ufb01cient Projections. Arizona State University, 2009.\n[15] J. Mairal and B. Yu. Complexity analysis of the lasso regularization path. In ICML, 2012.\n[16] S. A. Nene, S. K. Nayar, and H. Murase. Columbia object image library (coil). Technical report, No.\n\nCUCS-006-96, Dept. Comp. Science, Columbia University, 1996.\n\n[17] M. R. Osborne, B. Presnell, and B. A. Turlach. A new approach to variable selection in least squares\n\nproblems. IMA Journal of Numerical Analysis, 20:389\u2013404, 2000.\n\n[18] M. Y. Park and T. Hastie. L1-regularized path algorithm for generalized linear models. Journal of the\n\nRoyal Statistical Society Series B, 69:659\u2013677, 2007.\n\n[19] F. Samaria and A. Harter. Parameterisation of a stochastic model for human face identi\ufb01cation.\n\nProceedings of 2nd IEEE Workshop on Applications of Computer Vision, 1994.\n\nIn\n\n[20] R. Tibshirani. Regression shringkage and selection via the lasso. Journal of the Royal Statistical Society\n\nSeries B, 58:267\u2013288, 1996.\n\n[21] R. Tibshirani, J. Bien, J. Friedman, T. Hastie, N. Simon, J. Taylor, and R. Tibshirani. Strong rules for\ndiscarding predictors in lasso-type problems. Journal of the Royal Statistical Society Series B, 74:245\u2013\n266, 2012.\n\n[22] N. Ueda and K. Saito. Parametric mixture models for multi-labeled text. Advances in neural information\n\nprocessing systems, 15:721\u2013728, 2002.\n\n[23] J. Wright, Y. Ma, J. Mairal, G. Sapiro, T. Huang, and S. Yan. Sparse representation for computer vision\n\nand pattern recognition. In Proceedings of IEEE, 2010.\n\n[24] T. T. Wu, Y. F. Chen, T. Hastie, E. Sobel, and K. Lange. Genomewide association analysis by lasso\n\npenalized logistic regression. Bioinformatics, 25:714\u2013721, 2009.\n\n[25] Z. J. Xiang and P. J. Ramadge. Fast lasso screening tests based on correlations. In IEEE ICASSP, 2012.\n[26] Z. J. Xiang, H. Xu, and P. J. Ramadge. Learning sparse representation of high dimensional data on large\n\nscale dictionaries. In NIPS, 2011.\n\n[27] M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables. Journal of the\n\nRoyal Statistical Society Series B, 68:49\u201367, 2006.\n\n[28] P. Zhao and B. Yu. On model selection consistency of lasso. Journal of Machine Learning Research,\n\n7:2541\u20132563, 2006.\n\n[29] J. Zhou, L. Yuan, J. Liu, and J. Ye. A multi-task learning formulation for predicting disease progression.\n\nIn KDD, pages 814\u2013822. ACM, 2011.\n\n9\n\n\f", "award": [], "sourceid": 569, "authors": [{"given_name": "Jie", "family_name": "Wang", "institution": "Arizona State University"}, {"given_name": "Jiayu", "family_name": "Zhou", "institution": "Arizona State University"}, {"given_name": "Peter", "family_name": "Wonka", "institution": "Arizona State University"}, {"given_name": "Jieping", "family_name": "Ye", "institution": "Arizona State University"}]}