{"title": "Multi-Layer Feature Reduction for Tree Structured Group Lasso via Hierarchical Projection", "book": "Advances in Neural Information Processing Systems", "page_first": 1279, "page_last": 1287, "abstract": "Tree structured group Lasso (TGL) is a powerful technique in uncovering the tree structured sparsity over the features, where each node encodes a group of features. It has been applied successfully in many real-world applications. However, with extremely large feature dimensions, solving TGL remains a significant challenge due to its highly complicated regularizer. In this paper, we propose a novel Multi-Layer Feature reduction method (MLFre) to quickly identify the inactive nodes (the groups of features with zero coefficients in the solution) hierarchically in a top-down fashion, which are guaranteed to be irrelevant to the response. Thus, we can remove the detected nodes from the optimization without sacrificing accuracy. The major challenge in developing such testing rules is due to the overlaps between the parents and their children nodes. By a novel hierarchical projection algorithm, MLFre is able to test the nodes independently from any of their ancestor nodes. Moreover, we can integrate MLFre---that has a low computational cost---with any existing solvers. Experiments on both synthetic and real data sets demonstrate that the speedup gained by MLFre can be orders of magnitude.", "full_text": "Multi-Layer Feature Reduction for Tree Structured\n\nGroup Lasso via Hierarchical Projection\n\n1Computational Medicine and Bioinformatics\n\n2Department of Electrical Engineering and Computer Science\n\nJie Wang1, Jieping Ye1,2\n\nUniversity of Michigan, Ann Arbor, MI 48109\n\n{jwangumi, jpye}@umich.edu\n\nAbstract\n\nTree structured group Lasso (TGL) is a powerful technique in uncovering the tree\nstructured sparsity over the features, where each node encodes a group of features.\nIt has been applied successfully in many real-world applications. However, with\nextremely large feature dimensions, solving TGL remains a signi\ufb01cant challenge\ndue to its highly complicated regularizer. In this paper, we propose a novel Multi-\nLayer Feature reduction method (MLFre) to quickly identify the inactive nodes\n(the groups of features with zero coef\ufb01cients in the solution) hierarchically in a\ntop-down fashion, which are guaranteed to be irrelevant to the response. Thus, we\ncan remove the detected nodes from the optimization without sacri\ufb01cing accura-\ncy. The major challenge in developing such testing rules is due to the overlaps\nbetween the parents and their children nodes. By a novel hierarchical projec-\ntion algorithm, MLFre is able to test the nodes independently from any of their\nancestor nodes. Moreover, we can integrate MLFre\u2014that has a low computation-\nal cost\u2014with any existing solvers. Experiments on both synthetic and real data\nsets demonstrate that the speedup gained by MLFre can be orders of magnitude.\n\nIntroduction\n\n1\nTree structured group Lasso (TGL) [13, 30] is a powerful regression technique in uncovering the\nhierarchical sparse patterns among the features. The key of TGL, i.e., the tree guided regularization,\nis based on a pre-de\ufb01ned tree structure and the group Lasso penalty [29], where each node represents\na group of features. In recent years, TGL has achieved great success in many real-world applications\nsuch as brain image analysis [10, 18], gene data analysis [14], natural language processing [27, 28],\nand face recognition [12]. Many algorithms have been proposed to improve the ef\ufb01ciency of TGL\n[1, 6, 11, 7, 16]. However, the application of TGL to large-scale problems remains a challenge due\nto its highly complicated regularizer.\nAs an emerging and promising technique in scaling large-scale problems, screening has received\nmuch attention in the past few years. Screening aims to identify the zero coef\ufb01cients in the sparse\nsolutions by simple testing rules such that the corresponding features can be removed from the\noptimization. Thus, the size of the data matrix can be signi\ufb01cantly reduced, leading to substantial\nsavings in computational cost and memory usage. Typical examples include TLFre [25], FLAMS\n[22], EDPP [24], Sasvi [17], DOME [26], SAFE [8], and strong rules [21]. We note that strong rules\nare inexact in the sense that features with nonzero coef\ufb01cients may be mistakenly discarded, while\nthe others are exact. Another important direction of screening is to detect the non-support vectors for\nsupport vector machine (SVM) and least absolute deviation (LAD) [23, 19]. Empirical studies have\nshown that the speedup gained by screening methods can be several orders of magnitude. Moreover,\nthe exact screening methods improve the ef\ufb01ciency without sacri\ufb01cing optimality.\nHowever, to the best of our knowledge, existing screening methods are only applicable to sparse\nmodels with simple structures such as Lasso, group Lasso, and sparse group Lasso. In this paper, we\n\n1\n\n\fpropose a novel Multi-Layer Feature reduction method, called MLFre, for TGL. MLFre is exact and\nit tests the nodes hierarchically from the top level to the bottom level to quickly identify the inactive\nnodes (the groups of features with zero coef\ufb01cients in the solution vector), which are guaranteed to\nbe absent from the sparse representation. To the best of our knowledge, MLFre is the \ufb01rst screening\nmethod that is applicable to TGL with the highly complicated tree guided regularization.\nThe major technical challenges in developing MLFre for TGL lie in two folds. The \ufb01rst is that\nmost existing exact screening methods are based on evaluating the norm of the subgradients of the\nsparsity-inducing regularizers with respect to the variables or groups of variables of interests. How-\never, for TGL, we only have access to a mixture of the subgradients due to the overlaps between\nparents and their children nodes. Therefore, our \ufb01rst major technical contribution is a novel hier-\narchical projection algorithm that is able to exactly and ef\ufb01ciently recover the subgradients with\nrespect to every node from the mixture (Sections 3 and 4). The second technical challenge is that\nmost existing exact screening methods need to estimate an upper bound involving the dual optimum.\nThis turns out to be a complicated nonconvex optimization problem for TGL. Thus, our second major\ntechnical contribution is to show that this highly nontrivial nonconvex optimization problem admits\nclosed form solutions (Section 5). Experiments on both synthetic and real data sets demonstrate that\nthe speedup gained by MLFre can be orders of magnitude (Section 6). Please see supplements for\ndetailed proofs of the results in the main text.\nNotation: Let (cid:107)\u00b7(cid:107) be the (cid:96)2 norm, [p] = {1, . . . , p} for a positive integer p, G \u2286 [p], and \u00afG = [p]\\G.\nFor u \u2208 Rp, let ui be its ith component. For G \u2286 [p], we denote uG = [u]G = {v : vi = ui if i \u2208\nG, vi = 0 otherwise} and HG = {u \u2208 Rp : u \u00afG = 0}.\nIf G1, G2 \u2286 [n] and G1 \u2282 G2, we\nemphasize that G2 \\ G1 (cid:54)= \u2205. For a set C, let intC, ri C, bd C, and rbd C be its interior, relative\ninterior, boundary, and relative boundary, respectively [5]. If C is closed and convex, the projection\noperator is PC(z) := argminu\u2208C(cid:107)z \u2212 u(cid:107), and its indicator function is IC(\u00b7), which is 0 on C and \u221e\nelsewhere. Let \u03930(Rp) be the class of proper closed convex functions on Rp. For f \u2208 \u03930(Rp), let\n\u2202f be its subdifferential and dom f := {z : f (z) < \u221e}. We denote by \u03b3+ = max(\u03b3, 0).\n2 Basics\nWe brie\ufb02y review some basics of TGL. First, we introduce the so-called index tree.\nDe\ufb01nition 1. [16] For an index tree T of depth d, we denote the node(s) of depth i by Ti =\n{Gi\n(i): Gi\nj1\n(ii): If Gi\nWhen the tree structure is available (see supplement for an example), the TGL problem is\n\nj \u2282 [p], and ni \u2265 1, \u2200 i \u2208 [d]. We assume that\n}, where n0 = 1, G0\n= \u2205, \u2200 i \u2208 [d] and j1 (cid:54)= j2 (different nodes of the same depth do not overlap).\n(cid:96) \u2282 Gi\nj.\n(cid:88)d\n\n1, . . . , Gi\nni\n\u2229 Gi\nj is a parent node of Gi+1\n\n(cid:88)ni\n\n1 = [p], Gi\n\n, then Gi+1\n\nj2\n\n(cid:96)\n\n1\n\n(TGL)\n\nj are the coef\ufb01cients\nj, respectively, and \u03bb > 0 is the regularization\n\nand wi\n\nj\n\n(cid:107). The following hold:\n\nj\n\nj\n\nmin\n\n\u03b2\n\ni=0\n\nj=1\n\n(i): Let \u03c6i\n\n(cid:107) and Bi\n\nj(\u03b2) = (cid:107)\u03b2Gi\n\n2(cid:107)y \u2212 X\u03b2(cid:107)2 + \u03bb\n\nTheorem 2. For the TGL problem, let \u03c6(\u03b2) =(cid:80)d\n\nj(cid:107)\u03b2Gi\n(cid:107),\nwi\nwhere y \u2208 RN is the response vector, X \u2208 RN\u00d7p is the data matrix, \u03b2Gi\n(cid:80)ni\nvector and positive weight corresponding to node Gi\nparameter. We derive the Lagrangian dual problem of TGL as follows.\nj(cid:107)\u03b2Gi\nj=1 wi\n(cid:88)d\n(cid:88)ni\nj}. We can write \u2202\u03c6(0) as\n\u03bb \u2212 \u03b8(cid:107)2 : \u03b8 \u2208 F(cid:9) .\n2(cid:107) y\n(cid:88)ni\n\n(cid:88)d\n(cid:88)ni\nj = {\u03b6 \u2208 HGi\n(cid:8) 1\n2(cid:107)y(cid:107)2 \u2212 1\n\n(ii): Let F = {\u03b8 : XT \u03b8 \u2208 \u2202\u03c6(0)}. The Lagrangian dual of TGL is\n\ny = X\u03b2\u2217(\u03bb) + \u03bb\u03b8\u2217(\u03bb),\n\nXT \u03b8\u2217(\u03bb) \u2208(cid:88)d\n\n: (cid:107)\u03b6(cid:107) \u2264 wi\n\nwi\n\nj\u2202\u03c6i\n\n\u2202\u03c6(0) =\n\nj(0) =\n\nwi\n\nj\u2202\u03c6i\n\nj(\u03b2\u2217(\u03bb)).\n\nsup\n\n\u03b8\n\ni=0\n\nj=1\n\ni=0\n\nj=1\n\nBi\nj.\n\ni=0\n\nj\n\nj\n\ni=0\n\nj=1\n\n(iii): Let \u03b2\u2217(\u03bb) and \u03b8\u2217(\u03bb) be the optimal solution of problems (TGL) and (2), respectively. Then,\n\n(1)\n\n(2)\n\n(3)\n\n(4)\n\nThe dual problem of TGL in (2) is equivalent to a projection problem, i.e., \u03b8\u2217(\u03bb) = PF (y/\u03bb). This\ngeometric property plays a fundamentally important role in developing MLFre (see Section 5).\n\n2\n\n\f3 Testing Dual Feasibility via Hierarchical Projection\nAlthough the dual problem in (2) has nice geometric properties, it is challenging to determine the\nfeasibility of a given \u03b8 due to the complex dual feasible set F. An alternative approach is to test\nif XT \u03b8 = P\u2202\u03c6(0)(XT \u03b8). Although \u2202\u03c6(0) is very complicated, we show that P\u2202\u03c6(0)(\u00b7) admits a\nclosed form solution by hierarchically splitting P\u2202\u03c6(0)(\u00b7) into a sum of projection operators with\nrespect to a collection of simpler sets. We \ufb01rst introduce some notations. For an index tree T , let\n\n(cid:110)(cid:88)\n(cid:110)(cid:88)\n\nAi\nBt\nk : Gt\nj =\nCi\nBt\nk : Gt\nj =\nj, the set Ai\nj is the sum of Bt\n\nt,k\n\nt,k\n\n(cid:111)\n(cid:111)\nk \u2286 Gi\nk \u2282 Gi\n(6)\nk corresponding to all its descendant nodes and itself, and\n\n,\u2200 i \u2208 0 \u222a [d], j \u2208 [ni],\n,\u2200 i \u2208 0 \u222a [d], j \u2208 [ni].\n\n(5)\n\nj\n\nj\n\nFor a node Gi\nthe set Ci\n\nj = Bi\n\n1, Ai\n\nj, Bi\nj = Bi\n\nj, and Ci\nj, we have\nj,\u2200 leaf node Gi\nj,\n\nj the sum excluding itself. Therefore, by the de\ufb01nitions of Ai\nj, Ai\n\u2202\u03c6(0) = A0\n\nj,\u2200 non-leaf node Gi\nj + Ci\n(\u00b7) = PB0\n(\u00b7) into the sum of two projections onto B0\n\n(7)\n(\u00b7). This motivates the \ufb01rst pillar of this paper,\nwhich implies that P\u2202\u03c6(0)(\u00b7) = PA0\ni.e., Lemma 3, which splits PB0\n1, respectively.\n1+C0\nLemma 3. Let G \u2286 [p], B = {u \u2208 HG : (cid:107)u(cid:107) \u2264 \u03b3} with \u03b3 > 0, C \u2286 HG a nonempty closed convex\nset, and z an arbitrary point in HG. Then, the following hold:\n(i): [2] PB(z) = min{1, \u03b3/(cid:107)z(cid:107)}z if z (cid:54)= 0. Otherwise, PB(z) = 0.\n(ii): IB+C(z) = IB(z \u2212 PC(z)), i.e., PC(z) \u2208 argminu\u2208C IB(z \u2212 u).\n(iii): PB+C(z) = PC(z) + PB(z \u2212 PC(z)).\nBy part (iii) of Lemma 3, we can split PA0\n\n(XT \u03b8) in the following form:\n\n1 and C0\n\n1+C0\n\n1\n\n1\n\n1\n\nPA0\n\n1\n\n(XT \u03b8) = PC0\n\n(XT \u03b8) + PB0\n\n1\n\n(XT \u03b8 \u2212 PC0\n\n1\n\n1\n\n(XT \u03b8)).\n\n1\n\n(8)\n(XT \u03b8) if we\n\n(9)\n\nAs PB0\nhave PC0\n\n1\n\n(\u00b7) admits a closed form solution by part (i) of Lemma 3, we can compute PA0\n(XT \u03b8) computed. By Eq. (5) and Eq. (6), for a non-leaf node Gi\nj, we note that\nj) = {k : Gi+1\n\n, where Ic(Gi\n\nk \u2282 Gi\nj}.\n\n(cid:88)\n\nCi\nj =\n\nAi+1\n\nk\n\n1\n\n1\n\nk\u2208Ic(Gi\nj )\n\nnonempty closed convex sets, and C =(cid:80)\n\nInspired by (9), we have the following result.\nLemma 4. Let {G(cid:96) \u2282 [p]}(cid:96) be a set of nonoverlapping index sets, {C(cid:96) \u2286 HG(cid:96)}(cid:96) be a set of\nRemark 1. For Lemma 4, if all C(cid:96) are balls centered at 0, then PC(z) admits a closed form solution.\nBy Lemma 4 and Eq. (9), we can further splits PC0\n([XT \u03b8]G1\n\n(cid:96) C(cid:96). Then, PC(z) =(cid:80)\n\n(cid:96) PC(cid:96)(zG(cid:96) ) for z \u2208 Rp.\n\n(XT \u03b8) in Eq. (8) in the following form.\n1}.\n), where Ic(G0\nk \u2282 G0\nk is a leaf node, Eq. (7) implies that A1\nk = B1\n\nk and\nConsider the right hand side of Eq. (10). If G1\n(\u00b7) admits a closed form solution by part (i) of Lemma 3. Otherwise, we continue to split\nthus PA1\n(\u00b7) by Lemmas (3) and (4). This procedure will always terminate as we reach the leaf nodes\nPA1\n[see the last equality in Eq. (7)]. Therefore, by a repeated application of Lemmas (3) and (4), the\nfollowing algorithm computes the closed form solution of PA0\n\n1) = {k : G1\n\n(cid:88)\n\nk\u2208Ic(G0\n1)\n\n(XT \u03b8) =\n\n(\u00b7).\n\nPA1\n\n(10)\n\nPC0\n\nk\n\nk\n\nk\n\nk\n\n1\n\n1\n\n1\n\n(\u00b7).\n\n1\n\nAlgorithm 1 Hierarchical Projection: PA0\nInput: z \u2208 Rp, the index tree T as in De\ufb01nition 1, and positive weights wi\nOutput: u0 = PA0\n1: Set ui \u2190 0 \u2208 Rp, \u2200 i \u2208 0 \u222a [d + 1], vi \u2190 0 \u2208 Rp, \u2200 i \u2208 0 \u222a [d].\n2: for i = d to 0 do\n3:\n4:\n\n(z), vi for \u2200 i \u2208 0 \u222a [d].\n\nfor j = 1 to ni do\n\n\u2212 ui+1\n\nvi\n\n),\n\n1\n\nj for all nodes Gi\n\nj in T .\n\n/*hierarchical projection*/\n\nend for\n\n5:\n6: end for\n\nui\n\nGi\nj\n\nGi\nj\n\n(zGi\n\nj\n\n= PBi\n\u2190 ui+1\n\nj\n\nGi\nj\n\n+ vi\n\nGi\nj\n\n.\n\nGi\nj\n\n(11)\n\n(12)\n\n3\n\n\fi.e., O((cid:80)d\n(cid:80)ni\nj=1 |Gi\n\n(cid:80)ni\nj=1 |Gi\n\ni=0\n\nj|), where |Gi\n\nj| is the number of features contained in the node Gi\n\nThe time complexity of Algorithm 1 is similar to that of solving its proximal operator [16],\nj. As\nj| \u2264 p by De\ufb01nition 1, the time complexity of Algorithm 1 is O(pd), and thus O(p log p)\nfor a balanced tree, where d = O(log p). The next result shows that u0 returned by Algorithm 1 is\nthe projection of z onto A0\nTheorem 5. For Algorithm 1, the following hold:\n(i): ui\nGi\nj\n\n1. Indeed, we have more general results as follows.\n\n= PAi\n\nzGi\n\nj\n\nj\n\n(cid:16)\n(cid:16)\n\n(cid:17)\n(cid:17)\n\n, \u2200 i \u2208 0 \u222a [d], j \u2208 [ni].\n, for any non-leaf node Gi\nj.\n\n(ii): ui+1\nGi\nj\n\n= PCi\n\nj\n\nzGi\n\nj\n\n4 MLFre Inspired by the KKT Conditions and Hierarchical Projection\nIn this section, we motivate MLFre via the KKT condition in Eq. (4) and the hierarchical projection\nin Algorithm 1. Note that for any node Gi\n\nj, we have\n\n(cid:40){\u03b6 \u2208 HGi\n\nj\n\nj[\u03b2\u2217(\u03bb)]Gi\nwi\n\nj\n\nwi\n\nj\u2202\u03c6i\n\nj(\u03b2\u2217(\u03bb)) =\n\n: (cid:107)\u03b6(cid:107) \u2264 wi\nj},\n/(cid:107)[\u03b2\u2217(\u03bb)]Gi\n\nj\n\nif [\u03b2\u2217(\u03bb)]Gi\n(cid:107), otherwise.\n\nj\n\n= 0,\n\n(13)\n\n(cid:88)d\n\n(cid:88)ni\n\nMoreover, the KKT condition in Eq. (4) implies that\n\n\u2203{\u03bei\n\nj\u2202\u03c6i\nj(cid:107) < wi\n\nj(\u03b2\u2217(\u03bb)) : \u2200 i \u2208 0 \u222a [d], j \u2208 [ni]} such that XT \u03b8\u2217(\u03bb) =\nj, we can see that [\u03b2\u2217(\u03bb)]Gi\n\nj \u2208 wi\nThus, if (cid:107)\u03bei\n= 0. However, we do not have direct access to \u03bei\nj\neven if \u03b8\u2217(\u03bb) is known, because XT \u03b8\u2217(\u03bb) is a mixture (sum) of all \u03bei\nj as shown in Eq. (14). Indeed,\nAlgorithm 1 turns out to be much more useful than testing the feasibility of a given \u03b8: it is able to\nj(\u03b2\u2217(\u03bb)) from XT \u03b8\u2217(\u03bb). This will serve as a cornerstone in developing MLFre.\nsplit all \u03bei\nTheorem 6 rigorously shows this property of Algorithm 1.\nTheorem 6. Let vi, i \u2208 0 \u222a [d] be the output of Algorithm 1 with input XT \u03b8\u2217(\u03bb), and {\u03bei\nj : i \u2208\n0 \u222a [d], j \u2208 [ni]} be the set of vectors that satisfy Eq. (14). Then, the following hold.\n(i) If [\u03b2\u2217(\u03bb)]Gi\n\n= 0, and [\u03b2\u2217(\u03bb)]Gl\n\nj \u2208 wi\n\nj\u2202\u03c6i\n\n(14)\n\n\u03bei\nj.\n\nj=1\n\ni=0\n\nj\n\nj\n\n(ii) If Gi\n\nj is a non-leaf node, and [\u03b2\u2217(\u03bb)]Gi\n\nr\n\nPAi\n\nj, then\n\n(cid:16)\n(cid:16)\n\n(cid:54)= 0 for all Gl\n[XT \u03b8\u2217(\u03bb)]Gi\n\n(cid:17)\nr \u2283 Gi\n(cid:17)\nj(\u03b2\u2217(\u03bb)), \u2200 i \u2208 0 \u222a [d], j \u2208 [ni].\n\n=(cid:80){(k,t):Gt\n=(cid:80){(k,t):Gt\n\n[XT \u03b8\u2217(\u03bb)]Gi\n\n(cid:54)= 0, then\n\nPCi\n\nj\n\nj\n\nj\n\nj\n\nj\n\nk\u2286Gi\n\nj} \u03bet\nk.\n\nk\u2282Gi\n\nk.\nj} \u03bet\n\n(iii) vi\nGi\nj\n\n\u2208 wi\n\nj\u2202\u03c6i\n\nBy plugging Eq. (11) and part (ii) of Theorem 5 into (15), we have [\u03b2\u2217(\u03bb)]Gi\n\n= 0 if\n\nj\n\nCombining Eq. (13) and part (iii) of Theorem 6, we can see that\n\n(cid:107)vi\n\n(cid:107) < wi\n\nj \u21d2 [\u03b2\u2217(\u03bb)]Gi\n\n= 0.\n\nj\n\n(cid:17)(cid:17)(cid:13)(cid:13)(cid:13) < wi\n\n[XT \u03b8\u2217(\u03bb)]Gi\n\nj\n\nGi\nj\n\n(cid:16)\n(cid:17)(cid:13)(cid:13)(cid:13) < wi\n\n\u2212 PCi\n\nj\n\n[XT \u03b8\u2217(\u03bb)]Gi\n[XT \u03b8\u2217(\u03bb)]Gi\nMoreover, the de\ufb01nition of PBi\n\n(b):\n\nj\n\nj\n\n(cid:16)\n(cid:16)\n\nj\n\n(a):\n\n(cid:13)(cid:13)(cid:13)PBi\n(cid:13)(cid:13)(cid:13)PBi\n(cid:13)(cid:13)(cid:13)[XT \u03b8\u2217(\u03bb)]Gi\n(cid:13)(cid:13)(cid:13)[XT \u03b8\u2217(\u03bb)]Gi\n\nj\n\nj\n\nj\n\n(cid:16)\n[XT \u03b8\u2217(\u03bb)]Gi\nj \u21d2 [\u03b2\u2217(\u03bb)]Gi\n\nj\n\n\u2212 PCi\n\n(cid:13)(cid:13)(cid:13) < wi\n\nj\n\nj,\n\n(cid:17)(cid:13)(cid:13)(cid:13) < wi\n\nimplies that we can simplify (R1) and (R2) to the following form:\n\nj \u21d2 [\u03b2\u2217(\u03bb)]Gi\n\nj\n\n= 0, if Gi\n\nj is a non-leaf node, (R1\u2019)\n\n(15)\n\n(R1)\n\n(R2)\n\nj, if Gi\n\nj is a non-leaf node,\n\nif Gi\n\nj is a leaf node.\n\nj\n\n(R2\u2019)\nHowever, (R1\u2019) and (R2\u2019) are not applicable to detect inactive nodes as they involve \u03b8\u2217(\u03bb). Inspired\n: \u03b8 \u2208 \u0398} and\nby SAFE [8], we \ufb01rst estimate a set \u0398 containing \u03b8\u2217(\u03bb). Let [XT \u0398]Gi\n(16)\n\n= {[XT \u03b8]Gi\n\nj is a leaf node.\n\nif Gi\n\n(cid:16)\n\n(cid:17)\n\n= 0,\n\nSi\n\n.\n\nj\n\nj\n\nj\n\n\u2212 PCi\n\nj\n\nzGi\n\nj\n\nj(z) = zGi\n\nj\n\n4\n\n\f(cid:111)\n< wi\nj \u21d2 [\u03b2\u2217(\u03bb)]Gi\n\n(cid:110)(cid:13)(cid:13)Si\nj (\u03b6)(cid:13)(cid:13) : \u03b6Gi\n(cid:13)(cid:13)(cid:13) : \u03b6Gi\n(cid:110)(cid:13)(cid:13)(cid:13)\u03b6Gi\n\nj\n\nj\n\nsup\u03b6\n\nj\n\nj\n\nj\n\nj\n\nj\n\n= 0,\n\nif Gi\n\n= 0, if Gi\n\nj \u21d2 [\u03b2\u2217(\u03bb)]Gi\n\n\u2208 \u039ei\n\u2208 [XT \u0398]Gi\n\nThen, we can relax (R1\u2019) and (R2\u2019) as\n(cid:111)\nj \u2287 [XT \u0398]Gi\n< wi\n\nj is a non-leaf node, (R1\u2217)\n(R2\u2217)\nj is a leaf node.\nsup\u03b6\nIn view of (R1\u2217) and (R2\u2217), we sketch the procedure to develop MLFre in the following three steps.\nStep 1 We estimate a set \u0398 that contains \u03b8\u2217(\u03bb).\nStep 2 We solve for the supreme values in (R1\u2217) and (R2\u2217), respectively.\nStep 3 We develop MLFre by plugging the supreme values obtained in Step 2 to (R1\u2217) and (R2\u2217).\n4.1 The Effective Interval of the Regularization Parameter \u03bb\nThe geometric property of the dual problem in (2), i.e., \u03b8\u2217(\u03bb) = PF (y/\u03bb), implies that \u03b8\u2217(\u03bb) =\ny/\u03bb if y/\u03bb \u2208 F. Moreover, (R1) for the root node G0\n1 leads to \u03b2\u2217(\u03bb) = 0 if y/\u03bb is an interior point\nof F. Indeed, the following theorem presents stronger results.\nTheorem 7. For TGL, let \u03bbmax = max{\u03bb : y/\u03bb \u2208 F} and S0\n(i): \u03bbmax = {\u03bb : (cid:107)S0\n(ii): y\nFor more discussions on \u03bbmax, please refer to Section H in the supplements.\n5 The Proposed Multi-Layer Feature Reduction Method for TGL\nWe follow the three steps in Section 4 to develop MLFre. Speci\ufb01cally, we \ufb01rst present an accurate\nestimation of the dual optimum in Section 5.1, then we solve for the supreme values in (R1\u2217) and\n(R2\u2217) in Section 5.2, and \ufb01nally we present the proposed MLFre in Section 5.3.\n5.1 Estimation of the Dual Optimum\nWe estimate the dual optimum by the geometric properties of projection operators [recall that\n\u03b8\u2217(\u03bb) = PF (y/\u03bb)]. We \ufb01rst introduce a useful tool to characterize the projection operators.\nDe\ufb01nition 8. [2] For a closed convex set C and a point z0 \u2208 C, the normal cone to C at z0 is\n\n1}.\n1(XT y/\u03bb)(cid:107) = w0\n\u03bb \u2208 F \u21d4 \u03bb \u2265 \u03bbmax \u21d4 \u03b8\u2217(\u03bb) = y\n\u03bb \u21d4 \u03b2\u2217(\u03bb) = 0.\n\n1(\u00b7) be de\ufb01ned by Eq. (16). Then,\n\nNC(z0) = {\u03b6 : (cid:104)\u03b6, z \u2212 z0(cid:105) \u2264 0, \u2200 z \u2208 C}.\n\nTheorem 7 implies that \u03b8\u2217(\u03bb) is known with \u03bb \u2265 \u03bbmax. Thus, we can estimate \u03b8\u2217(\u03bb) in terms of a\nknown \u03b8\u2217(\u03bb0). This leads to Theorem 9 that bounds the dual optimum by a small ball.\nTheorem 9. For TGL, suppose that \u03b8\u2217(\u03bb0) is known with \u03bb0 \u2264 \u03bbmax. For \u03bb \u2208 (0, \u03bb0), we de\ufb01ne\n\nif \u03bb0 < \u03bbmax,\nif \u03bb0 = \u03bbmax,\n\n,\n\n(cid:40) y\n\n(cid:17)\n\n(cid:16)\n\n\u2212 \u03b8\u2217(\u03bb0),\n\u03bb0\nXT y\nXS0\n1\n\u03bb \u2212 \u03b8\u2217(\u03bb0),\n\n\u03bbmax\n\nn(\u03bb0) =\n\nr(\u03bb, \u03bb0) = y\nr\u22a5(\u03bb, \u03bb0) = r(\u03bb, \u03bb0) \u2212 (cid:104)r(\u03bb,\u03bb0),n(\u03bb0)(cid:105)\n\n(cid:107)n(\u03bb0)(cid:107)2\n\nn(\u03bb0).\n\nThen, the following hold:\n(i): n(\u03bb0) \u2208 NF (\u03b8\u2217(\u03bb0)).\n(ii): (cid:107)\u03b8\u2217(\u03bb) \u2212 (\u03b8\u2217(\u03bb0) + 1\nTheorem 9 indicates that \u03b8\u2217(\u03bb) lies inside the ball of radius 1\n\n2 r\u22a5(\u03bb, \u03bb0))(cid:107) \u2264 1\n\n2(cid:107)r\u22a5(\u03bb, \u03bb0)(cid:107).\n\no(\u03bb, \u03bb0) = \u03b8\u2217(\u03bb0) + 1\n\n2 r\u22a5(\u03bb, \u03bb0).\n\n2(cid:107)r\u22a5(\u03bb, \u03bb0)(cid:107) centered at\n\n5.2 Solving the Nonconvex Optimization Problems in (R1\u2217) and (R2\u2217)\nWe solve for the supreme values in (R1\u2217) and (R2\u2217). For notational convenience, let\n\n\u0398 = {\u03b8 : (cid:107)\u03b8 \u2212 o(\u03bb, \u03bb0)(cid:107) \u2264 1\nj = {\u03b6 : \u03b6 \u2208 HGi\n\u039ei\n\n2(cid:107)r\u22a5(\u03bb, \u03bb0)(cid:107)},\n\n,(cid:107)\u03b6 \u2212 [XT o(\u03bb, \u03bb0)]Gi\nTheorem 9 implies that \u03b8\u2217(\u03bb) \u2208 \u0398, and thus [XT \u0398]Gi\nMLFre by (R1\u2217) and (R2\u2217), we need to solve the following optimization problems:\nj(\u03b6)(cid:107) : \u03b6 \u2208 \u039ei\nj(\u03bb, \u03bb0) = sup\u03b6{(cid:107)Si\nj},\nj is a non-leaf node,\nsi\nj(\u03bb, \u03bb0) = sup\u03b6{(cid:107)\u03b6(cid:107) : \u03b6 \u2208 \u039ei\nj},\nj is a leaf node.\nsi\n\n(cid:107)2}.\n2(cid:107)r\u22a5(\u03bb, \u03bb0)(cid:107)(cid:107)XGi\nj for all non-leaf nodes Gi\n\n(cid:107) \u2264 1\n\u2286 \u039ei\n\nif Gi\nif Gi\n\nj\n\nj\n\nj\n\nj\n\n(17)\n(18)\nj. To develop\n\n(19)\n(20)\n\n5\n\n\fBefore we solve problems (19) and (20), we \ufb01rst introduce some notations.\nDe\ufb01nition 10. For a non-leaf node Gi\nj \\ \u222ak\u2208Ic(Gi\nGi\nj(cid:48) \u2208 {ni+1 + 1, ni+1 + 2, . . . , ni+1 + n(cid:48)\ni + 1. We set the weights wi\n\nj of an index tree T , let Ic(Gi\nj}. If\nfor\nj by Gi+1\ni+1}, where n(cid:48)\ni+1 is the number of virtual nodes of depth\nj(cid:48) = 0 for all virtual nodes Gi\nj(cid:48).\n\nj) = {k : Gi+1\nj \\ \u222ak\u2208Ic(Gi\nj(cid:48) = Gi\n\n(cid:54)= \u2205, we de\ufb01ne a virtual child node of Gi\n\nk \u2282 Gi\nj )Gi+1\n\nj )Gi+1\n\nk\n\nk\n\nAnother useful concept is the so-called unique path between the nodes in the tree.\nLemma 11. [16] For any non-root node Gi\nj to the root G0\n, where l \u2208 0 \u222a [i], r0 = 1, and ri = j. Then, the following hold:\nthe nodes on this path be Gl\nrl\nj \u2282 Gl\nGi\nj \u2229 Gl\nGi\n\n, \u2200 l \u2208 0 \u222a [i \u2212 1].\nr = \u2205, \u2200 r (cid:54)= rl, l \u2208 [i \u2212 1], r \u2208 [ni].\n\nj, we can \ufb01nd a unique path from Gi\n\nrl\n\nSolving Problem (19) We consider the following equivalent problem of (19).\n\nj(\u03bb, \u03bb0))2 = sup\u03b6{ 1\n\n2(cid:107)Si\n\nj(\u03b6)(cid:107)2 : \u03b6 \u2208 \u039ei\nj},\n\n1\n\n2 (si\n\nif Gi\n\nj is a non-leaf node.\n\nAlthough both the objective function and feasible set of problem (23) are convex, it is nonconvex as\nwe need to \ufb01nd the supreme value. We derive the closed form solutions of (19) and (23) as follows.\n(cid:107)2, and vi, i \u2208 0 \u222a [d] be the\nTheorem 12. Let c = [XT o(\u03bb, \u03bb0)]Gi\noutput of Algorithm 1 with input XT o(\u03bb, \u03bb0).\n(i): Suppose that c /\u2208 Ci\nj(\u03bb, \u03bb0) = (cid:107)vi\n(ii): Suppose that node Gi\n(iii): Suppose that node Gi\n\nj has a virtual child node. Then, for any c \u2208 Ci\nj has no virtual child node. Then, the following hold.\n\n2(cid:107)r\u22a5(\u03bb, \u03bb0)(cid:107)(cid:107)XGi\n\nj(\u03bb, \u03bb0) = \u03b3.\n\nj. Then, si\n\n(cid:107) + \u03b3.\n\n, \u03b3 = 1\n\nj, si\n\nGi\nj\n\nj\n\nj\n\n1. Let\n\n(21)\n(22)\n\n(23)\n\n(iii.a): If c \u2208 rbd Ci\n(iii.b): If c \u2208 ri Ci\n\nj, then si\n\nj(\u03bb, \u03bb0) = \u03b3.\nj, then, for any node Gt\n\nthe nodes on the path from Gt\n\nk to Gi\n\nk \u2282 Gi\nj, where t \u2208 {i + 1, . . . , d} and k \u2208 [nt + n(cid:48)\n(cid:88)t\nj be Gl\nrl\n\nt], let\n, where l = i, . . . , t, ri = j, and rt = k, and\n\n(cid:16)\n\n(cid:107)(cid:17)\n\n.\n\n\u2212 (cid:107)vl\n\nwl\nrl\n\nGl\nrl\n\nl=i+1\n\n(24)\n\n(cid:16)\n\nThen, si\n\nj(\u03bb, \u03bb0) =\n\n, Gt\n\n\u0393(Gi+1\nri+1\n\nk) =\n\u03b3 \u2212 min{(k,t):Gt\n\nk\u2282Gi\n\nj} \u0393(Gi+1\nri+1\n\n, Gt\nk)\n\n(cid:17)\n\n.\n\n+\n\nSolving Problem (20) We can solve problem (20) by the Cauchy-Schwarz inequality.\nTheorem 13. For problem (20), we have si\n\nj(\u03bb, \u03bb0) = (cid:107)[XT o(\u03bb, \u03bb0)]Gi\n\n(cid:107) + 1\n\n2(cid:107)r\u22a5(\u03bb, \u03bb0)(cid:107)(cid:107)XGi\n\nj\n\n(cid:107)2.\n\nj\n\n5.3 The Multi-Layer Screening Rule\nIn real-world applications, the optimal parameter values are usually unknown. Commonly used\napproaches to determine an appropriate parameter value, such as cross validation and stability se-\nlection, solve TGL many times along a grid of parameter values. This process can be very time\nconsuming. Motivated by this challenge, we present MLFre in the following theorem by plugging\nthe supreme values found by Theorems 12 and 13 into (R1\u2217) and (R2\u2217), respectively.\nTheorem 14. For the TGL problem, suppose that we are given a sequence of parameter values\n\u03bbmax = \u03bb0 > \u03bb1 > \u00b7\u00b7\u00b7 > \u03bbK. For each integer k = 0, . . . ,K \u2212 1, we compute \u03b8\u2217(\u03bbk) from a given\n\u03b2\u2217(\u03bbk) via Eq. (3). Then, for i = 1, . . . , d, MLFre takes the form of\n\nsi\nj(\u03bbk+1, \u03bbk) < wi\n\nj \u21d2 [\u03b2\u2217(\u03bb)]Gi\n\nj\n\n= 0, \u2200 j \u2208 [ni].\n\n(MLFre)\n\nRemark 2. We apply MLFre to identify inactive nodes hierarchically in a top-down fashion. Note\nthat, we do not need to apply MLFre to node Gi\nRemark 3. To simplify notations, we consider TGL with a single tree, in the proof. However, all\nmajor results are directly applicable to TGL with multiple trees, as they are independent from each\nother. We note that, many sparse models, such as Lasso, group Lasso, and sparse group Lasso, are\nspecial cases of TGL with multiple trees.\n\nj if one of its ancestor nodes passes the rule.\n\n6\n\n\f(a) synthetic 1, p = 20000\n\n(b) synthetic 1, p = 50000\n\n(c) synthetic 1, p = 100000\n\n(d) synthetic 2, p = 20000\n\n(e) synthetic 2, p = 50000\n\n(f) synthetic 2, p = 100000\n\nFigure 1: Rejection ratios of MLFre on two synthetic data sets with different feature dimensions.\n\n, where |Gi\n\np0\n\n(cid:80)\nk\u2208Gi |Gi\nk|\n\n6 Experiments\nWe evaluate MLFre on both synthetic and real data sets by two measurements. The \ufb01rst measure is\nthe rejection ratios of MLFre for each level of the tree. Let p0 be the number of zero coef\ufb01cients in\nthe solution vector and Gi be the index set of the inactive nodes with depth i identi\ufb01ed by MLFre.\nk| is the\nThe rejection ratio of the ith layer of MLFre is de\ufb01ned by ri =\nk. The second measure is speedup, namely, the ratio of the\nnumber of features contained in node Gi\nrunning time of the solver without screening to the running time of solver with MLFre.\nFor each data set, we run the solver combined with MLFre along a sequence of 100 parameter values\nequally spaced on the logarithmic scale of \u03bb/\u03bbmax from 1.0 to 0.05. The solver for TGL is from the\nSLEP package [15]. It also provides an ef\ufb01cient routine to compute \u03bbmax.\n6.1 Simulation Studies\nWe perform experiments on two synthetic data\nsets, named synthetic 1 and synthetic 2, which\nare commonly used in the literature [21, 31].\nThe true model is y = X\u03b2\u2217 + 0.01\u0001, \u0001 \u223c\nN (0, 1). For each of the data set, we \ufb01x N =\n250 and select p = 20000, 50000, 100000. We\ncreate a tree with height 4, i.e., d = 3. The\nsolver MLFre MLFre+solver speedup\naverage sizes of the nodes with depth 1, 2 and\n16.04\n3 are 50, 10, and 1, respectively. Thus, if\n483.96\n20000\n29.78\n50000 1175.91\np = 100000, we have roughly n1 = 2000,\n40.60\n100000 2391.43\nn2 = 10000, and n3 = 100000. For synthet-\n12.43\n470.54\nic 1, the entries of the data matrix X are i.i.d.\n20000\n25.53\n50000 1122.30\nstandard Gaussian with zero pair-wise correla-\n36.81\n100000 2244.06\ntion, i.e., corr (xi, xj) = 0 for the ith and jth\ncolumns of X with i (cid:54)= j. For synthetic 2,\n42.50\n39.29\nthe entries of X are drawn from standard Gaus-\n36.88\nsian with pair-wise correlation corr (xi, xj) =\n0.5|i\u2212j|. To construct \u03b2\u2217, we \ufb01rst randomly select 50% of the nodes with depth 1, and then ran-\ndomly select 20% of the children nodes (with depth 2) of the remaining nodes with depth 1. The\ncomponents of \u03b2\u2217 corresponding to the remaining nodes are populated from a standard Gaussian,\nand the remaining ones are set to zero.\n\nTable 1: Running time (in seconds) for solving\nTGL along a sequence of 100 tuning parame-\nter values of \u03bb equally spaced on the logarithmic\nscale of \u03bb/\u03bbmax from 1.0 to 0.05 by (a): the solver\n[15] without screening (see the third column); (b):\nthe solver with MLFre (see the \ufb01fth column).\n\n30.17\n39.49\n58.91\n37.87\n43.97\n60.96\n492.08\n556.19\n564.36\n\nsynthetic 1\n\n1.03\n2.95\n6.57\n1.19\n3.13\n6.18\nADNI+GMV 406262 20911.92 81.14\nADNI+WMV 406262 21855.03 80.83\nADNI+WBV 406262 20812.06 82.10\n\nsynthetic 2\n\nDataset\n\np\n\n7\n\n\f(a) ADNI+GMV\n\n(b) ADNI+WMV\n\n(c) ADNI+WBV\n\nrespectively. We can see that MLFre identi\ufb01es almost all inactive nodes, i.e.,(cid:80)3\n\nalmost all of the inactive nodes, i.e., (cid:80)3\nincreases, MLFre identi\ufb01es more inactive nodes, i.e.,(cid:80)3\n\nFigure 2: Rejection ratios of MLFre on ADNI data set with grey matter volume (GMV), white mater\nvolume (WMV), and whole brain volume (WBV) as response vectors, respectively.\nFig. 1 shows the rejection ratios of all three layers of MLFre. We can see that MLFre identi\ufb01es\ni=1 ri \u2265 90%, and the \ufb01rst layer contributes the most.\nMoreover, Fig. 1 also indicates that, as the feature dimension (and the number of nodes in each level)\ni=1 ri \u2248 100%. Thus, we can expect a more\nsigni\ufb01cant capability of MLFre in identifying inactive nodes on data sets with higher dimensions.\nTable 1 shows the running time of the solver with and without MLFre. We can observe signi\ufb01cant\nspeedups gained by MLFre, which are up to 40 times. Take synthetic 1 with p = 100000 for\nexample. The solver without MLFre takes about 40 minutes to solve TGL at 100 parameter values.\nCombined with MLFre, the solver only needs less than one minute for the same task. Table 1 also\nshows that the computational cost of MLFre is very low\u2014that is negligible compared to that of the\nsolver without MLFre. Moreover, as MLFre identi\ufb01es more inactive nodes with increasing feature\ndimensions, Table 1 shows that the speedup gained by MLFre becomes more signi\ufb01cant as well.\n6.2 Experiments on ADNI data set\nWe perform experiments on the Alzheimers Disease Neuroimaging Initiative (ADNI) data set\n(http://adni.loni.usc.edu/). The data set consists of 747 patients with 406262 single\nnucleotide polymorphisms (SNPs). We create the index tree such that n1 = 4567, n2 = 89332,\nand n3 = 406262. Fig. 2 presents the rejection ratios of MLFre on the ADNI data set with grey\nmatter volume (GMV), white matter volume (WMV), and whole brain volume (WBV) as response,\ni=1 ri \u2248 100%. As\na result, we observe signi\ufb01cant speedups gained by MLFre\u2014that are about 40 times\u2014from Table\n1. Speci\ufb01cally, with GMV as response, the solver without MLFre takes about six hours to solve\nTGL at 100 parameter values. However, combined with MLFre, the solver only needs about eight\nminutes for the same task. Moreover, Table 1 also indicates that the computational cost of MLFre is\nvery low\u2014that is negligible compared to that of the solver without MLFre.\n7 Conclusion\nIn this paper, we propose a novel multi-layer feature reduction (MLFre) method for TGL. Our major\ntechnical contributions lie in two folds. The \ufb01rst is the novel hierarchical projection algorithm that\nis able to exactly and ef\ufb01ciently recover the subgradients of the tree-guided regularizer with respect\nto each node from their mixture. The second is that we show a highly nontrivial nonconvex problem\nadmits a closed form solution. To the best of our knowledge, MLFre is the \ufb01rst screening method\nthat is applicable to TGL. An appealing feature of MLFre is that it is exact in the sense that the\nidenti\ufb01ed inactive nodes are guaranteed to be absent from the sparse representations. Experiments\non both synthetic and real data sets demonstrate that MLFre is very effective in identifying inactive\nnodes, leading to substantial savings in computational cost and memory usage without sacri\ufb01cing\naccuracy. Moreover, the capability of MLFre in identifying inactive nodes on higher dimensional\ndata sets is more signi\ufb01cant. We plan to generalize MLFre to more general and complicated sparse\nmodels, e.g., over-lapping group Lasso with logistic loss. In addition, we plan to apply MLFre to\nother applications, e.g., brain image analysis [10, 18] and natural language processing [27, 28].\nAcknowledgments\nThis work is supported in part by research grants from NIH (R01 LM010730, U54 EB020403) and\nNSF (IIS- 0953662, III-1539991, III-1539722).\n\n8\n\n\fReferences\n[1] F. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Optimization with sparsity-inducing penalties. Foun-\n\ndations and Trends in Machine Learning, 4(1):1\u2013106, Jan. 2012.\n\n[2] H. H. Bauschke and P. L. Combettes. Convex Analysis and Monotone Operator Theory in Hilbert Spaces.\n\nSpringer, 2011.\n\n[3] M. Bazaraa, H. Sherali, and C. Shetty. Nonlinear Programming: Theory and Algorithms. Wiley-\n\nInterscience, 2006.\n\n[4] J. Borwein and A. Lewis. Convex Analysis and Nonlinear Optimization, Second Edition. Canadian\n\nMathematical Society, 2006.\n\n[5] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.\n[6] X. Chen, Q. Lin, S. Kim, J. Carbonell, and E. Xing. Smoothing proximal gradient method for general\n\nstructured sparse regression. Annals of Applied Statistics, pages 719\u2013752, 2012.\n\n[7] W. Deng, W. Yin, and Y. Zhang. Group sparse optimization by alternating direction method. Technical\n\nreport, Rice CAAM Report TR11-06, 2011.\n\n[8] L. El Ghaoui, V. Viallon, and T. Rabbani. Safe feature elimination in sparse supervised learning. Paci\ufb01c\n\nJournal of Optimization, 8:667\u2013698, 2012.\n\n[9] J.-B. Hiriart-Urruty. From convex optimization to nonconvex optimization. necessary and suf\ufb01cient con-\n\nditions for global optimality. In Nonsmooth optimization and related topics. Springer, 1988.\n\n[10] R. Jenatton, A. Gramfort, V. Michel, G. Obozinski, E. Eger, F. Bach, and B. Thirion. Multiscale mining of\nfmri data with hierarchical structured sparsity. SIAM Journal on Imaging Science, pages 835\u2013856, 2012.\n[11] R. Jenatton, J. Mairal, G. Obozinski, and F. Bach. Proximal methods for hierarchical sparse coding.\n\nJournal of Machine Learning Research, 12:2297\u20132334, 2011.\n\n[12] K. Jia, T. Chan, and Y. Ma. Robust and practical face recognition via structured sparsity. In European\n\nConference on Computer Vision, 2012.\n\n[13] S. Kim and E. Xing. Tree-guided group lasso for multi-task regression with structured sparsity.\n\nInternational Conference on Machine Learning, 2010.\n\nIn\n\n[14] S. Kim and E. Xing. Tree-guided group lasso for multi-response regression with structured sparsity, with\n\nan application to eqtl mapping. The Annals of Applied Statistics, 2012.\n\n[15] J. Liu, S. Ji, and J. Ye. SLEP: Sparse Learning with Ef\ufb01cient Projections. Arizona State University, 2009.\n[16] J. Liu and J. Ye. Moreau-Yosida regularization for grouped tree structure learning. In Advances in neural\n\ninformation processing systems, 2010.\n\n[17] J. Liu, Z. Zhao, J. Wang, and J. Ye. Safe screening with variational inequalities and its application to\n\nlasso. In International Conference on Machine Learning, 2014.\n\n[18] M. Liu, D. Zhang, P. Yap, and D. Shen. Tree-guided sparse coding for brain disease classi\ufb01cation. In\n\nMedical Image Computing and Computer-Assisted Intervention, 2012.\n\n[19] K. Ogawa, Y. Suzuki, and I. Takeuchi. Safe screening of non-support vectors in pathwise SVM computa-\n\ntion. In ICML, 2013.\n\n[20] A. Ruszczy\u00b4nski. Nonlinear Optimization. Princeton University Press, 2006.\n[21] R. Tibshirani, J. Bien, J. Friedman, T. Hastie, N. Simon, J. Taylor, and R. Tibshirani. Strong rules for\ndiscarding predictors in lasso-type problems. Journal of the Royal Statistical Society Series B, 74:245\u2013\n266, 2012.\n\n[22] J. Wang, W. Fan, and J. Ye. Fused lasso screening rules via the monotonicity of subdifferentials. IEEE\n\nTransactions on Pattern Analysis and Machine Intelligence, PP(99):1\u20131, 2015.\n\n[23] J. Wang, P. Wonka, and J. Ye. Scaling svm and least absolute deviations via exact data reduction. In\n\nInternational Conference on Machine Learning, 2014.\n\n[24] J. Wang, P. Wonka, and J. Ye. Lasso screening rules via dual polytope projection. Journal of Machine\n\nLearning Research, 16:1063\u20131101, 2015.\n\n[25] J. Wang and J. Ye. Two-Layer feature reduction for sparse-group lasso via decomposition of convex sets.\n\nAdvances in neural information processing systems, 2014.\n\n[26] Z. J. Xiang, H. Xu, and P. J. Ramadge. Learning sparse representation of high dimensional data on large\n\nscale dictionaries. In NIPS, 2011.\n\n[27] D. Yogatama, M. Faruqui, C. Dyer, and N. Smith. Learning word representations with hierarchical sparse\n\ncoding. In International Conference on Machine Learning, 2015.\n\n[28] D. Yogatama and N. Smith. Linguistic structured sparsity in text categorization. In Proceedings of the\n\nAnnual Meeting of the Association for Computational Linguistics, 2014.\n\n[29] M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables. Journal of the\n\nRoyal Statistical Society Series B, 68:49\u201367, 2006.\n\n[30] P. Zhao, G. Rocha, and B. Yu. The composite absolute penalties family for grouped and hierarchical\n\nvariable selection. Annals of Statistics, 2009.\n\n[31] H. Zou and T. Hastie. Regularization and variable selection via the elastic net. Journal of the Royal\n\nStatistical Society Series B, 67:301\u2013320, 2005.\n\n9\n\n\f", "award": [], "sourceid": 789, "authors": [{"given_name": "Jie", "family_name": "Wang", "institution": "University of Michigan-Ann Arbor"}, {"given_name": "Jieping", "family_name": "Ye", "institution": "University of Michigan"}]}