{"title": "k-Support and Ordered Weighted Sparsity for Overlapping Groups: Hardness and Algorithms", "book": "Advances in Neural Information Processing Systems", "page_first": 284, "page_last": 292, "abstract": "The k-support and OWL norms generalize the l1 norm, providing better prediction accuracy and better handling of correlated variables. We study the norms obtained from extending the k-support norm and OWL norms to the setting in which there are overlapping groups. The resulting norms are in general NP-hard to compute, but they are tractable for certain collections of groups. To demonstrate this fact, we develop a dynamic program for the problem of projecting onto the set of vectors supported by a fixed number of groups. Our dynamic program utilizes tree decompositions and its complexity scales with the treewidth. This program can be converted to an extended formulation which, for the associated group structure, models the k-group support norms and an overlapping group variant of the ordered weighted l1 norm. Numerical results demonstrate the efficacy of the new penalties.", "full_text": "k-Support and Ordered Weighted Sparsity for\nOverlapping Groups: Hardness and Algorithms\n\nCong Han Lim\n\nUniversity of Wisconsin-Madison\n\nclim9@wisc.edu\n\nStephen J. Wright\n\nUniversity of Wisconsin-Madison\n\nswright@cs.wisc.edu\n\nAbstract\n\nThe k-support and OWL norms generalize the (cid:96)1 norm, providing better prediction\naccuracy and better handling of correlated variables. We study the norms obtained\nfrom extending the k-support norm and OWL norms to the setting in which there\nare overlapping groups. The resulting norms are in general NP-hard to compute,\nbut they are tractable for certain collections of groups. To demonstrate this fact,\nwe develop a dynamic program for the problem of projecting onto the set of\nvectors supported by a \ufb01xed number of groups. Our dynamic program utilizes tree\ndecompositions and its complexity scales with the treewidth. This program can\nbe converted to an extended formulation which, for the associated group structure,\nmodels the k-group support norms and an overlapping group variant of the ordered\nweighted (cid:96)1 norm. Numerical results demonstrate the ef\ufb01cacy of the new penalties.\n\n1\n\nIntroduction\n\nThe use of the (cid:96)1-norm to induce sparse solutions is ubiquitous in machine learning, statistics, and\nsignal processing. When the variables can be grouped into sets corresponding to different explanatory\nfactors, group variants of the (cid:96)1 penalty can be used to recover solutions supported on a small number\nof groups. When the collection of groups G forms a partition of the variables (that is, the groups do\nnot overlap), the group lasso penalty [19]\n\n\u2126GL(x) :=\n\nG\u2208G(cid:107)xG(cid:107)p\n\n(1)\n\n(cid:88)\n\nis often used. In many cases, however, some variables may contribute to more than one explanatory\nfactor, which leads naturally to overlapping-group formulations. Such is the case in applications such\nas \ufb01nding relevant sets of genes in a biological process [10] or recovering coef\ufb01cients in wavelet\ntrees [17]. In such contexts, the standard group lasso may introduce artifacts, since variables that are\ncontained in different numbers of groups are penalized differently. Another approach is to employ\nthe latent group lasso [10]:\n\n\u2126LGL(x) := min\nx,v\n\nG\u2208G(cid:107)vG(cid:107)p\n\nG\u2208G vG = x,\n\n(2)\n\n(cid:88)\n\nsuch that (cid:88)\n\nwhere each vG is a separate vector of latent variables supported only on the group G. The latent\ngroup lasso (2) can be written in terms of atomic norms, where the atomic set is\n\n{x : (cid:107)x(cid:107)p\u2264 1, supp(x) \u2286 G for some G \u2208 G} .\n\nThis set allows vectors supported on any one group. The unit ball is the convex hull of this atomic set.\nA different way of extending the (cid:96)1-norm involves explicit use of a sparsity parameter k. Argyriou\net al. [1] introduce the k-support norm \u2126k from the atomic norm perspective. The atoms are the set\nof k-sparse vectors with unit norm, and the unit ball of the norm is thus\n\nconv ({x : (cid:107)x(cid:107)p\u2264 1,|supp(x)|\u2264 k}) .\n\n(3)\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fThe k-support norm with p = 2 offers a tighter alternative to the elastic net, and like the elastic net, it\nhas better estimation performance than the (cid:96)1 norm especially in the presence of correlated variables.\nAnother extension of the (cid:96)1 norm is to the OSCAR/OWL/SLOPE norms [5, 20, 4], which order the\nelements of x according to magnitude before weighing them:\nwi|x\n\n(4)\nwhere the weights wi, i = 1, 2, . . . , n are nonnegative and decreasing and x\u2193 denotes the vector x\nsorted by decreasing absolute value. This family of norms controls the false discovery rate and clusters\ncorrelated variables. These norms correspond to applying the (cid:96)\u221e norm to a combinatorial penalty\nfunction in the framework of Obozinski and Bach [11, 12], and can be generalized by considering\ndifferent (cid:96)p-norms. For p = 2, we have the SOWL norm [18], whose variational form is\n\n(cid:88)\n\n\u2126OWL(x) :=\n\ni\u2208[n]\n\n\u2193\ni |.\n\n\u2126SOWL(x) := 1\n\n2 min\u03b7\u2208Rn\n\n+\n\ni /\u03b7i + wi|\u03b7\nx2\n\ni\u2208[n]\n\n(cid:88)\n\n(cid:16)\n\ni |(cid:17)\n\n\u2193\n\n.\n\nWe will refer to the generalized version of these norms as pOWL norms. The pOWL norms can be\nviewed as extensions of the k-support norms from the atomic norm angle, which we will detail later.\n\nFigure 1: Some sparsity-inducing norms. Each arrow represents an extension of a previous norm. We study\nthe two shaded norms on the right.\n\nIn this paper, we study the norms obtained by combining the overlapping group formulations with\nthe k-sparse/OWL formulations, with the aim of obtaining the bene\ufb01ts of both worlds. When the\ngroups do not overlap, the combination is fairly straightforward; see the GrOWL norm introduced\nby Oswal et al. [13]. We consider two classes of norms for overlapping groups. The latent k-group\nsupport (LG(k)) norm, very recently introduced by Rao et al. [15], is de\ufb01ned by the unit ball\n\n(cid:16)(cid:110)\nx : (cid:107)x(cid:107)p\u2264 1, supp(x) \u2286(cid:91)\n\nconv\n\nG for some subset Gk \u2286 G with k groups\n\nG\u2208Gk\n\n,\n\n(5)\n\n(cid:111)(cid:17)\n\ndirectly extending the k-support norm de\ufb01nition to unions of groups. We introduce the latent group\nsmooth OWL (LGS-OWL) norm, which similarly extends OWL/SOWL/GrOWL. These norms can\nbe applied in the same settings where the latent group lasso has proven to be useful, while adapting\nbetter to correlations. We explain how the norms are derived from a combinatorial penalty perspective\nusing the work of Obozinski and Bach [11, 12], and also provide explicit atomic-norm formulations.\nThe LGS-OWL norm can be seen as a combination of k-support norms across different k.\nThe rest of this focuses on computational aspects of these norms. Both the LG(k) norm and\nthe LGS-OWL norm are in general NP-hard to compute. Despite this hardness result, we devise\na computational approach that utilizes tree decompositions of the underlying group intersection\ngraph. The key parameter affecting the ef\ufb01ciency of our algorithms is the treewidth tw of the group\nintersection graph, which is small for certain graph structures such as chains, trees, and cycles.\nCertain problems with hierarchical groups like image recovery can have a tree structure [17, 3].\nOur \ufb01rst main technical contribution is a dynamic program for the best k-group sparse approximation\nproblem, which has time complexity O(2O(tw) \u00b7 mk + n), where m is the total number of groups.\nFor group intersection graphs with a tree structure (tw = 2), this leads to a O(mk + n) algorithm,\nsigni\ufb01cantly improving on the O(m2k + n) algorithm presented in [3]. Next, we build on the\nprinciples behind the dynamic program to construct extended formulations of O(2O(tw) \u00b7 mk2 + n)\n\n2\n\n\u20181k-supportOSCAROWL/SOWLSLOPE(Non-overlapping)GroupLasso(Non-overlapping)k-GroupSupport(Non-overlapping)GrOWLLatentGroupLassoLatentk-GroupSupport(LG(k))LatentGroupSmoothOWL(LGS-OWL)\fsize for LG(k) and O(2O(tw) \u00b7 m3 + n) for LGS-OWL, improving by a factor of k or m respectively\nin the special case in which the tree decomposition is a chain. This approach also yields extended\nformulations of size O(nk) and O(n2) for the k-support and pOWL norms, respectively. (Previously,\nonly a O(n2) linear program was known for OWL [5].) We thus facilitate incorporation of these\nnorms into standard convex programming solvers.\n\nRelated Work. Obozinski and Bach [11, 12] develop a framework for penalties derived by convex-\nifying the sum of a combinatorial function F and an (cid:96)p term. They describe algorithms for computing\nthe proximal operators and norms for the case of submodular F . We use their framework, but note\nthat the algorithms they provide cannot be applied since our functions are not submodular.\nTwo other works focus directly on sparsity of unions of overlapping groups. Rao et al. [15] introduce\nthe LG(k) norm and approximates it via variable splitting. Baldassarre et al. [3] study the best\nk-group sparse approximation problem, which they prove is NP-hard. For tree-structured intersection\ngraphs, they derive the aforementioned dynamic program with complexity O(m2k + n).\nFor the case of p = \u221e, a linear programming relaxation for the unit ball of the latent k-group\nsupport norm is provided by Halabi and Cevher [9, Section 5.4]. This linear program is tight if the\ngroup-element incidence matrix augmented with an all-ones row is totally unimodular. This condition\ncan be violated by simple tree-structured intersection graphs with just four groups.\nNotation and Preliminaries. Given A \u2286 [n], the vector xA is the subvector of x \u2208 Rn correspond-\ning to the index set A. For collections of groups G, we use m to denote the number of groups in G,\nG\u2208G G = [n], so that every index i \u2208 [n] appears in at least one\ngroup G \u2208 G. The discrete function CG(A) denotes the minimum number of groups from G needed\nto cover A (the smallest set cover).\n\nthat is, m = |G|. We assume that(cid:83)\n\n2 Overlapping Group Norms with Group Sparsity-Related Parameters\n\nWe now describe the LG(k) and LGS-OWL norms from the combinatorial penalty perspective by\nObozinski and Bach [11, 12], providing an alternative theoretical motivation for the LG(k) norm\nand formally motivating and de\ufb01ning LGS-OWL. Given a combinatorial function F : {A \u2286 [n]} \u2192\nR \u222a {+\u221e} and an (cid:96)p norm, a norm can be derived by taking the tightest positively homogeneous\nconvex lower bound of the combined penalty function F (supp(x)) + \u03bd(cid:107)x(cid:107)p\np. De\ufb01ning q to satisfy\n1/p + 1/q = 1 (so that (cid:96)p and (cid:96)q are dual), this procedure results in the norm \u2126F\np , which is given by\np (x) := q1/q(p\u03bd)1/pF (supp(x))1/q(cid:107)x(cid:107)p, whose unit ball is\nthe convex envelope of the function \u0398F\n(6)\n\nx \u2208 Rn : (cid:107)x(cid:107)p\u2264 F (supp(x))\n\n\u22121/q(cid:111)(cid:17)\n\n(cid:16)(cid:110)\n\nconv\n\n.\n\nThe norms discussed in this paper can be cast in this framework. Recall that the de\ufb01nition of OWL (4)\nincludes nonnegative weights w1 \u2265 w2 \u2265 . . . wn \u2265 0. De\ufb01ning h : [n] \u2192 R to be the monotonically\n\nincreasing concave function h(k) =(cid:80)k\n\ni=1 wi, we obtain\n\nk-support : F (A) =\n\nLG(k)\n\n: F (A) =\n\npOWL\n\n: F (A) = h(|A|),\n\nLGS-OWL : F (A) = h(CG(A)).\n\nThe de\ufb01nitions of the k-support and LG(k) balls from (3) and (5), respectively, match (6). As for the\nOWL norms, we can express their unit ball by\n\nconv\n\nx \u2208 Rn : (cid:107)x(cid:107)p\u2264 h(i)\n\n\u22121/q, CG(supp(x)) = i\n\n.\n\n(7)\n\ni=1\n\nThis can be seen as taking all of the k-support or LG(k) atoms for each value of k, scaling them\naccording to the value of k, then taking the convex hull of the resulting set. Hence, the OWL norms\ncan be viewed as a way of interpolating the k-support norms across all values of k. We take advantage\nof this interpretation in constructing extended formulations.\n\n3\n\nA = \u2205,\n|A|\u2264 k,\n\n1,\n\u221e, otherwise,\n\n\uf8f1\uf8f2\uf8f30,\n(cid:32) m(cid:91)\n\n(cid:110)\n\nA = \u2205,\nCG(A) \u2264 k,\n1,\n\u221e, otherwise,\n\n\uf8f1\uf8f2\uf8f30,\n(cid:111)(cid:33)\n\n\fHardness Results. Optimizing with the cardinality or non-overlapping group based penalties is\nstraightforward, since the well-known PAV algorithm [2] allows us to exactly compute the proximal\noperator in O(n log n) time [12]. However, the picture is different when we allow overlapping groups.\nThere are no fast exact algorithms for overlapping group lasso, and iterative algorithms are typically\nused. Introducing the group sparsity parameters makes the problem even harder.\nTheorem 2.1. The following problems are NP-hard for both \u2126LG(k) and \u2126LGS-OWL when p > 1:\n\nCompute \u2126(y),\n\narg minx\u2208Rn\narg minx\u2208Rn\n\nsuch that \u2126(x) \u2264 \u00b5,\n\n1\n\n2(cid:107)x \u2212 y(cid:107)2\n2(cid:107)x \u2212 y(cid:107)2\n\n1\n\n2\n2+\u03bb\u2126(x).\n\n(norm computation)\n(projection operator)\n(proximal operator)\n\nTherefore, other problems that incorporate these norm are also hard. Note that even if we only allow\neach element to be in at most two groups, the problem is already hard. We will show in the next two\nsections that these problems are tractable if the treewidth of the group intersection graph is small.\n\n3 A Dynamic Program for Best k-Group Approximation\n\nThe best k-group approximation problem is the discrete optimization problem\n\n(cid:107)y \u2212 x(cid:107)2\n\n2 such that CG(supp(x)) \u2264 k,\n\narg min\n\nx\n\n(8)\n\nwhere the goal is to compute the projection of a vector y onto a union of subspaces each de\ufb01ned by a\nsubcollection of k groups. The solution to (8) has the form\n\n(cid:26)yi\n\n0\n\n(cid:48)\ni =\n\nx\n\ni in chosen support,\notherwise.\n\nAs mentioned above, Baldassarre et al. [3] show that this problem is NP-hard. They provide a\ndynamic program that acts on the group intersection graph and focus speci\ufb01cally on the case where\nthis graph is a tree, obtaining a O(m2k + n) dynamic programming algorithm. In this section, we\nalso start by using group intersection graphs, but instead focus on the tree decomposition of this\ngraph, which yields a more general approach.\n\n3.1 Group Intersection Graphs and Tree Decompositions\n\nWe can represent the interactions between the different groups using an intersection graph,\nwhich is an undirected graph IG = (G, EG) in which each vertex denotes a group and two\ngroups are connected if and only if they overlap. For example, if the collection of groups is\n{{1, 2, 3},{3, 4, 5},{5, 6, 7}, . . .}, then the intersection graph is simply a chain. If each group\ncorresponds to a parent and all its children in a rooted tree, the intersection graph is also a tree.\nThe group intersection graph highlights the dependencies between different groups. Algorithms for\nthis problem need to be aware of how picking one group may affect the choice of another, connnected\ngroup. (If the groups do not overlap, then no groups are connected and a simple greedy approach\nsuf\ufb01ces.) A tree decomposition of IG is a more precise way of representing these dependencies.\nWe provide the de\ufb01nition of tree decompositions and treewidth below and illustrate the core ideas in\nFigure 2. Tree decompositions are a fundamental tool in parametrized complexity, leading to ef\ufb01cient\nalgorithms if the parameter in question is small. See [7, 8] for a a more comprehensive overview\n\nFigure 2: From groups to a group intersection graph to a tree decomposition of width 2.\n\n4\n\n\fA tree decomposition of (V, E) is a tree T with vertices X = {X1, X2, . . . , XN}, satisfying the\nfollowing conditions: (1) Each Xi is a subset of V , and the union of all sets Xi gives V . (2) For every\nedge (v, w) in E, there is a vertex Xi that contains both v and w. (3) For each v \u2208 V , the vertices\nthat contain v form a connected subtree of T . The width of a tree decomposition is maxi|Xi|\u22121, and\nthe treewidth of a graph (denoted by tw) is the smallest width among all tree decompositions. The\ntree decomposition is not unique, and there is always a tree width with number of nodes |X|\u2264 |V |\n(see for example Lemma 10.5.2 in [7]). Henceforth, we will assume that |X|\u2264 m.\nThe treewidth tw is modest for many types of graphs. For example, the treewidth is bounded\nfor the tree (tw = 1), the cycle (tw = 2), and series-parallel graphs (tw = 2). Computing tree\ndecompositions with optimal width for these graphs can be done in linear time. On the other hand,\nn) and checking if a graph has tw \u2264 k is NP-complete1.\n\nthe grid graph has large treewidth (tw \u2248 \u221a\n3.2 A Dynamic Program for Tree Decompositions\nGiven a collection of groups G, a corresponding tree decomposition T (G) of the group intersection\ngraph, and a vector y \u2208 Rn, we provide a dynamic program for problem (8), the best k-group\napproximation of y.\nThe tree decomposition has several features that we can exploit. The tree structure provides a natural\norder for processing the vertices, which are subcollections of groups. Properties (1) and (2) yield a\nnatural way to map elements i \u2208 [n] onto vertices in the tree, indicating when to include yi in the\nprocess. Finally, the connected subtree corresponding to each group G as a result of property (3)\nmeans that we only need to keep explicit information about G for that part of the computation.\nThe high-level view of our approach is described below. Details appear in the supplementary material.\nPreprocessing: For each i \u2208 [n], let G(i) denote the set of all groups that contain i. We have three\ndata structures: A and V, which are both indexed by (X, Y ), with X \u2208 X and Y \u2286 X; and T, which\nis indexed by (X, Y, s), with s \u2208 {0, 1 . . . , k}.\n\n1. Root the tree decomposition and process the nodes from root to leaves: At each node X,\n\nadd an index i \u2208 [n] to A(X,G(i)) if i is unassigned and G(i) \u2286 X.\n\n2. Set V(X, Y ) \u2190(cid:80)(cid:8)y2\n\ni : i \u2208 A(X,G(i)), Y \u2229 G(i) (cid:54)= 0(cid:9).\n\nMain Process: At each vertex X in the tree decomposition, we are allowed to pick groups Y to\ninclude in the support of the solution. The s term in T(X, Y, s) indicates the current group sparsity\n\u201cbudget\u201d that has been used. Proposition 3.2 below gives the semantic meaning behind each entry in\nT . We process the nodes from the leaves to the root to \ufb01ll T . At each step, the entries for node Xp\nwill be updated with information from its children.\nThe update for a leaf Xp is simply T(Xp, Yp, s) \u2190 V(Xp, Yp) if |Yp|= s. If |Yp|(cid:54)= s, we mark\nT(Xp, Yp, s) as invalid. For non-leaf Xp, we need to ensure that the groups chosen by the parent and\nthe child are compatible. We ensure this property via constraints of the form Yc \u2229 Xp = Yp \u2229 Xc.\nFor a single child Xc we have\n\nand \ufb01nally for Xp multiple children Xc(1), . . . , Xc(d) of Xp, we set T(Xp, Yp, s) as\n\nT(Xp, Yp, s) \u2190\n\nmax\n\nYi:Yi\u2229Xp=Yp\u2229Xc(i)\n\nfor each i\n\n(cid:9) + V(Xp, Yp),\n(cid:8)T(Xc, Y, s \u2212 s0) : |Yp \u2229 Xc|= s0\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = s0\n\uf8fc\uf8fd\uf8fe + V(Xp, Yp).\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)Yp \u2229 (cid:91)\n\nXc(i)\n\ni\u2208[d]\n\nT(Xc(i), Yi, si) :\n\nmax\n\nY :Y \u2229Xp=Yp\u2229Xc\n\n\uf8f1\uf8f2\uf8f3 (cid:88)\n\n(cid:80)d\ni=1 si=s\u2212s0\n\n(9)\n\n(10)\n\nAfter making each update, we keep track of which Yi was used for each of the children for\nT (Xp, Yp, s). This allows us to backtrack to recover the solution after T has been \ufb01lled.\nThe next lemma and proposition prove the correctness of this dynamic program. The lemma follows\nfrom the fact that every clique in a graph is contained in some node in any tree decomposition, while\nthe proposition from induction from the leaf nodes.\n\n1Nonetheless, there is signi\ufb01cant research on developing exact and heuristic tree decomposition algorithms.\n\nThere are regular competitions for better implementations [6, pacechallenge.wordpress.com].\n\n5\n\n\fLemma 3.1. Every index in [n] is assigned in the \ufb01rst preprocessing step.\nProposition 3.2. For a node X, let yX be the y vector restricted to just the indices i assigned to\nnodes below and including X. Each entry T(X, Y, s) is the squared (cid:96)2-norm of the best projection of\nyX, subject to the fact that besides the groups in Y , at most s \u2212 |Y | are allowed to be used.\nWe now prove the time complexity of this algorithm. Proposition 3.4 describes the time complexity of\nthe update when there are many children. It uses the following simple lemma about max-convolutions.\nComputing the other updates is straightforward.\nLemma 3.3. The max-convolution f between two concave functions g1, g2 : {0, 1, . . . , k} \u2192 R,\nde\ufb01ned by f (i) := maxj {g1(j) + g2(i \u2212 j)}, can be computed in O(k) time.\nProposition 3.4. The update (10) for a \ufb01xed Xp, Yp across all values s \u2208 {0, 1, . . . , k} can be\nimplemented in O(2O(tw) \u00b7 dk) time.\nCombining timing and correctness results gives us the desired algorithmic result. This approach\nsigni\ufb01cantly improves on the results of Baldassarre et al. [3]. Their approach is speci\ufb01c to groups\nwhose intersection graph is a tree and uses O(m2k + n) time.\nTheorem 3.5. Given G and a corresponding tree decomposition TG with treewidth tw, projection\nonto the corresponding k-group model can be done in O(2O(tw) \u00b7 (mk + n)) time. When the group\nintersection graph is a tree, the projection takes O(mk + n) time.\n\n4 Extended Formulations from Tree Decompositions\n\nHere we model explicitly the unit ball of LG(k) (5) and LGS-OWL (7). The principles behind this\nformulation are very similar to the dynamic program in the previous section.\nWe \ufb01rst consider the latent k-group support norm, whose atoms are\n\n(cid:110)\nx : (cid:107)x(cid:107)p\u2264 1, supp(x) \u2286(cid:91)\n\nG for some subset Gk \u2286 G with k groups\n\n.\n\nG\u2208Gk\n\n(cid:111)\n\nThe following process describes a way of selecting an atom; our extended formulation encodes this\nprocess mathematically. We introduce variables b, which represent the (cid:96)p budget at a given node,\nchoice of groups, and group sparsity budget. We start at the root Xr, with (cid:96)p budget of \u00b5 and group\nsparsity budget of k:\n\nb(Xr,Y,k\u2212|Y |) \u2264 \u00b5.\n\n(11)\n\n(cid:88)\n\nWe then start moving towards the leaves, as follows.\n\n1. Suppose we have picked some the groups at a node. Assign some of the (cid:96)p budget to the xi\n\nterms, where the index i is compatible with the node and the choice of groups.\n\n2. Move on to the child and pick the groups we want to use, considering only groups that\nare compatible with the parent. Debit the group budget accordingly. If there are multiple\nchildren, spread the (cid:96)p and group budgets among them before picking the groups.\n\nThe \ufb01rst step is represented by the following relations. Intermediate variables z and a are required to\nensure that we spread the (cid:96)p budget correctly among the valid xi.\n\nb(X,Y,s) \u2265 (cid:107)(z(X,Y,s), u(X,Y,s))(cid:107)p,\nz(X,Y,s) \u2265 (cid:107){a(X,(Y,Y (cid:48)\n\na(X,Y (cid:48),s) \u2264(cid:88)\n(cid:107)xA(X,Y )(cid:107)2 \u2264(cid:88)\n\nY \u2286X\n\n),s) : Y\na(X,(Y,Y (cid:48)\n\n(cid:48) \u2229 Y (cid:54)= \u2205}(cid:107)p,\n),s),\n\n(cid:88)k\n\na(X,Y (cid:48),s).\n\nA(X,Y )=A(X,Y (cid:48))\n\ns=0\n\nThe second step is represented by the following inequality in the case of a single child.\nb(Xc,(Yp,Y ),s\u2212s0) : Y \u2229 Xp = Yp \u2229 Xc,|Yp \u2229 Xc|= s0\n\nu(Xp,Yp,s) \u2265(cid:88)(cid:110)\n\n(12)\n(13)\n(14)\n\n(15)\n\n.\n\n(16)\n\n(cid:111)\n\n6\n\n\fb(Xc,Y,s) \u2264(cid:88)\n\nYp\n\nWhen there are multiple children, we need to introduce more intermediate variables to spread the\ngroup budget correctly. The technique here is similar to the one used in the proof of Proposition 3.4;\nwe defer details to the supplementary material. In both cases, we need to collect the budgets that have\nbeen sent from each Yp:\n\nb(Xc,(Yp,Y ),s).\n\n(17)\nThose b variables unreachable by the budget transfer process are set to 0. Our main theorem about the\ncorrectness of the construction in this section follows from the fact that when \u00b5 = 1, every extreme\npoint with nonzero x in our extended formulation is an atom of the corresponding LG(k).\nTheorem 4.1. We can model the set \u2126LG(k)(x) \u2264 \u00b5 using O(2O(tw) \u00b7 (mk2 + n)) variables and\ninequalities in general. When the tree decomposition is a chain, O(2O(tw) \u00b7 (mk + n)) suf\ufb01ces.\nFor the unit ball of \u2126LGS-OWL, we can exploit the fact that the atoms of \u2126LGS-OWL are obtained from\n\u2126LG(k) across different k at different scales. Instead of using the inequality (11) at the node, we have\n\n(cid:88)\n\nh(k)1/qb(Xr,Y,k\u2212|Y |) \u2264 \u00b5,\n\nY \u2286Xr\n\nwhich leads to a program of size O(2O(tw) \u00b7 (m2 + n)) for chains and O(2O(tw) \u00b7 (m3 + n)) for trees.\n\n5 Empirical Observations and Results\n\nThe extended formulations above can be implemented in modeling software such as CVX. This\nmay incur a large processing overhead, and it is often faster to implement these directly in a convex\noptimization solver such as Gurobi or MOSEK. Use of the (cid:96)\u221e-norm leads to a linear program which\ncan be signi\ufb01cantly faster than the second-order conic program that results from the (cid:96)2-norm.\n2(cid:107)y \u2212\nWe evaluated the performance of LG(k) and LGS-OWL on linear regression problems minx\n1\nAx(cid:107)2+\u03bb\u2126(x). In the scenarios considered, we use the latent group lasso as a baseline. We test both\nthe (cid:96)2 and (cid:96)\u221e variants of the various norms. Following [13] (which descrbes GrOWL), we consider\ntwo different types of weights for LGS-OWL. The linear variant sets wi = 1 \u2212 (i \u2212 1)/n for i \u2208 [n],\nwhereas in the spike version, we set w1 = 1 and wi = 0.25 for i = 2, 3, . . . , n. The regularization\nterm \u03bb was chosen by grid search over {10\u22122, 10\u22121.95, . . . , 104} for each experiment.\nThe metrics we use are support recovery and estimation quality. For the support recovery experiments,\nwe count the number of times the correct support was identi\ufb01ed. We also compute the root mean\nsquare (RMSE) of (cid:107)x \u2212 x\u2217(cid:107)2 (estimation error).2\nWe had also tested the standard lasso, elastic net, and k-support and OWL norms, but these norms\nperformed poorly. In our experiments they were not able to recover the exact correct support in\nany run. The estimation performance for the k-support norms and elastic net were worse than the\ncorresponding latent group lasso, and likewise for OWL vs. LGS-OWL.\n\nExperiments. We used 20 groups of variables where each successive group overlaps by two ele-\nments with the next [10, 14]. The groups are given by {1, . . . , 10},{9, . . . , 18}, . . . ,{153, . . . , 162}.\nFor the \ufb01rst set of experiments, the support of the true input x\u2217 are a cluster of \ufb01ve groups in the\nmiddle of x, with xi = 1 on the support. For the second set of experiments, the original x is supported\nby the two disjoint clusters of \ufb01ve overlapping groups each, with xi = 2 on one cluster and xi = 3\non the other.\nEach entry of the A matrix is chosen initially to be i.i.d. N (0, 1). We then introduce correlations\nbetween groups in the same cluster in A. Within each cluster of groups, we replicate the same set of\ncolumns for each group in the non-overlapping portions of the group (that is, every pair of groups in\na cluster shares at least 6 columns, and adjacent groups share 8 columns). We then introduce noise by\nadding i.i.d. elements from N (0, 0.05) so that the replications are not exact. Finally, we generate y\nby adding i.i.d. noise from N (0, 0.3) to each component of Ax\u2217.\nWe present support recovery results in Figure 3 for the (cid:96)2 variants of the norms which perform better\nthan the(cid:96)\u221e versions, though the relative results between the different norms hold. In the appendix we\nprovide the graphs for support recovery and estimation quality as well as other observations.\n\n2It is standard in the literature to compute the RMSE of the prediction or estimation quality. RMSE metrics\n\nare not ideal in practice since we should \u201cdebias\u201d x to offset shrinkage due to the regularization term.\n\n7\n\n\fFigure 3: Support recovery performance as number of measurements (height of A) increases. The vertical\naxis indicates the number of trials (out of 100) for which the correct support was identi\ufb01ed. The two left graphs\ncorrespond to the \ufb01rst con\ufb01guration of group supports (\ufb01ve groups), while the others to the second con\ufb01guration\n(ten groups). Each line represents a different method. In the \ufb01rst and third graphs, we plot LG(k) for different\nvalues of k, increasing from 1 to the \u201cground truth\u201d value. Note that k = 1 is exactly the latent group lasso. In\nthe second and fourth graphs, we plot LGS-OWL for the different choices of weights wi discussed in the text.\n\nOur methods can signi\ufb01cantly outperform latent group lasso in both support recovery and estimation\nquality. We provide a summary below and more details are provided in the supplementary.\nWe \ufb01rst focus on support recovery. There is a signi\ufb01cant jump in performance when k is the size\nof the true support. Note that exceeding the ground-truth value makes recovery of the true support\nimpossible in the presence of noise. For smaller values of k, the results range from slight improvement\n(especially when k = 4 or k = 8 in the \ufb01rst and second experiments respectively) to mixed results (for\nlarge number of rows in A and small k). The LGS-OWL norms can provide performance almost as\ngood as the best settings of k for LG(k), and can be used when the number of groups is unknown. We\nexpect to see better performance for well-tuned OWL weights. We see similar results for estimation\nperformance. Smaller values of k provide little to no advantage, while larger values of k and the\nLGS-OWL norms can offer signi\ufb01cant improvement.\n\n6 Discussion and Extensions\n\nWe introduce a variant of the OWL norm for overlapping groups and provide the \ufb01rst tractable\napproaches for this and the latent k-group support norm (via extended formulations) under a bounded\ntreewidth condition. The projection algorithm for the best k-group sparse approximation problem\ngeneralizes and improves on the algorithm by Baldassarre et al. [3]. Numerical results demonstrate\nthat the norms can provide signi\ufb01cant improvement in support recovery and estimation.\nA family of graphs with many applications and large treewidth is the set of grid graphs. Groups over\ncollections of adjacent pixels/voxels lead naturally to such group intersection graphs, and it remains\nan open question whether polynomial time algorithms exist for this set of graphs. Another venue for\nresearch is to derive and evaluate ef\ufb01cient approximations to these new norms.\nIt is tempting to apply recovery results on the latent group lasso here, since LG(k) can be cast as a\nlatent group lasso instance with groups {G(cid:48) : G(cid:48) is a union of up to k groups of G}. The consistency\nresults of [10] only applies under the strict condition that the target vector is supported exactly by a\nunique set of k groups. The Gaussian width results of [16] do not give meaningful bounds even when\nthe groups are disjoint and k = 2. Developing theoretical guarantees on the performance of these\nmethods requires a much better understanding of the geometry of unions of overlapping groups.\nWe can easily extend the dynamic program to handle the case in which we want both k-group sparsity,\nand overall sparsity of s. For tree-structured group intersection graphs, our dynamic program has\ntime complexity O(mks + n log s) instead of the \u02dcO(m2ks2 + mn) by [3]. This yields a variant of\nthe above norms that again has a similar extended formulations. These variants could be employed as\nan alternative to the sparse overlapping set LASSO by Rao et al. [14]. We leave this to future work.\n\nAcknowledgements This work was supported by NSF award CMMI-1634597, ONR Award\nN00014-13-1-0129, and AFOSR Award FA9550-13-1-0138.\n\n8\n\n\fReferences\n[1] Argyriou, A., Foygel, R., and Srebro, N. (2012). Sparse prediction with the k-support norm. In Advances in\n\nNeural Information Processing Systems, pages 1457\u20131465.\n\n[2] Ayer, M., Brunk, H. D., Ewing, G. M., Reid, W. T., and Silverman, E. (1955). An empirical distribution\nfunction for sampling with incomplete information. The Annals of Mathematical Statistics, 26(4):641\u2013647.\n\n[3] Baldassarre, L., Bhan, N., Cevher, V., Kyrillidis, A., and Satpathi, S. (2016). Group-Sparse Model Selection:\n\nHardness and Relaxations. IEEE Transactions on Information Theory, 62(11):6508\u20136534.\n\n[4] Bogdan, M., van den Berg, E., Sabatti, C., Su, W., and Cand\u00e8s, E. J. (2015). Slope\u2014adaptive variable\n\nselection via convex optimization. Ann. Appl. Stat., 9(3):1103\u20131140.\n\n[5] Bondell, H. D. and Reich, B. J. (2008). Simultaneous Regression Shrinkage, Variable Selection, and\n\nSupervised Clustering of Predictors with OSCAR. Biometrics, 64(1):115\u2013123.\n\n[6] Dell, H., Husfeldt, T., Jansen, B. M. P., Kaski, P., Komusiewicz, C., and Rosamond, F. A. (2016). The \ufb01rst\nparameterized algorithms and computational experiments challenge. In 11th International Symposium on\nParameterized and Exact Computation, IPEC 2016, August 24-26, 2016, Aarhus, Denmark, pages 30:1\u201330:9.\n\n[7] Downey, R. G. and Fellows, M. R. (1999). Parameterized Complexity. Monographs in Computer Science.\n\nSpringer New York, New York, NY.\n\n[8] Downey, R. G. and Fellows, M. R. (2013). Fundamentals of Parameterized Complexity. Texts in Computer\n\nScience. Springer London, London.\n\n[9] Halabi, M. E. and Cevher, V. (2015). A totally unimodular view of structured sparsity. In Lebanon, G. and\nVishwanathan, S., editors, Proceedings of the 18th International Conference on Arti\ufb01cial Intelligence and\nStatistics (AISTATS 2015), pages 223\u2013231.\n\n[10] Jacob, L., Obozinski, G., and Vert, J.-P. (2009). Group lasso with overlap and graph lasso. In Proceedings\nof the 26th Annual International Conference on Machine Learning, ICML \u201909, pages 433\u2013440, New York,\nNY, USA. ACM.\n\n[11] Obozinski, G. and Bach, F. (2012). Convex Relaxation for Combinatorial Penalties. Technical report.\n\n[12] Obozinski, G. and Bach, F. (2016). A uni\ufb01ed perspective on convex structured sparsity: Hierarchical,\n\nsymmetric, submodular norms and beyond. Technical report.\n\n[13] Oswal, U., Cox, C., Lambon-Ralph, M., Rogers, T., and Nowak, R. (2016). Representational similarity\nlearning with application to brain networks. In Balcan, M. F. and Weinberger, K. Q., editors, Proceedings of\nThe 33rd International Conference on Machine Learning, pages 1041\u20131049, New York, NY, USA. PMLR.\n\n[14] Rao, N., Cox, C., Nowak, R., and Rogers, T. T. (2013). Sparse overlapping sets lasso for multitask learning\nand its application to fmri analysis. In Burges, C., Bottou, L., Welling, M., Ghahramani, Z., and Weinberger,\nK., editors, Advances in Neural Information Processing Systems 26, pages 2202\u20132210.\n\n[15] Rao, N., Dud\u00edk, M., and Harchaoui, Z. (2017). The group k-support norm for learning with structured\nsparsity. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),\npages 2402\u20132406.\n\n[16] Rao, N., Recht, B., and Nowak, R. (2012). Universal measurement bounds for structured sparse signal\nrecovery. In Lawrence, N. D. and Girolami, M., editors, Proceedings of the Fifteenth International Conference\non Arti\ufb01cial Intelligence and Statistics, pages 942\u2013950, La Palma, Canary Islands. PMLR.\n\n[17] Rao, N. S., Nowak, R. D., Wright, S. J., and Kingsbury, N. G. (2011). Convex approaches to model wavelet\nsparsity patterns. In 2011 18th IEEE International Conference on Image Processing, pages 1917\u20131920. IEEE.\n\n[18] Sankaran, R., Bach, F., and Bhattacharya, C. (2017). Identifying Groups of Strongly Correlated Variables\nthrough Smoothed Ordered Weighted L1-norms. In Singh, A. and Zhu, J., editors, Proceedings of the 20th\nInternational Conference on Arti\ufb01cial Intelligence and Statistics, pages 1123\u20131131, Fort Lauderdale, FL,\nUSA. PMLR.\n\n[19] Yuan, M. and Lin, Y. (2006). Model selection and estimation in regression with grouped variables. Journal\n\nof the Royal Statistical Society: Series B (Statistical Methodology), 68(1):49\u201367.\n\n[20] Zeng, X. and Figueiredo, M. A. T. (2014). The Ordered Weighted (cid:96)1 Norm: Atomic Formulation,\n\nProjections, and Algorithms. arXiv:1409.4271.\n\n9\n\n\f", "award": [], "sourceid": 224, "authors": [{"given_name": "Cong Han", "family_name": "Lim", "institution": "University of Wisconsin-Madison"}, {"given_name": "Stephen", "family_name": "Wright", "institution": "UW-Madison"}]}