{"title": "Sparse Prediction with the $k$-Support Norm", "book": "Advances in Neural Information Processing Systems", "page_first": 1457, "page_last": 1465, "abstract": "We derive a novel norm that corresponds to the tightest convex   relaxation of sparsity combined with an $\\ell_2$ penalty. We show   that this new norm provides a tighter relaxation than the elastic   net, and is thus a good replacement for the Lasso or the elastic net   in sparse prediction problems.  But through studying our new norm,   we also bound the looseness of the elastic net, thus shedding new   light on it and providing justification for its use.", "full_text": "Sparse Prediction with the k-Support Norm\n\nAndreas Argyriou\n\u00b4Ecole Centrale Paris\n\nargyrioua@ecp.fr\n\nRina Foygel\n\nDepartment of Statistics, Stanford University\n\nrinafb@stanford.edu\n\nNathan Srebro\n\nToyota Technological Institute at Chicago\n\nnati@ttic.edu\n\nAbstract\n\nWe derive a novel norm that corresponds to the tightest convex relaxation of spar-\nsity combined with an (cid:96)2 penalty. We show that this new k-support norm provides\na tighter relaxation than the elastic net and can thus be advantageous in in sparse\nprediction problems. We also bound the looseness of the elastic net, thus shedding\nnew light on it and providing justi\ufb01cation for its use.\n\n1\n\nIntroduction\n\nRegularizing with the (cid:96)1 norm, when we expect a sparse solution to a regression problem, is often\njusti\ufb01ed by (cid:107)w(cid:107)1 being the \u201cconvex envelope\u201d of (cid:107)w(cid:107)0 (the number of non-zero coordinates of a\nvector w \u2208 Rd). That is, (cid:107)w(cid:107)1 is the tightest convex lower bound on (cid:107)w(cid:107)0. But we must be careful\nwith this statement\u2014for sparse vectors with large entries, (cid:107)w(cid:107)0 can be small while (cid:107)w(cid:107)1 is large.\nIn order to discuss convex lower bounds on (cid:107)w(cid:107)0, we must impose some scale constraint. A more\naccurate statement is that (cid:107)w(cid:107)1 \u2264 (cid:107)w(cid:107)\u221e(cid:107)w(cid:107)0, and so, when the magnitudes of entries in w are\nbounded by 1, then (cid:107)w(cid:107)1 \u2264 (cid:107)w(cid:107)0, and indeed it is the largest such convex lower bound. Viewed as\na convex outer relaxation,\nS(\u221e)\n\n:=(cid:8)w(cid:12)(cid:12)(cid:107)w(cid:107)0 \u2264 k,(cid:107)w(cid:107)\u221e \u2264 1(cid:9) \u2286(cid:8)w(cid:12)(cid:12)(cid:107)w(cid:107)1 \u2264 k(cid:9) .\n(cid:8)w(cid:12)(cid:12)(cid:107)w(cid:107)1 \u2264 k,(cid:107)w(cid:107)\u221e \u2264 1(cid:9) = conv(S(\u221e)\n\n) .\n\nIntersecting the right-hand-side with the (cid:96)\u221e unit ball, we get the tightest convex outer bound (convex\nhull) of S(\u221e)\n\n:\n\nk\n\nk\n\nk\n\nHowever, in our view, this relationship between (cid:107)w(cid:107)1 and (cid:107)w(cid:107)0 yields disappointing learning guar-\nantees, and does not appropriately capture the success of the (cid:96)1 norm as a surrogate for sparsity. In\nparticular, the sample complexity1 of learning a linear predictor with k non-zero entries by empirical\nrisk minimization inside this class (an NP-hard optimization problem) scales as O(k log d), but re-\nlaxing to the constraint (cid:107)w(cid:107)1 \u2264 k yields a sample complexity which scales as O(k2 log d), because\nthe sample complexity of (cid:96)1-regularized learning scales quadratically with the (cid:96)1 norm [11, 20].\nPerhaps a better reason for the (cid:96)1 norm being a good surrogate for sparsity is that, not only do we\nexpect the magnitude of each entry of w to be bounded, but we further expect (cid:107)w(cid:107)2 to be small. In\na regression setting, with a vector of features x, this can be justi\ufb01ed when E[(x(cid:62)w)2] is bounded\n(a reasonable assumption) and the features are not too correlated\u2014see, e.g. [15]. More broadly,\n\n1We de\ufb01ne this as the number of observations needed in order to ensure expected prediction error no more\nthan \u0001 worse than that of the best k-sparse predictor, for an arbitrary constant \u0001 (that is, we suppress the\ndependence on \u0001 and focus on the dependence on the sparsity k and dimensionality d).\n\n1\n\n\fespecially in the presence of correlations, we might require this as a modeling assumption to aid\nin robustness and generalization. In any case, we have (cid:107)w(cid:107)1 \u2264 (cid:107)w(cid:107)2\ninterested in predictors with bounded (cid:96)2 norm, we can motivate the (cid:96)1 norm through the following\nrelaxation of sparsity, where the scale is now set by the (cid:96)2 norm:\n\n(cid:8)w(cid:12)(cid:12)(cid:107)w(cid:107)0 \u2264 k,(cid:107)w(cid:107)2 \u2264 B(cid:9) \u2286(cid:8)w(cid:12)(cid:12)(cid:107)w(cid:107)1 \u2264 B\n\n(cid:112)(cid:107)w(cid:107)0, and so if we are\nk(cid:9) .\n\n\u221a\n\nThe sample complexity when using the relaxation now scales as2 O(k log d).\n\nSparse + (cid:96)2 constraint. Our starting point is then that of combining sparsity and (cid:96)2 regularization,\nand learning a sparse predictor with small (cid:96)2 norm. We are thus interested in classes of the form\n\n:=(cid:8)w(cid:12)(cid:12)(cid:107)w(cid:107)0 \u2264 k,(cid:107)w(cid:107)2 \u2264 1(cid:9) .\nAs discussed above, the class {(cid:107)w(cid:107)1 \u2264 \u221a\nk} (corresponding to the standard Lasso) provides a\n(cid:111) (cid:40)(cid:110)\nw(cid:12)(cid:12)(cid:107)w(cid:107)1 \u2264\nk . But clearly we can get a tighter relaxation by keeping the (cid:96)2 constraint:\n\u221a\n(1)\n\nw(cid:12)(cid:12)(cid:107)w(cid:107)1 \u2264\n\nconvex relaxation of S(2)\n\nk ) \u2286(cid:110)\n\nk,(cid:107)w(cid:107)2 \u2264 1\n\nconv(S(2)\n\nS(2)\nk\n\n(cid:111)\n\n\u221a\n\nk\n\n.\n\nConstraining (or equivalently, penalizing) both the (cid:96)1 and (cid:96)2 norms, as in (1), is known as the \u201celastic\nnet\u201d [5, 21] and has indeed been advocated as a better alternative to the Lasso. In this paper, we ask\nwhether the elastic net is the tightest convex relaxation to sparsity plus (cid:96)2 (that is, to S(2)\nk ) or whether\na tighter, and better, convex relaxation is possible.\n\nA new norm. We consider the convex hull (tightest convex outer bound) of S(2)\nk ,\n\nk ) = conv(cid:8)w(cid:12)(cid:12)(cid:107)w(cid:107)0 \u2264 k,(cid:107)w(cid:107)2 \u2264 1(cid:9) .\n\n\u221a\n\nCk := conv(S(2)\n\n(2)\nWe study the gauge function associated with this convex set, that is, the norm whose unit ball is\ngiven by (2), which we call the k-support norm. We show that, for k > 1, this is indeed a tighter\nconvex relaxation than the elastic net (that is, both inequalities in (1) are in fact strict inequalities),\nand is therefore a better convex constraint than the elastic net when seeking a sparse, low (cid:96)2-norm\nlinear predictor. We thus advocate using it as a replacement for the elastic net.\nHowever, we also show that the gap between the elastic net and the k-support norm is at most a factor\nof\n2, corresponding to a factor of two difference in the sample complexity. Thus, our work can\nalso be interpreted as justifying the use of the elastic net, viewing it as a fairly good approximation\nto the tightest possible convex relaxation of sparsity intersected with an (cid:96)2 constraint. Still, even a\nfactor of two should not necessarily be ignored and, as we show in our experiments, using the tighter\nk-support norm can indeed be bene\ufb01cial.\nTo better understand the k-support norm, we show in Section 2 that it can also be described as\n\nthe group lasso with overlaps norm [10] corresponding to all(cid:0)d\n\n(cid:1) subsets of k features. Despite the\n\nexponential number of groups in this description, we show that the k-support norm can be calculated\nef\ufb01ciently in time O(d log d) and that its dual is given simply by the (cid:96)2 norm of the k largest entries.\nWe also provide ef\ufb01cient \ufb01rst-order optimization algorithms for learning with the k-support norm.\n\nk\n\nRelated Work In many learning problems of interest, Lasso has been observed to shrink too many\nof the variables of w to zero. In particular, in many applications, when a group of variables is highly\ncorrelated, the Lasso may prefer a sparse solution, but we might gain more predictive accuracy by\nincluding all the correlated variables in our model. These drawbacks have recently motivated the use\nof various other regularization methods, such as the elastic net [21], which penalizes the regression\ncoef\ufb01cients w with a combination of (cid:96)1 and (cid:96)2 norms:\n\n(cid:27)\n\n(cid:26) 1\n\n2\n\nmin\n\n(cid:107)Xw \u2212 y(cid:107)2 + \u03bb1 (cid:107)w(cid:107)1 + \u03bb2 (cid:107)w(cid:107)2\n\n2 : w \u2208 Rd\n\n,\n\n(3)\n\n2More precisely, the sample complexity is O(B2k log d), where the dependence on B2 is to be expected.\nNote that if feature vectors are (cid:96)\u221e-bounded (i.e. individual features are bounded), the sample complexity when\nusing only (cid:107)w(cid:107)2 \u2264 B (without a sparsity or (cid:96)1 constraint) scales as O(B2d). That is, even after identifying\nthe correct support, we still need a sample complexity that scales with B2.\n\n2\n\n\fwhere for a sample of size n, y \u2208 Rn is the vector of response values, and X \u2208 Rn\u00d7d is a matrix\nwith column j containing the values of feature j.\nThe elastic net can be viewed as a trade-off between (cid:96)1 regularization (the Lasso) and (cid:96)2 regular-\nization (Ridge regression [9]), depending on the relative values of \u03bb1 and \u03bb2. In particular, when\n\u03bb2 = 0, (3) is equivalent to the Lasso. This method, and the other methods discussed below, have\nbeen observed to signi\ufb01cantly outperform Lasso in many real applications.\nThe pairwise elastic net (PEN) [13] is a penalty function that accounts for similarity among features:\n\n(cid:107)w(cid:107)P EN\n\nR\n\n= (cid:107)w(cid:107)2\n\n2 + (cid:107)w(cid:107)2\n\n1 \u2212 |w|(cid:62)R|w| ,\n\nwhere R \u2208 [0, 1]p\u00d7p is a matrix with Rjk measuring similarity between features Xj and Xk. The\ntrace Lasso [6] is a second method proposed to handle correlations within X, de\ufb01ned by\n\n(cid:107)w(cid:107)trace\n\nX = (cid:107)Xdiag(w)(cid:107)\u2217 ,\n\nwhere (cid:107)\u00b7(cid:107)\u2217 denotes the matrix trace-norm (the sum of the singular values) and promotes a low-rank\nsolution. If the features are orthogonal, then both the PEN and the Trace Lasso are equivalent to\nthe Lasso. If the features are all identical, then both penalties are equivalent to Ridge regression\n(penalizing (cid:107)w(cid:107)2). Another existing penalty is OSCAR [3], given by\n\n(cid:107)w(cid:107)OSCAR\n\nc\n\n= (cid:107)w(cid:107)1 + c\n\n(cid:88)\n\nmax{|wj|,|wk|} .\n\nLike the elastic net, each one of these three methods also \u201cprefers\u201d averaging similar features over\nselecting a single feature.\n\nj<k\n\n2 The k-Support Norm\n\nOne argument for the elastic net has been the \ufb02exibility of tuning the cardinality k of the regres-\nsion vector w. Thus, when groups of correlated variables are present, a larger k may be learned,\nwhich corresponds to a higher \u03bb2 in (3). A more natural way to obtain such an effect of tuning the\ncardinality is to consider the convex hull of cardinality k vectors,\n\nk ) = conv{w \u2208 Rd(cid:12)(cid:12)(cid:107)w(cid:107)0 \u2264 k,(cid:107)w(cid:107)2 \u2264 1}.\n\nCk = conv(S(2)\n\n,\n\nI\u2208Gk\n\nI\u2208Gk\n\nI\u2208Gk\n\nClearly the sets Ck are nested, and C1 and Cd are the unit balls for the (cid:96)1 and (cid:96)2 norms, respectively.\nConsequently we de\ufb01ne the k-support norm as the norm whose unit ball equals Ck (the gauge\nfunction associated with the Ck ball).3 An equivalent de\ufb01nition is the following variational formula:\nDe\ufb01nition 2.1. Let k \u2208 {1, . . . , d}. The k-support norm (cid:107) \u00b7 (cid:107)sp\n(cid:107)vI(cid:107)2 : supp(vI ) \u2286 I,\n\nk is de\ufb01ned, for every w \u2208 Rd, as\n(cid:88)\n\n(cid:40)(cid:88)\n\n(cid:107)w(cid:107)sp\n\n(cid:41)\n\nk := min\n\nCk,\u2200I \u2208 Gk,(cid:80)\n\n\u00b5I = 1. In addition, this immediately implies that (cid:107) \u00b7 (cid:107)sp\n\nvI = w\nwhere Gk denotes the set of all subsets of {1, . . . , d} of cardinality at most k.\nThe equivalence is immediate by rewriting vI = \u00b5I zI in the above de\ufb01nition, where \u00b5I \u2265 0, zI \u2208\nk is indeed a norm. In\nfact, the k-support norm is equivalent to the norm used by the group lasso with overlaps [10], when\nthe set of overlapping groups is chosen to be Gk (however, the group lasso has traditionally been\nused for applications with some speci\ufb01c known group structure, unlike the case considered here).\nAlthough the variational de\ufb01nition 2.1 is not amenable to computation because of the exponen-\ntial growth of the set of groups Gk, the k-support norm is computationally very tractable, with an\nO(d log d) algorithm described in Section 2.2.\nd = (cid:107) \u00b7 (cid:107)2. The unit ball of this new norm in\nAs already mentioned, (cid:107) \u00b7 (cid:107)sp\nR3 for k = 2 is depicted in Figure 1. We immediately notice several differences between this unit\nball and the elastic net unit ball. For example, at points with cardinality k and (cid:96)2 norm equal to 1,\nthe k-support norm is not differentiable, but unlike the (cid:96)1 or elastic-net norm, it is differentiable at\npoints with cardinality less than k. Thus, the k-support norm is less \u201cbiased\u201d towards sparse vectors\nthan the elastic net and the (cid:96)1 norm.\n\n1 = (cid:107) \u00b7 (cid:107)1 and (cid:107) \u00b7 (cid:107)sp\n\n3The gauge function \u03b3Ck : Rd \u2192 R \u222a {+\u221e} is de\ufb01ned as \u03b3Ck (x) = inf{\u03bb \u2208 R+ : x \u2208 \u03bbCk}.\n\n3\n\n\fFigure 1: Unit ball of the 2-support norm (left) and of the elastic net (right) on R3.\n\n2.1 The Dual Norm\nIt is interesting and useful to compute the dual of the k-support norm. For w \u2208 Rd, denote |w| for\nthe vector of absolute values, and w\n\n\u2193\ni for the i-th largest element of w [2]. We have\n\n(cid:33) 1\n\n2\n\n\uf8f1\uf8f2\uf8f3\n\n(cid:32)(cid:88)\n\ni\u2208I\n\n\uf8fc\uf8fd\uf8fe =\n\n(cid:32) k(cid:88)\n\ni=1\n\n(cid:33) 1\n\n2\n\n(cid:107)u(cid:107)sp\u2217\n\nk = max{(cid:104)w, u(cid:105) : (cid:107)w(cid:107)sp\n\nk \u2264 1} = max\n\nu2\ni\n\n: I \u2208 Gk\n\n(|u|\u2193\ni )2\n\n=: (cid:107)u(cid:107)(2)\n(k) .\n\nThis is the (cid:96)2-norm of the largest k entries in u, and is known as the 2-k symmetric gauge norm [2].\nNot surprisingly, this dual norm interpolates between the (cid:96)2 norm (when k = d and all entries\nare taken) and the (cid:96)\u221e norm (when k = 1 and only the largest entry is taken). This parallels the\ninterpolation of the k-support norm between the (cid:96)1 and (cid:96)2 norms.\n\n2.2 Computation of the Norm\n\nIn this section, we derive an alternative formula for the k-support norm, which leads to computation\nof the value of the norm in O(d log d) steps.\n\nProposition 2.1. For every w \u2208 Rd, (cid:107)w(cid:107)sp\nwhere, letting |w|\u2193\n\n|w|\u2193\n0 denote +\u221e, r is the unique integer in {0, . . . , k \u2212 1} satisfying\n\ni )2 + 1\nr+1\n\n(|w|\u2193\n\ni=k\u2212r\n\nk =\n\ni\n\n\uf8eb\uf8edk\u2212r\u22121(cid:80)\nd(cid:88)\n\ni=1\n\n(cid:32) d(cid:80)\n\n(cid:33)2\uf8f6\uf8f8 1\n\n2\n\n,\n\n(4)\n\n|w|\u2193\n\nk\u2212r\u22121 >\n\n1\n\n|w|\u2193\n\ni \u2265 |w|\u2193\n\nk\u2212r .\n\nr + 1\n\ni=k\u2212r\n\n((cid:107)w(cid:107)sp\n\nk )2 = max\n\n1\n2\n\n(cid:104)u, w(cid:105) \u2212 1\n2\n\n((cid:107)u(cid:107)(2)\n\nThis result shows that (cid:107) \u00b7 (cid:107)sp\nk trades off between the (cid:96)1 and (cid:96)2 norms in a way that favors sparse\nvectors but allows for cardinality larger than k. It combines the uniform shrinkage of an (cid:96)2 penalty\nfor the largest components, with the sparse shrinkage of an (cid:96)1 penalty for the smallest components.\nProof of Proposition 2.1. We will use the inequality (cid:104)w, u(cid:105) \u2264 (cid:104)w\u2193, u\u2193(cid:105) [7]. We have\n\n(cid:26)\n(cid:41)\n\n(cid:27)\n(k))2 : u \u2208 Rd\nd(cid:88)\n\n\u03b1i|w|\u2193\n\ni + \u03b1k\n\n(cid:40)k\u22121(cid:88)\n\n= max\n\n|w|\u2193\n\ni \u2212 1\n2\n\n(cid:40) d(cid:88)\nk(cid:88)\n\ni=1\n\nk(cid:88)\n\ni :\n\n\u03b12\n\n\u03b1i|w|\u2193\n\ni \u2212 1\n2\n\n(cid:41)\ni : \u03b11 \u2265 \u00b7\u00b7\u00b7 \u2265 \u03b1k \u2265 0\n\u03b12\n\ni=1\n\n.\n\n\u03b11 \u2265 \u00b7\u00b7\u00b7 \u2265 \u03b1d \u2265 0\n\n= max\n\ni=1\n\ni=k\n\ni=k\u2212r\n\n|w|\u2193\n\nLet Ar :=\n\u03b1i = |w|\u2193\n\u03b1k\u22121 lie between |w|\u2193\n\ni for r \u2208 {0, . . . , k \u2212 1}. If A0 < |w|\u2193\n\nd(cid:80)\ni for i = 1, . . . , (k \u2212 1), \u03b1i = A0 for i = k, . . . , d. If A0 \u2265 |w|\u2193\n(cid:40)k\u22122(cid:88)\n\nk\u22121 and A0, and have to be equal. So, the maximization becomes\nk\u22121 : \u03b11 \u2265 \u00b7\u00b7\u00b7 \u2265 \u03b1k\u22121 \u2265 0\n\ni + A1\u03b1k\u22121 \u2212 \u03b12\n\u03b12\n\nk\u22122(cid:88)\n\n\u03b1i|w|\u2193\n\nmax\n\n(cid:41)\n\ni=1\n\n.\n\ni \u2212 1\n2\n\ni=1\n\nk\u22121 then the solution \u03b1 is given by\n\nk\u22121 then the optimal \u03b1k,\n\ni=1\n\n4\n\n\fk\u22121 and |w|\u2193\n\nIf A0 \u2265 |w|\u2193\ni for i = 1, . . . , (k \u2212 2), \u03b1i = A1\nk\u22122 > A1\nfor i = (k \u2212 1), . . . , d. Otherwise we proceed as before and continue this process. At stage r the\nprocess terminates if A0 \u2265 |w|\u2193\nk\u2212r\u22121 and all but the last two\ninequalities are redundant. Hence the condition can be rewritten as (4). One optimal solution is\n\u03b1i = |w|\u2193\n\n2 then the solution is \u03b1i = |w|\u2193\nk\u22121, . . . , Ar\u22121\ni for i = 1, . . . , k \u2212 r \u2212 1, \u03b1i = Ar\n\nr \u2265 |w|\u2193\nr+1 for i = k \u2212 r, . . . , d. This proves the claim.\n\nr+1 < |w|\u2193\n\nk\u2212r, Ar\n\n2\n\n2.3 Learning with the k-support norm\n\n(cid:26) 1\n\nWe thus propose using learning rules with k-support norm regularization. These are appropriate\nwhen we would like to learn a sparse predictor that also has low (cid:96)2 norm, and are especially relevant\nwhen features might be correlated (that is, in almost all learning tasks) but the correlation structure\nis not known in advance. E.g., for squared error regression problems we have:\n\n(cid:107)Xw \u2212 y(cid:107)2 +\n\n((cid:107)w(cid:107)sp\n\nk )2 : w \u2208 Rd\n\nmin\n\n(5)\nwith \u03bb > 0 a regularization parameter and k \u2208 {1, . . . , d} also a parameter to be tuned. As typical\nin regularization-based methods, both \u03bb and k can be selected by cross validation [8]. Despite the\nrelationship to S(2)\nk , the parameter k does not necessarily correspond to the sparsity of the actual\nminimizer of (5), and should be chosen via cross-validation rather than set to the desired sparsity.\n\n2\n\n\u03bb\n2\n\n3 Relation to the Elastic Net\n\n(cid:26) 1\n\n2\n\n(cid:27)\n\n(cid:27)\n\n(cid:27)\n\nRecall that the elastic net with penalty parameters \u03bb1 and \u03bb2 selects a vector of coef\ufb01cients given by\n\narg min\n\n(cid:107)Xw \u2212 y(cid:107)2 + \u03bb1 (cid:107)w(cid:107)1 + \u03bb2 (cid:107)w(cid:107)2\n\n2\n\n.\n\n(6)\n\nFor ease of comparison with the k-support norm, we \ufb01rst show that the set of optimal solutions for\nthe elastic net, when the parameters are varied, is the same as for the norm\n\n(cid:110)(cid:107)w(cid:107)2,(cid:107)w(cid:107)1/\n\n\u221a\n\n(cid:111)\n\nk\n\n,\n\n(cid:107)w(cid:107)el\n\nk := max\n\nwhen k \u2208 [1, d], corresponding to the unit ball in (1) (note that k is not necessarily an integer). To\nsee this, let \u02c6w be a solution to (6), and let k := ((cid:107) \u02c6w(cid:107)1/(cid:107) \u02c6w(cid:107)2)2 \u2208 [1, d] . Now for any w (cid:54)= \u02c6w,\nif (cid:107)w(cid:107)el\nk , then (cid:107)w(cid:107)p \u2264 (cid:107) \u02c6w(cid:107)p for p = 1, 2. Since \u02c6w is a solution to (6), therefore,\n(cid:107)Xw \u2212 y(cid:107)2\n\n2. This proves that, for some constraint parameter B,\n\n2 \u2265 (cid:107)X \u02c6w \u2212 y(cid:107)2\n\nk \u2264 (cid:107) \u02c6w(cid:107)el\n\n\u02c6w = arg min\n\n(cid:107)Xw \u2212 y(cid:107)2\n\n2 : (cid:107)w(cid:107)el\n\nk \u2264 B\n\n.\n\n(cid:26) 1\n\nn\n\nLike the k-support norm, the elastic net interpolates between the (cid:96)1 and (cid:96)2 norms. In fact, when k\nis an integer, any k-sparse unit vector w \u2208 Rd must lie in the unit ball of (cid:107) \u00b7 (cid:107)el\nk . Since the k-support\nnorm gives the convex hull of all k-sparse unit vectors, this immediately implies that\n\n(cid:107)w(cid:107)el\n\nk \u2264 (cid:107)w(cid:107)sp\n\nk\n\n\u2200 w \u2208 Rd .\n\nThe two norms are not equal, however. The difference between the two is illustrated in Figure 1,\nwhere we see that the k-support norm is more \u201crounded\u201d.\nTo see an example where the two norms are not equal, we set d = 1 + k2 for some large k, and let\nw = (k1.5, 1, 1, . . . , 1)(cid:62) \u2208 Rd. Then\n\n(cid:26)(cid:112)\n\n(cid:107)w(cid:107)el\n\nk = max\n\nk1.5 + k2\u221a\nk3 + k2,\n)(cid:62), we have (cid:107)u(cid:107)(2)\n\nk\n\n(cid:27)\n\n(cid:18)\n\n(cid:19)\n\n.\n\n1\u221a\nk\n\n= k1.5\n\n1 +\n\nTaking u = ( 1\u221a\n2\nk-support norm:\n\n,\n\n1\u221a\n2k\n\n,\n\n1\u221a\n2k\n\n, . . . ,\n\n1\u221a\n2k\n\n(k) < 1, and recalling this norm is dual to the\n\n(cid:107)w(cid:107)sp\n\nk > (cid:104)w, u(cid:105) =\n\nk1.5\u221a\n2\n\n+ k2 \u00b7\n\n1\u221a\n2k\n\n=\n\n\u221a\n\n2 \u00b7 k1.5 .\n\n\u221a\n\n2. We now show\n\nIn this example, we see that the two norms can differ by as much as a factor of\nthat this is actually the most by which they can differ.\n\n5\n\n\f\u221a\n\nk \u2264 (cid:107) \u00b7 (cid:107)sp\n\nProposition 3.1. (cid:107) \u00b7 (cid:107)el\nProof. We show that these bounds hold in the duals of the two norms. First, since (cid:107) \u00b7 (cid:107)el\nmaximum over the (cid:96)1 and (cid:96)2 norms, its dual is given by\n\u221a\n\n2(cid:107) \u00b7 (cid:107)el\nk .\n\nk <\n\n(cid:111)\n\nk is a\n\n(cid:110)(cid:107)a(cid:107)2 +\n\n(cid:107)u(cid:107)(el)\u2217\n\nk\n\n:= inf\na\u2208Rd\n\nk \u00b7 (cid:107)u \u2212 a(cid:107)\u221e\n\nNow take any u \u2208 Rd. First we show (cid:107)u(cid:107)(2)\nu1 \u2265 \u00b7\u00b7\u00b7 \u2265 ud \u2265 0. For any a \u2208 Rd,\n\n(k) \u2264 (cid:107)u(cid:107)(el)\u2217\n\nk\n\n(cid:107)u(cid:107)(2)\n\n(k) = (cid:107)u1:k(cid:107)2 \u2264 (cid:107)a1:k(cid:107)2 + (cid:107)u1:k \u2212 a1:k(cid:107)2 \u2264 (cid:107)a(cid:107)2 +\n\n\u221a\n\nk(cid:107)u \u2212 a(cid:107)\u221e .\n\n. Without loss of generality, we take\n\nFinally, we show that (cid:107)u(cid:107)(el)\u2217\n\nk\n\n<\n\n2(cid:107)u(cid:107)(2)\n\n(k). Let a = (u1 \u2212 uk+1, . . . , uk \u2212 uk+1, 0, . . . , 0)(cid:62). Then\n\n(cid:107)u(cid:107)(el)\u2217\n\nk\n\n\u2264 (cid:107)a(cid:107)2 +\n\nk \u00b7 (cid:107)u \u2212 a(cid:107)\u221e =\n\n\u221a\n\n(ui \u2212 uk+1)2 +\n\n\u221a\n\nk \u00b7 |uk+1|\n\n\u221a\n\n(cid:118)(cid:117)(cid:117)(cid:116) k(cid:88)\n\n(cid:113)\n\ni=1\n\nk u2\n\nk+1 \u2264\n\n\u221a\n\n2 \u00b7\n\n(cid:118)(cid:117)(cid:117)(cid:116) k(cid:88)\n\n(cid:118)(cid:117)(cid:117)(cid:116) k(cid:88)\n\n\u2264\n\ni \u2212 u2\n(u2\n\nk+1) +\n\ni \u2212 u2\n(u2\n\nk+1) + k u2\n\nk+1 =\n\n\u221a\n\n2(cid:107)u(cid:107)(2)\n(k) .\n\ni=1\n\ni=1\n\nFurthermore, this yields a strict inequality, because if u1 > uk+1, the next-to-last inequality is strict,\nwhile if u1 = \u00b7\u00b7\u00b7 = uk+1, then the last inequality is strict.\n\n4 Optimization\n\nSolving the optimization problem (5) ef\ufb01ciently can be done with a \ufb01rst-order proximal algorithm.\nProximal methods \u2013 see [1, 4, 14, 18, 19] and references therein \u2013 are used to solve composite\nproblems of the form min{f (x) + \u03c9(x) : x \u2208 Rd}, where the loss function f (x) and the regularizer\n\u03c9(x) are convex functions, and f is smooth with an L-Lipschitz gradient. These methods require\nfast computation of the gradient \u2207f and the proximity operator\n\n(cid:27)\n\nprox\u03c9(x) := argmin\n\n(cid:107)u \u2212 x(cid:107)2 + \u03c9(u) : u \u2208 Rd\n\n.\n\n(cid:26) 1\n\n2\n\nTo obtain a proximal method for k-support regularization, it suf\ufb01ces to compute the proximity map\nof g = 1\n\u03bb ). This\ncomputation can be done in O(d(k + log d)) steps with Algorithm 1.\n\nk )2, for any \u03b2 > 0 (in particular, for problem (5) \u03b2 corresponds to L\n\n2\u03b2 ((cid:107) \u00b7 (cid:107)sp\n\nAlgorithm 1 Computation of the proximity operator.\n\nInput v \u2208 Rd\nOutput q = prox 1\nFind r \u2208 {0, . . . , k \u2212 1}, (cid:96) \u2208 {k, . . . , d} such that\n\n2\u03b2 ((cid:107)\u00b7(cid:107)sp\n\nk )2(v)\n\n1\n\n\u03b2+1 zk\u2212r\u22121 >\n\nTr,(cid:96)\n\n(cid:96)\u2212k+(\u03b2+1)r+\u03b2+1 \u2265 1\n(cid:96)\u2212k+(\u03b2+1)r+\u03b2+1 \u2265 z(cid:96)+1\n\nTr,(cid:96)\n\n\u03b2+1 zk\u2212r\n\nz(cid:96) >\n\n(cid:96)(cid:80)\n\nwhere z := |v|\u2193, z0 := +\u221e, zd+1 := \u2212\u221e, Tr,(cid:96) :=\n\nzi\n\n\uf8f1\uf8f4\uf8f2\uf8f4\uf8f3 \u03b2\n\n\u03b2+1 zi\nzi \u2212\n0\n\nqi \u2190\n\nTr,(cid:96)\n\n(cid:96)\u2212k+(\u03b2+1)r+\u03b2+1\n\ni=k\u2212r\nif i = 1, . . . , k \u2212 r \u2212 1\nif i = k \u2212 r, . . . , (cid:96)\nif i = (cid:96) + 1, . . . , d\n\nReorder and change signs of q to conform with v\n\n6\n\n(7)\n\n(8)\n\n\fFigure 2: Solutions learned for the synthetic data. Left to right: k-support, Lasso and elastic net.\n\n(cid:16)\n\n(cid:96)(cid:80)\n\nProof of Correctness of Algorithm 1. Since the support-norm is sign and permutation invariant,\nproxg(v) has the same ordering and signs as v. Hence, without loss of generality, we may assume\nthat v1 \u2265 \u00b7\u00b7\u00b7 \u2265 vd \u2265 0 and require that q1 \u2265 \u00b7\u00b7\u00b7 \u2265 qd \u2265 0, which follows from inequality (7) and\nthe fact that z is ordered.\nNow, q = proxg(v) is equivalent to \u03b2z \u2212 \u03b2q = \u03b2v \u2212 \u03b2q \u2208 \u2202 1\nk )2(q). It suf\ufb01ces to show\nd(cid:80)\nthat, for w = q, \u03b2z \u2212 \u03b2q is an optimal \u03b1 in the proof of Proposition 2.1. Indeed, Ar corresponds to\n(cid:96)\u2212k+(\u03b2+1)r+\u03b2+1\ni=k\u2212r\nand (4) is equivalent to condition (7). For i \u2264 k\u2212 r\u2212 1, we have \u03b2zi \u2212 \u03b2qi = qi. For k\u2212 r \u2264 i \u2264 (cid:96),\nwe have \u03b2zi \u2212 \u03b2qi = 1\nr+1 Ar, which\nis true by (8).\nWe can now apply a standard accelerated proximal method, such as FISTA [1], to (5), at each\niteration using the gradient of the loss and performing a prox step using Algorithm 1. The FISTA\nguarantee ensures us that, with appropriate step sizes, after T such iterations, we have:\n\nr+1 Ar. For i \u2265 (cid:96) + 1, since qi = 0, we only need \u03b2zi \u2212 \u03b2qi \u2264 1\n\n= Tr,(cid:96) \u2212 ((cid:96)\u2212k+r+1)Tr,(cid:96)\n\n(cid:96)\u2212k+(\u03b2+1)r+\u03b2+1 = (r + 1)\n\n(cid:96)\u2212k+(\u03b2+1)r+\u03b2+1\n\n2 ((cid:107) \u00b7 (cid:107)sp\n\nzi \u2212\n\nqi =\n\ni=k\u2212r\n\n\u03b2 Tr,(cid:96)\n\n(cid:17)\n\nTr,(cid:96)\n\n(cid:32)\n\n(cid:33)\n\n(cid:107)Xw\u2217 \u2212 y(cid:107)2 +\n\n1\n2\n\n((cid:107)w\u2217(cid:107)sp\nk )2\n\n\u03bb\n2\n\n2L(cid:107)w\u2217 \u2212 w1(cid:107)2\n\n(T + 1)2\n\n.\n\n+\n\n(cid:107)XwT \u2212 y(cid:107)2 +\n\n1\n2\n\n((cid:107)wT(cid:107)sp\n\nk )2 \u2264\n\n\u03bb\n2\n\n5 Empirical Comparisons\n\n\u221a\n\nOur theoretical analysis indicates that the k-support norm and the elastic net differ by at most a factor\nof\n2, corresponding to at most a factor of two difference in their sample complexities and general-\nization guarantees. We thus do not expect huge differences between their actual performances, but\nwould still like to see whether the tighter relaxation of the k-support norm does yield some gains.\n\nSynthetic Data For the \ufb01rst simulation we follow [21, Sec. 5, example 4]. In this experimental\nprotocol, the target (oracle) vector equals w\u2217 = (3, . . . , 3\n\n), with y = (w\u2217)(cid:62)x + N (0, 1).\n\n, 0 . . . , 0\n\n(cid:124) (cid:123)(cid:122) (cid:125)\n\n15\n\n(cid:124) (cid:123)(cid:122) (cid:125)\n\n25\n\nThe input data X were generated from a normal distribution such that components 1, . . . , 5 have the\nsame random mean Z1 \u223c N (0, 1), components 6, . . . , 10 have mean Z2 \u223c N (0, 1) and components\n11, . . . , 15 have mean Z3 \u223c N (0, 1). A total of 50 data sets were created in this way, each containing\n50 training points, 50 validation points and 350 test points. The goal is to achieve good prediction\nperformance on the test data.\nWe compared the k-support norm with Lasso and the elastic net. We considered the ranges k =\n{1, . . . , d} for k-support norm regularization, \u03bb = 10i, i = {\u221215, . . . , 5}, for the regularization\nparameter of Lasso and k-support regularization and the same range for the \u03bb1, \u03bb2 of the elastic net.\nFor each method, the optimal set of parameters was selected based on mean squared error on the\nvalidation set. The error reported in Table 5 is the mean squared error with respect to the oracle w\u2217,\nnamely M SE = ( \u02c6w \u2212 w\u2217)(cid:62)V ( \u02c6w \u2212 w\u2217), where V is the population covariance matrix of Xtest.\nTo further illustrate the effect of the k-support norm, in Figure 5 we show the coef\ufb01cients learned\nby each method, in absolute value. For each image, one row corresponds to the w learned for one\nof the 50 data sets. Whereas all three methods distinguish the 15 relevant variables, the elastic net\nresult varies less within these variables.\n\nSouth African Heart Data This is a classi\ufb01cation task which has been used in [8]. There are\n9 variables and 462 examples, and the response is presence/absence of coronary heart disease. We\n\n7\n\n510152025303540510152025303540455051015202530354051015202530354045505101520253035405101520253035404550\fTable 1: Mean squared errors and classi\ufb01cation accuracy for the synthetic data (median over 50 repetition),\nSA heart data (median over 50 replications) and for the \u201c20 newsgroups\u201d data set. (SE = standard error)\n\nMethod\nLasso\n\nElastic net\nk-support\n\nSynthetic\nMSE (SE)\n\n0.2685 (0.02)\n0.2274 (0.02)\n0.2143 (0.02)\n\nMSE (SE)\n0.18 (0.005)\n0.18 (0.005)\n0.18 (0.005)\n\nHeart\n\nNewsgroups\n\nAccuracy (SE) MSE Accuracy\n66.41 (0.53)\n66.41 (0.53)\n66.41 (0.53)\n\n73.02\n73.02\n73.40\n\n0.70\n0.70\n0.69\n\nnormalized the data so that each predictor variable has zero mean and unit variance. We then split the\ndata 50 times randomly into training, validation, and test sets of sizes 400, 30, and 32 respectively.\nFor each method, parameters were selected using the validation data. In Tables 5, we report the\nMSE and accuracy of each method on the test data. We observe that all three methods have identical\nperformance.\n\n20 Newsgroups This is a binary classi\ufb01cation version of 20 newsgroups created in [12] which can\nbe found in the LIBSVM data repository.4 The positive class consists of the 10 groups with names of\nform sci.*, comp.*, or misc.forsale and the negative class consists of the other 10 groups. To reduce\nthe number of features, we removed the words which appear in less than 3 documents. We randomly\nsplit the data into a training, a validation and a test set of sizes 14000,1000 and 4996, respectively.\nWe report MSE and accuracy on the test data in Table 5. We found that k-support regularization\ngave improved prediction accuracy over both other methods.5\n\n6 Summary\n\n\u221a\n\nWe introduced the k-support norm as the tightest convex relaxation of sparsity plus (cid:96)2 regularization,\nand showed that it is tighter than the elastic net by exactly a factor of\n2. In our view, this sheds\nlight on the elastic net as a close approximation to this tightest possible convex relaxation, and\nmotivates using the k-support norm when a tighter relaxation is sought. This is also demonstrated\nin our empirical results.\nWe note that the k-support norm has better prediction properties, but not necessarily better sparsity-\ninducing properties, as evident from its more rounded unit ball. It is well understood that there\nis often a tradeoff between sparsity and good prediction, and that even if the population optimal\npredictor is sparse, a denser predictor often yields better predictive performance [3, 10, 21]. For\nexample, in the presence of correlated features, it is often bene\ufb01cial to include several highly corre-\nlated features rather than a single representative feature. This is exactly the behavior encouraged by\n(cid:96)2 norm regularization, and the elastic net is already known to yield less sparse (but more predictive)\nsolutions. The k-support norm goes a step further in this direction, often yielding solutions that are\neven less sparse (but more predictive) compared to the elastic net.\nNevertheless, it is interesting to consider whether compressed sensing results, where (cid:96)1 regulariza-\ntion is of course central, can be re\ufb01ned by using the k-support norm, which might be able to handle\nmore correlation structure within the set of features.\n\nAcknowledgements The construction showing that the gap between the elastic net and the k-\n2 is due to joint work with Ohad Shamir. Rina Foygel was\noverlap norm can be as large as\nsupported by NSF grant DMS-1203762.\n\n\u221a\n\nReferences\n[1] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse\n\nproblems. SIAM Journal of Imaging Sciences, 2(1):183\u2013202, 2009.\n\n[2] R. Bhatia. Matrix Analysis. Graduate Texts in Mathematics. Springer, 1997.\n4http://www.csie.ntu.edu.tw/\u223ccjlin/libsvmtools/datasets/\n5Regarding other sparse prediction methods, we did not manage to compare with OSCAR, due to memory\n\nlimitations, or to PEN or trace Lasso, which do not have code available online.\n\n8\n\n\f[3] H.D. Bondell and B.J. Reich. Simultaneous regression shrinkage, variable selection, and su-\n\npervised clustering of predictors with OSCAR. Biometrics, 64(1):115\u2013123, 2008.\n\n[4] P.L. Combettes and V.R. Wajs. Signal recovery by proximal forward-backward splitting. Mul-\n\ntiscale Modeling and Simulation, 4(4):1168\u20131200, 2006.\n\n[5] C. De Mol, E. De Vito, and L. Rosasco. Elastic-net regularization in learning theory. Journal\n\nof Complexity, 25(2):201\u2013230, 2009.\n\n[6] E. Grave, G. R. Obozinski, and F. Bach. Trace lasso: a trace norm regularization for correlated\nIn J. Shawe-Taylor, R.S. Zemel, P. Bartlett, F.C.N. Pereira, and K.Q. Weinberger,\n\ndesigns.\neditors, Advances in Neural Information Processing Systems 24, 2011.\n\n[7] G. H. Hardy, J. E. Littlewood, and G. P\u00b4olya. Inequalities. Cambridge University Press, 1934.\n[8] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data Mining,\n\nInference and Prediction. Springer Verlag Series in Statistics, 2001.\n\n[9] A.E. Hoerl and R.W. Kennard. Ridge regression: Biased estimation for nonorthogonal prob-\n\nlems. Technometrics, pages 55\u201367, 1970.\n\n[10] L. Jacob, G. Obozinski, and J.P. Vert. Group Lasso with overlap and graph Lasso. In Pro-\nceedings of the 26th Annual International Conference on Machine Learning, pages 433\u2013440.\nACM, 2009.\n\n[11] S.M. Kakade, K. Sridharan, and A. Tewari. On the complexity of linear prediction: Risk\nIn Advances in Neural Information Processing\n\nbounds, margin bounds, and regularization.\nSystems, volume 22, 2008.\n\n[12] S. S. Keerthi and D. DeCoste. A modi\ufb01ed \ufb01nite Newton method for fast solution of large scale\n\nlinear SVMs. Journal of Machine Learning Research, 6:341\u2013361, 2005.\n\n[13] A. Lorbert, D. Eis, V. Kostina, D.M. Blei, and P.J. Ramadge. Exploiting covariate similarity\nIn Proceedings of the 13th International\n\nin sparse regression via the pairwise elastic net.\nConference on Arti\ufb01cial Intelligence and Statistics, 2010.\n\n[14] Y. Nesterov. Gradient methods for minimizing composite objective function. CORE, 2007.\n[15] N. Srebro, K. Sridharan, and A. Tewari. Smoothness, low-noise and fast rates. In Advances in\n\nNeural Information Processing Systems 23, 2010.\n\n[16] T. Suzuki and R. Tomioka. SpicyMKL: a fast algorithm for multiple kernel learning with\n\nthousands of kernels. Machine learning, pages 1\u201332, 2011.\n\n[17] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical\n\nSociety, Series B (Statistical Methodology), 58(1):267\u2013288, 1996.\n\n[18] P. Tseng. On accelerated proximal gradient methods for convex-concave optimization.\n\nPreprint, 2008.\n\n[19] P. Tseng. Approximation accuracy, gradient methods, and error bound for structured convex\n\noptimization. Mathematical Programming, 125(2):263\u2013295, 2010.\n\n[20] T. Zhang. Covering number bounds of certain regularized linear function classes. The Journal\n\nof Machine Learning Research, 2:527\u2013550, 2002.\n\n[21] H. Zou and T. Hastie. Regularization and variable selection via the elastic net. Journal of the\n\nRoyal Statistical Society: Series B (Statistical Methodology), 67(2):301\u2013320, 2005.\n\n9\n\n\f", "award": [], "sourceid": 698, "authors": [{"given_name": "Andreas", "family_name": "Argyriou", "institution": null}, {"given_name": "Rina", "family_name": "Foygel", "institution": null}, {"given_name": "Nathan", "family_name": "Srebro", "institution": null}]}