{"title": "Unified Methods for Exploiting Piecewise Linear Structure in Convex Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 4754, "page_last": 4762, "abstract": "We develop methods for rapidly identifying important components of a convex optimization problem for the purpose of achieving fast convergence times. By considering a novel problem formulation\u2014the minimization of a sum of piecewise functions\u2014we describe a principled and general mechanism for exploiting piecewise linear structure in convex optimization. This result leads to a theoretically justified working set algorithm and a novel screening test, which generalize and improve upon many prior results on exploiting structure in convex optimization. In empirical comparisons, we study the scalability of our methods. We find that screening scales surprisingly poorly with the size of the problem, while our working set algorithm convincingly outperforms alternative approaches.", "full_text": "Uni\ufb01ed Methods for Exploiting\n\nPiecewise Linear Structure in Convex Optimization\n\nTyler B. Johnson\n\nUniversity of Washington, Seattle\ntbjohns@washington.edu\n\nCarlos Guestrin\n\nUniversity of Washington, Seattle\n\nguestrin@cs.washington.edu\n\nAbstract\n\nWe develop methods for rapidly identifying important components of a convex\noptimization problem for the purpose of achieving fast convergence times. By\nconsidering a novel problem formulation\u2014the minimization of a sum of piecewise\nfunctions\u2014we describe a principled and general mechanism for exploiting piece-\nwise linear structure in convex optimization. This result leads to a theoretically\njusti\ufb01ed working set algorithm and a novel screening test, which generalize and\nimprove upon many prior results on exploiting structure in convex optimization.\nIn empirical comparisons, we study the scalability of our methods. We \ufb01nd that\nscreening scales surprisingly poorly with the size of the problem, while our working\nset algorithm convincingly outperforms alternative approaches.\n\n1\n\nIntroduction\n\nScalable optimization methods are critical for many machine learning applications. Due to tractable\nproperties of convexity, many optimization tasks are formulated as convex problems, many of which\nexhibit useful structure at their solutions. For example, when training a support vector machine, the\noptimal model is unin\ufb02uenced by easy-to-classify training instances. For sparse regression problems,\nthe optimal model makes predictions using a subset of features, ignoring its remaining inputs.\nIn these examples and others, the problem\u2019s \u201cstructure\u201d can be exploited to perform optimization\nef\ufb01ciently. Speci\ufb01cally, given the important components of a problem (for example the relevant\ntraining examples or features) we could instead optimize a simpler objective that results in the same\nsolution. In practice, since the important components are unknown prior to optimization, we focus on\nmethods that rapidly discover the relevant components as progress is made toward convergence.\nOne principled method for exploiting structure in optimization is screening, a technique that identi\ufb01es\ncomponents of a problem guaranteed to be irrelevant to the solution. First proposed by [1], screening\nrules have been derived for many objectives in recent years. These approaches are specialized to\nparticular objectives, so screening tests do not readily translate between optimization tasks. Prior\nworks have separately considered screening irrelevant features [1\u20138], training examples [9, 10], or\nconstraints [11]. No screening test applies to all of these applications.\nWorking set algorithms are a second approach to exploiting structure in optimization. By minimizing\na sequence of simpli\ufb01ed objectives, working set algorithms quickly converge to the problem\u2019s global\nsolution. Perhaps the most prominent working set algorithms for machine learning are those of the\nLIBLINEAR library [12]. As is common with working set approaches, there is little theoretical\nunderstanding of these algorithms. Recently a working set algorithm with some theoretical guarantees\nwas proposed [11]. This work fundamentally relies on the objective being a constrained function,\nhowever, making it unclear how to use this algorithm for other problems with structure.\nThe purpose of this work is to both unify and improve upon prior ideas for exploiting structure in\nconvex optimization. We begin by formalizing the concept of \u201cstructure\u201d using a novel problem\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fformulation: the minimization of a sum of many piecewise functions. Each piecewise function is\nde\ufb01ned by multiple simpler subfunctions, at least one of which we assume to be linear. With this\nformulation, exploiting structure amounts to selectively replacing piecewise terms in the objective\nwith corresponding linear subfunctions. The resulting objective can be considerably simpler to solve.\nUsing our piecewise formulation, we \ufb01rst present a general theoretical result on exploiting structure\nin optimization. This result guarantees quanti\ufb01able progress toward a problem\u2019s global solution by\nminimizing a simpli\ufb01ed objective. We apply this result to derive a new working set algorithm that\ncompares favorably to [11] in that (i) our algorithm results from a minimax optimization of new\nbounds, and (ii) our algorithm is not limited to constrained objectives. Later, we derive a state-of-\nthe-art screening test by applying the same initial theoretical result. Compared to prior screening\ntests, our screening result is more effective at simplifying the objective function. Moreover, unlike\nprevious screening results, our screening test applies to a broad class of objectives.\nWe include empirical evaluations that compare the scalability of screening and working set methods\non real-world problems. While many screening tests have been proposed for large-scale optimization,\nwe have not seen the scalability of screening studied in prior literature. Surprisingly, although our\nscreening test signi\ufb01cantly improves upon many prior results, we \ufb01nd that screening scales poorly as\nthe size of the problem increases. In fact, in many cases, screening has negligible effect on overall\nconvergence times. In contrast, our working set algorithm improves convergence times considerably\nin a number of cases. This result suggests that compared to screening, working set algorithms are\nsigni\ufb01cantly more useful for scaling optimization to large problems.\n\n2 Piecewise linear optimization framework\n\nWe consider optimization problems of the form\n\nf (x) := \u03c8(x) +(cid:80)m\n\ni=1 \u03c6i(x) ,\n\nminimize\n\nx\u2208Rn\n\n(P)\n\nwhere \u03c8 is \u03b3-strongly convex, and each \u03c6i is convex and piecewise; for each \u03c6i, we assume a function\n\u03c0i : Rn \u2192 {1, 2, . . . , pi} and convex subfunctions \u03c61\n\u03c6i(x) = \u03c6\u03c0i(x)\n\ni such that \u2200x \u2208 Rn, we have\n\ni , . . . , \u03c6pi\n(x) .\n\ni\n\nAs will later become clear, we focus on instances of (P) for which many of the subfunctions \u03c6k\nlinear. We denote by X k\nthe subset of Rn corresponding to the kth piecewise subdomain of \u03c6i:\n\ni are\n\ni\n\nX k\n\ni\n\n:= {x : \u03c0i(x) = k} .\n\nThe purpose of this work is to develop ef\ufb01cient and principled methods for solving (P) by exploiting\nthe piecewise structure of f. Our approach is based on the following observation:\nProposition 2.1 (Exploiting piecewise structure at x(cid:63)). Let x(cid:63) be the minimizer of f. For each\ni \u2208 [m], assume knowledge of \u03c0i(x(cid:63)) and whether x(cid:63) \u2208 int(X \u03c0i(x(cid:63))\nif x(cid:63) \u2208 int(X \u03c0i(x(cid:63))\n) ,\notherwise ,\n\n). De\ufb01ne\n\n\u03c6\u03c0i(x(cid:63))\ni\n\u03c6i\n\n(cid:26)\n\ni =\n\nwhere int(\u00b7) denotes the interior of a set. Then x(cid:63) is also the solution to\ni (x) .\n\nf (cid:63)(x) := \u03c8(x) +(cid:80)m\n\nminimize\n\ni=1 \u03c6(cid:63)\n\n(P(cid:63))\n\n\u03c6(cid:63)\n\ni\n\ni\n\nx\u2208Rn\n\ni\n\nIn words, Proposition 2.1 states that if x(cid:63) does not lie on the boundary of the subdomain X \u03c0i(x(cid:63))\nthen replacing \u03c6i with the subfunction \u03c6\u03c0i(x(cid:63))\nDespite having identical solutions, solving (P(cid:63)) can require far less computation than solving (P).\ni are linear, since the sum of linear functions is also linear. More\nThis is especially true when many \u03c6(cid:63)\nformally, consider a set W (cid:63) \u2286 [m] such that \u2200i /\u2208 W (cid:63), \u03c6(cid:63)\ni , x(cid:105) + b(cid:63)\ni\nfor some a(cid:63)\n\ni . De\ufb01ning a(cid:63) =(cid:80)\n\ni and b(cid:63) =(cid:80)\n\nin f does not affect the minimizer of f.\n\ni is linear, meaning \u03c6(cid:63)\n\ni (x) = (cid:104)a(cid:63)\n\nf (cid:63)(x) := \u03c8(x) + (cid:104)a(cid:63), x(cid:105) + b(cid:63) +(cid:80)\n\ni /\u2208W (cid:63) a(cid:63)\n\ni , then (P(cid:63)) is equivalent to\ni\u2208W (cid:63) \u03c6(cid:63)\n\nminimize\n\n(P(cid:63)(cid:63))\nThat is, (P) has been reduced from a problem with m piecewise functions to a problem of size |W (cid:63)|.\nSince often |W (cid:63)| (cid:28) m, solving (P(cid:63)) can be tremendously simpler than solving (P). The scenario is\nquite common in machine learning applications. Some important examples include:\n\ni and b(cid:63)\n\ni /\u2208W (cid:63) b(cid:63)\n\ni (x) .\n\nx\u2208Rn\n\n,\n\ni\n\n2\n\n\f\u2022 Piecewise loss minimization: \u03c6i is a piecewise loss with at least one linear subfunction.\n\u2022 Constrained optimization: \u03c6i takes value 0 for a subset of Rn and +\u221e otherwise.\n\u2022 Optimization with sparsity inducing penalties: (cid:96)1-regularized regression, group lasso, fused\n\nlasso, etc., are instances of (P) via duality [13].\n\nWe include elaboration on these examples in Appendix A.\n\n3 Theoretical results\nWe have seen that solving (P(cid:63)) can be more ef\ufb01cient than solving (P). However, since W (cid:63) is unknown\nprior to optimization, solving (P(cid:63)) is impractical. Instead, we can hope to design algorithms that\nrapidly learn W (cid:63). In this section, we propose principled methods for achieving this goal.\n\n3.1 A general mechanism for exploiting piecewise linear structure\n\nIn this section, we focus on the consequences of minimizing the function\n\nf(cid:48)(x) := \u03c8(x) +(cid:80)m\n\ni=1 \u03c6(cid:48)\n\ni(x) ,\n\ni }. That is, \u03c6(cid:48)\n\ni \u2208 {\u03c6i} \u222a {\u03c61\n\nwhere \u03c6(cid:48)\ni , . . . , \u03c6pi\ni is either the original piecewise function \u03c6i or one of\ni . With (P(cid:63)) unknown, it is natural to consider this more general class of objectives\nits subfunctions \u03c6k\n(in the case that \u03c6(cid:48)\ni for all i, we see f(cid:48) is the objective function of (P(cid:63))). The goal of this section\ni = \u03c6(cid:63)\nis to establish choices of f(cid:48) such that by minimizing f(cid:48), we can make progress toward minimizing f.\nWe later introduce working set and screening methods based on this result.\nTo guide the choice of f(cid:48), we assume points x0 \u2208 Rn, y0 \u2208 dom(f ), where x0 minimizes a\n\u03b3-strongly convex function f0 that lower bounds f. The point y0 represents an existing approximation\nof x(cid:63), while x0 can be viewed as a second approximation related to a point in (P)\u2019s dual space. Since\nf0 lower bounds f and x0 minimizes f0, note that f0(x0) \u2264 f0(x(cid:63)) \u2264 f (x(cid:63)). Using this fact, we\nquantify the suboptimality of x0 and y0 in terms of the suboptimality gap\n\n\u22060 := f (y0) \u2212 f0(x0) \u2265 f (y0) \u2212 f (x(cid:63)) .\n\n(1)\nImportantly, we consider choices of f(cid:48) such that by minimizing f(cid:48), we can form points (x(cid:48), y(cid:48)) that\nimprove upon the existing approximations (x0, y0) in terms of the suboptimality gap. Speci\ufb01cally,\nwe de\ufb01ne x(cid:48) as the minimizer of f(cid:48), while y(cid:48) is a point on the segment [y0, x(cid:48)] (to be de\ufb01ned precisely\nlater). Our result in this section applies to choices of f(cid:48) that satisfy three natural requirements:\nR1. Tight in a neighborhood of y0: For a closed set S with y0 \u2208 int(S), f(cid:48)(x) = f (x) \u2200x \u2208 S.\nR2. Lower bound on f: For all x, we have f(cid:48)(x) \u2264 f (x).\nR3. Upper bound on f0: For all x, we have f(cid:48)(x) \u2265 f0(x).\nEach of these requirements serves a speci\ufb01c purpose. After solving x(cid:48) := argminx f(cid:48)(x), R1 enables\na backtracking operation to obtain a point y(cid:48) such that f (y(cid:48)) < f (y0) (assuming y0 (cid:54)= x(cid:63)). We\nde\ufb01ne y(cid:48) as the point on the segment (y0, x(cid:48)] that is closest to x(cid:48) while remaining in the set S:\n(2)\nSince (i) f(cid:48) is convex, (ii) x(cid:48) minimizes f(cid:48), and (iii) y0 \u2208 int(S), it follows that f (y(cid:48)) \u2264 f (y0).\nApplying R2 leads to the new suboptimality gap\n\n\u03b8(cid:48) := max {\u03b8 \u2208 (0, 1] : \u03b8x(cid:48) + (1 \u2212 \u03b8)y0 \u2208 S} , y(cid:48) := \u03b8(cid:48)x(cid:48) + (1 \u2212 \u03b8(cid:48))y0 .\n\n\u2206(cid:48) := f (y(cid:48)) \u2212 f(cid:48)(x(cid:48)) \u2265 f (y(cid:48)) \u2212 f (x(cid:63)) .\n\n(3)\nR2 is also a natural requirement since we are interested in the scenario that many \u03c6(cid:48)\ni are linear, in\nwhich case (i) \u03c6(cid:48)\ni lower bounds \u03c6i as a result of convexity, and (ii) the resulting f(cid:48) likely can be\nminimized ef\ufb01ciently. Finally, R3 is useful for ensuring f(cid:48)(x(cid:48)) \u2265 f0(x(cid:48)) \u2265 f0(x0). It follows that\n\u2206(cid:48) \u2264 \u22060. Moreover, this improvement in suboptimality gap can be quanti\ufb01ed as follows:\nLemma 3.1 (Guaranteed suboptimality gap progress\u2014proven in Appendix B). Consider points\nx0 \u2208 Rn, y0 \u2208 dom(f ) such that x0 minimizes a \u03b3-strongly convex function f0 that lower bounds f.\nFor any function f(cid:48) that satis\ufb01es R1, R2, and R3, let x(cid:48) be the minimizer of f(cid:48), and de\ufb01ne \u03b8(cid:48) and y(cid:48)\nvia backtracking as in (2). Then de\ufb01ning suboptimality gaps \u22060 and \u2206(cid:48) as in (1) and (3), we have\n\n(cid:13)(cid:13)(cid:13)z \u2212 \u03b8(cid:48)x0+y0\n\n1+\u03b8(cid:48)\n\n(cid:13)(cid:13)(cid:13)2 \u2212 \u03b8(cid:48)\n\n1+\u03b8(cid:48)\n\n(cid:21)\n\n\u03b3\n\n2 (cid:107)x0 \u2212 y0(cid:107)2\n\n.\n\nThe primary signi\ufb01cance of Lemma 3.1 is the bound\u2019s relatively simple dependence on S. We next\ndesign working set and screening methods that choose S to optimize this bound.\n\n3\n\n(cid:20)\n\n\u2206(cid:48) \u2264 (1 \u2212 \u03b8(cid:48))\n\n\u22060 \u2212 1+\u03b8(cid:48)\n\u03b8(cid:48)2\n\n\u03b3\n2 min\n\nz /\u2208int(S)\n\n\fAlgorithm 1 PW-BLITZ\ninitialize y0 \u2208 dom(f )\n# Initialize x0 by minimizing a simple lower bound on f:\n\u2200i \u2208 [m], \u03c6(cid:48)\nx0 \u2190 argminx f(cid:48)\nfor t = 1, . . . , T until xT = yT do\n\n0(x) := \u03c8(x) +(cid:80)m\n\ni,0(x) := \u03c6i(y0) + (cid:104)gi, x \u2212 y0(cid:105), where gi \u2208 \u2202\u03c6i(y0)\n\ni=1 \u03c6(cid:48)\n\ni,0(x)\n\n# Form subproblem:\nSelect \u03b2t \u2208 [0, 1\n2 ]\nct \u2190 \u03b2txt\u22121 + (1 \u2212 \u03b2t)yt\u22121\nSelect threshold \u03c4t > \u03b2t (cid:107)xt\u22121 \u2212 yt\u22121(cid:107)\nSt := {x : (cid:107)x \u2212 ct(cid:107) \u2264 \u03c4t}\nfor i = 1, . . . , m do\n\nk \u2190 \u03c0i(yt\u22121)\nif (C1 and C2 and C3) then \u03c6(cid:48)\n\nt(x) := \u03c8(x) +(cid:80)m\n\ni,t := \u03c6k\n# Solve subproblem:\nxt \u2190 argminx f(cid:48)\ni=1 \u03c6(cid:48)\n# Backtrack:\n\u03b1t \u2190 argmin\u03b1\u2208(0,1] f (\u03b1xt + (1 \u2212 \u03b1)yt\u22121)\nyt \u2190 \u03b1txt + (1 \u2212 \u03b1t)yt\u22121\n\nreturn yT\n\ni else \u03c6(cid:48)\n\ni,t := \u03c6i\n\ni,t(x)\n\n3.2 Piecewise working set algorithm\n\nt(x) := \u03c8(x) +(cid:80)m\n\ni=1 \u03c6(cid:48)\n\ni,t = \u03c6j\n\nwe have \u03c6k\n\ni , . . . , \u03c6pi\n\ni (x) \u2200x \u2208 St).\n\ni,t(x), where \u03c6(cid:48)\n\ni,t \u2208 {\u03c6i} \u222a {\u03c61\n\ni (implying \u03c6i(x) = \u03c6k\n\nLemma 3.1 suggests an iterative algorithm that, at each iteration t, minimizes a modi\ufb01ed objective\ni }. To guide the choice of each\nf(cid:48)\ni,t, our algorithm considers previous iterates xt\u22121 and yt\u22121, where xt\u22121 minimizes f(cid:48)\n\u03c6(cid:48)\nt\u22121. For all\ni \u2208 [m], j = \u03c6i(yt\u22121), we de\ufb01ne \u03c6(cid:48)\ni if the following three conditions are satis\ufb01ed:\nC1. Tight in the neighborhood of yt\u22121: We have St \u2286 X k\ni (x) \u2264 \u03c6i(x).\nC2. Lower bound on \u03c6i: For all x, we have \u03c6k\nC3. Upper bound on \u03c6(cid:48)\ni (x) \u2265 \u03c6(cid:48)\n\ni,t\u22121 in the neighborhood of xt\u22121: For all x \u2208 Rn and gi \u2208 \u2202\u03c6(cid:48)\ni,t\u22121(xt\u22121) + (cid:104)gi, x \u2212 xt\u22121(cid:105).\nIf any of the above conditions are unmet, then we let \u03c6(cid:48)\ni,t = \u03c6i. As detailed in Appendix C, this\nchoice of \u03c6(cid:48)\nt satis\ufb01es conditions analogous to conditions R1, R2, and R3 for Lemma 3.1.\nAfter determining f(cid:48)\nt(x). We then set\nyt \u2190 \u03b1txt + (1 \u2212 \u03b1t)yt\u22121, where \u03b1t is chosen via backtracking. Lemma 3.1 implies the sub-\noptimality gap \u2206t := f (yt) \u2212 f(cid:48)\nt(xt) decreases with t until xT = yT , at which point \u2206T = 0 and\nxT and yT solve (P). De\ufb01ned in Algorithm 1, we call this algorithm \u201cPW-BLITZ\u201d as it extends the\nBLITZ algorithm for constrained problems from [11] to a broader class of piecewise objectives.\nAn important consideration of Algorithm 1 is the choice of St. If St is large, C1 is easily violated,\nt is dif\ufb01cult to minimize. In contrast, if St is small,\nmeaning \u03c6(cid:48)\nthen \u03c6(cid:48)\nt is simpler to minimize, but \u2206t may be large.\nInterestingly, conditioned on oracle knowledge of \u03b8t := max{\u03b8 \u2208 (0, 1] : \u03b8xt + (1 \u2212 \u03b8)yt\u22121 \u2208 St},\nwe can derive an optimal St according to Lemma 3.1 subject to a volume constraint vol(St) \u2264 V :\n\nt, the algorithm proceeds by solving xt \u2190 argminx f(cid:48)\n\ni,t is potentially linear for many i. In this case, f(cid:48)\n\ni,t = \u03c6i for many i. This implies f(cid:48)\n\ni,t ensures f(cid:48)\n\ni,t\u22121(xt\u22121),\n\nS (cid:63)\nt := argmax\nS : vol(S)\u2264V\n\nmin\n\nz /\u2208int(S)\n\n(cid:13)(cid:13)(cid:13)z \u2212 \u03b8txt\u22121+yt\u22121\n\n1+\u03b8t\n\n(cid:13)(cid:13)(cid:13) .\n\nS (cid:63)\nt is a ball with center \u03b8txt\u22121+yt\u22121\n. Of course, this result cannot be used in practice directly, since\n\u03b8t is unknown when choosing St. Motivated by this result, Algorithm 1 instead de\ufb01nes St as a ball\nwith radius \u03c4t and a similar center ct := \u03b2txt\u22121 + (1 \u2212 \u03b2t)yt\u22121 for some \u03b2t \u2208 [0, 1\n2 ].\n\n1+\u03b8t\n\n4\n\n\fBy choosing St in this manner, we can quantify the amount of progress Algorithm 1 makes at ieration\nt. Our \ufb01rst theorem lower bounds the amount of progress during iteration t of Algorithm 1 for the\ncase in which \u03b2t happens to be chosen optimally. That is, St is a ball with center \u03b8txt\u22121+yt\u22121\nTheorem 3.2 (Convergence progress with optimal \u03b2t). Let \u2206t\u22121 and \u2206t be the suboptimality gaps\nafter iterations t \u2212 1 and t of Algorithm 1, and suppose that \u03b2t = \u03b8t(1 + \u03b8t)\u22121. Then\n\n1+\u03b8t\n\n.\n\n\u2206t \u2264 \u2206t\u22121 + \u03b3\n\nt \u2212 3\n\n2\n\n2 \u03c4 2\n\nt \u22062\n\nt\u22121\n\n(cid:0)\u03b3\u03c4 2\n\n(cid:1)1/3\n\n.\n\nSince the optimal \u03b2t is unknown when choosing St, our second theorem characterizes the worst-case\n2).\nperformance of extremal choices of \u03b2t (the cases \u03b2t = 0 and \u03b2t = 1\nTheorem 3.3 (Convergence progress with suboptimal \u03b2t). Let \u2206t\u22121 and \u2206t be the suboptimality\ngaps after iterations t \u2212 1 and t of Algorithm 1, and suppose that \u03b2t = 0. Then\n\nAlternatively, suppose that \u03b2t = 1\n\n\u2206t \u2264 \u2206t\u22121 + \u03b3\n\nt \u2212 (2\u03b3\u03c4 2\n\n2 \u03c4 2\n\nt \u2206t\u22121)1/2.\n\n(cid:1)1/3\n2 , and de\ufb01ne dt := (cid:107)xt\u22121 \u2212 yt\u22121(cid:107). Then\n2 (\u03c4t \u2212 1\nt\u22121\n\n(cid:0)\u03b3(\u03c4t \u2212 1\n\n2 dt)2 \u2212 3\n\n2 dt)2\u22062\n\n2\n\n.\n\n\u2206t \u2264 \u2206t\u22121 + \u03b3\n\nt is signi\ufb01cantly less than \u2206t\u22121. (In the alternative case, the subproblem objective f(cid:48)\n\nThese results are proven in Appendices D and E. Note that it is often desirable to choose \u03c4t such that\n\u03b3\nt may be no\n2 \u03c4 2\nsimpler than f. One could choose \u03c4t such that \u2206t = 0, for example, but as we will see in \u00a73.3, we are\nonly performing screening in this scenario.) Assuming \u03b3\nt is small in relation to \u2206t\u22121, the ability to\nchoose \u03b2t is advantageous in terms of worst case bounds if one manages to select \u03b2t \u2248 \u03b8t(1 + \u03b8t)\u22121.\nAt the same time, Theorem 3.3 suggests that Algorithm 1 is robust to the choice of \u03b2t; the algorithm\nmakes progress toward convergence even with worst-case choices of this parameter.\n\n2 \u03c4 2\n\nt(xt) \u2212 minx f(cid:48)\n\nPractical considerations We make several notes about using Algorithm 1 in practice. Since\nsubproblem solvers are iterative, it is important to only compute xt approximately. In Appendix F,\nwe include a modi\ufb01ed version of Lemma 3.1 that considers this case. This result suggests terminating\nt(x) \u2264 \u0001\u2206t\u22121 for some \u0001 \u2208 (0, 1). Here \u0001 trades off the amount\nsubproblem t when f(cid:48)\nof progress resulting from solving subproblem t with the time dedicated to solving this subproblem.\nTo choose \u03b2t, we \ufb01nd it practical to initialize \u03b20 = 0 and let \u03b2t = \u03b1t\u22121(1 + \u03b1t\u22121)\u22121 for t > 0. This\nroughly approximates the optimal choice \u03b2t = \u03b8t(1 + \u03b8t)\u22121, since \u03b8t can be viewed as a worst-case\nversion of \u03b1t, and \u03b1t often changes gradually with t. Selecting \u03c4t is problem dependent. By letting\n\u03c4t = \u03b2t (cid:107)xt\u22121 \u2212 yt\u22121(cid:107) + \u03be\u22061/2\nt\u22121 for a small \u03be > 0, Algorithm 1 converges linearly in t. It can also\nbe bene\ufb01cial to choose \u03c4t in other ways\u2014for example, choosing \u03c4t so subproblem t \ufb01ts in memory.\nIt is also important to recognize the relative amount of time required for each stage of Algorithm 1.\nWhen forming subproblem t, the time consuming step is checking condition C1. In the most common\nscenarios that X k\nis a half-space or ball, this condition is testable in O(n) time. However, for\narbitrary regions, this condition could be dif\ufb01cult to test. The time required for solving subproblem\nt is clearly application dependent, but we note it can be helpful to select subproblem termination\ncriteria to balance time usage between stages of the algorithm. The backtracking stage is a 1D convex\nproblem that at most requires evaluating f a logarithmic number of times. Simpler backtracking\napproaches are available for many objectives. It is also not necessary to perform exact backtracking.\n\ni\n\nRelation to BLITZ algorithm Algorithm 1 is related to the BLITZ algorithm [11]. BLITZ applies\nonly to constrained problems, however, while Algorithm 1 applies to a more general class of piecewise\nobjectives. In Appendix G, we ellaborate on Algorithm 1\u2019s connection to BLITZ and other algorithms.\n\n3.3 Piecewise screening test\n\nLemma 3.1 can also be used to simplify the objective f in such a way that the minimizer x(cid:63) is\nunchanged. Recall Lemma 3.1 assumes a function f(cid:48) and set S for which f(cid:48)(x) = f (x) for all x \u2208 S.\nThe idea of this section is to select the smallest region S such that in Lemma 3.1, \u2206(cid:48) must equal 0\n(according to the lemma). In this case, the minimizer of f(cid:48) is equal to the minimizer of f\u2014even\nthough f(cid:48) is potentially much simpler to minimize. This results in the following screening test:\n\n5\n\n\fTheorem 3.4 (Piecewise screening test\u2014proven in Appendix H). Consider any x0, y0 \u2208 Rn such\nthat x0 minimizes a \u03b3-strongly convex function f0 that lower bounds f. De\ufb01ne the suboptimality gap\n\u22060 := f (y0) \u2212 f0(x0) as well as the point c0 := x0+y0\n. Then for any i \u2208 [m] and k = \u03c0i(y0), if\n\u03b3 \u22060 \u2212 1\n\nx : (cid:107)x \u2212 c0(cid:107) \u2264(cid:113) 1\n\n4 (cid:107)x0 \u2212 y0(cid:107)2\n\n\u2286 int(X k\n\nS :=\n\n(cid:26)\n\n(cid:27)\n\ni ) ,\n\n2\n\ni ). This implies \u03c6i may be replaced with \u03c6k\n\nthen x(cid:63) \u2208 int(X k\nTheorem 3.4 applies to general X k\noften is (or is a superset of) a simple region that makes applying Theorem 3.4 simple.\nCorollary 3.5 (Piecewise screening test for half-space X k\nfor some ai \u2208 Rn, bi \u2208 R. De\ufb01ne x0, y0, \u22060, and c0 as in Theorem 3.4. Then x(cid:63) \u2208 int(X k\n\ni ) may be dif\ufb01cult. Fortunately, X k\ni \u2287 {x : (cid:104)ai, x(cid:105) \u2264 bi}\n\ni , and testing if S \u2286 int(X k\n\ni in (P) without affecting x(cid:63).\n\ni ). Suppose that X k\n\ni\n\ni ) if\n\nbi \u2212 (cid:104)ai, c0(cid:105)\n\n(cid:107)ai(cid:107)\n\n>\n\n4 (cid:107)x0 \u2212 y0(cid:107)2 .\n\u03b3 \u22060 \u2212 1\ni ). Suppose that X k\n\n(cid:113) 1\n(cid:113) 1\n\nCorollary 3.6 (Piecewise screening test for ball X k\nsome ai \u2208 Rn, bi \u2208 R>0. De\ufb01ne x0, y0, \u22060, and c0 as in Theorem 3.4. Then x(cid:63) \u2208 int(X k\n\ni \u2287 {x : (cid:107)x \u2212 ai(cid:107) \u2264 bi} for\n\ni ) if\n\nbi \u2212 (cid:107)ai \u2212 c0(cid:107) >\n\n\u03b3 \u22060 \u2212 1\n\n4 (cid:107)x0 \u2212 y0(cid:107)2 .\n\nCorollary 3.5 applies to piecewise loss minimization (for SVMs, discarding examples that are not\nmarginal support vectors), (cid:96)1-regularized learning (discarding irrelevant features), and optimization\nwith linear constraints (discarding super\ufb02uous constraints). Applications of Corollary 3.6 include\ngroup lasso and many constrained objectives. In order to obtain the point x0, it is usually practical to\ni=1 \u03c6i. In this case, computing x0 is as\n\nchoose f0 as the sum of \u03c8 and a \ufb01rst-order lower bound on(cid:80)m\n\nsimple as \ufb01nding the conjugate of \u03c8. We illustrate this idea with an SVM example in Appendix I.\nSince \u22060 decreases over the course of an iterative algorithm, Theorem 3.4 is \u201cadaptive,\u201d meaning\nit increases in effectiveness as progress is made toward convergence. In contrast, most screening\ntests are \u201cnonadaptive.\u201d Nonadaptive screening tests depend on knowledge of an exact solution to a\nrelated problem, which is disadvantageous, since (i) solving a related problem exactly is generally\ncomputationally expensive, and (ii) the screening test can only be applied prior to optimization.\n\nRelation to existing screening tests Theorem 3.4 generalizes and improves upon many existing\nscreening tests. We summarize Theorem 3.4\u2019s relation to previous results below. Unlike Theorem 3.4,\nexisting tests typically apply to only one or two objectives. Elaboration is included in Appendix J.\n\u2022 Adaptive tests for sparse optimization: Recently, [6], [7], and [8] considered adaptive screening\ntests for several sparse optimization problems, including (cid:96)1-regularized learning and group\nlasso. These tests rely on knowledge of primal and dual points (analogous to x0 and y0), but\nthe tests are not as effective (nor as general) as Theorem 3.4.\n\u2022 Adaptive tests for constrained optimization: [11] considered screening with primal-dual pairs\nfor constrained optimization problems. The resulting test is a more general version (applies to\nmore objectives) of [6], [7], and [8]. Thus, Theorem 3.4 improves upon [11] as well.\n\u2022 Nonadaptive tests for degree 1 homogeneous loss minimization: [10] considered screening for\n(cid:96)2-regularized learning with hinge and (cid:96)1 loss functions. This is a special non-adaptive case of\nTheorem 3.4, which requires solving the problem with greater regularization prior to screening.\n\u2022 Nonadaptive tests for sparse optimization: Some tests, such as [4] for the lasso, may screen\ncomponents that Theorem 3.4 does not eliminate. In Appendix J, we show how Theorem 3.4\ncan be modi\ufb01ed to generalize [4], but this change increases the time needed for screening. In\npractice, we were unable to overcome this drawback to speed up iterative algorithms.\n\nRelation to working set algorithm Theorem 3.4 is closely related to Algorithm 1. In particular,\nour screening test can be viewed as a working set algorithm that converges in one iteration. In the\ncontext of Algorithm 1, this amounts to choosing \u03b21 = 1\n\n4 (cid:107)x0 \u2212 y0(cid:107)2.\n\n2 and \u03c41 =\n\n(cid:113) 1\n\u03b3 \u22060 \u2212 1\n\nIt is important to understand that it is usually not desirable that a working set algorithm converges in\none iteration. Since screening rules do not make errors, these methods simplify the objective by only\na modest amount. In many cases, screening may fail to simplify the objective in any meaningful way.\nIn the following section, we consider real-world scenarios to demonstrate these points.\n\n6\n\n\f(a) m = 100\n\n(b) m = 400\n\n(c) m = 1600\n\nFigure 1: Group lasso convergence comparison. While screening is marginally useful for the\nproblem with only 100 groups, screening becomes ineffective as m increases. The working set\nalgorithm convincingly outperforms dual coordinate descent in all cases.\n\n4 Comparing the scalability of screening and working set methods\n\nThis section compares the scalability of our working set and screening approaches. We consider\ntwo popular instances of (P): group lasso and linear SVMs. For each problem, we examine the\nperformance of our working set algorithm and screening rule as m increases. This is an important\ncomparison, as we have not seen such scalability experiments in prior works on screening.\nWe implemented dual coordinate ascent (DCA) to solve each instance of (P). DCA is known to be\nsimple and fast, and there are no parameters to tune. We compare DCA to three alternatives:\n\n1. DCA + screening: After every \ufb01ve DCA epochs we apply screening. \u201cPiecewise screening\u201d\n\nrefers to Theorem 3.4. For group lasso, we also implement \u201cgap screening\u201d [7].\n\n2. DCA + working sets: Implementation of Algorithm 1. DCA is used to solve each subproblem.\n3. DCA + working sets + screening: Algorithm 1 with Theorem 3.4 applied after each iteration.\n\nGroup lasso comparisons We de\ufb01ne the group lasso objective as\n\n2 (cid:107)A\u03c9 \u2212 b(cid:107)2 + \u03bb(cid:80)m\n\ni=1 (cid:107)\u03c9Gi(cid:107)2 .\n\ngGL(\u03c9) := 1\n\nA \u2208 Rn\u00d7q is a design matrix, and b \u2208 Rn is a labels vector. \u03bb > 0 is a regularization parameter, and\nG1, . . . ,Gm are disjoint sets of feature indices such that \u222am\ni=1Gi = [q]. Denote a minimizer of gGL by\n, have value 0 for many Gi. While gGL is not directly an\n\u03c9(cid:63). For large \u03bb, groups of elements, \u03c9(cid:63)Gi\ninstance of (P), the dual of gGL is strongly concave with m constraints (and thus an instance of (P)).\nWe consider an instance of gGL to perform feature selection for an insurance claim prediction task1.\nGiven n = 250,000 training instances, we learned an ensemble of 1600 decision trees. To make\npredictions more ef\ufb01ciently, we use group lasso to reduce the number of trees in the model. The\nresulting problem has m = 1600 groups and q = 28,733 features. To evaluate the dependence of\nthe algorithms on m, we form smaller problems by uniformly subsampling 100 and 400 groups. For\neach problem we set \u03bb so that exactly 5% of groups have nonzero weight in the optimal model.\nFigure 1 contains results of this experiment. Our metrics include the relative suboptimality of the\ncurrent iterate as well as the agreement of this iterate\u2019s nonzero groups with those of the optimal\nsolution in terms of precision (all algorithms had high recall). This second metric is arguably more\nimportant, since the task is feature selection. Our results illustrate that while screening is marginally\nhelpful when m is small, our working set method is more effective when scaling to large problems.\n\n1https://www.kaggle.com/c/ClaimPredictionChallenge\n\n7\n\n01234567Time(s)10\u2212610\u2212510\u2212410\u2212310\u2212210\u22121|g\u2212g(cid:63)|/|g(cid:63)|05101520253035Time(s)10\u2212610\u2212510\u2212410\u2212310\u2212210\u22121|g\u2212g(cid:63)|/|g(cid:63)|0100200300400500600700800Time(s)10\u2212610\u2212510\u2212410\u2212310\u2212210\u22121|g\u2212g(cid:63)|/|g(cid:63)|01234567Time(s)0.700.750.800.850.900.951.00Supportsetprecision05101520253035Time(s)0.700.750.800.850.900.951.00Supportsetprecision0100200300400500600700800Time(s)0.700.750.800.850.900.951.00SupportsetprecisionDCA+workingsets+piecewisescreeningDCA+workingsetsDCA+piecewisescreeningDCA+gapscreeningDCA\f(a) m = 104\n\n(b) m = 105\n\n(c) m = 106\n\nFigure 2: SVM convergence comparison. (above) Relative suboptimality vs. time. (below) Heat\nmap depicting fraction of examples screened by Theorem 3.4 when used in conjunction with dual\ncoordinate ascent. y-axis is the number of epochs completed; x-axis is the tuning parameter C.\nC0 is the largest value of C for which each element of the dual solution takes value C. Darker\nregions indicate more successful screening. The vertical line indicates the choice of C that minimizes\nvalidation loss\u2014this is also the choice of C for the above plots. As the number of examples increases,\nscreening becomes progressively less effective near the desirable choice of C.\n\nSVM comparisons We de\ufb01ne the linear SVM objective as\n\n2 (cid:107)x(cid:107)2 + C(cid:80)m\n\nfSVM(x) := 1\n\ni=1(1 \u2212 bi (cid:104)ai, x(cid:105))+ .\n\nHere C is a tuning parameter, while ai \u2208 Rn, bi \u2208 {\u22121, +1} represents the ith training instance. We\ntrain an SVM model on the Higgs boson dataset2. This dataset was generated by a team of particle\nphysicists. The classi\ufb01cation task is to determine whether an event corresponds to the Higgs boson.\nIn order to learn an accurate model, we performed feature engineering on this dataset, resulting in\n8010 features. In this experiment, we consider subsets of examples with size m = 104, 105, and 106.\nResults of this experiment are shown in Figure 2. For this problem, we plot the relative suboptimality\nin terms of objective value. We also include a heat map that shows screening\u2019s effectiveness for\ndifferent values of C. Similar to the group lasso results, the utility of screening decreases as m\nincreases. Meanwhile, working sets signi\ufb01cantly improve convergence times, regardless of m.\n\n5 Discussion\n\nStarting from a broadly applicable problem formulation, we have derived principled and uni\ufb01ed\nmethods for exploiting piecewise structure in convex optimization. In particular, we have introduced\na versatile working set algorithm along with a theoretical understanding of the progress this algorithm\nmakes with each iteration. Using the same analysis, we have also proposed a screening rule that\nimproves upon many prior screening results as well as enables screening for many new objectives.\nOur empirical results highlight a signi\ufb01cant disadvantage of using screening: unless a good approxi-\nmate solution is already known, screening is often ineffective. This is perhaps understandable, since\nscreening rules operate under the constraint that erroneous simpli\ufb01cations are forbidden. Working set\nalgorithms are not subject to this constraint. Instead, working set algorithms achieve fast convergence\ntimes by aggressively simplifying the objective function, correcting for mistakes only as needed.\n\n2https://archive.ics.uci.edu/ml/datasets/HIGGS\n\n8\n\n0.000.050.100.150.20Time(s)10\u2212610\u2212510\u2212410\u2212310\u2212210\u22121(f\u2212f(cid:63))/f(cid:63)012345678Time(s)10\u2212610\u2212510\u2212410\u2212310\u2212210\u22121(f\u2212f(cid:63))/f(cid:63)0102030405060Time(s)10\u2212610\u2212510\u2212410\u2212310\u2212210\u22121(f\u2212f(cid:63))/f(cid:63)101102103C/C010203040#EpochsFractionofexamplesscreened0.00.20.40.60.81.0101102103104C/C010203040#EpochsFractionofexamplesscreened0.00.20.40.60.81.0101102103104105106C/C010203040#EpochsFractionofexamplesscreened0.00.20.40.60.81.0DCA+workingsets+piecewisescreeningDCA+workingwetsDCA+piecewisescreeningDCA\fAcknowledgments\n\nWe thank Hyunsu Cho, Christopher Aicher, and Tianqi Chen for their helpful feedback as well as\nassistance preparing datasets used in our experiments. This work is supported in part by PECASE\nN00014-13-1-0023, NSF IIS-1258741, and the TerraSwarm Research Center 00008169.\n\nReferences\n[1] L. E. Ghaoui, V. Viallon, and T. Rabbani. Safe feature elimination for the lasso and sparse\n\nsupervised learning problems. Paci\ufb01c Journal of Optimization, 8(4):667\u2013698, 2012.\n\n[2] Z. J. Xiang and P. J. Ramadge. Fast lasso screening tests based on correlations. In IEEE\n\nInternational Conference on Acoustics, Speech, and Signal Processing, 2012.\n\n[3] R. Tibshirani, J. Bien, J. Friedman, T. Hastie, N. Simon, J. Taylor, and R. J. Tibshirani. Strong\nrules for discarding predictors in lasso-type problems. Journal of the Royal Statistical Society,\nSeries B, 74(2):245\u2013266, 2012.\n\n[4] J. Liu, Z. Zhao, J. Wang, and J. Ye. Safe screening with variational inequalities and its\n\napplication to lasso. In International Conference on Machine Learning, 2014.\n\n[5] J. Wang, P. Wonka, and J. Ye. Lasso screening rules via dual polytope projection. Journal of\n\nMachine Learning Research, 16(May):1063\u20131101, 2015.\n\n[6] O. Fercoq, A. Gramfort, and J. Salmon. Mind the duality gap: safer rules for the lasso. In\n\nInternational Conference on Machine Learning, 2015.\n\n[7] E. Ndiaye, O. Fercoq, A. Gramfort, and J. Salmon. GAP safe screening rules for sparse\nmulti-task and multi-class models. In Advances in Neural Information Processing Systems 28,\n2015.\n\n[8] E. Ndiaye, O. Fercoq, A. Gramfort, and J. Salmon. Gap safe screening rules for sparse-group\n\nlasso. Technical Report arXiv:1602.06225, 2016.\n\n[9] I. Takeuchi K. Ogawa, Y. Suzuki. Safe screening of non-support vectors in pathwise SVM\n\ncomputation. In International Conference on Machine Learning, 2013.\n\n[10] J. Wang, P. Wonka, and J. Ye. Scaling SVM and least absolute deviations via exact data\n\nreduction. In International Conference on Machine Learning, 2014.\n\n[11] T. B. Johnson and C. Guestrin. Blitz: a principled meta-algorithm for scaling sparse optimization.\n\nIn International Conference on Machine Learning, 2015.\n\n[12] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. LIBLINEAR: A library for\n\nlarge linear classi\ufb01cation. Journal of Machine Learning Research, 9:1871\u20131874, 2008.\n\n[13] F. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Optimization with sparsity-inducing penalties.\n\nFoundations and Trends in Machine Learning, 4(1):1\u2013106, 2012.\n\n9\n\n\f", "award": [], "sourceid": 2414, "authors": [{"given_name": "Tyler", "family_name": "Johnson", "institution": "University of Washington"}, {"given_name": "Carlos", "family_name": "Guestrin", "institution": "University of Washington"}]}