{"title": "On the Global Linear Convergence of Frank-Wolfe Optimization Variants", "book": "Advances in Neural Information Processing Systems", "page_first": 496, "page_last": 504, "abstract": "The Frank-Wolfe (FW) optimization algorithm has lately re-gained popularity thanks in particular to its ability to nicely handle the structured constraints appearing in machine learning applications. However, its convergence rate is known to be slow (sublinear) when the solution lies at the boundary. A simple less-known fix is to add the possibility to take `away steps' during optimization, an operation that importantly does not require a feasibility oracle. In this paper, we highlight and clarify several variants of the Frank-Wolfe optimization algorithm that has been successfully applied in practice: FW with away steps, pairwise FW, fully-corrective FW and Wolfe's minimum norm point algorithm, and prove for the first time that they all enjoy global linear convergence under a weaker condition than strong convexity. The constant in the convergence rate has an elegant interpretation as the product of the (classical) condition number of the function with a novel geometric quantity that plays the role of the `condition number' of the constraint set. We provide pointers to where these algorithms have made a difference in practice, in particular with the flow polytope, the marginal polytope and the base polytope for submodular optimization.", "full_text": "On the Global Linear Convergence\n\nof Frank-Wolfe Optimization Variants\n\nSimon Lacoste-Julien\n\nINRIA - SIERRA project-team\n\n\u00b4Ecole Normale Sup\u00b4erieure, Paris, France\n\nMartin Jaggi\n\nDept. of Computer Science\nETH Z\u00a8urich, Switzerland\n\nAbstract\n\nThe Frank-Wolfe (FW) optimization algorithm has lately re-gained popularity\nthanks in particular to its ability to nicely handle the structured constraints ap-\npearing in machine learning applications. However, its convergence rate is known\nto be slow (sublinear) when the solution lies at the boundary. A simple less-\nknown \ufb01x is to add the possibility to take \u2018away steps\u2019 during optimization, an\noperation that importantly does not require a feasibility oracle. In this paper, we\nhighlight and clarify several variants of the Frank-Wolfe optimization algorithm\nthat have been successfully applied in practice: away-steps FW, pairwise FW,\nfully-corrective FW and Wolfe\u2019s minimum norm point algorithm, and prove for\nthe \ufb01rst time that they all enjoy global linear convergence, under a weaker condi-\ntion than strong convexity of the objective. The constant in the convergence rate\nhas an elegant interpretation as the product of the (classical) condition number of\nthe function with a novel geometric quantity that plays the role of a \u2018condition\nnumber\u2019 of the constraint set. We provide pointers to where these algorithms have\nmade a difference in practice, in particular with the \ufb02ow polytope, the marginal\npolytope and the base polytope for submodular optimization.\n\nThe Frank-Wolfe algorithm [9] (also known as conditional gradient) is one of the earliest existing\nmethods for constrained convex optimization, and has seen an impressive revival recently due to\nits nice properties compared to projected or proximal gradient methods, in particular for sparse\noptimization and machine learning applications.\nOn the other hand, the classical projected gradient and proximal methods have been known to exhibit\na very nice adaptive acceleration property, namely that the the convergence rate becomes linear for\nstrongly convex objective, i.e. that the optimization error of the same algorithm after t iterations will\ndecrease geometrically with O((1 \u2212 \u03c1)t) instead of the usual O(1/t) for general convex objective\nfunctions.\nIt has become an active research topic recently whether such an acceleration is also\npossible for Frank-Wolfe type methods.\nContributions. We clarify several variants of the Frank-Wolfe algorithm and show that they all\nconverge linearly for any strongly convex function optimized over a polytope domain, with a con-\nstant bounded away from zero that only depends on the geometry of the polytope. Our analysis does\nnot depend on the location of the true optimum with respect to the domain, which was a disadvan-\ntage of earlier existing results such as [34, 12, 5], and the newer work of [28], as well as the line of\nwork of [1, 19, 26] which rely on Robinson\u2019s condition [30]. Our analysis yields a weaker suf\ufb01cient\ncondition than Robinson\u2019s condition; in particular we can have linear convergence even in some\ncases when the function has more than one global minima, and is not globally strongly convex. The\nconstant also naturally separates as the product of the condition number of the function with a novel\nnotion of condition number of a polytope, which might have applications in complexity theory.\nRelated Work. For the classical Frank-Wolfe algorithm, [5] showed a linear rate for the special\ncase of quadratic objectives when the optimum is in the strict interior of the domain, a result already\nsubsumed by the more general [12]. The early work of [23] showed linear convergence for strongly\n\n1\n\n\fconvex constraint sets, under the strong requirement that the gradient norm is not too small (see [11]\nfor a discussion). The away-steps variant of the Frank-Wolfe algorithm, that can also remove weight\nfrom \u2018bad\u2019 atoms in the current active set, was proposed in [34], and later also analyzed in [12].\nThe precise method is stated below in Algorithm 1. [12] showed a (local) linear convergence rate\non polytopes, but the constant unfortunately depends on the distance between the solution and its\nrelative boundary, a quantity that can be arbitrarily small. More recently, [1, 19, 26] have obtained\nlinear convergence results in the case that the optimum solution satis\ufb01es Robinson\u2019s condition [30].\nIn a different recent line of work, [10, 22] have studied a variation of FW that repeatedly moves mass\nfrom the worst vertices to the standard FW vertex until a speci\ufb01c condition is satis\ufb01ed, yielding a\nlinear rate on strongly convex functions. Their algorithm requires the knowledge of several constants\nthough, and moreover is not adaptive to the best-case scenario, unlike the Frank-Wolfe algorithm\nwith away steps and line-search. None of these previous works was shown to be af\ufb01ne invariant,\nand most require additional knowledge about problem speci\ufb01c parameters.\nSetup. We consider general constrained convex optimization problems of the form:\nwith only access to: LMOA(r) \u2208 arg min\nx\u2208A\n\nf (x) , M = conv(A),\n\n(cid:104)r, x(cid:105),\n\nmin\nx\u2208M\n\n(1)\nwhere A \u2286 Rd is a \ufb01nite set of vectors that we call atoms.1 We assume that the function f is \u00b5-\nstrongly convex with L-Lipschitz continuous gradient over M. We also consider weaker conditions\nthan strong convexity for f in Section 4. As A is \ufb01nite, M is a (convex and bounded) polytope. The\nmethods that we consider in this paper only require access to a linear minimization oracle LMOA(.)\nassociated with the domain M through a generating set of atoms A. This oracle is de\ufb01ned as to\nreturn a minimizer of a linear subproblem over M = conv(A), for any given direction r \u2208 Rd.2\nExamples. Optimization problems of the form (1) appear widely in machine learning and signal\nprocessing applications. The set of atoms A can represent combinatorial objects of arbitrary type.\nEf\ufb01cient linear minimization oracles often exist in the form of dynamic programs or other combina-\ntorial optimization approaches. As an example from tracking in computer vision, A could be the set\nof integer \ufb02ows on a graph [16, 7], where LMOA can be ef\ufb01ciently implemented by a minimum cost\nnetwork \ufb02ow algorithm. In this case, M can also be described with a polynomial number of linear\ninequalities. But in other examples, M might not have a polynomial description in terms of linear\ninequalities, and testing membership in M might be much more expensive than running the linear\noracle. This is the case when optimizing over the base polytope, an object appearing in submodular\nfunction optimization [3]. There, the LMOA oracle is a simple greedy algorithm. Another example\nis when A represents the possible consistent value assignments on cliques of a Markov random \ufb01eld\n(MRF); M is the marginal polytope [32], where testing membership is NP-hard in general, though\nef\ufb01cient linear oracles exist for some special cases [17]. Optimization over the marginal polytope\nappears for example in structured SVM learning [21] and variational inference [18].\nThe Original Frank-Wolfe Algorithm. The Frank-Wolfe (FW) optimization algorithm [9], also\nknown as conditional gradient [23], is particularly suited for the setup (1) where M is only accessed\nthrough the linear minimization oracle. It works as follows: At a current iterate x(t), the algorithm\n\ufb01nds a feasible search atom st to move towards by minimizing the linearization of the objective\nfunction f over M (line 3 in Algorithm 1) \u2013 this is where the linear minimization oracle LMOA\nis used. The next iterate x(t+1) is then obtained by doing a line-search on f between x(t) and st\n(line 11 in Algorithm 1). One reason for the recent increased popularity of Frank-Wolfe-type algo-\nas x(t) = (cid:80)\nrithms is the sparsity of their iterates: in iteration t of the algorithm, the iterate can be represented\nas a sparse convex combination of at most t + 1 atoms S (t) \u2286 A of the domain M, which we write\nv v. We write S (t) for the active set, containing the previously discovered\nsearch atoms sr for r < t that have non-zero weight \u03b1(t)\nsr > 0 in the expansion (potentially also\nincluding the starting point x(0)). While tracking the active set S (t) is not necessary for the original\nFW algorithm, the improved variants of FW that we discuss will require that S (t) is maintained.\nZig-Zagging Phenomenon. When the optimal solution lies at the boundary of M, the conver-\n\ngence rate of the iterates is slow, i.e. sublinear: f (x(t))\u2212 f (x\u2217) \u2264 O(cid:0)1/t(cid:1), for x\u2217 being an optimal\n\nv\u2208S (t) \u03b1(t)\n\nsolution [9, 6, 8, 15]. This is because the iterates of the classical FW algorithm start to zig-zag\n\n1The atoms do not have to be extreme points (vertices) of M.\n2All our convergence results can be carefully extended to approximate linear minimization oracles with\n\nmultiplicative approximation guarantees; we state them for exact oracles in this paper for simplicity.\n\n2\n\n\fFigure 1: (left) The FW algorithm zig-zags when the solution x\u2217 lies on the boundary. (middle) Adding the\npossibility of an away step attenuates this problem. (right) As an alternative, a pairwise FW step.\nbetween the vertices de\ufb01ning the face containing the solution x\u2217 (see left of Figure 1). In fact, the\n1/t rate is tight for a large class of functions: Canon and Cullum [6], Wolfe [34] showed (roughly)\n\nthat f (x(t))\u2212 f (x\u2217) \u2265 \u2126(cid:0)1/t1+\u03b4(cid:1) for any \u03b4 > 0 when x\u2217 lies on a face of M with some additional\nregularity assumptions. Note that this lower bound is different than the \u2126(cid:0)1/t(cid:1) one presented in [15,\n\nLemma 3] which holds for all one-atom-per-step algorithms but assumes high dimensionality d \u2265 t.\nImproved Variants of the Frank-Wolfe Algorithm\n1\nAlgorithm 1 Away-steps Frank-Wolfe algorithm: AFW(x(0),A, \u0001)\n1: Let x(0) \u2208 A, and S (0) := {x(0)}\n2: for t = 0 . . . T do\nLet st := LMOA\n3:\nLet vt \u2208 arg max\n4:\nv\u2208S (t)\nif g FW\n\n:=(cid:10)\u2212\u2207f (x(t)), dFW\n\n(the FW direction)\n(the away direction)\n\nv = 1 for v = x(0) and 0 otherwise)\n\n(FW gap is small enough, so return)\n\nt\n\nif(cid:10)\u2212\u2207f (x(t)), dFW\n\nt\n\n(so that \u03b1(0)\n:= st \u2212 x(t)\nt := x(t) \u2212 vt\n\n(cid:0)\u2207f (x(t))(cid:1) and dFW\n(cid:10)\u2207f (x(t)), v(cid:11) and dA\n(cid:11) \u2265(cid:10)\u2212\u2207f (x(t)), dA\nf(cid:0)x(t) + \u03b3dt\n\n(cid:11) \u2264 \u0001 then return x(t)\n(cid:11) then\n(cid:1)\n\nt\n\nt\n\nt\n\n(choose away direction; maximum feasible step-size)\n\n(and accordingly for the weights \u03b1(t+1), see text)\n\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n\ndt := dFW\n\nt\n\n, and \u03b3max := 1\n\n(choose the FW direction)\n\nelse\n\ndt := dA\n\nt , and \u03b3max := \u03b1vt/(1 \u2212 \u03b1vt)\n\nend if\nLine-search: \u03b3t \u2208 arg min\n\u03b3\u2208[0,\u03b3max]\nUpdate x(t+1) := x(t) + \u03b3tdt\nUpdate S (t+1) := {v \u2208 A s.t. \u03b1(t+1)\n\n12:\n13:\n14: end for\nAlgorithm 2 Pairwise Frank-Wolfe algorithm: PFW(x(0),A, \u0001)\n1: . . . as in Algorithm 1, except replacing lines 6 to 10 by: dt = dPFW\n\n> 0}\n\nv\n\nt\n\n:= st\u2212vt, and \u03b3max := \u03b1vt.\n\n:= (cid:10)\u2212\u2207f (x(t)), x(t) \u2212 vt\n\n(cid:11). Note that this search is over the (typically small) active set S (t),\n\nAway-Steps Frank-Wolfe. To address the zig-zagging problem of FW, Wolfe [34] proposed to\nadd the possibility to move away from an active atom in S (t) (see middle of Figure 1); this simple\nmodi\ufb01cation is suf\ufb01cient to make the algorithm linearly convergent for strongly convex functions.\nWe describe the away-steps variant of Frank-Wolfe in Algorithm 1.3 The away direction dA\nis\nde\ufb01ned in line 4 by \ufb01nding the atom vt in S (t) that maximizes the potential of descent given by\nt\ng A\nt\nand is fundamentally easier than the linear oracle LMOA. The maximum step-size \u03b3max as de\ufb01ned\nt stays in M. In fact, this guarantees that the convex\non line 9 ensures that the new iterate x(t) + \u03b3dA\nrepresentation is maintained, and we stay inside conv(S (t)) \u2286 M. When M is a simplex, then the\nt truly lies on the boundary of M. On the other\nbarycentric coordinates are unique and x(t) + \u03b3maxdA\nhand, if |A| > dim(M) + 1 (e.g. for the cube), then it could hypothetically be possible to have a\nstep-size bigger than \u03b3max which is still feasible. Computing the true maximum feasible step-size\nwould require the ability to know when we cross the boundary of M along a speci\ufb01c line, which\nis not possible for general M. Using the conservative maximum step-size of line 9 ensures that we\n3The original algorithm presented in [34] was not convergent; this was corrected by Gu\u00b4elat and Marcotte\n[12], assuming a tractable representation of M with linear inequalities and called it the modi\ufb01ed Frank-Wolfe\n(MFW) algorithm. Our description in Algorithm 1 extends it to the more general setup of (1).\n\n3\n\nx\u21e4x(t)x(0)x(t+1)stx\u21e4x(t)x(0)vtstx(t+1)x\u21e4x(t)x(0)vtst//x(t+1)//\fdo not need this more powerful oracle. This is why Algorithm 1 requires to maintain S (t) (unlike\nstandard FW). Finally, as in classical FW, the FW gap g FW\nis an upper bound on the unknown\nsuboptimality, and can be used as a stopping criterion:\n\n(cid:69) \u2265(cid:68)\u2212\u2207f (x(t)), x\u2217 \u2212 x(t)(cid:69) \u2265 f (x(t)) \u2212 f (x\u2217)\n\n(cid:68)\u2212\u2207f (x(t)), dFW\n\n(by convexity).\n\nt\n\ng FW\nt\n\n:=\n\nt\n\nst\n\nv\n\nst\n\n= \u03b1(t)\n\nvt\n\n:= (1 + \u03b3t)\u03b1(t)\n\nvt = \u03b1(t)\n\n:= (1 + \u03b3t)\u03b1(t)\n\nst +\u03b3t and \u03b1(t+1)\n\nv\n\n:= (1\u2212\u03b3t)\u03b1(t)\n\nvt \u2212 \u03b3 and \u03b1(t+1)\n\nv for v \u2208 S (t) \\ {vt}.\n\n:= (1\u2212\u03b3t)\u03b1(t)\nvt \u2212 \u03b3t and \u03b1(t+1)\n\nvt . In contrast, classical FW shrinks all active weights at every iteration.\n\nIf \u03b3t = \u03b3max, then we call this step a drop step, as it fully removes the atom vt from the currently\nactive set of atoms S (t) (by settings its weight to zero). The weight updates for lines 12 and 13\nare of the following form: For a FW step, we have S (t+1) = {st} if \u03b3t = 1; otherwise S (t+1) =\nS (t)\u222a{st}. Also, we have \u03b1(t+1)\nv for v \u2208 S (t)\\{st}.\nFor an away step, we have S (t+1) = S (t)\\{vt} if \u03b3t = \u03b3max (a drop step); otherwise S (t+1) = S (t).\nAlso, we have \u03b1(t+1)\nPairwise Frank-Wolfe. The next variant that we present is inspired by an early algorithm\nby Mitchell et al. [25], called the MDM algorithm, originally invented for the polytope distance\nproblem. Here the idea is to only move weight mass between two atoms in each step. More pre-\ncisely, the generalized method as presented in Algorithm 2 moves weight from the away atom vt to\nthe FW atom st, and keeps all other \u03b1 weights un-changed. We call such a swap of mass between\nthe two atoms a pairwise FW step, i.e. \u03b1(t+1)\nst + \u03b3 for some step-size\n\u03b3 \u2264 \u03b3max := \u03b1(t)\nThe pairwise FW direction will also be central to our proof technique to provide the \ufb01rst global\nlinear convergence rate for away-steps FW, as well as the fully-corrective variant and Wolfe\u2019s min-\nnorm-point algorithm.\nAs we will see in Section 2.2, the rate guarantee for the pairwise FW variant is more loose than for\nthe other variants, because we cannot provide a satisfactory bound on the number of the problematic\nswap steps (de\ufb01ned just before Theorem 1). Nevertheless, the algorithm seems to perform quite well\nin practice, often outperforming away-steps FW, especially in the important case of sparse solutions,\nthat is if the optimal solution x\u2217 lies on a low-dimensional face of M (and thus one wants to keep the\nactive set S (t) small). The pairwise FW step is arguably more ef\ufb01cient at pruning the coordinates in\nS (t). In contrast to the away step which moves the mass back uniformly onto all other active elements\nS (t) (and might require more corrections later), the pairwise FW step only moves the mass onto the\n(good) FW atom st. A slightly different version than Algorithm 2 was also proposed by \u02dcNanculef\net al. [26], though their convergence proofs were incomplete (see Appendix A.3). The algorithm\nis related to classical working set algorithms, such as the SMO algorithm used to train SVMs [29].\nWe refer to [26] for an empirical comparison for SVMs, as well as their Section 5 for more related\nwork. See also Appendix A.3 for a link between pairwise FW and [10].\nFully-Corrective Frank-Wolfe, and Wolfe\u2019s Min-Norm Point Algorithm. When the linear or-\nacle is expensive, it might be worthwhile to do more work to optimize over the active set S (t) in\nbetween each call to the linear oracle, rather than just performing an away or pairwise step. We\ngive in Algorithm 3 the fully-corrective Frank-Wolfe (FCFW) variant, that maintains a correction\npolytope de\ufb01ned by a set of atoms A(t) (potentially larger than the active set S (t)). Rather than\nobtaining the next iterate by line-search, x(t+1) is obtained by re-optimizing f over conv(A(t)).\nDepending on how the correction is implemented, and how the correction atoms A(t) are main-\ntained, several variants can be obtained. These variants are known under many names, such as the\nextended FW method by Holloway [14] or the simplicial decomposition method [31, 13]. Wolfe\u2019s\nmin-norm point (MNP) algorithm [35] for polytope distance problems is often confused with FCFW\nfor quadratic objectives. The major difference is that standard FCFW optimizes f over conv(A(t)),\nwhereas MNP implements the correction as a sequence of af\ufb01ne projections that potentially yield\na different update, but can be computed more ef\ufb01ciently in several practical applications [35]. We\ndescribe precisely in Appendix A.1 a generalization of the MNP algorithm as a speci\ufb01c case of the\ncorrection subroutine from step 7 of the generic Algorithm 3.\nThe original convergence analysis of the FCFW algorithm [14] (and also MNP algorithm [35]) only\nshowed that they were \ufb01nitely convergent, with a bound on the number of iterations in terms of the\ncardinality of A (unfortunately an exponential number in general). Holloway [14] also argued that\nFCFW had an asymptotic linear convergence based on the \ufb02awed argument of Wolfe [34]. As far\nas we know, our work is the \ufb01rst to provide global linear convergence rates for FCFW and MNP for\n\n4\n\n\f(cid:88)\n\nv v, stopping criterion \u0001.\n\u03b1(0)\n\nv\u2208S (0)\n\n(optionally, a bigger A(0) could be passed as argument for a warm start)\n\n:= st \u2212 x(t) and g FW\n\n(cid:0)\u2207f (x(t))(cid:1)\n\nAlgorithm 3 Fully-corrective Frank-Wolfe with approximate correction: FCFW(x(0),A, \u0001)\n1: Input: Set of atoms A, active set S (0), starting point x(0) =\n2: Let A(0) := S (0)\n3: for t = 0 . . . T do\nLet st := LMOA\n4:\nLet dFW\n5:\nt\nif g FW\n6:\n(x(t+1),A(t+1)) := Correction(x(t),A(t), st, \u0001)\n7:\n8: end for\nAlgorithm 4 Approximate correction: Correction(x(t),A(t), st, \u0001)\n1: Return (x(t+1),A(t+1)) with the following properties:\n2:\n3:\n\nt =(cid:10)\u2212\u2207f (x(t)), dFW\n\nS (t+1) is the active set for x(t+1) and A(t+1) \u2287 S (t+1).\nf (x(t+1)) \u2264 min\n\u03b3\u2208[0,1]\ng A\nt+1 := max\n\nf(cid:0)x(t) + \u03b3(st \u2212 x(t))(cid:1)\n(cid:10)\u2212\u2207f (x(t+1)), x(t+1) \u2212 v(cid:11) \u2264 \u0001\n\nt \u2264 \u0001 then return x(t)\n\n(the FW atom)\n(FW gap)\n\n(the away gap is small enough)\n\n(approximate correction step)\n\n(make at least as much progress as a FW step)\n\n(cid:11)\n\n4:\n\nt\n\nv\u2208S (t+1)\n\ngeneral strongly convex functions. Moreover, the proof of convergence for FCFW does not require\nan exact solution to the correction step; instead, we show that the weaker properties stated for the\napproximate correction procedure in Algorithm 4 are suf\ufb01cient for a global linear convergence rate\n(this correction could be implemented using away-steps FW, as done for example in [18]).\n2 Global Linear Convergence Analysis\n2.1\nWe \ufb01rst give the general intuition for the linear convergence proof of the different FW variants,\nstarting from the work of Gu\u00b4elat and Marcotte [12]. We assume that the objective function f is\nsmooth over a compact set M, i.e. its gradient is Lipschitz continuous with constant L. Also let\nM := diam(M). Let dt be the direction in which the line-search is executed by the algorithm\n(Line 11 in Algorithm 1). By the standard descent lemma [see e.g. (1.2.5) in 27], we have:\n\nIntuition for the Convergence Proofs\n\nf (x(t+1)) \u2264 f (x(t) + \u03b3dt) \u2264 f (x(t)) + \u03b3\n\n(2)\nWe let rt := \u2212\u2207f (x(t)) and let ht := f (x(t)) \u2212 f (x\u2217) be the suboptimality error. Supposing for\nnow that \u03b3max \u2265 \u03b3\u2217t := (cid:104)rt, dt(cid:105) /(L(cid:107)dt(cid:107)2). We can set \u03b3 = \u03b3\u2217t to minimize the RHS of (2), subtract\nf (x\u2217) on both sides, and re-organize to get a lower bound on the progress:\n\nL(cid:107)dt(cid:107)2 \u2200\u03b3 \u2208 [0, \u03b3max].\n\n\u03b32\n2\n\n+\n\n(cid:68)\u2207f (x(t)), dt\n\n(cid:69)\n\n=\n\nht \u2212 ht+1 \u2265 (cid:104)rt, dt(cid:105)2\n2L(cid:107)dt(cid:107)2\n(cid:68)\u2207f (x(t)), et\n\n1\n2L\n\n(cid:69)\n\n(3)\nwhere we use the \u2018hat\u2019 notation to denote normalized vectors: \u02c6dt := dt/(cid:107)dt(cid:107). Let et := x\u2217 \u2212 x(t)\nbe the error vector. By \u00b5-strong convexity of f, we have:\n\n\u03b32\n2\n\n\u00b5(cid:107)et(cid:107)2 \u2200\u03b3 \u2208 [0, 1].\n\nf (x(t) + \u03b3et) \u2265 f (x(t)) + \u03b3\n\n,\n\n+\n\n(4)\nThe RHS is lower bounded by its minimum as a function of \u03b3 (unconstrained), achieved using\n\u03b3 := (cid:104)rt, et(cid:105)/(\u00b5(cid:107)et(cid:107)2). We are then free to use any value of \u03b3 on the LHS and maintain a valid\nbound. In particular, we use \u03b3 = 1 to obtain f (x\u2217). Again re-arranging, we get:\nht \u2212 ht+1 \u2265 \u00b5\nL\n\nand combining with (3), we obtain:\n\nht \u2264 (cid:104)rt, \u02c6et(cid:105)2\n\n(cid:104)rt, \u02c6dt(cid:105)2\n(cid:104)rt, \u02c6et(cid:105)2 ht.\n\n(5)\n\n2\u00b5\n\nThe inequality (5) is fairly general and valid for any line-search method in direction dt. To get a\nlinear convergence rate, we need to lower bound (by a positive constant) the term in front of ht on the\nRHS, which depends on the angle between the update direction dt and the negative gradient rt. If\nwe assume that the solution x\u2217 lies in the relative interior of M with a distance of at least \u03b4 > 0 from\nthe boundary, then (cid:104)rt, dt(cid:105) \u2265 \u03b4(cid:107)rt(cid:107) for the FW direction dFW\n, and by combining with (cid:107)dt(cid:107) \u2264 M,\nwe get a linear rate with constant 1 \u2212 \u00b5\nM )2 (this was the result from [12]). On the other hand,\nif x\u2217 lies on the boundary, then (cid:104)\u02c6rt, \u02c6dt(cid:105) gets arbitrary close to zero for standard FW (the zig-zagging\nphenomenon) and the convergence is sublinear.\n\nL ( \u03b4\n\nt\n\n(cid:104)rt, \u02c6dt(cid:105)2 ,\n\n5\n\n\ft\n\nt\n\nt\n\nt\n\nt\n\nt\n\n(cid:105).\n\nt + dA\n\nt (cid:105) = (cid:104)rt, dFW\n\nt (cid:105) = (cid:104)rt, dPFW\n\nProof Sketch for AFW. The key insight to prove the global linear convergence for AFW is to\nrelate (cid:104)rt, dt(cid:105) with the pairwise FW direction dPFW\n:= st \u2212 vt. By the way the direction dt is\nchosen on lines 6 to 10 of Algorithm 1, we have:\n(cid:105) + (cid:104)rt, dA\n2(cid:104)rt, dt(cid:105) \u2265 (cid:104)rt, dFW\n(6)\n(cid:105)/2. Now the crucial property of the pairwise FW direction\nWe thus have (cid:104)rt, dt(cid:105) \u2265 (cid:104)rt, dPFW\nis that for any potential negative gradient direction rt, the worst case inner product (cid:104)\u02c6rt, dPFW\n(cid:105)\ncan be lower bounded away from zero by a quantity depending only on\nthe geometry of M (unless we are at the optimum). We call this quantity\nthe pyramidal width of A. The \ufb01gure on the right shows the six possible\npairwise FW directions dPFW\nfor a triangle domain, depending on which\ncolored area the rt direction falls into. We will see that the pyramidal\nwidth is related to the smallest width of pyramids that we can construct\nfrom A in a speci\ufb01c way related to the choice of the away and towards\natoms vt and st. See (9) and our main Theorem 3 in Section 3.\nThis gives the main argument for the linear convergence of AFW for steps where \u03b3\u2217t \u2264 \u03b3max.\nWhen \u03b3max is too small, AFW will perform a drop step, as the line-search will truncate the step-size\nto \u03b3t = \u03b3max. We cannot guarantee suf\ufb01cient progress in this case, but the drop step decreases the\nactive set size by one, and thus they cannot happen too often (not more than half the time). These are\nthe main elements for the global linear convergence proof for AFW. The rest is to carefully consider\nvarious boundary cases. We can re-use the same techniques to prove the convergence for pairwise\nFW, though unfortunately the latter also has the possibility of problematic swap steps. While their\nnumber can be bounded, so far we only found the extremely loose bound quoted in Theorem 1.\nProof Sketch for FCFW. For FCFW, by line 4 of the correction Algorithm 4, the away gap satis-\nt \u2264 \u0001 at the beginning of a new iteration. Supposing that the algorithm does not exit at line 6\n\ufb01es g A\n(cid:105) using a similar argument\nof Algorithm 3, we have g FW\nas in (6). Finally, by line 3 of Algorithm 4, the correction is guaranteed to make at least as much\nprogress as a line-search in direction dFW\n2.2 Convergence Results\nWe now give the global linear convergence rates for the four variants of the FW algorithm: away-\nsteps FW (AFW Alg. 1); pairwise FW (PFW Alg. 2); fully-corrective FW (FCFW Alg. 3 with\napproximate correction Alg. 4); and Wolfe\u2019s min-norm point algorithm (Alg. 3 with MNP-correction\nas Alg. 5 in Appendix A.1). For the AFW, MNP and PFW algorithms, we call a drop step when the\nactive set shrinks |S(t+1)| < |S(t)|. For the PFW algorithm, we also have the possibility of a swap\nstep where \u03b3t = \u03b3max but |S(t+1)| = |S(t)| (i.e. the mass was fully swapped from the away atom to\nthe FW atom). A nice property of FCFW is that it does not have any drop step (it executes both FW\nsteps and away steps simultaneously while guaranteeing enough progress at every iteration).\nTheorem 1. Suppose that f has L-Lipschitz gradient4 and is \u00b5-strongly convex over M =\nconv(A). Let M = diam(M) and \u03b4 = P Width(A) as de\ufb01ned by (9). Then the suboptimal-\nity ht of the iterates of all the four variants of the FW algorithm decreases geometrically at each\nstep that is not a drop step nor a swap step (i.e. when \u03b3t < \u03b3max, called a \u2018good step\u2019), that is\n\n, and so the progress bound (5) applies also to FCFW.\n\nt > \u0001 and therefore 2(cid:104)rt, dFW\n\n(cid:105) \u2265 (cid:104)rt, dPFW\n\nt\n\nt\n\nt\n\n(cid:18) \u03b4\n\n(cid:19)2\n\nM\n\n.\n\n\u00b5\n4L\n\nht+1 \u2264 (1 \u2212 \u03c1) ht ,\n\nwhere \u03c1 :=\n\nLet k(t) be the number of \u2018good steps\u2019 up to iteration t. We have k(t) = t for FCFW; k(t) \u2265 t/2 for\nMNP and AFW; and k(t) \u2265 t/(3|A|! + 1) for PFW (because of the swap steps). This yields a global\nlinear convergence rate of ht \u2264 h0 exp(\u2212\u03c1 k(t)) for all variants. If \u00b5 = 0 (general convex), then\nht = O(1/k(t)) instead. See Theorem 8 in Appendix D for an af\ufb01ne invariant version and proof.\nNote that to our knowledge, none of the existing linear convergence results showed that the duality\ngap was also linearly convergent. The result for the gap follows directly from the simple manipula-\ntion of (2); putting the FW gap to the LHS and optimizing the RHS for \u03b3 \u2208 [0, 1].\nTheorem 2. Suppose that f has L-Lipschitz gradient over M with M := diam(M). Then the FW\ngap g FW\n\nfor any algorithm is upper bounded by the primal error ht as follows:\nt \u2264 ht + LM 2/2 when ht > LM 2/2,\ng FW\n\n2htL otherwise .\n\nt \u2264 M\ng FW\n\n(cid:112)\n\n(7)\n\nt\n\n4For AFW and PFW, we actually require that \u2207f is L-Lipschitz over the larger domain M + M \u2212 M.\n\n6\n\ndFWtdAtdpFWtxvtstrt\ft\n\nt\n\n(cid:10) r\n\ns\u2208M,v\u2208S (t)\n\n(cid:105) =(cid:104)rt, st\u2212vt(cid:105) = max\n\n(cid:105)/(cid:104)rt, \u02c6et(cid:105). First note that (cid:104)rt, dPFW\n\n3 Pyramidal Width\nWe now describe the claimed lower bound on the angle between the negative gradient and the pair-\nwise FW direction, which depends only on the geometric properties of M. According to our ar-\ngument about the progress bound (5) and the PFW gap (6), our goal is to \ufb01nd a lower bound on\n(cid:104)rt, dPFW\n(cid:104)rt, s\u2212 v(cid:105) where S (t) is a pos-\nsible active set for x(t). This looks like the directional width of a pyramid with base S (t) and summit\nst. To be conservative, we consider the worst case possible active set for x(t); this is what we will\ncall the pyramid directional width P dirW (A, rt, x(t)). We start with the following de\ufb01nitions.\nDirectional Width. The directional width of a set A with respect to a direction r is de\ufb01ned as\ndirW (A, r) := maxs,v\u2208A\npossible directions in its af\ufb01ne hull.\nPyramidal Directional Width. We de\ufb01ne the pyramidal directional width of a set A with respect\nto a direction r and a base point x \u2208 M to be\n\n, s \u2212 v(cid:11). The width of A is the minimum directional width over all\n\nP dirW (A, r, x) := min\nS\u2208Sx\n\ndirW (S \u222a {s(A, r)}, r) = min\nS\u2208Sx\n\n(8)\nwhere Sx := {S |S \u2286 A such that x is a proper5 convex combination of all the elements in S}, and\ns(A, r) := arg maxv\u2208A\nPyramidal Width. To de\ufb01ne the pyramidal width of a set, we take the minimum over the cone of\npossible feasible directions r (in order to avoid the problem of zero width).\nA direction r is feasible for A from x if it points inwards conv(A), (i.e. r \u2208 cone(A \u2212 x)).\nWe de\ufb01ne the pyramidal width of a set A to be the smallest pyramidal width of all its faces, i.e.\n\n(cid:104)r, v(cid:105) is the FW atom used as a summit.\n\n, s \u2212 v(cid:11),\n\n(cid:10) r\n\ns\u2208A,v\u2208S\n\nmax\n\n(cid:107)r(cid:107)\n\n(cid:107)r(cid:107)\n\nP Width(A) :=\n\nP dirW (K \u2229 A, r, x).\n\n(9)\n\nmin\n\nK\u2208faces(conv(A))\nr\u2208cone(K\u2212x)\\{0}\n\nx\u2208K\n\n(10)\n\nTheorem 3. Let x \u2208 M = conv(A) be a suboptimal point and S be an active set for x. Let x\u2217 be\nan optimal point and corresponding error direction \u02c6e = (x\u2217\u2212x)/(cid:107)x\u2217 \u2212 x(cid:107), and negative gradient\nr := \u2212\u2207f (x) (and so (cid:104)r, \u02c6e(cid:105) > 0). Let d = s\u2212v be the pairwise FW direction obtained over A\nand S with negative gradient r. Then (cid:104)r, d(cid:105)\n\n(cid:104)r, \u02c6e(cid:105) \u2265 P Width(A).\n3.1 Properties of Pyramidal Width and Consequences\nExamples of Values. The pyramidal width of a set A is lower bounded by the minimal width over\nall subsets of atoms, and thus is strictly greater than zero if the number of atoms is \ufb01nite. On the\nother hand, this lower bound is often too loose to be useful, as in particular, vertex subsets of the\nunit cube in dimension d can have exponentially small width O(d\u2212 d\n2 ) [see Corollary 27 in 36]. On\nthe other hand, as we show here, the pyramidal width of the unit cube is actually 1/\nd, justifying\nwhy we kept the tighter but more involved de\ufb01nition (9). See Appendix B.1 for the proof.\n\u221a\nLemma 4. The pyramidal width of the unit cube in Rd is 1/\n\u221a\nFor the probability simplex with d vertices, the pyramidal width is actually the same as its width,\nd\u22121/d when d is odd [2] (see Appendix B.1). In contrast,\nwhich is 2/\nthe pyramidal width of an in\ufb01nite set can be zero. For example, for a curved domain, the set of active\natoms S can contain vertices forming a very narrow pyramid, yielding a zero width in the limit.\nCondition Number of a Set. The inverse of the rate constant \u03c1 appearing in Theorem 1 is the\nproduct of two terms: L/\u00b5 is the standard condition number of the objective function appearing in\nthe rates of gradient methods in convex optimization. The second quantity (M/\u03b4)2 (diameter over\npyramidal width) can be interpreted as a condition number of the domain M, or its eccentricity.\nThe more eccentric the constraint set (large diameter compared to its pyramidal width), the slower\nthe convergence. The best condition number of a function is when its level sets are spherical; the\nanalog in term of the constraint sets is actually the regular simplex, which has the maximum width-\nto-diameter ratio amongst all simplices [see Corollary 1 in 2]. Its eccentricity is (at most) d/2. In\ncontrast, the eccentricity of the unit cube is d2, which is much worse.\n\nd when d is even, and 2/\n\n(cid:112)\n\n\u221a\n\nd.\n\n5By proper convex combination, we mean that all coef\ufb01cients are non-zero in the convex combination.\n\n7\n\n\f\u221a\nd to 1/\n\n\u00b5 log( 1\n\nIllustrative Experiments\n\nWe conjecture that the pyramidal width of a set of vertices (i.e. extrema of their convex hull) is\nnon-increasing when another vertex is added (assuming that all previous points remain vertices).\n\u221a\nFor example, the unit cube can be obtained by iteratively adding vertices to the regular probability\n\u221a\nsimplex, and the pyramidal width thereby decreases from 2/\nd. This property could\nprovide lower bounds for the pyramidal width of more complicated polytopes, such as 1/\nd for the\nd-dimensional marginal polytope, as it can be obtained by removing vertices from the unit cube.\nComplexity Lower Bounds. Combining the convergence Theorem 1 and the condition number\nof the unit simplex, we get a complexity of O(d L\n\u0001 )) to reach \u0001-accuracy when optimizing a\nstrongly convex function over the unit simplex. Here the linear dependence on d should not come as\na surprise, in view of the known lower bound of 1/t for t \u2264 d for Frank-Wolfe type methods [15].\nApplications to Submodular Minimization. See Appendix A.2 for a consequence of our linear\nrate for the popular MNP algorithm for submodular function optimization (over the base polytope).\n4 Non-Strongly Convex Generalization\nBuilding on the work of Beck and Shtern [4] and Wang and Lin [33], we can generalize our global\nlinear convergence results for all Frank-Wolfe variants for the more general case where f (x) :=\ng(Ax) + (cid:104)b, x(cid:105), for A \u2208 Rp\u00d7d, b \u2208 Rd and where g is \u00b5g-strongly convex and continuously\ndifferentiable over AM. We note that for a general matrix A, f is convex but not necessarily\nstrongly convex. In this case, the linear convergence still holds but with the constant \u00b5 appearing in\nthe rate of Theorem 1 replaced with the generalized constant \u02dc\u00b5 appearing in Lemma 9 in Appendix F.\n5\nWe illustrate the performance of the presented algorithm vari-\nants in two numerical experiments, shown in Figure 2. The\n\ufb01rst example is a constrained Lasso problem ((cid:96)1-regularized\nleast squares regression), that is minx\u2208M f (x) = (cid:107)Ax \u2212 b(cid:107)2,\nwith M = 20 \u00b7 L1 a scaled L1-ball. We used a random Gaus-\nsian matrix A \u2208 R200\u00d7500, and a noisy measurement b = Ax\u2217\nwith x\u2217 being a sparse vector with 50 entries \u00b11, and 10% of\nadditive noise. For the L1-ball, the linear minimization oracle\nLMOA just selects the column of A of best inner product with\nthe residual vector. The second application comes from video\nco-localization. The approach used by [16] is formulated as a\nquadratic program (QP) over a \ufb02ow polytope, the convex hull of\npaths in a network. In this application, the linear minimization\noracle is equivalent to \ufb01nding a shortest path in the network,\nwhich can be done easily by dynamic programming. For the\nLMOA, we re-use the code provided by [16] and their included\naeroplane dataset resulting in a QP over 660 variables. In both\nFigure 2: Duality gap g FW\nvs itera-\nexperiments, we see that the modi\ufb01ed FW variants (away-steps\ntions on the Lasso problem (top), and\nand pairwise) outperform the original FW algorithm, and ex-\nvideo co-localization (bottom). Code\nhibit a linear convergence. In addition, the constant in the con-\nis available from the authors\u2019 website.\nvergence rate of Theorem 1 can also be empirically shown to be\nfairly tight for AFW and PFW by running them on an increasingly obtuse triangle (see Appendix E).\nDiscussion. Building on a preliminary version of our work [20], Beck and Shtern [4] also proved\na linear rate for away-steps FW, but with a simpler lower bound for the LHS of (10) using linear\nduality arguments. However, their lower bound [see e.g. Lemma 3.1 in 4] is looser: they get a d2\nconstant for the eccentricity of the regular simplex instead of the tighter d that we proved. Finally,\nthe recently proposed generic scheme for accelerating \ufb01rst-order optimization methods in the sense\nof Nesterov from [24] applies directly to the FW variants given their global linear convergence rate\nthat we proved. This gives for the \ufb01rst time \ufb01rst-order methods that only use linear oracles and\nL/\u00b5)\nconstant in the linear rate for strongly convex functions. Given that the constants also depend on the\ndimensionality, it remains an open question whether this acceleration is practically useful.\nAcknowledgements. We thank J.B. Alayrac, E. Hazan, A. Hubard, A. Osokin and P. Marcotte for helpful\ndiscussions. This work was partially supported by the MSR-Inria Joint Center and a Google Research Award.\n\nobtain the \u201cnear-optimal\u201d \u02dcO(1/k2) rate for smooth convex functions, or the accelerated \u02dcO((cid:112)\n\nt\n\n8\n\niteration02004006008001000gap10-810-610-410-2100102104106FWawayFWpairFWFWawayFWpairFWiteration0500100015002000gap10-810-610-410-2100FWawayFWpairFWFWawayFWpairFW\fReferences\n[1] S. D. Ahipaao\u02d8glu, P. Sun, and M. Todd. Linear convergence of a modi\ufb01ed Frank-Wolfe algorithm for com-\n\nputing minimum-volume enclosing ellipsoids. Optimization Methods and Software, 23(1):5\u201319, 2008.\n\n[2] R. Alexander. The width and diameter of a simplex. Geometriae Dedicata, 6(1):87\u201394, 1977.\n[3] F. Bach. Learning with submodular functions: A convex optimization perspective. Foundations and\n\nTrends in Machine Learning, 6(2-3):145\u2013373, 2013.\n\n[4] A. Beck and S. Shtern. Linearly convergent away-step conditional gradient for non-strongly convex\n\nfunctions. arXiv:1504.05002v1, 2015.\n\n[5] A. Beck and M. Teboulle. A conditional gradient method with linear rate of convergence for solving\n\nconvex linear systems. Mathematical Methods of Operations Research (ZOR), 59(2):235\u2013247, 2004.\n\n[6] M. D. Canon and C. D. Cullum. A tight upper bound on the rate of convergence of Frank-Wolfe algorithm.\n\nSIAM Journal on Control, 6(4):509\u2013516, 1968.\n\n[7] V. Chari et al. On pairwise costs for network \ufb02ow multi-object tracking. In CVPR, 2015.\n[8] J. C. Dunn. Rates of convergence for conditional gradient algorithms near singular and nonsingular\n\nextremals. SIAM Journal on Control and Optimization, 17(2):187\u2013211, 1979.\n\n[9] M. Frank and P. Wolfe. An algorithm for quadratic programming. Naval Research Logistics Quarterly,\n\n3:95\u2013110, 1956.\n\n2015.\n\n[10] D. Garber and E. Hazan. A linearly convergent conditional gradient algorithm with applications to online\n\nand stochastic optimization. arXiv:1301.4666v5, 2013.\n\n[11] D. Garber and E. Hazan. Faster rates for the Frank-Wolfe method over strongly-convex sets. In ICML,\n\n[12] J. Gu\u00b4elat and P. Marcotte. Some comments on Wolfe\u2019s \u2018away step\u2019. Mathematical Programming, 1986.\n[13] D. Hearn, S. Lawphongpanich, and J. Ventura. Restricted simplicial decomposition: Computation and\n\nextensions. In Computation Mathematical Programming, volume 31, pages 99\u2013118. Springer, 1987.\n\n[14] C. A. Holloway. An extension of the Frank and Wolfe method of feasible directions. Mathematical\n\nProgramming, 6(1):14\u201327, 1974.\n\n[15] M. Jaggi. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. In ICML, 2013.\n[16] A. Joulin, K. Tang, and L. Fei-Fei. Ef\ufb01cient image and video co-localization with Frank-Wolfe algorithm.\n\n[17] V. Kolmogorov and R. Zabin. What energy functions can be minimized via graph cuts? IEEE Transac-\n\ntions on Pattern Analysis and Machine Intelligence, 26(2):147\u2013159, 2004.\n\n[18] R. G. Krishnan, S. Lacoste-Julien, and D. Sontag. Barrier Frank-Wolfe for marginal inference. In NIPS,\n\nIn ECCV, 2014.\n\n2015.\n\n[19] P. Kumar and E. A. Yildirim. A linearly convergent linear-time \ufb01rst-order algorithm for support vector\n\nclassi\ufb01cation with a core set result. INFORMS Journal on Computing, 2010.\n\n[20] S. Lacoste-Julien and M. Jaggi. An af\ufb01ne invariant linear convergence analysis for Frank-Wolfe algo-\n\n[21] S. Lacoste-Julien, M. Jaggi, M. Schmidt, and P. Pletscher. Block-coordinate Frank-Wolfe optimization\n\nrithms. arXiv:1312.7864v2, 2013.\n\nfor structural SVMs. In ICML, 2013.\n\narXiv:1309.5550v2, 2013.\n\n[22] G. Lan.\n\nThe complexity of large-scale convex programming under a linear optimization oracle.\n\n[23] E. S. Levitin and B. T. Polyak. Constrained minimization methods. USSR Computational Mathematics\n\nand Mathematical Physics, 6(5):787\u2013823, Jan. 1966.\n\n[24] H. Lin, J. Mairal, and Z. Harchaoui. A universal catalyst for \ufb01rst-order optimization. In NIPS, 2015.\n[25] B. Mitchell, V. F. Demyanov, and V. Malozemov. Finding the point of a polyhedron closest to the origin.\n\nSIAM Journal on Control, 12(1), 1974.\n\n[26] R. \u02dcNanculef, E. Frandi, C. Sartori, and H. Allende. A novel Frank-Wolfe algorithm. Analysis and appli-\n\ncations to large-scale SVM training. Information Sciences, 2014.\n\n[27] Y. Nesterov. Introductory Lectures on Convex Optimization. Kluwer Academic Publishers, 2004.\n[28] J. Pena, D. Rodriguez, and N. Soheili. On the von Neumann and Frank-Wolfe algorithms with away\n\nsteps. arXiv:1507.04073v2, 2015.\n\n[29] J. C. Platt. Fast training of support vector machines using sequential minimal optimization. In Advances\n\nin kernel methods: support vector learning, pages 185\u2013208. 1999.\n\n[30] S. M. Robinson. Generalized Equations and their Solutions, Part II: Applications to Nonlinear Program-\n\n[31] B. Von Hohenbalken. Simplicial decomposition in nonlinear programming algorithms. Mathematical\n\nming. Springer, 1982.\n\nProgramming, 13(1):49\u201368, 1977.\n\n[32] M. J. Wainwright and M. I. Jordan. Graphical models, exponential families, and variational inference.\n\nFoundations and Trends in Machine Learning, 1(12):1\u2013305, 2008.\n\n[33] P.-W. Wang and C.-J. Lin.\n\nIteration complexity of feasible descent methods for convex optimization.\n\nJournal of Machine Learning Research, 15:1523\u20131548, 2014.\n\n[34] P. Wolfe. Convergence theory in nonlinear programming. In Integer and Nonlinear Programming. 1970.\n[35] P. Wolfe. Finding the nearest point in a polytope. Mathematical Programming, 11(1):128\u2013149, 1976.\n[36] G. M. Ziegler. Lectures on 0/1-polytopes. arXiv:math/9909177v1, 1999.\n\n9\n\n\f", "award": [], "sourceid": 359, "authors": [{"given_name": "Simon", "family_name": "Lacoste-Julien", "institution": "INRIA"}, {"given_name": "Martin", "family_name": "Jaggi", "institution": "ETH Zurich"}]}