{"title": "A Simple Practical Accelerated Method for Finite Sums", "book": "Advances in Neural Information Processing Systems", "page_first": 676, "page_last": 684, "abstract": "Abstract We describe a novel optimization method for finite sums (such as empirical risk minimization problems) building on the recently introduced SAGA method. Our method achieves an accelerated convergence rate on strongly convex smooth problems. Our method has only one parameter (a step size), and is radically simpler than other accelerated methods for finite sums. Additionally it can be applied when the terms are non-smooth, yielding a method applicable in many areas where operator splitting methods would traditionally be applied.", "full_text": "A Simple Practical Accelerated Method for Finite\n\nSums\n\nAaron Defazio\n\nAmbiata, Sydney Australia\n\nAbstract\n\nWe describe a novel optimization method for \ufb01nite sums (such as empirical risk\nminimization problems) building on the recently introduced SAGA method. Our\nmethod achieves an accelerated convergence rate on strongly convex smooth prob-\nlems. Our method has only one parameter (a step size), and is radically simpler\nthan other accelerated methods for \ufb01nite sums. Additionally it can be applied\nwhen the terms are non-smooth, yielding a method applicable in many areas where\noperator splitting methods would traditionally be applied.\n\nIntroduction\n\nA large body of recent developments in optimization have focused on minimization of convex \ufb01nite\nsums of the form:\n\nn(cid:88)\n\ni=1\n\nf (x) =\n\n1\nn\n\nfi(x),\n\na very general class of problems including the empirical risk minimization (ERM) framework as a\nspecial case. Any function h can be written in this form by setting f1(x) = h(x) and fi = 0 for\ni (cid:54)= 1, however when each fi is suf\ufb01ciently regular in a way that can be made precise, it is possible to\noptimize such sums more ef\ufb01ciently than by treating them as black box functions.\nIn most cases recently developed methods such as SAG [Schmidt et al., 2013] can \ufb01nd an \u0001-minimum\nfaster than either stochastic gradient descent or accelerated black-box approaches, both in theory and\nin practice. We call this class of methods fast incremental gradient methods (FIG).\nFIG methods are randomized methods similar to SGD, however unlike SGD they are able to achieve\nlinear convergence rates under Lipschitz-smooth and strong convexity conditions [Mairal, 2014,\nDefazio et al., 2014b, Johnson and Zhang, 2013, Kone\u02c7cn\u00fd and Richt\u00e1rik, 2013]. The linear rate in the\n\ufb01rst wave of FIG methods directly depended on the condition number L/\u00b5 of the problem, whereas\nrecently several methods have been developed that depend on the square-root of the condition number\n[Lan and Zhou, 2015, Lin et al., 2015, Shalev-Shwartz and Zhang, 2013c, Nitanda, 2014], at least\nwhen n is not too large. Analogous to the black-box case, these methods are known as accelerated\nmethods.\nIn this work we develop another accelerated method, which is conceptually simpler and requires\nless tuning than existing accelerated methods. The method we give is a primal approach, however it\nmakes use of a proximal operator oracle for each fi instead of a gradient oracle, unlike other primal\napproaches. The proximal operator is also used by dual methods such as some variants of SDCA\n[Shalev-Shwartz and Zhang, 2013a].\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fAlgorithm 1\nPick some starting point x0 and step size \u03b3. Initialize each g0\ngradient/subgradient at x0.\nThen at step k + 1:\n\n1. Pick index j from 1 to n uniformly at random.\n2. Update x:\n\ni = f(cid:48)\n\ni (x0), where f(cid:48)\n\ni (x0) is any\n\n(cid:35)\n\nj \u2212 1\ngk\nn\n\n(cid:34)\nn(cid:88)\n(cid:0)zk\n(cid:1) .\n(cid:0)zk\nj \u2212 xk+1(cid:1), and leave the rest of the entries\n\ngk\ni\n\ni=1\n\n,\n\nj\n\nzk\nj = xk + \u03b3\n\nxk+1 = prox\u03b3\nj\n\n3. Update the gradient table: Set gk+1\n\n= 1\n\u03b3\n\nunchanged (gk+1\n\ni = gk\n\nj\n\ni for i (cid:54)= j).\n\n1 Algorithm\n\nOur algorithm\u2019s main step makes use of the proximal operator for a randomly chosen fi. For\nconvenience, we de\ufb01ne:\n\n(cid:26)\n\n(cid:27)\n\n.\n\nprox\u03b3\n\ni (x) = argminy\n\n\u03b3fi(y) +\n\n(cid:107)x \u2212 y(cid:107)2\n\n1\n2\n\nThis proximal operator can be computed ef\ufb01ciently or in closed form in many cases, see Section 4 for\ndetails. Like SAGA, we also maintain a table of gradients gi, one for each function fi. We denote the\nstate of gi at the end of step k by gk\ni . The iterate (our guess at the solution) at the end of step k is\ndenoted xk. The starting iterate x0 may be chosen arbitrarily.\nThe full algorithm is given as Algorithm 1. The sum of gradients 1\ni can be cached and\nn\nupdated ef\ufb01ciently at each step, and in most cases instead of storing a full vector for each gi, only a\nsingle real value needs to be stored. This is the case for linear regression or binary classi\ufb01cation with\nlogistic loss or hinge loss, in precisely the same way as for standard SAGA. A discussion of further\nimplementation details is given in Section 4.\nWith step size\n\n(cid:80)n\n\ni=1 gk\n\nthe expected convergence rate in terms of squared distance to the solution is given by:\n\n(cid:113)\n\n(cid:18)\n\n\u03b3 =\n\nE(cid:13)(cid:13)xk \u2212 x\u2217(cid:13)(cid:13)2 \u2264\n\n(n \u2212 1)2 + 4n L\n\n2Ln\n\n1 \u2212 \u00b5\u03b3\n1 + \u00b5\u03b3\n\n,\n\n\u00b5\n\nn\n\n2L\n\n\u2212 1 \u2212 1\n(cid:19)k \u00b5 + L\n(cid:13)(cid:13)x0 \u2212 x\u2217(cid:13)(cid:13)2\n(cid:19)(cid:33)\n\n(cid:18) 1\n\n(cid:33)\n\n\u00b5\n\n,\n\n(cid:32)(cid:32)(cid:115)\n\nnL\n\u00b5\n\nk = O\n\n+ n\n\nlog\n\n,\n\n\u0001\n\nwhen each fi : Rd \u2192 R is L-smooth and \u00b5-strongly convex. See Nesterov [1998] for de\ufb01nitions of\nthese conditions. Using big-O notation, the number of steps required to reduce the distance to the\nsolution by a factor \u0001 is:\n\nas \u0001 \u2192 0. This rate matches the lower bound known for this problem [Lan and Zhou, 2015] under the\ngradient oracle. We conjecture that this rate is optimal under the proximal operator oracle as well.\nUnlike other accelerated approaches though, we have only a single tunable parameter (the step size\n\u03b3), and the algorithm doesn\u2019t need knowledge of L or \u00b5 except for their appearance in the step size.\nCompared to the O ((L/\u00b5 + n) log (1/\u0001)) rate for SAGA and other non-accelerated FIG methods,\naccelerated FIG methods are signi\ufb01cantly faster when n is small compared to L/\u00b5, however for\nn \u2265 L/\u00b5 the performance is essentially the same. All known FIG methods hit a kind of wall at\nn \u2248 L/\u00b5, where they decrease the error at each step by no more than 1 \u2212 1\nn. Indeed, when n \u2265 L/\u00b5\nthe problem is so well conditioned so as to be easy for any FIG method to solve it ef\ufb01ciently. This is\nsometimes called the big data setting [Defazio et al., 2014b].\n\n2\n\n\f(cid:16)(cid:16)(cid:112)L/\u00b5\n(cid:17)\n\n(cid:17)\n\nOur convergence rate can also be compared to that of optimal \ufb01rst-order black box methods, which\nhave rates of the form k = O\nper epoch equivalent. We are able to achieve\n\u221a\na\nn speedup on a per-epoch basis, for n not too large. Of course, all of the mentioned rates are\nsigni\ufb01cantly better than the O ((L/\u00b5) log (1/\u0001)) rate of gradient descent.\nFor non-smooth but strongly convex problems, we prove a 1/\u0001-type rate under a standard iterate\naveraging scheme. This rate does not require the use of decreasing step sizes, so our algorithm\nrequires less tuning than other primal approaches on non-smooth problems.\n\nlog (1/\u0001)\n\n2 Relation to other approaches\n\nOur method is most closely related to the SAGA method. To make the relation clear, we may write\nour method\u2019s main step as:\n\n(cid:34)\nj(xk+1) \u2212 gk\nf(cid:48)\n(cid:34)\n\nj +\n\ngk\ni\n\n,\n\n(cid:35)\n(cid:35)\n\n.\n\n1\nn\n\nn(cid:88)\nn(cid:88)\n\ni=1\n\ni=1\n\n1\nn\n\nxk+1 = xk \u2212 \u03b3\n\nwhereas SAGA has a step of the form:\n\nxk+1 = xk \u2212 \u03b3\n\nj(xk) \u2212 gk\nf(cid:48)\n\nj +\n\ngk\ni\n\nThe difference is the point at which the gradient of fj is evaluated at. The proximal operator has the\neffect of evaluating the gradient at xk+1 instead of xk. While a small difference on the surface, this\nchange has profound effects. It allows the method to be applied directly to non-smooth problems\nusing \ufb01xed step sizes, a property not shared by SAGA or other primal FIG methods. Additionally, it\nallows for much larger step sizes to be used, which is why the method is able to achieve an accelerated\nrate.\nIt is also illustrative to look at how the methods behave at n = 1. SAGA degenerates into regular\ngradient descent, whereas our method becomes the proximal-point method [Rockafellar, 1976]:\n\nxk+1 = prox\u03b3f (xk).\n\nThe proximal point method has quite remarkable properties. For strongly convex problems, it\nconverges for any \u03b3 > 0 at a linear rate. The downside being the inherent dif\ufb01culty of evaluating\nthe proximal operator. For the n = 2 case, if each term is an indicator function for a convex set, our\nalgorithm matches Dykstra\u2019s projection algorithm if we take \u03b3 = 2 and use cyclic instead of random\nsteps.\n\nAccelerated incremental gradient methods\n\nSeveral acceleration schemes have been recently developed as extensions of non-accelerated FIG\nmethods. The earliest approach developed was the ASDCA algorithm [Shalev-Shwartz and Zhang,\n2013b,c]. The general approach of applying the proximal-point method as the outer-loop of a double-\nloop scheme has been dubbed the Catalyst algorithm Lin et al. [2015]. It can be applied to accelerate\nany FIG method. Recently a very interesting primal-dual approach has been proposed by Lan and\nZhou [2015]. All of the prior accelerated methods are signi\ufb01cantly more complex than the approach\nwe propose, and have more complex proofs.\n\n3 Theory\n\n3.1 Proximal operator bounds\n\nIn this section we rehash some simple bounds from proximal operator theory that we will use in\n\u03b3 (x \u2212 p\u03b3f (x)), so that\nthis work. De\ufb01ne the short-hand p\u03b3f (x) = prox\u03b3f (x), and let g\u03b3f (x) = 1\np\u03b3f (x) = x \u2212 \u03b3g\u03b3f (x). Note that g\u03b3f (x) is a subgradient of f at the point p\u03b3f (x). This relation is\nknown as the optimality condition of the proximal operator. Note that proofs for the following two\npropositions are in the supplementary material.\n\n3\n\n\fNotation\n\nxk\nx\u2217\n\u03b3\n\np\u03b3f (x)\nprox\u03b3\ni (x)\ngk\ni\ng\u2217\ni\nvi\nj\nzk\nj\n\nDescription\n\nCurrent iterate at step k\n\nSolution\nStep size\n\nShort-hand in results for generic f\n\nProximal operator of \u03b3fi at x\n\nA stored subgradient of fi as seen at step k\n\nA subgradient of fi at x\u2217\n\nj = xk + \u03b3(cid:2)gk\n\nvi = x\u2217 + \u03b3g\u2217\nj \u2212 1\n\n(cid:80)n\n\ni\n\nn\n\nzk\n\ni=1 gk\ni\n\n(cid:3)\n\nChosen component index (random variable)\n\np\u03b3f (x) = prox\u03b3f (x)\n\n= argminy\n\n\u03b3fi(y) + 1\n\nAdditional relation\n\nxk \u2208 Rd\nx\u2217 \u2208 Rd\n\n(cid:110)\n(cid:80)n\ni=1 g\u2217\nx\u2217 = prox\u03b3\n\ni = 0\ni (vi)\n\n2 (cid:107)x \u2212 y(cid:107)2(cid:111)\n(cid:1)\n(cid:0)zk\n\nj\n\nj = prox\u03b3\nxk+1\n\nj\n\nTable 1: Notation quick reference\n\nProposition 1. (Strengthening \ufb01rm non-expansiveness under strong convexity) For any x, y \u2208 Rd,\nand any convex function f : Rd \u2192 R with strong convexity constant \u00b5 \u2265 0,\n\n(cid:104)x \u2212 y, p\u03b3f (x) \u2212 p\u03b3f (y)(cid:105) \u2265 (1 + \u00b5\u03b3)(cid:107)p\u03b3f (x) \u2212 p\u03b3f (y)(cid:107)2 .\n\nIn operator theory this property is known as (1 + \u00b5\u03b3)-cocoerciveness of p\u03b3f .\nProposition 2. (Moreau decomposition) For any x \u2208 Rd, and any convex function f : Rd \u2192 R\nwith Fenchel conjugate f\u2217 :\n\nRecall our de\ufb01nition of g\u03b3f (x) = 1\nholds between the proximal operator of the conjugate f\u2217 and g\u03b3f :\n\np\u03b3f (x) = x \u2212 \u03b3p 1\n(1)\n\u03b3 (x \u2212 p\u03b3f (x)) also. After combining, the following relation thus\n\n\u03b3 f\u2217 (x/\u03b3).\n\np 1\n\u03b3 f\u2217 (x/\u03b3) =\n\n1\n\u03b3\n\n(x \u2212 p\u03b3f (x)) = g\u03b3f (x).\n\n(2)\n\nTheorem 3. For any x, y \u2208 Rd, and any convex L-smooth function f : Rd \u2192 R:\n\n(cid:18)\n\n(cid:19)\n\n(cid:104)g\u03b3f (x) \u2212 g\u03b3f (y), x \u2212 y(cid:105) \u2265 \u03b3\n\n1 +\n\n1\nL\u03b3\n\n(cid:107)g\u03b3f (x) \u2212 g\u03b3f (y)(cid:107)2 ,\n\nProof. We will apply cocoerciveness of the proximal operator of f\u2217 as it appears in the decomposition.\nNote that L-smoothness of f implies 1/L-strong convexity of f\u2217. In particular we apply it to the\npoints 1\n\n\u03b3 x and 1\n\n\u03b3 y:\n\np 1\n\u03b3 f\u2217 (\n\n1\n\u03b3\n\nx) \u2212 p 1\n\n\u03b3 f\u2217 (\n\n1\n\u03b3\n\ny),\n\n1\n\u03b3\n\nx \u2212 1\n\u03b3\n\ny\n\n\u2265\n\n1 +\n\n1\nL\u03b3\n\n\u03b3 f\u2217 (\n\n1\n\u03b3\n\nx) \u2212 p 1\n\n\u03b3 f\u2217 (\n\n1\n\u03b3\n\ny)\n\n(cid:29)\n\n(cid:18)\n\n(cid:19)(cid:13)(cid:13)(cid:13)(cid:13)p 1\n\n(cid:13)(cid:13)(cid:13)(cid:13)2\n\n.\n\n(cid:28)\n\nPulling 1\n\n\u03b3 from the right side of the inner product out, and plugging in Equation 2, gives the result.\n\n3.2 Notation\nLet x\u2217 be the unique minimizer (due to strong convexity) of f. In addition to the notation used in the\ndescription of the algorithm, we also \ufb01x a set of subgradients g\u2217\nj , one for each of fj at x\u2217, chosen\nj . Note that at the solution x\u2217, we want to\n\nj = 0. We also de\ufb01ne vj = x\u2217 + \u03b3g\u2217\n\nsuch that(cid:80)n\n\nj=1 g\u2217\n\napply a proximal step for component j of the form:\n\n(cid:0)x\u2217 + \u03b3g\u2217\n\n(cid:1) = prox\u03b3\n\nj\n\nj (vj) .\n\nx\u2217 = prox\u03b3\n\nj\n\n4\n\n\fLemma 4. (Technical lemma needed by main proof) Under Algorithm 1, taking the expectation over\nthe random choice of j, conditioning on xk and each gk\ni , allows us to bound the following inner\nproduct at step k:\n\nj ,(cid:0)xk \u2212 x\u2217(cid:1) + \u03b3\n\n\u2212 \u03b3g\u2217\n\n(cid:34)\n\n(cid:35)\n\nn(cid:88)\n\ni=1\n\nj \u2212 1\ngk\nn\n\ngk\ni\n\n\u2212 \u03b3g\u2217\n\nj\n\n(cid:43)\n\n(cid:42)\n\n(cid:34)\n\nE\n\n\u03b3\n\n\u2264 \u03b32 1\nn\n\nn(cid:88)\n(cid:13)(cid:13)gk\n\nj \u2212 1\ngk\nn(cid:88)\nn\ni \u2212 g\u2217\n\ni=1\n\ni\n\ngk\ni\n\n(cid:35)\n(cid:13)(cid:13)2\n\n.\n\ni=1\n\nThe proof is in the supplementary material.\n\n3.3 Main result\n\nTheorem 5. (single step Lyapunov descent) We de\ufb01ne the Lyapunov function T k of our algorithm\n(Point-SAGA) at step k as:\n\n(cid:13)(cid:13)2\n\n+(cid:13)(cid:13)xk \u2212 x\u2217(cid:13)(cid:13)2\n\n,\n\ni \u2212 g\u2217\n\ni\n\nT k =\n\nc\nn\n\nn(cid:88)\n\ni=1\n\n(cid:13)(cid:13)gk\n(cid:113)\n\nfor c = 1/\u00b5L. Then using step size \u03b3 =\nrandom choice of j, conditioning on xk and each gk\n\nE(cid:2)T k+1(cid:3) \u2264 (1 \u2212 \u03ba) T k\n\n2Ln\n\ni , is:\n\nfor \u03ba =\n\n\u00b5\u03b3\n\n,\n\n1 + \u00b5\u03b3\n\n(n\u22121)2+4n L\n\n\u00b5\n\n\u2212 1\u2212 1\n\nn\n\n2L , the expectation of T k+1, over the\n\nwhen each fi : Rd \u2192 R is L-smooth and \u00b5-strongly convex and 0 < \u00b5 < L. This is the same\nLyapunov function as used by Hofmann et al. [2015].\n\nj \u2212 1\ngk\nn\n\ngk\ni\n\ni=1\n\n5\n\nE(cid:13)(cid:13)gk+1\n\nj \u2212 g\u2217\n\nj\n\n(cid:13)(cid:13)2\n\n.\n\nProof. Term 1 of T k+1 is straight-forward to simplify:\n\n(cid:13)(cid:13)gk+1\n\ni \u2212 g\u2217\n\ni\n\nn(cid:88)\n\ni=1\n\nc\nn\n\nE\n\nFor term 2 of T k+1 we start by applying cocoerciveness (Theorem 1):\n\ni\n\nn\n\n=\n\n+\n\ni=1\n\nc\nn\n\ni \u2212 g\u2217\n\nn(cid:88)\n\n1 \u2212 1\nn\n\n(cid:19) c\n\n(cid:13)(cid:13)2\n= (1 + \u00b5\u03b3)E(cid:13)(cid:13)prox\u03b3\n\u2264 E(cid:10)prox\u03b3\n= E(cid:10)xk+1 \u2212 x\u2217 , zk\n\n(cid:18)\n(cid:13)(cid:13)gk\n(1 + \u00b5\u03b3)E(cid:13)(cid:13)xk+1 \u2212 x\u2217(cid:13)(cid:13)2\n(cid:11) .\n\nj (zk\nj ) \u2212 prox\u03b3\nj \u2212 vj\n\nj (zk\n\nj ) \u2212 prox\u03b3\nj (vj), zk\n\n(cid:13)(cid:13)2\nj (vj)(cid:13)(cid:13)2\n(cid:11)\n\nj \u2212 vj\n(cid:11)\n\n(cid:11)\n\nj \u2212 vj\n\nj \u2212 vj\n\nj \u2212 vj\n\nNow we add and subtract xk :\n\nE(cid:10)xk+1 \u2212 xk , zk\n\n(cid:11) + E(cid:10)xk+1 \u2212 xk , zk\n(cid:11) ,\n\n= E(cid:10)xk+1 \u2212 xk + xk \u2212 x\u2217 , zk\n= E(cid:10)xk \u2212 x\u2217 , zk\n= (cid:13)(cid:13)xk \u2212 x\u2217(cid:13)(cid:13)2\n+ E(cid:10)xk+1 \u2212 xk , zk\nj \u2212 vj\n(cid:11) further:\nj \u2212 vj] = xk \u2212 x\u2217 (we can take the\nwhere we have pulled out the quadratic term by using E[zk\n(cid:11)\nE(cid:10)xk+1 \u2212 xk , zk\nexpectation since the left hand side of the inner product doesn\u2019t depend on j). We now expand\nj \u2212 vj\n(cid:11)\n= E(cid:10)xk+1 \u2212 \u03b3g\u2217\nj \u2212 vj\n(cid:35)\n(cid:34)\n(cid:42)\nn(cid:88)\nj \u2212 xk , zk\nj \u2212 vj\nj + \u03b3g\u2217\nj \u2212 1\n(cid:34)\ngk\nn(cid:88)\n(cid:0)xk \u2212 x\u2217(cid:1) + \u03b3\nn\n\nj \u2212 xk,\nj + \u03b3g\u2217\n(cid:43)\n\nxk \u2212 \u03b3gk+1\n\n\u2212 \u03b3g\u2217\n\n\u2212 \u03b3g\u2217\n\nj + \u03b3\n\n(cid:35)\n\n= E\n\ngk\ni\n\ni=1\n\n(3)\n\n.\n\nj\n\n\fWe further split the left side of the inner product to give two separate inner products:\n\n(cid:34)\n\n(cid:42)\n(cid:42)\n\n= E\n\n+ E\n\n\u03b3\n\nj \u2212 1\ngk\nn\n\n\u03b3g\u2217\n\nj \u2212 \u03b3gk+1\n\nj\n\n(cid:34)\n\n(cid:35)\nn(cid:88)\n,(cid:0)xk \u2212 x\u2217(cid:1) + \u03b3\n\n\u2212 \u03b3g\u2217\n\nj ,(cid:0)xk \u2212 x\u2217(cid:1) + \u03b3\n(cid:34)\nn(cid:88)\n\ngk\ni\n\ni=1\n\nj \u2212 1\ngk\nn\n\n(cid:35)\n\nn(cid:88)\n(cid:43)\n\ni=1\n\ngk\ni\n\n.\n\nj \u2212 1\n(cid:35)\ngk\nn\n\ngk\ni\n\n\u2212 \u03b3g\u2217\n\nj\n\n\u2212 \u03b3g\u2217\n\nj\n\n(cid:43)\n\n(4)\n\ni\n\ni=1\n\ni=1\n\n(cid:80)n\n\n(cid:13)(cid:13)gk\n\ninner product\ni \u2212 g\u2217\n\nin Equation 4 is the quantity we bounded in Lemma 4 by\n\nThe \ufb01rst\n\u03b32 1\nn\nrem 3 (note the right side of the inner product is equal to zk\n\n(cid:13)(cid:13)2. The second inner product in Equation 4, can be simpli\ufb01ed using Theo-\n\u2212\u03b3E(cid:10)gk+1\n\n(cid:13)(cid:13)2\nE(cid:13)(cid:13)gk+1\nj \u2212 g\u2217\nCombing these gives the following bound on (1 + \u00b5\u03b3)E(cid:13)(cid:13)xk+1 \u2212 x\u2217(cid:13)(cid:13)2:\n(cid:19)\n(cid:18)\n(1+\u00b5\u03b3)E(cid:13)(cid:13)xk+1 \u2212 x\u2217(cid:13)(cid:13)2 \u2264(cid:13)(cid:13)xk \u2212 x\u2217(cid:13)(cid:13)2\nE(cid:13)(cid:13)gk+1\n(cid:13)(cid:13)2\u2212\u03b32\n\n(cid:11) \u2264 \u2212\u03b32\nn(cid:88)\n(cid:13)(cid:13)gk\n\n(cid:19)\nj \u2212 vj):\n1\nL\u03b3\n\nj \u2212 g\u2217\n\n(cid:13)(cid:13)2\n\n.\n\nj \u2212 vj\n\nj \u2212 g\u2217\n\ni \u2212 g\u2217\n\nj , zk\n\n(cid:18)\n\n1 +\n\n1 +\n\n.\n\nj\n\nj\n\ni\n\n1\nL\u03b3\n\n+\u03b32 1\nn\n\ni=1\n\nDe\ufb01ne \u03b1 = 1\nand combine with the rest of the Lyapunov function, giving:\n\n1+\u00b5\u03b3 = 1 \u2212 \u03ba, where \u03ba = \u00b5\u03b3\n\n1+\u00b5\u03b3 . Now we multiply the above inequality through by \u03b1\n\nWe want an \u03b1 convergence rate, so we pull out the required terms:\n\n(cid:16)\n\nE(cid:2)T k+1(cid:3) \u2264 T k +\n(cid:16) c\n\n+\n\nn\n\n\u03b1\u03b32 \u2212 c\nn\n\u2212 \u03b1\u03b32 \u2212 \u03b1\u03b3\nL\n\n(cid:16)\n\nE(cid:2)T k+1(cid:3) \u2264 \u03b1T k +\n(cid:16) c\n\n+\n\n\u2212 \u03b1\u03b32 \u2212 \u03b1\u03b3\nL\n\nn\n\ni\n\ni\n\nj\n\nn\n\n(cid:17) 1\nn(cid:88)\n(cid:13)(cid:13)gk\n(cid:13)(cid:13)2\ni \u2212 g\u2217\n(cid:17)\nE(cid:13)(cid:13)gk+1\n(cid:13)(cid:13)2 \u2212 \u03baE(cid:13)(cid:13)xk \u2212 x\u2217(cid:13)(cid:13)2\nj \u2212 g\u2217\n(cid:17) 1\nn(cid:88)\n(cid:13)(cid:13)gk\n(cid:13)(cid:13)2\nE(cid:13)(cid:13)gk+1\n(cid:13)(cid:13)2\nn\nj \u2212 g\u2217\n(cid:113)\n\ni \u2212 g\u2217\n\n(cid:17)\n\n\u03b1\u03b32 + \u03bac \u2212 c\nn\n\n.\n\nj\n\ni\n\ni\n\n(n\u22121)2+4n L\n\n\u2212 1\u2212 1\n\n.\n\nNow to complete the proof we note that c = 1/\u00b5L and \u03b3 =\n2L ensure that both\nterms inside the round brackets are non-positive, giving ET k+1 \u2264 \u03b1T k. These constants were found\nby equating the equations in the brackets to zero, and solving with respect to the two unknowns, \u03b3\nand c. It is easy to verify that \u03b3 is always positive, as a consequence of the condition number L/\u00b5\nalways being at least 1.\n\n2Ln\n\nn\n\n\u00b5\n\nCorollary 6. (Smooth case) Chaining Theorem 5 gives a convergence rate for Point-SAGA at step k\nunder the constants given in Theorem 5 of:\n\nE(cid:13)(cid:13)xk \u2212 x\u2217(cid:13)(cid:13)2 \u2264 (1 \u2212 \u03ba)k \u00b5 + L\n\n(cid:13)(cid:13)x0 \u2212 x\u2217(cid:13)(cid:13)2\n\n,\n\n\u00b5\nif each fi : Rd \u2192 R is L-smooth and \u00b5-strongly convex.\n\nTheorem 7. (Non-smooth case) Suppose each fi : Rd \u2192 R is \u00b5-strongly convex,(cid:13)(cid:13)g0\nand(cid:13)(cid:13)x0 \u2212 x\u2217(cid:13)(cid:13) \u2264 R. Then after k iterations of Point-SAGA with step size \u03b3 = R/B\n\ni \u2212 g\u2217\n\u221a\nn:\n\ni\n\n(cid:13)(cid:13) \u2264 B\n\nE(cid:13)(cid:13)\u00afxk \u2212 x\u2217(cid:13)(cid:13)2 \u2264 2\n\n\u221a\n\n\u221a\n\nn (1 + \u00b5 (R/B\n\n\u00b5k\n\nn))\n\nRB,\n\nwhere \u00afxk = 1\n\nt=1 xt. The proof of this theorem is included in the supplementary material.\n\nk E(cid:80)k\n\n6\n\n\f4\n\nImplementation\n\nCare must be taken for ef\ufb01cient implementation, particularly in the sparse gradient case. We discuss\nthe key points below. A fast Cython implementation is available on the author\u2019s website incorporating\nthese techniques.\nProximal operators For the most common binary classi\ufb01cation and regression methods, imple-\nmenting the proximal operator is straight-forward. We include details of the computation\nof the proximal operators for the hinge, square and logistic losses in the supplementary\nmaterial. The logistic loss does not have a closed form proximal operator, however it may\nbe computed very ef\ufb01ciently in practice using Newton\u2019s method on a 1D subproblem. For\nproblems of a non-trivial dimensionality the cost of the dot products in the main step is\nmuch greater than the cost of the proximal operator evaluation. We also detail how to handle\na quadratic regularizer within each term\u2019s prox operator, which has a closed form in terms\nof the unregularized prox operator.\ni = f(cid:48)\n\ni (x0) before commencing the algorithm, we recommend\ni = 0 instead. This avoids the cost of a initial pass over the data. In practical effect\n\nInitialization Instead of setting g0\n\nusing g0\nthis is similar to the SDCA initialization of each dual variable to 0.\n\n5 Experiments\nWe tested our algorithm which we call Point-SAGA against SAGA [Defazio et al., 2014a], SDCA\n[Shalev-Shwartz and Zhang, 2013a], Pegasos/SGD [Shalev-Shwartz et al., 2011] and the catalyst\nacceleration scheme [Lin et al., 2015]. SDCA was chosen as the inner algorithm for the catalyst\nscheme as it doesn\u2019t require a step-size, making it the most practical of the variants. Catalyst applied\nto SDCA is essentially the same algorithm as proposed in Shalev-Shwartz and Zhang [2013c]. A\nsingle inner epoch was used for each SDCA invocation. Accelerated MISO as well as the primal-dual\nFIG method [Lan and Zhou, 2015] were excluded as we wanted to test on sparse problems and\nthey are not designed to take advantage of sparsity. The step-size parameter for each method (\u03ba for\ncatalyst-SDCA) was chosen using a grid search of powers of 2. The step size that gives the lowest\nerror at the \ufb01nal epoch is used for each method.\nWe selected a set of commonly used datasets from the LIBSVM repository [Chang and Lin, 2011].\nThe pre-scaled versions were used when available. Logistic regression with L2 regularization was\napplied to each problem. The L2 regularization constant for each problem was set by hand to ensure\nf was not in the big data regime n \u2265 L/\u00b5; as noted above, all the methods perform essentially the\nsame when n \u2265 L/\u00b5. The constant used is noted beneath each plot. Open source code to exactly\nreplicate the experimental results is available at https://github.com/adefazio/point-saga.\nAlgorithm scaling with respect to n The key property that distinguishes accelerated FIG methods\nfrom their non-accelerated counterparts is their performance scaling with respect to the dataset size.\nFor large datasets on well-conditioned problems we expect from the theory to see little difference\nbetween the methods. To this end, we ran experiments including versions of the datasets subsampled\nrandomly without replacement in 10% and 5% increments, in order to show the scaling with n\nempirically. The same amount of regularization was used for each subset.\nFigure 1 shows the function value sub-optimality for each dataset-subset combination. We see that\nin general accelerated methods dominate the performance of their non-accelerated counter-parts.\nBoth SDCA and SAGA are much slower on some datasets comparatively than others. For example,\nSDCA is very slow on the 5 and 10% COVTYPE datasets, whereas both SAGA and SDCA are much\nslower than the accelerated methods on the AUSTRALIAN dataset. These differences re\ufb02ect known\nproperties of the two methods. SAGA is able to adapt to inherent strong convexity while SDCA can\nbe faster on very well-conditioned problems.\nThere is no clear winner between the two accelerated methods, each gives excellent results on each\nproblem. The Pegasos (stochastic gradient descent) algorithm with its slower than linear rate is a\nclear loser on each problem, almost appearing as an almost horizontal line on the log scale of these\nplots.\n\nNon-smooth problems We also tested the RCV1 dataset on the hinge loss. In general we did not\nexpect an accelerated rate for this problem, and indeed we observe that Point-SAGA is roughly as\nfast as SDCA across the different dataset sizes.\n\n7\n\n\f(a) COVTYPE \u00b5 = 2 \u00d7 10\u22126 : 5%, 10%, 100% subsets\n\n(b) AUSTRALIAN \u00b5 = 10\u22124: 5%, 10%, 100% subsets\n\n(c) MUSHROOMS \u00b5 = 10\u22124: 5%, 10%, 100% subsets\n\n(d) RCV1 with hinge loss, \u00b5 = 5 \u00d7 10\u22125: 5%, 10%, 100% subsets\n\nFigure 1: Experimental results\n\n8\n\n05101520Epoch10\u2212810\u2212710\u2212610\u2212510\u2212410\u2212310\u2212210\u22121100FunctionSuboptimality05101520Epoch10\u2212910\u2212810\u2212710\u2212610\u2212510\u2212410\u2212310\u2212210\u22121100FunctionSuboptimality05101520Epoch10\u22121210\u22121110\u22121010\u2212910\u2212810\u2212710\u2212610\u2212510\u2212410\u2212310\u2212210\u22121100101FunctionSuboptimality051015202530Epoch10\u2212410\u2212310\u2212210\u22121100101FunctionSuboptimality051015202530Epoch10\u2212410\u2212310\u2212210\u22121100FunctionSuboptimality051015202530Epoch10\u2212810\u2212710\u2212610\u2212510\u2212410\u2212310\u2212210\u22121100FunctionSuboptimality051015202530Epoch10\u22121010\u2212910\u2212810\u2212710\u2212610\u2212510\u2212410\u2212310\u2212210\u22121100FunctionSuboptimality051015202530Epoch10\u22121010\u2212910\u2212810\u2212710\u2212610\u2212510\u2212410\u2212310\u2212210\u22121100FunctionSuboptimality051015202530Epoch10\u22121310\u22121210\u22121110\u22121010\u2212910\u2212810\u2212710\u2212610\u2212510\u2212410\u2212310\u2212210\u22121100FunctionSuboptimality0510152025303540Epoch10\u2212410\u2212310\u2212210\u22121100FunctionSuboptimality0510152025303540Epoch10\u2212510\u2212410\u2212310\u2212210\u22121100FunctionSuboptimality0510152025303540Epoch10\u2212510\u2212410\u2212310\u2212210\u22121100FunctionSuboptimality05101520Epoch106105104103102101100101FunctionSuboptimalityPoint-SAGAPegasosSAGASDCACatalyst-SDCA\fReferences\nChih-Chung Chang and Chih-Jen Lin. Libsvm : a library for support vector machines. ACM\n\nTransactions on Intelligent Systems and Technology, 2:27:1\u201327:27, 2011.\n\nAaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fast incremental gradient method\nwith support for non-strongly convex composite objectives. Advances in Neural Information\nProcessing Systems 27 (NIPS 2014), 2014a.\n\nAaron Defazio, Tiberio Caetano, and Justin Domke. Finito: A faster, permutable incremental gradient\nmethod for big data problems. Proceedings of the 31st International Conference on Machine\nLearning, 2014b.\n\nThomas Hofmann, Aurelien Lucchi, Simon Lacoste-Julien, and Brian McWilliams. Variance reduced\nstochastic gradient descent with neighbors. In C. Cortes, N.D. Lawrence, D.D. Lee, M. Sugiyama,\nand R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 2296\u20132304.\nCurran Associates, Inc., 2015.\n\nRie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance\n\nreduction. NIPS, 2013.\n\nJakub Kone\u02c7cn\u00fd and Peter Richt\u00e1rik. Semi-Stochastic Gradient Descent Methods. ArXiv e-prints,\n\nDecember 2013.\n\nG. Lan and Y. Zhou. An optimal randomized incremental gradient method. ArXiv e-prints, July 2015.\n\nHongzhou Lin, Julien Mairal, and Zaid Harchaoui. A universal catalyst for \ufb01rst-order optimization.\nIn C. Cortes, N.D. Lawrence, D.D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural\nInformation Processing Systems 28, pages 3366\u20133374. Curran Associates, Inc., 2015.\n\nJulien Mairal. Incremental majorization-minimization optimization with application to large-scale\nmachine learning. Technical report, INRIA Grenoble Rh\u00f4ne-Alpes / LJK Laboratoire Jean\nKuntzmann, 2014.\n\nYu. Nesterov. Introductory Lectures On Convex Programming. Springer, 1998.\n\nAtsushi Nitanda. Stochastic proximal gradient descent with acceleration techniques. In Z. Ghahra-\nmani, M. Welling, C. Cortes, N.D. Lawrence, and K.Q. Weinberger, editors, Advances in Neural\nInformation Processing Systems 27, pages 1574\u20131582. Curran Associates, Inc., 2014.\n\nR Tyrrell Rockafellar. Monotone operators and the proximal point algorithm. SIAM journal on\n\ncontrol and optimization, 14(5):877\u2013898, 1976.\n\nMark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing \ufb01nite sums with the stochastic\n\naverage gradient. Technical report, INRIA, 2013.\n\nShai Shalev-Shwartz and Tong Zhang. Stochastic dual coordinate ascent methods for regularized loss\n\nminimization. JMLR, 2013a.\n\nShai Shalev-Shwartz and Tong Zhang. Accelerated mini-batch stochastic dual coordinate ascent. In\nC.J.C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger, editors, Advances in\nNeural Information Processing Systems 26, pages 378\u2013385. Curran Associates, Inc., 2013b.\n\nShai Shalev-Shwartz and Tong Zhang. Accelerated proximal stochastic dual coordinate ascent for\nregularized loss minimization. Technical report, The Hebrew University, Jerusalem and Rutgers\nUniversity, NJ, USA, 2013c.\n\nShai Shalev-Shwartz, Yoram Singer, Nathan Srebro, and Andrew Cotter. Pegasos: Primal estimated\n\nsub-gradient solver for svm. Mathematical programming, 127(1):3\u201330, 2011.\n\n9\n\n\f", "award": [], "sourceid": 382, "authors": [{"given_name": "Aaron", "family_name": "Defazio", "institution": "Ambiata"}]}