{"title": "Natasha 2: Faster Non-Convex Optimization Than SGD", "book": "Advances in Neural Information Processing Systems", "page_first": 2675, "page_last": 2686, "abstract": "We design a stochastic algorithm to find $\\varepsilon$-approximate local minima of any smooth nonconvex function in rate $O(\\varepsilon^{-3.25})$, with only oracle access to stochastic gradients. The best result before this work was $O(\\varepsilon^{-4})$ by stochastic gradient descent (SGD).", "full_text": "Natasha 2: Faster Non-Convex Optimization Than\n\nSGD\n\nZeyuan Allen-Zhu\u2217\nMicrosoft Research AI\n\nzeyuan@csail.mit.edu\n\nAbstract\n\nWe design a stochastic algorithm to \ufb01nd \u03b5-approximate local minima of any\nsmooth nonconvex function in rate O(\u03b5\u22123.25), with only oracle access to stochas-\ntic gradients. The best result before this work was O(\u03b5\u22124) by stochastic gradient\ndescent (SGD).2\n\n1\n\nIntroduction\n\nIn diverse world of deep learning research has given rise to numerous architectures for neural net-\nworks (convolutional ones, long short term memory ones, etc). However, to this date, the underlying\ntraining algorithms for neural networks are still stochastic gradient descent (SGD) and its heuristic\nvariants. In this paper, we address the problem of designing a new algorithm that has provably faster\nrunning time than the best known result for SGD.\nMathematically, we study the problem of online stochastic nonconvex optimization:\n\n(cid:110)\n\n(cid:80)n\n\ni=1 fi(x)\n\n(cid:111)\n\nwhere both f (\u00b7) and each fi(\u00b7) can be nonconvex. We want to study\n\nminx\u2208Rd\n\nf (x) := Ei[fi(x)] = 1\n\nn\n\n(1.1)\n\nonline algorithms to \ufb01nd appx. local minimum of f (x).\n\nHere, we say an algorithm is online if its complexity is independent of n. This tackles the big-data\nscenarios when n is extremely large or even in\ufb01nite.3\nNonconvex optimization arises prominently in large-scale machine learning. Most notably, train-\ning deep neural networks corresponds to minimizing f (x) of this average structure: each training\nsample i corresponds to one loss function fi(\u00b7) in the summation. This average structure allows\none to perform stochastic gradient descent (SGD) which uses a random \u2207fi(x) \u2014corresponding to\ncomputing backpropagation once\u2014 to approximate \u2207f (x) and performs descent updates.\nThe standard goal of nonconvex optimization with provable guarantee is to \ufb01nd approximate local\nminima. This is not only because \ufb01nding the global one is NP-hard, but also because there exist rich\nliterature on heuristics for turning a local-minima \ufb01nding algorithm into a global one. This includes\nrandom seeding, graduated optimization [25] and others. Therefore, faster algorithms for \ufb01nding\napproximate local minima translate into faster heuristic algorithms for \ufb01nding global minimum.\nOn a separate note, experiments [16, 17, 24] suggest that fast convergence to approximate local\nminima may be suf\ufb01cient for training neural nets, while convergence to stationary points (i.e., points\nthat may be saddle points) is not. In other words, we need to avoid saddle points.\n\n\u2217The full version of this paper can be found on https://arxiv.org/abs/1708.08694.\n2When this manuscript \ufb01rst appeared online, the best rate was T = O(\u03b5\u22124) by SGD. Several followups\nappeared after this paper. This includes stochastic cubic regularization [44] which gives T = O(\u03b5\u22123.5), and\nNeon+SCSG [10, 46] which gives T = O(\u03b5\u22123.333). These rates are worse than T = O(\u03b5\u22123.25).\n3All of our results in this paper apply to the case when n is in\ufb01nite, meaning f (x) = Ei[fi(x)], because we\n\nfocus on online methods. However, we still introduce n to simplify notations.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00b4eal, Canada.\n\n\fFigure 1: Local minimum (left), saddle point (right) and its negative-curvature direction.\n\n1.1 Classical Approach: Escaping Saddle Points Using Random Perturbation\n\nOne natural way to avoid saddle points is to use randomness to escape from it, whenever we meet\none. For instance, Ge et al. [22] showed, by injecting random perturbation, SGD will not be stuck\nin saddle points: whenever SGD moves into a saddle point, randomness shall help it escape. This\npartially explains why SGD performs well in deep learning.4 Jin et al. [27] showed, equipped with\nrandom perturbation, full gradient descent (GD) also escapes from saddle points. Being easy to\nimplement, however, we raise two main ef\ufb01ciency issues regarding this classical approach:\n\u2022 Issue 1. If we want to escape from saddle points, is random perturbation the only way? Moving\nin a random direction is \u201cblind\u201d to the Hessian information of the function, and thus can we\nescape from saddle points faster?\n\u2022 Issue 2. If we want to avoid saddle points, is it really necessary to \ufb01rst move close to saddle\npoints and then escape from them? Can we design an algorithm that can somehow avoid saddle\npoints without ever moving close to them?\n\n1.2 Our Resolutions\n\nResolution to Issue 1: Ef\ufb01cient Use of Hessian. Mathematically, instead of using a random\nperturbation, the negative eigenvector of \u22072f (x) (a.k.a. the negative-curvature direction of f (\u00b7) at\nx) gives us a better direction to escape from saddle points. See Figure 1.\nTo make it concrete, suppose we apply power method on \u22072f (x) to \ufb01nd its most negative eigen-\nvector. If we run power method for 0 iteration, then it gives us a totally random direction; if we run\nit for more iterations, then it converges to the most negative eigenvector of \u22072f (x). Unfortunately,\napplying power method is unrealistic because f (x) = 1\ni fi(x) can possibly have in\ufb01nite pieces.\nn\nWe propose to use Oja\u2019s algorithm [37] to approximate power method. Oja\u2019s algorithm can be\nviewed as an online variant of power method, and requires only (stochastic) matrix-vector product\ncomputations. In our setting, this is the same as (stochastic) Hessian-vector products \u2014namely,\ncomputing \u22072fi(x) \u00b7 w for arbitrary vectors w \u2208 Rd and random indices i \u2208 [n]. It is a known fact\nthat computing Hessian-vector products is as cheap as computing stochastic gradients, and thus we\ncan use Oja\u2019s algorithm to escape from saddle points.\n\n(cid:80)\n\n(a)\n\n(b)\n\n(c)\n\nFigure 2: Illustration of Natasha2full \u2014 how to swing by a saddle point.\n\n(a) move in a negative curvature direction if there is any (by applying Oja\u2019s algorithm)\n(b) swing by a saddle point without entering its neighborhood (wishful thinking)\n(c) swing by a saddle point using only stochastic gradients (by applying Natasha1.5full)\n\n4In practice, stochastic gradients naturally incur \u201crandom noise\u201d and adding perturbation may not be needed.\n\n2\n\n\fResolution to Issue 2: Swing by Saddle Points.\nIf the function is suf\ufb01ciently smooth,5 then any\npoint close to a saddle point must have a negative curvature. Therefore, as long as we are close to\nsaddle points, we can already use Oja\u2019s algorithm to \ufb01nd such negative curvature, and move in its\ndirection to decrease the objective, see Figure 2(a).\nTherefore, we are left only with the case that point is not close to any saddle point. Using smoothness\nof f (\u00b7), this gives a \u201csafe zone\u201d near the current point, in which there is no strict saddle point,\nsee Figure 2(b). Intuitively, we wish to use the property of safe zone to design an algorithm that\ndecreases the objective faster than SGD. Formally, f (\u00b7) inside this safe zone must be of \u201cbounded\nnonconvexity,\u201d meaning that its eigenvalues of the Hessian are always greater than some negative\nthreshold \u2212\u03c3 (where \u03c3 depends on how long we run Oja\u2019s algorithm). Intuitively, the greater \u03c3\nis, then the more non-convex f (x) is. We wish to design an (online) stochastic \ufb01rst-order method\nwhose running time scales with \u03c3.\nUnfortunately, classical stochastic methods such as SGD or SCSG [30] cannot make use of this\nnonconvexity parameter \u03c3. The only known ones that can make use of \u03c3 are of\ufb02ine algorithms. In\nthis paper, we design a new stochastic \ufb01rst-order method Natasha1.5\n\n\u03b53 + \u03c31/3\n\u03b510/3\nFinally, we put Natasha1.5 together with Oja\u2019s to construct our \ufb01nal algorithm Natasha2:\nTheorem 2 (informal). Natasha2 \ufb01nds x with (cid:107)\u2207f (x)(cid:107) \u2264 \u03b5 and \u22072f (x) (cid:23) \u2212\u03b4I in rate T =\n\nTheorem 1 (informal). Natasha1.5 \ufb01nds x with (cid:107)\u2207f (x)(cid:107) \u2264 \u03b5 in rate T = O(cid:0) 1\n(cid:1).\n(cid:1). In particular, when \u03b4 \u2265 \u03b51/4, this gives T = (cid:101)O(cid:0) 1\n(cid:101)O(cid:0) 1\nIn contrast, the convergence rate of SGD was T = (cid:101)O(poly(d) \u00b7 \u03b5\u22124) [22].\n\n\u03b4\u03b53 + 1\n\u03b53.25\n\n(cid:1) .\n\n\u03b45 + 1\n\n\u03b53.25\n\n1.3 Follow-Up Results\n\n\u2022 If one applies SGD and only escapes from saddle points using Oja\u2019s algorithm, the convergence\n\nSince the original appearance of this work, there has been a lot of progress in stochastic nonconvex\noptimization. Most notably,\n\u2022 If one swings by saddle points using Oja\u2019s algorithm and SGD variants (instead of Natasha1.5),\n\nthe convergence rate is T = (cid:101)O(\u03b5\u22123.5) [5].\nrate is T = (cid:101)O(\u03b5\u22124) [10, 46].\nrate is T = (cid:101)O(\u03b5\u22123.333) [10, 46].\nis T = (cid:101)O(\u03b5\u22123.5) [44].\n\u2022 If f (x) is of \u03c3-bounded nonconvexity, the SGD4 method [5] gives rate T = (cid:101)O(\u03b5\u22122 + \u03c3\u03b5\u22124).\n\n\u2022 If one applies a stochastic version of cubic regularized Newton\u2019s method, the convergence rate\n\n\u2022 If one applies SCSG and only escapes from saddle points using Oja\u2019s algorithm, the convergence\n\nWe include these results in Table 1 for a close comparison.\n\n2 Preliminaries\nThroughout this paper, we denote by (cid:107) \u00b7 (cid:107) the Euclidean norm. We use i \u2208R [n] to denote that i\nis generated from [n] = {1, 2, . . . , n} uniformly at random. We denote by \u2207f (x) the gradient of\nfunction f if it is differentiable, and \u2202f (x) any subgradient if f is only Lipschitz continuous. We\ndenote by I[event] the indicator function of probabilistic events.\nWe denote by (cid:107)A(cid:107)2 the spectral norm of matrix A. For symmetric matrices A and B, we write\nA (cid:23) B to indicate that A \u2212 B is positive semide\ufb01nite (PSD). Therefore, A (cid:23) \u2212\u03c3I if and only if\nall eigenvalues of A are no less than \u2212\u03c3. We denote by \u03bbmin(A) and \u03bbmax(A) the minimum and\nmaximum eigenvalue of a symmetric matrix A.\nDe\ufb01nition 2.1. For a function f : Rd \u2192 R,\n\u2022 f is \u03c3-strongly convex if \u2200x, y \u2208 Rd, it satis\ufb01es f (y) \u2265 f (x) + (cid:104)\u2202f (x), y \u2212 x(cid:105) + \u03c3\n\n2(cid:107)x \u2212 y(cid:107)2.\n\n5As we shall see, smoothness is necessary for \ufb01nding approximate local minima with provable guarantees.\n\n3\n\n\fconvex only\n\napproximate\nstationary\n\npoints\n\nalgorithm\n\nSGD1 [5, 23]\n\nSGD2 [5]\nSGD3 [5]\n\nSGD (folklore)\n\nSCSG [30]\n\nNatasha1.5\n\nSGD4 [5]\n\nvariance\nbound\nneeded\nneeded\nneeded\n\nneeded\nneeded\n\nLipschitz\nsmooth\nneeded\nneeded\nneeded\n\nneeded\nneeded\n\nneeded\n\nneeded\n\n(cid:93)\n\nneeded\n\nneeded\n\n2nd-order\nsmooth\n\nno\nno\nno\n\nno\nno\n\nno\n\nno\n\n(see Appendix B)\n\n(see Theorem 1)\n\ngradient complexity T\n\nO(cid:0)\u03b5\u22122.667(cid:1)\nO(cid:0)\u03b5\u22122.5(cid:1)\n(cid:101)O(cid:0)\u03b5\u22122(cid:1)\nO(cid:0)\u03b5\u22124(cid:1)\nO(cid:0)\u03b5\u22123.333(cid:1)\nO(cid:0)\u03b5\u22123 + \u03c31/3\u03b5\u22123.333(cid:1)\n(cid:101)O(cid:0)\u03b5\u22122 + \u03c3\u03b5\u22124(cid:1)\n(cid:101)O(cid:0)\u03b5\u22124 \u00b7 poly(d)(cid:1)\n(cid:101)O(cid:0)\u03b5\u22123.25(cid:1)\n(cid:101)O(cid:0)\u03b5\u22124(cid:1)\n(cid:101)O(cid:0)\u03b5\u22123.5(cid:1)\n(cid:101)O(cid:0)\u03b5\u22123.5(cid:1)\n(cid:101)O(cid:0)\u03b5\u22123.333(cid:1)\n\n(cid:93)\n\n(cid:93)\n\n(cid:93)\n\n(cid:93)\n\n(cid:93)\n\n(cid:93)\n\nNatasha2\n\napproximate\n\nlocal\nminima\n\n(see Theorem 2)\n\nperturbed SGD [22]\n\nNEON + SGD [10, 46]\ncubic Newton [44]\n\nneeded\nneeded\nneeded\nneeded\nneeded\nneeded\nTable 1: Comparison of online methods for \ufb01nding (cid:107)\u2207f (x)(cid:107) \u2264 \u03b5. Following tradition, in these complexity\nbounds, we assume variance and smoothness parameters as constants, and only show the dependency\non n, d, \u03b5 and the bounded nonconvexity parameter \u03c3 \u2208 (0, 1). We use (cid:93) to indicate results that\nappeared after this paper.\n\nneeded\nneeded\nneeded\nneeded\nneeded\nneeded\n\nneeded\nneeded\nneeded\nneeded\nneeded\nneeded\n\nNEON + SCSG [10, 46]\n\nSGD5 [5]\n\nRemark 1. Variance bounds must be needed for online methods.\nRemark 2. Lipschitz smoothness must be needed for achieving even approximate stationary points.\nRemark 3. Second-order smoothness must be needed for achieving approximate local minima.\n\nf (x) + (cid:104)\u2202f (x), y \u2212 x(cid:105) \u2212 \u03c3\n\n\u2022 f is of \u03c3-bounded nonconvexity (or \u03c3-nonconvex for short) if \u2200x, y \u2208 Rd, it satis\ufb01es f (y) \u2265\n\u2022 f is L-Lipschitz smooth (or L-smooth for short) if \u2200x, y \u2208 Rd, (cid:107)\u2207f (x) \u2212 \u2207f (y)(cid:107) \u2264 L(cid:107)x \u2212 y(cid:107).\n\u2022 f is second-order L2-Lipschitz smooth (or L2-second-order smooth for short) if \u2200x, y \u2208 Rd, it\n\n2(cid:107)x \u2212 y(cid:107)2. 6\n\nsatis\ufb01es (cid:107)\u22072f (x) \u2212 \u22072f (y)(cid:107)2 \u2264 L2(cid:107)x \u2212 y(cid:107).\n\nThese de\ufb01nitions have other equivalent forms, see textbook [33].\nDe\ufb01nition 2.2. For composite function F (x) = \u03c8(x) + f (x) where \u03c8(x) is proper convex, given a\nparameter \u03b7 > 0, the gradient mapping of F (\u00b7) at point x is\n\n(cid:0)x \u2212 x(cid:48)(cid:1)\n\nGF,\u03b7(x) :=\n\n1\n\u03b7\n\nwhere\n\nx(cid:48) = arg min\n\ny\n\nIn particular, if \u03c8(\u00b7) \u2261 0, then GF,\u03b7(x) \u2261 \u2207f (x).\n\n(cid:8)\u03c8(y) + (cid:104)\u2207f (x), y(cid:105) +\n\n(cid:107)y \u2212 x(cid:107)2(cid:9)\n\n1\n2\u03b7\n\n3 Natasha 1.5: Finding Approximate Stationary Points\n\nWe \ufb01rst make a detour to study how to \ufb01nd approximate stationary points using only \ufb01rst-order\ninformation. A point x \u2208 Rd is an \u03b5-approximate stationary point7 of f (x) if it satis\ufb01es (cid:107)\u2207f (x)(cid:107) \u2264\n\u03b5. Let gradient complexity T be the number of computations of \u2207fi(x).\n\n6Previous authors also refer to this notion as \u201capproximate convex\u201d, \u201calmost convex\u201d, \u201chypo-convex\u201d,\n\u201csemi-convex\u201d, or \u201cweakly-convex.\u201d We call it \u03c3-nonconvex to stress the point that \u03c3 can be as large as L\n(any L-smooth function is automatically L-nonconvex).\n7Historically, in \ufb01rst-order literatures, x is called \u03b5-approximate if (cid:107)\u2207f (x)(cid:107)2 \u2264 \u03b5; in second-order litera-\ntures, x is \u03b5-approximate if (cid:107)\u2207f (x)(cid:107) \u2264 \u03b5. We adapt the latter notion following Polyak and Nesterov [34, 36].\n\n4\n\n\f(a) of\ufb02ine \ufb01rst-order methods\n\n(b) online \ufb01rst-order methods\n\nFigure 3: Comparison of \ufb01rst-order methods for \ufb01nding \u03b5-approximate stationary points of a \u03c3-nonconvex\nfunction. For simplicity, in the plots we let L = 1 and V = 1. The results SGD2/3/4 appeared after\nthis work.\n\nBefore 2015, nonconvex \ufb01rst-order methods give rise to two convergence rates. SGD converges in\nT = O(\u03b5\u22124) and GD converges T = O(n\u03b5\u22122). The proofs of both are simple (see Appendix B for\ncompleteness). In particular, the convergence of SGD relies on two minimal assumptions\nf (x) has bounded variance V, meaning Ei[(cid:107)\u2207fi(x) \u2212 \u2207f (x)(cid:107)2] \u2264 V, and\nf (x) is L-Lipschitz smooth, meaning (cid:107)\u2207f (x) \u2212 \u2207f (y)(cid:107) \u2264 L \u00b7 (cid:107)x \u2212 y(cid:107).\n\n(A1)\n(A2\u2019)\nRemark 3.1. Both assumptions are necessary to design online algorithms for \ufb01nding stationary\npoints.8 For of\ufb02ine algorithms \u2014like GD\u2014 the \ufb01rst assumption is not needed.\nSince 2016, the convergence rates have been improved to T = O(n + n2/3\u03b5\u22122) for of\ufb02ine meth-\nods [6, 38], and to T = O(\u03b5\u221210/3) for online algorithms [30]. Both results are based on the SVRG\n(stochastic variance reduced gradient) method, and assume additionally (note (A2) implies (A2\u2019))\n\neach fi(x) is L-Lipschitz smooth.\n\n(A2)\n\nall the eigenvalues of \u22072f (x) lie in [\u2212\u03c3, L]\n\nLei et al. [30] gave their algorithm a new name, SCSG (stochastically controlled stochastic gradient).\nBounded Non-Convexity.\nIn recent works [3, 13], it has been proposed to study a more re\ufb01ned\nconvergence rate, by assuming that f (x) is of \u03c3-bounded nonconvexity (or \u03c3-nonconvex), meaning\n(A3)\nfor some \u03c3 \u2208 (0, L]. This parameter \u03c3 is analogous to the strong-convexity parameter \u00b5 in convex\noptimization, where all the eigenvalues of \u22072f (x) lie in [\u00b5, L] for some \u00b5 > 0.\nIn our illustrative process to \u201cswing by a saddle point,\u201d the function inside safe zone \u2014see\nFigure 2(b)\u2014 is also of bounded nonconvexity. Since larger \u03c3 means the function is \u201cmore non-\nconvex\u201d and thus harder to optimize, can we design algorithms with gradient complexity T as an\nincreasing function of \u03c3 ?\nRemark 3.2. Most methods (SGD, SCSG, SVRG and GD) do not run faster in theory if \u03c3 < L.\nIn the of\ufb02ine setting, two methods are known to make use of parameter \u03c3. One is repeatSVRG,\nimplicitly in [13] and formally in [3]. The other is Natasha1 [3]. repeatSVRG performs better\nwhen \u03c3 \u2264 L/\nBefore this work, no online method is known to take advantage of \u03c3.\n\n\u221a\nn and Natasha1 performs better when \u03c3 \u2265 L/\n\n\u221a\n\nn. See Figure 3(a) and Table 2.\n\n3.1 Our Theorem\n\nWe show that, under (A1), (A2) and (A3), one can non-trivially extend Natasha1 to an online\nversion, taking advantage of \u03c3, and achieving better complexity than SCSG.\nLet \u2206f be any upper bound on f (x0) \u2212 f (x\u2217) where x0 is the starting point. In this section, to\npresent the simplest results, we use the big-O notion to hide dependency in \u2206f and V. In Section 6,\n8For instance, if the variance V is unbounded, we cannot even tell if a point x satis\ufb01es (cid:107)\u2207f (x)(cid:107) \u2264 \u03b5 using\n\ufb01nite samples. Also, if f (x) is not Lipschitz smooth, it may contain sharp turning points (e.g., behaves like\nabsolute value function |x|); in this case, \ufb01nding (cid:107)\u2207f (x)(cid:107) \u2264 \u03b5 can be as hard as \ufb01nding (cid:107)\u2207f (x)(cid:107) = 0, and is\nNP-hard in general.\n\n5\n\n\ud835\udf0e=1/\ud835\udc5brepeatSVRGNatasha1\ud835\udc5b2/3\ud835\udf0e1/3\ud835\udf002\ud835\udc5b3/4\ud835\udf0e1/2\ud835\udf002\ud835\udc5b2/3\ud835\udf002\ud835\udc5b\ud835\udf002\ud835\udc5b\ud835\udf0e\ud835\udf002SVRGGD\ud835\udc5b1/2\ud835\udf002gradient complexity \ud835\udc47\ud835\udf0e=1\ud835\udf0e=0\ud835\udf00\ud835\udf0e=1Natasha1.5\ud835\udf0e=0\ud835\udf0e1/3\ud835\udf0010/3\ud835\udf00\u221210/3\ud835\udf00\u22124SCSGSGD\ud835\udf00\u22123\ud835\udf0e\ud835\udf00\u22124\ud835\udf00\u22122SGD4\ud835\udf0e\ud835\udf00\u22124\ud835\udf00\u22128/3\ud835\udf00\u22122.5gradient complexity \ud835\udc47SGD3SGD2SGD1\ud835\udf002\f\u2205, epoch length B \u2208 [n], epoch count T (cid:48) \u2265 1,\n\ni\u2208S \u2207fi((cid:101)x) where S is a uniform random subset of [n] with |S| = B;\n\n(cid:5) T (cid:48) epochs each of length B\n(cid:5) p sub-epochs each of length m\n\nn\n\n\u2205\n\n(cid:80)\n\n, B, T (cid:48), \u03b1)\n\n1: (cid:98)x \u2190 x\n\ni=1 fi(x), starting vector x\n\n\u2205; p \u2190 \u0398((\u03c3/\u03b5L)2/3); m \u2190 B/p; X \u2190 [];\n\nAlgorithm 1 Natasha1.5(F, x\nInput: f (\u00b7) = 1\n\n(cid:80)n\n(cid:101)x \u2190(cid:98)x; \u00b5 \u2190 1\nx0 \u2190(cid:98)x; X \u2190 [X,(cid:98)x];\n(cid:101)\u2207 \u2190 \u2207fi(xt) \u2212 \u2207fi((cid:101)x) + \u00b5 + 2\u03c3(xt \u2212(cid:98)x) where i \u2208R [n]\nxt+1 = xt \u2212 \u03b1(cid:101)\u2207;\n(cid:98)x \u2190 a random choice from {x0, x1, . . . , xm\u22121};\n\nlearning rate \u03b1 > 0.\n2: for k \u2190 1 to T (cid:48) do\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12: end for\n\nfor t \u2190 0 to m \u2212 1 do\n\nfor s \u2190 0 to p \u2212 1 do\n\nend for\n\nB\n\nend for\n\n(cid:5) in practice, choose the average\n13: (cid:98)y \u2190 a random vector in X.\n(cid:5) in practice, simply return(cid:98)y\n14: g(x) := f (x) + \u03c3(cid:107)x \u2212(cid:98)y(cid:107)2 and use convex SGD to minimize g(x) for Tsgd = T (cid:48)B iterations.\n\n15: return xout \u2190 the output of SGD.\n\nwe shall add back such dependency and as well as support the existence of a proximal term. (That\nis, to minimize \u03c8(x) + f (x) where \u03c8(x) is a proper convex simple function.)\nUnder such simpli\ufb01ed notations, our main theorem can be stated as follows.\nTheorem 1 (simple). Under (A1), (A2) and (A3), using the big-O notion to hide dependency in \u2206f\nand V, we have for every \u03b5 \u2208 (0, \u03c3\n\n(cid:16) L2/3\u03c31/3\n\n(cid:17)\n\nL ], letting\n, T = \u0398\n, B, T /B, \u03b1) outputs a point xout with E[(cid:107)\u2207f (xout)(cid:107)] \u2264 \u03b5, and\n\nand \u03b1 = \u0398\n\n\u03c31/3L2/3\n\n\u03b510/3\n\n\u2205\n\n(cid:16) \u03b54/3\n\n(cid:17)\n\nB = \u0398(cid:0) 1\n\n\u03b52\n\n(cid:1)\n\nwe have that Natasha1.5(f, x\nneeds O(T ) computations of stochastic gradients. (See also Figure 3(b).)\n\nWe emphasize that the additional factor \u03c31/3 in the numerator of T shall become our key to achieve\nfaster algorithm for \ufb01nding approximate local minima in Section 4. Also, if the requirement \u03b5 \u2264 \u03c3\n\nis not satis\ufb01ed, one can replace \u03c3 with \u03b5L; accordingly, T becomes O(cid:0) L\n(cid:1)\nWe note that the SGD4 method of [5] (which appeared after this paper) achieves T = O(cid:0) L\n\nIt is better than Natasha1.5 only when \u03c3 \u2264 \u03b5L. We compare them in Figure 3(b), and emphasize\nthat it is necessary to use Natasha1.5 (rather than SGD4) to design Natasha2 of the next section.\n\n\u03b53 + L2/3\u03c31/3\n\n\u03b52 + \u03c3\n\u03b54\n\n(cid:1).\n\n\u03b510/3\n\nL\n\nExtension.\nIn fact, we show Theorem 1 in a more general proximal setting. That is, to minimize\nF (x) := f (x)+\u03c8(x) where \u03c8(x) is proper convex function that can be non-smooth. For instance, if\n\u03c8(x) is the indicator function of a convex set, then Problem (1.1) becomes constraint minimization;\nand if \u03c8(x) = (cid:107)x(cid:107)1, we encourage sparsity. At a \ufb01rst reading of its proof, one can assume \u03c8(x) \u2261 0.\n\n3.2 Our Intuition\n\nWe \ufb01rst recall the main idea of the SVRG method [28, 48], which is an of\ufb02ine algorithm. SVRG divides\n\niterations into epochs, each of length n. It maintains a snapshot point(cid:101)x for each epoch, and computes\nthe full gradient \u2207f ((cid:101)x) only for snapshots. Then, in each iteration t at point xt, SVRG de\ufb01nes\ngradient estimator (cid:101)\u2207f (xt) := \u2207fi(xt) \u2212 \u2207fi((cid:101)x) + \u2207f ((cid:101)x) which satis\ufb01es Ei[(cid:101)\u2207f (xt)] = \u2207f (xt),\nand performs proximal update xt+1 \u2190 xt \u2212 \u03b1(cid:101)\u2207f (xt) for learning rate \u03b1.\n\nFor minimizing non-convex functions, SVRG does not take advantage of parameter \u03c3 even if the\nlearning rate can be adapted to \u03c3. This is because SVRG (and in fact SGD and GD too) rely on\ngradient-descent analysis to argue for objective decrease per iteration. This is blind to \u03c3.9\n\n9These results argue for objective decrease per iteration, of the form f (xt) \u2212 f (xt+1) \u2265 \u03b1\n\nE(cid:2)(cid:107)\u2207f (xt) \u2212 (cid:101)\u2207f (xt)(cid:107)2(cid:3). Unlike mirror-descent analysis, this inequality cannot take advantage of the\n\n2 (cid:107)\u2207f (xt)(cid:107)2 \u2212\n\n\u03b12L\n\n2\n\n6\n\n\fThe prior work Natasha1 takes advantage of \u03c3. Natasha1 is similar to SVRG, but it further divides\n\neach epoch into sub-epochs, each with a starting vector(cid:98)x. Then, it replaces (cid:101)\u2207f (xt) with (cid:101)\u2207f (xt) +\n2\u03c3(xt \u2212(cid:98)x). This is equivalent to replacing f (x) with f (x) + \u03c3(cid:107)x\u2212(cid:98)x(cid:107)2, where the center(cid:98)x changes\nevery sub-epoch. We view this additional term 2\u03c3(xt \u2212(cid:98)x) as a type of retraction. Conceptually,\nrequires the full gradient computation \u2207f ((cid:101)x) at snapshots(cid:101)x. A natural \ufb01x \u2014originally studied by\npractitioners but \ufb01rst formally analyzed by Lei et al. [30]\u2014 is to replace the computation of \u2207f ((cid:101)x)\n(cid:80)\ni\u2208S \u2207fi((cid:101)x), for a random batch S \u2286 [n] with \ufb01xed cardinality B := |S| (cid:28) n. This\n\nit stabilizes the algorithm by moving a bit in the backward direction. Technically, it enables us to\nperform only mirror-descent type of analysis, and thus bypass the issue of SVRG.\nOur Algorithm. Both SVRG and Natasha1 are of\ufb02ine methods, because the gradient estimator\n\nwith 1|S|\nallows us to shorten the epoch length from n to B, thus turning SVRG and Natasha1 into online\nmethods.\nHow large should we pick B? By Chernoff bound, we wish B \u2248 1\n\u03b52 because our desired accuracy\nis \u03b5. One can thus hope to replace the parameter n in the complexities of SVRG and Natasha1.5\n(ignoring the dependency on L):\n\nT = O(cid:0)n +\n(cid:1)\nT = O(cid:0)\u03b5\u221210/3(cid:1)\n\nn2/3\n\u03b52\n\n\u03b52 . This \u201cwishful thinking\u201d gives\nand\n\nwith B \u2248 1\n\n(cid:1)\n\n\u03c31/3n2/3\n\nand\n\n\u03b52\n\nn1/2\n\u03b52 +\n\nT = O(cid:0)n +\nT = O(cid:0)\u03b5\u22123 + \u03c31/3\u03b5\u221210/3(cid:1).\n(cid:80)\ni\u2208S \u2207fi((cid:101)x) \u2212 \u2207f ((cid:101)x)\n\nThese are exactly the results achieved by SCSG [30] and to be achieved by our new Natasha1.5.\nUnfortunately, Chernoff bound itself is not suf\ufb01cient in getting such rates. Let\n\ne := 1|S|\n\ndenote the bias of this new gradient estimator, then when performing iterative updates, this bias\ne gives rise to two types of error terms: \u201c\ufb01rst-order error\u201d terms \u2014of the form (cid:104)e, x \u2212 y(cid:105)\u2014 and\n\u201csecond-order error\u201d term (cid:107)e(cid:107)2. Chernoff bound ensures that the second-order error ES[(cid:107)e(cid:107)2] \u2264 \u03b52\nis bounded. However, \ufb01rst-order error terms are the true bottlenecks.\nIn the of\ufb02ine method SCSG, Lei et al. [30] carefully performed updates so that all \u201c\ufb01rst-order errors\u201d\ncancel out. To the best of our knowledge, this analysis cannot take advantage of \u03c3 even if the\nalgorithm knows \u03c3. (Again, for experts, this is because SCSG is based on gradient-descent type of\nanalysis but not mirror-descent.)\nIn Natasha1.5, we use the aforementioned retraction to ensure that all points in a single sub-epoch\nare close to each other (based on mirror-descent type of analysis). Then, we use Young\u2019s inequality\n2(cid:107)x \u2212 y(cid:107)2. In this equation, (cid:107)e(cid:107)2 is already bounded by Chernoff\nto bound (cid:104)e, x \u2212 y(cid:105) by 1\nconcentration, and (cid:107)x \u2212 y(cid:107)2 can also be bounded as long as x and y are within the same sub-epoch.\nThis summarizes the high-level technical contribution of Natasha1.5.\nWe formally state Natasha1.5 in Algorithm 1, and it uses big-O notions to hide dependency in L,\n\u2206f , and V. The more general code to take care of the proximal term is in Algorithm 3 of Section 6.\n\n2(cid:107)e(cid:107)2 + 1\n\n4 Natasha 2: Finding Approximate Local Minima\n\nStochastic gradient descent (SGD) \ufb01nd approximate local minima [22], under (A1), (A2) and an\nadditional assumption (A4):\n\nf (x) is second-order L2-Lipschitz smooth, meaning (cid:107)\u22072f (x) \u2212 \u22072f (y)(cid:107)2 \u2264 L2 \u00b7 (cid:107)x \u2212 y(cid:107).\n\n(A4)\nRemark 4.1. (A4) is necessary to make the task of \ufb01nd appx. local minima meaningful, for the same\nreason Lipschitz smoothness was needed for \ufb01nding stationary points.\nDe\ufb01nition 4.2. We say x is an (\u03b5, \u03b4)-approximate local minimum of f (x) if10\n\n(cid:107)\u2207f (x)(cid:107) \u2264 \u03b5 and \u22072f (x) (cid:23) \u2212\u03b4I ,\n\nbounded nonconvexity parameter of f (x). For readers interested in the difference between gradient and mirror\ndescent, see [11].\n\n10The notion \u201c\u22072f (x) (cid:23) \u2212\u03b4I\u201d means all the eigenvalues of \u22072f (x) are above \u2212\u03b4.\n\n7\n\n\for \u03b5-approximate local minimum if it is (\u03b5, \u03b51/C)-approximate local minimum for constant C \u2265 1.\n\nBefore our work, Ge et al. [22] is the only result that gives provable online complexity for \ufb01nding\napproximate local minima. Other previous results, including SVRG, SCSG, Natasha1, and even\nNatasha1.5, do not \ufb01nd approximate local minima and may be stuck at saddle points.11 Ge et al.\n[22] showed that, hiding factors that depend on L, L2 and V, SGD \ufb01nds an \u03b5-approximate local\nminimum of f (x) in gradient complexity T = O(poly(d)\u03b5\u22124). This \u03b5\u22124 factor seems necessary\nsince SGD needs T \u2265 \u2126(\u03b5\u22124) for just \ufb01nding stationary points (see Appendix B and Table 1).\nRemark 4.3. Of\ufb02ine methods are often studied under (\u03b5, \u03b51/2)-approximate local minima. In the\n\u03b54 +\n\nonline setting, Ge et al. [22] used (\u03b5, \u03b51/4)-approximate local minima, thus giving T = O(cid:0) poly(d)\n(cid:1). In general, it is better to treat \u03b5 and \u03b4 separately to be more general, but nevertheless,\n\npoly(d)\n\n(\u03b5, \u03b51/C)-approximate local minima are always better than \u03b5-approximate stationary points.\n\n\u03b416\n\n4.1 Our Theorem\n\nWe propose a new method Natasha2full which, very informally speaking, alternatively\n\u2022 \ufb01nds approximate stationary points of f (x) using Natasha1.5, or\n\u2022 \ufb01nds negative curvature of the Hessian \u22072f (x), using Oja\u2019s online eigenvector algorithm.\nIn this section, we de\ufb01ne gradient complexity T to be the number of stochastic gradient computa-\ntions plus Hessian-vector products. Let \u2206f be any upper bound on f (x0) \u2212 f (x\u2217) where x0 is the\nstarting point. In this section, to present the simplest results, we use the big-O notion to hide de-\npendency in L, L2, \u2206f , and V. In Section 7, we shall add back such dependency for a more general\ndescription of the algorithm. Our main result can be stated as follows:\nTheorem 2 (informal). Under (A1), (A2) and (A4), for any \u03b5 \u2208 (0, 1) and \u03b4 \u2208 (0, \u03b51/4),\nNatasha2(f, y0, \u03b5, \u03b4) outputs a point xout so that, with probability at least 2/3:\n\nFurthermore, its gradient complexity is T = (cid:101)O(cid:0) 1\nRemark 4.4. If \u03b4 > \u03b51/4 we can replace it with \u03b4 = \u03b51/4. Therefore, T = (cid:101)O(cid:0) 1\n(cid:1).\nCorollary 4.6. T = (cid:101)O(\u03b5\u22123.25) for \ufb01nding (\u03b5, \u03b51/4)-approximate local minima. This is better than\nCorollary 4.7. T = (cid:101)O(\u03b5\u22123.5) for \ufb01nding (\u03b5, \u03b51/2)-approximate local minima. This was not known\n\nRemark 4.5. The follow-up work [10] replaced Hessian-vector products in Natasha2 with only\nstochastic gradient computations, turning Natasha2 into a pure \ufb01rst-order method. That paper is\nbuilt on ours and thus all the proofs of this paper are still needed.\n\nT = O(\u03b5\u221210/3) of SCSG for \ufb01nding only \u03b5-approximate stationary points.\n\n(cid:107)\u2207f (xout)(cid:107) \u2264 \u03b5 and \u22072f (xout) (cid:23) \u2212\u03b4I .\n\n(cid:1) .12\n\nbefore and is matched by several follow-up works using different algorithms [5, 10, 44, 46].\n\n\u03b45 + 1\n\u03b4\u03b53\n\n\u03b45 + 1\n\n\u03b4\u03b53 + 1\n\u03b53.25\n\n4.2 Our Intuition\n\nIt is known that the problem of \ufb01nding (\u03b5, \u03b4)-approximate local minima, at a high level, \u201creduces\u201d\nto (repeatedly) \ufb01nding \u03b5-approximate stationary points for an O(\u03b4)-nonconvex function [1, 13].\nSpeci\ufb01cally, Carmon et al. [13] proposed the following procedure. In every iteration at point yk,\ndetect whether the minimum eigenvalue of \u22072f (yk) is below \u2212\u03b4:\n\u2022 if yes, \ufb01nd the minimum eigenvector of \u22072f (yk) approximately and move in this direction.\n\n\u2022 if no, let F k(x) := f (x)+L(cid:0) max(cid:8)0,(cid:107)x\u2212yk(cid:107)\u2212 \u03b4\nF k(x) penalizes us from moving out of the \u201csafe zone\u201d of(cid:8)x : (cid:107)x \u2212 yk(cid:107) \u2264 \u03b4\n\n(cid:9)(cid:1)2, which can be proven as 5L-smooth and\n\n3\u03b4-nonconvex; then \ufb01nd an \u03b5-approximate stationary point of F k(x) to move there. Intuitively,\n\n(cid:9).\n\nL2\n\nL2\n\n11These methods are based on the \u201cvariance reduction\u201d technique to reduce the random noise of SGD. They\nhave been criticized by practitioners for performing poorer than SGD on training neural networks, because the\nnoise of SGD allows it to escape from saddle points. Variance-reduction based methods have less noise and\nthus cannot escape from saddle points.\n(namely, n, d, L, L2,V, 1/\u03b5, 1/\u03b4).\n\n12Throughout this paper, we use the (cid:101)O notion to hide at most one logarithmic factor in all the parameters\n\n8\n\n\f(cid:80)n\n2: else (cid:101)L \u2190 1 and(cid:101)\u03c3 \u2190 \u0398(cid:0) \u03b5\n\nAlgorithm 2 Natasha2(f, y0, \u03b5, \u03b4)\nInput: function f (x) = 1\nn\n1: if \u03b51/3\n3: X \u2190 [];\n4: for k \u2190 0 to \u221e do\n5:\n\n\u03b4 \u2265 1 then (cid:101)L =(cid:101)\u03c3 \u2190 \u0398( \u03b51/3\n(cid:1) \u2208 [\u03b4, 1].\nApply Oja\u2019s algorithm to \ufb01nd minEV v of \u22072f (yk) for(cid:101)\u0398( 1\n\n\u03b4 ) \u2265 1;\n\n\u03b43\n\ni=1 fi(x), starting vector y0, target accuracy \u03b5 > 0 and \u03b4 > 0.\n\n(cid:5) the boundary case for large L2\n\n\u03b42 ) iterations (cid:5) see Lemma 5.3\n\n(cid:5) it satis\ufb01es \u22072f (yk) (cid:23) \u2212\u03b4I\n\n(cid:5) F k(\u00b7) is(cid:101)L-smooth and(cid:101)\u03c3-nonconvex\n\nL2\n\nelse\n\nif v \u2208 Rd is found s.t. v(cid:62)\u22072f (yk)v \u2264 \u2212 \u03b4\n\n2 then\nyk+1 \u2190 yk \u00b1 \u03b4\nv where the sign is random.\nF k(x) := f (x) + L(max{0,(cid:107)x \u2212 yk(cid:107) \u2212 \u03b4\n\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14:\n15: end for\n17: de\ufb01ne convex function g(x) := f (x) + L(max{0,(cid:107)x \u2212 y(cid:107) \u2212 \u03b4\n\nrun Natasha1.5(cid:0)F k, yk, \u0398(\u03b5\u22122), 1, \u0398(\u03b5\u03b4)(cid:1).\nlet(cid:98)yk, yk+1 be the vector(cid:98)y and(cid:98)x when Line 13 is reached in Natasha1.5.\nX \u2190 [X, (yk,(cid:98)yk)];\nbreak the for loop if have performed \u0398(cid:0) 1\n16: (y,(cid:98)y) \u2190 a random pair in X.\n18: use SGD to minimize g(x) for(cid:101)\u0398( 1\n\n(cid:1) \ufb01rst-order steps.\n\n\u03b52 ) steps and output xout.\n\nend if\n\n})2.\n\nL2\n\n\u03b4\u03b5\n\n(cid:5) in practice, simply output(cid:98)yk\n})2 +(cid:101)\u03c3(cid:107)x \u2212(cid:98)y(cid:107)2.\n\nL2\n\nPreviously, it was thought necessary to achieve high accuracy for both tasks above. This is\nwhy researchers have only been able to design of\ufb02ine methods: in particular, the shift-and-invert\nmethod [21] was applied to \ufb01nd the minimum eigenvector, and repeatSVRG was applied to \ufb01nd a\nstationary point of F k(x).13\nIn this paper, we apply ef\ufb01cient online algorithms for the two tasks: namely, Oja\u2019s algorithm\n(see Section 5.1) for \ufb01nding minimum eigenvectors, and our new Natasha1.5 algorithm (see\nSection 3.2) for \ufb01nding stationary points. More speci\ufb01cally, for Oja\u2019s, we only decide if there is\nan eigenvalue below threshold \u2212\u03b4/2, or conclude that the Hessian has all eigenvalues above \u2212\u03b4.\nThis can be done in an online fashion using O(\u03b4\u22122) Hessian-vector products (with high probability)\nusing Oja\u2019s algorithm. For Natasha1.5, we only apply it for a single epoch of length B = \u0398(\u03b5\u22122).\nConceptually, this shall make the above procedure online and run in a complexity independent of n.\nUnfortunately, technical issues arise in this \u201cwishful thinking.\u201d\nMost notably, the above process \ufb01nishes only if Natasha1.5 \ufb01nds an approximate stationary point\ninside the safe zone and therefore (cid:107)\u2207F k(x)(cid:107) \u2264 \u03b5 also implies (cid:107)\u2207f (x)(cid:107) \u2264 2\u03b5.\nWhat can we do if we move out of the safe zone? To tackle this case, we show an additional property\nof Natasha1.5 (see Lemma 6.5). That is, the amount of objective decrease \u2014i.e., f (yk) \u2212 f (x) if\nx moves out of the safe zone\u2014 must be proportional to the distance (cid:107)x \u2212 yk(cid:107)2 we travel in space.\nTherefore, if x moves out of the safe zone, then we can decrease suf\ufb01ciently the objective. This is\nalso a good case. This summarizes some high-level technical ingredient of Natasha2.\nWe formally state Natasha2 in Algorithm 2, and it uses the big-O notion to hide dependency in L,\nL2, V and \u2206f . The more general code to take care of all the parameters can be found in Algorithm 5\nof Section 7.\n\nx of F k(x) that is also inside the safe zone(cid:8)x : (cid:107)x \u2212 yk(cid:107) \u2264 \u03b4\n\nFinally, we stress that although we borrowed the construction of f (x)+L(cid:0) max(cid:8)0,(cid:107)x\u2212yk(cid:107)\u2212 \u03b4\n\n(cid:9). This is because F k(x) = f (x)\n\n(cid:9)(cid:1)2\n\nfrom the of\ufb02ine algorithm of Carmon et al. [13], our Natasha2 algorithm and analysis are different\nfrom them in all other aspects.\n\nL2\n\nL2\n\n13repeatSVRG is an of\ufb02ine algorithm, and \ufb01nds an \u03b5-approximate stationary point for a function f (x) that\nis \u03c3-nonconvex. It is divided into stages. In each stage t, it considers a modi\ufb01ed function ft(x) := f (x) +\n\u03c3(cid:107)x \u2212 xt(cid:107)2, and then apply the accelerated SVRG method to minimize ft(x). Then, it moves to xt+1 which is\na suf\ufb01ciently accurate minimizer of ft(x).\n\n9\n\n\fReferences\n[1] Naman Agarwal, Zeyuan Allen-Zhu, Brian Bullins, Elad Hazan, and Tengyu Ma. Finding\n\nApproximate Local Minima for Nonconvex Optimization in Linear Time. In STOC, 2017.\n\n[2] Zeyuan Allen-Zhu. Katyusha: The First Direct Acceleration of Stochastic Gradient Methods.\n\nIn STOC, 2017.\n\n[3] Zeyuan Allen-Zhu. Natasha: Faster Non-Convex Stochastic Optimization via Strongly Non-\n\nConvex Parameter. In ICML, 2017.\n\n[4] Zeyuan Allen-Zhu. Katyusha X: Practical Momentum Method for Stochastic Sum-of-\n\nNonconvex Optimization. In ICML, 2018.\n\n[5] Zeyuan Allen-Zhu. How To Make the Gradients Small Stochastically: Even Faster Convex\n\nand Nonconvex SGD. In NeurIPS, 2018.\n\n[6] Zeyuan Allen-Zhu and Elad Hazan. Variance Reduction for Faster Non-Convex Optimization.\n\nIn ICML, 2016.\n\n[7] Zeyuan Allen-Zhu and Yuanzhi Li. LazySVD: Even Faster SVD Decomposition Yet Without\n\nAgonizing Pain. In NeurIPS, 2016.\n\n[8] Zeyuan Allen-Zhu and Yuanzhi Li. First Ef\ufb01cient Convergence for Streaming k-PCA: a Global,\n\nGap-Free, and Near-Optimal Rate. In FOCS, 2017.\n\n[9] Zeyuan Allen-Zhu and Yuanzhi Li. Follow the Compressed Leader: Faster Online Learning of\n\nEigenvectors and Faster MMWU. In ICML, 2017.\n\n[10] Zeyuan Allen-Zhu and Yuanzhi Li. Neon2: Finding Local Minima via First-Order Oracles. In\n\nNeurIPS, 2018.\n\n[11] Zeyuan Allen-Zhu and Lorenzo Orecchia. Linear Coupling: An Ultimate Uni\ufb01cation of Gra-\nIn Proceedings of the 8th Innovations in Theoretical Computer\n\ndient and Mirror Descent.\nScience, ITCS \u201917, 2017.\n\n[12] Sanjeev Arora, Rong Ge, Tengyu Ma, and Ankur Moitra. Simple, Ef\ufb01cient, and Neural Algo-\n\nrithms for Sparse Coding. In COLT, 2015.\n\n[13] Yair Carmon, John C. Duchi, Oliver Hinder, and Aaron Sidford. Accelerated Methods for\n\nNon-Convex Optimization. ArXiv e-prints, abs/1611.00756, November 2016.\n\n[14] Yair Carmon, Oliver Hinder, John C. Duchi, and Aaron Sidford. \u201dConvex Until Proven Guilty\u201d:\nDimension-Free Acceleration of Gradient Descent on Non-Convex Functions. In ICML, 2017.\n\n[15] Yuxin Chen and Emmanuel Candes. Solving random quadratic systems of equations is nearly\nIn Advances in Neural Information Processing Systems,\n\nas easy as solving linear systems.\npages 739\u2013747, 2015.\n\n[16] Anna Choromanska, Mikael Henaff, Michael Mathieu, G\u00b4erard Ben Arous, and Yann LeCun.\n\nThe loss surfaces of multilayer networks. In AISTATS, 2015.\n\n[17] Yann N Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and\nYoshua Bengio. Identifying and attacking the saddle point problem in high-dimensional non-\nconvex optimization. In NeurIPS, pages 2933\u20132941, 2014.\n\n[18] Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. SAGA: A Fast Incremental Gradient\n\nMethod With Support for Non-Strongly Convex Composite Objectives. In NeurIPS, 2014.\n\n[19] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning\nand stochastic optimization. The Journal of Machine Learning Research, 12:2121\u20132159, 2011.\n\n[20] Roy Frostig, Rong Ge, Sham M. Kakade, and Aaron Sidford. Un-regularizing: approximate\nIn ICML,\n\nproximal point and faster stochastic algorithms for empirical risk minimization.\n2015.\n\n10\n\n\f[21] Dan Garber, Elad Hazan, Chi Jin, Sham M. Kakade, Cameron Musco, Praneeth Netrapalli,\nand Aaron Sidford. Robust shift-and-invert preconditioning: Faster and more sample ef\ufb01cient\nalgorithms for eigenvector computation. In ICML, 2016.\n\n[22] Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle points\u2014online\nstochastic gradient for tensor decomposition. In Proceedings of the 28th Annual Conference\non Learning Theory, COLT 2015, 2015.\n\n[23] Saeed Ghadimi and Guanghui Lan. Accelerated gradient methods for nonconvex nonlinear\nand stochastic programming. Mathematical Programming, pages 1\u201326, feb 2015. ISSN 0025-\n5610.\n\n[24] I. J. Goodfellow, O. Vinyals, and A. M. Saxe. Qualitatively characterizing neural network\n\noptimization problems. ArXiv e-prints, December 2014.\n\n[25] Elad Hazan, K\ufb01r Yehuda Levy, and Shai Shalev-Shwartz. On graduated optimization for\nIn International Conference on Machine Learning, pages\n\nstochastic non-convex problems.\n1833\u20131841, 2016.\n\n[26] Xi He, Dheevatsa Mudigere, Mikhail Smelyanskiy, and Martin Tak\u00b4a\u02c7c. Distributed Hessian-\n\nFree Optimization for Deep Neural Network. ArXiv e-prints, abs/1606.00511, June 2016.\n\n[27] Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M Kakade, and Michael I Jordan. How to Escape\n\nSaddle Points Ef\ufb01ciently. In ICML, 2017.\n\n[28] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive vari-\nance reduction. In Advances in Neural Information Processing Systems, NeurIPS 2013, pages\n315\u2013323, 2013.\n\n[29] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ArXiv e-prints,\n\nabs/1412.6980, 12 2014.\n\n[30] Lihua Lei, Cheng Ju, Jianbo Chen, and Michael I Jordan. Nonconvex Finite-Sum Optimization\n\nVia SCSG Methods. In NeurIPS, 2017.\n\n[31] Yuanzhi Li and Yang Yuan. Convergence Analysis of Two-layer Neural Networks with ReLU\n\nActivation. In NeurIPS, 2017.\n\n[32] Hongzhou Lin, Julien Mairal, and Zaid Harchaoui. A Universal Catalyst for First-Order Opti-\n\nmization. In NeurIPS, 2015.\n\n[33] Yurii Nesterov. Introductory Lectures on Convex Programming Volume: A Basic course, vol-\n\nume I. Kluwer Academic Publishers, 2004. ISBN 1402075537.\n\n[34] Yurii Nesterov. Accelerating the cubic regularization of newton\u2019s method on convex problems.\n\nMathematical Programming, 112(1):159\u2013181, 2008.\n\n[35] Yurii Nesterov. How to make the gradients small. Optima, 88:10\u201311, 2012.\n\n[36] Yurii Nesterov and Boris T. Polyak. Cubic regularization of newton method and its global\n\nperformance. Mathematical Programming, 108(1):177\u2013205, 2006.\n\n[37] Erkki Oja. Simpli\ufb01ed neuron model as a principal component analyzer. Journal of mathemat-\n\nical biology, 15(3):267\u2013273, 1982.\n\n[38] Sashank J. Reddi, Ahmed Hefny, Suvrit Sra, Barnabas Poczos, and Alex Smola. Stochastic\n\nvariance reduction for nonconvex optimization. In ICML, 2016.\n\n[39] Sashank J Reddi, Manzil Zaheer, Suvrit Sra, Barnabas Poczos, Francis Bach, Ruslan Salakhut-\ndinov, and Alexander J Smola. A generic approach for escaping saddle points. ArXiv e-prints,\nabs/1709.01434, September 2017.\n\n[40] Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing \ufb01nite sums with the stochastic\n\naverage gradient. ArXiv e-prints, abs/1309.2388, September 2013.\n\n11\n\n\f[41] Shai Shalev-Shwartz. Online Learning and Online Convex Optimization. Foundations and\n\nTrends in Machine Learning, 4(2):107\u2013194, 2012. ISSN 1935-8237.\n\n[42] Shai Shalev-Shwartz. SDCA without Duality, Regularization, and Individual Convexity. In\n\nICML, 2016.\n\n[43] Ruoyu Sun and Zhi-Quan Luo. Guaranteed Matrix Completion via Nonconvex Factorization.\n\nIn FOCS, 2015.\n\n[44] Nilesh Tripuraneni, Mitchell Stern, Chi Jin, Jeffrey Regier, and Michael I Jordan. Stochas-\ntic Cubic Regularization for Fast Nonconvex Optimization. ArXiv e-prints, abs/1711.02838,\nNovember 2017.\n\n[45] Lin Xiao and Tong Zhang. A Proximal Stochastic Gradient Method with Progressive Variance\n\nReduction. SIAM Journal on Optimization, 24(4):2057\u2014-2075, 2014.\n\n[46] Yi Xu and Tianbao Yang. First-order Stochastic Algorithms for Escaping From Saddle Points\n\nin Almost Linear Time. ArXiv e-prints, abs/1711.01944, November 2017.\n\n[47] Matthew D Zeiler. ADADELTA: an adaptive learning rate method.\n\nabs/1212.5701, 12 2012.\n\nArXiv e-prints,\n\n[48] Lijun Zhang, Mehrdad Mahdavi, and Rong Jin. Linear convergence with condition number\nindependent access of full gradients. In Advances in Neural Information Processing Systems,\npages 980\u2013988, 2013.\n\n12\n\n\f", "award": [], "sourceid": 1376, "authors": [{"given_name": "Zeyuan", "family_name": "Allen-Zhu", "institution": "Microsoft Research"}]}