{"title": "Third-order Smoothness Helps: Faster Stochastic Optimization Algorithms for Finding Local Minima", "book": "Advances in Neural Information Processing Systems", "page_first": 4525, "page_last": 4535, "abstract": "We propose stochastic optimization algorithms that can find local minima faster than existing algorithms for nonconvex optimization problems, by exploiting the third-order smoothness to escape non-degenerate saddle points more efficiently. More specifically, the proposed algorithm only needs $\\tilde{O}(\\epsilon^{-10/3})$ stochastic gradient evaluations to converge to an approximate local minimum $\\mathbf{x}$, which satisfies $\\|\\nabla f(\\mathbf{x})\\|_2\\leq\\epsilon$ and $\\lambda_{\\min}(\\nabla^2 f(\\mathbf{x}))\\geq -\\sqrt{\\epsilon}$ in unconstrained stochastic optimization, where $\\tilde{O}(\\cdot)$ hides logarithm polynomial terms and constants. This improves upon the $\\tilde{O}(\\epsilon^{-7/2})$ gradient complexity achieved by the state-of-the-art stochastic local minima finding algorithms by a factor of $\\tilde{O}(\\epsilon^{-1/6})$. Experiments on two nonconvex optimization problems demonstrate the effectiveness of our algorithm and corroborate our theory.", "full_text": "Third-order Smoothness Helps: Faster Stochastic\nOptimization Algorithms for Finding Local Minima\n\nYaodong Yu\u21e4\n\nDepartment of Computer Science\n\nUniversity of Virginia\n\nCharlottesville, VA 22904\nyy8ms@virginia.edu\n\nPan Xu\u21e4\n\nDepartment of Computer Science\n\nUniversity of California, Los Angeles\n\nLos Angeles, CA 90095\npanxu@cs.ucla.edu\n\nQuanquan Gu\n\nDepartment of Computer Science\n\nUniversity of California, Los Angeles\n\nLos Angeles, CA 90095\n\nqgu@cs.ucla.edu\n\nAbstract\n\nWe propose stochastic optimization algorithms that can \ufb01nd local minima faster\nthan existing algorithms for nonconvex optimization problems, by exploiting the\nthird-order smoothness to escape non-degenerate saddle points more ef\ufb01ciently.\n\nMore speci\ufb01cally, the proposed algorithm only needs eO(\u270f10/3) stochastic gradi-\nent evaluations to converge to an approximate local minimum x, which satis\ufb01es\nkrf (x)k2 \uf8ff \u270f and min(r2f (x))  p\u270f in unconstrained stochastic optimiza-\ntion, where eO(\u00b7) hides logarithm polynomial terms and constants. This improves\nupon the eO(\u270f7/2) gradient complexity achieved by the state-of-the-art stochastic\nlocal minima \ufb01nding algorithms by a factor of eO(\u270f1/6). Experiments on two\n\nnonconvex optimization problems demonstrate the effectiveness of our algorithm\nand corroborate our theory.\n\n1\n\nIntroduction\n\nWe study the following unconstrained stochastic optimization problem\n\nf (x) = E\u21e0\u21e0D[F (x; \u21e0)],\n\n(1.1)\n\nmin\nx2Rd\n\nwhere F (x; \u21e0) : Rd ! R is a stochastic function and \u21e0 is a random variable sampled from a \ufb01xed\ndistribution D. In particular, we are interested in nonconvex optimization where the expected function\nf (x) is not convex. This kind of nonconvex optimization is ubiquitous in machine learning, especially\ndeep learning [24]. Finding a global minimum of nonconvex problem (1.1) is generally NP hard [18].\nNevertheless, for many nonconvex optimization problems in machine learning, a local minimum is\nadequate and can be as good as a global minimum in terms of generalization performance, such as in\ndeep learning [10, 13].\nIn this paper, we aim to design ef\ufb01cient stochastic optimization algorithms that can \ufb01nd an approxi-\nmate local minimum of (1.1), i.e., an (\u270f, \u270fH)-second-order stationary point x de\ufb01ned as follows\n\n\u21e4Equal contribution.\n\nkrf (x)k2 \uf8ff \u270f, and minr2f (x)  \u270fH,\n\n(1.2)\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fwhere \u270f, \u270fH 2 (0, 1). Notably, when \u270fH = pL2\u270f for Hessian Lipschitz f with parameter L2, (1.2)\n\nis equivalent to the de\ufb01nition of \u270f-second-order stationary point [28]. Algorithms based on cubic\nregularized Newton\u2019s method [28] and its variants [1, 7, 12, 23, 33, 31] have been proposed to \ufb01nd\nsuch approximate local minima. However, all of them need to solve the cubic problems exactly [28] or\napproximately [1, 7] in each iteration, which poses a rather heavy computational overhead. Another\nline of research employs the negative curvature direction to \ufb01nd the local minimum by combining\naccelerated gradient descent and negative curvature descent [8, 2], which yet becomes impractical\nin large scale and high dimensional machine learning problems due to the frequent computation of\nnegative curvature in each iteration.\nTo alleviate the computational burden of local minimum \ufb01nding algorithms, there has emerged\na fresh line of research [34, 5, 21] that tries to achieve the iteration complexity as the state-of-\nthe-art second-order methods, while only utilizing \ufb01rst-order oracles. The key observation is that\n\ufb01rst-order methods with noise injection [15, 20] are essentially an equivalent way to extract the\nnegative curvature direction around saddle points [34, 5]. Together with the Stochastically Controlled\nStochastic Gradient (SCSG) method [25], the aforementioned methods [34, 5] converge to an (\u270f,p\u270f)-\n\nby [9] which employed the third-order smoothness of f in deterministic nonconvex optimization to\n\ufb01nd a \ufb01rst-order stationary point, we explore the bene\ufb01ts of third-order smoothness in \ufb01nding an\napproximate local minimum in the stochastic nonconvex optimization. In particular, we propose a\nstochastic optimization algorithm, named as FLASH, which only utilizes \ufb01rst-order oracles and \ufb01nds\n\nsecond-order stationary point (an approximate local minimum) within eO(\u270f7/2) stochastic gradient\nevaluations, where eO(\u00b7) hides logarithm polynomial factors and constants. In this work, motivated\nthe (\u270f, \u270fH)-second-order stationary point within eO(\u270f10/3) stochastic gradient evaluations. Note\n\nthat our gradient complexity matches that of the state-of-the-art stochastic optimization algorithm\nSCSG [25] for \ufb01nding \ufb01rst-order stationary points. At the core of our algorithm is an exploitation of\nthe third-order smoothness of the objective function f which enables us to choose a larger step size\nin the negative curvature descent stage, and therefore leads to a faster convergence rate. The main\ncontributions of our work are as follows\n\u2022 We show that the third-order smoothness of the nonconvex function can lead to a faster escape from\nsaddle points in the stochastic optimization. We characterize, for the \ufb01rst time, the improvement\nbrought by third-order smoothness in \ufb01nding the approximate local minimum.\n\nevaluations.\n\n\u2022 We propose an ef\ufb01cient stochastic algorithm for general stochastic objective functions and prove\nfaster convergence rates for \ufb01nding local minima. More speci\ufb01cally, for stochastic optimization,\nour algorithm converges to an approximate local minimum with only eO(\u270f10/3) stochastic gradient\n\u2022 In each outer iteration, our proposed algorithm only performs either one step of negative curvature\ndescent, or an epoch of SCSG, which saves a lot of gradient and negative curvature computations\ncompared with existing algorithms.\nNotation For a vector x = (x1, ..., xd)> 2 Rd, we denote the `q norm as kxkq = (Pd\ni=1 |xi|q)1/q\nfor 0 < q < +1. For a matrix A = [Aij] 2 Rd\u21e5d, we use kAk2 and kAkF to denote the spectral\nand Frobenius norm. For a three-way tensor T2 Rd\u21e5d\u21e5d and vector x 2 Rd, we denote their\ninner product as hT , x\u23263i. For a symmetric matrix A, let max(A) and min(A) be the maximum,\nminimum eigenvalues of matrix A. We use A \u232b 0 to denote A is positive semide\ufb01nite. For two\nsequences {an} and {bn}, we denote an = O(bn) if an \uf8ff C bn for some constant C independent of\nn. The notation eO(\u00b7) hides logarithmic factors. Additionally, we denote an . bn (an & bn) if an is\nless than (larger than) bn up to a constant.\n\n2 Related Work\n\nIn this section, we discuss related work for \ufb01nding approximate second-order stationary points\nin nonconvex optimization. In general, existing literature can be divided into the following three\ncategories.\nHessian-based: The pioneer work of [28] proposed the cubic regularized Newton\u2019s method to\n\n\ufb01nd an (\u270f, \u270fH)-second-order stationary point in O max{\u270f3/2,\u270f 3\n\nH } iterations. Curtis et al. [12]\n\nshowed that the trust-region Newton method can achieve the same iteration complexity as the cubic\n\n2\n\n\fregularization method. Recently, Kohler and Lucchi [23], Xu et al. [33] showed that by using\nsubsampled Hessian matrix instead of the entire Hessian matrix in cubic regularization method and\ntrust-region method, the iteration complexity can still match the original exact methods under certain\nconditions. Zhou et al. [36] improved the second-order oracle complexity (including gradient and\nHessian evaluations) by proposing a variance-reduced Cubic regularization method. However, these\nmethods need to compute the Hessian matrix and solve a very expensive subproblem either exactly\nor approximately in each iteration, which can be computationally intractable for high-dimensional\nproblems.\nHessian-vector product-based: Through different approaches, Carmon et al. [8] and Agarwal et al.\n[1] independently proposed algorithms that are able to \ufb01nd (\u270f,p\u270f)-second-order stationary points\n\nevaluations. In the general stochastic optimization setting, Allen-Zhu [2] proposed an algorithm\nnamed Natasha2, which is based on variance reduction and negative curvature descent, and is able to\n\nH + n3/4\u270f7/2\n\nassumption of the third-order smoothness on the objective function and combining the negative\ncurvature descent with the \u201cconvex until proven guilty\u201d algorithm, Carmon et al. [9] proposed\n\ngradient and Hessian-vector product evaluations.2 For nonconvex \ufb01nite-sum optimization problems,\nAgarwal et al. [1] proposed an algorithm which is able to \ufb01nd approximate local minima within\n\nn is the number of component functions. Reddi et al. [30] proposed an algorithm, which combines\n\ufb01rst-order and second-order methods to \ufb01nd approximate (\u270f, \u270fH)-second-order stationary points, and\n\nwithin eO(\u270f7/4) full gradient and Hessian-vector product evaluations. By making an additional\nan algorithm that is able to \ufb01nd an (\u270f,p\u270f)-second-order stationary point within eO(\u270f5/3) full\neO(n\u270f3/2 + n3/4\u270f7/4) stochastic gradient and stochastic Hessian-vector product evaluations, where\nrequires eOn2/3\u270f2 + n\u270f3\nH  stochastic gradient and stochastic Hessian-vector product\n\ufb01nd (\u270f,p\u270f)-second-order stationary points with at most eO(\u270f7/2) stochastic gradient and stochastic\n\nHessian-vector product evaluations. Tripuraneni et al. [31] proposed a stochastic cubic regularization\nalgorithm to \ufb01nd (\u270f,p\u270f)-second-order stationary points and achieved the same runtime complexity\nas [2].\nGradient-based: For general nonconvex problems, Ghadimi and Lan [16] proposed a randomized\nstochastic gradient method and established the complexity of this method for \ufb01nding a \ufb01rst-order\nstationary point. Levy [26], Jin et al. [20, 21] showed that it is possible to escape from saddle points\nand \ufb01nd local minima only using gradient evaluations plus random perturbation. The best-known\n\nH + n5/12\u270f2\u270f1/2\n\nruntime complexity of these methods is eO\u270f7/4 when \u270fH = p\u270f [21]. For nonconvex \ufb01nite-sum\nleading to an algorithm that \ufb01nds (\u270f, \u270fH)-second-order stationary points within eOn2/3\u270f2 + n\u270f3\nH  stochastic gradient evaluations. For nonconvex stochastic optimization\n\nproblems, Allen-Zhu and Li [5] proposed a \ufb01rst-order negative curvature \ufb01nding method called\nNeon2 and combined it with the stochastic variance reduced gradient (SVRG) method [22, 29, 3, 25],\nH +\nn3/4\u270f7/2\nproblems, a variant of stochastic gradient descent (SGD) [15] is proved to \ufb01nd the (\u270f,p\u270f)-second-\norder stationary point within O(\u270f4poly(d)) stochastic gradient evaluations. More recently, Xu and\nYang [34], Allen-Zhu and Li [5] turned the \ufb01rst-order stationary point \ufb01nding method SCSG [25] into\napproximate local minima \ufb01nding algorithms, which only involves stochastic gradient computation.\nH ). In order to further save gradient\nand negative curvature computations, [35] considered the number of saddle points encountered in the\nalgorithm and proposed the gradient descent with one-step escaping algorithm (GOSE) that saves\nnegative curvature computation. However, none of the above algorithms explore the third-order\nsmoothness of the nonconvex objective function.\n\nThe runtime complexity of these algorithms is eO(\u270f10/3 + \u270f2\u270f3\n\n3 Preliminaries\n\nIn this section, we present de\ufb01nitions that will be used in our algorithm design and later theoretical\nanalysis.\nDe\ufb01nition 3.1 (Smoothness). A differentiable function f is L1-smooth, if for any x, y 2 Rd:\n\nkrf (x)  rf (y)k2 \uf8ff L1kx  yk2.\n\n2As shown in [9], the second-order accuracy parameter \u270fH can be set as \u270f2/3 and the total runtime complexity\n\nremains the same, i.e., eO(\u270f5/3).\n\n3\n\n\fDe\ufb01nition 3.2 (Hessian Lipschitz). A twice-differentiable function f is L2-Hessian Lipschitz, if for\nany x, y 2 Rd:\n\nkr2f (x)  r2f (y)k2 \uf8ff L2kx  yk2.\n\nNote that Hessian-Lipschitz is also referred to as the second-order smoothness. The above two\nsmoothness conditions are widely used in nonconvex optimization problems [28]. In this paper,\nwe will further explore the effectiveness of third-order derivative Lipschitz condition in nonconvex\noptimization. We use a three-way tensor r3f (x) 2 Rd\u21e5d\u21e5d to denote the third-order derivative of a\nfunction, which is formally de\ufb01ned below.\nDe\ufb01nition 3.3 (Third-order Derivative). The third-order derivative of function f: Rd ! R is a\nthree-way tensor r3f (x) 2 Rd\u21e5d\u21e5d which is de\ufb01ned as\n@\n\nf (x),\n\n[r3f (x)]ijk =\n\n@xi@xj@xk\n\nfor i, j, k = 1, . . . , d and x 2 Rd.\nNext we introduce the de\ufb01nition of third-order smoothness for function f, which implies that the\nthird-order derivative will not change rapidly.\nDe\ufb01nition 3.4 (Third-order Derivative Lipschitz). A thrice-differentiable function f has L3-Lipschitz\nthird-order derivative, if for any x, y 2 Rd:\n\nkr3f (x)  r3f (y)kF \uf8ff L3kx  yk2.\n\nThe above de\ufb01nition has been introduced in [6], and the third-order derivative Lipschitz is also\nreferred to as third-order smoothness in [9]. One can also use another equivalent notion of third-order\nderivative Lipschitz condition used in [9]. Note that the third-order Lipschitz condition is critical\nin our algorithms and theoretical analysis in later sections. In the sequel, we will use third-order\nderivative Lipschitz and third-order smoothness interchangeably.\nDe\ufb01nition 3.5 (Optimal Gap). For a function f, we de\ufb01ne the optimal gap f at point x0 as\n\nf (x0)  inf\nx2Rd\nWithout loss of generality, we assume f < +1.\nDe\ufb01nition 3.6 (Geometric Distribution). For a random integer X, de\ufb01ne X has a geometric distribu-\ntion with parameter p, denoted as Geom(p), if it satis\ufb01es that\n\nf (x) \uf8ff f .\n\nP(X = k) = pk(1  p),\n\n8k = 0, 1, . . . .\n\nDe\ufb01nition 3.7 (Sub-Gaussian Stochastic Gradient). For any x 2 Rd and random variable \u21e0 2D , the\nstochastic gradient rF (x; \u21e0) is sub-Gaussian with parameter  if it satis\ufb01es\n\u25c6 \uf8ff exp(1).\n\nE\uf8ff exp\u2713krF (x; \u21e0)  rf (x)k2\n\n2\n\n2\n\nIn addition, we introduce Tg to denote the time complexity of stochastic function value and gradient\nevaluation, i.e., (F (x; \u21e0i),rF (x; \u21e0i)) for \u21e0i 2D , and Th to denote the time complexity of stochastic\nHessian-vector product evaluation, i.e., r2F (x; \u21e0i)v for a given vector v and \u21e0i 2D .\n4 Exploiting Third-order Smoothness\n\nIn this section we will show how to employ the third-order smoothness of the objective function\nto make better use of the negative curvature direction for escaping saddle points. We \ufb01rst give an\nenlightening explanation on why third-order smoothness helps in general nonconvex optimization\nproblems. Then we present our main algorithm which is able to utilize the third-order smoothness to\ntake a larger step size for general stochastic optimization.\nIn order to \ufb01nd local minima in nonconvex problems, different kinds of approaches have been\nexplored to escape from saddle points. One of these approaches is to use negative curvature direction\n[27] to escape from saddle points, which has been explored in many existing studies [8, 11, 2].\n\n4\n\n\f.\n\n\u270fH\n2\n\nAccording to recent work by [34, 5], one can extract the negative curvature direction by only using\nstochastic gradient evaluations, which makes the negative curvature descent approach more appealing.\nWe \ufb01rst consider a simple case to illustrate how to utilize the third-order smoothness when taking a\nnegative curvature descent step. For nonconvex optimization problems, an \u270f-\ufb01rst-order stationary\n\npointbx can be found by using \ufb01rst-order methods such as gradient descent. Ifbx is not an (\u270f, \u270fH)-\nsecond-order stationary point de\ufb01ned in (1.2), then there must exist a unit vectorbv such that\nAs studied in [8, 34, 5], one can take a negative curvature descent step along the direction ofbv to\nescape from the saddle pointbx, i.e.,\ney = argmin\nwheree\u21b5 is the step size. Suppose the function f is L1-smooth and L2-Hessian Lipschitz, then the\nstep size can be set ase\u21b5 = O(\u270fH/L2) and the negative curvature descent step (4.1) is guaranteed to\n\nbv>r2f (bx)bv \uf8ff \nf (y), u =bx e\u21b5bv, w =bx +e\u21b5bv,\nf (ey)  f (bx) = O\u2713 \u270f3\n2\u25c6.\n\nInspired by the previous work [9], we aim to achieve more function value decrease than (4.2) by\nincorporating an additional assumption that the objective function has L3-Lipschitz third-order\nderivatives (third-order smoothness). More speci\ufb01cally, we adjust the negative curvature descent step\nin (4.1) as follows,\n\nattain the following function value decrease,\n\ny2{u,w}\n\nH\nL2\n\nwhere \u21b5 = O(p\u270fH/L3 ) is the adjusted step size which can be much larger than the step sizee\u21b5 in\n\n(4.1) when \u270fH is suf\ufb01ciently small. The adjusted negative curvature descent step (4.3) is guaranteed\nto decrease the function value by a larger decrement, i.e.,\n\nby = argmin\n\ny2{u,w}\n\nf (y), u =bx  \u21b5bv, w =bx + \u21b5bv,\nL3\u25c6.\nf (by)  f (bx) = O\u2713 \u270f2\n\nH\n\n(4.1)\n\n(4.2)\n\n(4.3)\n\n(4.4)\n\nCompared with (4.2), the decrement in (4.4) can be substantially larger. In other words, if we make\nthe additional assumption of the third-order smoothness, the negative curvature descent with larger\nstep size will make more progress toward decreasing the function value. Note that [9] focuses on\ndeterministic optimization, while our work is focused on the stochastic optimization. Here we need\nto carefully design our algorithm to improve the computational complexity in the stochastic setting.\nIn the following, we will present an algorithm for stochastic nonconvex optimization which exploits\nthe bene\ufb01ts of third-order smoothness to escape from saddle points . Recall the general stochastic\noptimization problem in (1.1). In this setting, one cannot have access to the full gradient or Hessian\ninformation. Instead, only stochastic gradient and stochastic Hessian-vector product evaluations are\naccessible. As a result, we have to employ stochastic optimization methods to calculate the negative\n\ncurvature direction. There exist two kinds of methods to calculate the negative curvature directionbv\n\nfor the general stochastic problem. The \ufb01rst kind is an online PCA method, i.e., Oja\u2019s algorithm [4],\nwhich uses Hessian-vector product evaluations and can be seen as a stochastic variant of FastPCA\n[14]. Another method is the online version of the Neon algorithm, denote as Neon2online [5], which\nonly requires stochastic gradient evaluations.\nBy using either Oja\u2019s algorithm or Neon2online, there exists an algorithm, denoted by ApproxNC-\nStochastic, which uses stochastic gradient evaluations or stochastic Hessian-vector product evaluations\nto \ufb01nd the negative curvature direction for general stochastic nonconvex optimization problem (1.1).\n\nSpeci\ufb01cally, ApproxNC-Stochastic returns a unit vectorbv that satis\ufb01esbv>r2f (x)bv \uf8ff \u270fH/2\nprovided min(r2f (x)) < \u270fH, otherwise it will returnbv = ?. Based on ApproxNC-Stochastic,\n\nwe present our negative curvature descent algorithm in Algorithm 1.\nNote that the Rademacher random variable \u21e3 is an important feature in Algorithm 1. As we cannot\naccess the full objective function value in stochastic setting, we use a Rademacher variable (\u21e3 = 1 or\n\u21e3 = 1 with probability 1/2) in our algorithm to decide the direction of negative curvature descent step.\n\n5\n\n\fi=1, , \u270fH)\n\nAlgorithm 1 NCD3-Stochastic (f, x, {Li}3\n1: Set \u21b5 =p3\u270fH/L3\n2: bv ApproxNC-Stochastic(f, x, L1, L2,,\u270f H)\n3: ifbv 6= ?\nby x + \u21e3 \u21b5bv\nreturnby\nreturn ?\n\ngenerate a Rademacher random variable \u21e3\n\n4:\n5:\n6:\n7: else\n8:\n\nTherefore, with the step size \u21b5 = O(p\u270fH/L3) for the negative curvature descent step, Algorithm\n1 can make greater progress in expectation when min(r2f (x)) < \u270fH, and we summarize this\nproperty as follows.\nLemma 4.1. Let f (x) = E\u21e0\u21e0D[F (x; \u21e0)] and each stochastic function F (x; \u21e0) is L1-smooth, L2-\nHessian Lipschitz continuous, and the third derivative of f (x) is L3-Lipschitz. Set \u270fH 2 (0, 1) and\nstep size as \u21b5 =p3\u270fH/L3. If the input x of Algorithm 1 satis\ufb01es min(r2f (x)) < \u270fH, then\nwith probability 1  , Algorithm 1 will returnby such that E\u21e3[f (x)  f (by)]  3\u270f2\nH/8L3, where\n 2 (0, 1) and E\u21e3 denotes the expectation over the Rademacher random variable \u21e3. Furthermore, if\nwe choose  \uf8ff \u270fH/(3\u270fH + 8L2), it holds that\n\nE[f (by)  f (x)] \uf8ff \n\n\u270f2\nH\n8L3\n\n,\n\nwhere E is over all randomness of the algorithm, and the total runtime is eOL2\nApproxNC-Stochastic adopts online Oja\u2019s algorithm, and eOL2\n\nHTh if\nHTg if ApproxNC-Stochastic\n\nadopts Neon2online.\n\n1/\u270f2\n\n1/\u270f2\n\n5 Fast Local Minima Finding Algorithm\n\nIn this section, we present our main algorithm to \ufb01nd approximate local minima for nonconvex\nstochastic optimization problems, based on the negative curvature descent algorithms proposed in\nprevious section.\nTo \ufb01nd the local minimum, we use SCSG [25], which is the state-of-the-art stochastic optimization\nalgorithm, to \ufb01nd a \ufb01rst-order stationary point and then apply Algorithm 1 to escape the saddle point\nusing negative curvature direction. The proposed method is presented in Algorithm 2, We use a\nsubsampled stochastic gradient rfS(x) in the outer loop (Line 4) of Algorithm 2, which is de\ufb01ned\nas rfS(x) = 1/|S|Pi2S rF (x; \u21e0i).\nAs shown in Algorithm 2, we use subsampled gradient to check whether xk1 is a \ufb01rst-order\nstationary point. Suppose the stochastic gradient rF (x; \u21e0) satis\ufb01es the gradient sub-Gaussian\ncondition (3.7) and the batch size |Sk| is large enough, then krf (xk1)k2 >\u270f/ 4 holds with high\nprobability if krfSk (xk1)k2 >\u270f/ 2. Similarly, krf (xk1)k2 \uf8ff \u270f holds with high probability if\nkrfSk (xk1)k2 \uf8ff \u270f/2.\nNote that each iteration of the outer loop in Algorithm 2 consists of two cases: (1) if the norm\nof subsampled gradient rfSk (xk1) is small, then we run one subroutine NCD3-Stochastic, i.e.,\nAlgorithm 1; and (2) if the norm of rfSk (xk1) is large, then we run one epoch of SCSG algorithm.\nThis design can reduce the number of negative curvature calculations. There are two major differences\nbetween Algorithm 2 and existing algorithms in [34, 5]: (1) the step size of negative curvature descent\nstep in Algorithm 2 is larger; and (2) the minibatch size in each epoch of SCSG in Algorithm 2 can\nbe set to 1 instead of being related to the accuracy parameters \u270f and \u270fH, while the minibatch size in\neach epoch of SCSG in the existing algorithms [34, 5] has to depend on \u270f and \u270fH.\nNow we present the following theorem which spells out the runtime complexity of Algorithm 2.\nTheorem 5.1. Let f (x) = E\u21e0\u21e0D[F (x; \u21e0)]. Suppose the third derivative of f (x) is L3-Lipschitz,\nand each stochastic function F (x; \u21e0) is L1-smooth and L2-Hessian Lipschitz continuous. Suppose\nthat the stochastic gradient rF (x; \u21e0) satis\ufb01es the gradient sub-Gaussian condition with parameter\n\n6\n\n\fAlgorithm 2 Fast Local minimA \ufb01nding with third-order SmootHness (FLASH-Stochastic)\n1: Input: f, x0, L1, L2, L3, , \u270f, \u270fH, b, K\n\n2: Set B eO(2/\u270f2), \u2318 = b2/3/(3L1B2/3)\nuniformly sample a batch Sk \u21e0D with |Sk| = B\ngk rfSk (xk1)\nif kgkk2 >\u270f/ 2\ngenerate Tk \u21e0 Geom(B/(B + b))\ny(k)\n0 xk1\nfor t = 1, ..., Tk\nrandomly pick It \u21e2D with |It| = b\nt1 rfIt(y(k)\n\u232b(k)\nt1)  rfIt(y(k)\ny(k)\nt y(k)\nt1  \u2318\u232b(k)\nt1\nend for\nxk y(k)\nelse\nxk NCD3-Stochastic(f, xk1,{Li}3\nif xk = ?\n\n3: for k = 1, 2, ..., K\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14:\n15:\n16:\n17:\n18:\n19: end for\n\nreturn xk1\n\n0 ) + gk\n\nTk\n\ni=1,,\u270f H)\n\n. Set batch size B = eO(2/\u270f2) and \u270fH & \u270f2/3. If Algorithm 2 adopts online Oja\u2019s algorithm to\n\ncompute the negative curvature, then Algorithm 2 \ufb01nds an (\u270f, \u270fH)-second-order stationary point with\nprobability at least 1/3 in runtime\n\nIf Algorithm 2 adopts Neon2online, then it \ufb01nds an (\u270f, \u270fH)-second-order stationary point with proba-\nbility at least 1/3 in runtime\n\n\u270f10/3 +\n\neO\u2713\u2713 L14/3f\neO\u2713\u2713 L14/3f\n\nL32f\n\n\u270f2\u270f2\n\nH \u25c6Tg +\u2713 L2\n\n1L3f\n\u270f4\n\nH \u25c6Th\u25c6.\n\n\u270f10/3 +\n\nL32f\n\n\u270f2\u270f2\nH\n\n+\n\nL2\n\n1L3f\n\u270f4\n\nH \u25c6Tg\u25c6.\n\nH +\u270f4\n\n\u270f2\u270f2\nH +\n\u270f5\nH ) runtime complexity achieved by the state-of-the-art [5], the runtime complexity of Algorithm 2\nis improved upon the state-of-the-art in the second and third terms. If we set \u270fH = p\u270f, the runtime of\n\nRemark 5.2. Although the runtime complexity in Theorem 5.1 holds with a constant probability,\none can repeatedly run Algorithm 2 for at most log(1/) times to achieve a high probability result\nwith probability at least 1  .\nRemark 5.3. Theorem 5.1 suggests that the runtime complexity of Algorithm 2 is eO(\u270f10/3 +\nH ) to \ufb01nd an (\u270f, \u270fH)-second-order stationary point. Compared with eO(\u270f10/3 +\u270f2\u270f3\nAlgorithm 2 is eO(\u270f10/3) and that of the state-of-the-art stochastic local minima \ufb01nding algorithms\n[2, 31, 34, 5] becomes eO(\u270f7/2), thus Algorithm 2 outperforms the state-of-the-art algorithms by a\nfactor of eO(\u270f1/6).\ncomplexity of Algorithm 2 remains eO(\u270f10/3). It is also worth noting that the runtime complexity of\n\nAlgorithm 2 matches that of the state-of-the-art stochastic optimization algorithm (SCSG) [25] which\nonly \ufb01nds \ufb01rst-order stationary points but does not impose the third-order smoothness assumption.\n\nRemark 5.4. Note that we can set \u270fH to a smaller value, i.e., \u270fH = \u270f2/3, and the total runtime\n\n6 Experiments\n\nIn this section, we conduct numerical experiment on two nonconvex optimization problems, i.e.,\nmatrix sensing and deep autoencoder. All the experiments are carried on Amazon AWS p2.xlarge\nnodes with NVIDIA GK210 GPUs, and we use Pytorch 0.3.0 to implement all the algorithms.\n\n7\n\n\f1.2\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\ne\nu\n\nl\na\nV\nn\no\ni\nt\nc\nn\nu\nF\n\ne\nv\ni\nt\nc\ne\nj\nb\nO\n\nSGD\nNSGD\nSGD-m\nSCSG\nNeon\nNeon2\nFLASH\n\n1.2\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\ne\nu\n\nl\na\nV\nn\no\ni\nt\nc\nn\nu\nF\n\ne\nv\ni\nt\nc\ne\nj\nb\nO\n\nSGD\nNSGD\nSGD-m\nSCSG\nNeon\nNeon2\nFLASH\n\n1.2\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\ne\nu\n\nl\na\nV\nn\no\ni\nt\nc\nn\nu\nF\n\ne\nv\ni\nt\nc\ne\nj\nb\nO\n\n2 = ,\n2 = 0.1,\n2 = 0.01,\n2 = 0.001,\n\n1.2\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\ne\nu\n\nl\na\nV\nn\no\ni\nt\nc\nn\nu\nF\n\ne\nv\ni\nt\nc\ne\nj\nb\nO\n\n2 = ,\n2 = 0.1,\n2 = 0.01,\n2 = 0.001,\n\n0\n\n2\n\n6\n\n8\n4\nOracle Calls\n\n10\n\n12\n#104\n\n0\n\n2\n\n6\n\n8\n4\nOracle Calls\n\n10\n\n12\n#104\n\n0\n\n2\n\nOracle Calls\n\n4\n\n#104\n\n0\n\n4\n2\nOracle Calls\n\n6\n#104\n\n(a) Matrix Sensing (d =\n50)\n\n(b) Matrix Sensing (d =\n100)\n\n(c) Varying NC Step Size\n(d = 50)\n\n(d) Varying NC Step Size,\n(d = 100)\n\n0.1\n\n0.08\n\n0.06\n\n0.04\n\n0.02\n\ns\ns\no\nL\n\ng\nn\n\ni\n\nn\n\ni\na\nr\nT\n\nSGD\nNSGD\nSGD-m\nSCSG\nNeon\nNeon2\nFLASH\n\n0.1\n\n0.08\n\n0.06\n\n0.04\n\n0.02\n\ns\ns\no\nL\n\nt\ns\ne\nT\n\nSGD\nNSGD\nSGD-m\nSCSG\nNeon\nNeon2\nFLASH\n\n0.1\n\n0.08\n\n0.06\n\n0.04\n\n0.02\n\ns\ns\no\nL\n\ng\nn\n\ni\n\nn\n\ni\na\nr\nT\n\nSGD\nNSGD\nSGD-m\nSCSG\nNeon\nNeon2\nFLASH\n\n0.1\n\n0.08\n\n0.06\n\n0.04\n\n0.02\n\ns\ns\no\nL\n\nt\ns\ne\nT\n\nSGD\nNSGD\nSGD-m\nSCSG\nNeon\nNeon2\nFLASH\n\n0\n\n0\n\n2\n\n4\n\n6\n\nOracle Calls\n\n8\n\n10\n#106\n\n0\n\n0\n\n2\n\n4\n\n6\n\nOracle Calls\n\n8\n\n10\n#106\n\n0\n\n0\n\n2\n\n4\n\n6\n\nOracle Calls\n\n8\n\n10\n#106\n\n0\n\n0\n\n2\n\n4\n\n6\n\nOracle Calls\n\n8\n\n10\n#106\n\n(e) AE-1, Training\n\n(f) AE-1, Test\n\n(g) AE-2, Training\n\n(h) AE-2, Test\n\nFigure 1: Numerical results for matrix sensing and deep autoencoder. (a)-(b) Convergence of different\nalgorithms for matrix sensing: objective function value versus the number of oracle calls. (c)-(d)\nDifferent negative curvature step size comparison of FLASH for matrix sensing. (e)-(h) Convergence\nof different algorithms for two deep autoencoders: Training loss versus the number of oracle calls\nand test loss versus the number of oracle calls.\n\nMatrix Sensing We consider the symmetric matrix sensing problem, which is de\ufb01ned as:\n\nf (U) =\n\nmin\nU2Rd\u21e5r\n\n1\n2m\n\nmXi=1hAi, UU>i  bi2,\n\n(6.1)\n\nwhere the matrices {Ai}i=1,...,m are known sensing matrices, bi = hAi, M\u21e4i is the i-th observation,\nand M\u21e4 = U\u21e4(U\u21e4)> is an unknown low-rank matrix, which needs to be recovered. For the data\ngeneration, we consider two matrix sensing problems: (1) d = 50, r = 3, and (2) d = 100, r = 3,\nthen generate m = 20d sensing matrices A1, . . . , Am, where each element of the sensing matrix Ai\nfollows i.i.d. standard normal distribution, and the unknown low-rank matrix M\u21e4 as M\u21e4 = U\u21e4(U\u21e4)>,\nwhere U\u21e4 2 Rd\u21e5r is randomly generated, and thus bi = hAi, M\u21e4i. Next we randomly initialize a\nvector u0 2 Rd satisfying ku0k2 < max(M\u21e4) and set the initial input U0 as U0 = [u0, 0, . . . , 0].\nDeep Autoencoder We also perform experiments of training a deep autoencoder on MNIST dataset\n[19]. The MNIST dataset contains images of handwritten digits, including 60, 000 training examples\nand 10, 000 test examples. Each image has 28 \u21e5 28 pixels. We consider two autoencoders: (1) a\nfully connected encoder with layers of size (28 \u21e5 28)-1024-512-256-32 and a symmetric decoder\n(AE-1) and (2) a fully connected encoder with layers of size (28 \u21e5 28)-1024-512-256-128-56-32 and\na symmetric decoder (AE-2);. The code layer with 32 units are linear and we use softplus function as\nthe activation function for other layers. We use mean squared error (MSE) as the loss function.\nWe evaluate our algorithm FLASH-Stochastic (FLASH for short) together with the following state-\nof-the-art stochastic optimization algorithms for nonconvex problems: (1) stochastic gradient descent\n(SGD); (2) SGD with momentum (SGD-m); (3) noisy stochastic gradient descent (NSGD) [15]; (4)\nStochastically Controlled Stochastic Gradient (SCSG) [25]; (5) NEgative-curvature-Originated-from-\nNoise (Neon) [34]; (6) NEgative-curvature-Originated-from-Noise 2 (Neon2) [5]. A \ufb01xed gradient\nmini-batch size of 100 is used for all the algorithms. We apply Oja\u2019s algorithm with a Hessian\nmini-batch size of 100 to calculate the negative curvature in FLASH. We perform a grid search\nover step sizes for each method. For the negative curvature step size \u21b5, we choose \u21b5 = O(\u270fH/L2\n\nfor Neon, Neon2 and \u21b5 = O(p\u270fH/L3) for our algorithm FLASH according to the corresponding\n\ntheories, where \u270fH = 0.001, and tune the constant parameter in the negative curvature step size by\ngrid search. We report the objective function value versus oracle calls on matrix sensing and training\nloss versus oracle calls on matrix sensing and deep autoencoder.\n\n8\n\n\fThe experimental results of the above two nonconvex problems are shown in Figure 1. For the matrix\nsensing problem, in Figure 1(a)-1(b), we observe that without adding noise or using second-order\ninformation, SGD, SGD-m and SCSG are not able to escape from saddle points. Our algorithm and\nNSGD, Neon, Neon2 can escape from saddle points, and our algorithm converges to the unknown\nmatrix faster than NSGD, Neon, Neon2. As we can see from Figure 1(e)-1(h), for deep autoencoder,\ncompared with SGD, SGD-m, NSGD, SCSG, Neon and Neon2, our algorithm escapes from saddle\npoints faster and converges faster. Our algorithm outperforms Neon and Neon2 on both problems\nand validates our theoretical analysis that negative curvature step with a larger step size is helpful\nin stochastic nonconvex optimization problems. We also compare the convergence behavior of our\nalgorithm with different step sizes for negative curvature descent. We \ufb01rst set initial step size \u21b5 = 0.2\n(for negative curvature descent) and then decrease the step size by a factor of 0.1 each time, while the\nother parameters remain the same. We can see from Figure 1(c) and 1(d) that our algorithm FLASH\nconverges faster with larger step sizes for negative curvature descent, which validates our theories on\nthird-order smoothness can be helpful in the nonconvex stochastic optimization.\n\n7 Conclusions\n\nIn this paper, we investigated the bene\ufb01t of third-order smoothness of nonconvex objective functions\nin stochastic optimization. We illustrated that third-order smoothness can help faster escape saddle\npoints, by proposing a new negative curvature descent algorithms with improved theoretical guarantee.\nBased on the proposed negative curvature descent algorithm, we further proposed a practical stochastic\noptimization algorithm with improved run time complexity that \ufb01nds local minima for stochastic\nnonconvex optimization problems.\n\nAcknowledgements\n\nWe would like to thank the anonymous reviewers for their helpful comments, and Yu Chen, Xuwang\nYin for their helpful discussions on the experiments. This research was sponsored in part by\nthe National Science Foundation IIS-1652539 and BIGDATA IIS-1855099. We also thank AWS\nfor providing cloud computing credits associated with the NSF BIGDATA award. The views\nand conclusions contained in this paper are those of the authors and should not be interpreted as\nrepresenting any funding agencies.\n\nReferences\n[1] Naman Agarwal, Zeyuan Allen-Zhu, Brian Bullins, Elad Hazan, and Tengyu Ma. Find-\ning approximate local minima for nonconvex optimization in linear time. arXiv preprint\narXiv:1611.01146, 2016.\n\n[2] Zeyuan Allen-Zhu. Natasha 2: Faster non-convex optimization than sgd. arXiv preprint\n\narXiv:1708.08694, 2017.\n\n[3] Zeyuan Allen-Zhu and Elad Hazan. Variance reduction for faster non-convex optimization. In\n\nInternational Conference on Machine Learning, pages 699\u2013707, 2016.\n\n[4] Zeyuan Allen-Zhu and Yuanzhi Li. Follow the compressed leader: Faster algorithms for matrix\n\nmultiplicative weight updates. arXiv preprint arXiv:1701.01722, 2017.\n\n[5] Zeyuan Allen-Zhu and Yuanzhi Li. Neon2: Finding local minima via \ufb01rst-order oracles. arXiv\n\npreprint arXiv:1711.06673, 2017.\n\n[6] Animashree Anandkumar and Rong Ge. Ef\ufb01cient approaches for escaping higher order saddle\npoints in non-convex optimization. In Conference on Learning Theory, pages 81\u2013102, 2016.\n\n[7] Yair Carmon and John C Duchi. Gradient descent ef\ufb01ciently \ufb01nds the cubic-regularized\n\nnon-convex newton step. arXiv preprint arXiv:1612.00547, 2016.\n\n[8] Yair Carmon, John C Duchi, Oliver Hinder, and Aaron Sidford. Accelerated methods for\n\nnon-convex optimization. arXiv preprint arXiv:1611.00756, 2016.\n\n9\n\n\f[9] Yair Carmon, Oliver Hinder, John C Duchi, and Aaron Sidford. \" convex until proven guilty\":\nDimension-free acceleration of gradient descent on non-convex functions. arXiv preprint\narXiv:1705.02766, 2017.\n\n[10] Anna Choromanska, Mikael Henaff, Michael Mathieu, G\u00e9rard Ben Arous, and Yann LeCun.\nThe loss surfaces of multilayer networks. In Arti\ufb01cial Intelligence and Statistics, pages 192\u2013204,\n2015.\n\n[11] Frank E Curtis and Daniel P Robinson. Exploiting negative curvature in deterministic and\n\nstochastic optimization. arXiv preprint arXiv:1703.00412, 2017.\n\n[12] Frank E Curtis, Daniel P Robinson, and Mohammadreza Samadi. A trust region algorithm with\na worst-case iteration complexity of\\mathcal {O}(\\epsilon\u02c6{-3/2}) for nonconvex optimization.\nMathematical Programming, 162(1-2):1\u201332, 2017.\n\n[13] Yann N Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and\nYoshua Bengio. Identifying and attacking the saddle point problem in high-dimensional non-\nconvex optimization. In Advances in neural information processing systems, pages 2933\u20132941,\n2014.\n\n[14] Dan Garber, Elad Hazan, Chi Jin, Cameron Musco, Praneeth Netrapalli, Aaron Sidford, et al.\nFaster eigenvector computation via shift-and-invert preconditioning. In International Conference\non Machine Learning, pages 2626\u20132634, 2016.\n\n[15] Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle points\u2014online\nstochastic gradient for tensor decomposition. In Conference on Learning Theory, pages 797\u2013\n842, 2015.\n\n[16] Saeed Ghadimi and Guanghui Lan. Stochastic \ufb01rst-and zeroth-order methods for nonconvex\n\nstochastic programming. SIAM Journal on Optimization, 23(4):2341\u20132368, 2013.\n\n[17] Saeed Ghadimi, Guanghui Lan, and Hongchao Zhang. Mini-batch stochastic approximation\nmethods for nonconvex stochastic composite optimization. Mathematical Programming, 155\n(1-2):267\u2013305, 2016.\n\n[18] Christopher J Hillar and Lek-Heng Lim. Most tensor problems are np-hard. Journal of the ACM\n\n(JACM), 60(6):45, 2013.\n\n[19] Geoffrey E Hinton and Ruslan R Salakhutdinov. Reducing the dimensionality of data with\n\nneural networks. science, 313(5786):504\u2013507, 2006.\n\n[20] Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M Kakade, and Michael I Jordan. How to escape\n\nsaddle points ef\ufb01ciently. arXiv preprint arXiv:1703.00887, 2017.\n\n[21] Chi Jin, Praneeth Netrapalli, and Michael I Jordan. Accelerated gradient descent escapes saddle\n\npoints faster than gradient descent. arXiv preprint arXiv:1711.10456, 2017.\n\n[22] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance\n\nreduction. In Advances in neural information processing systems, pages 315\u2013323, 2013.\n\n[23] Jonas Moritz Kohler and Aurelien Lucchi. Sub-sampled cubic regularization for non-convex\n\noptimization. arXiv preprint arXiv:1705.05933, 2017.\n\n[24] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436\u2013444,\n\n2015.\n\n[25] Lihua Lei, Cheng Ju, Jianbo Chen, and Michael I Jordan. Nonconvex \ufb01nite-sum optimization\n\nvia scsg methods. arXiv preprint arXiv:1706.09156, 2017.\n\n[26] K\ufb01r Y Levy. The power of normalization: Faster evasion of saddle points. arXiv preprint\n\narXiv:1611.04831, 2016.\n\n[27] Jorge J Mor\u00e9 and Danny C Sorensen. On the use of directions of negative curvature in a modi\ufb01ed\n\nnewton method. Mathematical Programming, 16(1):1\u201320, 1979.\n\n10\n\n\f[28] Yurii Nesterov and Boris T Polyak. Cubic regularization of newton method and its global\n\nperformance. Mathematical Programming, 108(1):177\u2013205, 2006.\n\n[29] Sashank J Reddi, Ahmed Hefny, Suvrit Sra, Barnabas Poczos, and Alex Smola. Stochastic\nvariance reduction for nonconvex optimization. In International conference on machine learning,\npages 314\u2013323, 2016.\n\n[30] Sashank J Reddi, Manzil Zaheer, Suvrit Sra, Barnabas Poczos, Francis Bach, Ruslan Salakhut-\ndinov, and Alexander J Smola. A generic approach for escaping saddle points. arXiv preprint\narXiv:1709.01434, 2017.\n\n[31] Nilesh Tripuraneni, Mitchell Stern, Chi Jin, Jeffrey Regier, and Michael I Jordan. Stochastic\ncubic regularization for fast nonconvex optimization. arXiv preprint arXiv:1711.02838, 2017.\n[32] Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. arXiv\n\npreprint arXiv:1011.3027, 2010.\n\n[33] Peng Xu, Farbod Roosta-Khorasani, and Michael W Mahoney. Newton-type methods for\nnon-convex optimization under inexact hessian information. arXiv preprint arXiv:1708.07164,\n2017.\n\n[34] Yi Xu and Tianbao Yang. First-order stochastic algorithms for escaping from saddle points in\n\nalmost linear time. arXiv preprint arXiv:1711.01944, 2017.\n\n[35] Yaodong Yu, Difan Zou, and Quanquan Gu. Saving gradient and negative curvature computa-\n\ntions: Finding local minima more ef\ufb01ciently. arXiv preprint arXiv:1712.03950, 2017.\n\n[36] Dongruo Zhou, Pan Xu, and Quanquan Gu. Stochastic variance-reduced cubic regularized\nNewton methods. In Proceedings of the 35th International Conference on Machine Learning,\nvolume 80, pages 5990\u20135999, 10\u201315 Jul 2018.\n\n11\n\n\f", "award": [], "sourceid": 2206, "authors": [{"given_name": "Yaodong", "family_name": "Yu", "institution": "University of Virginia"}, {"given_name": "Pan", "family_name": "Xu", "institution": "UCLA"}, {"given_name": "Quanquan", "family_name": "Gu", "institution": "UCLA"}]}