{"title": "Efficiently escaping saddle points on manifolds", "book": "Advances in Neural Information Processing Systems", "page_first": 5987, "page_last": 5997, "abstract": "Smooth, non-convex optimization problems on Riemannian manifolds occur in machine learning as a result of orthonormality, rank or positivity constraints. First- and second-order necessary optimality conditions state that the Riemannian gradient must be zero, and the Riemannian Hessian must be positive semidefinite. Generalizing Jin et al.'s recent work on perturbed gradient descent (PGD) for optimization on linear spaces [How to Escape Saddle Points Efficiently (2017), Stochastic Gradient Descent Escapes Saddle Points Efficiently (2019)], we study a version of perturbed Riemannian gradient descent (PRGD) to show that necessary optimality conditions can be met approximately with high probability, without evaluating the Hessian. Specifically, for an arbitrary Riemannian manifold $\\mathcal{M}$ of dimension $d$, a sufficiently smooth (possibly non-convex) objective function $f$, and under weak conditions on the retraction chosen to move on the manifold, with high probability, our version of PRGD produces a point with gradient smaller than $\\epsilon$ and Hessian within $\\sqrt{\\epsilon}$ of being positive semidefinite in $O((\\log{d})^4 / \\epsilon^{2})$ gradient queries. This matches the complexity of PGD in the Euclidean case. Crucially, the dependence on dimension is low, which matters for large-scale applications including PCA and low-rank matrix completion, which both admit natural formulations on manifolds. The key technical idea is to generalize PRGD with a distinction between two types of gradient steps: ``steps on the manifold'' and ``perturbed steps in a tangent space of the manifold.'' Ultimately, this distinction makes it possible to extend Jin et al.'s analysis seamlessly.", "full_text": "Ef\ufb01ciently escaping saddle points on manifolds\n\nChris Criscitiello\n\nDepartment of Mathematics\n\nPrinceton University\nPrinceton, NJ 08544\n\nccriscitiello6@gmail.com\n\nNicolas Boumal\n\nDepartment of Mathematics\n\nPrinceton University\nPrinceton, NJ 08544\n\nnboumal@math.princeton.edu\n\nAbstract\n\nSmooth, non-convex optimization problems on Riemannian manifolds occur in ma-\nchine learning as a result of orthonormality, rank or positivity constraints. First- and\nsecond-order necessary optimality conditions state that the Riemannian gradient\nmust be zero, and the Riemannian Hessian must be positive semide\ufb01nite. General-\nizing Jin et al.\u2019s recent work on perturbed gradient descent (PGD) for optimization\non linear spaces [How to Escape Saddle Points Ef\ufb01ciently (2017) [17], Stochastic\nGradient Descent Escapes Saddle Points Ef\ufb01ciently (2019) [18]], we propose a\nversion of perturbed Riemannian gradient descent (PRGD) to show that necessary\noptimality conditions can be met approximately with high probability, without\nevaluating the Hessian. Speci\ufb01cally, for an arbitrary Riemannian manifold M of\ndimension d, a suf\ufb01ciently smooth (possibly non-convex) objective function f, and\nunder weak conditions on the retraction chosen to move on the manifold, with high\nprobability, our version of PRGD produces a point with gradient smaller than \u270f\nand Hessian within p\u270f of being positive semide\ufb01nite in O((log d)4/\u270f2) gradient\nqueries. This matches the complexity of PGD in the Euclidean case. Crucially, the\ndependence on dimension is low. This matters for large-scale applications includ-\ning PCA and low-rank matrix completion, which both admit natural formulations\non manifolds. The key technical idea is to generalize PRGD with a distinction\nbetween two types of gradient steps: \u201csteps on the manifold\u201d and \u201cperturbed steps\nin a tangent space of the manifold.\u201d Ultimately, this distinction makes it possible\nto extend Jin et al.\u2019s analysis seamlessly.\n\n1\n\nIntroduction\n\nf (x).\n\nmin\nx2Rd\n\n(1)\n\nMachine learning has stimulated interest in obtaining global convergence rates in non-convex opti-\nmization. Consider a possibly non-convex objective function f : Rd ! R. We want to solve\n\nThis is hard in general. Instead, we usually settle for approximate \ufb01rst-order critical (or stationary)\npoints where the gradient is small, or second-order critical (or stationary) points where the gradient is\nsmall and the Hessian is nearly positive semide\ufb01nite.\nOne of the simplest algorithms for solving (1) is gradient descent (GD): given x0, iterate\n\n(2)\nIt is well known that if rf is Lipschitz continuous, with appropriate step-size \u2318, GD converges to\n\ufb01rst-order critical points. However, it may take exponential time to reach an approximate second-\norder critical point, thus, to escape saddle points [14]. There is an increasing amount of evidence that\nsaddle points are a serious obstacle to the practical success of local optimization algorithms such as\n\nxt+1 = xt  \u2318rf (xt).\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fGD [25, 16]. This calls for algorithms which provably escape saddle points ef\ufb01ciently. We focus on\nmethods which only have access to f and rf (but not r2f) through a black-box model.\nSeveral methods add noise to GD iterates in order to escape saddle points faster, under the assumption\nthat f has L-Lipschitz continuous gradient and \u21e2-Lipschitz continuous Hessian. In this setting, an\n\u270f-second-order critical point is a point x satisfying krf (x)k \uf8ff \u270f and r2f (x) \u232b p\u21e2\u270fI. Under the\nstrict saddle assumption, with \u270f small enough, such points are near (local) minimizers [16, 17].\nIn 2015, Ge et al. [16] gave a variant of stochastic gradient descent (SGD) which adds isotropic noise\nto iterates, showing it produces an \u270f-second-order critical point with high probability in O(poly(d)/\u270f4)\nstochastic gradient queries. In 2017, Jin et al. [17] presented a variant of GD, perturbed gradient\ndescent (PGD), which reduces this complexity to O((log d)4/\u270f2) full gradient queries. Recently, Jin\net al. [18] simpli\ufb01ed their own analysis of PGD, and extended it to stochastic gradient descent.\nJin et al.\u2019s PGD [18, Alg. 4] works as follows: If the gradient is large at iterate xt, krf (xt)k >\u270f ,\nthen perform a gradient descent step: xt+1 = xt  \u2318rf (xt). If the gradient is small at iterate xt,\nkrf (xt)k \uf8ff \u270f, perturb xt by \u2318\u21e0, with \u21e0 sampled uniformly from a ball of \ufb01xed radius centered at\nzero. Starting from this new point xt + \u2318\u21e0, perform T gradient descent steps, arriving at iterate\nxt+T . From here, repeat this procedure starting at xt+T . Crucially, Jin et al. [18] show that, if xt is\nnot an \u270f-second-order critical point, then the function decreases enough from xt to xt+T with high\nprobability, leading to an escape.\nIn this paper we generalize PGD to optimization problems on manifolds, i.e., problems of the form\n(3)\n\nf (x)\n\nmin\nx2M\n\nwhere M is an arbitrary Riemannian manifold and f : M! R is suf\ufb01ciently smooth [3]. Opti-\nmization on manifolds notably occurs in machine learning (e.g., PCA [35], low-rank matrix comple-\ntion [12]), computer vision (e.g., [32]) and signal processing (e.g., [2])\u2014see [4] for more. See [29]\nand [26] for examples of the strict saddle property on manifolds.\nGiven x 2M , the (Riemannian) gradient of f at x, grad f (x), is a vector in the tangent space at x,\nTxM. To perform gradient descent on a manifold, we need a way to move on the manifold along the\ndirection of the gradient at x. This is provided by a retraction Retrx: a smooth map from TxM to\nM. Riemannian gradient descent (RGD) performs steps on M of the form\n(4)\n\nxt+1 = Retrxt(\u2318grad f (xt)).\n\nFor Euclidean space, M = Rd, the standard retraction is Retrx(s) = x+s, in which case (4) reduces\nto (2). For the sphere embedded in Euclidean space, M = Sd \u21e2 Rd+1, a natural retraction is given\nby metric projection to the sphere: Retrx(s) = (x + s)/kx + sk.\nFor x 2M , de\ufb01ne the pullback \u02c6fx = f  Retrx : TxM! R, conveniently de\ufb01ned on a linear\nspace. If Retr is nice enough (details below), the Riemannian gradient and Hessian of f at x equal\nthe (classical) gradient and Hessian of \u02c6fx at the origin of TxM. Since TxM is a vector space, if\nwe perform GD on \u02c6fx, we can almost directly apply Jin et al.\u2019s analysis [18]. This motivates the\ntwo-phase structure of our perturbed Riemannian gradient descent (PRGD), listed as Algorithm 1.\nOur PRGD is a variant of RGD (4) and a generalization of PGD. It works as follows: If the gradient is\nlarge at iterate xt 2M , kgrad f (xt)k >\u270f , perform an RGD step: xt+1 = Retrxt(\u2318grad f (xt)).\nWe call this a \u201cstep on the manifold.\u201d If the gradient at iterate xt is small, kgrad f (xt)k \uf8ff \u270f,\nthen perturb in the tangent space TxtM. After this perturbation, execute at most T gradient\ndescent steps on the pullback \u02c6fxt, in the tangent space. We call these \u201ctangent space steps.\u201d We\ndenote this sequence of T tangent space steps by {sj}j0. This sequence of steps is performed by\nTANGENTSPACESTEPS: a deterministic, vector-space procedure\u2014see Algorithm 1.\nBy distinguishing between gradient descent steps on the manifold and those in a tangent space, we\ncan apply Jin et al.\u2019s analysis almost directly [18], allowing us to prove PRGD reaches an \u270f-second-\norder critical point on M in O((log d)4/\u270f2) gradient queries. Regarding regularity of f, we require\nits pullbacks to satisfy Lipschitz-type conditions, as advocated in [11, 7]. The analysis is far less\ntechnical than if one runs all steps on the manifold. We expect that this two-phase approach may prove\nuseful for the generalization of other algorithms and analyses from the Euclidean to the Riemannian\nrealm.\n\n2\n\n\fRecently, Sun and Fazel [30] provided the \ufb01rst generalization of PGD to certain manifolds with a\npolylogarithmic complexity in the dimension, improving earlier results by Ge et al. [16, App. B]\nwhich had a polynomial complexity. Both of these works focus on submanifolds of a Euclidean space,\nwith the algorithm in [30] depending on the equality constraints chosen to describe this submanifold.\nAt the same time as the present paper, Sun et al. [31] improved their analysis to cover any complete\nRiemannian manifold with bounded sectional curvature. In contrast to ours, their algorithm executes\nall steps on the manifold. Their analysis requires the retraction to be the Riemannian exponential map\n(i.e., geodesics). Our regularity assumptions are similar but different: while we assume Lipschitz-type\nconditions on the pullbacks in small balls around the origins of tangent spaces, Sun et al. make\nLipschitz assumptions on the cost function directly, using parallel transport and Riemannian distance.\nAs a result, curvature appears in their results. We make no explicit assumptions on M regarding\ncurvature or completeness, though these may be implicit in our regularity assumptions: see Section 4.\n\nelse\n\n. Riemannian gradient descent step\n\n. perturb\n. perform T steps in TxtM\n\nif kgrad f (xt)k >\u270f then\n\nxt+1 TANGENTSPACESTEPS(xt, 0,\u2318, b, 1)\nt t + 1\n\u21e0 \u21e0 Uniform(Bxt,r(0))\ns0 = \u2318\u21e0\nxt+T TANGENTSPACESTEPS(xt, s0,\u2318, b, T )\nt t + T\n\nAlgorithm 1 PRGD(x0,\u2318, r, T ,\u270f, T, b )\n1: t 0\n2: while t \uf8ff T do\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10:\nend if\n11:\n12: end while\n13:\n14: procedure TANGENTSPACESTEPS(x, s0,\u2318, b, T )\n15:\n16:\n17:\n18:\n19:\n20:\n21:\n22:\n23: end procedure\n\nfor j = 0, 1, . . . , T  1 do\nsj+1 sj  \u2318r \u02c6fx(sj)\nif ksj+1k  b then\n\nend for\nreturn Retrx(sT )\n\nsT sj  \u21b5\u2318r \u02c6fx(sj), where \u21b5 2 (0, 1] andsj  \u21b5\u2318r \u02c6fx(sj) = b.\n\nbreak\n\nend if\n\n. if the iterate leaves the interior of the ball Bx,b(0)\n\n1.1 Main result\nHere we state our result informally. Formal results are stated in subsequent sections.\nTheorem 1.1 (Informal). Let M be a Riemannian manifold of dimension d equipped with a retraction\nRetr. Assume f : M! R is twice continuously differentiable, and furthermore:\n\nA1. f is lower bounded.\nA2. The gradients of the pullbacks f  Retrx uniformly satisfy a Lipschitz-type condition.\nA3. The Hessians of the pullbacks f  Retrx uniformly satisfy a Lipschitz-type condition.\nA4. The retraction Retr uniformly satis\ufb01es a second-order condition.\n\nThen, setting T = O((log d)4/\u270f2), PRGD visits several points with gradient smaller than \u270f and, with\nhigh probability, at least two-thirds of those points are \u270f-second-order critical (De\ufb01nition 3.1).\n\nPRGD uses O((log d)4/\u270f2) gradient queries, and crucially no Hessian queries. The algorithm\nrequires knowledge of the Lipschitz constants de\ufb01ned below, which makes this a mostly theoretical\nalgorithm\u2014but see Appendix D for explicit constants in the case of PCA.\n\n3\n\n\f1.2 Other related work\n\nAlgorithms which ef\ufb01ciently escape saddle points can be classi\ufb01ed into two families: \ufb01rst-order and\nsecond-order methods. First-order methods only use function value and gradient information. SGD\nand PGD are \ufb01rst-order methods. Second-order methods also access Hessian information. Newton\u2019s\nmethod, trust regions [24, 11] and adaptive cubic regularization [23, 7, 34] are second-order methods.\nAs noted above, Ge et al. [16] and Jin et al. [17] escape saddle points (in Euclidean space) by\nexploiting noise in iterations. There has also been similar work for normalized gradient descent [20].\nExpanding on [17], Jin et al. [19] give an accelerated PGD algorithm (PAGD) which reaches an\n\u270f-second-order critical point of a non-convex function f with high probability in O((log d)6/\u270f7/4)\niterations. In [18], Jin et al. show that a stochastic version of PGD reaches an \u270f-second-order\ncritical point in O(d/\u270f4) stochastic gradient queries; only O(poly(log d)/\u270f4) queries are needed if\nthe stochastic gradients are well behaved. For an analysis of PGD under convex constraints, see [22].\nThere is another line of research, inspired by Langevin dynamics, in which judiciously scaled\nGaussian noise is added at every iteration. We note that although this differs from the \ufb01rst incarnation\nof PGD in [17], this resembles a simpli\ufb01ed version of PGD in [18]. Sang and Liu [27] develop\nan algorithm (adaptive stochastic gradient Langevin dynamics, ASGLD), which provably reaches\nan \u270f-second-order critical point in O(log d/\u270f4) with high probability. With full gradients, AGSLD\nreaches an \u270f-second-order critical point in O(log d/\u270f2) queries with high probability.\nOne might hope that the noise inherent in vanilla SGD would help it escape saddle points without\nnoise injection. Daneshmand et al. [13] propose the correlated negative curvature assumption (CNC),\nunder which they prove that SGD reaches an \u270f-second-order critical point in O(\u270f5) queries with\nhigh probability. They also show that, under the CNC assumption, a variant of GD (in which iterates\nare perturbed only by SGD steps) ef\ufb01ciently escapes saddle points. Importantly, these guarantees are\ncompletely dimension-free.\nA \ufb01rst-order method can include approximations of the Hessian (e.g., with a difference of gradients).\nFor example, Allen-Zhu\u2019s Natasha 2 algorithm [8] uses \ufb01rst-order information (function value and\nstochastic gradients) to search for directions of negative curvature of the Hessian. Natasha 2 reaches\nan \u270f-second-order critical point in O(\u270f13/4) iterations.\nMany classical optimization algorithms have been generalized to optimization on manifolds, including\ngradient descent, Newton\u2019s method, trust regions and adaptive cubic regularization [15, 3, 1, 6, 11,\n7, 9, 34]. Bonnabel [10] extends stochastic gradient descent to Riemannian manifolds and proves\nthat Riemannian SGD converges to critical points of the cost function. Zhang et al. [33] and Sato et\nal. [28] both use variance reduction to speed up SGD on Riemannian manifolds.\n\n2 Preliminaries: Optimization on manifolds\n\nWe review the key de\ufb01nitions and tools for optimization on manifolds. For more information,\nsee [3]. Let M be a d-dimensional Riemannian manifold: a real, smooth d-manifold equipped with a\nRiemannian metric. We associate with each x 2M a d-dimensional real vector space TxM, called\nthe tangent space at x. For embedded submanifolds of Rn, we often visualize the tangent space as\nbeing tangent to the manifold at x. The Riemannian metric de\ufb01nes an inner product h\u00b7,\u00b7ix on the\ntangent space TxM, with associated norm k\u00b7kx. We denote these by h\u00b7,\u00b7i and k\u00b7k when x is clear\nfrom context. A vector in the tangent space is a tangent vector. The set of pairs (x, sx) for x 2\nM, sx 2 TxM is called the tangent bundle TM. De\ufb01ne Bx,r(s) = { \u02d9s 2 TxM : k \u02d9s  skx \uf8ff r}:\nthe closed ball of radius r centered at s 2 TxM. We occasionally denote Bx,r(s) by Br(s) when x\nis clear from context. Let Uniform(Bx,r(s)) denote the uniform distribution over the ball Bx,r(s).\nThe Riemannian gradient gradf (x) of a differentiable function f at x 2M is the unique vector\nin TxM satisfying Df (x)[s] = hgrad f (x), six 8s 2 TxM, where Df (x)[s] is the directional\nderivative of f at x along s. The Riemannian metric gives rise to a well-de\ufb01ned notion of derivative of\nvector \ufb01elds called the Riemannian (or Levi\u2013Civita) connection r. The Hessian of f is the derivative\nof the gradient vector \ufb01eld: Hessf (x)[u] = rugradf (x). The Hessian describes how the gradient\nchanges. Hessf (x) is a symmetric linear operator on TxM. If the manifold is a Euclidean space,\nM = Rd, with the standard metric hx, yi = xT y, the Riemannian gradient gradf and Hessian\nHessf coincide with the standard gradient rf and Hessian r2f (mind the overloaded notation r).\n\n4\n\n\fAs discussed in Section 1, the retraction is a mapping which allows us to move along the manifold\nfrom a point x in the direction of a tangent vector s 2 TxM. Formally:\nDe\ufb01nition 2.1 (Retraction, from [3]). A retraction on a manifold M is a smooth mapping Retr from\nthe tangent bundle TM to M satisfying properties 1 and 2 below. Let Retrx : TxM!M denote\nthe restriction of Retr to TxM.\n\n1. Retrx(0x) = x, where 0x is the zero vector in TxM.\n2. The differential of Retrx at 0x, DRetrx(0x), is the identity map.\n\n(Our algorithm and theory only require Retr to be de\ufb01ned in balls of a \ufb01xed radius around the\norigins of tangent spaces.) Recall these special retractions, which are good to keep in mind for\nintuition: on M = Rd, we typically use Retrx(s) = x + s, and on the unit sphere we typically use\nRetrx(s) = (x + s)/kx + sk.\nFor x in M, de\ufb01ne the pullback of f from the manifold to the tangent space by\n\nThis is a real function on a vector space. Furthermore, for x 2M and s 2 TxM, let\n\n\u02c6fx = f  Retrx : TxM! R.\n\nTx,s = DRetrx(s) : TxM! TRetrx(s)M\n\ndenote the differential of Retrx at s (a linear operator). The gradient and Hessian of the pullback\nadmit the following nice expressions in terms of those of f, and the retraction.\nLemma 2.2 (Lemma 5.2 of [7]). For f : M! R twice continuously differentiable, x 2M and\ns 2 TxM, with T \u21e4x,s denoting the adjoint of Tx,s,\n\nr \u02c6fx(s) = T \u21e4x,sgrad f (Retrx(s)),\n\nr2 \u02c6fx(s) = T \u21e4x,sHess f (Retrx(s))Tx,s + Ws,\n\nwhere Ws is a symmetric linear operator on TxM de\ufb01ned through polarization by\n\n(5)\n\n(6)\n\nhWs[ \u02d9s], \u02d9si = hgrad f (Retrx(s)), 00(0)i ,\n\nwith 00(0) 2 TRetrx(s)M the intrinsic acceleration on M of (t) = Retrx(s + t \u02d9s) at t = 0.\nThe velocity of a curve  : R !M is d\nderivative (induced by the Levi\u2013Civita connection) of the velocity of : 00 = D\nRiemannian submanifold of Rn, 00(t) does not necessarily coincide with d2\nthe orthogonal projection of d2\n\ndt = 0(t). The intrinsic acceleration 00 of  is the covariant\ndt 0. When M is a\ndt2 : in this case, 00(t) is\n\ndt2 onto T(t)M.\n\n3 PRGD ef\ufb01ciently escapes saddle points\n\nWe now precisely state the assumptions, the main result, and some important parts of the proof of the\nmain result, including the main obstacles faced in generalizing PGD to manifolds. A full proof of all\nresults is provided in the appendix.\n\n3.1 Assumptions\n\nThe \ufb01rst assumption, namely, that f is lower bounded, ensures that there are points on the manifold\nwhere the gradient is arbitrarily small.\nAssumption 1. f is lower bounded: f (x)  f\u21e4 for all x 2M .\nGeneralizing from the Euclidean case, we assume Lipschitz-type conditions on the gradients and\nHessians of the pullbacks \u02c6fx = f  Retrx. For the special case of M = Rd and Retrx(s) = x + s,\nthese assumptions hold if the gradient rf (\u00b7) and Hessian r2f (\u00b7) are each Lipschitz continuous,\nas in [18, A1] (with the same constants). The Lipschitz-type assumptions below are similar to\nassumption A2 of [7]. Notice that these assumptions involve both the cost function and the retraction:\nthis dependency is further discussed in [11, 7] for a similar setting.\n\n5\n\n\fAssumption 2. There exist b1 > 0 and L > 0 such that 8x 2M and 8s 2 TxM with ksk \uf8ff b1,\n\nAssumption 3. There exist b2 > 0 and \u21e2> 0 such that 8x 2M and 8s 2 TxM with ksk \uf8ff b2,\n\nr \u02c6fx(s)  r \u02c6fx(0) \uf8ff Lksk .\nr2 \u02c6fx(s)  r2 \u02c6fx(0) \uf8ff \u21e2ksk ,\n\nwhere on the left-hand side we use the operator norm.\nMore precisely, we only need these assumptions to hold at the iterates x0, x1, . . . Let b = min{b1, b2}\n(to reduce the number of parameters in Algorithm 1). The next assumption requires the chosen\nretraction to be well behaved, in the sense that the (intrinsic) acceleration of curves x,s on the\nmanifold, de\ufb01ned below, must remain bounded\u2014compare with Lemma 2.2.\nAssumption 4. There exists   0 such that, for all x 2M and s 2 TxM satisfying ksk = 1, the\n\ncurve x,s(t) = Retrx(ts) has initial acceleration bounded by :00x,s(0) \uf8ff .\n\nIf Assumption 4 holds with  = 0, Retr is said to be second order [3, p107]. Second-order\nretractions include the so-called exponential map and the standard retractions on Rd and the unit\nsphere mentioned earlier\u2014see [5] for a large class of such retractions on relevant manifolds.\nDe\ufb01nition 3.1. A point x 2M is an \u270f-second-order critical point of the twice-differentiable function\nf : M! R satisfying Assumption 3 if\n(7)\n\nand\n\nwhere min(H) denotes the smallest eigenvalue of the symmetric operator H.\n\nkgrad f (x)k \uf8ff \u270f,\n\nmin(Hess f (x))  p\u21e2\u270f,\n\nFor compact manifolds, all of these assumptions hold (all proofs are in the appendix):\nLemma 3.2. Let M be a compact Riemannian manifold equipped with a retraction Retr. Assume\nf : M! R is three times continuously differentiable. Pick an arbitrary b > 0. Then, there exist\nf\u21e4, L > 0,\u21e2 > 0 and   0 such that Assumptions 1, 2, 3 and 4 are satis\ufb01ed.\n3.2 Main results\nRecall that PRGD (Algorithm 1) works as follows. If kgrad f (xt)k >\u270f , perform a Riemannian\ngradient descent step, xt+1 = Retrxt(\u2318grad f (xt)). If kgrad f (xt)k \uf8ff \u270f, then perturb, i.e.,\nsample \u21e0 \u21e0 Uniform(Bxt,r(0)) and let s0 = \u2318\u21e0. After this perturbation, remain in the tangent space\nTxtM and do (at most) T gradient descent steps on the pullback \u02c6fxt, starting from s0. We denote\nthis sequence of T tangent space steps by {sj}j0. This sequence of gradient descent steps is\nperformed by TANGENTSPACESTEPS: a deterministic procedure in the (linear) tangent space.\nOne dif\ufb01culty with this approach is that, under our assumptions, for some x = xt, r \u02c6fx may not be\nLipschitz continuous in all of TxM. However, it is easy to show that r \u02c6fx is Lipschitz continuous\nin the ball of radius b by compactness, uniformly in x. This is why we limit our algorithm to\nthese balls. If the sequence of iterates {sj}j0 escapes the ball Bx,b(0) \u21e2 TxM for some sj,\nTANGENTSPACESTEPS returns the point between sj1 and sj on the boundary of that ball.\nFollowing [18], we use a set of carefully balanced parameters. Parameters \u270f and  are user de\ufb01ned.\nThe claim in Theorem 3.4 below holds with probability at least 1  . Assumption 1 provides\nparameter f\u21e4. Assumptions 2 and 3 provide parameters L, \u21e2 and b = min{b1, b2}. As announced,\nthe latter two assumptions further ensure Lipschitz continuity of the gradients of the pullbacks in\nballs of the tangent spaces, uniformly: this de\ufb01nes the parameter `, as prescribed below.\nLemma 3.3. Under Assumptions 2 and 3, there exists ` 2 [L, L + \u21e2b] such that, for all x 2M , the\ngradient of the pullback, r \u02c6fx, is `-Lipschitz continuous in the ball Bx,b(0).\nThen, choose > 1/4 (preferably small) such that\n\n  4 log2 231 `2pd(f (x0)  f\u21e4)\n\np\u21e2\u270f5/2\n\n! ,\n\n(8)\n\n6\n\n\fand set algorithm parameters\n\n\u2318 =\n\n1\n`\n\n,\n\nr =\n\n\u270f\n\n4003 ,\n\nT =\n\n`\np\u21e2\u270f\n\n,\n\nwhere  is such that T is an integer. We also use this notation in the proofs:\n\n1\n\n503s \u270f3\n\n\u21e2\n\nF =\n\n,\n\nL =\n\n1\n\n4r \u270f\n\n\u21e2\n\n.\n\n(9)\n\n(10)\n\nTheorem 3.4. Assume f satis\ufb01es Assumptions 1, 2 and 3. For any x0 2M , with 0 <\u270f \uf8ff b2\u21e2,\nL  p\u21e2\u270f, \u270f3/2 \uf8ff 3p\u21e2 (f (x0)  f\u21e4) and  2 (0, 1), choose \u2318, r, T as in (9). Then, setting\n(log d)4\u25c6,\n\n = O\u2713 `(f (x0)  f\u21e4)\n\nT = 8 max\u21e2 T\n\n(f (x0)  f\u21e4)T\n\n(11)\n\nf (x0)  f\u21e4\n\n,\n\n\u2318\u270f2\n\n,\n\n3\n\nF\n\n\u270f2\n\nP RGD(x0,\u2318, r, T ,\u270f, T, b ) visits at least two iterates xt 2M satisfying kgrad f (xt)k \uf8ff \u270f. With\nprobability at least 1  , at least two-thirds of those iterates satisfy\n\nkgrad f (xt)k \uf8ff \u270f\n\nand\n\nmin(r2 \u02c6fxt(0))  p\u21e2\u270f.\n\nThe algorithm uses at most T + T \uf8ff 2T gradient queries (and no function or Hessian queries).\nBy Assumption 4 and Lemma 2.2, r2 \u02c6fxt(0) is close to Hess f (xt), which allows us to conclude:\nCorollary 3.5. Assume f satis\ufb01es Assumptions 1, 2, 3 and 4. For an arbitrary x0 2M , with\n0 <\u270f \uf8ff min{\u21e2/2, b2\u21e2}, L  p\u21e2\u270f, \u270f3/2 \uf8ff 3p\u21e2 (f (x0)  f\u21e4) and  2 (0, 1), choose \u2318, r, T as\nin (9). Then, setting T as in (11), P RGD(x0,\u2318, r, T ,\u270f, T, b ) visits at least two iterates xt 2M\nsatisfying kgrad f (xt)k \uf8ff \u270f. With probability at least 1  , at least two-thirds of those iterates are\n(4\u270f)-second-order points. If  = 0 (that is, the retraction is second order), then the same claim holds\nfor \u270f-second-order points instead of 4\u270f. The algorithm uses at most T + T \uf8ff 2T gradient queries.\nAssume M = Rd with standard inner product and standard retraction Retrx(s) = x + s. As in [18],\nassume f : Rd ! R is lower bounded, rf is L-Lipschitz in Rd, and r2f is \u21e2-Lipschitz in Rd. Then,\nAssumptions 1, 2 and 3 hold with b = +1. Furthermore, Assumption 4 holds with  = 0 so that\nr2 \u02c6fx(0) = Hess f (x) = r2f (x) (Lemma 2.2). For all x 2M , r \u02c6fx(s) has Lipschitz constant\n` = L since \u02c6fx(s) = f (x + s). Therefore, using b = +1, ` = L and choosing \u2318, r, T as in (9),\nPRGD reduces to PGD, and Theorem 3.4 recovers the result of Jin et al. [18]: this con\ufb01rms that the\npresent result is a bona \ufb01de generalization.\nFor the important special case of compact manifolds, Lemmas 3.2 and 3.3 yield:\nCorollary 3.6. Assume M is a compact Riemannian manifold equipped with a retraction Retr, and\nf : M! R is three times continuously differentiable. Pick an arbitrary b > 0. Then, Assumptions 1,\n2, 3, 4 hold for some L > 0,\u21e2 > 0,   0, so that Corollary 3.5 applies with some ` 2 [L, L + \u21e2b].\nRemark 3.7. PRGD, like PGD (Algorithm 4 in [18]), does not specify which iterate is an \u270f-second-\norder critical point. However, it is straightforward to include a termination condition in PRGD\nwhich halts the algorithm and returns a suspected \u270f-second-order critical point. Indeed, Jin et al.\ninclude such a termination condition in their original PGD algorithm [17], which here would go\nas follows: After performing a perturbation and T (tangent space) steps in TxtM, return xt if\n\u02c6fxt(sT )  \u02c6fxt(0) > fthres, i.e., the function value does not decrease enough. The termination\ncondition requires a threshold fthres which is balanced like the other parameters of PRGD in (9).\n\n3.3 Main proof ideas\n\nTheorem 3.4 follows from the following two lemmas which we prove in the appendix. These lemmas\nstate that, in each round of the while-loop in PRGD, if xt is not at an \u270f-second-order critical point,\nPRGD makes progress, that is, decreases the cost function value (the \ufb01rst lemma is deterministic, the\nsecond one is probabilistic). Yet, the value of f on the iterates can only decrease so much because\nf is bounded below by f\u21e4. Therefore, the probability that PRGD does not visit an \u270f-second-order\ncritical point is low.\n\n7\n\n\fLemma 3.8. Under Assumptions 2 and 3, set \u2318 = 1/` for some `  L.\nkgrad f (x)k >\u270f with \u270f \uf8ff b2\u21e2 and L  p\u21e2\u270f, then,\n\nIf x 2M satis\ufb01es\n\nf (TANGENTSPACESTEPS(x, 0,\u2318, b, 1))  f (x) \uf8ff \u2318\u270f2/2.\n\nLemma 3.9. Under Assumptions 2 and 3, let x 2M satisfy both kgrad f (x)k \uf8ff \u270f and\nmin(r2 \u02c6fx(0)) \uf8ff p\u21e2\u270f with \u270f \uf8ff b2\u21e2 and L  p\u21e2\u270f. Set \u2318, r, T , F as in (9) and (10). Let\ns0 = \u2318\u21e0 with \u21e0 \u21e0 Uniform(Bx,r(0)). Then,\n\nP\u21e5f (TANGENTSPACESTEPS(x, s0,\u2318, b, T ))  f (x) \uf8ff F /2\u21e4  1 \n\n`pd\np\u21e2\u270f\n\n210/2.\n\nLemma 3.8 states that we are guaranteed to make progress if the gradient is large. This follows from\nthe suf\ufb01cient decrease of RGD steps. Lemma 3.9 states that, with perturbation, GD on the pullback\nescapes a saddle point with high probability. Lemma 3.9 is analogous to Lemma 11 in [18].\nLet Xstuck be the set of tangent vectors s0 in Bx,\u2318r(0) for which GD on the pullback starting from s0\ndoes not escape the saddle point, i.e., the function value does not decrease enough after T iterations.\nFollowing Jin et al.\u2019s analysis [18], we bound the width of this \u201cstuck region\u201d (in the direction of the\neigenvector e1 associated with the minimum eigenvalue of the Hessian of the pullback, r2 \u02c6fx(0)).\nLike Jin et al., we do this with a coupling argument, showing that given two GD sequences with\nstarting points suf\ufb01ciently far apart, one of these sequences must escape. This is formalized in\nLemma C.4 of the appendix. A crucial observation to prove Lemma C.4 is that, if the function value\nof GD iterates does not decrease much, then these iterates must be localized; this is formalized in\nLemma C.3 of the appendix, which Jin et al. call \u201cimprove or localize.\u201d\nWe stress that the stuck region concept, coupling argument, improve or local paradigm, and details of\nthe analysis are due to Jin et al. [18]: our main contribution is to show a clean way to generalize the\nalgorithm to manifolds in such a way that the analysis extends with little friction. We believe that the\ngeneral idea of separating iterations between the manifold and the tangent spaces to achieve different\nobjectives may prove useful to generalize other algorithms as well.\n\n4 About the role of curvature of the manifold\n\nAs pointed out in the introduction, concurrently with our work, Sun et al. [31] have proposed another\ngeneralization of PGD to manifolds. Their algorithm executes all steps on the manifold directly (as\nopposed to our own, which makes certain steps in the tangent spaces), and moves around the manifold\nusing the exponential map. To carry out their analysis, Sun et al. assume f is regular in the following\nway. The Riemannian gradient is Lipschitz continuous in a Riemannian sense, namely,\nxgrad f (x)k \uf8ff Ldist(x, y),\n\nkgrad f (y)  y\n\n8x, y 2M ,\n\nx : TxM! TyM denotes parallel transport from x to y along any minimizing geodesic,\nwhere y\nand dist is the Riemannian distance. These notions are well de\ufb01ned if M is a connected, com-\nplete manifold. Similarly, they assume the Riemannian Hessian of f is Lipschitz continuous in a\nRiemannian sense:\n\n8x, y 2M ,\n\nkHess f (y)  y\n\nx  Hess f (x)  x\n\nyk \uf8ff \u21e2dist(x, y),\n\nin the operator norm. Using (and improving) sophisticated inequalities from Riemannian geometry,\nthey map the perturbed sequences back to tangent spaces for analysis, where they run an adapted\nversion of Jin et al.\u2019s argument. In so doing, it appears to be crucial to use the exponential map, owing\nto its favorable interplay with parallel transport along geodesics and Riemannian distance, providing\na good \ufb01t with the regularity conditions above.\nAs they map sequences back from the manifold to a common tangent space through the inverse of\nthe exponential map, the Riemannian curvature of the manifold comes into play. Consequently, they\nassume M has bounded sectional curvature (both from below and from above), and these bounds on\ncurvature come up in their \ufb01nal complexity result: constants degrade if the manifold is more curved.\nSince Riemannian curvature does not occur in our own complexity result for PRGD, it is legitimate\nto ask: is curvature supposed to occur? If so, it must be hidden in our analysis, for example in the\n\n8\n\n\fregularity assumptions we make, which are expressed in terms of pullbacks rather than with parallel\ntransports. And indeed, in several attempts to deduce our own assumptions from those of Sun et\nal., invariably, we had to degrade L and \u21e2 as a function of curvature\u2014minding that these are only\nbounds. On the other hand, under the assumptions of Sun et al., one can deduce that the regularity\nassumptions required in [11, 7] for the analysis of Riemannian gradient descent, trust regions and\nadaptive regularization by cubics hold with the exponential map, leading to curvature-free complexity\nbounds for all three algorithms. Thus, it is not clear that curvature should occur.\nWe believe this poses an interesting question regarding the complexity of optimization on manifolds:\nto what extent should it be in\ufb02uenced by curvature of the manifold? We intend to study this.\n\n5 Perspectives\n\nTo perform PGD (Algorithm 4 of [18]), one must know the step size \u2318, perturbation radius r and the\nnumber of steps T to perform after perturbation. These parameters are carefully balanced, and their\nvalues depend on the smoothness parameters L and \u21e2. In most situations, we do not know L or \u21e2\n(though see Appendix D for PCA). An algorithm which does not require knowledge of L or \u21e2 but\nstill has the same guarantees as PGD would be useful. However, that certain regularity parameters\nmust be known seems inevitable, in particular for the Hessian\u2019s \u21e2. Indeed, the main theorems make\nstatements about the spectrum of the Hessian, yet the algorithm is not allowed to query the Hessian.\nGD equipped with a backtracking line-search method achieves an \u270f-\ufb01rst-order critical point in\nO(\u270f2) gradient queries without knowledge of the Lipschitz constant L. At each iterate xt of GD,\nbacktracking line-search essentially uses function and gradient queries to estimate the gradient\nLipschitz parameter near xt. Perhaps PGD can perform some kind of line-search to locally estimate\nL and \u21e2. We note that if \u21e2 is known and we use line-search-type methods to estimate L, there are still\ndif\ufb01culties applying Jin et al.\u2019s coupling argument.\nJin et al. [18] develop a stochastic version of PGD known as PSGD. Instead of perturbing when the\ngradient is small and performing T GD steps, PSGD simply performs a stochastic gradient step\nand perturbation at each step. Distinguishing between manifold steps and tangent space steps, we\nsuspect it is possible to develop a Riemannian version of perturbed stochastic gradient descent which\nachieves an \u270f-second-order critical point in O(d/\u270f4) stochastic gradient queries, like PSGD. This\nRiemannian version performs a certain number of steps in the tangent space, like PRGD.\nMore broadly, we anticipate that it should be possible to extend several classical optimization methods\nfrom the Euclidean case to the Riemannian case through this approach of running many steps in\na given tangent space before retracting. This ought to be particularly bene\ufb01cial for algorithms\nwhose computations or analysis rely intimately on linear structures, such as for coordinate descent\nalgorithms, certain parallelized schemes, and possibly also accelerated schemes. In preparing the\n\ufb01nal version of this paper, we found that this idea is also the subject of another paper at NeurIPS\n2019, where it is called dynamic trivialization [21].\n\nAcknowledgments\nWe thank Yue Sun, Nicolas Flammarion and Maryam Fazel, authors of [31], for numerous relevant\ndiscussions. NB is partially supported by NSF grant DMS-1719558.\n\nReferences\n[1] P.-A. Absil, C. G. Baker, and K. A. Gallivan. Trust-region methods on Riemannian manifolds.\n\nFoundations of Computational Mathematics, 7(3):303\u2013330, 2007.\n\n[2] P.-A. Absil and K. A. Gallivan. Joint diagonalization on the oblique manifold for independent\ncomponent analysis. Proceedings of the IEEE International Conference on Acoustics, Speech,\nand Signal Processing (ICASSP), 5(945-948), 2006.\n\n[3] P.-A. Absil, R. Mahony, and R. Sepulchre. Optimization Algorithms on Matrix Manifolds.\n\nPrinceton University Press, 2008.\n\n[4] P. A. Absil, R. Mahony, and R. Sepulchre. Optimization on manifolds: Methods and applications.\nRecent Advances in Optimization and its Applications in Engineering, Springer, (125-144),\n2010.\n\n9\n\n\f[5] P.-A. Absil and J. Malick. Projection-like retractions on matrix manifolds. SIAM Journal on\n\nOptimization, 22(1)(135-158), 2012.\n\n[6] R. Adler, J. Dedieu, J. Margulies, M. Martens, and M. Shub. Newton\u2019s method on Riemannian\nmanifolds and a geometric model for the human spine. IMA Journal of Numerical Analysis,\n22(3)(359-390), 2002.\n\n[7] N. Agarwal, N. Boumal, B. Bullins, and C. Cartis. Adaptive regularization with cubics on\n\nmanifolds. arXiv preprint arXiv:1806.00065, 2019.\n\n[8] Zeyuan Allen-Zhu. Natasha 2: Faster non-convex optimization than SGD. In Advances in\n\nNeural Information Processing Systems, pages 2675\u20132686, 2018.\n\n[9] G.C. Bento, O.P. Ferreira, and J.G. Melo. Iteration-complexity of gradient, subgradient and\nproximal point methods on Riemannian manifolds. Journal of Optimization Theory and\nApplications, 173(2):548\u2013562, 2017.\n\n[10] S. Bonnabel. Stochastic gradient descent on Riemannian manifolds. Automatic Control, IEEE\n\nTransactions on, 58(9):2217\u20132229, 2013.\n\n[11] N. Boumal, P.-A. Absil, and C. Cartis. Global rates of convergence for nonconvex optimization\n\non manifolds. IMA Journal of Numerical Analysis, 2018.\n\n[12] Leopold Cambier and P. A. Absil. Robust low-rank matrix completion by riemannian optimiza-\n\ntion. SIAM Journal on Scienti\ufb01c Computing, 38(5)(S440-S460), 2016.\n\n[13] Hadi Daneshmand, Jonas Kohler, Aurelien Lucchi, and Thomas Hofmann. Escaping saddles\nwith stochastic gradients. In Jennifer Dy and Andreas Krause, editors, Proceedings of the\n35th International Conference on Machine Learning, volume 80 of Proceedings of Machine\nLearning Research, pages 1155\u20131164, Stockholmsm\u00e4ssan, Stockholm Sweden, 10\u201315 Jul 2018.\nPMLR.\n\n[14] Simon S Du, Chi Jin, Jason D Lee, Michael I Jordan, Aarti Singh, and Barnabas Poczos.\nGradient descent can take exponential time to escape saddle points. In Advances in neural\ninformation processing systems, pages 1067\u20131077, 2017.\n\n[15] A. Edelman, T.A. Arias, and S.T. Smith. The geometry of algorithms with orthogonality\n\nconstraints. SIAM journal on Matrix Analysis and Applications, 20(2):303\u2013353, 1998.\n\n[16] Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle points\u2013online stochastic\ngradient for tensor decomposition. In Conference on Learning Theory, pages 797\u2013842, 2015.\n[17] Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M Kakade, and Michael I Jordan. How to escape\nsaddle points ef\ufb01ciently. In Proceedings of the 34th International Conference on Machine\nLearning-Volume 70, pages 1724\u20131732. JMLR.org, 2017.\n\n[18] Chi Jin, Praneeth Netrapalli, Rong Ge, Sham M. Kakade, and Michael I. Jordan. Stochastic\n\ngradient descent escapes saddle points ef\ufb01ciently. arXiv:1902.04811, 2019.\n\n[19] Chi Jin, Praneeth Netrapalli, and Michael I. Jordan. Accelerated gradient descent escapes saddle\npoints faster than gradient descent. In S\u00e9bastien Bubeck, Vianney Perchet, and Philippe Rigollet,\neditors, Proceedings of the 31st Conference On Learning Theory, volume 75 of Proceedings of\nMachine Learning Research, pages 1042\u20131085. PMLR, 06\u201309 Jul 2018.\n\n[20] K\ufb01r Y. Levy. The power of normalization: Faster evasion of saddle points. arXiv:1611.04831,\n\n2016.\n\n[21] Mario Lezcano Casado. Trivializations for gradient-based optimization on manifolds.\n\nProceedings of NeurIPS, 2019.\n\nIn\n\n[22] Aryan Mokhtari, Asuman Ozdaglar, and Ali Jadbabaie. Escaping saddle points in constrained\noptimization. In Advances in Neural Information Processing Systems, pages 3629\u20133639, 2018.\n[23] Y. Nesterov and B. T. Polyak. Cubic regularization of newton method and its global performance.\n\nMathematical Programming, 108(1)(177-205), 2006.\n\n[24] J. Nocedal and S. Wright. Numerical Optimization. Springer Verlag, 1999.\n[25] Razvan Pascanu, Yann N. Dauphin, Surya Ganguli, and Yoshua Bengio. On the saddle point\n\nproblem for non-convex optimization. arXiv:1405.4604, 2014.\n\n10\n\n\f[26] T. Pumir, S. Jelassi, and N. Boumal. Smoothed analysis of the low-rank approach for smooth\nsemide\ufb01nite programs. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi,\nand R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 2283\u2013\n2292. Curran Associates, Inc., 2018.\n\n[27] Hejian Sang and Jia Liu. Adaptive stochastic gradient Langevin dynamics: Taming convergence\n\nand saddle point escape time. arXiv:1805.09416, 2018.\n\n[28] Hiroyuki Sato, Hiroyuki Kasai, and Bamdev Mishra. Riemannian stochastic variance reduced\n\ngradient. SIAM Journal on Optimization, 29(2):1444\u20131472, 2017.\n\n[29] Ju Sun, Qing Qu, and John Wright. Complete dictionary recovery over the sphere I: Overview\n\nand the geometric picture. IEEE Transactions on Information Theory, 63(2):853\u2013884, 2016.\n\n[30] Yue Sun and Maryam Fazel. Escaping saddle points ef\ufb01ciently in equality-constrained opti-\n\nmization problems. ICML, 2018.\n\n[31] Yue Sun, Nicolas Flammarion, and Maryam Fazel. Escaping from saddle points on Riemannian\n\nmanifolds. In Proceedings of NeurIPS, 2019.\n\n[32] Pavan Turaga, Ashok Veeraraghavan, and Rama Chellappa. Statistical analysis on stiefel and\ngrassmann manifolds with applications in computer vision. IEEE Conference on Computer\nVision and Pattern Recognition, 2008.\n\n[33] Hongyi Zhang, Sashank J. Reddi, and Suvrit Sra. Riemannian SVRG: Fast stochastic opti-\nmization on Riemannian manifolds. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and\nR. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 4592\u20134600.\nCurran Associates, Inc., 2016.\n\n[34] J. Zhang and S. Zhang. A cubic regularized Newton\u2019s method over Riemannian manifolds.\n\narXiv preprint arXiv:1805.05565, 2018.\n\n[35] Teng Zhang and Yi Yang. Robust pca by manifold optimization. Journal of Machine Learning\n\nResearch, 19(1-39), 2018.\n\n11\n\n\f", "award": [], "sourceid": 3202, "authors": [{"given_name": "Christopher", "family_name": "Criscitiello", "institution": "None, formerly Princeton University"}, {"given_name": "Nicolas", "family_name": "Boumal", "institution": "Princeton University"}]}