{"title": "Shadowing Properties of Optimization Algorithms", "book": "Advances in Neural Information Processing Systems", "page_first": 12692, "page_last": 12703, "abstract": "Ordinary differential equation (ODE) models of gradient-based optimization methods can provide insights into the dynamics of learning and inspire the design of new algorithms. Unfortunately, this thought-provoking perspective is weakened by the fact that, in the worst case, the error between the algorithm steps and its ODE approximation grows exponentially with the number of iterations. In an attempt to encourage the use of continuous-time methods in optimization, we show that, if some additional regularity on the objective is assumed, the ODE representations of Gradient Descent and Heavy-ball do not suffer from the aforementioned problem, once we allow for a small perturbation on the algorithm initial condition. In the dynamical systems literature, this phenomenon is called shadowing. Our analysis relies on the concept of hyperbolicity, as well as on tools from numerical analysis.", "full_text": "Shadowing Properties of Optimization Algorithms\n\nAntonio Orvieto\n\nDepartment of Computer Science\n\nETH Zurich, Switzerland \u21e4\n\nAurelien Lucchi\n\nDepartment of Computer Science\n\nETH Zurich, Switzerland\n\nAbstract\n\nOrdinary differential equation (ODE) models of gradient-based optimization meth-\nods can provide insights into the dynamics of learning and inspire the design of new\nalgorithms. Unfortunately, this thought-provoking perspective is weakened by the\nfact that \u2014 in the worst case \u2014 the error between the algorithm steps and its ODE\napproximation grows exponentially with the number of iterations. In an attempt to\nencourage the use of continuous-time methods in optimization, we show that, if\nsome additional regularity on the objective is assumed, the ODE representations of\nGradient Descent and Heavy-ball do not suffer from the aforementioned problem,\nonce we allow for a small perturbation on the algorithm initial condition. In the\ndynamical systems literature, this phenomenon is called shadowing. Our analysis\nrelies on the concept of hyperbolicity, as well as on tools from numerical analysis.\n\nIntroduction\n\n1\nWe consider the problem of minimizing a smooth function f : Rd ! R. This is commonly solved\nusing gradient-based algorithms due to their simplicity and provable convergence guarantees. Two of\nthese approaches \u2014 that prevail in machine learning \u2014 are: 1) Gradient Descent (GD), that computes\na sequence (xk)1k=0 of approximations to the minimizer recursively:\n\n(GD)\ngiven an initial point x0 and a learning rate \u2318> 0; and 2) Heavy-ball (HB)[43] that computes a\ndifferent sequence of approximations (zk)1k=0 such that\n\nxk+1 = xk  \u2318rf (xk),\n\nzk+1 = zk + (zk  zk1)  \u2318rf (zk),\n\n(HB)\ngiven an initial point z0 = z1 and a momentum  2 [0, 1). A method related to HB is Nesterov\u2019s\naccelerated gradient (NAG), for which rf is evaluated at a different point (see Eq. 2.2.22 in [38]).\nAnalyzing the convergence properties of these algorithms can be complex, especially for NAG whose\nconvergence proof relies on algebraic tricks that reveal little detail about the acceleration phenomenon,\ni.e. the celebrated optimality of NAG in convex smooth optimization. Instead, an alternative approach\nis to view these methods as numerical integrators of some ordinary differential equations (ODEs). For\ninstance, GD performs the explicit Euler method on \u02d9y = rf (y) and HB the semi-implicit Euler\nmethod on \u00a8q + \u21b5q + rf (q) = 0 [43, 47]. This connection goes beyond the study of the dynamics of\nlearning algorithms, and has recently been used to get a useful and thought-provoking viewpoint on\nthe computations performed by residual neural networks [10, 12].\nIn optimization, the relation between discrete algorithms and ordinary differential equations is not\nnew at all: the \ufb01rst contribution in this direction dates back to, at least, 1958 [18]. This connection\nhas recently been revived by the work of [49, 28], where the authors show that the continuous-time\nlimit of NAG is a second order differential equation with vanishing viscosity. This approach provides\nan interesting perspective on the somewhat mysterious acceleration phenomenon, connecting it to\n\n\u21e4Correspondence to orvietoa@ethz.ch\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fthe theory of damped nonlinear oscillators and to Bessel functions. Based on the prior work of\n[52, 27] and follow-up works, it has become clear that the analysis of ODE models can provide\nsimple intuitive proofs of (known) convergence rates and can also lead to the design of new discrete\nalgorithms [55, 4, 53, 54]. Hence, one area of particular interest has been to study the relation between\ncontinuous-time models and their discrete analogs, speci\ufb01cally understanding the error resulting from\nthe discretization of an ODE. The conceptual challenge is that, in the worst case, the approximation\nerror of any numerical integrator grows exponentially2 as a function of the integration interval\n[11, 21]; therefore, convergence rates derived for ODEs can not be straightforwardly translated to\nthe corresponding algorithms. In particular, obtaining convergence guarantees for the discrete case\nrequires analyzing a discrete-time Lyapunov function, which often cannot be easily recovered from\nthe one used for the continuous-time analysis [46, 47]. Alternatively, (sophisticated) numerical\narguments can be used to get a rate through an approximation of such Lyapunov function [55].\nIn this work, we follow a different approach and directly study conditions under which the \ufb02ow of an\nODE model is shadowed by (i.e. is uniformly close3 to) the iterates of an optimization algorithm.\nThe key difference with previous work, which makes our analysis possible, is that we allow the\nalgorithm \u2014 i.e. the shadow\u2014 to start from a slightly perturbed initial condition compared to the\nODE (see Fig. 2 for an illustration of this point). We rely on tools from numerical analysis [21] as\nwell as concepts from dynamical systems [7], where solutions to ODEs and iterations of algorithm\nare viewed as the same object, namely maps in a topological space [7]. Speci\ufb01cally, our analysis\nbuilds on the theory of hyperbolic sets, which grew out of the works of Anosov [1] and Smale [48] in\nthe 1960\u2019s and plays a fundamental role in several branches of the area of dynamical systems but has\nnot yet been seen to have a relationship with optimization for machine learning.\nIn this work we pioneer the use of the theory of shadowing in optimization. In particular, we show\nthat, if the objective is strongly-convex or if we are close to an hyperbolic saddle point, GD and HB\nare a shadow of (i.e. follow closely) the corresponding ODE models. We back-up and complement\nour theory with experiments on machine learning problems.\nTo the best of our knowledge, our work is the \ufb01rst to focus on a (Lyapunov function independent)\nsystematic and quantitative comparison of ODEs and algorithms for optimization. Also, we believe the\ntools we describe in this work can be used to advance our understanding of related machine learning\nproblems, perhaps to better characterize the attractors of neural ordinary differential equations [10].\n\n2 Background\nThis section provides a comprehensive overview of some fundamental concepts in the theory of\ndynamical systems, which we will use heavily in the rest of the paper.\n\n2.1 Differential equations and \ufb02ows\nConsider the autonomous differential equation \u02d9y = g(y). Every y represents a point in Rn (a.k.a\nphase space) and g : Rn ! Rn is a vector \ufb01eld which, at any point, prescribes the velocity of the\nsolution y that passes through that point. Formally, the curve y : R ! Rn is a solution passing\nthrough y0 at time 0 if \u02d9y(t) = g(y(t)) for t 2 R and y(0) = y0. We call this the solution to the initial\nvalue problem (IVP) associated with g (starting at y0). The following results can be found in [42, 26].\n\nTheorem 1. Assume g is Lipschitz continuous and Ck. The IVP has a unique Ck+1 solution in R.\nThis fundamental theorem tells us that, if we integrate for t time\nunits from position y0, the \ufb01nal position y(t) is uniquely determined.\nTherefore, we can de\ufb01ne the family {'g\nt}t2R of maps \u2014 the \ufb02ow of g\n\u2014 such that 'g\nt (y0) is the solution at time t of the IVP. Intuitively, we\ncan think of 'g\nt (y0) as determining the location of a particle starting at\ny0 and moving via the velocity \ufb01eld g for t seconds. Since the vector\n\ufb01eld is static, we can move along the solution (in both directions) by\niteratively applying this map (or its inverse). This is formalized below.\n\nt1+t2(y0)\n\n'g\n\n'g\n\nt1(y0)\n\n1.4\n\n1.2\n\n1.8\n\n1.6\n\ny0\n\n0\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\n1\n\n1.2\n\n1.4\n\n1.6\n\n1.8\n\n2\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n2The error between the numerical approximation and the actual trajectory with the same initial conditions is,\n\nfor a p-th order method, eCthp at time t with C  0.\n\n3Formally de\ufb01ned in Sec. 2.\n\n2\n\n\ft1  'g\n\nt1+t2 = 'g\n\nt2. In particular, 'g\n\nt 2 Ck+1 and, for any\n\nt is a diffeomorphism4 with inverse 'g\nt.\n\nProposition 1. Assume g is Lipschitz continuous and Ck. For any t 2 R, 'g\nt1, t2 2 R, 'g\nIn a way, the \ufb02ow allows us to exactly discretize the trajectory of an ODE. Indeed, let us \ufb01x a stepsize\nh > 0; the associated time-h map 'g\nh integrates the ODE \u02d9y = g(y) for h seconds starting from some\ny0, and we can apply this map recursively to compute a sampled ODE solution. In this paper, we\nstudy how \ufb02ows can approximate the iterations of some optimization algorithm using the language of\ndynamical systems and the concept of shadowing.\n2.2 Dynamical systems and shadowing\nA dynamical system on Rn is a map : Rn ! Rn. For p 2 N, we de\ufb01ne p = \u00b7\u00b7\u00b7 (p times).\nIf is invertible, then p = 1 \u00b7\u00b7\u00b7 1 (p times). Since p+m = p  m, these iterates\nform a group if is invertible, and a semigroup otherwise. We proceed with more de\ufb01nitions.\nDe\ufb01nition 1. A sequence (xk)1k=0 is a (positive) orbit of if, for all k 2 N, xk+1 = ( xk).\nFor the rest of this subsection, the reader may think of as an optimization algorithm (such\nas GD, which maps x to x  \u2318rf (x) ) and of the orbit (xk)1k=0 as its iterates. Also, the\nreader may think of (yk)1k=0 as the sequence of points derived from the iterative application\nof 'g\nh, which is itself a dynamical system, from some y0. The latter sequence represents our\nODE approximation of the algorithm . Our goal in this subsection is to understand when\na sequence (yk)1k=0 is \"close to\" an orbit of . The \ufb01rst notion of such similarity is local.\nDe\ufb01nition 2. The sequence (yk)1k=0 is a pseudo-orbit of \nif, for all k 2 N, kyk+1  (yk)k \uf8ff .\nIf (yk)1k=0 is locally similar to an orbit of (i.e. it is a pseudo-\norbit of ), then we may hope that such similarity extends\nglobally. This is captured by the concept of shadowing.\nDe\ufb01nition 3. A pseudo-orbit (yk)1k=0 of is \u270fshadowed if\nthere exists an orbit (xk)1k=0 of such that, for all k 2 N,\nkxk  ykk \uf8ff \u270f.\nIt is crucial to notice that, as depicted in the \ufb01gure above, we allow the shadowing orbit (a.k.a. the\nshadow) to start at a perturbed point x0 6= y0. A natural question is the following: which properties\nmust have such that a general pseudo-orbit is shadowed? A lot of research has been carried out in\nthe last decades on this topic (see e.g. [41, 31] for a comprehensive survey).\n\n\u270f\nx3\n\nx1\n\ny2\n\nx2\n\ny0\n\ny1\n\ny5\n\nx5\n\ny4\n\nx4\n\nx0\n\ny3\n\nShadowing for contractive/expanding maps. A straight-forward suf\ufb01cient condition is related to\ncontraction. is said to be uniformly contracting if there exists \u21e2< 1 (contraction factor) such that\nfor all x1, x2 2 Rn, k (x1)  (x2)k \uf8ff \u21e2kx1  x2k. The next result can be found in [23].\nTheorem 2. (Contraction map shadowing theorem) Assume is uniformly contracting. For every\n\u270f> 0 there exists > 0 such that every pseudo-orbit (yk)1k=0 of is \u270fshadowed by the orbit\n(xk)1k=0 of starting at x0 = y0, that is xk := k(x0). Moreover,  \uf8ff (1  \u21e2)\u270f.\nThe idea behind this result is simple: since all distances are contracted, errors that are made along the\npseudo-orbit vanish asymptotically. For instructive purposes, we report the full proof.\nProof : We proceed by induction: the proposition is trivially true at k = 0, since kx0  y0k \uf8ff \u270f;\nnext, we assume the proposition holds at k 2 N and we show validity for k + 1. We have\nkxk+1  yk+1k\n\nsubadditivity\n\nyk\n\n-pseudo-orbit\n\ncontraction\n\ninduction\n\n\uf8ff\n\uf8ff\n\uf8ff\n\uf8ff\n\nk (xk)  (yk)k + k (yk)  yk+1k\nk (xk)  (yk)k + \n\u21e2kxk  ykk + \n\u21e2\u270f + .\n\nxk\n\u270f\n\nyk+1\n\n\n)yk\n (\n (xk)\n\n\u270f\n\n4A C 1 map with well-de\ufb01ned and C 1 inverse.\n\n3\n\n\f\u2305\nFinally, since  \uf8ff \u270f(1  \u21e2), \u21e2\u270f +  = \u270f.\nNext, assume is invertible. If is uniformly expanding (i.e. \u21e2> 1), then 1 is uniformly\ncontracting with contraction factor 1/\u21e2 and we can apply the same reasoning backwards: consider the\npseudo-orbit {y0, y1,\u00b7\u00b7\u00b7 , yK} up to iteration K, and set xK = yK (before, we had x0 = y0); then,\napply the same reasoning as above using the map 1 and \ufb01nd the initial condition x0 = K(yK).\nIn App. B.2 we show that this reasoning holds in the limit, i.e. that K(yK) converges to a suitable\nx0 to construct a shadowing orbit under  \uf8ff (1  1/\u21e2)\u270f (precise statement in Thm. B.3).\nShadowing in hyperbolic sets.\nIn general, for machine learning problems, the algorithm map \ncan be a combination of the two cases above: an example is the pathological contraction-expansion\nbehavior of Gradient Descent around a saddle point [16]. To illustrate the shadowing properties of\nsuch systems, we shall start by taking to be linear5, that is (x) = Ax for some A 2 Rn\u21e5n. \nis called linear hyperbolic if its spectrum (A) does not intersect the unit circle S1. If this is the\ncase, we call s(A) the part of (A) inside S1 and u(A) the part of (A) outside S1. The spectral\ndecomposition theorem (see e.g. Corollary 4.56 in [24]) ensures that Rn decomposes into two\n -invariant closed subspaces (a.k.a the stable and unstable manifolds) in direct sum : Rn = Es  Eu.\nWe call s and u the restrictions of to these subspaces and As and Au the corresponding matrix\nrepresentations; the theorem also ensures that (As) = s(A) and (Au) = u(A). Moreover,\nusing standard results in spectral theory (see e.g. App. 4 in [24]) we can equip Es and Eu with norms\nequivalent to the standard Euclidean norm k\u00b7k such that, w.r.t. the new norms, u is uniformly\nexpanding and s uniformly contracting. If A is symmetric, then it is diagonalizable and the norms\nabove can be taken to be Euclidean. To wrap up this paragraph \u2014 we can think of a linear hyperbolic\nsystem as a map that allows us to decouple its stable and unstable components consistently across Rn.\nTherefore, a shadowing result directly follows from a combination of Thm. 2 and B.3 [23].\nAn important further question is \u2014 whether the result above for linear maps can be generalized. From\nthe classic theory of nonlinear systems [26, 2], we know that in a neighborhood of an hyperbolic\npoint p for , that is D (p) has no eigenvalues on S1, behaves like6 a linear system. Similarly,\nin the analysis of optimization algorithms, it is often used that, near a saddle point, an Hessian-\nsmooth function can be approximated by a quadratic [35, 15, 19]. Hence, it should not be surprising\nthat pseudo-orbits of are shadowed in a neighborhood of an hyperbolic point, if such set is\n -invariant [31]: this happens for instance if p is a stable equilibrium point (see de\ufb01nition in [26]).\nStarting from the reasoning above, the celebrated shadowing theorem, which has its foundations in\nthe work of Ansosov [1] and was originally proved in [6], provides the strongest known result of\nthis line of research: near an hyperbolic set of , pseudo-orbits are shadowed. Informally, \u21e4 \u21e2 Rn\nis called hyperbolic if it is -invariant and has clearly marked local directions of expansion and\ncontraction which make behave similarly to a linear hyperbolic system.We provide the precise\nde\ufb01nition of hyperbolic set and the statement of the shadowing theorem in App. B.1.\nUnfortunately, despite the ubiquity of hyperbolic sets, it is practically infeasible to establish this\nproperty analytically [3]. Therefore, an important part of the literature [14, 44, 11, 51] is concerned\nwith the numerical veri\ufb01cation of the existence of a shadowing orbit a posteriori, i.e. given a particular\npseudo-orbit.\n\n3 The Gradient Descent ODE\nWe assume some regularity on the objective function f : Rd ! R which we seek to minimize.\n(H1) f 2 C2(Rd, R) is coercive7, bounded from below and L-smooth (8a 2 Rd, kr2f (a)k \uf8ff L).\nIn this chapter, we study the well-known continuous-time model for GD : \u02d9y = rf (y) (GD-ODE).\nUnder (H1), Thm. 1 ensures that the solution to GD-ODE exists and is unique. We denote by 'GD\nh\nthe corresponding time-h map. We show that, under some additional assumptions, the orbit of 'GD\nh\n(which we indicate as (yk)1k=0) is shadowed by an orbit of the GD map with learning rate h: GD\nh .\n\n5This case is restrictive, yet it includes the important cases of the dynamics of GD and HB when f is a\n\nquadratic (which is the case in, for instance, least squares regression).\n\n6For a precise description of this similarity, we invite the reader to read on topological conjugacy in [2].\n7f (x) ! 1 as kxk ! 1\n\n4\n\n\fkR(2, h)k \uf8ff\n\nh2\n2\n\nsup\n\n0\uf8ff\uf8ff1k\u00a8y(t + h)k =\n\nh2\n2\n\nh2\n2\n\n\uf8ff\n\nsup\n\nsup\n\n0\uf8ff\uf8ff1kr2f (y(t + h))rf (y(t + h))k \uf8ff\n0\uf8ff\uf8ff1krf (y(t + h))k \uf8ff\nh (yk)k \uf8ff `L\n\n0\uf8ff\uf8ff1kr2f (y(t + h))k sup\nh (yk), we have kyk+1  GD\n\n2 h2.\n\nh2\n2\n\nL`.\n\n\u2305\n\nRemark 1 (Bound on the ODE gradients). Under (H1) let Gy = {p : p = krf (y(t))k, t  0}\nbe the set of gradient magnitudes experienced along the GD-ODE solution starting at any y0. It is\neasy to prove, using an argument similar to Prop. 2.2 in [8], that coercivity implies sup Gy < 1.\nA similar argument holds for the iterates of Gradient Descent. Hence, for the rest of this chapter it\nis safe to assume that gradients are bounded: krf (y(t))k \uf8ff ` for all t  0. For instance, if f is a\nquadratic centered at x\u21e4, then we have ` = Lky0  x\u21e4k.\nThe next result follows from the fact that GD implements the explicit Euler method on GD-ODE.\nProposition 2. Assume (H1). (yk)1k=0 is a -pseudo-orbit of GD\n\nh with  = `L\n\n2 h2: 8k 2 N,\n\nkyk+1  GD\n\nh (yk)k \uf8ff .\n\nProof. Thanks to Thm. 1, since the solution y of GD-ODE is a C2 curve, we can write y(kh + h) =\n'GD\nh (y(kh)) = yk+1 using Taylor\u2019s formula with Lagrange\u2019s Remainder in Banach spaces (see e.g.\nThm 5.2. in [13]) around time t = kh. Namely: y(kh + h) = y(kh)  \u02d9y(kh)h + R(2, h), where\nR(2,\u00b7) : R>0 ! Rd is the approximation error as a function of h, which can be bounded as follows:\n\nHence, since y(kh)  \u02d9y(kh)h = GD\n3.1 Shadowing under strong convexity\n\nAs seen in Sec. 2.2, the last proposition provides the \ufb01rst step towards a shadowing result. We also\ndiscussed that if, in addition, GD\nh is a contraction, we directly have shadowing (Thm. 1). Therefore,\nwe start with the following assumption that will be shown to imply contraction.\n(H2) f is \u00b5-strongly-convex: for all a 2 Rd, kr2f (a)k  \u00b5.\nThe next result follows from standard techniques in convex optimization (see e.g. [25]).\nProposition 3. Assume (H1), (H2). If 0 < h \uf8ff 1\nh is uniformly contracting with \u21e2 = 1  h\u00b5.\nWe provide the proof in App. D.1 and sketch the idea using a quadratic form: let f (x) = hx, Hxi\nwith H symmetric s.t. \u00b5I  H  LI; if h \uf8ff 1\nL then (1  Lh) \uf8ff kI  hHk \uf8ff (1  \u00b5h). Prop. 3\nfollows directly: k GD\nh (y)k = k(I  hH)(x  y)k \uf8ff kI  hHkkx  yk \uf8ff \u21e2kx  yk.\nThe shadowing result for strongly-convex functions is then a simple application of Thm. 2.\n\nh (x)  GD\n\nL, GD\n\nL` , 1\n\nh is \u270f-shadowed by any orbit (xk)1k=0 of GD\n\nL}; the\nh with x0 such that kx0  y0k \uf8ff \u270f.\nh with  \uf8ff (1  \u21e2)\u270f. From\n2 h2, while from Prop. 3 we have \u21e2 \uf8ff (1  h\u00b5). Putting it all together, we get\n\u2305\n\nTheorem 3. Assume (H1), (H2) and let \u270f be the desired accuracy. Fix 0 < h \uf8ff min{ 2\u00b5\u270f\norbit (yk)1k=0 of 'GD\nProof. From Thm. 2, we need (yk)1k=0 to be a -pseudo-orbit of GD\nProp. 2 we know  = `L\n2 h2 \uf8ff h\u00b5\u270f, which holds if and only if h \uf8ff 2\u00b5\u270f\nL` .\n`L\nNotice that we can formulate the theorem in a dual way:\nnamely, for every learning rate we can bound the ODE\napproximation error (i.e. \ufb01nd the shadowing radius).\nCorollary 1. Assume (H1), (H2).\nIf 0 < h \uf8ff 1\nL,\n(yk)1k=0 is \u270f-shadowed by any orbit (xk)1k=0 of GD\nstarting at x0 with kx0  y0k \uf8ff \u270f, with \u270f = h`L\n2\u00b5 .\nThis result ensures that if the objective is smooth and\nstrongly-convex, then GD-ODE is a theoretically sound\napproximation of GD. We validate this in Fig. 1 by\nintegrating GD-ODE analytically.\n\nFigure 1: Orbit of GD\nsol.) on strongly-convex quadratic. h = 0.2.\n\nh (sampled ODE\n\nh , 'GD\n\nh\n\n5\n\n\f(a) The orbit of GD is not a shadow: error blows up.\nFigure 2: A few iterations of the maps GD\nh with different initializations on a quadratic saddle.\nGD-ODE was solved analytically and h = 0.2. On the right plot, the coordinates of x0 are x0,1 = 7(y7,1)\n(expanding direction, need to reverse time) and x0,2 = y0,2 (contracting direction).\n\n(b) The orbit of GD is a shadow: error is bounded.\n\nh and 'GD\n\nSharpness of the result. First, we note that the bound for  in Prop. 2 cannot be improved; indeed it\ncoincides with the well-known local truncation error of Euler\u2019s method [22]. Next, pick f (x) = x2/2,\nx0 = 1 and h = 1/L = 1. For k 2 N, gradients are smaller than 1 for both GD-ODE and GD, hence\n` = L = \u00b5 = 1. Our formula for the global shadowing radius gives \u270f = hL`/(2\u00b5) = 0.5, equal to\nthe local error  = `Lh2/2 \u2014 i.e. as tight the well-established local result. In fact, GD jumps to 0\nin one iteration, while y(t) = et; hence y(1)  x1 = 1/e \u21e1 0.37 < 0.5. For smaller steps, like\nh = 0.1, our formula predicts \u270f = 0.05 = 10. In simulation, we have maximum deviation at k = 10\nand is around 0.02 = 4, only 2.5 times smaller than our prediction.\nThe convex case.\nIf f is convex but not strongly-convex, GD is non-expanding and the error\nbetween xk and yk cannot be bounded by a constant8 but grows slowly : in App. C.1 we show the\nerror \u270fk it is upper bounded by k = `Lkh2/2 = O(kh2).\nExtension to stochastic gradients. We extend Thm. 3 to account for perturbations: let SGD\n(x) =\nx  h \u02dcrf (x), where \u02dcrf (x) is a stochastic gradient s.t. k \u02dcrf (x)  rf (x)k \uf8ff R. Then, for \u270f big\nenough, we can (proof in App. C.2) choose h \uf8ff 2(\u00b5\u270fR)\nh (deterministic) is\nshadowed by the stochastic orbit of SGD\n(x) starting from x0 = y0. So, if the h is small enough,\nGD-ODE can be used to study SGD. This result is well known from stochastic approximation [30].\n\nso that the orbit of 'GD\n\nh\n\n`L\n\nh\n\n3.2 Towards non-convexity: behaviour around local maxima and saddles\nWe \ufb01rst study the strong-concavity case and then combine it with strong-convexity to assess the\nshadowing properties of GD around a saddle.\nStrong-concavity.\nIn this case, it follows from the same argument of Prop. 3 that GD is uniformly\nexpanding with \u21e2 = 1 + h > 1, with  := max((H)) < 0. As mentioned in the background\nsection (see Thm. B.3 for the precise statement) this case is conceptually identical to strong-convexity\nonce we reverse the arrow of time (so that expansions become contractions). We are allowed to make\nL, GD\nthis step because, under (H1) and if h \uf8ff 1\nh is a diffeomorphism (see e.g. [34], Prop. 4.5).\nIn particular, the backwards GD map ( GD\nh )1 is contracting with factor 1/\u21e2. Consider now the\ninitial part of an orbit of GD-ODE such that the gradient norms are still bounded by ` and let\nyK = ('GD\nk=0 is a pseudo-orbit,\nwith reversed arrow of time, of ( GD\nh )1. Hence, Thm. 3 directly ensures shadowing choosing\nxK = yK and xk = ( GD\nk=0 we\nfound are slightly9 perturbed: x0 = ( GD\nh )K(yK) u y0. Notice that, if we instead start GD from\nexactly x0 = y0, the iterates will diverge from the ODE trajectory, since every error made along the\npseudo-orbit is ampli\ufb01ed. We show this for the unstable direction of a saddle in Fig. 2a.\nQuadratic saddles. As discussed in Sec. 2, if the space can be split into stable (contracting) and\nunstable (expanding) invariant subspaces (Rd = Es  Eu), then every pseudo-orbit is shadowed.\nThis is a particular case of the shadowing theorem for hyperbolic sets [6]. In particular, we saw that\n\nh )K(y0) be the last point of such orbit. It is easy to realize that (yk)K\n\nh )kK(yK). Crucially \u2014 the initial condition of the shadow (xk)K\n\nzero curvature, hence the corresponding gradient system is going to have an eigenvalue on the unit circle.\n\n8This in line with the requirement of hyperbolicity in the shadowing theory: a convex function might have\n9Indeed, Thm. 3 applied backward in time from yK ensures that k( GD\n\nh )K(yK)  x0k \uf8ff \u270f.\n\n6\n\n\fif GD\nh is linear hyperbolic such splitting exists and Es and Eu are the subspaces spanned by the\nstable and unstable eigenvalues, respectively. It is easy to realize that GD\nh is linear if the objective is\na quadratic; indeed f (x) = hx, Hxi is such that GD\nh (x) = (I  hH)x. It is essential to note that\nhyperbolicity requires H to have no eigenvalue at 0 \u2014 i.e. that the saddle has only directions of\nstrictly positive or strictly negative curvature. This splitting allows to study shadowing on Es and Eu\nseparately: for Es we can use the shadowing result for strong-convexity and for Eu the shadowing\nresult for strong-concavity, along with the computation of the initial condition for the shadow in these\nsubspaces. We summarize this result in the next theorem, which we prove formally in App. C.3. To\nenhance understanding, we illustrate the procedure of construction of a shadow in Fig. 2.\nProposition 4. Let f : Rd ! R be quadratic centered at x\u21e4 with Hessian H with no eigenvalues in\nthe interval (, \u00b5), for some \u00b5,  > 0. Assume the orbit (yk)1k=0 of 'GD\nh is s.t. (H1) holds up to\niteration K. Let \u270f be the desired tracking accuracy; if 0 < h \uf8ff min \u00b5\u270f\nL , then (yk)1k=0 is\nL` , \u270f\n2L` , 1\n\u270f-shadowed by an orbit (xk)1k=0 of GD\nGeneral saddles.\nIn App. C.4 we take inspiration from the literature on the approximation of\nstiff ODEs near stationary points [36, 5, 32] and use Banach \ufb01xed-point theorem to generalize the\nresult above to perturbed quadratic saddles f + , where  is required to be L-smooth with L \uf8ff\nO(min{, \u00b5}). This condition is intuitive, since  effectively counteracts the contraction/expansion.\nTheorem 4. Let f : Rd ! R be a quadratic centered at x\u21e4 with Hessian H with no eigenvalues\nin the interval (, \u00b5), for some \u00b5,  > 0. Let g : Rd ! R be our objective function, of the form\ng(x) = f (x) + (x) with  : Rd ! R a Lsmooth perturbation such that r(x\u21e4) = 0. Assume\nthe orbit (yk)1k=0 of 'GD\nh on g is s.t. (H1) (stated for g) holds, with gradients bounded by ` up to\niteration K. Assume 0 < h \uf8ff 1\n\nL and let \u270f be the desired tracking accuracy, if also\n\nh up to iteration K.\n\n\u270fmin \n\n2 , \u00b5  4L\n\n,\n\nh \uf8ff\n\n2`L\nthen (yk)1k=0 is \u270f-shadowed by a orbit (xk)1k=0 of GD\n\nh on g up to iteration K.\n\nWe note that, in the special case of strongly-convex quadratics, the theorem above recovers the\nshadowing condition of Cor. 1 up to a factor 1/2 which is due to the different proof techniques.\nGluing landscapes. The last result can be combined\nwith Thm. 3 to capture the dynamics of GD-ODE where\ndirections of negative curvature are encountered during\nthe early stage of training followed by a strongly-convex\nregions as we approach a local minimum (such as the\none in Fig. 3). Note that, since under (H1) the objective\nis C2, there will be a few iterations in the \"transition\nphase\" (non-convex to convex) where the curvature is\nvery close to zero. These few iterations are not captured\nby Thm. 3 and Thm. 4; indeed, the error behaviour in\nFig. 3 is pathological at k u 10. Nonetheless, as we\nshowed for the convex case in Sec. 3.1, the approxima-\ntion error during these iterations only grows as O(kh).\nIn the numerical analysis literature, the procedure we\njust sketched was made precise in [11], who proved that\na gluing argument is successful if the number of unstable\ndirections on the ODE path is non-increasing.\n\nFigure 3: Dynamics on the Hosaki function,\nh = 0.3 and lightly perturbed initial condition\nin the unstable subspace. ODE numerical simu-\nlation with Runge-Kutta 4 method [21].\n\n4 The Heavy-ball ODE\nWe now turn our attention to analyzing Heavy-ball whose continuous representation is \u00a8q + \u21b5 \u02d9q +\nrf (q) = 0, where \u21b5 is a positive number called the viscosity parameter. Following [37], we\nintroduce the velocity variable p = \u02d9q and consider the dynamics of y = (q, p) (i.e. in phase space).\n\n\u21e2 \u02d9p = \u21b5p  rf (q)\n\n\u02d9q = p\n\n7\n\n(HB-ODE)\n\n\fFigure 5: Shadowing results under the sigmoid loss in MNIST (2 digits). We show 5 runs for the ODE and for\nthe algorithm, with same (random) initialization. ODEs are simulated with 4th-order RK: our implementation\nuses 4 back-propagations and an integrator-step of 0.1. When trying higher precisions, results do not change.\nShown are also the strictly decreasing (since we use full gradients) losses for each run the algorithms. The loss\nof the discretized ODEs are indistinguishible (because of shadowing) and are therefore not shown. We invite the\nreader to compare the results (in particular, for high ) to the ones obtained in synthetic examples in Fig. 1 and 4.\n\nUnder (H1), we denote by 'HB\n\u21b5,h : R2d ! R2d the corresponding joint time-h map and by (yk)1k=0 =\n((pk, qk))1k=0 its orbit (i.e. the sampled HB-ODE trajectory). First, we show that semi-implicit10\nintegration of Eq. (HB-ODE) yields HB.\nGiven a point xk = (vk, zk) in phase space, this integrator computes (vk+1, zk+1) u 'HB\n\n\u21b5,h(xk) as\n\n\u21e2vk+1 = vk + h(\u21b5vk  rf (zk))\n\nzk+1 = zk + hvk+1\n\n(HB-PS)\n\nNotice that vk+1 = (zk+1  zk)/h and zk+1 = zk  (1  \u21b5h)(zk  zk1)  h2rf (zk), which is\nexactly one iteration of HB, with  = 1 h\u21b5 and \u2318 = h2. We therefore have established a numerical\nlink between HB and HB-ODE, similar to the one presented in [47]. In the following, we use HB\n\u21b5,h to\ndenote the one step map in phase space de\ufb01ned by HB-PS.\nSimilarly to Remark 1, by Prop. 2.2 in [8], (H1) implies that gradients are bounded by a constant `.\nHence, we can get an analogue to Prop. 2 (see App. D.2 for the proof).\nProposition 5. Assume (H1) and let y0 = (0, z0). Then, (yk)1k=0 is a -pseudo-orbit of HB\n\n\u21b5,h with\n\n = `(\u21b5 + 1 + L)h2.\n\nStrong-convexity. The next step, as done for GD,\nwould be to consider strongly-convex landscapes and\nderive a formula for the shadowing radius (see Thm. 3).\nHowever \u2014 it is easy to realize that, in this setting, HB\nis not uniformly contracting. Indeed, it notoriously is\nnot a descent method. Hence, it is unfortunately dif\ufb01-\ncult to state an equivalent of Thm. 3 using similar argu-\nments. We believe that the reason behind this dif\ufb01culty\nlies at the very core of the acceleration phenomenon.\nIndeed, as noted by [20], the current known bounds\nfor HB in the strongly-convex setting might be loose\ndue to the tediousness of its analysis [46]\u2014 which is\nalso re\ufb02ected here. Hence, we leave this theoretical\ninvestigation (as well as the connection to acceleration\nand symplectic integration [47]) to future research, and\nshow instead experimental results in Sec. 5 and Fig. 4.\nQuadratics.\n\u21b5,h is linear hyperbolic. Hence,\nas discussed in the introduction and thanks to Prop. 5, there exists a norm11 under which we have\nshadowing, and we can recover a result analogous to Prop. 4 and to its perturbed variant (App. C.4).\nWe show this empirically in Fig. 4 and compare with the GD formula for the shadowing radius.\n\nFigure 4: Orbit of the space variable in HB\nh ,\n'HB\n(sampled ODE solution) on a strongly-\nh\nconvex quadratic with h = 0.2 and \u21b5 = 1 .\nSolution to HB-ODE was computed analytically.\n\nIn App. D.3 we show that, if f is quadratic, then HB\n\n10Note that this integrator, when applied to a Hamiltonian system, is symplectic (see e.g. de\ufb01nition in [21]).\n11For GD, this was the Euclidean norm. For HB the norm we have to pick is different, since (differently from\n\u21b5,h(x) = Ax with A non-symmetric. The interested reader can \ufb01nd more information in App. 4 of [24].\n\nGD) 'HB\n\n8\n\n\fnPn\ni=1 (hai, xili), (t) = 1\n2kxk2 + 1\n\n5 Experiments on empirical risk minimization\nWe consider the problem of binary classi\ufb01cation of digits 3 and 5 from the MNIST data-set [33]. We\ntake n = 10000 training examples {(ai, li)}n\ni=1, where ai 2 Rd is the i-th image (in R785 adding a\nbias) and li 2 {1, 1} is the corresponding label. We use the regularized sigmoid loss (non-convex)\n1+et . Compared to the cross-entropy loss (convex),\nf (x) = \nthis choice of f often leads to better generalization [45]. For 2 different choices of , using the full\ngradient, we simulate GD-ODE using fourth-order Runge-Kutta[21] (high-accuracy integration) and\nrun GD with learning rate h = 1, which yields a steady decrease in the loss. We simulate HB-ODE\nand run HB under the same conditions, using \u21b5 = 0.3 (to induce a signi\ufb01cant momentum). In Fig. 5,\nwe show the behaviour of the approximation error, measured in percentage w.r.t. the discretized ODE\ntrajectory, until convergence (with accuracy around 95%). We make a few comments on the results.\n1. Heavy regularization (in green) increases the contractiveness of GD around the solution, yielding\na small approximation error (it converges to zero) after a few iterations \u2014 exactly as in Fig. 1. For a\nsmall  (in magenta), the error between the trajectories is bounded but is slower to converge to zero,\nsince local errors tend not to be corrected (cf. discussion for convex objectives in Sec. 3.1).\n2. Locally, as we saw in Prop. 2, large gradients make the algorithm deviate signi\ufb01cantly from the\nODE. Since regularization increases the norm of the gradients experienced in early training, a larger\n will cause the approximation error to grow rapidly at the \ufb01rst iterations (when gradients are large).\nIndeed, Cor. 1 predicts that the shadowing radius is proportional12 to `.\n3. Since HB has momentum, we notice that indeed it converges faster than GD [50]. As expected,\n(see point 2) this has a bad effect on the global shadowing radius, which is 5 times bigger. On the\nother hand, the error from HB-ODE is also much quicker to decay to zero when compared to GD.\n\nLast in Fig. 6 we explore the effect of the shadowing\nradius \u270f on the learning rate h and \ufb01nd a good match\nwith the prediction of Cor. 1. Indeed, the experiment\ncon\ufb01rms that such relation is linear: \u270f = O(h), with no\ndependency on the number of iterations (as opposed to\nthe classical results discussed in the introduction).\nAll in all, we conclude from the experiments that the\nintuition developed from our analysis can potentially\nexplain the behaviour of the GD-ODE and the HB-ODE\napproximation in simple machine learning problems.\n\nFigure 6: Approx. error under same setting of\nFig. 5 for  = 0.005. Experiment validates the\nlinear dependency on h in Cor. 1\n\n6 Conclusion\nIn this work we used the theory of shadowing to motivate the success of continuous-time models in\noptimization. In particular, we showed that, if the cost f is strongly-convex or hyperbolic, then any\nGD-ODE trajectory is shadowed by a trajectory of GD, with a slightly perturbed initial condition.\nWeaker but similar results hold for HB. To the best of our knowledge, this work is the \ufb01rst to provide\nthis type of quantitative link between ODEs and algorithms in optimization. Moreover, our work\nleaves open a lot of directions for future research, including the derivation of a formula for the\nshadowing radius of Heavy-ball (which will likely give insights on acceleration), the extension to\nother algorithms (e.g. Newton\u2019s method) and to stochastic settings. Actually, a partial answer to the\nlast question was provided in the last months for the strongly-convex case in [17]. In this work, the\nauthors use backward error analysis to study how close SGD is to its approximation using a high\norder (involving the Hessian) stochastic modi\ufb01ed equation. It would be interesting to derive a similar\nresult for a stochastic variant of GD-ODE, such as the one studied in [40].\n\nAcknowledgements. We are grateful for the enlightening discussions on numerical methods and\nshadowing with Prof. Lubich (in T\u00fcbingen), Prof. Hofmann (in Z\u00fcrich) and Prof. Benettin (in Padua).\nAlso, we would like to thank Gary B\u00e9cigneul for his help in completing the proof of hyperbolicity of\nHeavy-ball and Foivos Alimisis for pointing out a mistake in the initial draft.\n\n12Alternatively, looking at the formula \u270f = hL`\n\nHence, regularization, which increases L and decreases \u00b5 by the same amount, actually increases \u270f.\n\n2\u00b5 in Cor. 1 and noting ` \uf8ff Lkx0  x\u21e4k, we get \u270f = O(L2/\u00b5).\n\n9\n\n\fReferences\n[1] Dmitry Victorovich Anosov. Geodesic \ufb02ows on closed riemannian manifolds of negative\n\ncurvature. Trudy Matematicheskogo Instituta Imeni VA Steklova, 90:3\u2013210, 1967.\n\n[2] Vitor Ara\u00fajo and Marcelo Viana. Hyperbolic dynamical systems. Springer, 2009.\n\n[3] Bernd Aulbach and Fritz Colonius. Six lectures on dynamical systems. World Scienti\ufb01c, 1996.\n\n[4] Michael Betancourt, Michael I Jordan, and Ashia C Wilson. On symplectic optimization. arXiv\n\npreprint arXiv:1802.03653, 2018.\n\n[5] W-J Beyn. On the numerical approximation of phase portraits near stationary points. SIAM\n\njournal on numerical analysis, 24(5):1095\u20131113, 1987.\n\n[6] Rufus Bowen. !-limit sets for axiom a diffeomorphisms. Journal of differential equations,\n\n18(2):333\u2013339, 1975.\n\n[7] Michael Brin and Garrett Stuck. Introduction to dynamical systems. Cambridge university\n\npress, 2002.\n\n[8] Alexandre Cabot, Hans Engler, and S\u00e9bastien Gadat. On the long time behavior of second\norder differential equations with asymptotically small dissipation. Transactions of the American\nMathematical Society, 361(11):5983\u20136017, 2009.\n\n[9] Jos\u00e9 A Carrillo, Robert J McCann, and C\u00e9dric Villani. Contractions in the 2-wasserstein length\nspace and thermalization of granular media. Archive for Rational Mechanics and Analysis,\n179(2):217\u2013263, 2006.\n\n[10] Tian Qi Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary\ndifferential equations. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi,\nand R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 6572\u2013\n6583. Curran Associates, Inc., 2018.\n\n[11] Shui-Nee Chow and Erik S Van Vleck. A shadowing lemma approach to global error analysis\n\nfor initial value odes. SIAM Journal on Scienti\ufb01c Computing, 15(4):959\u2013976, 1994.\n\n[12] Marco Ciccone, Marco Gallieri, Jonathan Masci, Christian Osendorfer, and Faustino Gomez.\nNais-net: Stable deep networks from non-autonomous differential equations. arXiv preprint\narXiv:1804.07209, 2018.\n\n[13] Rodney Coleman. Calculus on normed vector spaces. Springer Science & Business Media,\n\n2012.\n\n[14] Brian A Coomes, Kenneth J Palmer, et al. Rigorous computational shadowing of orbits of\n\nordinary differential equations. Numerische Mathematik, 69(4):401\u2013421, 1995.\n\n[15] Hadi Daneshmand, Jonas Kohler, Aurelien Lucchi, and Thomas Hofmann. Escaping saddles\n\nwith stochastic gradients. arXiv preprint arXiv:1803.05999, 2018.\n\n[16] Simon S Du, Chi Jin, Jason D Lee, Michael I Jordan, Aarti Singh, and Barnabas Poczos.\nGradient descent can take exponential time to escape saddle points. In Advances in neural\ninformation processing systems, pages 1067\u20131077, 2017.\n\n[17] Yuanyuan Feng, Tingran Gao, Lei Li, Jian-Guo Liu, and Yulong Lu. Uniform-in-time weak\nerror analysis for stochastic gradient descent algorithms via diffusion approximation. arXiv\npreprint arXiv:1902.00635, 2019.\n\n[18] Mark Konstantinovich Gavurin. Nonlinear functional equations and continuous analogues of\niteration methods. Izvestiya Vysshikh Uchebnykh Zavedenii. Matematika, pages 18\u201331, 1958.\n\n[19] Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle points\u2014online\nstochastic gradient for tensor decomposition. In Conference on Learning Theory, pages 797\u2013\n842, 2015.\n\n10\n\n\f[20] Euhanna Ghadimi, Hamid Reza Feyzmahdavian, and Mikael Johansson. Global convergence of\nthe heavy-ball method for convex optimization. In Control Conference (ECC), 2015 European,\npages 310\u2013315. IEEE, 2015.\n\n[21] Ernst Hairer, Christian Lubich, and Gerhard Wanner. Geometric numerical integration illustrated\n\nby the st\u00f6rmer\u2013verlet method. Acta numerica, 12:399\u2013450, 2003.\n\n[22] Ernst Hairer and Gerhard Wanner. Solving ordinary differential equations ii: Stiff and\ndifferential-algebraic problems second revised edition with 137 \ufb01gures. Springer Series in\nComputational Mathematics, 14, 1996.\n\n[23] Wayne Hayes and Kenneth R Jackson. A survey of shadowing methods for numerical solutions\n\nof ordinary differential equations. Applied Numerical Mathematics, 53(2-4):299\u2013321, 2005.\n\n[24] Michael Charles Irwin. Smooth dynamical systems, volume 17. World Scienti\ufb01c, 2001.\n[25] Alexander Jung. A \ufb01xed-point of view on gradient methods for big data. Frontiers in Applied\n\nMathematics and Statistics, 3:18, 2017.\n\n[26] Hassan K Khalil and Jessy W Grizzle. Nonlinear systems, volume 3. Prentice hall Upper Saddle\n\nRiver, NJ, 2002.\n\n[27] Walid Krichene and Peter L Bartlett. Acceleration and averaging in stochastic descent dynamics.\n\nIn Advances in Neural Information Processing Systems, pages 6796\u20136806, 2017.\n\n[28] Walid Krichene, Alexandre Bayen, and Peter L Bartlett. Accelerated mirror descent in continu-\nous and discrete time. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett,\neditors, Advances in Neural Information Processing Systems 28, pages 2845\u20132853. Curran\nAssociates, Inc., 2015.\n\n[29] Anastasiya Kulakova, Marina Danilova, and Boris Polyak. Non-monotone behavior of the\n\nheavy ball method. arXiv preprint arXiv:1811.00658, 2018.\n\n[30] Harold Kushner and G George Yin. Stochastic approximation and recursive algorithms and\n\napplications, volume 35. Springer Science & Business Media, 2003.\n\n[31] Oscar E Lanford. Introduction to hyperbolic sets. In Regular and chaotic motions in dynamic\n\nsystems, pages 73\u2013102. Springer, 1985.\n\n[32] Stig Larsson and J-M Sanz-Serna. The behavior of \ufb01nite element solutions of semilinear\nparabolic problems near stationary points. SIAM journal on numerical analysis, 31(4):1000\u2013\n1018, 1994.\n\n[33] Yann LeCun. The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/,\n\n1998.\n\n[34] Jason D Lee, Max Simchowitz, Michael I Jordan, and Benjamin Recht. Gradient descent\n\nconverges to minimizers. arXiv preprint arXiv:1602.04915, 2016.\n\n[35] K\ufb01r Y Levy. The power of normalization: Faster evasion of saddle points. arXiv preprint\n\narXiv:1611.04831, 2016.\n\n[36] Ch Lubich, Kaspar Nipp, and D Stoffer. Runge\u2013kutta solutions of stiff differential equations\n\nnear stationary points. SIAM journal on numerical analysis, 32(4):1296\u20131307, 1995.\n\n[37] Chris J Maddison, Daniel Paulin, Yee Whye Teh, Brendan O\u2019Donoghue, and Arnaud Doucet.\n\nHamiltonian descent methods. arXiv preprint arXiv:1809.05042, 2018.\n\n[38] Yurii Nesterov. Lectures on convex optimization, 2018.\n[39] Jerzy Ombach. The simplest shadowing. In Annales Polonici Mathematici, volume 58, pages\n\n253\u2013258, 1993.\n\n[40] Antonio Orvieto and Aurelien Lucchi. Continuous-time models for stochastic optimization\n\nalgorithms. arXiv preprint arXiv:1810.02565, 2018.\n\n11\n\n\f[41] Kenneth James Palmer. Shadowing in dynamical systems: theory and applications, volume 501.\n\nSpringer Science & Business Media, 2013.\n\n[42] Lawrence Perko. Differential equations and dynamical systems, volume 7. Springer Science &\n\nBusiness Media, 2013.\n\n[43] Boris T Polyak. Some methods of speeding up the convergence of iteration methods. USSR\n\nComputational Mathematics and Mathematical Physics, 4(5):1\u201317, 1964.\n\n[44] Tim Sauer and James A Yorke. Rigorous veri\ufb01cation of trajectories for the computer simulation\n\nof dynamical systems. Nonlinearity, 4(3):961, 1991.\n\n[45] Shai Shalev-Shwartz, Ohad Shamir, and Karthik Sridharan. Learning kernel-based halfspaces\n\nwith the 0-1 loss. SIAM Journal on Computing, 40(6):1623\u20131646, 2011.\n\n[46] Bin Shi, Simon S Du, Michael I Jordan, and Weijie J Su. Understanding the acceleration\nphenomenon via high-resolution differential equations. arXiv preprint arXiv:1810.08907, 2018.\n[47] Bin Shi, Simon S Du, Weijie J Su, and Michael I Jordan. Acceleration via symplectic discretiza-\n\ntion of high-resolution differential equations. arXiv preprint arXiv:1902.03694, 2019.\n\n[48] Stephen Smale. Differentiable dynamical systems. Bulletin of the American mathematical\n\nSociety, 73(6):747\u2013817, 1967.\n\n[49] Weijie Su, Stephen Boyd, and Emmanuel J. Cand\u00e8s. A differential equation for modeling\nnesterov\u2019s accelerated gradient method: Theory and insights. Journal of Machine Learning\nResearch, 17(153):1\u201343, 2016.\n\n[50] Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of\ninitialization and momentum in deep learning. In International conference on machine learning,\npages 1139\u20131147, 2013.\n\n[51] Erik S Van Vleck. Numerical shadowing near hyperbolic trajectories. SIAM Journal on Scienti\ufb01c\n\nComputing, 16(5):1177\u20131189, 1995.\n\n[52] Ashia C Wilson, Benjamin Recht, and Michael I Jordan. A lyapunov analysis of momentum\n\nmethods in optimization. arXiv preprint arXiv:1611.02635, 2016.\n\n[53] Pan Xu, Tianhao Wang, and Quanquan Gu. Accelerated stochastic mirror descent: From\ncontinuous-time dynamics to discrete-time algorithms. In International Conference on Arti\ufb01cial\nIntelligence and Statistics, pages 1087\u20131096, 2018.\n\n[54] Pan Xu, Tianhao Wang, and Quanquan Gu. Continuous and discrete-time accelerated stochastic\nmirror descent for strongly convex functions. In International Conference on Machine Learning,\npages 5488\u20135497, 2018.\n\n[55] Jingzhao Zhang, Aryan Mokhtari, Suvrit Sra, and Ali Jadbabaie. Direct runge-kutta discretiza-\n\ntion achieves acceleration. arXiv preprint arXiv:1805.00521, 2018.\n\n12\n\n\f", "award": [], "sourceid": 6899, "authors": [{"given_name": "Antonio", "family_name": "Orvieto", "institution": "ETH Zurich"}, {"given_name": "Aurelien", "family_name": "Lucchi", "institution": "ETH Zurich"}]}