{"title": "Acceleration and Averaging in Stochastic Descent Dynamics", "book": "Advances in Neural Information Processing Systems", "page_first": 6796, "page_last": 6806, "abstract": "We formulate and study a general family of (continuous-time) stochastic dynamics for accelerated first-order minimization of smooth convex functions. Building on an averaging formulation of accelerated mirror descent, we propose a stochastic variant in which the gradient is contaminated by noise, and study the resulting stochastic differential equation. We prove a bound on the rate of change of an energy function associated with the problem, then use it to derive estimates of convergence rates of the function values (almost surely and in expectation), both for persistent and asymptotically vanishing noise. We discuss the interaction between the parameters of the dynamics (learning rate and averaging rates) and the covariation of the noise process. In particular, we show how the asymptotic rate of covariation affects the choice of parameters and, ultimately, the convergence rate.", "full_text": "Acceleration and Averaging\n\nIn Stochastic Descent Dynamics\n\nWalid Krichene\n\nGoogle, Inc.\n\nwalidk@google.com\n\nPeter Bartlett\nU.C. Berkeley\n\nbartlett@cs.berkeley.edu\n\nAbstract\n\nWe formulate and study a general family of (continuous-time) stochastic dynamics\nfor accelerated \ufb01rst-order minimization of smooth convex functions.\nBuilding on an averaging formulation of accelerated mirror descent, we propose a\nstochastic variant in which the gradient is contaminated by noise, and study the\nresulting stochastic differential equation. We prove a bound on the rate of change\nof an energy function associated with the problem, then use it to derive estimates\nof convergence rates of the function values (almost surely and in expectation),\nboth for persistent and asymptotically vanishing noise. We discuss the interaction\nbetween the parameters of the dynamics (learning rate and averaging rates) and the\ncovariation of the noise process. In particular, we show how the asymptotic rate of\ncovariation affects the choice of parameters and, ultimately, the convergence rate.\n\n1\n\nIntroduction\n\nWe consider the constrained convex minimization problem\n\nmin\nx\u2208X f (x),\n\nwhere X is a closed, convex, compact subset of Rn, and f is a proper closed convex function,\nassumed to be differentiable with Lipschitz gradient, and we denote X (cid:63) \u2282 X the set of its minimizers.\nFirst-order methods play an important role in minimizing such functions, in particular in large-scale\nmachine learning applications, in which the dimensionality (number of features) and size (number of\nsamples) in typical datasets makes higher-order methods intractable. Many such algorithms can be\nviewed as a discretization of continuous-time dynamics. The simplest example is gradient descent,\nwhich can be viewed as the discretization of the gradient \ufb02ow dynamics \u02d9x(t) = \u2212\u2207f (x(t)), where\n\u02d9x(t) denotes the time derivative of a C 1 trajectory x(t). An important generalization of gradient\ndescent was developed by Nemirovsky and Yudin [1983], and termed mirror descent: it couples a\ndual variable z(t) and its \u201cmirror\u201d primal variable x(t). More speci\ufb01cally, the dynamics are given by\n\n(cid:26) \u02d9z(t) = \u2212\u2207f (x(t))\n\nMD\n\nx(t) = \u2207\u03c8\u2217(z(t)),\n\n(1)\nwhere \u2207\u03c8\u2217 : Rn \u2192 X is a Lipschitz function de\ufb01ned on the entire dual space Rn, with values in the\nfeasible set X ; it is often referred to as a mirror map, and we will recall its de\ufb01nition and properties\nin Section 2. Mirror descent can be viewed as a generalization of projected gradient descent, where\nthe Euclidean projection is replaced by the mirror map \u2207\u03c8\u2217 [Beck and Teboulle, 2003]. This makes\nit possible to adapt the choice of the mirror map to the geometry of the problem, leading to better\ndependence on the dimension n, see [Ben-Tal and Nemirovski, 2001], [Ben-Tal et al., 2001].\n\nContinuous-time dynamics Although optimization methods are inherently discrete,\nthe\ncontinuous-time point of view can help in their design and analysis, since it can leverage the\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\frich literature on dynamical systems, control theory, and mechanics, see [Helmke and Moore, 1994],\n[Bloch, 1994], and the references therein. Continuous-time models are also commonly used in \ufb01nan-\ncial applications, such as option pricing [Black and Scholes, 1973], even though the actions are taken\nin discrete time. In convex optimization, beyond simplifying the analysis, continuous-time models\nhave also motivated new algorithms: mirror descent is one such example, since it was originally\nmotivated in continuous time (Chapter 3 in [Nemirovsky and Yudin, 1983]). In a more recent line\nof work ([Su et al., 2014], [Krichene et al., 2015], [Wibisono et al., 2016]), Nesterov\u2019s accelerated\nmethod [Nesterov, 1983] was shown to be the discretization of a second-order ordinary differential\nequation (ODE), which, in the unconstrained case, can be interpreted as a damped non-linear oscilla-\ntor [Cabot et al., 2009, Attouch et al., 2015]. This motivated a restarting heuristic [O\u2019Donoghue and\nCand\u00e8s, 2015], which aims at further dissipating the energy. Krichene et al. [2015] generalized this\nODE to mirror descent, and gave an averaging interpretation of accelerated dynamics by writing it\nas two coupled \ufb01rst-order ODEs. This is the starting point of this paper, in which we introduce and\nstudy a stochastic variant of accelerated mirror descent.\n\nStochastic dynamics and related work The dynamics that we have discussed so far are deter-\nministic \ufb01rst-order dynamics, since they use the exact gradient \u2207f. However, in many machine\n(cid:80)\nin\nlearning applications, evaluating the exact gradient \u2207f can be prohibitively expensive, e.g.\nregularized empirical risk minimization problems, where the objective function f involves the sum\n(cid:80)\nof loss functions over a training set, of the form f (x) = 1|I|\ni\u2208I fi(x) + g(x), where I indexes\nthe training samples, and g is a regularization function1. Instead of computing the exact gradient\ni\u2208I \u2207fi(x) + \u2207g(x), a common approach is to compute an unbiased, stochastic\n\u2207f (x) = 1|I|\ni\u2208 \u02dcI \u2207fi(x) + \u2207g(x), where \u02dcI is a uniformly random subset\nestimate of the gradient, given by 1| \u02dcI|\nof I, indexing a random batch of samples from the training set. This approach motivates the study\nof stochastic dynamics for convex optimization. But despite an extensive literature on stochastic\ngradient and mirror descent in discrete time, e.g. [Nemirovski et al., 2009], [Duchi et al., 2010], [Lan,\n2012], [Johnson and Zhang, 2013], [Xiao and Zhang, 2014], and many others, few results are known\nfor stochastic mirror descent in continuous-time. To the best of our knowledge, the only published\nresults are by Raginsky and Bouvrie [2012] and Mertikopoulos and Staudigl [2016]. In its simplest\nform, the stochastic gradient dynamics can be described by the (underdamped) Langevin equation\n\n(cid:80)\n\ndX(t) = \u2212\u2207f (X(t)) + \u03c3dB(t),\n\nwhere B(t) denotes a standard Wiener process (Brownian motion). It has a long history in opti-\nmization [Chiang et al., 1987], dating back to simulated annealing, and it is known to have a unique\ninvariant measure with density proportional to the Gibbs distribution e\u2212 2f (x)\n(see, e.g., [Pavliotis,\n2014]). Langevin dynamics have recently played an important role in the analysis of sampling\nmethods [Dalalyan, 2017, Bubeck et al., 2015, Durmus and Moulines, 2016, Cheng and Bartlett,\n2017, Eberle et al., 2017, Cheng et al., 2017], where f is taken to be proportional to the logarithm of a\ntarget density. It has also been used to derive convergence rates for smooth, non-convex optimization\nwhere the objective is dissipative [Raginsky et al., 2017].\nFor mirror descent dynamics, Raginsky and Bouvrie [2012] were the \ufb01rst to propose a stochastic\nvariant of the mirror descent ODE (1), given by the SDE:\n\n\u03c3\n\n(cid:26)dZ(t) = \u2212\u2207f (X(t)) + \u03c3dB(t)\n\nX(t) = \u2207\u03c8\u2217(Z(t)),\n\nSMD\n\n(2)\n\nwhere \u03c3 is a constant volatility. They argued that the function values f (X(t)) along sample trajecto-\nries do not converge to the minimum value of f due to the persistent noise, but the optimality gap\nis bounded by a quantity proportional to \u03c32. They also proposed a method to reduce the variance\nby simultaneously sampling multiple trajectories and linearly coupling them. Mertikopoulos and\nStaudigl [2016] extended the analysis in some important directions: they replaced the constant \u03c3 with\na general volatility matrix \u03c3(x, t) which can be space and time dependent, and studied two regimes:\n\nthe small noise limit (\u03c3(x, t) vanishes at a O(1/\u221alog t) rate), in which case they prove almost sure\n\nconvergence; and the persistent noise regime (\u03c3(x, t) is uniformly bounded), in which case they de\ufb01ne\n\n1In statistical learning, one seeks to minimize the expected risk (with respect to the true, unknown data\ndistribution). A common approach is to minimize the empirical risk (observed on a given training set) then\nbound the distance between empirical and expected risk. Here we only focus on the optimization part.\n\n2\n\n\fa recti\ufb01ed variant of SMD, obtained by replacing the second equation by X(t) = \u2207\u03c8\u2217(Z(t)/s(t)),\nwhere 1/s(t) is a sensitivity parameter (intuitively, decreasing the sensitivity reduces the impact of\naccumulated noise). In particular, they prove that with s(t) = \u221at, the expected function values con-\nverge at a O(1/\u221at) rate. While these recent results paint a broad picture of mirror descent dynamics,\n\nthey leave many questions open: in particular, they do not provide estimates for convergence rates in\nthe vanishing noise limit, which is an important regime in machine learning applications, since one\ncan often control the variance of the gradient estimate, for example by gradually increasing the batch\nsize, as done by Xiao and Zhang [2014]. Besides, they do not study accelerated dynamics, and the\ninteraction between acceleration and noise remains unexplored in continuous time.\n\nOur contributions\nIn this paper, we answer many of the questions left open in previous works. We\nformulate and study a family of stochastic accelerated mirror descent dynamics, and we characterize\nthe interaction between its different parameters: the volatility of the noise, the (primal and dual)\nlearning rates, and the sensitivity of the mirror map. More speci\ufb01cally:\n\n\u2022 In Theorem 1, we give suf\ufb01cient conditions for almost sure convergence of solution trajec-\ntories to the set of minimizers X (cid:63). In particular, we show that it is possible to guarantee\nalmost sure convergence even when the volatility grows unbounded asymptotically.\n\u2022 In Theorem 2, we derive a bound on the expected function values. In particular, we can\nprove that in the vanishing noise regime, acceleration (with appropriate averaging) achieves\na faster rate, see Corollary 2 and the discussion in Remark 3.\n\n\u2022 In Theorem 3, we provide estimates of sample trajectory convergence rates.\n\nThe rest of the paper is organized as follows: We review the building blocks of our construction in\nSection 2, then formulate the stochastic dynamics in Section 3, and prove two instrumental lemmas.\nSection 4 is dedicated to the convergence results. We conclude with a brief discussion in Section 5.\n\n2 Accelerated Mirror Descent Dynamics\n\n2.1 Smooth mirror map\n\n\u2208 X , (cid:107)F (x) \u2212 F (x(cid:48))(cid:107)\u2217 \u2264 L(cid:107)x \u2212 x(cid:48)\n\nWe start by reviewing some de\ufb01nitions and preliminaries. Let (E,(cid:107) \u00b7 (cid:107)), be a normed vector space,\nand (E\u2217,(cid:107) \u00b7 (cid:107)\u2217) be its dual space equipped with the dual norm, and denote by (cid:104)x, z(cid:105) the pairing\nbetween x \u2208 E, z \u2208 E\u2217. To simplify, both E and E\u2217 can be identi\ufb01ed with Rn, but we make the\ndistinction for clarity. We say that a map F : E \u2192 E\u2217 is Lipschitz continuous on X \u2282 E with\nconstant L if for all x, x(cid:48)\n(cid:107). Let \u03c8 : E \u2192 R \u222a {+\u221e} be a\nconvex function with effective domain X (i.e. X = {x \u2208 E : \u03c8(x) < \u221e}). Its convex conjugate \u03c8\u2217\nis de\ufb01ned on E\u2217 by \u03c8\u2217(z) = supx\u2208X (cid:104)z, x(cid:105) \u2212 \u03c8(x). One can show that if \u03c8 is strongly convex, then\n\u03c8\u2217 is differentiable on all of E\u2217, and its gradient \u2207\u03c8\u2217 is a Lipschitz function that maps E\u2217 to X (see\nthe supplementary material). This map is often called a mirror map [Nemirovsky and Yudin, 1983].\nTo give a concrete example, take \u03c8 to be the squared Euclidean norm, \u03c8(x) = 1\n2. Then one\ncan show \u03c8\u2217(z) = arg minx\u2208X (cid:107)z \u2212 x(cid:107)2\n2, and the mirror map reduces to the Euclidean projection\non X . For additional examples, see e.g. Banerjee et al. [2005]. We make the following assumptions\nthroughout the paper:\nAssumption 1. X is closed, convex and compact, the set of minimizers X (cid:63) is contained in the\nrelative interior of X , \u03c8 is non-negative (without loss of generality), \u03c8\u2217 is twice differentiable with a\nLipschitz gradient, and f is differentiable with a Lipschitz gradient. We denote by L\u03c8\u2217 the Lipschitz\nconstant of \u2207\u03c8\u2217, and by Lf the Lipschitz constant of \u2207f.\n2.2 Averaging formulation of accelerated mirror descent\n\n2(cid:107)x(cid:107)2\n\nWe start from the averaging formulation of Krichene et al. [2015], and include a sensitivity parameter\nsimilar to Mertikopoulos and Staudigl [2016]. This results in the following ODE:\n\nAMD\u03b7,a,s\n\n\u02d9x(t) = a(t)(\u2207\u03c8\u2217(z(t)/s(t)) \u2212 x(t)),\n\n(3)\n\n(cid:26) \u02d9z(t) = \u2212\u03b7(t)\u2207f (x(t))\n\n3\n\n\fwith initial conditions2 (x(t0), z(t0)) = (x0, z0). The ODE system is parameterized by the following\nfunctions, all assumed to be positive and continuous on [t0,\u221e) (see Figure 1 for an illustration):\n\nin the stochastic case in scaling the noise term, in order to reduce its impact.\n\n\u2022 s(t) is a non-decreasing, inverse sensitivity parameter. As we will see, s(t) will be helpful\n\u2022 \u03b7(t) is a learning rate in the dual space.\nIndeed, the second ODE in (3) can be\n\u2022 a(t) is an averaging rate in the primal space.\nwritten in integral form as a weighted average of the mirror trajectory as follows: let\na(\u03c4 )d\u03c4 (equivalently, a(t) = \u02d9w(t)\nw(t)), then the ODE is equivalent to w(t) \u02d9x(t) +\nw(t) = e\n\u02d9w(t)x(t) = \u02d9w(t)\u2207\u03c8\u2217(z(t)/s(t)), and integrating and rearranging,\n\n(cid:82) t\n\nt0\n\nx(t0)w(t0) +(cid:82) t\n\n\u02d9w(\u03c4 )\u2207\u03c8\u2217(Z(\u03c4 )/s(\u03c4 ))d\u03c4\n\nt0\n\n.\n\nw(t)\n\nx(t) =\n\nThere are other, different ways of formulating the accelerated dynamics: instead of two \ufb01rst-order\nODEs, one can write one second-order ODE (such as in Su et al. [2014], Wibisono et al. [2016]),\nwhich has interesting interpretations related to Lagrangian dynamics. The averaging formulation\ngiven in Equation (3) is better suited to our analysis.\n\n2.3 Energy function\n\nThe analysis of continuous-time dynamics often relies on a Lyapunov argument (in reference to Lya-\npunov [1892]): one starts by de\ufb01ning a non-negative energy function, then bounding its rate of\nchange along solution trajectories. This bound can then be used to prove convergence to the set of\nminimizers X (cid:63). We will consider a modi\ufb01ed version of the energy function used in Krichene et al.\n[2016]: given a positive, C 1 function r(t), and a pair of optimal primal-dual points (x(cid:63), z(cid:63)) such\nthat3 x(cid:63) \u2208 X (cid:63) and \u2207\u03c8\u2217(z(cid:63)) = x(cid:63), let\nHere, D\u03c8\u2217 is the Bregman divergence associated with \u03c8\u2217, de\ufb01ned by\n\nL(x, z, t) = r(t)(f (x) \u2212 f (x(cid:63))) + s(t)D\u03c8(cid:63) (z(t)/s(t), z(cid:63)).\n\n(4)\n\nD\u03c8\u2217 (z(cid:48), z) = \u03c8\u2217(z(cid:48)) \u2212 \u03c8\u2217(z) \u2212 (cid:104)\u2207\u03c8\u2217(z), z(cid:48)\n\n\u2212 z(cid:105) ,\n\nfor all z, z(cid:48)\n\n\u2208 E\u2217.\n\nThen we can prove a bound on the time derivative of L along solution trajectories of AMD\u03b7,a,s,\ngiven in the following proposition. To keep the equations compact, we will occasionally omit explicit\ndependence on time, and write, e.g. \u03b7/r instead of \u03b7(t)/r(t).\nLemma 1. Suppose that a = \u03b7/r. Then under AMD\u03b7,\u03b7/r,s, for all t \u2265 t0,\n\nd\ndt\n\nL(x(t), z(t), t) \u2264 (f (x(t)) \u2212 f (x(cid:63)))( \u02d9r(t) \u2212 \u03b7(t)) + \u03c8(x(cid:63)) \u02d9s(t).\ns(t)D\u03c8\u2217 (z(t)/s(t), z(cid:63)) = \u02d9sD\u03c8\u2217 (z/s, z(cid:63)) + s(cid:10)\u2207\u03c8\u2217(z/s) \u2212 \u2207\u03c8\u2217(z(cid:63)), \u02d9z/s \u2212 \u02d9sz/s2(cid:11)\n\nProof. We start by bounding the rate of change of the Bregman divergence term:\n\nd\ndt\n\n(5)\n\n= (cid:104)\u2207\u03c8\u2217(z/s) \u2212 x(cid:63), \u02d9z(cid:105) + \u02d9s(D\u03c8\u2217 (z/s, z(cid:63)) \u2212 (cid:104)\u2207\u03c8\u2217(z/s) \u2212 \u2207\u03c8\u2217(z(cid:63)), z/s(cid:105))\n= (cid:104)\u2207\u03c8\u2217(z/s) \u2212 x(cid:63), \u02d9z(cid:105) + \u02d9s(\u03c8(x(cid:63)) \u2212 \u03c8(\u2207\u03c8\u2217(z/s)))\n\u2264 (cid:104)\u2207\u03c8\u2217(z/s) \u2212 x(cid:63), \u02d9z(cid:105) + \u02d9s\u03c8(x(cid:63)),\n\nwhere the third equality can be proved using the fact that \u03c8(x) + \u03c8\u2217(z) = (cid:104)x, z(cid:105) \u21d4 x \u2208 \u2202\u03c8\u2217(z) \u21d4\nz \u2208 \u2202\u03c8(x) (Theorem 23.5 in Rockafellar [1970]), and the last inequality follows from the assumption\nthat s is non-decreasing, and that \u03c8 is non-negative. Using this expression, we can then compute\n\nd\ndt\n\nL(x(t), z(t), t) \u2264 \u02d9r(f (x) \u2212 f (x(cid:63))) + r (cid:104)\u2207f (x), \u02d9x(cid:105) + (cid:104)\u2207\u03c8\u2217(z/s) \u2212 x(cid:63), \u02d9z(cid:105) + \u03c8(x(cid:63)) \u02d9s\n\n= \u02d9r(f (x) \u2212 f (x(cid:63))) + r (cid:104)\u2207f (x), \u02d9x(cid:105) + (cid:104) \u02d9x/a + x \u2212 x(cid:63),\u2212\u03b7\u2207f (x)(cid:105) + \u03c8(x(cid:63)) \u02d9s\n\u2264 (f (x) \u2212 f (x(cid:63)))( \u02d9r \u2212 \u03b7) + (cid:104)\u2207f (x), \u02d9x(cid:105) (r \u2212 \u03b7/a) + \u03c8(x(cid:63)) \u02d9s,\n\nvelocity, but this is not necessary in general.\n\n2The initial conditions typically satisfy \u2207\u03c8\u2217(z0) = x0 which ensures that the trajectory starts with zero\n3Such a z(cid:63) exists whenever x(cid:63) is in the relative interior of X (hence the condition X (cid:63) \u2282 relintX in\nAssumption 1). The analysis can be extended to minimizers that are on the relative boundary by replacing the\nBregman divergence term in L by the Fenchel coupling de\ufb01ned by Mertikopoulos and Staudigl [2016].\n\n4\n\n\fwhere we plugged in the expression of \u02d9z and \u2207\u03c8\u2217(z/s) from AMD\u03b7,a,s in the second equality, and\nused convexity of f in the last inequality. The assumption a = \u03b7/r ensures that the middle term\nvanishes, which concludes the proof.\nAs a consequence of the previous proposition, we can prove the following convergence rate:\nCorollary 1. Suppose that a = \u03b7/r and that \u03b7 \u2265 \u02d9r. Then under AMD\u03b7,\u03b7/r,s, for all t \u2265 t0\n\nf (x(t)) \u2212 f (x(cid:63)) \u2264\n\n\u03c8(x(cid:63))(s(t) \u2212 s(t0)) + L(x0, z0, t0)\n\n.\n\nr(t)\n\nProof. Starting from the bound (5), the \ufb01rst term is non-positive by the assumption that \u03b7 \u2265 \u02d9r.\nIntegrating, we have L(x(t), z(t), t) \u2212 L(x0, z0, t0) \u2264 \u03c8(x(cid:63))(s(t) \u2212 s(t0)), thus,\n\nf (x(t) \u2212 f (x(cid:63))) \u2264\n\nL(x(t), z(t), t)\n\nr(t)\n\n\u2264\n\n\u03c8(x(cid:63))(s(t) \u2212 s(t0)) + L(x0, z0, t0)\n\n.\n\nr(t)\n\nRemark 1. Corollary 1 can be interpreted as follows: given a desired convergence rate r(t), one\ncan choose parameters a, \u03b7, s that satisfy the conditions of the corollary (e.g. by \ufb01rst setting \u03b7 = \u02d9r,\nthen choosing a = \u03b7/r). This de\ufb01nes an ODE, the solutions of which are guaranteed to converge at\nthe rate r(t). While the convergence rate can seemingly be arbitrary for continuous time dynamics,\ndiscretizing the ODE does not always preserve the convergence rate. Wibisono et al. [2016], Wilson\net al. [2016] give suf\ufb01cient conditions on the discretization scheme to preserve polynomial rates, for\nexample, a \ufb01rst-order discretization can preserve quadratic rates, and a higher-order discretization\n(using cubic-regularized Newton updates) can preserve cubic rates.\nRemark 2. As a special case, one can recover Nesterov\u2019s ODE by taking r(t) = t2, \u03b7(t) = \u03b2t,\na(t) = \u03b2/t (i.e. w(t) = w(t0)(t/t0)\u03b2), and s(t) = 1 (see the supplement for additional details). It\nis worth observing that in this case, both the primal and dual rates \u03b7(t) and w(t) are increasing. A\ndifferent choice of parameters leads to dynamics similar to Nesterov\u2019s but with different weights.\n\n3 Stochastic dynamics\n\nSAMD\u03b7,a,s\n\nWe now formulate the stochastic variant of accelerated mirror descent dynamics (SAMD). Intuitively,\nwe would like to replace the gradient term \u2207f (x) in AMD\u03b7,a,s by a noisy gradient. Writing the\nnoisy dynamics as an It\u00f4 SDE [\u00d8ksendal, 2003], we consider the system\ndX(t) = a(t)[\u2207\u03c8\u2217(Z(t)/s(t)) \u2212 X(t)]dt,\n\n(cid:26)dZ(t) = \u2212\u03b7(t)[\u2207f (X(t))dt + \u03c3(X(t), t)dB(t)]\n\nwith initial condition (X(t0), Z(t0)) = (x0, z0) (we assume deterministic initial conditions for\nsimplicity). Here, B(t) \u2208 Rn is a standard Wiener process with respect to a given \ufb01ltered probability\nspace (\u2126,F,{Ft}t\u2265t0, P), and \u03c3 : (x, t) (cid:55)\u2192 \u03c3(x, t) \u2208 Rn\u00d7n is a volatility matrix assumed measur-\nable and Lipschitz in x (uniformly in t), and continuous in t for all x. The drift term in SAMD\u03b7,a,s\nis identical to the deterministic case, and the volatility term \u2212\u03b7(t)\u03c3(X(t), t)dB(t) represents the\nnoise in the gradient. In particular, we note that the learning rate \u03b7(t) multiplies \u03c3(X(t), t)dB(t),\nto capture the fact that the gradient noise is scaled by the learning rate \u03b7. This formulation is fairly\ngeneral, and does not assume, in particular, that the different components of the noise are independent,\nas we can see in the quadratic covariation of the dual process Z(t):\n\n(6)\n\nd[Zi(t), Zj(t)] = \u03b7(t)2(\u03c3(X(t), t)\u03c3(X(t), t)T )i,jdt = \u03b7(t)2\u03a3ij(X(t), t)dt,\n\n(7)\nwhere we de\ufb01ned the in\ufb01nitesimal covariance matrix \u03a3(x, t) = \u03c3(x, t)\u03c3(x, t)T \u2208 Rn\u00d7n. In our\nanalysis, we will focus on different noise regimes, which can be characterized using4\n(8)\nwhere (cid:107)\u03a3(cid:107)i = sup(cid:107)z(cid:107)\u2217\u22641 (cid:107)\u03a3z(cid:107) is the induced matrix norm. Since \u03a3(x, t) is Lipschitz in x and\ncontinuous in t, and X is compact, \u03c3\u2217(t) is \ufb01nite for all t, and continuous. Contrary to [Raginsky\nand Bouvrie, 2012, Mertikopoulos and Staudigl, 2016], we do not assume that \u03c3\u2217(t) is uniformly\nbounded in t. We give an illustration of the stochastic dynamics in Figure 1 (see the supplement for\ndetails).\n\nx\u2208X (cid:107)\u03a3(x, t)(cid:107)i,\n\n\u03c32\u2217(t) = sup\n\n4In our model, we focus on the time dependence of the volatility. Note that in some settings, the variance of\nthe gradient estimates scales with the squared norm of the gradient, see [Bottou et al., 2016] in the discrete case.\nThus one can consider a model where \u03c3(x, t) scales with (cid:107)\u2207f (x)(cid:107)\u2217, which may lead to different rates.\n\n5\n\n\fFigure 1: Illustration of the SAMD dynamics. The dual variable Z(t) cumulates gradients. It is\nscaled by the sensitivity 1/s(t) then mapped to the primal space via the mirror map, resulting in\n\u2207\u03c8\u2217(Z/s) (dotted line). The primal variable is then a weighted average of the mirror trajectory.\nExistence and uniqueness First, we give the following existence and uniqueness result:\nProposition 1. For all T > t0, SAMD\u03b7,a,s has a unique (up to rede\ufb01nition on a P-null set) solution\n(X(t), Z(t)) continuous on [0, T ], with the property that (X(t), Z(t)) is adapted to the \ufb01ltration\n\n{Ft}, and(cid:82) T\n\nt0 (cid:107)X(t)(cid:107)2dt,(cid:82) T\n\nt0 (cid:107)Z(t)(cid:107)2\u2217dt have \ufb01nite expectations.\n\nProof. By assumption, \u2207\u03c8\u2217 and \u2207f are Lipschitz continuous, thus the function (x, z) (cid:55)\u2192\n(\u2212\u03b7(t)\u2207f (x), a(t)[\u2207\u03c8\u2217(z/s(t)) \u2212 x]) is Lipschitz on [t0, T ] (since a, \u03b7, s are positive continu-\nous). Additionally, the function x (cid:55)\u2192 \u03c3(x, t) is also Lipschitz. Therefore, we can invoke the existence\nand uniqueness theroem for stochastic differential equations [\u00d8ksendal, 2003, Theorem 5.2.1].\n\nSince T is arbitrary, we can conclude that there exists a unique continuous solution on [t0,\u221e).\nEnergy decay Next, in order to analyze the convergence properties of the solution trajectories\n(X(t), Z(t)), we will need to bound the time-derivative of the energy function L.\nLemma 2. Suppose that the primal rate a = \u03b7/r, and let (X(t), Z(t)) be the unique solution to\nSAMD\u03b7,\u03b7/r,s. Then for all t \u2265 t0,\ndL(X(t), Z(t), t) \u2264\n\n(f (X(t)) \u2212 f (x(cid:63)))( \u02d9r(t) \u2212 \u03b7(t)) + \u03c8(x(cid:63)) \u02d9s(t) +\n\nnL\u03c8\u2217\n\n\u2217(t)\n\n(cid:21)\n\n(cid:20)\n\ndt+(cid:104)V (t), dB(t)(cid:105) ,\n\n\u03b72(t)\u03c32\ns(t)\n\n2\n\nwhere V (t) is the continuous process given by\n\nV (t) = \u2212\u03b7(t)\u03c3(X(t), t)T (\u2207\u03c8\u2217(Z(t)/s(t)) \u2212 \u2207\u03c8\u2217(z(cid:63))).\n\n(9)\n\nProof. By de\ufb01nition of the energy function L, \u2207xL(x, z, t) = r(t)\u2207f (x) and \u2207zL(x, z, t) =\n\u2207\u03c8\u2217(z/s(t)) \u2212 \u2207\u03c8\u2217(z(cid:63)), which are Lipschitz continuous in (x, z) (uniformly in t on any bounded\ninterval, since s, r are continuous positive functions of t). Thus by the It\u00f4 formula for functions with\nLipschitz continuous gradients [Errami et al., 2002], we have\n\u03b7\u03c3T\u22072\n\ndL = \u2202tLdt + (cid:104)\u2207xL, dX(cid:105) + (cid:104)\u2207zL, dZ(cid:105) +\n\n(cid:16)\n\n(cid:17)\n\ndt\n\ntr\n\nzzL\u03c3\u03b7\n\n= \u2202tLdt + (cid:104)\u2207xL, dX(cid:105) + (cid:104)\u2207zL,\u2212\u03b7\u2207f (X)(cid:105) dt + (cid:104)\u2207zL,\u2212\u03b7\u03c3dB(cid:105) +\n\n\u03b72\n2\n\n1\n2\n\n6\n\ntr(cid:0)\u03a3\u22072\n\nzzL(cid:1) dt.\n\nThe \ufb01rst three terms correspond exactly to the deterministic case, and we can bound them by (5) from\nLemma 1. The last two terms are due to the stochastic noise, and consist of a volatility term\n\nand the It\u00f4 correction term\n\n\u2212\u03b7 (cid:104)\u2207zL(X, Z, t), \u03c3dB(cid:105) = \u2212\u03b7 (cid:104)\u2207\u03c8\u2217(Z/s) \u2212 \u2207\u03c8\u2217(z(cid:63)), \u03c3dB(cid:105) = (cid:104)V, dB(cid:105) ,\n\ntr(cid:0)\u03a3(X, t)\u22072\n\nzzL(X, Z, t)(cid:1) dt =\n\n\u03b72\n2\n\ntr(cid:0)\u03a3(X, t)\u22072\u03c8\u2217(Z/s)(cid:1) dt.\n\n\u03b72\n2s\n\nXE\u2217\u2207\u03c8\u2217Z(t)s(t)X(t)\u2207\u03c8\u2217(cid:16)Z(t)s(t)(cid:17)\flemma, an asymptotic envelope (a consequence of the law of the iterated logarithm).\n\nIntegrating the bound of Lemma 2 will allow us to bound changes in energy. This bound will involve\nt0 (cid:104)V (\u03c4 ), dB(\u03c4 )(cid:105), and in order to control this term, we give, in the following\n\u03b72(\u03c4 )\u03c32\u2217(\u03c4 )d\u03c4. Then\n\nWe can bound the last term using the fact that \u2207\u03c8\u2217 is, by assumption, L\u03c8\u2217-Lipschitz, and the\nde\ufb01nition (8) of \u03c3\u2217: for all x \u2208 E, z \u2208 E\u2217, and t \u2265 t0, tr(\u03a3(x, t)\u22072\u03c8\u2217(z)) \u2264 nL\u03c8\u2217 \u03c32\u2217(t).\nCombining the previous inequalities, we obtain the desired bound.\nthe It\u00f4 martingale term(cid:82) t\nLemma 3. Let b(t) =(cid:82) t\n(cid:90) t\nt0 (cid:104)V (\u03c4 ), dB(\u03c4 )(cid:105) = O((cid:112)b(t) log log b(t))\nProof. Let us denote the It\u00f4 martingale by V(t) =(cid:82) t\nt0 (cid:104)V (\u03c4 ), dB(\u03c4 )(cid:105) =(cid:80)n\nn(cid:88)\n\nits quadratic variation by \u03b2(t) = [V(t),V(t)]. By de\ufb01nition of V, we have\n\na.s. as t \u2192 \u221e.\n\nVi(\u03c4 )dBi(\u03c4 ), and\n\nn(cid:88)\n\nn(cid:88)\n\n(cid:82) t\n\n(10)\n\nt0\n\ni=1\n\nt0\n\nd\u03b2 =\n\nViVjd[Bi, Bj] =\n\ni=1\n\nj=1\n\ni=1\n\nV 2\ni dt = (cid:104)V, V (cid:105) dt.\n\nBy the Dambis-Dubins-Schwartz time change theorem (e.g. Corollary 8.5.4 in [\u00d8ksendal, 2003]),\nthere exists a Wiener process \u02c6B such that\n\nV(t) = \u02c6B(\u03b2(t)).\n\n(11)\nWe now proceed to bound \u03b2(t). Using the expression (9) of V , we have (cid:104)V, V (cid:105) =\n\u03b72(t)\u2206T (t)\u03a3(X, t)\u2206(t), where \u2206(t) = \u2207\u03c8\u2217(Z(t)/s(t))\u2212\u2207\u03c8\u2217(z(cid:63)). Since the mirror map has val-\nues in X and X is assumed compact, the diameter D = supx,x(cid:48)\u2208X (cid:107)x \u2212 x(cid:48)\n(cid:107) is \ufb01nite, and \u2206(t) \u2264 D\nfor all t. Thus, d\u03b2(t) \u2264 D2\u03b7(t)2\u03c32\u2217(t)dt, and integrating,\n\u03b2(t) \u2264 D2b(t) a.s.\n\nSince \u03b2(t) is a non-decreasing process, two cases are possible:\nlim supt\u2192\u221e |V(t)| is a.s. \ufb01nite and the result follows immediately. If limt\u2192\u221e \u03b2(t) = \u221e, then\n\n(12)\nif limt\u2192\u221e \u03b2(t) is \ufb01nite, then\n\n(cid:112)b(t) log log b(t) \u2264 lim sup\n\nV(t)\n\nt\u2192\u221e\n\n(cid:113) \u03b2(t)\n\nlim sup\nt\u2192\u221e\n\n\u02c6B(\u03b2(t))\nD2 log log \u03b2(t)\n\nD2\n\n= D\u221a2\n\na.s.\n\nwhere the inequality combines (11) and (12), and the equality is by the law of the iterated logarithm.\n\n4 Convergence results\n\nTheorem 1. Suppose that \u03b7(t)\u03c3\u2217(t) = o(1/\u221alog t), and that (cid:82) t\n(cid:112)b(t) log log b(t) (where b(t) = (cid:82) t\n\nEquipped with Lemma 2 and Lemma 3, which bound, respectively, the rate of change of the energy\nand the asymptotic growth of the martingale term, we are now ready to prove our convergence results.\n\u03b7(\u03c4 )d\u03c4 dominates b(t) and\n\u03b72(\u03c4 )\u03c32\u2217(\u03c4 )d\u03c4 as de\ufb01ned in Lemma 3). Consider SAMD\ndynamics with r = s = 1. Let (X(t), Z(t)) be the unique continuous solution of SAMD\u03b7,\u03b7,1. Then\n\nt0\n\nt0\n\nt\u2192\u221e f (X(t)) \u2212 f (x(cid:63)) = 0\nlim\n\na.s.\n\nProof sketch. We give a sketch of the proof here (the full argument is deferred to the supplement).\ni) The \ufb01rst step is to prove that under the conditions of the theorem, the continuous solution of\nSAMD\u03b7,\u03b7,1, (X(t), Z(t)), is an asymptotic pseudo trajectory (a notion de\ufb01ned and studied\nby Bena\u00efm and Hirsch [1996] and Bena\u00efm [1999]) of the deterministic \ufb02ow AMD\u03b7,\u03b7,1. The\nrigorous de\ufb01nition is given in the supplementary material, but intuitively, this means that\nfor large enough times, the sample paths of the process (X(t), Z(t)) get arbitrarily close to\n(x(t), z(t)), the solution trajectories of the deterministic dynamics.\n\n7\n\n\fii) The second step is to show that under the deterministic \ufb02ow, the energy L decreases enough for\n\nlarge enough times.\n\n(cid:90) t\n\niii) The third step is to prove that under the stochastic process, f (X(t)) cannot stay bounded away\nfrom f (x(cid:63)) for all t. Note that under the conditions of the theorem, integrating the bound of\n(f (X(\u03c4 ))\u2212f (x(cid:63)))\u03b7(\u03c4 )d\u03c4 +O(b(t))+O((cid:112)b(t) log log b(t)),\nLemma 2, and using the asymptotic envelope of Lemma 3, gives\nL(X(t), Z(t), t)\u2212L(x0, z0, t0) \u2264 \u2212\nand if say f (X(t)) \u2212 f (x(cid:63)) \u2265 c > 0 for all t, then the \ufb01rst term dominates the bound, and the\nenergy would decrease to \u2212\u221e, a contradiction.\nCombining these steps, we argue that f (X(t)) eventually becomes close to f (x(cid:63)) by (iii), then stays\nclose by virtue of (i) and (ii).\n\nt0\n\n2 ),(cid:82) t\n\n2 . Then \u03b7(t)\u03c3\u2217(t) = O(t\u2212 1\n\n(cid:112)b(t) log log b(t) = O(\u221alog t log log log t), and the conditions of the theorem are satis\ufb01ed. There-\n\nThe result of Theorem 1 makes it possible to guarantee almost sure convergence (albeit without\nguaranteeing a convergence rate) when the noise is persistent (\u03c3\u2217(t) is constant, or even increas-\n2 but can be positive), and\ning). To give a concrete example, suppose \u03c3\u2217(t) = O(t\u03b1) (with \u03b1 < 1\nlet \u03b7(t) = t\u2212\u03b1\u2212 1\n\u03b7(\u03c4 )d\u03c4 = \u2126(t\u2212\u03b1+ 1\n2 ), b(t) = O(log t), and\nfore, with the appropriate choice of learning rate \u03b7(t) (and the corresponding averaging in the primal\nspace given by a(t) = \u03b7(t)), one can guarantee almost sure convergence.\nNext, we derive explicit bounds on convergence rates. We start by bounding expected function values.\nTheorem 2. Suppose that a = \u03b7/r and \u03b7 \u2265 \u02d9r. Let (X(t), Z(t)) be the unique continuous solution\nto SAMD\u03b7,\u03b7/r,s. Then for all t \u2265 t0,\n\nt0\n\n(cid:82) t\n\nt0\n\nE[f (X(t))] \u2212 f (x(cid:63)) \u2264\n\nL(x0, z0, t0) + \u03c8(x(cid:63))(s(t) \u2212 s(t0)) + nL\u03c8\u2217\n\n2\n\n\u03b72(\u03c4 )\u03c32\u2217(\u03c4 )\n\ns(\u03c4 )\n\nd\u03c4\n\n.\n\nr(t)\n\n2\n\n(cid:90) t\n\nt0\n\n(cid:90) t\n\nt0\n\nd\u03c4 +\n\nnL\u03c8\u2217\n\n\u2217(\u03c4 )\n\n\u03b72(\u03c4 )\u03c32\ns(\u03c4 )\n\nProof. Integrating the bound of Lemma 2, and using the fact that (f (X(t)) \u2212 f (x(cid:63)))( \u02d9r \u2212 \u03b7) \u2264 0 by\nassumption on \u03b7, we have\nL(X(t), Z(t), t) \u2212 L(x0, z0, t0) \u2264 \u03c8(x(cid:63))(s(t) \u2212 s(t0)) +\n(cid:104)V (\u03c4 ), dB(\u03c4 )(cid:105) ,\n(13)\nTaking expectations, the last term vanishes since it is an It\u00f4 martingale, and we conclude by observing\nthat E[f (X(t))] \u2212 f (x(cid:63)) \u2264 E[L(X(t), Z(t), t)]/r(t).\nTo give a concrete example, suppose that \u03c3\u2217(t) = O(t\u03b1) is given, and let r(t) = t\u03b2 and s(t) = t\u03b3,\n\u03b2, \u03b3 > 0. To simplify, we will take \u03b7(t) = \u02d9r(t) = \u03b2t\u03b2\u22121. Then the bound of Theorem 2 shows that\nE[f (X(t))] \u2212 f (x(cid:63)) = O(t\u03b3\u2212\u03b2 + t\u03b2+2\u03b1\u2212\u03b3\u22121). To minimize the asymptotic rate, we can choose\n2 ). In particular,\n\u03b3 \u2212 \u03b2 = \u03b2 + 2\u03b1 \u2212 \u03b3 \u2212 1, i.e. \u03b2 + \u03b1 \u2212 \u03b3 \u2212 1\nwe have:\nCorollary 2. Suppose that \u03c3\u2217(t) = O(t\u03b1), \u03b1 < 1\n2 , we have E[f (X(t))] \u2212 f (x(cid:63)) = O(t\u03b1\u2212 1\ns(t) = t 1\nRemark 3. Corollary 2 can be interpreted as follows: Given a polynomial bound \u03c3\u2217(t) = O(t\u03b1) on\nthe volatility of the noise process, one can adapt the choice of primal and dual averaging rates (a(t)\nand \u03b7(t)), which leads to an O(t\u03b1\u2212 1\n- In the persistent noise regime (\u03b1 = 0), the dynamics use a constant \u03b7, and result in a O(1/\u221at)\nrate. This rate is similar to the recti\ufb01ed dynamics proposed by Mertikopoulos and Staudigl\n[2016], but while they show convergence of the ergodic average \u02dcX(t) = 1\n0 X(\u03c4 )d\u03c4, we can\nt\nshow convergence of the original process X(t) under acceleration.\n\n2 . Then with \u03b7(t) = (1 \u2212 \u03b1)t\u2212\u03b1, a(t) = 1\u2212\u03b1\n\n2 = 0, and the resulting rate is O(t\u03b1\u2212 1\n\n2 ) convergence rate.\n\n(cid:82) t\n\n, and\n\n2 ).\n\nt\n\n- In the vanishing noise regime (\u03b1 < 0), we can take advantage of the decreasing volatility by\nmaking \u03b7(t) increasing. With the appropriate averaging rate a(t), this leads to the improved\nrate O(t\u03b1\u2212 1\n2 , the same rate can be obtained\nfor the ergodic average, without acceleration: one can show that the recti\ufb01ed SMD with\ns(t) = tmax(0,\u03b1+ 1\n2 , acceleration improves\nthe rate from O(t\u22121) to O(t\u03b1\u2212 1\n\n2 ). It is worth observing here that when \u03b1 \u2265 \u2212 1\n\n2 ,\u22121)). However for \u03b1 < \u2212 1\n\n2 ) achieves a O(tmax(\u03b1\u2212 1\n\n2 ).\n\n8\n\n\f- In the increasing noise regime (\u03b1 > 0), as long as the volatility does not increase too fast\n\n(\u03b1 < 1\n\n2 ), one can still guarantee convergence by decreasing \u03b7(t) with the appropriate rate.\n\nFinally, we give an estimate of the asymptotic convergence rate along solution trajectories.\nTheorem 3. Suppose that a = \u03b7/r and \u03b7 \u2265 \u02d9r. Let (X(t), Z(t)) be the unique continuous solution\nto SAMD\u03b7,\u03b7/r,s. Then\n\n\uf8eb\uf8ed s(t) + n(cid:82) t\n\nt0\n\ns(\u03c4 ) +(cid:112)b(t) log log b(t)\n\n\u03b72(\u03c4 )\u03c32\u2217(\u03c4 )\n\nr(t)\n\n\uf8f6\uf8f8 a.s. as t \u2192 \u221e,\n\nt0\n\nf (X(t)) \u2212 f (x(cid:63)) = O\n\n\u03b72(\u03c4 )\u03c32\u2217(\u03c4 )d\u03c4.\n\nt0 (cid:104)V (\u03c4 ), dB(\u03c4 )(cid:105) using Lemma 3. This concludes the proof.\n\nProof. Integrating the bound of Lemma 2 once again, we get inequality (13), where we can bound\n\nwhere b(t) =(cid:82) t\nthe It\u00f4 martingale term(cid:82) t\nComparing the last bound to that of Theorem 2, we have the additional(cid:112)b(t) log log b(t)/r(t) term\ndue to the envelope of the martingale term. This results in a slower a.s. convergence rate.\nplify. Then b(t) = (cid:82) t\nSuppose again that \u03c3\u2217(t) = O(t\u03b1), and that r(t) = t\u03b2 and \u03b7(t) = \u02d9r(t) = \u03b2t\u03b2\u22121 to sim-\nO((cid:112)b(t) log log b(t)/r(t)) = O(t\u03b1\u2212 1\n\u03b72(\u03c4 )\u03c32\u2217(\u03c4 )d\u03c4 = O(t2\u03b2+2\u03b1\u22121), and the martingale term becomes\n2\u221alog log t). Remarkably, the asymptotic rate of sample\ntrajectories is, up to a \u221alog log t factor, the same as the asymptotic rate in expectation; one should\nobserve, however, that the constant in the O notation is trajectory-dependent.\nCorollary 3. Suppose that \u03c3\u2217(t) = O(t\u03b1), \u03b1 < 1\n2 , we have f (X(t)) \u2212 f (x(cid:63)) = O(t\u03b1\u2212 1\ns(t) = t 1\n\n2 . Then with \u03b7(t) = (1 \u2212 \u03b1)t\u2212\u03b1, a(t) = 1\u2212\u03b1\n2\u221alog log t) a.s.\n\n, and\n\nt\n\nt0\n\n5 Conclusion\n\n2 ) in expectation and O(t\u03b1\u2212 1\n\nStarting from the averaging formulation of accelerated mirror descent in continuous-time, and\nmotivated by stochastic optimization, we formulated a stochastic variant and studied the resulting\nSDE. We discussed the role played by each parameter: the dual learning rate \u03b7(t), the inverse\nsensitivity parameter s(t), and the noise covariation bound \u03c3\u2217(t). Our results show that in the\npersistent noise regime, thanks to averaging, it is possible to guarantee a.s. convergence, remarkably\neven when \u03c3\u2217(t) is increasing (as long as \u03c3\u2217(t) = o(\u221at)). In the vanishing noise regime, adapting\nthe choice of \u03b7(t) to the rate of \u03c3\u2217(t) (with the appropriate averaging) leads to improved convergence\n2\u221alog log t) almost surely, when \u03c3\u2217(t) = O(t\u03b1).\nrates, e.g. to O(t\u03b1\u2212 1\nThese asymptotic bounds in continuous-time can provide guidelines in setting the different parameters\nof accelerated stochastic mirror descent.\nIt is also worth observing that in the deterministic case, one can theoretically obtain arbitrarily fast\nconvergence, through a time change as observed by Wibisono et al. [2016] \u2013 a time-change would\nsimply result in using different weights \u03b7(t) and a(t); this can also be seen in Corollary 1, where the\nrate r(t) can be arbitrarily fast. In the stochastic dynamics, such a time-change would also lead to\nre-scaling the noise covariation, and does not lead to a faster rate. To some extent, adding the noise\nprevents us from \u201carti\ufb01cially\u201d accelerating convergence using a simple time-change.\nFinally, we believe this continuous-time analysis can be extended in several directions. For instance,\nit will be interesting to carry out a similar analysis for strongly convex functions, for which we expect\nfaster convergence rates.\n\nAcknowledgments\n\nWe gratefully acknowledge the support of the NSF through grant IIS-1619362 and of the Australian\nResearch Council through an Australian Laureate Fellowship (FL110100281) and through the Aus-\ntralian Research Council Centre of Excellence for Mathematical and Statistical Frontiers (ACEMS).\nWe thank the anonymous reviewers for their insightful comments and suggestions.\n\n9\n\n\fReferences\nH. Attouch, J. Peypouquet, and P. Redont. Fast convergence of an inertial gradient-like system with\n\nvanishing viscosity. CoRR, abs/1507.04782, 2015.\n\nA. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh. Clustering with Bregman divergences. J. Mach.\n\nLearn. Res., 6:1705\u20131749, Dec. 2005.\n\nA. Beck and M. Teboulle. Mirror descent and nonlinear projected subgradient methods for convex\n\noptimization. Oper. Res. Lett., 31(3):167\u2013175, May 2003.\n\nA. Ben-Tal and A. Nemirovski. Lectures on Modern Convex Optimization. SIAM, 2001.\n\nA. Ben-Tal, T. Margalit, and A. Nemirovski. The ordered subsets mirror descent optimization method\n\nwith applications to tomography. SIAM J. on Optimization, 12(1):79\u2013108, Jan. 2001.\n\nM. Bena\u00efm. Dynamics of stochastic approximation algorithms. In S\u00e9minaire de probabilit\u00e9s XXXIII,\n\npages 1\u201368. Springer, 1999.\n\nM. Bena\u00efm and M. W. Hirsch. Asymptotic pseudotrajectories and chain recurrent \ufb02ows, with\n\napplications. Journal of Dynamics and Differential Equations, 8(1):141\u2013176, 1996.\n\nF. Black and M. Scholes. The pricing of options and corporate liabilities. Journal of Political\n\nEconomy, 81(3):637\u2013654, 1973.\n\nA. Bloch, editor. Hamiltonian and gradient \ufb02ows, algorithms, and control. American Mathematical\n\nSociety, 1994.\n\nL. Bottou, F. E. Curtis, and J. Nocedal. Optimization methods for large-scale machine learning.\n\nCoRR, abs/1606.04838, 2016.\n\nS. Bubeck, R. Eldan, and J. Lehec. Finite-time analysis of projected Langevin Monte Carlo. In\n\nAdvances in Neural Information Processing Systems (NIPS) 28, pages 1243\u20131251, 2015.\n\nA. Cabot, H. Engler, and S. Gadat. On the long time behavior of second order differential equations\nwith asymptotically small dissipation. Transactions of the American Mathematical Society, 361:\n5983\u20136017, 2009.\n\nX. Cheng and P. Bartlett. Convergence of Langevin MCMC in KL-divergence. CoRR, abs/1705.09048,\n\n2017.\n\nX. Cheng, N. S. Chatterji, P. L. Bartlett, and M. I. Jordan. Underdamped Langevin MCMC: A\n\nnon-asymptotic analysis. CoRR, abs/1707.03663, 2017.\n\nT.-S. Chiang, C.-R. Hwang, and S. J. Sheu. Diffusion for global optimization in Rn. SIAM Journal\n\non Control and Optimization, 25(3):737\u2013753, 1987.\n\nA. S. Dalalyan. Theoretical guarantees for approximate sampling from smooth and log-concave\ndensities. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(3):\n651\u2013676, 2017.\n\nJ. C. Duchi, A. Agarwal, M. Johansson, and M. Jordan. Ergodic mirror descent. SIAM Journal on\n\nOptimization (SIOPT), 22(4):1549\u20131578, 2010.\n\nA. Durmus and E. Moulines. Sampling from strongly log-concave distributions with the unadjusted\n\nLangevin algorithm. CoRR, 2016.\n\nA. Eberle, A. Guillin, and R. Zimmer. Quantitative contraction rates for Langevin dynamics. CoRR,\n\n2017.\n\nM. Errami, F. Russo, and P. Vallois. It\u00f4\u2019s formula for C 1,\u03bb-functions of a c\u00e0dl\u00e0g process and related\n\ncalculus. Probability Theory and Related Fields, 122(2):191\u2013221, 2002.\n\nU. Helmke and J. Moore. Optimization and dynamical systems. Communications and control\n\nengineering series. Springer-Verlag, 1994.\n\n10\n\n\fR. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction.\nIn Proceedings of the 26th International Conference on Neural Information Processing Systems\n(NIPS), pages 315\u2013323, 2013.\n\nW. Krichene, A. Bayen, and P. Bartlett. Accelerated mirror descent in continuous and discrete time.\n\nIn NIPS, 2015.\n\nW. Krichene, A. Bayen, and P. Bartlett. Adaptive averaging in accelerated descent dynamics. In\n\nNIPS, 2016.\n\nG. Lan. An optimal method for stochastic composite optimization. Mathematical Programming, 133\n\n(1):365\u2013397, 2012.\n\nA. Lyapunov. General Problem of the Stability Of Motion. Doctoral thesis, 1892.\n\nP. Mertikopoulos and M. Staudigl. On the convergence of gradient-like \ufb02ows with noisy gradient\n\ninput. CoRR, abs/1611.06730, 2016.\n\nA. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to\n\nstochastic programming. SIAM Journal on Optimization, 19(4):1574\u20131609, 2009.\n\nA. S. Nemirovsky and D. B. Yudin. Problem Complexity and Method Ef\ufb01ciency in Optimization.\n\nWiley-Interscience series in discrete mathematics. Wiley, 1983.\n\nY. Nesterov. A method of solving a convex programming problem with convergence rate o(1/k2).\n\nSoviet Mathematics Doklady, 27(2):372\u2013376, 1983.\n\nB. O\u2019Donoghue and E. Cand\u00e8s. Adaptive restart for accelerated gradient schemes. Foundations of\n\nComputational Mathematics, 15(3):715\u2013732, 2015. ISSN 1615-3375.\n\nB. \u00d8ksendal. Stochastic Differential Equations: An Introduction with Applications. Hochschultext /\n\nUniversitext. Springer, 2003.\n\nG. Pavliotis. Stochastic Processes and Applications: Diffusion Processes, the Fokker-Planck and\n\nLangevin Equations. Texts in Applied Mathematics. Springer New York, 2014.\n\nM. Raginsky and J. Bouvrie. Continuous-time stochastic mirror descent on a network: Variance\n\nreduction, consensus, convergence. In CDC 2012, pages 6793\u20136800, 2012.\n\nM. Raginsky, A. Rakhlin, and M. Telgarsky. Non-convex learning via stochastic gradient Langevin\n\ndynamics: a nonasymptotic analysis. CoRR, abs/1702.03849, 2017.\n\nR. Rockafellar. Convex Analysis. Princeton University Press, 1970.\n\nW. Su, S. Boyd, and E. Cand\u00e8s. A differential equation for modeling Nesterov\u2019s accelerated gradient\n\nmethod: Theory and insights. In NIPS, 2014.\n\nA. Wibisono, A. C. Wilson, and M. I. Jordan. A variational perspective on accelerated methods in\n\noptimization. CoRR, abs/1603.04245, 2016.\n\nA. C. Wilson, B. Recht, and M. I. Jordan. A lyapunov analysis of momentum methods in optimization.\n\nCoRR, abs/1611.02635, 2016.\n\nL. Xiao and T. Zhang. A proximal stochastic gradient method with progressive variance reduction.\n\nSIAM Journal on Optimization, 24(4):2057\u20132075, 2014.\n\n11\n\n\f", "award": [], "sourceid": 3420, "authors": [{"given_name": "Walid", "family_name": "Krichene", "institution": "Google"}, {"given_name": "Peter", "family_name": "Bartlett", "institution": "UC Berkeley"}]}