NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID: 6867 Continuous-time Models for Stochastic Optimization Algorithms

### Reviewer 1

I enjoyed reading the paper immensely-thank you very much for the effort made to provide intuition and to link your work to both recent results as well as to classic techniques (like the ODE method). I have very little to ask here. As a question of more academic nature: the Gaussianity assumption makes perfect sense, and the explanation provided in 117-126 is more than sufficient. It is also great that intuition via the smoothness of the objective is given as the reason as to why an "algebraic" relationship exists between the rate proofs for the discrete and continuous cases. That said, I believe that it could be possible to make both of these concepts precise, i.e., establish a form of "mean field limit": scaling n, b, and h jointly one may be able to show that the resulting rescaled discrete trajectory converges to the trajectory of the SDE in a precise manner. Such "mean field" results exist for the ODE method (see, e.g., https://doi.org/10.1016/j.peva.2008.03.005 and https://doi.org/10.1017/S0143385798097557). I do believe that this may be a whole lot lot of work to prove something that one is already inclined to believe (in the limit, the LLN kicks in!), and this sense this effort may not be worth it-and I am not advocating that this is needed for the present submission. It may be however a future direction worth pursuing even in a simpler setting than MB-SGD or SVRG, that could potentially make the link between the discrete approach and the continuous approach both formal and airtight.

### Reviewer 2

This paper builds continuous-time models for SGD and stochastic variance reduction methods, by providing explicit expressions of the evolution equations. Based on these expressions, this paper analyzes the convergence rates of the continuous-time models under certain conditions, and shows strong connections in convergence rate between the continuous-time model and its discrete-time counterpart (SGD or SVRG). Strengths: > The paper successfully builds continuous-time models that take into consideration both a) decaying step-size and b) growing batch-size of the corresponding discrete-time algorithms. And it shows the existence and uniqueness of the continuous-time models. > The paper provides convergence rates of the continuous-time models, under certain conditions. Convergence rates for the corresponding discrete-time counterparts are also developed. Weaknesses: >The concept of building continuous-time models (or ODEs) for stochastic optimization algorithms is not novel. Simpler continuous-time models for SGD can be found in the reference [47] [48] of the paper, where the stochasticity of gradient is also represented by Brownian motion. The continuous-time models built in this paper can be seen as extensions of the model in [47]. >Building the continuous-time models in this paper is not technically hard. Compared to the models in [47][48], the models in this paper additionally incorporate time-dependent factors that originate from decaying step-size and growing batch-size, which can be done by simply adding two time-dependent functions.

### Reviewer 3

I have read the rebuttal and I believe the authors have satisfactorily addressed my comments on prior work, so I have increased my rating. =========== Reasons for contribution ratings: 1. The SDE approximation method is well-established. Moreover, Minibatch SGD's continuous approximation has been considered by several prior works, e.g. [A] below. 2. Matching bounds are obtained, which is interesting 3. The time-change observation is quite immediate, but landscape change is non-trivial and potentially significant -- although I do have some doubts on its validity in general (more below). Summary and review comments: The paper is well-written and one of its strengths in generally good comparison with prior work. The main theoretical results are: * SDE approximation for minibatch SGD and SVRG * Well-posedness of the SDEs * Matching convergence bounds using Lyapunov functions * Interpreting time-dependent adjustments as time-change and landscape-stretching. The main weakness is the lack of focus of the discussion. I feel that too many points are scattered and there lacks a central message on the insights gained. Below are some specific questions and concerns: 1. Line 100-101: The theoretical results in [36] and also [26] do not assume that the gradient noise $Z_k$ is Gaussian. The weak approximation results do not depend on the actual distribution of gradient noise, which only need to satisfy some moment conditions. These are always satisfied when the objective is of finite-sum form, as considered in this work. See also [B] below for more general statements. This part should be rephrased accordingly to properly represent the results of prior work. 2. Line 180: The assumption $H\sigma$ is quite restrictive, as even in the quadratic case, as long as the covariance of gradients are not constant you would expect there to be some growth. I suggest relaxing this condition by some localization arguments, since at the end your results only depend on $\sigma^*$. 3. Line 127-137: 1) The reference appears wrong, [37] does not talk about the convergence rate of SGD to SDE. 2) Note that in previous work, explicit bounds between expectations over arbitrary test functions (not just $||\nabla f||^2$) on SGD and SDE are established. These are not the same as the results presented in Appendix D, which are matching rates just on $||\nabla f||^2$ (not arbitrary test functions). Moreover, the presented results are not bounding the difference between the expectation iterates, but rather show them having similar rates. This is a weaker statement. In my opinion, this point should be better clarified to avoid confusion of what actualy is derived in this paper -- in fact, without looking at the appendix I thought that the authors obtained uniform-in-time approximation results for non-convex cases, which would certainly be interesting! As far as I know, so far only [C] provides such estimates, but require strong convexity. I suggest the authors make space for the statements of results in this section in the main paper, since you have mentioned this in your abstract as one of your main results. 4. Line 277-286: This is an interesting observation. However, I have some concerns on its validity in general settings. It is well-known that 1D SDEs with multiplicative noise can be written as a noisy gradient flow of a modified potential function, but this fails to hold in high dimensions. It appears to me that by assuming $H$ is diagonal and $\sigma$ is constant, we fall into the 1D scenario, but this analogy is not likely to generalize. Perhaps the authors can comment on this. 5. Minor typos: 1) Theorem B.2, assumption 1 should not have a square on the RHS. 2) line 194: know -> known References: [A] Smith, Samuel L., and Quoc V. Le. "A bayesian perspective on generalization and stochastic gradient descent." arXiv preprint arXiv:1710.06451 (2017). [B] Li et al. "Stochastic Modified Equations and Dynamics of Stochastic Gradient Algorithms I: Mathematical Foundations." Journal of Machine Learning Research 20.40 (2019): 1-40. [C] Feng, Yuanyuan, et al. "Uniform-in-Time Weak Error Analysis for Stochastic Gradient Descent Algorithms via Diffusion Approximation." arXiv preprint arXiv:1902.00635 (2019).