NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:6831
Title:Beating SGD Saturation with Tail-Averaging and Minibatching

Reviewer 1

*********After author response********** I thank the authors for their response. The response addressed the clarity issues I raised. While I understand the high-level intuition that tail-averaging allows larger step sizes by reducing the variance, the authors could do a better job by making this intuition clear in the paper and supporting it by a quantitative relationship between tail-length and step-size. Therefore, I still keep my evaluation. ********************************************** Originality: While there are many previous analyses of least-squares learning in Hilbert spaces, the finding that tail-averaging overcomes saturation is novel and interesting, providing deeper understanding of the role of averaging in SGD. Quality: I didn't go through the appendices, but the analysis and proofs in the main paper are well-supported and sound. The experiments support the theory well. Clarity: The paper is clearly written in general and Section 3 provided a nice intuition, but I find several small spots that confused me (see the entries 2-6 in the "improvements" part). Significance: The paper seems to suggest the use of tail-averaging instead of full-averaging in practice, but the reason for using averaging in the first place is to allow larger step sizes. The paper seems not clear about why tail-averaging allows larger step sizes than no averaging, making the result less significant.

Reviewer 2

*************After author response************ Thank you for the answer. I'll keep my mark and vote for accepting this paper. **************************************************** In this paper, the authors provides a clear analysis for Stochastic Gradient Descent for least-square in a non-parametric setting on the interplay of: -the number of passes over the data set -the size of the mini-batches -the size of tail-averages -the step-size of the stochastic gradient descent Originality: As far as I understand the paper, it does not seem that the analysis of each term of the SGD estimator uses now tools or really new technique, even if the analysis of convergence of the population risk gradient descent iterates to the real minimizer considered with tail-averaging is new. Yet, the techniques for bounding each term seem borrowed and adapted from previous papers analyzing SGD for least-squares problems -related papers are adequately cited. Quality and clarity : Theoretically speaking, the paper is self-contained and provides proofs of all theorems and a clear discussion on all the assumptions made in the paper. Furthermore, despite the number of parameters concerned with the analysis, the main results (Theorem 1 and Corollary 1) are very clear and clearly compared with the relative work. However, the experimental section may lack of a real dataset where r can be computed and where we could see the difference between tail and uniform averaging. Moreover, when Theorem 1 and Corollary 1 speak about the interplay between 4 different parameters, the experimental section only provides a clear illustration of the difference between tail and uniform averaging (even if this is the main result of the paper). Significance: The paper is very clear, very well-written, and is a clear summary for SGD for Least-Square in the non-parametric setting but lack from real novel ideas.

Reviewer 3

I am not an expert of this area. I some questions for you: have you evaluated your methods on real scenario instead of only on synthetic data? Does the conclusion still hold on other objective functions such as cross entropy?