Paper ID: | 8746 |
---|---|

Title: | Momentum-Based Variance Reduction in Non-Convex SGD |

######################################################## I thank the authors for their response. Given the authors' promises to address the reviewers concerns, I continue to support acceptance of this paper. I agree with the other reviewers that a more thorough comparison with previous work such as SARAH should be added to the paper, and I think care should be taken in making the theorem statement more understandable. ######################################################## This is a very nice paper. The ideas are presented clearly and concisely. The proofs are accompanied by intuitive explanations. Most of all, the algorithm is straightforward and gives me confidence that it would be a good and robust approach to solving non-convex optimization problems in practice. The connection between momentum and variance reduction is particularly useful and may be quite useful for the analysis of other algorithms in the future. The one trouble spot is the theorem statement, which is quite confusing and difficult to parse. I found it very hard to make sense of all the parameters b, k, c, w, M, etc.

##### Thank you for your feedback. I agree with R3 that you did a poor job on relating your work to existing methods, in particular SARAH. Please also make sure that you carefully address the question of optimality. I also realized that your method in fact has nothing to do with momentum. Consider for instance deterministic objective, f(x, \xi)=f(x). If one has a tight estimate, i.e. d_{t-1}=\nabla f(x_{t-1}), then from your update rules it follows that d_t=\nabla f(x_t), i.e. the method become gradient descent with no momentum! Your title, thus, is very confusing and I highly encourage you to change it. Otherwise it will pollute the literature on real momentum methods. Please provide some basic extra experiments in the supplementary. I suggest you to 1) check the difference between uniform random sampling and random permutation 2) do some grid search to show how sensitive your method is to the parameter selection. I agree that the theory is not directly applicable here, but since you decided to do some experiments, it's better to do a proper study rather than a shallow observation that it works. ##### First of all, let me note that in work [1] there has been developed a very similar method. However, I believe it is rather a coincidence since 1) works use significantly different stepsizes, 2) work [1] uses two data samples per iteration, while this work only uses one, 3) the analysis is different. The fact that the method achieves optimal complexity for nonconvex optimization is the main reason I vote for acceptance. The method is simple, and it's clear that it brings variance reduction closer to being practical for training neural networks. The lack of experimental results disappoints me, but I still think the contribution is significant. Probably the most important part is that the authors manage to do here is working with expectations without large batches. However, based on the proof I feel that it's mostly due to assumption that all functions are almost surely Lipschitz in terms of their gradients (for instance it's used in Lemma 2, which is a variance bound, suggesting why Storm doesn't need large batches to control the variance). In contrast, SVRG and SARAH work without this assumption meaning that the results presented in this work are not strictly better. Several experimental details are missing. It's interesting what minibatch sizes the authors used, and otherwise the plot with the number of steps does not tell us how many passes over the data you made. How did you sample data during training? It's common to run SGD or Adam with random reshuffling, while your theory is for uniform sampling. Please range y-axis from 0.8 to 0.95 in Figure 1(c) to make the comparison clearer. Clearly, I wouldn't have had this questions if the authors provided their code. Are the authors going to release their code online along with values of c that they used to get the plots? Minor comment: in line 99, did you mean to write "e.g." instead of "i.e."? [1] Hybrid Stochastic Gradient Descent Algorithms for Stochastic Nonconvex Optimization

1) First, the reviewer thinks that, this algorithm combines both momentum and SARAH estimator in (Nguyen et al, 2017). So, it would be better if the authors rigorously discuss this. 2) On page 1, the authors write \nabla{F}(x) = 0 is a criterion that used in stochastic methods? I think it should be in some clear sense such as expectation, probability 1, etc. 3) I am not sure if [27] provides a lower bound complexity for expectation problems. It only considered a finite sum problem. So far I have not seen any results on lower bound complexity for the expectation problems under standard assumptions, i.e., smoothness and bounded variance. 4) The authors may be missing some recent works based on SARAH estimators such as SPIDER, SpiderBoost, and ProxSARAH which achieve optimal rate in the finite-sum case, and the best-known rate in the expectation case. It should be useful if these works are mentioned. Note that some of these methods can work with single-sample, and still achieve optimal rates. 5) I don't think it is proper to claim that STORM achieves optimal rate. First, it relies on an additional assumption: bounded gradient apart from other two assumptions. Second, I have not seen any paper studying the lower bound complexity for the expectation problems. So, it is a bit unclear if this complexity is optimal. 6) It is not clear about the assumptions in Section 3. It relies on the smoothness and bounded gradients with probability 1, which seems unclear to me to where it is used. Everything in the proof is relied on expectation. 7) The authors also claim that (page 2 and conclusion), Algorithm 1 resolves the tuning parameters, but this is not true. It also relies on several parameters, at least three: k, c, w.