Review for NeurIPS paper: On the Almost Sure Convergence of Stochastic Gradient Descent in Non-Convex Problems

NeurIPS 2020

On the Almost Sure Convergence of Stochastic Gradient Descent in Non-Convex Problems

Meta Review

The paper has claimed three contributions, answering three questions. 1) Convergence to critical points without “bounded iterates” assumptions. In general, I think removing boundedness assumption is useful. But Reviewer 3 pointed out that this work adds the "bounded level set" assumption and bounded gradient assumption, which do not seem much stronger than “bounded iterates”. In particular, the gap between “bounded level-set” and “bounded iterates” is quite technical (not necessarily small, but not necessarily large), thus requires more explanation.  2) Avoiding strict saddles. R1 clarified that he was not claiming "[17, 18] or Pemantle alone already covered this paper’s result", but asking whether a simple combination of [17, 18] and Pemantle [27] would give their result (that’s why R1 used “morally” in the review). Of course this is a valid question. The rebuttal argued “in the stochastic case, the presence of a non-trivial center manifold requires fundamentally different techniques from differential geometry, as we explain in detail in Appendices D.2 and D.3”, which I think provide a reasonable answer to this question. All reviewers still appreciate the technical contribution of this part, and it is also the main reason of acceptance. 3) Covnergence rate to local min. R3 pointed out three limitations: a) strict local-min with PD Hessian; b) a local rate result; c) hidden dependence on m. The first restriction is especially important. The rebuttal highlighted the improvement of 1/sqrt{n} in [14] to 1/n rate in this paper. During discussion, R1 pointed out that the improvement over [14] is an unfair comparison. More specifically, “O(1/\sqrt{n}) results in [14] is for finding Second-Order-Stationary-Point (SOSP),… this paper assumes locally strongly convex local-min. Their O(1/n) rate should be a direct consequence of this local strongly convexity assumption. Theorem 5 in [14] also gives the result for the local strongly convex setting, since [14] studies the deterministic gradient case, it can even achieves e^{-n} rate.“ The authors shall highlight "strict local-min" or "locally strongly convex local-min" whenever mentioning this convergence result. R2 questioned the advantage over Bonnabel (2013). I think results on compact Riemann manifold do not directly imply results in Euclidean space and might require substantial different techniques, thus the existence of Bonnabel (2013) is not a major issue. But as R1 suggested, please give more credit to Bonnabel and Permantle since they resolved partially the same problem (morally). Reviewers agree that this paper proved some new results on a problem of interest to the community, and the merits outweigh the pitfalls, so I recommend accept. Nevertheless, please modify the paper accordingly to soften the claims on the contributions on boundedness assumption and convergence rate result; e.g., in the introduction, the authors shall highlight they only provide a partial answer to Question 3 and 1, since these two parts are significantly weaker than Part 2.