Paper ID: | 1410 |
---|---|

Title: | SpiderBoost and Momentum: Faster Variance Reduction Algorithms |

The paper contains technically novel contributions and comprehensive theoretical analysis that cover many important scenarios. It is also well-written. Thus, I recommend to accept the paper for publication in NeurIPS. Below are my related questions: 1. SpiderBoost uses sampling with replacement. As it is known that sampling without replacement induces lower sampling variance, can similar analysis be applied to handle such a sampling scheme and obtain better complexity result? 2. The final output is randomly selected among all past iterations. Does this require to store all the generated variables?

Apologies I reviewed all three papers in one document and then cut and past the comments in. Here are my comments specific to `SpiderBoost and Momentum: Faster Variance Reduction Algorithms' *Line 273. Which recent studies? *Why did the numerics in `SpiderBoost and Momentum: Faster Stochastic Variance Reduction Algorithms’ not include MNIST? From Figure 3 of `Faster Stochastic Algorithms via History-Gradient Aided Batch Size Adaptation’ I can see that spiderboost is not doing well relative to mini batch SGD. Don’t be afraid to include results where your algorithm does not look good. It is interesting to discuss why (you can for example point out that the cause is perhaps large batch sizes). *Table 1. Why \eps^{-2} + \eps^{-3}? Just write \eps^{-3}. *It appears test error was not included in the numerics. *I’d suggest including a standard widely used algorithm in the comparisons, e.g., ADAM. * Big-O notation is not used consistently throughout the three papers. For example, in Theorem 1 of 1410 bounds K including Lipschitz constants and function value but then the SFO complexity ignores Lipschitz constants and function value which should be included in the expression. As a general rule of thumb, except in an introduction big-O notation should include all dependencies on Lipschitz constants, etc. * Line 111. In the context of this paper (since the gradient is not rescaled) the step size of spider is actually O( \eps / (L \| \grad f \|) ). To see this, compare line 11 and 14 of algorithm 1 of the spider paper with algorithm 1 in this paper. The step size selection of the original spider is poor when \eps / \| \grad f \| is much less than one. * Line 46-50. I don't understand this reasoning. The proximal mapping is non-expansive which appears to contradict your logic. Reviewer comment: `For example, creating one long, high quality paper’. Authors response to reviewer: [authors were worried the paper would be too long]. This is just one recommendation, in this case one would probably best to send it to a journal. Also, I am not suggesting copying and pasting the papers together but merging them into one coherent framework and removing a large number of results that are not core to your thesis. You could also merge 1410 and 2460 into one paper. These are just suggestions it is up to the authors to figure out the best strategy. Authors: [1495 is very different] Yes I agree with the statement that 1495 is more different than 1410 and 2460 are from each other, but they all still share a lot of ideas and similar theoretical results. Reviewer: `All three papers develop a practical **variant** of SPIDER algorithm which achieves the same bounds as SPIDER.’ Authors response to above comment: `First, not all three papers focus on SPIDER algorithms. The other two papers study other aspects of stochastic algorithms (paper 2460 on batch size adaptation, and paper 1495 on reinforcement learning) and both analyze three representative stochastic algorithms SGD, SVRG and SPIDER as examples.’ The keyword in my response is variant which I have starred. SVRG is also extremely strongly similar to SPIDER. The main distinguishing feature of the original SPIDER paper was the better theoretical runtime bound. Paper 1410 specific response to author: Reviewer `PL results are straightforward corollaries of main results. I would suggest removing them.’ Authors: `This is not true! PL results are linear rate obtained under PL condition, whereas the main results are sublinear rate obtained for nonconvex optimization. Clearly, PL results cannot be a special case of the main results.’ You misinterpreted what I meant. This however is my fault as I should have been more clear. Given an algorithm with a sublinear convergence rate of the form (f(x_k) - f(x_*)) / \eps^{2} for the gradient and using the PL inequality there is a very standard (and short) argument to obtain linear convergence. In my opinion there are way to many papers on this (the proof is extremely similar to the linear convergence proof for gradient descent). For this reason, I don’t think the result is very original and detracts from the main results.

Pros. + This paper is well written and easy to follow + This paper builds on recently proposed algorithms SARAH and SPYDER for non-convex problems. Its main contribution is a novel analysis, which allows stepsize to be constant O(1/L) rather than to be dependent on the desired accuracy \epsilon. + authors provide different extensions of their method – extension to proximal settings, momentum and methods for online optimization. + all the rates outperform or match the current state-of-the-art Cons. - constant step size 1/2L is enabled due to big minibatch \sqrt{n}, which could be a problem for the practical usage of this new method. - Experimental results are not appealing since the parameters are chosen neither based on theory nor any tuning. Moreover, there is no comparison to SAGA algorithm, which would be also a natural choice. Minor: Alg 3. - Prox (x_k - \lambda_k v_k) is used twice, while the second usage might be replaced by x_{k+1} *** I have read the authors response and other reviews and decided to keep my score. I strongly reccomend authors to include minibatch version and to remove stepsize of 1/2L from the definition of the algorithm to avoid confusion and make their work more transparent.