Reviews: On the Ineffectiveness of Variance Reduced Optimization for Deep Learning

#### Dear authors, I discussed the work with the other reviewers and I have to keep my score. I'm glad you commented on the learning rate selection, because this was a major point of our discussion. The main reason I can't increase my score is that many definitions, explanations and experiment details are missing, making it extremely hard to evaluate the real value of your experiments. This was additionally complicated by the fact that you didn't provide your code when submitting the paper. I hope that you will do a major revision, for example include a section in supplementary material with all experiments details. Just in case, here are my suggestions for some extra experiments: 1. Measure and plot the effect of data augmentation on bias of the gradient. For instance, show how increasing the number of data augmentation samples in the big batch changes the estimate. Please try to do more quantitative analysis of your statements about data augmentation. 2. Show how sampling the data in a permutation changes the performance of SVRG. Right now what you present is not aligned with the theory. Please also proofread the paper to add all missing details pointed out by the reviewers. I wish I could give the paper a higher score, but there are multiple significant issues and I was convinced by the other reviewers that we can not be sure that you will fix all of them. #### This works provides a detailed study of the interaction between neural networks gradients and SVRG update rule. The message of the paper is very simple: it doesn't work unless the dataset and the network are very simple. A few comments, suggestions and questions. 1. The title reads "the ineffectiveness of variance reduced optimization", while the only considered update rule is that of SVRG. I agree that for example SAGA is unlikely to be useful here because of the memory limits, and its update rule is very similar to that of SVRG. However there are many other variance reduction methods, notably SARAH [1], Spider [2] and Hybrid SARAH-SGD [3], Storm [4], whose update is different and their theory for nonconvex case is better than for SVRG or SCSG. This is at least not fair with respect to the authors of these works to claim the ineffectiveness of all variance reduced methods if you have tested only one. I'd like to also note that S-MISO [Bietti and Mairal, 2017] can be applied with limited memory and might be of some interest (maybe more for future work). For instance, if one takes each image class in Cifar10 to be an expectation, S-MISO can be used with memory of 10 networks, which might be fine for smaller nets. To be fair, my limited experiments with SARAH didn't show its advantage in image classification, so I don't think that the comment above is a big issue. 2. "Deep learning" is also not only about supervised image classification. In work [5], for example, a mix between SVRG and extragradient is designed to solve variation inequalities and in particular smooth formulations of GANs. This work has not been published and we do not know if their results are reliable, but I suggest authors write explicitly in the abstract the limitations of their experiments to avoid confusing people. 3. I might have missed it, but I didn't find in the paper how the authors compute the expectation of the stochastic gradient. Did you use only one sample per image (i.e. one instance of data augmentation) or you sampled many? 4. There is a lack of details about the augmentation and I believe more experiments are needed to draw conclusions here. Figure 1 is nice, but is it possible that by locking transforms you introduce a bias for every iteration in the epoch? 5. I wish there was a clear plot that would show how number of data augmentation instances affects the values of variance and expectation. Can you guarantee that your plots will not change dramatically if you change the number of augmentation samples? 6. Why there are not plots for test accuracy for the experiments that you did on Resnet-101? 7. Some implementation details are missing. You write about streaming SVRG that "often a fixed m steps is used in practice." Does it mean you use a fixed number of steps as well? 8. Similar question, how did you sample the data? When using SGD, usually researchers do not use uniform sampling, but rather iterate over a random permutation of the data. There is also theoretical evidence that SGD benefits a lot from this trick [6]. Is it possible that some of the issues can be fixed if we used SVRG with different sampling? 9. Since the paper is only about experiments, I wish the authors provided their code with this submission. But in any case, I'd like to know if you are going to make your code available online together with hyperparameters to reproduce the experiments. typo: in line 8 you use "n" in the denominator and "N" in the summation limit. typo: [1] Inexact SARAH Algorithm for Stochastic Optimization [2] SPIDER: Near-Optimal Non-Convex Optimization via Stochastic Path Integrated Differential Estimator [3] Hybrid Stochastic Gradient Descent Algorithms for Stochastic Nonconvex Optimization [4] Momentum-Based Variance Reduction in Non-Convex SGD [5] Reducing Noise in GAN Training with Variance Reduced Extragradient [6] SGD without Replacement: Sharper Rates for General Smooth Convex Functions

Originality -Related work is cited throughout the paper, but the authors do not directly include a related work section. The authors should verify that they are the first to try SVRG on deep neural networks that use batchnorm, dropout, and data augmentation. -To the best of my knowledge, the authors are the first to propose such changes and implement and test the updated algorithm, as well as compare the updated SVRG to SGD on a ResNet on ImageNet. Quality -The updates made to SVRG are technically sound, well motivated, and do address the issues with deploying the algorithm on modern deep learning architectures. -The authors compare SVRG and SGD on a LeNet and DenseNet for CIFAR-10 and a ResNet for ImageNet, which are reasonable datasets and architectures to test, but I would have liked to see more than one ImageNet architecture (and an architecture closer to 80% top-1 accuracy on ImageNet rather than 70%) -The authors should compare SGD and SVRG only after tuning the learning rate, the momentum, and possibly even the learning rate decay scheme Clarity -The paper is well written, and the explanations given by the author are easy to read and informative. Clarity is a major of strength of the paper. Significance -The results are important because they show that even if one attempts all these fixes for SVRG to make it amenable to modern deep neural networks, SGD still outperforms SVRG. This means that variance reduced optimization is less effective for these style of problems, and the authors provide very thorough and sensible explanations for why the algorithm fails.

Paper ID:	1029
Title:	On the Ineffectiveness of Variance Reduced Optimization for Deep Learning

Reviewer 1

Reviewer 2

Reviewer 3