Review for NeurIPS paper: Towards Better Generalization of Adaptive Gradient Methods

NeurIPS 2020

Towards Better Generalization of Adaptive Gradient Methods

Review 1

Summary and Contributions: The paper studies the use of differential privacy as a way to enforce generalization guarantees in the adaptive gradient based optimization methods. There has been an overwhelming flurry of papers on adaptive gradient based optimization recently. Numerous variants have been proposed, but none stand our clearly and consistently. In particular, there has been concern that these newer adaptive methods don't even consistently outperform stochastic gradient descent when it come to generalization error in machine learning setting (rather than optimization error on the training objective itself). The paper employs two natural differentially private algorithms for perturbing the gradients of the objective functions. Via the known generalization guarantees of differential privacy, this implies that the gradients encountered by the algorithm behave like population gradients even if the optimizer makes multiple passes over the data. The main results of the paper show convergence guarantees to a stationary point of the population objective. Both differentially private methods roughly guarantee that the squared norm of the gradient shrinks as O(1/n^{2/3}). Re rebuttal: The author's response only lightly touched on my review and it did neither increase nor decrease my score.

Strengths: This is a natural use of differential privacy for gradient optimization. In particular the use of Thresholdout to filter large gradients is natural and it's good to have a reference for this approach. The analyses seem sound and the proofs are relatively short and clean. The empirical results look promising.

Weaknesses: In the best case, this is a great idea that will lead to a more principled way to make sure that adaptive gradient methods generalize well. However, a more realistic scenario is that this adds yet another set of heuristics to the flurry of adaptive methods with no clear winner. This to some extent an issue with the research area. Despite thousands of papers on this topic there are relatively few stable facts. This paper does not necessarily reduce the entropy. The techniques involved are fairly standard at this point. I had a really hard time reading the plots in Figure 2. But if I see correctly, SAGD gets the lowest test perplexity in both cases, which is nice. I actually took some issue with the last sentence of the broader impact section. This does not seem to be well thought out or supported by empirical facts.

Correctness: The bounds look reasonable given the techniques that are involved.

Clarity: The clarity could be improved.

Relation to Prior Work: Yes.

Reproducibility: No

Additional Feedback: There is no code for the experimental evaluation.

Review 2

Summary and Contributions: The authors propose a new optimization algorithm called Stable Adaptive Gradient Descent (SAGD) that uses ideas from differential privacy and adaptive data analysis to improve the generalization performance of adaptive gradient algorithms. The authors provide theoretical analysis to show that both SAGD and minibatch SAGD converges to the population stationary point with high probability. Experimental results on neural networks are also provided. ======================================================================== I thank the authors for their response. Unfortunately, one of my main concerns (about the low numbers on their ResNet18 and VGG19 baselines) still remain after the rebuttal. While the authors mention in their rebuttal that proper learning rate tuning was done, this does not explain why the numbers obtained are lower than what is standard for these models. On another look at the experimental section of the paper, I think the low numbers are explained because the authors train their models only for 100 epochs, instead of the standard 200 epochs. This does make me less sure about the experimental results provided in the paper, and whether any improvements shown of the proposed algorithm actually scale to standard training setups. I am therefore reducing my score from 7 to a 6.

Strengths: Overall I really like the main idea and its analysis presented in the paper. Being able to simulate gradients that are close to the population gradient whp throughout the course of optimization seems really interesting. This would particularly be useful for smaller datasets where regularization becomes all the more important. This is confirmed by the experimental results as well where SAGD seems to outperform other techniques when the dataset size is smaller. I found the bounds on the convergence to a population stationary point quite interesting as well. What is curious is that adding the laplacian noise to the minibatch gradient seems to fix the convergence issues of minibatch RMSProp that were raised by Reddi et al in “On the Convergence of Adam and Beyond”?

Weaknesses: The main weakness of this paper is the experimental section. I would have liked to see more thorough and rigorous experiments. For example, how is SAGM affected by the size of the minibatches in these experiments? I am also not sure if the baselines were tuned properly. This is because both ResNet18 and VGG19 should be reaching slightly higher test accuracies with SGD/Adam than those reported in the paper. My other question is: does RMSProp offer any particular advantage to being used with DPG-LAG/DPG-SPARSE? The method proposed seems general enough to be used with any first order optimization algorithm? If that is indeed the case, then it would have been good to also include experiments on how DPG-LAG/DPG-SPARSE might affect generalization of other popular algorithms like SGD/Adam.

Correctness: Yes.

Clarity: Overall the paper is well written and is easy to follow. My only comment about the writing is that, while it is true that typically by just tuning the learning rate, adaptive gradient methods might generalize worse than SGD, it is worth mentioning that if all the hyperparameters of Adam are tuned properly, it can often reach the generalization performance of SGD. This was shown by Choi et al in “On Empirical Comparisons of Optimizers for Deep Learning”.

Relation to Prior Work: See the answer to the above question. I am not familiar with related work in this area in the differential privacy or adaptive data analysis literature, so I cannot be sure that all related work in those areas were discussed.

Reproducibility: Yes

Additional Feedback: A couple of additional questions: 1. How do the high probability bounds change when using mini-batches of size m? 2. Is data augmentation used in the experiments? Data augmentation is a popular technique to improve the generalization of neural network models, and It could be thought of as extending the dataset size, albeit in a more biased way. It would be interesting to compare the performance of DPG-LAG/DPG-SPARSE with simple data augmentation techniques.

Review 3

Summary and Contributions: The authors propose Stable Adaptive Gradient Descent (SAGD) which utilizes differential privacy based approaches to guarantee better generalization for adaptive gradient methods. The authors provide both theoretical analyses as well as comprehensive experiments to demonstrate the improvements in performance attained by SAGD. Post Rebuttal: The authors have addressed some of my concerns but the lack of code and other reviews imply that I keep my current score.

Strengths: The paper adopts the simple idea of adding a Laplacian noise to a gradient estimate. This idea is inspired by differential privacy and provides a simple practical variant of adaptive methods. The paper provides convergence results to first-order stationary points for the full gradient version, the DPG-Sparse analog as well as the mini-batch version. The main strength of this paper lies in the applicability of their approach to most existing popular adaptive methods. The experimental section is quite strong: the authors evaluate their results over many runs, calculate the mean and standard deviation while reporting results, evaluate the performance for the different datasets and perform extensive hyperparameter tuning.

Weaknesses: 1. It is unclear how guaranteeing stationary points that have small gradient norms translates to good generalization. The bounds just indicate that these algorithms reach one of the many stationary points for adaptive gradient methods and don't talk about how reaching one of the potentially many population stationary points especially in the non-convex regime can translate to good generalization. A remark on this would be helpful. 2. Line 124-125: For any w, the Hoeffding's bound holds true as long as the samples are drawn independently and so it is always possible to show inequality (2). Stochastic algorithms moreover impose conditioning on the previous iterate further guaranteeing that Hoeffding inequality holds. It will be great if the authors can elaborate on this. 3. The bounds in Theorem 1 have a dependence on d, which the authors have discussed. However, if \mu is small, the bounds are moot. If \mu is large, then the concentration guarantees are not very useful. Based on values in Theorem 2, latter seems to be the case. 4. It seems weird that the bounds in Theorems 2 and 4 do not depend on the initialization w_0 but on w_1. 5. For experiments on Penn-Tree bank, it seems that the algorithms are not stable with respect to train perplexity.

Correctness: There are some concerns about the claims made in this paper (see point 2 in Weaknesses). The empirical methodology is very sound and the authors provide the details for hyper-parameter tuning, applicability to different datasets, and comparison against a comprehensive set of algorithms. (A github link to the implementation will be

Clarity: The paper is well written and easy to follow.

Relation to Prior Work: This needs improvement as the authors miss out on a lot of theoretical progress made in the field of adaptive gradient methods.

Reproducibility: No

Additional Feedback: One disadvantage is that authors have not shared the code with respect to reproducibility. This has been an increasing concern in optimization papers regarding this a link to the code will make the case for this paper strong. Typo in Line 134

Review 4

Summary and Contributions: *** POST AUTHOR FEEDBACK *** After reading the rebuttal, I still have some concerns about the experiments. I strongly recommend the authors to include a new experiment (see <2>) similar to [Wilson et al., 2017] since this new experiment is more useful than the experiments given in the paper. <1> The SGD with Gaussian noise seems to be a straightforward baseline, which is missing in the experiments. Missing this baseline method makes me wonder why DP is useful since the focus of this paper is on generalization. Ideally, the authors should give a bound for this baseline method from the DP perspective and improve this baseline method using DP (e.g., use Laplace noise instead of Gaussian noise). In the rebuttal, the authors claim that "We believe it is of great interest to show how Gaussian noise works in our setting." and "We consider the theoretical details and experimental results as a future work." The response could suggest that SGD+Gaussian noise is as good as the proposed method. Moreover, the effectiveness of SGD+Gaussian noise can be explained from a non-DP perspective, which makes me wonder why DP is useful in this setting. <2> The experimental setup is non-standard, which is another concern. In the rebuttal, the authors claim that "[Wilson et al., 2017] mainly plotted the training/test accuracy against the number of epochs. We agree that it would be interesting to add experiments to compare SGD with differential privacy." However, it seems that the authors will not add a new experiment to show that SGD+Laplace noise gives a better generalization error compared to SGD or SGD+Gaussian noise in the standard setup. If the proposed method works well, it should be very easy to reproduce the results of [Wilson et al., 2017] alongside the proposed method. I think in practice this new experiment is more useful than the experiments given in the paper. ------------ In this work, the authors propose using differential privacy to generate noisy gradients and show that such a procedure can lead to better generalization.

Strengths: The authors show that using differential privacy can lead to better generalization. Furthermore, the authors show that differential privacy preserves the statistical nature of gradients with provable guarantees. Finally, the authors empirically demonstrate that an adaptive method built on differential privacy could lead to better test results.

Weaknesses: The empirical expereiment design mainly follows Wilson et al 2017. However, the authors use different evalution metrics, which makes it incomparable to the results in Wilson et al 2017. It will be great if the authors reproduce most of results in Wilson et al 2017 alongside the proposed method using the same evaluation metrics for a fair comparison. The empirical evaluation is not very thorough. (see the additional feedback)

Correctness: The proposed method seems to be solid. I do not check the theorical results since I am not familiar with differential privacy

Clarity: This paper is well written.

Relation to Prior Work: I am not aware of prior works using differential privacy to achieve better generalization.

Reproducibility: Yes

Additional Feedback: I have some questions. Does SGD using gradients generated from differential privacy (e.g., Laplace noise) achieve the same generalization as the proposed method? In other words, does the method perform well if I set v_t to be 0 (or set \beta_2 to be 1) in Algorithm 1-3. A follow-up question is whether SGD with gradient corrupted by Gaussian noise performs well or not. If so, why not use Gaussian noise since in this setting the focus is on generalization instead of differential privacy? I think SGD with Gaussian noise is a straightforward baseline method. The behavior of this baseline method is also well-studied. For example, see Mandt et al 2016 [A Variational Analysis of Stochastic Gradient Algorithms]. The last question is whether the proposed method works well for small datasets in terms of generalization?