NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:6214
Title:Uniform convergence may be unable to explain generalization in deep learning

Reviewer 1

This paper emphasizes that uniform convergence is inherently problematic for explaining generalizaiton of deep learning, as even the algorithm-dependent application would fail – casting doubt on post-Zhang et al. [31] algorithm-dependent approaches. The whole paper is well written and the related results are important.

Reviewer 2

The paper is clearly written, and the claim of the title is original. However I don't think that the claim is correct.

Reviewer 3

== SUMMARY == There has been much recent work that tries to explain why over-parameterized neural networks can generalize to unseen data, when classical learning theory says that they should not. This paper argues that the problem is uniform convergence, that there are certain scenarios in which it will fail to yield meaningful bounds, even when the learned model does generalize. The key observation is that the norm of the learned weights can grow with the number of examples. The paper demonstrates this empirically, then provides a theoretical construction that proves it for high-dimensional linear models. It also conducts an experiment to support the claim for 1-layer ReLU networks. The idea that uniform convergence bounds are inadequate for explaining generalization in some models/algorithms is not new. For example, it's well known that the 1-NN classifier has infinite VC dimension. But, it is surprising that the weight norm can grow with the number of examples. Then again, the construction that's used to show this is a bit of a "straw man." It doesn't use any regularization, the whole point of which is to constrain the norm of the weights. So, the paper has shown that uniform convergence fails to explain generalization in pure ERM, but it says nothing about SRM (structural risk minimization). I guess what I'm trying to say is that, while there _exist_ algorithms that can generalize for which uniform convergence fails, there are other algorithms that may generalize just as well for which uniform convergence _does_ hold. I appreciate the conclusion that there should be more research into algorithmic stability, but I don't think it's fair to say that it has "not been as extensively studied for deep networks." Hardt et al.'s paper (2016) sparked a lot of follow-up work. I think the big question in stability analysis is whether the common assumptions and constants (e.g., the Lipschitz constant of the network) are reasonable in deep learning. I enjoyed reading this paper. I like that it has a bold message that calls into question a lot of recent work. It's very well written, clear and easy to follow. The theoretical results are stated cleanly, with minimal notation. The experimental methodology is explained fairly well. I appreciate that the technical proofs were deferred, but that a proof sketch was provided. == DETAILED COMMENTS == For what it's worth, the construction used to prove Theorem 3.1 is basically Perceptron, only with updates on correct predictions too. Based on the way equations are typeset (especially in the appendix), it looks like the paper was originally written for a 2-column format. I suggest getting rid of some of the line breaks to take advantage of the single column format. The spade notation used in line 150 is fun, but seems a bit out of left field. Line 169: needs punctuation between "bounds namely". Line 251: "initialized to origin" --> "initialized to the origin". == POST AUTHOR RESPONSE == My main concern was that analyzing pure ERM (i.e., with no regularization) is a bit of a straw man. The author response reminded me that many prior works have studied this setting, and that it is common to train DNNs that generalize with no regularization. I am therefore raising my score to 8. That said, I would still like to see more discussion of regularization in the paper -- at the very least, an acknowledgement that capacity control would enable generalization with uniform convergence guarantees. It would be interesting to analyze learning with very small amounts of regularization (e.g., lambda = 10^{-5}) -- which is also a very common scenario -- and see whether uniform convergence with meaningful rates is possible there. Lastly, I do hope that the authors amend the paragraph about stability research in the conclusion. As written, it minimizes an extensive body of work.

Reviewer 4

Originality: I have seen a presentation/workshop paper on the same topic (I believe by the same authors) in the ICML2019 workshop on generalization in deep learning. Other than that, I am not familiar with earlier work presenting similar counterexamples showing UC cannot explain generalization. Clarity: The paper is mostly clear and easy to read. Quality: The proofs of the main theoretical results are correct. The results seem to be quite useful and informative. I believe that the paper is of relatively high quality. There was one point of confusion for me regarding Pseudo-Overfitting. Clarifying this concept and the framing of counterexamples presented with respect to this concept would be required (but possibly not sufficient) for me to increase my ranking to "Top 15%". From my understanding the counterexamples provided *do* pseudo-overfit, while the paper mentions (and purports to demonstrate in appendix B) that deep networks do not pseudo-overfit. This confusion casts doubt on the relevance of the counterexamples to actual models used in practice. I think that this could be clarified further by the authors. Significance: The work is significant as it informs theorists that certain tools may not be useful in their research. This is significant, IMO, to warrant acceptance at NeurIPS. It does not, however, provide an alternative set of tools for theorists to use when UC is not necessary. This would be required for me to consider this a "Top 5% paper". I imagine that the counterexamples presented in this work could become "standard text book examples" in the future since they are clear and easy to follow, and also readily demonstrate the phenomenon of interest.