NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:4627
Title:Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent

Reviewer 1


		
I enjoyed reading this paper. This paper builds on the recent NTK paper and develops rather surprising theory that the gradient descent dynamics of a deep neural network is actually very close to the dynamics of a simple linearized model, for wide neural networks. I’m also very impressed by the empirical results that the paper provides, because the experimental results corroborate the theory even for realistic-sized networks. Given that NTK-related results are getting much attention these days, I believe this paper would be worth reading for many people. I vote for acceptance. Minor comments: - I guess \hat \Theta^{(n)} is not defined anywhere. Does it mean the empirical tangent kernel evaluated of a width-n network? - Line 127: There is no e^{-\eta \Theta_0 t} term in Eq 2 and 3? - Line 263: Effects of depth? Not width? The paragraph doesn’t have any discussion on depth. - Figure 4: dashed lines are almost invisible; can you try to improve readability? [After rebuttal] I have read the authors’ response and the other reviews and I think my concerns were well-addressed. I’ll keep my score.

Reviewer 2


		
The paper shows that the dynamics of gradient descent for optimizing an infinite width neural network can be explained by the first-order Taylor expansion of the network around its initial parameters. Furthermore, it shows that when the loss function is squared loss, the dynamics admits a closed-form solution. It also provides a learning rate threshold such that whenever the learner rate is smaller than that threshold, the trajectory of gradient descent is in a neighborhood of the trajectory of gradient descent on the linearized neural net, under the condition that the neural net is sufficiently wide. It also shows that the prediction of a neural network can be described by Gaussian Process when the width goes towards infinity. The paper is written well and the insight is very interesting. I enjoy reading the paper and I also think the contributions might be significant. Q1: The theorem requires that \Theta is full rank. Does it hold in practice? Is it a strong assumption? It looks like the authors did not check this assumption in the experiments. Q2: (line S84) Can you explain how to use Theorem G.3 to get (S84)? Typo: (S77) \Theta_t <- \hat{\Theta_t} (S96) Should there be a distance factor \| \theta - \tilde{\theta} \| ?

Reviewer 3


		
The paper was proofread, well-structured, and very clear. The experiments were clearly described in detail, and provided relevant results. Below we outline some detailed comments of the results. 1) Relation to "On Lazy Training in Differentiable Programming" by Chizat and Bach - Some of the main results of this paper are very similar to those proved by Chizat and Bach. In particular, Chizat and Bach prove that the training of an NTK parameterized network is closely modeled by "lazy training" (their terminology for a linearized model). This paper is not referenced in the related work section. This seriously detracts from the novelty of the submission. 2) Applicability of the proven results to modern networks - The authors claim that the NTK parameterization closely models the modern neural networks used in practice. While it is true that the 1/sqrt(n) scaling is used in many modern networks, the optimization algorithms, and in particular, the learning rates used during training may invalidate the assumptions of this paper. In the aforementioned paper by Chizat and Bach they present and cite empirical evidence that the linearized networks do not model modern networks. For instance: a) Modern neural networks for image tasks have layers that learn interesting (non-random) filters. b) While the ``Are all layers equal'' paper does show that most layers can be reinitialized, this is not true for all layers. In particular, the first layer of each residual block in a residual network cannot be reinitialized, and do show more parameter movement than the other layers. In particular, the parameters in the first layer of the network tend to move substantially in comparison to the other layers. 3)Experimental results - While experimental results were presented for the CIFAR-10 dataset and do show that NTK models track linear models well, this doesn't provide convincing evidence that NTK parameterized networks provide a good model for modern networks. In particular, the test accuracies of the wide residual networks used are far from state of the art.