NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Reviewer 1
Study of the Neural Tangent Kernel (Jacot et al. 2018) is a compelling approach to understanding the dynamics of learning in neural networks. As acknowledged by the authors, while this kernel determines the dynamics only for very wide networks, better understanding of this simplified regime could be a first step towards a fuller understanding of learning dynamics more generally. In this work, the authors study the smoothness, approximation and stability of this kernel for shallow fully connected and convolutional networks. It has been recognized for some time that since overparametrized neural networks do not overfit, the choice of architecture and/or optimization algorithm must bias the solutions found towards ones that generalize well. In the past, it has been difficult to understand this implicit bias except in very simple architectures, and studying it in the wide network regime is another step towards a better understanding of this phenomenon. The presentation is very clear for the most part, and the authors make an effort to distinguish their contributions from previous work. They are also honest about the limitations of the setting of the analysis (the only exception is perhaps the issue of the decomposition in terms of spherical harmonics, see below). When presenting the Mercer decomposition in terms of spherical harmonics in Proposition 5, isn't one assuming a uniform data distribution over the sphere? If this were not the case, then the RKHS in eq. 11 would take a different form (since one could choose f to be supported on the entire sphere). If this assumption is made elsewhere in the text it might be good to state it clearly in this section. It is also useful to note that in general one would expect realistic data distributions to be supported on low dimensional subsets of the sphere. This limits the applicability of the results in this section. Given the known stability results for convolutional networks, is it surprising that the convolutional NTK feature map is stable as well?
Reviewer 2
Post author response: After reading other reviews and author's response, my evaluation still holds. I thank the authors for the thoughtful response and will be looking forward to see future developments in this direction. ______________________________________ In the infinite width limit, neural network’s gradient flow training dynamics is well captured by Neural Tangent Kernel. The advantage of this picture of understanding neural network is providing angle to study kernel properties to understand deep neural networks which has been a hard challenge theoretically. Although applicability of “lazy training” via NTK to realistic model / dataset is still not clear, it is an important theoretical landmark to have a good understanding of this kernel capturing neural network training in a certain limit. The authors set out to study inductive bias of Neural Tangent Kernel which is very timely and important contribution. Authors show that NTK with two-layer ReLU networks are 1) not Lipshitz smooth but satisfy weaker Holder smoothness property. 2) Also studying CNN NTKs stability shows less stability compared to kernels obtained by fixing weight of all layers except the last layer. 3) Using spherical harmonics decomposition show that eigenvalues decay slower than ArcCos kernel that would correspond to ReLU network kernel with fixed weights except for the last layer. Interesting observation pointed out by the authors is that the finding show tradeoff between stability and approximation where better approximation property captured by NTK is tradedoff by less stable/smooth property. While few recent concurrent work discuss NTK for convolutional networks, the current submission also provide definition of NTK for CNN, independently, also generalizing to linear operators that has not been considered in other works such as patch extraction and pooling operators. Few obvious limitation is that the analysis especially looking at the spherical harmonic decomposition assumes that input data is uniform sample from hyper-sphere. For real dataset, it is unclear how kernel spectral properties would be similar or different from simple toy data domain. While all of the work is studying property of kernels theoretically, it is a weakness of the work not showing any empirical support of the inductive bias described in the paper. Nit: Line 167 “NTK kernel” repeats `kernel’ twice. For references to GP limit of neural networks, one should also cite [1] along with papers already been cited. [1] Alexander Matthews et al., Gaussian Process Behaviour in Wide Deep Neural Networks, ICLR 2018
Reviewer 3
I have read the response. ---------------------------------------- This paper makes strong contributions to the recent line of work that demonstrates the equivalence between over-parameterized neural networks and a new class of kernels, neural tangent kernels (NTK). The new class of kernels is different from classical kernels like Gaussian kernel and we do not have a good understanding on why it gives better performance than classical kernels. This paper initiates a rigorous study on the properties of NTK. This paper provides a smoothness analysis of NTK induced by two-layer neural network. The result may be useful for understanding the robustness of neural network as well. The paper further provides approximation analysis of the NTK. I really like the result as it demonstrates the advantage of NTK comparing with previous kernels. The analyses on the smoothness and stability of CNTK are also interesting and authors did a good job in establishing the connection with previous work on CNNs in this direction. Overall I really like this paper and recommend to accept. Minor: 1. If I understand correctly, the proof for deriving CNTK relies on the sequential limit. This should be stated explicitly in the final version.