NeurIPS 2020

On the distance between two neural networks and the stability of learning


Meta Review

Three reviewers indicate "weak accept". They acknowledge the theoretical deviation and the proposed algorithm. One reviewer is still concerned that "ignoring the exponentially growing constant significantly undermines the theory" which I agree. I read the paper and find a few other issues: --The fact that the fixed point (adding prefactor) is not stationary point is not satisfactory. I think there is a way to resolve this, though cannot think of it now. --R1 asked about why the theory requires layer-dependent eta, but simulation shows 0.01 works for different tasks. The rebuttal adds experiments show deeper MLP requires smaller eta, which somewhat matches theory, which is good and should be added to Figure 3. Figure-3-right shows 0.01 works for three tasks; I thought this is due to they use somewhat similar layers, but the authors seem to think it is "pleasant surprise". If the three tasks (classification, GAN, transformer) use similar layers, please add "the # of layers are ..." into the caption and explain that it matches the theory; if they do not use similar layers, then Figure-3-right is somewhat misleading and more discussions on the caption are needed (e.g. add "this is unexpected and not predicted by the theory") since otherwise readers may treat Fig-3-right as validating "less hyper parameter tuning" in the abstract (or got confused if they saw "one tuning hyperparameter" but now found it does not require tuning). --It is not clear what the authors mean by "only one hyperparamter". If one uses SGD, then still only one hyperparameter. It seems the authors are comparing with Adam or other more complicated methods which require tuning. If the authors imply SGD requires tuning "decreasing xx times at xx epochs", then Fromage also did this. --quasi-exponential growth is expected, even without the derivation of this paper. It is a well-known fact due to layer-accumulation. The figure still shows a local quadratic behavior, and that is what people typically refer to. --In Thm 1, an unsaid assumption is that the weight matrix should be square or fat (or W' is fat; only one shape is possible) to ensure "condition number kappa is finite". I suggest to define "condition number of W" more rigorously in Thm 1 and state explicitly the restriction of weight matrices. --Show loss values but not accuracy is rather unconventional. The training acc and/or test acc need to be added. --Why checking conditional GAN, not standard GAN experiments like in SN-GAN paper? That is more standard for a generic paper. The architecture and hyperparamters are important factors for GAN results. E.g. using the same lr for G and D is known to be sub-optimal (check TTUR for GANs). Please modify the paper accordingly. Despite the issues, this paper directly relates a theoretical analysis to a new algorithm, which is interesting. In addition, the simulation results are quite good (though limited). For these reasons, I recommend accept.