NeurIPS 2020

### Review 1

Summary and Contributions: Post-rebuttal: Thank you for the rebuttal. I have maintained a recommendation for accept but lowered my score from 8 to 7 after reading the other reviewers' critiques. Pre-rebuttal: The authors conduct an empirical study into the effects of linear overparameterisation in nonlinear CNNs. They find that overparameterisation can lead to modest performance gains. They also conduct ablation studies into how overparameterisation affects generalisation error, the gradient distribution, and minima sharpness/flatness. They also run some experiments targetted at pinpointing the exact cause of performance improvements and conclude it is indeed overparametisation.

Strengths: - The central message of the paper is clear: they explore overparameterisation and find it helps. - The methods are straightforward and I can see such simple methods as easily adopted by the community. I think this sort of work is relevant in its applicability. - The experiments appears quite thorough and I believe I could reimplement them. - My favourite part of the paper is the ablation studies, where I felt I learnt the most. A couple of key phenomena are observed, such as tighter generalisation gaps, improved gradient confusion, and reduced side-lobes in pairwise gradient similarity. - I also really enjoyed page 8 concerning the two rejected hypotheses. This sort of empirical work, in my eyes, is very valuable for the community and adds significance to the paper.

Weaknesses: - I do not have any great criticisms of this paper. On the whole I am very happy with it. My main criticisms are on two experiments (gradient analysis and initialisation), which can be read in the next section. - I guess the paper lacks in novelty, but that not important since it is compensated by good empirical analysis

Correctness: The mathematics of the main method seems sound. The empirical methodology all seems sound to me as well, apart from two parts 1) In the analysis of the minimum gradient confusion and the cosine gradient similarity, which gradients were analysed exactly? While the results are interesting they may not be meaningful if either a) not taken from equivalent points in the networks, or b) not descriptive of the training of the all the layers in each network as a whole. 2) For initialisation it is shown that contracting a network from an ExpandNet leads to an equivalently initialised SmallNet. Indeed this is true for the forward pass of the network, but are the gradients in the backward pass exactly the same? Initialisation is typically designed to limit the explosion/vanishing of both forward activations and backwards gradients, but the authors have just considered the forward pass. Maybe it is true for the backward pass too, I don't know. Knowing this would help distinguish whether it is the gradients themselves which differ between an equivalent ExpandNet and SmallNet or whether it is the training dynamics (i.e. the optimizer, which we know not to be covariant to gradient reparameterisation).

Clarity: The paper is well-written and easy to understand. The authors clearly define the scope and context of the work against the backdrop of contemporary literature.

Relation to Prior Work: The paper is clearly depicted again the backdrop of past and concurrent works and in the experimental section pointers are made to other works of note.

Reproducibility: Yes

Additional Feedback: Is there a definition of "compact"? Perhaps because I have my mathematics hat on I keep on thinking it may mean something technical that I don't know, but I'm guessing the authors just mean contracted/compressed post-overparameterisation? In Equation 1 is there are bijection between the original representation of a convolutional kernel and the "matrixised" form? I think so, but it would be nice to know that this is true. I would like to see a short discussion on the computational overhead/complexity of training in terms of expansion rate r. Line 147: what is i in the equation p = r^i m and why would this lead to an explosion in parameters? Surely if r and/or i is close to unity it should not be much of an issue? Line 157: It may be useful to note that k is odd or instead of l = (k-1)/2 just write k = 2l+1. Padding and strides: is the correspondence between the padding/striding schemes and the contracted versions exact? A single line stating this, if true, would be useful. What is the logic behind including knowledge distillation in the experiments? I can see that this serves to improve performance, but I also think that this somewhat detracts from (what I see is) the core message of the paper, which is to explore the benefits of overparameterisation. Tables with standard deviations: Good job for including standard deviations, but does it make sense to bolden the results with the highest mean, when the standard deviations show overlap with the next few highest models? I would suggest to bolden the multiple highest performing models, which have significant overlap or to remove the boldening. Did the authors record wall-clock training times anywhere? I would like to see these. Line 262: Does 3.5pp means 3.5 %? Cityscapes dataset: the mIOU reported are quite a lot lower than the current models reports (~ 84%). Could this be due to very small learning rate of 1e-8? Figure 4: Are the CK and CL minima really that much flatter than for SmallNet? Since this is a qualitative result I would be cautious in offering this as an explanation for why overparameterisation leads to better overall performance. Line 353: SmalleNet -> SmallNet

### Review 2

Summary and Contributions: This paper proposes an expanding strategy (ExpandNets) to facilitate the training of compact convolutional networks. Both the fully-connected and convolutional layers in networks are expanded to deeper ones with more parameters during training, but equivalently converted to the original layers for inference. Positive results are achieved with ExpandNets in image classiﬁcation, semantic segmentation and object detection. The authors also empirically shown that ExpandNets accelerate training.

Strengths: It is interesting to leverage the benefits of over-parameterization during training but improves the efficiency for inference with less parameters. The empirical results on three tasks seem strong. ExpandNets are shown to produce flatter minima and reduce the variants of gradients. Besides, the effectiveness of over-parameterization is studied, which may motivate further research.

Weaknesses: Since the proposed method increases the channels of each convolutional layer by r times (ignoring the two 1x1 conv layers), ExpandNets may suffer from a high computation complexity during training. I am concerned about the time and memory cost. In addition, it may be unfair to compare the results with baselines that actually use less training time. For example, network pruning also train large models but inference with small models. There are not such comparisons with these methods. The reported performance of MobileNetV2 on ImageNet is much weaker than the original paper [48] (63.75% v.s. 72.00 %). The baseline models seem to be rather weak. For example, the accuracy on CIFAR-10 is about 80%, while the current SOTA has reached >98%. I’m not saying that SOTA accuracy should be pursued, but the presented results are cleared not competitive and not convincing.

Correctness: The claims and the empirical methodology are correct.

Clarity: This paper is well written, and easy to follow.

Relation to Prior Work: The differences of this work from previous ones are clearly discussed.

Reproducibility: Yes

### Review 3

Summary and Contributions: The author proposed a novel method to train a compact CNN, referred to as ExpandNets. The proposed method is simple but effective. Furthermore, the proposed method has the advantage that it can be combined with existing training methods such as knowledge distillation. The author conducted various experiments to validate the effectiveness of the proposed method. Experiments included image classification, object detection, image segmentation, and many ablation studies.

Strengths: - The paper is well organized and easy to follow. - The most novel point of the proposed method is the expanding scheme for the convolutional layers. The proposed scheme amplifies the number of parameters by increasing the number of channels via 1 by 1 convolution or replacing convolutional layers with consecutive 3 by 3 convolutional layers. These techniques have been generally used to change the number of parameters. However, the idea of expansion of linear layer for training and contraction for testing to improve the performance seems to be new. - The experimental results of submitted paper show that just expanding the linear layers (conv., fc) can improve performance of the model. I think this is meaningful results. The author empirically verified the effectiveness of the proposed model.

Weaknesses: - I wonder why the proposed method works well. The paper claims that the proposed method works well because of over-parameterization through parameter expansion. This was experimentally verified through self-ablation (section 5.1). However, there are still questions. In order for networks of the same structure to exhibit different performance, the network parameter (weight) must be learned differently. I would like to see why more beneficial the back-propagation of the expanded layers are than the non-expanded layer. It would be required to mathematically analyze how the parameter update of the linear layer changes depending on the expansion of linear layer. - Generally, it is considered that the power of deep neural networks comes from a non-linearity of the model. However, after contracting the expanded linear layers (conv., fc) into one layer, the network is equivalent to the network with the non-expended layered network. This implies there is a way to get the learning results via the expanded layers. I suspect that the backpropagation settings for the non-expanded network has not been established to the best ones. If not, it should be needed why the use of expanded layers (over-parameterization) is beneficial to the improvement. it would be clear to give a mathematical explanation for this. - The idea of simplifying the multiple linear layer into a single layer has been proposed in Fekix Wu, et al., Simplifying Graph Convolution Networks, ICML 2019.

Correctness: There are no technical errors.

Clarity: The paper is well organized and easy to follow. However, the key claim is not supported by rationale as mentioned in Weakness.

Relation to Prior Work: The idea of simplifying the multiple linear layer into a single layer has been proposed in Fekix Wu, et al., Simplifying Graph Convolution Networks, ICML 2019. Even though it is applied to graph neural network, the idea is the same.

Reproducibility: Yes

Additional Feedback: The authors should provide rationale on why the use of expanded layers (over-parameterization) is beneficial to the promotion of training of non-expanded network contrary to the direct training of it.

### Review 4

Summary and Contributions: This paper proposes a training strategy to train a given compact network. Given an arbitrary compact network, the paper expands the network into an over-parameterized one. Three expansion methods are introduced, i.e., expanding convolutional layers (CL), convolutional kernels (CK) and fc layers (FC). Experiments show that using the proposed network over-parameterization strategies, a small network can be better trained than training the original network.

Strengths: The proposed expansion methods are technically reasonable and easy to understand. The expansion techniques are based on existing network factorization methods, but the authors use them to improve the training of a given fixed network which differs from the previous ones trying to compress networks with matrix factorization. Such an over-parameterization scheme seems to be easy to be applied to any compact networks. The experimental results on image classification, object detection and semantic segmentation broadly validates the efficacy of the methods.

Weaknesses: I think the expansion techniques are not of strong novelty, as indicated above. The expansion of CL and FC are quite common.

Correctness: The methods look technically sound.

Clarity: Yes.

Relation to Prior Work: Yes.

Reproducibility: Yes