Review for NeurIPS paper: Residual Distillation: Towards Portable Deep Neural Networks without Shortcuts

NeurIPS 2020

Residual Distillation: Towards Portable Deep Neural Networks without Shortcuts

Review 1

Summary and Contributions: This paper makes the observation that skip connections in ResNet take up nontrivial memory footprint and computational bandwidth. The authors propose to train ConvNets with residual connections, but deploy them without. They show that by jointly training the ResNet and non-ResNet, they are able to achieve good performances on CIFAR and ImageNet classification benchmarks.

Strengths: * This paper has clear figures and explanation of the joint training algorithm. It seems to contain enough details to reproduce the results. * I appreciate the report of runtime statistics on actual mobile platforms (table 1, mobile NPU memory usage, and table 2, mobile inference latency). This is much more useful than proxy metrics like FLOP. * Removing the residual connection at inference time seems to be a novel technique.

Weaknesses: * There are numerous approaches to reduce ConvNet's memory footprint and computational resources at inference time, including but not limited to channel pruning, dynamic computational graph, and model distillation. Why is removing shortcut connection the best way to achieve the same goal? The baselines considered in Table 3 and 4 are rather lacking. For example, how does the proposed method compare to: 1. Pruning method that reduces ResNet-50 channel counts to match the memory footprint and FLOPs of plain-CNN 50. What will be the drop in accuracy? 2. Distill a ResNet-50 model to a smaller ResNet with similar memory footprint as plain-CNN 50. How does this compare to the proposed training scheme? * L107-111 states that "At the early stages of the training process, the gradients from ResNets play a bigger role with a larger weight, and at the later stages, the gradients contributed by ResNets fade out ...". There doesn't seem to be any empirical results or citations to back up this claim. * L133-L143, are there any ablation studies with path 1 - path 3 removed? What is the result of a naive KD method (without those skip connections)? EDIT AFTER REBUTTAL ================== I have read the authors' rebuttal and other reviews. I think the authors have done an excellent job in addressing my concerns. The updated table of results and efficiency metrics are very convincing. Please incorporate this into your final paper! I have thus updated my rating to accept.

Correctness: The results seem correct, but more baselines and ablations are needed.

Clarity: It is fine.

Relation to Prior Work: The discussion is fine.

Reproducibility: Yes

Additional Feedback: * L48 and L96 should cite the same reference - looks like there's a typo. * L92 - L97 is a restatement of L46 - L50. Consider removing to reduce verbosity.

Review 2

Summary and Contributions: The authors proposed a residual distillation approach to train plain CNN using the knowledge of teacher ResNet. The task is well-motivated and the adopted method makes sense. Experiments demonstrate that the performance of the CNN trained this way is on par with the ResNet baseline, but with lower memory and computations.

Strengths: I think overall submission is interesting. The strengths I read are as follows. 1. The experimental results are very promising, on par or even better than the baselines; 2. The shortcut aggregates features and consequently consumes much off-chip memory traffic, which is in fact unfriendly to portable devices that are subject to limited power. This paper introduces a method to distill residuals and to remove shortcuts. Apart from soft logits and intermediate features, the backbone of teacher are used to propagate the features of student, making the optimization easier for plain CNNs.

Weaknesses: 1. The student in this case, is enforced to have the same architecture as the teacher. As a result, a consistency of features between student and teacher is expected. So when would you want to transform the feature maps of Eq.2? 2. The ablation study should be made clearer. Are the Dirac initialization are used in other settings? If no, what is the performance if Dirac initialization is used?

Correctness: Yes to my best knowledge.

Clarity: Yes in general.

Relation to Prior Work: Yes in general.

Reproducibility: Yes

Additional Feedback: Please see the weakness section. ---- Post Rebuttal ---- I have read the rebuttal and keep my original rating.

Review 3

Summary and Contributions: This paper investigates a new training scheme by using which the shortcut of the res-block can be removed without sacrificing much accuracy. The authors propose to use a teacher student training scheme. In addition to use the conventional KD technique, it uses the error gradients from the teacher network to guide the learning of the student network. While this idea is straightforward, the good performance describes in the paper shows that the proposed technique is promising.

Strengths: This paper presents a quite valid motivation of removing the shortcut in resnet for mobile devices. It also quantitatively shows the significant improvement in running speed and memory consumption by using the proposed the method. The strength of the paper is probably coming from the simplicity of the proposed method. With such a straightforward and concise technique the authors show that it is possible to safely remove the shortcut with a minor compromise in accuracy. The proposed method seems novel to me, however I am not an expert in model distillation I am not completely sure about this.

Weaknesses: The authors did not do a good job to use the related work to describe the novelty of the proposed method, hope this can be clarified in the rebuttal. Also, the proposed method seems to have limited advantage in both small models and small tasks. Hope the authors can clarify if this is the case and implication of this if it's true.

Correctness: Should be correct.

Clarity: Yes.

Relation to Prior Work: The authors did not do a good job to use the related work to describe the novelty of the proposed method, hope this can be clarified in the rebuttal.

Reproducibility: Yes

Additional Feedback: