Review for NeurIPS paper: Storage Efficient and Dynamic Flexible Runtime Channel Pruning via Deep Reinforcement Learning

NeurIPS 2020

Storage Efficient and Dynamic Flexible Runtime Channel Pruning via Deep Reinforcement Learning

Review 1

Summary and Contributions: The author proposes a deep reinforcement learning (DRL) based framework to perform runtime channel pruning on convolutional neural networks. This framework mainly contains three parts: 1) the network that learns two types of channel importance: static channel importance and running time importance; 2) two DRL agents which are producing sparsity ratios during runtime and static pruning procedure. 3) a trade-off pruner to balance the runtime and static pruning results. Experimental results on two benchmark datasets prove that this framework is able to provide a tradeoff between dynamic flexibility and storage efficiency in runtime channel pruning.

Strengths: From the perspective of storage efficiency and dynamic flexibility, the author proposes a pruning method based on reinforcement learning, which combines runtime pruning with static pruning. The author verifies that each part of the network structure has a certain effect through comparative experiments. The research area of the author is related to the theme of NeurIPS community.

Weaknesses: The paper is not innovative enough. What’s more, the experiment part of the paper is not very complete. The number of iterations required to achieve the optimal results is an important indicator, but the paper doesn’t reflect it. I think that hyperparameters have a great impact on the final experimental results. The author did not give a clear explanation of how the hyperparameters involved in the paper should be determined and did not use experiments to illustrate how the hyperparameters are determined. The experimental results show that the method proposed in this paper does not achieve the optimal results in some metrics, and the robustness of the method in this paper is yet to be discussed.

Correctness: Yes

Clarity: Yes

Relation to Prior Work: In the introduction part, the author points out the differences between the method of this paper and the existing methods. In the experimental part, the method proposed in this paper is compared with the existing methods in terms of accuracy, speed-up, number of parameters, and inference time.

Reproducibility: No

Additional Feedback:

Review 2

Summary and Contributions: This paper presents a new channel pruning method by combining the idea of dynamic pruning and static pruning, which can simultaneously reduce storage and runtime. Experiments on CIFAR and ImageNet show the effectiveness of the method.

Strengths: - The idea of combining static and dynamic pruning method is interesting. - The proposed method can outperform both static and dynamic methods.

Weaknesses: - Although combining the static and dynamic pruning method is interesting, the proposed method looks like a incremental work on previous dynamic pruning method, which indeed limits the contribution of this work. - DRL methods usually introduce considerable extra cost. There is no analysis on the cost of the pruning process. - DRL-based methods usually are harder to implement. Since code is not provided, I have concerns on the reproducibility of this paper.

Correctness: I think the claims and method is correct.

Clarity: This paper overall is well organized and easy to read.

Relation to Prior Work: I think the differences from previous work are sufficiently discussed in Introduction and Related Work.

Reproducibility: Yes

Additional Feedback: Overall, I think the proposed idea is interesting and some promising results are achieved. However, I still have some concerns on the learning cost and implementation. As its current state, I would rate this paper as borderline and wait for further discussions. ----- Post rebuttal: The authors' feedback addressed my concerns on the extra cost. After reading other reviews, I think this paper does have some contribution on dynamic pruning methods. Therefore, I raise my score to 6. The authors should provide the code if the paper is accepted.

Review 3

Summary and Contributions: This paper proposed a reinforcement learning based approach to dynamically prune CNN channels in testing time. The agent has two parts: static and dynamic, hence in addition to reduce MACs, the storage footprint is also reduced in some cases. The rebuttal addressed most of my concerns and I slightly raised my score post-rebuttal.

Strengths: +++ The idea of combining static and dynamic pruning in a RL framework is interesting and novel. +++ The abalation studies are pretty good. +++ The experimental resutls are good.

Weaknesses: --- There are one technical error. The Ref [22] is not a dynamic pruning method as claimed in this paper. Ref [22] (Pattern recognition journal, not a arXiv preprint now) had a section devoted to explain how they achieved static pruning. It is, however, approriate to say that the approach in Ref [22] has inspired or been adopted by some dynamic pruning approach. -- Some key information is missing. For example, in tables 1 and 2 and subsequent figures, how are "sparsity" measured? What is the equation that defines this term? -- The empirical comparisons needs some improvements. 1. Wall clock timing is needed. MACs is in fact a bad indicator for dynamic pruning. It surely will slow down (w.r.t. to MAC reduction) when GPU is used. However, I expect the CPU speedup will be somehow closer to MACs reduction. 2. One important result is missing: the comparison with only static pruning using existing method. For example, in table 3, it will be FPGM (or AutoPruner from Ref [22]). For example, on GPU static pruning will have significant speed advantage, which is missing in Table 4, 5, and speed (wall clock) comparison + model size comparison are all missing there.

Correctness: I think they are correct. There are some issues with the experiments, but can be repaired.

Clarity: Mostly yes.

Relation to Prior Work: Mostly yes -- I have pointed out one error in the box above.

Reproducibility: Yes

Additional Feedback:

Review 4

Summary and Contributions: (1) Proposed to prune CNN channels by taking both runtime and static information of the environment into consideration, where runtime information endows pruning with flexibility based on different input instances and static information reduces the number of parameters in deployment, leading to storage reduction. (2) Proposed to use Deep Reinforcement Learning (DRL) to determine sparsity ratios for pruning, which is different from the previous approaches that manually set sparsity ratios. (3) Experiments demonstrate the effectiveness of the proposed method.

Strengths: (1) Soundness of the claims. The proposed DRL method for CNN pruning is well motivated and are theoretically sound. Experiments validate the effectiveness of the method. (2) Significance and novelty. Using DRL to combine the merits of runtime and static pruning is novel, which is also significant as it provides trade-off between dynamic flexibility and memory efficiency. (3) The work focuses on the optimization of neural network architectures, thus is sufficiently relevant to the NeurIPS community.

Weaknesses: Some details are missing, making the method less reproducible, e.g.: (1) In Line 113, M and u are the output of g(u_r; u_s; a^r_t; a^s_t). However, in Sec. 3.2 where the trade-off pruner g(·) is defined, a^r_t and a^s_t are missing. (2) Some implementation details (such as the RNN hyper-parameters) are missing.

Correctness: Yes.

Clarity: The paper is well-structured and easy to follow. There're some minor issues: (1) All variables in Fig. 1 should be represented by math fonts rather than plain text. (2) All vectors variables should be represented consistently (e.g. with bold lowercase), e.g., the decision mask should be m rather than M.

Relation to Prior Work: Yes.

Reproducibility: No

Additional Feedback: As the DRL method doesn't take the channel importance (u_r and u_s) as its action variables (only takes sparsity ratios a^r_t and a^s_t), u_r and u_s cannot get feedback from the reward of the computation/parameters budget, leading to suboptimal solutions. Though taking u_r and u_s as actions requires their sizes to be fixed, they can be then resized to match the input channel number C_in (by down-sampling or interpolation). ==== AFTER REBUTTAL ==== I've read all the reviewing details. The author has addressed my concerns about the reproducibility, but I still feel that the model design lacks optimality (from my comment "8. Additional feedback"). Thus, I will keep my score.