Paper ID: 1747
Title: BinaryConnect: Training Deep Neural Networks with binary weights during propagations
Current Reviews

Submitted by Assigned_Reviewer_1

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper suggests to modification to the backpropagation algorithm which enables training of multilayer neural networks with binary weights. This is done by binarizing the weights only during the forward and backward propagation phases (while updating their real value stored). This should accelerate training speed, and much more significantly, testing speed.

The fact that one can use binary weights during training, and still get good test performance, is novel and rather surprising (I had to check the code to believe this).

The (minor) comments I have are about the relation to previous works: 1. It is not clear, why on line 373, it is said that the comparisons were done on different datasets. Indeed ref [29] is done for text classification, but doesn't ref [30] use the same regular (permutation invariant) mnist as in this paper? If that is the case, I see several possible explanations to the differences in results. First, EBP didn't use the tricks used in this paper (L2-SVM output, batch normalization, and adjustable/decaying learning rate), except dropout. Increasing the margin (using L2-svm) can be especially useful for decreasing the discretization sensitivity. Second, larger networks and many more epochs were used here. Third, in EBP also the neurons are binary (which is more hardware friendly then the hard sigmoid used here).

2. Lines 369-371 are slightly inaccurate. In EBP [29,30]: it is also possible to use only a single binary network, at the price of reduced performance (as was done in [30]). This is somewhat similar to the decrease of performance in BinaryConnect (det.) over BinaryConnect (stoch.).

Also, there a few typos and grammatical error ("Let us compared...").

In summary, very interesting paper!

%%%

Edit, after Author's Feedback: I raised the score from 8 to 9 due to the new experimental results, which seem quite promising. I mostly like very much the fact that you can train with binary weights, in contrast to previous works. Note that is FPGA section should be improved. It should be made more concrete (showing at least a diagram how the weights are placed and how the data are routed through the network. Specifically, how to route the convolutional layers) or removed.

Q2: Please summarize your review in 1-2 sentences
This is the first time that a multilayer neural network was trained and tested with binary weights, and achieved competitive results.

Submitted by Assigned_Reviewer_2

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The paper presents BinaryConnect, a dropout like mechanism to train networks constraining weights to only two possible values. This reduces the multiply and accumulate operations required for deep nets to simple accumulate operations.

The paper is easy to read.

Code will be released, this is really good.

The idea is appealing and is worth investigating this avenue, however I am afraid the experimental section is not

yet ready for publication.

The only experiment which the authors conducted is on permutation invariant mnist and no evaluation is done to validate the claims in terms of speedup (while I believe this was the main topic of the paper). In addition to that I would not put too much effort in training speedup, as it can be done offline. Of course binarization during training is necessary to keep performance to drop considerably, but I would steer the focus towards FPGA and inference time performance.

MNIST is a binary input and thus it is reasonable to have a network with binary weights to work on this problem. I think though that on more challenging problems, such as CIFAR10 for example, there will be a bigger drop in performance. Also, why didn't the authors try convnets? Is there a particular reason not to use data augmentation? Perhaps makes training unstable?

Overall I like the idea very much, but it needs a more thorough evaluation.
Q2: Please summarize your review in 1-2 sentences
I find the idea interesting and deserves further investigation, but unfortunately it lacks a thorough evaluation and analysis. For example I don't see why the authors did not test it on convolutional nets.

Experiments on non-binary images and experimental evaluation of fpga speedup is required, considering that this is the main topic of the paper.

Submitted by Assigned_Reviewer_3

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper presents a method for potentially speeding up neural network training by binarizing connection weights to 1 of 2 values. The authors show that this method regularizes the network during training, analogous to Dropout or DropConnect. Two variants of BinaryConnect are proposed, deterministic or stochastic. On an MNIST task, the proposed methods improve over the baseline but are not as good as standard dropout. The authors also do an analysis to show the potential computational savings and speeds up possible with the BinaryConnect approach.

The paper shows an interesting variant of the dropout style of regularization methods. The paper is well motivated and the idea has been carefully considered in its implementation. It is unfortunate that the results are not more promising. It would be useful and interesting to know if you can combine with [32] to train the network after binarization to recover some of the loss in accuracy by training. It is unclear the true benefit computationally from this method as the assessment is largely theoretical. As a result, it's unclear if the benefits claimed can actually be realized in practice.

The paper is very clearly written. It would be straightforward for a reader to implement the proposed algorithm.

It is somewhat original. The authors are very aware of related literature and do an admirable job of putting this work in the proper context.

The significance as a method for more efficient training is unclear since the hardware analysis is largely theoretical. It is not likely to be adopted for general purpose implementations as the algorithm's performance is not as good as other methods like dropout. Fast training of a DNN on a FPGA would be a great outcome but there is still ample work to do before that can be demonstrated.
Q2: Please summarize your review in 1-2 sentences
This paper proposed a method to speed up DNN training by binarizing the weights during forward and backpropagation phases. The results on MNIST are better than with no regularization but not as good as dropout.

Author Feedback
Author Feedback
Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 5000 characters. Note however, that reviewers and area chairs are busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point.
Firstly, thank to all of you for taking the time to read our work and for your many constructive comments. We are currently doing our best to apply them.

The three main criticisms you addressed to our work were the following:
1) we did not evaluate our method on enough benchmark datasets,
2) the FPGA speedup resulting from our method is theoretical
3) and we had not compared our work with [1,2].

1) CIFAR-10 and SVHN experiments

We trained CNNs with BinaryConnect on CIFAR-10 and SVHN, and we obtained near state-of-the-art results (see the two tables below). The code is available on our (anonymous) github repository (https://github.com/AnonymousWombat/BinaryConnect).

Table 1: test error rate of a CNN trained on CIFAR-10 (without data-augmentation) depending on the method, showing that BinaryConnect can be used to make binary-weight nets competitive:

Method------------------------------------test error
No regularizer (this work)----------------15.24%
Deterministic BinaryConnect (this work)---13.38%
Stochastic BinaryConnect (this work)------12.20%
Maxout Networks [3]-----------------------11.68%
Network in Network [4]--------------------10.41%
Deeply Supervised Nets [5]----------------9.78%

Table 2: test error rate of a CNN trained on SVHN depending on the method:

Method------------------------------------test error
No regularizer (this work)----------------3.02%
Deterministic BinaryConnect (this work)---2.83%
Stochastic BinaryConnect (this work)------2.66%
Maxout Networks [3]-----------------------2.47%
Network in Network [4]--------------------2.35%
DropConnect [6]---------------------------1.94%
Deeply Supervised Nets [5]----------------1.92%

2) FPGA SPEEDUP

An experimental evaluation of the resulting FPGA speedup is important future work, but is a major engineering effort, which needs to be justified by the kind of simulations results and theoretical computational analysis that we are providing, i.e.,
1) we believe that our ideas and simulations represent an important contribution by themselves (and certainly a large amount of labor),
2) and we are waiting for the release of the latest FPGAs. Those FPGAs are basically twice denser than the previous generation (a big deal to make this kind of neural net implementable). Xilinx's VU440 was "released" in June 2015 and Altera's Stratix 10 is yet to be released.

3) RELATED WORKS

One reviewer rightly pointed out closely related work [1,2] in which they:
1) train a neural network with high-precision,
2) quantize the hidden states to a 3 bits fixed point representation,
3) ternarize the weights to three possible values -H, 0 and +H,
4) adjust H to minimize the output error (mean cross-entropy for classification)
5) and retrain with ternary weights during propagations and high-precision weights during updates.

By comparison, we:
1) set H with the weights initialization scaling coefficients from [7]
2) and *train all the way* with binary weights (-H and +H) during propagations and high-precision weights during updates, i.e., our training procedure could be implemented with efficient specialized hardware avoiding 2/3 of the multiplications, unlike [1,2].

We believe our method is original in several regards:
1) unlike [1,2], we do not use high-precision training (i.e., we can take advantage of specialized hardware),
2) we show that binary weights are a powerful regularizer
3) and we obtain near state-of-the-art results on CIFAR-10 and SVHN.

REFERENCES

[1] Kim, J., Hwang, K., & Sung, W. (2014, May). X1000 real-time phoneme recognition VLSI using feed-forward deep neural networks. In Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on (pp. 7510-7514). IEEE.

[2] Hwang, K., & Sung, W. (2014, October). Fixed-point feedforward deep neural network design using weights +1, 0, and -1. In Signal Processing Systems (SiPS), 2014 IEEE Workshop on (pp. 1-6). IEEE.

[3] Ian J. Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio. Maxout networks. Technical Report Arxiv report 1302.4389, Universite de Montreal, February 2013.

[4] Lin, M., Chen, Q., and Yan, S. Network in network. In Proc. ICLR, 2014.

[5] Chen-Yu Lee, Saining Xie, Patrick Gallagher, Zhengyou Zhang, and Zhuowen Tu. Deeply-supervised nets. arXiv preprint arXiv:1409.5185, 2014.

[6] Li Wan, Matthew Zeiler, Sixin Zhang, Yann LeCun, and Rob Fergus. Regularization of neural networks using dropconnect. In ICML'2013, 2013.

[7] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In AISTATS'2010, 2010.