{"title": "Dynamic Network Surgery for Efficient DNNs", "book": "Advances in Neural Information Processing Systems", "page_first": 1379, "page_last": 1387, "abstract": "Deep learning has become a ubiquitous technology to improve machine intelligence. However, most of the existing deep models are structurally very complex, making them difficult to be deployed on the mobile platforms with limited computational power. In this paper, we propose a novel network compression method called dynamic network surgery, which can remarkably reduce the network complexity by making on-the-fly connection pruning. Unlike the previous methods which accomplish this task in a greedy way, we properly incorporate connection splicing into the whole process to avoid incorrect pruning and make it as a continual network maintenance. The effectiveness of our method is proved with experiments. Without any accuracy loss, our method can efficiently compress the number of parameters in LeNet-5 and AlexNet by a factor of $\\bm{108}\\times$ and $\\bm{17.7}\\times$ respectively, proving that it outperforms the recent pruning method by considerable margins. Code and some models are available at https://github.com/yiwenguo/Dynamic-Network-Surgery.", "full_text": "Dynamic Network Surgery for Ef\ufb01cient DNNs\n\nYiwen Guo\u2217\nIntel Labs China\n\nyiwen.guo@intel.com\n\nAnbang Yao\n\nIntel Labs China\n\nanbang.yao@intel.com\n\nYurong Chen\nIntel Labs China\n\nyurong.chen@intel.com\n\nAbstract\n\nDeep learning has become a ubiquitous technology to improve machine intelligence.\nHowever, most of the existing deep models are structurally very complex, making\nthem dif\ufb01cult to be deployed on the mobile platforms with limited computational\npower. In this paper, we propose a novel network compression method called\ndynamic network surgery, which can remarkably reduce the network complexity\nby making on-the-\ufb02y connection pruning. Unlike the previous methods which\naccomplish this task in a greedy way, we properly incorporate connection splicing\ninto the whole process to avoid incorrect pruning and make it as a continual network\nmaintenance. The effectiveness of our method is proved with experiments. Without\nany accuracy loss, our method can ef\ufb01ciently compress the number of parameters\nin LeNet-5 and AlexNet by a factor of 108\u00d7 and 17.7\u00d7 respectively, proving that\nit outperforms the recent pruning method by considerable margins. Code and some\nmodels are available at https://github.com/yiwenguo/Dynamic-Network-Surgery.\n\n1\n\nIntroduction\n\nAs a family of brain inspired models, deep neural networks (DNNs) have substantially advanced a\nvariety of arti\ufb01cial intelligence tasks including image classi\ufb01cation [13, 19, 11], natural language\nprocessing, speech recognition and face recognition.\nDespite these tremendous successes, recently designed networks tend to have more stacked layers,\nand thus more learnable parameters. For instance, AlexNet [13] designed by Krizhevsky et al.\nhas 61 million parameters to win the ILSVRC 2012 classi\ufb01cation competition, which is over 100\ntimes more than that of LeCun\u2019s conventional model [15] (e.g., LeNet-5), let alone the much more\ncomplex models like VGGNet [19]. Since more parameters means more storage requirement and\nmore \ufb02oating-point operations (FLOPs), it increases the dif\ufb01culty of applying DNNs on mobile\nplatforms with limited memory and processing units. Moreover, the battery capacity can be another\nbottleneck [9].\nAlthough DNN models normally require a vast number of parameters to guarantee their superior\nperformance, signi\ufb01cant redundancies have been reported in their parameterizations [4]. Therefore,\nwith a proper strategy, it is possible to compress these models without signi\ufb01cantly losing their\nprediction accuracies. Among existing methods, network pruning appears to be an outstanding one\ndue to its surprising ability of accuracy loss prevention. For instance, Han et al. [9] recently propose to\nmake \"lossless\" DNN compression by deleting unimportant parameters and retraining the remaining\nones (as illustrated in Figure 1(b)), somehow similar to a surgery process.\nHowever, due to the complex interconnections among hidden neurons, parameter importance may\nchange dramatically once the network surgery begins. This leads to two main issues in [9] (and some\nother classical methods [16, 10] as well). The \ufb01rst issue is the possibility of irretrievable network\n\u2217This work was done when Yiwen Guo was an intern at Intel Labs China supervised by Anbang Yao who is\n\nresponsible for correspondence.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fdamage. Since the pruned connections have no chance to come back, incorrect pruning may cause\nsevere accuracy loss. In consequence, the compression rate must be over suppressed to avoid such\nloss. Another issue is learning inef\ufb01ciency. As in the paper [9], several iterations of alternate pruning\nand retraining are necessary to get a fair compression rate on AlexNet, while each retraining process\nconsists of millions of iterations, which can be very time consuming.\nIn this paper, we attempt to address these issues and pursue the compression limit of the pruning\nmethod. To be more speci\ufb01c, we propose to sever redundant connections by means of continual\nnetwork maintenance, which we call dynamic network surgery. The proposed method involves\ntwo key operations: pruning and splicing, conducted with two different purposes. Apparently,\nthe pruning operation is made to compress network models, but over pruning or incorrect pruning\nshould be responsible for the accuracy loss. In order to compensate the unexpected loss, we properly\nincorporate the splicing operation into network surgery, and thus enabling connection recovery once\nthe pruned connections are found to be important any time. These two operations are integrated\ntogether by updating parameter importance whenever necessary, making our method dynamic.\nIn fact, the above strategies help to make the whole process \ufb02exible. They are bene\ufb01cial not only\nto better approach the compression limit, but also to improve the learning ef\ufb01ciency, which will be\nvalidated in Section 4. In our method, pruning and splicing naturally constitute a circular procedure\nand dynamically divide the network connections into two categories, akin to the synthesis of excitatory\nand inhibitory neurotransmitter in human nervous systems [17].\nThe rest of this paper is structured as follows. In Section 2, we introduce the related methods of DNN\ncompression by brie\ufb02y discussing their merits and demerits. In Section 3, we highlight our intuition\nof dynamic network surgery and introduce its implementation details. Section 4 experimentally\nanalyses our method and Section 5 draws the conclusions.\n\n(a)\n\n(b)\n\nFigure 1: The pipeline of (a) our dynamic network surgery and (b) Han et al.\u2019s method [9], using\nAlexNet as an example. [9] needs more than 4800K iterations to get a fair compression rate (9\u00d7),\nwhile our method runs only 700K iterations to yield a signi\ufb01cantly better result (17.7\u00d7) with\ncomparable prediction accuracy.\n\n2 Related Works\n\nIn order to make DNN models portable, a variety of methods have been proposed. Vanhoucke et\nal. [20] analyse the effectiveness of data layout, batching and the usage of Intel \ufb01xed-point instructions,\nmaking a 3\u00d7 speedup on x86 CPUs. Mathieu et al. [18] explore the fast Fourier transforms (FFTs)\non GPUs and improve the speed of CNNs by performing convolution calculations in the frequency\ndomain.\nAn alternative category of methods resorts to matrix (or tensor) decomposition. Denil et al. [4]\npropose to approximate parameter matrices with appropriately constructed low-rank decompositions.\nTheir method achieves 1.6\u00d7 speedup on the convolutional layer with 1% drop in prediction accuracy.\nFollowing similar ideas, some subsequent methods can provide more signi\ufb01cant speedups [5, 22, 14].\nAlthough matrix (or tensor) decomposition can be bene\ufb01cial to DNN compression and speedup, these\nmethods normally incur severe accuracy loss under high compression requirement.\nVector quantization is another way to compress DNNs. Gong et al. [6] explore several such methods\nand point out the effectiveness of product quantization. HashNet proposed by Chen et al. [1] handles\nnetwork compression by grouping its parameters into hash buckets. It is trained with a standard\nbackpropagation procedure and should be able to make substantial storage savings. The recently\n\n2\n\n\fFigure 2: Overview of the dynamic network surgery for a model with parameter redundancy.\n\nproposed BinaryConnect [2] and Binarized Neural Networks [3] are able to compress DNNs by a\nfactor of 32\u00d7, while a noticeable accuracy loss is sort of inevitable.\nThis paper follows the idea of network pruning. It starts from the early work of LeCun et al.\u2019s [16],\nwhich makes use of the second derivatives of loss function to balance training loss and model\ncomplexity. As an extension, Hassibi and Stork [10] propose to take non-diagonal elements of the\nHessian matrix into consideration, producing compression results with less accuracy loss. In spite\nof their theoretical optimization, these two methods suffer from the high computational complexity\nwhen tackling large networks, regardless of the accuracy drop. Very recently, Han et al. [9] explore\nthe magnitude-based pruning in conjunction with retraining, and report promising compression results\nwithout accuracy loss. It has also been validated that the sparse matrix-vector multiplication can\nfurther be accelerated by certain hardware design, making it more ef\ufb01cient than traditional CPU\nand GPU calculations [7]. The drawback of Han et al.\u2019s method [9] is mostly its potential risk of\nirretrievable network damage and learning inef\ufb01ciency.\nOur research on network pruning is partly inspired by [9], not only because it can be very effective to\ncompress DNNs, but also because it makes no assumption on the network structure. In particular,\nthis branch of methods can be naturally combined with many other methods introduced above, to\nfurther reduce the network complexity. In fact, Han et al. [8] have already tested such combinations\nand obtained excellent results.\n\n3 Dynamic Network Surgery\n\nIn this section, we highlight the intuition of our method and present its implementation details. In\norder to simplify the explanations, we only talk about the convolutional layers and the fully connected\nlayers. However, as claimed in [8], our pruning method can also be applied to some other layer types\nas long as their underlying mathematical operations are inner products on vector spaces.\n\n3.1 Notations\n\nFirst of all, we clarify the notations in this paper. Suppose a DNN model can be represented as\n{Wk : 0 \u2264 k \u2264 C}, in which Wk denotes a matrix of connection weights in the kth layer. For the\nfully connected layers with p-dimensional input and q-dimensional output, the size of Wk is simply\nqk \u00d7 pk. For the convolutional layers with learnable kernels, we unfold the coef\ufb01cients of each kernel\ninto a vector and concatenate all of them to Wk as a matrix.\nIn order to represent a sparse model with part of its connections pruned away, we use {Wk, Tk : 0 \u2264\nk \u2264 C}. Each Tk is a binary matrix with its entries indicating the states of network connections, i.e.,\nwhether they are currently pruned or not. Therefore, these additional matrices can be considered as\nthe mask matrices.\n\n3.2 Pruning and Splicing\n\nSince our goal is network pruning, the desired sparse model shall be learnt from its dense reference.\nApparently, the key is to abandon unimportant parameters and keep the important ones. However, the\nparameter importance (i.e., the connection importance) in a certain network is extremely dif\ufb01cult\n\n3\n\n\fto measure because of the mutual in\ufb02uences and mutual activations among interconnected neurons.\nThat is, a network connection may be redundant due to the existence of some others, but it will soon\nbecome crucial once the others are removed. Therefore, it should be more appropriate to conduct a\nlearning process and continually maintain the network structure.\nTaking the kth layer as an example, we propose to solve the following optimization problem:\n\nL (Wk (cid:12) Tk)\n\n), \u2200(i, j) \u2208 I,\n\nk\n\nmin\nWk,Tk\n\ns.t. T(i,j)\n\nk = hk(W(i,j)\n\n(1)\nin which L(\u00b7) is the network loss function, (cid:12) indicates the Hadamard product operator, set I\nconsists of all the entry indices in matrix Wk, and hk(\u00b7) is a discriminative function, which satis\ufb01es\nhk(w) = 1 if parameter w seems to be crucial in the current layer, and 0 otherwise. Function\nhk(\u00b7) is designed on the base of some prior knowledge so that it can constrain the feasible region of\nWk (cid:12) Tk and simplify the original NP-hard problem. For the sake of topic conciseness, we leave\nthe discussions of function hk(\u00b7) in Section 3.3. Problem (1) can be solved by alternately updating\nWk and Tk through the stochastic gradient descent (SGD) method, which will be introduced in the\nfollowing paragraphs.\nSince binary matrix Tk can be determined with the constraints in (1), we only need to investigate the\nupdate scheme of Wk. Inspired by the method of Lagrange Multipliers and gradient descent, we\ngive the following scheme for updating Wk. That is,\n\nW(i,j)\n\nk \u2190 W(i,j)\n\nk \u2212 \u03b2\n\n\u2202\n\u2202(W(i,j)\n\nk T(i,j)\n\nk\n\nL (Wk (cid:12) Tk) , \u2200(i, j) \u2208 I,\n\n(2)\n\n)\n\nin which \u03b2 indicates a positive learning rate. It is worth mentioning that we update not only the\nimportant parameters, but also the ones corresponding to zero entries of Tk, which are considered\nunimportant and ineffective to decrease the network loss. This strategy is bene\ufb01cial to improve the\n\ufb02exibility of our method because it enables the splicing of improperly pruned connections.\nThe partial derivatives in formula (2) can be calculated by the chain rule with a randomly chosen\nminibatch of samples. Once matrix Wk and Tk are updated, they shall be applied to re-calculate the\nwhole network activations and loss function gradient. Repeat these steps iteratively, the sparse model\nwill be able to produce excellent accuracy. The above procedure is summarized in Algorithm 1.\n\nAlgorithm 1 Dynamic network surgery: the SGD method for solving optimization problem (1):\n\nInput: X: training datum (with or without label), {(cid:99)Wk : 0 \u2264 k \u2264 C}: the reference model,\nInitialize Wk \u2190 (cid:99)Wk, Tk \u2190 1, \u22000 \u2264 k \u2264 C, \u03b2 \u2190 1 and iter \u2190 0\n\n\u03b1: base learning rate, f: learning policy.\nOutput: {Wk, Tk : 0 \u2264 k \u2264 C}: the updated parameter matrices and their binary masks.\n\nrepeat\n\nChoose a minibatch of network input from X\nForward propagation and loss calculation with (W0 (cid:12) T0),...,(WC (cid:12) TC)\nBackward propagation of the model output and generate \u2207L\nfor k = 0, ..., C do\n\nUpdate Tk by function hk(\u00b7) and the current Wk, with a probability of \u03c3(iter)\nUpdate Wk by Formula (2) and the current loss function gradient \u2207L\n\nend for\nUpdate: iter \u2190 iter + 1 and \u03b2 \u2190 f (\u03b1, iter)\n\nuntil iter reaches its desired maximum\n\nNote that, the dynamic property of our method is shown in two aspects. On one hand, pruning\noperations can be performed whenever the existing connections seem to become unimportant. Yet,\non the other hand, the mistakenly pruned connections shall be re-established if they once appear to be\nimportant. The latter operation plays a dual role of network pruning, and thus it is called \"network\nsplicing\" in this paper. Pruning and splicing constitute a circular procedure by constantly updating\nthe connection weights and setting different entries in Tk, which is analogical to the synthesis of\nexcitatory and inhibitory neurotransmitter in human nervous system [17]. See Figure 2 for the\noverview of our method and the method pipeline can be found in Figure 1(a).\n\n4\n\n\f3.3 Parameter Importance\n\nSince the measure of parameter importance in\ufb02uences the state of network connections, function\nhk(\u00b7),\u22000 \u2264 k \u2264 C, can be essential to our dynamic network surgery. We have tested several\ncandidates and \ufb01nally found the absolute value of the input to be the best choice, as claimed in [9].\nThat is, the parameters with relatively small magnitude are temporarily pruned, while the others with\nlarge magnitude are kept or spliced in each iteration of Algorithm 1. Obviously, the threshold values\nhave a signi\ufb01cant impact on the \ufb01nal compression rate. For a certain layer, a single threshold can be\nset based on the average absolute value and variance of its connection weights. However, to improve\nthe robustness of our method, we use two thresholds ak and bk by importing a small margin t and set\nbk as ak + t in Equation (3). For the parameters out of this range, we set their function outputs as the\ncorresponding entries in Tk, which means these parameters will neither be pruned nor spliced in the\ncurrent iteration.\n\n(3)\n\n\uf8f1\uf8f4\uf8f2\uf8f4\uf8f3 0\n\nk\n\nT(i,j)\n1\n\nif ak > |W(i,j)\n|\nif ak \u2264 |W(i,j)\n| < bk\n|\nif bk \u2264 |W(i,j)\n\nk\n\nk\n\nk\n\nhk(W(i,j)\n\nk\n\n) =\n\n3.4 Convergence Acceleration\n\nConsidering that Algorithm 1 is a bit more complicated than the standard backpropagation method,\nwe shall take a few more steps to boost its convergence. First of all, we suggest slowing down the\npruning and splicing frequencies, because these operations lead to network structure change. This\ncan be done by triggering the update scheme of Tk stochastically, with a probability of p = \u03c3(iter),\nrather than doing it constantly. Function \u03c3(\u00b7) shall be monotonically non-increasing and satisfy\n\u03c3(0) = 1. After a prolonged decrease, the probability p may even be set to zero, i.e., no pruning or\nsplicing will be conducted any longer.\nAnother possible reason for slow convergence is the vanishing gradient problem. Since a large\npercentage of connections are pruned away, the network structure should become much simpler\nand probably even much \"thinner\" by utilizing our method. Thus, the loss function derivatives are\nlikely to be very small, especially when the reference model is very deep. We resolve this problem\nby pruning the convolutional layers and fully connected layers separately, in the dynamic way still,\nwhich is somehow similar to [9].\n\n4 Experimental Results\n\nIn this section, we will experimentally analyse the proposed method and apply it on some popular\nnetwork models. For fair comparison and easy reproduction, all the reference models are trained by\nthe GPU implementation of Caffe package [12] with .prototxt \ufb01les provided by the community.2 Also,\nwe follow the default experimental settings for SGD method, including the training batch size, base\nlearning rate, learning policy and maximal number of training iterations. Once the reference models\nare obtained, we directly apply our method to reduce their model complexity. A brief summary of the\ncompression results are shown in Table 1.\n\nTable 1: Dynamic network surgery can remarkably reduce the model complexity of some popular\nnetworks, while the prediction error rate does not increase.\n\nmodel\nLeNet-5 reference\nLeNet-5 pruned\nLeNet-300-100 reference\nLeNet-300-100 pruned\nAlexNet reference\nAlexNet pruned\n\nTop-1 error\n0.91%\n0.91%\n2.28%\n1.99%\n43.42%\n43.09%\n\nParameters\n431K\n4.0K\n267K\n4.8K\n61M\n3.45M\n\nIterations\n10K\n16K\n10K\n25K\n450K\n700K\n\nCompression\n\n108\u00d7\n\n56\u00d7\n\n17.7\u00d7\n\n2Except for the simulation experiment and LeNet-300-100 experiments which we create the .prototxt \ufb01les by\n\nourselves, because they are not available in the Caffe model zoo.\n\n5\n\n\f4.1 The Exclusive-OR Problem\n\nTo begin with, we consider an experiment on the synthetic data to preliminary testify the effectiveness\nof our method and visualize its compression quality. The exclusive-OR (XOR) problem can be a\ngood option. It is a nonlinear classi\ufb01cation problem as illustrated in Figure 3(a). In this experiment,\nwe turn the original problem to a more complicated one as Figure 3(b), in which some Gaussian\nnoises are mixed up with the original data (0, 0), (0, 1), (1, 0) and (1, 1).\n\n(a)\n\n(b)\n\nFigure 3: The Exclusive-OR (XOR) classi\ufb01cation problem (a) without noise and (b) with noise.\n\nIn order to classify these samples, we design a network model as illustrated in the left part of\nFigure 4(a), which consists of 21 connections and each of them has a weight to be learned. The\nsigmoid function is chosen as the activation function for all the hidden and output neurons. Twenty\nthousand samples were randomly generated for the experiment, in which half of them were used as\ntraining samples and the rest as test samples.\nBy 100,000 iterations of learning, this three-layer neural network achieves a prediction error rate of\n0.31%. The weight matrix of network connections between input and hidden neurons can be found in\nFigure 4(b). Apparently, its \ufb01rst and last row share the similar elements, which means there are two\nhidden neurons functioning similarly. Hence, it is appropriate to use this model as a compression\nreference, even though it is not very large. After 150,000 iterations, the reference model will be\ncompressed into the right side of Figure 4(a), and the new connection weights and their masks are\nshown in Figure 4(b). The grey and green patches in T1 stand for those entries equal to one, and the\ncorresponding connections shall be kept. In particular, the green ones indicate the connections were\nmistakenly pruned in the beginning but spliced during the surgery. The other patches (i.e., the black\nones) indicate the corresponding connections are permanently pruned in the end.\n\n(a)\n\n(b)\n\nFigure 4: Dynamic network surgery on a three-layer neural network for the XOR problem. (a): The\nnetwork complexity is reduced to be optimal. (b) The connection weights are updated with masks.\n\nThe compressed model has a prediction error rate of 0.30%, which is slightly better than that of the\nreference model, even though 40% of its parameters are set to be zero. Note that, the remaining\nhidden neurons (excluding the bias unit) act as three different logic gates and altogether make up\n\n6\n\n\fthe XOR classi\ufb01er. However, if the pruning operations are conducted only on the initial parameter\nmagnitude (as in [9]), then probably four hidden neurons will be \ufb01nally kept, which is obviously not\nthe optimal compression result.\nIn addition, if we reduce the impact of Gaussian noises and enlarge the margin between positive and\nnegative samples, then the current model can be further compressed, so that one more hidden neuron\nwill be pruned by our method.\nSo far, we have carefully explained the mechanism behind our method and preliminarily testi\ufb01ed\nits effectiveness. In the following subsections, we will further test our method on three popular NN\nmodels and make quantitative comparisons with other network compression methods.\n\n4.2 The MNIST database\n\nMNIST is a database of handwritten digits and it is widely used to experimentally evaluate machine\nlearning methods. Same with [9], we test our method on two network models: LeNet-5 and LeNet-\n300-100.\nLeNet-5 is a conventional CNN model which consists of 4 learnable layers, including 2 convolutional\nlayers and 2 fully connected layers. It is designed by LeCun et al. [15] for document recognition. With\n431K parameters to be learned, we train this model for 10,000 iterations and obtain a prediction error\nrate of 0.91%. LeNet-300-100, as described in [15], is a classical feedforward neural network with\nthree fully connected layers and 267K learnable parameters. It is also trained for 10,000 iterations,\nfollowing the same learning policy as with LeNet-5. The well trained LeNet-300-100 model achieves\nan error rate of 2.28%.\nWith the proposed method, we are able to compress these two models. The same batch size, learning\nrate and learning policy are set as with the reference training processes, except for the maximal\nnumber of iterations, which is properly increased. The results are shown in Table 1. After convergence,\nthe network parameters of LeNet-5 and LeNet-300-100 are reduced by a factor of 108\u00d7 and 56\u00d7,\nrespectively, which means less than 1% and 2% of the network connections are kept, while the\nprediction accuracies are as good or slightly better.\n\nTable 2: Compare our compression results on LeNet-5 and LeNet-300-100 with that of [9]. The\npercentage of remaining parameters after applying Han et al\u2019s method [9] and our method are shown\nin the last two columns.\n\nModel\n\nLeNet-5\n\nLeNet-300-100\n\nLayer\nconv1\nconv2\nfc1\nfc2\nTotal\nfc1\nfc2\nfc3\nTotal\n\nParams.\n0.5K\n25K\n400K\n5K\n431K\n236K\n30K\n1K\n267K\n\nParams.% [9]\n\u223c 66%\n\u223c 12%\n\u223c 8%\n\u223c 19%\n\u223c 8%\n\u223c 8%\n\u223c 9%\n\u223c 26%\n\u223c 8%\n\nParams.% (Ours)\n14.2%\n3.1%\n0.7%\n4.3%\n0.9%\n1.8%\n1.8%\n5.5%\n1.8%\n\nTo better demonstrate the advantage of our method, we make layer-by-layer comparisons between\nour compression results and Han et al.\u2019s [9] in Table 2. To the best of our knowledge, their method is\nso far the most effective pruning method, if the learning inef\ufb01ciency is not a concern. However, our\nmethod still achieves at least 4 times the compression improvement against their method. Besides,\ndue to the signi\ufb01cant advantage over Han et al.\u2019s models [9], our compressed models will also be\nundoubtedly much faster than theirs.\n\n4.3\n\nImageNet and AlexNet\n\nIn the \ufb01nal experiment, we apply our method to AlexNet [13], which wins the ILSVRC 2012\nclassi\ufb01cation competition. As with the previous experiments, we train the reference model \ufb01rst.\n\n7\n\n\fWithout any data augmentation, we obtain a reference model with 61M well-learned parameters after\n450K iterations of training (i.e., roughly 90 epochs). Then we perform the network surgery on it.\nAlexNet consists of 8 learnable layers, which is considered to be deep. So we prune the convolutional\nlayers and fully connected layers separately, as previously discussed in Section 3.4. The training\nbatch size, base learning rate and learning policy still keep the same with reference training process.\nWe run 320K iterations for the convolutional layers and 380K iterations for the fully connected layers,\nwhich means 700K iterations in total (i.e., roughly 140 epochs). In the test phase, we use just the\ncenter crop and test our compressed model on the validation set.\n\nTable 3: The comparison of different compressed models, with Top-1 and Top-5 prediction error rate,\nthe number of training epochs and the \ufb01nal compression rate shown in the table.\n\nModel\nFastfood 32 (AD) [21]\nFastfood 16 (AD) [21]\nNaive Cut [9]\nHan et al. [9]\nDynamic network surgery (Ours)\n\nTop-1 error\n41.93%\n42.90%\n57.18%\n42.77%\n43.09%\n\nTop-5 error\n-\n-\n23.23%\n19.67%\n19.99%\n\nEpochs\n-\n-\n0\n\u2265 960\n\u223c 140\n\nCompression\n2\u00d7\n3.7\u00d7\n4.4\u00d7\n9\u00d7\n17.7\u00d7\n\nTable 3 compares the result of our method with some others. The four compared models are built by\napplying Han et al.\u2019s method [9] and the adaptive fastfood transform method [21]. When compared\nwith these \"lossless\" methods, our method achieves the best result in terms of the compression rate.\nBesides, after acceptable number of epochs, the prediction error rate of our model is comparable or\neven better than those models compressed from better references.\nIn order to make more detailed comparisons, we compare the percentage of remaining parameters in\nour compressed model with that of Han et al.\u2019s [9], since they achieve the second best compression\nrate. As shown in Table 4, our method compresses more parameters on almost every single layer in\nAlexNet, which means both the storage requirement and the number of FLOPs are better reduced\nwhen compared with [9]. Besides, our learning process is also much more ef\ufb01cient thus considerable\nless epochs are needed (as least 6.8 times decrease).\n\nTable 4: Compare our method with [9] on AlexNet.\n\nLayer\nconv1\nconv2\nconv3\nconv4\nconv5\nfc1\nfc2\nfc3\nTotal\n\nParams.\n35K\n307K\n885K\n664K\n443K\n38M\n17M\n4M\n61M\n\nParams.% [9]\n\u223c 84%\n\u223c 38%\n\u223c 35%\n\u223c 37%\n\u223c 37%\n\u223c 9%\n\u223c 9%\n\u223c 25%\n\u223c 11%\n\nParams.% (Ours)\n53.8%\n40.6%\n29.0%\n32.3%\n32.5%\n3.7%\n6.6%\n4.6%\n5.7%\n\n5 Conclusions\n\nIn this paper, we have investigated the way of compressing DNNs and proposed a novel method\ncalled dynamic network surgery. Unlike the previous methods which conduct pruning and retraining\nalternately, our method incorporates connection splicing into the surgery and implements the whole\nprocess in a dynamic way. By utilizing our method, most parameters in the DNN models can\nbe deleted, while the prediction accuracy does not decrease. The experimental results show that\nour method compresses the number of parameters in LeNet-5 and AlexNet by a factor of 108\u00d7\nand 17.7\u00d7, respectively, which is superior to the recent pruning method by considerable margins.\nBesides, the learning ef\ufb01ciency of our method is also better thus less epochs are needed.\n\n8\n\n\fReferences\n[1] Wenlin Chen, James T. Wilson, Stephen Tyree, Kilian Q. Weinberger, and Yixin Chen. Compressing neural\n\nnetworks with the hashing trick. In ICML, 2015.\n\n[2] Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. BinaryConnect: Training deep neural\n\nnetworks with binary weights during propagations. In NIPS, 2015.\n\n[3] Matthieu Courbariaux, Itay Hubara, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Binarized neural\nnetworks: Training deep neural networks with weights and activations constrained to +1 or -1. arXiv\npreprint arXiv:1602.02830v3, 2016.\n\n[4] Misha Denil, Babak Shakibi, Laurent Dinh, Marc\u2019Aurelio Ranzato, and Nando de Freitas. Predicting\n\nparameters in deep learning. In NIPS, 2013.\n\n[5] Emily L. Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus. Exploiting linear structure\n\nwithin convolutional networks for ef\ufb01cient evaluation. In NIPS, 2014.\n\n[6] Yunchao Gong, Liu Liu, Ming Yang, and Lubomir Bourdev. Compressing deep convolutional networks\n\nusing vector quantization. arXiv preprint arXiv:1412.6115, 2014.\n\n[7] Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A Horowitz, and William J. Dally.\n\nEIE: Ef\ufb01cient inference engine on compressed deep neural network. In ISCA, 2016.\n\n[8] Song Han, Huizi Mao, and William J. Dally. Deep compression: Compressing deep neural networks with\n\npruning, trained quantization and Huffman coding. In ICLR, 2016.\n\n[9] Song Han, Jeff Pool, John Tran, and William J. Dally. Learning both weights and connections for ef\ufb01cient\n\nneural networks. In NIPS, 2015.\n\n[10] Babak Hassibi and David G. Stork. Second order derivatives for network pruning: Optimal brain surgeon.\n\nIn NIPS, 1993.\n\n[11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.\n\nIn CVPR, 2016.\n\n[12] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio\nGuadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. In ACM\nMM, 2014.\n\n[13] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classi\ufb01cation with deep convolutional\n\nneural networks. In NIPS, 2012.\n\n[14] Vadim Lebedev, Yaroslav Ganin, Maksim Rakhuba, Ivan Oseledets, and Victor Lempitsky. Speeding-up\n\nconvolutional neural networks using \ufb01ne-tuned cp-decomposition. In ICLR, 2015.\n\n[15] Yann LeCun, L\u00e9on Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to\n\ndocument recognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[16] Yann LeCun, John S Denker, Sara A Solla, Richard E Howard, and Lawrence D Jackel. Optimal brain\n\ndamage. In NIPS, 1989.\n\n[17] Harvey Lodish, Arnold Berk, S Lawrence Zipursky, Paul Matsudaira, David Baltimore, and James Darnell.\nMolecular Cell Biology: Neurotransmitters, Synapses, and Impulse Transmission. W. H. Freeman, 2000.\n\n[18] Michael Mathieu, Mikael Henaff, and Yann LeCun. Fast training of convolutional networks through FFTs.\n\narXiv preprint arXiv:1312.5851, 2013.\n\n[19] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recogni-\n\ntion. In ICLR, 2014.\n\n[20] Vincent Vanhoucke, Andrew Senior, and Mark Z. Mao. Improving the speed of neural networks on CPUs.\n\nIn NIPS Workshop, 2011.\n\n[21] Zichao Yang, Marcin Moczulski, Misha Denil, Nando de Freitas, Alex Smola, Le Song, and Ziyu Wang.\n\nDeep fried convnets. In ICCV, 2015.\n\n[22] Xiangyu Zhang, Jianhua Zou, Xiang Ming, Kaiming He, and Jian Sun. Ef\ufb01cient and accurate approxima-\n\ntions of nonlinear convolutional networks. In CVPR, 2015.\n\n9\n\n\f", "award": [], "sourceid": 756, "authors": [{"given_name": "Yiwen", "family_name": "Guo", "institution": "Intel Labs China"}, {"given_name": "Anbang", "family_name": "Yao", "institution": "Intel Labs China"}, {"given_name": "Yurong", "family_name": "Chen", "institution": "Intel Labs China"}]}