{"title": "Structured Bayesian Pruning via Log-Normal Multiplicative Noise", "book": "Advances in Neural Information Processing Systems", "page_first": 6775, "page_last": 6784, "abstract": "Dropout-based regularization methods can be regarded as injecting random noise with pre-defined magnitude to different parts of the neural network during training. It was recently shown that Bayesian dropout procedure not only improves gener- alization but also leads to extremely sparse neural architectures by automatically setting the individual noise magnitude per weight. However, this sparsity can hardly be used for acceleration since it is unstructured. In the paper, we propose a new Bayesian model that takes into account the computational structure of neural net- works and provides structured sparsity, e.g. removes neurons and/or convolutional channels in CNNs. To do this we inject noise to the neurons outputs while keeping the weights unregularized. We establish the probabilistic model with a proper truncated log-uniform prior over the noise and truncated log-normal variational approximation that ensures that the KL-term in the evidence lower bound is com- puted in closed-form. The model leads to structured sparsity by removing elements with a low SNR from the computation graph and provides significant acceleration on a number of deep neural architectures. The model is easy to implement as it can be formulated as a separate dropout-like layer.", "full_text": "Structured Bayesian Pruning via Log-Normal\n\nMultiplicative Noise\n\nKirill Neklyudov 1,2\nk.necludov@gmail.com\n\nDmitry Molchanov 1,3\ndmolchanov@hse.ru\n\nArsenii Ashukha 1,2\naashukha@hse.ru\n\nDmitry Vetrov 1,2\ndvetrov@hse.ru\n\n1National Research University Higher School of Economics 2Yandex\n\n3Skolkovo Institute of Science and Technology\n\nAbstract\n\nDropout-based regularization methods can be regarded as injecting random noise\nwith pre-de\ufb01ned magnitude to different parts of the neural network during training.\nIt was recently shown that Bayesian dropout procedure not only improves gener-\nalization but also leads to extremely sparse neural architectures by automatically\nsetting the individual noise magnitude per weight. However, this sparsity can hardly\nbe used for acceleration since it is unstructured. In the paper, we propose a new\nBayesian model that takes into account the computational structure of neural net-\nworks and provides structured sparsity, e.g. removes neurons and/or convolutional\nchannels in CNNs. To do this we inject noise to the neurons outputs while keeping\nthe weights unregularized. We establish the probabilistic model with a proper\ntruncated log-uniform prior over the noise and truncated log-normal variational\napproximation that ensures that the KL-term in the evidence lower bound is com-\nputed in closed-form. The model leads to structured sparsity by removing elements\nwith a low SNR from the computation graph and provides signi\ufb01cant acceleration\non a number of deep neural architectures. The model is easy to implement as it can\nbe formulated as a separate dropout-like layer.\n\n1\n\nIntroduction\n\nDeep neural networks are a \ufb02exible family of models which provides state-of-the-art results in many\nmachine learning problems [14, 20]. However, this \ufb02exibility often results in over\ufb01tting. A common\nsolution for this problem is regularization. One of the most popular ways of regularization is Binary\nDropout [19] that prevents co-adaptation of neurons by randomly dropping them during training. An\nequally effective alternative is Gaussian Dropout [19] that multiplies the outputs of the neurons by\nGaussian random noise. In recent years several Bayesian generalizations of these techniques have\nbeen developed, e.g. Variational Dropout [8] and Variational Spike-and-Slab Neural Networks [13].\nThese techniques provide theoretical justi\ufb01cation of different kinds of Dropout and also allow for\nautomatic tuning of dropout rates, which is an important practical result.\nBesides over\ufb01tting, compression and acceleration of neural networks are other important challenges,\nespecially when memory or computational resources are restricted. Further studies of Variational\nDropout show that individual dropout rates for each weight allow to shrink the original network\narchitecture and result in a highly sparse model [16]. General sparsity provides a way of neural\nnetwork compression, while the time of network evaluation may remain the same, as most modern\nDNN-oriented software can\u2019t work with sparse matrices ef\ufb01ciently. At the same time, it is possible\nto achieve acceleration by enforcing structured sparsity in convolutional \ufb01lters or data tensors. In\nthe simplest case it means removing redundant neurons or convolutional \ufb01lters instead of separate\nweights; but more complex patterns can also be considered. This way Group-wise Brain Damage\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\f[10] employs group-wise sparsity in convolutional \ufb01lters, Perforated CNNs [3] drop redundant rows\nfrom the intermediate dataframe matrices that are used to compute convolutions, and Structured\nSparsity Learning [24] provides a way to remove entire convolutional \ufb01lters or even layers in residual\nnetworks. These methods allow to obtain practical acceleration with little to no modi\ufb01cations of\nthe existing software. In this paper, we propose a tool that is able to induce an arbitrary pattern\nof structured sparsity on neural network parameters or intermediate data tensors. We propose a\ndropout-like layer with a parametric multiplicative noise and use stochastic variational inference to\ntune its parameters in a Bayesian way. We introduce a proper analog of sparsity-inducing log-uniform\nprior distribution [8, 16] that allows us to formulate a correct probabilistic model and avoid the\nproblems that come from using an improper prior. This way we obtain a novel Bayesian method of\nregularization of neural networks that results in structured sparsity. Our model can be represented\nas a separate dropout-like layer that allows for a simple and \ufb02exible implementation with almost no\ncomputational overhead, and can be incorporated into existing neural networks.\nOur experiments show that our model leads to high group sparsity level and signi\ufb01cant acceleration\nof convolutional neural networks with negligible accuracy drop. We demonstrate the performance of\nour method on LeNet and VGG-like architectures using MNIST and CIFAR-10 datasets.\n\n2 Related Work\n\nDeep neural networks are extremely prone to over\ufb01tting, and extensive regularization is crucial.\nThe most popular regularization methods are based on injection of multiplicative noise over layer\ninputs, parameters or activations [8, 19, 22]. Different kinds of multiplicative noise have been used\nin practice; the most popular choices are Bernoulli and Gaussian distributions. Another type of\nregularization of deep neural networks is based on reducing the number of parameters. One approach\nis to use low-rank approximations, e.g. tensor decompositions [4, 17], and the other approach is\nto induce sparsity, e.g. by pruning [5] or L1 regularization [24]. Sparsity can also be induced by\nusing the Sparse Bayesian Learning framework with empirical Bayes [21] or with sparsity-inducing\npriors [12, 15, 16].\nHigh sparsity is one of the key factors for the compression of DNNs [5, 21]. However, in addition to\ncompression it is bene\ufb01cial to obtain acceleration. Recent papers propose different approaches to\nacceleration of DNNs, e.g. Spatial Skipped Convolutions [3] and Spatially Adaptive Computation\nTime [2] that propose different ways to reduce the number of computed convolutions, Binary\nNetworks [18] that achieve speedup by using only 1 bit to store a single weight of a DNN, Low-Rank\nExpansions [6] that use low-rank \ufb01lter approximations, and Structured Sparsity Learning [24] that\nallows to remove separate neurons or \ufb01lters. As reported in [24] it is possible to obtain acceleration of\nDNNs by introducing structured sparsity, e.g. by removing whole neurons, \ufb01lters or layers. However,\nnon-adaptive regularization techniques require tuning of a huge number of hyperparameters that\nmakes it dif\ufb01cult to apply in practice. In this paper we apply the Bayesian learning framework to\nobtain structured sparsity and focus on acceleration of neural networks.\n\n3 Stochastic Variational Inference\n\nGiven a probabilistic model p(y | x, \u2713) we want to tune parameters \u2713 of the model using training\ndataset D = {(xi, yi)}N\ni=1. The prior knowledge about parameters \u2713 is de\ufb01ned by prior distribution\np(\u2713). Using the Bayes rule we obtain the posterior distribution p(\u2713 |D) = p(D | \u2713)p(\u2713)/p(D).\nHowever, computing posterior distribution using the Bayes rule usually involves computation of\nintractable integrals, so we need to use approximation techniques.\nOne of the most widely used approximation techniques is Variational Inference. In this approach the\nunknown distribution p(\u2713 |D) is approximated by a parametric distribution q(\u2713) by minimization\nof the Kullback-Leibler divergence KL(q(\u2713)k p(\u2713 |D)). Minimization of the KL divergence is\nequivalent to maximization of the variational lower bound L().\n\nL() = LD()  KL(q(\u2713)k p(\u2713)),\nEq(\u2713) log p(yi | xi,\u2713 )\n\nwhere LD() =\n\nNXi=1\n\n2\n\n(1)\n\n(2)\n\n\fLD() is a so-called expected log-likelihood function which is intractable in case of complex\nprobabilistic model p(y | x, \u2713). Following [8] we use the Reparametrization trick to obtain an\nunbiased differentiable minibatch-based Monte Carlo estimator of the expected log-likelihood. Here\nN is the total number of objects, M is the minibatch size, and f (, \") provides samples from the\napproximate posterior q(\u2713) as a deterministic function of a non-parametric noise \" \u21e0 p(\").\n\nlog p(yik | xik , wik = f (, \"ik ))\n\n(3)\n\nD\n\nLD() ' LSGV B\n\nMXk=1\nL() 'L SGV B() = LSGV B\n\n() =\n\nN\nM\n\nD\n\n()  KL(q(w)k p(w))\n\nrLD() ' rLSGV B\n\n(4)\n(5)\nThis way we obtain a procedure of approximate Bayesian inference where we solve optimization\nproblem (4) by stochastic gradient ascent w.r.t. variational parameters . This procedure can be\nef\ufb01ciently applied to Deep Neural Networks and usually the computational overhead is very small, as\ncompared to ordinary DNNs.\nIf the model p(y | x, \u2713, w) has another set of parameters w that we do not want to be Bayesian about,\nwe can still use the same variational lower bound objective:\n\n()\n\nD\n\nL(, w) = LD(, w)  KL(q(\u2713)k p(\u2713)) ! max\n\n,w\n\n,\n\nwhere LD(, w) =\n\nEq(\u2713) log p(yi | xi,\u2713, w )\n\nNXi=1\n\n(6)\n\n(7)\n\nThis objective corresponds the maximum likelihood estimation wM L of parameters w, while \ufb01nding\nthe approximate posterior distribution q(\u2713) \u21e1 p(\u2713 |D, wM L). In this paper we denote the weights of\nthe neural networks, the biases, etc. as w and \ufb01nd their maximum likelihood estimation as described\nabove. The parameters \u2713 that undergo the Bayesian treatment are the noisy masks in the proposed\ndropout-like layer (SBP layer). They are described in the following section.\n\n4 Group Sparsity with Log-normal Multiplicative Noise\n\nVariational Inference with a sparsity-inducing log-uniform prior over the weights of a neural network\nis an ef\ufb01cient way to enforce general sparsity on weight matrices [16]. However, it is dif\ufb01cult to\napply this approach to explicitly enforce structured sparsity. We introduce a dropout-like layer with a\ncertain kind of multiplicative noise. We also make use of the sparsity-inducing log-uniform prior, but\nput it over the noise variables rather than weights. By sharing those noise variables we can enforce\ngroup-wise sparsity with any form of groups.\n\n4.1 Variational Inference for Group Sparsity Model\nWe consider a single dropout-like layer with an input vector x 2 RI that represents one object\nwith I features, and an output vector y 2 RI of the same size. The input vector x is usually\nsupposed to come from the activations of the preceding layer. The output vector y would then\nserve as an input vector for the following layer. We follow the general way to build dropout-\nlike layers (8). Each input feature xi is multiplied by a noise variable \u2713i that comes from some\ndistribution pnoise(\u2713). For example, for Binary Dropout pnoise(\u2713) would be a fully factorized\nBernoulli distribution with pnoise(\u2713i) = Bernoulli(p), and for Gaussian dropout it would be a\nfully-factorized Gaussian distribution with pnoise(\u2713i) = N (1,\u21b5 ).\n\nyi = xi \u00b7 \u2713i\n\n(8)\nNote that if we have a minibatch X M\u21e5I of M objects, we would independently sample a separate\nnoise vector \u2713m for each object xm. This would be the case throughout the paper, but for the sake of\nsimplicity we would consider a single object x in all following formulas. Also note that the noise \u2713 is\nusually only sampled during the training phase. A common approximation during the testing phase\nis to use the expected value E\u2713 instead of sampling \u2713. All implementation details are provided and\ndiscussed in Section 4.5.\n\n\u2713 \u21e0 pnoise(\u2713)\n\n3\n\n\fWe follow a Bayesian treatment of the variable \u2713, as described in Section 3. In order to obtain a\nsparse solution, we choose the prior distribution p(\u2713) to be a fully-factorized improper log-uniform\ndistribution. We denote this distribution as LogU1(\u00b7) to stress that it has in\ufb01nite domain. This\ndistribution is known for its sparsi\ufb01cation properties and works well in practice for deep neural\nnetworks [16].\n\np(\u2713) =\n\np(\u2713i)\n\np(\u2713i) = LogU1(\u2713i) /\n\n1\n\u2713 i\n\n\u2713i > 0\n\n(9)\n\nIYi=1\n\nIn order to train the model, i.e. perform variational inference, we need to choose an approximation\nfamily q for the posterior distribution p(\u2713 |D) \u21e1 q(\u2713).\nIYi=1\n\n(11)\nA common choice of variational distribution q(\u00b7) is a fully-factorized Gaussian distribution. However,\nfor this particular model we choose q(\u2713) to be a fully-factorized log-normal distribution (10\u201311). To\nmake this choice, we were guided by the following reasons:\n\ni ) () log \u2713i \u21e0N (log \u2713i | \u00b5i, 2\ni )\n\nIYi=1\n\u2713i \u21e0 LogN(\u2713i | \u00b5i, 2\n\nLogN(\u2713i | \u00b5i, 2\ni )\n\nq(\u2713i | \u00b5i, i) =\n\nq(\u2713) =\n\n(10)\n\n\u2022 The log-uniform distribution is a speci\ufb01c case of the log-normal distribution when the parameter\n goes to in\ufb01nity and \u00b5 remains \ufb01xed. Thus we can guarantee that in the case of no data our\nvariational approximation can be made exact. Hence this variational family has no \"prior gap\".\n\n\u2022 We consider a model with multiplicative noise. The scale of this noise corresponds to its shift in\nthe logarithmic space. By establishing the log-uniform prior we set no preferences on different\nscales of this multiplicative noise. The usual use of a Gaussian as a posterior immediately implies\nvery asymmetric skewed distribution in the logarithmic space. Moreover log-uniform and Gaussian\ndistributions have different supports and that will require establishing two log-uniform distributions\nfor positive and negative noises. In this case Gaussian variational approximation would have\nquite exotic bi-modal form (one mode in the log-space of positive noises and another one in the\nlog-space of negative noises). On the other hand, the log-normal posterior for the multiplicative\nnoise corresponds to a Gaussian posterior for the additive noise in the logarithmic scale, which is\nmuch easier to interpret.\n\n\u2022 Log-normal noise is always non-negative both during training and testing phase, therefore it does\nnot change the sign of its input. This is in contrast to Gaussian multiplicative noise N (\u2713i | 1,\u21b5 )\nthat is a standard choice for Gaussian dropout and its modi\ufb01cations [8, 19, 23]. During the training\nphase Gaussian noise can take negative values, so the input to the following layer can be of arbitrary\nsign. However, during the testing phase noise \u2713 is equal to 1, so the input to the following layer is\nnon-negative with many popular non-linearities (e.g. ReLU, sigmoid, softplus). Although Gaussian\ndropout works well in practice, it is dif\ufb01cult to justify notoriously different input distributions\nduring training and testing phases.\n\n\u2022 The log-normal approximate posterior is tractable. Speci\ufb01cally,\nKL(LogN(\u2713 | \u00b5, 2)k LogU1(\u2713)) can be computed analytically.\nThe \ufb01nal loss function is presented in equation (12) and is essentially the original variational lower\nbound (4).\n\nthe KL divergence term\n\nD\n\nLSGV B() = LSGV B\n\n(12)\nwhere \u00b5 and  are the variatianal parameters, and W denotes all other trainable parameters of the\nneural network, e.g. the weight matrices, the biases, batch normalization parameters, etc.\nNote that we can optimize the variational lower bound w.r.t. the parameters \u00b5 and  of the log-normal\nnoise \u2713. We do not \ufb01x the mean of the noise thus making our variational approximation more tight.\n\n(\u00b5, , W )  KL(q(\u2713 | \u00b5, )k p(\u2713)) ! max\n\n\u00b5,,W\n\n,\n\n4.2 Problems of Variational Inference with Improper Log-Uniform Prior\nThe log-normal posterior in combination with a log-uniform prior has a number of attractive features.\nHowever, the maximization of the variational lower bound with a log-uniform prior and a log-normal\n\n4\n\n\fposterior is an ill-posed optimization problem. As the log-uniform distribution is an improper prior,\nthe KL-divergence between a log-normal distribution LogN(\u00b5, 2) and a log-uniform distribution\nLogU1 is in\ufb01nite for any \ufb01nite value of parameters \u00b5 and .\nKLLogN(x| \u00b5, 2)k LogU1(x) = C  log C\n\n(13)\nA common way to tackle this problem is to consider the density of the log-uniform distribution\nto be equal to C\n\u2713 and to treat C as some \ufb01nite constant. This trick works well for the case of a\nGaussian posterior distribution [8, 16]. The KL divergence between a Gaussian posterior and a\nlog-uniform prior has an in\ufb01nite gap, but can be calculated up to this in\ufb01nite constant in a meaningful\nway [16]. However, for the case of the log-normal posterior the KL divergence is in\ufb01nite for any\n\ufb01nite values of variational parameters, and is equal to zero for a \ufb01xed \ufb01nite \u00b5 and in\ufb01nite . As\nthe data-term (3) is bounded for any value of variational parameters, the only global optimum of\nthe variational lower bound is achieved when \u00b5 is \ufb01nite and \ufb01xed, and  goes to in\ufb01nity. In this\ncase the posterior distribution collapses into the prior distribution and the model fails to extract any\ninformation about the data. This effect is wholly caused by the fact that the log-uniform prior is an\nimproper (non-normalizable) distribution, which makes the whole probabilistic model \ufb02awed.\n\n= +1\n\n4.3 Variational Inference with Truncated Approximation Family\nDue to the improper prior the optimization problem becomes ill-posed. But do we really need to\nuse an improper prior distribution? The most common number format that is used to represent\nthe parameters of a neural network is the \ufb02oating-point format. The \ufb02oating-point format is only\nable to represent numbers from a limited range. For example, a single-point precision variable\ncan only represent numbers from the range 3.4 \u21e5 1038 to +3.4 \u21e5 1038, and the smallest possible\npositive number is equal to 1.2 \u21e5 1038. All of probability mass of the improper log-uniform prior is\nconcentrated beyond the single-point precision (and essentially any practical \ufb02oating point precision),\nnot to mention that the actual relevant range of values of neural network parameters is much smaller. It\nmeans that in practice this prior is not a good choice for software implementation of neural networks.\nWe propose to use a truncated log-uniform distribution (14) as a proper analog of the log-uniform\ndistribution. Here I[a,b](x) denotes the indicator function for the interval x 2 [a, b]. The posterior\ndistribution should be de\ufb01ned on the same support as the prior distribution, so we also need to use a\ntruncated log-normal distribution (14).\nLogU[a,b](\u2713i) / LogU1(\u2713i) \u00b7 I[a,b](log \u2713i)\n\nLogN[a,b](\u2713i) / LogN(\u2713i | \u00b5i, 2\n\ni ) \u00b7 I[a,b](log \u2713i)\n(14)\n\nOur \ufb01nal model then can be formulated as follows.\n\nyi = xi \u00b7 \u2713i\n\np(\u2713i) = LogU[a,b](\u2713i)\n\n(15)\nNote that all the nice facts about the log-normal posterior distribution from the Section 4.1 are also\ntrue for the truncated log-normal posterior. However, now we have a proper probabilistic model and\nthe Stochastic Variational Inference can be preformed correctly. Unlike (13), now the KL divergence\nterm (16\u201317) can be calculated correctly for all valid values of variational parameters (see Appendix\nA for details).\n\nq(\u2713i | \u00b5i, i) = LogN[a,b](\u2713i | \u00b5i, 2\ni )\n\nKL(q(\u2713 | \u00b5, )k p(\u2713)) =\n\nKL(q(\u2713i | \u00b5i, i)k p(\u2713i))\n\nIXi=1\ni  log((i)  (\u21b5i)) \n\nb  a\n\n(16)\n\n(17)\n\n\u21b5i(\u21b5i)  i(i)\n2((i)  (\u21b5i))\n\nKL(q(\u2713i | \u00b5i, i)k p(\u2713i)) = log\n\np2\u21e1e 2\nwhere \u21b5i = a\u00b5i\n, (\u00b7) and (\u00b7) are the density and the CDF of the standard normal\ni\ndistribution.\nThe reparameterization trick also can still be performed (18) using the inverse CDF of the truncated\nnormal distribution (see Appendix B).\n\n, i = b\u00b5i\ni\n\n,\n\n\u2713i = exp\u00b5i + i1 ( (\u21b5i) + (( i)  ( \u21b5i)) yi) , where yi \u21e0U (y | 0, 1)\n\n(18)\nThe \ufb01nal loss and the set of parameters is the same as described in Section 4.1, and the training\nprocedure remains the same.\n\n5\n\n\f4.4 Sparsity\nLog-uniform prior is known to lead to a sparse solution [16]. In the variational dropout paper authors\ninterpret the parameter \u21b5 of the multiplicative noise N (1,\u21b5 ) as a Gaussian dropout rate and use it as\na thresholding criterion for weight pruning. Unlike the binary or Gaussian dropout, in the truncated\nlog-normal model there is no \"dropout rate\" variable. However, we can use the signal-to-noise ratio\nE\u2713/pVar(\u2713) (SNR) for thresholding.\n\n((i  \u21b5i)  (i  i))/p(i)  (\u21b5i)\n\ni )((2i  \u21b5i)  (2i  i))  ((i  \u21b5i)  (i  i))2\n\nThe SNR can be computed analytically, the derivation can be found in the appendix. It has a simple\ninterpretation. If the SNR is low, the corresponding neuron becomes very noisy and its output no\nlonger contains any useful information. If the SNR is high, it means that the neuron output contains\nlittle noise and is important for prediction. Therefore we can remove all neurons or \ufb01lters with a low\nSNR and set their output to constant zero.\n\npexp(2\n\nSNR(\u2713i) =\n\n(19)\n\nImplementation details\n\n4.5\nWe perform a minibatch-based stochastic variational inference for training. The training procedure\nlooks as follows. On each training step we take a minibatch of M objects and feed it into the neural\nnetwork. Consider a single SBP layer with input X M\u21e5I and output Y M\u21e5I. We independently sample\na separate noise vector \u2713m \u21e0 q(\u2713) for each object xm and obtain a noise matrix \u2713M\u21e5I. The output\nmatrix Y M\u21e5I is then obtained by component-wise multiplication of the input matrix and the noise\nmatrix: ymi = xmi \u00b7 \u2713m\ni .\nTo be fully Bayesian, one would also sample and average over different dropout masks \u2713 during\ntesting, i.e. perform Bayesian ensembling. Although this procedure can be used to slightly improve\nthe \ufb01nal accuracy, it is usually avoided. Bayesian ensembling essentially requires sampling of\ndifferent copies of neural networks, which makes the evaluation K times slower for averaging over\nK samples. Instead, during the testing phase in most dropout-based techniques the noise variable\n\u2713 is replaced with its expected value. In this paper we follow the same approach and replace all\nnon-pruned \u2713i with their expectations (20) during testing. The derivation of the expectation of the\ntruncated log-normal distribution is presented in Appendix C.\n\nE\u2713i =\n\nexp(\u00b5i + 2\n\n(i)  (\u21b5i)\uf8ff\u2713 2\n\ni /2)\n\ni + \u00b5i  a\n\ni\n\n\u25c6  \u2713 2\n\ni + \u00b5i  b\n\ni\n\n\u25c6\n\n(20)\n\nWe tried to use Bayesian ensembling with this model, and experienced almost no gain of accuracy. It\nmeans that the variance of the learned approximate posterior distribution is low and does not provide\na rich ensemble.\nThroughout the paper we introduced the SBP dropout layer for the case when input objects are\nrepresented as one-dimensional vectors x. When de\ufb01ned like that, it would induce general sparsity\non the input vector x. It works as intended for fully-connected layers, as a single input feature\ncorresponds to a single output neuron of a preceding fully-connected layer and a single output neuron\nof the following layer. However, it is possible to apply the SBP layer in a more generic setting. Firstly,\nif the input object is represented as a multidimensional tensor X with shape I1 \u21e5 I2 \u21e5\u00b7\u00b7\u00b7\u21e5 Id, the\nnoise vector \u2713 of length I = I1 \u21e5 I2 \u21e5\u00b7\u00b7\u00b7\u21e5 Id can be reshaped into a tensor with the same shape.\nThen the output tensor Y can be obtained as a component-wise product of the input tensor X and\nthe noise tensor \u2713. Secondly, the SBP layer can induce any form of structured sparsity on this input\ntensor X. To do it, one would simply need to use a single random variable \u2713i for the group of input\nfeatures that should be removed simultaneously. For example, consider an input tensor X H\u21e5W\u21e5C\nthat comes from a convolutional layer, H and W being the size of the image, and C being the number\nof channels. Then, in order to remove redundant \ufb01lters from the preceding layer (and at the same\ntime redundant channels from the following layer), one need to share the random variables \u2713 in the\nfollowing way:\n\n(21)\nNote that now there is one sample \u2713 2 RC for one object X H\u21e5W\u21e5C on each training step. If\nthe signal-to-noise ratio becomes lower than 1 for a component \u2713c, that would mean that we can\n\n\u2713c \u21e0 LogN[a,b](\u2713c | \u00b5c, 2\nc )\n\nyhwc = xhwc \u00b7 \u2713c\n\n6\n\n\fFigure 1: The value of the SGVB for the case of\n\ufb01xed variational parameter \u00b5 = 0 (blue line) and\nfor the case when both variational parameters \u00b5\nand  are trained (green line)\n\nFigure 2: The learned signal-to-noise ratio for\nimage features on the MNIST dataset.\n\npermanently remove the c-th channel of the input tensor, and therefore delete the c-th \ufb01lter from the\npreceding layer and the c-th channel from the following layer. All the experiments with convolutional\narchitectures used this formulation of SBP. This is a general approach that is not limited to reducing\nthe shape of the input tensor. It is possible to obtain any \ufb01xed pattern of group-wise sparsity using\nthis technique.\nSimilarly, the SBP layer can be applied in a DropConnect fashion. One would just need to multiply\nthe weight tensor W by a noise tensor \u2713 of similar shape. The training procedure remains the same.\nIt is still possible to enforce any structured sparsity pattern for the weight tensor W by sharing the\nrandom variables as described above.\n\n5 Experiments\nWe perform an evaluation on different supervised classi\ufb01cation tasks and with different architectures\nof neural networks including deep VGG-like architectures with batch normalization layers. For each\narchitecture, we report the number of retained neurons and \ufb01lters, and obtained acceleration. Our\nexperiments show that Structured Bayesian Pruning leads to a high level of structured sparsity in\nconvolutional \ufb01lters and neurons of DNNs without signi\ufb01cant accuracy drop. We also demonstrate\nthat optimization w.r.t. the full set of variational parameters (\u00b5, ) leads to improving model quality\nand allows us to perform sparsi\ufb01cation in a more ef\ufb01cient way, as compared to tuning of only one\nfree parameter that corresponds to the noise variance. As a nice bonus, we show that Structured\nBayesian Pruning network does not over\ufb01t on randomly labeled data, that is a common weakness of\nnon-bayesian dropout networks. The source code is available in Theano [7] and Lasagne, and also in\nTensorFlow [1] (https://github.com/necludov/group-sparsity-sbp).\n5.1 Experiment Setup\n\nThe truncation parameters a and b are the hyperparameters of our model. As our layer is meant for\nregularization of the model, we would like our layer not to amplify the input signal and restrict the\nnoise \u2713 to an interval [0, 1]. This choice corresponds to the right truncation threshold b set to 0. We\n\ufb01nd empirically that the left truncation parameter a does not in\ufb02uence the \ufb01nal result much. We use\nvalues a = 20 and b = 0 in all experiments.\nWe de\ufb01ne redundant neurons by the signal-to-noise ratio of the corresponding multiplicative noise \u2713.\nSee Section 4.4 for more details. By removing all neurons and \ufb01lters with the SNR < 1 we experience\nno accuracy drop in all our experiments. SBP dropout layers were put after each convolutional layer\nto remove its \ufb01lters, and before each fully-connected layer to remove its input neurons. As one \ufb01lter\nof the last convolutional layer usually corresponds to a group of neurons in the following dense layer,\nit means that we can remove more input neurons in the \ufb01rst dense layer. Note that it means that we\nhave two consecutive dropout layers between the last convolutional layer and the \ufb01rst fully-connected\nlayer in CNNs, and a dropout layer before the \ufb01rst fully-connected layer in FC networks (see Fig. 2).\n\n7\n\n\fTable 1: Comparison of different structured sparsity inducing techniques on LeNet-5-Caffe and\nLeNet-500-300 architectures. SSL [24] is based on group lasso regularization, SparseVD [16])\nis a Bayesian model with a log-uniform prior that induces weight-wise sparsity. For SparseVD a\nneuron/\ufb01lter is considered pruned, if all its weights are set to 0. Our method provides the highest\nspeed-up with a similar accuracy. We report acceleration that was computed on CPU (Intel Xeon\nE5-2630), GPU (Tesla K40) and in terms of Floating Point Operations (FLOPs).\n\nLeNet-500-300 SSL\n\nNetwork Method\nOriginal\nSparseVD\n\nError % Neurons per Layer\n1.54\n784  500  300  10\n1.57\n537  217  130  10\n1.49\n434  174  78  10\n(ours) StructuredBP 1.55\n245  160  55  10\n0.80\n20  50  800  500\n0.75\n17  32  329  75\n1.00\n3  12  800  500\n(ours) StructuredBP 0.86\n3  18  284  283\n\nOriginal\nSparseVD\n\nLeNet5-Caffe SSL\n\nFLOPs\nCPU\nGPU\n1.00\u21e5\n1.00\u21e5 1.00\u21e5\n3.73\u21e5\n1.19\u21e5 1.03\u21e5\n6.06\u21e5\n2.21\u21e5 1.04\u21e5\n2.33\u21e5 1.08\u21e5 11.23\u21e5\n1.00\u21e5\n1.00\u21e5 1.00\u21e5\n1.48\u21e5 1.41\u21e5\n2.19\u21e5\n5.17\u21e5 1.80\u21e5\n3.90\u21e5\n5.41\u21e5 1.91\u21e5 10.49\u21e5\n\nTable 2: Comparison of different structured sparsity inducing techniques (SparseVD [16]) on VGG-\nlike architectures on CIFAR-10 dataset. StructuredBP stands for the original SBP model, and\nStructuredBPa stands for the SBP model with KL scaling. k is a width scale factor that determines\nthe number of neurons or \ufb01lters on each layer of the network (width(k) = k \u21e5 original width)\nCPU\nFLOPs\n1.00\u21e5 1.00\u21e5 1.00\u21e5\n2.50\u21e5 1.69\u21e5 2.27\u21e5\n2.71\u21e5 1.74\u21e5 2.30\u21e5\n3.68\u21e5 2.06\u21e5 3.16\u21e5\n1.00\u21e5 1.00\u21e5 1.00\u21e5\n3.35\u21e5 2.16\u21e5 3.27\u21e5\n3.63\u21e5 2.17\u21e5 3.32\u21e5\n4.47\u21e5 2.47\u21e5 3.93\u21e5\n\n64  64  128  128  256  256  256  512  512  512  512  512  512  512\n64  62  128  126  234  155  31  81  76  9  138  101  413  373\n64  62  128  126  234  155  31  79  73  9  59  73  56  27\n44  54  92  115  234  155  31  76  55  9  34  35  21  280\n96  96  192  192  384  384  384  768  768  768  768  768  768  768\n96  78  191  146  254  126  27  79  74  9  137  100  416  479\n96  77  190  146  254  126  26  79  70  9  71  82  79  49\n77  74  161  146  254  125  26  78  66  9  47  55  54  237\n\nOriginal\nSparseVD\nStructuredBP\nStructuredBPa\nOriginal\nSparseVD\nStructuredBP\nStructuredBPa\n\n7.2\n7.2\n7.5\n9.0\n6.8\n7.0\n7.2\n7.8\n\nUnits per Layer\n\n(ours)\n(ours)\n1.5\n\n(ours)\n(ours)\n\nMethod\n\nError %\n\nk\n1.0\n\nGPU\n\n5.2 More Flexible Variational Approximation\n\nUsually during automatic training of dropout rates the mean of the noise distribution remains \ufb01xed.\nIn the case of our model it is possible to train both mean and variance of the multiplicative noise. By\nusing a more \ufb02exible distribution we obtain a tighter variational lower bound and a higher sparsity\nlevel. In order to demonstrate this effect, we performed an experiment on MNIST dataset with a fully\nconnected neural network that contains two hidden layers with 1000 neurons each. The results are\npresented in Fig. 1.\n\n5.3 LeNet5 and Fully-Connected Net on MNIST\n\nWe compare our method with other sparsity inducing methods on the MNIST dataset using a\nfully connected architecture LeNet-500-300 and a convolutional architecture LeNet-5-Caffe. These\nnetworks were trained with Adam without any data augmentation. The LeNet-500-300 network\nwas trained from scratch, and the LeNet-5-Caffe1 network was pretrained with weight decay. An\nillustration of trained SNR for the image features for the LeNet-500-3002 network is shown in\nFig. 2. The \ufb01nal accuracy, group-wise sparsity levels and speedup for these architectures for different\nmethods are shown in Table 1.\n5.4 VGG-like on CIFAR-10\n\nTo prove that SBP scales to deep architectures, we apply it to a VGG-like network [25] that was\nadapted for the CIFAR-10 [9] dataset. The network consists of 13 convolutional and two fully-\nconnected layers, trained with pre-activation batch normalization and Binary Dropout. At the start of\nthe training procedure, we use pre-trained weights for initialization. Results with different scaling\nof the number of units are presented in Table 2. We present results for two architectures with\ndifferent scaling coef\ufb01cient k 2{ 1.0, 1.5} . For smaller values of scaling coef\ufb01cient k 2{ 0.25, 0.5}\nwe obtain less sparse architecture since these networks have small learning capacities. Besides\nthe results for the standard StructuredBP procedure, we also provide the results for SBP with KL\nscaling (StructuredBPa). Scaling the KL term of the variational lower bound proportional to the\ncomputational complexity of the layer leads to a higher sparsity level for the \ufb01rst layers, providing\n\n1A modi\ufb01ed version of LeNet5 from [11]. Caffe Model speci\ufb01cation: https://goo.gl/4yI3dL\n2Fully Connected Neural Net with 2 hidden layers that contains 500 and 300 neurons respectively.\n\n8\n\n\fmore acceleration. Despite the higher error values, we obtain the higher value of true variational\nlower bound during KL scaling, hence, we \ufb01nd its another local maximum.\n\n5.5 Random Labels\nA recent work shows that Deep Neural Networks have so much capacity that they can easily memorize\nthe data even with random labeling [26]. Binary dropout as well as other standard regularization\ntechniques do not prevent the networks from over\ufb01tting in this scenario. However, recently it was\nshown that Bayesian regularization may help [16]. Following these works, we conducted similar\nexperiments. We used a Lenet5 network on the MNIST dataset and a VGG-like network on CIFAR-10.\nAlthough Binary Dropout does not prevent these networks from over\ufb01tting, SBP decides to remove\nall neurons of the neural network and provides a constant prediction. In other words, in this case SBP\nchooses the simplest model that achieves the same testing error rate. This is another con\ufb01rmation that\nBayesian regularization is more powerful than other popular regularization techniques.\n6 Conclusion\nWe propose Structured Bayesian Pruning, or SBP, a dropout-like layer that induces multiplicative\nrandom noise over the output of the preceding layer. We put a sparsity-inducing prior over the noise\nvariables and tune the noise distribution using stochastic variational inference. SBP layer can induce\nan arbitrary structured sparsity pattern over its input and provides adaptive regularization. We apply\nSBP to cut down the number of neurons and \ufb01lters in convolutional neural networks and report\nsigni\ufb01cant practical acceleration with no modi\ufb01cation of the existing software implementation of\nthese architectures.\n\nAcknowledgments\nWe would like to thank Christos Louizos and Max Welling for valuable discussions. Kirill Neklyudov\nand Arsenii Ashukha were supported by HSE International lab of Deep Learning and Bayesian Meth-\nods which is funded by the Russian Academic Excellence Project \u20195-100\u2019. Dmitry Molchanov was\nsupported by the Ministry of Education and Science of the Russian Federation (grant 14.756.31.0001).\nDmitry Vetrov was supported by the Russian Science Foundation grant 17-11-01027.\n\nReferences\n[1] Mart\u00edn Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S\nCorrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensor\ufb02ow: Large-scale machine learning on\nheterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016.\n\n[2] Michael Figurnov, Maxwell D Collins, Yukun Zhu, Li Zhang, Jonathan Huang, Dmitry Vetrov, and Ruslan\nSalakhutdinov. Spatially adaptive computation time for residual networks. arXiv preprint arXiv:1612.02297,\n2016.\n\n[3] Mikhail Figurnov, Aizhan Ibraimova, Dmitry P Vetrov, and Pushmeet Kohli. Perforatedcnns: Acceleration\nthrough elimination of redundant convolutions. In Advances in Neural Information Processing Systems,\npages 947\u2013955, 2016.\n\n[4] Timur Garipov, Dmitry Podoprikhin, Alexander Novikov, and Dmitry Vetrov. Ultimate tensorization:\n\ncompressing convolutional and fc layers alike. arXiv preprint arXiv:1611.03214, 2016.\n\n[5] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with\n\npruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.\n\n[6] Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. Speeding up convolutional neural networks with\n\nlow rank expansions. arXiv preprint arXiv:1405.3866, 2014.\n\n[7] Bergstra James, Breuleux Olivier, Bastien Fr\u00e9d\u00e9ric, Lamblin Pascal, and Pascanu Razvan. Theano: a cpu\nand gpu math expression compiler. In Proceedings of the Python for Scienti\ufb01c Computing Conference\n(SciPy).\n\n[8] Diederik P Kingma, Tim Salimans, and Max Welling. Variational dropout and the local reparameterization\ntrick. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural\nInformation Processing Systems 28, pages 2575\u20132583. Curran Associates, Inc., 2015.\n\n[9] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. 2009.\n\n9\n\n\f[10] Vadim Lebedev and Victor Lempitsky. Fast convnets using group-wise brain damage. In Proceedings of\n\nthe IEEE Conference on Computer Vision and Pattern Recognition, pages 2554\u20132564, 2016.\n\n[11] Yann LeCun, L\u00e9on Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to\n\ndocument recognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[12] Ekaterina Lobacheva, Nadezhda Chirkova, and Dmitry Vetrov. Bayesian sparsi\ufb01cation of recurrent neural\n\nnetworks. arXiv preprint arXiv:1708.00077, 2017.\n\n[13] Christos Louizos. Smart regularization of deep architectures. Master\u2019s thesis, University of Amsterdam,\n\n2015.\n\n[14] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra,\nand Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602,\n2013.\n\n[15] Dmitry Molchanov, Arseniy Ashuha, and Dmitry Vetrov. Dropout-based automatic relevance determination.\n\nIn Bayesian Deep Learning workshop, NIPS, 2016.\n\n[16] Dmitry Molchanov, Arsenii Ashukha, and Dmitry Vetrov. Variational dropout sparsi\ufb01es deep neural\n\nnetworks. arXiv preprint arXiv:1701.05369, 2017.\n\n[17] Alexander Novikov, Dmitrii Podoprikhin, Anton Osokin, and Dmitry P Vetrov. Tensorizing neural networks.\n\nIn Advances in Neural Information Processing Systems, pages 442\u2013450, 2015.\n\n[18] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnor-net: Imagenet classi\ufb01-\ncation using binary convolutional neural networks. In European Conference on Computer Vision, pages\n525\u2013542. Springer, 2016.\n\n[19] Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout:\na simple way to prevent neural networks from over\ufb01tting. Journal of Machine Learning Research,\n15(1):1929\u20131958, 2014.\n\n[20] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru\nErhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of\nthe IEEE Conference on Computer Vision and Pattern Recognition, pages 1\u20139, 2015.\n\n[21] Karen Ullrich, Edward Meeds, and Max Welling. Soft weight-sharing for neural network compression.\n\narXiv preprint arXiv:1702.04008, 2017.\n\n[22] Li Wan, Matthew Zeiler, Sixin Zhang, Yann L Cun, and Rob Fergus. Regularization of neural networks\nusing dropconnect. In Proceedings of the 30th International Conference on Machine Learning (ICML-13),\npages 1058\u20131066, 2013.\n\n[23] Sida I Wang and Christopher D Manning. Fast dropout training. In ICML (2), pages 118\u2013126, 2013.\n\n[24] Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Learning structured sparsity in deep\n\nneural networks. In Advances in Neural Information Processing Systems, pages 2074\u20132082, 2016.\n\n[25] Sergey Zagoruyko. 92.45 on cifar-10 in torch, 2015.\n\n[26] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep\n\nlearning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.\n\n10\n\n\f", "award": [], "sourceid": 3407, "authors": [{"given_name": "Kirill", "family_name": "Neklyudov", "institution": "Yandex"}, {"given_name": "Dmitry", "family_name": "Molchanov", "institution": "Yandex"}, {"given_name": "Arsenii", "family_name": "Ashukha", "institution": null}, {"given_name": "Dmitry", "family_name": "Vetrov", "institution": "Higher School of Economics, Yandex"}]}