{"title": "Channel Gating Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 1886, "page_last": 1896, "abstract": "This paper introduces channel gating, a dynamic, fine-grained, and hardware\uff0defficient pruning scheme to reduce the computation cost for convolutional neural networks (CNNs). Channel gating identifies regions in the features that contribute less to the classification result, and skips the computation on a subset of the input channels for these ineffective regions. Unlike static network pruning, channel gating optimizes CNN inference at run-time by exploiting input-specific characteristics, which allows substantially reducing the compute cost with almost no accuracy loss. We experimentally show that applying channel gating in state-of-the-art networks achieves 2.7-8.0x reduction in floating-point operations (FLOPs) and 2.0-4.4x reduction in off-chip memory accesses with a minimal accuracy loss on CIFAR-10. Combining our method with knowledge distillation reduces the compute cost of ResNet-18 by 2.6x without accuracy drop on ImageNet. We further demonstrate that channel gating can be realized in hardware efficiently. Our approach exhibits sparsity patterns that are well-suited to dense systolic arrays with minimal additional hardware. We have designed an accelerator for channel gating networks, which can be implemented using either FPGAs or ASICs. Running a quantized ResNet-18 model for ImageNet, our accelerator achieves an encouraging speedup of 2.4x on average, with a theoretical FLOP reduction of 2.8x.", "full_text": "Channel Gating Neural Networks\n\nWeizhe Hua\n\nwh399@cornell.edu\n\nYuan Zhou\n\nyz882@cornell.edu\n\nChristopher De Sa\n\ncdesa@cornell.edu\n\nZhiru Zhang\n\nzhiruz@cornell.edu\n\nG. Edward Suh\n\ngs272@cornell.edu\n\nAbstract\n\nThis paper introduces channel gating, a dynamic, \ufb01ne-grained, and hardware-\nef\ufb01cient pruning scheme to reduce the computation cost for convolutional neural\nnetworks (CNNs). Channel gating identi\ufb01es regions in the features that contribute\nless to the classi\ufb01cation result, and skips the computation on a subset of the input\nchannels for these ineffective regions. Unlike static network pruning, channel\ngating optimizes CNN inference at run-time by exploiting input-speci\ufb01c charac-\nteristics, which allows substantially reducing the compute cost with almost no\naccuracy loss. We experimentally show that applying channel gating in state-of-\nthe-art networks achieves 2.7-8.0\u00d7 reduction in \ufb02oating-point operations (FLOPs)\nand 2.0-4.4\u00d7 reduction in off-chip memory accesses with a minimal accuracy loss\non CIFAR-10. Combining our method with knowledge distillation reduces the\ncompute cost of ResNet-18 by 2.6\u00d7 without accuracy drop on ImageNet. We\nfurther demonstrate that channel gating can be realized in hardware ef\ufb01ciently. Our\napproach exhibits sparsity patterns that are well-suited to dense systolic arrays with\nminimal additional hardware. We have designed an accelerator for channel gating\nnetworks, which can be implemented using either FPGAs or ASICs. Running a\nquantized ResNet-18 model for ImageNet, our accelerator achieves an encouraging\nspeedup of 2.4\u00d7 on average, with a theoretical FLOP reduction of 2.8\u00d7.\n\n1\n\nIntroduction\n\nThe past half-decade has seen unprecedented growth in the use of machine learning with convolutional\nneural networks (CNNs). CNNs represent the state-of-the-art in large scale computer vision, natural\nlanguage processing, and data mining tasks. However, CNNs have substantial computation and\nmemory requirements, greatly limiting their deployment in constrained mobile and embedded\ndevices [1, 23]. There are many lines of work on reducing CNN inference costs, including low-\nprecision quantization [3, 21], ef\ufb01cient architectures [14, 30], and static pruning [9, 12, 19, 22].\nHowever, most of these techniques optimize a network statically, and are agnostic of the input\ndata at run time. Several recent efforts propose to use additional fully-connected layers [6, 8, 28]\nor recurrent networks [20, 29] to predict if a fraction of the computation can be skipped based on\ncertain intermediate results produced by the CNN at run time. These approaches typically perform\ncoarse-grained pruning where an entire output channel or layer is skipped dynamically.\nIn this paper, we propose channel gating, a dynamic, \ufb01ne-grained, and hardware-ef\ufb01cient pruning\nscheme, which exploits the spatial structure of features to reduce CNN computation at the granularity\nof individual output activation. As illustrated in Figure 1, the essential idea is to divide a CNN layer\ninto a base path and a conditional path. For each output activation, the base path obtains a partial sum\nof the output activation by performing convolution on a subset of input channels. The activation-wise\ngate function then predicts whether the output activations are effective given the partial sum. Only\nthe effective activations take the conditional path which continues computing on the rest of the input\nchannels. For example, an output activation is ineffective if the activation is clipped to zero by ReLU\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: Illustration of channel gating \u2014 A subset of input\nchannels (colored in green) are used to generate a decision map,\nand prune away unnecessary computation in the rest of input\nchannels (colored in blue).\n\nFigure 2: Computation intensity\nmap \u2014 The computation intensity map\nis obtained by averaging decision maps\nover output channels.\n\nas it does not contribute to the classi\ufb01cation result. Our empirical study suggests that the partial and\n\ufb01nal sums are strongly correlated, which allows the gate function to make accurate predictions on\nwhether the outputs are effective. Figure 2 further conceptualizes our idea by showing the heat maps\nof the normalized computation cost for classifying two images, with the \u201ccool\u201d colors indicating the\ncomputation that can substantially be pruned by channel gating.\nClearly, the channel gating policy must be learned through training, as it is practically infeasible\nto manually identify the \u201cright\u201d gating thresholds for all output channels without causing notable\naccuracy loss. To this end, we propose an effective method to train CNNs with channel gating\n(CGNets) from scratch in a single pass. The objective is to maximize the reduction of computational\ncost while minimizing the accuracy loss. In addition, we introduce channel grouping to ensure\nthat all input channels are included and selected equally without a bias for generating dynamic\npruning decisions. The experimental results show that channel gating achieves a higher reduction in\n\ufb02oating-point operations (FLOPs) than existing pruning methods with the same or better accuracy. In\naddition to pruning computation, channel gating can be extended with a coarser-grained channel-wise\ngate to further improve memory ef\ufb01ciency by reducing the off-chip memory accesses for weights.\nCGNet is also hardware friendly for two key reasons: (1) channel gating requires minimal additional\nhardware (e.g., small comparators for gating functions) to support \ufb01ne-grained pruning decisions\non a dense systolic array; (2) channel gating maintains the locality and the regularity in both\ncomputations and data accesses, and can be ef\ufb01ciently implemented with small changes to a systolic\narray architecture similar to the Google TPU [16]. Our ASIC accelerator design achieves a 2.4\u00d7\nspeed-up for the CGNet that has the theoretical FLOP reduction of 2.8\u00d7.\nThis paper makes the following major contributions:\n\n\u2022 We introduce channel gating, a new lightweight dynamic pruning scheme, which is trainable\nand can substantially improve the computation and memory ef\ufb01ciency of a CNN at inference\ntime. In addition, we show that channel grouping can naturally be integrated with our\nscheme to effectively minimize the accuracy loss. For ImageNet, our scheme achieves a\nhigher FLOP reduction than state-of-the-art pruning approaches [5, 8, 11, 18, 22, 31] with\nthe same or less accuracy drop.\n\u2022 Channel gating represents a new layer type that can be applied to many CNN architectures.\nIn particular, we experimentally demonstrate the bene\ufb01ts of our technique on ResNet, VGG,\nbinarized VGG, and MobileNet models.\n\u2022 We co-design a specialized hardware accelerator to show that channel gating can be ef-\n\ufb01ciently realized using a dense systolic array architecture. Our ASIC accelerator design\nachieves a 2.4\u00d7 speed-up for the CGNet that has the theoretical FLOP reduction of 2.8\u00d7.\n\n2 Related Work\n\nStatic Pruning. Many recent proposals suggest pruning unimportant \ufb01lters/features statically [12,\n19, 22, 31]. They identify ineffective channels in \ufb01lters/features by examining the magnitude of the\n\n2\n\nCond. Path*\ud835\udc29Base PathOutput ChannelPartial Sum\ud835\udc29\ud835\udc2b*\ud835\udc93DecisionMapGate01\fFigure 3: Channel gating block \u2014 xp, Wp\nand xr, Wr are the input features and weights to\nthe base and conditional path, respectively. s is\nthe gate function which generates a binary pruning\ndecision based on the partial sum for each output\nactivation. f is the activation function.\n\nFigure 4: The computational graph of channel\ngating for training \u2014 \u2206 is a per-output-channel\nlearnable threshold. J is an all-one tensor of rank\nthree. Subtract the binary decision d from J gives\nthe complimentary of decision which is used to\nselect the activations from the base path.\n\nweights/activation in each channel. The relatively ineffective subset of the channels are then pruned\nfrom the model. The pruned model is then retrained to mitigate the accuracy loss from pruning.\nBy pruning and retraining iteratively, these approaches can compress the model size and reduce\nthe computation cost. PerforatedCNN [7] speeds up the inference by skipping the computations of\noutput activations at \ufb01xed spatial locations. Channel gating can provide a better trade-off between\naccuracy and computation cost compared to static approaches by utilizing run-time information to\nidentify unimportant receptive \ufb01elds in input features and skipping the channels for those regions\ndynamically.\nDynamic Pruning. Figurnov et al. [6] introduce the spatially adaptive computation time (SACT)\ntechnique on Residual Network [10], which adjusts the number of residual units for different regions\nof input features. SACT stops computing on the spatial regions that reach a prede\ufb01ned con\ufb01dence\nlevel, and thus only preserving low-level features in these regions. McGill and Perona [24] propose to\nleverage features at different levels by taking input samples at different scales. Instead of bypassing\nresidual unit, dynamic channel pruning [8] generates decisions to skip the computation for a subset\nof output channels. Teja Mullapudi et al. [28] propose to replace the last residual block with a\nMixture-of-Experts (MoE) layer and reduce computation cost by only activating a subset of experts\nin the MoE layer. The aforementioned methods embed fully-connected layers in a baseline network\nto help making run-time decisions. Lin et al. [20] and Wu et al. [29] propose to train a policy\nnetwork with reinforcement learning to make run-time decisions to skip computations at channel and\nresidual block levels, respectively. Both approaches require additional weights and extra FLOPs for\ncomputing the decision. In comparison, CGNet generates \ufb01ne-grained decisions with no extra weights\nor computations. Moreover, instead of dropping unimportant features, channel gating approximates\nthese features with their partial sums which can be viewed as cost-ef\ufb01cient features. Our results show\nthat both \ufb01ne-grained pruning and high-quality decision functions are essential to signi\ufb01cantly reduce\nthe amount of computation with minimal accuracy loss.\n\n3 Channel Gating\n\nIn this section, we \ufb01rst describe the basic mechanism of channel gating. Then, we discuss how to\ndesign the gate function for different activation functions. Last, we address the biased weight update\nproblem by introducing channel grouping.\n\n3.1 Channel Gating Block\n\nWithout loss of generality, we assume that features are rank-3 tensors consisting of c channels and\neach channel is a 2-D feature of width w and height h. Let xl, yl, Wl be the input features, output\nfeatures, and weights of layer l, respectively, where xl \u2208 Rcl\u00d7wl\u00d7hl, yl \u2208 Rcl+1\u00d7wl+1\u00d7hl+1, and\nWl \u2208 Rcl+1\u00d7cl\u00d7kl\u00d7kl. A typical block of a CNN layer includes convolution (\u2217), batch normalization\n(BN) [15], and an activation function (f). The output feature can be written as yl = f (BN(Wl\u2217xl))1.\n1The bias term is ignored because of batch normalization. A fully-connected layer is a special case where w,\n\nh, and k all are equal to 1.\n\n3\n\nfsx..WpWryxrxp......**..XpWp+s-J*+fBN2\u03b3,\u03b2*yXrWrBN1\u03b3,\u03b2xp^xg^xr^\u0394\fTo apply channel gating, we \ufb01rst split the input features statically along the channel dimension\ninto two tensors where xl = [xp, xr]. For \u03b7 \u2208 (0, 1], xp consists of \u03b7 fraction of the input\nchannels while the rest of the channels form xr, where xp \u2208 R\u03b7cl\u00d7wl\u00d7hl and xr \u2208 R(1\u2212\u03b7)cl\u00d7wl\u00d7hl.\nSimilarly, let Wp and Wr be the weights associated with xp and xr. This decomposition means\nthat Wl \u2217 xl = Wp \u2217 xp + Wr \u2217 xr. Then, the partial sum Wp \u2217 xp is fed into the gate to\ngenerate a binary decision tensor (d \u2208 {0, 1}cl+1\u00d7wl+1\u00d7hl+1), where di,j,k = 0 means skipping the\ncomputation on the rest of the input channels (i.e., xr) for yi,j,k.\nFigure 3 illustrates the structure of the channel gating block for inference2. There exist two possible\npaths with different frequency of execution. We refer to the path which is always taken as the base\npath (colored in grey) and the other path as the conditional path given that it may be gated for some\nactivations. The \ufb01nal output (y in Figure 3) is the element-wise combination of the outputs from both\nthe base and conditional paths. The output of the channel gating block can be written as follows,\nwhere s denotes the gate function and i, j, k are the indices of a component in a tensor of rank three:\n\nyli,j,k =\n\nf (Wp \u2217 xp + Wr \u2217 xr)i,j,k, otherwise\n\nif di,j,k = s(Wp \u2217 xp)i,j,k = 0\n\n(1)\n\n(cid:26)f (Wp \u2217 xp)i,j,k,\n\nChannel gating works only if the partial sum is a good predictor for the \ufb01nal sum. We hypothesize\nthat the partial sum is strongly correlated with the \ufb01nal sum. We test our hypothesis by measuring the\nlinear correlation between the partial and \ufb01nal sums of a layer with different \u03b7 value. The average\nPearson correlation coef\ufb01cient of 20 convolutional layers in ResNet-18 over 1000 training samples\nequals 0.56, 0.72, and 0.86 when \u03b7 is 1\n2, respectively. The results suggest that the partial and\n\ufb01nal sums are still moderately correlated even when only 1\n8 of the channels are used to compute the\npartial sum. While partial sums cannot accurately predict the exact values for all output activations,\nwe \ufb01nd that they can effectively identify and approximate ineffective output activations.\n\n4, and 1\n\n8, 1\n\n3.2 Learnable Gate Functions\n\nTo minimize the computational cost, the gate function should only allow a small fraction of the\noutput activations to take the conditional path. We introduce a per-output-channel learnable threshold\n(\u2206 \u2208 Rcl) to learn different gating policies for each output channel and de\ufb01ne the gate function using\nthe Heaviside step function, where i, j, k are the indices of a component in a tensor of rank three.\n\n(cid:26)1,\n\n\u03b8(x)i,j,k =\n\nif xi,j,k \u2265 0\n\n0, otherwise\n\nAn output activation is considered to be ineffective if the activation is zeroed out by ReLU or saturated\nto the limit values by sigmoid or hyperbolic tangent. Thus, the gate function is designed based on the\nactivation function in the original network. For instance, if ReLU is used, we use s(x, \u2206) = \u03b8(x\u2212\u2206)\nas the gate function. Similarly, if the activation function has limit values such as hyperbolic tangent\nand sigmoid function, we apply s(x, \u2206h, \u2206l) = \u03b8(\u2206h \u2212 x) \u25e6 \u03b8(x \u2212 \u2206l) as the gate function where\n\u25e6 is the Hadamard product operator and \u2206h, \u2206l are the upper and lower thresholds, respectively. The\nsame gate can be applied to binary neural networks [3, 26]. The rationales for choosing the gate\nfunction are twofold: (1) the step function is much cheaper to implement in hardware compared to\nthe Softmax and Noisy Top-K gates proposed in [27]; (2) the gate function should turn off the rest of\nthe channels for activations which are likely to be ineffective.\nWe use the gate for ReLU as a more concrete example. Let \u03c4 be the fraction of activations taking the\nconditional path. To \ufb01nd \u2206 which satis\ufb01es P (Wp \u2217 xp \u2265 \u2206) = \u03c4, we normalize the partial sums\nusing batch normalization without scaling and shift. During inference, the batch normalization is\nmerged with the gate to eliminate extra parameters and computation. The merged gate is de\ufb01ned as:\n\n(cid:101)s(x, \u2206) = \u03b8\n\n(cid:16)\n\n(cid:17)\nx \u2212 \u2206 \u00b7(cid:112)Var(x) \u2212 E[x]\n\nwhere E[x] and Var(x) are the mean and variance, respectively. The merged gate has cl+1 thresholds\nand performs wl+1 \u00b7 hl+1 \u00b7 cl+1 point-wise comparisons between the partial sums and the thresholds.\n\n2Batch normalization is omitted as it is only a scale and shift function during inference.\n\n4\n\n(2)\n\n(3)\n\n\f(a) Channel gating without grouping \u2014 Solid\nlines indicate inputs to the base path whereas\ndashed lines show inputs to the conditional path.\nMore speci\ufb01cally, x0 is the input to the base\npath (i.e, xp) and x1, x2, x3 are the input to the\nconditional path (i.e, xr) for all output groups.\nThe number of groups (G) is 4.\n\n(b) Channel gating with grouping \u2014 xi is the input to the\nbase path and the rest of the input channels are the input to\nthe conditional path for output group yi.\n\nFigure 5: Illustration of channel gating with and without channel grouping.\n\nIn addition to reducing computation, we can extend channel gating to further reduce memory footprint.\nIf the conditional path of an entire output channel is skipped, the corresponding weights do not need\nto be loaded. For weight access reduction, we introduce a per-layer threshold (\u03c4c \u2208 Rl). Channel\ngating skips the entire output channel if less than \u03c4c fraction of activations are taking the conditional\npath. Here the channel-wise gate function generates a binary pruning decision for each output channel\nwhich is de\ufb01ned as:\n\n(cid:88)\n\nS(x, \u2206, \u03c4c) = \u03b8(\n\ns(x, \u2206) \u2212 \u03c4c \u00b7 wl+1 \u00b7 hl+1)i\n\n(4)\n\nj,k\n\nwhere i, j, and k are the indices of channel, width, and height dimensions, respectively. The channel-\nwise gate adds one more threshold and cl+1 additional comparisons. Overall, channel gating remains\na lightweight pruning method, as it only introduces (wl+1 \u00b7 hl+1 + 1)\u00b7 cl+1 comparisons with cl+1 + 1\nthresholds per layer.\n\n3.3 Unbiased Channel Selection with Channel Grouping\n\nIn previous sections, we assume a predetermined partition of input channels between the base path\n(xp) and the conditional path (xr). In other words, a \ufb01xed set of input channels is used as xp for\neach layer. Since the base path is always taken, the weights associated with xp (i.e., Wp) will be\nupdated more frequently than the rest during training, making the weight update process biased. Our\nempirical results suggest that such biased updates can cause a notable accuracy drop. Thus, properly\nassigning input channels to the base and conditional paths without a bias is critical in minimizing the\nloss.\nWe address the aforementioned problem with the help of channel grouping. Inspired partly by\ngrouped convolution, channel grouping \ufb01rst divides the input and output features into the same\nnumber of groups along the channel dimension. Let xi\nl be the i-th group of input features,\nl \u2208 R\u03b7cl\u00d7wl\u00d7hl,\noutput features, and weights in a channel gating block, respectively, where xi\nl \u2208 R\u03b7cl+1\u00d7cl\u00d7kl\u00d7kl. Then, for the i-th output group, we choose the\nl \u2208 R\u03b7cl+1\u00d7wl+1\u00d7hl+1, and Wi\nyi\nr be xp and xr for the i-th\ni-th input group as xp and rest of the input groups as xr. Let xi\noutput group, respectively. Channel gating with G groups can be de\ufb01ned by substituting xi\np = xi\nl\nand xi\n] into Equation (1). The number of groups (G) is set to be 1\n\u03b7\nas each input group should contain the \u03b7 fraction of all input channels. Intuitively, the base path of\nCGNet is an ordinary grouped convolution.\nFigure 5b shows an example that illustrates channel grouping. In Figure 5a, we do not apply channel\ngrouping and x0 is always used for the base path for all output groups. In Figure 5b where channel\ngrouping is applied, x0 is fed to the base path only for output group y0. For other output groups,\nx0 is used for the conditional path instead. In this way, we can achieve unbiased channel selection\nwhere every input channel is chosen as the input to the base path once and conditional path (G \u2212 1)\n\nl , ..., xi\u22121\n\n, ..., xG\u22121\n\nl, yi\n\nl, Wi\n\np and xi\n\nr = [x0\n\n, xi+1\n\nl\n\nl\n\nl\n\n5\n\n CG channel shufflingCG withchannel grouping\ftimes (here G = 4). As a result, all weights are updated with the same frequency without a bias. To\nimprove cross-group information \ufb02ow, we can further add an optional channel shuf\ufb02e operation, as\nproposed in Shuf\ufb02eNet [30]. Figure 5b illustrates the shuf\ufb02ing using the black solid lines between\nthe output groups of the current layer and the input groups of the next layer.\n\n4 Training CGNet\n\nWe leverage the gradient descent algorithm to train CGNet. Figure 4 illustrates the computational\ngraph during training. Applying batch normalization directly on the output diminishes the contribution\nfrom the base path since the magnitude of Wp \u2217 xp can be relatively small compared to Wp \u2217 xp +\nWr \u2217 xr. To balance the magnitude of the two paths, we apply two separate batch normalization\n(BN 1 and BN 2) before combining the outputs from the two paths3. We subtract the output of the\ngate from an all-one tensor of rank three (J \u2208 Rcl+1\u00d7wl+1\u00d7hl+1) to express the if condition and\ncombine the two cases with an addition which makes all the operators differentiable except the gate\nfunction.\nIn addition, we show two important techniques to make CGNet end-to-end learnable and reduce\nthe computation cost: (1) approximating a non-differentiable gate function; (2) inducing sparsity in\ndecision maps during training. Last, we also discuss boosting accuracy of CGNet using knowledge\ndistillation.\nApproximating non-differentiable gate function. As shown in Figure 4, we implement a custom\noperator in MxNet [2] which takes the outputs (\u02c6xr, \u02c6xp, \u02c6xg) from the batch normalization as the\ninputs and combines the two paths to generate the \ufb01nal output (y). The gradients towards \u02c6xr and \u02c6xp\nare rather straightforward whereas the gradient towards \u02c6xg and \u2206 cannot be computed directly since\nthe gate is a non-differentiable function. We approximate the gate with a smooth function which is\ndifferentiable with respect to x and \u2206. Here, we propose to use s(x, \u2206) =\n1+e\u0001\u00b7(x\u2212\u2206) to approximate\nthe gate during backward propagation when the ReLU activation is used. \u0001 is a hyperparameter\nwhich can be tuned to adjust the difference between the approximated function and the gate. With the\napproximated function, the gradients d\u02c6xg and d\u2206 can be calculated as follows:\n\n1\n\nd\u02c6xg = \u2212d\u2206 = dy \u00b7 (\u02c6xp \u2212 \u02c6xr) \u00b7 \u2212\u0001 \u00b7 e\u0001(\u02c6xg\u2212\u2206)\n\ns(\u02c6xg, \u2206)2\n\n(5)\n\nInducing sparsity. Without loss of generality, we assume that the ReLU activation is used and\npropose two approaches to reduce the FLOPs. As the input (\u02c6xg) to the gate follows the standard\nnormal distribution, the pruning ratio (F) increases monotonically with \u2206. As a result, reducing the\ncomputation cost is equivalent to having a larger \u2206. We set a target threshold value named target (T )\nand add the squared loss of the difference between \u2206 and T into the loss function.\nWe also show that this proposed approach works better than an alternative approach empirically in\nterms of trading off the accuracy for FLOP reduction. The alternative approach directly optimizes\nthe objective by adding the computation cost as a squared loss term (computation-cost loss =\n2 \u00b7 wl+1 \u00b7 hl+1 \u00b7 cl+1)2). We observe that adding the\ncomputation-cost loss prunes layers with higher FLOPs and introduces an imbalanced pruning ratio\namong the layers while the proposed approach keeps a more balanced pruning ratio. As a result, we\n\n(cid:80)\nh(J \u2212 s(Wp \u2217 xp))) \u00b7 \u03b7cl \u00b7 kl\n\nw\n\nl(T \u2212 \u2206l)2 term to the loss function, where \u03bb is a scaling factor.\n\n((cid:80)\n(cid:80)\nl((cid:80)\nadd the \u03bb(cid:80)\n\nc\n\nKnowledge distillation (KD). KD [13] is a model compression technique which trains a student\nnetwork using the softened output from a teacher network. Let T, S, yt, P\u03ba\nS represent a teacher\nnetwork, a student network, ground truth labels, and the probabilistic distributions of the teacher and\nstudent networks after softmax with temperature (\u03ba). The loss of the student model is as follows:\n\nT, P\u03ba\n\nLS(W ) = \u2212((1 \u2212 \u03bbkd)\n\nyt log(P\u03ba\n\nS) + \u03bbkd\n\nP\u03ba\n\nT log(P\u03ba\n\nS))\n\n(6)\n\n(cid:88)\n\n(cid:88)\n\nAs a result, the student model is expected to achieve the same level of prediction accuracy as the\nteacher model. We leverage KD to improve the accuracy of CGNets on ImageNet where a ResNet-50\nmodel is used as the teacher of our ResNet-18 based CGNets with \u03ba = 1 and \u03bbkd = 0.5.\n\n3BN 1 and BN 2 share the same parameters (\u03b3, \u03b2).\n\n6\n\n\fTable 1: The accuracy and the FLOP reduction of CGNets for CIFAR-10 without KD.\nFLOP Reduction\n\n# of Groups\n\nBaseline\n\nTop-1 Accu.\nDrop (%)\n\nTarget\n\nThreshold\n\nTop-1 Error\nBaseline (%)\n\nTop-1 Error\nPruned (%)\n\nResNet-18\n\nBinary VGG-11\n\nVGG-16\n\nMobileNetV1\n\n8\n16\n8\n8\n8\n8\n8\n8\n\n2.0\n3.0\n1.0\n1.5\n1.0\n2.0\n1.0\n2.0\n\n5.40\n5.40\n16.85\n16.85\n7.20\n7.20\n12.15\n12.15\n\n5.44\n5.96\n16.95\n17.10\n7.12\n7.59\n12.44\n12.80\n\n0.04\n0.56\n0.10\n0.25\n-0.08\n0.39\n0.29\n0.65\n\n5 Experiments\n\n5.49\u00d7\n7.95\u00d7\n3.02\u00d7\n3.80\u00d7\n3.41\u00d7\n5.10\u00d7\n2.88\u00d7\n3.80\u00d7\n\nWe \ufb01rst evaluate CGNets only with the activation-wise gate on CIFAR-10 [17] and ImageNet\n(ILSVRC 2012) [4] datasets to compare the accuracy and FLOP reduction trade-off with prior arts.\nWe apply channel gating on a modi\ufb01ed ResNet-184, binarized VGG-11, VGG-16, and MobileNetV1\non CIFAR-10. For ImageNet, we use the ResNet-18, ResNet-34, and MobileNetV1 as the baseline.\nFurthermore, we explore channel gating with activation-wise and channel-wise gates to reduce both\ncomputation cost and off-chip memory accesses. We choose a uniform target threshold (T ) and\nnumber of groups (G) for all CGNets for the experiments in Section 5.1 and 5.2. Last, we show\nthat the accuracy and FLOP reduction trade-off of CGNets can be further improved by exploring the\ndesign space.\n\n5.1 Reduction in Computation Cost (FLOPs)\n\nIn Table 1, we show the trade-off between accuracy and FLOP reduction when CGNet models\nare used for CIFAR-10 without KD. Channel gating can trade-off accuracy for FLOP reduction\nby varying the group size and the target threshold. CGNets reduce the computation by 2.7 - 8.0\u00d7\nwith minimal accuracy degradation on \ufb01ve state-of-the-art architectures using the two gate functions\nproposed in Section 3.2. It is worth noting that channel gating achieves a 3\u00d7 FLOP reduction with\nnegligible accuracy drop even for a binary model (Binary VGG-11).\nTable 2 compares our approach to prior arts [5, 8, 11, 18, 22, 31] on ResNet and MobileNet without\nKD. The results show that channel gating outperforms all alternative pruning techniques, offering\nsmaller accuracy drop and higher FLOP saving. Discrimination-aware channel pruning [31] achieves\nthe highest FLOP reduction among three static pruning approaches on ResNet-18. The top-1 accuracy\ndrop of channel gating (CGNet-A) is 1.9% less than discrimination-aware channel pruning, which\ndemonstrates the advantage of dynamic pruning. Feature Boosting and Suppression (FBS) [8] is a\nchannel-level dynamic pruning approach which achieves higher FLOP saving than static pruning\napproaches. Channel gating is much simpler than FBS, yet achieves 1.6% less accuracy drop and\nslightly higher FLOP reduction (CGNet-B). Channel gating also works well on a lightweight CNN\nbuilt for mobile and embedded application such as MobileNet. MobileNet with channel gating\n(CGNet-A) achieves 1.2% higher accuracy with larger FLOP saving than a thinner MobileNet model\n(0.75 MobileNet). We believe that channel gating outperforms existing pruning approaches for two\nreasons: (1) instead of dropping the ineffective features, channel gating approximates the features\nwith the partial sums; 2) channel gating performs more \ufb01ne-grained activation-level pruning.\nTable 3 shows additional comparisons between CGNet with and without channel grouping and\nknowledge distillation (KD) on ResNet-18 for ImageNet. CGNet with channel grouping achieves\n0.9% higher top-1 accuracy and 20% higher FLOP reduction than the counterpart without channel\ngrouping. Applying KD further boosts the top-1 accuracy of CGNet by 1.3% and improves the FLOP\nsaving from 1.93\u00d7 to 2.55\u00d7. We observe the same trend for CIFAR-10 where channel grouping\nimproves the top-1 accuracy by 0.8% when the computation is reduced by 5\u00d7. KD does not improve\nthe model accuracy on CIFAR-10. The small difference between the ground truth label and the output\nfrom the teacher model makes the distilled loss ineffective.\n\n4The ResNet-18 variant architecture is similar to the ResNet-18 for ImageNet while using 3 \u00d7 3 \ufb01lter in the\n\n\ufb01rst convolutional layer and having ten outputs from the last fully-connected layer.\n\n7\n\n\fTable 2: Comparisons of accuracy drop and FLOP reduction of the pruned models for ImageNet\nwithout KD \u2014 CGNet A and B represent CGNet models with different target thresholds and scaling factors.\nBaseline\n\nModel\n\nDynamic Top-1 Error\nBaseline (%)\n\nTop-1 Error\nPruned (%)\n\nTop-1 Accu.\nDrop (%)\n\nReduction\n\nSoft Filter Pruning [11]\nNetwork Slimming [22]\n\nDiscrimination-aware Pruning [31]\nLow-cost Collaborative Layers [5]\n\nFeature Boosting and Suppression [8]\n\nCGNet-A\nCGNet-B\n\nFilter Pruning [18]\n\nSoft Filter Pruning [11]\n\nCGNet-A\nCGNet-B\n\n0.75 MobileNet [14]\n\nCGNet-A\nCGNet-B\n\n8\n1\n-\nt\ne\nN\ns\ne\nR\n\n4\n3\n-\nt\ne\nN\ns\ne\nR\n\n-\ne\nl\ni\nb\no\nM\n\nt\ne\nN\n\n\u0015\n\u0015\n\u0015\n(cid:88)\n(cid:88)\n(cid:88)\n(cid:88)\n\u0015\n\u0015\n(cid:88)\n(cid:88)\n\u0015\n(cid:88)\n(cid:88)\n\n29.7\n31.0\n30.4\n30.0\n29.3\n30.8\n30.8\n26.8\n26.1\n27.6\n27.6\n31.2\n31.2\n31.2\n\n32.9\n32.8\n32.7\n33.7\n31.8\n31.2\n31.7\n27.9\n28.3\n28.7\n29.8\n33.0\n31.8\n32.2\n\n3.2\n1.8\n2.3\n3.7\n2.5\n0.4\n0.9\n1.1\n2.1\n1.1\n2.2\n1.8\n0.6\n1.0\n\nFLOP\n1.72\u00d7\n1.39\u00d7\n1.85\u00d7\n1.53\u00d7\n1.98\u00d7\n1.93\u00d7\n2.03\u00d7\n1.32\u00d7\n1.70\u00d7\n2.02\u00d7\n3.14\u00d7\n1.75\u00d7\n1.88\u00d7\n2.39\u00d7\n\nTable 3: Comparisons of the CGNet-A with\nand without channel grouping and knowledge\ndistillation on ResNet-18 for ImageNet.\nChannel\nGrouping KD Top-1 Error\nPruned (%)\n\nTop-1 Accu.\nDrop (%)\n\nReduction\n\nFLOP\n\nTable 4: Power, performance, and energy com-\nparison of different platforms \u2014 Batch size for\nASIC and CPU/GPU is 1 and 32, respectively.\nASIC\n\nNVIDIA GTX\n\nIntel\n\nPlatform\n\ni7-7700k\n\n1080Ti\n\n\u0015\n(cid:88)\n(cid:88)\n(cid:88)\n\n\u0015\n\u0015\n(cid:88)\n(cid:88)\n\n32.1\n31.2\n30.3\n31.1\n\n1.3\n0.4\n-0.5\n0.3\n\n1.61\u00d7\n1.93\u00d7\n2.55\u00d7\n2.82\u00d7\n\nFrequency (MHz)\n\nPower (Watt)\n\nThroughput (fps)\nEnergy/Img (mJ)\n\n4200\n91\n13.8\n6594.2\n\n1923\n225\n\n1563.7\n143.9\n\n5.2 Reduction in Memory Accesss for Weights\n\nBaseline CGNet\n\n800\n0.20\n254.3\n0.79\n\n800\n0.25\n613.9\n0.41\n\nAs discussed in Section 3.2, a channel-wise gate can be introduced into the channel gating block\nto reduce the memory accesses for weights. Figure 6 shows the accuracy, the reduction in weight\naccesses, and the FLOP reduction of CGNets with different target thresholds for the activation-wise\ngate (T \u2208 {1.5, 2.0}) and the channel-wise gate (\u03c4c \u2208 {0, 0.05, 0.10, 0.20}). CGNets with the\nchannel-wise gate reduces the number of weight accesses by 3.1\u00d7 and 4.4\u00d7 with 0.6% and 0.3%\naccuracy degradation when T equals 1.5 and 2.0, respectively. We \ufb01nd that CGNets with a larger T\nachieve larger reduction in weight accesses with less accuracy drop as fewer activations are taking\nthe conditional path when T becomes larger.\n\n5.3 Design Space Exploration\n\nCGNets with a uniform target threshold (T ) and number of groups (G) only represent one design\npoint among many possible CGNet architectures. We explore the design space of CGNets by varying\nthe number of groups in each residual module. Speci\ufb01cally, we \ufb01x the group size of one speci\ufb01c\nmodule and then vary the group sizes of the other modules to obtain multiple design points per\nresidual module. As depicted in Figure 7, each residual module may favor a different G. For example,\nfor the second module, CGNets with two groups provide a better trade-off between accuracy and\nFLOP reduction compared to the ones with 16 groups. In contrast, for the fourth module, CGNets\nwith 16 groups outperform the designs with two groups. Similarly, we can explore the design space\nby choosing a different T for each residual module. This study shows that one can further improve\nthe accuracy and FLOP reduction trade-off by searching the design space.\n\n5.4 Speed-up Evaluation\n\nThe evaluation so far uses the FLOP reduction as a proxy for a speed-up. PerforatedCNN [7]\nintroduces sparsity at the same granularity as CGNet, and shows that the speed-ups on CPUs/GPUs\nclosely match the FLOP saving. Moreover, there is a recent development on a new GPU kernel\ncalled sampled dense matrix multiplications [25], which can potentially be leveraged to implement\n\n8\n\n\fFigure 6: Weight access reduc-\ntion on ResNet-18 for CIFAR-\n10 with different T and \u03c4c\ncombinations \u2014 Annotated val-\nues represent the top-1 accuracy.\n\nFigure 7: Comparisons of accuracy and FLOP reduction of\nCGNets with different group size for each residual module on\nResNet-18 for CIFAR-10 \u2014 The lines represent the linear regression\nfor the corresponding design points.\n\nFigure 8: Execution time breakdown for each residual block \u2014 The theoretical execution time of CGNet\nrepresents the best possible performance under a speci\ufb01c resource usage, and is computed as the total number of\nmultiplications (FLOPs) per inference divided by the number of multipliers.\n\nthe conditional path of CGNets ef\ufb01ciently. Therefore, we believe that the FLOP reduction will also\ntranslate to a promising speed-up on CPUs/GPUs.\nTo evaluate the performance improvement and energy saving of applying channel gating for spe-\ncialized hardware, we implemented a hardware prototype targeting a TSMC 28nm standard cell\nlibrary. The baseline accelerator adopts the systolic array architecture similar to the Google TPU.\nCGNet requires small changes to the baseline accelerator because it reuses partial sums and only\nneed additional comparators to make gating decisions. In addition to the comparators, CGNet uses a\ncustom data layout and memory banking to support sparse data movements after pruning.\nFigure 8 shows the speed-up of CGNet over the baseline accelerator. We compare the FLOP reduction\nand the actual speed-up. The actual speed-up is 2.4\u00d7 when the FLOP reduction, which represents the\ntheoretical speed-up, is 2.8\u00d7. This result shows that hardware can effectively exploit the dynamic\nsparsity in channel gating. Moreover, the small gap between the execution time of CGNet and the\ntheoretical execution time suggests that CGNet is hardware-ef\ufb01cient. As shown in Table 4, the\nCGNet outperforms a CPU by 42.1\u00d7 in terms of throughput and is four orders of magnitude more\nenergy-ef\ufb01cient. Compared with an NVIDIA GTX GPU, the CGNet is 326.3\u00d7 more energy-ef\ufb01cient.\n\n6 Conclusions and Future Work\n\nWe introduce a dynamic pruning technique named channel gating along with a training method to\neffectively learn a gating policy from scratch. Experimental results show that channel gating provides\nbetter trade-offs between accuracy and computation cost compared to existing pruning techniques.\nPotential future work includes applying channel gating on objection detection tasks.\n\n7 Acknowledgments\n\nThis work was partially sponsored by Semiconductor Research Corporation and DARPA, NSF\nAwards #1453378 and #1618275, a research gift from Xilinx, Inc., and a GPU donation from\nNVIDIA Corporation. The authors would like to thank the Batten Research Group, especially\nChristopher Torng (Cornell Univ.), for sharing their Modular VLSI Build System. The authors also\nthank Zhaoliang Zhang and Kaifeng Xu (Tsinghua Univ.) for the C++ implementation of channel\ngating and Ritchie Zhao and Oscar Casta\u00f1eda (Cornell Univ.) for insightful discussions.\n\n9\n\n4.04.55.05.56.06.57.0FLOP Reduction (\u00d7)1234Weight Access Reduction (\u00d7)94.494.3294.2694.0994.794.5994.4294.11T = 2.0T = 1.5241st Module94.094.294.494.694.895.0Accuracy (%)242nd Module243rd Module244th ModuleFLOP reduction(\u00d7)G = 2G = 4G = 8G = 1600.20.40.60.8blk0blk1blk2blk3blk4blk5blk6blk7Execution Time (ms)BaselineCGNetTheoretical\fReferences\n[1] Alfredo Canziani, Adam Paszke, and Eugenio Culurciello. An analysis of deep neural network models for\npractical applications. CoRR, abs/1605.07678, 2016. URL http://arxiv.org/abs/1605.07678.\n\n[2] Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan\nZhang, and Zheng Zhang. Mxnet: A \ufb02exible and ef\ufb01cient machine learning library for heterogeneous\ndistributed systems. CoRR, abs/1512.01274, 2015. URL http://arxiv.org/abs/1512.01274.\n\n[3] Matthieu Courbariaux and Yoshua Bengio. Binarynet: Training deep neural networks with weights and\nactivations constrained to +1 or -1. CoRR, abs/1602.02830, 2016. URL http://arxiv.org/abs/\n1602.02830.\n\n[4] J. Deng, W. Dong, R. Socher, L. J. Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image\ndatabase. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248\u2013255, June\n2009. doi: 10.1109/CVPR.2009.5206848.\n\n[5] Xuanyi Dong, Junshi Huang, Yi Yang, and Shuicheng Yan. More is less: A more complicated network\nwith less inference complexity. CoRR, abs/1703.08651, 2017. URL http://arxiv.org/abs/1703.\n08651.\n\n[6] Michael Figurnov, Maxwell D. Collins, Yukun Zhu, Li Zhang, Jonathan Huang, Dmitry P. Vetrov, and\nRuslan Salakhutdinov. Spatially adaptive computation time for residual networks. CoRR, abs/1612.02297,\n2016. URL http://arxiv.org/abs/1612.02297.\n\n[7] Michael Figurnov, Aijan Ibraimova, Dmitry Vetrov, and Pushmeet Kohli. Perforated cnns: Acceleration\nthrough elimination of redundant convolutions. In Proceedings of the 30th International Conference on Neu-\nral Information Processing Systems, NIPS\u201916, pages 955\u2013963, USA, 2016. Curran Associates Inc. ISBN\n978-1-5108-3881-9. URL http://dl.acm.org/citation.cfm?id=3157096.3157203.\n\n[8] X. Gao, Y. Zhao, L. Dudziak, R. Mullins, and C.-z. Xu. Dynamic Channel Pruning: Feature Boosting and\n\nSuppression. ArXiv e-prints, October 2018.\n\n[9] Song Han, Huizi Mao, and William J. Dally. Deep compression: Compressing deep neural network\nwith pruning, trained quantization and huffman coding. CoRR, abs/1510.00149, 2015. URL http:\n//arxiv.org/abs/1510.00149.\n\n[10] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.\n\nCoRR, abs/1512.03385, 2015. URL http://arxiv.org/abs/1512.03385.\n\n[11] Yang He, Guoliang Kang, Xuanyi Dong, Yanwei Fu, and Yi Yang. Soft \ufb01lter pruning for accelerating deep\nconvolutional neural networks. In International Joint Conference on Arti\ufb01cial Intelligence (IJCAI), pages\n2234\u20132240, 2018.\n\n[12] Yihui He, Xiangyu Zhang, and Jian Sun. Channel pruning for accelerating very deep neural networks.\n\nCoRR, abs/1707.06168, 2017. URL http://arxiv.org/abs/1707.06168.\n\n[13] Geoffrey Hinton, Oriol Vinyals, and Jeffrey Dean. Distilling the knowledge in a neural network. In NIPS\nDeep Learning and Representation Learning Workshop, 2015. URL http://arxiv.org/abs/1503.\n02531.\n\n[14] Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand,\nMarco Andreetto, and Hartwig Adam. Mobilenets: Ef\ufb01cient convolutional neural networks for mobile\nvision applications. CoRR, abs/1704.04861, 2017. URL http://arxiv.org/abs/1704.04861.\n\n[15] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reduc-\ning internal covariate shift. CoRR, abs/1502.03167, 2015. URL http://arxiv.org/abs/1502.\n03167.\n\n[16] Norman P. Jouppi, Cliff Young, Nishant Patil, David A. Patterson, Gaurav Agrawal, Raminder Bajwa,\nSarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick Boyle, Pierre-luc Cantin, Clifford Chao,\nChris Clark, Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaemmaghami,\nRajendra Gottipati, William Gulland, Robert Hagmann, Richard C. Ho, Doug Hogberg, John Hu, Robert\nHundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexander Kaplan, Harshit Khaitan, Andy\nKoch, Naveen Kumar, Steve Lacy, James Laudon, James Law, Diemthu Le, Chris Leary, Zhuyuan\nLiu, Kyle Lucke, Alan Lundin, Gordon MacKean, Adriana Maggiore, Maire Mahony, Kieran Miller,\nRahul Nagarajan, Ravi Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, Mark Omernick, Narayana\nPenukonda, Andy Phelps, Jonathan Ross, Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov,\n\n10\n\n\fMatthew Snelham, Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan, Gregory Thorson, Bo Tian,\nHoria Toma, Erick Tuttle, Vijay Vasudevan, Richard Walter, Walter Wang, Eric Wilcox, and Doe Hyun\nYoon. In-datacenter performance analysis of a tensor processing unit. CoRR, abs/1704.04760, 2017. URL\nhttp://arxiv.org/abs/1704.04760.\n\n[17] Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009.\n\n[18] Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning \ufb01lters for ef\ufb01cient\n\nconvnets. CoRR, abs/1608.08710, 2016. URL http://arxiv.org/abs/1608.08710.\n\n[19] Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning \ufb01lters for ef\ufb01cient\n\nconvnets. CoRR, abs/1608.08710, 2016. URL http://arxiv.org/abs/1608.08710.\n\n[20] Ji Lin, Yongming Rao, Jiwen Lu, and Jie Zhou. Runtime neural pruning. In I. Guyon, U. V. Luxburg,\nS. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information\nProcessing Systems 30, pages 2181\u20132191. Curran Associates, Inc., 2017. URL http://papers.nips.\ncc/paper/6813-runtime-neural-pruning.pdf.\n\n[21] Xiaofan Lin, Cong Zhao, and Wei Pan.\n\nTowards accurate binary convolutional neural\nIn I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vish-\nInformation Processing Systems 30,\nURL http://papers.nips.cc/paper/\n\nnetwork.\nwanathan, and R. Garnett, editors, Advances in Neural\npages 345\u2013353. Curran Associates,\n6638-towards-accurate-binary-convolutional-neural-network.pdf.\n\nInc., 2017.\n\n[22] Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui Zhang. Learning\nef\ufb01cient convolutional networks through network slimming. CoRR, abs/1708.06519, 2017. URL http:\n//arxiv.org/abs/1708.06519.\n\n[23] Zongqing Lu, Swati Rallapalli, Kevin S. Chan, and Thomas F. La Porta. Modeling the resource requirements\nof convolutional neural networks on mobile devices. CoRR, abs/1709.09503, 2017. URL http://arxiv.\norg/abs/1709.09503.\n\n[24] Mason McGill and Pietro Perona. Deciding how to decide: Dynamic routing in arti\ufb01cial neural networks.\nIn Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine\nLearning, volume 70 of Proceedings of Machine Learning Research, pages 2363\u20132372, International\nConvention Centre, Sydney, Australia, 06\u201311 Aug 2017. PMLR. URL http://proceedings.mlr.\npress/v70/mcgill17a.html.\n\n[25] I. Nisa, A. Sukumaran-Rajam, S. E. Kurt, C. Hong, and P. Sadayappan. Sampled dense matrix multiplication\nfor high-performance machine learning. In 2018 IEEE 25th International Conference on High Performance\nComputing (HiPC), pages 32\u201341, Dec 2018. doi: 10.1109/HiPC.2018.00013.\n\n[26] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnor-net: Imagenet clas-\nsi\ufb01cation using binary convolutional neural networks. CoRR, abs/1603.05279, 2016. URL http:\n//arxiv.org/abs/1603.05279.\n\n[27] Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc V. Le, Geoffrey E. Hinton,\nand Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. CoRR,\nabs/1701.06538, 2017. URL http://arxiv.org/abs/1701.06538.\n\n[28] Ravi Teja Mullapudi, William R. Mark, Noam Shazeer, and Kayvon Fatahalian. Hydranets: Specialized\ndynamic architectures for ef\ufb01cient inference. In The IEEE Conference on Computer Vision and Pattern\nRecognition (CVPR), June 2018.\n\n[29] Zuxuan Wu, Tushar Nagarajan, Abhishek Kumar, Steven Rennie, Larry S. Davis, Kristen Grauman, and\nRog\u00e9rio Schmidt Feris. Blockdrop: Dynamic inference paths in residual networks. CoRR, abs/1711.08393,\n2017. URL http://arxiv.org/abs/1711.08393.\n\n[30] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shuf\ufb02enet: An extremely ef\ufb01cient convolutional\nneural network for mobile devices. CoRR, abs/1707.01083, 2017. URL http://arxiv.org/abs/\n1707.01083.\n\n[31] Z. Zhuang, M. Tan, B. Zhuang, J. Liu, Y. Guo, Q. Wu, J. Huang, and J. Zhu. Discrimination-aware Channel\n\nPruning for Deep Neural Networks. ArXiv e-prints, October 2018.\n\n11\n\n\f", "award": [], "sourceid": 1093, "authors": [{"given_name": "Weizhe", "family_name": "Hua", "institution": "Cornell University"}, {"given_name": "Yuan", "family_name": "Zhou", "institution": "Cornell"}, {"given_name": "Christopher", "family_name": "De Sa", "institution": "Cornell"}, {"given_name": "Zhiru", "family_name": "Zhang", "institution": "Cornell Univeristy"}, {"given_name": "G. Edward", "family_name": "Suh", "institution": "Cornell University"}]}