{"title": "Heterogeneous Bitwidth Binarization in Convolutional Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 4006, "page_last": 4015, "abstract": "Recent work has shown that fast, compact low-bitwidth neural networks can\nbe surprisingly accurate. These networks use homogeneous binarization: all\nparameters in each layer or (more commonly) the whole model have the same low\nbitwidth (e.g., 2 bits). However, modern hardware allows efficient designs where\neach arithmetic instruction can have a custom bitwidth, motivating heterogeneous\nbinarization, where every parameter in the network may have a different bitwidth.\nIn this paper, we show that it is feasible and useful to select bitwidths at the\nparameter granularity during training. For instance a heterogeneously quantized\nversion of modern networks such as AlexNet and MobileNet, with the right mix\nof 1-, 2- and 3-bit parameters that average to just 1.4 bits can equal the accuracy\nof homogeneous 2-bit versions of these networks. Further, we provide analyses\nto show that the heterogeneously binarized systems yield FPGA- and ASIC-based\nimplementations that are correspondingly more efficient in both circuit area and\nenergy efficiency than their homogeneous counterparts.", "full_text": "Heterogeneous Bitwidth Binarization in\n\nConvolutional Neural Networks\n\nJosh Fromm\n\nDepartment of Electrical Engineering\n\nUniversity of Washington\n\nSeattle, WA 98195\njwfromm@uw.edu\n\nShwetak Patel\n\nDepartment of Computer Science\n\nUniversity of Washington\n\nSeattle, WA 98195\n\nshwetak@cs.washington.edu\n\nMatthai Philipose\nMicrosoft Research\nRedmond, WA 98052\n\nmatthaip@microsoft.com\n\nAbstract\n\nRecent work has shown that fast, compact low-bitwidth neural networks can\nbe surprisingly accurate. These networks use homogeneous binarization: all\nparameters in each layer or (more commonly) the whole model have the same low\nbitwidth (e.g., 2 bits). However, modern hardware allows ef\ufb01cient designs where\neach arithmetic instruction can have a custom bitwidth, motivating heterogeneous\nbinarization, where every parameter in the network may have a different bitwidth.\nIn this paper, we show that it is feasible and useful to select bitwidths at the\nparameter granularity during training. For instance a heterogeneously quantized\nversion of modern networks such as AlexNet and MobileNet, with the right mix\nof 1-, 2- and 3-bit parameters that average to just 1.4 bits can equal the accuracy\nof homogeneous 2-bit versions of these networks. Further, we provide analyses\nto show that the heterogeneously binarized systems yield FPGA- and ASIC-based\nimplementations that are correspondingly more ef\ufb01cient in both circuit area and\nenergy ef\ufb01ciency than their homogeneous counterparts.\n\n1\n\nIntroduction\n\nWith Convolutional Neural Networks (CNNs) now outperforming humans in vision classi\ufb01cation\ntasks (Szegedy et al., 2015), it is clear that CNNs will be a mainstay of AI applications. However,\nCNNs are known to be computationally demanding, and are most comfortably run on GPUs. For\nexecution in mobile and embedded settings, or when a given CNN is evaluated many times, using\na GPU may be too costly. The search for inexpensive variants of CNNs has yielded techniques\nsuch as hashing (Chen et al., 2015), vector quantization (Gong et al., 2014), and pruning (Han et al.,\n2015). One particularly promising track is binarization (Courbariaux et al., 2015), which replaces\n32-bit \ufb02oating point values with single bits, either +1 or -1, and (optionally) replaces \ufb02oating point\nmultiplies with packed bitwise popcount-xnors Hubara et al. (2016). Binarization can reduce the size\nof models by up to 32\u00d7, and reduce the number of operations executed by up to 64\u00d7.\nIt has not escaped hardware designers that the popcount-xnor operations used in a binary network\nare especially well suited for FPGAs or ASICs. Taking the xnor of two bits requires a single logic\ngate compared to the hundreds required for even the most ef\ufb01cient \ufb02oating point multiplication\nunits (Ehliar, 2014). The drastically reduced area requirements allows binary networks to be imple-\nmented with fully parallel computations on even relatively inexpensive FPGAs (Umuroglu et al.,\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\f2017). The level of parallelization afforded by these custom implementations allows them to outper-\nform GPU computation while expending a fraction of the power, which offers a promising avenue of\nmoving state of the art architectures to embedded environments. We seek to improve the occupancy,\npower, and/or accuracy of these solutions.\nOur approach is based on the simple observation that the power consumption, space needed, and\naccuracy of binary models on FPGAs and custom hardware are proportional to mn, where m is the\nnumber of bits used to binarize input activations and n is the number of bits used to binarize weights.\nCurrent binary algorithms restrict m and n to be integer values, in large part because ef\ufb01cient CPU\nimplementations require parameters within a layer to be the same bitwidth. However, hardware has\nno such requirements. Thus, we ask whether bitwidths can be fractional. To address this question,\nwe introduce Heterogeneous Bitwidth Neural Networks (HBNNs), which allow each individual\nparameter to have its own bitwidth, giving a fractional average bitwidth to the model.\nOur main contributions are:\n\n(1) We propose the problem of selecting the bitwidth of individual parameters during training\n\nsuch that the bitwidths average out to a speci\ufb01ed value.\n\n(2) We show how to augment a state-of-the-art homogeneous binarization training scheme\nwith a greedy bitwidth selection technique (which we call \u201cmiddle-out\u201d) and a simple\nhyperparameter search to produce good heterogeneous binarizations ef\ufb01ciently.\n\n(3) We present a rigorous empirical evaluation (including on highly optimized modern networks\nsuch as Google\u2019s MobileNet) to show that heterogeneity yields equivalent accuracy at\nsigni\ufb01cantly lower average bitwidth.\n\n(4) Although implementing HBNNs ef\ufb01ciently on CPU/GPU may be dif\ufb01cult, we provide\nestimates based on recently proposed FPGA/ASIC implementations that HBNNs\u2019 lower\naverage bitwidths can translate to signi\ufb01cant reductions in circuit area and power.\n\n2 Homogeneous Network Binarization\n\nIn this section we discuss existing techniques for binarization. Table 1 summarizes their accuracy.1\nWhen training a binary network, all techniques including ours maintain weights in \ufb02oating point\nformat. During forward propagation, the weights (and activations, if both weights and activations are\nto be binarized) are passed through a binarization function B, which projects incoming values to a\nsmall, discrete set. In backward propagation, a custom gradient, which updates the \ufb02oating point\nweights, is applied to the binarization layer. After training is complete, the binarization function is\napplied one last time to the \ufb02oating point weights to create a true binary (or more generally, small,\ndiscrete) set of weights, which is used for inference from then on.\nBinarization was \ufb01rst introduced by Courbariaux et al. (2015). In this initial investigation, dubbed\nBinaryConnect, 32-bit tensors T were converted to 1-bit variants T B using the stochastic equation\n\n(1)\n\n(cid:26)+1 with probability p = \u03c3(T ),\n\n-1 with probability 1 \u2212 p\n\nB(T ) (cid:44) T B =\n\nwhere \u03c3 is the hard sigmoid function de\ufb01ned by \u03c3(x) = max(0, min(1, x+1\ngradient function, BinaryConnect simply used dT B\nAlthough BinaryConnect showed excellent results on relatively simple datasets such as CIFAR-10 and\nMNIST, it performed poorly on ImageNet, achieving only an accuracy of 27.9%. Courbariaux et al.\n(2016) later improved this model by simplifying the binarization by simply taking T B = sign(T )\nand adding a gradient for this operation, namely the straight-through estimator:\n\n2 )). For the custom\n\ndT = 1.\n\ndT B\ndT\n\n= 1|T|\u22641.\n\n(2)\n\nThe authors showed that the straight-through estimator allowed the binarization of activations as well\nas weights without collapse of model performance. However, they did not attempt to train a model on\nImageNet in this work.\n\n1In line with prior work, we use the AlexNet model trained on the ImageNet dataset as the baseline.\n\n2\n\n\fd(cid:88)\n\nj=1\n\nFigure 1: Residual error binarization with n = 3 bits. Computing each bit takes a step from the\nposition of the previous bit (see Equation 4).\nRastegari et al. (2016) made a slight modi\ufb01cation to the simple pure single bit representation that\nshowed improved results. Now taking a binarized approximation as\n\nT B = \u03b1isign(T ) with \u03b1i =\n\n1\nd\n\n|Tj|.\n\n(3)\n\nThis additional scalar term allows binarized values to better \ufb01t the distribution of the incoming\n\ufb02oating-point values, giving a higher \ufb01delity approximation for very little extra computation. The\naddition of scalars and the straight-through estimator gradient allowed the authors to achieve a Top-1\naccuracy of 44.2% on ImageNet.\nHubara et al. (2016) and Zhou et al. (2016) found that increasing the number of bits used to quantize\nthe activations of the network gave a considerable boost to the accuracy, achieving similar Top-1\naccuracy of 51.03% and 50.7% respectively. The precise binarization function varied, but the typical\napproaches include linearly or logarithmically placing the quantization points between 0 and 1,\nclamping values below a threshold distance from zero to zero (Li et al., 2016), and computing higher\nbits by measuring the residual error from lower bits (Tang et al., 2017). All n-bit binarization schemes\nrequire similar amounts of computation at inference time, and have similar accuracy (see Table 1).\nIn this work, we extend the residual error binarization function Tang et al. (2017) for binarizing to\nmultiple (n) bits:\n\n1 = sign(T ), \u00b51 = mean(|T|)\nT B\n\n\u00b5i \u00d7 T B\n\ni\n\nEn = T \u2212 n(cid:88)\nT \u2248 n(cid:88)\n\ni=1\n\n\u00b5i \u00d7 T B\n\ni\n\nn>1 = sign(En\u22121), \u00b5n>1 = mean(|En\u22121|)\nT B\n\n(4)\n\ni=1\n\nn is a tensor representing the nth\nwhere T is the input tensor, En is the residual error up to bit n, T B\nbit of the approximation, and \u00b5n is a scaling factor for the nth bit. Note that the calculation of bit n is\na recursive operation that relies on the values of all bits less than n. Residual error binarization has\neach additional bit take a step from the value of the previous bit. Figure 1 illustrates the process of\nbinarizing a single value to 3 bits. Since every binarized value is derived by taking n steps, where\neach step goes left or right, residual error binarization approximates inputs using one of 2n values.\n\n3 Heterogeneous Binarization\n\nTo date, there remains a considerable gap between the performance of 1-bit and 2-bit networks\n(compare rows 8 and 10 of Table 1). The highest full (i.e., where both weights and activations are\nquantized) single-bit performer on AlexNet, Xnor-Net, remains roughly 7 percentage points less\naccurate (top 1) than the 2-bit variant, which is itself about 5.5 points less accurate than the 32-bit\nvariant (row 25). When only weights are binarized, very recent results (Dong et al., 2017) similarly\n\ufb01nd that binarizing to 2 bits can yield nearly full accuracy (row 2), while the 1-bit equivalent lags\nby 4 points (row 1). The \ufb02ip side to using 2 bits for binarization is that the resulting models require\ndouble the number of operations as the 1-bit variants at inference time.\nThese observations naturally lead to the question, explored in this section, of whether it is possible\nto attain accuracies closer to those of 2-bit models while running at speeds closer to those of 1-bit\nvariants. Of course, it is also fundamentally interesting to understand whether it is possible to match\n\n3\n\n\fthe accuracy of higher bitwidth models with those that have lower (on average) bitwidth. Below,\nwe discuss how to extend residual error binarization to allow heterogeneous (effectively fractional)\nbitwidths and present a method for distributing the bits of a heterogeneous approximation.\n\n3.1 Heterogeneous Residual Error Binarization via a Mask Tensor\n\nWe modify Equation 4 , which binarizes to n bits, to instead binarize to a mixture of bitwidths by\nchanging the third line as follows:\n\nn>1 = sign(En\u22121,j), \u00b5n>1 = mean(|En\u22121,j|)\nT B\n\nwith j : Mj \u2265 n\n\n(5)\n\nNote that the only addition is the mask tensor M, which is the same shape as T , and speci\ufb01es the\nnumber of bits Mj that the jth entry of T should be binarized to. In each round n of the binarization\nrecurrence, we now only consider values that are not \ufb01nished binarizing, i.e, which have Mj \u2265 n.\nUnlike homogeneous binarization, therefore, heterogeneous binarization generates binarized values\nby taking up to, not necessarily exactly, n steps. Thus, the number of distinct values representable is\n\n(cid:80)n\ni=1 2i = 2n+1 \u2212 2, which is roughly double that of the homogeneous binarization.\n\nIn the homogeneous case, on average, each step improves the accuracy of the approximation, but there\nmay be certain individual values that would bene\ufb01t from not taking a step, in Figure 1 for example, it\nis possible that (\u00b51 \u2212 \u00b52) approximates the target value better than (\u00b51 \u2212 \u00b52 + \u00b53). If values that\nbene\ufb01t from not taking a step can be targeted and assigned fewer bits, the overall approximation\naccuracy will improve despite there being a lower average bitwidth.\n\n3.2 Computing the Mask Tensor M\n\nThe question of how to distribute bits in a heterogeneous binary tensor to achieve high representational\npower is equivalent to asking how M should be generated. When computing M, our goal is to take\nan average bitwidth B and determine both what fraction P of M should be binarized to each bitwidth\n(e.g., P = 5% 3-bit, 10% 2-bit and 85% 1-bit for an average of B = 1.2 bits), and how to distribute\nthese bitwidths across the individual entries in M. The full computation of M is described in\nAlgorithm 1.\nWe treat the distribution P over bitwidths as a model-wide hyperparameter. Since we only search up\nto 3 bitwidths in practice, we perform a simple grid sweep over the values of P . As we discuss in\nSection 4.3, our discretization is relatively insensitive to these hyperparameters, so a coarse sweep is\nadequate. The results of the sweep are represented by the function DistF romAvg in Algorithm 1.\nGiven P , we need to determine how to distribute the various bitwidths using a value aware method:\nassigning low bitwidths to values that do not need additional approximation and high bitwidths to\n\n(a) Bit selection representational power.\n\n(b) 1.4 bit HBNN AlexNet Accuracy.\n\nFigure 2: Effectiveness of heterogeneous bit selection techniques (a) ability of different binarization\nschemes to approximate a large tensor of normally distributed random values. (b) accuracy of 1.4 bit\nheterogeneous binarized AlexNet-BN trained using each bit-selection technique.\n\n4\n\n\fAlgorithm 1 Generation of bit map M.\nInput: A tensor T of size N and an average bitwidth B.\nOutput: A bit map M that can be used in Equation 5 to heterogeneously binarize T .\n1: R = T\n2: x = 0\n3: P = DistFromAvg(B)\n4: for (b, pb) in P do\n5:\n6:\n7:\n8:\n9: end for\n\n(cid:46) Generate distribution of bits to \ufb01t average.\n(cid:46) b is a bitwidth and pb is the percentage of T to binarize to width b.\nS = SortHeuristic(R) (cid:46) Sort indices of remaining values by suitability for b-bit binarization.\nM [S[x : x + pbN ]] = b\nR = R \\ R[S[x : x + pbN ]]\nx += pbN\n\n(cid:46) Initialize R, which contains values that have not yet been assigned a bitwidth\n\n(cid:46) Do not consider these indices in next step.\n\nthose that do. To this end, we propose several sorting heuristic methods: Top-Down (TD), Middle-Out\n(MO), Bottom-Up (BU), and Random (R). These methods all attempt to sort values of T based on\nhow many bits that value should be binarized with. For example, Top-Down sorting assumes that\nlarger values need fewer bits, and so performs a standard descending sort. Similarly, Middle-Out\nsorting distributes fewer bits to values closest to the mean of T , while Bottom-Up sorting assigns\nfewer bits to smaller values. As a simple we control, we also consider Random sorting, which assigns\nbits in a completely uninformed way. The de\ufb01nitions for the sorting heuristics is given by Equation 6.\n\nTD(T ) = sort(|T|, descending)\nMO(T ) = sort(|T| \u2212 mean(|T|), ascending)\nBU(T ) = sort(|T|, ascending)\nR(T ) = a \ufb01xed uniformly random permutation of T\n\n(6)\n\nTo evaluate the methods in Equation 6, we performed two experiments. In the \ufb01rst, we create a large\ntensor of normally distributed values and binarize it with a variety of bit distributions P and each\nof the sorting heuristics using Algorithm 1. We then computed the Euclidean distance between the\nbinarized tensor and the original full precision tensor. A lower normalized distance suggests a more\npowerful sorting heuristic. The results of this experiment are shown in Figure 2a, and show that\nMiddle-Out sorting outperforms other heuristics by a signi\ufb01cant margin. Notably, the results suggest\nthat using Middle-Out sorting can produce approximations with fewer than 2-bits that are comparably\naccurate to 3-bit integer binarization.\nTo con\ufb01rm these results translate to accuracy in binarized convolutional networks, we consider 1.4\nbit binarized AlexNet, with bit distribution P set to 70% 1-bit, 20% 2-bit, and 10% 3-bit, an average\nof 1.4 bits. The speci\ufb01cs of the model and training procedure are the same as those described in\nSection 4.1. We train this model with each of the sorting heuristics and compare the \ufb01nal accuracy\nto gauge the representational strength of each heuristic. The results are shown in Figure 2b. As\nexpected, Middle-Out sorting performs signi\ufb01cantly better than other heuristics and yields an accuracy\ncomparable to 2-bit integer binarization despite using on average 1.4 bits.\nThe intuition behind the exceptional performance of Middle-Out is based on Figure 1 . We can see\nthat the values that are most likely to be accurate without additional bits are those that are closest to\nthe average \u00b5n for each step n. By assigning low bitwidths to the most average values, we can not\njust minimize losses, but in some cases provide a better approximation using fewer average steps. In\nproceeding sections, all training and evaluation is performed with Middle-Out as the sorting heuristic\nin Algorithm 1.\n\n4 Experiments\n\nTo evaluate HBNNs we wished to answer the following three questions:\n\n(1) How does accuracy scale with an uninformed bit distribution?\n(2) How well do HBNNs perform on a challenging dataset compared to the state of the art?\n(3) Can the bene\ufb01ts of HBNNs be transferred to other architectures?\n\n5\n\n\fIn this section we address each of these questions.\n\n4.1\n\nImplementation Details\n\nAlexNet with batch-normalization (AlexNet-BN) is the standard model used in binarization work\ndue to its longevity and the general acceptance that improvements made to accuracy transfer well to\nmore modern architectures. Batch normalization layers are applied to the output of each convolution\nblock, but the model is otherwise identical to the original AlexNet model proposed by Krizhevsky\net al. (2012). Besides it\u2019s bene\ufb01ts in improving convergence, Rastegari et al. (2016) found that batch-\nnormalization is especially important for binary networks because of the need to equally distribute\nvalues around zero. We additionally insert binarization functions within the convolutional layers of\nthe network when binarizing weights and at the input of convolutional layers when binarizing inputs.\nWe keep a \ufb02oating point copy of the weights that is updated during back-propagation, and binarized\nduring forward propagation as is standard for binary network training. We use the straight-through\nestimator for gradients.\nWhen binarizing the weights of the network\u2019s output layer, we add a single parameter scaling layer\nthat helps reduce the numerically large outputs of a binary layer to a size more amenable to softmax,\nas suggested by Tang et al. (2017). We train all models using an SGD solver with learning rate 0.01,\nmomentum 0.9, and weight decay 1e-4 and randomly initialized weights for 90 epochs on PyTorch.\n\n4.2 Layer-level Heterogeneity\n\nAs a baseline, we test a \u201cpoor man\u2019s\u201d approach to HBNNs, where we \ufb01x up front the number of\nbits each layer is allowed, require all values in a layer to have its associated bitwidth, and then train\nas with conventional homogeneous binarization. We consider 10 mixes of 1, 2 and 3-bit layers so\nas to sweep average bitwidths between 1 and 2. We trained as described in Section 4.1. For this\nexperiment, we used the CIFAR-10 dataset with a deliberately hobbled (4-layer fully convolutional)\nmodel with a maximum accuracy of roughly 78% as the baseline 32-bit variant. We chose CIFAR-10\nto allow quick experimentation. We chose not to use a large model for CIFAR-10, because for large\nmodels it is known that even 1-bit models have 32-bit-level accuracy Courbariaux et al. (2016).\nFigure 3a shows the results. Essentially, accuracy increases roughly linearly with average bitwidth.\nAlthough such linear scaling of accuracy with bitwidth is itself potentially useful (since it allows\n\ufb01ner grain tuning on FPGAs), we are hoping for even better scaling with the \u201cdata-aware\u201d bitwidth\nselection provided by HBNNs.\n\n4.3 Bit Distribution Generation\n\nAs described in 3.2, one of the considerations when using HBNNs is how to take a desired average\nbitwidth and produce a matching distribution of bits. For example, using 70% 1-bit, 20% 2-bit and\n\n(a) CIFAR-10 uninformed bit selection.\n\n(b) HBNN AlexNet with Middle-Out bit selection.\n\nFigure 3: Accuracy results of trained HBNN models. (a) Sweep of heterogenous bitwidths on a\ndeliberately simpli\ufb01ed four layer convolutional model for CIFAR-10. (b) Accuracy of heterogeneous\nbitwidth AlexNet-BN models. Bits are distributed using the Middle-Out selection algorithm.\n\n6\n\n\fTable 1: Accuracy of related binarization work and our results\n\nModel\n\nName\n\nBinarization (Inputs / Weights)\n\nTop-1\n\nTop-5\n\nAlexNet\nAlexNet\nAlexNet\nAlexNet\nAlexNet\nMobileNet HBNN\n\nSQ-BWN (Dong et al., 2017)\nSQ-TWN (Dong et al., 2017)\nTWN (our implementation)\nTWN\nHBNN (our results)\n\nBinarized weights with \ufb02oating point activations\nfull precision / 1-bit\nfull precision / 2-bit\nfull precision / 1-bit\nfull precision / 2-bit\nfull precision / 1.4-bit\nfull precision / 1.4-bit\n\n1\n2\n3\n4\n5\n6\n\nAlexNet\n7\nAlexNet\n8\n9\nAlexNet\n10 AlexNet\n11 AlexNet\n12 AlexNet\n13 AlexNet\n14 AlexNet\n15 AlexNet\n16 AlexNet\n17 MobileNet\n18 MobileNet\n19 MobileNet\n20 MobileNet\n21 MobileNet HBNN\n22 MobileNet HBNN\n23 MobileNet HBNN\n24 MobileNet HBNN\n\nBinarized weights and activations excluding input and output layers\nBNN (Courbariaux et al., 2015)\nXnor-Net (Rastegari et al., 2016)\nDoReFaNet (Zhou et al., 2016)\nQNN (Hubara et al., 2016)\nour implementation\nour implementation\nHBNN\nHBNN\nHBNN\nHBNN\nour implementation\nour implementation\nour implementation\nour implementation\n\n1-bit / 1-bit\n1-bit / 1-bit\n2-bit / 1-bit\n2-bit / 1-bit\n2-bit / 2-bit\n3-bit / 3-bit\n1.4-bit / 1.4-bit\n1-bit / 1.4-bit\n1.4-bit / 1-bit\n2-bit / 1.4-bit\n1-bit / 1-bit\n2-bit / 1-bit\n2-bit / 2-bit\n3-bit / 3-bit\n1-bit / 1.4-bit\n1.4-bit / 1-bit\n1.4-bit / 1.4-bit\n2-bit / 1.4-bit\nUnbinarized (our implementation)\n\n25 AlexNet\n26 MobileNet\n\n(Krizhevsky et al., 2012)\n(Howard et al., 2017)\n\nfull precision / full precision\nfull precision / full precision\n\n51.2% 75.1%\n55.3% 78.6%\n48.3% 71.4%\n54.2% 77.9%\n55.2% 78.4%\n65.1% 87.2%\n\n27.9% 50.4%\n44.2% 69.2%\n50.7% 72.6%\n51.0% 73.7%\n52.2% 74.5%\n54.2% 78.1%\n53.2% 77.1%\n49.4% 72.1%\n51.5% 74.2%\n52.0% 74.5%\n52.9% 75.1%\n61.3% 80.1%\n63.0% 81.8%\n65.9% 86.7%\n60.1% 78.7%\n62.0% 81.3%\n64.7% 84.9%\n63.6% 82.2%\n\n56.5% 80.1%\n68.8% 89.0%\n\n10% 3-bit values gives an average of 1.4 bits, but so too does 80% 1-bit and 20% 3-bit values. We\nsuspected that the choice of this distribution would have a signi\ufb01cant impact on the accuracy of\ntrained HBNNs, and performed a hyperparameter sweep by varying DistF romAvg in Algorithm 1\nwhen training AlexNet on ImageNet as described in the following sections. However, much to our\nsurprise, models trained with the same average bitwidth achieved nearly identical accuracies\nregardless of distribution. For example, the two 1.4-bit distributions given above yield accuracies of\n49.4% and 49.3% respectively. This suggests that choice of DistF romAvg is actually unimportant,\nwhich is quite convenient as it simpli\ufb01es training of HBNNs considerably.\n\n4.4 AlexNet: Binarized Weights and Non-Binarized Activations\n\nRecently, Dong et al. (2017) were able to binarize the weights of an AlexNet-BN model to 2 bits and\nachieve nearly full precision accuracy (row 2 of Table 1). We consider this to be the state of the art\nin weight binarization since the model achieves excellent accuracy despite all layer weights being\nbinarized, including the input and output layers which have traditionally been dif\ufb01cult to approximate.\nWe perform a sweep of AlexNet-BN models binarized with fractional bitwidths using middle-out\nselection with the goal of achieving comparable accuracy using fewer than two bits.\nThe results of this sweep are shown in Figure 3b. We were able to achieve nearly identical top-1\naccuracy to the best full 2 bit results (55.3%) with an average of only 1.4 bits (55.2%). As we had\nhoped, we also found that the accuracy scales in a super-linear manner with respect to bitwidth when\nusing middle-out bit selection. Speci\ufb01cally, the model accuracy increases extremely quickly from 1\nbit to 1.3 bits before slowly approaching the full precision accuracy.\n\n7\n\n\f4.5 AlexNet: Binarized Weights and Activations\n\nIn order to realize the speed-up bene\ufb01ts of binarization (on CPU or FPGA) in practice, it is necessary\nto binarize both inputs the weights, which allows \ufb02oating point multiplies to be replaced with packed\nbitwise logical operations. The number of operations in a binary network is reduced by a factor of\nmn where m is the number of bits used to binarize inputs and n is the number of bits to binarize\n64\nweights. Thus, there is signi\ufb01cant motivation to keep the bitwidth of both inputs and weights as low\nas possible without losing too much accuracy. When binarizing inputs, the input and output layers are\ntypically not binarized as the effects on the accuracy are much larger than other layers. We perform\nanother sweep on AlexNet-BN with all layers but the input and output fully binarized and compare\nthe accuracy of HBNNs to several recent results. Row 8 of Table 1 is the top previously reported\naccuracy (44.2%) for single bit input and weight binarization, while row 10 (51%) is the top accuracy\nfor 2-bit inputs and 1-bit weights.\nTable 1 (rows 13 to 16) reports a selection of results from this search. Using 1.4 bits to binarize inputs\nand weights (mn = 1.4 \u00d7 1.4 = 1.96) gives a very high accuracy (53.2% top-1) while having the\nsame number of total operations mn as a network, such as the one from row 10, binarized with 2 bit\nactivations and 1 bit weights. We have similarly good results when leaving the input binarization\nbitwidth an integer. Using 1 bit inputs and 1.4 bit weights, we reach 49.4% top-1 accuracy which is a\nlarge improvement over Rastegari et al. (2016) at a small cost. We found that using more than 1.4\naverage bits had very little impact on the overall accuracy. Binarizing inputs to 1.4 bits and weights\nto 1 bit (row 15) similarly outperforms Hubara et al. (2016) (row 10).\n\n4.6 MobileNet Evaluation\n\nAlthough AlexNet serves as an essential measure to compare to previous and related work, it is\nimportant to con\ufb01rm that the bene\ufb01ts of heterogeneous binarization is model independent. To\nthis end, we perform a similar sweep of binarization parameters on MobileNet, a state of the art\narchitecture that has unusually high accuracy for its low number of parameters (Howard et al., 2017).\nMobileNet is made up of separable convolutions instead of the typical dense convolutions of AlexNet.\nEach separable convolution is composed of an initial spatial convolution followed by a depth-wise\nconvolution. Because the vast bulk of computation time is spent in the depth-wise convolution, we\nbinarize only its weights, leaving the spatial weights \ufb02oating point. We binarize the depth wise\nweights of each MobileNet layer in a similar fashion as in section 4.4 and achieve a Top-1 accuracy\nof 65.1% (row 6). This is only a few percent below our unbinarized implementation (row 26), which\nis an excellent result for the signi\ufb01cant reduction in model size.\nWe additionally perform a sweep of many different binarization bitwidths for both the depth-wise\nweights and input activations of MobileNet, with results shown in rows 17-24 of Table 1. Just as in\nthe AlexNet case, we \ufb01nd that MobileNet with an average of 1.4 bits (rows 21 and 22) achieves over\n10% higher accuracy than 1-bit binarization (row 17). We similarly observe that 1.4-bit binarization\noutperforms 2-bit binarization in each permutation of bitwidths. The excellent performance of HBNN\nMobileNet con\ufb01rms that heterogeneous binarization is fundamentally valuable, and we can safely\ninfer that it is applicable to many other network architectures as well.\n\n5 Hardware Implementability\n\nOur experiments demonstrate that HBNNs have signi\ufb01cant advantages compared to integer bitwidth\napproximations. However, with these representational bene\ufb01ts come added complexity in implemen-\ntation. Binarization typically provides a signi\ufb01cant speed up by packing bits into 64-bit integers,\nallowing a CPU or GPU to perform a single xnor operation in lieu of 64 \ufb02oating-point multiplications.\nHowever, Heterogeneous tensors are essentially composed of sparse arrays of bits. Array sparsity\nmakes packing bits inef\ufb01cient, nullifying much of the speed bene\ufb01ts one would expect from having\nfewer average bits. The necessity of bit packing exists because CPUs and GPUs are designed to\noperate on groups of bits rather than individual bits. However, programmable or custom hardware\nsuch as FPGAs and ASICs have no such restriction. In hardware, each parameter can have its own\nset of n xnor-popcount units, where n is the bitwidth of that particular parameter. In FPGAs and\nASICs, the total number of computational units in a network has a signi\ufb01cant impact on the power\nconsumption and speed of inference. Thus, the bene\ufb01ts of HBNNs, higher accuracy with fewer\ncomputational units, are fully realizable.\n\n8\n\n\fTable 2: Hardware Implementation Metrics\n\nPlatform Model\n\nUnfolding Bits Occupancy\n\nkFPS\n\nPchip (W)\n\nTop-1\n\nCIFAR-10 Baseline Implementations\n\nVGG-8\nVGG-8\nVGG-8\n\nVGG-8\nVGG-8\nVGG-8\nVGG-8\nVGG-8\n\n1\n1\n2\n\n21.2%\n84.8%\n6.06 mm2\n\n1\u00d7\n4\u00d7\n-\nCIFAR-10 HBNN Customization\n1\u00d7\n1\u00d7\n4\u00d7\n-\n-\n\n25.4%\n29.7%\n100%\n2.18 mm2\n2.96 mm2\n\n1.2\n1.4\n1.2\n1.2\n1.4\n\n1\n2\n3\n\n4\n5\n6\n7\n8\n\nZC706\nZC706\nASIC\n\nZC706\nZC706\nZC706\nASIC\nASIC\n\nZC706\n9\nZC706\n10\n11\nZC706\n12 ASIC\n13 ASIC\n\n21.9\n87.6\n3.4\n\n18.25\n15.6\n73.0\n3.4\n3.4\n\n3.6\n14.4\n0.38\n\n4.3\n5.0\n17.0\n0.14\n0.18\n\n3.4\n6.8\n4.76\n18.62\n9.1\n\n80.90%\n80.90%\n87.89%\n\n85.8%\n89.4%\n85.8%\n85.8%\n89.4%\n\n52.9%\n63.0%\n64.7%\n63.0%\n64.7%\n\nExtrapolation to MobileNet with ImageNet Data\n0.45\n0.23\n0.32\n3.4\n3.4\n\n20.0%\n40.0%\n28.0%\n297 mm2\n145.5 mm2\n\n1\u00d7\n1\u00d7\n1\u00d7\n-\n-\n\nMobileNet\nMobileNet\nMobileNet\nMobileNet\nMobileNet\n\n1\n2\n1.4\n2\n1.4\n\nThere have been several recent binary convolutional neural network implementations on FGPAs and\nASICs that provide a baseline we can use to estimate the performance of HBNNs on ZC706 FPGA\nplatforms (Umuroglu et al., 2017) and on ASIC hardware (Alemdar et al., 2017). The results of these\nimplementations are summarized in rows 1-3 of Table 2. Here, unfolding refers to the number of\ncomputational units placed for each parameter, by having multiple copies of a parameter, throughput\ncan be increased through improved parallelization. Bits refers to the level of binarization of both\nthe input activations and weights of the network. Occupancy is the number of LUTs required to\nimplement the network divided by the total number of LUTs available for an FPGA, or the chip\ndimensions for an ASIC. Rows 4-12 of Table 2 show the metrics of HBNN versions of the baseline\nmodels. Some salient points that can be drawn from the table include:\n\n\u2022 Comparing lines 1, 4, and 5 show that on FPGA, fractional binarization offers \ufb01ne-grained\ntuning of the performance-accuracy trade-off. Notably, a signi\ufb01cant accuracy boost is\nobtainable for only slightly higher occupancy and power consumption.\n\u2022 Rows 2 and 6 both show the effect of unrolling. Notably, with 1.2 average bits, there is no\nremaining space on the ZC706. This means that using a full 2 bits, a designer would have\nto use a lower unrolling factor. In many cases, it may be ideal to adjust average bitwidth\nto reach maximum occupancy, giving the highest possible accuracy without sacri\ufb01cing\nthroughput.\n\u2022 Rows 3, 7, and 8 show that in ASIC, the size and power consumption of a chip can be\n\u2022 Rows 9-13 demonstrate the bene\ufb01ts of fractional binarization are not restriced to CIFAR,\nand extend to MobileNet in a similar way. The customization options and in many cases\ndirect performance boosts offered by HBNNs are valuable regardless of model architecture.\n\ndrastically reduced without impacting accuracy at all.\n\n6 Conclusion\n\nIn this paper, we present Heterogeneous Bitwidth Neural Networks (HBNNs), a new type of binary\nnetwork that is not restricted to integer bitwidths. Allowing effectively fractional bitwidths in\nnetworks gives a vastly improved ability to tune the trade-offs between accuracy, compression, and\nspeed that come with binarization. We introduce middle-out bit selection as the top performing\ntechnique for determining where to place bits in a heterogeneous bitwidth tensor. On the ImageNet\ndataset with AlexNet and MobileNet models, we perform extensive experiments to validate the\neffectiveness of HBNNs compared to the state of the art and full precision accuracy. The results\nof these experiments are highly compelling, with HBNNs matching or outperforming competing\nbinarization techniques while using fewer average bits.\n\n9\n\n\fReferences\nAlemdar, Hande, Leroy, Vincent, Prost-Boucle, Adrien, and P\u00e9trot, Fr\u00e9d\u00e9ric. Ternary neural networks\nfor resource-ef\ufb01cient ai applications. In Neural Networks (IJCNN), 2017 International Joint\nConference on, pp. 2547\u20132554. IEEE, 2017.\n\nChen, Wenlin, Wilson, James, Tyree, Stephen, Weinberger, Kilian, and Chen, Yixin. Compressing\nneural networks with the hashing trick. In International Conference on Machine Learning, pp.\n2285\u20132294, 2015.\n\nCourbariaux, Matthieu, Bengio, Yoshua, and David, Jean-Pierre. Binaryconnect: Training deep\nneural networks with binary weights during propagations. In Advances in Neural Information\nProcessing Systems, pp. 3123\u20133131, 2015.\n\nCourbariaux, Matthieu, Hubara, Itay, Soudry, Daniel, El-Yaniv, Ran, and Bengio, Yoshua. Binarized\nneural networks: Training deep neural networks with weights and activations constrained to+ 1\nor-1. arXiv preprint arXiv:1602.02830, 2016.\n\nDong, Yinpeng, Ni, Renkun, Li, Jianguo, Chen, Yurong, Zhu, Jun, and Su, Hang. Learning accurate\nlow-bit deep neural networks with stochastic quantization. arXiv preprint arXiv:1708.01001, 2017.\n\nEhliar, Andreas. Area ef\ufb01cient \ufb02oating-point adder and multiplier with ieee-754 compatible semantics.\nIn Field-Programmable Technology (FPT), 2014 International Conference on, pp. 131\u2013138. IEEE,\n2014.\n\nGong, Yunchao, Liu, Liu, Yang, Ming, and Bourdev, Lubomir. Compressing deep convolutional\n\nnetworks using vector quantization. arXiv preprint arXiv:1412.6115, 2014.\n\nHan, Song, Mao, Huizi, and Dally, William J. Deep compression: Compressing deep neural networks\nwith pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.\n\nHoward, Andrew G, Zhu, Menglong, Chen, Bo, Kalenichenko, Dmitry, Wang, Weijun, Weyand,\nTobias, Andreetto, Marco, and Adam, Hartwig. Mobilenets: Ef\ufb01cient convolutional neural\nnetworks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.\n\nHubara, Itay, Courbariaux, Matthieu, Soudry, Daniel, El-Yaniv, Ran, and Bengio, Yoshua. Quantized\nneural networks: Training neural networks with low precision weights and activations. arXiv\npreprint arXiv:1609.07061, 2016.\n\nKrizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E.\n\nImagenet classi\ufb01cation with deep\nconvolutional neural networks. In Advances in neural information processing systems, pp. 1097\u2013\n1105, 2012.\n\nLi, Fengfu, Zhang, Bo, and Liu, Bin. Ternary weight networks. arXiv preprint arXiv:1605.04711,\n\n2016.\n\nRastegari, Mohammad, Ordonez, Vicente, Redmon, Joseph, and Farhadi, Ali. Xnor-net: Imagenet\nclassi\ufb01cation using binary convolutional neural networks. In European Conference on Computer\nVision, pp. 525\u2013542. Springer, 2016.\n\nSzegedy, Christian, Liu, Wei, Jia, Yangqing, Sermanet, Pierre, Reed, Scott, Anguelov, Dragomir,\nErhan, Dumitru, Vanhoucke, Vincent, and Rabinovich, Andrew. Going deeper with convolutions.\nIn Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1\u20139, 2015.\n\nTang, Wei, Hua, Gang, and Wang, Liang. How to train a compact binary neural network with high\n\naccuracy? In AAAI, pp. 2625\u20132631, 2017.\n\nUmuroglu, Yaman, Fraser, Nicholas J, Gambardella, Giulio, Blott, Michaela, Leong, Philip, Jahre,\nMagnus, and Vissers, Kees. Finn: A framework for fast, scalable binarized neural network inference.\nIn Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate\nArrays, pp. 65\u201374. ACM, 2017.\n\nZhou, Shuchang, Wu, Yuxin, Ni, Zekun, Zhou, Xinyu, Wen, He, and Zou, Yuheng. Dorefa-net:\nTraining low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint\narXiv:1606.06160, 2016.\n\n10\n\n\f", "award": [], "sourceid": 1984, "authors": [{"given_name": "Joshua", "family_name": "Fromm", "institution": "University of Washington"}, {"given_name": "Shwetak", "family_name": "Patel", "institution": "University of Washington"}, {"given_name": "Matthai", "family_name": "Philipose", "institution": "Microsoft Research"}]}