{"title": "MetaQuant: Learning to Quantize by Learning to Penetrate Non-differentiable Quantization", "book": "Advances in Neural Information Processing Systems", "page_first": 3916, "page_last": 3926, "abstract": "Tremendous amount of parameters make deep neural networks impractical to be deployed for edge-device-based real-world applications due to the limit of computational power and storage space. Existing studies have made progress on learning quantized deep models to reduce model size and energy consumption, i.e. converting full-precision weights ($r$'s) into discrete values ($q$'s) in a supervised training manner. However, the training process for quantization is non-differentiable, which leads to either infinite or zero gradients ($g_r$) w.r.t. $r$. To address this problem, most training-based quantization methods use the gradient w.r.t. $q$ ($g_q$) with clipping to approximate $g_r$ by Straight-Through-Estimator (STE) or manually design their computation. However, these methods only heuristically make training-based quantization applicable, without further analysis on how the approximated gradients can assist training of a quantized network. In this paper, we propose to learn $g_r$ by a neural network. Specifically, a meta network is trained using $g_q$ and $r$ as inputs, and outputs $g_r$ for subsequent weight updates. The meta network is updated together with the original quantized network. Our proposed method alleviates the problem of non-differentiability, and can be trained in an end-to-end manner. Extensive experiments are conducted with CIFAR10/100 and ImageNet on various deep networks to demonstrate the advantage of our proposed method in terms of a faster convergence rate and better performance. Codes are released at: \\texttt{https://github.com/csyhhu/MetaQuant}", "full_text": "MetaQuant: Learning to Quantize by Learning to\n\nPenetrate Non-differentiable Quantization\n\nShangyu Chen\n\nWenya Wang\n\nNanyang Technological University, Singapore\n\nNanyang Technological University, Singapore\n\nschen025@e.ntu.edu.sg\n\nwangwy@ntu.edu.sg\n\nSinno Jialin Pan\n\nNanyang Technological University, Singapore\n\nsinnopan@ntu.edu.sg\n\nAbstract\n\nTremendous amount of parameters make deep neural networks impractical to be\ndeployed for edge-device-based real-world applications due to the limit of compu-\ntational power and storage space. Existing studies have made progress on learning\nquantized deep models to reduce model size and energy consumption, i.e. convert-\ning full-precision weights (r\u2019s) into discrete values (q\u2019s) in a supervised training\nmanner. However, the training process for quantization is non-differentiable, which\nleads to either in\ufb01nite or zero gradients (gr) w.r.t. r. To address this problem, most\ntraining-based quantization methods use the gradient w.r.t. q (gq) with clipping\nto approximate gr by Straight-Through-Estimator (STE) or manually design their\ncomputation. However, these methods only heuristically make training-based\nquantization applicable, without further analysis on how the approximated gra-\ndients can assist training of a quantized network. In this paper, we propose to\nlearn gr by a neural network. Speci\ufb01cally, a meta network is trained using gq\nand r as inputs, and outputs gr for subsequent weight updates. The meta network\nis updated together with the original quantized network. Our proposed method\nalleviates the problem of non-differentiability, and can be trained in an end-to-end\nmanner. Extensive experiments are conducted with CIFAR10/100 and ImageNet\non various deep networks to demonstrate the advantage of our proposed method in\nterms of a faster convergence rate and better performance. Codes are released at:\nhttps://github.com/csyhhu/MetaQuant\n\n1\n\nIntroduction\n\nDeep neural networks have shown promising results in various computer vision tasks. However,\nmodern deep learning models usually contain many layers and enormous amount of parameters [9],\nwhich limits their applications on edge devices. To reduce parameters redundancy, continuous effects\nin architecture re\ufb01nement have been made, such as using small kernel convolutions [14] and reusing\nfeatures [6]. Consider a very deep model which is fully-trained. To use it for making predictions,\nmost of the computations involve multiplications of a real-valued weight by a real-valued activation\nin a forward pass. These multiplications are expensive as they are all \ufb02oat-point to \ufb02oat-point\nmultiplication operations. To alleviate this problem, a number of approaches have been proposed to\nquantize deep models. Courbariaux et al. [4] and Hubara et al. [7] proposed to binarize weights of the\ndeep model to be in {\u00b11}. To provide more \ufb02exibility for quantized values in each layer, Rastegari et\nal. [13] introduced a \ufb02oat value \u03b1l known as the scaling factor for layer l to turn binarized weights\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\finto \u03b1l \u00d7 {\u00b11}. Li et al. [11] extended binary weights to ternary values, and Zhou et al. [17] further\nincorporated activation and gradient quantization.\nTraining-based quantization methods generate quantized neural networks under the training mech-\nanism. Existing training-based quantization methods can be roughly categorized into \u201cSTE\u201d and\n\u201cNon-STE\u201d methods. \u201cSTE\u201d methods contain a non-differentiable discrete quantization function,\nconnecting the full-precision weights and quantized weights. During backpropagation, STE is used\nto penetrate this non-differentiable function. (e.g.[7], [13], [17]). \u201cNon-STE\u201d methods are referred to\nas learning without STE by directly working on full-precision weights with a regularizer to obtain\nfeasible quantization ([2]) or weights projection using proximal gradient methods ([10], [5]). The\ntraining process in Non-STE quantization suffers from heavy hyper-parameters tuning, such as\nweights partition portion in each step [15] and penalty setting in [10].\nSpeci\ufb01cally, STE quantization methods follow a rather simple and standard training protocol. Given\na neural network f with full-precision weights W, a quantization function Q(\u00b7) (without loss of\ngeneralization, Q(r) is set as a mapping from r to 1 if r \u2265 0, otherwise \u22121), and labeled data\n(x, y), the objective is to minimize the training loss: (cid:96)(f (Q(W); x), y). However, due to the\nnon-differentiability of Q, the gradient of (cid:96) w.r.t W cannot be computed using the chain rule:\n\u2202W is in\ufb01nite when W = 0 and 0 elsewhere. To enable a stable\n\u2202W = \u2202l\n\u2202(cid:96)\nquantization training, Hubara et al. [7] proposed Straight-Through-Estimator (STE) to rede\ufb01ne\n\u2202Q(r)\n\n\u2202W , where \u2202Q(W)\n\n\u2202Q(W)\n\n\u2202Q(W)\n\n:\n\n\u2202r\n\n(cid:26)1\n\n0\n\n\u2202Q(r)\n\n\u2202r\n\n=\n\n|r| \u2264 1,\n\nif\notherwise.\n\n.\n\n\u2202(cid:96)\n\n\u2202r\n\nSTE is widely used in training-based quantization methods1 as it provides an approximated gradient\nfor penetration of Q with an simple implementation. However, it inevitably brings the problem of\ngradient mismatch: the gradients of the weights are not generated using the value of weights, but\nrather its quantized value. Although STE provides an end-to-end training beni\ufb01t under discrete\nconstraints, few works have progressed to investigate how to obtain better gradients for quantization\ntraining. In the methods HWGQ [3] and Bi-real [12], \u2202Q(r)\nis manually de\ufb01ned, but they focused on\nactivation quantization.\nTo overcome the problem of gradient mismatch and explore better gradients in training-based methods,\n\u2202W by a neural network (M) in quantization training. This\ninspired by [1], we propose to learn \u2202Q(W)\nadditional neural network is referred to as meta quantizer and trained together with the base quantized\nmodel. The whole process is denoted by Meta Quantization (MetaQuant). Specially, in each\nbackward propagation, M takes\n\u2202Q(W) and W as inputs in a coordinate-wise manner, then its\noutput is used to compute \u2202(cid:96)\n\u2202W for updating weights W using common optimization methods such\nas SGD or Adam [8]. In a forward pass, inference is performed using the quantized version of\nthe updated weights, which produces the \ufb01nal outputs to be compared with the ground-truth labels\nfor backward computation. During this process, gradient propagation from the quantized weights\nto the full-precision weights is handled by M, which avoids the problem of non-differentiability\nand gradient mismatch. Besides, the gradients generated by the meta quantizer are loss-aware,\ncontributing to better performance of the quantization training.\nCompared with commonly-used STE and manually designed gradient propagation in quantization\ntraining, MetaQuant learns to generate proper gradients without any manually designed knowledge.\nThe whole process is end-to-end. meta quantizer can be viewed as a plug-in to any base model, making\nit easy and general to be implemented in modern architectures. After quantization training is \ufb01nished,\nmeta quantizer can be removed and consumes no extra space for inference. We compare MetaQuant\nwith STE under different quantization functions (dorefa [17], BWN [13]) and optimization techniques\n(SGD, Adam) with CIFAR10/100 and ImageNet on various base models to verify MetaQuant\u2019s\ngeneralizability. Extensive experiments show that MetaQuant achieves a faster convergence speed\nunder SGD and better performance under SGD/Adam.\n\n1In the following description, training-based quantization refers to STE training-based quantization\n\n2\n\n\f2 Related Work\n\nCourbariaux et al. [4] proposed to train binarized networks through deterministic and stochastic\nrounding on parameters update after backpropagation. This idea was further extended in [7] and\n[13] by introducing binary activation. Nevertheless, these pioneer attempts face the problem of\nnon-differentiable rounding operator during back-propagation, which is solved by directly penetration\nof rounding with unchanged gradient. To bypass non-differentiability, Leng et al. [10] modi\ufb01ed the\nquantization training objective function using ADMM, which separates the processes on training real-\nvalued parameters and quantizing the updated parameters. Zhou et al. [15] proposed to incrementally\nquantize a portion of parameters based on weight partition and update the un-quantized parameters\nby normal training. However, this kind of methods introduced more hyper-parameters tuning such\nas determining the procedure of partial quantization, thus complicating quantization. Bai et al. [2]\nadded a regularizer in quantization training to transform full-precision weights to quantized values.\nThough this method simpli\ufb01es quantization training procedure, but its optimization process involves\nthe proximal method, which makes the training cost expensive.\n\n3 Problem Statement\nGiven a training set of n labeled instances {x, y}\u2019s, a pre-trained full-precision base model f with\nL layers is parameterized by W = [W1, ..., WL]. We de\ufb01ne a pre-processing function A(\u00b7) and a\nquantization function Q(\u00b7). A(\u00b7) converts W into \u02dcW, which is rescaled and centralized to make it\neasier for quantization. Q(\u00b7) discretizes \u02dcW to \u02c6W using k-bits. Specially, 2 pre-processing functions\nand corresponding quantization methods (dorefa2, BWN) are studied in this work:\n\n(cid:105)\n\n(cid:104)\n\n(2k \u2212 1) \u02dcW\n2k \u2212 1\n\n\u22121. (1)\n\n(2)\n\ndorefa : \u02dcW =A(W) =\n\ntanh(W)\n\n2max(|tanh(W)|)\n\n+\n\n1\n2\n\n,\n\n\u02c6W = Q( \u02dcW) = 2\n\nround\n\nBWN : \u02dcW = A(W) = W,\n\n\u02c6W = Q( \u02dcW) =\n\n|| \u02dcW||l1 \u00d7 sign( \u02dcW).\n\n1\nn\n\nTraining-based quantization aims at training a quantized version of W, i.e., \u02c6W, such that the loss of\nthe quantized f is minimized: min \u02c6W (cid:96)(f ( \u02c6W; x), y).\n\n4 Meta Quantization\n\n4.1 Generation of Meta Gradient\nOur proposed MetaQuant incorporates a shared meta quantizer M\u03c6 parameterized by \u03c6 across layers\ninto quantization training. After W is quantized as \u02c6W (subscript l is omitted for ease of notation), a\nloss (cid:96) is generated by comparing f ( \u02c6W; x) with the ground-truth.\nIn back-propagation, the gradient of (cid:96) w.r.t \u02c6W is then computed by chain rules, which is denoted by\n. The meta quantizer M\u03c6 receives g \u02c6W and \u02dcW as inputs, and outputs the gradient of (cid:96)\ng \u02c6W = \u2202(cid:96)\n\u2202 \u02c6W\nw.r.t. \u02dcW, denoted by g \u02dcW, as:\n\ng \u02dcW =\n\n\u2202(cid:96)\n\u2202 \u02dcW\n\n= M\u03c6(g \u02c6W, \u02dcW).\n\n(3)\n\nThe gradient g \u02dcW is further used to compute the gradient of (cid:96) w.r.t. W, denoted by gW, where gW is\ncomputed via:\n\ngW =\n\n\u2202(cid:96)\n\u2202 \u02dcW\n\n\u2202 \u02dcW\n\u2202W\n\n= g \u02dcW\n\n\u2202 \u02dcW\n\u2202W\n\n= M\u03c6(g \u02c6W, \u02dcW)\n\n\u2202 \u02dcW\n\u2202W\n\n,\n\n(4)\n\nwhere \u2202 \u02dcW\ndorefa according to (1), and \u2202 \u02dcW\ncalibration.\n\n\u2202W depends on the pre-processing function between W and \u02dcW: \u2202 \u02dcW\n\nmax(|tanh(W)|) for\n\u2202W = 1 for BWN according to (2). This process is referred to as\n\n\u2202W = 1\u2212tanh2(W)\n\n2In this work, we only consider the forward quantization function for weights quantization used in [17], and\n\ndenote it as \u201cdorefa\u201d\n\n3\n\n\fFigure 1: The over\ufb02ow of MetaQuant. During backward propagation, gradients are represented as\nblue line. Dash blue line means this propagation is non-differentiable and requires special handling.\nA shared meta network M is constructed which takes g \u02c6W and \u02dcW as input, and outputs the gradient\nof \u02dcW (g \u02dcW). With g \u02dcW, the gradient of the weights W can be computed using (4). Finally, W is\nupdated with (5), with the assistance of different optimization methods re\ufb02ected in \u03c0(\u00b7).\n\nBefore using gW to update W, gW is \ufb01rstly processed according to different optimization methods\nto produce the \ufb01nal update value for each weight. This process is named gradient re\ufb01nement, which\nis denoted by \u03c0(\u00b7) in the sequent. Speci\ufb01cally, for SGD, \u03c0(gW) = gW. For other optimization\nmethods such as Adam, \u03c0(\u00b7) can be implemented as \u03c0(gW) = gW + residual, where \u201cresidual\u201d is\ncomputed according to different gradient re\ufb01nement methods. Finally, the full-precision weights W\nis updated as:\n\nWt+1 = Wt \u2212 \u03b1\u03c0(gt\n\n(5)\nwhere t denotes the t-th training iteration and \u03b1 is the learning rate. Fig.1 illustrates the overall\nprocedure of MetaQuant.\nCompared with [1], which directly learns gW, MetaQuant construct a neural network to learn g \u02dcW,\nwhich cannot be directly computed in quantization training due to the property of non-differentiability\nof the quantization functions. Our work resolves the issue of non-differentiability and is general to\ndifferent optimization methods. Insight of how and why MetaQuant works is studied at Appendix.7.2.\n\nW),\n\n4.2 Training of Meta Quantizer\n\n= M\u03c6(g \u02c6Wi\n\nSimilar to [1], our proposed meta quantizer is a\ncoordinate-wise neural network, which means\nthat each weight parameter is processed inde-\npendently. For a single weight index i in g \u02c6Wi\n,\n\u02dcWi receives its corresponding gradient g \u02dcWi\nvia\n, \u02dcWi). For ef\ufb01cient process-\ng \u02dcWi\ning, during inference, the inputs in (3) are ar-\nranged as batches with size 1. Specially, suppose\nW comes from a convolution layer with shape\nRo\u00d7i\u00d7k\u00d7k, where o, i and k denote the number\nof output channels, input channels and kernel\nsize, respectively. Then \u02dcW, \u02c6W and the corre-\nsponding gradient share the same shape, which\nis a reshaping of inputs in (3) to R(o\u00d7i\u00d7k2)\u00d71.\nRecall from (5) and (4), the output of M\u03c6 is in-\ncorporated into the value of updated Wt, which\nis then quantized in next iteration\u2019s inference.\n\nFigure 2: Incorporation of meta quantizer into\nquantization training. \u2206W is composed of cal-\nibration, gradient re\ufb01nement and multiplication\nof learning rate \u03b1. Output of meta quantizer is\ninvolved in W\u2019s update and contributes to \ufb01nal\nloss, constructing a differential path from loss to\n\u03c6-parameterized meta quantizer.\n\n4\n\n\fTherefore, M\u03c6 is associated to the \ufb01nal quantization training loss, which receives gradient update on\n\u03c6 backpropagated from the \ufb01nal loss. By introducing the meta quantizer to produce g \u02dcW, MetaQuant\nnot only addresses the non-differentiability issue for parameters in the base model, but also provides\nan end-to-end training bene\ufb01t throughout the whole network. Moreover, the meta quantizer is loss-\naware, hence it is trained to generate more accurate update for W for reducing the \ufb01nal loss, which\nexplores how gradient can be modi\ufb01ed to assist quantization training. Figure.2 illustrates the detailed\nprocess when incorporating the meta quantizer into the quantization training of the base model, which\nforms a differentiable path from the \ufb01nal loss to \u03c6. In the meantime of quantization training in W, \u03c6\nis also learned in each training iteration t:\n\n(cid:34)\n\n(cid:35)\n\n\u2202 \u02dcWt\u22121\n\u2202Wt\u22121 )\n\n,\n\n(6)\n\n(7)\n\nForward:\n\n\u02dcWt = A(Wt) = A\n\nWt\u22121 \u2212 \u03b1 \u00d7 \u03c0(M\u03c6(gt\u22121\n\n, \u02dcWt\u22121)\n\n\u02c6W\n\n(cid:16)\n\n(cid:104)\n\n(cid:105)\n\n(cid:17)\n\nLoss = (cid:96)\n\nf\n\n\u2202(cid:96)\n\u2202\u03c6t =\n\n\u2202(cid:96)\n\u2202 \u02dcWt\n\n,\n\n, y\n\nQ( \u02dcWt); x\n\u2202 \u02dcWt\n\u2202\u03c6t = M\u03c6(g \u02c6Wt, \u02dcWt)\n\n\u2202 \u02dcWt\n\u2202\u03c6t .\n\nBackward:\n\n(8)\nIn Forward, we use a combination of Wt\u22121 and meta gradient to represent Wt, in order to incorporate\nM\u03c6. Specially in (6), meta gradient is derived from M\u2019s output, which is \ufb01rstly multiplied to achieve\ngradient of W, then is re\ufb01ned by optimization \u03c0. Finally, it is adjusted by learning rate to become\n\u2202\u03c6 is differentiable because A is differentiable.\nmeta gradient. (8) calculates gradient of \u03c6, here \u2202 \u02dcW\nFurthermore, a differentiable meta neural network is chosen. Wt will be actually updated after\nBackward, which can be regarded as late weights update.\n\n4.3 Design of Meta Quantizer\nThe meta quantizer M\u03c6 is a parameterized and differentiable neural network to generate the meta\ngradient. It can be viewed as a generalization of STE. For example, M\u03c6 reduces to STE if it clips\ng \u02dcW according to the absolute magnitude of \u02dcW: g \u02dcW = M\u03c6(g \u02c6W, \u02dcW) = g \u02c6W \u00b7 1| \u02dcW|\u22641.\nWe design 3 different architectures of the meta quantizer. The \ufb01rst architecture simply uses a neural\nnetwork composing of 2 or multiple layers of fully-connected layer. It only requires g \u02c6W as input:\n\nFCGrad : M\u03c6(g \u02c6W) = FCs(\u03c6, \u03c3, g \u02c6W),\n\n(9)\nwhere \u03c3 represents the nonlinear activation. Since previous successful experimental results brought\nby STE show that a good g \u02dcW should be generated by considering the value of \u02dcW. Based on this\nobservation, we construct another 2 architectures of meta quantizer with \u02dcW fed as input and multiply\nthe output of these neural networks with g \u02c6W to incorporate gradient information from its subsequent\nstep. Speci\ufb01cally, one is based on fully-connected (FC) layers:\n\nMultiFC : M\u03c6(g \u02c6W, \u02dcW) = g \u02c6W \u00b7 FCs(\u03c6, \u03c3, \u02dcW).\n\n(10)\nAnother network incorporates LSTM and FC to construct M, which is inspired by [1] that uses\nmemory-based neural network as the meta learner:\n\nLSTMFC : M\u03c6(g \u02c6W, \u02dcW) = g \u02c6W \u00b7 FCs(\u03c6F Cs, \u03c3, (LSTM(\u03c6LST M , \u02dcW))).\n\n(11)\nWhen using LSTM as the meta quantizer, each coordinate of the weights keeps a track of the hidden\nstates generated by LSTM, which contains the memory of historical information of g \u02c6W and \u02dcW. Meta\nquantizer\u2019s memory consumption and detailed hyper-parameter is studied at Appendix.7.1, 7.3.\n\n4.4 Algorithm and Implementation Details\nThe detailed process of MetaQuant is illustrated in Algorithm 1. A shared meta quantizer M\u03c6 is \ufb01rstly\nconstructed and randomely initialized. During each training iteration, line 2-6 describes the forward\nprocess: for each layer, g \u02c6W and \u02dcW from the previous iteration are fed into M\u03c6 to generate the meta\ngradient g \u02dcW to perform inference, as indicated from line 3-5. Since g \u02c6W is not calculated in the \ufb01rst\niteration, normal quantization training is conducted at the \ufb01rst iteration: \u02c6W = Q( \u02dcW) = Q [A(W)]\nto replace line 4. Line 7-9 shows the backward process: \u02c6W\u2019s gradient can be attained through\n\n5\n\n\ferror backpropagation, which is shown in line 7. During the backward process, g \u02c6W and \u02dcW of the\ncurrent iteration are obtained and their outputs from M\u03c6 are saved for computation in the next\niteration, denoted by g \u02dcWt+1 as described in line 7-8. By incorporating M\u03c6 into the inference graph,\nits gradient is obtained in line 9. Finally, g \u02dcW is used to calculate gW, which is then processed by\ndifferent optimization methods using \u03c0(\u00b7), leading to the update of W shown in line 10-12. In the\n\ufb01rst iteration, due to the lack of g \u02dcW, weights update of W is not conducted. Note that \u03c6 from the\nmeta quantizer is updated in line 13.\n\n(cid:21)(cid:27)\n\n)\n\n\u02c6Wl\n\n) \u00b7 \u2202 \u02dcWt\u22121\n\u2202Wt\u22121\n\nl\n\nl\n\n(cid:20)\n(cid:104)\n\n(cid:26)\n(cid:110)\n\nWt\u22121\n\nl \u2212 \u03b1 \u00d7 \u03c0(M\u03c6(gt\u22121\n\nfor Layer l from 1 to L do\nA\n\nl = Q( \u02dcWt\n\nl ) = Q\n\nAlgorithm 1 MetaQuant\nRequire: Training dataset {x, y}n, well-trained full-precision base model W.\nEnsure: Quantized base model \u02c6W.\n1: Construct shared meta quantizer M\u03c6, training iteration t = 0.\n2: while not optimal do\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14:\n15: end while\n\n(cid:105)\n\u02c6Wt\nend for\nQ( \u02dcWt); x\nCalculate loss: (cid:96) = Loss\nGenerate g \u02c6Wt using chain rules.\nCalculate meta gradient g \u02dcWt using M\u03c6.\nCalculate \u2202(cid:96)\nfor Layer l from 1 to L do\n\nWt\nend for\n\u03c6t+1 = \u03c6t \u2212 \u03b3 \u00d7 \u2202(cid:96)\nt = t + 1\n\n\u2202\u03c6t (\u03b3 is the learning rate of the meta quantizer)\n\nl = Wt\u22121\n\nl \u2212 \u03b1 \u00d7 \u03c0(M\u03c6(gt\u22121\n\n) \u00b7 \u2202 \u02dcWt\u22121\n\u2202Wt\u22121\n\nl\n\n\u2202\u03c6t by (8)\n\n, \u02dcWt\u22121\n\nl\n\n, \u02dcWt\u22121\n\nl\n\n(cid:111)\n\n, y\n\nf\n\n\u02c6Wl\n\n)\n\nl\n\n5 Experiment\n\n5.1 Experiment Setup\n\nMetaQuant focuses on the penetration of non-differentiable quantization function during training-\nbased methods. We conduct comparison experiments with STE under the following 2 forward\nquantization methods: 1) dorefa [17] , 2) BWN [13] and 2 optimization methods: 1) SGD 2) Adam\n[8]. When quantization training is conducted with dorefa or BWN as forward quantization function\nand STE as backward method, it becomes a weight-quantization version of [17] or the proposed\nmethod in [13], respectively. Three benchmark datasets are used including ImageNet ILSVRC-2012\nand CIFAR10/100. Regarding deep architectures, we experiment with ResNet20/32/44 on CIFAR10.\nSince CIFAR10/100 share the same input dimension, we modify the output dimension of the last\nfully-connected layer from 10 to 100 in ResNet56/110 for CIFAR100. For ImageNet, ResNet18 is\nutilized for comparison. For all the experiments conducted and compared, all layers in the networks\nare quantized using 1 bit: each layer contains only 2 values. For experiments on CIFAR10/100, we set\nthe initial learning rate as \u03b1 = 1e\u22123 for base models and the initial learning rate as \u03b3 = 1e\u22123 for the\nmeta quantizer. For fair comparison, we set total training epochs as 100 for all experiments, \u03b1 and \u03b3\nwill be divided by 10 after every 30 epochs. For ImageNet, the initial learning rate is set as \u03b1 = 1e\u22124\nfor the base model using dorefa and BWN. Initial \u03b3 is set as 1e\u22123. \u03b1 decreases to {1e\u22125, 1e\u22126}\nwhen training comes to 10 / 20 epochs. \u03b3 reduces to {1e\u22124, 1e\u22125} in accordance to the change of\nthe learning rate in base models with total epoch as 30. Batch size is 128 for CIFAR/ImageNet. All\nexperiments are conducted for 5 times, the statistics of last 10/5 epochs\u2019 test accuracy are reported\nas the performance of both proposed and baseline methods in CIFAR/ImageNet datasets. We also\ndemonstrate the empirical convergence speed among different methods through training loss curves.\nDetailed hyper-parameters in different realizations of MetaQuant in CIFAR experiments are the\nfollowing: In MultiFC, a 2-layer fully-connected layer is used with hidden size as 100, no non-linear\nactivation is used. In LSTMFC, a 1-layer LSTM and a fully-connected layer are utilized, with the\nhidden dimension set as 100. In FCGrad, a 2-layer fully-connected meta model is used with hidden\n\n6\n\n\fMultiFC\nLSTMFC\nFCGrad\n\nSTE\n\nMultiFC\nLSTMFC\nFCGrad\n\nSTE\n\nLSTMFC\nFCGrad\n\nSTE\n\nLSTMFC\nFCGrad\n\nSTE\n\nMultiFC\nLSTMFC\nFCGrad\n\nSTE\n\nMultiFC\nLSTMFC\nFCGrad\n\nSTE\n\nLSTMFC\nFCGrad\n\nSTE\n\nLSTMFC\nFCGrad\n\nSTE\n\nMultiFC\nLSTMFC\nFCGrad\n\nSTE\n\nMultiFC\nLSTMFC\nFCGrad\n\nSTE\n\nLSTMFC\nFCGrad\n\nSTE\n\ndorefa\n\nBWN\n\ndorefa\n\nBWN\n\ndorefa\n\nBWN\n\nResNet20\n\nResNet32\n\nResNet44\n\nLSTMFC\nFCGrad\n\nSGD\n\nAdam\n\nSGD\n\nAdam\n\nSGD\n\nAdam\n\nSGD\n\nAdam\n\nSGD\n\nAdam\n\nSGD\n\nAdam\n\nFP Acc (%)\n\n91.5\n\n92.13\n\n93.56\n\nTest Acc (%)\n80.745(2.113)\n88.942(0.466)\n88.305(0.810)\n88.840(0.291)\n89.782(0.172)\n89.941(0.068)\n89.979(0.103)\n89.962(0.068)\n75.913(3.495)\n89.289(0.212)\n88.949(0.231)\n89.896(0.182)\n90.036(0.109)\n90.042(0.098)\n82.911(1.680)\n89.637(0.380)\n90.397(0.149)\n89.934(0.246)\n90.172(0.077)\n90.966(0.064)\n90.948(0.074)\n90.976(0.068)\n79.768(2.062)\n90.568(0.169)\n90.241(0.316)\n91.015(0.087)\n91.002(0.077)\n91.034(0.067)\n86.686(1.020)\n90.546(0.218)\n91.494(0.163)\n91.539(0.097)\n91.079(0.064)\n91.772(0.073)\n91.870(0.022)\n91.989(0.067)\n82.647(0.334)\n91.498(0.057)\n91.614(0.081)\n91.121(0.023)\n91.498(0.271)\n92.107(0.059)\n\nsize as 100 without non-linear activation. In ImageNet experiments, we use MultiFC/FCGrad with\n2/1-layer fully-connected layer, whose hidden dimension is 100.\n\n5.2 Experimental Results and Analysis\n\nNetwork\n\nForward Backward Optimization\n\nSTE\n\nTable 1: Experimental result of MetaQuant and STE using dorefa, BWN on CIFAR10\n\nTable.1 shows the overall experimental results on CIFAR10 for MetaQuant and STE using dif-\nferent forward quantization methods and optimizations. Variants of MetaQuant shows signi\ufb01cant\nimprovement over STE baseline, especially SGD is used.\nCIFAR100 is a more dif\ufb01cult task than CIFAR10, which contains much more \ufb01ne-grained classes\nwith a total number of 100 classes. Table.2 shows the overall experimental results on CIFAR100\nfor MetaQuant and STE using different forward quantization methods and optimizations. Similar to\nCIFAR10, MetaQuant out-performs by a large margin than STE in all cases, showing that MetaQuant\nhas signi\ufb01cant improvement in more challenging tasks than traditional methods.\n\n5.3 Empirical Convergence Analysis\n\nIn this experiment, we compare the performances of variants of MetaQuant and STE during the\ntraining process to demonstrate their convergence speeds. ResNet20 using dorefa is utilized as an\nexample. As Fig.3 shows, under the same task and forward quantization method, MetaQuant shows\ntremendous convergence advantage over STE using SGD, including much faster descending speed of\nloss and obviously lower loss values. In Adam, although all the methods show similar decreasing\n\n7\n\n\fNetwork\n\nForward Backward Optimization\n\nSTE\n\nResNet56\n\nResNet110\n\nMultiFC\nLSTMFC\nFCGrad\n\nSTE\n\nMultiFC\nLSTMFC\nFCGrad\n\nSTE\n\nLSTMFC\nFCGrad\n\nSTE\n\nLSTMFC\nFCGrad\n\nSTE\n\nMultiFC\nLSTMFC\nFCGrad\n\nSTE\n\nMultiFC\nLSTMFC\nFCGrad\n\nSTE\n\nLSTMFC\nFCGrad\n\nSTE\n\ndorefa\n\nBWN\n\ndorefa\n\nBWN\n\nLSTMFC\nFCGrad\n\nSGD\n\nAdam\n\nSGD\n\nAdam\n\nSGD\n\nAdam\n\nSGD\n\nAdam\n\nFP Acc (%)\n\n71.22\n\n72.54\n\nTest Acc (%)\n42.265(8.143)\n65.791(0.415)\n63.645(2.183)\n64.351(0.935)\n66.419(0.533)\n66.588(0.375)\n66.483(0.793)\n66.564(0.351)\n34.479(11.737)\n63.346(2.253)\n64.402(1.434)\n64.297(1.309)\n66.584(0.349)\n67.018(0.329)\n43.419(18.902)\n68.269(0.136)\n64.753(2.850)\n66.145(2.490)\n66.836(1.198)\n68.418(0.235)\n67.138(1.286)\n68.741(0.363)\n35.227(19.408)\n66.242(2.979)\n64.791(4.096)\n66.265(1.429)\n67.767(1.391)\n69.114(0.181)\n\nTable 2: Experimental result of MetaQuant and STE using dorefa, BWN on CIFAR100\n\nNetwork\n\nForward Backward Optimization\n\nFP Top1/Top5(%)\n\nQuant Top1/Top5 (%)\n\ndorefa\n\nResNet18\n\n58.349(2.072)/81.477(1.567)\n59.472(0.025)/82.410(0.010)\n59.835(0.359)/82.671(0.232)\n59.503(0.835)/82.549(0.506)\n60.328(0.391)/83.025(0.234)\nTable 3: Experimental result of MetaQuant and STE using dorefa, BWN on ImageNet.\n\nMultiFC\nFCGrad\n\n69.76/89.08\n\nSTE\n\nFCGrad\n\nBWN\n\nSTE\n\nAdam\n\n(a) SGD\n\n(b) Adam\n\nFigure 3: Convergence speed of MetaQuant V.S STE using SGD/Adam in ResNet20, CIFAR10,\ndorefa.\n\nspeed, MetaQuant methods \ufb01nally reach to lower loss values, which is also re\ufb02ected in the test\naccuracy reported in Table.1. Overall, MetaQuant shows better convergence than STE using different\nforward quantizations and optimizations. The improvement is more obvious when SGD is chosen.\n\n8\n\n020406080100epoch100lossBaselineLSTMFCFC-GradMultiFC020406080100epoch100lossBaselineLSTMFCFC-GradMultiFC\fWe conjecture that the performance difference between SGD and Adam is due to the following reason:\nSGD simply updates full-precision weights using the calibrated gradient from g \u02dcW, which directly\nre\ufb02ects the output of meta quantizercompared to STE. Adam aggregates the historical information\nof gW and normalizes the current gradient, which to a certain degree shrinks the difference of meta\nquantizer and STE. More comparisons in training accuracy, test accuracy on more tasks are listed in\nAppendix.7.4.\n\n5.4 Performance Comparison with Non-STE Training-based Quantization\n\nNetwork\nResNet20\n\nResNet44\n\nMethod\n\nProxQuant\nMetaQuant\nProxQuant\nMetaQuant\n\nELQ\n\nAcc Drop (%)\n\n1.29\n0.7\n0.99\n0.08\n\n3.55/2.65\n6.32/4.31\n\nResNet18\nTable 4: Experimental result of MetaQuant V.S ProxQuant, LAB, ELQ, TTQ.\n\nResNet18-2bits\n\nMetaQuant\n\nNetwork\nResNet32\n\nLABNet\n\nMethod\n\nProxQuant\nMetaQuant\n\nLAB\n\nMetaQuant\nTTQ [18]\nMetaQuant\n\nAcc Drop (%)\n\n1.28\n0.39\n1.4\n-0.2\n\n3.00/2.00\n5.17/3.59\n\nMetaQuant aims at improving training-based quantization by learning better gradients for penetration\nof non-differentiable quantization functions. Some advanced quantization methods avoid discrete\nquantization. In this section, we compare MetaQuant with Non-STE training-based quantization:\nProxQuant ([2]), LAB ([5]) to demonstrate that traditional STE training-based quantization is able to\nachieve better performance by using MetaQuant.\nDue to the difference of the initial full-precision model used, we only report the performance drop\nin terms of test accuracy after quantization (the smaller the better). We compare MetaQuant with\nProxQuant using ResNet20/32/44, LAB using its proposed architecture3 on CIFAR10 with all layers\nquantized to binary values. As shown in Table.4, MetaQuant shows better performance than both\nbaselines.\nELQ ([16]) and TTQ ([18]) are compared in 3rd row in Table.4 using ImageNet datasets. Although\nover-performance, ELQ is a combination of a series of previous quantization methods and tricks\non incremental quantization. MetaQuant focuses more on how to improve STE-based training\nquantization, without any extra loss and training tricks. TTQ is a non-symmetric ternarization with\n{0, \u03b1,\u2212\u03b2} as ternary points. MetaQuant follows dorefa using a symmetric quantization which leads\nto ef\ufb01cient inference.\n\n5.5 MetaQuant Training Analysis\n\nTraining of MetaQuant involves computation in training of meta quantizer. To analyze the additional\ntraining time, training time per iteration as for MetaQuant using MultiFC and DoReFa with STE\nusing ResNet20 in CIFAR10 (Intel Xeon CPU E5-1650 with GeForce GTX 750 Ti). MetaQuant\ncosts 51.15 seconds to \ufb01nish one iteration of training while baseline method uses 38.17s. However,\nIn real deployment meta quantizer is removed, MetaQuant is able to provide better test performance\nwithout any extra inference time.\n\n6 Conclusion\n\nIn this paper, we propose a novel method (MetaQuant) to learn the gradient for penetration of the\nnon-differentiable quantization function in training-based quantization by a meta quantizer. This\nmeta network is general enough to be incorporated into various base models and can be updated using\nthe loss of the base models. We propose 3 types of meta quantizer and show that the meta gradients\ngenerated through these modules are able to provide better convergence speed and \ufb01nal quantization\nperformance, under different forward quantization functions and optimization methods.\n\n3(2x128C3)-MP2-(2x256C3)-MP2-(2x512C3)-MP2-(2x1024FC)-10FC\n\n9\n\n\fAcknowledgement\n\nThis work is supported by NTU Singapore Nanyang Assistant Professorship (NAP) grant\nM4081532.020, and Singapore MOE AcRF Tier-2 grant MOE2016-T2-2-06.\n\nReference\n[1] Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom\nSchaul, Brendan Shillingford, and Nando De Freitas. Learning to learn by gradient descent by\ngradient descent. In Advances in Neural Information Processing Systems, pages 3981\u20133989,\n2016.\n\n[2] Yu Bai, Yu-Xiang Wang, and Edo Liberty. Proxquant: Quantized neural networks via proximal\n\noperators. In International Conference of Learning Representation, 2019.\n\n[3] Zhaowei Cai, Xiaodong He, Jian Sun, and Nuno Vasconcelos. Deep learning with low precision\nby half-wave gaussian quantization. In Proceedings of the IEEE Conference on Computer\nVision and Pattern Recognition, pages 5918\u20135926, 2017.\n\n[4] Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Binaryconnect: Training deep\nneural networks with binary weights during propagations. In Advances in neural information\nprocessing systems, pages 3123\u20133131, 2015.\n\n[5] Lu Hou, Quanming Yao, and James T Kwok. Loss-aware binarization of deep networks. In\n\nInternational Conference of Learning Representation, 2017.\n\n[6] Gao Huang, Zhuang Liu, Kilian Q Weinberger, and Laurens van der Maaten. Densely connected\nconvolutional networks. In Proceedings of the IEEE conference on computer vision and pattern\nrecognition, volume 1, page 3, 2017.\n\n[7] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Binarized\nneural networks. In Advances in neural information processing systems, pages 4107\u20134115,\n2016.\n\n[8] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Interna-\n\ntional Conference of Learning Representation, 2015.\n\n[9] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi\ufb01cation with deep\nconvolutional neural networks. In Advances in neural information processing systems, pages\n1097\u20131105, 2012.\n\n[10] Cong Leng, Hao Li, Shenghuo Zhu, and Rong Jin. Extremely low bit neural network: Squeeze\n\nthe last bit out with admm. In AAAI Conference on Arti\ufb01cial Intelligence, 2017.\n\n[11] Fengfu Li, Bo Zhang, and Bin Liu. Ternary weight networks. International Workshop on\nEf\ufb01cient Methods for Deep Neural Networks, Advances in Neural Information Processing\nSystems, 2016.\n\n[12] Zechun Liu, Baoyuan Wu, Wenhan Luo, Xin Yang, Wei Liu, and Kwang-Ting Cheng. Bi-real\nnet: Enhancing the performance of 1-bit cnns with improved representational capability and\nadvanced training algorithm. In Proceedings of the European Conference on Computer Vision,\npages 722\u2013737, 2018.\n\n[13] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnor-net: Imagenet\nclassi\ufb01cation using binary convolutional neural networks. In European Conference on Computer\nVision, pages 525\u2013542. Springer, 2016.\n\n[14] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale\n\nimage recognition. In International Conference on Learning Representations, 2015.\n\n[15] Aojun Zhou, Anbang Yao, Yiwen Guo, Lin Xu, and Yurong Chen.\n\nIncremental network\nquantization: Towards lossless cnns with low-precision weights. In International Conference of\nLearning Representation, 2017.\n\n10\n\n\f[16] Aojun Zhou, Anbang Yao, Kuan Wang, and Yurong Chen. Explicit loss-error-aware quantization\nfor low-bit deep neural networks. In Proceedings of the IEEE Conference on Computer Vision\nand Pattern Recognition, pages 9426\u20139435, 2018.\n\n[17] Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, and Yuheng Zou. Dorefa-net:\nTraining low bitwidth convolutional neural networks with low bitwidth gradients. In Proceedings\nof the IEEE Conference on Computer Vision and Pattern Recognition, 2016.\n\n[18] Chenzhuo Zhu, Song Han, Huizi Mao, and William J Dally. Trained ternary quantization. In\n\nInternational Conference of Learning Representation, 2016.\n\n11\n\n\f", "award": [], "sourceid": 2171, "authors": [{"given_name": "Shangyu", "family_name": "Chen", "institution": "Nanyang Technological University, Singapore"}, {"given_name": "Wenya", "family_name": "Wang", "institution": "Nanyang Technological University"}, {"given_name": "Sinno Jialin", "family_name": "Pan", "institution": "Nanyang Technological University, Singapore"}]}