{"title": "In-Place Zero-Space Memory Protection for CNN", "book": "Advances in Neural Information Processing Systems", "page_first": 5734, "page_last": 5743, "abstract": "Convolutional Neural Networks (CNN) are being actively explored for safety-critical applications such as autonomous vehicles and aerospace, where it is essential to ensure the reliability of inference results in the presence of possible memory faults. Traditional methods such as error correction codes (ECC) and Triple Modular Redundancy (TMR) are CNN-oblivious and incur substantial memory overhead and energy cost. This paper introduces in-place zero-space ECC assisted with a new training scheme weight distribution-oriented training. The new method provides the first known zero space cost memory protection for CNNs without compromising the reliability offered by traditional ECC.", "full_text": "In-Place Zero-Space Memory Protection for CNN\n\nHui Guan1, Lin Ning1, Zhen Lin1, Xipeng Shen1, Huiyang Zhou1, Seung-Hwan Lim2\n\n1North Carolina State University, Raleigh, NC, 27695\n2Oak Ridge National Laboratory, Oak Ridge, TN 37831\n\n{hguan2, lning, zlin4, xshen5, hzhou}@ncsu.edu, lims1@ornl.gov\n\nAbstract\n\nConvolutional Neural Networks (CNN) are being actively explored for safety-\ncritical applications such as autonomous vehicles and aerospace, where it is essen-\ntial to ensure the reliability of inference results in the presence of possible memory\nfaults. Traditional methods such as error correction codes (ECC) and Triple Modu-\nlar Redundancy (TMR) are CNN-oblivious and incur substantial memory overhead\nand energy cost. This paper introduces in-place zero-space ECC assisted with\na new training scheme weight distribution-oriented training. The new method\nprovides the \ufb01rst known zero space cost memory protection for CNNs without\ncompromising the reliability offered by traditional ECC.\n\n1\n\nIntroduction\n\nAs CNNs are increasingly explored for safety-critical applications such as autonomous vehicles and\naerospace, reliability of CNN inference is becoming an important concern. A key threat is memory\nfaults (e.g., bit \ufb02ips in memory), which may result from environment perturbations, temperature\nvariations, voltage scaling, manufacturing defects, wear-out, and radiation-induced soft errors. These\nfaults change the stored data (e.g., CNN parameters), which may cause large deviations of the\ninference results [13, 19, 20]. In this work, fault rate is de\ufb01ned as the ratio between the number of bit\n\ufb02ips experienced before correction is applied and the total number of bits.\nExisting solutions have resorted to general memory fault protection mechanisms, such as Error\nCorrection Codes (ECC) hardware [24], spatial redundancy, and radiation hardening [28]. Being\nCNN-oblivious, these protections incur large costs. ECC, for instance, uses eight extra bits in\nprotecting 64-bit memory; spatial redundancy requires at least two copies of CNN parameters to\ncorrect one error (called Triple Modular Redundancy (TMR) [14]); radiation hardening is subject to\nsubstantial area overhead and hardware cost. The spatial, energy, and hardware costs are especially\nconcerning for safety-critical CNN inferences; as they often execute on resource-constrained (mobile)\ndevices, the costs worsen the limit on model size and capacity, and increase the cost of the overall AI\nsolution.\nTo address the fundamental\ntension between the needs for reliability and the needs for\nspace/energy/cost ef\ufb01ciency, this work proposes the \ufb01rst zero space cost memory protection for\nCNNs. The design capitalizes on the opportunities brought by the distinctive properties of CNNs.\nIt further ampli\ufb01es the opportunities by introducing a novel training scheme, Weight Distribution-\nOriented Training (WOT), to regularize the weight distributions of CNNs such that they become more\namenable for zero-space protection. It then introduces a novel protection method, in-place zero-space\nECC, which removes all space cost of ECC protection while preserving protection guarantees.\nExperiments on VGG16, ResNet18, and SqueezeNet validate the effectiveness of the proposed\nsolution. Across all tested scenarios, the method provides protections consistently comparable to\nthose offered by existing hardware ECC logic, while removing all space costs. It hence offers a\npromising replacement of existing protection schemes for CNNs.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f2 Related Work\n\nThere are some early studies on fault tolerance of earlier neural networks (NN) [16, 17, 26]; they\nexamined the performance degradation of NNs with various fault models on networks that differ from\nmodern CNNs in both network topologies and and model complexities.\nFault tolerance of deep neural networks (DNN) has recently drawn increasing attentions. Li et al. [13]\nstudied the soft error propagation in DNN accelerators and proposed to leverage symptom-based\nerror detectors for detecting errors and a hardware-based technique, selective latch hardening, for\ndetecting and correcting data-path faults. A recent work [3, 19] conducted some empirical studies\nto quantify the fault tolerance of DNNs to memory faults and revealed that DNN fault tolerance\nvaries with respect to model, layer type, and structure. Zhang et al. [30] proposed fault-aware pruning\nwith retraining to mitigate the impact of permanent faults for systolic array-based CNN accelerators\n(e.g., TPUs). They focused only on faults in the data-path and ignored faults in the memory. Qin et\nal. [18] studied the performance degradation of 16-bit quantized CNNs under different bit \ufb02ip rates\nand proposed to set values of detected erroneous weights as zeros to mitigate the impact of faults.\nThese prior works focused mainly on the characterization of DNN\u2019s fault tolerance with respect to\nvarious data types and network topologies. While several software-based protection solutions were\nexplored, they are preliminary. Some can only detect but not correct errors (e.g. detecting extreme\nvalues [13]), others have limited protection capability (e.g. setting faulty weights to zero [18]).\nSome prior work proposes designs of energy-ef\ufb01cient DNN accelerators by exploiting fault tolerance\nof DNNs [12, 25, 29]. An accelerator design [20] optimizes SRAM power by reducing the supply\nvoltage. It leverages active hardware fault detection coupled with bit masking that shifts data towards\nzero to mitigate the impact of bit \ufb02ips on DNNs\u2019 model accuracy without the need of re-training.\nSimilar hardware faults detection techniques are later exploited in [7, 22, 27, 29] to improve fault\ntolerance of DNNs. Azizimazreah et al. [4] proposed a novel memory cell designed to eliminate soft\nerrors while achieving a low power consumption. These designs are for some special accelerators\nrather than general DNN reliability protection. They are still subject to various costs and offer no\nprotection guarantees as existing ECC protections do. This current work aims to reducing the space\ncost of protection to zero without compromising the reliability of existing protections.\n\n3 Premises and Scopes\n\nThis work focuses on protections of 8-bit quantized CNN models. On the one hand, although the\noptimal bit width for a network depends on its weight distribution and might be lower than 8, we\nhave observed that 8-bit quantization is a prevalent, robust, and general choice to reduce model\nsize and latency while preserving accuracy. In our experiments, both activations and weights are\nquantized to 8-bit. Existing libraries that support quantized CNNs (e.g. NVIDIA TensorRT [15],\nIntel MKL-DNN [1], Google\u2019s GEMMLOWP [10], Facebook\u2019s QNNPACK [2]) mainly target for\nfast operators using 8-bit instead of lower bit width. On the other hand, previous studies [13, 19]\nhave suggested that CNNs should use data types that provide just-enough numeric value range and\nprecision to increase its fault tolerance. Our explorations on using higher precision including \ufb02oat32\nfor representing CNN parameters also show that 8-bit quantized models are the most resilient to\nmemory faults.\nThe quantization algorithm we used is symmetric range-based linear quantization that is well-\nsupported by major CNN frameworks (e.g. Tensor\ufb02ow [11], Pytorch [32]). Speci\ufb01cally, let X be a\n\ufb02oating-point tensor and X q be the 8-bit quantized version. X can be either weights or activations\nfrom a CNN. The quantization is based on the following formula:\n\nX q = round(X\n\n2n\u22121 \u2212 1\nmax{|X|} ),\n\n(1)\n\nwhere n is the number of bits used for quantization. In our case, n = 8. The number of bits used for\naccumulation is 32. Biases, if exist, are quantized to 32 bit integer.\nOur work protects only weights for two reasons. Firstly, weights are usually kept in the memory.\nThe longer they are kept, the higher the number of bit \ufb02ips they will suffer from. This easily\nresults in a high fault rate (e.g. 1e-3) for weights. Activations, however, are useful only during an\ninference process. Given the slight chance of having a bit \ufb02ip during an inference process (usually\n\n2\n\n\fTable 1: Accuracy and weight distribution of 8-bit quantized CNN models on ImageNet. The\npercentage rows use absolute values.\n\nModel\n#weights\nAccuracy\n(%)\nPercentage\n(%)\n\nFloat32\nInt8\n[0, 32)\n[32, 64)\n[64, 128]\n\nAlexNet VGG16 VGG16_bn Inception_V3 ResNet18 ResNet34 ResNet50 ResNet152 SqueezeNet\n61.1M 138.4M 138.4M\n56.52\n55.8\n95.09\n4.88\n0.03\n\n25.5M\n76.13\n75.33\n99.65\n0.34\n0.01\n\n11.7M\n69.76\n69.07\n99.66\n0.32\n0.02\n\n71.59\n71.51\n97.69\n2.27\n0.04\n\n73.36\n72.01\n98.83\n1.16\n0.01\n\n27.1M\n69.54\n68.07\n97.98\n1.96\n0.06\n\n21.8M\n73.31\n72.83\n99.76\n0.23\n0.01\n\n60.1M\n78.31\n77.79\n99.49\n0.49\n0.01\n\n1.2M\n58.09\n57.01\n95.16\n4.62\n0.22\n\nin milliseconds), protecting activations is not as pressing as protecting weights. Secondly, previous\nwork [19] has shown that activations are much less sensitive to faults compared with weights.\nError Correction Codes (ECC) is commonly used in computer systems to correct memory faults.\nThey are usually described as (k, d, t) code for length k code word, length d data, and t-bit error\ncorrection. The number of required check bits is k \u2212 d.\n\n4\n\nIn-Place Zero-Space ECC\n\nOur proposed method, in-place zero-space ECC, builds on the following observation:Weights of a\nwell-trained CNN are mostly small values. The Percentage rows in Table 1 show the distributions of\nthe absolute values of weights in some popular 8-bit quantized CNN models. The absolute values of\nmore than 99% of the weights are less than 64. Even though eight bits are used to represent each\nweight, if we already know that the absolute value of a weight is less than 64, the number of effective\nbits to represent the value would be at most seven, and the remaining bit could be possibly used for\nother purposes\u2014such as error correction. We call it a non-informative bit.\nThe core idea of in-place zero-space ECC is\nto use non-informative bits in CNN parame-\nters to store error check bits. For example, the\ncommonly used SEC-DED (64, 57, 1) code uses\nseven check bits to protect 57 data bits for sin-\ngle error correction; they together form a 64-\nbit code word. If seven out of eight consecu-\ntive weights are in range [\u221264, 63], we can then\nhave seven non-informative bits, one per small\nweight. The essential idea of in-place ECC is to\nuse these non-informative bits to store the error\ncheck bits for the eight weights. By embedding\nthe check bits into the data, it can hence avoid\nall space cost.\nFor the in-place ECC to work, there cannot be\nmore than one large weight in every 8 consec-\nutive weights. And the implementation has to\nrecord the locations of the large weights such\nthat the decoding step can \ufb01nd the error check\nbits from the data. It is, however, important to\nnote that the requirement of recording the loca-\ntions of large weights would disappear if the large weights are regularly distributed in data\u2014an\nexample is that the only place in which a large weight could appear is the last byte of an 8-byte block.\nHowever, the distributions of large weights in CNNs are close to uniform, as Figure 1 shows.\n\nFigure 1: Large weight (beyond [\u221264, 63]) dis-\ntributions in 8-byte (64-bit data) blocks for\nSqueezeNet on ImageNet. For instance, the \ufb01rst\nbar in (a) shows that of all the 8-byte data blocks\nstoring weights, around 380 have a large weight at\nthe \ufb01rst byte.\n\n4.1 WOT\n\nTo eliminate the need of storing large weight locations in in-place ECC, we enhance our design by\nintroducing a new training scheme, namely weight-distribution oriented training (WOT). WOT aims\nto regularize the spatial distribution of large weights such that large values can appear only at speci\ufb01c\nplaces. We \ufb01rst formalize the WOT problem and then elaborate our regularized training process.\n\n3\n\n12345678position050100150200250300350400count\fLet Wl be the \ufb02oat32 parameters (including both weights and biases) in the l-th convolutional layer\nand W q\nl be their values after quantization. Note that WOT applies to fully-connected layers as well\neven though our discussion focuses on convolutional layers. WOT minimizes the sum of the standard\ncross entropy loss (f ({W q\nl=1)) and weighted weight regularization loss (Frobenius norm with the\nhyperparameter \u03bb) subject to some weight distribution constraints on the weights:\n\nl }L\n\nL(cid:88)\nl \u2208 Sl, l = 1,\u00b7\u00b7\u00b7 , L.\n\nl=1) + \u03bb\n\nl }L\nl } f ({W q\nmin{W q\ns.t. W q\n\nl=1\n\n(cid:107)W q\nl (cid:107)2\nF ,\n\n(3)\nThe weights are a four-dimensional tensor. If \ufb02attened, it is a vector of length Nl \u00d7 Cl \u00d7 Hl \u00d7 Wl,\nwhere Nl, Cl, Hl and Wl are respectively the number of \ufb01lters, the number of channels in a \ufb01lter, the\nheight of the \ufb01lter, and the width of the \ufb01lter, in the l-th convolutional layer. WOT adds constraints to\neach 64-bit data block in the \ufb02attened weight vectors. Recall that, for in-place ECC to protect a 64-bit\ndata block, we need seven non-informative bits (i.e., seven small weights in the range [\u221264, 63]) to\nstore the seven check bits. To regularize the positions of large values in weights, the constraint on the\nweights in the l-th convolutional layer can be given by Sl = {X| the \ufb01rst seven values in every 64-bit\ndata block can have a value in only the range of [\u221264, 63]}.\nWe next describe two potential solutions to the optimization problems.\n\nADMM-based Training The above optimization problem can be formulated in the Alternating\nDirection Method of Multipliers (ADMM) framework and solved in a way similar to an earlier\nwork [31]. The optimization problem (Eq. 2) is equivalent to:\n(cid:107)W q\nl (cid:107)2\n\nL(cid:88)\n\nL(cid:88)\n\nl=1) + \u03bb\n\ngl(W q\n\nF +\n\nl ),\n\n(4)\n\nwhere gl(W q\n\nl ) =\n\n. Rewriting Eq. 4 in the ADMM framework leads to:\n\n(2)\n\n(7)\n\n(8)\n\n(cid:26)0,\n\nif W q\n\nmin\n{W q\n\nl }} f ({W q\nl }L\nl \u2208 Sl\n+\u221e, otherwise.\nl }L\nl } f ({W q\nmin{W q\ns.t. W q\n\nl=1\n\nl=1\n\nL(cid:88)\n\nl=1\n\n(cid:107)W q\nl (cid:107)2\n\nF +\n\nl=1) + \u03bb\n\nL(cid:88)\nl = Zl, l = 1,\u00b7\u00b7\u00b7 , L\nL(cid:88)\n\nl=1\n\nN(cid:88)\n\nL(cid:88)\n\ngl(Zl),\n\n(5)\n\n(6)\n\nl=1 and the auxiliary vari-\n\nl }L\nADMM alternates between the optimization of model parameters ({W q\nl=1 by repeating the following three steps for k = 1, 2,\u00b7\u00b7\u00b7 :\nables {Zl}L\nL(cid:88)\n}L\n{W q,k+1\nl=1 = arg min\nl }L\n\n(cid:107)W q\nl (cid:107)2\n\nf ({W q\n\nl=1) + \u03bb\n\nl }L\n\nF + \u03b3\n\n{W q\n\nl=1\n\nl\n\nl=1\n\nl=1\n\n(cid:107)W q\n\nl (cid:107)2\nl \u2212 Zl + U k\nF ,\n\n{Z k+1\n\n}L\nl=1 = arg min\n{Zl}L\n\n(cid:107)W q,k+1\n\nl\n\n\u2212 Zl + U k\n\nl (cid:107)2\nF ,\n\n\u03bb\n2\n\nl\n\nl\n\nl\n\nl\n\nl\n\nl\n\nl=1\n\nl=1\n\nl=1\n\n=U k\n\nU k+1\n\nl + W q,k+1\n\n.\n\u2212 Z k+1\n\nl \u2212 Z k\nl (cid:107)2\n\n(cid:107)2\nF \u2264 \u0001 and (cid:107)Z k+1\n\ngl(Zl) +\n\u2212 Z k+1\nuntil the two conditions are met: (cid:107)W q,k+1\nProblem 7 can be solved using stochastic gradient descent (SGD) as the objective function is\ndifferentiable. The optimal solution to the problem 8 is the projection of W q,k+1\nto set Sl. In\nthe implementation, we set a value in a 64-data block to 63 or -64 if the value is not in the eighth\nposition and is larger than 63 or smaller than -64.\nPrevious work has successfully applied the ADMM framework to CNN weight pruning [31] and\nCNN weight quantization [21] and shown remarkable compression results. But when it is applied to\nour problem, experiments show that ADMM-based training cannot help reduce the number of large\nvalues in the \ufb01rst seven positions of a 64-bit data block. Moreover, as the ADMM-based training\ncannot guarantee that the constrain in Eq. 3 is satis\ufb01ed, it is necessary to bound the reamining large\nquantized values in the \ufb01rst 7 positions to 63 or -64 after the training, resulting in large accuracy\ndrops. Instead of ADMM-based training, WOT adopts an alternative approach described below.\n\nF \u2264 \u0001.\n\n+ U k\nl\n\n(9)\n\nl\n\n4\n\n\fFigure 2: Hardware design for in-place zero-space ECC protection.\n\nQAT with Throttling (QATT) Our empirical explorations indicate that a simple quantization-\naware training (QAT) procedure combined with weight throttling can make the weights meet the\nconstraint without jeopardising the accuracy of a 8-bit quantized model. The training process iterates\nthe following major steps for each batch:\n\nl }L\nl=1 and {bq\n1. QAT: It involves forward-propagation using quantized parameters ({W q\nl }L\nl=1)\nto get the loss de\ufb01ned in Equation 2, back-propagation using quantized parameters, a update\nl=1 and {bl}L\nstep that applies \ufb02oat32 gradients to update \ufb02oat32 parameters ({Wl}L\nl=1), and\na quantization step that gets the new quantized parameters from their \ufb02oat32 version.\n\n2. Throttling: It forces the quantized weights to meet the constraints de\ufb01ned in Eq. 3: If any\nvalue in the \ufb01rst seven bytes of a 64-bit data block is larger than 63 (or less than -64), set the\nvalue to 63 (or -64). The \ufb02oat32 versions are updated accordingly.\n\nAfter the training, all of the values in the \ufb01rst seven positions of a 64-bit data block are ensured to be\nwithin the range of [\u221264, 63], eliminating the need of storing large value positions for the in-place\nECC. It is worth noting that with WOT, all tested CNNs converge without noticeable accuracy loss\ncompared to the 8-bit quantized versions as Section 5 shows.\n\n4.2 Full Design of In-Place Zero-Space ECC\n\nIn this part, we provide the full design of in-place zero-space ECC. For a given CNN, it \ufb01rst applies\nWOT to regularize the CNN. After that, it conducts in-place error check encoding. The encoding uses\nthe same encoding algorithm as the standard error-correction encoding methods do; the difference\nlies only in where the error check bits are placed.\nThere are various error-correction encoding algorithms. In principle, our proposed in-place ECC\ncould be generalized to various codes; we focus our implementation on SEC-DED codes for its\npopularity in existing hardware-based memory protections for CNN.\nOur in-place ECC features the same protection guarantees as the popular SEC-DED (72, 64, 1) code\nbut at zero-space cost. The in-place ECC uses the SEC-DED (64, 57, 1) code instead of (72, 64, 1)\nto protect a 64-bit data block with the same protection strength. It distributes the seven error check\nbits into the non-informative bits in the \ufb01rst seven weights.\nAs the ECC check bits are stored in-place, a minor extension to the existing ECC hardware is required\nto support ECC decoding. As shown in Figure 2, the in-place ECC check bits and data bits are\nswizzled to the right inputs to the standard ECC logic. The output of the ECC logic is then used to\nrecover the original weights: for each small weight (\ufb01rst seven bytes in a 8-byte data block), simply\ncopy the sign bit to its non-informative bit. As only additional wiring is needed to implement this\ncopy operation, no latency overhead is incurred to the standard ECC logic.\n\n5 Evaluations\n\nWe conducted a set of experiments to examine the ef\ufb01cacy of the proposed techniques in fault\nprotection and overhead. We \ufb01rst describe our experiment settings in Section 5.1 and then report the\neffects of WOT and the proposed fault protection technique in Sections 5.2 and 5.3.\n\n5\n\nSWWWWWW(Existing) ECC Logic with (64, 57) SEC-DED Hamming CodeP0P0P1\u2026\u2026\u2026P6SWWWWWWP6SWWWWWWWSWWWWWWSWWWWWWSWWWWWWW7 check bits8-weight (64-bit) block57 data bitsSWWWWWWW\u2026SWWWWWWWSWWWWWWW8-weight (64-bit) block\f5.1 Experiment Settings\n\nModels, Datasets, and Machines The models we used in the fault injection experiments include\nVGG16 [23], ResNet18 [8], and SqueezeNet [9]. We choose these CNN models as representatives\nbecause: 1) VGG is a typical CNN with stacked convolutional layers and widely used in transfer\nlearning because of its robustness. 2) ResNets are representatives of CNNs with modular structure\n(e.g. Residual Module) and are widely used in advanced computer vision tasks such as object\ndetection. 3) SqueezeNet has much fewer parameters and represents CNNs that are designed for\nmobile applications. The accuracies of these models are listed in Table 1. By default, We use the\nImageNet dataset [6] (ILSVRC 2012) for model training and evaluation. All the experiments are\nperformed with PyTorch 1.0.1 on machines equipped with a 40-core 2.2GHz Intel Xeon Silver 4114\nprocessor, 128GB of RAM, and an NVIDIA TITAN Xp GPU with 12GB memory. Distiller [32] is\nused for 8-bit quantization. The CUDA version is 10.1.\n\nCounterparts for Comparisons We compare our method (denoted as in-place) with the following\nthree counterparts:\n\n\u2022 No Protection (faulty): The CNN has no memory protection.\n\u2022 Parity Zero (zero): It adds one parity bit to detect single bit errors in an eight-bit data block\n\n(e.g. a single weight parameter). Once errors are detected, the weight is set to zero1.\n\n\u2022 SEC-DED (ecc) It is the traditional SEC-DED [72, 64, 1] code-based protection in computer\n\nsystems [24].\n\nThere are some previous proposals [4, 20] of memory protections, which are however designed for\nspecial CNN accelerators and provide without protection guarantees. The parity and ECC represent\nthe state of the art in the industry for memory protection that work generally across processors and\noffer protection guarantees, hence the counterparts for our comparison.\n\n5.2 WOT results\n\nWe evaluate the ef\ufb01ciency of WOT using the CNNs shown in Table 1. All the models are pre-trained\non ImageNet (downloaded from TorchVision2). We set \u03bb to 0.0001 for all of the CNNs. Model\ntraining uses stochastic gradient descent with a constant learning rate 0.0001 and momentum 0.9.\nBatch size is 32 for VGG16_bn and ResNet152, 64 for ResNet50 and VGG16, and 128 for the\nremaining models. Training stops as long as the model accuracy after weight throttling reaches its\n8-bit quantized version.\nFigure 3 shows the changes of the total number of large values that are beyond [\u221264, 63] in the\n\ufb01rst seven positions of 8-byte blocks during the training on six of the CNNs. WOT successfully\nreduces this number from more than 3,500\u201380,000 to near 0 for the models before throttling during\nthe training process. The remaining few large values in non-eighth positions are set to -64 or 63 at the\nend of WOT. Note that VGG16_bn has around 10000 large values in the non eighth positions after 8k\niterations. Although more iterations further reduce this number, VGG16_bn can already reach its\noriginal accuracy after weight throttling.\nThe accuracy curves of the models in the WOT training are shown in Figure 4. Overall, after WOT\ntraining, the original accuracy of all the six networks are fully recovered. During the training, the gap\nbetween the accuracy before throttling and after throttling is gradually reduced. For example, the\ntop-1 accuracy of SqueezeNet after 8-bit quantization is 57.01%. After the \ufb01rst iteration of WOT, the\naccuracy before weight throttling is 31.38% and drops to 11.54% after throttling. WOT increases the\naccuracy to 57.11% after 46k iterations with batch size 128 (around 4 epochs). All the other CNNs\nare able to recover their original accuracy in only a few thousands of iterations. An exception is\nVGG16, which reaches an accuracy of 71.50% (only 0.01% accuracy loss) after 20 epochs of training.\n\n1We have tried to set a detected faulty weight to the average of its neighbors but found it has worse\n\nperformance than Parity Zero.\n\n2https://pytorch.org/docs/master/torchvision/\n\n6\n\n\f(a) AlexNet\n\n(b) VGG16_bn\n\n(c) ResNet18\n\n(d) ResNet34\n\nFigure 3: Changes of the total number of large values (beyond [\u221264, 63]) in the \ufb01rst 7 positions of\n8-byte (64-bit data) blocks before the throttling step during the WOT training process.\n\n(e) ResNet50\n\n(f) SqueezeNet\n\n(a) AlexNet\n\n(b) VGG16_bn\n\n(c) ResNet18\n\n(d) ResNe34\n\n(e) ResNet50\n\n(f) SqueezeNet\n\nFigure 4: Accuracy curves before and after the throttling step during the WOT training process.\n\n5.3 Fault injection results\n\nIn this set of experiments, we inject faults to CNN models and report the accuracy drops of CNN\nmodels protected using different strategies. The fault model is random bit \ufb02ip. Faults are injected to\nthe weights of CNNs with memory fault rates varying from 10\u22129 to 0.001. The number of faulty bits\nis the product of the number of bits used to represent weights of a CNN and the memory fault rate.\nWe repeated each fault injection experiment ten times.\nTable 2 shows the mean accuracy drops with standard deviations under different memory fault rates\nand the overheads introduced by the protection strategies for each model. Overall, the in-place ECC\nprotection and standard SEC-DED show similar accuracy drop patterns under various fault rate\nsettings as expected because they provide the same error correction capability, i.e., correcting a single\nbit error and detecting double bit errors in a 64-bit data block. Both of the methods provide stronger\nfault protection compared with the Parity Zero method. The space overhead is the ratio between\nthe extra number of bytes introduced by a protection strategy and the number of bytes required to\nstore weights. Parity Zero and SEC-DED encode 8-byte data with extra eight check bits on average,\nmaking their space overhead 12.5%. In contrast, in-place ECC has zero space cost.\n\n7\n\n0.00.51.01.52.02.53.03.54.0iterations(k)05000100001500020000250003000035000count012345678iterations(k)1000020000300004000050000600007000080000count0.000.250.500.751.001.251.501.752.00iterations(k)010002000300040005000count012345678iterations(k)2000400060008000count0.02.55.07.510.012.515.017.520.0iterations(k)020004000600080001000012000count010203040iterations(k)0500100015002000250030003500count0.00.51.01.52.02.53.03.54.0iterations(k)50515253545556accuracy(%)WOT: Before ThrottlingWOT: After ThrottlingOriginal Accuracy012345678iterations(k)52.555.057.560.062.565.067.570.072.5accuracy(%)WOT: Before ThrottlingWOT: After ThrottlingOriginal Accuracy0.000.250.500.751.001.251.501.752.00iterations(k)6062646668accuracy(%)WOT: Before ThrottlingWOT: After ThrottlingOriginal Accuracy012345678iterations(k)6466687072accuracy(%)WOT: Before ThrottlingWOT: After ThrottlingOriginal Accuracy0.02.55.07.510.012.515.017.520.0iterations(k)5560657075accuracy(%)WOT: Before ThrottlingWOT: After ThrottlingOriginal Accuracy010203040iterations(k)1020304050accuracy(%)WOT: Before ThrottlingWOT: After ThrottlingOriginal Accuracy\fTable 2: Accuracy drop of VGG16, ResNet16, and SqueezeNet under different memory fault rates.\n\nModel\n\nVGG16\n\nResNet18\n\nSqueezeNet\n\nStrategy\n\nfaulty\nzero\necc\nin-place\nfaulty\nzero\necc\nin-place\nfaulty\nzero\necc\nin-place\n\nECC HW\n(Y/N)\nN\nN\nY\nY\nN\nN\nY\nY\nN\nN\nY\nY\n\nSpace\nOverhead (%)\n0\n12.5\n12.5\n0\n0\n12.5\n12.5\n0\n0\n12.5\n12.5\n0\n\nAccuracy drop (%) under different fault rate\n1e-06\n0.31 \u00b1 0.08\n0.27 \u00b1 0.05\n0.0 \u00b1 0.0\n0.0 \u00b1 0.0\n-0.09 \u00b1 0.1\n-0.06 \u00b1 0.08\n0.0 \u00b1 0.0\n0.0 \u00b1 0.0\n0.12 \u00b1 0.13\n0.09 \u00b1 0.12\n0.0 \u00b1 0.0\n0.0 \u00b1 0.0\n\n1e-05\n0.47 \u00b1 0.09\n0.36 \u00b1 0.08\n0.02 \u00b1 0.02\n0.02 \u00b1 0.02\n0.35 \u00b1 0.23\n-0.08 \u00b1 0.13\n0.0 \u00b1 0.01\n0.0 \u00b1 0.01\n0.69 \u00b1 0.31\n0.11 \u00b1 0.2\n0.0 \u00b1 0.0\n0.0 \u00b1 0.0\n\n1e-04\n1.35 \u00b1 0.2\n0.43 \u00b1 0.13\n0.35 \u00b1 0.06\n0.37 \u00b1 0.07\n4.35 \u00b1 1.12\n0.59 \u00b1 0.3\n-0.03 \u00b1 0.08\n-0.08 \u00b1 0.09\n9.39 \u00b1 2.37\n0.66 \u00b1 0.29\n0.12 \u00b1 0.09\n0.12 \u00b1 0.09\n\n1e-03\n21.93 \u00b1 5.7\n1.04 \u00b1 0.31\n0.96 \u00b1 0.14\n0.93 \u00b1 0.23\n72.96 \u00b1 1.48\n4.35 \u00b1 1.21\n2.8 \u00b1 0.31\n2.96 \u00b1 0.81\n64.83 \u00b1 0.5\n8.16 \u00b1 2.4\n5.37 \u00b1 0.66\n5.19 \u00b1 1.08\n\nThe fault injection experiments give the following insights on memory fault protection for CNNs.\nFirst, larger models tend to suffer less from memory faults. For example, when fault rate is 0.0001\nand no protection is applied, the accuracy drops of VGG16, ResNet18, SqueezeNet (less than 2%,\n8%, and 16% respectively) are increasing while the model size is decreasing (number of parameters\nare 138M, 12M, and 1.2M respectively). Second, when the fault rate is small (e.g. less than 1e-05),\nin-place ECC and standard SEC-DED can almost guarantee the same accuracy as the fault-free model.\nOverall, the experiments con\ufb01rm the potential of in-place zero-space ECC as an ef\ufb01cient replacement\nof the standard ECC without compromising the protection quality.\n\n6 Future Directions\n\nBesides 8-bit quantizations, there are proposals of even fewer-bit quantizations for CNN, in which,\nthere may be fewer non-informative bits in weight values. It is however worth noting that 8-bit\nquantization is the de facto in most existing CNN frameworks; it has repeatedly shown in practice as\na robust choice that offers an excellent balance in model size and accuracy. Improving the reliability\nof such models is hence essential. With that said, creating zero-space protections that works well\nwith other model quantizations is a direction worth future explorations.\nA second direction worth exploring is to extend the in-place zero-space protection to other error\nencoding methods (e.g., BCH [5]). Some of them require more parity bits, for which, the regularized\ntraining may need to be extended to create more free bits in data.\nFinally, in-place zero-space ECC is in principle applicable to neural networks beyond CNN. Empiri-\ncally assessing the ef\ufb01cacy is left to future studies.\n\n7 Conclusions\n\nThis paper presents in-place zero-space ECC assisted with a new training scheme named WOT to\nprotect CNN memory. The protection scheme removes all space cost of ECC without compromis-\ning the reliability offered by ECC, opening new opportunities for enhancing the accuracy, energy\nef\ufb01ciency, reliability, and cost effectiveness of CNN-driven AI solutions.\n\nAcknowledgement We would like to thank the anonymous reviews for their helpful feedbacks.\nThis material is based upon work supported by the National Science Foundation (NSF) under Grant\nNo. CCF-1525609, CCF-1703487, CCF-1717550, and CCF-1908406. Any opinions, comments,\n\ufb01ndings, and conclusions or recommendations expressed in this material are those of the authors\nand do not necessarily re\ufb02ect the views of NSF. This manuscript has been authored by UT-Battelle,\nLLC, under contract DE-AC05-00OR22725 with the US Department of Energy (DOE). The US\ngovernment retains and the publisher, by accepting the article for publication, acknowledges that\nthe US government retains a nonexclusive, paid-up, irrevocable, worldwide license to publish or\nreproduce the published form of this manuscript, or allow others to do so, for US government purposes.\nDOE will provide public access to these results of federally sponsored research in accordance with\nthe DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan).\n\n8\n\n\fReferences\n[1] Intel(r) math kernel library for deep neural networks (intel(r) mkl-dnn). https://intel.\n\ngithub.io/mkl-dnn/. Accessed: 2019-08-16.\n\n[2] Qnnpack: Open source library for optimized mobile deep learning. https://code.fb.com/\n\nml-applications/qnnpack/. Accessed: 2019-08-16.\n\n[3] Austin P Arechiga and Alan J Michaels. The effect of weight errors on neural networks. In\n2018 IEEE 8th Annual Computing and Communication Workshop and Conference (CCWC).\nIEEE, 2018.\n\n[4] Arash Azizimazreah, Yongbin Gu, Xiang Gu, and Lizhong Chen. Tolerating soft errors in\ndeep learning accelerators with reliable on-chip memory designs. In 2018 IEEE International\nConference on Networking, Architecture and Storage (NAS), pages 1\u201310. IEEE, 2018.\n\n[5] Daniel J Costello. Error Control Coding: Fundamentals and Applications. prentice Hall, 1983.\n\n[6] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-\nscale hierarchical image database. In 2009 IEEE conference on computer vision and pattern\nrecognition, pages 248\u2013255. Ieee, 2009.\n\n[7] Ghouthi Boukli Hacene, Fran\u00e7ois Leduc-Primeau, Amal Ben Soussia, Vincent Gripon, and\nFran\u00e7ois Gagnon. Training modern deep neural networks for memory-fault robustness. 2019.\n\n[8] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\nrecognition. In Proceedings of the IEEE conference on computer vision and pattern recognition,\npages 770\u2013778, 2016.\n\n[9] Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and\nKurt Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and< 0.5 mb\nmodel size. arXiv preprint arXiv:1602.07360, 2016.\n\n[10] Benoit Jacob et al. gemmlowp: a small self-contained low-precision gemm library.(2017), 2017.\n\n[11] Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard,\nHartwig Adam, and Dmitry Kalenichenko. Quantization and training of neural networks for\nef\ufb01cient integer-arithmetic-only inference. In Proceedings of the IEEE Conference on Computer\nVision and Pattern Recognition, pages 2704\u20132713, 2018.\n\n[12] Sung Kim, Patrick Howe, Thierry Moreau, Armin Alaghi, Luis Ceze, and Visvesh S Sathe.\nEnergy-ef\ufb01cient neural network acceleration in the presence of bit-level memory errors. IEEE\nTransactions on Circuits and Systems I: Regular Papers, (99):1\u201314, 2018.\n\n[13] Guanpeng Li, Siva Kumar Sastry Hari, Michael Sullivan, Timothy Tsai, Karthik Pattabiraman,\nJoel Emer, and Stephen W Keckler. Understanding error propagation in deep learning neural\nnetwork (dnn) accelerators and applications. In Proceedings of the International Conference for\nHigh Performance Computing, Networking, Storage and Analysis, page 8. ACM, 2017.\n\n[14] Robert E Lyons and Wouter Vanderkulk. The use of triple-modular redundancy to improve\n\ncomputer reliability. IBM journal of research and development, 6(2):200\u2013209, 1962.\n\n[15] Szymon Migacz. 8-bit inference with tensorrt. In GPU technology conference, volume 2, page 7,\n\n2017.\n\n[16] Dhananjay S Phatak and Israel Koren. Complete and partial fault tolerance of feedforward\n\nneural nets. IEEE Transactions on Neural Networks, 6(2):446\u2013456, 1995.\n\n[17] Peter W Protzel, Daniel L Palumbo, and Michael K Arras. Performance and fault-tolerance of\nneural networks for optimization. IEEE transactions on Neural Networks, 4(4):600\u2013614, 1993.\n\n[18] Minghai Qin, Chao Sun, and Dejan Vucinic. Robustness of neural networks against storage\n\nmedia errors. arXiv preprint arXiv:1709.06173, 2017.\n\n9\n\n\f[19] Brandon Reagen, Udit Gupta, Lillian Pentecost, Paul Whatmough, Sae Kyu Lee, Niamh\nMulholland, David Brooks, and Gu-Yeon Wei. Ares: A framework for quantifying the resilience\nof deep neural networks. In 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC),\npages 1\u20136. IEEE, 2018.\n\n[20] Brandon Reagen, Paul Whatmough, Robert Adolf, Saketh Rama, Hyunkwang Lee, Sae Kyu\nLee, Jos\u00e9 Miguel Hern\u00e1ndez-Lobato, Gu-Yeon Wei, and David Brooks. Minerva: Enabling\nlow-power, highly-accurate deep neural network accelerators. In 2016 ACM/IEEE 43rd Annual\nInternational Symposium on Computer Architecture (ISCA), pages 267\u2013278. IEEE, 2016.\n\n[21] Ao Ren, Tianyun Zhang, Shaokai Ye, Jiayu Li, Wenyao Xu, Xuehai Qian, Xue Lin, and Yanzhi\nWang. Admm-nn: An algorithm-hardware co-design framework of dnns using alternating\ndirection methods of multipliers. In Proceedings of the Twenty-Fourth International Conference\non Architectural Support for Programming Languages and Operating Systems, pages 925\u2013938.\nACM, 2019.\n\n[22] Behzad Salami, Osman S Unsal, and Adrian Cristal Kestelman. On the resilience of rtl nn\naccelerators: Fault characterization and mitigation. In 2018 30th International Symposium on\nComputer Architecture and High Performance Computing (SBAC-PAD), pages 322\u2013329. IEEE,\n2018.\n\n[23] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale\n\nimage recognition. arXiv preprint arXiv:1409.1556, 2014.\n\n[24] Vilas Sridharan, Nathan DeBardeleben, Sean Blanchard, Kurt B Ferreira, Jon Stearley, John\nShalf, and Sudhanva Gurumurthi. Memory errors in modern systems: The good, the bad, and\nthe ugly. ACM SIGPLAN Notices, 50(4):297\u2013310, 2015.\n\n[25] Olivier Temam. A defect-tolerant accelerator for emerging high-performance applications. In\nACM SIGARCH Computer Architecture News, volume 40, pages 356\u2013367. IEEE Computer\nSociety, 2012.\n\n[26] Cesar Torres-Huitzil and Bernard Girau. Fault and error tolerance in neural networks: A review.\n\nIEEE Access, 5:17322\u201317341, 2017.\n\n[27] Paul N Whatmough, Sae Kyu Lee, Hyunkwang Lee, Saketh Rama, David Brooks, and Gu-Yeon\nWei. 14.3 a 28nm soc with a 1.2 ghz 568nj/prediction sparse deep-neural-network engine with>\n0.1 timing error rate tolerance for iot applications. In 2017 IEEE International Solid-State\nCircuits Conference (ISSCC), pages 242\u2013243. IEEE, 2017.\n\n[28] Fa-Xin Yu, Jia-Rui Liu, Zheng-Liang Huang, Hao Luo, and Zhe-Ming Lu. Overview of radiation\nhardening techniques for ic design. Information Technology Journal, 9(6):1068\u20131080, 2010.\n\n[29] Jeff Zhang, Kartheek Rangineni, Zahra Ghodsi, and Siddharth Garg. Thundervolt: enabling\naggressive voltage underscaling and timing error resilience for energy ef\ufb01cient deep learning\naccelerators. In Proceedings of the 55th Annual Design Automation Conference, page 19. ACM,\n2018.\n\n[30] Jeff Jun Zhang, Tianyu Gu, Kanad Basu, and Siddharth Garg. Analyzing and mitigating the\nimpact of permanent faults on a systolic array based neural network accelerator. In 2018 IEEE\n36th VLSI Test Symposium (VTS), pages 1\u20136. IEEE, 2018.\n\n[31] Tianyun Zhang, Shaokai Ye, Kaiqi Zhang, Jian Tang, Wujie Wen, Makan Fardad, and Yanzhi\nWang. A systematic dnn weight pruning framework using alternating direction method of\nmultipliers. In Proceedings of the European Conference on Computer Vision (ECCV), pages\n184\u2013199, 2018.\n\n[32] Neta Zmora, Guy\n\nJacob,\n\nhttps://doi.org/10.5281/zenodo.1297430, June 2018.\n\nand Gal Novik.\n\nNeural\n\nnetwork\n\ndistiller.\n\n10\n\n\f", "award": [], "sourceid": 3078, "authors": [{"given_name": "Hui", "family_name": "Guan", "institution": "North Carolina State University"}, {"given_name": "Lin", "family_name": "Ning", "institution": "NCSU"}, {"given_name": "Zhen", "family_name": "Lin", "institution": "NCSU"}, {"given_name": "Xipeng", "family_name": "Shen", "institution": "North Carolina State University"}, {"given_name": "Huiyang", "family_name": "Zhou", "institution": "NCSU"}, {"given_name": "Seung-Hwan", "family_name": "Lim", "institution": "Oak Ridge National Laboratory"}]}