{"title": "Norm matters: efficient and accurate normalization schemes in deep networks", "book": "Advances in Neural Information Processing Systems", "page_first": 2160, "page_last": 2170, "abstract": "Over the past few years, Batch-Normalization has been commonly used in deep networks, allowing faster training and high performance for a wide variety of applications. However, the reasons behind its merits remained unanswered, with several shortcomings that hindered its use for certain tasks. In this work, we present a novel view on the purpose and function of normalization methods and weight-decay, as tools to decouple weights' norm from the underlying optimized objective. This property highlights the connection between practices such as normalization, weight decay and learning-rate adjustments. We suggest several alternatives to the widely used $L^2$ batch-norm, using normalization in $L^1$ and $L^\\infty$ spaces that can substantially improve numerical stability in low-precision implementations as well as provide computational and memory benefits. We demonstrate that such methods enable the first batch-norm alternative to work for half-precision implementations. Finally, we suggest a modification to weight-normalization, which improves its performance on large-scale tasks.", "full_text": "Norm matters: ef\ufb01cient and accurate normalization\n\nschemes in deep networks\n\nElad Hoffer1\u2217, Ron Banner2\u2217, Itay Golan1\u2217, Daniel Soudry1\n{elad.hoffer, itaygolan, daniel.soudry}@gmail.com\n\n{ron.banner}@intel.com\n\n(1) Technion - Israel Institute of Technology, Haifa, Israel\n(2) Intel - Arti\ufb01cial Intelligence Products Group (AIPG)\n\nAbstract\n\nOver the past few years, Batch-Normalization has been commonly used in deep\nnetworks, allowing faster training and high performance for a wide variety of\napplications. However, the reasons behind its merits remained unanswered, with\nseveral shortcomings that hindered its use for certain tasks. In this work, we present\na novel view on the purpose and function of normalization methods and weight-\ndecay, as tools to decouple weights\u2019 norm from the underlying optimized objective.\nThis property highlights the connection between practices such as normalization,\nweight decay and learning-rate adjustments. We suggest several alternatives to the\nwidely used L2 batch-norm, using normalization in L1 and L\u221e spaces that can\nsubstantially improve numerical stability in low-precision implementations as well\nas provide computational and memory bene\ufb01ts. We demonstrate that such methods\nenable the \ufb01rst batch-norm alternative to work for half-precision implementations.\nFinally, we suggest a modi\ufb01cation to weight-normalization, which improves its\nperformance on large-scale tasks. 2\n\n1\n\nIntroduction\n\nDeep neural networks are known to bene\ufb01t from normalization between consecutive layers. This\nwas made noticeable with the introduction of Batch-Normalization (BN) [19], which normalizes\nthe output of each layer to have zero mean and unit variance for each channel across the training\nbatch. This idea was later developed to act across channels instead of the batch dimension in Layer-\nnormalization [2] and improved in certain tasks with methods such as Batch-Renormalization [18],\nInstance-normalization [33] and Group-Normalization [38]. In addition, normalization methods are\nalso applied to the layer parameters instead of their outputs. Methods such as Weight-Normalization\n[27], and Normalization-Propagation [1] targeted the layer weights by normalizing their per-channel\nnorm to have a \ufb01xed value. Instead of explicit normalization, effort was also made to enable self-\nnormalization by adapting activation function so that intermediate activations will converge towards\nzero-mean and unit variance [21].\n\n1.1\n\nIssues with current normalization methods\n\nBatch-normalization, despite its merits, suffers from several issues, as pointed out by previous work\n[27, 18, 1]. These issues are not yet solved in current normalization methods.\n\n\u2217Equal contribution\n2Source code is available at https://github.com/eladhoffer/norm_matters\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fInterplay with other regularization mechanisms. Batch normalization typically improves gener-\nalization performance and is therefore considered a regularization mechanism. Other regularization\nmechanisms are typically used in conjunction. For example, weight decay, also known as L2 regular-\nization, is a common method which adds a penalty proportional to the weights\u2019 norm. Weight decay\nwas proven to improve generalization in various problems [24, 5, 4], but, so far, not for non-linear\ndeep neural networks. There, [40] performed an extensive set of experiments on regularization and\nconcluded that explicit regularization, such as weight decay, may improve generalization performance,\nbut is neither necessary nor, by itself, suf\ufb01cient for reducing generalization error. Therefore, it is\nnot clear how weight decay interacts with BN, or if weight decay is even really necessary given that\nbatch norm already constrains the output norms [16]).\n\nTask-speci\ufb01c limitations. A key assumption in BN is the independence between samples appearing\nin each batch. While this assumption seems to hold for most convolutional networks used to classify\nimages in conventional datasets, it falls short when employed in domains with strong correlations\nbetween samples, such as time-series prediction, reinforcement learning, and generative modeling.\nFor example, BN requires modi\ufb01cations to work in recurrent networks [6], for which alternatives such\nas weight-normalization [27] and layer-normalization [2] were explicitly devised, without reaching\nthe success and wide adoption of BN. Another example is Generative adversarial networks, which\nare also noted to suffer from the common form of BN. GAN training with BN proved unstable in\nsome cases, decreasing the quality of the trained model [28]. Instead, it was replaced with virtual-BN\n[28], weight-norm [39] and spectral normalization [32]. Also, BN may be harmful even in plain\nclassi\ufb01cation tasks, when using unbalanced classes, or correlated instances. In addition, while BN is\nde\ufb01ned for the training phase of the models, it requires a running estimate for the evaluation phase\n\u2013 causing a noticeable difference between the two [19]. This shortcoming was addressed later by\nbatch-renormalization [18], yet still requiring the original BN at the early steps of training.\n\nComputational costs. From the computational perspective, BN is signi\ufb01cant in modern neural\nnetworks, as it requires several \ufb02oating point operations across the activation of the entire batch for\nevery layer in the network. Previous analysis by Gitman & Ginsburg [11] measured BN to constitute\nup to 24% of the computation time needed for the entire model. It is also not easily parallelized, as it\nis usually memory-bound on currently employed hardware. In addition, the operation requires saving\nthe pre-normalized activations for back-propagation in the general case [26], thus using roughly twice\nthe memory as a non-BN network in the training phase. Other methods, such as Weight-Normalization\n[27] have a much smaller computational cost but typically achieve signi\ufb01cantly lower accuracy when\nused in large-scale tasks such as ImageNet [11].\n\nNumerical precision. As the use of deep learning continues to evolve, the interest in low-precision\ntraining and inference increases [17, 36]. Optimized hardware was designed to leverage bene\ufb01ts\nof low-precision arithmetic and memory operations, with the promise of better, more ef\ufb01cient\nimplementations [22]. Although most mathematical operations employed in neural-networks are\nknown to be robust to low-precision and quantized values, the current normalization methods are\nnotably not suited for these cases. As far as we know, this has remained an unanswered issue,\nwith no suggested alternatives. Speci\ufb01cally, all normalization methods, including BN, use an L2\nnormalization (variance computation) to control the activation scale for each layer. The operation\nrequires a sum of power-of-two \ufb02oating point variables, a square-root function, and a reciprocal\noperation. All of these require both high-precision to avoid zero variance, and a large range to avoid\nover\ufb02ow when adding large numbers. This makes BN an operation that is not easily adapted to\nlow-precision implementations. Using norm spaces other than L2 can alleviate these problems, as we\nshall see later.\n\n1.2 Contributions\n\nIn this paper we make the following contributions, to address the issues explained in the previous\nsection:\n\n\u2022 We \ufb01nd the mechanism through which weight decay before BN affects learning dynamics:\nwe demonstrate that by adjusting the learning rate or normalization method we can exactly\nmimic the effect of weight decay on the learning dynamics. We suggest this happens since\n\n2\n\n\fcertain normalization methods, such as a BN, disentangle the effect of weight vector norm\non the following activation layers.\n\u2022 We show that we can replace the standard L2 BN with certain L1 and L\u221e based variations\nof BN, which do not harm accuracy (on CIFAR and ImageNet) and even somewhat improve\ntraining speed. Importantly, we demonstrate that such norms can work well with low\nprecision (16bit), while L2 does not. Notably, for these normalization schemes to work well,\nprecise scale adjustment is required, which can be approximated analytically.\n\u2022 We show that by bounding the norm in a weight-normalization scheme, we can signi\ufb01cantly\nimprove its performance in convnets (on ImageNet), and improve baseline performance in\nLSTMs (on WMT14 de-en). This method can alleviate several task-speci\ufb01c limitations of\nBN, and reduce its computational and memory costs (e.g., allowing to work with signi\ufb01cantly\nlarger batch sizes). Importantly, for the method to work well, we need to carefully choose\nthe scale of the weights using the scale of the initialization.\n\nTogether, these \ufb01ndings emphasize that the learning dynamics in neural networks are very sensitive to\nthe norms of the weights. Therefore, it is an important goal for future research to search for precise\nand theoretically justi\ufb01able methods to adjust the scale for these norms.\n\n2 Consequences of the scale invariance of Batch-Normalization\n\nWhen BN is applied after a linear layer, it is well known that the output is invariant to the channel\nweight vector norm. Speci\ufb01cally, denoting a channel weight vector with w and \u02c6w = w/(cid:107)w(cid:107)2, channel\ninput as x and BN for batch-norm, we have\n\n(1)\nThis invariance to the weight vector norm means that a BN applied after a layer renders its norm\nirrelevant to the inputs of consecutive layers. The same can be easily shown for the per-channel\nweights of a convolutional layer. The gradient in such case is scaled by 1/(cid:107)w(cid:107)2:\n\nBN ((cid:107)w(cid:107)2 \u02c6wx) = BN ( \u02c6wx).\n\n\u2202BN((cid:107)w(cid:107)2 \u02c6wx)\n\u2202((cid:107)w(cid:107)2 \u02c6w)\n\n=\n\n1\n(cid:107)w(cid:107)2\n\n\u2202BN( \u02c6wx)\n\n\u2202( \u02c6w)\n\n.\n\n(2)\n\nWhen a layer is rescaling invariant, the key feature of the weight vector is its direction.\nDuring training, the weights are typically incremented through some variant of stochastic gradient\ndescent, according to the gradient of the loss at mini-batch t, with learning rate \u03b7\n\nClaim. During training, the weight direction \u02c6wt = wt/(cid:107)wt(cid:107)2 is updated according to\n\n\u02c6wt+1 = \u02c6wt \u2212 \u03b7 (cid:107)wt(cid:107)\u22122(cid:0)I \u2212 \u02c6wt \u02c6w(cid:62)\n\n(cid:1)\u2207L ( \u02c6wt) + O(cid:0)\u03b72(cid:1)\n\nt\n\nwt+1 = wt \u2212 \u03b7\u2207Lt (wt) .\n\n(3)\n\nProof. Denote \u03c1t = (cid:107)wt(cid:107)2. Note that, from eqs. 2 and 3 we have\n\n\u03c12\nt+1 = \u03c12\n\nt \u2212 2\u03b7 \u02c6w(cid:62)\n\nt \u2207L ( \u02c6wt) + \u03b72\u03c1\u22122\n\nt (cid:107)\u2207L ( \u02c6wt)(cid:107)2\n\nand therefore\n\n(cid:113)\n\n\u03c1t+1 = \u03c1t\n\n1 \u2212 2\u03b7\u03c1\u22122\nt \u02c6w(cid:62)\n\nt \u02c6w(cid:62)\n\nt \u2207L ( \u02c6wt) + \u03b72\u03c1\u22124\n\nt \u2207L ( \u02c6wt) + O(cid:0)\u03b72(cid:1) .\n\n= \u03c1t \u2212 \u03b7\u03c1\u22121\n\nt (cid:107)\u2207L ( \u02c6wt)(cid:107)2\n\nAdditionally, from eq. 3 we have\n\nand therefore, from eq. 2,\n\n\u03c1t+1 \u02c6wt+1 = \u03c1t \u02c6wt \u2212 \u03b7\u2207L ( \u02c6wt\u03c1t)\n\n\u02c6wt+1 =\n\n\u03c1t\n\u03c1t+1\n\n=(cid:0)1 + \u03b7\u03c1\u22122\n\n\u02c6wt \u2212 \u03b7\nt \u02c6w(cid:62)\n\n= \u02c6wt \u2212 \u03b7\u03c1\u22122\n\nt\n\n1\n\n\u03c1t+1\u03c1t\n\n\u2207L ( \u02c6wt)\n\nt \u2207L ( \u02c6wt)(cid:1) \u02c6wt \u2212 \u03b7\u03c1\u22122\n(cid:0)I \u2212 \u02c6wt \u02c6w(cid:62)\n\n(cid:1)\u2207L ( \u02c6wt) + O(cid:0)\u03b72(cid:1) ,\n\nt \u2207L ( \u02c6wt) + O(cid:0)\u03b72(cid:1)\n\nt\n\n3\n\n\fwhich proves the claim. (cid:3)\nTherefore, the step size of the weight direction is approximately proportional to\n\n\u02c6wt+1 \u2212 \u02c6wt \u221d \u03b7\n\n(cid:107)wt(cid:107)2\n\n2\n\n.\n\n(4)\n\nin the case of linear layer followed by BN, and for small learning rate \u03b7. Note that a similar conclusion\nwas reached by van Laarhoven [34], who implicitly assumed ||wt+1|| = ||wt||, though this is only\napproximately true. Here we show this conclusion is still true without such an assumption. This\nanalysis continues to hold for non-linear functions that do not affect scale, such as the commonly\nused ReLU function. In addition, although stated for the case of vanilla SGD, similar argument can\nbe made for adaptive methods such as Adagrad [9] or Adam [20].\n\n3 Connection between weight-decay, learning rate and normalization\n\nWe claim that when using batch-norm (BN), weight decay (WD) improves optimization only by\n\ufb01xing the norm to a small range of values, leading to a more stable step size for the weight direction\n(\u201ceffective step size\u201d). Fixing the norm allows better control over the effective step size through the\nlearning rate \u03b7. Without WD, the norm grows unbounded [31], resulting in a decreased effective step\nsize, although the learning rate hyper-parameter remains unchanged.\nWe show empirically that the accuracy gained by us-\ning WD can be achieved without it, only by adjusting\nthe learning rate. Given statistics on norms of each\nchannel from a training with WD and BN, similar\nresults can be achieved without WD by mimicking\nthe effective step size using the following correction\non the learning rate:\n\n\u02c6\u03b7Correction = \u03b7\n\n(cid:107)w(cid:107)2\n\n2\n\n(cid:107)w[WD on](cid:107)2\n\n2\n\n(5)\n\nFigure 1: The connection between norm,\neffective step size and weight decay. WD\non/WD off was trained with/without weight\ndecay respectively. WD off correction was\ntrained without weight decay but with LR\ncorrection as presented in Eq. 5. LR sched\nreplaced with Norm sched is based on WD\non norms but replacing LR scheduling with\nnorm scheduling. (VGG11, CIFAR-10)\n\nwhere w is the weights\u2019 vector of a single channel,\nand w[WD on] is the weights\u2019 vector of the correspond-\ning channel in a training with WD. This correction\nrequires access to the norms of a training with WD,\nhence it is not a practical method to replace WD but\njust a tool to demonstrate our claim on the connection\nbetween weights\u2019 norm, WD and step size.\nWe conducted multiple experiments on CIFAR-10\n[23] to show this connection. Figure 1 reports the test\naccuracy during the training of all experiments. We\nwere able to show that WD results can be mimicked\nwith step size adjustments using the correction for-\nmula from Eq. 5. In another experiment, we replaced\nthe learning rate scheduling with norm scheduling.\nTo do so, after every gradient descent step we normal-\nized the norm of each convolution layer channel to\nbe the same as the norm of the corresponding channel in training with WD and keep the learning\nrate constant. When learning rate is multiplied by 0.1 in the WD training, we instead multiply the\nnorm by\n. As expected, when\napplying the correction on step-size or replacing learning rate scheduling with norm scheduling, the\naccuracy is similar to the training with WD throughout the learning process, suggesting that WD\naffects the training process only indirectly, by modulating the learning rate. Implementation details\nappear in supplementary material.\n\n10, leading to an effective step size of\n\n\u03b7\n(cid:107)WWD on(cid:107)2\n\n102 = 0.1\n\n(cid:107)WWD on(cid:107)2\n\n2\n\n\u221a\n\n\u221a\n\n2\n\n\u03b7\n\n4\n\n1020304050607080Epoch6065707580859095Test accuracyWD onWD offWD off + LR CorrectionLR sched replaced with Norm sched\f4 Alternative Lp metrics for batch norm\n\nx(k) \u2212 \u00b5k\n\n(cid:112)Var[x(k)]\n\nWe suggested above that the main function of BN is to neutralize the effect of the preceding layer\u2019s\nweights. If this hypothesis is true, then other operations might be able to replace BN, as long as they\nremain similarly scale invariant (as in eq. (1)) \u2014 and if we keep the same scale as BN. Following\nthis reasoning, we next aim to replace the use of L2 norm with scale-invariant alternatives which are\nmore appealing computationally and for low-precision implementations.\nBatch normalization aims at regularizing the input so that sum of deviations from the mean would\nbe standardized according to the Euclidean L2 norm metric. For a layer with d\u2212dimensional input\nx = (x(1), x(2), ..., x(d)), L2 batch norm normalizes each dimension\n\n\u02c6x(k) =\n\n,\n\n(6)\n\nwhere \u00b5k is the expectation over x(k), n is the batch size and Var[x(k)] = 1\n\nThe computation toll induced by(cid:112)Var[x(k)] is often signi\ufb01cant with non-negligible overheads on\n\nn||x(k) \u2212 \u00b5k||2\n2.\n\nmemory and energy consumption. In addition, as the above variance computation involves sums\nof squares, the quantization of the L2 batch norm for training on optimized hardware can lead to\nnumerical instability as well as to arithmetic over\ufb02ows when dealing with large values.\nIn this section, we suggest alternative Lp metrics for BN. We focus on the L1 and L\u221e due to their\nappealing speed and memory computations. In our simulations, we were able to train models faster\nand with fewer GPUs using the above normalizations. Strikingly, by proper adjustments of these\nnormalizations, we were able to train various complicated models without hurting the classi\ufb01cation\nperformance. We begin with the L1-norm metric.\n\n4.1 L1 batch norm.\nFor a layer with d\u2212dimensional input x = (x(1), x(2), ..., x(d)), L1 batch normalization normalize\neach dimension\n\nwhere \u00b5k is the expectation over x(k), n is the batch size and CL1 =(cid:112)\u03c0/2 is a normalization term.\n\nCL1 \u00b7 ||x(k) \u2212 \u00b5k||1/n\n\n\u02c6x(k) =\n\n(7)\n\nx(k) \u2212 \u00b5k\n\nUnlike traditional L2 batch normalization that computes the average squared deviation from the mean\n(variance), L1 batch normalization computes only the average absolute deviation from the mean. This\nhas two major advantages. First, L1 batch normalization eliminates the computational efforts required\nfor the square and square root operations. Second, as the square of an n-bit number is generally of 2n\nbits, the absence of these square computations makes it much more suitable for low-precision training\nthat has been recognized to drastically reduce memory size and power consumption on dedicated\ndeep learning hardware [7].\nAs can be seen in equation 7, the L1 batch normalization quanti\ufb01es the variability with the normal-\nized average absolute deviation CL1 \u00b7 ||x(k) \u2212 \u00b5k||1/n. To calculate an appropriate value for the\nconstant CL1, we assume the input x(k) follows Gaussian distribution N (\u00b5k, \u03c32). This is a common\napproximation (e.g., Soudry et al. [30]), based on the fact that the neural input x(k) is a sum of many\ninputs, so we expect it to be approximately Gaussian from the central limit theorem. In this case,\n\u02c6x(k) = (x(k) \u2212 \u00b5k) follows the distribution N (0, \u03c32). Therefore, for each example \u02c6x(k)\ni \u2208 \u02c6x(k) it\nholds that |\u02c6x(k)\ningly, the expected L1 variability measure is related to the traditional standard deviation measure \u03c3\nnormally used with batch normalization as follows:\n\n| follows a half-normal distribution with expectation E(|\u02c6x(k)\n\n|) = \u03c3 \u00b7(cid:112)2/\u03c0. Accord-\n\ni\n\ni\n\n(cid:20) CL1\n\nn\n\nE\n\n(cid:21)\n\n(cid:112)\u03c0/2\n\nn\n\n\u00b7 n(cid:88)\n\ni=1\n\n\u00b7 ||x(k) \u2212 \u00b5k||1\n\n=\n\nE[|\u02c6x(k)\n\ni\n\n|] = \u03c3.\n\nFigure 2 presents the validation accuracy of ResNet-18 and ResNet-50 on ImageNet using L1 and\nL2 batch norms. While the use of L1 batch norm is more ef\ufb01cient in terms of resource usage,\n\n5\n\n\fpower, and speed, they both share the same classi\ufb01cation accuracy. We additionally veri\ufb01ed L1\nlayer-normalization to work on Transformer architecture [35]. Using an L1 layer-norm we achieved a\n\ufb01nal perplexity of 5.2 vs. 5.1 for original L2 layer-norm using the base model on the WMT14 dataset.\nWe note the importance of CL1 to the performance of L1 normalization method. For example, using\nCL1 helps the network to reach 20% validation error more than twice faster than an equivalent\ncon\ufb01guration without this normalization term. With CL1 the network converges at the same rate and\nto the same accuracy as L2 batch norm. It is somewhat surprising that this constant can have such\n\nan impact on performance, considering the fact that it is so close to one (CL1 =(cid:112)\u03c0/2 \u2248 1.25). A\n\ndemonstration of this effect can be found in the supplementary material (Figure 1).\nWe also note that the use of L1 norm improved both running time and memory consumption for\nmodels we tested. These bene\ufb01ts can be attributed to the fact that absolute-value operation is\ncomputationally more ef\ufb01cient compared to the costly square and sqrt operations. Additionally, the\nderivative of |x| is the operation sign(x). Therefore, in order to compute the gradients, we only need\nto cache the sign of the values (not the actual values), allowing for substantial memory savings.\n4.2 L\u221e batch norm\n\nAnother alternative measure for variability that avoids the discussed limitations of the traditional\nL2 batch norm is the maximum absolute deviation. For a layer with d\u2212dimensional input x =\n(x(1), x(2), ..., x(d)), L\u221e batch normalization normalize each dimension\n\n\u02c6x(k) =\n\nx(k) \u2212 \u00b5k\n\nCL\u221e (n) \u00b7 ||x(k) \u2212 \u00b5k||\u221e\n\n,\n\n(8)\n\nwhere \u00b5k is the expectation over x(k), n is batch size and CL\u221e (n) is computed similarly to CL1(n)\n(derivation appears in appendix).\nWhile normalizing according to the maximum absolute deviation offers a major performance advan-\ntage, we found it somewhat less robust to noise compared to L1 and L2 normalization.\nBy replacing the maximum absolute deviation with the mean of ten largest deviations, we were able\nto make normalization much more robust to outliers. Formally, let sn be the n-th largest value in S,\nwe de\ufb01ne Top(k) as follows\n\nk(cid:88)\n\nn=1\n\nTop(k) =\n\n1\nk\n\n|sn|\n\nGiven a batch of size n, the notion of Top(k) generalizes L1 and L\u221e metrics. Indeed, L\u221e is precisely\nTop(1) while L1 is by de\ufb01nition equivalent to Top(n). As we could not \ufb01nd a closed-form expression\nfor the normalization term CTopK(n), we approximated it as a linear interpolation between CL1\nand CL\u221e(n). As can be seen in \ufb01gure 2, the use of Top(10) was suf\ufb01cient to close the gap to L2\nperformance. For further details on Top(10) implementation, see our code.\n\n4.3 Batch norm at half precision\n\nDue to numerical issues, prior attempts to train neural networks at low precision had to leave batch\nnorm operations at full precision (\ufb02oat 32) as described by Micikevicius et al. [25], Das et al. [8],\nthus enabling only mixed precision training. This effectively means that low precision hardware still\nneeds to support full precision data types. The sensitivity of BN to low precision operations can be\nattributed to both the numerical operations of square and square-root used, as well as the possible\nover\ufb02ow of the sum of many large positive values. To overcome this over\ufb02ow, we may further require\na wide accumulator with full precision.\nWe provide evidence that by using L1 arithmetic, batch normalization can also be quantized to half\nprecision with no apparent effect on validation accuracy, as can be seen in \ufb01gure 3. Using the standard\nL2 BN in low-precision leads to over\ufb02ow and signi\ufb01cant quantization noise that quickly deteriorate\nthe whole training process, while L1 BN allows training with no visible loss of accuracy.\nAs far as we know, our work is the \ufb01rst to demonstrate a viable alternative to BN in half-precision\naccuracy. We also note that the usage of L\u221e BN or its Top(k) relaxation, may further help low-\n\n6\n\n\fFigure 2: Classi\ufb01cation error with L2 batch norm\n(baseline) and L1, L\u221e and Top(10) alternatives\nfor ResNet-18 and ResNet-50 on ImageNet. Com-\npared to the baselines, L1 and Top(10) normaliza-\ntions reached similar \ufb01nal accuracy (difference <\n0.2%), while L\u221e had a lower accuracy, by 3%.\n\nFigure 3: L1 BN is more robust to quantization\nnoise compared to L2 BN as demonstrated for\nResNet18 on ImageNet. The half precision run\nof L2 BN was clearly diverging, even when done\nwith a high precision accumulator, and we stopped\nthe run before termination at epoch 20.\n\nprecision implementations by signi\ufb01cantly lowering the extent of the reduction operation (as only k\nnumbers need to be summed).\n\n5\n\nImproving weight normalization\n\n5.1 The advantages and disadvantages of weight normalization\n\nTrying to address several of the limitations of BN, Salimans & Kingma [27] suggested weight\nnormalization as its replacement. As weight-norm requires an L2 normalization over the output\nchannels of the weight matrix, it alleviates both computational and task-speci\ufb01c shortcomings of BN,\nensuring no dependency on the current batch of sample activations within a layer.\nWhile this alternative works well for small-scale problems, as demonstrated in the original work, it\nwas noted by Gitman & Ginsburg [11] to fall short in large-scale usage. For example, in the ImageNet\nclassi\ufb01cation task, weight-norm exhibited unstable convergence and signi\ufb01cantly lower performance\n(67% accuracy on ResNet50 vs. 75% for original).\nAn additional modi\ufb01cation of weight-norm called \"normalization propagation\" [1] adds additional\nmultiplicative and additive corrections to address the change of activation distribution introduced\nby the ReLU non-linearity used between layers in the network. These modi\ufb01cations are not trivially\napplied to architectures with complex structure elements such as residual connections [14].\nSo far, we\u2019ve demonstrated that the key to the performance of normalization techniques lies in their\nproperty to neutralize the effect of weight\u2019s norm. Next, we will use this reasoning to overcome the\nshortcoming of weight-norm.\n\n5.2 Norm bounded weight-normalization\n\nWe return to the original parametrization suggested for weight norm, for a given initialized weight\nmatrix V with N output channels:\n\nwi = gi\n\nvi(cid:107)vi(cid:107)2\n\n,\n\nwhere wi is a parameterized weight for the ith output channel, composed from an L2 normalized\nvector vi and scalar gi.\nWeight-norm successfully normalized each output channel\u2019s weights to reside on the L2 sphere.\nHowever, it allowed the weights scale to change freely through the scalar gi. Following reasoning\npresented earlier in this work, we wish to make the weight\u2019s norm completely disjoint from its values.\n\n7\n\n\fWe can achieve this by keeping the norm \ufb01xed as follows:\n\nwi = \u03c1\n\nvi(cid:107)vi(cid:107)2\n\n,\n\n\u221a\n/\n\nF\n\nwhere \u03c1 is a \ufb01xed scalar for each layer that is determined by its size (number of input and output\nchannels). A simple choice for \u03c1 is by the initial norm of the weights, e.g \u03c1 = (cid:107)V (cid:107)(t=0)\nN, thus\nemploying the various successful heuristics used to initialize modern networks [12, 13]. We also note\nthat when using non-linearity with no scale sensitivity (e.g ReLU), these \u03c1 constants can be instead\nincorporated into only the \ufb01nal classi\ufb01er\u2019s weights and biases throughout the network.\nPrevious works demonstrated that weight-normalized networks converge faster when augmented with\nmean only batch normalization. We follow this regime, although noting that similar \ufb01nal accuracy\ncan be achieved without mean normalization but at the cost of slower convergence, or with the use of\nzero-mean preserving activation functions [10].\nAfter this modi\ufb01cation, we now \ufb01nd that weight-norm can be improved substantially, solving the stabil-\nity issues for large-scale task observed by Gitman & Ginsburg [11] and achieving comparable accuracy\n(although still behind BN). Results on Imagenet using Resnet50 are described in Figure 4, using the\noriginal settings and training regime [14]. We believe the still apparent margin between the two meth-\nods can be further decreased using hyper-parameter tuning, such as a modi\ufb01ed learning rate schedule.\nIt is also interesting to observe BWN\u2019s effect in re-\ncurrent networks, where BN is not easily applicable\n[6]. We compare weight-norm vs. the common im-\nplementation (with layer-norm) of an attention-based\nLSTM model on the WMT14 en-de translation task\n[3]. The model consists of 2 LSTM cells for both\nencoder and decoder, with an attention mechanism.\nWe also compared BWN on the Transformer archi-\ntecture [35] to replace layer-norm, again achieving\ncomparable \ufb01nal performance (26.5 vs. 27.3 BLEU\nscore on the original base model). Both sequence-\nto-sequence models were tested using beam-search\ndecoding with a beam size of 4 and length penalty of\n0.6. Additional results for BWN can be found in the\nsupplementary material (Figure 2 and Table 1).\n\n5.3 Lp weight normalization\n\nAs we did for BN, we can consider weight-\nnormalization over norms other than L2 such that\n\nwi = \u03c1\n\nvi(cid:107)vi(cid:107)p\n\n, \u03c1 = (cid:107)V (cid:107)(t=0)\n\np\n\n/N 1/p,\n\nFigure 4: Comparison between batch-norm\n(BN), weight-norm (WN) and bounded-\nweight-norm (WN) on ResNet50, ImageNet.\nFor weight-norm, we show the \ufb01nal results\nfrom [11]. Our implementation of WN here\ncould not converge (similar convergence is-\nsues were reported by [11]). Final accuracy:\nBN - 75.3%, WN 67%, and BWN - 73.8%.\n\nwhere computing the constant \u03c1 over desired (vector)\nnorm will ensure proper scaling that was required in\nthe BN case. We \ufb01nd that similarly to BN, the L1 norm can serve as an alternative to original L2\nweight-norm, where using L\u221e cause a noticeable degradation when using its proper form (top-1\nabsolute maximum).\n\n6 Discussion\n\nIn this work, we analyzed common normalization techniques used in deep learning models, with\nBN as their prime representative. We considered a novel perspective on the role of these methods,\nas tools to decouple the weights\u2019 norm from training objective. This perspective allowed us to\nre-evaluate the necessity of regularization methods such as weight decay, and to suggest new methods\nfor normalization, targeting the computational, numerical and task-speci\ufb01c de\ufb01ciencies of current\ntechniques.\nSpeci\ufb01cally, we showed that the use of L1 and L\u221e-based normalization schemes could provide\nsimilar results to the standard BN while allowing low-precision computation. Such methods can\n\n8\n\n020406080epoch30405060708090validationerrorBatchNormBoundedWeightNormWeightNorm(Gitman&Ginsburg)\fbe easily implemented and deployed to serve in current and future network architectures, low-\nprecision devices. A similar L1 normalization scheme to ours was recently introduced by Wu et al.\n[37], appearing in parallel to us (within a week). In contrast to Wu et al. [37], we found that the\nCL1 normalization constant is crucial for achieving the same performance as L2 (see Figure 1 in\nsupplementary). We additionally demonstrated the bene\ufb01ts of L1 normalization: it allowed us to\nperform BN in half-precision \ufb02oating-point, which was noted to fail in previous works [25, 8] and\nrequired full and mixed precision hardware.\nMoreover, we suggested a bounded weight normalization method, which achieves improved results\non large-scale tasks (ImageNet) and is nearly comparable with BN. Such a weight normalization\nscheme improves computational costs and can enable improved learning in tasks that were not suited\nfor previous methods such as reinforcement-learning and temporal modeling.\nWe further suggest that insights gained from our \ufb01ndings can have an additional impact on the way\nneural networks are devised and trained. As previous works demonstrated, a strong connection\nappears between the batch-size used and the optimal learning rate regime [15, 29] and between\nthe weight-decay factor and learning-rate [34]. We deepen this connection and suggest that all of\nthese factors, including the effective norm (or temperature), are mutually affecting one another. It is\nplausible, given our results, that some (or all) of these hyper-parameters can be \ufb01xed given another,\nwhich can potentially ease the design and training of modern models.\n\nAcknowledgments\n\nThis research was supported by the Israel Science Foundation (grant No. 31/1031), and by the Taub\nfoundation. A Titan Xp used for this research was donated by the NVIDIA Corporation.\n\nReferences\n[1] Arpit, D., Zhou, Y., Kota, B., and Govindaraju, V. Normalization propagation: A parametric\ntechnique for removing internal covariate shift in deep networks. In International Conference\non Machine Learning, pp. 1168\u20131176, 2016.\n\n[2] Ba, J. L., Kiros, J. R., and Hinton, G. E. Layer normalization. arXiv preprint arXiv:1607.06450,\n\n2016.\n\n[3] Bahdanau, D., Cho, K., and Bengio, Y. Neural machine translation by jointly learning to align\n\nand translate. arXiv preprint arXiv:1409.0473, 2014.\n\n[4] B\u00f6s, S. Optimal weight decay in a perceptron. In International Conference on Arti\ufb01cial Neural\n\nNetworks, pp. 551\u2013556. Springer, 1996.\n\n[5] Bos, S. and Chug, E. Using weight decay to optimize the generalization ability of a perceptron.\nIn Neural Networks, 1996., IEEE International Conference on, volume 1, pp. 241\u2013246. IEEE,\n1996.\n\n[6] Cooijmans, T., Ballas, N., Laurent, C., G\u00fcl\u00e7ehre, \u00c7., and Courville, A. Recurrent batch\n\nnormalization. arXiv preprint arXiv:1603.09025, 2016.\n\n[7] Courbariaux, M., Bengio, Y., and David, J.-P. Training deep neural networks with low precision\n\nmultiplications. arXiv preprint arXiv:1412.7024, 2014.\n\n[8] Das, D., Mellempudi, N., Mudigere, D., et al. Mixed precision training of convolutional neural\n\nnetworks using integer operations. arXiv preprint arXiv:1802.00930, 2018.\n\n[9] Duchi, J., Hazan, E., and Singer, Y. Adaptive subgradient methods for online learning and\n\nstochastic optimization. Journal of Machine Learning Research, 12(Jul):2121\u20132159, 2011.\n\n[10] Eidnes, L. and N\u00f8kland, A. Shifting mean activation towards zero with bipolar activation\n\nfunctions. arXiv preprint arXiv:1709.04054, 2017.\n\n[11] Gitman, I. and Ginsburg, B. Comparison of batch normalization and weight normalization\n\nalgorithms for the large-scale image classi\ufb01cation. CoRR, abs/1709.08145, 2017.\n\n9\n\n\f[12] Glorot, X. and Bengio, Y. Understanding the dif\ufb01culty of training deep feedforward neural\nnetworks. In Proceedings of the Thirteenth International Conference on Arti\ufb01cial Intelligence\nand Statistics, pp. 249\u2013256, 2010.\n\n[13] He, K., Zhang, X., Ren, S., and Sun, J. Delving deep into recti\ufb01ers: Surpassing human-level\nperformance on imagenet classi\ufb01cation. In Proceedings of the IEEE international conference\non computer vision, pp. 1026\u20131034, 2015.\n\n[14] He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In\nProceedings of the IEEE conference on computer vision and pattern recognition, pp. 770\u2013778,\n2016.\n\n[15] Hoffer, E., Hubara, I., and Soudry, D. Train longer, generalize better: closing the generalization\ngap in large batch training of neural networks. In Advances in Neural Information Processing\nSystems, pp. 1729\u20131739, 2017.\n\n[16] Huang, L., Liu, X., Lang, B., and Li, B. Projection based weight normalization for deep neural\n\nnetworks. arXiv preprint arXiv:1710.02338, 2017.\n\n[17] Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., and Bengio, Y. Binarized neural networks.\n\nIn Advances in neural information processing systems, pp. 4107\u20134115, 2016.\n\n[18] Ioffe, S. Batch renormalization: Towards reducing minibatch dependence in batch-normalized\n\nmodels. In Advances in Neural Information Processing Systems, pp. 1942\u20131950, 2017.\n\n[19] Ioffe, S. and Szegedy, C. Batch normalization: Accelerating deep network training by reducing\ninternal covariate shift. In International conference on machine learning, pp. 448\u2013456, 2015.\n[20] Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[21] Klambauer, G., Unterthiner, T., Mayr, A., and Hochreiter, S. Self-normalizing neural networks.\n\nIn Advances in Neural Information Processing Systems, pp. 971\u2013980, 2017.\n\n[22] K\u00f6ster, U., Webb, T., Wang, X., et al. Flexpoint: An adaptive numerical format for ef\ufb01cient\ntraining of deep neural networks. In Advances in Neural Information Processing Systems, pp.\n1740\u20131750, 2017.\n\n[23] Krizhevsky, A. and Hinton, G. Learning multiple layers of features from tiny images. 2009.\n[24] Krogh, A. and Hertz, J. A. A simple weight decay can improve generalization. In Advances in\n\nneural information processing systems, pp. 950\u2013957, 1992.\n\n[25] Micikevicius, P., Narang, S., Alben, J., et al. Mixed precision training.\n\nConference on Learning Representations, 2018.\n\nIn International\n\n[26] Rota Bul\u00f2, S., Porzi, L., and Kontschieder, P.\n\nIn-place activated batchnorm for memory-\n\noptimized training of dnns. arXiv preprint arXiv:1712.02616, 2017.\n\n[27] Salimans, T. and Kingma, D. P. Weight normalization: A simple reparameterization to accelerate\ntraining of deep neural networks. In Advances in Neural Information Processing Systems, pp.\n901\u2013909, 2016.\n\n[28] Salimans, T., Goodfellow, I., Zaremba, W., et al. Improved techniques for training gans. In\n\nAdvances in Neural Information Processing Systems, pp. 2234\u20132242, 2016.\n\n[29] Smith, S. L., Kindermans, P.-J., and Le, Q. V. Don\u2019t decay the learning rate, increase the batch\n\nsize. arXiv preprint arXiv:1711.00489, 2017.\n\n[30] Soudry, D., Hubara, I., and Meir, R. Expectation backpropagation: parameter-free training\nof multilayer neural networks with continuous or discrete weights. In Neural Information\nProcessing Systems, volume 2, pp. 963\u2013971, dec 2014.\n\n[31] Soudry, D., Hoffer, E., and Srebro, N. The implicit bias of gradient descent on separable data.\n\nInternational Conference on Learning Representations, 2018.\n\n10\n\n\f[32] Takeru Miyato, M. K. Y. Y., Toshiki Kataoka. Spectral normalization for generative adversarial\n\nnetworks. International Conference on Learning Representations, 2018.\n\n[33] Ulyanov, D., Vedaldi, A., and Lempitsky, V. S. Instance normalization: The missing ingredient\n\nfor fast stylization. CoRR, abs/1607.08022, 2016.\n\n[34] van Laarhoven, T. L2 regularization versus batch and weight normalization. arXiv preprint\n\narXiv:1706.05350, 2017.\n\n[35] Vaswani, A., Shazeer, N., Parmar, N., et al. Attention is all you need. In Advances in Neural\n\nInformation Processing Systems, pp. 6000\u20136010, 2017.\n\n[36] Venkatesh, G., Nurvitadhi, E., and Marr, D. Accelerating deep convolutional networks using\nlow-precision and sparsity. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE\nInternational Conference on, pp. 2861\u20132865. IEEE, 2017.\n\n[37] Wu, S., Li, G., Deng, L., et al. L1-Norm Batch Normalization for Ef\ufb01cient Training of Deep\n\nNeural Networks. ArXiv e-prints, February 2018.\n\n[38] Wu, Y. and He, K. Group normalization. arXiv preprint arXiv:1803.08494, 2018.\n\n[39] Xiang, S. and Li, H. On the effect of batch normalization and weight normalization in generative\n\nadversarial networks. arXiv preprint arXiv:1704.03971, 2017.\n\n[40] Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. Understanding deep learning\n\nrequires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.\n\n11\n\n\f", "award": [], "sourceid": 1099, "authors": [{"given_name": "Elad", "family_name": "Hoffer", "institution": "Technion"}, {"given_name": "Ron", "family_name": "Banner", "institution": "Intel - Artificial Intelligence Products Group (AIPG)"}, {"given_name": "Itay", "family_name": "Golan", "institution": "Technion"}, {"given_name": "Daniel", "family_name": "Soudry", "institution": "Technion"}]}