{"title": "How Does Batch Normalization Help Optimization?", "book": "Advances in Neural Information Processing Systems", "page_first": 2483, "page_last": 2493, "abstract": "Batch Normalization (BatchNorm) is a widely adopted technique that enables faster and more stable training of deep neural networks (DNNs).\nDespite its pervasiveness, the exact reasons for BatchNorm's effectiveness are still poorly understood.\nThe popular belief is that this effectiveness stems from controlling the change of the layers' input distributions during training to reduce the so-called \"internal covariate shift\".\nIn this work, we demonstrate that such distributional stability of layer inputs has little to do with the success of BatchNorm.\nInstead, we uncover a more fundamental impact of BatchNorm on the training process: it makes the optimization landscape significantly smoother.\nThis smoothness induces a more predictive and stable behavior of the gradients, allowing for faster training.", "full_text": "How Does Batch Normalization Help Optimization?\n\nShibani Santurkar\u2217\n\nDimitris Tsipras\u2217\n\nAndrew Ilyas\u2217\n\nAleksander M \u02dbadry\n\nMIT\n\nMIT\n\nshibani@mit.edu\n\ntsipras@mit.edu\n\nailyas@mit.edu\n\nmadry@mit.edu\n\nMIT\n\nMIT\n\nAbstract\n\nBatch Normalization (BatchNorm) is a widely adopted technique that enables\nfaster and more stable training of deep neural networks (DNNs). Despite its\npervasiveness, the exact reasons for BatchNorm\u2019s effectiveness are still poorly\nunderstood. The popular belief is that this effectiveness stems from controlling\nthe change of the layers\u2019 input distributions during training to reduce the so-called\n\u201cinternal covariate shift\u201d. In this work, we demonstrate that such distributional\nstability of layer inputs has little to do with the success of BatchNorm. Instead,\nwe uncover a more fundamental impact of BatchNorm on the training process: it\nmakes the optimization landscape signi\ufb01cantly smoother. This smoothness induces\na more predictive and stable behavior of the gradients, allowing for faster training.\n\n1\n\nIntroduction\n\nOver the last decade, deep learning has made impressive progress on a variety of notoriously\ndif\ufb01cult tasks in computer vision [16, 7], speech recognition [5], machine translation [29], and\ngame-playing [18, 25]. This progress hinged on a number of major advances in terms of hardware,\ndatasets [15, 23], and algorithmic and architectural techniques [27, 12, 20, 28]. One of the most\nprominent examples of such advances was batch normalization (BatchNorm) [10].\nAt a high level, BatchNorm is a technique that aims to improve the training of neural networks by\nstabilizing the distributions of layer inputs. This is achieved by introducing additional network layers\nthat control the \ufb01rst two moments (mean and variance) of these distributions.\nThe practical success of BatchNorm is indisputable. By now, it is used by default in most deep learning\nmodels, both in research (more than 6,000 citations) and real-world settings. Somewhat shockingly,\nhowever, despite its prominence, we still have a poor understanding of what the effectiveness of\nBatchNorm is stemming from. In fact, there are now a number of works that provide alternatives to\nBatchNorm [1, 3, 13, 31], but none of them seem to bring us any closer to understanding this issue.\n(A similar point was also raised recently in [22].)\nCurrently, the most widely accepted explanation of BatchNorm\u2019s success, as well as its original\nmotivation, relates to so-called internal covariate shift (ICS). Informally, ICS refers to the change in\nthe distribution of layer inputs caused by updates to the preceding layers. It is conjectured that such\ncontinual change negatively impacts training. The goal of BatchNorm was to reduce ICS and thus\nremedy this effect.\nEven though this explanation is widely accepted, we seem to have little concrete evidence supporting\nit. In particular, we still do not understand the link between ICS and training performance. The chief\ngoal of this paper is to address all these shortcomings. Our exploration lead to somewhat startling\ndiscoveries.\n\n\u2217Equal contribution.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fOur Contributions. Our point of start is demonstrating that there does not seem to be any link\nbetween the performance gain of BatchNorm and the reduction of internal covariate shift. Or that this\nlink is tenuous, at best. In fact, we \ufb01nd that in a certain sense BatchNorm might not even be reducing\ninternal covariate shift.\nWe then turn our attention to identifying the roots of BatchNorm\u2019s success. Speci\ufb01cally, we demon-\nstrate that BatchNorm impacts network training in a fundamental way: it makes the landscape of\nthe corresponding optimization problem signi\ufb01cantly more smooth. This ensures, in particular, that\nthe gradients are more predictive and thus allows for use of larger range of learning rates and faster\nnetwork convergence. We provide an empirical demonstration of these \ufb01ndings as well as their\ntheoretical justi\ufb01cation. We prove that, under natural conditions, the Lipschitzness of both the loss\nand the gradients (also known as \u03b2-smoothness [21]) are improved in models with BatchNorm.\nFinally, we \ufb01nd that this smoothening effect is not uniquely tied to BatchNorm. A number of other\nnatural normalization techniques have a similar (and, sometime, even stronger) effect. In particular,\nthey all offer similar improvements in the training performance.\nWe believe that understanding the roots of such a fundamental techniques as BatchNorm will let us\nhave a signi\ufb01cantly better grasp of the underlying complexities of neural network training and, in\nturn, will inform further algorithmic progress in this context.\nOur paper is organized as follows. In Section 2, we explore the connections between BatchNorm,\noptimization, and internal covariate shift. Then, in Section 3, we demonstrate and analyze the exact\nroots of BatchNorm\u2019s success in deep neural network training. We present our theoretical analysis in\nSection 4. We discuss further related work in Section 5 and conclude in Section 6.\n\n2 Batch normalization and internal covariate shift\n\nBatch normalization (BatchNorm) [10] has been arguably one of the most successful architectural\ninnovations in deep learning. But even though its effectiveness is indisputable, we do not have a \ufb01rm\nunderstanding of why this is the case.\nBroadly speaking, BatchNorm is a mechanism that aims to stabilize the distribution (over a mini-\nbatch) of inputs to a given network layer during training. This is achieved by augmenting the network\nwith additional layers that set the \ufb01rst two moments (mean and variance) of the distribution of each\nactivation to be zero and one respectively. Then, the batch normalized inputs are also typically scaled\nand shifted based on trainable parameters to preserve model expressivity. This normalization is\napplied before the non-linearity of the previous layer.\nOne of the key motivations for the development of BatchNorm was the reduction of so-called internal\ncovariate shift (ICS). This reduction has been widely viewed as the root of BatchNorm\u2019s success.\nIoffe and Szegedy [10] describe ICS as the phenomenon wherein the distribution of inputs to a layer\nin the network changes due to an update of parameters of the previous layers. This change leads to a\nconstant shift of the underlying training problem and is thus believed to have detrimental effect on\nthe training process.\n\nFigure 1: Comparison of (a) training (optimization) and (b) test (generalization) performance of a\nstandard VGG network trained on CIFAR-10 with and without BatchNorm (details in Appendix A).\nThere is a consistent gain in training speed in models with BatchNorm layers. (c) Even though the\ngap between the performance of the BatchNorm and non-BatchNorm networks is clear, the difference\nin the evolution of layer input distributions seems to be much less pronounced. (Here, we sampled\nactivations of a given layer and visualized their distribution over training steps.)\n\n2\n\n05k10k15kSteps50100Training Accuracy (%)Standard, LR=0.1Standard + BatchNorm, LR=0.1Standard, LR=0.5Standard + BatchNorm, LR=0.505k10k15kSteps50100Test Accuracy (%)Standard, LR=0.1Standard + BatchNorm, LR=0.1Standard, LR=0.5Standard + BatchNorm, LR=0.5Layer #3Standard (LR=0.1)Standard + BatchNorm (LR=0.1)Layer #11\fDespite its fundamental role and widespread use in deep learning, the underpinnings of BatchNorm\u2019s\nsuccess remain poorly understood [22]. In this work we aim to address this gap. To this end, we start\nby investigating the connection between ICS and BatchNorm. Speci\ufb01cally, we consider \ufb01rst training\na standard VGG [26] architecture on CIFAR-10 [15] with and without BatchNorm. As expected,\nFigures 1(a) and (b) show a drastic improvement, both in terms of optimization and generalization\nperformance, for networks trained with BatchNorm layers. Figure 1(c) presents, however, a surprising\n\ufb01nding. In this \ufb01gure, we visualize to what extent BatchNorm is stabilizing distributions of layer\ninputs by plotting the distribution (over a batch) of a random input over training. Surprisingly, the\ndifference in distributional stability (change in the mean and variance) in networks with and without\nBatchNorm layers seems to be marginal. This observation raises the following questions:\n\n(1) Is the effectiveness of BatchNorm indeed related to internal covariate shift?\n(2) Is BatchNorm\u2019s stabilization of layer input distributions even effective in reducing ICS?\n\nWe now explore these questions in more depth.\n\n2.1 Does BatchNorm\u2019s performance stem from controlling internal covariate shift?\n\nThe central claim in [10] is that controlling the mean and variance of distributions of layer inputs is\ndirectly connected to improved training performance. Can we, however, substantiate this claim?\nWe propose the following experiment. We train networks with random noise injected after BatchNorm\nlayers. Speci\ufb01cally, we perturb each activation for each sample in the batch using i.i.d. noise sampled\nfrom a non-zero mean and non-unit variance distribution. We emphasize that this noise distribution\nchanges at each time step (see Appendix A for implementation details).\nNote that such noise injection produces a severe covariate shift that skews activations at every time\nstep. Consequently, every unit in the layer experiences a different distribution of inputs at each\ntime step. We then measure the effect of this deliberately introduced distributional instability on\nBatchNorm\u2019s performance. Figure 2 visualizes the training behavior of standard, BatchNorm and our\n\u201cnoisy\u201d BatchNorm networks. Distributions of activations over time from layers at the same depth in\neach one of the three networks are shown alongside.\nObserve that the performance difference between models with BatchNorm layers, and \u201cnoisy\u201d Batch-\nNorm layers is almost non-existent. Also, both these networks perform much better than standard\nnetworks. Moreover, the \u201cnoisy\u201d BatchNorm network has qualitatively less stable distributions than\neven the standard, non-BatchNorm network, yet it still performs better in terms of training. To put\n\nFigure 2: Connections between distributional stability and BatchNorm performance: We compare\nVGG networks trained without BatchNorm (Standard), with BatchNorm (Standard + BatchNorm)\nand with explicit \u201ccovariate shift\u201d added to BatchNorm layers (Standard + \u201cNoisy\u201d BatchNorm).\nIn the later case, we induce distributional instability by adding time-varying, non-zero mean and\nnon-unit variance noise independently to each batch normalized activation. The \u201cnoisy\u201d BatchNorm\nmodel nearly matches the performance of standard BatchNorm model, despite complete distributional\ninstability. We sampled activations of a given layer and visualized their distributions (also cf. Figure 7).\n\n3\n\n05k10k15kSteps020406080100Training AccuracyStandardStandard + BatchNormStandard + \"Noisy\" BatchnormLayer #2Standard Standard + BatchNormStandard + \"Noisy\" BatchNormLayer #9Layer #13\f(a) VGG\n\n(b) DLN\n\nFigure 3: Measurement of ICS (as de\ufb01ned in De\ufb01nition 2.1) in networks with and without BatchNorm\nlayers. For a layer we measure the cosine angle (ideally 1) and (cid:96)2-difference of the gradients (ideally\n0) before and after updates to the preceding layers (see De\ufb01nition 2.1). Models with BatchNorm have\nsimilar, or even worse, internal covariate shift, despite performing better in terms of accuracy and\nloss. (Stabilization of BatchNorm faster during training is an artifact of parameter convergence.)\n\nthe magnitude of the noise into perspective, we plot the mean and variance of random activations\nfor select layers in Figure 7. Moreover, adding the same amount of noise to the activations of the\nstandard (non-BatchNorm) network prevents it from training entirely.\nClearly, these \ufb01ndings are hard to reconcile with the claim that the performance gain due to Batch-\nNorm stems from increased stability of layer input distributions.\n\n2.2\n\nIs BatchNorm reducing internal covariate shift?\n\nOur \ufb01ndings in Section 2.1 make it apparent that ICS is not directly connected to the training\nperformance, at least if we tie ICS to stability of the mean and variance of input distributions. One\nmight wonder, however: Is there a broader notion of internal covariate shift that has such a direct link\nto training performance? And if so, does BatchNorm indeed reduce this notion?\nRecall that each layer can be seen as solving an empirical risk minimization problem where given a\nset of inputs, it is optimizing some loss function (that possibly involves later layers). An update to the\nparameters of any previous layer will change these inputs, thus changing this empirical risk mini-\nmization problem itself. This phenomenon is at the core of the intuition that Ioffe and Szegedy [10]\nprovide regarding internal covariate shift. Speci\ufb01cally, they try to capture this phenomenon from\nthe perspective of the resulting distributional changes in layer inputs. However, as demonstrated in\nSection 2.1, this perspective does not seem to properly encapsulate the roots of BatchNorm\u2019s success.\nTo answer this question, we consider a broader notion of internal covariate shift that is more tied to\nthe underlying optimization task. (After all the success of BatchNorm is largely of an optimization\nnature.) Since the training procedure is a \ufb01rst-order method, the gradient of the loss is the most natural\nobject to study. To quantify the extent to which the parameters in a layer would have to \u201cadjust\u201d in\nreaction to a parameter update in the previous layers, we measure the difference between the gradients\nof each layer before and after updates to all the previous layers. This leads to the following de\ufb01nition.\nDe\ufb01nition 2.1. Let L be the loss, W (t)\nk be the parameters of each of the k layers and\n(x(t), y(t)) be the batch of input-label pairs used to train the network at time t. We de\ufb01ne internal\ncovariate shift (ICS) of activation i at time t to be the difference ||Gt,i \u2212 G(cid:48)\n\nt,i||2, where\n\n1 , . . . , W (t)\n\nGt,i = \u2207W (t)\nG(cid:48)\nt,i = \u2207W (t)\n\ni L(W (t)\ni L(W (t+1)\n\n1\n\n, W (t)\n\ni+1, . . . , W (t)\n\nk ; x(t), y(t)).\n\ni\n\n1 , . . . , W (t)\n\nk ; x(t), y(t))\n, W (t)\n\n, . . . , W (t+1)\n\ni\u22121\n\nHere, Gt,i corresponds to the gradient of the layer parameters that would be applied during a\nsimultaneous update of all layers (as is typical). On the other hand, G(cid:48)\nt,i is the same gradient after all\n\n4\n\n255075100Training Accuracy (%)LR = 0.1LR = 0.1StandardStandard + BatchNorm1012-diff.Layer #501Cos AngleLayer #10103104Training LossLR = 1e-07LR = 1e-07StandardStandard + BatchNorm1022-Diff.Layer #901Cos AngleLayer #17\f(a) loss landscape\n\n(b) gradient predictiveness\n\n(c) \u201ceffective\u201d \u03b2-smoothness\n\nFigure 4: Analysis of the optimization landscape of VGG networks. At a particular training step,\nwe measure the variation (shaded region) in loss (a) and (cid:96)2 changes in the gradient (b) as we move\nin the gradient direction. The \u201ceffective\u201d \u03b2-smoothness (c) refers to the maximum difference (in\n(cid:96)2-norm) in gradient over distance moved in that direction. There is a clear improvement in all of\nthese measures in networks with BatchNorm, indicating a more well-behaved loss landscape. (Here,\nwe cap the maximum distance to be \u03b7 = 0.4\u00d7 the gradient since for larger steps the standard network\njust performs worse (see Figure 1). BatchNorm however continues to provide smoothing for even\nlarger distances.) Note that these results are supported by our theoretical \ufb01ndings (Section 4).\n\nthe previous layers have been updated with their new values. The difference between G and G(cid:48) thus\nre\ufb02ects the change in the optimization landscape of Wi caused by the changes to its input. It thus\ncaptures precisely the effect of cross-layer dependencies that could be problematic for training.\nEquipped with this de\ufb01nition, we measure the extent of ICS with and without BatchNorm layers. To\nisolate the effect of non-linearities as well as gradient stochasticity, we also perform this analysis on\n(25-layer) deep linear networks (DLN) trained with full-batch gradient descent (see Appendix A for\ndetails). The conventional understanding of BatchNorm suggests that the addition of BatchNorm\nlayers in the network should increase the correlation between G and G(cid:48), thereby reducing ICS.\nSurprisingly, we observe that networks with BatchNorm often exhibit an increase in ICS (cf. Figure 3).\nThis is particularly striking in the case of DLN. In fact, in this case, the standard network experiences\nalmost no ICS for the entirety of training, whereas for BatchNorm it appears that G and G(cid:48) are\nalmost uncorrelated. We emphasize that this is the case even though BatchNorm networks continue to\nperform drastically better in terms of attained accuracy and loss. (The stabilization of the BatchNorm\nVGG network later in training is an artifact of faster convergence.) This evidence suggests that, from\noptimization point of view BatchNorm might not even reduce the internal covariate shift.\n\n3 Why does BatchNorm work?\n\nOur investigation so far demonstrated that the generally asserted link between the internal covariate\nshift (ICS) and the optimization performance is tenuous, at best. But BatchNorm does signi\ufb01cantly\nimprove the training process. Can we explain why this is the case?\nAside from reducing ICS, Ioffe and Szegedy [10] identify a number of additional properties of\nBatchNorm. These include prevention of exploding or vanishing gradients, robustness to different\nsettings of hyperparameters such as learning rate and initialization scheme, and keeping most of the\nactivations away from saturation regions of non-linearities. All these properties are clearly bene\ufb01cial\nto the training process. But they are fairly simple consequences of the mechanics of BatchNorm\nand do little to uncover the underlying factors responsible for BatchNorm\u2019s success. Is there a more\nfundamental phenomenon at play here?\n\n3.1 The smoothing effect of BatchNorm\n\nIndeed, we identify the key impact that BatchNorm has on the training process: it reparametrizes\nthe underlying optimization problem to make its landscape signi\ufb01cantly more smooth. The \ufb01rst\nmanifestation of this impact is improvement in the Lipschitzness2 of the loss function. That is, the\nloss changes at a smaller rate and the magnitudes of the gradients are smaller too. There is, however,\n\n2Recall that f is L-Lipschitz if |f (x1) \u2212 f (x2)| \u2264 L(cid:107)x1 \u2212 x2(cid:107), for all x1 and x2.\n\n5\n\n05k10k15kSteps100101Loss LandscapeStandardStandard + BatchNorm05k10k15kSteps050100150200250Gradient PredictivenessStandardStandard + BatchNorm05k10k15kSteps51015202530354045-smoothnessStandardStandard + BatchNorm\fan even stronger effect at play. Namely, BatchNorm\u2019s reparametrization makes gradients of the loss\nmore Lipschitz too. In other words, the loss exhibits a signi\ufb01cantly better \u201ceffective\u201d \u03b2-smoothness3.\nThese smoothening effects impact the performance of the training algorithm in a major way. To\nunderstand why, recall that in a vanilla (non-BatchNorm), deep neural network, the loss function\nis not only non-convex but also tends to have a large number of \u201ckinks\u201d, \ufb02at regions, and sharp\nminima [17]. This makes gradient descent\u2013based training algorithms unstable, e.g., due to exploding\nor vanishing gradients, and thus highly sensitive to the choice of the learning rate and initialization.\nNow, the key implication of BatchNorm\u2019s reparametrization is that it makes the gradients more\nreliable and predictive. After all, improved Lipschitzness of the gradients gives us con\ufb01dence that\nwhen we take a larger step in a direction of a computed gradient, this gradient direction remains a\nfairly accurate estimate of the actual gradient direction after taking that step. It thus enables any\n(gradient\u2013based) training algorithm to take larger steps without the danger of running into a sudden\nchange of the loss landscape such as \ufb02at region (corresponding to vanishing gradient) or sharp local\nminimum (causing exploding gradients). This, in turn, enables us to use a broader range of (and thus\nlarger) learning rates (see Figure 10 in Appendix B) and, in general, makes the training signi\ufb01cantly\nfaster and less sensitive to hyperparameter choices. (This also illustrates how the properties of\nBatchNorm that we discussed earlier can be viewed as a manifestation of this smoothening effect.)\n\n3.2 Exploration of the optimization landscape\n\nTo demonstrate the impact of BatchNorm on the stability of the loss itself, i.e., its Lipschitzness, for\neach given step in the training process, we compute the gradient of the loss at that step and measure\nhow the loss changes as we move in that direction \u2013 see Figure 4(a). We see that, in contrast to the\ncase when BatchNorm is in use, the loss of a vanilla, i.e., non-BatchNorm, network has a very wide\nrange of values along the direction of the gradient, especially in the initial phases of training. (In the\nlater stages, the network is already close to convergence.)\nSimilarly, to illustrate the increase in the stability and predictiveness of the gradients, we make\nanalogous measurements for the (cid:96)2 distance between the loss gradient at a given point of the training\nand the gradients corresponding to different points along the original gradient direction. Figure 4(b)\nshows a signi\ufb01cant difference (close to two orders of magnitude) in such gradient predictiveness\nbetween the vanilla and BatchNorm networks, especially early in training.\nTo further demonstrate the effect of BatchNorm on the stability/Lipschitzness of the gradients of the\nloss, we plot in Figure 4(c) the \u201ceffective\u201d \u03b2-smoothness of the vanilla and BatchNorm networks\nthroughout the training. (\u201cEffective\u201d refers here to measuring the change of gradients as we move in\nthe direction of the gradients.). Again, we observe consistent differences between these networks.\nWe complement the above examination by considering linear deep networks: as shown in Figures 9\nand 12 in Appendix B, the BatchNorm smoothening effect is present there as well.\nFinally, we emphasize that even though our explorations were focused on the behavior of the loss\nalong the gradient directions (as they are the crucial ones from the point of view of the training\nprocess), the loss behaves in a similar way when we examine other (random) directions too.\n\n3.3\n\nIs BatchNorm the best (only?) way to smoothen the landscape?\n\nGiven this newly acquired understanding of BatchNorm and the roots of its effectiveness, it is natural\nto wonder: Is this smoothening effect a unique feature of BatchNorm? Or could a similar effect be\nachieved using some other normalization schemes?\nTo answer this question, we study a few natural data statistics-based normalization strategies. Speci\ufb01-\ncally, we study schemes that \ufb01x the \ufb01rst order moment of the activations, as BatchNorm does, and then\nnormalizes them by the average of their (cid:96)p-norm (before shifting the mean), for p = 1, 2,\u221e. Note\nthat for these normalization schemes, the distributions of layer inputs are no longer Gaussian-like\n(see Figure 14). Hence, normalization with such (cid:96)p-norm does not guarantee anymore any control\nover the distribution moments nor distributional stability.\n\n3Recall that f is \u03b2-smooth if its gradient is \u03b2-Lipschitz. It is worth noting that, due to the existence of\n\nnon-linearities, one should not expect the \u03b2-smoothness to be bounded in an absolute, global sense.\n\n6\n\n\f(a) Vanilla Network\n\n(b) Vanilla Network + BatchNorm Layer\n\nFigure 5: The two network architectures we compare in our theoretical analysis: (a) the vanilla DNN\n(no BatchNorm layer); (b) the same network as in (a) but with a BatchNorm layer inserted after the\nfully-connected layer W . (All the layer parameters have exactly the same value in both networks.)\n\nThe results are presented in Figures 13, 11 and 12 in Appendix B. We observe that all the normalization\nstrategies offer comparable performance to BatchNorm.\nIn fact, for deep linear networks, (cid:96)1\u2013\nnormalization performs even better than BatchNorm. Note that, qualitatively, the (cid:96)p\u2013normalization\ntechniques lead to larger distributional shift (as considered in [10]) than the vanilla, i.e., unnormalized,\nnetworks, yet they still yield improved optimization performance. Also, all these techniques result in\nan improved smoothness of the landscape that is similar to the effect of BatchNorm. (See Figures 11\nand 12 of Appendix B.) This suggests that the positive impact of BatchNorm on training might be\nsomewhat serendipitous. Therefore, it might be valuable to perform a principled exploration of the\ndesign space of normalization schemes as it can lead to better performance.\n\n4 Theoretical Analysis\n\nOur experiments so far suggest that BatchNorm has a fundamental effect on the optimization\nlandscape. We now explore this phenomenon from a theoretical perspective. To this end, we consider\nan arbitrary linear layer in a DNN (we do not necessitate that the entire network be fully linear).\n\n4.1 Setup\n\nWe analyze the impact of adding a single BatchNorm layer after an arbitrary fully-connected layer W\nat a given step during the training. Speci\ufb01cally, we compare the optimization landscape of the original\ntraining problem to the one that results from inserting the BatchNorm layer after the fully-connected\nlayer \u2013 normalizing the output of this layer (see Figure 5). Our analysis therefore captures effects that\nstem from the reparametrization of the landscape and not merely from normalizing the inputs x.\nWe denote the layer weights (identical for both the standard and batch-normalized networks) as Wij.\nBoth networks have the same arbitrary loss function L that could potentially include a number of\nadditional non-linear layers after the current one. We refer to the loss of the normalized network as\n(cid:98)L for clarity. In both networks, we have input x, and let y = W x. For networks with BatchNorm,\nwe have an additional set of activations \u02c6y, which are the \u201cwhitened\u201d version of y, i.e. standardized\nto mean 0 and variance 1. These are then multiplied by \u03b3 and added to \u03b2 to form z. We assume \u03b2\nand \u03b3 to be constants for our analysis. In terms of notation, we let \u03c3j denote the standard deviation\n(computed over the mini-batch) of a batch of outputs yj \u2208 Rm.\n4.2 Theoretical Results\n\nWe begin by considering the optimization landscape with respect to the activations yj. We show that\nbatch normalization causes this landscape to be more well-behaved, inducing favourable properties in\nLipschitz-continuity, and predictability of the gradients. We then show that these improvements in the\nactivation-space landscape translate to favorable worst-case bounds in the weight-space landscape.\n\nWe \ufb01rst turn our attention to the gradient magnitude(cid:12)(cid:12)(cid:12)(cid:12)\u2207yjL(cid:12)(cid:12)(cid:12)(cid:12), which captures the Lipschitzness\n\nof the loss. The Lipschitz constant of the loss plays a crucial role in optimization, since it controls\nthe amount by which the loss can change when taking a step (see [21] for details). Without any\nassumptions on the speci\ufb01c weights or the loss being used, we show that the batch-normalized\n\n7\n\nyt\u2212\u03bc\u03c3\u0302yyx\u03b3\u0302y+\u03b2zBatchNormWLyt\u2212\u03bc\u03c3\u0302yyx\u03b3\u0302y+\u03b2zBatchNormWbL\flandscape exhibits a better Lipschitz constant. Moreover, the Lipschitz constant is signi\ufb01cantly\n\n2\n\n1\n\n\u2264\n\n\u2212\n\n\u03b32\n\u03c32\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\u2207yj(cid:98)L(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\nj (cid:18)(cid:12)(cid:12)(cid:12)(cid:12)\u2207yjL(cid:12)(cid:12)(cid:12)(cid:12)2\n\ndeviates from 0. Note that this reduction is additive, and has effect even when the scaling of BN is\nidentical to the original layer scaling (i.e. even when \u03c3j = \u03b3).\nTheorem 4.1 (The effect of BatchNorm on the Lipschitzness of the loss). For a BatchNorm network\n\nreduced whenever the activations \u02c6yj correlate with the gradient \u2207 \u02c6yj(cid:98)L or the mean of the gradient\nwith loss (cid:98)L and an identical non-BN network with (identical) loss L,\nFirst, note that (cid:104)1, \u2202L/\u2202y(cid:105)2 grows quadratically in the dimension, so the middle term above is\nsigni\ufb01cant. Furthermore, the \ufb01nal inner product term is expected to be bounded away from zero, as\nthe gradient with respect to a variable is rarely uncorrelated to the variable itself. In addition to the\nadditive reduction, \u03c3j tends to be large in practice (cf. Appendix Figure 8), and thus the scaling by \u03b3\n\u03c3\nmay contribute to the relative \u201c\ufb02atness\" we see in the effective Lipschitz constant.\nWe now turn our attention to the second-order properties of the landscape. We show that when a\nBatchNorm layer is added, the quadratic form of the loss Hessian with respect to the activations in the\ngradient direction, is both rescaled by the input variance (inducing resilience to mini-batch variance),\nand decreased by an additive factor (increasing smoothness). This term captures the second order\nterm of the Taylor expansion of the gradient around the current point. Therefore, reducing this term\nimplies that the \ufb01rst order term (the gradient) is more predictive.\n\n\u221am(cid:10)\u2207yjL, \u02c6yj(cid:11)2(cid:19) .\n\nm(cid:10)1,\u2207yjL(cid:11)2\n\n\u2212\n\n1\n\nbe the gradient\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)2\n\n\u2202yj\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) \u2202(cid:98)L\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)2(cid:33)\n\nTheorem 4.2 (The effect of BN to smoothness). Let \u02c6gj = \u2207yjL and Hjj = \u2202L\n\nand Hessian of the loss with respect to the layer outputs respectively. Then\n\n\u2202yj \u2202yj\n\n(cid:16)\u2207yj(cid:98)L(cid:17)(cid:62)\n\n\u2202yj\u2202yj\n\n\u2202(cid:98)L\n(cid:16)\u2207yj(cid:98)L(cid:17)(cid:62)\n\n(cid:33)(cid:62)\n\n(cid:32)\n\n(cid:16)\u2207yj(cid:98)L(cid:17) \u2264 \u03b32\n\u2202(cid:98)L\n\n\u2202(cid:98)L\n(cid:32)\n(cid:16)\u2207yj(cid:98)L(cid:17) \u2264 \u03b32\n\n\u2202yj\n\n\u03c32\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\nIf we also have that the Hjj preserves the relative norms of \u02c6gj and \u2207yj(cid:98)L,\n\nj Hjj \u02c6gj \u2212 1\n(cid:62)\n\u02c6g\nm\u03b3\n\n(cid:104)\u02c6gj, \u02c6yj(cid:105)\n\n\u2202yj\u2202yj\n\n\u03c32\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) \u2202(cid:98)L\n\n\u2202yj\n\n(cid:33)\n\n(cid:32)\n\n\u2202(cid:98)L\n\n\u2202yj\n\nHjj\n\n\u2212 \u03b3\nm\u03c32 (cid:104)\u02c6gj, \u02c6yj(cid:105)\n\nNote that if the quadratic forms involving the Hessian and the inner product (cid:104)\u02c6yj, \u02c6gj(cid:105) are non-negative\n(both fairly mild assumptions), the theorem implies more predictive gradients. The Hessian is positive\nsemi-de\ufb01nite (PSD) if the loss is locally convex which is true for the case of deep networks with\npiecewise-linear activation functions and a convex loss at the \ufb01nal layer (e.g. standard softmax\ncross-entropy loss or other common losses). The condition (cid:104)\u02c6yj, \u02c6gj(cid:105) > 0 holds as long as the negative\ngradient \u02c6gj is pointing towards the minimum of the loss (w.r.t. normalized activations). Overall, as\nlong as these two conditions hold, the steps taken by the BatchNorm network are more predictive\nthan those of the standard network (similarly to what we observed experimentally).\nNote that our results stem from the reparametrization of the problem and not a simple scaling.\nObservation 4.3 (BatchNorm does more than rescaling). For any input data X and network con\ufb01gu-\nration W , there exists a BN con\ufb01guration (W, \u03b3, \u03b2) that results in the same activations yj, and where\n\u03b3 = \u03c3j. Consequently, all of the minima of the normal landscape are preserved in the BN landscape.\n\nOur theoretical analysis so far studied the optimization landscape of the loss w.r.t. the normalized\nactivations. We will now translate these bounds to a favorable worst-case bound on the landscape\nwith respect to layer weights. Note that a (near exact) analogue of this theorem for minimax gradient\npredictiveness appears in Theorem C.1 of Appendix C.\nTheorem 4.4 (Minimax bound on weight-space Lipschitzness). For a BatchNorm network with loss\n\n(cid:98)L and an identical non-BN network (with identical loss L), if\n\n||\u2207WL||2 ,\n\ngj = max\n||X||\u2264\u03bb\n\n\u02c6gj = max\n||X||\u2264\u03bb\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\u2207W (cid:98)L(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)2\n\n=\u21d2 \u02c6gj \u2264 \u03b32\n\u03c32\nj\n\n(cid:16)\n\ngj \u2212 \u03bb2(cid:10)\u2207yjL, \u02c6yj\n\n(cid:11)2(cid:17)\n\n.\n\nj \u2212 m\u00b52\ng2\n\nFinally, in addition to a desirable landscape, we \ufb01nd that BN also offers an advantage in initialization:\n\n8\n\n\foptima for the weights in the normal and BN networks, respectively. For any initialization W0\n\nLemma 4.5 (BatchNorm leads to a favourable initialization). Let W \u2217 and (cid:99)W \u2217 be the set of local\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)W0 \u2212(cid:99)W \u2217(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\nif (cid:104)W0, W \u2217\n(cid:105) > 0, where(cid:99)W \u2217 and W \u2217 are closest optima for BN and standard network, respectively.\n5 Related work\n\n||2 \u2212 (cid:104)W \u2217, W0(cid:105)(cid:17)2\n\n1\n\n||W \u2217||2(cid:16)||W \u2217\n\n,\n\n2\n\n\u2264 ||W0 \u2212 W \u2217\n\n||2 \u2212\n\nA number of normalization schemes have been proposed as alternatives to BatchNorm, including\nnormalization over layers [1], subsets of the batch [31], or across image dimensions [30]. Weight\nNormalization [24] follows a complementary approach normalizing the weights instead of the\nactivations. Finally, ELU [3] and SELU [13] are two proposed examples of non-linearities that have\na progressively decaying slope instead of a sharp saturation and can be used as an alternative for\nBatchNorm. These techniques offer an improvement over standard training that is comparable to that\nof BatchNorm but do not attempt to explain BatchNorm\u2019s success.\nAdditionally, work on topics related to DNN optimization has uncovered a number of other Batch-\nNorm bene\ufb01ts. Li et al. [9] observe that networks with BatchNorm tend to have optimization\ntrajectories that rely less on the parameter initialization. Balduzzi et al. [2] observe that models\nwithout BatchNorm tend to suffer from small correlation between different gradient coordinates\nand/or unit activations. They report that this behavior is profound in deeper models and argue how it\nconstitutes an obstacle to DNN optimization. Morcos et al. [19] focus on the generalization properties\nof DNN. They observe that the use of BatchNorm results in models that rely less on single directions\nin the activation space, which they \ufb01nd to be connected to the generalization properties of the model.\nRecent work [14] identi\ufb01es simple, concrete settings where a variant of training with BatchNorm\nprovably improves over standard training algorithms. The main idea is that decoupling the length and\ndirection of the weights (as done in BatchNorm and Weight Normalization [24]) can be exploited to\na large extent. By designing algorithms that optimize these parameters separately, with (different)\nadaptive step sizes, one can achieve signi\ufb01cantly faster convergence rates for these problems.\n\n6 Conclusions\n\nIn this work, we have investigated the roots of BatchNorm\u2019s effectiveness as a technique for training\ndeep neural networks. We \ufb01nd that the widely believed connection between the performance of\nBatchNorm and the internal covariate shift is tenuous, at best. In particular, we demonstrate that\nexistence of internal covariate shift, at least when viewed from the \u2013 generally adopted \u2013 distributional\nstability perspective, is not a good predictor of training performance. Also, we show that, from an\noptimization viewpoint, BatchNorm might not be even reducing that shift.\nInstead, we identify a key effect that BatchNorm has on the training process: it reparametrizes the\nunderlying optimization problem to make it more stable (in the sense of loss Lipschitzness) and\nsmooth (in the sense of \u201ceffective\u201d \u03b2-smoothness of the loss). This implies that the gradients used in\ntraining are more predictive and well-behaved, which enables faster and more effective optimization.\nThis phenomena also explains and subsumes some of the other previously observed bene\ufb01ts of\nBatchNorm, such as robustness to hyperparameter setting and avoiding gradient explosion/vanishing.\nWe also show that this smoothing effect is not unique to BatchNorm. In fact, several other natural\nnormalization strategies have similar impact and result in a comparable performance gain.\nWe believe that these \ufb01ndings not only challenge the conventional wisdom about BatchNorm but\nalso bring us closer to a better understanding of this technique. We also view these results as an\nopportunity to encourage the community to pursue a more systematic investigation of the algorithmic\ntoolkit of deep learning and the underpinnings of its effectiveness.\nFinally, our focus here was on the impact of BatchNorm on training but our \ufb01ndings might also shed\nsome light on the BatchNorm\u2019s tendency to improve generalization. Speci\ufb01cally, it could be the case\nthat the smoothening effect of BatchNorm\u2019s reparametrization encourages the training process to\nconverge to more \ufb02at minima. Such minima are believed to facilitate better generalization [8, 11].\nWe hope that future work will investigate this intriguing possibility.\n\n9\n\n\fAcknowledgements\n\nWe thank Ali Rahimi and Ben Recht for helpful comments on a preliminary version of this paper.\nShibani Santurkar was supported by the National Science Foundation (NSF) under grants IIS-1447786,\nIIS-1607189, and CCF-1563880, and the Intel Corporation. Dimitris Tsipras was supported in part by\nthe NSF grant CCF-1553428 and the NSF Frontier grant CNS-1413920. Andrew Ilyas was supported\nin part by NSF awards CCF-1617730 and IIS-1741137, a Simons Investigator Award, a Google\nFaculty Research Award, and an MIT-IBM Watson AI Lab research grant. Aleksander M \u02dbadry was\nsupported in part by an Alfred P. Sloan Research Fellowship, a Google Research Award, and the NSF\ngrants CCF-1553428 and CNS-1815221.\n\nReferences\n[1] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint\n\narXiv:1607.06450, 2016.\n\n[2] David Balduzzi, Marcus Frean, Lennox Leary, JP Lewis, Kurt Wan-Duo Ma, and Brian McWilliams.\nThe shattered gradients problem: If resnets are the answer, then what is the question? arXiv preprint\narXiv:1702.08591, 2017.\n\n[3] Djork-Arn\u00e9 Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network learning\n\nby exponential linear units (elus). arXiv preprint arXiv:1511.07289, 2015.\n\n[4] Xavier Glorot and Yoshua Bengio. Understanding the dif\ufb01culty of training deep feedforward neural\nnetworks. In Proceedings of the thirteenth international conference on arti\ufb01cial intelligence and statistics,\npages 249\u2013256, 2010.\n\n[5] Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. Speech recognition with deep recurrent\nneural networks. In Acoustics, speech and signal processing (icassp), 2013 ieee international conference\non, pages 6645\u20136649. IEEE, 2013.\n\n[6] Moritz Hardt and Tengyu Ma. Identity matters in deep learning. arXiv preprint arXiv:1611.04231, 2016.\n[7] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.\n\nIn Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016.\n\n[8] Sepp Hochreiter and J\u00fcrgen Schmidhuber. Flat minima. Neural Computation, 9(1):1\u201342, 1997.\n[9] Daniel Jiwoong Im, Michael Tao, and Kristin Branson. An empirical analysis of deep network loss surfaces.\n\narXiv preprint arXiv:1612.04010, 2016.\n\n[10] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing\n\ninternal covariate shift. arXiv preprint arXiv:1502.03167, 2015.\n\n[11] Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter\nTang. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint\narXiv:1609.04836, 2016.\n\n[12] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[13] G\u00fcnter Klambauer, Thomas Unterthiner, Andreas Mayr, and Sepp Hochreiter. Self-normalizing neural\n\nnetworks. In Advances in Neural Information Processing Systems, pages 972\u2013981, 2017.\n\n[14] Jonas Kohler, Hadi Daneshmand, Aurelien Lucchi, Ming Zhou, Klaus Neymeyr, and Thomas Hofmann.\n\nTowards a theoretical understanding of batch normalization. arXiv preprint arXiv:1805.10694, 2018.\n[15] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. 2009.\n[16] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi\ufb01cation with deep convolutional\n\nneural networks. In Advances in neural information processing systems, pages 1097\u20131105, 2012.\n\n[17] Hao Li, Zheng Xu, Gavin Taylor, and Tom Goldstein. Visualizing the loss landscape of neural nets. arXiv\n\npreprint arXiv:1712.09913, 2017.\n\n[18] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare,\nAlex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through\ndeep reinforcement learning. Nature, 518(7540):529, 2015.\n\n[19] Ari S Morcos, David GT Barrett, Neil C Rabinowitz, and Matthew Botvinick. On the importance of single\n\ndirections for generalization. arXiv preprint arXiv:1803.06959, 2018.\n\n[20] Vinod Nair and Geoffrey E Hinton. Recti\ufb01ed linear units improve restricted boltzmann machines. In\n\nProceedings of the 27th international conference on machine learning (ICML-10), 2010.\n\n10\n\n\f[21] Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science\n\n& Business Media, 2013.\n\n[22] Ali Rahimi and Ben Recht. Back when we were kids. In NIPS Test-of-Time Award Talk, 2017.\n[23] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang,\nAndrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large\nScale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3), 2015.\n\n[24] Tim Salimans and Diederik P Kingma. Weight normalization: A simple reparameterization to accelerate\n\ntraining of deep neural networks. In Advances in Neural Information Processing Systems, 2016.\n\n[25] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian\nSchrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go\nwith deep neural networks and tree search. nature, 529(7587):484\u2013489, 2016.\n\n[26] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recogni-\n\ntion. arXiv preprint arXiv:1409.1556, 2014.\n\n[27] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout:\nA simple way to prevent neural networks from over\ufb01tting. The Journal of Machine Learning Research,\n15(1), 2014.\n\n[28] Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of initialization and\n\nmomentum in deep learning. In International conference on machine learning, 2013.\n\n[29] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In\n\nAdvances in neural information processing systems, pages 3104\u20133112, 2014.\n\n[30] Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Instance normalization: The missing ingredient\n\nfor fast stylization. arXiv preprint arXiv:1607.08022, 2016.\n\n[31] Yuxin Wu and Kaiming He. Group normalization. arXiv preprint arXiv:1803.08494, 2018.\n\n11\n\n\f", "award": [], "sourceid": 1246, "authors": [{"given_name": "Shibani", "family_name": "Santurkar", "institution": "MIT"}, {"given_name": "Dimitris", "family_name": "Tsipras", "institution": "MIT"}, {"given_name": "Andrew", "family_name": "Ilyas", "institution": "MIT"}, {"given_name": "Aleksander", "family_name": "Madry", "institution": "MIT"}]}