{"title": "SGD on Neural Networks Learns Functions of Increasing Complexity", "book": "Advances in Neural Information Processing Systems", "page_first": 3496, "page_last": 3506, "abstract": "We perform an experimental study of the dynamics of Stochastic Gradient Descent (SGD) in learning deep neural networks for several real and synthetic classification tasks.\nWe show that in the initial epochs, almost all of the performance improvement of the classifier obtained by SGD can be explained by a linear classifier.\nMore generally, we give evidence for the hypothesis that, as iterations progress,  SGD learns functions of increasing complexity. This hypothesis can be helpful in explaining why SGD-learned classifiers tend to generalize well even in the over-parameterized regime.\nWe also show that the linear classifier learned in the initial stages is ``retained'' throughout the execution even if training is continued to the point of zero training error, and complement this with a theoretical result in a simplified model.\nKey to our work is a new measure of\nhow well one classifier explains the performance of another, based on conditional mutual information.", "full_text": "SGD on Neural Networks Learns\n\nFunctions of Increasing Complexity\n\nPreetum Nakkiran\nHarvard University\n\nGal Kaplun\n\nHarvard University\n\nDimitris Kalimeris\nHarvard University\n\nTristan Yang\n\nHarvard University\n\nBenjamin L. Edelman\n\nHarvard University\n\nFred Zhang\n\nHarvard University\n\nBoaz Barak\n\nHarvard University\u2217\n\nAbstract\n\nWe perform an experimental study of the dynamics of Stochastic Gradient Descent\n(SGD) in learning deep neural networks for several real and synthetic classi\ufb01cation\ntasks. We show that in the initial epochs, almost all of the performance improve-\nment of the classi\ufb01er obtained by SGD can be explained by a linear classi\ufb01er.\nMore generally, we give evidence for the hypothesis that, as iterations progress,\nSGD learns functions of increasing complexity. This hypothesis can be helpful in\nexplaining why SGD-learned classi\ufb01ers tend to generalize well even in the over-\nparameterized regime. We also show that the linear classi\ufb01er learned in the initial\nstages is \u201cretained\u201d throughout the execution even if training is continued to the\npoint of zero training error, and complement this with a theoretical result in a\nsimpli\ufb01ed model. Key to our work is a new measure of how well one classi\ufb01er\nexplains the performance of another, based on conditional mutual information.\n\n1\n\nIntroduction\n\nNeural networks have been extremely successful in modern machine learning, achieving the state-\nof-the-art in a wide range of domains, including image-recognition, speech-recognition, and game-\nplaying [14, 18, 23, 37]. Practitioners often train deep neural networks with hundreds of layers\nand millions of parameters and manage to \ufb01nd networks with good out-of-sample performance.\nHowever, this practical prowess is accompanied by feeble theoretical understanding. In particular,\nwe are far from understanding the generalization performance of neural networks\u2014why can we\ntrain large, complex models on relatively few training examples and still expect them to generalize\nto unseen examples? It has been observed in the literature that the classical generalization bounds\nthat guarantee small generalization gap (i.e., the gap between train and test error) in terms of VC\ndimension or Rademacher complexity do not yield meaningful guarantees in the context of real\nneural networks. More concretely, for most if not all real-world settings, there exist neural networks\nwhich \ufb01t the train set exactly, but have arbitrarily bad test error [41].\n\nThe existence of such \u201cbad\u201d empirical risk minimizers (ERMs) with large gaps between the train\nand test error means that the generalization performance of deep neural networks depends on the\nparticular algorithm (and initialization) used in training, which is most often stochastic gradient\ndescent (SGD). It has been conjectured that SGD provides some form of \u201cimplicit regularization\u201d by\noutputting \u201clow complexity\u201d models, but it is safe to say that the precise notion of complexity and\nthe mechanism by which this happens are not yet understood (see related works below).\n\n\u2217preetum@cs.harvard.edu, galkaplun@g.harvard.edu, kalimeris@g.harvard.edu,\n\ntristanyang@college.harvard.edu, bedelman@g.harvard.edu, hzhang@g.harvard.edu,\nb@boazbarak.org\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: Left: An illustration of our hypothesis of how SGD dynamics progress. Initially, all\nprogress in learning can be attributed to a \u201csimple\u201d classi\ufb01er (in some precise sense to be later\nde\ufb01ned), then SGD continues in learning more complex but still meaningful classi\ufb01ers. Finally, the\nclassi\ufb01er will interpolate the training data, while retaining correlation with simpler classi\ufb01ers that\nallows it to generalize. Right: A plot of how the decision boundary evolves as a neural network\nis trained for a simple classi\ufb01cation task. The data distribution is uniform in a 2-dimensional ball\nof radius 1, labeled by a sinusoidal curve with 10% label noise. It is evident that an almost linear\ndecision boundary emerges in the \ufb01rst phases of training before more complex classi\ufb01ers are learned.\nIn the last stages, the network over\ufb01ts to the label noise, while still retaining the concept.\n\nIn this paper, we provide evidence for this hypothesis and shed some light on how it comes about.\nSpeci\ufb01cally, our thesis is that the dynamics of SGD play a crucial role and that SGD \ufb01nds generalizing\nERMs because:\n\n(i) In the initial epochs of learning, SGD has a bias towards simple classi\ufb01ers as opposed to\n\ncomplex ones; and\n\n(ii) in later epochs, SGD is relatively stable and retains the information from the simple classi\ufb01er\n\nit obtained in the initial epochs.\n\nFigure 1 illustrates qualitatively the predictions of this thesis for the dynamics of SGD over time. In\nthis work, we give experimental and theoretical evidence for both parts of this thesis. While several\nquantitative measures of complexity of neural networks have been proposed in the past, including\nthe classic notions of VC dimension, Rademacher complexity and margin [2, 6, 20, 32, 22, 5], we\ndo not propose such a measure here. Our focus is on the qualitative question of how much of SGD\u2019s\nearly progress in learning can be explained by simple models. Our main \ufb01ndings are the following:\n\nClaim 1 (Informal). In natural settings, the initial performance gains of SGD on a randomly ini-\ntialized neural network can be attributed almost entirely to its learning a function correlated with a\nlinear classi\ufb01er of the data.\n\nClaim 2 (Informal). In natural settings, once SGD \ufb01nds a simple classi\ufb01er with good generalization,\nit is likely to retain it, in the sense that it will perform well on the fraction of the population classi\ufb01ed\nby the simple classi\ufb01er, even if training continues until it \ufb01ts all training samples.\n\nWe state these claims broadly, using \u201cin natural settings\u201d to refer to settings of network architecture,\ninitialization, and data distributions that are used in practice. We emphasize that this holds for vanilla\nSGD with standard architecture and random initialization, without using any regularization, dropout,\nearly stopping or other explicit methods of biasing towards simplicity.\n\nSome indications for variants of Claim 2 have been observed in practice, but we provide further\nexperimental evidence and also show (Theorem 1) a simple setting where it provably holds. Our\nmain novelty is Claim 1, which is established via several experiments described in Sections 3 and 4.\nWe emphasize that our claims do not imply that during early stages of training the decision boundary\nis linear, but rather that there often exists a linear classi\ufb01er whose correct predictions highly agree\nwith the network\u2019s correct predictions. The decision boundary itself may be very complex.2\n\n2Figure 6 in Appendix C provides a simple illustration of this phenomenon.\n\n2\n\n\fFigure 2: Beyond linear classi\ufb01ers. The\ntwo phases of SGD learning in Figure 1 can\nbe broken into several sub-phases. Phase i\ninvolves learning classi\ufb01ers of lower \u201ccom-\nplexity\u201d than phase i+1. The precise notion\nof complexity may be algorithm, initializa-\ntion and architecture-dependent. In practice,\nwe expect that the phases will not be com-\npletely disjoint and some learning of classi-\n\ufb01ers of differing complexity will co-occur\nat the same time.\n\nThe other core contribution of this paper is a novel formulation of a mutual-information based\nmeasure to quantify how much of the prediction success of the neural network produced by SGD\ncan be attributed to a simple classi\ufb01er. We believe this measure is of independent interest.\n\nRemark 1 (Beyond linear classi\ufb01ers). While our main \ufb01ndings relate to linear classi\ufb01ers, our\nmethodology extends beyond this. We conjecture that generally, the dynamics of SGD are such that\nit initially learns simpler components of its \ufb01nal classi\ufb01er, and retains these as it continues to learn\nmore and more complex parts (see Figure 2). We provide evidence for this conjecture in Section 4.\n\nRemark 2 (Beyond binary classi\ufb01cation). This paper is focused on binary classi\ufb01cation tasks but our\nmutual-information based de\ufb01nitions and methodology can be extended to multi-class classi\ufb01cation.\nPreliminary results suggest that our results continue to hold.\n\nRelated Work. There is a substantial body of work that attempts to understand the generalization\nof (deep) neural networks, tackling the problem from different perspectives. Previous works by\nHardt et. al. (2016) and Kuzborskij & Lampert (2017) [17, 24] argue that generalization is due to\nstability. Neyshabur et. al. (2015); Keskar et. al. (2016); Bartlett et. al. (2016) consider margin-\nbased approaches [32, 22, 5], while Dziugaite & Roy (2017); Neyshabur et. al. (2017); Neyshabur\net. al. (2018); Golowich et. al. (2018); P\u00e9rez et. al. (2019); Zhou et. al. (2019) focus on PAC-\nBayes analysis and norm-based bounds [10, 31, 30, 12, 34, 42]. Arora et. al. (2018) [3] propose a\ncompression-based approach.\n\nThe implicit bias of (stochastic) gradient descent was also studied in various contexts, including\nlinear classi\ufb01cation, matrix factorization and neural networks. This includes the works of Brutzkus et.\nal. (2017); Gunasekar et. al. (2017); Soudry et. al. (2018); Gunasekar et. al. (2018); Li et. al. (2018);\nWu et. al. (2019) and Ji & Telgarsky (2019) [9, 16, 38, 15, 26, 39, 21]. There are also recent works\nproving generalization of overparameterized networks, by analyzing the speci\ufb01c behavior of SGD\nfrom random initialization [1, 8, 25]. These results are so far restricted to simpli\ufb01ed settings.\n\nSeveral prior works propose measures of the complexity of neural networks, and claim that training\ninvolves learning simple patterns [4, 40, 35, 33]. However, our formalization has many advantages\nover prior formalizations. A key difference is that our measures are intrinsic to the classi\ufb01cation\nfunction and data-distribution (and do not depend on the representation of the classi\ufb01er, or its\nbehavior outside the data distribution). Moreover, our measures address the extent by which one\nclassi\ufb01er \u201cexplains\u201d the performance of another. Finally, our metrics are tractable to estimate in high\ndimensions, and are experimentally demonstrated for real-world distributions.\n\nMost similar to our work is a concurrent work by Mangalam and Prabhu that also experimentally\ndemonstrates that neural networks trained with SGD \ufb01rst learn to be able to classify examples that\nare learnable by simpler models. Their focus is on the complexity of the examples, not the learned\nfunctions, and their metrics are different.\n\nThe concept of mutual information has also been used in the study of neural networks, though in\ndifferent ways than ours. For example, Schwartz-Ziv and Tishby (2017) [36] use it to argue that a\nnetwork compresses information, saving only the most meaningful representation of the input.\n\nPaper Organization. We begin by de\ufb01ning our mutual-information based formalization of Claims\n1 and 2 in Section 2. In Section 3, we establish the main result of the paper\u2014that for many synthetic\nand real data sets, the performance of neural networks in the early phase of training is well explained\nby a linear classi\ufb01er.\nIn Section 4, we investigate extensions to non-linear classi\ufb01ers (see also\n\n3\n\n\fRemark 1). We make the case that as training proceeds, SGD moves beyond this \u201clinear learning\u201d\nregime, and learns concepts of increasing complexity. In Section 5 we focus on the over\ufb01tting regime.\nWe provide a simple theoretical setting where, provably, if we start from a \u201csimple\u201d generalizable\nsolution, then over\ufb01tting to the train set will not hurt generalization. Moreover, the over\ufb01t classi\ufb01er\nretains the information from the initial classi\ufb01er. Finally, in Section 6 we discuss future directions.\n\n2 Performance Correlation via Mutual Information\n\nIn this section, we present our measures for the contribution of a \u201csimple classi\ufb01er\u201d to the perfor-\nmance of a \u201cmore complex\u201d one. This allows us to state what it means for the performance of a\nneural network to be \u201calmost entirely explained by a linear classi\ufb01er\u201d, formalizing Claims 1 and 2.\n\n2.1 Notation and Preliminaries\n\nKey to our formalism are the quantities of mutual information and conditional mutual information.\nRecall that for three random variables X, Y, Z, the mutual information between X and Y is de\ufb01ned as\nI(X; Y ) = H(Y )\u2212H(Y |X) and the conditional mutual information between X and Y conditioned\non Z is de\ufb01ned as I(X; Y |Z) = H(Y |Z) \u2212 H(Y |X, Z), where H is the (conditional) entropy.\nWe consider a joint distribution (X, Y ) on data and labels\n(X ,Y) \u2286 Rd \u00d7 {0, 1}. For a classi\ufb01er f : X \u2192 Y we\nuse the capital letter F to denote the random variable f (X).\nWhile the standard measure of prediction success is the accu-\nracy P[F = Y ], we use the mutual information I(F ; Y ) in-\nstead. This makes no qualitative difference since the two are\nmonotonically related (see Figure 3). In all plots, we plot the\ncorresponding accuracy axis on the right for ease of use. We\nuse (XS, YS) for the empirical distribution over the training\nset, and use I(F ; YS) (with a slight abuse of notation) for the\nmutual information between f (XS) and YS, a proxy for f \u2019s\nsuccess on the training set.\n\nFigure 3: I(F ; Y ) as a function\nof P[F = Y ] for unbiased binary\nF, Y s.t. P[F = Y ] \u2265 1/2.\n\n2.2 Performance correlation\n\nIf f and \u2113 are classi\ufb01ers, and as above F and L are the corresponding random variables, then the\nchain rule for mutual information implies that3\n\nI(F ; Y ) = I(F ; Y |L) + I(L; Y ) \u2212 I(L; Y |F ).\n\nWe interpret the quantity I(F ; Y |L) as capturing the part of the success of f on predicting Y that\ncannot be explained by the classi\ufb01er \u2113. For example, I(F ; Y |L) = 0 if and only if the prediction\nf (X) is conditionally independent of the label Y , when given \u2113(X). In general, I(F ; Y |L) is the\namount by which knowing f (X) helps in predicting Y , given that we already know \u2113(X). Based on\nthis interpretation, we introduce the following de\ufb01nition:\n\nDe\ufb01nition 1. For random variables F, L, Y we de\ufb01ne the performance correlation of F and L as\n\n\u00b5Y (F ; L) := I(F ; Y ) \u2212 I(F ; Y |L) = I(L; Y ) \u2212 I(L; Y |F ) = I(F ; L) \u2212 I(F ; L|Y ) .\n\nThe performance correlation is always upper bounded by the minimum of I(L; Y ), I(F ; Y ), and\nI(F ; L).4\nIf \u00b5Y (F ; L) = I(F ; Y ) then I(F ; Y |L) = 0 which means that f does not help in\npredicting Y , if we already know \u2113. Hence, when \u2113 is a \u201csimpler\u201d model than f , we consider\n\u00b5Y (F ; L) as denoting the part of F \u2019s performance that can be attributed to \u2113.5\n\n3Speci\ufb01cally, the equation can be derived by using the chain rule I(A, B; C) = I(B; C|A) + I(A; C) to\n\nexpress I(F, L; Y ) as both I(F ; Y |L) + I(L; Y ) and I(L; Y |F ) + I(F ; Y ).\n\n4The quantity \u00b5Y (F, L) can also be thought as a multivariate generalization of mutual information [28, 7].\n5This interpretation is slightly complicated by the fact that, like correlation, \u00b5Y (F ; L) can sometimes be\nnegative. However, this quantity is always non-negative under various weak assumptions which hold in practice,\ne.g. when both F and L have signi\ufb01cant test accuracy, or when H(Y |F, L) \u2265 min{H(Y |F ), H(Y |L)}.\n\n4\n\n\fThe reason why we use \u00b5Y (F ; L) instead of simply I(F ; L) is the following. While it is true that\n\u00b5Y (F ; L) \u2264 I(F ; L), \u00b5Y captures the degree to which the information learned by F about Y is\nexplained by L. whereas I(F ; L) only captures the correlation of F and L, regardless of whether\nthis correlation is useful for predicting Y or not. For example, consider a scenario where F (x) =\nL(x)\u00b7Bernoulli(p). That is, F is a linear classi\ufb01er L with noisy outputs. Here, I(F ; L) \u226a 1, due to\nthe noise in F . Hence we might infer that F does not agree with L. However, \u00b5Y (F ; L) = I(F ; Y ),\ni.e. our metric recovers the fact that all the performance of F in predicting Y is explained by L.\n\nThroughout this paper, we denote by ft the classi\ufb01er SGD outputs on a randomly-initialized neural\nnetwork after t gradient steps, and denote by Ft the corresponding random variable ft(X). We now\nformalize Claim 1 and Claim 2:\n\nClaim 1 (\u201cLinear Learning\u201d, Restated). In natural settings, there is a linear classi\ufb01er \u2113 and a step\nnumber T0 such that for all t \u2264 T0, \u00b5Y (Ft; L) \u2248 I(Ft; Y ). That is, almost all of ft\u2019s performance\nis explained by \u2113. Furthermore at T0, I(FT0 ; Y ) \u2248 I(L; Y ). That is, this initial phase lasts until ft\napproximately matches the performance of \u2113.\nClaim 2 (Restated) . In natural settings, for t > T0, \u00b5Y (Ft; L) plateaus at value \u2248 I(L; Y ) and\ndoes not shrink signi\ufb01cantly even if training continues until SGD \ufb01ts all the training set.\n\n3 SGD Learns a Linear Model First\n\nIn this section, we provide experimental evidence for Claim 1\u2014the \ufb01rst phase of SGD is dominated\nby \u201clinear learning\u201d\u2014and Claim 2\u2014at later stages SGD retains information from early phases. We\ndemonstrate these claims by evaluating our information-theoretic measures empirically on real and\nsimulated classi\ufb01cation tasks.\n\nExperimental Setup. We provide a brief description of our experimental setup here; a full descrip-\ntion is provided in Appendix B. We consider the following binary classi\ufb01cation tasks 6:\n\n(i) Binary MNIST: predict whether the image represents a number from 0 to 4 or from 5 to 9.\n\n(ii) CIFAR-10 Animals vs Objects: predict whether the image represents an animal or an object.\n(iii) CIFAR-10 First 5 vs Last 5: predict whether the image is in classes {0 . . . 4} or {5 . . . 9}.\n(iv) High-dimensional sinusoid: predict y := sign(hw, xi + sinhw\n\u2032, xi) for standard Gaussian\n\nx \u2208 R100, and w\u22a5w\n\n\u2032.\n\nWe train neural networks with standard architectures: CNNs for image-recognition tasks and Multi-\nlayer Perceptrons (MLPs) for the other tasks. We use standard uniform Xavier initialization [11] and\nwe train with binary cross-entropy loss. In all experiments, we use vanilla SGD without regularization\n(e.g., dropout, weight decay) for simplicity and consistency. (Preliminary experiments suggest our\nresults are robust with respect to these choices). We use a relatively small step-size for SGD, in order\nto more closely examine the early phase of training.\n\nIn all of our experiments, we compare the classi\ufb01er ft output by SGD to a linear classi\ufb01er \u2113. If the\npopulation distribution has a unique optimal linear classi\ufb01er \u2113\u2217 then we can use \u2113 = \u2113\u2217. This is the\ncase in tasks (i),(ii),(iv). If there are different linear classi\ufb01ers that perform equally well (task (iii)),\nthen the classi\ufb01er learned in the \ufb01rst stage could depend on the initialization. In this case, we pick \u2113\nby searching for the linear classi\ufb01er that best \ufb01ts fT0 , where T0 is the step in which I(Ft; Y ) reaches\nthe best linear performance maxL\u2032 I(L\u2032; Y ). In either case, it is a highly non-trivial fact that there is\nany linear classi\ufb01er that accounts for the bulk of the performance of the SGD-produced classi\ufb01er ft.\n\nResults and Discussion. The results of our experiments are presented in Figure 4. We observe the\nfollowing similar behaviors across several architectures and datasets:\nDe\ufb01ne the \ufb01rst phase of training as all steps t \u2264 T0, where T0 is the \ufb01rst SGD step such that the\nnetwork\u2019s performance I(Ft; Y ) reaches the linear model\u2019s performance I(L; Y ). Now:\n\n6We focus on binary classi\ufb01cation because: (1) there is a natural choice for the \u201csimplest\u201d model class (i.e.,\nlinear models), and (2) our mutual-information based metrics can be more accurately estimated from samples.\nWe have preliminary work extending our results to the multi-class setting.\n\n5\n\n\fFigure 4: SGD dynamics for various classi\ufb01cation tasks. In each \ufb01gure, we plot both the value of\nthe mutual information and the corresponding accuracy. Observe that in the initial phases the bulk\nof the increase in performance is attributed to the linear classi\ufb01er, since \u00b5Y (F ; L) \u2248 I(Ft; Y ).\n\n1. During the \ufb01rst phase of training, \u00b5Y (Ft; L) is close to I(Ft; Y ) thus, most of the per-\nformance of Ft can be attributed to \u2113. In fact, we can often pick \u2113 such that I(L; Y ) is\nclose to maxL\u2032 I(L\u2032; Y ), the performance of the best linear classi\ufb01er for the distribution.\nIn this case, the fact that \u00b5Y (FT0 ; L) \u2248 I(FT0 ; Y ) \u2248 maxL\u2032 I(L\u2032; Y ) means that SGD not\nonly starts learning a linear model, but remains in the \u201clinear learning\u201d regime until it has\nlearnt almost the best linear classi\ufb01er. Beyond this point, the model Ft cannot increase in\nperformance without learning more non-linear aspects of Y .\n\n2. In the following epochs, for t > T0, \u00b5Y (Ft; L) plateaus around I(L; Y ). This means that\nFt retains its correlation with L, which keeps explaining as much of Ft\u2019s generalization\nperformance as possible.\n\nObservation (1) provides strong support for Claim 1. Since neural networks are a richer class than\nlinear classi\ufb01ers, a priori one might expect that throughout the learning process, some of the growth\nin the mutual information between the label Y and the classi\ufb01er\u2019s output Ft will be attributable to the\nlinear classi\ufb01er, and some of this growth will be attributable to a more complex classi\ufb01er. However,\nwhat we observe is a relatively clean (though not perfect) separation of the learning process while in\nthe initial phase, all of the mutual information between Ft and Y disappears if we condition on L.\n\nTo understand this result\u2019s signi\ufb01cance, it is useful to contrast it with a \u201cnull model\u201d where we\nreplace the linear classi\ufb01er \u2113 by a random classi\ufb01er \u02dc\u2113 having the same mutual information with Y as\n\u2113.7 Now, consider the ratio \u00b5Y (FT0 ; eL)/I(FT0 ; Y ) at the end of the \ufb01rst phase. It can be shown that\nthis ratio is small, meaning that the performance of Ft is not well explained by eL. However, in our\ncase with a linear classi\ufb01er, this ratio is much closer to 1 at the end of the \ufb01rst phase. For example,\nfor CIFAR (iii), the linear model L has \u00b5Y (FT0 ; L)/I(FT0 ; Y ) = 0.80 while the corresponding null\nmodel eL has ratio \u00b5Y (FT0 ; eL)/I(FT0 ; Y ) = 0.31. This illustrates that the early stage of learning\nis biased speci\ufb01cally towards linear functions, and not towards arbitrary functions with non-trivial\naccuracy. Similar metrics for all datasets are reported in Table 2 in the Appendix.\n\nObservation (2) can be seen as offering support to Claim 2. If SGD \u201cforgets\u201d the linear model as\nit continues to \ufb01t the training examples, then we would expect the value of \u00b5Y (Ft; L) to shrink\nwith time. However, this does not occur. Since the linear classi\ufb01er itself would generalize, this\n\n7That is, \u02dc\u2113(X) = Y with probability p and random otherwise, where p is set to ensure I( eL; Y ) = I(L; Y ).\n\n6\n\n0612182430364248SGD epochs0.00.20.40.60.8Mutual InformationBest Linear0.50.60.70.80.9AccuracyMNIST, 0-4 vs 5-904080120160200240280SGD epochs0.00.10.20.30.40.50.60.7Mutual InformationBest Linear0.50.60.70.80.9AccuracyCIFAR10, Animals vs Objects - Mutual-Information0153045607590SGD epochs0.000.020.040.060.080.10Mutual InformationBest Linear0.50.60.68AccuracyCIFAR10, First 5 vs Last 503691215182124SGD epochs0.00.20.40.60.81.0Mutual InformationBest Linear0.50.60.70.80.90.99AccuracyHigh Dimensional Sinusoid\fexplains at least part of the generalization performance of Ft. To fully explain the generalization\nperformance, we would need to extend this theory to models more complex than linear; some\npreliminary investigations are given in Section 4.\n\nTable 1 summarizes the qualitative behavior of several information theoretic quantities we observe\nacross different datasets and architectures. We stress that these phenomena would not occur for an\narbitrary learning algorithm that increases model test accuracy. Rather, it is SGD (with a random,\nor at least \u201cnon-pathological\u201d initialization, see Section 5) that produces such behavior. The initial-\nization is important since in Figure 8 in the appendix we show that one can construct adversarial\ninitializations for which this inductive bias of SGD breaks. Concurrent work by Liu et al. [27] also\n\ufb01nds an initialization for SGD that leads to poor generalization, using a slightly different technique.\n\nTrain acc\n\nTest acc\n\nI(Ft; Y )\n\n\u00b5Y (Ft; L)\n\nI(Ft; Y | L)\n\nI(L; Y | Ft)\n\nFirst phase\n\nMiddle phase\n\nOver\ufb01tting\n\n\u2191\n\n\u2191\n\n\u2191\n\n\u2191\n\nover\ufb01t to train set over\ufb01tting doesn\u2019t hurt or improve test\n\n\u2013\n\n\u2191\n\n\u2191\n\n\u2191\n\n\u2013\n\n\u2248 I(Ft; Y )\n\nincrease in acc of Ft explained by L\n\n\u2248 0\n\nFt starts correlating with L\n\n\u2193\n\n\u2013\n\nplateaus near I(L; Y )\n\nFt becomes more expressive than L\n\n\u2191\n\n\u2013\n\n\u2013\n\n\u2248 0\n\nFt doesn\u2019t forget L\n\n\u2248 0\n\nFt still doesn\u2019t forget L\n\nTable 1: Qualitative behavior of the quantities of interest in our experiments. We denote with \u2191, \u2193\nand \u2013 increasing, decreasing and constant values respectively.\n\n4 Beyond Linear: SGD Learns Functions of Increasing Complexity\n\nIn this section we investigate Remark 1\u2014that SGD learns functions of increasing complexity\u2014\nthrough the lens of the mutual information framework, and provide experimental evidence supporting\nthe natural extension of the results from Section 3 to models more complex than linear.\n\nConjecture 1 (Beyond linear classi\ufb01ers: Remark 1 restated). There exist increasingly complex\nfunctions (g1, g2, ...) under some measure of complexity, and a monotonically increasing sequence\n(T1, T2, ...) such that \u00b5Y (Ft; Gi) \u2248 I(Ft; Y ) for t \u2264 Ti and \u00b5Y (Ft; Gi) \u2248 I(Gi; Y ) for t > Ti. 8\nIt is problematic to show Conjecture 1 in full generality, as the correct measure of complexity is\nunclear; it may depend on the distribution, architecture, and even initialization. Nevertheless, we are\nable to support it in the image-classi\ufb01cation setting, parameterizing complexity using the number of\nconvolutional layers.\n\nExperimental Setup.\nIn order to explore the behavior of more complex classi\ufb01ers we consider\nthe CIFAR \u201cFirst 5 vs. Last 5\u201d task introduced in Section 3, for which there is no high-accuracy\nlinear classi\ufb01er. We observed that the performance of various architectures on this task was similar\nto their performance on the full 10-way CIFAR classi\ufb01cation task, which supports the relevance of\nthis example to standard use-cases.9\n\nAs our model f , we train an 18-layer pre-activation ResNet described in [19] which achieves over\n90% accuracy on this task. For the simple models gi, we use convolutional neural networks corre-\nsponding to the 2nd, 4th, and 6th shallowest layers of the network for f . Similarly to Section 3, the\nmodels gi are trained on the images labeled by f\u221e (that is the model at the end of training). For\nmore details refer to Appendix B: \u201cFinding the Conditional Models\".\n\nResults and Discussion. Our results are illustrated in Figure 5. We can see a separation in phases\nfor learning, where all curves \u00b5Y (Ft; Gi) are initially close to I(Ft; Y ), before each successively\nplateaus as training progresses. Moreover, note that I(Gi; Y ) remains \ufb02at in the over\ufb01tting regime\nfor all three i, demonstrating that SGD does not \u201cforget\u201d the simpler functions as stated in Claim 2.\n\n8Note that implicit in our conjecture is that each Gi is itself explained by G<i, so we should not have to\n\ncondition on all previous Gi\u2019s; i.e. \u00b5Y (Ft; (G1:i)) \u2248 \u00b5Y (Ft; Gi).\n\n9Potentially since we need to distinguish between visually similar classes, e.g. automobile/truck or cat/dog.\n\n7\n\n\fFigure 5: Distinguishing be-\ntween the \ufb01rst vs.\nthe last 5\nclasses of CIFAR10. CNNk\ndenotes a convolutional neu-\nral network of k layers. We\nclearly see a separation in\nlearning, where\nphases of\nall curves \u00b5Y (Ft; Gi) are ini-\ntially close to I(Ft, Y ), before\neach successively plateaus as\ntraining progresses. The plot\nmatches the conjectured be-\nhavior illustrated in Figure 2.\n\nInterestingly, the 4 and 6-layer CNNs exhibit less clear phase separation than the 2-layer CNN and\nlinear model of Section 3. We attribute this to two possibilities\u2014\ufb01rstly, training gi on f\u221e for larger\n10; secondly, the\nmodels likely may not recover the best possible simple classi\ufb01er that explains ft\nnumber of layers may not be a perfect approximation to the notion of simplicity. However, we can\nagain verify our qualitative results by comparing to a random \u201cnull model\u201d classi\ufb01er \u02dcgi with the same\naccuracy as gi. For the 6-layer CNN, \u00b5(FT0 ; Gi)/I(FT0 ; Y ) = 0.72, while \u00b5(FT0 ; eGi)/I(FT0 ; Y ) =\n0.40, with T0 estimated as before (see Table 3 in Appendix B for the 2 and 4-layer numbers). Thus,\ngi explains the behavior of ft signi\ufb01cantly more than an arbitrary classi\ufb01er of equivalent accuracy.\n\n5 Over\ufb01tting Does Not Hurt Generalization\n\nIn the previous sections we investigated the early and middle phases of SGD training. In this section,\nwe focus on the last phase, i.e. the over\ufb01tting regime. In practice, we often observe that in late\nphases of training, train error goes to 0, while test error stabilizes, despite the fact that bad ERMs\nexist. The previous sections suggest that this phenomenon is an inherent property of SGD in the\noverparameterized setting, where training starts from a \u201csimpler\u201d model at the beginning of the\nover\ufb01tting regime and does not forget it even as it learns more \u201ccomplex\u201d models and \ufb01ts the noise.\n\nIn what follows, we demonstrate this intuition formally in an illustrative simpli\ufb01ed setting where,\nprovably, a heavily overparameterized (linear) model trained with SGD \ufb01ts the training set exactly,\nand yet its population accuracy is optimal for a class of \u201csimple\u201d initializations.11\n\nThe Model. We con\ufb01ne ourselves to the linear classi\ufb01cation setting. To formalize notions of\n\u201csimple\u201d we consider a data distribution that explicitly decomposes into a component explainable by\na sparse classi\ufb01er, and a remaining orthogonal noisy component on which it is possible to over\ufb01t.\nSpeci\ufb01cally, we de\ufb01ne the data distribution D as follows:\n\nu.a.r.\n\ny\n\n\u223c {\u22121, +1},\n\nx = \u03b7 \u00b7 y \u00b7 e1 + ek,\n\nu.a.r.\n\n\u223c {2, . . . , d},\n\nk\n\u03b7 \u223c Bernoulli(p) over {\u00b11}.\n\nHere ei refers to the ith vector of the standard basis of Rd, while p \u2264 1/2 is a noise parameter.\nFor 1 \u2212 p fraction of the points the \ufb01rst coordinate corresponds to the label, but a p fraction of the\npoints are \u201cnoisy\u201d, i.e., their label is the opposite of their \ufb01rst coordinate. Notice that the classes are\nessentially linearly separable up to error p.\nWe deal with the heavily overparameterized regime, i.e., when we are presented with only n = o(\u221ad)\nsamples. We analyze the learning of a linear classi\ufb01er w \u2208 Rd by minimizing the empirical square\nloss L(w) = 1\nusing SGD. Key to our setting is the existence of poor\nERMs\u2014classi\ufb01ers that have \u2264 50% population accuracy but achieve 100% training accuracy by\ntaking advantage of the ek components of the sample points, which are noise, not signal. We show\n\ni=1 (1 \u2212 ynhw, xii)2\n\nn Pn\n\n10In the extreme case, gi has same architecture as f . We cannot recover f exactly by training on its outputs.\n11A similar setting is analyzed in the concurrent work of Nagarajan and Kolter [29] to show the limitations\n\nof uniform convergence bounds for explaining generalization of deep learning.\n\n8\n\n02500500075001000012500150001750020000SGD Steps0.00.10.20.30.40.50.60.70.8Mutual InformationBest CNN2Best CNN4Best CNN6I(Ft; Y)I(Ft; YS)Y(Ft; CNN2)Y(Ft; CNN4)Y(Ft; CNN6)0.60.70.80.90.95AccuracyCIFAR10, First Five vs Last Five\f\u2217 = e1, the ERM found\nhowever, that as long as we begin not too far from the \u201csimplest\u201d classi\ufb01er w\nby SGD generalizes well. This holds empirically even for more complex models (Fig 8 in App C).\n\nTheorem 1. Consider training a linear classi\ufb01er via minimizing the empirical square loss using\n\nSGD. Let \u03b5 > 0 be a small constant and let the initial vector w0 satisfy w0(1) \u2265 \u2212n0.99, and\n|w0(i)| \u2264 1 \u2212 2p \u2212 \u03b5 for all i > 1. Then, with high probability, sample accuracy approaches 1 and\npopulation accuracy approaches 1 \u2212 p as the number of gradient steps goes to in\ufb01nity.\n\nProof sketch. The displacement of the weight vector from initialization will always lie in the span\nof the sample vectors which, because the samples are sparse, is in expectation almost orthogonal to\nthe population. Moreover, as long as the initialization is bounded suf\ufb01ciently, the \ufb01rst coordinate of\nthe learned vector will approach a constant. The full proof is deferred to Appendix A.\n\n\u2217), a version\nTheorem 1 implies in particular that if we initialize at a good bounded model (such as w\nof Claim 2 provably applies to this setting: if Ft corresponds to the model at SGD step t and \u2113\ncorresponds to w\n\n\u2217, then \u00b5Y (Ft; L) will barely decrease in the long term.\n\n6 Discussion and Future Work\n\nOur \ufb01ndings yield new insight into the inductive bias of SGD on deep neural networks. In particular,\nit appears that SGD increases the complexity of the learned classi\ufb01er as training progresses, starting\nby learning an essentially linear classi\ufb01er.\n\nThere are several natural questions that arise from our work. First, why does this \u201clinear learning\u201d\noccur? We pose this problem of understanding why Claims 1 and 2 are true as an important direction\nfor future work. Second, what is the correct measure of complexity which SGD increases over\ntime? That is, we would like the correct formalization of Conjecture 1\u2014ideally with a measure of\ncomplexity that implies generalization. We view our work as an initial step in a framework towards\nunderstanding why neural networks generalize, and we believe that theoretically establishing our\nclaims would be signi\ufb01cant progress in this direction.\n\nAcknowledgements. We thank all of the participants of the Harvard ML Theory Reading Group\nfor many useful discussions and presentations that motivated this work. We especially thank: Noah\nGolowich, Yamini Bansal, Thibaut Horel, Jaros\u0142aw B\u0142asiok, Alexander Rakhlin, and Madhu Sudan.\n\nThis work was supported by NSF awards CCF 1565264, CNS 1618026, CCF 1565641, CCF 1715187,\nNSF GRFP Grant No. DGE1144152, a Simons Investigator Fellowship, and Investigator Award.\n\nReferences\n\n[1] Zeyuan Allen-Zhu, Yuanzhi Li, and Yingyu Liang. Learning and generalization in overparame-\n\nterized neural networks, going beyond two layers. arXiv, abs/1811.04918, 2018.\n\n[2] Martin Anthony and Peter L Bartlett. Neural network learning: Theoretical foundations.\n\nCambridge University Press, 2009.\n\n[3] Sanjeev Arora, Rong Ge, Behnam Neyshabur, and Yi Zhang. Stronger generalization bounds\nfor deep nets via a compression approach. In International Conference on Machine Learning\n(ICML), pages 254\u2013263, 2018.\n\n[4] Devansh Arpit, Stanis\u0142aw Jastrz\u02dbebski, Nicolas Ballas, David Krueger, Emmanuel Bengio,\nMaxinder S Kanwal, Tegan Maharaj, Asja Fischer, Aaron Courville, and Yoshua Bengio.\nA closer look at memorization in deep networks. In International Conference on Machine\nLearning (ICML), pages 233\u2013242, 2017.\n\n[5] Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin bounds\nfor neural networks. In Advances in Neural Information Processing Systems (NIPS), pages\n6240\u20136249, 2017.\n\n[6] Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds\n\nand structural results. Journal of Machine Learning Research, 3:463\u2013482, 2002.\n\n[7] Anthony J Bell. The co-information lattice. In Proceedings of the Fifth International Workshop\n\non Independent Component Analysis and Blind Signal Separation: ICA, volume 2003, 2003.\n\n9\n\n\f[8] Alon Brutzkus, Amir Globerson, Eran Malach, and Shai Shalev-Shwartz. Sgd learns over-\n\nparameterized networks that provably generalize on linearly separable data, 2017.\n\n[9] Alon Brutzkus, Amir Globerson, Eran Malach, and Shai Shalev-Shwartz. SGD learns over-\nparameterized networks that provably generalize on linearly separable data. In International\nConference on Learning Representations (ICLR),, 2018.\n\n[10] Gintare Karolina Dziugaite and Daniel M. Roy. Computing nonvacuous generalization bounds\nfor deep (stochastic) neural networks with many more parameters than training data. In Con-\nference on Uncertainty in Arti\ufb01cial Intelligence (UAI), 2017.\n\n[11] Xavier Glorot and Yoshua Bengio. Understanding the dif\ufb01culty of training deep feedforward\nneural networks. In International Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS),\npages 249\u2013256, 2010.\n\n[12] Noah Golowich, Alexander Rakhlin, and Ohad Shamir. Size-independent sample complexity\n\nof neural networks. In Conference on Learning Theory (COLT), pages 297\u2013299, 2018.\n\n[13] Gene H Golub and Charles F Van Loan. Matrix computations. Johns Hopkins University Press,\n\n1996.\n\n[14] Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. Speech recognition with deep\nrecurrent neural networks. In IEEE International Conference on Acoustics, Speech and Signal\nProcessing (ICASSP), pages 6645\u20136649, 2013.\n\n[15] Suriya Gunasekar, Jason D Lee, Daniel Soudry, and Nati Srebro. Implicit bias of gradient\nIn Advances in Neural Information Processing\n\ndescent on linear convolutional networks.\nSystems (NeurIPS), pages 9461\u20139471, 2018.\n\n[16] Suriya Gunasekar, Blake E Woodworth, Srinadh Bhojanapalli, Behnam Neyshabur, and Nati\nSrebro. Implicit regularization in matrix factorization. In Advances in Neural Information\nProcessing Systems (NIPS), pages 6151\u20136159, 2017.\n\n[17] Moritz Hardt, Benjamin Recht, and Yoram Singer. Train faster, generalize better: stability of\nstochastic gradient descent. In International Conference on Machine Learning (ICML), pages\n1225\u20131234, 2016.\n\n[18] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\nrecognition. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 770\u2013\n778, 2016.\n\n[19] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual\n\nnetworks. In European Conference on Computer Vision (ECCV), pages 630\u2013645, 2016.\n\n[20] Geoffrey E. Hinton and Drew van Camp. Keeping the neural networks simple by minimizing the\ndescription length of the weights. In Conference on Computational Learning Theory (COLT),\npages 5\u201313, 1993.\n\n[21] Ziwei Ji and Matus Telgarsky. The implicit bias of gradient descent on nonseparable data. In\n\nConference on Learning Theory (COLT), 2019.\n\n[22] Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping\nTak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima.\nIn International Conference on Learning Representations (ICLR), 2017.\n\n[23] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classi\ufb01cation with deep\n\nconvolutional neural networks. Commun. ACM, 60(6):84\u201390, 2017.\n\n[24] Ilja Kuzborskij and Christoph Lampert. Data-dependent stability of stochastic gradient descent.\n\nIn International Conference on Machine Learning (ICML), pages 2820\u20132829, 2018.\n\n[25] Yuanzhi Li and Yingyu Liang. Learning overparameterized neural networks via stochastic\ngradient descent on structured data. In Advances in Neural Information Processing Systems,\npages 8157\u20138166, 2018.\n\n[26] Yuanzhi Li, Tengyu Ma, and Hongyang Zhang. Algorithmic regularization in over-\nparameterized matrix sensing and neural networks with quadratic activations. In Conference\nOn Learning Theory (COLT), pages 2\u201347, 2018.\n\n[27] Shengchao Liu, Dimitris Papailiopoulos, and Dimitris Achlioptas. Bad global minima exist\n\nand sgd can reach them. arXiv preprint arXiv:1906.02613, 2019.\n\n[28] William McGill. Multivariate information transmission. Transactions of the IRE Professional\n\nGroup on Information Theory, 4(4):93\u2013111, 1954.\n\n10\n\n\f[29] Vaishnavh Nagarajan and J Zico Kolter. Uniform convergence may be unable to explain\n\ngeneralization in deep learning. arXiv preprint arXiv:1902.04742, 2019.\n\n[30] Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nati Srebro. Exploring\ngeneralization in deep learning. In Advances in Neural Information Processing Systems (NIPS),\npages 5947\u20135956, 2017.\n\n[31] Behnam Neyshabur, Srinadh Bhojanapalli, and Nathan Srebro. A pac-bayesian approach to\nIn International Conference on\n\nspectrally-normalized margin bounds for neural networks.\nLearning Representations (ICLR), 2018.\n\n[32] Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. Norm-based capacity control in\n\nneural networks. In Conference on Learning Theory (COLT), pages 1376\u20131401, 2015.\n\n[33] Roman Novak, Yasaman Bahri, Daniel A Abola\ufb01a, Jeffrey Pennington, and Jascha Sohl-\nDickstein. Sensitivity and generalization in neural networks: an empirical study. arXiv preprint\narXiv:1802.08760, 2018.\n\n[34] Guillermo Valle P\u00e9rez, Ard A Louis, and Chico Q Camargo. Deep learning generalizes because\nthe parameter-function map is biased towards simple functions. In International Conference\non Learning Representations (ICLR), 2019.\n\n[35] Nasim Rahaman, Devansh Arpit, Aristide Baratin, Felix Draxler, Min Lin, Fred A Hamprecht,\nYoshua Bengio, and Aaron Courville. On the spectral bias of deep neural networks. arXiv\npreprint arXiv:1806.08734, 2018.\n\n[36] Ravid Shwartz-Ziv and Naftali Tishby. Opening the black box of deep neural networks via\n\ninformation. In International Conference on Learning Representations (ICLR), 2018.\n\n[37] David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den\nDriessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot,\nSander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lil-\nlicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Mastering\nthe game of go with deep neural networks and tree search. Nature, 529:484, Jan 2016.\n\n[38] Daniel Soudry, Elad Hoffer, and Nathan Srebro. The implicit bias of gradient descent on\n\nseparable data. In International Conference on Learning Representations (ICLR), 2018.\n\n[39] Yifan Wu, Barnabas Poczos, and Aarti Singh. Towards understanding the generalization bias\nof two layer convolutional linear classi\ufb01ers with gradient descent. In International Conference\non Arti\ufb01cial Intelligence and Statistics (AISTATS 2019), pages 1070\u20131078, 2019.\n\n[40] Zhiqin John Xu. Understanding training and generalization in deep learning by fourier analysis.\n\narXiv preprint arXiv:1808.04295, 2018.\n\n[41] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understand-\ning deep learning requires rethinking generalization. In International Conference on Learning\nRepresentations (ICLR), 2017.\n\n[42] Wenda Zhou, Victor Veitch, Morgane Austern, Ryan P. Adams, and Peter Orbanz. Non-\nvacuous generalization bounds at the imagenet scale: a pac-bayesian compression approach. In\nInternational Conference on Learning Representations (ICLR), 2019.\n\n11\n\n\f", "award": [], "sourceid": 1915, "authors": [{"given_name": "Dimitris", "family_name": "Kalimeris", "institution": "Harvard"}, {"given_name": "Gal", "family_name": "Kaplun", "institution": "Harvard University"}, {"given_name": "Preetum", "family_name": "Nakkiran", "institution": "Harvard"}, {"given_name": "Benjamin", "family_name": "Edelman", "institution": "Harvard University"}, {"given_name": "Tristan", "family_name": "Yang", "institution": "Harvard University"}, {"given_name": "Boaz", "family_name": "Barak", "institution": "Harvard University"}, {"given_name": "Haofeng", "family_name": "Zhang", "institution": "Harvard University"}]}