{"title": "Critical initialisation for deep signal propagation in noisy rectifier neural networks", "book": "Advances in Neural Information Processing Systems", "page_first": 5717, "page_last": 5726, "abstract": "Stochastic regularisation is an important weapon in the arsenal of a deep learning practitioner. However, despite recent theoretical advances, our understanding of how noise influences signal propagation in deep neural networks remains limited. By extending recent work based on mean field theory, we develop a new framework for signal propagation in stochastic regularised neural networks. Our \\textit{noisy signal propagation} theory can incorporate several common noise distributions, including additive and multiplicative Gaussian noise as well as dropout. We use this framework to investigate initialisation strategies for noisy ReLU networks. We show that no critical initialisation strategy exists using additive noise, with signal propagation exploding regardless of the selected noise distribution. For multiplicative noise (e.g.\\ dropout), we identify alternative critical initialisation strategies that depend on the second moment of the noise distribution.  Simulations and experiments on real-world data confirm that our proposed initialisation is able to stably propagate signals in deep networks, while using an initialisation disregarding noise fails to do so. Furthermore, we analyse correlation dynamics between inputs. Stronger noise regularisation is shown to reduce the depth to which discriminatory information about the inputs to a noisy ReLU network is able to propagate, even when initialised at criticality. We support our theoretical predictions for these trainable depths with simulations, as well as with experiments on MNIST and CIFAR-10.", "full_text": "Critical initialisation for deep signal propagation in\n\nnoisy recti\ufb01er neural networks\n\nArnu Pretorius\u2217\n\nComputer Science Division\n\nCAIR\u2020\n\nStellenbosch University\n\nElan Van Biljon\n\nComputer Science Division\n\nStellenbosch University\n\nSteve Kroon\n\nComputer Science Division\n\nStellenbosch University\n\nHerman Kamper\n\nDepartment of Electrical and Electronic Engineering\n\nStellenbosch University\n\nAbstract\n\nStochastic regularisation is an important weapon in the arsenal of a deep learning\npractitioner. However, despite recent theoretical advances, our understanding of\nhow noise in\ufb02uences signal propagation in deep neural networks remains limited.\nBy extending recent work based on mean \ufb01eld theory, we develop a new framework\nfor signal propagation in stochastic regularised neural networks. Our noisy signal\npropagation theory can incorporate several common noise distributions, including\nadditive and multiplicative Gaussian noise as well as dropout. We use this frame-\nwork to investigate initialisation strategies for noisy ReLU networks. We show that\nno critical initialisation strategy exists using additive noise, with signal propagation\nexploding regardless of the selected noise distribution. For multiplicative noise\n(e.g. dropout), we identify alternative critical initialisation strategies that depend\non the second moment of the noise distribution. Simulations and experiments on\nreal-world data con\ufb01rm that our proposed initialisation is able to stably propagate\nsignals in deep networks, while using an initialisation disregarding noise fails to do\nso. Furthermore, we analyse correlation dynamics between inputs. Stronger noise\nregularisation is shown to reduce the depth to which discriminatory information\nabout the inputs to a noisy ReLU network is able to propagate, even when initialised\nat criticality. We support our theoretical predictions for these trainable depths with\nsimulations, as well as with experiments on MNIST and CIFAR-10.\u2021\n\n1\n\nIntroduction\n\nOver the last few years, advances in network design strategies have made it easier to train large\nnetworks and have helped to reduce over\ufb01tting. These advances include improved weight initialisation\nstrategies (Glorot and Bengio, 2010; Saxe et al., 2014; Sussillo and Abbott, 2014; He et al., 2015;\nMishkin and Matas, 2016), non-saturating activation functions (Glorot et al., 2011) and stochastic\nregularisation techniques (Srivastava et al., 2014). Authors have noted, for instance, the critical\ndependence of successful training on noise-based methods such as dropout (Krizhevsky et al., 2012;\nDahl et al., 2013).\n\n\u2217Correspondence: arnupretorius@gmail.com\n\u2020CSIR/SU Centre for Arti\ufb01cial Intelligence Research.\n\u2021Code to reproduce all the results is available at https://github.com/ElanVB/noisy_signal_prop\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFigure 1: Noisy layer recursion. The input xl\u22121 from the previous layer gets corrupted by the sampled\nnoise \u0001l\u22121, either by vector addition or component-wise multiplication, producing the noisy inputs\n\u02dcxl\u22121. The lth layer\u2019s corrupted pre-activations are then computed by multiplication with the layer\nweight matrix W l, followed by a vector addition of the biases bl. Finally, the inputs to the next layer\nare simply the activations of the current layer, i.e. xl = \u03c6(\u02dchl).\n\nIn many cases, successful results arise only from effective combination of these advances. Despite\nthis interdependence, our theoretical understanding of how these mechanisms and their interactions\naffect neural networks remains impoverished.\nOne approach to studying these effects is through the lens of deep neural signal propagation. By\nmodelling the empirical input variance dynamics at the point of random initialisation, Saxe et al.\n(2014) were able to derive equations capable of describing how signal propagates in nonlinear fully\nconnected feed-forward neural networks. This \u201cmean \ufb01eld\u201d theory was subsequently extended by\nPoole et al. (2016) and Schoenholz et al. (2017), in particular, to analyse signal correlation dynamics.\nThese analyses highlighted the existence of a critical boundary at initialisation, referred to as the \u201cedge\nof chaos\u201d. This boundary de\ufb01nes a transition between ordered (vanishing), and chaotic (exploding)\nregimes for neural signal propagation. Subsequently, the mean \ufb01eld approximation to random neural\nnetworks has been employed to analyse other popular neural architectures (Yang and Schoenholz,\n2017; Xiao et al., 2018; Chen et al., 2018).\nThis paper focuses on the effect of noise on signal propagation in deep neural networks. Firstly we\nask: How is signal propagation in deep neural networks affected by noise? To gain some insight into\nthis question, we extend the mean \ufb01eld theory developed by Schoenholz et al. (2017) for the special\ncase of dropout noise, into a generalised framework capable of describing the signal propagation\nbehaviour of stochastically regularised neural networks for different noise distributions.\nSecondly we ask: How much are current weight initialisation strategies affected by noise-induced\nregularisation in terms of their ability to initialise at a critical point for stable signal propagation?\nUsing our derived theory, we investigate this question speci\ufb01cally for recti\ufb01ed linear unit (ReLU)\nnetworks. In particular, we show that no such critical initialisation exists for arbitrary zero-mean\nadditive noise distributions. However, for multiplicative noise, such an initialisation is shown to be\npossible, given that it takes into account the amount of noise being injected into the network. Using\nthese insights, we derive novel critical initialisation strategies for several different multiplicative\nnoise distributions.\nFinally, we ask: Given that a network is initialised at criticality, in what way does noise in\ufb02uence\nthe network\u2019s ability to propagate useful information about its inputs? By analysing the correlation\nbetween inputs as a function of depth in random deep ReLU networks, we highlight the following:\neven though the statistics of individual inputs are able to propagate arbitrarily deep at criticality,\ndiscriminatory information about the inputs becomes lost at shallower depths as the noise in the\nnetwork is increased. This is because in the later layers of a random noisy network, the internal\nrepresentations from different inputs become uniformly correlated. Therefore, the application of\nnoise regularisation directly limits the trainable depth of critically initialised ReLU networks.\n\n2 Noisy signal propagation\n\nWe begin by presenting mean \ufb01eld equations for stochastically regularised fully connected feed-\nforward neural networks, allowing us to study noisy signal propagation for a variety of noise\ndistributions. To understand how noise in\ufb02uences signal propagation in a random network given an\ninput x0 \u2208 RD0, we inject noise into the model\n\n\u02dchl = W l(xl\u22121 (cid:12) \u0001l\u22121) + bl, spa for l = 1, ..., L\n\n(1)\n\n2\n\n\u02dcxl\u22121xl\u22121\u0001l\u22121Wlbl\u02dchlxl(cid:12)+\u00d7\u03c6\fw/Dl\u22121 and \u03c32\n\nusing the operator (cid:12) to denote either addition or multiplication where \u0001l is an input noise vector,\nsampled from a pre-speci\ufb01ed noise distribution. For additive noise, the distribution is assumed to be\nzero mean, for multiplicative noise distributions, the mean is assumed to be equal to one. The weights\nW l \u2208 RDl\u00d7Dl\u22121 and biases bl \u2208 RDl are sampled i.i.d. from zero mean Gaussian distributions\nwith variances \u03c32\nb , respectively, where Dl denotes the dimensionality of the lth hidden\nlayer in the network. The hidden layer activations xl = \u03c6(\u02dchl) are computed element-wise using\nan activation function \u03c6(\u00b7), for layers l = 1, ..., L. Figure 1 illustrates this recursive sequence of\noperations.\nTo describe forward signal propagation for the model in (1), we make use of the mean \ufb01eld approxi-\nmation as in Poole et al. (2016) and analyse the statistics of the internal representations of the network\nin expectation over the parameters and the noise. Since the weights and biases are sampled from zero\nmean Gaussian distributions with pre-speci\ufb01ed variances, we can approximate the distribution of the\npre-activations at layer l, in the large width limit, by a zero mean Gaussian with variance\n\n(cid:26)\n\n(cid:20)\n\n(cid:16)(cid:112)\n\n(cid:17)2(cid:21)\n\n(cid:27)\n\n\u02dcql = \u03c32\nw\n\nEz\n\n\u03c6\n\n\u02dcql\u22121z\n\n(cid:12) \u00b5l\u22121\n\n2\n\n+ \u03c32\nb ,\n\n(2)\n\nwhere z \u223c N (0, 1) (see Section A.1 in the supplementary material). Here, \u00b5l\n2 = E\u0001[(\u0001l)2] is the\nsecond moment of the noise distribution being sampled from at layer l. The initial input variance is\nx0 \u00b7 x0. Furthermore, to study the behaviour of a pair of signals from two different\ngiven by q0 = 1\nD0\ninputs, x0,a and x0,b, passing through the network, we can compute the covariance at each layer as\n\n(cid:112)\n\nwhere \u02dcu1 =\n\n\u02dcql\u22121\naa z1 and \u02dcu2 =\n\nab = \u03c32\n\u02dcql\nw\n\u02dcql\u22121\n\n(cid:113)\n(cid:113)\n\nbb\n\n(cid:104)\n\nEz1 [Ez2 [\u03c6(\u02dcu1)\u03c6(\u02dcu2)]] + \u03c32\n\n\u02dccl\u22121z1 +(cid:112)1 \u2212 (\u02dccl\u22121)2z2\n\nb\n\n(cid:105)\n\n(3)\n\n, with the correlation between\n\nab/\n\nbb. Here, ql\n\naa is the variance of \u02dchl,a\n\ninputs at layer l given by \u02dccl = \u02dcql\naa \u02dcql\n\u02dcql\nsupplementary material for more details).\nFor the backward pass, we use the equations derived in Schoenholz et al. (2017) to describe error\nsignal propagation.1 In the context of mean \ufb01eld theory, the expected magnitude of the gradient at\neach layer can be shown to be proportional to the variance of the error, \u02dc\u03b4l\n\u02dc\u03b4l+1\nj W l+1\n.\nThis allows for the distribution of the error signal at layer l to be approximated by a zero mean\nGaussian with variance\n\ni)(cid:80)Dl+1\n\n(see Section A.2 in the\n\ni = \u03c6(cid:48)(\u02dchl\n\nj=1\n\nji\n\nj\n\n(cid:20)\n\u03c6(cid:48)(cid:16)(cid:112)\n\n(cid:17)2(cid:21)\n\n\u03b4 = \u02dcql+1\n\u02dcql\n\n\u03b4\n\nDl+1\nDl\n\nEz\n\n\u03c32\nw\n\n\u02dcqlz\n\n.\n\n(4)\n\nSimilarly, for noise regularised networks, the covariance between error signals can be shown to be\n\nab,\u03b4 = \u02dcql+1\n\u02dcql\nab,\u03b4\n\nDl+1\nDl+2\n\nEz1 [Ez2 [\u03c6(cid:48)(\u02dcu1)\u03c6(cid:48)(\u02dcu2)]] ,\n\n\u03c32\nw\n\n(5)\n\nwhere \u02dcu1 and \u02dcu2 are de\ufb01ned as was done in the forward pass.\nEquations (2)-(5) fully capture the relevant statistics that govern signal propagation for a random\nnetwork during both the forward and the backward pass. In the remainder of this paper, we consider,\nas was done by Schoenholz et al. (2017), the following necessary condition for training: \u201cfor a\nrandom network to be trained information about the inputs should be able to propagate forward\nthrough the network, and information about the gradients should be able to propagate backwards\nthrough the network.\u201d The behaviour of the network at this stage depends on the choice of activation,\nnoise regulariser and initial parameters. In the following section, we will focus on networks that use\nthe Recti\ufb01ed Linear Unit (ReLU) as activation function. The chosen noise regulariser is considered a\ndesign choice left to the practitioner. Therefore, whether a random noisy ReLU network satis\ufb01es the\nabove stated necessary condition for training largely depends on the starting parameter values of the\nnetwork, i.e. its initialisation.\n\n1It is, however, important to note that the derivation relies on the assumption that the weights used in the\n\nforward pass are sampled independently from those used during backpropagation.\n\n3\n\n\fFigure 2: Deep signal propagation with and without noise. (a): Iterative variance map. (b): Variance\ndynamics during forward signal propagation. In (a) and (b), lines correspond to theoretical predictions\nand points to numerical simulations (means over 50 runs with shaded one standard deviation bounds),\nfor noiseless tanh (yellow) and noiseless ReLU (purple) networks, as well as for noisy tanh (red)\nand noisy ReLU (brown) networks regularised using additive noise from a standard Gaussian. Both\n\u221a\ntanh networks use (\u03c3w, \u03c3b) = (1, 0), the \u201cXavier\u201d initialisation (Glorot and Bengio, 2010), while the\nReLU networks use (\u03c3w, \u03c3b) = (\n2, 0) the \u201cHe\u201d initialisation (He et al., 2015). In our experiments,\nwe use network layers consisting of 1000 hidden units (see Section C in the supplementary material\nfor more details on all our simulated experiments).\n\n3 Critical initialisation for noisy recti\ufb01er networks\n\nUnlike the tanh nonlinearity investigated in previous work (Poole et al., 2016; Schoenholz et al.,\n2017), rectifying activation functions such as ReLU are unbounded. This means that the statistics of\nsignal propagation through the network is not guaranteed to naturally stabilise through saturating\nactivations, as shown in Figure 2.\nA point on the identity line in Figure 2 (a) represents a \ufb01xed point to the recursive variance map\nin equation (2). At a \ufb01xed point, signal will stably propagate through the remaining layers of the\nnetwork. For tanh networks, such a \ufb01xed point always exists irrespective of the initialisation, or\nthe amount of noise injected into the network. For ReLU networks, this is not the case. Consider\nthe \u201cHe\u201d initialisation (He et al., 2015) for ReLU, commonly used in practice. In (b), we plot the\nvariance dynamics for this initialisation in purple and observe stable behaviour. But what happens\nwhen we inject noise into each network? In the case of tanh (shown in red), the added noise simply\nshifts the \ufb01xed point to a new stable value. However, for ReLU, the noise entirely destroys the \ufb01xed\npoint for the \u201cHe\u201d initialisation, making signal propagation unstable. This can be seen in (a), where\nthe variance map for noisy ReLU (shown in brown) moves off the identity line entirely, causing the\nsignal in (b) to explode.\nTherefore, to investigate whether signal can stably propagate through a random noisy ReLU network,\nwe examine (2) more closely, which for ReLU becomes (see Section B.1 in supplementary material)\n\n(cid:20) \u02dcql\u22121\n\n2\n\n(cid:21)\n\n(cid:12) \u00b52\n\n+ \u03c32\nb .\n\n\u02dcql = \u03c32\nw\n\n(6)\n2 = \u00b52,\u2200l. A critical\nFor ease of exposition we assume equal noise levels at each layer, i.e. \u00b5l\ninitialisation for a noisy ReLU network occurs when the tuple (\u03c3w, \u03c3b, \u00b52) provides a \ufb01xed point \u02dcq\u2217,\nto the recurrence in (6). This at least ensures that the statistics of individual inputs to the network will\nbe preserved throughout the \ufb01rst forward pass. The existence of such a solution depends on the type\n2 \u02dcq\u2217 + \u00b52\u03c32\nof noise that is injected into the network. In the case of additive noise, \u02dcq\u2217 = \u03c32\n\u221a\nb ,\nw + \u03c32\nimplying that the only critical point initialisation for non-zero \u02dcq\u2217 is given by (\u03c3w, \u03c3b, \u00b52) = (\n2, 0, 0).\nTherefore, critical initialisation is not possible using any amount of zero-mean additive noise,\nregardless of the noise distribution. For multiplicative noise, \u02dcq\u2217 = \u03c32\nb , so the solution\nprovides a critical initialisation for noise distributions with mean\n(\u03c3w, \u03c3b, \u00b52) =\none and a non-zero second moment \u00b52. For example, in the case of multiplicative Gaussian noise,\n. For dropout noise,\n\u00b52 = \u03c32\n\n\u0001 + 1, yielding critical initialisation with (\u03c3w, \u03c3b) =\n\n(cid:16)(cid:113) 2\n\n(cid:16)(cid:113) 2\n\n1\n\n2 \u02dcq\u2217\u00b52 + \u03c32\n\nw\n\n, 0, \u00b52\n\n\u00b52\n\n(cid:17)\n\n1\n\nw\n\n(cid:17)\n\n\u03c32+1 , 0\n\n4\n\n02468101214Inputvariance(ql\u22121)02468101214Outputvariance(ql)IterativevariancemapIdentitylinetanh-Nonetanh-AddGauss(\u03c32\u0001=1)ReLU-NoneReLU-AddGauss(\u03c32\u0001=1)02468101214Layer(l)02468101214Variance(ql)Dynamicsofq(a)(b)\fTable 1: Critical point initialisation for noisy ReLU networks.\n\nDISTRIBUTION\n\u2014 ADDITIVE NOISE \u2014\nGAUSSIAN\n\nLAPLACE\n\n\u2014 MULTIPLICATIVE NOISE \u2014\n\nGAUSSIAN\n\nLAPLACE\n\nPOISSON\n\nDROPOUT\n\nP(\u0001)\n\nN (0, \u03c32\n\u0001 )\n\nLap(0, \u03b2)\n\nN (1, \u03c32\n\u0001 )\n\n\u00b52\n\n\u03c32\n\u0001\n\n2\u03b22\n\n(\u03c32\n\n\u0001 + 1)\n\nCRITICAL INITIALISATION\n\n\u221a\n(\u03c3w, \u03c3b, \u03c3\u0001) = (\n\u221a\n(\u03c3w, \u03c3b, \u03b2) = (\n2, 0, 0)\n\n2, 0, 0)\n\n(\u03c3w, \u03c3b, \u03c3\u0001) =\n\n\u0001 +1 , 0, \u03c3\u0001\n\u03c32\n\n(cid:16)(cid:113) 2\n(cid:16)(cid:113) 2\n\n(cid:17)\n(cid:17)\n\nLap(1, \u03b2)\n\n(2\u03b22 + 1)\n\n(\u03c3w, \u03c3b, \u03b2) =\n\n2\u03b22+1 , 0, \u03b2\n\nP oi(1)\n\nP(\u0001 = 1\nP(\u0001 = 0) = 1 \u2212 p\n\np ) = p,\n\n2\n\n1\np\n\n(\u03c3w, \u03c3b, \u03bb) = (1, 0, 1)\n\n\u221a\n(\u03c3w, \u03c3b, p) = (\n2p, 0, p)\n\nFigure 3: Critical initialisation for noisy ReLU networks. (a): Iterative variance map. (b): Vari-\nance dynamics during forward signal propagation. In (a) and (b), lines correspond to theoretical\npredictions and points to numerical simulations. Dropout (p = 0.6) is shown in green for dif-\nferent initialisations, \u03c32\n(exploding sig-\nw = 2(0.6) = 2\n\u00b52\n(vanishing signal). Similarly, multiplicative Gaussian noise\nnal) and \u03c32\n(0.6)\u22121 < 2\nw = (0.85)2\n\u00b52\n(\u03c3\u0001 = 0.25) is shown in red with \u03c32\n(exploding) and\nw =\n( vanishing). (c): Variance critical boundary for initialisation, separating numerical\nw = (0.75)2 2\n\u03c32\n\u00b52\nover\ufb02ow and under\ufb02ow signal propagation regimes.\n\nw = (1.25)2 2\n\u00b52\n\n(0.25)2+1 = 2\n\u00b52\n\n(0.6)\u22121 > 2\n\u00b52\n\nw = (1.15)2\n\n(critical), \u03c32\n\n2\n\n2\n\n(critical), \u03c32\n\n2\n\n\u221a\n\u00b52 = 1/p (with p the probability of retaining a neuron); thus, to initialise at criticality, we must\nset (\u03c3w, \u03c3b) = (\n2p, 0). Table 1 summarises critical initialisations for some commonly used\nnoise distributions. We also note that similar results can be derived for other rectifying activation\nfunctions; for example, for multiplicative noise the critical initialisation for parametric ReLU (PReLU)\nactivations (with slope parameter \u03b1) is given by (\u03c3w, \u03c3b, \u00b52) =\n\n(cid:16)(cid:113) 2\n\n(cid:17)\n\n.\n\n\u00b52(\u03b12+1) , 0, \u00b52\n\nTo see the effect of initialising on or off the critical point for ReLU networks, Figure 3 compares\nthe predicted versus simulated variance dynamics for different initialisation schemes. For schemes\nnot initialising at criticality, the variance map in (a) no longer lies on the identity line and as a result\nthe forward propagating signal in (b) either explodes, or vanishes. In contrast, the initialisations\nderived above lie on the critical boundary between these two extremes, as shown in (c) as a function\nof the noise. By compensating for the amount of injected noise, the signal corresponding to the\ninitialisation \u03c32\nis preserved in (b) throughout the entire forward pass, with roughly constant\nvariance dynamics.\n\nw = 2\n\u00b52\n\n5\n\n051015Inputvariance(ql\u22121)0.02.55.07.510.012.515.0Outputvariance(ql)\u03c32w>2\u00b52\u03c32w<2\u00b52\u03c32w=2\u00b52Iterativevariancemap051015Layer(l)0.02.55.07.510.012.515.0Variance(ql)DynamicsofqMultGaussdropout1.01.21.41.61.82.0Weightinitialisation(\u03c32w)1.01.21.41.61.82.0Secondmomentofnoisedist.(\u00b52)Over\ufb02ow(\u03c32w>2\u00b52)Under\ufb02ow(\u03c32w<2\u00b52)\u03c32w=2\u00b52VariancepropagationdynamicsVariancecriticalboundary(a)(b)(c)\fFigure 4: Propagating correlation information in noisy ReLU networks. (a): Iterative correlation\nmap with \ufb01xed points indicated by \u201cX\u201d marks on the identity line. (b): Correlation dynamics during\nforward signal propagation. In (a) and (b), lines correspond to theoretical predictions and points to\nnumerical simulations. All simulated networks were initialised at criticality for each noise type and\nlevel. (c): Slope at the \ufb01xed point correlation as a function of the amount of noise injected into the\nnetwork.\n\nNext, we investigate the correlation dynamics between inputs. Assuming that (6) is at its \ufb01xed point\n\u02dcq\u2217, which exists only if \u03c32\n, the correlation map for a noisy ReLU network is given by (see\nSection B.2 in supplementary material)\n\nw = 2\n\u00b52\n\n(cid:40)\n\n\u02dccl\u22121sin\u22121(cid:0)\u02dccl\u22121(cid:1) +(cid:112)1 \u2212 (\u02dccl\u22121)2\n\n.\n\n(7)\n\n(cid:41)\n\n\u02dccl\u22121\n2\n\n+\n\n\u02dccl =\n\n1\n\u00b52\n\n\u03c0\n\nFigure 4 plots this theoretical correlation map against simulated dynamics for different noise types\nand levels. For no noise, the \ufb01xed point c\u2217 in (a) is situated at one (marked with an \u201cX\u201d on the blue\nline). The slope of the blue line indicates a non-decreasing function of the input correlations. After a\ncertain depth, inputs end up perfectly correlated irrespective of their starting correlation, as shown in\n(b). In other words, random deep ReLU networks lose discriminatory information about their inputs\nas the depth of the network increases, even when initialised at criticality. When noise is added to the\nnetwork, inputs decorrelate and c\u2217 moves away from one. However, more importantly, correlation\ninformation in the inputs become lost at shallower depths as the noise level increases, as can be seen\nin (b).\nHow quickly a random network loses information about its inputs depends on the rate of convergence\nto the \ufb01xed point c\u2217. Using this observation, Schoenholz et al. (2017) derived so-called depth scales\n\u03bec, by assuming |cl \u2212 c\u2217| \u223c e\u2212l/\u03bec. These scales essentially control the feasible depth at which\nnetworks can be considered trainable, since they may still allow useful correlation information to\npropagate through the network. In our case, the depth scale for a noisy ReLU network under this\nassumption can be shown to be (see Section B.3 in supplementary material)\n\nwhere\n\n\u03c7(c\u2217) =\n\n1\n\u00b52\u03c0\n\n\u03bec = \u22121/ln [\u03c7(c\u2217)] ,\n\n(cid:104)\n\nsin\u22121 (c\u2217) +\n\n(cid:105)\n\n.\n\n\u03c0\n2\n\n(8)\n\n(9)\n\nThe exponential rate assumption underlying the derivation of (8) is supported in Figure 5, where\nfor different noise types and levels, we plot |cl \u2212 c\u2217| as a function of depth on a log-scale, with\ncorresponding linear \ufb01ts (see panels (a) and (c)). We then compare the theoretical depth scales from\n(8) to actual depth scales obtained through simulation (panels (b) and (d)), as a function of noise\nand observe a good \ufb01t for non-zero noise levels.4 We thus \ufb01nd that noise limits the depth at which\ncritically initialised ReLU networks are expected to perform well through training.\n\n4We note Hayou et al. (2018) recently showed that the rate of convergence for noiseless ReLU networks is\nnot exponential, but polynomial instead. Interestingly, keeping with the exponential rate assumption, we indeed\n\ufb01nd that the discrepancy between our theoretical depth scales from (8) and our simulated depth scales, is largest\nat very low noise levels. However, at more typical noise levels, such as a dropout rate of p = 0.5 for example,\nthe assumption seems to provide a close \ufb01t, with good agreement between theory and simulation.\n\n6\n\n0.00.51.0Inputcorrelation(cl\u22121)0.00.51.0Outputcorrelation(cl)Iterativecorrelationmapnonedropout(p=0.6)dropout(p=0.8)MultGauss(\u03c3\u0001=0.25)MultGauss(\u03c3\u0001=2)0102030Layer(l)0.00.51.0Correlation(cl)Dynamicsofc246810Secondmomentofnoisedistribution(\u00b52)0.00.20.40.60.81.0Slopeat\ufb01xedpoint(\u03c7(c\u2217))Orderedregime(vanishinggradients)PhasediagramNoisecriticalinitialisationEdgeofchaos(a)(b)(c)\fFigure 5: Noise dependent depth scales for training. (a): Linear \ufb01ts (dashed lines) to |cl \u2212 c\u2217| as a\nfunction of depth on a log-scale (solid lines) for varying amounts of dropout (p = 0.1 to p = 0.9\nby 0.1). (b): Theoretical depth scales (solid lines) versus empirically inferred scales (dashed lines)\nper dropout rate. Scales are inferred noting that if |cl \u2212 c\u2217| \u223c e\u2212l/\u03bec, then a linear \ufb01t, al + b, in\nthe logarithmic domain gives \u03bec \u2248 \u2212 1\na, for large l. In other words, the negative inverse slope of a\nlinear \ufb01t to the log differences in correlation should match the theoretical values for \u03bec. Therefore,\nwe compare \u03bec = \u22121/ln [\u03c7(c\u2217)] to \u2212 1\na for different levels of noise. (c) - (d): Similar to (a) and (b),\nbut for Gaussian noise (\u03c3\u0001 = 0.1 to \u03c3\u0001 = 1.9 by 0.15).\n\nWe next brie\ufb02y discuss error signal propagation during the backward pass for noise regularised ReLU\nnetworks. When critically initialised, the error variance recurrence relation in (4) for these networks\nis (see Section B.4 in supplementary material)\n\n\u03b4 = \u02dcql+1\n\u02dcql\n\n\u03b4\n\nDl+1\nDl\u00b52\n\n,\n\n(10)\n\nwith the covariance between error signals in (5), given by (see Section B.5 in supplementary material)\n\nab,\u03b4 = \u02dcql+1\n\u02dcql\nab,\u03b4\n\n\u03c7(c\u2217).\n\nDl+1\nDl+2\n\n(11)\n\nNote the explicit dependence on the width of the layers of the network in (10) and (11). We \ufb01rst\nconsider constant width networks, where Dl+1 = Dl, for all l = 1, ..., L. For any amount of\nmultiplicative noise, \u00b52 > 1, and we see from (10) that gradients will tend to vanish for large depths.\nFurthermore, Figure 4 (c) plots \u03c7(c\u2217) as a function of \u00b52. As \u00b52 increases from one, \u03c7(c\u2217) decreases\nfrom one. Therefore, from (11), we also \ufb01nd that error signals from different inputs will tend to\ndecorrelate at large depths.\nInterestingly, for non-constant width networks, stable gradient information propagation may still be\npossible. If the network architecture adapts to the amount of noise being injected by having the widths\nof the layers grow as Dl+1 = Dl\u00b52, then (10) should be at its \ufb01xed point solution. For example, in\nthe case of dropout Dl+1 = Dl/p, which implies that for any p < 1, each successive layer in the\nnetwork needs to grow in width by a factor of 1/p to promote stable gradient \ufb02ow. Similarly, for\nmultiplicative Gaussian noise, Dl+1 = Dl(\u03c32\n\u0001 + 1), which requires the network to grow in width\n\u0001 = 0. Similarly, if Dl+2 = Dl+1\u03c7(c\u2217) = Dl\u00b52\u03c7(c\u2217) in (11), the covariance of the error\nunless \u03c32\nsignal should be preserved during the backward pass, for arbitrary values of \u00b52 and \u03c7(c\u2217).\n\n7\n\n05101520253010\u22121510\u22121210\u2212910\u2212610\u22123100|cl\u2212c\u2217|Rateofconvergenceto\ufb01xedpointSimulationLinear\ufb01t2468100.51.01.52.0\u03becTwoinputdepthscalesTheorySimulation051015202530Layer(l)10\u22121610\u22121310\u22121010\u2212710\u2212410\u22121|cl\u2212c\u2217|SimulationLinear\ufb01t1.01.52.02.53.03.54.04.5\u00b52123456\u03becTheorySimulation(a)(b)(c)(d)\fFigure 6: Depth scale experiments on MNIST and CIFAR-10. (a) Variance propagation dynamics for\nMNIST on and off the critical point initialisation (dashed black line) with dropout (p = 0.6). The\ncyan curve represents the theoretical boundary at which numerical instability issues are predicted\nto occur and is computed as L\u2217 = ln(K)/ln( \u03c32\n2 \u00b52), where K is the largest (or smallest) positive\nnumber representable by the computer. Speci\ufb01cally, we use 32-bit \ufb02oating point numbers and set\nK = 3.4028235 \u00d7 1038, if \u03c32\n. (b) Depth scales \ufb01t\nto the training loss on MNIST for networks initialised at criticality for dropout rates p = 0.1 (severe\ndropout) to p = 1 (no dropout). (c) Depth scales \ufb01t to the validation loss on MNIST. (d) - (f): Similar\nto (a) - (c), but for CIFAR-10. For each plot we highlight trends by smoothing the colour grid (for\nnon smoothed versions see Section C.5 in the supplementary material).\n\nand K = 1.1754944 \u00d7 10\u221238, if \u03c32\n\nw > 2\n\u00b52\n\nw < 2\n\u00b52\n\nw\n\n4 Experimental results\n\nFrom our analysis of deep noisy ReLU networks in the previous section, we expect that a necessary\ncondition for such a network to be trainable, is that the network be initialised at criticality. However,\nwhether the layer widths are varied or not for the sake of backpropagation, the correlation dynamics\nin the forward pass may still limit the depth at which these networks perform well.\nWe therefore investigate the performance of noise-regularised deep ReLU networks on real-world\ndata. First, we validate the derived critical initialisation. As the depth of the network increases, any\ninitialisation strategy that does not factor in the effects of noise, will cause the forward propagating\nsignal to become increasingly unstable. For very deep networks, this might cause the signal to either\nexplode or vanish, even within the \ufb01rst forward pass, making the network untrainable. To test this,\nwe sent inputs from MNIST and CIFAR-10 through ReLU networks using dropout (with p = 0.6) at\nvarying depths and for different initialisations of the network. Figure 6 (a) and (d) shows the evolution\nof the input statistics as the input propagates through each network for the different data sets. For\ninitialisations not at criticality, the variance grows or shrinks rapidly to the point of causing numerical\nover\ufb02ow or under\ufb02ow (indicated by black regions). For deep networks, this can happen well before\nany signal is able to reach the output layer. In contrast, initialising at criticality (as shown by the\ndashed black line), allows for the signal to propagate reliably even at very large depths. Furthermore,\n, we can predict the depth at which numerical over\ufb02ow\ngiven the \ufb02oating point precision, if \u03c32\nq0, where K is the largest (or\nsmallest) positive number representable by the computer (see Section C.4 in supplementary material).\nThese predictions are shown by the cyan line and provide a good \ufb01t to the empirical limiting depth\nfrom numerical instability.\nWe now turn to the issue of limited trainability. Due to the loss of correlation information between\ninputs as a function of noise and network depth, we expect noisy ReLU networks not to be able to\nperform well beyond certain depths. We investigated depth scales for ReLU networks with dropout\ninitialised at criticality: we trained 100 networks on MNIST and CIFAR-10 for 200 epochs using SGD\nand a learning rate of 10\u22123 with dropout rates ranging from 0.1 to 1 for varying depths. The results\n\n(or under\ufb02ow) will occur by solving for L\u2217 in K = (cid:0)\u03c32\n\nw\u00b52/2(cid:1)L\u2217\n\nw (cid:54)= 2\n\n\u00b52\n\n8\n\n0.30.60.91.21.51.82.1Weightinitialisation(\u03c32w)02004006008001000NumberoflayersUnder\ufb02owOver\ufb02owMNIST-Variancepropagationdepth:dropoutwithp=0.6,crit.init.at\u03c32w=1.2Theorycriticality0.51.01.52.0Criticalinitialisationforp(\u03c32w)10203040NumberoflayersMNIST-Depthatcriticality0.51.01.52.0Criticalinitialisationforp(\u03c32w)10203040NumberoflayersMNIST-Depthatcriticality0.30.60.91.21.51.82.1Weightinitialisation(\u03c32w)02004006008001000NumberoflayersUnder\ufb02owOver\ufb02owCIFAR-10-Variancepropagationdepth:dropoutwithp=0.6,crit.init.at\u03c32w=1.2Theorycriticality0.51.01.52.0Criticalinitialisationforp(\u03c32w)10203040NumberoflayersCIFAR-10-Depthatcriticality0.51.01.52.0Criticalinitialisationforp(\u03c32w)10203040NumberoflayersCIFAR-10-Depthatcriticality\u221280\u221260\u221240\u221220020406080log(ql)0.070.080.090.100.11Trainloss0.10.20.30.40.50.60.70.80.91.0p0.070.080.090.100.11Val.loss0.10.20.30.40.50.60.70.80.91.0p\u221280\u221260\u221240\u221220020406080log(ql)0.0500.0750.1000.1250.1500.1750.2000.2250.250Trainloss0.10.20.30.40.50.60.70.80.91.0p0.0500.0750.1000.1250.1500.1750.2000.2250.250Val.loss0.10.20.30.40.50.60.70.80.91.0p(a)(b)(c)(d)(e)(f)\fare shown in Figure 6 (see Section C.5 of the supplementary material for additional experimental\nresults). For each network con\ufb01guration and noise level, the critical initialisation \u03c32\nwas\nused. We indeed observe a relationship between depth and noise on the loss of a network, even at\ncriticality. Interestingly, the line 6\u03bec (Schoenholz et al., 2017), seems to track the depth beyond\nwhich the relative performance on the validation loss becomes poor, more so than on the training loss.\nHowever, in both cases, we \ufb01nd that even modest amounts of noise can limit performance.\n\nw = 2\n\u00b52\n\n5 Discussion\n\nBy developing a general framework to study signal propagation in noisy neural networks, we were\nable to show how different stochastic regularisation strategies may impact the \ufb02ow of information\nin a deep network. Focusing speci\ufb01cally on ReLU networks, we derived novel critical initialisation\nstrategies for multiplicative noise distributions and showed that no such critical initialisations exist\nfor commonly used additive noise distributions. At criticality however, our theory predicts that the\nstatistics of the input should remain within a stable range during the forward pass and enable reliable\nsignal propagation for noise regularised deep ReLU networks. We veri\ufb01ed these predictions by\ncomparing them with numerical simulations as well as experiments on MNIST and CIFAR-10 using\ndropout and found good agreement.\nInterestingly, we note that a dropout rate of p = 0.5 has often been found to work well for ReLU\n\u221a\nnetworks (Srivastava et al., 2014). The critical initialisation corresponding to this rate is (\u03c3w, \u03c3b) =\n2p, 0) = (1, 0). This is exactly the \u201cXavier\u201d initialisation proposed by Glorot and Bengio (2010),\n(\nwhich prior to the development of the \u201cHe\u201d initialisation, was often used in combination with\ndropout (Simonyan and Zisserman, 2014). This could therefore help to explain the initial success\nassociated with this speci\ufb01c dropout rate. Similarly, Srivastava et al. (2014) reported that adding\nmultiplicative Gaussian noise where \u0001 \u223c N (1, \u03c32\n\u0001 = 1, also seemed to perform well, for\n= (1, 0), again corresponding to the \u201cXavier\u201d method.\nwhich the critical initialisation is\n\n(cid:16)(cid:113) 2\n\n\u0001 ), with \u03c32\n\n(cid:17)\n\n\u0001 +1 , 0\n\u03c32\n\nAlthough our initialisations ensure that individual input statistics are preserved, we further analysed\nthe correlation dynamics between inputs and found the following: at large depths inputs become\npredictably correlated with each other based on the amount of noise injected into the network. As a\nconsequence, the representations for different inputs to a deep network may become indistinguishable\nfrom each other in the later layers of the network. This can make training infeasible for noisy ReLU\nnetworks of a certain depth and depends on the amount of noise regularisation being applied.\nWe now note the following shortcomings of our work: \ufb01rstly, our \ufb01ndings only apply to fully\nconnected feed-forward neural networks and focus almost exclusively on the ReLU activation\nfunction. Furthermore, we limit the scope of our architectural design to a recursive application of a\ndense layer followed by a noise layer, whereas in practice a larger mix of layers is usually required to\nsolve a speci\ufb01c task.\nUltimately, we are interested in reducing the number of decisions that need to made when designing\ndeep neural networks and understanding the implications of those decisions on network behaviour\nand performance. Any machine learning engineer exploring a neural network based solution to a\npractical problem will be faced with a large number of possible design decisions. All these decisions\ncost valuable time to explore. In this work, we hope to have at least provided some guidance in this\nregard, speci\ufb01cally when choosing between different initialisation strategies for noise regularised\nReLU networks and understanding their associated implications.\n\nAcknowledgements\n\nWe would like to thank the reviewers for their insightful comments which improved the quality of this\nwork. Furthermore, we would like to thank Google, the CSIR/SU Centre for Arti\ufb01cial Intelligence\nResearch (CAIR) as well as the Science Faculty and the Postgraduate and International Of\ufb01ce of\nStellenbosch University for \ufb01nancial support. Finally, we gratefully acknowledge the support of\nNVIDIA Corporation with the donation of a Titan Xp GPU used for this research.\n\n9\n\n\fReferences\nX. Glorot and Y. Bengio, \u201cUnderstanding the dif\ufb01culty of training deep feedforward neural networks,\u201d\nin Proceedings of the International Conference on Arti\ufb01cial Intelligence and Statistics, 2010, pp.\n249\u2013256.\n\nA. M. Saxe, J. L. McClelland, and S. Ganguli, \u201cExact solutions to the nonlinear dynamics of\nlearning in deep linear neural networks,\u201d Proceedings of the International Conference on Learning\nRepresentations, 2014.\n\nD. Sussillo and L. Abbott, \u201cRandom walk initialization for training very deep feedforward networks,\u201d\n\narXiv preprint arXiv:1412.6558, 2014.\n\nK. He, X. Zhang, S. Ren, and J. Sun, \u201cDelving deep into recti\ufb01ers: Surpassing human-level per-\nformance on ImageNet classi\ufb01cation,\u201d in Proceedings of the IEEE International Conference on\nComputer Vision, 2015, pp. 1026\u20131034.\n\nD. Mishkin and J. Matas, \u201cAll you need is a good init,\u201d Proceedings of International Conference on\n\nLearning Representations, 2016.\n\nX. Glorot, A. Bordes, and Y. Bengio, \u201cDeep sparse recti\ufb01er neural networks,\u201d in Proceedings of the\n\nInternational Conference on Arti\ufb01cial Intelligence and Statistics, 2011, pp. 315\u2013323.\n\nN. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, \u201cDropout: a simple\nway to prevent neural networks from over\ufb01tting.\u201d Journal of Machine Learning Research, vol. 15,\nno. 1, pp. 1929\u20131958, 2014.\n\nA. Krizhevsky, I. Sutskever, and G. E. Hinton, \u201cImageNet classi\ufb01cation with deep convolutional\nneural networks,\u201d in Advances in Neural Information Processing Systems, 2012, pp. 1097\u20131105.\n\nG. E. Dahl, T. N. Sainath, and G. E. Hinton, \u201cImproving deep neural networks for LVCSR using\nrecti\ufb01ed linear units and dropout,\u201d in Proceedings of the IEEE International Conference on\nAcoustics, Speech and Signal Processing, 2013, pp. 8609\u20138613.\n\nB. Poole, S. Lahiri, M. Raghu, J. Sohl-Dickstein, and S. Ganguli, \u201cExponential expressivity in deep\nneural networks through transient chaos,\u201d in Advances in Neural Information Processing Systems,\n2016, pp. 3360\u20133368.\n\nS. S. Schoenholz, J. Gilmer, S. Ganguli, and J. Sohl-Dickstein, \u201cDeep information propagation,\u201d\n\nProceedings of the International Conference on Learning Representations, 2017.\n\nG. Yang and S. Schoenholz, \u201cMean \ufb01eld residual networks: On the edge of chaos,\u201d in Advances in\n\nNeural Information Processing Systems, 2017, pp. 7103\u20137114.\n\nL. Xiao, Y. Bahri, J. Sohl-Dickstein, S. S. Schoenholz, and J. Pennington, \u201cDynamical isometry and\na mean \ufb01eld theory of CNNs: How to train 10,000-layer vanilla convolutional neural networks,\u201d\nProceedings of the International Conference on Machine Learning, 2018.\n\nM. Chen, J. Pennington, and S. S. Schoenholz, \u201cDynamical isometry and a mean \ufb01eld theory of RNNs:\nGating enables signal propagation in recurrent neural networks,\u201d Proceedings of the International\nConference on Machine Learning, 2018.\n\nS. Hayou, A. Doucet, and J. Rousseau, \u201cOn the selection of initialization and activation function for\n\ndeep neural networks,\u201d arXiv preprint arXiv:1805.08266, 2018.\n\nK. Simonyan and A. Zisserman, \u201cVery deep convolutional networks for large-scale image recognition,\u201d\n\narXiv preprint arXiv:1409.1556, 2014.\n\n10\n\n\f", "award": [], "sourceid": 2766, "authors": [{"given_name": "Arnu", "family_name": "Pretorius", "institution": "Stellenbosch University"}, {"given_name": "Elan", "family_name": "van Biljon", "institution": "Stellenbosch University"}, {"given_name": "Steve", "family_name": "Kroon", "institution": "Stellenbosch University"}, {"given_name": "Herman", "family_name": "Kamper", "institution": "Stellenbosch University"}]}