{"title": "Posterior Concentration for Sparse Deep Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 930, "page_last": 941, "abstract": "We introduce Spike-and-Slab Deep Learning (SS-DL), a fully Bayesian  alternative to dropout for improving generalizability of deep ReLU networks. This new type of regularization enables  provable recovery of smooth input-output maps with {\\sl unknown} levels of smoothness. Indeed, we  show that  the posterior distribution concentrates at the near minimax rate for alpha-Holder smooth maps, performing as well as if we knew the smoothness level alpha ahead of time. Our result sheds light on architecture design for deep neural networks, namely the choice of depth, width and sparsity level. These network attributes typically depend on  unknown smoothness  in order to be optimal. We obviate this constraint with the fully Bayes construction. As an aside, we show that SS-DL does not overfit in the sense that the posterior concentrates on smaller networks with fewer (up to the  optimal number of) nodes and links. Our results provide new theoretical justifications for deep ReLU networks from a Bayesian point of view.", "full_text": "Posterior Concentration for Sparse Deep Learning\n\nNicholas G. Polson and Veronika Ro\u02c7ckov\u00e1\n\nBooth School of Business\n\nUniversity of Chicago\n\nChicago, IL 60637\n\nAbstract\n\nWe introduce Spike-and-Slab Deep Learning (SS-DL), a fully Bayesian alternative\nto dropout for improving generalizability of deep ReLU networks. This new type\nof regularization enables provable recovery of smooth input-output maps with\nunknown levels of smoothness. Indeed, we show that the posterior distribution\nconcentrates at the near minimax rate for \u03b1-H\u00f6lder smooth maps, performing\nas well as if we knew the smoothness level \u03b1 ahead of time. Our result sheds\nlight on architecture design for deep neural networks, namely the choice of depth,\nwidth and sparsity level. These network attributes typically depend on unknown\nsmoothness in order to be optimal. We obviate this constraint with the fully Bayes\nconstruction. As an aside, we show that SS-DL does not over\ufb01t in the sense that the\nposterior concentrates on smaller networks with fewer (up to the optimal number\nof) nodes and links. Our results provide new theoretical justi\ufb01cations for deep\nReLU networks from a Bayesian point of view.\n\n1\n\nIntroduction\n\nDeep learning constructs are powerful tools for pattern matching and prediction. Their empirical\nsuccess has been accompanied by a number of theoretical developments addressing (a) why and when\nneural networks generalize well, (b) when do deep networks out-perform shallow ones and (c) which\nactivation functions and with how many layers. Despite the \ufb02urry of research activity, there are still\nmany theoretical gaps in understanding why deep neural networks work so well. In this paper, we\nprovide several new insights by studying the speed of posterior concentration around the optimal\npredictor, and in doing so we make a contribution to the Bayesian literature on deep learning rates.\nBayesian non-parametric methods are proliferating rapidly in statistics and machine learning, but\ntheir theoretical study has not yet kept pace with their application. Lee (2000) showed consistency\nof posterior distributions over single-layer sigmoidal neural networks. Our contribution builds\non this work in three fundamental aspects: (a) we focus on deep rather than single-layer, (b) we\nfocus on recti\ufb01ed linear units (ReLU) rather than sigmoidal squashing functions, (c) deploying (cid:96)0\nregularization, we show that the posterior converges at an optimal speed beyond the mere fact that it\nis consistent. To achieve these goals, we adopt a statistical perspective on deep learning through the\nlens of non-parametric regression.\nUsing deep versus shallow networks can be justi\ufb01ed theoretically in many ways. First, while both\nshallow and deep neural networks (NNs) are universal approximators (i.e. can approximate any\ncontinuous multivariate function arbitrarily well on a compact domain), Mhaskar et al. (2017) show\nthat deep nets use exponentially fewer number of parameters to achieve the same level of approxi-\nmation accuracy for compositional functions. Kolmogorov (1963) provided another motivation for\ndeep networks by showing that superpositions of univariate semi-af\ufb01ne functions provide a universal\nbasis for representing multivariate functions. Telgarsky (2016) provides examples of functions that\ncannot be represented ef\ufb01ciently with shallow networks and Kawaguchi et al (2017) explains why\ndeep networks generalize well. In related work, Poggio et al. (2017) show how deep networks can\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\favoid the curse of dimensionality for compositional functions. These theoretical results are growing\nand our goal is to show how they can be leveraged to show posterior concentration rates for deep\nlearning. In particular, we will build on properties of deep ReLU networks characterized recently by\nSchmidt-Hieber (2017).\nDeep ReLU activating functions can also be justi\ufb01ed by theory. Evidence exists that training deep\nlearning proceeds best when the neurons are either off or operate in a linear way. For example,\nGlorot et al. (2011) show that ReLU functions outperform hyperbolic tangent or sigmoid squashing\nfunctions, both in terms of statistical and computational performance. The success of ReLUs has\nbeen partially attributed to their ability to avoid vanishing gradients and their expressibility. The\nattractive approximation properties of ReLUs were discussed by many authors including Telgarsky\n(2017), Vitushkin (1964) or Montufar et al. (2014). Schmidt-Hieber (2017) points out a very curious\naspect of ReLU activators that their composition can yield rate-optimal reconstructions of smooth\nfunctions of an arbitrary order, not only up to order 2 which would be expected from piecewise linear\napproximators.\nIt is commonly perceived (Goodfellow et al., 2016) that generalizability of neural networks can\nbe improved with regularization. Regularization, loosely de\ufb01ned as any modi\ufb01cation to a learning\nalgorithm that is intended to reduce its test error but not its training error (Goodfellow et al., 2016),\ncan be achieved in many different ways. Beyond ReLU activators, another way to regularize is\nby adding noise to the learning process. For example, the dropout regularization (Srivastava et al.,\n2014) samples from (and averages over) thinned networks obtained by randomly dropping out nodes\ntogether with their connections. While motivated as stochastic regularization, dropout can be regarded\nas deterministic (cid:96)2 regularization obtained by margining out dropout noise (Wager, 2014). Dropout\naveraging over sparse architectures pertains, at least conceptually, to Bayesian model averaging\nunder spike-and-slab priors. Spike-and-slab approaches assign a prior distribution over sparsity\npatterns (models) and perform model averaging with posterior model probabilities as weights (George\nand McCulloch, 1993). Similarly to dropout, the spike-and-slab approach effectively switches off\nmodel coef\ufb01cients. However, dropout averages out patterns using equal weights rather than posterior\nmodel probabilities. Our approach embeds (cid:96)0 regularization within the layers of deep learning and\ncapitalizes on its connection to Bayesian subset selection. We exploit spike-and-slab constructions not\nnecessarily as a tool for model selection, but rather as a fully Bayesian alternative to dropout in order\nto (a) inject sparsity in deep learning to build stable network architectures, (b) achieve adaptation to\nthe unknown aspects of the regression function in order to achieve near-minimax performance for\nestimating smooth regression surfaces.\nCasting deep ReLU networks with (cid:96)0 penalization as a Bayesian hierarchical model, we study the\nspeed of posterior convergence around H\u00f6lder smooth regression functions. Our \ufb01rst result states\nthat, with properly chosen width, depth and sparsity level, the convergence rate is near minimax\noptimal when the smoothness is known. Going further, we show that adaptation to smoothness can\nbe achieved by assigning suitable complexity priors over the network width and sparsity.\nThe rest of the paper is outlined as follows. Section 2 describes our statistical framework for analyzing\ndeep learning predictors. Section 3 de\ufb01nes deep ReLU networks. Section 4 constructs an appropriate\nspike-and-slab regularization for deep learning. Section 5 contains posterior concentration results for\nsparse deep ReLU networks. Finally, Section 6 concludes with a discussion.\n\n1.1 Notation\nThe \u03b5-covering number of a set \u2126 for a semimetric d, denoted by E(\u03b5; \u2126; d), is the minimal number\nof d-balls of radius \u03b5 needed to cover set \u2126. The notation an (cid:46) bn will be used to denote inequality\nup to a constant, where an (cid:16) bn if an (cid:46) bn and bn (cid:46) an. The symbol (cid:98)a(cid:99) denotes the greatest\ninteger that is smaller than or equal to a, \u221d is equality up to a constant and (cid:107)f(cid:107)\u221e is the supremum of\na function f.\n\n2 Deep Learning: A Statistical Framework\n\nDeep Learning, in its simplest form, reconstructs high-dimensional input-output mappings. To \ufb01x\nnotation, let Y \u2208 R denote a (low dimensional) output and x = (x1, . . . , xp)(cid:48) \u2208 [0, 1]p a (high\ndimensional) set of inputs.\n\n2\n\n\fFrom a machine learning viewpoint, predicting an outcome from a set of features is typically framed\nas noise-less non-parametric regression for recovering f0 : [0, 1]p \u2192 R. Given inputs xi of training\n\ndata and outputs Yi = f0(xi) for 1 \u2264 i \u2264 n, the goal is to learn a deep learning architecture (cid:98)f DL\nsuch that (cid:98)f DL\noptimization problem for \ufb01nding values (cid:98)B \u2208 RT that minimize empirical risk (L2-recovery error on\n\nB\ni=1. Training neural networks is then positioned as an\n\ntraining data) together with a regularization term, i.e.\n\nB (x) \u2248 f0(x) for x /\u2208 {xi}n\nn(cid:88)\n\n(cid:98)B = arg min\n\nB\n\ni=1\n\n[f0(xi) \u2212 f DL\n\nB (xi)]2 + \u03c6(B)\n\n(1)\n\nwhere \u03c6(B) is a penalty over the weights and offset parameters B. In practice, this is most often\ncarried out with some form of stochastic gradient descent (SGD) (see e.g. Polson and Sokolov (2017)\nfor an overview).\nFrom a statistical viewpoint, deep learning is often embedded within non-parametric regression where\nresponses are linked to \ufb01xed predictors in a stochastic fashion through\n\nYi = f0(xi) + \u03b5i,\n\n(2)\nWe de\ufb01ne by H\u03b1\np = {f : [0, 1]p \u2192 R;(cid:107)f(cid:107)H\u03b1 < \u221e} the class of \u03b1-H\u00f6lder smooth functions on\na unit cube [0, 1]p for some \u03b1 > 0, where (cid:107)f(cid:107)H\u03b1 is the H\u00f6lder norm. The true generative model,\ngiving rise to (2), will be denoted with P(n)\np , we want to reconstruct f0 with\nB so that the empirical L2 distance\n\n. Assuming f0 \u2208 H\u03b1\n\n(cid:98)f DL\n\n\u03b5i\n\nf0\n\niid\u223c N (0, 1),\n\n1 \u2264 i \u2264 n.\n\n(cid:107)(cid:98)f DL\nB \u2212 f0(cid:107)2\n\nn =\n\nn(cid:88)\n[(cid:98)f DL\nB (xi) \u2212 f0(xi)]2\n\n1\nn\n\ni=1\n\nis at most a constant multiple away from the minimax rate \u03b5n = n\u2212\u03b1/(2\u03b1+p) (up to a log factor).\nUnlike related statistical developments (Schmidt-Hieber (2017), Bauer and Kohler (2017)), we\napproach the reconstruction problem from a purely Bayesian point of view. While the optimization\nproblem (1) has a Bayesian interpretation as MAP estimation under regularization priors, here we\nstudy the behavior of the entire posterior, not just its mode.\nOur approach rests on careful constructions of prior distributions \u03c0(f DL\nB ) over deep learning archi-\ntectures. In Bayesian non-parametrics, the quality of priors can be often quanti\ufb01ed with the speed at\nwhich the posterior distribution shrinks around the true regression function as n \u2192 \u221e. Ideally, most\nof the posterior mass should be concentrated in a ball centered around the true value f0 with a radius\nproportional to the minimax rate \u03b5n. These statements are ultimately framed in a frequentist way,\ndescribing the typical behavior of the posterior under the true generative model P(n)\nIn the construction of deep learning priors, a few questions emerge. How does one choose the\narchitecture f DL\nB : how deep and what activation functions? The choice typically depends on how\nquickly one can reconstruct f0. We focus on deep ReLU networks, motivated by the following\nexample of Mhaskar et al. (2017, remark 8).\n\nf0\n\n.\n\n2.1 Motivating Example\n\n1x2 +1)210\nMhaskar et al. (2017, remark 8) shows that the bivariate function f10(x1, x2) = (x2\ncan be approximated more ef\ufb01ciently by a deep ReLU neural net than a shallow combination of ridge\nfunctions. To verify this observation, we simulate data from the following polynomial\n\n1x2\n\n2\u2212x2\n\nf1(x1, x2) = (x2\n\n2 \u2212 x2\n\n1x2\n\n1x2 + 1)2\n\nwhere (x1, x2) take values in [\u22121, 1]2. We discretize the grid for a total training data of 201 \u00d7\n201 = 40401 observations. There exists an exact Kolmogorov representation for this function as a\nsuperposition of semi-af\ufb01ne functions if we use the identities for the inner polynomial functions\n\nx2\n1x2 =\n\n(x1x2)2 =\n\n1\n2\n1\n4\n\n1 + x2)2 \u2212 1\n(x2\n2\n\n1 \u2212 x2)2\n(x2\n\n(x1 + x2)4 +\n\n7\n\n4 \u00b7 33 (x1 \u2212 x2)4 \u2212 1\n\n2 \u00b7 33 (x1 + 2x2)4 \u2212 23\n\n33 (x1 + 2x2)4.\n\n(3)\n\n(4)\n\n3\n\n\fFollowing the theoretical results of Mhaskar et al. (2017), we build an 11-layer deep ReLU network is\nused to approximate this polynomial. There are 9 units in the \ufb01rst hidden layer and 3 units in the further\nlayers. All activation functions are ReLU. For comparison, we also build a shallow network with\nonly 1 hidden layer but 2048 units. The MSE for the models, both trained with SGD in TensorFlow\nand Keras, are: 11 layers, 39 units with M SE(train) = 0.0229, M SE(validation) = 0.0112\nand 1 layer, 2048 units with M SE(train) = 0.0441, M SE(validation) = 0.09. Both models\noutperform random forests.\n\n3 Deep ReLU Networks\n\nWe now formally describe the generative model that gives rise to deep recti\ufb01ed linear unit networks.\nTo \ufb01x notation, we write a deep neural network f DL\nB (x) as an iterative mapping speci\ufb01ed by\nhierarchical layers of abstraction. With L \u2208 N we denote the number of hidden layers and with\npl \u2208 N the number of neurons at the lth layer. Setting p0 = p and pL+1 = 1, we denote with\np = (p0, . . . , pL+1)(cid:48) \u2208 NL+2 the vector of neuron counts for the entire network. The deep network\nis then characterized by a set of model parameters\n\nB = {(W 1, b1), (W 2, b2), . . . , (W L, bL)},\n\n(5)\nwhere bl \u2208 Rpl are shift vectors and W L are pl \u00d7 pl\u22121 weight matrixes that link neurons between the\n(l \u2212 1)th and lth layers. Nodes in the ReLU network are connected through the following activation\nfunction \u03c3b : Rr \u2192 Rr\n\n\uf8eb\uf8ec\uf8ec\uf8edy1\n\ny2\n...\nyr\n\n\u03c3b\n\n...\n\n\u03c3(y2 \u2212 b2)\n\n\uf8eb\uf8ec\uf8ec\uf8ed\u03c3(y1 \u2212 b1)\n\uf8f6\uf8f7\uf8f7\uf8f8 =\n\uf8f6\uf8f7\uf8f7\uf8f8 ,\n(cid:0)W L\u03c3bL\u22121 . . . \u03c3b1(W 1x)(cid:1) .\n\n\u03c3(yr \u2212 br)\n\nwhere \u03c3(x) = ReLU (x) = max(x, 0) denotes the recti\ufb01ed linear unit activation function.\nDeep ReLU neural networks with L layers and a vector of p hidden nodes de\ufb01ne an input-output\nmap f DL\n\nB (x) : Rp \u2192 R of the form\n\nf DL\nB (x) = W L+1\u03c3bL\n\n(6)\n\nThe representation (6) casts neural networks as nested embeddings that allow to express the data \ufb02ow\nthrough a network using variable-size data structures. Varying the number of active neurons allows a\nmodel to control the effective dimensionality for a given input and achieve desired approximation\naccuracy. Similarly as Schmidt-Hieber (2017), we will focus on a speci\ufb01c type of networks with an\nequal number of hidden neurons, i.e. pl = 12pN for each 1 \u2264 l \u2264 L for some N \u2208 N. We will see\nlater in Section 5, that the optimal network width multiplier N should relate to the dimensionality p\nand smoothness \u03b1.\n\n4 Spike-and-Slab Regularization\n\nWe focus on uniformly bounded s-sparse deep nets with bounded parameters\n\nF(L, p, s) =(cid:8)f DL\n\nB (x) as in (6) : (cid:107)f DL\n\nB (cid:107)\u221e < F and (cid:107)B(cid:107)\u221e \u2264 1 and (cid:107)B(cid:107)0 \u2264 s(cid:9) ,\n\nwhere s \u2208 N is the sparsity level, i.e. an upper bound on the number of edges in the network, and\nwhere F > 0.\nThe amount of regularization needed to achieve optimal performance typically depends on unknown\nproperties of functions one wishes to approximate such as their smoothness, compositional pattern\nand/or the number of variables they depend on. Hierarchical Bayes procedures have the potential to\nbecome fully adaptive and achieve (nearly) minimax performance, as if one knew these properties\nahead of time. We will leverage the fully Bayes framework and devise a hierarchical procedure which\ncan learn the optimal level of sparsity needed to achieve near-minimax rates of posterior convergence\nof neural networks. The cornerstone of this development will be the spike-and-slab framework.\nDenote with\n\nT =\n\npl+1(pl + 1) \u2212 pL+1 < (12 p N + 1)L+1\n\n(7)\n\nL(cid:88)\n\nl=0\n\n4\n\n\fthe number of parameters in a fully connected network with L layers and a vector of p neurons,\nwhere the inequality in (7) holds when L \u2265 2 and 12pN \u2265 2. We treat the stacked vector of model\ncoef\ufb01cients B = (\u03b21, . . . , \u03b2T )(cid:48) in (5) as a random vector arising from the spike-and-slab prior\nde\ufb01ned hierarchically through\n\n\u03c0(\u03b2j | \u03b3j) = \u03b3j(cid:101)\u03c0(\u03b2j) + (1 \u2212 \u03b3j)\u03b40(\u03b2j),\n\nwhere\n\nis a uniform prior on an interval [\u22121, 1], \u03b40(\u03b2) is a dirac spike at zero, and where \u03b3j \u2208 {0, 1} for\nwhether or not \u03b2j is nonzero. We collate the binary indicators into a vector \u03b3 = (\u03b31, . . . , \u03b3T )(cid:48) \u2208\n{0, 1}T that encodes the connectivity pattern. We assume that, given the sparsity level s = |\u03b3|, all\narchitectures are equally likely a-priori, i.e.\n\n(cid:101)\u03c0(\u03b2) =\n\nI[\u22121,1](\u03b2)\n\n1\n2\n\n\u03c0(\u03b3 | s) =\n\n1(cid:0)T\n\n(cid:1) .\n\ns\n\n(8)\n\n(9)\n\n(11)\n\nThe sparsity level s will be \ufb01rst treated as \ufb01xed and later assigned a prior with exponential decay.\nThe spike-and-slab construction (8) and (9) has been studied in linear models by Castillo and van\nder Vaart (2012) and in trees/forests by Rockova and van der Pas (2017), who showed that with a\nsuitable prior on s, the posterior can adapt to the unknown level of sparsity. We conclude a very\nsimilar property for our proposed spike-and-slab deep learning.\nIt is worthwhile to point out that the prior (8) effectively zeroes out individual links rather than entire\ngroups of links attached to one node. The second approach was explored by Ghosh and Doshi-Velez\n(2017), who suggested assigning a Horseshoe prior on the node preactivators, diminishing in\ufb02uence\nof individual neurons. The dropout procedure is also motivated as erasing nodes rather than links.\n\n5 Posterior Concentration for Deep Learning\n\nReconstruction of f0 from the training data (Yi, xi)n\ni=1 can be achieved using a Bayesian approach.\nThis requires placing a prior measure \u03a0(\u00b7) on F(L, p, s), the set of qualitative guesses of f0. Given\nobserved data Y (n) = (Y1, . . . , Yn)(cid:48), inference about f0 is then carried out via the posterior distribu-\ntion\n\n(cid:16)\n\nA(cid:12)(cid:12) Y (n),{xi}n\n\n\u03a0\n\n(cid:17)\n\ni=1\n\n=\n\n(cid:82)\n(cid:81)n\n(cid:82)(cid:81)n\ni=1 \u03a0f (Yi | xi)d \u03a0(f )\ni=1 \u03a0f (Yi | xi)d \u03a0(f )\n\nA\n\nwhere B is a \u03c3-\ufb01eld on F(L, p, s) and where \u03a0f (Yi | xi) is the likelihood function for the output Yi\nunder f.\nOur goal is to determine how fast the posterior probability measure concentrates around f0 as\nn \u2192 \u221e. This speed can be assessed by inspecting the size of the smallest (cid:107) \u00b7 (cid:107)n-neighborhoods\naround f0 that contain most of the posterior probability (Ghosal and van der Vaart, 2007). For a\ndiameter \u03b5 > 0 and some M > 0, we denote with\n\n\u2200A \u2208 B,\n\nA\u03b5,M = {f DL\n\nB \u2208 F(L, p, s) : (cid:107)f DL\n\nB \u2212 f0(cid:107)n \u2264 M \u03b5}\n\nthe M \u03b5-neighborhood centered around f0. Our goal is to show that\n\nf0\n\n\u03b5n,Mn\n\n\u03a0(Ac\n\n| Y (n)) \u2192 0\n\n-probability as n \u2192 \u221e\n\nin P(n)\n(10)\nfor any Mn \u2192 \u221e and for \u03b5n \u2192 0 such that n \u03b52\nn \u2192 \u221e. We will position our results using\n\u03b5n = n\u2212\u03b1/(2\u03b1+p) log\u03b4(n) for some \u03b4 > 0, the near-minimax rate for a p-dimensional \u03b1-smooth\nfunction. Proving techniques for statements of type (10) were established in several pioneering works\nincluding Ghosal, Ghosh and van der Vaart (2000), Ghosal and van der Vaart (2007), Shen and\nWassermann (2001), Wong and Shen (1995), Walker et al. (2007).\nThe statement (10) can be proved by verifying the following three conditions (suitably adapted from\nTheorem 4 of Ghosal and van der Vaart (2007)):\n\nlog E(cid:0) \u03b5\n\nsup\n\u03b5>\u03b5n\n\n36 ; A\u03b5,1 \u2229 Fn;(cid:107).(cid:107)n\n\n(cid:1) \u2264 n \u03b52\n\nn\n\n5\n\n\f\u03a0(A\u03b5n,1) \u2265 e\u2212d n \u03b52\n\nn\n\n\u03a0(F\\Fn) = o(e\u2212(d+2) n \u03b52\nn )\n\n(12)\n(13)\nfor some d > 2. Above, Fn \u2286 F(L, p, s) is an approximating space (sieve) that captures the essence\nof the parameter space. Condition (11) restricts the size of the model as measured by the Le Cam\ndimension (or local entropy). The Le Cam dimension, de\ufb01ned here in terms of the log-covering\nnumber of A\u03b5,1\u2229Fn, gives rise to the minimax rate of convergence under certain conditions (Le Cam,\n1973). The sieve should not be too large (Condition (11)), it should be rich enough to approximate f0\nwell and it should receive most of the prior mass (Condition (13)).\nThe prior concentration Condition (12) is needed to make sure that the prior rewards shrinking\nneighborhoods of f0. This requirement is a bit at odds with Condition (11). The richer the model\nclass (i.e. the more layers/neurons), the better the approximation to f0. It is essential that the prior\nis supported on models that are good approximators, but that do not over\ufb01t. It is commonly agreed\n(Ghosal and van der Vaart, 2007) that the approximation gap should be no larger than a constant\nmultiple of \u03b5n. Below, we review some known results about expressibility of neural networks to\nget insights into how many layers/neurons are needed to achieve the desired level of approximation\naccuracy.\n\n5.1 Function Class Approximation Rates\n\nThere is an extensive literature on the approximation properties of neural nets. Many tight approxima-\ntion results are available for simple functions such as indicators f (x) = IB(x) where B is a unit ball\n(Cheang and Barron, 2000) or a half-space (Cheang (2010), Kainen et al. (2003, 2007) and K\u02darkova\net al. (1997)). Recent results on the ef\ufb01ciency of ridge NNs (which arise as shallow learners of the\nj x \u2212 bj) for sigmoidal \u03c3(\u00b7)) are available in Ismailov(2017), Klusowki and\n\nform f =(cid:80)n\n\nBarron (2016, 2017). Pinkus (1999) and Petrushev (1999) provide some of the early bounds.\nIn general, one tries to characterize the asymptotic behavior of the approximation error as follows:\n(14)\n\nwhere f0 is a real-valued \u03b1-smooth function, (cid:98)f is the neural-network reconstruction and where N is\n\np ) \u21d0\u21d2 (cid:107)f0 \u2212 (cid:98)f(cid:107) \u2264 \u03b5 where N = O(\u03b5\u2212 p\n\n(cid:107)f0 \u2212 (cid:98)f(cid:107) = O(N\n\nj=1 aj\u03c3(wT\n\n\u2212 \u03b1\n\nthe \u201csize\" of the network (typically the number of hidden nodes). Different bounds can be obtained\nfor different classes of f0 and different norms (cid:107) \u00b7 (cid:107). The goal is to assess how complex the network\nought to be for it to approximate f0 well (up to a constant multiple of \u03b5n).\nFor deep networks, one also wants to \ufb01nd the asymptotic behavior of the approximation error as a\nfunction of depth, not only its size. The following Lemma will be an essential building block in the\nproof of our main theorem. It summarizes the expressibility of deep ReLU networks by linking their\napproximation error (when estimating H\u00f6lder smooth functions) to the network depth, width and\nsparsity.\np for some \u03b1 > 0. Then for any N \u2265\nLemma 5.1. (Schmidt-Hieber, 2017) Assume that f0 \u2208 H\u03b1\nN = (p, 12pN, . . . , 12pN, 1), s(cid:63))(cid:48)\n(15)\n\n(\u03b1+1)p\u2228((cid:107)f0(cid:107)\u03b1H +1) there exists a neural network (cid:98)f \u2208 F(L(cid:63), pL(cid:63)\nL(cid:63) = 8 + ((cid:98)log2(n)(cid:99) + 5)(1 + (cid:100)log2 p(cid:101))\ns(cid:63) \u2264 94 p2(\u03b1 + 1)2pN (L(cid:63) + (cid:100)log2 p(cid:101))\n\nlayers and sparsity level s(cid:63) satisfying\n\nwith\n\n(16)\n\n\u03b1 ),\n\nsuch that\n\n(cid:107)(cid:98)f \u2212 f0(cid:107)\u221e \u2264 (2(cid:107)f0(cid:107)H\u03b1 + 1)3p+1 N\n\n+ (cid:107)f0(cid:107)H\u03b1 2\u03b1N\u2212\u03b1/p.\n\nn\n\nProof. Apply Theorem 3 of Schmidt-Hieber (2017) with m = (cid:98)log2(n)(cid:99).\n(cid:107)f \u2212(cid:98)f DL(cid:107)\u221e \u2264 \u03b5 with sparsity s = c\u00b7 \u03b5\u2212 p\nRemark 5.1. In a related result, Yarotsky (2017) shows that there exists a ReLU network that satis\ufb01es\n\u03b1 / log2(1/\u03b5) + 1 and depth L = c\u00b7 (log2(1/\u03b5) + 1) where\n\nc = c(p, \u03b1). Petersen and Voigtlaender (2017) extend this result to L2-smooth functions.\nWe assume that p = O(1) as n \u2192 \u221e. Lemma (5.1) essentially states that in order to approximate\nan \u03b1-H\u00f6lder smooth function with an error that is at most a constant multiple of \u03b5n, we have\nto choose L (cid:16) log(n) layers with sparsity s \u2264 CS(cid:98)np/(2\u03b1+p)(cid:99). This follows by setting N =\nCN(cid:98)np/(2\u03b1+p)/ log(n)(cid:99).\n\n6\n\n\f6 Posterior Concentration for Sparse ReLU Networks\n\nWe formalize large sample statistical properties of posterior distributions over ReLU networks. First,\nwe consider a hierarchical prior distribution on F(L, p, s), keeping L, p and s \ufb01xed as if they were\nknown. The prior distribution now only consists of the prior on the connectivity pattern (9) and the\nspike-and-slab prior on the weights/offsets (8).\nOur \ufb01rst result provides guidance for calibrating Bayesian deep sparse ReLU networks (choosing the\nsparsity level and the number of neurons) when the level of smoothness \u03b1 is known. The result can be\nregarded as a Bayesian analogue of Theorem 1 of Schmidt-Hieber (2017), who showed near-minimax\nrate-optimality of a sparse multilayer ReLU network estimator that minimizes empirical least-squares.\nThis was the \ufb01rst result on rate-optimality of deep ReLU networks in non-parametric regression,\nobtained assuming that the sparsity s is known and that the function f0 is a composition of H\u00f6lder\nfunctions. We build on this result and show that the entire posterior distribution for deep sparse ReLU\nneural networks is concentrating at the near-minimax rate, when \u03b1 is known and when f0 is a H\u00f6lder\nsmooth function. In the next section, we provide an adaptive result which no longer requires the\nknowledge of \u03b1.\np , where p = O(1) as n \u2192 \u221e,\nTheorem 6.1. (Deep ReLUs are near-minimax.) Assume f0 \u2208 H\u03b1\n\u03b1 < p and (cid:107)f0(cid:107)\u221e \u2264 F . Let L(cid:63) be as in (15), s(cid:63) as in (16) and p(cid:63) = (p, 12pN (cid:63), . . . , 12pN (cid:63), 1)(cid:48) \u2208\nNL(cid:63)+2, where N (cid:63) = CN (cid:98)np/(2\u03b1+p)/ log(n)(cid:99). Then the posterior probability concentrates at the\nrate \u03b5n = n\u2212\u03b1/(2\u03b1+p) log\u03b4(n) for \u03b4 > 1 in the sense that\n\n\u03a0(f DL\n\nB \u2208 F(L(cid:63), p(cid:63), s(cid:63)) : (cid:107)f \u2212 f0(cid:107)n > Mn \u03b5n | Y (n)) \u2192 0\n\nin Pn\n\n0 probability as n \u2192 \u221e for any Mn \u2192 \u221e.\n\n(17)\n\nProof. Supplementary Materials\nRemark 6.1. In Theorem 5.1, we do not need to construct a sieve Fn, because s and N are \ufb01xed.\nWe can simply take Fn = F(L(cid:63), p(cid:63), s(cid:63)) in which case F\\Fn = \u2205 and (13) holds trivially.\n\nTheorem 6.1 continues the line of theoretical investigation of Bayesian machine learning procedures.\nLee (2000) obtained posterior consistency for single-layer sigmoidal networks. van der Pas and\nRockova (2017) and Rockova and van der Pas (2017) obtained concentration results for Bayesian\nregression trees and forests. Compared to these developments, deep neural networks (NN) seem to be\nmore \ufb02exible in estimating smooth regression functions. Indeed, trees or forests are ultimately step\nfunction approximators and, as such, are near-minimax only for 0 < \u03b1 \u2264 1 (Rockova and van der\nPas, 2017). As we have shown in Theorem 6.1, Bayesian deep ReLU networks are near-minimax\nwhen 0 < \u03b1 < p, where p can be much larger than 1. One practical implication is that one would\nexpect NN\u2019s to outperform trees for very smooth objects.\n\n6.1 Bayesian Deep Learning Adapts to Smoothness\n\nTheorem 6.1 was conceived for network architectures that are optimally tuned for \u03b1 that is \ufb01xed as if\nit were known. However, such oracle information is rarely available, rendering the result less relevant\nfor practical design of networks. In this section, we devise a hierarchical prior construction (by\nendowing the unknown network parameters with suitable priors), under which the posterior performs\nas well as if we knew \u03b1.\nFrom the previous section (and discussion in Schmidt-Hieber (2017)), we know that the number of\nlayers L can be chosen without the knowledge of smoothness \u03b1. We will thus continue to assume\nthat the number of layers is \ufb01xed and equal to L(cid:63) in (15).\nBoth the network width N and sparsity level s were chosen in an \u03b1-dependent way. To obviate this\nconstraint, we treat them as unknown with the following priors. For the network width multiplier N,\nwe deploy\n\n\u03c0(N ) =\n\n\u03bbN\n\n(e\u03bb \u2212 1)N !\n\nfor N = 1, 2, . . .\n\nfor some \u03bb \u2208 R.\n\n(18)\n\n7\n\n\fThe prior (18) is one of the classical complexity priors used frequently in the Bayesian non-parametric\nliterature (Coram and Lalley (2006), Liu et al. (2017), Rockova and van der Pas (2017)). Similarly,\nthe sparsity level s will be now treated as unknown with the following prior\n\n\u03c0(s) \u221d e\u2212\u03bbss\n\n(19)\nN = (p, 12pN, . . . , 12pN, 1)(cid:48) \u2208 NL(cid:63) the now random vector of network widths that\nDenote with pL(cid:63)\ndepend on N and L(cid:63). Our parameter space now consists of shells of sparse deep nets with different\nwidths and sparsity levels, i.e.\n\nfor \u03bbs > 0.\n\nF(L(cid:63)) =\n\nF(L(cid:63), pL(cid:63)\n\nN , s),\n\n\u221e(cid:91)\n\nT(cid:91)\n\nN =1\n\ns=0\n\nNn(cid:91)\n\nsn(cid:91)\n\nwhere T is the number of links in a fully connected network (de\ufb01ned in (7)). We will design an\napproximating sieve as follows:\n\nFn =\n\nF(L(cid:63), pL(cid:63)\n\nN =1\n\ns=0\n\nN , s)\n\nNn = (cid:98)(cid:101)CN np/(2\u03b1+p) log2\u03b4\u22121(n)(cid:99) (cid:16) n\u03b52\n\n(20)\nfor some suitable Nn \u2208 N and sn \u2264 T . Following our discussion earlier in this section, the sieve Fn\nshould be rich enough to include networks that approximate well. To this end, we choose Nn and sn\nsimilar to the \u201coptimal choices\" obtained from the \ufb01xed \u03b1 case, i.e.\n\nfor (cid:101)CN > 0. With these choices, we show that the posterior distribution concentrates at the same rate\nas before, but without assuming \u03b1.\nTheorem 6.2. (Deep ReLUs adapt to smoothness.) Assume f0 \u2208 H\u03b1\np , where p = O(1) as n \u2192 \u221e,\n\u03b1 < p, and (cid:107)f0(cid:107)\u221e \u2264 F . Let L(cid:63) be as in (15) and assume priors (19) and (18). Then the posterior\nprobability concentrates at the rate \u03b5n = n\u2212\u03b1/(2\u03b1+p) log\u03b4(n) for \u03b4 > 1 in the sense that\n\nsn = (cid:98)L(cid:63)Nn(cid:99) (cid:16) n\u03b52\n\nn/ log n,\n\nand\n\nn\n\nB \u2208 F(L(cid:63)) : (cid:107)f DL\n0 probability as n \u2192 \u221e for any Mn \u2192 \u221e.\n\n\u03a0(f DL\n\nin Pn\n\nB \u2212 f0(cid:107)n > Mn \u03b5n | Y (n)) \u2192 0\n\n(21)\n\nProof. Supplementary Materials.\nTheorem 6.2 has a very important implication. It shows that, once we assign suitable complexity\npriors over the network size and sparsity, we can perform as well as if we knew the smoothness \u03b1.\nThis type of adaptation for deep learning is, to the best of our knowledge, a new phenomenon. It\noriginates from the fully Bayesian treatment of deep learning. Similar adaptations were obtained for\nBayesian forests (Rockova and van der Pas (2017)), where the adaptation costs only a small fraction\nof the log factor. Here, we have the same rate as in the non-adaptive case, suggesting that the analysis\ncould be potentially re\ufb01ned a bit to obtain a sharper rate when \u03b1 is known.\nWe conclude the paper with the following important corollary stating that Bayesian deep ReLU\nnetworks with adaptive spike-and-slab priors do not over\ufb01t in the sense that the posterior probability\nof using more than the optimal number of nodes and links goes to zero as n \u2192 \u221e\nCorollary 6.1. (Deep ReLUs do not over\ufb01t.) Under the assumptions in Theorem 6.2 we have\n\n\u03a0(N > Nn | Y (n)) \u2192 0\n\nand \u03a0(s > sn | Y (n)) \u2192 0\n\n(22)\n\nin Pn\n\n0 probability as n \u2192 \u221e.\n\nProof. This statement follows from Lemma 1 of Ghosal and van der Vaart (2007) and holds upon the\nsatisfaction of the conditions\n\n\u03a0(N > Nn) = o(e\u2212(d+2)n\u03b52\nn )\nthat are veri\ufb01ed in Supplementary Materials.\nThe key observation behind Corollary 5.1 is that the posterior does not overshoot in terms of the\nwidth and sparsity, rewarding only small networks that are sparse. That is, the posterior concentrates\non networks with up to the optimal number sn of links. This is purely a by-product of Bayesian\nregularization and, again, this property does not rely on any oracle information about \u03b1.\n\nand \u03a0(s > sn) = o(e\u2212(d+2)n\u03b52\nn )\n\n8\n\n\f6.2\n\nImplementation Considerations\n\nFor a \ufb01xed architecture (i.e. N is non-random) and continuous spike-and-slab priors, one could\nperform an Expectation-Maximization algorithm by iteratively (a) deploying SGD with (cid:96)1/(cid:96)2 regu-\nlarization and coef\ufb01cient speci\ufb01c penalties (M-step) and (b) computing conditional probability that\nthe coef\ufb01cient is non-negligible (E-step). The E-step is inexpensive and determines, one coef\ufb01cient\nat a time, how much shrinkage should be deployed. The M-step can be readily obtained with existing\nsoftware. Such an EM strategy has been successfuly deployed in linear models (Rockova and George\n(2014, 2018) and related strategies have already been deployed for neural networks (via Variational\nBayes by Ullrich et al. (2017) or with Bayes by Backprop by Blundell et al. (2015)). The optimization\nstrategy is feasible for Gaussian/Laplace mixtures which are continuous approximations of the the\npoint-mass mixture prior that we analyze. Turning optimization into posterior sampling is feasible\nwith a weighted Bayesian bootstrap (Newton, Polson and Xu (2018)). Attaching a random weight to\neach observation in the likelihood, modes of resulting posteriors constitute samples from the original\n(unweighted) posterior. Regarding the adaptive architectures (when N is random), they can be learned\nas well using ideas from Liu, Rockova and Wang (2018).\n\n7 Closing Remarks\n\nThe goal of this paper was to study posterior concentration for Bayesian deep learning and to provide\nnew theoretical justi\ufb01cations for neural networks from a Bayesian point of view. Our theoretical\nresults can be summarized in three points. First, in Theorem 6.1 we show that Bayesian deep ReLU\nnetworks can be near-minimax, if tuned properly. Second, in Theorem 6.2 we show that, by assigning\nsuitable complexity priors over the network architecture, Bayesian deep ReLU networks can be\nnear-minimax tuning-free. In other words, they can adapt to unknown smoothness, giving rise\nposteriors that concentrate around smooth surfaces at near-minimax rates. Third, in Corollary 6.1 we\nprovide some arguments for why Bayesian deep ReLU networks are less eager to over\ufb01t. The key\ningredients for these results were (a) sparsity through spike-and-slab regularization, (b) complexity\npriors on the network width and sparsity level. Posterior concentration rate results of this type are\nnow slowly entering the machine learning community as a tool for (a) obtaining more insights into\nBayesian methods (van der Pas and Rockova (2017), Rockova and van der Pas (2017)) and (b) prior\ncalibrations.\nThere are many non-parametric methods that can achieve near-minimax recovery of H\u00f6lder smooth\nfunctions. The appeal of deep learning is their compositional structure which makes them ideal for\nregression surfaces that are themselves compositions. Indeed, there is evidence that deep learning\nhas exponential advantage over shallow networks for approximating compositions. Schmidt-Hieber\n(2017) showed that sparsely connected deep ReLU networks achieve a near-minimax rate in learning\nfor compositions of smooth functions. It is possible to adapt our techniques to obtain a Bayesian\nanalogue of his compositional result.\nGeneralizing the results to other activators is possible, provided that one can show that H\u00f6lder smooth\nmaps can be approximated well with suf\ufb01ciently small networks. One could follow the general recipe\nfrom Section 5 for e.g. sigmoidal functions. We focused on ReLU since they are typically preferred\nover sigmoidal.\n\n8 Acknowledgments\n\nThis work was supported by the James S. Kemper Research Fund at the University of Chicago Booth\nSchool of Business. The authors would like to thank the anonymous referees and the area chair for\nuseful feedback.\n\n9 References\n\nBauer, B. and Kohler, M. (2017). On Deep Learning as a remedy for the curse of dimensionality in\nnonparametric regression. arXiv.\n\n9\n\n\fBlundell, C., Cornebise, J., Kavukcuoglu, K. and Wierstra, D. (2015). On Deep Learning as a remedy\nfor the curse of dimensionality in nonparametric regression. International Conference on Machine\nLearning, 37, 1613-1622.\n\nCastillo, I. and van der Vaart (2012). Needles and straw in a haystack: Posterior concentration for\npossibly sparse sequences. Annals of Statistics, 40, 2069-2101.\n\nCheang, G. H. (2010). Approximation with neural networks activated by ramp sigmoids. Journal of\nApproximation Theory, 162, 1450-1465.\n\nCheang, G. H., and Barron, A. R. (2000). A better approximation for balls. Journal of Approximation\nTheory, 104, 183-203.\n\nCoram, M. and Lalley, S. (2010). Consistency of Bayes estimators of a binary regression function.\nAnnals of Statistics, 34, 1233-1269.\n\nDinh, R., Pascanu, R., Bengio, S. and Bengio, Y. (2017). Sharp Minima Can Generalize For Deep\nNets. arXiv.\n\nGeorge, E.I. and McCulloch, R. (1993). Variable selection via Gibbs sampling. Journal of the\nAmerican Statistical Association, 88, 881-889.\n\nGhosal, S., Ghosh, J. and van der Vaart, A. (2000). Convergence rates of posterior distributions.\nAnnals of Statistics, 28, 500-531.\n\nGhosal, S. and van der Vaart, A. (2007). Convergence rates of posterior distributions for noniid\nobservations. Annals of Statistics, 35, 192-223.\n\nGhosh, S. and Doshi-Velez, F. (2017). Model selection in Bayesian neural networks via horseshoe\npriors. Advances in Neural Information Processing Systems.\n\nGlorot, X., Border, A. and Bengio, Y. (2011). Deep sparse recti\ufb01er neural networks. Proceedings of\nthe 14th International Conference on Arti\ufb01cial Intelligence and Statistics.\n\nGoodfellow, I., Bengio, Y. and Courville, A. (2016). Deep Learning. MIT Press.\n\nIsmailov, V. (2017). Approximation by sums of ridge functions with \ufb01xed directions. St. Petersburg\nMathematical Journal, 28, 741-772.\n\nKainen, P. C., K\u02daurkov\u00e1, V., and Vogt, A. (2003). Best approximation by linear combinations of\ncharacteristic functions of half-spaces. Journal of Approximation Theory, 122, 151-159.\n\nKainen, P. C., K\u02darkov\u00e1, V., and Vogt, A. (2007). A Sobolev-type upper bound for rates of approxi-\nmation by linear combinations of Heaviside plane waves. Journal of Approximation Theory, 147,\n1-10.\n\nKawaguchi, K., Kaelbling, L. P. and Bengio, Y. (2017). Generalization in deep learning. arXiv.\n\nKlusowki, J.M. and Barron, A.R. (2016). Risk bounds for high-dimensional ridge function combina-\ntions including neural networks. arXiv.\n\nKlusowki, J.M. and Barron, A.R. (2017). Minimax lower bounds for ridge combinations including\nneural networks. arXiv.\n\nKolmogorov, A. (1963). On the representation of continuous functions of many variables by su-\nperposition of continuous functions of one variable and addition. American Mathematical Society\nTranslation, 28, 55?59.\n\nK\u02darkov\u00e1, V., Kainen, P. C., and Kreinovich, V. (1997). Estimates of the number of hidden units and\nvariation with respect to half-spaces. Neural Networks, 10, 1061-1068.\n\n10\n\n\fLe Cam, L. (1973). Convergence of estimates under dimensionality restrictions. Annals of Statistics,\n1, 38-53.\n\nLee, H. (2000). Consistency of posterior distributions for neural networks. Neural Networks, 13,\n629-642.\n\nLiu, L., Li, D. and Wong, W.H. (2017). Convergence rates of a partition based Bayesian multivariate\ndensity estimation methods. Advances in Neural Information Processing Systems, 30.\n\nLiu, Y., Rockova, V. and Wang, Y. (2018). ABC Bayesian Forests for Variable Selection. arXiv.\n\nMhaskar, H. N. (1996). Neural networks for optimal approximation of smooth and analytic functions.\nNeural Computation, 8(1), 164-177.\n\nMhaskar, H., Liao, Q., and Poggio, T. A. (2017). When and why are deep networks better than\nshallow ones? In AAAI, 2343-2349.\n\nMontufar, G.F., R. Pascanu, K. Cho and Y. Bengio (2014). On the number of linear regions of deep\nneural networks. Advances in Neural Information Processing Systems, 27, 2924-2932.\n\nNewton, M., Polson, N.G. and Xu, J. (2018). Weighted Bayesian bootstrap for scalable Bayes. arXiv.\n\nvan der Pas, S. and Rockova, V. (2017). Bayesian dyadic trees and histograms for regression.\nAdvances in Neural Information Processing Systems.\n\nPetersen, P. and F. Voigtlaender (2017). Optimal approximation of piecewise smooth functions using\ndeep ReLU neural networks. arXiv.\n\nPetrushev, P. P. (1999). Approximation by ridge functions and neural networks. SIAM J. Math Anal.,\n30, 155-189.\n\nPinkus, A. (1999). Approximation theory of the MLP model is neural networks. Acta Numerica,\n143-195.\n\nPoggio, T., Mhaskar, H., Rosasco, L., Miranda, B., and Liao, Q. (2017). Why and when can deep-\nbut not shallow-networks avoid the curse of dimensionality: A review. International Journal of\nAutomation and Computing, 14, 503-519.\n\nPolson, N. and Sokolov, V. (2017). Deep learning: a Bayesian perspective. Bayesian Analysis, 12,\n1275-1304.\n\nRockova, V. and George, E.I. (2014). EMVS: The EM Approach to Bayesian Variable Selection.\nJournal of the American Statistical Association, 109, 828-846.\n\nRockova, V. and George, E.I. (2018). The Spike-and-Slab LASSO. Journal of the American Statistical\nAssociation, 113, 431-444.\n\nRockova, V. and van der Pas, S. (2017). Posterior Concentration for Bayesian Regression Trees and\ntheir Ensembles. arXiv.\n\nSchmidt-Hieber, J. (2017). Nonparametric regression using deep neural networks with ReLU\nactivation function. arXiv:1708.06633.\n\nShen, X. and Wasserman, L. (2001). Rates of convergence of posterior distributions. Annals of\nStatistics, 29, 687-714.\n\nSrivastava, N., Hinton, G., Krizhevsky, A.,Sutskever, I. and Salakhutdinov, R. (2015). Dropout: a\nsimple way to prevent neural networks from over\ufb01tting. Journal of Machine Learning Research, 15,\n1929-1958.\n\n11\n\n\fTelgarsky, M. (2016). Bene\ufb01ts of depth in neural networks. JMLR: Workshop and Conference\nProceedings, 49,1-23.\n\nTelgarsky, M. (2017). Neural Networks and Rational functions. arXiv.\n\nUllrich, K., Meeds, E. and Welling, M. (2017). Soft weight-sharing for neural network compression.\nInternational Conference on Learning Representations.\n\nVitushkin, A. G. (1964). Proof of the existence of analytic functions of several complex variables\nwhich are not representable by linear superpositions of continuously differentiable functions of fewer\nvariables. Soviet Mathematics, 5, 793-796.\n\nWalker, S., Lijoi, A. and Prunster, I. (2007). On rates of Convergence of Posterior Distributions in\nIn\ufb01nite Dimensional Models. Annals of Statistics, 35, 738-746.\n\nWager, S., Wang, S. and Liang, P. (2014). Dropout training as adaptive regularization. Advances in\nNeural Information Processing Systems.\n\nWong, W. H. and X. Shen (1995). Probability inequalities for Likelihood ratios and convergence rates\nof sieve mles. Annals of Statistics, 23, 339-362.\n\nYarotsky, D. (2017). Error bounds for approximations with deep ReLU networks. Neural Networks,\n94, 103-114.\n\n12\n\n\f", "award": [], "sourceid": 514, "authors": [{"given_name": "Nicholas", "family_name": "Polson", "institution": "Chicago Booth"}, {"given_name": "Veronika", "family_name": "Ro\u010dkov\u00e1", "institution": "University of Chicago"}]}