{"title": "Is Deeper Better only when Shallow is Good?", "book": "Advances in Neural Information Processing Systems", "page_first": 6429, "page_last": 6438, "abstract": "Understanding the power of depth in feed-forward neural networks is an ongoing challenge in the field of deep learning theory. While current works account for the importance of depth for the expressive power of neural-networks, it remains an open question whether these benefits are exploited during a gradient-based optimization process.\nIn this work we explore the relation between expressivity properties of deep networks and the ability to train them efficiently using gradient-based algorithms. We give a depth separation argument for distributions with fractal structure, showing that they can be expressed efficiently by deep networks, but not with shallow ones.\nThese distributions have a natural coarse-to-fine structure, and we show that the balance between the coarse and fine details has a crucial effect on whether the optimization process is likely to succeed. We prove that when the distribution is concentrated on the fine details, gradient-based algorithms are likely to fail.\nUsing this result we prove that, at least in some distributions, the success of learning deep networks depends on whether the distribution can be approximated by shallower networks, and we conjecture that this property holds in general.", "full_text": "Is Deeper Better only when Shallow is Good?\n\nEran Malach\n\neran.malach@mail.huji.ac.il\n\nSchool of Computer Science\n\nThe Hebrew University\n\nJerusalem, Israel\n\nShai Shalev-Shwartz\n\nSchool of Computer Science\n\nThe Hebrew University\n\nJerusalem, Israel\n\nshais@cs.huji.ac.il\n\nAbstract\n\nUnderstanding the power of depth in feed-forward neural networks is an ongoing\nchallenge in the \ufb01eld of deep learning theory. While current works account for\nthe importance of depth for the expressive power of neural-networks, it remains\nan open question whether these bene\ufb01ts are exploited during a gradient-based\noptimization process. In this work we explore the relation between expressivity\nproperties of deep networks and the ability to train them ef\ufb01ciently using gradient-\nbased algorithms. We give a depth separation argument for distributions with fractal\nstructure, showing that they can be expressed ef\ufb01ciently by deep networks, but not\nwith shallow ones. These distributions have a natural coarse-to-\ufb01ne structure, and\nwe show that the balance between the coarse and \ufb01ne details has a crucial effect\non whether the optimization process is likely to succeed. We prove that when the\ndistribution is concentrated on the \ufb01ne details, gradient-based algorithms are likely\nto fail. Using this result we prove that, at least in some distributions, the success of\nlearning deep networks depends on whether the distribution can be approximated\nby shallower networks, and we conjecture that this property holds in general.\n\n1\n\nIntroduction\n\nA fundamental question in studying deep networks is understanding why and when \u201cdeeper is better\u201d.\nIn recent years there has been a large number of works studying the expressive power of deep and\nshallow networks. The main goal of this research direction is to show families of functions or\ndistributions that are realizable with deep networks of modest width, but require exponential number\nof neurons to approximate by shallow networks. We refer to such results as depth separation results.\nMany of these works consider various measures of \u201ccomplexity\u201d that can grow exponentially fast with\nthe depth of the network, but not with the width. Hence, such measures provide a clear separation\nbetween deep and shallow networks. For example, the works of [11, 10, 9, 16] show that the number\nof linear regions grows exponentially with the depth of the network, but only polynomially with the\nwidth. The works of [13, 14, 20] give similar results for other complexity measures, such as the\ncurvature, the trajectory length or the number of oscillations of the output function.\nWhile such works give general characteristics of function families, they take a seemingly worst-case\napproach. Hence, it is not clear whether such analysis applies to the typical cases encountered in the\npractice of neural-networks. To answer this concern, recent works show depth separation results for\nnarrower families of functions that appear simple or \u201cnatural\u201d. For example, the work of [19] shows\na very simple construction of a function on the real line that exhibits depth separation. The works of\n[6, 15] show a depth separation argument for very natural functions, like the indicator function of\nthe unit ball. The work of [3] gives similar results for a richer family of functions. Other works by\n[8, 12] show that compositional functions, namely functions of functions, are well approximated by\ndeep networks. The works of [4, 2] show similar depth separation results for sum-product networks.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fWhile current works provide a variety of depth separation results, these are all focused on expressivity\nanalysis. This is unsatisfactory, as the fact that a certain network architecture can express some\nfunction does not mean that we can learn it from training data in reasonable training time. In fact,\nthere is theoretical evidence showing that gradient-based algorithms can only learn a small fraction\nof the functions that are expressed by a given neural-network (e.g [18]).\nThis paper relates expressivity properties of deep networks to the ability to train them ef\ufb01ciently\nusing a gradient-based algorithm. We start by giving depth separation arguments for distributions\nwith fractal structure. In particular, we show that deep networks are able to exploit the self-similarity\nproperty of fractal distributions, and thus realize such distributions with a small number of param-\neters. On the other hand, we show that shallow networks need a number of parameters that grows\nexponentially with the intrinsic \u201cdepth\u201d of the fractal. The advantage of fractal distributions is that\nthey exhibit a clear coarse-to-\ufb01ne structure. We show that if the distribution is more concentrated\non the \u201ccoarse\u201d details of the fractal, then even though shallower networks cannot exactly express\nthe underlying distribution, they can still achieve a good approximation. We introduce the notion\nof approximation curve, that characterizes how the examples are distributed between the \u201ccoarse\u201d\ndetails and the \u201c\ufb01ne\u201d details of the fractal. The approximation curve captures the relation between\nthe growth in the network\u2019s depth and the improvement in approximation.\nWe next go beyond pure expressivity analysis, and claim that the approximation curve plays a key role\nnot only in approximation analysis, but also in predicting the success of gradient-based optimization\nalgorithms. Speci\ufb01cally, we show that if the distribution is concentrated on the \u201c\ufb01ne\u201d details of the\nfractal, then gradient-based optimization algorithms are likely to fail. In other words, the \u201cstronger\u201d\nthe depth separation is (in the sense that shallow networks cannot even approximate the distribution)\nthe harder it is to learn a deep network with a gradient-based algorithm. While we prove this statement\nfor a speci\ufb01c fractal distribution, we state a conjecture aiming at formalizing this statement in a more\ngeneral sense. Namely, we conjecture that a distribution which cannot be approximated by a shallow\nnetwork cannot be learned using a gradient-based algorithm, even when using a deep architecture.\nWe perform experiments on learning fractal distributions with deep networks trained with SGD and\nassert that the approximation curve has a crucial effect on whether a depth ef\ufb01ciency is observed or\nnot. These results provide new insights as to when such deep distributions can be learned.\nAdmittedly, this paper is focused on analyzing a family of distributions that is synthetic by nature.\nThat said, we note that the conclusions from this analysis may be interesting for the broader effort\nof understanding the power of depth in neural-networks. As mentioned, we show that there exist\ndistributions with depth separation property that cannot be learned by gradient-based optimization\nalgorithms. This result implies that any depth separation argument that does not consider the\noptimization process should be taken with a grain of salt. Additionally, our results hint that the\nsuccess of learning deep networks depends on whether the distribution can be approximated by\nshallower networks. Indeed, this property is often observed in real-world distributions, where deeper\nnetworks perform better, but shallower networks exhibit good (if not perfect) performance. We\ndemonstrate this behavior empirically on the CIFAR-10 dataset. The work of [1] shows similar\nbehavior on the challenging ImageNet dataset, where a one-hidden layer network achieves 25%\nTop-5 accuracy (much better than a random guess), and a three-layer network already achieves 60%.\n\n2 Preliminaries\nLet X = Rd be the domain space and Y =\n{\u00b11} be the label space. We consider distri-\nbutions de\ufb01ned over sets generated by an iter-\nated function system (IFS). An IFS is a method\nfor constructing fractals, where a \ufb01nite set of\ncontraction mappings are applied iteratively,\nstarting with some arbitrary initial set. Apply-\ning such process ad in\ufb01nitum generates a self-\nsimilar fractal. In this work we will consider\nsets generated by performing a \ufb01nite number of\niterations from such process. We refer to the\nnumber of iterations of the IFS as the \u201cdepth\u201d of\nthe generated set.\n\nK0\n\nF3\nF1\n\nF2\nF4\n\nK1\n\nF3\n\nF1\n\nF2\n\nF4\n\nK2\n\nFigure 1: IFS and fractal distributions.\n\n2\n\n\fFormally, an IFS is de\ufb01ned by a set of r contractive af\ufb01ne 1 transformations F = (F1, . . . , Fr),\nwhereFi(x) = M (i)x + v(i) with full-rank matrix M (i) \u2208 Rd\u00d7d, vector v(i) \u2208 Rd, s.t\n(cid:107)Fi(x) \u2212 Fi(y)(cid:107) < (cid:107)x \u2212 y(cid:107) for all x, y \u2208 X (we use (cid:107)\u00b7(cid:107) to denote the (cid:96)2 norm, unless\nstated otherwise). We de\ufb01ne the set Kn \u2286 X recursively by K0 = [\u22121, 1]d and Kn =\nF1(Kn\u22121) \u222a \u00b7\u00b7\u00b7 \u222a Fr(Kn\u22121). The IFS construction is shown in \ufb01gure 1.\nWe de\ufb01ne a \u201cfractal distributions\u201d, denoted Dn, to be any balanced distribution over X \u00d7 Y such\nthat positive examples are sampled from the set Kn and negative examples are sampled from its\ncomplement. Formally, Dn = 1\nn is a distribution over X \u00d7Y that is supported\non Kn\u00d7{+1}, and D\u2212\nn is a distribution over X \u00d7Y that is supported on (X \\Kn)\u00d7{\u22121}. Examples\nfor such distributions are given in \ufb01gure 1 and \ufb01gure 2.\nWe consider the problem of learning fractal distributions with feed-forward ReLU neural-networks.\nA ReLU neural-network NW,B : X \u2192 Y of depth t and width k is de\ufb01ned recursively:\n\nn ) where D+\n\nn +D\u2212\n\n2 (D+\n\n1. x(0) = x\n2. x(t(cid:48)) = \u03c3(W (t(cid:48))x(t(cid:48)\u22121) + b(t(cid:48))) for 1 \u2264 t(cid:48) \u2264 t \u2212 1, and W (t(cid:48)) \u2208 Rk\u00d7dim x(t(cid:48)\u22121)\n3. NW,B(x) := x(t) = W (t)x(t\u22121) + b(t), for W (t) \u2208 R1\u00d7k, b(t) \u2208 R\n\n, bt(cid:48) \u2208 Rk\n\nWhere \u03c3(x) := max(x, 0).\nWe denote by Hk,t the family of all functions that are implemented by a neural-network of width\nk and depth t. Given a distribution D over X \u00d7 Y, we denote the error of a network h \u2208 Hk,t on\ndistribution D to be LD(h) := P(x,y)\u223cDn [sign(h(x)) (cid:54)= y]. We denote the approximation error of\nHk,t on D to be the minimal error of any such function: LD(Hk,t) := minh\u2208Hk,t LD(h).\n\n3 Expressivity and Approximation\n\nIn this section we analyze the expressive power of deep and shallow neural-networks w.r.t fractal\ndistributions. We show two results. The \ufb01rst is a depth separation property of neural-networks.\nNamely, we show that shallow networks need an exponential number of neurons to realize such\ndistributions, while deep networks need only a number of neurons that is linear in the problem\u2019s\nparameters. The second result bounds the approximation error achieved by networks that are not deep\nenough to achieve zero error. This bound depends on the speci\ufb01c properties of the fractal distribution.\nWe analyze IFSs where the images of the initial set K0 under the different mappings do not overlap.\nThis property allows the neural-network to \u201creverse\u201d the process that generates the fractal structure.\nAdditionally, we assume that the images of K0 (and therefore the entire fractal), are contained in\nK0, which means that the fractal does not grow in size. This is a technical requirement that could\nbe achieved by correctly scaling the fractal at each step. While these requirements are not generally\nassumed in the context of IFSs, they hold for many common fractals. Formally, we assume:\nAssumption 1 There exists \u0001 > 0 such that for i (cid:54)= j \u2208 [r] it holds that d(Fi(K0), Fj(K0)) > \u0001,\nwhere d(A, B) = minx\u2208A,y\u2208B (cid:107)x \u2212 y(cid:107).\nAssumption 2 For each i \u2208 [r] it holds that Fi(K0) \u2286 K0.\n\nAs in many problems in machine learning, we assume the positive and negative examples are separated\nby some margin. Speci\ufb01cally, we assume that the positive examples are sampled from strictly inside\nthe set Kn, with margin \u03b3 from the set boundary. Formally, for some set A, we de\ufb01ne A\u03b3 to be the\nset of all points that are far from the boundary of A by at least \u03b3: A\u03b3 := {x \u2208 A : B\u03b3(x) \u2286 A},\nwhere B\u03b3(x) denotes a ball around x with radius \u03b3. So our assumption is the following:\nAssumption 3 There exists \u03b3 > 0 such that D+\n\nn is supported on K \u03b3\n\nn \u00d7 {+1}.\n\nWe note that this assumption is used only in the proof of Theorem 1 below. In fact, a result similar to\nTheorem 1 can be shown without Assumption 3, as given in a recent preprint by [5].\n\n1In general, IFSs can be constructed with non-linear transformations, but we discuss only af\ufb01ne IFS.\n\n3\n\n\f3.1 Depth Separation\n\nWe show that neural-networks with depth linear in n (where n is the \u201cdepth\u201d of the fractal) can\nachieve zero error on any fractal distribution satisfying the above assumptions, with only linear width.\nOn the other hand, a shallow network needs a width exponential in n to achieve zero error on such\ndistributions. We start by giving the following expressivity result for deep networks:\nTheorem 1 For any distribution Dn there exist neural-network of width 5dr and depth 2n + 1, such\nthat LDn (NW,B) = 0.\n\nWe defer the proof of Theorem 1 to the appendix, and give here an intuition of how deep networks\ncan express these seemingly complex distributions with a small number of parameters. Note that\nby de\ufb01nition, the set Kn is composed of r copies of the set Kn\u22121, mapped by different af\ufb01ne\ntransformations. In our construction, each block of the network folds the different copies of Kn\u22121\non-top of each other, while \u201cthrowing away\u201d the rest of the examples (by mapping them to a distinct\nvalue). The next block can then perform the same thing on all copies of Kn\u22121 together, instead of\ndecomposing each subset separately. This allows a very ef\ufb01cient utilization of the network parameters.\nThe above result shows that deep networks require a number of parameters that grows linearly with r,\nd and n in order to realize any fractal distribution. Now, we want to consider the case of shallower\nnetworks, when the depth is not large enough to achieve zero error with linear width. Speci\ufb01cally, we\nshow that when decreasing the depth of the network by a factor of s, we can achieve zero error by\nallowing the width to grow like rs:\nCorollary 1 For any distribution Dn and every natural s \u2264 n there exists a neural-network of width\n5drs and depth 2(cid:98)n/s(cid:99) + 2, such that LDn (NW,B) = 0.\nThis is an upper bound on the required width of a network that can realize Dn, for any given depth.\nTo show the depth separation property, we show that a shallow network needs an exponential number\nof neurons to realize any distribution without \u201choles\u201d (areas of non-zero volume with no examples\nfrom Dn, outside the margin area). This gives the equivalent lower bound on the required width:\nn \u222a (X \\ Kn) it holds that\nTheorem 2 Let Dn be some fractal distribution, s.t for every ball B \u2286 K \u03b3\nP(x,y)\u223cDn [x \u2208 B] > 0. Then for every depth t and width k, s.t k < d\ntd , we have LDn (Hk,t) > 0.\ne r n\nThe previous result shows that in many cases we cannot guarantee exact realization of \u201cdeep\u201d\ndistributions by shallow networks that are not exponentially wide. On the other hand, we show that in\nsome cases we may be able to give good guarantees on approximating such distributions with shallow\nnetworks, when we take into account how the examples are distributed within the fractal structure.\nWe will formalize this notion in the next part of this section.\n\n3.2 Approximation Curve\nGiven distribution Dn, we de\ufb01ne the ap-\nproximation curve of this distribution to be\n[n] \u2192 [0, 1], where:\nthe function P :\nP (j) = P(x,y)\u223cDn [x /\u2208 Kj or y = 1]. Notice\nthat P (0) = 1\n2, P (n) = 1, and that P is non-\ndecreasing. The approximation curve P cap-\ntures exactly how the negative examples are dis-\ntributed between the different levels of the frac-\ntal structure. If P grows fast at the beginning,\nthen the distribution is more concentrated on the\nlow levels of the fractal (coarse details). If P\nstays \ufb02at until the end, then most of the weight is\non the high levels (\ufb01ne details). Figure 2 shows\nsamples from two distributions over the same\nfractal structure, with different approximation\ncurves.\n\nFigure 2: 2D cantor distributions of depth 5, nega-\ntive examples in orange and positive in blue. The\nnegative examples are concentrated in the middle\nrectangle, and not in all X \\ Kn. Left: \u201ccoarse\u201d\napproximation curve (curve#1). Right: \u201c\ufb01ne\u201d ap-\nproximation curve (curve#4). The curves are as\nshown in \ufb01gure 5 in the Experiments section.\n\n4\n\n\fA simple argument shows that distributions concentrated on coarse details can be well approximated\nby shallower networks. The following theorem characterizes the relation between the approximation\ncurve and the \u201capproximability\u201d by networks of growing depth:\nTheorem 3 Let Dn be some fractal distribution with approximation curve P . Fix some j, s, then for\nHk,t with depth t = 2(cid:98)j/s(cid:99) + 2 and width k = 5drs, we have: LDn (Hk,t) \u2264 1 \u2212 P (j).\nThis shows that using the approximation curve of distribution Dn allows us to give an upper bound\non the approximation error for networks that are not deep enough. We give a lower bound for this\nerror in a more restricted case. We limit ourselves to the case where d = 1, and observe networks of\nwidth k < rs for some s. Furthermore, we assume that the probability of seeing each subset of the\nfractal is the same. Then we get the following theorem:\nTheorem 4 Assume that Dn is a distribution on R (d = 1). Note that for every j, Kj is a union of rj\nintervals, and we denote Kj = \u222arj\ni=1Ii for intervals Ii. Assume that the distribution over each interval\nis equal, so for every i, (cid:96), y(cid:48): P(x,y)\u223cDn [x \u2208 Ii and y = y(cid:48)] = P(x,y)\u223cDn [x \u2208 I(cid:96) and y = y(cid:48)]. Then\nfor depth t and width k < rs, for n > j > st we get: LDn(Hk,t) \u2265 (1 \u2212 rst\u2212j)(1 \u2212 P (j)).\nThe above theorem shows that for shallow networks, for which st (cid:28) j, the approximation curve gives\na very tight lower bound on the approximation error. This is due to the fact that shallow networks\nhave a limited number of linear regions, and hence effectively give constant prediction on most of the\n\u201c\ufb01ner\u201d details of the fractal distribution. This result implies that there are fractal distributions that\nare not only hard to realize by shallow networks, but that are even hard to approximate. Indeed, \ufb01x\nsome small \u0001 > 0 and let j := st + logr( 1\n2\u0001 ). Then if the approximation curve stays \ufb02at for the \ufb01rst\n2 \u2212 \u0001.\nj levels (i.e P (j) = 1\n2), then from Theorem 4 the approximation error is at least 1\nThis gives a strong depth separation result: shallow networks have an error of \u2248 1\n2 while a network\nof depth t \u2265 2(cid:98)n/s(cid:99) + 2 can achieve zero error (on any fractal distribution). This strong depth\nseparation result occurs when the distribution is concentrated on the \u201c\ufb01ne\u201d details, i.e when the\napproximation curve stays \ufb02at throughout the \u201ccoarse\u201d levels. In the next section we relate the\napproximation curve to the success of \ufb01tting a deep network to the fractal distribution, using gradient-\nbased optimization algorithms. Speci\ufb01cally, we claim that distributions with strong depth separation\ncannot be learned by any network, deep or shallow, using gradient-based algorithms.\n\n4 Optimization Analysis\n\nSo far, we analyzed the ability of neural-networks to express and approximate different fractal\ndistributions. But it remains unclear whether these networks can be learned with gradient-based\noptimization algorithms. In this section, we show that the success of the optimization highly depends\non the approximation curve of the fractal distribution. Namely, we show that for distributions with a\n\u201c\ufb01ne\u201d approximation curve, that are concentrated on the \u201c\ufb01ne\u201d details of the fractal, the optimization\nfails with high probability, for any gradient-based optimization algorithm.\nTo simplify the analysis, we focus in this section on a very simple\nfractal distribution: a distribution over the Cantor set in R. The\nCantor set Cn is de\ufb01ned recursively by C0 = [0, 1] and Cn =\nF1(Cn\u22121) \u222a F2(Cn\u22121), where F1(x) = 1\n3 x and F2(x) =\n2 . We de\ufb01ne the distribution D+\n3 + 1\n2\nn \u00d7 {+1}. The distribution\nto be the uniform distribution over C \u03b3\nD\u2212\nn is a distribution over C0 \\ Cn, where we sample from each\n\u201clevel\u201d Cj (j < n) with probability pj. Formally, we de\ufb01ne Ej :=\n(cid:80)n\nCj\u22121 \\ Cj to be the j-th level of the negative distribution. We use\nU(Ej) to denote the uniform distribution on set Ej, then: D\u2212\nn =\nj=1 pj (U(Ej) \u00d7 {\u22121}). Notice that the approximation curve of\n\nFigure 3: \u201cFine\u201d Cantor dis-\ntributions of growing depth.\nNegative areas in orange, posi-\ntive in blue.\n\n3 x. Now, \ufb01x margin \u03b3 < 3\u2212n\n\n(cid:80)j\ni=1 pi. As before, we wish to learn Dn = 1\n\n2 (D+\n\nn +D\u2212\nn ).\n\nthis distribution is given by: P (j) = 1\nFigure 3 shows a construction of such distribution.\nThe main theorem in this section shows the connection between the approximation curve and the\nbehavior of a gradient-based optimization algorithm. This result shows that for deep enough Cantor\n\n2 + 1\n\n2\n\n3 \u2212 1\n\nn\n\nC1\nC2\nC3\n...\n\n5\n\n\fdistributions, the value of the approximation curve on the \ufb01ne details of the fractal bounds the norm\nof the population gradient for randomly initialized network:\nTheorem 5 Fix some depth t, width k and some \u03b4 \u2208 (0, 1). Let n, n(cid:48) \u2208 N such that n > n(cid:48) >\n\u03b4 ). Let Dn be some Cantor distribution with approximation curve P . Assume\n\u22121( 3\nlog\nwe initialize a neural-network NW,B of depth t and width k, with weights initialized uniformly\nin [\u2212 1\n] (where nin denotes the in-degree of each neuron), and biases initialized with a\n2. Denote the hinge-loss of the network on the population by: L(NW,B) =\n\ufb01xed value b = 1\nE(x,y)\u223cDn [max{1 \u2212 yNW,B(x), 0}]. Then with probability at least 1 \u2212 \u03b4 we have:\n2\n\n2 ) log( 4tk2\n\n2nin\n\n2nin\n\n1\n\n,\n\n1. (cid:13)(cid:13) \u2202\n\u2202WL(NW,B)(cid:13)(cid:13)max ,(cid:13)(cid:13) \u2202\n2. LDn(NW,B) \u2265(cid:0) 3\n\n\u2202BL(NW,B)(cid:13)(cid:13)max \u2264 5(cid:0)P (n(cid:48)) \u2212 1\n2 \u2212 P (n(cid:48))(cid:1) (1 \u2212 P (n(cid:48)))\n\n2\n\n(cid:1)\n\n\u22121( 3\n\n2 ) log( 4tk2\n\nWe now give some important implications of this theorem. First, notice that we can de\ufb01ne Cantor\ndistributions for which a gradient-based algorithm fails with high probability. Indeed, we de\ufb01ne\nthe \u201c\ufb01ne\u201d Cantor distribution to be a distribution concentrated on the highest level of the Cantor\nset. Given our previous de\ufb01nition, this means p1, . . . , pn\u22121 = 0 and pn = 1. The approximation\ncurve for this distribution is therefore P (0) = \u00b7\u00b7\u00b7 = P (n \u2212 1) = 1\n2, P (n) = 1. Figure 3 shows\nthe \u201c\ufb01ne\u201d Cantor distribution drawn over its composing intervals. From Theorem 5 we get that for\n\u03b4 ), with probability at least 1 \u2212 \u03b4, the population gradient is zero and the error\nn > log\nis 1\n2. This result immediately implies that vanilla gradient-descent on the distribution will be stuck\nin the \ufb01rst step. But SGD, or GD on a \ufb01nite sample, may move from the initial point, due to the\nstochasticity of the gradient estimation. What the theorem shows is that the objective is extremely \ufb02at\nalmost everywhere in the regime of W, so stochastic gradient steps are highly unlikely to converge to\nany solution with error better than 1\n2.\nThe above argument shows that there are fractal distributions that can be realized by deep networks,\nfor which a standard optimization process is likely to fail. We note that this result is interesting by\nitself, in the broader context of depth separation results. It implies that for many deep architectures,\nthere are distributions with depth separation property that cannot be learned by gradient-descent:\nCorollary 2 There exist two constants c1, c2, such that for every width k \u2265 10 and \u03b4 \u2208 (0, 1), for\nevery depth t > c1 log( k\n\n\u03b4 ) + c2 there exists a distribution D on R \u00d7 {\u00b11} for which:\n\ninitialization and loss described in Theorem 5, returns a network with error 1\n\n1. D can be realized by a neural network of depth t and width 10.\n2. D cannot be realized by a one-hidden layer network with less than 2t\u22121 units.\n3. Any gradient-based algorithm trying to learn a neural-network of depth t and width k, with\n2 w.p \u2265 1 \u2212 \u03b4.\nWe can go further, and use Theorem 5 to give a better characterization of these hard distributions.\nRecall that in the previous section we showed distributions that exhibit a strong depth separation\nproperty: distributions that are realizable by deep networks, for which shallow networks get an error\nexponentially close to 1\n2. From Theorem 5 we get that any Cantor distribution that gives a strong\ndepth separation cannot be learned by gradient-based algorithms:\nCorollary 3 Fix some depth t, width k and some \u03b4 \u2208 (0, 1). Let n > 4 log\n\u03b4 ) + 2. Let\nDn be some Cantor distribution such that any network of width 10 and depth t(cid:48) < n has an error of\n, for some \u0001 \u2208 (0, 1) (i.e, strong depth separation). Assume we initialize a network\nat least 1\nof depth t and width k as described in Theorem 5. Then with probability at least 1 \u2212 \u03b4:\n\n2 \u2212 \u0001n\u2212t(cid:48)\n\n2 ) log( 4tk2\n\n\u22121( 3\n\n1. (cid:13)(cid:13) \u2202\n\u2202WL(NW,B)(cid:13)(cid:13)max ,(cid:13)(cid:13) \u2202\n\n\u2202BL(NW,B)(cid:13)(cid:13)max \u2264 5\u0001n/2\n\n2We note that it is standard practice to initialize the bias to a \ufb01xed value. We \ufb01x b = 1\n\n2 for simplicity, but a\n\n2. LDn(NW,B) \u2265 1\n\n2 \u2212 3\n\nsimilar result can be given for any choice of b \u2208(cid:2)0, 1\n\n2 \u0001n/2\n\n(cid:3).\n\n2\n\n6\n\n\fy\nc\na\nr\nu\nc\nc\na\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\ncantor5\n\ncantor5\n\ny\nc\na\nr\nu\nc\nc\na\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\nt=1\nt=2\nt=3\nt=4\nt=5\n\n0\n\n100\n\n300\n\n400\n\n200\nwidth\n\n0\n\n2\n#parameters\n\n4\n\n6\u00b7105\n\nFigure 4: The effect of depth on learning the Cantor set.\n\nThis shows that in the strong depth separation case, the population gradient is exponentially close to\nzero with high probability. Effectively, this property means that even a small amount of stochastic\nnoise in the gradient estimation (for example, in SGD), makes the algorithm fail.\nThis result gives a very important property of Cantor distributions. It shows that every Cantor\ndistribution that cannot be approximated by a shallow network (achieving error greater than 1\n2),\ncannot be learned by a deep network (when training with gradient-based algorithms). While we\nshow this in a very restricted case, we conjecture that this property holds in general:\nConjecture 1 Let D be some distribution such that LD(Hk,t) = 0 (realizable with networks of\nwidth k and depth t). If LD(Hk,t(cid:48)) is exponentially close to 1\n2 when t(cid:48) \u2192 1, then any gradient-based\nalgorithm training a network of depth t and width k will fail with high probability.\n\nWe give an intuition of why such result may hold in a more general case. Note that a strong depth\nseparation property means that the loss of any shallow network is close to a random guess, which\nimplies that positive and negative examples are extremely hard to separate. In other words, any\nseparation of the space to a small number of linear regions will have approximately the same amount\nof positive and negative examples in the same linear region. Our proof technique shows that in this\ncase, upon initialization, a deep network only \u201crearranges\u201d the linear regions, but does not change\nthis property. Namely, when we initialize a deep network, most linear regions have a similar amount\nof positive and negative examples in them. In this case, gradient descent will fail, as the gradient will\nbe approximately zero on all linear regions. We conjecture that a similar technique can be used in a\nmore general setting. The use of fractal distributions greatly simpli\ufb01es our analysis, but the core idea\nof why deep networks fail does not depend on the fractal structure of the distribution.\n\n5 Experiments\n\nIn the previous section, we saw that learning a \u201c\ufb01ne\u201d distribution with gradient-based algorithms is\nlikely to fail. To complete the picture, we now assert that when the distribution has enough weight on\nthe \u201ccoarse\u201d details, SGD succeeds to learn a deep network with small error. Moreover, we show\nthat when training on such distributions, a clear depth separation is observed, and deeper networks\nindeed perform better than shallow networks. Unfortunately, giving theoretical evidence to support\nthis claim seems out of reach. Instead, we perform experiments to show these desired properties.\nIn this section we present our experimental results on learning deep networks with Adam optimizer\n([7]). First, we show that depth separation is observed when training on samples from fractal distribu-\ntions with \u201ccoarse\u201d approximation curve: deeper networks perform better and have better parameter\nutilization. Second, we demonstrate the effect of training on \u201ccoarse\u201d vs. \u201c\ufb01ne\u201d distributions, showing\nthat the performance of the network degrades as the approximation curve becomes \ufb01ner. Finally, we\nanalyze the behavior of networks of growing depth on CIFAR-10. We show that CIFAR-10 resembles\nthe \u201ccoarse\u201d fractal distribution in the sense that deep networks perform better, but shallow networks\nalready give a good approximation.\nWe start by observing a distribution with a \u201ccoarse\u201d approximation curve (denoted curve #1), where\nthe negative examples are evenly distributed between the levels. The underlying fractal structure is a\n\n7\n\n\fy\nc\na\nr\nu\nc\nc\na\n\n1.0\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\nacc\nP(5)\nP(4)\nP(3)\n\n1\n\n2\n\n3\n\n#curve\n\n4\n\n5\n\n6\n\n1\n0.8\n0.6\n\n1\n0.8\n0.6\n\n1\n0.8\n0.6\n\ncurve #1\n\ncurve #2\n\ncurve #3\n\ncurve #4\n\ncurve #5\n\ncurve #6\n\n0 2 4\n\n0 2 4\n\nFigure 5: Learning depth 5 network on 2D Cantor set of depth 5, with different approximation\ncurves. The \ufb01gures show the values of the approximation curve (denoted P ) at different levels of the\nfractal. Large values correspond to more weight. In red is the accuracy of the best depth 5 network\narchitecture trained on these distributions.\n\ntwo-dimensional variant of the Cantor set. This set is constructed by an IFS with four mappings, each\none maps the structure to a rectangle in a different corner of the space. The negative examples are\nconcentrated in the central rectangle of each structure. The distributions are shown in \ufb01gure 2.\nWe train feed-forward networks of varying depth and width on a 2D Cantor distribution of depth 5.\nWe sample 50K examples for a train dataset and 5K examples for a test dataset. We train the networks\non this dataset with Adam optimizer for 106 iterations, with batch size of 100 and different learning\nrates. We observe the best performance of each con\ufb01guration (depth and width) on the test data along\nthe runs. The results of these experiments are shown in \ufb01gure 4. In this experiment, we see that a\nwide enough depth 5 network gets almost zero error. Importantly, we can see a clear depth separation:\ndeeper networks achieve better accuracy, and are more ef\ufb01cient in utilizing the network parameters.\nNext, we observe the effect of the approximation curve on learning the distribution. We compare the\nperformance of the best depth 5 networks, when trained on distributions with different approximation\ncurves. The training and validation process is as described previously. We also plot the value of\nthe approximation curve for each distribution, in levels 3, 4, 5 of the fractal. The results of this\nexperiment are shown in \ufb01gure 5. Clearly, the approximation curve has a crucial effect on learning\nthe distribution. While for \u201ccoarse\u201d approximation curves the network achieves an error that is close\nto zero, distributions with \u201c\ufb01ne\u201d approximation curves cause a drastic degradation in performance.\nThe degradation in performance be-\ncomes even more dramatic when con-\nsidering deeper and more complex\nfractals. To demonstrate this, we ran\nan experiment on the Vicsek distribu-\ntion of depth 6, where the examples\nare concentrated on the \u201c\ufb01ne\u201d details\nof the fractal. Such distribution is hard\nto approximate by a shallow network,\nas shown in our theoretical analysis.\nWe trained networks of various depth\nand width on this distribution, as de-\nscribed above. The results are shown\nin \ufb01gure 6. As could be seen clearly,\nunlike distributions with \u201ccoarse\u201d ap-\nproximation curve (shown in \ufb01gure 4), in this case the bene\ufb01t of depth is not noticeable, and all\narchitectures achieve an accuracy of slightly more than 0.5 (i.e., chance level performance).\n\nFigure 6: Performance on the \u201c\ufb01ne\u201d Vicsek distribution, for\nnetworks of various depth and width.\n\nt=1\nt=2\nt=3\nt=4\nt=5\n\n100\nwidth\n\ny\nc\na\nr\nu\nc\nc\na\n\n150\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n50\n\n200\n\n8\n\n\f0.85\n0.80\n0.75\n0.70\n0.65\n0.60\n\ny\nc\na\nr\nu\nc\nc\na\n\nt=1\nt=2\nt=3\nt=4\nt=5\n\n50\n\nwidth\n\n100\n\nFigure 7: The effect of depth on learning CIFAR-10. We train CNNs with Adam for 60K steps. All\nlayers are 5x5 Convolutions with ReLU activation, except the readout layer. We perform max-pool\nonly in the \ufb01rst two layers. We use augmentations and training pipeline in [21].\n\nWe perform the same experiments with different fractal structures (\ufb01gure 1 shows these distributions).\nTables 1, 2 in the appendix summarize the results. We note that the effect of depth can be seen clearly\nin all fractal structures. The effect of the approximation curve is observed in all fractals, except the\nSierpinsky Triangle (generated with 3 transformations), where the approximation curve seems to\nhave no effect when the width of the network is large enough. This might be due to the fact that a\ndepth 5 IFS with 3 transformations generates a small number of linear regions, making the problem\noverall relatively easy.\nFinally, we want to show that the results given in this paper are interesting beyond the scope of our\nadmittedly synthetic fractal distributions. We note that the use of fractal distributions is favorable from\na theoretical perspective, as it allows us to develop crisp analysis and insightful results. On the other\nhand, it may raise a valid concern regarding the applicability of these results to real-world scenarios.\nTo address this concern, we performed similar experiments on the CIFAR-10 data, studying the effect\nof width and depth on the performance of neural-networks on real data. The results are shown in\n\ufb01gure 7. Notice that the trends on the CIFAR data resemble the behavior on the \u201ccoarse\u201d fractal\ndistributions. Importantly, note that the CIFAR data does not exhibit a strong depth separation, as\ndepth gives only gradual improvement in performance. That is, while deeper networks indeed exhibit\nbetter performance, a shallow network already gives a good approximation. A similar behavior is\nobserved even on the ImageNet dataset (see \ufb01g. 2 in [1]).\n\nAcknowledgements: This research is supported by the European Research Council (TheoryDL\nproject).\n\n9\n\n\fReferences\n[1] Eugene Belilovsky, Michael Eickenberg, and Edouard Oyallon. Greedy layerwise learning can\n\nscale to imagenet. arXiv preprint arXiv:1812.11446, 2018.\n\n[2] Nadav Cohen, Or Sharir, and Amnon Shashua. On the expressive power of deep learning: A\n\ntensor analysis. In Conference on Learning Theory, pages 698\u2013728, 2016.\n\n[3] Amit Daniely. Depth separation for neural networks. arXiv preprint arXiv:1702.08489, 2017.\n\n[4] Olivier Delalleau and Yoshua Bengio. Shallow vs. deep sum-product networks. In Advances in\n\nNeural Information Processing Systems, pages 666\u2013674, 2011.\n\n[5] Nadav Dym, Barak Sober, and Ingrid Daubechies. Expression of fractals through neural network\n\nfunctions. arXiv preprint arXiv:1905.11345, 2019.\n\n[6] Ronen Eldan and Ohad Shamir. The power of depth for feedforward neural networks. In\n\nConference on Learning Theory, pages 907\u2013940, 2016.\n\n[7] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[8] Hrushikesh N Mhaskar and Tomaso Poggio. Deep vs. shallow networks: An approximation\n\ntheory perspective. Analysis and Applications, 14(06):829\u2013848, 2016.\n\n[9] Guido Mont\u00fafar. Notes on the number of linear regions of deep neural networks. 2017.\n\n[10] Guido F Montufar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio. On the number of\nlinear regions of deep neural networks. In Advances in neural information processing systems,\npages 2924\u20132932, 2014.\n\n[11] Razvan Pascanu, Guido Montufar, and Yoshua Bengio. On the number of response regions of\ndeep feed forward networks with piece-wise linear activations. arXiv preprint arXiv:1312.6098,\n2013.\n\n[12] Tomaso Poggio, Hrushikesh Mhaskar, Lorenzo Rosasco, Brando Miranda, and Qianli Liao.\nWhy and when can deep-but not shallow-networks avoid the curse of dimensionality: a review.\nInternational Journal of Automation and Computing, 14(5):503\u2013519, 2017.\n\n[13] Ben Poole, Subhaneil Lahiri, Maithra Raghu, Jascha Sohl-Dickstein, and Surya Ganguli.\nExponential expressivity in deep neural networks through transient chaos. In Advances in neural\ninformation processing systems, pages 3360\u20133368, 2016.\n\n[14] Maithra Raghu, Ben Poole, Jon Kleinberg, Surya Ganguli, and Jascha Sohl-Dickstein. On the\n\nexpressive power of deep neural networks. arXiv preprint arXiv:1606.05336, 2016.\n\n[15] Itay Safran and Ohad Shamir. Depth-width tradeoffs in approximating natural functions with\n\nneural networks. arXiv preprint arXiv:1610.09887, 2016.\n\n[16] Thiago Serra, Christian Tjandraatmadja, and Srikumar Ramalingam. Bounding and counting\n\nlinear regions of deep neural networks. arXiv preprint arXiv:1711.02114, 2017.\n\n[17] Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory to\n\nalgorithms. Cambridge university press, 2014.\n\n[18] Shai Shalev-Shwartz, Ohad Shamir, and Shaked Shammah. Failures of gradient-based deep\n\nlearning. arXiv preprint arXiv:1703.07950, 2017.\n\n[19] Matus Telgarsky. Representation bene\ufb01ts of deep feedforward networks. arXiv preprint\n\narXiv:1509.08101, 2015.\n\n[20] Matus Telgarsky. Bene\ufb01ts of depth in neural networks. arXiv preprint arXiv:1602.04485, 2016.\n\n[21] Tensor\ufb02ow. Cifar-10 tensor\ufb02ow tutorial, models/tutorials/image/cifar10. 2018.\n\n10\n\n\f", "award": [], "sourceid": 3466, "authors": [{"given_name": "Eran", "family_name": "Malach", "institution": "Hebrew University Jerusalem Israel"}, {"given_name": "Shai", "family_name": "Shalev-Shwartz", "institution": "Mobileye & HUJI"}]}