{"title": "Efficient Approximation of Deep ReLU Networks for Functions on Low Dimensional Manifolds", "book": "Advances in Neural Information Processing Systems", "page_first": 8174, "page_last": 8184, "abstract": "Deep neural networks have revolutionized many real world applications, due to their flexibility in data fitting and accurate predictions for unseen data. A line of research reveals that neural networks can approximate certain classes of functions with an arbitrary accuracy, while the size of the network scales exponentially with respect to the data dimension. Empirical results, however, suggest that networks of moderate size already yield appealing performance. To explain such a gap, a common belief is that many data sets exhibit low dimensional structures, and can be modeled as samples near a low dimensional manifold. In this paper, we prove that neural networks can efficiently approximate functions supported on low dimensional manifolds. The network size scales exponentially in the approximation error, with an exponent depending on the intrinsic dimension of the data and the smoothness of the function. Our result shows that exploiting low dimensional data structures can greatly enhance the efficiency in function approximation by neural networks. We also implement a sub-network that assigns input data to their corresponding local neighborhoods, which may be of independent interest.", "full_text": "Ef\ufb01cient Approximation of Deep ReLU Networks for\n\nFunctions on Low Dimensional Manifolds\n\nMinshuo Chen Haoming Jiang Wenjing Liao\n\nTuo Zhao\n\n{mchen393, jianghm, wliao60, tourzhao}@gatech.edu\n\nGeorgia Institute of Technology\n\nAbstract\n\nDeep neural networks have revolutionized many real world applications, due to\ntheir \ufb02exibility in data \ufb01tting and accurate predictions for unseen data. A line of\nresearch reveals that neural networks can approximate certain classes of functions\nwith an arbitrary accuracy, while the size of the network scales exponentially with\nrespect to the data dimension. Empirical results, however, suggest that networks\nof moderate size already yield appealing performance. To explain such a gap,\na common belief is that many data sets exhibit low dimensional structures, and\ncan be modeled as samples near a low dimensional manifold. In this paper, we\nprove that neural networks can ef\ufb01ciently approximate functions supported on low\ndimensional manifolds. The network size scales exponentially in the approximation\nerror, with an exponent depending on the intrinsic dimension of the data and the\nsmoothness of the function. Our result shows that exploiting low dimensional\ndata structures can greatly enhance the ef\ufb01ciency in function approximation by\nneural networks. We also implement a sub-network that assigns input data to their\ncorresponding local neighborhoods, which may be of independent interest.\n\n1\n\nIntroduction\n\nIn the past decade, neural networks have made astonishing breakthroughs in many real world\napplications, such as computer vision (Krizhevsky et al., 2012; Goodfellow et al., 2014; Long et al.,\n2015), natural language processing (Graves et al., 2013; Bahdanau et al., 2014; Young et al., 2018),\nhealthcare (Miotto et al., 2017; Jiang et al., 2017), robotics (Gu et al., 2017), etc.\nAlthough data sets in these applications are highly complex, neural networks have achieved over-\nwhelming successes. For image classi\ufb01cation, the winner of the 2017 ImageNet challenge retained a\ntop-5 error rate of 2.25% (Hu et al., 2018), while the data set consists of about 1.2 million labeled\nhigh resolution images in 1000 categories. For speech recognition, Amodei et al. (2016) reported that\ndeep neural networks outperformed humans with a 5.15% word error rate on the LibriSpeech corpus\nconstructed from audio books (Panayotov et al., 2015). Such a data set consists of approximately\n1000 hours of 16kHz read English speech from 8000 audio books. These empirical results suggest\nthat neural networks can well approximate complex distributions and functions on data.\nA line of research attempts to explain the success of neural networks through the lens of expressivity\n\u2014 neural networks can effectively approximate various classes of functions. Among existing works,\nthe most well-known results are the universal approximation theorems, see Irie and Miyake (1988);\nFunahashi (1989); Cybenko (1989); Hornik (1991); Chui and Li (1992); Leshno et al. (1993).\nSpeci\ufb01cally, Cybenko (1989) showed that neural networks with one single hidden layer and continuous\nsigmoidal1 activations can approximate continuous functions in a unit cube with arbitrary accuracy.\nLater, Hornik (1991) extended the universal approximation theorem to general feed-forward networks\n\n1A function \u03c3(x) is sigmoidal, if \u03c3(x) \u2192 0 as x \u2192 \u2212\u221e, and \u03c3(x) \u2192 1 as x \u2192 \u221e.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fwith a single hidden layer, while the width of the network has to be exponentially large. Speci\ufb01c\napproximation rates of shallow networks (with one hidden layer) with smooth activation functions\nwere given in Barron (1993) and Mhaskar (1996). Recently, Lu et al. (2017) proved the universal\napproximation theorem for width-bounded deep neural networks, and Hanin (2017) improved the\nresult with ReLU (Recti\ufb01ed Linear Units) activations, i.e. ReLU(x) = max{0, x}. Yarotsky (2017)\nfurther showed that deep ReLU networks can uniformly approximate functions in Sobolev spaces,\nwhile the network size scales exponentially in the approximation error with an exponent depending\non the data dimension. Moreover, the network size in Yarotsky (2017) matches its lower bound.\nThe network size considered in applications, however, is signi\ufb01cantly smaller than what is predicted\nby the theory above. In the ImageNet challenge, data are RGB images with a resolution of 224 \u00d7 224.\nThe theory above suggests that, to achieve a \u0001 uniform approximation error, the number of neurons\nhas to scale as \u0001\u2212224\u00d7224\u00d73/2 (Barron, 1993). Setting \u0001 = 0.1 already gives rise to 10224\u00d7224\u00d73/2\nneurons. However, the AlexNet (Krizhevsky et al., 2012) only consists of 650000 neurons and 60\nmillion parameters to beat the state-of-the-art. To boost the performance on the ImageNet, several\nmore sophisticated network structures were proposed later, such as VGG16 (Simonyan and Zisserman,\n2014) which consists of about 138 million parameters. The size of both networks remains extremely\nsmall compared to 10224\u00d7224\u00d73/2. Why is there a tremendous gap between theory and practice?\nA common belief is that real world data sets often exhibit low dimensional structures. Many\nimages consist of projections of 3-dimensional objects followed by some transformations, such as\nrotation, translation, and skeleton. Such a generating mechanism induces a small number of intrinsic\nparameters. Speech data are composed of words and sentences following the grammar, and therefore\nhave a small degree of freedom. More broadly, visual, acoustic, textual, and many other types of\ndata all have low dimensional structures due to rich local regularities, global symmetries, repetitive\npatterns, or redundant sampling. It is plausible to model these data as samples near a low dimensional\nmanifold (Tenenbaum et al., 2000; Roweis and Saul, 2000). Then a natural question is:\nCan deep neural networks ef\ufb01ciently approximate functions supported on low dimensional manifolds?\nFunction approximation on manifolds has been well studied using local polynomials (Bickel et al.,\n2007) and wavelets (Coifman and Maggioni, 2006). However, studies using neural networks are\nvery limited. Two noticeable works are Chui and Mhaskar (2016) and Shaham et al. (2018). In Chui\nand Mhaskar (2016), high order differentiable functions on manifolds are approximated by neural\nnetworks with smooth activations, e.g., sigmoid activations and recti\ufb01ed quadratic unit functions\n(\u03c3(x) = (max{0, x})2). These smooth activations, however, are rarely used in the mainstream\napplications such as computer vision (Krizhevsky et al., 2012; Long et al., 2015; Hu et al., 2018).\nIn Shaham et al. (2018), a 4-layer network with ReLU activations was proposed to approximate\nC 2 functions on low dimensional manifolds that have absolutely summable wavelet coef\ufb01cients.\nHowever, this theory does not cover arbitrarily smooth functions, and the analysis is built upon\na restrictive assumption \u2014 there exists a linear transformation that maps the input data to sparse\ncoordinates, but such transformation is not explicitly given.\nIn this paper, we propose a framework to construct deep neural networks with nonsmooth activations\nto approximate functions supported on a d-dimensional smooth manifold isometrically embedded\nin RD. We prove that, in order to achieve a \ufb01xed approximation error, the network size scales\nexponentially with respect to the intrinsic dimension d, instead of the ambient dimension D. Our\nframework is \ufb02exible: 1). It applies to nonsmooth activations, e.g., ReLU and leaky ReLU activations;\n2). It applies to a wide class of functions, such as Sobolev and H\u00f6lder classes which are typical\nexamples in nonparametric statistics (Gy\u00f6r\ufb01 et al., 2006); 3). It exploits high order smoothness of\nfunctions for making the approximation as ef\ufb01cient as possible.\nTheorem (informal). Let M be a d-dimensional compact Riemannian manifold isometrically em-\nbedded in RD with d (cid:28) D. Assume M satis\ufb01es some mild regularity conditions. Given any\n\u0001 \u2208 (0, 1), there exists a ReLU neural network structure such that, for any C n (n \u2265 1) function\nf : M \u2192 R, if the weight parameters are properly chosen, the network yields a function (cid:98)f sat-\nisfying (cid:107)f \u2212 (cid:98)f(cid:107)\u221e \u2264 \u0001. Such a network has no more than c1(cid:0)log 1\n\u0001 + log D(cid:1) layers, and at most\n\u0001 + D log D(cid:1) neurons and weight parameters, where c1, c2 depend on d, n,\nc2(cid:0)\u0001\u2212d/n log 1\nf, and M.\nOur network size scales like \u0001\u2212d/n and only weakly depends on the ambient dimension D. This is\nconsistent with empirical observations, and partially justi\ufb01es why the networks of moderate size have\n\n\u0001 + D log 1\n\n2\n\n\fachieved a great success on aforementioned learning tasks. Moreover, we show that our network size\nmatches its lower bound up to a logarithmic factor (see Theorem 2).\nOur theory applies to general C n functions and leverages the bene\ufb01ts of exploiting high order\nsmoothness. Our result improves Shaham et al. (2018) for C n functions with n > 2. In this case, our\nnetwork size scales like \u0001\u2212d/n, which is signi\ufb01cantly smaller than the one in Shaham et al. (2018) in\nthe order of \u0001\u2212d/2.\nHere we state the theorem for C n functions for simplicity, and similar results hold for H\u00f6lder\nfunctions (see Theorem 1). Our framework can be easily applied to leaky ReLU activations, since\nleaky ReLU can be implemented by the difference of two ReLU functions.\nThe high level idea of our framework is to partition the low dimensional manifold into a collection\nof open sets, and then use Taylor expansions to approximate the function in each neighborhood. A\nnew technique is developed to implement a sub-network that assigns the input to its corresponding\nneighborhood on the manifold, which may be of independent interest.\nNotations: We use bold-faced letters to denote vectors, and normal font letters with a subscript\nto denote its coordinate, e.g., x \u2208 Rd and xk being the k-th coordinate of x. Given a vector\ni=1 ni! and |n| =(cid:80)d\nn = [n1, . . . , nd](cid:62)\n.\ni=1 xni\nGiven a function f : Rd (cid:55)\u2192 R, we denote its derivative as Dnf =\n, and its (cid:96)\u221e norm as\n(cid:107)f(cid:107)\u221e = maxx |f (x)|. We use \u25e6 to denote the composition operator.\n2 Preliminaries\n\ni=1 ni. We de\ufb01ne xn =(cid:81)d\n\n\u2208 Nd, we de\ufb01ne n! =(cid:81)d\n\ni\n\n\u2202|n|f\n1 ...\u2202x\n\n\u2202xn1\n\nnd\nd\n\nWe brie\ufb02y review manifolds, partition of unity, and function spaces de\ufb01ned on smooth manifolds.\nDetails can be found in Tu (2010) and Lee (2006).\nLet M be a d-dimensional Riemannian manifold isometrically embedded in RD.\nDe\ufb01nition 1 (Chart). A chart for M is a pair (U, \u03c6) such that U \u2282 M is open and \u03c6 : U (cid:55)\u2192 Rd,\nwhere \u03c6 is a homeomorphism (i.e., bijective, \u03c6 and \u03c6\u22121 are both continuous).\nThe open set U is called a coordinate neighborhood, and \u03c6 is called a coordinate system on U. A\nchart essentially de\ufb01nes a local coordinate system on M. We say two charts (U, \u03c6) and (V, \u03c8) on\nM are C k compatible if and only if the transition functions, \u03c6 \u25e6 \u03c8\u22121 : \u03c8(U \u2229 V ) (cid:55)\u2192 \u03c6(U \u2229 V ) and\n\u03c8 \u25e6 \u03c6\u22121 : \u03c6(U \u2229 V ) (cid:55)\u2192 \u03c8(U \u2229 V ) are both C k. Then we give the de\ufb01nition of an atlas.\nDe\ufb01nition 2 (C k Atlas). An atlas for M is a collection {(U\u03b1, \u03c6\u03b1)}\u03b1\u2208A of pairwise C k compatible\ncharts such that(cid:83)\u03b1\u2208A U\u03b1 = M.\nDe\ufb01nition 3 (Smooth Manifold). A smooth manifold is a manifold M together with a C\u221e atlas.\nClassical examples of smooth manifolds are the Euclidean space RD, the torus, and the unit sphere.\nThe existence of an atlas on M allows us to de\ufb01ne differentiable functions.\nDe\ufb01nition 4 (C n Functions on M). Let M be a smooth manifold in RD. A function f : M (cid:55)\u2192 R is\nC n if for any chart (U, \u03c6), the composition f \u25e6 \u03c6\u22121 : \u03c6(U ) (cid:55)\u2192 R is continuously differentiable up to\norder n.\nRemark 1. The de\ufb01nition of C n functions is independent of the choice of the chart (U, \u03c6). Suppose\n(V, \u03c8) is another chart and V (cid:84) U (cid:54)= \u2205. Then we have f \u25e6 \u03c8\u22121 = (f \u25e6 \u03c6\u22121) \u25e6 (\u03c6 \u25e6 \u03c8\u22121). Since M\nis a smooth manifold, (U, \u03c6) and (V, \u03c8) are C\u221e compatible. Thus, f \u25e6 \u03c6\u22121 is C n and \u03c6 \u25e6 \u03c8\u22121 is\nC\u221e, and their composition is C n.\nWe next introduce partition of unity, which plays a crucial role in our construction of neural networks.\nDe\ufb01nition 5 (Partition of Unity). A C\u221e partition of unity on a manifold M is a collection of\nnonnegative C\u221e functions \u03c1\u03b1 : M (cid:55)\u2192 R+ for \u03b1 \u2208 A such that 1).\nthe collection of supports,\n{supp(\u03c1\u03b1)}\u03b1\u2208A is locally \ufb01nite2; 2).(cid:80) \u03c1\u03b1 = 1.\nFor a smooth manifold, a C\u221e partition of unity always exisits.\n\n2A collection {A\u03b1} is locally \ufb01nite if every point has a neighborhood that meets only \ufb01nitely many of A\u03b1\u2019s.\n\n3\n\n\fProposition 1 (Existence of a C\u221e partition of unity). Let {U\u03b1}\u03b1\u2208A be an open cover of a smooth\nmanifold M. Then there is a C\u221e partition of unity {\u03c1i}\n\u221e\ni=1 with every \u03c1i having a compact support\nsuch that supp(\u03c1i) \u2282 U\u03b1 for some \u03b1 \u2208 A.\nProposition 1 gives rise to the decomposition f =(cid:80)\u221e\ni=1 fi with fi = f \u03c1i. Note that the fi\u2019s have the\nsame regularity as f, since fi \u25e6 \u03c6\u22121 = (f \u25e6 \u03c6\u22121)\u00d7 (\u03c1i \u25e6 \u03c6\u22121) for a chart (U, \u03c6). This decomposition\nhas the advantage that every fi is only supported in a single chart. Then the approximation of f boils\ndown to the approximations of the fi\u2019s, which are localized and have the same regularity as f.\nTo characterize the curvature of a manifold, we adopt the following geometric concept.\nDe\ufb01nition 6 (Reach, De\ufb01nition 2.1 in Aamari et al. (2019)). Denote C(M) = {x \u2208 RD : \u2203p (cid:54)= q \u2208\nM,(cid:107)p \u2212 x(cid:107)2 = (cid:107)q \u2212 x(cid:107)2 = inf y\u2208M (cid:107)y \u2212 x(cid:107)2} as the set of points that have at least two nearest\nneighbors on M. Then the reach \u03c4 > 0 is de\ufb01ned as \u03c4 = inf x\u2208M,y\u2208C(M) (cid:107)x \u2212 y(cid:107)2 .\n\nFigure 1: Manifolds with large and small reach.\n\nReach has a straightforward geometrical interpretation: At each point x \u2208 M, the radius of the\nosculating circle is greater or equal to \u03c4. A large reach for M essentially requires the manifold M\nnot to change \u201crapidly\u201d as shown in Figure 1.\nReach determines a proper choice of an atlas for M. In Section 4, we choose each chart U\u03b1 contained\nin a ball of radius less than \u03c4 /2. For smooth manifolds with a small \u03c4, we need a large number of\ncharts. Therefore, the reach of a smooth manifold re\ufb02ects the dif\ufb01culty of function approximations\non M.\n3 Main Result\nWe next present how to construct a ReLU network to approximate f : M (cid:55)\u2192 R with error \u0001, under\ncertain assumptions on M and f.\nAssumption 1. M is a d-dimensional compact Riemannian manifold isometrically embedded in RD.\nThere exists a constant B such that for any point x \u2208 M, we have |xi| \u2264 B for all i = 1, . . . , D.\nAssumption 2. The reach of M is \u03c4 > 0.\nAssumption 3. f : M (cid:55)\u2192 R belongs to the H\u00f6lder space H n,\u03b1 with a positive integer n and\n\u03b1 \u2208 (0, 1], in the sense that f \u2208 C n\u22121 and for any chart (U, \u03c6) and |n| = n, we have\n\n\u22121)(cid:12)(cid:12)\u03c6(x1) \u2212 Dn(f \u25e6 \u03c6\n\n\u22121)(cid:12)(cid:12)\u03c6(x2)(cid:12)(cid:12)(cid:12) \u2264 (cid:107)\u03c6(x1) \u2212 \u03c6(x2)(cid:107)\u03b1\n\n(1)\nAssumption 3 says that all n-th order derivatives of f \u25e6 \u03c6\u22121 are H\u00f6lder continuous. Here H\u00f6lder\nfunctions are de\ufb01ned on manifolds. We recover the standard H\u00f6lder class on Euclidean spaces by\ntaking \u03c6 as the identity map. We also note that Assumption 3 does not depend on the choice of charts.\nWe now formally state our main result. Extensions to functions in Sobolev spaces are straightforward.\nTheorem 1. Suppose Assumptions 1 and 2 hold. Given any \u0001 \u2208 (0, 1), there exists a ReLU network\nstructure such that, for any f : M \u2192 R satisfying Assumption 3, if the weight parameters are\nproperly chosen, the network yields a function (cid:98)f satisfying (cid:107)(cid:98)f \u2212 f(cid:107)\u221e \u2264 \u0001. Such a network has no\nmore than c1(log 1\n\u0001 + D log D) neurons and\nweight parameters, where c1, c2 depend on d, n, f, \u03c4, and the surface area of M.\nThe network structure identi\ufb01ed by Theorem 1 consists of three sub-networks as shown in Figure 2:\n\u2022 Chart determination sub-network, which assigns the input to its corresponding neighborhoods;\n\u2022 Taylor approximation sub-network, which approximates f by polynomials in each neighborhood;\n\n\u0001 + log D) layers, and at most c2(\u0001\n\nn+\u03b1 log 1\n\n\u0001 + D log 1\n\n(cid:12)(cid:12)(cid:12)Dn(f \u25e6 \u03c6\n\n2 ,\n\n\u2200x1, x2 \u2208 U.\n\n\u2212 d\n\n4\n\nLarge\u2327AAACB3icbVDLSsNAFL2pr1pfVZduBlvBVUnqwi4Lbly4qGAf0IQymUzaoZNJmJkIJfQD/AG3+gfuxK2f4Q/4HU7aLLT1wIXDOffF8RPOlLbtL6u0sbm1vVPereztHxweVY9PeipOJaFdEvNYDnysKGeCdjXTnA4SSXHkc9r3pze533+kUrFYPOhZQr0IjwULGcHaSO4dlmOK6q7GaX1UrdkNewG0TpyC1KBAZ1T9doOYpBEVmnCs1NCxE+1lWGpGOJ1X3FTRBJMpHtOhoQJHVHnZ4uc5ujBKgMJYmhIaLdTfExmOlJpFvumMsJ6oVS8X//UClS9cua7DlpcxkaSaCrI8HqYc6RjloaCASUo0nxmCiWTmf0QmWGKiTXQVE4yzGsM66TUbzlWjed+stVtFRGU4g3O4BAeuoQ230IEuEEjgGV7g1Xqy3qx362PZWrKKmVP4A+vzB2KXmWA=Small\u2327AAACB3icbVDLSsNAFL2pr1pfVZduBlvBVUnqwi4LblxWtA9oQplMJu3QySTMTIQS+gH+gFv9A3fi1s/wB/wOJ20W2nrgwuGc++L4CWdK2/aXVdrY3NreKe9W9vYPDo+qxyc9FaeS0C6JeSwHPlaUM0G7mmlOB4mkOPI57fvTm9zvP1KpWCwe9CyhXoTHgoWMYG0k9z7CnKO6q3FaH1VrdsNeAK0TpyA1KNAZVb/dICZpRIUmHCs1dOxEexmWmhFO5xU3VTTBZIrHdGiowBFVXrb4eY4ujBKgMJamhEYL9fdEhiOlZpFvOiOsJ2rVy8V/vUDlC1eu67DlZUwkqaaCLI+HKUc6RnkoKGCSEs1nhmAimfkfkQmWmGgTXcUE46zGsE56zYZz1WjeNWvtVhFRGc7gHC7BgWtowy10oAsEEniGF3i1nqw36936WLaWrGLmFP7A+vwBeWyZbg==RapidChangeAAACB3icbVDLSsNAFJ34rPVVdelmsAiuSlIXdlnoxmUV+4A2lMnkph06mQwzE6GEfoA/4Fb/wJ249TP8Ab/DSZuFth64cDjnvjiB5Ewb1/1yNja3tnd2S3vl/YPDo+PKyWlXJ6mi0KEJT1Q/IBo4E9AxzHDoSwUkDjj0gmkr93uPoDRLxIOZSfBjMhYsYpQYKw3viWQhbk2IGMOoUnVr7gJ4nXgFqaIC7VHlexgmNI1BGMqJ1gPPlcbPiDKMcpiXh6kGSeiUjGFgqSAxaD9b/DzHl1YJcZQoW8Lghfp7IiOx1rM4sJ0xMRO96uXiv16o84Ur103U8DMmZGpA0OXxKOXYJDgPBYdMATV8Zgmhitn/MZ0QRaix0ZVtMN5qDOukW69517X6Xb3abBQRldA5ukBXyEM3qIluURt1EEUSPaMX9Oo8OW/Ou/OxbN1wipkz9AfO5w/3ppm9SlowChangeAAACBnicbVC7TgJBFJ3FF+ILtbSZSEysyC4WUpLQWGKURwIbMjt7gQmzM5uZWQ3Z0PsDtvoHdsbW3/AH/A5nYQsFT3KTk3PuKyeIOdPGdb+cwsbm1vZOcbe0t39weFQ+PulomSgKbSq5VL2AaOBMQNsww6EXKyBRwKEbTJuZ330ApZkU92YWgx+RsWAjRomxUv+Oy0fcnBAxhmG54lbdBfA68XJSQTlaw/L3IJQ0iUAYyonWfc+NjZ8SZRjlMC8NEg0xoVMyhr6lgkSg/XTx8hxfWCXEI6lsCYMX6u+JlERaz6LAdkbETPSql4n/eqHOFq5cN6O6nzIRJwYEXR4fJRwbibNMcMgUUMNnlhCqmP0f0wlRhBqbXMkG463GsE46tap3Va3d1iqNeh5REZ2hc3SJPHSNGugGtVAbUSTRM3pBr86T8+a8Ox/L1oKTz5yiP3A+fwBXkZlo\fFigure 2: The ReLU network identi\ufb01ed by Theorem 1.\n\n\u2022 Pairing sub-network, which yields multiplications of the proper pairs of outputs from the chart\ndetermination and the Taylor approximation sub-networks.\nSpeci\ufb01cally, we partition the manifold as M =(cid:83)CM\ni=1 Ui, where the Ui\u2019s are open sets contained in a\nEuclidean ball of radius less than \u03c4 /2. CM depends on the reach \u03c4, the surface area of M, and the\ndimension d (see Section 4 for an explicit characterization). For each chart, the chart determination\nsub-network computes an approximation of the indicator function on Ui. The Taylor approximation\nsub-network provides a local polynomial approximation of f on Ui. Then the pairing sub-network\n\napproximates the product for the proper pairs of outputs in the previous two sub-networks. Finally, (cid:98)f\n\nis obtained by taking a sum over CM outputs from the pairing sub-network.\nThe size of our ReLU network matches its lower bound up to a logarithmic factor for the approxima-\ntion of functions in H\u00f6lder spaces. Denote F n,d as functions de\ufb01ned on [0, 1]d in the H\u00f6lder space\nH n\u22121,1. We state a lower bound due to DeVore et al. (1989).\nTheorem 2. Fix d and n. Let W be a positive integer and \u03ba : RW (cid:55)\u2192 C([0, 1]d) be any mapping.\nSuppose there is a continuous map \u0398 : F n,d (cid:55)\u2192 RW such that (cid:107)f \u2212 \u03ba(\u0398(f ))(cid:107)\u221e \u2264 \u0001 for any\nf \u2208 F n,d. Then W \u2265 c\u0001\u2212 d\nWe take RW as the parameter space of a ReLU network, and \u03ba as the network structure. Then\nto approximate any f \u2208 F n,d, the ReLU network has at least c\u0001\u2212 d\nn weight parameters. Although\nTheorem 2 holds for functions on [0, 1]d, our network size remains in the same order up to a\nlogarithmic factor even when the function is supported on a manifold of dimension d.\n\nn with c depending on n only.\n\n4 Proof of the Main Result\n\nDue to limited space, we present a sketch of the proof for Theorem 1. Before we proceed, we show\nhow to approximate the multiplication operation using ReLU networks. This operation is heavily\nused in the Taylor approximation sub-network, since Taylor polynomials involve sum of products.\nWe \ufb01rst show ReLU networks can approximate quadratic functions.\nLemma 1 (Proposition 2 in Yarotsky (2017)). The function f (x) = x2 with x \u2208 [0, 1] can be\napproximated by a ReLU network with any error \u0001 > 0. The network has depth and the number of\nneurons and weight parameters no more than c log(1/\u0001) with an absolute constant c.\n\nThis lemma is proved in Appendix A.1. The idea is to approximate quadratic functions using a\nweighted sum of a series of sawtooth functions. Those sawtooth functions are obtained by compositing\nthe triangular function\n\ng(x) = 2ReLU(x) \u2212 4ReLU(x \u2212 1/2) + 2ReLU(x \u2212 1),\n\nwhich can be implemented by a single layer ReLU network.\nWe then approximate the multiplication operation by invoking the identity ab = 1\n4 ((a+b)2\u2212(a\u2212b)2)\nwhere the two squares can be approximated by ReLU networks in Lemma 1.\nCorollary 1 (Proposition 3 in Yarotsky (2017)). Given a constant C > 0 and \u0001 \u2208 (0, C 2), there\n\nis a ReLU network which implements a function (cid:98)\u00d7 : R2 (cid:55)\u2192 R such that: 1). For all inputs x and\n\n5\n\nPAAACAXicbVC7TgJBFL2LL8QXamkzEUysyC4WUpLYWGIijwQ2ZHZ2FkZmZjczsyaEUPkDtvoHdsbWL/EH/A5nYQsFT3KTk3PuKydIONPGdb+cwsbm1vZOcbe0t39weFQ+PunoOFWEtknMY9ULsKacSdo2zHDaSxTFIuC0G0xuMr/7SJVmsbw304T6Ao8kixjBxkqd6kCnojosV9yauwBaJ15OKpCjNSx/D8KYpIJKQzjWuu+5ifFnWBlGOJ2XBqmmCSYTPKJ9SyUWVPuzxbdzdGGVEEWxsiUNWqi/J2ZYaD0Vge0U2Iz1qpeJ/3qhzhauXDdRw58xmaSGSrI8HqUcmRhlcaCQKUoMn1qCiWL2f0TGWGFibGglG4y3GsM66dRr3lXNvatXmo08oiKcwTlcggfX0IRbaEEbCDzAM7zAq/PkvDnvzseyteDkM6fwB87nD0HflyI=InputxAAACD3icbVDLTsJAFL3FF+Kr4tLNRDBxRVpc6JLEje4wkUcCDZkOU5gwnTYzUwNp+Ah/wK3+gTvj1k/wB/wOp9CFgieZ5OTc15njx5wp7ThfVmFjc2t7p7hb2ts/ODyyj8ttFSWS0BaJeCS7PlaUM0FbmmlOu7GkOPQ57fiTm6zeeaRSsUg86FlMvRCPBAsYwdpIA7t8J+JEo2o/xHrsB+l0Xh3YFafmLIDWiZuTCuRoDuzv/jAiSUiFJhwr1XOdWHsplpoRTuelfqJojMkEj2jPUIFDqrx04X2Ozo0yREEkzRMaLdTfEykOlZqFvunMLKrVWib+WxuqbOHKdR1ceynLvkwFWR4PEo50hLJw0JBJSjSfGYKJZMY/ImMsMdEmwpIJxl2NYZ206zX3subc1yuNeh5REU7hDC7AhStowC00oQUEpvAML/BqPVlv1rv1sWwtWPnMCfyB9fkDODWcgg==ChartDeterminationAAACEHicbVDLSsNAFJ3UV62vapduBovgqiR1YZeFunBZwT6gDWUyuWmHTiZhZiKE0J/wB9zqH7gTt/6BP+B3OGmz0NYDFw7n3BfHizlT2ra/rNLW9s7uXnm/cnB4dHxSPT3rqyiRFHo04pEcekQBZwJ6mmkOw1gCCT0OA2/eyf3BI0jFIvGg0xjckEwFCxgl2kiTaq0zI1LjW9AgQyYKtW437CXwJnEKUkcFupPq99iPaBKC0JQTpUaOHWs3M4sZ5bCojBMFMaFzMoWRoYKEoNxs+fwCXxrFx0EkTQmNl+rviYyESqWhZzpDomdq3cvFfz1f5QvXruug5WZMxIkGQVfHg4RjHeE8HewzCVTz1BBCJTP/Y2oCItTEoyomGGc9hk3Sbzac60bzvllvt4qIyugcXaAr5KAb1EZ3qIt6iKIUPaMX9Go9WW/Wu/Wxai1ZxUwN/YH1+QN9/51DTaylorApproximationAAACEXicbVA7TgMxFPSGXwi/8OloLCIkqmg3FKQMoqEMUn5Ssoq8Xm9ixWtbthexrHIKLkALN6BDtJyAC3AOnGQLSHjVaGbee6MJJKPauO6XU1hb39jcKm6Xdnb39g/Kh0cdLRKFSRsLJlQvQJowyknbUMNITyqC4oCRbjC5mende6I0FbxlUkn8GI04jShGxlLD8kkLpdYDr6VU4oHGOV1xq+584CrwclAB+TSH5e9BKHASE24wQ1r3PVcaP0PKUMzItDRINJEIT9CI9C3kKCbaz+bpp/DcMiGMbIhIcAPn7O+NDMVap3FgnTbeWC9rM/JfLdSzg0vfTVT3M8plYgjHi+dRwqARcFYPDKki2LDUAoQVtfkhHiOFsLEllmwx3nINq6BTq3qX1dpdrdKo5xUVwSk4AxfAA1egAW5BE7QBBo/gGbyAV+fJeXPenY+FteDkO8fgzzifP66Tne4=PairingAAACAnicbVDLSsNAFL3xWeur6tLNYBFclaQu7LLgxmUF+4A2lMlk2g6dTMLMjVBCd/6AW/0Dd+LWH/EH/A4nbRbaeuDC4Zz74gSJFAZd98vZ2Nza3tkt7ZX3Dw6Pjisnpx0Tp5rxNotlrHsBNVwKxdsoUPJeojmNAsm7wfQ297uPXBsRqwecJdyP6FiJkWAUrdRtUaGFGg8rVbfmLkDWiVeQKhRoDSvfgzBmacQVMkmN6Xtugn5GNQom+bw8SA1PKJvSMe9bqmjEjZ8t3p2TS6uEZBRrWwrJQv09kdHImFkU2M6I4sSsern4rxeafOHKdRw1/EyoJEWu2PL4KJUEY5LnQUKhOUM5s4QyLez/hE2opgxtamUbjLcawzrp1Gveda1+X682G0VEJTiHC7gCD26gCXfQgjYwmMIzvMCr8+S8Oe/Ox7J1wylmzuAPnM8ft9eX/w==1AAACAXicbVDLSsNAFL1TX7W+qi7dDBbBVUmqYJcFNy4r2FZoQ5lMJu3YySTMTIQSuvIH3OofuBO3fok/4Hc4abPQ1gMXDufcF8dPBNfGcb5QaW19Y3OrvF3Z2d3bP6geHnV1nCrKOjQWsbr3iWaCS9Yx3Ah2nyhGIl+wnj+5zv3eI1Oax/LOTBPmRWQkecgpMVbqDpIxH7rDas2pO3PgVeIWpAYF2sPq9yCIaRoxaaggWvddJzFeRpThVLBZZZBqlhA6ISPWt1SSiGkvm387w2dWCXAYK1vS4Ln6eyIjkdbTyLedETFjvezl4r9eoPOFS9dN2PQyLpPUMEkXx8NUYBPjPA4ccMWoEVNLCFXc/o/pmChCjQ2tYoNxl2NYJd1G3b2oN24va61mEVEZTuAUzsGFK2jBDbShAxQe4Ble4BU9oTf0jj4WrSVUzBzDH6DPH5ckl1o=CMAAACEXicbVDLSsNAFJ3UV62v+Ni5GSyCq5JUwS4L3bgRKtgHNCFMJpN26GQSZiZCDfkKf8Ct/oE7cesX+AN+h5M2C209MHA45965h+MnjEplWV9GZW19Y3Orul3b2d3bPzAPj/oyTgUmPRyzWAx9JAmjnPQUVYwME0FQ5DMy8Kedwh88ECFpzO/VLCFuhMachhQjpSXPPHGSCfWyjudESE0wYtltnntm3WpYc8BVYpekDkp0PfPbCWKcRoQrzJCUI9tKlJshoShmJK85qSQJwlM0JiNNOYqIdLN5+hyeayWAYSz04wrO1d8bGYqknEW+niwyymWvEP/1All8uHRdhS03ozxJFeF4cTxMGVQxLOqBARUEKzbTBGFBdX6IJ0ggrHSJNV2MvVzDKuk3G/Zlo3l3VW+3yoqq4BScgQtgg2vQBjegC3oAg0fwDF7Aq/FkvBnvxsditGKUO8fgD4zPH8Gpnfs=\u02c6d21AAACBnicbVDLSsNAFJ3UV62vqks3g0VwVZIq2GXBjcsKthXaWCaTSTt0MhNmboQSuvcH3OofuBO3/oY/4Hc4abPQ1gMXDufcFydIBDfgul9OaW19Y3OrvF3Z2d3bP6geHnWNSjVlHaqE0vcBMUxwyTrAQbD7RDMSB4L1gsl17vcemTZcyTuYJsyPyUjyiFMCVuoPxgSycDb0HhrDas2tu3PgVeIVpIYKtIfV70GoaBozCVQQY/qem4CfEQ2cCjarDFLDEkInZMT6lkoSM+Nn85dn+MwqIY6UtiUBz9XfExmJjZnGge2MCYzNspeL/3qhyRcuXYeo6WdcJikwSRfHo1RgUDjPBIdcMwpiagmhmtv/MR0TTSjY5Co2GG85hlXSbdS9i3rj9rLWahYRldEJOkXnyENXqIVuUBt1EEUKPaMX9Oo8OW/Ou/OxaC05xcwx+gPn8wdqZJl0\u02c6d2CMAAACFnicbVDLSsNAFJ34rPUVdSVuBovgqiRVsMtCN26ECvYBTQyTyaQdOpmEmYlQQvA3/AG3+gfuxK1bf8DvcNJmoa0HBg7n3NccP2FUKsv6MlZW19Y3Nitb1e2d3b198+CwJ+NUYNLFMYvFwEeSMMpJV1HFyCARBEU+I31/0i78/gMRksb8Tk0T4kZoxGlIMVJa8sxjZ4xUFuRe1vacCKkxRiy7yfP7hmfWrLo1A1wmdklqoETHM7+dIMZpRLjCDEk5tK1EuRkSimJG8qqTSpIgPEEjMtSUo4hIN5t9IYdnWglgGAv9uIIz9XdHhiIpp5GvK4sr5aJXiP96gSwGLmxXYdPNKE9SRTieLw9TBlUMi4xgQAXBik01QVhQfT/EYyQQVjrJqg7GXoxhmfQadfui3ri9rLWaZUQVcAJOwTmwwRVogWvQAV2AwSN4Bi/g1Xgy3ox342NeumKUPUfgD4zPH7SooBU=b1AAACGnicbVDLSsNAFJ34rPVVdSlCsBVclaQu7LKgC5cV7AOaECaT23bo5MHMjVJCV/6GP+BW/8CduHXjD/gdTtoutPXAwOGc+5rjJ4IrtKwvY2V1bX1js7BV3N7Z3dsvHRy2VZxKBi0Wi1h2fapA8AhayFFAN5FAQ19Axx9d5X7nHqTicXSH4wTckA4i3ueMopa80knFeeABDClmTkhxGKjMnkw85xoE0opXKltVawpzmdhzUiZzNL3StxPELA0hQiaoUj3bStDNqETOBEyKTqogoWxEB9DTNKIhKDebfmNinmklMPux1C9Cc6r+7shoqNQ49HVlfqpa9HLxXy9Q+cCF7divuxmPkhQhYrPl/VSYGJt5TmbAJTAUY00ok1zfb7IhlZShTrOog7EXY1gm7VrVvqjWbmvlRn0eUYEck1NyTmxySRrkhjRJizDySJ7JC3k1now34934mJWuGPOeI/IHxucPWJqhbw==b1AAACGnicbVDLSsNAFJ34rPVVdSlCsBVclaQu7LKgC5cV7AOaECaT23bo5MHMjVJCV/6GP+BW/8CduHXjD/gdTtoutPXAwOGc+5rjJ4IrtKwvY2V1bX1js7BV3N7Z3dsvHRy2VZxKBi0Wi1h2fapA8AhayFFAN5FAQ19Axx9d5X7nHqTicXSH4wTckA4i3ueMopa80knFeeABDClmTkhxGKjMnkw85xoE0opXKltVawpzmdhzUiZzNL3StxPELA0hQiaoUj3bStDNqETOBEyKTqogoWxEB9DTNKIhKDebfmNinmklMPux1C9Cc6r+7shoqNQ49HVlfqpa9HLxXy9Q+cCF7divuxmPkhQhYrPl/VSYGJt5TmbAJTAUY00ok1zfb7IhlZShTrOog7EXY1gm7VrVvqjWbmvlRn0eUYEck1NyTmxySRrkhjRJizDySJ7JC3k1now34934mJWuGPOeI/IHxucPWJqhbw==b\u21e5AAACD3icbVBLTsMwFHTKr5RfKEs2ES0SqyopC1hWsGFZJPqRmqhyHKe16jiR/QJUUQ/BBdjCDdghthyBC3AOnDYLaBnJ0mjmPc/T+AlnCmz7yyitrW9sbpW3Kzu7e/sH5mG1q+JUEtohMY9l38eKciZoBxhw2k8kxZHPac+fXOd+755KxWJxB9OEehEeCRYygkFLQ7Nadx9YQMcYMhdYRNWsPjRrdsOew1olTkFqqEB7aH67QUzSiAogHCs1cOwEvAxLYITTWcVNFU0wmeARHWgqsI7xsvntM+tUK4EVxlI/AdZc/b2R4UipaeTryQjDWC17ufivF6j8w6V0CC+9jIkkBSrIIjxMuQWxlZdjBUxSAnyqCSaS6fstMsYSE9AVVnQxznINq6TbbDjnjeZts9a6Kioqo2N0gs6Qgy5QC92gNuoggh7RM3pBr8aT8Wa8Gx+L0ZJR7ByhPzA+fwC18Zzebf(x)AAACFnicbVC7TsMwFHV4lvIKMCEWixapLFVSBhgrWBiLRB9SE1WO47RWHSeyHaCKIn6DH2CFP2BDrKz8AN+B02aAliNZOjrnvny8mFGpLOvLWFpeWV1bL22UN7e2d3bNvf2OjBKBSRtHLBI9D0nCKCdtRRUjvVgQFHqMdL3xVe5374iQNOK3ahITN0RDTgOKkdLSwDysOvfUJyOk0iCrOSFSIy9IH7LT6sCsWHVrCrhI7IJUQIHWwPx2/AgnIeEKMyRl37Zi5aZIKIoZycpOIkmM8BgNSV9TjkIi3XT6hQyeaMWHQST04wpO1d8dKQqlnISersxvlPNeLv7r+TIfOLddBRduSnmcKMLxbHmQMKgimGcEfSoIVmyiCcKC6vshHiGBsNJJlnUw9nwMi6TTqNtn9cZNo9K8LCIqgSNwDGrABuegCa5BC7QBBo/gGbyAV+PJeDPejY9Z6ZJR9ByAPzA+fwDg2p+bef1AAACDHicbVDLSsNAFJ3UV62PRl26GSyCq5JUwS4LblxWsA9oQ5lMbtqhk0mYmSgl9Bf8Abf6B+7Erf/gD/gdTtostPXAhcM598XxE86Udpwvq7SxubW9U96t7O0fHFbto+OuilNJoUNjHsu+TxRwJqCjmebQTySQyOfQ86c3ud97AKlYLO71LAEvImPBQkaJNtLIrg4fWQCa8QCycD5yR3bNqTsL4HXiFqSGCrRH9vcwiGkagdCUE6UGrpNoLyNSM8phXhmmChJCp2QMA0MFiUB52eLxOT43SoDDWJoSGi/U3xMZiZSaRb7pjIieqFUvF//1ApUvXLmuw6aXMZGkGgRdHg9TjnWM82RwwCRQzWeGECqZ+R/TCZGEapNfxQTjrsawTrqNuntZb9xd1VrNIqIyOkVn6AK56Bq10C1qow6iKEXP6AW9Wk/Wm/VufSxbS1Yxc4L+wPr8AaS3m70=efCMAAACGnicbVDLSsNAFJ34rPUVdSlCsAiuSlIFuyx040aoYB/QhDCZ3LRDJw9mJkoJWfkb/oBb/QN34taNP+B3OGmz0NYDA4dz7muOlzAqpGl+aSura+sbm5Wt6vbO7t6+fnDYE3HKCXRJzGI+8LAARiPoSioZDBIOOPQY9L1Ju/D798AFjaM7OU3ACfEoogElWCrJ1U/sB+qDpMyHLMjdrO3aIZZjgll2k+euXjPr5gzGMrFKUkMlOq7+bfsxSUOIJGFYiKFlJtLJMJeUMMirdiogwWSCRzBUNMIhCCebfSM3zpTiG0HM1YukMVN/d2Q4FGIaeqqyuFEseoX4r+eLYuDCdhk0nYxGSSohIvPlQcoMGRtFToZPORDJpopgwqm63yBjzDGRKs2qCsZajGGZ9Bp166LeuL2stZplRBV0jE7RObLQFWqha9RBXUTQI3pGL+hVe9LetHftY166opU9R+gPtM8fjP6iLQ==b\u21e5\u21e3(efii),(b1bd2i)\u2318AAACcHicbVHbattAEF2pN8e9xG1fAn3Itk7BDsFI7kPzGGge+phCnASyrlitRtaQ1YXdUYsR+tD8QD4gP9CuHIe2TgcWDmfO2dk5G1caLQXBtec/evzk6bPeVv/5i5evtgev35zZsjYKZqrUpbmIpQWNBcwIScNFZUDmsYbz+OpL1z//AcZiWZzSsoJ5LhcFpqgkOSoa2D3xExPIJDWCMAfbcqEhpdFoxRPqBJq0jZALhUZxUWUY4fiAj/74cklZYpuwbSNxDJrkvfZekTj/9+lYGFxkNN6LBsNgEqyKPwThGgzZuk6iwY1ISlXnUJDS0trLMKho3khDqDS0fVFbqKS6kgu4dLCQbo95swqn5R8dk/C0NO4UxFfs345G5tYu89gpu0XsZq8j/9tLbHfhxnRKD+cNFlVNUKi74WmtOZW8S58naECRXjoglUH3fq4yaaQi90d9F0y4GcNDcDadhJ8m02/T4dHhOqIee8c+sBEL2Wd2xL6yEzZjil2zX17P2/Ju/R1/139/J/W9tect+6f8/d+/jb57...AAACDHicbVDLSsNAFJ3UV62PRl26GSyCq5JUQZdFNy4r2Ac0pUwmk3boZBJmbgol9Bf8Abf6B+7Erf/gD/gdTtostPXAwOGce+cejp8IrsFxvqzSxubW9k55t7K3f3BYtY+OOzpOFWVtGotY9XyimeCStYGDYL1EMRL5gnX9yV3ud6dMaR7LR5glbBCRkeQhpwSMNLSrXkRg7IeZNw1i0POhXXPqzgJ4nbgFqaECraH97QUxTSMmgQqidd91EhhkRAGngs0rXqpZQuiEjFjfUEkipgfZIvgcnxslwGGszJOAF+rvjYxEWs8i30zmMfWql4v/eoHOP1y5DuHNIOMySYFJujwepgJDjPNmcMAVoyBmhhCquMmP6ZgoQsH0VzHFuKs1rJNOo+5e1hsPV7XmbVFRGZ2iM3SBXHSNmugetVAbUZSiZ/SCXq0n6816tz6WoyWr2DlBf2B9/gAiAZwU...AAACDHicbVDLSsNAFJ3UV62PRl26GSyCq5JUQZdFNy4r2Ac0pUwmk3boZBJmbgol9Bf8Abf6B+7Erf/gD/gdTtostPXAwOGce+cejp8IrsFxvqzSxubW9k55t7K3f3BYtY+OOzpOFWVtGotY9XyimeCStYGDYL1EMRL5gnX9yV3ud6dMaR7LR5glbBCRkeQhpwSMNLSrXkRg7IeZNw1i0POhXXPqzgJ4nbgFqaECraH97QUxTSMmgQqidd91EhhkRAGngs0rXqpZQuiEjFjfUEkipgfZIvgcnxslwGGszJOAF+rvjYxEWs8i30zmMfWql4v/eoHOP1y5DuHNIOMySYFJujwepgJDjPNmcMAVoyBmhhCquMmP6ZgoQsH0VzHFuKs1rJNOo+5e1hsPV7XmbVFRGZ2iM3SBXHSNmugetVAbUZSiZ/SCXq0n6816tz6WoyWr2DlBf2B9/gAiAZwU...AAACDHicbVDLSsNAFJ3UV62PRl26GSyCq5JUQZdFNy4r2Ac0pUwmk3boZBJmbgol9Bf8Abf6B+7Erf/gD/gdTtostPXAwOGce+cejp8IrsFxvqzSxubW9k55t7K3f3BYtY+OOzpOFWVtGotY9XyimeCStYGDYL1EMRL5gnX9yV3ud6dMaR7LR5glbBCRkeQhpwSMNLSrXkRg7IeZNw1i0POhXXPqzgJ4nbgFqaECraH97QUxTSMmgQqidd91EhhkRAGngs0rXqpZQuiEjFjfUEkipgfZIvgcnxslwGGszJOAF+rvjYxEWs8i30zmMfWql4v/eoHOP1y5DuHNIOMySYFJujwepgJDjPNmcMAVoyBmhhCquMmP6ZgoQsH0VzHFuKs1rJNOo+5e1hsPV7XmbVFRGZ2iM3SBXHSNmugetVAbUZSiZ/SCXq0n6816tz6WoyWr2DlBf2B9/gAiAZwU...AAACDHicbVDLSsNAFJ3UV62PRl26GSyCq5JUQZdFNy4r2Ac0pUwmk3boZBJmbgol9Bf8Abf6B+7Erf/gD/gdTtostPXAwOGce+cejp8IrsFxvqzSxubW9k55t7K3f3BYtY+OOzpOFWVtGotY9XyimeCStYGDYL1EMRL5gnX9yV3ud6dMaR7LR5glbBCRkeQhpwSMNLSrXkRg7IeZNw1i0POhXXPqzgJ4nbgFqaECraH97QUxTSMmgQqidd91EhhkRAGngs0rXqpZQuiEjFjfUEkipgfZIvgcnxslwGGszJOAF+rvjYxEWs8i30zmMfWql4v/eoHOP1y5DuHNIOMySYFJujwepgJDjPNmcMAVoyBmhhCquMmP6ZgoQsH0VzHFuKs1rJNOo+5e1hsPV7XmbVFRGZ2iM3SBXHSNmugetVAbUZSiZ/SCXq0n6816tz6WoyWr2DlBf2B9/gAiAZwU\fy satisfying |x| \u2264 C and |y| \u2264 C, we have |(cid:98)\u00d7(x, y) \u2212 xy| \u2264 \u0001; 2). The depth and the weight\n\nparameters of the network is no more than c log C2\n\n\u0001 with an absolute constant c.\n\nThe ReLU network in Theorem 1 is constructed in the following 5 steps.\nStep 1. Construction of an atlas. Denote the open Euclidean ball with center c and radius r in RD\nby B(c, r). For any r, the collection {B(x, r)}x\u2208M is an open cover of M. Since M is compact,\nthere exists a \ufb01nite collection of points ci for i = 1, . . . , CM such that M \u2282(cid:83)i B(ci, r).\nNow we pick the radius r < \u03c4 /2 so that Ui = M\u2229B(ci, r) is diffeomorphic3 to a ball in Rd (Niyogi\net al., 2008). Let {(Ui, \u03c6i)}CMi=1 be an atlas on M, where \u03c6i is to be de\ufb01ned in Step 2. The number\nof charts CM is upper bounded by\n\nCM \u2264(cid:24) SA(M)\n\nrd\n\nTd(cid:25) ,\n\ne\n\n(cid:46) Td \u2264 d log d + d log log d + 5d.\n\nwhere SA(M ) is the surface area of M, and Td is the thickness4 of the Ui\u2019s.\nRemark 2. The thickness Td scales approximately linear in d. As shown in Conway et al. (1987),\n\u221a\nthere exists covering with d\ne\nStep 2. Projection with rescaling and translation. We denote the tangent space at ci as Tci(M) =\nspan(vi1, . . . , vid), where {vi1, . . . , vid} form an orthonormal basis. We obtain the matrix Vi =\n[vi1, . . . , vid] \u2208 RD\u00d7d by concatenating vij\u2019s as column vectors.\nDe\ufb01ne \u03c6i(x) = bi(V (cid:62)\ni (x \u2212 ci) + si) \u2208 [0, 1]d for any x \u2208 Ui, where bi \u2208 (0, 1] is a scaling factor\nand si is a translation vector. Since Ui is bounded, we can choose proper bi and si to guarantee\n\u03c6i(x) \u2208 [0, 1]d. We rescale and translate the projection to ease the notation for the development of\nlocal Taylor approximations in Step 4. We also remark that each \u03c6i is a linear function, and can be\nrealized by a single-layer linear network.\nStep 3. Chart determination. This step is to locate the charts that a given input x belongs to. This\navoids projecting x using unmatched charts (i.e., x (cid:54)\u2208 Uj for some j) as illustrated in Figure 3.\nProper charts5 can be determined by compositing an indi-\ncator function and the squared Euclidean distance d2\ni (x) =\n(cid:107)x \u2212 ci(cid:107)2\nj=1(xj \u2212 ci,j)2 for i = 1, . . . , CM. The\nsquared distance d2\ni (x) is a sum of univariate quadratic\nfunctions, thus, we can apply Lemma 1 to approximate\ni (x) by ReLU networks. Denote(cid:98)hsq as an approximation\nd2\nof the quadratic function x2 on [0, 1] with an approxima-\ntion error \u03bd. Then we de\ufb01ne\n\n2 =(cid:80)D\n\ni \u2212 d2\n\nxj \u2212 ci,j\n\nis (cid:107)(cid:98)d2\n\nj=1(cid:98)hsq(cid:18)(cid:12)(cid:12)(cid:12)(cid:12)\n\n2B (cid:12)(cid:12)(cid:12)(cid:12)(cid:19) .\n\nas an approximation of d2\n\ni (x) = 4B2(cid:88)D\n(cid:98)d2\ni (x). The approximation error\ni(cid:107)\u221e \u2264 4B2D\u03bd, by the triangle inequality. We\nnext consider an approximation of the indicator function\nof an interval as in Figure 4:\n(cid:98)1\u2206(a) =\uf8f1\uf8f4\uf8f2\uf8f4\uf8f3\n\n\u2206\u22128B2m\u03bd a + r2\u22124B2m\u03bd\n\u2206\u22128B2m\u03bd\n\n1\n\u2212\n0\n\n1\n\nFigure 3: Projecting xj using a matched\nchart (blue) (Uj, \u03c6j), and an unmatched\nchart (green) (Ui, \u03c6i).\n\na \u2264 r2 \u2212 \u2206 + 4B2m\u03bd\na \u2208 [r2 \u2212 \u2206 + 4B2m\u03bd, r2 \u2212 4B2m\u03bd]\na > r2 \u2212 4B2m\u03bd\n\n,\n\n(2)\n\nwhere \u2206 (\u2206 \u2265 8B2m\u03bd) will be chosen later according to the accuracy \u0001. Note that(cid:98)1\u2206 can be\n\u2206\u22128B2m\u03bd ReLU(\u2212a + r2 \u2212\nimplemented exactly by a single layer ReLU network: (cid:98)1\u2206(a) =\n3P is diffeomorphic to Q if there is a mapping \u0393 : P (cid:55)\u2192 Q bijective, C\u221e, and its inverse also being C\u221e.\n4Thickness is the average number of Ui\u2019s that contain a point on M (Conway et al., 1987).\n5Note that an input x can belong to multiple charts. Accordingly, the chart determination sub-network\n\n1\n\ndetermines all these charts.\n\n6\n\nUnmatchedUj(null)(null)(null)(null)Ui(null)(null)(null)(null)M(null)(null)(null)(null)xi(null)(null)(null)(null)xj(null)(null)(null)(null)i(xj)AAACEHicbVC7TsMwFHV4lvIqMLJEtIiyVEkZYKzEwlgk+pCaKnLcm9bUeci+Qa2ifAILv8LCAEKsjGz8De5jgJYjWTo65177+Hix4Aot69tYWV1b39jMbeW3d3b39gsHh00VJZJBg0Uikm2PKhA8hAZyFNCOJdDAE9DyhtcTv/UAUvEovMNxDN2A9kPuc0ZRS27hzEEY4fSeVEIvS0tOPOAuLzsBxYHnp6PMvT8vZW6haFWsKcxlYs9JkcxRdwtfTi9iSQAhMkGV6thWjN2USuRMQJZ3EgUxZUPah46mIQ1AddNpkMw81UrP9COpT4jmVP29kdJAqXHg6clJTLXoTcT/vE6C/lU35WGcIIRs9pCfCBMjc9KO2eMSGIqxJpRJrrOabEAlZag7zOsS7MUvL5NmtWJfVKzbarFWnteRI8fkhJSJTS5JjdyQOmkQRh7JM3klb8aT8WK8Gx+z0RVjvnNE/sD4/AG4rJ2Hi(xi)(null)(null)(null)(null)(null)(null)(null)(null)(null)(null)(null)(null)(null)j(xj)(null)(null)(null)(null)\f1\n\n4B2m\u03bd) \u2212\nfunction on Ui: if x (cid:54)\u2208 Ui, i.e., d2\nwe have(cid:98)1\u2206 \u25e6(cid:98)d2\n\n\u2206\u22128B2m\u03bd ReLU(\u2212a + r2 \u2212 \u2206 + 4B2m\u03bd). We use(cid:98)1\u2206 \u25e6(cid:98)d2\n\ni (x) \u2265 r2, we have(cid:98)1\u2206 \u25e6(cid:98)d2\n\ni (x) = 1.\n\ni (x) = 0; if x \u2208 Ui and d2\n\ni to approximate the indicator\ni (x) \u2264 r2 \u2212 \u2206,\n\nFigure 4: The approximation of the indicator function(cid:98)1\u2206 in (2).\n\n2 ,\n\n\u22121\ni\n\n\u22121\ni\n\n\u22121\ni\n\n\u22121\ni\n\n\u22121\ni\n\n(cid:12)(cid:12)(cid:12)Dn(fi \u25e6 \u03c6\n\nStep 4. Taylor approximation. In each chart (Ui, \u03c6i), we locally approximate f using Taylor\npolynomials of order n. Speci\ufb01cally, we decompose f as f =(cid:80)CM\ni=1 fi with fi = f \u03c1i where \u03c1i is\nan element in a C\u221e partition of unity on M which is supported inside Ui. The existence of such a\npartition of unity is guaranteed by Proposition 1. Since M is compact and \u03c1i is C\u221e, fi preserves the\nregularity (smoothness) of f such that fi \u2208 H n,\u03b1 for i = 1, . . . , CM.\nLemma 2. Suppose Assumption 3 holds. For i = 1, . . . , CM, the function fi belongs to H n,\u03b1: there\nexists a H\u00f6lder coef\ufb01cient Li depending on d, fi, and \u03c6i such that for any |n| = n, we have\n\n)(cid:12)(cid:12)\u03c6i(x1) \u2212 Dn(fi \u25e6 \u03c6\n\n)(cid:12)(cid:12)\u03c6i(x2)(cid:12)(cid:12)(cid:12) \u2264 Li (cid:107)\u03c6i(x1) \u2212 \u03c6i(x2)(cid:107)\u03b1\n\u2200x1, x2 \u2208 Ui.\nProof Sketch. We provide a sketch here. Details can be found in Appendix B.1. Denote g1 = f \u25e6 \u03c6\n) = Dn(g1 \u00d7 g2) =(cid:80)|p|+|q|=n(cid:0) n|p|(cid:1)Dpg1Dqg2, by the\nand g2 = \u03c1i \u25e6 \u03c6\nLeibniz rule. Consider each term in the sum: for any x1, x2 \u2208 Ui,\n(cid:12)(cid:12)Dpg1Dqg2|\u03c6i(x1) \u2212 Dpg1Dqg2|\u03c6i(x2)(cid:12)(cid:12)\n\u2264 |Dpg1(\u03c6i(x1))|(cid:12)(cid:12)Dqg2|\u03c6i(x1) \u2212 Dqg2|\u03c6i(x2)(cid:12)(cid:12) + |Dqg2(\u03c6i(x2))|(cid:12)(cid:12)Dpg1|\u03c6i(x1) \u2212 Dpg1|\u03c6i(x2)(cid:12)(cid:12)\n(cid:12)(cid:12)Dqg2|\u03c6i(x1) \u2212 Dqg2|\u03c6i(x2)(cid:12)(cid:12) \u2264\n=\u221ad\u00b5i (cid:107)\u03c6i(x1) \u2212 \u03c6i(x2)(cid:107)1\u2212\u03b1\n\n\u2264 \u03bbi\u03b8i,\u03b1 (cid:107)\u03c6i(x1) \u2212 \u03c6i(x2)(cid:107)\u03b1\nHere \u03bbi and \u00b5i are uniform upper bounds on the derivatives of g1 and g2 with order up to n,\nrespectively. The last inequality above is derived as follows: by the mean value theorem, we have\n\n\u221ad\u00b5i (cid:107)\u03c6i(x1) \u2212 \u03c6i(x2)(cid:107)2\n(cid:107)\u03c6i(x1) \u2212 \u03c6i(x2)(cid:107)\u03b1\n\n\u221ad\u00b5i(2r)1\u2212\u03b1 (cid:107)\u03c6i(x1) \u2212 \u03c6i(x2)(cid:107)\u03b1\n\n2 + \u00b5i\u03b2i,\u03b1 (cid:107)\u03c6i(x1) \u2212 \u03c6i(x2)(cid:107)\u03b1\n2 .\n\n. We have Dn(fi \u25e6 \u03c6\n\nwhere the last inequality is due to the fact that (cid:107)\u03c6i(x1) \u2212 \u03c6i(x2)(cid:107)2 \u2264 bi (cid:107)Vi(cid:107)(cid:107)x1 \u2212 x2(cid:107)2 \u2264 2r.\nThen we set \u03b8i,\u03b1 = \u221ad\u00b5i(2r)1\u2212\u03b1 and by a similar argument, we set \u03b2i,\u03b1 = \u221ad\u03bbi(2r)1\u2212\u03b1. We\ncomplete the proof by taking Li = 2n+1\u221ad\u03bbi\u00b5i(2r)1\u2212\u03b1.\n\n2 \u2264\n\n2\n\n2 ,\n\n\u22121\ni\n\nby Taylor\nLemma 2 is crucial for the error estimation in the local approximation of fi \u25e6 \u03c6\npolynomials. This error estimate is given in the following theorem, where some of the proof\ntechniques are from Theorem 1 in Yarotsky (2017).\nTheorem 3. Let fi = f \u03c1i as in Step 4. For any \u03b4 \u2208 (0, 1), there exists a ReLU network structure\nthat, if the weight parameters are properly chosen, the network yields an approximation of fi \u25e6\n\u03b4 + 1(cid:1) layers, and at most\n\u22121\n\u03c6\ni\n\u2212 d\nc(cid:48)\u03b4\n\nuniformly with error \u03b4. Such a network has no more than c(cid:0)log 1\nn+\u03b1(cid:0)log 1\nProof Sketch. The detailed proof is provided in Appendix B.2. The proof consists of two steps: 1).\nusing a weighted sum of Taylor polynomials; 2). Implement the weighted sum\nApproximate fi \u25e6 \u03c6\nof Taylor polynomials using ReLU networks. Speci\ufb01cally, we set up a uniform grid and divide [0, 1]d\n\n\u03b4 + 1(cid:1) neurons and weight parameters with c, c(cid:48) depending on n, d, fi \u25e6 \u03c6\n\n\u22121\ni\n\n\u22121\ni\n\n.\n\n7\n\nb1(\u00b7)AAACI3icbVC7TgJBFJ31ifhCLW02ggYbsouFlCRaWGIij4QlZHb2AhNmH5m5qyGb/QN/wx+w1T+wMzYWtn6Hs0Ch4EkmOTnnvua4keAKLevTWFldW9/YzG3lt3d29/YLB4ctFcaSQZOFIpQdlyoQPIAmchTQiSRQ3xXQdsdXmd++B6l4GNzhJIKeT4cBH3BGUUv9wlnJeeAejCgmjk9x5KnETtN+4lyDQJqWHeaFeF7qF4pWxZrCXCb2nBTJHI1+4dvxQhb7ECATVKmubUXYS6hEzgSkeSdWEFE2pkPoahpQH1Qvmf4nNU+14pmDUOoXoDlVf3ck1Fdq4ru6MrtZLXqZ+K/nqWzgwnYc1HoJD6IYIWCz5YNYmBiaWWCmxyUwFBNNKJNc32+yEZWUoY41r4OxF2NYJq1qxb6oWLfVYr02jyhHjskJKRObXJI6uSEN0iSMPJJn8kJejSfjzXg3PmalK8a854j8gfH1A2ObpRY=r2+4B2m\u232bAAACFXicbVDdSsMwGE3n35x/VW8Eb4JDEMTRzoG7HOqFlxPcD6x1pGm6haVpSVJhlPkavoC3+gbeibde+wI+h+nWC908EDg55/tJjhczKpVlfRmFpeWV1bXiemljc2t7x9zda8soEZi0cMQi0fWQJIxy0lJUMdKNBUGhx0jHG11lfueBCEkjfqfGMXFDNOA0oBgpLfXNA3FfhWfONWEKwVNYu9TXEDo86Ztlq2JNAReJnZMyyNHsm9+OH+EkJFxhhqTs2Vas3BQJRTEjk5KTSBIjPEID0tOUo5BIN53+YAKPteLDIBL6cAWn6u+OFIVSjkNPV4ZIDeW8l4n/er7MBs5tV0HdTSmPE0U4ni0PEgZVBLOIoE8FwYqNNUFYUP1+iIdIIKx0kCUdjD0fwyJpVyv2eaV6Wys36nlERXAIjsAJsMEFaIAb0AQtgMEjeAYv4NV4Mt6Md+NjVlow8p598AfG5w+Rl50Er24B2m\u232bAAACDXicbVDLSsNAFL3xWesr6tLNYBHcWJJasMuiG5cV7APaWCaTSTt0Mgkzk0Ip/QZ/wK3+gTtx6zf4A36HkzYLbT0wzOGc++L4CWdKO86Xtba+sbm1Xdgp7u7tHxzaR8ctFaeS0CaJeSw7PlaUM0GbmmlOO4mkOPI5bfuj28xvj6lULBYPepJQL8IDwUJGsDZS37blYwVdouqN+SLUE2nfLjllZw60StyclCBHo29/94KYpBEVmnCsVNd1Eu1NsdSMcDor9lJFE0xGeEC7hgocUeVN55fP0LlRAhTG0jyh0Vz93THFkVKTyDeVEdZDtexl4r9eoLKBS9t1WPOmTCSppoIslocpRzpGWTQoYJISzSeGYCKZuR+RIZaYaBNg0QTjLsewSlqVsntVrtxXS/VaHlEBTuEMLsCFa6jDHTSgCQTG8Awv8Go9WW/Wu/WxKF2z8p4T+APr8wcfnpoj1AAAB/HicbVDLSsNAFL3xWeur6tJNsAiuSlIFXbgouHHZgn1AG8pkctMOnUzCzEQoof6AW/0Dd+LWf/EH/A4nbRbaeuDC4Zz74vgJZ0o7zpe1tr6xubVd2inv7u0fHFaOjjsqTiXFNo15LHs+UciZwLZmmmMvkUgin2PXn9zlfvcRpWKxeNDTBL2IjAQLGSXaSC13WKk6NWcOe5W4BalCgeaw8j0IYppGKDTlRKm+6yTay4jUjHKclQepwoTQCRlh31BBIlReNn90Zp8bJbDDWJoS2p6rvycyEik1jXzTGRE9VsteLv7rBSpfuHRdhzdexkSSahR0cTxMua1jO0/CDphEqvnUEEIlM//bdEwkodrkVTbBuMsxrJJOveZe1uqtq2rjtoioBKdwBhfgwjU04B6a0AYKCM/wAq/Wk/VmvVsfi9Y1q5g5gT+wPn8Ar5uVMA==0AAAB/HicbVDLSsNAFL3xWeur6tJNsAiuSlIFXbgouHHZgn1AG8pkctMOnUzCzEQoof6AW/0Dd+LWf/EH/A4nbRbaeuDC4Zz74vgJZ0o7zpe1tr6xubVd2inv7u0fHFaOjjsqTiXFNo15LHs+UciZwLZmmmMvkUgin2PXn9zlfvcRpWKxeNDTBL2IjAQLGSXaSC1nWKk6NWcOe5W4BalCgeaw8j0IYppGKDTlRKm+6yTay4jUjHKclQepwoTQCRlh31BBIlReNn90Zp8bJbDDWJoS2p6rvycyEik1jXzTGRE9VsteLv7rBSpfuHRdhzdexkSSahR0cTxMua1jO0/CDphEqvnUEEIlM//bdEwkodrkVTbBuMsxrJJOveZe1uqtq2rjtoioBKdwBhfgwjU04B6a0AYKCM/wAq/Wk/VmvVsfi9Y1q5g5gT+wPn8ArgOVLw==\f\u22121\ni\n\nby its n-th order Taylor polynomial in each cube. To\n\ninto small cubes, and then approximate fi \u25e6 \u03c6\nimplement such polynomials by ReLU networks, we recursively apply the multiplication(cid:98)\u00d7 operator\nin Corollary 1, since these polynomials are sums of the products of different variables.\nStep 5. Estimating the total error. We have collected all the ingredients to implement the entire\nReLU network to approximate f on M. Recall that the network structure consists of 3 main sub-\nnetworks as demonstrated in Figure 2. Let(cid:98)\u00d7 be an approximation to the multiplication operator in\nthe pairing sub-network with error \u03b7. Accordingly, the function given by the whole network is\n\nusing Taylor polynomials in Theorem 3. The total error\n\ni=1 (Ai,1 + Ai,2 + Ai,3), where\n\n(cid:98)f =\n\nCM(cid:88)i=1(cid:98)\u00d7((cid:98)fi,(cid:98)1\u2206 \u25e6(cid:98)d2\n\ni ) with (cid:98)fi = (cid:101)fi \u25e6 \u03c6i,\nwhere (cid:101)fi is the approximation of fi \u25e6 \u03c6\nTheorem 4. For any i = 1, . . . , CM, we have (cid:107)(cid:98)f \u2212 f(cid:107)\u221e \u2264(cid:80)CM\n\ncan be decomposed to three components according to the following theorem.\n\n\u22121\ni\n\nAi,1 =(cid:13)(cid:13)(cid:98)\u00d7((cid:98)fi,(cid:98)1\u2206 \u25e6(cid:98)d2\nAi,2 =(cid:13)(cid:13)(cid:98)fi \u00d7 ((cid:98)1\u2206 \u25e6(cid:98)d2\nAi,3 =(cid:13)(cid:13)fi \u00d7 ((cid:98)1\u2206 \u25e6(cid:98)d2\n\ni )(cid:13)(cid:13)\u221e \u2264 \u03b7,\ni ) \u2212 (cid:98)fi \u00d7 ((cid:98)1\u2206 \u25e6(cid:98)d2\ni )(cid:13)(cid:13)\u221e \u2264 \u03b4,\ni ) \u2212 fi \u00d7 ((cid:98)1\u2206 \u25e6(cid:98)d2\ni ) \u2212 fi \u00d7 1(x \u2208 Ui)(cid:13)(cid:13)\u221e \u2264\n\nc(\u03c0 + 1)\nr(1 \u2212 r/\u03c4 )\n\n\u2206 for some constant c.\n\nin the Taylor approximation sub-network, and Ai,3 is the error from the chart determination sub-\n\nHere 1(x \u2208 Ui) is the indicator function on Ui. Theorem 4 is proved in Appendix B.3. In order\nto achieve an \u0001 total approximation error, i.e., (cid:107)f \u2212 (cid:98)f(cid:107)\u221e \u2264 \u0001, we need to control the errors in\nthe three sub-networks. In other words, we need to decide \u03bd for (cid:98)d2\ni , \u2206 for(cid:98)1\u2206, \u03b4 for (cid:101)fi, and \u03b7\nfor (cid:98)\u00d7. Note that Ai,1 is the error from the pairing sub-network, Ai,2 is the approximation error\nnetwork. The error bounds on Ai,1, Ai,2 are straightforward from the constructions of(cid:98)\u00d7 and (cid:98)fi. The\nestimate of Ai,3 involves some technical analysis since (cid:107)(cid:98)1\u2206 \u25e6 (cid:98)d2\ni \u2212 1(x \u2208 Ui)(cid:107)\u221e = 1. Note that\n(cid:98)1\u2206 \u25e6 (cid:98)d2\n2 > r2, so we only need\nto prove that |fi(x)| is suf\ufb01ciently small in the region Ki de\ufb01ned below.\nLemma 3. For any i = 1, . . . , CM, denote Ki = {x \u2208 M : r2 \u2212 \u2206 \u2264 (cid:107)x \u2212 ci(cid:107)2\n2 \u2264 r2}. Then\nthere exists a constant c depending on fi\u2019s and \u03c6i\u2019s such that\nc(\u03c0 + 1)\nr(1 \u2212 r/\u03c4 )\n\ni (x) \u2212 1(x \u2208 Ui) = 0 whenever (cid:107)x \u2212 ci(cid:107)2\n\n2 < r2 \u2212 \u2206 or (cid:107)x \u2212 ci(cid:107)2\n\nx\u2208Ki |fi(x)| \u2264\nmax\n\n\u2206.\n\nProof Sketch. The detailed proof is in Appendix B.4. The function fi \u25e6 \u03c6\nis de\ufb01ned on \u03c6i(Ui) \u2282\n[0, 1]d. We extend fi \u25e6 \u03c6\n(x) = 0 for x \u2208 [0, 1]d \\ \u03c6i(Ui). It is easy to\nverify that such an extension preserves the regularity of fi \u25e6 \u03c6\n, since supp(fi) is a compact subset\nof Ui. By the mean value theorem, for any x, y \u2208 Ki, there exists z = \u03b2\u03c6i(x) + (1 \u2212 \u03b2)\u03c6i(y) for\nsome \u03b2 \u2208 (0, 1) such that\n\nto [0, 1]d by letting fi \u25e6 \u03c6\n\n\u22121\ni\n\n\u22121\ni\n\n\u22121\ni\n\n|fi(x) \u2212 fi(y)| \u2264 (cid:107)\u2207fi \u25e6 \u03c6\n\n\u22121\ni\n\n(z)(cid:107)2(cid:107)\u03c6i(x) \u2212 \u03c6i(y)(cid:107)2 \u2264 (cid:107)\u2207fi \u25e6 \u03c6\n\n(z)(cid:107)2bi(cid:107)Vi(cid:107)2(cid:107)x \u2212 y(cid:107)2.\n\n\u22121\ni\n\n\u22121\ni\n\n\u22121\ni\n\nWe pick y \u2208 \u2202Ui (the boundary of Ui) so that fi(y) = 0. Since fi \u2208 H n,\u03b1 and M is compact,\n(cid:13)(cid:13)\u2207fi \u25e6 \u03c6\n(z)(cid:13)(cid:13)2 bi (cid:107)Vi(cid:107)2 \u2264 c for some c > 0. To bound |fi(x)|, the key is to estimate (cid:107)x \u2212 y(cid:107)2.\nWe next prove that, for any x \u2208 Ki, there exists y \u2208 \u2202Ui satisfying (cid:107)x \u2212 y(cid:107)2 \u2264 \u03c0+1\nThe idea is to consider a geodesic6 \u03b3(t) parameterized by the arc length from x to \u2202Ui in Figure\n5. Denote y = \u2202Ui(cid:84) \u03b3. Without loss of generality, we shift the center ci to 0 in the following\nanalysis. To utilize polar coordinates, we de\ufb01ne two auxiliary quantities: \u03b8(t) = \u03b3(t)(cid:62) \u02d9\u03b3(t)/(cid:107)\u03b3(t)(cid:107)2\nand (cid:96)(t) = (cid:107)\u03b3(t)(cid:107)2, where \u02d9\u03b3 denotes the derivative of \u03b3.\n\nr(1\u2212r/\u03c4 ) \u2206.\n\n6A geodesic is the shortest path between two points on the manifold. We refer readers to Chapter 6 in Lee\n\n(2006) for a formal introduction.\n\n8\n\n\fWe show that there exists a geodesic \u03b3(t) satisfying inf t \u02d9(cid:96)(t) \u2265 1\u2212r/\u03c4\n\u03c0+1 > 0. This implies that the\ngeodesic continuously moves away from the center. Denote T such that \u03b3(T ) = y. By the de\ufb01nition\nof geodesic, T is the arc length of \u03b3(t) between x and y. We have T inf t \u02d9(cid:96)(t) \u2264 (cid:96)(T ) \u2212 (cid:96)(0) \u2264\nr \u2212 \u221ar2 \u2212 \u2206 \u2264 \u2206\n\nr . Therefore, (cid:107)x \u2212 y(cid:107)2 \u2264 T \u2264\n\n\u2206\n\nr inf t \u02d9(cid:96)(t) \u2264 \u03c0+1\n\nr(1\u2212r/\u03c4 ) \u2206.\n\nGiven Theorem 4, we choose\n\n\u03b7 = \u03b4 =\n\n\u0001\n\n3CM\n\nand \u2206 =\n\nr(1 \u2212 r/\u03c4 )\u0001\n3c(\u03c0 + 1)CM\n\n(3)\n\nso that the approximation error is bounded by \u0001. Moreover,\n16B2D to guarantee \u2206 > 8B2D\u03bd so that the\nwe choose \u03bd = \u2206\n\nde\ufb01nition of(cid:98)1\u2206 is valid.\n\nFinally we quantify the size of the ReLU network. Recall that\nthe chart determination sub-network has c1 log 1\n\u03bd layers, the\nTaylor approximation sub-network has c2 log 1\n\u03b4 layers, and the\npairing sub-network has c3 log 1\n\u03b7 layers. Here c2 depends on\nd, n, f, and c1, c3 are absolute constants. Combining these with\n(3) yields the depth in Theorem 1. By a similar argument, we\ncan obtain the number of neurons and weight parameters. A\ndetailed analysis is given in Appendix B.5.\n\nFigure 5: A geometric illustration\nof \u03b8 and (cid:96).\n\n5 Discussions\n\nReLU activations. We consider neural networks with ReLU activations for a practical concern\n\u2014 ReLU activations are widely used in deep networks. Moreover, ReLU networks are easier to\ntrain compared with sigmoid or hyperbolic tangent activations, which are known for their notorious\nvanishing gradient problem (Goodfellow et al., 2016; Glorot et al., 2011).\nLow Dimensional Manifolds. The low dimensional manifold model plays a vital role to reduce the\nnetwork size. As shown in Theorem 2, to approximate functions in F n,D with accuracy \u0001, the minimal\nnumber of weight parameters is O(\u0001\u2212 D\nn ). This lower bound is huge, and can not be improved without\nlow dimensional structures of data.\nExistence vs. Learnability and Generalization. Our Theorem 1 shows the existence of a ReLU\nnetwork structure that gives ef\ufb01cient approximations of functions on low dimensional manifolds, if\nthe weight parameters are properly chosen. In practice, it is observed that larger neural networks\nare easier to train and yield better generalization performances (Li et al., 2018; Zhang et al., 2016;\nArora et al., 2018). This is referred to as overparameterization. Establishing the connection between\nlearnability and generalization is an important future direction.\nConvolutional Filters. Convolutional neural networks (CNNs, Krizhevsky et al. (2012)) are widely\nused in computer vision, language modeling, etc. Empirical results reveal that different convolutional\n\ufb01lters can capture various patterns in images, e.g., edge detection \ufb01lters. An interesting question is\nwhether convolutional \ufb01lters serve as charts in our framework.\nEquivalent Networks. The ReLU network identi\ufb01ed in Theorem 1 is sparsely connected. Several\nother network structures can yield the same function as our ReLU network. It is interesting to\ninvestigate whether these network structures also possess the universal approximation property.\n\n6 Acknowledgements\n\nThis work is supported by NSF grants DMS 1818751 and III 1717916. The authors would like to\nthank Ryan Tibshirani for his helpful discussions and insightful comments.\n\nReferences\nAAMARI, E., KIM, J., CHAZAL, F., MICHEL, B., RINALDO, A., WASSERMAN, L. ET AL. (2019).\n\nEstimating the reach of a manifold. Electronic Journal of Statistics, 13 1359\u20131399.\n\n9\n\nciAAACB3icbVDLSsNAFL2pr1pfVZduBovgqiRVsMuCG5cV7AOaUCaTSTt0MgkzE6GEfoA/4Fb/wJ249TP8Ab/DSZqFth4YOJxz79zD8RPOlLbtL6uysbm1vVPdre3tHxwe1Y9P+ipOJaE9EvNYDn2sKGeC9jTTnA4TSXHkczrwZ7e5P3ikUrFYPOh5Qr0ITwQLGcHaSK4bYT31w4wsxmxcb9hNuwBaJ05JGlCiO65/u0FM0ogKTThWauTYifYyLDUjnC5qbqpogskMT+jIUIEjqrysyLxAF0YJUBhL84RGhfp7I8ORUvPIN5N5RrXq5eK/XqDyD1eu67DtZUwkqaaCLI+HKUc6RnkpKGCSEs3nhmAimcmPyBRLTLSprmaKcVZrWCf9VtO5arburxuddllRFc7gHC7BgRvowB10oQcEEniGF3i1nqw36936WI5WrHLnFP7A+vwB8tOaWg==xAAACBXicbVDLSsNAFL2pr1pfVZduBovgqiRVsMuCG5cV7APbUCaTSTt0MgkzE7GErv0Bt/oH7sSt3+EP+B1O0iy09cDA4Zx75x6OF3OmtG1/WaW19Y3NrfJ2ZWd3b/+genjUVVEiCe2QiEey72FFORO0o5nmtB9LikOP0543vc783gOVikXiTs9i6oZ4LFjACNZGuh+GWE+8IH2cj6o1u27nQKvEKUgNCrRH1e+hH5EkpEITjpUaOHas3RRLzQin88owUTTGZIrHdGCowCFVbponnqMzo/goiKR5QqNc/b2R4lCpWeiZySyhWvYy8V/PV9mHS9d10HRTJuJEU0EWx4OEIx2hrBLkM0mJ5jNDMJHM5EdkgiUm2hRXMcU4yzWskm6j7lzUG7eXtVazqKgMJ3AK5+DAFbTgBtrQAQICnuEFXq0n6816tz4WoyWr2DmGP7A+fwCCypmTyAAACBXicbVDLSsNAFL2pr1pfVZduBovgqiS1YJcFNy4r2Ae2oUwmk3boZBJmJkIIXfsDbvUP3Ilbv8Mf8DuctFlo64GBwzn3zj0cL+ZMadv+skobm1vbO+Xdyt7+weFR9fikp6JEEtolEY/kwMOKciZoVzPN6SCWFIcep31vdpP7/UcqFYvEvU5j6oZ4IljACNZGehiFWE+9IEvn42rNrtsLoHXiFKQGBTrj6vfIj0gSUqEJx0oNHTvWboalZoTTeWWUKBpjMsMTOjRU4JAqN1sknqMLo/goiKR5QqOF+nsjw6FSaeiZyTyhWvVy8V/PV/mHK9d10HIzJuJEU0GWx4OEIx2hvBLkM0mJ5qkhmEhm8iMyxRITbYqrmGKc1RrWSa9Rd67qjbtmrd0qKirDGZzDJThwDW24hQ50gYCAZ3iBV+vJerPerY/laMkqdk7hD6zPH4RjmZQ=(t)AAACBHicbVDLTgIxFL2DL8QX6tJNIzHBDZlBE1mSuHGJiYAGJqRTOtDQdiZtx4RM2PoDbvUP3Bm3/oc/4HfYgVkoeJKbnJxzXzlBzJk2rvvlFNbWNza3itulnd29/YPy4VFHR4kitE0iHqn7AGvKmaRtwwyn97GiWAScdoPJdeZ3H6nSLJJ3ZhpTX+CRZCEj2FjpoT/CQuCqOR+UK27NnQOtEi8nFcjRGpS/+8OIJIJKQzjWuue5sfFTrAwjnM5K/UTTGJMJHtGepRILqv10/vAMnVlliMJI2ZIGzdXfEykWWk9FYDsFNmO97GXiv95QZwuXrpuw4adMxomhkiyOhwlHJkJZImjIFCWGTy3BRDH7PyJjrDAxNreSDcZbjmGVdOo176JWv72sNBt5REU4gVOoggdX0IQbaEEbCAh4hhd4dZ6cN+fd+Vi0Fpx85hj+wPn8AZe0mG8=`(t)AAACAnicbVC7SgNBFL0bXzG+opY2g0GITdiNgikDNpYRzAOSJczO3k2GzD6YmRVCSOcP2Oof2ImtP+IP+B3OJlto4oELh3Pui+Mlgitt219WYWNza3unuFva2z84PCofn3RUnEqGbRaLWPY8qlDwCNuaa4G9RCINPYFdb3Kb+d1HlIrH0YOeJuiGdBTxgDOqjdQdoBBVfTksV+yavQBZJ05OKpCjNSx/D/yYpSFGmgmqVN+xE+3OqNScCZyXBqnChLIJHWHf0IiGqNzZ4t05uTCKT4JYmoo0Wai/J2Y0VGoaeqYzpHqsVr1M/NfzVbZw5boOGu6MR0mqMWLL40EqiI5JlgfxuUSmxdQQyiQ3/xM2ppIybVIrmWCc1RjWSadec65q9fvrSrORR1SEMziHKjhwA024gxa0gcEEnuEFXq0n6816tz6WrQUrnzmFP7A+fwAMMZeV\u02d9(t)AAACDHicbVDLSsNAFJ34rPXRqEs3g0Wom5JUwS4LblxWsA9oQ5lMJu3QmSTM3Agl9Bf8Abf6B+7Erf/gD/gdTtostPXAhcM598XxE8E1OM6XtbG5tb2zW9or7x8cHlXs45OujlNFWYfGIlZ9n2gmeMQ6wEGwfqIYkb5gPX96m/u9R6Y0j6MHmCXMk2Qc8ZBTAkYa2ZVhEEM2HBMpybwGlyO76tSdBfA6cQtSRQXaI/vbbKCpZBFQQbQeuE4CXkYUcCrYvDxMNUsInZIxGxgaEcm0ly0en+MLowQ4jJWpCPBC/T2REan1TPqmUxKY6FUvF//1Ap0vXLkOYdPLeJSkwCK6PB6mAkOM82RwwBWjIGaGEKq4+R/TCVGEgsmvbIJxV2NYJ91G3b2qN+6vq61mEVEJnaFzVEMuukEtdIfaqIMoStEzekGv1pP1Zr1bH8vWDauYOUV/YH3+ADR1m3c=\u2713(t)AAACBHicbVDLTgIxFO3gC/GFunTTSExwQ2bQRJYkblxiIqCBCemUDjS0nUl7x4RM2PoDbvUP3Bm3/oc/4HfYgVkoeJKbnJxzXzlBLLgB1/1yCmvrG5tbxe3Szu7e/kH58KhjokRT1qaRiPR9QAwTXLE2cBDsPtaMyECwbjC5zvzuI9OGR+oOpjHzJRkpHnJKwEoPfRgzIFU4H5Qrbs2dA68SLycVlKM1KH/3hxFNJFNABTGm57kx+CnRwKlgs1I/MSwmdEJGrGepIpIZP50/PMNnVhniMNK2FOC5+nsiJdKYqQxspyQwNsteJv7rDU22cOk6hA0/5SpOgCm6OB4mAkOEs0TwkGtGQUwtIVRz+z+mY6IJBZtbyQbjLcewSjr1mndRq99eVpqNPKIiOkGnqIo8dIWa6Aa1UBtRJNEzekGvzpPz5rw7H4vWgpPPHKM/cD5/ALZ1mII=rAAAB/HicbVDLSsNAFL3xWeur6tLNYBFclaQKdllw47IF+4A2lMnkph06eTAzEUqoP+BW/8CduPVf/AG/w0mbhbYeuHA45744XiK40rb9ZW1sbm3v7Jb2yvsHh0fHlZPTropTybDDYhHLvkcVCh5hR3MtsJ9IpKEnsOdN73K/94hS8Th60LME3ZCOIx5wRrWR2nJUqdo1ewGyTpyCVKFAa1T5HvoxS0OMNBNUqYFjJ9rNqNScCZyXh6nChLIpHePA0IiGqNxs8eicXBrFJ0EsTUWaLNTfExkNlZqFnukMqZ6oVS8X//V8lS9cua6DhpvxKEk1Rmx5PEgF0THJkyA+l8i0mBlCmeTmf8ImVFKmTV5lE4yzGsM66dZrznWt3r6pNhtFRCU4hwu4AgduoQn30IIOMEB4hhd4tZ6sN+vd+li2bljFzBn8gfX5AxYOlW0=pr2AAACEHicbVDLSsNAFJ34rPUV7dLNYBHcWJIq2GVBFy4r2Ac0sUwmk3bo5OHMjVBCf8IfcKt/4E7c+gf+gN/hpM1CWw9cOJxzXxwvEVyBZX0ZK6tr6xubpa3y9s7u3r55cNhRcSopa9NYxLLnEcUEj1gbOAjWSyQjoSdY1xtf5X73kUnF4+gOJglzQzKMeMApAS0NzIqjHiRk8r6Oz7BzzQSQ6cCsWjVrBrxM7IJUUYHWwPx2/JimIYuACqJU37YScDMigVPBpmUnVSwhdEyGrK9pREKm3Gz2/BSfaMXHQSx1RYBn6u+JjIRKTUJPd4YERmrRy8V/PV/lCxeuQ9BwMx4lKbCIzo8HqcAQ4zwd7HPJKIiJJoRKrv/HdEQkoaAzLOtg7MUYlkmnXrPPa/Xbi2qzUURUQkfoGJ0iG12iJrpBLdRGFE3QM3pBr8aT8Wa8Gx/z1hWjmKmgPzA+fwCVc5yz\fAMODEI, D., ANANTHANARAYANAN, S., ANUBHAI, R., BAI, J., BATTENBERG, E., CASE, C.,\nCASPER, J., CATANZARO, B., CHENG, Q., CHEN, G. ET AL. (2016). Deep speech 2: End-to-end\nspeech recognition in english and mandarin. In International conference on machine learning.\n\nARORA, S., COHEN, N. and HAZAN, E. (2018). On the optimization of deep networks: Implicit\n\nacceleration by overparameterization. arXiv preprint arXiv:1802.06509.\n\nBAHDANAU, D., CHO, K. and BENGIO, Y. (2014). Neural machine translation by jointly learning to\n\nalign and translate. arXiv preprint arXiv:1409.0473.\n\nBARRON, A. R. (1993). Universal approximation bounds for superpositions of a sigmoidal function.\n\nIEEE Transactions on Information theory, 39 930\u2013945.\n\nBICKEL, P. J., LI, B. ET AL. (2007). Local polynomial regression on unknown manifolds. In\n\nComplex datasets and inverse problems. Institute of Mathematical Statistics, 177\u2013186.\n\nCHUI, C. K. and LI, X. (1992). Approximation by ridge functions and neural networks with one\n\nhidden layer. Journal of Approximation Theory, 70 131\u2013141.\n\nCHUI, C. K. and MHASKAR, H. N. (2016). Deep nets for local manifold learning. arXiv preprint\n\narXiv:1607.07110.\n\nCOIFMAN, R. R. and MAGGIONI, M. (2006). Diffusion wavelets. Applied and Computational\n\nHarmonic Analysis, 21 53\u201394.\n\nCONWAY, J. H., SLOANE, N. J. A. and BANNAI, E. (1987). Sphere-packings, Lattices, and Groups.\n\nSpringer-Verlag, Berlin, Heidelberg.\n\nCYBENKO, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of\n\ncontrol, signals and systems, 2 303\u2013314.\n\nDEVORE, R. A., HOWARD, R. and MICCHELLI, C. (1989). Optimal nonlinear approximation.\n\nManuscripta mathematica, 63 469\u2013478.\n\nFUNAHASHI, K.-I. (1989). On the approximate realization of continuous mappings by neural\n\nnetworks. Neural networks, 2 183\u2013192.\n\nGLOROT, X., BORDES, A. and BENGIO, Y. (2011). Deep sparse recti\ufb01er neural networks. In\n\nProceedings of the fourteenth international conference on arti\ufb01cial intelligence and statistics.\n\nGOODFELLOW, I., BENGIO, Y. and COURVILLE, A. (2016). Deep Learning. MIT Press. http:\n\n//www.deeplearningbook.org.\n\nGOODFELLOW, I., POUGET-ABADIE, J., MIRZA, M., XU, B., WARDE-FARLEY, D., OZAIR, S.,\nCOURVILLE, A. and BENGIO, Y. (2014). Generative adversarial nets. In Advances in neural\ninformation processing systems.\n\nGRAVES, A., MOHAMED, A.-R. and HINTON, G. (2013). Speech recognition with deep recurrent\nneural networks. In 2013 IEEE international conference on acoustics, speech and signal processing.\nIEEE.\n\nGU, S., HOLLY, E., LILLICRAP, T. and LEVINE, S. (2017). Deep reinforcement learning for robotic\nmanipulation with asynchronous off-policy updates. In 2017 IEEE international conference on\nrobotics and automation (ICRA). IEEE.\n\nGY\u00d6RFI, L., KOHLER, M., KRZYZAK, A. and WALK, H. (2006). A distribution-free theory of\n\nnonparametric regression. Springer Science & Business Media.\n\nHANIN, B. (2017). Universal function approximation by deep neural nets with bounded width and\n\nrelu activations. arXiv preprint arXiv:1708.02691.\n\nHORNIK, K. (1991). Approximation capabilities of multilayer feedforward networks. Neural\n\nnetworks, 4 251\u2013257.\n\n10\n\n\fHU, J., SHEN, L. and SUN, G. (2018). Squeeze-and-excitation networks. In Proceedings of the\n\nIEEE conference on computer vision and pattern recognition.\n\nIRIE, B. and MIYAKE, S. (1988). Capabilities of three-layered perceptrons. In IEEE International\n\nConference on Neural Networks, vol. 1.\n\nJIANG, F., JIANG, Y., ZHI, H., DONG, Y., LI, H., MA, S., WANG, Y., DONG, Q., SHEN, H.\nand WANG, Y. (2017). Arti\ufb01cial intelligence in healthcare: past, present and future. Stroke and\nvascular neurology, 2 230\u2013243.\n\nKRIZHEVSKY, A., SUTSKEVER, I. and HINTON, G. E. (2012). Imagenet classi\ufb01cation with deep\n\nconvolutional neural networks. In Advances in neural information processing systems.\n\nLEE, J. M. (2006). Riemannian manifolds: an introduction to curvature, vol. 176. Springer Science\n\n& Business Media.\n\nLESHNO, M., LIN, V. Y., PINKUS, A. and SCHOCKEN, S. (1993). Multilayer feedforward networks\nwith a nonpolynomial activation function can approximate any function. Neural networks, 6\n861\u2013867.\n\nLI, H., XU, Z., TAYLOR, G., STUDER, C. and GOLDSTEIN, T. (2018). Visualizing the loss\n\nlandscape of neural nets. In Advances in Neural Information Processing Systems.\n\nLONG, J., SHELHAMER, E. and DARRELL, T. (2015). Fully convolutional networks for semantic\n\nsegmentation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).\n\nLU, Z., PU, H., WANG, F., HU, Z. and WANG, L. (2017). The expressive power of neural networks:\n\nA view from the width. In Advances in Neural Information Processing Systems.\n\nMHASKAR, H. N. (1996). Neural networks for optimal approximation of smooth and analytic\n\nfunctions. Neural computation, 8 164\u2013177.\n\nMIOTTO, R., WANG, F., WANG, S., JIANG, X. and DUDLEY, J. T. (2017). Deep learning for\n\nhealthcare: review, opportunities and challenges. Brie\ufb01ngs in bioinformatics, 19 1236\u20131246.\n\nNIYOGI, P., SMALE, S. and WEINBERGER, S. (2008). Finding the homology of submanifolds with\n\nhigh con\ufb01dence from random samples. Discrete & Computational Geometry, 39 419\u2013441.\n\nPANAYOTOV, V., CHEN, G., POVEY, D. and KHUDANPUR, S. (2015). Librispeech: an asr corpus\nbased on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech\nand Signal Processing (ICASSP). IEEE.\n\nROWEIS, S. T. and SAUL, L. K. (2000). Nonlinear dimensionality reduction by locally linear\n\nembedding. science, 290 2323\u20132326.\n\nSHAHAM, U., CLONINGER, A. and COIFMAN, R. R. (2018). Provable approximation properties for\n\ndeep neural networks. Applied and Computational Harmonic Analysis, 44 537\u2013557.\n\nSIMONYAN, K. and ZISSERMAN, A. (2014). Very deep convolutional networks for large-scale image\n\nrecognition. arXiv preprint arXiv:1409.1556.\n\nTENENBAUM, J. B., DE SILVA, V. and LANGFORD, J. C. (2000). A global geometric framework\n\nfor nonlinear dimensionality reduction. Science, 290 2319\u20132323.\n\nTU, L. (2010). An Introduction to Manifolds. Universitext, Springer New York.\n\nhttps://books.google.com/books?id=br1KngEACAAJ\n\nYAROTSKY, D. (2017). Error bounds for approximations with deep relu networks. Neural Networks,\n\n94 103\u2013114.\n\nYOUNG, T., HAZARIKA, D., PORIA, S. and CAMBRIA, E. (2018). Recent trends in deep learning\n\nbased natural language processing. ieee Computational intelligenCe magazine, 13 55\u201375.\n\nZHANG, C., BENGIO, S., HARDT, M., RECHT, B. and VINYALS, O. (2016). Understanding deep\n\nlearning requires rethinking generalization. arXiv preprint arXiv:1611.03530.\n\n11\n\n\f", "award": [], "sourceid": 4453, "authors": [{"given_name": "Minshuo", "family_name": "Chen", "institution": "Georgia Tech"}, {"given_name": "Haoming", "family_name": "Jiang", "institution": "Georgia Institute of Technology"}, {"given_name": "Wenjing", "family_name": "Liao", "institution": "Georgia Tech"}, {"given_name": "Tuo", "family_name": "Zhao", "institution": "Georgia Tech"}]}