{"title": "On the Complexity of Learning Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 5514, "page_last": 5522, "abstract": "The stunning empirical successes of neural networks currently lack rigorous theoretical explanation. What form would such an explanation take, in the face of existing complexity-theoretic lower bounds? A first step might be to show that data generated by neural networks with a single hidden layer, smooth activation functions and benign input distributions can be learned efficiently. We demonstrate here a comprehensive lower bound ruling out this possibility: for a wide class of activation functions (including all currently used), and inputs drawn from any logconcave distribution, there is a family of one-hidden-layer functions whose output is a sum gate that are hard to learn in a precise sense: any statistical query algorithm (which includes all known variants of stochastic gradient descent with any loss function) needs an exponential number of queries even using tolerance inversely proportional to the input dimensionality. Moreover, this hard family of functions is realizable with a small (sublinear in dimension) number of activation units in the single hidden layer. The lower bound is also robust to small perturbations of the true weights. Systematic experiments illustrate a phase transition in the training error as predicted by the analysis.", "full_text": "On the Complexity of Learning Neural Networks\n\nLe Song\n\nGeorgia Institute of Technology\n\nAtlanta, GA 30332\n\nlsong@cc.gatech.edu\n\nSantosh Vempala\n\nGeorgia Institute of Technology\n\nAtlanta, GA 30332\n\nvempala@gatech.edu\n\nGeorgia Institute of Technology\n\nGeorgia Institute of Technology\n\nJohn Wilmes\n\nAtlanta, GA 30332\n\nwilmesj@gatech.edu\n\nBo Xie\n\nAtlanta, GA 30332\n\nbo.xie@gatech.edu\n\nAbstract\n\nThe stunning empirical successes of neural networks currently lack rigorous the-\noretical explanation. What form would such an explanation take, in the face of\nexisting complexity-theoretic lower bounds? A \ufb01rst step might be to show that\ndata generated by neural networks with a single hidden layer, smooth activation\nfunctions and benign input distributions can be learned ef\ufb01ciently. We demonstrate\nhere a comprehensive lower bound ruling out this possibility: for a wide class\nof activation functions (including all currently used), and inputs drawn from any\nlogconcave distribution, there is a family of one-hidden-layer functions whose\noutput is a sum gate, that are hard to learn in a precise sense: any statistical query\nalgorithm (which includes all known variants of stochastic gradient descent with\nany loss function) needs an exponential number of queries even using tolerance\ninversely proportional to the input dimensionality. Moreover, this hard family of\nfunctions is realizable with a small (sublinear in dimension) number of activation\nunits in the single hidden layer. The lower bound is also robust to small perturba-\ntions of the true weights. Systematic experiments illustrate a phase transition in the\ntraining error as predicted by the analysis.\n\n1\n\nIntroduction\n\nIt is well-known that Neural Networks (NN\u2019s) provide universal approximate representations [11, 6, 2]\nand under mild assumptions, i.e., any real-valued function can be approximated by a NN. This holds\nfor a wide class of activation functions (hidden layer units) and even with only a single hidden layer\n(although there is a trade-off between depth and width [8, 20]). Typically learning a NN is done by\nstochastic gradient descent applied to a loss function comparing the network\u2019s current output to the\nvalues of the given training data; for regression, typically the function is just the least-squares error.\nVariants of gradient descent include drop-out, regularization, perturbation, batch gradient descent etc.\nIn all cases, the training algorithm has the following form:\n\nRepeat:\n\n1. Compute a \ufb01xed function FW (.) de\ufb01ned by the current network weights W on a subset\n\nof training examples.\n\n2. Use FW (.) to update the current weights W .\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fThe empirical success of this approach raises the question: what can NN\u2019s learn ef\ufb01ciently in theory?\nIn spite of much effort, at the moment there are no satisfactory answers to this question, even with\nreasonable assumptions on the function being learned and the input distribution.\nWhen learning involves some computationally intractable optimization problem, e.g., learning an\nintersection of halfspaces over the uniform distribution on the Boolean hypercube, then any training\nalgorithm is unlikely to be ef\ufb01cient. This is the case even for improper learning (when the complexity\nof the hypothesis class being used to learn can be greater than the target class). Such lower bounds are\nunsatisfactory to the extent they rely on discrete (or at least nonsmooth) functions and distributions.\nWhat if we assume that the function to be learned is generated by a NN with a single hidden layer of\nsmooth activation units, and the input distribution is benign? Can such functions be learned ef\ufb01ciently\nby gradient descent?\nOur main result is a lower bound, showing a simple and natural family of functions generated by\n1-hidden layer NN\u2019s using any known activation function (e.g., sigmoid, ReLU), with each input\ndrawn from a logconcave input distribution (e.g., Gaussian, uniform in an interval), are hard to learn\nby a wide class of algorithms, including those in the general form above. Our \ufb01nding implies that\nef\ufb01cient NN training algorithms need to use stronger assumptions on the target function and input\ndistribution, more so than Lipschitzness and smoothness even when the true data is generated by a\nNN with a single hidden layer.\nThe idea of the lower bound has two parts. First, NN updates can be viewed as statistical queries to\nthe input distribution. Second, there are many very different 1-layer networks, and in order to learn\nthe correct one, any algorithm that makes only statistical queries of not too small accuracy has to\nmake an exponential number of queries. The lower bound uses the SQ framework of Kearns [13] as\ngeneralized by Feldman et al. [9].\n\n1.1 Statistical query algorithms\n\nA statistical query (SQ) algorithm is one that solves a computational problem over an input dis-\ntribution; its interaction with the input is limited to querying the expected value of of a bounded\nfunction up to a desired accuracy. More precisely, for any integer t > 0 and distribution D over X, a\nVSTAT(t) oracle takes as input a query function f : X \u2192 [0, 1] with expectation p = ED(f (x))\nand returns a value v such that\n\n(cid:12)(cid:12)(cid:12) E\n\nx\u223cD\n\n(cid:114)\n\n(cid:40)\n\n(cid:12)(cid:12)(cid:12) \u2264 max\n\n(cid:41)\n\n(f (x)) \u2212 v\n\n1\nt\n\n,\n\np(1 \u2212 p)\n\nt\n\n.\n\nThe bound on the RHS is the standard deviation of t independent Bernoulli coins with desired\nexpectation, i.e., the error that even a random sample of size t would yield. In this paper, we study\nSQ algorithms that access the input distribution only via the VSTAT(t) oracle. The remaining\ncomputation is unrestricted and can use randomization (e.g., to determine which query to ask next).\nIn the case of an algorithm training a neural network via gradient descent, the relevant query functions\nare derivatives of the loss function.\nThe statistical query framework was \ufb01rst introduced by Kearns for supervised learning problems [14]\nusing the STAT(\u03c4 ) oracle, which, for \u03c4 \u2208 R+, responds to a query function f : X \u2192 [0, 1] with a\nvalue v such that | ED(f )\u2212v| \u2264 \u03c4. The STAT(\u221a\u03c4 ) oracle can be simulated by the VSTAT(O(1/\u03c4 ))\noracle. The VSTAT oracle was introduced by [9] who extended these oracles to general problems\nover distributions.\n\n1.2 Main result\nWe will describe a family C of functions f : Rn \u2192 R that can be computed exactly by a small NN,\nbut cannot be ef\ufb01ciently learned by an SQ algorithm. While our result applies to all commonly\nused activation units, we will use sigmoids as a running example. Let \u03c3(z) be the sigmoid gate that\ngoes to 0 for z < 0 and goes to 1 for z > 0. The sigmoid gates have sharpness parameter s, i.e.,\n\u03c3(x) = \u03c3s(x) = (1 + e\u2212sx)\u22121. Note that the parameter s also bounds the Lipschitz constant of\n\u03c3(x).\n\n2\n\n\f\u2208 C is a \ufb01xed concept (function).\n\nA function f : Rn \u2192 R can be computed exactly by a single layer NN with sigmoid gates precisely\nwhen it is of the form f (x) = h(\u03c3(g(x)), where g : Rn \u2192 Rm and h : Rm \u2192 R are af\ufb01ne, and \u03c3\nacts component-wise. Here, m is the number of hidden units, or sigmoid gates, of the of the NN.\nIn the case of a learning problem for a class C of functions f : X \u2192 R, the input distribution to the\nalgorithm is over labeled examples (x, f\u2217(x)), where x \u223c D for some underlying distribution D on\nX, and f\u2217\nAs mentioned in the introduction, we can view a typical NN learning algorithm as a statistical query\n(SQ) algorithm: in each iteration, the algorithm constructs a function based on its current weights\n(typically a gradient or subgradient), evaluates it on a batch of random examples from the input\ndistribution, then uses the evaluations to update the weights of the NN. Then we have the following\nresult.\nTheorem 1.1. Let n \u2208 N, and let \u03bb, s \u2265 1. There exists an explicit family C of functions f : Rn \u2192\n[\u22121, 1], representable as a single hidden layer neural network with O(s\u221an log(\u03bbsn)) sigmoid units\nof sharpness s, a single output sum gate and a weight matrix with condition number O(poly(n, s, \u03bb)),\nand an integer t = \u2126(s2n) s.t. the following holds. Any (randomized) SQ algorithm A that uses \u03bb-\nLipschitz queries to VSTAT(t) and weakly learns C with probability at least 1/2, to within regression\nerror 1/\u221at less than any constant function over i.i.d. inputs from any logconcave distribution of unit\nvariance on R requires 2\u2126(n)/(\u03bbs2) queries.\n\nThe Lipschitz assumption on the statistical queries is satis\ufb01ed by all commonly used algorithms\nfor training neural networks can be simulated with Lipschitz queries (e.g., gradients of natural loss\nfunctions with regularizers). This assumption can be omitted if the output of the hard-to-learn family\nC is represented with bounded precision.\nInformally, Theorem 1.1 shows that there exist simple realizable functions that are not ef\ufb01ciently\nlearnable by NN training algorithms with polynomial batch sizes, assuming the algorithm allows for\nerror as much as the standard deviation of random samples for each query. We remark that in practice,\nlarge batch sizes are seldom used for training NNs, not just for ef\ufb01ciency, but also since moderately\nnoisy gradient estimates are believed to be useful for avoiding bad local minima. Even NN training\nalgorithms with larger batch sizes will require \u2126(t) samples to achieve lower error, whereas the NNs\n\nthat represent functions in our class C have only (cid:101)O(\u221at) parameters.\n\u03c6m((cid:80)\n\nOur lower bound extends to a broad family of activation units, including all the well-known ones\n(ReLU, sigmoid, softplus etc., see Section 3.1). In the case of sigmoid gates, the functions of C\ntake the following form (cf. Figure 1.1). For a set S \u2286 {1, . . . , n}, we de\ufb01ne fm,S(x1, . . . , xn) =\n\ni\u2208S xi), where\n\n\u03c6m(x) = \u2212(2m + 1) +\n\n\u03c3\n\n(4k \u2212 1)\n\ns\n\nx \u2212\n\n+ \u03c3\n\n(1.1)\n\n(cid:18)\n\nm(cid:88)\n\nk=\u2212m\n\n(cid:19)\n\n(cid:18) (4k + 1)\n\n(cid:19)\n\ns\n\n\u2212 x\n\n.\n\nThen C = {fm,S : S \u2286 {1, . . . , n}}. We call the functions fm,S, along with \u03c6m, the s-wave\nfunctions. It is easy to see that they are smooth and bounded. Furthermore, the size of the NN\nrepresenting this hard-to-learn family of functions is only \u02dcO(s\u221an), assuming the query functions\n(e.g., gradients of loss function) are poly(s, n)-Lipschitz. We note that the lower bounds hold\nregardless of the architecture of the model, i.e., NN used to learn.\nOur lower bounds are asymptotic, but we show empirically in Section 4 that they apply even at\npractical values of n and s. We experimentally observe a threshold for the quantity s\u221an, above\nwhich stochastic gradient descent fails to train the NN to low error\u2014that is, regression error below\nthat of the best constant approximation\u2014 regardless of choices of gates, architecture used to learning,\nlearning rate, batch size, etc.\nThe condition number upper bound for C is signi\ufb01cant in part because there do exist SQ algorithms\nfor learning certain families of simple NNs with time complexity polynomial in the condition number\nof the weight matrix (the tensor factorization based algorithm of Janzamin et al. [12] can easily\nbe seen to be SQ). Our results imply that this dependence cannot be substantially improved (see\nSection 1.3).\nRemark 1. The class of input distributions can be relaxed further. Rather than being a product\ndistribution, it suf\ufb01ces if the distribution is in isotropic position and invariant under re\ufb02ections across\n\n3\n\n\fFigure 1.1: (a) The sigmoid function, the L1-function \u03c8 constructed from sigmoid functions, and the\nnearly-periodic \u201cwave\u201d function \u03c6 constructed from \u03c8. (b) The architecture of the NNs computing\nthe wave functions.\n\nand permutations of coordinate axes. And instead of being logconcave, it suf\ufb01ces for marginals to\nbe unimodal with variance \u03c3, density O(1/\u03c3) at the mode, and density \u2126(1/\u03c3) within a standard\ndeviation of the mode.\n\nOverall, our lower bounds suggest that even the combination of small network size, smooth, standard\nactivation functions, and benign input distributions is insuf\ufb01cient to make learning a NN easy, even\nimproperly via a very general family of algorithms. Instead, stronger structural assumptions on the\nNN, such as a small condition number, and very strong structural properties on the input distribution,\nare necessary to make learning tractable. It is our hope that these insights will guide the discovery of\nprovable ef\ufb01ciency guarantees.\n\n1.3 Related Work\n\nThere is much work on complexity-theoretic hardness of learning neural networks [4, 7, 15]. These\nresults have shown the hardness of learning functions representable as small (depth 2) neural networks\nover discrete input distributions. Since these input distributions bear little resemblance to the real-\nworld data sets on which NNs have seen great recent empirical success, it is natural to wonder whether\nmore realistic distributional assumptions might make learning NNs tractable. Our results suggest\nthat benign input distributions are insuf\ufb01cient, even for functions realized as small networks with\nstandard, smooth activation units.\nRecent independent work of Shamir [17] shows a smooth family of functions for which the gradient\nof the squared loss function is not informative for training a NN over a Gaussian input distribution\n(more generally, for distributions with rapidly decaying Fourier coef\ufb01cients). In fact, for this setting\nthe paper shows an exponentially small bound on the gradient, relying on the \ufb01ne structure of the\nGaussian distribution and of the smooth functions (see [16] for a follow-up with experiments and\nfurther ideas). These smooth functions cannot be realized in small NNs using the most commonly\nstudied activation units (though a related non-smooth family of functions for which the bounds apply\ncan be realized by larger NNs using ReLU units). In contrast our bounds are (a) in the more general\nSQ framework, and in particular apply regardless of the loss function, regularization scheme, or\nspeci\ufb01c variant of gradient descent (b) apply to functions actually realized as small NNs using any of\na wide family of activation units (c) apply to any logconcave input distribution and (d) are robust to\nsmall perturbations of the input layer weights.\nAlso related is the tensor-based algorithm of Janzamin et al. [12] to learn a 1-layer network under\nnondegeneracy assumptions on the weight matrix. The complexity is polynomial in the dimension,\nsize of network being learned and condition number of the weight matrix. Since their tensor\ndecomposition can also be implemented as a statistical query algorithm, our results give a lower\nbound indicating that such a polynomial dependence on the dimension and condition number is\nunavoidable.\nOther algorithmic results for learning NNs apply in very restricted settings. For example, polynomial-\ntime bounds are known for learning NNs with a single hidden ReLU layer over Gaussian inputs under\n\n4\n\n\u03c3(1/s+x)\u03c3(1/s\u2212x)\u03c8(x)=\u03c3(1/s+x)+\u03c3(1/s\u2212x)\u22121\u03c6m(x)=\u03c8(x)+\u03c8(x\u22124/s)+\u03c8(x+4/s)+\u00b7\u00b7\u00b7\fthe assumption that the hidden units use disjoint sets of inputs [5], as well as for learning a single\nReLU [10] and for learning sparse polynomials via NNs [1].\n\n1.4 Proof ideas\n\nTo prove Theorem 1.1, we wish to estimate the number of queries used by a statistical query algorithm\nlearning the family of s-wave functions, regardless of the strategy employed by the algorithm. To that\nend, we estimate the statistical dimension of the family of s-wave functions. Statistical dimension\nis a key concept in the study of SQ algorithms, and is known to characterize the query complexity\nof supervised learning via SQ algorithms [3, 19, 9]. Brie\ufb02y, a family C of distributions (e.g., over\nlabeled examples) has \u201cstatistical dimension d with average correlation \u00af\u03b3\u201d if every (1/d)-fraction\nof C has average correlation \u00af\u03b3; this condition implies that C cannot be learned with fewer than O(d)\nqueries to VSTAT(O(1/\u00af\u03b3)). See Section 2 for precise statements.\nThe SQ literature for supervised learning of boolean functions is rich. However, lower bounds for\nregression problems in the SQ framework have so far not appeared in the literature, and the existing\nnotions of statistical dimension are too weak for this setting. We state a new, strengthened notion\nof statistical dimension for regression problems (De\ufb01nition 2), and show that lower bounds for this\ndimension transfer to query complexity bounds (Theorem 2.1). The essential difference from the\nstatistical dimension for learning is that we must additionally bound the average covariances of\nindicator functions (or, rather, continuous analogues of indicators) on the outputs of functions in\nC. The essential claim in our lower bounds is therefore in showing that a typical pair of (indicator\nfunctions on outputs of) s-wave functions has small covariance.\nIn other words, to prove Theorem 1.1, it suf\ufb01ces to upper-bound the quantity\nE[(\u03c7 \u25e6 fm,S)(\u03c7 \u25e6 fm,T )] \u2212 E[\u03c7 \u25e6 fm,S]E[\u03c7 \u25e6 fm,T )\n\n(1.2)\nfor most pairs fm,S, fm,T of s-wave functions, where \u03c7 is some smoothed version of an indicator\n\nfunction. Write h(t) = \u03c7(\u03c6m(t)), so \u03c7(fm,S(x1, . . . , xn)) = h((cid:80)\n(cid:88)\n(cid:88)\n(cid:88)\nSo to estimate Eq. (1.2), it suf\ufb01ces to show that the expectation of h((cid:80)\nwhen we condition on the value of z =(cid:80)\n\nxi) |\ni\u2208S\u2229T\nE\n\n(x1,...,xn)\u223cD\n\n(cid:88)\n\n(cid:88)\n\nxi,i\u2208T\\S\n\nxi,i\u2208S\\T\n\nxi + z))\n\nxi = z)\n\ni\u2208T\\S\n\ni\u2208S\\T\n\ni\u2208S\n\n(h(\n\nxi)h(\n\nE\n\nE\n\ni\u2208T\n\n=\n\n(h(\n\ni\u2208S\u2229T xi.\n\ni\u2208S xi). We have\n\n(h(\n\nxi + z)) .\n\ni\u2208S xi) doesn\u2019t change much\n\nWe now observe that if \u03c7 is Lipschitz, and \u03c6m is \u201cclose to\u201d a periodic function with period \u03b8 > 0,\nthen h is also \u201cclose to\u201d a periodic function with period \u03b8 > 0 (see Section 3 for a precise statement).\nUnder this near-periodicity assumption, we are now able to show for any logconcave distribution D(cid:48)\n(cid:18) \u03b8\non R of variance \u03c3 > \u03b8, and any translation z \u2208 R, that\nIn particular, conditioning on the value of z =(cid:80)\n\n(cid:12)(cid:12)(cid:12) = O\ni\u2208S\u2229T xi has little effect on the value of h((cid:80)\n\ni\u2208S xi).\nThe combination of these observations gives the query complexity lower bound. Precise statements\nof some of the technical lemmas are given in Section 3; the complete proof appears in the full version\nof this paper [18].\n\n(h(x + z) \u2212 h(x))\n\n(cid:12)(cid:12)(cid:12) E\n\n(|h(x)|) .\n\nE\nx\u223cD\n\n(cid:19)\n\nx\u223cD\n\n\u03c3\n\n2 Statistical dimension\n\nWe now give a precise de\ufb01nition of the statistical dimension with average correlation for regression\nproblems, extending the concept introduced in [9].\nLet C be a \ufb01nite family of functions f : X \u2192 R over some domain X, and let D be a distribution\nover X. The average covariance and the average correlation of C with respect to D are\n\u03c1D(f, g)\n\n(cid:88)\n\n(cid:88)\n\nCovD(f, g)\n\nand\n\n\u03c1D(C) =\n\n1\n|C|2\n\nf,g\u2208C\n\nCovD(C) =\n\n1\n|C|2\n\nf,g\u2208C\n\n5\n\n\f(cid:112)\n\nwhere \u03c1D(f, g) = CovD(f, g)/\n\u03c1D(f, g) = 0 otherwise.\nFor y \u2208 R and \u0001 > 0, we de\ufb01ne the \u0001-soft indicator function \u03c7(\u0001)\n\ny\n\n: R \u2192 R as\n\nVar(f ) Var(g) when both Var(f ) and Var(g) are nonzero, and\n\n\u03c7(\u0001)\ny (x) = \u03c7y(x) = max{0, 1/\u0001 \u2212 (1/\u0001)2|x \u2212 y|}.\n\n(cid:48)\n\n(cid:48)\n\n(cid:48)\ny) \u2264 (max{\u0001, \u00b5(y)})2\u00af\u03b3 where C\n\nSo \u03c7y is (1/\u0001)2-Lipschitz, is supported on (y \u2212 \u0001, y + \u0001), and has norm (cid:107)\u03c7y(cid:107)1 = 1.\nDe\ufb01nition 2. Let \u00af\u03b3 > 0, let D be a probability distribution over some domain X, and let C\nbe a family of functions f : X \u2192 [\u22121, 1] that are identically distributed as random variables\nover D. The statistical dimension of C relative to D with average covariance \u00af\u03b3 and precision \u0001,\ndenoted by \u0001-SDA(C, D, \u00af\u03b3), is de\ufb01ned to be the largest integer d such that the following holds:\n(cid:48)) \u2264 \u00af\u03b3. Moreover,\nfor every y \u2208 R and every subset C\ny \u25e6 f ) for some\nCovD(C\nf \u2208 C.\nNote that the parameter \u00b5(y) is independent of the choice of f \u2208 C. The application of this notion of\ndimension is given by the following theorem.\nTheorem 2.1. Let D be a distribution on a domain X and let C be a family of functions f :\nX \u2192 [\u22121, 1] identically distributed as random variables over D. Suppose there is d \u2208 R and\n\u03bb \u2265 1 \u2265 \u00af\u03b3 > 0 such that \u0001-SDA(C, D, \u00af\u03b3) \u2265 d, where \u0001 \u2264 \u00af\u03b3/(2\u03bb). Let A be a randomized algorithm\nlearning C over D with probability greater than 1/2 to regression error less than \u2126(1) \u2212 2\u221a\u00af\u03b3. If\nA only uses queries to VSTAT(t) for some t = O(1/\u00af\u03b3), which are \u03bb-Lipschitz at any \ufb01xed x \u2208 X,\nthen A uses \u2126(d) queries.\n\n\u2286 C of size |C\n(cid:48)\ny = {\u03c7(\u0001)\n\n| > |C|/d, we have \u03c1D(C\n\ny \u25e6 f : f \u2208 C} and \u00b5(y) = ED(\u03c7(\u0001)\n\nA version of the theorem for Boolean functions is proved in [9]. For completeness, in the full version\nof this paper [18] we include a proof of Theorem 2.1, following ideas in [19, Theorem 2].\nAs a consequence of Theorem 2.1, there is no need to consider an SQ algorithm\u2019s query strategy in\norder to obtain lower bounds on its query complexity. Instead, the lower bounds follow directly from\nproperties of the concept class itself, in particular from bounds on average covariances of indicator\nfunctions. Theorem 1.1 will therefore follow from Theorem 2.1 by analyzing the statistical dimension\nof the s-wave functions.\n\n3 Estimates of statistical dimension for one-layer functions\n\nWe now present the most general context in which we obtain SQ lower bounds.\nA function \u03c6 : R \u2192 R is (M, \u03b4, \u03b8)-quasiperiodic if there exists a function \u02dc\u03c6 : R \u2192 R which is\nperiodic with period \u03b8 such that |\u03c6(x) \u2212 \u02dc\u03c6(x)| < \u03b4 for all x \u2208 [\u2212M, M ]. In particular, any periodic\nfunction with period \u03b8 is (M, \u03b4, \u03b8)-quasiperiodic for all M, \u03b4 > 0.\nLemma 3.1. Let n \u2208 N and let \u03b8 > 0. There exists \u00af\u03b3 = O(\u03b82/n) such that for all \u0001 > 0, there\nexist M = O(\u221an log(n/(\u0001\u03b8)) and \u03b4 = \u2126(\u00013\u03b8/\u221an) and a family C0 of af\ufb01ne functions g : Rn \u2192 R\nof bounded operator norm with the following property. Suppose \u03c6 : R \u2192 [\u22121, 1] is (M, \u03b4, \u03b8)-\nquasiperiodic and Varx\u223cU (0,\u03b8)(\u03c6(x)) = \u2126(1). Let D be logconcave distribution with unit variance\non R. Then for C = {\u03c6 \u25e6 g : g \u2208 C0}, we have \u0001-SDA(C, Dn, \u00af\u03b3) \u2265 2\u2126(n)\u0001\u03b82. Furthermore, the\nfunctions of C are identically distributed as random variables over Dn.\nIn other words, we have statistical dimension bounds (and hence query complexity bounds) for\nfunctions that are suf\ufb01ciently close to periodic. However, the activation units of interest are generally\nmonotonic increasing functions such as sigmoids and ReLUs that are quite far from periodic. Hence,\nin order to apply Lemma 3.1 in our context, we must show that the activation units of interest can be\ncombined to make nearly periodic functions.\nAs an intermediate step, we analyze activation functions in L1(R), i.e., functions whose absolute\nvalue has bounded integral over the whole real line. These L1-functions analyzed in our framework\nare themselves constructed as af\ufb01ne combinations of the usual activation functions. For example, for\nthe sigmoid unit with sharpness s, we study the following L1-function (cf. (1.1)):\n\n\u03c8(x) = \u03c3\n\ns \u2212 x\n\n\u2212 1.\n\n(3.1)\n\n(cid:18) 1\n\n(cid:19)\n\n(cid:19)\n\n(cid:18) 1\n\ns\n\n+ x\n\n+ \u03c3\n\n6\n\n\f(cid:82) r\nWe now describe the properties of the integrable functions \u03c8 that will be used in the proof.\nDe\ufb01nition 3. For \u03c8 \u2208 L1(R), we say the essential radius of \u03c8 is the number r \u2208 R such that\n\u2212r |\u03c8| = (5/6)(cid:107)\u03c8(cid:107)1.\nDe\ufb01nition 4. We say \u03c8 \u2208 L1(R) has the mean bound property if for all x \u2208 R and \u0001 > 0, we have\n\n(cid:90) x+\u0001\n\n(cid:18) 1\n\n\u0001\n\n(cid:19)\n\n\u03c8(x) = O\n\nx\u2212\u0001 |\u03c8(x)|\n\n.\n\nIn particular, if \u03c8 is bounded, and monotonic nonincreasing (resp. nondecreasing) for suf\ufb01ciently\nlarge positive (resp. negative) inputs, then \u03c8 satis\ufb01es De\ufb01nition 4. Alternatively, it suf\ufb01ces for \u03c8 to\nhave bounded \ufb01rst derivative.\nTo complete the proof of Theorem 1.1, we show that we can combine activation units \u03c8 satisfying\nthe above properties in a function which is close to periodic, i.e., which satis\ufb01es the hypotheses of\nLemma 3.1 above.\nLemma 3.2. Let \u03c8 \u2208 L1(R) have the mean bound property and let r > 0 be such that \u03c8 has\nessential radius at most r and (cid:107)\u03c8(cid:107)1 = \u0398(r). Let M, \u03b4 > 0. Then there is a pair of af\ufb01ne\nfunctions h : Rm \u2192 R and g : R \u2192 Rm such that if \u03c6(x) = h(\u03c8(g(x))), where \u03c8 is applied\ncomponent-wise, then \u03c6 is (M, \u03b4, 4r)-quasiperiodic. Furthermore, \u03c6(x) \u2208 [\u22121, 1] for all x \u2208 R, and\nVarx\u223cU (0,4r)(\u03c6(x)) = \u2126(1), and we may take m = (1/r) \u00b7 O(max{m1, M}), where m1 satis\ufb01es\n\n(cid:90) \u221e\n\nm1\n\n(|\u03c8(x)| + |\u03c8(\u2212x)|)dx < 4\u03b4r .\n\nWe now sketch how Lemmas 3.1 and 3.2 imply Theorem 1.1 for sigmoid units.\nSketch of proof of Theorem 1.1. The sigmoid function \u03c3 with sharpness s is not even in L1(R), so it\nis unsuitable as the function \u03c8 of Lemma 3.2. Instead, we de\ufb01ne \u03c8 to be an af\ufb01ne combination of \u03c3\ngates as in Eq. (3.1). Then \u03c8 satis\ufb01es the hypotheses of Lemma 3.2.\nLet \u03b8 = 4r and let \u00af\u03b3 = O(\u03b82/n) be as given by the statement of Lemma 3.1. Let \u0001 = \u00af\u03b3/(2\u03bb), and\nlet M = O(\u221an log(n/(\u0001\u03b8)) and \u03b4 = \u2126(\u00013\u03b8/\u221an) be as given by the statement of Lemma 3.1. By\nLemma 3.2, there is m \u2208 N and functions h : Rm \u2192 R and g : R \u2192 Rm such that \u03c6 = h \u25e6 \u03c8 \u25e6 g is\n(M, \u03b4, \u03b8)-quasiperiodic and satis\ufb01es the hypotheses of Lemma 3.1. Therefore, we have a family C0 of\naf\ufb01ne functions f : Rn \u2192 R such that for C = {\u03c6\u25e6f : f \u2208 C0} satis\ufb01es \u0001-SDA(C, D, \u00af\u03b3) \u2265 2\u2126(n)\u0001\u03b82.\nTherefore, the functions in C satisfy the hypothesis of Theorem 2.1, giving the query complexity\nlower bound.\nAll details are given in the full version of the paper [18].\n\n3.1 Different activation functions\n\n\u03c8(x) = \u03c3(x + 1/s) \u2212 \u03c3(x) + \u03c3(\u2212x + 1/s) \u2212 \u03c3(\u2212x) \u2212 1\n\nSimilar proofs give corresponding lower bounds for activation functions other than sigmoids. In\nevery case, we reduce to gates satisfying the hypotheses of Lemma 3.2 by constructing an appropriate\nL1-function \u03c8 as an af\ufb01ne combination of of the activation functions.\nFor example, let \u03c3(x) = \u03c3s(x) = max{0, sx} denote the ReLU unit with slope s. Then the af\ufb01ne\ncombination\n(3.2)\nis in L1(R), and is zero for |x| \u2265 1/s (and hence has the mean bound property and essential radius\nO(1/s)). The proof of Theorem 1.1 therefore goes through almost identically, the slope-s ReLU\nunits replacing the s-sharp sigmoid units. In particular, there is a family of single hidden layer\nNNs using O(s\u221an log(\u03bbsn) slope-s ReLU units, which is not learned by any SQ algorithm using\nfewer than 2\u2126(n)/(\u03bbs2) queries to VSTAT(O(s2n)), when inputs are drawn i.i.d. from a logconcave\ndistribution.\nSimilarly, we can consider the s-sharp softplus function \u03c3(x) = log(exp(sx) + 1). Then Eq. (3.2)\nagain gives an appropriate L1(R) function to which we can apply Lemma 3.2 and therefore follow\nthe proof of Theorem 1.1. For softsign functions \u03c3(x) = x/(|x| + 1), we use the af\ufb01ne combination\n\n\u03c8(x) = \u03c3(x + 1) + \u03c3(\u2212x + 1) .\n\n7\n\n\f(a) normal distribution\n\n(b) exp(\u2212|xi|) distribution\n\n(c) uniform l1 ball\n\n(d) normal distribution\n\n(e) exp(\u2212|xi|) distribution\n\n(f) uniform l1 ball\n\nFigure 4.1: Test error vs sharpness times square-root of dimension. Each curve corresponds to a\ndifferent input dimension n. The \ufb02at line corresponds to the best error by a constant function.\n\nIn the case of softsign functions, this function \u03c8 converges much more slowly to zero as |x| \u2192 \u221e\ncompared to sigmoid units. Hence, in order to obtain an adequate quasiperiodic function as an af\ufb01ne\ncombination of \u03c8-units, a much larger number of \u03c8-units is needed: the bound on the number m\nof units in this case is polynomial in the Lipschitz parameter \u03bb of the query functions, and a larger\npolynomial in the input dimension n. The case of other commonly used activation functions, such as\nELU (exponential linear) or LReLU (Leaky ReLU), is similar to those discussed above.\n\n4 Experiments\nIn the experiments, we show how the errors, E(f (x) \u2212 y)2, change with respect to the sharpness\nparameter s and the input dimension n for two input distributions: 1) multivariate normal distribution,\n\n2) coordinate-wise independent exp(\u2212|xi|), and 3) uniform in the l1 ball {x :(cid:80)\n\ni |xi| \u2264 n}.\n\nFor a given sharpness parameter s \u2208 {0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 1, 2}, input dimension d \u2208\n{50, 100, 200} and input distribution, we generate the true function according to Eqn. 1.1. There are\na total of 50,000 training data points and 1000 test data points. We then learn the true function with\nfully-connected neural networks of both ReLU and sigmoid activation functions. The best test error\nis reported among the following different hyper-parameters.\nThe number of hidden layers we used is 1, 2, and 4. The number of hidden units per layer varies from\n4n to 8n. The training is carried out using SGD with 0.9 momentum, and we enumerate learning\nrates from 0.1, 0.01 and 0.001 and batch sizes from 64, 128 and 256.\nFrom Theorem 1.1, learning such functions should become dif\ufb01cult as s\u221an increases over a threshold.\nIn Figure 4.1, we illustrate this phenomenon. Each curve corresponds to a particular input dimension\nn and each point in the curve corresponds to a particular smoothness parameter s. The x-axis is s\u221an\nand the y-axis denotes the test errors. We can see that at roughly s\u221an = 5, the problem becomes\nhard even empirically.\n\nAcknowledgments\n\nThe authors are grateful to Vitaly Feldman for discussions about statistical query lower bounds, and\nfor suggestions that simpli\ufb01ed the presentation of our results, and also to Adam Kalai for an inspiring\ndiscussion. This research was supported in part by NSF grants CCF-1563838 and CCF-1717349.\n\n8\n\n\fReferences\n[1] Alexandr Andoni, Rina Panigrahy, Gregory Valiant, and Li Zhang. Learning polynomials with\nneural networks. In International Conference on Machine Learning, pages 1908\u20131916, 2014.\n\n[2] Andrew R Barron. Universal approximation bounds for superpositions of a sigmoidal function.\n\nIEEE Transactions on Information theory, 39(3):930\u2013945, 1993.\n\n[3] Avrim Blum, Merrick Furst, Jeffrey Jackson, Michael Kearns, Yishay Mansour, and Steven\nRudich. Weakly learning DNF and characterizing statistical query learning using Fourier\nanalysis. In STOC, pages 253\u2013262, 1994.\n\n[4] Avrim Blum and Ronald L. Rivest. Training a 3-node neural network is NP-complete. Neural\n\nNetworks, 5(1):117\u2013127, 1992.\n\n[5] Alon Brutzkus and Amir Globerson. Globally optimal gradient descent for a convnet with\n\ngaussian inputs. CoRR, abs/1702.07966, 2017.\n\n[6] George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of\n\nControl, Signals, and Systems (MCSS), 2(4):303\u2013314, 1989.\n\n[7] Amit Daniely and Shai Shalev-Shwartz. Complexity theoretic limitations on learning dnf\u2019s. In\nProceedings of the 29th Conference on Learning Theory, COLT 2016, New York, USA, June\n23-26, 2016, pages 815\u2013830, 2016.\n\n[8] Ronen Eldan and Ohad Shamir. The power of depth for feedforward neural networks. In\n\nConference on Learning Theory, pages 907\u2013940, 2016.\n\n[9] Vitaly Feldman, Elena Grigorescu, Lev Reyzin, Santosh Vempala, and Ying Xiao. Statistical\nalgorithms and a lower bound for planted clique. In Proceedings of the 45th annual ACM\nSymposium on Theory of Computing, pages 655\u2013664. ACM, 2013.\n\n[10] Surbhi Goel, Varun Kanade, Adam R. Klivans, and Justin Thaler. Reliably learning the ReLU\n\nin polynomial time. CoRR, abs/1611.10258, 2016.\n\n[11] Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are\n\nuniversal approximators. Neural networks, 2(5):359\u2013366, 1989.\n\n[12] Majid Janzamin, Hanie Sedghi, and Anima Anandkumar. Generalization bounds for neural\n\nnetworks through tensor factorization. CoRR, abs/1506.08473, 2015.\n\n[13] Michael Kearns. Ef\ufb01cient noise-tolerant learning from statistical queries. Journal of the ACM,\n\n45(6):983\u20131006, 1998.\n\n[14] Michael J. Kearns. Ef\ufb01cient noise-tolerant learning from statistical queries. In Proceedings\nof the Twenty-Fifth Annual ACM Symposium on Theory of Computing, May 16-18, 1993, San\nDiego, CA, USA, pages 392\u2013401, 1993.\n\n[15] Adam R. Klivans. Cryptographic hardness of learning. In Encyclopedia of Algorithms, pages\n\n475\u2013477. 2016.\n\n[16] Shai Shalev-Shwartz, Ohad Shamir, and Shaked Shammah. Failures of deep learning. CoRR,\n\nabs/1703.07950, 2017.\n\n[17] Ohad Shamir.\n\nDistribution-speci\ufb01c hardness of learning neural networks.\n\nabs/1609.01037, 2016.\n\nCoRR,\n\n[18] Le Song, Santosh Vempala, John Wilmes, and Bo Xie. On the complexity of learning neural\n\nnetworks. arXiv preprint arXiv:1707.04615, 2017.\n\n[19] B. Sz\u00f6r\u00e9nyi. Characterizing statistical query learning:simpli\ufb01ed notions and proofs. In ALT,\n\npages 186\u2013200, 2009.\n\n[20] Matus Telgarsky. Bene\ufb01ts of depth in neural networks. arXiv preprint arXiv:1602.04485, 2016.\n\n9\n\n\f", "award": [], "sourceid": 2840, "authors": [{"given_name": "Le", "family_name": "Song", "institution": "Georgia Institute of Technology"}, {"given_name": "Santosh", "family_name": "Vempala", "institution": "Georgia Tech"}, {"given_name": "John", "family_name": "Wilmes", "institution": "Georgia Institute of Technology"}, {"given_name": "Bo", "family_name": "Xie", "institution": "Georgia Tech"}]}