{"title": "Eigenvalue Decay Implies Polynomial-Time Learnability for Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 2192, "page_last": 2202, "abstract": "We consider the problem of learning function classes computed by   neural networks with various activations (e.g. ReLU or Sigmoid), a   task believed to be computationally intractable in the worst-case.   A major open problem is to understand the minimal assumptions under   which these classes admit provably efficient algorithms. In this work we show   that a natural distributional assumption corresponding to {\\em     eigenvalue decay} of the Gram matrix yields polynomial-time   algorithms in the non-realizable setting for expressive classes of   networks (e.g. feed-forward networks of ReLUs).  We make no    assumptions on the structure of the network or the labels.  Given   sufficiently-strong eigenvalue decay, we obtain {\\em     fully}-polynomial time algorithms in {\\em all} the relevant   parameters with respect to square-loss.  This is the first purely   distributional assumption that leads to polynomial-time algorithms   for networks of ReLUs.  Further, unlike   prior distributional assumptions (e.g., the marginal distribution is   Gaussian), eigenvalue decay has been observed in practice on common   data sets.", "full_text": "Eigenvalue Decay Implies Polynomial-Time\n\nLearnability for Neural Networks\n\nSurbhi Goel \u2217\n\nDepartment of Computer Science\n\nUniversity of Texas at Austin\n\nsurbhi@cs.utexas.edu\n\nAdam Klivans \u2020\n\nDepartment of Computer Science\n\nUniversity of Texas at Austin\nklivans@cs.utexas.edu\n\nAbstract\n\nWe consider the problem of learning function classes computed by neural net-\nworks with various activations (e.g. ReLU or Sigmoid), a task believed to be com-\nputationally intractable in the worst-case. A major open problem is to understand\nthe minimal assumptions under which these classes admit provably ef\ufb01cient algo-\nrithms. In this work we show that a natural distributional assumption correspond-\ning to eigenvalue decay of the Gram matrix yields polynomial-time algorithms in\nthe non-realizable setting for expressive classes of networks (e.g. feed-forward\nnetworks of ReLUs). We make no assumptions on the structure of the network or\nthe labels. Given suf\ufb01ciently-strong eigenvalue decay, we obtain fully-polynomial\ntime algorithms in all the relevant parameters with respect to square-loss. This is\nthe \ufb01rst purely distributional assumption that leads to polynomial-time algorithms\nfor networks of ReLUs. Further, unlike prior distributional assumptions (e.g., the\nmarginal distribution is Gaussian), eigenvalue decay has been observed in practice\non common data sets.\n\n1\n\nIntroduction\n\nUnderstanding the computational complexity of learning neural networks from random examples\nis a fundamental problem in machine learning. Several researchers have proved results showing\ncomputational hardness for the worst-case complexity of learning various networks\u2013 that is, when\nno assumptions are made on the underlying distribution or the structure of the network [10, 16,\n21, 26, 43]. As such, it seems necessary to take some assumptions in order to develop ef\ufb01cient\nalgorithms for learning deep networks (the most expressive class of networks known to be learnable\nin polynomial-time without any assumptions is a sum of one hidden layer of sigmoids [16]). A\nmajor open question is to understand what are the \u201ccorrect\u201d or minimal assumptions to take in\norder to guarantee ef\ufb01cient learnability3. An oft-taken assumption is that the marginal distribution is\nequal to some smooth distribution such as a multivariate Gaussian. Even under such a distributional\nassumption, however, there is evidence that fully polynomial-time algorithms are still hard to obtain\nfor simple classes of networks [19, 36]. As such, several authors have made further assumptions on\nthe underlying structure of the model (and/or work in the noiseless or realizable setting).\nIn fact, in an interesting recent work, Shamir [34] has given evidence that both distributional as-\nsumptions and assumptions on the network structure are necessary for ef\ufb01cient learnability using\ngradient-based methods. Our main result is that under only an assumption on the marginal distribu-\ntion, namely eigenvalue decay of the Gram matrix, there exist ef\ufb01cient algorithms for learning broad\n\n\u2217Work supported by a Microsoft Data Science Initiative Award.\n\u2020Part of this work was done while visiting the Simons Institute for Theoretical Computer Science.\n3For example, a very recent paper of Song, Vempala, Xie, and Williams [36] asks \u201cWhat form would such\n\nan explanation take, in the face of existing complexity-theoretic lower bounds?\u201d\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fclasses of neural networks even in the non-realizable (agnostic) setting with respect to square loss.\nFurthermore, eigenvalue decay has been observed often in real-world data sets, unlike distributional\nassumptions that take the marginal to be unimodal or Gaussian. As one would expect, stronger as-\nsumptions on the eigenvalue decay result in polynomial learnability for broader classes of networks,\nbut even mild eigenvalue decay will result in savings in runtime and sample complexity.\nThe relationship between our assumption on eigenvalue decay and prior assumptions on the\nmarginal distribution being Gaussian is similar in spirit to the dichotomy between the complexity of\ncertain algorithmic problems on power-law graphs versus Erd\u02ddos-R\u00e9nyi graphs. Several important\ngraph problems such as clique-\ufb01nding become much easier when the underlying model is a\nrandom graph with appropriate power-law decay (as opposed to assuming the graph is generated\nfrom the classical G(n, p) model) [6, 22].\nIn this work we prove that neural network learning\nproblems become tractable when the underlying distribution induces an empirical gram matrix with\nsuf\ufb01ciently strong eigenvalue-decay.\n\nOur Contributions. Our main result is quite general and holds for any function class that can\nbe suitably embedded in an RKHS (Reproducing Kernel Hilbert Space) with corresponding kernel\nfunction k (we refer readers unfamiliar with kernel methods to [30]). Given m draws from a distri-\nbution (x1, . . . , xm) and kernel k, recall that the Gram matrix K is an m \u00d7 m matrix where the i, j\nentry equals k(xi, xj). For ease of presentation, we begin with an informal statement of our main\nresult that highlights the relationship between the eigenvalue decay assumption and the run-time and\nsample complexity of our \ufb01nal algorithm.\nTheorem 1 (Informal). Fix function class C and kernel function k. Assume C is approximated in the\ncorresponding RKHS with norm bound B. After drawing m samples, let K/m be the (normalized)\nm \u00d7 m Gram matrix with eigenvalues {\u03bb1, . . . , \u03bbm}. For error parameter \u0001 > 0,\n1. If, for suf\ufb01ciently large i, \u03bbi \u2248 O(i\u2212p), then C is ef\ufb01ciently learnable with m = \u02dcO(B1/p/\u00012+3/p).\n2. If, for suf\ufb01ciently large i, \u03bbi \u2248 O(e\u2212i), then C is ef\ufb01ciently learnable with m = \u02dcO(log B/\u00012).\nWe allow a failure probability for the event that the eigenvalues do not decay. In all prior work,\nthe sample complexity m depends linearly on B, and for many interesting concept classes (such as\nReLUs), B is exponential in one or more relevant parameters. Given Theorem 1, we can use known\nstructural results for embedding neural networks into an RKHS to estimate B and take a correspond-\ning eigenvalue decay assumption to obtain polynomial-time learnability. Applying bounds recently\nobtained by Goel et al. [16] we have\nCorollary 2. Let C be the class of all fully-connected networks of ReLUs with one-hidden layer\nof (cid:96) hidden ReLU activations feeding into a single ReLU output activation (i.e., two hidden layers\nor depth-3). Then, assuming eigenvalue decay of O(i\u2212(cid:96)/\u0001), C is learnable in polynomial time with\nrespect to square loss on Sn\u22121. If ReLU is replaced with sigmoid, then we require eigenvalue decay\nO(i\u2212\u221a\nFor higher depth networks, bounds on the required eigenvalue decay can be derived from struc-\ntural results in [16]. Without taking an assumption, the fastest known algorithms for learning the\nabove networks run in time exponential in the number of hidden units and accuracy parameter (but\npolynomial in the dimension) [16].\nOur proof develops a novel approach for bounding the generalization error of kernel methods,\nnamely we develop compression schemes tailor-made for classi\ufb01ers induced by kernel-based re-\ngression, as opposed to current Rademacher-complexity based approaches. Roughly, a compression\nscheme is a mapping from a training set S to a small subsample S(cid:48) and side-information I. Given\nthis compressed version of S, the decompression algorithm should be able to generate a classi\ufb01er h.\nIn recent work, David, Moran and Yehudayoff [13] have observed that if the size of the compression\nis much less than m (the number of samples), then the empirical error of h on S is close to its true\nerror with high probability.\nAt the core of our compression scheme is a method for giving small description length (i.e., o(m)\nbit complexity), approximate solutions to instances of kernel ridge regression. Even though we\nassume K has decaying eigenvalues, K is neither sparse nor low-rank, and even a single column\nor row of K has bit complexity at least m, since K is an m \u00d7 m matrix! Nevertheless, we can\nprove that recent tools from Nystr\u00f6m sampling [28] imply a type of sparsi\ufb01cation for solutions\n\n(cid:96)/\u0001)).\n\n\u221a\n\n(cid:96) log(\n\n2\n\n\fof certain regression problems involving K. Additionally, using preconditioning, we can bound\nthe bit complexity of these solutions and obtain the desired compression scheme. At each stage\nwe must ensure that our compressed solutions do not lose too much accuracy, and this involves\ncarefully analyzing various matrix approximations. Our methods are the \ufb01rst compression-based\ngeneralization bounds for kernelized regression.\n\nRelated Work. Kernel methods [30] such as SVM, kernel ridge regression and kernel PCA have\nbeen extensively studied due to their excellent performance and strong theoretical properties. For\nlarge data sets, however, many kernel methods become computationally expensive. The literature\non approximating the Gram matrix with the overarching goal of reducing the time and space com-\nplexity of kernel methods is now vast. Various techniques such as random sampling [39], subspace\nembedding [2], and matrix factorization [15] have been used to \ufb01nd a low-rank approximation that\nis ef\ufb01cient to compute and gives small approximation error. The most relevant set of tools for our\npaper is Nystr\u00f6m sampling [39, 14], which constructs an approximation of K using a subset of\nthe columns indicated by a selection matrix S to generate a positive semi-de\ufb01nite approximation.\nRecent work on leverage scores have been used to improve the guarantees of Nystr\u00f6m sampling in\norder to obtain linear time algorithms for generating these approximations [28].\nThe novelty of our approach is to use Nystr\u00f6m sampling in conjunction with compression schemes\nto give a new method for giving provable generalization error bounds for kernel methods. Compres-\nsion schemes have typically been studied in the context of classi\ufb01cation problems in PAC learning\nand for combinatorial problems related to VC dimension [23, 24]. Only recently some authors\nconsidered compression schemes in a general, real-valued learning scenario [13]. Cotter, Shalev-\nShwartz, and Srebro have studied compression for classi\ufb01cation using SVMs to prove that for gen-\neral distributions, compressing classi\ufb01ers with low generalization error is not possible [9].\nThe general phenomenon of eigenvalue decay of the Gram matrix has been studied from both a the-\noretical and applied perspective. Some empirical studies of eigenvalue decay and related discussion\ncan be found in [27, 35, 38]. There has also been prior work relating eigenvalue decay to gen-\neralization error in the context of SVMs or Kernel PCA (e.g., [29, 35]). Closely related notions to\neigenvalue decay are that of local Rademacher complexity due to Bartlett, Bousquet, and Mendelson\n[4] (see also [5]) and that of effective dimensionality due to Zhang [42].\nThe above works of Bartlett et al. and Zhang give improved generalization bounds via data-\ndependent estimates of eigenvalue decay of the kernel. At a high level, the goal of these works\nis to work under an assumption on the effective dimension and improve Rademacher-based general-\nization error bounds from 1/\nm to 1/m (m is the number of samples) for functions embedded in a\nRKHS of unit norm. These works do not address the main obstacle of this paper, however, namely\novercoming the complexity of the norm of the approximating RKHS. Their techniques are mostly\nincomparable even though the intent of using effective dimension as a measure of complexity is the\nsame.\nShamir has shown that for general linear prediction problems with respect to square-loss and norm\nbound B, a sample complexity of \u2126(B) is required for gradient-based methods [33]. Our work\nshows that eigenvalue decay can dramatically reduce this dependence, even in the context of kernel\nregression where we want to run in time polynomial in n, the dimension, rather than the (much\nlarger) dimension of the RKHS.\n\n\u221a\n\nRecent work on Learning Neural Networks. Due in part to the recent exciting developments in\ndeep learning, there have been several works giving provable results for learning neural networks\nwith various activations (threshold, sigmoid, or ReLU). For the most part, these results take various\nassumptions on either 1) the distribution (e.g., Gaussian or Log-Concave) or 2) the structure of the\nnetwork architecture (e.g. sparse, random, or non-overlapping weight vectors) or both and often have\na bad dependence on one or more of the relevant parameters (dimension, number of hidden units,\ndepth, or accuracy). Another way to restrict the problem is to work only in the noiseless/realizable\nsetting. Works that fall into one or more of these categories include [20, 44, 40, 17, 31, 41, 11].\nKernel methods have been applied previously to learning neural networks [43, 26, 16, 12]. The\ncurrent broadest class of networks known to be learnable in fully polynomial-time in all parameters\nwith no assumptions is due to Goel et al. [16], who showed how to learn a sum of one hidden layer of\nsigmoids over the domain of Sn\u22121, the unit sphere in n dimensions. We are not aware of other prior\n\n3\n\n\fwork that takes only a distributional assumption on the marginal and achieves fully polynomial-time\nalgorithms for even simple networks (for example, one hidden layer of ReLUs).\nMuch work has also focused on the ability of gradient descent to succeed in parameter estimation\nfor learning neural networks under various assumptions with an intense focus on the structure of\nlocal versus global minima [8, 18, 7, 37]. Here we are interested in the traditional task of learning in\nthe non-realizable or agnostic setting and allow ourselves to output a hypothesis outside the function\nclass (i.e., we allow improper learning). It is well known that for even simple neural networks, for\nexample for learning a sigmoid with respect to square-loss, there may be many bad local minima\n[1]. Improper learning allows us to avoid these pitfalls.\n\n2 Preliminaries\nNotation. The input space is denoted by X and the output space is denoted by Y. Vectors are rep-\nresented with boldface letters such as x. We denote a kernel function by k\u03c8(x, x(cid:48)) = (cid:104)\u03c8(x), \u03c8(x(cid:48))(cid:105)\nwhere \u03c8 is the associated feature map and for the kernel and K\u03c8 is the corresponding reproducing\nkernel Hilbert space (RKHS). For necessary background material on kernel methods we refer the\nreader to [30].\n\nSelection and Compression Schemes. It is well known that in the context of PAC learning Boolean\nfunction classes, a suitable type of compression of the training data implies learnability [25]. Perhaps\nsurprisingly, the details regarding the relationship between compression and ceratin other real-valued\nlearning tasks have not been worked out until very recently. A convenient framework for us will be\nthe notion of compression and selection schemes due to David et al. [13].\nA selection scheme is a pair of maps (\u03ba, \u03c1) where \u03ba is the selection map and \u03c1 is the reconstruction\nmap. \u03ba takes as input a sample S = ((x1, y1), . . . , (xm, ym)) and outputs a sub-sample S(cid:48) and a\n\ufb01nite binary string b as side information. \u03c1 takes this input and outputs a hypothesis h. The size of\nthe selection scheme is de\ufb01ned to be k(m) = |S(cid:48)| + |b|. We present a slightly modi\ufb01ed version of\nthe de\ufb01nition of an approximate compression scheme due to [13]:\nDe\ufb01nition 3 ((\u0001, \u03b4)-approximate agnostic compression scheme). A selection scheme (\u03ba, \u03c1) is an\n(\u0001, \u03b4)-approximate agnostic compression scheme for hypothesis class H and sample satisfying\nproperty P if for all samples S that satisfy P with probability 1 \u2212 \u03b4, f = \u03c1(\u03ba(S)) satis\ufb01es\n\n(cid:80)m\ni=1 l(f (xi), yi) \u2264 minh\u2208H ((cid:80)m\n\ni=1 l(h(xi), yi)) + \u0001.\n\nCompression has connections to learning in the general loss setting through the following theorem\nwhich shows that as long as k(m) is small, the selection scheme generalizes.\nTheorem 4 (Theorem 30.2 [32], Theorem 3.2 [13]). Let (\u03ba, \u03c1) be a selection scheme of size k =\nk(m), and let AS = \u03c1(\u03ba(S)). Given m i.i.d. samples drawn from any distribution D such that\nk \u2264 m/2, for constant bounded loss function l : Y(cid:48) \u00d7 Y \u2192 R+ with probability 1 \u2212 \u03b4, we have\n\n(cid:32)\n\n(cid:118)(cid:117)(cid:117)(cid:116)\u0001 \u00b7\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) \u2264\n\nm(cid:88)\n\ni=1\n\n1\nm\n\n(cid:33)\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)E(x,y)\u223cD[l(AS (x), y)] \u2212 m(cid:88)\n\ni=1\n\nl(AS (xi), yi)\n\nl(AS (xi), yi)\n\n+ \u0001\n\nwhere \u0001 = 50 \u00b7 k log(m/k)+log(1/\u03b4)\n\nm\n\n.\n\n3 Problem Overview\nIn this section we give a general outline for our main result. Let S = {(x1, y1), . . . , (xm, ym)} be a\ntraining set of samples drawn i.i.d. from some arbitrary distribution D on X \u00d7 [0, 1] where X \u2286 Rn.\nLet us consider a concept class C such that for all c \u2208 C and x \u2208 X we have c(x) \u2208 [0, 1]. We\nwish to learn the concept class C with respect to the square loss, that is, we wish to \ufb01nd c \u2208 C that\napproximately minimizes E(x,y)\u223cD[(c(x) \u2212 y)2]. A common way of solving this is by solving the\nempirical minimization problem (ERM) given below and subsequently proving that it generalizes.\n\n4\n\n\fOptimization Problem 1\n\nminimize\n\nc\u2208C\n\n1\nm\n\nm(cid:88)\n\n(c(xi) \u2212 yi)2\n\ni=1\n\nUnfortunately, it may not be possible to ef\ufb01ciently solve the ERM in polynomial-time due to issues\nsuch as non-convexity. A way of tackling this is to show that the concept class can be approximately\nminimized by another hypothesis class of linear functions in a high dimensional feature space (this\nin turn presents new obstacles for proving generalization-error bounds, which is the focus of this\npaper).\nDe\ufb01nition 5 (\u0001-approximation). Let C1 and C2 be function classes mapping domain X to R. C1 is \u0001-\napproximated by C2 if for every c \u2208 C1 there exists c(cid:48) \u2208 C2 such that for all x \u2208 X ,|c(x)\u2212c(cid:48)(x)| \u2264 \u0001.\nSuppose C can be \u0001-approximated in the above sense by the hypothesis class H\u03c8 = {x \u2192\n(cid:104)v, \u03c8(x)(cid:105)|v \u2208 K\u03c8,(cid:104)v, v(cid:105) \u2264 B} for some B and kernel function k\u03c8. We further assume that the\nkernel is bounded, that is, |k\u03c8(x, x\u2019)| \u2264 M for some M > 0 for all x, x\u2019 \u2208 X . Thus, the problem\nrelaxes to the following,\n\nOptimization Problem 2\n\nminimize\n\nv\u2208K\u03c8\n\n1\nm\n\nm(cid:88)\n\n((cid:104)v, \u03c8(xi)(cid:105) \u2212 yi)2\n\ni=1\n\nsubject to\n\n(cid:104)v, v(cid:105) \u2264 B\n\nv\u2217 =(cid:80)m\n\nUsing the Representer theorem, we have that the optimum solution for the above is of the form\ni=1 \u03b1i\u03c8(xi) for some \u03b1 \u2208 Rn. Denoting the sample kernel matrix be K such that Ki,j =\n\nk\u03c8(xi, xj), the above optimization problem is equivalent to the following optimization problem,\n\nOptimization Problem 3\n\nminimize\n\n\u03b1\u2208Rm\n\n||K\u03b1 \u2212 Y ||2\n\n2\n\n1\nm\n\nsubject to\n\n\u03b1T K\u03b1 \u2264 B\n\n\u221a\n\nwhere Y is the vector corresponding to all yi and ||Y ||\u221e \u2264 1 since \u2200i \u2208 [m], yi \u2208 [0, 1]. Let \u03b1B\nbe the optimal solution of the above problem. This is known to be ef\ufb01ciently solvable in poly(m, n)\ntime as long as the kernel function is ef\ufb01ciently computable.\nApplying Rademacher complexity bounds to H\u03c8 yields generalization error bounds that decrease,\nroughly, on the order of B/\nm (c.f. Supplemental 1.1). If B is exponential in 1/\u0001, the accuracy\nparameter, or in n, the dimension, as in the case of bounded depth networks of ReLUs, then this\ndependence leads to exponential sample complexity. As mentioned in Section 1, in the context of\neigenvalue decay, various results [42, 4, 5] have been obtained to improve the dependence of B/\nm\nto B/m, but little is known about improving the dependence on B.\nOur goal is to show that eigenvalue decay of the empirical Gram matrix does yield generaliza-\ntion bounds with better dependence on B. The key is to develop a novel compression scheme for\nkernelized ridge regression. We give a step-by-step analysis for how to generate an approximate,\ncompressed version of the solution to Optimization Problem 3. Then, we will carefully analyze the\nbit complexity of our approximate solution and realize our compression scheme. Finally, we can put\neverything together and show how quantitative bounds on eigenvalue decay directly translate into\ncompressions schemes with low generalization error.\n\n\u221a\n\n4 Compressing the Kernel Solution\n\nThrough a sequence of steps, we will sparsify \u03b1 to \ufb01nd a solution of much smaller bit complexity\nthat is still an approximate solution (to within a small additive error). The quality and size of the\napproximation will depend on the eigenvalue decay.\n\n5\n\n\fLagrangian Relaxation. We relax Optimization Problem 3 and consider the Lagrangian version of\nthe problem to account for the norm bound constraint. This version is convenient for us, as it has a\nnice closed-form solution.\nOptimization Problem 4\n\nminimize\n\n\u03b1\u2208Rm\n\n||K\u03b1 \u2212 Y ||2\n\n2 + \u03bb\u03b1T K\u03b1\n\n1\nm\n\nWe will later set \u03bb such that the error of considering this relaxation is small. It is easy to see that the\noptimal solution for the above lagrangian version is \u03b1 = (K + \u03bbmI)\n\n\u22121 Y .\n\nPreconditioning. To avoid extremely small or non-zero eigenvalues, we consider a perturbed ver-\nsion of K, K\u03b3 = K + \u03b3mI. This gives us that the eigenvalues of K\u03b3 are always greater than\nor equal to \u03b3m. This property is useful for us in our later analysis. Henceforth, we consider the\nfollowing optimization problem on the perturbed version of K:\nOptimization Problem 5\n\nminimize\n\n\u03b1\u2208Rm\n\n||K\u03b3\u03b1 \u2212 Y ||2\n\n2 + \u03bb\u03b1T K\u03b3\u03b1\n\n1\nm\n\nThe optimal solution for perturbed version is \u03b1\u03b3 = (K\u03b3 + \u03bbmI)\n\n\u22121 Y = (K + (\u03bb + \u03b3)mI)\n\n\u22121 Y .\n\nSparsifying the Solution via Nystr\u00f6m Sampling. We will now use tools from Nystr\u00f6m Sampling\nto sparsify the solution obtained from Optimzation Problem 5. To do so, we \ufb01rst recall the de\ufb01nition\nof effective dimension or degrees of freedom for the kernel [42]:\nDe\ufb01nition 6 (\u03b7-effective dimension). For a positive semide\ufb01nite m \u00d7 m matrix K and parameter\n\u03b7, the \u03b7-effective dimension of K is de\ufb01ned as d\u03b7(K) = tr(K(K + \u03b7mI)\u22121).\nVarious kernel approximation results have relied on this quantity, and here we state a recent result\ndue to [28] who gave the \ufb01rst application independent result that shows that there is an ef\ufb01cient way\nof computing a set of columns of K such that \u00afK, a matrix constructed from the columns is close in\nterms of 2-norm to the matrix K. More formally,\nTheorem 7 ([28]). For kernel matrix K, there exists an algorithm that gives a set of\nO (d\u03b7(K) log (d\u03b7(K)/\u03b4)) columns, such that \u00afK = KS(ST KS)\u2020ST K where S is the matrix that\nselects the speci\ufb01c columns, satis\ufb01es with probability 1 \u2212 \u03b4, \u00afK (cid:22) K (cid:22) \u00afK + \u03b7mI.\nIt can be shown that \u00afK is positive semi-de\ufb01nite. Also, the above implies ||K \u2212 \u00afK||2 \u2264 \u03b7m. We use\nthe decay to approximate the Kernel matrix with a low-rank matrix constructed using the columns\nof K. Let \u00afK\u03b3 be the matrix obtained by applying Theorem 7 to K\u03b3 for \u03b7 > 0 and consider the\nfollowing optimization problem,\nOptimization Problem 6\n\nThe optimal solution for the above is \u00af\u03b1\u03b3 =(cid:0) \u00afK\u03b3 + \u03bbmI(cid:1)\u22121\n\n\u03b1\u2208Rm\n\nminimize\n\nY . Since \u00afK\u03b3 = K\u03b3S(ST K\u03b3S)\u2020ST K\u03b3,\nsolving for the above enables us to get a solution \u03b1\u2217 = S(ST K\u03b3S)\u2020ST K\u03b3 \u00af\u03b1\u03b3, which is a k-sparse\nvector for k = O (d\u03b7(K\u03b3) log (d\u03b7(K\u03b3)/\u03b4)).\n\n|| \u00afK\u03b3\u03b1 \u2212 Y ||2\n\n2 + \u03bb\u03b1T \u00afK\u03b3\u03b1\n\n1\nm\n\nBounding the Error of the Sparse Solution. We bound the additional error incurred by our sparse\nhypothesis \u03b1\u2217 compared to \u03b1B. To do so, we bound the error for each of the approximations: spar-\nsi\ufb01cation, preconditioning and lagrangian relaxation and then combine them to give the following\ntheorem.\n2 \u2264\nTheorem 8 (Total Error). For \u03bb = \u00012\nm||K\u03b1B \u2212 Y ||2\n1\n\n729B and \u03b3 \u2264 \u00013\n\nm||K\u03b3\u03b1\u2217 \u2212 Y ||2\n\n81B , \u03b7 \u2264 \u00013\n\n729B , we have 1\n\n2 + \u0001.\n\n6\n\n\fComputing the Sparsity of the Solution. To compute the sparsity of the solution, we need to bound\nd\u03b7(K\u03b2). We consider the following different eigenvalue decays.\nDe\ufb01nition 9 (Eigenvalue Decay). Let the real eigenvalues of a symmetric m \u00d7 m matrix A be\ndenoted by \u03bb1 \u2265 \u00b7\u00b7\u00b7 \u2265 \u03bbm.\n1. A is said to have (C, p)-polynomial eigenvalue decay if for all i \u2208 {1, . . . , m}, \u03bbi \u2264 Ci\u2212p.\n2. A is said to have C-exponential eigenvalue decay if for all i \u2208 {1, . . . , m}, \u03bbi \u2264 Ce\u2212i.\nNote that in the above de\ufb01nitions C and p are not necessarily constants. We allow C and p to\ndepend on other parameters (the choice of these parameters will be made explicit in subsequent\ntheorem statements). We can now bound the effective dimension in terms of eigenvalue decay:\nTheorem 10 (Bounding effective dimension). For \u03b3m \u2264 \u03b7,\n\n1. If K/m has (C, p)-polynomial eigenvalue decay for p > 1 then d\u03b7(K\u03b3) \u2264(cid:16) C\n\n(cid:17)1/p\n\n+ 2.\n\n(p\u22121)\u03b7\n\n2. If K/m has C-exponential eigenvalue decay then d\u03b7(K\u03b3) \u2264 log\n\n5 Compression Scheme\n\n(cid:16) C\n\n(cid:17)\n\n(e\u22121)\u03b7\n\n+ 2.\n\nThe above analysis gives us a sparse solution for the problem and, in turn, an \u0001-approximation for\nthe error on the overall sample S with probability 1 \u2212 \u03b4. We can now fully de\ufb01ne our compression\nscheme for the hypothesis class H\u03c8 with respect to samples satisfying the eigenvalue decay property.\nSelection Scheme \u03ba: Given input S = (xi, yi)m\ni=1,\n1. Use RLS-Nystr\u00f6m Sampling [28] to compute \u00afK\u03b3 = K\u03b3S(ST K\u03b3S)\u2020ST K\u03b3 for \u03b7 = \u00013\n\u03b3 = \u00013\n\n5832Bm. Let I be the sub-sample corresponding to the columns selected using S.\n\n5832B and\n\n324B to get \u00af\u03b1\u03b3.\n\n2. Solve Optimization Problem 6 for \u03bb = \u00012\n3. Compute the |I|-sparse vector \u03b1\u2217 = S(ST K\u03b3S)\u2020ST K\u03b3 \u00af\u03b1\u03b3 = K\u22121\neigenvalues are non-zero).\n4. Output subsample I along with \u02dc\u03b1\u2217 which is \u03b1\u2217 truncated to precision\n4M|I| per non-zero index.\nsubsample I and\n\u02dc\u03b1\u2217,\nReconstruction Scheme \u03c1:\noutput hypothesis,\nhS (x) = clip0,1(wT \u02dc\u03b1\u2217) where w is a vector with entries K(xi, x) + \u03b3m1[x = xi] for\ni \u2208 I and 0 otherwise where \u03b3 = \u00013\n5832Bm. Note, clipa,b(x) = max(a, min(b, x)) for some a < b.\n\n\u00afK\u03b3 \u00af\u03b1\u03b3 (K\u03b3 is invertible as all\n\nGiven input\n\n\u03b3\n\n\u0001\n\nThe following theorem shows that the above is a compression scheme for H\u03c8.\nTheorem 11. (\u03ba, \u03c1) is an (\u0001, \u03b4)-approximate agnostic compression scheme for the hypothesis class\nH\u03c8 for sample S of size k(m, \u0001, \u03b4, B, M ) = O\nwhere d is the\n\u03b7-effective dimension of K\u03b3 for \u03b7 = \u00013\n\nd log(cid:0) d\n\n(cid:1) log\n\n5832B and \u03b3 = \u00013\n\n(cid:16)\u221a\n\n5832Bm .\n\n(cid:17)(cid:17)\n\n(cid:16)\n\nmBM d log(d/\u03b4)\n\n\u00014\n\n\u03b4\n\n6 Putting It All Together: From Compression to Learning\n\nWe now present our \ufb01nal algorithm: Compressed Kernel Regression (Algorithm 1). Note that the\nalgorithm is ef\ufb01cient and takes at most O(m3) time.\nFor our learnability result, we restrict distributions to those that satisfy eigenvalue decay.\nDe\ufb01nition 12 (Distribution Satisfying Eigenvalue Decay). Consider distribution D over X and\nkernel function k\u03c8. Let S be a sample drawn i.i.d. from the distribution D and K be the empirical\ngram matrix corresponding to kernel function k\u03c8 on S.\n1. D is said to satisfy (C, p, N )-polynomial eigenvalue decay if with probability 1 \u2212 \u03b4 over the\ndrawn sample of size m \u2265 N, K/m satis\ufb01es (C, p)-polynomial eigenvalue decay.\n\n7\n\n\fAlgorithm 1 Compressed Kernel Regression\n\n1: Using RLS-Nystr\u00f6m Sampling [28] with input (K\u03b3, \u03b7m) for \u03b3 =\n\nInput: Samples S = (xi, yi)m\nmaximum kernel function value M on X .\ncompute \u00afK\u03b3 = K\u03b3S(ST K\u03b3S)\u2020ST K\u03b3. Let I be the subsample corresponding to the columns\nselected using S. Note that the number of columns selected depends on the \u03b7 effective dimen-\nsion of K\u03b3.\n\ni=1, gram matrix K on S, constants \u0001, \u03b4 > 0, norm bound B and\n5832Bm and \u03b7 = \u00013\n\n5832B\n\n\u00013\n\n324B to get \u00af\u03b1\u03b3 over S\n2: Solve Optimization Problem 6 for \u03bb = \u00012\n3: Compute \u03b1\u2217 = S(ST K\u03b3S)\u2020ST K\u03b3 \u00af\u03b1\u03b3 = K\u22121\n4: Compute \u02dc\u03b1\u2217 by truncating each entry of \u03b1\u2217 up to precision\n\n\u00afK\u03b3 \u00af\u03b1\u03b3\n\n\u03b3\n\n\u0001\n\n4M|I|\n\nOutput: hS such that for all x \u2208 X , hS (x) = clip0,1(wT \u02dc\u03b1\u2217) where w is a vector with entries\nK(xi, x) + \u03b3m1[x = xi] for i \u2208 I and 0 otherwise.\n\n(cid:0)E(x,y)\u223cD(c(x) \u2212 y)2(cid:1) + 2\u00010 + \u0001\n(cid:0)E(x,y)\u223cD(c(x) \u2212 y)2(cid:1) + 2\u00010 + \u0001\n\n2. D is said to satisfy (C, N )-exponential eigenvalue decay if with probability 1 \u2212 \u03b4 over the drawn\nsample of size m \u2265 N, K/m satis\ufb01es C-exponential eigenvalue decay.\nOur main theorem proves generalization of the hypothesis output by Algorithm 1 for distributions\nsatisfying eigenvalue decay in the above sense.\nTheorem 13 (Formal for Theorem 1). Fix function class C with output bounded in [0, 1] and\nM-bounded kernel function k\u03c8 such that C is \u00010-approximated by H\u03c8 = {x \u2192 (cid:104)v, \u03c8(x)(cid:105)|v \u2208\nK\u03c8,(cid:104)v, v(cid:105) \u2264 B} for some \u03c8, B. Consider a sample S = {(xi, yi)m\nfrom D on\ni=1} drawn i.i.d.\nX \u00d7 [0, 1]. There exists an algorithm A that outputs hypothesis hS = A(S), such that,\n1. If DX satis\ufb01es (C, p, m)-polynomial eigenvalue decay with probability 1 \u2212 \u03b4/4 then with proba-\nbility 1 \u2212 \u03b4 for m = \u02dcO((CB)1/p log(M ) log(1/\u03b4)/\u00012+3/p),\n\nE(x,y)\u223cD(hS (x) \u2212 y)2 \u2264 min\nc\u2208C\n\n2. If DX satis\ufb01es (C, m)-exponential eigenvalue decay with probability 1\u2212\u03b4/4 then with probability\n1 \u2212 \u03b4 for m = \u02dcO(log CB log(M ) log(1/\u03b4)/\u00012),\nE(x,y)\u223cD(hS (x) \u2212 y)2 \u2264 min\nc\u2208C\n\nAlgorithm A runs in time poly(m, n).\nRemark: The above theorem can be extended to different rates of eigenvalue decay. For example,\nfor \ufb01nite rank r the obtained bound is independent of B but dependent instead on r. Also, as in the\nproof of Theorem 10, it suf\ufb01ces for the eigenvalue decay to hold only after suf\ufb01ciently large i.\n\n7 Learning Neural Networks\n\nHere we apply our main theorem to the problem of learning neural networks. For technical de\ufb01ni-\ntions of neural networks, we refer the reader to [43].\nDe\ufb01nition 14 (Class of Neural Networks [16]). Let N [\u03c3, D, W, T ] be the class of fully-connected,\nfeed-forward networks with D hidden layers, activation function \u03c3 and quantities W and T de-\nscribed as follows:\n1. Weight vectors in layer 0 have 2-norm bounded by T .\n2. Weight vectors in layers 1, . . . , D have 1-norm bounded by W .\n3. For each hidden unit \u03c3(w\u00b7 z) in the network, we have |w\u00b7 z| \u2264 T (by z we denote the input feeding\ninto unit \u03c3 from the previous layer).\n\nWe consider activation functions \u03c3relu(x) = max(0, x) and \u03c3sig = 1\n1+e\u2212x , though other activation\nfunctions \ufb01t within our framework. Goel et al. [16] showed that the class of ReLUs/Sigmoids along\nwith their compositions can be approximated by linear functions in a high dimensional Hilbert space\n\n8\n\n\f(corresponding to a particular type of polynomial kernel). As mentioned earlier, the sample com-\nplexity of prior work depends linearly on B, which, for even a single ReLU, is exponential in 1/\u0001.\nAssuming suf\ufb01ciently strong eigenvalue decay, we can show that we can obtain fully polynomial\ntime algorithms for the above classes.\nTheorem 15. For \u0001, \u03b4 > 0, consider D on Sn\u22121 \u00d7 [0, 1] such that,\n1. For Crelu = N [\u03c3relu, 0,\u00b7, 1], DX satis\ufb01es (C, p, m)-polynomial eigenvalue decay for p \u2265 \u03be/\u0001,\n2. For Crelu\u2212D = N [\u03c3relu, D, W, T ], DX satis\ufb01es (C, p, m)-polynomial eigenvalue decay for\np \u2265 (\u03beW DDT /\u0001)D,\n3. For Csig\u2212D = N [\u03c3sig, D, W, T ], DX satis\ufb01es (C, p, m)-polynomial eigenvalue decay for p \u2265\n(\u03beT log(W DD/\u0001)))D,\nwhere DX is the marginal distribution on X = Sn\u22121, \u03be > 0 is some suf\ufb01ciently large constant and\nC \u2264 (n \u00b7 1/\u0001 \u00b7 log(1/\u03b4))\u03b6p for some constant \u03b6 > 0. The value of m is obtained from Theorem 13\nas m = \u02dcO(C 1/p\u00012+3/p).\nEach decay assumption above implies an algorithm for agnostically learning the corresponding\nclass on Sn\u22121 \u00d7 [0, 1] with respect to the square loss in time poly(n, 1/\u0001, log(1/\u03b4)).\n\nNote that assuming an exponential eigenvalue decay (stronger than polynomial) will result in ef\ufb01-\ncient learnability for much broader classes of networks.\nSince it is not known how to agnostically learn even a single ReLU with respect to arbitrary distribu-\ntions on Sn\u22121 in polynomial-time4, much less a network of ReLUs, we state the following corollary\nhighlighting the decay we require to obtain ef\ufb01cient learnability for simple networks:\nCorollary 16 (Restating Corollary 2). Let C be the class of all fully-connected networks of ReLUs\nwith one-hidden layer of size (cid:96) feeding into a \ufb01nal output ReLU activation where the 2-norms of\nall weight vectors are bounded by 1. Then, (suppressing the parameter m for simplicity), assuming\n(C, i\u2212(cid:96)/\u0001)-polynomial eigenvalue decay for C = poly(n, 1/\u0001, (cid:96)), C is learnable in polynomial time\nwith respect to square loss on Sn\u22121. If ReLU is replaced with sigmoid, then we require eigenvalue\ndecay of i\u2212\u221a\n\n(cid:96)/\u0001).\n\n(cid:96) log(\n\n\u221a\n\n8 Conclusions and Future Work\n\nWe have proposed the \ufb01rst set of distributional assumptions that guarantee fully polynomial-time\nalgorithms for learning expressive classes of neural networks (without restricting the structure of\nthe network). The key abstraction was that of a compression scheme for kernel approximations,\nspeci\ufb01cally Nystr\u00f6m sampling. We proved that eigenvalue decay of the Gram matrix reduces the\ndependence on the norm B in the kernel regression problem.\nPrior distributional assumptions, such as the underlying marginal equaling a Gaussian, neither lead\nto fully polynomial-time algorithms nor are representative of real-world data sets5. Eigenvalue de-\ncay, on the other hand, has been observed in practice and does lead to provably ef\ufb01cient algorithms\nfor learning neural networks.\nA natural criticism of our assumption is that the rate of eigenvalue decay we require is too strong.\nIn some cases, especially for large depth networks with many hidden units, this may be true6. Note,\nhowever, that our results show that even moderate eigenvalue decay will lead to improved algo-\nrithms. Further, it is quite possible our assumptions can be relaxed. An obvious question for future\nwork is what is the minimal rate of eigenvalue decay needed for ef\ufb01cient learnability? Another di-\nrection would be to understand how these eigenvalue decay assumptions relate to other distributional\nassumptions.\n\n4Goel et al. [16] show that agnostically learning a single ReLU over {\u22121, 1}n is as hard as learning sparse\n\nparities with noise. This reduction can be extended to the case of distributions over Sn\u22121 [3].\n\n5Despite these limitations, we still think uniform or Gaussian assumptions are worthwhile and have provided\n\nhighly nontrivial learning results.\n\n6It is useful to keep in mind that agnostically learning even a single ReLU with respect to all distributions\nseems computationally intractable, and that our required eigenvalue decay in this case is only a function of the\naccuracy parameter \u0001.\n\n9\n\n\fAcknowledgements. We would like to thank Misha Belkin and Nikhil Srivastava for very helpful\nconversations regarding kernel ridge regression and eigenvalue decay. We also thank Daniel Hsu,\nKarthik Sridharan, and Justin Thaler for useful feedback. The analogy between eigenvalue decay\nand power-law graphs is due to Raghu Meka.\n\nReferences\n[1] Peter Auer, Mark Herbster, and Manfred K. Warmuth. Exponentially many local minima for\nsingle neurons. In Advances in Neural Information Processing Systems, volume 8, pages 316\u2013\n322. The MIT Press, 1996.\n\n[2] Haim Avron, Huy Nguyen, and David Woodruff. Subspace embeddings for the polynomial\n\nkernel. In Advances in Neural Information Processing Systems, pages 2258\u20132266, 2014.\n\n[3] Peter Bartlett, Daniel Kane, and Adam Klivans. personal communication.\n[4] Peter L. Bartlett, Olivier Bousquet, and Shahar Mendelson. Local rademacher complexities.\n\n33(4), August 16 2005.\n\n[5] Peter L. Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds\n\nand structural results. Journal of Machine Learning Research, 3:463\u2013482, 2002.\n\n[6] Pawel Brach, Marek Cygan, Jakub Lacki, and Piotr Sankowski. Algorithmic complexity of\n\npower law networks. CoRR, abs/1507.02426, 2015.\n\n[7] Alon Brutzkus and Amir Globerson. Globally optimal gradient descent for a convnet with\n\ngaussian inputs. CoRR, abs/1702.07966, 2017.\n\n[8] Anna Choromanska, Mikael Henaff, Micha\u00ebl Mathieu, G\u00e9rard Ben Arous, and Yann LeCun.\nIn AISTATS, volume 38 of JMLR Workshop and\n\nThe loss surfaces of multilayer networks.\nConference Proceedings. JMLR.org, 2015.\n\n[9] Andrew Cotter, Shai Shalev-Shwartz, and Nati Srebro. Learning optimally sparse support\nvector machines. In Proceedings of the 30th International Conference on Machine Learning\n(ICML-13), pages 266\u2013274, 2013.\n\n[10] Amit Daniely. Complexity theoretic limitations on learning halfspaces. In STOC, pages 105\u2013\n\n117. ACM, 2016.\n\n[11] Amit Daniely. SGD learns the conjugate kernel class of the network. CoRR, abs/1702.08503,\n\n2017.\n\n[12] Amit Daniely, Roy Frostig, and Yoram Singer. Toward deeper understanding of neural net-\nworks: The power of initialization and a dual view on expressivity. In NIPS, pages 2253\u20132261,\n2016.\n\n[13] O\ufb01r David, Shay Moran, and Amir Yehudayoff. On statistical learning via the lens of com-\n\npression. arXiv preprint arXiv:1610.03592, 2016.\n\n[14] Petros Drineas and Michael W Mahoney. On the nystr\u00f6m method for approximating a\njournal of machine learning research,\n\ngram matrix for improved kernel-based learning.\n6(Dec):2153\u20132175, 2005.\n\n[15] Petros Drineas, Michael W Mahoney, and S Muthukrishnan. Relative-error cur matrix decom-\n\npositions. SIAM Journal on Matrix Analysis and Applications, 30(2):844\u2013881, 2008.\n\n[16] Surbhi Goel, Varun Kanade, Adam Klivans, and Justin Thaler. Reliably learning the relu in\n\npolynomial time. arXiv preprint arXiv:1611.10258, 2016.\n\n[17] Majid Janzamin, Hanie Sedghi, and Anima Anandkumar. Beating the perils of non-\nconvexity: Guaranteed training of neural networks using tensor methods. arXiv preprint\narXiv:1506.08473, 2015.\n\n[18] Kenji Kawaguchi. Deep learning without poor local minima. In NIPS, pages 586\u2013594, 2016.\n[19] Adam R. Klivans and Pravesh Kothari. Embedding hard learning problems into gaussian space.\nIn APPROX-RANDOM, volume 28 of LIPIcs, pages 793\u2013809. Schloss Dagstuhl - Leibniz-\nZentrum fuer Informatik, 2014.\n\n[20] Adam R. Klivans and Raghu Meka. Moment-matching polynomials. Electronic Colloquium\n\non Computational Complexity (ECCC), 20:8, 2013.\n\n10\n\n\f[21] Adam R. Klivans and Alexander A. Sherstov. Cryptographic hardness for learning intersec-\n\ntions of halfspaces. J. Comput. Syst. Sci, 75(1):2\u201312, 2009.\n\n[22] Anton Krohmer. Finding Cliques in Scale-Free Networks. Master\u2019s thesis, Saarland University,\n\nGermany, 2012.\n\n[23] Dima Kuzmin and Manfred K. Warmuth. Unlabeled compression schemes for maximum\n\nclasses. Journal of Machine Learning Research, 8:2047\u20132081, 2007.\n\n[24] Nick Littlestone and Manfred Warmuth. Relating data compression and learnability. Technical\n\nreport, Technical report, University of California, Santa Cruz, 1986.\n\n[25] Nick Littlestone and Manfred Warmuth. Relating data compression and learnability. Technical\n\nreport, 1986.\n\n[26] Roi Livni, Shai Shalev-Shwartz, and Ohad Shamir. On the computational ef\ufb01ciency of training\nIn Advances in Neural Information Processing Systems, pages 855\u2013863,\n\nneural networks.\n2014.\n\n[27] Siyuan Ma and Mikhail Belkin. Diving into the shallows: a computational perspective on\n\nlarge-scale shallow learning. CoRR, abs/1703.10622, 2017.\n\n[28] Cameron Musco and Christopher Musco. Recursive sampling for the nystr\u00f6m method. arXiv\n\npreprint arXiv:1605.07583, 2016.\n\n[29] B. Sch\u00f6lkopf, J. Shawe-Taylor, AJ. Smola, and RC. Williamson. Generalization bounds via\n\neigenvalues of the gram matrix. Technical Report 99-035, NeuroCOLT, 1999.\n\n[30] Bernhard Sch\u00f6lkopf and Alexander J Smola. Learning with kernels: support vector machines,\n\nregularization, optimization, and beyond. MIT press, 2002.\n\n[31] Hanie Sedghi and Anima Anandkumar. Provable methods for training neural networks with\n\nsparse connectivity. arXiv preprint arXiv:1412.2693, 2014.\n\n[32] Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory to\n\nalgorithms. Cambridge university press, 2014.\n\n[33] Ohad Shamir. The sample complexity of learning linear predictors with the squared loss.\n\nJournal of Machine Learning Research, 16:3475\u20133486, 2015.\n\n[34] Ohad Shamir. Distribution-speci\ufb01c hardness of learning neural networks. arXiv preprint\n\narXiv:1609.01037, 2016.\n\n[35] John Shawe-Taylor, Christopher KI Williams, Nello Cristianini, and Jaz Kandola. On the\neigenspectrum of the gram matrix and the generalization error of kernel-pca. IEEE Transac-\ntions on Information Theory, 51(7):2510\u20132522, 2005.\n\n[36] Le Song, Santosh Vempala, John Wilmes, and Bo Xie. On the complexity of learning neural\n\nnetworks. arXiv preprint arXiv:1707.04615, 2017.\n\n[37] Daniel Soudry and Yair Carmon. No bad local minima: Data independent training error guar-\n\nantees for multilayer neural networks. CoRR, abs/1605.08361, 2016.\n\n[38] Ameet Talwalkar and Afshin Rostamizadeh. Matrix coherence and the nystrom method. CoRR,\n\nabs/1408.2044, 2014.\n\n[39] Christopher KI Williams and Matthias Seeger. Using the nystr\u00f6m method to speed up ker-\nIn Proceedings of the 13th International Conference on Neural Information\n\nnel machines.\nProcessing Systems, pages 661\u2013667. MIT press, 2000.\n\n[40] Bo Xie, Yingyu Liang, and Le Song. Diversity leads to generalization in neural networks.\n\nCoRR, abs/1611.03131, 2016.\n\n[41] Qiuyi Zhang, Rina Panigrahy, and Sushant Sachdeva. Electron-proton dynamics in deep learn-\n\ning. CoRR, abs/1702.00458, 2017.\n\n[42] Tong Zhang. Effective dimension and generalization of kernel learning. In Advances in Neural\n\nInformation Processing Systems, pages 471\u2013478, 2003.\n\n[43] Yuchen Zhang, Jason D Lee, and Michael I Jordan. l1-regularized neural networks are improp-\nerly learnable in polynomial time. In International Conference on Machine Learning, pages\n993\u20131001, 2016.\n\n[44] Yuchen Zhang, Jason D. Lee, Martin J. Wainwright, and Michael I. Jordan. Learning halfs-\n\npaces and neural networks with random initialization. CoRR, abs/1511.07948, 2015.\n\n11\n\n\f", "award": [], "sourceid": 1314, "authors": [{"given_name": "Surbhi", "family_name": "Goel", "institution": "University of Texas at Austin"}, {"given_name": "Adam", "family_name": "Klivans", "institution": "UT Austin"}]}