{"title": "How Many Samples are Needed to Estimate a Convolutional Neural Network?", "book": "Advances in Neural Information Processing Systems", "page_first": 373, "page_last": 383, "abstract": "A widespread folklore for explaining the success of Convolutional Neural Networks (CNNs) is that CNNs use a more compact representation than the Fully-connected Neural Network (FNN) and thus require fewer training samples to accurately estimate their parameters. We initiate the study of rigorously characterizing the sample complexity of estimating CNNs. We show that for an $m$-dimensional convolutional filter with linear activation acting on a $d$-dimensional input, the sample complexity of achieving population prediction error of $\\epsilon$ is $\\widetilde{O(m/\\epsilon^2)$, whereas the sample-complexity for its FNN counterpart is lower bounded by $\\Omega(d/\\epsilon^2)$ samples. Since, in typical settings $m \\ll d$, this result demonstrates the advantage of using a CNN. We further consider the sample complexity of estimating a one-hidden-layer CNN with linear activation where both the $m$-dimensional convolutional filter and the $r$-dimensional output weights are unknown. For this model, we show that the sample complexity is $\\widetilde{O}\\left((m+r)/\\epsilon^2\\right)$ when the ratio between the stride size and the filter size is a constant. For both models, we also present lower bounds showing our sample complexities are tight up to logarithmic factors. Our main tools for deriving these results are a localized empirical process analysis and a new lemma characterizing the convolutional structure. We believe that these tools may inspire further developments in understanding CNNs.", "full_text": "How Many Samples are Needed to Estimate a\n\nConvolutional Neural Network?\n\nSimon S. Du\u02da\n\nCarnegie Mellon University\n\nYining Wang*\n\nCarnegie Mellon University\n\nXiyu Zhai\n\nMassachusetts Institute of Technology\n\nSivaraman Balakrishnan\nCarnegie Mellon University\n\nRuslan Salakhutdinov\n\nCarnegie Mellon University\n\nAarti Singh\n\nCarnegie Mellon University\n\nAbstract\n\nA widespread folklore for explaining the success of Convolutional Neural Net-\nworks (CNNs) is that CNNs use a more compact representation than the Fully-\nconnected Neural Network (FNN) and thus require fewer training samples to accu-\nrately estimate their parameters. We initiate the study of rigorously characterizing\nthe sample complexity of estimating CNNs. We show that for an m-dimensional\nconvolutional \ufb01lter with linear activation acting on a d-dimensional input, the sam-\n\nple complexity of achieving population prediction error of \u270f is rOpm{\u270f2q 2, whereas\nthe sample-complexity for its FNN counterpart is lower bounded by \u2326pd{\u270f2q sam-\nples. Since, in typical settings m ! d, this result demonstrates the advantage of\nusing a CNN. We further consider the sample complexity of estimating a one-\nhidden-layer CNN with linear activation where both the m-dimensional convolu-\ntional \ufb01lter and the r-dimensional output weights are unknown. For this model,\n\nwe show that the sample complexity is rO`pm ` rq{\u270f2\u02d8 when the ratio between\n\nthe stride size and the \ufb01lter size is a constant. For both models, we also present\nlower bounds showing our sample complexities are tight up to logarithmic factors.\nOur main tools for deriving these results are a localized empirical process analysis\nand a new lemma characterizing the convolutional structure. We believe that these\ntools may inspire further developments in understanding CNNs.\n\n1\n\nIntroduction\n\nConvolutional Neural Networks (CNNs) have achieved remarkable impact in many machine learn-\ning applications, including computer vision (Krizhevsky et al., 2012), natural language process-\ning (Yu et al., 2018) and reinforcement learning (Silver et al., 2016). The key building block of\nthese improvements is the use of convolutional (weight sharing) layers to replace traditional fully\nconnected layers, dating back to LeCun et al. (1995). A common folklore of explaining the suc-\ncess of CNNs is that they are a more compact representation than Fully-connected Neural Networks\n(FNNs) and thus require fewer samples to estimate. However, to our knowledge, there is no rigorous\ncharacterization of the sample complexity of learning a CNN.\nThe main dif\ufb01culty lies in the convolution structure. Consider the simplest CNN, a single convolu-\ntional \ufb01lter with linear activation followed by average pooling (see Figure 1a), which represents a\n\n\u02daEqual contribution.\n\n2We use the standard big-O notation in this paper and use rOp\u00a8q when we ignore poly-logarithmic factors.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00b4eal, Canada.\n\n\ffunction F1 : Rd \ufb01\u00d1 R of the form:\n\nF1px; wq \u201c\n\nr\u00b41\u00ff`\u201c0\n\nwJP`\n\nsx,\n\n(1)\n\nwhere w P Rm is the \ufb01lter of size m and a stride size of s, r \u00ab d{s is the total number of times \ufb01lter\nw is applied to an input vector x P Rd, and P`\nsx :\u201c rx`s`1, x`s`2, . . . , x`s`ms is an m-dimensional\nsegment of the feature vector x. Noting that F1 is a linear function of x, we can also represent F1\nby a one-layer fully connected neural network (linear predictor):\n\nfor some \u2713 P Rd. Suppose we have n samples txi, yiun\nuse the least squares estimator:\n\nF FNN\n1\n\npx, \u2713q \u201c \u2713Jx\n\n(2)\ni\u201c1 where x is the input and y is the label and\n\np\u2713 :\u201c arg min\n\n\u2713PRd\n\nn\u00ffi\u201c1pyi \u00b4 \u2713Jxiq2.\n\n`\u201c0 P`\n\nBy a classical results analyzing the prediction error for linear regression (see for instance (Wasser-\n\nwhere \u00b5 is the input distribution and \u27130 is the optimal linear predictor. The proof for FNN is fairly\n\ngregated features and labels, respectively and then directly analyze this expression.\nOn the other hand, the network F1 can be viewed as a linear regression model with respect to w, by\nsx P Rm. The classical analysis of\nordinary least squares in linear regression does not directly yield the optimal sample complexity in\n\nman, 2013)), under mild regularity conditions, we need n \u2014 d{\u270f2 to havebEx\u201e\u00b5|p\u2713Jx \u00b4 \u2713J0 x|2 \u00a7 \u270f,\nsimple because we can writep\u2713 \u201c`XJX\u02d8\u00b41 XJY (normal equation) where X and Y are the ag-\nconsidering a \u201cstacked\u201d version of feature vectorsrxi \u201c\u221er\u00b41\nthis case, because the distributional properties ofrxi as well as the spectral properties of the sample\ncovariance\u221eirxirxJi are dif\ufb01cult to analyze due to the heavy correlation between coordinates of\nrx corresponding to overlapping patches. We discuss further details of this aspect after our main\n\npositive result in Theorem 1.\nIn this paper, we take a step towards understanding the statistical behavior of the CNN model de-\nscribed above. We adopt tools from localized empirical process theory (van de Geer, 2000) and\ncombine them with a structural property of convolutional \ufb01lters (see Lemma 2) to give a complete\ncharacterization of the statistical behavior of this simple CNN.\nWe \ufb01rst consider the problem of learning a convolutional \ufb01lter with average pooling as in Eq.(1)\nusing the least squares estimator. We show in the standard statistical learning setting, under fairly\n\nwhere \u00b5 is the input distribution and w0 is the underlying true convolutional \ufb01lter. Notably, to\n\nbEx\u201e\u00b5|F1px,pwq \u00b4 F1px, w0q|2 \u201c rO\u00b4am{n\u00af ,\n\nnatural conditions on the input distribution, pw satis\ufb01es\nachieve an \u270f error, the CNN only needs rOpm{\u270f2q samples whereas the FNN needs \u2326pd{\u270f2q. Since\nthe \ufb01lter size m ! d, this result clearly justi\ufb01es the folklore that the convolutional layer is a more\ncompact representation. Furthermore, we complement this upper bound with a minimax lower\nbound which shows the error bound rOpam{nq is tight up to logarithmic factors.\n\nNext, we consider a one-hidden-layer CNN (see Figure 1b):\n\na`wJP`\n\nsx,\n\n(3)\n\nF2px; w, aq \u201c\n\nr\u00b41\u00ff`\u201c0\n\nwhere both the shared convolutional \ufb01lter w P Rm and output weights a P Rr are unknown. This\narchitecture is previously considered in Du et al. (2017b). However the focus of that work is to un-\nderstand the dynamics of gradient descent. Using similar tools as in analyzing a single convolutional\n\n\ufb01lter, we show that the least squares estimator achieves the error bound rOpapm ` rq{nq if the ratio\n\nbetween the stride size and the \ufb01lter size is a constant. Further, we present a minimax lower bound\nshowing that the obtain rate is tight up to logarithmic factors.\nTo our knowledge, these theoretical results are the \ufb01rst sharp analyses of the statistical ef\ufb01ciency of\nthe CNN. These results suggest that if the input follows a (linear) CNN model, then it can be learned\nmore easily than treating it as a FNN since a CNN model reuses weights.\n\n2\n\n\fw\n\nP 1\n\ns x\n\nw\n\nP `+1\n\ns\n\nx\n\nw\n\ns\n\nP `\n\ns x\n\nInput\n\nx\n\n1\n\n1\n\n1\n\n+\n\nF1(x; w)\n\nw\n\nP 1\n\ns x\n\nw\n\nP `+1\n\ns\n\nx\n\nw\n\ns\n\nP `\n\ns x\n\nInput\n\nx\n\na1\n\na`\n\n+\na`+1\n\nF2(x; w, a)\n\n(a) Prediction function formalized in Eq. (1). It\nconsists of a convolutional \ufb01lter followed by av-\neraged pooling. The convolutional \ufb01lter is un-\nknown.\n\n(b) Prediction function formalized in Eq. (3) It\nconsists of a convolutional \ufb01lter followed by a lin-\near prediction layer. Both layers are unknown.\n\nFigure 1: CNN architectures that we consider in this paper.\n\n1.1 Comparison with Existing Work\n\nOur work is closely related to the analysis of the generalization ability of neural networks (Arora\net al., 2018; Anthony & Bartlett, 2009; Bartlett et al., 2017b,a; Neyshabur et al., 2017; Konstantinos\net al., 2017). These generalization bounds are often of the form:\n\nLp\u2713q \u00b4 Ltrp\u2713q\u00a7 D{?n\n\n(4)\nwhere \u2713 represents the parameters of a neural network, Lp\u00a8q and Ltrp\u00a8q represent population and\nempirical error under some additive loss, and D is the model capacity and is \ufb01nite only if the\n(spectral) norm of the weight matrix for each layer is bounded. Comparing with generalization\nbounds based on model capacity, our result has two advantages:\n\n1. If Lp\u00a8q is taken to be the mean-squared3 error E|\u00a8|2, Eq. (4) implies an rOp1{\u270f4q sample complex-\nity to achieve a standardized mean-square error ofaE| \u00a8 |2 \u00a7 \u270f, which is considerably larger\nthan the rOp1{\u270f2q sample complexity we established in this paper.\n2. Since the complexity of a model class in regression problems typically depends on the magni-\ntude of model parameters (e.g., }w}2), generalization error bounds like Eq. (4) are not scale-\nindependent and deteriorate if }w}2 is large. In contrast, our analysis has no dependency on the\nscale of w and also places no constraints on }w}2.\n\nOn the other hand, we consider the special case where the neural network model is well-speci\ufb01ed\nand the label is generated according to a neural network with unbiased additive noise (see Eq. (5))\nwhereas the generalization bounds discussed in this section are typically model agnostic.\n\n1.2 Other Related Work\n\nRecently, researchers have been making progress in theoretically understanding various aspects of\nneural networks, including hardness of learning (Goel et al., 2016; Song et al., 2017; Brutzkus &\nGloberson, 2017), landscape of the loss function (Kawaguchi, 2016; Choromanska et al., 2015;\nHardt & Ma, 2016; Haeffele & Vidal, 2015; Freeman & Bruna, 2016; Safran & Shamir, 2016; Zhou\n& Feng, 2017; Nguyen & Hein, 2017b,a; Ge et al., 2017b; Zhou & Feng, 2017; Safran & Shamir,\n2017; Du & Lee, 2018), dynamics of gradient descent (Tian, 2017; Zhong et al., 2017b; Li & Yuan,\n2017), provable learning algorithms (Goel & Klivans, 2017a,b; Zhang et al., 2015), etc.\nFocusing on the convolutional neural network, most existing work has analyzed the convergence\nrate of gradient descent or its variants (Du et al., 2017a,b; Goel et al., 2018; Brutzkus & Globerson,\n2017; Zhong et al., 2017a). Our paper differs from them in that we do not consider the computational\ncomplexity but only the sample complexity and information theoretical limits of learning a CNN. It\nis an open question when taking computational budget into account, what is the optimal estimator\nfor CNN.\n\n3Because the standardized mean-square erroraE| \u00a8 |2 is not a sum of independent random variables, it is\ndif\ufb01cult, if not impossible, to apply generalization error bounds directly foraE| \u00a8 |2.\n\n3\n\n\fConvolutional structure has also been studied in the dictionary learning (Singh et al., 2018; Huang &\nAnandkumar, 2015) and blind de-convolution (Zhang et al., 2017) literature. These papers studied\nthe unsupervised setting where their goal is to recover structured signals from observations generated\naccording to convolution operations whereas our paper focuses on the supervised learning setting\nwith predictor having the convolution structure.\n\n1.3 Organization\nThis paper is organized as follows. In Section 2, we formally setup the problem and assumptions. In\nSection 3 we present our main theoretical results for learning a convolutional \ufb01lter (see Eq. (1)). In\nSection 4 we present our main theoretical results for learning a one-hidden-layer CNN (see Eq. (3)).\nIn Section 5, we use numerical experiments to verify our theoretical \ufb01ndings. We conclude and list\nfuture directions in Section 6. Most technical proofs are deferred to the appendix.\n\n2 Problem speci\ufb01cation and assumptions\nLet txi, yiun\ni\u201c1 be a sample of n training data points, where xi P Rd denotes the d-dimensional\nfeature vector of the ith data point and yi P R is the corresponding real-valued response. We\nconsider a generic model of\n(5)\n\nyi \u201c Fpxi; w0q ` \"i, where Er\"i|xis \u201c 0.\n\nIn the model of Eq. (5), F represents a certain network parameterized by a \ufb01xed but unknown pa-\nrameter w0 that takes a d-dimensional vector xi as input and outputs a single real-valued prediction\nFpxi; w0q. t\"iun\ni\u201c1 represents stochastic noise inherent in the data, and is assumed to have mean\nzero. The feature vectors of training data txiun\ni\u201c1 are sampled i.i.d. from an unknown distribution \u00b5\nsupported on Rd.\nThroughout this paper we make the following assumptions:\n\n(A1) Sub-gaussian noise: there exists constant 2 \u20208 such that for any t P R, Eet\"i \u00a7 e2t2{2;\n(A2) Sub-gaussian design: there exists constant \u232b2 \u20208 such that for any a P Rd, E\u00b5x \u201c 0 and\n(A3) Non-degeneracy: there exists constant \uf8ff \u00b0 0 such that minpE\u00b5xxJq\u2022 \uf8ff.\n\nE\u00b5 exptaJxu\u00a7 expt\u232b2}a}2\n\n2{2u;\n\nWe remark that the assumptions (A1) through (A3) are quite mild. In particular, we only impose\nsub-Gaussianity conditions on the distributions of xi and \"i, and do not assume they are gener-\nated/sampled from any exact distributions. The last non-degeneracy condition (A3) assumes that\nthere is a non-negligible probability mass along any direction of the input distributions. It is very\nlikely to be satis\ufb01ed after simple pre-processing steps of input data, such as mean removal and\nwhitening of the sample covariance.\n\nminimize the standardized population mean-square prediction error\n\nWe are interested in learning a parameter pwn using a training sample tpxi, yiqun\nerr\u00b5ppwn, w0; Fq \u201cbEx\u201e\u00b5 |Fpx;pwnq \u00b4 Fpx; w0q|2.\n\n3 Convolutional \ufb01lters with average pooling\n\ni\u201c1 of size n so as to\n\n(6)\n\nWe \ufb01rst consider a convolutional network with one convolutional layer, one convolutional \ufb01lter,\nan average pooling layer and linear activations. More speci\ufb01cally, for a single convolutional \ufb01lter\nw P Rm of size m and a stride of size s, the network can be written as\n\nwJP`\n\nsx,\n\n(7)\n\nwhere r \u00ab d{s is the total number of times \ufb01lter w is applied to an input vector x, and P`\nsx :\u201c\nrx`s`1, x`s`2, . . . , x`s`ms is an m-dimensional segment of the d-dimensional feature vector xi.\nFor simplicity, we assume that m is divisible s and let J \u201c m{s P N denote the number of strides\nwithin a single \ufb01lter of size m.\n\nF1px; wq \u201c\n\nr\u00b41\u00ff`\u201c0\n\n4\n\n\f3.1 The upper bound\nGiven training sample tpxi, yiqun\n\ni\u201c1, we consider the following least-squares estimator:\n\nn\n\ni\u201c1.\n\n(8)\n\n(9)\n\n1\nn\n\nn\u00ffi\u201c1pyi \u00b4 F1pxi; wqq2 .\n\nwPRm\n\n`\u201c0 P`\n\nsxi is the stacked\n\ni\u201c1rxirxJi q\u00b41\u221en\n\ni\u201c1 yirxi, whererxi \u201c\u221er\u00b41\n\nIn\naddition, because the objective is a quadratic function in w, Eq. (8) is actually a convex optimization\n\nversion of input feature vector xi.\nThe following theorem upper bounds the expected population mean-square prediction error\n\npwn P arg min\nNote the subscript n which emphasizes that pwn is trained using a sample of n data points.\nproblem and a global optimal solution pwn can be obtained ef\ufb01ciently. More speci\ufb01cally, pwn admits\nthe closed-form solution of pwn \u201c p\u221en\nerr\u00b5ppwn, w0; F1q of the least-square estimate pwn in Eq. (8).\nTheorem 1. Fix an arbitrary  P p0, 1{2q. Suppose (A1) through (A3) hold and \u232balogpn{q\u2022 \uf8ff,\nn \u00c1 \uf8ff\u00b42\u232b2m logp\u232bd log \u00b41q logpn\u00b41q. Then there exists a universal constant C \u00b0 0 such that\nwith probability 1 \u00b4  over the random draws of x1, . . . , xn \u201e \u00b5,\nEerr\u00b5ppwn, w0; F1q\u00a7 Cc 2m logp\uf8ff\u00b41\u232bd logp\u00b41qq\n\nmetric statistics problems, and also con\ufb01rms the \u201cparameter count\u201d intution that the estimation error\nscales approximately with the number of parameters in a network (m in network F1).\nWe next brie\ufb02y explain the strategies we employ to prove Theorem 1. While it\u2019s tempting to directly\n\nHere the expectation is taken with respect to the randomness in t\"iun\nTheorem 1 shows that, with n \u201c r\u2326pmq samples, the expected population mean-square error\nerr\u00b5ppwn, w0; F1q scales as rOpa2m{nq. This matches the 1{?n statistical error for classical para-\nuse the closed-form expression pwn \u201c p\u221en\ni\u201c1 yirxi to analyze pwn, such an approach\nhas two limitations. First, because we consider the population mean-square error err\u00b5ppwn, w0; F1q,\nsuch an approach would inevtiably require the analysis of spectral properties (e.g., the least eigen-\nvalue) of\u221en\ni\u201c1rxirxJi , which is very challenging as heavy correlation occurs inrxi when \ufb01lters are\noverlapping (i.e., s \u2020 m and J \u00b0 1). It is likely that strong assumptions such as exact isotropic\nGaussianity of the feature vectors are needed to analyze the distributional propertiesrxi (Qu et al.,\n2017). Also, such an approach relies on closed-forms of pwn and is dif\ufb01cult to extend to other poten-\ntial activations such as the ReLU activation. when no closed-form expressions of pwn exist.\n\nTo overcome the above dif\ufb01culties, we adopt a localized empirical process approach introduced in\n(van de Geer, 2000) to upper bound the expected population mean-square prediction error. At the\ncore analysis is an upper bound on the covering number of a localized parameter set, with an inter-\nesting argument that partitions a d-dimensional equivalent regressor for compacti\ufb01cation purposes\n(see Lemmas 2 and 4 in the appendix for details). Our proof does not rely on the exact/closed-form\n\ni\u201c1rxirxJi q\u00b41\u221en\n\nconditioned on x1, . . . , xn.\n\n3.2 The lower bound\n\nSection 6. The complete proof of Theorem 1 is placed in the appendix.\n\nexpression of pwn, and has the potential to be extended to other activation functions, as we discuss in\nWe prove the following information-theoretic lower bound on Eerr\u00b5ppwn, w0q of any estimator pwn\ncalculated on a training sample of size n.\nTheorem 2. Suppose x1, . . . , xn\u201eNp0, Iq and \"1, . . . ,\" n \u201e Np0, 2q. Suppose also that m \u00b4 s is\nan even number. Then there exists a universal constant C1 \u00b0 0 such that\nEerr\u00b5pwn, w0; F1q\u2022 C1c 2m\n.\n\n(10)\n\ninf\n\nn\n\nsup\nw0PRm\n\npwn\n\nRemark 1. Theorem 2 is valid for any pair of (\ufb01lter size, stride) combinations pm, sq, provided that\nm is divisible by s and m \u00b4 s is an even number. The latter requirement is a technical condtion in\nour proof and is not critical, because one can double the size of m and s, and the lower bound in\nTheorem 2 remains asymptotically on the same order.\n\n5\n\n\fTheorem 2 shows that any estimator pwn computed on a training set of size n must have a worst-case\nerror of at leasta2m{n. This suggests that our upper error bound in Theorem 1 is tight up to\nlogarithmic factors.\nOur proof of Theorem 2 draws on tools from standard information-theoretical lower bounds such as\nthe Fano\u2019s inequality (Yu, 1997; Tsybakov, 2009). The high-level idea is to construct a \ufb01nite candi-\ndate set of parameters W \u00d1 Rm and upper bound the Kullback-Leibler (KL) divergence of induced\nobservable distributions and the population prediction mean-square error between parameters in the\ncandidate set W. The complete proof of Theorem 2 is placed in the appendix.\n4 Convolutional \ufb01lters with prediction layers\n\nWe consider a slightly more complicated convolutional network with two layers: the \ufb01rst layer is a\nsingle convolutional \ufb01lter of size m, applied r times to a d-dimensional input vector with stride s;\nthe second layer is a linear regression prediction layer that produces a single real-valued output.\nFor such a two-layer network the parameter w can be speci\ufb01ed as w \u201c pw, aq, where w P Rm is the\nweights in the \ufb01rst-layer convolutional \ufb01lter and a P Rr is the weight in the second linear prediction\nlayer. The network F2px; wq \u201c F2px; w, aq can then be written as\nsx.\n\na`wJP`\n\n(11)\n\nF2px; w, aq \u201c\n\nNote that in Eq. (11) the vector a P Rr is labeled as a \u201c pa0, a1, . . . , ar\u00b41q for convenience that\nmatches with the labels of the operator P`\nCompared to network F1 with average pooling, the new network F2 can be viewed as a weighted\npooling of convolutional \ufb01lters, with weights a P Rr unknown and to be learnt. A graph illustration\nof the network F2 is given in Figure 1b.\n\ns for ` \u201c 0, . . . , r \u00b4 1.\n\nr\u00b41\u00ff`\u201c0\n\n4.1 The upper bound\nWe again consider the least-squares estimator\n\n(12)\n\npwn \u201c ppwn,panq P arg min\n\nwPRm,aPRr\n\n1\nn\n\nn\u00ffi\u201c1pyi \u00b4 F2pxi; w, aqq2 .\n\ni\u201c1 of size n.\n\nAgain, we use subscript n to emphasize that both pwn and pan are computed on a training set\n\ntxi, yiun\nUnlike the least squares problem in Eq. (8) for the F1 network, the optimization problem in Eq. (12)\nhas two optimization variables w, a and is therefore no longer convex. This means that popular\noptimization algorithms like gradient descent do not necessarily converge to a global minima in\n\nEq. (12). Nevertheless, in this paper we choose to focus on the statistical properties of ppwn,panq and\n\nassume global minimality of Eq. (12) is achieved. On the other hand, because Eq. (12) resembles\nthe matrix sensing problem, it is possible that all local minima are global minima and saddle points\ncan be ef\ufb01ciently escaped (Ge et al., 2017a), which we leave as future work.\nThe following theorem upper bounds the population mean-square prediction error of any global\n\nminimizer pwn \u201c ppwn,panq of Eq. (12).\nTheorem 3. Fix arbitrary  P p0, 1{2q and de\ufb01ne J :\u201c m{s, where m is the \ufb01lter size and\ns is the stride. Suppose (A1) through (A3) hold and \u232balogpn{q\u2022 \uf8ff, n \u00c1 \uf8ff\u00b42\u232b2prJ `\nmq logp\u232bd log \u00b41q logpn\u00b41q. Then there exists a universal constant C \u00b0 0 such that with proba-\nbility 1 \u00b4  over the random draws of x1, . . . , xn \u201e \u00b5,\nEerr\u00b5ppwn, w0; F2q\u00a7 Cc 2prJ ` mq logp\uf8ff\u00b41\u232bd logp\u00b41qq\n\nHere the expectation is taken with respect to the randomness in t\"iun\nTheorem 3 is proved by a similar localized empirical process arguments as in the proof of Theorem\n1. Due to space costraints we defer the complete proof of Theorem 3 to the appendix.\n\nconditioned on x1, . . . , xn.\n\ni\u201c1.\n\n(13)\n\nn\n\n6\n\n\f10 0\n\n10 -2\n\nr\no\nr\nr\n\n \n\nE\ng\nn\n\ni\nt\ns\ne\nT\n\n10 -4\n\n500\n\nCNN\nFNN\n\n1000\n\n1500\n\n2000\n\nNumber of Training Data\n\n10 0\n\n10 -1\n\n10 -2\n\nr\no\nr\nr\n\n \n\nE\ng\nn\n\ni\nt\ns\ne\nT\n\n10 -3\n\n500\n\nCNN\nFNN\n\n1000\n\n1500\n\n2000\n\nNumber of Training Data\n\n10 0\n\n10 -1\n\n10 -2\n\nr\no\nr\nr\n\n \n\nE\ng\nn\n\ni\nt\ns\ne\nT\n\n10 -3\n\n500\n\nCNN\nFNN\n\n1000\n\n1500\n\n2000\n\nNumber of Training Data\n\n(a) Filter size m \u201c 2.\n\n(b) Filter size m \u201c 8.\n\n(c) Filter size m \u201c 16.\n\nFigure 2: Experiments on the problem of learning a convolutional \ufb01lter with average pooling de-\nscribed in Section 3 with stride size s \u201c 1.\n\nTheorem 3 shows that err\u00b5ppwn, w0; F2q can be upper bounded by rOpa2prJ ` mq{nq, provided\nthat at least n \u201cr\u2326prJ ` mq samples are available. Compared to the intuitive \u201cparameter count\u201d of\nr`m (r parameters for a and m parameters for w), our upper bound has an additional multiplicative\nJ \u201c m{s term, which is the number of strides within each m-dimensional \ufb01lter. Therefore, our\nupper bound only matches parameter counts when J is very small (e.g., non-overlapping \ufb01lters or\nfast-moving \ufb01lters where the stride s is at least a constant fraction of \ufb01lter size m), and becomes\nlarge when the stride s is very small, leading to many convolutions being computed.\nWe conjecture that such an increase in error/sample complexity is due to an inef\ufb01ciency in one\nof our key technical lemmas. More speci\ufb01cally, in Lemma 7 in which we derive upper bounds\non covering number of localized parameter sets, we use the boundedness and low-dimensionality\nof each segment of differences of equivalent parameters for compacti\ufb01cation purposes; such an\nargument is not ideal, as it overlooks the correlation between different segments, connected by an\nr-dimensional parameter a. A sharper covering number argument would potentially improve the\nerror analysis and achieve sample complexity scaling with r ` m.\n4.2 The lower bound\n\nWe prove the following information-theoretical lower bound on Eerr\u00b5ppwn, w0q of any estimator\npwn \u201c ppwn,panq calculated on a training sample of size n.\nTheorem 4. Suppose x1, . . . , xn\u201eNp0, Iq and \"1, . . . ,\" n \u201e Np0, 2q. Then there exists a univer-\nsal constant C1 \u00b0 0 such that\ninf\n\n(14)\n\n.\n\nEerr\u00b5ppwn, w0; F2q\u2022 C1c 2pr ` mq\n\nn\n\nsup\nw0\n\npwn\n\nTheorem 4 shows that the error of any estimator pwn computed on a training sample of size n must\nscale asa2pr ` mq{n, matching the parameter counts of r ` m for F2. It is proved by reducing\n\nthe regression problem under F2 to two separate ordinary linear regression problems and invoking\nclassical lower bounds for linear regression models (Wasserman, 2013; Van der Vaart, 1998). A\ncomplete proof of Theorem 4 is given in the appendix.\n\n5 Experiments\n\nIn this section we use simulations to verify our theoretical \ufb01ndings. For all experiments, we let\nthe ambient dimension d be 64 and the input distribution be Gaussian with mean 0 and identity\ncovariance. We use the population mean-square prediction error de\ufb01ned in Eq. (6) as the evaluation\nmetric. In all plots, CNN represents using convolutional parameterization corresponding to Eq. (1)\nor Eq. (3) and FNN represents using fully connected parametrization corresponding to Eq. (2).\nIn Figure 2 and Figure 3, we consider the problem of learning a convolutional \ufb01lter with average\npooling which we analyzed in Section 3. We vary the number of samples, the dimension of \ufb01lters\nand the stride size. Here we compare parameterizing the prediction function as a d-dimensional\nlinear predictor and as a convolutional \ufb01lter followed by average pooling. Experiments show CNN\n\n7\n\n\f10 0\n\n10 -2\n\nr\no\nr\nr\n\n \n\nE\ng\nn\n\ni\nt\ns\ne\nT\n\n10 -4\n\n500\n\nCNN\nFNN\n\n1000\n\n1500\n\n2000\n\nNumber of Training Data\n\n10 0\n\n10 -1\n\n10 -2\n\nr\no\nr\nr\n\n \n\nE\ng\nn\n\ni\nt\ns\ne\nT\n\n10 -3\n\n500\n\nCNN\nFNN\n\n1000\n\n1500\n\n2000\n\nNumber of Training Data\n\n10 0\n\n10 -1\n\n10 -2\n\nr\no\nr\nr\n\n \n\nE\ng\nn\n\ni\nt\ns\ne\nT\n\n10 -3\n\n500\n\nCNN\nFNN\n\n1000\n\n1500\n\n2000\n\nNumber of Training Data\n\n(a) Filter size m \u201c 2.\n\n(b) Filter size m \u201c 8.\n\n(c) Filter size m \u201c 16.\n\nFigure 3: Experiments on the problem of learning a convolutional \ufb01lter with average pooling de-\nscribed in Section 3 with stride size s \u201c m, i.e., non-overlapping.\n\n10 0\n\n10 -1\n\nr\no\nr\nr\n\n \n\nE\ng\nn\n\ni\nt\ns\ne\nT\n\n10 -2\n\n500\n\nCNN\nFNN\n\n1000\n\n1500\n\n2000\n\nNumber of Training Data\n\n10 0\n\n10 -1\n\n10 -2\n\nr\no\nr\nr\n\n \n\nE\ng\nn\n\ni\nt\ns\ne\nT\n\n10 -3\n\n500\n\nCNN\nFNN\n\n1000\n\n1500\n\n2000\n\nNumber of Training Data\n\n10 0\n\n10 -1\n\n10 -2\n\nr\no\nr\nr\n\n \n\nE\ng\nn\n\ni\nt\ns\ne\nT\n\n10 -3\n\n500\n\nCNN\nFNN\n\n1000\n\n1500\n\n2000\n\nNumber of Training Data\n\n(a) Stride size s \u201c 1.\n\n(b) Stride size s \u201c m{2.\n\n(c) Stride size s \u201c m, i.e., non-\noverlapping.\n\nFigure 4: Experiment on the problem of one-hidden-layer convolutional neural network with a\nshared \ufb01lter and a prediction layer described in Section 4. The \ufb01lter size m is chosen to be 8.\n\nparameterization is consistently better than the FNN parameterization. Further, as number of training\nsamples increases, the prediction error goes down and as the dimension of \ufb01lter increases, the error\n\ngoes up. These facts qualitatively justify our derived error bound rO` m\n\nn\u02d8. Lastly, in Figure 2 we\nchoose stride s \u201c 1 and in Figure 3 we choose stride size equals to the \ufb01lter size s \u201c m, i.e.,\nnon-overlapping. Our experiment shows the stride does not affect the prediction error in this setting\nwhich coincides our theoretical bound in which there is no stride size factor.\nIn Figure 4, we consider the one-hidden-layer CNN model analyzed in Section 4. Here we \ufb01x the\n\ufb01lter size m \u201c 8 and vary the number of training samples and the stride size. When stride s \u201c 1,\nconvolutional parameterization has the same order parameters as the linear predictor parameteriza-\ntion (r \u201c 57 so r ` m \u201c 65 \u00ab d \u201c 64) and Figure 4a shows they have similar performances.\nIn Figure 4b and Figure 4c we choose the stride to be m{2 \u201c 4 and m \u201c 8 (non-overlapping),\nrespectively. Note these settings have less parameters (r ` m \u201c 23 for s \u201c 4 and r ` m \u201c 16 for\ns \u201c 8) than the case when s \u201c 1 and so CNN gives better performance than FNN.\n6 Conclusion and Future Directions\n\nIn this paper we give rigorous characterizations of the statistical ef\ufb01ciency of CNN with simple ar-\nchitectures. Now we discuss how to extend our work to more complex models and main dif\ufb01culties.\n\nNon-linear Activation: Our paper only considered CNN with linear activation. A natural question\nis what is the sample complexity of learning a CNN with non-linear activation like Recitifed Linear\nUnits (ReLU). We \ufb01nd that even without convolution structure, this is a dif\ufb01cult problem. For\nlinear activation function, we can show the empirical loss is a good approximation to the population\nloss and we used this property to derive our upper bound. However, for ReLU activation, we can\n\ufb01nd a counter example for any \ufb01nite n, which breaks our Lemma 3. We believe if there is a better\nunderstanding of non-smooth activation which can replace our Lemma 3, we can extend our analysis\nframework to derive sharp sample complexity bounds for CNN with non-linear activation function.\n\nMultiple Filters: For both models we considered in this paper, there is only one shared \ufb01lter. In\ncommonly used CNN architectures, there are multiple \ufb01lters in each layer and multiple layers. Note\n\n8\n\n\fthat if one considers a model of k \ufb01lters with linear activation with k \u00b0 1, one can always replace\nthis model by a single convolutional \ufb01lter that equals to the summation of these k \ufb01lters. Thus,\nwe can formally study the statistical behavior of wide and deep architectures only after we have\nunderstood the non-linear activation function. Nevertheless, we believe our empirical process based\nanalysis is still applicable.\n\nAcknowledgment\n\nThis research was partly funded by AFRL grant FA8750-17-2-0212 and DARPA D17AP00001.\n\nReferences\nAnthony, M., & Bartlett, P. L. (2009). Neural network learning: Theoretical foundations. cambridge\n\nuniversity press.\n\nArora, S., Ge, R., Neyshabur, B., & Zhang, Y. (2018). Stronger generalization bounds for deep nets\n\nvia a compression approach. arXiv preprint arXiv:1802.05296.\n\nBartlett, P. L., Foster, D. J., & Telgarsky, M. J. (2017a). Spectrally-normalized margin bounds for\n\nneural networks. In Advances in Neural Information Processing Systems, (pp. 6241\u20136250).\n\nBartlett, P. L., Harvey, N., Liaw, C., & Mehrabian, A. (2017b). Nearly-tight vcdimension and\n\npseudodimension bounds for piecewise linear neural networks. arxiv preprint. arXiv, 1703.\n\nBickel, P. J., Ritov, Y., & Tsybakov, A. B. (2009). Simultaneous analysis of lasso and dantzig\n\nselector. The Annals of Statistics, 37(4), 1705\u20131732.\n\nBrutzkus, A., & Globerson, A. (2017). Globally optimal gradient descent for a Convnet with Gaus-\n\nsian inputs. arXiv preprint arXiv:1702.07966.\n\nChoromanska, A., Henaff, M., Mathieu, M., Arous, G. B., & LeCun, Y. (2015). The loss surfaces\n\nof multilayer networks. In Arti\ufb01cial Intelligence and Statistics, (pp. 192\u2013204).\n\nDu, S. S., & Lee, J. D. (2018). On the power of over-parametrization in neural networks with\n\nquadratic activation. arXiv preprint arXiv:1803.01206.\n\nDu, S. S., Lee, J. D., & Tian, Y. (2017a). When is a convolutional \ufb01lter easy to learn? arXiv preprint\n\narXiv:1709.06129.\n\nDu, S. S., Lee, J. D., Tian, Y., Poczos, B., & Singh, A. (2017b). Gradient descent learns one-hidden-\n\nlayer cnn: Don\u2019t be afraid of spurious local minima. arXiv preprint arXiv:1712.00779.\n\nDudley, R. M. (1967). The sizes of compact subsets of hilbert space and continuity of gaussian\n\nprocesses. Journal of Functional Analysis, 1(3), 290\u2013330.\n\nFreeman, C. D., & Bruna, J. (2016). Topology and geometry of half-recti\ufb01ed network optimization.\n\narXiv preprint arXiv:1611.01540.\n\nGe, R., Jin, C., & Zheng, Y. (2017a). No spurious local minima in nonconvex low rank problems:\nA uni\ufb01ed geometric analysis. In Proceedings of the 34th International Conference on Machine\nLearning, (pp. 1233\u20131242).\n\nGe, R., Lee, J. D., & Ma, T. (2017b). Learning one-hidden-layer neural networks with landscape\n\ndesign. arXiv preprint arXiv:1711.00501.\n\nGoel, S., Kanade, V., Klivans, A., & Thaler, J. (2016). Reliably learning the ReLU in polynomial\n\ntime. arXiv preprint arXiv:1611.10258.\n\nGoel, S., & Klivans, A. (2017a). Eigenvalue decay implies polynomial-time learnability for neural\n\nnetworks. arXiv preprint arXiv:1708.03708.\n\nGoel, S., & Klivans, A. (2017b). Learning depth-three neural networks in polynomial time. arXiv\n\npreprint arXiv:1709.06010.\n\n9\n\n\fGoel, S., Klivans, A., & Meka, R. (2018). Learning one convolutional layer with overlapping\n\npatches. arXiv preprint arXiv:1802.02547.\n\nGraham, R., & Sloane, N. (1980). Lower bounds for constant weight codes. IEEE Transactions on\n\nInformation Theory, 26(1), 37\u201343.\n\nHaeffele, B. D., & Vidal, R. (2015). Global optimality in tensor factorization, deep learning, and\n\nbeyond. arXiv preprint arXiv:1506.07540.\n\nHardt, M., & Ma, T. (2016). Identity matters in deep learning. arXiv preprint arXiv:1611.04231.\nHoeffding, W. (1963). Probability inequalities for sums of bounded random variables. Journal of\n\nthe American Statistical Association, 58(301), 13\u201330.\n\nHuang, F., & Anandkumar, A. (2015). Convolutional dictionary learning through tensor factoriza-\n\ntion. In Feature Extraction: Modern Questions and Challenges, (pp. 116\u2013129).\n\nKawaguchi, K. (2016). Deep learning without poor local minima. In Advances in Neural Information\n\nProcessing Systems, (pp. 586\u2013594).\n\nKonstantinos, P., Davies, M., & Vandergheynst, P. (2017). Pac-bayesian margin bounds for convo-\n\nlutional neural networks-technical report. arXiv preprint arXiv:1801.00171.\n\nKrizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classi\ufb01cation with deep convolu-\ntional neural networks. In Advances in neural information processing systems, (pp. 1097\u20131105).\nLeCun, Y., Bengio, Y., et al. (1995). Convolutional networks for images, speech, and time series.\n\nThe handbook of brain theory and neural networks, 3361(10), 1995.\n\nLi, Y., & Yuan, Y. (2017). Convergence analysis of two-layer neural networks with ReLU activation.\n\narXiv preprint arXiv:1705.09886.\n\nNeyshabur, B., Bhojanapalli, S., McAllester, D., & Srebro, N. (2017). A pac-bayesian approach to\n\nspectrally-normalized margin bounds for neural networks. arXiv preprint arXiv:1707.09564.\n\nNguyen, Q., & Hein, M. (2017a). The loss surface and expressivity of deep convolutional neural\n\nnetworks. arXiv preprint arXiv:1710.10928.\n\nNguyen, Q., & Hein, M. (2017b). The loss surface of deep and wide neural networks. arXiv preprint\n\narXiv:1704.08045.\n\nQu, Q., Zhang, Y., Eldar, Y. C., & Wright, J. (2017). Convolutional phase retrieval via gradient\n\ndescent. arXiv preprint arXiv:1712.00716.\n\nSafran, I., & Shamir, O. (2016). On the quality of the initial basin in overspeci\ufb01ed neural networks.\n\nIn International Conference on Machine Learning, (pp. 774\u2013782).\n\nSafran, I., & Shamir, O. (2017). Spurious local minima are common in two-layer relu neural net-\n\nworks. arXiv preprint arXiv:1712.08968.\n\nSilver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser,\nJ., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al. (2016). Mastering the game of go with\ndeep neural networks and tree search. Nature, 529(7587), 484\u2013489.\n\nSingh, S., P\u00b4oczos, B., & Ma, J. (2018). Minimax reconstruction risk of convolutional sparse dic-\ntionary learning. In International Conference on Arti\ufb01cial Intelligence and Statistics, (pp. 1327\u2013\n1336).\n\nSong, L., Vempala, S., Wilmes, J., & Xie, B. (2017). On the complexity of learning neural networks.\n\nIn Advances in Neural Information Processing Systems, (pp. 5520\u20135528).\n\nTian, Y. (2017). An analytical formula of population gradient for two-layered ReLU network and\n\nits applications in convergence and critical point analysis. arXiv preprint arXiv:1703.00560.\n\nTsybakov, A. B. (2009). Introduction to nonparametric estimation. Springer Series in Statistics.\n\nSpringer, New York.\n\n10\n\n\fvan de Geer, S. A. (2000). Empirical Processes in M-estimation, vol. 6. Cambridge university press.\nVan der Vaart, A. W. (1998). Asymptotic statistics, vol. 3. Cambridge university press.\nVershynin, R. (2012). How close is the sample covariance matrix to the actual covariance matrix?\n\nJournal of Theoretical Probability, 25(3), 655\u2013686.\n\nWang, Y., & Singh, A. (2016). Noise-adaptive margin-based active learning and lower bounds under\n\ntsybakov noise condition. In AAAI.\n\nWasserman, L. (2013). All of statistics: a concise course in statistical inference. Springer Science\n\n& Business Media.\n\nYu, A. W., Dohan, D., Luong, M.-T., Zhao, R., Chen, K., Norouzi, M., & Le, Q. V. (2018). Qanet:\nCombining local convolution with global self-attention for reading comprehension. arXiv preprint\narXiv:1804.09541.\n\nYu, B. (1997). Assouad, fano, and le cam. In Festschrift for Lucien Le Cam, (pp. 423\u2013435). Springer.\nZhang, Y., Lau, Y., Kuo, H.-w., Cheung, S., Pasupathy, A., & Wright, J. (2017). On the global ge-\nometry of sphere-constrained sparse blind deconvolution. In Proceedings of the IEEE Conference\non Computer Vision and Pattern Recognition, (pp. 4894\u20134902).\n\nZhang, Y., Lee, J. D., Wainwright, M. J., & Jordan, M. I. (2015). Learning halfspaces and neural\n\nnetworks with random initialization. arXiv preprint arXiv:1511.07948.\n\nZhong, K., Song, Z., & Dhillon, I. S. (2017a). Learning non-overlapping convolutional neural\n\nnetworks with multiple kernels. arXiv preprint arXiv:1711.03440.\n\nZhong, K., Song, Z., Jain, P., Bartlett, P. L., & Dhillon, I. S. (2017b). Recovery guarantees for\n\none-hidden-layer neural networks. arXiv preprint arXiv:1706.03175.\n\nZhou, P., & Feng, J. (2017).\n\narXiv:1705.07038.\n\nThe landscape of deep learning algorithms.\n\narXiv preprint\n\n11\n\n\f", "award": [], "sourceid": 252, "authors": [{"given_name": "Simon", "family_name": "Du", "institution": "Carnegie Mellon University"}, {"given_name": "Yining", "family_name": "Wang", "institution": "CMU"}, {"given_name": "Xiyu", "family_name": "Zhai", "institution": "MIT"}, {"given_name": "Sivaraman", "family_name": "Balakrishnan", "institution": "Carnegie Mellon University"}, {"given_name": "Russ", "family_name": "Salakhutdinov", "institution": "Carnegie Mellon University"}, {"given_name": "Aarti", "family_name": "Singh", "institution": "CMU"}]}