{"title": "On the Expressive Power of Deep Polynomial Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 10310, "page_last": 10319, "abstract": "We study deep neural networks with polynomial activations, particularly their expressive power.  For a fixed architecture and activation degree, a polynomial neural network defines an algebraic map from weights to polynomials.  The image of this map is the functional space associated to the network, and it is an irreducible algebraic variety upon taking closure.  This paper proposes the dimension of this variety as a precise measure of the expressive power of polynomial neural networks.  We obtain several theoretical results regarding this dimension as a function of architecture, including an exact formula for high activation degrees, as well as upper and lower bounds on layer widths in order for deep polynomials networks to fill the ambient functional space. We also present computational evidence that it is profitable in terms of expressiveness for layer widths to increase monotonically and then decrease monotonically.  Finally, we link our study to favorable optimization properties when training weights, and we draw  intriguing connections with tensor and polynomial decompositions.", "full_text": "On the Expressive Power of\n\nDeep Polynomial Neural Networks\n\nJoe Kileel\u21e4\n\nPrinceton University\n\nMatthew Trager\u21e4\nNew York University\n\nJoan Bruna\n\nNew York University\n\nAbstract\n\nWe study deep neural networks with polynomial activations, particularly their\nexpressive power. For a \ufb01xed architecture and activation degree, a polynomial\nneural network de\ufb01nes an algebraic map from weights to polynomials. The image\nof this map is the functional space associated to the network, and it is an irreducible\nalgebraic variety upon taking closure. This paper proposes the dimension of this\nvariety as a precise measure of the expressive power of polynomial neural networks.\nWe obtain several theoretical results regarding this dimension as a function of\narchitecture, including an exact formula for high activation degrees, as well as\nupper and lower bounds on layer widths in order for deep polynomials networks to\n\ufb01ll the ambient functional space. We also present computational evidence that it is\npro\ufb01table in terms of expressiveness for layer widths to increase monotonically and\nthen decrease monotonically. Finally, we link our study to favorable optimization\nproperties when training weights, and we draw intriguing connections with tensor\nand polynomial decompositions.\n\n1\n\nIntroduction\n\nA fundamental problem in the theory of deep learning is to study the functional space of deep neural\nnetworks. A network can be modeled as a composition of elementary maps, however the family of\nall functions that can be obtained in this way is extremely complex. Many recent papers paint an\naccurate picture for the case of shallow networks (e.g., using mean \ufb01eld theory [7, 27]) and of deep\nlinear networks [2, 3, 21], however a similar investigation of deep nonlinear networks appears to be\nsigni\ufb01cantly more challenging, and require very different tools.\nIn this paper, we consider a general model for deep polynomial neural networks, where the activation\nfunction is a polynomial (r-th power) exponentiation. The advantage of this framework is that\nthe functional space associated with a network architecture is algebraic, so we can use tools from\nalgebraic geometry [17] for a precise investigation of deep neural networks. Indeed, for a \ufb01xed\nactivation degree r and architecture d = (d0, . . . , dh) (expressed as a sequence of widths), the family\nof all networks with varying weights can be identi\ufb01ed with an algebraic variety Vd,r, embedded\nin a \ufb01nite-dimensional Euclidean space. In this setting, an algebraic variety can be thought of as a\nmanifold that may have singularities.\nIn this paper, our main object of study is the dimension of Vd,r as a variety (in practice, as a manifold),\nwhich may be regarded as a precise measure of the architecture\u2019s expressiveness. Speci\ufb01cally, we\nprove that this dimension stabilizes when activations are high degree, and we provide an exact\ndimension formula for this case (Theorem 14). We also investigate conditions under which Vd,r\n\ufb01lls its ambient space. This question is important from the vantage point of optimization, since an\narchitecture is \u201c\ufb01lling\u201d if and only if it corresponds to a convex functional space (Proposition 6). In\n\n\u21e4Equal contribution.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fthis direction, we prove a bottleneck property, that if a width is not suf\ufb01ciently large, the network can\nnever \ufb01ll the ambient space regardless of the size of other layers (Theorem 19).\nIn a broader sense, our work introduces a powerful language and suite of mathematical tools for\nstudying the geometry of network architectures. Although this setting requires polynomial activations,\nit may be used as a testing ground for more general situations and, e.g., to verify rules of thumb\nrigorously. Finally, our results show that polynomial neural networks are intimately related to\nthe theory of tensor decompositions [22]. In fact, representing a polynomial as a deep network\ncorresponds to a type of decomposition of tensors which may be viewed as a composition of\ndecompositions of a recently introduced sort [24]. Using this connection, we establish general\nnon-trivial upper bounds on \ufb01lling widths (Theorem 10). We believe that our work can serve as a\n\ufb01rst step towards many interesting research challenges in developing the theoretical underpinnings of\ndeep learning.\n\n1.1 Related work\n\nThe study of the expressive power of neural networks dates back to seminal work on the universality\nof networks as function approximators [10, 19]. More recently, there has been research supporting\nthe hypothesis of \u201cdepth ef\ufb01ciency\u201d, i.e., the fact that deep networks can approximate functions more\nef\ufb01ciently than shallow networks [11, 25, 8, 9]. In contrast to this line of work, we study the class\nof functions that can be expressed exactly using a network. Our analysis may of course be used to\ninvestigate the problem of approximation, however this is not the focus of this paper.\nMost of the aforementioned studies make strong hypotheses on the network architecture. In par-\nticular, [11, 25] focus on arithmetic circuits, or sum-product networks [29]. These are networks\ncomposed of units that compute either the product or a weighted sum of their inputs. In [8], the\nauthors introduce a model of convolutional arithmetic circuits. This is a particular class of arithmetic\ncircuits that includes networks with layers of 1D convolutions and product pooling. This model does\nnot allow for non-linear activations (beside the product pooling), although the follow-up paper [9]\nextends some results to ReLU activations with sum pooling. Interestingly, these networks are related\nto Hierarchical Tucker (HT) decomposition of tensors.\nThe polynomial networks studied in this paper are not arithmetic circuits, but feedforward deep\nnetworks with polynomial r-th power activations. This is a vast generalization of a setting consid-\nered in several recent papers [33, 14, 31], that study shallow (two layer) networks with quadratic\nactivations (r = 2). These papers show that if the width of the intermediate layer is at least twice\nthe input dimension, then the quadratic loss has no \u201cbad\u201d local minima. This result in line with our\nProposition 5, which explains in this case the functional space is convex and \ufb01lls the ambient space.\nWe also point out that polynomial activations are required for the functional space of the network to\nspan a \ufb01nite dimensional vector space [23, 33].\nThe polynomial networks considered in this paper do not correspond to HT tensor decompositions as\nin [8, 9], rather they are related to a different polynomial/tensor decomposition attracting very recent\ninterest [16, 24]. These generalize usual decompositions, however their algorithmic and theoretical\nunderstanding are, mostly, wide open. Neural networks motivate several questions in this vein.\nFinally, we mention other recent works that study neural networks from the perspective of algebraic\ngeometry [26, 32, 20].\n\nMain contributions. Our main contributions can be summarized as follows.\n\n\u2022 We give a precise formulation of the expressiveness of polynomial networks in terms of the\n\nalgebraic dimension of the functional space as an algebraic variety.\n\n\u2022 We spell out the close, two-way relationship between polynomial networks and a particular\n\nfamily of decompositions of tensors.\n\n\u2022 We prove several theoretical results on the functional space of polynomial networks. Notably,\nwe give a formula for the dimension that holds for suf\ufb01ciently high activation degrees\n(Theorem 14) and we prove a tight lower bound on the width of the layers for the network\nto be \u201c\ufb01lling\u201d in the functional space (Theorem 19).\n\n2\n\n\fNotation. We use Symd(Rn) to denote the space of homogeneous polynomials of degree d in n\n\nvariables with coef\ufb01cients in R. This set is a vector space over R of dimension Nd,n =n+d1\n\nspanned by all monomials of degree d in n variables. In practice, Symd(Rn) is isomorphic to RNd,n,\nand our networks will correspond to points in this high dimensional space. The notation Symd(Rn)\nexpresses the fact that a polynomial of degree d in n variables can always be identi\ufb01ed with a\nsymmetric tensor in (Rn)\u2326d that collects all of its coef\ufb01cients.\n\nd\n\n,\n\n2 Basic setup\n\nA polynomial network is a function p\u2713 : Rd0 ! Rdh of the form\n\np\u2713(x) = Wh\u21e2rWh1\u21e2r . . .\u21e2 rW1x, Wi 2 Rdi\u21e5di1,\n\nwhere the activation \u21e2r(z) raises all elements of z to the r-th power (r 2 N). The parameters\n\u2713 = (Wh, . . . , W1) 2 Rd\u2713 (with d\u2713 =Ph\ni=1 didi1) are the network\u2019s weights, and the network\u2019s\narchitecture is encoded by the sequence d = (d0, . . . , dh) (specifying the depth h and widths\ndi). Clearly, p\u2713 is a homogeneous polynomial mapping Rd0 ! Rdh of degree rh1, i.e., p\u2713 2\nSymrh1(Rd0)dh.\nFor \ufb01xed degree r and architecture d = (d0, . . . , dh), there exists an algebraic map\n\nd,r : \u2713 7! p\u2713 =264\n\np\u27131\n...\n\np\u2713dh+1\n\n375 ,\n\n(1)\n\nwhere each p\u2713i is a polynomial in d0 variables. The image of d,r is a set of vectors of polynomials,\ni.e., a subset Fd,r of Symrh1(Rd0)dh, and it is the functional space represented by the network. In\nthis paper, we consider the \u201cZariski closure\u201d Vd,r = Fd,r of the functional space.1 We refer to Vd,r\nas functional variety of the network architecture, as it is in fact an irreducible algebraic variety. In\nparticular, Vd,r can be studied using powerful machinery from algebraic geometry.\nRemark 1. The functional variety Vd,r may be signi\ufb01cantly larger than the actual functional space\nFd,r, since the Zariski closure is typically larger than the closure with respect to the standard the\nEuclidean topology. On the other hand, the dimensions of the spaces Vd,r and Fd,r agree, and the\nset Vd,r is usually \u201cnicer\u201d (it can be described by polynomial equations, whereas an exact implicit\ndescription of Fd,r may require inequalities).\n2.1 Examples\nWe present some examples that describe the functional variety Vd,r in simple cases.\nExample 2. A linear network is a polynomial network with r = 1. In this case, the network map\nd,r : Rd\u2713 ! Sym1(Rd0)dh \u21e0= Rdh\u21e5d0 is simply matrix multiplication:\n\n\u2713 = (Wh, Wh1, . . . , W1) 7! p\u2713 = WhWh1 . . . W1x.\n\nThe functional space Fd,r \u2713 Rdh\u21e5d0 is the set of matrices with rank at most dmin = mini{di}. This\nset is already characterized by polynomial equations, as the common zero set of all (1 + dmin)\u21e5 (1 +\ndmin) minors, so Fd,r = Vd,r in this case. The dimension of Vd,r \u21e2 Rdh\u21e5d0 is dmin(d0 + dh dmin).\nExample 3. Consider d = (2, 2, 3) and r = 2. The input variables are x = [x1, x2]T , and the\nparameters \u2713 are the weights\n\nW1 =\uf8ffw111 w112\n\nw121 w122 , W2 =264\n\nw211 w212\nw221 w222\nw231 w232\n\n375 .\n\n1The Zariski closure of a set X is the smallest set containing X that can be described by polynomial\n\nequations.\n\n3\n\n\fdet2664\n\nc(1)\n11\nc(2)\n11\nc(3)\n11\n\nc(1)\n12\nc(2)\n12\nc(3)\n12\n\nc(1)\n22\nc(2)\n22\nc(3)\n22\n\n3775 = 0.\n\nThe network map p\u2713 is a triple of quadratic polynomials in x1, x2, that can be written as\n\nW2\u21e22W1x = 264\n\nw211(w111x1 + w112x2)2 + w212(w121x1 + w122x2)2\nw221(w111x1 + w112x2)2 + w222(w121x1 + w122x2)2\nw231(w111x1 + w112x2)2 + w232(w121x1 + w122x2)2\n\n375 .\n\n(2)\n\nThe map (2,2,3),2 in (1) takes W1, W2 (that have d\u2713 = 10 parameters) to the three quadratics in\nx1, x2 displayed above. The quadratics have a total of dim Sym2(R2)3 = 9 coef\ufb01cients, however\nthese coef\ufb01cients are not arbitrary, i.e., not all possible triples of polynomials occur in the functional\nspace. Writing c(k)\nfor the coef\ufb01cient of xixj in p\u2713k in (2) (with k = 1, 2, 3 i, j = 1, 2) then it is a\nij\nsimple exercise to show that\n\nThis cubic equation describes the functional variety V(2,3,3),2, which is in this case an eight-\ndimensional subset (hypersurface) of Sym2(R2)3 \u21e0= R9.\n2.2 Objectives\nThe main goal of this paper is to study the dimension of Vd,r as the network\u2019s architecture d and\nthe activation degree r vary. This dimension may be considered a precise and intrinsic measure of\nthe polynomial network\u2019s expressivity, quantifying degrees of freedom of the functional space. For\nexample, the dimension re\ufb02ects the number of input/output pairs the network can interpolate, as each\nsample imposes one linear constraint on the variety Vd,r.\nIn general, the variety Vd,r lives in the ambient space Symrh1(Rd0)dh, which in turn only depends\non the activation degree r, network depth h, and the input/output dimensions d0 and dh. We are thus\ninterested in the role of the intermediate widths in the dimension of Vd,r.\nDe\ufb01nition 4. A network architecture d = (d0, . . . , dh) has a \ufb01lling functional variety for the\nactivation degree r if Vd,r = Symrh1(Rd0)dh.\nIt is important to note that if the functional variety Vd,r is \ufb01lling, then actual functional space\nFd,r (before taking closure) is in general only thick, i.e., it has positive Lebesgue measure in\nSymrh1(Rd0)dh (see Remark 1). On the other hand, given an architecture with a thick functional\nspace, we can \ufb01nd another architecture whose functional space is the whole ambient space.\nProposition 5 (Filling functional space). Fix r and suppose d = (d0, d1, . . . , dh1, dh) has a \ufb01lling\nfunctional variety Vd,r. Then the architecture d0 = (d0, 2d1, . . . , 2dh1, dh) has a \ufb01lling functional\nspace, i.e., Fd0,r = Symrh1(Rd0)dh.\nIn summary, while an architecture with a \ufb01lling functional variety may not necessarily have a \ufb01lling\nfunctional space, it is suf\ufb01cient to double all the intermediate widths for this stronger condition to\nhold. As argued below, we expect architectures with thick/\ufb01lling functional spaces to have more\nfavorable properties in terms of optimization and training. On the other hand, non-\ufb01lling architectures\nmay lead to interesting functional spaces for capturing patterns in data. In fact, we show in Section 3.2\nthat non-\ufb01lling architectures generalize families of low-rank tensors.\n\n2.3 Connection to optimization\n\nThe following two results illustrate that thick/\ufb01lling functional spaces are helpful for optimization.\nProposition 6. If the closure of a set C \u21e2 Rn is not convex, then there exists a convex function f\non Rn whose restriction to C has arbitrarily \u201cbad\u201d local minima (that is, there exist local minima\nwhose value is arbitrarily larger than that of a global minimum).\nProposition 7. If a functional space Fd,r is not thick, then it is not convex.\n\n4\n\n\fThese two facts show that if the functional space is not thick, we can always \ufb01nd a convex loss\nfunction and a data distribution that lead to a landscape with arbitrarily bad local minima. There is\nalso an obvious weak converse, namely that if the functional space is \ufb01lling Fd,r = Symrh1(Rd0)dh,\nthen any convex loss function Fd,r will have a unique global minimum (although there may be\n\u201cspurious\u201d critical points that arise from the non-convex parameterization).\n\n3 Architecture dimensions\n\nIn this section, we begin our study of the dimension of Vd,r. We describe the connection between\npolynomial networks and tensor decompositions for both shallow (Section 3.1) and deep (Section 3.2)\nnetworks, and we present some computational examples (Section 3.3).\n\n3.1 Shallow networks and tensors\n\nPolynomial networks with h = 2 are closely related to CP tensor decomposition [22]. Indeed in the\nshallow case, we can verify the network map (d0,d1,d2),r sends W1 2 Rd1\u21e5d0, W2 2 Rd2\u21e5d1 to:\n\nW2\u21e2rW1x = \u21e3 d1Xi=1\n\nW2(:, i) \u2326 W1(i, :)\u2326r\u2318 \u00b7 x\u2326r =: (W2, W1) \u00b7 x\u2326r.\n\n0\n\nHere (W2, W1) 2 Rd2 \u21e5 Symr(Rd0) is a partially symmetric d2 \u21e5 d\u21e5r\ntensor, expressed as a sum\nof d1 partially symmetric rank 1 terms, and \u00b7 denotes contraction of the last r indices. Thus the\nfunctional space F(d0,d1,d2),r is the set of rank \uf8ff d1 partially symmetric tensors. Algorithms for\nlow-rank CP decomposition could be applied to (W2, W1) to recover W2 and W1. In particular,\nwhen d2 = 1, we obtain a symmetric d\u21e5r\n0\nLemma 8. A shallow architecture d = (d0, d1, 1) is \ufb01lling for the activation degree r if and only if\nevery symmetric tensor T 2 Symr(Rd0) has rank at most d1.\nFurthermore, the celebrated Alexander-Hirschowitz Theorem [1] from algebraic geometry provides\nthe dimension of Vd,r for all shallow, single-output architectures.\nTheorem 9 (Alexander-Hirschowitz). If d = (d0, d1, 1), the dimension of Vd,r is given by\n\ntensor. For this case, we have the following.\n\nmin\u21e3d0d1,d0+r1\n\nr\n\n\u2318, except for the following cases:\n\n\u2022 r = 2, 2 \uf8ff d1 \uf8ff d0  1,\n\u2022 r = 3, d0 = 5, d1 = 7,\n\u2022 r = 4, d0 = 3, d1 = 5,\n\u2022 r = 4, d0 = 4, d1 = 9,\n\u2022 r = 4, d0 = 5, d1 = 15.\n3.2 Deep networks and tensors\n\nDeep polynomial networks also relate to a certain iterated tensor decomposition. We \ufb01rst note the\nmap d,r may be expressed via the so-called Khatri-Rao product from multilinear algebra. Indeed \u2713\nmaps to:\n\nSymRow Wh((Wh1 . . . (W2(W \u2022r\n\n(3)\nHere the Khatri-Rao product operates on rows: for M 2 Ra\u21e5b, the power M\u2022r 2 Ra\u21e5br replaces\neach row, M (i, :), by its vectorized r-fold outer product, vec(M (i, :)\u2326r). Also in (3), SymRow\ndenotes symmetrization of rows, regarded as points in (Rd0)\u2326rh1, a certain linear operator.\nAnother viewpoint comes from using polynomials and inspecting the layers in reverse order. Writing\n[p\u27131, . . . , p\u2713dh1]T for the output polynomials at depth h  1, the top output at depth h is:\n\n1 ))\u2022r . . . )\u2022r).\n\nwh11 pr\n\n\u27131 + wh12 pr\n\n\u27132 + . . . + wh1dh1 pr\n\n\u2713dh1.\n\n(4)\n\n5\n\n\frhi\n\nThis expresses a polynomial as a weighted sum of r-th powers of other (nonlinear) polynomials.\nRecently, a study of such decompositions has been initiated in the algebra community [24]. Such\nexpressions extend usual tensor decompositions, since weighted sums of powers of homogeneous\nlinear forms correspond to CP symmetric decompositions. Accounting for earlier layers, our neural\nnetwork expresses each p\u2713i in (4) as r-th powers of lower-degree polynomials at depth h 2, so forth.\nIterating the main result in [16] on decompositions of type (4), we obtain the following bound on\n\ufb01lling intermediate widths.\nTheorem 10 (Bound on \ufb01lling widths). Suppose d = (d0, d1, . . . , dh) and r  2 satisfy\n\ndhi  min\u2713dh \u00b7 ri(d01),\u2713rhi + d0  1\n\u25c6\u25c6\nfor each i = 1, . . . , h  1. Then the functional variety Vd,r is \ufb01lling.\n3.3 Computational investigation of dimensions\nWe have written code2 in the mathematical software SageMath [12] that computes the dimension\nof Vd,r for a general architecture d and activation degree r. Our approach is based on randomly\nselecting parameters \u2713 = (Wh, . . . , W1) and computing the rank of the Jacobian of d,r(\u2713) in (1).\nThis method is based on the following lemma, coming from the fact that the map d,r is algebraic.\nLemma 11. For all \u2713 2 Rd\u2713, the rank of the Jacobian matrix Jac d,r(\u2713) is at most the dimension\nof the variety Vd,r. Furthermore, there is equality for almost all \u2713 (i.e., for a non-empty Zariski-open\nsubset of Rd\u2713).\nThus if Jac d,r(\u2713) is full rank at any \u2713, this witnesses a mathematical proof Vd,r is \ufb01lling. On the\nother hand if the Jacobian is rank-de\ufb01cient at random \u2713, this indicates with \u201cprobability 1\" that Vd,r\nis not \ufb01lling. We have implemented two variations of this strategy, by leveraging backpropagation.\nBoth work over a \ufb01nite \ufb01eld F = Z/pZ to avoid \ufb02oating-point computations (for almost all primes p,\nthis provides the correct dimension over R).\n\n1. Backpropagation over a polynomial ring. We de\ufb01ned a network class over a ring\nF[x1, . . . , xd0], taking as input a vector variables x = (x1, . . . , xd0). Performing automatic\ndifferentiation (backpropagation) of the output function yields polynomials corresponding\nto dp\u2713(x)/dw, for any entry w of a weight matrix Wi. Extracting the coef\ufb01cients of the\nmonomials in x, we recover the entries of the Jacobian of d,r(\u2713).\n\n2. Backpropagation over a \ufb01nite \ufb01eld. We de\ufb01ned a network class over the \ufb01nite \ufb01eld F =\nZ/pZ. After performing backpropagation at a suf\ufb01cient number of random sample points x,\nwe can recover the entries of the Jacobian of d,r(\u2713) by solving a linear system (this system\nis overdetermined, but it will have an exact solution in \ufb01nite \ufb01eld arithmetic).\n\nThe \ufb01rst algorithm is simpler and does not require interpolation, but is generally slower. We present\nexamples of some of our computations in Tables 1 and 2. Table 1 shows minimal architectures\nd = (d0, . . . , dh) that are \ufb01lling, as the depth h varies. Here, \u201cminimal\u201d is with respect to the partial\nordering comparing all widths. It is interesting to note that for deeper networks, there is not a unique\n\n2Available at https://github.com/mtrager/polynomial_networks.\n\nTable 1: Minimal \ufb01lling widths for r = 2, d0 = 2, dh = 1\n\nDepth (h) Degree (rh1)\n\nMinimal \ufb01lling (d)\n\n3\n4\n5\n6\n7\n8\n9\n\n4\n8\n16\n32\n64\n128\n256\n\n(2,2,2,1)\n(2,3,3,2,1)\n(2,3,3,3,2,1)\n(2,3,3,4,4,2,1)\n(2,3,4,5,6,4,2,1)\n\n(2,3,4,5,7,7,6,2,1) or (2,3,5,5,7,7,5,2,1)\n\n(2,3,4,8,8,8,8,8,4,1) or (2,3,4,5,8,9,8,8,4,1)\n\n6\n\n\fTable 2: Examples of dimensions of Vd,r\nr = 5\n\nr = 3\n\nr = 4\n\nr = 2\n\nd = (3, 2, 1)\nd = (2, 3, 2)\nd = (2, 3, 2, 3)\nd = (2, 3, 2, 3, 4)\n\n5\n6\n10\n16\n\n6\n8\n12\n21\n\n6\n9\n13\n22\n\n6\n9\n13\n22\n\nr = 6\n\n6\n9\n13\n22\n\nminimally \ufb01lling network. Also conspicuous is that minimal \ufb01lling widths are \u201cunimodal\", (weakly)\nincreasing and then (weakly) decreasing. Arguably, this pattern conforms with common wisdom.\nConjecture 12 (Minimal \ufb01lling widths are unimodal). Fix r, h, d0 and dh. If d = (d0, d1, . . . , dh)\nis a minimal \ufb01lling architecture, there is i such that d0 \uf8ff d1 \uf8ff . . . \uf8ff di and di  di+1  . . .  dh.\nTable 2 shows examples of computed dimensions, for varying architectures and degrees. Notice that\nthe dimension of an architecture stabilizes as the degree r increases.\n\n4 General results\nThis section presents general results on the dimension of Vd,r. We begin by pointing out symmetries\nin the network map d,r, under suitable scaling and permutation.\nLemma 13 (Multi-homogeneity). For arbitrary invertible diagonal matrices Di 2 Rdi\u21e5di and\npermutation matrices Pi 2 Zdi\u21e5di (i = 1, . . . , h  1), the map d,r returns the same output under\nthe replacement:\n\nW1 P1D1W1\nW2 P2D2W2Dr\nW3 P3D3W3Dr\n\n1 P T\n1\n2 P T\n2\n\n...\n\nWh WhDr\n\nh1P T\n\nh1.\n\nThus the dimension of a generic \ufb01ber (pre-image) of d,r is at leastPh1\nOur next result deduces a general upper bound on the dimension of Vd,r. Conditional on a standalone\nconjecture in algebra, we prove that equality in the bound is achieved for all suf\ufb01ciently high activation\ndegrees r. An unconditional result is achieved by varying the activation degrees per layer.\nTheorem 14 (Naive bound and equality for high activation degree). If d = (d0, . . . , dh), then\n\ni=1 di.\n\ndimVd,r \uf8ff min dh +\n\nhXi=1\n\n(di1  1)di, dh\u2713d0 + rh1  1\n\nrh1\n\n\u25c6! .\n\n(5)\n\nConditional on Conjecture 16, for \ufb01xed d satisfying di > 1 (i = 1, . . . , h  1), there exists \u02dcr = \u02dcr(d)\nsuch that whenever r > \u02dcr, we have an equality in (5). Unconditionally, for \ufb01xed d satisfying\ndi > 1 (i = 1, . . . , h  1), there exist in\ufb01nitely many (rh1, rh2, . . . , r1) such that the image of\n(Wh, . . . , W1) 7! Wh\u21e2rh1Wh1\u21e2rh2 . . .\u21e2 1W1x has dimension dh +Pi(di1  1)di.\nProposition 15. Given positive integers d, k, s, there exists \u02dcr = \u02dcr(d, k, s) with the following property.\nWhenever p1, . . . , pk 2 R[x1, . . . , xd] are k homogeneous polynomials of the same degree s in d\nvariables, no two of which are linearly dependent, then pr\nk are linearly independent if r > \u02dcr.\n\n1, . . . , pr\n\nConjecture 16. In the setting of Proposition 15, \u02dcr may be taken to depend only on d and k.\n\nProposition 15 and Conjecture 16 are used in induction on h for the equality statements in Theorem 14.\nWe remark that following our arXiv version of this paper, progress toward Conjecture 16 was made\n\n7\n\n\fin [30]. There, it is shown that there exists r between 1 and k! such that pr\nindependent; however, it remains open whether there exists \u02dcr as we conjecture.\nThe next result uses the iterative nature of neural networks to provide a recursive dimension bound.\nProposition 17 (Recursive Bound). For all (d0, . . . , dk, . . . , dh) and r, we have:\ndimV(d0,...,dh),r \uf8ff dimV(d0,...,dk),r + dimV(dk,...,dh),r  dk.\n\nk are linearly\n\n1, . . . , pr\n\nUsing the recursive bound, we can prove an interesting bottleneck property for polynomial networks.\nDe\ufb01nition 18. The width di in layer i is an asymptotic bottleneck (for r, d0 and i) if there exists \u02dch\nsuch that for all h > \u02dch and all d1, . . . , di1, di+1, . . . , dh, then the widths (d0, d1, . . . , di, . . . , dh)\nare non-\ufb01lling.\n\nThis expresses our \ufb01nding that too narrow a layer can \u201cchoke\" a polynomial network, such that there\nis no hope of \ufb01lling the ambient space, regardless of how wide elsewhere or how deep the network is.\nTheorem 19 (Bottlenecks). If r  2, d0  2, i  1, then di = 2d0  2 is an asymptotic bottleneck.\nMoreover conditional on Conjecture 2 in [28], then di = 2d0 is not an asymptotic bottleneck.\nProposition 17 affords a simple proof that di = d0 1 is an asymptotic bottleneck. However to obtain\nthe full statement of Theorem 19, we seem to need more powerful tools from algebraic geometry.\n\n5 Conclusion\n\nWe have studied the functional space of neural networks from a novel perspective. Deep polynomial\nnetworks furnish a framework for nonlinear networks, to which the powerful mathematical machinery\nof algebraic geometry may be applied. In this respect, we believe polynomial networks can help us\naccess a better understanding of deep nonlinear architectures, for which a precise theoretical analysis\nhas been extremely dif\ufb01cult to obtain. Furthermore, polynomials can be used to approximate any\ncontinuous activation function over any compact support (Stone-Weierstrass theorem). For these\nreasons, developing a theory of deep polynomial networks is likely to pay dividends in building\nunderstanding of general neural networks.\nIn this paper, we have focused our attention on the dimension of the functional space of polynomial\nnetworks. The dimension is the \ufb01rst and most basic descriptor of an algebraic variety, and in this\ncontext it provides an exact measure of the expressive power of an architecture. Our novel theoretical\nresults include a general formula for the dimension of the architecture attained in high degree, as well\nas a tight lower bound and nontrivial upper bounds on the width of layers in order for the functional\nvariety to be \ufb01lling. We have also demonstrated intriguing connections with tensor and polynomial\ndecompositions, including some which appear in very recent literature in algebraic geometry.\nThe tools and concepts introduced in this work for fully connected feedforward polynomial networks\ncan be applied in principle to more general algebraic network architectures. Variations of our algebraic\nmodel could include multiple polynomial activations (rather than just single exponentiations) or\nmore complex connectivity patterns of the network (convolutions, skip connections, etc.). The\nfunctional varieties of these architectures could be studied in detail and compared. Another possible\nresearch direction is a geometric study of the functional varieties, beyond the simple dimension. For\nexample, the degree or the Euclidean distance degree [13] of these varieties could be used to bound\nthe number of critical points of a loss function. Additionally, motivated by Section 3.2, we would\nlike to develop computational methods for constructing a network architecture that represents an\nassigned polynomial mapping. Such algorithms might lead to \u201cclosed form\u201d approaches for learning\nusing polynomial networks (similar to SVD or tensor decomposition), as a provable counterpoint to\ngradient descent methods. Our research program might also shed light on the practical problem of\nchoosing an appropriate architecture for a given application.\n\nAcknowledgements\n\nWe thank Justin Chen, Amit Moscovich, Claudiu Raicu and Steven Sam for helpful conversations. JK\nwas partially supported by the Simons Collaboration on Algorithms and Geometry. MT and JB were\npartially supported by the Alfred P. Sloan Foundation, NSF RI-1816753 and Samsung Electronics.\n\n8\n\n\fReferences\n[1] James Alexander and Andr\u00e9 Hirschowitz. Polynomial interpolation in several variables. Journal\n\nof Algebraic Geometry, 4(2):201\u2013222, 1995.\n\n[2] Sanjeev Arora, Nadav Cohen, Noah Golowich, and Wei Hu. A convergence analysis of\ngradient descent for deep linear neural networks. In International Conference on Learning\nRepresentations, 2019.\n\n[3] Sanjeev Arora, Nadav Cohen, and Elad Hazan. On the optimization of deep networks: implicit\nacceleration by overparameterization. In International Conference on Machine Learning, pages\n244\u2013253, 2018.\n\n[4] Pranav Bisht. On hitting sets for special depth-4 circuits. Master\u2019s thesis, Indian Institute of\n\nTechnology Kanpur, 2017.\n\n[5] Grigoriy Blekherman and Zach Teitler. On maximum, typical and generic ranks. Mathematische\n\nAnnalen, 362(3-4):1021\u20131031, 2015.\n\n[6] Winfried Bruns and J\u00fcrgen Herzog. Cohen-Macaulay rings, volume 39 of Cambridge Studies\n\nin Advanced Mathematics. Cambridge University Press, Cambridge, 1993.\n\n[7] Lenaic Chizat and Francis Bach. On the global convergence of gradient descent for over-\nparameterized models using optimal transport. In Advances in Neural Information Processing\nSystems, pages 3036\u20133046, 2018.\n\n[8] Nadav Cohen, Or Sharir, and Amnon Shashua. On the expressive power of deep learning: a\n\ntensor analysis. In Conference on Learning Theory, pages 698\u2013728, 2016.\n\n[9] Nadav Cohen and Amnon Shashua. Convolutional recti\ufb01er networks as generalized tensor\n\ndecompositions. In International Conference on Machine Learning, pages 955\u2013963, 2016.\n\n[10] George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of\n\nControl, Signals and Systems, 2(4):303\u2013314, 1989.\n\n[11] Olivier Delalleau and Yoshua Bengio. Shallow vs. deep sum-product networks. In Advances in\n\nNeural Information Processing Systems, pages 666\u2013674, 2011.\n\n[12] The Sage Developers. SageMath, the Sage Mathematics Software System (Version 8.0.0), 2017.\n\nhttp://www.sagemath.org.\n\n[13] Jan Draisma, Emil Horobe\u00b8t, Giorgio Ottaviani, Bernd Sturmfels, and Rekha R. Thomas. The\nEuclidean distance degree of an algebraic variety. Foundations of Computational Mathematics,\n16(1):99\u2013149, 2016.\n\n[14] Simon S. Du and Jason D. Lee. On the power of over-parametrization in neural networks with\nquadratic activation. In International Conference on Machine Learning, pages 1329\u20131338,\n2018.\n\n[15] David Eisenbud. Commutative algebra: with a view toward algebraic geometry, volume 150 of\n\nGraduate Texts in Mathematics. Springer-Verlag, New York, 1995.\n\n[16] Ralf Fr\u00f6berg, Giorgio Ottaviani, and Boris Shapiro. On the Waring problem for polynomial\n\nrings. Proceedings of the National Academy of Sciences, 109(15):5600\u20135602, 2012.\n\n[17] Joe Harris. Algebraic geometry: a \ufb01rst course, volume 133 of Graduate Texts in Mathematics.\n\nSpringer-Verlag, New York, corrected 3rd print edition, 1995.\n\n[18] Robin Hartshorne. Algebraic geometry, volume 52 of Graduate Texts in Mathematics. Springer-\n\nVerlag, New York-Heidelberg, corrected 8th print edition, 1997.\n\n[19] Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are\n\nuniversal approximators. Neural Networks, 2(5):359\u2013366, 1989.\n\n[20] Hamza Jaffali and Luke Oeding. Learning algebraic models of quantum entanglement. arXiv\n\npreprint arXiv:1908.10247, 2019.\n\n9\n\n\f[21] Kenji Kawaguchi. Deep learning without poor local minima. In Advances in Neural Information\n\nProcessing Systems, pages 586\u2013594, 2016.\n\n[22] J. M. Landsberg. Tensors: geometry and applications, volume 128 of Graduate Studies in\n\nMathematics. American Mathematical Society, Providence, RI, 2012.\n\n[23] Moshe Leshno, Vladimir Ya. Lin, Allan Pinkus, and Shimon Schocken. Multilayer feedforward\nnetworks with a nonpolynomial activation function can approximate any function. Neural\nNetworks, 6(6):861\u2013867, 1993.\n\n[24] Samuel Lundqvist, Alessandro Oneto, Bruce Reznick, and Boris Shapiro. On generic and\nmaximal k-ranks of binary forms. Journal of Pure and Applied Algebra, 223(5):2062 \u2013 2079,\n2019.\n\n[25] James Martens and Venkatesh Medabalimi. On the expressive ef\ufb01ciency of sum product\n\nnetworks. arXiv preprint arXiv:1411.7717, 2014.\n\n[26] Dhagash Mehta, Tianran Chen, Tingting Tang, and Jonathan D. Hauenstein. The loss surface of\ndeep linear networks viewed through the algebraic geometry lens. arXiv:1810.07716, 2018-10-\n17.\n\n[27] Song Mei, Andrea Montanari, and Phan-Minh Nguyen. A mean \ufb01eld view of the landscape of\ntwo-layer neural networks. Proceedings of the National Academy of Sciences, 115(33):7665\u2013\n7671, 2018.\n\n[28] Lisa Nicklasson. On the Hilbert series of ideals generated by generic forms. Communications\n\nin Algebra, 45(8):3390\u20133395, 2017.\n\n[29] Hoifung Poon and Pedro Domingos. Sum-product networks: a new deep architecture. arXiv\n\npreprint arXiv:1202.3732, 2012.\n\n[30] Steven V. Sam and Andrew Snowden. Linear independence of power. arXiv preprint\n\narXiv:1907.02659, 2019.\n\n[31] Mahdi Soltanolkotabi, Adel Javanmard, and Jason D. Lee. Theoretical insights into the op-\ntimization landscape of over-parameterized shallow neural networks. IEEE Transactions on\nInformation Theory, 65(2):742\u2013769, 2019.\n\n[32] Matthew Trager, Kathl\u00e9n Kohn, and Joan Bruna. Pure and spurious critical points: a geometric\n\nstudy of linear networks. arXiv preprint arXiv:1910.01671, 2019.\n\n[33] Luca Venturi, Afonso S. Bandeira, and Joan Bruna. Spurious valleys in two-layers neural\n\nnetwork optimization landscapes. arXiv preprint arXiv:1802.06384, 2018.\n\n10\n\n\f", "award": [], "sourceid": 5442, "authors": [{"given_name": "Joe", "family_name": "Kileel", "institution": "Princeton University"}, {"given_name": "Matthew", "family_name": "Trager", "institution": "NYU"}, {"given_name": "Joan", "family_name": "Bruna", "institution": "NYU"}]}