{"title": "Splines, Rational Functions and Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 1040, "page_last": 1047, "abstract": null, "full_text": "Splines, Rational Functions and Neural Networks \n\nRobert C. Willialnson \n\nDepartment of Systems Engineering \n\nAustralian National University \n\nCanberra, 2601 \n\nAustralia \n\nPeter L. Bartlett \n\nDepartment of Electrical Engineering \n\nUniversity of Queensland \n\nQueensland, 4072 \n\nAustralia \n\nAbstract \n\nConnections between spline approximation, approximation with rational \nfunctions, and feedforward neural networks are studied. The potential \nimprovement in the degree of approximation in going from single to two \nhidden layer networks is examined. Some results of Birman and Solomjak \nregarding the degree of approximation achievable when knot positions are \nchosen on the basis of the probability distribution of examples rather than \nthe function values are extended. \n\n1 \n\nINTRODUCTION \n\nFeedforward neural networks have been proposed as parametrized representations \nsuitable for nonlinear regression. Their approximation theoretic properties are still \nnot well understood. This paper shows some connections with the more widely \nknown methods of spline and rational approximation. A result due to Vitushkin is \napplied to determine the relative improvement in degree of approximation possible \nby having more than one hidden layer. Furthermore, an approximation result rel(cid:173)\nevant to statistical regression originally due to Birman and Solomjak for Sobolev \nspace approximation is extended to more general Besov spaces. The two main \nresults are theorems 3.1 and 4.2. \n\n1040 \n\n\fSplines, Rational Functions and Neural Networks \n\n1041 \n\n2 SPLINES AND RATIONAL FUNCTIONS \n\nThe two most widely studied nonlinear approximation methods are splines with free \nknots and rational functions. It is natural to ask what connection, if any, these have \nwith neural networks. It is already known that splines with free knots and rational \nfunctions are closely related, as Petrushev and Popov's remarkable result shows: \n\nTheorem 2.1 ([10, chapter 8]) Let \n\nRn(J)p := inf{llf -\nS!(f)p := inf{lIf - slip: s a spline of degree k - 1 with n - 1 free knots}. \n\nrllp: r a rational function of degree n} \n\nIf f E Lp [a, b],oo < a < b < 00, 1 < p < 00, k ~ 1, 0 < a < k, then \nif and only if S~(f)p = D(n- a). \n\nRn(f)p = D(n- a) \n\nIn both cases the efficacy of the methods can be understood in terms of their \nflexibility in partitioning the domain of definition: the partitioning amounts to a \n\"balancing\" of the error of local linear approximation [4]. \nThere is an obvious connection between single hidden layer neural networks and \nsplines. For example, replacing the sigmoid (1 + e-x)-l by the piecewise linear \nfunction (Ix + II-Ix - 11)/2 results in networks that are in one dimension splines, \nand in d dimensions can be written in \"Canonical Piecewise Linear\" form [3]: \n\nf( x) := a + bT x + L Ci I aT x - fJi I \n\ni=l \n\ndefines f: IR d --+ IR, where a, Ci, fJi E IR and b, O'i E IR d. Note that canonical piece(cid:173)\nwise linear representations are unique on a compact domain if we use the form \nf(x) := L;~ll Ci laT x -11. Multilayer piecewise linear nets are not generally canon(cid:173)\nical piecewise linear: Let g(x) := Ix+y-ll-lx+y+ll-lx-y+ll-lx-y-ll+x+y. \nThen gC) is canonical piecewise linear, but Ig(x)1 (a simple two-hidden layer net(cid:173)\nwork) is not. \n\nThe connection between certain single hidden layer networks and rational functions \nhas been exploited in [13]. \n\n3 COMPOSITIONS OF RATIONAL FUNCTIONS \n\nThere has been little effort in the nonlinear approximation literature in understand(cid:173)\ning nonlinearly parametrized approximation classes \"more complex\" than splines or \nrational functions. Multiple hidden layer neural networks are in this more complex \nclass. As a first step to understanding the utility of these representations we now \nconsider the degree of approximation of certain smooth function classes via ratio(cid:173)\nnal functions or compositions of rational functions in the sup-metric. A function \n\u00a2> : IR --+ IR is rational of degree 7r if \u00a2> can be expressed as a ratio of polynomials in \nx E IR of degree at most 7r. Thus \n\n(3.1) \n\n\u00a2>(J := \u00a2>(J(x) := \n\nL:'\" O'i xi \n. \nLi=l fJixt \n\n~=l \n\nx E IR, (} := [0',,8] \n\n\f1042 \n\nWilliamson and Bartlett \n\nLet (]\" 1f (1, \u00a2) := inf {Ilf - \u00a2911: deg \u00a2 S 7T} denote the degree of approximation of \nf by a rational function of degree 7T or less. Let 'Ij; := \u00a2 0 p, where \u00a2 and pare \nrational functions: p: lR x e p ---+ lR, \u00a2: lR x e \u00a2 ---+ lR, both of degree 7T. Let IF be \nsome function space (metrized by 11\u00b71100) and let (]\"1f(lF,.) := sup{(]\"1f(1,\u00b7): f ElF} \ndenote the degree of approximation of the function class IF. \n\nTheorem 3.1 Let lFa := W~(n) denote the Sobolev space of functions from a \ncompact subset n c lR to lR with s := La J continuous derivatives and the sth \nderivative satisfying a Lipschitz condition with order a - s. Then there exist positive \nconstants C1 and C2 not depending on 7T such that for sufficiently large 7T \n\n(]\"1f(lFa ,p) 2: C1 (217T) a \n\n(3.2) \n\nand \n\n(3.3) \n\nNote that (3.2) is tight: it is achievable. Whether (3.3) is achievable is unknown. \nThe proof is a consequence of theorem 3.4. The above result, although only for \nrational functions of a single variable, suggests that no great benefit in terms of \ndegree of approximation is to be obtained by using multiple hidden layer networks. \n\n3.1 PROOF OF THEOREM \nDefinition 3.2 Let rd C lR d. A map r: r d ---+ lR is called a piecewise rational \nfunction of degree k with barrier b~ of order q if there is a polynomial b~ of degree \nq in x E r d such that on any connected component of Ii C r d \\ {x: b~( x) = O}, r is \na rational function on Ii of degree k: \n\nr:=r(x):= \n\nP;i(X) \nQdi(x) \n\nk' \n, \n\nPdi,QdiElR[x]. \n\nd \n\nk \n, \n\nk \n, \n\nNote that at any point x E Ii n Ii, (i f. j j, r is not necessarily single valued. \n\nDefinition 3.3 Let IF be some function class defined on a set G metrized with 11\u00b71100 \nand let e = lR II. Then F:,'J: G x e ---+ lR, F:'.,}: (x, 0) f-t F( x, 0) where \n\n1. F( x, 0) is a piecewise rational function of 0 of degree k or less with barrier b~'x \n\n(possibly depending on x) of order q; \n\n2. For all f E IF there is a 0 E e such that IIf - Fe, 0) liSE; \nis called an E-representation of IF of degree k and order q. \n\nTheorem 3.4 ([12, page 191, theorem 1]) If Ff:k~q is an E-representation oflFa \nof degree k and order q with barrier b not depending ~n x, then for sufficiently small \n\n(3.4) \n\nv log[(q + l)(k + 1)] 2: C \n\n( 1)1/a \n\n\"[ \n\nwhere C is a constant not dependent on E, v, k or q. \n\n\fSplines, Rational Functions and Neural Networks \n\n1043 \n\nTheorem 3.4 holds for any [-representation F and therefore (by rearrangement of \n(3.4) and setting v = 271\") \n\n(3.5) \n\n0\"11\" (IF, F) ~ c -( 2-7I\"-lo--:g [:--( q-+-1-) (-k-+-1--=-)] -) Q \n\n1 \n\nNow \u00a2(J given by (3.1) is, for any given and fixed x E IR, a piecewise rational function \nof () of degree 1 with barrier of degree 0 (no barrier is actually required). Thus (3.5) \nimmediately gives (3.2). \nNow consider 'IjJ(J =
\n\n\u2022 \n\n\u00a2 = L...-~~1 ai Y~ (y E IR) and p = L...-~=1 IJ x. \nLj~l 8j x J \n\nLi=l j3iY' \n\n'I\\\" 11\" P \n\n\u2022 \n\nj \n\n(x E IR) . \n\nDirect substitution and rearrangement gives \n\nV'9 = \n\n\u2022 ['I\\\"1I\"p \n\n'1\\\"11\"4> \nL.....i=l a z L...-j=l IJ x \n'I\\\" 11\" 4> \nL...-i=l j3i L...-j~lljxJ \n\n[ 'I\\\" 11\" \n\n. \n\n. \n\n. j]i ['I\\\"1I\"p \n\n'. j]1I\"4>-i \n\nL...-j=l OJ x \n\n. ] t ['I\\\" 11\" \n\nL...-j~l 8j xJ \n\n. ] 11\" 4> - t \n\nwhere we write () = [a, j3, 1,8] and for simplicity set 71\" \u00a2 = 71\" P = 71\". Thus dim () = \n471\" =: v. For arbitrary but fixed x, V' is a rational function of degree k = 71\". No \nbarrier is needed so q = 0 and hence by (3.4), \n\nu.(IF.,.;,) > c, (4\"10g~,,+ IJ\u00b7 \n\n3.2 OPEN PROBLEMS \n\nAn obvious further question is whether results as in the previous section hold for \nmultivariable approximation, perhaps for multivariable rational approximation. \n\nA popular method of d-dimensional nonlinear spline approximation uses dyadic \nsplines [2,5,8]. They are piecewise polynomial representations where the partition \nused is a dyadic decomposition . Given that such a partition 3 is a subset of a \npartition generated by the zero level set of a barrier polynomial of degree ~ 131, can \nVitushkin's results be applied to this situation? Note that in Vitushkin's theory \nit is the parametrization that is piecewise rational (PR), not the representation. \nWhat connections are there in general (if any) between PR representations and PR \nparametrizations? \n\n4 DEGREE OF APPROXIMATION AND LEARNING \n\nDetermining the degree of approximation for given parametrized function classes is \nnot only of curiosity value. It is now well understood that the statistical sample \ncomplexity of learning depends on the size of the approximating class. \nIdeally \nthe approximating class is small whilst well approximating as large as possible \nan approximat ed class. Furthermore, in order to make statements such as in [1] \nregarding the overall degree of approximation achieved by statistical learning, the \nclassical degree of approximation is required. \n\n\f1044 \n\nWilliamson and Bartlett \n\nFor regression purposes the metric used is L p ,/-, , where \n\nwhere J-t is a probability measure. Ideally one would like to avoid calculating the \ndegree of approximation for an endless series of different function spaces. Fortu(cid:173)\nnately, for the case of spline approximation (with free knots) this not necessary \nbecause (thanks to Petrushev and others) there now exist both direct and converse \ntheorems characterizing such approximation classes. Let Sn (f)p denote the error \nof n knot spline approximation in Lp[O, 1]. Let I denote the identity operator and \nT(h) the translation operator (T(h)(f, x) := f(x + h)) and let ~~ := (T(h) - I)k, \nk = 1,2, ... , be the difference operators. The modulus of smoothness of order k for \nf E LpUJ) is \n\nWk(f,t)p := L 11~~f(\u00b7)IILp(n). \n\nPetrushev [9] has obtained \n\nIhl:$;t \n\nLet T = (aid + IIp)-l. Then \nf)n a Sn(f)p]k ~ \n\nn=l \n\n< 00 \n\nn \n\nTheorelll 4.1 \n\n(4.1) \n\nif and only if \n\n(4.2) \n\nThe somewhat strange quantity in (4.2) is the norm of f in a Besov space B<; Tok' \nNote that for a large enough, T < 1. That is, the smoothness is measured in a~ Lp \n(p < 1) space. More generally [11], we have (on domain [0,1]) \n:= ( t (t-awk(f, t)p)q dt) l/q \n\nIlflIBC> 0 \n\np,q,k \n\nJo \n\nt \n\nBesov spaces are generalizations of classical smoothness spaces such as Sobolev \nspaces (see [11]). \n\nWe are interested in approximation in L p ,/-, and following Birman and Solomjak [2] \nask what degree of approximation in Lp ,/-, can be obtained when the knot positions \nare chosen according to J-t rather than f. This is of interest because it makes the \nproblem of determining the parameter values on the basis of observations linear. \n\nTheorelll 4.2 Let f E Lp ,/-, where J-t E LA for some>. > 1 and is absolutely contin(cid:173)\nuous. Choose the n knot positions of a spline approximant v to f on the basis of J-t \nonly. Then for all such f there is a constant c not dependent on n such that \n(4.3) \nwhere u = (a + (1- >.-l)p-l)-l and p < u. The constant c depends on J-t and >.. \n\n\fSplines, Rational Functions and Neural Networks \n\n1045 \n\nII p ~ 1 and (1 S p, for any el' < (1-1 for all I under the conditions above, there is \na v such that \n\n( 4.4) \n\nand again c depends on J-l and A but does not depend on n. \n\nFor any A 2: 1, \n\nProof First we prove (4.3). Let [0,1] be partitioned by 3. Thus if v is the \napproximant to Ion [0,1] we have \n\nIlf - vIIi \u2022. , = ~ Ilf - vlli \u2022. ,(,,) = ~ L If(x) - v(x)IPdl'(x). \n~ [L If - vIP(1-'-')-' dXr'-' [L (ix)' dx r' \n\ni I/(x) - v(x)IPdJ-l(x) = i l l - viP (~~) dx \n\n= III - vll~~(A) IldJ-l/dxIIL>.(A) \n\nwhere 'IjJ = p(l- A-I)-I. Now Petrushev and Popov [10, p.216] have shown that \nthere exists a polynomial of degree k on ~ = [r, s] such that \n\nIII - vll~~(A) s cll/ll~(A) \n\nwhere \n\nIlfIIB(A):= Jo \n\n( \n\nf(s-r)/k \n\n(t-QII~: 1(\u00b7)IIL.,.(r,s-kt)t T \n\ndt) l/u \n\nand (1:= (el' + 'IjJ-l)-I, 0< 'IjJ < 00 and k > 1. Let 131 =: n and choose:=: = Ui~i \n(~i = [ri, SiD such that \n\n1. (~~)' dx = ~lIdl'/dxIlL(o.1)' \n\nThus IldJ-l/dxIIL>.(A) = n-l/AlldJ-l/dxIIL>.(O,l)' Hence \n\n(4.5) \n\nIII - vll~p,,, s ClldJ-l/dxIIL>. L n-I/Allfll~(A)' \n\nAe2 \n\nSince (by hypothesis) p < (1, Holder's inequality gives \n\nII! - vilL., ~ clldJlldxllL, [~ G) t.-:,. ] 1-~ [~llfIlR(\")] ~ \n\nNow for arbitrary partitions 3 of [0,1] Petrushev and Popov [10, page 216] have \nshown \n\nL II/IIB(A) S 1I/IIB~.k \n\nAe3 \n\n' \n\n\f1046 \n\nWilliamson and Bartlett \n\nwhere Be;.k = Be; l7\u00b7k = B([O, 1]). Hence \n\n, , \n\n, \n\nIII - vll~ \n\np,p. \n\n:S clldJ.tjdxIIL>. n~+I-t \"I\"~cw \n\ntT j k \n\nand so \n\n(4 .6) \nwith u = (a + '1/'-1 )-1, 'I/' = p(l- A-I )-1 . Hence u = (a + 1_;-1 )-1. Thus given a \nand p, choosing different A adjusts the u used to measure I on the right-hand side \nof (4 .6). This proves (4.3). \nNote that because of the restriction that p < u , a > 1 is only achievable for p < 1 \n(which is rarely used in statistical regression [6]). Note also the effect of the term \nIIdJ.tjdxll.i!:' When A = 1 this is identically 1 (since J.t is a probability measure). \nWhen A > 1 it measures the departure from uniform distribution, suggesting the \ndegree of approximation achievable under non-uniform distributions is worse than \nunder uniform distributions. \nEquation (4.4) is proved similarly. When u :S p with p 2: 1, for any a :S 1 j u, we \ncan set A := (1 - ~ + pa )-1 2: 1. From (4.5) we have \n\nIII - vll~ \u2022.\u2022 :0; clld,,/dxllL, ~ (D 1/' 11/11':.(\"1 \n\n:0; clld,,/dxIIL, G) 1/' [~II/IIB(\"f\" \n\n:S clldJ.tjdxIlL>.n-l+~-pa ll/ll~<> \n\n0' ;k \n\nand therefore \n\n\u2022 \n\n5 CONCLUSIONS AND FURTHER WORK \n\nIn this paper a result of Vitushkin has been applied to \"multi-layer\" rational ap(cid:173)\nproximation . Furthermore, the degree of approximation achievable by spline ap(cid:173)\nproximation with free knots when the knots are chosen according to a probability \ndistribution has been examined. \n\nThe degree of approximation of neural networks, particularly multiple layer net(cid:173)\nworks, is an interesting open problem. Ideally one would like both direct and con(cid:173)\nverse theorems, completely characterizing the degree of approximation. If it turns \nout that from an approximation point of view neural networks are no better than \ndyadic splines (say), then there is a strong incentive to study the PAC-like learning \ntheory (of the style of [7]) for such spline representations. We are currently working \non this topic . \n\n\fSplines, Rational Functions and Neural Networks \n\n1047 \n\nAcknowledgements \n\nThis work was supported in part by the Australian Telecommunications and Elec(cid:173)\ntronics Research Board and OTC. The first author thanks Federico Girosi for pro(cid:173)\nviding him with a copy of [4]. The second author was supported by an Australian \nPostgraduate Research Award. \n\nReferences \n\n[1] A. R. Barron, Approxima.tion and Estimation Bounds for Artificial Neural Networks, \n\nTo appear in Machine Learning, 1992. \n\n[2] M. S. Birman and M. Z. Solomjak, Piecewise-Polynomial Approximations of Func(cid:173)\ntions of the Classes W;, Mathematics o{the USSR - Sbornik, 2 (1967), pp. 295-\n317. \n\n[3] L. Chua and A. -C'. Deng, Canonical Piecewise-Linear Representation, IEEE Trans(cid:173)\n\nactions on C'iIcuits and Systems, 35 (1988), pp. 101-111. \n\n[4] R. A. DeVore. Degree of Nonlinear Approximat.ion, in Approximation Theory VI, \nVolump 1. C. K. Chui. L. 1. Schumaker and J. D. Ward, eds., Academic Press, \nBost.on, 1991, pp. 17.5-201. \n\n[5] R. A. DeVore, B. Jawert.h and V. Popov, Compression of Wavelet Decompositions, \n\nTo appear in American Journal of Mathematics, 1992. \n\n[6] H. Ekblom , Lp-met.hods for Robust Regression, BIT, 14 (1974), pp. 22-32. \n\n[7] D. Haussler, Decision Theoretic Generalizations of the PAC Model for Neural Net \nand Ot.her Learning Applicat.ions, Report UCSC-CRL-90-52, Baskin Center for \nComputer Engineering and Informat.ion Sciences, University of California, Santa \nCruz, H)90. \n\n[8] P. Oswald. On t.he Degree of Nonlinear Spline Approximat.ion in Besov-Sobolev \n\nSpace~. Journal of A.pproximatioll Theory, 61 (1990), pp. 131-157. \n\n[9] P. P. Pet.ru~hev. Direct and Converse Theorems for Spline and Rational Approxi(cid:173)\n\nmation and Be~oy Spaces, in Function Spaces and Applications (Lecture Notes \nill Ma tllem a. tics 1.'3(2), M. Cwikel, J. Peetre, Y. Sagher and H. Wallin, eds., \nSprin~er-Verlag. Berlin, 1988, pp. 363-377. \n\n[10] P. P. Petrushev and V. A. Popov, Rational Approximation of Real Functions, Cam(cid:173)\n\nbridge Univer~it.y Press, Cambridge, 1987. \n\n[11] H. Triebel. TlleorJ' of Function Spaces . Birkhauser Verlag, Basel, 1983. \n\n[12] A. G. Vitllshkin. Tlleocy of the Transmission and Processing of Information, Perg(cid:173)\n\namon Press, Oxford, 1961, Originally published as Otsenka slozhnosti zadachi \ntaIJu1icO\\'ani,va (Est.ima.tion of the Complexit.y of the Tabulation Problem), Fiz(cid:173)\nmatgiz. Mo~cow. 19.59. \n\n[13] R. C. Williamson and U. Helmke, Existence and Uniqueness Results for Neural \n\nNetwork Approximations. Submitted, 1992. \n\n\f", "award": [], "sourceid": 442, "authors": [{"given_name": "Robert", "family_name": "Williamson", "institution": null}, {"given_name": "Peter", "family_name": "Bartlett", "institution": null}]}