{"title": "Lower Bounds on the Complexity of Approximating Continuous Functions by Sigmoidal Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 328, "page_last": 334, "abstract": null, "full_text": "Lower Bounds on the  Complexity of \n\nApproximating  Continuous Functions by \n\nSigmoidal Neural Networks \n\nMichael Schmitt \n\nLehrstuhl Mathematik und Informatik \n\nFakuWit ftir  Mathematik \nRuhr-Universitat Bochum \nD-44780 Bochum, Germany \n\nmschmitt@lmi.ruhr-uni-bochum.de \n\nAbstract \n\nWe calculate lower bounds on the size of sigmoidal neural networks \nthat  approximate  continuous  functions. \nIn  particular,  we  show \nthat  for  the  approximation  of  polynomials  the  network  size  has \nto grow as  O((logk)1/4)  where  k  is  the degree of the polynomials. \nThis bound is  valid for  any input dimension, i.e.  independently of \nthe  number of variables.  The result  is  obtained  by  introducing a \nnew method employing upper bounds on the Vapnik-Chervonenkis \ndimension  for  proving lower  bounds  on  the  size  of networks  that \napproximate continuous functions. \n\n1 \n\nIntroduction \n\nSigmoidal neural networks are known to be universal approximators.  This is one of \nthe  theoretical results  most frequently  cited  to justify the use  of sigmoidal neural \nnetworks  in  applications.  By  this  statement one  refers  to the fact  that  sigmoidal \nneural networks have been shown to be able to approximate any continuous function \narbitrarily  well.  Numerous  results  in  the  literature  have  established  variants  of \nthis universal approximation property by considering distinct function classes to be \napproximated  by  network  architectures  using  different  types  of neural  activation \nfunctions with respect to various approximation criteria, see for instance [1,  2,  3, 5, \n6,  11,  12,  14,  15].  (See  in particular Scarselli and Tsoi  [15]  for  a  recent survey and \nfurther  references.) \n\nAll these results and many others not referenced here, some of them being construc(cid:173)\ntive, some being merely existence proofs, provide upper bounds for  the network size \nasserting  that  good  approximation  is  possible  if there  are  sufficiently  many  net(cid:173)\nwork nodes  available.  This, however,  is  only a  partial answer  to the question that \nmainly arises  in  practical applications:  \"Given some function,  how  many  network \nnodes  are  needed  to  approximate  it?\"  Not  much  attention  has  been  focused  on \nestablishing lower  bounds  on  the  network  size  and,  in  particular, for  the approx(cid:173)\nimation  of functions  over  the  reals.  As  far  as  the  computation  of  binary-valued \n\n\fComplexity of Approximating Continuous Functions by Neural Networks \n\n329 \n\nfunctions  by sigmoidal networks is concerned (where the output value of a network \nis  thresholded to yield 0 or 1)  there are a  few  results  in  this direction.  For  a  spe(cid:173)\ncific  Boolean function  Koiran [9]  showed that networks using the standard sigmoid \nu(y)  =  1/(1 + e- Y )  as  activation  function  must  have  size  O(nl/4)  where  n  is  the \nnumber of inputs.  (When measuring network size we  do not count the input nodes \nhere and in what follows.)  Maass [13]  established a larger lower bound by construct(cid:173)\ning a binary-valued function over IRn  and showing that standard sigmoidal networks \nrequire O(n)  many network  nodes for  computing this function.  The first  work on \nthe complexity of sigmoidal networks for approximating continuous functions is due \nto DasGupta and Schnitger [4].  They showed that the standard sigmoid in network \nnodes can be replaced by other types of activation functions without increasing the \nsize of the network  by  more than a  polynomial.  This yields indirect  lower bounds \nfor  the size of sigmoidal networks in terms of other network types.  DasGupta and \nSchnitger  [4]  also  claimed  the  size  bound  AO(I/d)  for  sigmoidal  networks  with  d \nlayers approximating the function  sin(Ax). \nIn  this  paper  we  consider  the  problem  of  using  the  standard  sigmoid  u(y)  = \n1/(1 + e- Y)  in  neural  networks  for  the  approximation of polynomials.  We  show \nthat at least O\u00ablogk)1/4)  network nodes  are required to approximate polynomials \nof degree k  with small error in the loo  norm.  This bound is valid for arbitrary input \ndimension, i.e., it does not depend on the number of variables.  (Lower  bounds can \nalso  be  obtained from  the  results  on  binary-valued functions  mentioned above  by \ninterpolating the corresponding functions  by  polynomials.  This,  however,  requires \ngrowing input dimension and does  not yield a lower  bound in terms of the degree.) \nFurther, the bound established here holds for  networks of any number of layers.  As \nfar  as we  know this is  the first  lower bound result for  the approximation of polyno(cid:173)\nmials.  From the computational point of view this is a very simple class of functions; \nthey can be computed using the basic operations addition and multiplication only. \nPolynomials  also  play  an  important  role  in  approximation  theory  since  they  are \ndense in the class of continuous functions and some approximation results for  neu(cid:173)\nral networks rely on the approximability of polynomials by sigmoidal networks (see, \ne.g.,  [2,  15]). \nWe  obtain the result  by  introducing a  new  method that employs upper bounds on \nthe  Vapnik-Chervonenkis dimension of neural networks  to establish lower  bounds \non the network size.  The first  use of the Vapnik-Chervonenkis dimension to obtain \na  lower  bound  is  due  to  Koiran  [9]  who  calculated  the  above-mentioned  bound \non  the  size  of sigmoidal  networks  for  a  Boolean  function.  Koiran's  method  was \nfurther developed and extended by Maass [13]  using a similar argument but another \ncombinatorial dimension.  Both papers  derived  lower  bounds  for  the computation \nof binary-valued functions  (Koiran [9]  for inputs from {O, 1 }n, Maass [13]  for inputs \nfrom IRn).  Here, we present a new technique to show that and how lower bounds can \nbe  obtained for  networks that  approximate continuous functions.  It rests  on  two \nfundamental results about the Vapnik-Chervonenkis dimension of neural networks. \nOn the one hand, we  use constructions provided by Koiran and Sontag [10]  to build \nnetworks that have large Vapnik-Chervonenkis dimension and consist of gates that \ncompute  certain  arithmetic functions.  On  the other  hand,  we  follow  the  lines  of \nreasoning of Karpinski and Macintyre [7]  to derive an upper bound for  the Vapnik(cid:173)\nChervonenkis dimension of these networks from the estimates of Khovanskil [8]  and \na  result due to Warren [16]. \nIn the following section we give the definitions of sigmoidal networks and the Vapnik(cid:173)\nChervonenkis  dimension.  Then  we  present  the  lower  bound  result  for  function \napproximation.  Finally, we  conclude with some discussion and open questions. \n\n\f330 \n\nM  Schmitt \n\n2  Sigmoidal  Neural Networks and VC  Dimension \n\nWe  briefly  recall  the  definitions  of a  sigmoidal  neural  network  and  the  Vapnik(cid:173)\nChervonenkis dimension (see, e.g., [7,  10]).  We consider /eed/orward neural networks \nwhich  have  a  certain  number  of input  nodes  and  one  output  node.  The  nodes \nwhich are not input nodes  are called  computation  nodes  and associated with  each \nof them  is  a  real  number  t,  the  threshold.  Further,  each  edge  is  labelled  with  a \nreal  number  W  called  weight.  Computation in  the  network  takes  place as  follows: \nThe input  values are assigned to the input nodes.  Each computation node applies \nthe standard sigmoid u(y)  =  1/(1 + e- V )  to the sum  W1Xl  + ... + WrXr  -\nt  where \nXl, .\u2022. ,Xr  are the  values  computed by  the  node's  predecessors,  WI, \u2022\u2022\u2022 ,Wr  are the \nweights of the corresponding edges, and t is the threshold.  The output value of the \nnetwork is  defined  to be the value computed by the output node.  As  it is  common \nfor  approximation results by  means of neural networks, we  assume that the output \nnode is  a  linear gate,  i.e.,  it just outputs the sum WIXI + ... + WrXr  -\nt.  (Clearly, \nfor  computing  functions  on  finite  sets  with  output  range  [0, 1]  the  output  node \nmay  apply  the  standard sigmoid as  well.)  Since  u  is  the only  sigmoidal function \nthat we  consider here we  will  refer  to such networks as  sigmoidal  neural networks. \n(Sigmoidal  functions  in  general  need  to satisfy  much  weaker  assumptions  than  u \ndoes.)  The  definition  naturally generalizes to  networks  employing other  types  of \ngates that we  will make use of (e.g.  linear, multiplication, and division gates). \nThe Vapnik-Chervonenkis dimension is a combinatorial dimension of a function class \nand is defined  as follows:  A  dichotomy of a  set  S  ~ IRn  is  a  partition of S  into two \ndisjoint subsets (So, Sl) such that So U SI  =  S.  Given a set F  offunctions mapping \nIRn  to {O, I} and a  dichotomy  (So, Sd of S,  we  say that F  induces the dichotomy \n(So, Sd  on  S  if there  is  some  f  E  F  such  that  /(So)  ~ {O}  and  f(Sd  ~ {I}. \nWe  say  further  that F  shatters  S  if F  induces  all  dichotomies on  S.  The  Vapnik(cid:173)\nChervonenkis  (VC)  dimension of F,  denoted  VCdim(F),  is  defined  as  the largest \nnumber m  such that there is  a  set of m  elements that is  shattered by F.  We  refer \nto the VC  dimension of a  neural network, which is  given in terms of a  \"feedforward \narchitecture\",  i.e.  a  directed  acyclic  graph,  as  the  VC  dimension  of the  class  of \nfunctions  obtained by  assigning real numbers to all its programmable parameters, \nwhich are in general the weights and thresholds of the network or a  subset thereof. \nFurther,  we  assume that the output value  of the network  is  thresholded at  1/2 to \nobtain binary values. \n\n3  Lower Bounds on Network Size \n\nBefore  we  present  the lower  bound  on  the size  of sigmoidal  networks  required  for \nthe  approximation  of polynomials  we  first  give  a  brief outline  of the  proof idea. \nWe  will  define  a  sequence  of  univariate  polynomials  (Pn)n>l  by  means  of which \nwe  show  how  to  construct  neural  architectures Nn  consistmg  of various  types  of \ngates  such  as  linear,  multiplication,  and  division  gates,  and,  in  particular,  gates \nthat  compute  some  of the  polynomials.  Further,  this  architecture  has  a  single \nweight  as  programmable  parameter  (all  other  weights  and  thresholds  are  fixed). \nWe  then demonstrate that,  assuming the gates computing the polynomials can  be \napproximated  by  sigmoidal neural  networks  sufficiently  well,  the  architecture Nn \ncan shatter a  certain set  by  assigning suitable values  to its programmable weight. \nThe final step is to reason along the lines of Karpinski and Macintyre [7]  to obtain \nvia Khovanskil's estimates [8]  and Warren's result  [16]  an upper bound on the VC \ndimension of Nn  in terms of the number of its computation nodes.  (Note that we \ncannot directly apply Theorem 7 of [7]  since it does  not deal  with division gates.) \nComparing this bound with the cardinality of the shattered set we will then be able \n\n\fComplexity of Approximating Continuous Functions by Neural Networks \n\n331 \n\nW \n\n1 \n\nP3 \n\n(3) \n\nW 1 \n\n(3) \n\nWn \n\n(2) \n\nW 1 \n\n(1) \n\nW 1 \n\nn  Wi \n(3) \n\nP 2 \n\nn  Wj \n\n(2) \n\nP1 \n\nn \n\n(1) \n\nW k \n\n(2) \n\nWn \n\n(1) \n\nWn \n\nj  --------------------------------~ \n\nk--------------------------------------------------~ \n\nFigure  1:  The  network  Nn  with  values  k, j, i, 1  assigned  to  the  input  nodes \nXl, X2,  X3,  X4  respectively.  The  weight  W  is  the  only  programmable  parameter  of \nthe network. \n\nto conclude  with  a  lower  bound  on  the  number of computation  nodes  in Nn  and \nthus in the networks that approximate the polynomials. \n\nLet the sequence  (Pn)n2: l  of polynomials over IR  be  inductively defined  by \n\nPn(X)  = \n\n{  4x(1 - x) \n\nn  =  1 , \nP(Pn-dx))  n  2::  2 . \n\nClearly,  this  uniquely  defines  Pn  for  every  n  2::  1  and  it  can readily  be seen  that \nPn  has  degree  2n.  The main  lower  bound  result  is  made  precise  in  the  following \nstatement. \n\nTheorem 1  Sigmoidal  neural networks  that  approximate  the  polynomials  (Pn)n >l \non  the  interval [0,1]  with  error  at  most O(2- n )  in  the  100  norm must have  at  least \nn(nl/4)  computation  nodes. \n\nProof.  For  each  n  a  neural  architecture Nn  can  be  constructed  as  follows:  The \nnetwork has four  input nodes Xl, X2,  X3, X4.  Figure 1 shows the network with input \nvalues  assigned  to the  input  nodes  in  the  order  X4  = 1, X3  = i, X2  = j, Xl  = k. \nThere is  one  weight  which  we  consider  as  the  (only)  programmable parameter of \nNn .  It is  associated  with  the  edge  outgoing from  input  node  X4  and  is  denoted \nby  w.  The  computation  nodes  are partitioned into six  levels  as  indicated  by  the \nboxes in Figure 1.  Each level is itself a  network.  Let us first  assume, for  the sake of \nsimplicity, that all computations over real numbers are exact.  There are three levels \nlabeled with II, having n + 1 input nodes and one output node each, that compute \nso-called projections 7r  :  IRnH  -+  IR  where  7r(YI,\"\"  Yn, a)  =  Ya  for  a  E  {I, ... , n}. \nThe levels  labeled P3 , P2 , PI  have one  input  node and n  output nodes  each.  Level \nP3  receives the constant 1 as input and thus the value W  which is  the parameter of \nthe network.  We  define  the output values of level PA for>. =  3,2, 1 by \n\n(A) \n\nwb  =  Pbon\"'-l  v)  ,  b= 1, ... ,n \n\n( \n\nwhere v  denotes the input value to level PA.  This value is equal to w  for>.  =  3 and \nrom \n7r  WI \n\n,XA+l  ot  erWlse.  vve observe t  at wb+l  can  e calcu ate \n\n(A)  b id  f \n\n(  (A+l) \n\n,  .\u2022. , Wn \n\nh \n\n()..+l) \n\n)  h \n\n. \n\nOUT \n\n\f332 \n\nM  Schmitt \n\nw~A) as Pn>'_l(W~A\u00bb).  Therefore,  the computations of level  PA  can  be implemented \nusing n  gates each of them computing the function Pn>.-l. \nWe  show now  that Nn  can shatter a  set of cardinality n 3 \u2022  Let S =  {I, ... ,n p.  It \nhas been shown in Lemma 2 of [10]  that for  each  (/31 , ... , /3r)  E {O, 1 Y there exists \nsome W  E  [0,1]  such that for  q =  1, ... ,T \n\npq(w)  E  [0,1/2) \n\nif /3q  = 0,  and pq(w)  E  (1/2,1] \n\nif /3q  = 1. \n\nThis  implies  that,  for  each  dichotomy  (So, Sd  of S  there  is  some  W  E  [0,1]  such \nthat for  every  (i, j, k)  E  S \n\nPk (pj.n (Pi.n2 (w)))  < 1/2 \nPk(Pj.n(Pi.n2 (w)))  > 1/2 \n\nif \nif \n\n( i, j, k)  E So  , \n(i,j,k)ES1' \n\nNote that Pk(Pj.n(Pi.n2 (w)))  is the value computed by Nn given input values k, j, i, 1. \nTherefore, choosing a suitable value for w, which is the parameter of Nn ,  the network \ncan induce any dichotomy on  S.  In other words,  S  is shattered by Nn . \nIt has  been  shown  in  Lemma 1 of [10]  that there  is  an architecture An  such  that \n\nfor  each  E  > \u00b0 weights can  be  chosen for  An such that the function  in,\u20ac  computed \n\nby this network satisfies lim\u20ac~o in,\u20ac(Yl,  ... ,Yn, a)  =  Ya.  Moreover, this architecture \nconsists of O(n)  computation nodes,  which are linear,  multiplication, and division \ngates.  (Note  that  the  size  of  An  does  not  depend  on  E.)  Therefore,  choosing  E \nsufficiently small,  we  can  implement the projections  1r  in Nn  by  networks  of O(n) \ncomputation nodes such that the resulting network N~ still shatters S.  Now in N~ \nwe  have O(n)  computation nodes  for  implementing the three  levels  labeled II  and \nwe have in each level P A a number of O(n) computation nodes for  computing Pn>.-l, \nrespectively.  Assume  now  that  the  computation  nodes  for  Pn>.-l  can  be  replaced \nby sigmoidal networks  such that on  inputs from  S  and with the parameter values \ndefined above the resulting network N::  computes the same functions as N~. (Note \nthat the computation nodes for  Pn>.-l  have no programmable parameters.) \nWe  estimate the size  of N::.  According to Theorem 7 of Karpinski and Macintyre \n[7]  a sigmoidal neural network with I programmable parameters and m computation \nnodes has VC  dimension O((ml)2).  We  have to generalize this result slightly before \nbeing able to apply it.  It can readily be seen from the proof of Theorem 7 in [7]  that \nthe result  also holds  if the  network additionally contains linear and multiplication \ngates.  For division gates we can derive the same bound taking into account that for \na  gate computing division,  say  x/y, we  can introduce a  defining  equality x  =  z  . Y \nwhere  z  is  a  new  variable.  (See  [7]  for  how  to  proceed.)  Thus,  we  have  that  a \nnetwork  with  I  programmable  parameters  and  m  computation  nodes,  which  are \nlinear,  multiplication,  division,  and  sigmoidal gates,  has  VC  dimension O((ml)2). \nIn  particular, if m  is  the  number of computation nodes  of N::,  the VC  dimension \nis  O(m 2 ).  On  the  other  hand,  as  we  have  shown  above,  N::  can  shatter  a  set \nof cardinality n3 \u2022  Since  there  are O(n)  sigmoidal networks  in N::  computing the \nfunctions  Pn>.-l,  and since the number of linear,  multiplication, and  division  gates \nis  bounded  by  O(n),  for  some  value  of A a  single  network  computing Pn>.-l  must \nhave  size  at  least  O(fo).  This  yields  a  lower  bound  of O(nl/4)  for  the  size  of a \nsigmoidal network computing Pn. \nThus far,  we  have  assumed  that  the  polynomials Pn  are  computed  exactly.  Since \npolynomials are  continuous  functions  and  since  we  require  them  to  be  calculated \nonly on a finite  set of input values  (those resulting from  S  and from the parameter \nvalues chosen for w  to shatter S)  an approximation of these polynomials is sufficient. \nA straightforward analysis, based on  the fact  that the output value of the network \nhas a  \"tolerance\" close to 1/2, shows  that if Pn  is  approximated with error O(2- n ) \n\n\fComplexity of Approximating Continuous Functions by Neural Networks \n\n333 \n\nin the loo  norm,  the  resulting network still shatters the set  S.  This completes the \nproof of the theorem. \nD \n\nThe statement of the previous theorem is  restricted to the approximation of poly(cid:173)\nnomials on the input domain [0,1].  However, the result immediately generalizes to \nany arbitrary interval in llt  Moreover, it remains valid for multivariate polynomials \nof arbitrary input dimension. \n\nCorollary 2  The  approximation  of polynomials  of  degree  k  by  sigmoidal  neural \nnetworks with  approximation error O(ljk) in the 100  norm requires  networks of size \nO((log k)1/4).  This  holds  for  polynomials  over  any  number of variables. \n\n4  Conclusions  and Open Questions \n\nWe have established lower bounds on the size of sigmoidal networks for the approx(cid:173)\nimation of continuous functions.  In particular, for  a  concrete class of polynomials \nwe  have calculated a  lower  bound in terms  of the  degree of the  polynomials.  The \nmain  result  already  holds  for  the  approximation of univariate polynomials.  Intu(cid:173)\nitively,  approximation of multivariate polynomials  seems  to  become  harder  when \nthe dimension  increases.  Therefore,  it would  be  interesting to have  lower  bounds \nboth in terms of the degree and the input dimension. \n\nFurther, in our result the approximation error and the degree are coupled.  Naturally, \none would expect that the number of nodes has to grow for each fixed function when \nthe error decreases.  At present we  do not know of any such lower bound. \n\nWe have not aimed at calculating the constants in the bounds.  For practical appli(cid:173)\ncations such values are indispensable.  Refining our method and using tighter results \nit should be straightforward to obtain such numbers.  Further, we expect that better \nlower bounds can be obtained by considering networks of restricted depth. \n\nTo establish the result we have introduced a  new method for  deriving lower bounds \non network sizes.  One of the main arguments is to use the functions to be approxi(cid:173)\nmated to construct networks with large VC  dimension.  The method seems suitable \nto obtain bounds also for  the approximation of other types of functions  as long as \nthey are computationally powerful enough. \n\nMoreover, the method could  be adapted to obtain lower bounds  also for  networks \nusing other activation functions  (e.g.  more general sigmoidal functions,  ridge func(cid:173)\ntions,  radial  basis  functions).  This  may  lead  to  new  separation  results  for  the \napproximation capabilities of different  types  of neural  networks.  In order for  this \nto  be  accomplished, however,  an essential requirement is  that small upper bounds \ncan be calculated for  the VC  dimension of such networks. \n\nAcknowledgments \n\nI  thank Hans  U.  Simon for  helpful  discussions.  This  work  was  supported in  part \nby the ESPRIT Working Group in Neural and Computational Learning II,  Neuro(cid:173)\nCOLT2, No.  27150. \n\nReferences \n\n[1]  A.  Barron.  Universal approximation bounds for  superposition of a  sigmoidal \n\nfunction.  IEEE  Transactions  on  Information  Theory,  39:930--945, 1993. \n\n\f334 \n\nM  Schmitt \n\n[2J  C.  K.  Chui and X.  Li.  Approximation by ridge functions and neural networks \n\nwith one hidden layer.  Journal  of Approximation  Theory,  70:131-141,1992. \n[3J  G.  Cybenko.  Approximation by superpositions of a  sigmoidal function.  Math(cid:173)\n\nematics  of Control,  Signals,  and Systems,  2:303-314, 1989. \n\n[4J  B.  DasGupta and  G.  Schnitger.  The power of approximating:  A  comparison \nof activation functions.  In C.  L.  Giles, S.  J.  Hanson,  and J. D.  Cowan, editors, \nAdvances in Neural Information Processing Systems 5,  pages 615-622, Morgan \nKaufmann, San Mateo,  CA,  1993. \n\n[5]  K.  Hornik.  Approximation  capabilities  of multilayer  feedforward  networks. \n\nNeural Networks,  4:251-257, 1991. \n\n[6]  K.  Hornik,  M.  Stinchcombe,  and  H.  White.  Multilayer feedforward  networks \n\nare universal approximators.  Neural Networks,  2:359-366,  1989. \n\n[7]  M.  Karpinski  and  A.  Macintyre.  Polynomial  bounds  for  VC  dimension  of \nsigmoidal  and  general  Pfaffian  neural  networks.  Journal  of  Computer  and \nSystem Sciences,  54:169-176, 1997. \n\n[8]  A.  G.  Khovanskil.  Fewnomials,  volume  88  of  Translations  of Mathematical \n\nMonographs.  American Mathematical Society, Providence,  RI,  1991. \n\n[9]  P.  Koiran.  VC  dimension  in  circuit  complexity.  In  Proceedings  of the  11th \nAnnual IEEE Conference  on Computational  Complexity  CCC'96,  pages 81-85, \nIEEE Computer Society Press, Los  Alamitos,  CA,  1996. \n\n[10]  P.  Koiran and  E.  D.  Sontag.  Neural  networks  with quadratic VC  dimension. \n\nJournal  of Computer  and System Sciences,  54:190-198, 1997. \n\n[11]  V.  Y.  Kreinovich.  Arbitrary nonlinearity is sufficient to represent all functions \n\nby neural networks:  A theorem.  Neural Networks,  4:381-383, 1991. \n\n[12]  M.  Leshno, V.  Y.  Lin, A.  Pinkus, and S.  Schocken.  Multilayer feedforward net(cid:173)\nworks with a nonpolynomial activation function can approximate any function. \nNeural Networks,  6:861-867, 1993. \n\n[13]  W.  Maass.  Noisy  spiking  neurons  with  temporal  coding  have  more  compu(cid:173)\n\ntational  power  than  sigmoidal  neurons. \nIn  M.  Mozer,  M.  1.  Jordan,  and \nT.  Petsche,  editors,  Advances  in  Neural  Information  Processing  Systems  9, \npages 211-217. MIT Press,  Cambridge, MA,  1997. \n\n[14]  H.  Mhaskar. Neural networks for optimal approximation of smooth and analytic \n\nfunctions.  Neural  Computation,  8:164-177, 1996. \n\n[15J  F. Scarselli and A.  C.  Tsoi.  Universal approximation using feedforward neural \nnetworks:  A survey  of some existing methods  and some  new  results.  Neural \nNetworks,  11:15-37, 1998. \n\n[16]  H.  E. Warren.  Lower bounds for approximation by nonlinear manifolds.  Trans(cid:173)\n\nactions  of the  American Mathematical  Society,  133:167-178, 1968. \n\n\f", "award": [], "sourceid": 1692, "authors": [{"given_name": "Michael", "family_name": "Schmitt", "institution": null}]}