{"title": "Networks with Learned Unit Response Functions", "book": "Advances in Neural Information Processing Systems", "page_first": 1048, "page_last": 1055, "abstract": null, "full_text": "Networks with Learned Unit  Response  Functions \n\nJohn Moody and Norman Yarvin \nYale Computer Science,  51  Prospect  St. \n\nP.O.  Box 2158  Yale Station,  New  Haven,  CT 06520-2158 \n\nAbstract \n\nFeedforward networks composed of units which compute a sigmoidal func(cid:173)\ntion  of a  weighted  sum of their  inputs  have  been  much investigated.  We \ntested  the  approximation  and  estimation  capabilities  of networks  using \nfunctions  more  complex  than  sigmoids.  Three  classes  of functions  were \ntested:  polynomials,  rational  functions,  and  flexible  Fourier  series.  Un(cid:173)\nlike  sigmoids,  these  classes  can  fit  non-monotonic functions.  They  were \ncompared  on  three  problems:  prediction  of  Boston  housing  prices,  the \nsunspot  count,  and  robot  arm inverse  dynamics.  The  complex units  at(cid:173)\ntained  clearly superior  performance  on  the  robot  arm  problem,  which  is \na  highly  non-monotonic,  pure  approximation problem.  On  the  noisy  and \nonly  mildly  nonlinear  Boston  housing  and  sunspot  problems,  differences \namong the  complex units  were  revealed;  polynomials did poorly,  whereas \nrationals and flexible  Fourier series  were  comparable to sigmoids. \n\n1 \n\nIntroduction \n\nA  commonly studied neural  architecture  is  the feedforward  network  in  which  each \nunit  of the  network  computes  a  nonlinear function  g( x)  of a  weighted  sum  of its \ninputs  x  = wtu.  Generally  this  function  is  a  sigmoid,  such  as  g( x)  = tanh x  or \ng(x)  =  1/(1 + e(x-9\u00bb).  To  these  we  compared  units  of  a  substantially  different \ntype:  they  also  compute  a  nonlinear  function  of a  weighted  sum  of their  inputs, \nbut  the  unit  response  function  is  able  to fit  a  much  higher  degree  of nonlinearity \nthan  can  a  sigmoid.  The  nonlinearities  we  considered  were  polynomials,  rational \nfunctions  (ratios of polynomials), and flexible  Fourier series  (sums of cosines.)  Our \ncomparisons were done in the context of two-layer networks consisting of one hidden \nlayer of complex units and an output layer of a  single linear unit. \n\n1048 \n\n\fNetworks with Learned Unit Response Functions \n\n1049 \n\nThis  network  architecture  is  similar to  that  built  by  projection  pursuit  regression \n(PPR)  [1,  2],  another  technique  for  function  approximation.  The one  difference  is \nthat in PPR the nonlinear function of the units of the hidden layer is a nonparamet(cid:173)\nric smooth.  This nonparametric smooth has two disadvantages for neural modeling: \nit has many parameters, and,  as  a smooth, it is  easily trained only if desired  output \nvalues  are  available for  that  particular  unit.  The  latter  property  makes  the  use  of \nsmooths in  multilayer networks  inconvenient.  If a  parametrized function of a  type \nsuitable for one-dimensional function approximation is used instead of the nonpara(cid:173)\nmetric smooth,  then  these  disadvantages  do  not apply.  The functions  we  used  are \nall suitable for  one-dimensional function  approximation. \n\n2  Representation \n\nA few  details of the representation  of the unit response  functions  are worth noting. \n\nPolynomials:  Each polynomial unit computed the function \n\ng(x) =  alX + a2x2 + ... + anxn \n\nwith  x = wT u  being  the weighted sum of the  input.  A  zero'th order term was  not \nincluded  in  the  above formula,  since  it  would  have  been  redundant  among all  the \nunits.  The  zero'th  order  term  was  dealt  with  separately  and  only  stored  in  one \nlocation. \n\nRationals:  A  rational function  representation  was  adopted which  could  not  have \nzeros in the denominator.  This representation used a sum of squares of polynomials, \nas  follows: \n\nao + alx + ... + anxn \n\n(  ) \n9  x \n\n-\n- 1 + (b o + b1x)2 + (b 2x + b3x2)2 + (b4x + b5x 2 + b6X3 + b7x4)2 + .,. \n\nThis  representation  has  the  qualities  that  the  denominator  is  never  less  than  1, \nand that n parameters are used to produce a denominator of degree  n.  If the above \nformula were continued the next terms in the denominator would be of degrees eight, \nsixteen,  and  thirty-two.  This  powers-of-two  sequence  was  used  for  the  following \nreason:  of the 2( n - m)  terms in the square of a polynomial p = am xm + '\"  + anxn , \nit is  possible  by  manipulating am ... an  to determine  the  n - m  highest  coefficients, \nwith  the  exception  that  the  very  highest  coefficient  must  be  non-negative.  Thus \nif  we  consider  the  coefficients  of  the  polynomial that  results  from  squaring  and \nadding  together  the  terms  of the  denominator  of the  above  formula,  the  highest \ndegree  squared  polynomial may be regarded  as  determining the highest  half of the \ncoefficients,  the second  highest  degree  polynomial may be regarded  as  determining \nthe highest half of the rest  of the coefficients,  and so forth.  This process  cannot set \nall  the  coefficients  arbitrarily; some must be  non-negative. \n\nFlexible Fourier series:  The flexible  Fourier series  units computed \n\nn \n\ng(x) = L: ai COS(bi X + Ci) \n\ni=O \n\nwhere the amplitudes ai, frequencies  bi  and phases Ci  were unconstrained and could \nassume any  value. \n\n\f1050 \n\nMoody and Yarvin \n\nSigmoids:  We  used  the standard logistic function: \ng(x)  =  1/(1 + e(x-9)) \n\n3  Training Method \n\nAll  the  results  presented  here  were  trained  with  the  Levenberg-Marquardt  modifi(cid:173)\ncation of the  Gauss-Newton nonlinear least  squares  algorithm.  Stochastic gradient \ndescent  was  also  tried  at first,  but on  the  problems where  the two  were  compared, \nLevenberg-Marquardt was much superior both in convergence time and in quality of \nresult.  Levenberg-Marquardt  required  substantially fewer  iterations  than stochas(cid:173)\ntic  gradient  descent  to  converge.  However,  it needs  O(p2)  space  and  O(p2n)  time \nper  iteration  in  a  network  with  p  parameters  and  n  input  examples,  as  compared \nto  O(p)  space  and  O(pn)  time  per  epoch  for  stochastic  gradient  descent.  Further \ndetails of the training method will  be discussed  in  a  longer paper. \nWith some data sets,  a  weight  decay  term was  added to the  energy  function  to be \noptimized.  The  added  term  was  of the form  A L~=l w;.  When  weight  decay  was \nused,  a  range of values of A was  tried for  every  network  trained. \n\nBefore training, all the data was normalized:  each input variable was  scaled so  that \nits  range  was  (-1,1),  then  scaled  so  that  the  maximum sum  of squares  of input \nvariables for  any example was  1.  The output variable was scaled to have mean zero \nand  mean  absolute  value  1.  This  helped  the  training  algorithm,  especially  in  the \ncase  of stochastic gradient  descent. \n\n4  Results \n\nWe  present  results  of training our networks  on  three  data sets:  robot  arm inverse \ndynamics,  Boston  housing  data,  and  sunspot  count  prediction.  The  Boston  and \nsunspot data sets are noisy, but have only mild nonlinearity.  The robot arm inverse \ndynamics data has no noise,  but a  high degree  of nonlinearity.  Noise-free  problems \nhave low estimation error.  Models for  linear or mildly nonlinear problems typically \nhave  low  approximation error.  The robot  arm inverse  dynamics problem is  thus  a \npure  approximation problem, while performance on the noisy Boston and sunspots \nproblems is  limited more by estimation error  than by approximation error. \n\nFigure  la is  a  graph,  as  those used  in  PPR, of the  unit response  function of a  one(cid:173)\nunit  network  trained  on  the  Boston  housing  data.  The  x  axis  is  a  projection  (a \nweighted  sum of inputs  wT u)  of the  13-dimensional input space  onto  1 dimension, \nusing those weights chosen by the unit in training.  The y axis is the fit  to data.  The \nresponse function of the unit is a sum ofthree cosines.  Figure Ib is the superposition \nof five  graphs of the five  unit response functions used in a five-unit  rational function \nsolution (RMS error less  than 2%)  of the robot arm inverse dynamics problem.  The \ndomain for  each  curve  lies  along  a  different  direction  in  the  six-dimensional input \nspace.  Four of the five  fits  along the  projection  directions  are non-monotonic, and \nthus  can be fit  only poorly by  a sigmoid. \n\nTwo different  error  measures  are  used in the following.  The first  is the RMS  error, \nnormalized so that error of 1 corresponds  to no training.  The second measure is the \n\n\fNetworks with Learned  Unit Response Functions \n\n1051 \n\nRobot  arm  fit  to  data \n\n40 \n\n20 \n\no \n\n-zo \n\n-40 \n\n1.0 \n\n-4 \n\n. \n\n. \n\n\" . \n. '  . . ' \n\n~ \n.;  2 \no \n~ o \n! .. c \n\no \n\n-2 \n\n-2.0 \n\nFigure  1: \n\na \n\nb \n\nsquare of the normalized RMS  error,  otherwise  known  as  the fraction  of explained \nvarIance.  We  used  whichever  error  measure  was  used  in earlier  work  on  that  data \nset. \n\n4.1  Robot arm inverse dynamics \n\nThis  problem is  the  determination of the  torque  necessary  at  the joints of a  two(cid:173)\njoint  robot  arm  required  to  achieve  a  given  acceleration  of each  segment  of  the \narm , given  each  segment's  velocity  and  position.  There  are  six  input  variables  to \nthe  network,  and two output  variables.  This problem was  treated  as  two  separate \nestimation problems, one for the shoulder torque and one for the elbow torque.  The \nshoulder torque was  a slightly more difficult problem, for  almost all networks.  The \n1000  points in the training set  covered the input space relatively thoroughly.  This, \ntogether  with  the fact  that the  problem  had no  noise,  meant  that there  was  little \ndifference  between  training set error  and test  set error. \n\nPolynomial networks  of limited degree  are  not  universal  approximators,  and  that \nis  quite  evident  on  this  data set;  polynomial networks  of low  degree  reached  their \nminimum error  after  a  few  units.  Figure  2a shows  this.  If polynomial, cosine,  ra(cid:173)\ntional, and sigmoid networks  are  compared as  in  Figure 2b,  leaving out low  degree \npolynomials, the  sigmoids have  relatively  high  approximation error  even  for  net(cid:173)\nworks  with 20  units.  As shown in the following table,  the complex units have more \nparameters each,  but still get better  performance with fewer  parameters total. \n\nType \ndegree  7 polynomial  5 \ndegree  6 rational \n5 \n2  term cosine \n6 \nsigmoid \n10 \nsigmoid \n20 \n\nUnits  Parameters  Error \n\n65 \n95 \n73 \n81 \n161 \n\n.024 \n.027 \n.020 \n.139 \n.119 \n\nSince the training set  is noise-free,  these errors represent  pure approximation error. \n\n\f1052 \n\nMoody and Yarvin \n\n0.8 \n\n0.8 \n\n~ \n\u2022 \n0.4 \n\n0.2 \n\n0.8 \n\nO.S \n\n.. \nE \n0.4 \n\n0 \n\n0.2 \n\n~.Iilte ...... \n+ootII1n..  3  ler .... \nOoooln..  4  tel'lNl \n\nopoJynomleJ de,  7 \n\nXrationeJ  do,  8 \n\u2022 ... \"'0101 \n\n0.0  L---,b-----+--~::::::::8~~\u00a7=t::::::!::::::1J \n\nnumbel' of WIIt11 \n\nFigure 2: \n\na \n\n10 \n\nnumber Dr  WIIt11 \n\n111 \n\n20 \n\nb \n\nThe superior  performance of the complex units on this problem is  probably due to \ntheir ability to  approximate non-monotonic functions. \n\n4.2  Boston housing \n\nThe  second  data set  is  a  benchmark  for  statistical  algorithms:  the  prediction  of \nBoston housing prices from  13 factors  [3].  This data set  contains 506 exemplars and \nis  relatively simple; it can  be  approximated well  with only a  single  unit.  Networks \nof between  one  and  six  units  were  trained  on  this  problem.  Figure  3a is  a  graph \nof training set  performance from networks  trained on  the entire data set;  the error \nmeasure used  was the fraction of explained variance.  From this graph it is apparent \n\n03 tenD  coolh. \nx.itmold \n\n1.0 \n\no polJDomll1  d., fi \n+raUo,,\"1  dec  2 \n02 term.....m. \n0  3  term COllin. \nx.tpnotd \n\n0.5 \n\n0 .20 \n\nO. lfi \n\n~ \u2022 \n\n0.10 \n\n0.05 \n\nFigure 3: \n\na \n\nb \n\n\fNetworks with Learned Unit Response Functions \n\n1053 \n\nthat training set performance does not vary greatly between different  types of units, \nthough  networks  with  more units do better. \n\nOn  the test  set  there is  a large  difference.  This is shown  in  Figure 3b.  Each point \non the graph is the average performance of ten networks of that type.  Each network \nwas trained using a different  permutation of the data into test and training sets, the \ntest set  being 1/3 of the examples and the training set  2/3.  It can be seen  that the \ncosine  nets  perform  the  best,  the  sigmoid nets  a  close  second,  the  rationals third, \nand  the  polynomials  worst  (with  the  error  increasing  quite  a  bit  with  increasing \npolynomial degree.) \n\nIt should be noted that the distribution of errors  is  far from  a normal distribution, \nand that the training set error gives little clue as to the test set error.  The following \ntable  of errors,  for  nine  networks  of four  units  using  a  degree  5  polynomial,  is \nsomewhat typical: \n\nSet \ntraining \ntest \n\nError \n\n0.091  I \n\n0.395 \n\nOur speculation on the cause  of these extremely high errors is that  polynomial ap(cid:173)\nproximations do not extrapolate well;  if the prediction of some data point results in \na  polynomial being evaluated  slightly  outside  the  region on  which  the  polynomial \nwas  trained,  the  error  may  be  extremely  high.  Rational  functions  where  the  nu(cid:173)\nmerator and denominator have equal degree  have less  of a  problem with this,  since \nasymptotically they  are  constant.  However,  over  small intervals they  can  have  the \nextrapolation characteristics  of polynomials.  Cosines  are  bounded,  and so,  though \nthey  may not extrapolate well  if the function  is  not somewhat periodic,  at least  do \nnot reach large values like polynomials. \n\n4.3  Sunspots \n\nThe third  problem  was  the  prediction  of the  average  monthly sunspot  count  in  a \ngiven year from the  values of the previous twelve  years.  We followed  previous work \nin  using  as  our  error  measure  the  fraction  of variance  explained,  and  in  using  as \nthe  training  set  the  years  1700  through  1920  and  as  the  test  set  the  years  1921 \nthrough  1955.  This was  a relatively easy  test set - every  network of one unit which \nwe  trained  (whether  sigmoid,  polynomial,  rational,  or  cosine)  had,  in  each  of ten \nruns,  a  training set error  between  .147  and  .153  and  a  test  set  error  between  .105 \nand  .111.  For  comparison, the best test  set error  achieved by us  or previous testers \nwas  about  .085.  A  similar set  of runs  was  done  as  those  for  the  Boston  housing \ndata,  but  using  at  most four  units;  similar results  were  obtained.  Figure 4a shows \ntraining set  error  and Figure 4b shows test  set  error  on this problem. \n\n4.4  Weight Decay \n\nThe  performance  of almost  all  networks  was  improved by  some  amount of weight \ndecay.  Figure 5 contains graphs of test set error for sigmoidal and polynomial units, \n\n\f1054 \n\nMoody and Yarvin \n\n0.18  ,..-,------=..::.;==.::.....:::...:=:..:2..,;:.::.:..----r--1 0.25  ~---..::.S.::.:un:::;;a.!:..po.:...:l:....:t:.::e.:...:Bt:....:lI:.::e..:..l ..:.:,mre.::.:an~ __  --,-, \n\nOP0lr.!:0mt .. dea \n\nt ~\u00b7leO:: o~:~~ \n\nC 3  hrm corlne \nX_lamold \n\n0.14 \n\nO.IZ \n\n.. 0 \nI: .. \n\n0.10 \n\nO.OB \n\nOpolynomlal  d \u2022\u2022 1\\ \n\"\"\"allon..  de. 2 \n02  term  co.lne \ncs term  coolne \nx.tamcld \n\n0.20 \n\n0.15 \n\n0.10 \n\n0.08  ' - -+1 - - - - -\u00b1 2 - - - - - !S e - - - - - -+ - - '  \n\nnumber of  WIlle \n\nFigure 4: \n\na \n\n2 \n\nDumb ....  of  unit. \n\n3 \n\nb \n\nusing  various  values of the weight  decay  parameter A.  For  the sigmoids, very  little \nweight  decay  seems  to  be  needed  to  give  good  results,  and  there  is  an  order  of \nmagnitude range  (between  .001  and  .01)  which  produces  close  to  optimal results. \nFor  polynomials  of degree  5,  more  weight  decay  seems  to  be  necessary  for  good \nresults;  in fact,  the highest value of weight decay is  the best.  Since very high values \nof weight  decay  are  needed,  and  at  those  values  there  is  little  improvement  over \nusing  a  single  unit,  it  may be  supposed  that  using  those  values  of weight  decay \nrestricts  the  multiple  units  to  producing  a  very  similar  solution  to  the  one-unit \nsolution.  Figure  6  contains  the  corresponding  graphs for  sunspots.  Weight  decay \nseems to help less here for the sigmoids, but for the polynomials, moderate amounts \nof weight  decay  produce  an  improvement over  the one-unit  solution. \n\nAcknowledgements \n\nThe  authors  would  like  to  acknowledge  support  from  ONR  grant  N00014-89-J-\n1228,  AFOSR  grant  89-0478,  and  a  fellowship  from  the  John  and  Fannie  Hertz \nFoundation.  The robot  arm data set  was  provided by  Chris Atkeson. \n\nReferences \n\n[1]  J.  H.  Friedman, W.  Stuetzle,  \"Projection  Pursuit  Regression\",  Journal  of the \nAmerican  Statistical  Association,  December  1981,  Volume  76,  Number  376, \n817-823 \n\n[2]  P.  J.  Huber,  \"Projection  Pursuit\",  The  Annals  of Statistics,  1985  Vol.  13  No. \n\n2,435-475 \n\n[3]  L.  Breiman et aI,  Classification  and Regression  Trees,  Wadsworth and Brooks, \n\n1984,  pp217-220 \n\n\fBoston  housin \n\n0.30  r-T\"=::...:..:;.:;:....:r:-=::;.5I~;=::::..:;=:-;;..:..:..::.....;;-=..:.!ar:......::=~..., \n\nhi decay \n\nNetworks with Learned  Unit Response Functions \n\n1055 \n\n00 \n+.0001 \n0.001 \n0.01 \n)(.1 \n'.3 \n\n00 \n+.0001 \n0.001 \n0.01 \nX.l \n\u00b7.3 \n\n1.0 \n\n0.5 \n\n0.25 \n\n~0.20 \n\u2022 \n\n0.15 \n\nFigure 5:  Boston housing test  error with various amounts of weight  decay \n\n0.16 \n\n0.14 \n\nmoids  wilh  wei  hl  decay \n\n00 \n+.0001 \n0 .001 \n0 .01 \n><.1 \n\u00b7 .3 \n1.8 \n\nO.IB \n\n0. 111 \n\n.. \n1: 0.12 \nD  ~  0. 12 \n~~ \nsea \n\n::::::,. \n\n0.10 \n\n0.1 \u2022 \n\n0. 10 \n\n0.08 \n\n2 \n\nDum be ..  of  1IJlIt, \n\n3 \n\n0.08 \n\n<4 \n\n2 \n\nDumb.,.  01  WIll' \n\n3 \n\nFigure 6:  Sunspot  test error with  various amounts of weight  decay \n\n\f", "award": [], "sourceid": 568, "authors": [{"given_name": "John", "family_name": "Moody", "institution": null}, {"given_name": "Norman", "family_name": "Yarvin", "institution": null}]}