{"title": "Generalization Error and the Expected Network Complexity", "book": "Advances in Neural Information Processing Systems", "page_first": 367, "page_last": 374, "abstract": null, "full_text": "Generalization Error  and  The  Expected \n\nNetwork  Complexity \n\nChuanyi Ji \n\nDept.  of Elec.,  Compt.  and Syst  Engl' . \n\nRensselaer  Polytechnic Inst.itu( e \n\nTroy,  NY  12180-3590 \nchuanyi@ecse.rpi.edu \n\nFor  two  layer  networks  with  n  sigmoidal  hidden  units,  the  generalization  error  is \nshown  to be  bounded  by \n\nAbstract \n\nO(E~)  O( (EK)d l  N) \n, \n\nK  + \n\nN \n\nog \n\nwhere  d  and  N  are  the  input  dimension  and  the  number  of training  samples,  re(cid:173)\nspectively.  E  represents  the  expectation  on  random  number  K  of  hidden  units \n(1  :::;  I\\  :::;  n).  The  probability  Pr(I{  = k)  (1  :::;  k  :::;  n)  is  (kt.erl11ined  by  a  prior \ndistribution  of weights,  which  corresponds  to  a  Gibbs  distribtt! ion  of a  regularizeI'. \nThis  relationship  makes  it  possible  to  characterize  explicitly  how  a  regularization \nterm  affects  bias/variance  of networks.  The  bound  can  be  obtained  analytically \nfor  a  large  class  of  commonly used  priors.  It can  also  be  applied  to  estimate  the \nexpected  net.work  complexity  Ef{  in  practice.  The  result  provides  a  quantitative \nexplanation on  how  large networks  can generalize  well . \n\n1 \n\nIntroduction \n\nRegularization (or weight-decay) methods are widely used in supervised learning by \nadding a  regularization term t.o  an energy function.  Although it is  well  known  that \nsuch  a  regularization  term  effectively  reduces  network  complexity  by  introducing \nmore  bias and  less  variance[4]  to the  networks,  it is not  clear  whether  and  how  the \ninformation  given  by  a  regularization  term  can  be  used  alone  to  characterize  the \neffective  network  complexity  and  how  the  estimated  effective  network  complexity \nrelates  to  the  generaliza.tion  error .  This  research  attempts  to  provide  answers  to \nt.hese  questions  for  two  layer feedforward  networks  with sigmoidal hidden  units. \n\n367 \n\n\f368 \n\nJi \n\nSpecifically)  the effective network  complexity is  ch(lJ'act.erized  by  the expected  nUI11-\nbel'  of hidden units determined  by  a  Gibbs dist.ribution  corresponding  to a  regula L'(cid:173)\nization tenl1.  The generalization error can then be bounded by the expected network \ncomplexity)  and  thus  be  tighter  than  the  original  bound  given  by  Barron[2].  The \nnew  bound  shows  explicitly)  through  a  bigger  approximation  error  and  a  smaller \nestimation error I  how  a  regularization  term  introduces  more bias  and less  varia nce \nto the networks.  It therefore  provides a  quantitative explanation on how  a network \nlarger  than  necessary  can  also  generalize  well  under  certain  conditions)  which  can \nnot  be explained  by  the  existing  learning theory[9]. \n\nFor  a  class  of commonly-used  regularizers)  the  expecced  network  complexity  can \nbe  obtained  in  a  closed  form.  It is  then  used  to  estimate  the  expected  network \ncomplexity for  Gaussion  mixture model[6]. \n\n2  Background  and  Previous  Results \n\nA  relationship  has  been  developed  by  Barron[2]  between  generalization  error  and \nnetwork  complexity  for  two  layer  net.works  used  for  function  approximation.  \"Ve \nwill  briefly  describe  this  result  in  this  section  and give  our  extension subsequently. \n\n1=1 \n\nConsider  a class  of two layer networks of fixed  architecture  with n  sigmoidal hidden \nunits  a.nd  one  (linear)  output unit.  Let  fn(x; w)  =  twF)91(wP)T x)  be  a  n eiW01'k \nfunction)  where  wEen is the network weight vcctor  comprising both Wf2)  and wP) \nfor  1  ::;  l  ::;  n.  w}l)  and W}2)  are  the  incoming weights  to  the  l-th  hidden  unit  and \nthe  weight  from  the  l-th  hidden  unit  to  the output)  respectively.  en  ~ Rn(d+1)  is \nt.he  weight  space  for  n  hidden  unit.s  (and  input  dimension  d) .  Each  sigmoid  unit \n!JI(Z)  is  assumed  to  be  of tanh  type:  !J/(z)  --+  \u00b11  as  z  --+  \u00b1oo  for  1  ::;  I  :S  n  1. \nThe  input  is  xED ~ Rd.  '''' ithout loss  of generality)  D  is  assumed  to  be  a  unit \nhypercube  in  R d )  i.e.)  all  the  components of x  are  in  [\u00b7-1) 1]. \nLet  f( x)  be  a  target  function  defined  in  the  sa.me  domain  D  and  satisfy  some \nsmoot.hness  conditions  [2].  Consider  N  training samples independently drawn  from \nsome  distribution  p(:/.:):  (x1)f(:I:1)), ... ) (xN)f(;t.v)).  Define  an  energy  function  e) \nwhere  e  =  f1  +  A LTI.~~(1U) .  Ln ,N(W)  is  a  regularization  term  as  a  function  of  tv \nfor  a.  fixed  II .  A  is  a  const.ant. .  C1 \nis  a  quadratic  error  function  on  N  training \nsamples:  e1  =  J: 'L,(fn(Xi;W) - f(Xi)t\u00b7  Let  fll,l'.,r(x;-t'iJ)  be  t.he  (optimal)  network \ntV  =  arg  min e.  The  gen(cid:173)\nfunction  such  t.hat  'ttl  minimizes  t.he  energy  function  e: \nwEen \neralization  error  Eg  is  defined  to  be  the  squared  L'2  norm  E9  = Ell  f  -\nfn,N  112  = \nEJU(x) - fn,N(X; w))2dp(x))  where  E  is  the  expectation  over  all  training  sets  of \n\ni=l \n\nlV \n\n') \n\n. \n\nD \n\nsize  N  drawn from  the same distributioll.  Thus)  the generalization  error  measnres \nthe  mean  squared  distance  between  the  unknown  function  an' I  the  best  network \nfunction  that  can  be  obtained for  training sets  of size  N .  The generalization error \n\n1 In  the  previous  ,\\'ork  by  Barron)  t.he  sigmoillal  hidden  units  atC' \n\n'1,( ~)+1.  It is  easy  t.o \n\nshow  that  his  results  are  applica.ble  to  the class  of .t!1(Z))S  we  consider  h;re. \n\n\fGeneralization Error and the Expected Network Complexity \n\n369 \n\nEg  is  shown[2]  to  be  bounded  as \n\nEg  ::;  O(Rn,N), \n\nwhere  Rn ,N,  called  the  index  of resol vability  [2],  can be  expressed  as \n\nRn ,N  =  min {II .f _ in  112  + Ln,~( tv)}, \n\nwEen \n\n(1) \n\n(2) \n\nwhere  III  is  the clipped  fn(x; tv)  (see  [2]).  The index of resolvability  can  be further \nbounded  as  Rn,N  :::;  O(~) + O(',~~logN).  Therefore,  the  generalization  error  IS \nbounded  as \n\n1 \n\nnd \n\nE!!  :::;  0(;;) + O(  N  logN), \n\n(3) \n\nwhere  O(~) and  0(';1 logN)  are  t.he  bounds  for  approxima.tion  error  (bia.s)  and \nesti;:l.lnt.ion  error  (varia.nce),  respectively. \nIn addition, t.he  hOllnd for  E9  can be minimized if all  additional regularization term \nLN (71)  is  used  in  the energy  function  to  minimize the number of hidden  units,  i.e., \nEg  :::;  O( V dlogN ). \n\nr=N \n\n3  Open  Questions and  Motivations \n\nTwo open  questions,  which  can  not  be  answered  by  the  previous  result,  are  of the \nprimary interest  of this work. \n\nI)  How  do  large  networks generalize? \n\nThe  largc  networks  refer  to  those  wit.h  a  rat.io  ~~  to  he  somewhat  big,  where  TV \nand  N  are  the  t.ot.al  number  of  independent.ly  modifiable  weights  (lV  ~ nel,  for \n11  lcugc)  and  the  number  of training samples,  respectively.  Networks  tra.ined  with \nreglll<Hization  t.erms  may  fall  int.o  this  category.  Such  large  networks  are  found \n(0  Jw  abk  to  generalize  well  sometimes.  JImH'H'J',  when  '~~{  is  big,  the  bonnel  in \nEqll ahon  (~:l)  is  t.oo  loose  t.o  bOllnd  the  actual  generaliza t.ion  error  meaningfully. \nTherefme.  for  the  large networks,  the tot.al number of hidden ullits n  ma.y  no longer \nbe  a.  good  est.imate  for  network  complexity.  Efforts  have  been  made  to  develop \nmeasures on effective net.work  complexity both analytically and cmpirically[1][5][10] . \nThese  measures  depend  on  training  data  as  well  as  a  regularization  term  in  an \nimplicit way  which  make it  difficult to see  direct.  effects  of a  regulariza.tion  term on \ngeneraliza.tion  error.  This naturally leads t.o  our second  question. \n\n2)  Is  it  possible  to  characterize  network  complexit.y  for  a  cLI~~  of networks  using \nonly  the  information given  by a  regularizat.ion  term:!?  How  t.o  relat.e  the  estimated \nnetwork  complexity rigorously  with generalization error? \nIn practice, when a  regularization term (L I1 .N(W)) is used to penalize the m;l~llitude \nof weights, it  effectively minimizes the number of hidden units as ,,,,'ell even til' '1lgb a.n \nadditional regularization term LN(n) is  not used.  This is  dne to  the fact  tbll. some \nof the  hidden  units  may only  operate  in  the  lineal'  region  of a  sigmoid  when  their \n\n2This  was  posed  as  an  open  problem  hy  Solia.  ei..al.  [8] \n\n\f370 \n\nJi \n\nincoming weights are small and inputs are bounded.  Therefore,  a  Ln,N(W)  term can \neffectively  act like  a  LN(n)  term that reduces  the effective  number of hidden  units, \nand  thus  result  in  a  degenerate  parameter space  whose  degrees  of freedom  is  fewer \nthan  nd.  This  fact  was  not  taken  into  consideration  in  the  previous  work,  and  as \nshown  later in  this work,  will  lead to  a  tighter  bound on  Rn,N. \n\nIn what follows,  we  will first  define  the expected  network  complexity,  then use  it to \nbound the  generalization error. \n\n4  The  Expected  Network  C0111plexity \n\nFor reasons  that will  hecome  apparent,  we  choose  to define the effective  complexity \nof a  feedforward  two  layer  network  as  the  expected  number  of  hidden  unit.s  EE \n(1  :::;  J{  :::;  11)  ,vhich  are  effectively  nonlinear,  i.e.  operating  outside  t.he  central \nlinear  regions  of their  sigmoid response  function  g(.::). \n'''''e  define  the  linear  region \nas  an  interval  1 z  1<  b with  b a  positive  constant. \n\nConsider  the  presynaptic  input::  =  wiT x  to  a  hidden  unit  g(z),  where  Wi  is  the \nincoming weight  vector  for  the  unit.  Then  the  unit  is  considered  to  be  effectively \nlinear if 1 z  1<  b for  all  xED.  This will  happen  if 1 Zl  1<  b,  where  z'  =  wiT x'  with \nx'  being any  vertex  of the  unit hypercube  D.  This is  b~cause 1 z I:::;  wiT X,  where  x \nis  the  vertex  of D  whose  elements  are  t.he  8gn  functions  of the  elements of Wi. \nNext,  consider  network  weights  as  random  variaJ)lcs  wit.h  a  distribution  p(w)  = \nAex1J( - Ln,N (tv)),  ,,,hich  corresponds  t.o  a.  Gibbs  distribution  of  a  regularization \nterm  wit.h  a  normalizing constant.  A.  Consider  the vector  ;'1;'  to be  a  random vector \nalso wit.h  eqnally probable  l~s ,Hld  -l's.  Then I::' 1<  b will  be a  random event.  The \nprobability for  this hidden unit to be effectively  nonlin0.ill' equals to 1- Pr(1  z  1<  b), \nwhich  can  be  completely  determined  by  the  distributions  of  weights  p( 'W)  and  x' \n(equally  probable).  Let.  f{  be  the  number  of  hidden  units  which  are  effectively \nnonlinear.  Then  t.he  probability,  Pr(K  =  k)  (1  :::;  k  :::;  n),  can  be  determined \nthrough  a  joint probabilit.y of k  hidden  units  that  are  operating beyond  the  central \nlinear region of sigmoid fUllctions.  The expected  network  complexity, EI<,  can then \nbe  obtained  through  Pr(I< =  k),  which  is  determined  by  the  Gibbs  distribution  of \nLN,n (w).  The motivation on utilizing such a  Gibbs distribution comes from the fact \nthat Rk,N is independent of training samples but dependent. of a  regularization term \nwhich  corresponds  to  a  prior  distribution  of weights.  Using  sHch  a  formulation,  as \nwill  be  shown  later,  the  effect  of a  regularization  term  on  bias  and  va riance  ca.n  be \ncharacterized  explicitly. \n\n5  A  New  Bound  for  The  Generalization Error \n\nTo  develop  a  t.ightcr  houucl  for  the  generalizat.ion  error,  we  consider  subspa.ces  of \nt.he  weights indexed by  different number of effectively  nonlinc(lr hidden  units:  8 1  ~ \n8 2 . ..  ~ 8 n .  For  ead,  8 j ,  there  are  j  out  of  11  hidden  unit.s  which  are  effectively \nnonlinear fo],  1 :; j  :::;  n.  '1'11e1'e1'ore,  the index ofl'esolvability  T?71,N  ca.n  be expressed \nas \n\n(4) \n\n\fGeneralization Error and the Expected Network Complexity \n\n371 \n\nwhere  each  Rk,N  =  min {II f  - in  112 + Ln.~(w)}.  Next  let  us  consider  the number \nof effectively  nonlinear  units  to  be  random.  Since  the  minimum is  no  bigger  than \nthe  average,  we  have \n\nwEe\" \n\n(5) \nwhere  the expectation is  taken over the random variable  J{  utilizing the probability \nPr(I{ =  k).  For each  K , however,  the  t,yO  terms in  Rf(,N  can  be  bounded  as \n\nby  the  t.rian.gle  ine4uality, where  fn-l\":,n  is  the  actuallletwork function  with  n - J{ \nhidden  units  operating  in  the  region  bounded  by  the  constant  b,  and  ff(  is  the \ncorrespondillg  network  funct.ion  which  t.rea ts  the  11  -\nJ{  units  as  linear  units.  In \naddition,  we  have \n\nLn,N(W)  ::;  O(II.fn-K,n -\n\n') \n\n. \njI{  W) + O(  N  logN), \n\nI{d \n\n(7) \n\n\\vhere  the  f-irst  term  also  results  from  the  triangle inequality,  and  the  second  term \nis  obtained by  cliscretizing  the  degenerate  parameter space e J{  using  similar tech(cid:173)\nff(  \\\\2,  \\\\'e \nl1lques  as  in  [2]3.  Applying  Taylor  expansion  on  the  t.erm  \\\\  fn-K,n  -\nhave \n\nPutting Equations  (5)  (6)  (7)  and  (8)  into Equation  (1),  \\\\'('  have \n\n\\\\  fn-K,n  -\n\nff{  \\\\2  ::;  O(b13(n - K)). \n\nEg  ::;  O(E !{) + O(  N \n\n(EK)d \n\n1 \n\nlogN) + O(b  (11  - EX)) + o(b)), \n\n6 \n\n() \n\n(8) \n\n(9) \n\nwhere  EX is  the  expected  number of hidden  units  which  are  effectively  nonlinear. \nIf b ::;  O( -\\-),  we  have \n\nn3 \n\nEg  ::;  O(E J() + O(  N \n\n(EI{)d \n\n1 \n\nlogN) . \n\n(10) \n\n6  A  Closed  Fornl Expression  For  a  Class  of Regularization \n\nTernls \n\nFor  commonly used  regularization  terms,  how  can  \\\"e  actually find  the  probability \ndistribution of the  number of (nonlinear)  hidden  units  Pr(I{ = k)?  And  how  shall \nwe  evaluate  EK  and  E J( ? \nAs a simple example, we consider a special  class of prior distrihutions for iid weights, \ni.e,  p( w)  = TIiP( Wi),  where  W.i  are  the  \"i<'ments  of  wEen.  This  corresponds  to \na  large  class  of regularization  terms  ,,'hicIt  minimize the  magnitudes  of individual \nweights  indepcndently[7]. \n\nConsider each  weight  as  a  random variable  with zero  mean and a  common variance \n(J.  Then for  large input dimension el,  v7zZ'  is  approximately normal with zero-mean \n\n3 Deta.ils  \\Yill  be  given  ill  iL  longer  version  of the pa.per  in  prepa.ra.tion. \n\n\f372 \n\nJi \n\nand varia.nce  (J  by the  Central  Limit Theorem[3].  Let q  denote  the probability that \na.  unit is  effectively  nonlinear.  We  have \n\nq =  2Q(-\n\nb \nr,)' \n\n(Jyd \n\n(11 ) \n\n-x \n\nwhere  Q( -;1.:)  =  );- J e- T ely.  Next  consider  the  probability  that  J(  out  of  n \nhidden  units  are  nonlinear.  Based  011  our  (independence)  assumptions  on  w'  a.nd \nx',  I( has a  binomial distribution \n\n-co \n\n:;l \n\nPr(I{ =  I.:)  = \n\n(71.) \nn \nk  q  (1  - q)  -\n\nk \n\n/.; \n\n, \n\nwhere  1 < k  < n.  Then \n\n(12) \n\n(1:3) \n\nEX =  nq. \n1 \n\\ \n\n1 \nn \n\n(14) \nwhere  ~ =  L  HI - qr-~ + (1  - qt\u00b7  Then  the  generalization error  Eo  satisfies \n\nE}, =  - +~, \n\nn-1 \n\n. \n\ni=1 \n\nEg  :::;  0(- +~) + O(-N logN) \n\n. \n\n1 \nn \n\nnqd \n\n(15) \n\n7  Application \n\nAs an example for  applica.t.ions of t.he  tJleoretical results,  the expected network com(cid:173)\nplexity EJ{  is  estimat.ed for  G<:tussian  mixture model used for  time-series prediction \n(details  can  he found  in  [6])  4. \nIn  genera.l,  llsillg  only  a  prior dist.ribut.ion  of ,,,eights  to  est.ima.te  the network  COlll(cid:173)\nplexit.y  EJ{  may lead to a less  accurate measure on the effective  net.work  complexiLy \nthan  incorporat.ing informat.ion on  training data also.  However,  if parameters  of a \nregularization  term  also  get  optimized  during  training,  as  shown  in  this  example , \nthe resulting  Gibbs prior distribution of weights may lead to a good estimate of the \neffective  number of hidden  units. \nSpecifically,  the  corresponding  Gibbs  distribution  p( 'W)  of  the  weights  from  the \nGaussion  mixture  is  iicl,  which  consists  of a  linear  combination  of eight  Gaussia.n \ndistributions.  This  function  results  in  a  skewed  distribntion  with  a  sharp  peak \naround  the  zero  (see  [6]).  The  mean  and  variance  of the  presynaptic  inputs  z  t.o \nthe  hidden  units  can  thus  be  estimated  as  0.02  and  0.04,  respectively.  The other \nparameters  used  are  n  = 8,  d  =  12.  b =  0.6  is  chosen.  Then  q  ~ 004  is  obtained \nthrough  Equation  (11).  The  effective  network  complexity  is  EJ{  ~ 3  (or  4).  The \nempirical result(10], which  estima.tes  the effective  number of hidden  units using the \ndominated eigenvalues at the  hidden layer, results  in about ;3  effective  hidden units. \n\n4 Strictly  speaking,  the  theoretical  resnlts  deal  with  l'egulariza tion  terms  with  discrete \nweight.s.  It.  can  a.nd  ha.s  been extended  to  continuous  weight.s  by  D.F.  McCaffrey  and  A .R. \nGalla.nt.  Details  are  beyond  the content  of this  paper. \n\n\fGeneralization Error and the Expected Network Complexity \n\n373 \n\n5r---------.----------r---------.----------r-------~ \n\n4.5 \n\n4 \n\n0.5 \n\nvariance \n\nincrease in  bias \n\n0.2 \n\n0.4 \n\nq \n\n0.6 \n\n0.8 \n\nFigure  1:  Illustration of an  increase  .6..  in  bias  and variance  Bqn  as  a  function  of q. \nA  sca.ling  fadar  J3  = 0.25  is  used  for  t.he  convenience  of the  plot.  11  = 20  is  chosen. \n\n8  Discussions \n\nIs  this  new  bound  for  the  generalization  tighter  than  the  old  one  which  takes  no \naccount  of l1etwork-weight.-dependent  information?  If so . what  does  it  tell us? \n\nCompared  wit.h  the bOllnd  in  Equation (3),  the  new  bound results  in  an increase  .6.. \nin  approximation error  (bias),  and  qn  instea.d  of n  as  ~sLimatjon errol'  (variallce). \nThese  two  terms  are  plotted as functions  of q in Figure (1).  Since q is  a.  function  of \n(J  which  characterizes  how  strongly  the  magnitude of the  weights  is  penalized,  the \nlarger  the  (J,  the  less  the  weights  get  penalized,  the  larger  the  q,  the  more  hidden \nuni ts  are  likely to  be  effectively  nonlinear,  thus  the smaller the  bias  and  larger  the \nvariance.  ,\\Vhen  q  =  1,  all  the  hidden  units  are  effectively  nonlinear  and  the  new \nbound reduces  to the old one.  This shows ho\",- a regulariza.tion  t.erm directly  affects \nbias / variance. \n\n'\\i\\Then the estimation error dominates, the bound for  the generalization error  will be \nproportional to  nq  inst.ead  of n.  The value of 1'/,I},  however,  depends  on the choice  of \na.  For small  (J,  the  new  bound  can be much tighter  than the old  one,  especially for \nlarge  netwOl'ks  with  n  large  but  nq  small.  This  will  provide  a  practical  method  to \ncstilllate  gCltcrnlizn.tion  errol'  for  large  nctworks  as  well  as  an  explanation  of when \nrllld  why  hn~e networks  can  generalize  ,,-ell. \n\nHow  tight  the bound really is  depends on how well  Ln,l\\ (lL!)  is chosen.  Let no  denote \nt.he  optimallll1ll1ber  of  (nonlinear)  hidden  units  needeJ  to  approximate  I(x).  If \nLn,N(W)  is  chosen  so  that.  the  corresponding 1J(W)  is  almost a  delta.  function  a.t  no, \nt.hen  ERK,i\\'  ~ Rno,N,  which  gives  a.  very  tight  bound.  Otherwise,  if,  for  insta.nce, \n\n\f374 \n\nIi \n\nLn,N(W)  penalizes  network  complexity  so  little  that  ERJ(,N  :=::::  Rn,N,  the  bound \nwill  be  as  loose  as  the original one.  It should  also  be noted  that  an exact value for \nthe  bound  cannot be obtained unless  some information on  the unknown function  f \nitself is  available. \n\nFor  commonly used  regularization  terms,  the  expected  network  complexity can  be \nestimated  through  a  close  form  expression.  Such  expected  network  complexity  is \nshown  to  be  a  good  estimate  for  the  actual  network  complexity  if a  Gibbs  prior \ndistribution  of  weights  also  gets  optimized  through  training,  and  is  also  sharply \npeaked.  More  research  will  be  done  to  evaluate the  applica.bility of the  theoretical \nresults. \n\nA cknow ledgeluent \n\nThe support of National  Science  Foundation is  gratefully acknowledged. \n\nReferences \n\n[1]  S.  Amari  and  N.  Murata,  \"Statistical Theory  of Learning  Curves  under  En(cid:173)\n\ntropic  Loss  Criterion,\"  Neural  Computation,  5,  140-153,  1993. \n\n[2]  A.  Barron,  \"Approximation a.nd  Estimation Bounds for  Artificial  Neural  Net(cid:173)\nworks,\"  Proc.  of The  4th  Workshop  on  Computational  Learning  Theory,  243-\n249,  1991. \n\n[3]  Vv.  Feller,  An  Introduction  to  Probability  Theory  and  Its  Applications,  New \n\nYork:  John \\Viley  and  Sons,  1968. \n\n[4]  S.  Geman,  E.  Bienenstock,  and  R.  Doursat,  \"Neural  Networks  and  the \n\nBias/Variance Dilemma,\"  Neural  Comp1tiation,  4,  1-58,  1992. \n\n[5]  J.  Moody,  \"Generalization, vVeight  Decay,  and Architecture  Selection for  N on(cid:173)\nlinear  Learning  Systems,\"  Proc.  of  Neural  Information  Processing  Systems, \n1991. \n\n[6]  S.J.  Nowlan,  and  G.E.  Hinton,  \"Simplifying Neural  Networks  by  Soft  \\Veight \n\nSha.ring,\"  Neural  computation,  4,473-493(1992). \n\n[7]  R.  Reed,  \"Pruning Algorithms-A Survey,\"  IEEE  Trans.  Neural Networks Vol. \n\n4,  740-'i'47,  (1993). \n\n[8]  S. Solla,  \"The Emergence  of Generalization Ability in  Learning,\"  Presented  at \n\nNIPS92. \n\n[9]  V.  Vapnik,  \"Estimation of Dependences  Based  on  Empirical Data,\"  Springer(cid:173)\n\nVerlag,  New  York,  1982. \n\n[10]  A.S  . V\\'eigend  and D.E . Rumelhart,  \"The Effective  Dimension of the Space of \nHidden  Units,\"  Proc.  of International  Joint  Conference  on  Ne1tral  Networks, \n1992. \n\n\f", "award": [], "sourceid": 811, "authors": [{"given_name": "Chuanyi", "family_name": "Ji", "institution": null}]}