{"title": "Network Generality, Training Required, and Precision Required", "book": "Neural Information Processing Systems", "page_first": 219, "page_last": 222, "abstract": null, "full_text": "219 \n\nNetwork  Generality,  Training  Required, \n\nand  PrecisIon Required \n\nJohn  S.  Denker  and Ben S.  Wittner 1 \n\nAT&T  Bell  Laboratories \n\nHolmdel,  New  Jersey 07733 \n\nKeep  your hand on  your wallet. \n- Leon  Cooper,  1987 \n\nAbstract \n\nWe  show  how  to estimate  (1)  the  number  of functions  that  can  be implemented  by  a \nparticular  network  architecture,  (2)  how  much  analog  precision  is  needed  in  the  con(cid:173)\nnections in the network, and (3) the number of training examples the network must see \nbefore it can  be expected  to form  reliable  generalizations. \n\nGenerality versus Training  Data Required \n\nConsider  the following  objectives:  First, the network  should be very  powerful and ver(cid:173)\nsatile,  i.e.,  it  should  implement  any  function  (truth  table)  you  like,  and  secondly,  it \nshould learn easily, forming  meaningful generalizations from  a small number of training \nexamples.  Well, it is  information-theoretically impossible to create such a  network.  We \nwill  present  here a  simplified  argument; a  more complete and sophisticated version can \nbe found  in  Denker et al.  (1987). \n\nIt is  customary to regard learning as  a  dynamical process:  adjusting the weights  (etc.) \nin  a  single  network.  In  order  to  derive  the  results  of  this  paper,  however,  we  take \na  different  viewpoint,  which  we  call  the  ensemble  viewpoint.  Imagine  making  a  very \nlarge  number of replicas of the network.  Each  replica has  the same architecture as  the \noriginal,  but  the  weights  are  set  differently  in  each  case.  No  further  adjustment  takes \nplace;  the  \"learning process\"  consists  of winnowing the ensemble of replicas,  searching \nfor  the one( s)  that satisfy our requirements. \n\nTraining proceeds as follows:  We  present each item in  the training set to every network \nin  the  ensemble.  That  is,  we  use  the  abscissa of the  training  pattern  as  input  to  the \nnetwork,  and  compare  the  ordinate of the  training  pattern  to see  if it  agrees  with  the \nactual output  of the  network.  For  each  network,  we  keep  a  score  reflecting  how  many \ntimes (and how badly) it disagreed with a  training item.  Networks with the lowest score \nare  the  ones  that  agree  best  with  the  training data.  If we  had  complete  confidence  in \n\nlCurrently  at  NYNEX  Science  and Technology,  500  Westchester Ave.,  White  Plains,  NY  10604 \n\n@)  American Institute of Physics 1988 \n\n\f220 \n\nthe reliability of the training set, we  could at each step simply throwaway all  networks \nthat disagree. \n\nFor definiteness, let us  consider a typical network architecture, with  No input wires and \nNt  units in  each processing layer  I, for  I  E {I\u00b7\u00b7 \u00b7L}.  For simplicity  we  assume NL  =  1. \nWe  recognize  the  importance of networks  with  continuous-valued  inputs  and  outputs, \nbut  we  will  concentrate  for  now  on  training  (and  testing)  patterns  that  are  discrete, \nwith N  ==  No  bits of abscissa and N L =  1 bit of ordinate.  This allows  us to classify  the \nnetworks  into  bins  according  to  what  Boolean  input-output  relation  they  implement, \nand simply  consider the ensemble of bins. \n\nIf the  network  architecture  is  completely  general  and \nThere  are  22N  jossible  bins. \npowerful,  all  22 \nfunctions  will  exist  in  the ensemble of bins.  On  average,  one  expects \nthat each  training item  will  throwaway  at  most  half of the  bins.  Assuming  maximal \nefficiency,  if m  training items are  used,  then  when  m  ~ 2N  there  will  be only one  bin \nremaining,  and  that  must  be  the  unique  function  that  consistently  describes  all  the \ndata.  But there  are  only  2N  possible abscissas  using N  bits.  Therefore a  truly general \nnetwork cannot possibly exhibit meaningful generalization -\n100% of the possible data \nis  needed for  training. \n\nN ow  suppose that the network is not  completely general, so that even  with all  possible \nsettings of the weights we can only create functions in 250  bins, where So  < 2N.  We call \nSo  the initial entropy of the network.  A more formal  and general  definition is  given  in \nDenker et al.  (1987).  Once again, we  can use the training data to winnow the ensemble, \nand when  m  ~ So,  there will be only one remaining bin.  That function  will presumably \ngeneralize correctly to the remaining 2N - m  possible patterns.  Certainly that function \nis  the best we can do with the network architecture and the training data we were given. \n\nThe  usual  problem  with  automatic  learning  is  this:  If the  network  is  too  general,  So \nwill  be large, and an inordinate amount of training data will be required.  The required \namount of data may be simply unavailable, or it may be so large that training would be \nprohibitively time-consuming.  The shows the critical importance of building a  network \nthat is  not more general  than  necessary. \n\nEstimating the Entropy \n\nIn real engineering situations, it is  important to be able to estimate the initial entropy \nof various proposed designs, since that determines the amount of training data that will \nbe required.  Calculating So  directly from  the definition is prohibitively difficult, but we \ncan  use the definition to derive useful  approximate expressions.  (You  wouldn't want to \ncalculate the thermodynamic entropy of a  bucket of water directly  from  the definition, \neither. ) \n\n\f221 \n\nSuppose  that  the  weights  in  the  network  at  each  connection  i  were  not  continuously \nadjustable real numbers, but rather were specified by a  discrete code with bi  bits.  Then \nthe total number of bits required  to specify  the configuration  of the network is \n\n(1) \n\nNow the total number offunctions that could possibly be implemented by such a network \narchitecture would  be at most 2B.  The actual number will  always be smaller than this, \nsince there are various ways in which different settings of the weights can lead to identical \nfunctions  (bins).  For one  thing, for  each  hidden layer 1 E {1\u00b7\u00b7\u00b7 L-1}, the numbering of \nthe  hidden units can be permuted, and the polarity of the hidden units can be flipped, \nwhich  means  that  250  is  less  than  2B  by  a  factor  (among  others)  of III Nl! 2N ,.  In \naddition,  if there  is  an  inordinately  large  number  of bits  bi  at  each  connection,  there \nwill  be  many  settings  where  small  changes  in  the connection  will  be immaterial.  This \nwill  make 2so  smaller by an additional factor.  We expect  aSO/abi  ~ 1 when bi is small, \nand aSO/abi  ~ 0 when  bi  is  large;  we  must now  figure  out  where the crossover occurs. \n\nThe number of \"useful and significant\" bits of precision, which we designate b*, typically \nscales like the logarithm  of number of connections  to the unit in question.  This can  be \nunderstood as follows:  suppose there are N  connections into a  given  unit, and an input \nsignal  to  that  unit  of some  size  A  is  observed  to  be  significant  (the  exact  value  of  A \ndrops  out of the  present  calculation).  Then  there is  no  point  in  having  a  weight  with \nmagnitude  much  larger  than  A,  nor  much  smaller  than  A/N.  That  is,  the  dynamic \nrange should be comparable to the number of connections.  (This argument is not exact, \nand it is easy to devise exceptions, but the conclusion remains useful.)  If only a fraction \n1/ S  of the units in the previous layer are active (nonzero) at a time, the needed dynamic \nrange is  reduced.  This implies  b*  ~ log(N/S). \n\nNote:  our  calculation  does  not  involve  the  dynamics  of  the  learning  process.  Some \nnumerical methods (including versions of back propagation) commonly require a number \nof temporary  \"guard bits\"  on  each  weight,  as  pointed  out  by  llichard  Durbin  (private \ncommunication).  Another log N  bits ought  to suffice.  These bits  are  not  needed  after \nlearning is  complete, and do not contribute to So. \n\nIf we combine these ideas and apply them to a  network with N  units in each layer, fully \nconnected,  we  arrive  at  the  following  expression  for  the  number  of  different  Boolean \nfunctions  that  can be implemented by such a  network: \n\nwhere \n\nB  ~ LN 2 log N \n\n(2) \n\n(3) \n\nThese results  depend  on  the fact  that we  are  considering only a  very  restricted  type of \nprocessing unit:  the output is  a  monotone function  of a  weighted  sum of inputs.  Cover \n\n\f222 \n\n(1965) discussed in  considerable depth the capabilities of such units.  Valiant  (1986) has \nexplored the learning capabilities of various  models of computation. \n\nAbu-Mustafa  has  emphasized  the  principles  of information  and  entropy  and  applied \nthem  to  measuring  the  properties  of the  training  set.  At  this  conference,  formulas \nsimilar to equation 3 arose in the work of Baum, Psaltis, and Venkatesh, in the context \nof  calculating  the  number  of  different  training  patterns  a  network  should  be  able  to \nmemorize.  We originally  proposed equation 2 as  an estimate of the number of patterns \nthe network would  have  to  memorize before it could form  a  reliable generalization.  The \nbasic  idea,  which  has  numerous  consequences,  is  to  estimate  the  number  of (bins  of) \nnetworks that can  be realized. \n\nReferences \n\n1.  Vasser  Abu-Mustafa, these  proceedings. \n\n2.  Eric Baum, these proceedings. \n\n3.  T.  M.  Cover,  \"Geometrical and statistical properties of systems of linear inequal(cid:173)\n\nities  with applications in pattern recognition,\"  IEEE  Trans.  Elec.  Comp.,  EC-14, \n326-334, (June 1965) \n\n4.  John  Denker,  Daniel  Schwartz,  Ben  Wittner,  Sara  Solla,  John  Hopfield,  Richard \n\nHoward,  and  Lawrence Jackel,  Complex  Systems, in press  (1987). \n\n5.  Demetri Psaltis, these proceedings. \n\n6.  1. G.  Valiant, SIAM  J.  Comput.  15(2), 531  (1986), and references  therein. \n\n7.  Santosh Venkatesh, these proceedings. \n\n\f", "award": [], "sourceid": 16, "authors": [{"given_name": "John", "family_name": "Denker", "institution": null}, {"given_name": "Ben", "family_name": "Wittner", "institution": null}]}