{"title": "Statistical Mechanics of Learning in a Large Committee Machine", "book": "Advances in Neural Information Processing Systems", "page_first": 523, "page_last": 530, "abstract": null, "full_text": "Statistical Mechanics  of Learning \nLarge  Committee Machine \n\n\u2022 In a \n\nHolm  Schwarze \n\nCONNECT,  The  Niels  Bohr Institute \n\nBlegdamsvej  17,  DK-2100  Copenhagen  0, Denmark \n\nJohn Hertz\u00b7 \n\nBlegdamsvej  17,  DK-2100  Copenhagen 0, Denmark \n\nNordita \n\nAbstract \n\nWe  use  statistical  mechanics  to study  generalization in  large com(cid:173)\nmittee  machines.  For  an  architecture  with  nonoverlapping  recep(cid:173)\ntive fields a  replica calculation yields the generalization error in the \nlimit of a  large number of hidden units.  For continuous weights  the \ngeneralization  error  falls  off asymptotically inversely  proportional \nto  Q,  the  number  of  training  examples  per  weight.  For  binary \nweights  we  find  a  discontinuous  transition  from  poor  to  perfect \ngeneralization  followed  by  a  wide  region  of metastability.  Broken \nreplica symmetry is  found  within this  region  at low  temperatures. \nFor  a  fully  connected  architecture  the  generalization  error  is  cal(cid:173)\nculated  within  the  annealed  approximation.  For  both  binary  and \ncontinuous  weights  we  find  transitions  from  a  symmetric state  to \none  with  specialized  hidden  units,  accompanied  by  discontinuous \ndrops  in  the generalization error. \n\n1 \n\nIntroduction \n\nThere  has  been  a  good  deal  of theoretical  work  on  calcula.ting  the  generalization \nability of neural networks within the fra.mework of statistical mechanics (for a  review \n\n\u2022 Address  in  1993:  Laboratory  of Neuropsychology,  NIMH,  Bethesda,  MD  20892,  USA \n\n523 \n\n\f524 \n\nSchwarze and Hertz \n\nsee  e.g.  Watkin  et.al.,  1992;  Seung  et.al.,  1992).  This  approach  has  mostly  been \napplied  to  single-layer  nets  (e.g.  Gyorgyi  and  Tishby,  1990;  Seung  et.al.,  1992). \nExtensions  to  networks  with  a  hidden  layer  include  a  model  with  small  hidden \nreceptive  fields  (Sompolinskyand Tishby,  1990),  some general  results  on networks \nwhose  outputs  are  continuous functions  of their  inputs  (Seung  et.al.,  1992;  Krogh \nand  Hertz,  1992),  and  calculations  for  a  so-called  committee  machine  (Nilsson, \n1965),  a  two-layer  Boolean  network,  which  implements a  majority decision  of the \nhidden  units  (Schwarze  et.al.,  1992;  Schwarze  and  Hertz,  1992;  Mato  and  Parga, \n1992; Barkai et.al., 1992; Engel et.al., 1992).  This model has previomlly been studied \nwhen  learning a  function  which  could  be implemented by  a  simple perceptron  (i.e. \none with no hidden units) in the high-temperature (i.e.  high-noise) limit (Schwarze \net.al.,  1992).  In  most  practical  applications,  however,  the  function  to  be  learnt  is \nnot linearly separable.  Therefore,  we  consider here a  committee machine trained on \na  rule  which  itself is  defined  by another committee machine (the 'teacher'  network) \nand hence  not  linearly separable. \n\nWe calculate the generalization error,  the probability of misclassifying an arbitrary \nnew  input, as a function of 0, the ratio of the number of training examples P  to the \nnumber of adjustable  weights  in  the network.  First we  present  results  for  the 'tree' \ncommittee machine, a restricted  version of the model in which the receptive fields  of \nthe hidden units do not overlap.  In section 3 we study a fully connected architecture \nallowing for  correlations  between  different  hidden  units in  the student  network.  In \nboth cases  we  study a  large-net limit in  which  the total number of inputs (N) and \nthe number of hidden  units  (K) both go  to infinity,  but with  K  \u00ab: N. \n\n2  Committee machine with  nonoverlapping  receptive fields \n\nIn  this  model each hidden  unit receives  its input from N I K  input units,  subject  to \nthe restriction  that different  hidden  units  do  not  share  common inputs.  Therefore \nthere  is  only  one  path  from  each  input  unit  to  the  output.  The  hidden-output \nweights  are all fixed  to +1 as  to implement a  majority decision of the hidden  units. \nThe overall network output for inputs 5, E R  N/K, 1 =  1, ... , K,  to the K  branches \nis  given  by \n\n0\"( {S,}) = sign ( ~ t. 0\", (5,\u00bb) , \n\nwhere  0\"1  is  the output of the lth hidden  unit, given  by \n\n0\".(5,) = sign ( 1ft w, . 5.) . \n\n(1) \n\n(2) \n\nHere  W,  is  the  N I K -dimensional weight  vector  connecting  the  input  with  the  Ith \nhidden unit.  The training examples ({~#-' ,}, r( {~#-' ,}), j.\u00a3  = I, ... , P,  are generated  by \nanother committee machine with weight vectors  11,  and an overall output  r({~#-'I})' \ndefined  analogously  to  (1).  There  are  N  adjustable  weights  in  the  network,  and \ntherefore  we  have  0  = PIN. \n\n\fStatistical  Mechanics of Learning in  a Large Committee Machine \n\n525 \n\nAs  in  the  corresponding  calculations  for  simple  perceptrons  (Gardner  and  Der(cid:173)\nrida,  1988;  Gyorgyi and Tishby,  1990;  Seung  et.al.,  1992),  we  consider  a  stochastic \nlearning algorithm which for  long training times yields  a  Gibbs distribution of net(cid:173)\nworks.  The  statistical  mechanics  approach  starts  out  from  the  partition  function \nZ  =  jdpo({W,})e- 13E({W , }),  an integral over  weight  space  with  a  priori  measure \nPo({W,}),  weighted  with  a  thermal  factor  e-13E({w ,l),  where  E  is  the  total  error \non  the training examples \n\np \n\nE({W,}) = I:e[-u({{IL,}) .r({{ILI})]. \n\n1L=1 \n\n(3) \n\nThe  formal  temperature  T  =  1/ f3  defines  the  level  of  noise  during  the  training \nprocess.  For  T  =  0  this  procedure  corresponds  to  simply  minimizing the  training \nerror  E. \nFrom this  the  average free  energy  F  =  -T ((lnZ)),  averaged  over  all  possible  sets \nof  training  examples  can  be  calculated  using  the  replica  method  (for  details  see \nSchwarze and Hertz,  1992).  Like the calculations for simple perceptrons,  our theory \nhas  two sets  of order  parameters: \n\n0.13  _  K  Wo.  WI3 \n- N-I  \u00b7-1 \nq, \n\nK \n\na. \n\nRI  = N  WI  \u00b7V,. \n\na. \n\nNote that these are the only order  parameters in this  model.  Due to the tree struc(cid:173)\nture  no  correlations  between  different  hidden  units  exist.  Assuming  both  replica \nsymmetry  and  'translational  symmetry'  we  are  left  with  two  parameters:  q,  the \npattern  average  of the  square  of the  average  input-hidden  weight  vector,  and  R, \nthe  average  overlap  between  this  weight  vector  and  a  corresponding  one  for  the \nteacher. \n\nWe  then  obtain  expressions  for  the  replica-symmetric  free  energy  of  the  form \nG(q, R, tI, R)  = 0  G1(q, R) + G 2(q, R, tI, R),  where  the  'entropy'  terms  G 2  for  the \ncontinuous- and  binary-weight cases  are exactly the same as  in the simple percep(cid:173)\ntron  (Gyorgyi and  Tishby,  1990,  Seung et.al.,  1992).  In  the  large-K  limit another \nsimplification  similar  to  the  zero-temperature  capacity  calculation  (Barkai  et.al., \n1992)  is  found  in  the  tree  model.  The  'energy'  term  G 1  is  the  same as  the  corre(cid:173)\nsponding  term in  the  calculation  for  the  simple perceptron,  except  that  the  order \nparameters  have  to  be replaced  by  f(q)  = (2/1r) sin- 1 q and f(R) = (2/1r) sin- 1 R. \nThe generalization  error \n\n\u20acg  = - arccos If(R)] \n\n1 \n\n7r \n\n(4) \n\ncan  then  be obtained from  the  value of R  at the saddle  point of the free  energy. \n\nFor  a  network  with  continuous  weights,  the  solution of the  saddle  point  equations \nyields  an algebraically decreasing  generalization error.  There is no phase transition \nat any  value of 0  or  T.  For T  =  0  the  asymptotic form  of the generalization error \nin powers of 1/0 can be easily obtained as  1.25/0 + ('1/0 2 ),  twice the \u20acg  found for \nthe simple  perceptron  in this  limit. \n\n\f526 \n\nSchwarze  and Henz \n\n'\" \n\\I) \n\n0.50 \n\n0.40-\n\n0.30 \n\n0.20 \n\n0.10 \n\n0.00 \n0 \n\n2 \n\n3 \n\n4 \n\nFigure  1:  Learning  curve  for  the  large-K  tree  committee (solid  line)  with  binary \nweights  at T  =  1.  The phase transition occurs  at Oc  =  1.98, and the spinodal point \nis  at  0,  = 3.56.  The analytic results  are  compared  with  Monte  Carlo simulations \nwith  K  = 9,  N  = 75  and  T  = I,  averaged  over  10  runs. \nIn  each  simulation \nthe number of training examples is  gradually increased  (dotted  line)  and decreased \n(dashed  line),  respectively.  The  broken  line  shows  the  generalization error  for  the \nsimple perceptron. \n\nIn  contrast,  the  model  with  binary  weights  exhibits  a  phase  transition  at  all  tem(cid:173)\nperatures  from  poor  to  perfect  generalization.  The  corresponding  generalization \nerror  as  a  function  of 0  is  shown  in figure  1.  At  small values  of 0 \nthe  free  energy \nhas  two saddle points,  one at  R < 1 and  the other at R = 1.  Initially the  solution \nwith  R < 1 and  poor generalization ability has  the  lower  free  energy  and  therefore \ncorresponds  to  the  equilibrium state.  When  the  load  parameter  is  increased  to  a \ncritical value Oc,  the situation changes and the solution at R = 1 becomes the global \nminimum of the  free  energy.  The  system  exhibits  a  first  order  phase  transition  to \nthe  state  of perfect  generalization.  In  the  region  Oc  <  0  <  0,  the  R  <  1  solution \nremains  metastable  and  disappears  at  the  spinodal  point  0,.  We  find  the  same \nqualitative picture  at all  temperatures,  and  the  complete replica  symmetric phase \ndiagram  is  shown  in  figure  2.  The  solid  line  corresponds  to  the  phase  transition \nto perfect  generalization,  and in  the  region  between  the solid and  the dashed  lines \nthe  R  <  1  state  of poor  generalization  is  metastable.  Below  the  dotted  line,  the \nreplica-symmetric solution yields a  negative entropy for  the metastable state.  This \nis  unphysical  in  a  binary  system  and  replica  symmetry  has  to  be  broken  in  this \nregion,  indicating the existence  of many different  metastable states. \nThe  simple  perceptron  without  hidden  units  corresponds  to  the  case  K  = 1  in \nour  model.  A  comparison  of the  generalization  properties  with  the  large-K  limit \nshows  that  both  limits exhibit qualitatively similar behavior.  The locations of the \nthermodynamic  transitions  and  the  spinodal  line,  however,  are  different  and  the \ngeneralization error of the R  < 1 state in the large-K  committee machine is higher \nthan in the simple perceptron. \n\nThe case of general finite K  is rather more involved, but the annealed approximation \n\n\fStatistical Mechanics of Learning in  a Large  Committee Machine \n\n527 \n\n1.0 \n\n0.8 \n\n0.6 \n\n0.4 \n\n0.2 \n\n0.0 \n\nR<1 \n\nmetastability \n\n.\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7~\u00b7>\u00b7;.\u00b7r \n\n~ \n\n~ \n\n~ \n\n/ \n\nI \n\nI \n\nRSB \n\nR=1 \n\n1.0 \n\n1.5 \n\n2.0 \n0/ \n\n2.5 \n\n3.0 \n\nFigure 2:  Replica-symmetric phase diagram ofthe large-K tree committee machine \nwith binary weights.  The solid line shows  the locations of the phase transition, and \nthe spinodal line is  shown dashed.  Below  the the dotted line the replica-symmetric \nsolution is  incorrect. \n\nfor  finite  K  indicates  a  rather  smooth  K -dependence  for  1 < K  < 00  (Mato  and \nParga,  1992). \n\nWe  performed  Monte-Carlo  simulations  to  check  the  validity  of the  assumptions \nmade in our calculation and found good agreements with our analytic results.  Figure \n1  compares  the  analytic  predictions  for  large  K  with  Monte  Carlo simulations for \nK  = 9.  The  simulations  were  performed  for  a  slowly  increasing  and  decreasing \ntraining set  size,  respectively,  yielding  a  hysteresis  loop around the  location of the \nphase transition. \n\n3  Fully  connected  committee machine \n\nIn contrast to the previous model the hidden units in the fully connected committee \nmachine  receive  inputs  from  the  entire  input  layer.  Their  output  for  a  given  N(cid:173)\ndimensional input vector  5  is  given  by \n\n0',(5)  =  sign (.Jw W,  . 5), \n\n(5) \n\nwhile  the  overall output is  again of the form  (1).  Note  that  the  weight  vectors  W, \nare now  N-dimensional, and the load  parameter is  given  by  a  =  P / (K N). \nFor  this  model  we  solved  the  annealed  approximation,  which  replaces  ((In Z))  by \nIn ((Z)).  This approximation becomes  exact  at  high  temperatures  (high  noise  level \nduring training).  For learnable target rules, as in the present problem, previous work \nindicates  that  the  annealed  approximation yields  qualitatively  correct  results  and \ncorrectly  predicts  the shape of the learning curves  even at low temperatures  (Seung \net.al.,  1992).  Performing the  average  over  all  possible  training sets  again  leads  to \ntwo sets  of order  parameters:  the overlaps  between  the student  and  teacher  weight \n\n\f528 \n\nSchwarze and Hertz \n\nvectors,  RlIe  =  N- 1 W, . V An  and the mutual overlaps in  the student  network CUe  = \nN -1 W,, Wk'  The weight vectors of the target rule are assumed to be un correlated \nand  normalized,  N- 1 L  . V k =  O,k.  As  in  the  previous  model  we  make symmetry \nassumptions for  the order  parameters.  In  the fully connected  architecture  we  have \nto allow for  correlations  between  different  hidden  units (RlIe, ClIe  :f!  0 for  l =f.  Ie)  but \nalso include  the possibility of a  specialization of individual units  (Rll =f.  RlIe).  This \nis  necessary  because  the  ground  state  of the system  with  vanishing  generalization \nerror is  achieved  for  the choice  R'k  =  C'k  =  O,k.  Therefore  we  make the ansatz \n\nR'k  =  R + 1101111, \n\n(6) \nand evaluate the annealed free  energy of the system using the saddle point method \n(details  will  be  reported  elsewhere).  The  values  of  the  order  parameters  at  the \nminimum of the  free  energy  finally  yield  the  average  generalization  error  fg  as  a \nfunction  of o. \n\nC'k  =  C + (1  - C)O/k \n\nFor a  network with continuous weights and small 0  the global minimum of the free \nenergy  occurs  at  11  =  0  and  R  '\" qK- 3 / 4 ).  Hence,  for  small  training sets  each \nhidden unit in the student  network  has a small symmetric overlap to all the hidden \nunits in  the teacher network.  The information obtained from the training examples \nis  not  sufficient  for  a  specialization  of hidden  units,  and  the  generalization  error \napproaches  a  plateau.  To order 1/VK, this approach  is  given  by \n\n\u20acg  =  fO  + ~ + 0(1/ K), \n\nfO  =  ~ arccos  ( )2/71\")  ~ 0.206, \n\n(7) \n\nwith 'Y({3)  =  )71\"/2 - 1 [(1  - e-~)-1 - foJ/(471\").  Figure  3 shows  the  generalization \nerror  as  a  function  of 0, including  1/VK-corrections for  different  values  of K. \n\n0.50 \n\n0.40 \n\nD'  0.30  ...... \n\\U  0.20!:\"  ~'~'~\" \"':'''':'''.:::.:::'::  :':'-\",:,\u00b7::\u00b7::\u00b7..:r.:.-.:  - -'''''r'' \n\n(x, \n\n(Xc \n\n0.10 \n\nt. \n\ni \n\n\u00b7'\u00b7'\u00b7-\u00b7-.L.~ \n\n0.00 t.......o~ ......................... ......L.....~ ........... ~~L........o.-.........:l \n\no \n\n5 \n\n10 \n15 \na=P/KN \n\n20 \n\n25 \n\nFigure 3:  Generalization error for continuous weights and T  =  0.5.  The approach to \nthe residual  error is  shown including 1/VJ(-corrections for  K=5 (solid line),  K=ll \n(dotted line), and K=100 (dashed line).  The broken line corresponds to the solution \nwith  nonvanishing 11. \n\nWhen the training set  size  is  increased  to a  critical  value 0, of the load parameter, \n\n\fStatistical  Mechanics of Learning in  a Large  Committee Machine \n\n529 \n\na  second  minimum of the free  energy  appears  at a  finite  value of /:::,.  close  to  1.  For \na  larger  value  Oc  >  0,  this  becomes  the  global  minimum of the  free  energy  and \nthe  system  exhibits  a  first  order  phase  transition.  The  generalization  error  of the \nspecialized solution decays smoothly with an asymptotic behavior inversely propor(cid:173)\ntional  to o.  However,  the  poorly-generalizing symmetric state remains metastable \nfor  all \u00b0 > Oc.  Therefore,  a  stochastic  learning  procedure  starting  with  /:::,.  =  0  will \nfirst  settle into  the  metastable state.  For  large N  it will  take an exponentially long \ntime to cross  the free  energy  barrier  to the  global minimum of the free  energy. \n\nIn a  network with binary weights and for  large K  we  find  the same initial approach \nto a  finite  generalization error as in (7) for  continuous weights.  In the large-K limit \nthe  discreteness  of the  weights  does  not  influence  the  behavior  for  small  training \nsets.  However,  while a  perfect  match of the student  to the teacher  network  (Rue  = \ne'k  =  Olk)  cannot  happen  for  \u00b0 <  00  in  the  continuous  model,  such  a  'freezing' \nis  possible  in  a  discrete  system.  The free  energy  of the  binary  model  always  has \na  local  minimum at  R'k  = e'k  = Olk.  When  the  load  parameter is  increased  to  a \ncritical  value,  this  minimum becomes  the  global  minimum of the  free  energy,  and \na  discontinuous  transition  into  this  perfectly  generalizing  state  occurs,  just  as  in \nthe binary-weight simple perceptron  and the tree described  in section  2.  As in  the \ncase  of continuous  weights,  the  symmetric solution  remains  metastable  here  even \nfor  large  values  of o.  Figure  4  shows  the  generalization  error  for  binary  weights, \nincluding  1/v'K-corrections  for  K  =  5.  The  predictions  of the  large-K  theory \nare  compared  with  Monte  Carlo  simulations.  Although  we  cannot  expect  a  good \nquantitative  agreement  for  such  a  small  committee,  the  simulations  support  our \nqualitative results.  Note  that the  leading order  correction  to  \u20aco \nin  eqn.  (7)  is  only \nsmall for \u00b0 ~ 11K.  However,  we  have obtained a  different  solution,  which  is  valid \nfor  \u00b0 \"-'  (111 K).  The  corresponding  generalization error  is  shown  as  a  dotted  line \nin figure  4. \n\n'\" UJ \n\n0.50 \n\n0.40 \n\n0.30 \n\n0.20 \n\n0.10 \n\n0.00 \n0 \n\n\u2022.... \n\n\u2022 \n\n1M \n\nlI( \n\n\u2022 \n\n1M \n\n1M \n\n~  t  t \n\nt t  ~ \n\n5 \n\n15  20  25  30 \n\n10 \nex  =  P/KN \n\nFigure  4:  Generalization  error  for  binary  weights  at T  =  5.  The  large-K  theory \nfor  different  regions  of \u00b0  is  compared  with  simulations  for  K  =  5  and  N  =  45 \naveraged over all simulations (+) and simulations, in which no freezing occurred (*), \nrespectively.  The solid line shows  the finite-o  results  including II v'K -corrections. \nThe dotted  line shows  the  small-o solution. \n\n\f530 \n\nSchwarze and Hertz \n\nCompared to the tree  model the fully  connected  committee machine shows  a  quali(cid:173)\ntatively different  behavior.  This difference  is  particularly pronounced in the contin(cid:173)\nuous  model.  While the generalization error of the tree architecture decays smoothly \nfor  all values of a, the fully connected  model exhibits a  discontinuous  phase transi(cid:173)\ntion.  Compared to the tree model, the fully connected architecture has an additional \nsymmetry, because each  permutation of hidden  units  in  the student network  yields \nthe  same output for  a  given  input  (Barkai  et.al.,  1992).  This additional  degree  of \nfreedom  causes  the  poor  generalization  ability for  small training sets.  Only if the \ntraining  set  size  is  sufficiently  large  can  the  hidden  units  specialize  on  one  of the \nhidden  units  in  the teacher  network and achieve  good generalization.  However,  the \npoorly generalizing states  remain metastable even for  arbitrarily large a.  A similar \nphenomenon has also been found in a different architecture with only 2 hidden units \nperforming a  parity operation  (Hansel  et.al.,  1992). \n\nAcknowledgements \n\nH.  Schwarze  acknowledges  support  from  the  EC  under  the  SCIENCE  programme \nand  by  the  Danish  Natural  Science  Council  and  the  Danish  Technical  Research \nCouncil through  CONNECT. \n\nReferences \n\nE.  Barkai,  D.  Hansel,  and  H.  Sompolinsky  (1992),  Phys.Rev.  A  45, 4146. \n\nA.  Engel,  H.M.  Kohler,  F.  Tschepke,  H.  Vollmayr,  and  A.  Zippelius  (1992), \nPhys.Rev. A  45,  7590. \n\nE.  Gardner,  B.  Derrida (1989),  J.Phys. A  21,  271. \n\nG.  Gyorgyi  and  N.  Tishby  (1990)  in  Neural  Networks  and Spin  Glasses,  edited  K. \nThuemann and  R.  Koberle  (World  Scientific,  Singapore). \n\nD.  Hansel,  G.  Mato,  and C.  Meunier  (1992),  Europhys.Lett.  20, 471. \n\nA.  Krogh,  l. Hertz  (1992),  Advances in Neural Information Processing Systems IV, \nedited  by  l.E.  Moody,  S.l.  Hanson,  and  R.P.  Lippmann,  (Morgan  Kaufmann, San \nMateo). \n\nG.  Mato,  N.  Parga (1992),  J.Phys.  A  25, 5047. \n\nN.J.  Nilsson  (1965)  Learning Machines,  (McGraw-Hill,  New  York). \n\nH.  Schwarze,  M.  Opper,  and  W.  Kinzel  (1992),  Phys.Rev.  A  45,  R6185. \n\nH.  Schwarze,  J.  Hertz  (1992),  Europhys.Lett.  20,375. \n\nH.S.  Seung,  H.  Sompolinsky, and  N.  Tishby  (1992),  Phys.Rev.  A  45, 6056. \n\nH.  Sompolinsky, N. Tishby (1990),  Europhys.Lett.  13,567. \n\nT.  Watkin,  A.  Rau,  and  M.  Biehl  (1992),  to  be  published  in  Review  of Modern \nPhysics. \n\n\f", "award": [], "sourceid": 617, "authors": [{"given_name": "Holm", "family_name": "Schwarze", "institution": null}, {"given_name": "John", "family_name": "Hertz", "institution": null}]}