{"title": "Microscopic Equations in Rough Energy Landscape for Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 302, "page_last": 308, "abstract": null, "full_text": "Microscopic Equations  in  Rough Energy \n\nLandscape for  Neural Networks \n\nK.  Y.  Michael Wong \nDepartment of Physics, \n\nThe Hong  Kong  University  of Science  and Technology, \n\nClear Water Bay,  Kowloon,  Hong  Kong. \n\nE-mail:  phkywong@usthk.ust.hk \n\nAbstract \n\nWe  consider  the  microscopic  equations  for  learning  problems  in \nneural  networks.  The  aligning  fields  of an  example  are  obtained \nfrom  the  cavity  fields,  which  are  the  fields  if  that  example  were \nabsent  in  the  learning  process.  In  a  rough  energy  landscape,  we \nassume  that  the  density  of the  local  minima obey  an  exponential \ndistribution, yielding macroscopic properties agreeing with the first \nstep replica symmetry breaking solution.  Iterating the microscopic \nequations  provide  a  learning  algorithm,  which  results  in  a  higher \nstability than conventional algorithms. \n\n1 \n\nINTRODUCTION \n\nMost  neural  networks  learn iteratively  by  gradient descent.  As  a  result,  closed  ex(cid:173)\npressions for  the final network state after learning are rarely  known.  This precludes \nfurther  analysis  of their  properties,  and  insights  into  the  design  of learning  algo(cid:173)\nrithms.  To complicate the situation, metastable states (i.e.  local minima) are often \npresent  in  the  energy  landscape  of the  learning  space  so  that,  depending  on  the \ninitial configuration, each  one is  likely  to be the final  state. \n\nHowever,  large  neural  networks  are  mean  field  systems  since  the  examples  and \nweights  strongly interact  with each  other during  the  learning process.  This means \nthat when one example or weight is considered, the influence of the rest of the system \ncan be regarded as a background satisfying Some averaged properties.  The situation \nis  similar to  a  number  of disordered  systems  such  as  spin  glasses,  in  which  mean \nfield  theories  are  applicable  (Mezard,  Parisi  &  Virasoro,  1987).  This explains  the \nsuccess  of statistical  mechanical techniques  such  as  the  replica method in deriving \nthe  macroscopic properties  of neural  networks,  e.g.  the  storage  capacity  (Gardner \n& Derrida  1988),  generalization  ability  (Watkin,  Rau & Biehl  1993).  The  replica \n\n\fMicroscopic Equations in Rough Energy Landscape for Neural Networks \n\n303 \n\nmethod,  though,  provides  much  less  information on  the  microscopic  conditions  of \nthe individual dynamical variables. \n\nAn  alternative  mean field  approach  is  the cavity  method.  It is  a  generalization  of \nthe  Thouless-Anderson-Palmer  approach  to  spin  glasses,  which  started  from  mi(cid:173)\ncroscopic  equations of the system elements  (Thouless,  Anderson  &  Palmer,  1977). \nMezard  applied  the  method  to  neural  network  learning  (Mezard,  1989) .  Subse(cid:173)\nquent  extensions  were  made  to  the  teacher-student  perceptron  (Bouten,  Schietse \n&  Van  den  Broeck  1995),  the  AND  machine  (Griniasty,  1993)  and  the  multiclass \nperceptron  (Gerl  &  Krey,  1995) .  They  yielded  macroscopic properties  identical  to \nthe  replica  approach,  but  the  microscopic  equations  were  not  discussed,  and  the \nexistence  of local  minima was  neglected. \n\nRecently,  the cavity method was applied  to general classes  of single  and multilayer \nnetworks  with  smooth  energy  landscapes,  i.e.  without  the  local  minima  (Wong, \n1995a).  The aligning fields  of the  examples satisfy  a  set  of microscopic equations. \nSolving  these  equations  iteratively  provides  a  learning algoirthm,  as  confirmed  by \nsimulations in the maximally stable perceptron and the committee tree.  The method \nis also useful in solving the dynamics of feedforward networks which were  unsolvable \npreviously  (Wong,  1995b) . \nDespite  its  success,  the  theory  is  so  far  applicable  only  to  the  regime  of smooth \nenergy  landscapes.  Beyond  this  regime,  a  stability condition is  violated,  and local \nminima begin  to appear  (Wong,  1995a).  In this paper I present a mean field  theory \nfor  the regime of rough energy landscapes.  The complete analysis will  be published \nelsewhere  and  here  I  sketch  the  derivations,  emphasizing  the  underlying  physical \npicture.  As  shown below, a similar set of microscopic equations hold in this case,  as \nconfirmed by simulations in the committee tree.  In fact,  we find that the solutions to \nthese equations have  a higher stability than other conventional learning algorithms. \n\n2  MICROSCOPIC  EQUATIONS  FOR SMOOTH \n\nENERGY LANDSCAPES \n\nWe  proceed  by  reviewing  the  cavity  method for  the  case  of smooth energy  land(cid:173)\nscapes.  For  illustration  we  consider  the single  layer  neural  network  (for  two  layer \nnetworks  see  Wong,  1995a).  There  are  N  \u00bb  1  input  nodes  {Sj}  connecting  to  a \nsingle  output  node  by  the synaptic  weights  {Jj}.  The  output  state  is  determined \nby the sign of the local field at the output node, i.e.  Sout  =  sgn(Lj JjSj ).  Learning \na set of p examples means to find  the  weights  {Jj}  such  that the network gives  the \ncorrect input-to-output mapping for the examples.  If example J.l  maps the inputs Sf \nto the output 01-',  then a successful  learning process  should find  a  weight  vector  Jj \nsuch  that sgn(Lj Jj~j) =  1,  where  ~j =  01-' Sf.  Thus the  usual  approach  to learn(cid:173)\ning is  to first  define  an  energy  function  (or  error  function)  E  = Ll-'g(AI-')'  where \nAI-'  ==  Lj Jj~f /VN are  the aligning fields,  i.e.  the local fields in  the direction of the \ncorrect  output, normalized by the factor VN.  For example, the Adatron algorithm \nuses  the energy function g(A)  =  (I\\: - A)6(1\\: - A)  where  I\\:  is the stability parameter \nand  6  is  the  step function  (Anlauf &  Biehl,  1989).  Next,  one  should  minimize  E \nby  gradient  descent  dynamics.  To  avoid  ambiguity,  the weights  are  normalized to \n'\" . S~ = '\" . J~ = N \nL...J \nThe cavity method uses  a self-consistency argument to consider what happens when \na  set  of p  examples  is  expanded  to  p + 1  examples.  The  central  quantity  in  this \nmethod  is  the  cavity  field.  For  an  added  example  labelled  0,  the  cavity  field  is \nthe  aligning  field  when  it  is  fed  to  a  network  which  learns  examples  1  to  p  (but \n\nL...J \n\nJ \n\nJ \n\n. \n\n\f304 \n\nK.  Y.  M.  Wong \n\nnever  learns  example  0),  i.e.  to  ==  E j  JjeJ 1v'N.  Since  the  original  network  has \nno information about example 0,  Jj  and eJ  are uncorrelated.  Thus the cavity field \nobeys a  Gaussian distribution for  random example inputs. \nAfter  the  network  has  learned  examples  0  to  p,  the  weights  adjust  from  {Jj}  to \n{Jj},  and  the  cavity  field  to  adjusts  to  the  generic  aligning  field  Ao.  As  shown \nschematically in  Fig.  l(a),  we  assume  that  the  adjustments  of the  aligning fields \nof the  original examples  are  small,  typically of the order  O(N-l/2).  Perturbative \nanalysis concludes that the aligning field is a  well defined function  of the  cavity field, \ni.e.  Ao  =  A(to)  where  A(t)  is  the inverse function  of \n\nt = A + ,9' (A), \n\n(1) \n\nand, is called the local susceptibility.  The cavity fields satisfy a set of self-consistent \nequations \n\nt JJ  = I)A(tv) - tv]QVJJ + aXA(tJJ ) \n\nvtJJ \n\n(2) \n\nwhere  QVJJ  =  Lj e;ej IN .  X is  called  nonlocal  susceptibility,  and  a  ==  piN.  The \nweights  Jj  are given  by \n\n(3) \n\nNoting the Gaussian distribution of the cavity fields,  the macroscopic properties of \nthe  neural  network,  such  as  the storage  capacity,  can  be  derived,  and  the  results \nare identical  to those obtained by the replica method  (Gardner &  Derrida 1988). \nHowever, the real advantage of the cavity method lies in the microscopic information \nit provides.  The above equations can be iterated sequentially, resulting in a general \nlearning algorithm.  Simulations confirm that the equations are satisfied in the single \nlayer  percept ron ,  and  their  generalized  version  holds in  the committee tree  at low \nloading  (Wong,  1995a). \n\nE \n\nE \n\nJ \n\na. \n\nJ \n\nJ \n\nFigure  1:  Schematic drawing  of the  change  in  the  energy  landscape  in  the  weight \nspace when  example 0 is  added,  for  the regime of (a)  smooth energy landscape,  (b) \nrough energy landscape. \n\n\fMicroscopic Equations in Rough Energy Landscape for Neural Networks \n\n305 \n\n3  MICROSCOPIC  EQUATIONS  FOR ROUGH  ENERGY \n\nLANDSCAPES \n\nHowever,  the  above  argument  holds  under  the  assumption  that  the  adjustment \ndue  to  the  addition  of a  new  example  is  controllable.  We  can  derive  a  stability \ncondition  for  this  assumption,  and  we  find  that  it  is  equivalent  to  the  Almeida(cid:173)\nThouless condition  in  the replica method  (Mezard,  Parisi &  Virasoro,  1987). \nAn  example  for  such  instability  occurs  in  the  committee  tree,  which  consists  of \nhidden  nodes  a  =  1, ... , K  with  binary  outputs,  each  fed  by  K  nonoverlapping \ngroups of N / K  input nodes.  The output of the committee tree is the majority state \nof the  K  hidden  nodes.  The solution  in  the  cavity  method  minimizes  the  change \nfrom the cavity fields  {tal  to the aligning fields  {Aa }, as  measured by La(Aa -ta)2 \nin  the  space  of correct  outputs.  Thus  for  a  stability  parameter  K,  Aa  =  K  when \nta  <  K  and  the  value  of ta  is  above  median  among the  K  hidden  nodes,  otherwise \nAa  = tao  Note  that  a  discontinuity  exists  in  the  aligning  field  function.  Now \nless  than ta.  Then  the addition of example \u00b0 may induce a  change from tb  < ta  to \nsuppose ta  < K  is  the median,  but the next  highest  value tb  happens  to be slightly \n\ntbO  > taO\u00b7  Hence  AbO  changes from tb  to K  whereas  Aao  changes from  K  to taO.  The \nadjustment of the system is no longer small, and the previous perturbative analysis \nis not valid.  In fact,  it has been shown that all networks having a gap in the aligning \nfield  function  are not stable against the  addition of examples (Wong,  1995a). \nTo consider what happens beyond the stability regime, one has to take into account \nthe rough energy  landscape of the learning space.  Suppose  that the original global \nminimum for  examples  1 to p is  a.  After  adding example 0,  a  nonvanishing change \nto the system is induced,  and the global minimum shifts to the neighborhood of the \nlocal minimum 13,  as  schematically shown in Fig.  1 (b).  Hence the resultant aligning \nfields  Ag  are  no  longer  well-defined  functions  of the  cavity fields  tg.  Instead  they \nare well-defined functions  of the cavity fields  tg.  Nevertheless,  one  may expect  that \ncorrelations exist  between  the states a  and 13. \nLet  ViiO  be  the correlation  between  the  network  states,  i.e.  (Jj J1)  =  ViiO.  Since \nboth  states  a  and  13  are  determined  in  the  absence  of the  added  example  0,  the \ncorrelation  (tgtg)  = ViiO  as  well.  Knowing  that  both  tg  and  tg  obey  Gaussian \ndistributions,  the cavity field  distribution  can  be determined  if we  know  the  prior \ndistribution of the local  minima. \nAt  this  point  we  introduce  the central  assumption in  the cavity  method for  rough \nenergy  landscapes:  we  assume  that the  number of local  minima at energy  E  obey \nan  exponential  distribution  d~( E)  =  C exp( -wE)dE.  Similar  assumptions  have \nbeen  used  in specifying the density  of states in disordered  systems  (Mezard,  Parisi \n&  Virasoro  1987).  Thus for  single layer  networks  (and for  two  layer  networks with \nappropriate generalizations),  the cavity field  ditribution is given  by \n\nP(ti3jt<~) =  G(tgltg)exp[-w~E(-\\(tg))] \n\nJ dtgG(tg Itg) exp[-w~E(-\\(tg))]' \n\no  0 \n\n(4) \n\nwhere  G(tg Itg)  is  a  Gaussian  distribution.  w is  a  parameter describing  the distri(cid:173)\nbution, and -\\(tg)  is the aligning field  function.  The weights  J1  are given  by \n\nJ1  =  (1  - ax)-l ~ 2)-\\(t~) - t~]~f. \n\n(5) \n\nI' \n\nNoting  the  Gaussian  distribution  of the  cavity fields,  self-consistent  equations for \nboth  qo  and the local susceptibility 'Y  can  be derived . \n\n\f306 \n\nK.  Y.  M. Wong \n\nTo  determine  the  distribution  of local  minima,  namely  the  parameters  C  and  w, \nwe  introduce  a  \"free  energy\"  F(p, N)  for  p  examples  and  N  input  nodes,  given \nby  d~(E) =  exp[w(F(p, N) - E)]dE.  This  \"free energy\"  determines  the  averaged \nenergy  of the local  minima and should be  an extensive quantity, i.e.  it should scale \nas  the system size.  Cavity arguments enable us to find  an expression  F (p + 1, N) -\nF(p, N).  Similarly, we may consider a cavity argument for  the addition of one input \nnode,  expanding  the network  size  from  N  to  N  + l.  This yields  an expression  for \nF(p, N + 1) - F(p, N).  Since  F  is  an extensive quantity, F(p, N) should scale as  N \nfor  a given  ratio  0' =  p/ N.  This implies \n\nF \nN  = O'(F(p + 1, N) - F(p, N)) + (F(p, N + 1)  - F(p, N)). \n\n(6) \n\nWe  have  thus  obtained an expression  for  the  averaged  energy  of the local minima. \nMinimizing the free  energy  with  respect  to the parameter  w  gives  a  self-consistent \nequation. \n\nThe three equations for  qo,  'Y  and w  completely determines the model.  The macro(cid:173)\nscopic properties of the neural network, such as the storage capacity, can be derived, \nand the results  are identical  to the first  step replica symmetry breaking solution in \nthe replica method. \n\nIt remains to check  whether  the  microscopic equations have  been  modified due  to \nthe roughening of the energy  landscape.  It turns out that while  the cavity fields  in \nthe  initial state  0'  do  not  satisfy  the  microscopic equations  (2),  those  at  the  final \nmetastable state {3  do,  except  that  the  nonlocal susceptibility  X has to be  replaced \nby  its  average  over  the  distribution  of  the  local  minima.  In  fact,  the  nonlocal \nsusceptibility describes  the reactive effects  due  to the  background examples,  which \nadjust on the addition of the new  example.  (Technically, this is  called the Onsager \nreaction.)  The  adjustments  due  to  hopping  between  valleys  in  a  rough  energy \nlandscape have  thus  been  taken  into account. \n\n4  SIMULATION  RESULTS \n\nTo verify the theory,  I simulate a committee tree learning random examples .  Learn(cid:173)\ning  can  be  done  by  the more conventional  Least  Action  algorithm  (Nilsson  1965), \nor  by  iterating the microscopic equations. \nWe  verify  that  the  Least  Action  algorithm  yields  an  aligning  field  function  ..\\(t) \nconsistent with the cavity theory.  Suppose the weights from input j  to hidden node \na  is  given  by  Jaj  = 2:1' xal'~j /..IN.  Comparing with  Jaj  = (1- O'X)-l 2:1'(Aal'  -\ntal')~j /..IN,  we  estimate the nonlocal susceptibility X by requiring the distribution \nof tal'  ==  Aal' - (1  - O'X)xal'  to have a  zero  first  moment.  tal'  is  then  an  estimate of \nIt agrees  with  the \ntal\"  Fig.  2  shows  the  resultant  relation  between  Aal'  and  tal\" \npredictions of the cavity theory.  Fig.  3 shows  the values  of the stability parameter \nK,  measured from the Least  Action  algorithm and the microscopic equations.  They \nhave  better  agreement  with  the  predictions  of  the  rough  energy  landscape  (first \nstep replica symmetry breaking solution)  rather than the smooth energy landscape \n(replica  symmetric solution).  Note  that  the  microscopic  equations  yield  a  higher \nstability than  the Least  Action algorithm. \n\n\fMicroscopic Equations in Rough Energy Landscape for Neural Networks \n\n307 \n\n5 \n\n3 \n\n\"C \nQ) \n~  1 \n0) \nc \nc \n\n-1 \n\n.2' \u00ab \n\n'\" \n\n'\" \n\n.\", \n\n'\" \n\n_. \n\n- - -\n\n-\n\n-\n\n-3 \n\n'\" \n\n-5 \n\n-5 \n\n,. \n, .' \n'\" \n\n-3 \n\n-1 \n1 \nCavity field \n\n3 \n\n5 \n\nFigure 2:  The aligning fields  versus  the cavity fields  for  a  branch of the committee \ntree  with  K  = 3,  a  = 0.8 and  N  =  600.  The  dashed  line  is  the  prediction of the \ncavity theory for  the regime of rough energy  landscape. \n\n2.0 \n\n1.5 \n\n~  1.0 \n\n0.5 \n\n0.0 \n\n0.0 \n\n0.5 \n\n1.0 \na. \n\n1.5 \n\n2.0 \n\nFigure  3:  The  stability parameter  K,  versus  the  storage  level  a  in  the  committee \ntree  with  K  = 3  for  the  cavity  theory  of:  (a)  smooth  energy  landscape  (dashed \nline),  (b)  rough  energy  landscape  (solid  line),  and  the simulation of:  (c)  iterating \nthe microscopic equations (circles),  (d)  the Least  Action algorithm (squares).  Error \nbars are smaller than  the size  of the symbols. \n\n5  CONCLUSION \n\nIn summary, we  have derived the microscopic equations for  neural network learning \nin the regime of rough energy  landscapes.  They turn out to have the same form  as \nin  the  case  of smooth  energy  landscape,  except  that  the  parameters  are  averaged \nover  the distribution  of local  minima.  Iterating  the equations result  in  a  learning \nalgorithm, which  yields a  higher stability than more conventional algorithms in the \ncommittee tree.  However, for  high loading,  the iterations may not converge. \n\n\f308 \n\nK.  Y.  M.  Wong \n\nThe success  of the present scheme lies its ability to take into account the underlying \nphysical  picture of many local minima of comparable energy.  It correctly  describes \nthe experience that slightly different training sets may lead to vastly different neural \nnetworks.  The stability parameter predicted  by  the  rough  landscape  ansatz  has  a \nbetter  agreement  with  simulations  than  the  smooth  one.  It provides  a  physical \ninterpretation of the replica symmetry breaking solution in  the  replica method.  It \nis possible to generalize the theory to the physical picture with hierarchies of clusters \nof local  minima, which  corresponds  to the infinite step  replica symmetry breaking \nsolution,  though the mathematics is  much more involved. \n\nAcknowledgements \n\nThis work  is supported by the Hong Kong Telecom Institute of Information Techol(cid:173)\nogy,  HKUST. \n\nReferences \n\nAnlauf, J .K, &  Biehl,  M. (1989)  The AdaTron:  an adaptive perceptron  algorithm. \nEurophysics  Letters 10(7) :687-692. \nBouten,  M.,  Schietse,  J . &  Van den  Broeck,  C.  (1995)  Gradient descent  learning in \nperceptrons:  A review  of its possibilities.  Physical Review E  52(2):1958-1967. \nGardner,  E.  &  Derrida,  B.  (1988)  Optimal  storage  properties  of neural  network \nmodels.  Journal  of Physics  A : Mathematical  and  General 21(1) :271-284. \nGerl,  F . &  Krey,  U.  (1995)  A  Kuhn-Tucker  cavity  method for  generalization  with \napplications  to  perceptrons  with  Ising  and  Potts  neurons.  Journal  of Physics  A: \nMath ematical and General 28(23):6501-6516. \n\nGriniasty,  M.  (1993)  \"Cavity-approach\"  analysis  of  the  neural-network  learning \nproblem.  Physical Review E  47(6):4496-4513. \n\nMezard, M.  (1989)  The space of interactions in neural networks:  Gardner's compu(cid:173)\ntation  with  the  cavity  method.  Journal  of Physics  A:  Mathematical  and  General \n22(12):2181-2190. \nMezard,  M.,  Parisi,  G.  &  Virasoro,  M.  (1987)  Spin  Glass  Theory  and  Beyond. \nSingapore:  World Scientific. \nNilsson,  N.J .  (1965)  Learning Machines.  New  York:  McGraw-Hill. \nThouless,  D.J., Anderson,  P.W. &  Palmer, R.G.  (1977)  Solution of 'solvable model \nof a spin glass'.  Philosophical Magazin,e  35(3) :593-601. \nWatkin,  T.L.H.,  Rau , A.  &  Biehl,  M.  (1993)  The statistical mechanics  of learning \na  rule.  Review of Modern  Physics 65(2) :499-556. \n\nWong,  KY.M.  (1995a)  Microscopic  equations  and  stability  conditions  in  optimal \nneural networks.  Europhysics  Letters 30(4):245-250 . \n\nWong,  KY.M.  (1995b)  The cavity  method:  Applications  to learning and  retrieval \nin  neural networks.  In J.-H. Oh , C. Kwon and S. Cho (eds.),  Neural Networks:  The \nStatistical Mechanics  Perspective,  pp.  175-190.  Singapore:  World Scientific. \n\n\f", "award": [], "sourceid": 1177, "authors": [{"given_name": "K. Y. Michael", "family_name": "Wong", "institution": null}]}