{"title": "Statistical Mechanics of the Mixture of Experts", "book": "Advances in Neural Information Processing Systems", "page_first": 183, "page_last": 189, "abstract": null, "full_text": "Statistical Mechanics of the Mixture of \n\nExperts \n\nKukjin Kang and Jong-Hoon Oh \n\nDepartment of Physics \n\nPohang University of Science  and  Technology \n\nHyoja San 31,  Pohang, Kyongbuk  790-784, Korea \n\nE-mail:  kkj.jhohOgalaxy.postech.ac.kr \n\nAbstract \n\nWe study generalization capability of the mixture of experts  learn(cid:173)\ning  from  examples  generated  by  another  network  with  the  same \narchitecture.  When the number of examples is smaller than  a crit(cid:173)\nical  value,  the  network  shows  a  symmetric  phase  where  the  role \nof the experts  is  not specialized.  Upon  crossing  the critical  point, \nthe  system  undergoes  a  continuous  phase  transition  to  a  symme(cid:173)\ntry  breaking  phase  where  the gating network  partitions  the  input \nspace effectively  and each expert is assigned  to an appropriate sub(cid:173)\nspace.  We  also find  that the mixture of experts  with multiple level \nof hierarchy shows  multiple phase transitions. \n\n1 \n\nIntroduction \n\nRecently  there  has  been considerable interest  among neural network community in \ntechniques  that  integrate  the  collective  predictions of a  set  of networks[l,  2,  3,  4]. \nThe  mixture of experts  [1,  2]  is  a  well  known  example which  implements the  phi(cid:173)\nlosophy  of divide-and-conquer  elegantly.  Whereas  this  model  are  gaining  more \npopularity  in  various  applications,  there  have  been  little efforts  to evaluate gener(cid:173)\nalization capability of these  modular approaches theoretically. Here  we  present  the \nfirst  analytic study  of generalization in  the  mixture of experts from  the statistical \n\n\f184 \n\nK.  Kang and 1. Oh \n\nphysics  perspective.  Use  of statistical  mechanics  formulation  have  been  focused \non  the  study  of feedforward  neural  network  architectures  close  to  the  multilayer \nperceptron[5,  6],  together  with  the  VC  theory[8].  We  expect  that  the  statistical \nmechanics  approach  can  also  be  effectively  used  to  evaluate more  advanced  archi(cid:173)\ntectures  including mixture models. \n\nIn  this  letter  we  study  generalization  in  the  mixture of experts[l]  and  its  variety \nwith two-level  hierarchy[2].  The network is  trained by  examples given  by  a  teacher \nnetwork  with the same architecture.  We find  an interesting phase transition driven \nby  symmetry  breaking among the experts.  This phase transition is  closely  related \nto the  'division-and-conquer'  mechanism  which  this  mixture model  was  originally \ndesigned  to accomplish. \n\n2  Statistical Mechanics  Formulation for  the Mixture of \n\nExperts \n\nThe mixture of experts[2] is a tree consisted of expert networks and gating networks \nwhich assign  weights  to the outputs of the experts.  The expert  networks sit at the \nleaves  of the  tree  and  the  gating  networks  sit  at its  branching  points  of the  tree. \nFor  the sake of simplicity, we  consider  a  network  with  one gating network  and two \nexperts.  Each expert  produces its output  J,lj  as  a  generalized  linear function  of the \nN  dimensional input x : \n\nJ,lj  = /(Wj . x), \n\nj  =  1,2, \n\n(1) \n\nwhere  Wj  is  a  weight  vector  of the  j  th  expert  with  spherical  constraint[5].  We \nconsider  a  transfer  function  /(x)  = sgn(x)  which  produces  binary  outputs.  The \nprinciple of divide-and-conquer  is  implemented  by  assigning each  expert  to  a  sub(cid:173)\nspace of the input space with different local rules.  A gating network makes partitions \nin the input space and assigns each  expert  a  weighting factor : \n\n(2) \nwhere  the  gating  function  8(x)  is  the  Heaviside  step  function.  For  two  experts, \nthis  gating function  defines  a  sharp  boundary  between  the  two  subspace  which  is \nperpendicular  to the vector  V 1  = -V 2  = V, whereas  the softmax function  used  in \nthe original literature [2]  yield a soft  boundary.  Now  the weighted  output from the \nmixture of expert is  written: \n\nJ,l(V, W; x) = 2: 9j (x)J,lj (x). \n\n2 \n\nj=1 \n\n(3) \n\nThe  whole  network  as  well  as  the  individual  experts  generates  binary  outputs. \nTherefore,  it can learn  only dichotomy rules.  The training examples are  generated \nby  a  teacher  with the same architecture as: \n\n2 \n\nO'(xlJ)  = 2: 8(VJ . x)sgn(WJ . x)  , \n\nj=1 \n\n(4) \n\n\fStatistical Mechanics of the Mixture of Experts \n\n185 \n\nwhere  ~o and Wl  are  the  weights  of the  jth gating network  and  the  expert  of the \nteacher. \n\nThe learning of the mixture of experts is  usually interpreted  probabilistically, hence \nthe learning algorithm is  considered  as  a maximum likelihood estimation.  Learning \nalgorithms originated from statistical methods such  as  the EM  algorithm are often \nused.  Here  we  consider  Gibbs algorithm with  noise level T  (= 1/(3)  that leads  to a \nGibbs distribution of the  weights after  a long time: \n\n(5) \nwhere  Z = J dV dW exp( -(3E(V, Wj)) is the partition function.  Training both the \nexperts  and the gating network  is  necessary  for  a good generalization performance. \nThe energy  E  of the system is  defined  as  a  sum of errors  over  P  examples: \n\np L f(V, W j; xl), \n\n1=1 \n\n(6) \n\n(7) \n\nThe  performance  of  the  network  is  measured  by  the  generalization  function \nf(V, W j )  = J dx f(V, Wj; x),  where  J dx  represents  an  average  over  the  whole \ninput space .  The generalization error  fg  is defined  by fg  =  (((f(W))T)) where  ((-.-)) \ndenotes  the  quenched  average  over  the  examples  and  (- . -)T  denotes  the  thermal \naverage over  the probability distribution of Eq.  (5). \n\nSince  the  replica  calculation  turns  out  to  be  intractable,  we  use  the  annealed  ap(cid:173)\nproximation: \n\n((log Z))  ~ log((Z))  . \n\n(8) \n\nThe  annealed  approximation is  exact  only  in  the high  temperature limit,  but  it is \nknown  that  the approximation usually  gives  qualitatively good results  for  the case \nof learning realizable rules[5,  6] . \n\n3  Generalization  Curve  and  the Phase  Transition \n\nThe  generalization  function  f(V, W j)  is  can  be  written  as  a  function  of overlaps \nbetween  the  weight  vectors  of the teacher  and the student: \n\nwhere \n\n2 \n\n2 LLPijfij \n\ni=l  j=l \n\n(9) \n\n(10) \n\n(11) \n\n\f186 \n\nand \n\nK.  Kang and J.  Oh \n\nRij \n\nRij \n\n1 \n0 \n-V\u00b7\u00b7V \u00b7 \nN' \nJ' \n1 \n0 \nN  Wi \u00b7Wj\n. \n\n(12) \n\n(13) \n\nis  the  overlap  order  parameters.  Here,  Pij  is  a  probability that  the  i  th expert  of \nthe  student  learns  from  examples  generated  by  the  j  th expert  of the  teacher .  It \nis  a  volume fraction  in  the  input space  where  Vi . x  and  VJ . x  are  both  positive. \nFor that particular examples, the ith expert of the student gives wrong answer with \nprobability  fij  with  respect  to  the  j  th  expert  of  the  teacher.  We  assume  that \nthe  weight  vectors  of the  teacher,  V 0, W~ and  W~, are  orthogonal to  each  other, \nthen  the  overlap  order  parameters  other  than  the  oneS  shown  above  vanish.  We \nuse  the  symmetry  properties  of the  network  such  as  Rv  = RYI  = R~2 = - RY2, \nR = Rll = R 22 ,  and  r  = R12  = R 21 . \nThe free  energy  also can be  written as  a function of three order parameters Rv,  R, \nand  r .  Now  we  consider  a  thermodynamic limit where  the  dimension of the input \nspace  N  and the number of examples  P  goes to infinity, keeping the ratio eY  = PIN \nfinite.  By  minimizing the free energy  with respect  to the order parameters,  we  find \nthe most probable values ofthe order parameters as well  as the generalization error. \n\nFig 1.(a)  plots the overlap order  parameters Rv, Rand r  versus  eY  at temperature \nT  = 5.  Examining  the  plot,  we  find  an  interesting  phase  transition  driven  by \nsymmetry breaking among the experts.  Below the phase transition point eYe  = 51.5, \nthe  overlap  between  the  gating  networks  of the  teacher  and  the  student  is  zero \n(Rv  = 0)  and  the  overlaps  between  the  experts  are  symmetric  (R  = r).  In  the \nsymmetric phase, the gating network does not have enough examples to learn proper \npartitioning,  so  its  performance  is  not  much  better  than  a  random  partitioning. \nConsequently  each  expert  of the student  can  not specialize  for  the subspaces  with \na  particular local  rule given  by  an  expert  of the  teacher.  Each  expert  has  to  learn \nmultiple  linear  rules  with  linear  structure,  which  leads  to  a  poor  generalization \nperformance.  Unless  more  than  a  critical  amount  of examples  is  provided,  the \ndivide-and-conquer strategy does  not work. \n\nUpon crossing  the critical point  eYe,  the system undergoes  a  continuous phase tran(cid:173)\nsition  to  the  symmetry  breaking  phase.  The  order  parameter  Rv , related  to  the \ngoodness of partition, begins to increase abruptly and approaches  1 with increasing \neY .  The gating network now provides a  better partition which  is  close  to that of the \nteacher.  The  plot of order  parameter Rand r,  which  is  overlap between  experts  of \nteacher  and student, branches at eYe  and approaches  1 and  0 respectively.  It means \nthat  each  expert  specializes  its  role  by  making appropriate  pair  with  a  particular \nexpert  of the teacher.  Fig.  l(b)  plots  the generalization curve  (f g  versus  eY)  in  the \nsame scale.  Though  the  generalization curve  is  continuous,  the slope  of the  curve \nchanges discontinuously at the transition point so that the generalization curve  has \n\n\fStatistical Mechanics of the Mixture of Experts \n\n187 \n\nO.S \n\n0.6 \n\n0.4 \n\n0.2 \n\n0 \n\nI \nI \n/ \n\nI \n; \n0 \n\n0.5 \n\n0.45 \n\n0.4 \n\n0.35 \n\n0.3 \n~.25 \n0.2 \n\n0.15 \n\n0.1 \n\n0.05 \n\n0 \n\n0 \n\n/,/-\\ \n. \n.. \n\" \n\n\". \n\n, , \n-' \n, \n, \n-' \n; \n\n20 \n\n40 \n\n60 \n\n20 \n\n40 \n\n60 \n\n.,.-\n\n.' \n\n--.. _-- - - -- --. \n\n100 \n\n120 \n\n140 \n\n160 \n\nISO \n\n\" \n\nSO \n\nex \n(a) \n\n120 \n\n140 \n\n160 \n\n180 \n\n100 \n\n80 \n\nex \n(b) \n\nFigure  1:  (a)  The  overlap  order  parameters  Rv,  R,  r  versus  0'  at  T  = 5.  For \n0'  <  O'c  =  51.5,  we  find  Rv  =  0  (solid  line  that  follows  x  axis),  and  R  =  r \n(dashed  line).  At  the  transition  point,  Rv  begins  to  increase  abruptly,  R  (dotted \nline)  and  r  (dashed  line)  branches,  which  approach  1 and 0 respectively.  (b)  The \ngeneralization  curve  (f g  versus  0')  for  the  mixture of experts  in  the same scale.  A \ncusp  at the transition point  O'c  is  shown. \n\n\f188 \n\nK.  Kang and J.  Oh \n\n0.5  ,...,,------,---,.---,---,--.-------, \n\n0.45 \n\n0.4 \n\n0.35 \n\n0.3 \n~.25 \n0.2 \n\n0.15 \n\n0.1 \n\n0.05 \n\nOL-_~ __  ~ __  ~ __  _L  __  ~_~ \no \n\n200 \n\n100 \n\n150 a \n\n50 \n\n250 \n\n300 \n\nFigure  2:  A  typical  generalization  error  curve  for  HME  network  with  continuous \nweight.  T  = 5. \n\na  cusp.  The asymptotic behavior of fg  at large  0'  is  given  by: \n\n3 \n1 - e-\n\n1 \n0' \n\nf3' \n\nf  ::::: \n\n(14) \n\nwhere  the  1/0' decay  is  often observed  in learning of other feedforward  networks. \n\n4  The Mixture of Experts with Two-Level Hierarchy \n\nWe  also  study  generalization  in  the  hierarchical  mixture of experts  [2] .  Consider \na  two-level  hierarchical  mixture of experts  consisted  of three  gating networks  and \nfour  experts.  At  the  top level  the  tree  is  divided  into two  branch,  and  they  are in \nturn divided into two branches at the lower level.  The experts sit at the four  leaves \nof the tree,  and the three  gating networks  sit  at the  top  and lower-level  branching \npoints.  The  network  also  learns from  the  training examples drawn from  a  teacher \nnetwork  with  the same architecture. \n\nFIG  2.  (b)  shows  corresponding  learning  curve  which  has  two  cusps  related  to \nthe  phase  transitions.  For  0'  <  O'ct,  the  system  is  in  the  fully  symmetric  phase. \nThe gating networks do not  provide correct  partition for  the experts  at both levels \nof  hierarchy  and  the  experts  cannot  specialize  at  all.  All  the  overlaps  with  the \nweights  of the  teacher  experts  have  the same  value.  The first  phase  transition  at \nthe smaller 0'c1  is related to the symmetry breaking by the top-level gating network. \nFor  0'c1  < 0' < O'c2,  the top-level gating network  partition the input space  into two \nparts,  but  the  lower-level  gating network  is  not functioning  properly.  The overlap \nbetween  the gating networks  at  the  lower  level  of the  tree  and  that  of the  teacher \nis  still zero.  The experts  partially specialize into two groups.  Specialization among \nthe  same  group  is  not  accomplished  yet.  The  overlap  order  parameter  Rij  can \n\n\fStatistical Mechanics of the Mixture of Experts \n\n189 \n\nhave two distinct values.  The  bigger one  is  the overlap with the two experts of the \nteacher  for  which  the  group  is  specializing,  and the smaller is  with  the  experts  of \nthe teacher which belong to the other group.  At the second transition point Q'c2,  the \nsymmetry  related  to  the  lower-level  hierarchy  breaks.  For  c\u00a5  >  C\u00a5c2,  all  the  gating \nnetworks  work properly and the input space is divided into four.  Each expert makes \nappropriate pair with an expert  of the teacher.  Now  the  overlap  order  parameters \ncan  have  three  distinct  values.  The largest  is  the overlap  with  matching expert  of \nteacher.  The next  largest is  the overlap with  the neighboring teacher expert in the \ntree  hierarchy.  The smallest is  with the experts of the other group.  The two phase \ntransition result  in  the two cusps  of the learning curve. \n\n5  Conclusion \n\nWhereas  the  phase  transition  of  the  mixture  of experts  can  be  interpreted  as  a \nsymmetry breaking phenomenon which is similar to the one already observed in the \ncommittee machine and  the  multi-Iayer-perceptron[6,  7],  the  transition  is  novel  in \nthat it is  continuous.  This means that symmetry breaking is easier for  the mixture \nof  experts  than  in  the  multi-layer  perceptron.  This  can  be  a  big  advantage  in \nlearning of highly nonlinear rules as  we  do not have to worry about the existence  of \nlocal  minima.  We  find  that the  hierarchical  mixture  of experts  can  have  multiple \nphase transitions  which  are related  to symmetry breaking at different  levels.  Note \nthat symmetry breaking comes first from the higher-level branch,  which is  desirable \nproperty  of the model. \n\nWe  thank  M.  I.  Jordan,  L.  K.  Saul,  H.  Sompolinsky,  H.  S.  Seung,  H.  Yoon  and \nC.  K won  for  useful  discussions  and comments.  This work  was  partially supported \nby  the  Basic  Science  Special  Program  of the  POSTECH  Basic  Science  Research \nInstitute. \n\nReferences \n\n[1]  R.  A.  Jacobs,  M. I. Jordan, S.  J. Nolwan, and G.  E.  Hinton, Neural  Computa(cid:173)\n\ntion  3,  79  (1991). \n\n[2]  M.  I. Jordan,  and R.  A.  Jacobs,  Neural  Computation 6,  181  (1994). \n\n[3]  M.P. Perrone  and  L.  N.  Cooper,  Neural  Networks  for  Speech  and Image Pro(cid:173)\n\ncessing,  R.  J. Mammone.  Ed.,  Chapman-Hill, London,  1993. \n\n[4]  D.  Wolpert,  Neural Networks,  5,  241  (1992). \n\n[5]  H.  S.  Seung,  H.  Sompolinsky, and  N.  Tishby,  Phys.  Rev . A 45, 6056  (1992) . \n\n[6]  K.  Kang,  J.-H. Oh,  C.  Kwon  and Y.  Park,  Phys.  Rev.  E  48, 4805  (1993);  K. \n\nKang,  J .-H.  Oh,  C.  Kwon  and Y.  Park,  Phys. Rev.  E  54, 1816  (1996). \n\n[7]  E.  Baum and D.  Haussler,  Neural  Computation 1, 151  (1989). \n\n\f", "award": [], "sourceid": 1176, "authors": [{"given_name": "Kukjin", "family_name": "Kang", "institution": null}, {"given_name": "Jong-Hoon", "family_name": "Oh", "institution": null}]}