{"title": "Evaluation of Adaptive Mixtures of Competing Experts", "book": "Advances in Neural Information Processing Systems", "page_first": 774, "page_last": 780, "abstract": null, "full_text": "Evaluation  of Adaptive  Mixtures \n\nof Competing  Experts \n\nSteven J.  Nowlan  and  Geoffrey  E.  Hinton \n\nComputer Science  Dept. \n\nUniversity of Toronto \n\nToronto,  ONT M5S  1A4 \n\nAbstract \n\nWe  compare  the  performance  of the  modular  architecture,  composed  of \ncompeting  expert  networks,  suggested  by  Jacobs,  Jordan,  Nowlan  and \nHinton  (1991)  to  the  performance  of a  single  back-propagation  network \non  a  complex,  but  low-dimensional,  vowel  recognition  task.  Simulations \nreveal that this system is capable of uncovering interesting decompositions \nin  a  complex  task.  The  type  of decomposition  is  strongly  influenced  by \nthe  nature  of the  input  to  the  gating  network  that  decides  which  expert \nto  use  for  each  case.  The  modular  architecture  also exhibits consistently \nbetter  generalization on many variations of the  task. \n\n1 \n\nIntroduction \n\nIf back-propagation is used  to train a single, multilayer network to perform different \nsubtasks  on  different  occasions,  there  will  generally  be  strong  interference  effects \nwhich lead to slow learning and poor generalization.  If we know in advance that a set \nof training cases  may  be  naturally divideJ  into subsets  that correspond  to  distinct \nsubtasks,  interference  can  be  reduced  by  using  a  system  (see  Fig.  1)  composed  of \nseveral  different  \"expert\"  networks plus a gating network that decides  which of the \nexperts  should be  used  for  each  training case. \n\nSystems of this type have  been  suggested  by  a  number of authors  (Hampshire and \nWaibel,  1989;  Jacobs,  Jordan  and  Barto,  1990;  Jacobs  et  al.,  1991)  (see  also  the \npaper  by  Jacobs  and  Jordan  in  this  volume  (1991\u00bb.  Jacobs,  Jordan,  Nowlan  and \nHinton (1991) show that this system can  be trained by performing gradient descent \n\n774 \n\n\fEvaluation of Adaptive Mixtures of Competing Experts \n\n775 \n\n-10 O2 \n\nExpert  1  Expert  2  Expert  3 \n\nt \n\nIntut~ \n\nx1  x 2  x3 \nGating \nNetwork \n\nt \nInput \n\nFigure  1:  A  system  of expert  and  gating  networks.  Each  expert  is  a  feedforward \nnetwork  and  all  experts receive  the same input  and  have  the same  number  of out(cid:173)\nputs.  The gating network is also feedforward  and may receive  a different input than \nthe  expert  networks.  It has  normalized  outputs  Pj  = exp(xj)/ L:i exp(xd,  where \nXj  is  the total  weighted  input  received  by  output unit j  of the  gating network.  Pj \ncan  be  viewed  as  the probability of selecting expert j  for  a  particular case. \n\nin  the following  error function: \n\nE C  =  _logLvie-lIdc-o,cIl2/2Q'2 \n\n(1) \n\nwhere  E C  is the error  on  training case  c,  pi  is  the output of the gating network for \nexpert i, lc  is the desired  output vector and o{ is  the output vector of expert  i,  and \nu  is  constant. \n\nThe error defined  by Equation 1 is simply the negative log probability of generating \nthe  desired  output  vector  under  a  mixture  of gaussians  model  of the  probability \ndistribution of possible  output  vectors  given  the current  input.  The output vector \nof each expert specifies the mean of a multidimensional gaussian distribution.  These \nmeans are a function of the inputs to the experts.  The outputs of the gating network \nspecify  the  mixing proportions  of the  experts,  so  these  too  are  determined  by  the \ncurrent  input. \nDuring  learning,  the  gradient  descent  in  E  has  two  effects.  It raises  the  mixing \nproportion of experts  that do  better  than  average in  predicting the desired  output \nvector  for  a  particular case,  and  it  also  makes each  expert  better  at  predicting the \ndesired output for  those cases for  which it has a  high mixing proportion.  The result \nof these  two effects  is  that, after  learning,  the gating network  nearly always assigns \na  mixing  proportion  near  1  to  one  expert  on  each  case.  So  towards  the  end  of \nthe  learning,  each  expert  can  focus  on  modelling  the  cases  it  is  good  at  without \ninterference from  the  cases  for  which  it has  a  negligible  mixing proportion. \n\n\f776 \n\nNowlan and Hinton \n\nIn this paper,  we  compare mixtures of experts  to single back-propagation networks \non  a  vowel  recognition  task.  We  demonstrate  that  the  mixtures  are  better  at \nfitting  the  training  data  and  better  at  generalizing  than  comparable  single  back(cid:173)\npropagation networks. \n\n2  Data and  Experimental Procedures \n\nThe  data  used  in  these  experiments  consisted  of the  frequencies  of the  first  and \nsecond  formants  for  10  vowels  from  75  speakers  (32  Males,  28  Females,  and  15 \nChildren)  (Peterson  and  Barney,  1952).1  The  vowels,  which  were  uttered  in  an \nhVd context, were {heed,  hid,  head,  had,  hud,  hod,  hawed,  hood,  who'd,  heard}.  The \nword list  was repeated twice  by each speaker,  with the words in  a  different  random \norder for  each presentation.  The resulting spectrograms  were  hand segmented  and \nthe frequencies  of the formants extracted  from  the middle portion of the vowel. \n\nThe  simulations  were  performed  using  a  conjugate  gradient  technique,  with  one \nweight  change  after  each  pass  through  the  training set.  For  the  back-propagation \nexperiments,  each  simulation  was  initialised  randomly  with  weight  values  in  the \nrange  [-0.5,0.5].  For  the  mixture  systems,  the  last  layer  of weights  in  the  gating \nnetwork  was  always  initialised  to  0 so  that  all  experts  initially  had  equal  a  priori \nselection probabilities, Pi,k, while all other weights in the gating and expert networks \nwere  initialized randomly  with  values  in  the  range  [-0.5,0.5]  to  break  symmetry. \nThe  value  of u  used  was  0.25  for  all  of the  mixture simulations.  In  all  cases,  the \ninput  formant  values  were  linearly  scaled  by  dividing  them  by  1000,  so  the  first \nformant  was  in the range (0,1.5) and  the second  was  in  the range  (0,4). \nTwo sets of experiments were  performed:  one in  which the performance of different \nsystems  on  the  training  data  was  compared  and  a  second  in  which  the  ability  of \ndifferent  systems  to generalize  was  compared. \n\nFive different  types of input were  used  in  each  set of experiments: \n\n1.  Frequencies  of first  and second  formants only  (Form.). \n2.  Form.  plus a  localist encoding of the speaker identity (Form.  + Speaker  ID). \n3.  Form.  plus a  localist encoding  of whether  the  speaker  was  a  male,  female,  or \n\nchild  (Form.  + MFC). \n\n4.  Form.  plus  the  minimum  and  maximum  frequency  for  the  first  and  second \nformant  (as real  values)  over  all samples from  the speaker  (Form.  + Range). \n\n5.  Form.  + MFC +  Range. \n\nFor  the simulations in  which  a  single  back-propagation network  was  used  the net(cid:173)\nwork  received  the entire set of input values.  However,  for  the  mixture systems  the \nexpert  networks  saw  only  the  formant  frequencies,  while  the  gating  network  saw \neverything  but  the formant  frequencies  (except  of course  when  the input  consisted \nonly of the formant frequencies). \n\n1 Obtained, with thanks, from  Ray Watrous,  who originally obtained the data from Ann \n\nSyrdal at AT&T Bell  Labs. \n\n\fEvaluation of Adaptive Mixtures of Competing Experts \n\n777 \n\n#  Experts  #  Hid  per  Expert  #  Hid  Gating \n\nType of Input \nForm. \nForm.  + Speaker  ID \nForm.  + MFC \nForm.  + MFC + Range \nForm.  + Range \n\n20 \n10 \n10 \n10 \n10 \n\n3-5 \n\n25 \n25 \n25 \n25 \n\n10 \n0 \n0 \n5 \n5 \n\nTable  1:  Summary of mixture architecture  used  with each  type of input. \n\nMixture Error  %  BP  Error % \n\nType of Input \nFormants only \nForm.  + Speaker ID \nForm.  + MFC \nForm.  + MFC + Range \nForm.  + Range \n\n13.9 \u00b1  0.9 \n4.6  \u00b1  0.7 \n13.0 \u00b1  0.4 \n5.6  \u00b1  0.6 \n11.6  \u00b1  0.9 \n\nSig.(p) \n21.8  \u00b1  0.6  \u00bb 0.9999 \n6.2  \u00b1  0.6 \n15.4 \u00b1  0.3  \u00bb 0.9999 \n13.1  \u00b1  1.0  ~ 0.9999 \n13.5  \u00b1  0.4 \n> 0.998 \n\n> 0.97 \n\nTable  2:  Performance  comparison of associative  mixture systems  and  single  back(cid:173)\npropagation networks on vowel  classification task.  Results reported  are based on an \naverage over  25  simulations for  each  back-propagation network  or mixture system. \n\nThe  BP  networks  used  in  the  single  network  simulations  contained  one  layer  of \nhidden  units. 2 \nIn  the  mixture  systems,  the  expert  networks  also  contained  one \nlayer  of hidden  units  although  the  number  of hidden  units  in  each  expert  varied. \nThe  gating  network  in  some  cases  contained  hidden  units,  while  in  other  cases  it \ndid  not  (see  Table 1).  Further details of the simulations may be found  in  (Nowlan, \n1991). \n\n3  Results of Performance Studies \n\nIn  the set  of performance experiments,  each  system was  trained with the entire set \nof 1494  tokens  until  the  magnitude  of the  gradient vector  was  < 10-8 .  The error \nrate  (as  percent  of total  cases)  was  evaluated  on  the  training data (generalization \nstudies are described  in the next section).  The very  high degree of class overlap in \nthis task makes it extremely  difficult to find  good solutions with a gradient descent \nprocedure  and  this is  reflected  by  the far  from  optimal average  performance of all \nsystems  on  the  training data (see  Table  2).  For  purposes  of comparison,  the  best \nperformance ever obtained on this vowel data using  speaker dependant classification \nmethods is about  2.5%  (Gerstman,  1968;  Watrous,  1990). \n\nTable 2 reveals that in every  case  the mixture system performs significantly better3 \nthan  a  single  network  given  the  same  input.  The  most  striking,  and  interesting, \n\n2The number of hidden units was selected by  performing a number of initial simulations \nwith different  numbers of hidden units for  each  network and choosing the smallest number \nwhich  gave  near  optimal  performance.  These  numbers  were  50,  150,  60,  150,  and  80 \nrespectively for  the five  types of input listed  above. \n\n3Based  on a  t-test  with 48  degrees of freedom. \n\n\f778 \n\nNowlan and Hinton \n\nSpec.  #  %  Male  %  Female  %  Child  %  Total \n1.3 \n2.7 \n42.7 \n8.0 \n17.3 \n28.0 \n\n0.0 \n3.6 \n17.8 \n7.1 \n42.9 \n28.6 \n\n6.7 \n0.0 \n0.0 \n6.7 \n0.0 \n86.7 \n\n0 \n4 \n5 \n7 \n8 \n9 \n\n0.0 \n3.1 \n84.4 \n9.4 \n3.1 \n0.0 \n\nTable 3:  Speaker  decomposition  in  terms of Male,  Female and Child categories for \na  mixture with speaker  identity as input  to the gating network. \n\nresult  in Table 2 is  contained in  the fourth  row  of the  table.  While the  associative \nmixture architecture is able to combine the two separate cues of MFC categories and \nspeaker  formant  range  quite  effectively,  the  single  back-propagation  network  fails \nto  do  so.  The  combination  of these  two  different  cues  in  the  associative  mixture \nsystem  was  obtained  by  a  hierarchical  training  procedure  in  which  three  different \nexperts  were  first  created  using  the  MFC  cue  alone,  and  copies  of these  networks \nwere further specialized when the formant range cue was added to the input received \nby  the  gating  network  (see  (Nowlan,  1990;  Nowlan,  1991)  for  details).  Since  the \nsingle back-propagation network is  much  less  modular than the associative mixture \nsystem,  it  is  difficult  to  implement  such  a  hierarchical  training  procedure  in  the \nsingle network  case.  (A  variety  of techniques  were  explored  and  details may  again \nbe found  in (Nowlan,  1991).) \n\nAnother  interesting  aspect  of the  mixture systems,  not  revealed  in  Table  2,  is the \nmanner  in  which  the  training  cases  were  divided  among  the  different  expert  net(cid:173)\nworks.  Once the network was trained, the training cases were clustered by assigning \neach  case  to the expert  that was selected  most strongly by the  gating network. \n\nThe mixture which  used  only  the formant  frequencies  as  input  to  both the gating \nand  expert  networks  tended  to  cluster  training cases  according  to  the  position  of \nthe tongue hump when the vowel  is uttered.  In all simulations, the four front vowels \nwere  always  clustered  together  and  handled  by  a  single expert.  The  low  back  and \nhigh  back vowels  also  tended  to  be grouped  together,  but each of these groups  was \ndivided  among several  experts and  not  always in exactly  the same way. \n\nThe mixture which  received speaker identity as well as formant frequencies as input \ntended  to  group  speakers  roughly  according  to  the  categories  male,  female,  and \nchild.  A  typical grouping of speakers  by  the  mixture is shown  in Table 3. \n\n4  Results of Generalization Studies \n\nIn  the  set  of  generalization  experiments,  for  all  but  the  input  which  contained \nthe  speaker  identity,  each  system  was  trained  on  data from  65  speakers  until  the \nmagnitude of the  gradient  vector  was  <  10- 4 .  The  performance  was  then  tested \non the data from  the  10  speakers  not in  the  training set.  Twenty different  test  sets \nwere  created  by  leaving out  different  speakers for  each,  and  results  are  an  average \nover one simulation with each  of the test  sets.  Each  test  set  consisted  of 4  male,  3 \n\n\fEvaluation of Adaptive Mixtures of Competing Experts \n\n779 \n\nType of Input \nFormants only \nForm.  + Speaker  ID \nForm.  + MFC \nForm.  + MFC + Range \nForm.  + Range \n\n15.1  \u00b1  0.9 \n6.4  \u00b1  1.3 \n13.5 \u00b1  0.6 \n6.2 \u00b1  0.9 \n12.8 \u00b1  0.9 \n\nMixture  Error  %  BP  Error  % \n\n-\n\nSig.(p) \n23.3  \u00b1  1.2  ~ 0.9999 \n~ 0.9999 \n18.4 \u00b1  1.1  \u00bb 0.9999 \n16.1  \u00b1  1.0  ~ 0.9999 \n> 0.9999 \n16.2 \u00b1  0.8 \n\nTable 4:  Generalization comparison of associative mixture systems and single back(cid:173)\npropagation networks on vowel classification task.  Results reported  are based on an \naverage over  20  simulations for  each  back- propagation network  or  mixture system. \n\nfemale  and 3 child speakers. \n\nThe generalization  tests for  the mixture  in  which  speaker  identity was  part  of the \ninput used  a  different  testing strategy.  In  this case,  the training set consisted  of 70 \nspeakers and the testing set contained the remaining 5 speakers (2  male,  2 female,  1 \nchild).  Again,  results  are  averaged  over  20  different  testing sets.  After  the mixture \nwas trained, an expert was selected for each  test speaker using one utterance of each \nof the first  3  vowels,  and  the performance  of the  selected  expert  was  tested  on  the \nremaining 17  utterances of that speaker.  No  generalization results  are reported  for \nthe single  back-propagation network  which  received  the speaker identity  as  well  as \nthe first  and second formant values, since there is  no straightforward way to perform \nrapid speaker  adaptation with  this architecture.  (See  Watrous (Watrous, 1990)  for \nsome  approaches  to  speaker  adaptation  in  single  networks.)  The  percentage  of \nmisclassifications  on  the  test  set  for  the  mixture systems  and  corresponding single \nback-propagation networks are summarized in Table 4,  and in all cases  the mixture \nsystem generalizes significantly better4  than a  single  network. \n\nThe relatively poor generalization  performance of the single back-propagation net(cid:173)\nworks  is  not  due  to  overfitting  on  the  training  data  because  the  single  back(cid:173)\npropagation networks perform worse on  the training data than the mixture systems \non  the  test  data.  Also,  the  associative  mixture  systems  initially  contained  even \nmore parameters than the corresponding  back-propagation networks.  (The associa(cid:173)\ntive mixture which  received  formant range  data for  gating input initially contained \nalmost 3600  parameters,  while  the  corresponding single  back-propagation  network \ncontained only slightly more than 1200 parameters.)  Part of the explanation for  the \ngood generalization performance of the mixt ures is the pruning of excess parameters \nas the system is trained.  The number of effective  parameters in the final  mixture is \nvery often less  than half the number in  the original system,  because  a  large number \nof experts  have negligible mixing proportions in  the final  mixture. \n\n5  Discussion \n\nThe  mixture  systems  outperform  single  back-propagation  networks  which  receive \nthe  same  input,  and  show  much  better  generalization  properties  when  forced  to \ndeal  with  relatively  small  training  sets.  In  addition,  the  mixtures  can  easily  be \n\n4Based  on  a  t-test  with 38  degrees  of freedom. \n\n\f780 \n\nNowlan and Hinton \n\nrefined  hierarchically  by  learning a  few  experts  and  then  making several  copies  of \neach  and  adding additional contextual input  to the gating network. \n\nThe  best  performance  for  either  single  networks  or  mixture  systems  is  obtained \nby  including  the  speaker  identity  as  part  of the  input.  When  given  such  input, \nthe mixture systems are capable of discovering speaker categories which  give  levels \nof classification  performance  close  to  those  obtained  by  speaker  dependent  classi(cid:173)\nfication  schemes.  Good  performance  can  also  be  obtained  on  novel  speakers  by \ndetermining which existing speaker category the new speaker is most similar to (us(cid:173)\ning a small number oflabelled utterances).  If,  instead, the speaker is represented in \nterms of features such as  male, female,  child,  and formant range,  the mixtures also \nexhibit good generalization to  novel speakers  described  in  terms of these features. \n\nAcknow ledgements \n\nThis research  was  supported  by  grants from  the Natural Sciences  and  Engineering \nResearch  Council, the Ontario Information Technology Research Center, and Apple \nComputer Inc.  Hinton is the N orand a fellow of the Canadian Institute for Advanced \nResearch. \n\nReferences \n\nGerstman,  L.  J.  (1968).  Classification  of self-normalized  vowels.  IEEE  Trans.  on \n\nAudio  and  Electroacoustics,  AU-16(1 ):78-80. \n\nHampshire,  J.  and  Waibel,  A.  (1989).  The  Meta-Pi  network:  Building distributed \nknowledge  representations  for  robust  pattern  recognition.  Technical  Report \nCMU-CS-89-166, Carnegie-Mellon,  Pittsburgh,  PA. \n\nJacobs,  R.  A.  and Jordan,  M.  I.  (1991).  A  competitive  modular connectionist  ar(cid:173)\n\nchitecture.  In Touretzky,  D.  S., editor,  Neural Information  Processing  Systems \n3.  Morgan  Kauffman,  San  Mateo,  CA. \n\nJacobs,  R.  A., Jordan, M.  I., and Barto,  A.  G.  (1990).  Task decomposition through \ncompetition  in  a  modular  connectionist  architecture:  The  what  and  where \nvision  tasks.  Cognitive  Science.  In  Press. \n\nJacobs,  R.  A.,  Jordan,  M.  I.,  Nowlan,  S.  J.,  and  Hinton,  G.  E.  (1991).  Adaptive \n\nmixtures of local experts.  Neural  Computation,  3(1). \n\nNowlan,  S.  J.  (1990).  Competing experts:  An  experimental  investigation of assso(cid:173)\nciative mixture models.  Technical Report CRG-TR-90-5,  Department of Com(cid:173)\nputer Science,  University of Toronto. \n\nNowlan, S.  J.  (1991).  Soft  Competitive Adaptation:  Neural Network Learning Algo(cid:173)\n\nrithms  based  on  Fitting  Statistical  Mixtures.  PhD thesis,  School  of Computer \nScience,  Carnegie Mellon  University,  Pittsburgh,  PA. \n\nPeterson,  G.  E.  and  Barney,  H.  L.  (1952).  Control  methods  used  in  a  study  of \n\nvowels.  The  Journal  of the  Acoustical Society  of America,  24:175-184. \n\nWatrous,  R.  L.  (1990).  Speaker  normalization  and  adaptation  using  second  or(cid:173)\n\nder  connectionist  networks.  Technical  Report  CRG-TR-90-6,  University  of \nToronto. \n\n\f", "award": [], "sourceid": 318, "authors": [{"given_name": "Steven", "family_name": "Nowlan", "institution": null}, {"given_name": "Geoffrey", "family_name": "Hinton", "institution": null}]}