{"title": "Constructive Algorithms for Hierarchical Mixtures of Experts", "book": "Advances in Neural Information Processing Systems", "page_first": 584, "page_last": 590, "abstract": null, "full_text": "Constructive  Algorithms  for  Hierarchical \n\nMixtures  of Experts \n\nS.R.Waterhouse \n\nA.J.Robinson \n\nCambridge University Engineering  Department, \nTrumpington St.,  Cambridge, CB2  1PZ,  England. \nTel:  [+44]  1223  332754,  Fax:  [+44]  1223  332662, \n\nEmail: srw1001.ajr@eng.cam.ac.uk \n\nAbstract \n\nWe  present  two  additions  to  the  hierarchical  mixture  of experts \n(HME)  architecture.  By  applying a  likelihood splitting criteria to \neach expert in the HME we  \"grow\" the tree adaptively during train(cid:173)\ning.  Secondly,  by considering only the most probable path through \nthe tree we  may \"prune\"  branches away, either temporarily, or per(cid:173)\nmanently  if they  become  redundant.  We  demonstrate  results  for \nthe  growing  and  path  pruning  algorithms  which  show  significant \nspeed  ups  and  more efficient  use  of parameters over  the standard \nfixed  structure  in  discriminating  between  two  interlocking spirals \nand classifying 8-bit parity patterns. \n\nINTRODUCTION \n\nThe  HME  (Jordan  &  Jacobs  1994)  is  a  tree  structured  network  whose  terminal \nnodes  are simple function approximators in the case of regression or classifiers in the \ncase of classification.  The outputs of the  terminal nodes  or experts  are recursively \ncombined upwards towards the root node, to form the overall output of the network, \nby  \"gates\"  which  are situated at the non-terminal nodes. \nThe HME  has clear similarities with tree  based  statistical methods such  as  Classi(cid:173)\nfication  and Regression Trees  (CART)  (Breiman,  Friedman, Olshen &  Stone  1984). \nWe  may consider  the  gate  as  replacing  the  set  of  \"questions\"  which  are  asked  at \neach  branch of CART.  From this  analogy,  we  may consider  the  application  of the \nsplitting rules  used  to build  CART.  We  start  with  a  simple tree  consisting  of two \nexperts  and  one  gate.  After  partially training  this  simple tree  we  apply  the split(cid:173)\nting criterion  to each  terminal node.  This evaluates  the log-likelihood increase  by \nsplitting each  expert  into  two  experts  and  a  gate.  The split  which  yields  the  best \nincrease  in  log-likelihood is  then  added  permanently  to  the  tree.  This  process  of \ntraining followed  by growing continues until the desired  modelling power is reached. \n\n\fConstructive Algorithms  for  Hierarchical Mixtures of Experts \n\n585 \n\nFigure  1:  A simple mixture of experts. \n\nThis  approach is  reminiscent  of Cascade  Correlation  (Fahlman &  Lebiere  1990)  in \nwhich  new  hidden  nodes  are  added  to  a  multi-layer perceptron  and  trained  while \nthe rest  of the  network  is  kept  fixed. \n\nThe HME  also  has  similarities with model merging techniques  such  as  stacked  re(cid:173)\ngression  (Wolpert  1993),  in  which  explicit  partitions  of the  training  set  are  com(cid:173)\nbined.  However the HME differs from model merging in that each expert  considers \nthe  whole  input  space  in forming its  output.  Whilst this allows  the  network  more \nflexibility since each gate may implicitly partition the whole  input space in a  \"soft\" \nmanner,  it  leads  to  unnecessarily  long  computation in  the  case  of near  optimally \ntrained  models.  At  anyone  time only  a  few  paths  through  a  large  network  may \nhave  high  probability.  In  order  to overcome this  drawback,  we  introduce  the  idea \nof \"path pruning\"  which considers only those paths from the root node which have \nprobability greater  than  a  certain threshold. \n\nCLASSIFICATION  USING  HIERARCHICAL  MIXTURES  OF  EXPERTS \n\nThe  mixture  of experts,  shown  in  Figure  1,  consists  of a  set  of  \"experts\"  which \nperform local function  approximation. The expert  outputs  are combined by a  gate \nto  form  the  overall  output.  In  the  hierarchical  case,  the  experts  are  themselves \nmixtures  of further  experts,  thus  extending  the  architecture  in  a  tree  structured \nfashion.  Each terminal node or  \"expert\"  may take on a variety of forms, depending \non  the  application.  In  the  case  of multi-way classification,  each  expert  outputs  a \nvector Yj  in  which  element  m  is  the conditional probability of class  m  (m = 1 ... M) \nwhich  is  computed using the soft max function: \n\nP(CmI x(n>, Wj)  =  exp(w~jx(n\u00bb) It exp(w~.kX(n\u00bb) \n\nk=1 \n\nwhere  Wj  = [wlj  W2j \nclass  i. \n\n...  WMj]  is  the  parameter  matrix for  expert j  and  Ci  denotes \n\nThe  outputs  of the  experts  are  combined  using  a  \"gate\"  which  sits  at  the  non(cid:173)\nterminal  nodes.  The  gate  outputs  are  estimates  of the  conditional  probability  of \nselecting the daughters of the non-terminal node given the input and the path taken \nto that  node  from  the root  node.  This  is  once  again computed  using  the  softmax \nfunction: \n\nP(Zj I ~ (.), ~) = exp( ~ J ~(.\u00bb It <xp( ~f ~(.\u00bb \n\nwhere'; = [';1';2  ... ';f] is  the parameter matrix for  the gate,  and Zj  denotes  expert \nj . \n\n\f586 \n\nS. R. WATERHOUSE, A. 1. ROBINSON \n\nThe  overall  output  is  given  by  a  probabilistic  mixture  in  which  the  gate  outputs \nare  the  mixture weights  and the expert  outputs are  the mixture components.  The \nprobability of class  m is then  given  by: \n\nP(cmlz(n),8)  =  I: P(zilz(n), ~)P(Cmlz(n), Wi). \n\n] \n\ni=1 \n\nA straightforward extension  of this  model also  gives  us  the conditional probability \nht) of selecting expert j  given input zen)  and correct class  Ck, \n\nIn order to train the HME to perform classification we  maximise the log likelihood \nL  = l:~=1 l:~=1 t~) log P(cm Iz(n), 8),  where  the  variable t~) is  one  if m  is  the  correct \nclass at exemplar (n)  and zero  otherwise.  This is  done  via the expectation maximi(cid:173)\nsation (EM)  algorithm of Dempster,  Laird &  Rubin (1977),  as described  by  Jordan \n&  Jacobs  (1994). \n\nTREE  GROWING \n\nThe  standard  HME  differs  from  most  tree \nbased statistical models in that its architecture \nis fixed.  By relaxing this constraint and allow-\ning the  tree  to  grow,  we  achieve  a  greater  de(cid:173)\ngree of flexibility in the network.  Following the \nwork on CART we start with a simple tree, for \ninstance with two experts  and one  gate which \nwe  train  for  a  small number of cycles.  Given \nthis semi-trained network,  we  then make a set \nof candidate splits  {&} of terminal nodes  {z;}. \nEach split involves replacin~ an expert Zi  with \na  pair  of new  experts  {Zu}j=1  and  a  gate,  as \nshown  in  Figure 2. \n\n\\ \n\n\\ \n\n\\ \n\n\\ \n\nL(P) \n\nL(P+I) \n\nWe  wish  to select  eventually  only  the  \"best\" \nsplit  S out  of these  candidate  splits.  Let  us \ndefine  the best  split as  being that which max(cid:173)\nimises the increase in overall log-likelihood due \nto the  split, IlL = L(P+1)  - L(P)  where  L(P)  is  the \nlikelihood at the pth  generation  of the tree.  If \nwe make the constraint that all the parameters \nof the tree remain fixed  apart from the param-\neters  of the  new  split  whenever  a  candidate split  is  made,  then  the  maximisation \nis  simplified into a  dependency  on  the increases  in  the local likelihoods {Li}  of the \nnodes  {Zi}.  We  thus constrain the tree growing process  to be localised such that we \nfind  the node which  gains the most by  being split. \n\nFigure  2:  Making  a  can(cid:173)\ndidate  split  of a  terminal \nnode. \n\nmax M(&)  _  max M\u00b7 = max(Ly*1) - L(P\u00bb \n\ni i I  i \n\nI \n\nI \n\n\fConstructive Algorithms  for  Hierarchical Mixtures of Experts \n\n587 \n\nFigure 3:  Growing the HME.  This figure  shows  the addition of a  pair of experts  to \nthe partially grown tree. \n\nwhere \n\nn  m \n\nL~+l)  =  L L t~) log L P(zijlz(n), c;;,zi)P(cmlz(n), zij, wij) \n\nn  m \n\nj \n\nThis  splitting  rule  is  similar  in  form  to  the  CART  splitting  criterion  which  uses \nmaximisation of the  entropy  of the  node  split,  equivalent  to  our local  increase  in \nlop;-likelihood. \nTIle  final  growing algorithm starts  with  a  tree  of generation p  and  firstly  fixes  the \nparameters  of all  non-terminal nodes.  All  terminal  nodes  are  then  split  into  two \nexperts and a gate.  A split is only made if the sum of posterior probabilities En h~n), \nas described  (1), at the node is  greater than a small threshold.  This prevents splits \nbeing made on nodes which have very little data assigned to them.  In order to break \nsymmetry, the  new  experts  of a  split  are  initialised by  adding small random noise \nto  the  original expert  parameters.  The  gate  parameters  are  set  to  small  random \nweights.  For  each  node  i,  we  then  evaluate  M;  by  training  the  tree  using  the \nstandard  EM  method.  Since  all  non-terminal  node  parameters  are  fixed  the  only \nchanges  to the log-likelihood are  due  the  new  splits.  Since  the  parameters of each \nsplit are thus independent of one another, all splits can be trained at once,  removing \nthe need  to train multiple trees  separately. \n\nAfter  each split has been  evaluated, the best split is chosen.  This split is  kept and \nall other splits  are  discarded.  The original tree  structure  is  then  recovered  except \nfor  the  additional winning split, as  shown in  Figure 3.  The  new  tree,  of generation \np + I  is  then  trained  as  usual  using  EM.  At  present  the  decision  on  when  to  add \na  new  split  to  the  tree  is  fairly  straightforward:  a  candidate  split  is  made  after \ntraining  the  fixed  tree  for  a  set  number  of iterations.  An  alternative  scheme  we \nhave investigated is to make a split  when  the overall log-likelihood of the fixed  tree \nhas not increased for  a set  number of cycles.  In  addition, splits are  rejected  if they \nadd too little to the  local log-likelihood. \nAlthough  we  have  not  discussed  the  issue  of over-fitting  in  this  paper,  a  number \nof techniques  to  prevent  over-fitting  can  be  used  in  the  HME.  The  most  simple \ntechnique, akin to those used in CART, involves growing a large tree and successively \nremoving nodes from the tree until the performance on a cross validation set reaches \nan  optimum.  Alternatively  the  Bayesian  techniques  of Waterhouse,  MacKay  & \nRobinson  (1995)  could be applied. \n\n\f588 \n\nS. R. WATERHOUSE, A. J. ROBINSON \n\nTree growing simulations \nThis algorithm was  used  to solve  the 8-bit parity classification task.  We  compared \nthe growing algorithm to a fixed HME with depth of 4 and binary branches.  As can \nbe seen  in  Figures 4(a) and  (b), the factorisation enabled by the growing algorithm \nsignificantly speeds up computation over the standard fixed  structure.  The final tree \nshape obtained is shown in Figure 4(c).  We showed in an earlier paper (Waterhouse \n&  Robinson  1994)  that  the  XOR problem may be  solved  using  at  least  2  experts \nand  a  gate.  The 8  bit parity problem is  therefore  being solved by  a  series  of XOR \nclassifiers,  each  gated  by  its  parent  node,  which  is  an  intuitively  appealing  form \nwith an efficient  use  of parameters. \n\n-200oL----1~0----2~0~--~~~--~4~0--~W \n\nTime \n\n(a)  Evolution  of  log-likelihood  vs. \ntime in  CPU seconds. \n\n-50 \n\n8 ,E \n~-100 \n\"I \n.\u00a7' \n\n-150 \n\n-2000 1  2  3 \n\n4 \n\n5 \n6 \nGeneration \n\n(b)  Evolution  of log-likelihood  for  (i) \nvs  generations  of tree. \n\nO'()OI \n\n0.001 \n\n(c)  Final  tree structure obtained from \n(i), showing utilisation  U;  of each node \nwhere  U;  = L:  P(z;, R;I:c(n\u00bb)  I N,  and  Ri \nis  the  path t~en from  the  root node \nto node  i . \n\nFigure 4:  HME  GROWING  ON  THE  8  BIT  PARITY  PROBLEM;(i)  growing  HME  with  6 \ngenerations;  (ii)  4  deep  binary  branching  HME  (no  growing). \n\nPATH  PRUNING \n\nIf we  consider  the  HME  to be  a  good  model for  the  data generation  process,  the \ncase  for  path pruning becomes  clear.  In  a  tree  with sufficient  depth  to model  the \n\n\fConstructive  Algorithms  for Hierarchical Mixtures  of Experts \n\n589 \n\nunderlying sub-processes producing each data point, we  would expect the activation \nof each expert to tend to binary values such that only one expert is selected  at each \ntime exemplar. \n\nThe  path  pruning  scheme  is  depicted  in  Figure  5.  The  pruning  scheme  utilises \nthe  \"activation\"  of  each  node  at  each  exemplar.  The  activation  is  defined  as \nthe  product  of  node  probabilities  along  a  path  from  the  root  node  to  the  cur(cid:173)\nrent  node,  lin)  =  Li log P(zi/Ri, :.:(n\u00bb),  where  Ri  is  the  path  taken  to  node  i  from \nIf .l}n)  for  node  l  at  exemplar  n  falls  below  a  threshold  value, \nthe  root  node. \nft,  then  we  ignore  the  subtree  Sl  and  we  backtrack  up  to  the  parent  node  of  l. \nDuring training this involves not ac-\ncumulating the statistics of the sub(cid:173)\ntree  Sl;  during evaluation it involves \nsetting  the  output  of subtree  Sl  to \nzero.  In  addition to this  path prun(cid:173)\ning  scheme  we  can  use  the  activa(cid:173)\ntion of the nodes  to do more perma(cid:173)\nnent  pruning.  If the  overall  utilisa(cid:173)\ntion Vi = Ln P(Zi, Rd:.:(n\u00bb)IN of a node \nfalls  below a  small threshold,  then  a \nnode  is  pruned  completely from  the \ntree.  The  sister  subtrees  of the  re(cid:173)\nmoved node then subsume their par(cid:173)\nent nodes.  This process is used solely \nto  improve  computational efficiency \nin  this  paper,  although  conceivably \nit  could  be  used  as  a  regularisation \nmethod,  akin  to  the  brain  surgery \ntechniques  of Cun,  Denker  &  Solla \n(1990).  In  such  a  scheme,  however, \na  more useful  measure of node  utili(cid:173)\nsation would be the  effective number \nof parameters (Moody  1992). \n\nFigure 5:  Path pruning in  the HME. \n\n..... _---_ .. -.. \n\nPath pruning simulations \nFigure 6 shows the application of the pruning algorithm to the task of discriminating \nbetween  two  interlocking spirals.  With no  pruning the  solution to  the  two-spirals \ntakes over 4,000 CPU seconds,  whereas with pruning the solution is  achieved in 155 \nCPU seconds. \n\nOne problem which  we  encountered  when  implementing this algorithm was in com(cid:173)\nputing updates for the parameters of the tree in the case of high pruning thresholds. \nIf a  node is  visited too few  times during a  training pass,  it will sometimes have  too \nlittle  data  to  form  reliable  statistics  and  thus  the  new  parameter  values  may  be \nunreliable  and  lead  to  instability.  This  is  particularly  likely  when  the  gates  are \nsaturated.  To avoid this saturation we  use  a simplified version of the regularisation \nscheme  described  in Waterhouse et al.  (1995). \n\nCONCLUSIONS \n\nWe  have  presented  two  extensions  to the  standard HME  architecture.  By  pruning \nbranches either during training or evaluation we  may significantly reduce  the com(cid:173)\nputational requirements  of the  HME.  By  applying  tree  growing  we  allow  greater \nflexibility  in  the  HME  which  results  in  faster  training  and  more  efficient  use  of \nparameters. \n\n\f590 \n\nS. R. WATERHOUSE, A. J. ROBINSON \n\n0 \n\n-20 \n\n\"C  -40 \n0 \n0 \n;5  -60 \n~ \nT \n0> \n0 \n....J \n\n-80 \n\n-100 \n\n-120 \n\n(a) \n\n(iii \n\n(iv) \n\n(b) \n\n(c) \n\n,.,fi \n\" \n,: .. \n.i \n,  , \n,  , \n\n.... ' :1 \n\n,.,. \n\n,I \n\nI\n\n. \n\n/ \n( .' /  \n\n~ -,~ '.'- ' \n\n.\"..: '\" \n' ,,,.. \n\n.\"\"\n\n10 \n\n100 \n\nTime (5) \n\n1000 \n\nFigure  6:  The  effect  of pruning  on  the  two  spirals  classification  problem  by  a  8 \ndeep binary branching hme:(a) Log-likelihood  vs.  Time (CPU seconds), with log pruning \nthresholds  for  experts and gates f:  (i)  f  = -5. 6,(ii)  f  = -lO,(iii)  f  = -15,(iv)  no  pruning, \n(b)  training  set  for  two-spirals  task;  the  two  classes  are  indicated  by  crosses  and  circles, \n(c)  Solution  to two  spirals  problem. \n\nReferences \nBreiman,  L.,  Friedman,  J.,  Olshen,  R.  &  Stone,  C.  J.  (1984),  Classification  and \n\nRegression  Trees,  Wadswoth and Brooks/Cole. \n\nCun,  Y.  L.,  Denker,  J.  S.  &  Solla,  S.  A.  (1990),  Optimal  brain  damage,  in  D.  S. \nTouretzky, ed.,  'Advances  in  Neural  Information Processing  Systems  2',  Mor(cid:173)\ngan Kaufmann, pp.  598-605. \n\nDempster,  A.  P.,  Laird,  N.  M.  &  Rubin,  D.  B.  (1977),  'Maximum likelihood from \nincomplete data via the EM algorithm', Journal of the Royal Statistical Society, \nSeries  B  39,  1-38. \n\nFahlman,  S.  E.  &  Lebiere,  C.  (1990),  The  Cascade-Correlation  learning  architec(cid:173)\n\nture, Technical Report CMU-CS-90-100, School of Computer Science, Carnegie \nMellon  University,  Pittsburgh,  PA  15213. \n\nJordan,  M.  I. &  Jacobs,  R.  A.  (1994),  'Hierarchical  Mixtures  of Experts  and  the \n\nEM  algorithm', Neural  Computation  6, 181-214. \n\nMoody,  J.  E.  (1992),  The  effective  number of parameters:  An  analysis  of general(cid:173)\n\nization and regularization in nonlinear learning systems,  in J. E.  Moody,  S.  J. \nHanson  &  R.  P.  Lippmann, eds,  'Advances  in  Neural  Information Processing \nSystems 4', Morgan Kaufmann, San Mateo,  California, pp.  847-854. \n\nWaterhouse,  S.  R.  &  Robinson,  A.  J . (1994),  Classification  using  hierarchical mix(cid:173)\ntures of experts,  in 'IEEE Workshop on Neural Networks for Signal Processing', \npp.  177-186. \n\nWaterhouse,  S.  R.,  MacKay,  D.  J. C.  &  Robinson,  A. J . (1995),  Bayesian  methods \nfor  mixtures of experts,  in  M.  C.  M.  D.  S.  Touretzky  &  M.  E.  Hasselmo,  eds, \n'Advances in  Neural  Information Processing  Systems 8', MIT Press. \n\nWolpert,  D. H .  (1993),  Stacked  generalization,  Technical  Report  LA-UR-90-3460, \nThe Santa Fe  Institute,  1660 Old Pecos  Trail, Suite  A,  Santa Fe,  NM,  87501. \n\n\f", "award": [], "sourceid": 1165, "authors": [{"given_name": "Steve", "family_name": "Waterhouse", "institution": null}, {"given_name": "Anthony", "family_name": "Robinson", "institution": null}]}