{"title": "Hierarchical Mixtures of Experts Methodology Applied to Continuous Speech Recognition", "book": "Advances in Neural Information Processing Systems", "page_first": 859, "page_last": 865, "abstract": null, "full_text": "Hierarchical Mixtures of Experts Methodology  Applied to \n\nContinuous Speech Recognition \n\nYing  Zhao,  Richard  Schwartz,  Jason  Sroka*: John  Makhoul \n\nBBN  System and  Technologies \n\n70 Fawcett Street \n\nCambridge MA 02138 \n\nAbstract \n\nIn  this  paper,  we  incorporate  the  Hierarchical  Mixtures  of Experts  (HME) \nmethod  of probability  estimation,  developed  by  Jordan  [1],  into  an  HMM(cid:173)\nbased  continuous  speech  recognition  system.  The  resulting  system  can  be \nthought of as a continuous-density HMM system, but instead of using gaussian \nmixtures,  the HME system employs a large set of hierarchically organized but \nrelatively  small  neural  networks  to  perform the probability density estimation. \nThe  hierarchical  structure  is  reminiscent  of  a  decision  tree  except  for  two \nimportant differences:  each  \"expert\" or neural  net  performs a  \"soft\" decision \nrather than a hard decision, and,  unlike ordinary decision trees,  the parameters \nof all  the  neural  nets  in  the  HME  are  automatically  trainable  using  the  EM \nalgorithm.  We  report results on  the ARPA  5,OOO-word  and 4O,OOO-word Wall \nStreet Journal  corpus using  HME  models. \n\n1  Introduction \n\nRecent research  has  shown that a continuous-density HMM (CD-HMM) system can out(cid:173)\nperform  a  more constrained  tied-mixture  HMM  system  for  large-vocabulary  continuous \nspeech recognition (CSR)  when  a large amount of training data is available [2].  In  other \nwork,  the  utility  of decision  trees  has  been  demonstrated  in  classification  problems  by \nusing  the  \"divide and  conquer\"  paradigm  effectively,  where  a problem  is  divided  into  a \nhierarchical  set  of simpler problems.  We  present  here  a  new  CD-HMM  system  which \n\n**MIT,  Cambridge MA 02139 \n\n\f860 \n\nYing  Zhao,  Richard Schwartz, Jason  Sroka,  John  Makhoul \n\nhas  similar properties  and  possesses  the  same  advantages  as  decision  trees,  but has  the \nadditional  important advantage of having  automatically  trainable  \"soft\"  decision  bound(cid:173)\naries. \n\n2  Hierarchical Mixtures of Experts \n\nThe  method  of Hierarchical  Mixtures  of Experts  (HME)  developed  recently  by  Jordan \n[1]  breaks  a  large  scale  task  into  many  small  ones  by  partitioning  the  input  space  into \na  nested  set of regions,  then  building  a  simple but specific  model  (local  expert)  in  each \nregion.  The  idea behind  this  method follows  the  principle of divide-and-conquer  which \nhas  been utilized in certain approaches to  classification problems,  such  as  decision trees. \nIn the decision tree approach,  at each level of the tree,  the data are divided explicitly into \nregions.  In contrast, the HME model makes use of \"soft\" splits of the data, i.e.,  instead of \nthe data being explicitly divided  into regions,  the data may  lie simultaneously in  multiple \nregions with certain probabilities.  Therefore, the variance-increasing effect of lopping off \ndistant data in  the decision  tree can be ameliorated.  Furthermore,  the \"hard\" boundaries \nin  the  decision  tree  are  fixed  once  a  decision  is  made,  while  the  \"soft\"  boundaries  in \nthe HME are parameterized with generalized sigmoidal  functions,  which  can be adjusted \nautomatically  using  the  Expectation-Maximization (EM) algorithm during the  splitting. \n\nNow  we  describe  how  to  apply  the  HME  methodology  to  the  CSR  problem.  For  each \nstate of a phonetic HMM,  a separate HME is  used  to  estimate the likelihood.  The actual \nHME  first  computes  a  posterior probability  P(llz, s),  the  probability  of phoneme  class \nI,  given the input  feature  vector z  and  state s.  That probability  is  then divided  by  the a \npriori probability of the phone class I at state s.  A one-level HME performs the following \ncomputation: \n\nc \n\nP(llz, s) = L P(llci, z, s)P(cilz, s) \n\ni=l \n\n(1) \n\nwhere I = 1, , .. , L  indicates phoneme class,  Ci  represents a local region in the input space, \nand  C  is  the  number of regions.  P(cilz, s)  can  be  viewed  as  a  gating  network,  while \nP(lICi, z, s)  can  be  viewed  as  a  local  expert  classifier  (expert  network)  in  the  region c, \n[1].  In  a two-level  HME,  each region  Ci  is  divided in  turn  into C  subregions.  The term \nP(IICi, z, s)  is then computed in  a similar manner to  equation (1),  and so on.  If in  some \nof these  subregions there are  no  data  available,  we  back off to  the parent  network. \n\n3  TECHNICAL DETAILS \n\nAs  in  Jordan's  paper,  we  use  a  generalized  sigmoidal  function  to  parameterize  P(cilz) \nas follows: \n\n(2) \n\nwhere z  can be the direct input (in a one-layer neural net) or the hidden layer vector (in a \ntwo-layer  neural  net),  and v,, i  = 1, .. \" C are weights  which  need  to  train.  Similarly,  the \nlocal  phoneme classifier in  region  Ci,  P(llc\"  z), can be parameterized with a generalized \n\n\fMixtures  of Experts Applied to  Continuous  Speech  Recognition \n\nsigmoidal  function  also: \n\n861 \n\n(3) \n\nwhere 8;i,j = 1, ... , L  are  weights.  The whole system consists of two  set of parameters: \nVi, i = 1, ... , C and 8;i' j  = 1, ... , L, e = {8;i' Vi}.  All  parameters are estimated by  using \nthe EM algorithm. \n\nThe EM is an  iterative approach  to  maximum likelihood estimation.  Each iteration of an \nEM  algorithm  is  composed  of two  steps:  an  Expectation  (E)  step  and  a  Maximization \n(M) step.  The M  step involves the maximization of a likelihood function that is redefined \nin each iteration by the E step.  Using the parameterizations in  (2) and  (3),  we obtain the \nfollowing  iterative procedure for  computing parameters e = {Vi, 8;i}: \n1  .  .. I' \n.  lDltIa lze Vi  an \n2.  E-step:  In  each  iteration n, for  each data pair (z(t), l(t\u00bb, t  = 1, ... ,N, compute \n\nI \n, ... ,  ,} = \n\nC '  1 \n\n, ... , \n\nL \n. \n\n(0) \n\nd  8(0)  f \n\n;i  or 1.  = \n\n. \n\nzi(tin)  =  P(cilz(t), l(t), e(n~ \n\nP(Ci Iz(t), v~n\u00bbp(l(t)lci' z(t), 8~~~,i) \n\n= \n\n(4) \n\nwhere  i  = 1, ... , C.  Zi(t)<n)  represents  the  probability  of the  data t  lying  in  the  region \ni,  given  the current parameter estimation  e(n).  It will  be  used  as  a  weight  for  this data \nin  the  region i  in  the M-step.  The idea of \"soft\" splitting reflects  that  these weights are \nprobabilities between  0  and  1,  instead of a  \"hard\"decision 0  or  1. \n3.  M-step: \n\n(5) \n\n(6) \n\n4.  Iterate until 8;i' Vi  converge. \n\nThe first maximization means fitting  a generalized sigmoidal model (3) using  the labeled \ndata (z(t), l(t\u00bb  and  weighting  Zi(t)<n).  The  second  one means  fitting  a  generalized  sig(cid:173)\nmoidal  model  (2)  using  inputs  z(t) and  outputs  Zi(t)<n).  The criterion  for  fitting  is  the \ncross-entropy.  Typically, the fitting can be solved by the Newton-Raphson method.  How(cid:173)\never, it is quite expensive.  Viewing this type of fitting as a multi-class classification task, \nwe  developed  a  technique  to  invert  a  generalized  sigmoidal  function  more  efficiently, \nwhich  will  be described in  the following. \n\nA  common  method  in  a  multi-class  classification  is  to  divide  the  problem  into  many \n2-c1ass  classifications.  However,  this  method  results  in  a positive and  negative  training \nunbalance  usually.  To  avoid  the positive  and  negative training  unbalance,  the  following \ntechnique can  be used  to  solve multi-class posterior probabilities simultaneously. \nSuppose  we  have  a  labeled  data  set,  (z(t), l(t\u00bb, t  = 1, ... , N, where  l(t)  E  {I, ... , L} is \nthe  label  for  t-th  data.  We  use  a  generalized  sigmoidal  function  to  model  the  posterior \n\n\f862 \n\nYing  Zhao,  Richard Schwartz,  Jason  Sroka,  John  Makhoul \n\nprobability  P(llz), where  1 =  I, ... , L as  follows: \n\nP,(z) = P(llz) = \n\ne9'f'z \n9T \nL:k e  \"z \n\nObviously,  since these probabilities sum up  to  one,  we  have \n\nPL(Z) = 1 - L P,(z). \n\nL-I \n\n'=1 \n\nNow,  a training sample z(t) with  a class  label  let)  can be interpreted  as: \n\nP,(z(t\u00bb  = \n\n{  0.9 \n\n1 = let) \n1:':1  1 =/l(t) \n\nP,(z) \nT \n9,  z = log PL(z) \n\nIf we  define \n\nequation (10)  implies that \n\n(7) \n\n(8) \n\n(9) \n\n(10) \n\n(11) \n\nfor  1 =  I, ... , L  with  9Lz =  O.  This  expression  is  the  generalized  sigmoidal  function  in \n(7).  This  means,  we  can  train  parameters  in  (7)  to  satisfy  Equation  (10)  from  the  data. \nUsing a least  squares criterion,  the objective is \n\nmm L..J  9T z(t) - log - - -\nP'(Z(t\u00bb] 2 \n.  , , [  \nPL(Z(t\u00bb \n\nt \n\n(12) \n\nfor  1 = I, ... , L - 1.  Denote a data  matrix as \n\nx= \n\nz(l) \nz(2) \n\nzeN) \n\nA  least squares  solution  to  (12)  is \n\n9,  = (loga)(XT X)-I [L z(t) - L Z(t)] \n\n'(t)=l \n\n'(t)=L \n\nfor  1 = I, ... , L, where a = 9(L - 1).  Substituting (13)  into  (11),  we  get \n\nP,(z) = \n\nZT(XT X)-l ~  z(t) \n\nL.JI(I~I \n\nzT(XT X)-l L: \n\nz(t) \n\n1(1)=\" \n\na \nL:k a \n\n(13) \n\n(14) \n\nEquation (13) and  (14) are very  easy to compute.  Basically,  we  only have to  accumulate \nthe  matrix  XT X  and  sum  z(t)  into  different  classes  1 =  I, ... , L.  We  can  obtain  prob(cid:173)\nabilities  P,(z)  by  a  single  inversion  of matrix  XT X  after  a  pass  through  the  training \ndata. \n\n\fMixtures  of Experts  Applied to  Continuous  Speech  Recognition \n\n863 \n\n4  Relation to  Other Work \n\nThe  work  reported  here  is  very  different  from  our  previous  work  utilizing  neural  nets \nfor  CSR.  There,  a  single  segmental  neural  network  (SNN)  is  used  to  model  a  complete \nphonetic  segment  [3].  Here,  each  HME  estimates  the  probability  density  for  each  state \nof a phonetic HMM. The work here is more similar to that by Cohen et al.  [4],  the major \ndifference being that in [4], a single very large neural net is used to perform the probability \ndensity  modeling.  The training of such  a large  network  requires  the  use of a  specialized \nparallel  processing machine,  so  that  the  training can  be done  in  a  reasonable amount of \ntime.  By using  the  HME method and dividing the problem into many  smaller problems, \nwe  are  able to perform the  needed training computation on regular workstations. \n\nMost  of  the  previous  work  on  CD-HMM  work  has  utilized  mixtures  of gaussians  to \nestimate  the  probability  densities  of an  HMM.  Since  a ' multilayer  feedforward  neural \nnetwork is  a  universal continuous  function  approximator,  we decided to explore the  use \nof neural  nets  as  an  alternative approach  for  continuous density  estimation. \n\n5  Experimental  Results \n\nWord Error Rate \n\nHMM \nSNN \nHMM+SNN \nHME \nHME+HMM \nPrior-modified  HME +  HMM \n\n7.8 \n8.5 \n7.1 \n7.6 \n6.8 \n6.2 \n\nTable  1:  Error Rates for  the ARPA  WSJ 5K Development Test,  Trigram  Grammar \n\nWord  Error Rate \n\nHMM \nHME+ HMM \n\n9.5 \n8.7 \n\nTable 2:  Error Rates  for  the  ARPA  WSJ 40K Test Set, Trigram Grammar \n\nIn  our initial application of the HME method to  large-vocabulary CSR, we used phonetic \ncontext-independent  HMEs  to  estimate  the  likelihoods  at  each  state  of 5-state  HMMs. \nWe  implemented  a  two-level  HME,  with  the  input  space  divided  into  46  regions,  and \neach  of those  regions  is  further  divided  into  46  subregions.  The  initial  divisions  were \naccomplished by supervised training, with each division trained to one of the 46 phonemes \nin  the  system.  All  gating and  local  expert networks  in the HME  had  identical  structures \n-\na two-layer generalized sigmoidal network.  The whole HME system was implemented \nwithin  an  N-best paradigm  [3],  where the  recognized  sequence  was  obtained  as  a  result \nof a rescoring of an N-best list obtained from our baseline BYBLOS system (tied-mixture \nHMM) with  a statistical trigram  grammar. \n\n\f864 \n\nYing  Zhao,  Richard Schwartz,  Jason  Sroka,  John  Makhoul \n\nWe  then  built  a  context-dependent  HME  system  based  on  the  structure  of the  context(cid:173)\nindependent HME models  described  above.  For each  state,  the  whole  training data was \ndivided  into  46  parts  according  to  its  left  or  right  context.  Then  for  each  context,  a \nseparate HME model was built for  that context.  To be computationally feasible,  we  used \nonly  one-level  HMEs  here.  We  first  experimented  using a  left-context and  right-context \nmodel. \n\nWe  tested the HME implementation on the ARPA 5,OOO-word Wall  Street Journal corpus \n(WSJl, H2 dev set).  We report the word error rates on the same test set for  a number of \ndifferent systems.  Table  1 shows  the  word  error  rates  for  i)  the  baseline  HMM  system; \nii)  the  segment-based  neural  net  system  (SNN)  iii)  the  hybrid  SNNIHMM  system  iv)  a \nHME system alone.  v)  a HME system combined with  HMM;  vi)  a HME +HMM system \nwith  modified  priors. \n\nFrom  Table  1,  The  performance  of the  baseline  tied-mixture  HMM  is  7.8%.  The  per(cid:173)\nformance  of the  SNN  system (8.5%)  is comparable to  the  HMM  alone.  We  see that the \nperformance of a HME  (7.6%) is  as  good  as  the HMM  system,  which  is  better  than  the \nSNN system.  When  combined with  the  baseline  HMM  system,  the HME  and SNN both \nimprove performance over the HMM alone about 10% from 7.8% to 6.8% and from 7.8% \nto 7.1%  respectively.  We found  out that the improvement could be made larger for  a hy(cid:173)\nbrid HMElHMM by  adjusting the context-dependent priors  with  the context-independent \npriors,  and  then  smooth  context-dependent models  with  a context-independent model. \n\nc, z, s  = \n\n) \n\nP(x ,  c,.)P(' c.)  . \n\nMore specifically,  in a context-dependent HME model,  we usually  estimate the posterior \nprobability  phoneme  I,  P(llc, z, s),  given  left  or  right context c  and  the  acoustic  input \nz  in  a  particular  state  s.  Because  the  samples  may  be  sparse  for  many  of  context \nmodels,  it is  necessary  to regularize (smooth) context-dependent models  with  a context(cid:173)\nindependent  model,  where  there  is  much  more  data  available.  However,  since  the  two \nmodels  have  different  priors:  P(llc, s)  in  a  context-dependent  model  and  P(lls)  in  a \ncontext-independent  model,  a  simple  interpolation  between  the  two  models  which  is \nP(ll \nP(x ,  .)P(' .) \nin  a  context-independent  model  is  inconsistent.  To  scale  the  context-dependent  priors \nP(llc, s)  with  a  context-independent prior  P(lls),  we  weighted  each  input  data point  z \nwith the weight  :c.i'c:;)  for  a prior adjusting.  After this modification, a context-dependent \nHME  actually  estimates  P(z ~~:~('I').  It  combines  better  with  a  context-independent \nmodel.  For  the  same  experiment  we  showed  in  Table  1,  the  word  error  for  the  HME \n(with HMM) droped from 6.8% to  6.2% when priors were modified.  For this 5,OOO-word \ndevelopment set,  we  got a total  of about 20%  word error reduction over the tied-mixture \nHMM  system using a HME-based neural  network  system. \n\n10  a  context- epen  ent mean \n\nP(x c,.) \n\nd  P(ll \n\n) \n\nz, s  = \n\nP(x .) \n\nd \n\nd \n\nod  I \n\nWe  then  switched  our  experiment  domain  from  a  5,OOO-word  to  40,OOO-word  the  test \nset.  During  this  year,  the  BYBLOS  system  has  been  improVed  from  a  tied-mixture \nsystem  to  a  continuous  density  system.  We  also  switched  to  using  this  new  continuous \ndensity BYBLOS in  our hybrid HMElHMM system.  The language model  used here was \na 40,OOO-word  trigram grammar.  The result is  shown in Table 2. \n\nFrom Table 2,  we see that there is about a 10% word error rate reduction over the contin(cid:173)\nuous  density  HMM  system by  combining a context-dependent HME  system.  Compared \nwith  the 20% improvement over the tied-mixture system we made for  the 5,OOO-word de(cid:173)\nvelopment set,  the improvement over the continuous density  system in this 40,OOO-word \n\n\fMixtures  of Experts  Applied to  Continuous  Speech  Recognition \n\n865 \n\ndevelopment is less.  This may be due to the big improvement of the HMM system itself. \n\n6  CONCLUSIONS \n\nThe  method  of hierarchical  mixtures  of experts  can  be used  as  a  continous  density  es(cid:173)\ntimator  to  speech  recognition.  Experimental  results  showed  that  estimations  from  this \napproach  are  consistent  with  the  estimations  from  the  HMM  system.  The  frame-based \nneural  net  system  using  hierarchical  mixtures  of experts  improves  the  performance  of \nboth  the  state-of-the-art  tied  mixture  HMM  system  and  the  continuous  density  HMM \nsystem.  The  HME  system  itself has  the  same  performance  as  the  state-of-the-art  tied \nmixture HME system. \n\n7  Acknowledgments \n\nThis  work was funded  by  the Advanced Research Projects Agency  of the  Department of \nDefense. \n\nReferences \n\n[1]  Michael Jordan, \"Hierarchical Mixtures of Experts and the EM Algorithm,\" Neural \n\nComputation,  1994,  in  press. \n\n[2]  D.  Pallett,  J.  Fiscus,  W.  Fisher,  J.  Garofolo,  B.  Lund,  and  M.  Pryzbocki,  \"1993 \n\nBenchmark  Tests  for  the  ARPA  Spoken  Language  Program,\"  Proc.  ARPA  Hu(cid:173)\nman Language Technology  Workshop,  Plainsboro, NJ, Morgan Kaufman Publishers, \n1994. \n\n[3]  G.  Zavaliagkos,  Y.  Zhao,  R.  Schwartz  and  J.  Makhoul,  \"A  Hybrid  Neural  Net \nSystem for State-of-the-Art Continuous Speech Recognition,\" in Advances in Neural \nInformation Processing Systems 5, S. J.  Hanson,  J.  D. Cowan and C. L. Giles, eds., \nMorgan  Kaufmann Publishers,  1993. \n\n[4]  M. Cohen, H. Franco, N. Morgan, D. Rumelhart and V. Abrash, \"Context-Dependent \n\nMultiple  Distribution  Phonetic  Modeling  with  MLPS,\"  in Advances  in  Neural  In(cid:173)\nformation  Processing  Systems  5,  S.  J.  Hanson,  1.  D. Cowan  and  C.  L.  Giles,  eds., \nMorgan  Kaufmann Publishers,  1993. \n\n\f\f", "award": [], "sourceid": 929, "authors": [{"given_name": "Ying", "family_name": "Zhao", "institution": null}, {"given_name": "Richard", "family_name": "Schwartz", "institution": null}, {"given_name": "Jason", "family_name": "Sroka", "institution": null}, {"given_name": "John", "family_name": "Makhoul", "institution": null}]}