{"title": "REMAP: Recursive Estimation and Maximization of A Posteriori Probabilities - Application to Transition-Based Connectionist Speech Recognition", "book": "Advances in Neural Information Processing Systems", "page_first": 388, "page_last": 394, "abstract": null, "full_text": "REMAP:  Recursive Estimation and \n\nMaximization of A  Posteriori \nProbabilities - Application to \n\nTransition-Based  Connectionist  Speech \n\nRecognition \n\nYochai Konig,  Herve  Bourlard~ and Nelson Morgan \n\n{konig, bourlard,morgan }@icsi.berkeley.edu \nInternational Computer Science Institute \n\n1947 Center Street  Berkeley,  CA  94704,  USA. \n\nAbstract \n\nIn  this  paper,  we  introduce  REMAP,  an  approach for  the training \nand estimation of posterior probabilities using a recursive algorithm \nthat is  reminiscent of the EM-based  Forward-Backward  (Liporace \n1982)  algorithm  for  the  estimation  of sequence  likelihoods.  Al(cid:173)\nthough  very  general,  the  method  is  developed  in  the  context  of a \nstatistical  model for  transition-based speech  recognition  using  Ar(cid:173)\ntificial  Neural  Networks  (ANN)  to  generate  probabilities for  Hid(cid:173)\nden  Markov  Models  (HMMs).  In  the  new  approach,  we  use  local \nconditional posterior probabilities of transitions to estimate global \nposterior  probabilities of word  sequences.  Although  we  still  use \nANNs  to  estimate  posterior  probabilities,  the  network  is  trained \nwith targets that are themselves estimates of local posterior proba(cid:173)\nbilities.  An  initial experimental result shows a significant decrease \nin error-rate in  comparison to a  baseline system. \n\n1 \n\nINTRODUCTION \n\nThe ultimate goal in speech  recognition is to determine the sequence of words  that \nhas  been  uttered.  Classical  pattern  recognition  theory  shows  that the  best  possi(cid:173)\nble  system  (in  the  sense  of minimum probability of error)  is  the one  that  chooses \nthe word sequence  with the maximum a  posteriori  probability (conditioned on the \n\n* Also  affiliated  with with  Faculte Poly technique  de  Mons,  Mons,  Belgium \n\n\fREMAP:  Recursive  Estimation and  Maximization of A Posteriori  Probabilities \n\n389 \n\nevidence).  If word  sequence  i  is  represented  by  the  statistical  model  M i ,  and  the \nevidence  (which,  for  the application reported  here,  is  acoustical)  is  represented  by \na  sequence  X  =  {Xl, ... , X n , ... , X N  },  then  we  wish  to  choose  the  sequence  that \ncorresponds  to  the  largest  P(MiIX).  In  (Bourlard  &  Morgan  1994),  summarizing \nearlier  work  (such  as  (Bourlard  &  Wellekens  1989)),  we  showed  that it was  possi(cid:173)\nble  to compute the global a  posteriori  probability P(MIX) of a  discriminant form \nof Hidden  Markov  Model  (Discriminant  HMM),  M,  given  a  sequence  of acoustic \nvectors  X.  In  Discriminant HMMs,  the global  a  posteriori  probability P(MIX) is \ncomputed as follows:  if r  represents  all legal  paths (state sequences  ql, q2, ... , qN) \nin  Mi,  N  being the length of the sequence,  then \n\nP(Mi IX) = L P(Mi, ql, q2,  ... , qNIX) \n\nr \n\nin  which  ~n represents  the specific  state hypothesized  at time n, from  the set  Q  = \n{ql, ... , q , qk, ... , qK}  of all  possible  HMM  states  making up  all  possible  models \nMi.  We  can further decompose  this into: \n\nP(Mi, ql, q2,\u00b7\u00b7\u00b7, qNIX) =  P(ql, q2,\u00b7\u00b7\u00b7, qNIX)P(Milql, q2,\u00b7\u00b7\u00b7, qN, X) \n\nUnder  the assumptions stated in  (Bourlard  &  Morgan  1994)  we  can  compute \n\nP(ql, q2,\u00b7\u00b7\u00b7, qNIX) = II p(qnlqn-l, xn) \n\nN \n\nn=l \n\nThe Discriminant HMM  is  thus described  in terms of conditional transition  proba(cid:173)\nbilities p(q~lq~-l' xn), in which q~ stands for  the specific state ql  of Q hypothesized \nat time n  and  can be schematically represented  as  in  Figure 1. \n\nP(IkIIIkI, x) \n\np(/aell/ael, x) \n\nP(ltIlltI, x) \n\nP(/aelllkl, x) \n\nP(ltll/ael, x) \n\nFigure 1:  An example Discriminant HMM for  the word  \"cat\".  The variable X  refers \nto  a specific  acoustic observation  Xn  at time n. \n\nFinally, given  a state sequence  we  assume  the following  approximation: \n\nP(Milql, q2,\u00b7\u00b7\u00b7, qN, X) ::::::::  P(Milql, q2,\u00b7\u00b7\u00b7, qN) \n\nWe  can  estimate the right side of this last equation from  a phonological model  (in \nthe  case  that  a  given  state sequence  can  belong  to two  different  models).  All  the \nrequired  (local) conditional transition probabilities p(q~lq~-l> xn)  can be estimated \nby  the Multi-Layer Perceptron  (MLP) shown in  Figure  2. \nRecent  work  at  lesl  has  provided  us  with  further  insight  into  the  discriminant \nHMM,  particularly  in  light  of recent  work  on  transition-based  models  (Konig  & \nMorgan 1994j Morgan et  al.  1994).  This new perspective has motivated us to further \ndevelop  the original  Discriminant HMM  theory.  The new  approach  uses  posterior \nprobabilities  at  both  local  and  global  levels  and  is  more  discriminant  in  nature. \nIn this paper,  we  introduce the Recursive  Estimation-Maximization of A posteriori \n\n\f390 \n\nY. KONIG, H.  BOURLARD, N.  MORGAN \n\nP(CurrenCstlte I Acoustics, Prevlous_stlte) \n\nt  t \n\nt  t \n\nt \u00b7\u00b7\u00b7\u00b7 .. t \n\n0.1 \u2022\u2022 0 \nPrevious \n\nStlte \n\nAcoustics \n\nFigure  2:  An  MLP  that estimates local conditional transition probabilities. \n\nProbabilities  (REMAP)  training  algorithm for  hybrid  HMM/MLP  systems.  The \nproposed  algorithm  models  a  probability  distribution  over  all  possible  transitions \n(from  all  possible states  and for  all  possible  time frames  n)  rather  than  picking  a \nsingle time point as  a  transition  target.  Furthermore,  the algorithm incrementally \nincreases the posterior probability of the correct model, while reducing the posterior \nprobabilities of all  other  models.  Thus,  it  brings  the overall  system  closer  to  the \noptimal Bayes  classifier. \nA  wide  range  of discriminant  approaches  to speech  recognition  have  been  studied \nby  researchers  (Katagiri  et  al.  1991;  Bengio  et  al.  1992;  Bourlard  et  al.  1994).  A \nsignificant  difficulty that has  remained in  applying these  approaches to continuous \nspeech  recognition has been the requirement to run computationally intensive algo(cid:173)\nrithms on all of the rival sentences.  Since this is not generally feasible,  compromises \nmust always be made in  practice.  For instance, estimates for  all rival sentences  can \nbe  derived  from  a  list  of the  \"N-best\"  utterance  hypotheses,  or  by  using  a  fully \nconnected  word  model composed of all phonemes. \n\n2  REMAP  TRAINING OF THE DISCRIMINANT HMM \n\n2.1  MOTIVATIONS \n\nThe discriminant HMM/MLP theory as described above uses transition-based prob(cid:173)\nabilities as the key building block for acoustic recognition.  However, it is well known \nthat estimating transitions  accurately  is  a  difficult  problem  (Glass  1988).  Due  to \nthe inertia of the articulators, the boundaries between phones are blurred and over(cid:173)\nlapped  in  continuous  speech.  In  our  previous  hybrid  HMM/MLP  system,  targets \nwere  typically  obtained  by  using  a  standard forced  Viterbi  alignment  (segmenta(cid:173)\ntion).  For  a  transition-based  system  as  defined  above,  this  procedure  would  thus \nyield rigid  transition targets, which  is  not realistic. \nAnother  problem  related  to  the  Viterbi-based  training of the  MLP  presented  in \nFigure 2 and used  in  Discriminant HMMs,  is the lack of coverage of the input space \nduring training.  Indeed,  during training (based on hard transitions), the MLP only \nprocesses inputs consisting of \"correct\"  pairs of acoustic vectors and correct previous \nstate,  while in  recognition the net should generalize to all possible combinations of \n\n\fREMAP:  Recursive  Estimation and Maximization of A  Posteriori  Probabilities \n\n391 \n\nacoustic vectors and previous states, since all possible models and transitions will be \nhypothesized  for  each  acoustic  input.  For example, some hypothesized  inputs  may \ncorrespond  to  an  impossible condition  that has  thus never  been  observed,  such  as \nthe acoustics of the temporal center of a vowel in combination with a previous state \nthat  corresponds  to  a  plosive.  It  is  unfortunately  possible  that  the  interpolative \ncapabilities of the network  may not be sufficient to give these  \"impossible\"  pairs  a \nsufficiently low  probability during recognition. \nOne possible solution to these problems is to use a full MAP  algorithm to find  tran(cid:173)\nsition probabilities at each frame for all possible transitions by  a forward-backward(cid:173)\nlike  algorithm (Liporace  1982), taking all  possible paths into account. \n\n2.2  PROBLEM FORMULATION \n\nAs described  above, global maximum a posteriori training of HMMs should find  the \noptimal parameter set e maximizing \n\nJ II P(Mj IXj, e) \n\nj=1 \n\n(1) \n\nin which  Mj  represents  the  Markov  model associated  with each  training utterance \nXj, with j  =  1, ... , J. \nAlthough  in  principle  we  could  use  a  generalized  back-propagation-like  gradient \nprocedure in  e to maximize (1)  (Bengio  et  al.  1992),  an  EM-like algorithm should \nhave  better  convergence  properties,  and  could  preserve  the  statistical  interpreta(cid:173)\ntion  of the  ANN  outputs.  In  this  case,  training of the  discriminant  HMM  by  a \nglobal MAP criterion  requires a solution  to the following problem:  given  a trained \nMLP  at  iteration t  providing  a  parameter set  et  and,  consequently,  estimates  of \nP(q~lxn' q~-I' et ), how  can we  determine new  MLP  targets that: \n\n1.  will be smooth estimates of conditional transition probabilities q~-1 -+ q~, \n\nVk,f E [1, K] and \"In  E [1, N], \n\n2.  when training the MLP for iteration t+ 1, will lead to new estimates of et+l \nand P(q~lxn' q~-I' et+1)  that are guaranteed to incrementally increase the \nglobal posterior probability P(MiIX, e)? \n\nIn (Bourlard et  al.  1994), we prove that a re-estimate of MLP targets that guarantee \nconvergence  to a  local  maximum of (1)  is given by1: \n\n(2) \n\nwhere  we  have  estimated  the  left-hand  side  using  a  mapping  from  the  previous \nstate  and  the  local  acoustic  data to  the  current  state,  thus  making the estimator \nrealizable by an MLP with a local acoustic window. 2  Thus, we will want to estimate \n\n1 In most of the following,  we consider only one particular training sequence X  associated \nwith one particular  model M.  It is, however, easy to see that all of our conclusions  remain \nvalid  for  the  case  of several  training  sequences  Xj,  j  = 1, ... , J.  A  simple  way  to look \nat the problem  is  to consider  all  training sequences  as a  single  training  sequence  obtained \nby  concatenating  all  the  X,'s with  boundary  conditions  at every  possible  beginning  and \nending  point. \n2Note that,  as done in our  previous  hybrid  HMM/MLP systems,  all  conditional  on  Xn \ncan  be  replaced  by  X;::!:  = {x n - c , ., \u2022. ,  X n ,  .\u2022\u2022 ,  Xn+d}  to  take  some  acoustic  context  into \naccount. \n\n\f392 \n\nY.  KONIG, H.  BOURLARD, N.  MORGAN \n\nthe transition  probability conditioned on  the  local data (as  MLP  targets)  by  using \nthe transition  probability conditioned  on  all of the data. \nIn (Bourlard  et  al.  1994),  we  further  prove that alternating MLP target estimation \n(the  \"estimation\" step)  and MLP training (the\" maximization\" step)  is guaranteed \nto incrementally increase  (1)  over t. 3  The remaining problem is  to find  an efficient \nalgorithm to express  P(q~IX, q~-l' M)  in terms of P(q~lxn, q~-l) so that the next \niteration  targets  can  be found.  We have  developed  several  approaches  to this esti(cid:173)\nmation, some of which  are  described  in  (Bourlard  et  al.  1994).  Currently,  we  are \nimplementing this with an efficient  recursion  that estimates the sum of all  possible \npaths in  a  model,  for  every  possible  transition  at  each  possible  time.  From  these \nvalues we  can compute the  desired  targets  (2)  for  network training by \n\nP(  t  IX  M \n\nqn \n\n, \n\n)  =  P(M, q~, ~~_lIX) \nIX) \n\n, qn-l  ~ . P(M  J \n\nk \n\nk \n\n,qn, qn-l \n\nDJ \n\n(3) \n\n2.3  REMAP  TRAINING  ALGORITHM \n\nThe general scheme of the REMAP  training of hybrid  HMM/MLP systems can  be \nsummarized as follow: \n\n1.  Start from some initial net  providing P(q~lxn' q~-l' et ),  t  = 0,  V possible \n\n(k,\u00a3)-pairs4. \n\n2.  Compute MLP targets  P(q~IXj,q~_l,et,Mj) according  to  (3),  V training \nsentences  Xj  associated  with  HMM  Mj,  V possible  (k, \u00a3)  state  transition \npairs in  Mj  and V X n,  n = 1, ... , N  in  Xj  (see  next  point). \n\n3.  For  every  Xn  in  the  training  database,  train  the  MLP  to  minimize  the \nrelative  entropy  between  the  outputs  and  targets.  See  (Bourlard  et  ai, \n1994)  for  more details.  This provides  us  with a  new  set of parameters et , \nfor t  = t + 1. \n\n4.  Iterate from  2 until convergence. \n\nThis procedure is  thus composed of two steps:  an Estimation (E) step,  correspond(cid:173)\ning  to step  2  above,  and  a  Maximization (M)  step,  corresponding  to step  3  above. \nIn  this  regards,  it is  reminiscent  of the  Estimation-Maximization  (EM)  algorithm \nas  discussed  in  (Dempster  et  al.  1977).  However,  in  the  standard  EM  algorithm, \nthe M step involves the actual maximization of the likelihood function.  In a related \napproach, usually referred  to as Generalized EM (GEM) algorithm, the M step does \nnot  actually maximize the likelihood  but simply increases  it  (by  using,  e.g.,  a gra(cid:173)\ndient procedure).  Similarly, REMAP increases  the global posterior function during \nthe M step (in the direction of targets that actually maximize that global function), \nrather than actually maximizing it.  Recently,  a similar approach was suggested for \nmapping input sequences  to output sequences  (Bengio  &  Frasconi  1995). \n\n3Note  here  that  one  \"iteration\"  does  not  stand  for  one  iteration  of the  MLP  training \nbut for  one estimation-maximization  iteration  for  which  a  complete  MLP training  will  be \nrequired. \n\n4This can  be done, for instance,  by training up such a net from a hand-labeled  database \n\nlike  TIMIT or  from  some initial  forward-backward  estimator of equivalent  local probabil(cid:173)\nities  (usually  referred  to as  \"gamma\"  probabilities  in  the  Baum-Welch  procedure). \n\n\fREMAP:  Recursive  Estimation  and  Maximization of A  Posteriori  Probabilities \n\n393 \n\nSystem \nDHMM,  pre-REMAP \n1 REMAP iteration \n2  REMAP  iterations \n\nError  Rate \n14.9% \n13.6% \n13.2% \n\nTable  1:  Training  and  testing  on  continuous  numbers,  no  syntax,  no  durational \nmodels. \n\n3  EXPERIMENTS  AND  RESULTS \n\nFor  testing our theory  we  chose  the  Numbers'93  corpus.  It is  a  continuous speech \ndatabase collected  by  CSLU  at the Oregon  Graduate Institute.  It consists of num(cid:173)\nbers  spoken  naturally  over  telephone  lines  on  the  public-switched  network  (Cole \net  al.  1994).  The Numbers'93 database consists of 2167 speech  files  of spoken  num(cid:173)\nbers  produced  by  1132  callers.  We  used  877  of these  utterances  for  training  and \n657 for  cross-validation and  testing  (200  for  cross-validation) saving the remaining \nutterances for  final  testing purposes.  There are  36 words in the vocabulary, namely \nzero,  oh,  1,  2,  3, ... ,20,  30,  40,  50, ... ,100,  1000,  a,  and,  dash,  hyphen,  and  double. \nAll  our  nets  have  214  inputs:  153  inputs for  the  acoustic  features,  and  61  to  rep(cid:173)\nresent  the previous state  (one  unit for  every  possible  previous state,  one state  per \nphoneme in our  case).  The  acoustic features  are  combined from 9  frames  with  17 \nfeatures  each  (RASTA-PLP8 + delta features  + delta log  gain)  computed with  an \nanalysis window of 25  ms computed every  12.5 ms (overlapping windows)  and with \na  sampling rate of 8 Khz .  The nets have  200  hidden units and 61  outputs. \n\nOur  results  are  summarized in  Table 1.  The  row  entitled  \"DHMM,  pre-REMAP\" \ncorresponds  to  a  Discriminant HMM  using  the same training  approach,  with hard \ntargets determined by the first system, and additional inputs to represent  the previ(cid:173)\nous state The improvement in  the recognition rate as a result of REMAP iterations \nis  significant  at  p  <  0.05.  However  all  the  experiments  were  done  using  acoustic \ninformation alone.  Using our  (baseline)  hybrid system under equal conditions,  i.e., \nno  duration  information  and  no  language  information,  we  get  31.6%  word  error; \nadding  the  duration  information back  we  get  12.4%  word  error.  We  are  currently \nexperimenting with enforcing minimum duration constraints in  our framework. \n\n4  CONCLUSIONS \n\nIn summary: \n\n\u2022  We  have a  method for  MAP  training and estimation of sequences. \n\n\u2022  This can be used  in a  new form of hybrid HMM/MLP. Note that recurrent \nnets or TDNNs could  also  be used.  As  with standard HMM/MLP hybrids, \nthe network is used  to estimate local posterior probabilities (though in this \ncase they are conditional transition probabilities, that is,  state probabilities \nconditioned on  the  acoustic  data and the  previous state).  However,  in  the \ncase  of REMAP  these  nets  are  trained  with  probabilistic targets  that  are \nthemselves estimates of local posterior  probabilities. \n\n\u2022  Initial experiments demonstrate a significant reduction in error rate for this \n\nprocess. \n\n\f394 \n\nY. KONIG,  H.  BOURLARD,  N.  MORGAN \n\nAcknowledgments \n\nWe  would  like  to  thank  Kristine  Ma and  Su-Lin  Wu for  their  help  with  the  Num(cid:173)\nbers'93  database.  We  also thank OGI, in particular to Ron Cole,  for  providing the \ndatabase.  We  gratefully acknowledge  the support  of the Office  of Naval  Research, \nURI  No.  N00014-92-J-1617  (via  UCB),  the  European  Commission  via  ESPRIT \nproject  20077 (SPRACH), and ICSI and FPMs in general for supporting this work. \n\nReferences \n\nBENGIO,  Y.,  &  P.  FRASCONI. \n\n1995.  An  input  output  HMM  architecture. \nIn  Advances  in  Neural  Information  Processing  Systems,  ed.  by  G.  Tesauro, \nD.  Touretzky,  &  T.  Leen,  volume 7.  Cambridge:  MIT press. \n\n--, R.  DE  MORI,  G.  FLAMMIA,  &  R.  KOMPE.  1992.  Global optimization of a \nneural network-hidden Markov model hybrid.  IEEE trans.  on  Neural Networks \n3.252-258. \n\nBOURLARD,  H.,  Y.  KONIG,  &  N.  MORGAN.  1994.  REMAP:  Recursive  estimation \nand  maximization of a  posteriori  probabilities,  application  to transition-based \nconnectionist  speech  recognition.  Technical  Report  TR-94-064,  International \nComputer Science  Institute,  Berkeley,  CA. \n\n--, & N. MORGAN.  1994.  Connectionist Speech  Recognition - A  Hybrid Approach. \n\nKluwer  Academic Publishers. \n\n--, &  C.  J.  WELLEKENS.  1989.  Links  between  Markov  models  and  multilayer \nperceptrons.  In  Advances  in  Neural Information  Processing  Systems  1, ed.  by \nD.J. Touretzky,  502-510, San  Mateo.  Morgan  Kaufmann. \n\nCOLE,  R.A.,  M.  FANTY,  &  T.  LANDER.  1994.  Telephone speech  corpus  develop(cid:173)\n\nment at CSL U.  In Proceedings Int 'I Conference  on Spoken  Language Processing, \nYokohama, Japan. \n\nDEMPSTER,  A.  P.,  N.  M.  LAIRD,  &  D.  B.  RUBIN.  1977.  Maximum likelihood \nfrom  incomplete data  via the  EM algorithm.  Journal  of the  Royal  Statistical \nSociety,  Series  B  34.1-38. \n\nGLASS,  J.  R.,  1988.  Finding  Acoustic  Regularities  in  Speech  Applications  to  Pho(cid:173)\n\nnetic  Recognition.  M.LT dissertation. \n\nKATAGIRI,  S.,  C.H.  LEE,  &  JUANG  B.H.  1991.  New  discriminative  training \nalgorithms based  on  the  generalized  probabilistic  decent  method.  In  Proc.  of \nthe  IEEE  Workshop  on  Neural  Netwroks  for  Signal  Processing,  ed.  by RH. \nJuang, S.Y.  Kung,  &  C.A.  Kamm,  299-308 . \n\nKONIG,  Y.,  &  N.  MORGAN.  1994.  Modeling  dynamics  in  connectionist  speech \nrecognition - the time index model.  In  Proceedings  Int'l  Conference  on  Spoken \nLanguage  Processing,  1523-1526, Yokohama, Japan. \n\nLIPORACE,  L.  A.  1982.  Maximum likelihood estimation for  multivariate observa(cid:173)\ntions of markov sources.  IEEE  Trans.  on  Information  Theory  IT-28.729-734. \nMORGAN,  N.,  H.  BOURLARD,  S.  GREENBERG,  &  H.  HERMANSKY.  1994.  Stochas(cid:173)\ntic perceptual  auditory-event-based models for  speech  recognition.  In  Proceed(cid:173)\nings  Int'l  Conference  on  Spoken  Language  Processing,  1943-1946, Yokohama, \nJapan. \n\n\f", "award": [], "sourceid": 1027, "authors": [{"given_name": "Yochai", "family_name": "Konig", "institution": null}, {"given_name": "Herv\u00e9", "family_name": "Bourlard", "institution": null}, {"given_name": "Nelson", "family_name": "Morgan", "institution": null}]}