{"title": "An Asynchronous Hidden Markov Model for Audio-Visual Speech Recognition", "book": "Advances in Neural Information Processing Systems", "page_first": 1237, "page_last": 1244, "abstract": null, "full_text": "An Asynchronous  Hidden  Markov  Model \n\nfor  Audio-Visual  Speech Recognition \n\nSamy Bengio \n\nDalle  Molle Institute for  Perceptual Artificial Intelligence  (IDIAP) \n\nCP  592,  rue du Simplon 4, \n1920 Martigny, Switzerland \n\nbengio@idiap.ch.http://www.idiap.ch/-bengio \n\nAbstract \n\nThis paper presents a  novel  Hidden  Markov Model architecture to \nmodel the joint probability of pairs of asynchronous sequences de(cid:173)\nscribing the same event.  It is based on two other Markovian models, \nnamely  Asynchronous  Input/ Output  Hidden  Markov  Models  and \nPair Hidden Markov Models.  An EM algorithm to train the model \nis  presented,  as  well  as  a  Viterbi  decoder  that can  be  used  to  ob(cid:173)\ntain  the  optimal  state sequence  as  well  as  the  alignment  between \nthe two  sequences.  The model has  been tested on an audio-visual \nspeech  recognition  task  using  the  M2VTS  database  and  yielded \nrobust performances under various noise  conditions. \n\n1 \n\nIntroduction \n\nHidden  Markov Models  (HMMs)  are statistical tools  that have been  used  success(cid:173)\nfully  in the  last  30  years to model  difficult  tasks  such  as  speech  recognition  [6)  or \nbiological sequence analysis [4).  They are very well suited to handle discrete of con(cid:173)\ntinuous  sequences of varying sizes.  Moreover,  an efficient  training algorithm  (EM) \nis  available, as well as  an efficient  decoding algorithm (Viterbi),  which provides the \noptimal  sequence  of states  (and  the  corresponding  sequence  of high  level  events) \nassociated with a  given sequence of low-level data. \n\nOn the other hand, multimodal information processing is  currently a  very challeng(cid:173)\ning  framework  of applications  including  multimodal  person  authentication,  multi(cid:173)\nmodal speech recognition, multimodal event analyzers, etc.  In that framework,  the \nsame  sequence  of events  is  represented  not  only  by  a  single  sequence  of data but \nby  a  series  of sequences  of data,  each  of them  coming  eventually from  a  different \nmodality:  video streams with various viewpoints,  audio stream(s), etc. \n\nOne such  task,  which  will  be  presented in  this  paper,  is  multimodal speech  recog(cid:173)\nnition  using  both  a  microphone  and  a  camera recording  a  speaker  simultaneously \nwhile  he  (she)  speaks.  It is  indeed well  known that seeing the speaker's face  in  ad(cid:173)\ndition to hearing his  (her)  voice can often improve speech intelligibility, particularly \nin  noisy environments  [7),  mainly thanks to the complementarity of the visual and \nacoustic signals.  Previous  solutions  proposed for  this  task can be subdivided  into \n\n\ftwo  categories  [8]:  early  integration,  where  both signals  are first  modified  to reach \nthe  same  frame  rate  and  are  then  modeled  jointly,  or  late  integration,  where  the \nsignals  are modeled  separately and  are combined  later, during decoding.  While  in \nthe former solution, the alignment between the two sequences is  decided  a priori, in \nthe latter, there is  no explicit learning of the joint probability of the two sequences. \nAn example of late integration is presented in [3],  where the authors present a multi(cid:173)\nstream approach where each stream is  modeled by a different HMM,  while decoding \nis  done on a  combined HMM  (with various combination approaches proposed) . \n\nIn  this  paper,  we  present  a  novel  Asynchronous  Hidden  Markov  Model  (AHMM) \nthat  can  learn  the joint  probability of pairs  of sequences  of data representing  the \nsame  sequence  of events,  even  when  the  events  are not  synchronized  between  the \nsequences.  In fact,  the model enables to  desynchronize  the streams by temporarily \nstretching one  of them in order to obtain a  better match between the  correspond(cid:173)\ning frames .  The model  can thus be directly  applied to the problem of audio-visual \nspeech  recognition  where  sometimes  lips  start  to  move  before  any  sound  is  heard \nfor  instance.  The  paper  is  organized  as  follows:  in  the  next  section,  the  AHMM \nmodel is  presented, followed  by the corresponding EM training and Viterbi  decod(cid:173)\ning  algorithms.  Related models  are  then presented and  implementation  issues  are \ndiscussed.  Finally,  experiments on a  audio-visual speech recognition task based on \nthe M2VTS  database are presented, followed  by a  conclusion. \n\n2  The Asynchronous  Hidden  Markov  Model \nmodeling  the  joint  probability  of  2  asynchronous  sequences,  denoted  xi  and yr \n\nFor  the  sake  of simplicity,  let  us  present  here  the  case  where  one  is  interested  in \n\nwith  S  ::;  T  without loss  of generality!. \nWe are thus interested in modeling p(xi, Yr).  As it is intractable if we  do it directly \nby  considering  all  possible  combinations,  we  introduce  a  hidden  variable  q  which \nrepresents the state as in the classical HMM formulation, and which is synchronized \nwith the longest sequence.  Let N  be the number of states. \nMoreover, in the model presented here,  we  always emit Xt  at time t and sometimes \nemit  Ys  at  time  t.  Let  us  first  define  E(i, t)  =  P(Tt=sh- l =s - 1, qt=i, xLyf)  as \nthe probability that the system emits the next observation of sequence  y  at time t \nwhile in state i.  The additional hidden variable Tt  = s can be seen as the alignment \nbetween y and q (and x which is aligned with q).  Hence,  we model p(x f,yr, qf, T'[). \n\n2.1  Likelihood  Computation \n\nUsing classical HMM independence assumptions, a simple forward procedure can \nbe  used  to  compute  the  joint  likelihood  of the  two  sequences,  by  introducing  the \nfollowing 0: intermediate variable for each state and each possible alignment between \nthe sequences  x  and y: \n\no:(i,s,t) \n\no:(i , s,t) \n\nE(i, t)p( Xt , yslqt=i) L P(qt=ilqt- l =j)o:(j, s - 1, t - 1) \n\nN \n\nj = l \n\n(1) \n\nlIn  fact ,  we  assume that for  all  pairs  of sequences  (x, y),  the  sequence  x  is  always  at \nleast  as  long  as  the sequence  y.  If this is  not  the case,  a  straightforward extension  of the \nproposed model is  then necessary. \n\n\f+  (1  - E(i, t))p(xtlqt=i) L  P(qt=ilqt- 1 =j)a(j, s, t  - 1) \n\nN \n\nj=l \n\nwhich  is  very  similar  to  the  corresponding  a  variable  used  in  normal  HMMs2.  It \ncan then be used to compute the joint likelihood of the two  sequences as follows: \n\np(xi, yf) \n\nN \nL  p( qT=i , TT=S, xi, yf) \ni=l \nN \nL a(i,S,T) . \ni=l \n\n(2) \n\n2.2  Viterbi Decoding \n\nUsing  the same technique  and replacing  all  the  sums  by  max operators, a  Viterbi \ndecoding algorithm can be derived in order to obtain the most probable path along \nthe sequence of states and alignments between x  and y : \n\nV(i,s ,t) \n\nS) \nmax  P qt=Z, Tt=S, Xl' Y1 \n\nt \n\n. \n\n(\n\nt -\nT1 \n\nl \n\nt -\n, Ql \n\nl \n\n(3) \n\n{ \n\nmax \n\n(E(i, t)p(Xt, Ys Iqt=i) mJx P(qt=ilqt- 1 =j)V(j, s - 1, t  - 1), \n(1  - E(i, t))p(xtlqt=i) maxP(qt=i lqt- 1 =j)V(j, s, t  - 1)) \n\nJ \n\nThe best  path is then obtained after  having computed V(i , S, Ti  for  the  best final \n\nstate i  and backtracking along the best path that could reach it  . \n\n2.3  An EM Training Algorithm \n\nAn  EM  training  algorithm  can  also  be  derived  in  the  same fashion  as  in  classical \nHMMs.  We  here sketch the resulting algorithm, without going into more details4 . \n\nBackward  Step:  Similarly  to  the  forward  step  based  on  the  a  variable  used  to \ncompute the joint likelihood, a backward variable, (3  can also be derived as follows: \n\n(3(i,s, t) \n\n(3(i, s, t) \n\nL  E(j, t + l)p(xt+1' Ys+ 1Iqt+1 =j)P(qt+ 1 =j lqt=i)(3(j, s + 1, t + 1) \nj=l \nN \n\n+  L (l  - E(j, t + 1))P(Xt+ 1Iqt+ 1 =j)P(qt+1 =jlqt=i)(3(j, s, t + 1)  . \n\nN \n\nj=l \n\n(4) \n\n2The full  derivations are not given in this paper but can be found in the appendix of [1). \n3In  the  case  where  one  is  only  interested  in  the  best  state  sequence  (no  matter  the \nalignment),  the  solution  is  then  to  marginalize  over  all  the  alignments  during  decoding \n(essentially  keeping  the  sums  on  the  alignments  and the  max on  the  state  space).  This \nsolut ion  has not yet  been tested. \n\n4See  the appendix of [1)  for  more details. \n\n\fE-Step:  Using  both  the  forward  and  backward  variables,  one  can  compute  the \nposterior probabilities of the hidden variables of the system, namely the posterior \non the state when it emits on both sequences,  the posterior on the state when it \nemits on x  only,  and the posterior on  transitions. \nLet al(i , s, t)  be the part of a(i, s, t)  when  state i  emits on Y at time  t: \n\nE(i, t)p( Xt , ysl qt=i) L P(qt =ilqt- l =j)a(j , s  - 1, t - 1) \n\nN \n\n(5) \n\nj = l \n\nand similarly, let aO(i, s, t)  be the part of a(i, s, t)  when state i  does  not emit on \ny  at time t: \n\n(1  - E(i, t))p(xtlqt =i) L P(qt =ilqt- l =j)a(j , s , t  - 1). \n\nN \n\n(6) \n\nj = l \n\nThen the posterior on state i  when it emits joint observations of sequences x  and \ny  is \n\n.  ITS )   a l (i,s,t)(3(i,s,t) \n\nPqt =Z,Tt =STt- I=S- l ,XI ,YI  = \n\n( \n\n(T  S) \n\nP  Xl , YI \n\n, \n\n(7) \n\nthe posterior on state i  when it emits the next observation of sequence  x  only is \n\n.  ITS )   aO(i , s,t)(3 (i,s,t) \n\nP  qt=Z, Tt=S  Tt - l =S , Xl  , YI  = \n\n( \n\n(T  S ) '  \n\nP  Xl ,YI \n\nand the posterior on the transition between states i  and j  is \n\n(  ) \n8 \n\n(9) \n\nP(qt=ilqt- l =j) \n\nP(x f, yf) \n\n( * a(j\" \n\n- 1, t  - 1 )p(x\"  y., Iq,~i)'( i, t) fi ( i, \"  t) +  ) \n\nL a(j, s , t - 1 )p(Xt Iqt=i) (1  - E( i , t) )(3( i , s, t) \n\ns=O \n\nM-Step:  The Maximization step is  performed exactly as in normal HMMs:  when \n\nthe  distributions  are  modeled  by  exponential  functions  such  as  Gaussian  Mix(cid:173)\nture Models,  then an exact maximization can be performed  using the posteriors. \nOtherwise, a  Generalized EM is  performed by gradient ascent, back-propagating \nthe posteriors through the parameters  of the distributions. \n\n3  Related  Models \n\nThe present AHMM  model  is  related to the  Pair HMM model  [4],  which  was  pro(cid:173)\nposed  to  search for  the  best  alignment  between  two  DNA  sequences.  It was  thus \ndesigned and used mainly for  discrete sequences.  Moreover, the architecture of the \nPair HMM model is such that a given state is designed to always emit either one OR \ntwo vectors, while  in the proposed AHMM model, each state can always emit both \none  or  two  vectors,  depending  on  E(i, t),  which  is  learned.  In  fact,  when  E(i, t)  is \ndeterministic and solely depends on i , we  can indeed  recover the Pair HMM model \nby slightly transforming the architecture. \nIt is also very similar to the asynchronous version of Input/ Output HMMs [2],  which \nwas proposed for speech recognition applications.  The main difference here is that in \n\n\fAHMMs both sequences are considered as output, while in Asynchronous IOHMMs \none  of the  sequence  (the  shorter one,  the  output)  is  conditioned  on  the  other one \n(the  input).  The  resulting  Viterbi  decoding  algorithm  is  thus  different  since  in \nAsynchronous IOHMMs one of the sequence, the input, is  known during decoding, \nwhich is  not the case in AHMMs. \n\n4 \n\nImplementation Issues \n\n4.1  Time and  Space  Complexity \n\nThe  proposed  algorithms  (either  training  or  decoding)  have  a  complexity  of \nO(N 2 ST)  where  N  is  the  number  of  states  (and  assuming  the  worst  case  with \nergodic connectivity) , S  is  the length of sequence y  and T  is  the length of sequence \nx .  This  can become quickly  intractable if both x  and yare longer than,  say,  1000 \nframes.  It can however be shortened when  a priori knowledge is  known about possi(cid:173)\nble alignments between x  and \u00a5.  For instance, one can force  the alignment between \nXt  and Ys  to be such that It - 5s1  < k  where  k  is  a  constant representing the maxi(cid:173)\nmum stretching allowed between x  and y, which should not depend on S  nor T.  In \nthat case,  the complexity  (both in  time and space)  becomes  O(N2 Tk),  which  is  k \ntimes the usual HMM  training/ decoding complexity. \n\n4.2  Distributions to Model \n\nIn order to implement this system, we thus need to model the following distributions: \n\n\u2022  P(qt=ilqt- l =j):  the transition distribution, as in  normal HMMs; \n\u2022  p(xtlqt=i):  the  emission  distribution  in  the  case  where  only  x  is  emitted, \n\nas  in normal HMMs; \n\n\u2022  p(Xt , yslqt =i):  the  emission  distribution in the  case  where  both sequences \n\nare emitted.  This distribution could  be implemented in  various forms,  de(cid:173)\npending on the assumptions made on the data: \n\n- x  and y  are independent  given state i: \n\np(Xt, Ys Iqt=i)  = p(Xt Iqt=i)p(ys Iqt=i) \n\n- y  is  conditioned on x : \n\np(Xt , Ys Iqt =i)  =  p(Ys IXt , qt=i)p(xt Iqt =i) \n\n(10) \n\n(11) \n\n-\n\nthe joint probability is  modeled directly,  eventually forcing some com(cid:173)\nmon parameters from p(Xt Iqt=i)  and p(Xt , Ys Iqt=i)  to be shared. \n\nIn  the experiments described later in  the paper, we  have  chosen the latter \nimplementation, with  no sharing except  during initialization; \n\n\u2022  E(i, t)  =  P(Tt=slTt- l =s - 1, qt=i, xi,yf):  the  probability  to  emit  on  se(cid:173)\nquence  y  at time  t  on  state i.  With various  assumptions, this  probability \ncould  be  represented  as  either  independent  on  i,  independent  on  s,  inde(cid:173)\npendent on  Xt  and Ys.  In  the experiments described later in the paper, we \nhave chosen the latter implementation. \n\n5  Experiments \n\nAudio-visual  speech  recognition  experiments  were  performed  using  the  M2VTS \ndatabase [5],  which contains 185 recordings of 37 subjects, each containing acoustic \n\n\fand  video  signals  of the  subject  pronouncing the  French  digits  from  zero  to  nine. \nThe  video  consisted of 286x360  pixel  color  images  with  a  25  Hz  frame  rate,  while \nthe audio was recorded at 48 kHz using a 16 bit PCM coding.  Although the M2VTS \ndatabase is  one  of the largest databases of its type,  it is  still relatively small com(cid:173)\npared to reference  audio  databases used  in  speech  recognition.  Hence,  in  order to \nincrease the significance  level  of the experimental  results,  a  5-fold  cross-validation \nmethod was used.  Note that all the subjects always pronounced the same sequence \nof words but this information was  not used during recognition5 . \n\nThe audio  data was  down-sampled to  8khz and every  10ms  a  vector  of 16  MFCC \ncoefficients and their first  derivative, as well  as the derivative of the log energy was \ncomputed,  for  a  total  of 33  features.  Each  image  of the  video  stream  was  coded \nusing  12  shape  features  and  12  intensity  features,  as  described  in  [3].  The  first \nderivative of each of these features  was  also  computed, for  a  total of 48  features . \n\nThe  HMM  topology  was  as  follows:  we  used  left-to-right  HMMs  for  each  instance \nof the vocabulary, which  consisted of the following  11  words:  zero,  un,  deux trois, \nquatre,  cinq,  six,  sept,  huit,  neuf,  silence.  Each  model  had  between  3  to  9  states \nincluding non-emitting begin and end states. \nIn  each  emitting state,  there  was  3  distributions:  P( Xtlqt) , the  emission  distribu(cid:173)\ntion  of  audio-only  data,  which  consisted  of  a  Gaussian  mixture  of  10  Gaussians \n(of dimension  33),  P(Xt , yslqt),  the joint  emission  distribution  of audio  and  video \ndata,  which  consisted  also  of  a  Gaussian  mixture  of  10  Gaussians  (of  dimension \n33+ 48= 81) , and  E(i, t),  the  probability  that  the  system  should  emit  on  the  video \nsequence,  which  was  implemented  for  these  preliminary  experiments  as  a  simple \ntable. \n\nTraining  was  done  using  the  EM  algorithm  described  in  the  paper.  However,  in \norder  to  keep  the  computational  time  tractable,  a  constraint  was  imposed  in  the \nalignment  between  the  audio  and  video  streams:  we  did  not  consider  alignments \nwhere audio and video information were  farther than 0.5  second from  each other. \n\nComparisons were made between the AHMM  (taking into account audio and video), \nand a  normal HMM taking into account either the audio or the video only.  We  also \ncompared the model with a  normal HMM trained on both audio and video streams \nmanually  synchronized  (each  frame  of the  video  stream  was  repeated  in  multiple \ncopies  in  order  to  reach  the  same  rate  as  the  audio  stream).  Moreover,  in  order \nto  show  the  interest  of robust  multimodal  speech  recognition,  we  injected  various \nlevels of noise in the audio stream during decoding (training was always done using \nclean  audio).  The noise  was  taken from  the Noisex  database  [9],  and was  injected \nin order to reach signal-to-noise ratios of 10dB,  5dB and OdB. \n\nNote that all  the hyper-parameters of these  systems,  such  as the number of Gaus(cid:173)\nsians  in  the  mixtures,  the  number  of EM  iterations,  or  the  minimum  value  of the \nvariances  of the  Gaussians,  were  not tuned  using  the  M2VTS  dataset.  They were \ntaken from  a  previously trained model on a  different  task,  Numbers'95. \n\nFigure  1  and  Table  1  present  the  results.  As  it  can  be  seen,  the  AHMM  yielded \nbetter results  as  soon as the noise  level  was  significant  (for  clean  data, the perfor(cid:173)\nmance using the audio stream only was  almost perfect, hence no enhancement was \nexpected).  Moreover,  it  never  deteriorated  significantly  (using  a  95%  confidence \ninterval)  under  the  level  of  the  video  stream,  no  matter  the  level  of noise  in  the \naudio stream. \n\n5Nevertheless, it can be argued that transitions between words could have been learned \n\nusing the training data. \n\n\f80 ,-r---------~----------~--------~_, \n\naudio HMM  --+-(cid:173)\naudio+video HMM  ---)(---\n\"* \nvideo HMM  ----0 \n\naudio+video AHMM \n\n70 \n\n60 \n\n50 \n\n40 \n\n30 \n\n20 \n\n10 \n\nOdb \n\n5db \n\n10db \n\nnoise  level \n\nFigure  1:  Word  Error Rates  (in  percent,  the lower  the better),  of various  systems \nunder  various  noise  conditions  during  decoding  (from  15  to  0  dB  additive  noise). \nThe proposed model is  the AHMM  using both audio  and video  streams. \n\nObservations  Model \n\naudio \nHMM \nHMM \naudio+ video \naudio+ video  AHMM \n\n15  dB \n\n2.9  (\u00b1  2.4) \n21.5  (\u00b1  6.0) \n4.8  (\u00b1 3.1) \n\nWER (%)  and 95%  CI \n10  dB \n\n5 dB \n\n11.9  (\u00b1  4.7) \n28.1  (\u00b1  6.5) \n11.4  (\u00b1 4.6) \n\n38.7  ~ \u00b1 7.1) \n35.3  (\u00b1  6.9) \n22.3  (\u00b1 6.0) \n\no dB \n\n79.1  (\u00b1  5.9) \n45.4  (\u00b1  7.2) \n41.1  (\u00b1 7.1) \n\nTable 1:  Word Error Rates (WER, in percent, the lower the better) and correspond(cid:173)\ning Confidence Intervals (CI, in parenthesis), of various systems under various noise \nconditions during decoding  (from  15  to 0 dB  additive noise).  The proposed model \nis  the AHMM  using both audio and video streams.  An HMM using the clean video \ndata only obtains 39.6%  WER (\u00b1 7.1). \n\nAn  interesting side effect  of the model  is  to provide  an optimal alignment between \nthe  audio  and  the  video  streams.  Figure  2  shows  the  alignment  obtained  while \ndecoding sequence cd01 on data corrupted with 10dB Noisex noise.  It shows that the \nrate between video and audio is far from being constant (it would have followed the \nstepped line)  and hence computing the joint probability using the AHMM  appears \nmore informative than using a  naive alignment  and a  normal HMM. \n\n6  Conclusion \n\nIn this paper, we have presented a  novel asynchronous HMM architecture to handle \nmultiple sequences of data representing the same sequence of events.  The model was \ninspired  by  two  other  well-known  models,  namely  Pair  HMMs  and  Asynchronous \nIOHMMs.  An  EM  training  algorithm  was  derived  as  well  as  a  Viterbi  decoding \nalgorithm,  and  speech  recognition  experiments  were  performed  on  a  multimodal \ndatabase,  yielding  significant  improvements on  noisy  audio  data.  Various  proposi(cid:173)\ntions  were  made  to  implement  the  model  but  only  the  simplest  ones  were  tested \nin  this  paper.  Other  solutions  should  thus  be  investigated  soon.  Moreover,  other \napplications of the model  should also  be investigated, such as  multimodal  authen(cid:173)\ntication. \n\n\fAudio \n\nFigure  2:  Alignment  obtained  by  the  model  between  video  and  audio  streams  on \nsequence  cdOl  corrupted  with  a  10dE  Noisex  noise.  The  vertical  lines  show  the \nobtained segmentation between the words.  The stepped line  represents a  constant \nalignment. \n\nAcknowledgments \n\nThis  research  has  been  partially  carried  out  in  the  framework  of  the  European \nproject  LAVA,  funded  by  the  Swiss  OFES  project  number  01.0412.  The  Swiss \nNCCR project 1M2  has also  partly funded  this  research.  The author would  like to \nthank Stephane Dupont for  providing the extracted visual features  and the experi(cid:173)\nmental protocol used in the paper. \n\nReferences \n\n[I]  S.  Bengio.  An asynchronous hidden markov model for  audio-visual speech recognition. \n\nTechnical  Report IDIAP-RR 02-26, IDIAP,  2002. \n\n[2]  S.  Bengio  and  Y.  Bengio.  An  EM  algorithm  for  asynchronous  input/ output  hidden \nmarkov models.  In Proceedings  of the  International  Conference  on  Neural  Information \nProcessing,  ICONIP,  Hong Kong,  1996. \n\n[3]  S.  Dupont and J . Luettin.  Audio-visual speech modelling for  continuous speech recog(cid:173)\n\nnition.  IEEE  Transactions  on  Multimedia,  2:141- 151,  2000. \n\n[4]  R.  Durbin,  S.  Eddy,  A.  Krogh,  and G.  Michison.  Biological  Sequence  Analysis:  Prob(cid:173)\n\nabilistic  Models  of proteins  and nucleic  acids.  Cambridge University Press,  1998. \n\n[5]  S.  Pigeon and L.  Vandendorpe. The M2VTS multimodal face database (release 1.00).  In \nProceedings  of th e First International Conference  on Audio- and Video-bas ed  Biometric \nP erson  Authentication  ABVPA,  1997. \n\n[6]  Laurence  R.  Rabiner.  A  tutorial  on  hidden  markov  models  and selected  applications \n\nin speech recognition.  Proceedings  of the  IEEE, 77(2):257- 286,  1989. \n\n[7]  W.  H.  Sumby  and  1.  Pollak.  Visual  contributions  to  speech  intelligibility  in  noise. \n\nJournal  of the Acoustical Society  of America,  26:212- 215,  1954. \n\n[8]  A.  Q.  Summerfield.  Lipreading  and  audio-visual  speech  perception.  Philosophical \n\nTransactions  of the  Royal Society  of London,  Series  B,  335:71- 78,  1992. \n\n[9]  A.  Varga,  H.J .M.  Steeneken,  M.  Tomlinson,  and  D .  Jones.  The  noisex-92  study  on \nthe effect  of additive  noise  on  automatic  speech  recognition.  Technical  report,  DRA \nSpeech Research Unit,  1992. \n\n\f", "award": [], "sourceid": 2301, "authors": [{"given_name": "Samy", "family_name": "Bengio", "institution": null}]}