{"title": "Speaker Independent Speech Recognition with Neural Networks and Speech Knowledge", "book": "Advances in Neural Information Processing Systems", "page_first": 218, "page_last": 225, "abstract": null, "full_text": "218 \n\nBengio, De Mori and Cardin \n\nSpeaker Independent Speech Recognition with \n\nNeural Networks and Speech Knowledge \n\nY oshua Bengio \nDept Computer Science \nMcGill University \nMontreal, Canada H3A2A 7 \n\nDept Computer Science \n\nMcGill University \n\nRenato De Mori \n\nRegis Cardin \nDept Computer Science \nMcGill  University \n\nABSTRACT \n\nWe  attempt to  combine neural  networks with  knowledge  from \nspeech  science to build  a speaker independent speech recogni(cid:173)\ntion  system.  This  knowledge  is  utilized  in  designing  the \npreprocessing,  input coding, output coding,  output supervision \nand  architectural  constraints.  To  handle  the  temporal  aspect \nof speech we  combine  delays,  copies  of activations  of hidden \nand  output  units  at  the  input  level,  and  Back-Propagation  for \nSequences  (BPS),  a  learning algorithm for networks with  local \nself-loops.  This  strategy  is  demonstrated  in  several  experi(cid:173)\nments,  in  particular  a  nasal  discrimination  task  for  which  the \napplication  of  a  speech  theory  hypothesis  dramatically  im(cid:173)\nproved generalization. \n\n1  INTRODUCTION \nThe  strategy  put  forward  in  this  research  effort  is  to  combine  the  flexibility \nand learning abilities  of neural networks with as much knowledge from  speech \nscience  as  possible in  order to  build  a  speaker independent automatic  speech \nrecognition  system.  This  knowledge  is  utilized in each of the steps in the con(cid:173)\nstruction  of  an  automated  speech  recognition  system:  preprocessing,  input \ncoding,  output  coding,  output  supervision,  architectural  design.  In  particular \n\n\fSpeaker Independent Speech Recognition \n\n219 \n\nfor  preprocessing we  explored the advantages of various possible ways  of pro(cid:173)\ncessing  the  speech  signal,  such  as  comparing  an  ear  model  VS.  Fast  Fourier \nTransform  (FFT) ,  or  compressing  the  frame  sequence  in  such  a  way  as  to \nconserve  an  approximately  constant  rate  of change.  To  handle  the  temporal \naspect  of speech we  propose  to  combine  various  algorithms  depending of the \ndemands  of the  task,  including  an  algorithm  for  a  type  of recurrent  network \nwhich includes  only  self-loops  and is  local  in  space and time  (BPS).  This stra(cid:173)\ntegy  is  demonstrated  in  several  experiments,  in  particular  a  nasal  discrimina(cid:173)\ntion  task  for  which  the  application  of  a  speech  theory  hypothesis  drastically \nimproved generalization. \n\n2  Application of Speech Knowledge \n\n2.1  Preprocessing \nOur previous work  has shown  us  that the choice of preprocessing significantly \ninfluences  the  performance  of a  neural  network  recognizer.  (e.g.,  Bengio  & \nDe  Mori  1988)  Different  types  of  preprocessing  processes  and  acoustic \nfeatures  can  be  utilized  at  the  input  of  a  neural  network.  We  used  several \nacoustic  features  (such  as  counts  of zero  crossings),  filters  derived  from  the \nFFT,  energy  levels  (of  both  the  signal  and  its  derivative)  and  ratios  (Gori, \nBengio &  De Mori 1989),  as well  as  an  ear model and synchrony detector. \n\nEar model VS.  FFT \nWe  performed  experiments  in  speaker-independent  recognition  of 10  english \nvowels  on isolated words  that compared  the  use  of an ear model with  an  FFT \nas  preprocessing.  The FFT was  done  using  a  mel  scale  and the  same number \nof filters  (40)  as  for  the ear model.  The  ear model was  derived from  the one \nproposed by Seneff (1985).  Recognition was  performed with  a  neural network \nwith  one hidden layer of 20  units.  We obtained 87%  recognition with  the FFT \npreprocessing  VS.  96%  recognition  with  the  ear model  (plus  synchrony detec(cid:173)\ntor  to  extract  spectral  regularity  from  the  instantaneous  output  of  the  ear \nmodel) (Bengio,  Cosi,  De Mori  1989).  This  was  an  example  of the  successful \napplication  of knowledge  about  human  audition  to  the  automatic  recognition \nof speech with machines. \n\n(as  well  as \n\nCompression in time resulting in constant rate of change \nThe motivation for this  processing step is  the following.  The rate of change of \nthe  speech  signal, \nthe  output  of  networks  performing \nacoustic~phonetic mappings) varies  a  lot.  It would be nice to have more tem(cid:173)\nporal precision in parts  of the  signal  where  there is  a  lot of variation  (bursts, \nfast  transitions)  and less temporal  precision in more  stable parts  of the  signal \n(e.g., vowels,  silence). \nGiven  a  sequence of vectors  (parameters,  which  can  be  acoustic  parameters, \nsuch  as  spectral  coefficients,  as  well  as  outputs  from  neural  networks)  we \ntransform  it  by  compressing  it  in  time  in  order to  obtain  a  shorter  sequence \nwhere frames refer to segments of varying length of the original sequence. \n\n\f220 \n\nBengio, De Mori and Cardin \n\nthe \n\nsum  of \n\nin  yes)  as \n+ \n\nthe  Distance(X(t),X(t+1))  + \n\nVery simple Algorithm that  maps sequence X(t)  -+ sequence  yet) where X and \nYare vectors: \n{  Accumul ate  and  average  X(t),  X(t+1) ... X(t+n) \nlong  as \nDistance(X(t+n-1),X(t+n))  is  less  than  a  threshold. \nWhen  this  threshold  is  reached, \nt+-t+n+1; \ns+-s+l;  } \nThe  advantages  of this  system  are  the  following:  1)  more  temporal  precision \nwhere  needed,  2)  reduction  of the dimensionality of the  problem, 3)  constant \nrate  of change  of the  resulting  signal  so  that  when  using  input  windows  in  a \nneural  net,  the  windows  may  have  less  frames,  4)  better generalization  since \nseveral realizations  of the same word  spoken  at  different rates  of speech tend \nto be reduced to more similar sequences. \nInitial  results  when  this  system  is  used  to  compress  spectral  parameters  (24 \nmel-scaled  FFf filters  +  energy)  computed  every  5  ms  were  interesting.  The \ntask  was  the  classification  of phonemes into 14  classes.  The size of the data(cid:173)\nbase was  reduced  by  30% \u2022  The  size  of the  window  was  reduced  (4  frames  in(cid:173)\nstead  of 8),  hence  the  network  size  was  reduced  as  well.  Half the  size  of the \nwindow  was  necessary  in  order to  obtain  similar  performance  on the  training \nset.  Generalization  on the test  set was  slightly better (from 38%  to 33%  clas(cid:173)\nsification  error  by  frame).  The  idea  to  use  a  measure  of rate  of change  to \nprocess speech is  not new  (Atal, 1983)  but we  believe that it might be particu(cid:173)\nlarly  useful  when  the  recognition  device  is  a  neural  network  with  an  input of \nseveral frames of acoustic parameters. \n\nInput coding \n\n2.2 \nOur previous work  has shown  us  that information  should be as  easily  accessi(cid:173)\nble  as  possible  to  the  network.  For example,  compression  of the spectral in(cid:173)\nformation  into  cepstrum  coefficients  (with  first  few  coefficients  having  very \nlarge  variance)  resulted  in  poorer  performance  with  respect  to  experiments \ndone  with  the  spectrum  itself.  The  recognition  was  performed  with  a  neural \nnetwork where units  compute the sigmoid  of the weighted sum of their inputs. \nThe task  was  the broad  classification  of phonemes  in  4 classes.  The error on \nthe  test set increased from  15%  to 20%  when  using  cepstral rather than spec(cid:173)\ntral coefficients. \nAnother  example  concerns  the  recognition  experiments  for  which  there  is  a \nlot  of variance in  the quantities  presented  in  the input.  A  grid  representation \nwith  coarse  coding improved learning time  as  well  as  generalization  (since the \nproblem became more separable  and thus  the network needed less  hidden un(cid:173)\nits).  (Bengio,  De Mori, 1988). \n\n2.3  Output coding \nWe  have  chosen an  output coding scheme based  on  phonetic features  defined \nby  the  way  speech  is  produced.  This  is  generally  more  difficult  to  learn  but \nresults in  better generalization,  especially with respect to new sounds  that had \n\n\fSpeaker Independent Speech Recognition \n\n221 \n\nnot  been  seen by the  network during the training.  We  have  demonstrated  this \nwith  experiments  on vowel  recognition  in  which  the  networks  were  trained to \nrecognized  the  place  and  the  manner of articulation  (Bengio,  Cosi,  De  Mori \n89).  In  addition  the resulting  representation  is  more compact than when  using \none  output  for  each  phoneme.  However,  this  representation  remains  mean(cid:173)\ningful  i.e.  each  output  can  be  attributed  a  meaning  almost  independently  of \nthe values of the other outputs. \nIn general,  an explicit representation is  preferred to  an  arbitrary and compact \none  (such  as  a  compact binary coding of the classes).  Otherwise, the network \nmust  perform  an  additional  step  of encoding.  This  can  be  costly  in  terms  of \nthe  size  of the  networks,  and  generally  also  in  terms  of generalization  (given \nthe need for a  larger number of weights). \n\n2.4  Output supervision \nWhen  using  a  network with  some recurrences it is  not necessary that  supervi(cid:173)\nsion  be  provided  at  every  frame  for  every  output  (particularly  for  transition \nperiods  which  are difficult to label).  Instead the supervision  should be provid(cid:173)\ned to the network when the speech signal clearly corresponds to the categories \none  is  trying  to  learn.  We  have  used  this  approach  when  performing  the \ndiscrimination  between  Ibl  and  Idl  with  the  BPS  (Back  Propagation  for  Se(cid:173)\nquences) algorithm (self-loop only, c.!.  section 3.3). \nGiving  additional  information  to  the  network  through  more  supervision  (with \nextra output units)  improved learning time and generalization (c.! . . section 4). \n\n2.5  Architectural design \nHypothesis about the nature of the processing to be performed by the network \nbased on speech science knowledge enables to put constraints on the architec(cid:173)\nture.  These constraints  result  in  a  network that generalizes  better than  a  fully \nconnected  network.  This  strategy is  most useful when the  speech  recognition \ntask  has  been  modularized  in  the appropriate way  so  that  the  same  architec(cid:173)\ntural  constraints  do not  have  to  apply to all of the  subtasks.  Here  are  several \nexamples  of application  of modularization.  We initially  explored  modulariza(cid:173)\ntion by acoustic  context  (different networks  are triggered when various acous(cid:173)\ntic  contexts  are detected)(Bengio,  Cardin,  De Mori,  Merlo 89)  We also imple(cid:173)\nmented  modularisation  by independent articulatory features  (vertical  and  hor(cid:173)\nizontal  place of articulation)  (in Bengio,  Cosi, De Mori,  89).  Another type of \nmodularization,  by  subsets of phonemes, was  explored by several researchers, \nin particular Alex Waibel  (Waibel 88). \n\n3  Temporal aspect of the  speech recognition task \nBoth of the algorithms presented in  the following  subsections assume that one \nis  lising  the  Least  Mean  Square Error criterion,  but both  can  be easily modi(cid:173)\nfied  for  any  type of error criterion.  We used and sometimes combined the fol(cid:173)\nlowing techniques: \n\n\f222 \n\nBengio, De Mori and Cardin \n\n3.1  Delays \nIf the  speech  signal  is  preprocessed  in  such  a  way  as  to  obtain  a  frame  of \nacoustic parameters for  every interval of time,  one can use delays from  the in(cid:173)\nput  units  representing  these  acoustic  parameters  to  implement  an  input  win(cid:173)\ndow  on the input sequence,  as  in  NETtalk, or using  this  strategy at every level \nas  in  TDNNs  (Waibel  88).  Even  when  we  use  a  recurrent  network,  a  small \nnumber  of delays  on  the  outgoing  links  of the  input  units  might  be  useful.  It \nenables the network to make a direct comparison between successive frames. \n\n3.2  BPS  (Back Propagation for Sequences) \nThis  is  a  learning algorithm  that we  have  introduced  for  networks  that have  a \ncertain  constrained  type  of  recurrence  (local  self-loops).  It permits  to  com(cid:173)\npute  the  gradient  of the  error with  respect  to  all  weights.  This  algorithm  has \nthe  same  order  of space  and  time  requirements  as  backpropagation  for  feed(cid:173)\nforward  networks.  Experiments  with  the  Ibl  vs.  Idl  speaker  independent \ndiscrimination yielded 3.45%  error on the test  set for the BPS network  as  op(cid:173)\nposed to 6.9%  error for a feedforward  network  (Gori, Bengio,  De Mori 89). \nBPS equations: \nfeedforward pass: \nedynamic units:  these have a local  self-loop and their input must directly \ncome from  the input layer. \nXi(t+ 1) = Wii  Xi(t)  + I;j Wij f(Xj(t\u00bb \n8Xi(t+ 1)18Wij  ==  Wii 8Xi(t)/8Wij + f(Xj(t\u00bb \n8Xi(t)18Wii ==  Wii  8Xi(t)18Wii + Xi(t) \n\nfor i!=j \nfor i==j \n\nestatic  units, i.e., without feedback,  follow  usual Back-Propagation (BP) equa(cid:173)\ntions (Rumelhart et al.  1986): \nXi(t+ 1) =  ~j Wij  f(Xj(t\u00bb) \n8Xi(t+ 1)18Wij  ==  f(Xj(t\u00bb \nBackpropagation  pass,  after  every  frame:  as  usual  but  using  above  definition \nof 8Xi(t)18Wii instead of the usual f(Xj(t\u00bb. \nThis  algorithm  has  a  time  complexity O(L  .  Nw)(as  static  BP)  It  needs  space \no (Nu) , where L  is  the length of a  sequence,  Nw is the number of weights and \nNu  is  the  number of units.  Note that  it is  local  in time  (it is  causal,  no back(cid:173)\npropagation in time)  and in space (only information coming from  direct neigh(cid:173)\nbors is  needed). \n\n3.3  Discrete Recurrent Net without Constraints \nThis  is  how  we  compute  the  gradient  in  an  unconstrained  discrete  recurrent \nnet.  The  derivation  is  similar  to  the  one  of Pearlmutter  (1989).  It is  another \nway  to  view  the  computation  of the  gradient  for  recurrent  networks,  called \ntime unfolding,  which  was  presented by (Rumelhart et al.  1986). Here the un(cid:173)\nits  have  a  memory  of  their  past  activations  during  the  forward  pass  (from \n\n\fSpeaker Independent Speech Recognition \n\n223 \n\nframe  1 to L)  and  a  \"memory\" of the future BEIBXi  during the backward pass \n(from frame L  down to frame  1). \nForward phase:  consider the possibility of an  arbitrary number of connections \nfrom  unit i to unit j, each having a  different delay d. \nXi(t) = ~j,d Wijd f(Xi(t-d\u00bb)  +  I(i,t) \nHere, the basic idea is  to compute BEIBWijd by computing BE/BXi(t): \nBE/8Wijd = ~t 8E/8Xi(t)  8Xi(t)/BWijd \nwhere  8Xi(t)18Wijd  =  f(Xj(t-d\u00bb  as  usual.  In  the backward phase  we  backpro(cid:173)\npagate 8E/8Xi(t) recursively from the last time frame=L down to frame 1: \nBE/8Xi(t) =  :Ek,d Wkid 8E/8Xk(t+d) f(Xj(t\u00bb) \n\n+(if i  is  an output unit)(f(Xi(t\u00bb)-Yi*(t\u00bb)  f(Xi(t)) \n\nwhere Yi*(t)  is  the target  output for  unit  i  at  time t.  In this equation the first \nterm  represents  back propagation  from  future  times  and  downstream  units, \nwhile  the  second  one  comes from  direct  external  supervision.  This algorithm \nworks for  any  connectivity of the recurrent network with delays.  Its time com(cid:173)\nplexity is  O(L .  Nw)  (as  static BP).  However the space requirements are O(L . \nNu).  The  algorithm  is  local  in  space but  not  in  time;  however, we found  that \nrestriction  not  to be very important in  speech  recognition,  where we  consider \nat most a few  hundred frames of left context (one sentence). \n\n4  Nasal experiment \nAs an  example of the application of the above described strategy we have per(cid:173)\nformed  the following  experiment with  the discrimination of nasals Iml and Inl \nin  a  fIXed  context.  The speech material consisted of 294  tokens from  70  train(cid:173)\ning  speakers  (male  and  female  with  various  accents)  and  38  tokens  from  10 \ntest  speakers.  The  speech  signal  is  preprocessed with  an ear model followed \nby  a  generalized  synchrony  detector  yielding  40  spectral  parameters every  10 \nms.  Early  experiments  with  a  simple  output coding {vowel,  ffi,  n},  a  window \nof two  consecutive  frames  as  input,  and  a  two-layer  fully  connected  architec(cid:173)\nture  with  10  hidden  units  gave  poor  results:  15%  error  on  the  test  set.  A \nspeech  theory  hypothesis  claiming  that  the  most  critical  discriminatory infor(cid:173)\nmation for  the  nasals  is  available  during  the transition between the vowel and \nthe  nasal  inspired  us  to  try  the  following  output coding:  {vowel,  transition  to \nm,  transition  to n,  nasal}.  Since the  transition was  more important we chose \nas  input  a  window  of 4  frames  at  times  t,  t-10ms,  t-3Oms  and  t-70ms.  To \nreduce  the  connectivity  the  architecture  included  a  constrained  first  hidden \nlayer  of 40  units  where  each  unit  was  meant  to  correspond  to  one  of the  40 \nspectral frequencies  of the preprocessing stage.  Each such hidden unit associ(cid:173)\nated  with  filter  bank  F  was  connected  (when  possible)  to  input  units \ncorresponding to \n\nfrequency banks \nand times \n\n(F-2,F-1,F,F+1,F+2) \n(t,t-10ms,t-30ms,t-70ms). \n\n\f224 \n\nBengio, De Mori and Cardin \n\nExperiments  with  this  feedforward  delay  network  (160  inputs-40  hidden--10 \nhidden-4  outputs)  showed  that,  indeed  the  strongest  clues  about  the  identity \nof the  nasal  seemed  to  be available  during  the  transition  and  for  a very  short \ntime,  just before the  steady part  of the nasal  started.  In order to  extract  that \ncritical information from  the  stream of outputs  of this  network,  a  second  net(cid:173)\nwork was  trained on the outputs of the first  one to provide clearly the discrim(cid:173)\nination  of the  nasal  during  the  whole  of the  nasal.  That higher  level  network \nused  the  BPS  algorithm  to  learn  about  the  temporal  nature  of  the  task  and \nkeep  the detected critical information during the length of the nasal.  Recogni(cid:173)\ntion  performance reached  a  plateau of 1.14%  errors on  the training  set.  Gen(cid:173)\neralization was very good with  only 2.63%  error on the test set. \n\n5  Future experiments \nOne  of  the  advantages  of  using  phonetic  features  instead  of  phonemes  to \ndescribe  the  speech  is  that  they  could  help  to  learn  more  robustly  about  the \ninfluence  of  context.  If one  uses  a  phonemic  representation  and  tries  to \ncharacterize  the  influence  of the  past  phoneme on  the  current phoneme,  one \nfaces  the  problem  of poor  statistical  sampling  of many  of the  corresponding \ndiphones  (in  a  realistic  database).  On  the  other  hand,  if speech  is  character(cid:173)\nized  by  several  independent dimensions  such  as  horizontal  and  vertical  place \nof articulation  and  voicing,  then  the  number  of possible  contexts  to  consider \nfor  each  value  of one of the dimensions is  much  more limited.  Hence the  set \nof examples characterizing those contexts is  much richer. \nWe now  present some observations  on continuous  speech based on our initial \nwork  with  the  TIMIT database in  which  we  try  learning  articulatory features. \nAlthough  we  have  obtained  good  results  for  the  recognition  of  articulatory \nfeatures  (horizontal  and  vertical  place  of articulation)  for  isolated  words,  ini(cid:173)\ntial  results  with  continuous  speech  are  less  encouraging.  Indeed,  whereas  the \nmeasured  place  of  articulation  (by  the  networks)  for  phonemes  in  isolated \nspeech  corresponds  well  to expectations  (as defined by  acousticians who  phy(cid:173)\nsically  measured  these  features  for  isolated  short words),  this  is  not  the  case \nfor  continuous  speech.  In  the  latter  case,  phonemes  have  a  much  shorter \nduration  so  that  the  articulatory  features  are  most  of the  time  in  transition, \nand  the  place  of  articulation  generally  does  not  reach  the  expected  target \nvalues  (although  it  always  moves  in  the  right  direction  ).  This  is  probably due \nto the inertia  of the  production  system  and  to coarticulation effects.  In order \nto  attack  that  problem  we  intend  to  perform  the  following  experiments.  We \ncould  use  the  subset of the database for which  the phoneme duration  is  suffi(cid:173)\nciently  long  to  learn  an  approximation  of the  articulatory features.  We  could \nthen improve  that approximation  in  order to be able to learn about the trajec(cid:173)\ntories  of  these  features  found  in  the  transitions  from  one  phoneme  to  the \nnext.  This  could be done by using  a  two  stage  network  (similar to the encoder \nnetwork)  with  a  bottleneck  in  the  middle.  The  first  stage  of the network  pro(cid:173)\nduces  phonetic  features  and  receives  supervision  only  on  the  steady  parts  of \nthe speech. The second stage  of the network  (which would be a  recurrent net(cid:173)\nwork)  has  as  input  the  trajectory  of  the  approximation  of  the  phonetic \nfeatures  and  produces  as  output  the  previous,  current  and  next  phoneme.  As \nan  additional  constraint,  we  propose  to  use  self-loops  with  various  time  con(cid:173)\nstants on the units  of the bottleneck.  Units  that represent fast varying de scrip-\n\n\fSpeaker Independent Speech Recognition \n\n225 \n\ntors  of speech  will  have  a  short  time  constant,  while  units  that  we  want  to \nhave represent information  about the  past acoustic  context will  have  a  slightly \nlonger  time  constant and  units  that could represent very long time range infor(cid:173)\nmation - such  as  information  about the speaker or the  recording conditions -\nwill  receive a very long time constant. \nThis  paper has  proposed  a  general  strategy  for  setting  up  a  speaker  indepen(cid:173)\ndent  speech  recognition  system  with  neural  networks  using  as  much  speech \nknowledge  as  possible.  We explored  several  aspects  of this  problem  including \npreprocessing,  input  coding,  output  coding,  output  supervision,  architectural \ndesign,  algorithms  for  recurrent  networks,  and  have  described  several  initial \nexperimental results to support these ideas. \n\nReferences \nAtal B.S.  (1983),  Efficient coding of LPC parameters by temporal decomposi(cid:173)\ntion, Proc.  ICASSP 83  , Boston,  pp 81-84. \nBengio Y.,  Cardin R.,  De Mori R.,  Merlo E.  (1989)  Programmable  execution \nof multi-layered  networks  for  automatic  speech  recognition,  Communications \nof the Association for Computing Machinery,  32 (2). \nBengio  Y.,  Cardin  R.,  De  Mori  R.,  (1990),  Speaker  independent  speech \nrecognition  with  neural  networks  and  speech  knowledge,  in  D.S.  Touretzky \n(ed.), Advances in Neural Networks Information  Processing Systems 2,  San Ma(cid:173)\nteo,  CA: Morgan Kaufmann. \nBengio Y.,  De Mori  R.,  (1988),  Speaker  normalization  and automatic  speech \nrecognition  using  spectral  lines  and  neural  networks,  Proc.  Canadian  Confer(cid:173)\nence  on Artificial Intelligence  (CSCSI-88)  , Edmonton Al., May 88. \nBengio  Y.,  Cosi  P.,  De  Mori  R.,  (1989),  On  the  generalization  capability  of \nmulti-layered  networks  in  the  extraction  of speech  properties,  Proc.  Interna(cid:173)\ntion loint Conference  of Artificial Intelligence  (IICAI89)\"  , Detroit,  August 89, \npp.  1531-1536. \nGori  M.,  Bengio  Y.,  De Mori  R.,  (1989),  BPS:  a  learning  algorithm  for  cap(cid:173)\nturing  the  dynamic  nature  of speech,  Proc.  IEEE  International  loint  Confer(cid:173)\nence on Neural Networks,  Washington,  June 89. \nPearlmutter  B.A.,  Learning  state  space  trajectories  in  recurrent  neural  net(cid:173)\nworks,  (1989),  Neural Computation,  vol.  1,  no. 2,  pp. 263-269. \nRumelhart  D.E.,  Hinton  G.,  Williams  R.J.,  (1986),  Learning  internal \nrepresentation  by  error  propagation,  in  Parallel  Distributed  Processing,  ex(cid:173)\nploration  in  the microstructure of cognition,  vol.  1,  MIT Press 1986. \nSeneff S.,  (1985),  Pitch  and  spectral  analysis  of speech based on  an auditory \nsynchrony model, RLE Technical report 504,  MIT. \nWaibel A.,  (1988),  Modularity in  neural networks for  speech  recognition, Ad(cid:173)\nvances  in  Neural  Networks  Information  Processing  Systems 1.  San Mateo, CA: \nMorgan Kaufmann. \n\n\f", "award": [], "sourceid": 273, "authors": [{"given_name": "Yoshua", "family_name": "Bengio", "institution": null}, {"given_name": "Renato", "family_name": "de Mori", "institution": null}, {"given_name": "R\u00e9gis", "family_name": "Cardin", "institution": null}]}