{"title": "Time-Warping Network: A Hybrid Framework for Speech Recognition", "book": "Advances in Neural Information Processing Systems", "page_first": 151, "page_last": 158, "abstract": null, "full_text": "Time-Warping Network: \n\nA Hybrid Framework for Speech  Recognition \n\nEsther Levin \n\nEnrico Bocchieri \n\nRoberto Pieraccini \nAT&T Bell Laboratories \n\nSpeech Research Department \nMurray Hill,  NJ 00974 USA \n\nABSTRACT \n\ninterest  has  been  generated  regarding  speech \nRecently.  much \nrecognition  systems  based  on  Hidden  Markov  Models  (HMMs)  and \nneural  network  (NN)  hybrids.  Such  systems  attempt  to  combine  the \nbest  features  of  both  models:  the  temporal  structure  of HMMs  and \nthe  discriminative  power  of neural  networks.  In this  work  we  define \na  time-warping  (1W)  neuron  that extends  the  operation of the  fonnal \nneuron  of a  back-propagation  network by  warping  the input pattern  to \nmatch  it  optimally  to  its  weights.  We  show  that  a  single-layer \nnetwork  of TW  neurons  is  equivalent  to  a  Gaussian  density  HMM(cid:173)\nthe \nbased \ndiscriminative  power  of  this  system  by  using  back-propagation \ndiscriminative  training.  and/or  by  generalizing  the  structure  of  the \nrecognizer  to  a  multi-layered  net  The  performance  of the  proposed \nnetwork  was  evaluated  on  a  highly  confusable,  isolated  word.  multi \nspeaker recognition  task.  The  results  indicate  that  not  only  does  the \nrecognition  performance  improve.  but the  separation  between  classes \nto  set  up  a  rejection  criterion  to \nis  enhanced  also,  allowing  us \nimprove the confidence of the system. \n\nrecognition  system.  and  we  propose \n\nto \n\nimprove \n\nL  INTRODUCTION \nSince  their  first  application  in  speech  recognition  systems  in  the  late seventies,  hidden \nMarkov  models  have been established as  a  most  useful  tool.  mainly  due  to  their ability \nto  handle  the  sequential  dynamical  nature  of  the  speech  signal.  With  the  revival  of \nconnectionism  in  the  mid-eighties.  considerable  interest  arose  in  applying  artificial \nneural  networks  for  speech  recognition.  This  interest  was  based  on  the  discriminative \npower  of  NNs  and  their  ability  to  deal  with  non-explicit  knowledge.  These  two \nparadigms. namely  HMM and NN.  inspired by different philosophies. were seen  at first \nas  different  and  competing  tools.  Recently.  links  have  been  established  between  these \ntwo  paradigms.  aiming  at  a  hybrid  framework  in  which  the  advantages  of  the  two \nmodels  can be combined.  For example. Bourlard and Wellekens  [1]  showed that neural \n\n151 \n\n\f152 \n\nLevin,  Pieraccini, and Bocchieri \n\nto  HMM.  Bridle  [2] \n\nnetworks  with  proper  architecture  can  be  regarded  as  non-parametric  models  for \nintroduced \ncomputing  \"discriminant  probabilities\"  related \n\" Alpha-nets\",  a  recurrent neural  architecture  that  implements  the  alpha computation  of \nHMM, and found connections between back-propagation  [3]  training and discriminative \nHMM  parameter  estimation.  Predictive  neural  nets  were  shown  to  have  a  statistical \ninterpretation  [4],  generalizing  the  conventional  hidden  Markov  model  by  assuming \nthat  the  speech  signal is  generated by  nonlinear dynamics contaminated by noise. \nIn  this  work we  establish one more link between  the two paradigms by  introducing  the \ntime-warping  network  (1WN)  that  is  a  generalization  of  both  an  HMM-based \nrecognizer  and  a  back-propagation  net.  The basic  element of such  a  network,  a  time(cid:173)\nwarping  neuron,  generalizes  the  function  of  a  fonnal  neuron  by  warping  the  input \nIn  the  special  case  of  network  parameter \nsignal  in  order  maximize  its  activation. \nvalues,  a  single-layered  network  of  time-warping  (TW)  neurons  is  equivalent  to  a \nrecognizer based on Gaussian HMMs.  This equivalence of the  HMM-based recognizer \nand  single-layer  TWN  suggests  ways  of using  discriminative  neural  tools  to  enhance \nthe  perfonnance  of  the  recognizer.  For  instance,  a  training  algorithm,  like  back(cid:173)\npropagation,  that  minimizes  a  quantity  related  to  the  recognition  performance,  can  be \nused  to  train  the  recognizer  instead  of  the  standard  non-discriminative  maximum \nlikelihood  training.  Then,  the  architecture  of  the  recognizer  can  be  expanded  to \ncontain  more than  one layer of units, enabling  the network to  fonn  discriminant feature \ndetectors in  the hidden layers. \nThis paper is  organized as  follows:  in the  first  part of Section  2  we  describe  a  simple \nHMM-based  recognizer.  Then  we  define  the  time-warping  neuron  and  show  that  a \nsingle-layer network  built  with  such  neurons  is equivalent  to  the  HMM recognizer.  In \nSection  3  two  methods  are  proposed  to  improve  the  discriminative  power  of  the \nrecognizer,  namely,  adopting  neural  training  algorithms  and  extending  the  structure of \nthe  recognizer  to  a  multi-layer  net.  For  special  cases  of such  multi-layer  architecture \nsuch  net  can  implement a  conventional  or weighted  [5]  HMM  recognizer.  Results  of \nexperiments  using  a  TW  network  for  recognition of the English E-set  are  presented  in \nSection  4.  The  results  indicate  that  not  only  does  the  recognition  performance \nimprove,  but  the  separation  between  classes  is  enhanced  also,  allowing  us  to  set  up  a \nrejection criterion to improve the confidence of the  system.  A summary and discussion \nof this  work are included in Section 5. \nll.  THE MODEL \nIn  this  section  first  we  describe  the  basic  HMM-based  speech  recognition  system  that \nis  used  in  many  applications,  including  isolated  and  connected  word  recognition  [6] \nand  large vocabulary subword-based recognition  [7].  Though in this  paper we treat  the \ncase of isolated  word  recognition, generalization  to  connected speech  can be  made  like \nin  [6,7].  In  the  second  part  of  this  section  we  define  a  single-layered  time-warping \nnetwork  and  show  that  it  is  equivalent  to  the  HMM  based  recognizer  when  certain \nconditions constraining  the network parameter values  apply. \n11.1  THE HIDDEN MARKOV MODEL\u00b7BASED RECOGNITION SYSTEM \nA  HMM-based  recognition  system  consists  of  K  N-state  HMMs,  where  K  is  the \nvocabul~ size  (number  of  words  or  subword  units  in  the  defined  task).  The  k-th \nHMM, 0  , is associated  with  the  k-th  word in the vocabulary and  is characterized by  a \nmatrix A A:=  (at}  of transition probabilities between states, \n\n(1) \nwhere St  denotes  the active state at time  t  (so =0 is a dummy initial  state)  and by a  set \nof emission probabilities (one per state): \n\nat=Pr(St=j  I St-l=i)  ,0~i~N ,  l~j~N, \n\n\fTime-Warping Network:  A Hybrid Framework for  Speech  Recognition \n\n153 \n\nPr(X, I s,=i)= ~21t Illl:~ II  2  exp [- ~ (X,-J1~). (l:~)-l (X,-J1~)] , i =1,  ... ,N, \n\n(2) \n\nis \n\nwhere  X, \nthe  d-dimensional  observation  vector  describing  some  parametric \nrepresentation  of  the  t-th  frame  of  the  spoken  token,  and  ().  denotes  the  transpose \noperation. \nFor  the  case  discussed  here,  we  concentrate  on  strictly  left-to-right  HMMs,  where \n\nat * 0  only  if  j =i  or j =i + 1,  and  a  simplified  case  of  (2)  where  all  r} = I d,  the \n\nd=dimensional unit matrix. \nThe  system \nclassifying the token into the class ko  with  the highest likelihood L  O(X), \n\nrecognizes  a  speech  token  of  duration  T,  X={X~.x2'\u00b7\u00b7\u00b7 ,XT},  by \n\nko=argmaxLJ:(X). \n\nISk~ \n\nThe likelihood L J:(X)  is computed for  the  k-th  HMM as \n\nLJ:(X)=  max  10g[Pr(X I of,Si=i 1,  \u2022\u2022\u2022 ,sT=i,)] \n{il. Of  0  ,iTt \nII X,-J.1t  II  2+10gat14 -log21t. \n=  0  max 0  L -2 \n\n{II'  0 \n\n0  ,IT} 1=1 \n\n0 \n\n(3) \n\n(4) \n\nThe state sequence that maximizes (4) is  found by using the Viterbi [8]  algorithm. \n11.2  THE EQUIVALENT SINGLE-LAYER TIME-WARPING NETWORK \nA  single-layer TW network  is  composed  of K  TW neurons,  one for  each  word  in  the \nvocabulary.  The  TW  neuron  is  an  extension  of  a  fonnal  neuron  that  can  handle \ndynamic  and  temporally  distorted  patterns.  The  k-th  TW  neuron,  associated  with\"tpe \nk-t.I\\  '!..Qfabulary\"  word,  is  charpcterized  by  a  bias  w~  and  a  set  of  weights.  W = \n{W 10 W2,  .... ~) \u2022 where  Wi  is  a  column  vector  of dimensionality  d +2.  Given  an \ninput speech  token  of duration  T.  X={X ltX2 \u2022... ,XT }, the output activation yJ:  of the \nk-th  unit is computed as \n\n(5) \n\n\". \n\n/=g( LX\"w 4 +w~ )=g( L  (  L  X')'Wj+w~), \n\nN \n\n\".  .:.k \n\nT  \". ::.k \n\n1=1 \n\nj=l  , : 4=j \n\nwhere g (-)  is  a  sigmoidal,  smooth,  strictly  increasing  nonlinearity,  and X, = [X; \u2022 1, 1] \nis  an  d+2  - dimensional  augmented  input  vector.  The  corresponding  indices  i,. \nt=l,  ... ,T are detennined by the following condition: \n\nfi10  ... \u2022 iT} =argmax LX, 'W4 +w~ . \n\nT  \".  \"J: \n,=1 \n\n(6) \n\nIn  other  words.  a  TW  neuron  warps  the  input  pattern  to  match  it  optimally  to  its \nweights  (6)  and  computes  its  output  using  this  warped  version  of the  input  (5).  The \ntime-warping process of (6)  is  a  distinguishing feature  of this neural  model, enabling  it \nto deal with  the dynamic nature of a  speech signal and to  handle temporal distortions. \nAll TW neurons in this single-layer net recognizer receive the  same input speech  token \nX.  Recognition  is  perfonned  by  selecting  the  word  class  corresponding  to  the  neuron \nwith  the  maximal output activation. \nIt is easy to show  that when \n[W j]  = [[Jlj]  \u2022 -\"2  Jlj \n::.k  \u2022 \n\n.logaj,j ], \n\nk  II  2 \n\n1  II \n\n(7a) \n\nk .  \n\nk \n\nand \n\n\f154 \n\nLevin,  Pieraccini, and Bocchieri \n\nw~ = L loga',j_1 -loga',j \n\nN \n\nj=1 \n\n(7b) \n\nthis  network  is  equivalent  to  an  HMM-based  recognition  system,  with  K  N-state \nHMMs,  as described above.l \nThis  equivalent  neural  representation  of  an  HMM-based  system  suggests  ways  of \nimproving  the  discriminative  power  of the  recognizer,  while  preserving  the  temporal \nstructure  of  the  HMM,  thus  allowing  generalization  to  more  complicated  tasks  (e.g., \ncontinuous speech, subword units, etc.). \nIII.  IMPROVING DISCRIMINATION \nThere are  two  important differences  between  the  HMM-based  ~ystem and  a  neural  net \napproach  to  speech recognition that contribute to  the improved discrimination power of \nthe latter, namely, training and  structure. \nID.I  DISCRIMINATIVE TRAINING \nThe  HMM  parameters  are  usually  estimated  by  applying  the  maximum  likelihood \napproach,  using  only  the  examples  of  the  word  represented  by  the  model  and \ndisregarding  the  rival  classes  completely.  This  is  a  non-discriminative  approach: \nthe \nlearning criterion  is  not directly connected to  the  improvement of recognition accuracy. \nHere  we  propose  to  enhance  the  discriminative  power  of  the  system  by  adopting  a \nneural training approach. \nNN  training  algorithms  are  based  on  minimizing  an  error function  E.  which  is  related \nto  the  performance  of the  network  on  the  training  set of labeled  examples,  {X I , Z '}, \n1=1,  ... ,L, where Z'=[z{,  ... ,zkl*  denotes the vector of target neural outputs  for  the \nI-th  input token.  Zl  has +1  only in  the  entry corresponding  to  the right word  class,  and \n-1 elsewhere.  Then, \n\nL \n\nE = LE'(Z', yl), \n\n'=1 \n\n(8) \nwhere  yl = [Y,i,  ... ,ykt  is  a  vector  of  neural  output  activations  for  the  I-th  input \ntoken,  and  E'(Z', y') measures  the distortion  between  the  two  vectofi'  One choice for \nE'(Z', yl)  is  a  quadratic  error  measure,  i.e.,  E'(Z', y')= II Z'_yl \n2.  Other  choices \ninclude  the  cross-entropy  error  [9]  and  the  recently  proposed  discriminative  error \nfunctions,  which measure the misciassification rate  more directly  [10]. \nThe  gradient  based  training  algorithms  (such  as  back-propagation)  modify  the \nparameters  of  the  network  after  presentation  of each  training  token  to  minimize  the \nerror (8).  The change in  the j-th weight subvector of the k-th  model  after  presentation \nof the  I-th  training  token,  ~IW' is  inversely  proportional  to  the  derivative  of the  error \nE' with respect to  this  weight subvector, \n\n~'W'=-(l--j; =-(l L -, ~, l~jgj ,1~g, \n\nwhere  a> 0 \n.=-.I:  \u2022 \n[Wj]  = [[ Wj +~Wi]  ,-\"'2  Wi  +~Wi \nI: \n\nstep-size, \n1  II \nI: \n\na \n1:. \n\nis \n\ndE' \ndWj \n\nK  dE'  dyl \nm=1  dy m  aWj \nresulting \nin \nI:  II  2 \nI: \n\nan \n\nupdated  weight \n\n'  logaj,j]'  To  compute  the  terms  d~ \n\n(9) \n\nvectpr \nay m \n\n1.  With minor changes  we  can  show equivalence to a general  Gaussian  HMM,  where the covariance \n\nmatrices  are not  restricted  to  be  the unit matrix. \n\nJ \n\n\fTime-Warping Network: A Hybrid Framework for Speech Recognition \n\n155 \n\nwe  have  to  consider  (5)  and  (6)  that  define  the  operation  of the  neuron.  Equation  (6) \n\nexpresses  the  dependence of  the  warping  indices  iI, ... ,iT on W,.  In  the  proposed \n\nlearning rule we compute the gradient for  the quadratic  error criterion using only (5). \n\nA'W'=a(zi-Yi)g'(\u00b7)  L X~-W' ' \n\n, :i,=j \n\n(10) \n\nwhere  the  values  of  it  fulfill  condition  (6).  Although  the  weights  do  not  change \naccording  to  the  exact  gradient  descent  rule  (since  (6)  is  not  taken  into  account  for \nback-propagation)  we found  experimentally  that  the  error  made by  the  network always \ndecreases  after  the  weight  update.  This  fact  also  can  be  proved  when  certain \nconditions  restricting  the  step-size  a  hold,  and  we  conjecture  that  it  is  always  true  for \na>O. \nm.2  THE STRUCTURE OF THE RECOGNIZER \nWhen  the equivalent neural representation of the  HMM-based  recognizer is  used,  there \nexists a natural  way  of adaptively  increasing  the  complexity of the decision  boundaries \nand  developing  discriminative  feature  detectors.  This  can  be  done  by  extending  the \nstructure  of  the  recognizer  to  a  multi-layered  neL  There  are  many  possible \narchitectures  that  result  from  such  an  extension  by  changing  the  number  of  hidden \nlayers,  as  well  as  the  number  and  the  type  (Le.,  standard  or TW  )  of neurons  in  the \nhidden  layers.  Moreover,  the  role  of  the  TW  neurons  in  the  first  hidden  layer  is \ndifferent  now:  they  are  no  longer  class  representatives,  as  in  a  single-layered  net,  but \nIn  this  work \njust abstract  computing  elements  with  built-in  time  scale  nonnalization. \nwe  investigate  only  a  simple  special  case  of  such  multi-layered  architecture.  The \nmulti-layered network  we use  has  a single hidden  layer,  with NxK TW  neurons.  Each \nhidden  neuron  corresponds  to  oQ~  state  of  one  of  the  original  HM:Ms,  and  is \n\ncharac~riz~ by  a  weight  vector  Wj  and  a  bias w,.  The  output  activation  hj  of the \n\nneuron  IS  gIven  as \n\n(11) \n\nwhere \n\nand \n\n{ i It \u2022\u2022\u2022 ,iT} = argmax L ur \n\nN \n\nj=1 \n\nThe output layer is  composed of K  standard neurons.  The activation of output neurons \nyi, k=I, ... , K, is detennined by the  hidden layer neurons  activations as \n\n(12) \nwhere  Vi  is  a  NxK  dimensional  weight  vector,  H  is  the  vector  of  hidden  neurons \nactivation, and  Vi  is a bias  tenn. \nIn a special case of parameter values, when  ~ satisfy the  conditions (7a,b) and \n\nyi=g(H* Vi + Vi), \n\nk \nj,J \n\nw- = oga- --I - oga- -\ni \nj,J  ' \n\n(13) \nthe  activation  hj corresponds to  an  accumulated j-th state  likelihood  of the k-th  HMM: \nand  the  network  implements  a  weighted  [5]  HMM  recognizer  where  the  connection \nweight  vectors  Vi  detennine  the  relative  weights  assigned  to  each  state  likelihood  in \nthe  final  classification.  Such  network  can  learn  to  adopt  these  weights  to  enhance \ndiscrimination  by  giving \nlarge  positive  weights  to  states  that  contain  infonnation \nimportant  for  discrimination  and  ignoring  (by  fonning  zero  or  close  to  zero  weights) \nthose  states  that  do  not  contribute  to  discrimination.  A  back-propagation  algorithm \n\nI \n\nI \n\nk \nJ \n\n\f156 \n\nLevin,  Pieraccini, and Bocchieri \n\ncan be used for training  this net. \n\nIV.  EXPERIMENTAL RESULTS \nTo evaluate the effectiveness of the proposed  TWN,  we conducted several experiments \nthat  involved recognition of the highly confusable English E-set (i.e., Ib,  c,  d,  e,  g,  p, t, \nv,  z/).  The  utterances  were  collected  from  100  speakers,  50  males  and  50  females, \neach  speaking  every  word  in  the  E-set  twice,  once  for  training  and  once  for  testing. \nThe  signal  was  sampled  at 6.67  kHz.  We  used  12  cepstral  and  12  delra-cepstral LPC(cid:173)\nderived  [11]  coefficients to represent each 45 msec  frame of the sampled signal. \nWe  used  a baseline conventional  HMM-based recognizer to  initialize  the TW network, \nand  to  get  a  benchmark  performance.  Each  strictly  left-to-right  HMM  in  this  system \nhas  five  states,  and  the  observation  densities  are  modeled  by  four  Gaussian  mixture \ncomponents.  The  recognition  rates  of  this  system  are  61.7%  on  the  test  data,  and \n80.2% on  the  training data. \nExperiment with single-layer  TWN:  In  this  experiment  the  single-layer TW  network \nwas  initialized according  to  (7),  using  the  parameters  of the  baseline  HMMs.  The  four \nmixture  components  of each  state  were  treated  as  a  fully  connected  set of four  states, \nwith  transition  probabilities  that  reflect  the  original  transition  probabilities  and  the \nrelati ve  weights  of  the  mixtures.  This  corresponds  to  the  case  in  which  the  local \nlikelihood  is computed  using the dominant mixture component only.  The  network was \ntrained  using  the  suggested  training  algorithm  (10),  with  quadratic  error  function.  The \nrecognition rate of the trained  network increased to 69.4%  on  the test set and 93.6%  on \nthe  training sel \nExperiment  with  multi-layer  TWN:  In  this  experiment  we  used  the  multi-layer \nnetwork  architecture  described  in  the previous section.  The recognition perfonnance of \nthis  network after training was 74.4%  on  the test set and 91 % on  the training set. \nFigures  I,  2,  and  3  show  the  recognition  performance  of  a  single-layer  lWN, \ninitialized  by  a  baseline  HMM.  the  trained  single-layer  TWN.  and  the  trained  multi(cid:173)\nlayer  TWN,  respectively.  In  these  figures  the  activation  of the  unit  representing  the \ncorrect  class  is  plotted  against  the  activation  of the  best  wrong  unit  (Le.,  the  incorrect \nclass  with  the  highest  score)  for  each  input  utterance.  Therefore,  the  utterances  that \ncorrespond  to  the  marks  above  the  diagonal  line  are  correctly  recognized,  and  those \nunder  it  are  misclassified.  The  most  interesting  observation  that  can  be  made  from \nthese  plots  is  the  striking  difference  between  the  multi-layer  and  the  single-layer \nTWNs.  The  single-layer  lWNs  in  Figures  1  and  2  (the  baseline  and  the  trained) \nexhibit  the  same  typical  behavior  when  the  utterances  are  concentrated  around  the \ndiagonal line. For the  multi-layer net,  the utterances that were recognized correctly  tend \nto  concentrate in the  upper part of the graph, having  the correct unit activation  close  to \n1.0.  This  property  of  a  multi-layer  net  can  be  used  for  introducing  error  rejection \ncriterions:  utterances  for  which  the  difference  between  the  highest  activation  and \nsecond  high  activation  is less  than a  prescribed threshold are rejected.  In  Figure 4  we \ncompare  the  test performance of the multi-layer net and the  baseline system,  both  with \nsuch rejection  mechanism. for  different values of rejection  threshold.  As expected.  the \nmulti-layer  net  outperforms \nthe  baseline  recognizer,  by  showing  much  smaller \nmisclassification rate for  the same number of rejections. \n\nV.  SUMMARY AND  DISCUSSION \nIn  this paper we established  a  hybrid  framework  for  speech recognition, combining  the \ncharacteristics  of  hidden  Markov  models  and  neural  networks.  We  showed  that  a \nHMM-based  recognizer  has  an  equivalent  representation  as  a  single-layer  network \ncomposed of time-warping  neurons,  and proposed to  improve  the  discriminative  power \nof the recognizer  by  using  back-propagation  training  and  by  generalizing  the  structure \nof the  recognizer to  a  multi-layer net.  Several experiments were conducted  for  testing \n\n\fTime-Warping Network:  A Hybrid Framework for Speech  Recognition \n\n157 \n\nthe  perfonnance  of  the  proposed  network  on  a  highly  confusable  vocabulary  (the \nEnglish  E-set).  The  recognition  perfonnance  on  the  test  set of a  single-layer  TW  net \nimproved  from  61%  (when  initialized  with  a  baseline  HMMs)  to  69%  after  training. \nExpending  the  structure  of the  recognizer  by  one  more  layer of neurons,  we  obtained \nfurther  improvement of  recognition accuracy  up  to  74.4%.  Scatter plots  of the  results \nindicate  that in  the  multi-layer case, there is  a qualitative change in  the perfonnance of \nthe  recognizer,  allowing  us  to  set up  a rejection  criterion  to  improve  the  confidence  of \nthe  system. \n\nRererences \n1.  H.  Bourlard,  CJ.  Wellekens,  \"Links  between  Markov  models  and  multilayer \nperceptrons,\"  Advances  in  Neural  Information  Processing  Systems.  pp.502-510, \nMorgan Kauffman,  1989. \n2.  J.S.  Bridle,  \"Alphanets:  a  recurrent  'neural'  network  architecture  with  a  hidden \nNIarkov  model interpretation,\"  Speech Communication, April  1990. \n3.  D.E.  Rumelhart,  G.E.  Hinton  and  RJ.  Williams,  \"Learning  internal  representation \nby  error  propagation,\"  Parallel  Distributed  Processing:  Exploration \nthe \nMicrostructure of Cognition. MIT Press.  1986. \n4. E.  Levin.  \"Word  recognition  using  hidden  control  neural  architecture,\"  Proc.  of \nICASSP.  Albuquerque, April  1990. \n5.  K.-Y.  Suo  C.-H.  Lee.  \"Speech  Recognition  Using  Weighted  HMM  and  Subspace \nProjection Approaches,\" Proc  of ICASSP.  Toronto,  1991. \n6. L.  R.  Rabiner,  \"A  tutorial  on  hidden  Markov  models  and  selected  applications  in \nspeech recognition,\" Proc.  of IEEE.  vol.  77, No.2, pp. 257-286, February  1989. \n7.  C.-H. Lee, L. R.  Rabiner, R. Pieraccini, J.  G.  Wilpon,  \"Acoustic Modeling  for Large \nVocabulary  Speech  Recognition,\"  Computer  Speech  and  Language,  1990.  No.4.  pp. \n127-165. \n8. G.D.  Forney,  \"The  Viterbi  algorithm,\"  Proc.  IEEE.  vol.  61,  pp.  268-278,  1-tar. \n1973. \n9. S.A.  Solta,  E.  Levin,  M.  Fleisher.  \"Improved  targets  for  multilayer  perceptron \nlearning.\" Neural Networks Journal.  1988. \n10.  B.-H.  Juang,  S.  Katagiri,  \"Discriminative  Learning \nClassification,\" IEEE Trans.  on SP,  to be published. \n11.  B.S.  Atal,  \"Effectiveness of linear prediction characteristics of the  speech  wave for \nautomatic  speaker identification  and  verification,\"  J.  Acoust.  Soc.  Am.,  vol.  55,  No.6, \npp.  1304-1312, June  1974. \n\n:Minimum  Error \n\nin \n\nfor \n\nFigure 1:  Scatter plot for  baseline recognizer \n\n\f158 \n\nLevin,  Pieraccini, and Bocchieri \n\n\u2022 \u2022 r \n-.1 \u00b7 -. -. \n\nr \n: \n\n-. \n\n-. \n\nQ \n\n\u2022 \u2022 \n\u00b7 -. \n\n-. \n\n-. \n\n\u2022 \n\n.... e.  ...... \n\nFigure 2:  Scatter plot for trained single-layer 1WN \n\nFigure 3:  Scaner plot for  multi-layer 1WN \n\nI \n\nj \n\nI \n\n~ \n\nJ4 \n14\"-\nni. \nJOi.. \na\u00a3. \n\n~.  ~ \n~ \" \n~ \nS \n\nz  ~~ \n::t, \n. ,,,-\n:  I~ \n.. l \nIII \n,ot \n~ \n~ \u2022 ... \n:i. , \n'1 \n\n~. \n\nj \n\nrr '.\" ~-ry - iCl. I\" t:.  .. \n\n~_-' \n\n.J \n\nI \n, ~I \n\nto \n\nH~I \n\nI \n\n.0 \n\nW \n\nIr\"J ... tM \n\nFigure 4:  Rejection perfonnance of baseline recognizer and  the  multi-layer nvN \n\n\f", "award": [], "sourceid": 449, "authors": [{"given_name": "Esther", "family_name": "Levin", "institution": null}, {"given_name": "Roberto", "family_name": "Pieraccini", "institution": null}, {"given_name": "Enrico", "family_name": "Bocchieri", "institution": null}]}