{"title": "Continuous Speech Recognition by Linked Predictive Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 199, "page_last": 205, "abstract": null, "full_text": "Continuous Speech Recognition by \nLinked  Predictive Neural Networks \n\nJoe Tebelskis, Alex Waibel, Bojan Petek, and  Otto Schmidbauer \n\nSchool of Computer Science \nCarnegie Mellon  University \n\nPittsburgh,  PA  15213 \n\nAbstract \n\nWe present a large vocabulary, continuous speech recognition system based \non  Linked  Predictive  Neural  Networks  (LPNN's).  The system  uses  neu(cid:173)\nral  networks  as  predictors  of speech  frames,  yielding  distortion  measures \nwhich  are  used  by  the  One Stage DTW algorithm to perform  continuous \nspeech  recognition.  The system,  already  deployed  in  a  Speech  to Speech \nTranslation system, currently achieves 95%,  58%,  and 39% word accuracy \non  tasks  with  perplexity  5,  111,  and  402  respectively,  outperforming sev(cid:173)\neral simple HMMs  that  we  tested.  We  also  found  that  the  accuracy  and \nspeed of the LPNN can be slightly improved by the judicious use of hidden \ncontrol  inputs.  We  conclude  by  discussing  the  strengths  and  weaknesses \nof the predictive approach. \n\n1 \n\nINTRODUCTION \n\nNeural networks  are proving to be  useful for  difficult  tasks such  as speech recogni(cid:173)\ntion,  because  they  can  easily  be  trained  to  compute  smooth,  nonlinear,  nonpara(cid:173)\nmetric functions  from  any input space  to output space.  In speech  recognition,  the \nfunction most often computed by networks is  classification, in which spectral frames \nare  mapped  into  a  finite  set of classes,  such  as  phonemes.  In  theory,  classification \nnetworks  approximate the optimal Bayesian discriminant function  [1],  and in prac(cid:173)\ntice they have yielded very  high accuracy  [2,  3, 4].  However,  integrating a  phoneme \nclassifier into a speech  recognition system is nontrivial, since classification decisions \ntend  to  be  binary,  and  binary  phoneme-level  errors  tend  to  confound  word-level \nhypotheses.  To circumvent  this problem, neural network  training must be carefully \nintegrated into word level training [1, 5].  An alternative function which can be com-\n\n199 \n\n\f200 \n\nTebelskis, Waibel, Petek, and Schmidbauer \n\nputed  by  networks  is  prediction,  where  spectral frames  are  mapped  into  predicted \nspectral frames.  This provides a simple way  to get non-binary distortion measures, \nwith  straightforward  integration into a  speech  recognition  system.  Predictive  net(cid:173)\nworks  have  been  used  successfully  for  small vocabulary  [6,  7]  and large  vocabulary \n[8,  9]  speech  recognition  systems.  In  this  paper  we  describe  our  prediction-based \nLPNN  system  [9],  which  performs large vocabulary  continuous speech  recognition, \nand which has already been deployed within a Speech to Speech Translation system \n[10].  We present our experimental results, and discuss the strengths and weaknesses \nof the predictive approach. \n\n2  LINKED PREDICTIVE NEURAL  NETWORKS \n\nThe  LPNN  system  is  based  on  canonical  phoneme models,  which  can  be  logically \nconcatenated  in  any  order  (using  a  \"linkage  pattern\")  to  create  templates for  dif(cid:173)\nferent  words;  this makes  the LPNN  suitable for  large vocabulary  recognition. \n\nEach  canonical  phoneme is  modeled  by  a  short  sequence  of neural  networks.  The \nnumber  of nets  in  the  sequence,  N  >=  1,  corresponds  to  the  granularity  of the \nphoneme model.  These phone modeling networks are nonlinear, multilayered, feed(cid:173)\nforward,  and  \"predictive\"  in  the  sense  that,  given  a  short  section  of speech,  the \nnetworks  are  required  to extrapolate the  raw  speech signal,  rather than  to classify \nit.  Thus,  each  predictive  network  produces  a  time-varying  model  of the  speech \nsignal  which  will  be  accurate  in  regions  corresponding  to  the  phoneme  for  which \nthat  network  has  been  trained,  but  inaccurate  in  other  regions  (which  are  better \nmodeled by  other networks).  Phonemes are thus  \"recognized\"  indirectly,  by  virtue \nof the relative  accuracies  of the  different  predictive networks  in various  sections of \nspeech.  Note, however, that phonemes are not classified at the frame level.  Instead, \ncontinuous scores  (prediction errors)  are  accumulated for  various word  candidates, \nand a  decision  is  made only at the word  level,  where  it is finally  appropriate. \n\n2.1  TRAINING  AND  TESTING  ALGORITHMS \n\nThe purpose of the training procedure  is  both (a)  to train the  networks to become \nbetter predictors, and (b) to cause the networks to specialize on different phonemes. \nGiven a  known  training utterance,  the training procedure  consists of three  steps: \n\n1.  Forward Pass:  All the  networks  make their predictions across  the speech sam(cid:173)\n\nple, and we compute the Euclidean distance matrix of prediction errors between \npredicted  and actual speech frames.  (See  Figure  1.) \n\n2.  Alignment  Step:  We  compute  the  optimal time-alignment  path  between  the \ninput speech  and corresponding predictor nets,  using Dynamic Time Warping. \n3.  Backward Pass:  Prediction error is backpropagated into the networks according \n\nto the segmentation given  by  the  alignment  path.  (See  Figure 2.) \n\nHence  backpropagation causes  the nets to become better predictors,  and the align(cid:173)\nment path induces specialization of the networks for  different  phonemes. \n\nTesting is performed using the One Stage algorithm [11],  which is a  classical exten(cid:173)\nsion of the Dynamic Time Warping algorithm for  continuous speech. \n\n\fContinuous Speech Recognition by Linked ftedictive Neural Networks \n\n201 \n\n\u2022 \n\n\u2022 \n\n\u2022  \u2022 \u2022  - - - - prediction errors \n\n..--+--+-----if---~ ......\u2022................ . ... \n+--+---+-+--+-----4: \u2022......\u2022.................... \nt--t---t--+--t---t--~ .... . .\u2022......... . . . ........ \n\n\u00ab \nCtJ \n\u00ab \n\nphoneme \"a\" \npredictors \n\nphoneme \"b\" \npredictors \n\nFigure  1:  The forward  pass  during  training.  Canonical  phonemes  are  modeled  by \nsequences  of  N  predictive  networks,  shown  as  triangles  (here  N=3).  Words  are \nrepresented  by  \"linkage  patterns\"  over these  canonical  phoneme models  (shown  in \nthe area above the triangles), according to the phonetic spelling of the words.  Here \nwe  are  training on  the word  \"ABA\". In  the forward  pass,  prediction errors  (shown \nas black circles)  are computed for  all predictors, for each frame of the input speech. \nAs these prediction errors are routed through the linkage pattern, they fill a distance \nmatrix (upper right). \n\n+--+---1--1---1--+---1-: : :: : :A(igr)~~~(p.~th\u00b7 \u00b7  .... \n\n\u00ab \nCtJ \n\u00ab \\~~-\\tI------+-\n\n~;:;:1tt:;:;:;:;:::;::;:;::;::;::;:;:;:;:;::;:::;::;J;;tt \n\nFigure 2:  The backward pass  during training.  After the DTW alignment  path has \nbeen  computed, error is  backpropagated into the various predictors responsible for \neach point along the alignment path.  The back propagated error signal at each such \npoint is  the vector difference  between the  predicted and actual frame.  This teaches \nthe networks to become better predictors,  and also causes the networks to specialize \non different  phonemes. \n\n\f202 \n\nTebelskis, Waibel, Petek, and Schmidbauer \n\n3  RECOGNITION EXPERIMENTS \n\nWe  have evaluated the  LPNN system on a  database of continuous speech  recorded \nat CMU.  The database consists of 204  English sentences  using a  vocabulary of 402 \nwords,  comprising  12  dialogs  in  the  domain  of conference  registration.  Training \nand  testing  versions  of this  database  were  recorded  in  a  quiet  office  by  multiple \nspeakers  for  speaker-dependent  experiments.  Recordings  were  digitized  at  a  sam(cid:173)\npling rate of 16 KHz.  A Hamming window and an FFT were  computed, to produce \n16  melscale  spectral  coefficients  every  10  msec.  In  our  experiments  we  used  40 \ncontext-independent phoneme models (including one for silence), each of which had \na  6-state phoneme topology similar to the one  used  in  the SPICOS system [12]. \n\ntV>o tt..1s; \n\"*,,,theOl .. d  p/l<QM.; \n\nI\u00a3UD  IS  THIS  Il\u00a3  IFfll\u00a3  f~ lIE  ClII'EREKI \n111  IH  5 \n\niii EH  l  (1/ \n\nlit  Z \n\n, \n\n(seero  = 17. 61 \n111  1ft \n\nIII  f  1ft  5 \n\nf  ER \n\nKRAHfRIftNS \n\n, \n\n(1/ \n\n, \n\nI' \n\n, . \u2022 \u2022 , llll .............. ,IIIIIIIIU .. '  \u2022 \u2022 \".. \n\n:: : ::~ml;i!E::!~:l;::~:!i;~: U \".\" U lu, ........ ,III.,IIII.III ........................... \u00b7 . \u00b7... \n\n. , ,; n .... . \n.. 11 \u2022 \u2022  , 11 ..... \" \", \u00b7\u00b7 , . , ...... ,., ... . . . . .. . . . . . . . .  ,  ..... ,  ........ . \n... \"  \u2022 \u2022 \u2022  1 . 11' .\n. .. \"  .... 11 . ........ \"  \u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022 \n''':::::::::~:: 1I11 ............ lh \u2022\u2022 , ..... lllIltlll \u2022 \u2022 \u2022\u2022\u2022 .. III ........ \u00b7' .. ...... .. \n.. ... ,11 . . . . . .  \" \"  \u2022\u2022 \u2022 \u2022 h' U \u2022\u2022\u2022 ,I.'\u00b7\u00b7, ,111\"\" \nI\"\"  .... , \u2022\u2022 U .... , .... I.II ............ UI ..... , ................ 111 ........... ' .,.111 .. 11 \u2022\u2022 11. ,\u00b7  . 111\"' \u2022\u2022 111_1111,,0, \u2022\u2022 1111111'1'\" '1\" \" \" \"  \n\"\"\" ' \"  , .\u2022 '\".IIIII .... IIII ... IIIIUIIIlIlIl\"I . . . . . . . . . . .  \"  \u2022\u2022 \u2022 \u2022\u2022 IU ... 'IIUIIlIlIlIl ...... II'h .. .  ' \n\u2022 ,II'.\" I. \n, ..................... IIIU' ..... , \u2022\n\u2022 \".\"\"\" \n. ............... ,., \", .. II .... . ,\u00b7  \"  .. \" 1' \" \n'\" .... UIlIlI ...... \" ''' .... 'I .... \u2022 \n\u2022 '11.\" \u2022\u2022 \n. ' \"  \u2022\u2022\u2022\u2022 1111111 .. . 11\"11 . . . . . . . . . . .  , . . . . . . . . . . . . .  '  . .. .. '.1' ....... , \n' ...... '11' ...... 1111 \u2022\u2022\u2022\u2022 ,.'.'10\u00b7, .\u2022\u2022\u2022\u2022 1 ...... 11 ..... .  \"\"\"'11'\" \n, ., ..... 11 ........ \".0 , ............ ,. \n\" \" .\u2022 00.\"\" \n... ,  \u2022\u2022\u2022 ,. \n'\nI \n'01' '',,, \n. \n\"'''.\" \n\"\"\"11 \n\n''';;:;;;:;:~:::: ,1I1I1I1I11t1l11l\"11I.1I\"\"\",\"1I1O  \u2022 \u2022\u2022\u2022 ,It .......... ,\u00b7, \" \"III\"hl. . \n\n\"\"'''''1 \n\nIUIlIl ................ \u00b7 ., \n\n\"ll ........... \"  ..\n\n. .. .  -. \"  .... 1111 . .. .. \" \n\nI \n\n!!~ \n\n. \n\nFigure 3:  Actual and predicted spectrograms. \n\nFigure  3  shows  the  result  of testing  the  LPNN  system  on  a  typical sentence.  The \ntop portion is  the actual spectrogram for  this utterance;  the bottom portion shows \nthe frame-by-frame predictions made by the networks specified by each  point along \nthe optimal alignment path.  The similarity of these two spectrograms indicates that \nthe  hypothesis  forms  a  good  acoustic  model  of the unknown utterance  (in fact  the \nhypothesis  was  correct  in  this  case).  In  our speaker-dependent  experiments  using \ntwo  males  speakers,  our  system  averaged  95%,  58%,  and  39%  word  accuracy  on \ntasks with  perplexity 5,  111,  and 402  respectively. \n\nIn order  to confirm  that  the predictive  networks  were  making a  positive  contribu(cid:173)\ntion  to  the  overall  system,  we  performed  a  set  of comparisons between  the  LPNN \nand several  pure  HMM  systems.  When  we  replaced  each  predictive  network  by  a \nunivariate  Gaussian  whose  mean  and  variance  were  determined  analytically from \nthe  labeled  training  data,  the  resulting  HMM  achieved  44%  word  accuracy,  com(cid:173)\npared  to  60%  achieved  by  the  LPNN  under  the  same  conditions  (single  speaker, \nperplexity  111).  When  we  also  provided  the  HMM  with  delta  coefficients  (which \nwere  not  directly  available  to  the  LPNN),  it  achieved  55%.  Thus  the  LPNN  was \noutperforming each  of these  simple HMMs. \n\n\fContinuous Speech Recognition by Linked R-edictive Neural Networks \n\n203 \n\n4  HIDDEN CONTROL EXPERIMENTS \n\nIn  another series  of experiments,  we  varied  the LPNN  architecture  by  introducing \nhidden  control  inputs,  as  proposed  by  Levin  [7].  The  idea,  illustrated  in  Figure 4, \nis  that a sequence  of independent networks is  replaced  by a single network which is \nmodulated by an equivalent number of \"hidden control\"  input bits that distinguish \nthe state. \n\nSequence of \n\nPredictive Networks \n\nHidden Control \n\nNetwork \n\nFigure 4:  A sequence  of networks corresponds to a single  Hidden  Control network. \n\nA  theoretical  advantage  of  hidden  control  architectures  is  that  they  reduce  the \nnumber offree parameters in the system.  As the number of networks is reduced, each \none  is  exposed  to more training data,  and - up  to  a  certain  point - generalization \nmay  improve.  The  system  can  also  run  faster,  since  partial  results  of redundant \nforward  pass computations can  be saved.  (Notice,  however,  that the  total number \nof forward  passes is  unchanged.)  Finally,  the savings in memory can be significant. \n\nIn our experiments, we found  that by replacing 2-state phoneme models by equiva(cid:173)\nlent  Hidden  Control  networks,  recognition  accuracy  improved  slightly  and the sys(cid:173)\ntem  ran  much  faster.  On  the  other  hand,  when  we  replaced  all of the  phonemic \nnetworks  in  the  entire  system  by  a  single  Hidden  Control  network  (whose  hidden \ncontrol inputs  represented  the  phoneme  as  well  as  its state),  recognition  accuracy \ndegraded  significantly.  Hence,  hidden  control  may  be useful,  but only  if it  is  used \njudiciously. \n\n5  CURRENT LIMITATIONS  OF  PREDICTIVE NETS \n\nWhile  the  LPNN  system  is  good  at  modeling  the  acoustics  of speech,  it  presently \ntends  to  suffer  from  poor  discrimination. \nIn  other  words,  for  a  given  segment \nof speech,  all  of  the  phoneme  models  tend  to  make  similarly  good  predictions, \nrendering  all  phoneme  models  fairly  confusable.  For  example,  Figure  5  shows  an \nactual  spectrogram  and  the  frame-by-frame  predictions  made  by  the  /eh/  model \nand  the /z/ model.  Disappointingly,  both models are fairly  accurate predictors for \nthe entire utterance. \n\nThis problem arises because  each  predictor receives  training in only a  small region \nof input  acoustic  space  (i.e.,  those  frames  corresponding  to  that  phoneme).  Con(cid:173)\nsequently,  when  a  predictor  is  shown  any  other  input  frames,  it  will  compute  an \n\n\f204 \n\nTebelskis, Waibel, Petek, and Schmidbauer \n\n., 0,  1~\";\";\":7';l' lti1\"rl \n\u2022\u2022 \u2022 \u2022 \" ,  .... ...... .. \n. ............ . ............. . ....... \"  .. . . .. '  \u2022\u2022 \"  ...................... 11 ...................... 1' ... . ..... .  \u2022 . . . .  1111\" \u2022\u2022 \u2022 .. .\n' '' IIU ............ . ... , . . .  u'I, ...... II ... .. ,II . .... U. l lh . . ... '  .... \"  ..... .... . . .. . . ... . 1 .... .. \n.... II .. ' .... '  \u2022 \u2022 \" \n, \u2022 \u2022\u2022\u2022 1111 ... \"1111 \u2022\u2022\u2022 ,1.111111 .. \"  ...  , . . ... ................. 111 \u2022\u2022 111\" ... 11 . 111 \u2022\u2022\u2022 ,  ..... . ..... .. ... 111 .............. 1' .. '1111111' .. .... \" .111I111\".,t\" \u00b7 .... , .... .. . \u00b7.' .\"  ......... .. \n, \u2022 \u2022\u2022 1111 .... 11\"1 ......... 1,.'1' . ... .... ..... . .... \"111 \u2022 \u2022 \"  ......... '  ... ' \" \"  . . . . . . . . . . .  '11 \u2022\u2022 \u2022 ,'1 \u2022 \u2022 \" \"  ... 111 ...... . ... '1 11' ... 1\" . .. . . ... 11 11 ..... , . '\u00b7 .. .. \u2022 . .. '1 \u2022\u2022 1. '\u00b7 . . ...... ... .. 1 \n\u2022 \u2022 \u2022 1 .............................. 1 \u2022\u2022 1 ........ 1 . . . .  1111 . . . . . . . . . .  1 ..... , ............. 111 ...... 11 ........... 111111 ...... ,1 \u2022\u2022 , ..... '11 . . . . . . . .  1.' \u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022 ' .1 ' \u2022\u2022 \u2022 .... 1 ..... .. \n\n. . ......... \"nu ..... . ... ... IIII . . . . .  'U .. .,' . . . . . .  UI ... IIUlllll l\n\n/eh/  ....... , ..... ul,.\" \u2022\u2022\u2022\u2022 'I III'II . ....... \", ..... \"  ........... ... II'\".II '\",., .......... \" ....... 1 .. 1.11' ... 111\u00b7 \u2022\u2022 \"  .... , ..... 1 ... . .. ' .11' .... \"  ....... ' \u2022 \u2022\u2022\u2022 \"\" \u2022\u2022 \u2022 ' ...... , \u2022\u2022 , ,, \u2022 \u2022 \n:::~~-::::IC:=:~~::~~!:::::::::::::::::::::~::::.~!::::~::~:::~~~:~ \u2022 .:=::::::::::::~~:::::~::::.-:::~~::.:::::::::~~:;u:::::::::: ::~:~:::::: :: :::!::::~: . \n........ 11 ........ , ......................... ,1111 \u2022\u2022 \"  ...... , ........................ , ...... , .......... , ................... \"  .... 11.11 \u2022\u2022\u2022\u2022\u2022\u2022 '.\"N\"'\" \n\n\u2022\u2022 '111 . . . . . .. ... 111' .. '1 .. \" '  .. \"1\u00b7\u00b7\"' ..... 1'\u00b7\" .. \u00b7''' .. '  .. ' ' 1 ... 1 .. 111111 . . . .  1 . . . . . . . . .  11' .... 11 \u2022\u2022 ''' \u2022\u2022 ' . ' ' '' ''\"1'1 ..... ' . .. 1111' .............. \" \" . ' . \"  ............ . .  , \u2022 \u2022 \u2022 ,,\"\", . \n11' . . . . . . . . . . .  1111'111 \u2022\u2022 ' .. \"  .... 11' .... 11111111111 . . . . . . . . . . . . . . . . . . .  111 .... \"  . . . . . .  '. 1.11' ... 1''' . . . .  ' .... 1111 11 11 111111 ...... 11 . . . . . . . . . . . . . . .......... , ..... . . ... 1 . '  ... .. \n\u2022 1 . . . . . .. . . . . . . . . . . . . . . . . . .  111 11 .. 11111 ... 11.' \u2022 \u2022\u2022\u2022 , . . . .  \"  \u2022\u2022 \u2022 \"1 .... 1 . . . . . . .  IIH \u2022\u2022 \u2022 \u2022 '  .. , ............. IIII'I' . . . . . . . .  , \u2022\u2022\u2022\u2022 \u2022 \"., . . . . .  I1 ......... . , ......... . ... .. '. 1.1 \u2022\u2022\u2022 \" \n................................... , ..... ,\" \u2022\u2022 \"'1111 \u2022\u2022 11111\"111 ..... 1 .. '\".11\" . . . . . . .  1 ..... 1 ..... 1 .. 111'\"11 \u2022\u2022 \"1 . . . . .  1 \u2022\u2022\u2022 \"  \u2022\u2022 \" ....... ., ....... , \" \"',I., IIf,.,. \" .I.\"\"  \u2022\u2022\u2022 \n\n..... _ . - - . .  .................................. , .......... _ . ,  ...... , ....... ,.,\u00b7 ....... 1 ..... \u2022 ....... - . . .  .. \u2022\u2022 ......... ' ..... ,111 ..... .. \n\u2022\u2022\u2022\u2022\u2022\u2022\u2022 , ........................................................... 1 ...................................................... , .......... , ......... I'II \u2022\u2022 .! \n\n. ...... .. .. . .... . \" \n\n\" 1 \n\n. . ..... \u2022 ... \" ,. \u00b7  .\n\n. ....... \" \n\nr\"TT~!\".\"\"\"'\"\"'\"TmmTmTmTnm\"\"'TnT\"\"'IJII1\"\"'rrm~!!!'I1TT!\".\"\"rrmrrm\"\"'...,.,nT11\"\"';nnIlPl_rm\"\"'m;1,m,mn'\"\"rrmTm\u00b7nnTTi1TtTTT11TTTm.mml \u2022\u2022 t.'1 \u2022 \n. ..... .  , \u2022\u2022 \u2022 \u2022\u2022\u2022\u2022 ,,, .............. ,  . . . . . . . . . .  1.1 ............. .  '1111, ...... . ...... , ., \u2022\u2022\u2022 , .. ......... . .. , ..... ,It.\"., '''' ,., \u2022. u  . , .... , \".' .. '11' ... , . ....... .. ...... 11.111 \n. . ...... 1 \u2022\u2022 \"\"  I .\n.. . 1 .... 11 ..... .. \" \" ,\u00b7\" \u00b7 \u00b7 .. 1 .. ............ .. '.11 . . . . . . .  '\",, ....... '.11'\".11 \u2022\u2022 \"  .. , .. ,,' \u2022\u2022 1 ... 1111' ... 111.1111.1111\" ....... , .... , . ............ '00 . .. .. . .. , . , ... , ........... , ...... , \n\u2022\u2022\u2022 I .. ' .. . .  I1 .. ' \u2022\u2022 I.II .... I ..... ' ........... ,,' ...... \"III III I \u2022 \u2022  HIIIIII' ., 111 \u2022\u2022 ' 11 11111011' ..... \"  ... 1 ........ 1'\"'''.' .... \",., \u2022\u2022 . ' .\u2022\u2022 111 ''.' .' ... . 10 ...... '. , ..... , ............. . \n\u2022\u2022 \u2022\u2022 111\" ............ 1111.\"\".' .... ' . ....... \"  ........ 1.11111 .......... \".1 \u2022\u2022 1 ........ 11\"' \u2022\u2022\u2022 \u2022 11'\".1, ........... , ........ , ..... \u00b7\u00b7\u00b7 \u2022\u2022 111 1 11.11111\"  ...... .  , ... . ... ,. \u00b7 .. .. .. , . . . .  .. \n\u2022 , , ......... \"1 '1. 11111111 1' ..... . .. . , \"\"  .. , ' .... U '  .. I IIII \"  ....... , 11 1 ' .. 111 ..... '11 .. \"  .... . , , . .. ... , .. , ........ , . .. .. '\"11 ' \u2022\u2022\u2022 \u2022 \u2022 , \" .1111111'1 11'\"  \"' \"  ... .. , \u2022\u2022\u2022\u2022\u2022  ' . \"\"\"111\" \n. , . IUff'.\"\"I\".'.IIIIIIIIIII'\" . \" , . \"  \u2022 \u2022 11, .... , ... ..... ' .. 1'1 \u2022\u2022 ' .. . 11\".' ............ , ... \u00b7 ' \u2022 \u2022 1 \u2022 \u2022 \u2022\u2022 11 .... \",,, ,, ,,,,,\" ... ,11' \"'  ,  .... . ...... , ... . \"  .. ....... , ..  , \n\n/z/  .. \" '1 ....... 111111111111 11 .' ..... '  . ' ... .. ,.11 ' \u2022 \u2022 \u2022 \u2022 ' \" '  ... '  .1.1I1t1l.1 ..... ' .. '\"' ......... 1 \u2022\u2022 , \u00b7  \u2022 \" \".tt'\"IIII , I\" .. . ... 111111 .. . .. ' \n\n.. \u2022 1 ................. , ...... '  \u2022\u2022 , .. ,\u00b7 ,  .. ... ,  11 ... .. ,  \u2022\u2022\u2022\u2022 'II ....... '  \u2022\u2022 , .. ... ............... . \n,.\"  . . . . . . . . . . . . . . . . . . . . .  ,,\"  ....... 11 1''''' ......... 11 ...... '1' . . ..... ............... ' \n\u2022. ,  \" \"  . . . . . . . . . . . . . . .  I . . . .  U IIl' ... . .. ... .. , .. . ... .. , .. . ,  . . .  1' . 1. '  \u2022\u2022 , . .. .. ' \" '  . . . . . . . . .  11\" \n' \"  \u2022 \u2022 \"  . . . . . . . . . . . . . . . . . . . . .  f ... \u2022  ' . ...... . 11 ....... ' \" '1\".', '11 \"\"  \u2022\u2022 , .. III' .......... \"\"  . . \u2022\u2022\u2022\u2022\u2022 11'1\"  \"\" \" \"  \u2022\u2022\u2022 1 1.\"110 \u2022\u2022 '\u00b7 \u00b7 . \u00b7 . . . . . . . .  111 .... . . \u2022 \u00b7 \u00b7 . . . . . ...... . ,  ., ... .  , ... . \n\u2022 , \u2022 \u2022 , . . . . . . . . . . . . . . . . . . . . . . . .  1\"\"  \u2022 \u2022 \u2022 , .. .... .. .. 1 \u2022\u2022 , . . . . . . .  11\" \u2022\u2022 ,0, \u2022 \u2022 , I ................. . ',  .. ,' ...... \"00 , . .. .. ....... 1\" ... \"., . . ............... . . ,.  , , \". 11  tu \u2022 \u2022 , . . . . ... '  '\" \u2022\u2022 \n\u2022  . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  \"1 . . . . . . .  11.1.\"  ......... 1. , ... ..................... . . 0011 ......... 11' \u2022\u2022 , ......... 1 ..... ' .'  ........... \"  ...... .  , , \"III.'h, ... . .  \"  \u2022 \u2022 00  \u2022 \u2022 \n, . . . . . . . . . . . . .  I.H \u2022\u2022\u2022 \u2022\u2022\u2022 III ............ I ....... \"  . . .... II ... 'I \u2022\u2022 \u2022 \u2022\u2022 ,' ................. 1 .. . ........... 1\"  \" '\"  \u2022 \u2022\u2022 1 . . . . . . . . . . . . .. .  1 ............ 11 \u2022\u2022\u2022 , '.' \"  ........ 1'.'. ', \"' \" \n. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  1 . . .  \"  ........ 111 ........... , ........ ' \"' .............. 11' \u2022\u2022\u2022\u2022 , ......... . ,' ...... \"  .... ,.11 \u2022\u2022 I.,.I .. '~~h \u2022\u2022\u2022 ~.!~ .. , \n\n.... . . . . . .  1 ..... .. '\"  , ,11 \"  .... . . \n. . 1 .. \"  . . . . . . . ..... .. 1\" .... \"1 .. 1' \n. . ... 1 ....... 11' ..... ' .......... \" ,  ... , ............ ....  , .11101 ...... , \n.. I .............. . .. . . \" .'11., \u2022\u2022 1 \u2022 \u2022 .  , \u2022\u2022\u2022 ,  . . . . . . . ... .. , ,\"  \" \"\n01  ' 11\" \" \n\n\u2022 \u2022  '.I .... II I H \u00b7\u00b7 . \u00b7 \n\n. .\n\nFigure  5:  Actual spectrogram,  and  corresponding predictions  by the /eh/ and  /z/ \nphoneme models. \n\nundefined output, which may overlap with the outputs of other predictors.  In other \nwords,  the predictors  are  currently only trained on positive instances,  because  it is \nnot obvious what predictive output target is  meaningful for  negative instances; and \nthis leads to problematic  \"undefined regions\"  for  the predictors.  Clearly some type \nof discriminatory  training  technique  should  be  introduced,  to  yield  better  perfor(cid:173)\nmance in  prediction  based recognizers. \n\n6  CONCLUSION \n\nWe have studied the performance of Linked Predictive Neural Networks for large vo(cid:173)\ncabulary, continuous speech recognition.  Using a 6-state phoneme topology, without \nduration  modeling  or other optimizations, the  LPNN  achieved  an  average  of 95%, \n58%, and 39% accuracy on tasks with perplexity 5,  111, and 402,  respectively.  This \nwas  better  than the  performance of several  simple  HMMs  that  we  tested .  Further \nexperiments  revealed  that  the  accuracy  and  speed  of the  LPNN  can  be  slightly \nimproved by  the judicious use of hidden  control inputs. \n\nThe main advantages of predictive networks are that they produce  non-binary dis(cid:173)\ntortion measures in a simple and elegant way,  and that by virtue of their nonlinearity \nthey  can  model  the  dynamic properties  of speech  (e.g.,  curvature)  better  than lin(cid:173)\near  predictive  models  [13].  Their  main  current  weakness  is  that  they  have  poor \ndiscrimination,  since  their  strictly  positive  training  causes  them  all  to  make  con(cid:173)\nfusably  accurate  predictions  in  any  context.  Future  research  should  concentrate \non  improving  the  discriminatory  power  of the  LPNN,  by  such  techniques  as  cor(cid:173)\nrective training, explicit context  dependent  phoneme modeling,  and function  word \nmodeling. \n\n\fContinuous Speech Recognition by Linked ltedictive Neural Networks \n\n205 \n\nAcknowledgements \n\nThe  authors  gratefully  acknowledge  the  support  of DARPA,  the  National Science \nFoundation, ATR Interpreting Telephony  Research Laboratories,  and NEC  Corpo(cid:173)\nration.  B.  Petek  also  acknowledges  support  from  the  University  of Ljubljana and \nthe  Research  Council  of Slovenia.  O.  Schmidbauer acknowledges support from  his \nemployer,  Siemens AG,  Germany. \n\nReferences \n\n[1]  H.  Bourlard  and  C.  J.  Wellekens.  Links  Between  Markov  Models  and  Multilayer \n\nPerceptrons.  Pattern Analysis and Machine  Intelligence,  12:12,  December 1990. \n\n[2]  A.  Waibel,  T. Hanazawa,  G.  Hinton, K. Shikano, and K.  Lang.  Phoneme  Recognition \nUsing  Time-Delay  Neural  Networks.  IEEE  Transactions  on  Acoustics,  Speech,  and \nSignal Processing,  March  1989. \n\n[3]  M.  Miyatake,  H.  Sawai,  and  K.  Shikano.  Integrated Training  for  Spotting  Japanese \n\nPhonemes  Using  Large  Phonemic  Time-Delay  Neural  Networks.  In  Proc.  IEEE  In(cid:173)\nternational Conference  on Acoustics,  Speech,  and Signal Processing, April  1990. \n\n[4]  E.  McDermott and S.  Katagiri.  Shift-Invariant, Multi-Category Phoneme Recognition \nusing Kohonen's LVQ2.  In Proc.  IEEE International Conference on Acoustics, Speech, \nand Signal Processing,  May  1989. \n\n[5]  P.  Haffner,  M.  Franzini,  and A. Waibel.  Integrating Time Alignment and Connection(cid:173)\n\nist  Networks  for  High  Performance  Continuous  Speech  Recognition.  In  Proc.  IEEE \nInternational Conference  on  Acoustics,  Speech,  and Signal Processing,  May  1991. \n\n[6]  K.  Iso  and  T.  Watanabe.  Speaker-Independent  Word  Recognition  Using  a  Neural \nPrediction  Model.  In  Proc.  IEEE International Conference  on Acoustics,  Speech,  and \nSignal Processing, April  1990. \n\n[7]  E.  Levin.  Speech  Recognition  Using  Hidden  Control  Neural  Network  Architecture. \nIn  Proc.  IEEE International Conference  on Acoustics,  Speech  and Signal Processing, \nApril  1990. \n\n[8]  J.  Tebelskis  and  A.  Waibel.  Large  Vocabulary  Recognition  Using  Linked  Predictive \nNeural  Networks.  In  Proc.  IEEE International Conference  on  Acoustics, Speech,  and \nSignal Processing, April  1990. \n\n[9]  J. Tebelskis,  A.  Waibel,  B.  Petek, and  O. Schmidbauer.  Continuous  Speech Recogni(cid:173)\ntion  Using  Linked  Predictive Neural  Networks.  In  Proc.  IEEE International Confer(cid:173)\nence  on Acoustics,  Speech,  and Signal Processing,  May  1991. \n\n[10]  A.  Waibel,  A.  Jain, A. McNair,  H.  Saito,  A.  Hauptmann, and J. Tebelskis.  A  Speech(cid:173)\nto-Speech  Translation  System  Using  Connectionist  and  Symbolic  Processing  Strate(cid:173)\ngies.  In  Proc.  IEEE International Conference  on Acoustics,  Speech,  and Signal Pro(cid:173)\ncessing,  May  1991. \n\n[11]  H.  Ney.  The  Use  of a  One-Stage  Dynamic  Programming  Algorithm  for  Connected \nWord  Recognition.  IEEE  Transactions  on  Acoustics,  Speech,  and Signal  Processing, \n32:2,  April  1984. \n\n[12]  H.  Ney,  A.  Noll.  Phoneme  Modeling  Using  Continuous  Mixture  Densities.  In  Proc. \nIEEE  International  Conference  on  Acoustics,  Speech,  and  Signal  Processing,  April \n1988. \n\n[13]  N.  Tishby.  A  Dynamic  Systems  Approach  to  Speech  Processing. \n\nIn  Proc.  IEEE \n\nInternational Conference  on Acoustics,  Speech,  and Signal Processing,  April  1990. \n\n\f", "award": [], "sourceid": 358, "authors": [{"given_name": "Joe", "family_name": "Tebelskis", "institution": null}, {"given_name": "Alex", "family_name": "Waibel", "institution": null}, {"given_name": "Bojan", "family_name": "Petek", "institution": null}, {"given_name": "Otto", "family_name": "Schmidbauer", "institution": null}]}