{"title": "Speech Recognition Experiments with Perceptrons", "book": "Neural Information Processing Systems", "page_first": 144, "page_last": 153, "abstract": null, "full_text": "144 \n\nSPEECH RECOGNITION EXPERIMENTS \n\nWITH PERCEPTRONS \n\nD.  J. Burr \n\nBell  Communications Research \n\nMorristown, NJ 07960 \n\nABSTRACT \n\nArtificial  neural  networks  (ANNs)  are  capable  of accurate  recognition  of \nsimple speech  vocabularies such  as  isolated  digits  [1].  This paper looks  at two \nmore  difficult  vocabularies,  the  alphabetic  E-set  and  a  set  of  polysyllabic \nwords.  The  E-set  is  difficult  because  it  contains  weak  discriminants  and \npolysyllables  are  difficult  because  of  timing  variation.  Polysyllabic  word \nrecognition  is  aided  by a  time  pre-alignment technique  based on  dynamic pro(cid:173)\ngramming  and  E-set  recognition  is  improved  by  focusing  attention.  Recogni(cid:173)\ntion  accuracies  are  better  than  98%  for  both  vocabularies  when  implemented \nwith a single  layer perceptron. \n\nINTRODUCTION \n\nArtificial  neural  networks  perform  well  on  simple  pattern  recognition \n\ntasks.  On  speaker  trained  spoken  digits  a  layered  network  performs  as  accu(cid:173)\nrately  as  a  conventional  nearest  neighbor classifier  trained  on  the same tokens \n[1].  Spoken  digits  are  easy  to  recognize  since  they  are  for  the  most  part \nmonosyllabic  and are distinguished by strong vowels. \n\nIt is  reasonable  to  ask  whether  artificial  neural  networks  can  also  solve \nmore difficult  speech  recognition  problems.  Polysyllabic  recognition  is  difficult \nbecause  multi-syllable  words  exhibit  large  timing  variation.  Another  difficult \nvocabulary,  the  alphabetic E-set,  consists of the words B,  C,  D,  E,  G,  P, T, V, \nand  Z.  This  vocabulary  is  hard  since  the  distinguishing  sounds  are  short  in \nduration and low  in energy. \n\nWe  show  that  a  simple one-layer  perceptron  [7]  can  solve  both  problems \nvery  well  if  a  good  input  representation  is  used  and  sufficient  examples  are \ngiven.  We  examine  two  spectral  representations  -\na  smoothed  FFT  (fast \nFourier  transform)  and  an  LPC  (linear  prediction  coefficient)  spectrum.  A \ntime  stabilization  technique  is  described  which  pre-aligns  speech  templates \nbased  on  peaks  in  the  energy  contour.  Finally,  by  focusing  attention  of the \nartificial  neural  network to  the beginning of the word,  recognition  accuracy of \nthe E-set can be consistently increased. \n\nA  layered  neural  network,  a  relative  of the  earlier  percept ron  [7],  can  be \ntrained  by  a  simple gradient descent  process  [8].  Layered  networks  have been \n\n\u00a9 American Institute of Physics 1988 \n\n\f145 \n\napplied  successflJ.lly  to speech  recognition  [1],  handwriting  recognition  [2],  and \nto  speech synthesis  [11].  A  variation  of a  layered network  [3]  uses  feedback  to \nmodel causal  constraints, which  can be useful  in  learning speech  and  language. \nHidden neurons within a  layered network  are the building blocks that are used \nto form solutions to specific problems.  The number of hidden units required  is \nrelated  to the problem  [1,2].  Though a  single hidden  layer  can  form  any map(cid:173)\nping  [12],  no more  than two  layers  are needed  for  disjunctive normal form  [4]. \nThe  second  layer  may  be  useful  in  providing  more  stable  learning  and \nrepresentation  in  the  presence  of noise.  Though neural  nets have  been  shown \nto  perform  as  well  as  conventional  techniques[I,5],  neural  nets  may  do  better \nwhen classes have outliers  [5]. \n\nPERCEPTRONS \n\nA  simple  perceptron  contains  one  input  layer  and  one  output  layer  of \nneurons  directly  connected  to  each  other  (no  hidden  neurons).  This  is  often \ncalled  a  one-layer  system,  referring  to  the  single  layer  of  weights  connecting \ninput  to  output.  Figure  1.  shows  a  one-layer  perceptron  configured  to  sense \nspeech  patterns  on  a  two-dimensional  grid.  The  input  consists  of  a  64-point \nspectrum  at each  of twenty  time  slices.  Each  of the  1280  inputs  is  connected \nto  each  of  the  output  neurons,  though  only  a  sampling  of  connections  are \nshown.  There is  one output neuron corresponding  to each  pattern class.  Neu(cid:173)\nrons have standard linear-weighted  inputs with logistic activation. \n\nC(1) \n\nC(2) \n\nC(N-1)  C(N) \n\nFR:<lBC'V  .... \n64  units \n\nFigure  1.  A  single  layer  perceptron  sensing  a  time-frequency  array  of sample \ndata.  Each output neuron  CU)  (1 <i<N) corresponds  to  a  pattern class and \nis  full  connected  to  the  input  array  (for  clarity  only  a  few  connections  are \nshown). \n\nAn input word is  fit  to the grid  region  by applying an  automatic endpoint \ndetection  algorithm.  The  algorithm is  a  variation of one proposed by Rabiner \nand  Sambur  [9]  which  employs  a  double  threshold  successive  approximation \n\n\f146 \n\nmethod.  Endpoints  are  determined  by  first  detecting  threshold  crossings  of \nenergy  and  then  of zero  crossing  rate.  In  practice  a  level  crossing  other  than \nzero  is  used  to prevent endpoints from  being  triggered by background sounds. \n\nINPUT REPRESENTATIONS \n\nTwo different input representations were used in  this study.  The first  is  a \n\nFourier  representation  smoothed  in  both time  and  frequency.  Speech  is  sam(cid:173)\npled  at  10  KHz  ap.d  Hamming  windowed  at  a  number  of  sample  points.  A \n128-point  FFT  spectrum  is  computed  to  produce  a  template  of  64  spectral \nsamples at each of twenty time frames.  The template is  smoothed twice with  a \ntime window of length three and  a  frequency window of length eight. \n\nFor  comparison  purposes  an  LPC  spectrum  is  computed  using  a  tenth \norder  model  on  300-sample  Hamming  windows.  Analysis  is  performed  using \nthe autocorrelation method with Durbin recursion  [6].  The resulting  spectrum \nis  smoothed over three time frames. \n\nSample  spectra  for  the  utterance  \"neural-nets\"  is  shown  in  Figure  2. \nNotice  the  relative  smoothness  of  the  LPC  spectrum  which  directly  models \nspectral peaks. \n\nFFT \n\nLPC \n\nFigure 2.  FFT and LPC  time-frequency  plots  for  the  utterance  \"neural nets\". \nTime is  toward the left,  and frequency,  toward  the right. \n\nDYNAMIC TIME ALIGMv1ENT \n\nConventional  speech  recognition  systems  often  employ  a  time  normaliza(cid:173)\n\ntion  technique  based  on  dynamic  programming  [10].  It is  used  to  warp  the \ntime scales  of two  utterances  to  obtain  optimal  alignment  between  their  spec(cid:173)\ntral  frames.  We  employ  a  variation  of  dynamic  programming  which  aligns \nenergy  contours  rather  than  spectra.  A  reference  energy  template  is  chosen \nfor  each  pattern  class,  and  incoming  patterns  are  warped  onto  it.  Figure  3 \nshows  five  utterances  of  \"neural-nets\"  both  before  and  after  time  alignment. \nNotice the improved alignment of energy peaks. \n\n\f147 \n\n\u00a7 \n\n\u00a7 \n\n\u00a7 \n\n>-\n\\b \n~ ~ \nIII \nz \nW \n\n~ \n\n!I \n\nI \n\n\u00a7 \n\nI \n\nII \n\n~ \n\n! \n\n.. \n\n.. \n\n(a. ) \n\n10 \n\n10 \n\n. .. \n\nTIME \n\n(b) \n\nFigure  3.  (a)  Superimposed  energy plots  of five  different  utterances  of  \"neural \nnets\".  (b).  Same utterances after dynamic time  alignment. \n\nPOLYSYLLABLE  RECOGNITION \n\nTwenty  polysyllabic  words  containing  three  to  five  syllables  were  chosen, \nand  five  tokens  of  each  were  recorded  by  a  single  male  speaker.  A  variable \nnumber of tokens were used  to train  a  simple perceptron to study the effect  of \ntraining  set  size  on  performance.  Two  performance  measures  were  used: \nclassification  accuracy,  and  an  RMS  error measure.  Training  tokens were  per(cid:173)\nmuted  to obtain additional experimental data points. \n\nFigure  4.  Output  responses  of  a  perceptron  trained  with  one  token  per  class \n(left)  and four  tokens per class  (right). \n\n\f148 \n\nFigure  4  shows  two  representative  perspective  plots  of  the  output  of  a \nperceptron  trained  on  one  and  four  tokens  respectively  per  class.  Plots  show \nnetwork  response  (z-coordinate)  as  a  function  of  output  node  (left  axis)  and \ntest  word  index  (right  axis).  Note  that  more  training  tokens  produce  a  more \nideal map  - a  map should have ones along  the diagonal and zeroes everywhere \nelse. \n\nTable  1  shows  the  results  of  these  experiments  for \n\nthree  different \nrepresentations:  (1)  FFT,  (2)  LPC  and  (3)  time  aligned  LPC.  This  table  lists \nclassification  accuracy  as  a  function  of  number  of  training  tokens  and  input \nrepresentation.  The  perceptron  learned  to  classify  the  unseen  patterns  per(cid:173)\nfectly for  all  cases  except the FFT with  a  single  training pattern. \n\nTable 1.  Polysyllabic Word Recognition AccuraclT \n\nNumber Training Tokens \n\nFFT \nLPC \n\nTime Aligned LPC \nPermuted Trials \n\n1 \n98.7% \n100% \n100% \n400 \n\n2 \n\n100% \n100% \n100% \n300 \n\n3 \n\n100% \n100% \n100% \n200 \n\n4 \n\n100% \n100% \n100% \n100 \n\nA  different  performance measure,  the RMS  error,  evaluates the  degree  to \nwhich  the  trained  network output responses  Rjk  approximate the ideal  targets \nTjk \u2022  The measure \"is  evaluated over  the  N  non-trained  tokens  and  M  output \nnodes of the network.  Tik  equals 1 for  J=k  and 0 for  J=I=k. \n\nFigure  5  shows  plots  of RMS  error  as  a  function  of input  representation \nand training patterns.  Note that the FFT representation produced the highest \nerror,  LPC was  about  40%  less,  and  time-aligned  LPC  only marginally  better \nthan  non-aligned  LPC.  In  a  situation where  many choices must  be made  (i.e. \nvocabularies  much  larger  than  20  words)  LPC  is  the  preferred  choice,  and \ntime  alignment  could  be  useful  to  disambiguate  similar  words. \nIncreased \nnumber of training tokens results in improved performance in  all cases. \n\n\f149 \n\no ci  ,-----------------------------~ \n\n'\" 0 \n\ni \n\"! \nl-\nii I-\n5  0 g \n\n0 \n\n.. \n\nW \ntJl \n~ \na: \n\nFFT \n\nLPC \n\n'\" 0 \n0 \n\nTIme  Aligned LPC \n\no \no  ~--~----~--~----~ __ ~ ____ ~ \n\n1.0 \n\n2.0 \n\n3.0 \n\n4.0 \n\nNumber Traln'ng Tokens \n\nFigure  5.  RMS  error  versus  number  of  training  tokens  for  various  input \nrepresentations. \n\nE-SET VOCABULARY \n\nThe E-Set  vocabulary  consists  of  the  nine  E-words  of the  English  alpha(cid:173)\n\nbet  - B,  C,  D,  E,  G,  P,  T,  V,  Z.  Twenty  tokens  of each  of  the  nine  classes \nwere  recorded  by a single male speaker.  To maximize the sizes  of training  and \ntest  sets,  half were  used  for  training  and  the  other  half for  testing.  Ten  per(cid:173)\nmutations produced  a  total of 900  separate recognition  trials. \n\nFigure  6  shows  typical  LPC  templates  for  the  nine  classes.  Notice  the \ndouble  formant  ridge  due  to  the  ''E''  sound,  which  is  common  to  all  tokens. \nAnother characteristic feature  is  the FO  ridge  -\nthe upward fold  on  the  left of \nall  tokens which characterizes voicing or pitched sound. \n\n\f150 \n\nFigure  6.  LP C  time-frequency  plots  for  representative  tokens  of  the  E-set \nwords. \n\nFigure  7.  Time-frequency  plots  of  weight  values  connected  to  each  output \nneuron ''E'' through \"z\" in  a  trained  perceptron. \n\n\f151 \n\nFigure  7  shows  similar  plots  illustrating  the  weights  learned  by  the  net(cid:173)\n\nwork when  trained  on  ten  tokens of each  class.  These are  plotted like spectra, \nsince  one  weight  is  associated  with  each  spectral  sample.  Note  that  the  pat(cid:173)\nterns have some  formant  structure.  A  recognition  accuracy of 91.4%  included \nperfect scores for  classes  C, E,  and  G. \n\nNotice  that  weights  along  the FO  contour  are  mostly  small  and  some  are \nslightly  negative.  This  is  a  response  to  the  voiced  ''E\"  sound  common  to  all \nclasses.  The  network  has  learned  to  discount  \"voicing\"  as  a  discriminator  for \nthis vocabulary. \n\nNotice  also  the  strong  \"hilly\"  terrain  near  the  beginning  of  most  tem(cid:173)\n\nplates.  This  shows  where  the  network  has  decided  to  focus  much  of  its \ndiscriminating  power.  Note  in  particular  the  hill-valley  pair  at  the  beginning \nof  ''p''  and  \"T\".  These  are  near  to  formants  F2/F3  and  could  conceivably be \nformant  onset  detectors.  Note  the  complicated  detector  pattern  for  the  ''V'' \nsound. \n\nThe  classes  that  are easy  to  discriminate  (C,  E,  G)  produce  relatively fiat \nand  uninter~sting  weight  spaces.  A  highly  convoluted  weight  space  must \ntherefore  be  correlated  with  difficulty  in  discrimination.  It  makes  little  sense \nhowever  that  the  network should  be  working  hard  in the late time C'E\" sound) \nportion  of  the  utterance.  Perhaps  additional  training  might  reduce  this \nactivity,  since  the  network  would  eventually  find  little  consistent  difference \nthere. \n\nA  second  experiment  was  conducted  to  help  the  network  to  focus  atten(cid:173)\ntion.  The first  k  frames of each input token were averaged to produce an  aver(cid:173)\nage  spectrum.  These  average  spectra  were  then  used  in  a  simple  nearest \nneighbor  recognizer scheme.  Recognition accuracy was measured  as  a  function \nof k.  The  highest  performance  was  for  k=8,  indicating  that  the  first  40%  of \nthe word contained most of the \"action\". \n\nB \n\nC \n\nD \n\nE \n\nC \n\nP \n\nT \n\nV \n\nZ \n\n08 \n\n0 \n\n0 \n\n0 \n\n0 \n\n0 \n\n0 \n\n0 \n\n2 \n\n0 \n\n100  0 \n\n0 \n\n0 \n\n0 \n\n0 \n\n0 \n\n0 \n\n0 \n\n08 \n\n0 \n\n0 \n\n3 \n\n0 \n\n0 \n\n0 \n\n0 \n\n0 \n\n0 \n\n0 \n\n0 \n\n0 \n\n100  0 \n\n0 \n\n0 \n\n2 \n\n0 \n\n0 \n\n0 \n\n0 \n\n0 \n\n0 \n\n100  0 \n\n0 \n\n0 \n\n0 \n\n0 \n\n03 \n\n0 \n\n2 \n\n0 \n\n0 \n\n0 \n\n0 \n\n0 \n\n0 \n\n4 \n\n0 \n\n0 \n\n0 \n\n0 \n\n0 \n\n100  0 \n\n08 \n\n0 \n\n0 \n\n0 \n\n0 \n\n0 \n\n0 \n\n0 \n\n0 \n\n0 \n\n0 \n\n09 \n\nB \n\nc \n\nD \n\nE \n\nc \n\np \n\nT \n\nV \n\nZ \n\nFigure  8.  Confusion  matrix  of  the  E-set  focused  on  the  first  40%  of  each \nword. \n\n\f152 \n\nAll  words  were  resampled  to  concentrate  20  time  frames  into  the  first \n40%  of  the  word.  LPC  spectra  were  recomputed  using  a  16th  order  model \nand  the  network  was  trained  on  the  new  templates.  Performance  increased \nfrom  91.4%  to  98.2%.  There were  only  16  classification  errors out  of the 900 \nrecognition  tests.  The  confusion  matrix  is  shown  in  Figure  8.  Learning  times \nfor  all  experiments  consisted  of  about  ten  passes  through  the  training  set. \nWhen  weights  were  primed  with  average  spectral  values  rather  than  random \nvalues,  learning time decreased slightly. \n\nCONCLUSIONS \n\nArtificial  neural  networks  are  capable  of  high  performance  in  pattern \nrecognition  applications,  matching  or  exceeding \nthat  of  conventional \nclassifiers.  We  have  shown  that  for  difficult  speech  problems  such  as  time \nalignment  and  weak  discriminability,  artificial  neural  networks  perform  at \nhigh  accuracy exceeding 98%.  One-layer perceptrons learn these  difficult  tasks \nalmost  effortlessly - not in spite of their simplicity, but because of it. \n\nREFERENCES \n\n1.  D.  J.  Burr,  \"A  Neural  Network  Digit  Recognizer\",  Proceedings  of  IEEE \nConference  on  Systems,  Man,  and  Cybernetics,  Atlanta,  GA,  October,  1986, \npp.  1621-1625. \n\n2.  D.  J.  Burr,  \"Experiments  with  a  Connectionist  Text Reader,\"  IEEE  Inter(cid:173)\nnational  Conference on  Neural Networks,  San Diego,  CA,  June,  1987. \n\n3.  M.  I.  Jordan,  \"Serial  Order:  A  Parallel  Distributed  Processing  Approach,\" \nICS  Report  8604,  UCSD  Institute  for  Cognitive  Science,  La  Jolla,  CA,  May \n1986. \n\n4.  S.  J.  Hanson,  and  D.  J.  Burr,  'What  Connectionist Models Learn:  Toward \na  Theory of Representation  in Multi-Layered  Neural  Networ.ks,\" submitted for \npu blication. \n\n5.  W.  Y.  Huang and  R.  P. Lippmann,  \"Comparisons Between  Neural  Net and \nConventional  Classifiers,\" IEEE International  Conference on Neural  Networks, \nSan Diego,  CA,  June 21-23,  1987. \n\n6.  J.  D.  Markel  and  A.  H.  Gray,  Jr.,  Linear  Prediction  of  Speech,  Springer(cid:173)\nVerlag,  New York,  1976. \n\n7.  M.  L.  Minsky  and  S.  Papert,  Perceptrons,  MIT  Press,  Cambridge,  Mass., \n1969. \n\n\f153 \n\n8.  D.  E.  Rumelhart,  G.  E.  Hinton,  and  R.  J.  Williams,  ''Learning  Internal \nRepresentations  by  Error  Propagation,\"  in  Parallel  Distributed  Processing, \nVol.  1,  D.  E.  Rumelhart  and J. L.  McClelland,  eds.,  MIT Press,  1986,  pp.  318-\n362. \n\n9.  L.  R.  Rabiner and M.  R.  Sambur,  \"An Algorithm for  Determining the End(cid:173)\npoints of Isolated Utterances,\" BSTJ, Vol.  54,297-315, Feb.  1975. \n\n10.  H.  Sakoe and  S.  Chiba,  \"Dynamic Programming  Optimization for  Spoken \nWord  Recognition,\"  IEEE  Trans.  Acoust.,  Speech,  Signal  Processing,  Vol. \nASSP-26,  No.1, 43-49, Feb.  1978. \n\n11.  T.  J.  Sejnowski  and  C.  R.  Rosenberg,  \"NETtalk: A  Parallel  Network  that \nLearns  to  Read  Aloud,\"  Technical  Report  JHU/EECS-86/01,  Johns  Hopkins \nUniversity Electrical Engineering and Computer Science,  1986. \n\n12.  A.  Wieland  and  R.  Leighton,  \"Geometric  Analysis  of  Neural  Network \nCapabilities,\" IEEE  International  Conference  on  Neural  Networks,  San Deigo, \nCA,  June 21-24,  1987. \n\n\f", "award": [], "sourceid": 88, "authors": [{"given_name": "David", "family_name": "Burr", "institution": null}]}