{"title": "Speech Recognition Using Connectionist Approaches", "book": "Advances in Neural Information Processing Systems", "page_first": 262, "page_last": 269, "abstract": null, "full_text": "Speech  Recognition using Connectionist Approaches \n\nKhalid Choukri \n\nSPRINT Coordinator \n\nCAP GEMINI  INNOVATION \n\n118 rue  de  Tocqueville,  75017 Paris.  France \n\ne-mail:  choukri@capsogeti.fr \n\nAbstract \n\nThis  paper  is  a  summary  of SPRINT  project  aims  and  results.  The  project \nfocus  on the use  of neuro-computing techniques to tackle various problems that \nremain  unsolved  in  speech  recognition.  First  results  concern  the  use  of feed(cid:173)\nforward  nets  for  phonetic  units  classification,  isolated  word  recognition,  and \nspeaker  adaptation. \n\n1 \n\nINTRODUCTION \n\nSpeech  is  a  complex  phenomenon  but  it  is  useful  to divide  it into levels  of repre(cid:173)\nsentation.  Connectionism paradigms and particularities are  exploited to tackle the \nmajor problems in relationship with intra and inter speaker variabilities in order to \nimprove  the  recognizer  performance.  For  that  purpose  the  project  has  been  split \ninto individual tasks which  are  depicted  below: \n\nI..-._S_ign_al_-,H \n\nParam ... ,.  H  Phon.tie  H  Lexieon \n\nThe work described  herein concerns  : \n\n\u2022  Parameters-to-Phonetic:  Classification  of  speech  parameters  using  a  set  of \n\n\"phonetic\"  symbols and extraction of speech features  from signal. \n\n\u2022  Parameters-to-Lexical:  Classification of a sequence  of feature  vectors  by lexi(cid:173)\n\ncal access  (isolated word recognition)  in various environments. \n\n\u2022  Parameters-to-Parameters:  Adaptation to new  speakers  and environments. \n\n262 \n\n\fSpeech Recognition using Connectionist Approaches \n\n263 \n\nThe  following  sections  summarize  the  work  carried  out  within  this  project.  De(cid:173)\ntails,  including  different  nets  description,  are  reported  in  the  project  deliverables \n(Choukri,  1990),  (Bimbot,  1990),  (Varga,  1990). \n\n2  PARAMETERS-TO-PHONETIC \n\nThe objectives of this task were  to assess  various neural network topologies,  and to \nexamine the use of prior knowledge in improving results,  in the process of acoustic(cid:173)\nphonetic  decoding  of  natural  speech.  These  results  were  compared  to  classical \npattern classification approaches such as k-nearest  neighbour classifiers  (K-nn), dy(cid:173)\nnamic programming,  and k-means. \n\n2.1  DATABASES \n\nThe speech  was  uttered by one  male speaker  in  French.  Two databases were  used: \nDB_1  made of isolated non-sense words  (logatomes) which contains 6672 phonemes \nand DB_2 provided by the recording of 200 sentences which contains 5270 phonemes. \nDB_2  was  split  equally  into  training  and  test  sets  (2635  data each).  34  different \nlabels  were  used:  1 per phoneme  (not  per  allophone)  and one  for  the silence.  For \neach  phoneme  occurrence,  16  frames  of signal  (8  on  each  side  of the  label)  were \nprocessed  to provide  a  16  Mel-scaled filter-bank  vector. \n\n2.2  CLASSICAL CLASSIFIERS \n\nExperiments  using  k-NN  and k-means  classifiers  were  conducted  to check  the  suf(cid:173)\nficient  consistency  of the  data and  to have some reference  scores.  A  first  protocol \nconsidered  each  pattern  as  a  256-dimension vector,  and  achieved  k-nearest  neigh(cid:173)\nbours with the euclidean distance between references  and tests.  A second  protocol \nattempted to decrease  the time misalignments influences  by  carrying out some  Dy(cid:173)\nnamic Time Warping between references  and tests  and taking the sum of distances \nalong  the  best  path,  as  a  distance  measure  between  patterns.  The  same data was \nused  in  the framework  of a  k-means  classifier,  for  various  values  of k  (number  of \nrepresentatives  per class).  The best  results  are  : \n\nMethod  K-means  (K > 16) \nScore \n\n61.3 % \n\n2.3  NEURAL CLASSIFIERS \n2.3.1  LVQ  Classifiers \n\nK-nn  (K=5)  K-nn + DTW (K=5) \n\n72.2  % \n\n77.5  % \n\nExperiments were  conducted  using Learning Vector Quantization technique  (LVQ) \n(Bennani,  1990).  A study of the importance of the weights initialization procedure \nproved  to be  an  important parameter for  the  classification  performance.  We  have \ncompared  three initialization algorithms:  k-means,  LBG,  Multiedit.  With k-means \nand  LBG,  tests  were  conducted  with different  numbers  of reference  vectors,  while \nfor  Multiedit,  the  algorithm discovers  automatically representative  vectors  in  the \n\n\f264 \n\nChoukri \n\ntraining set,  the number of which is  therefore  not specified  in advance. \n\nInitialization by  LBG  gave  better  performance  for  self-consistency  (evaluation  on \nthe training database:  DB_I)  , whereas  test performance on DB_2  (sentences)  were \nsimilar for  all procedures  and  very  low.  Further  experiments  were  carried  out  on \nDB_2  both for  training and testing.  LBG  initialization with  16 and 32 classes  were \ntried  (since  they  gave  the  best  performances  in  the  previous  experiment).  Even \nthough the self-consistency for sentences is slightly lower than the one for logatomes, \nthe improvement of recognition scores  are far better as  illustrated here: \n\nnb ref per class \nK-means \nLBG \n\n-+  LVQ \n\n16 \n\n60.3  % \n\n62.4  %  -+  66.1  %  63.2  %  -+  67.2  % \n\n32 \n\n61.3  % \n\nThis experiment and some others  (not presented  here)  (Bimbot,  1990) confirm that \nthe failure of previous experiments is more due to a mismatch between  the corpora \nfor this recognition method, than an inadequacy of the classification technique itself. \n\n2.3.2  The Time-Delay Neural Network  (TDNN)  Classifiers \n\nA  TDNN,  as  introduced  by  A.  Waibel  (Waibel,  1987),  can  be  described  by  its set \nof typological parameters,  i.e.  : \n\nMoxNo/Po,So  - M1xN1/P1,Sl  - M2XN2  - Kx1. \n\nIn the following a \"TDNN-derived\"  network has a similar architecture,  except  that \nis  not  constrained  to  be  equal  to  K,  and  the  connectivity  between  the  last \nM2 \n2  layers  is  full.  Various  TDNN-derived  architectures  were  tested  on  recognizing \nphonemes from  sentences  (DB_2)  after learning on the logatomes (DB_I).  Best  re(cid:173)\nsults are  given below: \n\nTDNN-derived structure \n\n16x16/ 2,1  - 8x15  /  7,2  - 5x5 - 34x1 \n16x16 /  2,1  - 16x15 /  7,4 - llx3 - 34x1 \n16x16 /  4,1 - 16x13 /  5,2 - 16x5 - 34x1 \n16x16 /  2,1  - 16x15 /  7,4 - 16x3 - 34x1 \n\nself-consist. \n63.9  % \n75.1  % \n81.0  % \n79.8  % \n\nreco score \n48.1  % \n54.8  % \n60.5  % \n60.8  % \n\nThe first  net  is  clearly  not powerful enough for  the  task, so  the number of free  pa(cid:173)\nrameters has be increased.  This upgraded the results immediately as can be seen for \nthe other nets.  The  third and fourth nets have equivalent performance,  they differ \nin  the local windows width and delays.  Other tested  architectures  did not increase \nthis  performance.  The  main difference  between  training and test  sets  is  certainly \nthe  different  speaking  rate,  and  therefore  the  existence  of important  time  distor(cid:173)\nsions.  Though TDNN-derived  architectures  seem  more  able  to handle this kind of \ndistorsions  than LVQ,  as  the  generalization performance  is  significantly higher for \nsimilar learning self-consistency,  but both fail to remove all temporel misalignment \n\n\fSpeech Recognition using Connectionist Approaches \n\n265 \n\neffects. \nIn order to upgrade classification performance we  changed  the cost function which \nis  minimized  by  the  network:  the  error  term corresponding  to the  desired  output \nis multiplied by  a constant H superior to 1,  the terms of the error corresponding to \nother outputs being left unchanged to compensate the deficiency of the simple mean \nsquare error procedure.  We  obtained our best  results with the best TDDN-derived \nnet  we  experimented for  H=2  : \n\nDatabase \n\nDB_l \nDB_2 \n\nNet: \n\n16x16  /  4,1  - 16x13  /  5,2 - 16x5  - 34xl \n16x16  /  4,1  - 16x13  /  5,2 - 16x5  - 34xl \n\nself-consist. \n\n87.0 % \n87.0  % \n\nreco  score \n63.0 % \n78.0  % \n\nThe too small number of independent weights (too low-dimensioned TDNN-derived \narchitecture)  makes  the  problem  too  constrained.  A  well  chosen  TDNN-derived \narchitecture can perform  as well  as  the best  k-nearest  neighbours strategy.  Perfor(cid:173)\nmance gets lower for data that mainly differ by a significant speaking rate mismatch \nwhich could indicate that TDNN-derived  architectures do not manage to handle all \nkinds  of time  distortions.  So  it  is  encouraging  to  combine  different  networks  and \nclassical methods to deal with the temporal and sequential aspects  of speech. \n\n2.3.3  Combination of TDNN and LVQ \n\nA set  of experiments  using  a  combined  TDNN-derived  network  and  LVQ  architec(cid:173)\nture  were  conducted.  For these  experiments,  we  have  used  the  best  nets  found  in \nprevious  experiments.  The  main parameter of these  experiments  is  the  number of \nhidden cells in the last layer of the TDNN-derived network which is  the input layer \nof LVQ  (Bennani,  1990). \nEvaluation on  DB_l with various numbers of references  per class gave the following \nrecognition scores: \n\nrefs  per class \nTDNN  +k-means \nTDNN  +LBG \nTDNN  +LVQ  (LBG  for  initialization) \n\n8 \n\n16 \n\n4 \n\n76.2  %  78.1 %  79.8 % \n77.7 % \n79.9%  81.3 % \n78.4 %  82.1 %  81.4 % \n\nBest results have been obtained with 8 references  per class and the  LBG  algorithm \nto initialize the  LVQ  module.  The best  performance on the test  set  (82.1  % ) rep(cid:173)\nresents  a significant increase  (4  % ) compared to the best  TDNN-derived  network. \n\nOther  experiments  were  performed  on  TDNN  +  LVQ  by  using  a  modified  LVQ \narchitecture,  presented  in  (Bennani,  1990),  which  is  an extension  of  LVQ  built  to \nautomatically  weight  the  variables  according  to  their  importance  for  the  classifi(cid:173)\ncation.  We  obtain  a  recognition  score  of 83.6  %  on  DB_2  (training  and  tests  on \nsentences) . \nWe  also used low dimensioned TDNNs for discriminating between phonetic features \n(Bimbot,  1990),  assuming  that  phonetics  will provide  a  description  of speech  that \nwill  appropriately  constrain  a  priori  a  neural  network,  the  TDNN  structure  war-\n\n\f266 \n\nChoukri \n\nranting the desirable  property of shift invariance. \nThe  feature  extraction  approach  can  be  considered  as  an  other  way  to  use  prior \nknowledge  for  solving  a  complex  problem  with  neural  networks.  The  results  ob(cid:173)\ntained  in  these  experiments  are  an interesting  starting  point for  designing  a  large \nmodular network  where  each  module  is  in  charge  of a simple  task,  directly related \nto a well-defined  linguistic phenomenon  (Bimbot,  1990). \n\n2.4  CONCLUSIONS \n\nExperiments with LVQ  alone, a TDNN-derived network alone and combined TDNN(cid:173)\nLVQ  architectures  proved  the  combined  architecture  to be  the  most efficient  with \nrespect  to our databases  as summarized  below  (training and tests  on DB_2): \n\nk-means \n61.3  % \n\nLVQ \n67.2  %  72.2 % \n\nk-nn \n\nk-nn + DTW  TDNN  TDNN + LVQ \n\n83.6 % \n\n77.5 % \n\n78.0 % \n\n3  PARAMETERS-TO-LEXICAL \n\nThe  main objective  of this  task  is  to  use  neural  nets  for  the classification  of a  se(cid:173)\nquence  of speech frames into lexical items  (isolated words).  Many factors  affect  the \nperformance of automatic speech recognition systems.  They  have been categorized \ninto  those  relating  to  speaker  independent  recognition  mode,  the  time  evolution \nof  speech  (time  representation  of  the  neural  network  input),  and  the  effects  of \nnoise.  The two first  topics  are described  herein  while the third one is described  in \n(Varga,  1990). \n\n3.1  USE  OF VARIOUS  NETWORK TOPOLOGIES \n\nExperiments were carried out to examine the performance of several network topolo(cid:173)\ngies  such  as  those  evaluated in  section  2.  A TDNN  can be  thought  of as  a  single \nHidden  Markov  Model  state  spread  out  in  time.  The  lower  levels  of the  network \nare  forced  to be  shift-invariant,  and instantiate  the  idea  that the  absolute  time  of \nan event is not important.  Scaly networks are similar to TDDNs in that the hidden \nunits of a scaly network are fed by partially overlapping input windows.  As reported \nin previous sections,  LVQ  proved to be efficient  for  the phoneme classification task \nand  an  \"optimal\"  architecture  was found  as  a  combination of a  TDNN  and  LVQ. \nIt was  used  herein for isolated word recognition. \nFrom  experiments  reported  in  detail in  (Varga,  1990)  there  seems  little justifica(cid:173)\ntion  for  fully-connected  networks  with  their  thousands  of  weights  when  TDNNs \nand Scaly  networks  with hundreds  of weights  have very similar performance.  This \nperformance is about 83 %  (the nearest class mean classifier gave a performance of \n69%)  on  the  E-set  database  (a  portion  of the  larger  CONNEX  alphabet  database \nwhich  British  Telecom  Research  Laboratories  have  prepared  for  experiments  on \nneural networks).  The first  utterance  by  each speaker of the  \"E\"  words:  \"B,  C,  D, \n\n\fSpeech Recognition using Connectionist Approaches \n\n267 \n\nE,  G, P,  T, V'\"  were  used.  The database is divided into training and test  sets,  each \nconsisting of approximately 400 words  and 50 speakers. \nOther  experiments  were  conducted  on  an  isolated  digits  recognition  task,  speaker \nindependent mode (25 speakers for training and 15 for test),  using networks already \nintroduced.  A summary of the best  performance obtained is: \n\nTDNN \n\nK-means \ntest \ntrain. \n90.57  98,90  94.0  98.26 \n\ntrain. \n97.38 \n\ntrain. \n\ntest \n\nLVQ \n\nTDNN+LVQ \ntrain. \ntest \n92.57  99.90 \n\ntest \n97.50 \n\nPerformance for training is roughly equivalent for all algorithms.  For generalization, \nperformance of the combined  architecture  is  clearly superior to other techniques. \n\n3.2  TIME EVOLUTION OF  SPEECH \n\nIn contrast  to images as patterns of specific  size,  speech  signals display a  temporal \nevolution.  Approaches  have to be  developed  on how  a network with its fixed  num(cid:173)\nber of input units can cover word  patterns of variable size  and also  account for  the \ndynamic time  variations within words. \n\nDifferent  projections onto the fixed-size  collection of NxM network  input elements \n(number of vectors  x  number of coefficients  per vector)  have been tested,  such  as  : \nLinear  Normalization:  the  boundaries of a  word  are  determined  by  a  conven(cid:173)\ntional endpoint detection algorithm and the  N' feature  vectors  linearly compressed \nor expanded  to N by  averaging or duplicating vectors, \nTime Warp: word boundaries are located initially.  Some parts of a word of length \nN'  are  compressed,  while others  are  stretched  and some  remain constant  with re(cid:173)\nspect  to speech  characteristics, \nNoise Boundaries:  the sequence of N'  vectors of a word  are placed in the middle \nof or at random within the  area of the  desired  N  vectors  and the  margins  padded \nwith  the noise  in  the speech  pauses, \nTrace Segmentation:  the procedure essentially involves the division of the trace \nthat is  followed  by  the temporal course  in  the M-dimensional feature  vector space, \ninto a constant  number of new  sections of identical length. \nThese  time  normalization  procedures  were  used  with  the  scaly  neural  network \n(Varga,  1990).  It  turned  out  that  three  methods  for  time  representation  - time \nnormalization,  trace  segmentation  with  endpoint  detection  or  with  noise  bound(cid:173)\naries - are well suited to solve the transformation problem for a fixed  input network \nlayer.  The  recognition  scores  are  in  the  98.5%  range  (with\u00b1l%  deviation)  for  10 \ndigits  and  99.5%  for  a  57  words  in  speaker  independent  mode.  There  is  no  clear \nindication that one of these  approaches is superior to the other ones. \n\n3.3  CONCLUSIONS \n\nThe neural network techniques investigated have delivered comparable performance \nto  classical techniques.  It  is  now  well  agreed  that  Hybrid systems  (Integration of \n\n\f268 \n\nChoukri \n\nHidden  Markov  Modeling  and  MLPs)  yield  enhanced  performance.  Initial steps \nhave  been  made  towards  the  integration  of  Hidden  Markov  Models  and  MLPs. \nMathematical  formulations  are  required  to  unify  hybrid  models.  The  temporal \naspect  of speech  has  to  be  carefully  considered  and  taken into account  by  the for(cid:173)\nmalism. \n\n4  PARAMETERS-TO-PARAMETERS \n\nThe main objective of this task was  to provide  the speech  recognizer  with  a  set  of \nparameters adapted  to the  current  user  without any training phase. \n\nSpectral parameters corresponding  to the same sound uttered  by  two speakers  are \ngenerally  different.  Speaker-independent  recognizers  usually  take  this  variability \ninto account,  using  stochastic  models  and/or multi-references.  An  alternative  ap(cid:173)\nproach  consists  in  learning spectral  mappings  to  transform  the original set  of pa(cid:173)\nrameters  into another one  more  adapted  with respect  to the  characteristics  of the \ncurrent  user  and  the  speech  acquisition  conditions.  The  way  to  proceed  can  be \nsummed up as  follows: \n\n\u2022  Load of the standard dictionary of the reference  speaker, \n\u2022  Acquisition of an adaptation vocabulary for  the new  speaker, \n\u2022  Each new utterance is time-warped against the corresponding reference  utter(cid:173)\n\nance.  Thus temporal variability is softened and corresponding feature vectors \nare available  (input-output pairs), \n\n\u2022  The spectral transformations  are  learned from  these  associated vectors, \n\u2022  The  adaptation operator is  applied  to the reference  dictionary, leading to an \n\nadapted one, \n\n\u2022  The recognizer  is evaluated using  the obtained adapted dictionary. \n\nThe mathematical formulation is based on a very important result, regarding input(cid:173)\noutput  mappings,  and  demonstrated  by  Funahashi (Funahashi,  1989)  and  Hornik, \nStinchcombe  & White  (Hornik,  1989).  They proved  that a  network  using  a single \nhidden layer (a net with 3 layers)  with an arbitrary squashing function can approx(cid:173)\nimate any  Borel measurable function  to any desired  degree  of accuracy. \n\nExperiments  were  conducted  (see  details  in  (Choukri,  1990))  on  a  speech  isolated \nword  database  consisting  of  20  English  words  recorded  26  times  by  16  different \nspeakers  (TI data base  (Choukri,  1987)).  The  first  repetition  of the  20  words  are \nreference  templates,  tests  are  conducted  on  the  remaining  25  repetitions.  Before \nadaptation, the cross-speaker  scores is of 68%.  On the average  adaptation with the \nmulti-layer perceptron  provides a  15%  improvement compared  to  the non-adapted \nresults. \n\n\fSpeech Recognition using Connectionist Approaches \n\n269 \n\n5  CONCLUSIONS \n\nFor  phonetic  classifications,  sophisticated  networks,  combinations  of TONNs  and \nLVQ,  revealed  to be more efficient  than classical approaches or simple network  ar(cid:173)\nchitectures;  their use for isolated word recognition offered comparable performance. \nVarious approaches to cope with temporal distortions were implemented and demon(cid:173)\nstrate that combination of sophisticated neural networks and their cooperation with \nHMM is a promising research axis.  It has also been established that basic MLPs are \nefficient  tools  to  learn  speaker-to-speaker  mappings for speaker  adaptation proce(cid:173)\ndures.  We are expecting more sophisticated MLPs (recurrent  and context sensitive) \nto perform better. \n\nAcknowledgements: \n\nThis project is partially supported by the European ESPRIT Basic research  Actions \nprogramme (BRA 3228).  The partners involved are:  CGInn (F), ENST (F),  IRIAC \n(F),  RSRE (UK),  SEL  (FRG),  and  UPM  (SPAIN). \n\nReferences \nK.  Choukri.  (1990)  Speech  processing  and recognition  using  integrated  neurocomputing  tech(cid:173)\nniques:  ESPRIT Project  SPRINT (Bra  9ffB),  First  delitJerable  of Task  f,  June  1990. \nF. Bimbot.  (1990)  Speech  processing  and  recognition  using  integrated  neurocomputing  tech(cid:173)\nniques:  ESPRIT project  SPRINT (Bra  9ffB),  First  delitJerable  of task  9,  June 1990. \nA.  Varga.  (1990)  Speech  processing  and  recognition  using  integrated  neurocomputing  tech(cid:173)\nniques:  ESPRIT Project  SPRINT (Bra  9ff8),  First  delitJerable  of Task  S,  June  1990. \nA.  Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. Lang.  (1987)  Phoneme recognition \nusing  Time-Delay  Neural  Networks.,  Technical Report,  CMU  /  ATR,  Oct 30,  1987. \nY.  Bennani,  N.  Chaourar, P.  Gallinari, and A.  Mellouk.  (1990)  Comparison  of Neural  Net \nmodels  on  speech  recognition  tash,  Technical Report,  Universit  of Paris Sud,  LRl, 1990. \nKen-Ichi  Funahashi.  (1989)  On  the  approximate  realization  of continuous  mappings  by  neu(cid:173)\nral  networks,  in Neural  Networks,  2(2):183-192,  march  1989. \nK.  Hornik,  M.  Stinchcombe,  and  H.  White.  (1989)  Multilayer  feedforward  networks  are \nunitJersal  approximators.,  in Neural Networks,  vol.  2(number 5):359-366,  1989. \nK.  Choukri.  (1987)  SetJerai  approaches  to  Speaker  Adaptation  in  Automatic  Speech  Recog(cid:173)\nnition  Systems,  PhD thesis,  ENST  (Telecom Paris), Paris,  1987. \n\nAUTHORS  AND  CONTRIBUTORS \n\nY.  BENNANI \nK.  CHOUKRI \nD.  HOWELL \nA.  MELLOUK \nH.VALBRET \n\nF.  BIMBOT \nL.  DODD \nM.  IMMENDORFER \nc.  MONTACIE \nA.VARGA \n\nJ.  BRIDLE \nF. FOGELMAN \nA.  KRAUSE \nR.MOORE \nA.  WALLYN \n\nN.CHAOURAR \nP.  GALLINARI \nK.  McNAUGHT \nO.  SEGARD \n\n\f\fPart VI \n\nSignal Processing \n\n\f\f", "award": [], "sourceid": 390, "authors": [{"given_name": "Khalid", "family_name": "Choukri", "institution": null}]}