{"title": "Speech Recognition: Statistical and Neural Information Processing Approaches", "book": "Advances in Neural Information Processing Systems", "page_first": 796, "page_last": 801, "abstract": null, "full_text": "796 \n\nSPEECH  RECOGNITION:  STATISTICAL AND \n\nNEURAL  INFORMATION PROCESSING \n\nAPPROACHES \n\nJohn S.  Bridle \n\nSpeech  Research Unit and \n\nNational Electronics Research Initiative in Pattern Recognition \n\nRoyal Signals and  Radar Establishment \n\nMalvern UK \n\nAutomatic Speech Recognition (ASR) is an artificial perception problem:  the input \nis  raw,  continuous  patterns (no symbols!)  and  the desired  output,  which  may  be \nwords,  phonemes,  meaning or  text,  is  symbolic.  The  most  successful  approach to \nautomatic speech recognition  is  based on  stochastic  models.  A stochastic model is \na  theoretical system whose  internal state and  output undergo a  series  of transfor(cid:173)\nmations governed by probabilistic laws [1].  In the application to speech recognition \nthe  unknown  patterns of sound  are treated  as if they  were outputs of a  stochastic \nsystem [18,2].  Information about the classes of patterns is encoded as the structure \nof these  \"laws\"  and the probabilities that govern their operation.  The most popular \ntype of SM  for  ASR is  also known as a  \"hidden  Markov model.\" \n\nThere are  several  reasons  why  the SM  approach  has  been  so successful  for  ASR. \nIt can  describe  the  shape  of the  spectrum,  and  has  a  principled  way  of describ(cid:173)\ning  temporal  order,  together  with  variability  of both.  It is  compatible  with  the \nhierarchical nature of speech structure [20,18,4],  there  are  powerful algorithms for \ndecoding with respect to the model (recognition), and for adapting the model to fit \nsignificant  amounts  of example  data  (learning).  Firm theoretical  (mathematical) \nfoundations enable extensions  to be accommodated smoothly (e.g.  [3]). \n\nThere are many deficiencies  however.  In  a  typical system the speech signal is  first \ndescribed  as  a  sequence of acoustic  vectors (spectrum cross  sections or equivalent) \nat a  rate of say 100  per second.  The pattern is  assumed to consist of a  sequence of \nsegments corresponding to discrete states of the model.  In each segment the acoustic \nvectors  are  drawn  from  a  distribution  characteristic  of  the  state,  but  otherwise \nindependent of one another and of the states before and after.  In some systems there \nis  a  controlled  relationship  between  states and  the  phonemes  or  phones  of speech \nscience,  but  most of the properties and notions which speech scientists assume are \nimportan t  are ignored. \n\nMost  SM  approaches are also deficient  at a  pattern-recognition  theory  level:  The \nparameters of the models are usually adj usted (using the Baum-Welch re-estimation \nmethod  [5,2])  so as  to maximise  the likelihood  of the data given  the model.  This \nis  the  right  thing  to  do  if the form  of the  model  is  actually  appropriate  for  the \ndata,  but  if not  the  parameter-optimisation  method  needs  to  be  concerned  with \n\n\fSpeech Recognition \n\n797 \n\ndiscrimination  between classes (phonemes,  words,  meanings, ... )  [28,29,30]. \n\nA HMM recognition algorithm is designed to find  the best explanation of the input in \nterms of the model.  It tracks scores for  all plausible current states of the generator \nand  throws  away  explanations  which  lead  to a  current  state for  which  there  is  a \nbetter  explanation  (Bellman's  Dynamic  Programming) .  It may  also  throwaway \nexplanations which lead  to a  current state much worse than the best current state \n(score pruning),  producing a  Beam Search method.  (It is  important to keep many \nhypotheses  in  hand,  particularly when the current input is  ambiguous.) \n\nConnectionist (or \"Neural Network\") approaches start with a strong pre-conception \nof the types of process  to be  used.  They can claim  some  legitimacy  by  reference \nto new (or renewed)  theories of cognitive processing.  The actual mechanisms used \nare  usually  simpler  than  those  of the SM  methods,  but  the  mathematical theory \n(of what can  be learnt or  computed  for  instance)  is  more difficult,  particularly for \nstructures which have been proposed for  dealing with  temporal structure. \n\nOne of the dreams for connectionist approaches to speech is a network whose inputs \naccept the speech data as it arrives, it would have an internal state which contains all \nnecessary  information  about the  past input,  and  the  output would  be as accurate \nand  early  as  it  could  be.  The  training  of networks  with  their  own  dynamics  is \nparticularly difficult, especially when we are unable to specify what the internal state \nshould be.  Some are working on methods for training the fixed  points of continuous(cid:173)\nvalued  recurrent  non-linear networks [15,16,27].  Prager  [6]  has attempted  to train \nvarious  types  of network  in  a  full  state-feedback arrangement.  Watrous  [9]  limits \nhis recurrent connections to self-loops on  hidden and output units,  but even so the \ntheory of such recursive  non-linear filters  is  formidable. \n\nAt  the  other  extreme  are  systems  which  treat  a  whole  time-frequency-amplitude \narray (resulting from initial acoustic analysis) as the input to a network, and require \na  label  as  output.  For  example,  the  performance  that  Peeling  et  al. \n[7]  report \non  multi-speaker small-vocabulary  isolated  word  recognition  tasks  approach those \nof the best  HMM  techniques  available on  the same  data.  Invariance to  temporal \nposition was trained into the network by presenting the patterns at random positions \nin  a  fixed  time-window.  Waibel et  al.  [8]  use a  powerful compromise arrangement \nwhich can be thought of either as the replication of smaller networks across the time(cid:173)\nwindow (a time-spread network [19])  or as a single small network with internal delay \nlines  (a Time-Delay  Neural  Network  [8]).  There are  no recurrent  links  except  for \ntrivial ones at the output, so training (using Backpropagation) is no great problem. \nWe may think of this as a finite-impulse-response non-linear filter.  Reported results \non  consonant  discrimination  are  encouraging,  and  better  than  those  of a  HMM \nsystem  on  the  same  data.  The  system  is  insensitive  to  position  by  virtue  of its \nconstruction. \n\nKohonen  has  constructed  and  demonstrated  large  vocabulary  isolated  word  [12] \nand unrestricted vocabulary continuous speech transcription [13J  systems which are \ninspired  by  neural network ideas,  but implemented as algorithms more suitable for \n\n\f798 \n\nBridle \n\ncurrent programmed digital signal processor and CPU chips.  Kohonen's  phonotopic \nmap technique can be thought of as an unsupervised adaptive quantiser constrained \nto put its reference  points in  a  non-linear low-dimensional sub-space.  His  learning \nvector quantiser technique used  for  initial labeling combines  the advantages of the \nclassic nearest-neighbor method and discriminant training. \n\nAmong other types of network which have been applied to speech we  must mention \nan interesting class based not on correlations with weight vectors (dot-product) but \non distances from reference points.  Radial Basis Function theory [22]  was developed \nfor  multi-dimensional interpolation,  and  was  shown  by Broomhead  and  Lowe  [23] \nto be suitable for  many of the jobs that feed-forward  networks  are  used  for.  The \nadvantage is  that  it  is  not difficult  to find  useful  positions for  the  reference  points \nwhich  define  the  first,  non-linear,  transformation.  If this  is  followed  by  a  linear \noutput transformation then the weights can be found  by methods which are fast and \nstraightforward.  The reference points can be adapted using methods based on back(cid:173)\npropagation.  Related methods include potential functions [24],  Kernel methods [25] \nand  the modified  Kanerva network [26]. \n\nThere is  much  to be gained  form  a  careful comparison  of the theory  of stochastic \nmodel  and  neural  network  approaches  to  speech  recognition.  If a  NN  is  to  per(cid:173)\nform  speech decoding  in  a  way  anything like  a  SM  algorithm it  will  have  a  state \nwhich is  not just one of the states of the hypothetical  generative model;  the state \nmust include information about  the distribution of possible  generator states given \nthe  pattern so far,  and  the state transition function  must update this distribution \ndepending on the current speech input.  It is not clear whether such an internal rep(cid:173)\nresentation and behavior can be 'learned' from scratch by an otherwise unstructured \nrecurrent network. \n\nStochastic model based algorithms seem to have the edge at present for dealing with \ntemporal sequences.  Discrimination-based training inspired  by NN  techniques may \nmake a  significant difference in performance. \n\nIt would  seem  that the area where  NNs  have most to offer  is  in  finding  non-linear \ntransformations of the data which take us to a space (perhaps related to formant or \narticulatory parameters) where comparisons are more relevant to phonetic decisions \nthan purely auditory ones (e.g., [17,10,11]).  The resulting transformation could also \nbe viewed as a set of 'feature detectors'.  Or perhaps the NN should deliver posterior \nprobabilities of the states of a  SM  directly [14]. \n\nThe  art  of applying  a  stochastic  model  or  neural  network  approach  is  to  choose \na  class  of models  or  networks  which  is  realistic  enough  to be  likely  to  be able  to \ncapture  the  distinctions  (between  speech  sounds  or  words  for  instance)  and  yet \nhave a  structure which  makes it amenable to algorithms for  building the detail of \nthe  models  based  on examples,  and  for  interpreting particular  unknown  patterns. \nFuture systems will need to exploit the regularities described by phonetics,  to allow \nthe  construction  of high-performance  systems  with  large  vocabularies,  and  their \nadaptation to the characteristics of each new  user. \n\n\fSpeech Recognition \n\n799 \n\nThere is  no doubt that the Stochastic model based  methods work  best at present, \nbut current systems are generally far inferior to humans even in situations where the \nusefulness of higher-level processing in minimal.  I predict that the next generation \nof ASR systems will be based on a combination of connectionist and SM  theory and \ntechniques,  with mainstream speech knowledge used  in a  rather soft way to decide \nthe structure.  It should  not be long  before  the distinction  I have been making will \ndisappear [29]. \n\n[1]  D.  R.  Cox and H.  D.  Millar,  \"The Theory of Stochastic Processes\",  Methuen, \n\n1965.  pp.  721-741. \n\n[2]  S.  E.  Levinson,  L.  R.  Rabiner  and  M.  M.  Sohndi,  \"An  introduction  to  the \napplication  of the  theory  of  probabilistic  functions  of a  Markov  process  to \nautomatic  speech  recognition\",  Bell  Syst.  Tech.  J.,  vol.  62,  no.  4,  pp.  1035-\n1074,  Apr.  1983. \n\n[3]  M.  R.  Russell  and  R.  K.  Moore,  \"Explicit  modeling  of state  occupancy  in \n\nhidden Markov models  of automatic speech recognition\".  IEEE ICASSP-85. \n\n[4]  S.  E.  Levinson,  \"A  unified  theory of composite pattern analysis for  automatic \nspeech  recognition',  in  F.  Fallside  and  W.  Woods  (eds.),  \"Computer Speech \nProcessing\",  Prentice-Hall,  1984. \n\n[5]  L.  E.  Baum,  \"An inequality and  associated  maximisation  technique in  statis(cid:173)\n\ntical estimation  of probabilistic  functions  of a  Markov  process\",  Inequalities, \nvol.  3,  pp.  1-8,  1972. \n\n[6]  R.  G.  Prager et al.,  \"Boltzmann machines for  speech recognition\",  Computer \n\nSpeech and  Language,  vol.  1.,  no.  1,  1986. \n\n[7]  S.  M.  Peeling, R.  K.  moore and M.  J.  Tomlinson,  \"The multi-layer perceptron \nas a  tool for  speech  pattern processing  research\",  Proc.  Inst.  Acoustics Conf. \non Speech and  Hearing,  Windermere,  November  1986. \n\n[8]  Waibel et al.,  ICASSP88,  NIPS88 and ASSP  forthcoming. \n\n[9]  R.  1.  Watrous,  \"Connectionist  speech  recognition  using  the  Temporal  Flow \nmodel\",  Proc.  IEEE  Workshop  on  Speech  Recognition,  Harriman  NY,  June \n1988. \n\n[10]  I.  S.  Howard and  M.  A.  Huckvale,  \"Acoustic-phonetic attribute determination \n\nusing multi-layer perceptrons\",  IEEE Colloquium Digest  1988/11. \n\n[11]  M. A.  Huckvale and I. S. Howard,  \"High performance phonetic feature analysis \n\nfor  automatic speech recognition\",  ICASSP89. \n\n[12J  T. Kohonen  et al.,  \"On-line recognition of spoken words from  a  large vocabu(cid:173)\n\nlary\",  Information Sciences 33,  3-30  (1984). \n\n\f800 \n\nBridle \n\n[13]  T.  Kohonen,  \"The  'Neural'  phonetic  typewriter\",  IEEE  Computer,  March \n\n1988. \n\n[14]  H. Bourlard and C. J. Wellekens, \"Multilayer perceptrons and automatic speech \n\nrecognition\",  IEEE First IntI. Conf.  Neural  Networks, San Diego,  1987. \n\n[15]  R.  Rohwer  and  S.  Renals,  \"Training  recurrent  networks\",  Proc.  N'Euro-88, \n\nParis, June 1988. \n\n[16]  L.  Almeida,  \"A  learning  rule  for  asynchronous  perceptrons  with  feedback  in \na  combinatorial environment\",  Proc.  IEEE  IntI.  Conf.  Neural  Networks,  San \nDiego 1987. \n\n[17]  A.  R. Webb and D. Lowe,  \"Adaptive feed-forward layered networks as pattern \nclassifiers:  a  theorem illuminating their success in discriminant analysis\" , sub. \nto N eural Networks. \n\n[18]  J.  K.  Baker,  \"The Dragon system:  an overview\",  IEEE  Trans.  ASSP-23,  no. \n\n1,  pp.  24-29,  Feb.  1975. \n\n[19]  J.  S.  Bridle  and  R.  K.  Moore,  \"Boltzmann  machines  for  speech  pattern pro(cid:173)\n\ncessing\",  Proc.  Inst.  Acoust., November 1984,  pp.  1-8. \n\n[20]  B.  H. Repp,  \"On levels of description in speech research\", J. Acoust. Soc. Amer. \n\nvol.  69  p.  1462-1464,  1981. \n\n[21]  R.  A. Cole et aI,  \"Performing fine phonetic distinctions:  templates vs. features\" , \nin J. Perkell and D.  H.  Klatt (eds.),  \"Symposium on invariance and variability \nof speech processes\",  Hillsdale,  NJ,  Erlbaum 1984. \n\n[22]  M.  J.  D.  Powell,  \"Radial  basis  functions  for  multi-variate  interpolation:  a \nreview\", IMA Conf. on algorithms for the approximation offunctions and data, \nShrivenham 1985. \n\n[23]  D.  Broomhead  and  D.  Lowe,  \"Multi-variable interpolation  and  adaptive net(cid:173)\n\nworks\",  RSRE memo 4148,  Royal Signals and Radar Est.,  1988. \n\n[24]  M.  A.  Aizerman,  E.  M.  Braverman and  L.  1.  Rozonoer,  \"On  the  method  of \npotential functions\",  Automatika  i  Telemekhanika,  vol.  26  no.  11,  pp.  2086-\n2088,  1964. \n\n[25]  Hand,  \"Kernel discriminant analysis\",  Research Studies Press,  1982. \n\n[26]  R.  W.  Prager and  F. Fallside,  \"Modified Kanerva model  for  automatic speech \n\nrecognition\",  submitted  to Cmputer Speech and Language. \n\n[27]  F.  J.  Pineda,  \"Generalisation  of  back propagation  to  recurrent  neural  net(cid:173)\n\nworks\",  Physical Review Letters 1987. \n\n[28]  L.  R.  Bahl et aI.,  Proc. ICASSP88,  pp.  493-496. \n\n\fSpeech Recognition \n\n801 \n\n[29]  H.  Bourlard  and  C.  J.  Wellekens,  \"Links between  Markov  models  and  multi(cid:173)\n\nlayer perceptrons\",  this  volume. \n\n[30]  L.  Niles,  H. Silverman, G. Tajchman, M.  Bush,  \"How limited training data can \nallow a neural network to out-perform an 'optimal' classifier\" , Proc. ICASSP89. \n\n\f", "award": [], "sourceid": 174, "authors": [{"given_name": "John", "family_name": "Bridle", "institution": null}]}