{"title": "Dimensionality Reduction and Prior Knowledge in E-Set Recognition", "book": "Advances in Neural Information Processing Systems", "page_first": 178, "page_last": 185, "abstract": null, "full_text": "178 \n\nLang and Hinton \n\nDimensionality Reduction and Prior Knowledge  in \n\nE-set Recognition \n\nKevin J. Lang1 \nComputer Science Dept. \nCarnegie Mellon University \nPittsburgh, PA  15213 \nUSA \n\nGeoffrey E.  Hinton \nComputer Science Dept. \nUniversity of Toronto \nToronto, Ontario M5S  lA4 \nCanada \n\nABSTRACT \n\nIt is  well known  that  when an  automatic  learning algorithm  is  applied \nto a  fixed  corpus  of data,  the size of the corpus  places  an  upper bound \non  the  number  of degrees  of freedom  that  the  model  can  contain  if \nit  is  to generalize  well.  Because  the  amount  of hardware  in  a  neural \nnetwork  typically  increases  with  the  dimensionality  of  its  inputs,  it \ncan be challenging to build a high-performance network for classifying \nlarge input patterns.  In this paper, several techniques for addressing this \nproblem  are  discussed  in  the context of an  isolated  word  recognition \ntask. \n\nIntroduction \n\n1 \nThe domain for our research was a speech recognition task that requires distinctions to be \nlearned between recordings of four highl y confusable words:  the names of the letters \"B\", \n\"D\", \"E\", and \"V\". The task was  created at IBM's T. J.  Watson Research Center, and is \ndifficult because many speakers were included and also because the recordings were made \nunder noisy  office conditions  using  a  remote  microphone.  One  hundred  male  speakers \nsaid each  of the 4  words  twice, once for  training and again for  testing.  The words  were \nspoken  in  isolation,  and  the  recordings  averaged  1.1  seconds  in  length.  The  signal-to(cid:173)\nnoise  ratio  of the  data  set has  been  estimated  to  be about  15  decibels,  as  compared  to \n\n1 Now at  NEC Research Institute, 4 Independence Way, Princeton, NJ 08540. \n\n\fDimensionality Reduction and Prior Knowledge in E-Set Recognition \n\n179 \n\n50 decibels  for  typical  lip-mike recordings  (Brown,  1987).  The key  feature  of the  data \nset from  our point of view  is  that each  utterance contains  a  tiny information-laden event \nthe  release  of the  consonant - which  can  easily  be  overpowered  by  meaningless \n-\nvariation  in  the  strong \"E\" vowel and by background noise. \n\nOur first step in processing these recordings was to convert them  into spectrograms using \na standard DFI' program.  The spectrograms  encoded the energy in  128  frequency  bands \n(ranging  up  to  8  kHz)  at  3  msec  intervals,  and  so  they  contained an  average  of about \n45,000 energy values.  Thus, a naive back-propagation network which devoted a separate \nweight  to  each  of these  input  components  would  contain  far  too  many  weights  to  be \nproperly constrained by the task's 400 training patterns. \n\nAs described in the next section, we drastically reduced the dimensionality of our training \npatterns  by  decreasing  their  resolution  in  both  frequency  and  time  and also  by  using  a \nsegmentation algorithm to extract the most relevant portion of each pattern.  However, our \nnetwork  still contained too many  weights,  and  many of them  were devoted to detecting \nspurious features.  This  situation motivated the experiments with our network's objective \nfunction  and architecture that will be described in  sections  3 and 4. \n\n2  Reducing the Dimensionality of the Input Patterns \nBecause  it would  have  been  futile  to  feed  our  gigantic  raw  spectrograms  into  a  back(cid:173)\npropagation network, we first decreased the time resolution of our input format by a factor \nof 4  and the  frequency  resolution  of the  format  by  a  factor  8.  While  our  compression \nalong  the time  axis  preserved  the  linearity of the scale,  we  combined different numbers \nof raw freqencies  into the various frequency  bands  to  create a  mel  scale, which  is  linear \nup  to  2 kHz  and  logarithmic above that,  and  thus  provides  more resolution  in  the more \ninformative lower frequency  bands. \n\nNext, a  segmentation heuristic  was  used to locate  the consonant in each  training pattern \nso  that  the  rest  of the  pattern  could  be  discarded.  On  average,  all  but  1/7  of each \nrecording  was  thrown  away,  but  we  would  have  liked  to  have  discarded  more.  The \nuseful information in a word from  the E-set is concentrated in a roughly 50 msec  region \naround  the  consonant  release  in  the  word,  but current  segmentation  algorithms  aren't \ngood enough  to  accurately  position  a  50  msec  window  on  that  region.  To  prevent  the \nloss of potentially useful information, we extracted a 150 msec window from  around each \nconsonant release.  This  safeguard  meant  that our networks  contained  about  3  times  as \nmany  weights  as  would be required with an  ideal  segmentation. \n\nWe  were  also  concerned  that  segmentation  errors  during  recognition  could  lower  our \nfinal  system's performance, so we adopted a simple segmentation-free testing  method  in \nwhich the trained network is scanned over the full-length version of each testing utterance. \nFigures 3(a) and 3(b) show the activation traces generated by two different networks when \nscanned  over four  sample  utterances.  To  the  right  of each  of the  capital  letters  which \nidentifies  a  particular  sample  word  is  a  set  of 4  wiggly  lines  that  should  be viewed  as \nthe output of a 4-channel chart recorder which is  connected to the network's four output \nunits.  Our recognition  rule for unsegmented  utterances  states  that  the output unit which \n\n\f180 \n\nLang and Hinton \n\noutput unit weights \n\n,  \u2022 \n\n1 :  __ \n\n_ ::::=3ii:Z \n..  _  =z:::t::IIi:: __ -:  :-::::EX \n\n__ ~ _ =::::iCE \n\n- ---=-:~ \n::a:: 'I' =-\n\n= \n\n'\" \n::a:::a:::::x:~  _  _  _  _ \n\n__ ~:::c \n\n- --- -\n:! t: -\n\n'\" \n\n' \n\n, \n\n, \n\nM \n\n,\"'  ,  \"\" \n:a:::c: \n-: \n..  _  ::z:::w:  _  MM' \n_  .......... _ ::::::a::a::-==-: \n::c: :::a:::a:= _ \n:s::::z::x::  _  :::a::a::::z:::=:=::Ii \n::JL:_  _  _-=z: \n_~ ::1: _ \n, .  _:=-=c \n\n, \n\n:--:==~-\n:::::z::-:  _  _ _ \n\n8 \n\n8 \n\n, \n\n____ :L:L \n\ni \n\n\u2022 \n\n\"\"  -\n\nD \n\nB \n\n_  _ ~ \n\n:::z:::::J:: \n\n\u2022 \u2022  \"\"  __ :::::L:L \n-___  _ __ ':: ====-:::IE \n\n--- ----- -----\n\n-----------_. \n\n::z:: \n\n, \n\ni \n\nW\" \n\n\u2022 \n\n== \n\n:::c::L == \n\n- =-===-= -----\n\n\"\" \n\n\u2022 \u2022 \n_  --- ., ... \n\n- ---\n\n,  , . \n\" \n\n:%:LO:::a: ==::L \n\n. . \n-- - -- ----\n'w.'.'  ____ _ \n-\n-- --\n____ '\"'::  __  MM. \n'\" \n- - - _ ............ \n=-z:::::a::::::  _  _-: \n\n'II' \n\n. ..  , \n----- ==== \n\u2022  = \n= \n\u2022  ': === \n\n_:::::c:_ \n\n\"\" \n\n::e::c:::z::  _  :c \n\n_:::z:a::: __ \n: #iIJO _\"': \n~=-= \n:-=-: 3E \n...  :::a:::c ::c \n-- -- _ ... \n:a:::::a:: ...... \n4i:i3III:  _ .. \nx::_ ::z::-: \n.....  _:JL:& \n\n=-::x =_ . \n: :z::c \nii3Ei:  -(cid:173)\n: __ iiL \n\n-8kHz \n\n. '  \n::z:::z:::  _::t: \n..  __ 1 \n__ :::a::E_ \n- ::::a;;::;:-: - 2 kHz \n--~ \n:3BE  ::c \n:  ::::IEi::\"': \n~:-lkHz \n... - ----:k \nk:_:::a::::a: \niiiiE: _  : \n_:z::a::  _-: \n\n(a) \n\n(b) \n\n(c) \n\n(d) \n\nFigure 1:  Output Unit Weights  from  Four Different 2-layer BDEV Networks:  (a) base(cid:173)\nline,  (b)  smoothed,  (c)  decayed,  (d)  TDNN \n\ngenerates  the largest activation  spike (and  hence the  highest peak in the chart recorder's \ntraces)  on a  given  utterance determines  the  network's classification of that utterance.2 \n\nTo establish a performance baseline for the experiments that will be described in the next \ntwo sections, we trained  the  simple 2-layer network of figure 2(a) until it had learned to \ncorrectly identify 94 percent of our training segments.3 \n\nThis  network  contains  4  output  units  (one  for  each  word)  but  no  hidden  units.4  The \nweights  that this  network used to  recognize the words  B and D are shown  in figure  l(a). \nWhile these weight patterns are quite noisy, people who know how  to read spectrograms \ncan see sensible feature detectors amidst the clutter.  For example, both of the units appear \nto be stimulated by an  energy burst near  the 9th  time frame.  However,  the  units  expect \nto see this  energy  at different frequencies  because  the  tongue position  is  different in  the \nconsonants  that the two units  represent. \n\nUnfortunately, our baseline network's  weights  also contain  many  details  that don't make \n\nZOne can't  reasonably  expect a  network  that  has  been  trained  on  pre-segmented  patterns  to  function  well \nwhen  tested  in  this  way,  but  our best  network  (a  3-layer  TDN1'-I,)  actually  does  perform  better  in  this  mode \nthan  when trained and tested on  segments selected by  a  Viterbi  alignment  with an  IBM  hidden  Markov model. \nMoreover, because the  Viterbi alignment procedure  is  told the  identity of the  words  in  advance,  it  is  probably \nmore accurate than any method  that  could be used in a  real  recognition  system. \n\n3This rather arbitrary halting  rule for the learning procedure was uniformly employed during  the experiments \n\nof sections  2,  3  and 4. \n\n4Experiments  performed  with  multi-layer  networks  support  the  same  general  conclusions  as  the  results \n\nreported  here. \n\n\fDimensionality Reduction and Prior Knowledge in E-Set Recognition \n\n181 \n\nany  sense  to  speech  recognition  experts.  These  spurious  features  are  artifacts  of our \nsmall,  noisy  training  set,  and  are  partially  to  blame  for  the  very  poor perfonnance  of \nthe  network;  it achieved only  37  percent recognition accuracy  when  scanned across  the \nunsegmented testing  utterances. \n\n3  Limiting the Complexity of a  Network using a  Cost Function \nOur baseline network perfonned poorly because it had lots of free parameters with which \nit could model spurious  features  of the training set.  However, we had already  taken our \nbrute force  techniques  for input dimensionality reduction (pre-segmenting the utterances \nand reducing the resolution of input format) about as far as  possible while still retaining \nmost of the  useful  infonnation  in  the  patterns.  Therefore  it was  necessary  to  resort  to \na  more  subtle  fonn  of dimensionality reduction  in  which  the back-propagation  learning \nalgorithm  is  allowed  to  create  complicated  weight patterns  only  to  the extent that  they \nactually reduce the network's error. \n\nThis  constraint is  implemented by including a  cost term  for  the  network's complexity  in \nits  objective function.  The particular cost function  that should  be used  is  induced by a \nparticular definition of what constitutes a  complicated weight pattern, and this  definition \nshould be chosen with care.  For example, the rash of tiny details  in figure  l(a) originally \nled us  to penalize weights that were different from  their neighbors, thus  encouraging the \nnetwork to develop smooth, low-resolution  weight patterns  whenever possible. \n\nC = 2 ~ IINiII  ~(Wi - Wj) \n\n1 \"\"  1  \"\" \nJEM \n\n, \n\n2 \n\n(1) \n\nTo compute the  total  tax on  non-smoothness, each  weight Wi  was  compared to all of its \nneighbors  (which  are  indexed by  the set Ali).  When a  weight differed from  a neighbor, \na penalty  was  assessed that was  proportional  to  the square of their difference.  The tenn \nIlNiIl- 1  normalized  for  the  fact  that  units  at  the  edge  of a  receptive  field  have  fewer \nneighbors  than  units  in  the middle. \n\nWhen a  cost function  is  used, a  tradeoff factor'x  is  typically used to control the relative \nimportance of the error and cost components of the overall objective function 0  = E+'xC. \nThe  gradient  of the  overall objective  function  is  then  'V 0  =  'V E +  ,X 'V C.  To  compute \n'V C,  we needed  the derivative of our cost function  with respect to each  weight Wi.  This \nderivative  is  just  the  difference  between  the  weight  and  the  average  of its  neighbors: \ng~ = Wi - ukn\" LjEM Wj,  so minimizing the combined objective function was equivalent \nto  minimizmg  the  network's  error  while  simultaneously  smoothing  the  weight  patterns \nby decaying each weight towards  the average of its  neighbors. \n\nFigure  1 (b)  shows  the  B  and  D  weight patterns  of a  2-layer  network  that  was  trained \nunder  the  influence  of this  cost  function.  As  we had  hoped,  sharp  transitions  between \nneighboring  weights occurred primarily in  the maximally  infonnative consonant release \nof each  word,  while  the  spurious  details  that  had  plagued  our  baseline  network  were \nsmoothed out of existence.  However, this  network was  even worse at the task of gener(cid:173)\nalizing  to unsegmented test cases  than  the  baseline network,  getting  only  35  percent of \n\n\f182 \n\nLang and Hinton \n\nthem  correct \nWhile equation  1 might be a  good cost function  for some other task,  it doesn't capture \nour prior knowledge that the discrimination cues in E-set recognition are highly localized \nin  time.  This  cost  function  tells  the  network  to  treat  unimportant  neighboring  input \ncomponents  similarly, but we really want to tell  the network to ignore these components \naltogether.  Therefore,  a  better  cost  function  for  this  task  is  the  one  associated  with \nstandard weight decay: \n\nc= ~~w? \n' \n\n2 L...J \n\nj \n\n(2) \n\nEquation  2 causes  weights  to remain  close to zero  unless  they  are  particularly  valuable \nfor  reducing  the network's  error on  the  training set.  Unfortunately,  the  weights  that our \nnetwork learns  under the  influence of this  function  merely  look like  smaller versions  of \nthe  baseline weights  of figure  l(a) and perform just as  poorly.  No  matter what value  is \nused  for .x,  there  is  very  little  size differentiation  between  the  weights  that we  know  to \nbe  valuable  for  this  task  and  the  weights  that  we  know  to  be  spurious.  Weight  decay \nfails  because  our  training  set is  so  small  that  spurious  weights  do  not  appear  to  be  as \nirrelevant  as  they  really  are  for  performing  the  task  in  general.  Fortunately,  there  is  a \nmodified form  of weight decay  (Scalettar and Zee,  1988) that expresses the idea that  the \ndisparity between relevant and irrelevant weights is greater than can be deduced from  the \ntraining  set: \n\nc=.!.l:  wf \n\n.  2.5 +wr \n\n2 \n\nI \n\n(3) \n\nThe weights of figure  l(c) were learned under the influence of equation  3.5  In  these pat(cid:173)\nterns, the feature detectors that make sense to speech recognition experts stand out clearly \nabove a  highly  suppressed  field  of less  important  weights.  This  network generalizes  to \n48  percent of the unsegmented  test cases,  while our earlier networks  had managed only \n37  percent accuracy. \n\n4  A Time-Delay Neural Network \nThe  preceding  experiments  with  cost  functions  show  that  controlling  attention  (rather \nthan  resolution)  is  the  key  to  good  performance  on  the  BDEV  task.  The only  way  to \naccurately  classify the utterances  in  this  task  is  to focus  on  the  tiny  discrimination cues \nin the spectrograms  while ignoring  the remaining material in  the patterns. \n\nBecause  we  know  that  the  BDEV  discrimination  cues  are  highly  localized  in  time,  it \nwould make sense  to build a  network  whose architecture  reflected that knowledge.  One \nsuch  network  (see figure  2(b\u00bb contains  many  copies  of each  output unit.  These copies \napply identical weight patterns to the input in all possible positions.  The activation values \nsWe trained with >.  = 100 here as  opposed  to  the  setting of >.  = 10  that  worked  best  with standard  weight \n\ndecay. \n\n\fDimensionality Reduction and Prior Knowledge in E-Set Recognition \n\n183 \n\n~ output uoiu \n/ \\  \n\n8 copies \n\n:----,-; ouqNtuOOu \n\n1 \n1 ___ -\n\n1 \n__ \n\n16 \n\ninput units \n\ninput units \n\n-\n\n12 \n\n(a) \n\n12 \n\n(b) \n\nFigure 2:  Conventional and Time-Delay 2-layer Networks \n\nfrom  all of the  copies  of a  given  output unit are summed  to  generate  the overall output \nvalue  for  that unit 6 \n\nNow, assuming that the learning algorithm can construct weight patterns which recognize \nthe  characteristic  features  of each  word  while  rejecting  the  rest  of the  material  in  the \nwords,  then when an instance of a particular word is  shown to  the network, the only unit \nthat will  be activated is  the output unit copy for  that word which  happens  to be aligned \nwith  the recognition cues  in the pattern.  Then, the summation step at the output stage of \nthe  network serves  as  an OR gate which  transmits  that activation to  the  outside  world. \nThis  network  architecture,  which  has  been  named  the  \"Time-Delay  Neural  Network\" \nor \"TDNN\",  has  several  useful  properties  for  E-set recognition,  all of which  are  con(cid:173)\nsequences  of the  fact  that  the  network  essentially  performs  its  own  segmentation  by \nrecognizing  the  most  relevant  portion  of each  input  and  rejecting  the  rest.  One  ben(cid:173)\nefit  is  that  sharp  weight  patterns  can  be  learned  even  when  the  training  patterns  have \nbeen sloppily  segmented.  For example, in  the TDNN weight patterns  of figure  l(d), the \nrelease-burst detectors  are  localized in  a  single  time  frame,  while  in  the  earlier weight \npatterns  from  conventional networks  they  were  smeared over several time frames. \n\nAlso,  the  network  learns  to  actively  discriminate  between  the  relevant  and  irrelevant \nportions  of its  training  segments,  rather than  trying  to  ignore  the  latter by  using  small \nweights.  This  turns  out to  be  a  big advantage  when  the  network is  later scanned across \nunsegmented  utterances,  as  evidenced  by  the  vastly  different appearances  of the output \n\n6We actually designed this  network before performing our experiments  with  cost functions,  and were  orig(cid:173)\n\ninally attracted by its  translation  invariance  rather than by the advantages  mentioned here (Lang,  1987). \n\n\f184 \n\nLang and Hinton \n\nv \n\nv \ne \n'-------' ,---d \n~----------------------b \n\nv \n\nf  - ...... \n\nE \n\n,..-.,.. \nE  ~ \n\nv \ne \nd \nb \n\nD \n\nB \n\nV \nr-r'O..r-__  e \n1 - - - - ' \" ' \" - - ' - - -d \n~------------------~-b \nv \ne \n'----d \nt ' - - - - - J  '----------b \n\nD \n\nB \n\n-~ \nr -\n\n\"\\ \n\n-\n\nJ \n\n\\ \n\nv \ne \nd \nb \n\nv \ne \nd \nb \n\nv \ne \nd \nb \n\no \n\n250msec \n\n(a) \n\n500 \n\no \n\nI \n\n250msec \n\n(b) \n\nI \n\n500 \n\nFigure 3:  Output Unit Activation Traces of a Conventional  Network and a  Time-Delay \nNetwork,  on Four Sample Utterances \n\nactivity  traces in  figures  3(a) and 3(b)? \nFinally,  because  the  IDNN  can  locate  and  attend  to  the  most  relevant  portion  of its \ninput,  we are able to make its receptive fields  very narrow,  thus  reducing the number of \nfree  parameters  in  the  network and making it highly trainable  with  the small number of \nuaining cases that are available in this task.  In fact, the scanning mode generalization rate \nof our 2-layer TDNN  is  65 percent,  which  is  nearly  twice  the accuracy of our baseline \n2-layer network. \n\n5  Comparison with  other systems \nThe 2-layer networks  described up  to  this  point were  uained and tested  under identical \nconditions  so  that their perfonnances  could be meaningfully compared.  No attempt was \nmade  to achieve really high perfonnance in these experiments.  On  the other hand  when \n\n'While the main text of this  paper compares the perfonnance of a sequence of 2-1ayer networks, the plots of \nfigure 3  show the output traces of 3-layer versions of the networks.  The correct plots could not be conveniently \ngenerated because our eMU Common  Lisp program for creating them has died  of bit rot. \n\n\fDimensionality Reduction and Prior Knowledge in E-Set Recognition \n\n185 \n\nwe  trained  a  3-layer TDNN  using  the  slightly  fancier  methodology  described  in  (Lang, \nHinton, and Waibel, 1990),8  we obtained a system  that generalized to about 91 percent of \nthe  unsegmented  test cases.  By comparison,  the  standard,  large-vocabulary IBM  hidden \nMarkov  model  accounts  for  80  percent  of  the  test  cases,  and  the  accuracy  of human \nlisteners  has  been  measured  at  94  percent.  In  fact,  the  TDNN  is  probably  the  best \nautomatic  recognition  system  built for  this  task  to  date;  it even  performs  slightly better \nthan  the  continuous  acoustic  parameter,  maximum  mutual  information  hidden  Markov \nmodel proposed in  (Brown,  1987). \n\n6  Conclusion \nThe performance of a neural  network  can  be improved  by  building a priori knowledge \ninto  the  network's  architecture and  objective function.  In  this  paper, we  have exhibited \ntwo  successful  examples  of this  technique  in  the  context  of a  speech  recognition  task \nwhere  the  crucial  information  for  making  an  output  decision  is  highly  localized  and \nwhere  the  number of training  cases  is  limited.  Tony  Zee's  modified  version  of weight \ndecay and our time-delay architecture both yielded networks  that focused  their attention \non  the  short-duration  discrimination  cues  in  the  utterances.  Conversely, our attempts  to \nuse weight smoothing and standard weight decay during training got us  nowhere because \nthese cost functions  didn't accurately express our knowledge about the task. \n\nAcknowledgements \n\nThis work  was  supported by Office of Naval Research contract NOOOI4-86-K-0167,  and \nby a grant from  the Ontario Information Techology Research Center.  Geoffrey Hinton is \na fellow of the Canadian  Institute for  Advanced Research. \n\nReferences \n\nP.  Brown.  (1987)  The  Acoustic-Modeling  Problem  in  Automatic  Speech  Recognition. \nDoctoral Dissertation, Carnegie Mellon University. \n\nK.  Lang.  (1987)  Connectionist  Speech  Recognition.  PhD  Thesis  Proposal,  Carnegie \nMellon University. \n\nK.  Lang,  G.  Hinton, and A.  Waibel.  (1990)  A Time-Delay Neural Network Architecture \nfor Isolated Word Recognition.  Neural Networks 3(1). \n\nR.  Scalettar and A.  Zee.  (1988) In D.  Waltz and 1. Feldman (eds.), Connectionist Models \nand their Implications,  p.  309.  Publisher:  A.  Blex. \n\nSWider but less precisely aligned  training segments  were employed,  as  well as  randomly selected \"counter(cid:173)\nexample\"  segments  that  further  improved  the  network's  already  good  \"E\"  and  background  noise  rejection. \nAlso,  a  preliminary  cross-validation  run  was  performed  to  locate  a  nearly  optimal  stopping  point  for  the \nlearning procedure.  When trained using this  improved methodology,  a conventional  3-layer network achieved \na generalization  score  in  the mid  50's. \n\n\f", "award": [], "sourceid": 234, "authors": [{"given_name": "Kevin", "family_name": "Lang", "institution": null}, {"given_name": "Geoffrey", "family_name": "Hinton", "institution": null}]}