{"title": "Exploratory Feature Extraction in Speech Signals", "book": "Advances in Neural Information Processing Systems", "page_first": 241, "page_last": 247, "abstract": null, "full_text": "Exploratory Feature  Extraction in  Speech  Signals \n\nNathan Intrator \n\nCenter  for  Neural Science \n\nBrown  U ni versity \n\nProvidence,  RI  02912 \n\nAbstract \n\nA  novel  unsupervised  neural  network  for  dimensionality  reduction  which \nseeks  directions  emphasizing  multimodality is  presented,  and  its  connec(cid:173)\ntion  to exploratory projection pursuit  methods is  discussed.  This leads to \na  new  statistical insight  to the  synaptic  modification  equations  governing \nlearning in  Bienenstock,  Cooper,  and  Munro  (BCM)  neurons  (1982). \nThe  importance  of a  dimensionality  reduction  principle  based  solely  on \ndistinguishing  features,  is  demonstrated  using  a  linguistically  motivated \nphoneme  recognition  experiment,  and  compared  with  feature  extraction \nusing  back-propagation network. \n\n1 \n\nIntroduction \n\nDue  to  the  curse  of dimensionality  (Bellman,  1961)  it  is  desirable  to  extract  fea(cid:173)\ntures from a  high dimensional data space before attempting a  classification.  How  to \nperform  this feature  extraction/dimensionality reduction  is  not  that  clear.  A  first \nsimplification is  to consider  only  features  defined  by  linear  (or semi-linear)  projec(cid:173)\ntions  of high  dimensional data.  This class  of features  is  used  in  projection  pursuit \nmethods  (see  review in  Huber,  1985). \n\nEven  after  this  simplification,  it  is  still  difficult  to  characterize  what  interesting \nprojections  are,  although  it  is  easy  to  point  at  projections  that  are  uninteresting. \nA statement that has recently  been  made precise  by  Diaconis and Freedman  (1984) \nsays  that  for  most  high-dimensional  clouds,  most  low-dimensional  projections  are \napproximately normal.  This finding suggests  that the important information in the \ndata is conveyed in those directions whose single dimensional projected distribution \nis  far  from  Gaussian,  especially  at  the  center  of the  distribution.  Friedman  (1987) \n\n241 \n\n\f242 \n\nIntrator \n\nargues  that  the  most  computationally attractive  measures  for  deviation from  nor(cid:173)\nmality  (projection indices)  are  based  on  polynomial moments.  However  they  very \nheavily emphasize departure from  normality in the  tails of the distribution (Huber, \n1985).  Second  order  polynomials (measuring  the  variance - principal components) \nare  not  sufficient  in  characterizing  the  important  features  of a  distribution  (see \nexample  in  Duda  &  Hart  (1973)  p.  212),  therefore  higher  order  polynomials  are \nneeded.  We  shall  be  using  the  observation  that  high  dimensional  clusters  trans(cid:173)\nlate to multimodallow dimensional projections,  and if we  are after  such  structures \nmeasuring  multimodality defines  an  interesting  projection.  In  some  special  cases, \nwhere  the data is known in advance  to be  bi-modal, it is  relatively  straightforward \nto  define  a  good  projection  index  (Hinton  &  Nowlan,  1990).  When  the  structure \nis  not  known  in  advance,  defining  a  general  multi modal  measure  of the  projected \ndata is  not  straight forward,  and will  be  discussed  in  this  paper. \n\nThere  are  cases  in  which  it  is  desirable  to  make  the  projection  index  invariant \nunder  certain transformations, and maybe even  remove second  order structure  (see \nHuber,  1985) for  desirable invariant properties  of projection indices) ..  In  such  cases \nit is possible to make such  transformations before hand (Friedman,  1987), and then \nassume  that  the data possesses  these  invariant properties  already. \n\n2  Feature Extraction using  ANN \n\nIn  this  section,  the  intuitive  idea  presented  above  is  used  to  form  a  statistically \nplausible objective function  whose  minimization will  be  those  projections  having a \nsingle  dimensional  projected  distribution  that  is  far  from  Gaussian.  This  is  done \nusing  a  loss  function  whose  expected  value  leads  to  the  desired  projection  index. \nMathematical details are  given in  Intrator  (1990). \n\nBefore  presenting this loss function,  let  us  review  some necessary  notations and as(cid:173)\nsumptions.  Consider a  neuron with input vector x = (Xl, ... , :r N),  synaptic weights \nvector m  = (ml' ... , mN),  both in  RN , and activity (in the linear region)  c = x . m. \nDefine  the  threshold  em  = E[(x . m)2],  and  the functions \u00a2(c, em) =  c2  - ~cem, \n\u00a2(c, em) = c2 _  icem.  The \u00a2 function has been suggested as a biologically plausible \nsynaptic modification function  that explains visual cortical plasticity (Bienenstock, \nCooper and Munro,  1982).  Note that at this point c represents  the linear projection \nof x  onto m,  and  we  seek  an optimal projection in some sense. \n\nWe  want  to  base  our  projection  index  on  polynomial moments  of low  order,  and \nto  use  the fact  that  bimodal distribution is already interesting,  and  any additional \nmode should  make the  distribution even  more interesting.  With this in  mind,  con(cid:173)\nsider  the  following  family  of loss  functions  which  depend  on  the  synaptic  weight \nvector  and on  the  input x; \n\nThe  motivation  for  this  loss  function  can  be  seen  in  the  following  graph,  which \nrepresents  the  \u00a2  function  and  the  associated  loss  function  Lm (x).  For  simplicity \nthe loss  for  a  fixed  threshold  em  and synaptic vector  m  can  be written as  Lm(c)  = \n-ic2(c - em),  where  c = (x\u00b7 m). \n\n\fExploratory Feature Extraction in Speech Signals \n\n243 \n\nTllI~  qlA:\\D LOSS Ft;:\\CIlO:\\S \n\nl.Jc) \n\nFigure  1:  The function  \u00a2  and the loss  functions  for  a  fixed  m  and em. \n\nThe  graph  of the  loss  function  shows  that  for  any  fixed  m  and  em,  the  loss  is \nsmall  for  a  given  input  x,  when  either  (x  .111.)  is  close  to  zero,  or  when  (x  . m)  is \nlarger  than iem .  Moreover,  the loss function  remains  negative for  (x\u00b7 m) > iem , \ntherefore,  any  kind  of distribution  at  the  right  hand  side  of  ~em is  possible,  and \nthe  preferred  ones  are those  which are  concentratt'd  further  away from  ~em. \n\nWe  must still show why it is not possible that a  minimizer of the average loss  will be \nsuch  that all the mass of the distribution will  be concentrated in one of the regions. \nRoughly  speaking,  this can  not  happen  because  the  threshold  em  is  dynamic and \ndepends  on  the  projections  in  a  nonlinear  way,  namely,  em  = E(x  . m)2.  This \nimplies that em  will always  move itself to a  stable  point such  that the distribution \nwill  not  be  concentrated  at  only  one  of its  sides.  This  yields  that  the  part  of the \ndistribution  for  c < ~em has  a  high loss,  making  those  distributions  in  which  the \ndistribution for  c < ~em has its mode  at  zero  more  plausible. \nThe risk  (expected  value  of the  loss)  is  given  by: \n\nRm =  -~ {E[(x .111.)3]  - E2[(x\u00b7 m?]}. \n\n3 \n\nSince  the  risk is  continuously differentiable,  its minimization can  be achieved  via a \ngradient  descent  method  with  respect  to  m,  namely: \n\na \n\ndm \n-d t  =  - -;;;--Rm  =  J1  E[\u00a2(x\u00b7 m, em)Xi]. \nt \n\nV7ni \n\nThe resulting differential equations suggest a  modified  version of the  law governing \nsynaptic weight modification in the  BCM theory for  learning and memory  (Bienen(cid:173)\nstock,  Cooper and  Munro,  1982).  This theory  was  presented  to account for  various \nexperimental  results  in  visual  cortical  plasticity.  The  biological  relevance  of the \ntheory  has  been  extensively studied  (Soul et  al.,  1986;  Bear  et  al.,  1987;  Cooper et \naI.,  1987;  Bear et  al.,  1988),  and it was shown that the  theory  is in agreement  with \nthe  classical deprivation experiments  (Clothioux et  al.,  1990). \n\nThe fact  that the distribution has part of its mass on both sides of ~em makes this \nloss  a  plausible projection index that seeks  multimodalities.  However,  we  still need \n\n\f244 \n\nIntrator \n\nto reduce  the  sensitivity  of the  projection  index to outliers,  and for  full  generality, \nallow  any  projected  distribution  to  be  shifted  so  that  the  part  of the  distribution \nthat  satisfies  c  <  ~em will  have  its  mode  at  zero.  The  over-sensitivity  to  outliers \nis  addressed  by  considering  a  nonlinear  neuron  in  which  the  neuron's  activity  is \ndefined  to be  C = q(x . m), where q  usually represents  a  smooth sigmoidal function. \nA  more  general  definition  that  would  allow  symmetry  breaking  of the  projected \ndistributions,  will provide solution to  the  second  problem  raised  above,  and  is  still \nconsistent  with  the  statistical  formulation,  is  c  =  q(x .  m  - a),  for  an  arbitrary \nthreshold a  which can be found  by using gradient descent as well.  For the  nonlinear \nneuron,  em  is  defined  to be  em  = E[q2(x . m)]. \nBased on this formulation, a  network of Q identical nodes  may be constructed.  All \nthe  neurons  in  this  network  receive  the  same  input  and  inhibit  each  other,  so  as \nto  extract  several  features  in  parallel.  A  similar  network  has  been  studied  in  the \ncontext of mean field  theory  by  Scofield  and Cooper  (1985).  The activity of neuron \nk  in the  network is defined  as Ck  = q(x . mk - ak),  where  mk  is  the synaptic weight \nvector  of neuron  k,  and  ak  is  its  threshold.  The  inhibited activity and  threshold  of \nthe  k'th  neuron  are  given  by Ck  = Ck  -\nWe  omit  the  derivation of the  synaptic  modification  equations  which  is  similar  to \nthe  one  for  a  single  neuron,  and  present  only  the  resulting  modification equations \nfor  a  synaptic  vector  mk  in a  lateral inhibition network of nonlinear  neurons: \n\n17  E}#k Cj,  e~ = E[c~]. \n\nmk = -11  E{\u00a2(Ck' e~:J(q'(Ck) -17 Lq'(Cj})x}. \n\nj#k \n\nThe lateral inhibition network performs a direct search of Q-dimensional projections \ntogether,  and  therefore  may  find  a  richer  structure  that  a  stepwise  approach  may \nmiss,  e.g.  see  example  14.1  Huber  (1985). \n\n3  Conlparison  with  other feature  extraction nlethods \n\nWhen  dealing with a  classification  problem,  the interesting features  are  those  that \ndistinguish  between  classes.  The  network  presented  above has  been  shown  to  seek \nmultimodality  in  the  projected  distributions,  which  translates  to  clusters  in  the \noriginal space, and therefore to find  those directions that make a  distinction between \ndifferent  sets in the training data. \n\nIn  this  section  we  compare  classification  performance  of a  network  that  performs \ndimensionality reduction  (before  the  classification)  based  upon  multimodality, and \na  network  that performs dimensionality reduction  based upon  minimization of mis(cid:173)\nclassification error (using back-propagation with MSE criterion).  This is  done using \na  phoneme classification experiment whose linguistic motivation is described  below. \nIn  the  latter  we  regard  the  hidden  units  representation  as  a  new  reduced  feature \nrepresentation of the input space.  Classification on  the new  feature space was  done \nusing  back-propagation 1 \n\n1 See  Intrator  (1990)  for  comparison  with principal components  feature  extraction and \n\nwith k-NN as a  classifier \n\n\fExploratory Feature Extraction in Speech Signals \n\n245 \n\nConsider  the  six stop consonants  [p,k,t,b,g,dJ, which  have  been  a  subject  of recent \nresearch  in evaluating neural networks for  phoneme recognition  (see  review in Lipp(cid:173)\nmann,  1989).  According to phonetic feature  theory,  these stops posses  several  com(cid:173)\nmon  features,  but  only  two  distinguishing  phonetic  features,  place  of articulation \nand  voicing  (see  Blumstein  &  Lieberman  1984,  for  a  review  and  related  references \non phonetic feature  theory).  This theory  suggests  an experiment in  which  features \nextracted  from  unvoiced  stops  can  be  used  to  distinguish  place  of articulation  in \nvoiced  stops  as  well.  It is  of interest  if these  features  can  be  found  from  a  single \nspeaker,  how sensitive  they are to voicing and  whether  they  are speaker invariant. \n\nThe  speech  data  consists  of 20  consecutive  time  windows  of 32msec  with  30msec \noverlap,  aligned  to  the  beginning  of the  burst.  In  each  time  window,  a  set  of 22 \nenergy  levels is  computed.  These  energy  levels  correspond  to  Zwicker  critical  band \nfilters  (Zwicker,  1961).  The consonant-vowel  (CV)  pairs  were  pronounced  in isola(cid:173)\ntion  by  native  American  speakers  (two  male  BSS  and  LTN,  and one  female  JES.) \nAdditional details on biologicalmotivatioll for  the preprocessing,  and linguistic mo(cid:173)\ntivation related  to child language  acquisition  can  be found  in  Seebach  (1990),  and \nSeebach and Intrator (1991).  An average (over 25  tokens)  of the six stop consonants \nfollowed  by  the  vowel  [aJ  is  presented  in  Figure  2.  All  the  images  are  smoothened \nusing  a  moving  average.  One  can  see  some  similarities  between  the  voiced  and \nunvoiced stops especially in the upper left corner  of the image (high frequencies  be(cid:173)\nginning of the burst) and the radical difference  between them in the low frequencies. \n\nFigure  2:  An  average  of the six stop  consonants followed  by  the  vowel  raj. \nTheir  order  from  left  to  right  [paJ  [baJ  [kaJ  [gal  [taJ  [da].  Time  increases \nfrom the burst release on the X axis, and frequency  increases on the Y  axis. \n\nIn  the  experiments  reported  here,  5  features  were  extracted  from  the  440  dimen(cid:173)\nsion  original  space.  Although  the  dimensionality  reduction  methods  were  trained \nonly  with  the  unvoiced  tokens  of a  single  speaker,  the  classifier  was  trained  on  (5 \ndimensional)  voiced  and unvoiced  data from  the other speakers  as  well. \n\nThe  classification  results,  which  are  summarized  in  table  1,  show  that  the  back(cid:173)\npropagation  network  does  well  in  finding  structure  useful  for  classification  of the \ntrained  data,  but  this  structure  is  more  sensitive  to  voicing.  Classification  results \nusing  a  BCM  network  suggest  that,  for  this  specific  task,  structure  that  is  less \nsensitive  to  voicing  can  be  extracted,  even  though  voic.ing  has  significant  effects \non  the  speech  signal  itself.  The  results  also  suggest  that  these  features  are  more \nspeaker invariant. \n\n\f246 \n\nInuator \n\nPlace of Articulation Classification JB-P) \n\nBSS  /p,k,t/ \nBSS /b,g,d/ \nLTN  /p,k,t/ \nLTN  /b,g,d/ \nJES  (Both) \n\nB-P \n100 \n83.4 \n95.6 \n78.3 \n88.0 \n\nBCM \n100 \n94.7 \n97.7 \n93.2 \n99.4 \n\nTable 1:  Percentage of correct classification of place of articulation in voiced \nand  unvoiced stops. \n\nFigure 3 :  Synaptic weight images ofthe 5 hidden units of back-propagation \n(top),  and  by the 5  BCM  neurons  (bottom). \n\nThe difference  in performance  between  the  two feature  extractors  may be  partially \nexplained  by  looking  at  the  synaptic  weight  vectors  (images)  extracted  by  both \nmethod:  For  the  back-propagation feature  extraction  it can  be  seen  that  although \n5  units  were  used,  fewer  number  of features  were  extracted.  One  of  the  main \ndistinction between the unvoiced stops in the training set is the high frequency burst \nat  the  beginning of the  consonant  (the  upper  left  corner).  The  back-propagation \nmethod concentrated mainly on this feature,  probably because it is sufficient to base \nthe  recognition of the  training set  on  this feature,  and the  fact  that  training stops \nwhen misclassification error falls to zero.  On the other hand, the BCM method does \nnot try to reduce the misclassificaion error and is able to find  a  richer,  linguistically \nmeaningful  structure,  containing  burst  locations  and  format  tracking  of the  three \ndifferent  stops that allowed  a  better  generalization to other  speakers  and to  voiced \nstops. \n\nThe  network  and  its  training  paradigm  present  a  different  approach  to  speaker \nindependent  speech  recognition.  In  this  approach  the  speaker  variability  problem \nis  addressed  by training a  network  that  concentrates  mainly on  the  distinguishing \nfeatures  of a  single speaker,  as opposed  to training a  network  that concentrates on \nboth the  distinguishing and common features,  on  multi-speaker data. \n\nAcknowledgements \n\nI  wish  to thank  Leon  N  Cooper for  suggesting the  problem and for  providing many \nhelpful  hints  and  insights.  Geoff  Hinton  made  invaluable  comments.  The  appli(cid:173)\ncation  of BCM  to  speech  is  discussed  in  more  detail  in  Seebach  (1990)  and  in  a \n\n\fExploratory Feature Extraction in Speech Signals \n\n247 \n\nforthcoming  article  (Seebach  and  Intrator,  1991).  Research  was  supported  by  the \nNational  Science  Foundation,  the  Army  Research  Office,  and  the  Office  of Naval \nResearch. \n\nReferences \n\nBellman,  R.  E.  (1961)  Adaptive  Control  Processes,  Princeton,  NJ,  Princeton  Uni(cid:173)\nversity  Press. \n\nBienenstock,  E.  L.,  L.  N  Cooper,  and  P.W.  Munro  (1982)  Theory  for  the  devel(cid:173)\nopment  of neuron  selectivity:  orientation  specificity  and  binocular  interaction  in \nvisual cortex.  J.Neurosci.  2:32-48 \n\nBear,  M.  F.,  L.  N  Cooper,  and  F.  F.  Ebner  (1987)  A  Physiological  Basis  for  a \nTheory of Synapse Modification.  Science  237:42-48 \n\nDiaconis,  P,  and D.  Freedman (1984)  Asymptotics of Graphical Projection  Pursuit. \nThe  Annals of Statistics,  12  793-815. \n\nFriedman,  J.  H.  (1987)  Exploratory  Projection  Pursuit.  Journal  of the  American \nStatistical Association 82-397:249-266 \n\nHinton, G. E. and S.  J. Nowlan (1990) The bootstrap Widrow-Hoffrule as a cluster(cid:173)\nformation algorithm.  Neural  Computation. \nHuber  P.  J.  (1985)  Projection Pursuit.  The  Annal.  of Sta.t.  13:435-475 \n\nIntrator  N.  (1990)  A  Neural  Network  For  Feature  Extraction.  In  D.  S.  Touret(cid:173)\nzky  (ed.),  Advances  in  Neural Information  Processing  System,s  2.  San  Mateo,  CA: \nMorgan  Kaufmann. \n\nLippmann, R.  P.  (1989)  Review of Neural Networks for  Speech  Recognition.  Neural \nComputation  1,  1-38. \n\nReilly, D. L., C.L. Scofield,  L.  N Cooper and C. Elbaum (1988) GENSEP: a multiple \nneural  network  with  modifiable  network  topology. \nINNS  Conference  on  Neural \nNetworks. \n\nSaul,  A.  and  E.  E.  Clothiaux,  1986)  Modeling  and  Simulation  II:  Simulation  of \na  Model  for  Development  of Visual  Cortical  specificity.  J.  of Electrophysiological \nTechniques,  13:279-306 \n\nScofield,  C.  L.  and  L.  N  Cooper  (1985)  Development  and  properties  of neural  net(cid:173)\nworks.  Contemp.  Phys.  26:125-145 \n\nSeebach,  B.  S.  (1990)  Evidence  for  the  Development  of Phonetic  Property  Detec(cid:173)\ntors  in  a  Neural  Net  without  Innate  Knowledge  of  Linguistic  Structure.  Ph.D. \nDissertation  Brown  University. \n\nDuda  R.  O.  and  P.  E.  Hart  (19;3)  Pattern  classification  and  scene  analysis  John \nWiley,  New  York \n\nZwicker  E.  (1961)  Subdivision  of the  audible  frequency  range  into  critical  bands \n(Frequenzgruppen)  Journal  of the  Acoustical Society  of America 33:248 \n\n\f", "award": [], "sourceid": 320, "authors": [{"given_name": "Nathan", "family_name": "Intrator", "institution": null}]}