{"title": "Modeling Acoustic Correlations by Factor Analysis", "book": "Advances in Neural Information Processing Systems", "page_first": 749, "page_last": 755, "abstract": null, "full_text": "Modeling acoustic correlations  by \n\nfactor  analysis \n\nLawrence Saul and Mazin Rahim \n{lsaul.mazin}~research.att.com \n\nAT&T  Labs - Research \n\n180  Park  Ave,  D-130 \n\nFlorham Park,  NJ  07932 \n\nAbstract \n\nHidden  Markov  models (HMMs)  for  automatic speech  recognition \nrely  on  high  dimensional feature  vectors  to  summarize the  short(cid:173)\ntime properties  of speech.  Correlations between  features  can  arise \nwhen the speech signal is non-stationary or corrupted by  noise.  We \ninvestigate  how  to  model  these  correlations  using  factor  analysis, \na  statistical method for  dimensionality reduction .  Factor  analysis \nuses  a  small number of parameters to model  the  covariance struc(cid:173)\nture  of  high  dimensional  data.  These  parameters  are  estimated \nby  an  Expectation-Maximization (EM)  algorithm that can  be  em(cid:173)\nbedded  in  the  training  procedures  for  HMMs.  We  evaluate  the \ncombined  use  of  mixture  densities  and  factor  analysis  in  HMMs \nthat  recognize  alphanumeric strings.  Holding the  total  number of \nparameters fixed,  we  find  that  these  methods, properly  combined, \nyield  better models than either  method on its own. \n\n1 \n\nIntroduction \n\nHidden  Markov  models  (HMMs)  for  automatic speech  recognition[l]  rely  on  high \ndimensional  feature  vectors  to  summarize  the  short-time,  acoustic  properties  of \nspeech.  Though  front-ends  vary  from  recognizer  to  recognizer,  the  spectral  infor(cid:173)\nmation in  each  frame of speech  is  typically  codified  in a  feature  vector  with  thirty \nor  more  dimensions.  In  most  systems,  these  vectors  are  conditionally modeled  by \nmixtures of Gaussian probability density functions  (PDFs).  In this case,  the corre(cid:173)\nlations between  different  features  are  represented  in  two  ways[2]:  implicitly by  the \nuse of two or more mixture components, and explicitly by the non-diagonal elements \nin  each  covariance  matrix.  Naturally,  these  strategies for  modeling correlations(cid:173)\nimplicit  versus  explicit-involve  tradeoffs  in  accuracy,  speed,  and  memory.  This \npaper examines these  tradeoffs  using the statistical method of factor  analysis. \n\n\f750 \n\nL  Saul and M  Rahim \n\nThe present work is motivated by the following observation.  Currently, most HMM(cid:173)\nbased  recognizers  do  not  include  any  explicit  modeling  of correlations;  that  is  to \nsay-conditioned on the hidden states, acoustic features are modeled by mixtures of \nGaussian PDFs with  diagonal covariance matrices.  The reasons for this practice are \nwell known.  The use offull covariance matrices imposes a heavy computational bur(cid:173)\nden,  making it  difficult  to  achieve  real-time recognition.  Moreover,  one  rarely  has \nenough data to (reliably) estimate full covariance matrices.  Some of these disadvan(cid:173)\ntages can  be overcome by parameter-tying[3]-e.g., sharing the covariance matrices \nacross  different  states  or  models.  But  parameter-tying has  its  own  drawbacks:  it \nconsiderably  complicates  the  training  procedure,  and  it  requires  some  artistry  to \nknow  which  states should  and  should not be  tied. \n\nUnconstrained  and  diagonal  covariance  matrices  clearly  represent  two  extreme \nchoices for the hidden  Markov modeling of speech.  The statistical method of factor \nanalysis[4,5] represents a compromise between these two extremes.  The idea behind \nfactor  analysis is to map systematic variations of the data into a  lower dimensional \nsubspace.  This enables one to represent,  in a  very  compact way,  the covariance ma(cid:173)\ntrices  for  high  dimensional data.  These  matrices are  expressed  in  terms of a  small \nnumber  of parameters  that  model  the  most significant  correlations  without  incur(cid:173)\nring  much  overhead  in  time  or  memory.  Maximum likelihood  estimates  of  these \nparameters are obtained by an Expectation-Maximization (EM) algorithm that can \nbe  embedded in the training procedures  for  HMMs. \n\nIn this paper we  investigate the use  of factor  analysis in  continuous density  HMMs. \nApplying factor  analysis  at  the  state  and  mixture  component  level[6,  7]  results  in \na  powerful  form  of dimensionality  reduction,  one  tailored  to  the  local  properties \nof speech.  Briefly,  the  organization  of this  paper  is  as  follows.  In  section  2,  we \nreview the method of factor analysis and describe  what makes it attractive for large \nproblems in speech  recognition.  In  section 3,  we  report experiments on the speaker(cid:173)\nindependent  recognition of connected  alpha-digits.  Finally, in section  4,  we  present \nour  conclusions  as  well  as  ideas for  future  research. \n\n2  Factor analysis \n\nFactor analysis is  a linear method for  dimensionality reduction of Gaussian random \nvariables[4,  5].  Many  forms  of dimensionality  reduction  (including  those  imple(cid:173)\nmented as neural  networks)  can be understood  as variants of factor analysis.  There \nare particularly close ties to methods based on principal components analysis (PCA) \nand  the  notion of tangent  distance[8].  The  combined use  of mixture densities  and \nfactor  analysis-resulting  in  a  non-linear form  of dimensionality  reduction-was \nfirst  applied  by  Hinton  et  al[6]  to  the  modeling  of handwritten  digits.  The  EM \nprocedure  for  mixtures  of factor  analyzers  was  subsequently  derived  by  Ghahra(cid:173)\nmani et al[7].  Below we describe  the method offactor analysis for  Gaussian random \nvariables, then show how it can be applied to the hidden Markov modeling of speech. \n\n2.1  Gaussian model \n\nLet  x  E nP  denote  a  high  dimensional Gaussian  random  variable.  For simplicity, \nwe  will  assume  that  x  has  zero  mean.  If the  number  of dimensions,  D,  is  very \nlarge, it may be  prohibitively expensive  to estimate, store,  multiply, or invert  a full \ncovariance  matrix.  The  idea behind  factor  analysis is  to find  a  subspace  of much \nlower dimension, f  \u00ab  D,  that captures most of the variations in x.  To this end, let \nz  E 'RJ  denote  a  low  dimensional  Gaussian  random  variable  with  zero  mean  and \n\n\fModeling Acoustic Correlations by Factor Analysis \n\nidentity covariance  matrix: \n\n751 \n\n(1) \n\nWe now  imagine that the variable x  is generated by a random process in which z is a \nlatent (or hidden) variable; the elements of z are known as the factors.  Let A denote \nan  arbitrary  D  x  f  matrix,  and  let  '11  denote  a  diagonal,  positive-definite  D  x  D \nmatrix.  We imagine that x  is generated by sampling z  from eq.  (1),  computing the \nD-dime.nsional vector Az,  then  adding independent  Gaussian  noise  (with  variances \nWii)  to each  component of this vector.  The matrix A is known as the  factor loading \nmatrix.  The relation  between  x  and z  is  captured  by  the  conditional distribution: \n\nP(xlz) =  1'111- 1/ 2  e- HX-AZ)TI)-l(X-AZ) \n\n(211\")D/2 \n\n(2) \n\nThe marginal distribution for  x  is  found  by  integrating out  the  hidden  variable  z. \nThe calculation is  straightforward because  both P(z)  and P(xlz)  are  Gaussian: \n\n(3) \n\n(4) \n\nP(x)  =  J dz P(xlz)P(z) \n\nI'll + AAT I- 1/ 2  -!XT(I)+AATf1x \n\n(211\")D/2 \n\ne \n\nFrom eq.  (4),  we  see  that x  is normally distributed with  mean zero  and covariance \nmatrix '11  + AAT .  It follows  that when  the  diagonal elements ofw  are  small, most \nof the  variation  in  x  occurs  in  the  subspace  spanned  by  the  columns  of A.  The \nvariances  Wii  measure  the  typical  size  of componentwise  ftuctations  outside  this \nsubspace. \nCovariance matrices of the form '11 + AAT  have a  number of useful properties.  Most \nimportantly, they  are expressed  in  terms of a  small number of parameters,  namely \nthe D(f + 1) non-zero elements of A and W.  If f  ~ D,  then storing A and '11  requires \nmuch less memory than storing a full covariance matrix.  Likewise, estimating A and \n'11  also requires much less data than estimating a full  covariance matrix.  Covariance \nmatrices of this form can be efficiently inverted using the matrix inversion lemma[9], \n\n(5) \nwhere  I  is  the  f  x  f  identity matrix.  This  decomposition  also  allows one  to  com(cid:173)\npute  the  probability  P(x)  with  only  O(fD)  multiplies,  as  opposed  to  the  O(D2) \nmultiplies that are  normally required  when  the  covariance matrix is  non-diagonal. \nMaximum likelihood estimates of the  parameters A and '11  are obtained by  an  EM \nprocedure[4].  Let  {xt}  denote  a  sample of data points  (with mean  zero).  The  EM \nprocedure  is  an iterative procedure  for  maximizing the log-likelihood, Lt In P(xt}, \nwith  P(Xt)  given  by eq.  (4).  The E-step  of this procedure  is  to compute: \n\nQ(A', '11'; A, '11) = 'LJdz P(zIXt,A, w)lnP(z,xtIA', '11'). \n\nt \n\nThe right  hand side  of eq.  (6)  depends  on  A and '11  through  the statistics[7]: \n\nE[zlxtl \n\n[I + AT w- 1 A]-lATw-1Xt, \n\nE[zzT lxtl  =  [I + AT W- 1 A]-l + E[zlxtlE[zTlxtl. \n\n(6) \n\n(7) \n(8) \n\nHere,  E['lxtl  denotes  an  average  with  respect  to  the  posterior  distribution, \nP(zlxt, A, '11).  The  M-step  of  the  EM  algorithm  is  to  maximize  the  right  hand \n\n\f752 \n\nL.  Saul and M.  Rahim \n\nside of eq.  (6)  with  respect  to'll' and A'.  This leads to the  iterative updates[7]: \n\nA' \n\n'11' \n\n(~X'E[ZT IX,])  (~E[zzTIX,]) -1 \ndiag { ~ ~ [x,x; - A'E[zlx,]xiJ }, \n\n(9) \n\n(10) \n\nwhere  N  is  the  number  of data  points,  and'll'  is  constrained  to  be  purely  diago(cid:173)\nnal.  These  updates  are  guaranteed  to converge  monotonically to a  (possibly  local) \nmaximum of the log-likelihood. \n\n2.2  Hidden Markov modeling of speech \n\nConsider  a  continuous  density  HMM  whose  feature  vectors,  conditioned  on  the \nhidden states,  are modeled by mixtures of Gaussian PDFs.  If the dimensionality of \nthe feature  space is  very  large,  we  can make use  of the parameterization in eq.  (4). \nEach mixture component thus obtains its own  means, variances,  and factor  loading \nmatrix.  Taken  together,  these  amount  to  a  total  of C(f + 2)D  parameters  per \nmixture  model,  where  C  is  the  number  of mixture  components,  f  the  number  of \nfactors,  and  D  the  dimensionality  of the  feature  space.  Note  that  these  models \ncapture feature  correlations  in  two  ways:  implicitly, by  using two or  more mixture \ncomponents,  and  explicitly,  by  using  one  or  more  factors.  Intuitively,  one  expects \nthe  mixture  components  to  model  discrete  types  of variability  (e.g.,  whether  the \nspeaker is male or female),  and the factors  to model  continuous types of variability \n(e.g.,  due  to  coarticulation  or  noise).  Both  types  of variability  are  important for \nbuilding accurate  models of speech. \n\nIt  is  straightforward  to  integrate  the  EM  algorithm  for  factor  analysis  into  the \ntraining of HMMs.  Suppose that S = {xtl represents a sequence of acoustic vectors. \nThe forward-backward procedure enables one to compute the posterior probability, \n,t C  =  P(St  =  s, Ct  =  ciS), that the  HMM  used state s  and mixture component cat \ntime t.  The  updates  for  the  matrices A $C  and w3C  (within  each  state  and mixture \ncomponent)  have  essentially  the  same  form  as  eqs.  (9-10),  except  that  now  each \nobservation Xt  is  weighted by the  posterior probability, ,tc .  Additionally, one must \ntake into account that the mixture components have non-zero means[7].  A complete \nderivation of these  updates  (along with many additional  details)  will  be  given in  a \nlonger  version  of this paper. \n\nClearly,  an  important  consideration  when  applying  factor  analysis  to  speech  is \nthe  choice  of  acoustic  features.  A  standard  choice--and  the  one  we  use  in  our \nexperiments-is a  thirty-nine dimensional feature  vector that consists of twelve cep(cid:173)\nstral  coefficients  (with  first  and second  derivatives)  and  the  normalized log-energy \n(with first  and second  derivatives).  There  are  known  to be  correlations[2]  between \nthese  features,  especially  between  the  different  types  of coefficients  (e.g.,  cepstrum \nand  delta-cepstrum).  While  these  correlations  have  motivated  our  use  of factor \nanalysis,  it is  worth  emphasizing that  the  method applies to arbitrary feature  vec(cid:173)\ntors.  Indeed,  whatever  features  are  used  to  summarize  the  short-time  properties \nof speech,  one  expects  correlations  to arise  from  coarticulation,  background  noise, \nspeaker idiosynchrasies,  etc. \n\n3  Experiments \n\nContinuous  density  HMMs  with  diagonal  and  factored  covariance  matrices  were \ntrained  to  recognize  alphanumeric  strings  (e .g.,  N  Z  3  V  J  4  E  3  U  2).  Highly \n\n\fModeling Acoustic Correlations by FactorAnalysis \n\n753 \n\nalpha-digits (ML) \n\n37 \n\nalpha-dig~s (ML) \n\n17 \n\n16 \n\n15 \n\n#: \n;14 \n!! g13 \nII> \n'E 12 \n~ \n\n11 \n\n\\ . \n\n,  . \n\n'0. \n\n3~~--5~--1~0---1~5---2~0--~25~~30--~ \n\nparameters \n\n10 \n\n90 \n\n5 \n\n10 \n\n15 \n20 \nparameters \n\n25 \n\n30 \n\nFigure  1:  Plots of log-likelihood scores  and  word error  rates  on  the  test  set  versus \nthe  number of parameters per mixture model  (divided by  the  number of features). \nThe  stars  indicate  models  with  diagonal  covariance  matrices;  the  circles  indicate \nmodels with factor  analysis.  The  dashed lines  connect  the recognizers  in  table  2. \n\nconfusable  letters  such  as  BjV ,  C jZ,  and  MjN  make  this  a  challenging  problem \nin  speech  recognition .  The  training  and  test  data were  recorded  over  a  telephone \nnetwork  and  consisted  of  14622  and  7255  utterances,  respectively.  Recognizers \nwere  built from  285 left-to-right HMMs trained by  maximum likelihood estimation; \neach  HMM  modeled  a  context-dependent  sub-word  unit.  Testing was  done  with  a \nfree  grammar network  (i .e.,  no grammar constraints).  We  ran  several experiments, \nvarying  both  the  number  of mixture  components and  the  number of factors.  The \ngoal  was  to determine the best  model of acoustic feature  correlations. \n\nTable  1  summarizes  the  results  of these  experiments.  The  columns  from  left  to \nright show  the  number of mixture components, the  number of factors,  the number \nof parameters per mixture model (divided by the feature dimension), the word error \nrates (including insertion , deletion, and substition errors) on the test set, the average \nlog-likelihood  per frame  of speech  on  the  test  set ,  and  the  CPU  time  to  recognize \ntwenty  test  utterances  (on  an  SGI  R4000).  Not  surprisingly,  the  word  accuracies \nand  likelihood  scores  increase  with  the  number  of modeling  parameters;  likewise, \nso  do  the  CPU  times.  The most interesting  comparisons are  between  models with \nthe  same  number  of parameters-e.g.,  four  mixture  components  with  no  factors \nversus  two  mixture components with  two factors.  The left  graph  in figure  1 shows \na  plot  of the  average  log-likelihood versus  the  number  of parameters  per  mixture \nmodel; the stars and circles  in this plot indicate models with and  without diagonal \ncovariance matrices.  One sees  quite clearly from this plot that given a  fixed number \nof  parameters,  models  with  non-diagonal  (factored)  covariance  matrices  tend  to \nhave higher likelihoods. The right graph in figure  1 shows a similar plot of the word \nerror rates versus the number of parameters.  Here one does not see much difference; \npresumably, because  HMMs  are such  poor  models of speech  to  begin  with ,  higher \nlikelihoods do  not necessarily  translate into lower error rates.  We  will return to this \npoint later. \nIt is  worth  noting  that  the  above  experiments  used  a  fixed  number  of factors  per \nmixture  component .  In  fact ,  because  the  variability  of speech  is  highly  context(cid:173)\ndependent,  it  makes sense  to vary  the  number of factors , even  across states  within \nthe same HMM . A simple heuristic is to adjust the number of factors  depending on \nthe amount of training data for each state (as determined by an initial segmentation \nof the  training  utterances).  We  found  that  this  heuristic  led  to  more  pronounced \n\n\f754 \n\nL  Saul and M  Rahim \n\nC \n1 \n1 \n1 \n1 \n1 \n2 \n2 \n2 \n2 \n2 \n4 \n4 \n4 \n4 \n4 \n8 \n8 \n8 \n16 \n\nf  C(f + 2)  word  error  (%) \n0 \n1 \n2 \n3 \n4 \n0 \n1 \n2 \n3 \n4 \n0 \n1 \n2 \n3 \n4 \n0 \n1 \n2 \n0 \n\n16.2 \n14.6 \n13.7 \n13.0 \n12.5 \n13 .4 \n12.0 \n11.4 \n10.9 \n10.8 \n11.5 \n10.4 \n10.1 \n10.0 \n9.8 \n10.2 \n9.7 \n9.6 \n9.5 \n\n2 \n3 \n4 \n5 \n6 \n4 \n6 \n8 \n10 \n12 \n8 \n12 \n16 \n20 \n24 \n16 \n24 \n32 \n32 \n\nlog-likelihood  CPU  time (sec) \n\n32.9 \n34.2 \n34.9 \n35.3 \n35.8 \n34.0 \n35.1 \n35.8 \n36.2 \n36.6 \n34.9 \n35.9 \n36.5 \n36.9 \n37.3 \n35.6 \n36.5 \n37.0 \n36.2 \n\n25 \n30 \n30 \n38 \n39 \n30 \n44 \n48 \n61 \n67 \n46 \n80 \n93 \n132 \n153 \n93 \n179 \n226 \n222 \n\nTable  1:  Results  for  different  recogmzers.  The  columns  indicate  the  number  of \nmixture components, the number of factors , the number of parameters per mixture \nmodel  (divided  by  the  number of features),  the  word  error  rates  and  average  log(cid:173)\nlikelihood scores  on the test  set,  and the  CPU time to recognize  twenty utterances. \n\nC \n1 \n2 \n4 \n\nf  C(f + 2)  word error J % J \n2 \n2 \n2 \n\n12.3 \n10.5 \n9.6 \n\n4 \n8 \n16 \n\nlog-likelihood  CPU time Jse<J \n\n35.4 \n36.3 \n37.0 \n\n32 \n53 \n108 \n\nTable  2:  Results  for  recognizers  with  variable  numbers  of factors ;  f  denotes  the \naverage  number of factors  per mixture component. \n\ndifferences  in  likelihood scores  and  error  rates.  In  particular, substantial improve(cid:173)\nments  were  observed  for  three  recognizers  whose  HMMs  employed  an  average  of \ntwo factors  per  mixture component;  see  the  dashed  lines in  figure  1.  Table  2 sum(cid:173)\nmarizes  these  results.  The  reader  will  notice  that  these  recognizers  are  extremely \ncompetitive in all aspects of performance--accuracy,  memory, and speed-with the \nbaseline  (zero  factor)  models in table  1. \n\n4  Discussion \n\nIn  this  paper  we  have  studied  the  combined  use  of mixture  densities  and  factor \nanalysis for  speech  recognition.  This was  done in  the framework of hidden  Markov \nmodeling, where  acoustic  features  are  conditionally modeled  by  mixtures of Gaus(cid:173)\nsian  PDFs.  We  have shown  that mixture densities  and factor  analysis are  comple(cid:173)\nmentary  means of modeling acoustic  correlations.  Moreover,  when  used  together, \nthey  can  lead  to smaller, faster,  and more accurate  recognizers  than either method \non its own .  (Compare the last lines of tables  1 and 2.) \n\n\fModeling Acoustic Correlations by Factor Analysis \n\n755 \n\nSeveral  issues  deserve  further  investigation.  First,  we  have  seen  that  increases  in \nlikelihood  scores  do  not  always  correspond  to  reductions  in  error  rates.  (This  is \na  common occurrence  in  automatic speech  recognition.)  We  are  currently  investi(cid:173)\ngating discriminative methods[lO] for  training HMMs with factor analysis; the idea \nhere  is  to optimize  an  objective  function  that  more  directly  relates  to  the  goal  of \nminimizing classification  errors.  Second,  it  is  important  to  extend  our  results  to \nlarge  vocabulary  tasks  in  speech  recognition.  The  extreme  sparseness  of data  in \nthese  tasks  makes  factor  analysis  an  appealing  strategy  for  dimensionality reduc(cid:173)\ntion.  Finally, there  are  other questions  that  need  to be answered.  Given  a  limited \nnumber  of parameters,  what  is  the  best  way  to  allocate  them  among factors  and \nmixture components?  Do the cepstral features used  by HMMs throwaway informa(cid:173)\ntive correlations in the speech signal?  Could such correlations be better modeled by \nfactor  analysis?  Answers  to these  questions  can  only lead to further  improvements \nin overall  performance. \n\nAcknowledgements \n\nWe  are grateful to A.  Ljolje  (AT&T Labs),  Z.  Ghahramani (University of Toronto) \nand  H.  Seung  (Bell  Labs)  for  useful  discussions.  We  also  thank  P.  Modi  (AT&T \nLabs)  for  providing an initial segmentation of the training utterances. \n\nReferences \n\n[1]  Rabiner, L.,  and Juang,  B.  (1993)  Fundamentals  of Speech  Recognition.  Engle(cid:173)\n\nwood  Cliffs:  Prentice  Hall. \n\n[2]  Ljolje,  A.  (1994)  The  importance of cepstral  parameter correlations  in  speech \n\nrecognition.  Computer Speech  and Language  8:223-232. \n\n[3]  Bellegarda,  J.,  and  Nahamoo,  D.  (1990)  Tied  mixture continuous  parameter \nmodeling for  speech  recognition.  IEEE  Transactions  on  Acoustics,  Speech,  and \nSignal  Processing 38:2033-2045. \n\n[4]  Rubin, D., and Thayer, D. (1982) EM algorithms for factor analysis. Psychome(cid:173)\n\ntrika 47:69-76. \n\n[5]  Everitt, B.  (1984) An introduction to  latent variable models. London:  Chapman \n\nand  Hall. \n\n[6]  Hinton, G., Dayan, P., and Revow, M.  (1996) Modeling the manifolds of images \n\nof handwritten digits.  To appear in  IEEE  Transactions  on  Neural Networks. \n\n[7]  Ghahramani,  Z.  and  Hinton,  G.  (1996)  The  EM  algorithm  for  mixtures  of \n\nfactor  analyzers.  University  of Toronto  Technical  Report CRG-TR-96-1. \n\n[8]  Simard,  P.,  LeCun,  Y.,  and  Denker,  J.  (1993)  Efficient  pattern  recognition \nusing  a  new  transformation  distance.  In  J.  Cowan,  S.  Hanson,  and  C.  Giles, \neds.  Advances  in  Neural Information  Processing  Systems 5:50-58. Cambridge: \nMIT Press. \n\n[9]  Press,  W.,  Teukolsky,  S.,  Vetterling,  W.,  and  Flannery,  B.  (1992)  Numerical \n\nRecipes  in  C:  The  Art  of Scientific  Computing.  Cambridge:  Cambridge  Uni(cid:173)\nversity  Press. \n\n[10]  Bahl,  L.,  Brown,  P.,  deSouza,  P.,  and  Mercer,  1.  (1986)  Maximum mutual \n\ninformation estimation of hidden  Markov  model parameters for  speech  recog(cid:173)\nnition.  In  Proceedings  of ICASSP  86:  49-52. \n\n\f", "award": [], "sourceid": 1339, "authors": [{"given_name": "Lawrence", "family_name": "Saul", "institution": null}, {"given_name": "Mazin", "family_name": "Rahim", "institution": null}]}