{"title": "Factored Semi-Tied Covariance Matrices", "book": "Advances in Neural Information Processing Systems", "page_first": 779, "page_last": 785, "abstract": null, "full_text": "Factored Semi-Tied Covariance Matrices \n\nM.J.F. Gales \n\nCambridge University Engineering Department \n\nTrumpington Street, Cambridge. CB2  IPZ \n\nUnited Kingdom \n\nmjfg@eng.cam.ac.uk \n\nAbstract \n\nA new  form  of covariance modelling for Gaussian mixture models and \nhidden Markov models is  presented.  This is  an  extension to  an  efficient \nform of covariance modelling used in  speech recognition, semi-tied co(cid:173)\nvariance matrices.  In the standard form of semi-tied covariance matrices \nthe covariance matrix is  decomposed into a highly shared decorrelating \ntransform and a component-specific diagonal covariance matrix. The use \nof a factored decorrelating transform is presented in this paper. This fac(cid:173)\ntoring effectively increases the number of possible transforms without in(cid:173)\ncreasing the number of free parameters. Maximum likelihood estimation \nschemes for all the model parameters are presented including the compo(cid:173)\nnent/transform assignment,  transform and component parameters.  This \nnew  model  form  is  evaluated on  a large vocabulary  speech  recognition \ntask.  It is  shown that using this  factored  form of covariance modelling \nreduces the word error rate. \n\n1  Introduction \n\nA standard problem in machine learning is to how to efficiently model correlations in multi(cid:173)\ndimensional data.  Solutions should be efficient both in terms of number of model param(cid:173)\neters  and  cost  of the  likelihood  calculation.  For  speech  recognition  this  is  particularly \nimportant due to  the  large number of Gaussian components used,  typically in the tens  of \nthousands, and the relatively large dimensionality of the data, typically 30-60. \n\nThe following generative model has been used in speech recognition 1 \n\nW \n\nX(T) \nO(T)  =  F  [  X~T)  ] \n\n(1) \n\n(2) \n\nwhere X(T)  is the underlying speech signal, F  is  the observation transformation matrix, W \nis generated by  a hidden Markov model (HMM) with diagonal covariance matrix Gaussian \n\nIThis describes the static version of the generative model.  The more general  version is described \n\nby replacing equation  1 by  x( T)  =  Cx( T - 1) + w. \n\n\fmixture model (GMM) to  model each state2  and v  is usually assumed to be generated by a \nGMM, which is common to all HMMs.  This differs from the static linear Gaussian models \npresented in  [7]  in  two important ways.  First w  is generated by either an HMM or GMM, \nrather than  a  simple  Gaussian  distribution.  The  second  difference  is  that  the  \"noise\"  is \nnow restricted to  the null space of the signal x (7).  This type of system can be considered \nto  have two streams.  The first  stream,  the n1  dimensions associated with  X(7),  is  the  set \nof discriminating,  useful,  dimensions.  The  second  stream,  the n2  dimensions  associated \nwith v, is the set of non-discriminating, nuisance, dimensions.  Linear discriminant analy(cid:173)\nsis  (LDA) and heteroscedastic LDA (HLDA) [5]  are both based on this form of generative \nmodel.  When the dimensionality of the nuisance dimensions is reduced to  zero this gener(cid:173)\native model becomes equivalent to a semi-tied covariance matrix system [3]  with a single, \nglobal, semi-tied class. \n\nThis generative model has a clear advantage during recognition compared to the  standard \nlinear  Gaussian  models  [2]  in  the  reduction  in  the  computational cost  of the  likelihood \ncalculation.  The likelihood for component m  may be computed as3 \n\n( \n\n(  ) . \n\npo 7  ,IL \n\n(m)  ~(m)  F) _ \n-\n\n,  diag' \n\nl(7)  N((F-1) \n\nIdet(F)I \n\n().  (m)  ~(m)) \n,  diag \n\n[1]07  , IL \n\n(3) \n\nwhere lL(m) is the n1 -dimensional mean and ~~~lg the diagonal covariance matrix of Gaus(cid:173)\nsian component m.  l (7)  is the nuisance dimension likelihood which is independent of the \ncomponent being considered and only needs to  be computed once for each time instance. \nThe  initial  normalisation  term  is  only  required  during  recognition  when  multiple  trans(cid:173)\nforms  are used.  The dominant cost is a diagonal Gaussian computation for each compo(cid:173)\nnent,  O(n1)  per component.  In  contrast a  scheme  such  as  factor  analysis  (a  covariance \nmodelling scheme from the linear Gaussian model in [7]) has a cost of O(ni) per compo(cid:173)\nnent (assuming there are n1  factors).  The disadvantage of this form  of generative model is \nthat there is no simple expectation-maximisation (EM) [1] scheme for estimating the model \nparameters.  However, a simple iterative scheme is available [3]. \n\nFor some tasks,  such as  speech recognition where there are many different \"sounds\" to  be \nrecognised,  it is  unlikely  that a  single  transform is  sufficient to  well  model the data.  To \nreflect this  there has  been some work on  using  multiple feature-spaces  [3,  2].  The  stan(cid:173)\ndard approach for using multiple transforms is to assign each component, m, to a particular \ntransform, F( Tm).  To  simplify the description of the new scheme only modifications to the \nsemi-tied covariance matrix scheme, where the nuisance dimension is zero, are considered. \nThe generative model is  modified to  be  0(7)  = F(T m )X(7),  where Tm  is  the  transform \nclass  associated  with  the generating component,  m,  at  time  instance  7.  The assignment \nvariable, Tm , may either be determined by an \"expert\", for example using phonetic context \ninformation,  or it  may  be  assigned  in  a  maximum likelihood  (ML)  fashion  [3].  Simply \n\n2 Although it is not strictly necessary to use diagonal covariance matrices, tllese currently dominate \n\napplications in  speech recognition.  w  could also be generated by a simple GMM. \n\n3This  paper  uses  the  following  convention:  capital  bold  letters  refer  to  matrices  e.g.  A, bold \nletters refer to vectors e.g.  b, and scalars are not bold e.g.  c.  When referring to elements of a matrix \nor vector subscripts are used e.g.  ai is  tlle  ith row  of matrix  A, aij is  tlle element of row  i  column \nj  of matrix  A  and  bi  is  element  i  of vector  b.  Diagonal  matrices  are indicated  by  A diag.  Where \nmultiple  streams  are used tllis is  indicated,  for  example,  by  A[s], this  is  a n.  x  n  matrix  (n is  tlle \ndimensionality  of tlle feature  vector and n. is  tlle  size of stream 8).  Where  subsets  of tlle  diagonal \nmatrices are specified tlle matrices are square, e.g.  Adiag[s]  is ns  x  ns  square diagonal matrix.  AT \nis tlle transpose of tlle matrix and det( A) is tlle determinant of the  matrix. \n\n\fincreasing the number of transforms increases the number of model  parameters to be esti(cid:173)\nmated, hence reducing the robustness of the estimates.  There is a corresponding increase in \nthe computational cost during recognition. In the limit there is a single transform per com(cid:173)\nponent, the standard full-covariance matrix case.  The approach adopted in this paper is  to \nfactor the transform into multiple streams.  Each component can then use a different trans(cid:173)\nform for each stream.  Hence instead of using an assignment variable an assignment vector \nis  used.  In  order to  maintain the  efficient likelihood computation of equation 3,  F(r)-l, \nrather than F(r), must be factored into rows.  This is a partitioning of the feature space into \na set of observation streams. In common with other factoring schemes this dramatically in(cid:173)\ncreases the effective number of transforms from which each component may select without \nincreasing the number of transform parameters.  Though this paper only considers factoring \nsemi-tied covariance matrices the extension to the \"projection\" schemes presented in [2]  is \nstraightforward. \n\nThis paper describes how to estimate the set of transforms and determine which subspaces \na particular component should use.  The next section describes how to  assign components \nto transforms and, given this assignment, how to estimate the appropriate transforms. Some \ninitial  experiments on  a large vocabulary speech recognition task are presented in  the fol(cid:173)\nlowing section. \n\n2  Factored Semi-Tied Covariance Matrices \n\nIn order to factor semi-tied covariance matrices the inverse of the observation transforma(cid:173)\ntion for a component is  broken into multiple streams.  The feature space of each stream is \nthen determined by  selecting from an  inventory of possible transforms.  Consider the case \nwhere there are S  streams.  The effective full  covariance matrix of component m,  ~(m), \nmay be written as ~(m) =  F(z(~)) ~(':') F(Z(~))T where the form of F(z(~)) is restricted \n\ndlag \n\n, \n\nso that4 \n\n(4) \n\nand z(m)  is  the S-dimensional assignment vector for component m.  The complete set of \nmodel parameters, M, consists of the standard model parameters, the component means, \nvariances,  weights  and,  additionally,  the  set  of transforms  { Af~l ' ... , Af~')} for  each \n\nstream s  (Rs  is  the  number of transforms  associated  with  stream s) and  the  assignment \nvector z(m)  for each component.  Note that the semi-tied covariance matrix  scheme is  the \ncase when S  = 1.  The likelihood is  efficiently estimated by  storing transformed observa(cid:173)\ntions for each stream transform, i.e. Af;! O(T). \nThe model parameters are estimated  using  ML training on  a labelled set of training data \no = {0(1), . .. , o(T)}. The likelihood of the training data may be written as \np(OIM) =  LIT (P(q(T)lq(T  -1))  L  w(m)p(O(T);IL(m),~g;lg'A(Z(~))))  (5) \n\nE> \n\nr \n\nmE(}(r) \n\n4A  similar factorisation has also been proposed in [4]. \n\n\fwhere e  is  the set of all  valid state  sequences according to  the transcription for  the data, \nq(T)  is  the  state at time T of the current path, O(T)  is  the set of Gaussian components be(cid:173)\nlonging to state q(T), and w(m) is the prior of componentm. Directly optimising equation 5 \nis a very large optimisation task, as there are typically millions of model parameters.  Alter(cid:173)\nnatively, as is common with standard HMM training, an EM-based approach is  used.  The \nposterior probability of a particular component, m, generating the  observation at a given \ntime instance is denoted as 'Ym ( T).  This may be simply found using the forward backward \nalgorithm [6]  and the old  set of model parameters M.  The new  set of model parameters \nwill be denoted as  M.  The estimation of the component priors and  HMM transition ma(cid:173)\ntrices are estimated in the standard fashion  [6].  Directly optimising the auxiliary function \nfor the model parameters is computationally expensive [3]  and does not allow the embed(cid:173)\nding of the assignment process.  Instead a simple iterative optimisation scheme is  used as \nfollows: \n\n1.  Estimate the within class covariance matrix for each Gaussian component in  the \nsystem, W(m), using the values of 'Ym (T).  Initialise the set of assignment vectors, \n{z}  =  {Z(1), ... , Z(M)}  and  the  set  of transforms  for  each  stream {A}  = \n{A (1) \n\nA(Rt) \n\nA(1) \n\n[8)\"'\" \n\nA(RS)} \n. \n\n[8) \n\n[1)\"'\" \n\n[1) \n\n, ... , \n\n2.  Using the current estimates of the  transforms and assignment vectors  obtain  the \n\nML estimate of the set of component specific diagonal covariance matrices incor(cid:173)\nporating the appropriate parameter tying as required.  This  set of parameters will \nbe denoted as {t} = {~~~g\"'\"  ~~~}. \n\n3.  Estimate the new set of transforms, { A }, using the current set of component co(cid:173)\nvariance matrices { t  } and assignment vectors { Z }. The new auxiliary function \nat this stage will be written as  Q(M, M; { t  } , { z} ). \n\n4.  Update the set of assignment variables for each component { Z }, given the current \n\nset of model transforms, { A } . \n\n5.  Goto (2) until convergence, or an  appropriate stopping criterion is  satisfied.  Oth(cid:173)\nerwise  update {t} and  the  component means  using  the  latest  transforms  and \nassignment variables. \n\nThere  are  three distinct optimisation problems within  this  task.  First the  ML estimate  of \nthe  set  of component specific  diagonal  covariance matrices is required.  Second,  the  new \nset of transforms must be estimated.  Finally the new  set of assignment vectors is required. \nThe ML estimates of the component specific variances (and means) under a transformation \nis  a standard problem, e.g.  for the  semi-tied case see [3]  and is  not described further.  The \nML estimation of the transforms and assignment variables are described below. \n\nThe transforms are estimated in  an  iterative fashion.  The proposed scheme is  derived  by \nmodifying the  standard  semi-tied covariance optimisation equation in  [3].  A row  by row \n\n\foptimisation is used.  Consider row i of stream p of transform r, a[;fi' the auxiliary function \n\nmay be written as  (ignoring constant scalings and elements independent of a[;fi) \n\nQ(M  M' {t}  {z}) = \"\" (3(m)  log ((c(z(m\u00bba(Z~~\u00bbT)2) _  \"\" a(r) .K(srj)a(r)T \n\n[pj. \n\n[pj. \n\n[sj} \n\n\" ,   L...J \nm \n\nL...J \n8,r,j \n\n[sj} \n\nK(srj )  =  L \n\nw(m) \n\n(m)2  L 'Ym(r) \n\nm:{z~m)=r} U diag[sjj \n\nT \n\n(6) \n\n(z(m\u00bb \n\nand c[pji \n\nis  the cofactor of row i  of stream p of transform A \n\n(z(m\u00bb \n\n(r) \n. The gradient j  [pji' \n\ndifferentiating the auxiliary function with respect to a[;fi' is given by5 \n\nj(r).  = \n\n[pj. \n\n\"\" \nL...J \n\n{ \nm:{z~m)=r} \n\n2  (3 \n\n(m)c(z~m\u00bb} \n(z(m\u00bb \n\n[pj. \n\n(r)T \na[pji \n\nC[pji \n\n_  2a(r).K(pri) \n\n[pj. \n\n(8) \n\nThe main cost for computing the gradient is calculating the cofactors for each component. \nHaving computed the gradient the Hessian may also be simply calculated as \n\nH(r) . =  \"\" \nL...J \n\n[pj. \n\nm:{z~m)=r} \n\n{ \n\n_2(3 \n\nc [pji \n(  (z(m\u00bb \nc[pji \n\nc[pji \n(r)T)2 \n\na[pji \n\n(m)  (z(m\u00bbT  (z(m\u00bb} \n\n_  2K(pri) \n\n(9) \n\nThe Hessian is  guaranteed to  be  negative definite  so  the  Newton direction  must head  to(cid:173)\nwards a maximum. At the t + 1 th iteration \n\n(r)  ( \na[pji  t + \n\n1)  _ \n\n(r)  () \n- a[pji  t  -\n\nj(r) H(r)-l \n\n[pji \n\n[Pji \n\n(10) \n\nwhere the gradient and Hessian are based on  the  tth parameter estimates.  In  practice this \nestimation scheme was highly stable. \n\nThe  assignment for  stream s  of component m  is  found  using  a greedy  search  technique \nbased on ML estimation.  Stream s of component m is  assigned using \n\nz(m)  - arg max \ns \nrER, \n\n-\n\n{  ( \n\nIdet ( diag (A[;i W(m) A[;t) ) I \n\nIdet (A (u(,rm\u00bb)  12 \n\n)  } \n\nwhere the hypothesised assignment of factor stream s, u(srm), is given by \n\n(srm)  _  {  r, \n\nu j \n\n-\n\nz~m),  (otherwise) \n\nj  = s \n\n(11) \n\n(12) \n\n-------------------------\n\n5When the  standard semi-tied system is used (i.e.  S = 1) the estimation of row,  i  has the closed \n\nform solution \n\n(r)  _ \n\n(r)  K(lri)-l \n\n(Lm:{zim)=r} f3(m)) \n\na[l]i  - C[l ]i \n\n(r)  K(lri)-l  (r)T \n\nC[l]i \n\nC[l]i \n\n(7) \n\n\fAs  the assignment is  dependent on the cofactors, which themselves are dependent on  the \nother stream assignments for that component, an  iterative scheme is  required.  In practice \nthis was found to converge rapidly. \n\n3  Results and Discussion \n\nAn  initial  investigation  of the  use  of factored  semi-tied covariance matrices  was  carried \nout on  a  large-vocabulary  speaker-independent continuous-speech recognition task.  The \nrecognition experiments were performed on the  1994 ARPA  Hub  1 data (the HI task), an \nunlimited vocabulary task.  The results were averaged over the development and evaluation \ndata.  Note  that no  tuning  on  the  \"development\" data  was  performed.  The baseline sys(cid:173)\ntem used for the recognition task was  a gender-independent cross-word-triphone mixture(cid:173)\nGaussian tied-state HMM system.  For details of the  system  see  [8].  The total  number of \nphones (counting silence as  a  separate phone) was  46,  from which 6399 distinct context \nstates were formed.  The speech was parameterised into a 39-dimensional feature vector. \nThe set of baseline experiments with semi-tied covariance matrices (8 = 1) used \"expert\" \nknowledge to  determine the  transform classes.  Two  sets  were used.  The first  was  based \non phone level transforms where all components of all states from the same phone shared \nthe  same class  (phone  classes).  The  second used  an  individual transform per state  (state \nclasses). In addition a global transform (global class) and a full-covariance matrix system \n(comp class) were tested.  Two  systems were examined, a four Gaussian  components per \nstate  system and  a twelve Gaussian component system.  The twelve component system is \nthe standard system described in [8].  In both cases a diagonal covariance matrix system (la(cid:173)\nbelled none) was generated in the standard HTK fashion [9].  These systems were then used \nto generate the initial alignments to build the semi-tied systems.  An additional iteration of \nBaum-We1ch estimation was then performed. \n\nThree forms  of assignment training were compared.  The previously described expert sys(cid:173)\ntem and two ML-based schemes, standard andfactored. The standard scheme used a single \nstream (8 = 1) which is similar to the scheme described in [3].  The factored scheme used \nthe new approach described in this paper with a separate stream for each of the elements of \nthe feature vector (8 = 39). \n\nTable 1: System performance on the 1994 ARPA HI task \n\nAssignment \n\nScheme \n\nnone \nglobal \nphone \nstate \ncomp \nphone \nphone \n\n-\n\nexpert \nexpert \n\n-\n\nstandard \nfactored \n\n10.34  8.87 \n10.04  8.86 \n8.84 \n9.20 \n9.22 \n9.98 \n8.62 \n9.73 \n9.48 \n8.42 \n\nThe results of the baseline semi-tied covariance matrix systems are shown in table 1.  For the \nfour component system the full covariance matrix system achieved approximately the same \nperformance as that of the expert state semi-tied system.  Both systems significantly (at the \n\n\f95%  level)  outperformed the  standard  12-component system  (9.71 %).  The expert phone \nsystem  shows  around  an  9%  degradation  in  performance compared  to  the  state  system, \nbut used  less  than  a hundredth of the  number of transforms  (46  versus  6399).  Using  the \nstandard ML assignment scheme with initial  phone classes,  S  =  1, reduced the error rate \nof the phone system by around 3% over the expert system.  The factored scheme, S  =  39, \nachieved further reductions in error rate.  A 5% reduction in word error rate was  achieved \nover the expert system, which is  significant at the 95% level. \n\nTable  1 also  shows the performance of the twelve component system.  The use of a global \nsemi-tied transform significantly reduced the error rate by  around 9%  relative.  Increasing \nthe  number of transforms using the  expert assignment showed no reduction in  error rate. \nAgain using  the  phone level  system  and training the  component transform assignments, \neither the standard or the factored schemes, reduced the word error rate.  Using the factored \nsemi-tied transforms (S = 39) significantly reduced the error rate, by around 5%, compared \nto the expert systems. \n\n4  Conclusions \n\nThis  paper has  presented a new  form of semi-tied covariance,  the  factored  semi-tied co(cid:173)\nvariance matrix.  The theory for estimating these  transforms has  been developed and  im(cid:173)\nplemented  on  a  large  vocabulary  speech  recognition  task.  On  this  task the  use  of these \nfactored  transforms was  found  to  decrease the word error rate by  around 5% over using a \nsingle transform,  or multiple transforms,  where  the  assignments are expertly determined. \nThe improvement was  significant at the 95% level.  In  future  work the problems of deter(cid:173)\nmining the required number of transforms for each of the streams and how to determine the \nappropriate dimensions will be investigated. \n\nReferences \n\n[1]  A P Dempster,  N M Laird, and  D B Rubin.  Maximum likelihood from incomplete data via the \n\nEM algorithm.  Journal of the Royal Statistical Society, 39:1-38, 1977. \n\n[2]  M J F Gales. Maximum likelihood multiple projection schemes for hidden Markov models. Tech(cid:173)\nnical Report CUEDIF-INFENGffR365, Cambridge University,  1999.  Available via anonymous \nftp from:  svr-ftp.eng.cam.ac.uk. \n\n[3]  M  J  F  Gales.  Semi-tied covariance  matrices  for  hidden  Markov  models.  IEEE  Transactions \n\nSpeech and Audio Processing, 7:272-281,  1999. \n\n[4]  N K Goel and R Gopinath.  Multiple linear transforms.  In Proceedings ICASSP, 2001.  To appear. \n\n[5]  N Kumar.  Investigation  of Silicon-Auditory Models and Generalization of Linear Discriminant \n\nAnalysisfor Improved Speech Recognition.  PhD thesis, John Hopkins University,  1997. \n\n[6]  L R Rabiner.  A tutorial on hidden Markov  models  and  selected applications in  speech recogni(cid:173)\n\ntion.  Proceedings of the IEEE, 77, February 1989. \n\n[7]  S Roweiss and Z Ghahramani.  A unifying review of linear Gaussian models.  Neural Computa(cid:173)\n\ntion,  11:305-345,  1999. \n\n[8]  PC Woodland, J J Odell,  V Valtchev, and S J Young.  The development of the  1994 HTK large \nvocabulary  speech  recognition  system.  In  Proceedings ARPA  Workshop  on Spoken  Language \nSystems Technology,  pages 104-109, 1995. \n\n[9]  S J Young, J Jansen, J Odell, D Ollason, and P Woodland.  The HTK Book (for HTK Version 2.0). \n\nCambridge University, 1996. \n\n\f", "award": [], "sourceid": 1871, "authors": [{"given_name": "Mark", "family_name": "Gales", "institution": null}]}