{"title": "Batch and On-Line Parameter Estimation of Gaussian Mixtures Based on the Joint Entropy", "book": "Advances in Neural Information Processing Systems", "page_first": 578, "page_last": 584, "abstract": null, "full_text": "Batch and On-line Parameter Estimation of \nGaussian Mixtures Based on the Joint Entropy \n\nYoram Singer \n\nAT&T Labs \n\nsinger@research.att.com \n\nManfred K. Warmuth \n\nUniversity of California, Santa Cruz \n\nmanfred@cse.ucsc.edu \n\nAbstract \n\nWe  describe  a  new  iterative method  for parameter estimation of Gaus(cid:173)\nsian  mixtures.  The new method is based on a framework developed by \nKivinen and Warmuth for supervised on-line learning. In contrast to gra(cid:173)\ndient descent and EM, which estimate the mixture's covariance matrices, \nthe proposed method estimates the inverses  of the covariance matrices. \nFurthennore, the new parameter estimation procedure can  be applied in \nboth on-line and  batch settings.  We show experimentally that it is typi(cid:173)\ncally faster than EM, and usually requires about half as  many  iterations \nas EM. \n\n1  Introduction \n\nMixture models,  in particular mixtures of Gaussians, have been a popular tool for density \nestimation, clustering, and  un-supervised learning  with a wide range of applications (see \nfor instance [5, 2]  and the references therein).  Mixture models are one of the most useful \ntools for handling incomplete data,  in  particular hidden variables.  For Gaussian  mixtures \nthe hidden variables indicate for each data point the index of the Gaussian that generated it. \nThus, the model is specified by ajoint density between the observed and hidden variables. \nThe common technique used for estimating the parameters of a stochastic source with hid(cid:173)\nden variables is the EM  algorithm. In this paper we describe a new technique for estimating \nthe parameters of Gaussian mixtures.  The new parameter estimation method is based on a \nframework  developed  by  Kivinen and Warmuth  [8]  for supervised on-line learning.  This \nframework was successfully used in a large number of supervised and un-supervised prob(cid:173)\nlems (see for instance [7, 6, 9,  1]). \n\nOur goal  is to find  a local minimum of a loss function which, in our case,  is the negative \nlog likelihood induced by  a  mixture of Gaussians.  However,  rather than minimizing the \n\n\fParameter Estimation of Gaussian Mixtures \n\n579 \n\nloss directly we add  a tenn measuring the distance of the new parameters to the old ones. \nThis distance is useful for iterative parameter estimation procedures.  Its purpose is to keep \nthe  new  parameters  close  to  the  old  ones.  The  method  for  deriving  iterative  parameter \nestimation can  be used  in batch  settings as  well  as  on-line settings where the parameters \nare updated after each observation. The distance used for deriving the parameter estimation \nmethod in this paper is the relative entropy between the old and  new joint density of the \nobserved and hidden variables.  For brevity we tenn the new iterative parameter estimation \nmethod the joint-entropy (JE) update. \n\nThe JE update shares a common characteristic with the Expectation Maximization [4,  10] \nalgorithm as it first calculates the same expectations. However, it replaces the maximization \nstep  with a different update of the parameters.  For instance,  it updates the  inverse of the \ncovariance  matrix  of each  Gaussian  in  the  mixture,  rather  than  the covariance  matrices \nthemselves.  We  found  in our experiments that the JE update often requires half as many \niterations as  EM.  It is also  straightforward to modify  the  proposed parameter estimation \nmethod for on-line setting where the parameters are updated after each  new  observation. \nAs  we  demonstrate  in  our experiments  with digit recognition,  the on-line version  of the \nJE update is especially useful  in situations where the observations are generated by a non(cid:173)\nstationary stochastic source. \n\n2  Notation and preliminaries \n\nLet  S  be  a  sequence  of  training  examples  (Xl, X2,  ..\u2022 ,  XN)  where  each  Xi  is  a  d(cid:173)\ndimensional  vector  in  ~d.  To  model  the  distribution  of  the  examples  we  use  m  d(cid:173)\ndimensional  Gaussians.  The parameters of the i-th Gaussian  are  denoted by  8 i  and  they \ninclude the mean-vector and the covariance matrix \n\nThe density function of the ith Gaussian, denoted P(xI8d, is \n\nWe  denote  the  entire  set  of parameters  of a  Gaussian  mixture  by  8  =  {8i }:: 1  = \n{Wi, Pi' C i }::l where w  =  (WI, ... , wm )  is a non-negative vector of mixture coefficients \nsuch that 2:::1 Wi  = 1.  We  denote by  P(xI8)  =  2:;:1 w;P(xI8d the likelih~od of an \nobservation x  according to a Gaussian mixture with parameters_ 8. Let 8 i  and 8 i  be two \nGaussian  distributions.  For brevity,  we  d~note by  E; (Z)  and  Ej (Z)  the expectation  of a \nrandom variable Z with respect to 8i and 8 j \u2022  Let f  be a parametric function whose param(cid:173)\neters constitute a matrix A = (a;j).  We denote by {) f / {)A  the matrix of partial derivatives \nof f  with respect  to  the elements  in  A.  That  is,  the  ij element  of {) f / {)A  is  {) f / {)aij. \nSimilarly, let B  = (bij(x)) a matrix whose elements are functions of a scalar x.  Then,  we \ndenote by dB / dx the matrix of derivatives of the elements in B  with respect to x, namely, \nthe ij element of dB / dx is dbij (x) / dx. \n\n3  The framework for deriving updates \n\nKivinen and  Warmuth  [8]  introduced a general  framework for deriving on-line parameter \nupdates.  In  this  section  we  describe  how  to  apply  their  framework  for  the  problem  of \n\n\f580 \n\nY.  Singer and M.  K.  Warmuth \n\nparameter  estimation  of Gaussian  mixtures  in  a  batch  setting.  We  later  discuss  how  a \nsimple modification gives the on-line updates. \n\nGiven  a  set  of data  points  S  in  ~d and  a  number  m,  the  goal  is  to  find  a  set  of  m \nGaussians  that  minimize  the  loss  on  the  data,  denoted  as  loss(SI8).  For  density  esti(cid:173)\nmation  the  natural  loss  function  is  the  negative  log-likelihood of the  data  loss(SI8)  = \n-(I/ISI) In P(SI8) ~f -(I/ISI) L:xES In P(xI8). The best parameters which minimize \nthe above loss cannot be found analytically.  The common approach is to use iterative meth(cid:173)\nods such as EM  [4,  10] to find a local minimizer of the loss. \n\nIn an  iterative parameter estimation framework we are given the old set of parameters 8 t \nand we need to find a set of new parameters 8 t+1  that induce smaller loss.  The framework \nintroduced by  Kivinen and Warmuth [8]  deviates from the common  approaches as  it also \nrequires  to  the  new  parameter  vector  to stay  \"close\"  to  the  old set  of parameters  which \nincorporates all  that was learned in the previous iterations. The distance of the new param(cid:173)\neter setting 8 t+1  from the old setting 8 t  is measured by a non-negative distance function \nLl(8t+1 , 8 t ). We now search for a new set of parameters 8 t+1 that minimizes the distance \nsummed with the loss multiplied by 17.  Here 17  is a non-negative number measuring the rel(cid:173)\native importance of the distance versus the loss.  This parameter 17  will become the learning \nrate of the update.  More formally, the update is found by  setting 8 t+1  =  arg mineUt(8) \nwhereUt (8) =  Ll(8,8t ) + 17loss(SI8) + A(L:::1 Wi  -1). (We use a Lagrange multi(cid:173)\nplier A to enforce the constraint that the mixture coefficients sum to one.)  By choosing the \napropriate distance function and 17  = 1 one can show that EM  becomes the above update. \nFor most distance functions and  learning rates the minimizer of the function  U t (8)  can(cid:173)\nnot be found  analytically as  both the distance function and  the log-likelihood are  usually \nnon-linear in  8.  Instead,  we expand  the log-likelihood using a  first  order Taylor expan(cid:173)\nsion  around the  old  parameter setting.  This approximation  degrades  the further the new \nparameter  values  are  from  the old ones,  which further motivates the  use  of the distance \nfunction Ll(8, 8 t )  (see also the discussion in  [7]).  We  now  seek a new  set of parameters \n8 t +1  =  argmineVt(8) where \n\nVt(8)  =  ~(8, 0 t) + '7  (loss(510 t) + (8 - 0 t) . V' e l0ss(510t)) + A(L w. - 1)  . \n\nm \n\n(1) \n\n.=1 \n\nHere  V' eloss(SI8t)  denotes  the gradient of the  loss  at  8 t .  We  use  the  above  method \nEq.  (1) to derive the updates of this paper.  For density estimation,  it is natural  to use the \nrelative entropy between the new  and  old density as  a distance.  In  this paper we  use the \njoint density between  the observed  (data points) and hidden  variables (the indices of the \nGaussians). This motivates the name joint-entropy update. \n\n4  Entropy based distance functions \n\nWe first  consider the relative entropy between the new and  old parameter parameters of a \nsingle Gaussian.  Using the notation introduced in Sec. 2, the relative entropy between two \nGaussian distributions denoted by 8i , 8i is \n\n~(8., 8i)  =  JXE~d P(xI0i) In P(xI8.) dx \n\ndef \n\n[ \n\n-\n\nP(xI0.) \n\nI}  le.1 \nz  n  -z- - z'  X  -I-'i \nle.1 \n\nIE-(( \n\n-)Te-- I ( \n\n\u2022 \n\nX  -I-'i  + zEi \n\n-))  1-(( \n\nX  -\n\nTe- I ( \n\n\u2022  x-I-'. \n\n)) \n\nI-'i) \n\n\fParameter Estimation of Gaussian Mixtures \n\n581 \n\nUsing standard (though tedious) algebra we can rewrite the expectations as follows: \n\nU \n\nA(8- 8) \n\n- i,  - i  = 2\"  n -;;:;- -\n1]  ICil \nICil \n\n1  (C-1C-) \n\ni  + 2\"  J.li  -\n\n1(-\n\ni \n\nd \n- + 2\"tr \n2 \n\n)T  -1(-\n\nJ.l  Ci \n\nJ.li  -\n\n) \n\n. \n\nJ.li \n\n(2) \n\nThe relative entropy between the new and the old mixture models is the following \n\n-\n\n~(0,0)  =  ix P(xI0) In P(xI0)dx  =  ix 7:: w.P(xI0.)ln ~:1 w.P(xI0./ x  .  (3) \n\nL~1 w.P(xI8.) \n\nP(xI8) \n\nf  ~ -\n\n-\n\ndef  f \n\n-\n\nIdeally,  we would like to use  the above  distance function  in  V t  to give us  an  update of \n\ne in terms of 8.  However,  there isn't a closed form expression for Eq.  (3).  Although the \nrelative entropy between two Gaussians is a convex function in their parameters, the relative \nentropy between two Gaussian mixtures is non-convex. Thus, the loss function V t (e) may \nhave multiple minima, making the problem of finding arg mine V t (e) difficult. \n\nIn order to sidestep this problem we use the log-sum inequality [3] to obtain an upper bound \nfor the distance function ~(e, 8). We denote this upper bound as Li(e, 8). \n\n;;;,  L:m \n\nW, In  - + \n\n-\n\nw, \n\n- j \n\nw, \n\n- p(xle,)  L:m \n\ndx = \n\nI \n\np(xle,) In \n\nP(x e,l \n\n_=1 \n\nx \n\n;;;,  L:m \n\n-\n\nW, In  - + \n\nw, \n\n1=1 \n\nWI~(e\" e,l . \n-\n\n-\n\n(4) \n\n,=1 \n\nL:m \n= \n\n_=1 \n\nWe call  the new  distance function Li(e, 8) the joint-entropy distance.  Note that in  this \ndistance the parameters of Wi  and Wi  are \"coupled\" in the sense that it is a convex combi(cid:173)\nnation of the distances 6.(8 i , 8d.  In particular, Li(8, 8) as a function of the parameters \nWi,  Pi'  Ci  does not remain constant any more when the parameters of the individual Gaus(cid:173)\nsians are permuted. Furthermore, Li (e, 8) is also is sufficiently convex so that finding the \nminimizer of V t  is possible (see below). \n\n5  The updates \n\nWe are now ready to derive the new parameter estimation scheme.  This is done by setting \nthe partial derivatives of V t , with respect to e, to O.  That is, our problem consists of solving \nthe following equations \n\na~(e, e) \n\n_ \naw, \n\n1)  a In p(5Ie) \n-\n151 \n\naw, \n\n+>- =  0, \n\na~(e, e) \n\n_ \naJ.L, \n\n1)  a In P(5Ie) \n-\n151 \n\naJ.L, \n\n=  0, \n\na~(e,  e) \n\nac, \n\n1)  aln p(5Ie) \n-\n151 \n\nac, \n\n= o. \n\nWe  now use the fact that Ci  and thus C;l is symmetric.  The derivatives of Li(e, 8), as \ndefined by Eq.  (4) and Eq. (2), with respect to Wi, Pi  and C\\, are \n\nIn - + 1 + - n -\n\n1 I \n2 \n\nW. \nw. \n\nICd \nICd \n\n-\n\n- + -tr C  Ci  + - ll .  -\nd \n2 \n\n-1 -)  1 (-\n\n) \n2\"\",,,,,,,,,,,\"\", \n\n)TC-1 (-\n\n1  ( \n2 \n\nII  - \"   (5) \n\nII \n\nt \n\n. \n\naE(0,0) \n\naE(0,0) \n\nalii \nac. \n\n__ \n\n1 - (C--1  C-1) \n. \n2 Wi  -\n\ni  +  . \n\n(6) \n\n(7) \n\n\f582 \n\nY Singer and M.  K.  Warmuth \n\nTo simplify the notation throughout the rest of the paper we define the following variables \n\n) \nf3.(x)  =  P(x\\0)  and  (X i  x)  =  P(x\\0)  = P  t  x, 0i) = wif3i(X  . \n\n(d ef  wi P (xI0i) \n\nd e f  P(xI0i) \n\n('1 \n\nThe partial derivatives of the log-likelihood are computed similarly: \n\noln P(SI0) \n\nOWi \n\noln P(SI0) \n\n01-1. \n\noIn P(SI0) \n\nOC. \n\n=  ~ P(xI0i)  =  ~ a  (  ) \nL.; P'  x \nX\u00a3 s \n\nL.;  P(xI0) \nX\u00a3S \n~ w.P(xI0.) \nL.;  P(xI0)  C i  X-I-I.)  =  L.;(X.(x)C.  X-I-Ii) \nx\u00a3s \n\n~  -1 ( \n\n~ s \n\n-1 ( \n\n=  _l ~ wiP(xI0.) (C:- 1  _  C:- 1 ( \n\u2022  x \n\n2  L.;  P(xI0) \n-t L(X,(x)(Ci 1 - C i 1 (x-l-li)(X-I-I.fc ;-t). \n\n.)( \n1-1.  x \n\nx\u00a3s \n\n\u2022 \n\n_ \n\n1-1. \n\n_ \n\n)TC:-1) \n\n\u2022 \n\n(8) \n\n(9) \n\n(10) \n\nx\u00a3s \n\nWe now need to decide on an order for updating the parameter classes Wi, Pi ' and C i .  We \nuse the  same  order that EM  uses,  namely,  Wi,  then  Pi'  and finally,  C i .  (After doing one \npass over all  three groups we start again using the same order.)  Using this order results in \na simplified set of equations as  several  terms in Eq.  (5) cancel out.  Denote the size of the \nsample by  N  =  lSI.  We now need to sum the derivatives from  Eq.  (5) and Eq.  (8)  while \nusing the fact that the Lagrange multiplier). simply assures that the new weight Wi  sum to \none.  By setting the result to zero, we get that \n\nw.  t- E:l W J  exp (-N Ex\u00a3 s f3i(X\u00bb) \n\nSimilarly, we sum Eq. (6) and Eq. (9), set the result to zero, and get that \n\nI-li  t-I-I. + ~ Lf3i(X) (x -I-Ii)' \n\nx\u00a3s \n\n(11) \n\n(12) \n\nFinally, we do the same for C i .  We sum Eq. (7) and Eq. (10) using the newly obtained Pi' \n\nCit  t- Ci 1 + ~ Lf3.(x) (Cit - C;-l(X -I-I.)(x -l-lifCi1) . \n\n(13) \n\nx\u00a3s \n\nWe  call  the  new  iterative parameter estimation procedure the joint-entropy (JE)  update. \nTo summarize,  the JE update is composed of the following alternating steps:  We first cal(cid:173)\nculate for each  observation x  the value !3i(X)  = P(xI8;}j P(xI8) and  then  update the \nparameters as  given by  Eq.  (11), Eq.  (12), and Eq. (13).  The JE update and EM  differ in \nseveral  aspects.  First,  EM  uses  a simple update for the mixture we!ghts w .  Second,  EM \nuses the expectations (with respect to the current parameters) of the sufficient statistics [4] \nfor  Pi  and  C;  to find  new  sets  of mean  vectors  and covariance  matrices.  The JE  uses  a \n(slightly  different)  weighted  average  of the  observation and,  in  addition,  it  adds  the  old \nparameters.  The learning rate TJ  determines the proportion to be used in  summing the old \nparameters and the newly estimated parameters.  Last, EM estimates the covariance matri(cid:173)\nces Ci  whereas the new update estimates the inverses,  C;l, of these matrices.  Thus,  it is \npotentially be more stable numerically in cases  where the covariance matrices have  small \ncondition number. \n\nTo obtain an on-line procedure we need to update the parameters after each new observation \nat a time.  That is, rather than summing over all xES, for a new observation Xt, we update \n\n\fParameter Estimation of Gaussian Mixtures \n\n583 \n\nI \nI \nJE ot8_1 .9 J \n~I \n\nI \nI \n/018=1.5 \n/ \n\n\" \n..  / \n.:  ! \n\not8:1 .1 .. / /  ota=l .OS \n\n~  / /. \n\n_~\" EM \n\nr'\" \n\n! -0170 \n\nEM \n\n, , , \n\nr \n\n, \n\nEU \n\n-3.0 \n\n-3.1 \n\n~ -32 \n'i \n~-3.3 \nS \n\n-3.4 \n\n--::::_________ \n\n__-0':;:;--< \n\n~ /( ,/ \n/'  ' \nl-/ ./ \nrr\" \n\no \n\n50 \n\n100 \n\n150 \n\n200 \n\nNumber 01  iterations \n\n- 0  171  ~----7---!----:-,--7---:----!---' \n\n._...-\n\n-<), \n\nlo, \n\n.9'\" \n-0' \n\n250 \n\n300 \n\n10 \n\n15 \n\n~ \n\n..... \n\n................. .......... \n\nE'\" \n\n~ \n\n~ \n\n~ \n\n~ \n\n............... \n\nBJ \n\n._-\n\n~ \n\n~ \n\nFigure 1:  Left:  comparison of the convergence rate of EM and the JE update with different \nlearning rates.  Right: example of a case where EM initially increases the likelihood faster \nthan the JE update. \n\nthe parameters and get a new set of parameters 8 t+1  using the current parameters 8 t \u2022 The \nnew parameters are then used for inducing the likelihood of the next observation Xt+ 1. The \non-line parameter estimation procedure is composed of the following steps: \n\n(3  ( \ni  Xt  =  P(Xj  e)  . \n\n1  S \n.  et: \n2.  Parameter updates: \n\nP  Xj  e, \n\n) \n\n(a)  Wj  f- Wj exp (-1]t(3j (xt)) /  I:j=1 Wj exp ( -1]t(3j (xt)) \n(b)  J,lj  f- J,lj  + 1]t  (3j (xt) (Xt  -\n(c)  Ci 1  f- Ci 1 + 1]t  (3j(xt) (Cil  - Ci 1(Xt  - J,lj)(Xt  - J,lj)TCi1). \n\nJ,lj) \n\nTo guarantee convergence of the on-line update one should use a diminishing learning rate, \nthat is 1]t  -t 0 as t  -t 00 (for further motivation see [lID. \n\n6  Experiments \n\nWe conducted numerous experiments with the new update.  Due to the lack of space we de(cid:173)\nscribe here only two.  In the first experiment we compared the JE  update and EM in batch \nsettings.  We generated  data from Gaussian mixture distributions with varying number of \ncomponents  (m  = 2  to  100)  and  dimensions  (d  = 2 to  20).  Due  to  the  lack  of space \nwe describe here results obtained from only one setting.  In this setting the examples were \ngenerated  by a  mixture of 5 components with w  =  (0.4 , 0.3,0.2,0.05,0.05).  The mean \nvectors were the 5 standard unit vectors in the Euclidean space 1R5  and we set all of covari(cid:173)\nances matrices to the identity matrix.  We generated 1000 examples.  We then run EM and \nthe JE update with different learning rates (1]  = 1.9,1.5,1.1,1.05). To make sure that all \nthe runs will end in the same local maximum we fist  performed three EM iterations.  The \nresults are shown on the left hand side of Figure 1. In this setting, the JE update with high \nlearning rates achieves much faster convergence than EM. We  would like to note that this \nbehavior is by no means esoteric - most of our experiments data yielded similar results. \n\nWe found a different behavior in low dimensional settings.  On the right hand side of Fig(cid:173)\nure  1 we show convergence rate results for a mixture containing two components each of \nwhich  is  a  single dimension  Gaussians.  The mean  of the  two components were  located \n\n\f584 \n\nY.  Singer and M.  K.  Warmuth \n\nat  1 and  -1 with the same  variance of 2.  Thus,  there is a  significant \"overlap\"  between \nthe two Gaussian constituting the mixture.  The mixture weight vector was  (0 .5,0 .5).  We \ngenerated  50 examples according to this distribution and  initialized the parameters as fol(cid:173)\nlows:  1-'1  = 0.01,1-'2  = -0.01,  0\"1  = 0\"2  = 2,  WI  = W2  = 0.5 We  see that  initially \nEM  increases  the  likelihood much  faster  than  the JE  update.  Eventually,  the  JE update \nconvergences faster than EM when using a small learning rate (in the example appearing in \nFigure 1 we  set 'rJ  =  1.05).  However,  in this setting, the JE update diverges when learning \nrates larger than 'rJ  = 1.1 are used.  This behavior underscores the advantages of both meth(cid:173)\nods.  EM uses a fixed learning rate and is guaranteed to converge to a local maximum of the \nlikelihood, under conditions that typically hold for mixture of Gaussians [4,  12].  the JE up(cid:173)\ndate, on the other hand, encompasses a learning rate and in many settings it converges much \nfaster than EM. However, the superior performance in high dimensional cases demands its \nprice in low dimensional \"dense\" cases.  Namely, a very  conservative learning rate,  which \nis hard to tune,  need to be used.  In these cases,  EM  is a better alternative, offering almost \nthe same convergence rate without the need to tune any parameters. \n\nAcknowledgments  Thanks to Duncan  Herring for careful  proof reading and providing \nus with interesting data sets. \n\nReferences \n\n[1]  E. Bauer, D. Koller, and Y. Singer.  Update rules for parameter estimation in Bayesian \nnetworks.  In Proc.  of the 13th Annual Con! on Uncertainty in AI, pages 3-13, 1997. \n\n[2]  C.M. Bishop.  Neural Networks and Pattern Recognition.  Oxford Univ. Press,  1995. \n\n[3]  Thomas M. Cover and Joy A  Thomas.  Elements of Information Theory.  Wiley,  1991. \n\n[4]  AP. Dempster, N.M. Laird, and D.B. Rubin.  Maximum-likelihood from incomplete \ndata via the EM algorithm.  Journal of the Royal Statistical Society, B39:1-38, 1977. \n\n[5]  R.O.  Duda and P.E.  Hart.  Pattern Classification and Scene Analysis.  Wiley,  1973. \n\n[6]  D.  P.  Helmbold,  J.  Kivinen,  and  M.K.  Warmuth.  Worst-case  loss  bounds for  sig(cid:173)\n\nmoided neurons.  In Advances  in  Neural  Information  Processing  Systems  7,  pages \n309-315, 1995. \n\n[7]  D.P.  Helmbold, R.E.  Schapire,  Y.Singer,  and M.K.  Warmuth.  A comparison of new \nand old algorithms for a mixture estimation problem. Machine Learning, Vol. 7, 1997. \n\n[8]  J.  Kivinen and  M.K.  Warmuth.  Additive versus exponentiated gradient updates for \n\nlinear prediction.  Information and Computation, 132(1): 1-64, January  1997. \n\n[9]  J.  Kivinen and M.K. Warmuth.  Relative loss bounds for multidimensional regression \n\nproblems.  In Advances in Neural Information Processing Systems 10,  1997. \n\n[10]  R.A Redner and  H.E Walker.  Mixture densities, maximum  likelihood and the EM \n\nalgorithm.  SIAM Review, 26(2), 1984. \n\n[11]  D.M. Titterington, A.EM. Smith, and U.E. Makov.  Statistical Analysis of Finite Mix(cid:173)\n\nture Distributions. Wiley,  1985. \n\n[12]  C.E Wu. On the convergence properties of the EM algorithm. Annals of Stat.,  11 :95-\n\n103, 1983. \n\n\f", "award": [], "sourceid": 1525, "authors": [{"given_name": "Yoram", "family_name": "Singer", "institution": null}, {"given_name": "Manfred K.", "family_name": "Warmuth", "institution": null}]}