{"title": "The Infinite Gaussian Mixture Model", "book": "Advances in Neural Information Processing Systems", "page_first": 554, "page_last": 560, "abstract": null, "full_text": "The Infinite Gaussian Mixture Model \n\nCarl Edward Rasmussen \n\nDepartment of Mathematical Modelling \n\nTechnical University of Denmark \n\nBuilding 321, DK-2800 Kongens Lyngby, Denmark \n\ncarl@imm.dtu.dk  http://bayes.imm.dtu.dk \n\nAbstract \n\nIn a Bayesian mixture model it is not necessary a priori to limit the num(cid:173)\nber of components to be finite.  In this paper an infinite Gaussian mixture \nmodel is  presented which neatly sidesteps the difficult problem of find(cid:173)\ning the \"right\" number of mixture components. Inference in the model is \ndone using an efficient parameter-free Markov Chain that relies entirely \non Gibbs sampling. \n\n1 \n\nIntroduction \n\nOne of the major advantages in the Bayesian methodology is  that \"overfitting\" is avoided; \nthus  the difficult task of adjusting model complexity  vanishes.  For neural networks,  this \nwas  demonstrated by Neal [1996]  whose  work on infinite networks led to  the reinvention \nand  popularisation of Gaussian Process  models  [Williams  &  Rasmussen,  1996].  In this \npaper  a  Markov  Chain  Monte  Carlo  (MCMC)  implementation  of a  hierarchical  infinite \nGaussian  mixture model  is  presented.  Perhaps  surprisingly,  inference  in  such models  is \npossible using finite amounts of computation. \n\nSimilar models are  known  in statistics  as  Dirichlet Process mixture models  and  go  back \nto  Ferguson  [1973]  and  Antoniak  [1974].  Usually,  expositions  start  from  the  Dirichlet \nprocess itself [West et al,  1994]; here we derive the model as the limiting case of the well(cid:173)\nknown finite  mixtures.  Bayesian methods for  mixtures with  an unknown (finite)  number \nof components have been explored by Richardson &  Green [1997], whose methods are not \neasily extended to multivariate observations. \n\n2  Finite hierarchical mixture \n\nThe finite  Gaussian mixture model with  k  components may be written as: \n\nP(yl/l-l, ... ,/l-k,Sl,\u00b7 ..  ,Sk,7rl,\u00b7 .. , 7rk)  = L7rjN(/l-j,sjl), \n\nk \n\n(1) \n\nj=l \n\nwhere /l-j  are  the means,  Sj  the precisions (inverse variances),  7rj  the mixing proportions \n(which must be positive and sum to one) and N  is  a (normalised) Gaussian with specified \nmean and variance.  For simplicity, the exposition will initially assume scalar observations, \nn  of which  comprise  the  training  data  y  = {Yl, ... , Yn}.  First  we  will  consider these \nmodels for a fixed value of k, and later explore the properties in the limit where k  -+  00. \n\n\fThe Infinite Gaussian Mixture Model \n\n555 \n\nGibbs sampling is  a well known technique for generating samples from complicated mul(cid:173)\ntivariate distributions that  is  often used in Monte Carlo procedures.  In  its  simplest form, \nGibbs  sampling  is  used  to  update  each  variable  in  turn  from  its  conditional distribution \ngiven all other variables in the system. It can be shown that Gibbs sampling generates sam(cid:173)\nples from the joint distribution, and that the entire distribution is explored as the number of \nGibbs sweeps grows large. \n\nWe introduce stochastic indicator variables, Ci, one for each observation, whose role is  to \nencode  which  class  has  generated  the  observation;  the  indicators  take  on  values  1 ... k. \nIndicators are often referred to as \"missing data\" in a mixture model context. \n\nIn  the  following  sections  the  priors  on  component parameters and  hyperparameters  will \nbe  specified,  and the  conditional distributions  for  these,  which  will  be  needed for  Gibbs \nsampling, will be derived.  In general  the form of the priors are chosen to have (hopefully) \nreasonable modelling properties, with an eye to mathematical convenience (through the use \nof conjugate priors). \n\n2.1  Component parameters \n\nThe component means, f.1j,  are given Gaussian priors: \n\n(2) \nwhose  mean,  A,  and  precision,  r,  are  hyperparameters common  to  all  components.  The \nhyperparameters themselves are given vague Normal and Gamma priors: \n\np(f.1jIA,r)  \",N(A,r- 1), \n\np(A)  \",N(f.1Y,Cl;), \n\n(3) \nwhere f.1y  and 0-;  are the mean and variance of the observations 1.  The shape parameter of \nthe Gamma prior is set to  unity, corresponding to a very broad (vague) distribution. \n\np(r) \"'9(1 ,Cl;2)  ocr- 1/ 2 exp(-rCl;/2), \n\nThe conditional posterior distributions for the means are obtained by multiplying the like(cid:173)\nlihood from eq.  (1) conditioned on the indicators, by the prior, eq. (2): \n\nilJ  =  -\n\n1 \n\"\" Yi, \nn\u00b7  ~ \nJ  i:Ci=j \n\n(4) \n\nwhere the occupation number, nj, is  the number of observations belonging to class j, and \n'[jj  is  the  mean of these observations.  For the  hyperparameters,  eg.  (2)  plays  the  role of \nthe  likelihood which  together  with  the  priors  from  eq.  (4)  give  conditional posteriors of \nstandard form: \n\np(AIf.11,\u00b7 .. ,f.1k,r)\", \n\nN( f.1YCly \n\n- 2  +  \"k \n-2 \nCl Y  + kr \n\nr ~j=l f.1j \n\n1 \n\n) \n-2 \nCl y  + kr \n\n' \n\n' \n\np(rlf.11' ... ,f.1k,A)  \"'9(k+ 1,  [A(Cl; + L(f.1j _A)2)]-1). \n\n+ \n\nk \n\n.  1 \nJ= \n\n(5) \n\nThe component precisions, S j, are given Gamma priors: \n\n(6) \nwhose  shape,  j3,  and mean,  w- 1 ,  are hyperparameters common  to  all  components,  with \npriors of inverse Gamma and Gamma form: \n\np(Sjlj3,w)  '\" 9(j3,w-1), \n\n(7) \n\n1 Strictly speaking,  the priors ought not to  depend on the  observations.  The current procedure  is \nequivalent to nonnalising the observations and using unit priors. A wide variety of reasonable priors \nwill lead to similar results. \n\n\f556 \n\nC. E.  Rasmussen \n\nThe conditional posterior precisions are obtained by multiplying the likelihood from eq. (1) \nconditioned on the indicators, by the prior, eq. (6): \n\np(sjlc, y, /Lj,,8, w)  '\" 9(,8 + nj, [,8: n' (w,8 + . L (Yi -\n\n/Lj)2)] -1). \n\n(8) \n\nJ \n\nt :C;=J \n\nFor the hyperparameters, eq. (6) plays the role of likelihood which together with the priors \nfrom eq. (7) give: \n\nThe latter density is not of standard form, but it can be shown that p(log(,8) lSI, ... , Sk, w) \nis  log-concave, so  we may generate independent samples from the distribution for log(,8) \nusing the Adaptive Rejection Sampling (ARS)  technique [Gilks &  Wild,  1992], and trans(cid:173)\nform these to get values for ,8. \n\nThe mixing proportions, 7rj,  are given a symmetric Dirichlet (also  known as  multivariate \nbeta) prior with concentration parameter a/ k: \n\nP(\"\"'\"  , \".1<\u00bb  ~ Dirichlet(<>/ k, ... , <>/ k) =  r~(;1). }1,,;/.-1, \n\nk \n\n(10) \n\nwhere the mixing proportions must be positive and sum to one.  Given the mixing propor(cid:173)\ntions, the prior for the occupation numbers, n j, is multinomial and the joint distribution of \nthe indicators becomes: \n\nP(Cl, ... ,ckl 7rl, ...  , 7rk)  = II 7r;i, \n\nk \n\nn \n\nnj =  L 8Kronecker(Ci,j). \n\n(11) \n\nj=1 \n\ni=1 \n\nUsing  the  standard Dirichlet integral,  we  may  integrate  out the  mixing  proportions and \nwrite the prior directly in terms of the indicators: \n\nP(Cl, ... ,ckla ) =  /  P(Cl, . ..  ,ckl 7rl, .. '  , 7rk)p( 7rl, ... ,7rk)d7rl\u00b7\u00b7\u00b7d7rk \n\n(12) \n\n=  r(a) \nr(a/k)k \n\nk \n\n/  II 7rnj+a/k-ld7r ' = \n\nj=1  J \n\nJ \n\nk \n\nr(a)  II r(nj + a/k) \nr(n + a)  j=1 \n. \n\nr(a/k) \n\nIn  order  to  be  able  to  use  Gibbs  sampling  for  the  (discrete)  indicators,  Ci,  we  need  the \nconditional prior for  a  single  indicator given  all  the  others;  this  is  easily  obtained  from \neq. (12) by keeping all but a single indicator fixed: \n\n( \np  Ci  = J  C-i, a  = \n\n'1 \n\n)  n-i,j + a/k \n, \nn-l+a \n\n(13) \n\nwhere the subscript -i indicates all indexes except i  and n-i,j is  the number of observa(cid:173)\ntions, excluding Yi,  that are associated with component j. The posteriors for the indicators \nare derived in the next section. \n\nLastly, a vague prior of inverse Gamma shape is put on the concentration parameter a: \n\np(a- l )  '\" 9(1,1)  = }  p(a)  oc  a- 3 / 2 exp( - 1/(2a)). \n\n(14) \n\n\fThe Infinite Gaussian Mixture Model \n\n557 \n\nThe likelihood  for 0:  may  be  derived  from  eq.  (12),  which  together  with  the  prior from \neq. (14) gives: \n\np(nl, ... ,nklo:)  =  r(n + 0:)' \n\no:k r( 0:) \n\n(  I \npo: k,n)  ex \n\no:k-3/2 exp( - 1/(20:))r(0:) \n. \n\nr(n + 0:) \n\n( 15) \n\nNotice, that the conditional posterior for 0:  depends only on number of observations, n, and \nthe number of components, k, and not on how the observations are distributed among the \ncomponents. The distribution p(log( 0:) I k, n) is log-concave, so we may efficiently generate \nindependent samples from this distribution using ARS. \n\n3  The infinite limit \n\nSo far,  we  have considered k  to  be a  fixed  finite  quantity.  In  this section we  will explore \nthe  limit  k  -7  00  and make  the  final  derivations regarding the  conditional posteriors for \nthe indicators.  For all the model variables except the indicators, the conditional posteriors \nfor the infinite limit is obtained by substituting for k  the number of classes that have data \nassociated with them, krep ,  in the equations previously derived for the finite model. For the \nindicators, letting k  -7  00 in eq. (13), the conditional prior reaches the following limits: \n\ncomponents where n-i,j > 0: \n\nall  other  compo(cid:173)\nnents combined: \n\nn-l+o:' \n\n0: \n\nn-l+o: \n\n= \n\n(16) \n\nThis shows that the conditional class prior for components that are  associated  with other \nobservations is  proportional  to  the  number of such observations;  the  combined prior for \nall  other classes  depends only on 0:  and  n.  Notice how  the  analytical  tractability  of the \nintegral in eq. (12) is essential, since it allows us  to  work directly with  the  (finite number \nof)  indicator variables,  rather than the (infinite number of)  mixing proportions.  We  may \nnow combine the likelihood from eq. (1) conditioned on the indicators with  the prior from \neq. (16) to obtain the conditional posteriors for the indicators: \n\ncomponentsforwhichn_i,j  > 0:  P(Ci  =jlc-i,ltj,Sj,o:) ex \n\nP(Ci  =  jlc-i, o:)p(YiI Itj ,Sj ,c-d ex \n\nn-i,j \n\nn-l+o: \n\nS)1/2  exp ( - Sj (Yi  -\n\n(17) \n\nItj)2 /2), \n\nall other components combined:  p(Ci:j:.  Ci'  for all i  :j:.  i'lc-i, A, r, (3, W, 0:)  ex: \n\np(Ci:j:.  Ci'  foralli:j:. i'lc-i,O:) J p(Yilltj,sj)p(ltj,sjIA,r,{3,w)dltjdsj . \n\nThe likelihood for  components  with  observations other than Yi  currently associated  with \nthem is  Gaussian with component parameters Itj  and Sj.  The likelihood pertaining to the \ncurrently unrepresented classes  (which have  no  parameters associated  with  them)  is  ob(cid:173)\ntained  through  integration  over the  prior distribution  for  these.  Note,  that  we  need  not \ndifferentiate between the infinitely many unrepresented classes,  since their parameter dis(cid:173)\ntributions are all identical. Unfortunately, this integral is not analytically tractable; I follow \nNeal  [1998],  who  suggests  to  sample  from  the  priors  (which  are  Gaussian  and  Gamma \nshaped) in order to generate a Monte Carlo estimate of the probability of \"generating a new \nclass\".  Notice, that this approach effectively generates parameters (by sampling from the \nprior) for the classes that are unrepresented.  Since this Monte Carlo estimate is unbiased, \nthe resulting chain will sample from exactly the desired distribution, no matter how many \nsamples are used to approximate the integral; I have found that using a single sample works \nfairly well in many applications. \nIn detail, there are three possibilities when computing conditional posterior class probabil(cid:173)\nities, depending on the number of observations associated with the class: \n\n\f558 \n\nC. E.  Rasmussen \n\nif n-i,j > 0:  there are other observations associated with class j, and the posterior class \n\nprobability is as given by the top line of eq. (17). \n\nif n - i,j = a and Ci  = j:  observation Yi  is  currently the only observation associated with \nclass j; this is an peculiar situation, since there are no other observations associ(cid:173)\nated with the class, but the class still has parameters. It turns out that this situation \nshould be handled as  an unrepresented class, but rather than sampling for the pa(cid:173)\nrameters, one simply uses the class parameters; consult [Neal  1998] for a detailed \nderivation. \n\nunrepresented classes:  values for  the mixture parameters are picked at random from the \n\nprior for these parameters, which is Gaussian for J.Lj  and Gamma shaped for Sj. \n\nNow  that all  classes  have  parameters  associated  with  them,  we  can easily evaluate  their \nlikelihoods (which are Gaussian)  and the priors, which take  the form n-i,jl(n - 1 + a) \nfor components with observations other than Yi  associated with them, and al (n  - 1 + a) \nfor  the  remaining class.  When  hitherto  unrepresented classes  are chosen,  a  new  class  is \nintroduced in the model; classes are removed when they become empty. \n\n4  Inference;  the \"spirals\" example \n\nTo illustrate the model, we use the 3 dimensional \"spirals\" dataset from [Ueda et aI,  1998], \ncontaining 800 data point, plotted in figure  1.  Five data points are generated from each of \n160 isotropic Gaussians, whose means follow a spiral pattern. \n\no 16  18  20  22  24 \nrepresented components \n\n4 \n\n5 \n6 \nshape, 13 \n\n7 \n\nFigure  1:  The 800 cases from  the three dimensional spirals data.  The crosses represent a \nsingle (random) sample from the posterior for  the mixture model.  The  krep  =  20 repre(cid:173)\nsented classes account for nl (n + a)  ~ 99.6% of the mass.  The lines indicate 2 std. dev. in \nthe Gaussian mixture components; the thickness of the lines represent the mass of the class. \nTo the right histograms for 100 samples from the posterior for  krep ,  a and f3  are shown. \n\n4.1  Multivariate generalisation \n\nThe  generalisation  to  multivariate  observations  is  straightforward.  The  means,  J.Lj,  and \nprecisions,  S j, become  vectors  and  matrices respectively,  and  their prior (and  posterior) \n\n\fThe Infinite Gaussian Mixture Model \n\n559 \n\ndistributions become multivariate Gaussian and Wishart.  Similarly, the hyperparameter A \nbecomes a vector (multivariate Gaussian prior) and rand w become matrices with Wishart \npriors. The (3  parameter stays scalar, with the prior on ((3  - D  + 1)-1 being Gamma with \nmean 1/ D, where D  is the dimension of the dataset.  All other specifications stay the same. \nSetting D  =  1 recovers the scalar case discussed in detail. \n\n4.2 \n\nInference \n\nThe mixture model is started with a single component, and a large number of Gibbs sweeps \nare performed, updating all  parameters and hyperparameters in turn by sampling from the \nconditional distributions derived in the previous sections.  In figure  2  the auto-covariance \nfor several quantities is plotted, which reveals a maximum correlation-length of about 270. \nThen 30000 iterations  are performed for  modelling purposes  (taking  18  minutes of CPU \ntime on a Pentium PC) : 3000 steps initially for  \"burn-in\", followed by 27000 to  generate \n100 roughly independent samples from the posterior (spaced evenly 270  apart).  In figure \n1,  the  represented  components  of one  sample  from  the  posterior  is  visualised  with  the \ndata.  To  the  right of figure  1 we  see  that  the posterior number of represented classes  is \nvery  concentrated around  18 - 20,  and the concentration parameter takes  values around \na  :::::  3.5 corresponding to only a/ (n + a)  :::::  0.4% of the mass of the predictive distribution \nbelonging to unrepresented classes.  The shape parameter (3  takes values around 5-6, which \ngives the \"effective number of points\" contributed from the prior to the covariance matrices \nof the mixture components. \n\n4.3  The predictive distribution \n\nGiven a particular state in the Markov Chain, the predictive distribution has two parts:  the \nrepresented classes (which are Gaussian) and the unrepresented classes. As when updating \nthe indicators, we may chose to approximate the unrepresented classes by a finite  mixture \nof Gaussians, whose parameters are drawn from the prior.  The final predictive distribution \nis an average over the (eg. 100) samples from the posterior. For the spirals data this density \nhas roughly 1900 components for the represented classes plus however many are used to \nrepresent the remaining mass. I have not attempted to show this distribution. However, one \ncan imagine a  smoothed version of the single  sample shown in figure  1,  from  averaging \nover models  with  slightly varying numbers of classes and parameters.  The (small)  mass \nfrom the unrepresented classes spreads diffusely over the entire observation range. \n\n- - ,   og \n\nlog (a.) \nlog(~-2) \n\n-c  1[ - - \"  \",  -~ ----;::=::;=::::;;:;:=:::::;-~ \nQ) \n~ 0.8 \n~ \n<.>0.6 \n~ cO.4 \ntil \n\u00b7~0.2 \no \ng  0' '. - -'- . - ' .--- - .~~, - ,\" '-' , - .~' \n:; \ntil \n\n600 \n\n200 \n\n400 \n\n800 \n\n\" \n\n0 \n\niteration lag time \n\n1000 \n\n... \n\n.. .  - - - . \n\n. ...... ---...  ---------\n\n~30.---~----~----~----~---. \nQ) c \n8. \nE20 \n8 \n-= \ni \n~10 .:: \n(J)  = \n~  :-\n\u2022 \nQ. \n2:? \n'0  00 \n'**' \n\nMonte Carlo iteration \n\n3000 \n\n2000 \n\n4000 \n\n1000 \n\n5000 \n\nFigure 2:  The  left  plot  shows  the  auto-covariance  length  for  various  parameters  in  the \nMarkov Chain, based on 105  iterations. Only the number of represented classes,  krep ,  has \na  significant correlation;  the effective correlation length is  approximately 270,  computed \nas  the sum of covariance coefficients between lag  -1000 and  1000.  The right hand plot \nshows the number of represented classes growing during the initial phase of sampling. The \ninitial 3000 iterations are discarded. \n\n\f560 \n\n5  Conclusions \n\nC.  E.  Rasmussen \n\nThe infinite hierarchical Bayesian mixture model has  been reviewed and extended into a \npractical method.  It has  been shown that good performance (without overfitting)  can be \nachieved on multidimensional data.  An  efficient and practical MCMC algorithm with no \nfree parameters has been derived and demonstrated on an example.  The model is fully au(cid:173)\ntomatic, without needing specification of parameters of the (vague) prior. This corroborates \nthe falsity  of the  common misconception that \"the only difference between Bayesian and \nnon-Bayesian methods is the prior, which is arbitrary anyway ... \". \n\nFurther  tests  on  a  variety  of problems  reveals  that  the  infinite  mixture  model  produces \ndensities  whose generalisation is  highly competitive with other commonly used methods. \nCurrent work is undertaken to explore performance on high dimensional problems, in terms \nof computational efficiency and generalisation. \n\nThe infinite mixture model has  several advantages over its  finite  counterpart:  1)  in many \napplications, it may be more appropriate not to limit the number of classes, 2)  the number \nof represented classes is automatically determined, 3) the use of MCMC effectively avoids \nlocal minima which plague mixtures trained by optimisation based methods, ego EM [Ueda \net aI,  1998]  and 4)  it  is  much simpler to  handle the infinite limit than  to  work  with  finite \nmodels with  unknown sizes,  as  in  [Richardson &  Green,  1997]  or traditional approaches \nbased on extensive crossvalidation. The Bayesian infinite mixture model solves simultane(cid:173)\nously several long-standing problems with mixture models for density estimation. \n\nAcknowledgments \n\nThanks to Radford Neal for helpful comments, and to Naonori Ueda for making the spirals \ndata available.  This work is funded by the Danish Research Councils through the Compu(cid:173)\ntational Neural Network Center (CONNECT) and the THOR Center for Neuroinformatics. \n\nReferences \n\nAntoniak, C. E.  (1974).  Mixtures of Dirichlet processes with applications to Bayesian nonparametric \nproblems.  Annals of Statistics 2,  1152-1174. \n\nFerguson, T.  S.  (1973).  A Bayesian analysis of some nonparametric problems.  Annals of Statistics 1, \n209-230. \n\nGilks,  W.  R.  and P.  Wild (1992).  Adaptive rejection sampling for Gibbs sampling.  Applied Statis(cid:173)\ntics 41, 337-348. \n\nNeal,  R.  M.  (1996).  Bayesian Learning for Neural Networks,  Lecture Notes in Statistics No.  118, \nNew  York:  Springer-Verlag. \n\nNeal, R. M.  (1998).  Markov chain sampling methods for Dirichlet process mixture models.  Technical \nReport 4915, Department of Statistics, University of Toronto. \nhttp://www.cs.toronto.edu/~radford/mixmc.abstract.html . \n\nRichardson,  S.  and P.  Green (1997).  On Bayesian analysis of mixtures with an unknown number of \ncomponents.  Journal of the Royal Statistical Society.  B 59, 731-792. \n\nUeda,  N.,  R.  Nakano,  Z.  Ghahramani  and G.  E.  Hinton  (1998).  SMEM  Algorithm  for  Mixture \nModels,  NIPS  11. MIT Press. \n\nWest, M., P. Muller and M.  D. Escobar (1994).  Hierarchical priors and mixture models with applica(cid:173)\ntions in regression and density estimation.  In P.  R.  Freeman and A.  F. M.  Smith (editors), Aspects of \nUncertainty, pp.  363-386. John Wiley. \n\nWilliams, C.  K. I. and C.  E. Rasmussen (1996).  Gaussian Processes for Regression, in D.  S.  Touret(cid:173)\nzky, M.  C.  Mozer and M.  E.  Hasselmo (editors), NIPS 8, MIT Press. \n\n\f", "award": [], "sourceid": 1745, "authors": [{"given_name": "Carl", "family_name": "Rasmussen", "institution": null}]}