{"title": "Learning Mixture Hierarchies", "book": "Advances in Neural Information Processing Systems", "page_first": 606, "page_last": 612, "abstract": null, "full_text": "Learning Mixture Hierarchies \n\nN uno Vasconcelos \n\nAndrew Lippman \n\nMIT Media Laboratory, 20 Ames St, EI5-320M, Cambridge, MA 02139, \n\n{nuno,lip} @media.mit.edu, \n\nhttp://www.media.mit.edwnuno \n\nAbstract \n\nThe hierarchical representation of data has various applications in do(cid:173)\nmains such as data mining, machine vision, or information retrieval. In \nthis paper we introduce an extension of the Expectation-Maximization \n(EM) algorithm that learns mixture hierarchies in a computationally ef(cid:173)\nficient manner. Efficiency is achieved by progressing in a bottom-up \nfashion, i.e. by clustering the mixture components of a given level in the \nhierarchy to obtain those of the level above. This cl ustering requires onl y \nknowledge of the mixture parameters, there being no need to resort to \nintermediate samples. In addition to practical applications, the algorithm \nallows a new interpretation of EM that makes clear the relationship with \nnon-parametric kernel-based estimation methods, provides explicit con(cid:173)\ntrol over the trade-off between the bias and variance of EM estimates, and \noffers new insights about the behavior of deterministic annealing methods \ncommonly used with EM to escape local minima of the likelihood. \n\n1 \n\nIntroduction \n\nThere are many practical applications of statistical learning where it is useful to characterize \ndata hierarchically. Such characterization can be done according to either top-down or \nbottom-up strategies. While the former start by generating a coarse model that roughly \ndescribes the entire space, and then successively refine the description by partitioning the \nspace and generating sub-models for each of the regions in the partition; the later start \nfrom a fine description, and successively agglomerate sub-models to generate the coarser \ndescriptions at the higher levels in the hierarchy. \n\nBottom-up strategies are particularly useful when not all the data is available at once, or \nwhen the dataset is so big that processing it as whole is computationally infeasible. This \nis the case of machine vision tasks such as object recognition, or the indexing of video \ndatabases. In object recognition, it is many times convenient to determine not only which \nobject is present in the scene but also its pose [2], a goal that can be attained by a hierarchical, \ndescription where at the lowest level a model is learned for each object pose and all pose \nmodels are then combined into a generic model at the top level of the hierarchy. Similarly, \n\n\fLearning Mixture Hierarchies \n\n607 \n\nfor video indexing, one may be interested in learning a description for each frame and \nthen combine these into shot descriptions or descriptions for some other sort of high level \ntemporal unit [6). \n\nIn this paper we present an extension of the EM algorithm [I) for the estimation of hierar(cid:173)\nchical mixture models in a bottom-up fashion. It turns out that the attainment of this goal \nhas far more reaching consequences than the practical applications above. In particular, \nbecause a kernel density estimate can be seen as a limiting case ofa mixture model (where \na mixture component is superimposed on each sample), this extension establishes a direct \nconnection between so-called parametric and non-parametric density estimation methods \nmaking it possible to exploit results from the vast non-parametric smoothing literature [4) \nto improve the accuracy of parametric estimates. Furthennore, the original EM algorithm \nbecomes a particular case of the one now presented, and a new intuitive interpretation be(cid:173)\ncomes available for an important variation of EM (known as deterministic annealing) that \nhad previously been derived from statistical physics. With regards to practical applications, \nthe algorithm leads to computationally efficient methods for estimating density hierarchies \ncapable of describing data at different resolutions. \n\n2 Hierarchical mixture density estimation \n\nOur model consists of a hierarchy of mixture densities, where the data at a given level is \ndescribed by \n\nc l \n\nP(X) = L 1I\"~p(Xlz~ = I , Md, \n\n(I) \n\nk = 1 \n\nwhere 1 is the level in the hierarchy (l = 0 providing the coarsest characterization of the \ndata), MI the mixture model at this level, C l the number of mixture components that \ncompose it, 11\"~ the prior probability of the kth component, and z~ a binary variable that \ntakes the value 1 if and only if the sample X was drawn from this component. The only \nrestriction on the model is that if node j of levell + 1 is a child of node i of levell, then \n\n11\"1+1 = 11\"1+111\"1 \n\njlk k' \n\nJ \n\n(2) \n\nwhere k is the parent of j in the hierarchy of hidden variables. \n\nThe basic problem is to compute the mixture parameters of the description at levell given \nthe knowledge of the parameters at level 1 + 1. This can also be seen as a problem of \nclustering mixture components. A straightforward solution would be to draw a sample \nfrom the mixture density at levell + 1 and simply run EM with the number of classes of \nthe level 1 to estimate the corresponding parameters. Such a solution would have at least \ntwo major limitations. First, there would be no guarantee that the constraint of equation (2) \nwould be enforced, i.e. there would be no guarantee of structure in the resulting mixture \nhierarchy, and second it would be computationally expensive, as all the models in the \nhierarchy would have to be learned from a large sample. In the next section, we show that \nthis is really not necessary. \n\n3 Estimating mixture hierarchies \n\nThe basic idea behind our approach is, instead of generating a real sample from the mixture \nmodel at level L + 1, to consider a virtual sample generated from the same model, use EM \nto find the expressions for the parameters of the mixture model of levell that best explain \nthis virtual sample, and establish a closed-fonn relationship between these parameters and \nthose of the model at level 1 + I. For this, we start by considering a virtual sample \nX = {XI, .. . , X C l+ l } from ; \\.11+1, where each of the Xi is a virtual sample from one of \n\n\f608 \n\nN. Vasconcelos and A. Lippman \n\nthe C '+ 1 components of this model, with size Mi = 11'! N, where N is the total number of \nvirtual points. \nWe next establish the likelihood for the virtual sample under the model M\" For this, as is \nusual in the EM literature, we assume that samples from different blocks are independent, \nLe. \n\nC 1+1 \n\nP(XIM,) = II P(XiIM,), \n\n(3) \n\nbut, to ensure that the constraint of equation (2) is enforced, samples within the same block \nare assigned to the same component of M,. Assuming further that, given the knowledge \nof the assignment the samples are drawn independently from the corresponding mixture \ncomponent, the likelihood of each block is given by \n\ni=1 \n\nP(XiIMd = 2: lI'~P(Xilzij = I,M,) = 2: 11'} II p(XrlZij = I,M,), \n\nM i \n\n(4) \n\nc 1 \n\nc 1 \n\nj = 1 \n\nj = 1 \n\nm = 1 \n\nwhere Zij = Z!+I z; is a binary variable with value one if and only if the block Xi is assigned \n\nto the jth component of M\" and xr is the mth data point in Xi. Combining equations (3) \n\nand (4) we obtain the incomplete data likelihood, under M\" for the whole sample \n\nC 1+1 c 1 \n\nM. \n\nP(XIM,) = II 2: 11'; II p(XrlZij = I,M,). \n\n(5) \n\ni = 1 j = 1 \n\nm = 1 \n\nThis equation is similar to the incomplete data likelihood of standard EM, the main differ(cid:173)\nence being that instead of having an hidden variable for each sample point, we now have \none for each sample block. The likelihood of the complete data is given by \n\nP(X, ZIM,) = II II [lI'~P(Xilzij = 1, M,)f\u00b7i , \n\nC 1+1 c 1 \n\ni = 1 j = 1 \n\nwhere Z is a vector containing all the Zij, and the log-likelihood becomes \nlog P(X, ZIM,) = 2: 2: Zij 10g(1I';P(Xilzij = 1, M,). \n\nC 1+1 c 1 \n\ni = 1 j = 1 \n\n(6) \n\n(7) \n\nRelying on EM to estimate the parameters of M, leads to the the following E-step \nP(Xi I Zij = I, M,)lI'; \n\n_ \n\n_ \n\n_ \n\nhij \n\nE[zijIXi,M,]-P(zij-lIXi , M,)-~ P(X .I. -I lA) I' (8) \n\n, l VI/ lI'k \nThe key quantity to compute is therefore P (Xi I Zij = I, M,). Taking its logarithm \n\n, Zzk -\n\nL..k \n\n10gP(Xilzij = I,M,) \n\n1 M. \n\nMi[M . 2:logP(xrlzij = I,M,)] \n\n, i = 1 \n\n(9) \nwhere we have used the law of large numbers, and EM 1+1 \u2022\u2022 [x] is the expected value of x \naccording the ith mixture component of M'+ 1 (the one from which Xi was drawn). This \nis an easy computation for most densities commonly used in mixture modeling. It can be \nshown [5] that for the Gaussian case it leads to \n\nMiEM 1+ 1., [log P(XIZij = 1, M,)], \n\ng(J1.i+ 1 ,/L~ , E~)e-~trace (~J)~' \n[ \n\nI \n\n.L \n\n{ ~l _1~ l +I}]M. \n\n7r~ \n\n(10) \n\n\fLearning Mixture Hierarchies \n\nwhere 9(x, J1., E) is the expression for a Gaussian with mean J1. and covariance E. \n\nThe M-step consists of maximizing \nC I+1 c l \n\nQ = L L hij 10g(1T~P(Xilzij = 1, Md) \n\n609 \n\n(II) \n\ni = 1 j = 1 \n\nsubject to the constraint Ej 7r~ = I. Once again, this is a relatively simple task for \ncommon mixture models and in [5] we show that for the Gaussian case it leads to the \nfollowing parameter update equations \n\n(12) \n\nEi hij MiJ1.~+ 1 \n\nEi h ij Mi \nE.~ .. M . [LhijMiE~+1 + LhijMi(J1.~+I-J1.;)(J1.!+1 -J1.;f] .(4) \n\n(3) \n\n, \n\n'J \n\ni \n\n, \n\ni \n\nNotice that neither equation (10) nor equations (12) to (14) depend explicitly on the un(cid:173)\nderlying sample Xi and can be computed directly from the parameters of Ml+l. The \nalgorithm is thus very efficient from a computational standpoint as the number of mixture \ncomponents in Ml+ 1 is typically much smaller than the size of the sample at the bottom of \nthe hierarchy. \n\n4 Relationships with standard EM \n\nThere are interesting relationships between the algorithm derived above and the standard \nEM procedure. The first thing to notice is that by making Mi = I and E~+ 1 = 0, the E and \nM-steps become those obtained by applying standard EM to the sample composed of the \npoints J1.~+1 . \nThus, standard EM can be seen as a particular case of the new algorithm, that learns a two \nlevel mixture hierarchy. An initial estimate is first obtained at the bottom of this hierarchy \nby placing a Gaussian with zero covariance on top of each data point, the model at the \nsecond level being then computed from this estimate. The fact that the estimate at the \nbottom level is nothing more than a kernel estimate with zero bandwidth suggests that other \nchoices of the kernel bandwidth may lead to better overall EM estimates. \n\nUnder this interpretation, the E~+I become free parameters that can be used to control the \nsmoothness of the density estimates and the whole procedure is equivalent to the composition \nof three steps: I) find the kernel density estimate that best fits the sample under analysis, 2) \ndraw a larger virtual sample from that density, and 3) compute EM estimates from this larger \nsample. In section 5, we show that this can leave to significant improvements in estimation \naccuracy, particularly when the initial sample is small, the free parameters allowing explicit \ncontrol over the trade-off between the bias and variance of the estimator. \n\nAnother interesting relationship between the hierarchical method and standard EM can \nbe derived by investigating the role of the size of the underlying virtual sample (which \ndetermines Mi) on the estimates. Assuming Mi constant, Mi = M, Vi, it factors out of \nall summations in equations (12) to (14), the contributions of numerator and denominator \ncanceling each other. In this case, the only significance of the choice of M is its impact on \nthe E-step. Assuming, as before, that E~+I = 0 we once again have the EM algorithm, but \nwhere the class-conditional likelihoods of the E-step are now raised to the Mth power. If \n\n\f610 \n\nN Vasconcelos and A. Lippman \n\nM is seen as the inverse of temperature, both the E and M steps become those of standard \nEM under deterministic annealing (DA) I [3] . \n\nThe DA process is therefore naturally derived from our hierarchical formulation, which \ngives it a new interpretation that is significantly simpler and more intuitive than those \nderived from statistical physics. At the start of the process M is set to zero, i.e. no virtual \nsamples are drawn from the Gaussian superimposed on the real dataset, and there is no \nvirtual data. Thus, the assignments hij of the E-step simply become the prior mixing \nproportions 11\"; and the M-step simply sets the parameters of all Gaussians in the model to \nthe sample mean and sample covariance of the real sample. As M increases, the number \nof virtual points drawn from each Gaussian also increases and for M = 1 we have a single \npoint that coincides with the point on the real training sample. We therefore obtain the \nstandard EM equations. Increasing M further will make the E-step assignments harder (in \nthe limit of M = 00 each point is assigned to a single mixture component) because a larger \nvirtual probability mass is attached to each real point leading to much higher certainty with \nregards to the reliability of the assignment. \n\nOverall, while in the beginning of the process the reduced size of the virtual sample allows \nthe points in the real sample to switch from mixture to mixture easily, as M is increased \nthe switching becomes much less likely. The \"exploratory\" nature of the initial iterations \ndrives the process towards solutions that are globally good, therefore allowing it to escape \nlocal minima. \n\n5 Experimental results \n\nIn this section, we present experimental results that illustrate the properties of the hierar(cid:173)\nchical EM algorithm now proposed. We start by a simple example that illustrates how the \nalgorithm can be used to estimate hierarchical mixtures. \n\n, .. : ... ~.,:.:~ .. : \n.. :~~y ... :.~.:: \n~,~~,;,i \n\n1!IO \n\n- 100 \n\n-so \n\n0 \n\n50 \n\n100 \n\n150 \n\n'\u00b7.i~ i ';~ \n'~'Tl;~:. \n\nFigure I: Mixture hierarchy derived from the model shown in the left. The plot relative to each \nlevel of the hierarchy is superimposed on a sample drawn from this model. Only the one-standard \ndeviation contours are shown for each Gaussian. \n\nThe plot on the left of Figure 1 presents a Gaussian mixture with 16 uniformly weighted \ncomponents. A sample with 1000 points was drawn from this model, and the algorithm \nused to find the best descriptions for it at three resolutions (mixtures with 16, 4, and 2 \nGaussian). These descriptions are shown in the figure. Notice how the mixture hierarchy \nnaturally captures the various levels of structure exhibited by the data. \n\nThis example suggests how the algorithm could be useful for applications such as object \nrecognition or image retrieval. Suppose that each of the Gaussians in the leftmost plot of \n\nIDA is a technique drawn from analogies with statistical physics that avoids local maxima of \nthe likelihood function (in which standard EM can get trapped) by perfonning a succession of \noptimizations at various temperatures [31. \n\n\fLearning Mixture Hierarchies \n\n611 \n\n( \nI'\u00b7 \nj \n\n'~~~~--H~~W~~~~ \n\no-.I \u2022 ...,r'Itol \n\nFigure 2: Object recognition task. Left: 8 of the 100 objects in the database. Right: computational \nsavings achieved with hierarchical recognition vs full search. \n\nthe figure describes how a given pose of a given object populates a 2-D feature space on \nwhich object recognition is to be perfonned. In this case, higher levels in the hierarchical \nrepresentation provide a more generic description of the object. E.g. each of the Gaussians \nin the model shown in the middle of the figure might provide a description for all the poses \nin which the camera is on the same quadrant of the viewing sphere, while those in the \nmodel shown in the right might represent views from the same hemisphere. The advantage, \nfor recognition or retrieval, of relying on a hierarchal structure is that the search can be \nperfonned first at the highest resolution, where it is much less expensive, only the best \nmatches being considered at the subsequent levels. \n\nFigure 2 illustrates the application of hierarchical mixture modeling to a real object recog(cid:173)\nnition task. Shown on the left side of the figure are 8 objects from the 100 contained in the \nColumbia object database [2]. The database consists of 72 views (obtained by positioning \nthe camera in 5\u00b0 intervals along a circle on the viewing sphere), which were evenly sepa(cid:173)\nrated into a training and a test set. A set of features was computed for each image, and a \nhierarchical model was then learned for each object in the resulting feature space. While \nthe process could be extended to any number of levels, here we only report on the case of \na two-level hierarchy: at the bottom each image is described by a mixture of 8 Gaussians, \nand at the top each mixture (also with 8 Gaussians) describes 3 consecutive views. Thus, \nthe entire training set is described by 3600 mixtures at the bottom resolution and 1200 at \nthe top. \n\nGiven an image of an object to recognize, recognition takes place by computing its projection \ninto the feature space, measuring the likelihood of the resulting sample according to each \nof the models in the database, and choosing the most likely. The complexity of the process \nis proportional to the database size. The plot on the left of Figure 2 presents the recognition \naccuracy achieved with the hierarchical representation vs the corresponding complexity, \nshown as a percent of the complexity required by full search. The full-search accuracy is \nin this case 90%, and is also shown as a straight line in the graph. As can be seen from the \nfigure, the hierarchical search achieves the full search accuracy with less than 40% of its \ncomplexity. We are now repeating this experiments with deeper trees, where we expect the \ngains to be even more impressive. \n\nWe finalize by reporting on the impact of smoothing on the quality of EM estimates. \nFor this, we conducted the following Monte Carlo experiment: \nI) draw 200 datasets \nSi, i = 1, ... ,200 from the model shown on the left of Figure 1, 2) fit each dataset with \nEM, 3) measure the correlation coefficient Pi, i = 1, ... ,200 between each of the EM fits \nand the original model, and 4) compute the sample mean p and variance a p. The correlation \ncoefficient is defined by Pi = f f(x)h(x)dxIU f(x)dxf ii(X)dx), where f(x) is the \n\n\f612 \n\nOD \n\n08 \n\n, \n\n/ \n\n-so \n-\n- 100 \n\n200 \n\n- - 300 \n\n\u2022 \n\n<00 \n\n-500 \n\n--\n\n- 1000 \n\n-\n\no .o'-----'--~,0--':15:-----:2O':---25~--:'c30--=36:--..... 40::---'45 \n\nN. Vasconcelos and A. Lippman \n\n8lC 10~ \n\n-so \n- -HI) \n\n-\n\n- J(I) \n\n\u2022 \n\n4CXl \n\n-\n\n500 \n\n-\n\n- 1000 \n\n- - - - - - - - -:: .. ~. \" -, -' \n\n-- -----\n\n10 \n\n15 \n\n30 \n\n36 \n\n20 \n\n2S \n\n-\n\nFigure 3: Results of the Monte Carlo experiment described on the text. Left: p as a function 17k. \nRight: Up as a function of 17k. The various curves in each graph correspond to to different sample \nsizes. \n\ntrue model and fi(X) the ith estimate, and can be computed in closed form for Gaussian \nmixtures. The experiment was repeated with various dataset sizes and various degrees of \nsmoothing (by setting the bandwidth of the underlying Gaussian kernel to oil for various \nvalues of O'k). \nFigure 3 presents the results of this experiment. It is clear, from the graph on the left, that \nsmoothing can have a significant impact on the quality of the EM estimates. This impact \nis largest for small samples, where smoothing can provide up to a two fold improvement \nestimation accuracy, but can be found even for large samples. \n\nThe kernel bandwidth allows control over the trade-off between the bias and variance of \nthe estimates. When O'k is zero (standard EM), bias is small but variance can be large, as \nillustrated by the graph on the right of the figure. As O'k is increased, variance decreases at \nthe cost of an increase in bias (the reason why for large O'k aU lines in the graph of the left \nmeet at the same point regardless ofthe sample size). The point where p is the highest is the \npoint at which the bias-variance trade off is optimal. Operating at this point leads to a much \nsmaller dependence of the accuracy of the estimates on the sample size or, conversely, the \nneed for much smaller samples to achieve a given degree of accuracy. \n\nReferences \n\n[I] A. Dempster, N. Laird, and D. Rubin. Maximum-likelihood from Incomplete Data via \n\nthe EM Algorithm. J. of the Royal Statistical Society, B-39, 1977. \n\n[2] H. Murase and S. Nayar. Visual Learning and Recognition of 3-D Objects from \n\nAppearence. International Journal of Computer Vision, 14:5-24, 1995. \n\n[3] K. Rose, E. Gurewitz, and G. Fox. Vector Quantization by Determinisc Annealing. \n\nIEEE Trans. on Information Theory, Vol. 38, July 1992. \n\n[4] J. Simonoff. Smoothing Methods in Statistics. Springer-Verlag, 1996. \n[5] N. Vasconcelos and A. Lippman. Learning Mixture Hierarchies. Technical report, \nfrom \n\nLaboratory, \n\nAvailable \n\nMIT \n1998. \nftp:l/ftp.media.mit.eduJpub/nunolHierMix.ps.gz. \n\nMedia \n\n[6] N. Vasconcelos and A. Lippman. Content-based Pre-Indexed Video. In Proc. Int. Con! \n\nImage Processing, Santa Barbara, California, 1997. \n\n\f", "award": [], "sourceid": 1543, "authors": [{"given_name": "Nuno", "family_name": "Vasconcelos", "institution": null}, {"given_name": "Andrew", "family_name": "Lippman", "institution": null}]}