{"title": "Fast Large-scale Mixture Modeling with Component-specific Data Partitions", "book": "Advances in Neural Information Processing Systems", "page_first": 2289, "page_last": 2297, "abstract": "Remarkably easy implementation and guaranteed convergence has made the EM algorithm one of the most used algorithms for mixture modeling. On the downside, the E-step is linear in both the sample size and the number of mixture components, making it impractical for large-scale data. Based on the variational EM framework, we propose a fast alternative that uses component-specific data partitions to obtain a sub-linear E-step in sample size, while the algorithm still maintains provable convergence. Our approach builds on previous work, but is significantly faster and scales much better in the number of mixture components. We demonstrate this speedup by experiments on large-scale synthetic and real data.", "full_text": "Fast Large-scale Mixture Modeling with\n\nComponent-speci\ufb01c Data Partitions\n\nBo Thiesson\u2217\n\nMicrosoft Research\n\nChong Wang\u2217\u2020\n\nPrinceton University\n\nAbstract\n\nRemarkably easy implementation and guaranteed convergence has made the EM\nalgorithm one of the most used algorithms for mixture modeling. On the downside,\nthe E-step is linear in both the sample size and the number of mixture components,\nmaking it impractical for large-scale data. Based on the variational EM framework,\nwe propose a fast alternative that uses component-speci\ufb01c data partitions to obtain\na sub-linear E-step in sample size, while the algorithm still maintains provable\nconvergence. Our approach builds on previous work, but is signi\ufb01cantly faster and\nscales much better in the number of mixture components. We demonstrate this\nspeedup by experiments on large-scale synthetic and real data.\n\n1\n\nIntroduction\n\nProbabilistic mixture modeling [7] has been widely used for density estimation and clustering\napplications. The Expectation-Maximization (EM) algorithm [4, 11] is one of the most used methods\nfor this task for clear reasons \u2013 elegant formulation of an iterative procedure, ease of implementation,\nand guaranteed monotone convergence for the objective. On the other hand, the EM algorithm also\nhas some acknowledged shortcomings. In particular, the E-step is linear in both the number of\ndata points and the number of mixture components, and therefore computationally impractical for\nlarge-scale applications. Our work was motivated by a large-scale geo-spatial problem, demanding a\nmixture model of a customer base (a huge number of data points) for competing businesses (a large\nnumber mixture components), as the basis for site evaluation (where to locate a new store).\nSeveral approximation schemes for EM have been proposed to address the scalability problem,\ne.g. [2, 12, 14, 10, 17, 16] , to mention a few. Besides [17, 16], none of these variants has both an\nE-step that is truly sub-linear in sample size and also enjoys provable convergence for a well-de\ufb01ned\nobjective function. More details are discussed in Section 5. Our work is inspired by the \u201cchunky\nEM\u201d algorithm in [17, 16], a smart application of the variational EM framework [11], where a lower\nbound on the objective function increases at each iteration and convergence is guaranteed.\nAn E-step in standard EM calculates expected suf\ufb01cient statistics under mixture-component member-\nship probabilities calculated for each individual data point given the most recent model estimate. The\nvariational EM framework alters the E-step to use suf\ufb01cient statistics calculated under a variational\ndistribution instead. In chunky EM, the speedup is obtained by using a variational distribution with\nshared (variational) membership probabilities for blocks of data (in an exhaustive partition for the\nentire data into non-overlapping blocks of data). The chunky EM starts from a coarse partition of the\ndata and gradually re\ufb01nes the partition until convergence.\nHowever, chunky EM does not scale well in the number of components, since all components\nshare the same partition. The individual components are different \u2013 in order to obtain membership\nprobabilities of appropriate quality, one component may need \ufb01ne-grained blocks in one area of\nthe data space, while another component is perfectly \ufb01ne with coarse blocks in that area. Chunky\nEM expands the shared partition to match the needed granularity for the most demanding mixture\ncomponent in any area of the data space, which might unnecessarily increase the computational\n\n*Equal contributors. \u2020Work done during internship at Microsoft Research.\n\n1\n\n\fcost. Here, we derive a principled variation, called component-speci\ufb01c EM (CS-EM) that allows\ncomponent-speci\ufb01c partitions. We demonstrate a signi\ufb01cant performance improvement over standard\nand chunky EM for experiments on synthetic and mentioned customer-business data.\n\n2 Background: Variational and Chunky EM\nVariational EM. Given a set of i.i.d. data x (cid:44) {x1,\u00b7\u00b7\u00b7 , xN}, we are interested in estimating the\nparameters \u03b8 = {\u03b71:K, \u03c01:K} in the K-component mixture model with log-likelihood function\n\nL(\u03b8) =(cid:80)\n\nn log(cid:80)\n\nk p(xn|\u03b7k)\u03c0k.\n\nvariational distribution factorizes in accordance with data points, i.e, q =(cid:81)\n\n(1)\nFor this task, we consider a variational generalization [11] of standard EM [4], which maximizes\na lower bound of L(\u03b8) through the introduction of a variational distribution q. We assume that the\nn qn, where each qn is an\narbitrary discrete distribution over mixture components k = 1, . . . , K. We can lower bound L(\u03b8) by\nmultiplying each p(xn|\u03b7k)\u03c0k in (1) with qn(k)\n\nqn(k) and apply Jensen\u2019s inequality to get\n\nL(\u03b8) \u2265(cid:80)\n\n(cid:80)\n= L(\u03b8) \u2212(cid:80)\nk qn(k)[log p(xn|\u03b7k)\u03c0k \u2212 log qn(k)]\nn KL (qn||p(\u00b7|xn, \u03b8)) (cid:44) F(\u03b8, q),\n\nn\n\n(2)\n(3)\nwhere p(\u00b7|xn, \u03b8) de\ufb01nes the posterior distribution of membership probabilities and KL(q||p) is the\nKullback-Leibler (KL) divergence between q and p. The variational EM algorithm alternates the\nfollowing two steps, i.e. coordinate ascent on F(\u03b8, q), until convergence.\n\nIf q is not restricted in any form, the E-step produces qt+1 = (cid:81)\n\nE-step: qt+1 = arg maxq F(\u03b8t, q), M-step: \u03b8t+1 = arg max\u03b8 F(\u03b8, qt+1).\n\nn p(\u00b7|xn, \u03b8t), because the KL-\ndivergence is the only term in (3) depending on q. The variational EM is in this case equivalent to\nthe standard EM, and hence produces the maximum likelihood (ML) estimate. In the following, we\nconsider certain ways of restricting q to attain speedup over standard EM, implying that the minimum\nKL-divergence between qn and p(\u00b7|xn, \u03b8) is not necessarily zero. Still the variational EM de\ufb01nes a\nconvergent algorithm, which instead optimizes a lower bound of the log-likelihood.\n\nalgorithms. In chunky EM, the variational distribution q = (cid:81)\n\nChunky EM. The chunky EM algorithm [17, 16] falls into the framework of variational EM\nn qn is restricted according to a\npartition into exhaustive and mutually exclusive blocks of the data. For a given partition, if data\npoints xi and xj are in the same block, then qi = qj. The intuition is that data points in the same\nblock are somewhat similar and can be treated in the same way, which leads to computational savings\nin the E-step. If M is the number of blocks in a given partition, the E-step for chunky EM has cost\nO(KM ) whereas in standard EM the cost is O(KN ). The speedup can be tremendous for M (cid:28) N.\nThe speedup is gained by a trade-off between the tightness of the lower bound for the log-likelihood\nand the restrictiveness of constraints. Chunky EM starts from a coarse partition and iteratively re\ufb01nes\nit. This re\ufb01nement process always produces a tighter bound, since restrictions on the variational\ndistribution are gradually relaxed. The chunky EM algorithm stops when re\ufb01ning any block in a\npartition will not signi\ufb01cantly increase the lower bound.\n\n3 Component-speci\ufb01c EM\n\nIn chunky EM, all mixture components share the same data partition. However, for a particular\nblock of data, the variation in membership probabilities differs across components, resulting in\nvarying differences from the equality constrained variational probabilities. Roughly, the variation in\nmembership probabilities is greatest for components closer to a block of data, and, in particular, for\ncomponents far away the membership probabilities are all so small that the variation is insigni\ufb01cant.\nThis intuition suggests that we might gain a computational speedup, if we create component-speci\ufb01c\ndata partitions, where a component pays more attention to nearby data (\ufb01ne-grained blocks) than data\nfar away (coarser blocks). Let Mk be the number of data blocks in the partition for component k. The\nk Mk), compared to O(KM ) in chunky EM. Our conjecture\nk Mk signi\ufb01cantly smaller than\nKM, resulting in a much faster E-step. Since our model maintains different partitions for different\nmixture components, we call it the component-speci\ufb01c EM algorithm (CS-EM).\n\ncomplexity for the E-step is then O((cid:80)\nis that we can lower bound the log-likelihood equally well with(cid:80)\n\n2\n\n\fFigure 1: Trees 1-5 represent 5 mixture compo-\nnents with individual tree-consistent partitions\n(B1-B5) indicated by the black nodes. The\nbottom-right \ufb01gure is the corresponding MPT,\nwhere {\u00b7} indicates the component marks and\na, b, c, d, e, f, g enumerate all the marked nodes.\nThis MPT encodes all the component-speci\ufb01c\ninformation for the 5 mixtures.\n\nMain Algorithm. Figure 2 (on p. 6) shows the main \ufb02ow of CS-EM. Starting from a coarse partition\nfor each component (see Section 4.1 for examples), CS-EM runs variational EM to convergence\nand then selectively re\ufb01ne the component-speci\ufb01c partitions. This process continues until further\nre\ufb01nements will not signi\ufb01cantly improve the lower bound. Sections 3.1-3.5 provide a detailed\ndescription of basic concepts in support of this brief outline for the main structure of the algorithm.\n\n3.1 Marked Partition Trees\n\nIt is convenient to organize the data into a pre-computed partition tree, where a node in the tree\nrepresents the union of the data represented by its children. Individual data points are not actually\nstored in each node, but rather, the suf\ufb01cient statistics necessary for our estimation operations\nare pre-computed and stored here. (We discuss these statistics in Section 3.3.) Any hierarchical\ndecomposition of data that ensures some degree of similarity between data in a block is suitable for\nconstructing a partition tree. We exemplify our work by using KD-trees [9]. Creating a KD-tree and\nstoring the suf\ufb01cient statistics in its nodes has cost O(N log N ), where N is the number of data point.\nWe will in the following consider tree-consistent partitions, where each data block in a partition\ncorresponds to exactly one node for a cut (possibly across different levels) in the tree\u2013see Figure 1.\nLet us now de\ufb01ne a marked partition tree (MPT), a simple encoding of all component-speci\ufb01c\npartitions, as follows. Let Bk be the data partition (a set of blocks) in the tree-consistent partition for\nmixture component k. In Figure 1, for example, B1 is the partition into data blocks associated with\nnodes {e, c, d}. In the shared data partition tree used to generate the component-speci\ufb01c partitions,\nwe mark the corresponding nodes for the data blocks in each Bk by the component identi\ufb01er k. Each\nnode v in the tree will in this way contain a (possibly empty) set of component marks, denoted by Kv.\nThe MPT is now the subtree obtained by pruning all unmarked nodes without marked descendants\nfrom the tree. Figure 1 shows an example of a MPT. This example is special in the sense that all\nnodes in the MPT are marked. In general, a MPT may have unmarked nodes at any location above\nthe leaves. For example, in chunky EM, the component-speci\ufb01c partitions are the same for each\nmixture component. In this case, only the leaves in the MPT are marked, with each leaf marked by all\nmixture components. The following important property for a MPT holds since all component-speci\ufb01c\npartitions are constructed with respect to the same data partition tree.\nProperty 1. Let T denote a MPT. The marked nodes on a path from leaf to root in T mark exactly\none data block from each of the K component-speci\ufb01c data partitions.\nIn the following, it becomes important to identify the data block in a component-speci\ufb01c partition,\nwhich embeds the block de\ufb01ned by a leaf. Let L denote the set of leaves in T , and let BL denote\na partition with data blocks Bl \u2208 BL according to these leaves. We let Bk(l) denote the speci\ufb01c\nBk \u2208 Bk with the property that Bl \u2286 Bk. Property 1 ensures that Bk(l) exists for all l, k.\nExample: In Figure 1, the path a \u2192 e \u2192 g in turn marks the components Ka = {3, 4}, Ke = {1, 2},\nand Kg = {5} and we see that each component is marked exactly once on this path, as stated in\nProperty 1. Accordingly, for the leaf a, (B3(a) = B4(a)) \u2286 (B1(a) = B2(a)) \u2286 B5(a).\n3.2 The Variational Distribution\n\n2\n\nOur variational distribution q assigns the same variational membership probability to mixture compo-\nnent k for all data points in a component-speci\ufb01c block Bk \u2208 Bk. That is,\n\nqn(k) = qBk for all xn \u2208 Bk,\n\n(4)\n\nwhich we denote as the component-speci\ufb01c block constraint. Unlike chunky EM, we do not assume\nthat the data partition Bk is the same across different mixture components. The extra \ufb02exibility\ncomplicates the estimation of q in the E-step. This is the central challenge of our algorithm.\n\n3\n\n\u001f !\"#\u0002\u0002\u0002\u0002\u0003\u0007#\u0007\u0007\u001f\u0002 \u0007\u0007!\u0002\"\u0007\u0007!\u0002\"\u0007\u0007\u001f\u0002 \u0007\u0007\u001f\u0002 \u0007\u0007!\u0002\"\u0007=>?@ABCAABB==>>??@@C\f(cid:80)\nk qBk(l) = 1 for all l \u2208 L.\n\nvariational distributions qn(\u00b7) explicit. That is,(cid:80)\n\nTo further drive intuition behind the E-step complication, let us make the sum-to-one constraint for the\nk qn(k) = 1 for all data points n, which according\nto the above block constraint and using Property 1 can be reformulated as the |L| constraints\n\n(5)\nNotice that since qBk can be associated with an internal node in T it may be the case that qBk(l)\nrepresent the same qBk across different constraints in (5). In fact,\n\nqBk(l) = qBk for all l \u2208 {l \u2208 L|Bl \u2286 Bk},\n\n(6)\nimplying that the constraints in (5) are intertwined according to the nested structure given by T . The\ncloser a data block Bk is to the root of T the more constraints simultaneously involve the same qBk.\nExample: Consider the MPT in Figure 1. Here, qB5(a) = qB5(b) = qB5(c) = qB5(d), and hence the\ndensity for component 5 is the same across all four sum-to-one constraints. Similarly, qB1(a) = qB1(b),\nso the density is the same for component 1 in the two constraints associated with leaves a and b. 2\n\n3.3 Ef\ufb01cient Variational E-step\nAccounting for the component-speci\ufb01c block constraint in (4), the lower bound, F(\u03b8, q), in Eq. (2)\ncan be expressed as a sum of local parts, F(\u03b8, qBk ), as follows\n\n|Bk| qBk (gBk + log \u03c0k \u2212 log qBk ) =(cid:80)\n\nF(\u03b8, q) =(cid:80)\n\nF(\u03b8, qBk ),\n\n(cid:80)\n\n(cid:80)\n\n(7)\n\nk\n\nBk\u2208Bk\n\nk\n\nwhere we have de\ufb01ned the block-speci\ufb01c geometric mean\n\n(8)\nWe integrate the sum-to-one constraints in (5) into the lower bound in (7) by using the standard\nprinciple of Lagrange duality (see, e.g., [1]). Accordingly, we construct the Lagrangian\n\nx\u2208Bk\n\nlog p(x|\u03b7k)/|Bk|.\n\nBk\u2208Bk\n\ngBk = (cid:104)log p(x|\u03b7k)(cid:105)Bk =(cid:80)\nF(\u03b8, q, \u03bb) =(cid:80)\n(cid:16)\n\n(cid:80)\n(1/|Bk|)(cid:80)\n\nBk\n\nk\n\nF(\u03b8, qBk ) +(cid:80)\n\nl \u03bbl((cid:80)\n(cid:17)\n\n\u03bbl \u2212 1\n\nwhere \u03bb (cid:44) {\u03bb1, . . . , \u03bbL} are the Lagrange multipliers for the constraints in Eq. (5). Recall the\nrelationship between qBk and qBk(l) in (6). By setting \u2202F(\u03b8, q, \u03bb)/\u2202qBk = 0, we obtain\n\nk qBk(l) \u2212 1),\n\nl:Bl\u2286Bk\n\nqBk (\u03bb) = exp\n\n(9)\nSolving the dual optimization problem \u03bb\u2217 = arg min\u03bb F(\u03b8, q(\u03bb), \u03bb) now leads to the primal solution\ngiven by q\u2217\n\nFor chunky EM, the E-step is straightforward, because Bk(l) = Bl and therefore(cid:80)\n\n\u03bbl = \u03bbl\nfor all k = 1, . . . , K. Substituting (9) into the sum-to-one constraints in (5) reveals that each \u03bbl can\nbe solved independently, leading to the following closed-form solution for qBk(l)\n\n= qBk (\u03bb\u2217).1\n\n\u03c0k exp (gBk ) .\n\nl:Bl\u2286Bk(l)\n\nBk\n\nwhere Z =(cid:80)\n\nl = |Bl|(cid:0)1 + log(cid:80)\n\nk \u03c0k exp(gBk(l) )(cid:1) , q\u2217\n\n\u03bb\u2217\nk \u03c0k exp(gBk(l)) is a normalizing constant.\n\n= \u03c0k exp(gBk(l))/Z,\n\n(10)\n\nBk(l)\n\nCS-EM does not enjoy a similar simple optimization, because of the intertwined constraints, as\ndescribed in Section 3.2. Fortunately, we can still obtain a closed-form solution. Essentially, we use\nthe nesting structure of the constraints to reduce Lagrange multipliers from the solution one at a time\nuntil only one is left, in which case the optimization is easily solved. We describe the basic approach\nhere and defer the technical details (and pseudo-code) to the supplement.\nConsider a leaf node l \u2208 L and recall that Kl denotes the components with Bk(l) = Bl in their\npartitions. The sum-to-one constraint in (5) that is associated with leaf l can therefore be written as\n\nk\u2208Kl\n\nqBk(l) +(cid:80)\n\n(cid:80)\nqBk(l) = exp (\u03bbl/|Bl| \u2212 1)(cid:80)\n\nk(cid:54)\u2208Kl\n\nqBk(l) = 1.\n\nFurthermore, for all k \u2208 Kl the qBk(l), as de\ufb01ned in (9), is a function of the same \u03bbl. Accordingly,\n(11)\n1Notice that Eq. (9) implies that positivity constraints qn(k) \u2265 0 are automatically satis\ufb01ed during estimation.\n\n\u03c0k exp(gBk(l)).\n\nk\u2208Kl\n\nk\u2208Kl\n\nql (cid:44)(cid:80)\n\n4\n\n\fNow, consider l\u2019s leaf-node sibling, l(cid:48). For example, in Figure 1, node l = a and l(cid:48) = b. The two\nleaves share the same path from their parent to the root in T . Hence, using Property 1, it must be\nthe case that Bk(l) = Bk(l(cid:48)) for k (cid:54)\u2208 Kl. The two sum-to-one constraints\u2013one for each leaf\u2013therefore\nimply that ql = ql(cid:48). Using (11), it now follows that\n\n\u03bbl(cid:48) = |Bl(cid:48)|(\u03bbl/|Bl| + log(cid:80)\n\n\u03c0k exp(gBk(l)) \u2212 log(cid:80)\n\nk\u2208Kl\n\nk\u2208Kl(cid:48) \u03c0k(cid:48) exp(gBk(l(cid:48) ))) (cid:44) f (\u03bbl).\n\nThus, we can replace \u03bbl(cid:48) with f (\u03bbl) in all qBk expressions. Further analysis (detailed in the supple-\nment) shows how we more ef\ufb01ciently account for this parameter reduction and continue the process,\nnow considering the parent node a new \u201cleaf\u201d node once all children have been processed. When\nreaching the root, every qBk expression on the path from l only involves the single \u03bbl, and the optimal\n\u03bb\u2217\nl can therefore be found analytically by solving the corresponding sum-to-one constraint in (5).\nFollowing, all optimal q\u2217\nFinally, it is important to notice that gBk is the only data-dependent part in the above E-step solution.\nIt is therefore key to the computational ef\ufb01ciency of the CS-EM algorithm that gBk can be calculated\nfrom pre-computed statistics, which is in fact the case for the large class of exponential family\ndistributions. These are the statistics that are stored in the nodes of the MPT.\nExample: Let p(x|\u03b7k) be an exponential family distribution\n\nl into the reduced qBk expressions.\n\nare found by inserting \u03bb\u2217\n\nBk\n\n(12)\nwhere \u03b7k is the natural parameter, h(x) is the reference function, T (x) is the suf\ufb01cient statistic, and\nA(\u03b7k) is the normalizing constant. Then\n\np(x|\u03b7k) = h(x) exp(\u03b7T\n\nk T (x) \u2212 A(\u03b7k)),\n\ngBk = (cid:104)log h(x)(cid:105)Bk + \u03b7T\n\nk (cid:104)T (x)(cid:105)Bk \u2212 A(\u03b7k),\n\nwhere (cid:104)log h(x)(cid:105)Bk and (cid:104)T (x)(cid:105)Bk are the statistics that we pre-compute for (8). In particular, if\n(cid:1) ,\np(x|\u03b7k) = Nd (\u00b5k, \u03a3k), a Gaussian distribution, then\nh(x) = 1, T (x) = (x, xxT ), \u03b7k = (\u00b5k\u03a3\nand the statistics (cid:104)log h(x)(cid:105)Bk = 0 and (cid:104)T (x)(cid:105)Bk = ((cid:104)x(cid:105)Bk ,(cid:104)xxT(cid:105)Bk ) can be pre-computed. 2\n3.4 Ef\ufb01cient Variational M-step\nIn the variational M-step the model parameters \u03b8 = {\u03b71:K, \u03c01:K} are updated by maximizing Eq. (7)\n\n(cid:0)d log(2\u03c0)+log |\u03a3k|+\u00b5T\n\nk /2), A(\u03b7k) = \u2212 1\n\u22121\n\nw.r.t. \u03b8 under the constraint(cid:80)\n\nk \u03a3\u22121\u00b5k\n\nk ,\u2212\u03a3\n\u22121\n\n2\n\nk \u03c0k = 1. Hereby, the update is\n|Bk|qBk , \u03b7k = arg max\u03b7k\n\n(cid:80)\n\nBk\u2208Bk\n\n|Bk|qBk gBk .\n\n(13)\n\nThus, the M-step can be ef\ufb01ciently computed using the pre-computed suf\ufb01cient statistics as well.\nExample: If p(x|\u03b7k) has the exponential family form in Eq. (12), \u03b7k is obtained by solving\n|Bk|qBk )A(\u03b7k).\n\nT (x))\u03b7k \u2212 ((cid:80)\n\n(cid:80)\n\nqBk\n\nBk\u2208Bk\n\nBk\u2208Bk\n\nx\u2208Bk\n\nBk\u2208Bk\n\n\u03c0k \u221d(cid:80)\n\u03b7k = arg max\u03b7k ((cid:80)\n\n\u00b5k = ((cid:80)\n\nIn particular, if p(x|\u03b7k) = Nd (\u00b5k, \u03a3k), then\n\n|Bk|qBk(cid:104)x(cid:105)Bk )/ (N \u03c0k) , \u03a3k = ((cid:80)\n\nBk\u2208Bk\n\nBk\u2208Bk\n\n|Bk|qBk(cid:104)xxT(cid:105)Bk \u2212 \u00b5k\u00b5T\n\nk )/ (N \u03c0k) . 2\n\n3.5 Ef\ufb01cient Variational R-step\nGiven the current component-speci\ufb01c data partitions, as marked in the MPT T , a re\ufb01ning step (R-step)\nselectively re\ufb01nes these partitions. Any re\ufb01nement enlarges the family of variational distributions,\nand therefore always tightens the optimal lower bound for the log-likelihood. We de\ufb01ne a re\ufb01nement\nunit as the re\ufb01nement of one data block in the current partition for one component in the model. The\nef\ufb01ciency of CS-EM is affected by the number of re\ufb01nement units performed at each R-step. With\ntoo few units we spend too much time on re\ufb01ning, and with too many units some of the re\ufb01nements\nmay be far from optimal and therefore unnecessarily slow down the algorithm. We have empirically\nfound K re\ufb01nement units at each R-step to be a good choice. This introduces K new free variational\nparameters, which is similar to a re\ufb01nement step in chunky EM. However, chunky EM re\ufb01nes the\nsame data block across all components, which is not the case in CS-EM.\n\n5\n\n\fFigure 2: The CS-EM algorithm.\n\nFigure 3: Variational R-step algorithm.\n\n1: Initialization: build KD-tree, set initial MPT, set\ninitial \u03b8, run E-step to set q, set t, s = 0, compute\nFt,Fs using (7).\n\nvalues.\n\n1: Initialize priority queue Q favoring high \u2206Fv,k\n2: for each marked node v in T do\n3:\n\n4:\n5:\n\nend for\n\nrepeat\n\nRun variational E-step and M-step.\nSet t \u2190 t + 1 and compute Ft using (7).\n\nInsert candidate (v, k) into Q according to\n\u2206Fv,k.\n\nCompute q via E-step with constraints as in\n(14).\nfor all k \u2208 Kv do\n\nuntil (Ft \u2212 Ft\u22121)/(Ft \u2212 F0) < 10\u22124.\nRun variational R-step.\nSet s \u2190 s + 1 and Fs = Ft.\n\nthat improve F the most. This demands the evaluation of an E-step for each of the(cid:80)\n\n2: repeat\n3:\n4:\n5:\n6:\n6:\n7:\n7: end for\n8:\n9: until (Fs \u2212 Fs\u22121)/(Fs \u2212 F0) < 10\u22124.\n8: Select K top-ranked (v, k) in Q for re\ufb01nement.\nIdeally, an R-step should select the re\ufb01nement units leading to optimal improvement for F. Good\ncandidates can be found by performing a single E-step for each candidate and then select the units\nk Mk possible\nre\ufb01nement units. Exact evaluation for this many full E-steps is prohibitively expensive, and we\ntherefore instead approximate these re\ufb01nement-guiding E-steps by a local computation scheme based\non the intuition that re\ufb01ning a block for a speci\ufb01c component mostly affects components with similar\nlocal partition structures. The algorithm is described in Figure 3 with details as follows.\nConsider moving all component-marks for v \u2208 T to its children ch(v), where each child u \u2208 ch(v)\nreceives a copy. Let \u00afT denote the altered MPT, and \u00afKv, \u00afKu denote the set of marks at v, u \u2208 \u00afT .\nHence, \u00afKv = \u2205 and \u00afKu = Ku \u222a Kv. To approximate the new variational distribution \u00afq, we \ufb01x the\nvalue for each \u00afqBk(l), with k (cid:54)\u2208 \u00afKu and l \u2208 L, to the value obtained for the distribution q before the\nre\ufb01nement. In this case, the sum-to-one constraints for \u00afq simpli\ufb01es as\n\u00afqBk(l) + Rl = 1 for all l \u2208 L,\n\n(14)\nqBk(l) = 0 for any leaf l\nnot under u, and that qBk(l) = qBk(u) and \u00afqBk(l) = \u00afqBk(u) for k \u2208 \u00afKu and any leaf l under u. The\nconstraints in (14) therefore reduces to the following |ch(v)| independent constraints\n\nwith Rl = 1 \u2212(cid:80)\nEach \u00afqBk(u), k\u2208 \u00afKu now has a local closed form solution similar to (10)\u2013with Z =(cid:80)\n\nqBk(l) being the \ufb01xed values. Notice that(cid:80)\n(cid:80)\n\n\u00afqBk(u)+Ru.\nThe improvement to F that is achieved by the re\ufb01nement-guiding E-step for the re\ufb01nement unit\nre\ufb01ning data block v for component k is denoted \u2206Fv,k, and can be computed as\n\n\u00afqBk(u) + Ru = 1 for all u \u2208 ch(v).\n\n(cid:80)\n\nk\u2208 \u00afKu\n\nk\u2208 \u00afKu\n\nk\u2208 \u00afKu\n\nk\u2208 \u00afKu\n\nk\u2208 \u00afKu\n\n\u2206Fv,k =(cid:80)\n\nu\u2208ch(v) F(\u03b8, \u00afqBk(u) ) \u2212 F(\u03b8, qBk(v) ).\n\nThis improvement is computed for all possible re\ufb01nement units and the K highest scoring units are\nthen selected in the R-step. Notice that this selective re\ufb01nement step will most likely not re\ufb01ne the\nsame data block for all components and therefore creates component-speci\ufb01c partitions.\nExample: In Figure 1, node e and its children {a, b} are marked Ke = {1, 2} and Ka = Kb = {3, 4}.\nFor the two candidate re\ufb01nement units associated with e, we have \u00afKe = \u2205 and \u00afKa = \u00afKb = {1, 2, 3, 4}.\nWith q5(u) held \ufb01xed, we will for each child u \u2208 {a, b} optimize \u00afqBk(u), k = 1, 2, 3, 4, and following\n(e, 1) and (e, 2) are inserted into the priority queue of candidates according to their \u2206Fv,k values. 2\n4 Experiments\n\nIn this section we provide a systematic evaluation of CS-EM, chunky EM, and standard EM on\nsynthetic data, as well as a comparison between CS-EM and chunky EM on the business-customer\ndata, mentioned in Section 1. (Standard EM is too slow to be included in the latter experiment.)\n4.1 Experimental setup\nFor the synthetic experiments, we generated random training and test data sets from Gaussian mixture\nmodels (GMMs) by varying one (in a single case two) of the following default settings: #data points\nN = 100, 000, #mixture components K = 40, #dimensions d = 2, and c-separation2 c = 2.\n\n2A GMM is c-separated [3], if for any i (cid:54)= j, f (i, j) (cid:44) ||\u00b5i \u2212 \u00b5j||2/ max(\u03bbmax(\u03a3i), \u03bbmax(\u03a3j)) \u2265 dc2,\n\nwhere \u03bbmax(\u03a3) denotes the maximum eigenvalue of \u03a3. We only require that Median [f (i, j)] \u2265 dc2.\n\n6\n\n\fThe (proprietary) business-customer data was obtained through collaboration with PitneyBowes\nInc. and Yellowpages.com LLC. For the experiments on this data, N = 6.5 million and d = 2,\ncorresponding to the latitude and longitude for potential customers in Washington state. The basic\nassumption is that potential customers act as rational consumers and frequent the somewhat closest\nbusiness locations to purchase a good or service. The locations for competing stores of a particular\ntype, in this way, correspond to \ufb01xed centers for components in a mixture model. (A less naive model\nwith the penetration level for a good or service and the relative attractiveness for stores, is the object\nof related research, but is not important for the computational feasibility studied here.)\nThe synthetic experiments are initialized as follows. After constructing KD-tree, the \ufb01rst tree-level\ncontaining at least K nodes ((cid:100)log2 K(cid:101)) is used as the initial data partition for both chunky EM and\nall components in CS-EM. For all algorithms (including standard EM), we randomly chose K data\nblocks from the initial partition and initialized parameters for the individual mixture components\naccordingly. Mixture weights are initialized with a uniform distribution. The experiments on the\nbusiness-customer data are initialized in the same way, except that the component centers are \ufb01xed\nand the initial data blocks that cover these centers are used for initializing the remaining parameters.\nFor CS-EM we also considered an alternative initialization of data partitions, which better matches the\nrationale behind component-speci\ufb01c partitions. It starts from the CS-EM initialization and recursively,\naccording to the KD-tree structure, merges two data blocks in a component-speci\ufb01c partition, if the\nmerge has little effect on that component.3 We name this variant as CS-EM\u2217.\n\n4.2 Results\n\nEM\u2217 are signi\ufb01cantly faster than chunky EM in all experiments. In general, the(cid:80)\nsetting, the ratio KM/(cid:80)\n\nFor the synthetic experiments, we compared the run-times for the competing algorithms to reach\na parameter estimate of same quality (and therefore similar clustering performance not counting\ndifferent local maxima), de\ufb01ned as follows. We recorded the log-likelihood for the test data at each\niteration of the EM algorithm, and before each S-step in chunky EM and the CS-EM. We ran all\nalgorithms to convergence at level 10\u22124, and the test log-likelihood for the algorithm with lowest\nvalue was chosen as baseline.4 We now recorded the run-time for each algorithm to reach this\nbaseline, and computed the EM-speedup factors for chunky EM, CS-EM, and CS-EM\u2217, each de\ufb01ned\nas the standard EM run-time divided by the run-time for the alternative algorithm. We repeated all\nexperiments with \ufb01ve different parameter initializations and report the averaged results.\nFigure 4 shows the EM-speedups for the synthetic data. First of all, we see that both CS-EM and CS-\nk Mk variational\nparameters needed for the CS-EM algorithms is far fewer than the KM parameters needed for\nchunky EM in order to reach an estimate of same quality. For example, for the default experimental\nk Mk is 2.0 and 2.1 for, respectively, CS-EM and CS-EM\u2217. We also see\nthat there is no signi\ufb01cant difference in speedup between CS-EM and CS-EM\u2217. This observation can\nbe explained by the fact that the resulting component-speci\ufb01c data partitions greatly re\ufb01ne the initial\npartitions, and any computational speedup due to the smarter initial partition in CS-EM\u2217 is therefore\noverwhelmed. Hence, a simple initial partition, as in CS-EM, is suf\ufb01cient.\nFinally, similar to results already reported for chunky EM in [17, 16], we see for all of chunky\nEM, CS-EM, and CS-EM\u2217 that the number of data points and the amount of c-separation have a\npositive effect on EM-speedup, while the number of dimensions and the number of components\nhave a negative effect. However, the last plot in Figure 4 reveals an important difference between\nchunky EM and CS-EM: with a \ufb01xed ratio between number of data points and number of clusters, the\nEM-speedup declines a lot for chunky EM, as the number of clusters and data points increases. This\nobservation is important for the business-customer data, where increasing the area of investigation\n(from city to county to state to country) has this characteristic for the data.\nIn the second experiment on the business-customer data, standard EM is computationally too de-\nmanding. For example, for the \u201cNail salon\u201d example in Figure 5, a single EM iteration takes about\n5 hours. In contrast, CS-EM runs to convergence in 20 minutes. To compare run-times for chunky\n\n3Let \u00b5 and \u03a3 be the mean and variance parameter for an initial component, and \u00b5p, \u00b5l, and \u00b5r denote the\nsample mean for data in the considered parent, left and right child. We merge if |M D(\u00b5l, \u00b5|\u03a3)/M D(\u00b5p, \u00b5|\u03a3)\u2212\n1| < 0.05 and |M D(\u00b5r, \u00b5|\u03a3)/M D(\u00b5p, \u00b5|\u03a3) \u2212 1| < 0.05, where M D(\u00b7,\u00b7|\u03a3) is the Mahalanobis distance.\n4For the default experimental setting, for example, the baseline is reached at 96% of the log-likelihood\n\nimprovement from initialization to standard EM convergence.\n\n7\n\n\fFigure 4: EM-speedup factors on synthetic data.\n\nFigure 5: A comparison of run-time and \ufb01nal number\nof variational parameters for Chunky EM vs. CS-EM\nfor exemplary business types with different number of\nstores (mixture components).\n#stores\n\nBusiness type\n\nBowling\nDry cleaning\nNail salon\nPizza\nTax \ufb01ling\nConv. store\n\n129\n815\n1290\n1327\n1459\n1739\n\ntime\nratio\n5.0\n21.2\n35.8\n33.0\n34.8\n29.4\n\nparameter\n\nratio\n2.41\n2.81\n3.51\n3.18\n3.41\n3.42\n\nEM and CS-EM, we therefore slightly modi\ufb01ed the way we ensure that the two algorithm reach a\nparameter estimate of same quality. We use the lowest of the F values (on training data) obtained for\nthe two algorithms at convergence as the baseline, and record the time for each algorithm to reach\nthis baseline. Figure 5 shows the speedup (time ratio) and the reduction in number of variational\nparameters (parameter ratio) for CS-EM compared to chunky EM, as evaluated on exemplary types\nof businesses. Again, CS-EM is signi\ufb01cantly faster than chunky EM and the speedup is achieved by a\nbetter targeting of variational distribution through the component-speci\ufb01c partitions.\n5 Related and Future Work\nRelated work. CS-EM combines the best from two major directions in the literature regarding\nspeedup of EM for mixture modeling. The \ufb01rst direction is based on powerful heuristic ideas, but\nwithout provable convergence due to the lack of a well-de\ufb01ned objective function. The work in\n[10] is a prominent example, where KD-tree partitions were \ufb01rst used for speeding up EM. As also\npointed out in [17, 16], the method will likely\u2013but not provably\u2013converge for \ufb01ne-grained partitions.\nIn contrast, CS-EM is provable convergent\u2013even for arbitrary rough partitions, if extreme speedup\nis needed. The granularity of partitions in [10] is controlled by a user-speci\ufb01ed threshold on the\nminimum and maximum membership probabilities that are reachable within the boundaries of a node\nin the KD-tree. In contrast, we have almost no tuning parameters. We instead let the data speak by\nitself by having the \ufb01nal convergence determine the granularity of partitions. Finally, [10] \u201cprunes\u201d a\ncomponent (sets the membership probability to zero) for data far away from the component. It relates\nto our component-speci\ufb01c partitions, but ours is more principled with convergence guarantees.\nThe second direction of speedup approaches are based on the variational EM framework [11]. In [11],\na \u201csparse\u201d EM was presented, which at some iterations, only updates part of the parameters and\nhence relates it to the pruning idea in [10]. [14] presents an \u201cincremental\u201d and a \u201clazy\u201d EM, which\ngain speedup by performing E-steps on varying subsets of the data rather than the entire data. All\nthree methods guarantee convergence. However, they need to periodically perform an E-step over the\nentire data, and, in contrast to CS-EM, their E-step is therefore not truly sub-linear in sample size,\nmaking them potentially unsuitable for large-scale applications. The chunky EM in [17, 16] is the\napproach most similar to our CS-EM. Both are based on the variational EM framework and therefore\nguarantees convergence, but CS-EM is faster and scales better in the number of clusters.\nIn addition, heuristic sub-sampling is common practice when faced with a large amount of data. One\ncould argue that chunky EM is an intelligent sub-sampling method, where 1) instead of sampled data\npoints it uses geometric averages for blocks of data in a given data partition, and 2) it automatically\nchooses the \u201csampling size\u201d by a learning curve method, where F is used to measure the utility\nof increasing the granularity for the partition. Sub-sampling therefore has same computational\ncomplexity as chunky EM, and our results therefore suggest that we should expect CS-EM to be\nmuch faster than sub-sampling and scale better in the number of mixture components.\nFinally, we exempli\ufb01ed our work by using KD-trees as the tree-consistent partition structure for\ngenerating the component-speci\ufb01c partitions in CS-EM, which limited its effectiveness in high\ndimensions. However, any hierarchical partition structure can be used, and the work in [8] therefore\nsuggest that changing to an anchor tree (a special kind of metric tree [15]) will also render CS-EM\neffective in high dimensions, under the assumption of lower intrinsic dimensionality for the data.\nFuture Work. Future work will include parallelization of the algorithm and extensions to 1) non-\nprobabilistic clustering methods, e.g., k-means clustering [6, 13, 5] and 2) general EM applications\nbeyond mixture modeling.\n\n8\n\n\fReferences\n\n[1] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.\n[2] P. S. Bradley, U. M. Fayyad, and C. A. Reina. Scaling EM (expectation maximization) clustering\n\nto large databases. Technical Report MSR-TR-98-3, Microsoft Research, 1998.\n\n[3] S. Dasgupta. Learning mixtures of Gaussians. In Proceedings of the 40th Annual Symposium\n\non Foundations of Computer Science, pages 634\u2013644, 1999.\n\n[4] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via\n\nthe EM algorithm. Journal of the Royal Statistical Society, Series B, 39(1):1\u201338, 1977.\n\n[5] G. Hamerly. Making k-means even faster. In SIAM International Conference on Data Mining\n\n(SDM), 2010.\n\n[6] T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman, and A. Y. Wu. An\nef\ufb01cient k-means clustering algorithm: Analysis and implementation. IEEE Transactions on\nPattern Analysis and Machine Intelligence, 24(7):881\u2013892, 2002.\n\n[7] G. J. McLachlan and D. Peel. Finite Mixture Models. Wiley Interscience, New York, USA,\n\n2000.\n\n[8] A. Moore. The anchors hierarchy: Using the triangle inequality to survive high-dimensional\ndata. In Proceedings of the Fourteenth Conference on Uncertainty in Arti\ufb01cial Intelligence,\npages 397\u2013405. AAAI Press, 2000.\n\n[9] A. W. Moore. A tutorial on kd-trees. Technical Report 209, University of Cambridge, 1991.\n[10] A. W. Moore. Very fast EM-based mixture model clustering using multiresolution kd-trees. In\nAdvances in Neural Information Processing Systems, pages 543\u2013549. Morgan Kaufman, 1999.\n[11] R. Neal and G. E. Hinton. A view of the EM algorithm that justi\ufb01es incremental, sparse, and\n\nother variants. In Learning in Graphical Models, pages 355\u2013368, 1998.\n\n[12] L. E. Ortiz and L. P. Kaelbling. Accelerating EM: An empirical study. In Proceedings of the\n\nFifteenth Conference on Uncertainty in Arti\ufb01cial Intelligence, pages 512\u2013521, 1999.\n\n[13] D. Pelleg and A. Moore. Accelerating exact k-means algorithms with geometric reasoning. In\nS. Chaudhuri and D. Madigan, editors, Proceedings of the Fifth International Conference on\nKnowledge Discovery in Databases, pages 277\u2013281. AAAI Press, 1999.\n\n[14] B. Thiesson, C. Meek, and D. Heckerman. Accelerating EM for large databases. Machine\n\nLearning, 45(3):279\u2013299, 2001.\n\n[15] J. K. Uhlmann. Satisfying general proximity/similarity queries with metric trees. Information\n\nProcessing Letters, 40(4):175\u2013179, 1991.\n\n[16] J. J. Verbeek, J. R. Nunnink, and N. Vlassis. Accelerated EM-based clustering of large data sets.\n\nData Mining and Knowledge Discovery, 13(3):291\u2013307, 2006.\n\n[17] J. J. Verbeek, N. Vlassis, and J. R. J. Nunnink. A variational EM algorithm for large-scale\nmixture modeling. In In Proceedings of the 8th Annual Conference of the Advanced School for\nComputing and Imaging (ASCI), 2003.\n\n9\n\n\f", "award": [], "sourceid": 184, "authors": [{"given_name": "Bo", "family_name": "Thiesson", "institution": null}, {"given_name": "Chong", "family_name": "Wang", "institution": null}]}