{"title": "Scalable Training of Mixture Models via Coresets", "book": "Advances in Neural Information Processing Systems", "page_first": 2142, "page_last": 2150, "abstract": "How can we train a statistical mixture model on a massive data set? In this paper, we show how to construct coresets for mixtures of Gaussians and natural generalizations. A coreset is a weighted subset of the data, which guarantees that models fitting the coreset will also provide a good fit for the original data set. We show that, perhaps surprisingly, Gaussian mixtures admit coresets of size independent of the size of the data set. More precisely, we prove that a weighted set of $O(dk^3/\\eps^2)$ data points suffices for computing a $(1+\\eps)$-approximation for the optimal model on the original $n$ data points. Moreover, such coresets can be efficiently constructed in a map-reduce style computation, as well as in a streaming setting. Our results rely on a novel reduction of statistical estimation to problems in computational geometry, as well as new complexity results about mixtures of Gaussians. We empirically evaluate our algorithms on several real data sets, including a density estimation problem in the context of earthquake detection using accelerometers in mobile phones.", "full_text": "Scalable Training of Mixture Models via Coresets\n\nDan Feldman\n\nMIT\n\nMatthew Faulkner\n\nCaltech\n\nAndreas Krause\n\nETH Zurich\n\nAbstract\n\nHow can we train a statistical mixture model on a massive data set? In this paper, we\nshow how to construct coresets for mixtures of Gaussians and natural generalizations. A\ncoreset is a weighted subset of the data, which guarantees that models \ufb01tting the coreset\nwill also provide a good \ufb01t for the original data set. We show that, perhaps surprisingly,\nGaussian mixtures admit coresets of size independent of the size of the data set. More\nprecisely, we prove that a weighted set of O(dk3/\u03b52) data points suf\ufb01ces for computing\na (1 + \u03b5)-approximation for the optimal model on the original n data points. Moreover,\nsuch coresets can be ef\ufb01ciently constructed in a map-reduce style computation, as well as\nin a streaming setting. Our results rely on a novel reduction of statistical estimation to\nproblems in computational geometry, as well as new complexity results about mixtures of\nGaussians. We empirically evaluate our algorithms on several real data sets, including a\ndensity estimation problem in the context of earthquake detection using accelerometers in\nmobile phones.\n\nIntroduction\n\n1\nWe consider the problem of training statistical mixture models, in particular mixtures of Gaussians\nand some natural generalizations, on massive data sets. Such data sets may be distributed across\na cluster, or arrive in a data stream, and have to be processed with limited memory. In contrast to\nparameter estimation for models with compact suf\ufb01cient statistics, mixture models generally require\ninference over latent variables, which in turn depends on the full data set. In this paper, we show that\nGaussian mixture models (GMMs), and some generalizations, admit small coresets: A coreset is a\nweighted subset of the data which guarantees that models \ufb01tting the coreset will also provide a good\n\ufb01t for the original data set. Perhaps surprisingly, we show that Gaussian mixtures admit coresets of\nsize independent of the size of the data set.\nWe focus on \u03b5-semi-spherical Gaussians, where the covariance matrix \u03a3i of each component i has\neigenvalues bounded in [\u03b5, 1/\u03b5], but some of our results generalize even to the semi-de\ufb01nite case.\nIn particular, we show that given a data set D of n points in Rd, \u03b5 > 0 and k \u2208 N, how one can\nef\ufb01ciently construct a weighted set C of O(dk3/\u03b52) points, such that for any mixture of k \u03b5-semi-\nspherical Gaussians \u03b8 = [(w1, \u00b51, \u03a31), . . . , (wk, \u00b5k, \u03a3k)] it holds that the log-likelihood ln P (D |\n\u03b8) of D under \u03b8 is approximated by the (properly weighted) log-likelihood ln P (C | \u03b8) of C under\n\u03b8 to arbitrary accuracy as \u03b5 \u2192 0. Thus solving the estimation problem on the coreset C (e.g., using\nweighted variants of the EM algorithm, see Section 3.3) is almost as good as solving the estimation\nproblem on large data set D. Our algorithm for constructing C is based on adaptively sampling\npoints from D and is simple to implement. Moreover, coresets can be ef\ufb01ciently constructed in a\nmap-reduce style computation, as well as in a streaming setting (using space and update time per\npoint of poly(dk\u03b5\u22121 log n log(1/\u03b4))).\nExistence and construction of coresets have been investigated for a number of problems in compu-\ntational geometry (such as k-means and k-median) in many recent papers (cf., surveys in [1, 2]). In\nthis paper, we demonstrate how these techniques from computational geometry can be lifted to the\nrealm of statistical estimation. As a by-product of our analysis, we also close an open question on\nthe VC dimension of arbitrary mixtures of Gaussians. We evaluate our algorithms on several syn-\nthetic and real data sets. In particular, we use our approach for density estimation for acceleration\ndata, motivated by an application in earthquake detection using mobile phones.\n\n1\n\n\fi=1 wiN (x; \u00b5i, \u03a3i), where w1, . . . , wk \u2265 0 are the mixture weights,(cid:80)\n\nP (x | \u03b8) =(cid:80)k\n|2\u03c0\u03a3i| exp(cid:0)\u2212 1\nnegative log likelihood of the data is L(D | \u03b8) = \u2212(cid:80)\n\n2 Background and Problem Statement\nFitting mixture models by MLE. Suppose we are given a data set D = {x1, . . . , xn} \u2286 Rd. We\nconsider \ufb01tting a mixture of Gaussians \u03b8 = [(w1, \u00b51, \u03a31), . . . , (wk, \u00b5k, \u03a3k)], i.e., the distribution\ni wi = 1, and\n\u00b5i and \u03a3i are mean and covariance of the i-th mixture component, which is modeled as a multivariate\nnormal distribution N (x, \u00b5i, \u03a3i) =\nwill discuss extensions to more general mixture models. Assuming the data was generated i.i.d., the\nj ln P (xj | \u03b8), and we wish to obtain the\nmaximum likelihood estimate (MLE) of the parameters \u03b8\u2217 = argmin\u03b8\u2208C L(D | \u03b8), where C is a set\nof constraints ensuring that degenerate solutions are avoided1. Hereby, for a symmetric matrix A,\nspec A is the set of all eigenvalues of A. We de\ufb01ne\n\n(x \u2212 \u00b5i)(cid:1). In Section 4, we\n\n2 (x \u2212 \u00b5i)T \u03a3\u22121\n\n1\u221a\n\ni\n\nC = C\u03b5 = {\u03b8 = [(w1, \u00b51, \u03a31), . . . , (wk, \u00b5k, \u03a3k)] | \u2200i : spec(\u03a3i) \u2286 [\u03b5, 1/\u03b5]}\n\nto be the set of all mixtures of k Gaussians \u03b8, such that all the eigenvalues of the covariance matrices\nof \u03b8 are bounded between \u03b5 and 1/\u03b5 for some small \u03b5 > 0.\nApproximating the log-likelihood. Our goal is to approximate the data set D by a weighted set\nC = {(\u03b31, x(cid:48)\nm)} \u2286 R \u00d7 Rd, such that L(D | \u03b8) \u2248 L(C | \u03b8) for all \u03b8, where we\n\n1), . . . , (\u03b3m, x(cid:48)\n\nde\ufb01ne L(C | \u03b8) = \u2212(cid:80)\n\ni \u03b3i ln P (x(cid:48)\n\ni | \u03b8).\n\nWhat kind of approximation accuracy may we hope to expect? Notice that there is a nontrivial\nissue of scale: Suppose we have a MLE \u03b8\u2217 for D, and let \u03b1 > 0. Then straightforward linear\n\u03b1 for a scaled data set \u03b1D = {\u03b1x : x \u2208 D} by simply\nalgebra shows that we can obtain an MLE \u03b8\u2217\nscaling all means by \u03b1, and covariance matrices by \u03b12. For the log-likelihood, however, it holds\n\u03b1) = d ln \u03b1 + L(D | \u03b8\u2217). Therefore, optimal solutions on one scale can be ef\ufb01ciently\nthat L(\u03b1D | \u03b8\u2217\ntransformed to optimal solutions at a different scale, while maintaining the same additive error.\nThis means, that any algorithm which achieves absolute error \u03b5 at any scale could be used to achieve\nparameter estimates (for means, covariances) with arbitrarily small error, simply by applying the\nalgorithm to a scaled data set and transforming back the obtained solution. An alternative, scale-\ninvariant approach may be to strive towards approximating L(D | \u03b8) up to multiplicative error\n(1 + \u03b5). Unfortunately, this goal is also hard to achieve: Choosing a scaling parameter \u03b1 such that\nd ln \u03b1 + L(D | \u03b8\u2217) = 0 would require any algorithm that achieves any bounded multiplicative\nerror to essentially incur no error at all when evaluating L(\u03b1D | \u03b8\u2217). The above observations hold\neven for the case k = 1 and \u03a3 = I, where the mixture \u03b8 consists of a single Gaussian, and the\nlog-likelihood is the sum of squared distances to a point \u00b5 and an additive term.\nMotivated by the scaling issues discussed above, we use the following error bound that was sug-\ngested in [3] (who studied the case where all Gaussians are identical spheres). We decompose the\nnegative log-likelihood L(D | \u03b8) of a data set D as\n\n(cid:19)\n\n(cid:18)\n\nL(D | \u03b8) = \u2212 n(cid:88)\nwhere Z(\u03b8) =(cid:80)\n\nk(cid:88)\n\u03c6(D | \u03b8) = \u2212 n(cid:88)\n\nwi(cid:112)|2\u03c0\u03a3i| exp\nk(cid:88)\n\nj=1\n\ni=1\n\nln\n\ni\n\n\u2212 1\n2\n\n(xj \u2212 \u00b5i)T \u03a3\u22121\n\ni\n\n(xj \u2212 \u00b5i)\n\n= \u2212n ln Z(\u03b8) + \u03c6(D | \u03b8)\n\nwi\u221a\n|2\u03c0\u03a3i| is a normalizer, and the function \u03c6 is de\ufb01ned as\n(xj \u2212 \u00b5i)T \u03a3\u22121\n\nZ(\u03b8)(cid:112)|2\u03c0\u03a3i| exp\n\n\u2212 1\n2\n\nwi\n\nln\n\ni\n\n(cid:18)\n\nj=1\n\ni=1\n\n(cid:19)\n\n(xj \u2212 \u00b5i)\n\n.\n\nHereby, Z(\u03b8) plays the role of a normalizer, which can be computed exactly, independently of the\nset D. \u03c6(D | \u03b8) captures all dependencies of L(D | \u03b8) on D, and via Jensen\u2019s inequality, it can be\nseen that \u03c6(D | \u03b8) is always nonnegative.\nWe can now use this term \u03c6(D | \u03b8) as a reference for our error bounds. In particular, we call \u02dc\u03b8 a\n(1 + \u03b5)-approximation for \u03b8 if (1 \u2212 \u03b5)\u03c6(D | \u03b8) \u2264 \u03c6(D | \u02dc\u03b8) \u2264 \u03c6(D | \u03b8)(1 + \u03b5).\nCoresets. We call a weighted data set C a (k, \u03b5)-coreset for another (possibly weighted) set D \u2286\nRd, if for all mixtures \u03b8 \u2208 C of k Gaussians it holds that\n\n(1 \u2212 \u03b5)\u03c6(D | \u03b8) \u2264 \u03c6(C | \u03b8) \u2264 \u03c6(D | \u03b8)(1 + \u03b5).\n\n1equivalently, C can be interpreted as prior thresholding.\n\n2\n\n\f(a) Example data set\n\n(b) Iteration 1\n\n(c) Iteration 3\n\n(d) Final approximation B\n\n(e) Sampling distribution\n\n(f) Coreset\n\nFigure 1: Illustration of the coreset construction for example data set (a). (b,c) show two iterations of con-\nstructing the set B. Solid squares are points sampled uniformly from remaining points, hollow squares are\npoints selected in previous iterations. Red color indicates half the points furthest away from B, which are\nkept for next iteration. (d) \ufb01nal approximate clustering B on top of original data set. (e) Induced non-uniform\nsampling distribution: radius of circles indicates probability; color indicates weight, ranging from red (high\nweight) to yellow (low weight). (f) Coreset sampled from distribution in (e).\nHereby \u03c6(C | \u03b8) is generalized to weighted data sets C in the natural way (weighing the contribu-\nj \u2208 C by \u03b3j). Thus, as \u03b5 \u2192 0, for a sequence of (k, \u03b5)-coresets C\u03b5 we\ntion of each summand x(cid:48)\n(cid:81)\nhave that sup\u03b8\u2208C |L(C\u03b5 | \u03b8) \u2212 L(D | \u03b8)| \u2192 0, i.e., L(C\u03b5 | \u03b8) uniformly (over \u03b8 \u2208 C) approximates\nL(D | \u03b8). Further, under the additional condition that all variances are suf\ufb01ciently large (formally\n\u03bb\u2208spec(\u03a3i) \u03bb \u2265 1\n(2\u03c0)d for all components i), the log-normalizer ln Z(\u03b8) is negative, and conse-\nquently the coreset in fact provides a multiplicative (1 + \u03b5) approximation to the log-likelihood, i.e.,\n\n(1 \u2212 \u03b5)L(D | \u03b8) \u2264 L(C | \u03b8) \u2264 L(D | \u03b8)(1 + \u03b5).\n\n\u03b5 , d and k.\n\nMore details can be found in the supplemental material.\nNote that if we had access to a (k, \u03b5)-coreset C, then we could reduce the problem of \ufb01tting a mixture\nmodel on D to one of \ufb01tting a model on C, since the optimal solution \u03b8C is a good approximation\n(in terms of log-likelihood) of \u03b8\u2217. While \ufb01nding the optimal \u03b8C is a dif\ufb01cult problem, one can use\na (weighted) variant of the EM algorithm to \ufb01nd a good solution. Moreover, if |C| (cid:28) |D|, running\nEM on C may be orders of magnitude faster than solving it on D. In Section 3.3, we give more\ndetails about solving the density estimation problem on the coreset.\nThe key question is whether small (k, \u03b5)-coresets exist, and whether they can be ef\ufb01ciently con-\nstructed. In the following, we answer this question af\ufb01rmatively. We show that, perhaps surprisingly,\none can ef\ufb01ciently \ufb01nd coresets C of size independent of the size n of D, and with polynomial de-\npendence on 1\n3 Ef\ufb01cient Coreset Construction via Adaptive Sampling\nNaive approach: uniform sampling. A naive approach towards approximating D would be to\njust pick a subset C uniformly at random. In particular, suppose the data set is generated from a\nmixture of two spherical Gaussians (\u03a3i = I) with weights w1 = 1\u221a\nn. Unless\n\u221a\nn) points are sampled, with constant probability no data point generated from Gaussian\nm = \u2126(\n2 is selected. By moving the means of the Gaussians arbitrarily far apart, L(D | \u03b8C) can be made\narbitrarily worse than L(D | \u03b8D), where \u03b8C and \u03b8D are MLEs on C and D respectively. Thus, even\nfor two well-separated Gaussians, uniform sampling can perform arbitrarily poorly. This example\nalready suggests that, intuitively, in order to achieve small multiplicative error, we must devise a\nsampling scheme that adaptively selects representative points from all \u201cclusters\u201d present in the data\nset. However, this suggests that obtaining a coreset requires solving a chicken-and-egg problem,\nwhere we need to understand the density of the data to obtain the coreset, but simultaneously would\nlike to use the coreset for density estimation.\n\nn and w2 = 1 \u2212 1\u221a\n\n3\n\n\u22122\u22121.5\u22121\u22120.500.511.52\u22124\u22123\u22122\u221210123\u22122\u22121.5\u22121\u22120.500.511.52\u22124\u22123\u22122\u221210123\u22122\u22121.5\u22121\u22120.500.511.52\u22124\u22123\u22122\u221210123\u22122\u22121.5\u22121\u22120.500.511.52\u22124\u22123\u22122\u221210123\fBetter approximation via adaptive sampling. The key idea behind the coreset construction is\nthat we can break the chicken-and-egg problem by \ufb01rst obtaining a rough approximation B of the\nclustering solution (using more than k components, but far fewer than n), and then to use this so-\nlution to bias the random sampling. Surprisingly, a simple procedure which iteratively samples a\nsmall number \u03b2 of points, and removes half of the data set closest to the sampled points, provides\na suf\ufb01ciently accurate \ufb01rst approximation B for this purpose. This initial clustering is then used to\nsample the data points comprising coreset C according to probabilities which are roughly propor-\ntional to the squared distance to the set B. This non-uniform random sampling can be understood as\nan importance-weighted estimate of the log-likelihood L(D | \u03b8), where the weights are optimized\nin order to reduce the variance. The same general idea has been found successful in constructing\ncoresets for geometric clustering problems such as k-means and k-median [4]. The pseudocode for\nobtaining the approximation B, and for using it to obtain coreset C is given in Algorithm 1.\n\nOutput: Coreset C =(cid:8)(\u03b3(x1), x1), . . . , (\u03b3(x|C|), x|C|)(cid:9)\n\nAlgorithm 1: Coreset construction\nInput: Data set D, \u03b5, \u03b4, k\nD(cid:48) \u2190 D; B \u2190 \u2205;\nwhile |D(cid:48)| > 10dk ln(1/\u03b4) do\n\nSample set S of \u03b2 = 10dk ln(1/\u03b4) points uniformly at random from D(cid:48);\nRemove (cid:100)|D(cid:48)|/2(cid:101) points x \u2208 D(cid:48) closest to S (i.e., minimizing dist(x, S)) from D(cid:48);\nSet B \u2190 B \u222a S;\n\nSet B \u2190 B \u222a D(cid:48);\nfor each b \u2208 B do Db \u2190 the points in D whose closest point in B is b. Ties broken arbitrarily;\nfor each b \u2208 B and x \u2208 Db do\n\nm(x) \u2190(cid:108) 5|Db| +\nx \u2208 D, we have x(cid:48) = x with probability m(x)/(cid:80)\n\nx(cid:48)\u2208D dist(x(cid:48),B)2\n\ndist(x,B)2\n\n(cid:109)\n\n;\n\n(cid:80)\n\nfor each x(cid:48) \u2208 C do \u03b3(x(cid:48)) \u2190\n\n(cid:80)\nx\u2208D m(x)\n|C|\u00b7m(x(cid:48)) ;\n\nx(cid:48)\u2208D m(x(cid:48));\n\nPick a non-uniform random sample C of 10(cid:100)dk|B|2 ln(1/\u03b4)/\u03b52(cid:101) points from D, where for every x(cid:48) \u2208 C and\n\nWe have the following result, proved in the supplemental material:\nTheorem 3.1. Suppose C is sampled from D using Algorithm 1 for parameters \u03b5, \u03b4 and k. Then,\nwith probability at least 1 \u2212 \u03b4 it holds that for all \u03b8 \u2208 C\u03b5,\n\n\u03c6(D | \u03b8)(1 \u2212 \u03b5) \u2264 \u03c6(C | \u03b8) \u2264 \u03c6(D | \u03b8)(1 + \u03b5).\n\nIn our experiments, we compare the performance of clustering on coresets constructed via adaptive\nsampling, vs. clustering on a uniform sample. The size of C in Algorithm 1 depends on |B|2 =\nlog2 n. By replacing B in the algorithm with a constant factor approximation B(cid:48), |B(cid:48)| = l for the\nk-means problem, we can get a coreset C of size independent of n. Such a set B(cid:48) can be computed\nin O(ndk) time either by applying exhaustive search on the output C of the original Algorithm 1 or\nby using one of the existing constant-factor approximation algorithms for k-means (say, [5]).\n3.1 Sketch of Analysis: Reduction to Euclidean Spaces\nFor space limitations, the proof of Theorem 3.1 is included in the supplemental material, we only\nprovide a sketch of the analysis, carrying the main intuition. The key insight in the proof is that the\ncontribution log P (x | \u03b8) to the likelihood L(D | \u03b8) can be expressed in the following way:\nLemma 3.2. There exist functions \u03c6, \u03c8, and f such that, for any point x \u2208 Rd and mixture model\n\u03b8, ln P (x | \u03b8) = \u2212f\u03c6(x)(\u03c8(\u03b8)) + Z(\u03b8), where\n\n(cid:88)\n\n\u02dcwiexp(cid:0)\u2212Widist(\u02dcx \u2212 \u02dc\u00b5i, si)2(cid:1) .\n\nf\u02dcx(y) = \u2212 ln\n\ni\n\nHereby, \u03c6 is a function that maps a point x \u2208 Rd into \u02dcx = \u03c6(x) \u2208 R2d, and \u03c8 is a function that\nmaps a mixture model \u03b8 into a tuple y = (s, w, \u02dc\u00b5, W ) where w is a k-tuple of nonnegative weights\n\u02dcw1, . . . , \u02dcwk summing to 1, s = s1, . . . , sk \u2286 R2d is a set of k d-dimensional subspaces that are\nweighted by weights W1,\u00b7\u00b7\u00b7 , Wk > 0, and \u02dc\u00b5 = \u02dc\u00b51,\u00b7\u00b7\u00b7 , \u02dc\u00b5k \u2208 R2d is a set of k means.\nThe main idea behind Lemma 3.2 is that level sets of distances between points and subspaces\nare quadratic forms, and can thus represent level sets of the Gaussian probability density func-\ntion (see Figure 2(a) for an illustration). We recognize the \u201csoft-min\u201d function \u2227w(cid:48)(\u03b7) \u2261\n\n4\n\n\f(a) Gaussian pdf as Euclidean distances\n\n(b) Tree for coreset construction\n\nFigure 2: (a) Level sets of the distances between points on a plane (green) and (disjoint) k-dimensional sub-\nspaces are ellipses, and thus can represent contour lines of the multivariate Gaussian. (b) Tree construction for\ngenerating coresets in parallel or from data streams. Black arrows indicate \u201cmerge-and-compress\u201d operations.\nThe (intermediate) coresets C1, . . . , C7 are enumerated in the order in which they would be generated in the\nstreaming case. In the parallel case, C1, C2, C4 and C5 would be constructed in parallel, followed by parallel\nconstruction of C3 and C6, \ufb01nally resulting in C7.\n\n\u2212 ln(cid:80)\n\ni w(cid:48)\n\n(cid:80)\n\niexp (\u2212\u03b7i) as an approximation upper-bounding the minimum min(\u03b7) = mini \u03b7i for\n\u03b7i = Widist(\u02dcx \u2212 \u02dc\u00b5i, si)2 and \u03b7 = [\u03b71, . . . , \u03b7k]. The motivation behind this transformation is that it\nallows expressing the likelihood P (x | \u03b8) of a data point x given a model \u03b8 in a purely geometric\nmanner as soft-min over distances between points and subspaces in a transformed space. Notice\nthat if we use the minimum min() instead of the soft-min \u2227 \u02dcw(), we recover the problem of approx-\nimating the data set D (transformed via \u03c6) by k-subspaces. For semi-spherical Gaussians, it can be\nshown that the subspaces can be chosen as points while incurring a multiplicative error of at most\n1/\u03b5, and thus we recover the well-known k-means problem in the transformed space. This insight\nsuggests using a known coreset construction for k-means, adapted to the transformation employed.\nThe remaining challenge in the proof is to bound the additional error incurred by using the soft-min\nfunction \u2227 \u02dcw(\u00b7) instead of the minimum min(\u00b7). We tackle this challenge by proving a general-\nized triangle inequality adapted to the exponential transformation, and employing the framework\ndescribed in [4], which provides a general method for constructing coresets for clustering problems\nof the form mins\nAs proved in [4], the key quantity that controls the size of a coreset is the pseudo-dimension of the\nfunctions Fd = {f\u02dcx for \u02dcx \u2208 R2d}. This notion of dimension is closely related to the VC dimension\nof the (sub-level sets of the) functions Fd and therefore represents the complexity of this set of\nfunctions. The \ufb01nal ingredient in the proof of Theorem 3.1 is a new bound on the complexity of\nmixtures of k Gaussians in d dimensions proved in the supplemental material.\n3.2 Streaming and Parallel Computation\nOne major advantage of coresets is that they can be constructed in parallel, as well as in a streaming\nsetting where data points arrive one by one, and it is impossible to remember the entire data set due\nto memory constraints. The key insight is that coresets satisfy certain composition properties, which\nhave previously been used by [6] for streaming and parallel construction of coresets for geometric\nclustering problems such as k-median and k-means.\n1. Suppose C1 is a (k, \u03b5)-coreset for D1, and C2 is a (k, \u03b5)-coreset for D2. Then C1 \u222a C2 is\n2. Suppose C is a (k, \u03b5)-coreset for D, and C(cid:48) is a (k, \u03b4)-coreset for C. Then C(cid:48) is a (k, (1 +\n\ni f\u02dcx(s).\n\na (k, \u03b5)-coreset for D1 \u222a D2.\n\u03b5)(1 + \u03b4) \u2212 1)-coreset for D.\n\nIn the following, we review how to exploit these properties for parallel and streaming computation.\nStreaming.\nIn the streaming setting, we assume that points arrive one-by-one, but we do not have\nenough memory to remember the entire data set. Thus, we wish to maintain a coreset over time,\nwhile keeping only a small subset of O(log n) coresets in memory. There is a general reduction that\nshows that a small coreset scheme to a given problem suf\ufb01ces to solve the corresponding problem\non a streaming input [7, 6]. The idea is to construct and save in memory a coreset for every block of\npoly(dk/\u03b5) consecutive points arriving in a stream. When we have two coresets in memory, we can\nmerge them (resulting in a (k, \u03b5)-coreset via property (1)), and compress by computing a single core-\nset from the merged coresets (via property (2)) to avoid increase in the coreset size. An important\nsubtlety arises: While merging two coresets (via property (1)) does not increase the approximation\nerror, compressing a coreset (via property (2)) does increase the error. A naive approach that merges\nand compresses immediately as soon as two coresets have been constructed, can incur an exponen-\ntial increase in approximation error. Fortunately, it is possible to organize the merge-and-compress\noperations in a binary tree of height O(log n), where we need to store in memory a single coreset\n\n5\n\nx1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 x16 C1\t\r \u00a0C2\t\r \u00a0C3\t\r \u00a0C4\t\r \u00a0C5\t\r \u00a0C6\t\r \u00a0C7\t\r \u00a0\ffor each level on the tree (thus requiring only poly(dk\u03b5\u22121 log n) memory). Figure 2(b) illustrates\nthis tree computation. In order to construct a coreset for the union of two (weighted) coresets, we\nuse a weighted version of Algorithm 1, where we consider a weighted point as duplicate copies of\na non-weighted point (possibly with fractional weight). A more formal description can be found\nin [8]. We summarize our streaming result in the following theorem.\nTheorem 3.3. A (k, \u03b5)-coreset for a stream of n points in Rd can be computed for the \u03b5-\nsemi-spherical GMM problem with probability at least 1 \u2212 \u03b4 using space and update time\npoly(dk\u03b5\u22121 log n log(1/\u03b4)).\nParallel/Distributed computations. Using the same ideas from the streaming model, a (non-\nparallel) coreset construction can be transformed into a parallel one. We partition the data into\nsets, and compute coresets for each set, independently, on different computers in a cluster. We then\n(in parallel) merge (via property (1)) two coresets, and compute a single coreset for every pair of\nsuch coresets (via property (2)). Continuing in this manner yields a process that takes O(log n)\niterations of parallel computation. This computation is also naturally suited for map-reduce [9] style\ncomputations, where the map tasks compute coresets for disjoint parts of D, and the reduce tasks\nperform the merge-and-compress operations. Figure 2(b) illustrates this parallel construction.\nTheorem 3.4. A (k, \u03b5)-coreset for a set of n points in Rd can be computed for the \u03b5-semi-\nspherical GMM problem with probability at least 1 \u2212 \u03b4 using m machines in time (n/m) \u00b7\npoly(dk\u03b5\u22121 log(1/\u03b4) log n).\n3.3 Fitting a GMM on the Coreset using Weighted EM\nOne approach, which we employ in our experiments, is to use a natural generalization of the EM\nalgorithm, which takes the coreset weights into account. We here describe the algorithm for the case\nof GMMs. For other mixture distributions, the E and M steps are modi\ufb01ed appropriately.\n\nAlgorithm 2: Weighted EM for Gaussian mixtures\nInput: Coreset C, k, TOL\nOutput: Mixture model \u03b8C\nLold = \u221e; Initialize means \u00b51, . . . , \u00b5k by sampling k points from C with probability proportional to their\nweight. Initialize \u03a3i = I and wi = 1\nrepeat\n\nk for all i;\n\nLold = L(C | \u03b8); for j = 1 to n do for i = 1 to k do Compute \u03b7i,j = \u03b3i\nfor i = 1 to k do\n\nwi \u2190 wi/(cid:80)\n\n(cid:96) wi; \u00b5i \u2190(cid:80)\n\nj \u03b7i,j\n\nj \u2212 \u00b5i\n\nj/(cid:80)\n\nj \u03b7i,j; \u03a3i \u2190(cid:80)\n\nj \u03b7i,jx(cid:48)\n\nuntil L(C | \u03b8) \u2265 Lold \u2212 T OL ;\n\n(cid:0)x(cid:48)\n\n(cid:80)\nwiN (x(cid:48)\n(cid:1)(cid:0)x(cid:48)\n(cid:96) w(cid:96)N (x(cid:48)\nj \u2212 \u00b5i\n\nj ;\u00b5i,\u03a3i)\nj ;\u00b5(cid:96),\u03a3(cid:96)) ;\n\n(cid:1)T /(cid:80)\n\nj \u03b7i,j;\n\nUsing a similar analysis as for the standard EM algorithm, Algorithm 2 is guaranteed to converge,\nbut only to a local optimum. However, since it is applied on a much smaller set, it can be initialized\nusing multiple random restarts.\n4 Extensions and Generalizations\nWe now show how the connection between estimating the parameters for mixture models and\nproblems in computational geometry can be leveraged further. Our observations are based on the\nlink between mixture of Gaussians and projective clustering (multiple subspace approximation) as\nshown in Lemma 3.2.\nGeneralizations to non-semi-spherical GMMs. For simplicity, we generalized the coreset con-\nstruction for the k-means problem, which required assumptions that the Gaussians are \u03b5-semi-\nspherical. However, several more complex coresets for projective clustering were suggested recently\n(cf., [4]). Using the tools developed in this article, each such coreset implies a corresponding coreset\nfor GMMs and generalizations. As an example, the coresets for approximating points by lines [10]\nimplies that we can construct small coresets for GMMs even if the smallest singular value of one of\nthe corresponding covariance matrices is zero.\nGeneralizations to (cid:96)q distances and other norms. Our analysis is based on combinatorics (such\nas the complexity of sub-levelsets of GMMs) and probabilistic methods (non-uniform random sam-\npling). Therefore, generalizations to other non-Euclidean distance functions, or error functions such\nas (non-squared) distances (mixture of Laplace distributions) is straightforward. The main property\n\n6\n\n\f(a) MNIST\n\n(b) Tetrode recordings\n\n(c) CSN data\n\n(d) CSN detection\n\nFigure 3: Experimental results for three real data sets. We compare likelihood of the best model obtained on\nsubsets C constructed by uniform sampling, and by the adaptive coreset sampling procedure.\n\nthat we need is a generalization of the triangle inequality, as proved in the supplemental material.\nFor example, replacing the squared distances by non-squared distances yields a coreset for mixture\nof Laplace distributions. The double triangle inequality (cid:107)a \u2212 c(cid:107)2 \u2264 2((cid:107)a \u2212 b(cid:107) + (cid:107)b \u2212 c(cid:107)2) that we\nused in this paper is replaced by H\u00a8older\u2019s inequality, (cid:107)a \u2212 c(cid:107)2 \u2264 2O(q) (cid:107)a \u2212 b(cid:107) + 2(cid:107)b \u2212 c(cid:107)2. Such\na result is straight-forward from our analysis, and we summarize it in the following theorem.\nTheorem 4.1. Let q \u2265 1 be an integer. Consider Algorithm 1, where dist(\u00b7,\u00b7)2 is replaced by\ndist(\u00b7,\u00b7)q and \u03b52 is replaced by \u03b5O(q). Suppose C is sampled from D using this updated version of\nAlgorithm 1 for parameters \u03b5, \u03b4 and k. Then, with prob. at least 1 \u2212 \u03b4 it holds that for all \u03b8 \u2208 C\u03b5,\n\n\u03c6(D | \u03b8)(1 \u2212 \u03b5) \u2264 \u03c6(C | \u03b8) \u2264 \u03c6(D | \u03b8)(1 + \u03b5),\n\nwhere Z(\u03b8) =(cid:80)\nusing the normalizer g(\u03b8i) =(cid:82) exp\n\ng(\u03b8i) and \u03c6(D | \u03b8) = \u2212(cid:80)\n(cid:13)(cid:13)(cid:13)\u03a3\n(cid:16)\u2212 1\n\nwi\n\n2\n\ni\n\nx\u2208D ln(cid:80)k\n\n\u22121/2\ni\n\n(x \u2212 \u00b5i)\n\nwi\n\nZ(\u03b8)g(\u03b8i) exp\n\ni=1\n\n(cid:13)(cid:13)(cid:13)q(cid:17)\n\ndx.\n\n(cid:16)\u2212 1\n\n2\n\n(cid:13)(cid:13)(cid:13)\u03a3\n\n(cid:13)(cid:13)(cid:13)q(cid:17)\n\n\u22121/2\ni\n\n(x \u2212 \u00b5i)\n\n5 Experiments\nWe experimentally evaluate the effectiveness of using coresets of different sizes for training mixture\nmodels. We compare against running EM on the full set, as well as on an unweighted, uniform\nsample from D. Results are presented for three real datasets.\nMNIST handwritten digits. The MNIST dataset contains 60,000 training and 10,000 testing\ngrayscale images of handwritten digits. As in [11], we normalize each component of the data to\nhave zero mean and unit variance, and then reduce each 784-pixel (28x28) image using PCA, retain-\ning only the top d = 100 principal components as a feature vector. From the training set, we produce\ncoresets and uniformly sampled subsets of sizes between 30 and 5000, using the parameters k = 10\n(a cluster for each digit), \u03b2 = 20 and \u03b4 = 0.1 (see Algorithm 1), and \ufb01t GMMs using EM with 3\nrandom restarts. The log likelihood (LLH) of each model on the testing data is shown in Figure 3(a).\nNotice that coresets signi\ufb01cantly outperform uniform samples of the same size, and even a coreset\nof 30 points performs very well. Further note how the test-log likelihood begins to \ufb02atten out for\n|C| = 1000. Constructing the coreset and running EM on this size takes 7.9 seconds (Intel Xeon 2.6\nGHz), over 100 times faster than running EM on the full set (15 minutes).\nNeural tetrode recordings. We also compare coresets and uniform sampling on a large dataset\ncontaining 319,209 records of rat hippocampal action potentials, measured by four co-located elec-\ntrodes. As done by [11], we concatenate the 38-sample waveforms produced by each electrode to\nobtain a 152-dimensional vector. The vectors are normalized so each component has zero mean\nand unit variance. The 319,209 records are divided in half to obtain training and testing sets. From\nthe training set, we produce coresets and uniformly sampled subsets of sizes between 70 and 1000,\nusing the parameters k = 33 (as in [11]), \u03b2 = 66, and \u03b4 = 0.1, and \ufb01t GMMs. The log likelihood of\neach model on the held-out testing data is shown in Figure 3(b). Coreset GMMs obtain consistently\nhigher LLH than uniform sample GMMs for sets of the same size, and even a coreset of 100 points\nperforms very well. Overall, training on coresets achieves approximately the same likelihood as\ntraining on the full set about 95 times faster (1.2 minutes vs. 1.9 hours).\nCSN cell phone accelerometer data. Smart phones with accelerometers are being used by the\nCommunity Seismic Network (CSN) as inexpensive seismometers for earthquake detection. In [12],\n7 GB of acceleration data were recorded from volunteers while carrying and operating their phone in\nnormal conditions (walking, talking, on desk, etc.). From this data, 17-dimensional feature vectors\nwere computed (containing frequency information, moments, etc.). The goal is to train, in an online\n\n7\n\n101102103104105444546474849505152Training Set SizeLog Likelihood on Test Data Set Full SetUniformSampleCoreset102103104105\u22121800\u22121600\u22121400\u22121200\u22121000\u2212800\u2212600\u2212400\u22122000Training Set SizeLog Likelihood on Test Data Set UniformSampleCoresetFull Set101102103104105\u2212250\u2212200\u2212150\u2212100\u2212500Training Set SizeLog Likelihood on Test Data Set UniformSampleFull SetCoreset1011021031041050.550.60.650.70.75Training Set SizeArea Under ROC CurveUniformSampleFull SetCoreset\ffashion, GMMs based on normal data, which then can be used to perform anomaly detection to de-\ntect possible seismic activity. Motivated by the limited storage on smart phones, we evaluate coresets\non a data set of 40,000 accelerometer feature vectors, using the parameters k = 6, \u03b2 = 12, and \u03b4 =\n0.1. Figure 3(c) presents the results of this experiment. Notice that on this data set, coresets show an\neven larger improvement over uniform sampling. We hypothesize that this is due to the fact that the\nrecorded accelerometer data is imbalanced, and contains clusters of vastly varying size, so uniform\nsampling does not represent smaller clusters well. Overall, the coresets obtain a speedup of approx-\nimately 35 compared to training on the full set. We also evaluate how GMMs trained on the coreset\ncompare with the baseline GMMs in terms of anomaly detection performance. For each GMM, we\ncompute ROC curves measuring the performance of detecting earthquake recordings from the South-\nern California Seismic Network (cf., [12]). Note that even very small coresets lead to performance\ncomparable to training on the full set, drastically outperforming uniform sampling (Fig. 3(d)).\n6 Related Work\nTheoretical results on mixtures of Gaussians. There has been a signi\ufb01cant amount of work on\nlearning and applying GMMs (and more general distributions). Perhaps the most commonly used\ntechnique in practice is the EM algorithm [13], which is however only guaranteed to converge to a\nlocal optimum of the likelihood. Dasgupta [14] is the \ufb01rst to show that parameters of an unknown\nGMM P can be estimated in polynomial time, with arbitrary accuracy \u03b5, given i.i.d. samples from\nP . However, his algorithm assumes a common covariance, bounded excentricity, a (known) bound\non the smallest component weight, as well as a separation (distance of the means), that scales as\nd). Subsequent works relax the assumption on separation to d1/4 [15] and k1/4 [16]. [3] is\n\u2126(\nthe \ufb01rst to learn general GMMs, with separation d1/4. [17] provides the \ufb01rst result that does not\nrequire any separation, but assumes that the Gaussians are axis-aligned. Recently, [18] and [19]\nprovide algorithms with polynomial running time (except exponential dependence on k) and sample\ncomplexity for arbitrary GMMs. However, in contrast to our results, all the results described above\ncrucially rely on the fact that the data set D is actually generated by a mixture of Gaussians. The\nproblem of \ufb01tting a mixture model with near-optimal log-likelihood for arbitrary data is studied\nby [3], who provides a PTAS for this problem. However, their result requires that the Gaussians\nare identical spheres, in which case the maximum likelihood problem is identical to the k-means\nproblem.\nIn contrast, our results make only mild assumptions about the Gaussian components.\nFurthermore, none of the algorithms described above applies to the streaming or parallel setting.\n\n\u221a\n\nCoresets. Approximation algorithms in computational geometry often make use of random sam-\npling, feature extraction, and \u0001-samples [20]. Coresets can be viewed as a general concept that\nincludes all of the above, and more. See a comprehensive survey on this topic in [4]. It is not clear\nthat there is any commonly agreed-upon de\ufb01nition of a coreset, despite several inconsistent attempts\nto do so [6, 8]. Coresets have been the subject of many recent papers and several surveys [1, 2]. They\nhave been used to great effect for a host of geometric and graph problems, including k-median [6],\nk-mean [8], k-center [21], k-line median [10] subspace approximation [10, 22], etc. Coresets also\nimply streaming algorithms for many of these problems [6, 1, 23, 8]. A framework that generalizes\nand improves several of these results has recently appeared in [4].\n7 Conclusion\nWe have shown how to construct coresets for estimating parameters of GMMs and natural general-\nizations. Our construction hinges on a natural connection between statistical estimation and cluster-\ning problems in computational geometry. To our knowledge, our results provide the \ufb01rst rigorous\nguarantees for obtaining compressed \u03b5-approximations of the log-likelihood of mixture models for\nlarge data sets. The coreset construction relies on an intuitive adaptive sampling scheme, and can be\neasily implemented. By exploiting certain closure properties of coresets, it is possible to construct\nthem in parallel, or in a single pass through a stream of data, using only poly(dk\u03b5\u22121 log n log(1/\u03b4))\nspace and update time. Unlike most of the related work, our coresets provide guarantees for any\ngiven (possibly unstructured) data, without assumptions on the distribution or model that generated\nit. Lastly, we apply our construction on three real data sets, demonstrating signi\ufb01cant gains over no\nor naive subsampling.\n\nAcknowledgments This research was partially supported by ONR grant N00014-09-1-1044, NSF\ngrants CNS-0932392, IIS-0953413 and DARPA MSEE grant FA8650-11-1-7156.\n\n8\n\n\fReferences\n[1] P. K. Agarwal, S. Har-Peled, and K. R. Varadarajan. Geometric approximations via coresets. Combina-\n\ntorial and Computational Geometry - MSRI Publications, 52:1\u201330, 2005.\n\n[2] A. Czumaj and C. Sohler. Sublinear-time approximation algorithms for clustering via random sampling.\n\nRandom Struct. Algorithms (RSA), 30(1-2):226\u2013256, 2007.\n\n[3] Sanjeev Arora and Ravi Kannan. Learning mixtures of separated nonspherical gaussians. Annals of\n\nApplied Probability, 15(1A):69\u201392, 2005.\n\n[4] D. Feldman and M. Langberg. A uni\ufb01ed framework for approximating and clustering data. In Proc. 41th\n\nAnnu. ACM Symp. on Theory of Computing (STOC), 2011.\n\n[5] S. Har-Peled and A. Kushal. Smaller coresets for k-median and k-means clustering. Discrete & Compu-\n\ntational Geometry, 37(1):3\u201319, 2007.\n\n[6] S. Har-Peled and S. Mazumdar. On coresets for k-means and k-median clustering. In Proc. 36th Annu.\n\nACM Symp. on Theory of Computing (STOC), pages 291\u2013300, 2004.\n\n[7] Jon Louis Bentley and James B. Saxe. Decomposable searching problems i: Static-to-dynamic transfor-\n\nmation. J. Algorithms, 1(4):301\u2013358, 1980.\n\n[8] D. Feldman, M. Monemizadeh, and C. Sohler. A PTAS for k-means clustering based on weak coresets.\n\nIn Proc. 23rd ACM Symp. on Computational Geometry (SoCG), pages 11\u201318, 2007.\n\n[9] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: Simpli\ufb01ed data processing on large clusters.\n\nOSDI\u201904: Sixth Symposium on Operating System Design and Implementation, 2004.\n\nIn\n\n[10] D. Feldman, A. Fiat, and M. Sharir. Coresets for weighted facilities and their applications. In Proc. 47th\n\nIEEE Annu. Symp. on Foundations of Computer Science (FOCS), pages 315\u2013324, 2006.\n\n[11] Ryan Gomes, Andreas Krause, and Pietro Perona. Discriminative clustering by regularized information\n\nmaximization. In Proc. Neural Information Processing Systems (NIPS), 2010.\n\n[12] Matthew Faulkner, Michael Olson, Rishi Chandy, Jonathan Krause, K. Mani Chandy, and Andreas\nKrause. The next big one: Detecting earthquakes and other rare events from community-based sensors.\nIn In Proc. ACM/IEEE International Conference on Information Processing in Sensor Networks (IPSN),\n2011.\n\n[13] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em\n\nalgorithm. J. Roy. Statist. Soc. Ser. B, 39:1\u201338, 1977.\n\n[14] S. Dasgupta. Learning mixtures of gaussians. In Fortieth Annual IEEE Symposium on Foundations of\n\nComputer Science (FOCS), 1999.\n\n[15] S. Dasgupta and L.J. Schulman. A two-round variant of em for gaussian mixtures. In Sixteenth Conference\n\non Uncertainty in Arti\ufb01cial Intelligence (UAI), 2000.\n\n[16] S. Vempala and G. Wang. A spectral algorithm for learning mixture models. In In Proceedings of the\n\n43rd Annual IEEE Symposium on Foundations of Computer Science, 2002.\n\n[17] J. Feldman, R. A. Servedio, and R. O\u2019Donnell. Pac learning axis-aligned mixtures of gaussians with no\n\nseparation assumption. In COLT, 2006.\n\n[18] A. Moitra and G. Valiant. Settling the polynomial learnability of mixtures of gaussians.\n\nFoundations of Computer Science (FOCS), 2010.\n\nIn In Proc.\n\n[19] M. Belkin and K. Sinha. Polynomial learning of distribution families. In In Proc. Foundations of Com-\n\nputer Science (FOCS), 2010.\n\n[20] D. Haussler. Decision theoretic generalizations of the PAC model for neural net and other learning appli-\n\ncations. Inf. Comput., 100(1):78\u2013150, 1992.\n\n[21] S. Har-Peled and K. R. Varadarajan. High-dimensional shape \ufb01tting in linear time. Discrete & Computa-\n\ntional Geometry, 32(2):269\u2013288, 2004.\n\n[22] M.W. Mahoney and P. Drineas. CUR matrix decompositions for improved data analysis. Proceedings of\n\nthe National Academy of Sciences, 106(3):697, 2009.\n\n[23] G. Frahling and C. Sohler. Coresets in dynamic geometric data streams. In Proc. 37th Annu. ACM Symp.\n\non Theory of Computing (STOC), pages 209\u2013217, 2005.\n\n9\n\n\f", "award": [], "sourceid": 1186, "authors": [{"given_name": "Dan", "family_name": "Feldman", "institution": null}, {"given_name": "Matthew", "family_name": "Faulkner", "institution": null}, {"given_name": "Andreas", "family_name": "Krause", "institution": null}]}