{"title": "Coresets for Scalable Bayesian Logistic Regression", "book": "Advances in Neural Information Processing Systems", "page_first": 4080, "page_last": 4088, "abstract": "The use of Bayesian methods in large-scale data settings is attractive because of the rich hierarchical models, uncertainty quantification, and prior specification they provide. Standard Bayesian inference algorithms are computationally expensive, however, making their direct application to large datasets difficult or infeasible. Recent work on scaling Bayesian inference has focused on modifying the underlying algorithms to, for example, use only a random data subsample at each iteration. We leverage the insight that data is often redundant to instead obtain a weighted subset of the data (called a coreset) that is much smaller than the original dataset. We can then use this small coreset in any number of existing posterior inference algorithms without modification. In this paper, we develop an efficient coreset construction algorithm for Bayesian logistic regression models. We provide theoretical guarantees on the size and approximation quality of the coreset -- both for fixed, known datasets, and in expectation for a wide class of data generative models. Crucially, the proposed approach also permits efficient construction of the coreset in both streaming and parallel settings, with minimal additional effort. We demonstrate the efficacy of our approach on a number of synthetic and real-world datasets, and find that, in practice, the size of the coreset is independent of the original dataset size. Furthermore, constructing the coreset takes a negligible amount of time compared to that required to run MCMC on it.", "full_text": "Coresets for Scalable Bayesian Logistic Regression\n\nJonathan H. Huggins\n\nTrevor Campbell\n\nComputer Science and Arti\ufb01cial Intelligence Laboratory, MIT\n\n{jhuggins@, tdjc@, tbroderick@csail.}mit.edu\n\nTamara Broderick\n\nAbstract\n\nThe use of Bayesian methods in large-scale data settings is attractive because of\nthe rich hierarchical models, uncertainty quanti\ufb01cation, and prior speci\ufb01cation\nthey provide. Standard Bayesian inference algorithms are computationally ex-\npensive, however, making their direct application to large datasets dif\ufb01cult or in-\nfeasible. Recent work on scaling Bayesian inference has focused on modifying\nthe underlying algorithms to, for example, use only a random data subsample at\neach iteration. We leverage the insight that data is often redundant to instead ob-\ntain a weighted subset of the data (called a coreset) that is much smaller than the\noriginal dataset. We can then use this small coreset in any number of existing\nposterior inference algorithms without modi\ufb01cation. In this paper, we develop an\nef\ufb01cient coreset construction algorithm for Bayesian logistic regression models.\nWe provide theoretical guarantees on the size and approximation quality of the\ncoreset \u2013 both for \ufb01xed, known datasets, and in expectation for a wide class of\ndata generative models. Crucially, the proposed approach also permits ef\ufb01cient\nconstruction of the coreset in both streaming and parallel settings, with minimal\nadditional effort. We demonstrate the ef\ufb01cacy of our approach on a number of\nsynthetic and real-world datasets, and \ufb01nd that, in practice, the size of the coreset\nis independent of the original dataset size. Furthermore, constructing the coreset\ntakes a negligible amount of time compared to that required to run MCMC on it.\n\n1\n\nIntroduction\n\nLarge-scale datasets, comprising tens or hundreds of millions of observations, are becoming the\nnorm in scienti\ufb01c and commercial applications ranging from population genetics to advertising. At\nsuch scales even simple operations, such as examining each data point a small number of times,\nbecome burdensome; it is sometimes not possible to \ufb01t all data in the physical memory of a sin-\ngle machine. These constraints have, in the past, limited practitioners to relatively simple statis-\ntical modeling approaches. However, the rich hierarchical models, uncertainty quanti\ufb01cation, and\nprior speci\ufb01cation provided by Bayesian methods have motivated substantial recent effort in making\nBayesian inference procedures, which are often computationally expensive, scale to the large-data\nsetting.\nThe standard approach to Bayesian inference for large-scale data is to modify a speci\ufb01c inference al-\ngorithm, such as MCMC or variational Bayes, to handle distributed or streaming processing of data.\nExamples include subsampling and streaming methods for variational Bayes [6, 7, 16], subsampling\nmethods for MCMC [4, 18, 24], and distributed \u201cconsensus\u201d methods for MCMC [8, 19, 21, 22].\nExisting methods, however, suffer from both practical and theoretical limitations. Stochastic varia-\ntional inference [16] and subsampling MCMC methods use a new random subset of the data at each\niteration, which requires random access to the data and hence is infeasible for very large datasets\nthat do not \ufb01t into memory. Furthermore, in practice, subsampling MCMC methods have been found\nto require examining a constant fraction of the data at each iteration, severely limiting the compu-\ntational gains obtained [5, 23]. More scalable methods such as consensus MCMC [19, 21, 22]\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fand streaming variational Bayes [6, 7] lead to gains in computational ef\ufb01ciency, but lack rigorous\njusti\ufb01cation and provide no guarantees on the quality of inference.\nAn important insight in the large-scale setting is that much of the data is often redundant, though\nthere may also be a small set of data points that are distinctive. For example, in a large document\ncorpus, one news article about a hockey game may serve as an excellent representative of hundreds\nor thousands of other similar pieces about hockey games. However, there may only be a few articles\nabout luge, so it is also important to include at least one article about luge. Similarly, one individ-\nual\u2019s genetic information may serve as a strong representative of other individuals from the same\nancestral population admixture, though some individuals may be genetic outliers. We leverage data\nredundancy to develop a scalable Bayesian inference framework that modi\ufb01es the dataset instead of\nthe common practice of modifying the inference algorithm. Our method, which can be thought of as\na preprocessing step, constructs a coreset \u2013 a small, weighted subset of the data that approximates\nthe full dataset [1, 9] \u2013 that can be used in many standard inference procedures to provide posterior\napproximations with guaranteed quality. The scalability of posterior inference with a coreset thus\nsimply depends on the coreset\u2019s growth with the full dataset size. To the best of our knowledge,\ncoresets have not previously been used in a Bayesian setting.\nThe concept of coresets originated in computational geometry (e.g. [1]), but then became popular\nin theoretical computer science as a way to ef\ufb01ciently solve clustering problems such as k-means\nand PCA (see [9, 11] and references therein). Coreset research in the machine learning community\nhas focused on scalable clustering in the optimization setting [3, 17], with the exception of Feldman\net al. [10], who developed a coreset algorithm for Gaussian mixture models. Coreset-like ideas have\npreviously been explored for maximum likelihood-learning of logistic regression models, though\nthese methods either lack rigorous justi\ufb01cation or have only asymptotic guarantees (see [15] and\nreferences therein).\nThe job of the coreset in the Bayesian setting is to provide an approximation of the full data log-\nlikelihood up to a multiplicative error uniformly over the parameter space. As this paper is the \ufb01rst\nforay into applying coresets in Bayesian inference, we begin with a theoretical analysis of the quality\nof the posterior distribution obtained from such an approximate log-likelihood. The remainder of the\npaper develops the ef\ufb01cient construction of small coresets for Bayesian logistic regression, a useful\nand widely-used model for the ubiquitous problem of binary classi\ufb01cation. We develop a core-\nset construction algorithm, the output of which uniformly approximates the full data log-likelihood\nover parameter values in a ball with a user-speci\ufb01ed radius. The approximation guarantee holds for\na given dataset with high probability. We also obtain results showing that the boundedness of the\nparameter space is necessary for the construction of a nontrivial coreset, as well as results charac-\nterizing the algorithm\u2019s expected performance under a wide class of data-generating distributions.\nOur proposed algorithm is applicable in both the streaming and distributed computation settings,\nand the coreset can then be used by any inference algorithm which accesses the (gradient of the)\nlog-likelihood as a black box. Although our coreset algorithm is speci\ufb01cally for logistic regression,\nour approach is broadly applicable to other Bayesian generative models.\nExperiments on a variety of synthetic and real-world datasets validate our approach and demonstrate\nrobustness to the choice of algorithm hyperparameters. An empirical comparison to random subsam-\npling shows that, in many cases, coreset-based posteriors are orders of magnitude better in terms of\nmaximum mean discrepancy, including on a challenging 100-dimensional real-world dataset. Cru-\ncially, our coreset construction algorithm adds negligible computational overhead to the inference\nprocedure. All proofs are deferred to the Supplementary Material.\n\n2 Problem Setting\nWe begin with the general problem of Bayesian posterior inference. Let D = {(Xn, Yn)}N\nn=1 be\na dataset, where Xn \u2208 X is a vector of covariates and Yn \u2208 Y is an observation. Let \u03c00(\u03b8) be a\nprior density on a parameter \u03b8 \u2208 \u0398 and let p(Yn | Xn, \u03b8) be the likelihood of observation n given\nthe parameter \u03b8. The Bayesian posterior is given by the density \u03c0N (\u03b8), where\n\nN(cid:88)\n\nn=1\n\n2\n\n\u03c0N (\u03b8) :=\n\nexp(LN (\u03b8))\u03c00(\u03b8)\n\nEN\n\n, LN (\u03b8) :=\n\nln p(Yn | Xn, \u03b8), EN :=\n\nexp(LN (\u03b8))\u03c00(\u03b8) d\u03b8.\n\n(cid:90)\n\n\fAlgorithm 1 Construction of logistic regression coreset\nRequire: Data D, k-clustering Q, radius R > 0, tolerance \u03b5 > 0, failure rate \u03b4 \u2208 (0, 1)\n1: for n = 1, . . . , N do\n2:\n\n(cid:46) calculate sensitivity upper bounds using the k-clustering\n\u2212Zn(cid:107)2\n\nmn \u2190\n\n\u2212R(cid:107) \u00afZ\n\n1+(cid:80)k\n(cid:80)N\n5: M \u2190(cid:6) c \u00afmN\n\u03b52 [(D + 1) log \u00afmN + log(1/\u03b4)](cid:7)\n\n3: end for\n4: \u00afmN \u2190 1\n\ni=1 |G(\u2212n)\n\nn=1 mn\n\n(\u2212n)\nG,i\n\nN\n|e\n\nN\n\ni\n\nN \u00afmN\n\npn \u2190 mn\n\n6: for n = 1, . . . , N do\n7:\n8: end for\n9: (K1, . . . , KN ) \u223c Multi(M, (pn)N\n10: for n = 1, . . . , N do\n11:\n12: end for\n13: \u02dcD \u2190 {(\u03b3n, Xn, Yn)| \u03b3n > 0}\n14: return \u02dcD\n\n\u03b3n \u2190 Kn\n\npnM\n\nn=1)\n\n(cid:46) coreset size; c is from proof of Theorem B.1\n\n(cid:46) importance weights of data\n\n(cid:46) sample data for coreset\n(cid:46) calculate coreset weights\n\n(cid:46) only keep data points with non-zero weights\n\nOur aim is to construct a weighted dataset \u02dcD = {(\u03b3m, \u02dcXm, \u02dcYm)}M\n\nweighted log-likelihood \u02dcLN (\u03b8) =(cid:80)M\n\nm=1 \u03b3m ln p( \u02dcYn | \u02dcXm, \u03b8) satis\ufb01es\n\u2200\u03b8 \u2208 \u0398.\n\n|LN (\u03b8) \u2212 \u02dcLN (\u03b8)| \u2264 \u03b5|LN (\u03b8)|,\n\nIf \u02dcD satis\ufb01es Eq. (1), it is called an \u03b5-coreset of D, and the approximate posterior\n\nm=1 with M (cid:28) N such that the\n\n(1)\n\n(cid:90)\n\n(cid:90)\n\n\u02dc\u03c0N (\u03b8) =\n\nexp( \u02dcLN (\u03b8))\u03c00(\u03b8)\n\n\u02dcEN\n\n,\n\nwith\n\n\u02dcEN =\n\nexp( \u02dcLN (\u03b8))\u03c00(\u03b8) d\u03b8,\n\nhas a marginal likelihood \u02dcEN which approximates the true marginal likelihood EN , shown by Propo-\nsition 2.1. Thus, from a Bayesian perspective, the \u03b5-coreset is a useful notion of approximation.\nProposition 2.1. Let L(\u03b8) and \u02dcL(\u03b8) be arbitrary non-positive log-likelihood functions that satisfy\n|L(\u03b8) \u2212 \u02dcL(\u03b8)| \u2264 \u03b5|L(\u03b8)| for all \u03b8 \u2208 \u0398. Then for any prior \u03c00(\u03b8) such that the marginal likelihoods\n\nE =\n\nexp(L(\u03b8))\u03c00(\u03b8) d\u03b8\n\nand\n\n\u02dcE =\n\nexp( \u02dcL(\u03b8))\u03c00(\u03b8) d\u03b8\n\n(cid:90)\n\nare \ufb01nite, the marginal likelihoods satisfy | lnE \u2212 ln \u02dcE| \u2264 \u03b5| lnE|.\n\n3 Coresets for Logistic Regression\n\n3.1 Coreset Construction\nIn logistic regression, the covariates are real feature vectors Xn \u2208 RD, the observations are labels\nYn \u2208 {\u22121, 1}, \u0398 \u2286 RD, and the likelihood is de\ufb01ned as\np(Yn | Xn, \u03b8) = plogistic(Yn | Xn, \u03b8) :=\n\n1\n\n.\n\n1 + exp (\u2212YnXn \u00b7 \u03b8)\n\nThe analysis in this work allows any prior \u03c00(\u03b8); common choices are the Gaussian, Cauchy [12],\nand spike-and-slab [13]. For notational brevity, we de\ufb01ne Zn := YnXn, and let \u03c6(s) := ln(1 +\nexp(\u2212s)). Choosing the optimal \u0001-coreset is not computationally feasible, so we take a less direct\napproach. We design our coreset construction algorithm and prove its correctness using a quantity\n\u03c3n(\u0398) called the sensitivity [9], which quanti\ufb01es the redundancy of a particular data point n \u2013\nthe larger the sensitivity, the less redundant. In the setting of logistic regression, we have that the\nsensitivity is\n\n\u03c3n(\u0398) := sup\n\u03b8\u2208\u0398\n\n(cid:80)N\nN \u03c6(Zn \u00b7 \u03b8)\n(cid:96)=1 \u03c6(Z(cid:96) \u00b7 \u03b8)\n\n.\n\n3\n\n\fIntuitively, \u03c3n(\u0398) captures how much in\ufb02uence data point n has on the log-likelihood LN (\u03b8) when\nvarying the parameter \u03b8 \u2208 \u0398, and thus data points with high sensitivity should be included in the\ncoreset. Evaluating \u03c3n(\u0398) exactly is not tractable, however, so an upper bound mn \u2265 \u03c3n(\u0398) must\nbe used in its place. Thus, the key challenge is to ef\ufb01ciently compute a tight upper bound on the\nsensitivity.\nFor the moment we will consider \u0398 = BR for any R > 0, where BR := {\u03b8 \u2208 RD |(cid:107)\u03b8(cid:107)2 \u2264 R};\nWe discuss the case of \u0398 = RD shortly. Choosing the parameter space to be a Euclidean ball is\nreasonable since data is usually preprocessed to have mean zero and variance 1 (or, for sparse data,\nto be between -1 and 1), so each component of \u03b8 is typically in a range close to zero (e.g. between\n-4 and 4) [12].\nThe idea behind our sensitivity upper bound construction is that we would expect data points that\nare bunched together to be redundant while data points that are far from from other data have a large\neffect on inferences. Clustering is an effective way to summarize data and detect outliers, so we\nwill use a k-clustering of the data D to construct the sensitivity bound. A k-clustering is given by k\ncluster centers Q = {Q1, . . . , Qk}. Let Gi := {Zn | i = arg minj (cid:107)Qj\u2212Zn(cid:107)2} be the set of vectors\nclosest to center Qi and let G(\u2212n)\n:= Gi \\{Zn}. De\ufb01ne Z (\u2212n)\nto be a uniform random vector from\nG(\u2212n)\n:= E[Z (\u2212n)\n] be its mean. The following lemma uses a k-clustering to establish\nan ef\ufb01ciently computable upper bound on \u03c3n(BR):\nLemma 3.1. For any k-clustering Q,\n\nand let \u00afZ (\u2212n)\n\nG,i\n\nG,i\n\nG,i\n\ni\n\ni\n\n\u03c3n(BR) \u2264 mn :=\n\n1 +(cid:80)k\n\ni=1 |G(\u2212n)\n\ni\n\nN\n|e\u2212R(cid:107) \u00afZ(\u2212n)\n\nG,i \u2212Zn(cid:107)2\n\n.\n\n(2)\n\nFurthermore, mn can be calculated in O(k) time.\n\ni\n\nG,i \u2212Zn(cid:107)2 is small, and |G(\u2212n)\n\nThe bound in Eq. (2) captures the intuition that if the data forms tight clusters (that is, each Zn is\nclose to one of the cluster centers), we expect each cluster to be well-represented by a small number\nof typical data points. For example, if Zn \u2208 Gi, (cid:107) \u00afZ (\u2212n)\n| = \u0398(N ), then\n\u03c3n(BR) = O(1). We use the (normalized) sensitivity bounds obtained from Lemma 3.1 to form an\nimportance distribution (pn)N\nn=1 from which to sample the coreset. If we sample Zn, then we assign\nit weight \u03b3n proportional to 1/pn. The size of the coreset depends on the mean sensitivity bound,\nthe desired error \u03b5, and a quantity closely related to the VC dimension of \u03b8 (cid:55)\u2192 \u03c6(\u03b8 \u00b7 Z), which we\nshow is D + 1. Combining these pieces we obtain Algorithm 1, which constructs an \u03b5-coreset with\nhigh probability by Theorem 3.2.\nTheorem 3.2. Fix \u03b5 > 0, \u03b4 \u2208 (0, 1), and R > 0. Consider a dataset D with k-clustering Q. With\nprobability at least 1 \u2212 \u03b4, Algorithm 1 with inputs (D,Q, R, \u03b5, \u03b4) constructs an \u03b5-coreset of D for\nlogistic regression with parameter space \u0398 = BR. Furthermore, Algorithm 1 runs in O(N k) time.\nRemark 3.3. The coreset algorithm is ef\ufb01cient with an O(N k) running time. However, the algorithm\nrequires a k-clustering, which must also be constructed. A high-quality clustering can be obtained\ncheaply via k-means++ in O(N k) time [2], although a coreset algorithm could also be used.\n\nExamining Algorithm 1, we see that the coreset size M is of order \u00afmN log \u00afmN , where \u00afmN =\nn mn. So for M to be smaller than N, at a minimum, \u00afmN should satisfy \u00afmN = \u02dco(N ),1\n1\nN\nand preferably \u00afmN = O(1). Indeed, for the coreset size to be small, it is critical that (a) \u0398 is\nchosen such that most of the sensitivities satisfy \u03c3n(\u0398) (cid:28) N (since N is the maximum possible\nsensitivity), (b) each upper bound mn is close to \u03c3n(\u0398), and (c) ideally, that \u00afmN is bounded by\na constant. In Section 3.2, we address (a) by providing sensitivity lower bounds, thereby showing\nthat the constraint \u0398 = BR is necessary for nontrivial sensitivities even for \u201ctypical\u201d (i.e. non-\npathological) data. We then apply our lower bounds to address (b) and show that our bound in\nLemma 3.1 is nearly tight. In Section 3.3, we address (c) by establishing the expected performance\nof the bound in Lemma 3.1 for a wide class of data-generating distributions.\n\n(cid:80)\n\n1Recall that the tilde notation suppresses logarithmic terms.\n\n4\n\n\f3.2 Sensitivity Lower Bounds\n\nWe now develop lower bounds on the sensitivity to demonstrate that essentially we must limit our-\nselves to bounded \u0398,2 thus making our choice of \u0398 = BR a natural one, and to show that the\nsensitivity upper bound from Lemma 3.1 is nearly tight.\nWe begin by showing that in both the worst case and the average case, for all n, \u03c3n(RD) = N, the\nmaximum possible sensitivity \u2013 even when the Zn are arbitrarily close. Intuitively, the reason for\nthe worst-case behavior is that if there is a separating hyperplane between a data point Zn and the\nremaining data points, and \u03b8 is in the direction of that hyperplane, then when (cid:107)\u03b8(cid:107)2 becomes very\nlarge, Zn becomes arbitrarily more important than any other data point.\nTheorem 3.4. For any D \u2265 3, N \u2208 N and 0 < \u0001(cid:48) < 1, there exists \u0001 > 0 and unit vectors\nZ1, . . . , ZN \u2208 RD such that for all pairs n, n(cid:48), Zn \u00b7 Zn(cid:48) \u2265 1 \u2212 \u0001(cid:48) and for all R > 0 and n,\n\u03c3n(RD) = N.\n\n\u03c3n(BR) \u2265\n\nand hence\n\nN\n\n\u221a\n\n,\n\n1 + (N \u2212 1)e\u2212R\u0001\n\n\u0001(cid:48) /4\n\nThe proof of Theorem 3.4 is based on choosing N distinct unit vectors V1, . . . , VN \u2208 RD\u22121 and\nsetting \u0001 = 1\u2212 maxn(cid:54)=n(cid:48) Vn \u00b7 Vn(cid:48) > 0. But what is a \u201ctypical\u201d value for \u0001? In the case of the vectors\nbeing uniformly distributed on the unit sphere, we have the following scaling for \u0001 as N increases:\nProposition 3.5. If V1, . . . , VN are independent and uniformly distributed on the unit sphere SD :=\n{v \u2208 RD |(cid:107)v(cid:107) = 1} with D \u2265 2, then with high probability\n\n1 \u2212 max\n\nn(cid:54)=n(cid:48) Vn \u00b7 Vn(cid:48) \u2265 CDN\u22124/(D\u22121),\n\nwhere CD is a constant depending only on D.\n\n2 (cid:99), and V1, . . . , VN i.i.d. such that Vni = \u00b1 1\u221a\n\nFurthermore, N can be exponential in D even with \u0001 remaining very close to 1:\n\u221a\nProposition 3.6. For N = (cid:98)exp((1 \u2212 \u0001)2D/4)/\nwith probability 1/2, then with probability at least 1/2, 1 \u2212 maxn(cid:54)=n(cid:48) Vn \u00b7 Vn(cid:48) \u2265 \u0001.\nPropositions 3.5 and 3.6 demonstrate that the data vectors Zn found in Theorem 3.4 are, in two\ndifferent senses, \u201ctypical\u201d vectors and should not be thought of as worst-case data only occurring\nin some \u201cnegligible\u201d or zero-measure set. These three results thus demonstrate that it is necessary\nto restrict attention to bounded \u0398. We can also use Theorem 3.4 to show that our sensitivity upper\nbound is nearly tight.\nCorollary 3.7. For the data Z1, . . . , ZN from Theorem 3.4,\n\nD\n\nN\n\n1 + (N \u2212 1)e\u2212R\u0001\n\n\u221a\n\n\u0001(cid:48) /4\n\n\u2264 \u03c3n(BR) \u2264\n\nN\n\n1 + (N \u2212 1)e\u2212R\n\n\u221a\n\n2\u0001(cid:48) .\n\n3.3 k-Clustering Sensitivity Bound Performance\n\nWhile Lemma 3.1 and Corollary 3.7 provide an upper bound on the sensitivity given a \ufb01xed dataset,\nwe would also like to understand how the expected mean sensitivity increases with N. We might\nexpect it to be \ufb01nite since the logistic regression likelihood model is parametric; the coreset would\nthus be acting as a sort of approximate \ufb01nite suf\ufb01cient statistic. Proposition 3.8 characterizes the\nexpected performance of the upper bound from Lemma 3.1 under a wide class of generating dis-\ntributions. This result demonstrates that, under reasonable conditions, the expected value of \u00afmN is\nbounded for all N. As a concrete example, Corollary 3.9 specializes Proposition 3.8 to data with a\nsingle shared Gaussian generating distribution.\nindep\u223c Multi(\u03c01, \u03c02, . . . ) is the mixture compo-\nProposition 3.8. Let Xn\nnent responsible for generating Xn. For n = 1, . . . , N, let Yn \u2208 {\u22121, 1} be conditionally indepen-\ndent given Xn and set Zn = YnXn. Select 0 < r < 1/2, and de\ufb01ne \u03b7i = max(\u03c0i \u2212 N\u2212r, 0). The\nclustering of the data implied by (Ln)N\n\nindep\u223c N(\u00b5Ln , \u03a3Ln ), where Ln\n(cid:88)\n\nn=1 results in the expected sensitivity bound\n\nN e\u22122N 1\u22122r N\u2192\u221e\u2192\n\n1(cid:80)\n\ni \u03c0ie\u2212R\n\n\u221a\n\nBi\n\n,\n\nE [ \u00afmN ] \u2264\n\nN\u22121 +(cid:80)\n\n\u221a\n\n1\ni \u03b7ie\u2212R\n\nAiN\u22121\u03b7\n\n\u22121\ni +Bi\n\n+\n\ni:\u03b7i>0\n\n2Certain pathological datasets allow us to use unbounded \u0398, but we do not assume we are given such data.\n\n5\n\n\f(a)\n\n(b) BINARY10\n\n(c) WEBSPAM\n\nFigure 1: (a) Percentage of time spent creating the coreset relative to the total inference time (in-\ncluding 10,000 iterations of MCMC). Except for very small coreset sizes, coreset construction is a\nsmall fraction of the overall time. (b,c) The mean sensitivities for varying choices of R and k. When\nR varies k = 6 and when k varies R = 3. The mean sensitivity increases exponentially in R, as\nexpected, but is robust to the choice of k.\n\nwhere Ai := Tr [\u03a3i] +(cid:0)1 \u2212 \u00afy2\n\ni\n\n(cid:1) \u00b5T\ni \u00b5i, Bi := (cid:80)\n\nand \u00afyj = E [Y1|L1 = j].\nCorollary 3.9. In the setting of Proposition 3.8, if \u03c01 = 1 and all data is assigned to a single cluster,\nthen there is a constant C such that for suf\ufb01ciently large N, E [ \u00afmN ] \u2264 CeR\n1 \u00b51 .\n\nTr[\u03a31]+(1\u2212\u00afy2\n\n1 )\u00b5T\n\n\u221a\n\n(cid:1),\n\n(cid:0)Tr [\u03a3j] + \u00afy2\n\nj\u03c0j\n\ni \u00b5i \u2212 2\u00afyi \u00afyj\u00b5T\n\nj \u00b5T\n\ni \u00b5j + \u00b5T\n\nj \u00b5j\n\n3.4 Streaming and Parallel Settings\n\nAlgorithm 1 is a batch algorithm, but it can easily be used in parallel and streaming computation\nsettings using standard methods from the coreset literature, which are based on the following two\nobservations (cf. [10, Section 3.2]):\n\n1. If \u02dcDi is an \u03b5-coreset for Di, i = 1, 2, then \u02dcD1 \u222a \u02dcD2 is an \u03b5-coreset for D1 \u222a D2.\n2. If \u02dcD is an \u03b5-coreset for D and \u02dcD(cid:48) is an \u03b5(cid:48)-coreset for \u02dcD, then \u02dcD(cid:48) is an \u03b5(cid:48)(cid:48)-coreset for D,\n\nwhere \u03b5(cid:48)(cid:48) := (1 + \u03b5)(1 + \u03b5(cid:48)) \u2212 1.\n\nWe can use these observations to merge coresets that were constructed either in parallel, or sequen-\ntially, in a binary tree. Coresets are computed for two data blocks, merged using observation 1,\nthen compressed further using observation 2. The next two data blocks have coresets computed\nand merged/compressed in the same manner, then the coresets from blocks 1&2 and 3&4 can be\nmerged/compressed analogously. We continue in this way and organize the merge/compress oper-\nations into a binary tree. Then, if there are B data blocks total, only log B blocks ever need be\nmaintained simultaneously. In the streaming setting we would choose blocks of constant size, so\nB = O(N ), while in the parallel setting B would be the number of machines available.\n\n4 Experiments\n\nWe evaluated the performance of the logistic regression coreset algorithm on a number of synthetic\nand real-world datasets. We used a maximum dataset size of 1 million examples because we wanted\nto be able to calculate the true posterior, which would be infeasible for extremely large datasets.\nindep\u223c Bern(pd), d =\nSynthetic Data. We generated synthetic binary data according to the model Xnd\nindep\u223c plogistic(\u00b7| Xn, \u03b8). The idea is to simulate data in which there are\n1, . . . , D and Yn\na small number of rarely occurring but highly predictive features, which is a common real-\nworld phenomenon. We thus took p = (1, .2, .3, .5, .01, .1, .2, .007, .005, .001) and \u03b8 =\n(\u22123, 1.2,\u2212.5, .8, 3,\u22121.,\u2212.7, 4, 3.5, 4.5) for the D = 10 experiments (BINARY10) and the \ufb01rst\n5 components of p and \u03b8 for the D = 5 experiments (BINARY5). The generative model is the same\none used by Scott et al. [21] and the \ufb01rst 5 components of p and \u03b8 correspond to those used in the\n\n6\n\n\f(a) BINARY5\n\n(b) BINARY10\n\n(c) MIXTURE\n\n(d) CHEMREACT\n\n(e) WEBSPAM\n\n(f) COVTYPE\n\nFigure 2: Polynomial MMD and negative test log-likelihood of random sampling and the logistic\nregression coreset algorithm for synthetic and real data with varying subset sizes (lower is better\nfor all plots). For the synthetic data, N = 106 total data points were used and 103 additional data\npoints were generated for testing. For the real data, 2,500 (resp. 50,000 and 29,000) data points of\nthe CHEMREACT (resp. WEBSPAM and COVTYPE) dataset were held out for testing. One standard\ndeviation error bars were obtained by repeating each experiment 20 times.\n\nScott et al. experiments (given in [21, Table 1b]). We generated a synthetic mixture dataset with\ni.i.d.\u223c Bern(1/2)\ncontinuous covariates (MIXTURE) using a model similar to that of Han et al. [15]: Yn\nindep\u223c N(\u00b5Yn , I), where \u00b5\u22121 = (0, 0, 0, 0, 0, 1, 1, 1, 1, 1) and \u00b51 = (1, 1, 1, 1, 1, 0, 0, 0, 0, 0).\nand Xn\nReal-world Data. The CHEMREACT dataset consists of N = 26,733 chemicals, each with D = 100\nproperties. The goal is to predict whether each chemical is reactive. The WEBSPAM corpus consists\nof N = 350,000 web pages, approximately 60% of which are spam. The covariates consist of the\nD = 127 features that each appear in at least 25 documents. The cover type (COVTYPE) dataset\nconsists of N = 581,012 cartographic observations with D = 54 features. The task is to predict the\ntype of trees that are present at each observation location.\n\n4.1 Scaling Properties of the Coreset Construction Algorithm\n\nConstructing Coresets. In order for coresets to be a worthwhile preprocessing step, it is critical\nthat the time required to construct the coreset is small relative to the time needed to complete the in-\nference procedure. We implemented the logistic regression coreset algorithm in Python.3 In Fig. 1a,\nwe plot the relative time to construct the coreset for each type of dataset (k = 6) versus the total in-\nference time, including 10,000 iterations of the MCMC procedure described in Section 4.2. Except\nfor very small coreset sizes, the time to run MCMC dominates.\n\n3More details on our implementation are provided in the Supplementary Material. Code to recreate all of\n\nour experiments is available at https://bitbucket.org/jhhuggins/lrcoresets.\n\n7\n\n\fSensitivity. An important question is how the mean sensitivity \u00afmN scales with N, as it determines\nhow the size of the coreset scales with the data. Furthermore, ensuring that mean sensitivity is\nrobust to the number of clusters k is critical since needing to adjust the algorithm hyperparameters\nfor each dataset could lead to an unacceptable increase in computational burden. We also seek to\nunderstand how the radius R affects the mean sensitivity. Figs. 1b and 1c show the results of our\nscaling experiments on the BINARY10 and WEBSPAM data. The mean sensitivity is essentially\nconstant across a range of dataset sizes. For both datasets the mean sensitivity is robust to the choice\nof k and scales exponentially in R, as we would expect from Lemma 3.1.\n4.2 Posterior Approximation Quality\n\n(cid:80)\n\ni=1\n\nZ\u2208Gi\n\nI := N\u22121(cid:80)k\n\n(cid:107)Z \u2212 Qi(cid:107)2 be the normalized k-means score. We chose R = a/\n\nSince the ultimate goal is to use coresets for Bayesian inference, the key empirical question is how\nwell a posterior formed using a coreset approximates the true posterior distribution. We compared\nthe coreset algorithm to random subsampling of data points, since that is the approach used in\nmany existing scalable versions of variational inference and MCMC [4, 16]. Indeed, coreset-based\nimportance sampling could be used as a drop-in replacement for the random subsampling used by\nthese methods, though we leave the investigation of this idea for future work.\nExperimental Setup. We used adaptive Metropolis-adjusted Langevin algorithm (MALA) [14, 20]\nfor posterior inference. For each dataset, we ran the coreset and random subsampling algorithms\n20 times for each choice of subsample size M. We ran adaptive MALA for 100,000 iterations on\nthe full dataset and each subsampled dataset. The subsampled datasets were \ufb01xed for the entirety\nof each run, in contrast to subsampling algorithms that resample the data at each iteration. For the\nsynthetic datasets, which are lower dimensional, we used k = 4 while for the real-world datasets,\nwhich are higher dimensional, we used k = 6. We used a heuristic to choose R as large as was\nfeasible while still obtaining moderate total sensitivity bounds. For a clustering Q of data D, let\n\u221aI ,\nwhere a is a small constant. The idea is that, for i \u2208 [k] and Zn \u2208 Gi, we want R(cid:107) \u00afZ (\u2212n)\nG,i \u2212Zn(cid:107)2 \u2248 a\non average, so the term exp{\u2212R(cid:107) \u00afZ (\u2212n)\nG,i \u2212 Zn(cid:107)2} in Eq. (2) is not too small and hence \u03c3n(BR) is\nnot too large. Our experiments used a = 3. We obtained similar results for 4 \u2264 k \u2264 8 and 2.5 \u2264\na \u2264 3.5, indicating that the logistic regression coreset algorithm has some robustness to the choice\nof these hyperparameters. We used negative test log-likelihood and maximum mean discrepancy\n(MMD) with a 3rd degree polynomial kernel as comparison metrics (so smaller is better).\nSynthetic Data Results. Figures 2a-2c show the results for synthetic data. In terms of test log-\nlikelihood, coresets did as well as or outperformed random subsampling. In terms of MMD, the\ncoreset posterior approximation typically outperformed random subsampling by 1-2 orders of mag-\nnitude and never did worse. These results suggest much can be gained by using coresets, with\ncomparable performance to random subsampling in the worst case.\nReal-world Data Results. Figures 2d-2f show the results for real data. Using coresets led to better\nperformance on CHEMREACT for small subset sizes. Because the dataset was fairly small and\nrandom subsampling was done without replacement, coresets were worse for larger subset sizes.\nCoreset and random subsampling performance was approximately the same for WEBSPAM. On\nWEBSPAM and COVTYPE, coresets either outperformed or did as well as random subsampling in\nterms MMD and test log-likelihood on almost all subset sizes. The only exception was that random\nsubsampling was superior on WEBSPAM for the smallest subset set. We suspect this is due to the\nvariance introduced by the importance sampling procedure used to generate the coreset.\nFor both the synthetic and real-world data, in many cases we are able to obtain a high-quality logistic\nregression posterior approximation using a coreset that is many orders of magnitude smaller than\nthe full dataset \u2013 sometimes just a few hundred data points. Using such a small coreset represents\na substantial reduction in the memory and computational requirements of the Bayesian inference\nalgorithm that uses the coreset for posterior inference. We expect that the use of coresets could lead\nsimilar gains for other Bayesian models. Designing coreset algorithms for other widely-used models\nis an exciting direction for future research.\n\nAcknowledgments\n\nAll authors are supported by the Of\ufb01ce of Naval Research under ONR MURI grant N000141110688. JHH is\nsupported by a National Defense Science and Engineering Graduate (NDSEG) Fellowship.\n\n8\n\n\fReferences\n[1] P. K. Agarwal, S. Har-Peled, and K. R. Varadarajan. Geometric approximation via coresets. Combinato-\n\nrial and computational geometry, 52:1\u201330, 2005.\n\n[2] D. Arthur and S. Vassilvitskii. k-means++: The advantages of careful seeding. In Symposium on Discrete\n\nAlgorithms, pages 1027\u20131035. Society for Industrial and Applied Mathematics, 2007.\n\n[3] O. Bachem, M. Lucic, S. H. Hassani, and A. Krause. Approximate K-Means++ in Sublinear Time. In\n\nAAAI Conference on Arti\ufb01cial Intelligence, 2016.\n\n[4] R. Bardenet, A. Doucet, and C. C. Holmes. On Markov chain Monte Carlo methods for tall data.\n\narXiv.org, May 2015.\n\n[5] M. J. Betancourt. The Fundamental Incompatibility of Hamiltonian Monte Carlo and Data Subsampling.\n\nIn International Conference on Machine Learning, 2015.\n\n[6] T. Broderick, N. Boyd, A. Wibisono, A. C. Wilson, and M. I. Jordan. Streaming Variational Bayes. In\n\nAdvances in Neural Information Processing Systems, Dec. 2013.\n\n[7] T. Campbell, J. Straub, J. W. Fisher, III, and J. P. How. Streaming, Distributed Variational Inference for\n\nBayesian Nonparametrics. In Advances in Neural Information Processing Systems, 2015.\n\n[8] R. Entezari, R. V. Craiu, and J. S. Rosenthal. Likelihood In\ufb02ating Sampling Algorithm. arXiv.org, May\n\n2016.\n\n[9] D. Feldman and M. Langberg. A uni\ufb01ed framework for approximating and clustering data. In Symposium\n\non Theory of Computing. ACM Request Permissions, June 2011.\n\n[10] D. Feldman, M. Faulkner, and A. Krause. Scalable training of mixture models via coresets. In Advances\n\nin Neural Information Processing Systems, pages 2142\u20132150, 2011.\n\n[11] D. Feldman, M. Schmidt, and C. Sohler. Turning big data into tiny data: Constant-size coresets for k-\nmeans, pca and projective clustering. In Symposium on Discrete Algorithms, pages 1434\u20131453. SIAM,\n2013.\n\n[12] A. Gelman, A. Jakulin, M. G. Pittau, and Y.-S. Su. A weakly informative default prior distribution for\n\nlogistic and other regression models. The Annals of Applied Statistics, 2(4):1360\u20131383, Dec. 2008.\n\n[13] E. I. George and R. E. McCulloch. Variable selection via Gibbs sampling. Journal of the American\n\nStatistical Association, 88(423):881\u2013889, 1993.\n\n[14] H. Haario, E. Saksman, and J. Tamminen. An adaptive Metropolis algorithm. Bernoulli, pages 223\u2013242,\n\n2001.\n\n[15] L. Han, T. Yang, and T. Zhang. Local Uncertainty Sampling for Large-Scale Multi-Class Logistic Re-\n\ngression. arXiv.org, Apr. 2016.\n\n[16] M. D. Hoffman, D. M. Blei, C. Wang, and J. Paisley. Stochastic variational inference. The Journal of\n\nMachine Learning Research, 14:1303\u20131347, 2013.\n\n[17] M. Lucic, O. Bachem, and A. Krause. Strong Coresets for Hard and Soft Bregman Clustering with\nApplications to Exponential Family Mixtures. In International Conference on Arti\ufb01cial Intelligence and\nStatistics, 2016.\n\n[18] D. Maclaurin and R. P. Adams. Fire\ufb02y Monte Carlo: Exact MCMC with Subsets of Data. In Uncertainty\n\nin Arti\ufb01cial Intelligence, Mar. 2014.\n\n[19] M. Rabinovich, E. Angelino, and M. I. Jordan. Variational consensus Monte Carlo. arXiv.org, June 2015.\n[20] G. O. Roberts and R. L. Tweedie. Exponential convergence of Langevin distributions and their discrete\n\napproximations. Bernoulli, 2(4):341\u2013363, Nov. 1996.\n\n[21] S. L. Scott, A. W. Blocker, F. V. Bonassi, H. A. Chipman, E. I. George, and R. E. McCulloch. Bayes and\n\nbig data: The consensus Monte Carlo algorithm. In Bayes 250, 2013.\n\n[22] S. Srivastava, V. Cevher, Q. Tran-Dinh, and D. Dunson. WASP: Scalable Bayes via barycenters of subset\n\nposteriors. In International Conference on Arti\ufb01cial Intelligence and Statistics, 2015.\n\n[23] Y. W. Teh, A. H. Thiery, and S. Vollmer. Consistency and \ufb02uctuations for stochastic gradient Langevin\n\ndynamics. Journal of Machine Learning Research, 17(7):1\u201333, Mar. 2016.\n\n[24] M. Welling and Y. W. Teh. Bayesian Learning via Stochastic Gradient Langevin Dynamics. In Interna-\n\ntional Conference on Machine Learning, 2011.\n\n9\n\n\f", "award": [], "sourceid": 2035, "authors": [{"given_name": "Jonathan", "family_name": "Huggins", "institution": "MIT"}, {"given_name": "Trevor", "family_name": "Campbell", "institution": "MIT"}, {"given_name": "Tamara", "family_name": "Broderick", "institution": "MIT"}]}