{"title": "PASS-GLM: polynomial approximate sufficient statistics for scalable Bayesian GLM inference", "book": "Advances in Neural Information Processing Systems", "page_first": 3611, "page_last": 3621, "abstract": "Generalized linear models (GLMs)---such as logistic regression, Poisson regression, and robust regression---provide interpretable models for diverse data types. Probabilistic approaches, particularly Bayesian ones, allow coherent estimates of uncertainty, incorporation of prior information, and sharing of power across experiments via hierarchical models. In practice, however, the approximate Bayesian methods necessary for inference have either failed to scale to large data sets or failed to provide theoretical guarantees on the quality of inference. We propose a new approach based on constructing polynomial approximate sufficient statistics for GLMs (PASS-GLM). We demonstrate that our method admits a simple algorithm as well as trivial streaming and distributed extensions that do not compound error across computations. We provide theoretical guarantees on the quality of point (MAP) estimates, the approximate posterior, and posterior mean and uncertainty estimates. We validate our approach empirically in the case of logistic regression using a quadratic approximation and show competitive performance with stochastic gradient descent, MCMC, and the Laplace approximation in terms of  speed and multiple measures of accuracy---including on an advertising data set with 40 million data points and 20,000 covariates.", "full_text": "PASS-GLM: polynomial approximate suf\ufb01cient\nstatistics for scalable Bayesian GLM inference\n\nJonathan H. Huggins\n\nCSAIL, MIT\n\nRyan P. Adams\n\nGoogle Brain and Princeton\n\nTamara Broderick\n\nCSAIL, MIT\n\njhuggins@mit.edu\n\nrpa@princeton.edu\n\ntbroderick@csail.mit.edu\n\nAbstract\n\nGeneralized linear models (GLMs)\u2014such as logistic regression, Poisson regres-\nsion, and robust regression\u2014provide interpretable models for diverse data types.\nProbabilistic approaches, particularly Bayesian ones, allow coherent estimates of\nuncertainty, incorporation of prior information, and sharing of power across exper-\niments via hierarchical models. In practice, however, the approximate Bayesian\nmethods necessary for inference have either failed to scale to large data sets or\nfailed to provide theoretical guarantees on the quality of inference. We propose a\nnew approach based on constructing polynomial approximate suf\ufb01cient statistics\nfor GLMs (PASS-GLM). We demonstrate that our method admits a simple algo-\nrithm as well as trivial streaming and distributed extensions that do not compound\nerror across computations. We provide theoretical guarantees on the quality of\npoint (MAP) estimates, the approximate posterior, and posterior mean and un-\ncertainty estimates. We validate our approach empirically in the case of logistic\nregression using a quadratic approximation and show competitive performance\nwith stochastic gradient descent, MCMC, and the Laplace approximation in terms\nof speed and multiple measures of accuracy\u2014including on an advertising data set\nwith 40 million data points and 20,000 covariates.\n\n1\n\nIntroduction\n\nScientists, engineers, and companies increasingly use large-scale data\u2014often only available via\nstreaming\u2014to obtain insights into their respective problems. For instance, scientists might be in-\nterested in understanding how varying experimental inputs leads to different experimental outputs;\nor medical professionals might be interested in understanding which elements of patient histories\nlead to certain health outcomes. Generalized linear models (GLMs) enable these practitioners to\nexplicitly and interpretably model the effect of covariates on outcomes while allowing \ufb02exible noise\ndistributions\u2014including binary, count-based, and heavy-tailed observations. Bayesian approaches\nfurther facilitate (1) understanding the importance of covariates via coherent estimates of parameter\nuncertainty, (2) incorporating prior knowledge into the analysis, and (3) sharing of power across dif-\nferent experiments or domains via hierarchical modeling. In practice, however, an exact Bayesian\nanalysis is computationally infeasible for GLMs, so an approximation is necessary. While some\napproximate methods provide asymptotic guarantees on quality, these methods often only run suc-\ncessfully in the small-scale data regime. In order to run on (at least) millions of data points and thou-\nsands of covariates, practitioners often turn to heuristics with no theoretical guarantees on quality.\nIn this work, we propose a novel and simple approximation framework for probabilistic inference in\nGLMs. We demonstrate theoretical guarantees on the quality of point estimates in the \ufb01nite-sample\nsetting and on the quality of Bayesian posterior approximations produced by our framework. We\nshow that our framework trivially extends to streaming data and to distributed architectures, with\nno additional compounding of error in these settings. We empirically demonstrate the practicality\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fof our framework on datasets with up to tens of millions of data points and tens of thousands of\ncovariates.\nLarge-scale Bayesian inference. Calculating accurate approximate Bayesian posteriors for large\ndata sets together with complex models and potentially high-dimensional parameter spaces is a long-\nstanding problem. We seek a method that satis\ufb01es the following criteria: (1) it provides a posterior\napproximation; (2) it is scalable; (3) it comes equipped with theoretical guarantees; and (4) it\nprovides arbitrarily good approximations. By posterior approximation we mean that the method\noutputs an approximate posterior distribution, not just a point estimate. By scalable we mean that\nthe method examines each data point only a small number of times, and further can be applied to\nstreaming and distributed data. By theoretical guarantees we mean that the posterior approximation\nis certi\ufb01ed to be close to the true posterior in terms of, for example, some metric on probability\nmeasures. Moreover, the distance between the exact and approximate posteriors is an ef\ufb01ciently\ncomputable quantity. By an arbitrarily good approximation we mean that, with a large enough\ncomputational budget, the method can output an approximation that is as close to the exact posterior\nas we wish.\nMarkov chain Monte Carlo (MCMC) methods provide an approximate posterior, and the approxi-\nmation typically becomes arbitrarily good as the amount of computation time grows asymptotically;\nthereby MCMC satis\ufb01es criteria 1, 3, and 4. But scalability of MCMC can be an issue. Conversely,\nvariational Bayes (VB) and expectation propagation (EP) [27] have grown in popularity due to their\nscalability to large data and models\u2014though they typically lack guarantees on quality (criteria 3\nand 4). Subsampling methods have been proposed to speed up MCMC [1, 5, 6, 21, 25, 41] and\nVB [18]. Only a few of these algorithms preserve guarantees asymptotic in time (criterion 4), and\nthey often require restrictive assumptions. On the scalability front (criterion 2), many though not\nall subsampling MCMC methods have been found to require examining a constant fraction of the\ndata at each iteration [2, 6, 7, 30, 31, 38], so the computational gains are limited. Moreover, the\nrandom data access required by these methods may be infeasible for very large datasets that do not\n\ufb01t into memory. Finally, they do not apply to streaming and distributed data, and thus fail criterion\n2 above. More recently, authors have proposed subsampling methods based on piecewise determin-\nistic Markov processes (PDMPs) [8, 9, 29]. These methods are promising since subsampling data\nhere does not change the invariant distribution of the continuous-time Markov process. But these\nmethods have not yet been validated on large datasets nor is it understood how subsampling affects\nthe mixing rates of the Markov processes. Authors have also proposed methods for coalescing in-\nformation across distributed computation (criterion 2) in MCMC [12, 32, 34, 35], VB [10, 11], and\nEP [15, 17]\u2014and in the case of VB, across epochs as streaming data is collected [10, 11]. (See An-\ngelino et al. [3] for a broader discussion of issues surrounding scalable Bayesian inference.) While\nthese methods lead to gains in computational ef\ufb01ciency, they lack rigorous justi\ufb01cation and provide\nno guarantees on the quality of inference (criteria 3 and 4).\nTo address these dif\ufb01culties, we are inspired in part by the observation that not all Bayesian models\nrequire expensive posterior approximation. When the likelihood belongs to an exponential family,\nBayesian posterior computation is fast and easy. In particular, it suf\ufb01ces to \ufb01nd the suf\ufb01cient statis-\ntics of the data, which require computing a simple summary at each data point and adding these\nsummaries across data points. The latter addition requires a single pass through the data and is\ntrivially streaming or distributed. With the suf\ufb01cient statistics in hand, the posterior can then be\ncalculated via, e.g., MCMC, and point estimates such as the MLE can be computed\u2014all in time in-\ndependent of the data set size. Unfortunately, suf\ufb01cient statistics are not generally available (except\nin very special cases) for GLMs. We propose to instead develop a notion of approximate suf\ufb01cient\nstatistics. Previously authors have suggested using a coreset\u2014a weighted data subset\u2014as a sum-\nmary of the data [4, 13, 14, 16, 19, 24]. While these methods provide theoretical guarantees on the\nquality of inference via the model evidence, the resulting guarantees are better suited to approximate\noptimization and do not translate to guarantees on typical Bayesian desiderata, such as the accuracy\nof posterior mean and uncertainty estimates. Moreover, while these methods do admit streaming\nand distributed constructions, the approximation error is compounded across computations.\nOur contributions. In the present work we instead propose to construct our approximate suf\ufb01cient\nstatistics via a much simpler polynomial approximation for generalized linear models. We therefore\ncall our method polynomial approximate suf\ufb01cient statistics for generalized linear models (PASS-\nGLM). PASS-GLM satis\ufb01es all of the criteria laid of above. It provides a posterior approximation\nwith theoretical guarantees (criteria 1 and 3). It is scalable since is requires only a single pass over\n\n2\n\n\fthe data and can be applied to streaming and distributed data (criterion 2). And by increasing the\nnumber of approximate suf\ufb01cient statistics, PASS-GLM can produce arbitrarily good approxima-\ntions to the posterior (criterion 4).\nThe Laplace approximation [39] and variational methods with a Gaussian approximation family\n[20, 22] may be seen as polynomial (quadratic) approximations in the log-likelihood space. But we\nnote that the VB variants still suffer the issues described above. A Laplace approximation relies on\na Taylor series expansion of the log-likelihood around the maximum a posteriori (MAP) solution,\nwhich requires \ufb01rst calculating the MAP\u2014an expensive multi-pass optimization in the large-scale\ndata setting. Neither Laplace nor VB offers the simplicity of suf\ufb01cient statistics, including in stream-\ning and distributed computations. The recent work of Stephanou et al. [36] is similar in spirit to ours,\nthough they address a different statistical problem: they construct sequential quantile estimates using\nHermite polynomials.\nIn the remainder of the paper, we begin by describing generalized linear models in more detail in\nSection 2. We construct our novel polynomial approximation and specify our PASS-GLM algorithm\nin Section 3. We will see that streaming and distributed computation are trivial for our algorithm\nand do not compound error. In Section 4.1, we demonstrate \ufb01nite-sample guarantees on the quality\nof the MAP estimate arising from our algorithm, with the maximum likelihood estimate (MLE)\nas a special case.\nIn Section 4.2, we prove guarantees on the Wasserstein distance between the\nexact and approximate posteriors\u2014and thereby bound both posterior-derived point estimates and\nuncertainty estimates.\nIn Section 5, we demonstrate the ef\ufb01cacy of our approach in practice by\nfocusing on logistic regression. We demonstrate experimentally that PASS-GLM can be scaled\nwith almost no loss of ef\ufb01ciency to multi-core architectures. We show on a number of real-world\ndatasets\u2014including a large, high-dimensional advertising dataset (40 million examples with 20,000\ndimensions)\u2014that PASS-GLM provides an attractive trade-off between computation and accuracy.\n\n2 Background\n\nGeneralized linear models. Generalized linear models (GLMs) combine the interpretability of\nlinear models with the \ufb02exibility of more general outcome distributions\u2014including binary, ordinal,\nand heavy-tailed observations. Formally, we let Y \u2286 R be the observation space, X \u2286 Rd be the\ncovariate space, and \u0398 \u2286 Rd be the parameter space. Let D := {(xn, yn)}N\nn=1 be the observed data.\nWe write X \u2208 RN\u00d7d for the matrix of all covariates and y \u2208 RN for the vector of all observations.\nWe consider GLMs\n\nlog p(y | X, \u03b8) =(cid:80)N\n\nn=1 log p(yn | g\u22121(xn \u00b7 \u03b8)) =(cid:80)N\n\nn=1 \u03c6(yn, xn \u00b7 \u03b8),\n\nwhere \u00b5 := g\u22121(xn \u00b7 \u03b8) is the expected value of yn and g\u22121 : R \u2192 R is the inverse link function.\nWe call \u03c6(y, s) := log p(y | g\u22121(s)) the GLM mapping function.\nExamples include some of the most widely used models in the statistical toolbox. For in-\nstance, for binary observations y \u2208 {\u00b11}, the likelihood model is Bernoulli, p(y = 1| \u00b5) = \u00b5,\nand the link function is often either the logit g(\u00b5) = log \u00b5\n1\u2212\u00b5 (as in logistic regression) or the pro-\nbit g(\u00b5) = \u03a6\u22121(\u00b5), where \u03a6 is the standard Gaussian CDF. When modeling count data y \u2208 N, the\nlikelihood model might be Poisson, p(y | \u00b5) = \u00b5ye\u2212\u00b5/y!, and g(\u00b5) = log(\u00b5) is the typical log link.\nOther GLMs include gamma regression, robust regression, and binomial regression, all of which are\ncommonly used for large-scale data analysis (see Examples A.1 and A.3).\nIf we place a prior \u03c00(d\u03b8) on the parameters, then a full Bayesian analysis aims to approximate the\n(typically intractable) GLM posterior distribution \u03c0D(d\u03b8), where\np(y | X, \u03b8) \u03c00(d\u03b8)\n\n(cid:82) p(y | X, \u03b8(cid:48)) \u03c00(d\u03b8(cid:48))\n\n.\n\n\u03c0D(d\u03b8) =\n\nThe maximum a posteriori (MAP) solution gives a point estimate of the parameter:\n\n\u03b8MAP := arg max\n\n(1)\nwhere LD(\u03b8) := log p(y | X, \u03b8) is the data log-likelihood. The MAP problem strictly generalizes\n\ufb01nding the maximum likelihood estimate (MLE), since the MAP solution equals the MLE when\nusing the (possibly improper) prior \u03c00(\u03b8) = 1.\n\n\u03c0D(\u03b8) = arg max\n\n\u03b8\u2208\u0398\n\n\u03b8\u2208\u0398\n\nlog \u03c00(\u03b8) + LD(\u03b8),\n\n3\n\n\fbase measure \u03c2\n\nAlgorithm 1 PASS-GLM inference\nRequire: data D, GLM mapping function \u03c6 : R \u2192 R, degree M, polynomial basis (\u03c8m)m\u2208N with\n\n1: Calculate basis coef\ufb01cients bm \u2190(cid:82) \u03c6\u03c8md\u03c2 using numerical integration for m = 0, . . . , M\n3: for k \u2208 Nd with(cid:80)\nfor k \u2208 Nd with(cid:80)\n\n(cid:46) Can be done with any combination of batch, parallel, or streaming\n\n2: Calculate polynomial coef\ufb01cients b(M )\n\nm \u2190(cid:80)M\n\nk=m \u03b1k,mbm for m = 0, . . . , M\n\nj kj \u2264 M do\n\nInitialize tk \u2190 0\n4:\n5: for n = 1, . . . , N do\n6:\n7:\n\nj kj \u2264 M do\nUpdate tk \u2190 tk + (ynxn)k\n\n8: Form approximate log-likelihood \u02dcLD(\u03b8) =(cid:80)\n\nk\u2208Nd:(cid:80)\n9: Use \u02dcLD(\u03b8) to construct approximate posterior \u02dc\u03c0D(\u03b8)\n\nj kj\u2264m\n\n(cid:0)m\n\nk\n\n(cid:1)b(M )\n\nm tk\u03b8k\n\nComputation and exponential families. In large part due to the high-dimensional integral im-\nplicit in the normalizing constant, approximating the posterior, e.g., via MCMC or VB, is often\nprohibitively expensive. Approximating this integral will typically require many evaluations of the\n(log-)likelihood, or its gradient, and each evaluation may require \u2126(N ) time.\nComputation is much more ef\ufb01cient, though, if the model is in an exponential family (EF). In the EF\ncase, there exist functions t, \u03b7 : Rd \u2192 Rm, such that1\n\nlog p(yn | xn, \u03b8) = t(yn, xn) \u00b7 \u03b7(\u03b8) =: LD,EF(\u03b8; t(yn, xn)).\n\nThus, we can rewrite the log-likelihood as\n\nLD(\u03b8) =(cid:80)N\n\nwhere t(D) :=(cid:80)N\n\nn=1 LD,EF(\u03b8; t(yn, xn)) =: LD,EF(\u03b8; t(D)),\n\nn=1 t(yn, xn). The suf\ufb01cient statistics t(D) can be calculated in O(N ) time,\nafter which each evaluation of LD,EF(\u03b8; t(D)) or \u2207LD,EF(\u03b8; t(D)) requires only O(1) time. Thus,\ninstead of K passes over N data (requiring O(N K) time), only O(N + K) time is needed. Even\nfor moderate values of N, the time savings can be substantial when K is large.\nThe Poisson distribution is an illustrative example of a one-parameter exponential family\nwith t(y) = (1, y, log y!) and \u03b7(\u03b8) = (\u03b8, log \u03b8, 1). Thus, if we have data y (there are no covari-\n\nn yn,(cid:80) log yn!). In this case it is easy to calculate that the maximum likelihood\n\nates), t(y) = (N,(cid:80)\nestimate of \u03b8 from t(y) as t1(y)/t0(y) = N\u22121(cid:80)\n\nn yn.\n\ndistribution\n\nrarely belong to an exponential\nin\nan\nlogistic\n\nUnfortunately, GLMs\nthe out-\ncome\nis\nlink\nthe\nEF structure.\nIn\nthe \u03c6 notation)\nlog p(yn | xn, \u03b8) = \u03c6logit(ynxn \u00b7 \u03b8), where \u03c6logit(s) := \u2212 log(1 + e\u2212s). For Poisson regression\nwith log link, log p(yn | xn, \u03b8) = \u03c6Poisson(yn, xn \u00b7 \u03b8), where \u03c6Poisson(y, s) := ys \u2212 es \u2212 log y!. In\nboth cases, we cannot express the log-likelihood as an inner product between a function solely of\nthe data and a function solely of the parameter.\n\nthe\nregression, we write\n\nuse\na\n(overloading\n\nfamily \u2013 even if\n\nexponential\n\ndestroys\n\nfamily,\n\nof\n\n3 PASS-GLM\n\nSince exact suf\ufb01cient statistics are not available for GLMs, we propose to construct approxi-\nmate suf\ufb01cient statistics. In particular, we propose to approximate the mapping function \u03c6 with\nan order-M polynomial \u03c6M . We therefore call our method polynomial approximate suf\ufb01cient\nstatistics for GLMs (PASS-GLM). We illustrate our method next in the logistic regression case,\nwhere log p(yn | xn, \u03b8) = \u03c6logit(ynxn \u00b7 \u03b8). The fully general treatment appears in Appendix A.\nLet b(M )\n\nM be constants such that\n\n. . . , b(M )\n\n, b(M )\n\n0\n\n1\n\n\u03c6logit(s) \u2248 \u03c6M (s) :=(cid:80)M\n\nm=0 b(M )\n\nm sm.\n\n1Our presentation is slightly different from the standard textbook account because we have implicitly ab-\n\nsorbed the base measure and log-partition function into t and \u03b7.\n\n4\n\n\f(cid:80)\nk\u2208Nd(cid:80)\n\nj kj =m\n\n(cid:0)m\n\nk\n\n(cid:1)(yx)k\u03b8k\n\n(cid:1)b(M )\n\n(cid:1) monomials of degree at most M serving\n\nm . Thus, \u03c6M is an M-degree\n\nk\n\nd\n\nfor vectors v, k \u2208 Rd. Taking s = yx \u00b7 \u03b8, we obtain\nm=0 b(M )\n\nm=0 b(M )\n\nm\n\nk\n\nm=0\n\nj=1 vkj\n\nj\n\n(cid:80)\n\n=(cid:80)M\n\nj kj =m a(k, m, M )(yx)k\u03b8k,\n\nm (yx \u00b7 \u03b8)m =(cid:80)M\n(cid:1) is the multinomial coef\ufb01cient and a(k, m, M ) :=(cid:0)m\n\nLet vk :=(cid:81)d\n\u03c6logit(yx \u00b7 \u03b8) \u2248 \u03c6M (yx \u00b7 \u03b8) =(cid:80)M\nk\u2208Nd:(cid:80)\nwhere(cid:0)m\npolynomial approximation to \u03c6logit(yx \u00b7 \u03b8) with the(cid:0)d+M\nwhere k is taken over all k \u2208 Nd such that(cid:80)\nis, \u03c8m(s) =(cid:80)m\ntion \u03c6M (s) =(cid:80)M\n\nthen \u03c6(s) =(cid:80)\u221e\nm =(cid:80)M\n\nIf bm :=(cid:82) \u03c6\u03c8md\u03c2,\n\nm=0 bm\u03c8m(s). Conclude that b(M )\n\nand zero otherwise.\n\nt(yx) = ([yx]k)k\n\nand\nj kj \u2264 M. We next discuss the calculation of the b(M )\n\n\u03b7(\u03b8) = (a(k, m, M )\u03b8k)k,\n\nm\n\nand the choice of M.\nChoosing the polynomial approximation. To calculate the coef\ufb01cients b(M )\nm , we choose a polyno-\nmial basis (\u03c8m)m\u2208N orthogonal with respect to a base measure \u03c2, where \u03c8m is degree m [37]. That\n\nj=0 \u03b1m,jsj for some \u03b1m,j, and(cid:82) \u03c8m\u03c8m(cid:48)d\u03c2 = \u03b4mm(cid:48), where \u03b4mm(cid:48) = 1 if m = m(cid:48)\n\nas suf\ufb01cient statistics derived from yx. Speci\ufb01cally, we have a exponential family model with\n\nm=0 bm\u03c8m(s) and the approxima-\nk=m \u03b1k,mbm. The complete PASS-GLM\n\nM is also exponentially small in M: sups\u2208[\u2212R,R] |\u03c6(cid:48)(s) \u2212 \u03c6(cid:48)\n\nframework appears in Algorithm 1.\nChoices for the orthogonal polynomial basis include Chebyshev, Hermite, Leguerre, and Legen-\ndre polynomials [37]. We choose Chebyshev polynomials since they provide a uniform quality\nguarantee on a \ufb01nite interval, e.g., [\u2212R, R] for some R > 0 in what follows. If \u03c6 is smooth, the\nchoice of Chebyshev polynomials (scaled appropriately, along with the base measure \u03c2, based on\nthe choice of R) yields error exponentially small in M: sups\u2208[\u2212R,R] |\u03c6(s) \u2212 \u03c6M (s)| \u2264 C\u03c1M for\nsome 0 < \u03c1 < 1 and C > 0 [26]. We show in Appendix B that the error in the approximate deriva-\nM (s)| \u2264 C(cid:48)\u03c1M , where C(cid:48) > C.\ntive \u03c6(cid:48)\nChoosing the polynomial degree. For \ufb01xed d, the number of monomials is O(M d) while for \ufb01xed\nM the number of monomials is O(dM ). The number of approximate suf\ufb01cient statistics can remain\nmanageable when either M or d is small but becomes unwieldy if M and d are both large. Since\nour experiments (Section 5) generally have large d, we focus on the small M case here.\nIn our experiments we further focus on the choice of logistic regression as a particularly popular\nGLM example with p(yn | xn, \u03b8) = \u03c6logit(ynxn \u00b7 \u03b8), where \u03c6logit(s) := \u2212 log(1 + e\u2212s).\nIn gen-\neral, the smallest and therefore most compelling choice of M a priori is 2, and we demonstrate the\nreasonableness of this choice empirically in Section 5 for a number of large-scale data analyses. In\naddition, in the logistic regression case, M = 6 is the next usable choice beyond M = 2. This is be-\n2k+1 = 0 for all integer k \u2265 1 with 2k + 1 \u2264 M. So any approximation beyond M = 2 must\ncause b(M )\nhave M \u2265 4. Also, b(M )\n4k > 0 for all integers k \u2265 1 with 4k \u2264 M. So choosing M = 4k, k \u2265 1,\nleads to a pathological approximation of \u03c6logit where the log-likelihood can be made arbitrarily\nlarge by taking (cid:107)\u03b8(cid:107)2 \u2192 \u221e. Thus, a reasonable polynomial approximation for logistic regression\nrequires M = 2 + 4k, k \u2265 0. We have discussed the relative drawbacks of other popular quadratic\napproximations, including the Laplace approximation and variational methods, in Section 1.\n\n4 Theoretical Results\n\nWe next establish quality guarantees for PASS-GLM. We \ufb01rst provide \ufb01nite-sample and asymptotic\nguarantees on the MAP (point estimate) solution, and therefore on the MLE, in Section 4.1. We then\nprovide guarantees on the Wasserstein distance between the approximate and exact posteriors, and\nshow these bounds translate into bounds on the quality of posterior mean and uncertainty estimates,\nin Section 4.2. See Appendix C for extended results, further discussion, and all proofs.\n\n4.1 MAP approximation\n\nIn Appendix C, we state and prove Theorem C.1, which provides guarantees on the quality of the\nMAP estimate for an arbitrary approximation \u02dcLD(\u03b8) to the log-likelihood LD(\u03b8). The approximate\n\n5\n\n\fMAP (i.e., the MAP under \u02dcLD) is (cf. Eq. (1))\n\u02dc\u03b8MAP := arg max\n\n\u03b8\u2208\u0398\n\nlog \u03c00(\u03b8) + \u02dcLD(\u03b8).\n\nRoughly, we \ufb01nd in Theorem C.1 that the error in the MAP estimate naturally depends on the error\nof the approximate log-likelihood as well as the peakedness of the posterior near the MAP. In the\nlatter case, if log \u03c0D is very \ufb02at, then even a small error from using \u02dcLD in place of LD could lead\nto a large error in the approximate MAP solution. We measure the peakedness of the distribution in\nterms of the strong convexity constant2 of \u2212 log \u03c0D near \u03b8MAP.\nWe apply Theorem C.1 to PASS-GLM for logistic regression and robust regression. We require the\nassumption that\n\n\u03c6M (t) \u2264 \u03c6(t) \u2200t /\u2208 [\u2212R, R],\n\n(2)\nwhich in the cases of logistic regression and smoothed Huber regression, we conjecture holds\nfor M = 2 + 4k, k \u2208 N. For a matrix A, (cid:107)A(cid:107)2 denotes its spectral norm.\nCorollary 4.1. For the logistic regression model, assume that (cid:107)(\u22072LD(\u03b8MAP))\u22121(cid:107)2 \u2264 cd/N for\nsome constant c > 0 and that (cid:107)xn(cid:107)2 \u2264 1 for all n = 1, . . . , N. Let \u03c6M be the order-M Chebyshev\napproximation to \u03c6logit on [\u2212R, R] such that Eq. (2) holds. Let \u02dc\u03c0D(\u03b8) denote the posterior approx-\nimation obtained by using \u03c6M with a log-concave prior. Then there exist numbers r = r(R) > 1,\n\u03b5 = \u03b5(M ) = O(r\u2212M ), and \u03b1\u2217 \u2265\n\n\u03b5d3c3+54 , such that if R \u2212 (cid:107)\u03b8MAP(cid:107)2 \u2265 2\n\n(cid:113) cd\u03b5\n\n\u03b1\u2217 , then\n\n27\n\n(cid:107)\u03b8MAP \u2212 \u02dc\u03b8MAP(cid:107)2\n\n2 \u2264 4cd\u03b5\n\n\u03b1\u2217 \u2264 4\n\n27\n\nc4d4\u03b52 + 8cd\u03b5.\n\n(cid:80)N\nn=1 \u03c6(cid:48)(cid:48)\n\nThe main takeaways from Corollary 4.1 are that (1) the error decreases exponentially in M thanks\nto the \u03b5 term, (2) the error does not depend on the amount of data, and (3) in order for the bound\non the approximate MAP solution to hold, the norm of the true MAP solution must be suf\ufb01ciently\nsmaller than R.\ni.e., \u22072LD(\u03b8) =\nRemark 4.2. Some intuition for the assumption on the Hessian of LD,\nn , is as follows. Typically for \u03b8 near \u03b8MAP, the minimum eigenvalue\nof \u22072LD(\u03b8) is at least N/(cd) for some c > 0. The minimum eigenvalue condition in Corollary 4.1\nholds if, for example, a constant fraction of the data satis\ufb01es 0 < b \u2264 (cid:107)xn(cid:107)2 \u2264 B < \u221e and that\nsubset of the data does not lie too close to any (d \u2212 1)-dimensional hyperplane. This condition\nessentially requires the data not to be degenerate and is similar to ones used to show asymptotic\nconsistency of logistic regression [40, Ex. 5.40].\n\nlogit(ynxn \u00b7 \u03b8)xnx(cid:62)\n\nThe approximate MAP error bound in the robust regression case using, for example, the smoothed\nHuber loss (Example A.1), is quite similar to the logistic regression result.\nCorollary 4.3. For robust regression with smoothed Huber loss, assume that a constant fraction of\nthe data satis\ufb01es |xn \u00b7 \u03b8MAP \u2212 yn| \u2264 b/2 and that (cid:107)xn(cid:107)2 \u2264 1 for all n = 1, . . . , N. Let \u03c6M be the\norder M Chebyshev approximation to \u03c6Huber on [\u2212R, R] such that Eq. (2) holds. Let \u02dc\u03c0D(\u03b8) denote\nthe posterior approximation obtained by using \u03c6M with a log-concave prior. Then if R (cid:29) (cid:107)\u03b8MAP(cid:107)2,\nthere exists r > 1 such that for M suf\ufb01ciently large, (cid:107)\u03b8MAP \u2212 \u02dc\u03b8MAP(cid:107)2\n\n2 = O(dr\u2212M ).\n\n4.2 Posterior approximation\n\ndistance, dW. For distributions P and Q on Rd, dW (P, Q) := supf :(cid:107)f(cid:107)L\u22641 |(cid:82) f dP \u2212(cid:82) f dQ|,\n\nWe next establish guarantees on how close the approximate and exact posteriors are in Wasserstein\nwhere (cid:107)f(cid:107)L denotes the Lipschitz constant of f.3 This choice of distance is particularly useful\nsince, if dW (\u03c0D, \u02dc\u03c0D) \u2264 \u03b4, then \u02dc\u03c0D can be used to estimate any function with bounded gradient\nwith error at most \u03b4 supw (cid:107)\u2207f (w)(cid:107)2. Wasserstein error bounds therefore give bounds on the mean\nestimates (corresponding to f (\u03b8) = \u03b8i) as well as uncertainty estimates such as mean absolute de-\nviation (corresponding to f (\u03b8) = |\u00af\u03b8i \u2212 \u03b8i|, where \u00af\u03b8i is the expected value of \u03b8i).\n\nof the Hessian of f evaluated at \u03b8 is at least \u0001 > 0.\n\n2Recall that a twice-differentiable function f : Rd \u2192 R is \u0001-strongly convex at \u03b8 if the minimum eigenvalue\n3The Lipschitz constant of function f : Rd \u2192 R is (cid:107)f(cid:107)L := supv,w\u2208Rd\n\n(cid:107)\u03c6(v)\u2212\u03c6(w)(cid:107)2\n\n.\n\n(cid:107)v\u2212w(cid:107)2\n\n6\n\n\f(a)\n\n(b)\n\nFigure 1: Validating the use of PASS-GLM with M = 2. (a) The second-order Chebyshev approx-\nimation to \u03c6 = \u03c6logit on [\u22124, 4] is very accurate, with error of at most 0.069. (b) For a variety of\ndatasets, the inner products (cid:104)ynxn, \u03b8MAP(cid:105) are mostly in the range of [\u22124, 4].\n\nOur general result (Theorem C.3) is stated and proved in Appendix C. Similar to Theorem C.1,\nthe result primarily depends on the peakedness of the approximate posterior and the error of the\napproximate gradients. If the gradients are poorly approximated then the error can be large while\nif the (approximate) posterior is \ufb02at then even small gradient errors could lead to large shifts in\nexpected values of the parameters and hence large Wasserstein error.\nWe apply Theorem C.3 to PASS-GLM for logistic regression and Poisson regression. We give\nsimpli\ufb01ed versions of these corollaries in the main text and defer the more detailed versions to\nAppendix C. For logistic regression we assume M = 2 and \u0398 = Rd since this is the setting we\nuse for our experiments. The result is similar in spirit to Corollary 4.1, though more straightforward\nsince M = 2. Critically, we see in this result how having small error depends on |ynxn \u00b7 \u00af\u03b8| \u2264 R\nwith high probability. Otherwise the second term in the bound will be large.\nCorollary 4.4. Let \u03c62 be the second-order Chebyshev approximation to \u03c6logit on [\u2212R, R] and\nlet \u02dc\u03c0D(\u03b8) = N(\u03b8 | \u02dc\u03b8MAP, \u02dc\u03a3) denote the posterior approximation obtained by using \u03c62 with a Gaus-\nn=1(cid:104)ynxn, \u00af\u03b8(cid:105), and let \u03c31\nbe the subgaussianity constant of the random variable (cid:104)ynxn, \u00af\u03b8(cid:105) \u2212 \u03b41, where n \u223c Unif{1, . . . , N}.\nAssume that |\u03b41| \u2264 R, that (cid:107) \u02dc\u03a3(cid:107)2 \u2264 cd/N, and that (cid:107)xn(cid:107)2 \u2264 1 for all n = 1, . . . , N. Then\nwith \u03c32\n\nsian prior \u03c00(\u03b8) = N(\u03b8 | \u03b80, \u03a30). Let \u00af\u03b8 :=(cid:82) \u03b8\u03c0D(d\u03b8), let \u03b41 := N\u22121(cid:80)N\n\n0 := (cid:107)\u03a30(cid:107)2, we have\n\n(cid:17)(cid:17)\n0 (R \u2212 |\u03b41|)\n\n2 \u03c3\u22121\n\n\u221a\n\n.\n\ndW (\u03c0D, \u02dc\u03c0D) = O\n\ndR4 + d\u03c30 exp\n\n(cid:16)\n\n(cid:16)\n\n1\u03c3\u22122\n\u03c32\n\n0 \u2212\n\nThe main takeaway from Corollary 4.4 is that if (a) for most n, |(cid:104)xn, \u00af\u03b8(cid:105)| < R, so that \u03c62 is a good\napproximation to \u03c6logit, and (b) the approximate posterior concentrates quickly, then we get a high-\nquality approximate posterior. This result matches up with the experimental results (see Section 5\nfor further discussion).\nFor Poisson regression, we return to the case of general M. Recall that in the Poisson regression\nmodel that the expectation of yn is \u00b5 = exn\u00b7\u03b8. If yn is bounded and has non-trivial probability of\nbeing greater than zero, we lose little by restricting xn \u00b7 \u03b8 to be bounded. Thus, we will assume that\nthe parameter space is bounded. As in Corollaries 4.1 and 4.3, the error is exponentially small in M\n\nn=1 xnx(cid:62)\n\nn (cid:107)2 grows linearly in N, does not depend on the amount of data.\n\nand, as long as (cid:107)(cid:80)N\n\nCorollary 4.5. Let fM (s) be the order-M Chebyshev approximation to et on the inter-\nval [\u2212R, R], and let \u02dc\u03c0D(\u03b8) denote the posterior approximation obtained by using the approximation\nlog \u02dcp(yn | xn, \u03b8) := ynxn \u00b7 \u03b8 \u2212 fM (xn \u00b7 \u03b8) \u2212 log yn! with a log-concave prior on \u0398 = BR(0). If\nn (cid:107)2 = \u2126(N/d), and (cid:107)xn(cid:107)2 \u2264 1 for all n = 1, . . . , N,\ninf s\u2208[\u2212R,R] f(cid:48)(cid:48)\nthen\n\nM (s) \u2265 \u02dc\u0001 > 0, (cid:107)(cid:80)N\n\nn=1 xnx(cid:62)\n\ndW (\u03c0D, \u02dc\u03c0D) = O(cid:0)d\u02dc\u0001\u22121M 2eR2\u2212M(cid:1) .\n\n7\n\n-4-2024-4-3-2-10\u03d5(t)\u03d52(t)-4-2024-4-3-2-10\u03d5(t)\u03d52(t)6420246ynxn,MAP0.00.51.01.5ChemReact6420246ynxn,MAP0.00.10.20.3CovType6420246ynxn,MAP0.00.51.01.52.0Webspam12441220ynxn,MAP0.00.10.2CodRNA\f(a) WEBSPAM\n\n(b) COVTYPE\n\n(c) CHEMREACT\n\n(d) CODRNA\n\nFigure 2: Batch inference results. In all metrics smaller is better.\n\nNote that although \u02dc\u0001\u22121 does depend on R and M, as M becomes large it converges to eR. Observe\nthat if we truncate a prior on Rd to be on BR(0), by making R and M suf\ufb01ciently large, the Wasser-\nstein distance between \u03c0D and the PASS-GLM posterior approximation \u02dc\u03c0D can be made arbitarily\nsmall. Similar results could be shown for other GLM likelihoods.\n\n5 Experiments\n\nIn our experiments, we focus on logistic regression, a particularly popular GLM example.4 As\ndiscussed in Section 3, we choose M = 2 and call our algorithm PASS-LR2. Empirically, we ob-\nserve that M = 2 offers a high-quality approximation of \u03c6 on the interval [\u22124, 4] (Fig. 1a).\nIn\nfact sups\u2208[\u22124,4] |\u03c62(s) \u2212 \u03c6(s)| < 0.069. Moreover, we observe that for many datasets, the inner\nproducts ynxn \u00b7 \u03b8MAP tend to be concentrated within [\u22124, 4], and therefore a high-quality approx-\nimation on this range is suf\ufb01cient for our analysis.\nIn particular, Fig. 1b shows histograms of\nynxn \u00b7 \u03b8MAP for four datasets from our experiments. In all but one case, over 98% of the data points\nsatisfy |ynxn \u00b7 \u03b8MAP| \u2264 4. In the remaining dataset (CODRNA), only \u223c80% of the data satisfy this\ncondition, and this is the dataset for which PASS-LR2 performed most poorly (cf. Corollary 4.4).\n\n5.1 Large dataset experiments\n\nIn order to compare PASS-LR2 to other approximate Bayesian methods, we \ufb01rst restrict our attention\nto datasets with fewer than 1 million data points. We compare to the Laplace approximation and the\nadaptive Metropolis-adjusted Langevin algorithm (MALA). We also compare to stochastic gradient\ndescent (SGD) although SGD provides only a point estimate and no approximate posterior. In all\nexperiments, no method performs as well as PASS-LR2 given the same (or less) running time.\nDatasets. The CHEMREACT dataset consists of N = 26,733 chemicals, each with d = 100 prop-\nerties. The goal is to predict whether each chemical is reactive. The WEBSPAM corpus consists\nof N = 350,000 web pages and the covariates consist of the d = 127 features that each appear in\nat least 25 documents. The cover type (COVTYPE) dataset consists of N = 581,012 cartographic\nobservations with d = 54 features. The task is to predict the type of trees that are present at each ob-\nservation location. The CODRNA dataset consists of N = 488,565 and d = 8 RNA-related features.\nThe task is to predict whether the sequences are non-coding RNA.\nFig. 2 shows average errors of the posterior mean and variance estimates as well as negative test log-\nlikelihood for each method versus the time required to run the method. SGD was run for between\n1 and 20 epochs. The true posterior was estimated by running three chains of adaptive MALA for\n50,000 iterations, which produced Gelman-Rubin statistics well below 1.1 for all datasets.\n\n4Code is available at https://bitbucket.org/jhhuggins/pass-glm.\n\n8\n\n0.11.010.0100.0time(sec)0.620.640.660.68NegativeTestLLPASSLR2LaplaceSGDTruePosteriorMALA1.010.0100.0time(sec)0.11.0averagemeanerror1.010.0100.0time(sec)0.010.0320.10.321.0averagevarianceerror1.0100.0time(sec)0.50.6NegativeTestLL0.11.010.0100.0time(sec)0.10.321.03.2averagemeanerror0.11.010.0100.0time(sec)0.0320.10.321.0averagevarianceerror0.010.11.010.0time(sec)0.120.140.16NegativeTestLL0.010.11.010.0time(sec)0.010.0320.10.321.0averagemeanerror0.11.010.0time(sec)0.010.11.0averagevarianceerror0.011.0100.0time(sec)0.20.30.40.50.6NegativeTestLL1.0100.0time(sec)1.03.210.0averagemeanerror1.0100.0time(sec)1.01.62.54.0averagevarianceerror\f(a)\n\n(b)\n\nFigure 3: (a) ROC curves for streaming inference on 40 million CRITEO data points. SGD and\nPASS-LR2 had negative test log-likelihoods of, respectively, 0.07 and 0.045. (b) Cores vs. speedup\n(compared to one core) for parallelization experiment on 6 million examples from the CRITEO\ndataset.\n\nSpeed. For all four datasets, PASS-LR2 was an order of magnitude faster than SGD and 2\u20133 orders\nof magnitude faster than the Laplace approximation. Mean and variance estimates. For CHEM-\nREACT, WEBSPAM, and COVTYPE, PASS-LR2 was superior to or competitive with SGD, with\nMALA taking 10\u2013100x longer to produce comparable results. Laplace again outperformed all other\nmethods. Critically, on all datasets the PASS-LR2 variance estimates were competitive with Laplace\nand MALA. Test log-likelihood. For CHEMREACT and WEBSPAM, PASS-LR2 produced results\ncompetitive with all other methods. MALA took 10\u2013100x longer to produce comparable results.\nFor COVTYPE, PASS-LR2 was competitive with SGD but took a tenth of the time, and MALA took\n1000x longer for comparable results. Laplace outperformed all other methods, but was orders of\nmagnitude slower than PASS-LR2. CODRNA was the only dataset where PASS-LR2 performed\npoorly. However, this performance was expected based on the ynxn \u00b7 \u03b8MAP histogram (Fig. 1a).\n\n5.2 Very large dataset experiments using streaming and distributed PASS-GLM\n\nWe next test PASS-LR2, which is streaming without requiring any modi\ufb01cations, on a subset of 40\nmillion data points from the Criteo terabyte ad click prediction dataset (CRITEO). The covariates are\n13 integer-valued features and 26 categorical features. After one-hot encoding, on the subset of the\ndata we considered, d \u2248 3 million. For tractability we used sparse random projections [23] to reduce\nthe dimensionality to 20,000. At this scale, comparing to the other fully Bayesian methods from\nSection 5.1 was infeasible; we compare only to the predictions and point estimates from SGD. PASS-\nLR2 performs slightly worse than SGD in AUC (Fig. 3a), but outperforms SGD in negative test log-\nlikelihood (0.07 for SGD, 0.045 for PASS-LR2). Since PASS-LR2 estimates a full covariance, it\nwas about 10x slower than SGD. A promising approach to speeding up and reducing memory usage\nof PASS-LR2 would be to use a low-rank approximation to the second-order moments.\nTo validate the ef\ufb01ciency of distributed computation with PASS-LR2, we compared running times\non 6M examples with dimensionality reduced to 1,000 when using 1\u201322 cores. As shown in Fig. 3b,\nthe speed-up is close to optimal: K cores produces a speedup of about K/2 (baseline 3 minutes\nusing 1 core). We used Ray to implement the distributed version of PASS-LR2 [28].5\n\n6 Discussion\n\nWe have presented PASS-GLM, a novel framework for scalable parameter estimation and Bayesian\ninference in generalized linear models. Our theoretical results provide guarantees on the quality of\npoint estimates as well as approximate posteriors derived from PASS-GLM. We validated our ap-\nproach empirically with logistic regression and a quadratic approximation. We showed competitive\nperformance on a variety of real-world data, scaling to 40 million examples with 20,000 covariates,\nand trivial distributed computation with no compounding of approximation error.\nThere a number of important directions for future work. The \ufb01rst is to use randomization methods\nalong the lines of random projections and random feature mappings [23, 33] to scale to larger M\nand d. We conjecture that the use of randomization will allow experimentation with other GLMs for\nwhich quadratic approximations are insuf\ufb01cient.\n\n5https://github.com/ray-project/ray\n\n9\n\n0.000.250.500.751.00FalsePositiveRate0.000.250.500.751.00TruePositiveRatePASSLR2(area=0.696)SGD(area=0.725)01020cores2.55.07.510.0speedup\fAcknowledgments\n\nJHH and TB are supported in part by ONR grant N00014-17-1-2072, ONR MURI grant N00014-11-1-0688,\nand a Google Faculty Research Award. RPA is supported by NSF IIS-1421780 and the Alfred P. Sloan Foun-\ndation.\n\nReferences\n[1] S. Ahn, A. Korattikara, and M. Welling. Bayesian posterior sampling via stochastic gradient Fisher\n\nscoring. In International Conference on Machine Learning, 2012.\n\n[2] P. Alquier, N. Friel, R. Everitt, and A. Boland. Noisy Monte Carlo: convergence of Markov chains with\n\napproximate transition kernels. Statistics and Computing, 26:29\u201347, 2016.\n\n[3] E. Angelino, M. J. Johnson, and R. P. Adams. Patterns of scalable Bayesian inference. Foundations and\n\nTrends R(cid:13) in Machine Learning, 9(2-3):119\u2013247, 2016.\n\n[4] O. Bachem, M. Lucic, and A. Krause. Practical coreset constructions for machine learning. arXiv.org,\n\nMar. 2017.\n\n[5] R. Bardenet, A. Doucet, and C. C. Holmes. Towards scaling up Markov chain Monte Carlo: an adaptive\n\nsubsampling approach. In International Conference on Machine Learning, pages 405\u2013413, 2014.\n\n[6] R. Bardenet, A. Doucet, and C. C. Holmes. On Markov chain Monte Carlo methods for tall data. Journal\n\nof Machine Learning Research, 18:1\u201343, 2017.\n\n[7] M. J. Betancourt. The fundamental incompatibility of Hamiltonian Monte Carlo and data subsampling.\n\nIn International Conference on Machine Learning, 2015.\n\n[8] J. Bierkens, P. Fearnhead, and G. O. Roberts. The zig-zag process and super-ef\ufb01cient sampling for\n\nBayesian analysis of big data. arXiv.org, July 2016.\n\n[9] A. Bouchard-C\u02c6ot\u00b4e, S. J. Vollmer, and A. Doucet. The bouncy particle sampler: A non-reversible rejection-\n\nfree Markov chain Monte Carlo method. arXiv.org, pages 1\u201337, Jan. 2016.\n\n[10] T. Broderick, N. Boyd, A. Wibisono, A. C. Wilson, and M. I. Jordan. Streaming variational Bayes. In\n\nAdvances in Neural Information Processing Systems, Dec. 2013.\n\n[11] T. Campbell, J. Straub, J. W. Fisher, III, and J. P. How. Streaming, distributed variational inference for\n\nBayesian nonparametrics. In Advances in Neural Information Processing Systems, 2015.\n\n[12] R. Entezari, R. V. Craiu, and J. S. Rosenthal. Likelihood in\ufb02ating sampling algorithm. arXiv.org, May\n\n2016.\n\n[13] D. Feldman, M. Faulkner, and A. Krause. Scalable training of mixture models via coresets. In Advances\n\nin Neural Information Processing Systems, pages 2142\u20132150, 2011.\n\n[14] W. Fithian and T. Hastie. Local case-control sampling: Ef\ufb01cient subsampling in imbalanced data sets.\n\nThe Annals of Statistics, 42(5):1693\u20131724, Oct. 2014.\n\n[15] A. Gelman, A. Vehtari, P. Jyl\u00a8anki, T. Sivula, D. Tran, S. Sahai, P. Blomstedt, J. P. Cunningham, D. Schimi-\nnovich, and C. Robert. Expectation propagation as a way of life: A framework for Bayesian inference on\npartitioned data. arXiv.org, Dec. 2014.\n\n[16] L. Han, T. Yang, and T. Zhang. Local uncertainty sampling for large-scale multi-class logistic regression.\n\narXiv.org, Apr. 2016.\n\n[17] L. Hasenclever, S. Webb, T. Lienart, S. Vollmer, B. Lakshminarayanan, C. Blundell, and Y. W. Teh.\nDistributed Bayesian learning with stochastic natural-gradient expectation propagation and the posterior\nserver. Journal of Machine Learning Research, 18:1\u201337, 2017.\n\n[18] M. D. Hoffman, D. M. Blei, C. Wang, and J. Paisley. Stochastic variational inference. Journal of Machine\n\nLearning Research, 14:1303\u20131347, 2013.\n\n[19] J. H. Huggins, T. Campbell, and T. Broderick. Coresets for scalable Bayesian logistic regression.\n\nAdvances in Neural Information Processing Systems, May 2016.\n\nIn\n\n[20] T. Jaakkola and M. I. Jordan. A variational approach to Bayesian logistic regression models and their\n\nextensions. In Sixth International Workshop on Arti\ufb01cial Intelligence and Statistics, volume 82, 1997.\n\n10\n\n\f[21] A. Korattikara, Y. Chen, and M. Welling. Austerity in MCMC land: Cutting the Metropolis-Hastings\n\nbudget. In International Conference on Machine Learning, 2014.\n\n[22] A. Kucukelbir, R. Ranganath, A. Gelman, and D. M. Blei. Automatic variational inference in Stan. In\n\nAdvances in Neural Information Processing Systems, June 2015.\n\n[23] P. Li, T. J. Hastie, and K. W. Church. Very sparse random projections.\n\nKnowledge Discovery and Data Mining, 2006.\n\nIn SIGKDD Conference on\n\n[24] M. Lucic, M. Faulkner, A. Krause, and D. Feldman. Training mixture models at scale via coresets.\n\narXiv.org, Mar. 2017.\n\n[25] D. Maclaurin and R. P. Adams. Fire\ufb02y Monte Carlo: Exact MCMC with subsets of data. In Uncertainty\n\nin Arti\ufb01cial Intelligence, Mar. 2014.\n\n[26] J. C. Mason and D. C. Handscomb. Chebyshev Polynomials. Chapman and Hall/CRC, New York, 2003.\n\n[27] T. P. Minka. Expectation propagation for approximate Bayesian inference. In Uncertainty in Arti\ufb01cial\n\nIntelligence. Morgan Kaufmann Publishers Inc, Aug. 2001.\n\n[28] R. Nishihara, P. Moritz, S. Wang, A. Tumanov, W. Paul, J. Schleier-Smith, R. Liaw, M. Niknami, M. I.\nJordan, and I. Stoica. Real-time machine learning: The missing pieces. In Workshop on Hot Topics in\nOperating Systems, 2017.\n\n[29] A. Pakman, D. Gilboa, D. Carlson, and L. Paninski. Stochastic bouncy particle sampler. In International\n\nConference on Machine Learning, Sept. 2017.\n\n[30] N. S. Pillai and A. Smith. Ergodicity of approximate MCMC chains with applications to large data sets.\n\narXiv.org, May 2014.\n\n[31] M. Pollock, P. Fearnhead, A. M. Johansen, and G. O. Roberts. The scalable Langevin exact algorithm:\n\nBayesian inference for big data. arXiv.org, Sept. 2016.\n\n[32] M. Rabinovich, E. Angelino, and M. I. Jordan. Variational consensus Monte Carlo. arXiv.org, June 2015.\n\n[33] A. Rahimi and B. Recht. Weighted sums of random kitchen sinks: Replacing minimization with random-\n\nization in learning. In Advances in Neural Information Processing Systems, pages 1313\u20131320, 2009.\n\n[34] S. L. Scott, A. W. Blocker, F. V. Bonassi, H. A. Chipman, E. I. George, and R. E. McCulloch. Bayes and\n\nbig data: The consensus Monte Carlo algorithm. In Bayes 250, 2013.\n\n[35] S. Srivastava, V. Cevher, Q. Tran-Dinh, and D. Dunson. WASP: Scalable Bayes via barycenters of subset\n\nposteriors. In International Conference on Arti\ufb01cial Intelligence and Statistics, 2015.\n\n[36] M. Stephanou, M. Varughese, and I. Macdonald. Sequential quantiles via Hermite series density estima-\n\ntion. Electronic Journal of Statistics, 11(1):570\u2013607, 2017.\n\n[37] G. Szeg\u00a8o. Orthogonal Polynomials. American Mathematical Society, 4th edition, 1975.\n\n[38] Y. W. Teh, A. H. Thiery, and S. Vollmer. Consistency and \ufb02uctuations for stochastic gradient Langevin\n\ndynamics. Journal of Machine Learning Research, 17(7):1\u201333, Mar. 2016.\n\n[39] L. Tierney and J. B. Kadane. Accurate approximations for posterior moments and marginal densities.\n\nJournal of the American Statistical Association, 81(393):82\u201386, 1986.\n\n[40] A. W. van der Vaart. Asymptotic Statistics. University of Cambridge, 1998.\n\n[41] M. Welling and Y. W. Teh. Bayesian learning via stochastic gradient Langevin dynamics. In International\n\nConference on Machine Learning, 2011.\n\n11\n\n\f", "award": [], "sourceid": 2024, "authors": [{"given_name": "Jonathan", "family_name": "Huggins", "institution": "Massachusetts Institute of Technology"}, {"given_name": "Ryan", "family_name": "Adams", "institution": null}, {"given_name": "Tamara", "family_name": "Broderick", "institution": "MIT"}]}