{"title": "Stochastic Expectation Propagation", "book": "Advances in Neural Information Processing Systems", "page_first": 2323, "page_last": 2331, "abstract": "Expectation propagation (EP) is a deterministic approximation algorithm that is often used to perform approximate Bayesian parameter learning. EP approximates the full intractable posterior distribution through a set of local-approximations that are iteratively refined for each datapoint. EP can offer analytic and computational advantages over other approximations, such as Variational Inference (VI), and is the method of choice for a number of models. The local nature of EP appears to make it an ideal candidate for performing Bayesian learning on large models in large-scale datasets settings. However, EP has a crucial limitation in this context: the number approximating factors needs to increase with the number of data-points, N, which often entails a prohibitively large memory overhead. This paper presents an extension to EP, called stochastic expectation propagation (SEP), that maintains a global posterior approximation (like VI) but updates it in a local way (like EP).  Experiments on a number of canonical learning problems using synthetic and real-world datasets indicate that SEP performs almost as well as full EP, but reduces the memory consumption by a factor of N. SEP is therefore ideally suited to performing approximate Bayesian learning in the large model, large dataset setting.", "full_text": "Stochastic Expectation Propagation\n\nYingzhen Li\n\nUniversity of Cambridge\nCambridge, CB2 1PZ, UK\n\nyl494@cam.ac.uk\n\nJos\u00b4e Miguel Hern\u00b4andez-Lobato\n\nHarvard University\n\nCambridge, MA 02138 USA\njmh@seas.harvard.edu\n\nRichard E. Turner\n\nUniversity of Cambridge\nCambridge, CB2 1PZ, UK\n\nret26@cam.ac.uk\n\nAbstract\n\nExpectation propagation (EP) is a deterministic approximation algorithm that is\noften used to perform approximate Bayesian parameter learning. EP approximates\nthe full intractable posterior distribution through a set of local approximations that\nare iteratively re\ufb01ned for each datapoint. EP can offer analytic and computational\nadvantages over other approximations, such as Variational Inference (VI), and is\nthe method of choice for a number of models. The local nature of EP appears to\nmake it an ideal candidate for performing Bayesian learning on large models in\nlarge-scale dataset settings. However, EP has a crucial limitation in this context:\nthe number of approximating factors needs to increase with the number of data-\npoints, N, which often entails a prohibitively large memory overhead. This paper\npresents an extension to EP, called stochastic expectation propagation (SEP), that\nmaintains a global posterior approximation (like VI) but updates it in a local way\n(like EP). Experiments on a number of canonical learning problems using syn-\nthetic and real-world datasets indicate that SEP performs almost as well as full\nEP, but reduces the memory consumption by a factor of N. SEP is therefore ide-\nally suited to performing approximate Bayesian learning in the large model, large\ndataset setting.\n\n1\n\nIntroduction\n\nRecently a number of methods have been developed for applying Bayesian learning to large datasets.\nExamples include sampling approximations [1, 2], distributional approximations including stochas-\ntic variational inference (SVI) [3] and assumed density \ufb01ltering (ADF) [4], and approaches that mix\ndistributional and sampling approximations [5, 6]. One family of approximation method has gar-\nnered less attention in this regard: Expectation Propagation (EP) [7, 8]. EP constructs a posterior\napproximation by iterating simple local computations that re\ufb01ne factors which approximate the pos-\nterior contribution from each datapoint. At \ufb01rst sight, it therefore appears well suited to large-data\nproblems: the locality of computation make the algorithm simple to parallelise and distribute, and\ngood practical performance on a range of small data applications suggest that it will be accurate\n[9, 10, 11]. However the elegance of local computation has been bought at the price of prohibitive\nmemory overhead that grows with the number of datapoints N, since local approximating factors\nneed to be maintained for every datapoint, which typically incur the same memory overhead as the\nglobal approximation. The same pathology exists for the broader class of power EP (PEP) algo-\nrithms [12] that includes variational message passing [13]. In contrast, variational inference (VI)\nmethods [14, 15] utilise global approximations that are re\ufb01ned directly, which prevents memory\noverheads from scaling with N.\nIs there ever a case for preferring EP (or PEP) to VI methods for large data? We believe that there\ncertainly is. First, EP can provide signi\ufb01cantly more accurate approximations. It is well known\nthat variational free-energy approaches are biased and often severely so [16] and for particular mod-\nels the variational free-energy objective is pathologically ill-suited such as those with non-smooth\nlikelihood functions [11, 17]. Second, the fact that EP is truly local (to factors in the posterior distri-\n\n1\n\n\fbution and not just likelihoods) means that it affords different opportunities for tractable algorithm\ndesign, as the updates can be simpler to approximate.\nAs EP appears to be the method of choice for some applications, researchers have attempted to\npush it to scale. One approach is to swallow the large computational burden and simply use large\ndata structures to store the approximating factors (e.g. TrueSkill [18]). This approach can only\nbe pushed so far. A second approach is to use ADF, a simple variant of EP that only requires a\nglobal approximation to be maintained in memory [19]. ADF, however, provides poorly calibrated\nuncertainty estimates [7] which was one of the main motivating reasons for developing EP in the \ufb01rst\nplace. A third idea, complementary to the one described here, is to use approximating factors that\nhave simpler structure (e.g. low rank, [20]). This reduces memory consumption (e.g. for Gaussian\nfactors from O(N D2) to O(N D)), but does not stop the scaling with N. Another idea uses EP to\ncarve up the dataset [5, 6] using approximating factors for collections of datapoints. This results in\ncoarse-grained, rather than local, updates and other methods must be used to compute them. (Indeed,\nthe spirit of [5, 6] is to extend sampling methods to large datasets, not EP itself.)\nCan we have the best of both worlds? That is, accurate global approximations that are derived from\ntruly local computation. To address this question we develop an algorithm based upon the standard\nEP and ADF algorithms that maintains a global approximation which is updated in a local way. We\ncall this class of algorithms Stochastic Expectation Propagation (SEP) since it updates the global\napproximation with (damped) stochastic estimates on data sub-samples in an analogous way to SVI.\nIndeed, the generalisation of the algorithm to the PEP setting directly relates to SVI. Importantly,\nSEP reduces the memory footprint by a factor of N when compared to EP. We further extend the\nmethod to control the granularity of the approximation, and to treat models with latent variables\nwithout compromising on accuracy or unnecessary memory demands. Finally, we demonstrate the\nscalability and accuracy of the method on a number of real world and synthetic datasets.\n\n2 Expectation Propagation and Assumed Density Filtering\n\nWe begin by brie\ufb02y reviewing the EP and ADF algorithms upon which our new method is based.\nConsider for simplicity observing a dataset comprising N i.i.d. samples D = {xn}N\nn=1 from a\nprobabilistic model p(x|\u03b8) parametrised by an unknown D-dimensional vector \u03b8 that is drawn from\na prior p0(\u03b8). Exact Bayesian inference involves computing the (typically intractable) posterior\ndistribution of the parameters given the data,\n\nN(cid:89)\n\np(\u03b8|D) \u221d p0(\u03b8)\n\np(xn|\u03b8) \u2248 q(\u03b8) \u221d p0(\u03b8)\n\nN(cid:89)\n\nfn(\u03b8).\n\n(1)\n\nn=1\n\nn=1\n\nHere q(\u03b8) is a simpler tractable approximating distribution that will be re\ufb01ned by EP. The goal of\nEP is to re\ufb01ne the approximate factors so that they capture the contribution of each of the likeli-\nhood terms to the posterior i.e. fn(\u03b8) \u2248 p(xn|\u03b8). In this spirit, one approach would be to \ufb01nd\neach approximating factor fn(\u03b8) by minimising the Kullback-Leibler (KL) divergence between the\nposterior and the distribution formed by replacing one of the likelihoods by its corresponding ap-\nproximating factor, KL[p(\u03b8|D)||p(\u03b8|D)fn(\u03b8)/p(xn|\u03b8)]. Unfortunately, such an update is still in-\ntractable as it involves computing the full posterior. Instead, EP approximates this procedure by\nreplacing the exact leave-one-out posterior p\u2212n(\u03b8) \u221d p(\u03b8|D)/p(xn|\u03b8) on both sides of the KL\nby the approximate leave-one-out posterior (called the cavity distribution) q\u2212n(\u03b8) \u221d q(\u03b8)/fn(\u03b8).\nSince this couples the updates for the approximating factors, the updates must now be iterated.\nIn more detail, EP iterates four simple steps. First, the factor selected for update is removed from the\napproximation to produce the cavity distribution. Second, the corresponding likelihood is included\nto produce the tilted distribution \u02dcpn(\u03b8) \u221d q\u2212n(\u03b8)p(xn|\u03b8). Third EP updates the approximating\nfactor by minimising KL[\u02dcpn(\u03b8)||q\u2212n(\u03b8)fn(\u03b8)]. The hope is that the contribution the true-likelihood\nmakes to the posterior is similar to the effect the same likelihood has on the tilted distribution. If the\napproximating distribution is in the exponential family, as is often the case, then the KL minimisation\nreduces to a moment matching step [21] that we denote fn(\u03b8) \u2190 proj[\u02dcpn(\u03b8)]/q\u2212n(\u03b8). Finally,\nhaving updated the factor, it is included into the approximating distribution.\nWe summarise the update procedure for a single factor in Algorithm 1. Critically, the approximation\nstep of EP involves local computations since one likelihood term is treated at a time. The assumption\n\n2\n\n\fAlgorithm 1 EP\n1: choose a factor fn to re\ufb01ne:\n2: compute cavity distribution\n\nq\u2212n(\u03b8) \u221d q(\u03b8)/fn(\u03b8)\n3: compute tilted distribution\n\u02dcpn(\u03b8) \u221d p(xn|\u03b8)q\u2212n(\u03b8)\nfn(\u03b8) \u2190 proj[\u02dcpn(\u03b8)]/q\u2212n(\u03b8)\nq(\u03b8) \u2190 q\u2212n(\u03b8)fn(\u03b8)\n\n4: moment matching:\n\n5: inclusion:\n\nAlgorithm 2 ADF\n1: choose a datapoint xn \u223c D:\n2: compute cavity distribution\n\n4: moment matching:\n\nq\u2212n(\u03b8) = q(\u03b8)\n3: compute tilted distribution\n\u02dcpn(\u03b8) \u221d p(xn|\u03b8)q\u2212n(\u03b8)\nfn(\u03b8) \u2190 proj[\u02dcpn(\u03b8)]/q\u2212n(\u03b8)\nq(\u03b8) \u2190 q\u2212n(\u03b8)fn(\u03b8)\n\n5: inclusion:\n\nAlgorithm 3 SEP\n1: choose a datapoint xn \u223c D:\n2: compute cavity distribution\n\n4: moment matching:\n\nq\u22121(\u03b8) \u221d q(\u03b8)/f (\u03b8)\n3: compute tilted distribution\n\u02dcpn(\u03b8) \u221d p(xn|\u03b8)q\u22121(\u03b8)\nfn(\u03b8) \u2190 proj[\u02dcpn(\u03b8)]/q\u22121(\u03b8)\nq(\u03b8) \u2190 q\u22121(\u03b8)fn(\u03b8)\nf (\u03b8) \u2190 f (\u03b8)1\u2212 1\n\n6: implicit update:\n\n5: inclusion:\n\nN fn(\u03b8)\n\n1\nN\n\nFigure 1: Comparing the Expectation Propagation (EP), Assumed Density Filtering (ADF), and\nStochastic Expectation Propagation (SEP) update steps. Typically, the algorithms will be initialised\nusing q(\u03b8) = p0(\u03b8) and, where appropriate, fn(\u03b8) = 1 or f (\u03b8) = 1.\n\nis that these local computations, although possibly requiring further approximation, are far simpler\nto handle compared to the full posterior p(\u03b8|D).\nIn practice, EP often performs well when the\nupdates are parallelised. Moreover, by using approximating factors for groups of datapoints, and\nthen running additional approximate inference algorithms to perform the EP updates (which could\ninclude nesting EP), EP carves up the data making it suitable for distributed approximate inference.\nThere is, however, one wrinkle that complicates deployment of EP at scale. Computation of the\ncavity distribution requires removal of the current approximating factor, which means any imple-\nmentation of EP must store them explicitly necessitating an O(N ) memory footprint. One option\nis to simply ignore the removal step replacing the cavity distribution with the full approximation,\nresulting in the ADF algorithm (Algorithm 2) that needs only maintain a global approximation in\nmemory. But as the moment matching step now over-counts the underlying approximating factor\n(consider the new form of the objective KL[q(\u03b8)p(xn|\u03b8)||q(\u03b8)]) the variance of the approxima-\ntion shrinks to zero as multiple passes are made through the dataset. Early stopping is therefore\nrequired to prevent over\ufb01tting and generally speaking ADF does not return uncertainties that are\nwell-calibrated to the posterior. In the next section we introduce a new algorithm that sidesteps EP\u2019s\nlarge memory demands whilst avoiding the pathological behaviour of ADF.\n\n3 Stochastic Expectation Propagation\n\n= (cid:81)N\n\nn=1 fn(\u03b8) \u2248 (cid:81)N\n\nIn this section we introduce a new algorithm, inspired by EP, called Stochastic Expectation Propaga-\ntion (SEP) that combines the bene\ufb01ts of local approximation (tractability of updates, distributability,\nand parallelisability) with global approximation (reduced memory demands). The algorithm can\nbe interpreted as a version of EP in which the approximating factors are tied, or alternatively as a\ncorrected version of ADF that prevents over\ufb01tting. The key idea is that, at convergence, the approx-\nimating factors in EP can be interpreted as parameterising a global factor, f (\u03b8), that captures the\naverage effect of a likelihood on the posterior f (\u03b8)N (cid:52)\nn=1 p(xn|\u03b8). In this\nspirit, the new algorithm employs direct iterative re\ufb01nement of a global approximation comprising\nthe prior and N copies of a single approximating factor, f (\u03b8), that is q(\u03b8) \u221d f (\u03b8)N p0(\u03b8).\nSEP uses updates that are analogous to EP\u2019s in order to re\ufb01ne f (\u03b8) in such a way that it captures\nthe average effect a likelihood function has on the posterior. First the cavity distribution is formed\nby removing one of the copies of the factor, q\u22121(\u03b8) \u221d q(\u03b8)/f (\u03b8). Second, the corresponding\nlikelihood is included to produce the tilted distribution \u02dcpn(\u03b8) \u221d q\u22121(\u03b8)p(xn|\u03b8) and, third, SEP\n\ufb01nds an intermediate factor approximation by moment matching, fn(\u03b8) \u2190 proj[\u02dcpn(\u03b8)]/q\u22121(\u03b8).\nFinally, having updated the factor, it is included into the approximating distribution. It is important\nhere not to make a full update since fn(\u03b8) captures the effect of just a single likelihood function\np(xn|\u03b8). Instead, damping should be employed to make a partial update f (\u03b8) \u2190 f (\u03b8)1\u2212\u0001fn(\u03b8)\u0001.\nA natural choice uses \u0001 = 1/N which can be interpreted as minimising KL[\u02dcpn(\u03b8)||p0(\u03b8)f (\u03b8)N ]\n\n3\n\n\fin the moment update, but other choices of \u0001 may be more appropriate, including decreasing \u0001\naccording to the Robbins-Monro condition [22].\nSEP is summarised in Algorithm 3. Unlike ADF, the cavity is formed by dividing out f (\u03b8) which\ncaptures the average affect of the likelihood and prevents the posterior from collapsing. Like ADF,\nhowever, SEP only maintains the global approximation q(\u03b8) since f (\u03b8) \u221d (q(\u03b8)/p0(\u03b8)) 1\nN and\nq\u22121(\u03b8) \u221d q(\u03b8)1\u2212 1\nN . When Gaussian approximating factors are used, for example, SEP\nreduces the storage requirement of EP from O(N D2) to O(D2) which is a substantial saving that\nenables models with many parameters to be applied to large datasets.\n\nN p0(\u03b8) 1\n\n4 Algorithmic extensions to SEP and theoretical results\n\nSEP has been motivated from a practical perspective by the limitations inherent in EP and ADF. In\nthis section we extend SEP in four orthogonal directions relate SEP to SVI. Many of the algorithms\ndescribed here are summarised in Figure 2 and they are detailed in the supplementary material.\n\n4.1 Parallel SEP: relating the EP \ufb01xed points to SEP\n\nn(cid:54)=m fn(\u03b8)(cid:81)\n\nm fm(\u03b8).\n\n(cid:80)M\nm=1 KL[q(\u03b8)||qm(\u03b8)] + (N \u2212 M )KL[q(\u03b8)||qold(\u03b8)].\n\ntion is updated to q(\u03b8) = p0(\u03b8)(cid:81)\nthe approximating distribution, which becomes q(\u03b8) \u2190 p0(\u03b8)fold(\u03b8)N\u2212M(cid:81)\nplication, the approximating factor is fnew(\u03b8) = fold(\u03b8)1\u2212M/N(cid:81)M\n\nThe SEP algorithm outlined above approximates one likelihood at a time which can be computa-\ntionally slow. However, it is simple to parallelise the SEP updates by following the same recipe by\nwhich EP is parallelised. Consider a minibatch comprising M datapoints (for a full parallel batch\nupdate use M = N). First we form the cavity distribution for each likelihood. Unlike EP these are\nall identical. Next, in parallel, compute M intermediate factors fm(\u03b8) \u2190 proj[\u02dcpm(\u03b8)]/q\u22121(\u03b8).\nIn EP these intermediate factors become the new likelihood approximations and the approxima-\nIn SEP, the same update is used for\nm fm(\u03b8) and, by im-\nm=1 fm(\u03b8)1/N . One way of\nunderstanding parallel SEP is as a double loop algorithm. The inner loop produces intermediate\napproximations qm(\u03b8) \u2190 arg minq KL[\u02dcpm(\u03b8)||q(\u03b8)]; these are then combined in the outer loop:\nq(\u03b8) \u2190 arg minq\nFor M = 1 parallel SEP reduces to the original SEP algorithm. For M = N parallel SEP is\nequivalent to the so-called Averaged EP algorithm proposed in [23] as a theoretical tool to study\nthe convergence properties of normal EP. This work showed that, under fairly restrictive conditions\n(likelihood functions that are log-concave and varying slowly as a function of the parameters), AEP\nconverges to the same \ufb01xed points as EP in the large data limit (N \u2192 \u221e).\nThere is another illuminating connection between SEP and AEP. Since SEP\u2019s approximating factor\nN , SEP\nconverges to the same \ufb01xed points as AEP if the learning rates satisfy the Robbins-Monro condition\n[22], and therefore under certain conditions [23], to the same \ufb01xed points as EP. But it is still an\nopen question whether there are more direct relationships between EP and SEP.\n\nf (\u03b8) converges to the geometric average of the intermediate factors \u00aff (\u03b8) \u221d [(cid:81)N\n\nn=1 fn(\u03b8)] 1\n\n4.2 Stochastic power EP: relationships to variational methods\n\nThe relationship between variational inference and stochastic variational inference [3] mirrors the\nrelationship between EP and SEP. Can these relationships be made more formal? If the moment\nprojection step in EP is replaced by a natural parameter matching step then the resulting algorithm\nis equivalent to the Variational Message Passing (VMP) algorithm [24] (and see supplementary\nmaterial). Moreover, VMP has the same \ufb01xed points as variational inference [13] (since minimising\nthe local variational KL divergences is equivalent to minimising the global variational KL).\nThese results carry over to the new algorithms with minor modi\ufb01cations. Speci\ufb01cally VMP can be\ntransformed into SVMP by replacing VMP\u2019s local approximations with the global form employed\nby SEP. In the supplementary material we show that this algorithm is an instance of standard SVI\nand that it therefore has the same \ufb01xed points as VI when \u0001 satis\ufb01es the Robbins-Monro condition\n[22]. More generally, the procedure can be applied any member of the power EP (PEP) [12] family\nof algorithms which replace the moment projection step in EP with alpha-divergence minimization\n\n4\n\n\fFigure 2: Relationships between algorithms. Note that care needs to be taken when interpreting the\nalpha-divergence as a \u2192 \u22121 (see supplementary material).\n\n[21], but care has to be taken when taking the limiting cases (see supplementary). These results lend\nweight to the view that SEP is a natural stochastic generalisation of EP.\n\n4.3 Distributed SEP: controlling granularity of the approximation\n\nk=1 with N =(cid:80)K\n\n}K\n\nEP uses a \ufb01ne-grained approximation comprising a single factor for each likelihood. SEP, on\nthe other hand, uses a coarse-grained approximation comprising a signal global factor to approx-\nimate the average effect of all likelihood terms. One might worry that SEP\u2019s approximation is\ntoo severe if the dataset contains sets of datapoints that have very different likelihood contribu-\ntions (e.g. for odd-vs-even handwritten digits classi\ufb01cation consider the affect of a 5 and a 9 on the\nposterior). It might be more sensible in such cases to partition the dataset into K disjoint pieces\n{Dk = {xn}Nk\nk=1 Nk and use an approximating factor for each partition.\nIf normal EP updates are performed on the subsets, i.e. treating p(Dk|\u03b8) as a single true factor to be\napproximated, we arrive at the Distributed EP algorithm [5, 6]. But such updates are challenging as\nmultiple likelihood terms must be included during each update necessitating additional approxima-\ntions (e.g. MCMC). A simpler alternative uses SEP/AEP inside each partition, implying a posterior\nk=1 fk(\u03b8)Nk with fk(\u03b8)Nk approximating p(Dk|\u03b8).\nThe limiting cases of this algorithm, when K = 1 and K = N, recover SEP and EP respectively.\n\napproximation of the form q(\u03b8) \u221d p0(\u03b8)(cid:81)K\n\nn=Nk\u22121\n\n4.4 SEP with latent variables\n\nrameters and hidden variables p(\u03b8,{hn}|D) \u221d p0(\u03b8)(cid:81)\n\nMany applications of EP involve latent variable models. Although this is not the main focus of the\npaper, we show that SEP is applicable in this case without scaling the memory footprint with N.\nConsider a model containing hidden variables, hn, associated with each observation p(xn, hn|\u03b8)\nthat are drawn i.i.d. from a prior p0(hn). The goal is to approximate the true posterior over pa-\nn p0(hn)p(xn|hn, \u03b8). Typically, EP would\napproximate the effect of each intractable term as p(xn|hn, \u03b8)p0(hn) \u2248 fn(\u03b8)gn(hn). Instead,\nSEP ties the approximate parameter factors p(xn|hn, \u03b8)p0(hn) \u2248 f (\u03b8)gn(hn) yielding:\nN(cid:89)\n\nq(\u03b8,{hn})\n\n(cid:52)\u221d p0(\u03b8)f (\u03b8)N\n\ngn(hn).\n\n(2)\n\nCritically, as proved in supplementary, the local factors gn(hn) do not need to be maintained in\nmemory. This means that all of the advantages of SEP carry over to more complex models involving\nlatent variables, although this can potentially increase computation time in cases where updates for\ngn(hn) are not analytic, since then they will be initialised from scratch at each update.\n\nn=1\n\n5\n\nalphadivergenceupdatesparallel minibatchupdatesmultipleapproximatingfactorsK=NK=1M=1M=Na=1a=-1SEPAEPEPPEPVMPAVMPpar-VMPpar-SEPAEP: Averaged EPAVMP: Averaged VMPEP: Expectation PropagationSEPEPVMPVIAEPAVMPPEP: Power EPSEP: Stochastic EPSVMP: Stochastic VMPsame (stochastic methods)samesame in large data limit(conditions apply)par-EP: EP with parallel updatespar-SEP: SEP with parallel updatespar-VMP: VMP with parallel updatesVI: Variational InferenceVMP: Variational Message PassingA) Relationships between algorithmsB) Relationships between \ufb01xed points\f5 Experiments\n\nThe purpose of the experiments was to evaluate SEP on a number of datasets (synthetic and real-\nworld, small and large) and on a number of models (probit regression, mixture of Gaussians and\nBayesian neural networks).\n\n5.1 Bayesian probit regression\n\nThe \ufb01rst experiments considered a simple Bayesian classi\ufb01cation problem and investigated the\nstability and quality of SEP in relation to EP and ADF as well as the effect of using mini-\nbatches and varying the granularity of the approximation. The model comprised a probit likeli-\nhood function P (yn = 1|\u03b8) = \u03a6(\u03b8T xn) and a Gaussian prior over the hyper-plane parameter\np(\u03b8) = N (\u03b8; 0, \u03b3I). The synthetic data comprised N = 5, 000 datapoints {(xn, yn)}, where xn\nwere D = 4 dimensional and were either sampled from a single Gaussian distribution (Fig. 3(a)) or\nfrom a mixture of Gaussians (MoGs) with J = 5 components (Fig. 3(b)) to investigate the sensitiv-\nity of the methods to the homogeneity of the dataset. The labels were produced by sampling from\nthe generative model. We followed [6] measuring the performance by computing an approximation\nof KL[p(\u03b8|D)||q(\u03b8)], where p(\u03b8|D) was replaced by a Gaussian that had the same mean and covari-\nance as samples drawn from the posterior using the No-U-Turn sampler (NUTS) [25], to quantify\nthe calibration of uncertainty estimations.\nResults in Fig. 3(a) indicate that EP is the best performing method and that ADF collapses towards a\ndelta function. SEP converges to a solution which appears to be of similar quality to that obtained by\nEP for the dataset containing Gaussian inputs, but slightly worse when the MoGs was used. Variants\nof SEP that used larger mini-batches \ufb02uctuated less, but typically took longer to converge (although\nfor the small minibatches shown this effect is not clear). The utility of \ufb01ner grained approximations\ndepended on the homogeneity of the data. For the second dataset containing MoG inputs (shown in\nFig. 3(b)), \ufb01ner-grained approximations were found to be advantageous if the datapoints from each\nmixture component are assigned to the same approximating factor. Generally it was found that there\nis no advantage to retaining more approximating factors than there were clusters in the dataset.\nTo verify whether these conclusions about the granularity of the approximation hold in real datasets,\nwe sampled N = 1, 000 datapoints for each of the digits in MNIST and performed odd-vs-even\nclassi\ufb01cation. Each digit class was assigned its own global approximating factor, K = 10. We\ncompare the log-likelihood of a test set using ADF, SEP (K = 1), full EP and DSEP (K = 10)\nin Figure 3(c). EP and DSEP signi\ufb01cantly outperform ADF. DSEP is slightly worse than full EP\ninitially, however it reduces the memory to 0.001% of full EP without losing accuracy substantially.\nSEP\u2019s accuracy was still increasing at the end of learning and was slightly better than ADF. Further\nempirical comparisons are reported in the supplementary, and in summary the three EP methods are\nindistinguishable when likelihood functions have similar contributions to the posterior.\nFinally, we tested SEP\u2019s performance on six small binary classi\ufb01cation datasets from the UCI ma-\nchine learning repository.1 We did not consider the effect of mini-batches or the granularity of the\napproximation, using K = M = 1. We ran the tests with damping and stopped learning after\nconvergence (by monitoring the updates of approximating factors). The classi\ufb01cation results are\nsummarised in Table 1. ADF performs reasonably well on the mean classi\ufb01cation error metric,\npresumably because it tends to learn a good approximation to the posterior mode. However, the pos-\nterior variance is poorly approximated and therefore ADF returns poor test log-likelihood scores. EP\nachieves signi\ufb01cantly higher test log-likelihood than ADF indicating that a superior approximation\nto the posterior variance is attained. Crucially, SEP performs very similarly to EP, implying that SEP\nis an accurate alternative to EP even though it is re\ufb01ning a cheaper global posterior approximation.\n\n5.2 Mixture of Gaussians for clustering\n\nThe small scale experiments on probit regression indicate that SEP performs well for fully-observed\nprobabilistic models. Although it is not the main focus of the paper, we sought to test the \ufb02exibility\nof the method by applying it to a latent variable model, speci\ufb01cally a mixture of Gaussians. A syn-\nthetic MoGs dataset containing N = 200 datapoints was constructed comprising J = 4 Gaussians.\n\n1https://archive.ics.uci.edu/ml/index.html\n\n6\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 3: Bayesian logistic regression experiments. Panels (a) and (b) show synthetic data experi-\nments. Panel (c) shows the results on MNIST (see text for full details).\n\nTable 1: Average test results all methods on probit regression. All methods appear to capture the\nposterior\u2019s mode, however EP outperforms ADF in terms of test log-likelihood on almost all of the\ndatasets, with SEP performing similarly to EP.\n\ntest log-likelihood\n\nSEP\n\nEP\n\nmean error\n\nSEP\n\nADF\n\nDataset\nAustralian 0.328\u00b10.0127 0.325\u00b10.0135 0.330\u00b10.0133 -0.634\u00b10.010 -0.631\u00b10.009 -0.631\u00b10.009\n0.037\u00b10.0045 0.034\u00b10.0034 0.034\u00b10.0039 -0.100\u00b10.015 -0.094\u00b10.011 -0.093\u00b10.011\nBreast\n0.056\u00b10.0133 0.033\u00b10.0099 0.036\u00b10.0113 -0.242\u00b10.012 -0.125\u00b10.013 -0.110\u00b10.013\nCrabs\n0.126\u00b10.0166 0.130\u00b10.0147 0.131\u00b10.0149 -0.373\u00b10.047 -0.336\u00b10.029 -0.324\u00b10.028\nIonos\n0.242\u00b10.0093 0.244\u00b10.0098 0.241\u00b10.0093 -0.516\u00b10.013 -0.514\u00b10.012 -0.513\u00b10.012\nPima\n0.198\u00b10.0208 0.198\u00b10.0217 0.198\u00b10.0243 -0.461\u00b10.053 -0.418\u00b10.021 -0.415\u00b10.021\nSonar\n\nADF\n\nEP\n\nThe means were sampled from a Gaussian distribution, p(\u00b5j) = N (\u00b5; m, I), the cluster identity\nvariables were sampled from a uniform categorical distribution p(hn = j) = 1/4, and each mixture\ncomponent was isotropic p(xn|hn) = N (xn; \u00b5hn, 0.52I). EP, ADF and SEP were performed to\napproximate the joint posterior over the cluster means {\u00b5j} and cluster identity variables {hn} (the\nother parameters were assumed known).\nFigure 4(a) visualises the approximate posteriors after 200 iterations. All methods return good\nestimates for the means, but ADF collapses towards a point estimate as expected. SEP, in contrast,\ncaptures the uncertainty and returns nearly identical approximations to EP. The accuracy of the\nmethods is quanti\ufb01ed in Fig. 4(b) by comparing the approximate posteriors to those obtained from\nNUTS. In this case the approximate KL-divergence measure is analytically intractable, instead we\nused the averaged F-norm of the difference of the Gaussian parameters \ufb01tted by NUTS and EP\nmethods. These measures con\ufb01rm that SEP approximates EP well in this case.\n\n5.3 Probabilistic backpropagation\n\nThe \ufb01nal set of tests consider more complicated models and large datasets. Speci\ufb01cally we eval-\nuate the methods for probabilistic backpropagation (PBP) [4], a recent state-of-the-art method for\nscalable Bayesian learning in neural network models. Previous implementations of PBP perform\nseveral iterations of ADF over the training data. The moment matching operations required by ADF\nare themselves intractable and they are approximated by \ufb01rst propagating the uncertainty on the\nsynaptic weights forward through the network in a sequential way, and then computing the gradient\nof the marginal likelihood by backpropagation. ADF is used to reduce the large memory cost that\nwould be required by EP when the amount of available data is very large.\nWe performed several experiments to assess the accuracy of different implementations of PBP based\non ADF, SEP and EP on regression datasets following the same experimental protocol as in [4] (see\nsupplementary material). We considered neural networks with 50 hidden units (except for Year and\nProtein which we used 100). Table 2 shows the average test RMSE and test log-likelihood for each\nmethod. Interestingly, SEP can outperform EP in this setting (possibly because the stochasticity\nenabled it to \ufb01nd better solutions), and typically it performed similarly. Memory reductions using\n\n7\n\n\f(a)\n\n(b)\n\nFigure 4: Posterior approximation for the mean of the Gaussian components. (a) visualises posterior\napproximations over the cluster means (98% con\ufb01dence level). The coloured dots indicate the true\nlabel (top-left) or the inferred cluster assignments (the rest). In (b) we show the error (in F-norm) of\nthe approximate Gaussians\u2019 means (top) and covariances (bottom).\n\nTable 2: Average test results for all methods. Datasets are also from the UCI machine learning\nrepository.\n\nADF\n\nDataset\n1.005\u00b10.007\nKin8nm 0.098\u00b10.0007 0.088\u00b10.0009 0.089\u00b10.0006\n0.006\u00b10.0000 0.002\u00b10.0000 0.004\u00b10.0000\n4.207\u00b10.011\nNaval\n4.124\u00b10.0345 4.165\u00b10.0336 4.191\u00b10.0349 -2.837\u00b10.009 -2.846\u00b10.008 -2.852\u00b10.008\nPower\n4.727\u00b10.0112 4.670\u00b10.0109 4.748\u00b10.0137 -2.973\u00b10.003 -2.961\u00b10.003 -2.979\u00b10.003\nProtein\n0.635\u00b10.0079 0.650\u00b10.0082 0.637\u00b10.0076 -0.968\u00b10.014 -0.976\u00b10.013 -0.958\u00b10.011\nWine\n-3.929\u00b1NA\n8.879\u00b1 NA\nYear\n\n-3.603\u00b1 NA -3.924\u00b1NA\n\n0.896\u00b10.006\n3.731\u00b10.006\n\n8.922\u00b1NA\n\n8.914\u00b1NA\n\nEP\n\nADF\n\nRMSE\nSEP\n\nSEP\n\ntest log-likelihood\n1.013\u00b10.011\n4.590\u00b10.014\n\nEP\n\nSEP instead of EP were large e.g. 694Mb for the Protein dataset and 65,107Mb for the Year dataset\n(see supplementary). Surprisingly ADF often outperformed EP, although the results presented for\nADF use a near-optimal number of sweeps and further iterations generally degraded performance.\nADF\u2019s good performance is most likely due to an interaction with additional moment approximation\nrequired in PBP that is more accurate as the number of factors increases.\n\n6 Conclusions and future work\n\nThis paper has presented the stochastic expectation propagation method for reducing EP\u2019s large\nmemory consumption which is prohibitive for large datasets. We have connected the new algorithm\nto a number of existing methods including assumed density \ufb01ltering, variational message passing,\nvariational inference, stochastic variational inference and averaged EP. Experiments on Bayesian\nlogistic regression (both synthetic and real world) and mixture of Gaussians clustering indicated\nthat the new method had an accuracy that was competitive with EP. Experiments on the probabilistic\nback-propagation on large real world regression datasets again showed that SEP comparably to\nEP with a vastly reduced memory footprint. Future experimental work will focus on developing\ndata-partitioning methods to leverage \ufb01ner-grained approximations (DESP) that showed promising\nexperimental performance and also mini-batch updates. There is also a need for further theoretical\nunderstanding of these algorithms, and indeed EP itself. Theoretical work will study the convergence\nproperties of the new algorithms for which we only have limited results at present. Systematic\ncomparisons of EP-like algorithms and variational methods will guide practitioners to choosing the\nappropriate scheme for their application.\n\nAcknowledgements\n\nWe thank the reviewers for valuable comments. YL thanks the Schlumberger Foundation Faculty for\nthe Future fellowship on supporting her PhD study. JMHL acknowledges support from the Rafael\ndel Pino Foundation. RET thanks EPSRC grant # EP/G050821/1 and EP/L000776/1.\n\n8\n\n\fReferences\n[1] Sungjin Ahn, Babak Shahbaba, and Max Welling. Distributed stochastic gradient mcmc. In Proceedings\n\nof the 31st International Conference on Machine Learning (ICML-14), pages 1044\u20131052, 2014.\n\n[2] R\u00b4emi Bardenet, Arnaud Doucet, and Chris Holmes. Towards scaling up markov chain monte carlo:\nIn Proceedings of the 31st International Conference on Machine\n\nan adaptive subsampling approach.\nLearning (ICML-14), pages 405\u2013413, 2014.\n\n[3] Matthew D. Hoffman, David M. Blei, Chong Wang, and John William Paisley. Stochastic variational\n\ninference. Journal of Machine Learning Research, 14(1):1303\u20131347, 2013.\n\n[4] Jos\u00b4e Miguel Hern\u00b4andez-Lobato and Ryan P. Adams. Probabilistic backpropagation for scalable learning\n\nof bayesian neural networks. arXiv:1502.05336, 2015.\n\n[5] Andrew Gelman, Aki Vehtari, Pasi Jylnki, Christian Robert, Nicolas Chopin, and John P. Cunningham.\n\nExpectation propagation as a way of life. arXiv:1412.4869, 2014.\n\n[6] Minjie Xu, Balaji Lakshminarayanan, Yee Whye Teh, Jun Zhu, and Bo Zhang. Distributed bayesian\n\nposterior sampling via moment sharing. In NIPS, 2014.\n\n[7] Thomas P. Minka. Expectation propagation for approximate Bayesian inference. In Uncertainty in Arti-\n\n\ufb01cial Intelligence, volume 17, pages 362\u2013369, 2001.\n\n[8] Manfred Opper and Ole Winther. Expectation consistent approximate inference. The Journal of Machine\n\nLearning Research, 6:2177\u20132204, 2005.\n\n[9] Malte Kuss and Carl Edward Rasmussen. Assessing approximate inference for binary gaussian process\n\nclassi\ufb01cation. The Journal of Machine Learning Research, 6:1679\u20131704, 2005.\n\n[10] Simon Barthelm\u00b4e and Nicolas Chopin. Expectation propagation for likelihood-free inference. Journal of\n\nthe American Statistical Association, 109(505):315\u2013333, 2014.\n\n[11] John P Cunningham, Philipp Hennig, and Simon Lacoste-Julien. Gaussian probabilities and expectation\n\npropagation. arXiv preprint arXiv:1111.6832, 2011.\n\n[12] Thomas P. Minka. Power EP. Technical Report MSR-TR-2004-149, Microsoft Research, Cambridge,\n\n2004.\n\n[13] John M Winn and Christopher M Bishop. Variational message passing. In Journal of Machine Learning\n\nResearch, pages 661\u2013694, 2005.\n\n[14] Michael I Jordan, Zoubin Ghahramani, Tommi S Jaakkola, and Lawrence K Saul. An introduction to\n\nvariational methods for graphical models. Machine learning, 37(2):183\u2013233, 1999.\n\n[15] Matthew James Beal. Variational algorithms for approximate Bayesian inference. PhD thesis, University\n\nof London, 2003.\n\n[16] Richard E. Turner and Maneesh Sahani. Two problems with variational expectation maximisation for\nIn D. Barber, T. Cemgil, and S. Chiappa, editors, Bayesian Time series models,\n\ntime-series models.\nchapter 5, pages 109\u2013130. Cambridge University Press, 2011.\n\n[17] Richard E. Turner and Maneesh Sahani. Probabilistic amplitude and frequency demodulation.\n\nIn\nJ. Shawe-Taylor, R.S. Zemel, P. Bartlett, F.C.N. Pereira, and K.Q. Weinberger, editors, Advances in Neu-\nral Information Processing Systems 24, pages 981\u2013989. 2011.\n\n[18] Ralf Herbrich, Tom Minka, and Thore Graepel. Trueskill: A bayesian skill rating system. In Advances in\n\nNeural Information Processing Systems, pages 569\u2013576, 2006.\n\n[19] Peter S. Maybeck. Stochastic models, estimation and control. Academic Press, 1982.\n[20] Yuan Qi, Ahmed H Abdel-Gawad, and Thomas P Minka. Sparse-posterior gaussian processes for general\n\nlikelihoods. In Uncertainty and Arti\ufb01cial Intelligence (UAI), 2010.\n\n[21] Shun-ichi Amari and Hiroshi Nagaoka. Methods of information geometry, volume 191. Oxford University\n\nPress, 2000.\n\n[22] Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of mathematical\n\nstatistics, pages 400\u2013407, 1951.\n\n[23] Guillaume Dehaene and Simon Barthelm\u00b4e.\n\narXiv:1503.08060, 2015.\n\nExpectation propagation in the large-data limit.\n\n[24] Thomas Minka. Divergence measures and message passing. Technical Report MSR-TR-2005-173, Mi-\n\ncrosoft Research, Cambridge, 2005.\n\n[25] Matthew D Hoffman and Andrew Gelman. The no-u-turn sampler: Adaptively setting path lengths in\n\nhamiltonian monte carlo. The Journal of Machine Learning Research, 15(1):1593\u20131623, 2014.\n\n9\n\n\f", "award": [], "sourceid": 1372, "authors": [{"given_name": "Yingzhen", "family_name": "Li", "institution": "University of Cambridge"}, {"given_name": "Jos\u00e9 Miguel", "family_name": "Hern\u00e1ndez-Lobato", "institution": "Harvard"}, {"given_name": "Richard", "family_name": "Turner", "institution": "University of Cambridge"}]}