{"title": "Improving on Expectation Propagation", "book": "Advances in Neural Information Processing Systems", "page_first": 1241, "page_last": 1248, "abstract": "We develop as series of corrections to Expectation Propagation (EP), which is one of the most popular methods for approximate probabilistic inference. These corrections can lead to improvements of the inference approximation or serve as a sanity check, indicating when EP yields unrealiable results.", "full_text": "Improving on Expectation Propagation\n\nManfred Opper\n\nComputer Science, TU Berlin\n\nopperm@cs.tu-berlin.de\n\nUlrich Paquet\n\nComputer Laboratory, University of Cambridge\n\nulrich@cantab.net\n\nInformatics and Mathematical Modelling, Technical University of Denmark\n\nOle Winther\n\nowi@imm.dtu.dk\n\nAbstract\n\nA series of corrections is developed for the \ufb01xed points of Expectation Propaga-\ntion (EP), which is one of the most popular methods for approximate probabilistic\ninference. These corrections can lead to improvements of the inference approxi-\nmation or serve as a sanity check, indicating when EP yields unrealiable results.\n\n1 Introduction\n\nThe expectation propagation (EP) message passing algorithm is often considered as the method of\nchoice for approximate Bayesian inference when both good accuracy and computational ef\ufb01ciency\nare required [5]. One recent example is a comparison of EP with extensive MCMC simulations for\nGaussian process (GP) classi\ufb01ers [4], which has shown that not only the predictive distribution, but\nalso the typically much harder marginal likelihood (the partition function) of the data, are approxi-\nmated remarkably well for a variety of data sets. However, while such empirical studies hold great\nvalue, they can not guarantee the same performance on other data sets or when completely different\ntypes of Bayesian models are considered.\n\nIn this paper methods are developed to assess the quality of the EP approximation. We compute\nexplicit expressions for the remainder terms of the approximation. This leads to various corrections\nfor partition functions and posterior distributions. Under the hypothesis that the EP approximation\nworks well, we identify quantities which can be assumed to be small and can be used in a series\nexpansion of the corrections with increasing complexity. The computation of low order corrections\nin this expansion is often feasible, typically require only moderate computational efforts, and can\nlead to an improvement to the EP approximation or to the indication that the approximation cannot\nbe trusted.\n\n2 Expectation Propagation in a Nutshell\n\nSince it is the goal of this paper to compute corrections to the EP approximation, we will not dis-\ncuss details of EP algorithms but rather characterise the \ufb01xed points which are reached when such\nalgorithms converge.\n\nEP is applied to probabilistic models with an unobserved latent variable x having an intractable\ndistribution p(x). In applications p(x) is usually the Bayesian posterior distribution conditioned on\na set of observations. Since the dependency on the latter variables is not important for the subsequent\ntheory, we will skip them in our notation.\n\n1\n\n\fIt is assumed that p(x) factorizes into a product of terms fn such that\n\np(x) =\n\nfn(x) ,\n\n1\n\nZ Yn\n\n(1)\n\nwhere the normalising partition function Z = R dx Qn fn(x) is also intractable. We then assume\n\nan approximation to p(x) in the form\n\nq(x) =Yn\n\ngn(x)\n\n(2)\n\nwhere the terms gn(x) belong to a tractable, e.g. exponential family of distributions. To compute\nthe optimal parameters of the gn term approximation a set of auxiliary tilted distributions is de\ufb01ned\nvia\n\nqn(x) =\n\n1\n\nZn (cid:18) q(x)fn(x)\n\ngn(x) (cid:19) .\n\n(3)\n\nHere a single approximating term gn is replaced by an original term fn. Assuming that this re-\nplacement leaves qn still tractable, the parameters in gn are determined by the condition that q(x)\nand all qn(x) should be made as similar as possible. This is usually achieved by requiring that these\ndistributions share a set of generalised moments (which usually coincide with the suf\ufb01cient statistics\nof the exponential family). Note, that we will not assume that this expectation consistency [8] for\nthe moments is derived by minimising a Kullback\u2013Leibler divergence, as was done in the original\nderivations of EP [5]. Such an assumption would limit the applicability of the approximate inference\nand exclude e.g. the approximation of models with binary, Ising variables by a Gaussian model as\nin one of the applications in the last section.\n\nThe corresponding approximation to the normalising partition function in (1) was given in [8] and\n[7] and reads in our present notation1\n\nZEP =Yn\n\nZn .\n\n(4)\n\n3 Corrections to EP\n\nAn expression for the remainder terms which are neglected by the EP approximation can be obtained\nby solving for fn in (3), and taking the product to get\n\nYn\n\nfn(x) =Yn (cid:18) Znqn(x)gn(x)\n\nq(x)\n\nq(x) (cid:19) .\n(cid:19) = ZEP q(x)Yn (cid:18) qn(x)\n\nHence Z =R dx Qn fn(x) = ZEP R, with\nR =Z dx q(x)Yn (cid:18) qn(x)\n\nq(x) (cid:19) and p(x) =\n\n1\nR\n\nq(x) (cid:19) .\nq(x)Yn (cid:18) qn(x)\n\n(5)\n\n(6)\n\nThis shows that corrections to EP are small when all distributions qn are indeed close to q, justifying\nthe optimality criterion of EP. For related expansions, see [2, 3, 9].\n\nExact probabilistic inference with the corrections described here again leads to intractable computa-\ntions. However, we can derive exact perturbation expansions involving a series of corrections with\nincreasing computational complexity. Assuming that EP already yields a good approximation, the\ncomputation of a small number of these terms maybe suf\ufb01cient to obtain the most dominant correc-\ntions. On the other hand, when the leading corrections come out large or do not suf\ufb01ciently decrease\nwith order, this may indicate that the EP approximation is inaccurate. Two such perturbation expan-\nsions are be presented in this section.\n\n1The de\ufb01nition of partition functions Zn is slightly different from previous works.\n\n2\n\n\f3.1 Expansion I: Clusters\n\nThe most basic expansion is based on the variables \u03b5n(x) = qn(x)\nq(x) \u2212 1 which we can assume to be\ntypically small, when the EP approximation is good. Expanding the products in (6) we obtain the\ncorrection to the partition function\n\n(1 + \u03b5n(x))\n\nR =Z dx q(x)Yn\n= 1 + Xn1<n2(cid:10)\u03b5n1 (x)\u03b5n2 (x)(cid:11)q + Xn1<n2<n3(cid:10)\u03b5n1 (x)\u03b5n2 (x)\u03b5n3 (x)(cid:11)q + . . . ,\n\n(7)\n\n(8)\n\nwhich is a \ufb01nite series in terms of growing clusters of \u201cinteracting\u201d variables \u03b5n(x). Here the\nbrackets h. . .iq denote expectations with respect to the distribution q. Note, that the \ufb01rst order term\nPn h\u03b5n(x)iq = 0 vanishes by the normalization of qn and q. As we will see later, the computation\nof corrections is feasible when qn is just a \ufb01nite mixture of K simpler densities from the exponential\nfamily to which q belongs. Then the number of mixture components in the j-th term of the expansion\nof R is just of the order O(K j) and an evaluation of low order terms should be tractable.\nIn a similar way, we get\n\np(x) =\n\nq(x) (cid:0)1 +Pn \u03b5n(x) +Pn1<n2 \u03b5n1(x)\u03b5n2 (x) + . . .(cid:1)\n\n,\n\n(9)\n\nIn order to keep the resulting density normalized to one, we should keep as many terms in the\nnumerator as in the denominator. As an example, the \ufb01rst order correction to q(x) is\n\n1 +Pn1<n2 h\u03b5n1 (x)\u03b5n2 (x)iq + . . .\np(x) \u2248Xn\n\nqn(x) \u2212 (N \u2212 1)q(x) .\n\n(10)\n\n3.2 Expansion II: Cumulants\n\n2\n\nOne of most important applications of EP is to the case of statistical models with Gaussian process\npriors. Here x is a latent variable with Gaussian prior distribution and covariance E[xx\u22a4] = K\nwhere K is the kernel matrix. In this case we have N + 1 terms f0, f1, . . . , fN in (1) where f0(x) =\ng0(x) = exp[\u2212 1\nx\u22a4K\u22121x]. For n \u2265 1 each fn(x) = tn(xn) is the likelihood term for the nth\nobservation which depends only on a single component xn of the vector x.\nThe corresponding approximating terms are chosen to be Gaussian of the form gn(x) \u221d\n2 \u03bbnx2. The 2N parameters \u03b3n and \u03bbn are determined in such a way that q(x) and the dis-\ne\u03b3nx\u2212 1\ntributions qn(x) have the same \ufb01rst and second marginal moments hxni and hx2\nni. In this case, the\ncomputation of corrections (7) would require the computation of multivariate integrals of increasing\ndimensionality. Hence, a different type of expansion seems more appropriate. The main idea is to\nexpand with respect to the higher order cumulants of the distributions qn.\nTo derive this expansion, we simplify (6) using the fact that q(x) = q(x\\n|xn)q(xn) and qn(x) =\nq(x\\n|xn)qn(xn), where we have (with a slight abuse of notation) introduced q(xn) and qn(xn),\nthe marginals of q(x) and qn(x). Thus p(x) = 1\n\nR q(x)F (x) and R =R dx q(x)F (x), where\n\n(11)\n\nq(xn) (cid:19) .\nF (x) =Yn (cid:18) qn(xn)\n\nSince q(xn) and the qn(xn) have the same \ufb01rst two cumulants, corrections can be expressed by the\nhigher cumulants of the qn(xn) (note, that the higher cumulants of q(xn) vanish). The cumulants\ncln of qn(xn) are de\ufb01ned by their characteristic functions \u03c7n(k) via\n\nqn(xn) =Z dk\n\n2\u03c0\n\ne\u2212ikxn \u03c7n(k)\n\nand\n\nln \u03c7n(k) =Xl\n\n(i)l cln\nl!\n\nkl .\n\n(12)\n\nExpressing the Gaussian marginals q(xn) by their \ufb01rst and second cumulants, the means mn and\nthe variances Snn and introducing the function\n\nrn(k) =Xl\u22653\n\n(i)l cln\nl!\n\nkl\n\n3\n\n(13)\n\n\fwhich contains the contributions of all higher order cumulants, we get\n\nF (x) =Yn  R dkn exp(cid:2)\u2212ikn(xn \u2212 mn) \u2212 1\n\nR dkn exp(cid:2)\u2212ikn(xn \u2212 mn) \u2212 1\n\n2 Snnk2\n\nSnn\u03b72\nn\n\n=Z d\u03b7sYn\n\nSnn\n2\u03c0\n\nexp\"\u2212Xn\n\n2 Snnk2\n\n!\nn + rn(kn)(cid:3)\nn(cid:3)\nrn(cid:18)\u03b7n \u2212 i\n\n# exp\"Xn\n\n2\n\n(14)\n\n(15)\n\n(xn \u2212 mn)\n\nSnn\n\n(cid:19)#\n\nwhere in the last equality we have introduced a shift of variables \u03b7n = kn + i (xn\u2212mn)\nAn expansion can be performed with respect to the cumulants in the terms gn which had been ne-\nglected in the EP approximation. The basic computations are most easily explained for the correction\nR to the partition function.\n\nSnn\n\n.\n\n3.2.1 Correction to the partition function\nSince q(x) is a multivariate Gaussian of the form q(x) = N (x; m, S), the correction R to the\npartition Z involves a double Gaussian average over the vector x and the set of \u03b7n. This can be\nsimpli\ufb01ed by combining them into a single complex zero mean Gaussian random vector de\ufb01ned as\nzn = \u03b7n \u2212 i xn\u2212mn\n\nsuch that\n\nSnn\n\nR =*exp\"Xn\n\nrn (zn)#+z\n\n(16)\n\n(17)\n\nThe most remarkable property of the Gaussian z is its covariance which is easily found to be\n\nhzizjiz = \u2212\n\nSij\n\nSiiSjj\n\nwhen i 6= j,\n\nand hz2\n\ni iz = 0 .\n\nThe last equation has important consequences for the surviving terms in an expansion of R!\nAssuming that the gn are small we perform a power series expansion of ln R\n\n1\n\n1\n\n1\n\n=\n\nclnclm\n\n2 Xm6=n\n\nhrmrniz \u00b1 . . .\n\nhrniz(cid:17)2\n\nrn (zn)i+z\n\nln R = ln*exphXn\n\n=Xn\nhrniz +\n= Xm6=nXl\u22653\n\nrn(cid:17)2Ez\n2D(cid:16)Xn\n\u2212\nSnnSmm(cid:19)l\nl! (cid:18) Snm\n\n2(cid:16)Xn\n\u00b1 . . .\nHere we have repeatedly used the fact that each factor zn in expectations hzl\nmi have to be paired\n(by Wick\u2019s theorem) with a factor zm where m 6= n (diagonal terms vanish by (17)). This gives\nnonzero contributions only, when l = s and there are l! ways for pairing.2\nThis expansion gives a hint why EP may work typically well for multivariate models when covari-\nances Sij are small compared to the variances Sii. While we may expect that ln ZEP = O(N )\nwhere N is the number of variables xn, the vanishing of the \u201cself interactions\u201d indicates that correc-\ntions may not scale with N .\n\n\u00b1 . . . (18)\n\n(19)\n\nnzs\n\n3.2.2 Correction to marginal moments\n\nThe predictive density of a novel observation can be treated by extending the Gaussian prior to\ninclude a new latent variable x\u2217 with E[x\u2217x] = k\u2217 and E[x2\n\u2217] = k\u2217, and appears as an average of a\nlikelihood term over the posterior marginal of x\u2217.\nA correction for the predictive density can also be derived in terms of the cumulant expansion by\naveraging the conditional distribution p(x\u2217|x) = N (x\u2217; k\u22a4\nK\u22121k\u2217.\n\u2217) with \u03c32\nUsing the expression (15) we obtain (where we set R = 1 in (6) to lowest order)\nSnn (cid:19) + . . .+\u03b7,x\u223cN (x;\u00b5,\u03a3)\nxn \u2212 mn\np(x\u2217) =Z dx p(x\u2217|x) p(x) = N (x\u2217; \u00b5x\u2217 , s2\n\nrn(cid:18)\u03b7n \u2212 i\n\n\u2217 = k\u2217 \u2212 k\u22a4\n\nK\u22121x, \u03c32\n\nx\u2217)*1 +Xn\n\n\u2217\n\n\u2217\n\n(20)\n\n2The terms in the expansion might be organised in Feynman graphs, where \u201cself interaction\u201d loops are\n\nabsent.\n\n4\n\n\fZ\ng\no\nl\n\n\u2212195\n\n\u2212200\n\n\u2212205\n\n\u2212210\n\n\u2212215\n\n\u2212220\n\n\u2212225\n\n\u2212230\n\n\u2212235\n\n1\n\n2\n\n3\n\n4\n\n5\n\n6\n\nNumber of components K\n\nFigure 1:\nln Z approximations obtained from\nq(x)\u2019s factorization in (2), for sec. 4.1\u2019s mixture\nmodel, as obtained by: variational Bayes (see [1]\nfor details) as red squares; \u03b1 = 1\n2 in Minka\u2019s \u03b1-\ndivergence message passing scheme, described in\n[6], as magenta triangles; EP as blue circles; EP\nwith the 2nd order correction in (8) as green di-\namonds. For 20 runs each, the colour intensities\ncorrespond to the frequency of reaching different\nestimates. A Monte Carlo estimate of the true\nln Z, as found by parallel tempering with thermo-\ndynamic integration, is shown as a line with two-\nstandard deviation error bars.\n\nK\u22121m and variance s2\n\nwhere \u00b5x\u2217 = k\u22a4\n\u2217 (K + \u039b\u22121)\u22121k\u2217 and \u039b = diag(\u03bb) denotes\n\u2217\nthe parameters in the Gaussian terms gn. The average in (20) is over a Gaussian x with \u03a3\u22121 =\n(K \u2212 k\u22121\n\u03a3K\u22121k\u2217 + m. By simplifying the inner\nexpectation over the complex Gaussian variables \u03b7 we obtain\n\nx\u2217 = k\u2217 \u2212 k\u22a4\n\u2217 )\u22121 + \u039b\u22121 and \u00b5 = (x\u2217 \u2212 \u00b5x\u2217 )\u03c3\u22122\n\nk\u2217k\u22a4\n\n\u2217\n\n\u2217\n\np(x\u2217) = N (x\u2217; \u00b5x\u2217 , s2\n\nx\u2217)\uf8ee\n\uf8f01 +Xn Xl\u22653\n\n+ \u00b7\u00b7\u00b7\uf8f9\n\uf8fb(21)\nwhere hl is the lth Hermite polynomial. The Hermite polynomials are averaged over a Gaussian\ndensity where the only occurrence of x\u2217 is through (x\u2217 \u2212 \u00b5x\u2217) in \u00b5, so that the expansion ultimately\nappears as a polynomial in x\u2217. A correction to the predictive density follows from averaging t\u2217(x\u2217)\nover (21).\n\n\u221aSnn(cid:19)l*hl(cid:18) xn \u2212 mn\n\n\u221aSnn (cid:19)+x\u223cN (x;\u00b5,\u03a3)\n\ncln\n\nl! (cid:18) 1\n\n4 Applications\n\n4.1 Mixture of Gaussians\n\nThis section illustrates an example where a large \ufb01rst nontrivial correction term in (8) re\ufb02ects an\ninaccurate EP approximation. We explain this for a K-component Gaussian mixture model.\n\nConsider N observed data points \u03b6n with likelihood terms fn(x) = P\u03ba \u03c0\u03baN (\u03b6n; \u00b5\u03ba, \u0393\u22121\n\u03ba ), with\nn \u2265 1 and with the mixing weights \u03c0\u03ba forming a probability vector. The latent variables are then\n\u03ba=1. For our prior on x we use a Dirichlet distribution and product of Normal-\nx = {\u03c0\u03ba, \u00b5\u03ba, \u0393\u03ba}K\nWisharts densities so that f0(x) = D(\u03c0)Q\u03ba NW(\u00b5\u03ba, \u0393\u03ba). When we multiply the fn terms we\nsee that intractability for the mixture model arises because the number of terms in the marginal\nlikelihood is K N , rather than because integration is intractable. The computation of lower-order\nterms in (8) should therefore be immediately feasible. The approximation q(x) and each gn(x) are\nchosen to be of the same exponential family form as f0(x), where we don\u2019t require gn(x) to be\nnormalizable.\n\nFor brevity we omit the details of the EP algorithm for this mixture model, and assume here that an\nEP \ufb01xed point has been found, possibly using some damping. Fig. 1 shows various approximations\nto the log marginal likelihood ln Z for \u03b6n coming from the acidity data set. It is evident that the\n\u201ctrue peak\u201d doesn\u2019t match the peak obtained by approximate inference, and we will wrongly predict\nwhich K maximizes the log marginal likelihood. Without having to resort to Monte Carlo methods,\nthe second order correction for K = 3 both corrects our prediction and already con\ufb01rms that the\noriginal approximation might be inadequate.\n\n4.2 Gaussian Process Classi\ufb01cation\n\nThe GP classi\ufb01cation model arises when we observe N data points \u03b6n with class labels yn \u2208\n{\u22121, 1}, and model y through a latent function x with the GP prior mentioned in sec. 3.2. The\n\n5\n\n\f)\n\u03c3\n(\nn\nl\n\n,\ne\nd\nu\nt\ni\nn\ng\na\nm\ng\no\nl\n\n0\n3\n\n.\n\n0\n4\n\n.\n\n0\n6\n\n.\n\n0\n5\n\n.\n\n5\n\n4.5\n\n4\n\n3.5\n\n3\n\n2.5\n\n0\n\n.\n\n3\n\n0\n\n.\n\n4\n\n0\n\n.\n\n5\n\n0\n\n.\n\n6\n\n2\n\n1.5\n\n1\n\n0.5\n\n0\n\n0.3\n\n0.2\n\n0.1\n\n1.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0.3\n\n0 .2\n\n0 . 1\n\n1\n\n0 . 0\n\n1\n\n0 . 0\n\n1\n\n0\n\n0 . 0\n\n6\n.\n0\n\n0.5\n\n0.4\n\n0.5\n\n0.3\n\n0 . 4\n\n0.2\n\n0.1\n\n1\n\n0 . 0\n2.5\n\n1\n\n0\n\n0 . 0\n3.5\n\n1\n\n0\n\n0\n\n0 . 0\n\n4.5\n2\nlog lengthscale, ln(\u2113)\n\n4\n\n3\n\n)\n\u03c3\n(\nn\nl\n\n,\ne\nd\nu\nt\ni\nn\ng\na\nm\ng\no\nl\n\n5\n\n4.5\n\n4\n\n3.5\n\n3\n\n2.5\n\n2\n\n1.5\n\n1\n\n0.5\n\n0\n\n0 . 2\n0.1\n\n1.5\n\n5\n\n0.2\n\n5\n\n0 . 2\n\n0 . 1\n\n1\n\n0 . 0\n\n1\n\n0\n\n0 . 0\n\n1\n\n0\n\n0\n\n0 . 0\n\n0.1\n\n1\n\n0 . 0\n\n1\n\n0\n\n0 . 0\n\n0.4\n\n3\n.\n0\n\n0.2\n\n0.1\n\n1\n\n0 . 0\n2.5\n\n4.5\n2\nlog lengthscale, ln(\u2113)\n\n3.5\n\n3\n\n4\n\n(a) ln R, second order, with l = 3, 4.\n\n(b) Monte Carlo ln R\n\nFigure 2: A comparison of a perturbation expansion of (19) against Monte Carlo estimates of the\ntrue correction ln R, using the USPS data set from [4].\n\nlikelihood terms for yn are assumed to be tn(xn) = \u03a6(ynxn), where \u03a6(\u00b7) denotes the cumulative\nNormal density.\n\nEq. (19) shows how to compute the cumulant expansion by dovetailing the EP \ufb01xed point with the\ncharacteristic function of qn(xn): From the EP \ufb01xed point we have q(x) = N (x; m, S) and gn \u221d\ne\u03b3nxn\u2212 1\n2 \u03bbnxn; consequently the marginal density of xn in q(x)/gn(xn) from (3) is N (xn; \u00b5, v2),\nwhere v\u22122 = 1/Snn \u2212 \u03bbn and \u00b5 = v\u22122(mn/Snn \u2212 \u03b3n). Using (3) again we have\n\nqn(xn) =\n\n1\nZn\n\n\u03a6(ynxn)N (xn; \u00b5, v2) .\n\n(22)\n\nThe characteristic function of qn(xn) is obtained by the inversion of (12),\n\nyn\u00b5 + ikv2\n\u221a1 + v2\n\nyn\u00b5\n\u221a1 + v2\n\n2 k2v2 \u03a6(wk)\n\u03a6(w)\n\n,\n\n(23)\n\nand wk =\n\n, with w =\n\n\u03c7n(k) =(cid:10)eikxn(cid:11) = eik\u00b5\u2212 1\nwith expectations h\u00b7\u00b7\u00b7i being with respect to qn(xn). Raw moments are computed through deriva-\ntives of the characteristic function, i.e. hxj\nn (0). The cumulants cln are determined\nfrom the derivatives of ln \u03c7n(k) evaluated at zero (or equally from raw moments, e.g. c3n =\n2hxni3 \u2212 3hxnihx2\n\nni + hx3\nc3n = \u03b13\u03b2(cid:2)2\u03b22 + 3w\u03b2 + w2 \u2212 1(cid:3)\nc4n = \u2212\u03b14\u03b2(cid:2)6\u03b23 + 12w\u03b22 + 7w2\u03b2 + w3 \u2212 4\u03b2 \u2212 3w(cid:3) ,\nwhere \u03b1 = v2/\u221a1 + v2 and \u03b2 = N (w; 0, 1)/\u03a6(w).\n\nni = i\u2212j\u03c7(j)\n\nni), such that\n\nAn extensive MCMC evaluation of EP for GP classi\ufb01cation on various data sets was recently given\nby [4], showing that the log marginal likelihood of the data can be approximated remarkably well.\nAn even more accurate estimation of the approximation error is given by considering the second\norder correction in (19) (computed here up to l = 4). For GPC we generally found that the l = 3\nterm dominates l = 4, and we do not include any higher cumulants here. Fig. 2 illustrates the ln R\ncorrection on the binary subproblem of the USPS 3\u2019s vs. 5\u2019s digits data set, with N = 767, as was\nused by [4]. We used the same kernel k(\u03b6, \u03b6\u2032) = \u03c32 exp(\u2212 1\n2k\u03b6 \u2212 \u03b6\u2032k2/\u21132) as [4], and evaluated\n(19) on a similar grid of ln \u2113 and ln \u03c3 values. For the same grid values we obtained Monte Carlo\nestimates of ln Z, and hence ln R. They are plotted in \ufb01g. 2(b) for the cases where they estimate ln Z\nto suf\ufb01cient accuracy (up to four decimal places) to obtain a smoothly varying plot of ln R.3 The\ncorrection from (19), as computed here, is O(N 2), and compares favourably to O(N 3) complexity\nof EP for GPC.\n3The Monte Carlo estimates in [4] are accurate enough for showing EP\u2019s close approximation to ln Z, but\n\n(24)\n(25)\n\nnot enough to make any quanti\ufb01ed statement about ln R.\n\n6\n\n\fo\ni\nt\na\nr\n\nn\no\ni\nt\nc\ne\nr\nr\no\nc\n&\na\u2217\nx\n\nf\no\n\ns\nt\nn\ne\ni\nc\n\ufb03\ne\no\nC\n\n1.2\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n\u22120.2\n\n\u22120.4\n\n\u22120.6\n\n \n\n \n\ny\n = +1\nn\ny\n = \u22121\nn\n0\ncoeff of x\n*\n1\ncoeff of x\n*\n2\ncoeff of x\n*\n3\ncoeff of x\n*\n\ncorrection ratio\np\ngpc\n    p\n\n = 1) /\n(y\n*\n(y\n = 1)\n*\n\ncorr\n\n\u22126\n\n\u22124\n\n\u22122\n\n0\n\n2\n\n4\n\n6\n\nLocation of \u03b6\u2217\n\nFigure 3: The initial coef\ufb01-\ncients of the polynomial in\nx\u2217, as they ultimately appear\nin the \ufb01rst nontrivial correc-\ntion term in (21). Cumulants\nl = 3 and l = 4 were used.\nThe coef\ufb01cients are shown for\ntest points \u03b6\u2217 after observ-\ning data points \u03b6n. The ra-\ntio between the standard and\n(1st order) corrected GP clas-\nsi\ufb01cation predictive density is\nalso illustrated.\n\nIn \ufb01g. 3 we show the coef\ufb01cients of the polynomial corrections (21) in powers of x\u2217 to the predictive\ndensity p(x\u2217), using 3rd and 4th cumulants. The small corrections arise as whenever terms ynmn\nare positive and large compared to the posterior variance, non-Gaussian terms fn(x) = tn(xn) \u2248 1\nfor almost all values of xn which have signi\ufb01cant probability under the Gaussian distribution that\nis proportional to q(x)/gn(xn). For these terms qn(x) is therefore almost Gaussian and higher\ncumulants are small. A example where this will no longer be the case is a GP model with tn(xn) = 1\nfor |xn| < a and tn(xn) = 0 for |xn| > a. This is a regression model yn = xn+\u03bdn where i.i.d. noise\nvariables \u03bdn have uniform distribution and the observed outputs are all zero, i.e. yn = 0. For this\ncase, the exact posterior variance does not shrink to zero even if the number of data points goes\nto in\ufb01nity. The EP approximation however has the variance decrease to zero and our corrections\nincrease with sample size.\n\n4.3 Ising models\n\nSomewhat surprising (and probably less known) is the fact that EP and our corrections apply\nwell to a fairly limiting case of the GP model where the terms are of the form tn(xn) =\ne\u03b8nxn (\u03b4(xn + 1) + \u03b4(xn \u2212 1)), where \u03b4(x) is the Dirac distribution. These terms, together with\na \u201cGaussian\u201d f0(x) = exp[Pi<j Jij xixj] (where we do not assume that the matrix J is negative\nde\ufb01nite), makes this GP model an Ising model with binary variables xn = \u00b11. As shown in [8],\nthis model can still be treated with the same type of Gaussian term approximations as ordinary GP\nmodels, allowing for surprisingly accurate estimation of the mean and covariance. Here we will\nshow the effect of our corrections for toy models, where exact inference is possible by enumeration.\nThe tilted distributions qn(xn) are biased binary distributions with cumulants: c3n = \u22122mn(1 \u2212\nn, etc. We will consider two different scenarios for random \u03b8 and J\nn), c4n = \u22122 + 8m2\nm2\n\nn \u2212 6m4\n\nl\n\ni\n\ns\na\nn\ng\nr\na\nm\n \ne\nd\no\nn\n \n2\n \nD\nA\nM\n\n100\n\n10\u22122\n\n10\u22124\n\n10\u22126\n\n10\u22121\n\n105\n\n100\n\n10\u22125\n\ny\ng\nr\ne\nn\ne\n\n \n\ne\ne\nr\nF\nD\nA\n\n \n\n100\n\u03b2\n\n101\n\n10\u221210\n\n10\u22121\n\n100\n\u03b2\n\n101\n\nFigure 4: The left plot shows the MAD of the estimated covariance matrix from the exact one for\ndifferent values of \u03b2 for EP (blue), EP 2nd order l = 4 corrections (blue with triangles), Bethe or\nloopy belief propagation (LBP; dashed green) and Kikuchi or generalized LBP (dash\u2013dotted red).\nThe Bethe and Kikuchi approximations both give covariance estimates for all variable pairs as the\nmodel is fully connected. The right plot shows the absolute deviation of ln Z from the true value\nusing second order perturbations with l = 3, 4, 5 (l = 3 is the smallest change). The remaining line\nstyles are the same as in the left plot.\n\n7\n\n\fdescribed in detail in [8]. In the \ufb01rst scenario, with N = 10, the Jij\u2019s are generated independently\nat random according to Jij = \u03b2wij and wij \u223c N (0, 1). For varying \u03b2, the maximum absolute\ndeviation (MAD) of the estimated covariance matrices from the exact one maxi,j |\u03a3est\nij \u2212 \u03a3exact\n|\nis shown in \ufb01g. 4 left. The absolute deviation on the log partition function is shown in \ufb01g. 4 right.\nIn the Wainwright-Jordan set-up N = 16 nodes are either fully connected or connected to nearest\nneighbors in a 4\u2013by\u20134 grid. The external \ufb01eld (observation) strengths \u03b8i are drawn from a uniform\ndistribution \u03b8i \u223c U[\u2212dobs, dobs] with dobs = 0.25. Three types of coupling strength statistics are\nconsidered: repulsive (anti-ferromagnetic) Jij \u223c U[\u22122dcoup, 0], mixed Jij \u223c U[\u2212dcoup, +dcoup]\nand attractive (ferromagnetic) Jij \u223c U[0, +2dcoup]. Table 1 gives the MAD of marginals averaged\nof 100 repetitions. The results for both set-ups give rise to the conclusion that when the EP approx-\nimation works well then the correction give an order of magnitude of improvement. In the opposite\nsituation, the correction might worsen the results.\n\nij\n\nTable 1: Average MAD of marginals in a Wainwright-Jordan set-up, comparing loopy belief prop-\nagation (LBP), log-determinant relaxation (LD), EP, EP with l = 5 correction (EP+), and EP with\nonly one spanning tree approximating term (EP tree).\n\nProblem type\n\nMethod\n\nGraph\n\nFull\n\nGrid\n\nCoupling\nRepulsive\nRepulsive\n\nMixed\nMixed\n\nAttractive\nAttractive\nRepulsive\nRepulsive\n\nMixed\nMixed\n\nAttractive\nAttractive\n\ndcoup\n0.25\n0.50\n0.25\n0.50\n0.06\n0.12\n1.0\n2.0\n1.0\n2.0\n1.0\n2.0\n\nLBP\n0.037\n0.071\n0.004\n0.055\n0.024\n0.435\n0.294\n0.342\n0.014\n0.095\n0.440\n0.520\n\nLD\n0.020\n0.018\n0.020\n0.021\n0.027\n0.033\n0.047\n0.041\n0.016\n0.038\n0.047\n0.042\n\nEP\n0.003\n0.031\n0.002\n0.022\n0.004\n0.117\n0.153\n0.198\n0.011\n0.082\n0.125\n0.177\n\nEP+\n\n0.00058487\n\n0.0157\n\n0.00042727\n\n0.0159\n0.0023\n0.1066\n0.1693\n0.4244\n0.0122\n0.0984\n0.1759\n0.4730\n\nEP tree\n0.0017\n0.0143\n0.0013\n0.0151\n0.0025\n0.0211\n0.0031\n0.0021\n0.0018\n0.0068\n0.0028\n0.0002\n\n5 Outlook\n\nWe expect that it will be possible to develop similar corrections to other approximate inference\nmethods, such as the variational approach or the \u201cpower EP\u201d approximations which interpolate\nbetween the variational method and EP. This may help the user to decide which approximation is\nmore accurate for a given problem. We will also attempt an analysis of the scaling of higher order\nterms in these expansions to see if they are asymptotic or have a \ufb01nite radius of convergence.\n\nReferences\n[1] H. Attias. A variational Bayesian framework for graphical models. In Advances in Neural Information\n\nProcessing Systems 12, 2000.\n\n[2] M. Chertkov and V. Y. Chernyak. Loop series for discrete statistical models on graphs. Journal of Statistical\n\nMechanics: Theory and Experiment, page P06009, 2006.\n\n[3] S. Ikeda, T. Tanaka, and S. Amari. Information geometry of turbo and low-density parity-check codes.\n\nIEEE Transactions on Information Theory, 50(6):1097, 2004.\n\n[4] M. Kuss and C. E. Rasmussen. Assessing approximate inference for binary Gaussian process classi\ufb01cation.\n\nJournal of Machine Learning Research, 6:1679\u20131704, 2005.\n\n[5] T. P. Minka. Expectation propagation for approximate Bayesian inference. In UAI 2001, pages 362\u2013369,\n\n2001.\n\n[6] T. P. Minka. Divergence measures and message passing. Technical Report MSR-TR-2005-173, Microsoft\n\nResearch, Cambridge, UK, 2005.\n\n[7] T.P. Minka. The EP energy function and minimization schemes. Technical report, 2001.\n[8] M. Opper and O. Winther. Expectation consistent approximate inference. Journal of Machine Learning\n\nResearch, 6:2177\u20132204, 2005.\n\n[9] E. Sudderth, M. Wainwright, and A. Willsky. Loop series and Bethe variational bounds in attractive graph-\n\nical models. In Advances in Neural Information Processing Systems 20, pages 1425\u20131432. 2008.\n\n8\n\n\f", "award": [], "sourceid": 997, "authors": [{"given_name": "Manfred", "family_name": "Opper", "institution": null}, {"given_name": "Ulrich", "family_name": "Paquet", "institution": null}, {"given_name": "Ole", "family_name": "Winther", "institution": null}]}