{"title": "Tree-structured Approximations by Expectation Propagation", "book": "Advances in Neural Information Processing Systems", "page_first": 193, "page_last": 200, "abstract": "", "full_text": "Tree-structured approximations by expectation\n\npropagation\n\nThomas Minka\n\nDepartment of Statistics\n\nCarnegie Mellon University\nPittsburgh, PA 15213 USA\nminka@stat.cmu.edu\n\nYuan Qi\n\nMedia Laboratory\n\nMassachusetts Institute of Technology\n\nCambridge, MA 02139 USA\nyuanqi@media.mit.edu\n\nAbstract\n\nApproximation structure plays an important role in inference on loopy\ngraphs. As a tractable structure, tree approximations have been utilized\nin the variational method of Ghahramani & Jordan (1997) and the se-\nquential projection method of Frey et al. (2000). However, belief propa-\ngation represents each factor of the graph with a product of single-node\nmessages. In this paper, belief propagation is extended to represent fac-\ntors with tree approximations, by way of the expectation propagation\nframework. That is, each factor sends a \u201cmessage\u201d to all pairs of nodes\nin a tree structure. The result is more accurate inferences and more fre-\nquent convergence than ordinary belief propagation, at a lower cost than\nvariational trees or double-loop algorithms.\n\n1\n\nIntroduction\n\nAn important problem in approximate inference is improving the performance of belief\npropagation on loopy graphs. Empirical studies have shown that belief propagation (BP)\ntends not to converge on graphs with strong positive and negative correlations (Welling\n& Teh, 2001). One approach is to force the convergence of BP, by appealing to a free-\nenergy interpretation (Welling & Teh, 2001; Teh & Welling, 2001; Yuille, In press 2002).\nUnfortunately, this doesn\u2019t really solve the problem because it dramatically increases the\ncomputational cost and doesn\u2019t necessarily lead to good results on these graphs (Welling &\nTeh, 2001).\n\nThe expectation propagation (EP) framework (Minka, 2001a) gives another interpretation\nof BP, as an algorithm which approximates multi-variable factors by single-variable factors\n(f (x1; x2) ! ~f1(x1) ~f2(x2)). This explanation suggests that it is BP\u2019s target approxima-\ntion which is to blame, not the particular iterative scheme it uses. Factors which encode\nstrong correlations should not be well approximated in this way. The connection between\nfailure to converge and poor approximation holds true for EP algorithms in general, as\nshown by Minka (2001a) and Heskes & Zoeter (2002).\n\nYedidia et al. (2000) describe an extension of BP involving the Kikuchi free-energy. The\nresulting algorithm resembles BP on a graph of node clusters, where again multi-variable\nfactors are decomposed into independent parts (f(x1; x2; x3) ! ~f1(x1) ~f23(x2; x3)).\n\n\fIn this paper, the target approximation of BP is enriched by exploiting the connection to\nexpectation propagation. Instead of approximating each factor by disconnected nodes or\nclusters, it is approximated by a tree distribution. The algorithm is a strict generalization of\nbelief propagation, because if the tree has no edges, then the results are identical to (loopy)\nbelief propagation.\n\nThis approach is inspired by previous work employing trees. For example, Ghahramani\n& Jordan (1997) showed that tree structured approximations could improve the accuracy\nof variational bounds. Such bounds are tuned to minimize the \u2018exclusive\u2019 KL-divergence\nKL(qjjp), where q is the approximation. Frey et al. (2000) criticized this error measure\nand described an alternative method for minimizing the \u2018inclusive\u2019 divergence KL(pjjq).\nTheir method, which sequentially projects graph potentials onto a tree structure, is closely\nrelated to expectation propagation and the method in this paper. However, their method is\nnot iterative and therefore sensitive to the order in which the potentials are sequenced.\n\nThere are also two tangentially related papers by Wainwright et al. (2001); Wainwright\net al. (2002). In the \ufb01rst paper, a \u201cmessage-free\u201d version of BP was derived, which used\nmultiple tree structures to propagate evidence. The results it gives are nevertheless the\nsame as BP. In the second paper, tree structures were used to obtain an upper bound on\nthe normalizing constant of a Markov network. The trees produced by that method do not\nnecessarily approximate the original distribution well.\n\nThe following section describes the EP algorithm for updating the potentials of a tree ap-\nproximation with known structure. Section 3 then describes the method we use to choose\nthe tree structure. Section 4 gives numerical results on various graphs, comparing the new\nalgorithm to BP, Kikuchi, and variational methods.\n\n2 Updating the tree potentials\n\nThis section describes an expectation-propagation algorithm to approximate a given distri-\nbution (of arbitrary structure) by a tree with known structure. It elaborates section 4.2.2\nof Minka (2001b), with special attention to ef\ufb01ciency. Denote the original distribution by\np(x), written as a product of factors:\n\nfi(x)\n\n(1)\n\np(x) =Yi\n\nFor example, if p(x) is a Bayesian network or Markov network, the factors are conditional\nprobability distributions or potentials which each depend on a small subset of the variables\nin x. In this paper, the variables are assumed to be discrete, so that the factors fi(x) are\nsimply multidimensional tables.\n\n2.1\n\nJunction tree representation\n\nThe target approximation q(x) will have pairwise factors along a tree T :\n\nq(x) = Q(j;k)2T q(xj; xk)\nQs2S q(xs)\n\n(2)\n\nIn this notation, q(xs) is the marginal distribution for variable xs and q(xj; xk) is the\nmarginal distribution for the two variables xj and xk. These are going to be stored as mul-\ntidimensional tables. The division is necessary to cancel over-counting in the numerator. A\nuseful way to organize these divisions is to construct a junction tree connecting the cliques\n(j; k) 2 T (Jensen et al., 1990). This tree has a different structure than T \u2014the nodes in the\njunction tree represent cliques in T , and the edges in the junction tree represent variables\nwhich are shared between cliques. These separator variables S in the junction tree are\n\n\fp\n\nq\n\njunction tree of q\n\nFigure 1: Approximating a complete graph p by a tree q. The junction tree of q is used to\norganize computations.\n\nexactly the variables that go in the denominator of (2). Note that the same variable could\nbe a separator more than once, so technically S is a multiset.\nFigure 1 shows an example of how this all works. We want to approximate the distribution\np(x), which has a complete graph, by q(x), whose graph is a spanning tree. The marginal\nrepresentation of q can be directly read off of the junction tree:\n\nq(x) =\n\nq(x1; x4)q(x2; x4)q(x3; x4)\n\nq(x4)q(x4)\n\n(3)\n\n2.2 EP updates\n\nThe algorithm iteratively tunes q(x) so that it matches p(x) as closely as possible, in the\nsense of \u2018inclusive\u2019 KL-divergence. Speci\ufb01cally, q tries to preserve the marginals and\npairwise marginals of p:\n\nq(xj) (cid:25) p(xj)\n\nq(xj; xk) (cid:25) p(xj; xk)\n\n(j; k) 2 T\n\n(4)\n(5)\n\nExpectation propagation is a general framework for approximating distributions of the form\n(1) by approximating the factors one by one. The \ufb01nal approximation q is then the product\nof the approximate factors. The functional form of the approximate factors is determined\nby considering the ratio of two different q\u2019s. In our case, this leads to approximations of\nthe form\n\nA product of such factors gives a distribution of the desired form (2). Note that ~fi(xj; xk)\nis not a proper marginal distribution, but just a non-negative function of two variables.\n\nThe algorithm starts by initializing the clique and separator potentials on the junction tree\nto 1. If a factor in p only depends on one variable, or variables which are adjacent in T ,\nthen its approximation is trivial. It can be multiplied into the corresponding clique potential\nright away and removed from further consideration. The remaining factors in p, the off-tree\nfactors, have their approximations ~fi initialized to 1.\nTo illustrate, consider the graph of \ufb01gure 1. Suppose all the potentials in p are pairwise, one\nfor each edge. The edges f(1; 4); (2; 4); (3; 4)g are absorbed directly into q. The off-tree\nedges are f(1; 2); (1; 3); (2; 3)g.\n\nfi(x) (cid:25) ~fi(x) = Q(j;k)2T\nQs2S\n\n~fi(xj; xk)\n~fi(xs)\n\n(6)\n\n\fThe algorithm then iteratively passes through the off-tree factors in p, performing the fol-\nlowing three steps until all ~fi converge:\n(a) Deletion. Remove ~fi from q to get an \u2018old\u2019 approximation qni:\n\nqni(xj; xk) =\n\nqni(xs) =\n\nq(xj; xk)\n~fi(xj; xk)\nq(xs)\n~fi(xs)\n\n(j; k) 2 T\n\ns 2 S\n\n(7)\n\n(8)\n\n(b) Incorporate evidence. Form the product fi(x)qni(x), by considering f (x) as \u2018evidence\u2019\nfor the junction tree. Propagate the evidence to obtain new clique marginals q(xj; xk) and\nseparators q(xs) (details below).\n(c) Update. Re-estimate ~fi by division:\n\n~fi(xj; xk) =\n\n~fi(xs) =\n\nq(xj; xk)\nqni(xj; xk)\nq(xs)\nqni(xs)\n\n(j; k) 2 T\n\ns 2 S\n\n(9)\n\n(10)\n\n2.3\n\nIncorporating evidence by cutset conditioning\n\nThe purpose of the \u201cincorporate evidence\u201d step is to \ufb01nd a distribution q minimizing\nKL(fi(x)qni jj q). This is equivalent to matching the marginal distributions corresponding\nto each clique in q. By de\ufb01nition, fi depends on a set of variables which are not adjacent\nin T , so the graph structure corresponding to fi(x)qni(x) is not a tree, but has one or more\nloops. One approach is to apply a generic exact inference algorithm to fi(x)qni(x) to ob-\ntain the desired marginals, e.g. construct a new junction tree in which fi(x) is a clique and\npropagate evidence in this tree. But this does not exploit the fact that we already have a\njunction tree for qni on which we can perform ef\ufb01cient inference.\nInstead we use a more ef\ufb01cient approach\u2014Pearl\u2019s cutset conditioning algorithm\u2014to in-\ncorporate the evidence. Suppose fi(x) depends on a set of variables V. The domain of\nfi(x) is the set of all possible assignments to V. Find the clique (j; k) 2 T which has the\nlargest overlap with this domain\u2014call this the root clique. Then enumerate the rest of the\ndomain Vn(xj; xk). For each possible assignment to these variables, enter it as evidence\nin q\u2019s junction tree and propagate to get marginals and an overall scale factor (which is the\nprobability of that assignment). When the variables Vn(xj; xk) are \ufb01xed, entering evidence\nsimply reduces to zeroing out con\ufb02icting entries in the junction tree, and multiplying the\nroot clique (j; k) by fi(x). After propagating evidence multiple times, average the results\ntogether according to their scale factors, to get the \ufb01nal marginals and separators of q.\nContinuing the example of \ufb01gure 1, suppose we want to process edge (1; 2), whose factor\nis f1(x1; x2). When added to q, this creates a loop. We cut the loop by conditioning on the\nvariable with smallest arity. Suppose x1 is binary, so we condition on it. The other clique,\n(2; 4), becomes the root. In one case, the evidence is (x1 = 0; f1(0; x2)) and in the other it\nis (x1 = 1; f1(1; x2)). Propagating evidence for both cases and averaging the results gives\nthe new junction tree potentials.\n\nBecause it is an expectation-propagation algorithm, we know that a \ufb01xed point always\nexists, but we may not always \ufb01nd one. In these cases, the algorithm could be stabilized by\na stepsize or double-loop iteration. But overall the method is very stable, and in this paper\nno convergence control is used.\n\n\f2.4 Within-loop propagation\n\nA further optimization is also used, by noting that evidence does not need to be propagated\nto the whole junction tree. In particular, it only needs to be propagated within the subtree\nthat connects the nodes in V. Evidence propagated to the rest of the tree will be exactly\ncanceled by the separators, so even though the potentials may change, the ratios in (2) will\nnot. For example, when we process edge (1; 2) in \ufb01gure 1, there is no need to propagate\nevidence to clique (3; 4), because when q(x3; x4) is divided by the separator q(x4), we\nhave q(x3jx4) which is the same before and after the evidence.\nThus evidence is propagated as follows: \ufb01rst collect evidence from V to the root, then dis-\ntribute evidence from the root back to V, bypassing the rest of the tree (these operations are\nde\ufb01ned formally by Jensen et al. (1990)). In the example, this means we collect evidence\nfrom clique (1; 4) to the root (2; 4), then distribute back from (2; 4) to (1; 4), ignoring\n(3; 4). This simpli\ufb01cation also means that we don\u2019t need to store ~fi for the cliques that are\nnever updated by factor i. When moving to the next factor, once we\u2019ve designated the root\nfor that factor, we collect evidence from the previous root. In this way, the results are the\nsame as if we always propagated evidence to the whole junction tree.\n\n3 Choosing the tree structure\n\nThis section describes a simple method to choose the tree structure. It leaves open the\nproblem of \ufb01nding the \u2018optimal\u2019 approximation structure; instead, it presents a simple rule\nwhich works reasonably well in practice.\n\nIntuitively, we want edges between the variables which are the most correlated. The ap-\nproach is based on Chow & Liu (1968): estimate the mutual information between adjacent\nnodes in p\u2019s graph, call this the \u2018weight\u2019 of the edge between them, and then \ufb01nd the span-\nning tree with maximal total weight. The mutual information between nodes requires an\nestimate of their joint distribution. In our implementation, this is obtained from the product\nof factors involving only these two nodes, i.e. the single-node potentials times the edge be-\ntween them. While crude, it does capture the amount of correlation provided by the edge,\nand thus whether we should have it in the approximation.\n\n4 Numerical results\n\n4.1 The four-node network\n\nThis section illustrates the algorithm on a concrete problem, comparing it to other methods\nfor approximate inference. The network and approximation will be the ones pictured in\n\ufb01gure 1, with all nodes binary. The potentials were chosen randomly and can be obtained\nfrom the authors\u2019 website.\n\nFive approximate inference methods were compared. The proposed method (TreeEP) used\nthe tree structure speci\ufb01ed in \ufb01gure 1. Mean-\ufb01eld (MF) \ufb01t a variational bound with in-\ndependent variables, and TreeVB \ufb01t a tree-structured variational bound, with the same\nstructure as TreeEP. TreeVB was implemented using the general method described by\nWiegerinck (2000), with the same junction tree optimizations as in TreeEP.\n\nGeneralized belief propagation (GBP) was implemented using the parent-child algorithm\nof Yedidia et al.\n(2002) (with special attention to the damping described in section 8).\nWe also used GBP to perform ordinary loopy belief propagation (BP). Our implementation\ntries to be ef\ufb01cient in terms of FLOPS, but we do not know if it is the fastest possible. GBP\nand BP were \ufb01rst run using stepsize 0:5, and if didn\u2019t converge, halved it and started over.\nThe time for these \u2018trial runs\u2019 was not counted.\n\n\fMethod\nExact\nTreeEP\nGBP\nTreeVB\nBP\nMF\n\nFLOPS E[x1] E[x2] E[x3] E[x4]\n0.536\n0.535\n0.535\n0.540\n0.501\n0.946\n\n200\n800\n2200\n11700\n500\n11500\n\n0.474\n0.467\n0.467\n0.460\n0.499\n0.000\n\n0.468\n0.459\n0.459\n0.460\n0.499\n0.000\n\n0.482\n0.477\n0.477\n0.476\n0.5\n0.094\n\nError\n0\n0.008\n0.008\n0.014\n0.035\n0.474\n\nTable 1: Node means estimated by various methods (TreeEP = the proposed method, BP =\nloopy belief propagation, GBP = generalized belief propagation on triangles, MF = mean-\n\ufb01eld, TreeVB = variational tree). FLOPS are rounded to the nearest hundred.\n\nThe algorithms were all implemented in Matlab using Kevin Murphy\u2019s BNT toolbox (Mur-\nphy, 2001). Computational cost was measured by the number of \ufb02oating-point operations\n(FLOPS). Because the algorithms are iterative and can be stopped at any time to get a re-\nsult, we used a \u201c5% rule\u201d to determine FLOPS. The algorithm was run for a large number\nof iterations, and the error at each iteration was computed. At each iteration, we then get\nan error bound, which is the maximum error from that iteration onwards. The \ufb01rst iteration\nwhose error bound is within 5% of the \ufb01nal error is chosen for the of\ufb01cial FLOP count.\n(The of\ufb01cial error is still the \ufb01nal error.)\n\nThe results are shown in table 1. TreeEP is more accurate than BP, with less cost than\nTreeVB and GBP. GBP was run with clusters f(1; 2; 4); (1; 3; 4); (2; 3; 4)g. This gives the\nsame result as TreeEP, because these clusters are exactly the off-tree loops.\n\n4.2 Complete graphs\n\nThe next experiment tests the algorithms on complete graphs of varying size. The graphs\nhave random single-node and pairwise potentials, of the form fi(xj) = [exp((cid:18)j) exp((cid:0)(cid:18)j)]\nexp((cid:0)wjk)\nand fi(xj; xk) =\" exp(wjk)\nexp(wjk) #. The \u201cexternal \ufb01elds\u201d (cid:18)j were drawn in-\nexp((cid:0)wjk)\ndependently from a Gaussian with mean 0 and standard deviation 1. The \u201ccouplings\u201d wjk\nwere drawn independently from a Gaussian with mean 0 and standard deviation 3=pn (cid:0) 1,\nwhere n is the number of nodes. Each node has n (cid:0) 1 neighbors, so this tries to keep the\noverall coupling level constant.\n\nFigure 2(a) shows the approximation error as n increases. For each n, 10 different po-\ntentials were drawn, giving 110 networks in all. For each one, the maximum absolute\ndifference between the estimated means and exact means was computed. These errors are\naveraged over potentials and shown separately for each graph size. TreeEP and TreeVB\nalways used the same structure, picked according to section 3. TreeEP outperforms BP\nconsistently, but TreeVB does not.\n\nFor this type of graph, we found that GBP works well with clusters in a \u2018star\u2019 pattern, i.e.\nthe clusters are f(1; 2; 3); (1; 3; 4); (1; 4; 5); :::; (1; n; 2)g. Node \u20181\u2019 is the center of the star,\nand was chosen to be the node with highest average coupling to its neighbors. As shown\nin \ufb01gure 2(a), this works much better than using all triples of nodes, as done by Kappen &\nWiegerinck (2001). Note that if TreeEP is given a similar \u2018star\u2019 structure, the results are the\nsame as GBP. This is because the GBP clusters coincide with the off-tree loops. In general,\nif the off-tree loops are triangles, then GBP on those triangles will give identical results.\n\nFigure 2(b) shows the cost as n increases. TreeEP and TreeVB scale the best, with TreeEP\nbeing the fastest method on large graphs.\n\n\f10\u22121\n\nBP\n\nGBP\u2212triples\n\nr\no\nr\nr\n\nE\n\n10\u22122\n\nTreeVB\n\nTreeEP\n\n106\n\n105\n\n104\n\n103\n\nS\nP\nO\nL\nF\n\nGBP\u2212star\n\nTreeVB\n\nBP\n\nTreeEP\n\nExact\n\nGBP\u2212star/TreeEP\u2212star\n\n4\n\n6\n\n8\n10\n# of nodes\n(a)\n\n12\n\n14\n\n4\n\n6\n\n12\n\n14\n\n8\n10\n# of nodes\n(b)\n\nFigure 2: (a) Error in the estimated means for complete graphs with randomly chosen\npotentials. Each point is an average over 10 potentials. (b) Average FLOPS for the results\nin (a).\n\nTreeVB\n\n10\u22122\n\nBP\n\n10\u22123\n\nTreeEP\n\nr\no\nr\nr\n\nE\n\n10\u22124\n\n10\u22125\n\nGBP\u2212squares\n\n107\n\n106\n\nS\nP\nO\nL\nF\n\n105\n\n104\n\n103\n\nTreeVB\n\nGBP\u2212squares\n\nTreeEP\n\nBP\n\nExact\n\n20\n\n40\n\n60\n\n80\n\n100\n\n120\n\n20\n\n40\n\n60\n\n80\n\n100\n\n120\n\n# of nodes\n(a)\n\n# of nodes\n(b)\n\nFigure 3: (a) Error in the estimated means for grid graphs with randomly chosen potentials.\nEach point is an average over 10 potentials. (b) Average FLOPS for the results in (a).\n\n4.3 Grids\n\nThe next experiment tests the algorithms on square grids of varying size. The external\n\ufb01elds (cid:18)j were drawn as before, and the couplings wjk had standard deviation 1. The GBP\nclusters were overlapping squares, as in Yedidia et al. (2000).\n\nFigure 3(a) shows the approximation error as n increases, with results averaged over 10\ntrials as in the previous section. TreeVB performs consistently worse than BP, even though\nit is using the same tree structures as TreeEP. The plot also shows that these structures,\nbeing automatically chosen, are not as good as the hand-crafted clusters used by GBP. We\nhave hand-crafted tree structures that perform just as well on grids, but for simplicity we\ndo not include these results.\n\nFigure 3(b) shows that TreeEP is the fastest on large grids, even faster than BP, because BP\nmust use increasingly smaller stepsizes. GBP is more than a factor of ten slower.\n\n\f5 Conclusions\n\nTree approximation allows a smooth tradeoff between cost and accuracy in approximate\ninference. It improves on BP for a modest increase in cost. In particular, when ordinary BP\ndoesn\u2019t converge, TreeEP is an attractive alternative to damping or double-loop iteration.\nTreeEP performs better than the corresponding variational bounds, because it minimizes\nthe inclusive KL-divergence. We found that TreeEP was equivalent to GBP in some cases,\nwhich deserves further study.\n\nWe hope that these results encourage more investigation into approximation structure for\ninference algorithms, such as \ufb01nding the \u2018optimal\u2019 structure for a given problem. There are\nmany other opportunities for special approximation structure to be exploited, especially\nin hybrid networks, where not only do the independence assumptions matter but also the\ndistributional forms.\n\nAcknowledgments\n\nWe thank an anonymous reviewer for advice on comparisons to GBP.\n\nReferences\nChow, C. K., & Liu, C. N. (1968). Approximating discrete probability distributions with dependence\n\ntrees. IEEE Transactions on Information Theory, 14, 462\u2013467.\n\nFrey, B. J., Patrascu, R., Jaakkola, T., & Moran, J. (2000). Sequentially \ufb01tting inclusive trees for\n\ninference in noisy-OR networks. NIPS 13.\n\nGhahramani, Z., & Jordan, M. I. (1997). Factorial hidden Markov models. Machine Learning, 29,\n\n245\u2013273.\n\nHeskes, T., & Zoeter, O. (2002). Expectation propagation for approximate inference in dynamic\n\nBayesian networks. Proc UAI.\n\nJensen, F. V., Lauritzen, S. L., & Olesen, K. G. (1990). Bayesian updating in causal probabilistic\n\nnetworks by local computations. Computational Statistics Quarterly, 5, 269\u2013282.\n\nKappen, H. J., & Wiegerinck, W. (2001). Novel iteration schemes for the cluster variation method.\n\nNIPS 14.\n\nMinka, T. P. (2001a). Expectation propagation for approximate Bayesian inference. UAI (pp. 362\u2013\n\n369).\n\nMinka, T. P. (2001b). A family of algorithms for approximate Bayesian inference. Doctoral disserta-\n\ntion, Massachusetts Institute of Technology.\n\nMurphy, K. (2001). The Bayes Net Toolbox for Matlab. Computing Science and Statistics, 33.\nTeh, Y. W., & Welling, M. (2001). The uni\ufb01ed propagation and scaling algorithm. NIPS 14.\nWainwright, M. J., Jaakkola, T., & Willsky, A. S. (2001). Tree-based reparameterization for approx-\n\nimate estimation on loopy graphs. NIPS 14.\n\nWainwright, M. J., Jaakkola, T. S., & Willsky, A. S. (2002). A new class of upper bounds on the log\n\npartition function. Proc UAI.\n\nWelling, M., & Teh, Y. W. (2001). Belief optimization for binary networks: A stable alternative to\n\nloopy belief propagation. UAI.\n\nWiegerinck, W. (2000). Variational approximations between mean \ufb01eld theory and the junction tree\n\nalgorithm. Proc UAI.\n\nYedidia, J. S., Freeman, W. T., & Weiss, Y. (2000). Generalized belief propagation. NIPS 13.\nYedidia, J. S., Freeman, W. T., & Weiss, Y. (2002). Constructing free energy approximations and\n\ngeneralized belief propagation algorithms (Technical Report). MERL Research Lab.\n\nYuille, A. (In press, 2002). A double-loop algorithm to minimize the Bethe and Kikuchi free energies.\n\nNeural Computation.\n\n\f", "award": [], "sourceid": 2407, "authors": [{"given_name": "Yuan", "family_name": "Qi", "institution": null}, {"given_name": "Tom", "family_name": "Minka", "institution": null}]}