{"title": "Density Propagation and Improved Bounds on the Partition Function", "book": "Advances in Neural Information Processing Systems", "page_first": 2762, "page_last": 2770, "abstract": "Given a probabilistic graphical model, its density of states is a function that, for any likelihood value, gives the number of configurations with that probability. We introduce a novel message-passing algorithm called Density Propagation (DP) for estimating this function. We show that DP is exact for tree-structured graphical models and is, in general, a strict generalization of both sum-product and max-product algorithms. Further, we use density of states and tree decomposition to introduce a new family of upper and lower bounds on the partition function. For any tree decompostion, the new upper bound based on finer-grained density of state information is provably at least as tight as previously known bounds based on convexity of the log-partition function, and strictly stronger if a general condition holds. We conclude with empirical evidence of improvement over convex relaxations and mean-field based bounds.", "full_text": "Density Propagation and\n\nImproved Bounds on the Partition Function\u2217\n\nStefano Ermon, Carla P. Gomes\n\nDept. of Computer Science\n\nCornell University\n\nIthaca NY 14853, U.S.A.\n\nAshish Sabharwal\n\nBart Selman\n\nIBM Watson Research Ctr.\n\nDept. of Computer Science\n\nYorktown Heights\nNY 10598, U.S.A.\n\nCornell University\n\nIthaca NY 14853, U.S.A.\n\nAbstract\n\nGiven a probabilistic graphical model, its density of states is a distribution that,\nfor any likelihood value, gives the number of con\ufb01gurations with that probabil-\nity. We introduce a novel message-passing algorithm called Density Propagation\n(DP) for estimating this distribution. We show that DP is exact for tree-structured\ngraphical models and is, in general, a strict generalization of both sum-product and\nmax-product algorithms. Further, we use density of states and tree decomposition\nto introduce a new family of upper and lower bounds on the partition function.\nFor any tree decomposition, the new upper bound based on \ufb01ner-grained density\nof state information is provably at least as tight as previously known bounds based\non convexity of the log-partition function, and strictly stronger if a general con-\ndition holds. We conclude with empirical evidence of improvement over convex\nrelaxations and mean-\ufb01eld based bounds.\n\n1\n\nIntroduction\n\nAssociated with any undirected graphical model [1] is the so-called density of states, a term bor-\nrowed from statistical physics indicating a distribution that, for any likelihood value, gives the\nnumber of con\ufb01gurations with that probability. The density of states plays an important role in\nstatistical physics because it provides a \ufb01ne grained description of the system, and can be used to\nef\ufb01ciently compute many properties of interests, such as the partition function and its parameterized\nversion [2, 3]. It can be seen that computing the density of states is computationally intractable in\nthe worst case, since it subsumes a #-P complete problem (computing the partition function) and an\nNP-hard one (MAP inference). All current approximate techniques estimating the density of states\nare based on sampling, the most prominent being the Wang-Landau algorithm [3] and its improved\nvariants [2]. These methods have been shown to be very effective in practice. However, they do not\nprovide any guarantee on the quality of the results. Furthermore, they ignore the structure of the\nunderlying graphical model, effectively treating the energy function (which is proportional to the\nnegative log-likelihood of a con\ufb01guration) as a black-box.\n\nAs a \ufb01rst step towards exploiting the structure of the graphical model when computing the density\nof states, we propose an algorithm called DENSITYPROPAGATION (DP). The algorithm is based on\ndynamic programming and can be conveniently expressed in terms of message passing on the graph-\nical model. We show that DENSITYPROPAGATION computes the density of states exactly for any\ntree-structured graphical model. It is closely related to the popular Sum-Product (Belief Propaga-\ntion, BP) and Max-Product (MP) algorithms, and can be seen as a generalization of both. However,\nit computes something much richer, namely the density of states, which contains information such\nas the partition function and variable marginals. Although we do not work at the level of individual\ncon\ufb01gurations, DENSITYPROPAGATION allows us to reason in terms of groups of con\ufb01gurations\nwith the same probability (energy).\n\n\u2217Supported by NSF Expeditions in Computing award for Computational Sustainability (grant 0832782).\n\n1\n\n\fBeing able to solve inference tasks for certain tractable classes of problems (e.g., trees) is important\nbecause one can often decompose a complex problem into tractable subproblems (such as spanning\ntrees) [4], and the solutions to these simpler problems can be combined to recover useful properties\nof the original graphical model [5, 6].\nIn this paper we show that by combining the additional\ninformation given by the density of states, we can obtain a new family of upper and lower bounds on\nthe partition function. We prove that the new upper bound is always at least as tight as the one based\non the convexity of the log-partition function [4], and we provide a general condition where the\nnew bound is strictly tighter. Further, we illustrate empirically that the new upper bound improves\nupon the convexity-based one on Ising grid and clique models, and that the new lower bound is\nempirically slightly stronger than the one given by mean-\ufb01eld theory [4, 7].\n\n2 Problem de\ufb01nition and setup\n\nWe consider a graphical model speci\ufb01ed as a factor graph with N = |V | discrete random variables\nxi, i \u2208 V where xi \u2208 Xi. The global random vector x = {xs, s \u2208 V } takes value in the Cartesian\ni=1 |Xi|. We consider a probability\n\nproduct X = X1\u00d7X2\u00d7\u00b7 \u00b7 \u00b7\u00d7XN , with cardinality D = |X | =QN\n\ndistribution over elements x \u2208 X (called con\ufb01gurations)\n\n(1)\n\np(x) =\n\n\u03c8\u03b1({x}\u03b1)\n\n1\n\nZ Y\u03b1\u2208I\n\nthat factors into factors \u03c8\u03b1 : {x}\u03b1 \u2192 R+, where I is an index set and {x}\u03b1 \u2286 V a subset of\nvariables the factor \u03c8\u03b1 depends on, and Z is a normalization constant known as partition function.\nThe corresponding factor graph is a bipartite graph with vertex set V \u222a I. In the factor graph, each\nvariable node i \u2208 V is connected with all the factors \u03b1 \u2208 I that depend on i. Similarly, each factor\nnode \u03b1 \u2208 I is connected with all the variable nodes i \u2208 {x}\u03b1. We denote the neighbors of i and \u03b1\nby N (i) and N (\u03b1) respectively.\nWe will also make use of the related exponential representation [8]. Let \u03c6 be a collection of potential\nfunctions {\u03c6\u03b1, \u03b1 \u2208 I}, de\ufb01ned over the index set I. Given an exponential parameter vector \u0398 =\n{\u0398\u03b1, \u03b1 \u2208 I}, the exponential family de\ufb01ned by \u03c6 is the family of probability distributions over X\nde\ufb01ned as follows:\n\np(x, \u0398) =\n\n1\n\nZ(\u0398)\n\nexp(\u0398 \u00b7 \u03c6(x)) =\n\n1\n\nZ(\u0398)\n\nexp X\u03b1\u2208I\n\n\u0398\u03b1\u03c6\u03b1({x}\u03b1)!\n\n(2)\n\nwhere we assume p(x) = p(x, \u0398\u2217). Given an exponential family, we de\ufb01ne the density of states [2]\nas the following distribution:\n\nn(E, \u0398) = Xx\u2208X\n\n\u03b4 (E \u2212 \u0398 \u00b7 \u03c6(x))\n\n(3)\n\nwhere \u03b4 (E \u2212 \u0398 \u00b7 \u03c6(x)) indicates a Dirac delta centered at \u0398 \u00b7 \u03c6(x). For any exponential parameter\n\u0398, it holds that\n\nZ A\n\n\u2212\u221e\n\nn(E, \u0398)dE = |{x \u2208 X |\u0398 \u00b7 \u03c6(x) \u2264 A}|\n\nand RR n(E, \u0398)dE = |X |. We will\nP\u03b1\u2208I log \u03c8\u03b1({x}\u03b1) as the energy of a con\ufb01guration x, although it has an additional minus sign\n\nto the quantity P\u03b1\u2208I \u0398\u2217\n\nwith respect to the conventional energy in statistical physics.\n\n\u03b1\u03c6\u03b1({x}\u03b1) =\n\nrefer\n\n3 Density Propagation\n\nSince any propositional Satis\ufb01ability (SAT) instance can be ef\ufb01ciently encoded as a factor graph\n(e.g., by de\ufb01ning a uniform probability measure over satisfying assignments), it is clear that com-\nputing the density of states is computationally intractable in the worst case, as a generalization of an\nNP-Complete problem (satis\ufb01ability testing) and a #-P complete problem (model counting).\nWe show that the density of states can be computed ef\ufb01ciently1 for acyclic graphical models. We\nprovide a Dynamic Programming algorithm, which can also be interpreted as a message passing\nalgorithm on the factor graph, called DENSITYPROPAGATION (DP), which computes the density of\nstates exactly for acyclic graphical models.\n\n1Polynomial in the cardinality of the support, which could be exponential in N in the worst case.\n\n2\n\n\f3.1 Density propagation equations\n\nDENSITYPROPAGATION works by exchanging messages from variable to factor nodes and vice\nversa. Unlike traditional message passing algorithms, where messages represent marginal probabil-\nities (vectors of real numbers), for every xi \u2208 Xi a DENSITYPROPAGATION message ma\u2192i(xi) is\n\nsum of Dirac deltas.\n\na distribution (a \u201cmarginal\u201d density of states), i.e. ma\u2192i(xi) = Pk ck(a \u2192 i, xi)\u03b4Ek(a\u2192i,xi) is a\n\nAt every iteration, messages are updated according to the following rules. The message from vari-\nable node i to factor node a is updated as follows:\n\nmi\u2192a(xi) = Ob\u2208N (i)\\a\n\nmb\u2192i(xi)\n\n(4)\n\nwhere N is the convolution operator (commutative, associative and distributive). Intuitively, the\n\nconvolution operation gives the distribution of the sum of (conditionally) independent random vari-\nables, in this case corresponding to distinct subtrees in a tree-structured graphical model. The mes-\nsage from factor a to variable i is updated as follows:\n\nma\u2192i(xi) = X{x}\u03b1\\i\n\nmj\u2192a(xj)\uf8f6\n\n\uf8eb\n\uf8ed Oj\u2208N (a)\\i\n\n\uf8f8O \u03b4E\u03b1({x}\u03b1)\n\n(5)\n\nwhere \u03b4E\u03b1({x}\u03b1) is a Dirac delta function centered at E\u03b1(x\u03b1) = log \u03c8\u03b1({x}\u03b1).\nFor tree structured graphical models, DENSITYPROPAGATION converges after a \ufb01nite number of\niterations, independent of the initial condition, to the true density of states. Formally,\nTheorem 1. For any variable i \u2208 V and A \u2208 R, for any initial condition, after a \ufb01nite number of\n\niterations(cid:16)Pq\u2208XsNb\u2208N (i) mb\u2192i(q)(cid:17) (E) = n(E, \u0398\u2217).\n\nThe proof is by induction on the size of the tree (omitted due to lack of space).\n\n3.1.1 Complexity and Approximation with Energy Bins\n\nThe most ef\ufb01cient message update schedule for tree structured models is a two-pass procedure where\nmessages are \ufb01rst sent from the leaves to the root node, and then propagated backwards from the\nroot to the leaves. However, as with other message-passing algorithms, for tree structured instances\nthe algorithm will converge with either a sequential or a parallel update schedule, with any initial\ncondition for the messages. Although DP requires the same number of messages updates as BP\nand MP, DP updates are more expensive because they require the computation of convolutions.\nSpeci\ufb01cally, each variable-to-factor update rule (4) requires (N \u2212 2)L convolutions, where N is the\nnumber of neighbors of the variable node and L is the number of states in the random variable. Each\nfactor-to-variable update rule (5) requires summation over N \u2212 1 variables, each of size L, requiring\nO(LN ) convolutions. Using Fast Fourier Transform (FFT), each convolution takes O(K log K),\nwhere K is the maximum number of non-zero entries in a message. In the worst case, the density of\nstates can have an exponential number of non-zero entries (i.e., the \ufb01nite number of possible energy\nvalues, which we will also refer to as \u201cbuckets\u201d), for instance when potentials are set to logarithms\nof prime numbers, making every x \u2208 X have a different probability. However, in many practical\nproblems of interest (e.g., SAT/CSP models and certain grounded Markov Logic Networks [9]), the\nnumber of energy \u201cbuckets\u201d is limited, e.g., bounded by the total number of constraints. For general\ngraphical models, coarse-grain energy bins can be used, similar to the Wang-Landau algorithm [3],\nwithout losing much precision. Speci\ufb01cally, if we use bins of size \u01eb/M, where each bin corresponds\nto con\ufb01gurations with energy in the interval [k\u01eb/M, (k + 1)\u01eb/M ), the energy estimated for each\ncon\ufb01guration through O(M ) convolutions is at most an O(\u01eb) additive value away from its true\nenergy (as the quantization error introduced by energy binning is summed up across convolution\nsteps). This also guarantees that the density of states with coarse-grain energy bins gives a constant\nfactor approximation of the true partition function.\n\n3.1.2 Relationship with sum and max product algorithms\n\nDENSITYPROPAGATION is closely related to traditional message passing algorithms such as BP\n(Belief Propagation, Sum-Product) and MP (Max-Product), since it is based on the same (condi-\ntional) independence assumptions. Speci\ufb01cally, as shown by the next theorem, both BP and MP can\n\n3\n\n\fbe seen as simpli\ufb01ed versions of DENSITYPROPAGATION that consider only certain global statistics\nof the distributions represented by DENSITYPROPAGATION messages.\nTheorem 2. With the same initial condition and message update schedule, at every iteration we can\nrecover Belief Propagation and Max-Product marginals from DENSITYPROPAGATION messages.\n\nProof. Given a DP message mi\u2192j(xj) =Pk ck(i \u2192 j, xj)\u03b4Ek(i\u2192j,xj ), the Max-Product algorithm\n\ncorresponds to considering only the entry associated with the highest probability, i.e. \u03b3i\u2192j(xj) =\nf (mi\u2192j(xj)) , maxk{Ek(i \u2192 j, xj)}. According to DP updates in equations (4) and (5), the\nquantities \u03b3i\u2192j(xj) are updated as follows\n\u03b3i\u2192a(xi) = f \uf8eb\n\nmb\u2192i(xi)\uf8f6\n\n\u03b3b\u2192i(xi)\n\n\uf8ed O\n\nb\u2208N (i)\\a\n\nb\u2208N (i)\\a\n\n\uf8f8 = X\nmj\u2192a(xj)\uf8f6\n\n\u03b3a\u2192i(xi) = f \uf8eb\n\n\uf8ed X\n\n{x}\u03b1\\i\n\n\uf8eb\n\uf8ed O\n\nj\u2208N (a)\\i\n\n\uf8f8O \u03b4E\u03b1({x}\u03b1)\n\n\uf8f6\n\uf8f8 = max\n\n{x}\u03b1\\i X\n\nj\u2208N (a)\\i\n\n\u03b3j\u2192a(xj) + E\u03b1({x}\u03b1)\n\nThese results show that the quantities \u03b3i\u2192j(xj) are updated according to the Max-Product algorithm\n(with messages in log-scale). To see the relationship with BP, for every DP message mi\u2192j(xj), let\nus de\ufb01ne\n\n\u00b5i\u2192j(xj) = ||mi\u2192j(xj)(E) exp(E)||1 =ZR\n\nmi\u2192j(xj)(E) exp(E)dE\n\nNotice that \u00b5i\u2192j(xj) would correspond to an unnormalized marginal probability, assuming that\nmi\u2192j(xj) is the density of states of the instance when variable j is clamped to value xj. According\nto DP updates in equation (4) and (5)\n\n\u00b5i\u2192a(xi) = ||mi\u2192a(xi)(E) exp(E)||1 =\n\nmb\u2192i(xi)(E) exp(E)\n\nO\n\nb\u2208N (i)\\a\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\n\uf8eb\n\uf8ed O\n\nj\u2208N (a)\\i\n\nX\n\n{x}\u03b1\\i\n\nmj\u2192a(xj)\uf8f6\n\nb\u2208N (i)\\a\n\n\u00b5b\u2192i(xi)\n\n= Y\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)1\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\uf8f8O \u03b4E\u03b1({x}\u03b1)(E) exp(E)\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)1\n\n\uf8f8O \u03b4E\u03b1({x}\u03b1)(E) exp(E)\n\n= X\n\n{x}\u03b1\\i\n\n\u03c8\u03b1({x}\u03b1) Y\n\nj\u2208N (a)\\i\n\n\u00b5j\u2192a(xi)\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)1\n\n\u00b5a\u2192i(xi) = ||\u00b5a\u2192i(xi)(E) exp(E)||1 =\n\n= X\n\n{x}\u03b1\\i\n\n\uf8eb\n\uf8ed O\n\nj\u2208N (a)\\i\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\nmj\u2192a(xj)\uf8f6\n\nthat is we recover BP updates for the \u00b5i\u2192j quantities. Similarly, if we de\ufb01ne temperature versions\nof the marginals \u00b5T\ni\u2192j(xj) , ||mi\u2192j(xj)(E) exp(E/T )||1, we recover the temperature-versions of\nBelief Propagation updates, similar to [10] and [11].\n\nAs other message passing algorithms, DENSITYPROPAGATION updates are well de\ufb01ned also for\nloopy graphical models, even though there is no guarantee of convergence or correctness [12]. The\ncorrespondence with BP and MP (Theorem 2) however still holds: if loopy BP converges, then\nthe corresponding quantities \u00b5i\u2192j computed from DP messages will converge as well, and to the\nsame value (assuming the same initial condition and update schedule). Notice however that the\nconvergence of the \u00b5i\u2192j does not imply the convergence of DENSITYPROPAGATION messages\n(e.g., in probability, law, or Lp). In fact, we have observed empirically that the situation where\n\u00b5i\u2192j converge but mi\u2192j do not converge (not even in distribution) is fairly common. It would\nbe interesting to see if there is a variational interpretation for DENSITYPROPAGATION equations,\nsimilar to [13]. Notice also that Junction Tree style algorithms could also be used in conjunction\nwith DP updates for the messages, as an instance of generalized distributive law [14].\n\n4 Bounds on the density of states using tractable families\n\nUsing techniques such as DENSITYPROPAGATION, we can compute the density of states exactly for\ntractable families such as tree-structured graphical models. Let p(x, \u0398\u2217) be a general (intractable)\nprobabilistic model of interest, and let \u0398i be a family of tractable parameters (e.g., corresponding to\ntrees) such that \u0398\u2217 is a convex combination of \u0398i, as de\ufb01ned formally below and used previously\n\n4\n\n\fby Wainwright et al. [5, 6]. See below (Figure 1) for an example of a possible decomposition of\na 2 \u00d7 2 Ising model into 2 tractable distributions. By computing the partition function or MAP\nestimates for the tree structured subproblems, Wainwright et al. showed that one can recover useful\ninformation about the original intractable problem, for instance by exploiting convexity of the log-\npartition function log Z(\u0398).\nWe present a way to exploit the decomposition idea to derive an upper bound on the density of states\nn(E, \u0398\u2217) of the original intractable model, despite the fact that density of states is not a convex\nfunction of \u0398\u2217. The result below gives a point-by-point upper bound which, to the best of our\nknowledge, is the \ufb01rst bound of this kind for density of states. In the following, with some abuse\n\ncon\ufb01gurations with energy E (zero almost everywhere).\n\nof the notation, we denote n(E, \u0398\u2217) = Px\u2208X (cid:0)1{\u0398\u2217\u00b7\u03c6(x)=E}(cid:1) the function giving the number of\nTheorem 3. Let \u0398\u2217 =Pn\n\ni=1 \u03b3i = 1, and yn = E \u2212Pn\u22121\ni=1 \u03b3i\u0398i,Pn\nn(E, \u0398\u2217) \u2264ZRZR\n. . .ZR\n\n{n(yi, \u03b3i\u0398i)} dy1dy2 . . . dyn\u22121\n\ni=1 yi. Then\n\nmin\ni=1\n\nn\n\nProof. From the de\ufb01nition of density of states and using 1{} to denote the 0-1 indicator function,\n\nn(E, \u0398\u2217) = Xx\u2208X\n\n1{(Pi \u03b3i\u0398i)\u03c6(x)=E}\n\n1{\u0398\u2217\u03c6(x)=E} = Xx\u2208X\n. . .ZR n\n1{\u03b3i\u0398i\u03c6(x)=yi}! dy1dy2 . . . dyn\u22121 where yn = E \u2212\nYi=1\n. . .ZR Xx\u2208X n\n1{\u03b3i\u0398i\u03c6(x)=yi}! dy1dy2 . . . dyn\u22121\nYi=1\n. . .ZR Xx\u2208X(cid:18) n\ni=1 (cid:8)1{\u03b3i\u0398i\u03c6(x)=yi}(cid:9)(cid:19) dy1dy2 . . . dyn\u22121\ni=1 (Xx\u2208X(cid:0)1{\u03b3i\u0398i\u03c6(x)=yi}(cid:1)) dy1dy2 . . . dyn\u22121\n. . .ZR\n\n= Xx\u2208XZRZR\n=ZRZR\n=ZRZR\n\u2264ZRZR\n\nmin\n\nn\n\nmin\n\nyi\n\nn\u22121\n\nXi=1\n\nObserving thatPx\u2208X (cid:0)1{\u03b3i\u0398i\u03c6(x)=yi}(cid:1) is precisely n(yi, \u03b3i\u0398i) \ufb01nishes the proof.\n\n5 Bounds on the partition function using n-dimensional matching\n\nThe density of states n(E, \u0398\u2217) can be used to compute the partition function, since by de\ufb01nition\nZ(\u0398\u2217) = ||n(E, \u0398\u2217) exp(E)||1. We can therefore get an upper bound on Z(\u0398\u2217) by integrating the\npoint-by-point upper bound on n(E, \u0398\u2217) from Theorem 3. This bound can be tighter than the known\nbound [6] obtained by applying Jensen\u2019s inequality to the log-partition function (which is convex),\n\ngiven by log Z(\u0398\u2217) \u2264Pi \u03b3i log Z(\u0398i). For instance, consider a graphical model with weights that\n\nare large enough such that the density of states based sum de\ufb01ning Z(\u0398\u2217) is dominated by the contri-\nbution of the highest-energy bucket. As a concrete example, consider the decomposition in Figure 1.\nAs the edge weight w (w = 2 in the \ufb01gure) grows, the convexity-based bound will approximately\nequal the geometric average of 2 exp(6w) and 8 exp(2w), which is 4 exp(4w). On the other hand,\nthe bound based on Theorem 3 will approximately equal min{2, 8} exp((2 + 6)w/2) = 2 exp(4w).\nIn general, the latter bound will always be strictly better for large enough w unless the highest-energy\nbucket counts are identical across all \u0398i.\nWhile this is already promising, we can, in fact, obtain a much tighter bound by taking into account\nthe interactions between different energy levels across any parameter decomposition, e.g., by en-\nforcing the fact that there are a total of |X | con\ufb01gurations. For compactness, in the following let us\nde\ufb01ne yi(x) = exp(\u0398i \u00b7 \u03c6(x)) for any x \u2208 X and i = 1, \u00b7 \u00b7 \u00b7 , n. Then,\n\nZ(\u0398\u2217) = Xx\u2208X\n\nexp(\u0398\u2217 \u00b7 \u03c6(x)) = Xx\u2208XYi\n\nyi(x)\u03b3i\n\nTheorem 4. Let \u03a0 be the (\ufb01nite) set of all possible permutations of X . Given \u03c3 = (\u03c31, \u00b7 \u00b7 \u00b7 , \u03c3n) \u2208\n\n\u03a0n, let Z(\u0398\u2217, \u03c3) =Px\u2208X Qi yi(\u03c3i(x))\u03b3i. Then,\n\nZ(\u0398\u2217, \u03c3) \u2264 Z(\u0398\u2217) \u2264 max\n\u03c3\u2208\u03a0n\n\nmin\n\u03c3\u2208\u03a0n\n\nZ(\u0398\u2217, \u03c3)\n\n(6)\n\n5\n\n\fAlgorithm 1 Greedy algorithm for the maximum matching (upper bound).\n1: while there exists E such that n(E, \u0398i) > 0 do\n2:\n3:\n4:\n5:\n6: end while\n\nEmax(\u0398i) \u2190 maxE {E|n(E, \u0398i) > 0)}, for i = 1, \u00b7 \u00b7 \u00b7 , n\nc\u2032 \u2190 min {n(Emax(\u03981), \u03981), \u00b7 \u00b7 \u00b7 , n(Emax(\u0398n), \u0398n)}\nub(\u03b31Emax(\u03981) + \u00b7 \u00b7 \u00b7 + \u03b3nEmax(\u0398n), \u03981, \u00b7 \u00b7 \u00b7 , \u0398n) \u2190 c\u2032\nn(Emax(\u0398i), \u0398i) \u2190 n(Emax(\u0398i), \u0398i) \u2212 c\u2032, for i = 1, \u00b7 \u00b7 \u00b7 , n\n\nProof. Let \u03c3I \u2208 \u03a0n denote a collection of n identity permutations. Then we have Z(\u0398\u2217) =\nZ(\u0398\u2217, \u03c3I ), which proves the upper and lower bounds in equation (6).\n\nWe can think of \u03c3 \u2208 \u03a0n as an n-dimensional matching over the exponential size con\ufb01guration\nspace X . For any i, j, \u03c3i(x) matches with \u03c3j(x), and \u03c3(x) gives the corresponding hyper-edge.\n\nIf we de\ufb01ne the weight of each hyper-edge in the matching graph as w(\u03c3(x)) = Qi yi(\u03c3i(x))\u03b3i\nthen Z(\u0398\u2217, \u03c3) =Px\u2208X w(\u03c3(x)) corresponds to the weight of the matching represented by \u03c3. We\n\ncan therefore think the bounds in equation (6) as given by a maximum and a minimum matching,\nrespectively. Intuitively, the maximum matching corresponds to the case where the con\ufb01gurations\nin the high energy buckets of the densities happen to be the same con\ufb01guration (matching), so that\ntheir energies are summed up.\n\n5.1 Upper bound\n\nThe maximum matching max\u03c3 Z(\u0398\u2217, \u03c3) (i.e., the upper bound on the partition function) can be\n\nupper bound on the density n(E, \u0398\u2217) of the original mode.\nProposition 1. Algorithm 1 computes the maximum matching and its runtime is bounded by the\n\ncomputed using Algorithm 1. Algorithm 1 returns a distribution ub such thatR ub(E)dE = |X | and\nR ub(E) exp(E)dE = max\u03c3 Z(\u0398\u2217, \u03c3). Notice however that ub(E) is not a valid point-by-point\ntotal number of non-empty bucketsPi |{E|n(E, \u0398i) > 0}|.\n\nProof. The correctness of Algorithm 1 follows from observing that exp(E1+E2)+exp(E\u2032\n2) \u2265\nexp(E1 + E\u2032\n2. Intuitively, this means that for\nn = 2 parameters it is always optimal to connect the highest energy con\ufb01gurations, therefore the\ngreedy method is optimal. This result can be generalized for n > 2 by induction. The runtime is\nproportional to the total number of buckets because we remove one bucket from at least one density\nat every iteration.\n\n1 + E2) when E1 \u2265 E\u2032\n\n1 and E2 \u2265 E\u2032\n\n2) + exp(E\u2032\n\n1+E\u2032\n\nA key property of Algorithm 1 is that even though it de\ufb01nes a matching over an exponential num-\nber of con\ufb01gurations |X |, its runtime proportional only to the total number of buckets, because it\nmatches con\ufb01gurations in groups at the bucket level.\n\nThe following result shows that the value of the maximum matching is at least as tight as the\nbound provided by the convexity of the log-partition function, which is used for example by Tree\nReweighted Belief Propagation (TRWBP) [6].\n\nTheorem 5. For any parameter decomposition Pn\n\ni=1 \u03b3i\u0398i = \u0398\u2217, the upper bound given by the\nmaximum matching in equation (6) and computed using Algorithm 1 is always at least as tight as\nthe bound obtained using the convexity of the log-partition function.\n\nProof. The bound obtained by applying Jensen\u2019s inequality to the log-partition function (which is\n\ni = 1, \u00b7 \u00b7 \u00b7 , n (in particular, it holds for the one attaining the maximum matching value) we have\n\nconvex), given by log Z(\u0398\u2217) \u2264 Pi \u03b3i log Z(\u0398i) [6], leads to the following geometric average\nbound Z(\u0398\u2217) \u2264Qi (Px yi(x))\u03b3i. Given any n permutations of the con\ufb01gurations \u03c3i : X \u2192 X for\nXx Yi\n\n||yi(\u03c3i(x))\u03b3i||1/\u03b3i =Yi Xx\n\nyi(\u03c3i(x))\u03b3i||1 \u2264Yi\n\nyi(\u03c3i(x))\u03b3i = ||Yi\n\nyi(\u03c3i(x))!\u03b3i\n\nwhere we used Generalized Holder inequality and the norm || \u00b7 ||\u2113 indicates a sum over X .\n\n6\n\n\fAlgorithm 2 Greedy algorithm for the minimum matching with n = 2 parameters (lower bound).\n1: while there exists E such that n(E, \u0398i) > 0 do\n2:\n3:\n4:\n5:\n6: end while\n\nEmax(\u0398i) \u2190 maxE {E|n(E, \u0398i) > 0)}; Emin(\u03982) \u2190 minE {E|n(E, \u03982) > 0)}\nc\u2032 \u2190 min {n(Emax(\u03981), \u03981), n(Emin(\u03982), \u03982)}\nlb(\u03b31Emax(\u03981) + \u03b32Emin(\u03982), \u03981, \u03982) \u2190 c\u2032\nn(Emax(\u03981), \u03981) \u2190 n(Emax(\u03981), \u03981) \u2212 c\u2032; n(Emin(\u03982), \u03982) \u2190 n(Emin(\u03982), \u03982) \u2212 c\u2032\n\n5.2 Lower bound\n\nWe also provide Algorithm 2 to compute the minimum matching when there are n = 2 parameters.\nThe proof of correctness is similar to that for Proposition 1.\nProposition 2. For n = 2, Algorithm 2 computes the minimum matching and its runtime is bounded\n\nby the total number of non-empty bucketsPi |{E|n(E, \u0398i) > 0}|.\n\nFor the minimum matching case, the induction argument does not apply and the result does not\nextend to the case n > 2. For that case, we can obtain a weaker lower bound by applying Re-\nverse Generalized Holder inequality [15], obtaining from a different perspective a bound previously\n\nderived in [16]. Speci\ufb01cally, let s1, \u00b7 \u00b7 \u00b7 , sn\u22121 < 0 and sn such thatP 1\n\nsi\n\n= 1. We then have\n\nyi(\u03c3min,i(x))\u03b3i||1 \u2265\n\n(7)\n\n\u03c3\n\nmin\n\nZ(\u0398\u2217, \u03c3) =Xx Yi\n||yi(\u03c3min,i(x))\u03b3i||si =Yi Xx\n\nyi(\u03c3min,i(x))\u03b3i = ||Yi\nyi(\u03c3min,i(x))si\u03b3i!\n\n1\nsi\n\n=Yi Xx\n\nYi\n\n1\nsi\n\nyi(x)si\u03b3i!\n\nNotice this result cannot be applied if yi(x) = 0, i.e. there are factors assigning probability zero\n(hard constraints) in the probabilistic model.\n\n6 Empirical evaluation\n\nTo evaluate the quality of the bounds, we consider an Ising model from statistical physics,\nwhere given a graph (V, E), single node variables xs, s \u2208 V are Bernoulli distributed\n(xs \u2208 {0, 1})), and the global\nis distributed according to p(x, \u0398) =\n\nrandom vector\n\n1\n\nZ(\u0398) exp(cid:16)Ps\u2208V \u0398sxs +P(i,j)\u2208E \u0398ij1{xi=xj }(cid:17). Figure 1 shows a simple 2 \u00d7 2 grid Ising\n\nmodel with exponential parameter \u0398\u2217 = [0, 0, 0, 0, 1, 1, 1, 1] (\u0398s = 0 and \u0398ij = 1) decom-\nposed as the convex sum of two parameters \u03981 and \u03982 corresponding to tractable distributions,\ni.e. \u0398\u2217 = (1/2)\u03981 + (1/2)\u03982. The corresponding partition function is Z(\u0398\u2217) = 2 + 12 exp(2) +\n2 exp(4) \u2248 199.86. In panels 1(d) and 1(e) we report the corresponding density of states n(E, \u03981)\nand n(E, \u03982) as histograms. For instance, for the model corresponding to \u03982 there are only two\nglobal con\ufb01gurations (all variables positive and all negative) that give an energy of 6. It can be seen\nfrom the densities reported that Z(\u03981) = 2 + 6 exp(2) + 6 exp(4) + 2 exp(6) \u2248 1180.8, while\nZ(\u03982) = 8 + 8 exp(2) \u2248 67.11. The corresponding geometric average (obtained from the con-\n\nvexity of the log-partition function) isp(Z(\u03981))p(Z(\u03982)) \u2248 281.50. In panels 1(f) and 1(c) we\n\nshow ub and lb computed using Algorithms 1 and 2, i.e. the solutions to the maximum and minimum\nmatching problems, respectively. For instance, for the maximum matching case the 2 con\ufb01gurations\nwith energy 6 from n(E, \u03981) are matched with 2 of the 8 with energy 2 from n(E, \u03982), giving an\nenergy 6/2 + 2/2 = 4. Notice that ub and lb are not valid bounds on individual densities of states\nthemselves, but they nonetheless provide upper and lower bounds on the partition function as shown\nin the \ufb01gure: \u2248 248.01 and 134.27, respectively. The bound (8) given by inverse Holder inequality\nwith s1 = \u22121, s2 = 1/2 is \u2248 126.22, while the mean \ufb01eld lower bound [4, 7] is \u2248 117.91. In this\ncase, the additional information provided by the density leads to tighter upper and lower bounds on\nthe partition function.\n\nIn Figure 2 we report the upper bounds obtained for several types of Ising models (in all cases,\n\u0398s = 0, i.e., there is no external \ufb01eld). In the two left plots, we consider a N \u00d7N square Ising model,\nonce with attractive interactions (\u0398ij \u2208 [0, w]) and once with mixed interactions (\u0398ij \u2208 [\u2212w, w]).\nIn the two right plots, we use a complete graph (a clique) with N = 15 vertices. For each model,\nwe compute the upper bound given by TRWBP (with edge appearance probabilities \u00b5e based on a\n\n7\n\n\fv2\n\n2\n\nv3\n\n2\n\n2\n\nv1\n\nv4\n\nv2\n\nv3\n\nv1\n\n2\n\nv4\n\ns\nn\no\ni\nt\na\nr\nu\ng\ni\nf\nn\no\nC\n\n6\n\n4\n\n2\n\n0\n\n6\n\n1\n\n2\n\n0\n\n6\n\n3\n\n2\n\n4\n\nEnergy\n\n(a) Graph for \u03981.\n\n(b) Graph for \u03982.\n\n(c) Zub = 2 + 6e + 6e3 + 2e4.\n\ns\nn\no\ni\nt\na\nr\nu\ng\ni\nf\nn\no\nC\n\n6\n\n4\n\n2\n\n0\n\n6\n\n6\n\n2\n\n0\n\n2\n\n4\n\nEnergy\n\n2\n\n6\n\ns\nn\no\ni\nt\na\nr\nu\ng\ni\nf\nn\no\nC\n\n8\n\n6\n\n4\n\n2\n\n0\n\n8\n\n0\n\nEnergy\n\n8\n\n2\n\ns\nn\no\ni\nt\na\nr\nu\ng\ni\nf\nn\no\nC\n\n12\n10\n8\n6\n4\n2\n0\n\n12\n\n2\n\n1\n\n2\n\n3\n\n2\n\nEnergy\n\n(d) Histogram n(E, \u03981)\n\n(e) Histogram n(E, \u03982)\n\n(f) Zlb = 2e + 12e2 + 2e3\n\nFigure 1: Decomposition of a 2 \u00d7 2 Ising model, densities obtained with maximum and minimum\nmatching algorithms, and the corresponding upper and lower bounds on Z(\u0398\u2217).\n\n(a) 15 \u00d7 15 grid, attractive.\n\n(b) 10 \u00d7 10 grid, mixed.\n\n(c) 15-Clique, attractive.\n\n(d) 15-Clique, mixed.\n\nFigure 2: Relative error of the upper bounds.\n\nsubset of 10 randomly selected spanning trees) and the mean-\ufb01eld bound using the implementations\nin libDAI [17]. We then compute the bound based on the maximum matching using the same set\nof spanning trees. For the grid case, we also use a combination of 2 spanning trees and compute\nthe corresponding lower bound based on the minimum matching (notice it is not possible to cover\nall the edges in a clique with only 2 spanning tree). For each bound, we report the relative error,\nde\ufb01ned as (log(bound) \u2212 log(Z)) / log(Z), where Z is the true partition function, computed using\nthe junction tree method.\n\nIn these experiments, both our upper and lower bounds improve over the ones obtained with TR-\nWBP [6] and mean-\ufb01eld respectively. The lower bound based on minimum matching visually over-\nlaps with the mean-\ufb01eld bound and is thus omitted from Figure 2. It is, however, strictly better, even\nif by a small amount. Notice that we might be able to get a better bound by choosing a different\nset of parameters \u0398i (which may be suboptimal for TRW-BP). By optimizing the parameters si in\nthe inverse Holder bound (8) using numerical optimization (BFGS and BOBYQA [18]), we were\nalways able to obtain a lower bound at least as good as the one given by mean \ufb01eld.\n\n7 Conclusions\n\nWe presented DENSITYPROPAGATION, a novel message passing algorithm for computing the den-\nsity of states while exploiting the structure of the underlying graphical model. We showed that\nDENSITYPROPAGATION computes the exact density for tree structured graphical models and is a\ngeneralization of both Belief Propagation and Max-Product algorithms. We introduced a new family\nof bounds on the partition function based on n-dimensional matching and tree decomposition but\nwithout relying on convexity. The additional information provided by the density of states leads,\nboth theoretically and empirically, to tighter bounds than known convexity-based ones.\n\n8\n\n\fReferences\n\n[1] M.J. Wainwright and M.I. Jordan. Graphical models, exponential families, and variational\n\ninference. Foundations and Trends in Machine Learning, 1(1-2):1\u2013305, 2008.\n\n[2] S. Ermon, C. Gomes, A. Sabharwal, and B. Selman. Accelerated Adaptive Markov Chain for\n\nPartition Function Computation. Neural Information Processing Systems, 2011.\n\n[3] F. Wang and DP Landau. Ef\ufb01cient, multiple-range random walk algorithm to calculate the\n\ndensity of states. Physical Review Letters, 86(10):2050\u20132053, 2001.\n\n[4] M.J. Wainwright. Stochastic processes on graphs with cycles: geometric and Variational ap-\n\nproaches. PhD thesis, Massachusetts Institute of Technology, 2002.\n\n[5] M. Wainwright, T. Jaakkola, and A. Willsky. Exact map estimates by (hyper) tree agreement.\n\nAdvances in neural information processing systems, pages 833\u2013840, 2003.\n\n[6] M.J. Wainwright. Tree-reweighted belief propagation algorithms and approximate ML estima-\n\ntion via pseudo-moment matching. In AISTATS, 2003.\n\n[7] G. Parisi and R. Shankar. Statistical \ufb01eld theory. Physics Today, 41:110, 1988.\n[8] L.D. Brown. Fundamentals of statistical exponential families: with applications in statistical\n\ndecision theory. Institute of Mathematical Statistics, 1986.\n\n[9] M. Richardson and P. Domingos. Markov logic networks. Machine Learning, 62(1):107\u2013136,\n\n2006.\n\n[10] Y. Weiss, C. Yanover, and T. Meltzer. MAP estimation, linear programming and belief propa-\n\ngation with convex free energies. In Uncertainty in Arti\ufb01cial Intelligence, 2007.\n\n[11] T. Hazan and A. Shashua. Norm-product belief propagation: Primal-dual message-passing for\napproximate inference. Information Theory, IEEE Transactions on, 56(12):6294\u20136316, 2010.\n[12] K.P. Murphy, Y. Weiss, and M.I. Jordan. Loopy belief propagation for approximate inference:\nIn Proceedings of the Fifteenth conference on Uncertainty in arti\ufb01cial\n\nAn empirical study.\nintelligence, pages 467\u2013475. Morgan Kaufmann Publishers Inc., 1999.\n\n[13] J.S. Yedidia, W.T. Freeman, and Y. Weiss. Understanding belief propagation and its general-\n\nizations. Exploring arti\ufb01cial intelligence in the new millennium, 8:236\u2013239, 2003.\n\n[14] S.M. Aji and R.J. McEliece. The generalized distributive law.\n\nTransactions on, 46(2):325\u2013343, 2000.\n\nInformation Theory, IEEE\n\n[15] W.S. Cheung. Generalizations of H\u00a8olders inequality. International Journal of Mathematics\n\nand Mathematical Sciences, 26:7\u201310, 2001.\n\n[16] Qiang Liu and Alexander Ihler. Negative tree reweighted belief propagation. In Proceedings\nof the Twenty-Sixth Conference Annual Conference on Uncertainty in Arti\ufb01cial Intelligence\n(UAI-10), pages 332\u2013339, Corvallis, Oregon, 2010. AUAI Press.\n\n[17] J.M. Mooij. libDAI: A free and open source c++ library for discrete approximate inference in\n\ngraphical models. The Journal of Machine Learning Research, 11:2169\u20132173, 2010.\n\n[18] M.J.D. Powell. The BOBYQA algorithm for bound constrained optimization without deriva-\n\ntives. University of Cambridge Technical Report, 2009.\n\n9\n\n\f", "award": [], "sourceid": 1269, "authors": [{"given_name": "Stefano", "family_name": "Ermon", "institution": null}, {"given_name": "Ashish", "family_name": "Sabharwal", "institution": null}, {"given_name": "Bart", "family_name": "Selman", "institution": null}, {"given_name": "Carla", "family_name": "Gomes", "institution": null}]}