{"title": "Probabilistic Variational Bounds for Graphical Models", "book": "Advances in Neural Information Processing Systems", "page_first": 1432, "page_last": 1440, "abstract": "Variational algorithms such as tree-reweighted belief propagation can provide deterministic bounds on the partition function, but are often loose and difficult to use in an ``any-time'' fashion, expending more computation for tighter bounds. On the other hand, Monte Carlo estimators such as importance sampling have excellent any-time behavior, but depend critically on the proposal distribution. We propose a simple Monte Carlo based inference method that augments convex variational bounds by adding importance sampling (IS). We argue that convex variational methods naturally provide good IS proposals that ``cover the probability of the target distribution, and reinterpret the variational optimization as designing a proposal to minimizes an upper bound on the variance of our IS estimator. This both provides an accurate estimator and enables the construction of any-time probabilistic bounds that improve quickly and directly on state of-the-art variational bounds, which provide certificates of accuracy given enough samples relative to the error in the initial bound.", "full_text": "Probabilistic Variational Bounds\n\nfor Graphical Models\n\nQiang Liu\n\nComputer Science\nDartmouth College\n\nqliu@cs.dartmouth.edu\n\nfisher@csail.mit.edu\n\nJohn Fisher III\n\nCSAIL\nMIT\n\nAlexander Ihler\nComputer Science\n\nUniv. of California, Irvine\nihler@ics.uci.edu\n\nAbstract\n\nVariational algorithms such as tree-reweighted belief propagation can provide de-\nterministic bounds on the partition function, but are often loose and dif\ufb01cult to use\nin an \u201cany-time\u201d fashion, expending more computation for tighter bounds. On the\nother hand, Monte Carlo estimators such as importance sampling have excellent\nany-time behavior, but depend critically on the proposal distribution. We propose\na simple Monte Carlo based inference method that augments convex variational\nbounds by adding importance sampling (IS). We argue that convex variational\nmethods naturally provide good IS proposals that \u201ccover\u201d the target probability,\nand reinterpret the variational optimization as designing a proposal to minimize an\nupper bound on the variance of our IS estimator. This both provides an accurate\nestimator and enables construction of any-time probabilistic bounds that improve\nquickly and directly on state-of-the-art variational bounds, and provide certi\ufb01cates\nof accuracy given enough samples relative to the error in the initial bound.\n\n1\n\nIntroduction\n\nGraphical models such as Bayesian networks, Markov random \ufb01elds and deep generative models\nprovide a powerful framework for reasoning about complex dependency structures over many vari-\nables [see e.g., 14, 13]. A fundamental task is to calculate the partition function, or normalization\nconstant. This task is #P-complete in the worst case, but in many practical cases it is possible to\n\ufb01nd good deterministic or Monte Carlo approximations. The most useful approximations should\ngive not only accurate estimates, but some form of con\ufb01dence interval, so that for easy problems\none has a certi\ufb01cate of accuracy, while harder problems are identi\ufb01ed as such. Broadly speaking,\napproximations fall into two classes: variational optimization, and Monte Carlo sampling.\nVariational inference [29] provides a spectrum of deterministic estimates and upper and lower\nbounds on the partition function; these include loopy belief propagation (BP), which is often quite\naccurate; its convex variants, such as tree reweighted BP (TRW-BP), which give upper bounds on the\npartition function; and mean \ufb01eld type methods that give lower bounds. Unfortunately, these meth-\nods often lack useful accuracy assessments; although in principle a pair of upper and lower bounds\n(such as TRW-BP and mean \ufb01eld) taken together give an interval containing the true solution, the\ngap is often too large to be practically useful. Also, improving these bounds typically means using\nlarger regions, which quickly runs into memory constraints.\nMonte Carlo methods, often based on some form of importance sampling (IS), can also be used\nto estimate the partition function [e.g., 15]. In principle, IS provides unbiased estimates, with the\npotential for a probabilistic bound: a bound which holds with some user-selected probability 1 \u2212 \u03b4.\nSampling estimates can also easily trade time for increased accuracy, without using more memory.\nUnfortunately, choosing the proposal distribution in IS is often both crucial and dif\ufb01cult; if poorly\nchosen, not only is the estimator high-variance, but the samples\u2019 empirical variance estimate is also\nmisleading, resulting in both poor accuracy and poor con\ufb01dence estimates; see e.g., [35, 1].\n\n1\n\n\fWe propose a simple algorithm that combines the advantages of variational and Monte Carlo meth-\nods. Our result is based on an observation that convex variational methods, including TRW-BP and\nits generalizations, naturally provide good importance sampling proposals that \u201ccover\u201d the proba-\nbility of the target distribution; the simplest example is a mixture of spanning trees constructed by\nTRW-BP. We show that the importance weights of this proposal are uniformly bounded by the con-\nvex upper bound itself, which admits a bound on the variance of the estimator, and more importantly,\nallows the use of exponential concentration inequalities such as the empirical Bernstein inequality\nto provide explicit con\ufb01dence intervals. Our method provides several important advantages:\nFirst, the upper bounds resulting from our sampling approach improve directly on the initial vari-\national upper bound. This allows our bound to start at a state-of-the-art value, and be quickly and\neasily improved in an any-time, memory ef\ufb01cient way. Additionally, using a two-sided concentration\nbound provides a \u201ccerti\ufb01cate of accuracy\u201d which improves over time at an easily analyzed rate. Our\nupper bound is signi\ufb01cantly better than existing probabilistic upper bounds, while our correspond-\ning lower bound is typically worse with few samples but eventually outperforms state-of-the-art\nprobabilistic bounds [11].\nOur approach also results in improved estimates of the partition function. As in previous work [32,\n34, 31], applying importance sampling serves as a \u201cbias correction\u201d to variational approximations.\nHere, we interpret the variational bound optimization as equivalent to minimizing an upper bound\non the IS estimator\u2019s variance. Empirically, this translates into estimates that can be signi\ufb01cantly\nmore accurate than IS using other variational proposals, such as mean \ufb01eld or belief propagation.\nRelated Work.\nImportance sampling and related approaches have been widely explored in the\nBayesian network literature, in which the partition function corresponds to the probability of ob-\nserved evidence; see e.g., [8, 26, 33, 11] and references therein. Dagum and Luby [4] derive a\nsample size to ensure a probabilistic bound with given relative accuracy; however, they use the\nnormalized Bayes net distribution as a proposal, leading to prohibitively large numbers of samples\nwhen the partition function is small, and making it inapplicable to Markov random \ufb01elds. Cheng [2]\nre\ufb01nes this result, including a user-speci\ufb01ed bound on the importance weights, but leaves the choice\nof proposal unspeci\ufb01ed.\nSome connections between IS and variational methods are also explored in Yuan and Druzdzel\n[32, 34], Wexler and Geiger [31], Gogate and Dechter [11], in which proposals are constructed\nbased on loopy BP or mean \ufb01eld methods. While straightforward in principle, we are not aware of\nany prior work which uses variational upper bounds to construct a proposal, or more importantly,\nanalyzes their properties. An alternative probabilistic upper bound can be constructed using \u201cper-\nturb and MAP\u201d methods [23, 12] combined with recent concentration results [22]; however, in our\nexperiments the resulting bounds were quite loose. Although not directly related to our work, there\nare also methods that connect variational inference with MCMC [e.g., 25, 6].\nOur work is orthogonal to the line of research on adaptive importance sampling, which re\ufb01nes the\nproposal as more samples are drawn [e.g., 21, 3]; we focus on developing a good \ufb01xed proposal\nbased on variational ideas, and leave adaptive improvement as a possible future direction.\nOutline. We introduce background on graphical models in Section 2. Our main result is presented\nin Section 3, where we construct a tree reweighted IS proposal, discuss its properties, and propose\nour probabilistic bounds based on it. We give a simple extension of our method to higher order\ncliques based on the weighted mini-bucket framework in Section 4. We then show experimental\ncomparisons in Section 5 and conclude with Section 6.\n\n2 Background\n\n2.1 Undirected Probabilistic Graphical Models\nLet x = [x1, . . . , xp] be a discrete random vector taking values in X def\nbilistic graphical model on x, in an over-complete exponential family form, is\n\n= X1 \u00d7 \u00b7\u00b7\u00b7 \u00d7 Xp; a proba-\n\np(x; \u03b8) =\n\nf (x; \u03b8)\nZ(\u03b8)\n\n,\n\nwith\n\nf (x; \u03b8) = exp\n\n2\n\n(cid:16)(cid:88)\n\n\u03b1\u2208I\n\n(cid:17)\n\n(cid:88)\n\nx\u2208X\n\n\u03b8\u03b1(x\u03b1)\n\n,\n\nZ(\u03b8) =\n\nf (x; \u03b8),\n\n(1)\n\n\fwhere I = {\u03b1} is a set of subsets of variable indices, and \u03b8\u03b1 : X\u03b1 \u2192 R are functions of x\u03b1; we\ndenote by \u03b8 = {\u03b8\u03b1(x\u03b1) : \u2200\u03b1 \u2208 I, x\u03b1 \u2208 X\u03b1} the vector formed by the elements of \u03b8\u03b1(\u00b7), called\nthe natural parameters. Our goal is to calculate the partition function Z(\u03b8) that normalizes the\ndistribution; we often drop the dependence on \u03b8 and write p(x) = f (x)/Z for convenience.\nThe factorization of p(x; \u03b8) can be represented by an undirected graph G = (V, EG), called its\nMarkov graph, where each vertex k \u2208 V is associated with a variable xk, and nodes k, l \u2208 V are\nconnected (i.e., (kl) \u2208 EG) iff there exists some \u03b1 \u2208 I that contains both k and l; then, I is a set of\ncliques of G. A simple special case of (1) is the pairwise model, in which I = V \u222a E:\n\nf (x; \u03b8) = exp(cid:0)(cid:88)\n\n(cid:88)\n\n\u03b8kl(xk, xl)(cid:1).\n\n\u03b8k(xk) +\n\ni\u2208V\n\n(kl)\u2208EG\n\n(2)\n\n2.2 Monte Carlo Estimation via Importance Sampling\n\nImportance sampling (IS) is at the core of many Monte Carlo methods for estimating the partition\nfunction. The idea is to take a tractable, normalized distribution q(x), called the proposal, and\nestimate Z using samples {xi}n\n\ni=1 \u223c q(x):\n\nn(cid:88)\n\ni=1\n\n\u02c6Z =\n\n1\nn\n\nw(xi),\n\nwith\n\nw(xi) =\n\nf (xi)\nq(xi)\n\n,\n\nn var(w(x)).\n\nwhere w(x) is called the importance weight. It is easy to show that \u02c6Z is an unbiased estimator of Z,\nin that E \u02c6Z = Z, if q(x) > 0 whenever p(x) > 0, and has a MSE of E( \u02c6Z \u2212 Z)2 = 1\nUnfortunately, the IS estimator often has very high variance if the choice of proposal distribution is\nvery different from the target, especially when the proposal is more peaked or has thinner tails than\nthe target. In these cases, there exist con\ufb01gurations x such that q(x) (cid:28) p(x), giving importance\nweights w(x) = f (x)/q(x) with extremely large values, but very small probabilities. Due to\nthe low probability of seeing these large weights, a \u201ctypical\u201d run of IS often underestimates Z in\npractice, that is, \u02c6Z \u2264 Z with high probability, despite being unbiased.\nSimilarly, the empirical variance of {w(xi)} can also severely underestimate the true variance\nvar(w(x)), and so fail to capture the true uncertainty of the estimator. For this reason, concen-\ntration inequalities that make use of the empirical variance (see Section 3) also require that w, or its\nvariance, be bounded. It is thus desirable to construct proposals that are similar to, and less peaked\nthan, the target distribution p(x). The key observation of this work is to show that tree reweighted\nBP and its generalizations provide a easy way to construct such good proposals.\n\n2.3 Tree Reweighted Belief Propagation\n\nNext we describe the tree reweighted (TRW) upper bound on the partition function, restricting to\npairwise models (2) for notational ease. In Section 4 we give an extension that includes both more\ngeneral factor graphs, and more general convex upper bounds.\nLet T = {T} be a set of spanning trees T = (V, ET ) of G that covers G: \u222aT ET = EG. We assign\n= {\u03b8T : T \u2208 T }\nT \u03c1T \u03b8T = \u03b8, and each \u03b8T respects the structure of T\n\na set of nonnegative weights {\u03c1T : T \u2208 T } on T such that(cid:80)\nbe a set of natural parameters that satis\ufb01es(cid:80)\n(cid:16)(cid:88)\n\nkl(xk, xl) \u2261 0 for \u2200(kl) (cid:54)\u2208 ET ). De\ufb01ne\n\nT \u03c1T = 1. Let \u03b8\n\n(cid:88)\n\n(so that \u03b8T\n\n(cid:17)\n\nT\n\n, with f (x; \u03b8T ) = exp\n\n\u03b8T\nk (xk) +\n\n\u03b8T\nkl(xk, xl)\n\n;\n\npT (x)\n\ndef\n= p(x; \u03b8T ) =\n\n(kl)\u2208ET\n\nthen pT (x) is a tree structured graphical model with Markov graph T . Wainwright et al. [30] use\nthe fact that log Z(\u03b8) is a convex function of \u03b8 to propose to upper bound log Z(\u03b8) by\n\nlog Ztrw(\u03b8\n\nT\n\n) =\n\n\u03c1T \u03b8T ) = log Z(\u03b8),\n\nf (x; \u03b8T )\nZ(\u03b8T )\n\n(cid:88)\n\nT\u2208T\n\nk\u2208V\n\n(cid:88)\n\nT\u2208T\n\n\u03c1T log Z(\u03b8T ) \u2265 log Z(\n\n3\n\n\fvia Jensen\u2019s inequality. Wainwright et al. [30] \ufb01nd the tightest bound via a convex optimization:\n\nlog Z\u2217\n\ntrw(\u03b8) = min\n\u03b8T\n\nlog Ztrw(\u03b8\n\nT\n\n),\n\ns.t.\n\n\u03c1T \u03b8T = \u03b8\n\n.\n\n(3)\n\n(cid:27)\n\n(cid:88)\n\nT\n\n(cid:26)\n\nWainwright et al. [30] solve this optimization by a tree reweighted belief propagation (TRW-BP)\nalgorithm, and note that the optimality condition of (3) is equivalent to enforcing a marginal con-\nT optimizes (3) if and only if there exists a set of common\nsistency condition on the trees \u2013 a \u03b8\nsingleton and pairwise \u201cpseudo-marginals\u201d {bk(xk), bkl(xk, xl)}, corresponding to the \ufb01xed point\nof TRW-BP in Wainwright et al. [30], such that\n\nb(xk, xl) = pT (xk, xl), \u2200(kl) \u2208 T,\n\nand\n\nb(xk) = pT (xk), \u2200k \u2208 V,\n\nwhere pT (xk) and pT (xk, xl) are the marginals of pT (x). Thus, after running TRW-BP, we can\ncalculate pT (x) via\n\npT (x) = p(x ; \u03b8T ) =\n\nbk(xk)\n\nbkl(xk, xl)\nbk(xk)bl(xl)\n\n.\n\n(4)\n\n(cid:89)\n\nk\u2208V\n\n(cid:89)\n\nkl\u2208ET\n\nBecause TRW provides a convex upper bound, it is often well-suited to the inner loop of learning\nalgorithms [e.g., 28]. However, it is often far less accurate than its non-convex counterpart, loopy\nBP; in some sense, this can be viewed as the cost of being a bound. In the next section, we show\nthat our importance sampling procedure can \u201cde-bias\u201d the TRW bound, to produce an estimator\nthat signi\ufb01cantly outperforms loopy BP; in addition, due to the nice properties of our TRW-based\nproposal, we can use an empirical Bernstein inequality to construct a non-asymptotic con\ufb01dence\ninterval for our estimator, turning the deterministic TRW bound into a much tighter probabilistic\nbound.\n\n3 Tree Reweighted Importance Sampling\n\nWe propose to use the collection of trees pT (x) and weights \u03c1T in TRW to form an importance\nsampling proposal,\n\nq(x; \u03b8\n\nT\n\n) =\n\n\u03c1T pT (x),\n\n(5)\n\n(cid:88)\n\nT\u2208T\n\nwhich de\ufb01nes an estimator \u02c6Z = 1\n). Our observation\nn\nis that this proposal is good due to the special convex construction of TRW. To see this, we note that\n\ni=1 w(xi) with xi drawn i.i.d. from q(x; \u03b8\n\nT\n\n(cid:80)n\nthe reparameterization constraint(cid:80)\n\nT \u03c1T \u03b8T = \u03b8 can be rewritten as\n\nf (x; \u03b8) = Ztrw(\u03b8\n\nT\n\n)\n\n(cid:2)pT (x)(cid:3)\u03c1T\n\n(cid:89)\n\nT\n\n,\n\n(6)\n\nT\n\nthat is, f (x; \u03b8) is the {\u03c1T}-weighted geometric mean of pT (x) up to a constant Ztrw; on the other\nhand, q(x; \u03b8\n), by its de\ufb01nition, is the arithmetic mean of pT (x), and hence will always be larger\nthan the geometric mean by the AM-GM inequality, guaranteeing good coverage of the target\u2019s prob-\nability. To be speci\ufb01c, we have q(x; \u03b8\n), and hence\nthe importance weight w(x) is always upper bounded by Ztrw(\u03b8\n). Note that (5)\u2013(6) immediately\nimplies that q(x; \u03b8\n\n) > 0 whenever f (x; \u03b8) > 0. We summarize our result as follows.\n\n) is always no smaller than f (x; \u03b8)/Ztrw(\u03b8\n\nT\n\nT\n\nT\n\nProposition 3.1. i) If(cid:80)\n\nT\n\nT \u03c1T \u03b8T = \u03b8, \u03c1T \u2265 0,(cid:80)\n\nT \u03c1T = 1, then the importance weight w(x) =\n\nf (x; \u03b8)/q(x; \u03b8\n\n), with q(x; \u03b8\n\nT\n\nT\n\n(7)\nthat is, the importance weights of (5) are always bounded by the TRW upper bound; this reinterprets\nthe TRW optimization (3) as \ufb01nding the mixture proposal in (5) that has the smallest upper bound\non the importance weights.\n\nii) As a result, we have max{var(w(x)),(cid:99)var(w(x))} \u2264 1\n\n), where (cid:99)var(w(x))\n\nis the empirical variance of the weights. This implies that E( \u02c6Z \u2212 Z)2 \u2264 1\n\n4 Z 2\n\ntrw for x \u223c q(x; \u03b8\n4n Z 2\n\nT\n\ntrw.\n\n) de\ufb01ned in (5), satis\ufb01es\nw(x) \u2264 Ztrw(\u03b8\n\nT\n\n), \u2200x \u2208 X ,\n\n4\n\n\f(cid:114)\n\n2(cid:99)var(w(x)) log(2/\u03b4)\n\nProof. i) Directly apply AM-GM inequality on (5) and (6). ii) Note that E(w(x)) = Z and hence\nvar(w(x)) = E(w(x)2) \u2212 E(w(x))2 \u2264 ZtrwZ \u2212 Z 2 \u2264 1\n\n4 Z 2\n\ntrw.\n\nNote that the TRW reparameterization (6) is key to establishing our results. Its advantage is two-fold:\nFirst, it provides a simple upper bound on w(x); for an arbitrary q(\u00b7), establishing such an upper\nbound may require a dif\ufb01cult combinatorial optimization over x. Second, it enables that bound to\nbe optimized over q(\u00b7), resulting in a good proposal.\nEmpirical Bernstein Con\ufb01dence Bound.\nThe upper bound of w(x) in Proposition 3.1 allows us\nto use exponential concentration inequalities and construct tight \ufb01nite-sample con\ufb01dence bounds.\nBased on the empirical Bernstein inequality in Maurer and Pontil [19], we have\nCorollary 3.2 (Maurer and Pontil [19]). Let \u02c6Z be the IS estimator resulting from q(x) in (5). De\ufb01ne\n\n7Ztrw(\u03b8\n\n) log(2/\u03b4)\n\n,\n\n+\n\nT\n\n3(n \u2212 1)\n\n(8)\n\n\u2206 =\n\nn\n\nwhere (cid:99)var(w(x) is the empirical variance of the weights, then \u02c6Z+ = \u02c6Z + \u2206 and Z\u2212 = \u02c6Z \u2212 \u2206 are\n\nupper and lower bounds of Z with at least probability (1 \u2212 \u03b4), respectively, that is, Pr(Z \u2264 \u02c6Z+) \u2265\n1 \u2212 \u03b4 and Pr( \u02c6Z\u2212 \u2264 Z) \u2265 1 \u2212 \u03b4.\n\u221a\nThe quantity \u2206 is quite intuitive, with the \ufb01rst term proportional to the empirical standard deviation\nn rate. The second term captures the possibility that the empiri-\nand decaying at the classic 1/\ncal variance is inaccurate; it depends on the boundedness of w(x) and decays at rate 1/n. Since\n\ntrw, the second term typically dominates for small n, and the \ufb01rst term for large n.\n\n(cid:99)var(w) < Z 2\n\nWhen \u2206 is large, the lower bound \u02c6Z \u2212 \u2206 may be negative; this is most common when n is small and\nZtrw is much larger than Z. In this case, we may replace \u02c6Z\u2212 with any deterministic lower bound, or\nwith \u02c6Z\u03b4, which is a (1 \u2212 \u03b4) probabilistic bound by the Markov inequality; see Gogate and Dechter\n[11] for more Markov inequality based lower bounds. However, once n is large enough, we expect\n\u02c6Z\u2212 should be much tighter than using Markov\u2019s inequality, since \u02c6Z\u2212 also leverages boundedness\nand variance information.1 On the other hand, the Bernstein upper bound \u02c6Z+ readily gives a good\nupper bound, and is usually much tighter than Ztrw even with a relatively small n.\nFor example, if \u02c6Z (cid:28) Ztrw (e.g., the TRW bound is not tight), our upper bound \u02c6Z+ improves rapidly\non Ztrw at rate 1/n and passes Ztrw when n \u2265 7\n3 log(2/\u03b4) + 1 (for example, for \u03b4 = 0.025 used\nin our experiments, we have \u02c6Z+ \u2264 Ztrw by n = 12). Meanwhile, one can show that the lower\nbound must be non-trivial ( \u02c6Z\u2212 > 0) if n > 6(Ztrw/ \u02c6Z) log(2/\u03b4) + 1. During sampling, we can\nroughly estimate the point at which it will become non-trivial, by \ufb01nding n such that \u02c6Z \u2265 \u2206. More\nrigorously, one can apply a stopping criterion [e.g., 5, 20] on n to guarantee a relative error \u0001 with\nprobability at least 1 \u2212 \u03b4, using the bound on w(x); roughly, the expected number of samples will\ndepend on Ztrw/Z, the relative accuracy of the variational bound.\n\n4 Weighted Mini-bucket Importance Sampling\n\nWe have so far presented our results for tree reweighted BP on pairwise models, which approximates\nthe model using combinations of trees. In this section, we give an extension of our results to general\nhigher order models, and approximations based on combinations of low-treewidth graphs. Our\nextension is based on the weighted mini-bucket framework [7, 17, 16], but extensions based on\nother higher order generalizations of TRW, such as Globerson and Jaakkola [9], are also possible.\nWe only sketch the main idea in this section.\nWe start by rewriting the distribution using the chain rule along some order o = [x1, . . . , xp],\n\nf (x) = Z\n\np(xk|xpa(k)).\n\n(9)\n\n1The Markov lower bounds by Gogate and Dechter [11] have the undesirable property that they may not\n\nbecome tighter with increasing n, and may even decrease.\n\nk\n\n5\n\n(cid:89)\n\n\fwhere pa(k), called the induced parent set of k, is the set of variables adjacent to xk when it is\neliminated along order o. The largest parent size \u03c9 := maxk\u2208V |pa(k)| is called the induced width\nof G along order o, and the computational complexity of exact variable elimination along order o is\nO(exp(\u03c9)), which is intractable when \u03c9 is large.\nWeighted mini-bucket is an approximation method that avoids the O(exp(\u03c9)) complexity by split-\nting each pa(k) into several smaller \u201cmini-buckets\u201d pa(cid:96)(k), such that \u222a(cid:96)pa(cid:96)(k) = pa(k), where\nthe size of the pa(cid:96)(k) is controlled by a prede\ufb01ned number ibound \u2265 |pa(cid:96)(k)|, so that the ibound\ntrades off the computational complexity with approximation quality. We associate each pa(cid:96)(k) with\n(cid:96) \u03c1k(cid:96) = 1. The weighted mini-bucket algorithm in Liu [16]\nthen frames a convex optimization to output an upper bound Zwmb \u2265 Z together with a set of\n\u201cpseudo-\u201d conditional distributions bk(cid:96)(xk|xpa(cid:96)(k)), such that\n\na nonnegative weight \u03c1k(cid:96), such that(cid:80)\n\nf (x) = Zwmb\n\nbk(cid:96)(xk|xpa(cid:96)(k))\u03c1k(cid:96) ,\n\n(10)\n\n(cid:89)\n\n(cid:89)\n\nk\n\n(cid:96)\n\nwhich,\nintuitively speaking, can be treated as approximating each conditional distribution\np(xk|xpa(k)) with a geometric mean of the bk(cid:96)(xk|xpa(cid:96)(k)); while we omit the details of weighted\nmini-bucket [17, 16] for space, what is most important for our purpose is the representation (10).\nSimilarly to with TRW, we de\ufb01ne a proposal distribution by replacing the geometric mean with an\narithmetic mean:\n\nq(x) =\n\n\u03c1k(cid:96) bk(cid:96)(xk|xpa(cid:96)(k)).\n\n(11)\n\n(cid:89)\n\n(cid:88)\n\nk\n\n(cid:96)\n\nWe can again use the AM-GM inequality to obtain a bound on w(x), that w(x) \u2264 Zwmb.\nProposition 4.1. Let w(x) = f (x)/q(x), where f (x) and q(x) satisfy (10) and (11), with\n\n(cid:80)\n(cid:96) \u03c1k(cid:96) = 1, \u03c1k(cid:96) \u2265 0, \u2200k, (cid:96). Then,\nProof. Use the AM-GM inequality,(cid:81)\nfrom the mixture(cid:80)\n\n\u2200x \u2208 X .\n\nw(x) \u2264 Zwmb,\n\n(cid:96) bk(cid:96)(xk|xpa(cid:96)(k))\u03c1k(cid:96) \u2264(cid:80)\n\n(cid:96) \u03c1k(cid:96) bk(cid:96)(xk|xpa(cid:96)(k)), for each k.\nNote that the form of q(x) makes it convenient to sample by sequentially drawing each variable xk\n(cid:96) \u03c1k(cid:96) bk(cid:96)(xk|xpa(cid:96)(k)) along the reverse order [xp, . . . , x1]. The proposal q(x)\nalso can be viewed as a mixture of a large number of models with induced width controlled by\nibound; this can be seen by expanding the form in (11),\n\n(cid:89)\n\n(cid:89)\n\nbk(cid:96)k (xk|xpa(cid:96)(k)).\n\nq(x) =\n\n\u03c1(cid:96)1\u00b7\u00b7\u00b7(cid:96)p q(cid:96)1\u00b7\u00b7\u00b7(cid:96)p (x), where \u03c1(cid:96)1\u00b7\u00b7\u00b7(cid:96)p =\n\n\u03c1k(cid:96)k ,\n\nq(cid:96)1\u00b7\u00b7\u00b7(cid:96)p (x) =\n\nk\n\nk\n\n(cid:88)\n\n(cid:96)1\u00b7\u00b7\u00b7(cid:96)p\n\n5 Experiments\n\nWe demonstrate our algorithm using synthetic Ising models, and real-world models from recent\nUAI inference challenges. We show that our TRW proposal can provide better estimates than other\nproposals constructed from mean \ufb01eld or loopy BP, particularly when it underestimates the partition\nfunction; in this case, the proposal may be too peaked and fail to approach the true value even for\nextremely large sample sizes n. Using the empirical Bernstein inequality, our TRW proposal also\nprovides strong probabilistic upper and lower bounds. When the model is relatively easy or n is\nlarge, our upper and lower bounds are close, demonstrating the estimate has high con\ufb01dence.\n5.1 MRFs on 10 \u00d7 10 Grids\nWe illustrate our method using pairwise Markov random \ufb01elds (2) on a 10 \u00d7 10 grid. We start with\na simple Ising model with \u03b8k(xk) = \u03c3sxk and \u03b8kl(xk, xl) = \u03c3pxkxl, xk \u2208 {\u22121, 1}, where \u03c3s\nrepresents the external \ufb01eld and \u03c3p the correlation. We \ufb01x \u03c3s = 0.01 and vary \u03c3p from \u22121.5 (strong\nnegative correlation) to 1.5 (strong positive correlation). Different \u03c3p lead to different inference\nhardness: inference is easy when the correlation is either very strong (|\u03c3p| large) or very weak (|\u03c3p|\nsmall), but dif\ufb01cult for an intermediate range of values, corresponding to a phase transition.\n\n6\n\n\fZ\ng\no\nl\n\n\u2212\n\u02c6Z\ng\no\nl\n\nPairwise Strength \u03c3p\n\nPairwise Strength \u03c3p\n\nSample Size n\n\nPairwise Strength \u03c3p\n\n(d) Fixed \u03c3p = \u22120.5\n(a) Fixed n = 104\nFigure 1: Experiments on 10 \u00d7 10 Ising models with interaction strength \u03c3p ranging from strong\nnegative (-1.5) to strong positive (1.5).\n\n(b) Fixed n = 104\n\n(c) Fixed n = 107\n\nmarginals. The MF proposal is q(x) =(cid:81)\n\nWe \ufb01rst run the standard variational algorithms, including loopy BP (LBP), tree reweighted BP\n(TRW), and mean \ufb01eld (MF). We then calculate importance sampling estimators based on each\nof the three algorithms. The TRW trees are chosen by adding random spanning trees until their\nunion covers the grid; we assign uniform probability \u03c1T to each tree. The LBP proposal follows\nGogate [10], constructing a (randomly selected) tree structured proposal based on the LBP pseudo-\nk\u2208V qk(xk), where the qk(xk) are the mean \ufb01eld beliefs.\nFigure 1(a) shows the result of the IS estimates based on a relatively small number of importance\nsamples (n = 104). In this case the TRW proposal outperforms both the MF and LBP proposals;\nall the methods degrade when \u03c3p \u2248 \u00b1.5, corresponding to inherently more dif\ufb01cult inference.\nHowever, the TRW proposal converges to the correct values when the correlation is strong (e.g.,\n|\u03c3p| > 1), while the MF and LBP proposals underestimate the true value, indicating that the MF and\nLBP proposals are too peaked, and miss a signi\ufb01cant amount of probability mass of the target.\nExamining the deterministic estimates, we note that the LBP approximation, which can be shown to\nbe a lower bound on these models [27, 24], is also signi\ufb01cantly worse than IS with the TRW pro-\nposal, and slightly worse than IS based on the LBP proposal. The TRW and MF bounds, of course,\nare far less accurate compared to either LBP or the IS methods, and are shown separately in Fig-\nure 1(b). This suggests it is often bene\ufb01cial to follow the variational procedure with an importance\nsampling process, and use the corresponding IS estimators instead of the variational approximations\nto estimate the partition function.\nFigure 1(b) compares the 95% con\ufb01dence interval of the IS based on the TRW proposal (\ufb01lled with\nred), with the interval formed by the TRW upper bound and the MF lower bound (\ufb01lled with green).\nWe can see that the Bernstein upper bound is much tighter than the TRW upper bound, although at\nthe cost of turning a deterministic bound into a (1 \u2212 \u03b4) probabilistic bound. On the other hand, the\nBernstein interval fails to report a meaningful lower bound when the model is dif\ufb01cult (\u03c3p \u2248 \u00b10.5),\nbecause n = 104 is small relative to the dif\ufb01culty of the model. As shown in Figure 1(c), our method\neventually produces both tight upper and lower bounds as sample size increases.\nFigure 1(d) shows the Bernstein bound as we increase n on a \ufb01xed model with \u03c3p = \u22120.5, which\nis relatively dif\ufb01cult according to Figure 1. Of the methods, our IS estimator becomes the most\naccurate by around n = 103 samples. We also show the Markov lower bound \u02c6Zmarkov = \u02c6Z\u03b4 as\nsuggested by Gogate [10]; it provides non-negative lower bounds for all sample sizes, but does not\nconverge to the true value even with n \u2192 +\u221e (in fact, it converges to Z\u03b4).\nIn addition to the simple Ising model, we also tested grid\nmodels with normally distributed parameters: \u03b8k(xk) \u223c\nN (0, \u03c32\np). Figure 2 shows\nthe results when \u03c3s = 0.01 and we vary \u03c3p. In this case,\nLBP tends to overestimate the partition function, and IS\nwith the LBP proposal performs quite well (similarly to\nour TRW IS); but with the previous example, this illus-\ntrates that it is hard to know whether BP will result in a\nhigh- or low-variance proposal. On this model, mean \ufb01eld\nIS is signi\ufb01cantly worse and is not shown in the \ufb01gure.\n\ns ) and \u03b8kl(xk, xl) \u223c N (0, \u03c32\n\nFigure 2: MRF with mixed interactions.\n\nPairwise Strength \u03c3p\n\nZ\ng\no\nl\n\n\u2212\n\u02c6Z\ng\no\nl\n\n7\n\n\u2212101\u22123\u22122\u221210 IS(TRW)IS(MF)IS(LBP)Loopy BP-101-50510\u2212101\u22120.100.1 TRW/MF IntervalIS(TRW) Bernstein101103105107\u221250510TRWLBPMFIS(TRW)Markov (TRW)0.511.52\u22120.100.10.2 IS(TRW) BernsteinLoopy BPIS (BP)\fZ\ng\no\nl\n\n\u2212\n\u02c6Z\ng\no\nl\n\n(a) BN 6, ibound = 1\n\n(b) BN 11, ibound = 1\n\nFigure 3: The Bernstein interval on (a) BN 6 and (b) BN 11 using ibound = 1 and different sample\nsizes n. These problems are relatively easy for variational approximations; we illustrate that our\nmethod gives tight bounds despite using no more memory than the original model.\n\nZ\ng\no\nl\n\n\u2212\n\u02c6Z\ng\no\nl\n\nFigure 4: Results on a harder instance, pedigree20, at ibound = 8, 15 and different n.\n\n(a) ibound = 8\n\n(b) ibound = 15\n\n5.2 UAI Instances\nWe test the weighted mini-bucket (WMB) version of our algorithm on instances from past UAI\napproximate inference challenges. For space reasons, we only report a few instances for illustration.\nFigure 3 shows two Bayes net instances, BN 6 (true log Z = \u221258.41) and BN 11\nBN Instances.\n(true log Z = \u221239.37). These examples are very easy for loopy BP, which estimates log Z nearly\nexactly, but of course gives no accuracy guarantees. For comparison, we run our WMB IS estimator\nusing ibound = 1, e.g., cliques equal to the original factors. We \ufb01nd that we get tight con\ufb01dence\nintervals by around 104\u2013105 samples. For comparison, the method of Dagum and Luby [4], using\nthe normalized distribution as a proposal, would require samples proportional to 1/Z: approximately\n1025 and 1017, respectively.\nPedigree Instances. We next show results for our method on pedigree20, (log Z = \u221268.22,\ninduced width \u03c9 = 21). and various ibounds; Figure 4 shows the results for ibound 8 and 15.\nFor comparision, we also evaluate GBP, de\ufb01ned on a junction graph with cliques found in the same\nway as WMB [18], and complexity controlled by the same ibound. Again, LBP and GBP generally\ngive accurate estimates; the absolute error of LBP (not shown) is about 0.7, reducing to 0.4 and\n0.2 at ibound = 8 and 15, respectively. The initial WMB bounds overestimate by 6.3 and 2.4 at\nibound = 8 and 15, and are much less accurate. However, our method surpasses GBP\u2019s accuracy\nwith a modest number of samples: for example, with ibound = 15 (Figure 4b), our IS estimator is\nmore accurate than GBP with fewer than 100 samples, and our 95% Bernstein con\ufb01dence interval\npasses GBP at roughly 1000 samples.\n\n6 Conclusion\nWe propose a simple approximate inference method that augments convex variational bounds by\nadding importance sampling. Our formulation allows us to frame the variational optimization as de-\nsigning a proposal that minimizes an upper bound on our estimator\u2019s variance, providing guarantees\non the goodness of the resulting proposal. More importantly, this enables the construction of any-\ntime probabilistic bounds that improve quickly and directly on state-of-the-art variational bounds,\nand provide certi\ufb01cates of accuracy given enough samples, relative to the error in the initial bound.\nOne potential future direction is whether one can adaptively improve the proposal during sampling.\n\nAcknowledgement This work is supported in part by VITALITE, under the ARO MURI program\n(Award number W911NF-11-1-0391); NSF grants IIS-1065618 and IIS-1254071; and by the United\nStates Air Force under Contract No. FA8750-14-C-0011 under the DARPA PPAML program.\n\n8\n\n101102103104105\u2212505WMBIS(WMB)Markov (WMB)Sample Size (n)101102103104105106\u2212505WMBIS(WMB)Markov (WMB)Sample Size (n)101102103104105106\u2212202IS(WMB)GBPSample Size (n)101102103104105106\u2212101IS(WMB)GBPSample Size (n)\fReferences\n[1] T. Bengtsson, P. Bickel, and B. Li. Curse-of-dimensionality revisited: Collapse of the particle \ufb01lter in\nvery large scale systems. In Probability and statistics: Essays in honor of David A. Freedman, pages\n316\u2013334. Institute of Mathematical Statistics, 2008.\n\n[2] J. Cheng. Sampling algorithms for estimating the mean of bounded random variables. Computational\n\nStatistics, 16(1):1\u201323, 2001.\n\n[3] J. Cheng and M. Druzdzel. AIS-BN: An adaptive importance sampling algorithm for evidential reasoning\n\nin large Bayesian networks. Journal of Arti\ufb01cial Intelligence Research, 2000.\n\n[4] P. Dagum and M. Luby. An optimal approximation algorithm for Bayesian inference. Arti\ufb01cial Intelli-\n\ngence, 93(1):1\u201327, 1997.\n\n[5] P. Dagum, R. Karp, M. Luby, and S. Ross. An optimal algorithm for Monte Carlo estimation. SIAM\n\nJournal on Computing, 29:1484\u20131496, 2000.\n\n[6] N. De Freitas, P. H\u00f8jen-S\u00f8rensen, M. Jordan, and S. Russell. Variational MCMC. In UAI, 2001.\n[7] R. Dechter and I. Rish. Mini-buckets: A general scheme for bounded inference. Journal of the ACM, 50\n\n(2):107\u2013153, 2003.\n\nIn UAI, 1990.\n\npages 130\u2013138, 2007.\n\nIrvine, 2009.\n\n(2):171\u2013188, 2011.\n\nICML, 2012.\n\n[8] R. Fung and K. Chang. Weighing and integrating evidence for stochastic simulation in Bayesian networks.\n\n[9] A. Globerson and T. Jaakkola. Approximate inference using conditional entropy decompositions. In UAI,\n\n[10] V. Gogate. Sampling Algorithms for Probabilistic Graphical Models with Determinism. PhD thesis, UC\n\n[11] V. Gogate and R. Dechter. Sampling-based lower bounds for counting queries. Intelligenza Arti\ufb01ciale, 5\n\n[12] T. Hazan and T. Jaakkola. On the partition function and random maximum a-posteriori perturbations. In\n\n[13] D. Koller and N. Friedman. Probabilistic graphical models: principles and techniques. MIT press, 2009.\n[14] S. Lauritzen. Graphical models. Oxford University Press, 1996.\n[15] J. Liu. Monte Carlo strategies in scienti\ufb01c computing. Springer Science & Business Media, 2008.\n[16] Q. Liu. Reasoning and Decisions in Probabilistic Graphical Models\u2013A Uni\ufb01ed Framework. PhD thesis,\n\n[17] Q. Liu and A. Ihler. Bounding the partition function using H\u00a8older\u2019s inequality. In ICML, 2011.\n[18] R. Mateescu, K. Kask, V. Gogate, and R. Dechter. Join-graph propagation algorithms. JAIR, 37(1):\n\n[19] A. Maurer and M. Pontil. Empirical Bernstein bounds and sample-variance penalization. In COLT, pages\n\n[20] V. Mnih, C. Szepesv\u00b4ari, and J.-Y. Audibert. Empirical Bernstein stopping. In ICML, 2008.\n[21] M.-S. Oh and J. Berger. Adaptive importance sampling in Monte Carlo integration. J. Stat. Comput.\n\nUC Irvine, 2014.\n\n279\u2013328, 2010.\n\n115\u2013124, 2009.\n\nSimul., 41(3-4):143\u2013168, 1992.\n\na-posteriori perturbations. In ICML, 2014.\n\nsample from energy models. In ICCV, 2011.\n\n[22] F. Orabona, T. Hazan, A. Sarwate, and T. Jaakkola. On measure concentration of random maximum\n\n[23] G. Papandreou and A. Yuille. Perturb-and-map random \ufb01elds: Using discrete optimization to learn and\n\n[24] N. Ruozzi. The bethe partition function of log-supermodular graphical models. In NIPS, 2012.\n[25] T. Salimans, D. Kingma, and M. Welling. Markov chain Monte Carlo and variational inference: Bridging\n\nthe gap. In ICML, 2015.\n\nUAI, 1990.\n\n[26] R. Shachter and M. Peot. Simulation approaches to general probabilistic inference on belief networks. In\n\n[27] E. Sudderth, M. Wainwright, and A. Willsky. Loop series and bethe variational bounds in attractive\n\ngraphical models. In NIPS, pages 1425\u20131432, 2007.\n\n[28] M. Wainwright. Estimating the wrong graphical model: Bene\ufb01ts in the computation-limited setting.\n\nJMLR, 7:1829\u20131859, 2006.\n\n[29] M. Wainwright and M. Jordan. Graphical models, exponential families, and variational inference. Foun-\n\ndations and Trends in Machine Learning, 1(1-2):1\u2013305, 2008.\n\n[30] M. Wainwright, T. Jaakkola, and A. Willsky. A new class of upper bounds on the log partition function.\n\nIEEE Trans. Information Theory, 51(7):2313\u20132335, 2005.\n\n[31] Y. Wexler and D. Geiger. Importance sampling via variational optimization. In UAI, 2007.\n[32] C. Yuan and M. Druzdzel. An importance sampling algorithm based on evidence pre-propagation. In\n\nUAI, pages 624\u2013631, 2002.\n\n[33] C. Yuan and M. Druzdzel. Importance sampling algorithms for Bayesian networks: Principles and per-\n\nformance. Mathematical and Computer Modeling, 43(9):1189\u20131207, 2006.\n\n[34] C. Yuan and M. Druzdzel. Generalized evidence pre-propagated importance sampling for hybrid Bayesian\n\nnetworks. In AAAI, volume 7, pages 1296\u20131302, 2007.\n\n[35] C. Yuan and M. Druzdzel. Theoretical analysis and practical insights on importance sampling in Bayesian\n\nnetworks. International Journal of Approximate Reasoning, 46(2):320\u2013333, 2007.\n\n9\n\n\f", "award": [], "sourceid": 872, "authors": [{"given_name": "Qiang", "family_name": "Liu", "institution": "MIT"}, {"given_name": "John", "family_name": "Fisher III", "institution": "MIT"}, {"given_name": "Alexander", "family_name": "Ihler", "institution": "UC Irvine"}]}