{"title": "Particle-based Variational Inference for Continuous Systems", "book": "Advances in Neural Information Processing Systems", "page_first": 826, "page_last": 834, "abstract": "Since the development of loopy belief propagation, there has been considerable work on advancing the state of the art for approximate inference over distributions defined on discrete random variables. Improvements include guarantees of convergence, approximations that are provably more accurate, and bounds on the results of exact inference.  However, extending these methods  to continuous-valued systems has lagged behind.  While several methods have been developed to use belief propagation on systems with continuous values, they have not as yet incorporated the recent advances for discrete variables. In this context we extend a recently proposed particle-based belief propagation algorithm to provide a general framework for adapting discrete message-passing algorithms to perform inference in continuous systems.  The resulting algorithms behave similarly to their purely discrete counterparts, extending the benefits of these more advanced inference techniques to the continuous domain.", "full_text": "Particle-based Variational Inference\n\nfor Continuous Systems\n\nAlexander T. Ihler\n\nDept. of Computer Science\nUniv. of California, Irvine\nihler@ics.uci.edu\n\nAndrew J. Frank\n\nDept. of Computer Science\nUniv. of California, Irvine\najfrank@ics.uci.edu\n\nPadhraic Smyth\n\nDept. of Computer Science\nUniv. of California, Irvine\nsmyth@ics.uci.edu\n\nAbstract\n\nSince the development of loopy belief propagation, there has been considerable\nwork on advancing the state of the art for approximate inference over distributions\nde\ufb01ned on discrete random variables. Improvements include guarantees of con-\nvergence, approximations that are provably more accurate, and bounds on the re-\nsults of exact inference. However, extending these methods to continuous-valued\nsystems has lagged behind. While several methods have been developed to use be-\nlief propagation on systems with continuous values, recent advances for discrete\nvariables have not as yet been incorporated.\nIn this context we extend a recently proposed particle-based belief propagation\nalgorithm to provide a general framework for adapting discrete message-passing\nalgorithms to inference in continuous systems. The resulting algorithms behave\nsimilarly to their purely discrete counterparts, extending the bene\ufb01ts of these more\nadvanced inference techniques to the continuous domain.\n\n1\n\nIntroduction\n\nGraphical models have proven themselves to be an effective tool for representing the underlying\nstructure of probability distributions and organizing the computations required for exact and ap-\nproximate inference. Early examples of the use of graph structure for inference include join or\njunction trees [1] for exact inference, Markov chain Monte Carlo (MCMC) methods [2], and vari-\national methods such as mean \ufb01eld and structured mean \ufb01eld approaches [3]. Belief propagation\n(BP), originally proposed by Pearl [1], has gained in popularity as a method of approximate infer-\nence, and in the last decade has led to a number of more sophisticated algorithms based on conjugate\ndual formulations and free energy approximations [4, 5, 6].\nHowever, the progress on approximate inference in systems with continuous random variables has\nnot kept pace with that for discrete random variables. Some methods, such as MCMC techniques, are\ndirectly applicable to continuous domains, while others such as belief propagation have approximate\ncontinuous formulations [7, 8]. Sample-based representations, such as are used in particle \ufb01ltering,\nare particularly appealing as they are relatively easy to implement, have few numerical issues, and\nhave no inherent distributional assumptions. Our aim is to extend particle methods to take advantage\nof recent advances in approximate inference algorithms for discrete-valued systems.\nSeveral recent algorithms provide signi\ufb01cant advantages over loopy belief propagation. Double-\nloop algorithms such as CCCP [9] and UPS [10] use the same approximations as BP but guarantee\nconvergence. More general approximations can be used to provide theoretical bounds on the results\nof exact inference [5, 3] or are guaranteed to improve the quality of approximation [6], allowing\nan informed trade-off between computation and accuracy. Like belief propagation, they can be\nformulated as local message-passing algorithms on the graph, making them amenable to parallel\ncomputation [11] or inference in distributed systems [12, 13].\n\n1\n\n\fIn short, the algorithmic characteristics of these recently-developed algorithms are often better, or at\nleast more \ufb02exible, than those of BP. However, these methods have not been applied to continuous\nrandom variables, and in fact this subject was one of the open questions posed at a recent NIPS\nworkshop [14].\nIn order to develop particle-based approximations for these algorithms, we focus on one particular\ntechnique for concreteness: tree-reweighted belief propagation (TRW) [5]. TRW represents one\nof the earliest of a recent class of inference algorithms for discrete systems, but as we discuss in\nSection 2.2 the extensions of TRW can be incorporated into the same framework if desired.\nThe basic idea of our algorithm is simple and extends previous particle formulations of exact infer-\nence [15] and loopy belief propagation [16]. We use collections of samples drawn from the con-\ntinuous state space of each variable to de\ufb01ne a discrete problem, \u201clifting\u201d the inference task from\nthe original space to a restricted, discrete domain on which TRW can be performed. At any point,\nthe current results of the discrete inference can be used to re-select the sample points from a vari-\nable\u2019s continuous domain. This iterative interaction between the sample locations and the discrete\nmessages produces a dynamic discretization that adapts itself to the inference results.\nWe demonstrate that TRW and similar methods can be naturally incorporated into the lifted, discrete\nphase of particle belief propagation and that they confer similar bene\ufb01ts on the continuous problem\nas hold in truly discrete systems. To this end we measure the performance of the algorithm on an\nIsing grid, an analogous continuous model, and the sensor localization problem. In each case, we\nshow that tree-reweighted particle BP exhibits behavior similar to TRW and produces signi\ufb01cantly\nmore robust marginal estimates than ordinary particle BP.\n\n2 Graphical Models and Inference\n\nGraphical models provide a convenient formalism for describing structure within a probability dis-\ntribution p(X) de\ufb01ned over a set of variables X = {x1, . . . , xn}. This structure can then be applied\nto organize computations over p(X) and construct ef\ufb01cient algorithms for many inference tasks,\nincluding optimization to \ufb01nd a maximum a posteriori (MAP) con\ufb01guration, marginalization, or\ncomputing the likelihood of observed data.\n\n2.1 Factor Graphs\n\nFactor graphs [17] are a particular type of graphical model that describe the factorization struc-\nture of the distribution p(X) using a bipartite graph consisting of factor nodes and variable nodes.\nSpeci\ufb01cally, suppose such a graph G consists of factor nodes F = {f1, . . . , fm} and variable nodes\nX = {x1, . . . , xn}. Let Xu \u2286 X denote the neighbors of factor node fu and Fs \u2286 F denote the\nneighbors of variable node xs. Then, G is consistent with a distribution p(X) if and only if\n\nm(cid:89)\n\nu=1\n\np(x1, . . . , xn) =\n\n1\nZ\n\nfu(Xu).\n\n(1)\n\nIn a common abuse of notation, we use the same symbols to represent each variable node and its\nassociated variable xs, and similarly for each factor node and its associated function fu. Each factor\nfu corresponds to a strictly positive function over a subset of the variables. The graph connectivity\ncaptures the conditional independence structure of p(X), enabling the development of ef\ufb01cient exact\nand approximate inference algorithms [1, 17, 18]. The quantity Z, called the partition function, is\nalso of importance in many problems; for example in normalized distributions such as Bayes nets, it\ncorresponds to the probability of evidence and can be used for model comparison.\nA common inference problem is that of computing the marginal distributions of p(X). Speci\ufb01cally,\nfor each variable xs we are interested in computing the marginal distribution\n\n(cid:90)\n\nX\\xs\n\nps(xs) =\n\np(X) \u2202X.\n\nFor discrete-valued variables X, the integral is replaced by a summation.\nWhen the variables are discrete and the graph G representing p(X) forms a tree (G has no cy-\ncles), marginalization can be performed ef\ufb01ciently using the belief propagation or sum-product al-\ngorithm [1, 17]. For inference in more general graphs, the junction tree algorithm [19] creates a\n\n2\n\n\ftree-structured hypergraph of G and then performs inference on this hypergraph. The computational\ncomplexity of this process is O(ndb), where d is the number of possible values for each variable and\nb is the maximal clique size of the hypergraph. Unfortunately, for even moderate values of d, this\ncomplexity becomes prohibitive for even relatively small b.\n\n2.2 Approximate Inference\n\nLoopy BP [1] is a popular alternative to exact methods and proceeds by iteratively passing \u201cmes-\nsages\u201d between variable and factor nodes in the graph as though the graph were a tree (ignoring\ncycles). The algorithm is exact when the graph is tree-structured and can provide excellent approx-\nimations in some cases even when the graph has loops. However, in other cases loopy BP may\nperform poorly, have multiple \ufb01xed points, or fail to converge at all.\nMany of the more recent varieties of approximate inference are framed explicitly as an optimiza-\ntion of local approximations over locally de\ufb01ned cost functions. Variational or free-energy based\napproaches convert the problem of exact inference into the optimization of a free energy function\nover the set of realizable marginal distributions M, called the marginal polytope [18]. Approximate\ninference then corresponds to approximating the constraint set and/or energy function. Formally,\n\nmax\n\u00b5\u2208M\n\nE\u00b5[log P (X)] + H(\u00b5) \u2248 max\n\u00b5\u2208(cid:99)M\n\nE\u00b5[log P (X)] +(cid:98)H(\u00b5)\n\nwhere H is the entropy of the distribution corresponding to \u00b5. Since the solution \u00b5 may not corre-\nspond to the marginals of any consistent joint distribution, these approximate marginals are typically\n\nreferred to as pseudomarginals. If both the constraints in (cid:99)M and approximate entropy(cid:98)H decompose\ntion (cid:99)M \u2287 M enforcing local consistency and the Bethe approximation to H [4]. This viewpoint\n\nlocally on the graph, the optimization process can be interpreted as a message-passing procedure,\nand is often performed using \ufb01xed-point equations like those of BP.\nBelief propagation can be understood in this framework as corresponding to an outer approxima-\n\nprovides a clear path to directly improve upon the properties of BP, leading to a number of differ-\nent algorithms. For example, CCCP [9] and UPS [10] make the same approximations but use an\nalternative, direct optimization procedure to ensure convergence. Fractional belief propagation [20]\ncorresponds to a more general Bethe-like approximation with additional parameters, which can be\nmodi\ufb01ed to ensure that the cost function is convex and used with convergent algorithms [21]. A\nspecial case includes tree-reweighted belief propagation [5], which both ensures convexity and pro-\nvides an upper bound on the partition function Z. The approximation of M can also be improved\nusing cutting plane methods, which include additional, higher-order consistency constraints on the\npseudomarginals [6]. Other choices of local cost functions lead to alternative families of approxi-\nmations [8].\nOverall, these advances have provided signi\ufb01cant improvements in the state of the art for approxi-\nmate inference in discrete-valued systems. They provide increased \ufb02exibility, theoretical bounds on\nthe results of exact inference, and can provably increase the quality of the estimates. However, these\nadvances have not been carried over into the continuous domain.\nFor concreteness, in the rest of the paper we will use tree-reweighted belief propagation (TRW) [5]\nas our inference method of choice, although the same ideas can be applied to any of the discussed\ninference algorithms. As we will see shortly, the details speci\ufb01c to TRW are nicely encapsulated\nand can be swapped out for those of another algorithm with minimal effort.\nThe \ufb01xed-point equations for TRW lead to a message-passing algorithm similar to BP, de\ufb01ned by\n\nmxs(cid:1)fu (xs) \u221d (cid:89)\n\nfv\u2208Fs\n\nmfv(cid:1)xs(xs)\u03c1v\nmfu(cid:1)xs(xs)\n\n, mfu(cid:1)xs(xs) \u221d (cid:88)\n\nXu\\xs\n\nfu(Xu)1/\u03c1u (cid:89)\n\nmxt(cid:1)fu (xt)\n\nxt\u2208Xu\\xs\n\n(2)\nThe parameters \u03c1v are called edge weights or appearance probabilities. For TRW, the \u03c1 are required\nto correspond to the fractional occurrence rates of the edges in some collection of tree-structured\nsubgraphs of G. The choice of \u03c1 affects the quality of the approximation; the tightest upper bound\ncan be obtained via a convex optimization of \u03c1 which computes the pseudomarginals as an inner\nloop.\n\n3\n\n\f3 Continuous Random Variables\n\nFor continuous-valued random variables, many of these algorithms cannot be applied directly. In\nparticular, any reasonably \ufb01ne-grained discretization produces a discrete variable whose domain size\nd is quite large. The domain size is typically exponential in the dimension of the variable and the\ncomplexity of the message-passing algorithms is O(ndb), where n is the total number of variables\nand b is the number of variables in the largest factor. Thus, the computational cost can quickly\nbecome intractable even with pairwise factors over low dimensional variables. Our goal is to adapt\nthe algorithms of Section 2.2 to perform ef\ufb01cient approximate inference in such systems.\nFor time-series problems, in which G forms a chain, a classical solution is to use sequential Monte\nCarlo approximations, generally referred to as particle \ufb01ltering [22]. These methods use samples to\nde\ufb01ne an adaptive discretization of the problem with \ufb01ne granularity in regions of high probability.\nThe stochastic nature of the discretization is simple to implement and enables probabilistic assur-\nances of quality including convergence rates which are independent of the problem\u2019s dimensionality.\n(In suf\ufb01ciently few dimensions, deterministic adaptive discretizations can also provide a competitive\nalternative, particularly if the factors are analytically tractable [23, 24].)\n\n3.1 Particle Representations for Message-Passing\n\nParticle-based approximations have been extended to loopy belief propagation as well. For example,\nin the nonparametric belief propagation (NBP) algorithm [7], the BP messages are represented as\nGaussian mixtures and message products are approximated by drawing samples, which are then\nsmoothed to form new Gaussian mixture distributions. A key aspect of this approach is the fact that\nthe product of several mixtures of Gaussians is also a mixture of Gaussians, and thus can be sampled\nfrom with relative ease. However, it is dif\ufb01cult to see how to extend this algorithm to more general\nmessage-passing algorithms, since for example the TRW \ufb01xed point equations (2) involve ratios and\npowers of messages, which do not have a simple form for Gaussian mixtures and may not even form\n\ufb01nitely integrable functions.\nInstead, we adapt a recent particle belief propagation (PBP) algorithm [16] to work on the tree-\nreweighted formulation. In PBP, samples (particles) are drawn for each variable, and each message\nis represented as a set of weights over the available values of the target variable. At a high level,\nthe procedure iterates between sampling particles from each variable\u2019s domain, performing inference\nover the resulting discrete problem, and adaptively updating the sampling distributions. This process\nis illustrated in Figure 1. Formally, we de\ufb01ne a proposal distribution Ws(xs) for each variable xs\nsuch that Ws(xs) is non-zero over the domain of xs. Note that we may rewrite the factor message\ncomputation (2) as an importance reweighted expectation:\n\nmfu(cid:1)xs(xs) \u221d E\n\nXu\\xs\n\n\uf8ee\uf8f0fu(Xu)1/\u03c1u (cid:89)\n\nxt\u2208Xu\\xs\n\n\uf8f9\uf8fb\n\nmxt(cid:1)fu (xt)\n\nWt(xt)\n\n(3)\n\n(4)\n\nLet us index the variables that are neighbors of factor fu as Xu = {xu1 , . . . , xub}. Then, after\nsampling particles {x(1)\n} from Ws(xs), we can index a particular assignment of parti-\ncle values to the variables in Xu with X ((cid:126)j)\nub ]. We then obtain a \ufb01nite-sample\napproximation of the factor message in the form\n\ns ,\u00b7\u00b7\u00b7 , x(N )\n\ns\n\nu = [x(j1)\n\nu1 , . . . , x(jb)\n\n(cid:16)\n\n(cid:17) \u221d 1\n\nN b\u22121\n\n(cid:88)\n\n(cid:126)i:ik=j\n\n\uf8ee\uf8f0fu\n\n(cid:16)\n\nX ((cid:126)i)\nu\n\n(cid:17)1/\u03c1u(cid:89)\n\nl(cid:54)=k\n\n(cid:16)\n\n(cid:16)\nmxul(cid:1)fu\n\nWxul\n\nx(il)\nul\n\n(cid:17)\n\nx(il)\nul\n\n(cid:17)\n\n\uf8f9\uf8fb\n\nmfu(cid:1)xuk\n\nx(j)\nuk\n\nIn other words, we construct a Monte Carlo approximation to the integral using importance weighted\nsamples from the proposal. Each of the values in the message then represents an estimate of the\ncontinuous function (2) evaluated at a single particle. Observe that the sum is over N b\u22121 elements,\nand hence the complexity of computing an entire factor message is O(N b); this could be made more\nef\ufb01cient at the price of increased stochasticity by summing over a random subsample of the vectors\n\n4\n\n\fs\n\n\u00b5(cid:0)x(i)\n(cid:1)\n(cid:9) \u223c Ws(xs)\n\n(1) Sample\n\n(cid:8)x(i)\n\ns\n\n(2) Inference on discrete system\n\nf(cid:0)x(i)\n\ns , x(j)\n\nt\n\n(cid:1)\n\n(3)\n\n(1)\n\n\u00b5(cid:0)x(j)\n\nt\n\n(cid:1)\n\n(3) Adjust\nWt(xt)\n\nFigure 1: Schematic view of particle-based inference.\n(1) Samples for each variable provide a\ndynamic discretization of the continuous space; (2) inference proceeds by optimization or message-\npassing in the discrete space; (3) the resulting local functions can be used to change the proposals\nWs(\u00b7) and choose new sample locations for each variable.\n\n(cid:126)i. Likewise, we compute variable messages and beliefs as simple point-wise products:\n\n(cid:16)\n\n(cid:81)\n\n(cid:17) \u221d\n\nmxs(cid:1)fu\n\nx(j)\ns\n\n(cid:16)\n\n(cid:17)\u03c1v\n\nfv\u2208Fs\n\n(cid:16)\nmfv(cid:1)xs\nmfu(cid:1)xs\n\nx(j)\ns\n\nx(j)\ns\n\n(cid:17)\n\ns ) \u221d (cid:89)\n\nfv\u2208Fs\n\n(cid:16)\n\n(cid:17)\u03c1v\n\n(5)\n\nmfv(cid:1)xs\n\nx(j)\ns\n\n,\n\nbs(x(j)\n\nThis parallels the development in [16], except here we use factor weights (cid:126)\u03c1 to compute messages\naccording to TRW rather than standard loopy BP.\nJust as in discrete problems, it is often desirable to obtain estimates of the log partition function for\nuse in goodness-of-\ufb01t testing or model comparison. Our implementation of TRW-PBP gives us a\nstochastic estimate of an upper bound on the true partition function. Using other message passing\napproaches that \ufb01t into this framework, such as mean \ufb01eld, can provide a similar a lower bound.\nThese bounds provide a possible alternative to Monte Carlo estimates of marginal likelihood [25].\n\n3.2 Rao-Blackwellized Estimates\n\nQuantities about xs such as expected values under the pseudomarginal can be computed using the\nsamples x(i)\ns . However, for any given variable node xs, the incoming messages to xs given in (4) are\nde\ufb01ned in terms of the importance weights and sampled values of the neighboring variables. Thus,\nwe can compute an estimate of the messages and beliefs de\ufb01ned in (4)\u2013(5) at arbitrary values of xs,\nsimply by evaluating (4) at that point. This allows us to perform Rao-Blackwellization, conditioning\non the samples at the neighbors of xs rather than using xs\u2019s samples directly.\nUsing this trick we can often get much higher quality estimates from the inference for small N. In\nparticular, if the variable state spaces are suf\ufb01ciently small that they can be discretized (for example,\nin 3 or fewer dimensions the discretized domain size d may be manageable) but the resulting factor\ndomain size, db, is intractably large, we can evaluate (4) on the discretized grid for only O(dN b\u22121).\nMore generally, we can substitute a larger number of samples N(cid:48) (cid:29) N with cost that grows only\nlinearly in N(cid:48).\n\n3.3 Resampling and Proposal Distributions\n\nAnother critical point is that the ef\ufb01ciency of this procedure hinges on the quality of the proposal\ndistributions Ws. Unfortunately, this forms a circular problem \u2013 W must be chosen to perform\ninference, but the quality of W depends on the distribution and its pseudomarginals. This interde-\npendence motivates an attempt to learn the sampling distributions in an online fashion, adaptively\nupdating them based on the results of the partially completed inference procedure. Note that this\nprocedure depends on the same properties as Rao-Blackwellized estimates: that we be able to com-\npute our messages and beliefs at a new set of points given the message weights at the other nodes.\nBoth [15] and [16] suggest using the current belief at each iteration to form a new proposal dis-\ntribution. In [15], parametric density estimates are formed using the message-weighted samples\nat the current iteration, which form the sampling distributions for the next phase. In [16], a short\nMetropolis-Hastings MCMC sequence is run at a single node, using the Rao-Blackwellized belief\nestimate to compute an acceptance probability. A third possibility is to use a sampling/importance\n\n5\n\n\fFigure 2: 2-D Ising model performance. L1 error for PBP (left) and TRW-PBP (center) for varying\nnumbers of particles; (right) PBP and TRW-PBP juxtaposed to reveal the gap for high \u03b7.\n\nresampling (SIR) procedure, drawing a large number of samples, computing weights, and prob-\nabilistically retaining only N. In our experiments we draw samples from the current beliefs, as\napproximated by Rao-Blackwellized estimation over a \ufb01ne grid of particles. For variables in more\nthan 2 dimensions, we recommend the Metropolis-Hastings approach.\n\n4\n\nIsing-like Models\n\nThe Ising model corresponds to a graphical model, typically a grid, over binary-valued variables with\npairwise factors. Originating in statistical physics, similar models are common in many applications\nincluding image denoising and stereo depth estimation.\nIsing models are well understood, and\nprovide a simple example of how BP can fail and the bene\ufb01ts of more general forms such as TRW.\nWe initially demonstrate the behavior of our particle-based algorithms on a small (3 \u00d7 3) lattice\nof binary-valued variables to compare with the exact discrete implementations, then show that the\nsame observed behavior arises in an analagous continuous-valued problem.\n\n4.1\n\nIsing model\n\nOur factors consist of single-variable and pairwise functions, given by\n\nf (xs) = [ 0.5 0.5 ]\n\nf (xs, xt) =\n\n(cid:20)\n\n\u03b7\n1 \u2212 \u03b7\n\n1 \u2212 \u03b7\n\u03b7\n\n(cid:21)\n\n(6)\n\nfor \u03b7 > .5. By symmetry, it is easy to see that the true marginal of each variable is uniform, [.5 .5].\nHowever, around \u03b7 \u2248 .78 there is a phase transition; the uniform \ufb01xed point becomes unstable and\nseveral others appear, becoming more skewed toward one state or another as \u03b7 increases. As the\nstrength of coupling in an Ising model increases, the performance of BP often degrades sharply,\nwhile TRW is comparatively robust and remains near the true marginals [5].\nFigure 2 shows the performance of PBP and TRW-PBP on this model. Each data point represents\nthe median L1 error between the beliefs and the true marginals, across all nodes and 40 randomly\ninitialized trials, after 50 iterations. The left plot (BP) clearly shows the phase shift; in contrast,\nthe error of TRW remains low even for very strong interactions. In both cases, as N increases the\nparticle versions of the algorithms converge to their discrete equivalents.\n\n4.2 Continuous grid model\n\nThe results for discrete systems, and their corresponding intuition, carry over naturally into contin-\nuous systems as well. To illustrate on an interpretable analogue of the Ising model, we use the same\ngraph structure but with real-valued variables, and factors given by:\n\n(cid:19)\n\n(cid:18)\n\n\u2212 x2\ns\n2\u03c32\nl\n\n(cid:18)\n\u2212 (xs \u2212 1)2\n\n(cid:19)\n\n2\u03c32\nl\n\nf (xs) = exp\n\n+ exp\n\nf (xs, xt) = exp\n\n(cid:18)\n\n(cid:19)\n\n\u2212|xs \u2212 xt|2\n\n2\u03c32\np\n\n.\n\n(7)\n\nLocal factors consist of bimodal Gaussian mixtures centered at 0 and 1, while pairwise factors\nencourage similarity using a zero-mean Gaussian on the distance between neighboring variables.\nWe set \u03c3l = 0.2 and vary \u03c3p analagously to \u03b7 in the discrete model. Since all potentials are Gaussian\nmixtures, the joint distribution is also a Gaussian mixture and can be computed exactly.\n\n6\n\n0.50.60.70.80.9100.20.40.60.81\u03b7L1 error  20100500BP0.50.60.70.80.9100.20.40.60.81\u03b7L1 error  20100500TRW0.50.60.70.80.9100.20.40.60.81\u03b7L1 error  PBP 500TRW\u2212PBP 500\fFigure 3: Continuous grid model performance. L1 error for PBP (left) and TRW-PBP (center) for\nvarying numbers of particles; (right) PBP and TRW-PBP juxtaposed to reveal the gap for low \u03c3p.\n\nFigure 3 shows the results of running PBP and TRW-PBP on the continuous grid model, demon-\nstrating similar characteristics to the discrete model. The left panel reveals that our continuous grid\nmodel also induces a phase shift in PBP, much like that of the Ising model. For suf\ufb01ciently small\nvalues of \u03c3p (large values on our transformed axis), the beliefs in PBP collapse to unimodal distri-\nbutions with an L1 error of 1. In contrast, TRW-PBP avoids this collapse and maintains multi-modal\ndistributions throughout; its primary source of error (0.2 at 500 particles) corresponds to overdis-\npersed bimodal beliefs. This is expected in attractive models, in which BP tends to \u201covercount\u201d\ninformation leading to underestimates of variance; TRW removes some of this overcounting and\nmay overestimate uncertainty.\nAs mentioned in Section 3.1, we can use the results of\nTRW-PBP to compute an upper bound on the log parti-\ntion function. We implement naive mean \ufb01eld within this\nsame framework to achieve a lower bound as well. The\nresulting bounds, computed for a continuous grid model\nin which mean \ufb01eld collapses to a single mode, are shown\nin Figure 4. With suf\ufb01ciently many particles, the values\nproduced by TRW-PBP and MF inference bound the true\nvalue, as they should. With only 20 particles per variable,\nhowever, TRW-PBP occasionally fails and yields \u201cupper\nbounds\u201d below the true value. This is not surprising; the\nconsistency guarantees associated with the importance-\nreweighted expectation take effect only when N is suf\ufb01-\nciently large.\n\nFigure 4: Bounds on the log partition\nfunction.\n\n5 Sensor Localization\nWe also demonstrate the presence of these effects in a simulation of a real-world application. Sensor\nlocalization considers the task of estimating the position of a collection of sensors in a network given\nnoisy estimates of a subset of the distances between pairs of sensors, along with known positions\nfor a small number of anchor nodes. Typical localization algorithms operate by optimizing to \ufb01nd\nthe most likely joint con\ufb01guration of sensor positions. A classical model consists of (at a minimum)\nthree anchor nodes, and a Gaussian model on the noise in the distance observations.\nIn [12], this problem is formulated as a graphical model and an alternative solution is proposed\nusing nonparametric belief propagation to perform approximate marginalization. A signi\ufb01cant ad-\nvantage of this approach is that by providing approximate marginals, we can estimate the degree\nof uncertainty in the sensor positions. Gauging this uncertainty can be particularly important when\nthe distance information is suf\ufb01ciently ambiguous that the posterior belief is multi-modal, since in\nthis case the estimated sensor position may be quite far from its true value. Unfortunately, belief\npropagation is not ideal for identifying multimodality, since the model is essentially attractive. BP\nmay underestimate the degree of uncertainty in the marginal distributions and (as in the case of the\nIsing-like models in the previous section) collapse into a single mode, providing beliefs which are\nmisleadingly overcon\ufb01dent.\nFigure 5 shows a set of sensor con\ufb01gurations where this is the case. The distance observations\ninduce a fully connected graph; the edges are omitted for clarity. In this network the anchor nodes\nare nearly collinear. This induces a bimodal uncertainty about the locations of the remaining nodes\n\n7\n\n\u2212202400.20.40.60.81log(\u03c3p\u22122)L1 error  20100500\u2212202400.20.40.60.81log(\u03c3p\u22122)L1 error  20100500\u2212202400.20.40.60.81log(\u03c3p\u22122)L1 error  PBP 500TRW\u2212PBP 500\f(a) Exact\n\n(b) PBP\n\n(c) TRW-PBP\n\nFigure 5: Sensor location belief at the target node. (a) Exact belief computed using importance\nsampling. (b) PBP collapses and represents only one of the two modes. (c) TRW-PBP\noverestimates the uncertainty around each mode, but represents both.\n\n\u2013 the con\ufb01guration in which they are all re\ufb02ected across the crooked line formed by the anchors is\nnearly as likely as the true con\ufb01guration. Although this example is anecdotal, it re\ufb02ects a situation\nwhich can arise regularly in practice [26].\nFigure 5a shows the true marginal distribution for one node, estimated exhaustively using importance\nsampling with 5 \u00d7 106 samples. It shows a clear bimodal structure \u2013 a slightly larger mode near the\nsensor\u2019s true location and a smaller mode at a point corresponding to the re\ufb02ection. In this system\nthere is not enough information in the measurements to resolve the sensor positions. We compare\nthese marginals to the results found using PBP.\nFigure 5b displays the Rao-Blackwellized belief estimate for one node after 20 iterations of PBP\nwith each variable represented by 100 particles. Only one mode is present, suggesting that PBP\u2019s\nbeliefs have \u201ccollapsed,\u201d just as in the highly attractive Ising model. Examination of the other\nnodes\u2019 beliefs (not shown for space) con\ufb01rms that all are unimodal distributions centered around\ntheir re\ufb02ected locations. It is worth noting that PBP converged to the alternative set of unimodal\nbeliefs (supporting the true locations) in about half of our trials. Such an outcome is only slightly\nbetter; an accurate estimate of con\ufb01dence is equally important.\nThe corresponding belief estimate generated by TRW-PBP is shown in Figure 5c.\nIt is clearly\nbimodal, with signi\ufb01cant probability mass supporting both the true and re\ufb02ected locations. Also,\neach of the two modes is less concentrated than the belief in 5b. As with the continuous grid model\nwe see increased stability at the price of conservative overdispersion. Again, similar effects occur\nfor the other nodes in the network.\n\n6 Conclusion\nWe propose a framework for extending recent advances in discrete approximate inference for appli-\ncation to continuous systems. The framework directly integrates reweighted message passing algo-\nrithms such as TRW into the lifted, discrete phase of PBP. Furthermore, it allows us to iteratively\nadjust the proposal distributions, providing a discretization that adapts to the results of inference,\nand allows us to use Rao-Blackwellized estimates to improve our \ufb01nal belief estimates.\nWe consider the particular case of TRW and show that its bene\ufb01ts carry over directly to continuous\nproblems. Using an Ising-like system, we argue that phase transitions exist for particle versions of\nBP similar to those found in discrete systems, and that TRW signi\ufb01cantly improves the quality of the\nestimate in those regimes. This improvement is highly relevant to approximate marginalization for\nsensor localization tasks, in which it is important to accurately represent the posterior uncertainty.\nThe \ufb02exibility in the choice of message passing algorithm makes it easy to consider several instan-\ntiations of the framework and use the one best suited to a particular problem. Furthermore, future\nimprovements in message-passing inference algorithms on discrete systems can be directly incorpo-\nrated into continuous problems.\n\nAcknowledgements: This material is based upon work partially supported by the Of\ufb01ce of Naval\nResearch under MURI grant N00014-08-1-1015.\n\n8\n\n  AnchorMobileTarget  AnchorMobileTarget  AnchorMobileTarget\fReferences\n\n[1] J. Pearl. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufman, San Mateo, 1988.\n[2] S. Geman and D. Geman. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of\n\nimages. IEEE Trans. PAMI, 6(6):721\u2013741, November 1984.\n\n[3] M. Jordan, Z. Ghahramani, T. Jaakkola, and L. Saul. An introduction to variational methods for graphical\n\nmethods. Machine Learning, 37:183\u2013233, 1999.\n\n[4] J. Yedidia, W. Freeman, and Y. Weiss. Constructing free energy approximations and generalized belief\n\npropagation algorithms. Technical Report 2004-040, MERL, May 2004.\n\n[5] M. Wainwright, T. Jaakkola, and A. Willsky. A new class of upper bounds on the log partition function.\n\nIEEE Trans. Info. Theory, 51(7):2313\u20132335, July 2005.\n\n[6] D. Sontag and T. Jaakkola. New outer bounds on the marginal polytope. In NIPS 20, pages 1393\u20131400.\n\nMIT Press, Cambridge, MA, 2008.\n\n[7] E. Sudderth, A. Ihler, W. Freeman, and A. Willsky. Nonparametric belief propagation. In CVPR, 2003.\n[8] T. Minka. Divergence measures and message passing. Technical Report 2005-173, Microsoft Research\n\nLtd, January 2005.\n\n[9] A. Yuille. CCCP algorithms to minimize the Bethe and Kikuchi free energies: convergent alternatives to\n\nbelief propagation. Neural Comput., 14(7):1691\u20131722, 2002.\n\n[10] Y.-W. Teh and M. Welling. The uni\ufb01ed propagation and scaling algorithm. In NIPS 14. 2002.\n[11] J. Gonzalez, Y. Low, and C. Guestrin. Residual splash for optimally parallelizing belief propagation. In\n\nIn Arti\ufb01cial Intelligence and Statistics (AISTATS), Clearwater Beach, Florida, April 2009.\n\n[12] A. Ihler, J. Fisher, R. Moses, and A. Willsky. Nonparametric belief propagation for self-calibration in\n\nsensor networks. IEEE J. Select. Areas Commun., pages 809\u2013819, April 2005.\n\n[13] J. Schiff, D. Antonelli, A. Dimakis, D. Chu, and M. Wainwright. Robust message-passing for statistical\n\ninference in sensor networks. In IPSN, pages 109\u2013118, April 2007.\n\n[14] A. Globerson, D. Sontag, and T. Jaakkola. Approximate inference \u2013 How far have we come? (NIPS\u201908\n\nWorkshop), 2008. http://www.cs.huji.ac.il/\u02dcgamir/inference-workshop.html.\n\n[15] D. Koller, U. Lerner, and D. Angelov. A general algorithm for approximate inference and its application\n\nto hybrid Bayes nets. In UAI 15, pages 324\u2013333, 1999.\n\n[16] A. Ihler and D. McAllester. Particle belief propagation. In AI & Statistics: JMLR W&CP, volume 5,\n\npages 256\u2013263, April 2009.\n\n[17] F. Kschischang, B. Frey, and H.-A. Loeliger. Factor graphs and the sum-product algorithm. IEEE Trans.\n\nInfo. Theory, 47(2):498\u2013519, February 2001.\n\n[18] M. Wainwright and M. Jordan. Graphical models, exponential families, and variational inference. Tech-\n\nnical Report 629, UC Berkeley Dept. of Statistics, September 2003.\n\n[19] SL Lauritzen and DJ Spiegelhalter. Local computations with probabilities on graphical structures and\ntheir application to expert systems. Journal of the Royal Statistical Society. Series B (Methodological),\npages 157\u2013224, 1988.\n\n[20] W. Wiegerinck and T. Heskes. Fractional belief propagation. In NIPS 15, pages 438\u2013445. 2003.\n[21] T. Hazan and A. Shashua. Convergent message-passing algorithms for inference over general graphs with\n\nconvex free energies. In UAI 24, pages 264\u2013273. July 2008.\n\n[22] M. S. Arulampalam, S. Maskell, N. Gordon, and T. Clapp. A tutorial on particle \ufb01lters for online\n\nnonlinear/non-Gaussian Bayesian tracking. 50(2):174\u2013188, February 2002.\n\n[23] J. Coughlan and H. Shen. Dynamic quantization for belief propagation in sparse spaces. Comput. Vis.\n\nImage Underst., 106(1):47\u201358, 2007.\n\n[24] M. Isard, J. MacCormick, and K. Achan. Continuously-adaptive discretization for message-passing algo-\n\nrithms. In NIPS 21, pages 737\u2013744. 2009.\n\n[25] S. Chib. Marginal likelihood from the gibbs output. JASA, 90(432):1313\u20131321, 1995.\n[26] D. Moore, J. Leonard, D. Rus, and S. Teller. Robust distributed network localization with noisy range\n\nmeasurements. In 2nd Int\u2019l Conf. on Emb. Networked Sensor Sys. (SenSys\u201904), pages 50\u201361, 2004.\n\n9\n\n\f", "award": [], "sourceid": 862, "authors": [{"given_name": "Andrew", "family_name": "Frank", "institution": null}, {"given_name": "Padhraic", "family_name": "Smyth", "institution": null}, {"given_name": "Alexander", "family_name": "Ihler", "institution": null}]}