{"title": "Fixing Max-Product: Convergent Message Passing Algorithms for MAP LP-Relaxations", "book": "Advances in Neural Information Processing Systems", "page_first": 553, "page_last": 560, "abstract": "We present a novel message passing algorithm for approximating the MAP problem in graphical models. The algorithm is similar in structure to max-product but unlike max-product it always converges, and can be proven to find the exact MAP solution in various settings. The algorithm is derived via block coordinate descent in a dual of the LP relaxation of MAP, but does not require any tunable parameters such as step size or tree weights. We also describe a generalization of the method to cluster based potentials. The new method is tested on synthetic and real-world problems, and compares favorably with previous approaches.", "full_text": "Fixing Max-Product: Convergent Message Passing\n\nAlgorithms for MAP LP-Relaxations\n\nAmir Globerson Tommi Jaakkola\n\nComputer Science and Arti\ufb01cial Intelligence Laboratory\n\nMassachusetts Institute of Technology\n\nCambridge, MA 02139\n\ngamir,tommi@csail.mit.edu\n\nAbstract\n\nWe present a novel message passing algorithm for approximating the MAP prob-\nlem in graphical models. The algorithm is similar in structure to max-product but\nunlike max-product it always converges, and can be proven to \ufb01nd the exact MAP\nsolution in various settings. The algorithm is derived via block coordinate descent\nin a dual of the LP relaxation of MAP, but does not require any tunable parameters\nsuch as step size or tree weights. We also describe a generalization of the method\nto cluster based potentials. The new method is tested on synthetic and real-world\nproblems, and compares favorably with previous approaches.\n\nGraphical models are an effective approach for modeling complex objects via local interactions. In\nsuch models, a distribution over a set of variables is assumed to factor according to cliques of a graph\nwith potentials assigned to each clique. Finding the assignment with highest probability in these\nmodels is key to using them in practice, and is often referred to as the MAP (maximum aposteriori)\nassignment problem. In the general case the problem is NP hard, with complexity exponential in the\ntree-width of the underlying graph.\n\nLinear programming (LP) relaxations have proven very useful in approximating the MAP problem,\nand often yield satisfactory empirical results. These approaches relax the constraint that the solution\nis integral, and generally yield non-integral solutions. However, when the LP solution is integral,\nit is guaranteed to be the exact MAP. For some classes of problems the LP relaxation is provably\ncorrect. These include the minimum cut problem and maximum weight matching in bi-partite graphs\n[8]. Although LP relaxations can be solved using standard LP solvers, this may be computationally\nintensive for large problems [13]. The key problem with generic LP solvers is that they do not use\nthe graph structure explicitly and thus may be sub-optimal in terms of computational ef\ufb01ciency.\n\nThe max-product method [7] is a message passing algorithm that is often used to approximate the\nMAP problem.\nIn contrast to generic LP solvers, it makes direct use of the graph structure in\nconstructing and passing messages, and is also very simple to implement. The relation between\nmax-product and the LP relaxation has remained largely elusive, although there are some notable\nexceptions: For tree-structured graphs, max-product and LP both yield the exact MAP. A recent\nresult [1] showed that for maximum weight matching on bi-partite graphs max-product and LP also\nyield the exact MAP [1]. Finally, Tree-Reweighted max-product (TRMP) algorithms [5, 10] were\nshown to converge to the LP solution for binary xi variables, as shown in [6].\nIn this work, we propose the Max Product Linear Programming algorithm (MPLP) - a very simple\nvariation on max-product that is guaranteed to converge, and has several advantageous properties.\nMPLP is derived from the dual of the LP relaxation, and is equivalent to block coordinate descent in\nthe dual. Although this results in monotone improvement of the dual objective, global convergence\nis not always guaranteed since coordinate descent may get stuck in suboptimal points. This can\nbe remedied using various approaches, but in practice we have found MPLP to converge to the LP\n\n1\n\n\fsolution in a majority of the cases we studied. To derive MPLP we use a special form of the dual\nLP, which involves the introduction of redundant primal variables and constraints. We show how\nthe dual variables corresponding to these constraints turn out to be the messages in the algorithm.\nWe evaluate the method on Potts models and protein design problems, and show that it compares\nfavorably with max-product (which often does not converge for these problems) and TRMP.\n\n1 The Max-Product and MPLP Algorithms\n\nThe max-product algorithm [7] is one of the most often used methods for solving MAP problems.\nAlthough it is neither guaranteed to converge to the correct solution, or in fact converge at all, it\nprovides satisfactory results in some cases. Here we present two algorithms: EMPLP (edge based\nMPLP) and NMPLP (node based MPLP), which are structurally very similar to max-product, but\nhave several key advantages:\n\n\u2022 After each iteration, the messages yield an upper bound on the MAP value, and the se-\nquence of bounds is monotone decreasing and convergent. The messages also have a limit\npoint that is a \ufb01xed point of the update rule.\n\n\u2022 No additional parameters (e.g., tree weights as in [6]) are required.\n\u2022 If the \ufb01xed point beliefs have a unique maximizer then they correspond to the exact MAP.\n\u2022 For binary variables, MPLP can be used to obtain the solution to an LP relaxation of the\nMAP problem. Thus, when this LP relaxation is exact and variables are binary, MPLP will\n\ufb01nd the MAP solution. Moreover, for any variable whose beliefs are not tied, the MAP\nassignment can be found (i.e., the solution is partially decodable).\n\nPseudo code for the algorithms (and for max-product) is given in Fig. 1. As we show in the next\nsections, MPLP is essentially a block coordinate descent algorithm in the dual of a MAP LP re-\nlaxation. Every update of the MPLP messages corresponds to exact minimization of a set of dual\nvariables. For EMPLP minimization is over the set of variables corresponding to an edge, and for\nNMPLP it is over the set of variables corresponding to all the edges a given node appears in (i.e., a\nstar). The properties of MPLP result from its relation to the LP dual. In what follows we describe\nthe derivation of the MPLP algorithms and prove their properties.\n\n2 The MAP Problem and its LP Relaxation\n\nWe consider functions over n variables x = {x1, . . . , xn} de\ufb01ned as follows. Given a graph G =\n(V, E) with n vertices, and potentials \u03b8ij(xi, xj) for all edges ij \u2208 E, de\ufb01ne the function1\n\nf (x; \u03b8) = Xij\u2208E\n\n\u03b8ij (xi, xj) .\n\n(1)\n\nThe MAP problem is de\ufb01ned as \ufb01nding an assignment xM that maximizes the function f (x; \u03b8).\nBelow we describe the standard LP relaxation for this problem. Denote by {\u00b5ij(xi, xj)}ij\u2208E distri-\nbutions over variables corresponding to edges ij \u2208 E and {\u00b5i(xi)}i\u2208V distributions corresponding\nto nodes i \u2208 V . We will use \u00b5 to denote a given set of distributions over all edges and nodes. The\nset ML(G) is de\ufb01ned as the set of \u00b5 where pairwise and singleton distributions are consistent\n\n\u00b5ij(\u02c6xi, xj) = \u00b5j(xj ) , P\u02c6xj\n\n\u00b5i(xi) = 1\n\n\u00b5ij (xi, \u02c6xj) = \u00b5i(xi) \u2200ij \u2208 E, xi, xj\n\n\u2200i \u2208 V\n\n(cid:27)\n\nML(G) = (cid:26)\u00b5 \u2265 0(cid:12)(cid:12)(cid:12)(cid:12)\n\nP\u02c6xi\nPxi\n\nNow consider the following linear program:\n\nMAPLPR :\n\n\u00b5L\u2217 = arg max\n\n\u00b5\u2208ML(G)\n\n\u00b5 \u00b7 \u03b8 .\n\n(2)\n\nwhere \u00b5\u00b7\u03b8 is shorthand for \u00b5\u00b7\u03b8 = Pij\u2208E Pxi,xj\n\n\u03b8ij(xi, xj )\u00b5ij(xi, xj). It is easy to show (see e.g.,\n[10]) that the optimum of MAPLPR yields an upper bound on the MAP value, i.e. \u00b5L\u2217 \u00b7\u03b8 \u2265 f (xM ).\nFurthermore, when the optimal \u00b5i(xi) have only integral values, the assignment that maximizes\n\u00b5i(xi) yields the correct MAP assignment. In what follows we show how the MPLP algorithms can\nbe derived from the dual of MAPLPR.\n\n1We note that some authors also add a term Pi\u2208V \u03b8i(xi) to f (x; \u03b8). However, these terms can be included\n\nin the pairwise functions \u03b8ij (xi, xj), so we ignore them for simplicity.\n\n2\n\n\f3 The LP Relaxation Dual\n\nSince MAPLPR is an LP, it has an equivalent convex dual. In App. A we derive a special dual of\nMAPLPR using a different representation of ML(G) with redundant variables. The advantage of\nthis dual is that it allows the derivation of simple message passing algorithms. The dual is described\nin the following proposition.\n\nProposition 1 The following optimization problem is a convex dual of MAPLPR\n\nDMAPLPR :\nmin\n\ns.t.\n\nPi\n\nmax\n\nxi Pk\u2208N (i)\n\nmax\nxk\n\n\u03b2ki(xk, xi)\n\n(3)\n\n\u03b2ji(xj , xi) + \u03b2ij(xi, xj) = \u03b8ij(xi, xj) ,\n\nwhere the dual variables are \u03b2ij (xi, xj) for all ij, ji \u2208 E and values of xi and xj.\n\nxi Pk\u2208N (i)\n\nmax\nxk\n\nThe dual has an intuitive interpretation in terms of re-parameterizations. Consider the star\nshaped graph Gi consisting of node i and all its neighbors N (i). Assume the potential on\nedge ki (for k \u2208 N (i)) is \u03b2ki(xk, xi). The value of the MAP assignment for this model is\n\u03b2ki(xk, xi). This is exactly the term in the objective of DMAPLPR. Thus the dual\nmax\n\ncorresponds to individually decoding star graphs around all nodes i \u2208 V where the potentials on the\ngraph edges should sum to the original potential. It is easy to see that this will always result in an\nupper bound on the MAP value. The somewhat surprising result of the duality is that there exists a\n\u03b2 assignment such that star decoding yields the optimal value of MAPLPR.\n\n4 Block Coordinate Descent in the Dual\n\nTo obtain a convergent algorithm we use a simple block coordinate descent strategy. At every\niteration, \ufb01x all variables except a subset, and optimize over this subset. It turns out that this can\nbe done in closed form for the cases we consider. We begin by deriving the EMPLP algorithm.\nConsider \ufb01xing all the \u03b2 variables except those corresponding to some edge ij \u2208 E (i.e., \u03b2ij and\n\u03b2ji), and minimizing DMAPLPR over the non-\ufb01xed variables. Only two terms in the DMAPLPR\nobjective depend on \u03b2ij and \u03b2ji. We can write those as\n\ni (xi) + max\n\nxj\n\nf (\u03b2ij, \u03b2ji) = max\n\nxi (cid:20)\u03bb\u2212i\n\n\u03b2ji(xj, xi)(cid:21) + max\n\nxi (cid:20)\u03bb\u2212j\nwhere we de\ufb01ned \u03bb\u2212j\ni (xi) = Pk\u2208N (i)\\j \u03bbki(xi) and \u03bbki(xi) = maxxk \u03b2ki(xk, xi) as in App. A.\nNote that the function f (\u03b2ij , \u03b2ji) depends on the other \u03b2 values only through \u03bb\u2212i\ni (xi).\nThis implies that the optimization can be done solely in terms of \u03bbij (xj) and there is no need to\nstore the \u03b2 values explicitly. The optimal \u03b2ij , \u03b2ji are obtained by minimizing f (\u03b2ij, \u03b2ji) subject\nto the re-parameterization constraint \u03b2ji(xj , xi) + \u03b2ij(xi, xj) = \u03b8ij (xi, xj ). The following propo-\nsition characterizes the minimum of f (\u03b2ij, \u03b2ji). In fact, as mentioned above, we do not need to\ncharacterize the optimal \u03b2ij (xi, xj ) itself, but only the new \u03bb values.\n\n\u03b2ij(xi, xj)(cid:21)\n\nj (xj) and \u03bb\u2212j\n\nj (xj ) + max\n\nxi\n\n(4)\n\nProposition 2 Maximizing the function f (\u03b2ij, \u03b2ji) yields the following \u03bbji(xi) (and the equivalent\nexpression for \u03bbij (xj))\n\n\u03bbji(xi) = \u2212\n\n1\n2\n\n\u03bb\u2212j\ni (xi) +\n\n1\n2\n\nmax\n\nxj (cid:2)\u03bb\u2212i\n\nj (xj ) + \u03b8ij(xi, xj )(cid:3)\n\nThe proposition is proved in App. B. The \u03bb updates above result in the EMPLP algorithm, described\nin Fig. 1. Note that since the \u03b2 optimization affects both \u03bbji(xi) and \u03bbij (xj ), both these messages\nneed to be updated simultaneously.\n\nWe proceed to derive the NMPLP algorithm. For a given node i \u2208 V , we consider all its neighbors\nj \u2208 N (i), and wish to optimize over the variables \u03b2ji(xj, xi) for ji, ij \u2208 E (i.e., all the edges in a\nstar centered on i), while the other variables are \ufb01xed. One way of doing so is to use the EMPLP\nalgorithm for the edges in the star, and iterate it until convergence. We now show that the result of\n\n3\n\n\fInputs: A graph G = (V, E), potential functions \u03b8ij(xi, xj) for each edge ij \u2208 E.\n\nInitialization: Initialize messages to any value.\n\nAlgorithm:\n\n\u2022 Iterate until a stopping criterion is satis\ufb01ed:\n\n\u2013 Max-product: Iterate over messages and update (cji shifts the max to zero)\n\nmji(xi)\u2190 max\n\nxj hm\u2212i\n\nj (xj) + \u03b8ij(xi, xj)i \u2212 cji\n\n\u2013 EMPLP: For each ij \u2208 E, update \u03bbji(xi) and \u03bbij(xj) simultaneously (the update\n\nfor \u03bbij(xj) is the same with i and j exchanged)\n\n\u03bbji(xi)\u2190 \u2212\n\n1\n2\n\n\u03bb\u2212j\n\ni (xi) +\n\n1\n2\n\nmax\n\nxj h\u03bb\u2212i\n\nj (xj) + \u03b8ij (xi, xj)i\n\n\u2013 NMPLP: Iterate over nodes i \u2208 V and update all \u03b3ij(xj) where j \u2208 N (i)\n\n\u03b3ij(xj)\u2190 max\n\nxi\n\n2\n4\n\n\u03b8ij(xi, xj) \u2212 \u03b3ji(xi) +\n\n2\n\n|N (i)| + 1 X\n\nk\u2208N(i)\n\n\u03b3ki(xi)\n\n3\n5\n\n\u2022 Calculate node \u201cbeliefs\u201d: Set bi(xi) to be the sum of incoming messages into node i \u2208 V\n\n(e.g., for NMPLP set bi(xi) = Pk\u2208N(i) \u03b3ki(xi)).\n\nOutput: Return assignment x de\ufb01ned as xi = arg max\u02c6xi b(\u02c6xi).\n\nFigure 1: The max-product, EMPLP and NMPLP algorithms. Max-product, EMPLP and NMPLP use mes-\nsages mij, \u03bbij and \u03b3ij respectively. We use the notation m\u2212i\n\nj (xj) = Pk\u2208N(j)\\i mkj(xj).\n\nthis optimization can be found in closed form. The assumption about \u03b2 being \ufb01xed outside the star\nimplies that \u03bb\u2212i\nyields the following relation between \u03bb\u2212j\n\nj (xj ) is \ufb01xed. De\ufb01ne: \u03b3ji(xi) = maxxj (cid:2)\u03b8ij (xi, xj) + \u03bb\u2212i\n\nj (xj )(cid:3). Simple algebra\n\ni (xi) and \u03b3ki(xi) for k \u2208 N (i)\n\n\u03bb\u2212j\ni (xi) = \u2212\u03b3ji(xi) +\n\n\u03b3ki(xi)\n\n(5)\n\n2\n\n|N (i)| + 1 Xk\u2208N (i)\n\nPlugging this into the de\ufb01nition of \u03b3ji(xi) we obtain the NMPLP update in Fig. 1. The messages\nfor both algorithms can be initialized to any value since it can be shown that after one iteration they\nwill correspond to valid \u03b2 values.\n\n5 Convergence Properties\n\nThe MPLP algorithm decreases the dual objective (i.e., an upper bound on the MAP value) at every\niteration, and thus its dual objective values form a convergent sequence. Using arguments similar to\n[5] it can be shown that MPLP has a limit point that is a \ufb01xed point of its updates. This in itself does\nnot guarantee convergence to the dual optimum since coordinate descent algorithms may get stuck\nat a point that is not a global optimum. There are ways of overcoming this dif\ufb01culty, for example by\nsmoothing the objective [4] or using techniques as in [2] (see p. 636). We leave such extensions for\nfurther work. In this section we provide several results about the properties of the MPLP \ufb01xed points\nand their relation to the corresponding LP. First, we claim that if all beliefs have unique maxima then\nthe exact MAP assignment is obtained.\n\nProposition 3 If the \ufb01xed point of MPLP has bi(xi) such that for all i the function bi(xi) has a\nunique maximizer x\u2217\n\ni , then x\u2217 is the solution to the MAP problem and the LP relaxation is exact.\n\nSince the dual objective is always greater than or equal to the MAP value, it suf\ufb01ces to show that\nthere exists a dual feasible point whose objective value is f (x\u2217). Denote by \u03b2\u2217, \u03bb\u2217 the value of the\ncorresponding dual parameters at the \ufb01xed point of MPLP. Then the dual objective satis\ufb01es\n\nXi\n\nmax\n\nxi Xk\u2208N (i)\n\n\u03bb\u2217\n\nki(xi) = Xi Xk\u2208N (i)\n\nmax\nxk\n\nki(xk, x\u2217\n\u03b2\u2217\n\ni ) = Xi Xk\u2208N (i)\n\n4\n\nki(x\u2217\n\u03b2\u2217\n\nk, x\u2217\n\ni ) = f (x\u2217)\n\n\fi ) = maxxi,xj \u03bb\u2212j\n\nj ) = maxxi,xj \u03bb\u2212i\n\ni (xi) + \u03b2ji(xj, xi) and\nj (xj ) + \u03b2ij(xi, xj). By the equalization property in Eq. 9 the arguments of\nj are\n\nTo see why the second equality holds, note that bi(x\u2217\nbj(x\u2217\nthe two max operations are equal. From the unique maximum assumption it follows that x\u2217\nthe unique maximizers of the above. It follows that \u03b2ji, \u03b2ij are also maximized by x\u2217\nIn the general case, the MPLP \ufb01xed point may not correspond to a primal optimum because of the\nlocal optima problem with coordinate descent. However, when the variables are binary, \ufb01xed points\ndo correspond to primal solutions, as the following proposition states.\n\nj .\ni , x\u2217\n\ni , x\u2217\n\nProposition 4 When xi are binary, the MPLP \ufb01xed point can be used to obtain the primal optimum.\n\ni (x\u2217\n\nij (x\u2217\n\ni ) to 1. If bi, bj are not tied we set \u00b5\u2217\n\ni (xi) to 0.5\nThe claim can be shown by constructing a primal optimal solution \u00b5\u2217. For tied bi, set \u00b5\u2217\nj ) = 1. If bi is not tied but bj\nand for untied bi, set \u00b5\u2217\ni , xj) = 0.5. If bi, bj are tied then \u03b2ji, \u03b2ij can be shown to be maximized at either\nis, we set \u00b5\u2217\nij to be 0.5 at one of these assignment\nx\u2217\ni , x\u2217\npairs. The resulting \u00b5\u2217 is clearly primal feasible. Setting \u03b4\u2217\ni we obtain that the dual variables\n(\u03b4\u2217, \u03bb\u2217, \u03b2\u2217) and primal \u00b5\u2217 satisfy complementary slackness for the LP in Eq. 7 and therefore \u00b5\u2217 is\nprimal optimal. The binary optimality result implies partial decodability, since [6] shows that the\nLP is partially decodable for binary variables.\n\nj = (0, 1), (1, 0). We then set \u00b5\u2217\n\nj = (0, 0), (1, 1) or x\u2217\n\ni = b\u2217\n\nij (x\u2217\n\ni , x\u2217\n\ni , x\u2217\n\n6 Beyond pairwise potentials: Generalized MPLP\n\nIn the previous sections we considered maximizing functions which factor according to the edges of\nthe graph. A more general setting considers clusters c1, . . . , ck \u2282 {1, . . . , n} (the set of clusters is\n\ndenoted by C), and a function f (x; \u03b8) = Pc \u03b8c(xc) de\ufb01ned via potentials over clusters \u03b8c(xc). The\n\nMAP problem in this case also has an LP relaxation (see e.g. [11]). To de\ufb01ne the LP we introduce\nthe following de\ufb01nitions: S = {c \u2229 \u02c6c : c, \u02c6c \u2208 C, c \u2229 \u02c6c 6= \u2205} is the set of intersection between clusters\nand S(c) = {s \u2208 S : s \u2286 c} is the set of overlap sets for cluster c.We now consider marginals over\nthe variables in c \u2208 C and s \u2208 S and require that cluster marginals agree on their overlap. Denote\nthis set by ML(C). The LP relaxation is then to maximize \u00b5 \u00b7 \u03b8 subject to \u00b5 \u2208 ML(C).\nAs in Sec. 4, we can derive message passing updates that result in monotone decrease of the dual\nLP of the above relaxation. The derivation is similar and we omit the details. The key observation\nis that one needs to introduce |S(c)| copies of each marginal \u00b5c(xc) (instead of the two copies\nin the pairwise case). Next, as in the EMPLP derivation we assume all \u03b2 are \ufb01xed except those\ncorresponding to some cluster c. The resulting messages are \u03bbc\u2192s(xs) from a cluster c to all of its\nintersection sets s \u2208 S(c). The update on these messages turns out to be:\n\n1\n\n|S(c)|(cid:19) \u03bb\u2212c\n\n\u03bbc\u2192s(xs) = \u2212(cid:18)1 \u2212\n\n\u02c6s (x\u02c6s) + \u03b8c(xc)\uf8f9\n\uf8fb\nwhere for a given c \u2208 C all \u03bbc\u2192s should be updated simultaneously for s \u2208 S(c), and \u03bb\u2212c\ns (xs) is\nde\ufb01ned as the sum of messages into s that are not from c. We refer to this algorithm as Generalized\nEMPLP (GEMPLP). It is possible to derive an algorithm similar to NMPLP that updates several\nclusters simultaneously, but its structure is more involved and we do not address it here.\n\n\uf8ee\n\uf8f0 X\u02c6s\u2208S(c)\\s\n\ns (xs) +\n\nmax\nxc\\s\n\n|S(c)|\n\n\u03bb\u2212c\n\n1\n\n7 Related Work\n\nWeiss et al. [11] recently studied the \ufb01xed points of a class of max-product like algorithms. Their\nanalysis focused on properties of \ufb01xed points rather than convergence guarantees. Speci\ufb01cally, they\nshowed that if the counting numbers used in a generalized max-product algorithm satisfy certain\nproperties, then its \ufb01xed points will be the exact MAP if the beliefs have unique maxima, and for\nbinary variables the solution can be partially decodable. Both these properties are obtained for the\nMPLP \ufb01xed points, and in fact we can show that MPLP satis\ufb01es the conditions in [11], so that\nwe obtain these properties as corollaries of [11]. We stress however, that [11] does not address\nconvergence of algorithms, but rather properties of their \ufb01xed points, if they converge.\n\nMPLP is similar in some aspects to Kolmogorov\u2019s TRW-S algorithm [5]. TRW-S is also a monotone\ncoordinate descent method in a dual of the LP relaxation and its \ufb01xed points also have similar\n\n5\n\n\fguarantees to those of MPLP [6]. Furthermore, convergence to a local optimum may occur, as it\ndoes for MPLP. One advantage of MPLP lies in the simplicity of its updates and the fact that it is\nparameter free. The other is its simple generalization to potentials over clusters of nodes (Sec. 6).\nRecently, several new dual LP algorithms have been introduced, which are more closely related to\nour formalism. Werner [12] presented a class of algorithms which also improve the dual LP at every\niteration. The simplest of those is the max-sum-diffusion algorithm, which is similar to our EMPLP\nalgorithm, although the updates are different from ours. Independently, Johnson et al. [4] presented\na class of algorithms that improve duals of the MAP-LP using coordinate descent. They decompose\nthe model into tractable parts by replicating variables and enforce replication constraints within the\nLagrangian dual. Our basic formulation in Eq. 3 could be derived from their perspective. However,\nthe updates in the algorithm and the analysis differ. Johnson et al. also presented a method for\novercoming the local optimum problem, by smoothing the objective so that it is strictly convex.\nSuch an approach could also be used within our algorithms. Vontobel and Koetter [9] recently\nintroduced a coordinate descent algorithm for decoding LDPC codes. Their method is speci\ufb01cally\ntailored for this case, and uses updates that are similar to our edge based updates.\n\nFinally, the concept of dual coordinate descent may be used in approximating marginals as well. In\n[3] we use such an approach to optimize a variational bound on the partition function. The derivation\nuses some of the ideas used in the MPLP dual, but importantly does not \ufb01nd the minimum for each\ncoordinate. Instead, a gradient like step is taken at every iteration to decrease the dual objective.\n\n8 Experiments\n\nWe compared NMPLP to three other message passing algorithms:2 Tree-Reweighted max-product\n(TRMP) [10],3 standard max-product (MP), and GEMPLP. For MP and TRMP we used the standard\napproach of damping messages using a factor of \u03b1 = 0.5. We ran all algorithms for a maximum of\n2000 iterations, and used the hit-time measure to compare their speed of convergence. This measure\nis de\ufb01ned as follows: At every iteration the beliefs can be used to obtain an assignment x with value\nf (x). We de\ufb01ne the hit-time as the \ufb01rst iteration at which the maximum value of f (x) is achieved.4\nWe \ufb01rst experimented with a 10 \u00d7 10 grid graph, with 5 values per state. The function f (x) was\n\na Potts model: f (x) = Pij\u2208E \u03b8ijI(xi = xj) + Pi\u2208V \u03b8i(xi).5 The values for \u03b8ij and \u03b8i(xi)\n\nwere randomly drawn from [\u2212cI, cI ] and [\u2212cF , cF ] respectively, and we used values of cI and\ncF in the range range [0.1, 2.35] (with intervals of 0.25), resulting in 100 different models. The\nclusters for GEMPLP were the faces of the graph [14]. To see if NMPLP converges to the LP\nsolution we also used an LP solver to solve the LP relaxation. We found that the the normalized\ndifference between NMPLP and LP objective was at most 10\u22123 (median 10\u22127), suggesting that\nNMPLP typically converged to the LP solution. Fig. 2 (top row) shows the results for the three\nalgorithms. It can be seen that while all non-cluster based algorithms obtain similar f (x) values,\nNMPLP has better hit-time (in the median) than TRMP and MP, and MP does not converge in many\ncases (see caption). GEMPLP converges more slowly than NMPLP, but obtains much better f (x)\nvalues. In fact, in 99% of the cases the normalized difference between the GEMPLP objective and\nthe f (x) value was less than 10\u22125, suggesting that the exact MAP solution was found.\nWe next applied the algorithms to the real world problems of protein design.\nIn [13], Yanover\net al. show how these problems can be formalized in terms of \ufb01nding a MAP in an appropriately\nconstructed graphical model.6 We used all algorithms except GNMPLP (since there is no natural\nchoice for clusters in this case) to approximate the MAP solution on the 97 models used in [13].\nIn these models the number of states per variable is 2 \u2212 158, and there are up to 180 variables per\nmodel. Fig. 2 (bottom) shows results for all the design problems. In this case only 11% of the MP\nruns converged, and NMPLP was better than TRMP in terms of hit-time and comparable in f (x)\nvalue. The performance of MP was good on the runs where it converged.\n\n2As expected, NMPLP was faster than EMPLP so only NMPLP results are given.\n3The edge weights for TRMP corresponded to a uniform distribution over all spanning trees.\n4This is clearly a post-hoc measure since it can only be obtained after the algorithm has exceeded its maxi-\n\nmum number of iterations. However, it is a reasonable algorithm-independent measure of convergence.\n\n5The potential \u03b8i(xi) may be folded into the pairwise potential to yield a model as in Eq. 1.\n6Data available from http://jmlr.csail.mit.edu/papers/volume7/yanover06a/Rosetta Design Dataset.tgz\n\n6\n\n\f(a)\n\n100\n\n50\n\n0\n\n\u221250\n\n\u2212100\n\n(b)\n\nl\n\n)\ne\nu\na\nV\n(\n\u2206\n\n0.04\n\n0.02\n\n0\n\n\u22120.02\n\n\u22120.04\n\n\u22120.06\n\n(c)\n\n2000\n\n1000\n\n0\n\n\u22121000\n\n(d)\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n\u22120.2\n\n\u22120.4\n\nl\n\n)\ne\nu\na\nV\n(\n\u2206\n\n)\ne\nm\nT\n\ni\n\n \nt\ni\n\nH\n(\n\u2206\n\nMP\n\nTRMP\n\nGMPLP\n\nMP\n\nTRMP\n\nGMPLP\n\nMP\n\nTRMP\n\nMP\n\nTRMP\n\n)\ne\nm\nT\n\ni\n\n \nt\ni\n\nH\n(\n\u2206\n\nFigure 2: Evaluation of message passing algorithms on Potts models and protein design problems.\n(a,c):\nConvergence time results for the Potts models (a) and protein design problems (c). The box-plots (horiz. red\nline indicates median) show the difference between the hit-time for the other algorithms and NMPLP. (b,d):\nValue of integer solutions for the Potts models (b) and protein design problems (d). The box-plots show the\nnormalized difference between the value of f (x) for NMPLP and the other algorithms. All \ufb01gures are such\nthat better MPLP performance yields positive Y axis values. Max-product converged on 58% of the cases for\nthe Potts models, and on 11% of the protein problems. Only convergent max-product runs are shown.\n\n9 Conclusion\n\nWe have presented a convergent algorithm for MAP approximation that is based on block coordi-\nnate descent of the MAP-LP relaxation dual. The algorithm can also be extended to cluster based\nfunctions, which result empirically in improved MAP estimates. This is in line with the observa-\ntions in [14] that generalized belief propagation algorithms can result in signi\ufb01cant performance\nimprovements. However generalized max-product algorithms [14] are not guaranteed to converge\nwhereas GMPLP is. Furthermore, the GMPLP algorithm does not require a region graph and only\ninvolves intersection between pairs of clusters. In conclusion, MPLP has the advantage of resolving\nthe convergence problems of max-product while retaining its simplicity, and offering the theoretical\nguarantees of LP relaxations. We thus believe it should be useful in a wide array of applications.\n\nA Derivation of the dual\n\nBefore deriving the dual, we \ufb01rst express the constraint set ML(G) in a slightly different way. The\nde\ufb01nition of ML(G) in Sec. 2 uses a single distribution \u00b5ij(xi, xj) for every ij \u2208 E. In what\nfollows, we use two copies of this pairwise distribution for every edge, which we denote \u00af\u00b5ij(xi, xj )\nand \u00af\u00b5ji(xj , xi), and we add the constraint that these two copies both equal the original \u00b5ij(xi, xj).\nFor this extended set of pairwise marginals, we consider the following set of constraints which\nis clearly equivalent to ML(G). On the rightmost column we give the dual variables that will\ncorrespond to each constraint (we omit non-negativity constraints).\n\n\u00af\u00b5ij(xi, xj ) = \u00b5ij(xi, xj)\n\u00af\u00b5ji(xj , xi) = \u00b5ij(xi, xj)\n\n\u2200ij \u2208 E, xi, xj \u03b2ij(xi, xj)\n\u2200ij \u2208 E, xi, xj \u03b2ji(xj , xi)\n\n\u00af\u00b5ij (\u02c6xi, xj) = \u00b5j(xj ) \u2200ij \u2208 E, xj\n\u2200ji \u2208 E, xi\n\u00af\u00b5ji(\u02c6xj , xi) = \u00b5i(xi)\n\u2200i \u2208 V\n\u00b5i(xi) = 1\n\n\u03bbij (xj )\n\u03bbji(xi)\n\u03b4i\n\n(6)\n\nP\u02c6xi\nP\u02c6xj\nPxi\n\nWe denote the set of (\u00b5, \u00af\u00b5) satisfying these constraints by \u00afML(G). We can now state an LP that\nis equivalent to MAPLPR, only with an extended set of variables and constraints. The equivalent\nproblem is to maximize \u00b5 \u00b7 \u03b8 subject to (\u00b5, \u00af\u00b5) \u2208 \u00afML(G) (note that the objective uses the original\n\u00b5 copy). LP duality transformation of the extended problem yields the following LP\n\nmin Pi \u03b4i\n\ns.t.\n\n\u03bbij (xj ) \u2212 \u03b2ij(xi, xj ) \u2265 0\n\u03b2ij (xi, xj) + \u03b2ji(xj, xi) = \u03b8ij (xi, xj) \u2200ij \u2208 E, xi, xj\n\n\u2200ij, ji \u2208 E, xi, xj\n\n(7)\n\n\u2212Pk\u2208N (i) \u03bbki(xi) + \u03b4i \u2265 0\n\n\u2200i \u2208 V, xi\n\nWe next simplify the above LP by eliminating some of its constraints and variables. Since each\nvariable \u03b4i appears in only one constraint, and the objective minimizes \u03b4i it follows that \u03b4i =\n\nmaxxi Pk\u2208N (i) \u03bbki(xi) and the constraints with \u03b4i can be discarded. Similarly, since \u03bbij (xj ) ap-\n\npears in a single constraint, we have that for all ij \u2208 E, ji \u2208 E, xi, xj \u03bbij(xj ) = maxxi \u03b2ij(xi, xj )\nand the constraints with \u03bbij (xj ), \u03bbji(xi) can also be discarded. Using the eliminated \u03b4i and \u03bbji(xi)\n\n7\n\n\fvariables, we obtain that the LP in Eq. 7 is equivalent to that in Eq. 3. Note that the objective in\nEq. 3 is convex since it is a sum of point-wise maxima of convex functions.\n\nB Proof of Proposition 2\n\nWe wish to minimize f in Eq. 4 subject to the constraint that \u03b2ij + \u03b2ji = \u03b8ij. Rewrite f as\n\nf (\u03b2ij, \u03b2ji) = max\n\nxi,xj h\u03bb\u2212j\n\ni (xi) + \u03b2ji(xj , xi)i + max\nin the max is \u03bb\u2212j\n\nxi,xj (cid:2)\u03bb\u2212i\n\nj (xj) + \u03b2ij(xi, xj)(cid:3)\n\ni (xi) + \u03bb\u2212i\n\nThus\n\nthe minimum must be greater\n\nj (xj) + \u03b8ij(xi, xj )\nthan\n\n(8)\n\nsum of\n\nThe\n(because of\n1\n\ntwo arguments\nconstraints on \u03b2).\n\nthe\nthe\ni (xi) + \u03bb\u2212i\n\n2 maxxi,xj h\u03bb\u2212j\nmum is obtained by requiring an equalization condition:7\n\nj (xj ) + \u03b8ij(xi, xj)i. One assignment to \u03b2 that achieves this mini-\n\nj (xj ) + \u03b2ij(xi, xj ) = \u03bb\u2212j\n\u03bb\u2212i\n\ni (xi) + \u03b2ji(xj, xi) =\n\n1\n\n2 (cid:16)\u03b8ij (xi, xj) + \u03bb\u2212j\n\ni (xi) + \u03bb\u2212i\n\nj (xj )(cid:17)\n\n(9)\n\nwhich implies \u03b2ij(xi, xj) = 1\nThe resulting \u03bbij (xj ) = maxxi \u03b2ij (xi, xj) are then the ones in Prop. 2.\n\n2 (cid:16)\u03b8ij (xi, xj ) + \u03bb\u2212j\n\ni (xi) \u2212 \u03bb\u2212i\n\nj (xj)(cid:17) and a similar expression for \u03b2ji.\n\nAcknowledgments\nThe authors acknowledge support from the Defense Advanced Research Projects Agency (Transfer\nLearning program). Amir Globerson was also supported by the Rothschild Yad-Hanadiv fellowship.\n\nReferences\n[1] M. Bayati, D. Shah, and M. Sharma. Maximum weight matching via max-product belief propagation.\n\nIEEE Trans. on Information Theory (to appear), 2007.\n\n[2] D. P. Bertsekas, editor. Nonlinear Programming. Athena Scienti\ufb01c, Belmont, MA, 1995.\n[3] A. Globerson and T. Jaakkola. Convergent propagation algorithms via oriented trees. In UAI. 2007.\n[4] J.K. Johnson, D.M. Malioutov, and A.S. Willsky. Lagrangian relaxation for map estimation in graphical\n\nmodels. In Allerton Conf. Communication, Control and Computing, 2007.\n\n[5] V. Kolmogorov. Convergent tree-reweighted message passing for energy minimization. IEEE Transac-\n\ntions on Pattern Analysis and Machine Intelligence, 28(10):1568\u20131583, 2006.\n\n[6] V. Kolmogorov and M. Wainwright. On the optimality of tree-reweighted max-product message passing.\n\nIn 21st Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI). 2005.\n\n[7] J. Pearl. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, 1988.\n[8] B. Taskar, S. Lacoste-Julien, and M. Jordan. Structured prediction, dual extragradient and bregman pro-\n\njections. Journal of Machine Learning Research, pages 1627\u20131653, 2006.\n\n[9] P.O. Vontobel and R. Koetter. Towards low-complexity linear-programming decoding. In Proc. 4th Int.\n\nSymposium on Turbo Codes and Related Topics, 2006.\n\n[10] M. J. Wainwright, T. Jaakkola, and A. S. Willsky. Map estimation via agreement on trees: message-\n\npassing and linear programming. IEEE Trans. on Information Theory, 51(11):1120\u20131146, 2005.\n\n[11] Y. Weiss, C. Yanover, and T. Meltzer. Map estimation, linear programming and belief propagation with\n\nconvex free energies. In UAI. 2007.\n\n[12] T. Werner. A linear programming approach to max-sum, a review. IEEE Trans. on PAMI, 2007.\n[13] C. Yanover, T. Meltzer, and Y. Weiss. Linear programming relaxations and belief propagation \u2013 an\n\nempirical study. Jourmal of Machine Learning Research, 7:1887\u20131907, 2006.\n\n[14] J.S. Yedidia, W.T. W.T. Freeman, and Y. Weiss. Constructing free-energy approximations and generalized\n\nbelief propagation algorithms. IEEE Trans. on Information Theory, 51(7):2282\u20132312, 2005.\n\n7Other solutions are possible but may not yield some of the properties of MPLP.\n\n8\n\n\f", "award": [], "sourceid": 940, "authors": [{"given_name": "Amir", "family_name": "Globerson", "institution": null}, {"given_name": "Tommi", "family_name": "Jaakkola", "institution": null}]}