{"title": "Cooperative Graphical Models", "book": "Advances in Neural Information Processing Systems", "page_first": 262, "page_last": 270, "abstract": "We study a rich family of distributions that capture variable interactions significantly more expressive than those representable with low-treewidth or pairwise graphical models, or log-supermodular models. We call these cooperative graphical models. Yet, this family retains structure, which we carefully exploit for efficient inference techniques. Our algorithms combine the polyhedral structure of submodular functions in new ways with variational inference methods to obtain both lower and upper bounds on the partition function. While our fully convex upper bound is minimized as an SDP or via tree-reweighted belief propagation, our lower bound is tightened via belief propagation or mean-field algorithms. The resulting algorithms are easy to implement and, as our experiments show, effectively obtain good bounds and marginals for synthetic and real-world examples.", "full_text": "Cooperative Graphical Models\n\nJosip Djolonga\n\nDept. of Computer Science, ETH Z\u00a8urich\n\njosipd@inf.ethz.ch\n\nStefanie Jegelka\n\nCSAIL, MIT\n\nstefje@mit.edu\n\nSebastian Tschiatschek\n\nDept. of Computer Science, ETH Z\u00a8urich\n\nstschia@inf.ethz.ch\n\nAndreas Krause\n\nDept. of Computer Science, ETH Z\u00a8urich\n\nkrausea@inf.ethz.ch\n\nAbstract\n\nWe study a rich family of distributions that capture variable interactions signi\ufb01-\ncantly more expressive than those representable with low-treewidth or pairwise\ngraphical models, or log-supermodular models. We call these cooperative graph-\nical models. Yet, this family retains structure, which we carefully exploit for\nef\ufb01cient inference techniques. Our algorithms combine the polyhedral structure of\nsubmodular functions in new ways with variational inference methods to obtain\nboth lower and upper bounds on the partition function. While our fully convex upper\nbound is minimized as an SDP or via tree-reweighted belief propagation, our lower\nbound is tightened via belief propagation or mean-\ufb01eld algorithms. The resulting\nalgorithms are easy to implement and, as our experiments show, effectively obtain\ngood bounds and marginals for synthetic and real-world examples.\n\n1\n\nIntroduction\n\nX1,3\n\nX1,4\n\nX2,4\n\nX2,3\n\nX2,1\n\nX1,1\n\nX1,2\n\nX2,2\n\nProbabilistic inference in high-order discrete graphical models has\nbeen an ongoing computational challenge, and all existing methods\nrely on exploiting speci\ufb01c structure: either low-treewidth or pairwise\ngraphical models, or functional properties of the distribution such as\nlog-submodularity. Here, we aim to compute approximate marginal\nprobabilities in complex models with long-range variable interactions\nthat do not possess any of these properties.\nInstead, we exploit a\ncombination of structural and functional properties in new ways.\nThe classical example of image segmentation may serve to motivate\nour family of models: we would like to estimate a posterior marginal\ndistribution over k labels for each pixel in an image. A common\napproach uses Conditional Random Fields on a pixel neighborhood\ngraph with pairwise potentials that encourage neighboring pixels to\ntake on the same label. From the perspective of the graph, this model\nprefers con\ufb01gurations with few edges cut, where an edge is said to\nbe cut if its endpoints have different labels. Such cut-based models,\nhowever, short-cut elongated structures (e.g. tree branches), a problem known as shrinking bias.\nJegelka and Bilmes [1] hence replace the bias towards short cuts (boundaries) by a bias towards\ncon\ufb01gurations with certain higher-order structure: the cut edges occur at similar-looking pixel pairs.\nThey group the graph edges into clusters (based on, say, color gradients across the endpoints),\nobserving that the true object boundary is captured by few of these clusters. To encourage cutting\nedges from few clusters, the cost of cutting an edge decreases as more edges in its cluster are cut.\nIn short, the edges \u201ccooperate\u201d. In Figure 1, each pixel takes on one of two labels (colors), and cut\n\nFigure 1: Example coopera-\ntive model. Edge colors in-\ndicate the edge cluster. Dot-\nted edges are cut under the\ncurrent assignment.\n\nX3,1\n\nX3,2\n\nX3,3\n\nX3,4\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\f\uf8eb\uf8ed\u2212\n\n(cid:18)(cid:88)\n\ni\u2208V\n\nP (x) =\n\n1\nZ exp\n\n(cid:88)\n\n{i,j}\u2208E\n\n(cid:19)\uf8f6\uf8f8 \u03bd(x),\n\n\u221a\n\nu. Similar cooperative models can express a preference for shapes [2].\n\nedges are indicated by dotted lines. The current con\ufb01guration cuts three red edges and one blue edge,\nand has lower probability than the con\ufb01guration that swaps X3,1 to gray, cutting only red edges. Such\na model can be implemented by an energy (cost) h(#red edges cut) + h(#blue edges cut), where\ne.g. h(u) =\nWhile being expressive, such models are computationally very challenging: the nonlinear function\non pairs of variables (edges) is equivalent to a graphical model of extremely high order (up to the\nnumber of variables). Previous work hence addressed only MAP inference [3, 4]; the computation of\nmarginals and partition functions was left as an open problem. In this paper, we close this gap, even\nfor a larger family of models.\nWe address models, which we call cooperative graphical models, that are speci\ufb01ed by an undirected\ngraph G = (V, E): each node i \u2208 V is associated with a random variable Xi that takes values in\nX = {1, 2, . . . , k}. To each vertex i \u2208 V and edge {i, j}, we attach a potential function \u03b8i : X \u2192 R\nand \u03b8i,j : X 2 \u2192 R, respectively. Our distribution is then\n\n\u03b8i(xi) +\n\n\u03b8i,j(xi, xj) + f (y(x))\n\n(1)\n\nwhere we call y : X n \u2192 {0, 1}E the disagreement variable1, de\ufb01ned as yi,j =(cid:74)xi (cid:54)= xj(cid:75). The term\n\n\u03bd : X n \u2192 R\u22650 is the base-measure and allows to encode constraints, e.g., conditioning on some\nvariables. With f \u2261 0 we obtain a Markov random \ufb01eld.\nProbabilistic inference in our model class (1) is very challenging, since we make no factorization\nassumption about f. One solution would be to encode P (x) as a log-linear model via a new\n\nvariable z \u2208 {0, 1}E and constraints \u03bd(x, z) =(cid:74)y(x) = z(cid:75), but this in general requires computing\n\nexponential-sized suf\ufb01cient statistics from z. In contrast, we make one additional key assumption\nthat will enable the development of ef\ufb01ciently computable variational lower and upper bounds: we\nhenceforth assume that f : {0, 1}E \u2192 R is submodular, i.e., it satis\ufb01es\n\nf (min(y, y(cid:48))) + f (max(y, y(cid:48))) \u2264 f (y) + f (y(cid:48))\n\nfor all y, y(cid:48) \u2208 {0, 1}E,\n\nwhere the min and max operations are taken element-wise. For example, the pairwise potentials \u03b8i,j\nare submodular if \u03b8i,j(0, 0) + \u03b8i,j(1, 1) \u2264 \u03b8i,j(0, 1) + \u03b8i,j(1, 0). In our introductory example, f is\nsubmodular if h is concave. As opposed to [3], we do not assume that f is monotone increasing.\nImportantly, even if f is submodular, P (x) neither has low treewdith, nor is its logarithm sub- or\nsupermodular in x, properties that have commonly been exploited for inference.\n\nContributions. We make the following contributions: (1) We introduce a new family of prob-\nabilistic models that can capture rich non-submodular interactions, while still admitting ef\ufb01cient\ninference. This family includes pairwise and certain higher-order graphical models, cooperative cuts\n[1], and other, new models. We develop new inference methods for these models; in particular, (2)\nupper bounds that are amenable to convex optimization, and (3) lower bounds that we optimize with\ntraditional variational methods. Finally, we demonstrate the ef\ufb01cacy of our methods empirically.\n\n1.1 Related work\n\nMaximum-a-posteriori (MAP). Computing the mode of (1) for binary models is also known as\nthe cooperative cut problem, and has been analyzed for the case when both the pairwise interactions\n\u03b8i,j are submodular and f is monotone [1]. While the general problem is NP-hard, it can be solved if\nf is de\ufb01ned by a piecewise linear concave function [4].\n\nVariational inference. Since computing marginal probabilities for (1) is #P-hard even for pairwise\nmodels (when f \u2261 0) [5, 6], we revert to approximate inference. Variational inference methods for\ndiscrete pairwise models have been studied extensively; a comprehensive overview may be found\nin [7]. We will build on a selection of techniques that we discuss in the next section. Most existing\nmethods focus on pairwise models (f \u2261 0), and many scale exponentially with the size of the largest\nfactor, which is infeasible for our cooperative models. Some specialized tractable inference methods\nexist for higher-order models [8, 9], but they do not apply to our family of models (1).\n\n1The results presented in this paper can be easily extended to arbitrary binary-valued functions y(x).\n\n2\n\n\fLog-supermodular models. A related class of relatively tractable models are distributions P (x) =\n1Z exp(\u2212g(x)) for some submodular function g; Djolonga and Krause [10] showed variational\ninference methods for those models. However, our models are not log-supermodular. While [10]\nalso obtain upper and lower bounds, we need different optimization techniques, and also different\npolytopes. In fact, submodular and multi-class submodular [11] settings are a strict subset of ours:\nthe function g(x) can be expressed via an auxiliary variable z \u2208 {0, 1} that is \ufb01xed to zero using\n\n\u03bd(x, z) =(cid:74)z = 0(cid:75). We then set f (y(x, z)) = g(x1 (cid:54)= z, x2 (cid:54)= z, . . . , xn (cid:54)= z).\n\n2 Notation and Background\n\nThroughout this paper, we have n variables in a graph of m edges, and the potentials \u03b8i and \u03b8i,j are\nstored in a vector \u03b8. The characteristic vector (or indicator vector) 1A of a set A is the binary vector\nwhich contains 1 in the positions corresponding to elements in A, and zeros elsewhere. Moreover,\nthe vector of all ones is 1, and the neighbours of i \u2208 V are denoted by \u03b4(i) \u2286 V .\n\nSubmodularity. We assume that f in Eqn. (1) is submodular. Occasionally (in Sec. 4 and 5, where\nstated), we assume that f is monotone: for any y and y(cid:48) in {0, 1}E such that y \u2264 y(cid:48) coordinate-wise,\nit holds that f (y) \u2264 f (y(cid:48)). When de\ufb01ning the inference schemes, we make use of two polytopes\nassociated with f. First, the base polytope of a submodular function f is\n\nB(f ) = {g \u2208 Rm | \u2200y \u2208 {0, 1}E : gT y \u2264 f (y)} \u2229 {g \u2208 Rm | gT 1 = f (1)}.\n\nAlthough B(f ) is de\ufb01ned by exponentially many inequalities, an in\ufb02uential result [12] states that\nit is tractable: we can optimize linear functions over B(f ) in time O(m log m + mF ), where F is\nthe time complexity of evaluating f. This algorithm is part of our scheme in Figure 2. Moreover,\nas a result of this (linear) tractability, it is possible to compute orthogonal projections onto B(f ).\nProjection is equivalent to the minimum norm point problem [13]. While the general projection\nproblem has a high degree polynomial time complexity, there are many very commonly used models\nthat admit practically fast projections [14, 15, 16].\nThe second polytope is the upper submodular polyhedron of f [17], de\ufb01ned as\nU(f ) = {(g, c) \u2208 Rm+1 | \u2200y \u2208 {0, 1}E : gT y + c \u2265 f (y)}.\n\nUnfortunately, U(f ) is not as tractable as B(f ): even checking membership in U(f ) is hard [17].\nHowever, we can still succinctly describe speci\ufb01c elements of U(f ). In \u00a74, we show how to ef\ufb01ciently\noptimize over those elements.\n\nVariational inference. We brie\ufb02y summarize key results for variational inference for pairwise\nmodels, following Wainwright and Jordan [7]. We write pairwise models as2\n\n(cid:88)\n\n{i,j}\u2208E\n\n(cid:19)\n(gi,j(cid:74)xi (cid:54)= xj(cid:75) + \u03b8i,j(xi, xj)\n\n\uf8f6\uf8f8 \u03bd(x),\n\n\u2212 A(g)\n\n\uf8eb\uf8ed\u2212\n\n(cid:18)(cid:88)\n\ni\u2208V\n\nP (x) = exp\n\n\u03b8i(xi) +\n\nwhere g \u2208 RE is an arbitrary vector and A(g) is the log-partition function. For any choice of\nparameters (\u03b8, g), there is a resulting vector of marginals \u00b5 \u2208 [0, 1]k|V |+k2|E|. Speci\ufb01cally, for every\ni \u2208 V , \u00b5 has k elements \u00b5i,xi = P (Xi = xi), one for each xi \u2208 X . Similarly, for each {i, j} \u2208 E,\nthere are k2 elements \u00b5ij,xixj so that \u00b5ij,xixj = P (Xi = xi, Xj = xj). The marginal polytope M is\nnow the set of all such vectors \u00b5 that are realizable under some distribution P (x), and the partition\nfunction can equally be expressed in terms of the marginals [7]:\n\n\u00b5ij,xixj \u03b8i,j(xi, xj) \u2212 \u2206(\u00b5)T g\n\n+ H(\u00b5), (2)\n\nA(g) = sup\n\u00b5\u2208M\n\n\uf8eb\uf8ed\u2212 (cid:88)\n(cid:124)\nentries \u2206(\u00b5)i,j = (cid:80)\n\ni\u2208V,xi\n\n\u00b5i,xi\u03b8i(xi) \u2212 (cid:88)\n\n{i,j}\u2208E\n\n(cid:88)\n(cid:123)(cid:122)\n\nxi,xj\n\n(cid:104)stack(\u03b8,g),\u00b5(cid:105)\n\nwhere H(\u00b5) is the entropy of the distribution, \u2206(\u00b5) is the vector of disagreement probabilities with\n\u00b5ij,xixj , and stack(\u03b8, g) adds the elements of \u03b8 and g into a single\n\nxi(cid:54)=xj\n\n2This formulation is slightly nonstandard, but will be very useful for the subsequent discussion in \u00a73.\n\n3\n\n\uf8f6\uf8f8\n(cid:125)\n\n\fvector so that the sum can be written as an inner product. Alas, neither M nor H(\u00b5) have succinct\ndescriptions and we will have to approximate them. Because the vectors in the approximation of\nM are in general not correct marginals, they are called pseudo-marginals and will be denoted by\n\u03c4 instead of \u00b5. Different approximations of M and H yield various methods, e.g. mean-\ufb01eld [7],\nthe semide\ufb01nite programming (SDP) relaxation of Wainwright and Jordan [18], tree-reweighted\nbelief propagation (TRWBP) [19], or the family of weighted entropies [20, 21]. Due to the space\nconstraints, we only discuss the latter. They approximate M with the local polytope\n\nL = {\u03c4 \u2265 0 | (\u2200i \u2208 V )\n\n\u03c4i,xi = 1 and (\u2200j \u2208 \u03b4(i)) \u03c4i,xi =\n\n\u03c4ij,xixj}.\n\nxi\n\nxj\n\n(cid:88)\n\n(cid:88)\n\nThe approximations H to the entropy H are parametrized by one weight \u03c1i,j per edge and one \u03c1i per\nvertex i, all collected in a vector \u03c1 \u2208 R|V |+|E|. Then, they take the following form\n\nHi(\u03c4 i) = \u2212(cid:80)\nHi,j(\u03c4 i,j) = \u2212(cid:80)\n\n\u03c4i,xi log \u03c4i,xi, and\n\n\u03c4ij,xijxj log \u03c4ij,xixj .\n\nxi\n\nxi,xj\n\n(cid:88)\n\ni\u2208V\n\n(cid:88)\n\n{i,j}\u2208E\n\nH(\u03c4 , \u03c1) =\n\n\u03c1iHi(\u03c4 i)+\n\n\u03c1i,jHi,j(\u03c4 i,j), where\n\nThe most prominent example is traditional belief propagation, i.e., using the Bethe entropy, which\nsets \u03c1e = 1 for all e \u2208 E, and assigns to each vertex i \u2208 V a weight of \u03c1i = 1 \u2212 |\u03b4(i)|.\n\n3 Convex upper bounds\n\nThe above variational methods do not directly generalize to our cooperative models: the vectors of\nmarginals could be exponentially large. Hence, we derive a different approach that relies on the\nsubmodularity of f. Our \ufb01rst step is to approximate f (y(x)) by a linear lower bound, f (y(x)) \u2248\ngT y(x), so that the resulting (pairwise) linearized model will have a partition function upper\nbounding that of the original model. Ensuring that g indeed remains a lower bound means to satisfy\nan exponential number of constraints f (y(x)) \u2265 gT y(x), one for each x \u2208 {0, 1}n. While this is\nhard in general, the submodularity of f implies that these constraints are easily satis\ufb01ed if g \u2208 B(f ),\na very tractable constraint. For g \u2208 B(f ), we have\nlog Z = log\n\n\u03b8i(xi) +\n\n(cid:88)\n(cid:88)\n\n{i,j}\u2208E\n\n{i,j}\u2208E\n\n\u03b8i,j(xi, xj) + f (y(x)))(cid:1)\n(\u03b8i,j(xi, xj) + gi,j(cid:74)xi (cid:54)= xj(cid:75)))(cid:1) \u2261 A(g).\n\n(cid:88)\n(cid:88)\n\nx\u2208{0,1}V\n\nx\u2208{0,1}V\n\nexp(cid:0) \u2212 (\nexp(cid:0) \u2212 (\n\n(cid:88)\n(cid:88)\n\ni\u2208V\n\n(cid:88)\n(cid:88)\n\nxi\n\ni\u2208V\n\nxi\n\n\u2264 log\n\n\u03b8i(xi) +\n\nUnfortunately, A(g) is still very hard to compute and we need to approximate it. If we use an\napproximation A(g) that upper bounds A(g), then the above inequality will still hold when we\nreplace A by A. Such approximations can be obtained by relaxing the marginal polytope M to an\nouter bound M \u2287 M, and using a concave entropy surrogate H that upper bounds the true entropy\nH. TRWBP [19] or the SDP formulation [18] implement this approach. Our central optimization\nproblem is now to \ufb01nd the tightest upper bound, an optimization problem3 in g:\n\nminimize\ng\u2208B(f )\n\n(cid:104)stack(\u03b8, g), \u03c4(cid:105) + H(\u03c4 ).\n\nsup\n\u03c4\u2208M\n\n(3)\n\nBecause the inner problem is linear in g, this is a convex optimization problem over the base polytope.\nTo obtain the gradient with respect to g (equal to the negative disagreement probabilities \u2212\u2206(\u03c4 )), we\nhave to solve the inner problem. This subproblem corresponds to performing variational inference in\na pairwise model, e.g. via TRWBP or an SDP. The optimization properties of the problem (3) depend\non its Lipschitz continuity of the gradients (smoothness). Informally, the inferred pseudomarginals\nshould not drastically change if we perturb the linearization g. The formal condition is that there\nexists some \u03c3 > 0 so that (cid:107)\u2206(\u03c4 ) \u2212 \u2206(\u03c4 (cid:48))(cid:107) \u2264 \u03c3(cid:107)\u03c4 \u2212 \u03c4 (cid:48)(cid:107) for all \u03c4 , \u03c4 (cid:48) \u2208 M. We discuss below\nwhen this condition holds. Before that, we discuss two different algorithms for solving problem (3),\nand how their convergence depends on \u03c3.\n\n3If we compute the Fenchel dual, we obtain a special case of the problem considered in [22] with the Lov\u00b4asz\n\nextension acting as a non-smooth non-local energy function (in the terminology introduced therein).\n\n4\n\n\fFrank-Wolfe. Given that we can ef\ufb01ciently solve linear programs over B(f ), the Frank-Wolfe [23]\nalgorithm is a natural candidate for solving the problem. We present it in Figure 2. It iteratively\nmoves towards the minimizer of a linearization of the objective around the current iterate. The method\nhas a convergence rate of O(\u03c3/t) [24], where \u03c3 is the assumed smoothness parameter. One can\neither use a \ufb01xed step size \u03b3 = 2/(t + 2), or determine it using line search. In each iteration, the\nalgorithm calls the procedure LINEAR-ORACLE, which \ufb01nds the vector s \u2208 B(f ) that minimizes the\nlinearization of the objective function in (3) over the base polytope B(f ). The linearization is given\nby the (approximate) gradient \u2206(\u03c4 ), determined by the computed approximate marginals \u03c4 .\nWhen taking a step towards s, the weight of edge ei is changed by sei = f ({e1, e2, . . . , ei}) \u2212\nf ({e1, e2, . . . , ei\u22121}). Due to the submodularity4 of f, an edge will obtain a higher weight if it\nappears earlier in the order determined by the disagreement probabilities \u2206(\u03c4 ). Hence, in every\niteration, the algorithm will re-adjusts the pairwise potentials, by encouraging the variables to agree\nmore as a function of their (approximate) disagreement probability.\n\n1: procedure FW-INFERENCE(f, \u03b8)\ng \u2190 LINEAR-ORACLE(f, 0)\n2:\nfor t = 0, 1, . . . , max steps do\n3:\n\u03c4 \u2190 VAR-INFERENCE(\u03b8, g)\n4:\ns \u2190 LINEAR-ORACLE(f, \u03c4 )\n5:\n\u03b3 \u2190 COMPUTE-STEP-SIZE(g, s)\n6:\ng \u2190 (1 \u2212 \u03b3)g + \u03b3s\n7:\n8:\nFigure 2: Inference with Frank-Wolfe, assuming that VAR-INFERENCE guarantees an upper bound.\n\nLet e1, e2, . . . , e|E| be the edges E sorted so\nfor i = 0, 1, . . . ,|E| do\n\nf\u2212i \u2190 f ({e1, e2, . . . , ei\u22121})\nf+i \u2190 f ({e1, e2, . . . , ei})\nsei \u2190 f+i \u2212 f\u2212i\n\n1: procedure LINEAR-ORACLE(f, \u03c4 )\n2:\n\nthat \u2206(\u03c4 )e1 \u2265 \u2206(\u03c4 )e2 \u2265 . . . \u2265 \u2206(\u03c4 )e|E|\n\nreturn \u03c4 , \u02c6A\n\n3:\n4:\n5:\n6:\n7:\n\nreturn s\n\nProjected gradient descent (PGD). Since it is possible to compute projections onto B(f ), and\n\u221a\npractically so for many submodular functions f, we can alternatively use projected gradient or\nsubgradient descent (PGD). Without smoothness, PGD converges at a rate of O(1/\nt). If the\nobjective is smooth, we can use an accelerated methods like FISTA [25], which has both a much\nbetter O(\u03c3/t2) rate and seems to converge faster than many Frank-Wolfe variants in our experiments.\n\nSmoothness and convergence. The \ufb01nal question that remains to be answered is under which\nconditions problem (3) is smooth (the proof can be found in the appendix).\nTheorem 1 Problem (3) is k2\u03c3-smooth over B(f ) if the entropy surrogate \u2212H is 1\n\u03c3 -strongly convex.\nThis result follows from the duality between smoothness and strong convexity for convex conjugates,\nsee e.g. [26]. It implies that the convergence rates of the proposed algorithms depend on the strong\nconvexity of the entropy approximation \u2212H. The bene\ufb01ts of strongly convex entropy approximations\nare known. For instance, the tree-reweighted entropy approximation is strongly convex with a\nmodulus \u03c3 depending on the size of the graph; similarly, the SDP relaxation is strongly convex [27].\nLondon et al. [28] provide an even sharper bound for the tree reweighted entropy, and show how one\ncan strong-convexify any weighted entropy by solving a QP over the weights \u03c1.\nIn practice, because the inner problem is typically solved using an iterative algorithm and because the\nproblem is smooth, we obtain speedups by warm-starting the solver with the solution at the previous\niterate. We can moreover easily obtain duality certi\ufb01cates using the results in [24].\n\nJoint optimization. When using weighted entropy approximations, it makes sense to optimize over\nboth the linearization g and the weights \u03c1 jointly. Speci\ufb01cally, let T be some set of weights that yield\nan entropy approximation H that upper bounds H. Then, if we expand H in problem (3), we obtain\n\nminimize\ng\u2208B(f ),\u03c1\u2208T sup\n\u03c4\u2208L\n\n(cid:104)stack(\u03b8, g), \u03c4(cid:105) +\n\n\u03c1iHi(\u03c4 i) +\n\n\u03c1i,jHi,j(\u03c4 i,j).\n\n(cid:88)\n\ni\u2208V\n\n(cid:88)\n\n{i,j}\u2208E\n\nNote that inside the supremum, both g and \u03c1 appear only linearly, and there is no summand that has\nterms from both of them. Thus, the problem is convex in (g, \u03c1), and we can optimize jointly over\n\n4This is also known as the diminishing returns property.\n\n5\n\n\fboth variables. As a \ufb01nal remark, if we already perform inference in a pairwise model and repeatedly\ntighten the approximation by optimizing over \u03c1 via Frank-Wolfe (as suggested in [19]), then the\ncomplexity per iteration remains the same even if we use the higher-order term f.\n\n4 Submodular lower bounds\n\nWhile we just derived variational upper bounds, we next develop lower bounds on the partition\nfunction. Speci\ufb01cally, analogously to the linearization for the upper bound, if we pick an element\n(g, c) of U(f ), the partition function of the resulting pairwise approximation always lower bounds\nthe partition function of (1). Formally,\nlog Z \u2265 log\n\ngi,j(cid:74)xi (cid:54)= xj(cid:75) + c)(cid:1) = A(g) \u2212 c.\n\nexp(cid:0) \u2212 (aT x +\n\n(cid:88)\n\n(cid:88)\n\n(cid:88)\n\n\u03b8ij,xixj +\n\n{i,j}\u2208E\n\n{i,j}\u2208E\n\nx\u2208{0,1}V\n\nAs before, after plugging in a lower bound estimate of A, we obtain a variational lower bound over\nthe partition function, which takes the form\n\nlog Z \u2265\n\nsup\n\n(g,c)\u2208U (f ),\u03c4\u2208M\n\n\u2212c + (cid:104)stack(\u03b8, g), \u03c4(cid:105) + H(\u03c4 ),\n\n(4)\n\nfor any pair of approximations of M and H that guarantee a lower bound of the pairwise model.\nWe propose to optimize this lower bound in a block-coordinate-wise manner: \ufb01rst with respect to\nthe pseudo-marginals \u03c4 (which amounts to approximate inference in the linearized model), and\nthen with respect to the supergradient (g, c) \u2208 U(f ). As already noted, this step is in general\nintractable. However, it is well-known [29] that for any Y \u2286 E we can construct a point (so\ncalled bar supergradient) in U(f ) as follows. First, de\ufb01ne the vectors ai,j = f (1{i,j}) and bi,j =\nf (1)\u2212f (1\u22121{i,j}). Then, the vector (g, c) with g = b(cid:12)1Y +(1\u22121Y )(cid:12)a and c = f (Y )\u2212bT 1Y\nbelongs to U(f ), where (cid:12) denotes element-wise multiplication.\nTheorem 2 Optimizing problem (4) for a \ufb01xed \u03c4 over all bar supergradients is equal to the following\n\nsubmodular minimization problem minY \u2286E f (Y ) +(cid:0)\u2206(\u03c4 ) (cid:12) (b \u2212 a) \u2212 b(cid:1)T\n\n1Y .\n\nIn contrast to computing the MAP, the above problem has no constraints and can be easily solved\nusing existing algorithms. As the approximation algorithm for the linearized pairwise model, one\ncan always use mean-\ufb01eld [7]. Moreover, if (i) the problem is binary with submodular pairwise\npotentials \u03b8i,j and (ii) f is monotone, we can also use belief propagation. This is an implication\nof the result of Ruozzi [30], who shows that traditional belief-propagation yields a lower bound on\nthe partition function for binary pairwise log-supermodular models. It is easy to see that the above\nconditions are suf\ufb01cient for the log-supermodularity of the linearized model, as g \u2265 0 when f is\nmonotone (because both a and b have non-negative components). Moreover, in this setting both\nthe mean-\ufb01eld and belief propagation objectives (i.e. computing \u03c4 ) can be cast as an instance of\ncontinuous submodular minimization (see e.g. [31]), which means that they can be solved to arbitrary\nprecision in polynomial time. Unfortunately, problem (4) will not be jointly submodular, so we still\nneed to use the block-coordinate ascent method we have just outlined.\n\n5 Approximate inference via MAP perturbations\n\nFor binary models with submodular pairwise potentials and monotone f we can (approximately)\nsolve the MAP problem using the techniques in [1, 4]. Hence, this opens as an alternative approach\nthe perturb-and-MAP method of Papandreou and Yuille [32]. This method relies on a set of tractable\n\ufb01rst order perturbations: For any i \u2208 V de\ufb01ne \u03b8(cid:48)\ni(xi) = \u03b8i(xi) \u2212 \u03b7i,xi, where \u03b7 = (\u03b7i,xi)i\u2208V,xi\u2208X\nare a set of independently drawn Gumbel random variables. The optimizer argminxG\u03b7(x) of the\n\ni(xi) +(cid:80){i,j}\u2208E \u03b8i,j(xi, xj) + f (y(x)) is then a sample\n\nperturbed model energy G\u03b7(x) =(cid:80)\n\nfrom (an approximation to) the true distribution. If this MAP problem can be solved exactly (which is\nnot always the case here), then it is possible to obtain an upper bound on the partition function [33].\n\ni\u2208V \u03b8(cid:48)\n\n6 Experiments\n\nSynthetic experiments. Our \ufb01rst set of experiments uses a complete graph on n variables. The\nunary potentials were sampled as \u03b8i(xi) \u223c Uniform(\u2212\u03b1, \u03b1). The edges E were randomly split\n\n6\n\n\finto \ufb01ve disjoint buckets E1, E2, . . . , E5, and we used f (y) =(cid:80)5\n\nj=1 hj(yEj ), where yEi are the\ncoordinates of y corresponding to that group, and the functions {hj} will be de\ufb01ned below. To\nperform inference in the linearized pairwise models, we used: trwbp, jtree+ (exact inference, upper\nbound), jtree- (same, lower bound), sdp (SDP), mf (mean-\ufb01eld), bp (belief propagation), pmap\n(perturb-and-MAP with approximate MAP) and epmap (perturb-and-MAP with exact MAP). We\nused libDAI [34] and implemented sdp using cvxpy [35] and SCS [36]. As a max\ufb02ow solver we\nused [37]. Errors bars denote three standard errors.\n\nye/(cid:112)|Ei|, with weights wi \u223c Uniform(0, \u03b2).\nwi,j(cid:74)xi (cid:54)= xj(cid:75), where wi,j \u223c Uniform(\u2212\u03b2, \u03b2). First, the results imply that the methods optimizing\n\nFigure 3 shows the results for hi(yEi) = wi\nIn panel (c) we use mixed (attractive and repulsive) pairwise potentials, chosen as \u03b8i,j(xi, xj) =\n\n(cid:113)(cid:80)\n\ne\u2208Ei\n\nthe fully convex upper bound yield very good marginal probabilities over a large set of parameter\ncon\ufb01gurations. The estimate of the log-partition function from trwbp is also very good, while sdp is\nmuch worse, which we believe can be attributed to the very loose entropy bound used in the relaxation.\nThe lower bounds (bp and mf) work well for settings when the pairwise strength \u03b2 is small compared\nto the unary strength \u03b1. Otherwise, both the bound and the marginals become worse, while jtree-\nstill performs very well. This could be explained by the hardness of the pairwise models obtained\nafter linearizing f. Finally, pmap (when applicable) seems very promising for small \u03b2.\nTo better understand the regimes when one should use trwbp or pmap, we compare their marginal\nerrors in Figure 5. We see that for most parameter con\ufb01gurations, trwbp performs better, and\nsigni\ufb01cantly so when the edge interactions are strong.\nFinally, we evaluate the effects of the approximate MAP solver for pmap in Figure 4. To be able\nve/2},\nwhere ve \u223c Uniform(0, \u03b2). As evident from the \ufb01gure, the gains from the exact solver seem minimal,\nand it seems that solving the MAP problem approximately does not strongly affect the results.\n\nto solve the MAP problem exactly (see [4]), we used h(yEj ) = max{(cid:80)\n\nyeve,(cid:80)\n\ne\u2208Ej\n\ne\u2208Ej\n\nAn example from computer vision. To demonstrate the scalability of our method and obtain a\nbetter qualitative understanding of the resulting marginals, we ran trwbp and pmap on a real world\nimage segmentation task. We use the same setting, data and models as [1], as implemented in\nthe pycoop5 package. Because libDAI was too slow, we wrote our own TRWBP implementation.\nFigure 6 shows the results for two speci\ufb01c images (size 305 \u00d7 398 and 214 \u00d7 320). The example\nin the \ufb01rst row is particularly dif\ufb01cult for pairwise models, but the rich higher-order model has no\nproblem capturing the details even in the challenging shaded regions of the image. The second row\nshows results for two different model parameters. The second model uses a function f that is closer to\nbeing linear, while the \ufb01rst one is more curved (see the appendix for details). We observe that trwbp\nrequires lower temperature parameters (i.e. relatively larger functions \u03b8i, \u03b8i,j and f) than pmap, and\nthat the bottleneck of the complete inference procedure is running the trwbp updates. In other words,\nthe added complexity from our method is minimal and the runtime is dominated by the message\npassing updates of TRWBP. Hence, any algorithms that speed up TRWBP (e.g., by parallelization or\nbetter message scheduling) will result in a direct improvement on the proposed inference procedure.\n7 Conclusion\n\nWe developed new inference techniques for a new broad family of discrete probabilistic models by\nexploiting the (indirect) submodularity in the model, and carefully combining it with ideas from\nclassical variational inference in graphical models. The result are inference schemes that optimize\nrigorous bounds on the partition function. For example, our upper bounds lead to convex variational\ninference problems. Our experiments indicate the scalability, ef\ufb01cacy and quality of these schemes.\n\nAcknowledgements. This research was supported in part by SNSF grant CRSII2 147633, ERC StG 307036,\na Microsoft Research Faculty Fellowship, a Google European Doctoral Fellowship, and NSF CAREER 1553284.\n\nReferences\n\n[1] S. Jegelka and J. Bilmes. \u201cSubmodularity beyond submodular energies: coupling edges in graph cuts\u201d.\n\nCVPR. 2011.\n\n5https://github.com/shelhamer/coop-cut.\n\n7\n\n\f(a) \u03b1 = 2, binary, K15\n\n(b) \u03b1 = 0.1, binary, K15\n\n(c) \u03b1 = 0.1, mixed, 4 labels, K10\n\nFigure 3: Results on several synthetic models. The methods that optimize the convex upper bound\n(trwbp, sdp) obtain very good marginals for a large set of parameter settings. Those maximizing the\nlower bound (bp, mf) fail when there is strong coupling between the edges. In the strong coupling\nregime the results of pmap also deteriorate, but not as strongly. In (c) bp, pmap, sdp are not applicable.\n\nFigure 4: \u03b1 = 2, K15, model where epmap is applicable. Solving\nthe MAP problem exactly only marginally improves over pmap.\nThe other observations are similar to those in Fig. 3b.\n\nFigure 5: errorpmap - errortrwbp\non K15. Missing entries were\nnot signi\ufb01cant at the 0.05 level.\n\n(a) Original image\n\n(b) trwbp, pairwise\n\n(c) pmap, pairwise\n\n(d) trwbp, coop.\n\n(e) pmap, coop.\n\n(f) Original image\n\n(g) trwbp, model 1\n\n(h) pmap, model 1\n\n(i) trwbp, model 2\n\n(j) pmap, model 2\n\nFigure 6: Inferred marginals on an image segmentation task. The \ufb01rst row showcases an example\nthat is particularly hard for pairwise models. In the second row we show the results for two different\nmodels (the cooperative function f is more curved for model 1).\n\n8\n\n\u22120.10.00.10.20.30.40.50.60.7Meanabsoluteerrorinmarginalsbpjtree+jtree-mfpmapsdptrwbp10\u22121100101102Pairwisestrength\u03b2\u22128\u22126\u22124\u221220246Errorintheestimatelog\u02c6Z\u2212logZ0.000.020.040.060.080.100.120.14Meanabsoluteerrorinmarginalsbpjtree+jtree-mfpmapsdptrwbp10\u22121100101102Pairwisestrength\u03b2\u22126\u22124\u221220246Errorintheestimatelog\u02c6Z\u2212logZ\u22120.10.00.10.20.30.40.5Absolutemeanerrorinmarginalsjtree+jtree-mftrwbp10\u22121100101102Pairwisestrength\u03b2\u22128\u22126\u22124\u22122024Errorintheestimatelog\u02c6Z\u2212logZ10\u221211001011020.00.20.40.60.8Meanabsoluteerrorinmarginalsbpepmapjtree+jtree-mfpmapsdptrwbp10\u22121100101102Pairwisestrength\u03b2\u22126\u22124\u2212202468Errorintheestimatelog\u02c6Z\u2212logZ0.10.51.02.04.08.016.032.064.0Pairwisestrength\u03b216.08.04.02.01.00.50.1Unarystrength\u03b10.0020.00210.00140.0910.260.0340.00360.00230.0580.240.0690.00560.0012-0.0052-0.0130.160.120.10.0730.0088-0.0011-0.015-0.0320.0570.0860.150.20.120.01-0.0095-0.00950.0970.20.20.0680.00990.00720.00460.0850.170.120.0120.0120.0110.00870.0190.037-0.049-0.23\f[2] N. Silberman, L. Shapira, R. Gal, and P. Kohli. \u201cA Contour Completion Model for Augmenting Surface\n\nReconstructions\u201d. ECCV. 2014.\n\n[3] S. Jegelka and J. Bilmes. \u201cApproximation Bounds for Inference using Cooperative Cuts\u201d. ICML. 2011.\n[4] P. Kohli, A. Osokin, and S. Jegelka. \u201cA principled deep random \ufb01eld model for image segmentation\u201d.\n\nCVPR. 2013.\n\n[5] M. Jerrum and A. Sinclair. \u201cPolynomial-time approximation algorithms for the Ising model\u201d. SIAM\n\nJournal on computing 22.5 (1993), pp. 1087\u20131116.\n\n[6] L. A. Goldberg and M. Jerrum. \u201cThe complexity of ferromagnetic Ising with local \ufb01elds\u201d. Combinatorics,\n\n[7] M. J. Wainwright and M. I. Jordan. \u201cGraphical models, exponential families, and variational inference\u201d.\n\nProbability and Computing 16.01 (2007), pp. 43\u201361.\nFoundations and Trends R(cid:13) in Machine Learning 1.1-2 (2008).\n\n[8] D. Tarlow, K. Swersky, R. S. Zemel, R. P. Adams, and B. J. Frey. \u201cFast Exact Inference for Recursive\n\nCardinality Models\u201d. UAI. 2012.\n\n[11]\n\n[10]\n\n[9] V. Vineet, J. Warrell, and P. H. Torr. \u201cFilter-based mean-\ufb01eld inference for random \ufb01elds with higher-order\n\nterms and product label-spaces\u201d. IJCV 110 (2014).\nJ. Djolonga and A. Krause. \u201cFrom MAP to Marginals: Variational Inference in Bayesian Submodular\nModels\u201d. NIPS. 2014.\nJ. Zhang, J. Djolonga, and A. Krause. \u201cHigher-Order Inference for Multi-class Log-supermodular\nModels\u201d. ICCV. 2015.\nJ. Edmonds. \u201cSubmodular functions, matroids, and certain polyhedra\u201d. Combinatorial structures and\ntheir applications (1970), pp. 69\u201387.\nS. Fujishige and S. Isotani. \u201cA submodular function minimization algorithm based on the minimum-norm\nbase\u201d. Paci\ufb01c Journal of Optimization 7.1 (2011), pp. 3\u201317.\nP. Stobbe and A. Krause. \u201cEf\ufb01cient Minimization of Decomposable Submodular Functions\u201d. NIPS. 2010.\n[14]\n[15] S. Jegelka, F. Bach, and S. Sra. \u201cRe\ufb02ection methods for user-friendly submodular optimization\u201d. NIPS.\n\n[13]\n\n[12]\n\n[16] F. Bach. \u201cLearning with submodular functions: a convex optimization perspective\u201d. Foundations and\n\n[17] R.\n\nIyer and J. Bilmes. \u201cPolyhedral aspects of Submodularity, Convexity and Concavity\u201d.\n\n2013.\nTrends R(cid:13) in Machine Learning 6.2-3 (2013).\n\narXiv:1506.07329 (2015).\n\n[18] M. J. Wainwright and M. I. Jordan. \u201cLog-determinant relaxation for approximate inference in discrete\n\nMarkov random \ufb01elds\u201d. Signal Processing, IEEE Trans. on 54.6 (2006).\n\n[19] M. J. Wainwright, T. S. Jaakkola, and A. S. Willsky. \u201cA new class of upper bounds on the log partition\n\nfunction\u201d. UAI. 2002.\n\nJAIR 26 (2006).\n\n2009.\n\nUAI. 2015.\n\n[20] T. Heskes. \u201cConvexity Arguments for Ef\ufb01cient Minimization of the Bethe and Kikuchi Free Energies.\u201d\n\n[21] O. Meshi, A. Jaimovich, A. Globerson, and N. Friedman. \u201cConvexifying the Bethe free energy\u201d. UAI.\n\n[22] L. Vilnis, D. Belanger, D. Sheldon, and A. McCallum. \u201cBethe Projections for Non-Local Inference\u201d.\n\n[23] M. Frank and P. Wolfe. \u201cAn algorithm for quadratic programming\u201d. Naval Res. Logist. Quart. (1956).\n[24] M. Jaggi. \u201cRevisiting Frank-Wolfe: Projection-free sparse convex optimization\u201d. ICML. 2013.\n[25] A. Beck and M. Teboulle. \u201cA fast iterative shrinkage-thresholding algorithm for linear inverse problems\u201d.\n\nSIAM Journal on imaging sciences 2.1 (2009), pp. 183\u2013202.\nS. Kakade, S. Shalev-Shwartz, and A. Tewari. \u201cOn the duality of strong convexity and strong smoothness:\nLearning applications and matrix regularization\u201d. Technical Report (2009).\n\n[27] M. J. Wainwright. \u201cEstimating the wrong graphical model: Bene\ufb01ts in the computation-limited setting\u201d.\n\n[26]\n\nJMLR 7 (2006).\n\ninference\u201d. ICML. 2015.\n\nICML. 2013.\n\n[28] B. London, B. Huang, and L. Getoor. \u201cThe bene\ufb01ts of learning with strongly convex approximate\n\n[29] R. Iyer, S. Jegelka, and J. Bilmes. \u201cFast Semidifferential-based Submodular Function Optimization\u201d.\n\n[30] N. Ruozzi. \u201cThe Bethe partition function of log-supermodular graphical models\u201d. NIPS. 2012.\n[31] A. Weller and T. Jebara. \u201cApproximating the Bethe Partition Function\u201d. UAI. 2014.\n[32] G. Papandreou and A. L. Yuille. \u201cPerturb-and-MAP random \ufb01elds: Using discrete optimization to learn\n\nand sample from energy models\u201d. ICCV. 2011.\n\n[33] T. Hazan and T. Jaakkola. \u201cOn the partition function and random maximum a-posteriori perturbations\u201d.\n\n[34]\n\n[35]\n\nICML (2012).\nJ. M. Mooij. \u201clibDAI: A Free and Open Source C++ Library for Discrete Approximate Inference in\nGraphical Models\u201d. Journal of Machine Learning Research (2010), pp. 2169\u20132173.\nS. Diamond and S. Boyd. \u201cCVXPY: A Python-Embedded Modeling Language for Convex Optimization\u201d.\nJMLR (2016). To appear.\n\n[36] B. O\u2019Donoghue, E. Chu, N. Parikh, and S. Boyd. \u201cConic Optimization via Operator Splitting and\n\nHomogeneous Self-Dual Embedding\u201d. Journal of Optimization Theory and Applications (2016).\n\n[37] Y. Boykov and V. Kolmogorov. \u201cAn experimental comparison of min-cut/max-\ufb02ow algorithms for energy\n\nminimization in vision\u201d. Pattern Analysis and Machine Intelligence, IEEE Trans. on 26 (2004).\n\n9\n\n\f", "award": [], "sourceid": 178, "authors": [{"given_name": "Josip", "family_name": "Djolonga", "institution": "ETH Zurich"}, {"given_name": "Stefanie", "family_name": "Jegelka", "institution": "MIT"}, {"given_name": "Sebastian", "family_name": "Tschiatschek", "institution": "ETH Zurich"}, {"given_name": "Andreas", "family_name": "Krause", "institution": "ETHZ"}]}