{"title": "Scaling MPE Inference for Constrained Continuous Markov Random Fields with Consensus Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 2654, "page_last": 2662, "abstract": "Probabilistic graphical models are powerful tools for analyzing constrained, continuous domains. However, finding most-probable explanations (MPEs) in these models can be computationally expensive. In this paper, we improve the scalability of MPE inference in a class of graphical models with piecewise-linear and piecewise-quadratic dependencies and linear constraints over continuous domains. We derive algorithms based on a consensus-optimization framework and demonstrate their superior performance over state of the art. We show empirically that in a large-scale voter-preference modeling problem our algorithms scale linearly in the number of dependencies and constraints.", "full_text": "Scaling MPE Inference for Constrained Continuous\nMarkov Random Fields with Consensus Optimization\n\nStephen H. Bach\n\nUniversity of Maryland, College Park\n\nCollege Park, MD 20742\nbach@cs.umd.edu\n\nMatthias Broecheler\n\nAurelius LLC\n\nmatthias@thinkaurelius.com\n\nUniversity of Maryland, College Park\n\nUniversity of Maryland, College Park\n\nDianne P. O\u2019Leary\n\nCollege Park, MD 20742\noleary@cs.umd.edu\n\nLise Getoor\n\nCollege Park, MD 20742\ngetoor@cs.umd.edu\n\nAbstract\n\nProbabilistic graphical models are powerful tools for analyzing constrained, con-\ntinuous domains. However, \ufb01nding most-probable explanations (MPEs) in these\nmodels can be computationally expensive. In this paper, we improve the scala-\nbility of MPE inference in a class of graphical models with piecewise-linear and\npiecewise-quadratic dependencies and linear constraints over continuous domains.\nWe derive algorithms based on a consensus-optimization framework and demon-\nstrate their superior performance over state of the art. We show empirically that in\na large-scale voter-preference modeling problem our algorithms scale linearly in\nthe number of dependencies and constraints.\n\n1\n\nIntroduction\n\nThere is a growing need for statistical models which can capture rich dependencies in structured\ndata. Link predication, collective classi\ufb01cation, modeling information diffusion, entity resolution,\nand viral marketing are all important tasks where incorporating structural dependencies is crucial\nfor good predictive performance. Graphical models [1] are an expressive class of statistical models\nto address such problems, but their applicability to large datasets is often limited by impractically\nexpensive inference and learning algorithms.\nIn this paper, we focus on scaling up most-probable-explanation (MPE) inference for a particular\nclass of graphical models called constrained continuous Markov random \ufb01elds (CCMRFs) [2]. Like\nother Markov random \ufb01elds (MRFs), CCMRFs de\ufb01ne a joint distribution over a collection of ran-\ndom variables and capture local dependencies through potential functions. However, unlike many\npopular discrete MRFs which are de\ufb01ned over binary random variables, CCMRFs are de\ufb01ned over\ncontinuous random variables. They also allow their domains to be constrained. This makes CCM-\nRFs ideally suited to reason over continuous quantities, such as similarity, af\ufb01nity, or probability,\nwithout making assumptions about the variables\u2019 marginal distributions.1\nMPE inference for CCMRFs is tractable under mild convexity assumptions because it can be cast as a\nconvex numeric optimization problem, which can be solved by interior-point methods [3]. However,\nfor large problems, interior-point methods are impractically slow because each step takes time up to\ncubic in the size of the problem.\n\n1In contrast with Gaussian random \ufb01elds where random variables are assumed to be Gaussian.\n\n1\n\n\fWe show how hinge-loss potential functions that are often used to model real world problems in\nCCMRFs (see, e.g., [3, 2, 4, 5, 6, 7]) can be exploited to signi\ufb01cantly speed up the numeric opti-\nmization and therefore MPE inference. To do so, we rely on a consensus optimization framework [8].\nConsensus optimization has recently been shown to perform well on relaxations of discrete optimiza-\ntion problems, like MRF MPE inference [8, 9, 10].\nThe contributions of this paper are as follows: First, we derive algorithms for the MPE problem\nin CCMRFs with piecewise-linear and piecewise-quadratic dependencies in Section 3. Next, we\nimprove the performance of consensus optimization by deriving an algorithm that exploits oppor-\ntunities for closed-form solutions to subproblems, based on the current optimization iterate, before\nresorting to an iterative solver when the closed-form solution is not applicable. Then, we present an\nexperimental evaluation (Section 4) that demonstrates superior performance of our approach over\na commercial interior-point method, the current state-of-the-art for CCMRF MPE inference. In a\nvoter-preference modeling problem, our algorithms scaled linearly in the number of dependencies\nand constraints. In addition, compared to an exact solver, our method achieves at least 99.6% of the\noptimal solution. Finally, we show that our improved consensus-optimization algorithm more than\ndoubles the speed of a less sophisticated approach. To the best of our knowledge, we are the \ufb01rst\nto show results on MPE inference for any MRF variant using consensus optimization with iterative\nmethods to solve subproblems.\n\n2 Background\n\nIn this section we formally introduce the class of probabilistic graphical models for which we derive\ninference algorithms and present a simple running example (this is the same example used in our\nexperiments in Section 4). We also give an overview of consensus optimization [8], the abstract\nframework we will use to derive our algorithms in Section 3.\n\n2.1 Constrained continuous Markov random \ufb01elds and the MPE problem\n\nA constrained continuous Markov random \ufb01eld (CCMRF) is a probabilistic graphical model de\ufb01ned\nover continuous random variables with a constrained domain [2].\nIn this paper, we focus on a\ncommon subclass in which dependencies among continuous random variables are de\ufb01ned in terms\nof hinge-loss functions and linear constraints:\nDe\ufb01nition 1. A hinge-loss constrained continuous Markov random \ufb01eld f is a probability density\nover a \ufb01nite set of n random variables X = {X1, . . . , Xn} with domain D = [0, 1]n. Let \u03c6 =\n{\u03c61, . . . , \u03c6m} be a \ufb01nite set of m continuous potential functions of the form\n\n\u03c6j(X) = [max{(cid:96)j(X), 0}]pj\n\nwhere (cid:96)j is a linear function of X and pj \u2208 {1, 2}. Let C = {C1, . . . , Cr} be a \ufb01nite set of r linear\nconstraint functions associated with two index sets denoting equality and inequality constraints, E\nand I, which de\ufb01ne the feasible set \u02dcD = {X \u2208 D|Ck(X) = 0,\u2200k \u2208 E and Ck(X) \u2265 0,\u2200k \u2208 I}.\nIf X /\u2208 \u02dcD, then f (X) = 0.\nIf X \u2208 \u02dcD, then, for a set of non-negative free parameters \u039b =\n{\u039b1, . . . , \u039bm},\n\n\uf8ee\uf8f0\u2212 m(cid:88)\n\nj=1\n\n\uf8f9\uf8fb ; Z(\u039b) =\n\n(cid:90)\n\n\uf8ee\uf8f0\u2212 m(cid:88)\n\nj=1\n\nexp\n\n\u02dcD\n\n\u039bj\u03c6j(X)\n\n\uf8f9\uf8fb dX.\n\nf (X) =\n\n1\n\nZ(\u039b)\n\nexp\n\n\u039bj\u03c6j(X)\n\nDe\ufb01nition 1 is a special case of the de\ufb01nition of CCMRFs of Broecheler and Getoor [2]. It says that\nhinge-loss CCMRFs are models in which densities of assignments to variables are de\ufb01ned by an\nexponential of the negated, weighted sum of functions over those assignments, unless any constraint\nis violated, in which case the density is zero.\nThe MPE problem is to maximize f (X) such that X \u2208 \u02dcD. In a hinge-loss CCMRF, the normal-\nizing function Z(\u039b) is constant over X for \ufb01xed parameters and the exponential is maximized by\nminimizing its negated argument, so the MPE problem is\n\narg max\n\nX\n\nf (X) \u2261 arg min\nX\u2208[0,1]n\n\n\u039bj\u03c6j(X)\n\ns.t. Ck(X) = 0,\u2200k \u2208 E and Ck(X) \u2265 0,\u2200k \u2208 I.\n\n(1)\n\nm(cid:88)\n\nj=1\n\n2\n\n\fHinge-loss CCMRFs have two main desirable properties. First, the MPE problem is convex. Second,\nthey are expressive. Hinge-loss functions are useful for many domains. Instances of hinge-loss\nCCMRFs have been used previously to model many problems, including link prediction, collective\nclassi\ufb01cation [3, 2], prediction of opinion diffusion [4], medical decision making [5], trust analysis\nin social networks [6], and group detection in social networks [7].\nFor ease of presentation, in the rest of this paper, when we refer to CCMRFs we mean hinge-loss\nCCMRFs. Next, we present a motivating CCMRF, using an example from Broecheler et. al. [4].\nExample 1 (Opinion diffusion). Consider a social network S \u2261 (V, E) of voters in a set V with\nrelationships de\ufb01ned by annotated, unweighted, directed edges (va, vb)\u03c4 \u2208 E. Here, va, vb \u2208 V\nand \u03c4 is an annotation denoting the type of relationship: friend, boss, etc. To reason about\nvoter\u2019s opinions towards two hypothetical political parties, liberal (L) and conservative (C), we\nintroduce two nonnegative random variables Xa,L and Xa,C, summing to at most one, representing\nthe strength of voter va\u2019s preferences for each political party. We assume that va\u2019s preference results\nfrom an intrinsic opinion and the in\ufb02uence of va\u2019s social group. We represent the intrinsic opinion\nby opinion(va), ranging from \u22121 (strongly favoring L) to 1 (strongly favoring C).\nThe in\ufb02uence of the social group is modeled by potential functions that we generically denote\nIf opinion(va) < 0, then \u03c6 \u2261\nas \u03c6. First we penalize deviations from intrinsic opinions.\n[max{|opinion(va)| \u2212 Xa,L, 0}]p, which penalizes preferences that are weaker than intrinsic opin-\nions. Similarly, \u03c6 \u2261 [max{opinion(va) \u2212 Xa,C, 0}]p. when opinion(va) > 0. These hinge-loss\npotential functions are weighted by a \ufb01xed parameter \u039bopinion.\nNext we penalize disagreements between voters in a social group. For each edge (va, vb)\u03c4 we\nintroduce potential functions \u03c6 \u2261 [max{Xb,L \u2212 Xa,L, 0}]p and \u03c6 \u2261 [max{Xb,C \u2212 Xa,C, 0}]p,\npenalizing preferences of va that are not as strong as those of vb. These potential functions are\nweighted by parameters \u039b\u03c4 de\ufb01ning the relative in\ufb02uence of the \u03c4 relationship. For example, we\nexpect more in\ufb02uence from a close friend than from a co-worker.\nWe consider p = 1, meaning that the model has no preference between distributing the loss and\naccumulating it on a single potential function, and p = 2, meaning that that the model prefers to\ndistribute the loss among multiple hinge-loss functions. To illustrate the choice, consider a single\nvoter in a CCMRF with two equally-weighted potential functions \u03c61 \u2261 [max{0.9 \u2212 Xa,L, 0}]p and\n\u03c62 \u2261 [max{0.6 \u2212 Xa,C, 0}]p. Let 0.9 and 0.6 represent the preferences of the voter\u2019s two friends.\nIf p = 1, then any assignment Xa,L, Xa,C with Xa,L \u2208 [0.4, 0.9] and Xa,C = 1 \u2212 Xa,L is a\nMPE. However, if p = 2, then only the assignment Xa,L = 0.65, Xa,C = 0.35 is a MPE. We see\nthat, all else being equal, squared potential functions \u201crespect\u201d the minima of individual potential\nfunctions if they cannot all be minimized. However, this useful modeling feature generally increases\nthe computational cost. As we demonstrate in Section 4, scaling MPE inference for CCMRFs with\npiecewise-quadratic potential functions is one of the contributions of our work.\n\n2.2 Consensus optimization\n\nConsensus optimization is a framework that optimizes an objective by dividing it into independent\nsubproblems and then iterating to reach a consensus on the optimum [8]. In this subsection we\npresent an abstract consensus optimization algorithm for Problem (1), the MPE problem for CCM-\nRFs. In Section 3 we will derive specialized versions for different potential functions.\nGiven a CCMRF (X, \u03c6, C,E,I, \u039b) and parameter \u03c1 > 0, the algorithm \ufb01rst constructs a modi\ufb01ed\nMPE problem in which each potential and constraint is a function of different variables. The vari-\nables are constrained to make the new and original MPE problems equivalent. We let xj be a copy\nof the variables in X that are used in the potential function \u03c6j, j = 1, . . . , m and xk+m be a copy\nof those used in the constraint function Ck, k = 1, . . . , r. We also introduce an indicator function\nIk for each constraint function where Ik [Ck(xk+m)] = 0 if the constraint is satis\ufb01ed and \u221e if it is\nnot. Finally, let Xi be the variables in X that are copied in xi, i = 1, . . . , m + r.\nConsensus optimization solves the new MPE problem\n\n\u039bj\u03c6j (xj) +\n\nIk [Ck (xk+m)]\n\nsubject to xi = Xi\n\n(2)\n\nm(cid:88)\n\narg min\nxi\u2208[0,1]ni\n\nr(cid:88)\n\nj=1\n\nk=1\n\n3\n\n\fAlgorithm Consensus optimization\n\nInput: CCMRF (X, \u03c6, C,E,I, \u039b), \u03c1 > 0\nInitialize xj as a copy of the variables in X that appear in \u03c6j, j = 1, . . . , m\nInitialize xk+m as a copy of the variables in X that appear in Ck, k = 1, . . . , r.\nInitialize yi at 0, i = 1, . . . , m + r.\nwhile not converged do\n\nfor i = 1, . . . , m + r do\nyi \u2190 yi + \u03c1(xi \u2212 Xi)\n\nend for\nfor j = 1, . . . , m do\n\nend for\nfor k = 1, . . . , r do\n\nxj \u2190 arg minxj\u2208[0,1]nj \u039bj\u03c6j(xj) + \u03c1\n\n2(cid:107)xj \u2212 Xj + 1\n\n\u03c1 yj(cid:107)2\n\n2\n\nxk+m \u2190 arg minxk+m\u2208[0,1]nk+m Ik [Ck (xk+m)] + \u03c1\n\n2(cid:107)xk+m \u2212 Xk+m + 1\n\n\u03c1 yk+m(cid:107)2\n\n2\n\nend for\nSet each variable in X to the average of its copies\n\nend while\n\nwhere i = 1, . . . , m + r and ni is the number of components of xi. Inspection shows that Prob-\nlems (1) and (2) are equivalent.\nWe use the alternating direction method of multipliers (ADMM) [11, 12, 8] to solve Problem (2).\nADMM can be viewed as an approach to combining the scalability of dual decomposition and the\nconvergence properties of augmented Lagrangian methods [8]. We outline the algorithm in the\nabove pseudocode. At each step in the iteration, it solves m + r independent optimization problems,\none for each \u03c6j and each Ck. It then averages the copies of variables to get the consensus variables\nX for the next iteration. Lagrange multipliers yi for each xi ensure convergence. The objective is\nknown to converge to its optimum and the iterates to approach feasibility under mild assumptions\n[13, 14, 8]. See Boyd et. al. [8] or this paper\u2019s supplementary material for more information. In the\nnext section we derive algorithms with speci\ufb01c methods for updating each xj.\n\n3 Solving the MPE problem with consensus optimization\n\narg min\nx\u2208[0,1]n\n\nWe now derive algorithms to update xj for each potential function \u03c6j. At this point we drop the\nmore complex notation and view each update as an instance of the problem\n\u039b[max{cT x + c0, 0}]p + (\u03c1/2)(cid:107)x \u2212 d(cid:107)2\n\n(3)\nwhere c, d \u2208 Rn, c0 \u2208 R, \u039b \u2265 0, p \u2208 {1, 2}, and \u03c1 > 0. To map an update to Problem (3) for a\npotential function \u03c6j and parameter \u039bj, let n = nj, cT x + c0 = (cid:96)(xj), d = Xj \u2212 (1/\u03c1)yj, \u039b = \u039bj,\np = pj, and keep \u03c1 the same.\nOur \ufb01rst algorithm, CO-Linear, solves the MPE problem when p = 1 and n \u2264 2 in each instance\nof Problem (3), i.e., each potential function has at most two unknowns and is piecewise-linear. We\npresent the update in terms of the intermediate optimization problems it solves. (We use variables\n\u03b1 with parenthetical superscripts to easily refer to the solutions of intermediate problems, but im-\nplementations should not treat them as separate variables.) It \ufb01rst \ufb01nds \u03b11, which is easy to do by\ninspection. For each component \u03b1(1)\n\n2\n\nj of \u03b1(1)\n\n\uf8f1\uf8f2\uf8f30\n\ndj\n1\n\n\u03b1(1)\n\nj =\n\nif dj < 0\nif 0 \u2264 dj \u2265 1\nif dj > 1\n\nwhere j = 1, . . . , n. We refer to this procedure as clipping the vector d to the interval [0, 1].\nIn this section, when we refer to clipping to [a, b], we mean an identical vector except that any\ncomponent outside a bound a or b is changed to that bound. \u03b12 is also easy to \ufb01nd: clip the vector\nd \u2212 (\u039b/\u03c1)c to [0, 1]. There are two cases when \ufb01nding \u03b1(3). If n = 1, clip the scalar \u2212c0/c1 to\n\n4\n\n\fAlgorithm Update for CO-Linear\n\n2\n\nInput: c, d \u2208 Rn where n \u2264 2, c0 \u2208 R, \u039b \u2265 0, \u03c1 > 0\nOutput: x(cid:63) = arg minx\u2208[0,1]n \u039b[max{cT x + c0, 0}] + (\u03c1/2)(cid:107)x \u2212 d(cid:107)2\n\u03b1(1) \u2190 arg minx\u2208[0,1]n (\u03c1/2)(cid:107)x \u2212 d(cid:107)2\nif cT \u03b1(1) + c0 \u2264 0 then\nelse\n\n(by inspection)\n\nx(cid:63) \u2190 \u03b1(1)\n\u03b1(2) \u2190 arg minx\u2208[0,1]n \u039bcT x + (\u03c1/2)(cid:107)xi \u2212 d(cid:107)2\nif cT \u03b1(2) + c0 \u2265 0 then\nelse\n\nx(cid:63) \u2190 \u03b1(2)\nx(cid:63) \u2190 \u03b1(3) \u2190 arg minx\u2208[0,1]n s.t. cT x+c0=0(\u03c1/2)(cid:107)x\u2212 d(cid:107)2\n\n(by inspection)\n\n2\n\n2\n\n2\n\nend if\n\nend if\n\n(by substitution and inspection)\n\n[0, 1]. If n = 2, solve cT x = \u2212c0 for one of the components of x, substitute to eliminate that\ncomponent in the objective, and compute the interval [min, max] on which x \u2208 [0, 1]2 when the\nremaining component is in [min, max] and cT x = \u2212c0. Inspect the reduced objective and clip the\nunconstrained minimizer to [min, max]. Substitute the result back into cT x = \u2212c0 to \ufb01nd the other\ncomponent.\nTo verify that the CO-Linear update is correct, \ufb01rst consider the case when cT \u03b1(1) + c0 \u2264 0. Since\n\u03b1(1) minimizes (\u03c1/2)(cid:107)x \u2212 d(cid:107)2\n2 and \u039b[max{cT x + c0, 0}] \u2265 0, each term of the update objective\nis minimized at \u03b1(1), so x(cid:63) = \u03b1(1). In the second case, if cT \u03b1(1) + c0 > 0, but cT \u03b1(2) + c0 \u2265 0,\nthen observe that \u03b1(2) minimizes an objective which bounds the update objective below, but the two\nobjectives are equal at \u03b1(2). Therefore, x(cid:63) = \u03b1(2). Finally, in the third case, cT \u03b1(1) + c0 > 0 and\ncT \u03b1(2) + c0 < 0. We know \u2203x \u2208 [0, 1]n such that cT x + c0 = 0, so the problem can be split into\ntwo feasible problems:\n\n\u03b2(1) \u2261\n\u03b2(2) \u2261\n\narg min\n\nx\u2208[0,1]n s.t. cT x+c0\u22640\n\narg min\n\nx\u2208[0,1]n s.t. cT x+c0\u22650\n\n(\u03c1/2)(cid:107)x \u2212 d(cid:107)2\n\u039bcT x + (\u03c1/2)(cid:107)x \u2212 d(cid:107)2\n\n2\n\n2 .\n\nEither x(cid:63) = \u03b2(1) or x(cid:63) = \u03b2(2) (or both). We use Lemma 4 of Martins et. al. [9] which states that\ngiven a convex, feasible optimization problem over a nonempty convex subset of Rn with a convex\nconstraint, if that constraint is violated by the minimizer to a relaxed problem without that constraint\nover the same set, then that constraint will be active at the minimizer to the original problem. Since\ncT \u03b1(1) + c0 > 0 and cT \u03b1(2) + c0 < 0, we conclude that cT \u03b2(1) + c0 = 0 and cT \u03b2(2) + c0 = 0.\nTherefore x(cid:63) = \u03b2(1) = \u03b2(2) = \u03b1(3).\nCO-Linear is suf\ufb01cient to solve many useful and interesting models. Unfortunately, the piecewise-\nquadratic case (p = 2) is more dif\ufb01cult. If n > 1 and it cannot be established that cT x(cid:63) + c0 \u2264 0,\nthen the approach of CO-Linear is not applicable, because minimizing \u039bcT xxT c + 2\u039bc0cT x +\n(\u03c1/2)(cid:107)x\u2212 d(cid:107)2\n2 over [0, 1]n does not have a (known) closed-form solution in general. That motivates\nus to derive an algorithm for the piecewise-quadratic case that can resort to a suf\ufb01ciently general\niterative solver if necessary. Obviously, a naive algorithm could use an iterative method immediately\nif n > 1. However, CO-Linear still offers some insight into the problem. If clipping d to [0, 1] gives\na vector \u03b1(1) such that cT \u03b1(1) + c0 \u2264 0, then again it is the minimizer.\nOur second algorithm, CO-Quad, \ufb01rst tries to \ufb01nd x(cid:63) by clipping d to [0, 1] for any n. If it does\nnot succeed and n = 1, then \u03b1(2) can be found by inspection. If n > 1, then an iterative method\nis required. Note that now after concluding that cT x(cid:63) + c0 \u2265 0 we can just minimize \u039bcT xxT c +\n2\u039bc0cT x + (\u03c1/2)(cid:107)x\u2212 d(cid:107)2\n2 to \ufb01nd x(cid:63) since \u039bcT xxT c + 2\u039bc0cT x is symmetric about the hyperplane\ncT x + c0 = 0, (\u03c1/2)(cid:107)x \u2212 d(cid:107)2\n2 is minimized for some x such that cT x + c0 \u2265 0, and the objective is\nthe same as the subproblem on that region.\n\n5\n\n\fAlgorithm Update for CO-Quad\n\nInput: c, d \u2208 Rn, c0 \u2208 R, \u039b \u2265 0, \u03c1 > 0\nOutput: x(cid:63) = arg minx\u2208[0,1]n \u039b[max{cT x + c0, 0}]2 + (\u03c1/2)(cid:107)x \u2212 d(cid:107)2\n\u03b1(1) \u2190 arg minx\u2208[0,1]n (\u03c1/2)(cid:107)x \u2212 d(cid:107)2\nif cT \u03b1(1) + c0 \u2264 0 then\nelse\n\nx(cid:63) \u2190 \u03b1(1)\n\n(by inspection)\n\n2\n\n2\n\nif n = 1 then\n\nelse\n\nx(cid:63) \u2190 \u03b1(2) \u2190 arg minx\u2208[0,1]n \u039bcT xxT c + 2\u039bc0cT x + (\u03c1/2)(cid:107)x \u2212 d(cid:107)2\nx(cid:63) \u2190 \u03b1(3) \u2190 arg minx\u2208[0,1]n \u039bcT xxT c+2\u039bc0cT x+(\u03c1/2)(cid:107)x\u2212d(cid:107)2\n\n2\n\n2\n\n(by inspection)\n\n(by iterative method)\n\nend if\n\nend if\n\nTo update xk+m for each constraint Ck, both CO-Linear and CO-Quad use the method proposed by\nMartins et. al. [9], which handles the case when Ck(xk+m) = 0 is a probability simplex. This is\nsuf\ufb01cient for the purposes of this work.\n\n4 Experiments\n\nWe evaluated the scalability of CO-Linear and CO-Quad by generating social networks of varying\nsizes, constructing CCMRFs with them, and measuring the running time required to \ufb01nd a MPE.\nWe compared our approach to the previous state-of-the-art approach for \ufb01nding MPEs in CCMRFs,\nwhich uses an interior point method implemented in MOSEK, a commercial optimization package\n(http://www.mosek.com). Next we describe the social-network and CCMRF generation procedure,\nthe implementations and setup, and then present the results.\n\n4.1 Social-network and CCMRF generation\n\nOur social-network generation process follows Example 1 and is based on the procedure described\nby Broecheler et. al. [4] to generate social networks using power-law degree distributions. Given\na desired number of vertices N (which the procedure matches approximately) and a list of edge\ntypes, along with parameters \u03b3 and \u03b1 for each type, the procedure samples in- and out-degrees\nfor each node for each edge type from the power-law distribution D(k) \u2261 \u03b1k\u2212\u03b3. Incoming and\noutgoing edges of the same type are then matched randomly to create edges until no more matches\nare possible. Vertices with no incoming or outgoing edges are removed from the network. We used\nsix edge types with various parameters to represent relationships in social networks with different\ncombinations of abundance and exclusivity, choosing \u03b3 between 2 and 3, and \u03b1 between 0 and 1, as\nsuggested by Broecheler et. al. We then annotated each vertex with a value in [\u22121, 1] uniformly at\nrandom to represent intrinsic opinions as described in Example 1.\nWe generated social networks with between 22,050 and 66,150 vertices, which induced CCMRFs\nwith between 130,082 and 397,494 total potential functions and constraints. In all the CCMRFs,\nbetween 83% and 85% of those totals were potential functions and between 15% and 17% were\nconstraints. For each social network, we created both a CCMRF to test CO-Linear (p = 1 in\nDe\ufb01nition 1) and one to test CO-Quad (p = 2). We chose \u039bopinion = 0.5 and chose \u039b\u03c41, . . . , \u039b\u03c46\nbetween 0 and 1 to model both more and less in\ufb02uential relationships.\n\n4.2\n\nImplementation\n\nWe implemented CO-Linear and CO-Quad in Java. We used the interior-point method in MOSEK\nto \ufb01nd \u03b13 in the update for CO-Quad when necessary by passing the problem via MOSEK\u2019s Java\nnative interface wrapper. We also compared with MOSEK\u2019s interior-point method by encoding the\nentire MPE problem as a linear program or a second-order cone program as appropriate, and passing\nthe encoded problem via the Java native interface wrapper.\n\n6\n\n\f(a) Piecewise-linear MPE problems\n\n(b) Piecewise-quadratic MPE problems\n\nFigure 1: Average running times to \ufb01nd a most probable explanation (MPE) in CCMRFs.\n\nAll experiments were performed on a single machine with 2 6-core 3.06 Ghz Intel Xeon X5675\nprocessors with 48GB of RAM. Each optimizer used a single thread. All results are averaged over 3\nruns. All differences between CO-Linear and the interior-point method are signi\ufb01cant at p = 0.0005.\nAll differences between CO-Quad and the interior-point method are signi\ufb01cant at p = 0.005 on\nproblems with more than 175,000 potential functions and constraints. (The interior-point method\nexhibited much higher variance in running times on piecewise-quadratic problems.) All differences\nbetween CO-Quad and Naive CO-Quad are signi\ufb01cant at p = 0.0005.\n\n4.3 Results\n\nWe \ufb01rst evaluated the scalability of CO-Linear and compared with MOSEK\u2019s interior-point method.\nFigure 1a shows the results. The running time of the interior-point method quickly exploded as\nthe problem size increased. Although we do not show it in the \ufb01gure, the average running time\non the largest problem was about 4,900 seconds (over 1 hour, 20 minutes). This demonstrates the\nlimited scalability of the interior-point method. In contrast, CO-Linear displays excellent scalability.\nThe average running time on the largest problem was about 130 seconds (2 minutes, 10 seconds).\nFurther, the running time grows linearly in the number of potential functions and constraints in the\nCCMRF, i.e., the number of subproblems that must be solved at each iteration. The line of best\n\ufb01t has R2 = 0.99834. Combined with Figure 1a, this shows that CO-Linear scaled linearly with\nincreasing problem size. We emphasize that the implementation of CO-Linear is research code\nwritten in Java and the interior-point method is a commercial package running as native code. The\ndramatic differences in running times illustrate the superior utility of CO-Linear for these problems.\nWe then evaluated CO-Quad. Figure 1b shows the results (note the 2-orders-of-magnitude increase\non the vertical axis between CO-Linear and CO-Quad). Again, the running time of the interior-\npoint method quickly exploded. We could only test it on the three smallest problems, the largest of\nwhich took an average of about 56,500 seconds to solve (over 15 hours, 40 minutes). Consensus\noptimization again scaled linearly to the problem. The line of best \ufb01t has R2 = 0.9842. To compare\nwith the interior-point method, on the third-smallest problem, CO-Quad took an average of about\n5,250 seconds (under 1 hour, 28 minutes). We also evaluated a naive variant of CO-Quad which\nimmediately updates xj via the interior-point method when there are two unknowns. As Figure 1b\nshows, the difference is signi\ufb01cant. This demonstrates that CO-Quad is a further improvement on a\nless sophisticated approach over the previous state-of-the-art.\nOne of the advantages of interior-point methods is great numerical stability and accuracy, Consensus\noptimization, which treats both objective terms and constraints as subproblems, often returns points\nthat are only optimal and feasible to moderate precision for non-trivially constrained problems [8].\nAlthough this is often acceptable, we quanti\ufb01ed the mix of infeasibility and suboptimality by repair-\ning the infeasibility and measuring the resulting total suboptimality. We \ufb01rst projected the solutions\nreturned by consensus optimization onto the feasible region, which took a negligible amount of com-\nputational time. Let pC be the value of the objective in Problem (1) at such a point and let pIP M be\n\n7\n\n0\t\r \u00a0100\t\r \u00a0200\t\r \u00a0300\t\r \u00a0400\t\r \u00a0500\t\r \u00a0600\t\r \u00a0125000\t\r \u00a0175000\t\r \u00a0225000\t\r \u00a0275000\t\r \u00a0325000\t\r \u00a0375000\t\r \u00a0Time\t\r \u00a0in\t\r \u00a0seconds\t\r \u00a0Number\t\r \u00a0of\t\r \u00a0poten2al\t\r \u00a0func2ons\t\r \u00a0and\t\r \u00a0constraints\t\r \u00a0CO-\u00ad\u2010Linear\t\r \u00a0Interior-\u00ad\u2010point\t\r \u00a0method\t\r \u00a00\t\r \u00a010000\t\r \u00a020000\t\r \u00a030000\t\r \u00a040000\t\r \u00a050000\t\r \u00a060000\t\r \u00a0125000\t\r \u00a0175000\t\r \u00a0225000\t\r \u00a0275000\t\r \u00a0325000\t\r \u00a0375000\t\r \u00a0Time\t\r \u00a0in\t\r \u00a0seconds\t\r \u00a0Number\t\r \u00a0of\t\r \u00a0poten2al\t\r \u00a0func2ons\t\r \u00a0and\t\r \u00a0constraints\t\r \u00a0CO-\u00ad\u2010Quad\t\r \u00a0Naive\t\r \u00a0CO-\u00ad\u2010Quad\t\r \u00a0Interior-\u00ad\u2010point\t\r \u00a0method\t\r \u00a0\fthe value of the objective at the point returned by the interior-point method. Then the relative error\non that problem is (pC \u2212 pIP M )/pIP M . The relative error was consistently small. For CO-Linear,\nit varied between 0.2% and 0.4%, and did not trend upward as the problem size increased. For\nCO-Quad, when the interior-point method also returned a solution, the relative error was always less\nthan 0.05% and also did not trend upward. This shows that consensus optimization was accurate,\nin addition to being dramatically faster (lower absolute time) and more scalable (smaller growth in\ntime with problem size).\n\n5 Discussion and conclusion\n\nIn this paper we advanced the state-of-the-art in solving the MPE problem for CCMRFs. With spe-\ncialized algorithms, consensus optimization offers far superior scalability. In our experiments the\ncomputational cost grew linearly with the number of potential functions and constraints. This is cru-\ncially important if models are to scale to the sizes of data now available. As we build bigger models,\nit will be important to understand the trade-off between speed and accuracy. The well-understood\ntheory of consensus optimization can help here. It is a major difference between our work and that\nof Broecheler et. al. [4], which used heuristics to solve the MPE problem by partitioning CCMRFs,\n\ufb01xing values of variables at the boundaries, solving relatively large subproblems with interior-point\nmethods, and repeating with different partitions. A direction for future work is studying how to\nenforce desired combinations of speed and accuracy when solving MPE problems.\nSuch work could have a broader impact for research on solving the MPE problem for MRFs using\ndecomposition-based approaches, which is an active area of research. Much work has studied dual\ndecomposition for solving relaxations of discrete MPE problems [15]. Martins et. al. [9], and Meshi\nand Globerson [10] recently studied using consensus optimization to solve convex relaxations of the\nMPE problem for discrete MRFs. They solved the problem for MRFs which induced subproblems\nwith closed-form solutions. Meshi and Globerson [10] also showed advantages of solving the dual\nof the relaxation and decoding the values of the discrete primal variables, but such an approach\nis not applicable to our work. Other recent approaches include that of Ravikumar et. al. [16], an\nalgorithm for solving a relaxed MPE problem by solving a sequence of subproblems in a process\ncalled proximal minimization.\nThere are a number of remaining research problems. The \ufb01rst is to expand the number of unknowns\nin subproblems that can be solved in closed form. Another is analyzing the Karush-Kuhn-Tucker\noptimality conditions for the subproblems to eliminate variables when possible and solve them more\nef\ufb01ciently. While all (hinge-loss) CCMRF subproblems could be solved with a general-purpose\nalgorithm, such as an interior-point method, we showed that even in cases when an algorithm might\nhave to resort to an interior-point method, exploiting opportunities for closed-form solutions greatly\nimproved speed.\n\nAcknowledgments\n\nThe authors would like to thank Neal Parikh and the anonymous reviewers for their helpful sugges-\ntions. This material is based upon work supported by the National Science Foundation under Grant\nNo. 0937094, the Department of Energy under Grant No. DESC0002218, and the Intelligence Ad-\nvanced Research Projects Activity (IARPA) via Department of Interior National Business Center\n(DoI/NBC) contract number D12PC00337. The U.S. Government is authorized to reproduce and\ndistribute reprints for Governmental purposes notwithstanding any copyright annotation thereon.\nDisclaimer: The views and conclusions contained herein are those of the authors and should not\nbe interpreted as necessarily representing the of\ufb01cial policies or endorsements, either expressed or\nimplied, of IARPA, DOI/NBA, or the U.S. Government.\n\nReferences\n[1] D. Koller and N. Friedman. Probabilistic Graphical Models: Principles and Techniques. The\n\nMIT Press, 2009.\n\n[2] M. Broecheler and L. Getoor. Computing marginal distributions over continuous Markov net-\nworks for statistical relational learning. In Advances in Neural Information Processing Systems\n(NIPS), 2010.\n\n8\n\n\f[3] M. Broecheler, L. Mihalkova, and L. Getoor. Probabilistic similarity logic. In Proceedings of\n\nthe 26th Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI), 2010.\n\n[4] M. Broecheler, P. Shakarian, and V. S. Subrahmanian. A scalable framework for modeling\ncompetitive diffusion in social networks. In Proceedings of the Second International Confer-\nence on Social Computing (SocialCom), 2010.\n\n[5] S. H. Bach, M. Broecheler, S. Kok, and L. Getoor. Decision-driven models with probabilistic\n\nsoft logic. In NIPS Workshop on Predictive Models in Personalized Medicine, 2010.\n\n[6] B. Huang, A. Kimmig, L. Getoor, and J. Golbeck. Probabilistic soft logic for trust analysis\nin social networks. In International Workshop on Statistical Relational Arti\ufb01cial Intelligence\n(StaRAI), 2012.\n\n[7] B. Huang, S. H. Bach, E. Norris, J. Pujara, and L. Getoor. Social group modeling with proba-\nbilistic soft logic. In NIPS Workshop on Social Network and Social Media Analysis: Methods,\nModels, and Applications, 2012.\n\n[8] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed Optimization and Statisti-\n\ncal Learning Via the Alternating Direction Method of Multipliers. Now Publishers, 2011.\n\n[9] A. Martins, M. Figueiredo, P. Aguiar, N. Smith, and E. Xing. An augmented Lagrangian\napproach to constrained MAP inference. In Proceedings of the 28th International Conference\non Machine Learning (ICML), 2011.\n\n[10] O. Meshi and A. Globerson. An alternating direction method for dual MAP LP relaxation. In\nProceedings of the 2011 European conference on machine learning and knowledge discovery\nin databases (ECML), 2011.\n\n[11] R. Glowinski and A. Marrocco. Sur l\u2019approximation, par \u00b4el\u00b4ements \ufb01nis d\u2019ordre un, et la\nr\u00b4esolution, par p\u00b4enalisation-dualit\u00b4e, d\u2019une classe de probl`emes de Dirichlet non lin\u00b4eaires. Re-\nvue franc\u00b8aise d\u2019automatique, informatique, recherche op\u00b4erationnelle, 9(2):41\u201376, 1975.\n\n[12] D. Gabay and B. Mercier. A dual algorithm for the solution of nonlinear variational problems\nvia \ufb01nite element approximation. Computers & Mathematics with Applications, 2(1):17\u201340,\n1976.\n\n[13] D. Gabay. Applications of the method of multipliers to variational inequalities, volume 15,\n\nchapter 9, pages 299\u2013331. Elsevier, 1983.\n\n[14] J. Eckstein and D. P. Bertsekas. On the Douglas-Rachford splitting method and the proximal\n\npoint algorithm for maximal monotone operators. Math. Program., 55(3):293\u2013318, 1992.\n\n[15] D. Sontag, A. Globerson, and T. Jaakkola. Introduction to dual decomposition for inference,\n\nchapter 8, pages 219\u2013254. MIT Press, 2011.\n\n[16] P. Ravikumar, A. Agarwal, and M. J. Wainwright. Message-passing for graph-structured linear\nprograms: proximal methods and rounding schemes. Journal of Machine Learning Research,\n11:1043\u20131080, 2010.\n\n9\n\n\f", "award": [], "sourceid": 4772, "authors": [{"given_name": "Stephen", "family_name": "Bach", "institution": null}, {"given_name": "Matthias", "family_name": "Broecheler", "institution": null}, {"given_name": "Lise", "family_name": "Getoor", "institution": null}, {"given_name": "Dianne", "family_name": "O'leary", "institution": null}]}