{"title": "M-Best-Diverse Labelings for Submodular Energies and Beyond", "book": "Advances in Neural Information Processing Systems", "page_first": 613, "page_last": 621, "abstract": "We consider the problem of finding M best diverse solutions of energy minimization problems for graphical models. Contrary to the sequential method of Batra et al., which greedily finds one solution after another, we infer all $M$ solutions jointly. It was shown recently that such jointly inferred labelings not only have smaller total energy but also qualitatively outperform the sequentially obtained ones. The only obstacle for using this new technique is the complexity of the corresponding inference problem, since it is considerably slower algorithm than the method of Batra et al. In this work we show that the joint inference of $M$ best diverse solutions can be formulated as a submodular energy minimization if the original MAP-inference problem is submodular, hence fast inference techniques can be used. In addition to the theoretical results we provide practical algorithms that outperform the current state-of-the art and can be used in both submodular and non-submodular case.", "full_text": "M-Best-Diverse Labelings\n\nfor Submodular Energies and Beyond\n\nAlexander Kirillov1\n\nDmitrij Schlesinger1\n\nDmitry Vetrov2\n\nCarsten Rother1\n\nBogdan Savchynskyy1\n\n1 TU Dresden, Dresden, Germany\n\n2 Skoltech, Moscow, Russia\n\nalexander.kirillov@tu-dresden.de\n\nAbstract\n\nWe consider the problem of \ufb01nding M best diverse solutions of energy minimiza-\ntion problems for graphical models. Contrary to the sequential method of Batra\net al., which greedily \ufb01nds one solution after another, we infer all M solutions\njointly. It was shown recently that such jointly inferred labelings not only have\nsmaller total energy but also qualitatively outperform the sequentially obtained\nones. The only obstacle for using this new technique is the complexity of the cor-\nresponding inference problem, since it is considerably slower algorithm than the\nmethod of Batra et al. In this work we show that the joint inference of M best\ndiverse solutions can be formulated as a submodular energy minimization if the\noriginal MAP-inference problem is submodular, hence fast inference techniques\ncan be used. In addition to the theoretical results we provide practical algorithms\nthat outperform the current state-of-the-art and can be used in both submodular\nand non-submodular case.\n\n1\n\nIntroduction\n\nA variety of tasks in machine learning can be formulated in the form of an energy minimization\nproblem, known also as maximum a posteriori (MAP) or maximum likelihood estimation (MLE)\ninference in an undirected graphical models (related to Markov or conditional random \ufb01elds). Its\nmodeling power and importance are well-recognized, which resulted into specialized benchmark,\ni.e. [18] and computational challenges [8] for its solvers. This underlines the importance of \ufb01nding\nthe most probable solution. Following [3] and [25] we argue, however, that \ufb01nding M > 1 diverse\ncon\ufb01gurations with low energies is also of importance in a number of scenarios, such as: (a) Ex-\npressing uncertainty of the found solution [27]; (b) Faster training of model parameters [14]; (c)\nRanking of inference results [32]; (d) Empirical risk minimization [26].\nWe build on the new formulation for \ufb01nding M-best-diverse-con\ufb01gurations, which was recently\nproposed in [19]. In this formulation all M con\ufb01gurations are inferred jointly, contrary to the es-\ntablished method [3], where a sequential greedy procedure is used. As shown in [19], the new\nformulation does not only reliably produce con\ufb01gurations with lower total energy, but also leads\nto better results in several application scenarios. In particular, for the image segmentation scenario\nthe results of [19] signi\ufb01cantly outperform those of [3]. This is true even when [19] uses a plain\nHamming distance as a diversity measure and [3] uses more powerful diversity measures.\nOur contributions.\n\n\u2022 We show that \ufb01nding M-best-diverse con\ufb01gurations of a binary submodular energy min-\nimization can be formulated as a submodular MAP-inference problem, and hence can be solved\nThis project has received funding from the European Research Council (ERC) under the European Unions\nHorizon 2020 research and innovation programme (grant agreement No 647769). D. Vetrov was supported by\nRFBR proj. (No. 15-31-20596) and by Microsoft (RPD 1053945).\n\n1\n\n\fef\ufb01ciently for any node-wise diversity measure.\n\n\u2022 We show that for certain diversity measures, such as e.g. Hamming distance, the M-best-\ndiverse con\ufb01gurations of a multilabel submodular energy minimization can be formulated as a\nsubmodular MAP-inference problem, which also implies applicability of ef\ufb01cient graph cut-based\nsolvers.\u2022 We give the insight that if the MAP-inference problem is submodular then the M-best-diverse\ncon\ufb01gurations can be always fully ordered with respect to the natural partial order, induced in the\nspace of all con\ufb01gurations.\n\u2022 We show experimentally that if the MAP-inference problem is submodular, we are quantita-\ntively at least as good as [19] and considerably better than [3]. The main advantage of our method\nis a major speed up over [19], up to the order of two magnitudes. Our method has the same order of\nmagnitude run-time as [3]. In the non-submodular case our results are slightly inferior to [19], but\nthe advantage with respect to gain in speed up still holds.\nRelated work. The importance of the considered problem may be justi\ufb01ed by the fact that a proce-\ndure of computing M-best solutions to discrete optimization problems was proposed in [23], which\ndates back to 1972. Later, more ef\ufb01cient specialized procedures were introduced for MAP-inference\non a tree [29, Ch. 8], junction-trees [24] and general graphical models [33, 12, 2]. Such methods\nare however not suited for scenarios where diversity of the solutions is required (like in machine\ntranslation, search engines, producing M-best hypothesis in cascaded algorithms), since they do not\nenforce it explicitly.\nStructural Determinant Point Processes [22] is a tool to model probabilistic distributions over struc-\ntured models. Unfortunately an ef\ufb01cient sampling procedure is feasible for tree-structured graphical\nmodels only. The recently proposed algorithm [7] to \ufb01nd M best modes of a distribution is limited\nto the same narrow class of problems.\nTraining of M independent graphical models to produce diverse solutions was proposed in [13, 15].\nIn contrast, we assume a single \ufb01xed model supporting reasonable MAP-solutions.\nAlong with [3], the most related to our work is the recent paper [25], which proposes a subclass of\nnew diversity penalties, for which the greedy nature of the algorithm [3] can be substantiated due\nto submodularity of the used diversity measures. In contrast to [25] we do not limit ourselves to\ndiversity measures ful\ufb01lling such properties and moreover, we de\ufb01ne a class of problems, for which\nour joint inference approach leads to polynomially and ef\ufb01ciently solvable problems in practice.\nWe build on top of the work [19], which is explained in detail in Section 2.\nOrganization of the paper. Section 2 provides background necessary for formulation of our results:\nenergy minimization for graphical models and existing approaches to obtain diverse solutions. In\nSection 3 we introduce submodularity for graphical models and formulate the main results of our\nwork. Finally, Section 4 and 5 are devoted to the experimental evaluation of our technique and\nconclusions. Supplementary material contains proofs of all mathematical claims and the concurrent\nsubmission [19].\n\nLv. The set LA = (cid:81)\n\n2 Preliminaries\nEnergy minimization. Let 2A denote the powerset of a set A. The pair G = (V,F) is called\na hyper-graph and has V as a \ufb01nite set of variable nodes and F \u2286 2V as a set of factors. Each\nvariable node v \u2208 V is associated with a variable yv taking its values in a \ufb01nite set of labels\nv\u2208A Lv denotes a Cartesian product of sets of labels corresponding to\nthe subset A \u2286 V of variables. Functions \u03b8f : Lf \u2192 R, associated with factors f \u2208 F, are\ncalled potentials and de\ufb01ne local costs on values of variables and their combinations. Potentials\n\u03b8f with |f| = 1 are called unary, with |f| = 2 pairwise and |f| > 2 higher order. The set\n{\u03b8f : f \u2208 F} of all potentials is referred by \u03b8. For any factor f \u2208 F the corresponding set of\nvariables {yv : v \u2208 f} will be denoted by yf . The energy minimization problem consists of \ufb01nd-\ning a labeling y\u2217 = {yv : v \u2208 V} \u2208 LV, which minimizes the total sum of corresponding potentials:\n(1)\nProblem (1) is also known as MAP-inference. Labeling y\u2217 satisfying (1) will be later called a solu-\ntion of the energy-minimization or MAP-inference problem, shortly MAP-labeling or MAP-solution.\n\ny\u2217 = arg min\ny\u2208LV\n\n(cid:88)\n\nf\u2208F\n\nE(y) = arg min\ny\u2208LV\n\n\u03b8f (yf ) .\n\n2\n\n\fE(y1)\n\n\u03b81\ny1\n1\n\n\u03b81,2\n\n\u03b82\ny1\n2\n\n\u03b82,3\n\n\u03b83\ny1\n3\n\n\u03b83,4\n\n\u03b84\ny1\n4\n\n\u2212\u2206M (y1, y2, y3)\n\nE(y2)\n\nE(y3)\n\ny2\n1\n\n\u03b81\n\ny3\n1\n\n\u03b81\n\n\u03b81,2\n\n\u03b81,2\n\ny2\n2\n\n\u03b82\n\ny3\n2\n\n\u03b82\n\n\u03b82,3\n\n\u03b82,3\n\ny2\n3\n\n\u03b83\n\ny3\n3\n\n\u03b83\n\n\u03b83,4\n\n\u03b83,4\n\ny2\n4\n\n\u03b84\n\ny3\n4\n\n\u03b84\n\ny1\n1\n\n\u03b81\n\n\u03b81,2\n\ny1\n2\n\n\u03b82\n\n\u03b82,3\n\ny1\n3\n\n\u03b83\n\n\u03b83,4\n\ny1\n4\n\n\u03b84\n\n\u2212\u2206M\n\n1 (y1\n\n1 , y2\n\n1 , y3\n1 )\n\n\u2212\u2206M\n\n3 (y1\n\n3 , y2\n\n3 , y3\n3 )\n\ny1\n1\n\n\u03b81\n\n\u03b81,2\n\ny1\n2\n\n\u03b82\n\n\u03b82,3\n\ny1\n3\n\n\u03b83\n\n\u03b83,4\n\ny1\n4\n\n\u03b84\n\n\u2212(cid:74)y1\n1(cid:75)\n1 (cid:54)= y3\n\n\u2212(cid:74)y1\n1(cid:75)\n1 (cid:54)= y2\n\u2212(cid:74)y1\n2(cid:75)\n2 (cid:54)= y3\n\n\u2212(cid:74)y1\n2(cid:75)\n2 (cid:54)= y2\n\u2212(cid:74)y1\n3(cid:75)\n3 (cid:54)= y3\n\n\u2212(cid:74)y1\n3(cid:75)\n3 (cid:54)= y2\n\u2212(cid:74)y1\n4(cid:75)\n4 (cid:54)= y3\n\n\u2212(cid:74)y1\n\n4(cid:75)\n4 (cid:54)= y2\n\n\u2212\u2206M\n\n2 (y1\n\n2 , y2\n\n2 , y3\n2 )\n\n\u2212\u2206M\n\n4 (y1\n\n4 , y2\n\n4 , y3\n4 )\n\ny2\n1\n\n\u03b81\n\ny2\n2\n\n\u03b82\n\n\u03b81,2\n\n\u03b82,3\n\n\u03b83,4\n\ny2\n3\n\n\u03b83\n\ny2\n4\n\n\u03b84\n\ny2\n1\n\n\u03b81\n\n\u03b81,2\n\ny2\n2\n\n\u03b82\n\n\u03b82,3\n\ny2\n3\n\n\u03b83\n\n\u03b83,4\n\ny2\n4\n\n\u03b84\n\n\u2212(cid:74)y2\n\n1(cid:75)\n1 (cid:54)= y3\n\n\u2212(cid:74)y2\n\n2(cid:75)\n2 (cid:54)= y3\n\n\u2212(cid:74)y2\n\n3(cid:75)\n3 (cid:54)= y3\n\n\u2212(cid:74)y2\n\n4(cid:75)\n4 (cid:54)= y3\n\ny3\n1\n\n\u03b81\n\n\u03b81,2\n\ny3\n2\n\n\u03b82\n\n\u03b82,3\n\ny3\n3\n\n\u03b83\n\n\u03b83,4\n\ny3\n4\n\n\u03b84\n\ny3\n1\n\n\u03b81\n\n\u03b81,2\n\ny3\n2\n\n\u03b82\n\n\u03b82,3\n\ny3\n3\n\n\u03b83\n\n\u03b83,4\n\ny3\n4\n\n\u03b84\n\n(a) General diversity measure\n\n(b) Node-wise diversity measure\n\n(c) Hamming distance diversity\n\nFigure 1: Examples of factor graphs for 3 diverse solutions of the original MRF (1) with different\ndiversity measures. The circles represent nodes of the original model that are copied 3 times. For\nclarity the diversity factors of order higher than 2 are shown as squares. Pairwise factors are depicted\nby edges connecting the nodes. We omit \u03bb for readability. (a) The most general diversity measure\n(4), (b) the node-wise diversity measure (6), (c) Hamming distance as a diversity measure (5).\nFinally, a model is de\ufb01ned by the triple (G,LV , \u03b8), i.e. the underlying hyper-graph, the sets of labels\nand the potentials.\nIn the following, we use brackets to distinguish between upper index and power, i.e. (A)n means\nthe n-th power of A, whereas n is an upper index in the expression An. We will keep, however, the\nstandard notation Rn for the n-dimensional vector space.\nSequential Computation of M Best Diverse Solutions [3]. Instead of looking for a single labeling\nwith lowest energy, one might ask for a set of labelings with low energies, yet being signi\ufb01cantly\ndifferent from each other.\nIn order to \ufb01nd such M diverse labelings y1, . . . , yM , the method\nproposed in [3] solves a sequence of problems of the form\n\n(cid:34)\n\n(cid:35)\n\nm\u22121(cid:88)\n\ni=1\n\nym = arg min\n\ny\n\nE(y) \u2212 \u03bb\n\n\u2206(y, yi)\n\n(2)\n\nv\u2208V\n\nfor m = 1, 2 . . . , M, where \u03bb > 0 determines a trade-off between diversity and energy, y1 is the\nMAP-solution and the function \u2206 : LV \u00d7 LV \u2192 R de\ufb01nes the diversity of two labelings. In other\nwords, \u2206(y, y(cid:48)) takes a large value if y and y(cid:48) are diverse, in a certain sense, and a small value\notherwise. This problem can be seen as an energy minimization problem, where additionally to the\ninitial potentials \u03b8 the potentials \u2212\u03bb\u2206(\u00b7, yi), associated with an additional factor V, are used. In the\nsimplest and most commonly used form, \u2206(y, y(cid:48)) is represented by a sum of node-wise diversity\nmeasures \u2206v : Lv \u00d7 Lv \u2192 R,\n(3)\n\n\u2206(y, y(cid:48)) =\n\n(cid:88)\n\n\u2206v(yv, y(cid:48)\n\nv) ,\n\nand the potentials are split to a sum of unary potentials, i.e. those associated with additional factors\n{v}, v \u2208 V. This implies that in case ef\ufb01cient graph-cut based inference methods (including \u03b1-\nexpansion [6], \u03b1-\u03b2-swap [6] or their generalizations [1, 10]) are applicable to the initial problem (1)\nthen they remain applicable to the augmented problem (2), which assures ef\ufb01ciency of the method.\nJoint computation of M-best-diverse labelings. The notation f M ({y}) will be used as a shortcut\nfor f M (y1, . . . , yM ), for any function f M : (LV )M \u2192 R.\nInstead of the greedy sequential procedure (2), in [19] it was suggested to infer all M labelings\njointly, by minimizing\n\nEM ({y}) =\n\nE(yi) \u2212 \u03bb\u2206M ({y})\n\n(4)\n\nM(cid:88)\n\ni=1\n\n(cid:80)M\n\nfor y1, . . . , yM and some \u03bb > 0. Function \u2206M de\ufb01nes the total diversity of any M labelings.\nIt was shown in [19] that the M labelings obtained according to (4) have both lower total energy\ni=1 E(yi) and are better from the applied point of view, than those obtained by the sequential\n\nmethod (2). Hence we will build on the formulation (4) in this work.\n\n3\n\n\f1 = (cid:83)M\n\n1 = (cid:83)M\n\ni=1 V i. Factors are F M\n\n1 = (V M\ni=1 F i \u222a {V M\n\nThough the expression (4) looks complicated, it can be nicely represented in the form (1) and hence\nconstitutes an energy minimization problem. To achieve this, one creates M copies (Gi,LiV , \u03b8i) =\n(G,LV , \u03b8) of the initial model (G,LV , \u03b8). The hyper-graph GM\n1 ,F M\n1 ) for the new task\nis de\ufb01ned as follows. The set of nodes in the new graph is the union of the node sets from the\n1 }, i.e. again the union of\nconsidered copies V M\nthe initial ones extended by a special factor corresponding to the diversity penalty that depends on\nall nodes of the new graph. Each node v \u2208 V i is associated with the label set Li\nv = Lv. The\n1 are de\ufb01ned as {\u2212\u03bb\u2206M , \u03b81, . . . , \u03b8M}, see Fig. 1a for illustration. The\ncorresponding potentials \u03b8M\nmodel (GM\n1 ) corresponds to the energy (4). An optimal M-tuple of these labelings,\ncorresponding to a minimum of (4), is a trade-off between low energy of individual labelings yi and\ntheir total diversity.\nComplexity of the Diversity Problem (4). Though the formulation (4) leads to better results than\nthose of (2), minimization of EM is computationally demanding even if the original energy E can\nbe easily (approximatively) optimized. This is due to the intrinsic repulsive structure of the diversity\npotentials \u2212\u03bb\u2206M : according to the intuitive meaning of the diversity, similar labels are penalized\nmore than different one. Consider the simplest case with the Hamming distance applied node-wise\nas a diversity measure\n\n1 ,LVM\n\n, \u03b8M\n\n1\n\nM\u22121(cid:88)\n\nM(cid:88)\n\n(cid:88)\n\n\u2206M ({y}) =\n\n\u2206v(yi\n\nv, yj\n\nv), where \u2206v(y, y(cid:48)) =(cid:74)y (cid:54)= y(cid:48)(cid:75) .\n\n(5)\n\ni=1\n\nj=i+1\n\nv\u2208V\n\nv, . . . , yM\n\nHere expression (cid:74)A(cid:75) equals 1 if A is true and 0 otherwise. The corresponding factor graph is\n\nsketched in Fig. 1c. Such potentials can not be optimized with ef\ufb01cient graph-cut based methods\nand moreover, as shown in [19], the bounds delivered by LP-relaxation [31] based solvers are very\nloose in practice. Indeed, solutions delivered by such solvers are signi\ufb01cantly inferior even to the\nresults of the sequential method (2).\nTo cope with this issue a clique encoding representation of (4) was proposed in [19]. In this rep-\nv (in the M nodes corresponding to the single initial node\nresentation M-tuples of labels y1\nv) were considered as the new labels. In this way the dif\ufb01cult diversity factors were incorporated\ninto the unary factors of the new representation and the pairwise factors were adjusted respectively.\nThis allowed to (approximately) solve the problem (4) with graph-cuts based techniques if those\ntechniques were applicable to the energy E of a single labeling. The disadvantage of the clique\nencoding representation is the exponential growth of the label space, which was re\ufb02ected in a sig-\nni\ufb01cantly higher inference time for the problem (4) compared to the procedure (2). In what follows,\nwe show an alternative transformation of the problem (4), which (i) does not have this drawback (its\nsize is basically the same as those of (4)) and (ii) allows to exactly solve (4) in the case the energy\nE is submodular.\nNode-wise Diversity. In what follows we will mainly consider the node-wise diversity measures,\ni.e. those, which can be represented in the form\n\u2206M ({y}) =\nv : (Lv)M \u2192 R, see Fig. 1b for illustration.\n\nfor some node diversity measures \u2206M\n\nv ({y}v)\n\u2206M\n\n(cid:88)\n\nv\u2208V\n\n(6)\n\n3 M-Best-Diverse Labelings for Submodular Problems\nSubmodularity. In what follows we will assume that the sets Lv, v \u2208 V, of labels are completely\nordered. This implies that for any s, t \u2208 Lv their maximum and minimum, denoted as s\u2228 t and s\u2227 t\nrespectively, are well-de\ufb01ned. Similarly let y1 \u2228 y2 and y1 \u2227 y2 denote the node-wise maximum\nand minimum of any two labelings y1, y2 \u2208 LA, A \u2286 V. Potential \u03b8f is called submodular, if for\nany two labelings y1, y2 \u2208 Lf it holds1:\n\n\u03b8f (y1) + \u03b8f (y2) \u2265 \u03b8f (y1 \u2228 y2) + \u03b8f (y1 \u2227 y2) .\n\n(7)\n\nPotential \u03b8 will be called supermodular, if (\u2212\u03b8) is submodular.\n\n1Pairwise binary potentials satisfying \u03b8f (0, 1) + \u03b8f (1, 0) \u2265 \u03b8f (0, 0) + \u03b8f (1, 1) build an important special\n\ncase of this de\ufb01nition.\n\n4\n\n\fEnergy E is called submodular if for any two labelings y1, y2 \u2208 LV it holds:\n\nE(y1) + E(y2) \u2265 E(y1 \u2228 y2) + E(y1 \u2227 y2) .\n\n(8)\n\nSubmodularity of energy trivially follows from the submodularity of all its non-unary potentials \u03b8f ,\nf \u2208 F, |f| > 1. In the pairwise case the inverse also holds: submodularity of energy implies also\nsubmodularity of all its (pairwise) potentials (e.g. [31, Thm. 12]). There are ef\ufb01cient methods for\nsolving energy minimization problems with submodular potentials, based on its transformation into\nmin-cut/max-\ufb02ow problem [21, 28, 16] in case all potentials are either unary or pairwise or to a\nsubmodular max-\ufb02ow problem in the higher-order case [20, 10, 1].\nOrdered M Solutions. In what follows we will write z\uf731 \u2264 z\uf732 for any two vectors z1 and z\uf732\nmeaning that the inequality holds coordinate-wise.\nFor an arbitrary set A we will call a function f : (A)n \u2192 R of n variables permutation in-\nvariant if for any (x1, x2, . . . , xn) \u2208 (A)n and any permutation \u03c0 it holds f (x1, x2, . . . , xn) =\nf (x\u03c0(1), x\u03c0(2), . . . , x\u03c0(n)). In what follows we will consider mainly permutation invariant diversity\nmeasures.\nLet us consider two arbitrary labelings y1, y2 \u2208 LV and their node-wise minimum y1 \u2227 y2 and\nmaximum y1 \u2228 y2. Since (y1\nv), for any\nv). This in\npermutation invariant node diversity measure it holds \u22062\nits turn implies \u22062(y1 \u2227 y2, y1 \u2228 y2) = \u22062(y1, y2) for any node-wise diversity measure of the\nform (6). If E is submodular, then from (8) it additionally follows that\nE2(y1 \u2227 y2, y1 \u2228 y2) \u2264 E2(y1, y2) ,\n\n(9)\nwhere E2 is de\ufb01ned as in (4). Note, that (y1 \u2227 y2) \u2264 (y1 \u2228 y2). Generalizing these considerations\nto M labelings one obtains\nTheorem 1. Let E be submodular and \u2206M be a node-wise diversity measure with each component\nv being permutation invariant. Then there exists an ordered M-tuple (y1, . . . , yM ), yi \u2264 yj for\n\u2206M\n1 \u2264 i < j \u2264 M, such that for any (z1, . . . , zM ) \u2208 (LV )M it holds\n\nv) is either equal to (y1\nv) = \u22062\n\nv) or to (y2\nv \u2227 y2\nv, y1\n\nv, y1\nv \u2228 y2\n\nv, y2\nv(y1\n\nv \u2228 y2\n\nv \u2227 y2\n\nv, y1\n\nv, y2\n\nv(y1\n\nEM ({y}) \u2264 EM ({z}) ,\n\n(10)\n\nwhere EM is de\ufb01ned as in (4).\nTheorem 1 in particular claims that in the binary case Lv = {0, 1}, v \u2208 V, the optimal M labelings\nde\ufb01ne nested subsets of nodes, corresponding to the label 1.\nSubmodular formulation of M-Best-Diverse problem. Due to Theorem 1, for submodular ener-\ngies and node-wise diversity measures it is suf\ufb01cient to consider only ordered M-tuples of labelings.\nThis order can be enforced by modifying the diversity measure accordingly:\n\n\u02c6\u2206M\n\nv (y1, . . . , yM ) :=\n\nv (y1, . . . , yM ),\n\u2212\u221e,\n\ny1 \u2264 y2 \u2264 \u00b7\u00b7\u00b7 \u2264 yM\notherwise\n\n,\n\n(11)\n\nand using it instead of the initial measure \u2206M\npractice one can use suf\ufb01ciently big numbers in place of \u221e in (11). This implies\nLemma 1. Let E be submodular and \u2206M be a node-wise diversity measure with each component\nv being permutation invariant. Then any solution of the ordering enforcing M-best-diverse prob-\n\u2206M\nlem\n\nis not permutation invariant.\n\nv . Note that \u02c6\u2206M\n\nIn\n\nv\n\n\u02c6EM ({y}) =\n\nE(yi) \u2212 \u03bb\n\n\u02c6\u2206M\n\nv (y1\n\nv, . . . , yM\nv )\n\nis a solution of the corresponding M-best-diverse problem (4)\n\nEM ({y}) =\n\nE(yi) \u2212 \u03bb\n\n\u2206M\n\nv (y1\n\nv, . . . , yM\n\nv ) ,\n\ni=1\nv are related by (11).\n\nv and \u2206M\n\nwhere \u02c6\u2206M\nWe will say that a vector (y1, . . . , yM ) \u2208 (Lv)M is ordered, if it holds y1 \u2264 y2 \u2264 \u00b7\u00b7\u00b7 \u2264 yM .\n\n(cid:26)\u2206M\n\nM(cid:88)\nM(cid:88)\n\ni=1\n\n(cid:88)\n(cid:88)\n\nv\u2208V\n\nv\u2208V\n\n5\n\n(12)\n\n(13)\n\n\fGiven submodularity of E the submodularity (an hence \u2013 solvability) of EM in (13) would triv-\nially follow from the supermodularity of \u2206M . However there hardly exist supermodular diver-\nsity measures. The ordering provided by Theorem 1 and the corresponding form of the ordering-\nenforcing diversity measure \u02c6\u2206M signi\ufb01cantly weaken this condition, which is precisely stated by\nthe following lemma. In the lemma we substitute \u221e of (11) with a suf\ufb01ciently big values such as\nC\u221e \u2265 max{y} EM ({y}) for the sake of numerical implementation. Moreover, this values will\ndiffer from each other to keep \u02c6\u2206M\nLemma 2. Let for any two ordered vectors y = (y1, . . . , yM ) \u2208 (Lv)M and z = (z1, . . . , zM ) \u2208\n(Lv)M it holds\n\n(14)\nwhere y \u2228 z and y \u2227 z are element-wise maximum and minimum respectively. Then \u02c6\u2206v, de\ufb01ned as\n\n\u2206v(y \u2228 z) + \u2206v(y \u2227 z) \u2265 \u2206v(y) + \u2206v(z),\n\nv supermodular.\n\n\uf8ee\uf8f0M\u22121(cid:88)\n\nM(cid:88)\n\n\uf8f9\uf8fb\n\n\u02c6\u2206v(y1, . . . , yM ) = \u2206v(y1, . . . , yM ) \u2212 C\u221e \u00b7\n\n3max(0,yi\u2212yj ) \u2212 1\n\n(15)\n\nis supermodular.\n\ni=1\n\nj=i+1\n\nNote, eq. (11) and (15) are the same up to the in\ufb01nity values in (11). Though condition (14) re-\nsembles the supermodularity condition, it has to be ful\ufb01lled for ordered vectors only. The following\ncorollaries of Lemma 2 give two most important examples of the diversity measures ful\ufb01lling (14).\nCorollary 1. Let |Lv| = 2 for all v \u2208 V. Then the statement of Lemma 2 holds for arbitrary\n\u2206v : (Lv)M \u2192 R.\nCorollary 2. Let \u2206M\nis equivalent to\n\nv (y1, . . . , yM ) =(cid:80)M\u22121\n\nj=i+1 \u2206ij(yi, yj). Then the condition of Lemma 2\n\n(cid:80)M\n\n\u2206ij(yi, yj) + \u2206ij(yi + 1, yj + 1) \u2265 \u2206ij(yi + 1, yj) + \u2206ij(yi, yj + 1) for yi < yj\n\n(16)\n\ni=1\n\nIn particular, condition (16) is satis\ufb01ed for the Hamming distance \u2206ij(y, y(cid:48)) =(cid:74)y (cid:54)= y(cid:48)(cid:75).\n\nThe following theorem trivially summarizes Lemmas 1 and 2:\nTheorem 2. Let energy E and diversity measure \u2206M satisfy conditions of Lemmas 1 and 2. Then\nthe ordering enforcing problem (12) delivers solution to the M-best-diverse problem (13) and is\nsubmodular. Moreover, submodularity of all non-unary potentials of the energy E implies submod-\nularity of all non-unary potentials of the ordering enforcing energy \u02c6EM .\n\nand 1 \u2264 i < j \u2264 M.\n\n4 Experimental evaluation\nWe have tested our algorithms in two application scenarios: (a) interactive foreground/background\nimage segmentation, where annotation is available in the form of scribbles [3] and (b) Category level\nsegmentation on PASCAL VOC 2012 data [9].\nAs baselines we use: (i) the sequential method DivMBest (2) proposed in [3, 25] and (ii) the\nclique-encoding CE method [19] for an (approximate) joint computation of M-best-diverse label-\nings. As mentioned in Section 2, this method addresses the energy EM de\ufb01ned in (4), however it\nhas the disadvantage that its label space grows exponentially with M.\nOur method that solves the problem (12) with the Hamming diversity measure (5) by trans-\nforming it into min-cut/max-\ufb02ow problem [21, 28, 16] and running the solver [5] is denoted as\nJoint-DivMBest.\nDiversity measures used in experiments are: the Hamming distance (5) HD, Label Cost LC, Label\nTransitions LT and Hamming Ball HB. The last three measures are higher order diversity potentials\nintroduced in [25] and used only in connection with the DivMBest algorithm. If not stated other-\nwise, the Hamming distance (5) is used as a diversity measure. Both the clique encoding (CE) based\napproaches and the submodularity-based methods proposed in this work use only the Hamming\ndistance as a diversity measure.\nAs [25] suggests, certain combinations of different diversity measures may lead to better results. To\ndenote such combinations, the signs \u2297 and \u2295 were used in [25]. We refer to [25] for a detailed\ndescription of this notation and treat such combined methods as a black box for our comparison.\n\n6\n\n\fDivMBest\nCE\nJoint-DivMBest\n\nM=2\n\nM=6\n\nM=10\n\nquality\n93.16\n95.13\n95.13\n\ntime\n0.45\n2.9\n0.77\n\nquality\n95.02\n96.01\n96.01\n\ntime\n2.4\n47.6\n5.2\n\nquality\n95.16\n96.19\n96.19\n\ntime\n4.4\n1247\n20.4\n\nTable 1: Interactive segmentation: per-pixel accuracies (quality) for the best segmentation out of M\nones and run-time. Compare to the average quality 91.57 of a single labeling. Hamming distance\nis used as a diversity measure. The run-time is in milliseconds (ms). Joint-DivMBest quantita-\ntively outperforms DivMBest, and is equal to CE, however, it is considerably faster than CE.\n\n4.1\n\nInteractive segmentation\n\nInstead of returning a single segmentation corresponding to a MAP-solution, diversity methods pro-\nvide to the user a small number of possible low-energy results based on the scribbles. Following [3]\nwe model only the \ufb01rst iteration of such an interactive procedure, i.e. we consider user scribbles to\nbe given and compare the sets of segmentations returned by the compared diversity methods.\nAuthors of [3] kindly provided us their 50 graphical model instances, corresponding to the MAP-\ninference problem (1). They are based on a subset of the PASCAL VOC 2010 [9] segmentation chal-\nlenge with manually added scribbles. Pairwise potentials constitute contrast sensitive Potts terms [4],\nwhich are submodular. This implies that (i) the MAP-inference is solvable by min-cut/max-\ufb02ow al-\ngorithms [21] and (ii) Theorem 2 is applicable and the M-best-diverse solutions can be found by\nreducing the ordering preserving problem (12) to min-cut/max-\ufb02ow and applying the corresponding\nalgorithm.\nQuantitative comparison and run-time of the considered methods is provided in Table 1, where\neach method was used with the parameter \u03bb (see (2), (4)), optimally tuned via cross-validation.\nFollowing [3], as a quality measure we used the per pixel accuracy of the best solution for each\nsample averaged over all test images. Methods CE and Joint-DivMBest gave the same quality,\nwhich con\ufb01rms the observation made in [19], that CE returns an exact MAP solution for each sample\nin this dataset. Combined methods with more sophisticated diversity measures return results that are\neither inferior to DivMBest or only negligibly improved once, hence we omitted them. The run-\ntime provided is also averaged over all samples. The max-\ufb02ow algorithm was used for DivMBest\nand Joint-DivMBest and \u03b1-expansion for CE.\nSummary. It can be seen that the Joint-DivMBest qualitatively outperforms DivMBest and\nis equal to CE. However, it is considerably faster than the latter (the difference grows exponentially\nwith M) and the runtime is of the same order of magnitude as the one of DivMBest.\n\n4.2 Category level segmentation\n\ning pairwise models with contrast sensitive Potts terms of the form \u03b8uv(y, y(cid:48)) = wuv(cid:74)y (cid:54)= y(cid:48)(cid:75),\n\nThe category level segmentation from PASCAL VOC 2012 challenge [9] contains 1449 validation\nimages with known ground truth, which we used for evaluation of diversity methods. Correspond-\nuv \u2208 F, were used in [25] and kindly provided to us by the authors. Contrary to interactive segmen-\ntation, the label sets contain 21 elements and hence the respective MAP-inference problem (1) is not\nsubmodular anymore. However it still can be approximatively solved by \u03b1-expansion or \u03b1-\u03b2-swap.\nSince the MAP-inference problem (1) is not submodular in this experiment, Theorem 2 is not ap-\nplicable. We used two ways to overcome it. First, we modi\ufb01ed the diversity potentials according\nto (15), as if Theorem 2 were to be correct. This basically means we were explicitly looking for\nordered M best diverse labelings. The resulting inference problem was addressed with \u03b1-\u03b2-swap\n(since neither max-\ufb02ow nor the \u03b1-expansion algorithms are applicable). We refer to this method as\nto Joint-DivMBest-ordered. The second way to overcome the non-submodularity problem,\nis based on learning. Using structured SVM technique we trained pairwise potentials with addi-\ntional constraints enforcing their submodularity, as it is done in e.g. [11]. We kept the contrast terms\n\u02c6\u03b8(y, y(cid:48)), uv \u2208 F. We refer to\n\nwuv and learned only a single submodular function \u02c6\u03b8(y, y(cid:48)), which we used in place of(cid:74)y (cid:54)= y(cid:48)(cid:75).\n\nAfter the learning, all our potentials had the form \u03b8uv(y, y(cid:48)) = wuv\n\n7\n\n\fDivMBest\nHB\u2217\nDivMBest\u2217\u2295HB\u2217\nHB\u2217\u2297LC\u2217\u2297LT\u2217\nDivMBest\u2217\u2297HB\u2217\u2297LC\u2217\u2297LT\u2217\nCE\nCE3\nJoint-DivMBest-ordered\nJoint-DivMBest-learned\nJoint-DivMBest-learned\n\nMAP inference\n\nM=5\n\nM=15\n\nM=16\n\n\u03b1-exp[4]\n\nHB-HOP-MAP[30] 51.71\nHB-HOP-MAP[30]\nLT \u2013 coop. cuts[17]\nLT \u2013 coop. cuts[17]\n\n-\n-\n-\n\n\u03b1-exp[4]\n\u03b1-exp[4]\n\n\u03b1-\u03b2-swap[4]\nmax-\ufb02ow[5]\n\n\u03b1-exp[4]\n\nquality time quality time quality time\n51.21 0.01 52.90 0.03\n53.07 0.03\n\n-\n-\n-\n\n-\n\n-\n-\n-\n-\n-\n\n55.32\n55.89\n56.97\n\n-\n-\n-\n-\n54.22 733\n58.36 7.24\n54.14 2.28 57.76 5.87\n53.81 0.01 56.08 0.08\n56.31 0.08\n53.85 0.38 56.14 35.47 56.33 38.67\n53.84 0.01 56.08 0.08\n56.31 0.08\n\n57.39\n\n-\n-\n-\n-\n-\n\n-\n-\n\nTable 2: PASCAL VOC 2012.\nIntersection over union quality measure/running time. The best\nsegmentation out of M is considered. Compare to the average quality 43.51 of a single labeling.\nTime is in seconds (s). Notation \u2019-\u2019 correspond to absence of result due to computational reasons or\ninapplicability of the method. (\u2217)- methods were not run by us and the results were taken from [25]\ndirectly. The MAP-inference column references the slowest inference technique out of those used\nby the method.\n\nthis method as to Joint-DivMBest-learned. For the model we use max-\ufb02ow[5] as an exact\ninference method and \u03b1-expansion[4] as a fast approximate inference method.\nQuantitative comparison and run-time of the considered methods is provided in Table 2,\nwhere each method was used with the parameter \u03bb (see (2), (4)) optimally tuned via cross-\nvalidation on the validation set in PASCAL VOC 2012. Following [3], we used the Intersec-\ntion over union quality measure, averaged over all images. Among combined methods with\nhigher order diversity measures we selected only those providing the best results. The method\nCE3 [19] is a hybrid of DivMBest and CE delivering a reasonable trade-off between run-\nning time and accuracy of inference for the model EM (4). Quantitative results delivered by\nJoint-DivMBest-ordered and Joint-DivMBest-learned are very similar (though the\nlatter is negligibly better), signi\ufb01cantly outperform those of DivMBest and only slightly infe-\nrior to those of CE3. However the run-time for Joint-DivMBest-ordered and \u03b1-expansion\nversion of Joint-DivMBest-learned are comparable to those of DivMBest and outper-\nform all other competitors due to use of the fast inference algorithms and linearly growing label\nspace, contrary to the label space of CE3, which grows as (Lv)3. Though we do not know ex-\nact run-time for the combined methods (where \u2295 and \u2297 are used) we expect them to be signi\ufb01-\ncantly higher then those for DivMBest and Joint-DivMBest-ordered because of the intrin-\nsically slow MAP-inference techniques used. However contrary to the latter one the inference in\nJoint-DivMBest-learned can be exact due to submodularity of the underlying energy.\n\n5 Conclusions\n\nWe have shown that submodularity of the MAP-inference problem implies a fully ordered set of\nM best diverse solutions given a node-wise permutation invariant diversity measure. Enforcing\nsuch ordering leads to a submodular formulation of the joint M-best-diverse problem and implies\nits ef\ufb01cient solvability. Moreover, we have shown that even in non-submodular cases, when the\nMAP-inference is (approximately) solvable with ef\ufb01cient graph-cut based methods, enforcing this\nordering leads to the M-best-diverse problem, which is (approximately) solvable with graph-cut\nbased methods as well. In our test cases (and there are likely others), such an approximative tech-\nnique lead to notably better results then those provided by the established sequential DivMBest\ntechnique [3], whereas its run-time remains quite comparable to the run-time of DivMBest and is\nmuch smaller than the run-time of other competitors.\n\n8\n\n\fReferences\n\n[1] C. Arora, S. Banerjee, P. Kalra, and S. Maheshwari. Generalized \ufb02ows for optimal inference in higher\n\norder MRF-MAP. TPAMI, 2015.\n\n[2] D. Batra. An ef\ufb01cient message-passing algorithm for the M-best MAP problem. arXiv:1210.4841, 2012.\n[3] D. Batra, P. Yadollahpour, A. Guzman-Rivera, and G. Shakhnarovich. Diverse M-best solutions in markov\n\nrandom \ufb01elds. In ECCV. Springer Berlin/Heidelberg, 2012.\n\n[4] Y. Boykov and M.-P. Jolly. Interactive graph cuts for optimal boundary & region segmentation of objects\n\nin N-D images. In ICCV, 2001.\n\n[5] Y. Boykov and V. Kolmogorov. An experimental comparison of min-cut/max-\ufb02ow algorithms for energy\n\nminimization in vision. TPAMI, 26(9):1124\u20131137, 2004.\n\n[6] Y. Boykov, O. Veksler, and R. Zabih. Fast approximate energy minimization via graph cuts. TPAMI,\n\n23(11):1222\u20131239, 2001.\n\n[7] C. Chen, V. Kolmogorov, Y. Zhu, D. N. Metaxas, and C. H. Lampert. Computing the M most probable\n\nmodes of a graphical model. In AISTATS, 2013.\n\n[8] G. Elidan and A. Globerson. The probabilistic inference challenge (PIC2011).\n[9] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object\n\nClasses Challenge 2012 (VOC2012) Results.\n\n[10] A. Fix, A. Gruber, E. Boros, and R. Zabih. A graph cut algorithm for higher-order Markov random \ufb01elds.\n\nIn ICCV, 2011.\n\n[11] V. Franc and B. Savchynskyy. Discriminative learning of max-sum classi\ufb01ers. JMLR, 9:67\u2013104, 2008.\n[12] M. Fromer and A. Globerson. An lp view of the m-best map problem. In NIPS 22, 2009.\n[13] A. Guzman-Rivera, D. Batra, and P. Kohli. Multiple choice learning: Learning to produce multiple\n\nstructured outputs. In NIPS 25, 2012.\n\n[14] A. Guzman-Rivera, P. Kohli, and D. Batra. DivMCuts: Faster training of structural SVMs with diverse\n\nM-best cutting-planes. In AISTATS, 2013.\n\n[15] A. Guzman-Rivera, P. Kohli, D. Batra, and R. A. Rutenbar. Ef\ufb01ciently enforcing diversity in multi-output\n\nstructured prediction. In AISTATS, 2014.\n\n[16] H. Ishikawa. Exact optimization for Markov random \ufb01elds with convex priors. TPAMI, 2003.\n[17] S. Jegelka and J. Bilmes. Submodularity beyond submodular energies: coupling edges in graph cuts. In\n\nCVPR, 2011.\n\n[18] J. H. Kappes, B. Andres, F. A. Hamprecht, C. Schn\u00a8orr, S. Nowozin, D. Batra, S. Kim, B. X. Kausler,\nT. Kr\u00a8oger, J. Lellmann, N. Komodakis, B. Savchynskyy, and C. Rother. A comparative study of modern\ninference techniques for structured discrete energy minimization problems. IJCV, pages 1\u201330, 2015.\n\n[19] A. Kirillov, B. Savchynskyy, D. Schlesinger, D. Vetrov, and C. Rother. Inferring M-best diverse labelings\n\nin a single one. In ICCV, 2015.\n\n[20] V. Kolmogorov. Minimizing a sum of submodular functions. Discrete Applied Mathematics, 2012.\n[21] V. Kolmogorov and R. Zabin. What energy functions can be minimized via graph cuts? TPAMI, 2004.\n[22] A. Kulesza and B. Taskar. Structured determinantal point processes. In NIPS 23, 2010.\n[23] E. L. Lawler. A procedure for computing the K best solutions to discrete optimization problems and its\n\napplication to the shortest path problem. Management Science, 18(7), 1972.\n\n[24] D. Nilsson. An ef\ufb01cient algorithm for \ufb01nding the m most probable con\ufb01gurationsin probabilistic expert\n\nsystems. Statistics and Computing, 8(2):159\u2013173, 1998.\n\n[25] A. Prasad, S. Jegelka, and D. Batra.\n\nSubmodular meets structured: Finding diverse subsets in\n\nexponentially-large structured item sets. In NIPS 27, 2014.\n\n[26] V. Premachandran, D. Tarlow, and D. Batra. Empirical minimum bayes risk prediction: How to extract\n\nan extra few % performance from vision models with just three more parameters. In CVPR, 2014.\n\n[27] V. Ramakrishna and D. Batra. Mode-marginals: Expressing uncertainty via diverse M-best solutions. In\n\nNIPS Workshop on Perturbations, Optimization, and Statistics, 2012.\n\n[28] D. Schlesinger and B. Flach. Transforming an arbitrary minsum problem into a binary one. TU Dresden,\n\nFak. Informatik, 2006.\n\n[29] M. I. Schlesinger and V. Hlavac. Ten lectures on statistical and structural pattern recognition, volume 24.\n\nSpringer Science & Business Media, 2002.\n\n[30] D. Tarlow, I. E. Givoni, and R. S. Zemel. Hop-map: Ef\ufb01cient message passing with high order potentials.\n\nIn AISTATS, 2010.\n\n[31] T. Werner. A linear programming approach to max-sum problem: A review. TPAMI, 29(7), 2007.\n[32] P. Yadollahpour, D. Batra, and G. Shakhnarovich. Discriminative re-ranking of diverse segmentations. In\n\nCVPR, 2013.\n\n[33] C. Yanover and Y. Weiss. Finding the M most probable con\ufb01gurations using loopy belief propagation. In\n\nNIPS 17, 2004.\n\n9\n\n\f", "award": [], "sourceid": 424, "authors": [{"given_name": "Alexander", "family_name": "Kirillov", "institution": "TU Dresden"}, {"given_name": "Dmytro", "family_name": "Shlezinger", "institution": "TU Dresden"}, {"given_name": "Dmitry", "family_name": "Vetrov", "institution": "Skoltech, Moscow"}, {"given_name": "Carsten", "family_name": "Rother", "institution": "TU Dresden"}, {"given_name": "Bogdan", "family_name": "Savchynskyy", "institution": "TU Dresden"}]}