{"title": "On fast approximate submodular minimization", "book": "Advances in Neural Information Processing Systems", "page_first": 460, "page_last": 468, "abstract": "We are motivated by an application to extract a representative subset of machine learning training data and by the poor empirical performance we observe of the popular minimum norm algorithm. In fact, for our application, minimum norm can have a running time of about O(n^7 ) (O(n^5 ) oracle calls). We therefore propose a fast approximate method to minimize arbitrary submodular functions. For a large sub-class of submodular functions, the algorithm is exact. Other submodular functions are iteratively approximated by tight submodular upper bounds, and then repeatedly optimized. We show theoretical properties, and empirical results suggest significant speedups over minimum norm while retaining higher accuracies.", "full_text": "On fast approximate submodular minimization\n\nStefanie Jegelka\u2020, Hui Lin\u2217, Jeff Bilmes\u2217\n\njegelka@tuebingen.mgp.de,{hlin,bilmes}@ee.washington.edu\n\n\u2020 Max Planck Institute for Intelligent Systems, Tuebingen, Germany\n\n\u2217 University of Washington, Dept. of EE, Seattle, U.S.A.\n\nAbstract\n\nWe are motivated by an application to extract a representative subset of machine\nlearning training data and by the poor empirical performance we observe of the\npopular minimum norm algorithm. In fact, for our application, minimum norm can\nhave a running time of about O(n7) (O(n5) oracle calls). We therefore propose\na fast approximate method to minimize arbitrary submodular functions. For a\nlarge sub-class of submodular functions, the algorithm is exact. Other submodular\nfunctions are iteratively approximated by tight submodular upper bounds, and then\nrepeatedly optimized. We show theoretical properties, and empirical results suggest\nsigni\ufb01cant speedups over minimum norm while retaining higher accuracies.\n\n1\n\nIntroduction\n\nSubmodularity has been and continues to be an important property in many \ufb01elds. A set function\nf : 2V \u2192 R de\ufb01ned on subsets of a \ufb01nite ground set V is submodular if it satis\ufb01es the inequality\nf (S) + f (T ) \u2265 f (S \u222a T ) + f (S \u2229 T ) for all S, T \u2286 V. Submodular functions include entropy,\ngraph cuts (de\ufb01ned as a function of graph nodes), potentials in many Markov Random Fields\n[3], clustering objectives [23],covering functions (e.g., sensor placement objectives), and many\nmore. One might consider submodular functions as being on the boundary between \u201cef\ufb01ciently\u201d,\ni.e., polynomial-time, and \u201cnot ef\ufb01ciently\u201d optimizable set functions. Submodularity is gaining\nimportance in machine learning too, but many machine learning data sets are so large that mere\n\u201cpolynomial-time\u201d ef\ufb01ciency is not enough. Indeed, the submodular function minimization (SFM)\nalgorithms with proven polynomial running time are practical only for very small data sets. An\nalternative, often considered to be faster in practice, is the minimum-norm point algorithm [7]. Its\nworst-case running time however is still an open question.\nContrary to current wisdom, we demonstrate that for certain functions relevant in practice (see\nSection 1.1), the minimum-norm algorithm has an impractical empirical running time of about\nO(n7), requiring about O(n5) oracle function calls. To our knowledge, and interesting from an\noptimization perspective, this is worse than any results reported in the literature, where times of\nO(n3.3) were obtained with simpler graph cut functions [22].\nSince we found the minimum-norm algorithm to be either slow (when accurate), or inaccurate (when\nfast), in this work we take a different approach. We view the SFM problem as an instance of a larger\nclass of problems that includes NP-hard instances. This class admits approximation algorithms, and\nwe apply those instead of an exact method. Contrary to the possibly poor performance of \u201cexact\u201d\nmethods, our approximate method is fast, is exact for a large class of submodular functions, and\napproximates all other functions with bounded deviation.\nOur approach combines two ingredients: 1) the representation of functions by graphs; and 2) a recent\ngeneralization of graph cuts that combines edge-costs non-linearly. Representing functions as graph\ncuts is a popular basis for optimization, but cuts cannot ef\ufb01ciently represent all submodular functions.\nContrary to previous constructions, including 2) leads to exact representations for any submodular\n\n1\n\n\ffunction. To optimize an arbitrary submodular function f represented in our formalism, we construct\na graph-representable tractable submodular upper bound \u02c6f that is tight at a given set T \u2286 V, i.e.,\n\u02c6f (T ) = f (T ), and \u02c6f (S) \u2265 f (S) for all S \u2286 V. We repeat this \u201csubmodular majorization\u201d step and\noptimize, in at most a linear number of iterations. The resulting algorithm ef\ufb01ciently computes good\napproximate solutions for our motivating application and other dif\ufb01cult functions as well.\n\n1.1 Motivating application and the failure of the minimum-norm point algorithm\n\nOur motivating problem is how to empirically evaluate new or expensive algorithms on large data sets\nwithout spending an inordinate amount of time doing so [20, 21]. If a new idea ends up performing\npoorly, knowing this sooner will avoid futile work. Often the complexity of a training iteration is\nlinear in the number of samples n but polynomial in the number c of classes or types. For example,\nfor object recognition, it typically takes O(ck) time to segment an image into regions that each\ncorrespond to one of c objects, using an MRF with non-submodular k-interaction potential functions.\nIn speech recognition, moreover, a k-gram language model with size-c vocabulary has a complexity\nof O(ck), where c is in the hundreds of thousands and k can be as large as six.\nTo reduce complexity one can reduce k, but this can be unsatisfactory since the novelty of the\nalgorithm might entail this very cost. An alternative is to extract and use a subset of the training data,\none with small c. We would want any such subset to possess the richness and intricacy of the original\ndata while simultaneously ensuring that c is bounded.\nThis problem can be solved via SFM using the following Bipartite neighborhoods class of submodular\nfunctions: De\ufb01ne a bipartite graph H = (V,U,E, w) with left/right nodes V/U, and a modular weight\nu\u2208U w(u). Let the neighborhood of a\nset S \u2286 V be N (S) = {u \u2208 U : \u2203 edge (i, u) \u2208 E with i \u2208 S}. Then f : 2V \u2192 R+, de\ufb01ned as\nu\u2208N (S) w(u), is non-decreasing submodular. This function class encompasses e.g. set\ni\u2208S Ui| for sets Ui covered by element i. We say f is the submodular\n\nfunction w : U \u2192 R+. A function is modular if w(U ) =(cid:80)\nf (S) =(cid:80)\ncovers of the form f (S) = |(cid:83)\nfunction m : 2V \u2192 R+, m(S) = (cid:80)\n\nfunction induced by modular function w and graph H.\nLet U be the set of types in a set of training samples V. More-\nover, let w measure the cost of a type u \u2208 U (this corresponds\ne.g. to the \u201cundesirability\u201d of type u). De\ufb01ne also a modular\ni\u2208S m(i) as the bene\ufb01t\nof training samples (e.g., in vision, m(i) is the number of dif-\nferent objects in an image i \u2208 V, and in speech, this is the\nlength of utterance i). Then the above optimization problem\ncan be solved by \ufb01nding argminS\u2286V w(N (S)) \u2212 \u03bbm(S) =\nargminS\u2286V w(N (S))+\u03bbm(V\\S) where \u03bb is a tradeoff coef\ufb01-\ncient. As shown below, this can be easily represented and solved\nef\ufb01ciently via graph cuts. In some cases, however, we prefer to\npick certain subclasses of U together. We partition U = U1\u222aU2\nresulting optimization problem is minS\u2286V(cid:80)\ninto blocks, and make it bene\ufb01cial to pick items from the same block. Bene\ufb01t restricted to blocks can\narise from non-negative non-decreasing submodular functions g : 2U \u2192 R+ restricted to blocks. The\ni g(Ui \u2229N (S)) + \u03bbm(V \\ S); the sum over i expresses\n(cid:112)\nthe obvious generalization to a partition into more than just two blocks. Unfortunately, this class of\nsubmodular functions is no longer representable by a bipartite graph, and general SFM must be used.\nw(N (S)), the empirical running time of the minimum\nWith such a function, f (S) = m(S) + 100\nnorm point algorithm (MN) scales as O(n7), with O(n5) oracle calls (Figure 1). This rules out large\ndata sets for our application, but is interesting with regard to the unknown complexity of MN.\n\nFigure 1: Running time of MN\n\n1.2 Background on Algorithms for submodular function minimization (SFM)\n\nThe \ufb01rst polynomial algorithm for SFM was by Gr\u00a8otschel et al. [13], with further milestones being the\n\ufb01rst combinatorial algorithms [15, 27] ([22] contains a survey). The currently fastest strongly polyno-\nmial combinatorial algorithm has a running time of O(n5T +n6) [24] (where T is function evaluation\ntime), far from practical. Thus, the minimum-norm algorithm [7] is often the method of choice.\n\n2\n\n99.51010.546810121416Ground set size (power of 2)CPU time (seconds, power of 2) min\u2212normO(n7)\fLuckily, many sub-families of submodular functions permit specialized, faster algorithms. Graph\ncut functions fall into this category [1]. They have found numerous applications in computer vision\n[2, 12], begging the question as to which functions can be represented and minimized using graph\ncuts [9, 6, 31]. \u02d8Zivn\u00b4y et al. [32] show that cut representations are indeed limited: even when allowing\nexponentially many additional variables, not all submodular functions can be expressed as graph cuts.\nMoreover, to maintain ef\ufb01ciency, we do not wish to add too many auxiliary variables, i.e., graph\nnodes. Other speci\ufb01c cases of relatively ef\ufb01cient SFM include graphic matroids [25] and symmetric\nsubmodular functions, minimizable in cubic time [26].\n\nA further class of benign functions are those of the form f (S) = \u03c8((cid:80)\n\nKrause [29] decompose it into a sum of truncated functions of the form f (A) = min{(cid:80)\n\ni\u2208S w(i)) + m(S) for\nnonnegative weights w : V \u2192 R+, and certain concave functions \u03c8 : R \u2192 R. Fujishige and Iwata\n[8] minimize such a function via a parametric max-\ufb02ow, and we build on their results in Section 4.\nHowever, restrictions apply to the effective number of breakpoints of \u03c8. Stobbe and Krause [29]\ngeneralize this class to arbitrary concave functions and exploit Nesterov\u2019s accelerated gradient descent.\nWhereas Fujishige and Iwata [8] decompose \u03c8 as a minimum of modular functions, Stobbe and\ni\u2208A w(cid:48)(i), \u03b3}\n\u2014 this class of functions, however, is also limited. Truncations are expressible by graph cuts, as we\nshow in Figure 3(b). Thus, if truncations could express any submodular function, then so could\ngraph cuts, contradicting the results in [32]. This was proven independently in [30]. Moreover, the\nformulation itself of some representable functions in terms of concave functions can be challenging.\nIn this paper, by contrast, we propose a model that is exact for graph-representable functions, and\nyields an approximation for all other functions.\n\n2 Representing submodular functions by generalized graph cuts\nWe begin with the representation of a set function f : 2V \u2192 R by a graph\ncut, and then extend this to submodular edge weights. Formally, f is\ngraph-representable if there exists a graph G = (V \u222a U \u222a {s, t},E) with\nterminal nodes s, t, one node for each element i in V, a set U of auxiliary nodes\n(U can be empty), and edge weights w : E \u2192 R+ such that, for any S \u2286 V:\n\n(cid:88)\n\ne\u2208\u03b4s(S\u222aU )\n\nf (S) = min\n\nU\u2286U w(\u03b4(s \u222a S \u222a U )) = min\nU\u2286U\n\nw(e).\n\n(1)\n\nFigure 2: max\n\nAs an illustrative example, Figure 2 represents the function f (S) = maxi\u2208S w(i) +(cid:80)\n\n\u03b4(S) is the set of edges leaving S, and \u03b4s(S) = \u03b4({s} \u222a S). Recall that any minimal (s, t)-cut\npartitions the graph nodes into the set Ts \u2286 V\u222aU reachable from s and the set Tt = (V\u222aU)\\Ts discon-\nnected from s. That means, f (S) equals the weight of the minimum (s, t)-cut that assigns S to Ts and\nV \\ S to Tt, and the auxiliary nodes to achieve the minimum. The nodes in U act as auxiliary variables.\nj\u2208V\\S m(j)\nfor two elements V = {1, 2} and w(2) > w(1), using one auxiliary node u. For any query set S, u\nmight be joined with S (u \u2208 Ts) or not (u \u2208 Tt). If S = {1}, then w(\u03b4s({1, u})) = m(2) + w(2),\nand w(\u03b4s({1})) = m(2) + w(1) = f (S) < w(\u03b4s({1, u})). If S = {1, 2}, then w(\u03b4s({1, 2, u})) =\nw(2) < w(\u03b4s({1, 2})) = w(1) + w(2), and indeed f (S) = w(2). The graph representation (1) leads\nto the equivalence between minimum cuts and the minimizers of f:\nLemma 1. Let S\u2217 be a minimizer of f, and let U\u2217 \u2208 argminU\u2286U w(\u03b4s(S\u2217\u222aU )). Then the boundary\n\u03b4s(S\u2217 \u222a U\u2217) \u2286 E is a minimum cut in G.\nThe lemma (proven in [18]) is good news since minimum cuts can be computed ef\ufb01ciently. To derive\ns \u2286 V \u222a U that\nS\u2217 from a minimum cut, recall that any minimum cut is the boundary of some set T \u2217\nis still reachable from s after cutting. Then S\u2217 = T \u2217\nt . A large\nsub-family of submodular functions can be expressed exactly in the form (1), but possibly with an\nexponentially large U. For ef\ufb01ciency, the size of U should remain small. To express any submodular\nfunction with few auxiliary nodes, in this paper we extend Equation (1) as is seen below.\nUnless the submodular function f is already a graph cut function (and directly representable), we \ufb01rst\ndecompose f into a modular function and a nondecreasing submodular function, and then build up\nthe graph part by part. This accounts for any graph-representable component of f. To approximate\nthe remaining component of the function that is not exactly representable, we use submodular costs\non graph edges (in contrast with graph nodes), a construction that has been introduced recently in\n\ns and (V \\ S\u2217) \u2286 T \u2217\n\ns \u2229 V, so S\u2217 \u2286 T \u2217\n\n3\n\nts12um(1)m(2)w(1)w(2)w(2)\f(a) maximum\n\n(b) truncation\n\n(c) partition matroid\n\n(d) bipartite\n\n(e) bipartite & truncation (f) basic submodular con-\n\nstruction\n\nFigure 3: Example graph constructions. Dashed blue edges can have submodular weights; aux-\niliary nodes are white and ground set nodes are shaded. The bipartite graph can have arbitrary\nrepresentations between U and t, 3(e) is one example. (All \ufb01gures are best viewed in color.)\n\ncomputer vision [16]. We \ufb01rst introduce a relevant decomposition result by Cunningham [4]. A\npolymatroid rank function is totally normalized if f (V \\ i) = f (V) for all i \u2208 V. The marginal costs\nare de\ufb01ned as \u03c1f (i|S) = f (S \u222a {i}) \u2212 f (S) for all i \u2208 V \\ S.\nTheorem 1 ([4, Thm. 18]). Any submodular function f can be decomposed as f (S) = m(S) + g(S)\ninto a modular function m and a totally normalized polymatroid rank function g. The components\n\ni\u2208A \u03c1f (i|V \\ i) and g(S) = f (S) \u2212 m(S) for all S \u2286 V.\n\nare de\ufb01ned as m(S) =(cid:80)\nweights: since m(V) is constant, minimizing m(S) =(cid:80)\n\nWe may assume that m(i) < 0 for all i \u2208 V. If m(i) \u2265 0 for any i \u2208 V, then diminishing marginal\ncosts, a property of submodular functions, imply that we can discard element i immediately [5, 18].\nTo express such negative costs in a graph cut, we point out an equivalent formulation with positive\ni\u2208S m(i) is equivalent to minimizing the\nshifted function m(S) \u2212 m(V) = \u2212m(V \\ S). Thus, we instead minimize the sum of positive\nweights on the complement of the solution. We implement this shifted function in the graph by adding\nan edge (s, i) with nonnegative weight \u2212m(i) for each i \u2208 V. Every element j \u2208 Tt (i.e., j /\u2208 S) that\nis not selected must be separated from s, and the edge (s, j) contributes \u2212m(j) to the total cut cost.\nHaving constructed the modular part of the function f by edges (s, i) for all i \u2208 V, we address\nits submodular part g. If g is a sum of functions, we can add a subgraph for each function. We\nbegin with some example functions that are explicitly graph-representable with polynomially many\nauxiliary nodes U. The illustrations in Figure 3 include the modular part m as well.\nMaximum. The function g(S) = maxi\u2208S w(i) for nonnegative weights w is an extension of\nFigure 2. Without loss of generality, we assume the elements to be ordered by weight, so that\nw(1) \u2264 w(2) \u2264 . . . w(n). We introduce n\u22121 auxiliary nodes uj, and connect them to form an imbal-\nanced tree with leaves V, as illustrated in Figure 3(a). The minimum way to disconnect a set S from\nt is to cut the single edge (uj\u22121, uj) with weight w(j) of the largest element j = argmaxi\u2208S w(i).\nTruncations. Truncated functions f (S) = min{w(S), \u03b3} for w, \u03b3 \u2265 0 can be modeled by one\nextra variable, as shown in Figure 3(b). If w(S) > \u03b3, then the minimization in (1) puts u in\nTs and cuts the \u03b3-edge. This construction has been successfully used in computer vision [19].\nTruncations can model piecewise linear concave functions of w(S) [19, 29], and also represent\nnegative terms in a pseudo-boolean polynomial [18]. Furthermore, these functions include rank\nfunctions g(S) = min{|S|, k} of uniform matroids, and rank functions of partition matroids. If V\nis partitioned into groups G \u2282 V, then the rank of the associated partition matroid counts the number\nof groups that S intersects: f (S) = |{G|G \u2229 S (cid:54)= \u2205}| (Fig. 3(c)).\n(cid:80)\nBipartite neighborhoods. We already encountered bipartite submodular functions f (S) =\nu\u2208N (S) w(u) in Section 1.1. The bipartite graph that de\ufb01nes N (S) is part of the representa-\n\n4\n\nst-m(1)-m(n)w(1)w(2)w(n)w(2)w(n)w(3)Vst-m(1)-m(n)w(1)w(n)\u03b3Vst-m(1)-m(n)VU11111st-m(1)-m(n)V\u221e\u221e\u221e\u221est-m(1)-m(n)V\u221e\u221e\u221e\u221est-m(1)-m(n)V\ftion shown in Figure 3(d), and its edges get in\ufb01nite weight. As a result, if S \u2208 Ts, then all neighbors\nN (S) of S must also be in Ts, and the edges (u, t) for all u \u2208 N (S) are cut. Each u \u2208 U has such\nan edge (u, t), and the weight of that edge is the weight w(u) of u.\nAdditional examples are given in [18].\nOf course, all the above constructions can also be applied to subsets Q \u2282 V of nodes. In fact, the\ndecomposition and constructions above permit us to address arbitrary sums and restrictions of such\ngraph-representable functions. These example families of functions already cover a wide variety of\nfunctions needed in applications. Minimizing a graph-represented function is equivalent to \ufb01nding\nthe minimum (s, t)-cut, and all edge weights in the above are nonnegative. Thus we can use any\nef\ufb01cient min-cut or max-\ufb02ow algorithm for any of the above functions.\n\n2.1 Submodular edge weights\n\nNext we address the generic case of a submodular function that is not (ef\ufb01ciently) graph-representable\nor whose functional form is unknown. We can still decompose this function into a modular part m\nand a polymatroid g. Then we construct a simple graph as shown in Figure 3(f). The representation\nof m is the same as above, but the cost of the edges (i, t) will be charged differently. Instead of\na sum of weights, we de\ufb01ne the cost of a set of these edges to be a non-additive function on sets\nof edges, a polymatroid rank function. Each edge (i, t) is associated with exactly one ground set\nelement i \u2208 V, and selecting i (i \u2208 Ts) is equivalent to cutting the edge (i, t). Thus, the cost of edge\n(i, t) will model the cost g(i) of its element i \u2208 V. Let Et be the set of such edges (i, t), and denote,\nfor any subset C \u2286 Et the set of ground set elements adjacent to C by V (C) = {i \u2208 V|(i, t) \u2208 C}.\nEquivalently, C is the boundary of V (C) in Et: \u03b4s(V (C)) \u2229 Et = C. We de\ufb01ne the cost of C to be\nthe cost of its adjacent ground set elements, hg(C) (cid:44) g(V (C)); this implies hg(\u03b4s(S \u2229 Et)) = g(S).\nThe equivalent of Equation (1) becomes\n\nf (S) = min\n\nU\u2286U w(\u03b4s(S \u222a U ) \\ Et) + hg(\u03b4s(S \u222a U ) \u2229 Et) = \u2212m(V \\ S) + g(S),\n\n(2)\nwith U = \u2205 in Figure 3(f). This generalization from the standard sum of edge weights to a nondecreas-\ning submodular function permits us to express many more functions, in fact any submodular function\n[5]. Such expressiveness comes at a price, however: in general, \ufb01nding a minimum (s, t)-cut with\nsuch submodular edge weights is NP-hard, and even hard to approximate [17]. The graphs here that\nrepresent submodular functions correspond to benign examples that are not NP-hard. Nevertheless,\nwe will use an approximation algorithm that applies to all such non-additive cuts. We describe the\nalgorithm in Section 3. For the moment, we assume that we can handle submodular costs on edges.\nThe simple construction in Figure 3(f) itself corresponds to a general submodular function mini-\nmization. It becomes powerful when combined with parts of f that are explicitly representable. If g\ndecomposes into a sum of graph-representable functions and a (nondecreasing submodular) remainder\ngr, then we construct a subgraph for each graph-representable function, and combine these subgraphs\nwith the submodular-edge construction for gr. All the subgraphs share the same ground set nodes V.\nIn addition, we are in no way restricted to separating graph-representable and general submodular\nfunctions. The cost function in our application is a submodular function induced by a bipartite graph\nH = (V,U,E). Let, as before, N (S) be the neighborhood of S \u2286 V in U. Given a nondecreasing\nsubmodular function gU : 2U \u2192 R+ on U, the graph H de\ufb01nes a function g(S) = gU (N (S)). If\ngU is nondecreasing submodular, then so is g [28, \u00a744.6 g]. For any such function, we represent H\nexplicitly in G, and then add submodular-cost edges from U to t with hg(\u03b4s(N (S))) = gU (N (S)),\nas shown in Figure 3(d). If gU is itself exactly representable, then we add the appropriate subgraph\ninstead (Figure 3(e)).\n\n3 Optimization\n\nTo minimize a function f, we \ufb01nd a minimum (s, t)-cut in its representation graph. Algorithm 1\napplies to any submodular-weight cut; this algorithm is exact if the edge costs are modular (a sum of\nweights). In each iteration, we approximate f by a function \u02c6f that is ef\ufb01ciently graph-representable,\nand minimize \u02c6f instead. In this section, we switch from costs f, \u02c6f of node sets S, T to equivalent\ncosts w, h of edge sets A, B, C and back.\n\n5\n\n\fAlgorithm 1: Minimizing graph-based approximations.\ncreate the representation graph G = (V \u222a U \u222a {s, t},E) and set S0 = T0 = \u2205;\nfor i = 1, 2, . . . do\n\ncompute edge weights \u03bdi\u22121 = \u03bd\u03b4s(Ti\u22121) (Equation 4);\n\ufb01nd the (maximal) minimum (s, t)-cut Ti = argminT \u2286 (V\u222aU ) \u03bdi\u22121(\u03b4sT );\nif f (Ti) = f (Ti\u22121) then\nreturn Si = Ti \u2229 V;\nend\n\nend\n\nThe approximation \u02c6f arises from the cut representation constructed in Section 2: we replace the\nexact edge costs by approximate modular edge weights \u03bd in G. Recall that the representation G has\ntwo types of edges: those whose weights w are counted as the usual sum, and those charged via a\nsubmodular function hg derived from g. We denote the latter set by Et, and the former by Em. For\nany e \u2208 Em, we use the exact cost \u03bd(e) = w(e). The submodular cost hg of the remaining edges is\nupper bounded by referring to a \ufb01xed set B \u2286 E that we specify later. For any A \u2286 Et, we de\ufb01ne\n\n\u02c6hB(A) (cid:44) hg(B) +\n\n\u03c1h(e|Et \\ e) \u2265 hg(A).\n\n(3)\n\n(cid:88)\n\n\u03c1h(e|B \u2229 Et) \u2212 (cid:88)\n\ne\u2208A\\B\n\ne\u2208B\\A\n\n\u03bdB(e) = \u03c1h(e|B \u2229 Et)\n\nThis inequality holds thanks to diminishing marginal costs, and the approximation is tight at B,\n\u02c6hB(B) = hg(B). Up to a constant shift, this function is equivalent [16] to the edge weights:\nif e \u2208 B \u2229 Et.\n\n(4)\nPlugging \u03bdB into Equation (2) yields an approximation \u02c6f of f. In the algorithm, B is always the\nboundary B = \u03b4s(T ) of a set T \u2286 (V \u222a U). Then G with weights \u03bdB represents\n\u02c6f (S) = min\n\n\u03bdB(e) = \u03c1h(e|Et \\ e)\n\nif e \u2208 Et \\ B;\n\nand\n\nU\u2286U \u03bdB(\u03b4s(S \u222a U ) \u2229 Em) + \u03bdB(\u03b4s(S \u222a U ) \u2229 Et)\nU\u2286U w(\u03b4s(S \u222a U ) \u2229 Em) +\n\n(cid:88)\n\n= min\n\n(u,t)\u2208\u03b4s(S\u222aU )\u2229B\n\n\u03c1g(u|V \u222a U \\ u) +\n\n\u03c1g(u|T ).\n\n(u,t)\u2208\u03b4s(S\u222aU )\\B\n\n(cid:88)\n\nHere, we used the de\ufb01nition hg(C) (cid:44) g(V (C)). Importantly, the edge weights \u03bdB are always\nnonnegative, because, by Theorem 1, g is guaranteed to be nondecreasing. Hence, we can ef\ufb01ciently\nminimize \u02c6f as a standard minimum cut. If in Algorithm 1 there is more than one set T de\ufb01ning a\nminimum cut, then we pick the largest (i.e., maximal) such set. Lemma 2 states properties of the Ti.\nLemma 2. Assume G is any of the graphs in Figure 3, and let T \u2217 \u2286 V\u222aU be the maximal set de\ufb01ning\na minimum-cost cut \u03b4s(T \u2217) in G, so that S\u2217 = T \u2217 \u2229 V is a minimizer of the function represented by\nG. Then, in any iteration i of Algorithm 1, it holds that Ti\u22121 \u2286 Ti \u2286 T \u2217. In particular, S \u2286 S\u2217 for\nthe returned solution S.\n\nLemma 2 has three important implications. First, the algorithm never picks any element outside the\nmaximal optimal solution. Second, because the Ti are growing, there are at most |T \u2217| \u2264 |V \u222a U|\niterations, and the algorithm is strongly polynomial. Finally, the chain property permits more ef\ufb01cient\nimplementations. The proof of Lemma 2 relies on the de\ufb01nition of \u03bd and submodularity [18].\nMoreover, the weights \u03bd lead to a bound the worst-case approximation factor [18].\n\n3.1\n\nImprovement via summarizations\n\ncost \u03c1h(A|\u03b4sTi) is much smaller than the estimated sum of weights, \u03bdi(A) = (cid:80)\n\nThe approximation \u02c6f is loosest if the sum of edge weights \u03bdi(A) signi\ufb01cantly overestimates the true\njoint cost hg(A) of sets of edges A \u2286 \u03b4sT \u2217 \\ \u03b4sTi still to be cut. This happens if the joint marginal\ne\u2208A \u03c1h(e|\u03b4sTi).\nLuckily, many of the functions that show this behavior strongly resemble truncations. Thus, to tighten\nthe approximation, we summarize the joint cost of groups of edges by a construction similar to\nFigure 3(b). Then the algorithm can take larger steps and pick groups of elements.\nWe partition Et into disjoint groups Gk of edges (u, t). For each group, we introduce an auxiliary\nnode tk and re-connect all edges (u, t) \u2208 Gk to end in tk instead of t. Their cost remains the\n\n6\n\n\fsame. An extra edge ek connects tk to t, and carries the joint weight \u03bdi(ek) of all edges in Gk;\na tighter approximation. The weight \u03bdi(ek) is also adapted in each iteration.\nInitially, we set\n\u03bd0(ek) = hg(Gk) = g(V (Gk)). Subsequent approximations \u03bdi refer to cuts \u03b4sTi, and such a cut can\ncontain either single edges from Gk, or the group edge ek. We set the next reference set Bi to be a\ncopy of \u03b4sTi in which each group edge ek was replaced by all its group members Gk. The joint group\n\u03bdi(e).\n\nweight \u03bdi(ek) for any k is then \u03bdi(ek) = \u03c1h(Gk \\ Bi|Bi) +(cid:80)\n\ne\u2208Gk\n\ne\u2208Gk\u2229Bi \u03c1h(e|Et \\ e) \u2264(cid:80)\n\u03c1h(e|B)\u2212 (cid:88)\n\n\u03c1h(e|Et\\ e) \u2264 \u02c6h(A),\n\ne\u2208(Gk\u2229A)\\B,Gk(cid:54)\u2286A\n\ne\u2208B\\A\n\nFormally, these weights represent the upper bound\n(cid:48)\n\u02c6h\nB(A) = hg(B) +\n\n\u03c1h(Gk\\ B|B) +\n\n(cid:88)\n\n(cid:88)\n\nGk\u2286A\n\nwhere we replace Gk by ek whenever Gk \u2286 A. In our experiments, this summarization helps improve\nthe results while simultaneously reducing running time.\n\npick the best range. For this construction, g must have the form g(U ) = \u03c8((cid:80)\n\u02dcw(U ) =(cid:80)\n\n4 Parametric constructions for special cases\nFor certain functions of the form f (S) = m(S) + g(N (S)), the graph representation in Figure 3(d)\nadmits a speci\ufb01c algorithm. We use approximations that are exact on limited ranges, and eventually\nu\u2208U \u02dcw(u)) for\nweights \u02dcw \u2265 0 and one piecewise linear, concave function \u03c8 with a small (polynomial) number\n(cid:96) of breakpoints. Alternatively, \u03c8 can be any concave function if the weights \u02dcw are such that\nu\u2208U \u02dcw(u) can take at most polynomially many distinct values xk; e.g., if \u02dcw(u) = 1 for\nall u, then effectively (cid:96) = |U| + 1 by using the xk as breakpoints and interpolating. In all these cases,\n\u03c8 is equivalent to the minimum of at most (cid:96) linear (modular) functions.\nWe build on the approach in [8], but, whereas their functions are de\ufb01ned on V, g here is de\ufb01ned on U.\nContrary to their functions and owing to our decomposition, the \u03c8 here is nondecreasing. We de\ufb01ne (cid:96)\nlinear functions, one for each breakpoint xk (and use x0 = 0):\n\n\u03c8k(t) = (\u03c8(xk) \u2212 \u03c8(xk\u22121))(t \u2212 xk) + \u03c8(xk) = \u03b1kt + \u03b2k.\n\n(5)\n\nThe \u03c8k are de\ufb01ned such that \u03c8(t) = mink \u03c8k(t). Therefore, we approximate f by a series \u02c6fk(S) =\n\u2212m(V \\ S) + \u03c8k( \u02dcw(N (S))), and \ufb01nd the exact minimizer Sk for each k. To compute Sk via a\nminimum cut in G (Fig. 3(d)), we de\ufb01ne edge weights \u03bdk(e) = w(e) for edges e /\u2208 Et as in Section 3,\nand \u03bdk(u, t) = \u03b1k \u02dcw(u) for e \u2208 Et. Then Tk = Sk \u222a N (Sk) de\ufb01nes a minimum cut \u03b4sTk in G. We\ncompute \u02c6fk(Sk) = \u03bdk(\u03b4sTk) + \u03b2k + m(V); the optimal solution is the Sk with minimum cost \u02c6fk(Sk).\nThis method is exact. To solve for all k within one max-\ufb02ow, we use a parametric max-\ufb02ow method\n[10, 14]. Parametric max-\ufb02ow usually works with both edges from s and to t. Here, \u03bdk \u2265 0 because\n\u03c8 is nondecreasing, and thus we only need t-edges which already exist in the bipartite graph G.\nThis method is limited to few breakpoints. For more general concave \u03c8 and arbitrary \u02dcw \u2265 0, we\ncan approximate \u03c8 by a piecewise linear function. Still, the parametric approach does not directly\ni gi(U \u2229 Wi) for sets Wi \u2286 U. In contrast,\nAlgorithm 1 (with the summarization) can handle all of these cases. We point out that without\nindirection via the bipartite graph, i.e., f (S) = m(S) + \u03c8(w(S)) for a \u03c8 with few breakpoints, we\ncan minimize f very simply: The solution for \u03c8k includes all j \u2208 V with \u03b1k \u2264 \u2212m(j)/w(j). The\nadvantage of the graph cut is that it easily combines with other objectives.\n\ngeneralize to more than one nonlinearity, e.g., g(U ) =(cid:80)\n\n5 Experiments\n\nIn the experiments, we test whether the graph-based methods improve over the minimum-norm point\nalgorithm in the dif\ufb01cult cases of Section 1.1. We compare the following methods:\nMN: a re-implementation of the minimum norm point algorithm in C++ that is about four times\nfaster than the C code in [7] (see [18]), ensuring that our results are not due to a slow implementation;\nMC: a minimum cut with static edge weights \u03bd(e) = hg(e);\nGI: the graph-based iterative Algorithm 1, implemented in C++ with the max-\ufb02ow code of [3], (i) by\ngenerated by sorting the edges in Et by their weights hg(e), and then forming groups Gk of edges\nadjacent in the order such that for each e \u2208 Gk, hg(e) \u2264 1.1hg(Gk) (GIs);\n\nitself; (ii) with summarization via(cid:112)|Et| random groups (GIr); (iii) with summarization via groups\n\n7\n\n\fFigure 4: (a) Running time, (b) relative and (c) absolute error with varying \u03bb for a data set as described\nin Section 1.1, |V| = 54915, |U| = 6871, and f (S) = \u2212m(S) + \u03bb\nshow absolute errors. (d) Running times with respect to |V|, f (S) = \u2212m(S) + \u03bb\n\nw(N (S)).\n\n(cid:112)|N (S)|. Where f (S\u2217) = 0, we\n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\n(cid:112)\n\n(cid:112)\n\nGP: the parametric method from Section 4, using |Et| equispaced breakpoints; based on C code\nfrom RIOT1.\nWe also implemented the SLG method from [29] in C++ (public code is not available), but found\nit to be impractical on the problems here, as gradient computation of our function requires \ufb01nding\ngradients of |U| truncation functions, which is quite expensive [18]. Thus, we did not include it in the\ntests on the large graphs. We use bipartite graphs of the form described in Section 1.1, with a cost\nfunction f (S) = m(S) + \u03bbg(N (S)). The function g uses a square root, g(U ) =\nw(U ). More\nresults, also on other functions, can be found in [18].\nSolution quality with solution size. Running time and results depend on the size of S\u2217. Thus, we\nvary \u03bb from 50 (S\u2217 \u2248 V) to 9600 (S\u2217 = \u2205) on a speech recognition data set [11]. The bipartite\ngraph represents a corpus subset extraction problem (Section 1.1) and has |V| = 54915, |U| = 6871\nnodes, and uniform weights w(u) = 1 for all u \u2208 U. The results look similar with non-uniform\nweights, but for uniform weights the parametric method from Section 4 always \ufb01nds the optimal\nsolution and thus allows us to report errors. Figure 4 shows the running times and the relative error\nerr(S) = |f (S) \u2212 f (S\u2217)|/|f (S\u2217)| (note that f (S\u2217) \u2264 0). If f (S\u2217) = 0, we report absolute errors.\nBecause of the large graph, we used the minimum-norm algorithm with accuracy 10\u22125. Still, it\ntakes up to 100 times longer than the other methods. It works well if S\u2217 is large, but as \u03bb grows,\nits accuracy becomes poor. In particular when f (S\u2217) = f (\u2205) = 0, it returns large sets with large\npositive cost. In contrast, the deviation of the approximate edge weights \u03bdi from the true cost is\nbounded [18]. All algorithms except MN return an optimal solution for \u03bb \u2265 2000. Updating the\nweights \u03bd clearly improves the performance of Algorithm 1, as does the summarization (GIr/GIs\nperform identically here). With the latter, the solutions are very often optimal, and almost always\nvery good.\nScaling: To test how the methods scale with the size |V|, we sample small graphs from the big\ngraph, and report average running times across 20 graphs for each size. As the graphs have non-\nw(U ) by\na piecewise linear function with |U| breakpoints. All algorithms \ufb01nd the same (optimal) solution.\nFigure 4(d) shows that the minimum-norm algorithm with high accuracy is much slower than the\nother methods. Empirically, MN scales as up to O(n5) (note that Figure 1 is a speci\ufb01c worst-case\ngraph), the parametric version approximately O(n2), and the variants of GI up to O(n1.5).\nAcknowledgments: This material is based upon work supported in part by the National Science\nFoundation under grant IIS-0535100, by an Intel research award, a Microsoft research award, and a\nGoogle research award.\n\nuniform weights, we use GP as an approximation method and estimate the nonlinearity(cid:112)\n\nReferences\n[1] R. K. Ahuja, T. L. Magnanti, and J. B. Orlin. Network Flows. Prentice Hall, 1993.\n[2] Y. Boykov and M.-P. Jolly. Interactive graph cuts for optimal boundary and region segmentation of objects\n\nin n-d images. In ICCV, 2001.\n\n1http://riot.ieor.berkeley.edu/riot/Applications/Pseudoflow/parametric.html\n\n8\n\n2000400060008000101102103\u03bblog time (s) 5001000150020000246rel error\u03bb MN 1e\u22125MCGIGIrGIsGP40006000800002000400060008000abs error\u03bb 10310\u22122100102104time (s)log n MN 1e\u221210MCGIGIrGIsGPtime (s)log n\f[3] Y. Boykov and V. Kolmogorov. An experimental comparison of min-cut/max-\ufb02ow algorithms for energy\n\nminimization in vision. IEEE TPAMI, 26(9):1124\u20131137, 2004.\n\n[4] W. H. Cunningham. Decomposition of submodular functions. Combinatorica, 3(1):53\u201368, 1983.\n[5] W. H. Cunningham. Testing membership in matroid polyhedra. J Combinatorial Theory B, 36:161\u2013188,\n\n1984.\n\n[6] D. Freedman and P. Drineas. Energy minimization via graph cuts: Settling what is possible. In CVPR,\n\n2005.\n\n[7] S. Fujishige and S. Isotani. A submodular function minimization algorithm based on the minimum-norm\n\nbase. Paci\ufb01c Journal of Optimization, 7:3\u201317, 2011.\n\n[8] S. Fujishige and S. Iwata. Minimizing a submodular function arising from a concave function. Discrete\n\nApplied Mathematics, 92, 1999.\n\n[9] S. Fujishige and S. B. Patkar. Realization of set functions as cut functions of graphs and hypergraphs.\n\nDiscrete Mathematics, 226:199\u2013210, 2001.\n\n[10] G. Gallo, M.D. Grigoriadis, and R.E. Tarjan. A fast parametric maximum \ufb02ow algorithm and applications.\n\nSIAM J Computing, 18(1), 1989.\n\n[11] J.J. Godfrey, E.C. Holliman, and J. McDaniel. Switchboard: Telephone speech corpus for research and\n\ndevelopment. In Proc. ICASSP, volume 1, pages 517\u2013520, 1992.\n\n[12] D. M. Greig, B. T. Porteous, and A. H. Seheult. Exact maximum a posteriori estimation for binary images.\n\nJournal of the Royal Statistical Society, 51(2), 1989.\n\n[13] M. Gr\u00a8otschel, L. Lov\u00b4asz, and A. Schrijver. The ellipsoid algorithm and its consequences in combinatorial\n\noptimization. Combinatorica, 1:499\u2013513, 1981.\n\n[14] D. Hochbaum. The pseudo\ufb02ow algorithm: a new algorithm for the maximum \ufb02ow problem. Operations\n\nResearch, 58(4), 2008.\n\n[15] S. Iwata, L. Fleischer, and S. Fujishige. A combinatorial strongly polynomial algorithm for minimizing\n\nsubmodular functions. J. ACM, 48:761\u2013777, 2001.\n\n[16] S. Jegelka and J. Bilmes. Submodularity beyond submodular energies: coupling edges in graph cuts. In\n\nCVPR, 2011.\n\n[17] S. Jegelka and J. Bilmes. Approximation bounds for inference using cooperative cuts. In ICML, 2011.\n[18] S. Jegelka, H. Lin, and J. Bilmes. Fast approximate submodular minimization: Extended version, 2011.\n[19] P. Kohli, L. Ladick\u00b4y, and P. Torr. Robust higher order potentials for enforcing label consistency. Int. J.\n\nComputer Vision, 82, 2009.\n\n[20] H. Lin and J. Bilmes. An application of the submodular principal partition to training data subset selection.\n\nIn NIPS workshop on Discrete Optimization in Machine Learning, 2010.\n\n[21] H. Lin and J. Bilmes. Optimal selection of limited vocabulary speech corpora. In Proc. Interspeech, 2011.\n[22] S. T. McCormick. Submodular function minimization. In K. Aardal, G. Nemhauser, and R. Weismantel,\neditors, Handbook on Discrete Optimization, pages 321\u2013391. Elsevier, 2006. updated version 3a (2008).\n\n[23] M. Narasimhan, N. Jojic, and J. Bilmes. Q-clustering. In NIPS, 2005.\n[24] J. B. Orlin. A faster strongly polynomial time algorithm for submodular function minimization. Mathemat-\n\nical Programming, 118(2):237\u2013251, 2009.\n\n[25] M. Preissmann and A. Seb\u02ddo. Research Trends in Combinatorial Optimization, chapter Graphic Submodular\n\nFunction Minimization: A Graphic Approach and Applications, pages 365\u2013385. Springer, 2009.\n\n[26] M. Queyranne. Minimizing symmetric submodular functions. Mathematical Programming, 82:3\u201312, 1998.\n[27] A. Schrijver. A combinatorial algorithm minimizing submodular functions in strongly polynomial time. J.\n\nCombin. Theory Ser. B, 80:346\u2013355, 2000.\n\n[28] A. Schrijver. Combinatorial Optimization. Springer, 2004.\n[29] P. Stobbe and A. Krause. Ef\ufb01cient minimization of decomposable submodular functions. In NIPS, 2010.\n[30] J. Vondr\u00b4ak. personal communication, 2011.\n[31] S. \u02d8Zivn\u00b4y and P.G. Jeavons. Classes of submodular constraints expressible by graph cuts. Constraints, 15:\n\n430\u2013452, 2010. ISSN 1383-7133.\n\n[32] S. \u02d8Zivn\u00b4y, D. A. Cohen, and P. G. Jeavons. The expressive power of binary submodular functions. Discrete\n\nApplied Mathematics, 157(15):3347\u20133358, 2009.\n\n9\n\n\f", "award": [], "sourceid": 345, "authors": [{"given_name": "Stefanie", "family_name": "Jegelka", "institution": null}, {"given_name": "Hui", "family_name": "Lin", "institution": null}, {"given_name": "Jeff", "family_name": "Bilmes", "institution": null}]}