{"title": "Reflection methods for user-friendly submodular optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 1313, "page_last": 1321, "abstract": "Recently, it has become evident that submodularity naturally captures widely occurring concepts in machine learning, signal processing and computer vision. In consequence, there is need for efficient optimization procedures for submodular functions, in particular for minimization problems. While general submodular minimization is challenging, we propose a new approach that exploits existing decomposability of submodular functions. In contrast to previous approaches, our method is neither approximate, nor impractical, nor does it need any cumbersome parameter tuning. Moreover, it is easy to implement and parallelize. A key component of our approach is a formulation of the discrete submodular minimization problem as a continuous best approximation problem. It is solved through a sequence of reflections and its solution can be automatically thresholded to obtain an optimal discrete solution. Our method solves both the continuous and discrete formulations of the problem, and therefore has applications in learning, inference, and reconstruction. In our experiments, we show the benefits of our new algorithms for two image segmentation tasks.", "full_text": "Re\ufb02ection methods for user-friendly\n\nsubmodular optimization\n\nStefanie Jegelka\n\nUC Berkeley\n\nBerkeley, CA, USA\n\nFrancis Bach\nINRIA - ENS\nParis, France\n\nAbstract\n\nSuvrit Sra\n\nMPI for Intelligent Systems\n\nT\u00a8ubingen, Germany\n\nRecently, it has become evident that submodularity naturally captures widely\noccurring concepts in machine learning, signal processing and computer vision.\nConsequently, there is need for ef\ufb01cient optimization procedures for submodu-\nlar functions, especially for minimization problems. While general submodular\nminimization is challenging, we propose a new method that exploits existing de-\ncomposability of submodular functions. In contrast to previous approaches, our\nmethod is neither approximate, nor impractical, nor does it need any cumbersome\nparameter tuning. Moreover, it is easy to implement and parallelize. A key com-\nponent of our method is a formulation of the discrete submodular minimization\nproblem as a continuous best approximation problem that is solved through a\nsequence of re\ufb02ections, and its solution can be easily thresholded to obtain an\noptimal discrete solution. This method solves both the continuous and discrete\nformulations of the problem, and therefore has applications in learning, inference,\nand reconstruction. In our experiments, we illustrate the bene\ufb01ts of our method on\ntwo image segmentation tasks.\n\n1\n\nIntroduction\n\nSubmodularity is a rich combinatorial concept that expresses widely occurring phenomena such as\ndiminishing marginal costs and preferences for grouping. A set function F : 2V \u2192 R on a set V is\nsubmodular if for all subsets S, T \u2286 V , we have F (S \u222a T ) + F (S \u2229 T ) \u2264 F (S) + F (T ).\nSubmodular functions underlie the goals of numerous problems in machine learning, computer vision\nand signal processing [1]. Several problems in these areas can be phrased as submodular optimization\ntasks: notable examples include graph cut-based image segmentation [7], sensor placement [30], or\ndocument summarization [31]. A longer list of examples may be found in [1].\nThe theoretical complexity of submodular optimization is well-understood: unconstrained mini-\nmization of submodular set functions is polynomial-time [19] while submodular maximization is\nNP-hard. Algorithmically, however, the picture is different. Generic submodular maximization admits\nef\ufb01cient algorithms that can attain approximate optima with global guarantees; these algorithms are\ntypically based on local search techniques [16, 35]. In contrast, although polynomial-time solvable,\nsubmodular function minimization (SFM) which seeks to solve\n\nmin\nS\u2286V\n\nF (S),\n\n(1)\n\nposes substantial algorithmic dif\ufb01culties. This is partly due to the fact that one is commonly interested\nin an exact solution (or an arbitrarily close approximation thereof), and \u201cpolynomial-time\u201d is not\nnecessarily equivalent to \u201cpractically fast\u201d.\nSubmodular minimization algorithms may be obtained from two main perspectives: combinatorial\nand continuous. Combinatorial algorithms for SFM typically use close connections to matroid and\n\n1\n\n\fmaximum \ufb02ow methods; the currently theoretically fastest combinatorial algorithm for SFM scales\nas O(n6 + n5\u03c4 ), where \u03c4 is the time to evaluate the function oracle [37] (for an overview of other\nalgorithms, see e.g., [33]). These combinatorial algorithms are typically nontrivial to implement.\nContinuous methods offer an alternative by instead minimizing a convex extension. This idea exploits\nthe fundamental connection between a submodular function F and its Lov\u00b4asz extension f [32], which\nis continuous and convex. The SFM problem (1) is then equivalent to\n\nmin\nx\u2208[0,1]n\n\nf (x).\n\n(2)\n\nThe Lov\u00b4asz extension f is nonsmooth, so we might have to resort to subgradient methods. While\na fundamental result of Edmonds [15] demonstrates that a subgradient of f can be computed in\n\u221a\nO(n log n) time, subgradient methods can be sensitive to choices of the step size, and can be slow.\nThey theoreticaly converge at a rate of O(1/\nt) (after t iterations). The \u201csmoothing technique\u201d of\n[36] does not in general apply here because computing a smoothed gradient is equivalent to solving\nthe submodular minimization problem. We discuss this issue further in Section 2.\nAn alternative to minimizing the Lov\u00b4asz extension directly on [0, 1]n is to consider a slightly modi\ufb01ed\nconvex problem. Speci\ufb01cally, the exact solution of the discrete problem minS\u2286V F (S) and of its\n(cid:62) 0} of\nnonsmooth convex relaxation minx\u2208[0,1]n f (x) may be found as a level set S0 = {k | x\u2217\nthe unique point x\u2217 that minimizes the strongly convex function [1, 10]:\n\nk\n\n2(cid:107)x(cid:107)2.\n\nf (x) + 1\n\nFollowing [28, 29, 38, 41], we make the assumption that F (S) =(cid:80)r\n\n(3)\nWe will refer to the minimization of (3) as the proximal problem due to its close similarity to proximity\noperators used in convex optimization [12]. When F is a cut function, (3) becomes a total variation\nproblem (see, e.g., [9] and references therein) that also occurs in other regularization problems [1].\n2(cid:107)x(cid:107)2; (ii) the\nTwo noteworthy points about (3) are: (i) addition of the strongly convex component 1\nensuing removal of the box-constraints x \u2208 [0, 1]n. These changes allow us to consider a convex\ndual which is amenable to smooth optimization techniques.\nTypical approaches to generic SFM include Frank-Wolfe methods [17] that have cheap iterations\nand O(1/t) convergence, but can be quite slow in practice (Section 5); or the minimum-norm-\npoint/Fujishige-Wolfe algorithm [20] that has expensive iterations but \ufb01nite convergence. Other\nrecent methods are approximate [24]. In contrast to several iterative methods based on convex\nrelaxations, we seek to obtain exact discrete solutions.\nTo the best of our knowledge, all generic algorithms that use only submodularity are several orders\nof magnitude slower than specialized algorithms when they exist (e.g., for graph cuts). However,\nthe submodular function is not always generic and given via a black-box, but has known structure.\ni=1 Fi(S) is a sum of suf\ufb01ciently\n\u201csimple\u201d functions (see Sec. 3). This structure allows the use of (parallelizable) dual decomposition\ntechniques for the problem in Eq. (2), with [11, 38] or without [29] Nesterov\u2019s smoothing technique,\nor with direct smoothing [41] techniques. But existing approaches typically have two drawbacks: (1)\nthey use smoothing or step-size parameters whose selection may be critical and quite tedious; and (2)\nthey still exhibit slow convergence (see Section 5).\nThese drawbacks arise from working with formulation (2). Our main insight is that, despite seemingly\ncounter-intuitive, the proximal problem (3) offers a much more user-friendly tool for solving (1)\nthan its natural convex counterpart (2), both in implementation and running time. We approach\nproblem (3) via its dual. This allows decomposition techniques which combine well with orthogonal\nprojection and re\ufb02ection methods that (a) exhibit faster convergence, (b) are easily parallelizable, (c)\nrequire no extra hyperparameters, and (d) are extremely easy to implement.\nThe main three algorithms that we consider are: (i) dual block-coordinate descent (equivalently,\nprimal-dual proximal-Dykstra), which was already shown to be extremely ef\ufb01cient for total variation\nproblems [2] that are special cases of Problem (3); (ii) Douglas-Rachford splitting using the careful\nvariant of [4], which for our formulation (Section 4.2) requires no hyper-parameters; and (iii)\naccelerated projected gradient [5]. We will see these alternative algorithms can offer speedups beyond\nknown ef\ufb01ciencies. Our observations have two implications: \ufb01rst, from the viewpoint of solving\nProblem (3), they offers speedups for often occurring denoising and reconstruction problems that\nemploy total variation. Second, our experiments suggest that projection and re\ufb02ection methods can\nwork very well for solving the combinatorial problem (1).\n\n2\n\n\fIn summary, we make the following contributions: (1) In Section 3, we cast the problem of minimizing\ndecomposable submodular functions as an orthogonal projection problem and show how existing\noptimization techniques may be brought to bear on this problem, to obtain fast, easy-to-code and\neasily parallelizable algorithms. In addition, we show examples of classes of functions amenable\nto our approach. In particular, for simple functions, i.e., those for which minimizing F (S) \u2212 a(S)\nis easy for all vectors1 a \u2208 Rn, the problem in Eq. (3) may be solved in O(log 1\n\u03b5 ) calls to such\nminimization routines, to reach a precision \u03b5 (Section 2,3). (2) In Section 5, we demonstrate the\nempirical gains of using accelerated proximal methods, Douglas-Rachford and block coordinate\ndescent methods over existing approaches: fewer hyperparameters and faster convergence.\n\n2 Review of relevant results from submodular analysis\nThe relevant concepts we review here are the Lov\u00b4asz extension, base polytopes of submodular\nfunctions, and relationships between proximal and discrete problems. For more details, see [1, 19].\n\nk=1 x\u03c3(k)\n\nLov\u00b4asz extension and convexity. The power set 2V may be naturally identi\ufb01ed with the ver-\ntices of the hypercube, i.e., {0, 1}n. The Lov\u00b4asz extension f of any set function is de\ufb01ned\nby linear interpolation, so that for any S \u2282 V , F (S) = f (1S).\nIt may be computed in\nif x\u03c3(1) (cid:62) \u00b7\u00b7\u00b7 (cid:62) x\u03c3(n), then f (x) =\nclosed form once the components of x are sorted:\n\n(cid:2)F ({\u03c3(1), . . . , \u03c3(k)}) \u2212 F ({\u03c3(1), . . . , \u03c3(k \u2212 1)})(cid:3) [32]. For the graph cut function, f\n\n(cid:80)n\n\nis the total variation.\nIn this paper, we are going to use two important results: (a) if the set function F is submodular, then\nits Lov\u00b4asz extension f is convex, and (b) minimizing the set function F is equivalent to minimizing\nf (x) with respect to x \u2208 [0, 1]n. Given x \u2208 [0, 1]n, all of its level sets may be considered and the\nfunction may be evaluated (at most n times) to obtain a set S. Moreover, for a submodular function,\nthe Lov\u00b4asz extension happens to be the support function of the base polytope B(F ) de\ufb01ned as\n\nB(F ) = {y \u2208 Rn | \u2200S \u2282 V, y(S) (cid:54) F (S) and y(V ) = F (V )},\n\nthat is f (x) = maxy\u2208B(F ) y(cid:62)x [15]. A maximizer of y(cid:62)x (and hence the value of f (x)), may be\ncomputed by the \u201cgreedy algorithm\u201d, which \ufb01rst sorts the components of w in decreasing order\nx\u03c3(1) (cid:62) \u00b7\u00b7\u00b7 (cid:62) x\u03c3(n), and then compute y\u03c3(k) = F ({\u03c3(1), . . . , \u03c3(k)}) \u2212 F ({\u03c3(1), . . . , \u03c3(k \u2212 1)}).\nIn other words, a linear function can be maximized over B(F ) in time O(n log n + n\u03c4 ) (note that\nthe term n\u03c4 may be improved in many special cases). This is crucial for exploiting convex duality.\nDual of discrete problem. We may derive a dual problem to the discrete problem in Eq. (1) and\nthe convex nonsmooth problem in Eq. (2), as follows:\n\nmax\ny\u2208B(F )\n\ny(cid:62)x = max\ny\u2208B(F )\n\nmin\nx\u2208[0,1]n\n\ny(cid:62)x = max\ny\u2208B(F )\n\n(y)\u2212(V ),\n\n\u00b5 as x\u2217\n\nf (x) = min\nx\u2208[0,1]n\n\nF (S) = min\nx\u2208[0,1]n\n\nk = sup{\u00b5 \u2208 R | k \u2208 S\u2217\n\n(4)\nmin\nS\u2286V\nwhere (y)\u2212 = min{y, 0} applied elementwise. This allows to obtain dual certi\ufb01cates of optimality\nfrom any y \u2208 B(F ) and x \u2208 [0, 1]n.\n2(cid:107)x(cid:107)2, has intricate\nProximal problem. The optimization problem (3), i.e., minx\u2208Rn f (x) + 1\nrelations to the SFM problem [10]. Given the unique optimal solution x\u2217 of (3), the maximal (resp.\nminimal) optimizer of the SFM problem is the set S\u2217 of nonnegative (resp. positive) elements of x\u2217.\nMore precisely, solving (3) is equivalent to minimizing F (S) + \u00b5|S| for all \u00b5 \u2208 R. A solution\nS\u00b5 \u2286 V is obtained from a solution x\u2217 as S\u2217\n(cid:62) \u00b5}. Conversely, x\u2217 may be obtained\n\u00b5} for all k \u2208 V . Moreover, if x is an \u03b5-optimal solution\nfrom all S\u2217\nof Eq. (3), then we may construct\n\u03b5n-optimal solutions for all S\u00b5 [1; Prop. 10.5]. In practice, the\nduality gap of the discrete problem is usually much lower than that of the proximal version of the\nsame problem, as we will see in Section 5. Note that the problem in Eq. (3) provides much more\ninformation than Eq. (2), as all \u00b5-parameterized discrete problems are solved.\nThe dual problem of Problem (3) reads as follows:\n2(cid:107)x(cid:107)2\n\u2212 1\n2(cid:107)y(cid:107)2\n2,\nmin\nx\u2208Rn\nwhere primal and dual variables are linked as x = \u2212y. Observe that this dual problem is equivalent\nto \ufb01nding the orthogonal projection of 0 onto B(F ).\n\n2(cid:107)x(cid:107)2\n1Every vector a \u2208 Rn may be viewed as a modular (linear) set function: a(S) (cid:44)(cid:80)\n\n\u00b5 = {i | x\u2217\n\n2 = max\ny\u2208B(F )\n\n2 = max\ny\u2208B(F )\n\nmin\nx\u2208Rn\n\n2 = min\nx\u2208Rn\n\nmax\ny\u2208B(F )\n\nf (x) + 1\n\n2(cid:107)x(cid:107)2\n\ny(cid:62)x + 1\n\ny(cid:62)x + 1\n\n\u221a\n\ni\n\ni\u2208S a(i).\n\n3\n\n\fDivide-and-conquer strategies for the proximal problems. Given a solution x\u2217 of the proximal\nproblem, we have seen how to get S\u2217\n\u00b5 for any \u00b5 by simply thresholding x\u2217 at \u00b5. Conversely, one can\nrecover x\u2217 exactly from at most n well-chosen values of \u00b5. A known divide-and-conquer strategy\n[19, 21] hinges upon the fact that for any \u00b5, one can easily see which components of x\u2217 are greater\nor smaller than \u00b5 by computing S\u2217\n\u00b5. The resulting algorithm makes O(n) calls to the submodular\nfunction oracle. In [25], we extend an alternative approach by Tarjan et al. [42] for cuts to general\nsubmodular functions and obtain a solution to (3) up to precision \u03b5 in O(min{n, log 1\n\u03b5}) iterations.\nThis result is particularly useful if our function F is a sum of functions for each of which by itself\nthe SFM problem is easy. Beyond squared (cid:96)2-norms, our algorithm equally applies to computing all\nj=1 hj(xj) for arbitrary smooth strictly convex functions hj, j = 1, . . . , n.\n\nminimizers of f (x) +(cid:80)p\n(cid:80)r\n\n(cid:88)r\n\n(cid:88)r\n\n(cid:88)r\n2(cid:107)x\u2212 z(cid:107)2\n2 subject to y \u2208 B(Fj).\n\nj=1\n\n3 Decomposition of submodular functions\nFollowing [28, 29, 38, 41], we assume that our function F may be decomposed as the sum F (S) =\nj=1 Fj(S) of r \u201csimple\u201d functions. In this paper, by \u201csimple\u201d we mean functions G for which\nG(S) \u2212 a(S) can be minimized ef\ufb01ciently for all vectors a \u2208 Rn (more precisely, we require that\nS (cid:55)\u2192 G(S \u222a T )\u2212 a(S) can be minimized ef\ufb01ciently over all subsets of V \\ T , for any T \u2286 V and a).\nEf\ufb01ciency may arise from the functional form of G, or from the fact that G has small support. For\nsuch functions, Problems (1) and (3) become\n\n2(cid:107)x(cid:107)2\n2.\n\n(5)\n\nfj(x) + 1\n\nj=1\n\nj=1\n\nfj(x)\n\nmin\nS\u2286V\n\n2(cid:107)y \u2212 z(cid:107)2\n\nFj(S) = min\nx\u2208[0,1]n\n\n2 + fj(x), or equivalently,\n\nmin\nx\u2208Rn\nThe key to the algorithms presented here is to be able to minimize 1\nto orthogonally project z onto B(Fj): min 1\nWe next sketch some examples of functions F and their decompositions into simple functions Fj. As\nshown at the end of Section 2, projecting onto B(Fj) is easy as soon as the corresponding submodular\nminimization problems are easy. Here we outline some cases for which specialized fast algorithms\nare known.\nGraph cuts. A widely used class of submodular functions are graph cuts. Graphs may be decom-\nposed into substructures such as trees, simple paths or single edges. Message passing algorithms\napply to trees, while the proximal problem for paths is very ef\ufb01ciently solved by [2]. For single edges,\nit is solvable in closed form. Tree decompositions are common in graphical models, whereas path\ndecompositions are frequently used for TV problems [2].\nConcave functions. Another important class of submodular functions is that of concave functions of\ncardinality, i.e., Fj(S) = h(|S|) for a concave function h. Problem (3) for such functions may be\nsolved in O(n log n) time (see [18] and our appendix in [25]). Functions of this class have been used\nin [24, 27, 41]. Such functions also include covering functions [41].\nHierarchical functions. Here, the ground set corresponds to the leaves of a rooted, undirected tree.\nEach node has a weight, and the cost of a set of nodes S \u2286 V is the sum of the weights of all nodes\nin the smallest subtree (including the root) that spans S. This class of functions too admits to solve\nthe proximal problem in O(n log n) time [22, 23, 26].\nSmall support. Any general, potentially slower algorithm such as the minimum-norm-point algo-\nrithm can be applied if the support of each Fj is only a small subset of the ground set.\n\nWe \ufb01rst review existing dual decomposition techniques for the nonsmooth problem (1). We always\nRn (cid:39) Rn\u00d7r. We follow [29] to derive a dual\n\n3.1 Dual decomposition of the nonsmooth problem\n\nassume that F =(cid:80)r\nj=1 Fj, and de\ufb01ne Hr :=(cid:81)r\n(cid:88)r\n\nj=1\n\nformulation (see appendix in [25]):\nLemma 1. The dual of Problem (1) may be written in terms of variables \u03bb1, . . . , \u03bbr \u2208 Rn as\n\ns.t. \u03bb \u2208(cid:8)(\u03bb1, . . . , \u03bbr) \u2208 Hr |(cid:88)r\n\ngj(\u03bbj)\n\nmax\n\n(6)\n\nj=1\n\nwhere gj(\u03bbj) = minS\u2282V Fj(S) \u2212 \u03bbj(S) is a nonsmooth concave function.\nThe dual is the maximization of a nonsmooth concave function over a convex set, onto which it is\neasy to project: the projection of a vector y has j-th block equal to yj \u2212 1\nk=1 yk. Moreover, in\nour setup, functions gj and their subgradients may be computed ef\ufb01ciently through SFM.\n\nr\n\nj=1\n\n\u03bbj = 0(cid:9)\n(cid:80)r\n\n4\n\n\ft).\n\nWe consider several existing alternatives for the minimization of f (x) on x \u2208 [0, 1]n, most of which\nuse Lemma 1. Computing subgradients for any fj means calling the greedy algorithm, which runs in\ntime O(n log n). All of the following algorithms require the tuning of an appropriate step size.\nPrimal subgradient descent (primal-sgd): Agnostic to any decomposition properties, we may\n\u221a\napply a standard simple subgradient method to f. A subgradient of f may be obtained from the\nsubgradients of the components fj. This algorithm converges at rate O(1/\n\u221a\nDual subgradient descent (dual-sgd) [29]: Applying a subgradient method to the nonsmooth dual\nin Lemma 1 leads to a convergence rate of O(1/\nt). Computing a subgradient requires minimizing\nthe submodular functions Fj individually. In simulations, following [29], we consider a step-size\nrule similar to Polyak\u2019s rule (dual-sgd-P) [6], as well as a decaying step-size (dual-sgd-F), and use\ndiscrete optimization for all Fj.\nPrimal smoothing (primal-smooth) [41]: The nonsmooth primal may be smoothed in several ways\n2(cid:107)yj(cid:107)2. This\nj (xj) = maxyj\u2208B(Fj ) y(cid:62)\nby smoothing the fj individually; one example is \u02dcf \u03b5\nleads to a function that is (1/\u03b5)-smooth. Computing \u02dcf \u03b5\nj means solving the proximal problem for Fj.\nThe convergence rate is O(1/t), but, apart from step size which may be set relatively easily, the\nsmoothing constant \u03b5 needs to be de\ufb01ned.\nDual smoothing (dual-smooth): Instead of the primal, the dual (6) may be smoothed, e.g., by\nentropy [8, 38] applied to each gj as \u02dcg\u03b5\nj (\u03bbj) = minx\u2208[0,1]n fj(x) + \u03b5h(x) where h(x) is a negative\nentropy. Again, the convergence rate is O(1/t) but there are two free parameters (in particular the\nsmoothing constant \u03b5 which is hard to tune). This method too requires solving proximal problems for\nall Fj in each iteration.\nDual smoothing with entropy also admits coordinate descent methods [34] that exploit the decompo-\nsition, but we do not compare to those here.\n\nj xj \u2212 \u03b5\n\n3.2 Dual decomposition methods for proximal problems\nWe may also consider Eq. (3) and \ufb01rst derive a dual problem using the same technique as in\nSection 3.1. Lemma 2 (proved in the appendix in [25]) formally presents our dual formulation as a\n\nbest approximation problem. The primal variable can be recovered as x = \u2212(cid:80)\ny \u2208(cid:89)r\n\nLemma 2. The dual of Eq. (3) may be written as the best approximation problem\nmin\n\u03bb,y\n\n\u03bbj = 0(cid:9),\n\n(cid:107)y \u2212 \u03bb(cid:107)2\n\nj yj.\n\nj=1\n\nj=1\n\n2\n\ns.t. \u03bb \u2208(cid:8)(\u03bb1, . . . , \u03bbr) \u2208 Hr |(cid:88)r\n(cid:13)(cid:13)(cid:13)(cid:88)r\n\n(cid:13)(cid:13)(cid:13)2\n\nyj\n\n\u2212 1\n2\n\nj=1\n\n2\n\nmax\n\ny\n\nB(Fj). (7)\n\n(8)\n\nWe can actually eliminate the \u03bbj and obtain the simpler looking dual problem\n\ns.t. yj \u2208 B(Fj), j \u2208 {1, . . . , r}\n\nSuch a dual was also used in [40]. In Section 5, we will see the effect of solving one of these duals or\nthe other. For the simpler dual (8) the case r = 2 is of special interest; it reads\n\nmax\n\ny1\u2208B(F1), y2\u2208B(F2)\n\n\u2212 1\n2\n\n(cid:107)y1 + y2(cid:107)2\n\n2 \u21d0\u21d2\n\ny1\u2208B(F1),\u2212y2\u2208\u2212B(F2)\n\nmin\n\n(cid:107)y1 \u2212 (\u2212y2)(cid:107)2.\n\n(9)\n\nWe write problem (9) in this suggestive form to highlight its key geometric structure: it is, like (7),\na best approximation problem: i.e., the problem of \ufb01nding the closest point between the polytopes\nB(F1) and \u2212B(F2). Notice, however, that (7) is very different from (9)\u2014the former operates in a\nproduct space while the latter does not, a difference that can have impact in practice (see Section 5).\nWe are now ready to present algorithms that exploit our dual formulations.\n4 Algorithms\nWe describe a few competing methods for solving our smooth dual formulations. We describe the\ndetails for the special 2-block case (9); the same arguments apply to the block dual from Lemma 2.\n\n4.1 Block coordinate descent or proximal-Dykstra\nPerhaps the simplest approach to solving (9) (viewed as a minimization problem) is to use a block\ncoordinate descent (BCD) procedure, which in this case performs the alternating projections:\n)(cid:107)2. (10)\n\n2 \u2190 argminy2\u2208B(F2) (cid:107)y2 \u2212 (\u2212yk+1\nyk+1\n\n1 \u2190 argminy1\u2208B(F1) (cid:107)y1 \u2212 (\u2212yk\nyk+1\n\n2 )(cid:107)2\n2;\n\n1\n\n5\n\n\fThe iterations for solving (8) are analogous. This BCD method (applied to (9)) is equivalent to\napplying the so-called proximal-Dykstra method [12] to the primal problem. This may be seen by\ncomparing the iterates. Notice that the BCD iteration (10) is nothing but alternating projections onto\nthe convex polyhedra B(F1) and B(F2). There exists a large body of literature studying method of\nalternating projections\u2014we refer the interested reader to the monograph [13] for further details.\nHowever, despite its attractive simplicity, it is known that BCD (in its alternating projections form),\ncan converge arbitrarily slowly [4] depending on the relative orientation of the convex sets onto which\none projects. Thus, we turn to a potentially more effective method.\n\n4.2 Douglas-Rachford splitting\n\nThe Douglas-Rachford (DR) splitting method [14] includes algorithms like ADMM as a special\ncase [12].\nIt avoids the slowdowns alluded to above by replacing alternating projections with\nalternating \u201cre\ufb02ections\u201d. Formally, DR applies to convex problems of the form [3, 12]\n\n(11)\nsubject to the quali\ufb01cation ri(dom \u03c61) \u2229 ri(dom \u03c62) (cid:54)= \u2205. To solve (11), DR starts with some z0,\nand performs the three-step iteration (for k \u2265 0):\n\n\u03c61(x) + \u03c62(x),\n\nminx\n\n1. xk = prox\u03c62(zk);\n\nwhere \u03b3k \u2208 [0, 2] is a sequence of scalars that satisfy(cid:80)\n\n2. vk = prox\u03c61 (2xk \u2212 zk);\n\n3. zk+1 = zk + \u03b3k(vk \u2212 zk),\n\n(12)\nk \u03b3k(2 \u2212 \u03b3k) = \u221e. The sequence {xk}\n\nproduced by iteration (12) can be shown to converge to a solution of (11) [3; Thm. 25.6].\nIntroducing the re\ufb02ection operator\n\nR\u03c6 := 2 prox\u03c6 \u2212 I,\n\nand setting \u03b3k = 1, the DR iteration (12) may be written in a more symmetric form as\n\nxk = prox\u03c62 (zk),\n\nzk+1 = 1\n\n2 [R\u03c61R\u03c62 + I]zk,\n\nk \u2265 0.\n\n(13)\n\nApplying DR to the duals (7) or (9), requires \ufb01rst putting them in the form (11), either by introducing\nextra variables or by going back to the primal, which is unnecessary. This is where the special\nstructure of our dual problem proves crucial, a recognition that is subtle yet remarkably important.\nInstead of applying DR to (9), consider the closely related problem\n\n\u03b41(y) + \u03b4\u2212\n\nminy\n\n2 (y),\n\n(14)\n2 are indicator functions for B(F1) and \u2212B(F2), respectively. Applying DR directly\nwhere \u03b41, \u03b4\u2212\nto (14) does not work because usually ri(dom \u03b41) \u2229 ri(dom \u03b42) = \u2205. Indeed, applying DR to (14)\ngenerates iterates that diverge to in\ufb01nity [4; Thm. 3.13(ii)]. Fortunately, even though the DR iterates\nfor (14) may diverge, Bauschke et al. [4] show how to extract convergent sequences from these\niterates, which actually solve the corresponding best approximation problem; for us this is nothing\nbut the dual (9) that we wanted to solve in the \ufb01rst place. Theorem 3, which is a simpli\ufb01ed version\nof [4; Thm. 3.13], formalizes the above discussion.\nTheorem 3. [4] Let A and B be nonempty polyhedral convex sets. Let \u03a0A (\u03a0B) denote orthogonal\nprojection onto A (B), and let RA := 2\u03a0A \u2212 I (similarly RB) be the corresponding re\ufb02ection\noperator. Let {zk} be the sequence generated by the DR method (13) applied to (14). If A \u2229 B (cid:54)= \u2205,\nthen {zk}k\u22650 converges weakly to a \ufb01xed-point of the operator T := 1\n2 [RARB + I]; otherwise\n(cid:107)zk(cid:107)2 \u2192 \u221e. The sequences {xk} and {\u03a0A\u03a0Bzk} are bounded; the weak cluster points of either of\nthe two sequences\n\n{(\u03a0ARBzk, xk)}k\u22650\n\n{(\u03a0Axk, xk)}k\u22650,\n\n(15)\n\nare solutions best approximation problem mina,b (cid:107)a \u2212 b(cid:107) such that a \u2208 A and b \u2208 B.\nThe key consequence of Theorem 3 is that we can apply DR with impunity to (14), and extract from\nits iterates the optimal solution to problem (9) (from which recovering the primal is trivial). The most\nimportant feature of solving the dual (9) in this way is that absolutely no stepsize tuning is required,\nmaking the method very practical and user friendly.\n\n6\n\n\fpBCD, iter 1\n\npBCD, iter 7\n\nDR, iter 1\n\nDR, iter 4\n\nsmooth gap\ndiscrete gap\n\n\u03bds = 8.05 \u00b7 104\n\u03bdd = 5.9 \u00b7 10\u22121\nFigure 1: Segmentation results for the slowest and fastest projection method, with smooth (\u03bds) and discrete\n(\u03bdd) duality gaps. Note how the background noise disappears only for small duality gaps.\n\n\u03bds = 4.17 \u00b7 105\n\u03bdd = 6.6 \u00b7 103\n\n\u03bds = 3.4 \u00b7 106\n\u03bdd = 4.6 \u00b7 103\n\n\u03bds = 4.4 \u00b7 105\n\u03bdd = 5.5 \u00b7 102\n\n5 Experiments\nWe empirically compare the proposed projection methods2 to the (smoothed) subgradient methods\ndiscussed in Section 3.1. For solving the proximal problem, we apply block coordinate descent (BCD)\nand Douglas-Rachford (DR) to Problem (8) if applicable, and also to (7) (BCD-para, DR-para). In\naddition, we use acceleration to solve (8) or (9) [5]. The main iteration cost of all methods except\nfor the primal subgradient method is the orthogonal projection onto polytopes B(Fj). The primal\nsubgradient method uses the greedy algorithm in each iteration, which runs in O(n log n). However,\nas we will see, its convergence is so slow to counteract any bene\ufb01t that may arise from not using\nprojections. We do not include Frank-Wolfe methods here, since FW is equivalent to a subgradient\ndescent on the primal and converges correspondingly slowly.\nAs benchmark problems, we use (i) graph cut problems for segmentation, or MAP inference in a\n4-neighborhood grid-structured MRF, and (ii) concave functions similar to [41], but together with\ngraph cut functions. The functions in (i) decompose as sums over vertical and horizontal paths. All\nhorizontal paths are independent and can be solved together in parallel, and similarly all vertical\npaths. The functions in (ii) are constructed by extracting regions Rj via superpixels and, for each\nRj, de\ufb01ning the function Fj(S) = |S||Rj \\ S|. We use 200 and 500 regions. The problems\nhave size 640 \u00d7 427. Hence, for (i) we have r = 640 + 427 (but solve it as r = 2) and for (ii)\nr = 640 + 427 + 500 (solved as r = 3). More details and experimental results may be found in [25].\nTwo functions (r = 2). Figure 2 shows the duality gaps for the discrete and smooth (where\napplicable) problems for two instances of segmentation problems. The algorithms working with\nthe proximal problems are much faster than the ones directly solving the nonsmooth problem. In\nparticular DR converges extremely fast, faster even than BCD which is known to be a state-of-the-art\nalgorithms for this problem [2]. This, in itself, is a new insight for solving TV. If we aim for parallel\nmethods, then again DR outperforms BCD. Figure 3 (right) shows the speedup gained from parallel\nprocessing. Using 8 cores, we obtain a 5-fold speed-up. We also see that the discrete gap shrinks\nfaster than the smooth gap, i.e., the optimal discrete solution does not require to solve the smooth\nproblem to extremely high accuracy. Figure 1 illustrates example results for different gaps.\nMore functions (r > 2). Figure 3 shows example results for four problems of sums of concave and\ncut functions. Here, we can only run DR-para. Overall, BCD, DR-para and the accelerated gradient\nmethod perform very well.\nIn summary, our experiments suggest that projection methods can be extremely useful for solving\nthe combinatorial submodular minimization problem. Of the tested methods, DR, cyclic BCD and\naccelerated gradient perform very well. For parallelism, applying DR on (9) converges much faster\nthan BCD on the same problem. Moreover, in terms of running times, running the DR method with a\nmixed Matlab/C implementation until convergence on a single core is only 3-8 times slower than the\noptimized ef\ufb01cient C code of [7], and only 2-4 times on 2 cores. These numbers should be read while\nconsidering that, unlike [7], the projection methods naturally lead to parallel implementations, and\nare able to integrate a large variety of functions.\n\n6 Conclusion\nWe have presented a novel approach to submodular function minimization based on the equivalence\nwith a best approximation problem. The use of re\ufb02ection methods avoids any hyperparameters\nand reduce the number of iterations signi\ufb01cantly, suggesting the suitability of re\ufb02ection methods\n\n2Code and data corresponding to this paper are available at https://sites.google.com/site/mloptstat/drsubmod\n\n7\n\n\fFigure 2: Comparison of convergence behaviors. Left: discrete duality gaps for various optimization\nschemes for the nonsmooth problem, from 1 to 1000 iterations. Middle: discrete duality gaps for\nvarious optimization schemes for the smooth problem, from 1 to 100 iterations. Right: corresponding\ncontinuous duality gaps. From top to bottom: two different images.\n\nFigure 3: Left two plots: convergence behavior for graph cut plus concave functions. Right: Speedup\ndue to parallel processing.\n\nfor combinatorial problems. Given the natural parallelization abilities of our approach, it would\nbe interesting to perform detailed empirical comparisons with existing parallel implementations of\ngraph cuts (e.g., [39]). Moreover, a generalization beyond submodular functions of the relationships\nbetween combinatorial optimization problems and convex problems would enable the application of\nour framework to other common situations such as multiple labels (see, e.g., [29]).\nAcknowledgments. This research was in part funded by the Of\ufb01ce of Naval Research under contract/grant\nnumber N00014-11-1-0688, by NSF CISE Expeditions award CCF-1139158, by DARPA XData Award FA8750-\n12-2-0331, and the European Research Council (SIERRA project), as well as gifts from Amazon Web Services,\nGoogle, SAP, Blue Goji, Cisco, Clearstory Data, Cloudera, Ericsson, Facebook, General Electric, Hortonworks,\nIntel, Microsoft, NetApp, Oracle, Samsung, Splunk, VMware and Yahoo!.\nReferences\n[1] F. Bach. Learning with submodular functions: A convex optimization perspective. Arxiv preprint\n\n[2] A. Barbero and S. Sra. Fast Newton-type methods for total variation regularization. In ICML, 2011.\n[3] H. H. Bauschke and P. L. Combettes. Convex Analysis and Monotone Operator Theory in Hilbert Spaces.\n\narXiv:1111.6453v2, 2013.\n\nSpringer, 2011.\n\n[4] H. H. Bauschke, P. L. Combettes, and D. R. Luke. Finding best approximation pairs relative to two closed\n\nconvex sets in Hilbert spaces. J. Approx. Theory, 127(2):178\u2013192, 2004.\n\n[5] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems.\n\nSIAM Journal on Imaging Sciences, 2(1):183\u2013202, 2009.\n\n8\n\n2004006008001000\u2212101234iterationlog10(duality gap) discrete gaps \u2212 non\u2212smooth problems \u2212 1 dual\u2212sgd\u2212Pdual\u2212sgd\u2212Fdual\u2212smoothprimal\u2212smoothprimal\u2212sgd20406080100\u2212101234iterationlog10(duality gap) discrete gaps \u2212 smooth problems\u2212 1 grad\u2212accelBCDDRBCD\u2212paraDR\u2212para20406080100\u22124\u221220246 iterationlog10(duality gap)smooth gaps \u2212 smooth problems \u2212 1 grad\u2212accelBCDDRBCD\u2212paraDR\u2212para2004006008001000\u22121012345iterationlog10(duality gap) discrete gaps \u2212 non\u2212smooth problems \u2212 4 dual\u2212sgd\u2212Pdual\u2212sgd\u2212Fdual\u2212smoothprimal\u2212smoothprimal\u2212sgd20406080100\u22121012345iterationlog10(duality gap) discrete gaps \u2212 smooth problems\u2212 4 grad\u2212accelBCDDRBCD\u2212paraDR\u2212para20406080100\u22122\u2212101234567 iterationlog10(duality gap)smooth gaps \u2212 smooth problems \u2212 4 grad\u2212accelBCDDRBCD\u2212paraDR\u2212para50100150200\u22123\u22122\u221210123456iterationlog10(duality gap)discrete gaps \u2212 2 dual\u2212sgd\u2212PDR\u2212paraBCDBCD\u2212paragrad\u2212accel20406080100\u22123\u22122\u221210123456iterationlog10(duality gap)discrete gaps \u2212 3 dual\u2212sgd\u2212PDR\u2212paraBCDBCD\u2212paragrad\u2212accel02468012345640 iterations of DR# coresspeedup factor\f[6] D. P. Bertsekas. Nonlinear programming. Athena Scienti\ufb01c, 1999.\n[7] Y. Boykov, O. Veksler, and R. Zabih. Fast approximate energy minimization via graph cuts. IEEE TPAMI,\n\n[8] B.Savchynskyy, S.Schmidt, J.H.Kappes, and C.Schn\u00a8orr. Ef\ufb01cient MRF energy minimization via adaptive\n\n[9] A. Chambolle. An algorithm for total variation minimization and applications. J Math. Imaging and Vision,\n\n23(11):1222\u20131239, 2001.\n\ndiminishing smoothing. In UAI, 2012.\n\n20(1):89\u201397, 2004.\n\n[10] A. Chambolle and J. Darbon. On total variation minimization and surface evolution using parametric\n\nmaximum \ufb02ows. Int. Journal of Comp. Vision, 84(3):288\u2013307, 2009.\n\n[11] F. Chudak and K. Nagano. Ef\ufb01cient solutions to relaxations of combinatorial problems with submodular\n\npenalties via the Lov\u00b4asz extension and non-smooth convex optimization. In SODA, 2007.\n\n[12] P. L. Combettes and J.-C. Pesquet. Proximal Splitting Methods in Signal Processing. In Fixed-Point\n\nAlgorithms for Inverse Problems in Science and Engineering, pages 185\u2013212. Springer, 2011.\n\n[13] F. R. Deutsch. Best Approximation in Inner Product Spaces. Springer Verlag, \ufb01rst edition, 2001.\n[14] J. Douglas and H. H. Rachford. On the numerical solution of the heat conduction problem in 2 and 3 space\n\nvariables. Tran. Amer. Math. Soc., 82:421\u2013439, 1956.\n\n[15] J. Edmonds. Submodular functions, matroids, and certain polyhedra. In Combinatorial optimization -\n\nEureka, you shrink!, pages 11\u201326. Springer, 2003.\n\n[16] U. Feige, V. S. Mirrokni, and J. Vondrak. Maximizing non-monotone submodular functions. SIAM J Comp,\n\n[17] M. Frank and P. Wolfe. An algorithm for quadratic programming. Naval Research Logistics Quarterly, 3:\n\n40(4):1133\u20131153, 2011.\n\n95\u2013110, 1956.\n\n[18] S. Fujishige. Lexicographically optimal base of a polymatroid with respect to a weight vector. Mathematics\n\nof Operations Research, pages 186\u2013196, 1980.\n\n[19] S. Fujishige. Submodular Functions and Optimization. Elsevier, 2005.\n[20] S. Fujishige and S. Isotani. A submodular function minimization algorithm based on the minimum-norm\n\nbase. Paci\ufb01c Journal of Optimization, 7:3\u201317, 2011.\n\n[21] H. Groenevelt. Two algorithms for maximizing a separable concave function over a polymatroid feasible\n\nregion. European Journal of Operational Research, 54(2):227\u2013236, 1991.\n\n[22] D.S. Hochbaum and S.-P. Hong. About strongly polynomial time algorithms for quadratic optimization\n\nover submodular constraints. Math. Prog., pages 269\u2013309, 1995.\n\n[23] S. Iwata and N. Zuiki. A network \ufb02ow approach to cost allocation for rooted trees. Networks, 44:297\u2013301,\n\n[24] S. Jegelka, H. Lin, and J. Bilmes. On fast approximate submodular minimization. In NIPS, 2011.\n[25] S. Jegelka, F. Bach, and S. Sra. Re\ufb02ection methods for user-friendly submodular optimization (extended\n\n2004.\n\nversion). arXiv, 2013.\n\n[26] R. Jenatton, J. Mairal, G. Obozinski, and F. Bach. Proximal methods for hierarchical sparse coding. Journal\n\nof Machine Learning Research, pages 2297\u20132334, 2011.\n\n[27] P. Kohli, L. Ladick\u00b4y, and P. Torr. Robust higher order potentials for enforcing label consistency. Int.\n\nJournal of Comp. Vision, 82, 2009.\n\n[28] V. Kolmogorov. Minimizing a sum of submodular functions. Disc. Appl. Math., 160(15), 2012.\n[29] N. Komodakis, N. Paragios, and G. Tziritas. MRF energy minimization and beyond via dual decomposition.\n\nIEEE TPAMI, 33(3):531\u2013552, 2011.\n\n[30] A. Krause and C. Guestrin. Submodularity and its applications in optimized information gathering. ACM\n\nTransactions on Intelligent Systems and Technology, 2(4), 2011.\n\n[31] H. Lin and J. Bilmes. A class of submodular functions for document summarization. In NAACL/HLT,\n\n[32] L. Lov\u00b4asz. Submodular functions and convexity. Mathematical programming: the state of the art, Bonn,\n\n[33] S. T. McCormick. Submodular function minimization. Discrete Optimization, 12:321\u2013391, 2005.\n[34] O. Meshi, T. Jaakkola, and A. Globerson. Convergence rate analysis of MAP coordinate minimization\n\n2011.\n\npages 235\u2013257, 1982.\n\nalgorithms. In NIPS, 2012.\n\n[35] G.L. Nemhauser, L.A. Wolsey, and M.L. Fisher. An analysis of approximations for maximizing submodular\n\nset functions\u2013I. Math. Prog., 14(1):265\u2013294, 1978.\n\n[36] Y. Nesterov. Smooth minimization of non-smooth functions. Math. Prog., 103(1):127\u2013152, 2005.\n[37] J. B. Orlin. A faster strongly polynomial time algorithm for submodular function minimization. Math.\n\nProg., 118(2):237\u2013251, 2009.\n\n[38] B. Savchynskyy, S. Schmidt, J. Kappes, and C. Schn\u00a8orr. A study of Nesterov\u2019s scheme for Lagrangian\n\ndecomposition and MAP labeling. In CVPR, 2011.\n\n[39] A. Shekhovtsov and V. Hlav\u00b4ac. A distributed mincut/max\ufb02ow algorithm combining path augmentation and\n\npush-relabel. In Energy Minimization Methods in Computer Vision and Pattern Recognition, 2011.\n\n[40] P. Stobbe. Convex Analysis for Minimizing and Learning Submodular Set functions. PhD thesis, California\n\nInstitute of Technology, 2013.\n\n[41] P. Stobbe and A. Krause. Ef\ufb01cient minimization of decomposable submodular functions. In NIPS, 2010.\n[42] R. Tarjan, J. Ward, B. Zhang, Y. Zhou, and J. Mao. Balancing applied to maximum network \ufb02ow problems.\n\nIn European Symp. on Algorithms (ESA), pages 612\u2013623, 2006.\n\n9\n\n\f", "award": [], "sourceid": 671, "authors": [{"given_name": "Stefanie", "family_name": "Jegelka", "institution": "UC Berkeley"}, {"given_name": "Francis", "family_name": "Bach", "institution": "INRIA & ENS"}, {"given_name": "Suvrit", "family_name": "Sra", "institution": "MPI for Intelligent Systems & CMU"}]}