{"title": "Continuous DR-submodular  Maximization: Structure and Algorithms", "book": "Advances in Neural Information Processing Systems", "page_first": 486, "page_last": 496, "abstract": "DR-submodular continuous functions are important objectives with wide real-world applications spanning MAP inference in  determinantal point processes (DPPs), and mean-field inference for probabilistic submodular models, amongst others. DR-submodularity captures a subclass of non-convex functions that enables both exact minimization and approximate maximization in polynomial time.  In this work we study the  problem of maximizing  non-monotone DR-submodular continuous functions under general down-closed convex constraints. We start by investigating geometric properties that underlie such objectives, e.g., a strong relation between  (approximately) stationary points and global optimum is proved. These properties are then used to devise two optimization algorithms with provable guarantees. Concretely, we first devise a \"two-phase'' algorithm with 1/4 approximation guarantee. This algorithm allows the use of existing methods for finding (approximately) stationary points as a subroutine, thus, harnessing recent progress in non-convex optimization. Then we present a non-monotone Frank-Wolfe variant with 1/e approximation guarantee and sublinear convergence rate. Finally, we extend our approach to a broader class of generalized DR-submodular continuous functions, which captures a wider spectrum of applications. Our theoretical findings are validated on synthetic and real-world problem instances.", "full_text": "Continuous DR-submodular Maximization:\n\nStructure and Algorithms\n\nAn Bian\nETH Zurich\n\nybian@inf.ethz.ch\n\nK\ufb01r Y. Levy\nETH Zurich\n\nyehuda.levy@inf.ethz.ch\n\nAndreas Krause\n\nETH Zurich\n\nkrausea@ethz.ch\n\nJoachim M. Buhmann\n\nETH Zurich\n\njbuhmann@inf.ethz.ch\n\nAbstract\n\nDR-submodular continuous functions are important objectives with wide real-world\napplications spanning MAP inference in determinantal point processes (DPPs),\nand mean-\ufb01eld inference for probabilistic submodular models, amongst others.\nDR-submodularity captures a subclass of non-convex functions that enables both\nexact minimization and approximate maximization in polynomial time.\nIn this work we study the problem of maximizing non-monotone continuous DR-\nsubmodular functions under general down-closed convex constraints. We start\nby investigating geometric properties that underlie such objectives, e.g., a strong\nrelation between (approximately) stationary points and global optimum is proved.\nThese properties are then used to devise two optimization algorithms with provable\nguarantees. Concretely, we \ufb01rst devise a \u201ctwo-phase\u201d algorithm with 1/4 approxi-\nmation guarantee. This algorithm allows the use of existing methods for \ufb01nding\n(approximately) stationary points as a subroutine, thus, harnessing recent progress\nin non-convex optimization. Then we present a non-monotone FRANK-WOLFE\nvariant with 1/e approximation guarantee and sublinear convergence rate. Finally,\nwe extend our approach to a broader class of generalized DR-submodular continu-\nous functions, which captures a wider spectrum of applications. Our theoretical\n\ufb01ndings are validated on synthetic and real-world problem instances.\n\n1\n\nIntroduction\n\nSubmodularity is classically most well known for set function optimization, where it enables ef\ufb01cient\nminimization [23] and approximate maximization [31; 25] in polynomial time. Submodularity has\nrecently been studied on the integer lattice [34; 33] and on continuous domains [3; 4; 36; 21], with\nsigni\ufb01cant theoretical results and practical applications. For set functions, it is well known that\nsubmodularity is equivalent to the diminishing returns (DR) property. However, this does not hold\nfor integer-lattice functions or continuous functions, where the DR property de\ufb01nes a subclass of\nsubmodular functions, called DR-submodular functions.\nIn continuous domains, applying convex optimization techniques enables ef\ufb01cient minimization of\nsubmodular continuous functions [3; 36] (despite the non-convex nature of such objectives). In [4]\nit is further shown that continuous submodularity enables constant-factor approximation schemes\nfor constrained monotone DR-submodular maximization and \u201cbox\u201d constrained non-monotone\nsubmodular maximization problems.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fMany real-world non-convex problems, such as maximizing the softmax extension of DPPs, require\nmaximizing a non-monotone DR-submodular function over a general down-closed convex constraint.\nYet, current theory [3; 4; 36] does not apply to this general problem setting, which motivates us to\ndevelop guaranteed and ef\ufb01cient algorithms for such problems.\nExploring the structure that underlies DR-submodularity is crucial to deriving guaranteed algorithms.\nCombined with a notion of non-stationarity for constrained optimization problems and a new notion\nof \u201cstrong DR-submodularity\u201d, we \ufb01nd a rich structure in the problem of continuous DR-submodular\nmaximization. This in turn gives rise to two approximation algorithms with provable guarantees.\nSpeci\ufb01cally, we make the following contributions:\n\n- We bound the difference between objective values of stationary points and the global optimum.\nOur analysis shows that the bound is even tighter if the objective is strongly DR-submodular\n(see De\ufb01nition 3).\n- Based on the geometric properties, we present two algorithms: (i) A two-phase FRANK-\nWOLFE-style algorithm with 1/4 approximation guarantee converges with a 1/pk rate; (ii) a\nnon-monotone FRANK-WOLFE variant exhibits a 1/e approximation guarantee and converges\nsublinearly. Even though the worst-case guarantee of the \ufb01rst one is worse than the second, it\nyields several practical advantages, which we discuss in Section 4.2.\n\n- We investigate a generalized class of submodular functions on \u201cconic\u201d lattices. This allows\nus to model a larger class of non-trivial applications. These include logistic regression with\na non-convex separable regularizer, non-negative PCA, etc. To optimize them, we provide a\nreduction that enables to invoke algorithms for continuous submodular optimization problems.\n- We experimentally demonstrate the applicability of our methods on both synthetic and real-\n\nworld problem instances.\n\n1.1 Problem Statement\nNotation. We use boldface letters, e.g., x to represent a vector, boldface capital letters, e.g., A\nto denote a matrix. xi is the ith entry of x, Aij is the (ij)th entry of A. We use ei to denote the\nstandard ith basis vector. f (\u00b7) is used to denote a continuous function, and F (\u00b7) to represent a set\nfunction. [n] := {1, ..., n} for an integer n  1. k\u00b7k means the Euclidean norm by default. Given\ntwo vectors x, y, x \uf8ff y means xi \uf8ff yi,8i. x _ y and x ^ y denote coordinate-wise maximum and\ncoordinate-wise minimum, respectively.\nThe general setup of constrained non-monotone DR-submodular (see De\ufb01nition 1 below) maximiza-\ntion is,\n\n(P)\nwhere f : X! R is continuous DR-submodular, X =Qn\ni=1 Xi, each Xi is an interval [3; 4]. Wlog1,\nwe assume that the lower bound u of X is 0, i.e., X = [0, \u00afu]. The set P\u2713 [0, \u00afu] is assumed to be a\ndown-closed convex set, where down-closedness means: x 2P and 0 \uf8ff y \uf8ff x implies that y 2P .\nThe diameter of P is D := maxx,y2P kx  yk, and it holds that D \uf8ff k \u00afuk. We use x\u21e4 to denote the\nglobal maximum of (P). One can assume f is non-negative over X , since otherwise one just needs to\n\ufb01nd a lower bound for the minimum function value of f over [0, \u00afu] (and box-constrained submodular\nminimization can be solved to arbitrary precision in polynomial time [3]). Over continuous domains,\na DR-submodular function [4] is a submodular function with the diminishing returns (DR) property,\nDe\ufb01nition 1 (DR-submodular & DR property). A function f : X 7! R is DR-submodular (has the\nDR property) if 8a \uf8ff b 2X , 8i 2 [n],8k 2 R+ s.t. (kei + a) and (kei + b) are still in X , it holds,\n(1)\nIf f is differentiable, one can show that De\ufb01nition 1 is equivalent to rf being an antitone mapping\nfrom Rn to Rn. Furthermore, if f is twice-differentiable, the DR property is equivalent to all of\nthe entries of its Hessian being non-positive, i.e., r2\nijf (x) \uf8ff 0,8x 2X , i, j 2 [n]. A function\nf : X 7! R is DR-supermodular iff f is DR-submodular. We also assume that f has Lipschitz\ngradients,\n1Since otherwise one can work on a new function g(x) := f (x + u) that has 0 as the lower bound of its\n\nf (kei + a)  f (a)  f (kei + b)  f (b).\n\nf (x),\n\nmax\nx2P\n\ndomain, and all properties of the function are still preserved.\n\n2\n\n\fDe\ufb01nition 2. A function f has L-Lipschitz gradients if for all x, y 2X it holds that,\n\nkrf (x)  rf (y)k \uf8ff Lkx  yk.\n\n(2)\n\nA brief summary of related work appears in Section 6.\n\n2 Motivating Real-world Examples\n\nMany continuous objectives in practice turn out to be DR-submodular. Here we list several of them.\nMore can be found in Appendix B.\nSoftmax extension. Determinantal point processes (DPPs) are probabilistic models of repulsion,\nthat have been used to model diversity in machine learning [26]. The constrained MAP (maximum a\nposteriori) inference problem of a DPP is an NP-hard combinatorial problem in general. Currently,\nthe methods with the best approximation guarantees are based on either maximizing the multilinear\nextension [6] or the softmax extension [20], both of which are DR-submodular functions (details in\nAppendix F.1). The multilinear extension is given as an expectation over the original set function\nvalues, thus evaluating the objective of this extension requires expensive sampling. In constast, the\nsoftmax extension has a closed form expression, which is much more appealing from a computational\nperspective. Let L be the positive semide\ufb01nite kernel matrix of a DPP, its softmax extension is:\n\nf (x) = log det (diag(x)(L  I) + I) , x 2 [0, 1]n,\n\n(3)\nwhere I is the identity matrix, diag(x) is the diagonal matrix with diagonal elements set as x.\nThe problem of MAP inference in DPPs corresponds to the problem maxx2P f (x), where P is a\ndown-closed convex constraint, e.g., a matroid polytope or a matching polytope.\nMean-\ufb01eld inference for log-submodular models. Log-submodular models [9] are a class of\nprobabilistic models over subsets of a ground set V = [n], where the log-densities are submodular\nZ exp(F (S)). The partition function Z = PS\u2713V exp(F (S)) is\nset functions F (S): p(S) = 1\ntypically hard to evaluate. One can use mean-\ufb01eld inference to approximate p(S) by some factorized\ndistribution qx(S) :=Qi2S xiQj /2S(1  xj), x 2 [0, 1]n, by minimizing the distance measured\nw.r.t. the Kullback-Leibler divergence between qx and p, i.e.,PS\u2713V qx(S) log qx(S)\nKL(x) = XS\u2713VYi2S\n\n[xi log xi + (1  xi) log(1  xi)] + log Z.\n\nxiYj /2S\n\n(1  xj)F (S) +Xn\n\np(S) . It is,\n\nKL(x) is DR-supermodular w.r.t. x (details in Appendix F.1). Minimizing the Kullback-Leibler\ndivergence KL(x) amounts to maximizing a DR-submodular function.\n\ni=1\n\n2.1 Motivating Example Captured by Generalized Submodularity on Conic Lattices\nSubmodular continuous functions can already model many scenarios. Yet, there are several inter-\nesting cases which are in general not (DR-)Submodular, but can still be captured by a generalized\nnotion. This generalization enables to develop polynomial algorithms with guarantees by using ideas\nfrom continuous submodular optimization. We present one representative objective here (more in\nAppendix B). In Appendix A we show the technical details on how they are covered by a class of\nsubmodular continuous functions over conic lattices.\nConsider the logistic regression model with a non-convex separable regularizer. This \ufb02exibility may\nresult in better statistical performance (e.g., in recovering discontinuities, [2]) compared to classical\nmodels with convex regularizers. Let z1, ..., zm in Rn be m training points with corresponding\nbinary labels y 2 {\u00b11}m. Assume that the following mild assumption is satis\ufb01ed: For any \ufb01xed\ndimension i, all the data points have the same sign, i.e., sign(zj\ni ) is the same for all j 2 [m] (which\ncan be achieved by easily scaling if not). The task is to solve the following non-convex optimization\nproblem,\n\nmin\nx2Rn\n\nf (x) := m1Xm\n\nj=1\n\nfj(x) + r(x),\n\n(4)\n\nwhere fj(x) = log(1 + exp(yjx>zj)) is the logistic loss; > 0 is the regularization parameter,\nand r(x) is some non-convex separable regularizer. Such separable regularizers are popular in\n\n3\n\n\fx2\ni\n\ni=1\n\n1+x2\ni\n\ni=1 min{x2\n\n, and r(x) =Pn\ni ), i 2 [n] and l(x) := 1\n\nstatistics, and two notable choices are r(x) =Pn\ni , 1} (see\n[2]). Let us de\ufb01ne a vector \u21b5 2 {\u00b11}n as \u21b5i = sign(zj\nj=1 fj(x). One\ncan show that l(x) is not DR-submodular or DR-supermodular. Yet, in Appendix A we will show\nthat l(x) is K\u21b5-DR-supermodular, where the latter generalizes DR-supermodularity. Usually, one\ncan assume the optimal solution x\u21e4 lies in some box [u, \u00afu]. Then the problem is an instance of\nconstrained non-monotone K\u21b5-DR-submodular maximization.\n3 Underlying Properties of Constrained DR-submodular Maximization\nIn this section we present several properties arising in DR-submodular function maximization. First\nwe show properties related to concavity of the objective along certain directions, then we establish the\nrelation between locally stationary points and the global optimum (thus called \u201clocal-global relation\u201d).\nThese properties will be used to derive guarantees for the algorithms in Section 4. All omitted proofs\nare in Appendix D.\n\nmPm\n\n3.1 Properties Along Non-negative/Non-positive Directions\nA DR-submodular function f is concave along any non-negative/non-positive direction [4]. Notice\n+: for instance,\nthat DR-submodularity is a stronger condition than concavity along directions v 2\u00b1 Rn\na concave function is concave along any direction, but it may not be a DR-submodular function.\nFor a DR-submodular function with L-Lipschitz gradients, one can get the following quadratic lower\nbound using standard techniques by combing the concavity and Lipschitz gradients in (2).\nQuadratic lower bound. If f is DR-submodular with a L-Lipschitz gradient, then for all x 2X\nand v 2\u00b1 Rn\n\n+, it holds,\n\nf (x + v)  f (x) + hrf (x), vi \n\nL\n2 kvk2.\n\n(5)\n\n(6)\n\n(7)\n\nIt will be used in Section 4.2 for analyzing the non-monotone FRANK-WOLFE variant (Algorithm 2).\nStrong DR-submodularity. DR-submodular objectives may be strongly concave along directions\n+, e.g., for DR-submodular quadratic functions. We will show that such additional structure\nv 2\u00b1 Rn\nmay be exploited to obtain stronger guarantees for the local-global relation.\nDe\ufb01nition 3 (Strongly DR-submodular). A function f is \u00b5-strongly DR-submodular (\u00b5  0) if for\nall x 2X and v 2\u00b1 Rn\n\n+, it holds that,\n\nf (x + v) \uf8ff f (x) + hrf (x), vi \n\n\u00b5\n2kvk2.\n\n3.2 Relation Between Approximately Stationary Points and Global Optimum\nFirst of all, we present the following Lemma, which will motivate us to consider a non-stationarity\nmeasure for general constrained optimization problems.\nLemma 1. If f is \u00b5-strongly DR-submodular, then for any two points x, y in X , it holds:\n\n(y  x)>rf (x)  f (x _ y) + f (x ^ y)  2f (x) +\n\n\u00b5\n2kx  yk2.\n\nLemma 1 implies that if x is stationary (i.e., rf (x) = 0), then 2f (x)  f (x _ y) + f (x ^ y) +\n\u00b5\n2kx  yk2, which gives an implicit relation between x and y. While in practice \ufb01nding an exact\nstationary point is not easy, usually non-convex solvers will arrive at an approximately stationary\npoint, thus requiring a proper measure of non-stationarity for the constrained optimization problem.\nNon-stationarity measure. Looking at the LHS of (7), it naturally suggests to use maxy2P (y \nx)>rf (x) as the non-stationarity measure, which happens to coincide with the measure proposed\nby recent work of [27], and it can be calculated for free for Frank-Wolfe-style algorithms (e.g.,\nAlgorithm 3). In order to adapt it to the local-global relation, we give a slightly more general\nde\ufb01nition here: For any constraint set Q\u2713X , the non-stationarity of a point x 2Q is,\n\n(non-stationarity).\n\n(8)\n\ngQ(x) := max\n\nv2Qhv  x,rf (x)i\n\n4\n\n\f\u00b5\n\n1\n4\n\n[f (x\u21e4)  gP (x)  gQ(z)] +\n\nIt always holds that gQ(x)  0, and x is a stationary point in Q iff gQ(x) = 0, so (8) is a natural\ngeneralization of the non-stationarity measure krf (x)k for unconstrained optimization. As the next\nstatement shows, gQ(x) plays an important role in characterizing the local-global relation.\nProposition 1 (Local-Global Relation). Let x be a point in P with non-stationarity gP (x), and\nQ := {y 2P | y \uf8ff \u00afu  x}. Let z be a point in Q with non-stationarity gQ(z). It holds that,\n8kx  x\u21e4k2 + kz  z\u21e4k2 ,\n\nmax{f (x), f (z)}\nwhere z\u21e4 := x _ x\u21e4  x.\nProof sketch of Proposition 1: The proof uses Lemma 1, the non-stationarity in (8) and a key\nobservation in the following Claim. The detailed proof is in Appendix D.2.\nClaim 1. It holds that f (x _ x\u21e4) + f (x ^ x\u21e4) + f (z _ z\u21e4) + f (z ^ z\u21e4)  f (x\u21e4).\nNote that [7; 20] propose a similar relation for the special cases of multilinear/softmax extensions by\nmainly proving the same conclusion as in Claim 1. Their relation does not incorporate the properties\nof non-stationarity or strong DR-submodularity. They both use the proof idea of constructing a\ncomplicated auxiliary set function tailored to speci\ufb01c DR-submodular functions. We present a\ndifferent proof method by directly utilizing the DR property on carefully constructed auxiliary points\n(e.g., (x + z) _ x\u21e4 in the proof of Claim 1).\n4 Algorithms for Constrained DR-submodular Maximization\n\n(9)\n\nBased on the properties, we present two algorithms for solving (P). The \ufb01rst is based on the local-\nglobal relation, and the second is a FRANK-WOLFE variant adapted for the non-monotone setting.\nAll the omitted proofs are deferred to Appendix E.\n\n4.1 An Algorithm Based on the Local-Global Relation\n\nAlgorithm 1: TWO-PHASE FRANK-WOLFE for non-monotone DR-submodular maximization\n\nInput: maxx2P f (x), stopping tolerance \u270f1,\u270f 2, #iterations K1, K2\n1 x NON-CONVEX FRANK-WOLFE(f,P, K1,\u270f 1, x(0)) ;\n2 Q P\\{ y 2 Rn\n3 z NON-CONVEX FRANK-WOLFE(f,Q, K2,\u270f 2, z(0)) ;\nOutput: arg max{f (x), f (z)} ;\n\n+ | y \uf8ff \u00afu  x};\n\n// x(0) 2P\n// z(0) 2Q\n\nWe summarize the TWO-PHASE algorithm in Algorithm 1. It is generalized from the \u201ctwo-phase\u201d\nmethod in [7; 20]. It invokes some non-convex solver (we use the NON-CONVEX FRANK-WOLFE\nby [27]; pseudocode is included in Algorithm 3 of Appendix C) to \ufb01nd approximately stationary\npoints in P and Q, respectively, then returns the solution with the larger function value. Though\nwe use NON-CONVEX FRANK-WOLFE as the subroutine here, it is worth noting that any algorithm\nthat is guaranteed to \ufb01nd an approximately stationary point can be plugged into Algorithm 1 as\nthe subroutine. We give an improved approximation bound by considering more properties of\nDR-submodular functions. Borrowing the results from [27] for the NON-CONVEX FRANK-WOLFE\nsubroutine, we get the following,\nTheorem 1. The output of Algorithm 1 satis\ufb01es,\n\n\u00b5\n\n1\n\n+\n\nmax{f (x), f (z)}\n\n8kx  x\u21e4k2 + kz  z\u21e4k2\n4\uf8fff (x\u21e4)  min\u21e2 max{2h1, Cf (P)}\npK1 + 1\n\n,\u270f 2 ,\nwhere h1 := maxx2P f (x) f (x(0)), h2 := maxz2Q f (z) f (z(0)) are the initial suboptimalities,\n2 (f (y)  f (x)  (y  x)>rf (x)) is the curvature of f\nCf (P) := supx,v2P,2[0,1],y=x+(vx)\nw.r.t. P, and z\u21e4 = x _ x\u21e4  x.\n\n,\u270f 1  min\u21e2 max{2h2, Cf (Q)}\n\npK2 + 1\n\n(10)\n\n2\n\n5\n\n\fTheorem 1 indicates that Algorithm 1 has a 1/4 approximation guarantee and 1/pk rate. However,\nit has good empirical performance as demonstrated by the experiments in Section 5. Informally, this\n8kx  x\u21e4k2 + kz  z\u21e4k2 in (10): if x is away from x\u21e4,\ncan be partially explained by the term \u00b5\nthis term will augment the bound; if x is close to x\u21e4, by the smoothness of f, it should be close to\noptimal.\n\n4.2 The Non-monotone FRANK-WOLFE Variant\n\nAlgorithm 2: Non-monotone FRANK-WOLFE variant for DR-submodular maximization\n\nInput: maxx2P f (x), prespeci\ufb01ed step size  2 (0, 1]\n1 x(0) 0, t(0) 0, k 0;\n2 while t(k) < 1 do\n3\n4\n5\n\nv(k) arg maxv2P,v\uf8ff \u00afux(k)hv,rf (x(k))i;\nuse uniform step size k = ; set k min{k, 1  t(k)};\nx(k+1) x(k) + kv(k), t(k+1) t(k) + k, k k + 1;\n\nOutput: x(K) ;\n\n// k : iteration index, t(k) : cumulative step size\n\n// shrunken LMO\n\n// assuming there are K iterations in total\n\nAlgorithm 2 summarizes the non-monotone FRANK-WOLFE variant, which is inspired by the uni\ufb01ed\ncontinuous greedy algorithm in [13] for maximizing the multilinear extension of a submodular set\nfunction. It initializes the solution x(0) to be 0, and maintains t(k) as the cumulative step size. At\niteration k, it maximizes the linearization of f over a \u201cshrunken\u201d constraint set: {v|v 2P , v \uf8ff\n\u00afu  x(k)}, which is different from the classical LMO of Frank-Wolfe-style algorithms (hence we\nrefer to it as the \u201cshrunken LMO\u201d). Then it employs an update step in the direction v(k) chosen by\nthe LMO with a uniform step size k = . The cumulative step size t(k) is used to ensure that the\noverall step sizes sum to one, thus the output solution x(K) is a convex combination of the LMO\noutputs, hence also lies in P.\nThe shrunken LMO (Step 3) is the key difference compared to the monotone FRANK-WOLFE variant\nin [4]. The extra constraint v \uf8ff \u00afu  x(k) is added to prevent too aggressive growth of the solution,\nsince in the non-monotone setting such aggressive growth may hurt the overall performance. The\nnext theorem states the guarantees of Algorithm 2.\nTheorem 2. Consider Algorithm 2 with uniform step size . For k = 1, ..., K it holds that,\n\nf (x(k))  t(k)et(k)\n\nf (x\u21e4) \n\nLD2\n\n2\n\nk 2  O(2)f (x\u21e4).\n\n(11)\n\n2K  O 1\n\nK2 f (x\u21e4).\n\nBy observing that t(K) = 1 and applying Theorem 2, we get the following Corollary:\nCorollary 1. The output of Algorithm 2 satis\ufb01es f (x(K))  e1f (x\u21e4)  LD2\nCorollary 1 shows that Algorithm 2 enjoys a sublinear convergence rate towards some point x(K)\ninside P, with a 1/e approximation guarantee.\nProof sketch of Theorem 2: The proof is by induction. To prepare the building blocks, we \ufb01rst of\nall show that the growth of x(k) is indeed bounded,\nLemma 2. Assume x(0) = 0. For k = 0, ..., K  1, it holds x(k)\nThen the following Lemma provides a lower bound, which gets the global optimum involved,\nLemma 3 (Generalized from Lemma 7 in [8]). Given \u2713 2 (0, \u00afu], let 0 = mini2[n]\nx 2 [0, \u2713], it holds f (x _ x\u21e4)  (1  1\nThen the key ingredient for induction is the relation between f (x(k+1)) and f (x(k)) indicated by:\nClaim 2. For k = 0, ..., K1 it holds f (x(k+1))  (1)f (x(k))+(1)t(k)/f (x\u21e4) LD2\nwhich is derived by a combination of the quadratic lower bound in (5), Lemma 2 and Lemma 3.\n\ni \uf8ff \u00afui[1  (1  )t(k)/],8i 2 [n].\n\n. Then for all\n\n0 )f (x\u21e4).\n\n\u00afui\n\u2713i\n\n2 2,\n\n6\n\n\fRemarks on the two algorithms. Notice that though the TWO-PHASE algorithm has a worse\nguarantee than the non-monotone FRANK-WOLFE variant, it is still of interest: i) It allows \ufb02exibility\nin using a wide range of existing solvers for \ufb01nding an (approximately) stationary point. ii) The\nguarantees that we present rely on a worst-case analysis. The empirical performance of the TWO-\nPHASE algorithm is often comparable or better than that of the FRANK-WOLFE variant. This suggests\nto explore more properties in concrete problems that may favor the TWO-PHASE algorithm, which we\nleave for future work.\n\n5 Experimental Results\n\nWe test the performance of the analyzed algorithms, while considering the following baselines: 1)\nQUADPROGIP [39], which is a global solver for non-convex quadratic programming; 2) Projected\ngradient ascent (PROJGRAD) with diminishing step sizes ( 1\nk+1, k starts from 0). We run all the\nalgorithms for 100 iterations. For the subroutine (Algorithm 3) of TWO-PHASE FRANK-WOLFE,\nwe set \u270f1 = \u270f2 = 106, K1 = K2 = 100. All the synthetic results are the average of 20 repeated\nexperiments. All experiments were implemented using MATLAB. Source code can be found at:\nhttps://github.com/bianan/non-monotone-dr-submodular.\n\n5.1 DR-submodular Quadratic Programming\n\nAs a state-of-the-art global solver, QUADPROGIP2 [39] can \ufb01nd the global optimum (possibly in\nexponential time), which were used to calculate the approximation ratios. Our problem instances are\nsynthetic DR-submodular quadratic objectives with down-closed polytope constraints, i.e., f (x) =\n+ | Ax \uf8ff b, x \uf8ff \u00afu, A 2 Rm\u21e5n\n2 x>Hx + h>x + c and P = {x 2 Rn\n+}. Both objective\n1\nand constraints were randomly generated, in the following two manners:\n1) Uniform distribution. H 2 Rn\u21e5n is a symmetric matrix with uniformly distributed entries in\n[1, 0]; A 2 Rm\u21e5n has uniformly distributed entries in [\u232b, \u232b + 1], where \u232b = 0.01 is a small positive\nconstant in order to make entries of A strictly positive.\n\n++ , b 2 Rm\n\n1\n\n0.95\n\n0.9\n\no\ni\nt\na\nr\n \n.\nx\no\nr\np\np\nA\n\n1\n\no\ni\nt\na\nr\n \n.\nx\no\nr\np\np\nA\n\n0.95\n\n0.9\n\n0.85\n\n8\n\n12\n\n10\nDimensionality\n\n14\n\n16\n\n(a) m = b0.5nc\n\n14\n\n16\n\n8\n\n12\n\n10\nDimensionality\n(b) m = n\n\n1\n\no\ni\nt\na\nr\n \n.\nx\no\nr\np\np\nA\n\n0.95\n\n0.9\n\n0.85\n\n8\n\n12\n\n10\nDimensionality\n\n14\n\n16\n\n(c) m = b1.5nc\n\nFigure 1: Results on DR-submodular quadratic instances with uniform distribution.\n\n2) Exponential distribution. The entries of H and A were sampled from exponential distributions\nExp() (For a random variable y  0, its probability density function is ey, and for y < 0, its\ndensity is 0). Speci\ufb01cally, each entry of H was sampled from Exp(1), then the matrix H was\nmade to be symmetric. Each entry of A was sampled from Exp(0.25) + \u232b, where \u232b = 0.01 is a small\npositive constant.\nIn both the above two cases, we set b = 1m, and \u00afu to be the tightest upper bound of P by\n,8j 2 [n]. In order to make f non-monotone, we set h = 0.2 \u21e4 H> \u00afu. To\n\u00afuj = mini2[m]\nmake sure that f is non-negative, we \ufb01rst of all solve the problem minx2P\n2 x>Hx + h>x using\nQUADPROGIP, let the solution to be \u02c6x, then set c = f ( \u02c6x) + 0.1 \u21e4| f ( \u02c6x)|.\nThe approximation ratios w.r.t. dimensionalities (n) are plotted in Figures 1 and 2, for the two\nmanners of data generation. We set the number of constraints to be m = b0.5nc, m = n and\nm = b1.5nc in Figures 1a to 1c (and Figures 2a to 2c), respectively.\n\nbi\nAij\n\n1\n\n2We used the open source code provided by [39], and the IBM CPLEX optimization studio https://www.\n\nibm.com/jm-en/marketplace/ibm-ilog-cplex as the subroutine.\n\n7\n\n\f1\n0.95\n0.9\n0.85\n0.8\n0.75\n\no\n\ni\nt\n\na\nr\n \n.\nx\no\nr\np\np\nA\n\n1\n\n0.95\n\n0.9\n\n0.85\n\n0.8\n\n0.75\n\no\n\ni\nt\n\na\nr\n \n.\nx\no\nr\np\np\nA\n\n8\n\n12\n\n10\nDimensionality\n\n14\n\n16\n\n(a) m = b0.5nc\n\n1\n0.95\n0.9\n0.85\n0.8\n0.75\n\no\n\ni\nt\n\na\nr\n \n.\nx\no\nr\np\np\nA\n\n8\n\n12\n\n10\nDimensionality\n\n14\n\n16\n\n(c) m = b1.5nc\n\n8\n\n12\n\n10\nDimensionality\n(b) m = n\n\n14\n\n16\n\nFigure 2: Results on quadratic instances with exponential distribution.\n\nOne can see that TWO-PHASE FRANK-WOLFE usually performs the best, PROJGRAD follows, and\nnon-monotone FRANK-WOLFE variant is the last. The good performance of TWO-PHASE FRANK-\nWOLFE can be partially explained by the strong DR-submodularity of quadratic functions according\nto Theorem 1. Performance of the two analyzed algorithms is consistent with the theoretical bounds:\nthe approximation ratios of FRANK-WOLFE variant are always much higher than 1/e.\n\n5.2 Maximizing Softmax Extensions\nWith some derivation, one can see the derivative of the softmax extension in (3) is: rif (x) =\ntr((diag(x)(L  I) + I)1(L  I)i),8i 2 [n], where (L  I)i denotes the matrix obtained by\nzeroing all entries except the ith row of (L  I). Let C := (diag(x)(L  I) + I)1, D := (L  I),\none can see that rif (x) = D>i\u00b7 C\u00b7i, which gives an ef\ufb01cient way to calculate the gradient rf (x).\n\n0.2\n0.15\n0.1\n0.05\n0\n-0.05\n\nl\n\ne\nu\na\nv\n \nn\no\ni\nt\nc\nn\nu\nF\n\n8\n\n12\n\n10\nDimensionality\n\n14\n\n16\n\n0.2\n\n0.15\n\n0.1\n\n0.05\n\n0\n\n-0.05\n\nl\n\ne\nu\na\nv\n \nn\no\ni\nt\nc\nn\nu\nF\n\n(a) m = b0.5nc\n\n14\n\n16\n\n8\n\n12\n\n10\nDimensionality\n(b) m = n\n\n0.2\n0.15\n0.1\n0.05\n0\n-0.05\n\nl\n\ne\nu\na\nv\n \nn\no\ni\nt\nc\nn\nu\nF\n\n8\n\n12\n\n10\nDimensionality\n\n14\n\n16\n\n(c) m = b1.5nc\n\nFigure 3: Results on softmax instances with polytope constraints generated from uniform distribution.\n\nResults on synthetic data. We generate the softmax objectives (see (3)) in the following way: \ufb01rst\ngenerate the n eigenvalues d 2 Rn\n+, each randomly distributed in [0, 1.5], and set D = diag(d).\nAfter generating a random unitary matrix U, we set L = UDU>. One can verify that L is positive\nsemide\ufb01nite and has eigenvalues as the entries of d.\nWe generate the down-closed polytope constraints in the same form and same way as that for DR-\nsubmodular quadratic functions, except for setting b = 2 \u21e4 1m. Function values returned by different\nsolvers w.r.t. n are shown in Figure 3, for which the random polytope constraints were generated\nwith uniform distribution (results for which the random polytope constraints were generated with\nexponential distribution are deferred to Appendix G). The number of constraints was set to be\nm = b0.5nc, m = n and m = b1.5nc in Figures 3a to 3c, respectively. One can observe that\nTWO-PHASE FRANK-WOLFE still has the best performance, the non-monotone FRANK-WOLFE\nvariant follows, and PROJGRAD has the worst performance.\nReal-world results on matched summarization. The task of \u201cmatched summarization\u201d is to select\na set of document pairs out of a corpus of documents, such that the two documents within a pair\nare similar, and the overall set of pairs is as diverse as possible. The motivation for this task is very\npractical: it could be, for example, to compare the opinions of various politicians on a range of\nrepresentative topics.\nIn our experiments, we used a similar setting to the one in [20]. We experimented on the 2012\nUS Republican debates data, which consists of 8 candidates: Bachman, Gingrich, Huntsman, Paul,\nPerry, Romney and Santorum. Each task involves one pair of candidates, so in total there are\n\n8\n\n\f10\n\n28 = 7 \u21e4 8/2 tasks. Figure 4a plots the averaged function values returned by the three solvers over 28\ntasks, w.r.t. different values of a hyperparameter re\ufb02ecting the matching quality (details see [20]).\nFigure 4b traces the objectives\nw.r.t.\niterations for a spe-\nci\ufb01c candidate pair (Bachman,\nRomney).\nFor TWO-PHASE\nFRANK-WOLFE, the objectives\nof the selected phase were plot-\nted. One can see that TWO-\nPHASE FRANK-WOLFE also\nachieves the best performance,\nwhile the performance of non-\nmonotone FRANK-WOLFE vari-\nant and PROJGRAD is compara-\nble.\n\n(a) Average on 28 tasks\n(b) Objectives w.r.t. iterations\nFigure 4: Results on 2012 US Republican debates data.\n\nMatch quality controller\n\n40\nIteration\n\nl\n\ne\nu\na\nv\n \nn\no\n\nl\n\ne\nu\na\nv\n \nn\no\n\ni\nt\nc\nn\nu\nF\n\ni\nt\nc\nn\nu\nF\n\n100\n\n8\n\n6\n\n4\n\n2\n\n0\n\n0\n\n0.4\n\n0.2\n\n0.6\n\n2.5\n\n1.5\n\n2\n\n1\n\n0.5\n\n60\n\n80\n\n0.8\n\n1\n\n20\n\n6 Related Work\n\nSubmodular optimization and, more broadly, non-convex optimization are extensively studied in the\nliterature, which renders it very dif\ufb01cult comprehensively surveying all previous work. Here we only\nbrie\ufb02y summarize some of the most related papers.\nSubmodular optimization over integer-lattice and continuous domains. Many results from sub-\nmodular set function optimization have been generalized to the integer-lattice case [34; 33; 12; 24].\nOf particular interest is the reduction [12] from an integer-lattice DR-submodular maximization prob-\nlem to a submodular set function maximization problem. Submodular optimization over continuous\ndomains has attracted considerable attention recently [3; 4; 36]. Two classes of functions that are\ncovered by continuous submodularity are the Lovasz extensions [28] and multilinear extensions\n[6] of submodular set functions. Particularly, multilinear extensions of submodular set functions\nare also continuous DR-submodular [3], but with the special property that they are coordinate-wise\nlinear. Combined with the rounding technique of contention resolution [7], maximizing multilinear\nextensions [38; 19; 13; 8; 11] has become the state-of-the-art method for submodular set function\nmaximization. Some of the techniques in maximizing multilinear extensions [13; 7; 8] have inspired\nthis work. However, we are the \ufb01rst to explore the rich properties and devise algorithms for the\ngeneral constrained DR-submodular maximization problem over continuous domains.\nNon-convex optimization. Non-convex optimization receives a surge of attention in the past years.\nOne active research topic is to reach a stationary point for unconstrained optimization [35; 32; 1] or\nconstrained optimization [18; 27]. However, without proper assumptions, a stationary point may not\nlead to any global approximation guarantee. The local-global relation (in Proposition 1) provides a\nstrong relation between (approximately) stationary points and global optimum, thus making it \ufb02exible\nto incorporate progress in this area.\n\n7 Conclusion\n\nWe have studied the problem of constrained non-monotone DR-submodular continuous maximization.\nWe explored the structural properties of such problems, and established a local-global relation. Based\non these properties, we presented a TWO-PHASE algorithm with a 1/4 approximation guarantee, and a\nnon-monotone FRANK-WOLFE variant with a 1/e approximation guarantee. We further generalized\nsubmodular continuous function over conic lattices, which enabled us to model a larger class of\napplications. Lastly, our theoretical \ufb01ndings were veri\ufb01ed by synthetic and real-world experiments.\n\nAcknowledgement. This research was partially supported by ERC StG 307036, by the Max Planck\nETH Center for Learning Systems, and by the ETH Z\u00fcrich Postdoctoral Fellowship program.\n\n9\n\n\fReferences\n[1] Allen-Zhu, Zeyuan and Hazan, Elad. Variance reduction for faster non-convex optimization. In\n\nInternational Conference on Machine Learning (ICML), pp. 699\u2013707, 2016.\n\n[2] Antoniadis, Anestis, Gijbels, Ir\u00e8ne, and Nikolova, Mila. Penalized likelihood regression for\ngeneralized linear models with non-quadratic penalties. Annals of the Institute of Statistical\nMathematics, 63(3):585\u2013615, 2011.\n\n[3] Bach, Francis. Submodular functions: from discrete to continous domains. arXiv preprint\n\narXiv:1511.00394, 2015.\n\n[4] Bian, Andrew An, Mirzasoleiman, Baharan, Buhmann, Joachim M., and Krause, Andreas.\nGuaranteed non-convex optimization: Submodular maximization over continuous domains. In\nInternational Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS), pp. 111\u2013120, 2017.\n[5] Boyd, Stephen and Vandenberghe, Lieven. Convex optimization. Cambridge university press,\n\n2004.\n\n[6] Calinescu, Gruia, Chekuri, Chandra, P\u00e1l, Martin, and Vondr\u00e1k, Jan. Maximizing a submod-\nular set function subject to a matroid constraint. In Integer programming and combinatorial\noptimization, pp. 182\u2013196. Springer, 2007.\n\n[7] Chekuri, Chandra, Vondr\u00e1k, Jan, and Zenklusen, Rico. Submodular function maximization via\nthe multilinear relaxation and contention resolution schemes. SIAM Journal on Computing, 43\n(6):1831\u20131879, 2014.\n\n[8] Chekuri, Chandra, Jayram, TS, and Vondr\u00e1k, Jan. On multiplicative weight updates for concave\nand submodular function maximization. In Proceedings of the 2015 Conference on Innovations\nin Theoretical Computer Science, pp. 201\u2013210. ACM, 2015.\n\n[9] Djolonga, Josip and Krause, Andreas. From map to marginals: Variational inference in bayesian\n\nsubmodular models. In Neural Information Processing Systems (NIPS), pp. 244\u2013252, 2014.\n\n[10] Eghbali, Reza and Fazel, Maryam. Designing smoothing functions for improved worst-case\ncompetitive ratio in online optimization. In Advances in Neural Information Processing Systems\n(NIPS), pp. 3279\u20133287. 2016.\n\n[11] Ene, Alina and Nguyen, Huy L. Constrained submodular maximization: Beyond 1/e.\n\nIn\nFoundations of Computer Science (FOCS), 2016 IEEE 57th Annual Symposium on, pp. 248\u2013257,\n2016.\n\n[12] Ene, Alina and Nguyen, Huy L. A reduction for optimizing lattice submodular functions with\n\ndiminishing returns. arXiv preprint arXiv:1606.08362, 2016.\n\n[13] Feldman, Moran, Naor, Joseph, and Schwartz, Roy. A uni\ufb01ed continuous greedy algorithm\nfor submodular maximization. In Foundations of Computer Science (FOCS), 2011 IEEE 52nd\nAnnual Symposium on, pp. 570\u2013579. IEEE, 2011.\n\n[14] Friedland, S and Gaubert, S. Submodular spectral functions of principal submatrices of a\nhermitian matrix, extensions and applications. Linear Algebra and its Applications, 438(10):\n3872\u20133884, 2013.\n\n[15] Fuchssteiner, Benno and Lusky, Wolfgang. Convex cones, volume 56. Elsevier, 2011.\n[16] Fujishige, Satoru. Submodular functions and optimization, volume 58. Elsevier, 2005.\n[17] Garg, Vijay K. Introduction to lattice theory with computer science applications. John Wiley &\n\nSons, 2015.\n\n[18] Ghadimi, Saeed, Lan, Guanghui, and Zhang, Hongchao. Mini-batch stochastic approximation\nmethods for nonconvex stochastic composite optimization. Mathematical Programming, 155\n(1-2):267\u2013305, 2016.\n\n[19] Gharan, Shayan Oveis and Vondr\u00e1k, Jan. Submodular maximization by simulated annealing. In\nProceedings of the twenty-second annual ACM-SIAM symposium on Discrete Algorithms, pp.\n1098\u20131116. Society for Industrial and Applied Mathematics, 2011.\n\n[20] Gillenwater, Jennifer, Kulesza, Alex, and Taskar, Ben. Near-optimal map inference for deter-\nminantal point processes. In Advances in Neural Information Processing Systems (NIPS), pp.\n2735\u20132743, 2012.\n\n10\n\n\f[21] Hassani, Hamed, Soltanolkotabi, Mahdi, and Karbasi, Amin. Gradient methods for submodular\nmaximization. In Advances in Neural Information Processing Systems (NIPS), pp. 5837\u20135847,\n2017.\n\n[22] Ito, Shinji and Fujimaki, Ryohei. Large-scale price optimization via network \ufb02ow. In Advances\n\nin Neural Information Processing Systems (NIPS), pp. 3855\u20133863, 2016.\n\n[23] Iwata, Satoru, Fleischer, Lisa, and Fujishige, Satoru. A combinatorial strongly polynomial\n\nalgorithm for minimizing submodular functions. Journal of the ACM, 48(4):761\u2013777, 2001.\n\n[24] Khodabakhsh, Ali and Nikolova, Evdokia. Maximizing non-monotone dr-submodular functions\n\nwith cardinality constraints. arXiv preprint arXiv:1611.09474, 2016.\n\n[25] Krause, Andreas and Golovin, Daniel. Submodular function maximization. Tractability:\n\nPractical Approaches to Hard Problems, 3:19, 2012.\n\n[26] Kulesza, Alex, Taskar, Ben, et al. Determinantal point processes for machine learning. Founda-\n\n[27] Lacoste-Julien, Simon. Convergence rate of frank-wolfe for non-convex objectives. arXiv\n\ntions and Trends R in Machine Learning, 5(2\u20133):123\u2013286, 2012.\npreprint arXiv:1607.00345, 2016.\n\n[28] Lov\u00e1sz, L\u00e1szl\u00f3. Submodular functions and convexity. In Mathematical Programming The State\n\nof the Art, pp. 235\u2013257. Springer, 1983.\n\n[29] Montanari, Andrea and Richard, Emile. Non-negative principal component analysis: Message\npassing algorithms and sharp asymptotics. IEEE Transactions on Information Theory, 62(3):\n1458\u20131484, 2016.\n\n[30] Motzkin, Theodore S and Straus, Ernst G. Maxima for graphs and a new proof of a theorem of\n\ntur\u00e1n. Canad. J. Math, 17(4):533\u2013540, 1965.\n\n[31] Nemhauser, George L, Wolsey, Laurence A, and Fisher, Marshall L. An analysis of approxima-\ntions for maximizing submodular set functions \u2013 i. Mathematical Programming, 14(1):265\u2013294,\n1978.\n\n[32] Reddi, Sashank J., Sra, Suvrit, Poczos, Barnabas, and Smola, Alexander J. Proximal stochastic\nmethods for nonsmooth nonconvex \ufb01nite-sum optimization. In Advances in Neural Information\nProcessing Systems (NIPS), pp. 1145\u20131153. 2016.\n\n[33] Soma, Tasuku and Yoshida, Yuichi. A generalization of submodular cover via the diminishing\nreturn property on the integer lattice. In Advances in Neural Information Processing Systems\n(NIPS), pp. 847\u2013855, 2015.\n\n[34] Soma, Tasuku, Kakimura, Naonori, Inaba, Kazuhiro, and Kawarabayashi, Ken-ichi. Optimal\nbudget allocation: Theoretical guarantee and ef\ufb01cient algorithm. In International Conference\non Machine Learning (ICML), pp. 351\u2013359, 2014.\n\n[35] Sra, Suvrit. Scalable nonconvex inexact proximal splitting. In Advances in Neural Information\n\nProcessing Systems (NIPS), pp. 530\u2013538, 2012.\n\n[36] Staib, Matthew and Jegelka, Stefanie. Robust budget allocation via continuous submodular\nfunctions. In International Conference on Machine Learning (ICML), pp. 3230\u20133240, 2017.\n\n[37] Topkis, Donald M. Minimizing a submodular function on a lattice. Operations research, 26(2):\n\n305\u2013321, 1978.\n\n[38] Vondr\u00e1k, Jan. Optimal approximation for the submodular welfare problem in the value oracle\nmodel. In Proceedings of the 40th Annual ACM Symposium on Theory of Computing, pp. 67\u201374,\n2008.\n\n[39] Xia, Wei, Vera, Juan, and Zuluaga, Luis F. Globally solving non-convex quadratic programs via\n\nlinear integer programming techniques. arXiv preprint arXiv:1511.02423, 2015.\n\n[40] Zass, Ron and Shashua, Amnon. Nonnegative sparse pca. Advances in Neural Information\n\nProcessing Systems (NIPS), pp. 1561\u20131568, 2007.\n\n11\n\n\f", "award": [], "sourceid": 342, "authors": [{"given_name": "An", "family_name": "Bian", "institution": "ETH Zurich"}, {"given_name": "Kfir", "family_name": "Levy", "institution": "ETH"}, {"given_name": "Andreas", "family_name": "Krause", "institution": "ETHZ"}, {"given_name": "Joachim", "family_name": "Buhmann", "institution": "ETH Zurich"}]}