{"title": "Stochastic Submodular Maximization: The Case of Coverage Functions", "book": "Advances in Neural Information Processing Systems", "page_first": 6853, "page_last": 6863, "abstract": "Stochastic optimization of continuous objectives is at the heart of modern machine learning. However, many important problems are of discrete nature and often involve submodular objectives. We seek to unleash the power of stochastic continuous optimization, namely stochastic gradient descent and its variants, to such discrete problems. We first introduce the problem of stochastic submodular optimization, where one needs to optimize a submodular objective which is given as an expectation. Our model captures situations where the discrete objective arises as an empirical risk (e.g., in the case of exemplar-based clustering), or is given as an explicit stochastic model (e.g., in the case of influence maximization in social networks). By exploiting that common extensions act linearly on the class of submodular functions, we employ projected stochastic gradient ascent and its variants in the continuous domain, and perform rounding to obtain discrete solutions. We focus on the rich and widely used family of weighted coverage functions. We show that our approach yields solutions that are guaranteed to match the optimal approximation guarantees, while reducing the computational cost by several orders of magnitude, as we demonstrate empirically.", "full_text": "Stochastic Submodular Maximization:\n\nThe Case of Coverage Functions\n\nMohammad Reza Karimi\n\nDepartment of Computer Science\n\nETH Zurich\n\nmkarimi@ethz.ch\n\nMario Lucic\n\nDepartment of Computer Science\n\nETH Zurich\n\nlucic@inf.ethz.ch\n\nDepartment of Electrical and Systems Engineering\n\nDepartment of Computer Science\n\nAndreas Krause\n\nETH Zurich\n\nkrausea@ethz.ch\n\nHamed Hassani\n\nUniversity of Pennsylvania\nhassani@seas.upenn.edu\n\nAbstract\n\nStochastic optimization of continuous objectives is at the heart of modern ma-\nchine learning. However, many important problems are of discrete nature and\noften involve submodular objectives. We seek to unleash the power of stochastic\ncontinuous optimization, namely stochastic gradient descent and its variants, to\nsuch discrete problems. We \ufb01rst introduce the problem of stochastic submodular\noptimization, where one needs to optimize a submodular objective which is given\nas an expectation. Our model captures situations where the discrete objective\narises as an empirical risk (e.g., in the case of exemplar-based clustering), or is\ngiven as an explicit stochastic model (e.g., in the case of in\ufb02uence maximization\nin social networks). By exploiting that common extensions act linearly on the\nclass of submodular functions, we employ projected stochastic gradient ascent\nand its variants in the continuous domain, and perform rounding to obtain discrete\nsolutions. We focus on the rich and widely used family of weighted coverage\nfunctions. We show that our approach yields solutions that are guaranteed to match\nthe optimal approximation guarantees, while reducing the computational cost by\nseveral orders of magnitude, as we demonstrate empirically.\n\n1\n\nIntroduction\n\nSubmodular functions are discrete analogs of convex functions. They arise naturally in many\nareas, such as the study of graphs, matroids, covering problems, and facility location problems.\nThese functions are extensively studied in operations research and combinatorial optimization [22].\nRecently, submodular functions have proven to be key concepts in other areas such as machine\nlearning, algorithmic game theory, and social sciences. As such, they have been applied to a host\nof important problems such as modeling valuation functions in combinatorial auctions, feature and\nvariable selection [23], data summarization [27], and in\ufb02uence maximization [20].\nClassical results in submodular optimization consider the oracle model whereby the access to the\noptimization objective is provided through a black box \u2014 an oracle. However, in many applications,\nthe objective has to be estimated from data and is subject to stochastic \ufb02uctuations. In other cases\nthe value of the objective may only be obtained through simulation. As such, the exact computation\nmight not be feasible due to statistical or computational constraints. As a concrete example, consider\nthe problem of in\ufb02uence maximization in social networks [20]. The objective function is de\ufb01ned\nas the expectation of a stochastic process, quantifying the size of the (random) subset of nodes\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fin\ufb02uenced from a selected seed set. This expectation cannot be computed ef\ufb01ciently, and is typically\napproximated via random sampling, which introduces an error in the estimate of the value of a seed\nset. Another practical example is the exemplar-based clustering problem, which is an instance of the\nfacility location problem. Here, the objective is the sum of similarities of all the points inside a (large)\ncollection of data points to a selected set of centers. Given a distribution over point locations, the\ntrue objective is de\ufb01ned as the expected value w.r.t. this distribution, and can only be approximated\nas a sample average. Moreover, evaluating the function on a sample involves computation of many\npairwise similarities, which is computationally prohibitive in the context of massive data sets.\nIn this work, we provide a formalization of such stochastic submodular maximization tasks. More\nprecisely, we consider set functions f : 2V ! R+, de\ufb01ned as f (S) = E\u21e0[f(S)] for S \u2713 V ,\nwhere is an arbitrary distribution and for each realization \u21e0 , the set function f : 2V ! R+ is\nmonotone and submodular (hence f is monotone submodular). The goal is to maximize f subject to\nsome constraints (e.g. the k-cardinality constraint) having access only to i.i.d. samples f\u21e0(\u00b7).\nMethods for submodular maximization fall into two major categories: (i) The classic approach is to\ndirectly optimize the objective using discrete optimization methods (e.g. the GREEDY algorithm and\nits accelerated variants), which are state-of-the-art algorithms (both in practice and theory), at least in\nthe case of simple constraints, and are most widely considered in the literature; (ii) The alternative is\nto lift the problem into a continuous domain and exploit continuous optimization techniques available\ntherein [7]. While the continuous approaches may lead to provably good results, even for more\ncomplex constraints, their high computational complexity inhibits their practicality.\nIn this paper we demonstrate how modern stochastic optimization techniques (such as SGD, ADA-\nGRAD [8] and ADAM [21]), can be used to solve an important class of discrete optimization problems\nwhich can be modeled using weighted coverage functions. In particular, we show how to ef\ufb01ciently\nmaximize them under matroid constraints by (i) lifting the problem into the continuous domain\nusing the multilinear extension [37], (ii) ef\ufb01ciently computing a concave relaxation of the multilinear\nextension [32], (iii) ef\ufb01ciently computing an unbiased estimate of the gradient for the concave relax-\nation thus enabling (projected) stochastic gradient ascent-style algorithms to maximize the concave\nrelaxation, and (iv) rounding the resulting fractional solution without loss of approximation quality\n[7]. In addition to providing convergence and approximation guarantees, we demonstrate that our\nalgorithms enjoy strong empirical performance, often achieving an order of magnitude speedup\nwith less than 1% error with respect to GREEDY. As a result, the presented approach unleashes the\npowerful toolkit of stochastic gradient based approaches to discrete optimization problems.\n\nOur contributions.\nIn this paper we (i) introduce a framework for stochastic submodular opti-\nmization, (ii) provide a general methodology for constrained maximization of stochastic submodular\nobjectives, (iii) prove that the proposed approach guarantees a (1 1/e)approximation in ex-\npectation for the class of weighted coverage functions, which is the best approximation guarantee\nachievable in polynomial time unless P = NP, (iv) highlight the practical bene\ufb01t and ef\ufb01ciency\nof using continuous-based stochastic optimization techniques for submodular maximization, (v)\ndemonstrate the practical utility of the proposed framework in an extensive experimental evaluation.\nWe show for the \ufb01rst time that continuous optimization is a highly practical, scalable avenue for\nmaximizing submodular set functions.\n\n2 Background and problem formulation\n\nLet V be a ground set of n elements. A set function f : 2V ! R+ is submodular if for every\nA, B \u2713 V , it holds f (A) + f (B) f (A \\ B) + f (A [ B). Function f is said to be monotone if\nf (A) \uf8ff f (B) for all A \u2713 B \u2713 V . We focus on maximizing f subject to some constraints on S \u2713 V .\nThe prototypical example is maximization under the cardinality constraint, i.e., for a given integer\nk, \ufb01nd S \u2713 V , |S|\uf8ff k, which maximizes f. Finding an exact solution for monotone submodular\nfunctions is NP-hard [10], but a (1 1/e)-approximation can be ef\ufb01ciently determined [30]. Going\nbeyond the (1 1/e)-approximation is NP-hard for many classes of submodular functions [30, 24].\nMore generally, one may consider matroid constraints, whereby (V,I) is a matroid with the family\nof independent sets I, and maximize f such that S 2I . The GREEDY algorithm achieves a 1/2-\napproximation [13], but CONTINUOUS GREEDY introduced by Vondr\u00e1k [37], Calinescu et al. [6]\ncan achieve a (1 1/e)-optimal solution in expectation. Their approach is based on the multilinear\n\n2\n\n\fextension of f, F : [0, 1]V ! R+, de\ufb01ned as\nF (x) = XS\u2713V\n\nf (S)Yi2S\n\nxiYj /2S\n\n(1 xj),\n\n(1)\n\nIn other words, F (x) is the expected value of of f over\nfor all x = (x1,\u00b7\u00b7\u00b7 , xn) 2 [0, 1]V .\nsets wherein each element i is included with probability xi independently. Then, instead of opti-\nmizing f (S) over I, we can optimize F over the matroid base polytope corresponding to (V,I):\n+ | x(S) \uf8ff r(S),8S \u2713 V, x(V ) = r(V )}, where r(\u00b7) is the matroid\u2019s rank\nP = {x 2 Rn\nfunction. The CONTINUOUS GREEDY algorithm then \ufb01nds a solution x 2P which provides a\n(1 1/e)approximation. Finally, the continuous solution x is then ef\ufb01ciently rounded to a feasible\ndiscrete solution without loss in objective value, using PIPAGE ROUNDING [1, 6]. The idea of\nconverting a discrete optimization problem into a continuous one was \ufb01rst exploited by Lov\u00e1sz [28]\nin the context of submodular minimization and this approach was recently applied to a variety of\nproblems [36, 19, 3].\n\nf (S) = E\u21e0[f(S)],\n\nProblem formulation. The aforementioned results are based on the oracle model, whereby the\nexact value of f (S) for any S \u2713 V is given by an oracle. In absence of such an oracle, we face the\nadditional challenges of evaluating f, both statistical and computational. In particular, consider set\nfunctions that are de\ufb01ned as expectations, i.e. for S \u2713 V we have\n\n(2)\nwhere is an arbitrary distribution and for each realization \u21e0 , the set function f : 2V ! R is\nsubmodular. The goal is to ef\ufb01ciently maximize f subject to constraints such as the k-cardinality\nconstraint, or more generally, a matroid constraint.\nAs a motivating example, consider the problem of propagation of contagions through a network. The\nobjective is to identify the most in\ufb02uential seed set of a given size. A propagation instance (concrete\nrealization of a contagion) is speci\ufb01ed by a graph G = (V, E). The in\ufb02uence fG(S) of a set of nodes\nS in instance G is the fraction of nodes reachable from S using the edges E. To handle uncertainties\nin the concrete realization, it is natural to introduce a probabilistic model such as the Independent\nCascade [20] model which de\ufb01nes a distribution G over instances G \u21e0G that share a set V of nodes.\nThe in\ufb02uence of a seed set S is then the expectation f (S) = EG\u21e0G[fG(S)], which is a monotone\nsubmodular function. Hence, estimating the expected in\ufb02uence is computationally demanding, as it\nrequires summing over exponentially many functions fG. Assuming f as in (2), one can easily obtain\nan unbiased estimate of f for a \ufb01xed set S by random sampling according to . The critical question\nis, given that the underlying function is an expectation, can we optimize it more ef\ufb01ciently?\nOur approach is based on continuous extensions that are linear operators on the class of set functions,\nnamely, linear continuous extensions. As a speci\ufb01c example, considering the multilinear extension,\nwe can write F (x) = E\u21e0[F(x)], where F denotes the extension of f. As a consequence, the\nvalue of F(x), when \u21e0 , is an unbiased estimator for F (x) and unbiased estimates of the\n(sub)gradients may be obtained analogously. We explore this avenue to develop ef\ufb01cient algorithms\nfor maximizing an important subclass of submodular functions that can be expressed as weighted\ncoverage functions. Our approach harnesses a concave relaxation detailed in Section 3.\nFurther related work. The emergence of new applications, combined with a massive increase in\nthe amount of data has created a demand for fast algorithms for submodular optimization. A variety\nof approximation algorithms have been presented, ranging from submodular maximization subject\nto a cardinality constraint [29, 39, 4], submodular maximization subject to a matroid constraint\n[6], non-monotone submodular maximization [11], approximately submodular functions [17], and\nalgorithms for submodular maximization subject to a wide variety of constraints [25, 12, 38, 18, 9].\nA closely related setting to ours is online submodular maximization [35], where functions come one\nat a time and the goal is to provide time-dependent solutions (sets) such that a cumulative regret\nis minimized. In contrast, our goal is to \ufb01nd a single (time-independent) set that maximizes the\nobjective (2). Another relevant setting is noisy submodular maximization, where the evaluations\nreturned by the oracle are noisy [16, 34]. Speci\ufb01cally, [34] assumes a noisy but unbiased oracle (with\nan independent sub-Gaussian noise) which allows one to suf\ufb01ciently estimate the marginal gains of\nitems by averaging. In the context of cardinality constraints, some of these ideas can be carried to our\nsetting by introducing additional assumptions on how the values f(S) vary w.r.t. to their expectation\nf (S). However, we provide a different approach that does not rely on uniform convergence and\ncompare sample and running time complexity comparison with variants of GREEDY in Section 3.\n\n3\n\n\f3 Stochastic Submodular Optimization\n\nWe follow the general framework of [37] whereby the problem is lifted into the continuous domain,\na continuous optimization algorithm is designed to maximize the transferred objective, and the\nresulting solution is rounded. Maximizing f subject to a matroid constraint can then be done by\n\ufb01rst maximizing its multilinear extension F over the matroid base polytope and then rounding the\nsolution. Methods such as the projected stochastic gradient ascent can be used to maximize F over\nthis polytope.\nCritically, we have to assure that the computed local optima are good in expectation. Unfortunately,\nthe multilinear extension F lacks concavity and therefore may have bad local optima. Hence, we\nconsider concave continuous extensions of F that are ef\ufb01ciently computable, and at most a constant\nfactor away from F to ensure solution quality. As a result, such a concave extension \u00afF could then be\nef\ufb01ciently maximized over a polytope using projected stochastic gradient ascent which would enable\nthe application of modern continuous optimization techniques. One class of important functions for\nwhich such an extension can be ef\ufb01ciently computed is the class of weighted coverage functions.\n\nThe class of weighted coverage functions (WCF). Let U be a set and let g be a nonnegative\n\nmodular function on U, i.e. g(S) =Pu2S w(u), S \u2713 U. Let V = {B1, . . . , Bn} be a collection of\nsubsets of U. The weighted coverage function f : 2V ! R+ de\ufb01ned as\nis monotone submodular. For all u 2 U, let us denote by Pu := {Bi 2 V | u 2 Bi} and by I(\u00b7) the\nindicator function. The multilinear extension of f can be expressed in a more compact way:\n\n8S \u2713 V : f (S) = gSBi2S Bi\nI(u 2 Bi for some Bi 2 S) \u00b7 w(u)\n\nF (x) = ES[f (S)] = ESXu2U\n\n=Xu2U\n\nw(u) \u00b7 P(u 2 Bi for some Bi 2 S) =Xu2U\n\nw(u)\u27131 QBi2Pu\n\n(1 xi)\u25c6 (3)\n\nwhere we used the fact that each element Bi 2 V was chosen with probability xi.\nConcave upper bound for weighted coverage functions. To ef\ufb01ciently compute a concave upper\nbound on the multilinear extension we use the framework of Seeman and Singer [32]. Given that all\nthe weights w(u), u 2 U in (3) are non-negative, we can construct a concave upper bound for the\nmultilinear extension F (x) using the following Lemma. Proofs can be found in the Appendix A.\nLemma 1. For x 2 [0, 1]` de\ufb01ne \u21b5(x) := 1Q`\ni=1(1 xi). Then the Fenchel concave biconjugate\nof \u21b5(\u00b7) is (x) := minn1,P`\n\ni=1 xio. Also\n(1 1/e) (x) \uf8ff \u21b5(x) \uf8ff (x) 8x 2 [0, 1]`.\n\nFurthermore, is an extension of \u21b5, i.e. 8x 2{ 0, 1}`: \u21b5(x) = (x).\nConsequently, given a weighted coverage function f with F (x) represented as in (3), we can de\ufb01ne\nxv\n\nw(u) min\u21e21, XBv2Pu\n\n\u00afF (x) :=Xu2U\n\nand conclude using Lemma 1 that (1 1/e) \u00afF (x) \uf8ff F (x) \uf8ff \u00afF (x), as desired. Furthermore, \u00afF has\nthree interesting properties: (1) It is a concave function over [0, 1]V , (2) it is equal to f on vertices\nof the hypercube, i.e. for x 2{ 0, 1}n one has \u00afF (x) = f ({i : xi = 1}), and (3) it can be computed\nef\ufb01ciently and deterministically given access to the sets Pu, u 2 U. In other words, we can compute\nthe value of \u00afF (x) using at most O(|U|\u21e5| V |) operations. Note that \u00afF is not the tightest concave\nupper bound of F , even though we use the tightest concave upper bounds for each term of F .\n\n(4)\n\nOptimizing the concave upper bound by stochastic gradient ascent.\nInstead of maximizing F\nover a polytope P, one can now attempt to maximize \u00afF over P. Critically, this task can be done\nef\ufb01ciently, as \u00afF is concave, by using projected stochastic gradient ascent. In particular, one can\n\n4\n\n\fAlgorithm 1 Stochastic Submodular Maximization via concave relaxation\nRequire: matroid M with base polytope P, \u2318t (step size), T (maximum # of iterations)\n1: x(0) starting point in P\n2: for t 0 to T 1 do\n3:\n4:\n5:\n6: end for\n7: \u00afxT 1\n8: S RANDOMIZED-PIPAGE-ROUND(\u00afxT )\n9: return S such that S 2M , E[f (S)] (1 1/e)f (OP T ) \"(T ).\n\nChoose gt at random from a distribution such that E[gt|x(0), . . . , x(t)] 2 @ \u00afF (x(t))\nx(t+1/2) x(t) + \u2318t gt\nx(t+1) ProjectP (x(t+1/2))\n\nT PT\n\nt=1 x(t)\n\ncontrol the convergence speed by choosing from the toolbox of modern continuous optimization\nalgorithms, such as SGD, ADAGRAD and ADAM. Let us denote a maximizer of \u00afF over P by \u00afx\u21e4, and\nalso a maximizer of F over P by x\u21e4. We can thus write\n\nF (\u00afx\u21e4) (1 1/e) \u00afF (\u00afx\u21e4) (1 1/e) \u00afF (x\u21e4) (1 1/e)F (x\u21e4),\n\nwhich is the exact guarantee that previous methods give, and in general is the best near-optimality ratio\nthat one can give in poly-time. Finally, to round the continuous solution we may apply RANDOMIZED-\nPIPAGE-ROUNDING [7] as the quality of the approximation is preserved in expectation.\n\nMatroid constraints. Constrained optimization can be ef\ufb01ciently performed by projected gradient\nascent whereby after each step of the stochastic ascent, we need to project the solution back onto\nthe feasible set. For the case of matroid constraints, it is suf\ufb01cient to consider projection onto the\nmatroid base polytope. This problem of projecting on the base polytope has been widely studied\nand fast algorithms exist in many cases [2, 5, 31]. While these projection algorithms were used as a\nkey subprocedure in constrained submodular minimization, here we consider them for submodular\nmaximization. Details of a fast projection algorithm for the problems considered in this work are\npresented the Appendix D. Algorithm 1 summarizes all steps required to maximize f subject to\nmatroid constraints.\n\nConvergence rate. Since we are maximizing a concave function \u00afF (\u00b7) over a matroid base polytope\nP, convergence rate (and hence running time) depends on B := maxx2P ||x||, as well as maximum\ngradient norm \u21e2 (i.e. ||gt|| \uf8ff \u21e2 with probability 1). 1 In the case of the base polytope for a matroid of\nrank r, B is pr, since each vertex of the polytope has exactly r ones. Also, from (4), one can build a\nrough upper bound for the norm of the gradient:\n\n||g|| \uf8ff ||Pu2U w(u)1Pu|| \uf8ffmax\nwhich depends on the weights w(u) as well as |Pu| and is hence problem-dependent. We will\nprovide tighter upper bounds for gradient norm in our speci\ufb01c examples in the later sections. With\n\u2318t = B/\u21e2pt, and classic results for SGD [33], we have that\n\nu2U |Pu|1/2Xu2U\n\nw(u),\n\n\u00afF (x\u21e4) E[ \u00afF (\u00afxT )] \uf8ff B\u21e2/pT ,\nwhere T is the total number of SGD iterations and \u00afxT is the \ufb01nal outcome of SGD (see Algorithm 1).\nTherefore, for a given \"> 0, after T B2\u21e22/\"2 iterations, we have\n\n\u00afF (x\u21e4) E[ \u00afF (\u00afxT )] \uf8ff \".\n\nSumming up, we will have the following theorem:\nTheorem 2. Let f be a weighted coverage function, P be the base polytope of a matroid M, and \u21e2\nand B be as above. Then for each \u270f> 0, Algorithm 1 after T = B2\u21e22/\"2 iterations, produces a set\nS\u21e4 2M such that E[f (S\u21e4)] (1 1/e) maxS2M f (S) \".\nsmooth or strongly concave.\n\n1Note that the function \u00afF is neither smooth nor strongly concave as functions such as min{1, x} are not\n\n5\n\n\fRemark.\nIndeed this approximation ratio is the best ratio one can achieve, unless P=NP [10]. A\nkey point to make here is that our approach also works for more general constraints (in particular\nis ef\ufb01cient for simple matroids such as partition matroids). In the latter case, GREEDY only gives\n2-approximation and fast discrete methods like STOCHASTIC-GREEDY [29] do not apply, whereas\n1\nour method still yields an (1 1/e)-optimal solution.\nTime Complexity. One can compute an upper bound for the running time of Algorithm 1 by\nestimating the time required to perform gradient computations, projection on P, and rounding. For\nthe case of uniform matroids, projection and rounding take O(n log n) and O(n) time, respectively\n(see Appendix D). Furthermore, for the applications considered in this work, namely expected\nin\ufb02uence maximization and exemplar-based clustering, we provide linear time algorithms to compute\nthe gradients. Also when our matroid is the k-uniform matroid (i.e. k-cardinality constraint), we have\n\nB = pk. By Theorem 2, the total computational complexity of our algorithm is O(\u21e22kn(log n)/\"2).\nComparison to GREEDY. Let us relate our results to the classical approach. When running the\nGREEDY algorithm in the stochastic setting, one estimates \u02c6f (S) := 1\ni=1 fi(S) where 1, . . . , s\nare i.i.d. samples from . The following proposition bounds the sample and computational complexity\nof GREEDY. The proof is detailed in the Appendix B.\nProposition 3. Let f be a submodular function de\ufb01ned as (2). Suppose 0 \uf8ff f(S) \uf8ff H for all\nS \u2713 V and all \u21e0 . Assume S\u21e4 denotes the optimal solution for f subject to k-cardinality\nconstraint and Sk denotes the solution computed by the greedy algorithm on \u02c6f after k steps. Then, in\norder to guarantee\n\nsPs\n\nit is enough to have\n\ni.i.d. samples from . The running time of GREEDY is then bounded by\n\nP[f (Sk) (1 1/e)f (S\u21e4) \"] 1 ,\ns 2 \u2326\u2713H 2(k log n + log(1/))/\"2\u25c6,\nO\u2713\u2327H 2nk(k log n + log(1/))/\"2\u25c6,\n\nwhere \u2327 is an upper bound on the computation time for a single evaluation of f(S).\n\nAs an example,\nlet us compare the worst-case complexity bound obtained for SGD (i.e.\nO(\u21e22kn(log n)/\"2)) with that of GREEDY for the in\ufb02uence maximization problem. Each single\nfunction evaluation for GREEDY amounts to computing the total in\ufb02uence of a set in a sample graph,\nwhich makes \u2327 = O(n) (here we assume our sample graphs satisfy |E| = O(|V |)). Also, a crude\nupper bound for the size of the gradient for each sample function is Hpn (see Appendix E.1). Hence,\nwe can deduce that SGD can have a factor k speedup w.r.t. to GREEDY.\n\n4 Applications\n\nWe will now show how to instantiate the stochastic submodular maximization framework using\nseveral prototypical discrete optimization problems.\nIn\ufb02uence maximization. As discussed in Section 2, the Independent Cascade [20] model de\ufb01nes\na distribution G over instances G \u21e0G that share a set V of nodes. The in\ufb02uence fG(S) of a set of\nnodes S in instance G is the fraction of nodes reachable from S using the edges E(G). The following\nLemma shows that the in\ufb02uence belongs to the class of WCF.\nLemma 4. The in\ufb02uence function fG(\u00b7) is a WCF. Moreover,\n\nFG(x) = ES[fG(S)] =\n\n1\n\n|V |Xv2V\nmin{1,Pu2Pv\n\n1\n\n|V |Xv2V\n\n\u00afFG(x) =\n\nxu},\nwhere Pv is the set of all nodes having a (directed) path to v.\n\n(1 Qu2Pv\n\n(1 xu))\n\n(5)\n\n(6)\n\n6\n\n\fWe return to the problem of maximizing fG(S) = EG\u21e0G[fG(S)] given a distribution over graphs G\nsharing nodes V . Since fG is a weighted sum of submodular functions, it is submodular. Moreover,\n\nF (x) = ES[fG(S)] = ES[EG[fG(S)]] = EG[ES[fG(S)]] = EG[FG(x)]\n\nLet U be the uniform distribution over vertices. Then,\n\n(1 Qu2Pv\n\n|V |Xv2V\n\n= EG\" 1\n|V |Pv2V (1 Qu2Pv\n\nF (x) = EG\uf8ff 1\n\nand the corresponding upper bound would be\n\n(1 xu))# .\n\n(1 xu)) = EG\uf8ff Ev\u21e0U [1 Qu2Pv\n\n(1 xu)],\n\n(7)\n\n(8)\n\n\u00afF (x) = EG\uf8ff Ev\u21e0U\u21e5min{1,Pu2Pv\n\nxu}\u21e4.\n\nThis formulation proves to be helpful in ef\ufb01cient calculation of subgradients, as one can obtain a\nrandom subgradient in linear time. For more details see Appendix E.1. We also provide a more\nef\ufb01cient, biased estimator of the expectation in the Appendix.\nFacility location. Let G = (X \u02d9[Y, E) be a complete weighted bipartite graph with parts X and Y\nand nonnegative weights wx,y. The weights can be considered as utilities or some similarity metric.\nWe select a subset S \u2713 X and each y 2 Y selects s 2 S with the highest weight ws,y. Our goal is to\nmaximize the average weight of these selected edges, i.e. to maximize\n\nf (S) =\n\n1\n\n|Y |Xy2Y\n\nws,y\n\nmax\ns2S\n\n(9)\n\ngiven some constraints on S. This problem is indeed the Facility Location problem, if one takes\nX to be the set of facilities and Y to be the set of customers and wx,y to be the utility of facility x\nfor customer y. Another interesting instance is the Exemplar-based Clustering problem, in which\nX = Y is a set of objects and wx,y is the similarity (or inverted distance) between objects x and y,\nand one tries to \ufb01nd a subset S of exemplars (i.e. centroids) for these objects.\nThe stochastic nature of this problem is revealed when one writes (9) as the expectation f (S) =\nEy\u21e0[fy(S)], where is the uniform distribution over Y and fy(S) := maxs2S ws,y. One can also\nconsider this more general case, where y\u2019s are drawn from an unknown distribution, and one tries to\nmaximize the aforementioned expectation.\nFirst, we claim that fy(\u00b7) for each y 2 Y is again a weighted coverage function. For simplicity, let\nX = {1, . . . , n} and set mi\nLemma 5. The utility function fy(\u00b7) is a WCF. Moreover,\ni=1(mi mi+1)(1 Qi\ni=1(mi mi+1) min{1,Pi\n\n.\n= wi,y, with m1 \u00b7\u00b7\u00b7 mn and mn+1\nFy(x) =Pn\n\u00afFy(x) =Pn\n\nWe remark that the gradient of both Fy and \u00afFy can be computed in linear time using a recursive\nprocedure. We refer to Appendix E.2 for more details.\n\nj=1(1 xj)),\n\nj=1 xj}.\n\n(10)\n(11)\n\n.\n= 0.\n\n5 Experimental Results\n\nWe demonstrate the practical utility of the proposed framework and compare it to standard baselines.\nWe compare the performance of the algorithms in terms of their wall-clock running time and the\nobtained utility. We consider the following problems:\n\u2022 In\ufb02uence Maximization for the Epinions network2. The network consists of 75 879 nodes and\n508 837 directed edges. We consider the subgraph induced by the top 10 000 nodes with the largest\nout-degree and use the independent cascade model [20]. The diffusion model is speci\ufb01ed by a\n\ufb01xed probability for each node to in\ufb02uence its neighbors in the underlying graph. We set this\nprobability p to be 0.02, and chose the number of seeds k = 50.\n2http://snap.stanford.edu/\n\n7\n\n\fFigure 1: In the case of Facility location for Blog selection as well as on in\ufb02uence maximization\non Epinions, the proposed approach reaches the same utility signi\ufb01cantly faster. On the exemplar-\nbased clustering of CIFAR, the proposed approach is outperformed by STOCHASTIC-GREEDY, but\nnevertheless reaches 98.4% of the GREEDY utility in a few seconds (after less than 1000 iterations).\nOn In\ufb02uence Maximization over partition matroids, the proposed approach signi\ufb01cantly outperforms\nGREEDY.\n\n\u2022 Facility Location for Blog Selection. We use the data set used in [14], consisting of 45 193\nblogs, and 16 551 cascades. The goal is to detect information cascades/stories spreading over the\nblogosphere. This dataset is heavy-tailed, hence a small random sample of the events has high\nvariance in terms of the cascade sizes. We set k = 100.\n\u2022 Exemplar-based Clustering on CIFAR-10. The data set contains 60 000 color images with\nresolution 32\u21e5 32. We use a single batch of 10 000 images and compare our algorithms to variants\nof GREEDY over the full data set. We use the Euclidean norm as the distance function and set\nk = 50. Further details about preprocessing of the data as well as formulation of the submodular\nfunction can be found in Appendix E.3.\n\nBaselines.\nIn the case of cardinality constraints, we compare our stochastic continuous optimization\napproach against the most ef\ufb01cient discrete approaches (LAZY-)GREEDY and (LAZY-)STOCHASTIC-\nGREEDY, which both provide optimal approximation guarantees. For STOCHASTIC-GREEDY,\nwe vary the parameter \" in order to explore the running time/utility tradeoff. We also report the\nperformance of randomly selected sets. For the two facility location problems, when applying the\ngreedy variants we can evaluate the exact objective (true expectation). In the In\ufb02uence Maximization\napplication, computing the exact expectation is intractable. Hence, we use an empirical average of s\nsamples (cascades) from the model. We note that the number of samples suggested by Proposition 3\nis overly conservative, and instead we make a practical choice of s = 103 samples.\n\n8\n\n0100200300Cost(seconds)2468101214UtilityT=100KT=8Ke=0.9e=0.5e=0.3e=0.1e=0.01BlogsSSM/AdaGradLAZY-GREEDYLAZY-STOCH-GREEDYRANDOM-SELECT0204060Cost(seconds)2.142.162.182.202.22Utilitye=0.9e=0.5e=0.3e=0.1e=0.01T=8KCIFAR-10SSM/AdaGradLAZY-GREEDYLAZY-STOCH-GREEDYRANDOM-SELECT050010001500Cost(seconds)14.014.515.015.516.016.5Utilitye=0.9e=0.5e=0.1e=0.01T=106EpinionsSSM/AdaGradLAZY-GREEDYLAZY-STOCH-GREEDYRANDOM-SELECT01020304050k(#ofseeds)0.140.150.160.17UtilityEpinions(PartitionMatroid)SSM/AdaGradLAZY-GREEDY\fResults. The results are summarized in Figure 1. On the blog selection and in\ufb02uence maximization\napplications, the proposed continuous optimization approach outperforms STOCHASTIC-GREEDY in\nterms of the running time/utility tradeoff. In particular, for blog selection we can compute a solution\nwith the same utility 26\u21e5 faster than STOCHASTIC-GREEDY with \" = 0.5. Similarly, for in\ufb02uence\nmaximization on Epinions we the solution 88\u21e5 faster than STOCHASTIC-GREEDY with \" = 0.1.\nOn the exemplar-based clustering application STOCHASTIC-GREEDY outperforms the proposed\napproach. We note that the proposed approach is still competitive as it recovers 98.4% of the value\nafter less than thousand iterations.\nWe also include an experiment on In\ufb02uence Maximization over partition matroids for the Epinions\nnetwork. In this case, GREEDY only provides a 1/2 approximation guarantee and STOCHASTIC-\nGREEDY does not apply. To create the partition, we \ufb01rst sorted all the vertices by their out-degree.\nUsing this order on the vertices, we divided the vertices into two partitions, one containing vertices\nwith even positions, other containing the rest. Figure 1 clearly demonstrates that the proposed\napproach outperforms GREEDY in terms of utility (as well as running time).\n\nAcknowledgments The research was partially supported by ERC StG 307036. We would like to\nthank Yaron Singer for helpful comments and suggestions.\n\nReferences\n[1] Alexander A Ageev and Maxim I Sviridenko. Pipage rounding: A new method of constructing\nalgorithms with proven performance guarantee. Journal of Combinatorial Optimization, 8(3):\n307\u2013328, 2004.\n\n[2] Francis Bach et al. Learning with submodular functions: A convex optimization perspective.\n\nFoundations and Trends R in Machine Learning, 6(2-3):145\u2013373, 2013.\n\n[3] Francis R. Bach. Convex analysis and optimization with submodular functions: a tutorial.\n\nCoRR, abs/1010.4207, 2010.\n\n[4] Ashwinkumar Badanidiyuru and Jan Vondr\u00e1k. Fast algorithms for maximizing submodular\nfunctions. In Proceedings of the Twenty-Fifth Annual ACM-SIAM Symposium on Discrete\nAlgorithms, pages 1497\u20131514. SIAM, 2014.\n\n[5] P. Brucker. An o(n) algorithm for quadratic knapsack problems. Operations Research Letters, 3\n\n(3):163\u2013166, 1984.\n\n[6] Gruia Calinescu, Chandra Chekuri, Martin P\u00e1l, and Jan Vondr\u00e1k. Maximizing a submodular set\nfunction subject to a matroid constraint. In International Conference on Integer Programming\nand Combinatorial Optimization, pages 182\u2013196. Springer, 2007.\n\n[7] Gruia Calinescu, Chandra Chekuri, Martin P\u00e1l, and Jan Vondr\u00e1k. Maximizing a monotone\nsubmodular function subject to a matroid constraint. SIAM Journal on Computing, 40(6):\n1740\u20131766, 2011.\n\n[8] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning\n\nand stochastic optimization. Journal of Machine Learning Research, 2011.\n\n[9] Alina Ene and Huy L. Nguyen. Constrained submodular maximization: Beyond 1/e. pages\n\n248\u2013257, 2016.\n\n[10] Uriel Feige. A threshold of ln n for approximating set cover. Journal of the ACM, 45(4):634 ?\n\n652, 1998.\n\n[11] Uriel Feige, Vahab S Mirrokni, and Jan Vondrak. Maximizing non-monotone submodular\n\nfunctions. SIAM Journal on Computing, 40(4):1133\u20131153, 2011.\n\n[12] Moran Feldman, Joseph Naor, and Roy Schwartz. A uni\ufb01ed continuous greedy algorithm for\nsubmodular maximization. In Foundations of Computer Science (FOCS), 2011 IEEE 52nd\nAnnual Symposium on, pages 570\u2013579. IEEE, 2011.\n\n9\n\n\f[13] Marshall L Fisher, George L Nemhauser, and Laurence A Wolsey. An analysis of approximations\nfor maximizing submodular set functions. In Polyhedral combinatorics, pages 73\u201387. Springer,\n1978.\n\n[14] Natalie Glance, Matthew Hurst, Kamal Nigam, Matthew Siegler, Robert Stockton, and Takashi\nTomokiyo. Deriving marketing intelligence from online discussion. In Proceedings of the\neleventh ACM SIGKDD international conference on Knowledge discovery in data mining, pages\n419\u2013428, 2005.\n\n[15] Ryan Gomez and Andreas Krause. Budgeted nonparametric learning from data streams. Pro-\n\nceedings of the 27th International Conference on Machine Learning, 2010.\n\n[16] Avinatan Hassidim and Yaron Singer.\n\nabs/1601.03095, 2016.\n\nSubmodular optimization under noise. CoRR,\n\n[17] Thibaut Horel and Yaron Singer. Maximizing approximately submodular functions. NIPS,\n\n2016.\n\n[18] Rishabh K Iyer and Jeff A Bilmes. Submodular optimization with submodular cover and\nsubmodular knapsack constraints. In Advances in Neural Information Processing Systems,\npages 2436\u20132444, 2013.\n\n[19] Rishabh K. Iyer and Jeff A. Bilmes. Polyhedral aspects of submodularity, convexity and\n\nconcavity. Arxiv, CoRR, abs/1506.07329, 2015.\n\n[20] David Kempe, Jon Kleinberg, and Eva. Tardos. Maximizing the spread of in\ufb02uence through\na social network. 9th ACM SIGKDD Conference on Knowledge Discovery and Data Mining,\npages 137\u2013146, 2003.\n\n[21] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ICLR, 2015.\n\n[22] Andreas Krause and Daniel Golovin. Submodular function maximization. Tractability: Practi-\n\ncal Approaches to Hard Problems, 3(19):8, 2012.\n\n[23] Andreas Krause and Carlos Guestrin. Near-optimal nonmyopic value of information in graphical\n\nmodels. In Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI), July 2005.\n\n[24] Andreas Krause and Carlos Guestrin. Near-optimal nonmyopic value of information in graphical\nmodels. In Proceedings of the Twenty-First Conference on Uncertainty in Arti\ufb01cial Intelligence,\npages 324\u2013331. AUAI Press, 2005.\n\n[25] Ariel Kulik, Hadas Shachnai, and Tami Tamir. Maximizing submodular set functions subject to\nmultiple linear constraints. In Proceedings of the twentieth Annual ACM-SIAM Symposium on\nDiscrete Algorithms, pages 545\u2013554. Society for Industrial and Applied Mathematics, 2009.\n\n[26] K. S. Sesh Kumar and Francis Bach. Active-set methods for submodular minimization problems.\n\nhal-01161759v3, 2016.\n\n[27] Hui Lin and Jeff Bilmes. A class of submodular functions for document summarization. In\nProceedings of the 49th Annual Meeting of the Association for Computational Linguistics:\nHuman Language Technologies-Volume 1, pages 510\u2013520. Association for Computational\nLinguistics, 2011.\n\n[28] L\u00e1szl\u00f3 Lov\u00e1sz. Submodular functions and convexity. In Mathematical Programming The State\n\nof the Art, pages 235\u2013257. Springer, 1983.\n\n[29] Baharan Mirzasoleiman, Ashwinkumar Badanidiyuru, Amin Karbasi, Jan Vondrak, and Andreas\nKrause. Lazier than lazy greedy. Association for the Advancement of Arti\ufb01cial Intelligence,\n2015.\n\n[30] George L. Nemhauser, Laurence A. Wolsey, and Marshall L. Fisher. An analysis of approxima-\ntions for maximizing submodular set functions - i. Mathematical Programming, 14(1):265\u2013294,\n1978.\n\n10\n\n\f[31] P. M. Pardalos and N. Kovoor. An algorithm for a singly constrained class of quadratic programs\n\nsubject to upper and lower bounds. Mathematical Programming, 46(1):321\u2013328, 1990.\n\n[32] Lior Seeman and Yaron Singer. Adaptive seeding in social networks. pages 459\u2013468, 2013.\n[33] Shai Shalev-Shwartz and Shai Ben-David. Understanding Machine Learning : From Theory to\n\nAlgorithms. Cambridge University Press, 2014.\n\n[34] Adish Singla, Sebastian Tschiatschek, and Andreas Krause. Noisy submodular maximization\nvia adaptive sampling with applications to crowdsourced image collection summarization. In\nProc. Conference on Arti\ufb01cial Intelligence (AAAI), February 2016.\n\n[35] Matthew Streeter and Daniel Golovin. An online algorithm for maximizing submodular\n\nfunctions. NIPS, 2008.\n\n[36] Jan Vondr\u00e1k. Submodularity in combinatorial optimization. Charles University, Prague, 2007.\n[37] Jan Vondr\u00e1k. Optimal approximation for the submodular welfare problem in the value oracle\nmodel. In Proceedings of the fortieth annual ACM symposium on Theory of computing, pages\n67\u201374. ACM, 2008.\n\n[38] Jan Vondr\u00e1k. Symmetry and approximability of submodular maximization problems. SIAM\n\nJournal on Computing, 42(1):265\u2013304, 2013.\n\n[39] Kai Wei, Rishabh Iyer, and Jeff Bilmes. Fast multi-stage submodular maximization.\n\nInternational Conference on Machine Learning (ICML), Beijing, China, 2014.\n\nIn\n\n11\n\n\f", "award": [], "sourceid": 3437, "authors": [{"given_name": "Mohammad", "family_name": "Karimi", "institution": "ETH Zurich"}, {"given_name": "Mario", "family_name": "Lucic", "institution": "Google Brain (Zurich)"}, {"given_name": "Hamed", "family_name": "Hassani", "institution": "UPenn"}, {"given_name": "Andreas", "family_name": "Krause", "institution": "ETHZ"}]}