{"title": "Submodularity Cuts and Applications", "book": "Advances in Neural Information Processing Systems", "page_first": 916, "page_last": 924, "abstract": "Several key problems in machine learning, such as feature selection and active learning, can be formulated as submodular set function maximization.  We present herein a novel algorithm for maximizing a submodular set function under a cardinality constraint --- the algorithm is based on a cutting-plane method and is implemented as an iterative small-scale binary-integer linear programming procedure. It is well known that this problem is NP-hard, and the approximation factor achieved by the greedy algorithm is the theoretical limit for polynomial time. As for (non-polynomial time) exact algorithms that perform reasonably in practice, there has been very little in the literature although the problem is quite important for many applications. Our algorithm is guaranteed to find the exact solution in finite iterations, and it converges fast in practice due to the efficiency of the cutting-plane mechanism. Moreover, we also provide a method that produces successively decreasing upper-bounds of the optimal solution, while our algorithm provides successively increasing lower-bounds.  Thus, the accuracy of the current solution can be estimated at any point, and the algorithm can be stopped early once a desired degree of tolerance is met.  We evaluate our algorithm on sensor placement and feature selection applications showing good performance.", "full_text": "Submodularity Cuts and Applications\n\nThe Inst. of Scienti\ufb01c and Industrial Res. (ISIR),\n\n\u2044\nYoshinobu Kawahara\n\nOsaka Univ., Japan\n\nKiyohito Nagano\n\nDept. of Math. and Comp. Sci.,\nTokyo Inst. of Technology, Japan\n\nkawahara@ar.sanken.osaka-u.ac.jp\n\nnagano@is.titech.ac.jp\n\nComp. Bio. Research Center,\n\nKoji Tsuda\n\nAIST, Japan\n\nJeff A. Bilmes\n\nDept. of Electrical Engineering,\n\nUniv. of Washington, USA\n\nkoji.tsuda@aist.go.jp\n\nbilmes@u.washington.edu\n\nAbstract\n\nSeveral key problems in machine learning, such as feature selection and active\nlearning, can be formulated as submodular set function maximization. We present\nherein a novel algorithm for maximizing a submodular set function under a car-\ndinality constraint \u2014 the algorithm is based on a cutting-plane method and is\nimplemented as an iterative small-scale binary-integer linear programming proce-\ndure. It is well known that this problem is NP-hard, and the approximation factor\nachieved by the greedy algorithm is the theoretical limit for polynomial time. As\nfor (non-polynomial time) exact algorithms that perform reasonably in practice,\nthere has been very little in the literature although the problem is quite impor-\ntant for many applications. Our algorithm is guaranteed to \ufb01nd the exact solution\n\ufb01nitely many iterations, and it converges fast in practice due to the ef\ufb01ciency of\nthe cutting-plane mechanism. Moreover, we also provide a method that produces\nsuccessively decreasing upper-bounds of the optimal solution, while our algorithm\nprovides successively increasing lower-bounds. Thus, the accuracy of the current\nsolution can be estimated at any point, and the algorithm can be stopped early\nonce a desired degree of tolerance is met. We evaluate our algorithm on sensor\nplacement and feature selection applications showing good performance.\n\n1 Introduction\n\nIn many fundamental problems in machine learning, such as feature selection and active learning,\nwe try to select a subset of a \ufb01nite set so that some utility of the subset is maximized. A number of\nsuch utility functions are known to be submodular, i.e., the set function f satis\ufb01es f(S) + f(T ) \u2265\nf(S \u2229 T ) + f(S \u222a T ) for all S, T \u2286 V , where V is a \ufb01nite set [2, 5]. This type of function can\nbe regarded as a discrete counterpart of convex functions, and includes entropy, symmetric mutual\ninformation, information gain, graph cut functions, and so on. In recent years, treating machine\nlearning problems as submodular set function maximization (usually under some constraint, such as\nlimited cardinality) has been addressed in the community [10, 13, 22].\n\ns.t. |S| \u2264 k,\n\nIn this paper, we address submodular function maximization under a cardinality constraint:\n\n(1)\nwhere V = {1, 2, . . . , n} and k is a positive integer with k \u2264 n. Note that this formulation is\nconsiderably general and covers a broad range of problems. The main dif\ufb01culty of this problem\ncomes from a potentially exponentially large number of locally optimal solutions. In the \ufb01eld of\n\nmax\nS(cid:181)V\n\nf(S)\n\n\u2217\n\nURL: http://www.ar.sanken.osaka-u.ac.jp/ kawahara/\n\n1\n\n\fcombinatorial optimization, it is well-known that submodular maximization is NP-hard and the\napproximation factor of (1 \u2212 1/e) (\u2248 0.63) achieved by the greedy algorithm [19] is the theoretical\nlimit of a polynomial-time algorithm for positive and nondecreasing submodular functions [3]. That\nis, in the worst case, any polynomial-time algorithm cannot give a solution whose function value is\nmore than (1 \u2212 1/e) times larger than the optimal value unless P=NP. In recent years, it has been\nreported that greedy-based algorithms work well in several machine-learning problems [10, 1, 13,\n22]. However, in some applications of machine learning, one seeks a solution closer to the optimum\nthan what is guaranteed by this bound. In feature selection or sensor placement, for example, one\nmay be willing to spend much more time in the selecting phase, since once selected, items are used\nmany times or for a long duration. Unfortunately, there has been very little in the literature on\n\ufb01nding exact but still practical solutions to submodular maximization [17, 14, 8]. To the best of our\nknowledge, the algorithm by Nemhauser and Wolsey [17] is the only way for exactly maximizing\na general form of nondecreasing submodular functions (other than naive brute force). However, as\nstated below, this approach is inef\ufb01cient even for moderate problem sizes.\n\nIn this paper, we present a novel algorithm for maximizing a submodular set function under a cardi-\nnality constraint based on a cutting-plane method, which is implemented as an iterative small-scale\nbinary-integer linear programming (BILP) procedure. To this end, we derive the submodularity cut,\na cutting plane that cuts off the feasible sets on which the objective function values are guaranteed\nto be not better than current best one, and this is based on the submodularity of a function and its\nLov\u00b4asz extension [15, 16]. This cut assures convergence to the optimum in \ufb01nite iterations and\nallows the searching for better subsets in an ef\ufb01cient manner so that the algorithm can be applied\nto suitably-sized problems. The existing algorithm [17] is infeasible for such problems since, as\noriginally presented, it has no criterion for improving the solution ef\ufb01ciently at each iteration (we\ncompare these algorithms empirically in Sect. 5.1). Moreover, we present a new way to evaluate an\nupper bound of the optimal value with the help of the idea of Nemhauser and Wolsey [17]. This\nenables us to judge the accuracy of the current best solution and to calculate an \u2020-optimal solution\nfor a predetermined \u2020 > 0 (cf. Sect. 4). In our algorithm, one needs to iteratively solve small-\nscale BILP (and mixed integer programming (MIP) for the upper-bound) problems, which are also\nNP-hard. However, due to their small size, these can be solved using ef\ufb01cient modern software\npackages such as CPLEX. Note that BILP is a special case of MIP and more ef\ufb01cient to solve in\ngeneral, and the presented algorithm can be applied to any submodular functions while the existing\none needs the nondecreasing property.1 We evaluate the proposed algorithm on the applications of\nsensor placement and feature selection in text classi\ufb01cation.\n\nThe remainder of the paper is organized as follows: In Sect. 2, we present submodularity cuts and\ngive a general description of the algorithm using this cutting plane. Then, we describe a speci\ufb01c\nprocedure for performing the submodularity cut algorithm in Sect. 3 and the way of updating an\nupper bound for calculating an \u2020-optimal solution in Sect. 4. And \ufb01nally, we give several empirical\nexamples in Sect. 5, and conclude the paper in Sect. 6.\n\n2 Submodularity Cuts and Cutting-Plane Algorithm\nWe start with a subset S0 \u2286 V of some ground set V with a reasonably good lower bound \u03b3 =\nf(S0) \u2264 max{f(S) : S \u2286 V }. Using this information, we cut off the feasible sets on which the\nobjective function values are guaranteed to be not better than f(S0). In this section, we address\na method for solving the submodular maximization problem (1) based on this idea along the line\nof cutting-plane methods, as described by Tuy [23] (see also [6, 7]) and often successfully used in\nalgorithms for solving mathematical programming problems [18, 11, 20].\n\n2.1 Lov\u00b4asz extension\n\nFor dealing with the submodular maximization problem (1) in a way analogous to the continuous\ncounterpart, i.e., convex maximization, we brie\ufb02y describe an useful extension to submodular func-\ntions, called the Lov\u00b4asz extension [15, 16]. The relationship between the discrete and the continuous,\ndescribed in this subsection, is summarized in Table 1.\n\n1A submodular function is called nondecreasing if f (A) \u2264 f (B) for (A \u2286 B). For example, an entropy\n\nfunction is nondecreasing but a cut function on nodes is not.\n\n2\n\n\fTable 1: Correspondence between continu-\nous and discrete.\n\n(discrete)\nf : 2V \u2192 R\n\nS \u2286 V\n\nf is submodular\n\n(continuous)\nEq. (2)=\u21d2 \u02c6f : Rn \u2192 R\nEq. (3)\u21d0\u21d2 I S \u2208 Rn\nThm. 1\u21d0\u21d2 \u02c6f is convex\n\nFigure 1: Illustration of cutting plane H. For H\n\u2044\nand c\n\n, see Section 3.2.\n\n\u2044\n\nGiven any real vector p \u2208 Rn, we denote the m distinct elements of p by \u02c6p1 > \u02c6p2 > \u00b7\u00b7\u00b7 > \u02c6pm.\nThen, the Lov\u00b4asz extension \u02c6f : Rn \u2192 R corresponding to a general set function f : 2V \u2192 R, which\nis not necessarily submodular, is de\ufb01ned as\n\nP\nk=1 (\u02c6pk \u2212 \u02c6pk+1)f(Uk) + \u02c6pmf(Um),\nm\u00a11\n\n\u02c6f(p) =\n\n(2)\nwhere Uk = {i \u2208 V : pi \u2265 \u02c6pk}. From the de\ufb01nition, \u02c6f is a piecewise linear (i.e., polyhedral) func-\ntion.2 In general, \u02c6f is not convex. However, the following relationship between the submodularity\nof f and the convexity of \u02c6f is given [15, 16]:\nTheorem 1 For a set function f : 2V \u2192 R and its Lov\u00b4asz extension \u02c6f : Rn \u2192 R, f is submodular\nif and only if \u02c6f is convex.\nNow, we de\ufb01ne I S \u2208 {0, 1}n as I S =\ni2S ei, where ei is the i-th unit vector. Obviously, there is\na one-to-one correspondence between I S and S. I S is called the characteristic vector of S.3 Then,\nthe Lov\u00b4asz extension \u02c6f is a natural extension of f in the sense that it satis\ufb01es the following [15, 16]:\n(3)\n\n(S \u2286 V ).\n\n\u02c6f(I S) = f(S)\n\nP\n\nIn what follows, we assume that f is submodular. Now we introduce a continuous relaxation of the\nproblem (1) using the Lov\u00b4asz extension \u02c6f. A polytope P \u2286 Rn is a bounded intersection of a \ufb01nite\nj x \u2264 bj, j = 1,\u00b7\u00b7\u00b7 , m}, where\nset of half-spaces \u2014 that is, P is of the form P = {x \u2208 Rn : A\n>\nAj is a real vector and bj is a real scalar. According to the correspondence between discrete and\ncontinuous functions described above, it is natural to replace the objective function f : 2V \u2192 R and\nthe feasible region {S \u2286 V : |S| \u2264 k} of the problem (1) by the Lov\u00b4asz extension \u02c6f : Rn \u2192 R and\nP\na polytope D0 \u2286 Rn de\ufb01ned by\ni=1xi \u2264 k},\n\nD0 = {x \u2208 Rn : 0 \u2264 xi \u2264 1 (i = 1,\u00b7\u00b7\u00b7 , n),\n\nn\n\nrespectively. The resulting problem is a convex maximization problem. For problem (1), we will use\nthe analogy with the way of solving the continuous problem: max{ \u02c6f(x) : x \u2208 D0}. The question\nis, can we solve it and how good is the solution?\n\n2.2 Submodularity cuts\n\nHere, we derive what we call the submodularity cut, a cutting plane that cuts off the feasible sets\nwith optimality guarantees using the submodularity of f, and with the help of the relationship be-\ntween submodularity and convexity described in Thm. 1. Note that the algorithm using this cutting\nplane, described later, converges to an optimal solution in a \ufb01nite number of iterations (cf. Thm. 5).\nThe presented technique is essentially a discrete analog of concavity cut techniques for continuous\nconcave minimization, which rests on the following property (see, e.g., [11]).\nTheorem 2 A convex function g : Rn \u2192 R attains its global maximum over a polytope P \u2282 Rn at\na vertex of P .\n\n2For a submodular function, the Lov\u00b4asz extension (2) is known to be equal to\n\n\u02c6f (p) = sup{pT x : x \u2208 B(f )} (p \u2208 Rn),\n\nf [15] and x(S) =Pi\u2208S xi.\nwhere B(f ) = {x \u2208 Rn : x(S) \u2264 f (S) (\u2200S \u2282 V ), x(V ) = f (V )} is the base polyhedron associated with\n3For example in case of |V | = 6, the characteristic vector of S = {1, 3, 4} becomes I S = (1, 0, 1, 1, 0, 0).\n\n3\n\nvHH*c*d1d2y2y1PH+H -\fFirst, we clarify the relation between discrete and continuous problems. Let P be a polytope with\nP \u2286 D0. Denote by S(P ) the subsets of V whose characteristic vectors are inside of P , i.e.,\nI S0 \u2208 P for any S\n0 \u2208 S(P ), and denote by V (P ) the set consisting of all vertices of P . Note\nthat any characteristic vector I S \u2208 P is a vertex of P . Also, there is a one-to-one correspondence\nbetween S(D0) and V (D0). Now clearly, we have\n\nmax{f(S\n\n0) : S\n\n0 \u2208 S(P )} \u2264 max{ \u02c6f(x) : x \u2208 P}.\n\n(4)\nIf we can \ufb01nd a subset \u00afP where the function value of \u02c6f is always smaller than the currently-known\nlargest value, any f( \u00afS) for \u00afS \u2208 S( \u00afP ) is also smaller than the value. Thus, the cutting plane for the\nproblem max{ \u02c6f(x) : x \u2208 D0} can be applied to our problem (1) through the relationship (4).\nTo derive the submodularity cut, we use the following de\ufb01nition:\nDe\ufb01nition 3 (\u03b3-extension) Let g : Rn \u2192 R be a convex function, x \u2208 Rn, \u03b3 be a real number\nsatisfying \u03b3 \u2265 g(x) and t > 0. Then, a point y \u2208 Rn de\ufb01ned by the following formula is called\n\u03b3-extension of x in direction d \u2208 Rn \\ {0} (with respect to g) where \u03b8 \u2208 R \u222a {\u221e}:\n\n\u03b8 = sup{t : g(x + td) \u2264 \u03b3}.\n\ny = x + \u03b8d with\n\n\u2044 \u2286 V satisfying f(S\n\n(5)\nWe may have \u03b8 = \u221e depending on g and d, but this is unproblematic in practice. The \u03b3-extension\nof x \u2208 Rn can be de\ufb01ned with respect to the Lov\u00b4asz extension because it is a convex function.\nThe submodular cut algorithm is an iterative procedure. At each iteration, the algorithm keeps a\npolytope P \u2286 D0, the current best function value \u03b3, and a set S\n\u2044) = \u03b3. We\nconstruct a submodular cut as follows. Let v \u2208 V (P ) be a vertex of P such that v = I S for some\nS \u2208 S(P ), and let K = K(v; d1, . . . , dn) be a convex polyhedral cone with vertex v generated by\nlinearly independent vectors d1, . . . , dn, i. e., K = {v + t1d1 + \u00b7\u00b7\u00b7 + tndn : tl \u2265 0}. For each\ni = 1,\u00b7\u00b7\u00b7 , n, let yl = v + \u03b8ldl be the \u03b3-extension of v in direction dl with respect to \u02c6f. We choose\nthe vectors d1, . . . , dn so that P \u2282 K and \u03b8l > 0 (cf. Sect. 3.1). These directions are not necessarily\nchosen tightly on P (in fact, the directions described in Sect. 3.1 enclose P but also a set larger).\nSince the vectors dl are linearly independent, there exists a unique hyperplane H = H(y1,\u00b7\u00b7\u00b7 , yn)\nthat contains yl (l = 1,\u00b7\u00b7\u00b7 , n), which we call a submodular cut. It is de\ufb01ned by (cf. Fig. 1)\nH = {x : eT Y\nwhere e = (1,\u00b7\u00b7\u00b7 , 1)T \u2208 Rn and Y = ((y1\ntwo halfspaces H\u00a1 = {x : eT Y\nObviously the point v is in the halfspace H\u00a1, and moreover, we have:\nLemma 4 Let P \u2286 D0 be a polytope, \u03b3 be the current best function value, v be a vertex of P such\nthat v = I S for some S \u2208 S(P ) and H\u00a1 be the halfspace determined by the cutting plane, i.e.,\nH\u00a1 = {x : eT Y\n\u2212 v)) and y1, . . . , yn are the\n\u03b3-extensions of v in linearly independent directions d1, . . . , dn. Then, it holds that\n\n\u00a11v}.\n(6)\n\u2212 v)). The hyperplane H generates\n\u00a11x \u2265 1 + eT Y v}.\n\n\u00a11x \u2264 1 + eT Y v} and H+ = {x : eT Y\n\n\u00a11x \u2264 1 + eT Y v}, where Y = ((y1\n\n\u00a11x = 1 + eT Y\n\u2212 v),\u00b7\u00b7\u00b7 , (yn\n\n0) \u2264 \u03b3\n\nf(S\n\nfor all S\n\n\u2212 v),\u00b7\u00b7\u00b7 , (yn\n0 \u2208 S(P \u2229 H\u00a1).\n\nProof Since P \u2282 K = K(I S; d1,\u00b7\u00b7\u00b7 , dn), it follows that P \u2229 H\u00a1 is contained in the simplex\nR = [I S, y1,\u00b7\u00b7\u00b7 , yn]. Since the Lov\u00b4asz extension \u02c6f is convex and the maximum of a convex\nfunction over a compact convex set is attained at a vertex of the convex set (Thm. 2), the maximum\nof \u02c6f over R is attained at a vertex of R. Therefore, we have\n\nmax{ \u02c6f(x) : x \u2208 P \u2229 H\u00a1} \u2264 max{f(x) : x \u2208 R} = max{ \u02c6f(v); \u02c6f(y1),\u00b7\u00b7\u00b7 , \u02c6f(yn)} \u2264 \u03b3.\n\n0 \u2208 S(P \u2229 H\u00a1)} \u2264 max{ \u02c6f(x) : x \u2208 P \u2229 H\u00a1} \u2264 \u03b3.\n\nFrom Eq. (4), max{f(S\nThe above lemma shows that we can cut off the feasible subsets S(P \u2229 H\u00a1) from S(P ) without\nloss of any feasible set whose objective function value is better than \u03b3. If S(P ) = S(P \u2229 H\u00a1), then\n\u03b3 = max{f(S) : |S| \u2264 k} is achieved. A speci\ufb01c way to check whether S(P ) = S(P \u2229 H\u00a1) will\nbe given in Sect. 3.2. As v \u2208 S(P \u2229 H\u00a1) and v /\u2208 S(P \u2229 H+), we have\n(7)\nThe submodular cut algorithm updates P \u2190 P \u2229 H+ until the global optimality of \u03b3 is guaranteed.\nThe general description is shown in Alg. 1 (also see Fig. 2). Furthermore, the \ufb01niteness of the\nalgorithm is assured by the following theorem.\n\n|S(P )| > |S(P \u2229 H+)|.\n\n0) : S\n\n4\n\n\fFigure 2: Outline of the\nsubmodularity cuts algo-\nrithm.\n\n\u2044 = S0.\n\nAlgorithm 1 General description of the submodularity cuts algorithm.\n1. Compute a subset S0 s.t. |S0| \u2264 k, and set a lower bound \u03b30 = f(S0).\n2. Set P0 \u2190 D0, stop \u2190 f alse, i \u2190 1 and S\n3. while stop=false do\n4.\n5.\n6.\n7.\n8.\n9.\n10.\n11. end while\n\nConstruct with respect to Si\u00a11, Pi\u00a11 and \u03b3i\u00a11 a submodularity cut H i.\nif S(Pi\u00a11) = S(Pi\u00a11 \u2229 H i\u00a1) then\nstop \u2190 true (S\nelse\n\u2044\nUpdate \u03b3i (using Si and other available information) and set S\n+ and i \u2190 i + 1.\nCompute Si \u2208 S(Pi), and set Pi \u2190 Pi\u00a11 \u2229 H i\nend if\n\nis an optimal solution and \u03b3i\u00a11 the optimal value).\n\n\u2044\n\ns.t. f(S\n\n\u2044) = \u03b3i.\n\nTheorem 5 Alg. 1 gives an optimal solution to the problem (1) in a \ufb01nite number of iterations.\nProof In the beginning, |S(D0)| is \ufb01nite. In view of (7), each iteration decreases |S(P )| by at least\n1. So, the number of iterations is \ufb01nite.\n\n3\n\nImplementation\n\nIn this section, we describe a speci\ufb01c way to perform Alg. 1 using a binary-integer linear program-\nming (BILP) solver. The pseudo-code of the resulting algorithm is shown in Alg. 2.\n\n3.1 Construction of submodularity cuts\nGiven a vertex of a polytope P \u2286 D0, which is of the form I S, we describe how to compute linearly\nindependent directions d1,\u00b7\u00b7\u00b7 , dn for the construction of the submodularity cut at each iteration of\nthe algorithm (Line 4 in Alg. 1). Note that the way described here is just one option and any other\nchoice satisfying P \u2282 K can be substituted.\nIf |S| < k, then directions d1, . . . , dn can be chosen as \u2212el (l \u2208 S) and el (l \u2208 V \\ S). Now we\nfocus on the case where |S| = k. De\ufb01ne a neighbor S(i,j) of S as\nThat is, the neighbor S(i,j) is given by replacing one of the elements of S with that of V \\ S. Note\n\u2212 I S = ej \u2212 ei for any neighbor S(i,j) of S. Let S(i\u2044,j\u2044) be a neighbor that maximizes\nthat I S(i,j)\nf(S(i,j)) among all neighbors of S. Since a subset S of size k has k \u00d7 (n \u2212 k) neighbors S(i,j)\n(i \u2208 S, j \u2208 V \\ S), this computation is O(nk). Suppose that S = {i1, . . . , ik} with i1 = i\n\u2044\nand V \\ S = {jk+1, . . . , jn} with jn = j\n. If f(S(i\u2044,j\u2044)) > \u03b3, we update \u03b3 \u2190 f(S(i\u2044,j\u2044)) and\n(\n\u2044 \u2190 S(i\u2044,j\u2044). Thus, in either case it holds that \u03b3 \u2265 f(S(i\u2044,j\u2044)). As an example of the set of\ndirections {d1, . . . , dn}, we choose\n\nS(i,j) := (S \\ {i}) \u222a {j} (i \u2208 S, j \u2208 V \\ S).\n\nS\n\n\u2044\n\ndl =\n\n(8)\n\nej\u2044 \u2212 eil\n\u2212 ej\u2044\nejl\n\u2212ej\u2044\n\nif l \u2208 {1, . . . , k}\nif l \u2208 {k + 1, . . . , n \u2212 1}\nif l = n.\n\nIt is easy to see that d1, . . . , dn are linearly independent. Moreover, we obtain the following lemma:\n\nK(I S; d1, . . . , dn) = {I S + t1d1 + \u00b7\u00b7\u00b7 + tndn : tl \u2265 0}\n\nLemma 6 For the directions d1, . . . , dn de\ufb01ned in (8), a cone\ncontains the polytope D0 = {x \u2208 Rn : 0 \u2264 xl \u2264 1 (l = 1,\u00b7\u00b7\u00b7 , n),\nThe proof of this lemma is included in the supplementary material (Sect. A). The \u03b3-extensions, i.e.,\n\u03b8\u2019s, in these directions can be obtained in closed forms. The details of this are also included in the\nsupplementary material (Sect. A).\n\nP\nl=1xl \u2264 k}.\n\nn\n\n5\n\nS0    S(P0)Sopt    S(Popt)H 0S1    S(P1)(cid:29)(cid:68)(cid:80)(cid:79)(cid:85)(cid:74)(cid:79)(cid:86)(cid:80)(cid:86)(cid:84)(cid:31)(cid:29)(cid:69)(cid:74)(cid:84)(cid:68)(cid:83)(cid:70)(cid:85)(cid:70)(cid:31)(cid:45)(cid:80)(cid:87)(cid:66)(cid:84)(cid:91)(cid:1)(cid:70)(cid:89)(cid:85)(cid:70)(cid:79)(cid:84)(cid:74)(cid:80)(cid:79)(cid:49)(cid:83)(cid:80)(cid:81)(cid:15)(cid:1)(cid:24)......H1P1=P0 \u2229 H 0Sopt-1    S(Popt-1)Hopt-1(cid:1134)(cid:1134)(cid:1134)(cid:1134)(cid:12)Popt=Popt-1 \u2229 H opt-1(cid:12)\fAlgorithm 2 Pseudo-code of the submodularity cuts algorithm using BILP.\n1. Compute a subset S0 s.t. |S0| \u2264 k, and set a lower bound \u03b30 = f(S0).\n2. Set P0 \u2190 D0, stop \u2190 f alse, i \u2190 1 and S\n3. while stop=false do\n4.\n5.\n\nConstruct with respect to Si\u00a11, Pi\u00a11 and \u03b3i\u00a11 a submodularity cut H.\nSolve the BILP problem (9) with respect to Aj and bj (j = 1,\u00b7\u00b7\u00b7 , nk), and let the optimal\n\u2044\nsolution and value Si and c\n\u00a11vi\u00a11 then\nif c\n\u2044\n\n, respectively.\n\n\u2044 = S0.\n\nis an optimal solution and \u03b3i\u00a11 the optimal value).\n\n\u2044 \u2264 1 + eT Y\nstop \u2190 true (S\n\nelse\nUpdate \u03b3i (using Si and other available information) and set S\nSet Pi \u2190 Pi\u00a11 \u2229 H+ and i \u2190 i + 1.\nend if\n\n\u2044\n\ns.t. f(S\n\n\u2044) = \u03b3i.\n\n6.\n7.\n8.\n9.\n10.\n11.\n12. end while\n\nthe next starting subset Si (respectively, in Lines 5 and 9 in Alg. 1). Let eP \u2286 Rn be the minimum\n\n3.2 Stopping criterion and next starting point\nNext, we address the checking of optimality, i.e., whether S(P ) = S(P \u2229 H\u00a1), and also \ufb01nding\npolytope containing S(P ). Geometrically, checking S(P ) = S(P\u2229H\u00a1) can be done by considering\nis given by translating\na parallel hyperplane H\nH towards v, then S(P ) = S(P \u2229 H\u00a1). Numerically, such a translation corresponds to linear\nprogramming. Using Eq. (6), we obtain:\n\u2044\nProposition 7 Let c\n\nof H which is tangent to eP . If H = H\n\nor H\n\n\u2044\n\n\u2044\n\n\u2044\n\nbe the optimal value of the binary integer program\n\u00a11x : Ajx \u2265 bj, j = 1,\u00b7\u00b7\u00b7 , mk}.\n\nmax\n\n(9)\n\nThen S(P ) \u2282 H\u00a1 if c\nNote that, if c\nwhich can be used as a starting subset of the next iteration (see Fig. 1).\n\n\u00a11v, then the optimal solution x\n\n> 1+ eT Y\n\n\u2044\n\n\u2044\n\nof Eq. (9) yields a subset of S(P \\ H\u00a1)\n\nx2f0,1gn\n\u2044 \u2264 1 + eT Y\n\n{eT Y\n\u00a11v.\n\n4 Upper bound and \u2020-optimal solution\n\nmax \u03b7\n\ns.t.\n\nAlthough our algorithm can \ufb01nd an exact solution in a \ufb01nite number of iterations, the computational\ncost could be expensive for a high-dimensional case. Therefore, we present here an iterative update\nof an upper bound of the current solution, and thus a way to allow us to obtain an \u2020-optimal solution.\nTo this end, we combine the idea of the algorithm by Nemhauser and Wolsey [17] with our cutting\nplane algorithm. Note that this hybrid approach is effective only when f is nondecreasing.\nIf the submodular function f : 2V \u2192 R is nondecreasing, the submodular maximization problem\n(1) can be reformulated [17] as\n\nP\nj2V nS\u03c1j(S)yj (S \u2286 V ),\n\nP\n\u03b7 \u2264 f(S) +\nj2V yj = k, yj \u2208 {0, 1} (j \u2208 V )\n\n(10)\nwhere \u03c1j(S) := f(S \u222a {j}) \u2212 f(S). This formulation is a MIP with regard to one continuous and\nn binary variables, and has approximately 2n constraints. The \ufb01rst type of constraint corresponds\nto all feasible subsets S, and the number of inequalities is as large as 2n. This approach is therefore\ninfeasible for certain problem sizes. Nemhauser and Wolsey [17] address this problem by adding the\nconstraints one by on and calculating a reduced MIP problem iteratively. In the worse case, however,\nthe number of iterations becomes equal to the case of when all constraints are added. The solution\nof a maximization problem with a subset of constraints is larger than the one with all constraints, so\nthe good news is that this solution is guaranteed to improve (by monotonically decreasing down to\nthe true solution) at each iteration. In our algorithm, by contrast, the best current solution increases\nmonotonically to the true solution. Therefore, by adding the constraint corresponding to Si at each\niteration of our algorithm and solving the reduced MIP above, we can evaluate an upper bound of\nthe current solution. Thus, we can assure the optimality of a current solution, or obtain a desired\n\u2020-optimal solution using both the lower and upper bound.\n\n6\n\n\fFigure 3: Averaged computational time (log-scale) for com-\nputing exact and \u2020-optimal solutions by the submodularity cut\nalgorithm and existing algorithm by Nemhauser and Wolsey.\n\nFigure 4: An example of compu-\ntational time (log-scale) versus the\ncalculated upper and lower bounds.\n\n5 Experimental Evaluation\n\nWe \ufb01rst empirically compare the proposed algorithm with the existing algorithm by Nemhauser and\nWolsey [17] in Sect. 5.1, and then apply the algorithm to the real-world applications of sensor place-\nment, and feature selection in text classi\ufb01cation (Sect. 5.2 and 5.3, respectively). In the experiments,\nwe used the solution by a greedy algorithm as initial subset S0. The experiments below were run\non a 2.5GHz 64-bit workstation using Matlab and a Parallel CPLEX ver. 11.2 (8 threads) through a\nmex function. If \u03b8 = \u221e in Eq. (5), we set \u03b8 = \u03b81, where \u03b81 is large (i.e. \u03b81 = 106).\n\n5.1 Arti\ufb01cial example\n\nm\n\nf(S) =\n\nP\n\ni=1 maxj2S cij,\n\nHere, we evaluate empirically and illustrate the submodularity cut algorithm (Alg. 2) with respect\nto (1) computational time for exact solutions compared with the existing algorithm and (2) how\nfast the algorithm can sandwich the true solution between the upper and lower bounds, using arti-\n\ufb01cial datasets. The considered problem here is the K-location problem [17], i.e., the submodular\nmaximization problem (1) with respect to the nondecreasing submodular function:\nwhere C = cij is an m\u00d7 n nonnegative matrix and V = {1,\u00b7\u00b7\u00b7 , n}. We generated several matrices\nC of different size n (we \ufb01xed m = n+1), and solved the above problem with respect to k = 5, 8 for\nexact and \u2020 optimal solutions, using the two algorithms. The graphs in Fig. 3 show the computational\ntime (log-scale) for several n and k = 5, 8, where the results were averaged over randomly generated\n3 matrices C. Note that, for example, the number of combination becomes more than two hundred\nmillions for n = 45 and k = 8. As the \ufb01gure shows, the required costs for Alg. 2 were less than the\nexisting algorithm, especially in the case of high search spaces. This could be because the cutting-\nplane algorithm searches feasible subsets in an ef\ufb01cient manner by eliminating worse ones with the\nsubmodularity cuts. And Fig. 4 shows an example of the calculated upper and lower bounds vs.\ntime (k = 5 and n = 45). The lower bound is updated rarely and converges to the optimal solution\nquickly while the upper bound decreases gradually.\n\n5.2 Sensor placements\n\n\u2212 \u03c32\n\nOur \ufb01rst example with real data is the sensor placements problem, where we try to select sensor\nlocations to minimize the variance of observations. The dataset we used here is temperature mea-\nsurements at discretized \ufb01nite locations V obtained using the NIMS sensor node deployed at a lake\nnear the University of California, Merced [9, 12] (|V | = 86).4 As in [12], we evaluated the set of\nlocations S \u2286 V using the averaged variance reduction f(S) = V ar(\u2205) \u2212 V ar(S) = 1\nsFs(S),\nwhere Fs(S) = \u03c32\nsjS denote the predictive variance at lo-\ncation s \u2208 V after observing locations S \u2286 V . This function is monotone and submodular. The\ns\ngraphs in Fig. 5 show the computation time of our algorithm, and the accuracy improvement of our\ncalculated solution over that of the greedy algorithm (%), respectively, for \u2020 = 0.05, 0.1, 0.2. Both\nthe computation time and improvement are large at around k = 5 compared with other choices of k.\nThis is because the greedy solutions are good when k is either very small or large.\nbox for Submodular Function Optimization (http://www.cs.caltech.edu/\u223ckrausea/sfo/).\n\n4The covariance matrix of the Gaussian process that models the measurements is available in Matlab Tool-\n\nsjS is the variance reduction and \u03c32\n\nP\n\nn\n\n7\n\n(cid:21)(cid:24)(cid:22)(cid:19)(cid:22)(cid:24)(cid:23)(cid:19)(cid:23)(cid:24)(cid:20)(cid:19)(cid:19)(cid:20)(cid:19)(cid:20)(cid:20)(cid:19)(cid:21)(cid:20)(cid:19)(cid:22)(cid:98)(cid:98)(cid:54)(cid:88)(cid:69)(cid:80)(cid:82)(cid:71)(cid:88)(cid:79)(cid:68)(cid:85)(cid:76)(cid:87)(cid:92)(cid:98)(cid:11)(cid:82)(cid:83)(cid:87)(cid:17)(cid:12)(cid:54)(cid:88)(cid:69)(cid:80)(cid:82)(cid:71)(cid:88)(cid:79)(cid:68)(cid:85)(cid:76)(cid:87)(cid:92)(cid:98)(cid:11)(cid:19)(cid:17)(cid:19)(cid:19)(cid:20)(cid:12)(cid:54)(cid:88)(cid:69)(cid:80)(cid:82)(cid:71)(cid:88)(cid:79)(cid:68)(cid:85)(cid:76)(cid:87)(cid:92)(cid:98)(cid:11)(cid:19)(cid:17)(cid:19)(cid:24)(cid:12)(cid:49)(cid:72)(cid:80)(cid:75)(cid:68)(cid:88)(cid:86)(cid:72)(cid:85)(cid:98)(cid:9)(cid:98)(cid:58)(cid:82)(cid:79)(cid:86)(cid:72)(cid:92)Dimensionality (n)Time (log-scale) [s]k = 5(cid:21)(cid:24)(cid:22)(cid:19)(cid:22)(cid:24)(cid:23)(cid:19)(cid:23)(cid:24)(cid:20)(cid:19)(cid:19)(cid:20)(cid:19)(cid:20)(cid:20)(cid:19)(cid:21)(cid:20)(cid:19)(cid:22)(cid:98)(cid:98)(cid:54)(cid:88)(cid:69)(cid:80)(cid:82)(cid:71)(cid:88)(cid:79)(cid:68)(cid:85)(cid:76)(cid:87)(cid:92)(cid:98)(cid:11)(cid:82)(cid:83)(cid:87)(cid:17)(cid:12)(cid:54)(cid:88)(cid:69)(cid:80)(cid:82)(cid:71)(cid:88)(cid:79)(cid:68)(cid:85)(cid:76)(cid:87)(cid:92)(cid:98)(cid:11)(cid:19)(cid:17)(cid:19)(cid:19)(cid:20)(cid:12)(cid:54)(cid:88)(cid:69)(cid:80)(cid:82)(cid:71)(cid:88)(cid:79)(cid:68)(cid:85)(cid:76)(cid:87)(cid:92)(cid:98)(cid:11)(cid:19)(cid:17)(cid:19)(cid:24)(cid:12)(cid:49)(cid:72)(cid:80)(cid:75)(cid:68)(cid:88)(cid:86)(cid:72)(cid:85)(cid:98)(cid:9)(cid:98)(cid:58)(cid:82)(cid:79)(cid:86)(cid:72)(cid:92)Dimensionality (n)Time (log-scale) [s]k = 8(cid:20)(cid:19)(cid:19)(cid:20)(cid:19)(cid:20)(cid:20)(cid:19)(cid:21)(cid:20)(cid:19)(cid:22)(cid:23)(cid:22)(cid:21)(cid:23)(cid:22)(cid:23)(cid:23)(cid:22)(cid:25)(cid:23)(cid:22)(cid:27)(cid:23)(cid:23)(cid:19)(cid:23)(cid:23)(cid:21)(cid:23)(cid:23)(cid:23)(cid:98)(cid:98)(cid:54)(cid:81)(cid:81)(cid:70)(cid:83)(cid:1)(cid:67)(cid:80)(cid:86)(cid:79)(cid:69)(cid:45)(cid:80)(cid:88)(cid:70)(cid:83)(cid:1)(cid:67)(cid:80)(cid:86)(cid:79)(cid:69)(cid:39)(cid:86)(cid:79)(cid:68)(cid:85)(cid:74)(cid:80)(cid:79)(cid:1)(cid:87)(cid:66)(cid:77)(cid:86)(cid:70)(cid:1)(cid:80)(cid:71)(cid:1)(cid:85)(cid:73)(cid:70)(cid:1)(cid:84)(cid:80)(cid:77)(cid:86)(cid:85)(cid:74)(cid:80)(cid:79)(cid:1)(cid:67)(cid:90)(cid:1)(cid:85)(cid:73)(cid:70)(cid:1)(cid:72)(cid:83)(cid:70)(cid:70)(cid:69)(cid:90)(cid:1)(cid:66)(cid:77)(cid:72)(cid:80)(cid:83)(cid:74)(cid:85)(cid:73)(cid:78)Time (log-scale) [s]Function value\fFigure 5: Computational time (left) and accuracy improvement over the greedy algorithm (right).\n\nTable 1: Selected words with [the values of information gain, classi\ufb01cation precision].\nk\n5 (tonn,\u2018agricultur\u2019,trade,pct,\u2018market\u2019)[2.59,0.53]\u2192 (\u2018week\u2019,tonn,trade,pct,\u2018washington\u2019)[2.66,0.58]\n10 ( . . .,week,oil,price,\u2018dollar\u2019,\u2018of\ufb01ci\u2019)[3.55,0.57]\u2192 ( . . .,price,oil,\u2018bank\u2019,\u2018produc\u2019,\u2018blah\u2019)[3.88,0.62]\n\nsubmodularity cuts\n\ngreedy\n\n5.3 Feature selection in text classi\ufb01cation\n\nOur second real test case is feature selection in document classi\ufb01cation using the Reuters-21578\ndataset. We applied the greedy and submodularity cuts algorithms to the training set that includes\n7,770 documents with 5,180 words (features) and 90 categories, where we used the information\ngain as a criterion [4]. Table 1 shows the selected words by the algorithms in the cases of k =\n5, 10 (for the proposed algorithm \u2020 = 0.003 in both cases) with the values of information gain and\nclassi\ufb01cation precision (tp/(tp + f p), tp; true positive, f p; false positive). For classi\ufb01cation on the\ntest set (3,019 documents with 5,180 words and 90 categories), we applied a Naive Bayes classi\ufb01er\nwith the selected features. The submodularity cuts algorithm selected several different words from\nthat of the greedy algorithm. We can see that the words selected by our algorithm would have high\npredictive power even though the number of the chosen words is very small.\n\n6 Conclusions\n\nIn this paper, we presented a cutting-plane algorithm for submodular maximization problems, which\ncan be implemented as an iterative binary-integer linear programming procedure. We derived a cut-\nting plane procedure, called the submodularity cut, based on the submodularity of a set function\nthrough the Lov\u00b4asz extension, and showed this cut assures that the algorithm converges to the opti-\nmum in \ufb01nite iterations. Moreover, we presented a way to evaluate an upper bound of the optimal\nvalue with the help of Nemhauser and Wolsey [17], which enables us to ensure the accuracy of the\ncurrent best solution and to calculate an intended \u2020-optimal solution for a predetermined \u2020 > 0.\nOur new algorithm computationally compared favorably against the existing algorithm on arti\ufb01cial\ndatasets, and also showed improved performance on the real-world applications of sensor place-\nments and feature selection in text classi\ufb01cation.\n\nThe submodular maximization problem treated in this paper covers broad range of applications in\nmachine learning. In future works, we will develop frameworks with \u2020-optimality guarantees for\nmore general problem settings such as knapsack constraints [21] and not nondecreasing submodular\nfunctions. This will be make the submodularity cuts framework applicable to a still wider variety of\nmachine learning problems.\n\nAcknowledgments\n\nThis research was supported in part by JSPS Global COE program \u201cComputationism as a Foundation\nfor the Sciences\u201d, KAKENHI (20800019 and 21680025), the JFE 21st Century Foundation, and\nthe Functional RNA Project of New Energy and Industrial Technology Development Organization\n(NEDO). Further support was received from a PASCAL2 grant, and by NSF grant IIS-0535100.\nAlso, we are very grateful to the reviewers for helpful comments.\n\n8\n\nCardinality ( k )Time (log-scale) [s](cid:24)(cid:20)(cid:19)(cid:20)(cid:24)(cid:21)(cid:19)(cid:20)(cid:19)(cid:19)(cid:20)(cid:19)(cid:20)(cid:20)(cid:19)(cid:21)(cid:20)(cid:19)(cid:22)(cid:20)(cid:19)(cid:23)(cid:98)(cid:98)(cid:19)(cid:17)(cid:21)(cid:19)(cid:17)(cid:20)(cid:24)(cid:19)(cid:17)(cid:20)Cardinality ( k )Improvement [%](cid:21)(cid:22)(cid:24)(cid:20)(cid:19)(cid:20)(cid:24)(cid:21)(cid:19)(cid:19)(cid:20)(cid:21)(cid:22)(cid:23)(cid:98)(cid:98)(cid:19)(cid:17)(cid:21)(cid:19)(cid:17)(cid:20)(cid:24)(cid:19)(cid:17)(cid:20)\fReferences\n\n[1] A. Das and D. Kempe. Algorithms for subset selection in linear regression. In R. E. Ladner and C. Dwork,\neditors, Proc. of the 40th Annual ACM Symp. on Theory of Computing (STOC 2008), pages 45\u201354, 2008.\n[2] J. Edmonds. Submodular functions, matroids, and certain polyhedra. In R. Guy, H. Hanani, N. Sauer,\nand J. Sh\u00a8onheim, editors, Combinatorial Structures and Their Applications, pages 69\u201387. Gordon and\nBreach, 1970.\n\n[3] U. Feige. A threshold of ln n for approximating set cover. Journal of the ACM, 45:634\u2013652, 1998.\n[4] G. Forman. An extensive empirical study of feature selection metrics for text classi\ufb01cation. Journal of\n\nMachine Learning Research, 3:1289\u20131305, 2003.\n\n[5] S. Fujishige. Submodular Functions and Optimization. Elsevier, second edition, 2005.\n[6] F. Glover. Convexity cuts and cut search. Operations Research, 21:123\u2013134, 1973.\n[7] F. Glover. Polyhedral convexity cuts and negative edge extension. Zeitschrift f\u00a8ur Operations Research,\n\n18:181\u2013186, 1974.\n\n[8] B. Goldengorin. Maximization of submodular functions: Theory and enumeration algorithms. European\n\nJournal of Operational Research, 198(1):102\u2013112, 2009.\n\n[9] T. C. Harmon, R. F. Ambrose, R. M. Gilbert, J. C. Fisher, M. Stealey, and W. J. Kaiser. High resolution\nriver hydraulic and water quality characterization using rapidly deployable. Technical report, CENS,,\n2006.\n\n[10] S. C. H. Hoi, R. Jin, J. Zhu, and M. R. Lyu. Batch mode active learning and its application to medical\nimage classi\ufb01cation. In Proc. of the 23rd int\u2019l conf. on Machine learning (ICML 2006), pages 417\u2013424,\n2006.\n\n[11] R. Horst and H. Tuy. Global Optimization (Deterministic Approaches). Springer, 3 edition, 1996.\n[12] A. Krause, H. B. McMahan, C. Guestrin, and A. Gupta. Robust submodular observation selection. Journal\n\nof Machine Learning Research, 9:2761\u20132801, 2008.\n\n[13] A. Krause, A. Singh, and C. Guestrin. Near-optimal sensor placements in Gaussian processes: Theory,\n\nef\ufb01cient algorithms and empirical studies. Journal of Machine Learning Research, 9:235\u2013284, 2009.\n\n[14] H. Lee, G. L. Nemhauser, and Y. Wang. Maximizing a submodular function by integer programming:\nPolyhedral results for the quadratic case. European Journal of Operational Research, 94:154\u2013166, 1996.\n[15] L. Lov\u00b4asz. Submodular functions and convexity. In A. Bachem, M. Gr\u00a8otschel, and B. Korte, editors,\n\nMathematical Programming \u2013 The State of the Art, pages 235\u2013257. 1983.\n\n[16] K. Murota. Discrete Convex Analysis, volume 10 of Monographs on Discrete Math and Applications.\n\nSociety for Industrial & Applied, 2000.\n\n[17] G. L. Nemhauser and L. A. Wolsey. Maximizing submodular set functions: formulations and analysis of\nalgorithms. In P. Hansen, editor, Studies on Graphs and Discrete Programming, volume 11 of Annals of\nDiscrete Mathematics. 1981.\n\n[18] G. L. Nemhauser and L. A. Wolsey. Integer and Combinatorial Optimization. Wiley-Interscience, 1988.\n[19] G. L. Nemhauser, L. A. Wolsey, and M. L. Fisher. An analysis of approximations for maximizing for\n\nsubmodular set functions \u2013 I. Mathematical Programming, 14:265\u2013294, 1978.\n\n[20] M. Porembski. Finitely convergent cutting planes for concave minimization. Journal of Global Optimiza-\n\ntion, 20(2):109\u2013132, 2001.\n\n[21] M. Sviridenko. A note on maximizing a submodular set function subject to a knapsack constraint. Oper-\n\nations Research Letters, 32(1):41\u201343, 2004.\n\n[22] M. Thoma, H. Cheng, A. Gretton, J. Han, H. P. Kriegel, A. J. Smola, L. Song, P. S. Yu, X. Yan, and K. M.\nBorgwardt. Near-optimal supervised feature selection among frequent subgraphs. In Proc. of the 2009\nSIAM Conference on Data Mining (SDM 2009), pages 1075\u20131086, 2009.\n\n[23] H. Tuy. Concave programming under linear constraints. Soviet Mathematics Doklady, 5:1437\u20131440,\n\n1964.\n\n9\n\n\f", "award": [], "sourceid": 583, "authors": [{"given_name": "Yoshinobu", "family_name": "Kawahara", "institution": null}, {"given_name": "Kiyohito", "family_name": "Nagano", "institution": null}, {"given_name": "Koji", "family_name": "Tsuda", "institution": null}, {"given_name": "Jeff", "family_name": "Bilmes", "institution": null}]}