{"title": "Minimizing a Submodular Function from Samples", "book": "Advances in Neural Information Processing Systems", "page_first": 814, "page_last": 822, "abstract": "In this paper we consider the problem of minimizing a submodular function from training data. Submodular functions can be efficiently minimized and are conse- quently heavily applied in machine learning. There are many cases, however, in which we do not know the function we aim to optimize, but rather have access to training data that is used to learn the function. In this paper we consider the question of whether submodular functions can be minimized in such cases. We show that even learnable submodular functions cannot be minimized within any non-trivial approximation when given access to polynomially-many samples. Specifically, we show that there is a class of submodular functions with range in [0, 1] such that, despite being PAC-learnable and minimizable in polynomial-time, no algorithm can obtain an approximation strictly better than 1/2 \u2212 o(1) using polynomially-many samples drawn from any distribution. Furthermore, we show that this bound is tight using a trivial algorithm that obtains an approximation of 1/2.", "full_text": "Minimizing a Submodular Function from Samples\n\nEric Balkanski\nHarvard University\n\nericbalkanski@g.harvard.edu\n\nAbstract\n\nYaron Singer\n\nHarvard University\n\nyaron@seas.harvard.edu\n\nIn this paper we consider the problem of minimizing a submodular function from\ntraining data. Submodular functions can be ef\ufb01ciently minimized and are conse-\nquently heavily applied in machine learning. There are many cases, however, in\nwhich we do not know the function we aim to optimize, but rather have access\nto training data that is used to learn it. In this paper we consider the question of\nwhether submodular functions can be minimized when given access to its training\ndata. We show that even learnable submodular functions cannot be minimized\nwithin any non-trivial approximation when given access to polynomially-many sam-\nples. Speci\ufb01cally, we show that there is a class of submodular functions with range\nin [0, 1] such that, despite being PAC-learnable and minimizable in polynomial-time,\nno algorithm can obtain an approximation strictly better than 1/2  o(1) using\npolynomially-many samples drawn from any distribution. Furthermore, we show\nthat this bound is tight via a trivial algorithm that obtains an approximation of 1/2.\n\n1\n\nIntroduction\n\nFor well over a decade now, submodular minimization has been heavily studied in machine learning\n(e.g. [SK10, JB11, JLB11, NB12, EN15, DTK16]). This focus can be largely attributed to the fact\nthat if a set function f : 2N ! R is submodular, meaning it has the following property of diminishing\nreturns: f (S [{ a})  f (S)  f (T [{ a})  f (T ) for all S \u2713 T \u2713 N and a 62 T , then it can\nbe optimized ef\ufb01ciently: its minimizer can be found in time that is polynomial in the size of the\nground set N [GLS81, IFF01]. In many cases, however, we do not know the submodular function,\nand instead learn it from data (e.g. [BH11, IJB13, FKV13, FK14, Bal15, BVW16]). The question\nwe address in this paper is whether submodular functions can be (approximately) minimized when\nthe function is not known but can be learned from training data.\nAn intuitive approach for optimization from training data is to learn a surrogate function from training\ndata that predicts the behavior of the submodular function well, and then \ufb01nd the minimizer of the\nsurrogate learned and use that as a proxy for the true minimizer we seek. The problem however, is that\nthis approach does not generally guarantee that the resulting solution is close to the true minimum of\nthe function. One pitfall is that the surrogate may be non-submodular, and despite approximating the\ntrue submodular function arbitrarily well, the surrogate can be intractable to minimize. Alternatively,\nit may be that the surrogate is submodular, but its minimum is arbitrarily far from the minimum of\nthe true function we aim to optimize (see examples in Appendix A).\nSince optimizing a surrogate function learned from data may generally result in poor approximations,\none may seek learning algorithms that are guaranteed to produce surrogates whose optima well-\napproximate the true optima and are tractable to compute. More generally, however, it is possible that\nthere is some other approach for optimizing the function from the training samples, without learning\na model. Therefore, at a high level, the question is whether a reasonable number of training samples\nsuf\ufb01ces to minimize a submodular function. We can formalize this as optimization from samples.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fOptimization from samples. We will say that a class of functions F = {f : 2N ! [0, 1]} is\n\u21b5-optimizable from samples over distribution D if for every f 2F and  2 (0, 1), when given\npoly(|N|) i.i.d. samples {(Si, f (Si))}m\ni=1 where Si \u21e0D , with probability at least 1   over the\nsamples one can construct an algorithm that returns a solution S \u2713 N s.t,\n\nf (S)  min\nT\u2713N\n\nf (T ) \uf8ff \u21b5.\n\nThis framework was recently introduced in [BRS17] for the problem of submodular maximization\nwhere the standard notion of approximation is multiplicative. For submodular minimization, since\nthe optimum may have zero value, the suitable measure is that of additive approximations for [0, 1]-\nbounded functions, and the goal is to obtain a solution which is a o(1) additive approximation to the\nminimum (see e.g. [CLSW16, EN15, SK10]). The question is then:\n\nCan submodular functions be minimized from samples?\n\nSince submodular functions can be minimized in polynomial-time, it is tempting to conjecture that\nwhen the function is learnable it also has desirable approximation guarantees from samples, especially\nin light of positive results in related settings of submodular maximization:\n\n\u2022 Constrained maximization. For functions that can be maximized in polynomial time under\na cardinality constraint, like modular and unit-demand functions, there are polynomial\ntime algorithms that obtain an arbitrarily good approximation using polynomially-many\nsamples [BRS16, BRS17]. For general monotone submodular functions which are NP-\nhard to maximize under cardinality constraints, there is no algorithm that can obtain a\nreasonable approximation from polynomially-many samples [BRS17]. For the problem of\nunconstrained minimization, submodular functions can be optimized in polynomial time;\n\u2022 Unconstrained maximization. For unconstrained maximization of general submodular\nfunctions, the problem is NP-hard to maximize (e.g. MAX-CUT) and one seeks constant\nfactor approximations. For this problem, there is an extremely simple algorithm that uses\nno queries and obtains a good approximation: choose elements uniformly at random with\nprobability 1/2 each. This algorithm achieves a constant factor approximation of 1/4 for\ngeneral submodular functions. For symmetric submodular functions (i.e. f (S) = f (N \\ S)),\nthis algorithm is a 1/2-approximation which is optimal, since no algorithm can obtain an\napproximation ratio strictly better than 1/2 using polynomially-many value queries, even\nfor symmetric submodular functions [FMV11]. For unconstrained symmetric submodular\nminimization, there is an appealing analogue: the empty set and the ground set N are\nguaranteed to be minimizers of the function (see Section 2). This algorithm, of course, uses\nno queries either. The parallel between these two problems seems quite intuitive, and it\nis tempting to conjecture that like for unconstrained submodular maximization, there are\noptimization from samples algorithms for general unconstrained submodular minimization\nwith good approximation guarantees.\n\nMain result. Somewhat counter-intuitively, we show that despite being computationally tractable\nto optimize, submodular functions cannot be minimized from samples to within a desirable guarantee,\neven when these functions are learnable. In particular, we show that there is no algorithm for\nminimizing a submodular function from polynomially-many samples drawn from any distribution\nthat obtains an additive approximation of 1/2  o(1), even when the function is PAC-learnable.\nFurthermore, we show that this bound is tight: the algorithm which returns the empty set or ground\nset each with probability 1/2 achieves at least a 1/2 approximation. Notice that this also implies that\nin general, there is no learning algorithm that can produce a surrogate whose minima is close to the\nminima of the function we aim to optimize, as otherwise this would contradict our main result.\n\nTechnical overview. At a high level, hardness results in optimization from samples are shown\nby constructing a family of functions, where the values of functions in the family are likely to be\nindistinguishable for the samples, while having very different optimizers. The main technical dif\ufb01culty\nis to construct a family of functions that concurrently satisfy these two properties (indistinguishability\nand different optimizers), and that are also PAC-learnable. En route to our main construction, we\n\ufb01rst construct a family of functions that are completely indistinguishable given samples drawn from\nthe uniform distribution, in which case we obtain a 1/2  o(1) impossibility result (Section 2). The\n\n2\n\n\fgeneral result that holds for any distribution requires heavier machinery to argue about more general\nfamilies of functions where some subset of functions can be distinguished from others given samples.\nInstead of satisfying the two desired properties for all functions in a \ufb01xed family, we show that these\nproperties hold for all functions in a randomized subfamily (Section 3.2). We then develop an ef\ufb01cient\nlearning algorithm for the family of functions constructed for the main hardness result (Section 3.3).\nThis algorithm builds multiple linear regression predictors and a classi\ufb01er to direct a fresh set to the\nappropriate linear predictor. The learning of the classi\ufb01er and the linear predictors relies on multiple\nobservations about the speci\ufb01c structure of this class of functions.\n\n1.1 Related work\n\nThe problem of optimization from samples was introduced in the context of constrained submodular\nmaximization [BRS17, BRS16]. In general, for maximizing a submodular function under a cardinality\nconstraint, no algorithm can obtain a constant factor approximation guarantee from any samples. As\ndiscussed above, for special classes of submodular functions that can be optimized in polynomial time\nunder a cardinality constraint, and for unconstrained maximization, there are desirable optimization\nfrom samples guarantees. It is thus somewhat surprising that submodular minimization, which is an\nunconstrained optimization problem that is optimizable in polynomial time in the value query model,\nis hard to optimize from samples. From a technical perspective the constructions are quite different.\nIn maximization, the functions constructed in [BRS17, BRS16] are monotone so the ground set\nwould be an optimal solution if the problem was unconstrained. Instead, we need to construct novel\nnon-monotone functions. In convex optimization, recent work shows a tight 1/2-inapproximability\nfor convex minimization from samples [BS17]. Although there is a conceptual connection between\nthat paper and this one, from a technical perspective these papers are orthogonal. The discrete\nanalogue of the family of convex functions constructed in that paper is not (even approximately) a\nfamily of submodular functions, and the constructions are signi\ufb01cantly different.\n\n2 Warm up: the Uniform Distribution\n\nAs a warm up to our main impossibility result, we sketch a tight lower bound for the special case in\nwhich the samples are drawn from the uniform distribution. At a high level, the idea is to construct a\nfunction which considers some special subset of \u201cgood\u201d elements that make its value drops when a\nset contains all such \u201cgood\u201d elements. When samples are drawn from the uniform distribution and\n\u201cgood\u201d elements are suf\ufb01ciently rare, there is a relatively simple construction that obfuscates which\nelements the function considers \u201cgood\u201d, which then leads to the inapproximability.\n\n2.1 Hardness for uniform distribution\nWe construct a family of functions F where fi 2F is de\ufb01ned in terms of a set Gi \u21e2 N of size pn.\nFor each such function we call Gi the set of good elements, and Bi = N \\ Gi its bad elements. We\ndenote the number of good and bad elements in a set S by gS and bS, dropping the subscripts (S and\ni) when clear from context, so g = |Gi \\ S| and b = |Bi \\ S|. The function fi is de\ufb01ned as follows:\n\nIt is easy to verify that these functions are submodular with range in [0, 1] (see illustration in\nFigure 1a). Given samples drawn uniformly at random (u.a.r.), it is impossible to distinguish good\nand bad elements since with high probability (w.h.p.) g < pn for all samples. Informally, this\nimplies that a good learner for F over the uniform distribution D is f0(S) = 1/2 + |S|/(2n).\nIntuitively, F is not 1/2  o(1) optimizable from samples because if an algorithm cannot learn the\nset of good elements Gi, then it cannot \ufb01nd S such fi(S) < 1/2  o(1) whereas the optimal solution\ni = Gi has value fi(Gi) = 0.\nS?\nTheorem 1. Submodular functions f : 2N ! [0, 1] are not 1/2  o(1) optimizable from samples\ndrawn from the uniform distribution for the problem of submodular minimization.\n\n3\n\nfi(S) :=8><>:\n\n1\n2n \u00b7 (g + b)\n\n+\n\n1\n2\n1\n2n \u00b7 b\n\nif g < pn\nif g = pn\n\n\fProof. The details for the derivation of concentration bounds are in Appendix B. Consider fk drawn\nu.a.r. from F and let f ? = fk and G? = Gk. Since the samples are all drawn from the uniform\ndistribution, by standard application of the Chernoff bound we have that every set Si in the sample\nrespects |Si|\uf8ff 3n/4, w.p. 1  e\u2326(n). For sets S1, . . . , Sm, all of size at most 3n/4, when fj is\ndrawn u.a.r. from F we get that |Si \\ Gj| < pn, w.p. 1  e\u2326(n1/2) for all i 2 [m], again by\nChernoff, and since m = poly(n). Notice that this implies that w.p. 1  e\u2326(n1/2) for all i 2 [m]:\n\nfj(Si) =\n\n1\n2\n\n+ |Si|\n2n\n\nNow, let F0 be the collection of all functions fj for which fj(Si) = 1/2 + |Si|/(2n) on all sets\ni=1. The argument above implies that |F0| = (1  e\u2326(n1/2))|F|. Thus, since f ? is drawn u.a.r.\n{Si}m\nfrom F we have that f ? 2F 0 w.p. 1  e\u2326(n1/2), and we condition on this event.\nLet S be the (possibly randomized) solution returned by the algorithm. Observe that S is independent\nof f ? 2F 0. In other words, the algorithm cannot learn any information about which function in F0\ngenerates the samples. By Chernoff, if we \ufb01x S and choose f u.a.r. from F, then, w.p. 1 e\u2326(n1/6):\n\nf (S) \n\n1\n2  o(1).\n\nSince |F0|\n= 1  e\u2326(n1/2), it is also the case that f ?(S)  1/2  o(1) w.p. 1  e\u2326(n1/6) over the\n|F|\nchoice of f ? 2F 0. By the probabilistic method and since all the events we conditioned on occur\nwith exponentially high probability, there exists f ? 2F s.t. the value of the set S returned by the\nalgorithm is 1/2  o(1) whereas the optimal solution is f ?(G?) = 0.\n2.2 A tight upper bound\nWe now show that the result above is tight. In particular, by randomizing between the empty set and\nthe ground set we get a solution whose value is at most 1/2. In the case of symmetric functions, i.e.\nf (S) = f (N \\ S) for all S \u2713 N, ; and N are minima since f (N ) + f (;) \uf8ff f (S) + f (N \\ S) for\nall S \u2713 N as shown below.1 Notice, that this does not require any samples.\nProposition 2. The algorithm which returns the empty set ; or the ground N with probability 1/2\neach is a 1/2 additive approximation for the problem of unconstrained submodular minimization.\n\nProof. Let S \u2713 N, observe that\nwhere the inequality is by submodularity. Thus, we obtain\n\nf (N \\ S)  f (;)  f ((N \\ S) [ S)  f (S) = f (N )  f (S)\n\nf (S) +\nIn particular, this holds for S 2 argminT\u2713N f (T ).\n\n(f (N ) + f (;)) \uf8ff\n\n1\n2\n\n1\n2\n\n1\n2\n\nf (N \\ S) \uf8ff f (S) +\n\n1\n2\n\n.\n\n3 General Distribution\n\nIn this section, we show our main result, namely that there exists a family of submodular functions\nsuch that, despite being PAC-learnable for all distributions, no algorithm can obtain an approximation\nbetter than 1/2  o(1) for the problem of unconstrained minimization.\nThe functions in this section build upon the previous construction, though are inevitably more involved\nin order to achieve learnability and inapproximability on any distribution. The functions constructed\nfor the uniform distribution do not yield inapproximability for general distributions due to the fact that\nthe indistinguishability between two functions no longer holds when sets S of large size are sampled\nwith non-negligible probability. Intuitively, in the previous construction, once a set is suf\ufb01ciently\nlarge the good elements of the function can be distinguished from the bad ones. The main idea to get\naround this issue is to introduce masking elements M. We construct functions such that, for sets S of\nlarge size, good and bad elements are indistinguishable if S contains at least one masking element.\n1Although ; and N are trivial minima if f is symmetric, the problem of minimizing a symmetric submodular\n\nfunction over proper nonempty subsets is non-trivial (see [Que98]).\n\n4\n\n\f1\n\n)\n\nS\n\n(\nf\n\n0.5\n\n1\n\n)\n\nS\n\n(\nf\n\n0.5\n\nb = |S|\ng = |S|\n\nn \u2013 \u221an\n\n0\n\n0\n\n\u221an - 1 \u221an\n\n|S|\n\n(a) Uniform distribution\n\nm = |S|\nb = |S|\ng = |S|\nb = |S|-1, m=1\ng = |S|-1, m=1\n\n0\n\n0\n\n1 n1/4\n\n\u221an\n|S|\n\nn/2\n\n(b) General distribution\n\nFigure 1: An illustration of the value of a set S of good (blue), bad (red), and masking (green)\nelements as a function of |S| for the functions constructed. For the general distribution case, we\nalso illustrate the value of a set S of good (dark blue) and bad (dark red) elements when S also\ncontains at least one masking element.\n\nfi(S) =\n\n+\n\n1\n2\n\nThe construction. Each function fi 2F is de\ufb01ned in terms of a partition Pi of the ground set\ninto good, bad, and masking elements. The partitions we consider are Pi = (Gi, Bi, Mi) with\n|Gi| = n/2, |Bi| = pn, and |Mi| = n/2  pn. Again, when clear from context, we drop indices of\ni and S and the number of good, bad, and masking elements in a set S are denoted by g, b, and m.\nFor such a given partition Pi, the function fi is de\ufb01ned as follows (see illustration in Figure 1b):\nRegion X : if m = 0 and g < n1/4\nn \u00b7\u21e3g  n1/4\u2318 Region Y : if m = 0 and g  n1/4\n\n2pn \u00b7\u21e3b + n1/4\u2318 \n\n1\n2pn \u00b7 (b + g)\n1\n\n1\nn \u00b7 (b + g)\n\n1\n2 \n3.1 Submodularity\nIn the appendix, we prove that the functions fi constructed as above are indeed submodular\n(Lemma 10). By rescaling fi with an additive term of n1/4/(2pn) = 1/(2n1/4), it can be easily\nveri\ufb01ed that its range is in [0, 1]. We use the non-normalized de\ufb01nition as above for ease of notation.\n\nRegion Z : otherwise\n\n8>>>>>><>>>>>>:\n\n1\n\nInapproximability\n\n3.2\nWe now show that F cannot be minimized within a 1/2  o(1) approximation given samples from\nany distribution. We \ufb01rst de\ufb01ne F M, which is a randomized subfamily of F. We then give a general\nlemma that shows that if two conditions of indistinguishability and gap are satis\ufb01ed then we obtain\ninapproximability. We then show that these two conditions are satis\ufb01ed for the subfamily F M.\nA randomization over masking elements.\nInstead of considering a function f drawn u.a.r. from\nF as in the uniform case, we consider functions f in a randomized subfamily of functions F M \u2713F\nto obtain the indistinguishability and gap conditions. Given the family of functions F, let M be a\nuniformly random subset of size n/2  pn and de\ufb01ne F M \u21e2F :\nF M := {fi 2F : (Gi, Bi, M )}.\n\nSince masking elements are distinguishable from good and bad elements, they need to be the same\nset of elements for each function in family F M to obtain indistinguishability of functions in F M.\nThe inapproximability lemma.\nIn addition to this randomized subfamily of functions, another\nmain conceptual departure of the following inapproximability lemma from the uniform case is that\nno assumption can be made about the samples, such as their size, since the distribution is arbitrary.\nWe denote by U (A) the uniform distribution over the set A.\n\n5\n\n\fLemma 3. Let F be a family of functions and F0 = {f1, . . . , fp}\u2713F be a subfamily of functions\ndrawn from some distribution. Assume the following two conditions hold:\n\n1. Indistinguishability. For all S \u2713 N, w.p. 1  e\u2326(n1/4) over F0: for every fi, fj 2F 0,\n\nfi(S) = fj(S);\n\n2. \u21b5-gap. Let S?\n\ni be a minimizer of fi, then w.p. 1 over F0: for all S \u2713 N,\n\nE\n\nfi\u21e0U (F0)\n\n[fi(S)  fi (S?\n\ni )]  \u21b5;\n\nThen, F is not \u21b5-minimizable from strictly less than e\u2326(n1/4) samples over any distribution D.\nNote that the ordering of the quanti\ufb01ers is crucial. The proof is deferred to the appendix, but the main\nideas are summarized as follows. We use a probabilistic argument to switch from the randomization\nover F0 to the randomization over S \u21e0D and show that there exists a deterministic F \u2713F such that\nfi(S) = fj(S) for all fi, fj 2 F w.h.p. over S \u21e0D . By a union bound this holds for all samples\nS. Thus, for such a family of functions F = {f1, . . . , fp}, the choices of an algorithm that is given\nsamples from fi for i 2 [p] are independent of i. By the \u21b5-gap condition, this implies that there exists\nfi 2 F for which a solution S returned by the algorithm is at least \u21b5 away from fi(S?\ni ).\nIndistinguishability and gap of F. We now show the indistinguishability and gap conditions, with\n\u21b5 = 1/2  o(1), which immediately imply a 1/2  o(1) inapproximability by Lemma 3. For the\nindistinguishability, it suf\ufb01ces to show that good and bad elements are indistinguishable since the\nmasking elements are identical for all functions in F M. Good and bad elements are indistinguishable\nsince, w.h.p., a set S is not in region Y, which is the only region distinguishing good and bad elements.\nLemma 4. For all S \u2713 N s.t. |S| < n1/4: For all fi 2F M,\n\nfi(S) =\n\n1\n2\n\n+( 1\n2pn \u00b7 (b + g)\n2  1\nn \u00b7 (b + g)\n\n1\n\nif m = 0 (Region X )\notherwise (Region Z)\n\nand for all S \u2713 N such that |S| n1/4, with probability 1  e\u2326(n1/4) over F M: For all fi 2F M,\n\nfi(S) = 1 \n\n1\nn \u00b7 (b + g)\n\n(Region Z).\n\nProof. Let S \u2713 N. If |S| < n1/4, then the proof follows immediately from the de\ufb01nition of fi. If\n|S| n1/4, then, the number of masking elements m in S is m = |M \\ S| for all fi 2F M. We\nthen get m  1, for all fi 2F M, with probability 1  e\u2326(n1/4) over F M by Chernoff bound. The\nproof then follows again immediately from the de\ufb01nition of fi.\n\nNext, we show the gap. The gap is since the good elements can be any subset of N \\ M.\nLemma 5. Let S?\n\ni be a minimizer of fi. With probability 1 over F M, for all S \u2713 N,\n\nE\n\nfi\u21e0U (F M )\n\n[fi(S)] \n\n1\n2  o(1).\n\nProof. Let S \u2713 N and fi \u21e0 U (F M ). Note that the order of the quanti\ufb01ers in the statement of the\nlemma implies that S can be dependent on M, but that it is independent of i. There are three cases. If\nm  1, then S is in region Z and fi(S)  1/2. If m = 0 and |S|\uf8ff n7/8, then S is in region X or Y\nand fi(S)  1/2  n7/8/n = 1\n2  o(1). Otherwise, m = 0 and |S| n7/8. Since S is independent\nof i, by Chernoff bound, we get\nn/2 + pn\n\n\u00b7 b and n/2 + pn\n\nn/2\n\n\u00b7 g \uf8ff (1 + o(1)) \u00b7 |S|\n\n(1  o(1)) \u00b7 |S|\uf8ff\n\npn\n\nwith probability 1  e\u2326(n1/4). Thus S is in region Y and\nfi(S) \nThus, we obtain Efi\u21e0U (F M ) [fi(S)]  1\n\n+ (1  o(1))\n\n2  o(1).\n\nn/2 + pn \u00b7 |S| (1 + o(1))\n\n1\n2pn \u00b7\n\npn\n\n1\n2\n\n1\nn \u00b7\n\nn/2\n\nn/2 + pn \u00b7 |S|\n\n1\n2  o(1).\n\n6\n\n\fCombining the above three lemmas, we obtain the inapproximability result.\nLemma 6. The problem of submodular minimization cannot be approximated with a 1/2  o(1)\nadditive approximation given poly(n) samples from any distribution D.\nProof. For any set S \u2713 N, observe that the number g + b of elements in S that are either good or\nbad is the same for any two functions fi, fj 2F M and for any F M. Thus, by Lemma 4, we obtain\nthe indistinguishability condition. Next, the optimal solution S?\ni = Gi of fi has value fi(Gi) = o(1),\nso by Lemma 5, we obtain the \u21b5-gap condition with \u21b5 = 1/2  o(1). Thus F is not 1/2  o(1)\nminimizable from samples from any distribution D by Lemma 3. The class of functions F is a class\nof submodular functions by Lemma 10 (in Appendix C).\n\n3.3 Learnability of F\nWe now show that every function in F is ef\ufb01ciently learnable from samples drawn from any distri-\nbution D. Speci\ufb01cally, we show that for any \u270f,  2 (0, 1) the functions are (\u270f, )  PAC learnable\nwith the absolute loss function (or any Lipschitz loss function) using poly(1/\u270f, 1/, n) samples and\nrunning time. At a high level, since each function fi is piecewise-linear over three different regions\nXi,Yi, and Zi, the main idea is to exploit this structure by \ufb01rst training a classi\ufb01er to distinguish\nbetween regions and then apply linear regression in different regions.\n\nThe learning algorithm. Since every function f 2F is piecewise linear over three different\nregions, there are three different linear functions fX , fY , fZ s.t. for every S \u2713 N its value f (S)\ncan be expressed as fR(S) for some region R 2 {X ,Y,Z}. The learning algorithm produces a\npredictor \u02dcf by using a multi-label classi\ufb01er and a set of linear predictors {f \u02dcX , f \u02dcY} [ {[i2 \u02dcM f \u02dcZi}.\nThe multi-label classi\ufb01er creates a mapping from sets to regions, g : 2N !{ \u02dcX , \u02dcY} [ {[i2 \u02dcM\n\u02dcZi},\ns.t. X ,Y,Z are approximated by \u02dcX , \u02dcY,[i2 \u02dcM\n\u02dcZi. Given a sample S \u21e0D , using the algorithm then\nretuns \u02dcf (S) = fg(S)(S). We give a formal description below (detailed description is in Appendix D).\n\nAlgorithm 1 A learning algorithm for f 2F which combines classi\ufb01cation and linear regression.\nInput: samples S = {(Sj, f (Sj))}j2[m]\n( \u02dcZ, \u02dcM ) (;,;)\nfor i = 1 to n do\n\n(Sj)  f (Sj)| = 0 then\n\n\u02dcZi { S : ai 2 S, S 62 \u02dcZ}\nf \u02dcZi ERMreg({(Sj, f (Sj)) : Sj 2 \u02dcZi})\nifP(Sj ,f (Sj )):Sj2 \u02dcZi |f \u02dcZi\n\u02dcZ \u02dcZ[ \u02dcZi, \u02dcM \u02dcM [{ ai}\nC ERMcla({(Sj, f (Sj)) : Sj 62 \u02dcZ, j \uf8ff m/2})\n( \u02dcX , \u02dcY) ({S : S 62 \u02dcZ, C(S) = 1},{S : S 62 \u02dcZ, C(S) = 1})\nreturn \u02dcf S 7!8<:\n|S|/(2pn)\nf \u02dcY (S) = ERMreg({(Sj, f (Sj)) : Sj 2 \u02dcY, j > m/2})\nf \u02dcZi\n\n(S) : i = min({i0 : ai0 2 S \\ \u02dcM})\n\nlinear regression\n\ntrain a classi\ufb01er for regions X , Y\n\nif S2 \u02dcX\nif S2 \u02dcY\nif S2 \u02dcZ\n\nOverview of analysis of the learning algorithm. There are two main challenges in training the\nalgorithm. The \ufb01rst is that the region X , Y, or Z that a sample (Sj, f (Sj)) belongs to is not known.\nThus, even before being able to train a classi\ufb01er which learns the regions \u02dcX , \u02dcY, \u02dcZ using the samples,\nwe need to learn the region a sample Sj belongs to using f (Sj). The second is that the samples SR\nused for training a linear regression predictor fR over region R need to be carefully selected so that\nSR is a collection of i.i.d. samples from the distribution S \u21e0D conditioned on S 2R (Lemma 20).\nWe \ufb01rst discuss the challenge of labeling samples with the region they belong to. Observe that for a\n\ufb01xed masking element ai 2 M, f 2F is linear over all sets S containing ai since these sets are all in\n= ERMreg(\u00b7) with zero empirical\nregion Z. Thus, there must exist a linear regression predictor f \u02dcZi\nloss over all samples Sj containing ai if ai 2 M (and thus Sj 2Z ). ERMreg(\u00b7) minimizes the\nempirical loss on the input samples over the class of linear regression predictors with bounded norm\n\n7\n\n\fFigure 2: An illustration of the regions. The dots represent the samples, the corresponding full circles represent\nthe regions X (red), Y (blue), and Z (green). The ellipsoids represent the regions \u02dcX , \u02dcY, \u02dcZ learned by the\nclassi\ufb01er. Notice that \u02dcZ has no false negatives.\n\n(Lemma 19). If f \u02dcZi has zero empirical loss, we directly classify any set S containing ai as being in\n\u02dcZ. Next, for a sample (Sj, f (Sj)) not in \u02dcZ, we can label these samples since Sj 2X if and only if\nf (Sj) = |Sj|/(2pn). With these labeled samples S0, we train a binary classi\ufb01er C = ERMcla(S0)\nthat indicates if S s.t. S 62 \u02dcZ is in region \u02dcX or \u02dcY. ERMcla(S0) minimizes the empirical loss on\nlabeled samples S0 over the class of halfspaces w 2 Rn (Lemma 23).\nRegarding the second challenge, we cannot use all samples Sj s.t. Sj 2 \u02dcY to train a linear predictor\nfor region \u02dcY since these same samples were used to de\ufb01ne \u02dcY, so they are not a collection of i.i.d.\nf \u02dcY\nsamples from the distribution S \u21e0D conditioned on S 2 \u02dcY. To get around this issue, we partition\n(Lemma 24).\nthe samples into two distinct collections, one to train the classi\ufb01er C and one to train f \u02dcY\nNext, given T 2 \u02dcZ, we predict f \u02dcZi\n(T ) where i is s.t. ai 2 T \\ \u02dcM (breaking ties lexicographically)\nwhich performs well since \u02dcf \u02dcZi has zero empirical error for ai 2 \u02dcM (Lemma 22). Since we break\nties lexicographically, \u02dcf \u02dcZi must be trained over samples Sj such that ai 2 Sj and ai0 62 Sj for i0 s.t.\ni0 < i and ai0 2 \u02dcM to obtain i.i.d. samples from the same distribution as T \u21e0D conditioned on T\nbeing directed to \u02dcf \u02dcZi (Lemma 21).\nThe analysis of the learning algorithm leads to the following main learning result.\nLemma 7. Let \u02dcf be the predictor returned by Algorithm 1, then w.p. 1   over m 2 O(n3 +\nn2(log(2n/))/\u270f2) samples S drawn i.i.d. from any distribution D, ES\u21e0D[| \u02dcf (S)  f (S)|] \uf8ff \u270f.\n3.4 Main Result\nWe conclude this section with our main result which combines Lemmas 6 and 7.\nTheorem 8. There exists a family of [0, 1]-bounded submodular functions F that is ef\ufb01ciently PAC-\nlearnable and that cannot be optimized from polynomially many samples drawn from any distribution\nD within a 1/2  o(1) additive approximation for unconstrained submodular minimization.\n4 Discussion\n\nIn this paper, we studied the problem of submodular minimization from samples. Our main result\nis an impossibility, showing that even for learnable submodular functions it is impossible to \ufb01nd\na non-trivial approximation to the minimizer with polynomially-many samples, drawn from any\ndistribution. In particular, this implies that minimizing a general submodular function learned from\ndata cannot yield desirable guarantees. In general, it seems that the intersection between learning and\noptimization is elusive, and a great deal still remains to be explored.\n\n8\n\n\fReferences\n\n[Bal15] Maria-Florina Balcan. Learning submodular functions with applications to multi-agent systems. In\nProceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems,\nAAMAS 2015, Istanbul, Turkey, May 4-8, 2015, page 3, 2015.\n\n[BH11] Maria-Florina Balcan and Nicholas JA Harvey. Learning submodular functions. In Proceedings of\n\nthe forty-third annual ACM symposium on Theory of computing, pages 793\u2013802. ACM, 2011.\n\n[BRS16] Eric Balkanski, Aviad Rubinstein, and Yaron Singer. The power of optimization from samples. In\n\nAdvances in Neural Information Processing Systems, pages 4017\u20134025, 2016.\n\n[BRS17] Eric Balkanski, Aviad Rubinstein, and Yaron Singer. The limitations of optimization from samples.\n\nProceedings of the Forty-Ninth Annual ACM on Symposium on Theory of Computing, 2017.\n\n[BS17] Eric Balkanski and Yaron Singer. The sample complexity of optimizing a convex function. In\n\nCOLT, 2017.\n\n[BVW16] Maria-Florina Balcan, Ellen Vitercik, and Colin White. Learning combinatorial functions from\npairwise comparisons. In Proceedings of the 29th Conference on Learning Theory, COLT 2016,\nNew York, USA, June 23-26, 2016, pages 310\u2013335, 2016.\n\n[CLSW16] Deeparnab Chakrabarty, Yin Tat Lee, Aaron Sidford, and Sam Chiu-wai Wong. Subquadratic\n\nsubmodular function minimization. arXiv preprint arXiv:1610.09800, 2016.\n\n[DTK16] Josip Djolonga, Sebastian Tschiatschek, and Andreas Krause. Variational inference in mixed\nprobabilistic submodular models. In Advances in Neural Information Processing Systems 29: Annual\nConference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona,\nSpain, pages 1759\u20131767, 2016.\n\n[EN15] Alina Ene and Huy L. Nguyen. Random coordinate descent methods for minimizing decomposable\nsubmodular functions. In Proceedings of the 32nd International Conference on Machine Learning,\nICML 2015, Lille, France, 6-11 July 2015, pages 787\u2013795, 2015.\n\n[FK14] Vitaly Feldman and Pravesh Kothari. Learning coverage functions and private release of marginals.\n\nIn COLT, pages 679\u2013702, 2014.\n\n[FKV13] Vitaly Feldman, Pravesh Kothari, and Jan Vondr\u00e1k. Representation, approximation and learning of\n\nsubmodular functions using low-rank decision trees. In COLT, pages 711\u2013740, 2013.\n\n[FMV11] Uriel Feige, Vahab S. Mirrokni, and Jan Vondr\u00e1k. Maximizing non-monotone submodular functions.\n\nSIAM J. Comput., 40(4):1133\u20131153, 2011.\n\n[GLS81] Martin Grotschel, Laszlo Lovasz, and Alexander Schrijver. The ellipsoid method and its conse-\n\nquences in combinatorial optimization. Combinatorica, 1(2):169\u2013197, 1981.\n\n[IFF01] Satoru Iwata, Lisa Fleischer, and Satoru Fujishige. A combinatorial strongly polynomial algorithm\n\nfor minimizing submodular functions. J. ACM, 48(4):761\u2013777, 2001.\n\n[IJB13] Rishabh K. Iyer, Stefanie Jegelka, and Jeff A. Bilmes. Curvature and optimal algorithms for learning\nand minimizing submodular functions. In Advances in Neural Information Processing Systems 26:\n27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting\nheld December 5-8, 2013, Lake Tahoe, Nevada, United States., pages 2742\u20132750, 2013.\n\n[JB11] Stefanie Jegelka and Jeff Bilmes. Submodularity beyond submodular energies: coupling edges in\ngraph cuts. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages\n1897\u20131904. IEEE, 2011.\n\n[JLB11] Stefanie Jegelka, Hui Lin, and Jeff A Bilmes. On fast approximate submodular minimization. In\n\nAdvances in Neural Information Processing Systems, pages 460\u2013468, 2011.\n\n[NB12] Mukund Narasimhan and Jeff A Bilmes. A submodular-supermodular procedure with applications\n\nto discriminative structure learning. arXiv preprint arXiv:1207.1404, 2012.\n\n[Que98] Maurice Queyranne. Minimizing symmetric submodular functions. Mathematical Programming,\n\n82(1-2):3\u201312, 1998.\n\n[SK10] Peter Stobbe and Andreas Krause. Ef\ufb01cient minimization of decomposable submodular functions.\n\nIn Advances in Neural Information Processing Systems, pages 2208\u20132216, 2010.\n\n[SSBD14] Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory to\n\nalgorithms. Cambridge university press, 2014.\n\n9\n\n\f", "award": [], "sourceid": 547, "authors": [{"given_name": "Eric", "family_name": "Balkanski", "institution": "Harvard University"}, {"given_name": "Yaron", "family_name": "Singer", "institution": "Harvard University"}]}