{"title": "Learning convex polytopes with margin", "book": "Advances in Neural Information Processing Systems", "page_first": 5706, "page_last": 5716, "abstract": "We present improved algorithm for properly learning convex polytopes in the\nrealizable PAC setting from data with a margin. Our learning algorithm constructs\na consistent polytope as an intersection of about t log t halfspaces with margins\nin time polynomial in t (where t is the number of halfspaces forming an optimal\npolytope).\nWe also identify distinct generalizations of the notion of margin from hyperplanes\nto polytopes and investigate how they relate geometrically; this result may be of\ninterest beyond the learning setting.", "full_text": "Learning convex polytopes with margin\n\nLee-Ad Gottlieb\nAriel University\n\nEran Kaufman\nAriel University\n\nleead@ariel.ac.il\n\nerankfmn@gmail.com\n\nAryeh Kontorovich\nBen-Gurion University\nkaryeh@bgu.sc.il\n\nGabriel Nivasch\nAriel University\n\ngabrieln@ariel.ac.il\n\nAbstract\n\nWe present an improved algorithm for properly learning convex polytopes in the\nrealizable PAC setting from data with a margin. Our learning algorithm constructs\na consistent polytope as an intersection of about t log t halfspaces with margins\nin time polynomial in t (where t is the number of halfspaces forming an optimal\npolytope).\nWe also identify distinct generalizations of the notion of margin from hyperplanes\nto polytopes and investigate how they relate geometrically; this result may be of\ninterest beyond the learning setting.\n\nIntroduction\n\n1\nIn the theoretical PAC learning setting [Valiant, 1984], one considers an abstract instance space X \u2014\nwhich, most commonly, is either the Boolean cube {0, 1}d or the Euclidean space Rd. For the former\nsetting, an extensive literature has explored the statistical and computational aspects of learning\nBoolean functions [Angluin, 1992, Hellerstein and Servedio, 2007]. Yet for the Euclidean setting, a\ncorresponding theory of learning geometric concepts is still being actively developed [Kwek and Pitt,\n1998, Jain and Kinber, 2003, Anderson et al., 2013, Kane et al., 2013]. The focus of this paper is the\nlatter setting.\nThe simplest nontrivial geometric concept is perhaps the halfspace. These concepts are well-known\nto be hard to agnostically learn [H\u00f6ffgen et al., 1995] or even approximate [Amaldi and Kann, 1995,\n1998, Ben-David et al., 2003]. Even the realizable case, while commonly described as \u201csolved\u201d\nvia the Perceptron algorithm or linear programming (LP), is not straightforward: The Perceptron\u2019s\nruntime is quadratic in the inverse-margin, while solving the consistent hyperplane problem in\nstrongly polynomial time is equivalent to solving the general LP problem in strongly polynomial\ntime [Nikolov, 2018, Chv\u00e1tal], a question that has been open for decades [B\u00e1r\u00e1sz and Vempala,\n2010]. Thus, an unconditional (i.e., in\ufb01nite-precision and independent of data con\ufb01guration in space)\npolynomial-time solution for the consistent hyperplane problem hinges on the strongly polynomial\nLP conjecture.\nIf we consider not a single halfspace, but polytopes de\ufb01ned by the intersection of multiple halfspaces,\nthe computational and generalization bounds rapidly become more pessimistic. Megiddo [1988]\nshowed that the problem of deciding whether two sets of points in general space can be separated\nby the intersection of two hyperplanes is NP-complete, and Khot and Saket [2011] showed that\n\u201cunless NP = RP, it is hard to (even) weakly PAC-learn intersection of two halfspaces\u201d, even when\nallowed the richer class of O(1) intersecting halfspaces. Under cryptographic assumptions, Klivans\nand Sherstov [2009] showed that learning an intersection of n\u03b5 halfspaces is intractable regardless of\nhypothesis representation.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\f\uf8f1\uf8f2\uf8f3d(t/\u03b3)O(t log t log(1/\u03b3)), d\n\n(cid:18) log t\n\n(cid:19)O(\n\n\u221a\n\n\uf8fc\uf8fd\uf8fe\n\n1/\u03b3 log t)\n\nSince the margin assumption is what allows one to \ufb01nd a consistent hyperplane in provably strongly\npolynomial time, it is natural to seek to generalize this scheme to intersections of t halfspaces each\nwith margin \u03b3; we call this the \u03b3-margin of a t-polytope. This problem was considered by Arriaga\nand Vempala [2006], who showed that such a polytope can be learned (in dimension d) in time\n\nwith sample complexity m = O(cid:0)(t/\u03b32) log(t) log(t/\u03b3)(cid:1) (where we have taken the PAC-learning\n\nO(dmt) + (t log t)O((t/\u03b32) log(t/\u03b3))\n\nparameters \u03b5, \u03b4 to be constants). In fact, they actually construct a candidate t-polytope as their learner;\nas such, their approach is an example of proper learning, where the hypothesis is chosen from the\nsame concept class as the true concept. In contrast, Klivans and Servedio [2008] showed that a\n\u03b3-margin t-polytope can be learned by constructing a function that approximates the polytope\u2019s\nbehavior, without actually constructing a t-polytope. This is an example of improper learning, where\nthe hypothesis is selected from a broader class than that of the true concept. They achieved a runtime\nof\n\nand sample complexity m = O(cid:0)(1/\u03b3)t log t log(1/\u03b3)(cid:1) . Very recently, Goel and Klivans [2018] im-\n\nmin\n\n\u03b3\n\nproved on this latter result, constructing a function hypothesis in time poly(d, tO(1/\u03b3)), with sample\ncomplexity exponential in \u03b3\u22121/2.\n\nOur results. The central contribution of the paper is improved algorithmic runtimes and sample\ncomplexity for computing separating polytopes (Theorem 7). In contrast to the algorithm of Arriaga\nand Vempala [2006], whose runtime is exponential in t/\u03b32, and to that of [Goel and Klivans, 2018],\nwhose sample complexity is exponential in \u03b3\u22121/2, we give an algorithm with polynomial sample\ncomplexity m = \u02dcO(t/\u03b32) and runtime only m \u02dcO(1/\u03b32). We accomplish this by constructing an\nO(t log m)-polytope that correctly separates the data. This means that our hypothesis is drawn from\na broader class than the t-polytopes of Arriaga and Vempala [2006] (allowing faster runtime), but\nfrom a much narrower class than the functions of Klivans and Servedio [2008], Goel and Klivans\n[2018] (allowing for improved sample complexity).\nComplementing our algorithm, we provide the \ufb01rst nearly matching hardness-of-approximation\nbounds, which demonstrate that an exponential dependence on t\u03b3\u22122 is unavoidable for the computa-\ntion of separating t-polytopes, under standard complexity-theoretic assumptions (Theorem 6). This\nmotivates our consideration of O(t log m)-polytopes instead.\nOur \ufb01nal contribution is in introducing a new and intuitive notion of polytope margin: This is the\n\u03b3-envelope of a convex polytope, de\ufb01ned as all points within distance \u03b3 of the polytope boundary,\nas opposed to the above \u03b3-margin of the polytope, de\ufb01ned as the intersection of the \u03b3-margins of\nthe hyperplanes forming the polytope. (See Figure 2 for an illustration, and Section 2 for precise\nde\ufb01nitions.) Note that these two objects may exhibit vastly different behaviors, particularly at a sharp\nintersection of two or more hyperplanes. It seems to us that the envelope of a polytope is a more\nnatural structure than its margin, yet we \ufb01nd the margin more amenable to the derivation of both\nVC-bounds (Lemma 1) and algorithms (Theorem 7). We demonstrate that results derived for margins\ncan be adapted to apply to envelopes as well. We prove that when con\ufb01ned to the unit ball, the\n\u03b3-envelope fully contains within it the (\u03b32/2)-margin (Theorem 10), and this implies that statistical\nand algorithmic results for the latter hold for the former as well.\n\nRelated work. When general convex bodies are considered under the uniform distribution1 (over\nthe unit ball or cube), exponential (in dimension and accuracy) sample-complexity bounds were\nobtained by Rademacher and Goyal [2009]. This may motivate the consideration of convex polytopes,\nand indeed a number of works have studied the problem of learning convex polytopes, including\nHeged\u00fcs [1994], Kwek and Pitt [1998], Anderson et al. [2013], Kane et al. [2013], Kantchelian\net al. [2014]. Heged\u00fcs [1994] examines query-based exact identi\ufb01cation of convex polytopes with\ninteger vertices, with runtime polynomial in the number of vertices (note that the number of vertices\n\n1Since the concept class of convex sets has in\ufb01nite VC-dimension, without distribution assumptions, an\nadversarial distribution can require an arbitrarily large sample size, even in 2 dimensions [Kearns and Vazirani,\n1997].\n\n2\n\n\f(cid:26)\n\n(cid:27)\n\ncan be exponential in the number of facets [Matou\u0161ek, 2002]). Kwek and Pitt [1998] also rely on\nmembership queries (see also references therein regarding prior results, as well as strong positive\nresults in 2 dimensions). Anderson et al. [2013] ef\ufb01ciently approximately recover an unknown\nsimplex from uniform samples inside it. Kane et al. [2013] learn halfspaces under the log-concave\ndistributional assumption.\nThe recent work of Kantchelian et al. [2014] bears a super\ufb01cial resemblance to ours, but the two are\nactually not directly comparable. What they term worst case margin will indeed correspond to our\nmargin. However, their optimization problem is non-convex, and the solution relies on heuristics\n\u221a\nwithout rigorous run-time guarantees. Their generalization bounds exhibit a better dependence on\nthe number t of halfspaces than our Lemma 3 (O(\nt) vs. our O(t log t)). However, the hinge loss\nappearing in their Rademacher-based bound could be signi\ufb01cantly worse than the 0-1 error appearing\nin our VC-based bound. We stress, however, that the main contribution of our paper is algorithmic\nrather than statistical.\n\n2 Preliminaries\ni=1 x(i)2 by (cid:107)x(cid:107) and\nNotation. For x \u2208 Rd, we denote its Euclidean norm (cid:107)x(cid:107)2 :=\nfor n \u2208 N, we write [n] := {1, . . . , n}. Our instance space X is the unit ball in Rd: X =\n\n(cid:8)x \u2208 Rd : (cid:107)x(cid:107) \u2264 1(cid:9). We assume familiarity with the notion of VC-dimension as well as with basic\n\nPAC de\ufb01nitions such as generalization error (see, e.g., Kearns and Vazirani [1997]).\nPolytopes. A (convex) polytope P \u2282 Rd is the convex hull of \ufb01nitely many points: P =\nconv({x1, . . . , xn}). Alternatively, it can be de\ufb01ned by t hyperplanes (wi, bi) \u2208 Rd \u00d7 R where\n(cid:107)wi(cid:107) = 1 for each i:\n\n(cid:113)(cid:80)d\n\nP =\n\nx \u2208 Rd : min\ni\u2208[t]\n\nwi \u00b7 x + bi \u2265 0\n\n.\n\n(1)\n\nA hyperplane (w, b) is said to classify a point x as positive (resp., negative) with margin \u03b3 if\nw \u00b7 x + b \u2265 \u03b3 (resp., \u2264 \u2212\u03b3). Since (cid:107)w(cid:107) = 1, this means that x is \u03b3-far from the hyperplane\n\n(cid:8)x(cid:48) \u2208 Rd : w \u00b7 x(cid:48) + b = 0(cid:9), in (cid:96)2 distance.\n\nMargins and envelopes. We consider two natural ways of extending this notion to polytopes: the\n\u03b3-margin and the \u03b3-envelope. For a polytope de\ufb01ned by t hyperplanes as in (1), we say that x is in\nthe inner \u03b3-margin of P if\n\n0 \u2264 min\ni\u2208[t]\n\nwi \u00b7 x + bi \u2264 \u03b3\n\nand that x is in the outer \u03b3-margin of P if\n\n0 \u2265 min\ni\u2208[t]\n\nwi \u00b7 x + bi \u2265 \u2212\u03b3.\n\nSimilarly, we say that x is in the outer \u03b3-envelope of P if x /\u2208 P and inf p\u2208P (cid:107)x \u2212 p(cid:107) \u2264 \u03b3 and that\nx is in the inner \u03b3-envelope of P if x \u2208 P and inf p /\u2208P (cid:107)x \u2212 p(cid:107) \u2264 \u03b3.\nWe call the union of the inner and the outer \u03b3-margins the \u03b3-margin, and we denote it by \u2202P [\u03b3].\nSimilarly, we call the union of the inner and the outer \u03b3-envelopes the \u03b3-envelope, and we denote it\nby \u2202P (\u03b3).\nThe two notions are illustrated in Figure 2. As we show in Section 4 below, the inner envelope\ncoincides with the inner margin, but this is not the case for the outer objects: The outer margin always\ncontains the outer envelope, and could be of arbitrarily larger volume.\n\nFat hyperplanes and polytopes. Binary classi\ufb01cation requires a collection of concepts mapping\nthe instance space (in our case, the unit ball in Rd) to {\u22121, 1}. However, given a hyperplane (w, b)\nand a margin \u03b3, the function fw,b : Rd \u2192 R given by fw,b(x) = w \u00b7 x + b partitions Rd into three\n\nregions: positive(cid:8)x \u2208 Rd : fw,b(x) \u2265 \u03b3(cid:9), negative(cid:8)x \u2208 Rd : fw,b(x) \u2264 \u2212\u03b3(cid:9), and ambiguous\n(cid:8)x \u2208 Rd : |fw,b(x)| < \u03b3(cid:9). We use a standard device (see, e.g., Hanneke and Kontorovich [2017,\n\n3\n\n\fSection 4]) of de\ufb01ning an auxiliary instance space X (cid:48) = X \u00d7{\u22121, 1} together with the concept class\n\nH\u03b3 =(cid:8)hw,b : w \u2208 Rd, b \u2208 R,(cid:107)w(cid:107) = 1/\u03b3(cid:9), where, for all (x, y) \u2208 X (cid:48),\n\n|w \u00b7 x + b| \u2265 \u03b3\nelse.\n\n(cid:26)sign(y(w \u00b7 x + b)),\n\nhw,b(x, y) =\n\n\u22121,\n\nIt is shown in [Hanneke and Kontorovich, 2017, Lemma 6] that2\nLemma 1. The VC-dimension of H\u03b3 is at most (2/\u03b3 + 1)2.\nAnalogously, we de\ufb01ne the concept class Pt,\u03b3 of \u03b3-fat t-polytopes as follows. Each hP \u2208 Pt,\u03b3 is\ninduced by some t-halfspace intersection P as in (1). The label of a pair (x, y) \u2208 X (cid:48) is determined\nas follows: If x is in the \u03b3-margin of P , then the pair is labeled \u22121 irrespective of y. Otherwise, if\nx \u2208 P and y = 1, or else x /\u2208 P and y = \u22121, then the pair is labeled 1. Otherwise, the pair is labeled\n\u22121.\nLemma 2. The VC-dimension of Pt,\u03b3 in d dimensions is at most\n\nmin{2(d + 1)t log(3t), 2vt log(3t)} ,\n\nwhere v = (2/\u03b3 + 1)2.\n\nProof. The family of intersections of t concept classes of VC-dimension at most v is bounded by\n2vt log(3t) [Blumer et al., 1989, Lemma 3.2.3]. Since the class of d-dimensional hyperplanes has\nVC-dimension d + 1 [Long and Warmuth, 1994], the family of polytopes has VC-dimension at most\n2(d + 1)t log(3t). The second part of the bound is obtained by applying Blumer et al. [1989, Lemma\n3.2.3] to the VC bound in Lemma 1.\n\nGeneralization bounds. The following VC-based generalization bounds are well-known; the \ufb01rst\none may be found in, e.g., Cristianini and Shawe-Taylor [2000], while the second one in Anthony\nand Bartlett [1999].\nLemma 3. Let H be a class of learners with VC-dimension dVC. If a learner h \u2208 H is consistent on\na random sample S of size m, then with probability at least 1 \u2212 \u03b4 its generalization error is\n\n(cid:0)dVC log(2em/dVC) + log(2/\u03b4)(cid:1).\n\nerr(h) \u2264 2\nm\n\nDimension reduction. The Johnson-Lindenstrauss (JL) transform [Johnson and Lindenstrauss,\n1982] takes a set S of m vectors in Rd and projects them into k = O(\u03b5\u22122 log m) dimensions, while\npreserving all inter-point distances and vector norms up to 1 + \u03b5 distortion. That is, if f : Rd \u2192 Rk is\na linear embedding realizing the guarantees of the JL transform on S, then for every x \u2208 S we have\n\n(1 \u2212 \u03b5)(cid:107)x(cid:107) \u2264 (cid:107)f (x)(cid:107) \u2264 (1 + \u03b5)(cid:107)x(cid:107),\n\nand for every x, y \u2208 S we have\n\n(1 \u2212 \u03b5)(cid:107)x \u2212 y(cid:107) \u2264 (cid:107)f (x \u2212 y)(cid:107) \u2264 (1 + \u03b5)(cid:107)x \u2212 y(cid:107).\n\nThe JL transform can be realized with probability 1 \u2212 n\u2212c for any constant c \u2265 1 by a randomized\nlinear embedding, for example a projection matrix with entries drawn from a normal distribution\n[Achlioptas, 2003]. This embedding is oblivious, in the sense that the matrix can be chosen without\nknowledge of the set S.\nIt is an easy matter to show that the JL transform can also be used to approximately preserve distances\nto hyperplanes, as in the following lemma.\n\n2Such estimates may be found in the literature for homogeneous (i.e., b = 0) hyperplanes (see, e.g., Bartlett\nand Shawe-Taylor [1999, Theorem 4.6]), but dealing with polytopes, it is important for us to allow offsets. As\ndiscussed in Hanneke and Kontorovich [2017], the standard non-homogeneous to homogeneous conversion can\ndegrade the margin by an arbitrarily large amount, and hence the non-homogeneous case warrants an independent\nanalysis.\n\n4\n\n\fLemma 4. Let S be set of d-dimensional vectors in the unit ball, T be a set of normalized vectors,\nand f : Rd \u2192 Rk a linear embedding realizing the guarantees of the JL transform. Then for any\n0 < \u03b5 < 1 and some k = O((log |S \u222a T|)/\u03b52), with probability 1 \u2212 |S \u222a T|\u2212c (for any constant\nc > 1) we have for all x \u2208 S and t \u2208 T that\n\nf (t) \u00b7 f (x) \u2208 t \u00b7 x \u00b1 \u03b5.\n\nProof. Let the constant in k be chosen so that the JL transform preserves distances and norms among\nS \u222a T within a factor 1 + \u03b5(cid:48) for \u03b5(cid:48) = \u03b5/5. By the guarantees of the JL transform for the chosen value\nof k, we have that\n\nf (t) \u00b7 f (x) =\n\n(cid:2)(cid:107)f (t)(cid:107)2 + (cid:107)f (x)(cid:107)2 \u2212 (cid:107)f (t) \u2212 f (x)(cid:107)2(cid:3)\n(cid:2)(1 + \u03b5(cid:48))2((cid:107)t(cid:107)2 + (cid:107)x(cid:107)2) \u2212 (1 \u2212 \u03b5(cid:48))2(cid:107)t \u2212 x(cid:107)2(cid:3)\n(cid:2)(1 + 3\u03b5(cid:48))((cid:107)t(cid:107)2 + (cid:107)x(cid:107)2) \u2212 (1 \u2212 2\u03b5(cid:48))(cid:107)t \u2212 x(cid:107)2(cid:3)\n(cid:2)5\u03b5(cid:48)((cid:107)t(cid:107)2 + (cid:107)x(cid:107)2) + t \u00b7 x(cid:3)\n\n1\n2\n\u2264 1\n2\n1\n2\n1\n2\n\n<\n\n<\n\u2264 5\u03b5(cid:48) + t \u00b7 x.\n= \u03b5 + t \u00b7 x.\n\nA similar argument gives that f (t) \u00b7 f (x) > \u2212\u03b5 + t \u00b7 x.\n\n3 Computing and learning separating polytopes\n\nIn this section, we present algorithms to compute and learn \u03b3-fat t-polytopes. We begin with\nhardness results for this problem, and show that these hardness results justify algorithms with run\ntime exponential in the dimension or the square of the reciprocal of the margin. We then present our\nalgorithms.\n\n3.1 Hardness\n\nWe show that computing separating polytopes is NP-hard, and even hard to approximate. We begin\nwith the case of a single hyperplane. The following preliminary lemma builds upon Amaldi and Kann\n[1995, Theorem 10].\nLemma 5. Given a labelled point set S (n = |S|) with p negative points, let h\u2217 be a hyperplane\nthat places all positive points of S on its positive side, and maximizes the number of negative points\non its negative size \u2014 let opt be the number of these negative points. Then it is NP-hard to \ufb01nd\na hyperplane \u02dch consistent with all positive points, and which places at least opt/p1\u2212o(1) negative\npoints on on the negative side of \u02dch. This holds even when the optimal hyperplane correctly classifying\nopt points has margin \u03b3 \u2265 1\n\u221a\n4\n\nopt .\n\nProof. We reduce from maximum independent set, which for p vertices is hard to approximate to\nwithin p1\u2212o(1) [Zuckerman, 2007]. Given a graph G = (V, E), for each vetex vi \u2208 V place a\nnegative point on the basis vector ei. Now place a positive point at the origin, and for each edge\n(vi, vj) \u2208 E, place a positive point at (ei + ej)/2.\nConsider a hyperplane consistent with the positive points and placing opt negative points on the\nnegative side: These negative points must represent an independent set in G, for if (vi, vj) \u2208 E, then\nby construction the midpoint of ei, ej is positive, and so both ei, ej cannot lie on the negative side of\nthe hyperplane.\nLikewise, if G contained an independent set V (cid:48) \u2282 V of size opt, then we consider the hyperplane\nde\ufb01ned by the equation w \u00b7 x + 3\nopt if vj \u2208 V (cid:48) and\n\u221a\n4\nw(j) = 0 otherwise. It is easily veri\ufb01ed that the distance from the hyperplane to a negative point (i.e.\na basis vector) is \u2212 1\u221a\n\u221a\nopt, and to all other positive points\nopt + 3\n4\nis at least \u2212 1\n\u221a\n\u221a\nopt + 3\n2\n\nopt = 0, where coordinate w(j) = \u2212 1\u221a\n\nopt, to the origin is\n\nopt = \u2212 1\n\u221a\n4\n\u221a\nopt = 1\n4\n\nopt.\n\n\u221a\n3\n4\n\n4\n\n5\n\n\fWe can now extend the above result for a hyperplane to similar ones for polytopes:\nTheorem 6. Given a labelled point set S (n = |S|) with p negative points, let H\u2217 be a collection of\nt halfspaces whose intersection partitions S into positive and negative sets. Then it is NP-hard to\n\ufb01nd a collection \u02dcH of size less than tp1\u2212o(1) whose intersection also partitions S into positive and\n\u221a\nnegative sets. This holds even when all hyperplanes have margin \u03b3 \u2265 1\n\n,\n\n4\n\np/t\n\nProof. The reduction is from minimum coloring, which is hard to approximate within a factor of\nn1\u2212o(1) [Zuckerman, 2007]. The construction is identical to that of the proof of Lemma 5. In\nparticular, a set of vertices in G assigned the same color necessarily form an independent set, and so\ntheir corresponding negative points in S can be separated from all positive points by some halfspace,\nand vice-versa.\n; as\nThe only dif\ufb01culty in the reduction is our insistence that the margin must be of size at least\nin Lemma 5, this holds only when the halfspaces are restricted to separate at most opt = p/t points.\nBut there is no guarantee that the optimal coloring satis\ufb01es this requirement, that is if the optimal\ncoloring possesses t colors, that each color represents only p/t vertices. To this end, if a color in\nthe optimal t-coloring of G covers more than p/t vertices, we partition it into a set of colors, each\ncoloring no more than p/t vertices. This increases the total number of colors to at most 2t, which\ndoes not affect the hardness-of-approximation result.\n\n1\np/t\n\n\u221a\n\n4\n\nThe Exponential Time Hypothesis (ETH) posits that maximum independent set and minimum coloring\ncannot be solved in less than cn operations (for some constant c)3. As Lemma 5 asserts that the\nseparating hyperplane problem remains hard for margin \u03b3 \u2265 1\nopt \u2265 1\n\u221a\n\u221a\np, we cannot hope to \ufb01nd an\n4\noptimal solution in time less than cp \u2265 c1/(16\u03b32). Likewise, as Theorem 6 asserts that the separating t-\n\u221a\npolytope problem remains hard for margin \u03b3 \u2265 1\nwe cannot hope to \ufb01nd a consistent t-polytope\nin time less than cp \u2265 ct/(16\u03b32). This justi\ufb01es the exponential dependence on t\u03b3\u22122 in the algorithm\nof Arriaga and Vempala [2006], and implies that to avoid an exponential dependence on t in the\nruntime, we should consider broader hypothesis class, for example O(t log m)-polytopes.\n\np/t\n\n4\n\n4\n\n3.2 Algorithms\n\nHere we present algorithms for computing polytopes, and use them to give an ef\ufb01cient algorithm for\nlearning polytopes.\nIn what follows, we give two algorithms inspired by the work of Arriaga and Vempala [2006].\nBoth have runtime faster than the algorithm of Arriaga and Vempala [2006], and the second is only\npolynomial in t.\nTheorem 7. Given a labelled point set S (n = |S|) for which some \u03b3-fat t-polytope correctly\nseparates the positive and negative points (i.e., the polytope is consistent), we can compute the\nfollowing with high probability:\n\n1. A consistent (\u03b3/4)-fat t-polytope in time nO(t\u03b3\u22122 log(1/\u03b3)).\n2. A consistent (\u03b3/4)-fat O(t log n)-polytope in time nO(\u03b3\u22122 log(1/\u03b3)).\n\nBefore proving Theorem 7, we will need a preliminary lemma:\nLemma 8. Given any 0 < \u03b4 < 1, there exists a set V of unit vectors of size |V | = \u03b4\u2212O(d) with the\nfollowing property: For any unit vector w, there exists a v \u2208 V that satis\ufb01es v \u00b7 x \u2208 w \u00b7 x \u00b1 \u03b4 for\nall vectors x with (cid:107)x(cid:107) \u2264 1. The set V can be constructed in time \u03b4\u2212O(d) with high probability.\n\nThis implies that if a set S admits a hyperplane (w, b) with margin \u03b3, then S admits a hyperplane\n(v, b) (for v \u2208 V ) with margin at least \u03b3 \u2212 \u03b4.\n\n3This does not necessary imply that approximating these problems requires cn operations: As hardness-of-\napproximation results utilize polynomial-time reductions, ETH implies only that the runtime is exponential in\nsome polynomial in n.\n\n6\n\n\fProof. We take V to be a \u03b4-net of the unit ball, a set satisfying that every point on the ball is within\ndistance \u03b4 of some point in V . Then |V | \u2264 (1 + 2/\u03b4)d [Vershynin, 2010, Lemma 5.2]. For any unit\nvector w we have for some v \u2208 V that (cid:107)w \u2212 v(cid:107) \u2264 \u03b4, and so for any vector x satisfying (cid:107)x(cid:107) \u2264 1 we\nhave\n\n|w \u00b7 x \u2212 v \u00b7 x| = |(w \u2212 v) \u00b7 x| \u2264 (cid:107)w \u2212 v(cid:107) \u2264 \u03b4.\n\nThe net can be constructed by a randomized greedy algorithm. By coupon-collector analysis, it\nsuf\ufb01ces to sample O(|V | log |V |) random unit vectors. For example, each can be chosen by sampling\nits coordinate from N (0, 1) (the standard normal distribution), and then normalizing the vector. The\nresulting set contains within it a \u03b4-net.\n\n12. In the embedded space, we extract a \u03b4-net V of Lemma 8 with parameter \u03b4 = \u03b3\n\nProof of Theorem 7. We \ufb01rst apply the Johnson-Lindenstrauss transform to reduce dimension of the\npoints in S to k = O(\u03b3\u22122 log(n + t)) = O(\u03b3\u22122 log n) while achieving the guarantees of Lemma\n4 for the points of S and the t halfspaces forming the optimal \u03b3-fat t-polytope, with parameter\n\u03b5 = \u03b3\n12, and\nwe have |V | = \u03b4\u2212O(k). Now de\ufb01ne the set B consisting of all values of the form \u03b3i\n12 for integer\ni = {0, 1, . . . ,(cid:98)12/\u03b3(cid:99)}. It follows that for each d-dimensional halfspace (w, b) forming the original\n\u03b3-fat t-polytope, there is a k-dimensional halfspace (v, b(cid:48)) with v \u2208 V and b(cid:48) \u2208 B satisfying\nv \u00b7 f (x) + b(cid:48) \u2208 w \u00b7 x + b \u00b1 \u03b3/4 for every x \u2208 S. Given (v, b(cid:48)), we can recover an approximation\nto (w, b) in the d-dimensional origin space thus: Let S(cid:48) \u2282 S include only those points x \u2208 S for\nwhich |v \u00b7 f (x) + b(cid:48)| \u2265 3\u03b3\n4 \u2212 \u03b3\n2 . As S(cid:48) is a separable\npoint set with margin \u0398(\u03b3), we can run the Perceptron algorithm on S(cid:48) in time O(dn\u03b3\u22122), and \ufb01nd\na d-dimensional halfspace w(cid:48) consistent with w on all points at distance \u03b3\n4 or more from w. We will\nrefer to w(cid:48) as the d-dimensional mirror of v.\nWe compute the d-dimensional mirror of every vector in V for every candidate value in B. We then\nenumerate all possible t-polytopes by taking intersections of all combinations of t mirror halfspaces,\nin total time\n\n4 , and it follows that |w \u00b7 x + b| \u2265 3\u03b3\n\n4 = \u03b3\n\n(1/\u03b3)O(kt) = nO(t\u03b3\u22122 log(1/\u03b3)),\n\nand choose the best one consistent with S. The \ufb01rst part of the theorem follows.\nBetter, we may give a greedy algorithm with a much improved runtime: First note that as the\nintersection of t halfspaces correctly classi\ufb01es all points, the best halfspace among them correctly\nclassi\ufb01es at least a (1/t)-fraction of the negative points with margin \u03b3. Hence it suf\ufb01ces to \ufb01nd\nthe d-dimensional mirror which is consistent with all positive points and maximizes the number of\ncorrect negative points, all with margin \u03b3\n4 . We choose this halfspace, remove from S the correctly\nclassi\ufb01ed negative points, and iteratively search for the next best halfspace. After ct log n iterations\n(for an appropriate constant c), the number of remaining points is\n\nn(1 \u2212 \u2126(1/t))ct log n < ne\u2212 ln n = 1,\n\nand the algorithm terminates.\n\nHaving given an algorithm to compute \u03b3-fat t-polytopes, we can now give an ef\ufb01cient algorithm\nto learn \u03b3-fat t-polytopes. We sample m points, and use the second item of Theorem 7 to \ufb01nd a\n(\u03b3/4)-fat O(t log m)-polytope consistent with the sample. By Lemma 2, the class of polytopes has\nVC-dimension O(\u03b3\u22122t log m). The size of m is chosen according to Lemma 3, and we conclude:\nTheorem 9. There exists an algorithm that learns \u03b3-fat t-polytopes with sample complexity\n\n(cid:16) t\n\u03b5\u03b32 log2 t\n\n\u03b5\u03b3\n\n(cid:17)\n\n+ log\n\n1\n\u03b4\n\nm = O\n\nin time mO((1/\u03b32) log(1/\u03b3)), where \u03b5, \u03b4 are the desired accurcy and con\ufb01dence levels.\n\n4 Polytope margin and envelope\n\nIn this section, we show that the notions of margin and envelope de\ufb01ned in Section 2 are, in general,\nquite distinct. Fortunately, when con\ufb01ned to the unit ball X , one can be used to approximate the\nother.\n\n7\n\n\fFigure 1: Expansion and contraction of a polytope by \u03b3.\n\nFigure 2: The \u03b3-envelope \u2202P (\u03b3) (left) and \u03b3-margin \u2202P [\u03b3] (right) of a polytope P .\n\nGiven two sets S1, S2 \u2286 Rd, their Minkowski sum is given by S1 + S2 = {p + q : p \u2208 S1, q \u2208 S2},\nand their Minkowski difference is given by S1 \u2212 S2 = {p \u2208 Rd : {p} + S2 \u2286 S1}. Let B\u03b3 = {p \u2208\nRd : (cid:107)p(cid:107) \u2264 \u03b3} be a ball of radius \u03b3 centered at the origin.\nGiven a polytope P \u2208 Rd an a real number \u03b3 > 0, let\n\nP (+\u03b3) = P + B\u03b3,\nP (\u2212\u03b3) = P \u2212 B\u03b3.\n\nHence, P (+\u03b3) and P (\u2212\u03b3) are the results of expanding or contracting, in a certain sense, the polytope\nP .\nAlso, let P [+\u03b3] be the result of moving each halfspace de\ufb01ning a facet of P outwards by distance\n\u03b3, and similarly, let P [\u2212\u03b3] be the result of moving each such halfspace inwards by distance \u03b3. Put\ndifferently, we can think of the halfspaces de\ufb01ning the facets of P as moving outwards at unit speed,\nso P expands with time. Then P [\u00b1\u03b3] is P at time \u00b1\u03b3. See Figure 1.\nObservation 1. We have P (\u2212\u03b3) = P [\u2212\u03b3].\nProof. Each point in P [\u2212\u03b3] is at distance at least \u03b3 from each hyperplane containing a facet of P ,\nhence, it is at distance at least \u03b3 from the boundary of P , so it is in P (\u2212\u03b3). Now, suppose for a\ncontradiction that there exists a point p \u2208 P (\u2212\u03b3) \\ P [\u2212\u03b3]. Then p is at distance less than \u03b3 from a\npoint q \u2208 \u2202h \\ f, where f is some facet of P and \u2202h is the hyperplane containing f. But then the\nsegment pq must intersect another facet of P .\n\nHowever, in the other direction we have P (+\u03b3) (cid:40) P [+\u03b3]. Furthermore, the Hausdorff distance\nbetween them could be arbitrarily large (see again Figure 1).\nThen the \u03b3-envelope of P is given by \u2202P (\u03b3) = P (+\u03b3) \\ P (\u2212\u03b3), and the \u03b3-margin of P is given by\n\u2202P [\u03b3] = P [+\u03b3] \\ P [\u2212\u03b3]. See Figure 2.\nSince the \u03b3-margin of P is not contained in the \u03b3-envelope of P , we would like to \ufb01nd some suf\ufb01cient\ncondition under which, for some \u03b3(cid:48) < \u03b3, the \u03b3(cid:48)-margin of P is contained in the \u03b3-envelope of P .\nOur solution to this problem is given in the following theorem. Recall that X is the unit ball in Rd.\nTheorem 10. Let P \u2282 Rd be a polytope, and let 0 < \u03b3 < 1. Suppose that P [\u2212\u03b3] \u2229 X (cid:54)= \u2205. Then,\nwithin X , the (\u03b32/2)-margin of P is contained in the \u03b3-envelope of P ; meaning, \u2202P [\u03b32/2] \u2229 X \u2286\n\u2202P (\u03b3).\n\n8\n\nPP(\u2013\u03b3) = P[\u2013\u03b3]P[+\u03b3]P(+\u03b3)\fThe proof uses the following general observation:\nObservation 2. Let Q = Q(t) be an expanding polytope whose de\ufb01ning halfspaces move outwards\nwith time, each one at its own constant speed. Let p = p(t) be a point that moves in a straight line\nat constant speed. Suppose t1 < t2 < t3 are such that p(t1) \u2208 Q(t1) and p(t3) \u2208 Q(t3). Then\np(t2) \u2208 Q(t2) as well.\n\nProof. Otherwise, p exits one of the halfspaces and enters it again, which is impossible.\n\nProof of Theorem 10. By Observation 1, it suf\ufb01ces to show that P [+\u03b32/2] \u2229 X \u2286 P (+\u03b3). Hence, let\np \u2208 P [+\u03b32/2] \u2229 X and q \u2208 P [\u2212\u03b3] \u2229 X . Let s be the segment pq. Let r be the point in s that is at\ndistance \u03b3 from p. Suppose for a contradiction that p /\u2208 P (+\u03b3). Then r /\u2208 P . Consider P = P (t) as\na polytope that expands with time, as above. Let z = z(t) be a point that moves along s at constant\nspeed, such that z(\u2212\u03b3) = q and z(\u03b32/2) = p. Since (cid:107)r \u2212 q(cid:107) \u2264 2, the speed of s is at most 2/\u03b3.\nHence, between t = 0 and t = \u03b32/2, z moves distance at most \u03b3, so z(0) is already between r and p.\nIn other words, z exits P and reenters it, contradicting Observation 2.\n\nIt follows immediately from Theorem 10 and Lemma 2 that the VC-dimension of the class of\nt-polytopes with envelope \u03b3 is at most\n\nmin{2(d + 1)t log(3t), 2vt log(3t)} ,\n\nwhere v = (4/\u03b32 + 1)2. Likewise, we can approximate the optimal t-polytope with envelope \u03b3 by\nthe algorithms of Theorem 7 (with parameter \u03b3(cid:48) = \u03b32/2).\n\nAcknowledgments\n\nWe thank Sasho Nikolov, Bernd G\u00e4rtner and David Eppstein for helpful discussions. L. Gottlieb and\nA. Kontorovich were supported in part by the Israel Science Foundation (grant No. 755/15).\n\nReferences\nDimitris Achlioptas. Database-friendly random projections: Johnson-Lindenstrauss with binary\ncoins. J. Comput. Syst. Sci., 66(4):671\u2013687, 2003. doi: 10.1016/S0022-0000(03)00025-4. URL\nhttps://doi.org/10.1016/S0022-0000(03)00025-4.\n\nEdoardo Amaldi and Viggo Kann. The complexity and approximability of \ufb01nding maximum feasible\nsubsystems of linear relations. Theoretical Computer Science, 147(1):181 \u2013 210, 1995. ISSN 0304-\n3975. doi: https://doi.org/10.1016/0304-3975(94)00254-G. URL http://www.sciencedirect.\ncom/science/article/pii/030439759400254G.\n\nEdoardo Amaldi and Viggo Kann. On the approximability of minimizing nonzero variables or\nunsatis\ufb01ed relations in linear systems. Theoretical Computer Science, 209(1):237 \u2013 260, 1998.\nISSN 0304-3975. doi: https://doi.org/10.1016/S0304-3975(97)00115-1. URL http://www.\nsciencedirect.com/science/article/pii/S0304397597001151.\n\nJoseph Anderson, Navin Goyal, and Luis Rademacher. Ef\ufb01cient learning of simplices. In COLT 2013 -\nThe 26th Annual Conference on Learning Theory, June 12-14, 2013, Princeton University, NJ, USA,\npages 1020\u20131045, 2013. URL http://jmlr.org/proceedings/papers/v30/Anderson13.\nhtml.\n\nDana Angluin. Computational learning theory: Survey and selected bibliography. In Proceedings\nof the 24th Annual ACM Symposium on Theory of Computing, May 4-6, 1992, Victoria, British\nColumbia, Canada, pages 351\u2013369, 1992. doi: 10.1145/129712.129746. URL http://doi.acm.\norg/10.1145/129712.129746.\n\nMartin Anthony and Peter L. Bartlett. Neural Network Learning: Theoretical Foundations. Cambridge\nUniversity Press, Cambridge, 1999. ISBN 0-521-57353-X. doi: 10.1017/CBO9780511624216.\nURL http://dx.doi.org/10.1017/CBO9780511624216.\n\n9\n\n\fRosa I. Arriaga and Santosh Vempala. An algorithmic theory of learning: Robust concepts and\nrandom projection. Machine Learning, 63(2):161\u2013182, 2006. doi: 10.1007/s10994-006-6265-7.\nURL https://doi.org/10.1007/s10994-006-6265-7.\n\nMih\u00e1ly B\u00e1r\u00e1sz and Santosh Vempala. A new approach to strongly polynomial linear programming.\nIn Innovations in Computer Science - ICS 2010, Tsinghua University, Beijing, China, January 5-7,\n2010. Proceedings, pages 42\u201348, 2010. URL http://conference.itcs.tsinghua.edu.cn/\nICS2010/content/papers/4.html.\n\nPeter Bartlett and John Shawe-Taylor. Generalization performance of support vector machines\nand other pattern classi\ufb01ers, pages 43\u201354. MIT Press, Cambridge, MA, USA, 1999. ISBN\n0-262-19416-3.\n\nShai Ben-David, Nadav Eiron, and Philip M. Long. On the dif\ufb01culty of approximately maximizing\nagreements. J. Comput. Syst. Sci., 66(3):496\u2013514, 2003. doi: 10.1016/S0022-0000(03)00038-2.\nURL https://doi.org/10.1016/S0022-0000(03)00038-2.\n\nAnselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred K. Warmuth. Learnability\nand the Vapnik-Chervonenkis dimension. J. Assoc. Comput. Mach., 36(4):929\u2013965, 1989. ISSN\n0004-5411.\n\nVa\u0161ek Chv\u00e1tal. Notes on the Khachiyan-Kalantari algorithm. URL https://users.encs.\n\nconcordia.ca/~chvatal/notes/khakal.pdf.\n\nJohn Shawe-Taylor.\n\nand\nand Other Kernel-based Learning Methods.\n2000.\n\nNello Cristianini\ntor Machines\nversity Press,\nIntroduction-Support-Machines-Kernel-based-Learning/dp/0521780195?\nSubscriptionId=0JYN1NVW651KCA56C102&tag=techkie-20&linkCode=xm2&camp=\n2025&creative=165953&creativeASIN=0521780195.\n\nISBN 0521780195.\n\nAn\n\nIntroduction\n\nSupport Vec-\nto\nCambridge Uni-\nURL https://www.amazon.com/\n\nSurbhi Goel and Adam Klivans. Learning neural networks with two nonlinear layers in polynomial\n\ntime (arxiv:1709.06010v4). 2018.\n\nSteve Hanneke and Aryeh Kontorovich. Optimality of SVM: Novel proofs and tighter bounds. 2017.\n\nURL https://www.cs.bgu.ac.il/~karyeh/opt-svm.pdf.\n\nTibor Heged\u00fcs. Geometrical concept learning and convex polytopes. In Proceedings of the Seventh\nAnnual ACM Conference on Computational Learning Theory, COLT 1994, New Brunswick, NJ,\nUSA, July 12-15, 1994., pages 228\u2013236, 1994. doi: 10.1145/180139.181124. URL http:\n//doi.acm.org/10.1145/180139.181124.\n\nLisa Hellerstein and Rocco A. Servedio. On PAC learning algorithms for rich boolean function\nclasses. Theor. Comput. Sci., 384(1):66\u201376, 2007. doi: 10.1016/j.tcs.2007.05.018. URL https:\n//doi.org/10.1016/j.tcs.2007.05.018.\n\nKlaus-Uwe H\u00f6ffgen, Hans Ulrich Simon, and Kevin S. Van Horn. Robust trainability of single\nneurons. J. Comput. Syst. Sci., 50(1):114\u2013125, 1995. doi: 10.1006/jcss.1995.1011. URL https:\n//doi.org/10.1006/jcss.1995.1011.\n\nSanjay Jain and E\ufb01m B. Kinber. Intrinsic complexity of learning geometrical concepts from positive\ndata. J. Comput. Syst. Sci., 67(3):546\u2013607, 2003. doi: 10.1016/S0022-0000(03)00067-9. URL\nhttps://doi.org/10.1016/S0022-0000(03)00067-9.\n\nWilliam B. Johnson and Joram Lindenstrauss. Extensions of Lipschitz mappings into a Hilbert space.\nConference in modern analysis and probability (New Haven, Conn.). In Contemp. Math., 26, Amer.\nMath. Soc., Providence, pages 189\u2013206, 1982.\n\nDaniel M. Kane, Adam R. Klivans, and Raghu Meka. Learning halfspaces under log-concave\ndensities: Polynomial approximations and moment matching. In COLT 2013 - The 26th Annual\nConference on Learning Theory, June 12-14, 2013, Princeton University, NJ, USA, pages 522\u2013545,\n2013. URL http://jmlr.org/proceedings/papers/v30/Kane13.html.\n\n10\n\n\fAlex Kantchelian, Michael Carl Tschantz, Ling Huang, Peter L. Bartlett, Anthony D. Joseph, and\nJ. Doug Tygar. Large-margin convex polytope machine. In Advances in Neural Information\nProcessing Systems 27: Annual Conference on Neural Information Processing Systems 2014,\nDecember 8-13 2014, Montreal, Quebec, Canada, pages 3248\u20133256, 2014. URL http://\npapers.nips.cc/paper/5511-large-margin-convex-polytope-machine.\n\nMicheal Kearns and Umesh Vazirani. An Introduction to Computational Learning Theory. The MIT\n\nPress, 1997.\n\nSubhash Khot and Rishi Saket. On the hardness of learning intersections of two halfspaces. J.\nComput. Syst. Sci., 77(1):129\u2013141, 2011. doi: 10.1016/j.jcss.2010.06.010. URL https://doi.\norg/10.1016/j.jcss.2010.06.010.\n\nAdam R. Klivans and Rocco A. Servedio. Learning intersections of halfspaces with a margin. J.\nComput. Syst. Sci., 74(1):35\u201348, 2008. doi: 10.1016/j.jcss.2007.04.012. URL https://doi.\norg/10.1016/j.jcss.2007.04.012.\n\nAdam R. Klivans and Alexander A. Sherstov. Cryptographic hardness for learning intersections\nof halfspaces. J. Comput. Syst. Sci., 75(1):2\u201312, 2009. doi: 10.1016/j.jcss.2008.07.008. URL\nhttps://doi.org/10.1016/j.jcss.2008.07.008.\n\nStephen Kwek and Leonard Pitt. PAC learning intersections of halfspaces with membership queries.\nAlgorithmica, 22(1/2):53\u201375, 1998. doi: 10.1007/PL00013834. URL https://doi.org/10.\n1007/PL00013834.\n\nPhilip M. Long and Manfred K. Warmuth. Composite geometric concepts and polynomial\nInf. Comput., 113(2):230\u2013252, 1994. doi: 10.1006/inco.1994.1071. URL\n\npredictability.\nhttps://doi.org/10.1006/inco.1994.1071.\n\nJi\u02c7r\u00ed Matou\u0161ek. Lectures on discrete geometry, volume 212 of Graduate Texts in Mathematics.\nSpringer-Verlag, New York, 2002. ISBN 0-387-95373-6. doi: 10.1007/978-1-4613-0039-7. URL\nhttps://doi.org/10.1007/978-1-4613-0039-7.\n\nNimrod Megiddo. On the complexity of polyhedral separability. Discrete & Computational Geometry,\n3(4):325\u2013337, Dec 1988. ISSN 1432-0444. doi: 10.1007/BF02187916. URL https://doi.org/\n10.1007/BF02187916.\n\nAleksandar Nikolov. Complexity of \ufb01nding a consistent hyperplane. Theoretical Computer Science\n\nStack Exchange, 2018. URL https://cstheory.stackexchange.com/q/40554.\n\nLuis Rademacher and Navin Goyal. Learning convex bodies is hard. In COLT 2009 - The 22nd\nConference on Learning Theory, Montreal, Quebec, Canada, June 18-21, 2009, 2009. URL\nhttp://www.cs.mcgill.ca/~colt2009/papers/030.pdf#page=1.\n\nLeslie G. Valiant. A theory of the learnable. Commun. ACM, 27(11):1134\u20131142, 1984.\n\nRoman Vershynin.\n\nIntroduction to the non-asymptotic analysis of random matrices. CoRR,\n\nabs/1011.3027, 2010. URL http://arxiv.org/abs/1011.3027.\n\nDavid Zuckerman. Linear degree extractors and the inapproximability of max clique and chromatic\nnumber. Theory of Computing, 3(6):103\u2013128, 2007. doi: 10.4086/toc.2007.v003a006. URL\nhttp://www.theoryofcomputing.org/articles/v003a006.\n\n11\n\n\f", "award": [], "sourceid": 2743, "authors": [{"given_name": "Lee-Ad", "family_name": "Gottlieb", "institution": "Ariel University"}, {"given_name": "Eran", "family_name": "Kaufman", "institution": "Ariel University"}, {"given_name": "Aryeh", "family_name": "Kontorovich", "institution": "Ben Gurion University"}, {"given_name": "Gabriel", "family_name": "Nivasch", "institution": "Ariel University"}]}