{"title": "Distribution-Independent PAC Learning of Halfspaces with Massart Noise", "book": "Advances in Neural Information Processing Systems", "page_first": 4749, "page_last": 4760, "abstract": "We study the problem of {\\em distribution-independent} PAC learning of halfspaces in the presence of Massart noise. \nSpecifically, we are given a set of labeled examples $(\\bx, y)$ drawn \nfrom a distribution $\\D$ on $\\R^{d+1}$ such that the marginal distribution \non the unlabeled points $\\bx$ is arbitrary and the labels $y$ are generated by an unknown halfspace \ncorrupted with Massart noise at noise rate $\\eta<1/2$. The goal is to find \na hypothesis $h$ that minimizes the misclassification error $\\pr_{(\\bx, y) \\sim \\D} \\left[ h(\\bx) \\neq y \\right]$. \n\nWe give a $\\poly\\left(d, 1/\\eps\\right)$ time algorithm for this problem with misclassification error $\\eta+\\eps$. \nWe also provide evidence that improving on the error guarantee of our algorithm\nmight be computationally hard. Prior to our work, no efficient weak (distribution-independent) learner \nwas known in this model, even for the class of disjunctions. The existence of such an algorithm \nfor halfspaces (or even disjunctions) has been posed as an open question in various works, \nstarting with Sloan (1988), Cohen (1997), and was most recently highlighted in Avrim Blum's FOCS 2003 tutorial.", "full_text": "Distribution-Independent PAC Learning of\n\nHalfspaces with Massart Noise\n\nIlias Diakonikolas\n\nUniversity of Wisconsin-Madison\n\nilias@cs.wisc.edu\n\nThemis Gouleakis\n\nMax Planck Institute for Informatics\ntgouleak@mpi-inf.mpg.de\n\nChristos Tzamos\n\nUniversity of Wisconsin-Madison\n\ntzamos@wisc.edu\n\nAbstract\n\nWe study the problem of distribution-independent PAC learning of halfspaces in\nthe presence of Massart noise. Speci\ufb01cally, we are given a set of labeled examples\n(x, y) drawn from a distribution D on Rd+1 such that the marginal distribution on\nthe unlabeled points x is arbitrary and the labels y are generated by an unknown\nhalfspace corrupted with Massart noise at noise rate \u03b7 < 1/2. The goal is to \ufb01nd a\nhypothesis h that minimizes the misclassi\ufb01cation error Pr(x,y)\u223cD [h(x) (cid:54)= y].\nWe give a poly (d, 1/\u0001) time algorithm for this problem with misclassi\ufb01cation\nerror \u03b7 + \u0001. We also provide evidence that improving on the error guarantee of\nour algorithm might be computationally hard. Prior to our work, no ef\ufb01cient\nweak (distribution-independent) learner was known in this model, even for the\nclass of disjunctions. The existence of such an algorithm for halfspaces (or even\ndisjunctions) has been posed as an open question in various works, starting with\nSloan (1988), Cohen (1997), and was most recently highlighted in Avrim Blum\u2019s\nFOCS 2003 tutorial.\n\nIntroduction\n\n1\nHalfspaces, or Linear Threshold Functions (henceforth LTFs), are Boolean functions f : Rd \u2192 {\u00b11}\nof the form f (x) = sign((cid:104)w, x(cid:105) \u2212 \u03b8), where w \u2208 Rd is the weight vector and \u03b8 \u2208 R is the threshold.\n(The function sign : R \u2192 {\u00b11} is de\ufb01ned as sign(u) = 1 if u \u2265 0 and sign(u) = \u22121 otherwise.)\nThe problem of learning an unknown halfspace is as old as the \ufb01eld of machine learning \u2014 starting\nwith Rosenblatt\u2019s Perceptron algorithm [Ros58] \u2014 and has arguably been the most in\ufb02uential problem\nin the development of the \ufb01eld. In the realizable setting, LTFs are known to be ef\ufb01ciently learnable\nin Valiant\u2019s distribution-independent PAC model [Val84] via Linear Programming [MT94]. In the\npresence of corrupted data, the situation is more subtle and crucially depends on the underlying noise\nmodel. In the agnostic model [Hau92, KSS94] \u2013 where an adversary is allowed to arbitrarily corrupt\nan arbitrary \u03b7 < 1/2 fraction of the labels \u2013 even weak learning is known to be computationally\nintractable [GR06, FGKP06, Dan16]. On the other hand, in the presence of Random Classi\ufb01cation\nNoise (RCN) [AL88] \u2013 where each label is \ufb02ipped independently with probability exactly \u03b7 < 1/2 \u2013\na polynomial time algorithm is known [BFKV96, BFKV97].\nIn this work, we focus on learning halfspaces with Massart noise [MN06]:\nDe\ufb01nition 1.1 (Massart Noise Model). Let C be a class of Boolean functions over X = Rd, Dx be\nan arbitrary distribution over X, and 0 \u2264 \u03b7 < 1/2. Let f be an unknown target function in C. A\nnoisy example oracle, EXMas(f,Dx, \u03b7), works as follows: Each time EXMas(f,Dx, \u03b7) is invoked, it\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\freturns a labeled example (x, y), where x \u223c Dx, y = f (x) with probability 1\u2212 \u03b7(x) and y = \u2212f (x)\nwith probability \u03b7(x), for an unknown parameter \u03b7(x) \u2264 \u03b7. Let D denote the joint distribution on\n(x, y) generated by the above oracle. A learning algorithm is given i.i.d. samples from D and its goal\nis to output a hypothesis h such that with high probability the error Pr(x,y)\u223cD[h(x) (cid:54)= y] is small.\nAn equivalent formulation of the Massart model [Slo88, Slo92] is the following: With probability\n1 \u2212 \u03b7, we have that y = f (x), and with probability \u03b7 the label y is controlled by an adversary. Hence,\nthe Massart model lies in between the RCN and the agnostic models. (Note that the RCN model\ncorresponds to the special case that \u03b7(x) = \u03b7 for all x \u2208 X.) It is well-known (see, e.g., [MN06]) that\npoly(d, 1/\u0001) samples information-theoretically suf\ufb01ce to compute a hypothesis with misclassi\ufb01cation\nerror OPT + \u0001, where OPT is the misclassi\ufb01cation error of the optimal halfspace. Also note that\nOPT \u2264 \u03b7 by de\ufb01nition. The question is whether a polynomial time algorithm exists.\n\nThe existence of an ef\ufb01cient distribution-independent learning algorithm for halfspaces (or even\ndisjunctions) in the Massart model has been posed as an open question in a number of works. In the\n\ufb01rst COLT conference [Slo88] (see also [Slo92]), Sloan de\ufb01ned the malicious misclassi\ufb01cation noise\nmodel (an equivalent formulation of Massart noise, described above) and asked whether there exists an\nef\ufb01cient learning algorithm for disjunctions in this model. About a decade later, Cohen [Coh97] asked\nthe same question for the more general class of all LTFs. The question remained open \u2014 even for\nweak learning of disjunctions! \u2014 and was highlighted in Avrim Blum\u2019s FOCS 2003 tutorial [Blu03].\nSpeci\ufb01cally, prior to this work, even the following very basic special case remained open:\n\nGiven labeled examples from an unknown disjunction, corrupted with 1% Massart noise,\n\ncan we ef\ufb01ciently \ufb01nd a hypothesis that achieves misclassi\ufb01cation error 49%?\n\nThe reader is referred to slides 39-40 of Avrim Blum\u2019s FOCS\u201903 tutorial [Blu03], where it is suggested\nthat the above problem might be easier than agnostically learning disjunctions. As a corollary of our\nmain result (Theorem 1.2), we answer this question in the af\ufb01rmative. In particular, we obtain an\nef\ufb01cient algorithm that achieves misclassi\ufb01cation error arbitrarily close to \u03b7 for all LTFs.\n\n1.1 Our Results\n\nThe main result of this paper is the following:\nTheorem 1.2 (Main Result). There is an algorithm that for all 0 < \u03b7 < 1/2, on input a set of i.i.d.\nexamples from a distribution D = EXMas(f,Dx, \u03b7) on Rd+1, where f is an unknown halfspace on\nRd, it runs in poly(d, b, 1/\u0001) time, where b is an upper bound on the bit complexity of the examples,\nand outputs a hypothesis h that with high probability satis\ufb01es Pr(x,y)\u223cD[h(x) (cid:54)= y] \u2264 \u03b7 + \u0001.\nSee Theorem 2.9 for a more detailed formal statement. For large-margin halfspaces, we obtain a\nslightly better error guarantee; see Theorem 2.2 and Remark 2.6.\n\nDiscussion. We note that our algorithm is non-proper, i.e., the hypothesis h itself is not a halfspace.\nThe polynomial dependence on b in the runtime cannot be removed, even in the noiseless case,\nunless one obtains strongly-polynomial algorithms for linear programming. Finally, we note that the\nmisclassi\ufb01cation error of \u03b7 translates to error 2\u03b7 + \u0001 with respect to the target LTF.\nOur algorithm gives error \u03b7 + \u0001, instead of the information-theoretic optimum of OPT + \u0001. To\ncomplement our positive result, we provide some evidence that improving on our (\u03b7 + \u0001) error\nguarantee may be challenging. Roughly speaking, we show (see Theorems B.1 and B.2 in the\nsupplementary material) that natural approaches \u2014 involving convex surrogates and re\ufb01nements\nthereof \u2014 inherently fail, even under margin assumptions. (See Section 1.2 for a discussion.)\n\nBroader Context. This work is part of the broader agenda of designing robust estimators in the\ndistribution-independent setting with respect to natural noise models. A recent line of work [KLS09,\nABL17, DKK+16, LRV16, DKK+17, DKK+18, DKS18, KKM18, DKS19, DKK+19] has given\nef\ufb01cient robust estimators for a range of learning tasks (both supervised and unsupervised) in the\npresence of a small constant fraction of adversarial corruptions. A limitation of these results is\nthe assumption that the good data comes from a \u201ctame\u201d distribution, e.g., Gaussian or isotropic\nlog-concave distribution. On the other hand, if no assumption is made on the good data and the noise\nremains fully adversarial, these problems become computationally intractable [Ber06, GR06, Dan16].\n\n2\n\n\fThis suggests the following general question: Are there realistic noise models that allow for ef\ufb01cient\nalgorithms without imposing (strong) assumptions on the good data? Conceptually, the algorithmic\nresults of this paper could be viewed as an af\ufb01rmative answer to this question for the problem of\nlearning halfspaces.\n\n1.2 Technical Overview\n\nIn this section, we provide an outline of our approach and a comparison to previous techniques. Since\nthe distribution on the unlabeled data is arbitrary, we can assume w.l.o.g. that the threshold \u03b8 = 0.\n\nMassart Noise versus RCN. Random Classi\ufb01cation Noise (RCN) [AL88] is the special case of\nMassart noise where each label is \ufb02ipped with probability exactly \u03b7 < 1/2. At \ufb01rst glance, it\nmight seem that Massart noise is easier to deal with computationally than RCN. After all, in the\nMassart model we add at most as much noise as in the RCN model. It turns out that this intuition is\nfundamentally \ufb02awed. Roughly speaking, the ability of the Massart adversary to choose whether to\nperturb a given label and, if so, with what probability (which is unknown to the learner), makes the\ndesign of ef\ufb01cient algorithms in this model challenging. In particular, the well-known connection\nbetween learning with RCN and the Statistical Query (SQ) model [Kea93, Kea98] no longer holds, i.e.,\nthe property of being an SQ algorithm does not automatically suf\ufb01ce for noise-tolerant learning with\nMassart noise. We note that this connection with the SQ model is leveraged in [BFKV96, BFKV97]\nto obtain their polynomial time algorithm for learning halfspaces with RCN.\n\nLarge Margin Halfspaces. To illustrate our approach, we start by describing our learning algorithm\nfor \u03b3-margin halfspaces on the unit ball. That is, we assume |(cid:104)w\u2217, x(cid:105)| \u2265 \u03b3 for every x in the support,\nwhere w\u2217 \u2208 Rd with (cid:107)w\u2217(cid:107)2 = 1 de\ufb01nes the target halfspace hw\u2217 (x) = sign((cid:104)w\u2217, x(cid:105)). Our goal is\nto design a poly(d, 1/\u0001, 1/\u03b3) time learning algorithm in the presence of Massart noise.\n\nIn the RCN model, the large margin case is easy because the learning problem is essentially\nconvex. That is, there is a convex surrogate that allows us to formulate the problem as a convex\nprogram. We can use SGD to \ufb01nd a near-optimal solution to this convex program, which automatically\ngives a strong proper learner. This simple fact does not appear explicitly in the literature, but follows\neasily from standard tools.\n[Byl94] showed that a variant of the Perceptron algorithm (which\ncan be viewed as gradient descent on a particular convex objective) learns \u03b3-margin halfspaces in\npoly(d, 1/\u0001, 1/\u03b3) time. The algorithm in [Byl94] requires an additional anti-concentration condition\nabout the distribution, which is easy to remove. In Appendix C, we show that a \u201csmoothed\u201d version\nof Bylander\u2019s objective suf\ufb01ces as a convex surrogate under only the margin assumption.\n\nRoughly speaking, the reason that a convex surrogate works for RCN is that the expected effect\nof the noise on each label is known a priori. Unfortunately, this is not the case for Massart noise. We\nshow (Theorem B.1 in Appendix B) that no convex surrogate can lead to a weak learner, even under\n\na margin assumption. That is, if (cid:98)w is the minimizer of G(w) = E(x,y)\u223cD[\u03c6(y(cid:104)w, x(cid:105))], where \u03c6 can\nbe any convex function, then the hypothesis sign((cid:104)(cid:98)w, x(cid:105)) is not even a weak learner. So, in sharp\n\ncontrast with the RCN case, the problem is non-convex in this sense.\n\nOur Massart learning algorithm for large margin halfspaces still uses a convex surrogate, but in a\nqualitatively different way. Instead of attempting to solve the problem in one-shot, our algorithm\nadaptively applies a sequence of convex optimization problems to obtain an accurate solution\nin disjoint subsets of the space. Our iterative approach is motivated by a new structural lemma\n(Lemma 2.5) establishing the following: Even though minimizing a convex proxy does not lead to\nsmall misclassi\ufb01cation error over the entire space, there exists a region with non-trivial probability\nmass where it does. Moreover, this region is ef\ufb01ciently identi\ufb01able by a simple thresholding rule.\nSpeci\ufb01cally, we show that there exists a threshold T > 0 (which can be found algorithmically) such\n\nthat the hypothesis sign((cid:104)(cid:98)w, x(cid:105)) has error bounded by \u03b7 + \u0001 in the region RT = {x : |(cid:104)(cid:98)w, x(cid:105)| \u2265 T}.\nHere (cid:98)w is any near-optimal solution to an appropriate convex optimization problem, de\ufb01ned via a\n\nconvex surrogate objective similar to the one used in [Byl94]. We note that Lemma 2.5 is the main\ntechnical novelty of this paper and motivates our algorithm. Given Lemma 2.5, in any iteration i\nwe can \ufb01nd the best threshold T (i) using samples, and obtain a learner with misclassi\ufb01cation error\n\u03b7 + \u0001 in the corresponding region. Since each region has non-trivial mass, iterating this scheme a\nsmall number of times allows us to \ufb01nd a non-proper hypothesis (a decision-list of halfspaces) with\nmisclassi\ufb01cation error at most \u03b7 + \u0001 in the entire space.\n\n3\n\n\fThe idea of iteratively optimizing a convex surrogate was used in [BFKV96] to learn halfspaces\nwith RCN without a margin. Despite this similarity, we note that the algorithm of [BFKV96] fails\nto even obtain a weak learner in the Massart model. We point out two crucial technical differences:\nFirst, the iterative approach in [BFKV96] was needed to achieve polynomial running time. As\nmentioned already, a convex proxy is guaranteed to converge to the true solution with RCN, but the\nconvergence may be too slow (when the margin is tiny). In contrast, with Massart noise (even under a\nmargin condition) convex surrogates cannot even give weak learning in the entire domain. Second,\nthe algorithm of [BFKV96] used a \ufb01xed threshold in each iteration, equal to the margin parameter\nobtained after an appropriate pre-processing of the data (that is needed in order to ensure a weak\nmargin property). In contrast, in our setting, we need to \ufb01nd an appropriate threshold T (i) in each\niteration i, according to the criterion speci\ufb01ed by our Lemma 2.5.\n\nGeneral Case. Our algorithm for the general case (in the absence of a margin) is qualitatively similar\nto our algorithm for the large margin case, but the details are more elaborate. We borrow an idea\nfrom [BFKV96] that in some sense allows us to \u201creduce\u201d the general case to the large margin case.\nSpeci\ufb01cally, [BFKV96] (see also [DV04a]) developed a pre-processing routine that slightly modi\ufb01es\nthe distribution on the unlabeled points and guarantees the following weak margin property: After pre-\nprocessing, there exists an explicit margin parameter \u03c3 = \u2126(1/poly(d, b)), such that any hyperplane\nthrough the origin has at least a non-trivial mass of the distribution at distance at least \u03c3 from it.\nUsing this pre-processing step, we are able to adapt our algorithm from the previous subsection to\nwork without margin assumptions in poly(d, b, 1/\u0001) time. While our analysis is similar in spirit\nto the case of large margin, we note that the margin property obtained via the [BFKV96, DV04a]\npreprocessing step is (necessarily) weaker, hence additional careful analysis is required.\n\nLower Bounds Against Natural Approaches. We have already explained our Theorem B.1, which\nshows that using a convex surrogate over the entire space cannot not give a weak learner. Our\nalgorithm, however, can achieve error \u03b7 + \u0001 by iteratively optimizing a speci\ufb01c convex surrogate\nin disjoint subsets of the domain. A natural question is whether one can obtain qualitatively better\naccuracy, e.g., f (OPT)+\u0001, by using a different convex objective function in our iterative thresholding\napproach. We show (Theorem B.2) that such an improvement is not possible: Using a different\nconvex proxy cannot lead to error better than (1\u2212 o(1))\u00b7 \u03b7. It is a plausible conjecture that improving\non the error guarantee of our algorithm is computationally hard. We leave this as an intriguing open\nproblem for future work.\n\n1.3 Prior and Related Work\n\nBylander [Byl94] gave a polynomial time algorithm to learn large margin halfspaces with RCN\n(under an additional anti-concentration assumption). The work of Blum et al. [BFKV96, BFKV97]\ngave the \ufb01rst polynomial time algorithm for distribution-independent learning of halfspaces with\nRCN without any margin assumptions. Soon thereafter, [Coh97] gave a polynomial-time proper\nlearning algorithm for the problem. Subsequently, Dunagan and Vempala [DV04b] gave a rescaled\nperceptron algorithm for solving linear programs, which translates to a signi\ufb01cantly simpler and\nfaster proper learning algorithm.\n\nThe term \u201cMassart noise\u201d was coined after [MN06]. An equivalent version of the model was\npreviously studied by Rivest and Sloan [Slo88, Slo92, RS94, Slo96], and a very similar asymmetric\nrandom noise model goes back to Vapnik [Vap82]. Prior to this work, essentially no ef\ufb01cient\nalgorithms with non-trivial error guarantees were known in the distribution-free Massart noise model.\nIt should be noted that polynomial time algorithms with error OPT + \u0001 are known [ABHU15, ZLC17,\nYZ17] when the marginal distribution on the unlabeled data is uniform on the unit sphere. For the\ncase that the unlabeled data comes from an isotropic log-concave distribution, [ABHZ16] give a\nd2poly(1/(1\u22122\u03b7))\n\n/poly(\u0001) sample and time algorithm.\n\n1.4 Preliminaries\nFor n \u2208 Z+, we denote [n] def= {1, . . . , n}. We will use small boldface characters for vectors and we\nlet ei denote the i-th vector of an orthonormal basis.\n\n4\n\n\fdef= ((cid:80)d\n\nFor x \u2208 Rd, and i \u2208 [d], xi denotes the i-th coordinate of x, and (cid:107)x(cid:107)2\n\ni )1/2 denotes\nthe (cid:96)2-norm of x. We will use (cid:104)x, y(cid:105) for the inner product between x, y \u2208 Rd. We will use E[X] for\nthe expectation of random variable X and Pr[E] for the probability of event E.\n\ni=1 x2\n\nAn origin-centered halfspace is a Boolean-valued function hw : Rd \u2192 {\u00b11} of the form\nhw(x) = sign ((cid:104)w, x(cid:105)), where w \u2208 Rd. (Note that we may assume w.l.o.g. that (cid:107)w(cid:107)2 = 1.) We\ndenote by Hd the class of all origin-centered halfspaces on Rd.\nWe consider a classi\ufb01cation problem where labeled examples (x, y) are drawn i.i.d. from a\ndistribution D. We denote by Dx the marginal of D on x, and for any x denote Dy(x) the distribution\nof y conditional on x. Our goal is to \ufb01nd a hypothesis classi\ufb01er h with low misclassi\ufb01cation error.\nWe will denote the misclassi\ufb01cation error of a hypothesis h with respect to D by errD\n0\u22121(h) =\nPr(x,y)\u223cD[h(x) (cid:54)= y]. Let OPT = minh\u2208Hd errD\n0\u22121(h) denote the optimal misclassi\ufb01cation error\nof any halfspace, and w\u2217 be the normal vector to a halfspace hw\u2217 that achieves this.\n\n2 Algorithm for Learning Halfspaces with Massart Noise\n\nIn this section, we present the main result of this paper, which is an ef\ufb01cient algorithm that achieves\n\u03b7 + \u0001 misclassi\ufb01cation error for distribution-independent learning of halfspaces with Massart noise \u03b7.\nOur algorithm uses (stochastic) gradient descent on a convex proxy function L(w) for the\nmisclassi\ufb01cation error to identify a region with small misclassi\ufb01cation error. The loss function\npenalizes the points which are misclassi\ufb01ed by the threshold function hw, proportionally to the\ndistance from the corresponding hyperplane, while rewards the correctly classi\ufb01ed points at a smaller\nrate. Directly optimizing this convex objective does not lead to a separator with low error, but\nguarantees that for a non-negligible fraction of the mass away from the separating hyperplane\nthe misclassi\ufb01cation error will be at most \u03b7 + \u0001. Classifying points in this region according to\nthe hyperplane and recursively working on the remaining points, we obtain an improper learning\nalgorithm that achieves \u03b7 + \u0001 error overall.\n\nWe now develop some necessary notation before proceeding with the description and analysis of\n\nour algorithm.\n\nOur algorithm considers the following convex proxy for the misclassi\ufb01cation error as a function\n\nof the weight vector w:\n\nL(w) = E\n\n(x,y)\u223cD[LeakyRelu\u03bb(\u2212y(cid:104)w, x(cid:105))] ,\n(cid:26) (1 \u2212 \u03bb)z\n\nunder the constraint (cid:107)w(cid:107)2 \u2264 1, where LeakyRelu\u03bb(z) =\nleakage parameter, which we will set to be \u03bb \u2248 \u03b7.\n\n\u03bbz\n\nif z \u2265 0\nif z < 0 and \u03bb is the\n\nWe de\ufb01ne the per-point misclassi\ufb01cation error and the error of the proxy function as err(w, x) =\n\nPry\u223cDy(x)[w(x) (cid:54)= y] and (cid:96)(w, x) = Ey\u223cDy(x)[LeakyRelu\u03bb(\u2212y(cid:104)w, x(cid:105))] respectively.\n\nNotice that errD\n\n0\u22121(hw) = Ex\u223cDx [err(w, x)] and L(w) = Ex\u223cDx [(cid:96)(w, x)]. Moreover,\n\nOPT = Ex\u223cDx[err(w\u2217, x)] = Ex\u223cDx [\u03b7(x)].\nRelationship between proxy loss and misclassi\ufb01cation error We \ufb01rst relate the proxy loss and\nthe misclassi\ufb01cation error.\nClaim 2.1. For any w, x, we have that (cid:96)(w, x) = (err(w, x) \u2212 \u03bb)|(cid:104)w, x(cid:105)|.\n\nProof. We consider two cases:\n\u2022 Case sign((cid:104)w, x(cid:105)) = sign((cid:104)w\u2217, x(cid:105)):\n\n(cid:96)(w, x) = \u03b7(x)(1 \u2212 \u03bb)|(cid:104)w, x(cid:105)| \u2212 (1 \u2212 \u03b7(x))\u03bb|(cid:104)w, x(cid:105)| = (\u03b7(x) \u2212 \u03bb)|(cid:104)w, x(cid:105)|.\n\nIn this case, we have that err(w, x) = \u03b7(x), while\n\n\u2022 Case sign((cid:104)w, x(cid:105)) (cid:54)= sign((cid:104)w\u2217, x(cid:105)): In this case, we have that err(w, x) = 1 \u2212 \u03b7(x), while\n\n(cid:96)(w, x) = (1 \u2212 \u03b7(x))(1 \u2212 \u03bb)|(cid:104)w, x(cid:105)| \u2212 \u03b7(x)\u03bb|(cid:104)w, x(cid:105)| = (1 \u2212 \u03b7(x) \u2212 \u03bb)|(cid:104)w, x(cid:105)|.\n\nThis completes the proof of Claim 2.1.\n\n5\n\n\f(cid:104) (cid:96)(w,x)\n\n(cid:105)\n\nis equivalent to minimizing the misclassi\ufb01cation\nClaim 2.1 shows that minimizing Ex\u223cDx\nerror. Unfortunately, this objective is hard to minimize as it is non-convex, but one would hope that\nminimizing L(w) instead may have a similar effect. As we show, this is not true because |(cid:104)w, x(cid:105)|\nmight vary signi\ufb01cantly across points, and in fact it is not possible to use a convex proxy that achieves\nbounded misclassi\ufb01cation error directly.\n\n|(cid:104)w,x(cid:105)|\n\nOur algorithm circumvents this dif\ufb01culty by approaching the problem indirectly to \ufb01nd a non-\nproper classi\ufb01er. Speci\ufb01cally, our algorithm works in multiple rounds, where within each round\nonly points with high value of |(cid:104)w, x(cid:105)| are considered. The intuition is based on the fact that the\napproximation of the convex proxy to the misclassi\ufb01cation error is more accurate for those points\nthat have comparable distance to the halfspace.\nIn Section 2.1, we handle the large margin case and in Section 2.2 we handle the general case.\n\n2.1 Warm-up: Learning Large Margin Halfspaces\n\nWe consider the case that there is no probability mass within distance \u03b3 from the separating hyperplane\n(cid:104)w\u2217, x(cid:105) = 0, (cid:107)w\u2217(cid:107)2 = 1. Formally, assume that for every x \u223c Dx, (cid:107)x(cid:107)2 \u2264 1 and that (cid:104)w\u2217, x(cid:105) \u2265 \u03b3.\nThe pseudo-code of our algorithm is given in Algorithm 1. Our algorithm returns a decision list\n[(w(1), T (1)), (w(2), T (2)),\u00b7\u00b7\u00b7 ] as output. To classify a point x given the decision list, the \ufb01rst i is\nidenti\ufb01ed such that |(cid:104)w(i), x(cid:105)| \u2265 T (i) and sign((cid:104)w(i), x(cid:105)) is returned. If no such i exists, an arbitrary\nprediction is returned.\n\nAlgorithm 1 Main Algorithm (with margin)\n1: Set S(1) = Rd, \u03bb = \u03b7 + \u0001, m = \u02dcO( 1\n\u03b32\u00014 ).\n2: Set i \u2190 1.\n\n3: Draw O(cid:0)(1/\u00012) log(1/(\u0001\u03b3))(cid:1) samples from Dx to form an empirical distribution \u02dcDx.\n\n(cid:2)x \u2208 S(i)(cid:3) \u2265 \u0001 do\n\n4: while Prx\u223c \u02dcDx\n5:\n6:\n7:\n\nL(i)(w(i)) \u2264 minw:(cid:107)w(cid:107)2\u22641 L(i)(w) + \u03b3\u0001/2.\n\nSet D(i) = D|S(i), the distribution conditional on the unclassi\ufb01ed points.\nLet L(i)(w) = E(x,y)\u223cD(i)[LeakyRelu\u03bb(\u2212y(cid:104)w, x(cid:105))]\nRun SGD on L(i)(w) for \u02dcO(1/(\u03b32\u00012)) iterations to get w(i) with (cid:107)w(i)(cid:107)2 = 1 such that\nDraw m samples from D(i) to form an empirical distribution D(i)\nm .\nFind a threshold T (i) such that Pr(x,y)\u223cD(i)\n\n[hw(i)(x) (cid:54)= y(cid:12)(cid:12)|(cid:104)w(i), x(cid:105)| \u2265 T (i)], is minimized.\n\n[|(cid:104)w(i), x(cid:105)| \u2265 T (i)] \u2265 \u03b3\u0001 and the empirical\n\nmisclassi\ufb01cation error, Pr(x,y)\u223cD(i)\n\nm\n\nUpdate the unclassi\ufb01ed region S(i+1) \u2190 S(i) \\ {x : |(cid:104)w(i), x(cid:105)| \u2265 T (i)} and set i \u2190 i + 1.\n\nm\n\n10:\n11: Return the classi\ufb01er [(w(1), T (1)), (w(2), T (2)),\u00b7\u00b7\u00b7 ]\n\n8:\n9:\n\nThe main result of this section is the following:\nTheorem 2.2. Let D be a distribution on Bd \u00d7 {\u00b11} such that Dx satis\ufb01es the \u03b3-margin property\nwith respect to w\u2217 and y is generated by sign((cid:104)w\u2217, x(cid:105)) corrupted with Massart noise at rate \u03b7 < 1/2.\nAlgorithm 1 uses \u02dcO(1/(\u03b33\u00015)) samples from D, runs in poly(d, 1/\u0001, 1/\u03b3) time, and returns, with\nprobability 2/3, a classi\ufb01er h with misclassi\ufb01cation error errD\nOur analysis focuses on a single iteration of Algorithm 1. We will show that a large fraction of\nthe points is classi\ufb01ed at every iteration within error \u03b7 + \u0001. To achieve this, we analyze the convex\nobjective L. We start by showing that the optimal classi\ufb01er w\u2217 obtains a signi\ufb01cantly negative\nobjective value.\nLemma 2.3. If \u03bb \u2265 \u03b7, then L(w\u2217) \u2264 \u2212\u03b3(\u03bb \u2212 OPT).\n\n0\u22121(h) \u2264 \u03b7 + \u0001.\n\nProof. For any \ufb01xed x, using Claim 2.1, we have that\n\n(cid:96)(w\u2217, x) = (err(w\u2217, x) \u2212 \u03bb)|(cid:104)w\u2217, x(cid:105)| = (\u03b7(x) \u2212 \u03bb)|(cid:104)w\u2217, x(cid:105)| \u2264 \u2212\u03b3(\u03bb \u2212 \u03b7(x)) ,\n\nsince |(cid:104)w\u2217, x(cid:105)| \u2265 \u03b3 and \u03b7(x) \u2212 \u03bb \u2264 0. Taking expectation over x \u223c Dx, the statement follows.\n\n6\n\n\fLemma 2.3 is the only place where the Massart noise assumption is used in our approach and\nestablishes that points with suf\ufb01ciently negative value exist. As we will show, any weight vector\nw with this property can be found with few samples and must accurately classify some region of\nnon-negligible mass away from it (Lemma 2.5).\n\nWe now argue that we can use stochastic gradient descent (SGD) to ef\ufb01ciently identify a point w\nthat achieves comparably small objective value to the guarantee of Lemma 2.3. We use the following\nstandard property of SGD:\nLemma 2.4 (see, e.g., Theorem 3.4.11 in [Duc16]). Let L be any convex function. Consider the\n(projected) SGD iteration that is initialized at w(0) = 0 and for every step computes\n\nw(t+ 1\n\nw:(cid:107)w(cid:107)2\u22641\n\n2 ) = w(t) \u2212 \u03c1v(t)\n\nwhere v(t) is a stochastic gradient such that for all steps E[v(t)|w(t)] \u2208 \u2202L(w(t)) and(cid:13)(cid:13)v(t)(cid:13)(cid:13)2 \u2264 1.\n\nand w(t+1) = arg min\n\nAssume that SGD is run for T iterations with step size \u03c1 = 1\u221a\nt=1 w(t). Then, for\nT\nany \u0001, \u03b4 > 0, after T = \u2126(log(1/\u03b4)/\u00012) iterations with probability with probability at least 1 \u2212 \u03b4 we\nhave that L( \u00afw) \u2264 minw:(cid:107)w(cid:107)2\u22641 L(w) + \u0001.\n\nthat by running SGD on L(w) with projection to the unit (cid:96)2-ball for O(cid:0)log(1/\u03b4)/(\u03b32(\u03bb \u2212 OPT)2)(cid:1)\n\nBy Lemma 2.3, we know that minw:(cid:107)w(cid:107)2\u22641 L(w) \u2264 \u2212\u03b3(\u03bb \u2212 OPT). By Lemma 2.4, it follows\n\nsteps, we \ufb01nd a w such that L(w) \u2264 \u2212\u03b3(\u03bb \u2212 OPT)/2 with probability at least 1 \u2212 \u03b4.\n\nand let \u00afw = 1\nT\n\n,\n\n2 )(cid:13)(cid:13)(cid:13)2\n(cid:13)(cid:13)(cid:13)w \u2212 w(t+ 1\n(cid:80)T\n\nNote that we can assume without loss of generality that (cid:107)w(cid:107)2 = 1, as increasing the magnitude\n\nof w only decreases the objective value.\n\nWe now consider the misclassi\ufb01cation error of the halfspace hw conditional on the points that are\nfurther than some distance T from the separating hyperplane. We claim that there exists a threshold\nT > 0 where the restriction has non-trivial mass and the conditional misclassi\ufb01cation error is small:\nLemma 2.5. Consider a vector w with L(w) < 0. There exists a threshold T \u2265 0 such that (i)\nPr(x,y)\u223cD[|(cid:104)w, x(cid:105)| \u2265 T ] \u2265 |L(w)|\n.\n\n2\u03bb , and (ii) Pr(x,y)\u223cD[hw(x) (cid:54)= y(cid:12)(cid:12)|(cid:104)w, x(cid:105)| \u2265 T ] \u2264 \u03bb \u2212 |L(w)|\nProof. We will show there is a T \u2265 0 such that Pr(x,y)\u223cD[hw(x) (cid:54)= y(cid:12)(cid:12)|(cid:104)w, x(cid:105)| \u2265 T ] \u2264 \u03bb \u2212 \u03b6,\nwhere \u03b6 def= |L(w)|/2, or equivalently, Ex\u223cDx[(err(w, x) \u2212 \u03bb + \u03b6)1|(cid:104)w,x(cid:105)|\u2265T ] \u2264 0.\n(cid:90) 1\nFor a T drawn uniformly at random in [0, 1], we have that:\n\n[(err(w, x) \u2212 \u03bb + \u03b6)1|(cid:104)w,x(cid:105)|\u2265T ]dT = Ex\u223cDx[(err(w, x) \u2212 \u03bb)|(cid:104)w, x(cid:105)|] + \u03b6Ex\u223cDx [|(cid:104)w, x(cid:105)|]\n\nE\n\n2\n\n\u2264 Ex\u223cDx[(cid:96)(w, x)] + \u03b6 = L(w) + \u03b6 = L(w)/2 < 0 .\nThus, there exists a \u00afT such that Ex\u223cDx[(err(w, x)\u2212 \u03bb + \u03b6)1|(cid:104)w,x(cid:105)|\u2265 \u00afT ] \u2264 0. Consider the minimum\nsuch \u00afT . Then we have\n\nx\u223cDx\n\n0\n\n(cid:90) 1\n\n\u00afT\n\nBy de\ufb01nition of \u00afT , it must be the case that(cid:82) \u00afT\n\nEx\u223cDx [(err(w, x) \u2212 \u03bb + \u03b6)1|(cid:104)w,x(cid:105)|\u2265T ]dT \u2265 \u2212\u03bb \u00b7 Pr(x,y)\u223cD[|(cid:104)w, x(cid:105)| \u2265 \u00afT ] .\n(cid:90) 1\n\n0 Ex\u223cDx [(err(w, x) \u2212 \u03bb + \u03b6)1|(cid:104)w,x(cid:105)|\u2265T ]dT \u2265 0.\n\nEx\u223cDx[(err(w, x) \u2212 \u03bb + \u03b6)1|(cid:104)w,x(cid:105)|\u2265T ]dT \u2265 \u2212\u03bb \u00b7 Pr(x,y)\u223cD[|(cid:104)w, x(cid:105)| \u2265 \u00afT ] ,\n\nTherefore,\n\n\u2265\n\nL(w)\n\n2\n\n\u00afT\n\nwhich implies that Pr(x,y)\u223cD[|(cid:104)w, x(cid:105)| \u2265 \u00afT ] \u2265 |L(w)|\n\n2\u03bb . This completes the proof of Lemma 2.5.\n\nEven though minimizing the convex proxy L does not lead to low misclassi\ufb01cation error overall,\nLemma 2.5 shows that there exists a region of non-trivial mass where it does. This region is\nidenti\ufb01able by a simple threshold rule. We are now ready to prove Theorem 2.2.\n\n7\n\n\fProof of Theorem 2.2. We consider the steps of Algorithm 1 in each iteration of the while loop. At\niteration i, we consider a distribution D(i) consisting only of points not handled in previous iterations.\nWe start by noting that with high probability the total number of iterations is \u02dcO(1/(\u03b3\u0001)). This can\nbe seen as follows: The empirical probability mass under D(i)\nm of the region {x : |(cid:104)w(i), x(cid:105)| \u2265 T (i)}\nremoved from S(i) to obtain S(i+1) is at least \u03b3\u0001 (Step 9). Since m = \u02dcO(1/(\u03b32\u00014)), the DKW\ninequality [DKW56] implies that the true probability mass of this region is at least \u03b3\u0001/2 with high\nprobability. By a union bound over i \u2264 K = \u0398(log(1/\u0001)/(\u0001\u03b3)), it follows that with high probability\nwe have that PrDx[S(i+1)] \u2264 (1 \u2212 \u03b3\u0001/2)i for all i \u2208 [K]. After K iterations, we will have that\nPrDx[S(i+1)] \u2264 \u0001/3. Step 3 guarantees that the mass of S(i) under \u02dcDx is within an additive \u0001/3 of\nits mass under Dx, for i \u2208 [K]. This implies that the loop terminates after at most K iterations.\n\nO(cid:0)log(1/\u03b4)/(\u03b32\u00012)(cid:1) steps, we obtain a w(i) such that, with probability at least 1 \u2212 \u03b4, it holds\n\nBy Lemma 2.3 and the fact that every D(i) has margin \u03b3, it follows that the minimizer of the\nloss L(i) has value less than \u2212\u03b3(\u03bb \u2212 OPT(i)) \u2264 \u2212\u03b3\u0001, as OPT(i) \u2264 \u03b7 and \u03bb = \u03b7 + \u0001. By the\nguarantees of Lemma 2.4, running SGD in line 7 on L(i)(\u00b7) with projection to the unit (cid:96)2-ball for\nL(i)(w(i)) \u2264 \u2212\u03b3\u0001/2 and (cid:107)w(i)(cid:107)2 = 1. Here \u03b4 > 0 is a parameter that is selected so that the\nfollowing claim holds: With probability at least 9/10, for all iterations i of the while loop we have\nthat L(i)(w(i)) \u2264 \u2212\u03b3\u0001/2. Since the total number of iterations is \u02dcO(1/(\u03b3\u0001)), setting \u03b4 to \u02dc\u2126(\u0001\u03b3) and\napplying a union bound over all iterations gives the previous claim. Therefore, the total number\nof SGD steps per iteration is \u02dcO(1/(\u03b32\u00012)). For a given iteration of the while loop, running SGD\n\nas Prx\u223cDx\n\n(cid:2)x \u2208 S(i)(cid:3) \u2265 2\u0001/3.\n\nrequires \u02dcO(1/(\u03b32\u00012)) samples from D(i) which translate to at most \u02dcO(cid:0)1/(\u03b32\u00013)(cid:1) samples from D,\n(b)) Pr(x,y)\u223cD(i) [hw(x) (cid:54)= y(cid:12)(cid:12)|(cid:104)w, x(cid:105)| \u2265 T ] \u2264 \u03b7 + \u0001. Line 9 of Algorithm 1 estimates the threshold\n\nLemma 2.5 implies that there exists T \u2265 0 such that: (a) Pr(x,y)\u223cD(i) [|(cid:104)w, x(cid:105)| \u2265 T ] \u2265 \u03b3\u0001, and\n\nusing samples. By the DKW inequality [DKW56], we know that with m = \u02dcO(1/(\u03b32\u00014)) samples\nwe can estimate the CDF within error \u03b3\u00012 with probability 1 \u2212 poly(\u0001, \u03b3). This suf\ufb01ces to estimate\nthe probability mass of the region within additive \u03b3\u00012 and the misclassi\ufb01cation error within \u0001/3. This\nis satis\ufb01ed for all iterations with constant probability.\n\nIn summary, with high constant success probability, Algorithm 1 runs for \u02dcO(1/(\u03b3\u0001)) iterations\nand draws \u02dcO(1/(\u03b32\u00014)) samples per round for a total of \u02dcO(1/(\u03b33\u00015)) samples. As each iteration\nruns in polynomial time, the total running time follows.\n\nWhen the while loop terminates, we have that Prx\u223cDx [x \u2208 S(i)] \u2264 4\u0001/3, i.e., we will have\naccounted for at least a (1\u2212 4\u0001/3)-fraction of the total probability mass. Since our algorithm achieves\nmisclassi\ufb01cation error at most \u03b7 + 4\u0001/3 in all the regions we accounted for, its total misclassi\ufb01cation\nerror is at most \u03b7 + 8\u0001/3. Rescaling \u0001 by a constant factor gives Theorem 2.2.\nRemark 2.6. If the value of OPT is smaller than \u03b7 \u2212 \u03be for some value \u03be > 0, Algorithm 1 gets\nmisclassi\ufb01cation error less than \u03b7 \u2212 \u2126(\u03b32\u03be2) when run for \u0001 = O(\u03b32\u03be2). This is because, in the \ufb01rst\niteration, L(1)(w(1)) \u2264 \u2212\u03b3(\u03bb\u2212 OPT)/2 \u2264 \u2212\u03b3\u03be/2, which implies, by Lemma 2.5, that the obtained\nerror in S(1) is at most \u03bb \u2212 \u03b3\u03be/4. The misclassi\ufb01cation error in the remaining regions is at most\n\u03bb + \u0001, and region S(1) has probability mass at least \u03b3\u03be/4. Thus, the total misclassi\ufb01cation error is at\nmost \u03bb + \u0001 \u2212 \u03b32\u03be2/16 = \u03b7 \u2212 \u2126(\u03b32\u03be2), when run for \u0001 = O(\u03b32\u03be2).\n\n2.2 The General Case\nIn the general case, we assume that Dx is an arbitrary distribution supported on b-bit integers. While\nsuch a distribution might have exponentially small margin in the dimension d (or even 0), we will\npreprocess the distribution to ensure a margin condition by removing outliers.\n\nWe will require the following notion of an outlier:\n\nDe\ufb01nition 2.7 ([DV04a]). We call a point x in the support of a distribution Dx a \u03b2-outlier, if there\nexists a vector w \u2208 Rd such that (cid:104)w, x(cid:105)2 \u2264 \u03b2 Ex\u223cDx [(cid:104)w, x(cid:105)2].\n\nWe will use Theorem 3 of [DV04a], which shows that any distribution supported on b-bit integers\n\ncan be ef\ufb01ciently preprocessed using samples so that no large outliers exist.\n\n8\n\n\fLemma 2.8 (Rephrasing of Theorem 3 of [DV04a]). Using m = \u02dcO(d2b) samples from Dx, one\ncan identify with high probability an ellipsoid E such that Prx\u223cDx [x \u2208 E] \u2265 1\n2 and Dx|E has no\n\u0393\u22121 = \u02dcO(db)-outliers.\n\nGiven this lemma, we can adapt Algorithm 1 for the large margin case to work in general.\nThe pseudo-code is given in Algorithm 2. It similarly returns a decision list [(w(1), T (1), E(1)),\n(w(2), T (2), E(2)), \u00b7\u00b7\u00b7 ] as output.\n\nAlgorithm 2 Main Algorithm (general case)\n1: Set S(1) = Rd, \u03bb = \u03b7 + \u0001, \u0393\u22121 = \u02dcO(db), m = \u02dcO( 1\n2: Set i \u2190 1.\n\n3: Draw O(cid:0)(1/\u00012) log(1/(\u0001\u0393))(cid:1) samples from Dx to form an empirical distribution \u02dcDx.\n\n\u03932\u00014 ).\n\n(cid:2)x \u2208 S(i)(cid:3) \u2265 \u0001 do\n\n4: while Prx\u223c \u02dcDx\n5:\n\n6:\n\n7:\n8:\n\n9:\n10:\n\nS(i)\n\nRun the algorithm of Lemma 2.8 to remove \u0393\u22121-outliers from the distribution DS(i) by\n\ufb01ltering points outside the ellipsoid E(i).\n[xxT ] and set D(i) = \u0393\u03a3(i)\u22121/2 \u00b7 D|S(i)\u2229E(i) be the distribution\nD|S(i)\u2229E(i) brought in isotropic position and rescaled by \u0393 so that all vectors have (cid:96)2-norm at\nmost 1.\n\nLet \u03a3(i) = E(x,y)\u223cD(i)|\n\nL(i)(w(i)) \u2264 minw:(cid:107)w(cid:107)2\u22641 L(i)(w) + \u0393\u0001/2.\n\nLet L(i)(w) = E(x,y)\u223cD(i)[LeakyRelu\u03bb(\u2212y(cid:104)w, x(cid:105))]\nRun SGD on L(i)(w) for \u02dcO(1/(\u03932\u00012)) iterations, to get w(i) with (cid:107)w(i)(cid:107)2 = 1 such that\nDraw m samples from D(i) to form an empirical distribution D(i)\nm .\nFind a threshold T (i) such that Pr(x,y)\u223cD(i)\n\n[|(cid:104)w(i), x(cid:105)| \u2265 T (i)] \u2265 \u0393\u0001 and the empirical\n\n[hw(x) (cid:54)= y(cid:12)(cid:12)|(cid:104)w(i), x(cid:105)| \u2265 T (i)], is minimized.\n\nmisclassi\ufb01cation error, Pr(x,y)\u223cD(i)\n\nm\n\nRevert the linear transformation by setting w(i) \u2190 \u0393\u03a3(i)\u22121/2 \u00b7 w(i).\nUpdate the unclassi\ufb01ed region S(i+1) \u2190 S(i) \\ {x : x \u2208 E(i) \u2227 |(cid:104)w(i), x(cid:105)| \u2265 T (i)} and set\n\n11:\n12:\n13: Return the classi\ufb01er [(w(1), T (1), E(1)), (w(2), T (2), E(2)),\u00b7\u00b7\u00b7 ]\n\ni \u2190 i + 1.\n\nm\n\nOur main result is the following theorem:\nTheorem 2.9. Let D be a distribution over (d + 1)-dimensional labeled examples with bit-complexity\nb, generated by an unknown halfspace corrupted by Massart noise at rate \u03b7 < 1/2. Algorithm 2 uses\n\u02dcO(d3b3/\u00015) samples, runs in poly(d, 1/\u0001, b) time, and returns, with probability 2/3, a classi\ufb01er h\nwith misclassi\ufb01cation error errD\n\n0\u22121(h) \u2264 \u03b7 + \u0001.\n\n3 Conclusions\n\nThe main contribution of this paper is the \ufb01rst non-trivial learning algorithm for the class of halfspaces\n(or even disjunctions) in the distribution-free PAC model with Massart noise. Our algorithm achieves\nmisclassi\ufb01cation error \u03b7 + \u0001 in time poly(d, 1/\u0001), where \u03b7 < 1/2 is an upper bound on the Massart\nnoise rate. The most obvious open problem is whether this error guarantee can be improved to\nf (OPT) + \u0001 (for some function f : R \u2192 R such that limx\u21920 f (x) = 0) or, ideally, to OPT + \u0001. It\nfollows from our lower bound constructions that such an improvement would require new algorithmic\nideas. It is a plausible conjecture that obtaining better error guarantees is computationally intractable.\nThis is left as an interesting open problem for future work. Another open question is whether there is\nan ef\ufb01cient proper learner matching the error guarantees of our algorithm. We believe that this is\npossible, building on the ideas in [DV04b], but we did not pursue this direction. More broadly, what\nother concept classes admit non-trivial algorithms in the Massart noise model? Can one establish\nnon-trivial reductions between the Massart noise model and the agnostic model? And are there\nother natural semi-random input models that allow for ef\ufb01cient PAC learning algorithms in the\ndistribution-free setting?\n\n9\n\n\fAcknowledgments\n\nPart of this work was performed while Ilias Diakonikolas was at the Simons Institute for the Theory\nof Computing during the program on Foundations of Data Science. Ilias Diakonikolas is supported by\nSupported by NSF Award CCF-1652862 (CAREER) and a Sloan Research Fellowship. This research\nwas performed while Themis Gouleakis was a postdoctoral researcher at USC.\n\nReferences\n[ABHU15] P. Awasthi, M. F. Balcan, N. Haghtalab, and R. Urner. Ef\ufb01cient learning of linear\nseparators under bounded noise. In Proceedings of The 28th Conference on Learning\nTheory, COLT 2015, pages 167\u2013190, 2015.\n\n[ABHZ16] P. Awasthi, M. F. Balcan, N. Haghtalab, and H. Zhang. Learning and 1-bit compressed\nsensing under asymmetric noise. In Proceedings of the 29th Conference on Learning\nTheory, COLT 2016, pages 152\u2013192, 2016.\n\n[ABL17] P. Awasthi, M. F. Balcan, and P. M. Long. The power of localization for ef\ufb01ciently\n\nlearning linear separators with noise. J. ACM, 63(6):50:1\u201350:27, 2017.\n\n[AL88] D. Angluin and P. Laird. Learning from noisy examples. Mach. Learn., 2(4):343\u2013370,\n\n1988.\n\n[Ber06] T. Bernholt. Robust estimators are hard to compute. Technical report, University of\n\nDortmund, Germany, 2006.\n\n[BFKV96] A. Blum, A. M. Frieze, R. Kannan, and S. Vempala. A polynomial-time algorithm for\nlearning noisy linear threshold functions. In 37th Annual Symposium on Foundations of\nComputer Science, FOCS \u201996, pages 330\u2013338, 1996.\n\n[BFKV97] A. Blum, A. Frieze, R. Kannan, and S. Vempala. A polynomial time algorithm for\n\nlearning noisy linear threshold functions. Algorithmica, 22(1/2):35\u201352, 1997.\n\n[Blu03] A. Blum. Machine learning: My favorite results, directions, and open problems. In 44th\n\nSymposium on Foundations of Computer Science (FOCS 2003), pages 11\u201314, 2003.\n\n[Byl94] T. Bylander. Learning linear threshold functions in the presence of classi\ufb01cation noise.\nIn Proceedings of the Seventh Annual ACM Conference on Computational Learning\nTheory, COLT 1994, pages 340\u2013347, 1994.\n\n[Coh97] E. Cohen. Learning noisy perceptrons by a perceptron in polynomial time. In Proceedings\nof the Thirty-Eighth Symposium on Foundations of Computer Science, pages 514\u2013521,\n1997.\n\n[Dan16] A. Daniely. Complexity theoretic limitations on learning halfspaces. In Proceedings\nof the 48th Annual Symposium on Theory of Computing, STOC 2016, pages 105\u2013117,\n2016.\n\n[DKK+16] I. Diakonikolas, G. Kamath, D. M. Kane, J. Li, A. Moitra, and A. Stewart. Robust\nestimators in high dimensions without the computational intractability. In Proceedings\nof FOCS\u201916, pages 655\u2013664, 2016.\n\n[DKK+17] I. Diakonikolas, G. Kamath, D. M. Kane, J. Li, A. Moitra, and A. Stewart. Being\nrobust (in high dimensions) can be practical. In Proceedings of the 34th International\nConference on Machine Learning, ICML 2017, pages 999\u20131008, 2017.\n\n[DKK+18] I. Diakonikolas, G. Kamath, D. M. Kane, J. Li, A. Moitra, and A. Stewart. Robustly\nlearning a gaussian: Getting optimal error, ef\ufb01ciently. In Proceedings of the Twenty-Ninth\nAnnual ACM-SIAM Symposium on Discrete Algorithms, SODA 2018, pages 2683\u20132702,\n2018.\n\n[DKK+19] I. Diakonikolas, G. Kamath, D. Kane, J. Li, J. Steinhardt, and Alistair Stewart. Sever: A\nrobust meta-algorithm for stochastic optimization. In Proceedings of the 36th Interna-\ntional Conference on Machine Learning, ICML 2019, pages 1596\u20131606, 2019.\n\n10\n\n\f[DKS18] I. Diakonikolas, D. M. Kane, and A. Stewart. Learning geometric concepts with nasty\nIn Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of\n\nnoise.\nComputing, STOC 2018, pages 1061\u20131073, 2018.\n\n[DKS19] I. Diakonikolas, W. Kong, and A. Stewart. Ef\ufb01cient algorithms and lower bounds for\nrobust linear regression. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium\non Discrete Algorithms, SODA 2019, pages 2745\u20132754, 2019.\n\n[DKW56] A. Dvoretzky, J. Kiefer, and J. Wolfowitz. Asymptotic minimax character of the sample\ndistribution function and of the classical multinomial estimator. Ann. Mathematical\nStatistics, 27(3):642\u2013669, 1956.\n\n[Duc16] J. C. Duchi. Introductory lectures on stochastic convex optimization. Park City Mathe-\n\nmatics Institute, Graduate Summer School Lectures, 2016.\n\n[DV04a] J. Dunagan and S. Vempala. Optimal outlier removal in high-dimensional spaces. J.\n\nComputer & System Sciences, 68(2):335\u2013373, 2004.\n\n[DV04b] J. Dunagan and S. Vempala. A simple polynomial-time rescaling algorithm for solving\nlinear programs. In Proceedings of the 36th Annual ACM Symposium on Theory of\nComputing, pages 315\u2013320, 2004.\n\n[FGKP06] V. Feldman, P. Gopalan, S. Khot, and A. Ponnuswami. New results for learning noisy\n\nparities and halfspaces. In Proc. FOCS, pages 563\u2013576, 2006.\n\n[GR06] V. Guruswami and P. Raghavendra. Hardness of learning halfspaces with noise. In Proc.\n47th IEEE Symposium on Foundations of Computer Science (FOCS), pages 543\u2013552.\nIEEE Computer Society, 2006.\n\n[Hau92] D. Haussler. Decision theoretic generalizations of the PAC model for neural net and\n\nother learning applications. Information and Computation, 100:78\u2013150, 1992.\n\n[Kea93] M. J. Kearns. Ef\ufb01cient noise-tolerant learning from statistical queries. In Proceedings\nof the Twenty-Fifth Annual ACM Symposium on Theory of Computing, pages 392\u2013401,\n1993.\n\n[Kea98] M. J. Kearns. Ef\ufb01cient noise-tolerant learning from statistical queries. Journal of the\n\nACM, 45(6):983\u20131006, 1998.\n\n[KKM18] A. R. Klivans, P. K. Kothari, and R. Meka. Ef\ufb01cient algorithms for outlier-robust\nregression. In Conference On Learning Theory, COLT 2018, pages 1420\u20131430, 2018.\n\n[KLS09] A. Klivans, P. Long, and R. Servedio. Learning halfspaces with malicious noise. To\nappear in Proc. 17th Internat. Colloq. on Algorithms, Languages and Programming\n(ICALP), 2009.\n\n[KSS94] M. Kearns, R. Schapire, and L. Sellie. Toward Ef\ufb01cient Agnostic Learning. Machine\n\nLearning, 17(2/3):115\u2013141, 1994.\n\n[LRV16] K. A. Lai, A. B. Rao, and S. Vempala. Agnostic estimation of mean and covariance. In\n\nProceedings of FOCS\u201916, 2016.\n\n[LS10] P. M. Long and R. A. Servedio. Random classi\ufb01cation noise defeats all convex potential\n\nboosters. Machine Learning, 78(3):287\u2013304, 2010.\n\n[MN06] P. Massart and E. Nedelec. Risk bounds for statistical learning. Ann. Statist., 34(5):2326\u2013\n\n2366, 10 2006.\n\n[MT94] W. Maass and G. Turan. How fast can a threshold gate learn? In S. Hanson, G. Drastal,\nand R. Rivest, editors, Computational Learning Theory and Natural Learning Systems,\npages 381\u2013414. MIT Press, 1994.\n\n[Ros58] F. Rosenblatt. The Perceptron: a probabilistic model for information storage and\n\norganization in the brain. Psychological Review, 65:386\u2013407, 1958.\n\n11\n\n\f[RS94] R. Rivest and R. Sloan. A formal model of hierarchical concept learning. Information\n\nand Computation, 114(1):88\u2013114, 1994.\n\n[Slo88] R. H. Sloan. Types of noise in data for concept learning. In Proceedings of the First\nAnnual Workshop on Computational Learning Theory, COLT \u201988, pages 91\u201396, San\nFrancisco, CA, USA, 1988. Morgan Kaufmann Publishers Inc.\n\n[Slo92] R. H. Sloan. Corrigendum to types of noise in data for concept learning. In Proceedings\nof the Fifth Annual ACM Conference on Computational Learning Theory, COLT 1992,\npage 450, 1992.\n\n[Slo96] R. H. Sloan. Pac Learning, Noise, and Geometry, pages 21\u201341. Birkh\u00e4user Boston,\n\nBoston, MA, 1996.\n\n[Val84] L. G. Valiant. A theory of the learnable. In Proc. 16th Annual ACM Symposium on\n\nTheory of Computing (STOC), pages 436\u2013445. ACM Press, 1984.\n\n[Vap82] V. Vapnik. Estimation of Dependences Based on Empirical Data: Springer Series in\n\nStatistics. Springer-Verlag, Berlin, Heidelberg, 1982.\n\n[YZ17] S. Yan and C. Zhang. Revisiting perceptron: Ef\ufb01cient and label-optimal learning\nIn Advances in Neural Information Processing Systems 30: Annual\nof halfspaces.\nConference on Neural Information Processing Systems 2017, pages 1056\u20131066, 2017.\n\n[ZLC17] Y. Zhang, P. Liang, and M. Charikar. A hitting time analysis of stochastic gradient\nlangevin dynamics. In Proceedings of the 30th Conference on Learning Theory, COLT\n2017, pages 1980\u20132022, 2017.\n\n12\n\n\f", "award": [], "sourceid": 2659, "authors": [{"given_name": "Ilias", "family_name": "Diakonikolas", "institution": "UW Madison"}, {"given_name": "Themis", "family_name": "Gouleakis", "institution": "Max Planck Institute for Informatics"}, {"given_name": "Christos", "family_name": "Tzamos", "institution": "UW Madison"}]}