NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:2659
Title:Distribution-Independent PAC Learning of Halfspaces with Massart Noise

Reviewer 1

The paper gives a PAC learning algorithm for the basic problem of halfspaces in a model of learning with noise. The algorithm uses ideas from previous related results in the simpler model of random classification noise, with important new ideas. Learning with noise is a basic topic in learning theory. It can be argued that the most studied models (random misclassification noise and malicious noise) are unrealistically benign (even though the related SQ model is very important) or malicious, and there is a great need for the study of more realistic models. The Massart noise model is a candidate for such a model. As positive learnability results in the general PAC model were not known for this kind of noise, the result of the present paper is quite significant. The algorithm is non-proper, with a kind of decision list as hypothesis. This is especially interesting, as this class of decision lists is a natural class, which have already been studied (called neural decision lists, linear decision lists and threshold decision lists, going back to the work of Marchand, Golea and Rujan 30 years ago). It would be useful to comment on the possibility of proper learning halfspaces in this model. Comments: It is mentioned that an equivalent notion called ``malicious misclassification noise'' was studied (also 30 years ago) by Sloan. An explanation (or at least a reference) should be given for the equivalence of the two definitions. Malicious misclassification noise seems to be an appropriate term fitting the general terminology, and it seems that instead of using two names for the equivalent notions, one should just use malicious misclassification noise, noting that the other one is an equivalent definition. Massart noise is also unjustified as the cited paper is due to Massart and Nedelec. A small additional point is that presumably there is some ``tameness'' assumption for the noise probability function $\eta(x)$ (or not? does this matter for the equivalence proof?). A related comment: the literature review does not distinguish between the real-valued and Boolean domains (for example, Daniely's negative result already holds for the Boolean case); some comments on that should be added. One more terminological remark: the basic PAC model is distribution-independent, and for ``PAC learning under a fixed distribution'' is used when the underlying distribution is fixed. Thus the term ``distribution-independent'' in the title seems redundant (of course it is important to emphasize this feature as opposed to results on ``tame'' distributions in the text).

Reviewer 2

This paper studies the problem of learning halfspaces under arbitrary data distributions and when label noise is present in the data. This problem has a rich history and the celebrated result of [BFKV'97] showed that there exists a polynomial time learning algorithm when the label noise is i.i.d., i.e., when each label is flipped independently with probability eta < 1/2. Essentially this is the only noise model for which we know distribution independent learning results. At the other extreme we have the agnostic learning model where we know that learning halfspaces under the uniform/log-concave distributions is easy and there is also evidence that agnostic learning under arbitrary distributions is hard. An intermediate noise model is the Massart/bounded noise model, where the label of each example x is flipped independently with probability p_x < eta < 1/2. Even for this model it has been a longstanding open problem as to whether one can design a learning algorithm that for any eps>0, achieves error OPT + eps, where OPT is the error of the best halfspace. Before this paper it was known how to achieve this only for uniform/log-concave distributions. The main result of the paper is that for learning halfspaces under arbitrary distributions under Massart noise, one can achieve error eta+ eps in polynomial time. This is a significant advance over the state of the art. The paper shows that one can construct a decision list of halfspaces to achieve this bound. The main insight in achieving this is Lemma 2.5 that states that under Massart noise, if data distribuition has some non-trivial margin, then by minimizing a convex proxy one will end up with a halfspce w that does well on a non-trivial amount of the data distribution. Furthermore, this space can be identified by simply thresholding |w.x| at a certain value T. Once this is proved, one immediately obtains learning algorithm for large margin distributions by simply repeating the process on the distribution that does not fall within the threshold. For the general case, one can use the idea of [BFKV] to preprocess the data so that a large fraction satisfies good margin and then apply lemma 2.5. I very much enjoyed reading the paper, it makes progress on a long standing open problem and will lead to further theoretical work in the area. This is a very strong submission and I absolutely recommend acceptance.

Reviewer 3

This paper provides an efficient algorithm for distribution-free PAC learning of halfspaces under Massart noise. This resolves a long-standing open question, at least for representations with a bounded bit-complexity. The paper is very well-written; the techniques are interesting and well explained, and the organization is good. In several points, there are sloppy statements which are incorrect as stated, though they all seem fixable. The detailed comments below list these issues. I request that the authors address these issues in their response, as well as fix the final version of this submission. The main observation exploited in this work is that while optimizing a convex surrogate cannot achieve the required error in this case (which the authors also prove), there does exist a convex surrogate that achieves a small error on some non-negligible region of the space. Repeatedly minimizing this convex surrogate on the remaining part of the space obtains a low error on the entire space, using improper learning with a decision list of half-spaces. The paper presents the solution to learning half-spaces with a margin and Massart noise, and later generalizes the solution to half-spaces without a margin, using the finite bit-complexity and an additional preprocessing step. The proofs in the body of the paper seem correct, except for some fixable issues listed below. This paper addresses an important and interesting question, and resolves it with an elegant and well-explained technique. Assuming all the small issues are fixed, I strongly recommend that this paper be accepted. Detailed comments: ~~~~~~~~~~~~~~~~~~ 1. The dependence of the solution for the zero-margin case in the bit-complexity of the representation is not revealed until section 1.1. Since the paper claims to resolve an open problem, it is important to discuss the relationship between the original statement of the open problem and the actual solution. 2. In Alg 1, there seems to be an assumption that the marginal distribution is completely available (line 3), otherwise an estimation process is needed here. Please explain. 3. Page 6, lines 249-252 ignore the fact that the guarantee of lemma 2.4 is on expectation only. This is then addressed in page 7, lines 276-279 but it is too late, as the paragraph on page 6 is incorrect as stated. Please unite these two paragraphs and put these on page 6. 4. Proof of lemma 2.5. page 7: the last display equation is incorrect. I believe there are several typos there. The conclusion in line 261 is correct though. Please fix this and explain your correction. 5. Page 7, lines 276-279: It is proposed to use Markov's inequality. Please comment on the boundedness conditions that allow you to do that. Also, Markov's inequality cannot actually get the same guarantee as that of the expectation, there will be some constant factor. Finally, when the procedure is repeated, the right w(i) needs to be selected by estimating L(w(i)) from samples. Please address this. 6. Page 7, line 280: (i) seems to require a small enough lambda. 7. Alg 1, line 7: unclear what is meant by "uniform over the samples". Please rephrase. 8. Page 3, line 101, it is not clear what "on the unit ball" refers to, though I assume this is an assumption on x. Please rephrase. 9. Page 4, line 185: theta is not used.