NeurIPS 2020

Early-Learning Regularization Prevents Memorization of Noisy Labels

Meta Review

The paper studies the following interesting phenomenon (observed in the previous literature): when trained on the dataset with incorrectly labeled points (i.e. "label noise"), DNNs first learn the benign ("correctly labeled") points and once this is done they start "memorizing" the noisy points. It was previously shown in the literature (empirically) that the second "memorization" phase hurts the generalization. The authors make 2 Contributions: (Contribution 1) They demonstrate (empirically and theoretically) that similar phenomenon can be observed in the simpler setting of the over-parametrized (dimensionality ~ number of points) linear two-class logistic regression, when the class distributions are isotropic Gaussian with fixed means $\pm mu$ and vanishing variance (see Theorem 1 and Figure A.1). (Contribution 2) Motivated by the theory of contribution 1, the authors propose a novel regularizer. (see Eq. 6). When used in the vanilla DNN training with the cross-entropy loss, this regularizer successfully prevents the networks from falling to the "memorization phase" (as evidenced by Figure 1). All the reviewers agree that the topic and the focus of this paper is very timely. The questions related to DNN and "memorization" has surfaced recently in many works on DNNs, both theoretical and applied. Apart from providing theoretical insights on the empirical phenomenon (Contribution 1), this paper is one of the first works (to be best of our knowledge) that propose to utilize it explicitly during DNN training. The proposed method is a very reasonable first attempt in this direction ("DNN makes good decisions in the beginning of the training, before it starts memorizing, so let's influence its decisions in the later stages by the ones it made in the earlier stages"). While the proposed strategy has many moving parts (when exactly DNN switches from the first to the second phase? How exactly should we bias later stages? i.e. how should we choose "targets" in Eq. 6), the authors presented several particular design choices that empirically lead to strong results. Even though the proposed method does not outperform SOTA results across the table, it is competitive while requiring much fewer tricks than other methods. Finally, even though the theory (Theorem 1) holds only for the particular case of the linear two-class logistic regression, it motivates the choice of the regularizer that (in turn) performs well in the realistic practical settings. This shows that the paper contains insights that will be likely useful for the future research in this direction. However, the reviewers had several concerns. The following three are among the most important ones to address: (1) Rev#2 points out that Theorem 1 essentially assumes that the classes are supported on tiny compact subsets very well separated from each other. In the rebuttal, the authors claim "when $p$ is large, $\sigma \sqrt{p} >> 2$, so the spheres are huge". Some of the reviewers (myself included) were not convinced with this argument. The text of the paper never makes explicit what the assumptions mean exactly when they say "variance is sufficiently small". It may be that there is a relation between dimensionality $p$ and the variance $\sigma$ implicitly assumed in the proofs that forbid $\sigma \sqrt{p}$ to become large (as in the argument from the rebuttal). (2) For instance, in the Eq. 11 from the supplementary we see that $\sigma \|\theta_0 - \theta_t\|$ is required to be close to zero, yet the authors don't prove formally that this is possible. This step should be made precise! One naive way of doing this that comes to my mind is to show that $\|\theta_0 - \theta_t\| ~ O(\sqrt{p})$ and set $\sigma ~ o(1/\sqrt{p})$. However, this would contradict $\sigma \sqrt{p} >> 2$ stated in the rebuttal. (3) More discussion on the hyperparameter selection for ELR and ELR+ are required. Concerns 1 and 2 sketched above are not critical, as they don't directly affect the proposed method (Contribution 2). However, they need to be carefully addressed in order to save Contribution 1. In particular, the authors are required to put *exact* assumptions on $\sigma$ explicitly in the text of Theorem 1. Most likely (if my guess from Concern 2 above is correct), the authors will have to write something like $\sigma \sqrt{p} ~ o(1)$, in which case Rev#2's original intuition will be correct and the result indeed holds only if the classes are far away from each other and consist of almost identical points. I think this fact won't diminish Contribution 1, after all (i) it still provides a simple setting where the phenomenon can be observed and (ii) it still provides a reasonable motivation for the proposed regularizer. I am tending towards acceptance, given the authors address all the concerns summarized in this meta-review (as well as other minor concerns mentioned by the reviewers).