NeurIPS 2020

### Review 1

Summary and Contributions: The paper studies the robust training of neural networks in the presence of label noise. Since overparameterized neural networks can fit the noisy labels, naively training on corrupted data can lead to poor generalization. This paper exploits the bimodal nature of the Jacobian matrix of neural networks where the Jacobian matrix is approximately low rank with a few large singular values. Furthermore, the residual associated with the clean labels mostly lies in the subspace spanned by the singular vectors associated with the large singular values of the Jacobian matrix; and the residual corresponding to noisy labels mostly resides in the complement subspace. The paper proposes to sample a subset of training points in every iteration such that the Jacobian associated with the sampled points forms a low-rank approximation of the original Jacobian matrix. To mitigate the effect of the points with noisy labels selected in the subset, the paper proposes to utilize the mixup technique. The proposed method is validated on CIFAR-10 and CIFAR-100 with synthetically generated noisy labels and Webvision dataset with naturally occurring noisy labels. ############### Post author response phase ############### Thank you for addressing my comments, especially regarding the comparison with the prior work [19]. I have updated the score accordingly.

Strengths: There is significant scope for improvement in the presentation of the paper. There are numerous typos and many of the notations in the paper are not well defined. Please see the feedback section below. Despite the claim in Section 2 that the techniques in prior work are limited, the proofs in the paper heavily rely on the techniques developed in [19]. A more thorough discussion on the similarity and difference of the technical results and tools of [19] and this paper in the main text would help put the technical contribution of this paper in the right context. How would the proposed method behave in the settings with a very large number of labels (such as extreme classification) where class margin might not hold. Would utilizing square loss give good performance in such settings?

Weaknesses: There is significant scope for improvement in the presentation of the paper. There are are numerous typos and many of the notations in the paper are not well defined. Please see the feedback section below. Despite the claim in Section 2 that the techniques in prior work are limited, the proofs in the paper heavily rely on the techniques developed in [19]. A more thorough discussion on the similarity and difference of the technical results and tools of [19] and this paper in main text would help put the technical contribution of this paper in the right context. How would the propose method behave in the settings with very large number of labels (such as extreme classification) where class margin might not hold. Would utilizing square loss give good performance in such settings?

Correctness: As far as the reviewer can tell, the claims and methods are correct.

Clarity: The paper does a reasonable job of conveying the main ideas of the proposed methods. However, there is quite a bit of room for improvement in the presentation of the paper.

Relation to Prior Work: The paper presents a detailed account of the prior work on training in the presence of label noise. As mentioned in the weaknesses section above, there is some need for elaboration on the comparison with [19].

Reproducibility: Yes

Additional Feedback: Notation related issues and typos (not comprehensive). 1. It's not clear what the dimension of J(W, X_S) is. Eq. (5) gives the impression that it should be n x m, because of the term || J(W, X) - J(W, X_S)||. However, line 160 and Proof of Corollary B.3 make sense when it is k x m. 2. \mathcal{S}_+ and \mathcal{S}_{-} are not defined. Are these subspaces? 3. Notation \sigma_i(J(..), \mathcal{S}_{+}) is not defined. 4. Both W^{\tau} (superscript) and W_{\tau} (subscript) are used as iterates. Please consistently use the same notation. 5. In Algorithm 1 (step 6), $k/n_c$-medoids --> $kn_c$ medoids? Accordingly, fix Section 4.3. 6. In Broader Impact sections, please fix lines 325--330. 7. In line 479--480, should $f$ be replaced with $\mathcal{L}$? Other minor comments/questions: 1. In Table 2, what is the difference between 'coreset w/ label' and 'coreset w/ pred.'? 2. What is the utility of line 490 - 491 and Eq. (16). Similar to this many sections of the appendix lack coherent structure. 3. In practice, one typically employs SGD with much smaller mini-batches compared to the full dataset. Could the authors comment on the relevance of their method in such settings?

### Review 2

Summary and Contributions: This paper proposes CRUST, a method to train neural networks with noisy labels. This improves on previous methods by being computationally simple and having the added benefit of theoretical performance guarantees. The main idea is to approximate the Jacobian of the paramaters wrt data with a low rank approximation by solving a k-medoids problem. This leads to several improvements over state of the art on symmetric/asymmetric synthetic noise, as well as a real-world noisy dataset (WebVision)

Strengths: Relevance/Significance: This paper provides theoretical guarantees in an area where recent progress has been largely empirical. Additionally, the method is simple and quite efficient, for example not requiring an auxiliary model Novelty: builds on previous work [26] in a novel way for a distinct application area Experiments: Ablation on most components of the algorithm

Weaknesses: Experiments: In addition to ablation on mixup and coresets with label vs prediction, it would be good to compare the method with a standard low rank approximation in place of the coverage objective (e.g. QR decomposition or the references below)

Correctness: Proofs appear to be correct

Clarity: The paper flows well and is generally a pleasure to read. However, the authors should clarify they propose a theoretical argument (not a rigorous proof such as Theorem 4.1) that mixup improves coreset quality and therefore leads to better generalization Typo: "point to its closes medoid" should be "closest medoid"

Relation to Prior Work: This paper has extensive discussion of training neural networks subject to label noise, but there is less mention of work on low rank matrix approximation/column subset selection e.g. [1-3]. Additionally, the authors should discuss similarity to [4] which in some settings (CIFAR-10, WebVisioin) achieves similar error reductions without mixup while meeting several of the desiderata proposed here. [1] Cohen et al. Input Sparsity Time Low-Rank Approximation via Ridge Leverage Score Sampling. SODA 2017 [2] Altschuler et al. Greedy Column Subset Selection: New Bounds and Distributed Algorithms. ICML 2016 [3] Khanna et al. On Approximation Guarantees for Greedy Low Rank Optimization. ICML 2017 [4] Pleiss et al. Identifying Mislabeled Data using the Area Under the Margin Ranking. https://arxiv.org/abs/2001.10528

Reproducibility: Yes

Additional Feedback: I think this is paper presents a good, novel idea in a relevant area. However, there are several questions wrt experimental setup and related work which should be resolved Questions: - Why not minimize the rank directly, or compare to other approximate low rank objectives mentioned above? - Does the proposed method scale to TinyImageNet or the full WebvVision/ImageNet datasets? - How does solving (5) compare to using the best low-rank subspace or best subset of columns? - If k is too large, will CRUST overfit to noisy labels? - How do the running times of proposed algorithms compare, and is there significant speedup from training on a subset of the data? - Do the 30-70% of data removed during selection correspond to data corrupted by noisy labels? It would be good to analyze how clean the coreset is EDIT: After reading the author feedback I am keeping my review the same

### Review 3

Summary and Contributions: This paper addresses the problem of learning with label noise. The authors propose to iteratively select clean data points that provide an approximately low-rank Jacobian matrix, which avoids overfitting caused by label noise. The proposed method is strongly theoretically grounded and makes a significant contribution to label noise learning community. Experimental results on multiple benchmark datasets show that CRUST outperforms the state-of-the-art methods. **After reading author response** I have read the rebuttal carefully, and I will not change my score.

Strengths: 1. The idea is well motivated and clearly presented. 2. The approach has detailed and strong theoretical guarantees for coping with label noise. 3. The results on synthetic and real-world datasets demonstrate a clear improvement on the SoA. The implementation details of the proposed method are also provided, which will be very helpful in reproducing the reported results.

Weaknesses: 1. Compared with strong theoretical analysis, the experimental analysis is lacking slightly. I hope that the authors can add some baselines ([1], [2], [3]) to further verify the effectiveness of the method experimentally. 2. Noise level is low. I wonder what the performance would be in extreme cases with even more label noise. [1] Yilun Xu et al. L_DMI: A Novel Information-theoretic Loss Function for Training Deep Nets Robust to Label Noise. NeurIPS 2019. [2] Xiaobo Xia et al. Are Anchor Points Really Indispensable in Label-Noise Learning. NeurIPS 2019. [3] Daiki Tanaka et al. Joint optimization framework for learning with noisy labels. CVPR 2018.

Correctness: Yes, the claims, method and empirical methodology are correct.

Clarity: Yes, the paper is well written.

Relation to Prior Work: Yes, the paper is clearly discussed how this work differs from previous contributions.

Reproducibility: Yes

### Review 4

Summary and Contributions: The authors proposed a novel approach with strong theoretical guarantees for robust training of neural networks trained with noisy labels. The authors focused on selecting subsets of clean data that provide an approximately low-rank Jacobian matrix. The authors proved that gradient descent applied to the subsets cannot overfit the noisy labels, without regularization or early stopping.

Strengths: - The idea to leverage the low-rank Jacobian matrix to alleviate label noise seems novel. - The authors provided theoretical proof of the advantages of the proposed method. - The authors provided an ablation study to demonstrate the effectiveness of each component and achieved state-of-the-art performance with the final method. - The proposed method, CRUST, improves the image classification performance noticeably.

Weaknesses: - It would be better to evaluate the proposed model on other real-world datasets with noisy labels such as the Clothing1M dataset [1]. [1] Tong Xiao et al., Learning from massive noisy labeled data for image classification. CVPR 2015

Correctness: It seems the claims, method, and empirical methodology are correct.

Clarity: The paper is well written and the motivation for the proposed framework is clear to understand.

Relation to Prior Work: It seems the paper clearly referred and discussed the difference from previous methods.

Reproducibility: Yes

Additional Feedback: I have no other comments to say. ######################### Comments after rebuttal After reading the author's responses and the other reviewers' comments, I decided to keep my decision for accept.