{"title": "Combinatorial Inference against Label Noise", "book": "Advances in Neural Information Processing Systems", "page_first": 1173, "page_last": 1183, "abstract": "Label noise is one of the critical sources that degrade generalization performance of deep neural networks significantly. To handle the label noise issue in a principled way, we propose a unique classification framework of constructing multiple models in heterogeneous coarse-grained meta-class spaces and making joint inference of the trained models for the final predictions in the original (base) class space. Our approach reduces noise level by simply constructing meta-classes and improves accuracy via combinatorial inferences over multiple constituent classifiers. Since the proposed framework has distinct and complementary properties for the given problem, we can even incorporate additional off-the-shelf learning algorithms to improve accuracy further. We also introduce techniques to organize multiple heterogeneous meta-class sets using $k$-means clustering and identify a desirable subset leading to learn compact models. Our extensive experiments demonstrate outstanding performance in terms of accuracy and efficiency compared to the state-of-the-art methods under various synthetic noise configurations and in a real-world noisy dataset.", "full_text": "Combinatorial Inference against Label Noise\n\nPaul Hongsuck Seo\u2020\u2021\n\u2020Computer Vision Lab.\n\nPOSTECH\n\nhsseo@postech.ac.kr\n\nGeeho Kim\u2021\n\nBohyung Han\u2021\n\n\u2021Computer Vision Lab. & ASRI\n\nSeoul National University\n\n{snow1234, bhhan}@snu.ac.kr\n\nAbstract\n\nLabel noise is one of the critical sources that degrade generalization performance of\ndeep neural networks signi\ufb01cantly. To handle the label noise issue in a principled\nway, we propose a unique classi\ufb01cation framework of constructing multiple models\nin heterogeneous coarse-grained meta-class spaces and making joint inference of\nthe trained models for the \ufb01nal predictions in the original (base) class space. Our\napproach reduces noise level by simply constructing meta-classes and improves\naccuracy via combinatorial inferences over multiple constituent classi\ufb01ers. Since\nthe proposed framework has distinct and complementary properties for the given\nproblem, we can even incorporate additional off-the-shelf learning algorithms\nto improve accuracy further. We also introduce techniques to organize multiple\nheterogeneous meta-class sets using k-means clustering and identify a desirable\nsubset leading to learn compact models. Our extensive experiments demonstrate\noutstanding performance in terms of accuracy and ef\ufb01ciency compared to the state-\nof-the-art methods under various synthetic noise con\ufb01gurations and in a real-world\nnoisy dataset.\n\n1\n\nIntroduction\n\nConstruction of a large-scale dataset is labor-intensive and time-consuming, which makes it inevitable\nto introduce a substantial level of label noise and inconsistency. This issue is aggravated if the\ndata collection relies on crowd-sourcing [1, 2] or internet search engines [3, 4] without proper\ncuration. More importantly, many real-world problems inherently involve a signi\ufb01cant amount of\nnoise and it is extremely important to train machine learning models that can handle such a challenge\neffectively. Figure 1 presents several noisy examples in WebVision benchmark [4], where training\nexamples are collected from Google and Flickr by feeding queries corresponding to the ImageNet\nclass labels. Although a moderate level of noise is sometimes useful for regularization, label noise is\na critical source of under\ufb01tting or over\ufb01tting. In the case of deep neural networks, models can easily\nmemorize a large number of noisy labels and, consequently, are prone to degrade their generalization\nperformance [5].\nTo tackle various kinds of label noise, existing mainstream approaches either attempt to \ufb01lter noisy\nexamples out [6\u201311] or correct noisy labels based on the network predictions [12\u201315]. However,\nthese methods are similar to solving the chicken-and-egg problem and their unstable on-the-\ufb02y noise\nrecognition process may result in poor generalization performance. On the other hand, [16\u201321]\nassume the correct labels to be latent and learn the networks inferring the latent correct labels by\nestimating noise transition matrices. Although these methods estimate the noise injection process\ndirectly, they are also suboptimal when the network is capable of adapting to label noise as discussed\nin [10].\nWhile the prior approaches typically focus on developing noise-resistant training algorithms given\nnoise levels, our algorithm takes a totally different perspective of reducing noise level and learning a\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: Examples of 259-pomeranian class in WebVision [4]. In addition to clean samples, the\ndataset contains closed- and open-set noises, where the examples with closed-set noise are mislabeled\nwith known classes while the ones with open-set noise are associated with unknown class labels.\n\nrepresentation robust to label noise. In our framework, we automatically generate multiple coarse-\ngrained meta-class sets, each of which constructs a heterogeneous partition of the original class set.\nThen, we train classi\ufb01ers on individual meta-class sets and make the \ufb01nal prediction using an output\ncombination of the classi\ufb01ers. Note that the combination process allows us to uniquely identify\nthe original classes despite the coarse representation learning on meta-class spaces. Learning on\nmeta-class spaces actually reduces the level of label noise because multiple classes in an original\nclass space collapse to a single meta-class and the label noise within the same meta-class becomes\ninvisible on the meta-class space.\nThe contribution of this paper is three-fold; (1) we successfully reduce the amount of label noise by\nconstructing meta-classes of multiple base classes; (2) we propose a novel combinatorial classi\ufb01cation\nframework, where inference on the original class space is given by combining the predictions on\nmultiple meta-class spaces; (3) we demonstrate the robustness of the proposed method through\nextensive controlled experiments as well as the evaluation on a real-world dataset with label noise.\nThe rest of this paper is organized as follows. Section 2 reviews previous approaches against datasets\nwith label noise and other related techniques. Then, we formally describe the proposed compositional\nclassi\ufb01cation method, and demonstrate the effectiveness of our method in Section 3 and 4, respectively.\nFinally, we conclude our paper in Section 5.\n\n2 Related Work\n\nOne common approach to learning with noisy data is to correct or \ufb01lter out noisy examples during\ntraining [6\u201315, 22\u201325]. Existing methods adopt their own criteria to identify the noisy samples.\nThere exist several techniques to employ the con\ufb01dence scores of models as the signal of noise in\n[11\u201314] while [8] incorporates a contrastive loss term to iteratively identify noisy samples. Deep\nbilevel learning [9] attempts to \ufb01nd reliable mini-batches based on the distances between the mini-\nbatches in training and validation datasets. Multiple networks have often been adopted to identify\nnoisy examples. For example, two networks with an identical architecture are jointly trained to\nidentify noisy samples in each batch [6, 11] whereas a separate teacher network is employed to\nselect samples for training a student network. Contrary to the approaches making hard decisions\non noisy sample selection, there are a handful of algorithms relying on the soft penalization of\npotentially noisy examples by designing noise-robust loss functions [10, 23], using knowledge\ndistillation [24] and adding regularizers [22]. Although these methods are often motivated by intuitive\nunderstanding of classi\ufb01cation models, their ad-hoc procedures often lack theoretical support and\nhamper reproducibility.\nAnother line of methods estimates a noise transition matrix capturing transition probabilities from\ncorrect labels to corrupted ones [16\u201321]. Some of them [16\u201318] adopt the standard backpropagation\nto estimate the transition matrix and train the network simultaneously while a pretrained network is\noften used for the transition matrix estimation [19]. To improve the quality of the estimated transition\nmatrices, additional clean data [21] or manually de\ufb01ned constraints [20] are sometimes integrated\nduring the matrix estimation process.\nAlthough all these existing approaches cover various aspects of training with noisy data, they typically\nassume that the noise-level of a dataset is irrevocable and therefore focus on developing algorithms that\n\n2\n\n97-drake526-deskclean samplesclosed-set noiseopen-set noise\fFigure 2: Motivation and concept of combinatorial classi\ufb01cation.\n(left) Empirical noise-level\nreduction by use of meta-class labels on CUB-200 with closed-set noise; the noise rates in the meta-\nclass level show the average of all meta-class sets. (right) Illustration of combinatorial classi\ufb01cation\nwith two binary meta-class sets on four original classes. By combining the coarse-grained meta-\nclasses, it is possible to predict \ufb01ne-grained original class labels.\n\navoid over\ufb01tting to a noisy training set by identifying noisy examples or modeling noise distribution.\nIn contrast, we propose a novel output representation method that directly reduces the noise-level of\na given dataset, and a model that predicts class labels based on the proposed representations.\nThe proposed combinatorial classi\ufb01cation solves the target problem by combining solutions of\nmultiple subproblems encoded by class codewords and there are several related methods in this aspect.\nProduct quantization [27, 28] measures distances in multiple quantized subspaces and combines them\nto perform the approximate nearest neighbor search in the original space. A recently proposed an\nimage geolocalization technique by classi\ufb01cation achieves the \ufb01ne-grained quantization by combining\nmultiple coarse-grained classi\ufb01cations [29], while a similar approach is proposed for metric learning\nfor retrieval tasks [30]. Unlike these works targeting regression or retrieval tasks on continuous\nspaces, our approach deals with a classi\ufb01cation problem on a discrete output space. Ensemble\nmethods [31\u201337] also have the similar concept to our algorithm but are different in the sense that\ntheir constituent models have the common output space. One of the most closely related work\nis the classi\ufb01cation by error-correcting output code [26]. This technique combines the results of\nbinary classi\ufb01ers to solve multi-class classi\ufb01cation problems and proposes deterministic processes to\ngenerate and predict the binarized codewords based on Hamming distance. In contrast, we generate\ncodewords by exploiting the semantics of the original classes and combine the predicted scores to\nconstruct the compositional classi\ufb01er robust to label noise.\n\n3 Combinatorial Classi\ufb01cation\n\n3.1 Class Codewords\nAs in the ordinary classi\ufb01cation, our goal is to predict a class label y \u2208 C given an input x, where\nC = {c1, . . . , cK} is a set of K disjoint classes. Unlike conventional classi\ufb01cation approaches that\ndirectly predict the output class y, our model estimates y by predicting its corresponding unique\ncodeword. To construct the class codewords, we de\ufb01ne M meta-class sets, each of which is given\nby a unique partitioning of C. Speci\ufb01cally, each meta-class set denoted by Cm (m = 1, . . . , M) has\nK(cid:48)}, where multiple original classes are merged into a\nK(cid:48)((cid:28) K) meta-classes, i.e., Cm = {cm\nsingle meta-class, which results in a coarse-grained class de\ufb01nition. Then, each class ci is represented\nim corresponds to a meta-class to which ci belongs in a\nby a M-ary codeword, c1\nmeta-class set Cm.\ni1\nWhen training data have label noise, classi\ufb01cation on a coarse-grained meta-class set naturally reduces\nnoise level of the dataset. Formally, let \u03b7( \u02c6D) be the noise level of a dataset \u02c6D = {(xi, \u02c6yi)}N\ni=1, which\nis given by\n\niM , where cm\n\n1 , . . . , cm\n\nc2\ni2\n\n. . . cM\n\nN(cid:88)\n\ni=1\n\n\u03b7( \u02c6D) = E \u02c6D[1(y (cid:54)= \u02c6y)] =\n\n1\nN\n\n1(yi (cid:54)= \u02c6yi),\n\n(1)\n\nwhere \u02c6yi means a label potentially corrupted from a clean label yi and 1 is an indicator function.\nAlthough two examples xi and xj belong to the same class but the label of xj is corrupted from\nyj(= yi) to \u02c6yj, the two classes corresponding to yi and \u02c6yj can be merged into the same meta-class,\nwhich removes the label noise in the meta-class level. Consequently, the noise level with the meta-\n\n3\n\nuniformnearest0.00.20.40.6noise levelclass labelmeta-class labelclasses:meta-classes:predicted class\u00d7=\fi )N\n\nclass representations is lower than that with the original class space, i.e., \u03b7( \u02c6Dm) \u2264 \u03b7( \u02c6D), where\ni=1 is the dataset associated with meta-class labels in Cm. In Figure 2(left), we make\n\u02c6Dm = (xi, \u02c6ym\nempirical observations of the noise-level reduction on CUB-200 with two different noise injection\nschemes1. The noise levels are signi\ufb01cantly reduced regardless of noise-types by converting the\noriginal class spaces into meta-class representations.\nAlthough a coarse-grained meta-class representation reduces noise level, it is not capable of distin-\nguishing the base classes in the original class space C. We resolve this limitation by introducing\nmultiple heterogeneous meta-class sets and exploiting their compositions. Even if multiple classes\nare collapsed to a single meta-class within a meta-class set, it is possible to provide a unique class\ncodeword to each of the original classes by using a suf\ufb01ciently large number of meta-class sets. In\nthis way, we convert noisy labels in \u02c6D to partially noisy codewords.\n\n3.2 Classi\ufb01cation with Class Codewords\n\nGiven noise-robust class codewords de\ufb01ned above, we now discuss the classi\ufb01cation method that\npredicts the class codewords to identify the class label y. Unlike ordinary classi\ufb01ers directly predicting\nthe class label on C, we construct M constituent classi\ufb01ers, each of which estimates a distribution on\na meta-class set Cm (m = 1, . . . , M), and combine their predictions to obtain class labels on C. This\nprocess is referred to as combinatorial classi\ufb01cation and is illustrated in Figure 2(right).\n\nk |x) on a meta-class\nInference A constituent classi\ufb01er estimates the conditional distribution P (cm\nset Cm. Given M constituent classi\ufb01ers, we obtain the conditional probability of ck \u2208 C by combining\nthe predictions of constituent classi\ufb01ers as follows:\n\n(cid:81)M\n(cid:80)K\n(cid:81)M\nm=1 P (meta(ck; m)|x)\nm=1 P (meta(cj; m)|x)\n\nj=1\n\nP (ck|x) =\n\nset. The denominator in Eq. (2) is the normalization term deriving(cid:80)K\n\nwhere meta(ck; m) returns the meta-class label containing the base class ck in the m-th meta-class\n\nk=1 P (ck|x) = 1.\n\n,\n\n(2)\n\nTraining We train our model by minimizing the sum of negative log-likelihoods with respect to the\nground-truth meta-class labels meta(y; m) that contain the ground-truth label y \u2208 C, i.e.,\n\nlog P (meta(y; m)|x).\n\n(3)\n\n\u2212 M(cid:88)\n\nm=1\n\n\u2212 log P (y|x) = \u2212 M(cid:88)\n\nK(cid:88)\n\nM(cid:89)\n\nThis objective encourages the constituent classi\ufb01ers to maximize the prediction scores of the true\nmeta-classes. Our algorithm employs the objective in Eq. (3) even though the following objective,\nminimizing the negative log-likelihood of the ground-truth class label y, is also a reasonable option:\n\nlog P (meta (y; m)|x) + log\n\nP (meta(ck; m)|x).\n\n(4)\n\nm=1\n\nk=1\n\nm=1\n\nAlthough this objective is directly related to the inference procedure in Eq. (2), it turns out to be\nnot effective. Note that the second term of the right hand side in this equation corresponds to the\ndenominator in Eq. (2), and penalizes the scores of the classes other than the true one, i.e., C \\ {y}.\nSince a ground truth meta-class may contain non-ground-truth original class labels, the penalty\ngiven to these non-ground-truth class labels can be propagated to the ground-truth meta-classes.\nConsequently, the optimization of each constituent classi\ufb01er becomes more challenging.\nDeep combinatorial classi\ufb01er We implement our model using a deep neural network with a shared\nfeature extractor and M parallel branches corresponding to the individual constituent classi\ufb01ers.\nSince the shared feature extractor receives the supervisory signal from all the M classi\ufb01ers, we scale\ndown the gradients of the shared feature extractor by a factor of M for backpropagation. Note that\nour approach uses the exactly same number of parameters with a \ufb02at classi\ufb01er in the feature extractor\nand its model size is rather smaller in total even with multiple network branches. This is mainly\nbecause the number of meta-classes is much less than the number of base classes (K(cid:48) (cid:28) K) and,\nconsequently, each classi\ufb01er requires fewer parameters.\n\n1Refer to Section 4.2 for noise injection con\ufb01gurations.\n\n4\n\n\f3.3 Con\ufb01guring Meta-class Sets\n\nTo implement our method, one needs to de\ufb01ne heterogeneous meta-class sets. A naive approach for\ndetermining a meta-class set con\ufb01guration is to randomly assign each class to one of meta-classes\nin a meta-class set. However, this method may result in large intra-class variations by grouping\nbase classes without common properties. We instead sample M meta-class sets by running k-means\nclustering algorithms with random seeds. Since the clustering algorithm often results in redundant\nmeta-class sets despite the random seeds, we diversify the clustering results by randomly sampling\nQ-dimensional axis-aligned subspaces of class representation vectors. We obtain class embedding\nfrom the weights of the classi\ufb01cation layer in a convolutional neural network, which is \ufb01ne-tuned\nusing noisy labels in the original class space.\nWhile the clustering-based method is suf\ufb01ciently good at mining reasonable meta-class sets, we can\nfurther optimize the con\ufb01gurations of meta-class sets by searching for their combinations. To achieve\nthis, we oversample candidate meta-class sets, and search for the optimal subset using a search agent\ntrained by reinforcement learning. Given a set of all candidate meta-class sets P = {Cm}M\nm=1, the\nsearch agent predicts a probability distribution over a binary selection variable um for each candidate\nmeta-class set. We train the agent by a policy gradient method, speci\ufb01cally REINFORCE rule [38],\nand iteratively update the parameters by the empirical approximation of the expected policy gradient,\nwhich is given by\n\nS(cid:88)\n\nM(cid:88)\n\ns=1\n\nm=1\n\n(cid:53)\u03b8J (\u03b8) =\n\n1\nS\n\n(cid:53)\u03b8 log P (u(s)\n\nm ; \u03b8)(R(s) \u2212 B),\n\n(5)\n\nwhere S is the number of meta-class set combinations2, \u03b8 is a model parameter of the search agent,\nR(s) is the reward obtained by the s-th sample, and B is the baseline injected to reduce the variance.\nOur main goal is to select the optimal collection of meta-class sets in terms of accuracy on the\nvalidation dataset, but we employ in-batch validation accuracy as the primary reward for training\nef\ufb01ciency. In addition, we encourage the number of selected meta-class sets to be small in each\ncombination by providing the negative reward in proportion to its size. Then, the total reward for the\nselection is given by\n\nM(cid:88)\n\nR = Racc \u2212 \u03b1\n\nui,\n\ni=1\n\n(cid:80)\nwhere Racc is the in-batch validation accuracy and \u03b1 is a hyper-parameter balancing the two terms.\ns R(s). At\nFinally, we set the baseline to the average reward of the batch samples, i.e., B = 1\nS\nevery T epoch during training, we evaluate all the S samples on the entire validation set and store the\nmeta-class set combination with the highest accuracy. At the end of the training, the agent returns the\nbest meta-class set combination among the stored ones. According to our empirical observations,\nthe search cost by RL is just as much as the cost for training a classi\ufb01er. Note that we employ a\nsimple two-layer perceptron to optimize the agent, which is also helpful to reduce the computational\ncomplexity together with the in-batch validation strategy described above.\n\n(6)\n\n3.4 Discussions\n\nIn addition to the bene\ufb01t of noise level reduction, the coarse-grained meta-class representation\nalso brings several desirable characteristics. The meta-classes naturally introduce some inter-class\nrelationships to the model and lead to better generalization performance by grouping multiple classes\nthat potentially have shared information. Moreover, the representation learning based on meta-\nclasses makes the trained model more robust to data-de\ufb01ciency since each coarse-grained meta-class\nobviously contains more training examples compared to the original classes. As multiple meta-class\nsets construct a large number of class codewords by their Cartesian product, a small number of\nconstituent classi\ufb01ers are suf\ufb01cient to recover the original class set and the proposed method can\nreduce the number of parameters. Finally, since the proposed method utilizes multiple constituent\nclassi\ufb01ers, it brings some ensemble effects and lead to accuracy improvement.\n\n2Each combination is composed of the meta-class sets and each of the meta-class sets is selected by um.\n\n5\n\n\f(a) Standard classi\ufb01er\n\n(b) Combinatorial classi\ufb01ers with one, two and three meta-class sets\n\nFigure 3: Sample results from (a) a standard classi\ufb01er and (b) combinatorial classi\ufb01ers on the\nexamples from a 2D Gaussian mixture model with \ufb01ve components. The accuracy is shown at the\ntop-right corner in each case. For the combinatorial classi\ufb01ers, we gradually add meta-classi\ufb01ers\none-by-one. Gray dots correspond to noisy samples with random labels while purple dashed and\nblack solid lines represent decision boundaries on clean and noisy datasets, respectively.\n\n(a) Open-set: uniform\n\n(b) Open-set: nearest\n\n(c) Closed-set: uniform\n\n(d) Closed-set: nearest\n\nFigure 4: Example noise transition matrices with eight output classes. For open-set noise, four classes\nare used as target classes while the remaining four classes are reserved for noisy labels.\n\n4 Experiments\n\n4.1 Experiments on a Toy Set\n\nTo illustrate and visualize the effectiveness of the proposed method, we build a toy set that contains\n150 clean samples drawn from a two-dimensional Gaussian mixture model with \ufb01ve components. To\nsimulate a dataset with signi\ufb01cant noise, we generate 300 noisy examples from a random Gaussian\ndistribution with a larger variance and assign noisy labels selected from a uniform distribution.\nFigure 3 demonstrates the decision boundaries of a standard method and the proposed combinatorial\nclassi\ufb01ers with their accuracies. Our model is based on logistic regressors on three binary meta-class\nsets, which are gradually added one by one as shown in Figure 3b. With these results, we put\nemphases on the following three observations. First, the combinatorial classi\ufb01er becomes capable\nof identifying all original classes as we add more meta-class sets. The decision boundaries and\naccuracies also illustrate the noise-robustness of the our method. Finally, the proposed technique\nrequires fewer parameters (three weight vectors in the logistic regressors) than the standard classi\ufb01er\n(\ufb01ve weight vectors corresponding to each class).\n\n4.2 Evaluation on CUB-200\n\nExperimental settings We conduct a set of experiments on Caltech-UCSD Birds-200-2011 (CUB-\n200) dataset [39] with various noise settings. CUB-200 is a \ufb01ne-grained classi\ufb01cation benchmark\nwith 200 bird species and contains \u223c30 images per class in the training and validation sets. Note that\nCUB-200 is more natural and realistic compared to the simple datasets\u2014MNIST and CIFAR\u2014used\nfor the evaluation of many previous methods.\nWe consider both open- and closed-set noise arti\ufb01cially injected to training examples. The open-set\nnoise is created by giving one of the target labels to the images sampled from unseen categories. To\nsimulate open-set noise, we use 100 (out of 200) classes as the target labels and the remaining 100\nclasses assume to be unknown. Noise level \u03b7 controls the ratio between clean and noisy examples.\nOn the other hand, examples with the closed-set noise have wrong labels within the target classes\nand we use all 200 classes in CUB-200 as the target labels. For both types of noise, we use two label\ncorruption schemes: uniform transition and nearest label transfer. The uniform transition injects label\n\n6\n\n86.7%43.3%78.7%100%1.1.1.1..25assignednon-target labelassignedtarget labeltruetarget labeltruenon-target label1.1.1.1..40.50.03.07.10.00.20.70.05.40.30.25.15.10.75.00assignednon-target labelassignedtarget labeltruetarget labeltruenon-target label.50.50114.50.50.50114.50.50.50assignedtarget labeltruetarget label.50.30.00.05.03.00.12.00.20.50.00.00.10.00.20.00.00.00.50.10.00.35.02.03.05.04.00.50.05.10.01.25.00.10.05.35.50.05.00.05.00.05.40.00.00.50.00.05.15.20.02.00.00.03.50.10.10.00.03.30.02.00.05.50assignedtarget labeltruetarget label\fTable 1: Accuracies [%] on CUB-200 with different levels of open-set noise.\nHigh noise level\n\nClean dataset\n\n(\u03b7 = 0)\n\nModerate noise level\nNearest\n\nUniform\n\nUniform\n\nNearest\n\n80.57 \u00b1 0.37\n79.32 \u00b1 0.83\n80.66 \u00b1 0.60\n80.75 \u00b1 0.37\n80.39 \u00b1 0.36\n81.55 \u00b1 0.52\n82.19 \u00b1 0.58\n75.15 \u00b1 1.96\n80.90 \u00b1 0.13\n82.80 \u00b1 0.36\n82.86 \u00b1 0.25\n\n73.37 \u00b1 0.34\n71.42 \u00b1 0.70\n73.55 \u00b1 0.70\n73.52 \u00b1 0.47\n73.53 \u00b1 0.56\n75.04 \u00b1 0.46\n77.51 \u00b1 0.53\n71.02 \u00b1 1.68\n75.37 \u00b1 0.54\n79.38 \u00b1 0.83\n79.93 \u00b1 0.44\n\n77.14 \u00b1 0.27\n76.07 \u00b1 0.40\n77.03 \u00b1 0.29\n77.13 \u00b1 0.97\n77.27 \u00b1 0.49\n78.09 \u00b1 0.40\n78.58 \u00b1 0.46\n47.40 \u00b1 1.16\n77.41 \u00b1 0.36\n79.28 \u00b1 0.52\n80.22 \u00b1 0.31\n\n70.04 \u00b1 0.71\n66.79 \u00b1 0.44\n69.76 \u00b1 0.59\n70.06 \u00b1 0.65\n70.34 \u00b1 0.42\n71.84 \u00b1 0.95\n75.40 \u00b1 0.66\n67.16 \u00b1 1.29\n74.02 \u00b1 0.29\n79.19 \u00b1 0.29\n80.50 \u00b1 0.61\n\n75.45 \u00b1 0.50\n74.80 \u00b10.46\n75.52 \u00b1 0.32\n75.59 \u00b1 0.33\n75.75 \u00b1 0.62\n76.32 \u00b1 0.65\n76.47 \u00b1 0.34\n32.60 \u00b1 2.38\n75.57 \u00b1 0.22\n77.95 \u00b1 0.53\n78.26 \u00b1 0.43\n\nMethods\nStandard\nDecoupling [6]\nF-correction [19]\nS-model [18]\nMentorNet [7]\nq-loss (q = 0.3) [10]\nq-loss (q = 0.5) [10]\nq-loss (q = 0.8) [10]\nCo-teaching [11]\nCombCls\nCombCls+Co-teaching\n\nMethods\nStandard\nDecoupling [6]\nF-correction [19]\nS-model [18]\nMentorNet [7]\nq-loss (q = 0.3) [10]\nq-loss (q = 0.5) [10]\nq-loss (q = 0.8) [10]\nCo-teaching [11]\nCombCls\nCombCls+Co-teaching\n\nTable 2: Accuracies [%] on CUB-200 with different levels of closed-set noise.\nHigh noise level\n\nClean dataset\n\n(\u03b7 = 0)\n\nModerate noise level\nNearest\n\nUniform\n\nUnifrom\n\nNearest\n\n79.58 \u00b1 0.18\n77.79 \u00b1 0.23\n80.01 \u00b1 0.42\n79.42 \u00b1 0.27\n79.78 \u00b1 0.20\n80.41 \u00b1 0.36\n80.76 \u00b1 0.38\n40.70 \u00b1 2.25\n79.74 \u00b1 0.14\n81.36 \u00b1 0.23\n81.52 \u00b1 0.47\n\n63.65 \u00b1 0.26\n62.52 \u00b1 0.23\n63.81 \u00b1 0.16\n63.08 \u00b1 0.74\n68.03 \u00b1 0.32\n68.52 \u00b1 0.51\n75.24 \u00b1 0.31\n29.31 \u00b1 1.14\n68.21 \u00b1 0.35\n71.75 \u00b1 0.24\n75.30 \u00b1 0.10\n\n65.21 \u00b1 0.42\n66.24 \u00b1 0.53\n64.69 \u00b1 0.21\n64.90 \u00b1 0.29\n65.49 \u00b1 0.14\n66.34 \u00b1 0.25\n67.49 \u00b1 0.56\n24.98 \u00b1 1.61\n66.24 \u00b1 0.30\n68.35 \u00b1 0.35\n70.46 \u00b1 0.31\n\n42.35 \u00b1 0.50\n43.91 \u00b1 0.52\n42.23 \u00b1 0.54\n42.17 \u00b1 0.70\n47.74 \u00b1 1.64\n53.18 \u00b1 0.49\n60.89 \u00b1 0.32\n17.67 \u00b1 1.06\n52.72 \u00b1 0.56\n51.90 \u00b1 0.35\n62.77 \u00b1 0.66\n\n47.70 \u00b1 0.41\n51.92 \u00b1 0.18\n48.00 \u00b1 0.46\n48.01 \u00b1 0.47\n48.25 \u00b1 0.39\n49.30 \u00b1 0.35\n49.28 \u00b1 0.57\n15.95 \u00b1 0.65\n49.81 \u00b1 0.19\n52.00 \u00b1 0.22\n52.49 \u00b1 0.79\n\nnoise by selecting a wrong label uniformly while the nearest label transfer determines the label of\na noisy example using the nearest example with a different label to simulate confusions between\nvisually similar classes. For the nearest neighbor search, we employ the features of examples in a\npretrained network on the clean dataset. For both noise types, we test moderate and high noise levels\n(\u03b7 = 0.25 and \u03b7 = 0.50). Figure 4 presents sample noise transition matrices for all cases.\nWe use ResNet-50 as the backbone network for all the methods and initialize the parameters of\nthe feature extractor using the pretrained weights on ImageNet [40] while the classi\ufb01cation layer(s)\nare initialized randomly. The entire network is \ufb01ne-tuned for 40 epochs by a mini-batch stochastic\ngradient descent method with batch size of 32, momentum of 0.9 and weight decaying factor of\n5 \u00d7 10\u22124. The initial learning rate is 0.01 and decayed by a factor of 0.1 at epoch 20 and 30. For\ncombinatorial classi\ufb01cation, we use 100 binary meta-class sets (K(cid:48) = 2) generated by performing\nk-means clustering with Q = 50. All models are evaluated on clean test sets with \ufb01ve independent\nruns. We report the best test accuracy across epochs for all models for comparisons since the learning\ncurves of individual methods may be different and reporting accuracies at a particular epoch may be\nunfair. However, we also note that our approach still outperforms others even when \ufb01xing the number\nof epochs in a wide range.\n\nResults The proposed combinatorial classi\ufb01cation (CombCls) is compared with the following\nstate-of-the-art methods including Decoupling [6], F-correction [19], S-model [18], MentorNet [7],\nq-loss [10] and Co-teaching [11], in addition to an ordinary \ufb02at classi\ufb01er (Standard).\nTable 1 and 2 present results on CUB-200 dataset in the presence of open- and closed-set noises,\nrespectively. We \ufb01rst observe that CombCls outperforms Standard on noise-free setting (\u03b7 = 0)\nin both cases. This is partly because our combinatorial classi\ufb01er learns useful information for\nrecognition by modeling inter-class relationships and exploits ensemble effects during inference as\ndiscussed in Section 3.4. These results imply that our method is also useful regardless of noise-level\nand achieves outstanding classi\ufb01cation accuracy. Moreover, the proposed algorithm identi\ufb01es a\ncompact model compared to the other methods. The baseline models use K weight vectors in their\nclassi\ufb01cation layers, where each vector corresponds to a base class; the baselines have 100 and 200\n\n7\n\n\fTable 3: Results of ablation studies. Accuracies [%] of (left) combinatorial classi\ufb01er trained on highly\nnoisy datasets with meta-class sets generated from datasets with different levels of uniform noise and\n(right) standard classi\ufb01er with feature extractor of CombCls in various noise con\ufb01gurations.\n\nDataset used for\nmeta-class set generation\nClean\nModerate noise level\nHigh noise level\n\nOpen-set Closed-set\n\n77.82\n77.58\n79.19\n\n45.40\n50.22\n51.90\n\nOpen-set\n\nUniform\nNearest\nClosed-set Uniform\nNearest\n\nStandard\n\n70.04\n75.45\n42.35\n47.70\n\nStandard\n+CombFeat\n\n78.48\n77.40\n53.83\n52.52\n\nTable 4: Results of combinatorial classi\ufb01cation using different meta-class set con\ufb01gurations on the\ndatasets with high noise level. Acc. means accuracy [%] and Param. is ratio of model parameters in\neach method with respect to that of the Standard.\n\nOpen-set noises\n\nUniform\n\nNearest\n\nClosed-set noises\n\nUnifrom\n\nNearest\n\nMethods Meta-class set\nStandard N/A\nCombCls Random\n\nClustering\nClustering+Search\n\nAcc.\n70.04\n78.66\n79.19\n79.98\n\nParam. Acc.\n75.45\n1.00\n76.75\n1.00\n1.00\n77.95\n78.35\n0.42\n\nParam. Acc.\n42.35\n1.00\n46.98\n1.00\n1.00\n51.90\n54.52\n0.44\n\nParam. Acc.\n42.84\n1.00\n48.55\n0.50\n0.50\n52.00\n52.43\n0.41\n\nParam.\n1.00\n0.50\n0.50\n0.29\n\nweight vectors for open- and closed-set noise cases. Note that CombCls consists of M (= 100) binary\nclassi\ufb01ers in both cases and saves memory substantially.\nOur algorithm outperforms other methods in the presence of noise and the accuracy gain is even larger\nthan noise-free case. It achieves the state-of-the-art accuracy in most settings. Note that q-loss with\nthe optimal q value is better than our method when uniform closed-set noise is injected. This is mainly\nbecause such problem con\ufb01gurations are aligned well to the assumption behind q-loss algorithm [10].\nHowever, real noise distributions are unlikely to follow uniform distributions. For instances, we\nobserve signi\ufb01cantly more open-set noise than closed-set one in a real-world noisy dataset as shown in\nFigure 1. Moreover, the performance of q-loss highly depends on the choice of q and an inappropriate\nchoice of q value degrades performance signi\ufb01cantly because the theoretical noise-robustness and\ntraining stability vary with respect to q values; the optimal q value differs across datasets, e.g.,\nq = 0.5 for CUB-200 and q = 0.3 for WebVision (shown in Section 4.3). In contrast, our algorithm\nreduces the level of noise effectively regardless of hyper-parameters by introducing coarse-grained\nmeta-class sets. Another observation is that our approach is unique and complementary to other\nmethods. As a result, it is straightforward to further improve accuracy by combining both our method\nand Co-teaching [11] (CombCls+Co-teaching). Note that Co-teaching trains two networks to \ufb01lter\nout noisy examples by cross referencing each other, which employs a completely different approach\nfrom the proposed one to tackles label noise. Noticeably, CombCls+Co-teaching achieves the best\naccuracy through the collaboration in all noise con\ufb01gurations and almost recovers the accuracy of\nStandard on the clean dataset in Table 1.\nFor further understanding, we train the proposed classi\ufb01er on the datasets with high noise level while\npretraining the network for the clustering-based meta-class set generation on another dataset with\ndifferent noise levels. Interestingly, using the dataset with high noise level in the meta-class sets\ngeneration gives higher accuracy compared to the cleaner datasets as shown in Table 3(left). This\nimplies that the meta-class sets generated from a noisy dataset re\ufb02ect the noise distribution and help\nthe combinatorial classi\ufb01ers generalize better on the noisy dataset. We observe similar tendencies\nwith other combinations of noise levels, where the meta-class sets generated with the same noise\ndistribution result in higher accuracies. Also, we construct a Standard network and initialize its\nfeature extractor using a trained CombCls model. Then, we \ufb01ne-tune its classi\ufb01cation layer while\nthe weights of the feature extractor are \ufb01xed. This network (Standard+CombFeat) is compared to\nStandard in Table 3(right). Using the feature extractor of the combinatorial classi\ufb01er, the accuracy of\nStandard is improved with a signi\ufb01cant margin in all noise settings. This signi\ufb01es that the proposed\nmodel learns a noise-robust feature extractor.\nNext, we evaluate the proposed method with different meta-class set con\ufb01gurations and show the\nresults in Table 4. We compare randomly con\ufb01gured meta-class sets (Random) with the proposed\nclustering based ones (Clustering). In addition, we also evaluate performance of the meta-class sets\n\n8\n\n\fTable 5: Results on WebVision.\nMethods\nAcc. [%]\nStandard\nDecoupling [6]\nF-correction [19]\nS-model [18]\nMentorNet [7]\nq-loss (q = 0.3) [10]\nCoteaching [11]\nCombCls\nCombCls+Search\nCombCls+Coteaching\n\n79.82\n79.38\n80.96\n81.36\n80.46\n82.18\n83.06\n83.14\n83.26\n84.14\n\nFigure 5: Accuracy [%] of combinatorial classi\ufb01ers with open-set\nuniform noise (\u03b7 = 0.5) by varying K(cid:48) and M.\n\nsearch method (Clustering+Search) proposed in Section 3.3. For the meta-class set search, we reserve\na half of training set as the validation set to train the search agent, compute the in-batch accuracy\nwith 32 images, and sample 100 meta-class set combinations per batch. The number of candidate\nmeta-class sets is 600 and \u03b1 in Eq. (6) is set to 3\u00d7 10\u22124. Note that the search process does not require\nany clean data since the agent is trained on the noisy held-out data extracted from the training set.\nAfter the search process, we retrain the combinatorial classi\ufb01er with the entire training set including\nthe validation data. As shown in Table 4, the combinatorial classi\ufb01er outperforms the baseline\n(Standard) even with randomly constructed meta-class sets. Our clustering-based algorithm brings\nadditional improvement while employing the meta-class set search technique boosts the accuracy\neven further. Note that the meta-class sets con\ufb01gured by the search agent not only improves the\naccuracy of CombCls but also reduces the number of parameters as the meta-class sets are optimized\nto maximize the accuracy and minimize the number of meta-class sets.\nFinally, we investigate the effects of K(cid:48) and M in our method. Figure 5 shows the effects of K(cid:48) and\nM on CUB-200 with open-set uniform noise (\u03b7 = 0.5). Our models outperform the baseline even\nwith a fairly small number of meta-class sets regardless of K(cid:48). In particular, the model with K(cid:48) = 2\nreduces noise level most effectively and achieve the best accuracy among the ones with the same\nnumber of parameters, which are depicted by diamond markers in the plot. We observe the same\ntendency in the experiments with other noise settings.\n\n4.3 Experiments on WebVision\n\nWe also conduct experiments on a real-world noisy benchmark, WebVision [4]. This dataset is\nconstructed by collecting 2.4 million web images retrieved from Flickr and Google using manually\nde\ufb01ned queries related to 1,000 ImageNet classes. While the training set includes signi\ufb01cant amount\nof noise, the benchmark provides a clean validation set for evaluation. We use a subset of WebVision\ndataset for our experiment, which contains all images from 100 randomly sampled classes. The\nexperimental settings are identical to the ones described in the previous section except for the\noptimization parameters; we adopt the parameters of the ImageNet training setting in [41].\nTable 5 presents accuracies of all compared methods and the proposed model shows competitive\nperformance. As in the experiments on CUB-200, our method bene\ufb01ts from the combination with\nCo-teaching and achieves the best accuracy. We also \ufb01nd that applying the meta-class sets search\nmakes additions accuracy gain. Moreover, it reduces the model complexity and uses only 64% of the\nparameters in the classi\ufb01cation layers compared to the baselines and the proposed model without the\nmeta-class sets search.\n\n5 Conclusion\n\nWe proposed a novel classi\ufb01cation framework, which constructs multiple classi\ufb01ers over heteroge-\nneous coarse-grained meta-class sets and perform combinatorial inferences using their predictions\nto identify a target label in the original class space. Our method is particularly bene\ufb01cial when the\ndataset contains label noise since the use of the coarse-grained meta-class representations reduces\nnoise level naturally. We also introduced meta-class set search techniques based on clustering and\nreinforcement learning. The extensive experiments on the datasets with arti\ufb01cial and real-world noise\ndemonstrated the effectiveness of the proposed method in terms of accuracy and ef\ufb01ciency.\n\n9\n\n35103050200400Number of meta-class sets (M) (log scale)65.070.075.080.0Accuracy (%)baselineK0=2K0=3K0=5K0=10K0=20\fAcknowledgments\n\nThis work is partly supported by Google AI Focused Research Award and Korean ICT R&D programs\nof the MSIP/IITP grant [2017-0-01778, 2017-0-01780].\n\nReferences\n[1] Yan, Y., Rosales, R., Fung, G., Subramanian, R., Dy, J.: Learning from multiple annotators\n\nwith varying expertise. Machine learning 95(3) (2014) 291\u2013327\n\n[2] Yu, X., Liu, T., Gong, M., Tao, D.: Learning with biased complementary labels. In ECCV.\n\n(2018)\n\n[3] Blum, A., Kalai, A., Wasserman, H.: Noise-tolerant learning, the parity problem, and the\n\nstatistical query model. Journal of the ACM (JACM) 50(4) (2003) 506\u2013519\n\n[4] Li, W., Wang, L., Li, W., Agustsson, E., Van Gool, L.: Webvision database: Visual learning and\n\nunderstanding from web data. arXiv preprint arXiv:1708.02862 (2017)\n\n[5] Zhang, C., Bengio, S., Hardt, M., Recht, B., Vinyals, O.: Understanding deep learning requires\n\nrethinking generalization. In ICLR. (2017)\n\n[6] Malach, E., Shalev-Shwartz, S.: Decoupling\" when to update\" from\" how to update\". In NIPS.\n\n(2017)\n\n[7] Jiang, L., Zhou, Z., Leung, T., Li, L.J., Fei-Fei, L.: Mentornet: Learning data-driven curriculum\n\nfor very deep neural networks on corrupted labels. In ICML. (2018)\n\n[8] Wang, Y., Liu, W., Ma, X., Bailey, J., Zha, H., Song, L., Xia, S.T.: Iterative learning with\n\nopen-set noisy labels. In CVPR. (2018)\n\n[9] Jenni, S., Favaro, P.: Deep bilevel learning. In ECCV. (2018)\n\n[10] Zhang, Z., Sabuncu, M.: Generalized cross entropy loss for training deep neural networks with\n\nnoisy labels. In NeurIPS. (2018)\n\n[11] Han, B., Yao, Q., Yu, X., Niu, G., Xu, M., Hu, W., Tsang, I., Sugiyama, M.: Co-teaching:\n\nRobust training of deep neural networks with extremely noisy labels. In NeurIPS. (2018)\n\n[12] Reed, S., Lee, H., Anguelov, D., Szegedy, C., Erhan, D., Rabinovich, A.: Training deep neural\n\nnetworks on noisy labels with bootstrapping. arXiv preprint arXiv:1412.6596 (2014)\n\n[13] Northcutt, C.G., Wu, T., Chuang, I.L.: Learning with con\ufb01dent examples: Rank pruning for\n\nrobust classi\ufb01cation with noisy labels. arXiv preprint arXiv:1705.01936 (2017)\n\n[14] Veit, A., Alldrin, N., Chechik, G., Krasin, I., Gupta, A., Belongie, S.: Learning from noisy\n\nlarge-scale datasets with minimal supervision. In CVPR. (2017)\n\n[15] Tanaka, D., Ikami, D., Yamasaki, T., Aizawa, K.: Joint optimization framework for learning\n\nwith noisy labels. In CVPR. (2018)\n\n[16] Sukhbaatar, S., Bruna, J., Paluri, M., Bourdev, L., Fergus, R.: Training convolutional networks\n\nwith noisy labels. arXiv preprint arXiv:1406.2080 (2014)\n\n[17] Jindal, I., Nokleby, M., Chen, X.: Learning deep networks from noisy labels with dropout\n\nregularization. In ICDM. (2016)\n\n[18] Goldberger, J., Ben-Reuven, E.: Training deep neural-networks using a noise adaptation layer.\n\nIn ICLR. (2017)\n\n[19] Patrini, G., Rozza, A., Krishna Menon, A., Nock, R., Qu, L.: Making deep neural networks\n\nrobust to label noise: A loss correction approach. In CVPR. (2017)\n\n[20] Han, B., Yao, J., Niu, G., Zhou, M., Tsang, I., Zhang, Y., Sugiyama, M.: Masking: A new\n\nperspective of noisy supervision. In NeurIPS. (2018)\n\n10\n\n\f[21] Hendrycks, D., Mazeika, M., Wilson, D., Gimpel, K.: Using trusted data to train deep networks\n\non labels corrupted by severe noise. In NIPS. (2018)\n\n[22] Azadi, S., Feng, J., Jegelka, S., Darrell, T.: Auxiliary image regularization for deep cnns with\n\nnoisy labels. arXiv preprint arXiv:1511.07069 (2015)\n\n[23] Ghosh, A., Kumar, H., Sastry, P.: Robust loss functions under label noise for deep neural\n\nnetworks. In AAAI. (2017)\n\n[24] Li, Y., Yang, J., Song, Y., Cao, L., Luo, J., Li, L.J.: Learning from noisy labels with distillation.\n\nIn ICCV. (2017)\n\n[25] Yu, X., Han, B., Yao, J., Niu, G., Tsang, I., Sugiyama, M.: How does disagreement help\n\ngeneralization against label corruption? In ICML. (2019)\n\n[26] Dietterich, T.G., Bakiri, G.: Solving multiclass learning problems via error-correcting output\n\ncodes. Journal of arti\ufb01cial intelligence research 2 (1994) 263\u2013286\n\n[27] Jegou, H., Douze, M., Schmid, C.: Product quantization for nearest neighbor search. IEEE\n\ntransactions on pattern analysis and machine intelligence 33(1) (2011) 117\u2013128\n\n[28] Ge, T., He, K., Ke, Q., Sun, J.: Optimized product quantization. IEEE transactions on pattern\n\nanalysis and machine intelligence 36(4) (2014) 744\u2013755\n\n[29] Seo, P.H., Weyand, T., Sim, J., Han, B.: Cplanet: Enhancing image geolocalization by\n\ncombinatorial partitioning of maps. In ECCV. (2018)\n\n[30] Xuan, H., Souvenir, R., Pless, R.: Deep randomized ensembles for metric learning. In ECCV.\n\n(2018)\n\n[31] Breiman, L.: Bagging predictors. Machine learning 24(2) (1996) 123\u2013140\n\n[32] Freund, Y., Schapire, R.E., et al.: Experiments with a new boosting algorithm. In ICML. (1996)\n\n[33] Jacobs, R.A., Jordan, M.I., Nowlan, S.J., Hinton, G.E.: Adaptive mixtures of local experts.\n\nNeural computation 3(1) (1991) 79\u201387\n\n[34] Opitz, D., Maclin, R.: Popular ensemble methods: An empirical study. Journal of arti\ufb01cial\n\nintelligence research 11 (1999) 169\u2013198\n\n[35] Dietterich, T.G.: Ensemble methods in machine learning. In International workshop on multiple\n\nclassi\ufb01er systems, Springer (2000) 1\u201315\n\n[36] Domingos, P.: Bayesian averaging of classi\ufb01ers and the over\ufb01tting problem. In ICML. (2000)\n[37] Rokach, L.: Ensemble-based classi\ufb01ers. Arti\ufb01cial Intelligence Review 33(1-2) (2010) 1\u201339\n\n[38] Williams, R.J.: Simple statistical gradient-following algorithms for connectionist reinforcement\n\nlearning. Machine learning 8(3-4) (1992) 229\u2013256\n\n[39] Welinder, P., Branson, S., Mita, T., Wah, C., Schroff, F., Belongie, S., Perona, P.: Caltech-UCSD\n\nBirds 200. Technical Report CNS-TR-2010-001, California Institute of Technology (2010)\n\n[40] Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A.,\nKhosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet Large Scale Visual Recognition\nChallenge. International Journal of Computer Vision (IJCV) 115(3) (2015) 211\u2013252\n\n[41] Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional\n\nnetworks. In CVPR. (2017)\n\n11\n\n\f", "award": [], "sourceid": 696, "authors": [{"given_name": "Paul Hongsuck", "family_name": "Seo", "institution": "POSTECH"}, {"given_name": "Geeho", "family_name": "Kim", "institution": "Seoul National University"}, {"given_name": "Bohyung", "family_name": "Han", "institution": "Seoul National University"}]}