{"title": "Learning with Multiple Labels", "book": "Advances in Neural Information Processing Systems", "page_first": 921, "page_last": 928, "abstract": null, "full_text": "Learning with Multiple Labels \n\nRong Jin* \n\n*School of Computer Science \nCarnegie Mellon University \nPittsburgh, PA 15213, USA \n\nrong@es.emu.edu \n\nZoubin Ghahramanit* \n\ntGatsby Computational Neuroscience Unit \n\nUniversity College London \nLondon WCIN 3AR, UK \nzoubin@gatsby.ucl.ae.uk \n\nAbstract \n\nIn this paper, we study a special kind of learning problem in which \neach training instance is given a set of (or distribution over) \ncandidate class labels and only one of the candidate labels is the \ncorrect one. Such a problem can occur, e.g., in an information \nretrieval setting where a set of words is associated with an image, \nor if classes labels are organized hierarchically. We propose a \nnovel discriminative approach for handling the ambiguity of class \nlabels in the training examples. The experiments with the proposed \napproach over five different UCI datasets show that our approach is \nable to find the correct label among the set of candidate labels and \nactually achieve performance close to the case when each training \ninstance is given a single correct label. In contrast, naIve methods \ndegrade rapidly as more ambiguity is introduced into the labels. \n\n1 Introduction \n\nSupervised and unsupervised learning problems have been extensively studied in the \nmachine learning literature. In supervised classification each training instance is \nassociated with a single class label, while in unsupervised classification (i.e. \nclustering) the class labels are not known. There has recently been a great deal of \ninterest in partially- or semi-supervised learning problems, where the training data is \na mixture of both labeled and unlabelled cases. Here we study a new type of semi(cid:173)\nsupervised learning problem. \n\nWe generalize the notion of supervision by thinking of learning problems where \nmultiple candidate class labels are associated with each training instance, and it is \nassumed that only one of the candidates is the correct label. For a supervised \nclassification problem, the set of candidate class labels for every training instance \ncontains only one label, while for an unsupervised learning problem, the set of \ncandidate class labels for each training instance counts in all the possible class \nlabels. For a learning problem with the mixture of labeled and unlabelled training \ndata, the number of candidate class labels for every training instance can be either \none or the total number of different classes. \n\nHere we study the general setup, i.e. a learning problem when each training instance \nis assigned to a subset of all the class labels (later, we further generalize this to \n\n\finclude arbitrary distributions over the class labels). For example, there may be 10 \ndifferent classes and each training instance is given two candidate class labels and \none of the two given labels is correct. This learning problem is more difficult than \nsupervised classification because for each training example we don't know which \nclass among the given set of candidate classes is actually the target. For easy \nreference, we called this class of learning problems 'multiple-label' problems. \n\nIn practice, many real problems can be formalized as a 'multiple-label' problem. For \nexample, the problem of having several different class labels for a single training \nexample can be caused by the disagreement between several assessors. 1 Consider \nthe scenario when two assessors are hired to label the training data and sometimes \nthe two assessors give different class labels to the same training example. In this \ncase, we will have two class labels for a single training instance and don't know \nwhich, if any, is actually correct. Another scenario that can cause multiple class \nlabels to be assigned to a single training example is when there is a hierarchical \nstructure over the class labels and some of the training data are given the labels of \nthe internal nodes in the hierarchy (i.e. superclasses) instead of the labels of the leaf \nnodes (subclasses). Such hierarchies occur, for example, in bioinformatics where \nproteins are regularly classified \ninto superfamilies and families. For such \nhierarchical labels, we can treat the label of internal nodes as a set of the labels on \nthe leaf nodes. \n\n2 Related Work \n\nFirst of all, we need to distinguish this 'multiple-label' problem from the problem \nwhere the classes are not mutually exclusive and therefore each training example is \nallowed several class labels [4]. There, even though each training example can have \nmultiple class labels, all the assigned class labels are actually correct labels while in \n'multiple-label' problems only one of the assigned multiple labels is the target label \nfor the training instance. \n\nThe essential difficulty of 'multiple-label' problems comes from the ambiguity in \nthe class labels for training data, i.e. among the several labels assigned to every \ntraining instance only one is presumed to be the correct one and unfortunately we \nare not informed which one is the target label. A similar difficulty appears in the \nproblem of classification from labeled and unlabeled training data. The difference \nbetween the \nlabeled/unlabeled classification \nproblem is that in the former only a subset of the class labels can be the candidate \nfor the target label, while in the latter any class label can be the candidate. As will \nbe shown later, this constraint makes it possible for us to build up a purely \ndiscriminative approach while for learning problems using unlabeled data people \nusually take a generative approach and model properties of the input distribution. \n\n'multiple-label' problem and the \n\nIn contrast to the 'multiple-label' problem, there is a set of problems named \n'multiple-instance' problems [3] where instances are organized into 'bags' of \nseveral instances, and a class label is tagged for every bag of instances. In the \n'multiple-instance' problem, at \ninstances within each bag \ncorresponds to the label of the bag and all other instances within the bag are just \nnoise. The difference between 'multiple-label' problems and 'multiple-instance' \nproblems is that for 'multiple-label' problems the ambiguity lies on the side of class \nlabels while for 'multiple-instance' problem the ambiguity comes from the instances \nwithin the bag. \n\nleast one of the \n\n1 Observer disagreement has been modeled using the EM algorithm [1] . Our multiple(cid:173)\nlabel framework differs in that we don't know which observer assigned which label to \neach case. This would be an interesting direction to extend our framework. \n\n\fThe most related work to this paper is [6], where a similar problem is studied using \nthe logistic regression method. Our framework is completely general for any \ndiscriminative model and incorporates non-uniform 'prior' on the labels. \n\n3 Formal Description of the 'Multiple-label' Problem \n\nAs described in the introduction, for a 'multiple-label' problem, each training \ninstance is associated with a set of candidate class labels, only one of which is the \ntarget label for that instance. Let Xi be the input for the i-th training example, and Si \nbe the set of candidate class labels for the i-th training example. Our goal is to find \nthe model parameters e E e in some class of models M , i.e. a parameterized \nclassifier with parameters e which maps inputs to labels, so that the predicted class \nlabel y for the i-th training example has a high probability to be a member of the set \nSi. More formally, using the maximum likelihood criterion and the assumption of \ni.i.d. assignments, this goal can be simply stated as \n\n(1) \n\n4 Description of the Discriminative Model for the \n\n'Multiple-label' Problem \n\nBefore discussing the discriminative model for the 'multiple-label' problem, let's \nlook at the standard discriminative model for supervised classification. Let p(y I X i ) \nstand for some given conditional distribution of class labels for the training instance \nXi and p(y I x\"f}) be the model-based conditional distribution for the training data Xi \nto have the class label y. A common and sensible criterion for finding model \nparameters (/ is to minimize the KL divergence between the given conditional \ndistributions and the model-based distributions, i.e. \n\nB* = arg min {L L p(y I x,) log p(y I x) } \n\np(y I x ;, B) \n\nB \n\n; \n\ny \n\n(2) \n\n(3) \n\n(4) \n\nFor supervised learning problems, the class label for every trammg instance is \nknown. Therefore, the given conditional distribution of the class label for every \ntraining instance is a delta function or jJ(y I Xi) = c5(y, Yi) where Yi is the given class \nlabel for the i-th instance. With this, it can be easily shown that Eqn. (2) will be \nsimplified as maximum likelihood criterion. For the 'multiple-label' problem, each \ntraining instance Xi is assigned to a set of candidate class labels Si and therefore Eqn. \n(2) can be rewritten as: \n\n()* = arg min {L L p(y I X,) log p(y I x,) } \n\np(y I Xi' (}) \n\nB \n\ni YES; \n\nwith the constraints Vi L yESi p(y I Xi) = I . \n\nIn the 'multiple-label' problem the distribution of class labels p(y I x,) is unknown \nexcept for the constraint that the target class label for every training example is a \nmember of the corresponding set of candidate class labels. A simple solution to the \nproblem of unknown \nI.e. \np(y I x,) = p(y' I x,) for any y, y' E Si . Then, Eqn. (3) can be simplified to: \n\nlabel distribution \n\nis uniform, \n\nis \n\nto assume \n\nit \n\n\fB* = argmin {L:-1 L:loi \n\nB \n\ni I Si I YES, \n\nII Si I p(y I x\"B) \n\nB \n\ni I Si I YE S, \n\n1 \n\nJ} =argmax{L:-1 L:IOgp(YIXi' B)} , \n\n(5) \n\nwhich corresponds to minimizing the KL divergence (2) to a uniform over Sj . For \nthe case of multiple assessors giving differing labels to the data, discussed in the \nintroduction, this corresponds to concatenating the labeled data sets. Standard \nlearning algorithms can be applied to learn the conditional model p(y I x,B). For \nlater reference, we called this simple idea the ' Naive Model'. \n\nA better solution than the 'NaIve Model' is to disambiguate the label association, \ni.e. to find which label among the given set is more appropriate than the others and \nuse the appropriate label for training. It turns out that it is possible to apply the EM \nalgorithm [2] to accomplish this goal, resulting in a procedure which iterates \nbetween disambiguating and classifying. Starting with the assumption that every \nclass label within the set is equally likely, we train a conditional model p(y I x, B). \nThen, with the help of this conditional model, we estimate the label distribution \njJ(y I x,) for each data point. With these label distributions, we refit the conditional \nmodel p(y I x , B) and so on. More formally, this idea can be expressed as follows: \n\nFirst, we estimate the conditional model based on the assumed or estimated label \ndistribution according to Eqn. (3). This step corresponds to the M-step in the EM \nalgorithm. Then, in the E-step, new label distributions are estimated by maximizing \nEqn. (3) W.r.t. jJ(y I x,) under the constraints (4), resulting in: \n\njJ(y I Xi) = L: p(y' I Xi' B) \n\n1 P(yIXi,B) \n\nY ESj \n\no \n\nVYESi \n\notherwise \n\n(6) \n\nimportantly, this procedure optImIzes the objective function in Eqn. (1), by the \nusual EM proof. The negative of the KL divergence in Eqn. (3) is a lower bound on \nthe log likelihood (1) by Jensen's inequality. Substituting Eqn. (6) for jJ(y I Xi) into \n(3) we obtain equality. For easy reference, we called this model the 'EM Model'. \n\nin some 'multiple-label' problems, information on which class label within the set \nis more likely to be the correct one can be obtained. For example, if three \nSj \nassessors manually label the training data, in some cases two assessors will agree on \nthe class label and the other doesn't. We should give more weights to the labels that \nare agreed by two assessors and low weights to the labels that are chosen by only \none. To accommodate prior information on the class labels, we generalize the \njJ(y I Xi) has low \nprevious framework so that the estimated label distribution \nrelative entropy with the prior on the class labels. Therefore, the objective function \n(1) and its EM -bound (4) can be modified to be \n\nB* = arg~in{ ~ ~ p(y I x,)logP:i.lyx,) - ~ ~ p(y I X,) log p(y I Xi,B)} \n\n(7) \n\nwhere \" i,y is the prior probability for the i-th training example to have class label y. \nThe first term in the objective function (7) encourages the estimated label \ndistribution to be consistent with the prior distribution of class labels and the second \nterm encourages the prediction of the model to be consistent with the estimated \nlabel distribution. The objective (7) is an upper bound on - L:\\og L: 7l'i,y P(Y I xi,B) . \n\nYE Si \n\n\fWhen there is no prior information about which class label within the given set is \npreferable we can set n ;,y = 1/ I S; I and Eqn. (7) becomes \n\nB* = argmin{II p(y I xJlog p(y I x;) - I I p(y I xJlogp(y I X;,B)} \n\n(I \n\n; YES, \n\n1/ I S; I \n\n; YES, \n\n(7') \n\n= argmin{II p(y I xJlog p(y I xJ + Ilog I S; I} = argmin{I I p(y I xJlog p(y I xJ } \np(y I x;,IJ) \n\np(y I x;,B) \n\nI I ; yES, \n\n; yES, \n\n; \n\nII \n\nEqn. (7') is identical to Eqn. (3), which shows that when there is no pnor \nknowledge on the class label distribution, we revert back to the' EM Model' . \n\nAgain we can optimize Eqn. (7) using the EM algorithm, estimating the label \ndistribution p(y I x;) in the E step fitting any standard discriminative model for \np(y I x,B) in the M step. The label distribution that optimizes (7) in the Estep \nis: p(y I x.) = 7r. p(y I x B) / \" \n.p(y'l x B), and 0 otherwise. As we would expect, \nthe label distribution p(y I xJ trades off both the prior n ;,y and the model-based \nprediction p(y I x;, B). We will call this model 'EM+Prior Model'. \n\nI' ~Y'ESi \n\n7r \n\nI ,), \n\nI, ), \n\nI' \n\nI \n\ncan \n\n'EM+Prior Model' \n\nThe \nalso be \ninterpreted from the viewpoint of a graphical \nmodel. The basic idea is illustrated in Figure \n1, where the random variable ti represents the \nevent that the true label Yi belongs to the \nlabel set Si. For the 'EM+Prior' model, n ;,y \nactually plays the role of a likelihood or \nnoise model where, where p(y E Si I x i ,(}) in \nEqn. (1) is replaced as in Eqn. (8). From this \npoint of view, generalizing \nto Bayesian \nlearning and regression is easy. \n\nFigure \ninterpretation of 'EM+Prior' model \n\nI: Diagram for graphic model \n\nP(ti = 11 xi,B) = LP(ti = 11 y)p(y I xi,B) = L\"i.yP(y I xi,B) \n\nYE5i \n\nYESi \n\n(8) \n\n5 Experiments \n\nThe goal of our experiments is to answer the following questions: \n\nl. \nIs the 'EM Model' better than the 'Nai've Model'? The difference between the \n'EM Model' and the 'Naive Model' for the 'multiple-label' problems is that the \n'Naive Model' makes no effort in finding the correct label within the given label set \nwhile the 'EM Model' applies the EM algorithm to clarify the ambiguity in the class \nlabel. Therefore, in this experiment, we need to justify empirically whether the \neffort in disambiguating class labels is effective. \n\n2. Will prior knowledge help the model? The difference between the 'EM Model' \nand the 'EM+Prior Model' is that the 'EM+Prior Model' takes advantage of prior \nknowledge on the distribution of class labels for instances. However, since \nsometimes the prior knowledge on the class label can be misleading, we need to test \nthe robustness of the 'EM+Prior Model' to such noisy prior knowledge. \n\n5.1 Experimental Data \n\nSince there don't exist standard data sets with trammg instances assigned to \nmultiple class labels, we actually create several data sets with multiple class labels \n\n\ffrom the UCI classification datasets. To make our experiments more realistic, we \ntried two different methods of creating datasets with multiple class labels: \n\n\u2022 Random Distractors. For every training instance, in addition to the original \nassigned label, several randomly selected labels are added to the label candidate set. \nWe varied the number of added classes to test reliability of our algorithm. \n\n\u2022 Nai\"ve Bayes Distractors. In the previous method, the added class labels are \nrandomly selected and therefore independent from the original class label. However, \nwe usually expect that distractors are in the candidate set should be correlated with \nthe original label. To simulate this realistic situation, we use the output of a NaIve \nBayes (NB) classifier as an additional member of the class label candidate set. 1 \nFirst, a NaIve Bayes classifier using Gaussian generation models is trained on the \ndataset. Then, the trained NB classifier is asked to predict the class label of the \ntraining data. When the output of the NB classifier differs from the original label, it \nis added as a candidate label. Otherwise, a randomly selected label is added to the \ncandidate set. Since the NB classifier errors are not completely random, they should \nhave some correlation with the originally assigned labels. \n\nIn these experiments we chose a simple maximum entropy (ME) model [5] as the \nbasic discriminative model, which expresses a conditional probability p(y I i,e) in an \nexponential form, i.e. p(y I i ,e) = exp(e\u00b7 i ) / Z(i ) where x is the input feature vector and \nZ(x) is the normalization constant which ensures that the conditional probabilities \nover all different classes y sum to 1. \n\nT bill \u00a3 \n\na e \n\nn ormatIOn a out lve \n\nb \n\nf \n\nUCI d \n\natasets t at are use \n\nh \n\nd\u00b7 h \n\nIII t e expenments \n\nClass Name \n\nNumber of Instances \n\nNumber of Classes \n\nNumber of Features \n\n% NB Output;tAssigned Label \n\nError Rate for ME on clean \n\ndata (lO-fold cross validation) \n\necoli \n\n327 \n\n5 \n\n7 \n\n15% \n\n12.6% \n\nwine \n\n178 \n\n3 \n\n13 \n\n8% \n\n3.7% \n\npendi2it \n\n2000 \n\n10 \n\n16 \n\n22.3% \n\n9% \n\niris \n\n154 \n\n3 \n\n14 \n\n13.3% \n\n5.7% \n\n21ass \n\n204 \n\n5 \n\n10 \n\n16.6% \n\n9.7% \n\nFive different VCI datasets were selected as \ntestbed for experiments. \nInformation about these datasets is listed in Table 1. For each dataset, the 10-fold \ncross validation results for the ME model together with the percentage of time the \nNB output differs from the originally assigned label are also listed in Table 1. \n\nthe \n\n5.2 Experiment Results (I): 'Naive Model' vs. 'EM Model' \n\nTable 2 lists the results for the 'NaIve Model' and 'EM Model' over a varied \nnumber of additional class labels created by the 'random distractor' and the 'NaIve \nBayes' distractor. Since 'wine' and 'iris' datasets only have 3 different classes, the \nmaximum additional class labels for these two data sets is 1. Therefore, there is no \nexperiment result for the case of 2 or 3 distractor class labels for 'wine' and 'iris'. \n\nAs shown in Table 2, for the random distractor, the 'EM Model' substantially \noutperforms the 'NaIve Model' in all cases. Particularly, for the 'wine' and 'iris' \ndatasets, by introducing an additional class label to every training instance, there is \nonly one class label left out of the class label candidates and yet the performance of \nthe 'EM Model' is still close to the case when there are no additional class labels. \n\n1 NaIve Bayes distractor should not be confused with the multiple-label NaIve Model. \n\n\fMeanwhile, the 'NaIve Model' degrades significantly for both cases, i.e. from 3.7% \nto 10.0% for 'wine' and 5.7% to 18.5% for 'iris'. Therefore, we can conclude that \nthe 'EM Model' is able to reduce the noise caused by randomly added class labels. \n\nT bl 2 A \n\na e \n\nverage 10 D Id \n\n- 0 \n\ncross va I attOn error rates D b h 'N \n\nor ot \n\naIve M d I' \n\no e an \n\nd 'EM M d I' \n\no e \n\nClass Name \n\n1 extra label \nby random \ndistracter \n\n2 extra labels \nby random \ndistracter \n\n3 extra labe ls \nby random \ndistracter \n\n1 extra labe l \nbyNB \ndistracter \n\nNaive \n\nEM \n\nNaive \n\nEM \n\nNaive \n\nEM \n\nNaive \n\nEM \n\necoli \n\n17.3% \n\n13.6% \n\n20.7% \n\n14.9% \n\n25 .8% \n\n18.3% \n\n22.4% \n\n14.6% \n\nwine \n\n10% \n\n4.4% \n\n15.7% \n\n6.8% \n\npendigit \n\n14.2% \n\n8.9% \n\n15.4% \n\n9.4% \n\n17.6% \n\n11.7% \n\n17.2% \n\n15.4% \n\niris \n\n18.5% \n\n5.2% \n\n18.5% \n\n6.7% \n\nglass \n\n24.9% \n\n12.9% \n\n44.9% \n\n12% \n\n34.6% \n\n33.5% \n\n27.7% \n\n20.6% \n\nSecondly, we compare the performance of these two models over a more realistic \nsetup for the 'multiple-label' problem where the distractor identity is correlated with \nthe true label (simulated by using the NB distractor). Table 1 gives the percentage \nof times when the trained Naive Bayes classifier disagreed with the 'true' labels, \nwhich is also the percentage of the additional class labels that is created by the \n'Naive Bayes distracter'. The last row of Table 2 shows the performance of these \ntwo models when the additional class labels are introduced by the 'NB distracter'. \nAgain, the 'EM Model' is significantly better than 'NaIve Model'. For dataset \n'ecoli', 'wine' and 'iris', the averaged error rates of the 'EM Model' are very close \nto the cases when there are no distractor class labels. Therefore, we can conclude \nthat the 'EM Model' is able to reduce the noise caused not only by random label \nambiguity but also by some systematic label ambiguity. \n\n5.3 Experiment Results (II): 'EM Model' vs. 'EM+Prior Model' \n\nT bl 3 A \n\na e \n\nverage 10 D Id \n\n- 0 \n\ncross va I attOn error rates D \nor \n\n'EM P' M d I' \n\n+ nor o e over f UCld \n\nIve \n\natasets. \n\nClass Name \n\nI extra label \nby random \ndistracter \n\n2 extra labels \nby random \ndistracter \n\n3 extra labels \nby random \ndistracter \n\nI extra labe l \nbyNB \ndistracter \n\nPerfect \n\nNoisy \n\nPerfect \n\nNoisy \n\nPerfect \n\nNoisy \n\nPerfect \n\nNoisy \n\necoli \n\n13 .3% \n\n13 .3% \n\n13.6% \n\n13.9% \n\n12.6% \n\n13.9% \n\n13.9% \n\n15.3% \n\nwine \n\n3.7% \n\n3.2% \n\n5.0% \n\n6.2% \n\npendigit \n\n8.7% \n\n9.0% \n\n9.0% \n\n9.4% \n\n10.0% \n\n11.0% \n\n13.4% \n\n14.2% \n\niris \n\n5.2% \n\n18.5% \n\n5.2% \n\n6.7% \n\nglass \n\n12.4% \n\n12 .9% \n\n12.5% \n\n13.6% \n\n12.4% \n\n16.8% \n\n16.7% \n\n19.0% \n\nIn this subsection, we focus on whether the information from a prior distribution on \nclass labels can improve the performance. In this experiment, we study two cases: \n\n'Perfect Case '. Here the guidance of the prior distribution on class labels is \n\u2022 \nalways correct. In our experiments for every training instance Xi we set the \nprobability Jri, y; twice as large for the correct Yi as for other Jri ,yo