{"title": "A Conditional Multinomial Mixture Model for Superset Label Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 548, "page_last": 556, "abstract": null, "full_text": "A Conditional Multinomial Mixture Model for\n\nSuperset Label Learning\n\nLi-Ping Liu\n\nliuli@eecs.oregonstate.edu\n\nEECS, Oregon State University\n\nCorvallis, OR 97331\n\nThomas G. Dietterich\n\nEECS, Oregon State University\n\nCorvallis, OR 97331\ntgd@cs.orst.edu\n\nAbstract\n\nIn the superset label learning problem (SLL), each training instance provides a set\nof candidate labels of which one is the true label of the instance. As in ordinary\nregression, the candidate label set is a noisy version of the true label.\nIn this\nwork, we solve the problem by maximizing the likelihood of the candidate label\nsets of training instances. We propose a probabilistic model, the Logistic Stick-\nBreaking Conditional Multinomial Model (LSB-CMM), to do the job. The LSB-\nCMM is derived from the logistic stick-breaking process. It \ufb01rst maps data points\nto mixture components and then assigns to each mixture component a label drawn\nfrom a component-speci\ufb01c multinomial distribution. The mixture components can\ncapture underlying structure in the data, which is very useful when the model is\nweakly supervised. This advantage comes at little cost, since the model introduces\nfew additional parameters. Experimental tests on several real-world problems with\nsuperset labels show results that are competitive or superior to the state of the art.\nThe discovered underlying structures also provide improved explanations of the\nclassi\ufb01cation predictions.\n\n1\n\nIntroduction\n\nIn supervised classi\ufb01cation, the goal is to learn a classi\ufb01er from a collection of training instances,\nwhere each instance has a unique class label. However, in many settings, it is dif\ufb01cult to obtain such\nprecisely-labeled data. Fortunately, it is often possible to obtain a set of labels for each instance,\nwhere the correct label is one of the elements of the set.\nFor example, captions on pictures (in newspapers, facebook, etc.) typically identify all of the people\nthe picture but do not necessarily indicate which face belongs to each person. Imprecisely-labeled\ntraining examples can be created by detecting each face in the image and de\ufb01ning a label set contain-\ning all of the names mentioned in the caption. A similar case arises in bird song classi\ufb01cation [2]. In\nthis task, a \ufb01eld recording of multiple birds singing is divided into 10-second segments, and experts\nidentify the species of all of the birds singing in each segment without localizing each species to a\nspeci\ufb01c part of the spectrogram. These examples show that superset-labeled data are typically much\ncheaper to acquire than standard single-labeled data. If effective learning algorithms can be devised\nfor superset-labeled data, then they would have wide application.\nThe superset label learning problem has been studied under two main formulations. In the multi-\ninstance multi-label (MIML) formulation [15], the training data consist of pairs (Bi, Yi), where\nBi = {xi,1, . . . , xi,ni} is a set of instances and Yi is a set of labels. The assumption is that for every\ninstance xi,j \u2208 Bi, its true label yi,j \u2208 Yi. The work of Jie et al. [9] and Briggs et al. [2] learn\nclassi\ufb01ers from such set-labeled bags.\nIn the superset label formulation (which has sometimes been confusingly called the \u201cpartial label\u201d\nproblem) [7, 10, 8, 12, 4, 5], each instance xn has a candidate label set Yn that contains the unknown\n\n1\n\n\ftrue label yn. This formulation ignores any bag structure and views each instance independently. It\nis more general than the MIML formulation, since any MIML problem can be converted to a super-\nset label problem (with loss of the bag information). Furthermore, the superset label formulation is\nnatural in many applications that do not involve bags of instances. For example, in some applica-\ntions, annotators may be unsure of the correct label, so permitting them to provide a superset of the\ncorrect label avoids the risk of mislabeling. In this paper, we employ the superset label formulation.\nOther relevant work includes Nguyen et al. [12] and Cour et al. [5] who extend SVMs to handle\nsuperset labeled data.\nIn the superset label problem, the label set Yn can be viewed as a corruption of the true label.\nThe standard approach to learning with corrupted labels is to assume a generic noise process and\nincorporate it into the likelihood function. In standard supervised learning, it is common to assume\nthat the observed label is sampled from a Bernoulli random variable whose most likely outcome is\nequal to the true label. In ordinary least-squares regression, the assumption is that the observed value\nis drawn from a Gaussian distribution whose mean is equal to the true value and whose variance is\na constant \u03c32. In the superset label problem, we will assume that the observed label set Yn is drawn\nfrom a set-valued distribution p(Yn|yn) that depends only on the true label. When computing the\nlikelihood, this will allow us to treat the true label as a latent variable that can be marginalized away.\nWhen the label information is imprecise, the learning algorithm has to depend more on underlying\nstructure in the data. Indeed, many semi-supervised learning methods [16] model cluster structure\nof the training data explicitly or implicitly. This suggests that the underlying structure of the data\nshould also play important role in the superset label problem.\nIn this paper, we propose the Logistic Stick-Breaking Conditional Multinomial Model (LSB-CMM)\nfor the superset label learning problem. The model has two components: the mapping component\nand the coding component. Given an input xn, the mapping component maps xn to a region k. Then\nthe coding component generates the label according to a multinomial distribution associated with k.\nThe mapping component is implemented by the Logistic Stick Breaking Process(LSBP) [13] whose\nBernoulli probabilities are from discriminative functions. The mapping and coding components are\noptimized simultaneously with the variational EM algorithm.\nLSB-CMM addresses the superset label problem in several aspects. First, the mapping component\nmodels the cluster structure with a set of regions. The fact that instances in the same region often\nhave the same label is important for inferring the true label from noisy candidate label sets. Second,\nthe regions do not directly correspond to classes. Instead, the number of regions is automatically\ndetermined by data, and it can be much larger than the number of classes. Third, the results of the\nLSB-CMM model can be more easily interpreted than the approaches based on SVMs [5, 2]. The\nregions provide information about how data are organized in the classi\ufb01cation problem.\n\n2 The Logistic Stick Breaking Conditional Multinomial Model\nThe superset label learning problem seeks to train a classi\ufb01er f : Rd (cid:55)\u2192 {1,\u00b7\u00b7\u00b7 , L} on a given\nn=1, where each instance xn \u2208 Rd has a candidate label set Yn \u2282\ndataset (x, Y ) = {(xn, Yn)}N\n{1,\u00b7\u00b7\u00b7 , L}. The true labels y = {yn}N\nn=1 are not directly observed. The only information is that\nthe true label yn of instance xn is in the candidate set Yn. The extra labels {l|l (cid:54)= yn, l \u2208 Yn}\ncausing ambiguity will be called the distractor labels. For any test instance (xt, yt) drawn from the\nsame distribution as {(xn, yn)}N\nn=1, the trained classi\ufb01er f should be able to map xt to yt with high\nprobability. When |Yn| = 1 for all n, the problem is a supervised classi\ufb01cation problem. We require\n|Yn| < L for all n; that is, every candidate label set must provide at least some information about\nthe true label of the instance.\n\n2.1 The Model\n\nas p(Yn|xn) = (cid:80)L\n\nAs stated in the introduction, the candidate label set is a noisy version of the true label. To train\na classi\ufb01er, we \ufb01rst need a likelihood function p(Yn|xn). The key to our approach is to write this\nyn=1 p(Yn|yn)p(yn|xn), where each term is the product of the underlying true\nclassi\ufb01er, p(yn|xn), and the noise model p(Yn|yn). We then make the following assumption about\nthe noise distribution:\n\n2\n\n\fFigure 1: The LSB-CMM. Square nodes are discrete, circle nodes are continuous, and double-circle\nnodes are deterministic.\n\nAssumption: All labels in the candidate label set Yn have the same probability of generating Yn,\nbut no label outside of Yn can generate Yn\np(Yn|yn = l) =\n\n(cid:26) \u03bb(Yn)\n\n(1)\n\n.\n\nif l \u2208 Yn\nif l /\u2208 Yn\n\n0\n\nThis assumption enforces three constraints. First, the set of labels Yn is conditionally independent\nof the input xn given yn. Second, labels that do not appear in Yn have probability 0 of generating\nYn. Third, all of the labels in Yn have equal probability of generating Yn (symmetry). Note that\nthese constraints do not imply that the training data are correctly labeled. That is, suppose that the\nmost likely label for a particular input xn is yn = l. Because p(yn|xn) is a multinomial distribution,\na different label yn = l(cid:48) might be assigned to xn by the labeling process. Then this label is further\ncorrupted by adding distractor labels to produce Yn. Hence, it could be that l (cid:54)\u2208 Yn. In short, in\nthis model, we have the usual \u201cmultinomial noise\u201d in the labels which is then further compounded\nby \u201csuperset noise\u201d. The third constraint can be criticized for being simplistic; we believe it can be\nreplaced with a learned noise model in future work.\nGiven (1), we can marginalize away yn in the following optimization problem maximizing the like-\nlihood of observed candidate labels.\n\nf\u2217 = arg max\n\nf\n\n= arg max\n\nf\n\nN(cid:88)\nN(cid:88)\n\nn=1\n\nn=1\n\nlog\n\nL(cid:88)\nlog (cid:88)\n\nyn=1\n\nyn\u2208Yn\n\np(yn|xn; f)p(Yn|yn)\n\nN(cid:88)\n\nn=1\n\np(yn|xn; f) +\n\nlog(\u03bb(Yn)).\n\n(2)\n\nn=1.\n\nUnder the conditional independence and symmetry assumptions, the last term does not depend on f\nand so can be ignored in the optimization. This result is consistent with the formulation in [10].\nWe propose the Logistic Stick-Breaking Conditional Multinomial Model to instantiate f (see\nIn LSB-CMM, we introduce a set of K regions (mixture components) {1, . . . , K}.\nFigure 1).\nLSB-CMM has two components. The mapping component maps each instance xn to a region\nzn, zn \u2208 {1, . . . , K}. Then the coding component draws a label yn from the multinomial distri-\nbution indexed by zn with parameter \u03b8zn. We denote the region indexes of the training instances by\nz = (zn)N\nIn the mapping component, we employ the Logistic Stick Breaking Process(LSBP) [13] to model the\ninstance-region relationship. LSBP is a modi\ufb01cation of the Dirichlet Process (DP) [14]. In LSBP,\nthe sequence of Bernoulli probabilities are the outputs of a sequence of logistic functions instead of\nbeing random draws from a Beta distribution as in the Dirichlet process. The input to the k-th logistic\nfunction is the dot product of xn and a learned weight vector wk \u2208 Rd+1. (The added dimension\ncorresponds to a zeroth feature \ufb01xed to be 1 to provide an intercept term.) To regularize these\nlogistic functions, we posit that each wk is drawn from a Gaussian distribution Normal(0, \u03a3), where\n\u03a3 = diag(\u221e, \u03c32,\u00b7\u00b7\u00b7 , \u03c32). This regularizes all terms in wk except the intercept. For each xn, a\nsequence of probabilities {vnk}K\nk xn)\nand expit(u) = 1/(1 + exp(\u2212u)) is the logistic function. We truncate k at K by setting wK =\n(+\u221e, 0,\u00b7\u00b7\u00b7 , 0) and thus vnK = 1. Let w denote the collection of all K wk. Given the probabilities\n\nk=1 is generated from logistic functions, where vnk = expit(wT\n\n3\n\n\fvn1, . . . , vnK computed from xn, we choose the region zn according to a stick-breaking procedure:\n\np(zn = k) = \u03c6nk = vnk\n\n(1 \u2212 vni).\n\nk\u22121(cid:89)\n\ni=1\n\nHere we stipulate that the product is 1 when k = 1. Let \u03c6n = (\u03c6n1,\u00b7\u00b7\u00b7 , \u03c6nK) constitute the\nparameter of a multinomial distribution. Then zn is drawn from this distribution.\nIn the coding component of LSB-CMM, we \ufb01rst draw K L-dimensional multinomial probabilities\n\u03b8 = {\u03b8k}K\nk=1 from the prior Dirichlet distribution with parameter \u03b1. Then, for each instance xn\nwith mixture zn, its label yn is drawn from the multinomial distribution with \u03b8zn. In the traditional\nmulti-class problem, yn is observed. However, in the SLL problem yn is not observed and Yn is\ngenerated from yn.\nThe generative process of the whole model is summarized below:\n\nwk \u223c Normal(0, \u03a3), 1 \u2264 k \u2264 K \u2212 1, wK = (+\u221e, 0,\u00b7\u00b7\u00b7 , 0)\nzn \u223c Mult(\u03c6n), \u03c6nk = expit(wT\n\u03b8k \u223c Dirichlet(\u03b1)\nyn \u223c Mult(\u03b8zn)\nYn \u223c Dist1(yn)\n\n(Dist1 is some distribution satisfying (1))\n\n(1 \u2212 expit(wT\n\nk\u22121(cid:89)\n\nk xn)\n\ni=1\n\ni xn))\n\n(3)\n\n(4)\n\n(5)\n\n(6)\n(7)\n(8)\n\n(9)\n\n(10)\n\nAs shown in (2), the model needs to maximize the likelihood that each yn is in Yn. After incorpo-\nrating the priors, we can write the penalized maximum likelihood objective as\n\nN(cid:88)\n\nn=1\n\n\uf8eb\uf8ed (cid:88)\n\nyn\u2208Yn\n\nmax LL =\n\nlog\n\np(yn|xn, w, \u03b1)\n\n\uf8f6\uf8f8 + log(p(w|0, \u03a3)).\n\nThis cannot be solved directly, so we apply variational EM [1].\n\n2.2 Variational EM\n\nThe hidden variables in the model are y, z, and \u03b8. For these hidden variables, we introduce the\nvariational distribution q(y, z, \u03b8| \u02c6\u03c6, \u02c6\u03b1), where \u02c6\u03c6 = { \u02c6\u03c6n}N\nk=1 are the parameters.\nThen we factorize q as\n\nN(cid:89)\n\nn=1 and \u02c6\u03b1 = {\u02c6\u03b1k}K\nK(cid:89)\n\nq(\u03b8k|\u02c6\u03b1k),\n\nq(z, y, \u03b8| \u02c6\u03c6, \u02c6\u03b1) =\n\nq(zn, yn| \u02c6\u03c6n)\n\nn=1\n\nk=1\n\nwhere \u02c6\u03c6n is a K \u00d7 L matrix and q(zn, yn| \u02c6\u03c6n) is a multinomial distribution in which p(zn = k, yn =\nl) = \u02c6\u03c6nkl. This distribution is constrained by the candidate label set: if a label l /\u2208 Yn, then \u02c6\u03c6nkl = 0\nfor any value of k. The distribution q(\u03b8k|\u02c6\u03b1k) is a Dirichlet distribution with parameter \u02c6\u03b1k.\nAfter we set the distribution q(z, y, \u03b8), our variational EM follows standard methods. The detailed\nderivation can be found in the supplementary materials [11]. Here we only show the \ufb01nal updating\nstep with some analysis.\nIn the E step, the parameters of variational distribution are updated as (11) and (12).\n\n\u02c6\u03c6nkl \u221d\n\n(cid:26) \u03c6nk exp(cid:0)Eq(\u03b8k| \u02c6\u03b1k) [log(\u03b8kl)](cid:1) ,\nN(cid:88)\n\n0,\n\n\u02c6\u03b1k = \u03b1 +\n\n\u02c6\u03c6nkl .\n\nif l \u2208 Yn\nif l /\u2208 Yn\n\n,\n\n(11)\n\n(12)\n\nn=1\n\nThe update of \u02c6\u03c6n in (11) indicates the key difference between the LSB-CMM model and traditional\nclustering models. The formation of regions is directed by both instance similarities and class labels.\n\n4\n\n\fIf the instance xn wants to join region k (i.e.,(cid:80)\n\nl\n\n\u02c6\u03c6nkl is large), then it must be similar to wk as\nwell as to instances in that region in order to make \u03c6nk large. Simultaneously, its candidate labels\nmust \ufb01t the \u201clabel \ufb02avor\u201d of region k, where the \u201clabel \ufb02avor\u201d means region k prefers labels having\nlarge values in \u02c6\u03b1k. The update of \u02c6\u03b1 in (12) can be interpreted as having each instance xn vote for\nthe label l for region k with weight \u02c6\u03c6nkl.\nIn the M step, we need to solve the maximization problem in (13) for each wk, 1 \u2264 k \u2264 K \u2212 1.\nNote that wK is \ufb01xed. Each wk can be optimized separately. The optimization problem is similar to\nthe problem of logistic regression and is also a concave maximization problem, which can be solved\nby any gradient-based method, such as BFGS.\n\n\u2212 1\n2\n\nmax\nwk\n\nwhere \u02c6\u03c6nk = (cid:80)L\n\nk \u03a3\u22121wk +\nwT\n\n\u02c6\u03c6nkl and \u02c6\u03c8nk = (cid:80)K\n\nn=1\n\nk xn)) + \u02c6\u03c8nk log(1 \u2212 expit(wT\n\u02c6\u03c6nj.\n\nl=1\n\nIntuitively, the variable \u02c6\u03c6nk is the proba-\nbility that instance xn belongs to region k, and \u02c6\u03c8nk is the probability that xn belongs to region\n{k + 1,\u00b7\u00b7\u00b7 , K}. Therefore, the optimal wk discriminates instances in region k against instances in\nregions \u2265 k.\n\nj=k+1\n\nk xn))\n\n,\n\n(13)\n\nN(cid:88)\n\n(cid:104) \u02c6\u03c6nk log(expit(wT\n\n(cid:105)\n\n2.3 Prediction\n\nFor a test instance xt, we predict the label with maximum posterior probability. The test instance\ncan be mapped to a region with w, but the coding matrix \u03b8 is marginalized out in the EM. We use\nthe variational distribution p(\u03b8k|\u02c6\u03b1k) as the prior of each \u03b8k and integrate out all \u03b8k-s. Given a test\npoint xt, the prediction is the label l that maximizes the probability p(yt = l|xt, w, \u02c6\u03b1) calculated as\n(14). The detailed derivation is also in the supplementary materials [11].\n\nK(cid:88)\n(cid:17)\n\nk=1\n\n\u02c6\u03b1kl(cid:80)\n\nl \u02c6\u03b1kl\n\n\u03c6tk\n\n,\n\n(14)\n\n(cid:16)\n\np(yt = l|xt, w, \u02c6\u03b1) =\n\nk xt)(cid:81)k\u22121\n\nwhere \u03c6tk =\nprobability \u03c6tk, and its label is decided by the votes (\u02c6\u03b1k) in that region.\n\nexpit(wT\n\ni xt))\n\ni=1 (1 \u2212 expit(wT\n\n. The test instance goes to region k with\n\n2.4 Complexity Analysis and Practical Issues\n\nIn the E step, for each region k, the algorithm iterates over all candidate labels of all instances, so the\ncomplexity is O(N KL). In the M step, the algorithm solves K \u22121 separate optimization problems.\nSuppose each optimization problem takes O(V N d) time, where V is the number of BFGS iterations.\nThen the complexity is O(KV N d). Since V is usually larger than L, the overall complexity of one\nEM iteration is O(KV N d). Suppose the EM steps converge within m iterations, where m is usually\nless than 50. Then the overall complexity is O(mKV N d). The space complexity is O(N K), since\n\nwe only store the matrix(cid:80)L\n\n\u02c6\u03c6nkl and the matrix \u02c6\u03b1.\n\nl=1\n\nIn prediction, the mapping phase requires O(Kd) time to multiply w and the test instance. Af-\nter the stick breaking process, which takes O(K) calculations, the coding phase requires O(KL)\ncalculation. Thus the overall time complexity is O(K max{d, L}). Hence, the prediction time is\ncomparable to that of logistic regression.\nThere are several practical issues that affect the performance of the model. Initialization: From\nthe model design, we can expect that instances in the same region have the same label. Therefore,\nit is reasonable to initialize \u02c6\u03b1 to have each region prefer only one label, that is, each \u02c6\u03b1k has one\nelement with large value and all others with small values. We initialize \u03c6 to \u03c6nk = 1\nK , so that\nall regions have equal probability to be chosen at the start. Initialization of these two variables is\nenough to begin the EM iterations. We \ufb01nd that such initialization works well for our model and\ngenerally is better than random initialization. Calculation of Eq(\u03b8k| \u02c6\u03b1k)[log(\u03b8kl)] in (11): Although\nit has a closed-form solution, we encountered numerical issues, so we calculate it via Monte Carlo\nsampling. This does not change complexity analysis above, since the training is dominated by M\nstep. Priors: We found that using a non-informative prior for Dirichlet(\u03b1) worked best. From (12)\nand (14), we can see that when \u03b8 is marginalized, the distribution is non-informative when \u03b1 is set\nto small values. We use \u03b1 = 0.05 in our experiments.\n\n5\n\n\fFigure 2: Decision boundaries of LSB-CMM on a linearly-inseparable problem. Left: all data points\nhave true labels. Right: labels of gray data points are corrupted.\n\n3 Experiments\n\nIn this section, we describe the results of several experiments we conducted to study the behavior of\nour proposed model. First, we experiment with a toy problem to show that our algorithm can solve\nproblems with linearly-inseparable classes. Second, we perform controlled experiments on three\nsynthetic datasets to study the robustness of LSB-CMM with respect to the degree of ambiguity of\nthe label sets. Third, we experiment with three real-world datasets.\nLSB-CMM Model: The LSB-CMM model has three parameters K, \u03c32, \u03b1. We \ufb01nd that the model\nis insensitive to K if it is suf\ufb01ciently large. We set K = 10 for the toy problems and K = 5L for\nother problems. \u03b1 is set to 0.05 for all experiments. When the data is standardized, the regularization\nparameter \u03c32 = 1 generally gives good results, so \u03c32 is set to 1 in all superset label tasks.\nBaselines: We compared the LSB-CMM model with three state-of-the-art methods. Supervised\nSVM: the SVM is always trained with the true labels. Its performance can be viewed as an upper\nbound on the performance of any SSL algorithm. LIBSVM [3] with RBF kernel was run to construct\na multi-class classi\ufb01er in one-vs-one mode. One third of the training data was used to tune the C\nparameter and the RBF kernel parameter \u03b3. CLPL: CLPL [5] is a linear model that encourages\nlarge average scores of candidate labels. The model is insensitive to the C parameter, so we set\nthe C value to 1000 (the default value in their code). SIM: SIM [2] minimizes the ranking loss of\ninstances in a bag. In controlled experiments and in one of the real-world problems, we could not\nmake the comparison to LSB-CMM because of the lack of bag information. The \u03bb parameter is set\nto 10\u22128 based on authors\u2019 recommendation.\n\n3.1 A Toy Problems\n\nIn this experiment, we generate a linearly-inseparable SLL problem. The data has two dimensions\nand six clusters drawn from six normal distributions with means at the corners of a hexagon. We\nassign a label to each cluster so that the problem is linearly-inseparable (see (2)). In the \ufb01rst task,\nwe give the model the true labels. In the second task, we add a distractor label for two thirds of\nall instances (gray data points in the \ufb01gure). The distractor label is randomly chosen from the two\nlabels other than the true label. The decision boundaries found by LSB-CMM in both tasks are\nshown in (2)). We can see that LSB-CMM can successfully give nonlinear decision boundaries for\nthis problem. After injecting distractor labels, LSB-CMM still recovers the boundaries between\nclasses. There is minor change of the boundary at the edge of the cluster, while the main part of\neach cluster is classi\ufb01ed correctly.\n\n3.2 Controlled Experiments\nWe conducted controlled experiments on three UCI [6] datasets: {segment (2310 instances, 7\nclasses), pendigits (10992 instances, 10 classes), and usps (9298 instances, 10 classes)}. Ten-\nfold cross validation is performed on all three datasets. For each training instance, we add distractor\nlabels with controlled probability. As in [5], we use p, q, and \u03b5 to control the ambiguity level of\n\n6\n\nllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll\u22121.6\u22120.800.81.6\u22121.6\u22120.800.81.6lclass 1class 2class 3llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll\u22121.6\u22120.800.81.6\u22121.6\u22120.800.81.6lclass 1class 2class 3\fFigure 3: Three regions learned by the model on usps\n\ncandidate label sets. The roles and values of these three variables are as follows: p is the proba-\nbility that an instance has distractor labels (p = 1 for all controlled experiments); q \u2208 {1, 2, 3, 4}\nis the number of distractor labels; and \u03b5 \u2208 {0.3, 0.7, 0.9, 0.95} is the maximum probability that a\ndistractor label co-occurs with the true label [5], also called the ambiguity degree.\nWe have two settings for these three variables. In the \ufb01rst setting, we hold q = 1 and vary \u03b5, that is,\nfor each label l, we choose a speci\ufb01c label l(cid:48) (cid:54)= l as the (unique) distractor label with probability \u03b5\nor choose any other label with probability 1 \u2212 \u03b5. In the extreme case when \u03b5 = 1, l(cid:48) and l always\nco-occur, and they cannot be distinguished by any classi\ufb01er. In the second setting, we vary q and\npick distractor labels randomly for each candidate label set.\nThe results are shown in Figure (4). Our LSB-CMM model signi\ufb01cantly outperforms the CLPL\napproach. As the number of distractor labels increases, performance of both methods goes down, but\nnot too much. When the true label is combined with different distractor labels, the disambiguation is\neasy. The co-occurring distractor labels provide much less disambiguation. This explains why large\nambiguity degree hurts the performance of both methods. The small dataset (segment) suffers even\nmore from large ambiguity degree, because there are fewer data points that can \u201cbreak\u201d the strong\ncorrelation between the true label and the distractors.\nTo explore why the LSB-CMM model has good performance, we investigated the regions learned\nby the model. Recall that \u03c6nk is the probability that xn is sent to region k. In each region k, the\nrepresentative instances have large values of \u03c6nk. We examined all \u03c6nk from the model trained\non the usps dataset with 3 random distractor labels. For each region k, we selected the 9 most\nrepresentative instances. Figure (3) shows representative instances for three regions. These are all\nfrom class \u201c2\u201d but are written in different styles. This shows that the LSB-CMM model can discover\nthe sub-classes in the data. In some applications, the whole class is not easy to discriminate from\nother classes, but sometimes each sub-class can be easily identi\ufb01ed. In such cases, LSB-CMM will\nbe very useful and can improve performance.\nExplanation of the results via regions can also give better understanding of the learned classi\ufb01er.\nIn order to analyze the performance of the classi\ufb01er learned from data with either superset labels\nor fully observed labels, one traditional method is to compute the confusion matrix. While the\nconfusion matrix can only tell the relationships between classes, the mixture analysis can indicate\nprecisely which subclass of a class are confused with which subclasses of other classes. The regions\ncan also help the user identify and de\ufb01ne new classes as re\ufb01nements of existing ones.\n\n3.3 Real-World Problems\n\nWe apply our model on three real-world problems. 1) BirdSong dataset [2]: This contains 548\n10-second bird song recordings. Each recording contains 1-40 syllables. In total there are 4998\nsyllables. Each syllable is described by 38 features. The labels of each recording are the bird species\nthat were singing during that 10-second period, and these species become candidate labels set of each\nsyllable in the recording. 2) MSRCv2 dataset: This dataset contains 591 images with 23 classes.\nThe ground truth segmentations (regions with labels) are given. The labels of all segmentations\nin an image are treated as candidate labels for each segmentation. Each segmentation is described\nby 48-dimensional gradient and color histograms. 3) Lost dataset [5]: This dataset contains 1122\nfaces, and each face has the true label and a set of candidate labels. Each face is described by 108\nPCA components. Since the bag information (i.e., which faces are in the same scene) is missing,\n\n7\n\n\fFigure 4: Classi\ufb01cation performance on synthetic data (red: LSB-CMM; blue: CLPL). The dot-dash\nline is for different q values (number of distractor labels) as shown on the top x-axis. The dashed\nline is for different \u03b5 (ambiguity degree) values as shown on the bottom x-axis.\n\nTable 1: Classi\ufb01cation Accuracies for Superset Label Problems\nSVM\n\nCLPL\n\nSIM\n\nBirdSong\nMSRCv2\n\nLost\n\nLSB-CMM\n0.715(0.042)\n0.459(0.032)\n0.703(0.058)\n\n0.589(0.035)\n0.454(0.043)\n\n-\n\n0.637(0.034)\n0.411(0.044)\n0.710(0.045)\n\n0.790(0.027)\n0.673(0.043)\n0.817(0.038)\n\nSIM is not compared to our model on this dataset. We run 10-fold cross validation on these three\ndatasets. The BirdSong and MSRCv2 datasets are split by recordings/images, and the Lost dataset\nis split by faces.\nThe classi\ufb01cation accuracies are shown in Table (1). Accuracies of the three superset label learning\nalgorithms are compared using the paired t-test at the 95% con\ufb01dence level. Values statistically\nindistinguishable from the best performance are shown in bold. Our LSB-CMM model out-performs\nthe other two methods on the BirdSong database, and its performance is comparable to SIM on the\nMSRCv2 dataset and to CLPL on the Lost dataset. It should be noted that the input features are\nvery coarse, which means that the cluster structure of the data is not well maintained. The relatively\nlow performance of the SVM con\ufb01rms this. If the instances were more precisely described by \ufb01ner\nfeatures, one would expect our model to perform better in those cases as well.\n\n4 Conclusions\n\nThis paper introduced the Logistic Stick-Breaking Conditional Multinomial Model to address the\nsuperset label learning problem. The mixture representation allows LSB-CMM to discover cluster\nstructure that has predictive power for the superset labels in the training data. Hence, if two labels\nco-occur, LSB-CMM is not forced to choose one of them to assign to the training example but\ninstead can create a region that maps to both of them. Nonetheless, each region does predict from a\nmultinomial, so the model still ultimately seeks to predict a single label. Our experiments show that\nthe performance of the model is either better than or comparable to state-of-the-art methods.\n\nAcknowledgment\n\nThis material is based upon work supported by the National Science Foundation under Grant\nNo. 1125228. The code as an R package is available at:\nhttp://web.engr.oregonstate.edu/\u02dcliuli/files/LSB-CMM_1.0.tar.gz.\n\n8\n\naccuracyllll0.30.70.90.950.70.80.91.01234lSVMLSB\u2212CMM, vary qLSB\u2212CMM, vary eCLPL, vary qCLPL, vary enumber of ambiguous labelsambiguity degree(a) segmentllll0.30.70.90.950.70.80.91.01234lSVMLSB\u2212CMM, vary qLSB\u2212CMM, vary eCLPL, vary qCLPL, vary enumber of ambiguous labelsambiguity degree(b) pendigitsllll0.30.70.90.950.70.80.91.01234lSVMLSB\u2212CMM, vary qLSB\u2212CMM, vary eCLPL, vary qCLPL, vary enumber of ambiguous labelsambiguity degree(c) usps\fReferences\n\n[1] C. M. Bishop. Pattern recognition and machine learning. Springer, 2006.\n[2] F. Briggs & X. F. Fern & R. Raich. Rank-Loss Support Instance Machines for MIML Instance Annotation.\n\nIn proc. KDD, 2012.\n\n[3] C.-C. Chang & C.-J. Lin. LIBSVM: A Library for Support Vector Machines. ACM Trans. on Intelligent\n\nSystems and Technology, 2(3):1-27, 2011.\n\n[4] T. Cour & B. Sapp & C. Jordan & B. Taskar. Learning From Ambiguously Labeled Images. In Proc.\n\nCVPR 2009.\n\n[5] T. Cour & B. Sapp & B. Taskar. Learning from Partial Labels. Journal of Machine Learning Research,\n\n12:1225-1261, 2011.\n\n[6] A. Frank & A. Asuncion. UCI Machine Learning Repository [http://archive.ics.uci.edu/ml].\n[7] Y. Grandvalet. Logistic Regression for Partial Labels. In Proc. IPMU, 2002.\n[8] E. Hullermeier & J. Beringer. Learning from Ambiguously Labeled Examples. In Proc. IDA-05, 6th\n\nInternational Symposium on Intelligent Data Analysis Madrid, 2005.\n\n[9] L. Jie & F. Orabona. Learning from Candidate Labeling Sets. In Proc. NIPS, 2010.\n[10] R. Jin & Z. Ghahramani. Learning with Multiple Labels. In Proc. NIPS, 2002.\n[11] L-P. Liu & T. Dietterich. A Conditional Multinomial Mixture Model for Superset Label Learning (Sup-\n\nplementary Materials),\nhttp://web.engr.oregonstate.edu/\u02dcliuli/pdf/lsb_cmm_supp.pdf .\n\n[12] N. Nguyen & R. Caruana. Classi\ufb01cation with Partial Labels. In Proc. KDD, 2008.\n[13] L. Ren & L. Du & L. Carin & D. B. Dunson. Logistic Stick-Breaking Process. Journal of Machine\n\nLearning Research, 12:203-239, 2011.\n\n[14] Y. W. Teh. Dirichlet Processes. Encyclopedia of Machine Learning, to appear. Springer.\n[15] Z.-H. Zhou & M.-L. Zhang. Multi-Instance Multi-Label Learning with Application To Scene Classi\ufb01ca-\n\ntion. Advances in Neural Information Processing Systems, 19, 2007\n\n[16] X. Zhu & A. B. Goldberg. Introduction to Semi-Supervised Learning. Synthesis Lectures on Arti\ufb01cial\n\nIntelligence and Machine Learning, 3(1):1-130, 2009.\n\n9\n\n\f", "award": [], "sourceid": 4597, "authors": [{"given_name": "Liping", "family_name": "Liu", "institution": null}, {"given_name": "Thomas", "family_name": "Dietterich", "institution": null}]}