{"title": "Learning from Candidate Labeling Sets", "book": "Advances in Neural Information Processing Systems", "page_first": 1504, "page_last": 1512, "abstract": "In many real world applications we do not have access to fully-labeled training data, but only to a list of possible labels. This is the case, e.g., when learning visual classifiers from images downloaded from the web, using just their text captions or tags as learning oracles. In general, these problems can be very difficult. However most of the time there exist different implicit sources of information, coming from the relations between instances and labels, which are usually dismissed. In this paper, we propose a semi-supervised framework to model this kind of problems. Each training sample is a bag containing multi-instances, associated with a set of candidate labeling vectors. Each labeling vector encodes the possible labels for the instances in the bag, with only one being fully correct. The use of the labeling vectors provides a principled way not to exclude any information. We propose a large margin discriminative formulation, and an efficient algorithm to solve it. Experiments conducted on artificial datasets and a real-world images and captions dataset show that our approach achieves performance comparable to SVM trained with the ground-truth labels, and outperforms other baselines.", "full_text": "Learning from Candidate Labeling Sets\n\nIdiap Research Institute and EPF Lausanne\n\nLuo Jie\n\njluo@idiap.ch\n\nFrancesco Orabona\n\nDSI, Universit`a degli Studi di Milano\n\norabona@dsi.unimi.it\n\nAbstract\n\nIn many real world applications we do not have access to fully-labeled training\ndata, but only to a list of possible labels. This is the case, e.g., when learning visual\nclassi\ufb01ers from images downloaded from the web, using just their text captions or\ntags as learning oracles. In general, these problems can be very dif\ufb01cult. However\nmost of the time there exist different implicit sources of information, coming from\nthe relations between instances and labels, which are usually dismissed. In this\npaper, we propose a semi-supervised framework to model this kind of problems.\nEach training sample is a bag containing multi-instances, associated with a set\nof candidate labeling vectors. Each labeling vector encodes the possible labels\nfor the instances in the bag, with only one being fully correct. The use of the\nlabeling vectors provides a principled way not to exclude any information. We\npropose a large margin discriminative formulation, and an ef\ufb01cient algorithm to\nsolve it. Experiments conducted on arti\ufb01cial datasets and a real-world images and\ncaptions dataset show that our approach achieves performance comparable to an\nSVM trained with the ground-truth labels, and outperforms other baselines.\n\n1 Introduction\n\nIn standard supervised learning, each training sample is associated with a label, and the classi\ufb01er is\nusually trained through the minimization of the empirical risk on the training set. However, in many\nreal world problems we are not always so lucky. Partial data, noise, missing labels and other similar\ncommon issues can make you deviate from this ideal situation, moving the learning scenario from\nsupervised learning to semi-supervised learning [7, 26].\n\nIn this paper, we investigate a special kind of semi-supervised learning which considers ambiguous\nlabels. In particular each training example is associated with several possible labels, among which\nonly one is correct.\nIntuitively this problem can be arbitrarily hard in the worst case scenario.\nConsider the case when one noisy label is consistently appearing together with the true label: in this\nsituation we could not tell them apart. Despite that, learning could still be possible in many typical\nreal world scenarios. Moreover, in real problems samples are often gathered in groups, and the\nintrinsic nature of the problem could be used to constrain the possible labels for the samples from\nthe same group. For example, we might have that two labels can not appear together in the same\ngroup or a label can appear only once in each group, as, for example, a speci\ufb01c face in an image.\n\nInspired by these scenarios, we focus on the general case where we have bags of instances, with\neach bag associated with a set of several possible labeling vectors, and among them only one is fully\ncorrect. Each labeling vector consists of labels for each corresponding instance in the bag. For easy\nreference, we call this type of learning problem a Candidate Labeling Set (CLS) problem.\n\nAs labeled data is usually expensive and hard to obtain, CLS problems naturally arise in many\nreal world tasks. For example, in computer vision and information retrieval domains, photographs\ncollections with tags have motivated the studies on learning from weakly annotated images [2], as\neach image (bag) can be naturally partitioned into several patches (instances), and one could assume\nthat each tag should be associated with at least one patch. High-level knowledge, such as spatial\n\n1\n\n\fcorrelations (e.g. \u201csun in sky\u201d and \u201ccar on street\u201d), have been explored to prune down the labeling\npossibilities [14]. Another similar task is to learn a face recognition system from images gathered\nfrom news websites or videos, using the associated text captions and video scripts [3, 8, 16, 13].\nThese works use different approaches to integrate the constraints, such as that two faces in one\nimage could not be associated with the same name [3], mouth motion and gender of the person [8],\nor modeling both names and action verbs jointly [16]. Another problem is the multiple annotators\nscenario, where each data is associated with the labels given by independently hired annotators. The\nannotators can disagree on the data and the aim is to recover the true label of each sample. All these\nproblems can be naturally casted into the CLS framework.\n\nThe contribution of this paper is a new formal way to cast the CLS setup into a learning problem.\nWe also propose a large margin formulation and an ef\ufb01cient algorithm to solve it. The proposed\nMaximum Margin Set learning (MMS) algorithm, can scale to datasets of the order of 105 instances,\nreaching performances comparable to fully-supervised learning algorithms.\n\nRelated works. This type of learning problem dates back to the work of Grandvalet in [12]. Later\nJin and Ghaharmani [17] formalized it and proposed a general framework for discriminative models.\nOur work is also closely related to the ambiguous labeling problem presented in [8, 15]. Our frame-\nwork generalizes them, to the cases where instances and possible labels come in the form of bags.\nThis particular generalization gives us a principled way for using different kinds of prior knowledge\non instances and labels correlation, without hacking the learning algorithm. More speci\ufb01cally, prior\nknowledge, such as pairwise constraints [21] and mutual exclusiveness of some labels, can be easily\nencoded in the labeling vectors. Although several works have focused on integrating these weakly\nlabeled information that are complementary to the labeled or unlabeled training data into existing\nalgorithms, these approaches are usually computational expensive. On the other hand, in our frame-\nwork we have the opposite behavior: the more prior knowledge we exploit to construct the candidate\nset, the better the performance and the faster the algorithm will be.\n\nOther lines of research which are related to this paper are multiple-instance learning (MIL) prob-\nlems [1, 5, 10], and multi-instance multi-label learning (MIML) problems [24, 25] which extends the\nbinary MIL setup to multi-labels scenario. In both setups, several instances are grouped into bags,\nand their labels are not individually given but assigned to the bags directly. However, contrary to\nour framework, in MIML noisy labeling is not allowed. In other words, all the labels being assigned\nto the bags are assumed to be true. Moreover, current MIL and MIML algorithms usually rely on\na \u2018key\u2019 instance in the bag [1] or they transform each bag into single instance representation [25],\nwhile our algorithm makes an explicit effort to label every instance in a bag and to consider all of\nthem during learning. Hence, it has a clear advantage in problems where the bags are dense in la-\nbeled instances and instances in the same bag are independent, as opposed to the cases when several\ninstances jointly represent a label. Our algorithm is also related to Latent Structural SVMs [22],\nwhere the correct labels could be considered as latent variables.\n\n2 Learning from Candidate Labeling Sets\n\nPreliminaries.\nambiguous labeling problem described in [17] from single instances to bags of instances.\n\nIn this section, we formalize the CLS setting, which is a generalization of the\n\nIn the following we denote vectors by bold letters, e.g. w, y, and use calligraphic font for sets, e.g.,\nX . In the CLS setting, the N training data are provided in the form {Xi, Zi}N\ni=1, where Xi is a bag of\nMi instances, Xi = {xi,m}Mi\nm=1, and xi,m \u2208 Rd, \u2200 i = 1, . . . , N, m = 1, . . . , Mi. The associated\nset of Li candidate labeling vectors is Zi = {zi,l}Li\nl=1, where zi,l \u2208 Y Mi, and Y = {1, ..., C}. In\nother words there are Li different combinations of Mi labels for the Mi instances in the i-th bag.\nWe assume that the correct labeling vector for Xi is present in Zi, while the other labeling vectors\nmaybe partially correct or even completely wrong. It is important to point out that this assumption\nis not equivalent to just associating Li candidate labels to each instance. In fact, in this way we also\nencode explicitly the correlations between instances and their labels in a bag. For example, consider\na two instances bag {xi,1, xi,2}: if it is known that they can only come from classes 1 and 2, and\nthey can not share the same label, then zi,1 = [1, 2], zi,2 = [2, 1] will be the candidate labeling\nvectors for this bag, while the other possibilities are excluded from the labeling set. In the following\nwe will assume that the labeling set Zi is given with the training set. In Section 4.2 we will give a\npractical example on how to construct this set using the prior knowledge on the task.\n\n2\n\n\fi=1, we want to learn a function f (x), to correctly predict the\nGiven the training data {Xi, Zi}N\nclass of each single instance x, coming from the same distribution. The problem would become\nthe standard multiclass supervised learning if there is only one labeling vector in every labeling set\nZi, i.e. Li = 1. On the other hand, given a set of C labels, without any prior knowledge, a bag\nof Mi instances could have maximum CMi labeling vectors, which becomes a clustering problem.\nHowever, we are more interested in situations when Li \u226a CMi.\n\n2.1 Large-margin formulation\n\nWe introduce here a large margin formulation to solve the CLS problem. It is helpful to \ufb01rst de\ufb01ne\nby X the generic bag of M instances {x1, . . . xM }, Z = {z1, . . . , zL} the generic set of candidate\nlabeling vectors, and y = {y1, . . . , yM }, z = {z1, . . . , zM } \u2208 Y M two labeling vectors.\nWe start by introducing the loss function that assumes the true label ym of each instance xm is\nknown\n\n\u2113\u2206(z, y) =\n\n\u2206(zm, ym) ,\n\n(1)\n\nMXm=1\n\nwhere \u2206(zm, ym) is a non-negative loss function measuring how much we pay for having predicted\nzm instead of ym. For example \u2206(zm, ym) can be de\ufb01ned as 1(zm 6= ym), where 1 is the indicator\nfunction. Hence, if the vector z is the predicted label for the bag, \u2113\u2206(z, y) simply counts the number\nof misclassi\ufb01ed instances in the bag.\n\nHowever, the true labels are unknown, and we only have access to the set Z, knowing that the true\nlabeling vector is in Z. So we use a proxy of this loss function, and propose the ambiguous version\nof this loss:\n\n\u2113A\n\u2206(z, Z) = min\n\u2032\u2208Z\n\u2206(X , Z; f ) = \u2113A\n\n\u2113\u2206(z, z\u2032) .\n\nz\n\n\u2206(f (X ), Z), where f (X ) returns\nWe also de\ufb01ne, with a small abuse of notation, \u2113A\na labeling vector which consists of labels for each instance in the bag X . It is obvious that this loss\nunderestimates the true loss. Nevertheless, we can easily extend [8, Proposition 3.1 to 3.3] to the\nbag case, and prove that \u2113A\n\u2206/(1 \u2212 \u03b7) is an upper bound to \u2113\u2206 in expectation, where \u03b7 is a factor\nbetween 0 and 1, and its value depends on the hardness of the problem. Like the de\ufb01nition in [8], \u03b7\ncorresponds to the maximum probability of an extra label co-occurring with the true label over all\nlabels and instances. Hence, minimizing the ambiguous loss we are actually minimizing an upper\nbound of the true loss. It is a known problem that direct minimization of this loss is hard, so in the\nfollowing we introduce another loss that upper bounds \u2113A\nWe assume that the prediction function f (x) we are searching for is equal to arg maxy\u2208Y F (x, y).\nIn this framework we can interpret the value of F (x, y) as the con\ufb01dence of the classi\ufb01er in assigning\nx to the class y. We also assume the standard linear model used in supervised multiclass learning [9].\nIn particular the function F (x, y) is set to be w \u00b7 \u03c6(x) \u2297 \u03c8(y), where \u03c6 and \u03c8 are the feature and\nlabel space mapping [20], and \u2297 is the Kronecker product1. We can now de\ufb01ne F(X , y; w) =\nm=1 F (xm, ym), which intuitively is gathering from each instance in X the con\ufb01dence on the\n\n\u2206 which can be minimized ef\ufb01ciently.\n\nlabels in y. With the de\ufb01nitions above, we can rewrite the function F as\n\nPM\n\nF(X , y; w) =\n\nF (xm, ym) =\n\nMXm=1\n\nMXm=1\n\nw \u00b7 \u03c6(xm) \u2297 \u03c8(ym) = w \u00b7 \u03a6(X , y) ,\n\n(2)\n\nwhere we de\ufb01ned \u03a6(X , y) =PM\n\nm=1 \u03c6(xm) \u2297 \u03c8(ym). Hence the function F can be de\ufb01ned as the\n\nscalar product between w and a joint feature map between the bag X and the labeling vector y.\nRemark. If the prior probabilities of every candidate labeling vectors zl \u2208 Z are also available,\nthey could be incorporated by slightly modifying the feature mapping scheme in (2).\n\nWe can now introduce the following loss function\n\n\u2113max (X , Z; w) =(cid:12)(cid:12)(cid:12)(cid:12)max\n\u00afz /\u2208Z (cid:0)\u2113A\n\n\u2206(\u00afz, Z) + F(X , \u00afz; w)(cid:1) \u2212 max\n\nz\u2208Z\n\nF(X , z; w)(cid:12)(cid:12)(cid:12)(cid:12)+\n\nwhere |x|+ = max(0, x). The following proposition shows that \u2113max upper bounds \u2113A\n\u2206.\n\n(3)\n\n1For simplicity we will omit the bias term here, it can be easily added by modifying the feature mapping.\n\n3\n\n\fProposition. \u2113max (X , Z; w) \u2265 \u2113A\n\n\u2206 (X , Z; w) .\n\nProof. De\ufb01ne \u02c6z = arg maxz\u2208YM F(X , z; w). If \u02c6z \u2208 Z then \u2113max (X , Z; w) \u2265 \u2113A\nWe now consider the case in which \u02c6z /\u2208 Z. We have that\n\n\u2206 (X , Z; w) = 0.\n\n\u2113A\n\u2206 (X , Z; w) \u2264 \u2113A\n\n\u2206(\u02c6z, Z) + F(X , \u02c6z; w) \u2212 max\nz\u2208Z\n\nF(X , z; w)\n\n\u2264 max\n\n\u00afz /\u2208Z (cid:0)\u2113A\n\n\u2206(\u00afz, Z) + F(X , \u00afz; w)(cid:1) \u2212 max\n\nz\u2208Z\n\nF(X , z; w) \u2264 \u2113max (X , Z; w) .\n\n(cid:3)\n\nThe loss \u2113max is non-convex, due to the second max(\u00b7) function inside, but in Section 3 we will\nintroduce an algorithm to minimize it ef\ufb01ciently.\n\n2.2 A probabilistic interpretation\n\nIt is possible to gain additional intuition on the proposed loss function \u2113max through a probabilistic\ninterpretation of the problem. It is helpful to look at the discriminative model for supervised learning\n\ufb01rst, where the goal is to learn the model parameters \u03b8 for the function P (y|x; \u03b8), from a pre-\nde\ufb01ned modeling class \u0398. Instead of directly maximizing the log-likelihood for the training data, an\nalternative way is to maximize the log-likelihood ratio between the correct label and the most likely\nincorrect one [9]. On the other hand, in the CLS setting the correct labeling vector for X is unknown,\nbut it is known to be a member of the candidate set Z. Hence we could maximize the log-likelihood\nratio between P (Z|X ; \u03b8) and the most likely incorrect labeling vector which is not member of Z\n(denoted as \u00afz). However, the correlations between different vectors in Z are not known, so the\ninference could be arbitrarily hard. Instead, we could approximate the problem by considering just\nthe most likely correct member of Z. It can be easily veri\ufb01ed that maxz\u2208Z P (z|X ; \u03b8) is a lower\nbound of P (Z|X ; \u03b8). The learning problem becomes to minimize the ratio for the bag:\n\n\u2212 log\n\nP (Z|X ; \u03b8)\n\nmax \u00afz /\u2208Z P (\u00afz|X ; \u03b8)\n\n\u2248 \u2212 log\n\nmaxz\u2208Z P (z|X ; \u03b8)\nmax \u00afz /\u2208Z P (\u00afz|X ; \u03b8)\n\n.\n\n(4)\n\nIf we assume independence between the instances in the bag, (4) can be factorized as:\n\n\u2212 log\n\nmaxz\u2208ZQm P (zm|xm; \u03b8)\nmax \u00afz /\u2208ZQm P (\u00afzm|xm; \u03b8)\n\n= max\n\n\u00afz /\u2208Z Xm\n\nlog P (\u00afzm|xm; \u03b8) \u2212 max\n\nz\u2208Z Xm\n\nlog P (zm|xm; \u03b8) .\n\nIf we take the margin into account, and assume a linear model for the log-posterior-likelihood, we\nobtain the loss function in (3).\n\n3 MMS: The Maximum Margin Set Learning Algorithm\n\nUsing the square norm regularizer as in the SVM and the loss function in (3), we have the following\noptimization problem for the CLS learning problem:\n\nmin\n\nw\n\n\u03bb\n2\n\nkwk2\n\n2 +\n\n1\nN\n\nNXi=1\n\n\u2113max (Xi, Zi; w)\n\n(5)\n\nThis optimization problem (5) is non-convex due to the non-convex loss function (3). To convexify\nthis problem, one could approximate the second max(\u00b7) in (3) with the average over all the labeling\nvectors in Zi. Similar strategies have been used in several analogous problems [8, 24]. However, the\napproximation could be very loose if the number of labeling vectors is large. Fortunately, although\nthe loss function is not convex, it can be decomposed into a convex and a concave part. Thus the\nproblem can be solved using the constrained concave-convex procedure (CCCP) [19, 23].\n\n3.1 Optimization using the CCCP algorithm\n\nThe CCCP solves the optimization problem using an iterative minimization process. At each round\nr, given an initial w(r), the CCCP replaces the concave part of the objective function with its \ufb01rst-\norder Taylor expansion at w(r), and then sets w(r+1) to the solution of the relaxed optimization\nF(Xi, z; w) in our formulation, the\nproblem. When this function is non-smooth, such as maxz\u2208Zi\ngradient in the Taylor expansion must be replaced by the subgradient2. Thus, at the r-th round, the\n\n2Given a function g, its subgradient \u2202g(x) at x satis\ufb01es: \u2200u, g(u) \u2212 g(x) \u2265 \u2202g(x) \u00b7 (u \u2212 x). The set of\n\nall subgradients of g at x is called the subdifferential of g at x.\n\n4\n\n\fCCCP replaces maxz\u2208Zi\n\nF(Xi, z; w) in the loss function by\n\nF(Xi, z; w(r)) + (w \u2212 w(r)) \u00b7 \u2202(cid:18)max\n\nz\u2208Zi\n\nF(Xi, z; w)(cid:19) .\n\n(6)\n\nmax\nz\u2208Zi\n\nThe subgradient of a point-wise maximum function g(x) = maxi gi(x) is the convex hull of the\nunion of subdifferentials of the subset of the functions gi(x) which equal g(x) [4]. De\ufb01ning by\nC(r)\nF(Xi, z\u2032; w(r))}, the subgradient of the function\ni = {z \u2208 Zi : F(Xi, z; w(r)) = maxz\ni,l = 1,\nmaxz\u2208Zi\nand \u03b1(r)\n\nF(Xi, z; w) equals toPl \u03b1(r)\n\ni,l \u03a6(Xi, zi,l), withPl \u03b1(r)\n\n\u2032\u2208Zi\n\ni,l \u2265 0 if zi,l \u2208 C(r)\n\u03b1(r)\ni,l w(r) \u00b7 \u03a6(Xi, zi,l) = max\n\ni\n\nXl\n\nand \u03b1i,l = 0 otherwise. Hence we have\n\ni,l \u2202F(Xi, zi,l; w) = Pl \u03b1(r)\nz\u2208Zi(cid:16)w(r) \u00b7 \u03a6(Xi, z)(cid:17) Xl:zi,l\u2208C(r)\n\ni\n\n\u03b1(r)\ni,l = max\n\nz\u2208Zi(cid:16)w(r) \u00b7 \u03a6(Xi, z)(cid:17) .\n\nWe are free to choose the values of the \u03b1(r)\nfor \u2200zi,l \u2208 C(r)\n\n. Using (6) the new loss function becomes\n\ni\n\ni,l in the convex hull, here we choose to set \u03b1(r)\n\ni,l = 1/|C(r)\n\ni\n\n|\n\n\u03a6(Xi, z)(cid:12)(cid:12)(cid:12)(cid:12)+\n\n, (7)\n\n\u2113(r)\n\ncccp (Xi, Zi; w) =(cid:12)(cid:12)(cid:12)(cid:12)max \u00afz /\u2208Zi(cid:0)\u2113A\n\n\u2206(\u00afz, Zi) + w \u00b7 \u03a6(Xi, \u00afz)(cid:1) \u2212 w \u00b7\n\n1\n|C(r)\n\ni\n\n|Pz\u2208C(r)\n\ni\n\nReplacing the non-convex loss \u2113max in (5) with (7), the relaxed convex optimization program at r-th\nround of the CCCP is\n\nmin\n\nw\n\n\u03bb\n2\n\nkwk2\n\n2 +\n\n1\nN\n\nNXi=1\n\n\u2113(r)\ncccp (Xi, Zi; w)\n\n(8)\n\nWith our choice of \u03b1(r)\ni,l , in the \ufb01rst round of the CCCP when w is initialized at 0, the second max(\u00b7)\nin (3) is approximated by the average over all the labeling vectors. The CCCP algorithm is guaran-\nteed to decrease the objective function and it converges to a local minimum solution of (5) [23].\n\n3.2 Solve the convex optimization problem using the Pegasos framework\n\nIn order to solve the relaxed convex optimization problem (8) ef\ufb01ciently at each round of the CCCP,\nwe have designed a stochastic subgradient descent algorithm, using the Pegasos framework devel-\noped in [18]. At each step the algorithm takes K random samples from the training set and calculates\nan estimate of the subgradient of the objective function using these samples. Then it performs a sub-\ngradient descent step with decreasing learning rate, followed by a projection of the solution into\nthe space where the optimal solution lives. An upper bound on the radius of the ball in which the\noptimal hyperplane lives can be calculated by considering that\n\n\u03bb\n2\n\nkw\u2217k2\n\n2 \u2264 min\n\nw\n\n\u03bb\n2\n\nkwk2\n\n2 +\n\n1\nN\n\nNXi=1\n\n\u2113(r)\ncccp (Xi, Zi; w) \u2264 B\n\nwhere w\u2217 is the optimal solution of (8), and B = maxi(\u2113(r)\ncccp(Xi, Zi; 0)). If we use \u2206(zm, ym) =\n1(zm6=ym) in (7), B equals the maximum number of instances in the bags. The details of the Pegasos\nalgorithm for solving (8) are given in Algorithm 2. Using the theorems in [18] it is easy to show that\n\nafter eO(cid:0)1/(\u03bb\u03b5)) iterations Algorithm 2 converges in expectation to a solution of accuracy \u03b5.\n\nEf\ufb01cient implementation. Note that even if we solve the problem in the primal, we can still use\nnonlinear kernels without computing the nonlinear mapping \u03c6(x) explicitly. Since the implementa-\ntion method is similar to the one described in [18, Section 4] for lack of space we omit the details.\nGreedily searching for the most violating labeling vector \u02c6zk in line 4 of Algorithm 2 can be com-\nputational expensive. Dynamic programming can be carried out to reduce the computational cost\nsince the contribution of each instance is additive over different labels. Moreover, by looking into\nthe structure of Zi, the computational time can be further reduced. In the general situation, the\nm=1 Ci,m), where Ci,m is the\nnumber of unique possible labels for xi,m in Zi (usually Ci,m \u226a Li). This complexity can be\ngreatly reduced when there are special structures such as graphs and trees in the labeling set. See\nfor example [20, Section 4] for a discussion on some speci\ufb01c problems and special cases.\n\nworst case complexity of searching the maximum of \u00afz /\u2208 Zi is O(QMi\n\n5\n\n\fAlgorithm 1 The CCCP algorithm for solving MMS\n1: initialize: w(1) = 0\n2: repeat\n3:\n4:\n5: until convergence to a local minimum\n6: output:w(r+1)\n\ni = {z \u2208 Zi : F(Xi, z; w(r)) = maxz\n\nSet C(r)\nF(Xi, z\u2032; w(r))}\nSet w(r+1) as the solution of the convex optimization problem (8)\n\n\u2032\u2208Zi\n\ni }N\n\ni=1, \u03bb, T , K, B\n\nDraw at random At \u2286 {1, . . . , N }, with |At| = K\n\nAlgorithm 2 Pegasos Algorithm for Solving Relaxed-MMS (8)\n1: Input: w0, {Xi, Zi, C(r)\n2: for t = 1, 2, . . . , T do\n3:\n4:\n5:\n6:\n\nCompute \u02c6zk = arg max \u00afz /\u2208Zk(cid:0)\u2113A\n7: wt+1 = min(cid:16)1,p2B/\u03bb/kwt+ 1\n\n\u2206(\u00afz, Zk) + wt \u00b7 \u03a6(Xk, \u00afz)(cid:1)\nt (cid:16)Pz\u2208C(r)\n\u03bbKtPk\u2208A+\nk(cid:17) wt+ 1\n\nt = {k \u2208 At : \u2113(r)\n\nSet A+\nSet wt+ 1\n\n= (1 \u2212 1\n\nt )wt + 1\n\ncccp(Xk, Zk; wt) > 0}\n\n2\n\n2\n\n2\n\n8: end for\n9: Output: wT +1\n\n\u2200k \u2208 At\n\n\u03a6(Xk, z)/|C(r)\n\ni\n\ni\n\n| \u2212 \u03a6(Xk, \u02c6zk)(cid:17)\n\n4 Experiments\n\nIn order to evaluate the proposed algorithm, we \ufb01rst perform experiments on several arti\ufb01cial\ndatasets created from standard machine learning databases. Finally, we test our algorithm on one of\nthe examples motivating our study \u2014 learning a face recognition system from news images weakly\nannotated by their associated captions. We benchmark MMS against the following baselines:\n\n\u2022 SVM: we train a fully-supervised SVM classi\ufb01er using the ground-truth labels by consid-\nering every instance separately while ignoring the other candidate labels. Its performance\ncan be considered as an upper bound for the performance using candidate labels. In all our\nexperiments, we use the LIBLINEAR [11] package and test two different multiple-class\nextensions, the 1-vs-All method using L1-loss (1vA-SVM) and the method by Crammer\nand Singer [9] (MC-SVM).\n\n\u2022 CL-SVM: the Candidate Labeling SVM (CL-SVM) is a naive approach which transforms\nthe ambiguous labeled data into a standard supervised representation by treating all possi-\nble labels of each instance as true labels. Then it learns 1-vs-All SVM classi\ufb01ers from the\nresulting dataset, where the negative examples are instances which do not have the corre-\nsponding label in their candidate labeling set. A similar baseline has been used in binary\nMIL literature [5].\n\n\u2022 MIML: we also compared with two SVM-based MIML algorithms3: MIMLSVM [25] and\nM3MIML [24]. We train the MIML algorithms by treating the labels in Zi as a label for\nthe bag. During the test phase, we consider each instance separately and predict the labels\nas: y = arg maxy\u2208Y Fmiml(x, y), where Fmiml is the obtained classi\ufb01er, and Fmiml(x, y)\ncan be interpreted as the con\ufb01dence of the classi\ufb01er in assigning the instance x to the class\ny. We would like to underline that although some of the experimental setups may favor our\nalgorithm, we include the comparison between MMS and MIML algorithms because to the\nbest of our knowledge it is the only existing principle framework for modeling instance bags\nwith multiple labels. MIML algorithms may still have their own advantage in scenarios\nwhen no prior knowledge is available about the instances within a bag.\n\n3We used the original implementation at http://lamda.nju.edu.cn/data.ashx#code. We did\nnot compare against MIMLBOOST [25], because it does not scale to all the experiments we conducted. Be-\nsides, MIMLSVM [25] does not scale to data with high dimensional feature vectors (e.g., news20 which has\na 62,061-dimensions features). Running the MATLAB implementation of M3MIML [24] on problems with\nmore than a few thousand samples is computational infeasible. Thus, we will only report results using this two\nbaseline methods on small size problems, where they can be \ufb01nished in a reasonable amount of time.\n\n6\n\n\fusps (B=5, N=1,459)\n\nletter (B=8, N=1,875)\n\nnews20 (B=5, N=3,187)\n\ncovtype (B=4, N=43,575)\n\n100\n\n80\n\n60\n\n40\n\n80\n\n70\n\n60\n\n50\n\n40\n\n30\n\n90\n\n80\n\n70\n\n60\n\n50\n\n80\n\n60\n\n40\n\n20\n\nt\n\ne\na\nr\n \n\nn\no\n\ni\nt\n\na\nc\ni\nf\ni\ns\ns\na\nC\n\nl\n\n20\n\n10\n\n25\n\n50\nL\n\n100 200\n\n20\n\n10\n\n50\n\n100\nL\n\n200\n\n400\n\n40\n\n10\n\n50\n\n100\nL\n\n200\n\n400\n\n0\n10\n\n25\n\n100\n\n200\n\n50\nL\n\nFigure 1: (Best seen in colors) Classi\ufb01cation performance of different algorithms on arti\ufb01cial datasets.\n\nWe implemented our MMS algorithm in MATLAB4, and used a value of the 1/N for the regular-\nization parameter \u03bb in all our experiments. In (1) we used \u2206(zm, ym) = 1(zm 6= ym). For a fair\ncomparison, we used linear kernel for all the methods. The cost parameter for SVM algorithms is\nselected from the range C \u2208 {0.1, 1, 10, 100, 1000}. The bias term is used in all the algorithms.\n\n4.1 Experiments on arti\ufb01cial data\n\nWe create several arti\ufb01cial datasets using four widely used multi-class datasets (usps, letter, news20\nand covtype) from the LIBSVM [6] website. The arti\ufb01cial training sets are created as follows: we\n\ufb01rst set at random pairs of classes as \u201ccorrelated classes\u201d, and as \u201cambiguous classes\u201d, where the\nambiguous classes can be different from the correlated classes. Following that, instances are grouped\nrandomly into bags of \ufb01xed size B with probability at least Pc that two instances from correlated\nclasses will appear in the same bag. Then L ambiguous labeling vectors are created for each bag,\nby modifying a few elements of the correct labeling vector. The number of the modi\ufb01ed element is\nrandomly chosen from {1, . . . , B}, and the new labels are chosen among a prede\ufb01ned ambiguous\nset. The ambiguous set is composed by the other correct labels from the same bag (except the true\none) and a subset of the ambiguous pairs of all the correct labels from the bag. The probability of\nwhether the ambiguous pair of a label is present equals Pa. For testing, we use the original test set,\nand each instance is considered separately.\nVarying Pc, Pa, and L we generate different dataset dif\ufb01culty levels to evaluate the behaviour of\nthe algorithms. For example, when Pa > 0, noisy labels are likely to be present in the labeling\nset. Meanwhile, Pc controls the ambiguity within the same bags. If Pc is large, instances from\ntwo correlated classes are likely to be grouped into the same bag, thus it becomes more dif\ufb01cult to\ndistinguish between these two classes. The parameters Pc and Pa are chosen from {0, 0.25, 0.5}.\nFor each dif\ufb01culty level, we run three different training/test splits.\n\nIn \ufb01gure 1, we plot the average classi\ufb01cation accuracy. Several observations can be made: \ufb01rst,\nMMS achieves results close to the supervised SVM methods, and better than all other baselines.\nAs MMS uses a similar multi-class loss as MC-SVM, it even outperforms 1vA-SVM when the\nloss has its advantage (e.g., on the \u2018letter\u2019 dataset). For the \u2018covtype\u2019 dataset, the performance\ngap between MMS and SVM is more visible.\nIt may because \u2018covtype\u2019 has a class unbalance,\nwhere the two largest classes (among seven) dominate the whole dataset (more than 85% of the\ntotal number of samples). Second, the change on performance of MMS is small when the size of the\ncandidate labeling set grows. Moreover, when correlated instances and extra noisy labels are present\nin the dataset, the baseline methods\u2019 performance drops by several percentages, while MMS is less\naffected. The CCCP algorithm usually converges in 3 \u2013 5 rounds, and the \ufb01nal performance is about\n5% \u2013 40% higher compared to the results obtained after the \ufb01rst round, especially when L is large.\nThis behavior also proves that approximating the second max(\u00b7) function in the loss function (3)\nwith the average over all the possible labeling vectors can lead to poor performance.\n\n4.2 Applications to learning from images & captions\n\nA huge amount of images with accompanying text captions are available on the web. This cheap\nsource of information has been used, e.g., to name faces in images using captions [3, 13]. Thanks\nto the recent developments in the computer vision and natural language processing \ufb01elds, faces in\nthe images can be detected by a face detector and names in the captions can be identi\ufb01ed using a\nlanguage parser. The gathered data can then be used to train visual classi\ufb01ers, without human\u2019s\n\n4Code available at http://dogma.sourceforge.net/\n\n7\n\n\fz1\n\nz2\nna\n\u25e6\n\nz3 z4\nnb\nna\n\n\u25e6\nnb\n\nz5\n\n\u25e6\nna\n\nPresident Barack Obama and \ufb01rst lady\nMichelle Obama wave from the steps of\nAir Force One as they arrive in Prague,\nCzech Republic.\n\nZ :\u00bb na\n\n\u25e6 \u2013\u2190 facea\nFigure 2: (Left): An example image and its associated caption. There are two detected faces facea and faceb\nand two names Barack Obama (na) and Michelle Obama (nb) from the caption. (Right): The candidate labeling\nset for this image-captions pairs. The labeling vectors are generated using the following constrains: i). a face\nin the image can either be assigned with a name from its caption, or it possibly corresponds to none of them (a\nNULL class, denoted as \u25e6); ii) a face can be assigned to at most one name; iii) a name can be assigned to at most\na face. Differently from previous methods, we do not allow the labeling vector with all the faces assigned to\nthe NULL class, because it would lead to the trivial solution with 0 loss by classifying every instance as NULL.\n\n\u2190 faceb\n\nnb\n\nz6\nnb\n\nTable 1: Overall face recognition accuracy\n\nDataset\nYahoo!\n\n1vA-SVM\n81.6% \u00b1 0.6\n\nMC-SVM\n\nCL-SVM\n\n87.2% \u00b1 0.3\n\n76.9% \u00b1 0.2\n\nMIMLSVM\n74.7% \u00b1 0.9\n\nMMS\n\n85.7% \u00b1 0.5\n\neffort in labeling the data. This task is dif\ufb01cult due to the so called \u201ccorrespondence ambiguity\u201d\nproblem: there could be more than one face and name appearing in the image-caption pairs, and not\nall the names in the caption appear in the image, and vice versa. Nevertheless, this problem can be\nnaturally formulated as a CLS problem. Since the names of the key persons in the image typically\nappear in the captions, combined with other common assumptions [3, 13], we can easily generate\nthe candidate labeling sets (see Figure 2 for a practical example).\nWe conducted experiments on the Labeled Yahoo! News dataset5 [3, 13]. The dataset is fully an-\nnotated for association of faces in the image with names in the caption, precomputed facial features\nwere also available with the dataset. After preprocessing, the dataset contains 20071 images and\n31147 faces. There are more than 10000 different names from the captions. We retain the 214 most\nfrequent ones which occur at least 20 times, and treat the other names as NULL. The experiments\nare performed over 5 different permutations, sampling 80% images and captions as training set, and\nusing the rest for testing. During splitting we also maintain the ratio between the number of samples\nfrom each class in the training and test set. For all algorithms, NULL names are considered as an\nadditional class, except for MIML algorithms where unknown faces can be automatically consid-\nered as negative instances. The performance of the algorithms is measured by how many faces in\nthe test set are correctly labeled with their name. Table 1 summarizes the results. Similar observa-\ntions can also be made here: MMS achieves performance comparable to the fully-supervised SVM\nalgorithms (4.1% higher than 1vA-SVM on Yahoo! data), while outperforming the other baselines\nfor ambiguously labeled data.\n\n5 Conclusion\nIn this paper, we introduce the \u201cCandidate Labeling Set\u201d problem where training samples contain\nmultiple instances and a set of possible labeling vectors. We also propose a large margin formulation\nof the learning problem and an ef\ufb01cient algorithm for solving it. Although there are other similar\nframeworks, such as MIML, which also investigate learning from instance bags with multiple labels,\nour framework is different since it makes an explicit effort to label and to consider each instance in\nthe bag during the learning process, and allows noisy labels in the training data.\nIn particular,\nour framework provides a principled way to encode prior knowledge about relationships between\ninstances and labels, and these constraints are explicitly taken into account into the loss function\nby the algorithm. The use of this framework does not have to be limited to data which is naturally\ngrouped in multi-instance bags. It could be also possible to group separate instances into bags and\nsolve the learning problem using MMS, when there are labeling constraints between these instances\n(e.g., a clustering problem with linkage constraints).\n\nAcknowledgments We thank the anonymous reviewers for their helpful comments. The Labeled Yahoo!\nNews dataset were kindly provided by Matthieu Guillaumin and Jakob Verbeek. LJ was sponsored by the EU\nproject DIRAC IST-027787 and FO was sponsored by the PASCAL2 NoE under EC grant no. 216886. LJ also\nacknowledges PASCAL2 Internal Visiting Programme for supporting traveling expense.\n\n5Dataset available at http://lear.inrialpes.fr/data/\n\n8\n\n\fReferences\n\n[1] S. Andrews, I. Tsochantaridis, and T. Hofmann. Support vector machines for multiple-instance\n\nlearning. In Proc. NIPS, 2003.\n\n[2] K. Barnard, P. Duygulu, D. Forsyth, N. de Freitas, D. Blei, and M. Jordan. Matching words\n\nand pictures. JMLR, 3:1107\u20131135, 2003.\n\n[3] T. Berg, A. Berg, J. Edwards, and D. Forsyth. Who\u2019s in the picture? In Proc. NIPS, 2004.\n[4] D. P. Bertsekas. Convex Analysis and Optimization. Athena Scienti\ufb01c, 2003.\n[5] R. C. Bunescu and R. J. Mooney. Multiple instance learning for sparse positive bags. In Proc.\n\nICML, 2007.\n\n[6] C. C. Chang and C. J. Lin. LIBSVM: A Library for Support Vector Machines, 2001. Software\n\navailable at http://www.csie.ntu.edu.tw/\u02dccjlin/libsvm.\n\n[7] O. Chapelle, A. Zien, and B. Sch\u00a8olkopf (Eds.). Semi-supervised Learning. MIT Press, 2006.\n[8] T. Cour, B. Sapp, C. Jordan, and B. Taskar. Learning from ambiguously labeled images. In\n\nProc. CVPR, 2009.\n\n[9] K. Crammer and Y. Singer. On the algorithmic implementation of multiclass kernel-based\n\nvector machines. JMLR, 2:265\u2013292, 2001.\n\n[10] T. G. Dietterich, R. H. Lathrop, T. Lozano-Perez, and A. Pharmaceutical. Solving the multiple-\n\ninstance problem with axis-parallel rectangles. Arti\ufb01cial Intelligence, 39:31\u201371, 1997.\n\n[11] R.-E. Fan, K.-W. Chang, C.-J. Lin, S. S. Keerthi, and S. Sundarajan. LIBLINEAR: A library\n\nfor large linear classi\ufb01cation. JMLR, 9:1871\u20131874, 2008.\n\n[12] Y. Grandvalet. Logistic regression for partial labels. In Proc. IPMU, 2002.\n[13] M. Guillaumin, J. Verbeek, and C. Schmid. Multiple instance metric learning from automati-\n\ncally labeled bags of faces. In Proc. ECCV, 2010.\n\n[14] A. Gupta and L. Davis. Beyond nouns: Exploiting prepositions and comparative adjectives for\n\nlearning visual classi\ufb01ers. In Proc. ECCV, 2008.\n\n[15] E. H\u00a8ullermeier and J. Beringe. Learning from ambiguously labelled example. Intelligent Data\n\nAnalysis, 10:419\u2013439, 2006.\n\n[16] L. Jie, B. Caputo, and V. Ferrari. Who\u2019s doing what: Joint modeling of names and verbs for\n\nsimultaneous face and pose annotation. In Proc. NIPS, 2009.\n\n[17] R. Jin and Z. Ghahramani. Learning with multiple labels. In Proc. NIPS, 2002.\n[18] S. Shalev-Shwartz, Y. Singer, and N. Srebro. Pegasos: Primal Estimated sub-GrAdient SOlver\n\nfor SVM. In Proc. ICML, 2007.\n\n[19] A. J. Smola, S. V. N. Vishwanathan, and T. Hofmann. Kernel methods for missing variables.\n\nIn Proc. AISTAT, 2005.\n\n[20] I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun. Large margin methods for structured\n\nand interdependent output variables. JMLR, 6:1453\u20131484, 2005.\n\n[21] E.P Xing, A.Y. Ng, M.I. Jordan, and S. Russell. Distance metric learning with application to\n\nclustering with side-information. In Proc. NIPS, 2002.\n\n[22] C.-N. Yu and T. Joachims. Learning structural svms with latent variables. In Proc. ICML,\n\n2009.\n\n[23] A. Yuille and A. Rangarajan. The concave-convex procedure. Neural Computation, 15:915\u2013\n\n936, 2003.\n\n[24] M.-L. Zhang and Z.-H. Zhou. M3MIML: A maximum margin method for multi-instance multi-\n\nlabel learning. In Proc. ICDM, 2008.\n\n[25] Z.-H. Zhou and M.-L. Zhang. Multi-instance multi-label learning with application to scene\n\nclassi\ufb01cation. In Proc. NIPS, 2006.\n\n[26] X. Zhu. Semi-supervised learning literature survey. Technical Report 1530, Computer Sci-\n\nences, University of Wisconsin-Madison, 2005.\n\n9\n\n\f", "award": [], "sourceid": 104, "authors": [{"given_name": "Jie", "family_name": "Luo", "institution": null}, {"given_name": "Francesco", "family_name": "Orabona", "institution": null}]}