{"title": "Convex Multiple-Instance Learning by Estimating Likelihood Ratio", "book": "Advances in Neural Information Processing Systems", "page_first": 1360, "page_last": 1368, "abstract": "Multiple-Instance learning has been long known as a hard non-convex problem.\n In this work, we propose an approach that recasts it as a convex likelihood ratio\n estimation problem. Firstly, the constraint in multiple-instance learning is reformulated\n into a convex constraint on the likelihood ratio. Then we show that a joint\n estimation of a likelihood ratio function and the likelihood on training instances\n can be learned convexly. Theoretically, we prove a quantitative relationship between\n the risk estimated under the 0-1 classification loss, and under a loss function\n for likelihood ratio estimation. It is shown that our likelihood ratio estimation is\n generally a good surrogate for the 0-1 loss, and separates positive and negative\n instances well. However with the joint estimation it tends to underestimate the\n likelihood of an example to be positive. We propose to use these likelihood ratio\n estimates as features, and learn a linear combination on them to classify the bags.\n Experiments on synthetic and real datasets show the superiority of the approach.", "full_text": "Convex Multiple-Instance Learning by\n\nEstimating Likelihood Ratio\n\nFuxin Li and Cristian Sminchisescu\n\nInstitute for Numerical Simulation, University of Bonn\n\n{fuxin.li,cristian.sminchisescu}@ins.uni-bonn.de\n\nAbstract\n\nWe propose an approach to multiple-instance learning that reformulates the prob-\nlem as a convex optimization on the likelihood ratio between the positive and the\nnegative class for each training instance. This is casted as joint estimation of both\na likelihood ratio predictor and the target (likelihood ratio variable) for instances.\nTheoretically, we prove a quantitative relationship between the risk estimated un-\nder the 0-1 classi\ufb01cation loss, and under a loss function for likelihood ratio. It\nis shown that likelihood ratio estimation is generally a good surrogate for the 0-1\nloss, and separates positive and negative instances well. The likelihood ratio esti-\nmates provide a ranking of instances within a bag and are used as input features\nto learn a linear classi\ufb01er on bags of instances.\nInstance-level classi\ufb01cation is\nachieved from the bag-level predictions and the individual likelihood ratios. Ex-\nperiments on synthetic and real datasets demonstrate the competitiveness of the\napproach.\n\n1 Introduction\n\nMultiple Instance Learning (MIL) has been proposed over 10 years ago as a methodology to learn\nmodels under weak labeling constraints [1]. Unlike traditional binary classi\ufb01cation problems, the\npositive items are represented as bags, which are sets of instances. A feature vector is used to\nrepresent each instance in the bag. There is an OR relationship in a bag: if one of the feature\nvectors is classi\ufb01ed as positive, the entire bag is considered positive. A simple intuition is: one has\na number of keys and faces a locked door. To enter the door, we only need one matching keys.\nMIL is a natural weak labeling formulation for text categorization [2] and computer vision problems\n[3]. In document classi\ufb01cation, one is given \ufb01les made of many sentences, and often only a few\nare useful. In computer vision, an image can be decomposed into different regions, and only some\ndelineate objects. Therefore, MIL can be used in sophisticated tasks, such as identifying the location\nof object parts from bounding box information in images [4]. Although efforts have been made to\nprovide datasets with increasingly more detailed supervisory information [5], without automation\nsuch a minutiae level of detail becomes prohibitive for large datasets, or more complicated data like\nvideo [6, 7]. In this case, one necessarily needs to resort to multiple-instance learning.\n\nMIL is interesting mainly because of its potential to provide instance-level labels from weak supervi-\nsory information. However the state-of-the-art in MIL is often obtained by simply using a weighted\nsum of kernel values between all instance pairs within the bags, while ignoring the prediction of\ninstance labels [8, 9, 10]. It is intriguing why MIL algorithms that exploit instance level information\ncannot achieve better performance, as constraints at instance level seems abundant \u2013 none of the\nnegative instances is positive. This should provide additional constraints in de\ufb01ning the region of\npositive instances and should help classi\ufb01cation in input space.\n\nA major challenge is the non-convexity of many instance-level MIL algorithms [2, 11, 12, 13, 14].\nMost of these algorithms perform alternating minimization on the classi\ufb01er and the instance weights.\n\n1\n\n\fThis procedure usually gives only a local optimum since the objective is non-convex. The benchmark\nperformance of MIL methods is overall quite similar, although techniques differ signi\ufb01cantly: some\nassign binary weights to instances [2], some assign real weights [12, 13], yet others use probabilistic\nformulations [14]; some optimize using conventional alternating minimization, others use convex-\nconcave procedures [11].\n\nGehler and Chapelle [15] have recently performed an interesting analysis of the MIL costs, where\ndeterministic annealing (DA) was used to compute better local optima for several formulations. In\nthe case of a previous mi-SVM formulation [2], annealing methods did not improve the performance\nsigni\ufb01cantly. A newly proposed algorithm, ALP-SVM, was also introduced, which used a preset pa-\nrameter de\ufb01ning the \ufb01xed ratio of witnesses \u2013 the true positive instances in a positive bag. Excellent\nresults were obtained with this witness rate parameter set to the correct value. However, in prac-\ntice it is unclear whether this can be known beforehand and whether it is stationary across different\nbags. In principle, the witness rate should also be estimated, and this learning stage partially ac-\ncount for the non-convexity of the MIL problem. It remains however unclear whether the observed\nperformance variations are caused by non-convexity or by other modeling aspects.\n\nAlthough performance considerations have hindered the application of MIL to practical problems,\nthe methodology has started to gain momentum recently [4, 16]. The success of the Latent SVM for\nperson detection [4] shows that a standard MIL procedure (the reformulation of the alternating mini-\nmization MI-SVM algorithm in [2]) can achieve good results if properly initialized. However, prop-\ner initialization of MIL remains elusive in general, as it often requires engineering experience with\nthe individual problem structure. Therefore, it is still of broad interest to develop an initialization-\nindependent formulation for MIL. Recently Li et al. [17] proposed a convex instance-level MIL\nalgorithm based on multiple kernel learning, where one kernel was used for each possible combi-\nnation of instances. This creates an exponential number of constraints and requires a cutting-plane\nsolver. Although the formulation is convex, its scalability drops signi\ufb01cantly for bags with many\ninstances.\n\nIn this paper we make an alternative attempt towards a convex formulation: we establish that non-\nconvex MIL constraints can be recast reliably into convex constraints on the likelihood ratio between\nthe positive and negative classes for each instance. We transform the multiple-instance learning\nproblem into a convex joint estimation of the likelihood ratio function and the likelihood ratio values\non training instances. The choice of the jointly convex loss function is rich, remarkably at least from\na family of f-divergences. Theoretically, we prove consistency results for likelihood ratio estimation,\nthus showing that f-divergence loss functions upper bound the classi\ufb01cation 0-1 loss tightly, unless\nthe likelihood is very large.\n\nA support vector regression scheme is implemented to estimate the likelihood ratio, and it is shown\nto separate positive and negative instances well. However, determining the correct threshold for\ninstance classi\ufb01cation from the training set remain non-trivial. To address this problem, we propose\na post-processing step based on a bag classi\ufb01er computed as a linear combination of likelihood\nratios. While this is shown to be suboptimal in synthetic experiments, it still achieves state-of-the-\nart results in practical datasets, demonstrating the vast potential of the proposed approach.\n\n2 Convex Reformulation of the Multiple Instance Constraint\n\n1 , x+\n\n2 , . . . , x+\n\n1 , x\u2212\n\n2 , . . . , x\u2212\nn\u2212\n\nn+ }, X \u2212 = {x\u2212\n\nLet us consider a learning problem with n training instances in total, n+ positive and n\u2212 negative.\nIn negative bags, every instance is negative, hence we do not separately de\ufb01ne such bags \u2013 instead\nwe directly work with the instances. Let B = {B1, B2, . . . , Bk} be positive bags and X + =\n{x+\n} be the training input, where each xi belongs to a\npositive bag Bj and each x\u2212\nis a negative instance. The goal of multiple instance learning is, given\ni\n{X +, X \u2212, B}, to learn a decision rule, sign(f (x)), to predict the label {+1, \u22121} for the test instance\nx.\nThe MIL problem can be characterized by two properties. 1) negative-exclusion: if none of the\ninstances in a bag is positive, the bag is not positive. 2) positive-identi\ufb01ability:\nif one of the\ninstances in the bag is positive, the bag is positive. These properties are equivalent to a constraint\nmaxxi\u2208Bj f (xi) \u2265 0 on positive bags. This constraint is not convex since the negative max function\n\nis concave. Reformulation into a sum constraint such as P f (x) \u2265 0 would be convex, when\n\n2\n\n\ff (x) = wT x is linear [6]. However, this hardly retains positive-identi\ufb01ability, since if there is\nonly one xi with f (xi) > 0, this can be superseded by other instances with f (xi) < 0. Apparently,\nthe distinction between the sum and the max operations is signi\ufb01cant and dif\ufb01cult to ignore in this\ncontext.\n\nHowever, in this paper we show that if MIL conditions are formulated as constraints on the likelihood\nratio, convexity can be achieved. For example, the constraint:\n\nPr(y = 1|xi)\nPr(y = \u22121|xi)\n\n> |Bj|\n\nXxi\u2208Bj\n\n(1)\n\ncan ensure both of the MIL properties. Positive-identi\ufb01ability is satis\ufb01ed when Pr(y = 1|xi) \u2265\n|Bi|\n|Bi|+1 or equivalently, when the positive examples all have very large margin.\nWhen the size of the bag is large, the assumption Pr(y = 1|xi) > |Bj|\n|Bj|+1 can be too strong.\nTherefore, we exploit large deviation bounds to reduce the quantity |Bj|, such that Pr(y = 1|xi)\ndoes not have to be very large to satisfy the constraint. Intuitively, if the examples are not very\nambiguous, i.e. Pr(y = 1|xi) is not close to 1/2, then likelihood ratio sums on negative examples\ncan become much smaller, hence we can adopt a signi\ufb01cantly lower threshold at some degree of\nviolation of the negative-exclusion property. To this end, a common assumption is the low label\nnoise [18, 19]:\n\nM\u03b2 : \u2203c > 0, \u2200\u01eb, Pr(0 < |Pr(y = 1|xi) \u2212\n\n1\n2\n\n| \u2264 \u01eb) \u2264 c\u01eb\u03b2.\n\nThis assumes that the posterior Pr(y = 1|xi) is usually not very close to 1/2, meaning that most\nexamples are not very ambiguous, which is usually reasonable. In [18, 19, 20], a number of results\nhave been obtained implying that classi\ufb01ers learned under this assumption converge to the Bayes\nerror much faster than the conventional empirical process rate O(n\u22121/2) of most standard classi\ufb01ers,\nand can be as fast as O(n\u22121). These theoretical results show that low label noise assumptions indeed\nsupports learning with fewer observations.\nAssuming M\u03b2 holds, we prove the following result which allows us to relax the hard constraint (1):\nTheorem 1 \u2200\u03b4 > 0, for each xi in a bag Bj, assume yi is drawn i.i.d.\nfrom the distribution\nPrBj (yi|xi) that satis\ufb01es M\u03b2. If all instances xi \u2208 Bj are negative, then the probability that\n\nPr(y = 1|xi)\nPr(y = \u22121|xi)\n\n\u2265\n\n\u03b2 + 4\n\n2(\u03b2 + 1)(\u03b2 + 2)\n\n|Bj|+s\n\n4\u03b2 + 1\n\n2(\u03b2 + 1)2(2\u03b2 + 3)\n\n|Bj| log 1/\u03b4+\n\nlog 1/\u03b4\n\n3\n\n(2)\n\nXxi\u2208Bj\n\nis at most \u03b4.\n\nThe proof is given in an accompanying technical report [21]. From Theorem 1, we could weaken\nthe constraint (1) to obtain constraint (2) and still ensure negative-exclusion with probability 1 \u2212 \u03b4.\nWhen \u03b2 is large, the reduction is signi\ufb01cant. For example, for \u03b2 = 2 and \u03b4 = 0.05, the right-\nhand side of (2) is approximately 1\n14 |Bi| + 1, which is an important decrease over |Bi|,\nwhenever |Bi| \u2265 3. Note that the i.i.d. assumption in Theorem 1 applies to each bag. Different bags\ncan have different label distributions. This is often a signi\ufb01cantly weaker assumption than the ones\nbased on global i.i.d. of labels [8].\n\n4 |Bi| +q 3\n\n3 Likelihood Ratio Estimation\n\nTo estimate the likelihood ratio, one possibility would be to use kernel methods as nonparametric\nestimators over a RKHS. This approach was taken in [22], where predictions of the ratio provided\na variational estimate of an f -divergence (or Ali-Silvey divergence) between two distributions. The\nformulation is powerful, yet not immediately applicable here. In our case, because of the uncertainty\nin the positive examples, Pr(y = 1|x) is not observed but has to be estimated. Therefore we need\nto optimize jointly as minf,Pr(y=1|x) D(f, Pr(y = 1|x)) + \u03bb||f ||2 with loss function D(f, g). This\noptimization would not be convex if a framework in [22] were taken.\n\n3\n\n\fThe requirement to estimate two sets of variables simultaneously (e.g. f and Pr(y = 1|x) here), is\none of the major dif\ufb01culties in turning multiple-instance learning into a convex problem. Approaches\nbased on classi\ufb01cation-style loss functions lead to non-convex optimization [2, 13]. However, since\nwe are outside a classi\ufb01cation setting, we can optimize over divergence measures D\u03c6(f, g) that are\nconvex w.r.t. both f and g. These measures are common. For example, the f -divergence family that\nincludes many statistical distances, satis\ufb01es the following properties [23]:\n\nL1 : D(x, y) =Pi |xi \u2212 yi|; \u03c72 : D(x, y) =Pi\nKullback-Leibler : D(x, y) =Pi xi log xi \u2212 xi log yi \u2212 xi + yi;\n\nSymmetric Kullback-Leibler : D(x, y) =Pi(yi \u2212 xi) log yi + (xi \u2212 yi) log xi \u2212 xi + yi\n\nIn principle, any of the measures given above can be used to estimate the likelihood ratio.\n\nxi\n\n;\n\n(xi\u2212yi)2\n\n(3)\n\nAn important issue is the relationship between the likelihood ratio estimation and our \ufb01nal goal:\nbinary classi\ufb01cation. In [20], the authors give necessary and suf\ufb01cient conditions for Bayes con-\nsistent learners by minimizing the mean of a surrogate loss function of the data. In this paper we\nextend these results to loss functions for likelihood ratio estimation. Let R(f ) = P (sign(y) 6=\nsign(f (x) \u2212 1)) be the 0-1 risk of a likelihood estimator f , with classi\ufb01cation rule given by\nsign(f (x) \u2265 1). The Bayes risk is then R\u2217 = inf f R(f ).\nFor a generic loss function C(\u03b1, \u03b7), let \u03b7 = Pr(y = 1|x), we can de\ufb01ne the C-risk as RC (f ) =\nE(C(f, \u03b7)) and R\u2217\nC = inf f RC (f ). Our goal is to bound the excess 0-1 risk R(f ) \u2212 R\u2217 by the\nexcess-C risk RC (f ) \u2212 R\u2217\nC, so that minimizing the excess-C risk can be converted into minimizing\nthe classi\ufb01cation loss. Let us further de\ufb01ne the optimal conditional risk as H(\u03b7) = inf \u03b1\u2208R C(\u03b1, \u03b7),\nand H \u2212(\u03b7) = inf \u03b1,(\u03b1\u22121)(2\u03b7\u22121)\u22640 C(\u03b1, \u03b7). We say C(\u03b1, \u03b7) is classi\ufb01cation-calibrated if for any\n\u03b7 6= 1/2, H \u2212(\u03b7) > H(\u03b7). Then, we de\ufb01ne the \u03c8-transform of C(\u03b1, \u03b7) as \u03c8(\u03b8) = \u02dc\u03c8\u2217\u2217(\u03b8), where\n\u02dc\u03c8(\u03b8) = H \u2212( 1+\u03b8\n2 ), \u03b8 \u2208 [\u22121, 1], and g\u2217\u2217 is the Fenchel-Legendre biconjugate of g, which\nis essentially the largest convex lower bound of g [20].\nThe difference between likelihood ratio estimation and the classi\ufb01cation setting is in the asymmetric\nscaling of the loss function for positive and negative examples. Let \u03c8\u2212 = \u03c8(\u2212x), R\u2212(f ) =\nPr(y = \u22121, f (x) > 1), R\u2217\n+ = inf f R+(f )\nbe the risk and Bayes risks on negative and positive examples, respectively. It is easy to prove that\nR(f ) \u2212 R\u2217 = R\u2212(f ) \u2212 R\u2217\nTheorem 2 a) For any nonnegative loss function C(\u03b1, \u03b7), any measurable f : X \u2192 R, and any\nprobability distribution on X \u00d7{\u00b11}, \u03c8\u2212(R\u2212(f )\u2212R\u2217\nC. b) The\nfollowing conditions are equivalent: (1) C is classi\ufb01cation-calibrated; (2) For any sequence (\u03b8i) in\n[0, 1], \u03c8(\u03b8i) \u2192 0 if and only if \u03b8i \u2192 0; (3) For every sequence of measurable functions fi : X \u2192 R\nand every probability distribution on X \u00d7 {\u00b11}, RC (fi) \u2192 R\u2217\n\n\u2212 = inf f R\u2212(f ), R+(f ) = Pr(y = 1, f (x) < 1) and R\u2217\n\n+. We derived the following theorem:\n\n\u2212)+\u03c8(R+(f )\u2212R\u2217\n\n+) \u2264 RC (f )\u2212R\u2217\n\n\u2212 + R+(f ) \u2212 R\u2217\n\n2 ) \u2212 H( 1+\u03b8\n\nC implies R(fi) \u2192 R\u2217.\n\n1\u2212\u03b7 , \u03b7) = 0 and H \u2212(\u03b7) = D(1, \u03b7\n\nThe proof is given in an accompanying technical report [21]. This suggests that if \u03c8 is well-behaved,\nminimizing RC (f ) still gives a reasonable surrogate for the classi\ufb01cation risk. Compared against\nTheorem 3 in [20] which has the form \u03c8(R(f ) \u2212 R\u2217) \u2264 RC (f ) \u2212 R\u2217\nC, the difference here stems\nfrom the different loss transforms used for the positive and the negative examples.\nWe consider an f -divergence of the likelihood as the loss function, i.e., C(\u03b1, \u03b7) = D(\u03b1, \u03b7\n1\u2212\u03b7 ),\nwhere \u03b7\n1\u2212\u03b7 is the likelihood ratio when the Pr(y = 1|x) = \u03b7. From convexity arguments, it can be\neasily seen that H(\u03b7) = C( \u03b7\n1\u2212\u03b8 ).\nThe \u03c8 for all the loss functions listed in (4) can be computed accordingly. In \ufb01g. 3 (a) we show the\n\u03c8(\u03b8) of L1 and the KL-divergence from (4) and compare it against the hinge loss (where \u02dc\u03c8(\u03b8) = |\u03b8|\n[20]) used for SVM classi\ufb01cation. It could be seen that our approximation of the classi\ufb01cation loss is\naccurate when Pr(yi = 1|xi) is small. However, likelihood estimation would severely penalize the\nmisclassi\ufb01ed positive examples with large Pr(yi = 1|xi). This suggests that for the joint estimation\nof f and Pr(yi = 1|xi), the optimizer would tend to make Pr(yi = 1|xi) smaller, in order to avoid\nhigh penalties, as shown in \ufb01g. 1(b).\nIn \ufb01g. 1(a) we plot \u03c8 functions for different losses. We prefer an L1 measure as it is closer to the\nclassi\ufb01cation hinge loss, at least for the negative examples. In the end we solve the nonparametric\nfunction estimation in RKHS using an epsilon-insensitive L1 loss, which can be reformulated as\n\n1\u2212\u03b7 ), therefore \u02dc\u03c8(\u03b8) = D(1, 1+\u03b8\n\n4\n\n\f100\n\n)\n\u03b8\n(\n\u03c8\n\n10\u22125\n\n \n\nL1 divergence\nKL\u2212divergence\nHinge Loss\n\n1.2\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n \nPositive\nNegative\n\n \n\u22121\n\n\u22120.5\n\n0\n\u03b8\n(a)\n\n0.5\n\n1\n\n \n\u22120.2\n0\n\n50\n\n100\n\n150\n\n200\n\n(b)\n\nd\no\no\nh\n\ni\nl\n\ne\nk\nL\n\ni\n\n \n\nd\ne\n\nt\n\na\nm\n\ni\nt\ns\nE\n\n1.2\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n \n\u22120.2\n0\n\n \n\nPositive\nNegative\nUndecided\n\n50\n\n100\n\nExample\n\n150\n\n(c)\n\nFigure 1: Loss functions and their in\ufb02uence on the estimation bias. (a) The function \u03c8 appearing\nin the losses used for likelihood estimation (L1, KL-divergence) is similar to the hinge loss when\n\u03b8 > 0; however it goes to in\ufb01nity as \u03b8 approaches 1. This deviation essentially means the surrogate\nloss is going to be extremely large if an example with very large Pr(yi = 1|xi) is misclassi\ufb01ed. (b)\nExample estimated likelihood for a synthetic example. The estimated likelihood is biased towards\nsmaller values. However, with a fully labeled training set, the threshold can still be obtained. (c)\nIf we only know the label of the negative examples (blue) and the maximal positive example (red),\ndetermining the optimal threshold becomes non-trivial.\n\nsupport vector regression on the conditional likelihood, with the additional MIL constraints in (2):\n\nmax(|f (x+\n\nj ) \u2212 \u03b7+\n\nmax(|f (x\u2212\n\nj )| \u2212 \u01eb, 0) + \u03bb||f ||2\n\nmin\n\nf,\u03b7+ Px+\n\ns.t.\n\nj\n\nj\n\nj | \u2212 \u01eb, 0) +Px\u2212\nPx+\n\nj \u2208Bi\n\n\u03b7+\nj \u2265 Di, \u03b7+\n\nj \u2265 0\n\n(4)\n\n4 |Bi| +q 3\n\nwhere ||f ||2 is the RKHS norm; Di is a constant for each bag and can be determined from Theorem\n1, with appropriately chosen values for constants \u03b2 and \u03b4; \u03b7+\nfor the\ni\ntraining set. In this paper we use \u03b2 = 2 and \u03b4 = 0.05, which gives the estimate of the bound for\neach bag as Di = 1\n\n14 |Bi| + 1, when Bi \u2265 3 and Di = |Bi| when |Bi| < 3.\n\nis an estimate of P r(y=1|x+\ni )\nP r(y=\u22121|x+\ni )\n\nbetween solving for the SVM and projecting on the constraint sets given byPx+\n\nIt can be proved that optimization problem (4) is jointly convex in both arguments. A standard repre-\nsenter theorem [24] would convert it to an optimization on vectors, which we omit here. The problem\ncan be solved by different methods. The one easiest to implement is the alternating minimization\ny+\nj \u2265 Di and\ny+\nj \u2265 0. As this can turn out to be slow for large datasets, approaches such as the dual SMO or\nprimal subgradient projection algorithms (in the case of linear SVM) can be used. In this paper we\nimplement the alternating minimization approach, which is provably convergent since the optimiza-\ntion problem (4) is convex. In the accompanying technical report [21] we derive an SMO algorithm\nbased on the dual of (4) and characterize the basic properties of the optimization problem.\n\nj \u2208Bi\n\n4 Bag and Instance Classi\ufb01cation\n\nIf the likelihood ratio is obtained using an unbiased estimator, a decision rule based on sign(f (x) \u2265\n1) should give the optimal classi\ufb01er. However as previously argued, the joint estimation on f and\n\u03b7+ introduces a bias which is not always easy to identify. In positive bags, it is unclear whether an\ninstance should be labeled positive or negative, as long as it does not contribute signi\ufb01cantly to the\nclassi\ufb01cation error of its bag (\ufb01g. 3(b),(c)). In the synthetic experiments, we noticed that knowledge\nof the correct threshold would make the algorithm outperform competitors by a large margin (\ufb01g.\n2). This means that based on the learned likelihood ratio, the positive examples are usually well\nseparated from the negative ones. Developing a theory that would advance these aspects remains\na promising avenue for future work. The main dif\ufb01culty stems from the compound source of bias\nwhich arises from both the estimation of \u03b7+ and the loss minimization over \u03b7+ and f .\nHere we propose a partial solution. Instead of directly estimating the threshold, we learn a linear\ncombination of instance likelihood ratios to classify the bag. First, we sort the instance likelihood\nratios for each bag into a vector of length maxi |Bi|. We append 0 to bags that do not have enough\n\n5\n\n\f(a)\n\nBag Classification Error\n\n \n\nAL\u2212SVM : T=10C\nmi\u2212SVM\nAW\u2212SVM\nBEST\u2212THRESHOLD\nSVR\u2212SVM\n\n(b)\n\nPattern Classification Error\n\nAL\u2212SVM : T=10C\nmi\u2212SVM\nAW\u2212SVM\nBEST\u2212THRESHOLD\nSVR\u2212SVM\n\n0.35\n\n0.3\n\n0.25\n\n0.2\n\n0.15\n\n0.1\n\ns\nn\nu\nr\n \n\n0\n5\n\n \n\n \nr\ne\nv\no\nd\ne\ng\na\nr\ne\nv\na\n\u2212\n \nr\no\nr\nr\n\n \n\n0.2\n\n0.4\n\npercent of positive labeled points in bags\n\n0.6\n\n0.8\n\nE\n\n0.05\n\n1\n\n0\n\n \n\n0.2\n\n0.4\n\n0.8\npercent of positives in bags\n\n0.6\n\ns\nn\nu\nr\n \n\n0\n5\n\n \nr\ne\nv\no\n\n \n\nd\ne\ng\na\nr\ne\nv\na\n\u2212\n \nr\no\nr\nr\n\n \n\nE\n\n0.45\n\n0.4\n\n0.35\n\n0.3\n\n0.25\n\n0.2\n\n0.15\n\n0.1\n\n0.05\n\n0\n\n \n\n(c)\n\nWitness Rate\nEstimate by SVR\u2212SVM\n\n0.4\n\n0.8\nPercent of positives in bags\n\n0.6\n\n \n\n1\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n \n\n0.2\n\nt\n\ne\na\nr\n \ns\ns\ne\nn\n\nt\ni\n\n \n\nw\nd\ne\n\nt\n\na\nm\n\ni\nt\ns\nE\n\n \n\n1\n\n(d)\n\n(e)\n\n(f)\n\nFigure 2: Synthetic dataset (best viewed in color). (a) The true decision boundary. (b) Training\npoints at 40% witness rate. (c) The learned regression function. (d) Bag misclassi\ufb01cation rate of\ndifferent algorithms. (e) Instance misclassi\ufb01cation rate of different algorithms. (f) Estimated witness\nrate and true witness rate.\n\ninstances. Under this representation, bag classi\ufb01cation turns into a standard binary decision problem\nwhere a vector and a binary label is given for each bag, and a linear SVM is learned to solve the\nproblem. If we were to classify only the likelihood ratio on the \ufb01rst instance, this procedure would\nreduce to simple thresholding. We instead leverage information in the entire bag, aiming to constrain\nthe classi\ufb01er to learn the correct threshold. In this linear SVM setting, regularization never helps in\npractice and we always \ufb01x C to very large values. Effectively no parameter tuning is needed.1\nTo classify instances, a threshold is still necessary. In the current system, we follow a simple ap-\nproach and take the mean between two instances: the one with the highest likelihood among training\nbags that are predicted negative by the bag classi\ufb01er, and the lowest scored one among instances in\npositive bags with a score higher than the previous one. This approach is derived from the basic\nMIL assumption that all instances in a negative bag are negative.\n\nBased on instance classi\ufb01cation we could also estimate the witness rate of the dataset. This is\ncomputed as the ratio of positively classi\ufb01ed instances and the total number of instances in the\npositive bags of the training set. Since our algorithm automatically adjusts to different witness rates,\nthis estimate offers quantitative insight as to whether MIL should be used. For instance, if the\nwitness rate is 100%, it may be more effective to use a conventional learning approach.\n\n5 Experiments\n\n5.1 Synthetic Data\n\nWe start with an experiment on the synthetic dataset of [15], where the controlled setting helps\nunderstanding the behavior of the proposed algorithm. This is a 2-D dataset with the actual decision\nboundary shown in \ufb01g. 2 (a). The positive bags have a fraction of points sampled uniformly from\nthe white region and the rest sampled uniformly from the black region. An example of the sample\nat 40% witness rate is shown in \ufb01g. 2 (b). In this \ufb01gure, the plotted instance labels are the ones\nof their bags \u2013 indeed, one could notice many positive (blue) instances in the negative (red) region.\n\n1We have also experimented with a uniform threshold based on probabilistic estimates, as well as with\npredicting an instance-level threshold. While the former tends to under\ufb01t, the latter over\ufb01tts. Our bag-level\nclassi\ufb01er targets an intermediate level of granularity and turns out to be the most robust in our experiments.\n\n6\n\n\fTable 1: Performance of various MIL algorithms on weak labeling benchmarks. The best result on\neach dataset is shown in bold. The second group of algorithms either not provide instance labels\n(MI-Kernel and miGraph) or require a parameter that can be dif\ufb01cult to tune (ALP-SVM). SVR-\nSVM appears to give consistent results among algorithms that provide instance labels. The row\ndenoted \u201cEst. WR\u201d gives the estimated witness rates of our method.\n\nAlgorithm\nCH-FD\nEMDD\nmi-SVM\nMI-SVM\nMICA\nAW-SVM\nIns-KI-SVM\nMI-Kernel\nmiGraph\nALP-SVM\nSVR-SVM\nEst. WR\n\nMusk-1\n\nMusk-2\n\nElephant\n\n88.8\n84.9\n87.4\n77.9\n84.4\n85.7\n84.0\n\n85.7\n84.8\n83.6\n84.3\n90.5\n83.8\n84.4\n\n82.4\n78.3\n82.2\n81.4\n82.5\n82.0\n83.5\n\nTiger\n82.2\n72.1\n78.4\n84.0\n82.0\n83.0\n82.9\n\n88.0 \u00b1 3.1\n88.9 \u00b1 3.3\n\n86.3\n\n87.9 \u00b1 1.7\n\n100 %\n\n89.3 \u00b1 1.5\n90.3 \u00b1 2.6\n\n86.2\n\n85.4 \u00b1 1.8\n\n89.5 %\n\n84.3 \u00b1 1.6\n86.8 \u00b1 0.7\n\n83.5\n\n85.3 \u00b1 2.8\n\n37.8 %\n\n84.2 \u00b1 1.9\n86.0 \u00b1 2.8\n\n86.0\n\n79.8 \u00b1 3.4\n\n42.7 %\n\nFox\n60.4\n56.1\n58.2\n57.8\n62.0\n63.5\n63.4\n60.3 \u00b1 1.0\n61.6 \u00b1 1.6\n66.0\n63.0 \u00b1 3.5\n100 %\n\nIn order to test the effect of witness rates, 10 different types of datasets are created by varying the\nrates over the range 0.1, 0.2, . . . , 1. In this experiment we \ufb01x the hyperparameters C = 5 and use a\nGaussian kernel with \u03c3 = 1.We show a trained likelihood ratio function in \ufb01g. 2 (c), estimated on\nthe dataset shown in \ufb01g. 2 (b). Under the likelihood ratio, the positive examples are well separated\nfrom negatives. This illustrates how our proposed approach converts multiple-instance learning into\nthe problem of deciding a one-dimensional threshold.\n\nComplete results on datasets with different witness rates are shown in \ufb01g. 2 (d) and (e). We give\nboth bag classi\ufb01cation and instance classi\ufb01cation results. Our approach is referred to as SVR-\nSVM. BEST THRESHOLD refers to a method where the best threshold was chosen based on the\nfull knowledge of training/test instance labels, i.e., the optimal performance our likelihood ratio\nestimator can achieve. Comparison is done with two other approaches, the mi-SVM in [2] and\nthe AW-SVM from [15]. SVR-SVM generally works well when the witness rate is not very low.\nFrom instance classi\ufb01cation, one can see that the original mi-SVM is only competitive when the\nwitness rate is near 1 \u2013 this situation is close to a supervised SVM. With a deterministic annealing\napproach in [15], AW-SVM and mi-SVM perform quite the opposite \u2013 competitive when the witness\nrate is small but degrade when this is large. Presumably this is because deterministic annealing is\ninitialized with the apriori assumption that datasets are multiple-instance i.e. has a small witness rate\n[15]. When the witness rate is large, annealing does not improve performance. On the contrary, the\nproposed SVR-SVM does not appear to be affected by the witness rate. With the same parameters\nused across all the experiments, the method self-adjusts to different witness rates. One could see the\neffect especially in \ufb01g. 2 (e): regardless of the witness rate, the instance error rate remains roughly\nthe same. However, this is still inferior to our model based on the best threshold, which indicates\nthat important room for improvement exists.\n\n5.2 MIL Datasets\n\nThe algorithm is evaluated on a number of popular MIL benchmarks. We use the common ex-\nperimental setting, based on 10-fold cross-validation for parameter selection and we report the test\nresults averaged over 10 trials. The results are shown in Table 1, together with other competitive\nmethods in from the literature [12, 15, 10] (for some of these methods standard deviation estimates\nare not available).\n\nIn our tests, the proposed SVR-SVM gives consistently good results among algorithms that provide\ninstance-level labels. The only atypical case is Tiger, where the algorithm underperforms other\nmethods. Overall, the performance of SVR-SVM is slightly worse than miGraph and ALP-SVM.\nBut we note that results in ALP-SVM are obtained by tuning the witness rate to the optimal value,\nwhich may be dif\ufb01cult in practical settings. The slightly lower performance compared to miGraph\nsuggests that we may be inferior in the bag classi\ufb01cation step, which we already know is suboptimal.\n\n7\n\n\fTable 2: Results from 20 Newsgroups. The best result on each dataset is shown in bold, pairwise\nt-tests are performed to determine if the differences are statistically signi\ufb01cantly. miGraph is domi-\nnating in 10 datasets, whereas SVR-SVM is dominating in 14.\n\nDataset\nalt.atheism\ncomp.graphics\ncomp.windows.misc\ncomp.ibm.pc.hardware\ncomp.sys.mac.hardware\ncomp.window.x\nmisc.forsale\nrec.autos\nrec.motorcycles\nrec.sport.baseball\nrec.sport.hockey\nsci.crypt\nsci.electronics\nsci.med\nsci.space\nsoc.religion.christian\ntalk.politics.guns\ntalk.politics.mideast\ntalk.politics.misc\ntalk.religion.misc\n\n5.3 Text Categorization\n\nMI-Kernel miGraph [10] miGraph (web)\n60.2 \u00b1 3.9\n47.0 \u00b1 3.3\n51.0 \u00b1 5.2\n46.9 \u00b1 3.6\n44.5 \u00b1 3.2\n50.8 \u00b1 4.3\n51.8 \u00b1 2.5\n52.9 \u00b1 3.3\n50.6 \u00b1 3.5\n51.7 \u00b1 2.8\n51.3 \u00b1 3.4\n56.3 \u00b1 3.6\n50.6 \u00b1 2.0\n50.6 \u00b1 1.9\n54.7 \u00b1 2.5\n49.2 \u00b1 3.4\n47.7 \u00b1 3.8\n55.9 \u00b1 2.8\n51.5 \u00b1 3.7\n55.4 \u00b1 4.3\n\n82.0 \u00b1 0.8\n84.3 \u00b1 0.4\n70.1 \u00b1 0.3\n79.4 \u00b1 0.8\n81.0 \u00b1 0\n79.4 \u00b1 0.5\n71.0 \u00b1 0\n83.2 \u00b1 0.6\n70.9 \u00b1 2.7\n75.0 \u00b1 0.6\n92.0 \u00b1 0\n70.1 \u00b1 0.8\n94.0 \u00b1 0\n72.1 \u00b1 1.3\n79.4 \u00b1 0.8\n75.4 \u00b1 1.2\n72.3 \u00b1 1.0\n75.5 \u00b1 1.0\n72.9 \u00b1 2.4\n67.5 \u00b1 1.0\n\n65.5 \u00b1 4.0\n77.8 \u00b1 1.6\n63.1 \u00b1 1.5\n59.5 \u00b1 2.7\n61.7 \u00b1 4.8\n69.8 \u00b1 2.1\n55.2 \u00b1 2.7\n72.0 \u00b1 3.7\n64.0 \u00b1 2.8\n64.7 \u00b1 3.1\n85.0 \u00b1 2.5\n69.6 \u00b1 2.1\n87.1 \u00b1 1.7\n62.1 \u00b1 3.9\n75.7 \u00b1 3.4\n59.0 \u00b1 4.7\n58.5 \u00b1 6.0\n73.6 \u00b1 2.6\n70.4 \u00b1 3.6\n63.3 \u00b1 3.5\n\nSVR-SVM Est. WR\n83.5 \u00b1 1.7\n1.83 %\n85.2 \u00b1 1.5\n5.19 %\n2.23 %\n66.9 \u00b1 2.6\n70.3 \u00b1 2.8\n2.42 %\n4.58 %\n78.0 \u00b1 1.7\n83.7 \u00b1 2.0\n5.36 %\n72.3 \u00b1 1.2\n4.29 %\n2.75 %\n78.1 \u00b1 1.9\n75.6 \u00b1 0.9\n2.86 %\n76.7 \u00b1 1.4\n4.31 %\n6.52 %\n89.3 \u00b1 1.6\n69.7 \u00b1 2.5\n3.22 %\n91.5 \u00b1 1.0\n4.29 %\n74.9 \u00b1 1.9\n5.23 %\n83.2 \u00b1 2.0\n3.64 %\n83.2 \u00b1 2.7\n3.30 %\n73.7 \u00b1 2.6\n3.23 %\n80.5 \u00b1 3.2\n3.88 %\n72.6 \u00b1 1.4\n2.82 %\n71.9 \u00b1 1.9\n2.87 %\n\nThe text datasets are taken from [10]. These data have the bene\ufb01t of being designated to have a small\nwitness rate. Thus they serve as a better MIL benchmark compared to the previous ones. These are\nderived from the 20 Newsgroups corpus, with 50 positive and 50 negative bags for each of the 20\nnews categories. Each positive bag has around 3% witness rate. We run 10-fold cross validation 10\ntimes on each dataset and compute the average accuracy and standard deviations, C is \ufb01xed to 100,\n\u01eb to 0.2. Authors of [10] reported recent results for this dataset on their website, which are vastly\nsuperior than the ones reported in the paper. Therefore, in Table 2 we included both results in the\ncomparison, identi\ufb01ed as miGraph (paper) and miGraph (website), respectively.\n\nOur SVR-SVM performs signi\ufb01cantly better than MI-Kernel and miGraph (paper). It is comparable\nwith miGraph (web), and offers a marginal improvement. It is interesting that even though we use a\nsuboptimal second step, SVR-SVM fares well with the state-of-the-art. This shows the potential of\nmethods based on likelihood ratio estimators for multiple instance learning.\n6 Conclusion\n\nWe have proposed an approach to multiple-instance learning based on estimating the likelihood ratio\nbetween the positive and the negative classes on instances. The MIL constraint is reformulated into a\nconvex constraint on the likelihood ratio where a joint estimation of both the function and the target\nratios on the training set is performed. Theoretically we justify that learning the likelihood ratio is\nBayes-consistent and has desirable excess loss transform properties. Although we are not able to \ufb01nd\nthe optimal classi\ufb01cation threshold on the estimated ratio function, our proposed bag classi\ufb01er based\non such ratios obtains state-of-the-art results in a number of dif\ufb01cult datasets. In future work, we\nplan to explore transductive learning techniques in order to leverage the information in the learned\nratio function and identify better threshold estimation procedures.\n\nAcknowledgements\n\nThis work is supported, in part, by the European Commission, under a Marie Curie Excellence Grant\nMCEXT-025481.\n\n8\n\n\fReferences\n\n[1] Dietterich, T.G., Lathrop, R.H., Lozano-Perez, T.: Solving the multiple-instance problem with\n\naxis-parallel rectangles. Arti\ufb01cial Intelligence 89 (1997) 31\u201371\n\n[2] Andrews, S., Tsochantaridis, I., Hofmann, T.: Support vector machines for multiple-instance\n\nlearning. In: NIPS. (2003) 561\u2013568\n\n[3] Maron, O., Lozano-P\u00b4erez, T.: A framework for multiple-instance learning. In: NIPS. (1998)\n\n570\u2013576\n\n[4] Felzenszwalb, P.F., McAllester, D.A., Ramanan, D.: A discriminatively trained, multiscale,\n\ndeformable part model. In: CVPR. (2008)\n\n[5] Russell, B.C., Torralba, A., Murphy, K.P., Freeman, W.T.: Labelme: A database and web-\n\nbased tool for image annotation. IJCV 77(1-3) (2008) 157\u2013173\n\n[6] Cour, T., Sapp, B., Nagle, A., Taskar, B.: Talking pictures: Temporal grouping and dialog-\n\nsupervised person recognition. In: CVPR. (2010)\n\n[7] Zeisl, B., Leistner, C., Saffari, A., Bischof, H.: On-line semi-supervised multiple-instance\n\nboosting. In: CVPR. (2010)\n\n[8] G\u00a8artner, T., Flach, P.A., Kowalczyk, A., Smola, A.J.: Multi-instance kernels.\n\n(2002)\n\nIn: ICML.\n\n[9] Tao, Q., Scott, S., Vinodchandran, N.V., Osugi, T.T.: Svm-based generalized multiple-instance\n\nlearning via approximate box counting. In: ICML. (2004)\n\n[10] Zhou, Z.H., Sun, Y.Y., Li, Y.F.: Multi-instance learning by treating instances as non-i.i.d.\n\nsamples. In: ICML. (2009)\n\n[11] Cheung, P.M., Kwok, J.T.: A regularization framework for multiple-instance learning. In:\n\nICML. (2006) 193\u2013200\n\n[12] Fung, G., Dandar, M., Krishnapuram, B., Rao, R.B.: Multiple instance learning for computer\n\naided diagnosis. In: NIPS. (2007)\n\n[13] Mangasarian, O., Wild, E.: Multiple instance classi\ufb01cation via successive linear programming.\n\nJournal of Optimization Theory and Applications 137 (2008) 555\u2013568\n\n[14] Zhang, Q., Goldman, S.A., Yu, W., Fritts, J.E.: Content-based image retrieval using multiple-\n\ninstance learning. In: ICML. (2002) 682\u2013689\n\n[15] Gehler, P., Chapelle, O.: Deterministic annealing for multiple-instance learning. In: AISTATS.\n\n(2007)\n\n[16] Doll\u00b4ar, P., Babenko, B., Belongie, S., Perona, P., Tu, Z.: Multiple component learning for\n\nobject detection. In: ECCV. (2008)\n\n[17] Li, Y.F., Kwok, J.T., Tsang, I.W., Zhou, Z.H.: A convex method for locating regions of interest\n\nwith multi-instance learning. In: ECML. (2009)\n\n[18] Mammen, E., Tsybakov, A.B.: Smooth discrimination analysis. Annals of Statistics 27 (1999)\n\n1808\u20131829\n\n[19] Tsybakov, A.B.: Optimal aggregation of classi\ufb01ers in statistical learning. Annals of Statistics\n\n32 (2004) 135\u2013166\n\n[20] Bartlett, P., Jordan, M.I., McAulliffe, J.: Convexity, classi\ufb01cation and risk bounds. Journal of\n\nAmerican Statistical Association 101 (2006) 138\u2013156\n\n[21] Li, F., Sminchisescu, C.: Convex multiple instance learning by estimating likelihood ratio.\n\nTechnical report, Institute for Numerical Simulation, University of Bonn (November 2010)\n\n[22] Nguyen, X., Wainwright, M., Jordan, M.I.: Estimating divergence functionals and the likeli-\n\nhood ratio by penalized convex risk minimization. In: NIPS. (2007)\n\n[23] Liese, F., Vajda, I.: Convex Statistical Distances. Teubner VG (1987)\n[24] Hofmann, T., Sch\u00a8olkopf, B., Smola, A.J.: Kernel methods in machine learning. The Annals of\n\nStatistics 36 (2008) 1171\u20131220\n\n9\n\n\f", "award": [], "sourceid": 1288, "authors": [{"given_name": "Fuxin", "family_name": "Li", "institution": null}, {"given_name": "Cristian", "family_name": "Sminchisescu", "institution": null}]}