{"title": "Optimal ROC Curve for a Combination of Classifiers", "book": "Advances in Neural Information Processing Systems", "page_first": 57, "page_last": 64, "abstract": "We present a new analysis for the combination of binary classifiers. We propose a theoretical framework based on the Neyman-Pearson lemma to analyze combinations of classifiers. In particular, we give a method for finding the optimal decision rule for a combination of classifiers and prove that it has the optimal ROC curve. We also show how our method generalizes and improves on previous work on combining classifiers and generating ROC curves.", "full_text": "Optimal ROC Curve for a Combination of Classi\ufb01ers\n\nMarco Barreno\n\nAlvaro A. C\u00b4ardenas\n\nComputer Science Division\n\nUniversity of California at Berkeley\n\nBerkeley, California 94720\n\nJ. D. Tygar\n\n{barreno,cardenas,tygar}@cs.berkeley.edu\n\nAbstract\n\nWe present a new analysis for the combination of binary classi\ufb01ers. Our analysis\nmakes use of the Neyman-Pearson lemma as a theoretical basis to analyze combi-\nnations of classi\ufb01ers. We give a method for \ufb01nding the optimal decision rule for a\ncombination of classi\ufb01ers and prove that it has the optimal ROC curve. We show\nhow our method generalizes and improves previous work on combining classi\ufb01ers\nand generating ROC curves.\n\n1 Introduction\n\nWe present an optimal way to combine binary classi\ufb01ers in the Neyman-Pearson sense: for a given\nupper bound on false alarms (false positives), we \ufb01nd the set of combination rules maximizing the\ndetection rate (true positives). This forms the optimal ROC curve of a combination of classi\ufb01ers.\n\nThis paper makes the following original contributions: (1) We present a new method for \ufb01nding\nthe meta-classi\ufb01er with the optimal ROC curve. (2) We show how our framework can be used to\ninterpret, generalize, and improve previous work by Provost and Fawcett [1] and Flach and Wu [2].\n(3) We present experimental results that show our method is practical and performs well, even when\nwe must estimate the distributions with insuf\ufb01cient data.\n\nIn addition, we prove the following results: (1) We show that the optimal ROC curve is composed\nin general of 2n + 1 different decision rules and of the interpolation between these rules (over the\nspace of 22n possible Boolean rules). (2) We prove that our method is optimal in this space. (3) We\nprove that the Boolean AND and OR rules are always part of the optimal set for the special case of\nindependent classi\ufb01ers (though in general we make no independence assumptions). (4) We prove a\nsuf\ufb01cient condition for Provost and Fawcett\u2019s method to be optimal.\n\n2 Background\n\nConsider classi\ufb01cation problems where examples from a space of inputs X are associated with\nbinary labels {0, 1} and there is a \ufb01xed but unknown probability distribution P(x, c) over examples\n(x, c) \u2208 X \u00d7 {0, 1}. H0 and H1 denote the events that c = 0 and c = 1, respectively.\nA binary classi\ufb01er is a function f : X \u2192 {0, 1} that predicts labels on new inputs. When we use\nthe term \u201cclassi\ufb01er\u201d in this paper we mean binary classi\ufb01er. We address the problem of combining\nresults from n base classi\ufb01ers f1, f2, . . . , fn. Let Yi = fi(X) be a random variable indicating the\noutput of classi\ufb01er fi and Y \u2208 {0, 1}n = (Y1, Y2, . . . , Yn). We can characterize the performance of\nclassi\ufb01er fi by its detection rate (also true positives, or power) PDi = Pr[Yi = 1|H1] and its false\nalarm rate (also false positives, or test size) PF i = Pr[Yi = 1|H0]. In this paper we are concerned\nwith proper classi\ufb01ers, that is, classi\ufb01ers where PDi > PF i. We sometimes omit the subscript i.\n\n1\n\n\fThe Receiver Operating Characteristic (ROC) curve plots PF on the x-axis and PD on the y-axis\n(ROC space). The point (0, 0) represents always classifying as 0, the point (1, 1) represents always\nclassifying as 1, and the point (0, 1) represents perfect classi\ufb01cation. If one classi\ufb01er\u2019s curve has no\npoints below another, it weakly dominates the latter. If no points are below and at least one point\nis strictly above, it dominates it. The line y = x describes a classi\ufb01er that is no better than chance,\nand every proper classi\ufb01er dominates this line. When an ROC curve consists of a single point, we\nconnect it with straight lines to (0, 0) and (1, 1) in order to compare it with others (see Lemma 1).\nIn this paper, we focus on base classi\ufb01ers that occupy a single point in ROC space. Many classi\ufb01ers\nhave tunable parameters and can produce a continuous ROC curve; our analysis can apply to these\ncases by choosing representative points and treating each one as a separate classi\ufb01er.\n\n2.1 The ROC convex hull\n\nProvost and Fawcett [1] give a seminal result on the use of ROC curves for combining classi\ufb01ers.\nThey suggest taking the convex hull of all points of the ROC curves of the classi\ufb01ers. This ROC\nconvex hull (ROCCH) combination rule interpolates between base classi\ufb01ers f1, f2, . . . , fn, select-\ning (1) a single best classi\ufb01er or (2) a randomization between the decisions of two classi\ufb01ers for\nevery false alarm rate [1]. This approach, however, is not optimal: as pointed out in later work by\nFawcett, the Boolean AND and OR rules over classi\ufb01ers can perform better than the ROCCH [3].\n\nAND and OR are only 2 of 22n possible Boolean rules over the outputs of n base classi\ufb01ers (n\nclassi\ufb01ers \u21d2 2n possible outcomes \u21d2 22n rules over outcomes). We address \ufb01nding optimal rules.\n\n2.2 The Neyman-Pearson lemma\n\nIn this section we introduce Neyman-Pearson theory from the framework of statistical hypothesis\ntesting [4, 5], which forms the basis of our analysis.\nWe test a null hypothesis H0 against an alternative H1. Let the random variable Y have probability\ndistributions P (Y|H0) under H0 and P (Y|H1) under H1, and de\ufb01ne the likelihood ratio \u2113(Y) =\nP (Y|H1)/P (Y|H0). The Neyman-Pearson lemma states that the likelihood ratio test\n\nD(Y) =( 1\n\n\u03b3\n0\n\nif \u2113(Y) > \u03c4\nif \u2113(Y) = \u03c4\nif \u2113(Y) < \u03c4\n\n,\n\n(1)\n\nfor some \u03c4 \u2208 (0, \u221e) and \u03b3 \u2208 [0, 1], is a most powerful test for its size: no other test has higher\nPD = Pr[D(Y) = 1|H1] for the same bound on PF = Pr[D(Y) = 1|H0]. (When \u2113(Y) = \u03c4 ,\nD = 1 with probability \u03b3 and 0 otherwise.) Given a test size \u03b1, we maximize PD subject to PF \u2264 \u03b1\nby choosing \u03c4 and \u03b3 as follows. First we \ufb01nd the smallest value \u03c4 \u2217 such that Pr[\u2113(Y) > \u03c4 \u2217|H0] \u2264\n\u03b1. To maximize PD, which is monotonically nondecreasing with PF , we choose the highest value\n\u03b3 \u2217 that satis\ufb01es Pr[D(Y) = 1|H0] = Pr[\u2113(Y) > \u03c4 \u2217|H0] + \u03b3 \u2217 Pr[\u2113(Y) = \u03c4 \u2217|H0] \u2264 \u03b1, \ufb01nding\n\u03b3 \u2217 = (\u03b1 \u2212 Pr[\u2113(Y) > \u03c4 \u2217|H0])/ Pr[\u2113(Y) = \u03c4 \u2217|H0].\n\n3 The optimal ROC curve for a combination of classi\ufb01ers\n\nWe characterize the optimal ROC curve for a decision based on a combination of arbitrary\nclassi\ufb01ers\u2014for any given bound \u03b1 on PF , we maximize PD. We frame this problem as a Neyman-\nPearson hypothesis test parameterized by the choice of \u03b1. We assume nothing about the classi\ufb01ers\nexcept that each produces an output in {0, 1}. In particular, we do not assume the classi\ufb01ers are\nindependent or related in any way.\n\nBefore introducing our method we analyze the one-classi\ufb01er case (n = 1).\n\nLemma 1 Let f1 be a classi\ufb01er with performance probabilities PD1 and PF 1. Its optimal ROC\ncurve is a piecewise linear function parameterized by a free parameter \u03b1 bounding PF : for \u03b1 <\nPF 1, PD(\u03b1) = (PD1/PF 1)\u03b1, and for \u03b1 > PF 1, PD(\u03b1) = [(1 \u2212 PD1)/(1 \u2212 PF 1)](\u03b1 \u2212 PF 1) + PD1.\n\nProof. When \u03b1 < PF 1, we can obtain a likelihood ratio test by setting \u03c4 \u2217 = \u2113(1) and \u03b3 \u2217 = \u03b1/PF 1,\nand for \u03b1 > PF 1, we set \u03c4 \u2217 = \u2113(0) and \u03b3 \u2217 = (\u03b1 \u2212 PF 1)/(1 \u2212 PF 1).\n\n2\n\n2\n\n\fThe intuitive interpretation of this result is that to decrease or increase the false alarm rate of the\nclassi\ufb01er, we randomize between using its predictions and always choosing 1 or 0. In ROC space,\nthis forms lines interpolating between (PF 1, PD1) and (1, 1) or (0, 0), respectively.\nTo generalize this result for the combination of n classi\ufb01ers, we require the distributions P (Y|H0)\nand P (Y|H1). With this information we then compute and sort the likelihood ratios \u2113(y) for all\noutcomes y \u2208 {0, 1}n. Let L be the list of likelihood ratios ranked from low to high.\n\nLemma 2 Given any 0 \u2264 \u03b1 \u2264 1, the ordering L determines parameters \u03c4 \u2217 and \u03b3 \u2217 for a likelihood\nratio test of size \u03b1.\n\nLemma 2 sets up a classi\ufb01cation rule for each interval between likelihoods in L and interpolates\nbetween them to create a test with size exactly \u03b1. Our meta-classi\ufb01er does this for any given bound\non its false positive rate, then makes predictions according to Equation 1. To \ufb01nd the ROC curve for\nour meta-classi\ufb01er, we plot PD against PF for all 0 \u2264 \u03b1 \u2264 1. In particular, for each y \u2208 {0, 1}n\nwe can compute Pr[\u2113(Y) > \u2113(y)|H0], which gives us one value for \u03c4 \u2217 and a point in ROC space\n(PF and PD follow directly from L and P ). Each \u03c4 \u2217 will turn out to be the slope of a line segment\nbetween adjacent vertices, and varying \u03b3 \u2217 interpolates between the vertices. We call the ROC curve\nobtained in this way the LR-ROC.\n\nTheorem 1 The LR-ROC weakly dominates the ROC curve of any possible combination of Boolean\nfunctions g : {0, 1}n \u2192 {0, 1} over the outputs of n classi\ufb01ers.\n\nProof. Let \u03b1\u2032 be the probability of false alarm PF for g. Let \u03c4 \u2217 and \u03b3 \u2217 be chosen for a test of\nsize \u03b1\u2032. Then our meta-classi\ufb01er\u2019s decision rule is a likelihood ratio test. By the Neyman-Pearson\nlemma, no other test has higher power for any given size. Since ROC space plots power on the\ny-axis and size on the x-axis, this means that the PD for g at PF = \u03b1\u2032 cannot be higher than that of\nthe LR-ROC. Since this is true at any \u03b1\u2032, the LR-ROC weakly dominates the ROC curve for g. 2\n\n3.1 Practical considerations\n\nTo compute all likelihood ratios for the classi\ufb01er outcomes we need to know the probability distri-\nbutions P (Y|H0) and P (Y|H1). In practice these distributions need to be estimated. The simplest\nmethod is to run the base classi\ufb01ers on a training set and count occurrences of each outcome. It is\nlikely that some outcomes will not occur in the training, or will occur only a small number of times.\nOur initial approach to deal with small or zero counts when estimating was to use add-one smooth-\ning. In our experiments, however, simple special-case treatment of zero counts always produced\nbetter results than smoothing, both on the training set and on the test set. See Section 5 for details.\n\nFurthermore, the optimal ROC curve may have a different likelihood ratio for each possible outcome\nfrom the n classi\ufb01ers, and therefore a different point in ROC space, so optimal ROC curves in general\nhave up to 2n points. This implies an exponential (in the number of classi\ufb01ers) lower bound on the\nrunning time of any algorithm to compute the optimal ROC curve for a combination of classi\ufb01ers.\nFor a handful of classi\ufb01ers, such a bound is not problematic, but it is impractical to compute the\noptimal ROC curve for dozens or hundreds of classi\ufb01ers. (However, by computing and sorting the\nlikelihood ratios we avoid a 22n-time search over all possible classi\ufb01cation functions.)\n\n4 Analysis\n\n4.1 The independent case\n\nIn this section we take an in-depth look at the case of two binary classi\ufb01ers f1 and f2 that are\nconditionally independent given the input\u2019s class, so that P (Y1, Y2|Hc) = P (Y1|Hc)P (Y2|Hc) for\nc \u2208 {0, 1} (this section is the only part of the paper in which we make any independence assump-\ntions). Since Y1 and Y2 are conditionally independent, we do not need the full joint distribution; we\nneed only the probabilities PD1, PF 1, PD2, and PF 2 to \ufb01nd the combined PD and PF . For example,\n\u2113(01) = ((1 \u2212 PD1)PD2)/((1 \u2212 PF 1)PF 2).\nThe assumption that f1 and f2 are conditionally independent and proper de\ufb01nes a partial ordering\non the likelihood ratio: \u2113(00) < \u2113(10) < \u2113(11) and \u2113(00) < \u2113(01) < \u2113(11). Without loss of\n\n3\n\n\fTable 1: Two probability distributions.\n\nClass 1 (H1)\n\nClass 0 (H0)\n\nClass 1 (H1)\n\nClass 0 (H0)\n\nY1\n\n1\n\n0.375\n0.325\n\n0\n0.2\n0.1\n\nY2\n0\n1\n\nY2\n0\n1\n\n0\n0.5\n0.3\n\nY1\n\n1\n0.1\n0.1\n\n(a)\n\nY2\n0\n1\n\n0\n0.2\n0.2\n\nY1\n\n1\n0.1\n0.5\n\nY2\n0\n1\n\n0\n0.1\n0.5\n\nY1\n\n1\n0.3\n0.1\n\n(b)\n\ngenerality, we assume \u2113(00) < \u2113(01) < \u2113(10) < \u2113(11). This ordering breaks the likelihood ratio\u2019s\nrange (0, \u221e) into \ufb01ve regions; choosing \u03c4 in each region de\ufb01nes a different decision rule.\nThe trivial cases 0 \u2264 \u03c4 < \u2113(00) and \u2113(11) < \u03c4 < \u221e correspond to always classifying as\n1 and 0, respectively. PD and PF are therefore both equal to 1 and both equal to 0, respec-\ntively. For the case \u2113(00) \u2264 \u03c4 < \u2113(01), Pr [\u2113(Y) > \u03c4 ] = Pr [Y = 01 \u2228 Y = 10 \u2228 Y = 11] =\nPr [Y1 = 1 \u2228 Y2 = 1] . Thresholds in this range de\ufb01ne an OR rule for the classi\ufb01ers, with PD =\nPD1 + PD2 \u2212 PD1PD2 and PF = PF 1 + PF 2 \u2212 PF 1PF 2. For the case \u2113(01) \u2264 \u03c4 < \u2113(10), we\nhave Pr [\u2113(Y) > \u03c4 ] = Pr [Y = 10 \u2228 Y = 11] = Pr [Y1 = 1] . Therefore the performance proba-\nbilities are simply PD = PD1 and PF = PF 1. Finally, the case \u2113(10) \u2264 \u03c4 < \u2113(11) implies that\nPr [\u2113(Y) > \u03c4 ] = Pr [Y = 11] , and therefore thresholds in this range de\ufb01ne an AND rule, with\nPD = PD1PD2 and PF = PF 1PF 2. Figure 1a illustrates this analysis with an example.\nThe assumption of conditional independence is a suf\ufb01cient condition for ensuring that the AND and\nOR rules improve on the ROCCH for n classi\ufb01ers, as the following result shows.\n\nTheorem 2 If the distributions of the outputs of n proper binary classi\ufb01ers Y1, Y2, . . . , Yn are con-\nditionally independent given the instance class, then the points in ROC space for the rules AND\n(Y1 \u2227 Y2 \u2227 \u00b7 \u00b7 \u00b7 \u2227 Yn) and OR (Y1 \u2228 Y2 \u2228 \u00b7 \u00b7 \u00b7 \u2228 Yn) are strictly above the convex hull of the ROC\ncurves of the base classi\ufb01ers f1, . . . , fn. Furthermore, these Boolean rules belong to the LR-ROC.\n\nProof.\nThe likelihood ratio of the case when AND outputs 1 is given by \u2113(11 \u00b7 \u00b7 \u00b7 1) =\n(PD1PD2 \u00b7 \u00b7 \u00b7 PDn)/(PF 1PF 2 \u00b7 \u00b7 \u00b7 PF n). The likelihood ratio of the case when OR does not output 1\nis given by \u2113(00 \u00b7 \u00b7 \u00b7 0) = [(1 \u2212 PD1)(1 \u2212 PD2) \u00b7 \u00b7 \u00b7 (1 \u2212 PDn)]/[(1 \u2212 PF 1)(1 \u2212 PF 2) \u00b7 \u00b7 \u00b7 (1 \u2212 PF n)].\nNow recall that for proper classi\ufb01ers fi, PDi > PF i and thus (1 \u2212 PDi)/(1 \u2212 PF i) < 1 < PDi/PF i.\nIt is now clear that \u2113(00 \u00b7 \u00b7 \u00b7 0) is the smallest likelihood ratio and \u2113(11 \u00b7 \u00b7 \u00b7 1) is the largest likelihood\nratio, since others are obtained only by swapping P(F,D)i and (1 \u2212 P(F,D)i), and therefore the OR\nand AND rules will always be part of the optimal set of decisions for conditionally independent clas-\nsi\ufb01ers. These rules are strictly above the ROCCH: because \u2113(11 \u00b7 \u00b7 \u00b7 1) > PD1/PD2, and PD1/PD2\nis the slope of the line from (0, 0) to the \ufb01rst point in the ROCCH (f1), the AND point must be\nabove the ROCCH. A similar argument holds for OR since \u2113(00 \u00b7 \u00b7 \u00b7 0) < (1 \u2212 PDn)/(1 \u2212 PF n). 2\n\n4.2 Two examples\n\nWe return now to the general case with no independence assumptions. We present two example\ndistributions for the two-classi\ufb01er case that demonstrate interesting results.\n\nThe \ufb01rst distribution appears in Table 1a. The likelihood ratio values are \u2113(00) = 0.4, \u2113(10) = 3.75,\n\u2113(01) = 1/3, and \u2113(11) = 3.25, giving us \u2113(01) < \u2113(00) < \u2113(11) < \u2113(10). The three non-trivial\nrules correspond to the Boolean functions Y1 \u2228 \u00acY2, Y1, and Y1 \u2227 \u00acY2. Note that Y2 appears only\nnegatively despite being a proper classi\ufb01er, and both the AND and OR rules are sub-optimal.\n\nThe distribution for the second example appears in Table 1b. The likelihood ratios of the outcomes\nare \u2113(00) = 2.0, \u2113(10) = 1/3, \u2113(01) = 0.4, and \u2113(11) = 5, so \u2113(10) < \u2113(01) < \u2113(00) < \u2113(11)\nand the three points de\ufb01ning the optimal ROC curve are \u00acY1 \u2228 Y2, \u00ac(Y1 \u2295 Y2), and Y1 \u2227 Y2 (see\nFigure 1b). In this case, an XOR rule emerges from the likelihood ratio analysis.\n\nThese examples show that for true optimal results it is not suf\ufb01cient to use weighted voting rules\nw1Y1 + w2Y2 + \u00b7 \u00b7 \u00b7 + wnYn \u2265 \u03c4 , where w \u2208 (0, \u221e) (like some ensemble methods). Weighted\nvoting always has AND and OR rules in its ROC curve, so it cannot always express optimal rules.\n\n4\n\n\f1\n\n0.8\n\nY1 \u2228 Y2\n\nY2\n\n \n\n0.6\n\nY1 \u2227 Y2\n\nD\nP\n\n0.4\n\n0.2\n\n0\n \n0\n\nY1\n\nROC of f\n1\nROC of f\n2\nLR\u2212ROC\n\n0.2\n\n0.4\n\nPF\n\n0.6\n\n0.8\n\n1\n\n(a)\n\nD\nP\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n \n0\n\n\u00acY1 \u2228 Y2\n\n \n\n\u00ac(Y1 \u2295 Y2)\n\nY1 \u2227 Y2\n\nY2\n\nY1\n\nROC of f\n1\nROC of f\n2\nLR\u2212ROC\n\n0.2\n\n0.4\n\nPF\n\n0.6\n\n0.8\n\n1\n\n(b)\n\nD\nP\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n \n0\n\nf \u2032\n2\n\nf1\n\nf \u2032\n1\n\n \n\nf \u2032\n3\n\nf3\n\nf2\n\nOriginal ROC\nLR\u2212ROC\n\n0.2\n\n0.4\n\nPF\n\n0.6\n\n0.8\n\n1\n\n(c)\n\nFigure 1: (a) ROC for two conditionally independent classi\ufb01ers. (b) ROC curve for the distributions\nin Table 1b. (c) Original ROC curve and optimal ROC curve for example in Section 4.4.\n\n4.3 Optimality of the ROCCH\n\nWe have seen that in some cases, rules exist with points strictly above the ROCCH. As the following\nresult shows, however, there are conditions under which the ROCCH is optimal.\n\nTheorem 3 Consider n classi\ufb01ers f1, . . . , fn. The convex hull of points (PF i, PDi) with (0, 0) and\n(1, 1) (the ROCCH) is an optimal ROC curve for the combination if (Yi = 1) \u21d2 (Yj = 1) for i < j\nand the following ordering holds: \u2113(00 \u00b7 \u00b7 \u00b7 0) < \u2113(00 \u00b7 \u00b7 \u00b7 01) < \u2113(00 \u00b7 \u00b7 \u00b7 011) < \u00b7 \u00b7 \u00b7 < \u2113(1 \u00b7 \u00b7 \u00b7 1).\n\nProof. The condition (Yi = 1) \u21d2 (Yj = 1) for i < j implies that we only need to consider n + 2\npoints in the ROC space (the two extra points are (0, 0) and (1, 1)) rather than 2n. It also implies the\nfollowing conditions on the joint distribution: Pr[Y1 = 0 \u2227 \u00b7 \u00b7 \u00b7 \u2227 Yi = 0 \u2227 Yi+1 = 1 \u2227 \u00b7 \u00b7 \u00b7 \u2227 Yn =\n1|H0] = PF i+1 \u2212 PF i, and Pr[Y1 = 1 \u2227 \u00b7 \u00b7 \u00b7 \u2227 Yn = 1|H0] = PF 1. With these conditions\nand the ordering condition on the likelihood ratios, we have Pr[\u2113(Y) > \u2113(1 \u00b7 \u00b7 \u00b7 1)|H0] = 0, and\n1 \u00b7 \u00b7 \u00b7 1)|H0] = PF i. Therefore, \ufb01nding the optimal threshold of the likelihood\nPr[\u2113(Y) > \u2113(0 \u00b7 \u00b7 \u00b7 0\n\nratio test for PF i\u22121 \u2264 \u03b1 < PF i, we get \u03c4 \u2217 = \u2113(0 \u00b7 \u00b7 \u00b7 0\ni\u22121\n\n1 \u00b7 \u00b7 \u00b7 1), and for PF i \u2264 \u03b1 < PF i+1,\n\n\u03c4 \u2217 = \u2113(0 \u00b7 \u00b7 \u00b7 0\n\n1 \u00b7 \u00b7 \u00b7 1). This change in \u03c4 \u2217 implies that the point PF i is part of the LR-ROC. Setting\n\n| {z }\n\n\u03b1 = PF i (thus \u03c4 \u2217 = \u2113(0 \u00b7 \u00b7 \u00b7 0\n\n1 \u00b7 \u00b7 \u00b7 1) and \u03b3 \u2217=0) implies Pr[\u2113(Y) > \u03c4 \u2217|H1] = PDi.\n\n2\n\n| {z }i\n\n| {z }i\n\n| {z }i\n\nThe condition Yi = 1 \u21d2 Yj = 1 for i < j is the same inclusion condition Flach and Wu use\nfor repairing an ROC curve [2]. It intuitively represents the performance in ROC space of a single\nclassi\ufb01er with different operating points. The next section explores this relationship further.\n\n4.4 Repairing an ROC curve\n\nFlach and Wu give a voting technique to repair concavities in an ROC curve that generates operating\npoints above the ROCCH [2]. Their intuition is that points underneath the convex hull can be\nmirrored to appear above the convex hull in much the same way as an improper classi\ufb01er can be\nnegated to obtain a proper classi\ufb01er. Although their algorithm produces better ROC curves, their\nsolution will often yield curves with new concavities (see for example Flach and Wu\u2019s Figure 4 [2]).\nTheir algorithm has a similar purpose to ours, but theirs is a local greedy optimization technique,\nwhile our method performs a global search in order to \ufb01nd the best ROC curve.\n\nFigure 1c shows an example comparing their method to ours. Consider the following probabil-\nity distribution on a random variable Y \u2208 {0, 1}2: P ((00, 10, 01, 11)|H1) = (0.1, 0.3, 0.0, 0.6),\nP ((00, 10, 01, 11)|H0) = (0.5, 0.001, 0.4, 0.099). Flach and Wu\u2019s method assumes the original\nROC curve to be repaired has three models, or operating points: f1 predicts 1 when Y \u2208 {11}, f2\npredicts 1 when Y \u2208 {11, 01}, and f3 predicts 1 when Y \u2208 {11, 01, 10}. If we apply Flach and\nWu\u2019s repair algorithm, the point f2 is corrected to the point f \u2032\n2; however, the operating points of f1\nand f3 remain the same.\n\n5\n\n\fd\nP\n\nd\nP\n\n0\n\n.\n\n1\n\n8\n\n.\n\n0\n\n6\n\n.\n\n0\n\n4\n\n.\n\n0\n\n2\n\n.\n\n0\n\n0\n\n.\n\n0\n\n0\n1\n\n.\n\n8\n0\n\n.\n\n6\n0\n\n.\n\n4\n0\n\n.\n\n2\n0\n\n.\n\n0\n0\n\n.\n\nd\nP\n\n0\n\n.\n\n1\n\n8\n\n.\n\n0\n\n6\n\n.\n\n0\n\n4\n\n.\n\n0\n\n2\n\n.\n\n0\n\n0\n\n.\n\n0\n\nMeta (train)\nBase (train)\nMeta (test)\nBase (test)\nPART\n\nMeta (train)\nBase (train)\nMeta (test)\nBase (test)\nPART\n\n0.00\n\n0.05\n\n0.10\n\nPfa\n\n(a) adult\n\n0.15\n\n0.20\n\n0.000\n\n0.005\n\n0.010\n\n0.015\n\nPfa\n\n(b) hypothyroid\n\nd\nP\n\n0\n1\n\n.\n\n8\n0\n\n.\n\n6\n0\n\n.\n\n4\n0\n\n.\n\n2\n0\n\n.\n\n0\n0\n\n.\n\nMeta (train)\nBase (train)\nMeta (test)\nBase (test)\nPART\n\nMeta (train)\nBase (train)\nMeta (test)\nBase (test)\nPART\n\n0.00\n\n0.05\n\n0.10\n\n0.15\n\n0.00\n\n0.02\n\n0.04\n\n0.06\n\n0.08\n\n0.10\n\nPfa\n\n(c) sick-euthyroid\n\nPfa\n\n(d) sick\n\nFigure 2: Empirical ROC curves for experimental results on four UCI datasets.\n\nOur method improves on this result by ordering the likelihood ratios \u2113(01) < \u2113(00) < \u2113(11) < \u2113(10)\nand using that ordering to make three different rules: f \u2032\n2 predicts 1\nwhen Y \u2208 {10, 11}, and f \u2032\n\n1 predicts 1 when Y \u2208 {10}, f \u2032\n\n3 predicts 1 when Y \u2208 {10, 11, 00}.\n\n5 Experiments\n\nWe ran experiments to test the performance of our combining method on the adult, hypothyroid,\nsick-euthyroid, and sick datasets from the UCI machine learning repository [6]. We chose \ufb01ve base\nclassi\ufb01ers from the YALE machine learning platform [7]: PART (a decision list algorithm), SMO\n(Sequential Minimal Optimization), SimpleLogistic, VotedPerceptron, and Y-NaiveBayes. We used\ndefault settings for all classi\ufb01ers. The adult dataset has around 30,000 training points and 15,000\ntest points and the sick dataset has around 2000 training points and 700 test points. The others each\nhave around 2000 points that we split randomly into 1000 training and 1000 test.\n\nFor each experiment, we estimate the joint distribution by training the base classi\ufb01ers on a training\nset and counting the outcomes. We compute likelihood ratios for all outcomes and order them. When\noutcomes have no examples, we treat \u00b7/0 as near-in\ufb01nite and 0/\u00b7 as near-zero and de\ufb01ne 0/0 = 1.\n\n6\n\n\fWe derive a sequence of decision rules from the likelihood ratios computed on the training set. We\ncan compute an optimal ROC curve for the combination by counting the number of true positives\nand false positives each rule achieves. In the test set we use the rules learned on the training set.\n\n5.1 Results\n\nThe ROC graphs for our four experiments appear in Figure 2. The ROC curves in these experiments\nall rise very quickly and then \ufb02atten out, so we show only the range of PF 1 for which the values\nare interesting. We can draw some general conclusions from these graphs. First, PART clearly\noutperforms the other base classi\ufb01ers in three out of four experiments, though it seems to over\ufb01t\non the hypothyroid dataset. The LR-ROC dominates the ROC curves of the base classi\ufb01ers on both\ntraining and test sets. The ROC curves for the base classi\ufb01ers are all strictly below the LR-ROC\nin results on the test sets. The results on training sets seem to imply that the LR-ROC is primarily\nclassifying like PART with a small boost from the other classi\ufb01ers; however, the test set results (in\nparticular, Figure 2b) demonstrate that the LR-ROC generalizes better than the base classi\ufb01ers.\n\nThe robustness of our method to estimation errors is uncertain. In our experiments we found that\nsmoothing did not improve generalization, but undoubtedly our method would bene\ufb01t from better\nestimation of the outcome distribution and increased robustness.\n\nWe ran separate experiments to test how many classi\ufb01ers our method could support in practice.\nEstimation of the joint distribution and computation of the ROC curve \ufb01nished within one minute\nfor 20 classi\ufb01ers (not including time to train the individual classi\ufb01ers). Unfortunately, the inherent\nexponential structure of the optimal ROC curve means we cannot expect to do signi\ufb01cantly better\n(at the same rate, 30 classi\ufb01ers would take over 12 hours and 40 classi\ufb01ers almost a year and a half).\n\n6 Related work\n\nOur work is loosely related to ensemble methods such as bagging [8] and boosting [9] because\nit \ufb01nds meta-classi\ufb01cation rules over a set of base classi\ufb01ers. However, bagging and boosting each\ntake one base classi\ufb01er and train many times, resampling or reweighting the training data to generate\nclassi\ufb01er diversity [10] or increase the classi\ufb01cation margin [11]. The decision rules applied to\nthe generated classi\ufb01ers are (weighted) majority voting. In contrast, our method takes any binary\nclassi\ufb01ers and \ufb01nds optimal combination rules from the more general space of all binary functions.\n\nRanking algorithms, such as RankBoost [12], approach the problem of ranking points by score or\npreference. Although we present our methods in a different light, our decision rule can be interpreted\nas a ranking algorithm. RankBoost, however, is a boosting algorithm and therefore fundamentally\ndifferent from our approach. Ranking can be used for classi\ufb01cation by choosing a cutoff or threshold,\nand in fact ranking algorithms tend to optimize the common Area Under the ROC Curve (AUC)\nmetric. Although our method may have the side effect of maximizing the AUC, its formulation is\ndifferent in that instead of optimizing a single global metric, it is a constrained optimization problem,\nmaximizing PD for each PF .\nAnother more similar method for combining classi\ufb01ers is stacking [13]. Stacking trains a meta-\nlearner to combine the predictions of several base classi\ufb01ers; in fact, our method might be consid-\nered a stacking method with a particular meta-classi\ufb01er. It can be dif\ufb01cult to show the improvement\nof stacking in general over selecting the best base-level classi\ufb01er [14]; however, stacking has a use-\nful interpretation as generalized cross-validation that makes it practical. Our analysis shows that our\ncombination method is the optimal meta-learner in the Neyman-Pearson sense, but incorporating the\nmodel validation aspect of stacking would make an interesting extension to our work.\n\n7 Conclusion\n\nIn this paper we introduce a new way to analyze a combination of classi\ufb01ers and their ROC curves.\nWe give a method for combining classi\ufb01ers and prove that it is optimal in the Neyman-Pearson\nsense. This work raises several interesting questions.\n\nAlthough the algorithm presented in this paper avoids checking the whole doubly exponential num-\nber of rules, the exponential factor in running time limits the number of classi\ufb01ers that can be\n\n7\n\n\fcombined in practice. Can a good approximation algorithm approach optimality while having lower\ntime complexity? Though in general we make no assumptions about independence, Theorem 2\nshows that certain simple rules are optimal when we do know that the classi\ufb01ers are independent.\nTheorem 3 proves that the ROCCH can be optimal when only n output combinations are possible.\nPerhaps other restrictions on the distribution of outcomes can lead to useful special cases.\n\nAcknowledgments\n\nThis work was supported in part by TRUST (Team for Research in Ubiquitous Secure Technology),\nwhich receives support from the National Science Foundation (NSF award number CCF-0424422)\nand the following organizations: AFOSR (#FA9550-06-1-0244), Cisco, British Telecom, ESCHER,\nHP, IBM, iCAST, Intel, Microsoft, ORNL, Pirelli, Qualcomm, Sun, Symantec, Telecom Italia, and\nUnited Technologies; and in part by the UC Berkeley-Taiwan International Collaboration in Ad-\nvanced Security Technologies (iCAST) program. The opinions expressed in this paper are solely\nthose of the authors and do not necessarily re\ufb02ect the opinions of any funding agency or the U.S. or\nTaiwanese governments.\n\nReferences\n\n[1] Foster Provost and Tom Fawcett. Robust classi\ufb01cation for imprecise environments. Machine Learning\n\nJournal, 42(3):203\u2013231, March 2001.\n\n[2] Peter A. Flach and Shaomin Wu. Repairing concavities in ROC curves.\n\nIn Proceedings of the 19th\n\nInternational Joint Conference on Arti\ufb01cial Intelligence (IJCAI\u201905), pages 702\u2013707, August 2005.\n\n[3] Tom Fawcett. ROC graphs: Notes and practical considerations for data mining researchers. Technical\n\nReport HPL-2003-4, HP Laboratories, Palo Alto, CA, January 2003. Updated March 2004.\n\n[4] J. Neyman and E. S. Pearson. On the problem of the most ef\ufb01cient tests of statistical hypotheses. Philo-\nsophical Transactions of the Royal Society of London, Series A, Containing Papers of a Mathematical or\nPhysical Character, 231:289\u2013337, 1933.\n\n[5] Vincent H. Poor. An Introduction to Signal Detection and Estimation. Springer-Verlag, second edition,\n\n1988.\n\n[6] D. J. Newman, S. Hettich, C. L. Blake, and C. J. Merz. UCI repository of machine learning databases,\n\n1998. http://www.ics.uci.edu/\u223cmlearn/MLRepository.html.\n\n[7] I. Mierswa, M. Wurst, R. Klinkenberg, M. Scholz, and T. Euler. YALE: Rapid prototyping for com-\nplex data mining tasks. In Proceedings of the ACM SIGKDD International Conference on Knowledge\nDiscovery and Data Mining (KDD), 2006.\n\n[8] L. Breiman. Bagging predictors. Machine Learning, 24(2):123\u2013140, 1996.\n[9] Y. Freund and R. E. Schapire. Experiments with a new boosting algorithm. In Thirteenth International\n\nConference on Machine Learning, pages 148\u2013156, Bari, Italy, 1996. Morgan Kaufmann.\n\n[10] Thomas G. Dietterich. Ensemble methods in machine learning. Lecture Notes in Computer Science,\n\n1857:1\u201315, 2000.\n\n[11] Robert E. Schapire, Yoav Freund, Peter Bartlett, and Wee Sun Lee. Boosting the margin: A new ex-\nplanation for the effectiveness of voting methods. The Annals of Statistics, 26(5):1651\u20131686, October\n1998.\n\n[12] Yoav Freund, Raj Iyer, Robert E. Schapire, and Yoram Singer. An ef\ufb01cient boosting algorithm for com-\n\nbining preferences. Journal of Machine Learning Research (JMLR), 4:933\u2013969, 2003.\n\n[13] D. H. Wolpert. Stacked generalization. Neural Networks, 5:241\u2013259, 1992.\n[14] Saso D\u02d8zeroski and Bernard \u02d8Zenko. Is combining classi\ufb01ers with stacking better than selecting the best\n\none? Machine Learning, 54:255\u2013273, 2004.\n\n8\n\n\f", "award": [], "sourceid": 429, "authors": [{"given_name": "Marco", "family_name": "Barreno", "institution": null}, {"given_name": "Alvaro", "family_name": "Cardenas", "institution": null}, {"given_name": "J. D.", "family_name": "Tygar", "institution": null}]}