{"title": "On the Relationship Between Binary Classification, Bipartite Ranking, and Binary Class Probability Estimation", "book": "Advances in Neural Information Processing Systems", "page_first": 2913, "page_last": 2921, "abstract": "We investigate the relationship between three fundamental problems in machine learning: binary classification, bipartite ranking, and binary class probability estimation (CPE). It is known that a good binary CPE model can be used to obtain a good binary classification model (by thresholding at 0.5), and also to obtain a good bipartite ranking model (by using the CPE model directly as a ranking model); it is also known that a binary classification model does not necessarily yield a CPE model. However, not much is known about other directions. Formally, these relationships involve regret transfer bounds. In this paper, we introduce the notion of weak regret transfer bounds, where the mapping needed to transform a model from one problem to another depends on the underlying probability distribution (and in practice, must be estimated from data). We then show that, in this weaker sense, a good bipartite ranking model can be used to construct a good classification model (by thresholding at a suitable point), and more surprisingly, also to construct a good binary CPE model (by calibrating the scores of the ranking model).", "full_text": "On the Relationship Between Binary Classi\ufb01cation,\nBipartite Ranking, and Binary Class Probability\n\nEstimation\n\nShivani Agarwal\nHarikrishna Narasimhan\nDepartment of Computer Science and Automation\nIndian Institute of Science, Bangalore 560012, India\n\n{harikrishna,shivani}@csa.iisc.ernet.in\n\nAbstract\n\nWe investigate the relationship between three fundamental problems in machine\nlearning: binary classi\ufb01cation, bipartite ranking, and binary class probability esti-\nmation (CPE). It is known that a good binary CPE model can be used to obtain a\ngood binary classi\ufb01cation model (by thresholding at 0.5), and also to obtain a good\nbipartite ranking model (by using the CPE model directly as a ranking model); it\nis also known that a binary classi\ufb01cation model does not necessarily yield a CPE\nmodel. However, not much is known about other directions. Formally, these rela-\ntionships involve regret transfer bounds. In this paper, we introduce the notion of\nweak regret transfer bounds, where the mapping needed to transform a model from\none problem to another depends on the underlying probability distribution (and in\npractice, must be estimated from data). We then show that, in this weaker sense, a\ngood bipartite ranking model can be used to construct a good classi\ufb01cation model\n(by thresholding at a suitable point), and more surprisingly, also to construct a\ngood binary CPE model (by calibrating the scores of the ranking model).\n\nIntroduction\n\n1\nLearning problems with binary labels, where one is given training examples consisting of objects\nwith binary labels (such as emails labeled spam/non-spam or documents labeled relevant/irrelevant),\nare widespread in machine learning. These include for example the three fundamental problems of\nbinary classi\ufb01cation, where the goal is to learn a classi\ufb01cation model which, when given a new\nobject, can predict its label; bipartite ranking, where the goal is to learn a ranking model that can\nrank new objects such that those in one category are ranked higher than those in the other; and\nbinary class probability estimation (CPE), where the goal is to learn a CPE model which, when\ngiven a new object, can estimate the probability of its belonging to each of the two classes. Of\nthese, binary classi\ufb01cation is classical, although several fundamental questions related to binary\nclassi\ufb01cation have been understood only relatively recently [1\u20134]; bipartite ranking is more recent\nand has received much attention in recent years [5\u20138], and binary CPE, while a classical problem,\nalso continues to be actively investigated [9,10]. All three problems abound in applications, ranging\nfrom email classi\ufb01cation to document retrieval and computer vision to medical diagnosis.\nIt is well known that a good binary CPE model can be used to obtain a good binary classi\ufb01cation\nmodel (in a formal sense that we will detail below; speci\ufb01cally, in terms of regret transfer bounds)\n[4, 11]; more recently, it was shown that a good binary CPE model can also be used to obtain a\ngood bipartite ranking model (again, in terms of regret transfer bounds, to be detailed below) [12].\nIt is also known that a binary classi\ufb01cation model cannot necessarily be converted to a CPE model.1\nHowever, beyond this, not much is understood about the exact relationship between these problems.2\n\n1Note that we start from a single classi\ufb01cation model, which rules out the probing reduction of [13].\n2There are some results suggesting equivalence between speci\ufb01c boosting-style classi\ufb01cation and ranking\n\nalgorithms [14, 15], but this does not say anything about relationships between the problems per se.\n\n1\n\n\fFigure 1: (a) Current state of knowledge; (b) State of knowledge after the results of this paper. Here\n\u2018S\u2019 denotes a \u2018strong\u2019 regret transfer relationship; \u2018W\u2019 denotes a \u2018weak\u2019 regret transfer relationship.\n\nIn this paper, we introduce the notion of weak regret transfer bounds, where the mapping needed to\ntransform a model from one problem to another depends on the underlying probability distribution\n(and in practice, must be estimated from data). We then show such weak regret transfer bounds\n(under mild technical conditions) from bipartite ranking to binary classi\ufb01cation, and from bipartite\nranking to binary CPE. Speci\ufb01cally, we show that, given a good bipartite ranking model and access\nto either the distribution or a sample from it, one can estimate a suitable threshold and convert the\nranking model into a good binary classi\ufb01cation model; similarly, given a good bipartite ranking\nmodel and access to the distribution or a sample, one can \u2018calibrate\u2019 the ranking model to construct\na good binary CPE model. Though weak, the regret bounds are non-trivial in the sense that the\nsample size required for constructing a good classi\ufb01cation or CPE model from an existing ranking\nmodel is smaller than what might be required to learn such models from scratch.\nThe main idea in transforming a ranking model to a classi\ufb01er is to \ufb01nd a threshold that minimizes the\nexpected classi\ufb01cation error on the distribution, or the empirical classi\ufb01cation error on the sample.\nWe derive these results for cost-sensitive classi\ufb01cation with any cost parameter c. The main idea\nin transforming a ranking model to a CPE model is to \ufb01nd a monotonically increasing function\nfrom R to [0, 1] which, when applied to the ranking model, minimizes the expected CPE error on\nthe distribution, or the empirical CPE error on the sample; this is similar to the idea of isotonic\nregression [16\u201319]. The proof here makes use of a recent result of [20] which relates the squared\nerror of a calibrated CPE model to classi\ufb01cation errors over uniformly drawn costs, and a result on\nthe Rademacher complexity of a class of bounded monotonically increasing functions on R [21]. As\na by-product of our analysis, we also obtain a weak regret transfer bound from bipartite ranking to\nproblems involving the area under the cost curve [22] as a performance measure.\nThe relationships between the three problems \u2013 both those previously known and those established\nin this paper \u2013 are summarized in Figure 1. As noted above, in a weak regret transfer relationship,\ngiven a model for one type of problem, one needs access to a data sample in order to transform this\nto a model for another problem. This is in contrast to the previous \u2018strong\u2019 relationships, where a\nbinary CPE model can simply be thresholded at 0.5 (or cost c) to yield a classi\ufb01cation model, or can\nsimply be used directly as a ranking model. Nevertheless, even with the weak relationships, one still\ngets that a statistically consistent algorithm for bipartite ranking can be converted into a statistically\nconsistent algorithm for binary classi\ufb01cation or for binary CPE. Moreover, as we demonstrate in our\nexperiments, if one has access to a good ranking model and only a small additional sample, then\none is better off using this sample to transform the ranking model into a classi\ufb01cation or CPE model\nrather than using the limited sample to learn a classi\ufb01cation or CPE model from scratch.\nThe paper is structured as follows. We start with some preliminaries and background in Section 2.\nSections 3 and 4 give our main results, namely weak regret transfer bounds from bipartite ranking\nto binary classi\ufb01cation, and from bipartite ranking to binary CPE, respectively. Section 5 gives\nexperimental results on both synthetic and real data. All proofs are included in the appendix.\n\n2 Preliminaries and Background\nLet X be an instance space and let D be a probability distribution on X \u00d7 {\u00b11}. For (x, y) \u223c D,\nwe denote \u03b7(x) = P(y = 1| x) and p = P(y = 1). In the settings we are interested in, given a\ntraining sample S = ((x1, y1), . . . , (xn, yn)) \u2208 (X \u00d7 {\u00b11})n with examples drawn iid from D,\nthe goal is to learn a binary classi\ufb01cation model, a bipartite ranking model, or a binary CPE model.\nIn what follows, for u \u2208 [\u2212\u221e,\u221e], we will denote sign(u) = 1 if u > 0 and \u22121 otherwise, and\nsign(u) = 1 if u \u2265 0 and \u22121 otherwise.\n\n2\n\n\f(Cost-Sensitive) Binary Classi\ufb01cation. Here the goal is to learn a model h : X \u2192 {\u00b11}. Typically,\none is interested in models h with small expected 0-1 classi\ufb01cation error:\nwhere 1(\u00b7) is 1 if its argument is true and 0 otherwise; this is simply the probability that h misclas-\nsi\ufb01es an instance drawn randomly from D. The optimal 0-1 error (Bayes error) is\n\n(cid:2)1(h(x) (cid:54)= y)(cid:3) ,\n(cid:2) min(cid:0)\u03b7(x), 1 \u2212 \u03b7(x)(cid:1)(cid:3) ;\n\nD [h] = E(x,y)\u223cD\n\ner0-1\n\ner0-1\n\ninf\n\nD [h] = Ex\n\ner0-1,\u2217\nD =\n\nh:X\u2192{\u00b11}\n\n2\n\nD\n\n=\n\ninf\n\ner0-1,c\n\nD [h] = er0-1\n\nD [h] \u2212 er0-1,\u2217\n\nD [h] = E(x,y)\u223cD\n\nthis is achieved by the Bayes classi\ufb01er h\u2217(x) = sign(\u03b7(x) \u2212 1\n2 ). The 0-1 classi\ufb01cation regret of\na classi\ufb01er h is then regret0-1\nD . More generally, in a cost-sensitive binary\nclassi\ufb01cation problem with cost parameter c \u2208 (0, 1), where the cost of a false positive is c and that\nof a false negative is (1 \u2212 c), one is interested in models h with small cost-sensitive 0-1 error:\n\n(cid:2)(1 \u2212 c)1(cid:0)y = 1, h(x) = \u22121(cid:1) + c 1(cid:0)y = \u22121, h(x) = 1(cid:1)(cid:3) .\n(cid:2) min(cid:0)(1 \u2212 c)\u03b7(x), c(1 \u2212 \u03b7(x))(cid:1)(cid:3) ;\n\nD [h]. The optimal cost-sensitive 0-1 error for cost\n\n2, we get er0-1, 1\nNote that for c = 1\nparameter c can then be seen to be\ner0-1,c,\u2217\nh:X\u2192{\u00b11}\nthis is achieved by the classi\ufb01er h\u2217\nh is then regret0-1,c\nBipartite Ranking. Here one wants to learn a ranking model f : X \u2192 R that assigns higher scores\nto positive instances than to negative ones. Speci\ufb01cally, the goal is to learn a ranking function f\nwith small bipartite ranking error:\n(cid:48)\n\ner0-1,c\nc (x) = sign(\u03b7(x) \u2212 c). The c-cost-sensitive regret of a classi\ufb01er\n\nD [h] \u2212 er0-1,c,\u2217\n\nD [h] = er0-1,c\n\nD [h] = Ex\n\nD [h] = 1\n\n2er0-1\n\n(cid:48)(cid:105)\n\n)) < 0(cid:1) + 1\n\n(cid:104)\n1(cid:0)(y \u2212 y\n\n2 1(cid:0)f (x) = f (x\n\n)(f (x) \u2212 f (x\n\nwhere (x, y), (x(cid:48), y(cid:48)) are assumed to be drawn iid from D; this is the probability that a randomly\ndrawn pair of instances with different labels is mis-ranked by f, with ties broken uniformly at\nrandom. It is known that the ranking error of f is equivalent to one minus the area under the ROC\ncurve (AUC) of f [5\u20137]. The optimal ranking error can be seen to be\n\n)(cid:1)(cid:12)(cid:12) y (cid:54)= y\n)(1 \u2212 \u03b7(x))(cid:1)(cid:105)\n\nmin(cid:0)\u03b7(x)(1 \u2212 \u03b7(x\n\nD [f ] = E\n\nerrank,\u2217\n\n)), \u03b7(x\n\nerrank\n\n(cid:104)\n\nEx,x(cid:48)\n\n(cid:48)\n\n.\n\nD\n\n(cid:48)\n\n(cid:48)\n\n(cid:48)\n\nD\n\n= inf\n\nf :X\u2192R errank\n\nD [f ] =\n\n1\n\n2p(1 \u2212 p)\n\n,\n\n.\n\n2\n\ninf\n\nersq\n\nersq,\u2217\nD =\n\nD [f ] = errank\n\nD [f ] \u2212 errank,\u2217\nD .\n\nThe optimal squared error can be seen to be\n\nthis is achieved by any function f\u2217 that is a strictly monotonically increasing transformation of \u03b7.\nThe ranking regret of a ranking function f is given by regretrank\nBinary Class Probability Estimation (CPE). The goal here is to learn a class probability estimator\n\n(cid:1)2(cid:105)\n(cid:2)\u03b7(x)(1 \u2212 \u03b7(x))(cid:3) .\n(cid:2)(cid:0)(cid:98)\u03b7(x) \u2212 \u03b7(x)(cid:1)2(cid:3) .\n\nor CPE model(cid:98)\u03b7 : X \u2192 [0, 1] with small squared error (relative to labels converted to {0, 1}):\n\n(cid:104)(cid:0)(cid:98)\u03b7(x) \u2212 y+1\nD[(cid:98)\u03b7 ] = E(x,y)\u223cD\nersq\nD[(cid:98)\u03b7 ] = ersq\n(cid:98)\u03b7:X\u2192[0,1]\nThe squared-error regret of a CPE model(cid:98)\u03b7 can be seen to be\nD[(cid:98)\u03b7 ] \u2212 ersq,\u2217\nD[(cid:98)\u03b7 ] = ersq\nTheorem 1 ( [4, 11]). Let(cid:98)\u03b7 : X \u2192 [0, 1]. Let c \u2208 (0, 1). Then the classi\ufb01er h(x) = sign(cid:0)(cid:98)\u03b7(x) \u2212 c)\nobtained by thresholding(cid:98)\u03b7 at c satis\ufb01es\nD[(cid:98)\u03b7 ] .\nTheorem 2 ( [12]). Let(cid:98)\u03b7 : X \u2192 [0, 1]. Then using(cid:98)\u03b7 as a ranking model yields\nD[(cid:98)\u03b7 ] .\n\nRegret Transfer Bounds. The following (strong) regret transfer results from binary CPE to binary\nclassi\ufb01cation and from binary CPE to bipartite ranking are known:\n\n(cid:2)|(cid:98)\u03b7(x) \u2212 \u03b7(x)|(cid:3) \u2264 (cid:113)\n\n(cid:2)sign \u25e6 ((cid:98)\u03b7 \u2212 c)(cid:3) \u2264 Ex\n\n(cid:2)|(cid:98)\u03b7(x) \u2212 \u03b7(x)|(cid:3) \u2264\n\nD [(cid:98)\u03b7 ] \u2264\n\nD[ \u03b7 ] = Ex\n\nD = Ex\n\nregret0-1,c\n\nregretrank\n\nregretsq\n\nregretsq\n\nregretsq\n\n(cid:113)\n\nEx\n\n1\n\n1\n\nD\n\np(1 \u2212 p)\n\np(1 \u2212 p)\n\n;\n\nNote that as a consequence of these results, one gets that any learning algorithm that is statistically\nconsistent for binary CPE, i.e. whose squared-error regret converges in probability to zero as the\ntraining sample size n\u2192\u221e, can easily be converted into an algorithm that is statistically consistent\nfor binary classi\ufb01cation (with any cost parameter c, by thresholding the CPE models learned by the\nalgorithm at c), or into an algorithm that is statistically consistent for bipartite ranking (by using the\nlearned CPE models directly for ranking).\n\n3\n\n\f3 Regret Transfer Bounds from Bipartite Ranking to Binary Classi\ufb01cation\nIn this section, we derive weak regret transfer bounds from bipartite ranking to binary classi\ufb01cation.\nWe derive two bounds. The \ufb01rst holds in an idealized setting where one is given a ranking model f\nas well as access to the distribution D for \ufb01nding a suitable threshold to construct the classi\ufb01er. The\nsecond bound holds in a setting where one is given a ranking model f and a data sample S drawn\niid from D for \ufb01nding a suitable threshold; this bound holds with high probability over the draw of\nS. Our results will require the following assumption on the distribution D and ranking model f:\nAssumption A. Let D be a probability distribution on X \u00d7 {\u00b11} with marginal distribution \u00b5 on\nX. Let f : X\u2192R be a ranking model, and let \u00b5f denote the induced distribution of scores f (x) \u2208 R\nwhen x \u223c \u00b5. We say (D, f ) satis\ufb01es Assumption A if \u00b5f is either discrete, continuous, or mixed\nwith at most \ufb01nitely many point masses.\nWe will \ufb01nd it convenient to de\ufb01ne the following set of all increasing functions from R to {\u00b11}:\nTinc =\n\u03b8 : R\u2192{\u00b11} : \u03b8(u) = sign(u \u2212 t) or \u03b8(u) = sign(u \u2212 t) for some t \u2208 [\u2212\u221e,\u221e]\n.\nDe\ufb01nition 3 (Optimal classi\ufb01cation transform). For any ranking model f : X \u2192 R, cost param-\neter c \u2208 (0, 1), and probability distribution D over X \u00d7{\u00b11} such that (D, f ) satis\ufb01es Assumption\nA, de\ufb01ne an optimal classi\ufb01cation transform ThreshD,f,c as any increasing function from R to {\u00b11}\nsuch that the classi\ufb01er h(x) = ThreshD,f,c(f (x)) resulting from composing f with ThreshD,f,c\nyields minimum cost-sensitive 0-1 error on D:\n\n(cid:110)\n\n(cid:111)\n\nThreshD,f,c \u2208 argmin\u03b8\u2208Tinc\n\n(cid:8)er0-1,c\n\nD\n\n(cid:2)\u03b8 \u25e6 f(cid:3)(cid:9) .\n\n(OP1)\n\nWe note that when f is the class probability function \u03b7, we have ThreshD,\u03b7,c(u) = sign(u \u2212 c).\nTheorem 4 (Idealized weak regret transfer bound from bipartite ranking to binary classi\ufb01ca-\ntion based on distribution). Let (D, f ) satisfy Assumption A. Let c \u2208 (0, 1). Then the classi\ufb01er\nh(x) = ThreshD,f,c(f (x)) satis\ufb01es\n\n(cid:2)ThreshD,f,c \u25e6 f(cid:3) \u2264 (cid:113)\n\nregret0-1,c\n\nD\n\n2p(1 \u2212 p) regretrank\n\nD [f ] .\n\nS\n\nS\n\nIn practice, one does not have access to the distribution D, and the optimal threshold must be esti-\nmated from a data sample. To this end, we de\ufb01ne the following:\nDe\ufb01nition 5 (Optimal sample-based threshold). For any ranking model f : X \u2192 R, cost param-\neter c \u2208 (0, 1), and sample S \u2208 \u222a\u221e\n\nn=1(X\u00d7{\u00b11})n, de\ufb01ne an optimal sample-based threshold(cid:98)tS,f,c\nas any threshold on f such that the resulting classi\ufb01er h(x) = sign(f (x) \u2212(cid:98)tS,f,c) yields minimum\ncost-sensitive 0-1 error on S:(cid:98)tS,f,c \u2208 argmint\u2208R(cid:8)er0-1,c\nbased threshold(cid:98)tS,f,c can be computed in O(n ln n) time by sorting the examples (xi, yi) in S based\n\n(OP2)\n[h] denotes the c-cost-sensitive 0-1 error of a classi\ufb01er h on the empirical distribution\n\nwhere er0-1,c\nassociated with S (i.e. the uniform distribution over examples in S).\n\nNote that given a ranking function f, cost parameter c, and a sample S of size n, the optimal sample-\n\n(cid:2)sign \u25e6(cid:0)f \u2212 t(cid:1)(cid:3)(cid:9) ,\n\non the scores f (xi) and evaluating at most n + 1 distinct thresholds lying between adjacent score\nvalues (and above/below all score values) in this sorted order.\nTheorem 6 (Sample-based weak regret transfer bound from bipartite ranking to binary clas-\nsi\ufb01cation). Let D be any probability distribution on X \u00d7 {\u00b11} and f : X \u2192 R be any \ufb01xed\nranking model such that (D, f ) satis\ufb01es Assumption A. Let S \u2208 (X \u00d7 {\u00b11})n be drawn randomly\naccording to Dn. Let c \u2208 (0, 1). Let 0 < \u03b4 \u2264 1. Then with probability at least 1 \u2212 \u03b4 (over the draw\n\nof S \u223c Dn), the classi\ufb01er h(x) = sign(f (x) \u2212(cid:98)tS,f,c) obtained by thresholding f at(cid:98)tS,f,c satis\ufb01es\n(cid:1)(cid:1)\n\n32(cid:0)2(cid:0) ln(2n) + 1(cid:1) + ln(cid:0) 4\n\n(cid:2)sign \u25e6 (f \u2212(cid:98)tS,f,c)(cid:3) \u2264 (cid:113)\n\n2p(1 \u2212 p) regretrank\n\nregret0-1,c\n\n(cid:115)\n\nD [f ] +\n\nD\n\n\u03b4\n\n.\n\nn\n\nThe proof of Theorem 6 involves an application of the result in Theorem 4 together with a standard\nVC-dimension based uniform convergence result; speci\ufb01cally, the proof makes use of the fact that\nselecting the sample-based threshold in (OP2) is equivalent to empirical risk minimization over Tinc.\nNote in particular that the above regret transfer bound, though \u2018weak\u2019, is non-trivial in that it suggests\na good classi\ufb01er can be constructed from a good ranking model using far fewer examples than might\nbe required for learning a classi\ufb01er from scratch based on standard VC-dimension bounds.\n\n4\n\n\fRemark 7. We note that, as a consequence of Theorem 6, one can use any learning algorithm that\nis statistically consistent for bipartite ranking to construct an algorithm that is consistent for (cost-\nsensitive) binary classi\ufb01cation as follows: divide the training data into two (say equal) parts, use\none part for learning a ranking model using the consistent ranking algorithm, and the other part for\nselecting a threshold on the learned ranking model; both terms in Theorem 6 will then go to zero as\nthe training sample size increases, yielding consistency for (cost-sensitive) binary classi\ufb01cation.\nRemark 8. Another implication of the above result is a justi\ufb01cation for the use of the AUC as\na surrogate performance measure when learning in cost-sensitive classi\ufb01cation settings where the\nmisclassi\ufb01cation costs are unknown during training time [23]. Here, instead of learning a classi\ufb01er\nthat minimizes the cost-sensitive classi\ufb01cation error for a \ufb01xed cost parameter that may turn out to\nbe incorrect, one can learn a ranking function with good ranking performance (in terms of AUC),\nand then later use a small additional sample to select a suitable threshold once the misclassi\ufb01cation\ncosts are known; the above result provides guarantees on the resulting classi\ufb01cation performance in\nterms of the ranking (AUC) performance of the learned model.\n\nP(y = 1 |(cid:98)\u03b7(x) = u) = u, \u2200u \u2208 range((cid:98)\u03b7),\n\nWe will make use of the following result, which follows from results in [20] and shows that the\nsquared error of a calibrated CPE model is related to the expected cost-sensitive error of a classi\ufb01er\nconstructed using the optimal threshold in De\ufb01nition 3, over uniform costs in (0, 1):\n\n4 Regret Transfer Bounds from Bipartite Ranking to Binary CPE\nWe now derive weak regret transfer bounds from bipartite ranking to binary CPE. Again, we derive\ntwo bounds: the \ufb01rst holds in an idealized setting where one is given a ranking model f as well as\naccess to the distribution D for \ufb01nding a suitable conversion to a CPE model; the second, which is\na high-probability bound, holds in a setting where one is given a ranking model f and a data sample\nS drawn iid from D for \ufb01nding a suitable conversion. We will need the following de\ufb01nition:\nw.r.t. a probability distribution D on X \u00d7 {\u00b11} if\n\nDe\ufb01nition 9 (Calibrated CPE model). A binary CPE model(cid:98)\u03b7 : X \u2192 [0, 1] is said to be calibrated\nwhere range((cid:98)\u03b7) denotes the range of(cid:98)\u03b7.\nTheorem 10 ( [20]). Let(cid:98)\u03b7 : X \u2192 [0, 1] be a binary CPE model that is calibrated w.r.t. D. Then\nwhere U (0, 1) is the uniform distribution over (0, 1) and ThreshD,(cid:98)\u03b7,c is as de\ufb01ned in De\ufb01nition 3.\nThe proof of Theorem 10 follows from the fact that for any CPE model(cid:98)\u03b7 that is calibrated w.r.t. D,\nthe optimal classi\ufb01cation transform is given by ThreshD,(cid:98)\u03b7,c(u) = sign(u \u2212 c), thus generalizing a\nWe then have the following result, which shows that for a calibrated CPE model(cid:98)\u03b7 : X\u2192[0, 1], one\nLemma 11 (Regret transfer bound for calibrated CPE models). Let(cid:98)\u03b7 : X \u2192 [0, 1] be a binary\n\ncan upper bound the squared-error regret in terms of the bipartite ranking regret; this result follows\ndirectly from Theorem 10 and Theorem 4:\n\n(cid:2)ThreshD,(cid:98)\u03b7,c \u25e6(cid:98)\u03b7(cid:3)(cid:3) ,\n\nsimilar result noted earlier for the true class probability function \u03b7.\n\nD[(cid:98)\u03b7 ] = 2 Ec\u223cU (0,1)\n\nersq\n\n(cid:2)er0-1,c\n\nD\n\nCPE model that is calibrated w.r.t. D. Then\n\nD[(cid:98)\u03b7 ] \u2264 (cid:113)\n\nregretsq\n\n8p(1 \u2212 p) regretrank\n\nD [(cid:98)\u03b7 ] .\n\n(cid:110)\n\nWe are now ready to describe the construction of the optimal CPE transform in the idealized setting.\nWe will \ufb01nd it convenient to de\ufb01ne the following set:\n\nGinc =\n\ng : R\u2192[0, 1] : g is a monotonically increasing function\n\n.\n\nDe\ufb01nition 12 (Optimal CPE transform). Let f : X \u2192 [a, b] (where a, b \u2208 R, a < b) be any\nbounded-range ranking model and D be any probability distribution over X \u00d7 {\u00b11} such that\n(D, f ) satis\ufb01es Assumption A. Moreover assume that \u00b5f (see Assumption A), if mixed, does not\nhave a point mass at the end-points a, b, and that the function \u03b7f : [a, b]\u2192[0, 1] de\ufb01ned as \u03b7f (t) =\nP(y = 1| f (x) = t) is square-integrable w.r.t. the density of the continuous part of \u00b5f . De\ufb01ne\nan optimal CPE transform CalD,f as any monotonically increasing function from R to [0, 1] such\n\nthat the CPE model(cid:98)\u03b7(x) = CalD,f (f (x)) resulting from composing f with CalD,f yields minimum\n\nsquared error on D (see appendix for existence of CalD,f under these conditions):\n\n(cid:111)\n\nCalD,f \u2208 argming\u2208Ginc\n\n(cid:8)ersq\n\nD\n\n(cid:2)g \u25e6 f(cid:3)(cid:9) .\n\n5\n\n(OP3)\n\n\fLemma 13 (Properties of CalD,f ). Let (D, f ) satisfy the conditions of De\ufb01nition 12. Then\n\n1. (CalD,f \u25e6 f ) is calibrated w.r.t. D.\n2. errank\nD\n\n(cid:2)CalD,f \u25e6 f(cid:3) \u2264 errank\n\nD [f ].\n\nThe proof of Lemma 13 is based on equivalent results for the minimizer of a sample version of\n(OP3) [24, 25]. Combining this with Lemma 11 immediately gives the following result:\nTheorem 14 (Idealized weak regret transfer bound from bipartite ranking to binary CPE\nbased on distribution). Let (D, f ) satisfy the conditions of De\ufb01nition 12. Then the CPE model\n\n(cid:98)\u03b7(x) = CalD,f (f (x)) obtained by composing f with CalD,f satis\ufb01es\n\n(cid:2)CalD,f \u25e6 f(cid:3) \u2264 (cid:113)\n\nregretsq\n\nD\n\n8p(1 \u2212 p) regretrank\n\nD [f ] .\n\nWe now derive a sample version of the above result.\nDe\ufb01nition 15 (Optimal sample-based CPE transform). For any ranking model f : X \u2192 R\nand sample S \u2208 \u222a\u221e\n\nn=1(X \u00d7 {\u00b11})n, de\ufb01ne an optimal sample-based transform (cid:99)CalS,f as any\nmonotonically increasing function from R to [0, 1] such that the CPE model(cid:98)\u03b7(x) = (cid:99)CalS,f (f (x))\nresulting from composing f with (cid:99)CalS,f yields minimum squared error on S:\nS [(cid:98)\u03b7 ] denotes the squared error of a CPE model(cid:98)\u03b7 on the empirical distribution associated\n\n(cid:99)CalS,f \u2208 argming\u2208Ginc\n\nwhere ersq\nwith S (i.e. the uniform distribution over examples in S).\n\n(cid:2)g \u25e6 f(cid:3)(cid:9) ,\n\n(cid:8)ersq\n\n(OP4)\n\nS\n\nThe above optimization problem corresponds to the well-known isotonic regression problem and\ncan be solved in O(n ln n) time using the pool adjacent violators (PAV) algorithm [16] (the PAV\nalgorithm outputs a score in [0, 1] for each instance in S such that these scores preserve the ordering\nof f; a straightforward interpolation of the scores then yields a monotonically increasing function\nof f). We then have the following sample-based weak regret transfer result:\nTheorem 16 (Sample-based weak regret transfer bound from bipartite ranking to binary\nCPE). Let D be any probability distribution on X \u00d7 {\u00b11} and f : X \u2192 [a, b] be any \ufb01xed\nranking model such that (D, f ) satis\ufb01es the conditions of De\ufb01nition 12. Let S \u2208 (X \u00d7 {\u00b11})n\nbe drawn randomly according to Dn. Let 0 < \u03b4 \u2264 1. Then with probability at least 1 \u2212 \u03b4 (over\n\nthe draw of S \u223c Dn), the CPE model(cid:98)\u03b7(x) = (cid:99)CalS,f (f (x)) obtained by composing f with (cid:99)CalS,f\n\nsatis\ufb01es\n\n(cid:2)(cid:99)CalS,f \u25e6 f(cid:3) \u2264 (cid:113)\n\nregretsq\n\nD\n\n8p(1 \u2212 p) regretrank\n\nD [f ] + 96\n\n2 ln(n)\n\nn\n\n+ 2\n\n(cid:114)\n\n(cid:115)\n\n2 ln(cid:0) 8\n\n\u03b4\n\n(cid:1)\n\n.\n\nn\n\nThe proof of Theorem 16 involves an application of the idealized result in Theorem 14, together with\na standard uniform convergence argument based on Rademacher averages applied to the function\nclass Ginc; for this, we make use of a result on the Rademacher complexity of this class [21].\nRemark 17. As in the case of binary classi\ufb01cation, we note that, as a consequence of Theorem 16,\none can use any learning algorithm that is statistically consistent for bipartite ranking to construct an\nalgorithm that is consistent for binary CPE as follows: divide the training data into two (say equal)\nparts, use one part for learning a ranking model using the consistent ranking algorithm, and the other\npart for selecting a CPE transform on the learned ranking model; both terms in Theorem 16 will then\ngo to zero as the training sample size increases, yielding consistency for binary CPE.\nRemark 18. We note a recent result in [19] giving a bound on the empirical squared error of a CPE\nmodel constructed from a ranking model using isotonic regression in terms of the empirical ranking\nerror of the ranking model. However, this does not amount to a regret transfer bound.\nRemark 19. Finally, we note that the quantity Ec\u223cU (0,1)\nTheorem 10 is also the area under the cost curve [20, 22]; since this quantity is upper bounded in\nterms of regretrank\nD [f ] by virtue of Theorem 4, we also get a weak regret transfer bound from bipartite\nranking to problems where the area under the cost curve is a performance measure of interest. In\nparticular, this implies that algorithms that are statistically consistent with respect to AUC can also\nbe used to construct algorithms that are statistically consistent w.r.t. the area under the cost curve.\n\n(cid:2)ThreshD,(cid:98)\u03b7,c \u25e6(cid:98)\u03b7(cid:3)(cid:3) that appears in\n\n(cid:2)er0-1,c\n\nD\n\n6\n\n\f(a)\n\n(b)\n\nFigure 2: Results on synthetic data. A ranking model was learned using a pairwise linear logistic re-\ngression ranking algorithm (which is a consistent ranking algorithm for the distribution used in these\nexperiments); this was followed by an optimal choice of classi\ufb01cation threshold (with c = 1\n2) or op-\ntimal CPE transform based on the distribution as outlined in Sections 3 and 4. The plots show (a)\n0-1 classi\ufb01cation regret of the resulting classi\ufb01cation model together with the corresponding upper\nbound from Theorem 4; and (b) squared-error regret of the resulting CPE model together with the\ncorresponding upper bound from Theorem 14. As can be seen, in both cases, the classi\ufb01cation/CPE\nregret converges to zero as the training sample size increases.\n5 Experiments\nWe conducted two types of experiments to evaluate the results described in this paper:\nthe \ufb01rst\ninvolved synthetic data drawn from a known distribution for which the classi\ufb01cation and ranking\nregrets could be calculated exactly; the second involved real data from the UCI Machine Learning\nRepository. In the \ufb01rst experiment, we learned ranking models using a consistent ranking algorithm\non increasing training sample sizes, converted the learned models using the optimal threshold or\nCPE transforms described in Sections 3 and 4 based on the distribution, and veri\ufb01ed that this yielded\nclassi\ufb01cation and CPE models with 0-1 classi\ufb01cation regret and squared-error regret converging to\nzero. In the second experiment, we simulated a setting where a ranking model has been learned\nfrom some data, the original training data is no longer available, and a classi\ufb01cation/CPE model is\nneeded; we investigated whether in such a setting the ranking model could be used in conjunction\nwith a small additional data sample to produce a useful classi\ufb01cation or CPE model.\n\n5.1 Synthetic Data\nOur \ufb01rst goal was to verify that using ranking models learned by a statistically consistent ranking\nalgorithm and applying the distribution-based transformations described in Sections 3 and 4 yields\nclassi\ufb01cation/CPE models with classi\ufb01cation/CPE regret converging to zero. For these experiments,\nwe generated examples in (X = Rd) \u00d7 {\u00b11} (with d = 100) as follows: each example was\nassigned a positive/negative label with equal probability, with the positive instances drawn from a\nmultivariate Gaussian distribution with mean \u00b5 \u2208 Rd and covariance matrix \u03a3 \u2208 Rd\u00d7d, and negative\ninstances drawn from a multivariate Gaussian distribution with mean \u2212\u00b5 and the same covariance\nmatrix \u03a3; here \u00b5 was drawn uniformly at random from {\u22121, 1}d, and \u03a3 was drawn from a Wishart\ndistribution with 200 degrees of freedom and a randomly drawn invertible PSD scale matrix. For this\ndistribution, the optimal ranking and classi\ufb01cation models are linear. Training samples of various\n\u221a\nsizes n were generated from this distribution; in each case, a linear ranking model was learned using\na pairwise linear logistic regression algorithm (with regularization parameter set to 1/\nn), and an\noptimal threshold (with c = 1\n2) or CPE transform was then applied to construct a binary classi\ufb01cation\nor CPE model. In this case the ranking regret and 0-1 classi\ufb01cation regret of a linear model and can\nbe computed exactly; the squared-error regret for the CPE model was computed approximately by\nsampling instances from the distribution. The results are shown in Figure 2. As can be seen, the\nclassi\ufb01cation and squared-error regrets of the classi\ufb01cation and CPE models constructed both satisfy\nthe bounds from Theorems 4 and 14, and converge to zero as the bounds suggest.\n\n5.2 Real Data\nOur second goal was to investigate whether good classi\ufb01cation and CPE models can be constructed\nin practice by applying the data-based transformations described in Sections 3 and 4 to an existing\nranking model. For this purpose, we conducted experiments on several data sets drawn from the UCI\nMachine Learning Repository3. We present representative results on two data sets: Spambase (4601\n\n3http://archive.ics.uci.edu/ml/\n\n7\n\n10210310410500.10.20.30.40.50.60.7No. of training examples0\u22121 regret Upper boundRanking + Opt. Thres. Choice10210310410500.10.20.30.40.50.60.7No. of training examplesSquared regret Upper boundRanking + Calibration\f(a) Spambase (0-1)\n\n(b) Internet-ads (0-1)\n\n(c) Spambase (CPE)\n\n(d) Internet-ads (CPE)\n\nFigure 3: Results on real data from the UCI repository. A ranking model was learned using a pair-\nwise linear logistic regression ranking algorithm from a part of the data set that was then discarded.\nThe remaining data was divided into training and test sets. The training data was then used to esti-\nmate an empirical (sample-based) classi\ufb01cation threshold and CPE transform (calibration) for this\nranking model as outlined in Sections 3 and 4. Using the same training data, a binary classi\ufb01er and\nCPE model were also learned from scratch using a standard linear logistic regression algorithm. The\nplots show the resulting test error for both approaches. As can be seen, if only a small amount of\nadditional data is available, then using this data to convert an existing ranking model into a classi\ufb01-\ncation/CPE model is more bene\ufb01cial than learning a classi\ufb01cation/CPE model from scratch.\ninstances, 57 features) and Internet Ads (3279 instances, 1554 features4). Here we divided each\ndata set into three equal parts. One part was used to learn a ranking model using a pairwise linear\nlogistic regression algorithm, and was then discarded. This allowed us to simulate a situation where\na (reasonably good) ranking model is available, but the original training data used to learn the model\nis no longer accessible. Various subsets of the second part of the data (of increasing size) were then\nused to estimate a data-based threshold or CPE transform on this ranking model using the optimal\nsample-based methods described in Sections 3 and 4. The performance of the constructed classi-\n\ufb01cation and CPE models on the third part of the data, which was held out for testing purposes, is\nshown in Figure 3. For comparison, we also show the performance of binary classi\ufb01cation and CPE\nmodels learned directly from the same subsets of the second part of the data using a standard linear\nlogistic regression algorithm. In each case, the regularization parameter for both standard logistic\nregression and pairwise logistic regression was chosen from {10\u22124, 10\u22123, 10\u22122, 10\u22121, 1, 10, 102}\nusing 5-fold cross validation on the corresponding training data. As can be seen, when one has\naccess to a previously learned (or otherwise available) ranking model with good ranking perfor-\nmance, and only a small amount of additional data, then one is better off using this data to estimate\na threshold/CPE transform and converting the ranking model into a classi\ufb01cation/CPE model, than\nlearning a classi\ufb01cation/CPE model from this data from scratch. However, as can also be seen, the\neventual performance of the classi\ufb01cation/CPE model thus constructed is limited by the ranking per-\nformance of the original ranking model; therefore, once there is suf\ufb01cient additional data available,\nit is advisable to use this data to learn a new model from scratch.\n\n6 Conclusion\nWe have investigated the relationship between three fundamental problems in machine learning:\nbinary classi\ufb01cation, bipartite ranking, and binary class probability estimation (CPE). While formal\nregret transfer bounds from binary CPE to binary classi\ufb01cation and to bipartite ranking are known,\nlittle has been known about other directions. We have introduced the notion of weak regret transfer\nbounds that require access to a distribution or data sample, and have established the existence of\nsuch bounds from bipartite ranking to binary classi\ufb01cation and to binary CPE. The latter result\nmakes use of ideas related to calibration and isotonic regression; while these ideas have been used\nto calibrate scores from real-valued classi\ufb01ers to construct probability estimates in practice, to our\nknowledge, this is the \ufb01rst use of such ideas in deriving formal regret bounds in relation to ranking.\nOur experimental results demonstrate possible uses of the theory developed here.\n\nAcknowledgments\n\nThanks to Karthik Sridharan for pointing us to a result on monotonically increasing functions.\nThanks to the anonymous reviewers for many helpful suggestions. HN gratefully acknowledges\nsupport from a Google India PhD Fellowship. SA thanks the Department of Science & Technology\n(DST), the Indo-US Science & Technology Forum (IUSSTF), and Yahoo! for their support.\n\n4The original data set contains 1558 features; we discarded 4 features with missing entries.\n\n8\n\n1020304050607080901000.060.080.10.120.140.16%. of training examples0\u22121 error(Fixed) Ranking error = 0.0317 Empr. Thres. ChoiceLogistic Regression1020304050607080901000.020.030.040.050.060.070.08%. of training examples0\u22121 error(Fixed) Ranking error = 0.0179 Empr. Thres. ChoiceLogistic Regression1020304050607080901000.050.060.070.080.090.10.110.12%. of training examplesSquared error(Fixed) Ranking error = 0.0317 Empr. CalibrationLogistic Regression1020304050607080901000.020.030.040.050.06%. of training examplesSquared error(Fixed) Ranking error = 0.0179 Empr. CalibrationLogistic Regression\fReferences\n[1] T. Zhang. Statistical behavior and consistency of classi\ufb01cation methods based on convex risk minimiza-\n\ntion. Annals of Statistics, 32(1):56\u2013134, 2004.\n\n[2] P. L. Bartlett, M. I. Jordan, and J. D. McAuliffe. Convexity, classi\ufb01cation and risk bounds. Journal of the\n\nAmerican Statistical Association, 101(473):138\u2013156, 2006.\n\n[3] M. D. Reid and R. C. Williamson. Surrogate regret bounds for proper losses. In ICML, 2009.\n[4] C. Scott. Calibrated asymmetric surrogate losses. Electronic Journal of Statistics, 6:958\u2013992, 2012.\n[5] Y. Freund, R. Iyer, R. E. Schapire, and Y. Singer. An ef\ufb01cient boosting algorithm for combining prefer-\n\nences. Journal of Machine Learning Research, 4:933\u2013969, 2003.\n\n[6] C. Cortes and M. Mohri. AUC optimization vs. error rate minimization. In Advances in Neural Informa-\n\ntion Processing Systems 16. MIT Press, 2004.\n\n[7] S. Agarwal, T. Graepel, R. Herbrich, S. Har-Peled, and D. Roth. Generalization bounds for the area under\n\nthe ROC curve. Journal of Machine Learning Research, 6:393\u2013425, 2005.\n\n[8] S. Cl\u00b4emenc\u00b8on, G. Lugosi, and N. Vayatis. Ranking and empirical minimization of U-statistics. Annals of\n\nStatistics, 36:844\u2013874, 2008.\n\n[9] A. Buja, W. Stuetzle, and Y. Shen. Loss functions for binary class probability estimation: Structure and\n\napplications. Technical report, University of Pennsylvania, November 2005.\n\n[10] M. D. Reid and R. C. Williamson. Composite binary losses. Journal of Machine Learning Research,\n\n11:2387\u20132422, 2010.\n\n[11] L. Devroye, L. Gyor\ufb01, and G. Lugosi. A Probabilistic Theory of Pattern Recognition. Springer, 1996.\n[12] S. Cl\u00b4emenc\u00b8on and S. Robbiano. Minimax learning rates for bipartite ranking and plug-in rules.\n\nProceedings of the 28th International Conference on Machine Learning, 2011.\n\nIn\n\n[13] John Langford and Bianca Zadrozny. Estimating class membership probabilities using classi\ufb01er learners.\n\nIn AISTATS, 2005.\n\n[14] C. Rudin and R.E. Schapire. Margin-based ranking and an equivalence between adaboost and rankboost.\n\nJournal of Machine Learning Research, 10:2193\u20132232, 2009.\n\n[15] S\u00b8. Ertekin and C. Rudin. On equivalence relationships between classi\ufb01cation and ranking algorithms.\n\nJournal of Machine Learning Research, 12:2905\u20132929, 2011.\n\n[16] M. Ayer, H.D. Brunk, G. M. Ewing, W. T. Reid, and E. Silverman. An empirical distribution function for\n\nsampling with incomplete information. The Annals of Mathematical Statistics, 26(4):641\u2013647, 1955.\n\n[17] H. D. Brunk. On the estimation of parameters restricted by inequalities. The Annals of Mathematical\n\nStatistics, 29(2):437\u2013454, 1958.\n\n[18] B. Zadrozny and C. Elkan. Transforming classi\ufb01er scores into accurate multiclass probability estimates.\n\nIn KDD, 2002.\n\n[19] A.K. Menon, X. Jiang, S. Vembu, C. Elkan, and L. Ohno-Machado. Predicting accurate probabilities with\n\na ranking loss. In ICML, 2012.\n\n[20] J. Hern\u00b4andez-Orallo, P. Flach, and C. Ferri. A uni\ufb01ed view of performance metrics: Translating threshold\n\nchoice into expected classi\ufb01cation loss. Journal of Machine Learning Research, 13:2813\u20132869, 2012.\n\n[21] P. Bartlett. CS281B/Stat241B (Spring 2008) Statistical Learning Theory [Lecture 19 notes], University of\nCalifornia, Berkeley. http://www.cs.berkeley.edu/\u02dcbartlett/courses/281b-sp08/\n19.pdf. 2008.\n\n[22] C. Drummond and R.C. Holte. Cost curves: An improved method for visualizing classi\ufb01er performance.\n\nMachine Learning, 65(1):95\u2013130, 2006.\n\n[23] M.A. Maloof. Learning when data sets are imbalanced and when costs are unequal and unknown. In\n\nICML-2003 Workshop on Learning from Imbalanced Data Sets II, volume 2, 2003.\n\n[24] A.T. Kalai and R. Sastry. The isotron algorithm: High-dimensional isotonic regression. In COLT, 2009.\n[25] T. Fawcett and A. Niculescu-Mizil. PAV and the ROC convex hull. Machine Learning, 68(1):97\u2013106,\n\n2007.\n\n[26] S. Agarwal. Surrogate regret bounds for the area under the ROC curve via strongly proper losses. In\n\nCOLT, 2013.\n\n[27] D. Anevski and P. Soulier. Monotone spectral density estimation. Annals of Statistics, 39(1):418\u2013438,\n\n2011.\n\n[28] P. Groeneboom and G. Jongbloed. Generalized continuous isotonic regression. Statistics & Probability\n\nLetters, 80(34):248\u2013253, 2010.\n\n9\n\n\f", "award": [], "sourceid": 1338, "authors": [{"given_name": "Harikrishna", "family_name": "Narasimhan", "institution": "Indian Institute of Science"}, {"given_name": "Shivani", "family_name": "Agarwal", "institution": "Indian Institute of Science"}]}