{"title": "A Generalized Bradley-Terry Model: From Group Competition to Individual Skill", "book": "Advances in Neural Information Processing Systems", "page_first": 601, "page_last": 608, "abstract": null, "full_text": "A Generalized Bradley-Terry Model: From\n\nGroup Competition to Individual Skill\n\nTzu-Kuo Huang\nChih-Jen Lin\nDepartment of Computer Science\n\nNational Taiwan University\n\nTaipei 106, Taiwan\n\nRuby C. Weng\n\nDepartment of Statistics\n\nNational Chenechi University\n\nTaipei 116, Taiwan\n\nAbstract\n\nThe Bradley-Terry model for paired comparison has been popular in\nmany areas. We propose a generalized version in which paired individual\ncomparisons are extended to paired team comparisons. We introduce a\nsimple algorithm with convergence proofs to solve the model and obtain\nindividual skill. A useful application to multi-class probability estimates\nusing error-correcting codes is demonstrated.\n\n1\n\nIntroduction\n\nThe Bradley-Terry model [2] for paired comparisons has been broadly applied in many\nareas such as statistics, sports, and machine learning. It considers the model\n\nP (individual i beats individual j) =\n\n\u03c0i\n\n\u03c0i + \u03c0j\n\n,\n\n(1)\n\nwhere \u03c0i is the overall skill of the ith individual. Given k individuals and rij as the number\nof times that i beats j, an approximate skill pi can be found by minimizing the negative log\nlikelihood of the model (1):\n\nmin\n\np\n\nsubject to\n\nl(p) = \u2212Xi<j\u00b5rij log\nXi=1\n0 \u2264 pi, i = 1, . . . , k,\n\nk\n\npi\n\npi + pj\n\n+ rji log\n\npj\n\npi + pj\u00b6\n\npi = 1.\n\n(2)\n\nThus, from paired comparisons, we can obtain individual performance. This model dates\nback to [14] and has been extended to more general settings. Some reviews are, for exam-\nple, [5, 6]. Problem (2) can be solved by a simple iterative procedure:\n\nAlgorithm 1\n\n1. Start with any initial p0\n2. Repeat (t = 0, 1, . . .)\n\nj > 0, j = 1, . . . , k.\n\na. Let s = (t mod k) + 1. For j = 1, . . . , k, de\ufb01ne\n\nPi:i6=s rsi\n\nrsi+ris\ns+pt\npt\ni\n\nPi:i6=s\npt\nj\n\npt,n\n\nj \u2261\uf8f1\uf8f2\n\uf8f3\n\nif j = s,\nif j 6= s.\n\n(3)\n\n\fb. Normalize p\n\nt,n to be p\n\nt+1.\n\nuntil \u2202l(p\n\nt)/\u2202pj = 0, j = 1, . . . , k are satis\ufb01ed.\n\nThis algorithm is so simple that there is no need to use sophisticated optimization tech-\nniques. If rij > 0,\u2200i, j, Algorithm 1 globally converges to the unique minimum of (2). A\nsystematic study of the convergence is in [9].\n\nSeveral machine learning work have used the Bradley-Terry model and one is to obtain\nmulti-class probability estimates from pairwise coupling [8]. For any data instance x, if\nnij is the number of training data in the ith or jth class, and\n\nrij \u2248 nijP (x in class i | x in class i or j)\n\nis available, solving (2) obtains the estimate of P (x in class i), i = 1, . . . , k. [13] tried to\nextend this algorithm to other multi-class settings such as \u201cone-against-the rest\u201d or \u201cerror-\ncorrecting coding,\u201d but did not provide a convergence proof. In Section 5.2 we show that\nthe algorithm proposed in [13] indeed has some convergence problems.\n\nIn this paper, we propose a generalized Bradley-Terry model where each comparison is\nbetween two disjoint subsets of subjects. Then from the results of team competitions, we\ncan approximate the skill of each individual. This model has many potential applications.\nFor example, from records of tennis or badminton doubles (or singles and doubles com-\nbined), we may obtain the rank of all individuals. A useful application in machine learning\nis multi-class probability estimates using error-correcting codes. We then introduce a sim-\nple iterative method to solve the generalized model with a convergence proof. Experiments\non multi-class probability estimates demonstrate the viability of the proposed model and\nalgorithm. Due to space limitation, we omit all proofs in this paper.\n\n2 Generalized Bradley-Terry Model\n\nWe propose a generalized Bradley-Terry model where, using team competition results, we\ncan approximate individual skill levels. Consider a group of k individuals: {1, . . . , k}.\nTwo disjoint subsets I +\ni \u2265 0) is the number\nof times that I +\ni beats I \u2212\ni ). Thus, we have Ii \u2282 {1, . . . , k}, i = 1, . . . , m so\nthat\nI +\ni\n\nform teams for games and ri \u2265 0 (r\u2032\n\ni beats I +\n\ni and I \u2212\n\nIi = I +\n\n(I \u2212\n\ni\n\ni\n\ni\n\ni \u222a I \u2212\ni ,\n\nUnder the model that\n\n6= \u2205, and I +\n\ni\n\n\u03c0j\n\n6= \u2205, I \u2212\nPj\u2208I +\nPj\u2208I +\n\u03c0j +Pj\u2208I \u2212\ni \u2261 Xj\u2208I +\n\nq+\n\npj,\n\ni\n\ni\n\ni\n\ni\n\n\u03c0j\n\ni \u2229 I \u2212\ni = \u2205.\n= Pj\u2208I +\nPj\u2208Ii\ni \u2261 Xj\u2208I \u2212\n\npj,\n\n\u03c0j\n\ni\n\n\u03c0j\n\nq\u2212\n\nP (I +\n\ni beats I \u2212\n\ni ) =\n\npj,\n\nqi \u2261 Xj\u2208Ii\n\nwe can de\ufb01ne\n\n,\n\n(4)\n\nand minimize the negative log likelihood\n\nmin\n\np\n\nl(p) = \u2212\n\nm\n\nXi=1\u00a1ri log(q+\n\ni /qi) + r\u2032\n\ni log(q\u2212\n\ni /qi)\u00a2 ,\n\nunder the same constraints of (2). If Ii, i = 1, . . . , k(k \u2212 1)/2 are as the following:\n\nI +\ni\n{1}\n...\n{k \u2212 1}\n\nI \u2212\ni\n{2}\n...\n{k}\n\nri\nr12\n...\nrk\u22121,k\n\nr\u2032\ni\nr21\n...\nrk,k\u22121\n\n\fthen (4) goes back to (2). The dif\ufb01culty of solving (4) over solving (2) is that now l(p)\nis expressed in terms of q+\ni , qi but the real variable is p. The original Bradley-Terry\nmodel is a special case of other statistical models such as log-linear or generalized linear\nmodel, so methods other than Algorithm 1 (e.g., iterative scaling and iterative weighted\nleast squares) can also be used. However, (4) is not in a form of such models and hence\nthese methods cannot be applied. We propose the following algorithm to solve (4).\n\ni , q\u2212\n\nAlgorithm 2\n\n1. Start with p0\n2. Repeat (t = 0, 1, . . .)\n\nj > 0, j = 1, . . . , k and corresponding q0,+\n\ni\n\n, q0,\u2212\n\ni\n\n, q0\n\ni , i = 1, . . . , m.\n\na. Let s = (t mod k) + 1. For j = 1, . . . , k, de\ufb01ne\n\nP\n\ni:s\u2208I\n\n+\ni\n\nri\nt,+\ni\n\nq\n\n+P\n\ni:s\u2208I\n\n\u2212\ni\n\nPi:s\u2208Ii\n\nri+r\u2032\ni\n\nqt\ni\n\npt,n\n\nj \u2261\uf8f1\uf8f4\uf8f2\n\uf8f4\uf8f3\n\npt\nj\nt+1.\ni to qt+1,+\n\nb. Normalize p\nc. Update qt,+\n\nt,n to p\n, qt,\u2212\n, qt\n\nuntil \u2202l(p\n\n, qt+1\nt)/\u2202pj = 0, j = 1, . . . , k are satis\ufb01ed.\n\n, qt+1,\u2212\n\ni\n\ni\n\ni\n\ni\n\ni\n\n, i = 1, . . . , m.\n\nr\u2032\ni\nt,\u2212\ni\n\nq\n\npt\nj\n\nif j = s,\n\nif j 6= s.\n\n(5)\n\nFor the multiplicative factor in (5) to be well de\ufb01ned (i.e., non-zero denominator), we\nneed Assumption 1, which will be discussed in Section 3. Eq. (5) is a simple \ufb01xed-point\ntype update; in each iteration, only one component (i.e., pt\ns) is modi\ufb01ed while the others\nremain the same. It is motivated from using a descent direction to strictly decrease l(p): If\n\u2202l(p\n\nt)/\u2202ps 6= 0, then\nt)\n\n\u2202l(p\n\u2202ps\n\n\u00b7 (pt,n\n\ns \u2212 pt\n\ns) =\u00c3\u2212\u00b5 \u2202l(p\n\n\u2202ps \u00b62\n\nt)\n\nwhere\n\n\u2202l(p)\n\u2202ps\n\n= \u2212 Xi:s\u2208I +\n\ni\n\nri\nq+\n\ni \u2212 Xi:s\u2208I \u2212\n\ni\n\nri + r\u2032\ni\n\ni ! < 0,\n\nqt\n\n(6)\n\npt\n\ns!.\u00c3 Xi:s\u2208Ii\n+ Xi:s\u2208Ii\n\nr\u2032\ni\nq\u2212\ni\n\nri + r\u2032\ni\n\n.\n\nqi\n\ns \u2212 pt\n\nThus, pt,n\ns is a descent direction in optimization since a suf\ufb01ciently small step along\nthis direction guarantees the strict decrease of the function value. Since now we take the\nwhole direction without searching for the step size, more efforts are needed to prove the\nstrict decrease in Lemma 1. However, (6) does hint that (5) is a reasonable update.\n\nLemma 1 If pt\n\ns > 0 is the index to be updated and \u2202l(p\n\nIf we apply the update rule (5) on the pairwise model,\n\nt)/\u2202ps 6= 0, then l(p\n\nt+1) < l(p\n\nt).\n\nrsi\npt\ns\n\nPi:i6=s\n\nrsi\ns+pt\npt\ni\n\n+Pi:i6=s\n\nPi:i6=s\n\npt\n\ns = Pi:i6=s rsi\n\nrsi+ris\ns+pt\npt\ni\n\nris\ns+pt\npt\ni\n\nPi:i6=s\n\n3 Convergence of Algorithm 2\n\nand (5) goes back to (3).\n\nFor any point satisfying \u2202l(p)/\u2202pj = 0, j = 1, . . . , k and constraints of (4), it is a sta-\ntionary point of (4)1. We will prove that Algorithm 2 converges to such a point.\nIf\n\n1A stationary point means a Karash-Kunh-Tucker (KKT) point for constrained optimization prob-\nlems like (2) and (4). Note that here \u2202l(p)/\u2202pj = 0 implies (and is more restricted than) the KKT\ncondition.\n\n\ft} is an in\ufb01nite sequence. As {p\n\nit stops in a \ufb01nite number of iterations, then \u2202l(p)/\u2202pj = 0, j = 1, . . . , k, which\nmeans a stationary point of (4) is already obtained. Thus, we only need to handle the\nt=0 is in a compact (i.e., closed and\ncase where {p\nt}\u221e\nbounded) set {p | 0 \u2264 pj \u2264 1,Pk\nj=1 pj = 1}, it has at least one convergent subse-\n\u2217 is one such convergent point. In the following we will prove that\nquence. Assume p\n\u2202l(p\nTo prove the convergence of a \ufb01xed-point type algorithm, we need that if p\u2217\ns we can use (5) to update it to p\u2217,n\n\u2202l(p\ns > 0 (see also Theorem 1).\nfollowing assumption to ensure that p\n\u2217\n\ns > 0 and\ns. We thus make the\n\n\u2217)/\u2202ps 6= 0, then from p\u2217\n\n\u2217)/\u2202pj = 0, j = 1, . . . , k.\n\n6= p\u2217\n\ns\n\nAssumption 1 For each j \u2208 {1, . . . , k},\n\n\u222ai:i\u2208AIi = {1, . . . , k}, where A = {i | (I +\n\ni = {j}, ri > 0) or (I \u2212\n\ni = {j}, r\u2032\n\ni > 0)}.\n\nThat is, each individual forms a winning (losing) team in some competitions which together\ninvolve all subjects.\n\nAn issue left in Section 2 is whether the multiplicative factor in (5) is well de\ufb01ned. With\nAssumption 1 and initial p0\nj > 0,\u2200t\nj > 0, Assumption 1 implies that\nand hence the denominator of (5) is never zero: If pt\nis positive. Thus, both numerator and denominator in\n\nj > 0, j = 1, . . . , k, one can show by induction that pt\n\nri/qt,+\n\ni/qt,\u2212\nr\u2032\n\ni\n\ni\n\nthe multiplicative factor are positive, and so is pt+1\nIf rij > 0, the original Bradley-Terry model satis\ufb01es Assumption 1. No matter the model\nsatis\ufb01es the assumption or not, an easy way to ful\ufb01ll it is to add an additional term\n\n.\n\nj\n\norPi:j\u2208I \u2212\n\ni\n\nPi:j\u2208I +\n\ni\n\n(7)\n\nlog\u00c3 ps\nj=1 pj!\nPk\n\nk\n\n\u2212\u00b5\n\nXs=1\ni = 0. AsPk\n\ni = {s}, ri = \u00b5, and r\u2032\n\nto l(p), where \u00b5 is a small positive number. That is, for each s, we make an Ii = {1, . . . , k}\nwith I +\nj=1 pj = 1 is one of the constraints, (7) reduces\nto \u2212\u00b5Pk\ns=1 log ps, which is a barrier term in optimization to ensure that ps does not go to\n\ns > 0 and the convergence of Algorithm 2 are in Theorem 1:\n\nzero. The property p\u2217\n\nTheorem 1 Under Assumption 1, any convergent point p\n0, s = 1, . . . , k and is a stationary point of (4).\n\n\u2217 of Algorithm 2 satis\ufb01es p\u2217\n\ns >\n\n4 Asymptotic Distribution of the Maximum Likelihood Estimator\n\nFor the standard Bradley-Terry model, asymptotic distribution of the MLE (i.e., p) has been\ndiscussed in [5]. In this section, we discuss the asymptotic distribution for the proposed\nestimator. To work on the real probability \u03c0, we de\ufb01ne\n\n\u00afqi \u2261Pj\u2208Ii\ni \u2261Pj\u2208I \u2212\n\u03c0j,\ni as a constant. Note that ri \u223c BIN(ni, \u00afq+\nand consider ni \u2261 ri + r\u2032\nvariable representing the number of times that I +\nfor s, t = 1, . . . , k,\n\ni \u2261Pj\u2208I +\n\n\u03c0j,\ni /\u00afqi) is a random\nin ni competitions. By de\ufb01ning\n\ni beats I \u2212\n\n\u03c0j,\n\ni\n\ni\n\ni\n\n\u00afq+\n\n\u00afq\u2212\n\n\u03bbss \u2261 varh \u2202l(\u03c0)\n\u03bbst \u2261 covh \u2202l(\u03c0)\n\n\u2202ps\n\nni \u00afq\u2212\ni\n\u00afq+\ni \u00afq2\n\n, \u2202l(\u03c0)\n\n\u2202ps i =Pi:s\u2208I +\nPi:(s,t)\u2208I +\n\n+Pi:s\u2208I \u2212\ni \u2212\ni \u2212Pi:(s,t)\u2208I \u2212\ni \u00d7I +\n\n\u2202pt i =Pi:s,t\u2208I +\n\n\u00afq\u2212\ni ni\n\u00afq+\ni \u00afq2\n\ni \u00d7I\u2212\n\nni\n\u00afq2\n\ni\n\ni\n\ni\n\ni\n\ni\n\ni\n\nwe have the following theorem:\n\nni \u00afq+\ni\n\u00afq\u2212\ni \u00afq2\n\ni\n\n,\n\nni\n\u00afq2\ni\n\n+Pi:s,t\u2208I \u2212\n\ni\n\n\u00afq+\ni ni\n\u00afq\u2212\ni \u00afq2\n\ni\n\n, s 6= t,\n\n\fTheorem 2 Let n be the total number of comparisons. If ri is independent of rj,\u2200i 6= j,\nthen \u221an(p1 \u2212 \u03c01), . . . ,\u221an(pk\u22121 \u2212 \u03c0k\u22121) have for large samples the multivariate normal\ndistribution with zero means and dispersion matrix [\u03bb\u2032\n\nst]\u22121, where\n\n\u03bb\u2032\nst = \u03bbst \u2212 \u03bbsk \u2212 \u03bbtk + \u03bbkk, s, t = 1, . . . , k \u2212 1.\n\n5 Application to Multi-class Probability Estimates\n\nMany classi\ufb01cation methods are two-class based approaches and there are different ways\nto extend them for multi-class cases. Most existing studies focus on predicting class labels\nbut not probability estimates. In this section, we discuss how the generalized Bradley-Terry\nmodel can be applied to multi-class probability estimates.\n\nError-correction coding [7, 1] is a general method to construct binary classi\ufb01ers and com-\nbine them for multi-class prediction. It suggests some ways to construct I +\ni ; both\nare subsets of {1, . . . , k}. Then one trains a binary model using data from classes in I +\n(I \u2212\ni ) as positive (negative). Simple and commonly used methods such as \u201cone-against-one\u201d\nand \u201cone-against-the rest\u201d are its special cases. Given ni the number of training data with\nclasses in Ii = I +\n\ni , we assume here that for any data x,\n\ni and I \u2212\n\ni\n\n(8)\nis available, and the task is to approximate P (x in class s), s = 1, . . . , k. In the rest of this\nsection we discuss the special case \u201cone-against-the rest\u201d and the earlier results in [13].\n\n| x in classes of I +\n\ni or I \u2212\ni )\n\ni \u222a I \u2212\nri \u2248 niP (x in classes of I +\n\ni\n\n5.1 Properties of the \u201cOne-against-the rest\u201d Approach\n\nFor this approach, Ii, i = 1, . . . , m are\nI \u2212\ni\n{2, . . . , k}\n{1, 3, . . . , k}\n...\n\nI +\ni\n{1}\n{2}\n...\n\nri\nr1\nr2\n...\n\nr\u2032\ni\n1 \u2212 r1\n1 \u2212 r2\n...\n\nNow n1 = \u00b7\u00b7\u00b7 = nm = the total number of training data, so the solution to (4) is not\naffected by ni. Thus, we remove it from (8), so ri + r\u2032\ni = 1. As \u2202l(p)/\u2202ps = 0 becomes\n\nk\n\nr\u2032\nj\n\nr\u2032\nj\n\nrs\nps\n\nrk\npk \u2212\n\n1 \u2212 rk\n1 \u2212 pk\n\nr1\np1 \u2212\n\n1 \u2212 r1\n1 \u2212 p1\n\n= \u03b4,\n\n= k\u2212\n\nXj=1\n\n1 \u2212 pj\n\n1 \u2212 pj\n\n= \u00b7\u00b7\u00b7 =\n\n= k, we have\n\ns=1 ps = 1, we obtain \u03b4 and the optimal p.\n\nthe equalities, but it is negative when \u03b4 < 0, and greater than 1 when \u03b4 > 0. By solving\n\nwhere \u03b4 is a constant. These equalities provide another way to solve p, and ps = ((1 +\n\n+Xj:j6=s\n\u03b4) \u2212p(1 + \u03b4)2 \u2212 4rs\u03b4)/2\u03b4. Note that ((1 + \u03b4) +p(1 + \u03b4)2 \u2212 4rs\u03b4)/2\u03b4 also satis\ufb01es\nPm\nFrom the formula of ps, if \u03b4 > 0, larger ps implies smaller (1 + \u03b4)2\u2212 4rs\u03b4 and hence larger\nrs. It is similar for \u03b4 < 0. Thus, the order of p1, . . . , pk is the same as that of r1, . . . , rk:\nTheorem 3 If rs \u2265 rt, then ps \u2265 pt.\n5.2 The Approach in [13] for Error-Correcting Codes\n\n[13] was the \ufb01rst attempt to address the probability estimates using general error-correcting\ncodes. By considering the same optimization problem (4), it proposes a heuristic update\nrule\n\npt,n\ns \u2261\n\nPi:s\u2208I +\n\ni\n\nniqt,+\n\ni\nqt\ni\n\ni\n\nri +Pi:s\u2208I \u2212\n+Pi:s\u2208I \u2212\n\ni\n\nPi:s\u2208I +\n\ni\n\nr\u2032\ni\nniqt,\u2212\n\ni\nqt\ni\n\npt\ns,\n\n(9)\n\n\fand assume ni = 1, then ri + r\u2032\n\ni = 1 and the factor in the update rule (9) is\n\nbut does not provide a convergence proof. For a \ufb01xed-point update, we expect that at the\noptimum, the multiplicative factor in (9) is one. However, unlike (5), when the factor is one,\n(9) does not relate to \u2202l(p)/\u2202ps = 0. In fact, a simple example shows that this algorithm\ni = 1\n\nmay never converge. Taking the \u201cone-against-the rest\u201d approach, if we keepPk\nIf the algorithm converges and the factor approaches one, then ps = (1 + 2rs\u2212Pk\ni=1 ri)/2\ns=1 ps = 1. Therefore, if in the algorithm we keepPk\nbut they may not satisfyPk\ni=1 pt\ni = 1\nas [13] did, the factor may not approach one and the algorithm does not converge. More\ni = 1, the condition\ngenerally, if Ii = {1, . . . , k},\u2200i, the algorithm may not converge. As qt\nthat the factor equals one can be written as a linear equation of p. Together withPk\ni=1 pi =\n1, there is an over-determined linear system (i.e., k + 1 equations and k variables).\n\ni) = k\u22121+2rs\u2212Pk\n\nrs+Pi:i6=s r\u2032\ns+Pi:i6=s(1\u2212pt\npt\n\ni=1 ri\nk\u22122+2pt\ns\n\ni=1 pt\n\n.\n\ni\n\n6 Experiments on Multi-class Probability Estimates\n\n6.1 Simulated Examples\n\nWe consider the same settings in [8, 12] by de\ufb01ning three possible class probabilities:\n(a) p1 = 1.5/k, pj = (1 \u2212 p1)/(k \u2212 1), j = 2, . . . , k.\n(b) k1 = k/2 if k is even, and (k + 1)/2 if k is odd; then p1 = 0.95 \u00d7 1.5/k1, pi =\n(0.95 \u2212 p1)/(k1 \u2212 1) for i = 2, . . . , k1, and pi = 0.05/(k \u2212 k1) for i = k1 + 1, . . . , k.\n(c) p1 = 0.95 \u00d7 1.5/2, p2 = 0.95 \u2212 p1, and pi = 0.05/(k \u2212 2), i = 3, . . . , k.\nClasses are competitive in case (a), but only two dominate in case (c). We then generate ri\nby adding some noise to q+\n\ni /qi:\n\nri = min(max(\u01eb, q+\n\ni\nqi\n\n(1 + 0.1N (0, 1))), 1 \u2212 \u01eb).\n\ni = 1 \u2212 ri. Here \u01eb = 10\u22127 is used so that all ri, r\u2032\n\nThen r\u2032\nfour encodings used in [1] to generate Ii:\n1. \u201c1vs1\u201d: the pairwise approach (eq. (2)).\n2. \u201c1vsrest\u201d: the \u201cone-against-the rest\u201d approach in Section 5.1.\n3. \u201cdense\u201d: Ii = {1, . . . , k} for all i. Ii is randomly split to two equally-sized sets I +\n\ni and\ni . [10 log2 k] such splits are generated2. Following [1], we repeat this procedure 100\nI \u2212\ntimes and select the one whose [10 log2 k] splits have the smallest distance.\n\ni are positive. We consider the\n\n4. \u201csparse\u201d: I +\n\ni |) = k/4.\nThen [15 log2 k] such splits are generated. Similar to \u201cdense,\u201d we repeat the procedure\n100 times to \ufb01nd a good coding.\n\ni are randomly drawn from {1, . . . , k} with E(|I +\n\ni |) = E(|I \u2212\n\ni , I \u2212\n\nFigure 1 shows averaged accuracy rates over 500 replicates for each of the four methods\nwhen k = 22, 23, . . . , 26. \u201c1vs1\u201d is good for (a) and (b), but suffers some losses in (c),\nwhere the class probabilities are highly unbalanced. [12] has observed this and proposed\nsome remedies. \u201c1vsrest\u201d is quite competitive in all three scenarios. Furthermore, \u201cdense\u201d\nand \u201csparse\u201d are less competitive in cases (a) and (b) when k is large. Due to the large\n|I +\ni | and |I \u2212\ni |, the model is unable to single out a clear winner when probabilities are more\nbalanced. We also analyze the (relative) mean square error (MSE) in Figure 2:\n\nMSE =\n\n1\n500\n\n500\n\nXj=1\u00c3 k\nXi=1\n\n(\u02c6pj\ni \u2212 pi)2/\n\nk\n\nXi=1\n\np2\n\ni! ,\n\n(10)\n\nwhere \u02c6p\nj is the probability estimate obtained in the jth of the 500 replicates. Results of\nFigures 2(b) and 2(c) are consistent with those of the accuracy. Note that in Figure 2(a), as\ni=1(\u02c6pj\ni \u2212 pi)2 is small. Hence, all approaches have small MSE\np (and \u02c6p\nthough some have poor accuracy.\n\nj) are balanced,Pk\n\n2We use [x] to denote the nearest integer value of x.\n\n\f1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\ny\nc\na\nr\nu\nc\nc\nA\n\n \nt\ns\ne\nT\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\ny\nc\na\nr\nu\nc\nc\nA\n\n \nt\ns\ne\nT\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\ny\nc\na\nr\nu\nc\nc\nA\n\n \nt\ns\ne\nT\n\n0\n2\n\n3\n\n4\nlog\n k\n2\n(a)\n\n5\n\n6\n\n0\n2\n\n3\n\n4\nlog\n k\n2\n(b)\n\n5\n\n6\n\n0\n2\n\n3\n\n5\n\n6\n\n4\nlog\n k\n2\n(c)\n\nFigure 1: Accuracy by the four encodings: \u201c1vs1\u201d (dashed line, square), \u201c1vsrest\u201d (solid\nline, cross), \u201cdense\u201d (dotted line, circle), \u201csparse\u201d (dashdot line, asterisk)\n\n0.016\n\n0.014\n\n0.012\n\n0.01\n\nE\nS\nM\n\n0.008\n\n0.006\n\n0.004\n\n0.002\n\n0\n2\n\n0.4\n\n0.35\n\n0.3\n\n0.25\n\n0.2\n\n0.15\n\n0.1\n\n0.05\n\nE\nS\nM\n\n0.45\n\n0.4\n\n0.35\n\n0.3\n\n0.25\n\n0.2\n\n0.15\n\n0.1\n\n0.05\n\nE\nS\nM\n\n3\n\n4\nlog\n k\n2\n(a)\n\n5\n\n6\n\n0\n2\n\n3\n\n4\nlog\n k\n2\n(b)\n\n5\n\n6\n\n0\n2\n\n3\n\n5\n\n6\n\n4\nlog\n k\n2\n(c)\n\nFigure 2: MSE by the four encodings: legend the same as Figure 1\n\n6.2 Experiments on Real Data\n\nIn this section we present experimental results on some real-world multi-class problems.\nThey have been used in [12], which provides more information about data preparation. Two\nproblem sizes, 300/500 and 800/1,000 for training/testing, are used. 20 training/testing\nsplits are generated and the testing error rates are averaged. All data used are available\nat http://www.csie.ntu.edu.tw/\u02dccjlin/papers/svmprob/data. We use\nthe same four ways in Section 6.1 to generate Ii. All of them have |I1| \u2248 \u00b7\u00b7\u00b7 \u2248 |Im|. With\nthe property that these multi-class problems are reasonably balanced, we set ni = 1 in (8).\nSince there are no probability values available for these problems, we compare the accu-\nracy by predicting the label with the largest probability estimate. The purpose here is to\ncompare the four probability estimates but not to check the difference from existing multi-\nclass classi\ufb01cation techniques. We consider support vector machines (SVM) [4] with the\nRBF kernel as the binary classi\ufb01er. An improved version [10] of [11] obtains ri. Full\nSVM parameter selection is conducted before testing, although due to space limitation, we\nomit details here. The code is modi\ufb01ed from LIBSVM [3], a library for support vector\nmachines. The resulting accuracy is in Table 1 for smaller and larger training/testing sets.\nExcept \u201c1vs1,\u201d the other three approaches are quite competitive. These results indicate that\npractical problems are more similar to the case of (c) in Section 6.1, where few classes\ndominate. This observation is consistent with the \ufb01ndings in [12]. Moreover, \u201c1vs1\u201d suf-\nfers some losses when k is larger (e.g., letter), the same as in Figure 1(c); so for \u201c1vs1,\u201d\n[12] proposed using a quadratic model instead of the Bradley-Terry model.\n\nIn terms of the computational time, because the number of binary problems for \u201cdense\u201d and\n\u201csparse\u201d ([10 log2 k] and [15 log2 k], respectively) is larger than k, and each binary problem\n\n\finvolves many classes of data (all and one half), their training time is longer than \u201c1vs1\u201d\nand \u201c1vsrest.\u201d \u201cDense\u201d is particularly time consuming. Note that though \u201c1vs1\u201d solves\nk(k\u2212 1)/2 binaries, it is ef\ufb01cient as each binary problem involves only two classes of data.\n\nTable 1: Average of 20 test errors (in percentage) by four encodings (lowest boldfaced)\n\nProblem k\ndna\n3\nwaveform 3\nsatimage 6\nsegment 7\nUSPS\n10\nMNIST 10\nletter\n26\n\n300 training and 500 testing\n1vs1 1vsrest dense\n10.45\n10.47\n15.01\n15.66\n14.22\n14.72\n6.62\n6.24\n10.81\n11.37\n13.0\n13.84\n33.86\n39.73\n\n10.33\n15.35\n15.08\n6.69\n10.89\n12.56\n35.17\n\nsparse\n10.19\n15.12\n14.8\n6.19\n11.14\n12.29\n33.88\n\n800 training and 1,000 testing\n1vs1\n6.21\n13.525\n11.54\n3.295\n7.78\n8.11\n21.11\n\n1vsrest\n6.45\n13.635\n11.74\n3.605\n7.49\n7.37\n19.685\n\ndense\n6.415\n13.76\n11.865\n3.52\n7.31\n7.59\n20.14\n\nsparse\n6.345\n13.99\n11.575\n3.25\n7.575\n7.535\n19.49\n\nIn summary, we propose a generalized Bradley-Terry model which gives individual skill\nfrom group competition results. A useful application to general multi-class probability\nestimate is demonstrated.\nReferences\n[1] E. L. Allwein, R. E. Schapire, and Y. Singer. Reducing multiclass to binary: a unifying approach\n\nfor margin classi\ufb01ers. Journal of Machine Learning Research, 1:113\u2013141, 2001.\n\n[2] R. A. Bradley and M. Terry. The rank analysis of incomplete block designs: I. the method of\n\npaired comparisons. Biometrika, 39:324\u2013345, 1952.\n\n[3] C.-C. Chang and C.-J. Lin. LIBSVM: a library for support vector machines, 2001. Software\n\navailable at http://www.csie.ntu.edu.tw/\u02dccjlin/libsvm.\n\n[4] C. Cortes and V. Vapnik. Support-vector network. Machine Learning, 20:273\u2013297, 1995.\n[5] H. A. David. The method of paired comparisons. Oxford University Press, New York, second\n\nedition, 1988.\n\n[6] R. R. Davidson and P. H. Farquhar. A bibliography on the method of paired comparisons.\n\nBiometrics, 32:241\u2013252, 1976.\n\n[7] T. G. Dietterich and G. Bakiri. Solving multiclass learning problems via error-correcting output\n\ncodes. Journal of Arti\ufb01cial Intelligence Research, 2:263\u2013286, 1995.\n\n[8] T. Hastie and R. Tibshirani. Classi\ufb01cation by pairwise coupling. In M. I. Jordan, M. J. Kearns,\nand S. A. Solla, editors, Advances in Neural Information Processing Systems 10. MIT Press,\nCambridge, MA, 1998.\n\n[9] D. R. Hunter. MM algorithms for generalized Bradley-Terry models. The Annals of Statistics,\n\n32:386\u2013408, 2004.\n\n[10] H.-T. Lin, C.-J. Lin, and R. C. Weng. A note on Platt\u2019s probabilistic outputs for support vector\nmachines. Technical report, Department of Computer Science, National Taiwan University,\n2003.\n\n[11] J. Platt. Probabilistic outputs for support vector machines and comparison to regularized like-\nlihood methods. In A. Smola, P. Bartlett, B. Sch\u00a8olkopf, and D. Schuurmans, editors, Advances\nin Large Margin Classi\ufb01ers, Cambridge, MA, 2000. MIT Press.\n\n[12] T.-F. Wu, C.-J. Lin, and R. C. Weng. Probability estimates for multi-class classi\ufb01cation by pair-\nwise coupling. In S. Thrun, L. Saul, and B. Sch\u00a8olkopf, editors, Advances in Neural Information\nProcessing Systems 16. MIT Press, Cambridge, MA, 2004.\n\n[13] B. Zadrozny. Reducing multiclass to binary by coupling probability estimates. In T. G. Di-\netterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing\nSystems 14, pages 1041\u20131048. MIT Press, Cambridge, MA, 2002.\n\n[14] E. Zermelo. Die berechnung der turnier-ergebnisse als ein maximumproblem der wahrschein-\n\nlichkeitsrechnung. Mathematische Zeitschrift, 29:436\u2013460, 1929.\n\n\f", "award": [], "sourceid": 2705, "authors": [{"given_name": "Tzu-kuo", "family_name": "Huang", "institution": null}, {"given_name": "Chih-jen", "family_name": "Lin", "institution": null}, {"given_name": "Ruby", "family_name": "Weng", "institution": null}]}