{"title": "Accuracy at the Top", "book": "Advances in Neural Information Processing Systems", "page_first": 953, "page_last": 961, "abstract": "We introduce a new notion of classification accuracy based on the top $\\tau$-quantile values of a scoring function, a relevant criterion in a number of problems arising for search engines. We define an algorithm optimizing a convex surrogate of the corresponding loss, and show how its solution can be obtained by solving several convex optimization problems. We also present margin-based guarantees for this algorithm based on the $\\tau$-quantile of the functions in the hypothesis set. Finally, we report the results of several experiments evaluating the performance of our algorithm. In a comparison in a bipartite setting with several algorithms seeking high precision at the top, our algorithm achieves a better performance in precision at the top.", "full_text": "Accuracy at the Top\n\nStephen Boyd\n\nStanford University\n\nPackard 264\n\nStanford, CA 94305\nboyd@stanford.edu\n\nMehryar Mohri\n\nCourant Institute and Google\n\n251 Mercer Street\n\nNew York, NY 10012\nmohri@cims.nyu.edu\n\nCorinna Cortes\nGoogle Research\n76 Ninth Avenue\n\nNew York, NY 10011\ncorinna@google.com\n\nAna Radovanovic\nGoogle Research\n76 Ninth Avenue\n\nNew York, NY 10011\n\nanaradovanovic@google.com\n\nAbstract\n\nWe introduce a new notion of classi\ufb01cation accuracy based on the top \u2327-quantile\nvalues of a scoring function, a relevant criterion in a number of problems aris-\ning for search engines. We de\ufb01ne an algorithm optimizing a convex surrogate of\nthe corresponding loss, and discuss its solution in terms of a set of convex opti-\nmization problems. We also present margin-based guarantees for this algorithm\nbased on the top \u2327-quantile value of the scores of the functions in the hypothesis\nset. Finally, we report the results of several experiments in the bipartite setting\nevaluating the performance of our solution and comparing the results to several\nother algorithms seeking high precision at the top. In most examples, our solution\nachieves a better performance in precision at the top.\n\n1\n\nIntroduction\n\nThe accuracy of the items placed near the top is crucial for many information retrieval systems such\nas search engines or recommendation systems, since most users of these systems browse or consider\nonly the \ufb01rst k items. Different criteria have been introduced in the past to measure this quality,\nincluding the precision at k (Precision@k), the normalized discounted cumulative gain (NDCG)\nand other variants of DCG, or the mean reciprocal rank (MRR) when the rank of the most relevant\ndocument is critical. A somewhat different but also related criterion adopted by [1] is based on the\nposition of the top irrelevant item.\nSeveral machine learning algorithms have been recently designed to optimize these criteria and other\nrelated ones [6, 12, 11, 21, 7, 14, 13]. A general algorithm inspired by the structured prediction\ntechnique SVMStruct [22] was incorporated in an algorithm by [15] which can be used to optimize\na convex upper bound on the number of errors among the top k items. The algorithm seeks to\nsolve a convex problem with exponentially many constraints via several rounds of optimization\nwith a smaller number of constraints, augmenting the set of constraints at each round with the\nmost violating one. Another algorithm, also based on structured prediction ideas, is proposed in\nan unpublished manuscript of [19] and covers several criteria, including Precision@k and NDCG.\nA regression-based solution is suggested by [10] for DCG in the case of large sample sizes. Some\nother methods have also been proposed to optimize a smooth version of a non-convex cost function\nin this context [8]. [1] discusses an optimization solution for an algorithm seeking to minimize the\nposition of the top irrelevant item.\n\n1\n\n\fHowever, one obvious shortcoming of all these algorithms is that the notion of top k does not gen-\neralize to new data. For what k should one train if the test data in some instances is half the size and\nin other cases twice the size? In fact, no generalization guarantee is available for such precision@k\noptimization or algorithm.\nA more principled approach in all the applications already mentioned consists of designing algo-\nrithms that optimize accuracy in some top fraction of the scores returned by a real-valued hypothe-\nsis. This paper deals precisely with this problem. The desired objective is to learn a scoring function\nthat is as accurate as possible for the items whose scores are above the top \u2327-quantile. To be more\nspeci\ufb01c, when applied to a set of size n, the number of top items is k = \u2327n for a \u2327-quantile, while\nfor a different set of size n0 6= n, this would correspond to k0 = \u2327n0 6= k.\nThe implementation of the Precision@k algorithm in [15] indirectly acknowledges the problem that\nthe notion of top k does not generalize since the command-line \ufb02ag requires k to be speci\ufb01ed as a\nfraction of the positive samples. Nevertheless, the formulation of the problem as well as the solution\nare still in terms of the top k items of the training set. A study of various statistical questions related\nto the problem of accuracy at the top is discussed by [9]. The authors also present generalization\nbounds for the speci\ufb01c case of empirical risk minimization (ERM) under some assumptions about\nthe hypothesis set and the distribution. But, to our knowledge, no previous publication has given\ngeneral learning guarantees for the problem of accuracy in the top quantile scoring items or carefully\naddressed the corresponding algorithmic problem.\nWe discuss the formulation of this problem (Section 3.1) and de\ufb01ne an algorithm optimizing a\nconvex surrogate of the corresponding loss in the case of linear scoring functions. We discuss the\nsolution of this problem in terms of several simple convex optimization problems and show that these\nproblems can be extended to the case where positive semi-de\ufb01nite kernels are used (Section 3.2).\nIn Section 4, we present a Rademacher complexity analysis of the problem and give margin-based\nguarantees for our algorithm based on the \u2327-quantile value of the functions in the hypothesis set.\nIn Section 5, we also report the results of several experiments evaluating the performance of our\nalgorithm.\nIn a comparison in a bipartite setting with several algorithms seeking high precision\nat the top, our algorithm achieves a better performance in precision at the top. We start with a\npresentation of notions and notation useful for the discussion in the following sections.\n\n2 Preliminaries\n\nLet X denote the input space and D a distribution over X\u21e5X . We interpret the presence\nof a pair (x, x0) in the support of D as the preference of x0 over x. We denote by S =\n(x1, x01), . . . , (xm, x0m) 2 (X\u21e5X )m a labeled sample of size m drawn i.i.d. according to D\nand denote by bD the corresponding empirical distribution. D induces a marginal distribution over\nX that we denote by D0, which in the discrete case can be de\ufb01ned via\n\nD0(x) =\n\n1\n\n2 Xx02XD(x, x0) + D(x0, x).\n\nWe also denote by bD0 the empirical distribution associated to D0 based on the sample S.\nThe learning problems we are studying are de\ufb01ned in terms of the top \u2327-quantile of the values taken\nby a function h: X! R, that is a score q such that Prx\u21e0D0[h(x) > q] = \u2327 (see Figure 1(a)). In\ngeneral, q is not unique and this equality may hold for all q in an interval [qmin, qmax]. We will be\nparticularly interested in the properties of the set of points x whose scores are above a quantile, that\nis sq = {x: h(x) > q}. Since for any (q, q0) 2 [qmin, qmax]2, sq and sq0 differ only by a set of\nmeasure zero, the particular choice of q in that interval has no signi\ufb01cant consequence. Thus, in\nwhat follows, when it is not unique, we will choose the quantile value to be the maximum, qmax.\nFor any \u2327 2 [0, 1], let \u21e2\u2327 denote the function de\ufb01ned by\n\n8u 2 R,\u21e2\n\n\u2327 (u) = \u2327(u) + (1  \u2327)(u)+,\n\nwhere (u)+ = max(u, 0) and (u) = min(u, 0) (see Figure 1(b)). \u21e2\u2327 is convex as a sum\nof two convex functions since u 7! (u)+ is convex, u 7! (u) concave. We will denote by\nargMinu f(u) the largest minimizer of function f.\nIt is known (see for example [17]) that the\n\n2\n\n\f\u03c4-Quantile\n\u2327 2 [0, 1]\nSet , .\n\nU = {u1, . . . , un}  R\n\n top \u03c4 fraction \n\nof scores\n\n\u03c1\u03c4\n\nMehryar Mohri - Courant & Google\n\npage\n\n5\n\n(a)\n\n0\n\n(b)\n\nu\n\nFigure 1: (a) Illustration of the \u2327-quantile. (b) Graph of function \u21e2\u2327 for \u2327 = .25.\n\n(maximum) \u2327-quantile value bq of a sample of real numbers X = (u1, . . . , un) 2 Rn can be\ngiven by bq = argMinu2R F\u2327 (u), where F\u2327 is the convex function de\ufb01ned for all u 2 R by\n\nnPn\ni=1 \u21e2\u2327 (ui  u).\n\nF\u2327 (u) = 1\n\nMehryar Mohri - Courant & Google\n\npage\n\n1\n\n3 Accuracy at the top (AATP)\n\n3.1 Problem formulation and algorithm\n\nThe learning problem we consider is that of accuracy at the top (AATP) which consists of achieving\nan ordering of all items so that items whose scores are among the top \u2327-quantile are as relevant as\npossible. Ideally, all preferred items are ranked above the quantile and non-preferred ones ranked\nbelow. Thus, the loss or generalization error of a hypothesis h: X! R with top \u2327-quantile value\nqh is the average number of non-preferred elements that h ranks above qh and preferred ones ranked\nbelow:\n\nR(h) =\n\n1\n2\n\nE\n\n(x,x0)\u21e0D\u21e51h(x)>qh + 1h(x0)<qh\u21e4.\n\nqh can be de\ufb01ned as follows in terms of the distribution D0: qh = argMinu2R Ex\u21e0D0[\u21e2\u2327 (h(x)u)].\nThe quantile value qh depends on the true distribution D. To de\ufb01ne the empirical error of h for a\n\nsample S =(x1, x01), . . . , (xm, x0m)2 (X\u21e5X )m, we will use instead an empirical estimatebqh of\nqh: bqh = argMinu2R Ex\u21e0bD0[\u21e2\u2327 (h(x)  u)]. Thus, we de\ufb01ne the empirical error of h for a labeled\n\nsample as follows:\n\nbR(h) =\n\n1\n2m\n\nmXi=1\u21e51h(xi)>bqh + 1h(x0i)<bqh\u21e4.\n\nIn the following, we will assume that \u2327 is a multiple of 1/2m, otherwise it can be rounded to the\nnearest such value.\n\n3.2 Analysis of the optimization problem\n\nProblem (1) is not a convex optimization problem since, while the objective function is convex, the\nequality constraint is not af\ufb01ne. Here, we further analyze the problem and discuss a solution.\n\n3\n\nWe \ufb01rst assume that X is a subset of RN for some N  1 and consider a hypothesis set H of linear\nfunctions h: x 7! w \u00b7 x. We will use a surrogate empirical loss taking into consideration how much\nthe score w\u00b7xi of a non-preferred item xi exceedsbqh, and similarly how much lower the score w\u00b7x0i\nfor a preferred point x0i is thanbqh, and seek a solution w minimizing a trade-off of that surrogate\n\nloss and the norm squared kwk2. This leads to the following optimization problem for AATP:\n\nmin\nw\n\n1\n\n2kwk2 + Ch mXi=1w \u00b7 xi bqw + 1+ +bqw  w \u00b7 x0i + 1+i\n\nQ\u2327 (w, u),\n\n(1)\n\nwhere C  0 is a regularization parameter and Q\u2327 the quantile function de\ufb01ned as follows for a\nsample S, for any w 2 RN and u 2 R:\n\nu2R\n\nsubject to bqw = argMin\n2mh mXi=1\n\nQ\u2327 (w, u) =\n\n1\n\n\u21e2\u2327(w \u00b7 xi)  u) + \u21e2\u2327(w \u00b7 x0i)  u)i.\n\n\fThe equality constraint could be written as an in\ufb01nite number of inequalities of Q\u2327 (w,bqw) \uf8ff\n\nQ\u2327 (w, u) for all u 2 R. Observe, however, that the quantile value qw must coincide with the score\nof one of training points xk or x0k, that is w \u00b7 xk or w \u00b7 x0k. Thus, Problem (1) can be equivalently\nwritten with a \ufb01nite number of constraints as follows:\n\n1\n\nmin\nw\n\nsubject to bqw 2{ w \u00b7 xk, w \u00b7 x0k : k 2 [1, m]}\n\n2kwk2 + Ch mXi=1w \u00b7 xi bqw + 1+ +bqw  w \u00b7 x0i + 1+i\n8k 2 [1, m], Q\u2327 (w,bqw) \uf8ff Q\u2327 (w, w \u00b7 xk),8k 2 [1, m], Q\u2327 (w,bqw) \uf8ff Q\u2327 (w, w \u00b7 x0k).\n\nThe inequality constraints do not correspond to non-positivity constraints on convex functions.\nThus, the problem is not a standard convex optimization problem, but our analysis leads us\nto a simple approximate solution for the problem. For convenience, let (z1, . . . , z2m) denote\n(x1, . . . , xm, x01, . . . , x0m). Our method consists of solving the convex quadratic programming (QP)\nproblem for each value of k 2 [1, 2m]:\n\nmin\nw\n\n1\n\n2kwk2 + Ch mXi=1w \u00b7 xi bqw + 1+ +bqw  w \u00b7 x0i + 1+i\n\n(2)\n\nsubject to bqw = w \u00b7 zk.\n\nLet wk be the solution of Problem (2). For each k 2 [1, 2m], we determine the \u2327-quantile value\nof the scores {wk\u00b7zi : i 2 [1, 2m]}. This can be checked straightforwardly in time O(m log m) by\nsorting the scores. Then, the solution w\u21e4 we return is the wk for which wk \u00b7zk is closest to the\n\u2327-quantile value, the one for which the objective function is the smallest in the presence of ties. The\nmethod for determining w\u21e4 is thus based on the solution of 2m simple QPs. Our solution naturally\nparallelizes so that on a distributed computing environment, the computational time for solving the\nproblem can be reduced to roughly the same as that of solving a single QP.\n\n3.3 Kernelized formulation\nFor any i2[1, 2m], let yi =1 if i\uf8ff m, yi =+1 otherwise. Then, Problem (2) admits the following\nequivalent dual optimization problem similar to that of SVMs:\n\nmax\n\u21b5\n\n1\n2\n\n\u21b5i \n\n2mXi=1\n\n2mXi,j=1\n\nsubject to: 8i 2 [1, 2m], 0 \uf8ff \u21b5i \uf8ff C,\n\n\u21b5i\u21b5jyiyj(zi  zk) \u00b7 (zj  zk)\n\n(3)\n\nwhich depends only on inner products between points of the training set. The vector w can be\nobtained from the solution via w =P2m\ni=1 \u21b5iyi(zizk). The algorithm can therefore be generalized\nby using equivalently any positive semi-de\ufb01nite kernel symmetric (PDS) kernel K : X\u21e5X! R\ninstead of the inner product in the input space, thereby also extending it to the case of non-vectorial\ninput spaces X . The corresponding hypothesis set H is that of linear functions h: x 7! w \u00b7 (x)\nwhere : X! H is a feature mapping to a Hilbert space H associated to K and w an element of\nH. In view of (3), for any k 2 [1, 2m], the dual problem of (2) can then be expressed as follows:\n\nmax\n\u21b5\n\n1\n2\n\n\u21b5i \n\n2mXi=1\n\n2mXi,j=1\n\nsubject to: 8i 2 [1, 2m], 0 \uf8ff \u21b5i \uf8ff C,\n\n\u21b5i\u21b5jyiyjKk(zi, zj)\n\n(4)\n\nwhere, for any k 2 [1, 2m], Kk is the PDS kernel de\ufb01ned by Kk : (z, z0) 7! K(z, z0)  K(z, zk) \nK(zk, z0) + K(zk, zk). Our solution can therefore also be found in the dual by solving the 2m QPs\nde\ufb01ned by (4).\n\n4 Theoretical guarantees\n\nWe here present margin-based generalization bounds for the AATP learning problem.\n\n4\n\n\f1\n2\n\nR(h, t) =\n\nLet \u21e2 : R ! [0; 1] be the function de\ufb01ned by \u21e2 : x 7! 1x\uf8ff0 + (1  x/\u21e2)+1x>0. For any \u21e2> 0\nand t2 R, we de\ufb01ne the generalization error R(h, t) and empirical margin loss bR\u21e2(h, t), both with\nrespect to t, by\nmXi=1\u21e5\u21e2(t  h(xi)) + \u21e2(h(x0i)  t)\u21e4.\nIn particular, R(h, qh) corresponds to the generalization error and bR\u21e2(h, qh) to the empirical margin\nloss of a hypothesis h for AATP. For any t > 0, the empirical margin loss bR\u21e2(h, t) is upper bounded\n\nby the average of the fraction of non-preferred elements xi that h ranks above t or less than \u21e2 below\nt, and the fraction of preferred ones x0i it ranks below t or less than \u21e2 above t:\n\n(x,x0)\u21e0D\u21e51h(x)>t + 1h(x0)<t\u21e4 bR\u21e2(h, t) =\n\n1\n2m\n\nE\n\nbR\u21e2(h, t) \uf8ff\n\n1\n2m\n\nmXi=1\u21e51th(xi)<\u21e2 + 1h(x0i)t<\u21e2\u21e4.\n\n(5)\n\nm (H) the Rademacher complexity of H with respect to the marginal distribution D1, that is\n\nWe denote by D1 the marginal distribution of the \ufb01rst element of the pairs in X\u21e5X derived from\nD, and by D2 the marginal distribution with respect to the second element. Similarly, S1 is the\nsample derived from S by keeping only the \ufb01rst element of each pair: S1 = x1, . . . , xm and\nS2 the one obtained by keeping only the second element: S2 = x01, . . . , x0m. We also denote\nby RD1\nm (H) = E[bRS1(H)], and RD2\nRD1\nTheorem 1 Let H be a set of real-valued functions taking values in [M, +M] for some M > 0.\nFix \u2327 2 [0, 1] and \u21e2> 0, then, for any > 0, with probability at least 1   over the choice of a\nsample S of size m, each of the following inequalities holds for all h 2 H and t 2 [M, +M]:\n\nm (H) = E[bRS2(H)].\n\nR(h, t) \uf8ff bR\u21e2(h, t)+\nR(h, t) \uf8ff bR\u21e2(h, t)+\n\n1\n\nm (H) +\n\nm (H) + RD2\n\n\u21e2\u2713RD1\n\u21e2\u2713bRS1(H) +bRS2(H) +\n\n1\n\n2M\n\npm\u25c6+rlog 1/\npm\u25c6+3rlog 2/\n\n2M\n\n2m\n\n2m\n\n.\n\nProof. Let eH be the family of hypotheses mapping (X\u21e5X ) to R de\ufb01ned by eH = {z = (x, x0) 7!\nt  h(x): h 2 H, t 2 [M, +M]} and similarly eH0 = {z = (x, x0) 7! h(x0)  t: h 2 H, t 2\n[M, +M]}. Consider the two families of functions eH and eH0 taking values in [0, 1] de\ufb01ned by\neH = {\u21e2  f : f 2 eH} and eH0 = {\u21e2  f : f 2 eH0}. By the general Rademacher complexity\nbounds for functions taking values in [0, 1] [18, 3, 20], with probability at least 1  ,\n\n1\n2\n\nE\u21e5\u21e2(t  h(x)) + \u21e2(h(x0)  t)\u21e4 \uf8ff bR\u21e2(h, t) + 2Rm\u21e31\n\n2\n\n(eH + eH0)\u2318 +rlog 1/\n\uf8ff bR\u21e2(h, t) + Rm(eH) + Rm(eH0 +rlog 1/\n\n2m\n\n2m\n\n,\n\nfor all h 2 H. Since 1u<0 \uf8ff \u21e2(u) for all u 2 R, the generalization error R(h, t) is a lower bound\non left-hand side: R(h, t) \uf8ff 1\n\n2 E\u21e5\u21e2(t  h(x)) + \u21e2(h(x0)  t)\u21e4, we obtain\nR(h, t) \uf8ff bR\u21e2(h, t) + Rm(eH) + Rm(eH0 +rlog 1/\nSince \u21e2 is 1/\u21e2-Lipschitz, by Talagrand\u2019s contraction lemma, we have RmeH \uf8ff (1/\u21e2)Rm(eH)\nand RmeH0 \uf8ff (1/\u21e2)Rm(eH0). By de\ufb01nition of the Rademacher complexity,\nS\u21e0Dm,\" sup\nih(xi)#\nRm(eH)=\nh\n\nS,\"sup\ni(t  h(xi))#=\nmXi=1\nih(xi).\n\uf8ff sup\nii+\nmXi=1\n\nmXi=1\nmXi=1\n\nit + sup\nh2H\n\nt2[M,+M ]\n\nmXi=1\n\nh2H,t\n\n1\nm\n\n1\nm\n\n1\nm\n\n1\nm\n\nh2H\n\nsup\n\n2m\n\n=\n\nE\n\nE\n\nE\n\nE\n\nt\n\n.\n\nt\n\n5\n\n\fE\n\ni\n\nM\n\ni=1 i<0\n\nE\n\nsup\n\ni=1 i>0\n\nM\uf8fft\uf8ffM\n\nmXi=1\n\nmXi=1\n\nm XPm\ni2i 1\n\nm XPm\n\"\nmXi=1\n\nSince the random variables i and i follow the same distribution, the second term coincides with\nm (H). The \ufb01rst term can be rewritten and upper bounded as follows using Jensen\u2019s inequality:\nRD1\n\"\n1\nm\n\nit# = M\n\ni \n\nPr[]\n\nPr[]\n\n= M\nm\n\nii 1\n\n2 = M\nm\n\nh mXi=1\n\ni# \uf8ff\n(1/p2). Similarly, we can show that Rm(eH0) \uf8ff RD2\n\n2 = M\npm\nNote that, by the Kahane-Khintchine inequality, the last upper bound used is tight modulo a constant\nm (H)+M/pm. This proves the \ufb01rst inequality\nof the theorem; the second inequality can be derived from the \ufb01rst one using the standard bound\nrelating the empirical and true Rademacher complexity. 2\nSince the bounds of the theorem hold uniformly for all t 2 [M, +M], they hold in particular for\nany quantile value qh.\nCorollary 1 (Margin bounds for AATP) Let H be a set of real-valued functions taking values in\n[M, +M] for some M > 0. Fix \u2327 2 [0, 1] and \u21e2> 0, then, for any > 0, with probability at least\n1   over the choice of a sample S of size m, for all h 2 H it holds that:\n\nmXi=1\nh mXi=1\n\nM\nm\n\nE\n\nE\n\n2\n\n.\n\nR(h)\uf8ffbR\u21e2(h, qh)+\nR(h)\uf8ffbR\u21e2(h, qh)+\n\n1\n\nm (H) +\n\nm (H) + RD2\n\n\u21e2\u2713RD1\n\u21e2\u2713bRS1(H) +bRS2(H) +\n\n1\n\n2M\n\npm\u25c6+rlog 1/\npm\u25c6+3rlog 2/\n\n2M\n\n2m\n\n2m\n\n.\n\nA more explicit version of this corollary can be derived for kernel-based hypotheses (Appendix A).\nIn the results of the previous theorem and corollary, the right-hand side of the generalization bounds\nis expressed in terms of the empirical margin loss with respect to the true quantile value qh, which\nis upper bounded (see (5)) by half the fraction of non-preferred points in the sample whose score is\nabove qh  \u21e2 and half the fraction of the preferred points whose score is less than qh + \u21e2. These\nfractions are close to the same fractions with qh replaced withbqh since the probability that a score\nfalls between qh andbqh can be shown to be uniformly bounded by a term in O(1/pm).1 Altogether,\n\nthis analysis provides a strong support for our algorithm which is precisely seeking to minimize the\nsum of an empirical margin loss based on the quantile and a term that depends on the complexity, as\nin the right-hand side of the learning guarantees above.\n\n5 Experiments\n\nThis section reports the results of experiments with our AATP algorithm on several datasets. To\nmeasure the effectiveness of our algorithm, we compare it to two other algorithms, the INFINITE-\nPUSH algorithm [1] and the SVMPERF algorithm [15], which are both algorithms seeking to em-\nphasize the accuracy near the top. Our experiments are carried out using three data sets from the\nUC Irvine Machine Learning Repository http://archive.ics.uci.edu/ml/datasets.html:\nIonosphere, Housing, and Spambase. (Results for Spambase can be found in Appendix C). In ad-\ndition, we use the TREC 2003 (LETOR 2.0) data set which is available for download from the\nfollowing Microsoft Research URL: http://research.microsoft.com/letor.\nAll the UC Irvine data sets we experiment with are for two-group classi\ufb01cation problems. From\nthese we construct bipartite ranking problems where a preference pair consists of one positive and\none negative example. To explicitly indicate the dependency on the quantile, we denote by q\u2327 the\nvalue of the top \u2327-th quantile of the score distribution of a hypothesis. We will use N to denote the\nnumber of instances in a particular data set, as well as si, i = 1, . . . , N, to denote the particular\nscore values. If n+ denotes the number of positive examples in the data set and n denotes the\nnumber of negative examples, then N = n+ + n and the number of preferences is m = n+n.\n\n1Note that the Bahadur-Kiefer representation is known to provide a uniform convergence bound on the\ndifference of the true and empirical quantiles when the distribution admits a density [2, 16], a stronger result\nthan what is needed in our context.\n\n6\n\n\fTable 1: Ionosphere data: for each top quantile \u2327 and each evaluation metric, the three rows cor-\nrespond to AATP (top), SVMPERF(middle) and INFINITEPUSH (bottom). For the INFINITEPUSH\nalgorithm we only report mean values over the folds.\n\n\u2327 (%)\n\nP@\u2327\n\nAP\n\nDCG@\u2327\n\nNDCG@\u2327 Positives@top\n\n27.83\n\n0.85\n\n0.87\n\n0.80\n\n0.80\n\n19\n\n14\n\n10.32\n\n12.1 \u00b1 12.5\n0.89 \u00b1 0.04 0.86 \u00b1 0.03 29.21 \u00b1 0.10 0.92 \u00b1 0.06\n6.00 \u00b1 11.1\n0.89 \u00b1 0.06 0.83 \u00b1 0.04 28.88 \u00b1 1.37 0.89 \u00b1 0.11\n0.91 \u00b1 0.05 0.84 \u00b1 0.03 28.15 \u00b1 0.95 0.91 \u00b1 0.07 13.31 \u00b1 12.5\n0.82 \u00b1 0.11 0.79 \u00b1 0.04 27.02 \u00b1 1.37 0.75 \u00b1 0.16\n4.10 \u00b1 11.1\n9.50 0.93 \u00b1 0.06 0.84 \u00b1 0.03 28.15 \u00b1 0.95 0.91 \u00b1 0.09 13.31 \u00b1 12.49\n0.77 \u00b1 0.18 0.79 \u00b1 0.04 27.02 \u00b1 1.35 0.70 \u00b1 0.21\n4.50 \u00b1 10.9\n0.91 \u00b1 0.14 0.84 \u00b1 0.03 28.15 \u00b1 0.95 0.89 \u00b1 0.15 13.31 \u00b1 12.49\n0.66 \u00b1 0.27 0.79 \u00b1 0.04 27.02 \u00b1 1.36 0.60 \u00b1 0.30\n4.60 \u00b1 11.0\n0.85 \u00b1 0.24 0.84 \u00b1 0.03 28.15 \u00b1 0.95 0.88 \u00b1 0.19 13.30 \u00b1 12.53\n0.35 \u00b1 0.41 0.79 \u00b1 0.04 27.02 \u00b1 1.36 0.34 \u00b1 0.41\n4.50 \u00b1 11.0\n\n27.90\n\n27.91\n\n27.90\n\n11.51\n\n11.51\n\n0.87\n\n0.89\n\n5\n\n1\n\n0.80\n\n0.81\n\n0.80\n\n0.85\n\n0.87\n\n0.86\n\n0.90\n\n0.86\n\n0.85\n\n27.91\n\n11.59\n\n11.50\n\n5.1 Implementation\n\nWe solved the convex optimization problems (2) using the CVX solver http://cvxr.com/. As\nalready noted, the AATP problem can be solved ef\ufb01ciently using a distributed computing envi-\nronment. The convex optimization problem of the INFINITEPUSH algorithm (see (3.9) of [1])\ncan also be solved using CVX. However, this optimization problem has as many variables as\nthe product of the numbers of positively and negatively labeled instances (n+n), which makes\nit prohibitive to solve for large data sets within a runtime of a few days. Thus, we experi-\nmented with the INFINITEPUSH algorithm only on the Ionosphere data set. Finally, for SVM-\nPERF\u2019s training and score prediction we used the binary executables downloaded from the URL\nhttp://www.cs.cornell.edu/people/tj and used the SVMPERF\u2019s settings that are the clos-\nest to our optimization formulation. Thus, we used L1-norm for slack variables and allowed the\nconstraint cache and the tolerance for termination criterion to grow in order to control the algo-\nrithm\u2019s convergence, especially for larger values of the regularization constant.\n\n5.2 Evaluation measures\n\nTo evaluate and compare the AATP, INFINITEPUSH, and SVMPERF algorithms, we used a number\nof standard metrics: Precision at the top (P@\u2327), Average Precision (AP), Number of positives at the\nabsolute top (Positives@top), Discounted Cumulative Gain (DCG@\u2327), and Normalized Discounted\nCumulative Gain (NDCG@\u2327). De\ufb01nitions are included in Appendix B.\n\n5.3 Ionosphere data\n\nThe data set\u2019s 351 instances represent radar signals collected from phased antennas, where \u2018good\u2019\nsignals (225 positively labeled instances) are those that re\ufb02ect back toward the antennas and \u2018bad\u2019\nsignals (126 negatively labeled instances) are those that pass through the ionosphere. The data has\n34 features. We split the data set into 10 independent sets of instances, say S1, . . . , S10. Then, we\nran 10 experiments, where we used 3 consecutive sets for learning and the rest (7 sets) for testing.\nWe evaluated and compared the algorithms for 5 different top quantiles \u2327 2{ 19, 14, 9.5, 5, 1} (%),\nwhich would correspond to the top 20, 15, 10, 5, 1 items, respectively. For each \u2327, the regulariza-\ntion parameter C was selected based on the average value of P@\u2327. The performance of AATP is\nsigni\ufb01cantly better than that of the other algorithms, particularly for the smallest top quantiles. The\ntwo main criteria on which to evaluate the AATP algorithm are Precision at the top, (P@\u2327), and\nNumber of positive at the top, (Positives@top). For \u2327 = 5% the AATP algorithm obtains a stellar\n91% accuracy with an average of 13.3 positive elements at the top (Table 1).\n\n7\n\n\fTable 2: Housing data: for each quantile value \u2327 and each evaluation metric, there are two rows\ncorresponding to AATP (top) and SVMPERF(bottom).\nDCG@\u2327\n\nNDCG@\u2327 Positives@top\n\nAP\n\n\u2327 (%)\n\nP@\u2327\n\n6\n\n5\n\n4\n\n3\n\n2\n\n1\n\n0.14 \u00b1 0.05 0.11 \u00b1 0.03 4.64 \u00b1 0.40 0.13 \u00b1 0.08\n0.13 \u00b1 0.05 0.10 \u00b1 0.02 4.81 \u00b1 0.46 0.16 \u00b1 0.09\n0.17 \u00b1 0.07 0.10 \u00b1 0.03 4.69 \u00b1 0.26 0.16 \u00b1 0.07\n0.12 \u00b1 0.10 0.09 \u00b1 0.03 4.76 \u00b1 0.60 0.16 \u00b1 0.14\n0.19 \u00b1 0.13 0.12 \u00b1 0.03 4.83 \u00b1 0.45 0.18 \u00b1 0.15\n0.14 \u00b1 0.05 0.10 \u00b1 0.02 4.66 \u00b1 0.25 0.13 \u00b1 0.07\n0.20 \u00b1 0.12 0.10 \u00b1 0.03 4.70 \u00b1 0.26 0.18 \u00b1 0.11\n0.17 \u00b1 0.12 0.09 \u00b1 0.02 4.65 \u00b1 0.40 0.18 \u00b1 0.13\n0.23 \u00b1 0.10 0.10 \u00b1 0.03 4.69 \u00b1 0.26 0.19 \u00b1 0.11\n0.25 \u00b1 0.17 0.10 \u00b1 0.03 4.89 \u00b1 0.48 0.27 \u00b1 0.16\n0.20 \u00b1 0.27 0.12 \u00b1 0.03 4.80 \u00b1 0.45 0.17 \u00b1 0.23\n0.30 \u00b1 0.27 0.09 \u00b1 0.02 4.74 \u00b1 0.56 0.29 \u00b1 0.27\n\n0.20 \u00b1 0.45\n0.21 \u00b1 0.45\n0.00 \u00b1 0.00\n0.20 \u00b1 0.48\n0.00 \u00b1 0.00\n0.00 \u00b1 0.00\n0.00 \u00b1 0.00\n0.00 \u00b1 0.00\n0.00 \u00b1 0.00\n0.20 \u00b1 0.46\n0.00 \u00b1 0.00\n0.20 \u00b1 0.45\n\n5.4 Housing data\n\nThe Boston Housing data set has 506 examples, 35 positive and 471 negative, described by 13\nfeatures. We used feature 4 as the binary target value. Two thirds of the data instances was randomly\nselected and used for training, and the rest for testing. We created 10 experimental folds analogously\nas in the case of the Ionosphere data. The Housing data is very unbalanced with less than 7%\npositive examples. For this dataset we obtain results very comparable to SVMPERF for the very top\nquantiles, see Table 2. Naturally, the standard deviations are large as a result of the low percentage\nof positive examples, so the results are not always signi\ufb01cant. For higher top quantiles, e.g., top\n4%, the AATP algorithm signi\ufb01cantly outperforms SVMPERF, obtaining 19% accuracy at the top\n(P@\u2327). For the highest top quantiles the difference in performance between the two algorithms is\nnot signi\ufb01cant.\n\n5.5 LETOR 2.0\n\nThis data set corresponds to a relatively hard ranking problem, with an average of only 1% relevant\nquery-URL pairs per query. It consists of 5 folds. Our Matlab implementation (with CVX) of the\nalgorithms prevented us from trying our approach on larger data sets. Hence from each training fold\nwe randomly selected 500 items for training. For testing, we selected 1000 items at random from the\ntest fold. Here, we only report results for P@1%. SVMPERF obtained an accuracy of 1.5% \u00b1 1.5%\nwhile the AATP algorithm obtained an accuracy of 4.6% \u00b1 2.4%. This signi\ufb01cantly better result\nindicates the power of the algorithm proposed.\n\n6 Conclusion\n\nWe presented a series of results for the problem of accuracy at the top quantile, including an AATP\nalgorithm, a margin-based theoretical analysis in support of that algorithm, and a series of experi-\nments with several data sets demonstrating the effectiveness of our algorithm. These results are of\npractical interest in applications where the accuracy among the top quantile is sought. The analysis\nof problems based on other loss functions depending on the top \u2327-quantile scores is also likely to\nbene\ufb01t form the theoretical and algorithmic results we presented.\nThe optimization algorithm we discussed is highly parallelizable, since it is based on solving 2m\nindependent QPs. Our initial experiments reported here were carried out using Matlab with CVX,\nwhich prevented us from evaluating our approach on larger data sets, such as the full LETOR 2.0\ndata set. However, we have now designed a solution for very large m based on the ADMM (Al-\nternating Direction Method of Multipliers) framework [4]. We have implemented that solution and\nwill present and discuss it in future work.\n\n8\n\n\fReferences\n[1] S. Agarwal. The in\ufb01nite push: A new support vector ranking algorithm that directly optimizes\naccuracy at the absolute top of the list. In Proceedings of the SIAM International Conference\non Data Mining, 2011.\n\n[2] R. R. Bahadur. A note on quantiles in large samples. Annals of Mathematical Statistics, 37,\n\n1966.\n\n[3] P. L. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: Risk bounds and\n\nstructural results. Journal of Machine Learning Research, 3:2002, 2002.\n\n[4] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statis-\ntical learning via the alternating direction method of multipliers. Foundations and Trends in\nMachine Learning, 3(1):1\u2013122, 2011.\n\n[5] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.\n[6] J. S. Breese, D. Heckerman, and C. M. Kadie. Empirical analysis of predictive algorithms for\ncollaborative \ufb01ltering. In UAI \u201998: Proceedings of the Fourteenth Conference on Uncertainty\nin Arti\ufb01cial Intelligence. Morgan Kaufmann, 1998.\n\n[7] C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender.\nLearning to rank using gradient descent. In Proceedings of the 22nd international conference\non Machine learning, ICML \u201905, pages 89\u201396, New York, NY, USA, 2005. ACM.\n\n[8] C. J. C. Burges, R. Ragno, and Q. V. Le. Learning to rank with nonsmooth cost functions. In\n\nNIPS, pages 193\u2013200, 2006.\n\n[9] S. Cl\u00b4emenc\u00b8on and N. Vayatis. Ranking the best instances. Journal of Machine Learning\n\nResearch, 8:2671\u20132699, 2007.\n\n[10] D. Cossock and T. Zhang. Statistical analysis of Bayes optimal subset ranking. IEEE Trans-\n\nactions on Information Theory, 54(11):5140\u20135154, 2008.\n\n[11] K. Crammer and Y. Singer. PRanking with ranking. In Neural Information Processing Systems\n\n(NIPS 2001). MIT Press, 2001.\n\n[12] Y. Freund, R. Iyer, R. E. Schapire, and Y. Singer. An ef\ufb01cient boosting algorithm for combining\n\npreferences. J. Mach. Learn. Res., 4, December 2003.\n\n[13] R. Herbrich, K. Obermayer, and T. Graepel. Advances in Large Margin Classi\ufb01ers, chapter\n\nLarge Margin Rank Boundaries for Ordinal Regression. MIT Press, 2000.\n\n[14] T. Joachims. Optimizing search engines using clickthrough data. In Proceedings of the eighth\nACM SIGKDD international conference on Knowledge discovery and data mining, KDD \u201902,\npages 133\u2013142, New York, NY, USA, 2002. ACM.\n\n[15] T. Joachims. A support vector method for multivariate performance measures. In ICML, pages\n\n377\u2013384, 2005.\n\n[16] J. Kiefer. On Bahadur\u2019s representation of sample quantiles. Annals of Mathematical Statistics,\n\n38, 1967.\n\n[17] R. Koenker. Quantile Regression. Cambridge University Press, 2005.\n[18] V. Koltchinskii and D. Panchenko. Empirical margin distributions and bounding the general-\n\nization error of combined classi\ufb01ers. Annals of Statistics, 30, 2002.\n\n[19] Q. V. Le, A. Smola, O. Chapelle, and C. H. Teo. Optimization of ranking measures. Unpub-\n\nlished, 2009.\n\n[20] M. Mohri, A. Rostamizadeh, and A. Talwalkar. Foundations of Machine Learning. The MIT\n\nPress, 2012.\n\n[21] C. Rudin, C. Cortes, M. Mohri, and R. E. Schapire. Margin-based ranking meets boosting in\n\nthe middle. In COLT, pages 63\u201378, 2005.\n\n[22] I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun. Large margin methods for structured\nand interdependent output variables. Journal of Machine Learning Research, 6:1453\u20131484,\n2005.\n\n9\n\n\f", "award": [], "sourceid": 459, "authors": [{"given_name": "Stephen", "family_name": "Boyd", "institution": null}, {"given_name": "Corinna", "family_name": "Cortes", "institution": null}, {"given_name": "Mehryar", "family_name": "Mohri", "institution": null}, {"given_name": "Ana", "family_name": "Radovanovic", "institution": null}]}