{"title": "Learning to Rank by Optimizing NDCG Measure", "book": "Advances in Neural Information Processing Systems", "page_first": 1883, "page_last": 1891, "abstract": "Learning to rank is a relatively new field of study, aiming to learn a ranking function from a set of training data with relevancy labels. The ranking algorithms are often evaluated using Information Retrieval measures, such as Normalized Discounted Cumulative Gain [1] and Mean Average Precision [2]. Until recently, most learning to rank algorithms were not using a loss function related to the above mentioned evaluation measures. The main difficulty in direct optimization of these measures is that they depend on the ranks of documents, not the numerical values output by the ranking function. We propose a probabilistic framework that addresses this challenge by optimizing the expectation of NDCG over all the possible permutations of documents. A relaxation strategy is used to approximate the average of NDCG over the space of permutation, and a bound optimization approach is proposed to make the computation efficient. Extensive experiments show that the proposed algorithm outperforms state-of-the-art ranking algorithms on several benchmark data sets.", "full_text": "Learning to Rank by Optimizing NDCG Measure\n\nHamed Valizadegan\nRong Jin\nComputer Science and Engineering\n\nMichigan State University\nEast Lansing, MI 48824\n\n{valizade,rongjin}@cse.msu.edu\n\nRuofei Zhang\nJianchang Mao\nAdvertising Sciences, Yahoo! Labs\n\n4401 Great America Parkway,\n\nSanta Clara, CA 95054\n\n{rzhang,jmao}@yahoo-inc.com\n\nAbstract\n\nLearning to rank is a relatively new \ufb01eld of study, aiming to learn a ranking func-\ntion from a set of training data with relevancy labels. The ranking algorithms\nare often evaluated using information retrieval measures, such as Normalized Dis-\ncounted Cumulative Gain (NDCG) [1] and Mean Average Precision (MAP) [2].\nUntil recently, most learning to rank algorithms were not using a loss function\nrelated to the above mentioned evaluation measures. The main dif\ufb01culty in direct\noptimization of these measures is that they depend on the ranks of documents, not\nthe numerical values output by the ranking function. We propose a probabilistic\nframework that addresses this challenge by optimizing the expectation of NDCG\nover all the possible permutations of documents. A relaxation strategy is used to\napproximate the average of NDCG over the space of permutation, and a bound\noptimization approach is proposed to make the computation ef\ufb01cient. Extensive\nexperiments show that the proposed algorithm outperforms state-of-the-art rank-\ning algorithms on several benchmark data sets.\n\n1 Introduction\n\nLearning to rank has attracted the focus of many machine learning researchers in the last decade\nbecause of its growing application in the areas like information retrieval (IR) and recommender\nsystems. In the simplest form, the so-called pointwise approaches, ranking can be treated as classi\ufb01-\ncation or regression by learning the numeric rank value of documents as an absolute quantity [3, 4].\nThe second group of algorithms, the pairwise approaches, considers the pair of documents as in-\ndependent variables and learns a classi\ufb01cation (regression) model to correctly order the training\npairs [5, 6, 7, 8, 9, 10, 11]. The main problem with these approaches is that their loss functions are\nrelated to individual documents while most evaluation metrics of information retrieval measure the\nranking quality for individual queries, not documents.\nThis mismatch has motivated the so called listwise approaches for information ranking, which treats\neach ranking list of documents for a query as a training instance [2, 12, 13, 14, 15, 16, 17]. Unlike\nthe pointwise or pairwise approaches, the listwise approaches aim to optimize the evaluation metrics\nsuch as NDCG and MAP. The main dif\ufb01culty in optimizing these evaluation metrics is that they are\ndependent on the rank position of documents induced by the ranking function, not the numerical\nvalues output by the ranking function. In the past studies, this problem was addressed either by the\nconvex surrogate of the IR metrics or by heuristic optimization methods such as genetic algorithm.\nIn this work, we address this challenge by a probabilistic framework that optimizes the expectation\nof NDCG over all the possible permutation of documents. To handle the computational dif\ufb01culty, we\npresent a relaxation strategy that approximates the expectation of NDCG in the space of permutation,\nand a bound optimization algorithm [18] for ef\ufb01cient optimization. Our experiment with several\nbenchmark data sets shows that our method performs better than several state-of-the-art ranking\ntechniques.\n\n1\n\n\fThe rest of this paper is organized as follows. The related work is presented in Section 2. The\nproposed framework and optimization strategy is presented in Section 3. We report our experimental\nstudy in Section 4 and conclude this work in Section 5.\n\n2 Related Work\nWe focus on reviewing the listwise approaches that are closely related to the theme of this work.\nThe listwise approaches can be classi\ufb01ed into two categories. The \ufb01rst group of approaches directly\noptimizes the IR evaluation metrics. Most IR evaluation metrics, however, depend on the sorted\norder of documents, and are non-convex in the target ranking function. To avoid the computational\ndif\ufb01culty, these approaches either approximate the metrics with some convex functions or deploy\nmethods (e.g., genetic algorithm [19]) for non-convex optimization. In [13], the authors introduced\nLambdaRank that addresses the dif\ufb01culty in optimizing IR metrics by de\ufb01ning a virtual gradient\non each document after the sorting. While [13] provided a simple test to determine if there exists\nan implicit cost function for the virtual gradient, theoretical justi\ufb01cation for the relation between\nthe implicit cost function and the IR evaluation metric is incomplete. This may partially explain\nwhy LambdaRank performs very poor when compared to MCRank [3], a simple adjustment of\nclassi\ufb01cation for ranking (a pointwise approach). The authors of MCRank paper even claimed that\na boosting model for regression produces better results than LambdaRank. Volkovs and Zemel [17]\nproposed optimizing the expectation of IR measures to overcome the sorting problem, similar to\nthe approach taken in this paper. However they use monte carlo sampling to address the intractable\ntask of computing the expectation in the permutation space which could be a bad approximation\nfor the queries with large number of documents. AdaRank [20] uses boosting to optimize NDCG,\nsimilar to our optimization strategy. However they deploy heuristics to embed the IR evaluation\nmetrics in computing the weights of queries and the importance of weak rankers; i.e. it uses NDCG\nvalue of each query in the current iteration as the weight for that query in constructing the weak\nranker (the documents of each query have similar weight). This is unlike our approach that the\ncontribution of each single document to the \ufb01nal NDCG score is considered. Moreover, unlike our\nmethod, the convergence of AdaRank is conditional and not guaranteed. Sun et al. [21] reduced\nthe ranking, as measured by NDCG, to pairwise classi\ufb01cation and applied alternating optimization\nstrategy to address the sorting problem by \ufb01xing the rank position in getting the derivative. SVM-\nMAP [2] relaxes the MAP metric by incorporating it into the constrains of SVM. Since SVM-MAP\nis designed to optimize MAP, it only considers the binary relevancy and cannot be applied to the\ndata sets that have more than two levels of relevance judgements.\nThe second group of listwise algorithms de\ufb01nes a listwise loss function as an indirect way to op-\ntimize the IR evaluation metrics. RankCosine [12] uses cosine similarity between the ranking list\nand the ground truth as a query level loss function. ListNet [14] adopts the KL divergence for loss\nfunction by de\ufb01ning a probabilistic distribution in the space of permutation for learning to rank.\nFRank [9] uses a new loss function called \ufb01delity loss on the probability framework introduced in\nListNet. ListMLE [15] employs the likelihood loss as the surrogate for the IR evaluation metrics.\nThe main problem with this group of approaches is that the connection between the listwise loss\nfunction and the targeted IR evaluation metric is unclear, and therefore optimizing the listwise loss\nfunction may not necessarily result in the optimization of the IR metrics.\n\n3 Optimizing NDCG Measure\n3.1 Notation\nAssume that we have a collection of n queries for training, denoted by Q = {q1, . . . , qn}. For each\nquery qk, we have a collection of mk documents Dk = {dk\ni , i = 1, . . . , mk}, whose relevance to\n) \u2208 Zmk. We denote by F (d, q) the ranking function that\nqk is given by a vector rk = (rk\ntakes a document-query pair (d, q) and outputs a real number score, and by jk\ni the rank of document\ni within the collection Dk for query qk. The NDCG value for ranking function F (d, q) is then\ndk\ncomputed as following:\n\n1 , . . . , rk\nmk\n\n(1)\n\nwhere Zk is the normalization factor [1]. NDCG is usually truncated at a particular rank level (e.g.\nthe \ufb01rst 10 retrieved documents) to emphasize the importance of the \ufb01rst retrieved documents.\n\nn(cid:88)\n\nmk(cid:88)\n\n2rk\n\ni \u2212 1\nlog(1 + jk\ni )\n\nL(Q, F ) =\n\n1\nn\n\n1\nZk\n\nk=1\n\ni=1\n\n2\n\n\f3.2 A Probabilistic Framework\nOne of the main challenges faced by optimizing the NDCG metric de\ufb01ned in Equation (1) is that the\ni ) on the ranking function F (d, q) is not explicitly expressed,\ndependence of document ranks (i.e., jk\nwhich makes it computationally challenging. To address this problem, we consider the expectation\nof L(Q, F ) over all the possible rankings induced by the ranking function F (d, q), i.e.,\n\u00afL(Q, F ) =\n\nPr(\u03c0k|F, qk)\n\nn(cid:88)\n\n(cid:88)\n\nn(cid:88)\n\nmk(cid:88)\n\nmk(cid:88)\n\ni \u2212 1\n\n(cid:42)\n\n(cid:43)\n\n2rk\n\n=\n\n2rk\n\ni \u2212 1\nlog(1 + jk\ni )\n\n1\nn\n\n1\nZk\n\nF\n\nk=1\n\n1\nn\n\n1\nZk\n\nk=1\n\ni=1\n\ni=1\n\n\u03c0k\u2208Smk\n\nlog(1 + \u03c0k(i))\n\n(2)\n\nwhere Smk stands for the group of permutations of mk documents, and \u03c0k is an instance of permuta-\ntion (or ranking). Notation \u03c0k(i) stands for the rank position of the ith document by \u03c0k. To this end,\nwe \ufb01rst utilize the result in the following lemma to approximate the expectation of 1/ log(1+ \u03c0k(i))\nby the expectation of \u03c0k(i).\nLemma 1. For any distribution Pr(\u03c0|F, q), the inequality \u00afL(Q, F ) \u2265 \u00afH(Q, F ) holds where\n\nn(cid:88)\n\nmk(cid:88)\n\n\u00afH(Q, F ) =\n\n1\nn\n\n1\nZk\n\nk=1\n\ni=1\n\ni \u2212 1\n\n2rk\n\nlog(1 + (cid:104)\u03c0k(i)(cid:105)F )\n\nProof. The proof follows from the fact that (a) 1/x is a convex function when x > 0 and therefore\n(cid:104)1/ log(1+x)(cid:105) \u2265 1/(cid:104)log(1+x)(cid:105); (b) log(1+x) is a concave function, and therefore (cid:104)log(1+x)(cid:105) \u2264\nlog(1 + (cid:104)x(cid:105)). Combining these two factors together, we have the result stated in the lemma.\nGiven \u00afH(Q, F ) provides a lower bound for \u00afL(Q, F ), in order to maximize \u00afL(Q, F ), we could\nalternatively maximize \u00afH(Q, F ), which is substantially simpler than \u00afL(Q, F ). In the next step of\nsimpli\ufb01cation, we rewrite \u03c0k(i) as\n\nmk(cid:88)\n\nj=1\n\n(3)\n\n(4)\n\n(5)\n\n\u03c0k(i) = 1 +\n\nI(\u03c0k(i) > \u03c0k(j))\n\nmk(cid:88)\n\nmk(cid:88)\n\nwhere I(x) outputs 1 when x is true and zero otherwise. Hence, (cid:104)\u03c0k(i)(cid:105) is written as\n\n(cid:104)\u03c0k(i)(cid:105) = 1 +\n\n(cid:104)I(\u03c0k(i) > \u03c0k(j))(cid:105) = 1 +\n\nPr(\u03c0k(i) > \u03c0k(j))\n\nj=1\n\nj=1\n\nAs a result, to optimize \u00afH(Q, F ), we only need to de\ufb01ne Pr(\u03c0k(i) > \u03c0k(j)), i.e., the marginal\ndistribution for document dk\ni . In the next section, we will dis-\ncuss how to de\ufb01ne a probability model for Pr(\u03c0k|F, qk), and derive pairwise ranking probability\nPr(\u03c0k(i) > \u03c0k(j)) from distribution Pr(\u03c0k|F, qk).\n\nj to be ranked before document dk\n\n3.3 Objective Function\nWe model Pr(\u03c0k|F, qk) as follows\n1\n\nPr(\u03c0k|F, qk) =\n\nZ(F, qk)\n\n=\n\n1\n\nZ(F, qk)\n\nexp\n\nexp\n\n\uf8eb\uf8ed mk(cid:88)\n(cid:195)\nmk(cid:88)\n\ni=1\n\ni=1\n\n(cid:88)\n\n(F (dk\n\n(cid:33)\ni , qk) \u2212 F (dk\n\nj , qk))\n\nj:\u03c0k(j)>\u03c0k(i)\n\n(mk \u2212 2\u03c0k(i) + 1)F (dk\n\ni , qk)\n\n\uf8f6\uf8f8\n\n(6)\n\ni , dk\n\ni ) < \u03c0k(dk\n\nj (i.e., \u03c0k(dk\n\nj ) of the ranking list \u03c0k by the factor exp(F (dk\n\nwhere Z(F, qk) is the partition function that ensures the sum of probability is one. Equation (6)\nmodels each pair (dk\nj , qk)) if dk\ni\nj )) and vice versa. This modeling choice is consistent\nis ranked before dk\nwith the idea of ranking the documents with largest scores \ufb01rst; intuitively, the more documents in\na permutation are in the decreasing order of score, the bigger the probability of the permutation is.\nUsing Equation (6) for Pr(\u03c0k|F, qk), we have \u00afH(Q, F ) expressed in terms of ranking function F .\nBy maximizing \u00afH(Q, F ) over F , we could \ufb01nd the optimal solution for ranking function F .\nAs indicated by Equation (5), we only need to compute the marginal distribution Pr(\u03c0k(i) > \u03c0k(j)).\nTo approximate Pr(\u03c0k(i) > \u03c0k(j)), we divide the group of permutation Smk into two sets:\n\ni , qk) \u2212 F (dk\n\n3\n\n\f(cid:163)\n\n(cid:164)(cid:162)\n\n(cid:163)\n\n(cid:163)\n\n(cid:164)\n\n(cid:164)\n\na(i, j) = {\u03c0k|\u03c0k(i) > \u03c0k(j)} and Gk\nGk\none-to-one mapping between these two sets; namely for any ranking \u03c0k \u2208 Gk\na corresponding ranking \u03c0k \u2208 Gk\nversa. The following lemma allows us to bound the marginal distribution Pr(\u03c0k(i) > \u03c0k(j)).\nLemma 2. If F (dk\n\nb (i, j) = {\u03c0k|\u03c0k(i) < \u03c0k(j)}. Notice that there is a\na(i, j), we could create\nj and vice\n\nb (i, j) by switching the rankings of document dk\n\ni and dk\n\ni , qk) > F (dk\nPr(\u03c0k(i) > \u03c0k(j)) \u2264\n\nj , qk), we have\n\n1 + exp\n\n2(F (dk\n\n1\ni , qk) \u2212 F (dk\n\nj , qk))\n\nProof.\n\n1 =\n\n=\n\n\u2265\n\n(cid:88)\n(cid:88)\n(cid:88)\n\n\u03c0k\u2208Gk\n\na(i,j)\n\n\u03c0k\u2208Gk\n\na(i,j)\n\n(cid:161)\n\n\u03c0k\u2208Gk\na(i,j)\n1 + exp\n\n=\n\n(cid:179)\n(cid:163)\n\n(cid:88)\n\n(cid:163)\n\n\u03c0k\u2208Gk\n\nb (i,j)\n\n(cid:179)\n\n(cid:161)\n\nPr(\u03c0k|F, qk) +\n\nPr(\u03c0k|F, qk)\n\nPr(\u03c0k|F, qk)\n\n1 + exp\n\n2(\u03c0k(i) \u2212 \u03c0k(j))(F (dk\n\nPr(\u03c0k|F, qk)\n\n1 + exp\n\n2(F (dk\n\n2(F (dk\n\ni , qk) \u2212 F (dk\n\nj , qk))\n\nPr\n\ni , qk) \u2212 F (dk\n(cid:161)\n\n\u03c0k(i) > \u03c0k(j)\n\nj , qk))\n\n(cid:164)(cid:162)(cid:180)\n\ni , qk) \u2212 F (dk\n(cid:162)\n\nj , qk))\n\n(7)\n\n(cid:164)(cid:180)\n\nWe used the de\ufb01nition of Pr(\u03c0k|F, qk) in Equation (6) to \ufb01nd Gk\na(i, j) in the\n\ufb01rst step of the proof. The inequality in the proof is because \u03c0k(i) \u2212 \u03c0k(j) \u2265 1 and the last step is\nbecause Pr(\u03c0k|F, qk) is the only term dependent on \u03c0.\nThis lemma indicates that we could approximate Pr(\u03c0k(i) > \u03c0k(j)) by a simple logistic model. The\nidea of using logistic model for Pr(\u03c0k(i) > \u03c0k(j)) is not new in learning to rank [7, 9]; however\nit has been taken for granted and no justi\ufb01cation has been provided in using it for learning to rank.\nUsing the logistic model approximation introduced in Lemma 2, we now have (cid:104)\u03c0k(i)(cid:105) written as\n\nb (i, j) as the dual of Gk\n\nmk(cid:88)\n\n(cid:104)\u03c0k(i)(cid:105) \u2248 1 +\n\nmk(cid:88)\n\n1 + exp\n\nj=1\n\n2(F (dk\n\nj , qk))\n\n1\ni , qk) \u2212 F (dk\nmk(cid:88)\n\n1\n\nTo simplify our notation, we de\ufb01ne F k\n\ni = 2F (dk\n\ni , qk), and rewrite the above expression as\n\n(cid:104)\u03c0k(i)(cid:105) = 1 +\n\nPr(\u03c0k(i) > \u03c0k(j)) \u2248 1 +\n\ni \u2212 F k\nj )\nUsing the above approximation for (cid:104)\u03c0k(i)(cid:105), we have \u00afH in Equation (3) written as\n\n1 + exp(F k\n\nj=1\n\nj=1\n\nn(cid:88)\n\nmk(cid:88)\n\nwhere\n\n\u00afH(Q, F ) \u2248 1\nmk(cid:88)\nn\n\nAk\n\ni =\n\nj=1\n\nk=1\n\ni=1\n\n1\nZk\nI(j (cid:54)= i)\n1 + exp(F k\n\ni \u2212 F k\nj )\n\n2rk\n\ni \u2212 1\nlog(2 + Ak\ni )\n\nWe de\ufb01ne the following proposition to further simplify the objective function:\nProposition 1.\n\n1\n\nlog(2 + Ak\ni )\n\n\u2265 1\n\nlog(2)\n\n\u2212\n\nAk\ni\n\n2 [log(2)]2\n\nThe proof is due to the Taylor expansion of convex function 1/log(2 + x), x > \u22121 around x = 0\ni > 0 (the proof of convexity of 1/log(1 + x) is given in Lemma 1). By plugging the\nnoting that Ak\nresult of this proposition to the objective function in Equation (9), the new objective is to minimize\nthe following quantity:\n\nn(cid:88)\n\nmk(cid:88)\n\n\u00afM(Q, F ) \u2248 1\nn\n\n1\nZk\n\nk=1\n\ni=1\n\ni \u2212 1)Ak\n\n(2rk\n\ni\n\nThe objective function in Equation (11) is explicitly related to F via term Ak\ni . In the next section, we\naim to derive an algorithm that learns an effective ranking function by ef\ufb01ciently minimizing \u00afM. It is\nalso important to note that although \u00afM is no longer a rigorous lower bound for the original objective\nfunction \u00afL, our empirical study shows that this approximation is very effective in identifying the\nappropriate ranking function from the training data.\n\n4\n\n(8)\n\n(9)\n\n(10)\n\n(11)\n\n\f3.4 Algorithm\nTo minimize \u00afM(Q, F ) in Equation (11), we employ the bound optimization strategy [18] that it-\ni . To\neratively updates the solution for F . Let F k\nimprove NDCG, following the idea of Adaboost, we restrict the new ranking value for document dk\ni ,\ndenoted by \u02dcF k\n\ni denote the value obtained so far for document dk\n\ni , is updated as to the following form:\n\n\u02dcF k\ni = F k\n(12)\ni , qk) \u2208 {0, 1} is a binary value. Note that in\nwhere \u03b1 > 0 is the combination weight and f k\nthe above, we assume the ranking function F (d, q) is updated iteratively with an addition of binary\nclassi\ufb01cation function f(d, q), which leads to ef\ufb01cient computation as well as effective exploitation\n. To construct a lower bound for \u00afM(Q, F ), we\nof the existing algorithms for data classi\ufb01cation.\n\ufb01rst handle the expression [1 + exp(F k\nProposition 2.\n\nj )]\u22121, summarized by the following proposition.\n\ni + \u03b1f k\ni = f(dk\n\ni\n\ni \u2212 F k\n1\n\n(cid:164)\n\nexp(\u03b1(f k\n\nj \u2212 f k\n\ni )) \u2212 1\n\nwhere\n\n1\n1 + exp( \u02dcF k\n\ni \u2212 \u02dcF k\nj )\n\n\u2264\n\n1 + exp(F k\n\n+ \u03b3k\ni,j\n\ni \u2212 F k\nj )\ni \u2212 F k\nj )\nexp(F k\ni \u2212 F k\nj )\n\n1 + exp(F k\n\n(cid:161)\n\ni,j =\n\u03b3k\n\n(13)\n\n(14)\n\n(cid:163)\n(cid:162)2\n\ni\n\ni from that related to \u03b1f k\n\nThe proof of this proposition can be found in Appendix A. This proposition separates the term\nrelated to F k\nin Equation (11), and shows how the new weak ranker (i.e.,\nthe binary classi\ufb01cation function f(d, q)) will affect the current ranking function F (d, q). Using\nthe above proposition, we can derive the upper bound for M (Theorem 1) as well as a closed form\nsolution for \u03b1 given the solution for F (Theorem 2).\nTheorem 1. Given the solution for binary classi\ufb01er f d\nfunction in Equation (11) is\n1\n\u03b1 =\n2\ni,jI(j (cid:54)= i).\n\ni , the optimal \u03b1 that minimizes the objective\n\n\uf8eb\uf8ed(cid:80)n\n(cid:80)n\n\n(cid:80)mk\n(cid:80)mk\n\ni )\nj < f k\ni )\nj > f k\n\ni,jI(f k\n\u03b8k\ni,jI(f k\n\u03b8k\n\ni \u22121\n2rk\nZk\ni \u22121\n2rk\nZk\n\n(15)\n\nlog\n\ni,j=1\n\ni,j=1\n\nk=1\n\nk=1\n\ni,j = \u03b3k\n\nwhere \u03b8k\nTheorem 2.\n\nexp(3\u03b1) \u2212 1\n\n3\n\nn(cid:88)\n\nmk(cid:88)\n\nf k\ni\n\nk=1\n\ni=1\n\nj=1\n\n\uf8f6\uf8f8\n\n2rk\n\ni \u2212 2rk\nZk\n\nj\n\n\u03b8k\ni,j\n\n\uf8f6\uf8f8\n\uf8eb\uf8ed mk(cid:88)\n\n\u00afM(Q, \u02dcF ) \u2264 \u00afM(Q, F ) + \u03b3(\u03b1) +\n\ni\n\nij for every pair of documents of query k. Then, it computes wk\n\nwhere \u03b3(\u03b1) is only a function of \u03b1 with \u03b3(0) = 0.\nThe proofs of these theorems are provided in Appendix B and Appendix C respectively. Note that the\nbound provided by Theorem 2 is tight because by setting \u03b1 = 0, the inequality reduces to equality\nresulting \u00afM(Q, \u02dcF ) = \u00afM(Q, F ). The importance of this theorem is that the optimal solution for f k\ncan be found without knowing the solution for \u03b1.\nAlgorithm 1 1 summarizes the procedure in minimizing the objective function in Equation (11).\nFirst, it computes \u03b8k\ni , a weight for\neach document which can be positive or negative. A positive weight wk\ni indicates that the ranking\ni induced by the current ranking function F is less than its true rank position, while a\nposition of dk\nnegative weight wk\ni induced by the current F is greater than its\ni provides a clear guidance for how to construct\ntrue rank position. Therefore, the sign of weight wk\nthe next weak ranker, the binary classi\ufb01er in our case; that is, the documents with a positive wk\ni should be labeled as \u22121.\ni\nshould be labeled as +1 by the binary classi\ufb01er and those with negative wk\ni shows how much the corresponding document is misplaced in the ranking.\nThe magnitude of wk\nIn other words, it shows the importance of correcting the ranking position of document dk\ni in terms\nof improving the value of NDCG. This leads to maximizing \u03b7 given in Equation (17) which can be\nconsidered as some sort of classi\ufb01cation accuracy. We use sampling strategy in order to maximize \u03b7\nbecause most binary classi\ufb01ers do not support the weighted training set; that is, we \ufb01rst sample the\ni | and then construct a binary classi\ufb01er with the sampled documents. It\ndocuments according to |wk\ncan be shown that the proposed algorithm reduces the objective function M exponentially (the proof\nis removed due to the lack of space).\n\ni shows that ranking position of dk\n\n1Notice that we use F (dk\n\ni ) instead of F (dk\n\ni , qk) to simplify the notation in the algorithm.\n\n5\n\n\fwk\n\ni =\n\n2rk\n\ni \u2212 2rk\nZk\n\nj\n\n\u03b8k\ni,j\n\nmk(cid:88)\n\nj=1\n\nn(cid:88)\n\nmk(cid:88)\n\n\u03b7 =\n\n|wk\n\ni |f(dk\n\ni )yk\n\ni\n\n(16)\n\n(17)\n\nAlgorithm 1 NDCG Boost: A Boosting Algorithm for Maximizing NDCG\n1: Initialize F (dk\n2: repeat\n3:\n3:\n\nCompute \u03b8k\nCompute the weight for each document as\n\ni,jI(j (cid:54)= i) for all document pairs of each query. \u03b3k\n\ni ) = 0 for all documents\n\ni,j = \u03b3k\n\ni,j is given in Eq. (14).\n\n3:\n4:\n\nAssign each document the following class label yk\nTrain a classi\ufb01er f(x) : Rd \u2192 {0, 1} that maximizes the following quantity\n\ni = sign(wk\ni ).\n\nPredict fi for all documents in {Dk, i = 1, . . . , n}\nCompute the combination weight \u03b1 as provided in Equation (15).\ni + \u03b1f k\nUpdate the ranking function as F k\ni .\n\n5:\n6:\ni \u2190 F k\n7:\n8: until reach the maximum number of iterations\n\nk=1\n\ni=1\n\n4 Experiments\n\nTo study the performance of NDCG Boost we use the latest version (version 3.0) of LETOR package\nprovided by Microsoft Research Asia [22]. LETOR Package includes several benchmark data data,\nbaselines and evaluation tools for research on learning to rank.\n\n4.1 Letor Data Sets\nThere are seven data sets provided in the LETOR package: OHSUMED, Top Distillation 2003\n(TD2003), Top Distillation 2004 (TD2004), Homepage Finding 2003 (HP2003), Homepage Finding\n2003 (HP2003), Named Page Finding 2003 (NP2003) and Named Page Finding 2004 (NP2004) 2.\nThere are 106 queries in the OSHUMED data sets with a number of documents for each query.\nThe relevancy of each document in OHSUMED data set is scored 0 (irrelevant), 1 (possibly) or\n2 (de\ufb01nitely). The total number of query-document relevancy judgments provided in OHSUMED\ndata set is 16140 and there are 45 features for each query-document pair. For TD2003, TD2004,\nHP2003, HP2004 and NP2003, there are 50, 75, 75, 75 and 150 queries, respectively, with about\n1000 retrieved documents for each query. This amounts to a total number of 49171, 74170, 74409,\n73834 and 147606 query-document pairs for TD2003, TD2004, HP2003, HP2004 and NP2003\nrespectively. For these data sets, there are 63 features extracted for each query-document pair and a\nbinary relevancy judgment for each pair is provided.\nFor every data sets in LETOR, \ufb01ve partitions are provided to conduct the \ufb01ve-fold cross validation,\neach includes training, test and validation sets. The results of a number of state-of-the-art learning\nto rank algorithms are also provided in the LETOR package. Since these baselines include some\nof the most well-known learning to rank algorithms from each category (pointwise, pairwise and\nlistwise), we use them to study the performance of NDCG Boost. Here is the list of these baselines\n(the details can be found in the LETOR web page):\nRegression: This is a simple linear regression which is a basic pointwise approach and can be\n\nconsidered as a reference point.\n\nRankSVM: RankSVM is a pairwise approach using Support Vector Machine [5].\nFRank: FRank is a pairwise approach. It uses similar probability model to RankNet [7] for the\nrelative rank position of two documents, with a novel loss function called Fidelity loss\nfunction [9]. TSai et al [9] showed that FRank performs much better than RankNet.\n\nListNet: ListNet is a listwise learning to rank algorithm [14].\n\nIt uses cross-entropy loss as its\n\nAdaRank NDCG: This is a listwise boosting algorithm that incorporates NDCG in computing the\n\nlistwise loss function.\n\nsamples and combination weights [20].\n\n2The experiment result for the last data set is not reported due to the lack of space.\n\n6\n\n\f(a) OHSUMED\n\n(b) TD2003\n\n(c) TD2004\n\n(d) HP2003\n\n(e) HP2004\n\n(f) NP2003\n\nFigure 1: The experimental results in terms of NDCG for Letor 3.0 data sets\n\nSVM MAP: SVM MAP is a support vector machine with MAP measure used in the constraints. It\n\nis a listwise approach [2].\n\nWhile the validation set is used in \ufb01nding the best set of parameters in the baselines in LETOR,\nit is not being used for NDCG Boost in our experiments. For NDCG Boost, we set the maximum\nnumber of iteration to 100 and use decision stump as the weak ranker.\nFigure 1 provides the the average results of \ufb01ve folds for different learning to rank algorithms in\nterms of NDCG @ each of the \ufb01rst 10 truncation level on the LETOR data sets 3. Notice that the\nperformance of algorithms in comparison varies from one data set to another; however NDCG Boost\nperforms almost always the best. We would like to point out a few statistics; On OHSUMED\ndata set, NDCG Boost performs 0.50 at N DCG@3, a 4% increase in performance, compared to\nFRANK, the second best algorithm. On TD2003 data set, this value for NDCG Boost is 0.375\nthat shows a 10% increase, compared with RankSVM (0.34), the second best method. On HP2004\ndata set, NDCG Boost performs 0.80 at N DCG@3, compared to 0.75 of SVM MAP, the second\nbest method, which indicates a 6% increase. Moreover, among all the methods in comparison,\nNDCG Boost appears to be the most stable method across all the data sets. For example, FRank,\nwhich performs well in OHSUMED and TD2004 data sets, yields a poor performance on TD2003,\nHP2003 and HP 2004. Similarly, AdaRank NDCG achieves a decent performance on OHSUMED\ndata set, but fails to deliver accurate ranking results on TD2003, HP2003 and NP2003. In fact, both\nAdaRank NDCG and FRank perform even worse than the simple Regression approach on TD2003,\nwhich further indicates their instability. As another example, ListNet and RankSVM, which perform\nwell on TD2003 are not competitive to NDCG boost on OHSUMED and TD2004 data sets.\n\n5 Conclusion\n\nListwise approach is a relatively new approach to learning to rank. It aims to use a query-level\nloss function to optimize a given IR measure. The dif\ufb01culty in optimizing IR measure lies in the\ninherited sort function in the measure. We address this challenge by a probabilistic framework that\noptimizes the expectation of NDCG over all the possible permutations of documents. We present a\nrelaxation strategy to effectively approximate the expectation of NDCG, and a bound optimization\nstrategy for ef\ufb01cient optimization. Our experiments on benchmark data sets shows that our method\nis superior to the state-of-the-art learning to rank algorithms in terms of performance and stability.\n\n3NDCG is commonly measured at the \ufb01rst few retrieved documents to emphasize their importance.\n\n7\n\n123456789100.40.420.440.460.480.50.520.54@ nNDCGOHSUMED dataset RegressionFRankListNetRankSVMAdaRankNDCGSVM_MAPNDCG_\\BOOST123456789100.240.260.280.30.320.340.360.380.40.42@nNDCGTD2003 dataset 123456789100.250.30.350.40.450.5@nNDCGTD2004 dataset 123456789100.650.70.750.80.85@nNDCGHP2003 dataset 123456789100.350.40.450.50.550.60.650.70.750.80.85@nNDCGHP2004 dataset 123456789100.450.50.550.60.650.70.750.8@nNDCGNP2003 dataset \f6 Acknowledgements\nLabs4 and National Institute of Health\nThe work was supported in part by the Yahoo!\n(1R01GM079688-01). Any opinions, \ufb01ndings, and conclusions or recommendations expressed in\nthis material are those of the authors and do not necessarily re\ufb02ect the views of Yahoo! and NIH.\n\nA Proof of Proposition 2\n\n1\n1 + exp( \u02dcF k\n\ni \u2212 \u02dcF k\nj )\n\n=\n\n=\n\n\u2264\n\n=\n\n(cid:179)\n(cid:179)\n\n1 + exp(F k\n\n1\ni \u2212 F k\nj + \u03b1(f k\n\ni \u2212 f k\nj ))\n1\n\n1\n\n1\n\n1\n\n1 + exp(F k\n\ni \u2212 F k\nj )\n\n1 + exp(F k\n\n1 + exp(F k\n\ni \u2212 F k\nj )\ni \u2212 F k\nj )\n\n1 + exp(F k\n\nexp(F k\n\ni \u2212 F k\nj )\ni \u2212 F k\nj )\ni \u2212 F k\nj )\ni \u2212 F k\nj )\n\nexp(F k\n\n1 + exp(F k\n\n1 + exp(F k\n1 \u2212 exp(F k\n1 + exp(F k\n\n+\ni \u2212 F k\nj )\ni \u2212 F k\nj )\ni \u2212 F k\nj )\nj \u2212 f k\ni ) \u2212 1\n\n+\n\n(cid:164)\n\n+ \u03b3k\ni,j\n\nexp(\u03b1(f k\n\n(cid:163)\n\n(cid:180)\u22121\n\n(cid:180)\n\nexp(\u03b1(f k\n\ni \u2212 f k\nj )\n\nexp(\u03b1(f k\n\nj \u2212 f k\ni )\n\nThe \ufb01rst step is a simple manipulations of the terms and the second step is due to the convexity of\ninverse function on R+.\n\nB Proof of Theorem 1\nIn order to obtain the result of the Theorem 1, we \ufb01rst plug Equation (13) in Equation (11). This\n, the term related to \u03b1 . Since f k\nleads to minimizing\ni\ntakes binary values 0 and 1, we have the following:\n\n(cid:80)mk\n\n(cid:80)n\n\nexp(\u03b1(f k\n\ni \u22121\n2rk\nZk\n\n\u03b8k\ni,j\n\ni,j=1\n\nk=1\n\nn(cid:88)\n\nmk(cid:88)\n\n(cid:163)\nn(cid:88)\n\n(cid:164)\n(cid:179)\n\nmk(cid:88)\n\nj \u2212 f k\ni ))\ni \u2212 1\nZk\n\n\u03b8k\ni,j\n\n2rk\n\n2rk\n\ni \u2212 1\nZk\n\ni,j exp(\u03b1(f k\n\u03b8k\n\nj \u2212 f k\n\ni )) =\n\ni,j=1\n\nk=1\nGetting the partial derivative of this term respect to \u03b1 and having it equal to zero results the theorem.\n\nk=1\n\ni,j=1\n\nexp(\u03b1)I(f k\n\nj > f k\n\ni ) + exp(\u2212\u03b1)I(f k\n\ni )\nj < f k\n\n(cid:180)\n\nC Proof of Theorem 2\nFirst, we provide the following proposition to handle exp(\u03b1(f k\nProposition 3. If x, y \u2208 [0, 1], we have\n\nexp(\u03b1(x \u2212 y)) \u2264 exp(3\u03b1) \u2212 1\n\n(x \u2212 y) +\n\nj \u2212 f k\n\ni )).\n\nexp(3\u03b1) + exp(\u22123\u03b1) + 1\n\n3\n\nProof. Due to the convexity of exp function, we have:\n\n3\n+ 0 \u00d7 1 \u2212 x + y\n3\n1 \u2212 x + y\n+\n\n3\n\n1\n\u00d7 \u22123\u03b1)\n+\n3\n1\nexp(\u22123\u03b1)\n3\n\nx \u2212 y + 1\n\nexp(\u03b1(x \u2212 y)) = exp(3\u03b1\n\u2264 x \u2212 y + 1\n(cid:179)exp(3\u03b1) \u2212 1\n(cid:164) \u2264 \u03b8k\n\nj \u2212 f k\n\ni ) \u2212 1\n\n3\n\ni,j\n\n3\nexp(3\u03b1) +\n\n(cid:163)\n\nUsing the result in the above proposition, we can bound the last term in Equation (13) as follows:\n\nexp(\u03b1(f k\n\n\u03b8k\ni,j\nUsing the result in Equation (19) and (13), we have \u00afM(Q, \u02dcF ) in Equation (11) bounded as\ni \u2212 f k\nj )\n\n\u00afM(Q, \u02dcF ) \u2264 \u00afM(Q, F ) + \u03b3(\u03b1) +\n\nexp(3\u03b1) \u2212 1\n\ni,j(f k\n\u03b8k\n\ni ) +\n\n2rk\n\n3\n\n3\n\nexp(3\u03b1) + exp(\u22123\u03b1) \u2212 2\n\nj \u2212 f k\n(f k\nn(cid:88)\nn(cid:88)\n\nk=1\n\nmk(cid:88)\nmk(cid:88)\n\ni=1\n\ni \u2212 1\nZk\n\nmk(cid:88)\n\uf8eb\uf8ed mk(cid:88)\n\nj=1\n\n2rk\n\nf k\ni\n\nk=1\n\ni=1\n\nj=1\n\n3\n\n3\n\n\uf8f6\uf8f8\n\ni \u2212 2rk\nZk\n\nj\n\n\u03b8k\ni,j\n\n= \u00afM(Q, F ) + \u03b3(\u03b1) +\n\nexp(3\u03b1) \u2212 1\n\n(18)\n\n(cid:180)\n\n(19)\n\n4The \ufb01rst author has been supported as a part-time intern in Yahoo!\n\n8\n\n\fReferences\n[1] Kalervo J\u00a8arvelin and Jaana Kek\u00a8al\u00a8ainen. Ir evaluation methods for retrieving highly relevant\ndocuments. In SIGIR 2000: Proceedings of the 23th annual international ACM SIGIR confer-\nence on Research and development in information retrieval, pages 41\u201348, 2000.\n\n[2] Yisong Yue, Thomas Finley, Filip Radlinski, and Thorsten Joachims. A support vector method\nfor optimizing average precision. In SIGIR 2007: Proceedings of the 30th annual Int. ACM\nSIGIR Conf. on Research and development in information retrieval, pages 271\u2013278, 2007.\n\n[3] Ping Li, Christopher Burges, and Qiang Wu. Mcrank: Learning to rank using multiple classi-\n\n\ufb01cation and gradient boosting. In Neural Information Processing System 2007.\n\n[4] Ramesh Nallapati. Discriminative models for information retrieval. In SIGIR \u201904: Proceed-\nings of the 27th annual international ACM SIGIR conference on Research and development in\ninformation retrieval, pages 64\u201371, New York, NY, USA, 2004. ACM.\n\n[5] Ralf Herbrich, Thore Graepel, and Klaus Obermayer. Support vector learning for ordinal\n\nregression. In Int. Conf. on Arti\ufb01cial Neural Networks 1999, pages 97\u2013102, 1999.\n\n[6] Yoav Freund, Raj Iyer, Robert E. Schapire, and Yoram Singer. An ef\ufb01cient boosting algorithm\n\nfor combining preferences. Journal of Machine Learning Research, 4:933\u2013969, 2003.\n\n[7] Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Greg\nHullender. Learning to rank using gradient descent. In International Conference on Machine\nLearning 2005, 2005.\n\n[8] Yunbo Cao, Jun Xu, Tie-Yan Liu, Hang Li, Yalou Huang, and Hsiao-Wuen Hon. Adapting\nranking svm to document retrieval. In SIGIR 2006: Proceedings of the 29th annual interna-\ntional ACM SIGIR conference on Research and development in information retrieval, pages\n186\u2013193, 2006.\n\n[9] Ming Feng Tsai, Tie yan Liu, Tao Qin, Hsin hsi Chen, and Wei ying Ma. Frank: A ranking\nmethod with \ufb01delity loss. In SIGIR 2007: Proceedings of the 30th annual international ACM\nSIGIR conference on Research and development in information retrieval, 2007.\n\n[10] Rong Jin, Hamed Valizadegan, and Hang Li. Ranking re\ufb01nement and its application to infor-\n\nmation retrieval. In WWW \u201908: Proc. of the 17th int. conference on World Wide Web.\n\n[11] Steven C.H. Hoi and Rong Jin. Semi-supervised ensemble ranking. In Proceedings of Associ-\n\nation for the Advancement of Arti\ufb01cial Intelligence (AAAI2008).\n\n[12] Tao Qin, Tie yan Liu, Ming feng Tsai, Xu dong Zhang, and Hang Li. Learning to search web\n\npages with query-level loss functions. Technical report, 2006.\n\n[13] Christopher J. C. Burges, Robert Ragno, and Quoc V. Le. Learning to rank with nonsmooth\n\ncost functions. In Neural Information Processing System 2006, 2006.\n\n[14] Zhe Cao and Tie yan Liu. Learning to rank: From pairwise approach to listwise approach. In\n\nInternational Conference on Machine Learning 2007, pages 129\u2013136, 2007.\n\n[15] Fen Xia, Tie-Yan Liu, Jue Wang, Wensheng Zhang, and Hang Li. Listwise approach to learning\nto rank: theory and algorithm. In Int. Conf. on Machine Learning 2008, pages 1192\u20131199,\n2008.\n\n[16] Michael Taylor, John Guiver, Stephen Robertson, and Tom Minka. Softrank: optimizing non-\n\nsmooth rank metrics.\n\n[17] Maksims N. Volkovs and Richard S. Zemel. Boltzrank: learning to maximize expected ranking\nIn ICML \u201909: Proceedings of the 26th Annual International Conference on Machine\n\ngain.\nLearning, pages 1089\u20131096, New York, NY, USA, 2009. ACM.\n\n[18] Ruslan Salakhutdinov, Sam Roweis, and Zoubin Ghahramani. On the convergence of bound\noptimization algorithms. In Proc. 19th Conf. in Uncertainty in Arti\ufb01cial Intelligence (UAI 03).\n[19] Jen-Yuan Yeh, Yung-Yi Lin, Hao-Ren Ke, and Wei-Pang Yang. Learning to rank for infor-\nmation retrieval using genetic programming. In SIGIR 2007 workshop: Learning to Rank for\nInformation Retrieval.\n\n[20] Jun Xu and Hang Li. Adarank: a boosting algorithm for information retrieval.\n\nIn SIGIR\n\u201907: Proceedings of the 30th annual international ACM SIGIR conference on Research and\ndevelopment in information retrieval, pages 391\u2013398, 2007.\n\n[21] Zhengya Sun, Tao Qin, Qing Tao, and Jue Wang. Robust sparse rank learning for non-smooth\nranking measures. In SIGIR \u201909: Proceedings of the 32nd international ACM SIGIR conference\non Research and development in information retrieval, pages 259\u2013266, New York, NY, USA,\n2009. ACM.\n\n[22] Tie-Yan Liu, Tao Qin, Jun Xu, Wenying Xiong, and Hang Li. Letor: Benchmark dataset for\n\nresearch on learning to rank for information retrieval.\n\n9\n\n\f", "award": [], "sourceid": 344, "authors": [{"given_name": "Hamed", "family_name": "Valizadegan", "institution": null}, {"given_name": "Rong", "family_name": "Jin", "institution": null}, {"given_name": "Ruofei", "family_name": "Zhang", "institution": null}, {"given_name": "Jianchang", "family_name": "Mao", "institution": null}]}