{"title": "Global Ranking Using Continuous Conditional Random Fields", "book": "Advances in Neural Information Processing Systems", "page_first": 1281, "page_last": 1288, "abstract": "This paper studies global ranking problem by learning to rank methods. Conventional learning to rank methods are usually designed for `local ranking', in the sense that the ranking model is defined on a single object, for example, a document in information retrieval. For many applications, this is a very loose approximation. Relations always exist between objects and it is better to define the ranking model as a function on all the objects to be ranked (i.e., the relations are also included). This paper refers to the problem as global ranking and proposes employing a Continuous Conditional Random Fields (CRF) for conducting the learning task. The Continuous CRF model is defined as a conditional probability distribution over ranking scores of objects conditioned on the objects. It can naturally represent the content information of objects as well as the relation information between objects, necessary for global ranking. Taking two specific information retrieval tasks as examples, the paper shows how the Continuous CRF method can perform global ranking better than baselines.", "full_text": "Global Ranking Using Continuous Conditional\n\nRandom Fields\n\n1Tao Qin, 1Tie-Yan Liu, 2Xu-Dong Zhang, 2De-Sheng Wang, 1Hang Li\n\n1Microsoft Research Asia, 2Tsinghua University\n\n1{taoqin, tyliu, hangli}@microsoft.com\n2{zhangxd, wangdsh ee}@tsinghua.edu.cn\n\nAbstract\n\nThis paper studies global ranking problem by learning to rank methods. Con-\nventional learning to rank methods are usually designed for \u2018local ranking\u2019, in the\nsense that the ranking model is de\ufb01ned on a single object, for example, a document\nin information retrieval. For many applications, this is a very loose approximation.\nRelations always exist between objects and it is better to de\ufb01ne the ranking model\nas a function on all the objects to be ranked (i.e., the relations are also included).\nThis paper refers to the problem as global ranking and proposes employing a Con-\ntinuous Conditional Random Fields (CRF) for conducting the learning task. The\nContinuous CRF model is de\ufb01ned as a conditional probability distribution over\nranking scores of objects conditioned on the objects. It can naturally represent\nthe content information of objects as well as the relation information between\nobjects, necessary for global ranking. Taking two speci\ufb01c information retrieval\ntasks as examples, the paper shows how the Continuous CRF method can perform\nglobal ranking better than baselines.\n\n1 Introduction\n\nLearning to rank is aimed at constructing a model for ordering objects by means of machine learning.\nIt is useful in many areas including information retrieval, data mining, natural language processing,\nbioinformatics, and speech recognition. In this paper, we take information retrieval as an example.\nTraditionally learning to rank is restricted to \u2018local ranking\u2019, in which the ranking model is de\ufb01ned\non a single object. In other words, the relations between the objects are not directly represented\nin the model. In many application tasks this is far from being enough, however. For example, in\nPseudo Relevance Feedback [17, 8], we manage to rank documents on the basis of not only relevance\nof documents to the query, but also similarity between documents. Therefore, the use of a model\nsolely based on individual documents would not be suf\ufb01cient. (Previously, heuristic methods were\ndeveloped for Pseudo Relevance Feedback.) Similar things happen in the tasks of Topic Distillation\n[12, 11] and Subtopic Retrieval [18]. Ideally, in information retrieval we would exploit a ranking\nmodel de\ufb01ned as a function on all the documents with respect to the query. In other words, ranking\nshould be conducted on the basis of the contents of objects as well as the relations between objects.\nWe refer to this setting as \u2018global ranking\u2019 and give a formal description on it with information\nretrieval as an example.\nConditional Random Fields (CRF) technique is a powerful tool for relational learning, because it al-\nlows the uses of both relations between objects and contents of objects [16]. However, conventional\nCRF cannot be directly applied to global ranking because it is a discrete model in the sense that\nthe output variables are discrete [16]. In this work, we propose a Continuous CRF model (C-CRF)\nto deal with the problem. The C-CRF model is de\ufb01ned as a conditional probability distribution\nover ranking scores of objects (documents) conditioned on the objects (documents). The speci\ufb01c\n\n1\n\n\fprobability distribution can be represented by an undirected graph, and the output variables (rank-\ning scores) can be continuous. To our knowledge, this is the \ufb01rst time such kind of CRF model is\nproposed.\nWe apply C-CRF to two global ranking tasks: Pseudo Relevance Feedback and Topic Distillation.\nExperimental results on benchmark data show that our method performs better than baseline meth-\nods.\n\n2 Global Ranking Problem\n\nDocument ranking in information retrieval is a problem as follows. When the user submits a query,\nthe system retrieves all the documents containing at least one query term, calculates a ranking score\nfor each of the documents using the ranking model, and sorts the documents according to the ranking\nscores. The scores can represent relevance, importance, and/or diversity of documents.\nLet q denote a query. Let x(q) = {x(q)\nn(q)} denote the documents retrieved with q, and\ny(q) = {y(q)\nn(q)} denote the ranking scores assigned to the documents. Here n(q) stands\nfor the number of documents retrieved with q. Note that the numbers vary according to queries. We\nassume that y(q) is determined by a ranking model.\nWe call the ranking \u2018local ranking\u2019, if the ranking model is de\ufb01ned as\n\n2 , . . . , x(q)\n\n2 , . . . , y(q)\n\n1 , x(q)\n\n1 , y(q)\n\nFurthermore, we call the ranking \u2018global ranking\u2019, if the ranking model is de\ufb01ned as\n\ny(q)\ni = f(x(q)\n\ni\n\n), i = 1, . . . , n(q)\n\ny(q) = F (x(q))\n\n(1)\n\n(2)\n\nThe major difference between the two is that F takes on all the documents together as its input,\nwhile f takes on an individual document as its input. In other words, in global ranking, we use not\nonly the content information of documents but also the relation information between documents.\nThere are many speci\ufb01c application tasks that can be viewed as examples of global ranking. These\ninclude Pseudo Relevance Feedback, Topic Distillation, and Subtopic Retrieval.\n\n3 Continuous CRF for Global Ranking\n\ni\n\n3.1 Continuous CRF\nLet {hk(y(q)\n, x(q))}K1\nranking score y(q)\nfunctions de\ufb01ned on y(q)\nContinuous Conditional Random Fields is a conditional probability distribution with the following\ndensity function,\n\nk=1 be a set of real-valued feature functions de\ufb01ned on document set x(q) and\n(i = 1,\u00b7\u00b7\u00b7 , n(q)), and {gk(y(q)\nk=1 be a set of real-valued feature\n\n, x(q))}K2\n, and x(q) (i, j = 1,\u00b7\u00b7\u00b7 , n(q), i (cid:54)= j).\n\n, y(q)\n\n, y(q)\n\nj\n\nj\n\ni\n\ni\n\ni\n\n(cid:40)(cid:88)\n\nK1(cid:88)\n\n(cid:88)\n\nK2(cid:88)\n\n(cid:41)\n\n(cid:41)\n\nPr(y(q)|x(q)) =\n\n1\n\nZ(x(q))\n\nexp\n\n\u03b1khk(y(q)\n\ni\n\n, x(q)) +\n\n\u03b2kgk(y(q)\n\ni\n\n, y(q)\n\nj\n\n, x(q))\n\n,\n\n(3)\n\ni\n\nk=1\n\ni,j\n\nk=1\n\nwhere \u03b1 is a K1-dimensional parameter vector and \u03b2 is a K2-dimensional parameter vector, and\nZ(x(q)) is a normalization function,\n\n(cid:90)\n\n(cid:40)(cid:88)\n\nK1(cid:88)\n\n(cid:88)\n\nK2(cid:88)\n\nZ(x(q)) =\n\nexp\n\ny(q)\n\n\u03b1khk(y(q)\n\ni\n\n, x(q)) +\n\n\u03b2kgk(y(q)\n\ni\n\n, y(q)\n\nj\n\n, x(q))\n\ndy(q).\n\n(4)\n\ni\n\nk=1\n\ni,j\n\nk=1\n\nGiven a set of documents x(q) for a query, we select the ranking score vector y(q) with the maximum\nconditional probability Pr(y(q)|x(q)) as the output of our proposed global ranking model:\n\nF (x(q)) = arg max\ny(q)\n\nPr(y(q)|x(q)).\n\n(5)\n\n2\n\n\fC-CRF is a graphical model, as depicted in Figure 1. In the conditioned undirected graph, a white\nvertex represents a ranking score, a gray vertex represents a document, an edge between two white\nvertexes represents the dependency between ranking scores, and an edge between a gray vertex\nand a white vertex represents the dependency of a ranking score on its document (content). (In\nprinciple a ranking score can depend on all the documents of the query; here for ease of presenta-\ntion we only consider the simple case in which it only depends on the corresponding document.)\nIn C-CRF, feature function hk represents the depen-\ndency between the ranking score of a document and\nthe content of it, and feature function gk represents a\nrelation between the ranking scores of two documents.\nDifferent retrieval tasks may have different relations\n(e.g. similarity relation, parent-child relation), as will\nbe explained in Section 4. For ease of reference, we\ncall the feature functions hk vertex features, and the\nfeature functions gk edge features.\nNote that in conventional CRF the output random vari-\nables are discrete while in C-CRF the output variables\nare continuous. This makes the inference of C-CRF\nlargely different from that of conventional CRF, as will\nbe seen in Section 4.\n\nFigure 1: Continuous CRF Model\n\n3.2 Learning\nIn the inference of C-CRF, the paramters {\u03b1, \u03b2} are given, while in learning, they are to be estimated.\nGiven training data {x(q), y(q)}N\nn(q)} is a set of documents\nof query q, and each y(q) = {y(q)\nn(q)} is a set of ranking scores associated with the\ndocuments of query q, we employ Maximum Likelihood Estimation to estimate the parameters\n{\u03b1, \u03b2} of C-CRF. Speci\ufb01cally, we calculate the conditional log likelihood of the training data with\nrespect to the C-CRF model,\n\nq=1, where each x(q) = {x(q)\n1 , y(q)\n\n2 , ..., y(q)\n\n1 , x(q)\n\n2 , ..., x(q)\n\nN(cid:88)\n\nL(\u03b1, \u03b2) =\n\nlog Pr(y(q)|x(q); \u03b1, \u03b2).\n\n(6)\n\nWe then use Gradient Ascend to maximze the log likelihood, and use the optimal parameter \u02c6\u03b1, \u02c6\u03b2 to\nrank the documents of a new query.\n\nq=1\n\n4 Case Study\n\n4.1 Pseudo Relevance Feedback (PRF)\n\nPseudo Relevance Feedback (PRF) [17, 8] is an example of global ranking, in which similarity be-\ntween documents are considered in the ranking process. Conceptually, in this task one \ufb01rst conducts\na round of ranking, assuming that the top ranked documents are relevant; then conducts another\nround of ranking, using similarity information between the top ranked documents and the other doc-\numents to boost some relevant documents dropped in the \ufb01rst round. The underlying assumption\nis that similar documents are likely to have similar ranking scores. Here we consider a method of\nusing C-CRF for performing the task.\n\n4.1.1 Continuous CRF for Pseudo Relevance Feedback\n\nWe \ufb01rst introduce vertex feature functions. The relevance of a document to the query depends on\nmany factors, such as term frequency, page importance, and so on. For each factor we de\ufb01ne a vertex\nfeature function. Suppose that x(q)\ni,k is the k-th relevance factor of document xi with respect to query\n\n3\n\ny3y5y1y4y6y2x1x2x3x4x5x6\fK1(cid:88)\n\nk=1\n\nexp\n\n(cid:40)(cid:88)\n(cid:40)(cid:88)\nK1(cid:88)\n(cid:80)K1\n\nk=1\n\ni\n\ni\n\n1\n\nZ(x)\n\n(cid:90)\n(cid:110)(cid:80)\n\ny\n\n(cid:88)\n\ni,j\n\n(cid:88)\n(cid:80)\ni,j \u2212 \u03b2\n\ni,j\n\n\u2212 \u03b2\n2\n\n(cid:41)\n(cid:111)\n\nq extracted by operator tk: x(q)\n\ni,k = tk(xi, q). We de\ufb01ne the k-th feature function1hk(yi, x) as\n\nhk(yi, x) = \u2212(yi \u2212 xi,k)2.\n\n(7)\n\nNext, we introduce the edge feature function. Recall that there are two rounds in PRF: the \ufb01rst\nround scores each document, and the second round re-ranks the documents considering similarity\nbetween documents. Here the similarities between any two documents are supposed to be given. We\nincorporate them into the edge feature function.\n\ng(yi, yj, x) = \u22121\n\n2 Si,j(yi \u2212 yj)2,\n\n(8)\nwhere Si,j is similarity between documents xi and xj, which can be extracted by some operator s\nfrom the raw content2 of document xi and xj: Si,j = s(xi, xj). The larger Si,j is, the more similar\nthe two documents are. Sine only similarity relation is considered in this task, we have only one\nedge function (K2 = 1).\nThe C-CRF for Pseudo Relevance Feedback then becomes\n\n(cid:41)\n\nPr(y|x) =\n\n\u2212\u03b1k(yi \u2212 xi,k)2 +\n\nSi,j(yi \u2212 yj)2\n\n\u2212 \u03b2\n2\n\n,\n\n(9)\n\nwhere Z(x) is de\ufb01ned as\n\nZ(x) =\n\nexp\n\n\u2212\u03b1k(yi \u2212 xi,k)2 +\n\nSi,j(yi \u2212 yj)2\n\ndy.\n\n(10)\n\n2 Si,j(yi \u2212 yj)2\n\nk=1 \u2212\u03b1k(yi \u2212 xi,k)2 +\n\ni\n\ni\n\ny\n\n(cid:80)K1\n\nk=1 \u2212\u03b1k(yi \u2212 xi,k)2 in Eq.\n\nTo guarantee that exp\n(cid:80)\nmust have \u03b1k > 03 and \u03b2 > 0.\n(cid:80)\nThe item\n(9) plays a role similar to the \ufb01rst round of PRF:\nthe ranking score yi is determined solely by the relevance factors of document xi. The item\n2 Si,j(yi \u2212 yj)2 in Eq. (9) plays a role similar to the second round of PRF: it makes sure that\ni,j \u2212 \u03b2\nsimilar documents have similar ranking scores. We can see that CRF combines the two rounds of\nranking of PRF into one.\nTo rank the documents of a query, we calculate the ranking scores of documents with respect to this\nquery in the following way.\n\nis integrable, we\n\n(cid:80)\n\nF (x) = arg max\n\nPr(y|x; \u03b1, \u03b2) = (\u03b1T eI + \u03b2D \u2212 \u03b2S)\u22121X\u03b1.\n\n(11)\nwhere e is a K1-dimensional all-ones vector, I is an n \u00d7 n identity matrix, S is a similarity matrix\nwith Si,j = s(xi, xj), D is an n \u00d7 n diagonal matrix with Di,i =\nj Si,j, and X is a factor matrix\nwith Xi,k = xi,k. If we ignore the relation between documents and set \u03b2 = 0, then the ranking\nmodel degenerates to F (x) = X\u03b1, which is equivalent to a linear model used in conventional local\nranking.\nFor n documents, the time complexity of straightforwardly computing the ranking model (11) is of\norder O(n3) and thus the computation is expensive. The main cost of the computation comes from\nmatrix inversion. We employ a fast computation technique to quickly perform the task. First, we\nmake S a sparse matrix, which has at most K non-zero values in each row and each column. We\n2 nearest neighbors.\ncan do so by only considering the similarity between each document and its K\nNext, we use the Gibbs-Poole-Stockmeyer algorithm [9] to convert S to a banded matrix. Finally\nwe solve the following system of linear equation and take the solution as ranking scores.\n\n(12)\nSince S is a banded matrix, the scores F (x) in Eq.(12) can be computed with time complexity of\nO(n) when K (cid:191) n [5]. That is to say, the time complexity of testing a new query is comparable\nwith those of existing local ranking methods.\n\n(\u03b1T eI + \u03b2D \u2212 \u03b2S)F (x) = X\u03b1\n\n1We omit superscript (q) in this section when there is no confusion.\n2Note that Si,j is not computed from the ranking factors of documents xi and xj but from their raw terms.\n\nFor more details, please refer to our technique report [13].\n3\u03b1k > 0 means that the factor xi,k is positively correlated with the ranking score yi. Considering that some\nfactor may be negatively correlated with yi, we double a factor xi,k into two factors xi,k and xi,k(cid:48) = \u2212xi,k in\nexperiments. Then if \u03b1k(cid:48) > \u03b1k, one can get the factor xi,k is negatively correlated with the ranking score yi.\n\n4\n\n\fAlgorithm 1 Learning Algorithm of Continuous CRF for Pseudo Relevance Feedback\n\nInput: training data {(x(1), y(1)),\u00b7\u00b7\u00b7 , (x(N ), y(N ))}, number of iterations T and learning rate \u03b7\nInitialize parameter log \u03b1k and log \u03b2\nfor t = 1 to T do\n\nfor i = 1 to N do\n\nCompute gradient \u2207log \u03b1k and \u2207log \u03b2 using Eq. (13) and (14) for a single query\n(x(i), y(i), S(i)).\nUpdate log \u03b1k = log \u03b1k + \u03b7 \u00d7 \u2207log \u03b1k and log \u03b2 = log \u03b2 + \u03b7 \u00d7 \u2207log \u03b2\n\nend for\n\nend for\nOutput: parameters of CRF model \u03b1k and \u03b2.\n\n4.1.2 Learning\n\nIn learning, we try to maximize the log likelihood. Note that maximization of L(\u03b1, \u03b2) in Eq. (6) is\na constrained optimization problem because we need to guarantee that \u03b1k > 0 and \u03b2 > 0. Gradient\nAscent cannot be directly applied to such a constrained optimization problem. Here we adopt a\ntechnique similar to that in [3]. Speci\ufb01cally, we maximize L(\u03b1, \u03b2) with respect to log \u03b1k and log \u03b2\ninstead of \u03b1k and \u03b2. As a result, the new optimization issue becomes unconstrained and Gradient\nAscent method can be used. Algorithm 1 shows the learning algorithm based on Stochastic Gradient\n(cid:88)\nAscent 4, in which the gradient \u2207log \u03b1k and \u2207log \u03b2 can be computed as follows5.\n\u2207log \u03b1k =\n\u22121b +\n(cid:88)\n(cid:80)K1\n(cid:80)\n\n(14)\nwhere A = \u03b1T eI + \u03b2D \u2212 \u03b2S, |A| is determinant of matrix A, b = X\u03b1, c =\ni,k, X :\ndenotes the long column vector formed by concatenating the columns of matrix X, and X,k denotes\nthe k-th column of matrix X.\n\n(cid:40)(cid:179)\n(cid:40)(cid:179)\n\n\u2212T ) :T (D \u2212 S) :\n\ni \u2212 2yixi,k)\n(y2\n\nSi,j(yi \u2212 yj)2\n\n\u2212 bT A\n\n\u22121(D \u2212 S)A\n\n\u22121b \u2212 bT A\n\n\u22121A\n\n\u2212T ) :T I :\n\n(A\n\n\u2202L(\u03b1, \u03b2)\n\u2202 log \u03b1k\n\n(cid:41)\n(cid:41)\n\n\u2212 1\n2\n\n(A\n\nk=1 \u03b1kx2\n\n\u2207log \u03b2 =\n\n\u2202L(\u03b1, \u03b2)\n\u2202 log \u03b2\n\n= \u2212\u03b2\n\n= \u2212\u03b1k\n\n+ 2X T\n\n,kA\n\ni\n\ni,j\n\ni\n\n\u22121b +\n\n1\n2\n\n(cid:180)\n\n(cid:180)\n\n\u2212 1\n2\n\n(13)\n\n4.2 Topic Distillation (TD)\n\nTopic Distillation [12] is another example of global ranking. In this task, one selects a page that can\nbest represent the topic of the query from a web site by using structure (relation) information of the\nsite. If both a page and its parent page are concerned with the topic, then the parent page is preferred\n(to be ranked higher) [12, 11]. Here we apply C-CRF to Topic Distillation.\n\n4.2.1 Continuous CRF for Topic Distillation\n\nWe de\ufb01ne the vertex feature function hk(yi, x) in the same way as in Eq.(7).\nRecall that in Topic Distillation, a page is more preferred than its child page if both of them are\nrelevant to a query. Here the parent-child relation between two pages is supposed to be given. We\nincorporate them into the edge feature function. Speci\ufb01cally, we de\ufb01ne the (and the only) edge\nfeature function as\n\n(15)\nwhere Ri,j = r(xi, xj) denotes the parent-child relation: r(xi, xj) = 1 if document xi is the parent\nof xj, and r(xi, xj) = 0 for other cases.\nThe C-CRF for Topic Distillation then becomes\n\ng(yi, yj, x) = Ri,j(yi \u2212 yj),\n\n(cid:41)\n\n(cid:40)(cid:88)\n\nK1(cid:88)\n\n(cid:88)\n\nPr(y|x) =\n\n1\n\nZ(x)\n\nexp\n\n\u2212\u03b1k(yi \u2212 xi,k)2 +\n\n\u03b2Ri,j(yi \u2212 yj)\n\n,\n\n(16)\n\ni\n\nk=1\n\ni,j\n\n4Stochastic Gradient means conducting gradient ascent from one query to another.\n5Details can be found in [13].\n\n5\n\n\fwhere Z(x) is de\ufb01ned as\n\n(cid:90)\n(cid:110)(cid:80)\n\ny\n\n(cid:40)(cid:88)\n(cid:80)K1\n\ni\n\nK1(cid:88)\n\nk=1\n\n(cid:88)\n\ni,j\n\n(cid:80)\n\n(cid:41)\n\n(cid:111)\n\nZ(x) =\n\nexp\n\n\u2212\u03b1k(yi \u2212 xi,k)2 +\n\n\u03b2Ri,j(yi \u2212 yj)\n\ndy.\n\n(17)\n\ni\n\nk=1 \u2212\u03b1k(yi \u2212 xi,k)2 +\n\nTo guarantee that exp\nhave \u03b1k > 0.\nThe C-CRF can naturally model Topic Distillation: if the value of Ri,j is one, then the value of yi is\nlarge than that of yj with high probability.\nTo rank the documents of a query, we calculate the ranking scores in the following way.\n\nis integrable, we must\n\ni,j \u03b2Ri,j(yi \u2212 yj)\n\nF (x) = arg max\n\nPr(y|x; \u03b1, \u03b2) =\n\n1\n\n(2X\u03b1 + \u03b2(Dr \u2212 Dc)e)\n\ny\n\n\u03b1T e\nwhere Dr and Dc are two diagonal matrixes with Dri,i =\nSimilarly to Pseudo Relevance Feedback, if we ignore the relation between documents and set \u03b2 =\n0, the ranking model degenerates to a linear ranking model in conventional local ranking.\n\nj Ri,j and Dci,i =\n\nj Rj,i.\n\n(18)\n\n(cid:80)\n\n(cid:80)\n\n4.2.2 Learning\n\nIn learning, we use Gradient Ascent to maximize the log likelihood. We use the same technique as\nthat for PRF to guarantee \u03b1k > 0. The gradient of L(\u03b1, \u03b2) with respect to log \u03b1k and \u03b2 can be\nfound6 in Eq. (19) and (20). Due to space limitation, we omit the details of the learning algorithm,\nwhich is similar to Algorithm 1.\n\n(cid:40)\n\n\u2207log \u03b1k =\n\n\u2202L(\u03b1, \u03b2)\n\u2202 log \u03b1k\n\n= \u03b1k\n\nn\n2a\n\n+\n\n1\n\n4a2 bT b \u2212 1\n\n2a\n\nbT X,k +\n\ni,k \u2212\nx2\n\n(yi \u2212 xi,k)2\n\n(cid:88)\n\ni\n\n(cid:88)\n(cid:88)\n\ni\n\ni,j\n\n(cid:41)\n\n(19)\n\n(20)\n\n\u2207\u03b2 =\n\n\u2202L(\u03b1, \u03b2)\n\n\u2202\u03b2\n\n= \u2212 1\n2a\n\nbT (Dr \u2212 Dc)e +\n\nRi,j(yi \u2212 yj)\n\n(cid:80)\n\n(cid:80)K1\n\nwhere where n denotes number of documents for the query, and a = \u03b1T e, b = 2X\u03b1+\u03b2(Dr\u2212Dc)e,\nc =\n\ni,k, X,k denotes the k-th column of matrix X.\n\nk=1 \u03b1kx2\n\ni\n\n4.3 Continuous CRF for Multiple Relations\n\nWe only consider using one type of relation in the previous two cases. We can also conduct global\nranking by utilizing multiple types of relation. C-CRF is a powerful tool to perform the task. It can\neasily incorporate various types of relation as edge feature functions. For example, we can combine\nsimilarity relation and parent-child relation by using the following C-CRF model:\n\n(cid:40)(cid:88)\n\nK1(cid:88)\n\n(cid:181)\n\n(cid:88)\n\nPr(y|x) =\n\n1\n\nZ(x)\n\nexp\n\n\u2212\u03b1k(yi \u2212 xi,k)2 +\n\n\u03b21Ri,j(yi \u2212 yj) \u2212 \u03b22\n\n(yi \u2212 yj)2\n\nSi,j\n2\n\ni\n\nk=1\n\ni,j\n\nIn this case, the ranking scores of documents for a new query is calculated as follows.\n\nF (x) = arg max\n\ny\n\nPr(y|x; \u03b1, \u03b2) = (\u03b1T eI + \u03b22D \u2212 \u03b22S)\u22121\n\n5 Experiments\n\nX\u03b1 + \u03b21\n2\n\n(Dr \u2212 Dc)e\n\n(cid:181)\n\n(cid:182)\uf8fc\uf8fd\uf8fe.\n\n(cid:182)\n\nWe empirically tested the performance of C-CRF on both Pseudo Relevance Feedback and Topic\nDistillation7. As data, we used LETOR [10], which is a public dataset for learning to rank research.\n\n6Please refer to [13] for the derivation of the two equations.\n7Please refer to [13] for more details of experiments.\n\n6\n\n\fPRF on OHSUMED Data\n\nBM25\n\nAlgorithms\n\nndcg1\n0.3994\nBM25-PRF\n0.3962\nRankSVM 0.4952\n0.5231\n0.5443\n\nListNet\nC-CRF\n\nndcg2\n0.3931\n0.4277\n0.4755\n0.497\n0.4986\n\nndcg5\n0.3972\n0.3981\n0.4579\n0.4662\n0.4808\n\nBM25\n\nST\nSS\n\nTable 1: Ranking Accuracy\n\nAlgorithms\n\nTD on TREC2004 Data\nndcg2\n0.2933\n0.3133\n0.3200\n0.4333\n0.4267\n0.4733\n\nndcg1\n0.3067\n0.3200\n0.3200\nRankSVM 0.4400\n0.4400\n0.5200\n\nListNet\nC-CRF\n\nndcg5\n0.2293\n0.3232\n0.3227\n0.3935\n0.4209\n0.4428\n\nWe made use of OHSUMED in LETOR for Pseudo Relevance Feedback and TREC2004 in LETOR\nfor Topic Distillation. As evaluation measure, we utilized NDCG@n (Normalized Discounted Cu-\nmulative Gain) [6].\nAs baseline methods for the two tasks, we used several local ranking algorithms such as BM25,\nRankSVM [7] and ListNet [2]. BM25 is a widely used non-learning ranking method. RankSVM\nis a state-of-the-art algorithm of the pairwise approach to learning to rank, and ListNet is a state-\nof-the-art algorithm of the listwise approach. For Pseudo Relevance Feedback, we also compared\nwith a traditional feedback method based on BM25 (BM25-PRF for short). For Topic Distillation,\nwe also compared with two traditional methods, sitemap based term propagation (ST) and sitemap\nbased score propagation (SS) [11], which propagate the relevance along sitemap structure. These\nalgorithms can be regarded as a kind of global ranking methods but they are not based on supervised\nlearning. We conducted 5 fold cross validation for C-CRF and all the baseline methods, using the\npartition provided in LETOR.\nThe left part of Table 1 shows the ranking accuracies of BM25, BM25-PRF, RankSVM, ListNet,\nand C-CRF, in terms of NDCG averaged over \ufb01ve trials on OHSUMED data. C-CRF\u2019s performance\nis superior to the performances of RankSVM and ListNet. This is particularly true for NDCG@1;\nC-CRF achieves about 5 points higher accuracy than RankSVM and more than 2 points higher\naccuracy than ListNet. The results indicate that C-CRF based global ranking can indeed improve\nsearch relevance. C-CRF also outperforms BM25-PRF, the traditional method of using similarity\ninformation for ranking. The result suggests that it is better to employ a supervised learning approach\nfor the task.\nThe right part of Table 1 shows the performances of BM25, SS, ST, RankSVM, ListNet, and C-CRF\nmodel in terms of NDCG averaged over 5 trials on TREC data. C-CRF outperforms RankSVM and\nListNet at all NDCG positions. This is particularly true for NDCG@1. C-CRF achieves 8 points\nhigher accuracy than RankSVM and ListNet, which is a more than 15% relative improvement. The\nresult indicates that C-CRF based global ranking can achieve better results than local ranking for this\ntask. C-CRF also outperforms SS and ST, the traditional method of using parent-child information\nfor Topic Distillation. The result suggests that it is better to employ a learning based approach.\n\n6 Related Work\n\nMost existing work on using relation information in learning is for classi\ufb01cation (e.g., [19, 1]) and\nclustering (e.g., [4, 15]). To the best of our knowledge, there was not much work on using relation for\nranking, except Relational Ranking SVM (RRSVM) proposed in [14], which is based on a similar\nmotivation as our work.\nThere are large differences between RRSVM and C-CRF, however. For RRSVM, it is hard to com-\nbine the uses of multiple types of relation. In contrast, C-CRF can easily do it by incorportating the\nrelations in different edge feature functions. There is a hyper parameter \u03b2 in RRSVM representing\nthe trade-off between content and relation information. It needs to be manually tuned. This is not\nnecessary for C-CRF, however, because the trade-off between them is handled naturally by the fea-\nture weights in the model, which can be learnt automatically. Furthermore, in some cases certain\napproximation must be made on the model in RRSVM (e.g. for Topic Distillation) in order to \ufb01t\ninto the learning framework of SVM. Such kind of approximation is unnecessary in C-CRF anyway.\n\n7\n\n\fBesides, C-CRF achieves better ranking accuracy than that reported for RRSVM [14] on the same\nbenchmark dataset.\n\n7 Conclusions\n\nWe studied learning to rank methods for global ranking problem, in which we use both content\ninformation of objects and relation information between objects for ranking. A Continuous CRF\n(C-CRF) model was proposed for performing the learning task. Taking Pseudo Relevance Feedback\nand Topic Distillation as examples, we showed how to use C-CRF in global ranking. Experimental\nresults on benchmark data show that C-CRF improves upon the baseline methods in the global\nranking tasks.\nThere are still issues which we need to investigate at the next step. (1) We have studied the method\nof learning C-CRF with Maximum Likelihood Estimation.\nIt is interesting to see how to apply\nMaximum A Posteriori Estimation to the problem. (2) We have assumed absolute ranking scores\ngiven in training data. We will study how to train C-CRF with relative preference data. (3) We have\nstudied two global ranking tasks: Pseudo Relevance Feedback and Topic Distillation. We plan to\nlook at other tasks in the future.\n\nReferences\n[1] M. Belkin, P. Niyogi, and V. Sindhwani. Manifold regularization: A geometric framework for learning\n\nfrom labeled and unlabeled examples. J. Mach. Learn. Res., 7:2399\u20132434, 2006.\n\n[2] Z. Cao, T. Qin, T.-Y. Liu, M.-F. Tsai, and H. Li. Learning to rank: from pairwise approach to listwise\n\napproach. In ICML \u201907, pages 129\u2013136, 2007.\n\n[3] W. Chu and Z. Ghahramani. Gaussian processes for ordinal regression. Journal of Machine Learning\n\nResearch, 6:1019\u20131041, 2005.\n\n[4] I. S. Dhillon. Co-clustering documents and words using bipartite spectral graph partitioning. In KDD \u201901.\n[5] G. H. Golub and C. F. V. Loan. Matrix computations (3rd ed.). Johns Hopkins University Press, 1996.\n[6] K. J\u00a8arvelin and J. Kek\u00a8al\u00a8ainen. Cumulated gain-based evaluation of ir techniques. ACM Trans. Inf. Syst.,\n\n20(4):422\u2013446, 2002.\n\n[7] T. Joachims. Optimizing search engines using clickthrough data. In KDD \u201902, pages 133\u2013142, 2002.\n[8] K. L. Kwok. A document-document similarity measure based on cited titles and probability theory, and\n\nits application to relevance feedback retrieval. In SIGIR \u201984, pages 221\u2013231, 1984.\n\n[9] J. G. Lewis. Algorithm 582: The gibbs-poole-stockmeyer and gibbs-king algorithms for reordering sparse\n\nmatrices. ACM Trans. Math. Softw., 8(2):190\u2013194, 1982.\n\n[10] T.-Y. Liu, J. Xu, T. Qin, W.-Y. Xiong, and H. Li. Letor: Benchmark dataset for research on learning to\n\nrank for information retrieval. In SIGIR \u201907 Workshop, 2007.\n\n[11] T. Qin, T.-Y. Liu, X.-D. Zhang, Z. Chen, and W.-Y. Ma. A study of relevance propagation for web search.\n\nIn SIGIR \u201905, pages 408\u2013415, 2005.\n\n[12] T. Qin, T.-Y. Liu, X.-D. Zhang, G. Feng, D.-S. Wang, and W.-Y. Ma. Topic distillation via sub-site\n\nretrieval. Information Processing & Management, 43(2):445\u2013460, 2007.\n\n[13] T. Qin, T.-Y. Liu, X.-D. Zhang, D.-S. Wang, and H. Li. Global ranking of documents using continuous\n\nconditional random \ufb01elds. Technical Report MSR-TR-2008-156, Microsoft Corporation, 2008.\n\n[14] T. Qin, T.-Y. Liu, X.-D. Zhang, D.-S. Wang, W.-Y. Xiong, and H. Li. Learning to rank relational objects\n\nand its application to web search. In WWW \u201908, 2008.\n\n[15] J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis\n\nand Machine Intelligence, 22(8):888\u2013905, 2000.\n\n[16] C. Sutton and A. McCallum. An introduction to conditional random \ufb01elds for relational learning.\n\nL. Getoor and B. Taskar, editors, Introduction to Statistical Relational Learning. MIT Press, 2006.\n\nIn\n\n[17] T. Tao and C. Zhai. Regularized estimation of mixture models for robust pseudo-relevance feedback. In\n\nSIGIR \u201906, pages 162\u2013169, 2006.\n\n[18] C. X. Zhai, W. W. Cohen, and J. Lafferty. Beyond independent relevance: methods and evaluation metrics\n\nfor subtopic retrieval. In SIGIR \u201903, pages 10\u201317, 2003.\n\n[19] D. Zhou, O. Bousquet, T. Lal, J. Weston, and B. Sch\u00a8olkopf. Learning with local and global consistency,\n\n2003. In 18th Annual Conf. on Neural Information Processing Systems.\n\n8\n\n\f", "award": [], "sourceid": 518, "authors": [{"given_name": "Tao", "family_name": "Qin", "institution": null}, {"given_name": "Tie-yan", "family_name": "Liu", "institution": null}, {"given_name": "Xu-dong", "family_name": "Zhang", "institution": null}, {"given_name": "De-sheng", "family_name": "Wang", "institution": null}, {"given_name": "Hang", "family_name": "Li", "institution": null}]}