{"title": "Fast and Accurate Inference of Plackett\u2013Luce Models", "book": "Advances in Neural Information Processing Systems", "page_first": 172, "page_last": 180, "abstract": "We show that the maximum-likelihood (ML) estimate of models derived from Luce's choice axiom (e.g., the Plackett-Luce model) can be expressed as the stationary distribution of a Markov chain. This conveys insight into several recently proposed spectral inference algorithms. We take advantage of this perspective and formulate a new spectral algorithm that is significantly more accurate than previous ones for the Plackett--Luce model. With a simple adaptation, this algorithm can be used iteratively, producing a sequence of estimates that converges to the ML estimate. The ML version runs faster than competing approaches on a benchmark of five datasets. Our algorithms are easy to implement, making them relevant for practitioners at large.", "full_text": "Fast and Accurate Inference of Plackett\u2013Luce Models\n\nLucas Maystre\n\nEPFL\n\nlucas.maystre@epfl.ch\n\nmatthias.grossglauser@epfl.ch\n\nMatthias Grossglauser\n\nEPFL\n\nAbstract\n\nWe show that the maximum-likelihood (ML) estimate of models derived from\nLuce\u2019s choice axiom (e.g., the Plackett\u2013Luce model) can be expressed as the\nstationary distribution of a Markov chain. This conveys insight into several recently\nproposed spectral inference algorithms. We take advantage of this perspective and\nformulate a new spectral algorithm that is signi\ufb01cantly more accurate than previous\nones for the Plackett\u2013Luce model. With a simple adaptation, this algorithm can\nbe used iteratively, producing a sequence of estimates that converges to the ML\nestimate. The ML version runs faster than competing approaches on a benchmark\nof \ufb01ve datasets. Our algorithms are easy to implement, making them relevant for\npractitioners at large.\n\n1\n\nIntroduction\n\nAggregating pairwise comparisons and partial rankings are important problems with applications in\neconometrics [1], psychometrics [2, 3], sports ranking [4, 5] and multiclass classi\ufb01cation [6]. One\npossible approach to tackle these problems is to postulate a statistical model of discrete choice. In\nthis spirit, Luce [7] stated the choice axiom in a foundational work published over \ufb01fty years ago.\nDenote p(i | A) the probability of choosing item i when faced with alternatives in the set A. Given\ntwo items i and j, and any two sets of alternatives A and B containing i and j, the axiom posits that\n\np(i | A)\np(j | A)\n\n=\n\np(i | B)\np(j | B)\n\n.\n\nIn other words, the odds of choosing item i over item j are independent of the rest of the alternatives.\nThis simple assumption directly leads to a unique parametric choice model, known as the Bradley\u2013\nTerry model in the case of pairwise comparisons, and the Plackett\u2013Luce model in the generalized case\nof k-way rankings. In this paper, we highlight a connection between the maximum-likelihood (ML)\nestimate under these models and the stationary distribution of a Markov chain parametrized by the\nobserved choices. Markov chains were already used in recent work [8, 9, 10] to aggregate pairwise\ncomparisons and rankings. These approaches reduce the problem to that of \ufb01nding a stationary\ndistribution. By formalizing the link between the likelihood of observations under the choice model\nand a certain Markov chain, we unify these algorithms and explicate them from an ML inference\nperspective. We will also take a detour, and use this link in the reverse direction to give an alternative\nproof to a recent result on the error rate of the ML estimate [11], by using spectral analysis techniques.\nBeyond this, we make two contributions to statistical inference for this model. First, we develop\na simple, consistent and computationally ef\ufb01cient spectral algorithm that is applicable to a wide\nrange of models derived from the choice axiom. The exact formulation of the Markov chain used\nin the algorithm is distinct from related work [9, 10] and achieves a signi\ufb01cantly better statistical\nef\ufb01ciency at no additional computational cost. Second, we observe that with a small adjustment, the\nalgorithm can be used iteratively, and it then converges to the ML estimate. An evaluation on \ufb01ve\nreal-world datasets reveals that it runs consistently faster than competing approaches and has a much\nmore predictable performance that does not depend on the structure of the data. The key step, \ufb01nding\na stationary distribution, can be of\ufb02oaded to commonly available linear-algebra primitives, which\n\n1\n\n\fmakes our algorithms scale well. Our algorithms are intuitively pleasing, simple to understand and\nimplement, and they outperform the state of the art, hence we believe that they will be highly useful\nto practitioners.\nThe rest of the paper is organized as follows. We begin by introducing some notations and presenting\na few useful facts about the choice model and about Markov chains. By necessity, our exposition is\nsuccinct, and the reader is encouraged to consult Luce [7] and Levin et al. [12] for a more thorough\nexposition. In Section 2, we discuss related work. In Section 3, we present our algorithms, and in\nSection 4 we evaluate them on synthetic and real-world data. We conclude in Section 5.\nDiscrete choice model. Denote by n the number of items. Luce\u2019s choice axiom implies that each\nitem i \u2208 {1, . . . , n} can be parametrized by a positive strength \u03c0i \u2208 R>0 such that p(i | A) =\nj\u2208A \u03c0j for any A containing i. The strengths \u03c0 = [\u03c0i] are de\ufb01ned up to a multiplicative\ni \u03c0i = 1. An alternative parametrization of the model is given by\n\n\u03c0i/(cid:80)\nfactor; for identi\ufb01ability, we let(cid:80)\n\n\u03b8i = log(\u03c0i), in which case the model is sometimes referred to as conditional logit [1].\nMarkov chain theory. We represent a \ufb01nite, stationary, continuous-time Markov chain by a directed\ngraph G = (V, E), where V is the set of states and E is the set of transitions with positive rate. If G is\nstrongly connected, the Markov chain is said to be ergodic and admits a unique stationary distribution\n\u03c0. The global balance equations relate the transition rates \u03bbij to the stationary distribution as follows:\n\n\u03c0i\u03bbij =\n\n\u03c0j\u03bbji \u2200i.\n\n(1)\n\n(cid:88)\n\nj(cid:54)=i\n\n(cid:88)\n\nj(cid:54)=i\n\nThe stationary distribution is therefore invariant to changes in the time scale, i.e., to a rescaling of the\ntransition rates. In the supplementary \ufb01le, we brie\ufb02y discuss how to \ufb01nd \u03c0 given [\u03bbij].\n\n2 Related work\n\nSpectral methods applied to ranking and scoring items from noisy choices have a long-standing\nhistory. To the best of our knowledge, Saaty [13] is the \ufb01rst to suggest using the leading eigenvector\nof a matrix of inconsistent pairwise judgments to score alternatives. Two decades later, Page et al.\n[14] developed PageRank, an algorithm that ranks Web pages according to the stationary distribution\nof a random walk on the hyperlink graph. In the same vein, Dwork et al. [8] proposed several variants\nof Markov chains for aggregating heterogeneous rankings. The idea is to construct a random walk\nthat is biased towards high-ranked items, and use the ranking induced by the stationary distribution.\nMore recently, Negahban et al. [9] presented Rank Centrality, an algorithm for aggregating pairwise\ncomparisons close in spirit to that of [8]. When the data is generated under the Bradley\u2013Terry model,\nthis algorithm asymptotically recovers model parameters with only \u03c9(n log n) pairwise comparisons.\nFor the more general case of rankings under the Plackett\u2013Luce model, Azari Sou\ufb01ani et al. [10]\npropose to break rankings into pairwise comparisons and to apply an algorithm similar to Rank\nCentrality. They show that the resulting estimator is statistically consistent. Interestingly, many of\nthese spectral algorithms can be related to the method of moments, a broadly applicable alternative to\nmaximum-likelihood estimation.\nThe history of algorithms for maximum-likelihood inference under Luce\u2019s model goes back even\nfurther. In the special case of pairwise comparisons, the same iterative algorithm was independently\ndiscovered by Zermelo [15], Ford [16] and Dykstra [17]. Much later, this algorithm was explained\nby Hunter [18] as an instance of minorization-maximization (MM) algorithm, and extended to the\nmore general choice model. Today, Hunter\u2019s MM algorithm is the de facto standard for ML inference\nin Luce\u2019s model. As the likelihood can be written as a concave function, off-the-shelf optimization\nprocedures such as the Newton-Raphson method can also be used, although they have been been\nreported to be slower and less practical [18]. Recently, Kumar et al. [19] looked at the problem\nof \ufb01nding the transition matrix of a Markov chain, given its stationary distribution. The problem\nof inferring Luce\u2019s model parameters from data can be reformulated in their framework, and the\nML estimate is the solution to the inversion of the stationary distribution. Their work stands out as\nthe \ufb01rst to link ML inference to Markov chains, albeit very differently from the way presented in\nour paper. Beyond algorithms, properties of the maximum-likelihood estimator in this model were\nstudied extensively. Hajek et al. [11] consider the Plackett\u2013Luce model for k-way rankings. They\ngive an upper bound to the estimation error and show that the ML estimator is minimax-optimal. In\nsummary, they show that only \u03c9(n/k log n) samples are enough to drive the mean-square error down\n\n2\n\n\fto zero, as n increases. Rajkumar and Agarwal [20] consider the Bradley\u2013Terry model for pairwise\ncomparisons. They show that the ML estimator is able to recover the correct ranking, even when\nthe data is generated as per another model, e.g., Thurstone\u2019s [2], as long as a so-called low-noise\ncondition is satis\ufb01ed. We also mention that as an alternative to likelihood maximization, Bayesian\ninference has also been proposed. Caron and Doucet [21] present a Gibbs sampler, and Guiver and\nSnelson [22] propose an approximate inference algorithm based on expectation propagation.\nIn this work, we provide a unifying perspective on recent advances in spectral algorithms [9, 10] from\na maximum-likelihood estimation perspective. It turns out that this perspective enables us to make\ncontributions on both sides: On the one hand, we develop an improved and more general spectral\nranking algorithm, and on the other hand, we propose a faster procedure for ML inference by using\nthis algorithm iteratively.\n\n3 Algorithms\n\nWe begin by expressing the ML estimate under the choice model as the stationary distribution\nof a Markov chain. We then take advantage of this formulation to propose novel algorithms for\nmodel inference. Although our derivation is made in the general choice model, we will also discuss\nimplications for the special cases of pairwise data in Section 3.3 and k-way ranking data in Section 3.4.\nSuppose that we collect d independent observations in the multiset D = {(c(cid:96), A(cid:96)) | (cid:96) = 1, . . . , d}.\nEach observation consists of a choice c(cid:96) among a set of alternatives A(cid:96); we say that i wins over j\nand denote by i (cid:31) j whenever i, j \u2208 A and c(cid:96) = i. We de\ufb01ne the directed comparison graph as\nGD = (V, E), with V = {1, . . . , n} and (j, i) \u2208 E if i wins at least once over j in D. In order to\nensure that the ML estimate is well-de\ufb01ned, we make the standard assumption that GD is strongly\nconnected [16, 18]. In practice, if this assumption does not hold, we can consider each strongly\nconnected component separately.\n\n3.1 ML estimate as a stationary distribution\n\nFor simplicity, we denote the model parameter associated with item c(cid:96) by \u03c0(cid:96). The log-likelihood of\nparameters \u03c0 given observations D is\n\n\uf8eb\uf8edlog \u03c0(cid:96) \u2212 log\n\nd(cid:88)\n\n(cid:96)=1\n\n(cid:88)\n\nj\u2208A(cid:96)\n\n\uf8f6\uf8f8 .\n\n\u03c0j\n\nlog L(\u03c0 | D) =\n\n.\nFor each item, we de\ufb01ne two sets of indices. Let Wi\n= {(cid:96) | i \u2208\nA(cid:96), c(cid:96) (cid:54)= i} be the indices of the observations where item i wins over and loses against the alternatives,\nrespectively. The log-likelihood (2) is not concave in \u03c0 (it can be made strictly concave using a\nsimple reparametrization), but we brie\ufb02y show in the supplementary material that it admits a unique\nstationary point, at the ML estimate \u02c6\u03c0. The optimality condition \u2207 \u02c6\u03c0 log L = 0 implies\n\n(cid:32)\n\n=\n\n(cid:88)\n\uf8eb\uf8ed (cid:88)\n\n(cid:96)\u2208Wi\n\n(cid:96)\u2208Wi\u2229Lj\n\n1\n\u02c6\u03c0i \u2212\n\n\u02c6\u03c0j(cid:80)\n\nt\u2208A(cid:96)\n\n\u2202 log L\n\u2202 \u02c6\u03c0i\n\n(cid:88)\n\nj(cid:54)=i\n\n\u21d0\u21d2\n\n1(cid:80)\n\nj\u2208A(cid:96)\n\n\u02c6\u03c0j\n\n.\n= {(cid:96) | i \u2208 A(cid:96), c(cid:96) = i} and Li\n(cid:33)\n(cid:88)\n\n(cid:88)\n1(cid:80)\n\u02c6\u03c0i(cid:80)\n\n= 0 \u2200i\n\n\uf8f6\uf8f8 = 0 \u2200i.\n\nj\u2208A(cid:96)\n\n(cid:96)\u2208Li\n\n\u2212\n\n\u02c6\u03c0j\n\nt\u2208A(cid:96)\n\n\u02c6\u03c0t\n\n(2)\n\n(3)\n\n(4)\n\n(5)\n\nIn order to go from (3) to (4), we multiply by \u02c6\u03c0i and rearrange the terms. To simplify the notation, let\nus further introduce the function\n\nwhich takes observations S \u2286 D and an instance of model parameters \u03c0, and returns a non-negative\n.\nover j. Then (4) can be rewritten as(cid:88)\nreal number. Let Di(cid:31)j\n= {(c(cid:96), A(cid:96)) \u2208 D | (cid:96) \u2208 Wi \u2229 Lj}, i.e., the set of observations where i wins\n\n(cid:88)\n\n\u02c6\u03c0i \u00b7 f (Dj(cid:31)i, \u02c6\u03c0) =\n\n\u02c6\u03c0j \u00b7 f (Di(cid:31)j, \u02c6\u03c0) \u2200i.\n\nj(cid:54)=i\n\n\u02c6\u03c0t \u2212\n\n(cid:96)\u2208Wj\u2229Li\n\n(cid:88)\n\nA\u2208S\n\n1(cid:80)\n\n,\n\ni\u2208A \u03c0i\n\n.\n=\n\nf (S, \u03c0)\n\nj(cid:54)=i\n\n3\n\n\fAlgorithm 1 Luce Spectral Ranking\nRequire: observations D\n1: \u03bb \u2190 0n\u00d7n\n2: for (i, A) \u2208 D do\n3:\n4:\n5:\n6: end for\n7: \u00af\u03c0 \u2190 stat. dist. of Markov chain \u03bb\n8: return \u00af\u03c0\n\nfor j \u2208 A \\ {i} do\n\u03bbji \u2190 \u03bbji + n/|A|\nend for\n\nAlgorithm 2 Iterative Luce Spectral Ranking\nRequire: observations D\n(cid:124)\n1: \u03c0 \u2190 [1/n, . . . , 1/n]\n2: repeat\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10: until convergence\n\n\u03bb \u2190 0n\u00d7n\nfor (i, A) \u2208 D do\nfor j \u2208 A \\ {i} do\nend for\n\nend for\n\u03c0 \u2190 stat. dist. of Markov chain \u03bb\n\n\u03bbji \u2190 \u03bbji + 1/(cid:80)\n\nt\u2208A \u03c0t\n\nThis formulation conveys a new viewpoint on the ML estimate. It is easy to recognize the global\nbalance equations (1) of a Markov chain on n states (representing the items), with transition rates\n\u03bbji = f (Di(cid:31)j, \u02c6\u03c0) and stationary distribution \u02c6\u03c0. These transition rates have an interesting inter-\npretation: f (Di(cid:31)j, \u03c0) is the count of how many times i wins over j, weighted by the strength of\nthe alternatives. At this point, it is useful to observe that for any parameters \u03c0, f (Di(cid:31)j, \u03c0) \u2265 1\nif (j, i) \u2208 E, and 0 otherwise. Combined with the assumption that GD is strongly connected, it\nfollows that any \u03c0 parametrizes the transition rates of an ergodic (homogeneous) Markov chain. The\nergodicity of the inhomogeneous Markov chain, where the transition rates are constantly updated to\nre\ufb02ect the current distribution over states, is shown by the following theorem.\nTheorem 1. The Markov chain with inhomogeneous transition rates \u03bbji = f (Di(cid:31)j, \u03c0) converges to\nthe maximum-likelihood estimate \u02c6\u03c0, for any initial distribution in the open probability simplex.\n\nProof (sketch). By (5), \u02c6\u03c0 is the unique invariant distribution of the Markov chain. In the supplemen-\ntary \ufb01le, we look at an equivalent uniformized discrete-time chain. Using the contraction mapping\nprinciple, one can show that this chain converges to the invariant distribution.\n\n3.2 Approximate and exact ML inference\n\nWe approximate the Markov chain described in (5) by considering a priori that all alternatives have\n(cid:124)\n.\n= f (Di(cid:31)j, \u03c0) by \ufb01xing \u03c0 to [1/n, . . . , 1/n]\nequal strength. That is, we set the transition rates \u03bbji\n.\nFor i (cid:54)= j, the contribution of i winning over j to the rate of transition \u03bbji is n/|A|. In other words,\nfor each observation, the winning item is rewarded by a \ufb01xed amount of incoming rate that is evenly\nsplit across the alternatives (the chunk allocated to itself is discarded.) We interpret the stationary\ndistribution \u00af\u03c0 as an estimate of model parameters. Algorithm 1 summarizes this procedure, called\nLuce Spectral Ranking (LSR.) If we consider a growing number of observations, LSR converges to\nthe true model parameters \u03c0\u2217, even in the restrictive case where the sets of alternatives are \ufb01xed.\nTheorem 2. Let A = {A(cid:96)} be a collection of sets of alternatives such that for any partition of A into\ntwo non-empty sets S and T , (\u222aA\u2208SA) \u2229 (\u222aA\u2208T A) (cid:54)= \u22051. Let d(cid:96) be the number of choices observed\nover alternatives A(cid:96). Then \u00af\u03c0 \u2192 \u03c0\u2217 as d(cid:96) \u2192 \u221e \u2200(cid:96).\nProof (sketch). The condition on A ensures that asymptotically GD is strongly connected. Let\nd \u2192 \u221e be a shorthand for d(cid:96) \u2192 \u221e \u2200(cid:96). We can show that if items i and j are compared in at least one\nset of alternatives, the ratio of transition rates satis\ufb01es limd\u2192\u221e \u03bbij/\u03bbji = \u03c0\u2217\ni . It follows that in\nthe limit of d \u2192 \u221e, the stationary distribution is \u03c0\u2217. A rigorous proof is given in the supplementary\n\ufb01le.\n\nj /\u03c0\u2217\n\nStarting from the LSR estimate, we can iteratively re\ufb01ne the transition rates of the Markov chain and\nobtain a sequence of estimates. By (5), the only \ufb01xed point of this iteration is the ML estimate \u02c6\u03c0. We\ncall this procedure I-LSR and describe it in Algorithm 2.\nLSR (or one iteration of I-LSR) entails (a) \ufb01lling a matrix of (weighted) pairwise counts and\n(b) \ufb01nding a stationary distribution. Let D\n(cid:96) |A(cid:96)|, and let S be the running time of \ufb01nding the\nstationary distribution. Then LSR has running time O(D + S). As a comparison, one iteration of\n\n=(cid:80)\n\n.\n\n1 This is equivalent to stating that the hypergraph H = (V,A) is connected.\n\n4\n\n\fthe MM algorithm [18] is O(D). Finding the stationary distribution can be implemented in different\nways. For example, in a sparse regime where D (cid:28) n2, the stationary distribution can be found with\nthe power method in a few O(D) sparse matrix multiplications. In the supplementary \ufb01le, we give\nmore details about possible implementations. In practice, whether D or S turns out to be dominant in\nthe running time is not a foregone conclusion.\n\n3.3 Aggregating pairwise comparisons\n\nA widely-used special case of Luce\u2019s choice model occurs when all sets of alternatives contain exactly\ntwo items, i.e., when the data consists of pairwise comparisons. This model was proposed by Zermelo\n[15], and later by Bradley and Terry [3]. As the stationary distribution is invariant to changes in the\n.\n= |Di(cid:31)j| when using LSR on pairwise\ntime-scale, we can rescale the transition rates and set \u03bbji\ndata. Let S be the set containing the pairs of items that have been compared at least once. In the\ncase where each pair (i, j) \u2208 S has been compared exactly p times, LSR is strictly equivalent to a\ncontinuous-time Markov-chain formulation of Rank Centrality [9]. In fact, our derivation justi\ufb01es\nRank Centrality as an approximate ML inference algorithm for the Bradley\u2013Terry model. Furthermore,\nwe provide a principled extension of Rank Centrality to the case where the number of comparisons\nobserved is unbalanced. Rank Centrality considers transition rates proportional to the ratio of wins,\nwhereas (5) justi\ufb01es making transition rates proportional to the count of wins.\nNegahban et al. [9] also provide an upper bound on the error rate of Rank Centrality, which essentially\nshows that it is minimax-optimal. Because the two estimators are equivalent in the setting of balanced\npairwise comparisons, the bound also applies to LSR. More interestingly, the expression of the ML\nestimate as a stationary distribution enables us to reuse the same analytical techniques to bound the\nerror of the ML estimate. In the supplementary \ufb01le, we therefore provide an alternative proof of the\nrecent result of Hajek et al. [11] on the minimax-optimality of the ML estimate.\n\n3.4 Aggregating partial rankings\n\nAnother case of interest is when observations do not consist of only a single choice, but of a ranking\nover the alternatives. We now suppose m observations consisting of k-way rankings, 2 \u2264 k \u2264 n.\nFor conciseness, we suppose that k is the same for all observations. Let one such observation\nbe \u03c3(1) (cid:31) . . . (cid:31) \u03c3(k), where \u03c3(p) is the item with p-th rank. Luce [7] and later Plackett [4]\nindependently proposed a model of rankings where\n\nk(cid:89)\n\nr=1\n\n\u03c0\u03c3(r)(cid:80)k\n\np=r \u03c0\u03c3(p)\n\n.\n\nP (\u03c3(1) (cid:31) . . . (cid:31) \u03c3(k)) =\n\nIn this model, a ranking can be interpreted as a sequence of k \u2212 1 independent choices: Choose the\n\ufb01rst item, then choose the second among the remaining alternatives, etc. With this point of view in\nmind, LSR and I-LSR can easily accommodate data consisting of k-way rankings, by decomposing\nthe m observations into d = m(k \u2212 1) choices.\nAzari Sou\ufb01ani et al. [10] provide a class of consistent estimators for the Plackett\u2013Luce model, using\nthe idea of breaking rankings into pairwise comparisons. Although they explain their algorithms from\na generalized-method-of-moments perspective, it is straightforward to reinterpret their estimators as\nstationary distributions of particular Markov chains. In fact, for k = 2, their algorithm GMM-F is\n\nidentical to LSR. When k > 2 however, breaking a ranking into(cid:0)k\n\n(cid:1) pairwise comparisons implicitly\n\nmakes the (incorrect) assumption that these comparisons are statistically independent. The Markov\nchain that LSR builds breaks rankings into pairwise rate contributions, but weights the contributions\ndifferently depending on the rank of the winning item. In Section 4, we show that this weighting\nturns out to be crucial. Our approach yields a signi\ufb01cant improvement in statistical ef\ufb01ciency, yet\nkeeps the same attractive computational cost and ease of use.\n\n2\n\n3.5 Applicability to other models\n\nSeveral other variants and extensions of Luce\u2019s choice model have been proposed. For example, Rao\nand Kupper [23] extend the Bradley\u2013Terry model to the case where a comparison between two items\ncan result in a tie. In the supplementary \ufb01le, we show that the ML estimate in the Rao\u2013Kupper model\ncan also be formulated as a stationary distribution, and we provide corresponding adaptations of LSR\n\n5\n\n\fand I-LSR. We believe that our algorithms can be generalized to further models that are based on the\nchoice axiom. However, this axiom is key, and other choice models (such as Thurstone\u2019s [2]) do not\nadmit the stationary-distribution interpretation we derive here.\n\n4 Experimental evaluation\n\nIn this section, we compare LSR and I-LSR to other inference algorithms in terms of (a) statistical\nef\ufb01ciency, and (b) empirical performance. In order to understand the ef\ufb01ciency of the estimators,\nwe generate synthetic data from a known ground truth. Then, we look at \ufb01ve real-world datasets\nand investigate the practical performance of the algorithms in terms of accuracy, running time and\nconvergence rate.\nError metric. As the probability of i winning over j depends on the ratio of strengths \u03c0i/\u03c0i, the\nstrengths are typically logarithmically spaced. In order to evaluate the accuracy of an estimate \u03c0 to\nground truth parameters \u03c0\u2217, we therefore use a log transformation, reminiscent of the random-utility-\n.\ntheoretic formulation of the choice model [1, 11]. De\ufb01ne \u03b8\n= [log \u03c0i \u2212 t], with t chosen such that\n\ni \u03b8i = 0. We will consider the root-mean-squared error (RMSE)\n\n(cid:80)\n\n4.1 Statistical ef\ufb01ciency\n\nERMS = (cid:107)\u03b8 \u2212 \u03b8\u2217\n\n(cid:107)2/\u221an.\n\nTo assess the statistical ef\ufb01ciency of LSR and other algorithms, we follow the experimental procedure\nof Hajek et al. [11]. We consider n = 1024 items, and draw \u03b8\u2217 uniformly at random in [\u22122, 2]n.\nWe generate d = 64 full rankings over the n items from a Plackett-Luce model parametrized with\n\u03c0 \u221d [e\u03b8i]. For a given k \u2208 {21, . . . , 210}, we break down each of the full rankings as follows. First,\nwe partition the items into n/k subsets of size k uniformly at random. Then, we store the k-way\nrankings induced by the full ranking on each of those subsets. As a result, we obtain m = dn/k\nstatistically independent k-way partial rankings. For a given estimator, this data produces an estimate\n\u03b8, for which we record the root-mean-square error to \u03b8\u2217. We consider four estimators. The \ufb01rst two\n(LSR and ML) work on the ranking data directly. The remaining two follow Azari Sou\ufb01ani et al. [10],\n\nwho suggest breaking down k-way rankings into(cid:0)k\n\n(cid:1) pairwise comparisons. These comparisons are\n\nthen used by LSR, resulting in Azari Sou\ufb01ani et al.\u2019s GMM-F estimator, and by an ML estimator\n(ML-F.) In short, the four estimators vary according to (a) whether they use as-is rankings or derived\ncomparisons, and (b) whether the model is \ufb01tted using an approximate spectral algorithm or using\nexact maximum likelihood. Figure 1 plots ERMS for increasing sizes of partial rankings, as well as\na lower bound to the error of any estimator for the Plackett-Luce model (see Hajek et al. [11] for\ndetails.) We observe that breaking the rankings into pairwise comparisons (*-F estimators) incurs a\nsigni\ufb01cant ef\ufb01ciency loss over using the k-way rankings directly (LSR and ML.) We conclude that by\ncorrectly weighting pairwise rates in the Markov chain, LSR distinctly outperforms the rank-breaking\napproach as k increases. We also observe that the ML estimate is always more ef\ufb01cient. Spectral\nestimators such as LSR provide a quick, asymptotically consistent estimate of parameters, but this\nobservation justi\ufb01es calling them approximate inference algorithms.\n\n2\n\n4.2 Empirical performance\n\nWe investigate the performance of various inference algorithms on \ufb01ve real-world datasets. The\nNASCAR [18] and sushi [24] datasets contain multiway partial rankings. The YouTube, GIFGIF\nand chess datasets2 contain pairwise comparisons. Among those, the chess dataset is particular in\nthat it features 45% of ties; in this case we use the extension of the Bradley\u2013Terry model proposed\nby Rao and Kupper [23]. We preprocess each dataset by discarding items that are not part of the\nlargest strongly connected component in the comparison graph. The number of items n, the number\nof rankings m, as well as the size of a partial rankings k for each dataset are given in Table 1.\nAdditional details on the experimental setup are given in the supplementary material. We \ufb01rst\ncompare the estimates produced by three approximate ML inference algorithms, LSR, GMM-F and\nRank Centrality (RC.) Note that RC applies only to pairwise comparisons, and that LSR is the only\n\n2\n\nSee https://archive.ics.uci.edu/ml/machine-learning-databases/00223/,\n\nhttp://www.gif.gf/ and https://www.kaggle.com/c/chess.\n\n6\n\n\fFigure 1: Statistical ef\ufb01ciency of different estimators for increasing sizes of partial rankings. As k\ngrows, breaking rankings into pairwise comparisons becomes increasingly inef\ufb01cient. LSR remains\nef\ufb01cient at no additional computational cost.\n\nalgorithm able to infer the parameters in the Rao-Kupper model. Also note that in the case of pairwise\ncomparisons, GMM-F and LSR are strictly equivalent. In Table 1, we report the root-mean-square\ndeviation to the ML estimate \u02c6\u03b8 and the running time T of the algorithm.\n\nTable 1: Performance of approximate ML inference algorithms\n\nDataset\nNASCAR\nSushi\nYouTube\nGIFGIF\nChess\n\n16 187\n5 503\n\n6 174\n\n1 128 704\n95 281\n\n63 421\n\nn\n\n83\n100\n\nm k\n\n36\n5 000\n\nLSR\n\nGMM-F\n\nRC\n\nERMS\n\n0.194\n0.034\n\n0.417\n1.286\n\n0.420\n\nT [s]\n0.03\n0.22\n\n34.18\n1.90\n\n2.90\n\nERMS\n\n0.751\n0.130\n\n0.417\n1.286\n\u2014\n\nT [s] ERMS\n\u2014\n0.06\n\u2014\n0.19\n0.432\n1.295\n\u2014\n\n34.18\n1.90\n\u2014\n\nT [s]\n\u2014\n\u2014\n41.91\n2.84\n\u2014\n\n43\n10\n\n2\n2\n\n2\n\nThe smallest value of ERMS is highlighted in bold for each dataset. We observe that in the case of\nmultiway partial rankings, LSR is almost four times more accurate than GMM-F on the datasets\nconsidered. In the case of pairwise comparisons, RC is slightly worse than LSR and GMM-F, because\nthe number of comparisons per pair is not homogeneous (see Section 3.3.) The running time of the\nthree algorithms is comparable.\nNext, we turn our attention to ML inference and consider three iterative algorithms: I-LSR, MM and\nNewton-Raphson. For Newton-Raphson, we use an off-the-shelf solver. Each algorithm is initialized\n(cid:124)\nwith \u03c0(0) = [1/n, . . . , 1/n]\n, and convergence is declared when ERMS < 0.01. In Table 2, we report\nthe number of iterations I needed to reach convergence, as well as the total running time T of the\nalgorithm.\n\nTable 2: Performance of iterative ML inference algorithms.\n\nDataset\n\u03b3D\nNASCAR 0.832\nSushi\n0.890\nYouTube\nGIFGIF\nChess\n\n0.002\n0.408\n\n0.007\n\nI\n\n3\n2\n\n12\n10\n\n15\n\nI-LSR\n\nT [s]\n0.08\n0.42\n\nI\n\n4\n4\n\n414.44\n22.31\n\n8 680\n119\n\nMM\n\nNewton\nT [s]\nI\n0.10 \u2014\n3\n1.09\n22 443.88 \u2014\n5\n\nT [s]\n\u2014\n10.45\n\u2014\n72.38\n\n109.62\n\n43.69\n\n181\n\n55.61\n\n3\n\n49.37\n\nThe smallest total running time T is highlighted in bold for each dataset. We observe that Newton-\nRaphson does not always converge, despite the log-likelihood being strictly concave3. I-LSR consis-\n3 On the NASCAR dataset, this has also been noted by Hunter [18]. Computing the Newton step appears to\nbe severely ill-conditioned for many real-world datasets. We believe that it can be addressed by a careful choice\n\n7\n\n212223242526272829210k0.10.20.4RMSElowerboundML-FGMM-FMLLSR\ftently outperforms MM and Newton-Raphson in running time. Even if the average running time per\niteration is in general larger than that of MM, it needs considerably fewer iterations: For the YouTube\ndataset, I-LSR yields an increase in speed of over 50 times.\nThe slow convergence of minorization-maximization algorithms is known [18], yet the scale of the\nissue and its apparent unpredictability is surprising. In Hunter\u2019s MM algorithm, updating a given \u03c0i\ninvolves only parameters of items to which i has been compared. Therefore, we speculate that the\nconvergence rate of MM is dependent on the expansion properties of the comparison graph GD. As\nan illustration, we consider the sushi dataset. To quantify the expansion properties, we look at the\nspectral gap \u03b3D of a simple random walk on GD; intuitively, the larger the spectral gap is, the better\nthe expansion properties are [12]. The original comparison graph is almost complete, and \u03b3D = 0.890.\nBy breaking each 10-way ranking into 5 independent pairwise comparisons, we effectively sparsify\nthe comparison graph. As a result, the spectral gap decreases to 0.784. In Figure 2, we show the\nconvergence rate of MM and I-LSR for the original (k = 10) and modi\ufb01ed (k = 2) datasets. We\nobserve that both algorithms display linear convergence, however the rate at which MM converges\nappears to be sensitive to the structure of the comparison graph. In contrast, I-LSR is robust to\nchanges in the structure. The spectral gap of each dataset is listed in Table 2.\n\nFigure 2: Convergence rate of I-LSR and MM on the sushi dataset. When partial rankings (k = 10)\nare broken down into independent comparisons (k = 2), the comparison graph becomes sparser.\nI-LSR is robust to this change, whereas the convergence rate of MM signi\ufb01cantly decreases.\n\n5 Conclusion\n\nIn this paper, we develop a stationary-distribution perspective on the maximum-likelihood estimate\nof Luce\u2019s choice model. This perspective explains and uni\ufb01es several recent spectral algorithms from\nan ML inference point of view. We present our own spectral algorithm that works on a wider range of\ndata, and show that the resulting estimate signi\ufb01cantly outperforms previous approaches in terms of\naccuracy. We also show that this simple algorithm, with a straighforward adaptation, can produce a\nsequence of estimates that converge to the ML estimate. On real-world datasets, our ML algorithm is\nalways faster than the state of the art, at times by up to two orders of magnitude.\nBeyond statistical and computational performance, we believe that a key strength of our algorithms\nis that they are simple to implement. As an example, our implementation of LSR \ufb01ts in ten lines\nof Python code. The most complex operation\u2014\ufb01nding a stationary distribution\u2014can be readily\nof\ufb02oaded to commonly available and highly optimized linear-algebra primitives. As such, we believe\nthat our work is very useful for practitioners.\n\nAcknowledgments\n\nWe thank Holly Cogliati-Bauereis, Stratis Ioannidis, Ksenia Konyushkova and Brunella Spinelli for\ncareful proofreading and comments on the text.\n\nof starting point, step size, or by monitoring the numerical stability; however, these modi\ufb01cations are non-trivial\nand impose an additional burden on the practitioner.\n\n8\n\n12345678910iteration10010\u2212210\u2212410\u2212610\u2212810\u22121010\u221212RMSEMM,k=10MM,k=2I-LSR,k=10I-LSR,k=2\fReferences\n\n[1] D. McFadden. Conditional logit analysis of qualitative choice behavior. In P. Zarembka, editor, Frontiers\n\nin Econometrics, pages 105\u2013142. Academic Press, 1973.\n\n[2] L. Thurstone. The method of paired comparisons for social values. Journal of Abnormal and Social\n\nPsychology, 21(4):384\u2013400, 1927.\n\n[3] R. A. Bradley and M. E. Terry. Rank Analysis of Incomplete Block Designs: I. The Method of Paired\n\nComparisons. Biometrika, 39(3/4):324\u2013345, 1952.\n\n[4] R. L. Plackett. The Analysis of Permutations. Journal of the Royal Statistical Society, Series C (Applied\n\nStatistics), 24(2):193\u2013202, 1975.\n\n[5] A. Elo. The Rating Of Chess Players, Past & Present. Arco, 1978.\n[6] T. Hastie and R. Tibshirani. Classi\ufb01cation by pairwise coupling. The Annals of Statistics, 26(2):451\u2013471,\n\n1998.\n\n[7] R. D. Luce. Individual Choice behavior: A Theoretical Analysis. Wiley, 1959.\n[8] C. Dwork, R. Kumar, M. Naor, and D. Sivakumar. Rank Aggregation Methods for the Web. In Proceedings\n\nof the 10th international conference on World Wide Web (WWW 2001), Hong Kong, China, 2001.\n\n[9] S. Negahban, S. Oh, and D. Shah. Iterative Ranking from Pair-wise Comparisons. In Advances in Neural\n\nInformation Processing Systems 25 (NIPS 2012), Lake Tahoe, CA, 2012.\n\n[10] H. Azari Sou\ufb01ani, W. Z. Chen, D. C. Parkes, and L. Xia. Generalized Method-of-Moments for Rank\nAggregation. In Advances in Neural Information Processing Systems 26 (NIPS 2013), Lake Tahoe, CA,\n2013.\n\n[11] B. Hajek, S. Oh, and J. Xu. Minimax-optimal Inference from Partial Rankings. In Advances in Neural\n\nInformation Processing Systems 27 (NIPS 2014), Montreal, QC, Canada, 2014.\n\n[12] D. A. Levin, Y. Peres, and E. L. Wilmer. Markov Chains and Mixing Times. American Mathematical\n\nSociety, 2008.\n\n[13] T. L. Saaty. The Analytic Hierarchy Process: Planning, Priority Setting, Resource Allocation. McGraw-Hill,\n\n1980.\n\n[14] L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank Citation Ranking: Bringing Order to the\n\nWeb. Technical report, Stanford University, 1998.\n\n[15] E. Zermelo. Die Berechnung der Turnier-Ergebnisse als ein Maximumproblem der Wahrscheinlichkeits-\n\nrechnung. Mathematische Zeitschrift, 29(1):436\u2013460, 1928.\n\n[16] L. R. Ford, Jr. Solution of a Ranking Problem from Binary Comparisons. The American Mathematical\n\nMonthly, 64(8):28\u201333, 1957.\n\n[17] O. Dykstra, Jr. Rank Analysis of Incomplete Block Designs: A Method of Paired Comparisons Employing\n\nUnequal Repetitions on Pairs. Biometrics, 16(2):176\u2013188, 1960.\n\n[18] D. R. Hunter. MM algorithms for generalized Bradley\u2013Terry models. The Annals of Statistics, 32(1):\n\n384\u2013406, 2004.\n\n[19] R. Kumar, A. Tomkins, S. Vassilvitskii, and E. Vee. Inverting a Steady-State. In Proceedings of the 8th\n\nInternational Conference on Web Search and Data Mining (WSDM 2015), pages 359\u2013368, 2015.\n\n[20] A. Rajkumar and S. Agarwal. A Statistical Convergence Perspective of Algorithms for Rank Aggregation\nfrom Pairwise Data. In Proceedings of the 31st International Conference on Machine Learning (ICML\n2014), Beijing, China, 2014.\n\n[21] F. Caron and A. Doucet. Ef\ufb01cient Bayesian Inference for Generalized Bradley\u2013Terry models. Journal of\n\nComputational and Graphical Statistics, 21(1):174\u2013196, 2012.\n\n[22] J. Guiver and E. Snelson. Bayesian inference for Plackett\u2013Luce ranking models. In Proceedings of the\n\n26th International Conference on Machine Learning (ICML 2009), Montreal, Canada, 2009.\n\n[23] P. V. Rao and L. L. Kupper. Ties in Paired-Comparison Experiments: A Generalization of the Bradley-Terry\n\nModel. Journal of the American Statistical Association, 62(317):194\u2013204, 1967.\n\n[24] T. Kamishima and S. Akaho. Ef\ufb01cient Clustering for Orders. In Mining Complex Data, pages 261\u2013279.\n\nSpringer, 2009.\n\n9\n\n\f", "award": [], "sourceid": 88, "authors": [{"given_name": "Lucas", "family_name": "Maystre", "institution": "EPFL"}, {"given_name": "Matthias", "family_name": "Grossglauser", "institution": "EPFL"}]}