{"title": "Bayesian nonparametric models for ranked data", "book": "Advances in Neural Information Processing Systems", "page_first": 1520, "page_last": 1528, "abstract": "We develop a Bayesian nonparametric extension of the popular Plackett-Luce choice model that can handle an infinite number of choice items. Our framework is based on the theory of random atomic measures, with the prior specified by a gamma process. We derive a posterior characterization and a simple and effective Gibbs sampler for posterior simulation. We then develop a time-varying extension of our model, and apply our model to the New York Times lists of weekly bestselling books.", "full_text": "Bayesian nonparametric models for ranked data\n\nFranc\u00b8ois Caron\n\nINRIA\n\nIMB - University of Bordeaux\n\nTalence, France\n\nFrancois.Caron@inria.fr\n\nYee Whye Teh\n\nDepartment of Statistics\n\nUniversity of Oxford\n\nOxford, United Kingdom\n\ny.w.teh@stats.ox.ac.uk\n\nAbstract\n\nWe develop a Bayesian nonparametric extension of the popular Plackett-Luce\nchoice model that can handle an in\ufb01nite number of choice items. Our framework\nis based on the theory of random atomic measures, with the prior speci\ufb01ed by a\ngamma process. We derive a posterior characterization and a simple and effective\nGibbs sampler for posterior simulation. We develop a time-varying extension of\nour model, and apply it to the New York Times lists of weekly bestselling books.\n\n1\n\nIntroduction\n\nData in the form of partial rankings, i.e. in terms of an ordered list of the top-m items, arise in many\ncontexts. For example, in this paper we consider datasets consisting of the top 20 bestselling books\nas published each week by the New York Times. The Plackett-Luce model [1, 2] is a popular model\nfor modeling such partial rankings of a \ufb01nite collection of M items. It has found many applications,\nincluding choice modeling [3], sport ranking [4], and voting [5]. [6, Chap. 9] provides detailed\ndiscussions on the statistical foundations of this model.\nIn the Plackett-Luce model, each item k \u2208 [M ] = {1, . . . , M} is assigned a positive rating parame-\nter wk, which represents the desirability or rating of a product in the case of choice modeling, or the\nskill of a player in sport rankings. The Plackett-Luce model assumes the following generative story\nfor a top-m list \u03c1 = (\u03c11, . . . , \u03c1m) of items \u03c1i \u2208 [M ]: At each stage i = 1, . . . , m, an item is chosen\nto be the ith item in the list from among the items that have not yet appeared, with the probability\nthat \u03c1i is selected being proportional to its desirability w\u03c1i. The overall probability of a given partial\nranking \u03c1 is then:\n(1)\n\nm(cid:89)\n\nP (\u03c1) =\n\n(cid:0)(cid:80)M\n\n(cid:1) \u2212(cid:0)(cid:80)i\u22121\n\nw\u03c1i\n\n(cid:1) .\n\ni=1\n\nk=1 wk\n\nj=1 w\u03c1j\n\nwith the denominator in (1) being the sum over all items not yet selected at stage i.\nIn many situations the collection of available items can be very large and potentially unknown.\nIn this case, a nonparametric approach can be sensible, where the pool of items is assumed to\nbe in\ufb01nite and the model allows for the possibility of items not observed in previous top-m lists\nto appear in new ones.\nIn this paper we propose such a Bayesian nonparametric Plackett-Luce\nmodel. Our approach is built upon recent work on Bayesian inference for the (\ufb01nite) Plackett-Luce\nmodel and its extensions [7, 8, 9]. Our model assumes the existence of an in\ufb01nite pool of items\n{Xk}\u221e\nk=1. The probability of a top-m list of items,\nsay (X\u03c11 , . . . , X\u03c1m), is then a direct extension of the \ufb01nite case (1):\n\nTo formalize the framework, a natural representation to encapsulate the pool of items along with\ntheir ratings is using an atomic measure:\n\nk=1, each with its own rating parameter, {wk}\u221e\nm(cid:89)\n(cid:0)(cid:80)\u221e\n\u221e(cid:88)\n\nP (X\u03c11, . . . , X\u03c1m ) =\n\ni=1\n\nG =\n\nwk\u03b4Xk\n\n(cid:1) \u2212(cid:0)(cid:80)i\u22121\n\nw\u03c1i\n\n(cid:1) .\n\nk=1 wk\n\nj=1 w\u03c1j\n\n(2)\n\n(3)\n\nk=1\n\n1\n\n\fUsing this representation, note that the top item X\u03c11 in our list is simply a draw from the probability\nmeasure obtained by normalizing G, while subsequent items in the top-m list are draws from prob-\nability measures obtained by \ufb01rst removing from G the atoms corresponding to previously picked\nitems and normalizing. Described this way, it is clear that the Plackett-Luce model is basically a par-\ntial size-biased permutation of the atoms in G [10], and the existing machinery of random measures\nand exchangeable random partitions [11] can be brought to bear on our problem.\nIn particular, in Section 2 we will use a gamma process as the prior over the atomic measure G.\nThis is a completely random measure [12] with gamma marginals, such that the corresponding\nnormalized probability measure is a Dirichlet process. We will show that with the introduction of\na suitable set of auxiliary variables, we can characterize the posterior law of G given observations\nof top-m lists distributed according to (2). A simple Gibbs sampler can then be derived to simulate\nfrom the posterior distribution. In Section 3 we develop a time-varying extension of our model and\nderive a simple and effective Gibbs sampler for posterior simulation. In Section 4 we apply our\ntime-varying Bayesian nonparametric Plackett- Luce model to the aforementioned New York Times\nbestsellers datasets, and conclude in Section 5.\n\n2 A Bayesian nonparametric model for partial ranking\n\nWe start this section by brie\ufb02y describing a Bayesian approach to inference in \ufb01nite Plackett-Luce\nmodels [9], and taking the in\ufb01nite limit to arrive at the nonparametric model. This will give good\nintuitions for how the model operates, before we rederive the same nonparametric model more\nformally using gamma processes. Throughout this paper we will suppose that our data consists\nof L partial rankings, with \u03c1(cid:96) = (\u03c1(cid:96)1, . . . , \u03c1(cid:96)m) for (cid:96) \u2208 [L]. For notational simplicity we assume\nthat all the partial rankings are length m.\n\n2.1 Finite Plackett-Luce model with gamma prior\nSuppose we have M choice items, with item k \u2208 [M ] having a positive desirability parameter wk.\nA partial ranking \u03c1(cid:96) = (\u03c1(cid:96)1, . . . , \u03c1(cid:96)m) can be constructed generatively by picking the ith item \u03c1(cid:96)i\nat the ith stage for i = 1, . . . , m, with probability proportional to w\u03c1(cid:96)i as in (1). An alternative\nThurstonian interpretation, which will be important in the following, is as follows: For each item k\nlet z(cid:96)k \u223c Exp(wk) be exponentially distributed with rate wk. Thinking of z(cid:96)k as the arrival time of\nitem k in a race, let \u03c1(cid:96)i be the index of the ith item to arrive (the ith smallest value among (z(cid:96)k)M\nk=1).\nThe resulting probability of \u03c1(cid:96) can then be shown to still be (1). In this interpretation (z(cid:96)k) can be\nunderstood as latent variables, and the EM algorithm can be applied to derive an algorithm to \ufb01nd\na ML parameter setting for (wk)M\nk=1 given multiple partial rankings. Unfortunately the posterior\ndistribution of (z(cid:96)k) given \u03c1(cid:96) is dif\ufb01cult to compute directly, so we instead consider an alternative\nparameterization: Let Z(cid:96)i = z\u03c1(cid:96)i \u2212 z\u03c1(cid:96) i\u22121 be the waiting time for the ith item to arrive after the\ni \u2212 1th item (with z\u03c1(cid:96)0 de\ufb01ned to be 0). Then it can be shown that the joint probability is:\n\nm(cid:89)\n\nL(cid:89)\n\n(cid:17)(cid:17)\ni=1 is simply factorized with Z(cid:96)i|\u03c1, w \u223c Exp((cid:80)M\n\n(cid:16)(cid:80)M\nk=1 wk \u2212(cid:80)i\u22121\n\n(cid:16)\u2212Z(cid:96)i\n\nj=1 w\u03c1(cid:96)j\n\nw\u03c1(cid:96)i exp\n\n(cid:96)=1\n\ni=1\n\n(4)\n\nP ((\u03c1(cid:96))L\n\n(cid:96)=1, (Z(cid:96)i)L,m\n\n(cid:96)=1,i=1|(wk)M\n\nk=1) =\n\n(cid:80)i\u22121\n\nk=1 wk \u2212\nNote that the posterior of (Z(cid:96)i)m\nj=1 w\u03c1(cid:96)j ), and the ML parameter setting can be easily derived as well. Taking a further step,\nwe note that a factorized gamma prior over (wk) is conjugate to (4), say wk \u223c Gamma( \u03b1\nM , \u03c4 )\nwith hyperparameters \u03b1, \u03c4 > 0. Now Bayesian inference can be carried out either with a VB EM\nalgorithm, or a Gibbs sampler. In this paper we shall consider only Gibbs sampling algorithms. In\nthis case the parameter updates are of the form\n\n(cid:16) \u03b1\nM + nk, \u03c4 +(cid:80)L\n\n(cid:80)m\n\n(cid:17)\n\nwk|(\u03c1(cid:96)), (Z(cid:96)i), (wk(cid:48))k(cid:48)(cid:54)=k \u223c Gamma\n\n(cid:96)=1\n\ni=1 \u03b4(cid:96)ikZ(cid:96)i\n\n(5)\n\nwhere nk is the number of occurrences of item k among the observed partial rankings, and \u03b4(cid:96)ik = 0\nif there is a j < i with \u03c1(cid:96)j = k and 1 otherwise. These terms arise by regrouping those in the\nexponential in (4).\nA nonparametric Plackett-Luce model can now be easily derived by taking the limit as the number\nof choice items M \u2192 \u221e. For those items k that have appeared among the observed partial rankings,\n\n2\n\n\fthe limiting conditional distribution (5) is well de\ufb01ned since nk > 0. For items that did not appear\nk:nk=0 wk to be the\n\nin the observations, (5) becomes degenerate at 0. Instead we can de\ufb01ne w\u2217 =(cid:80)\n(cid:80)m\n\ntotal desirability among all in\ufb01nitely many previously unobserved items, and show that\n\nw\u2217|(\u03c1(cid:96)), (Z(cid:96)i), (wk)k:nk>0 \u223c Gamma\n\n\u03b1, \u03c4 +(cid:80)L\n\n(cid:16)\n\n(cid:17)\n\n(6)\n\ni=1 Z(cid:96)i\n\n(cid:96)=1\n\nThe Gibbs sampler thus alternates between updating (Z(cid:96)i), and updating the ratings of the observed\nitems (wk)k:nk>0 and of the unobserved ones w\u2217. This nonparametric model allows us to estimate\nthe probability of seeing new items appearing in future partial rankings in a consistent manner. While\nintuitive, this derivation is ad hoc in the sense that it arises as the in\ufb01nite limit of the Gibbs sampler\nfor \ufb01nite models, and is unsatisfying as it did not directly capture the structure of the underlying\nin\ufb01nite dimensional object, which we will show in the next subsection to be a gamma process.\n\n2.2 A Bayesian nonparametric Plackett-Luce model\nLet X be a measurable space of choice items. A gamma process is a completely random measure\nover X with gamma marginals. Speci\ufb01cally, it is a random atomic measure of the form (3), such\nthat for each measurable subset A, the (random) mass G(A) is gamma distributed. Assuming that\nG has no \ufb01xed atoms (that is, for each element x \u2208 X we have G({x}) = 0 with probability one)\nand that the atom locations {Xk} are independent of their masses {wk}, it can be shown that such\na random measure can be constructed as follows: each Xk is iid according to a base distribution\nH (which we assume is non-atomic with density h(x)), while the set of masses {wk} is distributed\naccording to a Poisson process over R+ with intensity \u03bb(w) = \u03b1w\u22121e\u2212w\u03c4 where \u03b1 > 0 is the\nconcentration parameter and \u03c4 > 0 the inverse scale. We write this as G \u223c \u0393(\u03b1, \u03c4, H). Under this\nparametrization, we have that G(A) \u223c Gamma(\u03b1H(A), \u03c4 ).\nEach atom Xk is a choice item, with its mass wk > 0 corresponding to the desirability parameter.\nThe Thurstonian view described in the \ufb01nite model can be easily extended to the nonparametric one,\nwhere a partial ranking (X\u03c1(cid:96)1 . . . X\u03c1(cid:96)m) can be generated as the \ufb01rst m items to arrive in a race. In\nparticular, for each atom Xk let z(cid:96)k \u223c Exp(wk) be the time of arrival of Xk and X\u03c1(cid:96)i the ith item to\narrive. The \ufb01rst m items to arrive (X\u03c1(cid:96)1 . . . X\u03c1(cid:96)m) then constitutes our top-m list, with probability\nas given in (2). Again reparametrizing using inter-arrival durations, let Z(cid:96)i = z\u03c1(cid:96)i \u2212 z\u03c1(cid:96)i\u22121 for\ni = 1, 2, . . . (with z\u03c10 = 0). Then the joint probability is:\n(cid:19)(cid:19)\ni=1|G) = P ((z\u03c1(cid:96)1 . . . z\u03c1(cid:96)m), and z(cid:96)k > z\u03c1(cid:96)m for all k (cid:54)\u2208 {\u03c1(cid:96)1, . . . , \u03c1(cid:96)m}) (7)\n\nP ((X\u03c1(cid:96)i)m\n\n(cid:19)\n\n(cid:18)\n\n(cid:19)(cid:18) (cid:89)\n\n(cid:18) \u221e(cid:88)\n\n(cid:18) m(cid:89)\n\nm(cid:89)\n\ni=1, (Z(cid:96)i)m\nw\u03c1(cid:96)ie\u2212w\u03c1(cid:96)i z\u03c1(cid:96)i\n\nwk \u2212 i\u22121(cid:88)\n\n=\n\nw\u03c1(cid:96)i exp\n\ne\u2212wkz\u03c1(cid:96)m\n\n\u2212 Z(cid:96)i\n\nw\u03c1(cid:96)j\n\nk(cid:54)\u2208{\u03c1(cid:96)i}m\n\ni=1\n\ni=1\n\nk=1\n\nj=1\n\n=\n\ni=1\n\nMarginalizing out (Z(cid:96)i)m\ni=1 in (2). Further, conditional on \u03c1(cid:96)\nit is seen that the inter-arrival durations Z(cid:96)1 . . . Z(cid:96)m are mutually independent and exponentially\ndistributed:\n\ni=1 gives the probability of (X\u03c1(cid:96)i)m\n\n(cid:18) \u221e(cid:88)\n\nwk \u2212 i\u22121(cid:88)\n\nw\u03c1(cid:96)j\n\n(cid:19)\n\nZ(cid:96)i|(X\u03c1(cid:96)i )m\n\ni=1, G \u223c Exp\n\nThe above construction is depicted on Figure 1(left). We visualize on right some top-m lists gener-\nated from the model, with \u03c4 = 1 and different values of \u03b1.\n\nk=1\n\nj=1\n\n2.3 Posterior characterization\nConsider a number L of partial rankings, with the (cid:96)th list denoted Y(cid:96) = (Y(cid:96)1 . . . Y(cid:96)m(cid:96)) , for (cid:96) \u2208 [L].\nWhile previously our top-m list (X\u03c11 . . . X\u03c1m) consists of an ordered list of the atoms in G. Here G\nis unobserved and (Y(cid:96)1 . . . Y(cid:96)m(cid:96) ) is simply a list of observed choice items, which is why they were\nnot expressed as an ordered list of atoms in G. The task here is then to characterize the posterior law\nof G under a gamma process prior and supposing that the observed partial rankings were drawn iid\nfrom the nonparametric Plackett-Luce model given G. Re-expressing the conditional distribution\n(2) of Y(cid:96) given G, we have:\n\nP (Y(cid:96)|G) =\n\nG({Y(cid:96)i})\n\nG(X\\{Y(cid:96)1 . . . Y(cid:96) i\u22121})\n\nm(cid:96)(cid:89)\n\ni=1\n\n3\n\n(8)\n\n(9)\n\n\fLeft: G and U = (cid:80)\n\nFigure 1: Bayesian nonparametric Plackett-Luce model.\nk uk\u03b4Xk where uk = \u2212 log(zk). The\ntop-3 ranking is (\u03c11, \u03c12, \u03c13). Right: Visualization of top-5\nrankings with rows corresponding to different rankings and\ncolumns to items sorted by size biased order. A lighter shade\ncorresponds to a higher rank. Each \ufb01gure is for a different\nG, with \u03b1 = .1, 1, 3.\n\nAs before, for each (cid:96), we will also introduce a set of auxiliary variables Z(cid:96) = (Z(cid:96)1 . . . Z(cid:96)m(cid:96)) (the\ninter-arrival times) that are conditionally mutually independent given G and Y(cid:96), with:\n\nZ(cid:96)i|Y(cid:96), G \u223c Exp(G(X\\{Y(cid:96)1, . . . , Y(cid:96)i\u22121}))\n\nThe joint probability of the item lists and auxiliary variables is then (c.f. (7)):\n\nP ((Y(cid:96), Z(cid:96))L\n\n(cid:96)=1|G) =\n\nG({Y(cid:96)i}) exp(\u2212Z(cid:96)iG(X\\{Y(cid:96)1, . . . , Y(cid:96) i\u22121}))\n\nL(cid:89)\n\nm(cid:96)(cid:89)\n\n(cid:96)=1\n\ni=1\n\n(cid:26)0\n\n(10)\n\n(11)\n\n(12)\n\n(cid:33)\n\n(cid:17)\n\nz\n\u03c4\n\n(14)\n\n(15)\n\n(16)\n\nNote that under the generative process described in Section 2.2, there is positive probability that an\nitem appearing in a list Y(cid:96) appears in another list Y(cid:96)(cid:48) with (cid:96)(cid:48) (cid:54)= (cid:96). Denote the unique items among all\nL lists by X\u2217\nk among\nthe item lists. Finally de\ufb01ne occurrence indicators\n\nK, and for each k = 1, . . . , K let nk be the number of occurrences of X\u2217\n\n1 . . . X\u2217\n\nif \u2203j < i with Y(cid:96)j = X\u2217\nk;\notherwise.\ni.e. \u03b4(cid:96)ik is the indicator of the occurence that item X\u2217\n(cid:96)th list. Then the joint probability under the nonparametric Plackett-Luce model is:\n\n\u03b4(cid:96)ik =\n\n1\n\nk does not appear at a rank lower than i in the\n\nP ((Y(cid:96), Z(cid:96))L\n\n(cid:96)=1|G) =\n\nexp(\u2212Z(cid:96)iG(X\\{Y(cid:96)1, . . . , Y(cid:96) i\u22121}))\n\nK(cid:89)\n\nm(cid:96)(cid:89)\nk})nk \u00d7 L(cid:89)\n(cid:33) K(cid:89)\n(cid:88)\n\n(cid:96)=1\n\ni=1\n\nZ(cid:96)i\n\nG({X\u2217\n\n(cid:32)\n\n\u2212G(X)\n\nk=1\n\n= exp\n\nG({X\u2217\n\nk})nk exp\n\n\u2212G({X\u2217\nk})\n\n(cid:96)i\n\nk=1\n\n(cid:96)i\n\n(cid:32)\n\n(cid:88)\n\n(\u03b4(cid:96)ik \u2212 1)Z(cid:96)i\n(13)\n\nTaking expectation of (13) with respect to G using the Palm formula gives:\n\nTheorem 1 The marginal probability of the L partial rankings and auxiliary variables is:\n\nK(cid:89)\n\n(cid:18)\n\n(cid:88)\n\n(cid:96)i Z(cid:96)i)\n\nh(X\u2217\n\nk )\u03ba\n\nnk,\n\n\u03b4(cid:96)ikZ(cid:96)i\n\nk=1\n\n(cid:96)i\n\n(cid:19)\n\n(cid:16)\n\nP ((Y(cid:96), Z(cid:96))L\n\n\u03c8(z) = \u2212 log E(cid:104)\n\n(cid:96)=1) = e\u2212\u03c8((cid:80)\n(cid:90)\ne\u2212zG(X)(cid:105)\n(cid:90)\n\n=\n\nwhere \u03c8(z) is the Laplace transform of \u03bb,\n\n1 +\nand \u03ba(n, z) is the nth moment of the exponentially tilted L\u00b4evy intensity \u03bb(w)e\u2212zw:\n\nR+\n\n\u03bb(w)(1 \u2212 e\u2212zw)dw = \u03b1 log\n\n\u03ba(n, z) =\n\nR+\n\n\u03bb(w)wne\u2212zwdw =\n\n\u03b1\n\n(z + \u03c4 )n \u0393(n)\n\nDetails are given in the supplementary material. Another application of the Palm formula now allows\nus to derive a posterior characterisation of G:\n\n4\n\nGu\u03c11u\u03c12u\u03c13E510152025302468101214161820510152025302468101214161820510152025302468101214161820\fTheorem 2 Given the observations and associated auxiliary variables (Y(cid:96), Z(cid:96))L\n(cid:96)=1, the posterior\nlaw of G is also a gamma process, but with atoms with both \ufb01xed and random locations. Speci\ufb01cally,\n\nG|(Y(cid:96), Z(cid:96))L\n\n(cid:96)=1 = G\u2217 +\n\nw\u2217\nk\u03b4X\u2217\n\nk\n\nK(cid:88)\n\nk=1\n\nwhere G\u2217 and w\u2217\n\nK are mutually independent. The law of G\u2217 is still a gamma process,\n\n1, . . . , w\u2217\nG\u2217|(X(cid:96), Z(cid:96))L\n\n(cid:96)=1 \u223c \u0393(\u03b1, \u03c4\u2217, h)\n\n\u03c4\u2217 = \u03c4 +\n\nZ(cid:96)i\n\n(cid:88)\n(cid:19)\n\n(cid:96)i\n\n(cid:18)\n\nnk, \u03c4 +\n\n(cid:88)\n\n(cid:96)i\n\n\u03b4(cid:96)ikZ(cid:96)i\n\n(17)\n\n(18)\n\n(19)\n\n(20)\n(21)\n(22)\n\nwhile the masses have distributions,\nk|(Y(cid:96), Z(cid:96))L\nw\u2217\n\n(cid:96)=1 \u223c Gamma\n\n2.4 Gibbs sampling\n\nGiven the results of the previous section, a simple Gibbs sampler can now be derived, where all the\nconditionals are of known analytic form. In particular, we will integrate out all of G\u2217 except for its\ntotal mass w\u2217\nk) and the\nauxiliary variables (Z(cid:96)i). The update for Z(cid:96)i is given by (10), while those for the masses are given\nin Theorem 2:\n\n\u2217 = G\u2217(X). This leaves the latent variables to consist of the masses w\u2217\n\n\u2217, (w\u2217\n\nGibbs update for Z(cid:96)i:\nGibbs update for w\u2217\nk:\nGibbs update for w\u2217\n\u2217:\n\nZ(cid:96)i|rest \u223c Exp(cid:0)w\u2217\n\u2217 +(cid:80)\nk|rest \u223c Gamma(cid:0)nk, \u03c4 +(cid:80)\n\u2217|rest \u223c Gamma(cid:0)\u03b1, \u03c4 +(cid:80)\n\nw\u2217\nw\u2217\n\nk \u03b4(cid:96)ikw\u2217\n\n(cid:1)\n\n(cid:1)\n\nk\n(cid:96)i \u03b4(cid:96)ikZ(cid:96)i\n(cid:96)i Z(cid:96)i\n\n(cid:1)\n\nNote that the auxiliary variables are conditionally independent given the masses and vice versa. Hy-\nperparameters of the gamma process can be simply derived from the joint distribution in Theorem 1.\nSince the marginal probability of the partial rankings is invariant to rescaling of the masses, it is\nsuf\ufb01cient to keep \u03c4 \ufb01xed at 1. As for \u03b1, if a Gamma(a, b) prior is placed on it, its conditional\ndistribution is still gamma:\n\n\u03b1|rest \u223c Gamma(cid:0)a + K, b + log(cid:0)1 +\n\n(23)\n\u2217 marginalized out, so after an update to \u03b1 it is necessary\n\n(cid:96)i Z(cid:96)i\n\u03c4\n\n(cid:1)(cid:1)\n\n(cid:80)\n\nGibbs update for \u03b1:\n\nNote that this update was derived with w\u2217\nto immediately update w\u2217\n\n\u2217 via (22) before proceeding to update other variables.\n\n3 Dynamic Bayesian nonparametric ranking models\n\nIn this section we develop an extension of the Bayesian nonparametric Plackett-Luce model to model\ntime-varying rankings, where the rating parameters of items may change smoothly over time and\nre\ufb02ected in a changing series of rankings. Given a series of times indexed by t = 1, 2, . . ., we\nmay model the rankings at time t using a gamma process distributed random measure Gt as in\nSection 2.2, with Markov dependence among the sequence of measures (Gt) enabling dependence\namong the rankings over time.\n\n3.1 Pitt-Walker dependence model\n\nWe will construct a dependent sequence (Gt) which marginally follow a gamma process \u0393(\u03b1, \u03c4, H)\nusing the construction of [13]. Suppose Gt \u223c \u0393(\u03b1, \u03c4, H). Since Gt is atomic, we can write it in the\nform:\n(24)\n\n\u221e(cid:88)\n\nGt =\n\nwtk\u03b4Xtk\n\nk=1\nDe\ufb01ne a random measure Ct with conditional law:\n\n\u221e(cid:88)\n\nCt|Gt =\n\nctk\u03b4Xtk\n\nctk|Gt \u223c Poisson(\u03c6twtk)\n\n(25)\n\nwhere \u03c6t > 0 is a dependence parameter. Using the same method as in Section 2.3, we can show:\n\nk=1\n\n5\n\n\fProposition 3 Suppose the law of Gt is \u0393(\u03b1, \u03c4, H). The conditional law of Gt given Ct is then:\n\nGt = G\u2217\n\nt +\n\nw\u2217\ntk\u03b4Xtk\n\n(26)\n\nk=1 are all mutually independent. The law of G\u2217\n\nt is given by a gamma process,\n\nwhere G\u2217\nwhile the masses are conditionally gamma,\n\ntk)\u221e\nt and (w\u2217\nt|Ct \u223c \u0393(\u03b1, \u03c4 + \u03c6t, H)\nG\u2217\n\ntk|Ct \u223c Gamma(ctk, \u03c4 + \u03c6t)\nw\u2217\n\n(27)\n\nThe idea of [13] is to de\ufb01ne the conditional law of Gt+1 given Gt and Ct to coincide with the\nconditional law of Gt given Ct as in Proposition 3. In other words, de\ufb01ne\n\nGt+1 = G\u2217\n\nk=1\n\nt+1 +\n\nwt+1,k\u03b4Xtk\n\n(28)\nt+1 \u223c \u0393(\u03b1, \u03c4 + \u03c6t, H) and wt+1,k \u223c Gamma(ctk, \u03c4 + \u03c6t) are mutually independent. If\nwhere G\u2217\nthe prior law of Gt is \u0393(\u03b1, \u03c4, H), the marginal law of Gt+1 will be \u0393(\u03b1, \u03c4, H) as well when both\nGt and Ct are marginalized out, thus maintaining a form of stationarity. Further, although we have\ndescribed the process in order of increasing t, the joint law of Gt, Ct, Gt+1 can equivalently be\ndescribed in the reverse order with the same conditional laws as above. Note that if ctk = 0, the\nconditional distribution of wt+1,k will be degenerate at 0. Hence Gt+1 has an atom at Xtk if and\nonly if Ct has an atom at Xtk, that is, if ctk > 0. In addition, it also has atoms (those in G\u2217\nt+1) where\nCt does not (nor does Gt). Finally, the parameter \u03c6t can be interpreted as controlling the strength\nof dependence between Gt+1 and Gt. Indeed it can be shown that\n\nE[Gt+1|Gt] =\n\n\u03c6t\n\n\u03c6t + \u03c4\n\nGt +\n\n\u03c4\n\n\u03c6t + \u03c4\n\nH.\n\n(29)\n\nAnother measure of dependence can be gleaned by examining the \u201clifetime\u201d of an atom. Suppose\nX is an atom in G1 with mass w > 0. The probability that X is an atom in C2 with positive mass\nis 1 \u2212 exp(\u2212\u03c61w), in which case it has positive mass in G2 as well. Conversely, once it is not an\natom, it will never be an atom in the future since the base distribution H is non-atomic. The lifetime\nof the atom is then the smallest t such that it is no longer an atom. We can show by induction that:\n(details in supplementary material)\n\nProposition 4 The probability that an atom X in G1 with mass w > 0 is dead at time t is given by\n\nP (Gt({X}) = 0|w) = exp(\u2212yt|1w)\n\nwhere yt|1 can be obtained by the recurrence yt|t\u22121 = \u03c6t\u22121 and yt|s\u22121 = yt|s\u03c6s\u22121\n\n\u03c6s\u22121+\u03c4 +yt|s\n\n.\n\n3.2 Posterior characterization and Gibbs sampling\n\nAssume for simplicity that at each time step t = 1, . . . , T we observe one top-m list Yt =\n(Yt1, . . . , Ytm) (it trivially extends to multiple partial rankings of differing sizes). We extend the\nresults of the previous section in characterizing the posterior and developing a Gibbs sampler for the\ndynamical model.\nSince each observed item at time t has to be an atom in its corresponding random measure Gt, and\natoms in Gt can propagate to neighboring random measures via the Pitt-Walker dependence model,\nwe conclude that the set of all observed items (through all times) has to include all \ufb01xed atoms in\nthe posterior of Gt. Thus let X\u2217 = (X\u2217\nk ), k = 1, . . . , K be the set of unique items observed in\nY1, . . . , YT , let ntk \u2208 {0, 1} be the number of times the item X\u2217\nk appears at time t, and let \u03c1t be\nk}),\nde\ufb01ned as Yt = (X\u2217\nwhile the total mass of all other random atoms is denoted wt\u2217 = Gt(X\\X\u2217). Note that wtk has\nto be positive on a random contiguous interval of time that includes all observations of X\u2217\nk\u2014it\u2019s\nlifetime\u2014but is zero outside of the interval. We also write ctk = Ct({X\u2217\nk}) and ct\u2217 = Ct(X\\X\u2217).\nAs before, we introduce, for t = 1, . . . , T and i = 1, . . . , m, latent variables\n\n). We write the masses of the \ufb01xed atoms as wtk = Gt({X\u2217\n\n, . . . , X\u2217\n\n\u03c1m\n\n\u03c11\n\n\u221e(cid:88)\n\nk=1\n\n\u221e(cid:88)\n\n(cid:18)\n\nZti \u223c Exp\n\nwt\u2217 +\n\n(cid:19)\n\nwt\u03c1j\n\nK(cid:88)\n\nwtk \u2212 i\u22121(cid:88)\n\nk=1\n\nj=1\n\n6\n\n(30)\n\n\fFigure 2: Sample path drawn from the Dawson-Watanabe superprocess. Each colour represents an\natom, with height being its (varying) mass. Left shows (Gt) and right (Gt/Gt(X)), a Fleming-Viot\nprocess.\n\nEach iteration of the Gibbs sampler then proceeds as follows (details in supplementary material).\nThe latent variables (Zti) are updated as above. Conditioned on the latent variables (Zti), (ctk)\nand (ct\u2217), we update the masses (wtk), which are independent and gamma distributed since all\nlikelihoods are of gamma form. Note that the total masses (Gt(X)) are not likelihood identi\ufb01able, so\nwe introduce an extra step to improve mixing by sampling them from the prior (integrating out (ctk),\n(ct\u2217)), scaling all masses along with it. Directly after this step we update (ctk), (ct\u2217). We update\n\u03b1 along with the random masses (wt\u2217) and (ct\u2217) ef\ufb01ciently using a forward-backward recursion.\nFinally, the dependence parameters (\u03c6t) are updated.\n\n3.3 Continuous time formulation using superprocesses\n\nThe dynamic model described in the previous section is formulated for discrete time data. When\nthe time interval between ranking observations is not constant, it is desirable to work with dynamic\nmodels evolving over continuous-time instead, with the underlying random measures (Gt) de\ufb01ned\nover all t \u2208 R, but with observations at a discrete set of times t1 < t2 < \u00b7\u00b7\u00b7 . Here we propose\na continuous-time model based on the Dawson-Watanabe superprocess [14, 15] (see also [16, 17,\n18, 19]). This is a diffusion on the space of measures with the gamma process \u0393(\u03b1, \u03c4, H) as its\nequilibrium distribution. It is de\ufb01ned by a generator\n\n(cid:90)\n\n(cid:90)\n\n(cid:19)\n\n(cid:18)(cid:90)\n\nL = \u03be\n\nG(dX)\n\n\u22022\n\n\u2202G(X)2 + \u03b1\n\nH(dX)\n\n\u2202\n\n\u2202G(X)\n\n\u2212 \u03c4\n\nG(dX)\n\n\u2202\n\n\u2202G(X)\n\nwith \u03be parametrizing the rate of evolution. Figure 2 gives a sample path, where we see that it is\ncontinuous but non-differentiable. For ef\ufb01cient inference, it is desirable to be able to integrate out all\nGt\u2019s except those Gt1 , Gt2 , . . . at observation times. An advantage to using the Dawson-Watanabe\nsuperprocess is that, the conditional distribution of Gts given Gts\u22121 is remarkably simple [20]. In\nparticular it is simply given by the discrete-time process of the previous section with dependence\n. Thus the inference algorithm developed previously is directly\nparameter \u03c6ts|ts\u22121 =\napplicable to the continuous-time model too.\n\ne\u03c4 \u03be(ts\u2212ts\u22121)\u22121\n\n\u03c4\n\n4 Experiments\n\nWe apply the discrete-time dynamic Plackett-Luce model to the New York Times bestsellers data.\nThese consist of the weekly top-20 best-sellers list from June 2008 to April 2012 in various cate-\ngories. We consider here the categories paperback non\ufb01ction (PN) and hardcover \ufb01ction (HF), for\nwhich respectively 249 and 916 books appear at least once in the top-20 lists over the 200 weeks.\nWe consider that the correlation parameter \u03c6t = \u03c6 is constant over time, and assign \ufb02at improper\npriors p(\u03b1) \u221d 1/\u03b1 and p(\u03c6) \u221d 1/\u03c6. In order to take into account the publication date of a book,\nwe do not consider books in the likelihood before their \ufb01rst appearance in a list. We run the Gibbs\nsampler with 10000 burn-in iterations followed by 10000 samples. Mean normalized weights for\nthe more popular books in both categories are shown in Figure 3.\nThe model is able to estimate the weights associated to each book that appeared at least once, as\nwell as the total weight associated to all other books, i.e. the probability that a new book enters at\nthe \ufb01rst rank in the list, represented by the black curve. Moreover, the Bayesian approach enables us\nto have a measure of the uncertainty on the weights. The hardcover \ufb01ction category is characterized\nby rapid changes in successive lists, compared to the paperback non\ufb01ction. This is quanti\ufb01ed by the\nestimated value of the parameter \u03c6, which are respectively 85 \u00b1 20 and 140 \u00b1 40 for PN and HF.\nThe estimated values of the shape parameter \u03b1 are 7 \u00b1 1.5 and 2 \u00b1 1 respectively.\n\n7\n\n\fFigure 3: Mean normalized weights for paperback non\ufb01ction (left) and hardcover \ufb01ction (right).\nThe black lines represent the weight associated to all the books that have not appear in the top-20\nlists.\n\n5 Discussion\n\nWe have proposed a Bayesian nonparametric Plackett-Luce model for ranked data. Our approach\nis based on the theory of atomic random measures, where we showed that the Plackett-Luce gener-\native model corresponds exactly to a size-biased permutation of the atoms in the random measure.\nWe characterized the posterior distribution, and derived a simple MCMC sampling algorithm for\nposterior simulation. Our approach can be see as a multi-stage generalization of posterior inference\nin normalized random measures [21, 22, 23], and can be easily extended from gamma processes to\ngeneral completely random measures.\nWe also proposed dynamical extensions of our model for both discrete and continuous time data,\nand applied it to modeling the bestsellers\u2019 lists on the New York Times. Our dynamic extension\nmay be useful for modeling time varying densities or clusterings as well. In our experiments we\nfound that our model is insuf\ufb01cient to capture the empirical observation that bestsellers often start\noff high on the lists and tail off afterwards, since our model has continuous sample paths. We\nadjusted for this by simply not including books in the model prior to their publication date. It may\nbe possible to model this better using models with discontinuous sample paths, for example, the\nOrstein-Uhlenbeck approach of [24] where the process evolves via a series of discrete jump events\ninstead of continuously.\n\nAcknowledgements\n\nYWT thanks the Gatsby Charitable Foundation for generous funding.\n\n8\n\nNov2008Mar2009Aug2009Dec2009May2010Oct2010Feb2011Jul2011Nov2011Apr201200.050.10.150.20.250.3EAT, PRAY, LOVETHE AUDACITY OF HOPEMARLEY AND MEDREAMS FROM MY FATHERTHREE CUPS OF TEAI HOPE THEY SERVE BEER IN HELLGLENN BECK\u2019S \u2018COMMON SENSE\u2019THE BLIND SIDETHE LOST CITY OF ZA PATRIOT\u2019S HISTORY OF THE UNITED STATESCONSERVATIVE VICTORYMENNONITE IN A LITTLE BLACK DRESSINSIDE OF A DOGTHE VOWHEAVEN IS FOR REALNormalized weightsDate\fReferences\n[1] R.D. Luce. Individual choice behavior: A theoretical analysis. Wiley, 1959.\n[2] R. Plackett. The analysis of permutations. Applied Statistics, 24:193\u2013202, 1975.\n[3] R.D. Luce. The choice axiom after twenty years.\n\nJournal of Mathematical Psychology,\n\n15:215\u2013233, 1977.\n\n[4] D.R. Hunter. MM algorithms for generalized Bradley-Terry models. The Annals of Statistics,\n\n32:384\u2013406, 2004.\n\n[5] I.C. Gormley and T.B. Murphy. Exploring voting blocs with the Irish electorate: a mixture\n\nmodeling approach. Journal of the American Statistical Association, 103:1014\u20131027, 2008.\n\n[6] P. Diaconis. Group representations in probability and statistics, IMS Lecture Notes, volume 11.\n\nInstitute of Mathematical Statistics, 1988.\n\n[7] I.C. Gormley and T.B. Murphy. A grade of membership model for rank data. Bayesian Analy-\n\nsis, 4:265\u2013296, 2009.\n\n[8] J. Guiver and E. Snelson. Bayesian inference for Plackett-Luce ranking models. In Interna-\n\ntional Conference on Machine Learning, 2009.\n\n[9] F. Caron and A. Doucet. Ef\ufb01cient Bayesian inference for generalized Bradley-Terry models.\n\nJournal of Computational and Graphical Statistics, 21(1):174\u2013196, 2012.\n\n[10] G.P. Patil and C. Taillie. Diversity as a concept and its implications for random communities.\n\nBulletin of the International Statistical Institute, 47:497\u2013515, 1977.\n\n[11] J. Pitman. Combinatorial stochastic processes. Ecole d\u2019\u00b4et\u00b4e de Probabilit\u00b4es de Saint-Flour\n\nXXXII - 2002, volume 1875 of Lecture Notes in Mathematics. Springer, 2006.\n\n[12] J. F. C. Kingman. Completely random measures. Paci\ufb01c Journal of Mathematics, 21(1):59\u201378,\n\n1967.\n\n[13] M.K. Pitt and S.G. Walker. Constructing stationary time series models using auxiliary variables\nwith applications. Journal of the American Statistical Association, 100(470):554\u2013564, 2005.\n[14] S. Watanabe. A limit theorem of branching processes and continuous state branching pro-\n\ncesses. Journal of Mathematics of Kyoto University, 8:141\u2013167, 1968.\n\n[15] D. A. Dawson. Stochastic evolution equations and related measure processes. Journal of\n\nMultivariate Analysis, 5:1\u201352, 1975.\n\n[16] S.N. Ethier and RC Grif\ufb01ths. The transition function of a measure-valued branching diffusion\nwith immigration. Stochastic Processes. A Festschrift in Honour of Gopinath Kallianpur (S.\nCambanis, J. Ghosh, RL Karandikar and PK Sen, eds.), 71:79, 1993.\n\n[17] R.H. Mena and S.G. Walker. On a construction of Markov models in continuous time. Metron-\n\nInternational Journal of Statistics, 67(3):303\u2013323, 2009.\n\n[18] S. Feng. Poisson-Dirichlet Distribution and Related Topics. Springer, 2010.\n[19] J.C. Cox, J.E. Ingersoll Jr, and S.A. Ross. A theory of the term structure of interest rates.\n\nEconometrica: Journal of the Econometric Society, pages 385\u2013407, 1985.\n\n[20] S. N. Ethier and R. C. Grif\ufb01ths. The transition function of a measure-valued branching diffu-\n\nsion with immigration. Stochastic Processes, 1993.\n\n[21] L.F. James, A. Lijoi, and I. Pr\u00a8unster. Posterior analysis for normalized random measures with\n\nindependent increments. Scandinavian Journal of Statistics, 36(1):76\u201397, 2009.\n\n[22] J.E. Grif\ufb01n and S.G. Walker. Posterior simulation of normalized random measure mixtures.\n\nJournal of Computational and Graphical Statistics, 20(1):241\u2013259, 2011.\n\n[23] S. Favaro and Y.W. Teh. MCMC for normalized random measure mixture models. Technical\n\nreport, University of Turin, 2012.\n\n[24] J. E. Grif\ufb01n. The Ornstein-Uhlenbeck Dirichlet process and other time-varying processes for\nBayesian nonparametric inference. Journal of Statistical Planning and Inference, 141:3648\u2013\n3664, 2011.\n\n9\n\n\f", "award": [], "sourceid": 724, "authors": [{"given_name": "Francois", "family_name": "Caron", "institution": null}, {"given_name": "Yee", "family_name": "Teh", "institution": null}]}