{"title": "Combinatorial Bandits with Relative Feedback", "book": "Advances in Neural Information Processing Systems", "page_first": 985, "page_last": 995, "abstract": "We consider combinatorial online learning with subset choices when only relative feedback information from subsets is available, instead of bandit or semi-bandit feedback which is absolute. Specifically, we study two regret minimisation problems over subsets of a finite ground set $[n]$, with subset-wise relative preference information feedback according to the Multinomial logit choice model. In the first setting, the learner can play subsets of size bounded by a maximum size and receives top-$m$ rank-ordered feedback, while in the second setting the learner can play subsets of a fixed size $k$ with a full subset ranking observed as feedback. For both settings, we devise instance-dependent and order-optimal regret algorithms with regret $O(\\frac{n}{m} \\ln T)$ and $O(\\frac{n}{k} \\ln T)$, respectively. We derive fundamental limits on the regret performance of online learning with subset-wise preferences, proving the tightness of our regret guarantees. Our results also show the value of eliciting more general top-$m$ rank-ordered feedback over single winner feedback ($m=1$). Our theoretical results are corroborated with empirical evaluations.", "full_text": "Combinatorial Bandits with Relative Feedback\n\nAadirupa Saha\n\nAditya Gopalan\n\nIndian Institute of Science, Bangalore\n\nIndian Institute of Science, Bangalore\n\naadirupa@iisc.ac.in\n\naditya@iisc.ac.in\n\nAbstract\n\nWe consider combinatorial online learning with subset choices when only relative\nfeedback information from subsets is available, instead of bandit or semi-bandit\nfeedback which is absolute. Speci\ufb01cally, we study two regret minimisation prob-\nlems over subsets of a \ufb01nite ground set [n], with subset-wise relative preference\ninformation feedback according to the Multinomial logit choice model. In the\n\ufb01rst setting, the learner can play subsets of size bounded by a maximum size and\nreceives top-m rank-ordered feedback, while in the second setting the learner can\nplay subsets of a \ufb01xed size k with a full subset ranking observed as feedback. For\nboth settings, we devise instance-dependent and order-optimal regret algorithms\nwith regret O( n\nk ln T ), respectively. We derive fundamental limits\non the regret performance of online learning with subset-wise preferences, proving\nthe tightness of our regret guarantees. Our results also show the value of eliciting\nmore general top-m rank-ordered feedback over single winner feedback (m = 1).\nOur theoretical results are corroborated with empirical evaluations.\n\nm ln T ) and O( n\n\nIntroduction\n\n1\nOnline learning over subsets with absolute or cardinal utility feedback is well-understood in terms of\nstatistically ef\ufb01cient algorithms for bandits or semi-bandits with large, combinatorial subset action\nspaces [15, 31]. In such settings the learner aims to \ufb01nd the subset with highest value, and upon\ntesting a subset observes either noisy rewards from its constituents or an aggregate reward. In many\nnatural settings, however, information obtained about the utilities of alternatives chosen is inherently\nrelative or ordinal, e.g., recommender systems [25, 34], crowdsourcing [16], multi-player game\nranking [22], market research and social surveys [9, 6, 24], and in other systems where humans are\noften more inclined to express comparative preferences.\nThe framework of dueling bandits [43, 46] represents a promising attempt to model online optimisa-\ntion with pairwise preference feedback. However, our understanding of the more general and realistic\nonline learning setting of combinatorial subset choices and subset-wise feedback is relatively less\ndeveloped than the case of observing absolute, subset-independent reward information.\nIn this work, we consider a generalisation of the dueling bandit problem where the learner, instead\nof choosing only two arms, selects a subset of (up to) k \u2265 2 many arms in each round. The learner\nsubsequently observes as feedback a rank-ordered list of m \u2265 1 items from the subset, generated\nprobabilistically according to an underlying subset-wise preference model \u2013 in this work the Plackett-\nLuce distribution on rankings based on the multinomial logit (MNL) choice model [7] \u2013 in which\neach arm has an unknown positive value. Simultaneously, the learner earns as reward the average\nvalue of the subset played in the round. The goal of the learner is to play subsets to minimise its\ncumulative regret with respect to the subset with highest value.\nAchieving low regret with subset-wise preference feedback is relevant in settings where deviating\nfrom choosing an optimal subset of alternatives comes with a cost (driven by considerations like\nrevenue) even during the learning phase, but where the feedback information provides purely relative\nfeedback. For instance, consider a beverage company that experimentally develops several variants\nof a drink (arms or alternatives), a best-selling subset of which it wants to learn to put up in the open\nmarket by trial and error. Each time a subset of items is put up, in parallel the company elicits relative\npreference feedback about the subset from, say, a team of expert tasters or through crowdsourcing.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fThe value of a subset can be modelled as the average value of items in it, which is however not\ndirectly observable, it being function of the open market response to the offered subset. The challenge\nthus lies in optimizing the subset selection over time by observing only relative preferences (made\nprecise by the notion of Top-k-regret, Section 2.2).\nA challenging feature of this problem, with subset plays and relative feedback, is the combinatorially\nlarge action and feedback space, much like those in combinatorial bandits [14, 19]. The key question\nhere is whether (and if so, how) structure in the subset choice model \u2013 de\ufb01ned compactly by only a\nfew parameters (as many as the number of arms) \u2013 can be exploited to give algorithms whose regret\ndoes not explode combinatorially. The contributions of this paper are:\n(1). We consider the problem of regret minimisation when subsets of items {1, . . . , n} of size at\nmost k can be played, top m \u2264 k rank-ordered feedback is received according to the MNL model,\nand the value of a subset is the mean MNL-parameter value of the items in the subset. We propose\nan upper con\ufb01dence bound (UCB)-based algorithm, with a new max-min subset-building rule and\na lightweight space requirement of tracking O(n2) pairwise item estimates, showing that it enjoys\ninstance-dependent regret in T rounds of O( n\nm ln T ). This is shown to be order-optimal by exhibiting\nm ln T ) on the regret for any No-regret algorithm. Our results imply that the\na lower bound of \u2126( n\noptimal regret does not vary with the maximum subset size (k) that can be played, but improves\nmultiplicatively with the length of top m-rank-ordered feedback received per round (Sec. 3).\n(2). We consider a related regret minimisation setting in which subsets of size exactly k must be\nplayed, after which a ranking of the k items is received as feedback, and where the zero-regret subset\nconsists of the k items with the highest MNL-parameter values. In this case, our analysis reveals a\nfundamental lower bound on regret of \u2126( n\u2212k\nln T ), where the problem complexity now depends\non the parameter difference between the kth and (k + 1)th best item of the MNL model. We follow\nthis up with a subset-playing algorithm (Alg. 3) for this problem \u2013 a recursive variant of the earlier\nUCB-based algorithm \u2013 with a matching, optimal regret guarantee of O\n\n(cid:16) (n\u2212k) ln T\n\n(cid:17)\n\nk\u2206(k)\n\nk\u2206(k)\n\n(Sec. 4).\n\nWe also provide extensive numerical evaluations supporting our theoretical \ufb01ndings. Due to space\nconstraints, a discussion on related work appears in the Appendix.\n2 Preliminaries and Problem Statement\nNotation. We denote by [n] the set {1, 2, ..., n}. For any subset S \u2286 [n], we let |S| denote the\ncardinality of S. When there is no confusion about the context, we often represent (an unordered)\nsubset S as a vector (or ordered subset) S of size |S| according to, say, a \ufb01xed global ordering of all\nthe items [n]. In this case, S(i) denotes the item (member) at the ith position in subset S. For any\nordered set S, S(i : j) denotes the set of items from position i to j, i < j, \u2200i, j \u2208 [|S|]. \u03a3S = {\u03c3 | \u03c3\nis a permutation over items of S}, where for any permutation \u03c3 \u2208 \u03a3S, \u03c3(i) denotes the element at\nthe i-th position in \u03c3, i \u2208 [|S|]. We also denote by \u03a3m\nS the set of permutations of any m-subset of\nS := {\u03a3S(cid:48) | S(cid:48) \u2286 S, |S(cid:48)| = m}. 1(\u03d5) is generically used to denote an\nS, for any m \u2208 [k], i.e. \u03a3m\nindicator variable that takes the value 1 if the predicate \u03d5 is true, and 0 otherwise. P r(A) is used to\ndenote the probability of event A, in a probability space that is clear from the context.\nDe\ufb01nition 1 (Multinomial logit probability model). A Multinomial logit (MNL) probability model\nMNL(n, \u03b8), speci\ufb01ed by positive parameters (\u03b81, . . . , \u03b8n), is a collection of probability distributions\n{P r(\u00b7|S) : S \u2282 [n], S (cid:54)= \u2205}, where for each non-empty subset S \u2286 [n], P r(i|S) = \u03b8i1(i\u2208S)\nj\u2208S \u03b8j\n\u22001 \u2264 i \u2264 n. The indices 1, . . . , n are referred to as \u2018items\u2019 or \u2018arms\u2019 .\n(i). Best-Item: Given an MNL(n, \u03b8) instance, we de\ufb01ne the Best-Item a\u2217 \u2208 [n], to be the item with\nhighest MNL parameter if such a unique item exists, i.e. a\u2217 := arg maxi\u2208[n] \u03b8i.\n(ii). Top-k Best-Items: Given any instance of MNL(n, \u03b8) we de\ufb01ne the Top-k Best-Items S(k) \u2286 [n],\nto be the set of k distinct items with highest MNL parameters if such a unique set exists, i.e. for any\npair of items i \u2208 S(k) and j \u2208 [n] \\ S(k), \u03b8i > \u03b8j, such that |S(k)| = k. For this problem, we assume\n\u03b81 \u2265 \u03b82 \u2265 . . . \u03b8k > \u03b8k+1 \u2265 . . . \u2265 \u03b8n, implying S(k) = [k]. We also denote \u2206(k) = \u03b8k \u2212 \u03b8k+1.\n2.1 Feedback models\nAn online learning algorithm interacts with a MNL(n, \u03b8) probability model over n items as follows.\nAt each round t = 1, 2, . . ., the algorithm plays a subset St \u2286 [n] of (distinct) items, with |St| \u2264 k,\nupon which it receives stochastic feedback de\ufb01ned as:\n\n(cid:80)\n\n2\n\n\fi=1\n\n(cid:96)\u2208S \u03b8(cid:96)\n\n(cid:80)\n\n\u03b8\u03c3\u22121(i)\n\n\u2200j \u2208 S.\n\nj\u2208S\\\u03c3\u22121 (1:i\u22121) \u03b8j\n\n, \u03c3 \u2208 \u03a3m\n\n1. Winner Feedback: In this case, the environment returns a single item J drawn independently\nfrom probability distribution P r(\u00b7|S), i.e., P r(J = j|S) = \u03b8j(cid:80)\n2. Top-m-ranking Feedback (1 \u2264 m \u2264 k \u2212 1): Here, the environment returns an ordered\nP r(\u03c3 = \u03c3|S) = (cid:81)m\nlist of m items sampled without replacement from the MNL(n, \u03b8) probability model on S. More\nformally, the environment returns a partial ranking \u03c3 \u2208 \u03a3m\nS , drawn from the probability distribution\nS . This can also be seen as picking an item\n\u03c3\u22121(1) \u2208 S according to Winner Feedback from S, then picking \u03c3\u22121(2) from S \\{\u03c3\u22121(1)}, and so\non, until all elements from S are exhausted. When m = 1, Top-m-ranking Feedback is the same as\nWinner Feedback. To incorporate sets with |S| < m, we set m = min(|S|, m). Clearly this model\nreduces to Winner Feedback for m = 1, and a full rank ordering of the set S when m = |S| \u2212 1.\n2.2 Decisions (Subsets) and Regret\nWe consider two regret minimisation problems in terms of their decision spaces and notions of regret:\n(1). Winner-regret: This is motivated by learning to identify the Best-Item a\u2217. At any round t, the\nlearner can play sets of size 1, . . . , k, but is penalised for playing any item other than a\u2217. Formally,\nwe de\ufb01ne the learner\u2019s instantaneous regret at round t as r1\n, and its cumulative\nregret from T rounds as R1\n\nt =(cid:80)\n\nt =(cid:80)T\n\nT =(cid:80)T\n\n(cid:16)(cid:80)\n\ni\u2208St\n,\n\n(\u03b8a\u2217\u2212\u03b8i)\n\n(\u03b8a\u2217\u2212\u03b8i)\n\nt=1 r1\n\n(cid:17)\n\n|St|\n\nt=1\n\ni\u2208St\n\n|St|\n\n(cid:80)\n\ni\u2208S \u03b8i\n|S|\n\nof its items.\n\nThe learner aims to play sets St to keep the regret as low as possible, i.e., to play only the singleton\nset St = {a\u2217} over time, as that is the only set with 0 regret. The instantaneous Winner-regret can be\ninterpreted as a shortfall in value of the played set St with respect to {a\u2217}, where the value of a set S\nis simply the mean parameter value\nRemark 1. Assuming \u03b8a\u2217 = 1 (we can do this without loss of generality since the MNL model\nis positive scale invariant, see Defn. 1), it is easy to note that for any item i \u2208 [n] \\ {a\u2217} pa\u2217i :=\nP r(a\u2217|{a\u2217, i}) = \u03b8a\u2217\n(as \u03b8i < \u03b8a\u2217 , \u2200i). Consequently, the Winner-regret as\n\u03b8a\u2217 +\u03b8i\n(pa\u2217 i\u2212 1\n2 )\nde\ufb01ned above, can be further bounded above (up to constant factors) as \u02dcR1\n,\n|St|\nwhich, for k = 2, is standard dueling bandit regret [45, 47, 42].\nRemark 2. An alternative notion of instantaneous regret is the shortfall in the preference probability\nP r(a\u2217|St \u222a {a\u2217}) \u2212 P r(i|St \u222a\nof the best item a\u2217 in the selected set St, i.e., \u02dcr1\n{a\u2217})\n. However, if all the MNL parameters are bounded, i.e., \u03b8i \u2208 [a, b], \u2200i \u2208 [n], then\n\nT =(cid:80)T\n\n2 + \u03b8a\u2217\u2212\u03b8i\n\n(cid:80)\n\n\u2265 1\n\n(cid:16)\n\ni\u2208St\n\ni\u2208St\n\nt=1\n\n4\n\n(\u03b8a\u2217\u2212\u03b8i)\n|St|+1\n\n(\u03b8a\u2217\u2212\u03b8i)\n|St|+1\n\n\u2264 \u02dcr1\n\nt \u2264 1\n\na\n\ni\u2208St\n\nt , are only constant factors apart.\n\n, implying that these two notions of regret,\n\n1\ni\u2208St\nb\nt and \u02dcr1\nr1\n(2). Top-k-regret: This setting is motivated by learning to identify the set of Top-k Best-Items S(k)\nof the MNL(n, \u03b8) model. Correspondingly, we assume that the learner can play sets of k distinct\nitems at each round t \u2208 [T ]. The instantaneous regret of the learner, in this case, in the t-th round\nis de\ufb01ned to be rk\n\u03b8i. Consequently, the cumulative\n\n(cid:18) \u03b8S(k)\n\n\u2212(cid:80)\n\n(cid:19)\n\ni\u2208St\n\n\u03b8i\n\nt =\n\ni\u2208S(k)\n\nk\n\nt = (cid:80)\n(cid:19)\n\n(cid:19)\n\n(cid:18)(cid:80)\n\n(cid:17)\n(cid:18)(cid:80)\n\n, where \u03b8S(k) =(cid:80)\nT =(cid:80)T\n\nt =(cid:80)T\n\n(cid:18) \u03b8S(k)\n\n\u2212(cid:80)\n\ni\u2208St\n\n\u03b8i\n\n(cid:19)\n\nt=1\n\nt=1 rk\n\nregret of the learner at the end of round T becomes Rk\n. As\nwith the Winner-regret, the Top-k-regret also admits a natural interpretation as the shortfall in value\nof the set St with respect to the set S(k), with value of a set being the mean \u03b8 parameter of its arms.\n3 Minimising Winner-regret\nWe \ufb01rst consider the problem of minimising Winner-regret. We start by analysing a regret lower\nbound for the problem, followed by designing an optimal algorithm with matching upper bound.\n3.1 Fundamental lower bound on Winner-regret\nAlong the lines of [32], we de\ufb01ne the following consistency property of any reasonable online\nlearning algorithm in order to state a fundamental lower bound on regret performance.\n\nk\n\n3\n\n\ffor some \u03b1 \u2208 [0, 1] (potentially depending on \u03b8), where NS(T ) :=(cid:80)T\n\nDe\ufb01nition 2 (No-regret algorithm). An online learning algorithm A is de\ufb01ned to be a No-regret\nalgorithm for Winner-regret if for each problem instance MNL(n, \u03b8) , the expected number of times\nA plays any suboptimal set S \u2286 [n] is sublinear in T , i.e., \u2200S (cid:54)= arg maxi \u03b8i : E\u03b8[NS(T )] = o(T \u03b1),\nt=1 1(St = S) is the number\nplays of set S in T rounds. E\u03b8[\u00b7] denotes expectation under the algorithm and MNL(n, \u03b8) model.\nTheorem 3 (Winner-regret Lower Bound). For any No-regret learning algorithm A for Winner-\nregret that uses Winner Feedback, and for any problem instance MNL(n, \u03b8) s.t. a\u2217 = arg max\n\u03b8i,\nthe expected regret incurred by A satis\ufb01es lim inf\n\n(cid:105) \u2265\n\n(cid:104) R1\n\n(cid:17) (n \u2212 1).\n\nT (A)\nln T\n\n(cid:16)\n\ni\u2208[n]\n\n\u03b8a\u2217\n\nT\u2192\u221e E\u03b8\n\n(cid:16)\n\n\u22121\n\n(cid:17)\u22121\n\nmin\n\n\u03b8a\u2217\ni\u2208[n]\\{a\u2217}\n\u03b8i\n\u2212 1\n\n\u03b8a\u2217\n\u03b8i\n\nmin\n\ni\u2208[n]\\{a\u2217}\n\nNote: This is a problem-dependent lower bound with \u03b8a\u2217\ndenoting a com-\nplexity or hardness term (\u2018gap\u2019) for regret performance under any \u2018reasonable learning algorithm\u2019.\nRemark 3. The result suggests the regret rate with only Winner Feedback cannot improve with k,\nuniformly across all problem instances. Rather strikingly, there is no reduction in hardness (measured\nin terms of regret rate) in learning the Best-Item using Winner Feedback from large (k-size) subsets\nas compared to using pairwise (dueling) feedback (k = 2). It could be tempting to expect an improved\nlearning rate with subset-wise feedback as the number of items being tested per iteration is more\n(k \u2265 2), so information-theoretically one may expect to \u2018learn more\u2019 about the underlying model per\nsubset query. On the contrary, it turns out that it is intuitively \u2018harder\u2019 for a good (i.e., near-optimal)\nitem to prove its competitiveness in just a single winner draw against a large population of its k \u2212 1\nother competitors, as compared to winning over just a single competitor for k = 2 case.\n\n(cid:48)\n\nProof sketch. The proof of the result is based on the change of measure technique for bandit regret\nlower bounds presented by, say, Garivier et al. [21], that uses the information divergence between\ntwo nearby instances MNL(n, \u03b8) (the original instance) and MNL(n, \u03b8\n) (an alternative instance)\nto quantify the hardness of learning the best arm in either environment. In our case, each bandit\ninstance corresponds to an instance of the MNL(n, \u03b8) problem with the arm set containing all\nsubsets of [n] of size upto k: A = {S \u2286 [n] | |S| \u2208 [k]}. The key of the proof relies on carefully\ncrafting a true instance, with optimal arm a\u2217 = 1, and a family of \u2018slightly perturbed\u2019 alternative\ninstances {\u03bda : a (cid:54)= 1}, each with optimal arm a (cid:54)= 1, chosen as: (1). True Instance: MNL(n, \u03b81) :\nn = \u03b8 (for some \u03b8 \u2208 R+), , and for each suboptimal item a \u2208 [n] \\ {1}, the\n\u03b81\n1 > \u03b81\ni , \u2200i \u2208 [n] \\ {a} for\n(2). Altered instances: MNL(n, \u03b8a) : \u03b8a\nsome \u0001 > 0. The result of Thm. 3 now follows by applying Lemma 13 on pairs of problem instances\n(cid:17)\n(\u03bd, \u03bd(cid:48)(a)) with suitable upper bounds on the divergences. (Complete proof given in Appendix C.3). (cid:3)\nNote: We also show an alternate version of the regret lower bound of \u2126\n\n1 + \u0001 = \u03b8 + (\u039b + \u0001); \u03b8a\n\n2 = . . . = \u03b81\n\ni = \u03b81\n\na = \u03b81\n\n(cid:0) min\ni\u2208[n]\\{a\u2217}pa\u2217 ,i\u22120.5(cid:1) ln T\n\n(cid:16)\n\n(cid:104) R1\n\n(cid:105) \u2265\n\nin terms of pairwise preference-based instance complexities (details are moved to Appendix C.4).\nImproved regret lower bound with Top-m-ranking Feedback. In contrast to the situation with\nonly winner feedback, the following (more general) result shows a reduced lower bound when\nTop-m-ranking Feedback is available in each play of a subset, opening up the possibility of improved\nlearning (regret) performance when ranked-order feedback is available.\nTheorem 4 (Regret Lower Bound: Winner-regret with Top-m-ranking Feedback). For any No-\nregret algorithm A for the Winner-regret problem with Top-m-ranking Feedback, there exists\n(cid:17) (n\u22121)\nincurred by A satis\ufb01es\na problem instance MNL(n, \u03b8) such that the expected Winner-regret\nm , where as in Thm. 3, E\u03b8[\u00b7] denotes expectation\nlim inf\nT\u2192\u221e E\u03b8\nunder the algorithm and the MNL model MNL(n, \u03b8), and recall a\u2217 := arg maxi\u2208[n] \u03b8i.\nProof sketch. The main observation made here is that the KL divergences for Top-m-ranking\nFeedback are m times compared to the case of Winner Feedback, which we show using chain\nS(\u03c3i | \u03c3(1 :\nrule for KL divergences [20]: KL(p1\ni \u2212 1)), pa\nX =\nx P r\n\n(cid:16)\n(cid:17)(cid:2)KL(P (Y | X = x), Q(Y | X = x))(cid:3) denotes the conditional KL-divergence. Using this, along\n\nS(\u03c3i | \u03c3(1 : i \u2212 1))), where \u03c3i = \u03c3(i) and KL(P (Y | X), Q(Y | X)) :=(cid:80)\n\nS(\u03c31)) +(cid:80)m\n\nS) = KL(p1\n\ni=2 KL(p1\n\nS(\u03c31), pa\n\nT (A)\nln T\n\ni\u2208[n]\\{a\u2217}\n\nS, pa\n\n(cid:16)\n\n\u03b8a\u2217\n\u03b8i\n\n\u22121\n\n\u03b8a\u2217\n\nmin\n\nx\n\nn\n\n4\n\n\fm\n\n\u03b8a\u2217\n\nm\n\nS, pa\n\na\n\nmin\n\ni\u2208[n]\\{a\u2217}\n\n\u03b8a\u2217\n\u03b8i\n\n\u22121\n\n1 +\u0001) , \u2200a \u2208 [n] \\ {1}(cid:0)where \u03b8S =(cid:80)\n\nwith the upper bound on KL divergences for Winner Feedback (derived for Thm. 3), we show that\nKL(p1\nS (\u03b81\n\u03b81\nm-factor reduction in the lower bound compared to Winner Feedback. The bound\nprecisely gives the 1\n(cid:3).\nnow can be derived following a similar technique used for Thm. 3 (details in Appendix C.5).\n\na = O(\u03b8a\u2217 \u2212 \u03b8a)(cid:1), which\n(cid:1) lower bound on regret, containing the instance-dependent con-\n(cid:17) which exposes the hardness of the regret minimisation problem in terms\n\nRemark 4. Thm. 4 shows a \u2126(cid:0) n ln T\n\nS) \u2264 m\u2206(cid:48)2\n(cid:16)\n\ni\u2208S \u03b8i and \u2206(cid:48)\n\n(cid:16)(cid:0) k\n\n(cid:17)\n(cid:1)m!\n\nstant term\nof the \u2018gap\u2019 between the best a\u2217 and the second best item mini\u2208[n]\\{a\u2217} \u03b8i: (\u03b8a\u2217 \u2212 maxi\u2208[n]\\{a\u2217} \u03b8j).\nm factor improvement in learning rate with Top-m-ranking Feedback can be intuitively inter-\nThe 1\npreted as follows: revealing an m-ranking of a k-set is worth about ln\n= O(m ln k) bits of\ninformation, which is about m times as large compared to revealing a single winner.\n3.2 An order-optimal algorithm for Winner-regret\nWe here show that above fundamental lower bounds on Winner-regret are, in fact, achievable with\ncarefully designed online learning algorithms. We design an upper-con\ufb01dence bound (UCB)-based\nalgorithm for Winner-regret with Top-m-ranking Feedback model based on the following ideas:\n(1). Playing sets of only (m + 1) sizes: It is enough for the algorithm to play subsets of size either\n(m + 1) (to fully exploit the Top-m-ranking Feedback) or 1 (singleton sets), and not play a singleton\nunless there is a high degree of con\ufb01dence about the single item being the best item.\n(2). Parameter estimation from pairwise preferences: It is possible to play the subset-wise game\njust by maintaining pairwise preference estimates of all n items of the MNL(n, \u03b8) model using\nthe idea of Rank-Breaking\u2013the idea of extracting pairwise comparisons from (partial) rankings and\napplying estimators on the obtained pairs treating each comparison independently over the received\nsubset-wise feedback\u2014this is possible owning to the independence of irrelevant attributes (IIA)\nproperty of the MNL model (Defn. 10).\n(3). A new UCB-based max-min set building rule for playing large sets (build_S): Main novelty\nof MaxMin-UCB lies in its underlying set building subroutine (Alg. 2, Appendix C.1 ), that constructs\nSt by applying a recursive max-min strategy on the UCB estimates of empirical pairwise preferences.\nAlgorithm description. MaxMin-UCB maintains an pairwise preference matrix \u02c6P \u2208 [0, 1]n\u00d7n,\nwhose (i, j)-th entry \u02c6pij records the empirical probability of i having beaten j in a pairwise duel,\nand a corresponding upper con\ufb01dence bound uij for each pair (i, j). At any round t, it plays a\nsubset St \u2286 [n], |St| \u2208 [k] using the Max-Min set building rule build_S (see Alg. 2), receives\nTop-m-ranking Feedback \u03c3t \u2208 \u03a3m\nfrom St, and updates the \u02c6pij entries of pairs in St by applying\nRank-Breaking (Line 10). The set building rule build_S is at the heart of MaxMin-UCB which builds\nthe subset St from a set of potential Condorcet winners (Ct) of round t: By recursively picking the\nstrongest opponents of the already selected items using a max-min selection strategy on uij. The\ncomplete algorithm is presented in Alg. 1, Appendix C.1.\nThe following result establishes that MaxMin-UCB enjoys O( n\nm ln T ) regret with high probability.\nTheorem 5 (MaxMin-UCB: High Probability Regret bound). Fix a time horizon T and \u03b4 \u2208 (0, 1),\n2 . With probability at least (1 \u2212 \u03b4), the regret of MaxMin-UCB for Winner-regret with Top-m-\n\u03b1 > 1\ni=2(Dmax \u02c6\u2206i),\n2(\u03b8a\u2217 +\u03b8i) , \u02c6\u2206max = maxi\u2208[n]\\{a\u2217} \u02c6\u2206i D1i = 4\u03b1\n,\n\nranking Feedback satis\ufb01es R1\nwhere \u2200i \u2208 [n] \\ {a\u2217}, \u02c6\u2206i = (\u03b8a\u2217 \u2212 \u03b8i), \u2206i = \u03b8a\u2217\u2212\u03b8i\n\n(cid:104) 2\u03b1n2\n\n\u02c6\u2206max + ln T\nm+1\n\n(cid:105) 1\n\n(cid:80)n\n\n+ 2D ln 2D\n\n2\n\n(2\u03b1\u22121)\u03b4\n\nT \u2264\n\n(cid:19)\n\nSt\n\n(cid:18)\n\n2\u03b1\u22121\n\nD :=(cid:80)\n\ni<j Dij, Dmax = maxi\u2208[n]\\{a\u2217} D1i.\n\nProof sketch. The proof hinges on analysing the entire run of MaxMin-UCB by breaking it up into 3\nphases: (1). Random-Exploration (2). Progress, and (3). Saturation.\n2\u03b1\u22121 , for any \u03b4 \u2208\n(1). Random-Exploration: This phase runs from time 1 to f (\u03b4) =\n(0, 1), such that for any t > f (\u03b4), the upper con\ufb01dence bounds uij are guaranteed to be correct for\nthe true values pij for all pairs (i, j) \u2208 [n] \u00d7 [n] (i.e. pij \u2264 uij), with high probability (1 \u2212 \u03b4).\n(2). Progress: After t > f (\u03b4), the algorithm can be viewed as starting to explore the \u2018confusing\nitems\u2019, appearing in Ct, as potential candidates for the Best-Item a\u2217, and trying to capture a\u2217 in\na holding set Bt. At any time, the set Bt is either empty or a singleton by construction, and once\n\n(2\u03b1\u22121)\u03b4\n\n(cid:104) 2\u03b1n2\n\n(cid:105) 1\n\n\u22062\ni\n\n5\n\n\fa\u2217 \u2208 Bt it stays their forever (with high probability). The Progress phase ensures that the algorithm\nexplores fast enough so that within a constant number of rounds, Bt captures {a\u2217}.\n(3). Saturation: This is the last phase from time T0(\u03b4)+1 to T . As the name suggests, MaxMin-UCB\nshows relatively stable behavior here, mostly playing St = {a\u2217} and incurring almost no regret.\nAlthough Thm. 5 shows a (1 \u2212 \u03b4)-high probability regret bound for MaxMin-UCB it is important\nto note that the algorithm itself does not require to take the probability of failure (\u03b4) as input. As a\nconsequence, by simply integrating the bound obtained in Thm. 5 over the entire range of \u03b4 \u2208 [0, 1],\nwe get an expected regret bound of MaxMin-UCB for Winner-regret with Top-m-ranking Feedback:\nTheorem 6. The expected regret of MaxMin-UCB for Winner-regret with Top-m-ranking Feedback\n\nis: E[R1\n\nT ] \u2264\n\n2\u03b1\u22121 2\u03b1\u22121\n\n\u03b1\u22121 + 2D ln 2D\n\n\u02c6\u2206max + ln T\nm+1\n\ni=2(Dmax \u02c6\u2206i), in T rounds.\n\n(cid:80)n\n\n2\n\n(cid:32)\n(cid:104) 2\u03b1n2\n(cid:32)\n(cid:104) 2\u03b1n2\n\n(2\u03b1\u22121)\n\n(cid:105) 1\n(cid:105) 1\n\n2\n\n(2\u03b1\u22121)\n\n(cid:17)\n\n(cid:16) n ln T\n(cid:17)\n(cid:16) n ln T\n\nRemark 5. This is an upper bound on expected regret of the same order as that in the lower bound\nof Thm. 3, which shows that the algorithm is essentially regret-optimal. From Thm. 6, note that the\n\n\ufb01rst two terms\n\n2\u03b1\u22121 2\u03b1\u22121\n\n\u03b1\u22121 + 2D ln 2D\n\n\u02c6\u2206max of E[R1\n\nT ] are essentially instance speci\ufb01c\n\nk\u2206(k)\n\nwhich is in fact optimal in\n\nconstants, its only the third term which makes expected regret O\n\n(cid:16)\n\n64\u03b1(\u03b8a\u2217\u2212\u03b8i)(\u03b8a\u2217 )\n\nm\nlower bound of Thm. 4). More-\n\u2264\n, also brings out the inverse dependency on the\n\n(\u03b8a\u2217\u2212maxj\u2208[n]\\{a\u2217} \u03b8j )2\n\nterms of its dependencies on n and T (since it matches the \u2126\nover the problem dependent complexity terms (Dmax \u02c6\u2206i) = 16\u03b1(\u03b8a\u2217\u2212\u03b8i)(\u03b8a\u2217 +maxj\u2208[n]\\{a\u2217} \u03b8j )2\n(\u03b8a\u2217\u2212maxj\u2208[n]\\{a\u2217} \u03b8j )2 = O\n\u2018gap-term\u2019 (\u03b8a\u2217 \u2212 maxj\u2208[n]\\{a\u2217} \u03b8j) as discussed in Rem. 4.\n4 Minimising Top-k-regret\nIn this section, we study the problem of minimising Top-k-regret with Top-k-ranking Feedback.\n\nAs before, we \ufb01rst derive a regret lower bound, for this learning setting, of the form \u2126(cid:0) n\u2212k\n\n(\u03b8a\u2217\u2212maxj\u2208[n]\\{a\u2217} \u03b8j )\n\nln T(cid:1)\n\n(cid:17)\n\n\u03b8a\u2217\n\nm\n\nT\u2192\u221e E\u03b8\n\nT (A)\nln T\n\n(recall \u2206(k) from Sec. 2).We next propose an UCB based algorithm (Alg. 3) for the same, along with\na matching upper bound regret analysis (Thm. 8,9) showing optimality of our proposed algorithm.\n4.1 Regret lower bound for Top-k-regret with Top-k-ranking Feedback\nTheorem 7 (Regret Lower Bound: Top-k-regret with Top-m-ranking Feedback). For any No-regret\nlearning algorithm A for Top-k-regret that uses Top-k-ranking Feedback, and for any problem\ninstance MNL(n, \u03b8), the expected regret incurred by A when run on it satis\ufb01es lim inf\n\n(cid:105) \u2265\n\n(cid:104) Rk\n\nk\n\n\u03b81\u03b8k+1\n\n(n\u2212k)\n\n, where E\u03b8[\u00b7] denotes expectation under the algorithm and MNL(n, \u03b8) model.\n\n1 = \u03b81\n\nn = \u03b8 + \u0001; \u03b81\n\n2 = . . . = \u03b81\n\nk\u22121 = \u03b8 + 2\u0001; \u03b81\n\n\u2206(k)\nProof sketch. Similar to 4, the proof again relies on carefully constructing a true instance, with\noptimal set of Top-k Best-Items S(k) = [k], and a family of slightly perturbed alternative in-\nstances {\u03bda : a \u2208 [n] \\ S(k)}, for each suboptimal arm a \u2208 [n] \\ S(k)}, which we design as:\n(1). True Instance: MNL(n, \u03b81) : \u03b81\nk+1 =\nn\u22121 = \u03b8, for some \u03b8 \u2208 R+ and \u0001 > 0. Clearly Top-k Best-Items of MNL(n, \u03b81) is\n\u03b81\nk+2 = . . . \u03b81\nS(k)[1] = [k \u2212 1] \u222a {n}. (2). Altered Instances: For every n \u2212 k suboptimal items a /\u2208 S(k)[1],\nnow consider an altered instance Instance a, denoted by MNL(n, \u03b8a), such that \u03b8a\ni =\ni , \u2200i \u2208 [n]\\{a}. The result of Thm. 7 now can be obtained by following an exactly same procedure\n\u03b81\n(cid:3)\nas described for the proof of Thm. 4. The complete details is given in Appendix D.1.\nRemark 6. The regret lower bound of Thm. 7 is \u2126( (n\u2212k) ln T\n), with an instance-dependent term\n(\u03b8k\u2212\u03b8k+1) which shows for recovering the Top-k Best-Items, the problem complexity is governed by\nthe \u2018gap\u2019 between the kth and (k + 1)th best item \u2206(k) = (\u03b8k \u2212 \u03b8k+1), as consistent with intuition.\n4.2 An order-optimal algorithm with low Top-k-regret with Top-k-ranking Feedback\nMain idea: A recursive set-building rule: As with the MaxMin-UCB algorithm (Alg. 1), we\nmaintain pairwise UCB estimates (uij) of empirical pairwise preferences \u02c6pij via Rank-Breaking.\nHowever the main difference here lies in the set building rule, as here it is required to play sets of\n\na = \u03b8 + 2\u0001; \u03b8a\n\n\u03b81\u03b8k+1\n\nk\n\n(cid:33)\n(cid:33)\n\n6\n\n\fsize exactly k. The core idea here is to recursively try to capture the set of Top-k Best-Items in an\nordered set Bt, and, once the set is assumed to be found with con\ufb01dence (formally |Bt| = k), to\nkeep playing Bt unless some other potential good item emerges, which is then played replacing the\nweakest element (Bt(k)) of Bt. The algorithm is described in Alg. 3, Appendix D.2.\nTheorem 8 (Rec-MaxMin-UCB: High Probability Regret bound). Given a \ufb01xed time hori-\nzon T and \u03b4 \u2208 (0, 1), with high probability (1 \u2212 \u03b4), the regret incurred by Rec-MaxMin-\nUCB for Top-k-regret admits the bound Rk\nmax +\n\n+ 2 \u00afD(k) ln(cid:0)2 \u00afD(k)(cid:1)(cid:19)\n\n(cid:104) 2\u03b1n2\n\n(cid:105) 1\n\n(cid:18)\n\nT \u2264\n\n\u2206(cid:48)\n\n2\u03b1\u22121\n\n(2\u03b1\u22121)\u03b4\n\n2\n\n, where D(k) is an instance dependent constant (see Lem. 26, Appendix),\n\n(cid:19)\n\n(cid:18)(cid:80)n\n(cid:0)(cid:80)k\ni=1 \u03b8i\u2212(cid:80)n\n\nb=k+1\n\n\u02c6D2\n\n(\u03b8k\u2212\u03b8b)\n\n(cid:1)\n\n4\u03b1 ln T\n\nk\n\nk\n\nmax =\n\ni=n\u2212k+1 \u03b8i\n\n(2\u03b1\u22121)\u03b4\n\n(cid:105) 1\n\n(cid:104) 2\u03b1n2\n\n, and \u02c6D = ming\u2208[k\u22121](pkg \u2212 pbg).\n\n\u2206(cid:48)\nProof sketch. Similar to Thm. 5, we prove the above bound dividing the entire run of algorithm\nRec-MaxMin-UCB into three phases and applying an recursive argument:\n(1). Random-Exploration: Same as Thm. 5, in this case also this phase runs from time 1 to\n2\u03b1\u22121 , for any \u03b4 \u2208 (0, 1), after which, for any t > f (\u03b4), one can guarantee\n\nf (\u03b4) =\npij \u2264 uij for all pairs (i, j) \u2208 [n] \u00d7 [n], with high probability at least (1 \u2212 \u03b4). (Lem. 15)\n(2). Progress: The analysis of this phase is quite different from that of Thm. 5: After t > f (\u03b4),\nthe algorithm starts exploring the items in the set of Top-k Best-Items in a recursive manner\u2013It \ufb01rst\ntries to capture (one of) the Best-Items in Bt(1). Once that slot is secured, it goes on for searching\nthe second Best-Item from remaining pool of items and try capturing it in Bt(2) and so on upto\nBt(k). By de\ufb01nition, the phase ends at, say t = T0(\u03b4), when Bt = S(k). Moreover the update rule\nof Rec-MaxMin-UCB (along with Lem. 15) ensures that Bt = S(k) \u2200t > T0(\u03b4). The novelty of our\nanalysis lies in showing that T0(\u03b4) is bounded by just a instance dependent complexity term which\ndoes not scale with t (Lem. 26), and hence the regret incurred in this phase is also constant.\n(3). Saturation: In the last phase from time T0(\u03b4) + 1 to T Rec-MaxMin-UCB has already captured\nS(k) in Bt, and Bt = S(k) henceforth. Hence the algorithm mostly plays St = S(k) without incurring\nany regret. Only if any item outside Bt enters into the list of potential Top-k Best-Items , it takes a\nvery conservative approach of replacing the \u2018weakest of Bt by that element to make sure whether it\nindeed lies in or outside S(k). However we are able to show that any such suboptimal item i /\u2208 S(k)\n\u02c6D2 ) times (Lem. 27), combining which over all [n] \\ [k] suboptimal\ncan not occur for more than O( ln T\n(cid:3)\nitems \ufb01nally leads to the desired regret. The complete details are moved to Appendix D.3.\nFrom Theorem 8, we can also derive an expected regret bound for Rec-MaxMin-UCB in T rounds is:\n(cid:19)\nTheorem 9. The expected regret incurred by MaxMin-UCB for Top-k-regret is:\n\n(cid:18)\n\n(cid:104) 2\u03b1n2\n\n(cid:105) 1\n\n(\u03b8k \u2212 \u03b8b)\n\nE[R1\n\nT ] \u2264\n\n2\n\n(2\u03b1 \u2212 1)\n\n2\u03b1\u22121 2\u03b1 \u2212 1\n\u03b1 \u2212 1\n\n.\n\n\u02c6D2\n\n(cid:18) n(cid:88)\n+ 2 \u00afD(k) ln(cid:0)2 \u00afD(k)(cid:1)(cid:19)\n\n\u2206(cid:48)\nmax +\n\n4\u03b1 ln T\n\nk\n\n\u2206(cid:48)\n\nb=k+1\n\n(cid:16) (n\u2212k) ln T\n\n(cid:17)\n\nRemark 7. In Thm. 9, the \ufb01rst two terms\nmax of E[Rk\nT ]\nare just some MNL(n, \u03b8) model dependent constants which do not contribute to the learning rate of\nRec-MaxMin-UCB, and the third term is O\nwhich varies optimally in terms of on n, k, T\n\n(cid:17)\n\n2\n\nk\n\n\u02c6D2\n\n: Let g\u2217 \u2208 [k \u2212 1] is the minimizer of \u02c6D, then (\u03b8k\u2212\u03b8b)\n\nmatching the \u2126\nlower bound of Thm. 7). Also Rem. 6 indicates an inverse dependency\non the \u2018gap-complexity\u2019 (\u03b8k \u2212 \u03b8k+1), which also shows up in above bound through the component\n(\u03b8k\u2212\u03b8b)\n\u03b8g\u2217 (\u03b8k\u2212\u03b8k+1),\nwhere the upper bounding follows as \u03b8g\u2217 \u2265 \u03b8k > \u03b8b for any b \u2208 [n] \\ [k], and \u03b8b \u2264 \u03b8k+1 for any b.\n5 Experiments\nIn this section we present the empirical evaluations of our proposed algorithm MaxMin-UCB (abbre-\nviated as MM) on different synthetic datasets, and also compare them with different algorithms. All\nresults are reported as average across 50 runs along with the standard deviations. For this we use 7\ndifferent MNL(n, \u03b8) environments as described below:\n\n\u02c6D2 = (\u03b8g\u2217 +\u03b8k)(\u03b8b+\u03b8g\u2217 )\n\ng\u2217 (\u03b8k\u2212\u03b8b)\n\u03b82\n\n\u2264\n\n4\n\n+ 2 \u00afD(k) ln(cid:0)2 \u00afD(k)(cid:1)(cid:19)\n(cid:18)\n(cid:105) 1\n(cid:104) 2\u03b1n2\n(cid:16) (n\u2212k) ln T\n\n(2\u03b1\u22121)\u03b4\n\n2\u03b1\u22121\n\nk\n\n7\n\n\fMNL(n, \u03b8) Environments. 1. g1, 2. g4, 3. arith, 4. geo, 5. har all with n = 16, and two larger\nmodels 6. arithb, and 7. geob with n = 50 items in both. Details are moved to Appendix E.\nWe compare our proposed methods with the following two baselines which closely applies to our\nproblem setup. Note, as discussed in Sec. 1, none of the existing work exactly addresses our problem.\nAlgorithms. 1. BD: The Battling-Duel algorithm of [36] with RUCB aalgorithm [47] as the dueling\nbandit blackbox, and 2. Sp-TS: The Self-Sparring algorithm of [39] with Thompson Sampling [1],\nand 3. MM: Our proposed method MaxMin-UCB for Winner-regret (Alg. 1).\nComparing Winner-regret with Top-m-ranking Feedback (Fig. 1): We \ufb01rst compare the regret\nperformances for k = 10 and m = 5. From Fig. 1, it clearly follows that in all cases MaxMin-UCB\nuniformly outperforms the other two algorithms taking the advantage of Top-m-ranking Feedback\nwhich the other two fail to make use of as they both allow repetitions in the played subsets which can\nnot exploit the rank-ordered feedback to the full extent. Furthermore, the thompson sampling based\nSp-TS in general exhibits a much higher variance compared to the rest due to its bayesian nature.\nAlso as expected, g1 and g4 being comparatively easier instances, i.e. with larger \u2018gap\u2019 \u02c6\u2206max (see\nThm. 3, 4,5, 6 etc. for a formal justi\ufb01cation), our algorithm converges much faster on these models.\n\nFigure 1: Comparative performances on Winner-regret for k = 10, m = 5\n\nComparing Top-k-regret performances for Top-k-ranking Feedback (Fig. 2): We are not\naware of any existing algorithm for Top-k-regret objective with Top-k-ranking Feedback. We thus\nuse a modi\ufb01ed version of Sp-TS algorithm [39] described above for the purpose\u2013it simply draws\nk-items without repetition and uses Rank-Breaking updates to maintain the Beta posteriors. Here\nagain, we see that our method Rec-MaxMin-UCB (Rec-MM) uniformly outperforms Sp-TS in all\ncases, and as before Sp-TS shows a higher variability as well. Interestingly, our algorithm converges\nthe fastest on g4, it being the easiest model with largest \u2018gap\u2019 \u2206(k) between the kth and (k + 1)th\nbest item (see Thm. 7,8,9 etc.), and takes longest time for har since it has the smallest \u2206(k).\n\nFigure 2: Comparative performances on Top-k-regret for k = 10\n\nEffect of varying m with \ufb01xed k (Fig. 3): We also studied our algorithm MaxMin-UCB , with\nvarying size rank-ordered feedback (m), keeping the subsetsize (k) \ufb01xed, both for Winner-regret and\nTop-k-regret objective, on the larger models arithb and geob which has n = 50 items. As expected,\nin both cases, regret scales down with increasing m (justifying the bounds in Thm. 5,6),8,9).\n\nFigure 3: Regret with varying m with \ufb01xed k = 40 (on our proposed algorithm MaxMin-UCB)\n\n6 Conclusion and Future Work\nAlthough we have analysed low-regret algorithms for learning with subset-wise preferences, there are\nseveral avenues for investigation that open up with these results. The case of learning with contextual\nsubset-wise models is an important and practically relevant problem, as is the problem of considering\nmixed cardinal and ordinal feedback structures in online learning. Other directions of interest could\nbe studying the budgeted version where there are costs associated with the amount of preference\ninformation that may be elicited in each round, or analysing the current problem on a variety of subset\nchoice models, e.g. multinomial probit, Mallows, or even adversarial preference models etc.\n\n8\n\n\fAcknowledgements\n\nThe authors are grateful to the anonymous reviewers for valuable feedback. This work is supported\nby a Qualcomm Innovation Fellowship 2019, and the Indigenous 5G Test Bed project grant from the\nDept. of Telecommunications, Government of India. Aadirupa Saha thanks Arun Rajkumar for the\nvaluable discussions, and the Tata Trusts and ACM-India/IARCS Travel Grants for travel support.\n\nReferences\n[1] Shipra Agrawal and Navin Goyal. Analysis of Thompson sampling for the multi-armed bandit\n\nproblem. In Conference on Learning Theory, pages 39\u20131, 2012.\n\n[2] Shipra Agrawal, Vashist Avadhanula, Vineet Goyal, and Assaf Zeevi. A near-optimal\nexploration-exploitation approach for assortment selection. In Proceedings of the 2016 ACM\nConference on Economics and Computation, pages 599\u2013600. ACM, 2016.\n\n[3] Shipra Agrawal, Vashist Avadhanula, Vineet Goyal, and Assaf Zeevi. Thompson sampling for\n\nthe mnl-bandit. Machine Learning Research, 65:1\u20133, 2017.\n\n[4] Shipra Agrawal, Vashist Avadhanula, Vineet Goyal, and Assaf Zeevi. Mnl-bandit: A dynamic\n\nlearning approach to assortment selection. Operations Research, 67(5):1453\u20131485, 2019.\n\n[5] Nir Ailon, Zohar Shay Karnin, and Thorsten Joachims. Reducing dueling bandits to cardinal\n\nbandits. In ICML, volume 32, pages 856\u2013864, 2014.\n\n[6] Duane F Alwin and Jon A Krosnick. The measurement of values in surveys: A comparison of\n\nratings and rankings. Public Opinion Quarterly, 49(4):535\u2013552, 1985.\n\n[7] Hossein Azari, David Parkes, and Lirong Xia. Random utility theory for social choice. In\n\nAdvances in Neural Information Processing Systems, pages 126\u2013134, 2012.\n\n[8] G\u00e1bor Bart\u00f3k, D\u00e1vid P\u00e1l, and Csaba Szepesv\u00e1ri. Minimax regret of \ufb01nite partial-monitoring\ngames in stochastic environments. In Proceedings of the 24th Annual Conference on Learning\nTheory, pages 133\u2013154, 2011.\n\n[9] Moshe Ben-Akiva, Mark Bradley, Takayuki Morikawa, Julian Benjamin, Thomas Novak,\nHarmen Oppewal, and Vithala Rao. Combining revealed and stated preferences data. Marketing\nLetters, 5(4):335\u2013349, 1994.\n\n[10] Austin R Benson, Ravi Kumar, and Andrew Tomkins. On the relevance of irrelevant alternatives.\nIn Proceedings of the 25th International Conference on World Wide Web, pages 963\u2013973.\nInternational World Wide Web Conferences Steering Committee, 2016.\n\n[11] Brian Brost, Yevgeny Seldin, Ingemar J. Cox, and Christina Lioma. Multi-dueling bandits and\n\ntheir application to online ranker evaluation. CoRR, abs/1608.06253, 2016.\n\n[12] R\u00f3bert Busa-Fekete and Eyke H\u00fcllermeier. A survey of preference-based online learning with\nbandit algorithms. In International Conference on Algorithmic Learning Theory, pages 18\u201339.\nSpringer, 2014.\n\n[13] R\u00f3bert Busa-Fekete, Eyke H\u00fcllermeier, and Bal\u00e1zs Sz\u00f6r\u00e9nyi. Preference-based rank elicitation\nIn Proceedings of The 31st International\n\nusing statistical models: The case of mallows.\nConference on Machine Learning, volume 32, 2014.\n\n[14] Nicolo Cesa-Bianchi and G\u00e1bor Lugosi. Combinatorial bandits. Journal of Computer and\n\nSystem Sciences, 78(5):1404\u20131422, 2012.\n\n[15] Wei Chen, Yajun Wang, and Yang Yuan. Combinatorial multi-armed bandit: General framework\n\nand applications. In International Conference on Machine Learning, pages 151\u2013159, 2013.\n\n[16] Xi Chen, Paul N Bennett, Kevyn Collins-Thompson, and Eric Horvitz. Pairwise ranking\naggregation in a crowdsourced setting. In Proceedings of the sixth ACM international conference\non Web search and data mining, pages 193\u2013202. ACM, 2013.\n\n9\n\n\f[17] Xi Chen, Yuanzhi Li, and Jieming Mao. A nearly instance optimal algorithm for top-k ranking\nunder the multinomial logit model. In Proceedings of the Twenty-Ninth Annual ACM-SIAM\nSymposium on Discrete Algorithms, pages 2504\u20132522. SIAM, 2018.\n\n[18] Yuxin Chen and Changho Suh. Spectral mle: Top-k rank aggregation from pairwise comparisons.\n\nIn International Conference on Machine Learning, pages 371\u2013380, 2015.\n\n[19] Richard Combes, Mohammad Sadegh Talebi Mazraeh Shahi, Alexandre Proutiere, et al. Com-\nbinatorial bandits revisited. In Advances in Neural Information Processing Systems, pages\n2116\u20132124, 2015.\n\n[20] Thomas M Cover and Joy A Thomas. Elements of information theory. John Wiley & Sons,\n\n2012.\n\n[21] Aur\u00e9lien Garivier, Pierre M\u00e9nard, and Gilles Stoltz. Explore \ufb01rst, exploit next: The true shape\n\nof regret in bandit problems. Mathematics of Operations Research, 44(2):377\u2013399, 2018.\n\n[22] Thore Graepel and Ralf Herbrich. Ranking and matchmaking. Game Developer Magazine, 25:\n\n34, 2006.\n\n[23] Bruce Hajek, Sewoong Oh, and Jiaming Xu. Minimax-optimal inference from partial rankings.\n\nIn Advances in Neural Information Processing Systems, pages 1475\u20131483, 2014.\n\n[24] David A Hensher. Stated preference analysis of travel choices: the state of practice. Transporta-\n\ntion, 21(2):107\u2013133, 1994.\n\n[25] Katja Hofmann. Fast and reliable online learning to rank for information retrieval. In SIGIR\n\nForum, volume 47, page 140, 2013.\n\n[26] Minje Jang, Sunghyun Kim, Changho Suh, and Sewoong Oh. Optimal sample complexity of\nm-wise data for top-k ranking. In Advances in Neural Information Processing Systems, pages\n1685\u20131695, 2017.\n\n[27] Sumeet Katariya, Branislav Kveton, Csaba Szepesvari, and Zheng Wen. Dcm bandits: Learning\nto rank with multiple clicks. In International Conference on Machine Learning, pages 1215\u2013\n1224, 2016.\n\n[28] Emilie Kaufmann, Olivier Capp\u00e9, and Aur\u00e9lien Garivier. On the complexity of best-arm\nidenti\ufb01cation in multi-armed bandit models. The Journal of Machine Learning Research, 17(1):\n1\u201342, 2016.\n\n[29] Ashish Khetan and Sewoong Oh. Data-driven rank breaking for ef\ufb01cient rank aggregation.\n\nJournal of Machine Learning Research, 17(193):1\u201354, 2016.\n\n[30] Junpei Komiyama, Junya Honda, Hisashi Kashima, and Hiroshi Nakagawa. Regret lower bound\n\nand optimal algorithm in dueling bandit problem. In COLT, pages 1141\u20131154, 2015.\n\n[31] Branislav Kveton, Zheng Wen, Azin Ashkan, and Csaba Szepesvari. Tight regret bounds for\nstochastic combinatorial semi-bandits. In Arti\ufb01cial Intelligence and Statistics, pages 535\u2013543,\n2015.\n\n[32] Tze Leung Lai and Herbert Robbins. Asymptotically ef\ufb01cient adaptive allocation rules. Ad-\n\nvances in applied mathematics, 6(1):4\u201322, 1985.\n\n[33] Pantelimon G Popescu, Silvestru Dragomir, Emil I Slusanschi, and Octavian N Stanasila.\nBounds for Kullback-Leibler divergence. Electronic Journal of Differential Equations, 2016,\n2016.\n\n[34] Filip Radlinski, Madhu Kurup, and Thorsten Joachims. How does clickthrough data re\ufb02ect\nretrieval quality? In Proceedings of the 17th ACM conference on Information and knowledge\nmanagement, pages 43\u201352. ACM, 2008.\n\n[35] Wenbo Ren, Jia Liu, and Ness B Shroff. PAC ranking from pairwise and listwise queries: Lower\n\nbounds and upper bounds. arXiv preprint arXiv:1806.02970, 2018.\n\n10\n\n\f[36] Aadirupa Saha and Aditya Gopalan. Battle of bandits. In Uncertainty in Arti\ufb01cial Intelligence,\n\n2018.\n\n[37] Aadirupa Saha and Aditya Gopalan. PAC Battling Bandits in the Plackett-Luce Model. In\n\nAlgorithmic Learning Theory, pages 700\u2013737, 2019.\n\n[38] Hossein Azari Sou\ufb01ani, David C Parkes, and Lirong Xia. Computing parametric ranking models\n\nvia rank-breaking. In ICML, pages 360\u2013368, 2014.\n\n[39] Yanan Sui, Vincent Zhuang, Joel Burdick, and Yisong Yue. Multi-dueling bandits with depen-\n\ndent arms. In Conference on Uncertainty in Arti\ufb01cial Intelligence, UAI\u201917, 2017.\n\n[40] Bal\u00e1zs Sz\u00f6r\u00e9nyi, R\u00f3bert Busa-Fekete, Adil Paul, and Eyke H\u00fcllermeier. Online rank elicitation\nfor plackett-luce: A dueling bandits approach. In Advances in Neural Information Processing\nSystems, pages 604\u2013612, 2015.\n\n[41] Tanguy Urvoy, Fabrice Clerot, Raphael F\u00e9raud, and Sami Naamane. Generic exploration and\nk-armed voting bandits. In International Conference on Machine Learning, pages 91\u201399, 2013.\n\n[42] Huasen Wu and Xin Liu. Double Thompson sampling for dueling bandits. In Advances in\n\nNeural Information Processing Systems, pages 649\u2013657, 2016.\n\n[43] Yisong Yue and Thorsten Joachims. Interactively optimizing information retrieval systems as\na dueling bandits problem. In Proceedings of the 26th Annual International Conference on\nMachine Learning, pages 1201\u20131208. ACM, 2009.\n\n[44] Yisong Yue and Thorsten Joachims. Beat the mean bandit.\n\nIn Proceedings of the 28th\n\nInternational Conference on Machine Learning (ICML-11), pages 241\u2013248, 2011.\n\n[45] Yisong Yue, Josef Broder, Robert Kleinberg, and Thorsten Joachims. The k-armed dueling\n\nbandits problem. Journal of Computer and System Sciences, 78(5):1538\u20131556, 2012.\n\n[46] Masrour Zoghi, Shimon Whiteson, Remi Munos, and Maarten de Rijke. Relative upper\ncon\ufb01dence bound for the k-armed dueling bandit problem. arXiv preprint arXiv:1312.3393,\n2013.\n\n[47] Masrour Zoghi, Shimon Whiteson, Remi Munos, Maarten de Rijke, et al. Relative upper\ncon\ufb01dence bound for the k-armed dueling bandit problem. In JMLR Workshop and Conference\nProceedings, number 32, pages 10\u201318. JMLR, 2014.\n\n11\n\n\f", "award": [], "sourceid": 554, "authors": [{"given_name": "Aadirupa", "family_name": "Saha", "institution": "Indian Institute of Science"}, {"given_name": "Aditya", "family_name": "Gopalan", "institution": "Indian Institute of Science"}]}