{"title": "Active Learning Ranking from Pairwise Preferences with Almost Optimal Query Complexity", "book": "Advances in Neural Information Processing Systems", "page_first": 810, "page_last": 818, "abstract": "Given a set $V$ of  $n$ elements we wish to linearly order them using pairwise preference labels  which may be non-transitive (due to irrationality or arbitrary noise).  The goal is to linearly order the elements while disagreeing with  as few pairwise preference labels as possible.  Our performance is measured by two parameters:  The number of disagreements (loss) and the query complexity (number of pairwise preference labels).  Our algorithm adaptively queries  at most $O(n\\poly(\\log n,\\eps^{-1}))$ preference labels for a regret of  $\\eps$ times the optimal loss.  This is strictly better, and often significantly better than what  non-adaptive sampling could achieve.  Our main result helps settle an open problem posed by   learning-to-rank (from pairwise information) theoreticians and practitioners:  What is a provably correct way to sample preference labels?", "full_text": "Active Learning Ranking from Pairwise Preferences\n\nwith Almost Optimal Query Complexity\n\nTechnion, Haifa, Israel nailon@cs.technion.ac.il\n\nNir Ailon\u2217\n\nAbstract\n\nGiven a set V of n elements we wish to linearly order them using pairwise\npreference labels which may be non-transitive (due to irrationality or arbitrary\nnoise). The goal is to linearly order the elements while disagreeing with as\nfew pairwise preference labels as possible. Our performance is measured by\ntwo parameters: The number of disagreements (loss) and the query complex-\nity (number of pairwise preference labels). Our algorithm adaptively queries at\nmost O(n poly(log n, \u03b5\u22121)) preference labels for a regret of \u03b5 times the optimal\nloss. This is strictly better, and often signi\ufb01cantly better than what non-adaptive\nsampling could achieve. Our main result helps settle an open problem posed\nby learning-to-rank (from pairwise information) theoreticians and practitioners:\nWhat is a provably correct way to sample preference labels?\n\n1\n\nIntroduction\n\nproblem, with a \ufb01nite sample space of(cid:0)n\n\n2\n\n(cid:1) possibilities only (hence a transductive learning problem).\n\nWe study the problem of learning to rank from pairwise preferences, and solve an open problem\nthat has led to development of many heuristics but no provable results. The input is a set V of n\nelements from some universe, and we wish to linearly order them given pairwise preference labels,\ngiven as response to which is preferred, u or v? for pairs u, v \u2208 V . The goal is to linearly order\nthe elements from the most preferred to the least preferred, while disagreeing with as few pairwise\npreference labels as possible. Our performance is measured by two parameters: The loss (number of\ndisagreements) and query complexity (number of preference responses we need). This is a learning\nThe loss minimization problem given the entire n \u00d7 n preference matrix is a well known NP-hard\nproblem called MFAST (minimum feedback arc-set in tournaments) [5]. Recently, Kenyon and\nSchudy [23] have devised a PTAS for it, namely, a polynomial (in n) -time algorithm computing a\nsolution with loss at most (1 + \u03b5) the optimal, for and \u03b5 > 0 (the degree of the polynomial there\nmay depend on \u03b5). In our case each edge from the input graph is given for a unit cost, hence we seek\nquery ef\ufb01ciency. Our algorithm samples preference labels non-uniformly and adaptively, hence we\nobtain an active learning algorithm. Our output is not a solution to MFAST, but rather a reduction of\nthe original learning problem to a simpler one decomposed into small instances in which the optimal\nloss is high, consequently, uniform sampling of preferences can be shown to be suf\ufb01ciently good.\nOur Setting vs. The Usual \u201cLearning to Rank\u201d Problem. Our setting defers from much of\nthe learning to rank (LTR) literature. Usually, the labels used in LTR problems are responses to\nindividual elements, and not to pairs of elements. A typical example is the 1..5 scale rating for\nrestaurants, or 0, 1 rating (irrelevant/relevant) for candidate documents retrieved for a query (known\nas the binary ranking problem). The preference graph induced from these labels is transitive, hence\nno combinatorial problems arise due to nontransitivity. We do not discuss this version of LTR. Some\nLTR literature does consider the pairwise preference label approach, and there is much justi\ufb01cation\nto it (see [11, 22] and reference therein). Other works (e.g. [26]) discuss pairwise or higher order\n\n\u2217Supported by a Marie Curie International Reintegration Grant PIRG07-GA-2010-268403\n\n1\n\n\f(listwise) approaches, but a close inspection reveals that they do not use pairwise (or listwise) labels,\nonly pairwise (or listwise) loss functions.\nUsing Kenyon and Schudy\u2019s PTAS as a starting point. As mentioned above, our main algorithm\nis derived from the PTAS of [23], but with a signi\ufb01cant difference. We use their algorithm to obtain\na certain decomposition of the input. A key change to their algorithm, which is not query ef\ufb01cient,\ninvolves careful sampling followed by iterated sample refreshing steps.\nOur work can be studied in various contexts, aside from LTR. Machine Learning Reductions: Our\nmain algorithm reduces a given instance to smaller subproblems decomposing it. We mention other\nwork in this vein: [6, 3, 9]. Active Learning: An important \ufb01eld of statistical learning theory and\npractice ([8, 21, 15, 14, 24, 17, 13, 20, 16, 13]). In the most general setting, one wishes to improve\non standard statistical learning theoretical complexity bounds by actively choosing instances for\nlabels. Many heuristics have been developed, while algorithms with provable bounds (especially in\nthe agnostic case) are known for few problems, often toys. General bounds are dif\ufb01cult to use: [8]\nprovides general purpose active learning bounds which are quite dif\ufb01cult to use in actual speci\ufb01c\nproblems; The A2 algorithm [7], analyzed in [21] using the disagreement coef\ufb01cient is not useful\nhere. It can be shown that the disagreement coef\ufb01cient here is trivial (omitted due to lack of space).\nNoisy Sorting: There is much literature in theoretical computer science on sorting noisy data.\n[10] work in a Bayesian setting; In [19], the input preference graph is transitive, and labels are\nnondeterministic. In other work, elements from the set of alternatives are assumed to have a latent\nvalue. In this work the input is worst case and not Bayesian, query responses are deterministic and\nelements do not necessarily have a latent value.\nPaper Organization: Section 2 presents basic de\ufb01nitions and lemmata, and in particular de\ufb01nes\nwhat a good decomposition is and how it can be used in learning permutations from pairwise pref-\nerences. Section 3 presents our main active learning algorithm which is, in fact, an algorithm for\nproducing a good decomposition query ef\ufb01ciently. The main result is presented in Theorem 3.1.\nSection 4 discusses future work and followup work appearing in the full version of this paper.\n\n2 Notation and Basic Lemmata\n\nLet V denote a \ufb01nite set of size n that we wish to rank.1 We assume an unknown preference function\nW on pairs of elements in V , which is unknown to us. For any pair u, v \u2208 V , W (u, v) is 1 if u is\ndeemed preferred over v, and 0 otherwise. We enforce W (u, v) + W (v, u) = 1 (no abstentation)\nhence, (V, W ) is a tournament. We assume that W is agnostic: it is not necessarily transitive and\nmay contain errors and inconsistencies. For convenience, for any two real numbers a, b we will let\n[a, b] denote the interval {x : a \u2264 x \u2264 b} if a \u2264 b and {x : b \u2264 x \u2264 a} otherwise.\nWe wish to predict W using a hypothesis h from concept class H = \u03a0(V ), where \u03a0(V ) is the\nset of permutations \u03c0 over V viewed equivalently as binary functions over V \u00d7 V satisfying, for\nall u, v, w \u2208 V , \u03c0(u, v) = 1 \u2212 \u03c0(v, u) and \u03c0(u, w) = 1 whenever \u03c0(u, v) = \u03c0(v, w) = 1. For\n\u03c0 \u2208 \u03a0(V ) we also use notation: \u03c0(u, v) = 1 if and only if u \u227a\u03c0 v, namely, if u precedes v in \u03c0.\nAbusing notation, we also view permutations as injective functions from [n] to V , so that the element\n\u03c0(1) \u2208 V is in the \ufb01rst, most preferred position and \u03c0(n) is the least preferred one. We also de\ufb01ne\nthe function \u03c1\u03c0 inverse to \u03c0 as the unique function satisfying \u03c0(\u03c1\u03c0(v)) = v for all v \u2208 V . Hence,\nu \u227a\u03c0 v is equivalent to \u03c1\u03c0(u) < \u03c1\u03c0(v). ) As in standard ERM, we de\ufb01ne a risk function Cu,v\npenalizing the error of \u03c0 with respect to the pair u, v, namely, Cu,v(\u03c0, V, W ) = 1\u03c0(u,v)(cid:54)=W (u,v) .\nThe total loss, C(h, V, W ) is de\ufb01ned as Cu,v summed over all unordered u, v \u2208 V . Our goal is to\ndevise an active learning algorithm for the purpose of minimizing this loss.\nIn this paper we show an improved, almost optimal statistical learning theoretical bound using recent\nimportant breakthroughs in combinatorial optimization of a related problem called minimum feed-\nback arc-set in tournaments (MFAST). The relation between this NP-Hard problem and our learning\nproblem has been noted before in (eg [12]), when these breakthroughs were yet to be known.\nMFAST is more precisely de\ufb01ned as follows: V and W are given in entirety (we pay no price for\nreading W ), and we seek \u03c0 \u2208 \u03a0(V ) minimizing the MFAST cost C(\u03c0, V, W ). A PTAS has been\n\n1In a more general setting we are given a sequence V 1, V 2, . . . of sets, but there is enough structure and\n\ninterest in the single set case, which we focus on in this work.\n\n2\n\n\fdiscovered for this NP-Hard very recently in groundbreaking work by Kenyon and Schudy [23].\nThis PTAS is not useful however for the purpose of learning to rank from pairwise preferences\nbecause it is not query ef\ufb01cient. It may require to read all quadratically many entries in W . In this\nwork we \ufb01x this drawback, and use the PTAS to obtain a certain useful decomposition.\nDe\ufb01nition 2.1. Given a set V of size n, an ordered decomposition is a list of pairwise disjoint\nsubsets V1, . . . , Vk \u2286 V such that \u222ak\ni=1Vi = V . We let W|Vi denote the restriction of W to Vi \u00d7 Vi\nfor i = 1, . . . , k. For a permutation \u03c0 \u2208 \u03a0(v) we let \u03c0|Vi denote its restriction to the elements of Vi\n(hence, \u03c0|Vi \u2208 \u03a0(Vi)). We say that \u03c0 \u2208 \u03a0(V ) respects V1, . . . , Vk if for all u \u2208 Vi, v \u2208 Vj, i < j,\nu \u227a\u03c0 v. We denote the set of permutations \u03c0 \u2208 \u03a0(V ) respecting the decomposition V1, . . . , Vk by\n\u03a0(V1, . . . , Vk). We say that a subset U of V is small in V if |U| \u2264 log n/ log log n, otherwise we\nsay that U is big in V . A decomposition V1, . . . , Vk is \u03b5-good with respect to W if:2\n\n(cid:88)\n\nC(\u03c0|Vi, Vi, W|Vi) \u2265 \u03b52 (cid:88)\n\n(cid:18)ni\n\n(cid:19)\n\n.\n\n2\n\n(2.1)\n\n(2.2)\n\nLocal Chaos:\n\nmin\n\n\u03c0\u2208\u03a0(V )\n\nApproximate Optimality:\n\nmin\n\n\u03c3\u2208\u03a0(V1,...,Vk)\n\nC(\u03c3, V, W ) \u2264 (1 + \u03b5) min\n\u03c0\u2208\u03a0(V )\n\nC(\u03c0, V, W ) .\n\ni:Vi big in V\n\ni:Vi big in V\n\n2\n\nWe will show how to use an \u03b5-good decomposition, and how to obtain one query-ef\ufb01ciently.\nBasic (suboptimal) results from statistical learning theory: Viewing pairs of V -elements as data\npoints, the loss C(\u03c0, V, W ) is, up to normalization, an expected cost over a random draw of a data\npoint. A sample E of unordered pairs gives rise to a partial cost, CE de\ufb01ned as: CE(\u03c0, V, W ) =\nW (v, u). (We assume throughout that E is chosen with repetitions and is hence\na multiset; the accounting of parallel edges is clear.) CE(\u00b7,\u00b7,\u00b7) is an empirical unbiased estimator\n\n(cid:1) is chosen uniformly at random among all (multi)subsets of a given size.\n\n(cid:0)n\n(cid:1)|E|\u22121(cid:80)\nof C(\u03c0, V, W ) if E \u2286(cid:0)V\n\nThe basic question in statistical learning theory is, how good is the minimizer \u03c0 of CE, in terms of\nC? The notion of VC dimension [25] gives us a nontrivial (albeit suboptimal - see below) bound.\nLemma 2.2. The VC dimension of the set of permutations on V , viewed as binary classi\ufb01ers on\npairs of elements, is n \u2212 1.\n\n(u,v)\u2208E\nu\u227a\u03c0v\n\n2\n\n2\n\n(cid:18)(cid:113) n log m+log(1/\u03b4)\n\nsatisfy: |CE(\u03c0, V, W ) \u2212 C(\u03c0, V, W )| = n2O\n\nIt is easy to show that the VC dimension is at most O(n log n), which is the logarithm of the number\nof permutations. See [4] for a linear bound. The implications are:\nProposition 2.3. If E is chosen uniformly at random (with repetitions) as a sample of m elements\n\nfrom(cid:0)V\n(cid:1), where m > n, then with probability at least 1 \u2212 \u03b4 over the sample, all permutations \u03c0\n(cid:1)\nat least 1\u2212\u03b4, it suf\ufb01ces to choose a sample E of m = O(\u00b5\u22122(n log n+log \u03b4\u22121)) elements from(cid:0)V\nFor two permutations \u03c0, \u03c3, the Kendall-Tau metric d\u03c4 (\u03c0, \u03c3) is de\ufb01ned as d\u03c4 (\u03c0, \u03c3) =(cid:80)\nv)\u2227 (v \u227a\u03c3 u)] . The Spearman Footrule metric dfoot(\u03c0, \u03c3) is de\ufb01ned as dfoot(\u03c0, \u03c3) =(cid:80)\n\nHence, if we want to minimize C(\u03c0, V, W ) over \u03c0 to within an additive error of \u00b5n2 with probability\nuniformly at random (with repetitions), and optimize CE(\u03c0, V, W ) instead.3 Assume \u03b4 \u2265 e\u2212n, so\nthat we get a more manageable sample bound of O(\u00b5\u22122n log n). Is this bound at all interesting?\nu(cid:54)=v 1[(u \u227a\u03c0\nu |\u03c1\u03c0(u)\u2212\n\n(cid:19)\n\nm\n\n.\n\n2\n\n\u03c1\u03c3(u)| . The following is well known [18]:\n\nd\u03c4 (\u03c0, \u03c3) \u2264 dfoot(\u03c0, \u03c3) \u2264 2d\u03c4 (\u03c0, \u03c3) .\n\n(2.3)\nClearly C(\u00b7, V,\u00b7) extends d\u03c4 (\u00b7,\u00b7) to distances between permutations and binary tournaments, with\nthe triangle inequality d\u03c4 (\u03c0, \u03c3) \u2264 C(\u03c0, V, W ) + C(\u03c3, V, W ) satis\ufb01ed for all W and \u03c0, \u03c3 \u2208 \u03a0(V ).\nAssume we use Proposition 2.3 to \ufb01nd \u03c0 \u2208 \u03a0(V ) with an additive regret of O(\u00b5n2) with respect to\nan optimal solution \u03c0\u2217 for some \u00b5 > 0. The triangle inequality implies d\u03c4 (\u03c0, \u03c0\u2217) = \u2126(\u00b5n2). By\n(2.3), hence, dfoot(\u03c0, \u03c0\u2217) = \u2126(\u00b5n2). By de\ufb01nition of dfoot, this means that the averege element v \u2208\nV is translated \u2126(\u00b5n) positions away from its position in \u03c0\u2217. In some applications (e.g. IR), one may\n3(cid:0)V\n\n(cid:1) denotes the set of unordered pairs of distinct elements in V .\n\n2We will just say \u03b5-good if W is clear from the context.\n\n2\n\n3\n\n\fwant elements to be at most a constant \u03b3 positions off. This translates to a sought regret of O(\u03b3n)\nfor constant \u03b3, and using our notation, to \u00b5 = \u03b3/n. Proposition 2.3 cannot guarantee less than a\nquadratic sample size for such a regret, which is tantamount to querying all of W . We can do better:\nFor any \u03b5 > 0 we achieve an additive regret of O(\u03b5C(\u03c0\u2217, V, W )) using O(poly(log n, \u03b5\u22121)) W -\nqueries, for arbitrarily small optimal loss C(\u03c0\u2217, V, W ). This is not achievable using Proposition 2.3.\nOne may argue that the VC bound may be too pessimistic, and other arguments may work for the\nuniform sample case. A simple extremal case (omitted from this abstract) shows that this is false.\nProposition 2.4. Let V1, . . . , Vk be an ordered decomposition of V . Let B denote the set of indices\ni \u2208 [k] such that Vi is big in V . Assume E is chosen uniformly at random (with repetitions) as a\nDe\ufb01ne CE(\u03c0,{V1, . . . , Vk}, W ) to be\ni\u2208B\n\n(cid:0)Vi\n(cid:1).\n(cid:1), where m > n. For each i = 1, . . . , k, let Ei = E \u2229(cid:0)Vi\nsample of m elements from(cid:83)\n(cid:1)\u22121|Ei|CEi(\u03c0|Vi, Vi, W|Vi) . (The nor-\n(cid:0)ni\nCE(\u03c0,{V1, . . . , Vk}, W ) = (cid:0)(cid:80)\n(cid:1)(cid:1)|E|\u22121(cid:80)\nmalization is de\ufb01ned so that the expression is an unbiased estimator of(cid:80)\n|Ei| = 0 for some i, formally de\ufb01ne(cid:0)ni\n(cid:1)\u22121|Ei|CEi(\u03c0|Vi, Vi, W|Vi) = 0.) Then with probability at\n(cid:12)(cid:12)CE(\u03c0,{V1, . . . , Vk}, W ) \u2212(cid:80)\n(cid:1)O\n(cid:0)ni\n\n(cid:0)ni\ni\u2208B C(\u03c0|Vi, Vi, W|Vi)(cid:12)(cid:12) =(cid:80)\n\nleast 1 \u2212 e\u2212n over the sample, all permutations \u03c0 \u2208 \u03a0(V ) satisfy:\n\n(cid:18)(cid:113) n log m+log(1/\u03b4)\n\ni\u2208B C(\u03c0|Vi, Vi, W|Vi). If\n\n(cid:19)\n\ni\u2208B\n\ni\u2208B\n\ni\u2208B\n\n2\n\n2\n\n2\n\n2\n\n2\n\n2\n\nm\n\n.\n\n(cid:83)\n\nThe proof (omitted from this abstract) uses simple VC dimension arithmetic. Now, why is \u03b5-\ngoodness good?\nLemma 2.5. Fix \u03b5 > 0 and assume we have an \u03b5-good partition (De\ufb01nition 2.1) V1, . . . , Vk\nof V . Let B denote the set of i \u2208 [k] such that Vi is big in V , and let \u00afB = [k] \\ B. Let\nni = |Vi| for i = 1, . . . , n, and let E denote a random sample of O(\u03b5\u22126n log n) elements from\ni\u2208B\nCE(\u03c0,{V1, . . . , Vk}, W ) be de\ufb01ned as in Proposition 2.4. For any \u03c0 \u2208 \u03a0(V1, . . . , Vk) de\ufb01ne:\n\u02dcC(\u03c0) := CE(\u03c0,{V1, . . . , Vk}, W ) +\n1v\u227a\u03c0u .\n\n(cid:1), each element chosen uniformly at random with repetitions. Let Ei denote E \u2229(cid:0)Vi\n\n(cid:1). Let\n\nC(\u03c0|Vi, Vi, W|Vi) +\n\n(cid:88)\n\n(cid:88)\n\n(cid:88)\n\n(cid:0)Vi\n\n(2.4)\n\n2\n\n2\n\ni\u2208 \u00afB\n\n1\u2264i<j\u2264k\n\n(u,v)\u2208Vi\u00d7Vj\n\nThen the following event occurs with probability at least 1 \u2212 e\u2212n: For any minimizer \u03c3\u2217 of \u02dcC(\u00b7)\nover \u03a0(V1, . . . , Vk): C(\u03c3\u2217, V, W ) \u2264 (1 + 2\u03b5) min\u03c0\u2208\u03a0(V ) C(\u03c0, V, W ).\n(Proof omitted from abstract.) The consequence: Given an \u03b5-good decomposition V1, . . . , Vk, op-\ntimizing \u02dcC(\u03c3) over \u03c3 \u2208 \u03a0(V1, . . . , Vk), would give a solution with relative regret of 2\u03b5 w.r.t. the\noptimum. The \ufb01rst and last terms in the RHS of (2.4) require no more than O(\u03b5\u22126n log n) W -\nqueries to compute (by de\ufb01nition of E, and given the decomposition). The middle term runs over\nsmall Vi\u2019s, and can be computed from O(n log n/ log log n) W -queries. If we now assume that\na good decomposition can be ef\ufb01ciently computed using O(n polylog(n, \u03b5\u22121)) W -queries (as we\nindeed show), then we would beat the VC bound whenever the optimal loss is at most O(n2\u2212\u03bd) for\nsome \u03bd > 0.\n\n3 A Query Ef\ufb01cient Algorithm for \u03b5-Good Decompositions\n\nTheorem 3.1. Given a set V of size n, a preference oracle W and an error tolerance parameter\n0 < \u03b5 < 1, there exists a poly(n, \u03b5\u22121)-time algorithm returning, with constant probabiliy, an\n\u03b5-good partition of V , querying at most O(\u03b5\u22126n log5 n) locations in W on expectation.\n\nBefore describing the algorithm and its analysis, we need some de\ufb01nitions:\nDe\ufb01nition 3.2. Let \u03c0 denote a permutation over V . Let v \u2208 V and i \u2208 [n]. We de\ufb01ne \u03c0v\u2192i to be\nthe permutation obtained by moving the rank of v to i in \u03c0, and leaving the rest of the elements in\nthe same order.4\nDe\ufb01nition 3.3. Fix \u03c0 \u2208 \u03a0(V ), v \u2208 V and i \u2208 [n]. We de\ufb01ne TestMove(\u03c0, V, W, v, i) :=\nC(\u03c0, V, W ) \u2212 C(\u03c0v\u2192i, V, W ) . Equivalently, if i \u2265 \u03c1\u03c0(v) then TestMove(\u03c0, V, W, v, i) :=\n4For example, if V = {x, y, z} and (\u03c0(1), \u03c0(2), \u03c0(3)) = (x, y, z), then (\u03c0x\u21923(1), \u03c0x\u21923(2), \u03c0x\u21923(3)) =\n\n(y, z, x).\n\n4\n\n\f2\n\n(cid:1), de\ufb01ne TestMoveE(\u03c0, V, W, v, i), for i \u2265 \u03c1\u03c0(v), as TestMoveE(\u03c0, V, W, v, i) :=\n\n(cid:80)\nset E \u2286 (cid:0)V\nu:\u03c1\u03c0(u)\u2208[\u03c1\u03c0(v)+1,i](Wuv \u2212 Wvu) . A similar expression can be written for i < \u03c1\u03c0(v). For a multi-\n(cid:80)\n|i\u2212\u03c1\u03c0(v)|\nu:(u,v)\u2208 \u02dcE(W (u, v) \u2212 W (v, u)). where the multiset \u02dcE is de\ufb01ned as {(u, v) \u2208 E :\n(cid:80)\n\u03c1\u03c0(u) \u2208 [\u03c1\u03c0(v) + 1, i]}. Similarly, for i < \u03c1\u03c0(v) we de\ufb01ne TestMoveE(\u03c0, V, W, v, i) :=\n|i\u2212\u03c1\u03c0(v)|\nu:(u,v)\u2208 \u02dcE(W (v, u) \u2212 W (u, v)). where \u02dcE is now {(u, v) \u2208 E : \u03c1\u03c0(u) \u2208 [i, \u03c1\u03c0(v) \u2212 1]}.\n\nLemma 3.4. Fix \u03c0 \u2208 \u03a0(V ), v \u2208 V , i \u2208 [n] and an integer N. Let E \u2286(cid:0)V\n\n(cid:1) be a random (multi)-\n\n| \u02dcE|\n\n| \u02dcE|\n\nset of size N with elements (v, u1), . . . , (v, uN ), drawn so that for each j \u2208 [N ] the element uj\nis chosen uniformly at random from among the elements lying between v (exclusive) and position i\n(inclusive) in \u03c0. Then E[TestMoveE(\u03c0, V, W, v, i)] = TestMove(\u03c0, V, W, v, i). Additionally, for\nany \u03b4 > 0, except with probability of failure \u03b4,\n| TestMoveE(\u03c0, V, W, v, i) \u2212 TestMove(\u03c0, V, W, v, i)| = O\n\n|i \u2212 \u03c1\u03c0(v)|(cid:113) log \u03b4\u22121\n\n(cid:18)\n\n(cid:19)\n\n.\n\n2\n\nN\n\nThe lemma is easily proven using Hoeffding tail bounds, using the fact that |W (u, v)| \u2264 1 for all\nu, v.\nOur decomposition algorithm SampleAndRank is detailed in Algorithm 1, with\nsubroutines in Algorithms 2 and 3. It is a query ef\ufb01cient improvement of the PTAS in [23] with\nthe following difference: here we are not interested in an approximation algorithm for MFAST,\nbut just in an \u03b5-good decomposition. Whenever we reach a small block (line 3) or a big block\nwith a probably approximately suf\ufb01ciently high cost (line 8) in our recursion of Algorithm 2), we\nsimply output it as a block in our partition. Denote the resulting outputted partition by V1, . . . , Vk.\nDenote by \u02c6\u03c0 the minimizer of C(\u00b7, V, W ) over \u03a0(V1, . . . , Vk). We need to show that C(\u02c6\u03c0, V, W ) \u2264\n(1 + \u03b5) min\u03c0\u2208\u03a0(V ) C(\u03c0, V, W ), thus establishing (2.2). The analysis closely follows [23]. Due to\nspace limitations, we focus on the differences, and speci\ufb01cally on Procedure ApproxLocalImprove\n(Algorithm 3), replacing a greedy local improvement step in [23] which is not query ef\ufb01cient.\nSampleAndRank (Algorithm 1) takes the following arguments: The set V , the preference ma-\ntrix W and an accuracy argument \u03b5.\nIt is implicitly understood that the argument W passed to\nSampleAndRank is given as a query oracle, incurring a unit cost upon each access. The \ufb01rst warm\nstart step in SampleAndRank computes an expected constant factor approximation \u03c0 to MFAST\non V, W using QuickSort [2]. The query complexity of this step is O(n log n) on expectation (see\n[3]). Before continuing, we make the following assumption, which holds with constant probability\nusing Markov probability bounds.\nAssumption 3.5. The cost C(\u03c0, V, W ) of \u03c0 computed in line 2 of SampleAndRank is O(1) times\nthat of the optimal \u03c0\u2217, and the query cost incurred in the computation is O(n log n).\n\nNext, a recursive procedure SampleAndDecompose is called, running a divide-and-conquer algo-\nrithm. Before branching, it executes the following: Lines 5 to 9 identify local chaos (2.1) (with\nhigh probability). Line 10 calls ApproxLocalImprove (Algorithm 3), responsible for performing\nquery-ef\ufb01cient approximate greedy steps, as we now explain.\nApproximate local improvement steps. ApproxLocalImprove takes a set V of size N, W , a\npermutation \u03c0 on V , two numbers C0, \u03b5 and an integer n.5 The number n is always the size of\nthe input in the root call to SampleAndDecompose, passed down in the recursion, and used for the\npurpose of controlling success probabilities. The goal of is to repeatedly identify w.h.p. single vertex\nmoves that considerably decrease the cost. The procedure starts by creating a sample ensemble\nS = {Ev,i : v \u2208 V, i \u2208 [B, L]}, where B = log(cid:98)\u0398(\u03b5N/ log n)(cid:99) and L = (cid:100)log N(cid:101). The size of each\nEv,i \u2208 S is \u0398(\u03b5\u22122 log2 n), and each element (v, x) \u2208 Ev,i was added (with possible multiplicity)\nby uniformly at random selecting, with repetitions, an element x \u2208 V positioned at distance at most\n2i from the position of v in \u03c0. Let D\u03c0 denote the distribution space from which S was drawn, and\nlet PrX\u223cD\u03c0 [X = S] denote the probability of obtaining a given sample ensemble S. S will enable\nus to approximate the improvement in cost obtained by moving a single element u to position j.\nDe\ufb01nition 3.6. Fix u \u2208 V and j \u2208 [n], and assume log |j \u2212 \u03c1\u03c0(u)| \u2265 B. Let (cid:96) = (cid:100)log |j \u2212\n\u03c1\u03c0(u)|(cid:101). We say that S is successful at u, j if |{x : (u, x) \u2208 Eu,(cid:96)} \u2229 {x : \u03c1\u03c0(x) \u2208 [\u03c1\u03c0(u), j]}| =\n\u2126(\u03b5\u22122 log2 n) .\n\n5Notation abuse: V here is a subset of the original input.\n\n5\n\n\fSuccess of S at u, j means that suf\ufb01ciently many samples x \u2208 V such that \u03c1\u03c0(x) is between \u03c1\u03c0(u)\nand j are represented in Eu,(cid:96). Conditioned on S being successful at u, j, note that the denominator\nfrom the de\ufb01nition of TestMoveE does not vanish, and we can thereby de\ufb01ne:\nDe\ufb01nition 3.7. S is a good approximation at u, j if (de\ufb01ning (cid:96) as in De\ufb01nition 3.6):\n\n(cid:12)(cid:12)TestMoveEu,(cid:96) (\u03c0, V, W, u, j) \u2212 TestMove(\u03c0, V, W, u, j)(cid:12)(cid:12) \u2264 1\n\n2 \u03b5|j \u2212 \u03c1\u03c0(u)|/ log n . S is a good\n\napproximation if it is succesful and a good approximation at all u \u2208 V , j \u2208 [n] satisfying\n(cid:100)log |j \u2212 \u03c1\u03c0(u)|(cid:101) \u2208 [B, L].\nUsing Chernoff to ensure success and Hoeffding to ensure good approximation, union bounding:\nLemma 3.8. Except with probability 1 \u2212 O(n\u22124), S is a good approximation.\n\nAlgorithm 1 SampleAndRank(V, W, \u03b5)\n1: n \u2190 |V |\n2: \u03c0 \u2190 Expected O(1)-approx solution to MFAST using O(n log n) W -queries on expectation\n\nusing QuickSort [2]\n\n3: return SampleAndDecompose(V, W, \u03b5, n, \u03c0)\n\n(cid:1) (with repetitions)\n\n(C is an additive O(\u03b52N 2) approximation of C w.p. \u2265 1 \u2212 n\u22124)\n\n2\n\n5: E \u2190 random subset of O(\u03b5\u22124 log n) elements from(cid:0)V\n\nAlgorithm 2 SampleAndDecompose(V, W, \u03b5, n, \u03c0)\n1: N \u2190 |V |\n2: if N \u2264 log n/ log log n then\nreturn trivial partition {V }\n3:\n4: end if\n6: C \u2190 CE(\u03c0, V, W )\n7: if C = \u2126(\u03b52N 2) then\n8:\n9: end if\n10: \u03c01 \u2190 ApproxLocalImprove(V, W, \u03c0, \u03b5, n)\n11: k \u2190 random integer in the range [N/3, 2N/3]\n12: VL \u2190 {v \u2208 V : \u03c1\u03c0(v) \u2264 k}, \u03c0L \u2190 restriction of \u03c01 to VL\n13: VR \u2190 V \\ VL,\n14: return\n\n\u03c0R \u2190 restriction of \u03c01 to VR\n\nreturn trivial partition {V }\n\nconcatenation\n\nof\n\nSampleAndDecompose(VL, W, \u03b5, n, \u03c0L),\n\nSampleAndDecompose(VR, W, \u03b5, n, \u03c0R)\n\nMutating the Pair Sample To Re\ufb02ect a Single Element Move. Line 16 in ApproxLocalImprove\nrequires elaboration. In lines 15-18 we sought (using S) an element u and position j, such that\nmoving u to j (giving rise to \u03c0u\u2192j) would considerably improve the cost w.h.p. If such an element\nu existed, we executed the exchange \u03c0 \u2190 \u03c0u\u2192j. Unfortunately the sample ensemble S becomes\nstale: even if S was a good approximation, it is no longer necessarily so w.r.t. the new value of \u03c0.\nWe refresh it in line 16 by applying a transformation \u03d5u\u2192j on S, resulting in a new sample ensemble\n\u03d5u\u2192j(S) approximately distributed by D\u03c0u\u2192j . More precisely, \u03d5 (de\ufb01ned below) is such that\n\n\u03d5u\u2192j(D\u03c0) = D\u03c0u\u2192j ,\n\nv,i.\n\n(3.1)\nwhere the left hand side denotes the distribution obtained by drawing from D\u03c0 and applying \u03d5u\u2192j\nto the result. We now de\ufb01ne \u03d5u\u2192j. Denoting \u03d5u\u2192j(S) = S(cid:48) = {E(cid:48)\nv,i : v \u2208 V, i \u2208 [B, L]}, we\nneed to de\ufb01ne each E(cid:48)\nDe\ufb01nition 3.9. Ev,i is interesting in the context of \u03c0 and \u03c0u\u2192j if the two sets T1, T2 de\ufb01ned as\nT1 = {x \u2208 V : |\u03c1\u03c0(x) \u2212 \u03c1\u03c0(v)| \u2264 2i}, T2 = {x \u2208 V : |\u03c1\u03c0u\u2192j (x) \u2212 \u03c1\u03c0u\u2192j (v)| \u2264 2i} differ.\nWe set E(cid:48)\nv,i = Ev,i for all v, i for which Ev,i is not interesting. Fix one interesting choice v, i. Let\nT1, T2 be as in De\ufb01ntion 3.9. It can be easily shown that each of T1 and T2 contains O(1) elements\nthat are not contained in the other, and it can be assumed (using a simple clipping argument - omitted)\nthat this number is exactly 1, hence |T1| = |T2|. let X1 = T1 \\ T2, and X2 = T2 \\ T1. Fix any\ninjection \u03b1 : X1 \u2192 X2, and extend \u03b1 : T1 \u2192 T2 so that \u03b1(x) = x for all x \u2208 T1 \u2229 T2. Finally,\n\n6\n\n\freturn\n\nAlgorithm 3 ApproxLocalImprove(V, W, \u03c0, \u03b5, n) (Note: \u03c0 used as both input and output)\n1: N \u2190 |V |, B \u2190 (cid:100)log(\u0398(\u03b5N/ log n)(cid:101), L \u2190 (cid:100)log N(cid:101)\n2: if N = O(\u03b5\u22123 log3 n) then\n3:\n4: end if\n5: for v \u2208 V do\nr \u2190 \u03c1\u03c0(v)\n6:\nfor i = B . . . L do\n7:\nEv,i \u2190 \u2205\n8:\nfor m = 1..\u0398(\u03b5\u22122 log2 n) do\n9:\nj \u2190 integer uniformly at random chosen from [max{1, r \u2212 2i}, min{n, r + 2i}]\n10:\nEv,i \u2190 Ev,i \u222a {(v, \u03c0(j))}\n11:\n12:\n13:\n14: end for\n15: while \u2203u \u2208 V and j \u2208 [n] s.t. (setting (cid:96) := (cid:100)log |j \u2212 \u03c1\u03c0(u)|(cid:101)):\n\n(cid:96) \u2208 [B, L] and TestMoveEu,(cid:96) (\u03c0, V, W, u, j) > \u03b5|j \u2212 \u03c1\u03c0(u)|/ log n do\n\nFor v \u2208 V , i \u2208 [B, L] refresh Ev,i w.r.t. the move u \u2192 j using \u03d5u\u2192j (Section 3)\n\u03c0 \u2190 \u03c0u\u2192j\n\nend for\n\nend for\n\n16:\n17:\n18: end while\n\nv,i\n\nu1\u2192j1\n\nu2\u2192j2\n\nuk\u2192jk\n\n, \u03c02 = \u03c01\n\n(cid:12)(cid:12)(cid:12)(cid:83)\n\nv,i Ev,i\u2206E(cid:48)\n\n,\u00b7\u00b7\u00b7 , \u03c0k = \u03c0k\u22121\n\n(cid:12)(cid:12)(cid:12) bounds the query\n\nv,i = {(v, \u03b1(x)) : (v, x) \u2208 Ev,i}. For v = u we create E(cid:48)\n\nde\ufb01ne E(cid:48)\nv,i from scratch by repeating the\nloop in line 7 for that v. It is easy to see that (3.1) holds. By Lemma 3.8, the total variation distance\nbetween (D\u03c0| good approximation) and D\u03c0u\u2192j is O(n\u22124). Using a simple chain rule argument:\nLemma 3.10. Fix \u03c00 on V of size N, and \ufb01x u1, . . . , uk \u2208 V and j1, . . . , jk \u2208 [n]. Draw\nS 0 from D\u03c00, and de\ufb01ne S 1 = \u03d5u1\u2192j1 (S 0),S 2 = \u03d5u2\u2192j2 (S 1),\u00b7\u00b7\u00b7 ,S k = \u03d5uk\u2192jk (S k\u22121),\n. Consider the random variable Sk conditioned on\n\u03c01 = \u03c00\nS 0,S 1, . . . ,S k\u22121 being good approximations for \u03c00, . . . , \u03c0k\u22121, respectively. Then the total vari-\nation distance between the distribution of S k and the distribution (D\u03c0k|\u03c0k) (corresponding to the\nprocess of obtaning \u03c0k and drawing from D\u03c0k \u201dfrom scratch\u201d) is at most O(kn\u22124).\nThe difference between S and S(cid:48), de\ufb01ned as dist(S,S(cid:48)) :=\ncomplexity of computing mutations. The proof of the following has been omitted from this abstract.\nLemma 3.11. Assume S \u223c D\u03c0 for some \u03c0, and S(cid:48) = \u03d5u\u2192j. Then E[dist(S,S(cid:48))] = O(\u03b5\u22123 log3 n).\nAnalysis of SampleAndDecompose. Various high probability events must occur in order for the al-\ngorithm guarantees to hold. Let E1 denote the event that the \ufb01rst \u0398(n4) sample ensembles S1,S2, . . .\nApproxLocalImprove, either in lines 5 and 14, or via mutations, are good approximations By Lem-\nmas 3.8 and 3.10, using a union bound, with constant probability (say, 0.99) this happens. Let E2\ndenote the event that the cost approximations obtained in line 5 of SampleAndDecompose are suc-\ncessful at all recursive calls. By Hoeffding tail bounds, this happens with probability 1 \u2212 O(n\u22124)\nfor each call, there are O(n log n) calls, hence we can lower bound the probability of success of all\nexecutions by 0.99. Concluding, the following holds with probability at least 0.97:\nAssumption 3.12. Events E1 and E2 hold true.\nWe condition what follows on this assumption.6 Let \u03c0\u2217 denote the optimal permutation for the\nroot call to SampleAndDecompose with V, W, \u03b5. The permutation \u03c0 is, by Assumption 3.5, a\nconstant factor approximation for \u03c0\u2217. By the triangle inequality, d\u03c4 (\u03c0, \u03c0\u2217) \u2264 C(\u03c0, V, W ) +\nC(\u03c0\u2217, V, W ), hence, E[d\u03c4 (\u03c0, \u03c0\u2217)] = O(C(\u03c0\u2217, V, W )) . From this, using (2.3), E[dfoot(\u03c0, \u03c0\u2217)] =\nO(C(\u03c0\u2217, V, W )). Now consider the recursion tree T of SampleAndDecompose. Denote I\nthe set of internal nodes, and by L the set of leaves (i.e.\nexecutions exiting from line 8).\nFor a call SampleAndDecompose corresponding to a node X, denote the input arguments by\n(VX , W, \u03b5, n, \u03c0X ). Let L[X], R[X] denote the left and right children of X respectively. Let kX\n6This may bias some expectation upper bounds derived earlier and in what follows. This bias can multiply\n\nthe estimates by at most 1/0.97, which can be absorbed in our O-notations.\n\n7\n\n\f2\n\nX\n\nX\n\n(cid:17)\n\nX\n\nX\n\ndenote the integer k in 11 in the context of X \u2208 I. Hence, by our de\ufb01nitions, VL[X], VR[X], \u03c0L[X]\nand \u03c0R[X] are precisely VL, VR, \u03c0L, \u03c0R from lines 12-13 in the context of node X. Take, as\nin line 1, NX = |VX|. Let \u03c0\u2217\nX denote the optimal MFAST solution for instance (VX , W|VX ).\nBy E1 we conclude that the cost of \u03c0X u\u2192j is always an actual improvement compared to \u03c0X\n(for the current value of \u03c0X , u and j in iteration), and the improvement in cost is of magni-\ntude at least \u2126(\u03b5|\u03c1\u03c0X (u) \u2212 j|/ log n), which is \u2126(\u03b52NX / log2 n) due to the use of B de\ufb01ned in\nline 1.7 But then the number of iterations of the while loop in line 15 of ApproxLocalImprove\nis O(\u03b5\u22122C(\u03c0X , VX , W|VX ) log2 n/NX ) (Otherwise the true cost of the running solution would\n\ngo below 0.) Since C(\u03c0X , VX , W|VX ) \u2264 (cid:0)NX\n\n(cid:1), the number of iterations is hence at most\n\n(u)) = O(\u03b5|\u03c1\u03c01X (u) \u2212 \u03c1\u03c0\u2217\n\nX : (*) TestMove(\u03c01X , VX , W|VX , u, \u03c1\u03c0\u2217\n\n(u)| = O(\u03b5NX / log n), and V long\n\ntity as in [23]: TX := (cid:80)\n\nO(\u03b5\u22122NX log2 n). By Lemma 3.11 the expected query complexity incurred by the call to\nApproxLocalImprove is therefore O(\u03b5\u22125NX log5 n). Summing over the recursion tree,\nthe\ntotal query complexity incurred by calls to ApproxLocalImprove is, on expectation, at most\nO(\u03b5\u22125n log6 n). Now consider the moment at which the while loop of ApproxLocalImprove ter-\nminates. Let \u03c01X denote the permutation obtained at that point, returned to SampleAndDecompose\nin line 10. We classify the elements v \u2208 VX to two families: V short\ndenotes all u \u2208 VX s.t.\n|\u03c1\u03c01X (u) \u2212 \u03c1\u03c0\u2217\ndenotes VX \\ V short\n. We know by assumption,\nthat the last sample ensemble S used in ApproxLocalImprove was a good approximation, hence\nfor all u \u2208 V long\n(u)|/ log n).\nFollowing [23], we say for u \u2208 VX that u crosses kX if [\u03c1\u03c01X (u), \u03c1\u03c0\u2217\nLet\ndenote the (random) set of elements u \u2208 VX that cross kX. We de\ufb01ne a key quan-\nV cross\n(u)). Following (*), the\nTestMove(\u03c01X , VX , W|VX , u, \u03c1\u03c0\u2217\nto TX.\nu\u2208V long\nX )/ log n) which is, using (2.3) at most\nX, the last expres-\ncontribute to TX?\n(u)|/NX ). Hence, the\n. Under the constraints\n(u)| = O(\u03b5NX / log n),\nthis is O(dfoot(\u03c01X , \u03c0\u2217\nX )\u03b5/ log n). Again using (2.3) and\nthe triangle inequality, the bound becomes O(\u03b5C(\u03c01X , VX , W|VX )/ log n). Combining for V long\nand V short, we conclude: (**) EkX [TX ] = O(\u03b5C(\u03c0\u2217\nX , VX , W|VX )/ log n), (the expectation is over\nthe choice of kX.) The bound (**) is the main improvement over [23], and should be compared with\nLemma 3.2 there, stating (in our notation) TX = O(\u03b5C\u2217NX /(4n log n)). The latter bound is more\nrestrictive than ours in certain cases, and obtaining it relies on a procedure that cannot be performed\nwithout having access W in its entirety. (**) however can be achieved using ef\ufb01cient querying of\nW , as we have shown. The remaineder of the arguments leading to proof of Theorem 3.1 closely\nfollow those in Section 4 of [23]. The details have been omitted from this abstract.\n\nelements u \u2208 V long\nThis latter bound is, by de\ufb01nition, O(\u03b5dfoot(\u03c01X , \u03c0\u2217\nX )/ log n). By the triangle inequality and the de\ufb01nition of \u03c0\u2217\nO(\u03b5d\u03c4 (\u03c01X , \u03c0\u2217\n(cid:17)\nsion is O(\u03b5C(\u03c01X , VX , W|VX )/ log n). How much can elements in V short\nThe probability of each such element to cross k is O(|\u03c1\u03c01X (u) \u2212 \u03c1\u03c0\u2217\n(cid:80)\n(u)|2/NX\ntotal expected contribution is O\nX ) and |\u03c1\u03c01X (u) \u2212 \u03c1\u03c0\u2217\n\nX )\u03b5NX /(NX log n)) = O(dfoot(\u03c01X , \u03c0\u2217\n\n(u)| \u2264 dfoot(\u03c01X , \u03c0\u2217\n\nX\n\n|\u03c1\u03c01X (u) \u2212 \u03c1\u03c0\u2217\n\nX\n\n|\u03c1\u03c01X (u) \u2212 \u03c1\u03c0\u2217\n\nX\n\n|\u03c1\u03c01X (u) \u2212 \u03c1\u03c0\u2217\n\nX\n\ncan contribute at most O\n\n(cid:16)\n\u03b5(cid:80)\n\n(u)] contains kX.\n\nX\n\nX\n\n(cid:16)(cid:80)\n\nX\n\n(u)|/ log n\n\nX\n\nX\n\nX\n\nX\n\nX\n\nX\n\nu\u2208V cross\n\nX\n\nX\n\nu\u2208V short\n\nX\n\nu\u2208V short\n\n4 Future Work\nWe presented a statistical learning theoretical active learning result for pairwise ranking. The main\nvehicle was a query (and time) ef\ufb01cient decomposition procedure, reducing the problem to smaller\nones in which the optimal loss is high and uniform sampling suf\ufb01ces. The main drawback of our\nresult is the inability to use it in order to search in a limited subspace of permutations. A typical\nexample of such a subspace is the case in which each element v \u2208 V has a corresponding feature\nvector in a real vector space, and we only seek permutations induced by linear score functions. In\nfollowup work, Ailon, Begleiter and Ezra [1] show a novel technique achieving a slightly better\nquery complexity than here with a simpler proof, while also admitting search in restricted spaces.\nAcknowledgements The author gratefully acknowledges the help of Warren Schudy with derivation\nof some of the bounds in this work. Special thanks to Ron Begleiter for helpful comments. Apolo-\ngizes for omitting references to much relevant work that could not \ufb01t in this version\u2019s bibliography.\n\n7This also bounds the number of times a sample ensemble is created by O(n4), as required by E1.\n\n8\n\n\fReferences\n[1] Nir Ailon, Ron Begleiter, and Esther Ezra, A new active learning scheme with applications to\n\nlearning to rank from pairwise preferences, arxiv.org/abs/1110.2136 (2011).\n\n[2] Nir Ailon, Moses Charikar, and Alantha Newman, Aggregating inconsistent information:\n\nRanking and clustering, J. ACM 55 (2008), no. 5.\n\n[3] Nir Ailon and Mehryar Mohri, Preference based learning to rank, vol. 80, 2010, pp. 189\u2013212.\n[4] Nir Ailon and Kira Radinsky, Ranking from pairs and triplets: Information quality, evaluation\n\nmethods and query complexity, WSDM, 2011.\n\n[5] Noga Alon, Ranking tournaments, SIAM J. Discret. Math. 20 (2006), no. 1, 137\u2013142.\n[6] M. F. Balcan, N. Bansal, A. Beygelzimer, D. Coppersmith, J. Langford, and G. B. Sorkin,\nRobust reductions from ranking to classi\ufb01cation, Machine Learning 72 (2008), no. 1-2, 139\u2013\n153.\n\n[7] Maria-Florina Balcan, Alina Beygelzimer, and John Langford, Agnostic active learning, J.\n\nComput. Syst. Sci. 75 (2009), no. 1, 78\u201389.\n\n[8] Maria-Florina Balcan, Steve Hanneke, and Jennifer Vaughan, The true sample complexity of\n\nactive learning, Machine Learning 80 (2010), 111\u2013139.\n\n[9] A. Beygelzimer, J. Langford, and P. Ravikumar, Error-correcting tournaments, ALT, 2009,\n\npp. 247\u2013262.\n\n[10] M. Braverman and E. Mossel, Noisy sorting without resampling, SODA: Proceedings of the\n\n19th annual ACM-SIAM symposium on Discrete algorithms, 2008, pp. 268\u2013276.\n\n[11] B. Carterette, P. N. Bennett, D. Maxwell Chickering, and S. T. Dumais, Here or there: Prefer-\n\nence judgments for relevance, ECIR, 2008.\n\n[12] William W. Cohen, Robert E. Schapire, and Yoram Singer, Learning to order things, NIPS \u201997,\n\n1998, pp. 451\u2013457.\n\n[13] D. Cohn, L. Atlas, and R. Ladner, Improving generalization with active learning, Machine\n\nLearning 15 (1994), no. 2, 201\u2013221.\n\n[14] A. Culotta and A. McCallum, Reducing labeling effort for structured prediction tasks, AAAI:\n\nProceedings of the 20th national conference on Arti\ufb01cial intelligence, 2005, pp. 746\u2013751.\n\n[15] S. Dasgupta, Coarse sample complexity bounds for active learning, Advances in Neural Infor-\n\nmation Processing Systems 18, 2005, pp. 235\u2013242.\n\n[16] S. Dasgupta, A. Tauman Kalai, and C. Monteleoni, Analysis of perceptron-based active learn-\n\ning, Journal of Machine Learning Research 10 (2009), 281\u2013299.\n\n[17] Sanjoy Dasgupta, Daniel Hsu, and Claire Monteleoni, A general agnostic active learning al-\n\ngorithm, NIPS, 2007.\n\n[18] Persi Diaconis and R. L. Graham, Spearman\u2019s footrule as a measure of disarray, Journal of\n\nthe Royal Statistical Society. Series B (Methodological) 39 (1977), no. 2, pp. 262\u2013268.\n\n[19] U. Feige, D. Peleg, P. Raghavan, and E. Upfal, Computing with unreliable information, STOC:\nProceedings of the 22nd annual ACM symposium on Theory of computing, 1990, pp. 128\u2013137.\n[20] Yoav Freund, H. Sebastian Seung, Eli Shamir, and Naftali Tishby, Selective sampling using the\n\nquery by committee algorithm, Mach. Learn. 28 (1997), no. 2-3, 133\u2013168.\n\n[21] Steve Hanneke, A bound on the label complexity of agnostic active learning, ICML, 2007,\n\npp. 353\u2013360.\n\n[22] Eyke H\u00a8ullermeier, Johannes F\u00a8urnkranz, Weiwei Cheng, and Klaus Brinker, Label ranking by\n\nlearning pairwise preferences, Artif. Intell. 172 (2008), no. 16-17, 1897\u20131916.\n\n[23] Claire Kenyon-Mathieu and Warren Schudy, How to rank with few errors, STOC, 2007, pp. 95\u2013\n\n103.\n\n[24] Dan Roth and Kevin Small, Margin-based active learning for structured output spaces, 2006.\n[25] V. N. Vapnik and A. Ya. Chervonenkis, On the uniform convergence of relative frequencies of\nevents to their probabilities, Theory of Prob. and its Applications 16 (1971), no. 2, 264\u2013280.\n[26] F. Xia, T-Y Liu, J. Wang, W. Zhang, and H. Li, Listwise approach to learning to rank: theory\n\nand algorithm, ICML \u201908, 2008, pp. 1192\u20131199.\n\n9\n\n\f", "award": [], "sourceid": 543, "authors": [{"given_name": "Nir", "family_name": "Ailon", "institution": null}]}