{"title": "The Pareto Regret Frontier for Bandits", "book": "Advances in Neural Information Processing Systems", "page_first": 208, "page_last": 216, "abstract": "Given a multi-armed bandit problem it may be desirable to achieve a smaller-than-usual worst-case regret for some special actions. I show that the price for such unbalanced worst-case regret guarantees is rather high. Specifically, if an algorithm enjoys a worst-case regret of B with respect to some action, then there must exist another action for which the worst-case regret is at least \u03a9(nK/B), where n is the horizon and K the number of actions. I also give upper bounds in both the stochastic and adversarial settings showing that this result cannot be improved. For the stochastic case the pareto regret frontier is characterised exactly up to constant factors.", "full_text": "The Pareto Regret Frontier for Bandits\n\nTor Lattimore\n\ntor.lattimore@gmail.com\n\nDepartment of Computing Science\n\nUniversity of Alberta, Canada\n\nAbstract\n\nGiven a multi-armed bandit problem it may be desirable to achieve a smaller-\nthan-usual worst-case regret for some special actions. I show that the price for\nsuch unbalanced worst-case regret guarantees is rather high. Speci\ufb01cally, if an\nalgorithm enjoys a worst-case regret of B with respect to some action, then there\nmust exist another action for which the worst-case regret is at least \u2126(nK/B),\nwhere n is the horizon and K the number of actions. I also give upper bounds\nin both the stochastic and adversarial settings showing that this result cannot be\nimproved. For the stochastic case the pareto regret frontier is characterised exactly\nup to constant factors.\n\n1\n\nIntroduction\n\nThe multi-armed bandit is the simplest class of problems that exhibit the exploration/exploitation\ndilemma. In each time step the learner chooses one of K actions and receives a noisy reward signal\nfor the chosen action. A learner\u2019s performance is measured in terms of the regret, which is the\n(expected) difference between the rewards it actually received and those it would have received (in\nexpectation) by choosing the optimal action.\nPrior work on the regret criterion for \ufb01nite-armed bandits has treated all actions uniformly and has\naimed for bounds on the regret that do not depend on which action turned out to be optimal.\nI\ntake a different approach and ask what can be achieved if some actions are given special treatment.\nFocussing on worst-case bounds, I ask whether or not it is possible to achieve improved worst-case\nregret for some actions, and what is the cost in terms of the regret for the remaining actions. Such\nresults may be useful in a variety of cases. For example, a company that is exploring some new\nstrategies might expect an especially small regret if its existing strategy turns out to be (nearly)\noptimal.\nThis problem has previously been considered in the experts setting where the learner is allowed\nto observe the reward for all actions in every round, not only for the action actually chosen. The\nearliest work seems to be by Hutter and Poland [2005] where it is shown that the learner can assign\n\na prior weight to each action and pays a worst-case regret of O(\u221a\u2212n log \u03c1i) for expert i where \u03c1i\nis the prior belief in expert i and n is the horizon. The uniform regret is obtained by choosing \u03c1i =\n1/K, which leads to the well-known O(\u221an log K) bound achieved by the exponential weighting\nalgorithm [Cesa-Bianchi, 2006]. The consequence of this is that an algorithm can enjoy a constant\nregret with respect to a single action while suffering minimally on the remainder. The problem was\nstudied in more detail by Koolen [2013] where (remarkably) the author was able to exactly describe\nthe pareto regret frontier when K = 2.\nOther related work (also in the experts setting) is where the objective is to obtain an improved regret\nagainst a mixture of available experts/actions [Even-Dar et al., 2008, Kapralov and Panigrahy, 2011].\nIn a similar vain, Sani et al. [2014] showed that algorithms for prediction with expert advice can be\ncombined with minimal cost to obtain the best of both worlds. In the bandit setting I am only aware\n\n1\n\n\fof the work by Liu and Li [2015] who study the effect of the prior on the regret of Thompson\nsampling in a special case. In contrast the lower bound given here applies to all algorithms in a\nrelatively standard setting.\nThe main contribution of this work is a characterisation of the pareto regret frontier (the set of\nachievable worst-case regret bounds) for stochastic bandits.\nLet \u00b5i \u2208 R be the unknown mean of the ith arm and assume that supi,j \u00b5i \u2212 \u00b5j \u2264 1. In each time\nstep the learner chooses an action It \u2208 {1, . . . , K} and receives reward gIt,t = \u00b5i + \u03b7t where \u03b7t\nis the noise term that I assume to be sampled independently from a 1-subgaussian distribution that\nmay depend on It. This model subsumes both Gaussian and Bernoulli (or bounded) rewards. Let\n\u03c0 be a bandit strategy, which is a function from histories of observations to an action It. Then the\nn-step expected pseudo regret with respect to the ith arm is\n\nn(cid:88)\n\nt=1\n\n\u00b5,i = n\u00b5i \u2212 E\nR\u03c0\n\n\u00b5It ,\n\nwhere the expectation is taken with respect to the randomness in the noise and the actions of the\npolicy. Throughout this work n will be \ufb01xed, so is omitted from the notation. The worst-case\nexpected pseudo-regret with respect to arm i is\n\n(1)\nThis means that R\u03c0 \u2208 RK is a vector of worst-case pseudo regrets with respect to each of the arms.\nLet B \u2282 RK be a set de\ufb01ned by\n\ni = sup\n\u00b5\n\nR\u03c0\n\nR\u03c0\n\n\u00b5,i .\n\n\uf8f1\uf8f2\uf8f3B \u2208 [0, n]K : Bi \u2265 min\n\n\uf8f1\uf8f2\uf8f3n,\n\nB =\n\n\uf8fc\uf8fd\uf8fe for all i\n\uf8fc\uf8fd\uf8fe .\n\n(cid:88)\n\nj(cid:54)=i\n\nn\nBj\n\n(2)\n\nThe boundary of B is denoted by \u03b4B. The following theorem shows that \u03b4B describes the pareto\nregret frontier up to constant factors.\n\nObserve that the lower bound relies on the assumption that the noise term be Gaussian while the\nupper bound holds for subgaussian noise. The lower bound may be generalised to other noise models\nsuch as Bernoulli, but does not hold for all subgaussian noise models. For example, it does not hold\nif there is no noise (\u03b7t = 0 almost surely).\nThe lower bound also applies to the adversarial framework where the rewards may be chosen arbi-\ntrarily. Although I was not able to derive a matching upper bound in this case, a simple modi\ufb01cation\nof the Exp-\u03b3 algorithm [Bubeck and Cesa-Bianchi, 2012] leads to an algorithm with\n\nR\u03c0\n1 \u2264 B1\n\nand R\u03c0\nk\n\n(cid:46) nK\nB1\n\nlog\n\nfor all k \u2265 2 ,\n\n(cid:18) nK\n\n(cid:19)\n\nB2\n1\n\nwhere the regret is the adversarial version of the expected regret. Details are in the supplementary\nmaterial.\nThe new results seem elegant, but disappointing. In the experts setting we have seen that the learner\ncan distribute a prior amongst the actions and obtain a bound on the regret depending in a natural\nway on the prior weight of the optimal action. In contrast, in the bandit setting the learner pays\nan enormously higher price to obtain a small regret with respect to even a single arm.\nIn fact,\nthe learner must essentially choose a single arm to favour, after which the regret for the remaining\narms has very limited \ufb02exibility. Unlike in the experts setting, if even a single arm enjoys constant\nworst-case regret, then the worst-case regret with respect to all other arms is necessarily linear.\n\n2\n\nTheoremThereexistuniversalconstantsc1=8andc2=252suchthat:Lowerbound:for\u03b7t\u223cN(0,1)andallstrategies\u03c0wehavec1(R\u03c0+K)\u2208BUpperbound:forallB\u2208Bthereexistsastrategy\u03c0suchthatR\u03c0i\u2264c2Biforalli\f2 Preliminaries\n\nI use the same notation as Bubeck and Cesa-Bianchi [2012]. De\ufb01ne Ti(t) to be the number of times\naction i has been chosen after time step t and \u02c6\u00b5i,s to be the empirical estimate of \u00b5i from the \ufb01rst s\ntimes action i was sampled. This means that \u02c6\u00b5i,Ti(t\u22121) is the empirical estimate of \u00b5i at the start of\nthe tth round. I use the convention that \u02c6\u00b5i,0 = 0. Since the noise model is 1-subgaussian we have\n\n\u2200\u03b5 > 0\n\nP{\u2203s \u2264 t : \u02c6\u00b5i,s \u2212 \u00b5i \u2265 \u03b5/s} \u2264 exp\n\n.\n\n(3)\n\n(cid:19)\n\n(cid:18)\n\n\u03b52\n2t\n\n\u2212\n\nThis result is presumably well known, but a proof is included in the supplementary material for\nconvenience. The optimal arm is i\u2217 = arg maxi \u00b5i with ties broken in some arbitrary way. The\noptimal reward is \u00b5\u2217 = maxi \u00b5i. The gap between the mean rewards of the jth arm and the optimal\narm is \u2206j = \u00b5\u2217 \u2212 \u00b5j and \u2206ji = \u00b5i \u2212 \u00b5j. The vector of worst-case regrets is R\u03c0 \u2208 RK and has\nbeen de\ufb01ned already in Eq. (1). I write R\u03c0 \u2264 B \u2208 RK if R\u03c0\ni \u2264 Bi for all i \u2208 {1, . . . , K}. For\nvector R\u03c0 and x \u2208 R we have (R\u03c0 + x)i = R\u03c0\n3 Understanding the Frontier\n\ni + x.\n\nBefore proving the main theorem I brie\ufb02y describe the features of the regret frontier. First notice\nthat if Bi =\n\nn(K \u2212 1) for all i, then\n\n(cid:112)\n\n(cid:112)\n\n(cid:88)\n\n(cid:112)\n\nj(cid:54)=i\n\nBi =\n\nn(K \u2212 1) =\n\nn/(K \u2212 1) =\n\n(cid:88)\n\nj(cid:54)=i\n\nn\nBj\n\n.\n\nThus B \u2208 B as expected. This particular B is witnessed up to constant factors by MOSS [Audibert\nand Bubeck, 2009] and OC-UCB [Lattimore, 2015], but not UCB [Auer et al., 2002], which suffers\ni \u2208 \u2126(\u221anK log n).\nRucb\nOf course the uniform choice of B is not the only option. Suppose the \ufb01rst arm is special, so B1\nshould be chosen especially small. Assume without loss of generality that B1 \u2264 B2 \u2264 . . . \u2264 BK \u2264\nn. Then by the main theorem we have\n\nK(cid:88)\n\ni=2\n\nk(cid:88)\n\ni=2\n\nn\nBi \u2265\n\nn\nBi \u2265\n\n(k \u2212 1)n\n\nBk\n\n.\n\nB1 \u2265\n\nTherefore\n\nBk \u2265\n\n(k \u2212 1)n\n\nB1\n\n.\n\n(4)\n\nHowever, if H =(cid:80)K\n\nThis also proves the claim in the abstract, since it implies that BK \u2265 (K \u2212 1)n/B1. If B1 is \ufb01xed,\nthen choosing Bk = (k \u2212 1)n/B1 does not lie on the frontier because\n\nK(cid:88)\n\nk=2\n\nK(cid:88)\n\nk=2\n\nn\nBk\n\n=\n\nB1\n\nk \u2212 1 \u2208 \u2126(B1 log K)\n\nk=2 1/(k \u2212 1) \u2208 \u0398(log K), then choosing Bk = (k \u2212 1)nH/B1 does lie on\nthe frontier and is a factor of log K away from the lower bound given in Eq. (4). Therefore up the\na log K factor, points on the regret frontier are characterised entirely by a permutation determining\nthe order of worst-case regrets and the smallest worst-case regret.\nPerhaps the most natural choice of B (assuming again that B1 \u2264 . . . \u2264 BK) is\nBk = (k \u2212 1)n1\u2212pH for k > 1 .\n\nFor p = 1/2 this leads to a bound that is at most \u221aK log K worse than that obtained by MOSS and\nOC-UCB while being a factor of \u221aK better for a select few.\n\nB1 = np\n\nand\n\n3\n\n\fAssumptions\nThe assumption that \u2206i \u2208 [0, 1] is used to avoid annoying boundary problems caused by the fact that\ntime is discrete. This means that if \u2206i is extremely large, then even a single sample from this arm can\ncause a big regret bound. This assumption is already quite common, for example a worst-case regret\nof \u2126(\u221aKn) clearly does not hold if the gaps are permitted to be unbounded. Unfortunately there is\nno perfect resolution to this annoyance. Most elegant would be to allow time to be continuous with\nactions taken up to stopping times. Otherwise you have to deal with the discretisation/boundary\nproblem with special cases, or make assumptions as I have done here.\n\n4 Lower Bounds\nTheorem 1. Assume \u03b7t \u223c N (0, 1) is sampled from a standard Gaussian. Let \u03c0 be an arbitrary\nstrategy, then 8(R\u03c0 + K) \u2208 B.\nProof. Assume without loss of generality that R\u03c0\nre-order the actions). If R\u03c0\nc = 4 and de\ufb01ne\n\ni (if this is not the case, then simply\n1 \u2264 n/8. Let\n\n1 > n/8, then the result is trivial. From now on assume R\u03c0\n\n1 = mini R\u03c0\n\n(cid:27)\n\n\u03b5k = min\n\n,\n\ncR\u03c0\nk\nn\n\n1\n2\n\n.\n\n\u2264\n\nDe\ufb01ne K vectors \u00b51, . . . , \u00b5K \u2208 RK by\n\n(\u00b5k)j =\n\nif j = 1\nif j = k (cid:54)= 1\notherwise .\n\n\u03b5k\n\u2212\u03b5j\n\n2\n\n(cid:26) 1\n\uf8f1\uf8f2\uf8f30\n\uf8f9\uf8fb (c)\n\n= \u03b5k\n\n+\n\n1\n2\n\n\u00b5k,k\n\n\u00b5k\n\n(a)\n\n(b)\n\nR\u03c0\nk\n\n\u2265 R\u03c0\n\n\uf8ee\uf8f0(cid:88)\n\nk \u2264 n/8} and\n\n\u00b5k Tk(n)(cid:1) (d)\n\nTherefore the optimal action for the bandit with means \u00b5k is k. Let A = {k : R\u03c0\nA(cid:48) = {k : k /\u2208 A} and assume k \u2208 A. Then\nk (n \u2212 E\u03c0\n\n\u2265 \u03b5kE\u03c0\nwhere (a) follows since R\u03c0\nk is the worst-case regret with respect to arm k, (b) since the gap between\nthe means of the kth arm and any other arm is at least \u03b5k (Note that this is also true for k = 1\ni Ti(n) = n and (d) from the de\ufb01nition of \u03b5k.\n\n(cid:0)n \u2212 E\u03c0\nsince \u03b51 = mink \u03b5k. (c) follows from the fact that(cid:80)\n\u2264 E\u03c0\n\nTherefore\n\n\u00b5k Tk(n) .\n\n\u00b5k Tk(n))\n\n(cid:18)\n\n(cid:19)\n\n1 \u2212\n\nTj(n)\n\ncR\u03c0\n\nj(cid:54)=k\n\n(5)\n\n1\nc\n\n=\n\nn\n\nn\n\n,\n\nTherefore for k (cid:54)= 1 with k \u2208 A we have\n\n(cid:18)\n\n(cid:19)\n\nn\n\n1\nc\n\n1 \u2212\n\n(cid:113)E\u03c0\n\n\u00b51 Tk(n) + n\u03b5k\n\n\u00b51Tk(n)\n\n(a)\n\n\u2264 E\u03c0\n\n\u2264 E\u03c0\n\u00b5k Tk(n)\n\u2264 n \u2212 E\u03c0\n\n(b)\n\n(cid:113)E\u03c0\n\n(cid:113)E\u03c0\n\n\u2264\nwhere (a) follows from standard entropy inequalities and a similar argument as used by Auer et al.\n[1995] (details in supplementary material), (b) since k (cid:54)= 1 and E\u03c0\n\u00b51Tk(n) \u2264 n, and (c)\nby Eq. (5). Therefore\n\n\u00b51T1(n) + n\u03b5k\n\n\u00b51Tk(n) ,\n\n\u00b51Tk(n)\n\n+ n\u03b5k\n\nn\nc\n\u00b51 T1(n) + E\u03c0\n\n(c)\n\nwhich implies that\n\nR\u03c0\n1 \u2265 R\u03c0\n\n\u00b51,1 =\n\nK(cid:88)\n\nk=2\n\n\u03b5kE\u03c0\n\n\u00b51Tk(n) \u2265\n\nE\u03c0\n\u00b51 Tk(n) \u2265\n\nc\n\n1 \u2212 2\n\u03b52\nk\n(cid:88)\n\nk\u2208A\u2212{1}\n\n,\n\n1 \u2212 2\n\u03b5k\n\nc\n\n=\n\n1\n8\n\n(cid:88)\n\nk\u2208A\u2212{1}\n\nn\nR\u03c0\nk\n\n.\n\n4\n\n\fTherefore for all i \u2208 A we have\n8R\u03c0\n\ni \u2265\n\nTherefore\n\n8R\u03c0\n\ni + 8K \u2265\n\n(cid:88)\n(cid:88)\n\nk(cid:54)=i\n\n(cid:88)\n\nn\nk \u00b7\nR\u03c0\n\nR\u03c0\ni\n1 \u2265\nR\u03c0\n(cid:88)\n\nk\u2208A\u2212{1}\n\nk\u2208A\u2212{i}\n\nn\nR\u03c0\nk\n\n+ 8K \u2212\n\nn\nk \u2265\nR\u03c0\n\nn\nR\u03c0\nk\n\n,\n\nk\u2208A(cid:48)\u2212{i}\n\nn\nR\u03c0\nk\n\n.\n\n(cid:88)\n\nk(cid:54)=i\n\nwhich implies that 8(R\u03c0 + K) \u2208 B as required.\n5 Upper Bounds\n\nI now show that the lower bound derived in the previous section is tight up to constant factors. The\nalgorithm is a generalisation MOSS [Audibert and Bubeck, 2009] with two modi\ufb01cations. First, the\nwidth of the con\ufb01dence bounds are biased in a non-uniform way, and second, the upper con\ufb01dence\nbounds are shifted. The new algorithm is functionally identical to MOSS in the special case that Bi\nis uniform. De\ufb01ne log+(x) = max{0, log(x)}.\n\n1: Input: n and B1, . . . , BK\ni for all i\n2: ni = n2/B2\n3: for t \u2208 1, . . . , n do\n4:\nIt = arg max\n\n\u02c6\u00b5i,Ti(t\u22121) +\n\ni\n\n5: end for\n\n(cid:115)\n\n(cid:18)\n\nlog+\n\n(cid:19)\n\n(cid:114) 1\n\nni\n\n\u2212\n\nni\n\nTi(t \u2212 1)\n\n4\n\nTi(t \u2212 1)\n\nAlgorithm 1: Unbalanced MOSS\n\nTheorem 2. Let B \u2208 B, then the strategy \u03c0 given in Algorithm 1 satis\ufb01es R\u03c0 \u2264 252B.\nCorollary 3. For all \u00b5 the following hold:\n\n1. R\u03c0\n\n2. R\u03c0\n\n\u00b5,i\u2217 \u2264 252Bi\u2217.\n\u00b5,i\u2217 \u2264 mini(n\u2206i + 252Bi)\n\nThe second part of the corollary is useful when Bi\u2217 is large, but there exists an arm for which n\u2206i\nand Bi are both small. The proof of Theorem 2 requires a few lemmas. The \ufb01rst is a somewhat stan-\ndard concentration inequality that follows from a combination of the peeling argument and Doob\u2019s\nmaximal inequality. The proof may be found in the supplementary material.\n\n. Then P{Zi \u2265 \u2206} \u2264 20\n\nni\u22062 for all \u2206 > 0.\n\nIn the analysis of traditional bandit algorithms the gap \u2206ji measures how quickly the algorithm can\ndetect the difference between arms i and j. By design, however, Algorithm 1 is negatively biasing\n1/ni. This has the effect of shifting the gaps, which\n\n(cid:16) ni\n\n(cid:17)\n\ns\n\nlog+\n\n4\ns\n\n\u00b5i \u2212 \u02c6\u00b5i,s \u2212\n\nLemma 4. Let Zi = max\n1\u2264s\u2264n\n\n(cid:114)\nits estimate of the empirical mean of arm i by(cid:112)\n(cid:112)\n(cid:114)\n\n1/nj \u2212\nLemma 5. De\ufb01ne stopping time \u03c4ji by\n\nI denote by \u00af\u2206ji and de\ufb01ne to be\n\n(cid:112)\n(cid:40)\n\n\u00af\u2206ji = \u2206ji +\n\n\u03c4ji = min\n\ns : \u02c6\u00b5j,s +\n\nIf Zi < \u00af\u2206ji/2, then Tj(n) \u2264 \u03c4ji.\n\n5\n\n(cid:112)\n\n1/nj \u2212\n\n(cid:112)\n(cid:41)\n\u2264 \u00b5j + \u00af\u2206ji/2\n\n.\n\n1/ni .\n\n1/ni = \u00b5i \u2212 \u00b5j +\n\n(cid:16) nj\n\n(cid:17)\n\ns\n\n4\ns\n\nlog+\n\n\fProof. Let t be the \ufb01rst time step such that Tj(t \u2212 1) = \u03c4ji. Then\n\n(cid:115)\n\n\u02c6\u00b5j,Tj (t\u22121)+\n\n(cid:18)\n\n(cid:19)\n\n(cid:112)\n\n4\n\nnj\n\nTj(t \u2212 1)\n\nlog+\n\n(cid:112)\nTj(t \u2212 1)\n= \u00b5j + \u00af\u2206ji \u2212 \u00af\u2206ji/2 \u2212\n1/ni \u2212 \u00af\u2206ji/2\n= \u00b5i \u2212\n4\n\n(cid:112)\n\n(cid:115)\n\n< \u02c6\u00b5i,Ti(t\u22121) +\n\nTi(t \u2212 1)\n\n1/nj\n\n(cid:18)\n\nlog+\n\n\u2212\n\n1/nj \u2264 \u00b5j + \u00af\u2206ji/2 \u2212\n\n(cid:19)\n\n(cid:112)\n\n\u2212\n\n1/ni ,\n\nni\n\nTi(t \u2212 1)\n\n(cid:112)\n\n1/nj\n\nwhich implies that arm j will not be chosen at time step t and so also not for any subsequent time\n(cid:32)\nsteps by the same argument and induction. Therefore Tj(n) \u2264 \u03c4ji.\nLemma 6. If \u00af\u2206ji > 0, then E\u03c4ji \u2264\n\nnj \u00af\u22062\nji\n64\n\nProductLog\n\n(cid:33)\n\n+\n\n.\n\n(cid:115)\n\n4\ns0\n\n=\u21d2\n\n40\n\u00af\u22062\nji\n\n(cid:32)\n\n64\n\u00af\u22062\nji\n\n(cid:33)(cid:39)\n(cid:40)\n\nnj \u00af\u22062\nji\n64\n\nn\u22121(cid:88)\n\ns=1\n\nP\n\nProof. Let s0 be de\ufb01ned by\n\n(cid:38)\n\ns0 =\n\nTherefore\n\nE\u03c4ji =\n\nn(cid:88)\n\ns=1\n\n64\n\u00af\u22062\nji\n\nProductLog\n\nP{\u03c4ji \u2265 s} \u2264 1 +\n(cid:26)\n\nn\u22121(cid:88)\n\nP\n\n\u02c6\u00b5i,s \u2212 \u00b5i,s \u2265\n\n\u00af\u2206ji\n2 \u2212\n\n(cid:27)\n\n\u00af\u2206ji\n4\n\n(cid:32)\n\nnj \u00af\u22062\nji\n64\n\n\u2264 1 + s0 +\n\ns=s0+1\n\n\u02c6\u00b5i,s \u2212 \u00b5i,s \u2265\n\n\u2264 1 + s0 +\n\n\u2264 1 + s0 +\n\n32\nji \u2264\n\u00af\u22062\n\n40\n\u00af\u22062\nji\n\n+\n\n64\n\u00af\u22062\nji\n\nProductLog\n\n.\n\n\u2264\n\n\u00af\u2206ji\n4\n\n(cid:19)\n(cid:17)(cid:41)\n(cid:16) nj\n(cid:32)\n\ns\n\nexp\n\n\u2212\n\ns \u00af\u22062\nji\n32\n\n(cid:33)\n\n(cid:18) nj\n\ns0\n\nlog+\n\n(cid:114)\n\nlog+\n\n4\ns\n\n\u221e(cid:88)\n\n(cid:33)\n\ns=s0+1\n\n,\n\nwhere the last inequality follows since \u00af\u2206ji \u2264 2.\nProof of Theorem 2. Let \u2206 = 2/\u221ani and A = {j : \u2206ji > \u2206}. Then for j \u2208 A we have \u2206ji \u2264\n2 \u00af\u2206ji and \u00af\u2206ji \u2265\n\u00b5,i = E\nR\u03c0\n\n(cid:112)\n\uf8ee\uf8f0 K(cid:88)\n1/ni + \u221a1/nj. Letting \u2206(cid:48) =\n\n1/ni we have\n\n\u2206jiTj(n)\n\n(cid:112)\n\n\uf8f9\uf8fb\n\nj=1\n\nj\u2208A\n\n\u2264 n\u2206 + E\n\n\uf8ee\uf8f0(cid:88)\n\uf8ee\uf8f0(cid:88)\n\u2264 2Bi + E\n(cid:32)\n(cid:88)\n(cid:88)\n\n\u2264 2Bi +\n\nj\u2208A\n\nj\u2208A\n\n(a)\n\n(c)\n\n(b)\n\n\u2264 2Bi +\n\nj\u2208A\n\n\uf8f9\uf8fb\n\n\u2206jiTj(n)\n\n(cid:8)\u2206ji : Zi \u2265 \u00af\u2206ji/2(cid:9)\uf8f9\uf8fb\n(cid:33)(cid:33)\n(cid:32)\n\n\u2206ji\u03c4ji + n max\nj\u2208A\n\n+\n\n128\n\u00af\u2206ji\n\nProductLog\n\n80\n\u00af\u2206ji\n90\u221anj + 4nE[Zi1{Zi \u2265 \u2206(cid:48)}] ,\n\nnj \u00af\u22062\nji\n64\n\n+ 4nE[Zi1{Zi \u2265 \u2206(cid:48)}]\n\nwhere (a) follows by using Lemma 5 to bound Tj(n) \u2264 \u03c4ji when Zi < \u00af\u2206ji. On the other hand,\nthe total number of pulls for arms j for which Zi \u2265 \u00af\u2206ji/2 is at most n. (b) follows by bounding\n\n6\n\n\f(cid:112)\n\n(cid:90) \u221e\n\n\u2206(cid:48)\n\n1/ni. All that remains is to bound the expectation.\n\n\u03c4ji in expectation using Lemma 6. (c) follows from basic calculus and because for j \u2208 A we have\n\u00af\u2206ji \u2265\n160n\n4nE[Zi1{Zi \u2265 \u2206(cid:48)}] \u2264 4n\u2206(cid:48)P{Zi \u2265 \u2206(cid:48)} + 4n\n\u2206(cid:48)ni\n(cid:88)\nwhere I have used Lemma 4 and simple identities. Putting it together we obtain\nwhere I applied the assumption B \u2208 B and so(cid:80)\n\nj(cid:54)=1 \u221anj =(cid:80)\n\n90\u221anj + 160B1 \u2264 252Bi ,\n\nP{Zi \u2265 z} dz \u2264\n\nj(cid:54)=1 n/Bj \u2264 Bi.\n\nR\u03c0\n\u00b5,i \u2264 2Bi +\n\n=\n\n160n\n\u221ani\n\n= 160Bi ,\n\nj\u2208A\n\nThe above proof may be simpli\ufb01ed in the special case that B is uniform where we recover the\nminimax regret of MOSS, but with perhaps a simpler proof than was given originally by Audibert\nand Bubeck [2009].\n\nOn Logarithmic Regret\n\nIn a recent technical report I demonstrated empirically that MOSS suffers sub-optimal problem-\ndependent regret in terms of the minimum gap [Lattimore, 2015]. Speci\ufb01cally, it can happen that\n\nRmoss\n\u00b5,i\u2217 \u2208 \u2126\n\n\u2206min\n\nlog n\n\n,\n\n(6)\n\nwhere \u2206min = mini:\u2206i>0 \u2206i. On the other hand, the order-optimal asymptotic regret can be signif-\nicantly smaller. Speci\ufb01cally, UCB by Auer et al. [2002] satis\ufb01es\n\n(cid:19)\n\n(cid:33)\n\n(cid:18) K\n(cid:32) (cid:88)\n\ni:\u2206i>0\n\nRucb\n\u00b5,i\u2217 \u2208 O\n\n1\n\u2206i\n\nlog n\n\n,\n\n(7)\n\nwhich for unequal gaps can be much smaller than Eq. (6) and is asymptotically order-optimal [Lai\nand Robbins, 1985]. The problem is that MOSS explores only enough to obtain minimax regret, but\nsometimes obtains minimax regret even when a more conservative algorithm would do better. It is\nworth remarking that this effect is harder to observe than one might think. The example given in the\nafforementioned technical report is carefully tuned to exploit this failing, but still requires n = 109\nand K = 103 before signi\ufb01cant problems arise. In all other experiments MOSS was performing\nadmirably in comparison to UCB.\nAll these problems can be avoided by modifying UCB rather than MOSS. The cost is a factor of\nO(\u221alog n). The algorithm is similar to Algorithm 1, but chooses the action that maximises the\nfollowing index.\n\n(cid:115)\n\n(cid:114) log n\n\n,\n\nni\n\nIt = arg max\n\ni\n\n\u02c6\u00b5i,Ti(t\u22121) +\n\n(2 + \u03b5) log t\n\nTi(t \u2212 1) \u2212\n\nwhere \u03b5 > 0 is a \ufb01xed arbitrary constant.\nTheorem 7. If \u03c0 is the strategy of unbalanced UCB with ni = n2/B2\nof the unbalanced UCB satis\ufb01es:\n\n1. (problem-independent regret). R\u03c0\n\n\u00b5,i\u2217 \u2208 O(cid:0)Bi\u2217\u221alog n(cid:1).\n(cid:110)\n(cid:112)\n(cid:32)\nBi\u2217(cid:112)\n\ni : \u2206i \u2265 2\n\nlog n1{A (cid:54)= \u2205} +\n\nR\u03c0\n\u00b5,i\u2217 \u2208 O\n\n2. (problem-dependent regret). Let A =\n\n1/ni\u2217 log n\n\n. Then\n\ni and B \u2208 B, then the regret\n(cid:111)\n\n(cid:33)\n\nlog n\n\n.\n\n(cid:88)\n\ni\u2208A\n\n1\n\u2206i\n\nThe proof is deferred to the supplementary material. The indicator function in the problem-\ndependent bound vanishes for suf\ufb01ciently large n provided ni\u2217 \u2208 \u03c9(log(n)), which is equivalent to\n\n7\n\n\fBi\u2217 \u2208 o(n/\u221alog n). Thus for reasonable choices of B1, . . . , BK the algorithm is going to enjoy the\n\nsame asymptotic performance as UCB. Theorem 7 may be proven for any index-based algorithm for\nwhich it can be shown that\n\n(cid:18) 1\n\n\u22062\ni\n\nETi(n) \u2208 O\n\n(cid:19)\n\nlog n\n\n,\n\nwhich includes (for example) KL-UCB [Capp\u00b4e et al., 2013] and Thompson sampling (see analy-\nsis by Agrawal and Goyal [2012a,b] and original paper by Thompson [1933]), but not OC-UCB\n[Lattimore, 2015] or MOSS [Audibert and Bubeck, 2009].\n\nExperimental Results\n\n1\n\n2\n\n3 and B2 = n\n\nI compare MOSS and unbalanced MOSS in two simple simulated examples, both with horizon\nn = 5000. Each data point is an empirical average of \u223c104 i.i.d. samples, so error bars are too small\nto see. Code/data is available in the supplementary material. The \ufb01rst experiment has K = 2 arms\nand B1 = n\n3 . I plotted the results for \u00b5 = (0,\u2212\u2206) for varying \u2206. As predicted,\nthe new algorithm performs signi\ufb01cantly better than MOSS for positive \u2206, and signi\ufb01cantly worse\notherwise (Fig. 1). The second experiment has K = 10 arms. This time B1 = \u221an and Bk =\nk=1 1/k. Results are shown for \u00b5k = \u22061{k = i\u2217} for \u2206 \u2208 [0, 1/2] and\ni\u2217 \u2208 {1, . . . , 10}. Again, the results agree with the theory. The unbalanced algorithm is superior to\nMOSS for i\u2217 \u2208 {1, 2} and inferior otherwise (Fig. 2).\n\n(k \u2212 1)H\u221an with H =(cid:80)9\n\nFigure 1\n\nFigure 2: \u03b8 = \u2206 + (i\u2217 \u2212 1)/2\n\nSadly the experiments serve only to highlight the plight of the biased learner, which suffers signi\ufb01-\ncantly worse results than its unbaised counterpart for most actions.\n\n6 Discussion\n\nI have shown that the cost of favouritism for multi-armed bandit algorithms is rather serious. If\nan algorithm exhibits a small worst-case regret for a speci\ufb01c action, then the worst-case regret of\nthe remaining actions is necessarily signi\ufb01cantly larger than the well-known uniform worst-case\nbound of \u2126(\u221aKn). This unfortunate result is in stark contrast to the experts setting for which there\nexist algorithms that suffer constant regret with respect to a single expert at almost no cost for the\nremainder. Surprisingly, the best achievable (non-uniform) worst-case bounds are determined up to\na permutation almost entirely by the value of the smallest worst-case regret.\nThere are some interesting open questions. Most notably, in the adversarial setting I am not sure if\nthe upper or lower bound is tight (or neither). It would also be nice to know if the constant factors\ncan be determined exactly asymptotically, but so far this has not been done even in the uniform\ncase. For the stochastic setting it is natural to ask if the OC-UCB algorithm can also be modi\ufb01ed.\nIntuitively one would expect this to be possible, but it would require re-working the very long proof.\n\nAcknowledgements\n\nI am indebted to the very careful reviewers who made many suggestions for improving this paper.\nThank you!\n\n8\n\n\u22120.4\u22120.200.20.40200400600800\u2206RegretMOSSU.MOSS01234501,0002,000\u03b8Regret\fReferences\nShipra Agrawal and Navin Goyal. Further optimal regret bounds for thompson sampling. In Pro-\nceedings of International Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS), 2012a.\nShipra Agrawal and Navin Goyal. Analysis of thompson sampling for the multi-armed bandit prob-\n\nlem. In Proceedings of Conference on Learning Theory (COLT), 2012b.\n\nJean-Yves Audibert and S\u00b4ebastien Bubeck. Minimax policies for adversarial and stochastic bandits.\n\nIn COLT, pages 217\u2013226, 2009.\n\nPeter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. Gambling in a rigged\ncasino: The adversarial multi-armed bandit problem. In Foundations of Computer Science, 1995.\nProceedings., 36th Annual Symposium on, pages 322\u2013331. IEEE, 1995.\n\nPeter Auer, Nicol\u00b4o Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit\n\nproblem. Machine Learning, 47:235\u2013256, 2002.\n\nS\u00b4ebastien Bubeck and Nicol`o Cesa-Bianchi. Regret Analysis of Stochastic and Nonstochastic Multi-\narmed Bandit Problems. Foundations and Trends in Machine Learning. Now Publishers Incorpo-\nrated, 2012. ISBN 9781601986269.\n\nOlivier Capp\u00b4e, Aur\u00b4elien Garivier, Odalric-Ambrym Maillard, R\u00b4emi Munos, and Gilles Stoltz.\nKullback\u2013Leibler upper con\ufb01dence bounds for optimal sequential allocation. The Annals of\nStatistics, 41(3):1516\u20131541, 2013.\n\nNicolo Cesa-Bianchi. Prediction, learning, and games. Cambridge University Press, 2006.\nEyal Even-Dar, Michael Kearns, Yishay Mansour, and Jennifer Wortman. Regret to the best vs.\n\nregret to the average. Machine Learning, 72(1-2):21\u201337, 2008.\n\nMarcus Hutter and Jan Poland. Adaptive online prediction by following the perturbed leader. The\n\nJournal of Machine Learning Research, 6:639\u2013660, 2005.\n\nMichael Kapralov and Rina Panigrahy. Prediction strategies without loss. In Advances in Neural\n\nInformation Processing Systems, pages 828\u2013836, 2011.\n\nWouter M Koolen. The pareto regret frontier. In Advances in Neural Information Processing Sys-\n\ntems, pages 863\u2013871, 2013.\n\nTze Leung Lai and Herbert Robbins. Asymptotically ef\ufb01cient adaptive allocation rules. Advances\n\nin applied mathematics, 6(1):4\u201322, 1985.\n\nTor Lattimore. Optimally con\ufb01dent UCB : Improved regret for \ufb01nite-armed bandits. Technical\n\nreport, 2015. URL http://arxiv.org/abs/1507.07880.\n\nChe-Yu Liu and Lihong Li. On the prior sensitivity of thompson sampling.\n\narXiv:1506.03378, 2015.\n\narXiv preprint\n\nAmir Sani, Gergely Neu, and Alessandro Lazaric. Exploiting easy data in online optimization. In\n\nAdvances in Neural Information Processing Systems, pages 810\u2013818, 2014.\n\nWilliam Thompson. On the likelihood that one unknown probability exceeds another in view of the\n\nevidence of two samples. Biometrika, 25(3/4):285\u2013294, 1933.\n\n9\n\n\f", "award": [], "sourceid": 103, "authors": [{"given_name": "Tor", "family_name": "Lattimore", "institution": "University of Alberta"}]}