{"title": "On Explore-Then-Commit strategies", "book": "Advances in Neural Information Processing Systems", "page_first": 784, "page_last": 792, "abstract": "We study the problem of minimising regret in two-armed bandit problems with Gaussian rewards. Our objective is to use this simple setting to illustrate that strategies based on an exploration phase (up to a stopping time) followed by exploitation are necessarily suboptimal. The results hold regardless of whether or not the difference in means between the two arms is known. Besides the main message, we also refine existing deviation inequalities, which allow us to design fully sequential strategies with finite-time regret guarantees that are (a) asymptotically optimal as the horizon grows and (b) order-optimal in the minimax sense. Furthermore we provide empirical evidence that the theory also holds in practice and discuss extensions to non-gaussian and multiple-armed case.", "full_text": "On Explore-Then-Commit Strategies\n\nAur\u00e9lien Garivier\u2217\n\nInstitut de Math\u00e9matiques de Toulouse; UMR5219\n\nUniversit\u00e9 de Toulouse; CNRS\n\nUPS IMT, F-31062 Toulouse Cedex 9, France\n\naurelien.garivier@math.univ-toulouse.fr\n\nEmilie Kaufmann\n\nUniv. Lille, CNRS, Centrale Lille, Inria SequeL\n\nUMR 9189, CRIStAL - Centre de Recherche en Informatique Signal et Automatique de Lille\n\nF-59000 Lille, France\n\nemilie.kaufmann@univ-lille1.fr\n\nTor Lattimore\n\nUniversity of Alberta\n\n116 St & 85 Ave, Edmonton, AB T6G 2R3, Canada\n\ntor.lattimore@gmail.com\n\nAbstract\n\nWe study the problem of minimising regret in two-armed bandit problems with\nGaussian rewards. Our objective is to use this simple setting to illustrate that\nstrategies based on an exploration phase (up to a stopping time) followed by\nexploitation are necessarily suboptimal. The results hold regardless of whether\nor not the difference in means between the two arms is known. Besides the\nmain message, we also re\ufb01ne existing deviation inequalities, which allow us to\ndesign fully sequential strategies with \ufb01nite-time regret guarantees that are (a)\nasymptotically optimal as the horizon grows and (b) order-optimal in the minimax\nsense. Furthermore we provide empirical evidence that the theory also holds in\npractice and discuss extensions to non-gaussian and multiple-armed case.\n\n1\n\nIntroduction\n\nIt is now a very frequent issue for companies to optimise their daily pro\ufb01ts by choosing between\none of two possible website layouts. A natural approach is to start with a period of A/B Testing\n(exploration) during which the two versions are uniformly presented to users. Once the testing is\ncomplete, the company displays the version believed to generate the most pro\ufb01t for the rest of the\nmonth (exploitation). The time spent exploring may be chosen adaptively based on past observations,\nbut could also be \ufb01xed in advance. Our contribution is to show that strategies of this form are\nmuch worse than if the company is allowed to dynamically select which website to display without\nrestrictions for the whole month.\nOur analysis focusses on a simple sequential decision problem played over T time-steps. In time-step\nt \u2208 1, 2, . . . , T the agent chooses an action At \u2208 {1, 2} and receives a normally distributed reward\n\u2217This work was partially supported by the CIMI (Centre International de Math\u00e9matiques et d\u2019Informatique)\nExcellence program while Emilie Kaufmann visited Toulouse in November 2015. The authors acknowledge the\nsupport of the French Agence Nationale de la Recherche (ANR), under grants ANR-13-BS01-0005 (project\nSPADRO) and ANR-13-CORD-0020 (project ALICIA).\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fZt \u223c N (\u00b5At, 1) where \u00b51, \u00b52 \u2208 R are the unknown mean rewards for actions 1 and 2 respectively.\nThe goal is to \ufb01nd a strategy \u03c0 (a way of choosing each action At based on past observation) that\nmaximises the cumulative reward over T steps in expectation, or equivalently minimises the regret\n\n(cid:34) T(cid:88)\n\n(cid:35)\n\n\u00b5(T ) = T max{\u00b51, \u00b52} \u2212 E\u00b5\nR\u03c0\n\n\u00b5At\n\n.\n\n(1)\n\nt=1\n\nThis framework is known as the multi-armed bandit problem, which has many applications and\nhas been studied for almost a century [Thompson, 1933]. Although this setting is now quite well\nunderstood, the purpose of this article is to show that strategies based on distinct phases of exploration\nand exploitation are necessarily suboptimal. This is an important message because exploration\nfollowed by exploitation is the most natural approach and is often implemented in applications\n(including the website optimisation problem described above). Moreover, strategies of this kind\nhave been proposed in the literature for more complicated settings [Auer and Ortner, 2010, Perchet\nand Rigollet, 2013, Perchet et al., 2015]. Recent progress on optimal exploration policies (e.g., by\nGarivier and Kaufmann [2016]) could have suggested that well-tuned variants of two-phase strategies\nmight be near-optimal. We show, on the contrary, that optimal strategies for multi-armed bandit\nproblems must be fully-sequential, and in particular should mix exploration and exploitation. It is\nknown since the work of Wald [1945] on simple hypothese testing that sequential procedures can lead\nto signi\ufb01cant gains. Here, the superiority of fully sequential procedures is consistent with intuition: if\none arm \ufb01rst appears to be better, but if subsequent observations are disappointing, the obligation to\ncommit at some point can be restrictive. In this paper, we give a crisp and precise description of how\nrestrictive it is: it leads to regret asympotically twice as large on average. The proof of this result\ncombines some classical techniques of sequential analysis and of the bandit literature.\nWe study two settings, one when the gap \u2206 = |\u00b51 \u2212 \u00b52| is known and the other when it is not.\nThe most straight-forward strategy in the former case is to explore each action a \ufb01xed number\nof times n and subsequently exploit by choosing the action that appeared best while exploring.\nIt is easy to calculate the optimal n and consequently show that this strategy suffers a regret of\n\u00b5(T ) \u223c 4 log(T )/\u2206. A more general approach is to use a so-called Explore-Then-Commit (ETC)\nR\u03c0\nstrategy, following a nomenclature introduced by Perchet et al. [2015]. An ETC strategy explores\neach action alternately until some data-dependent stopping time and subsequently commits to a single\naction for the remaining time-steps. We show in Theorem 2 that by using a sequential probability ratio\n\u00b5(T ) \u223c log(T )/\u2206, which improves\ntest (SPRT) it is possible to design an ETC strategy for which R\u03c0\non the above result by a factor of 4. We also prove a lower bound showing that no ETC strategy can\nimprove on this result. Surprisingly it is possible to do even better by using a fully sequential strategy\ninspired by the UCB algorithm for multi-armed bandits [Katehakis and Robbins, 1995]. We design a\n\u00b5(T ) \u223c log(T )/(2\u2206), which improves on the \ufb01xed-design strategy by a\nnew strategy for which R\u03c0\nfactor of 8 and on SPRT by a factor of 2. Again we prove a lower bound showing that no strategy can\nimprove on this result.\nFor the case where \u2206 is unknown, \ufb01xed-design strategies are hopeless because there is no reasonable\ntuning for the exploration budget n. However, it is possible to design an ETC strategy for unknown\ngaps. Our approach uses a modi\ufb01ed \ufb01xed-budget best arm identi\ufb01cation (BAI) algorithm in its\nexploration phase (see e.g., Even-Dar et al. [2006], Garivier and Kaufmann [2016]) and chooses the\nrecommended arm for the remaining time-steps. In Theorem 5 we show that a strategy based on\n\u00b5(T ) \u223c 4 log(T )/\u2206, which again we show is optimal within the class of ETC\nthis idea satis\ufb01es R\u03c0\nstrategies. As before, strategies based on ETC are suboptimal by a factor of 2 relative to the optimal\n\u00b5(T ) \u223c 2 log(T )/\u2206\nrates achieved by fully sequential strategies such as UCB, which satis\ufb01es R\u03c0\n[Katehakis and Robbins, 1995].\nIn a nutshell, strategies based on \ufb01xed-design or ETC are necessarily suboptimal. That this failure\noccurs even in the simple setting considered here is a strong indicator that they are suboptimal in\nmore complicated settings. Our main contribution, presented in more details in Section 2, is to fully\ncharacterise the achievable asymptotic regret when \u2206 is either known or unknown and the strategies\nare either \ufb01xed-design, ETC or fully sequential. All upper bounds have explicit \ufb01nite-time forms,\nwhich allow us to derive optimal minimax guarantees. For the lower bounds we give a novel and\ngeneric proof of all results. All proofs contain new, original ideas that we believe are fundamental to\nthe understanding of sequential analysis.\n\n2\n\n\f2 Notation and Summary of Results\nWe assume that the horizon T is known to the agent. The optimal action is a\u2217 = arg max(\u00b51, \u00b52), its\nmean reward is \u00b5\u2217 = \u00b5a\u2217, and the gap between the means is \u2206 = |\u00b51 \u2212 \u00b52|. Let H = R2 be the set\n\nof all possible pairs of means, and H\u2206 =(cid:8)\u00b5 \u2208 R2 : |\u00b51 \u2212 \u00b52| = \u2206(cid:9). For i \u2208 {1, 2} and n \u2208 N let\nin time-step t and Ni(t) =(cid:80)t\n\n\u02c6\u00b5i,n be the empirical mean of the ith action based on the \ufb01rst n samples. Let At be the action chosen\n1{As = i} be the number of times the ith action has been chosen\nafter time-step t. We denote by \u02c6\u00b5i(t) = \u02c6\u00b5i,Ni(t) the empirical mean of the ith arm after time-step t.\nA strategy is denoted by \u03c0, which is a function from past actions/rewards to a distribution over the\nnext actions. An ETC strategy is governed by a sampling rule (which determines which arm to sample\nat each step), a stopping rule (which speci\ufb01es when to stop the exploration phase) and a decision\nrule indicating which arm is chosen in the exploitation phase. As we consider two-armed, Gaussian\nbandits with equal variances, we focus here on uniform sampling rules, which have been shown\nin Kaufmann et al. [2014] to be optimal in that setting. For this reason, we de\ufb01ne an ETC strategy as\na pair (\u03c4, \u02c6a), where \u03c4 is an even stopping time with respect to the \ufb01ltration (Ft = \u03c3(Z1, . . . , Zt))t\nand \u02c6a \u2208 {1, 2} is F\u03c4 -measurable. In all the ETC strategies presented in this paper, the stopping time\n\u03c4 depends on the horizon T (although this is not re\ufb02ected in the notation). At time t, the action\n\ns=1\n\n\uf8f1\uf8f2\uf8f31\n\nif t \u2264 \u03c4 and t is odd ,\nif t \u2264 \u03c4 and t is even ,\n\n2\n\u02c6a otherwise .\n\npicked by the ETC strategy is At =\n\nThe regret for strategy \u03c0, given in Eq. (1), depends on T and \u00b5. Assuming, for example that \u00b51 =\n2 + (T \u2212 \u03c4 )+1{\u02c6a = 2}\n\u00b52 + \u2206, then an ETC strategy \u03c0 chooses the suboptimal arm N2(T ) = \u03c4\u2227T\ntimes, and the regret R\u03c0\n\n\u00b5(T ) = \u2206E\u00b5[N2(T )] thus satis\ufb01es\n\n\u2206E\u00b5[(\u03c4 \u2227 T )/2] \u2264 R\u03c0\n\n\u00b5(T ) \u2264 (\u2206/2)E\u00b5[\u03c4 \u2227 T ] + \u2206T P\u00b5(\u03c4 \u2264 T, \u02c6a (cid:54)= a\u2217) .\n\n(2)\n\nWe denote the set of all ETC strategies by \u03a0ETC. A \ufb01xed-design strategy is and ETC strategy for\nwhich there exists an integer n such that \u03c4 = 2n almost surely, and the set of all such strategies is\ndenoted by \u03a0DETC. The set of all strategies is denoted by \u03a0ALL. For S \u2208 {H,H\u2206}, we are interested\nin strategies \u03c0 that are uniformly ef\ufb01cient on S, in the sense that\n\n\u2200\u00b5 \u2208 S,\u2200\u03b1 > 0, R\u03c0\n\n\u00b5(T ) = o(T \u03b1).\n\n(3)\n\n4\n1\n\nNA\n4\n\n\u03a0ALL \u03a0ETC \u03a0DETC\nH\n2\nH\u2206 1/2\n\nWe show in this paper that any uniformly ef\ufb01cient strategy in \u03a0\nhas a regret at least equal to C \u03a0S log(T )/|\u00b51 \u2212 \u00b52|(1 \u2212 oT (1))\nfor every parameter \u00b5 \u2208 S, where C \u03a0S is given in the adjacent\ntable. Furthermore, we prove that these results are tight. In\neach case, we propose a uniformly ef\ufb01cient strategy matching\nthis bound. In addition, we prove a tight and non-asymptotic regret bound which also implies, in\nparticular, minimax rate-optimality.\nThe paper is organised as follows. First we consider ETC and \ufb01xed-design strategies when \u2206 known\nand unknown (Section 3). We then analyse fully sequential strategies that interleave exploration and\nexploitation in an optimal way (Section 4). For known \u2206 we present a novel algorithm that exploits\nthe additional information to improve the regret. For unknown \u2206 we brie\ufb02y recall the well-known\nresults, but also propose a new regret analysis of the UCB* algorithm, a variant of UCB that can\nbe traced back to Lai [1987], for which we also obtain order-optimal minimax regret. Numerical\nexperiments illustrate and empirically support our results in Section 5. We conclude with a short\ndiscussion on non-uniform exploration, and on models with more than 2 arms, possibly non Gaussian.\nAll the proofs are given in the supplementary material. In particular, our simple, uni\ufb01ed proof for all\nthe lower bounds is given in Appendix A.\n\n3 Explore-Then-Commit Strategies\n\nFixed Design Strategies for Known Gaps. As a warm-up we start with the \ufb01xed-design ETC\nsetting where \u2206 is known and where the agent chooses each action n times before committing for the\nremainder.\n\n3\n\n\f2W(cid:0)T 2\u22064/(32\u03c0)(cid:1)/\u22062(cid:109)\n(cid:108)\n\ninput: T and \u2206\nn :=\nfor k \u2208 {1, . . . , n} do\n\nchoose A2k\u22121 = 1 and A2k = 2\n\nend for\n\u02c6a := arg maxi \u02c6\u00b5i,n\nfor t \u2208 {2n + 1, . . . , T} do\n\nchoose At = \u02c6a\n\nend for\nAlgorithm 1: FB-ETC algorithm\n\n(cid:18) T \u22062\n\n(cid:19)\n\n(cid:18) T \u22062\n\n(cid:19)\n2\u03c0\n\u00b5(T ) \u2264 2.04\n\n\u221a\n4\n\n+ \u2206\n\u221a\n\nT + \u2206.\n\nThe optimal decision rule is obviously \u02c6a =\narg maxi \u02c6\u00b5i,n with ties broken arbitrarily. The formal\ndescription of the strategy is given in Algorithm 1,\nwhere W denotes the Lambert function implicitly de-\n\ufb01ned for y > 0 by W (y) exp(W (y)) = y. We denote\nthe regret associated to the choice of n by Rn\n\u00b5(T ). The\nfollowing theorem is not especially remarkable except\nthat the bound is suf\ufb01ciently re\ufb01ned to show certain\nnegative lower-order terms that would otherwise not\nbe apparent.\nTheorem 1. Let \u00b5 \u2208 H\u2206, and let\n\n(cid:24) 2\n\n(cid:18) T 2\u22064\n\n(cid:19)(cid:25)\n\n.\n\nlog\n\nn =\n\n\u22062 W\n\n2\u03c0e, and Rn\n\n32\u03c0\n\u221a\nwhenever T \u22062 > 4\nFurthermore, for all \u03b5 > 0, T \u2265 1 and n \u2264 4(1 \u2212 \u03b5) log(T )/\u22062,\n\nThen Rn\nlog log\n\u00b5(T ) \u2264 T \u2206/2 + \u2206 otherwise. In all cases, Rn\n(cid:18)\n\u00b5(T ) \u2265 n\u2206, this entails that\n\n\u00b5(T ) \u2264 4\n\u2206\n(cid:19)\n(cid:19)(cid:18)\n\u00b5(T ) \u223c 4 log(T )/\u2206.\nRn\n\n2(cid:112)\u03c0 log(T )\n\n1 \u2212 8 log(T )\n\u22062T\n\n1 \u2212 2\nn\u22062\n\n\u00b5(T ) \u2265\nRn\n\n\u2212 2\n\u2206\n\nAs Rn\n\n\u2206T \u03b5\n\n4.46\n\n.\n\ninf\n\n1\u2264n\u2264T\n\nThe proof of Theorem 1 is in Appendix B. Note that the \"asymptotic lower bound\" 4 log(T )/\u2206 is\n\u00b5(T ) \u2212 4 log(T )/\u2206 \u2192 \u2212\u221e when\nactually not a lower bound, even up to an additive constant: Rn\nT \u2192 \u221e. Actually, the same phenomenon applies many other cases, and it should be no surprise that,\nin numerical experiments, some algorithm reach a regret smaller than Lai and Robbins asymptotic\nlower bound, as was already observed in several articles (see e.g. Garivier et al. [2016]). Also note\nthat the term \u2206 at the end of the upper bound is necessary: if \u2206 is large, the problem is statistically\nso simple that one single observation is suf\ufb01cient to identify the best arm; but that observation cannot\nbe avoided.\n\nExplore-Then-Commit Strategies for Known Gaps. We now show the existence of ETC strategies\nthat improve on the optimal \ufb01xed-design strategy. Surprisingly, the gain is signi\ufb01cant. We describe\nan algorithm inspired by ideas from hypothesis testing and prove an upper bound on its regret that is\nminimax optimal and that asymptotically matches our lower bound.\nLet P be the law of X \u2212 Y , where X (resp. Y ) is a reward from arm 1 (resp. arm 2). As \u2206 is\nknown, the exploration phase of an ETC algorithm can be viewed as a statistical test of the hypothesis\nH1 : (P = N (\u2206, 2)) against H2 : (P = N (\u2212\u2206, 2)). The work of Wald [1945] shows that a\nsigni\ufb01cant gain in terms of expected number of samples can be obtained by using a sequential rather\nthan a batch test. Indeed, for a batch test, a sample size of n \u223c (4/\u22062) log(1/\u03b4) is necessary to\nguarantee that both type I and type II errors are upper bounded by \u03b4. In contrast, when a random\nnumber of samples is permitted, there exists a sequential probability ratio test (SPRT) with the same\nguarantees that stops after a random number N of samples with expectation E[N ] \u223c log(1/\u03b4)/\u22062\nunder both H1 and H2. The SPRT stops when the absolute value of the log-likelihood ratio between\nH1 and H2 exceeds some threshold. Asymptotic upper bound on the expected number of samples\nused by a SPRT, as well as the (asymptotic) optimality of such procedures among the class of all\nsequential tests can be found in [Wald, 1945, Siegmund, 1985].\nAlgorithm 2 is an ETC strategy that explores\neach action alternately, halting when suf\ufb01cient\ncon\ufb01dence is reached according to a SPRT. The\nthreshold depends on the gap \u2206 and the horizon\nT corresponding to a risk of \u03b4 = 1/(T \u22062). The\nexploration phase ends at the stopping time\n\nwhile (t/2)\u2206|\u02c6\u00b51(t) \u2212 \u02c6\u00b52(t)| < log(cid:0)T \u22062(cid:1) do\n\nchoose At+1 = 1 and At+2 = 2,\nt := t + 2\n\ninput: T and \u2206\nA1 = 1, A2 = 2, t := 2\n\n(cid:110)\nt = 2n :(cid:12)(cid:12)\u02c6\u00b51,n\u2212\u02c6\u00b52,n\n\n(cid:12)(cid:12) \u2265 log(T \u22062)\n\nn\u2206\n\n(cid:111)\n\n.\n\n\u03c4 = inf\n\nIf \u03c4 < T then the empirical best arm \u02c6a at time \u03c4\nis played until time T . If T \u22062 \u2264 1, then \u03c4 = 1\n\nend while\n\u02c6a := arg maxi \u02c6\u00b5i(t)\nwhile t \u2264 T do\n\nchoose At = \u02c6a,\nt := t + 1\n\nend while\n\nAlgorithm 2: SPRT ETC algorithm\n\n4\n\n\f(one could even de\ufb01ne \u03c4 = 0 and pick a random arm). The following theorem gives a non-asymptotic\nupper bound on the regret of the algorithm. The results rely on non-asymptotic upper bounds on the\nexpectation of \u03c4, which are interesting in their own right.\nTheorem 2. If T \u22062 \u2265 1, then the regret of the SPRT-ETC algorithm is upper-bounded as\n\nOtherwise it is upper bounded by T \u2206/2+\u2206, and for all T and \u2206 the regret is less than 10(cid:112)T /e+\u2206.\n\n(T ) \u2264 log(eT \u22062)\n\nRSPRT-ETC\n\n+ \u2206 .\n\n\u2206\n\n\u2206\n\n+\n\n\u00b5\n\n4(cid:112)log(T \u22062) + 4\n\nThe proof of Theorem 2 is given in Appendix C. The following lower bound shows that no uniformly\nef\ufb01cient ETC strategy can improve on the asymptotic regret of Algorithm 2. The proof is given in\nSection A together with the other lower bounds.\nTheorem 3. Let \u03c0 be an ETC strategy that is uniformly ef\ufb01cient on H\u2206. Then for all \u00b5 \u2208 H\u2206,\n\nlim inf\nT\u2192\u221e\n\nR\u03c0\n\u00b5(T )\nlog(T )\n\n\u2265 1\n\u2206\n\n.\n\nExplore-Then-Commit Strategies for Unknown Gaps. When the gap is unknown it is not possible\nto tune a \ufb01xed-design strategy that achieves logarithmic regret. ETC strategies can enjoy logarithmic\nregret and these are now analysed. We start with the asymptotic lower bound.\nTheorem 4. Let \u03c0 be a uniformly ef\ufb01cient ETC strategy on H. For all \u00b5 \u2208 H, if \u2206 = |\u00b51 \u2212 \u00b52| then\n\nlim inf\nT\u2192\u221e\n\nR\u03c0\n\u00b5(T )\nlog(T )\n\n\u2265 4\n\u2206\n\n.\n\nA simple idea for constructing an algorithm that matches the lower bound is to use a (\ufb01xed-con\ufb01dence)\nbest arm identi\ufb01cation algorithm for the exploration phase. Given a risk parameter \u03b4, a \u03b4-PAC BAI\nalgorithm consists of a sampling rule (At), a stopping rule \u03c4 and a recommendation rule \u02c6a which\nis F\u03c4 measurable and satis\ufb01es, for all \u00b5 \u2208 H such that \u00b51 (cid:54)= \u00b52, P\u00b5(\u02c6a = a\u2217) \u2265 1 \u2212 \u03b4. In a bandit\nmodel with two Gaussian arms, Kaufmann et al. [2014] propose a \u03b4-PAC algorithm using a uniform\nsampling rule and a stopping rule \u03c4\u03b4 that asymptotically attains the minimal sample complexity\nE\u00b5[\u03c4\u03b4] \u223c (8/\u22062) log(1/\u03b4). Using the regret decomposition (2), it is easy to show that the ETC\nalgorithm using the stopping rule \u03c4\u03b4 for \u03b4 = 1/T matches the lower bound of Theorem 4.\nAlgorithm 3 is a slight variant of this optimal BAI\nalgorithm, based on the stopping time\n\n4 log(cid:0)T /(2n)(cid:1)\n\nn\n\n\uf8fc\uf8fd\uf8fe.\n\ninput: T (\u2265 3)\nA1 = 1, A2 = 2, t := 2\nwhile |\u02c6\u00b51(t) \u2212 \u02c6\u00b52(t)| <\n\n(cid:113) 8 log(T /t)\n\nt\n\nchoose At+1 = 1 and At+2 = 2\nt := t + 2\n\ndo\n\n\uf8f1\uf8f2\uf8f3t = 2n : |\u02c6\u00b51,n \u2212 \u02c6\u00b52,n|>\n\n(cid:115)\n\n\u03c4 = inf\n\nend while\n\u02c6a := arg maxi \u02c6\u00b5i(t)\nwhile t \u2264 T do\n\nThe motivation for the difference (which comes from\na more carefully tuned threshold featuring log(T /2n)\nin place of log(T )) is that the con\ufb01dence level should\ndepend on the unknown gap \u2206, which determines the\nregret when a mis-identi\ufb01cation occurs. The improve-\nment only appears in the non-asymptotic regime where\nwe are able to prove both asymptotic optimality and\norder-optimal minimax regret. The latter would not be possible using a \ufb01xed-con\ufb01dence BAI strategy.\nThe proof of this result can be found in Appendix D. The main dif\ufb01culty is developing a suf\ufb01ciently\nstrong deviation bound, which we do in Appendix G, and that may be of independent interest. Note\nthat a similar strategy was proposed and analysed by Lai et al. [1983], but in the continuous time\nframework and with asymptotic analysis only.\nTheorem 5. If T \u22062 > 4e2, the regret of the BAI-ETC algorithm is upper bounded as\n\nend while\nAlgorithm 3: BAI-ETC algorithm\n\nchoose At = \u02c6a\nt := t + 1\n\n334\n+\n\u221a\nIt is upper bounded by T \u2206 otherwise, and by 32\n\n(T ) \u2264 4 log\n\nRBAI-ETC\n\n\u2206\n\n\u00b5\n\n4\n\nT + 2\u2206 in any case.\n\n+\n\n178\n\u2206\n\n+ 2\u2206.\n\n(cid:16) T \u22062\n\n(cid:17)\n\n(cid:113)\n\nlog(cid:0) T \u22062\n\n4\n\n(cid:1)\n\n\u2206\n\n5\n\n\f4 Fully Sequential Strategies for Known and Unknown Gaps\n\nIn the previous section we saw that allowing a random stopping time leads to a factor of 4 improvement\nin terms of the asymptotic regret relative to the naive \ufb01xed-design strategy. We now turn our attention\nto fully sequential strategies when \u2206 is known and unknown. The latter case is the classic 2-armed\nbandit problem and is now quite well understood. Our modest contribution in that case is the \ufb01rst\nalgorithm that is simultaneously asymptotically optimal and order optimal in the minimax sense. For\nthe former case, we are not aware of any previous research where the gap is known except the line of\nwork by Bubeck et al. [2013], Bubeck and Liu [2013], where different questions are treated. In both\ncases we see that fully sequential strategies improve on the best ETC strategies by a factor of 2.\n\nKnown Gaps. We start by stating the lower bound (proved in Section A), which is a straightforward\ngeneralisation of Lai and Robbins\u2019 lower bound.\nTheorem 6. Let \u03c0 be a strategy that is uniformly ef\ufb01cient on H\u2206. Then for all \u00b5 \u2208 H\u2206,\n\nlim inf\nT\u2192\u221e\n\nR\u03c0\n\u00b5(T )\nlog T\n\n\u2265 1\n2\u2206\n\nWe are not aware of any existing algorithm matching this lower bound, which motivates us to\nintroduce a new strategy called \u2206-UCB that exploits the knowledge of \u2206 to improve the performance\nof UCB. In each round the algorithm chooses the arm that has been played most often so far unless\nthe other arm has an upper con\ufb01dence bound that is close to \u2206 larger than the empirical estimate of\nthe most played arm. Like ETC strategies, \u2206-UCB is not anytime in the sense that it requires the\nknowledge of both the horizon T and the gap \u2206.\n\n1: input: T and \u2206\n\u2212 1\n2: \u03b5T = \u2206 log\n3: for t \u2208 {1, . . . , T} do\n4:\ni\u22081,2\n\nlet At,min := arg min\n\n8 (e + T \u22062)/4\n\n5:\n\nif \u02c6\u00b5At,min (t \u2212 1) +\n\nchoose At = At,min\n\nchoose At = At,max\n\nelse\n\n6:\n7:\n8:\n9:\n10: end for\n\nend if\n\nNi(t \u2212 1) and At,max = 3 \u2212 At,min\n\n(cid:118)(cid:117)(cid:117)(cid:116) 2 log\n\n(cid:16)\nNAt,min (t\u22121)\nNAt,min(t \u2212 1)\n\nT\n\n(cid:17)\n\n\u2265 \u02c6\u00b5At,max(t \u2212 1) + \u2206 \u2212 2\u03b5T then\n\nAlgorithm 4: \u2206-UCB\n\nTheorem 7. If T (2\u2206 \u2212 3\u03b5T )2 \u2265 2 and T \u03b52\nbounded as\n\nT \u2265 e2, the regret of the \u2206-UCB algorithm is upper\n\nR\u2206-UCB\n\n\u00b5\n\n(T ) \u2264\n\nlog(cid:0)2T \u22062(cid:1)\n(cid:34)\n30e(cid:112)log(\u03b52\n\n2\u2206(1 \u2212 3\u03b5T /(2\u2206))2 +\nT T )\n\n(cid:112)\u03c0 log (2T \u22062)\n\n2\u2206(1 \u2212 3\u03b5T /\u2206)2\n2\n\n(cid:35)\n\n+ \u2206\n\n(2\u2206 \u2212 3\u03b5T )2\n(T )/ log(T ) \u2264 (2\u2206)\u22121 and \u2200\u00b5 \u2208 H\u2206, R\u2206-UCB\n\n\u03b52\nT\n\n+\n\n+\n\n+ 5\u2206.\n\n\u221a\n\n(T ) \u2264 328\n\nT + 5\u2206.\n\n\u00b5\n\n80\n\u03b52\nT\n\nMoreover lim supT\u2192\u221e R\u2206-UCB\nThe proof may be found in Appendix E.\n\n\u00b5\n\nUnknown Gaps. In the classical bandit setting where \u2206 is unknown, UCB by Katehakis and Robbins\n\u00b5 (T ) \u223c 2 log(T )/\u2206, which matches the lower\n[1995] is known to be asymptotically optimal: RUCB\nbound of Lai and Robbins [1985]. Non-asymptotic regret bounds are given for example by Auer\net al. [2002], Capp\u00e9 et al. [2013]. Unfortunately, UCB is not optimal in the minimax sense, which\nis so far only achieved by algorithms that are not asymptotically optimal [Audibert and Bubeck,\n2009, Lattimore, 2015]. Here, with only two arms, we are able to show that Algorithm 5 below is\n\n6\n\n\fsimultaneously minimax order-optimal and asymptotically optimal. The strategy is essentially the\nsame as suggested by Lai [1987], but with a fractionally smaller con\ufb01dence bound. The proof of\nTheorem 8 is given in Appendix F. Empirically the smaller con\ufb01dence bonus used by UCB\u2217 leads to\na signi\ufb01cant improvement relative to UCB.\n\n1: input: T\n2: for t \u2208 {1, . . . , T} do\n\n3:\n\nAt = arg max\ni\u2208{1,2}\n\n4: end for\n\n\u02c6\u00b5i(t \u2212 1) +\n\n(cid:115)\n\n(cid:19)\n\n(cid:18)\n\nT\n\nNi(t \u2212 1)\n\n2\n\nNi(t \u2212 1)\n\nlog\n\nAlgorithm 5: UCB\u2217\n\nTheorem 8. For all \u03b5 \u2208 (0, \u2206), if T (\u2206 \u2212 \u03b5)2 \u2265 2 and T \u03b52 \u2265 e2, the regret of the UCB\u2217 strategy is\nupper bounded as\n\n(cid:113)\n\u03c0 log(cid:0) T \u22062\n\u2206(cid:0)1 \u2212 \u03b5\n\n(cid:17)\n(cid:16) T \u22062\n\u2206(cid:0)1 \u2212 \u03b5\n(cid:1)2 +\n\u00b5(T )/ log(T ) = 2/\u2206 and for all \u00b5 \u2208 H, R\u03c0\n\n(cid:1)\n(cid:1)2 + \u2206\n\n30e(cid:112)log(\u03b52T ) + 16e\n\n(cid:32)\n\n2\n\n\u2206\n\n\u03b52\n\n2\n\n2\n\n\u2206\n\n(cid:33)\n\n\u00b5(T ) \u2264 33\n\n\u2206(cid:0)1 \u2212 \u03b5\n\n2\n\n\u2206\n\n(cid:1)2 + \u2206.\n\n+\n\n\u221a\n\nRUCB\u2217\n\n\u00b5\n\n(T ) \u2264 2 log\n\nMoreover, lim supT\u2192\u221e R\u03c0\n\na minimax regret of \u2126((cid:112)T K log(K)), which is a factor of(cid:112)log(K) suboptimal.\n\nNote that if there are K > 2 arms, then the strategy above is still asymptotically optimal, but suffers\n\nT + \u2206.\n\n5 Numerical Experiments\n\nWe represent here the regret of the \ufb01ve strategies presented in this article on a bandit problem\nwith \u2206 = 1/5, for different values of the horizon. The regret is estimated by 4.105 Monte-Carlo\nreplications. In the legend, the estimated slopes of \u2206R\u03c0(T ) (in logarithmic scale) are indicated after\nthe policy names.\n\nThe experimental behavior of the algorithms re\ufb02ects the theoretical results presented above: the\nregret asymptotically grows as the logarithm of the horizon, the experimental coef\ufb01cients correspond\napproximately to theory, and the relative ordering of the policies is respected. However, it should\nbe noted that for short horizons the hierarchy is not quite the same, and the growth rate is not\nlogarithmic; this question is raised in Garivier et al. [2016]. In particular, on short horizons the\nBest-Arm Identi\ufb01cation procedure performs very well with respect to the others, and starts to be\nbeaten (even by the gap-aware strategies) only when T \u22062 is much larger that 10.\n\n7\n\n50100200500100020005000100002000050000020406080100FB\u2212ETC : 3.65BAI\u2212ETC : 2.98UCB : 1.59SPRT\u2212ETC : 1.03D\u2212UCB : 0.77\f6 Conclusion: Beyond Uniform Exploration, Two Arms and Gaussian\n\ndistributions\n\nIt is worth emphasising the impossibility of non-trivial lower bounds on the regret of ETC strategies\nusing any possible (non-uniform) sampling rule. Indeed, using UCB as a sampling rule together with\nan a.s. in\ufb01nite stopping rule de\ufb01nes an arti\ufb01cial but formally valid ETC strategy that achieves the best\npossible rate for general strategies. This strategy is not a faithful counter-example to our claim that\nETC strategies are sub-optimal, because UCB is not a satisfying exploration rule. If exploration is the\nobjective, then uniform sampling is known to be optimal in the two-armed Gaussian case [Kaufmann\net al., 2014], which justi\ufb01es the uniform sampling assumption.\nThe use of ETC strategies for regret minimisation (e.g., as presented by Perchet and Rigollet [2013])\nis certainly not limited to bandit models with 2 arms. The extension to multiple arms is based on the\nsuccessive elimination idea in which a set of active arms is maintained with arms chosen according\nto a round robin within the active set. Arms are eliminated from the active set once their optimality\nbecomes implausible and the exploration phase terminates when the active set contains only a single\narm (an example is by Auer and Ortner [2010]). The Successive Elimination algorithm has been\nintroduced by Even-Dar et al. [2006] for best-arm identi\ufb01cation in the \ufb01xed-con\ufb01dence setting. It was\nshown to be rate-optimal, and thus a good compromise for both minimizing regret and \ufb01nding the\nbest arm. If one looks more precisely at mutliplicative constants, however, Garivier and Kaufmann\n[2016] showed that it is suboptimal for the best arm identi\ufb01cation task in almost all settings except\ntwo-armed Gaussian bandits. Regarding regret minimization, the present paper shows that it is\nsub-optimal by a factor 2 on every two-armed Gaussian problem.\nIt is therefore interesting to investigate the performance in terms of regret of an ETC algorithm\nusing an optimal BAI algorithm. This is actually possible not only for Gaussian distributions, but\nmore generally for one-parameter exponential families, for which Garivier and Kaufmann [2016]\npropose the asymptotically optimal Track-and-Stop strategy. Denoting d(\u00b5, \u00b5(cid:48)) = KL(\u03bd\u00b5, \u03bd\u00b5(cid:48)) the\nKullback-Leibler divergence between two distributions parameterised by \u00b5 and \u00b5(cid:48), they provide\nresults which can be adapted to obtain the following bound.\nProposition 1. For \u00b5 such that \u00b51 > maxa(cid:54)=1 \u00b5a, the regret of the ETC strategy using Track-and-Stop\nexploration with risk 1/T satis\ufb01es\n\nlim sup\nT\u2192\u221e\n\nRTaS\n\u00b5 (T )\nlog T\n\n\u2264 T \u2217(\u00b5)\n\n(cid:20)\n\n(cid:18)\n\nmax\nw\u2208\u03a3K\n\ninf\na(cid:54)=1\n\nw1d\n\n\u00b51,\n\nw1\u00b51 + wa\u00b5a\n\nw1 + wa\n\n(cid:32) K(cid:88)\n(cid:19)\n\na=2\n\n(cid:33)\na(\u00b5)(\u00b51 \u2212 \u00b5a)\nw\u2217\n(cid:18)\n\n,\n\n+ wad\n\n\u00b5a,\n\nwa\u00b51 + wa\u00b5a\n\nw1 + wa\n\n(cid:19)(cid:21)\n\n,\n\nwhere T \u2217(\u00b5) (resp. w\u2217(\u00b5)) is the the maximum (resp. maximiser) of the optimisation problem\n\nwhere \u03a3K is the set of probability distributions on {1, . . . , K}.\nIn general, it is not easy to quantify the difference to the lower bound of Lai and Robbins\n\nlim inf\nT\u2192\u221e\n\nR\u03c0\n\u00b5(T )\nlog T\n\n\u2265 K(cid:88)\n\na=2\n\n\u00b51 \u2212 \u00b5a\nd(\u00b5a, \u00b51)\n\n.\n\nEven for Gaussian distributions, there is no general closed-form formula for T \u2217(\u00b5) and w\u2217(\u00b5) except\nwhen K = 2. However, we conjecture that the worst case is when \u00b51 and \u00b52 are much larger than\nthe other means: then, the regret is almost the same as in the 2-arm case, and ETC strategies are\nsuboptimal by a factor 2. On the other hand, the most favourable case (in terms of relative ef\ufb01ciency)\nseems to be when \u00b52 = \u00b7\u00b7\u00b7 = \u00b5K: then\n\nw\u2217\n1(\u00b5) =\n\u221a\nand T \u2217 = 2(\n\n\u221a\n\nK \u2212 1\n\u221a\n\nK \u2212 1 +\n\nK \u2212 1\n\n,\n\nK \u2212 1 + 1)2/\u22062, leading to\nRTaS\n\u00b5 (T )\n\u2264\nlog(T )\n\nlim sup\nT\u2192\u221e\n\n2(\u00b5) = \u00b7\u00b7\u00b7 = w\u2217\nw\u2217\n(cid:18)\n\nK(\u00b5) =\n\n(cid:19) 2(K \u2212 1)\n\n\u221a\n1\nK \u2212 1 +\n\nK \u2212 1\n\n1\u221a\nK \u2212 1\nK \u2212 1 log(T )/\u2206 , but the relative difference decreases.\n\nwhile Lai and Robbins\u2019 lower bound yields 2(K \u2212 1)/\u2206. Thus, the difference grows with K as\n\u221a\n2\n\n1 +\n\n\u2206\n\n,\n\n8\n\n\fReferences\nJean-Yves Audibert and S\u00e9bastien Bubeck. Minimax policies for adversarial and stochastic bandits.\n\nIn Proceedings of Conference on Learning Theory (COLT), pages 217\u2013226, 2009.\n\nPeter Auer and Ronald Ortner. UCB revisited: Improved regret bounds for the stochastic multi-armed\n\nbandit problem. Periodica Mathematica Hungarica, 61(1-2):55\u201365, 2010.\n\nPeter Auer, Nicol\u00f3 Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit\n\nproblem. Machine Learning, 47:235\u2013256, 2002.\n\nS\u00e9bastien Bubeck and Che-Yu Liu. Prior-free and prior-dependent regret bounds for thompson\n\nsampling. In Advances in Neural Information Processing Systems, pages 638\u2013646, 2013.\n\nS\u00e9bastien Bubeck, Vianney Perchet, and Philippe Rigollet. Bounded regret in stochastic multi-armed\n\nbandits. In Proceedings of the 26th Conference On Learning Theory, pages 122\u2013134, 2013.\n\nOlivier Capp\u00e9, Aur\u00e9lien Garivier, Odalric-Ambrym Maillard, R\u00e9mi Munos, and Gilles Stoltz.\nKullback\u2013Leibler upper con\ufb01dence bounds for optimal sequential allocation. The Annals of\nStatistics, 41(3):1516\u20131541, 2013.\n\nEyal Even-Dar, Shie Mannor, and Yishay Mansour. Action Elimination and Stopping Conditions for\nthe Multi-Armed Bandit and Reinforcement Learning Problems. Journal of Machine Learning\nResearch, 7:1079\u20131105, 2006.\n\nAur\u00e9lien Garivier and Emilie Kaufmann. Optimal best arm identi\ufb01cation with \ufb01xed con\ufb01dence. In\n\nProceedings of the 29th Conference On Learning Theory (to appear), 2016.\n\nAur\u00e9lien Garivier, Pierre M\u00e9nard, and Gilles Stoltz. Explore \ufb01rst, exploit next: The true shape of\n\nregret in bandit problems. arXiv preprint arXiv:1602.07182, 2016.\n\nAbdolhossein Hoorfar and Mehdi Hassani. Inequalities on the lambert w function and hyperpower\n\nfunction. J. Inequal. Pure and Appl. Math, 9(2):5\u20139, 2008.\n\nMichael N Katehakis and Herbert Robbins. Sequential choice from several populations. Proceedings\n\nof the National Academy of Sciences of the United States of America, 92(19):8584, 1995.\n\nEmilie Kaufmann, Olivier Capp\u00e9, and Aur\u00e9lien Garivier. On the Complexity of A/B Testing. In\n\nProceedings of the 27th Conference On Learning Theory, 2014.\n\nTze Leung Lai. Adaptive treatment allocation and the multi-armed bandit problem. The Annals of\n\nStatistics, pages 1091\u20131114, 1987.\n\nTze Leung Lai and Herbert Robbins. Asymptotically ef\ufb01cient adaptive allocation rules. Advances in\n\napplied mathematics, 6(1):4\u201322, 1985.\n\nTze Leung Lai, Herbert Robbins, and David Siegmund. Sequential design of comparative clinical\ntrials. In M. Haseeb Rizvi, Jagdish Rustagi, and David Siegmund, editors, Recent advances in\nstatistics: papers in honor of Herman Chernoff on his sixtieth birthday, pages 51\u201368. Academic\nPress, 1983.\n\nTor Lattimore. Optimally con\ufb01dent UCB: Improved regret for \ufb01nite-armed bandits. Technical report,\n\n2015. URL http://arxiv.org/abs/1507.07880.\n\nPeter M\u00f6rters and Yuval Peres. Brownian motion, volume 30. Cambridge University Press, 2010.\nVianney Perchet and Philippe Rigollet. The multi-armed bandit with covariates. The Annals of\n\nStatistics, 2013.\n\nVianney Perchet, Philippe Rigollet, Sylvain Chassang, and Eric Snowberg. Batched bandit problems.\n\nIn Proceedings of the 28th Conference On Learning Theory, 2015.\n\nDavid Siegmund. Sequential Analysis. Springer-Verlag, 1985.\nWilliam Thompson. On the likelihood that one unknown probability exceeds another in view of the\n\nevidence of two samples. Biometrika, 25(3/4):285\u2013294, 1933.\n\nAbraham Wald. Sequential Tests of Statistical Hypotheses. Annals of Mathematical Statistics, 16(2):\n\n117\u2013186, 1945.\n\n9\n\n\f", "award": [], "sourceid": 472, "authors": [{"given_name": "Aurelien", "family_name": "Garivier", "institution": "Institut de Math\u00e9matiques de Toulouse, Universit\u00e9 Toulouse"}, {"given_name": "Tor", "family_name": "Lattimore", "institution": "Indiana University"}, {"given_name": "Emilie", "family_name": "Kaufmann", "institution": "Telecom ParisTech"}]}