{"title": "Sequential Test for the Lowest Mean: From Thompson to Murphy Sampling", "book": "Advances in Neural Information Processing Systems", "page_first": 6332, "page_last": 6342, "abstract": "Learning the minimum/maximum mean among a finite set of distributions is a fundamental sub-problem in planning, game tree search and reinforcement learning. We formalize this learning task as the problem of sequentially testing how the minimum mean among a finite set of distributions compares to a given threshold. We develop refined non-asymptotic lower bounds, which show that optimality mandates very different sampling behavior for a low vs high true minimum. We show that Thompson Sampling and the intuitive Lower Confidence Bounds policy each nail only one of these cases. We develop a novel approach that we call Murphy Sampling. Even though it entertains exclusively low true minima, we prove that MS is optimal for both possibilities. We then design advanced self-normalized deviation inequalities, fueling more aggressive stopping rules. We complement our theoretical guarantees by experiments showing that MS works best in practice.", "full_text": "Sequential Test for the Lowest Mean:\nFrom Thompson to Murphy Sampling\n\nEmilie Kaufmann1 Wouter M. Koolen2 Aur\u00e9lien Garivier3\n\n1 CNRS & U. Lille, CRIStAL / SequeL Inria Lille, emilie.kaufmann@univ-lille.fr\n\n2 Centrum Wiskunde & Informatica, Amsterdam, wmkoolen@cwi.nl\n\n3 UMPA, \u00c9cole normale sup\u00e9rieure de Lyon, aurelien.garivier@ens-lyon.fr\n\nAbstract\n\nLearning the minimum/maximum mean among a \ufb01nite set of distributions is a\nfundamental sub-task in planning, game tree search and reinforcement learning.\nWe formalize this learning task as the problem of sequentially testing how the\nminimum mean among a \ufb01nite set of distributions compares to a given threshold.\nWe develop re\ufb01ned non-asymptotic lower bounds, which show that optimality\nmandates very different sampling behavior for a low vs high true minimum. We\nshow that Thompson Sampling and the intuitive Lower Con\ufb01dence Bounds policy\neach nail only one of these cases. We develop a novel approach that we call Murphy\nSampling. Even though it entertains exclusively low true minima, we prove that\nMS is optimal for both possibilities. We then design advanced self-normalized\ndeviation inequalities, fueling more aggressive stopping rules. We complement our\ntheoretical guarantees by experiments showing that MS works best in practice.\n\n1\n\nIntroduction\n\n\u2217= mina \u00b5a from adaptive samples Xt\u223c \u00b5At, where At indicates the distribution sampled\n\u2217 = mina \u00b5a was studied in [34] and subsequently [7, 31, 8]. It is\n\u2217, and that estimators face an intricate bias-variance\n\nWe consider a collection of core problems related to minimums of means. For a given \ufb01nite collection\nof probability distributions parameterized by their means \u00b51, . . . , \u00b5K, we are interested in learning\nabout \u00b5\nat time t. We shall refer to these distributions as arms in reference to a multi-armed bandit model\n[28, 26]. Knowing about minima/maxima is crucial in reinforcement learning or game-playing, where\nthe value of a state for an agent is the maximum over actions of the (expected) successor state value\nor the minimum over adversary moves of the next state value.\nThe problem of estimating \u00b5\nknown that no unbiased estimator exists for \u00b5\ntrade-off. Beyond estimation, the problem of constructing con\ufb01dence intervals on minima/maxima\nnaturally arises in (Monte Carlo) planning in Markov Decision Processes [15] and games [25]. Such\ncon\ufb01dence intervals are used hierarchically for Monte Carlo Tree Search (MCTS) in [32, 11, 17, 20].\nThe open problem of designing asymptotically optimal algorithms for MCTS led us to isolate one\ncore dif\ufb01culty that we study here, namely the construction of con\ufb01dence intervals and associated\nsampling/stopping rules for learning minima (and, by symmetry, maxima).\nCon\ufb01dence intervals (that are uniform over time) can be naturally obtained from a (sequential)\n\n\u2217< \u03b3} versus{\u00b5\n\n\u2217> \u03b3}, given a threshold \u03b3. The main focus of the paper goes even\n\u2217> \u03b3}, that is sequentially sampling the arms in order to decide for one hypothesis\n\ntest of{\u00b5\n{\u00b5\n\u2217< \u03b3} or{\u00b5\n\nfurther and investigates the minimum number of samples required for adaptively testing whether\n\nas quickly as possible. Such a problem is interesting in its own right as it naturally arises in several\nstatistical certi\ufb01cation applications. As an example we may consider quality control testing in\nmanufacturing, where we want to certify that in a batch of machines each has a guaranteed probability\nof successfully producing a widget. In e-learning, we may want to certify that a given student has\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fmax\n\nmin\n\n\u00b51 \u00b52\n\n. . . \u00b5K\n\n\u03b3\n\n\u201cdepth 11~2\u201d. We consider the scenario\n\nFigure 1: Game tree search problem of\n\nwhere it has been established that the\nright subtree (grey) of the root has\nvalue \u03b3. Learning the optimal action\nat the root (orange) is equivalent to de-\ntermining how the minimum (blue) of\nthe leaf means (green) compares to \u03b3.\n\nsuf\ufb01cient understanding of a range of subjects, asking as few questions as possible about the different\nsubjects. Then in anomaly detection, we may want to \ufb02ag the presence of an anomaly faster the more\nanomalies are present. Finally, in a crowdsourcing system, we may need to establish as quickly as\npossible whether a cohort of workers contains at least one unacceptably careless worker. Our own\nmotivation for studying this problem is that it corresponds to an especially simple instance of the\ndepth-two game tree search problem, as illustrated in Figure 1.\nWe thus study a particular example of sequential adaptive\nhypothesis testing problem, as introduced by Chernoff [5],\nin which multiple experiments (sampling from one arm)\nare available to the experimenter, each of which allows\nto gain different information about the hypotheses. The\nexperimenter sequentially selects which experiment to per-\nform, when to stop and then which hypothesis to recommend.\nSeveral recent works from the bandit literature \ufb01t into this\nframework, with the twist that they consider continuous,\ncomposite hypotheses and aim for \u03b4-correct testing:\nthe\nprobability of guessing a wrong hypothesis has to be smaller\nthan \u03b4, while performing as few experiments as possible.\nThe \ufb01xed-con\ufb01dence Best Arm Identi\ufb01cation problem (con-\ncerned with \ufb01nding the arm with largest mean) is one such\nexample [9, 23], of which several variants have been studied\n[19, 17, 12]. For example the Thresholding Bandit Problem\n[27] aims at \ufb01nding the set of arms above a threshold, which\nis strictly harder than our testing problem. In the Ranking\nand Selection literature (see e.g., [14] for a survey) the related problem of \ufb01nding systems whose\nexpected performance is smaller than a known standard has been studied by [24], but if such system\nexist, the goal was to additionnaly identify the one with smallest expectation, which is strictly harder\nthan our problem.\nA full characterization of the asymptotic complexity of the BAI problem was recently given in\n[11], highlighting the existence of an optimal allocation of samples across arms. The lower bound\ntechnique introduced therein can be generalized to virtually any testing problem in a bandit model\n(see, e.g. [20, 12]). Such an optimal allocation is also presented by [4] in the GENERAL-SAMP\nframework, which is quite generic and in particular encompasses testing on which side of \u03b3 the\nminimum falls. The proposed LPSample algorithm is thus a candidate to be applied to our testing\nproblem. However, this algorithm is only proved to be order-optimal, that is to attain the minimal\nsample complexity up to a (large) multiplicative constant. Moreover, like other algorithms for special\ncases (e.g. Track-and-Stop for BAI [11]), it relies on forced exploration, which may be harmful\nin practice and leads to unavoidably asymptotic analysis.\nOur \ufb01rst contribution is a tight lower bound on the sample complexity that provides an oracle sample\nallocation, but also aims at re\ufb02ecting the moderate-risk behavior of a \u03b4-correct algorithm. Our second\ncontribution is a new sampling rule for the minimum testing problem, under which the empirical\nfraction of selections converges to the optimal allocation without forced exploration. The algorithm\nis a variant of Thompson Sampling [33, 1] that is conditioning on the \u201cworst\u201d outcome \u00b5\nthe name Murphy Sampling. This conditioning is inspired by the Top Two Thompson Sampling\nrecently proposed by [29] for Best Arm Identi\ufb01cation. As we shall see, the optimal allocation is\nvery different whether \u00b5\nbehavior in each case. Our third contribution is a new stopping rule, that by aggregating samples from\nseveral arms that look small may lead to early stopping whenever \u00b5\non a new self-normalized deviation inequality for exponential families (Theorem 7) of independent\ninterest. It generalizes results obtained by [18, 23] in the Gaussian case and by [3] without the\nuniformity in time, and also handles subsets of arms.\nThe rest of the paper is structured as follows. In Section 2 we introduce our notation and formally\nde\ufb01ne our objective. In Section 3, we present lower bounds on the sample complexity of sequential\ntests for minima. In particular, we compute the optimal allocations for this problem and discuss\nthe limitation of naive benchmarks to attain them. In Section 4 we introduce Murphy sampling,\nand prove its optimality in conjunction with a simple stopping rule. Improved stopping rules (and\n\n\u2217< \u03b3, hence\n\u2217> \u03b3 and yet Murphy Sampling automatically adopts the right\n\u2217< \u03b3. This stopping rule is based\n\n\u2217< \u03b3 or \u00b5\n\n2\n\n\fcon\ufb01dence intervals) are presented in Section 5. Finally, numerical experiments reported in Section 6\ndemonstrate the ef\ufb01ciency of Murphy Sampling paired with our new stopping rule.\n\n2 Setup\n\n\u00b5\n\n\u2217 = min\n\u2217< \u03b3}\n\nIn this paper, we are interested in the smallest mean (and the arm where it is attained)\n\nWe consider a family of K probability distributions that belong to a one-parameter canonical expo-\nnential family, that we shall call arms in reference to a multi-armed bandit model. Such exponential\nfamilies include Gaussian with known variance, Bernoulli, Poisson, see [3] for details. For natural\n\n\u03c1(dx),\nparameter \u03bd, the density of the distribution w.r.t. carrier measure \u03c1 on R is given by ex\u03bd\u2212b(\u03bd)\nwhere the cumulant generating function b(\u03bd)= ln E\u03c1[eX\u03bd] induces a bijection \u03bd\u0015 \u02d9b(\u03bd) to the mean\nparameterization. We write KL(\u03bd, \u03bb) and d(\u00b5, \u03b8) for the Kullback-Leibler divergence from natural\nparameters \u03bd to \u03bb and from mean parameters \u00b5 to \u03b8. Speci\ufb01cally, with convex conjugate b\u2217,\nKL(\u03bd, \u03bb) = b(\u03bb)\u2212 b(\u03bd)+(\u03bd\u2212 \u03bb) \u02d9b(\u03bd)\nand d(\u00b5, \u03b8) = b\u2217(\u00b5)\u2212 b\u2217(\u03b8)\u2212(\u00b5\u2212 \u03b8)\u02d9b\u2217(\u03b8).\nWe denote by \u00b5=(\u00b51, . . . , \u00b5K)\u2208I K the vector of arm means, which fully characterizes the model.\n\u2217(\u00b5) = arg min\n\u2217 = a\n\u2217< \u03b3 or \u00b5\nGiven a threshold \u03b3\u2208I, our goal is to decide whether \u00b5\n\u2217> \u03b3. We introduce the hypotheses\nand their union H = H<\u222aH>.\n\u2217> \u03b3},\nand H> = {\u00b5\u2208I K \u00b5\nH< = {\u00b5\u2208I K \u00b5\nAt, a stopping rule \u03c4 and a decision rule \u02c6m \u2208 {<,>}. The algorithm samples Xt \u223c \u00b5At while\nt \u2264 \u03c4, and then outputs a decision \u02c6m. We denote the information available after t rounds by\nFt= \u03c3(A1, X1, . . . , At, Xt). At is measurable with respect toFt\u22121 an possibly some exogenous\nrandom variable, \u03c4 is a stopping time with respect to this \ufb01ltration and \u02c6m isF\u03c4 -measurable.\nGiven a risk parameter \u03b4\u2208(0, 1], we aim for a \u03b4-correct algorithm, that satis\ufb01es P\u00b5(\u00b5\u2208H \u02c6m)\u2265 1\u2212 \u03b4\nfor all \u00b5\u2208H. Our goal is to build \u03b4-correct algorithms that use a small number of samples \u03c4\u03b4 in order\nto reach a decision. In particular, we want the sample complexity E\u00b5[\u03c4] to be small.\n1(As=a) be the number of selections of arm a up to round t, Sa(t)=\nNotation We let Na(t)=\u2211t\ns=1\ns=1 Xs1(As=a) be the sum of the gathered observations from that arm and \u02c6\u00b5a(t)= Sa(t)~Na(t)\n\u2211t\n\nWe want to propose a sequential and adaptive testing procedure, that consists in a sampling rule\n\nand\n\n\u00b5a.\n\n\u00b5a\n\na\n\na\n\na\n\ntheir empirical mean.\n\n3 Lower Bounds\n\nwhere\n\nIn this section we study information-theoretic sample complexity lower bounds, in particular to \ufb01nd\nout what the problem tells us about the behavior of oracle algorithms. [10] prove that for any \u03b4-correct\nalgorithm\n\nT\u2217(\u00b5) = max\nE\u00b5[\u03c4] \u2265 T\n\u2217(\u00b5)kl(\u03b4, 1\u2212 \u03b4)\nwad(\u00b5a, \u03bba)\n\u03bb\u2208Alt(\u00b5)Q\nw\u2208\u25b3 min\nkl(x, y)= x ln x\n1\u2212y and Alt(\u00b5) is the set of bandit models where the correct recom-\ny+(1\u2212 x) ln 1\u2212x\nmendation differs from that on \u00b5. The following result specialises the above to the case of testingH<\nvsH>, and gives explicit expressions for the characteristic time T\n\u2217(\u00b5).\n\u2217(\u00b5) and oracle weights w\na(\u00b5) = \u0017\u0017\u0017\u0017\u0017\u0017\u0017\u0017\u00171a=a\u2217\n\u2217(\u00b5) = \u0017\u0017\u0017\u0017\u0017\u0017\u0017\n\u2217< \u03b3,\nd(\u00b5\u2217,\u03b3)\n\u2217> \u03b3.\n\u2211a\nd(\u00b5a ,\u03b3)\nd(\u00b5a,\u03b3) \u00b5\n\u2211j\nd(\u00b5j ,\u03b3)\n\nLemma 1. Any \u03b4-correct strategy satis\ufb01es (1) with\n\n\u2217< \u03b3,\n\u2217> \u03b3,\n\nand\n\n(1)\n\n\u2217\n\nw\n\n\u00b5\n\n\u00b5\n\n\u00b5\n\nT\n\n1\n\n1\n\n1\n\n1\n\n1\n\na\n\nLemma 1 is proved in Appendix B. As explained by [10] the oracle weights correspond to the fraction\nof samples that should be allocated to each arm under a strategy matching the lower bound. The\ninteresting feature here is that the lower bound indicates that an oracle algorithm should have very\n\ndifferent behavior onH< andH>. OnH< it should sample a\nseveral) exclusively, while onH> it should sample all arms with certain speci\ufb01c proportions.\n\n\u2217 (or all lowest means, if there are\n\n3\n\n\f3.1 Boosting the Lower Bounds\n\nFollowing [13] (see also [30] and references therein), Lemma 1 can be improved under very mild\nassumptions on the strategies. We call a test symmetric if its sampling and stopping rules are invariant\nby conjugation under the action of the group of permutations on the arms. In that case, if all the arms\n\nare equal, then their expected numbers of draws are equal. For simplicity we assume \u00b51\u2264 . . .\u2264 \u00b5K.\nProposition 2. Let k= maxa d(\u00b5a, \u03b3)= max\u0001d(\u00b51, \u03b3), d(\u00b5K, \u03b3)\u0001. For any symmetric, \u03b4-correct\ntest, for all arms a\u2208{1, . . . , K}, the expected number of selections of arm a satis\ufb01es\n\nE\u00b5[Na(\u03c4)]\u2265 2\u00011\u2212 2\u03b4K 3\u0001\n\n27K 2k\n\n.\n\nmay suggest otherwise. Second, this lower bound on the number of draws of each arm can be used to\n\nProposition 2 is proved in Appendix B. It is an open question to improve the dependency in K in this\n\nbound; moreover, one may expect a bound decreasing with \u03b4, maybe in ln(ln(1~\u03b4)) (but certainly\nnot in ln(1~\u03b4)). This result already has two important consequences: \ufb01rst, it shows that even an\noptimal algorithm needs to draw all the arms a certain number of times, even onH< where Lemma 1\n\u201cboost\u201d the lower bound on E\u00b5[\u03c4]: the following result is also proved in Appendix B.\n\u0004 .\n\n\u2217< \u03b3, for any symmetric, \u03b4-correct strategy,\nd(\u00b51, \u03b3) + 2\u00011\u2212 2\u03b4K 3\u0001\nE\u00b5[\u03c4]\u2265 kl(\u03b4, 1\u2212 \u03b4)\n27K 2k Q\n\n\u00041\u2212 d(\u00b5a, \u03b3)1(\u00b5a\u2264\u03b3)\nd(\u00b51, \u03b3)\n\nTheorem 3. When \u00b5\n\na\n\n3.2 Lower Bound Inspired Matching Algorithms\n\nwhere\n\ndecision to stop relies on individual \u201cbox\u201d con\ufb01dence intervals for each arm, whose endpoints are\n\nUa(t) = max{q\u2236 Na(t)d\nLa(t) = min{q\u2236 Na(t)d\n\n+(\u02c6\u00b5a(t), q)\u2265 C<(\u03b4, Na(t))},\n\u2212(\u02c6\u00b5a(t), q)\u2265 C>(\u03b4, Na(t))}.\n\n\u2212(u, v)= d(u, v)1(u\u2265v), we let\n\u03c4< = inf{t\u2208 N\u2217\u2236\u2203a Na(t)d\n\u03c4> = inf{t\u2208 N\u2217\u2236\u2200a Na(t)d\n\nIn light of the lower bound in Lemma 1, we now investigate the design of optimal learning algorithms\n(sampling rule At and stopping rule \u03c4). We start with the stopping rule. The \ufb01rst stopping rule that\ncomes to mind consists in comparing separately each arm to the threshold and stopping when either\none arm looks signi\ufb01cantly below the threshold or all arms look signi\ufb01cantly above. Introducing\nd\n\n+(u, v)= d(u, v)1(u\u2264v) and d\n+(\u02c6\u00b5a(t), \u03b3)\u2265 C<(\u03b4, Na(t))} ,\n\u03c4Box = \u03c4<\u2227 \u03c4>\n\u2212(\u02c6\u00b5a(t), \u03b3)\u2265 C>(\u03b4, Na(t))} ,\nand C<(\u03b4, r) and C>(\u03b4, r) are two threshold functions to be speci\ufb01ed. Box refers to the fact that the\nIndeed, \u03c4Box= inf{t\u2208 N\u2217\u2236 mina Ua(t)\u2264 \u03b3 or mina La(t)\u2265 \u03b3}. In particular, if\u2200a,\u2200t\u2208 N\u2217\n, \u00b5a\u2208\n[La(t), Ua(t)], any algorithm that stops using \u03c4Box is guaranteed to output a correct decision. In\nthe Gaussian case, existing work [18, 23] permits to exhibit thresholds of the form C\u08e4(\u03b4, r) =\nln(1~\u03b4)+ a ln ln(1~\u03b4)+ b ln(1+ ln(r)) for which this suf\ufb01cient correctness condition is satis\ufb01ed\nwith probability larger than 1\u2212 \u03b4. Theorem 7 below generalizes this to exponential families.\nshow that a simple algorithm, called LCB, can do that for all \u00b5\u2208H>. LCB selects at each round the\nwhich is intuitively designed to attain the stopping condition mina La(t)\u2265 \u03b3 faster. In Appendix E\nwe prove (Proposition 15) that LCB is optimal for \u00b5\u2208H> however we show (Proposition 16) that on\ninstances ofH< it draws all arms a\u2260 a\nFor \u00b5\u2208H<, the lower bound Lemma 1 can actually be a good guideline to design a matching\n\u2217 with smallest\n\nGiven that \u03c4Box can be proved to be \u03b4-correct whatever the sampling rule, the next step is to propose\nsampling rules that, coupled with \u03c4Box, would attain the lower bound presented in Section 3. We now\n\n\u2217 too much and cannot match our lower bound.\n\nLCB: Play At= argmina La(t) ,\n\nalgorithm: under such an algorithm, the empirical proportion of draws of the arm a\nmean should converge to 1. The literature on regret minimization in bandit models (see [2] for a\nsurvey) provides candidate algorithms that have this type of behavior, and we propose to use the\n\narm with smallest Lower Con\ufb01dence Bound:\n\n(2)\n\n(3)\n\n4\n\n\fThompson Sampling (TS) algorithm [1, 22]. Given independent prior distribution on the mean of\neach arm, this Bayesian algorithm selects an arm at random according to its posterior probability of\nbeing optimal (in our case, the arm with smallest mean). Letting \u03c0t\na refer to the posterior distribution\nof \u00b5a after t samples, this can be implemented as\n\nTS: Sample\u2200a\u2208{1, . . . , K}, \u03b8a(t)\u223c \u03c0t\u22121\n\na , then play At= arg mina\u2208{1,...,K} \u03b8a(t).\n\nthe following: they are non-decreasing in r and there exists a function f such that,\n\n\u03b4\n\nIt follows from Theorem 12 in Appendix 5 that if Thompson Sampling is run without stopping,\n\nTo summarize, we presented a simple stopping rule, \u03c4Box, that can be asymptotically optimal for\n\nin combination with LCB. But neither of these two sampling rules are good for the other type of\ninstances, which is a big limitation for a practical use of either of these. In the next section, we\n\nWe now argue that ensuring the sampling proportions converge to w\noptimal sample complexity, at least in an asymptotic sense. The proof can be found in Appendix C.\n\nNa\u2217(t)~t converges almost surely to 1, for every \u00b5. As TS is an anytime sampling strategy (i.e. that\ndoes not depend on \u03b4), Lemma 4 below permits to justify that on every instance ofH< with a unique\noptimal arm, under this algorithm \u03c4Box\u08c3(1~d(\u00b51, \u03b8)) ln(1~\u03b4). However, TS cannot be optimal for\n\u00b5\u2208H>, as the empirical proportions of draws cannot converge to w\nevery \u00b5\u2208H< if it is used in combination with Thompson Sampling and for \u00b5\u2208H> if it is used\npropose a new Thompson Sampling like algorithm that ensures the right exploration under bothH<\nandH>. In Section 5, we further present an improved stopping rule that may stop signi\ufb01cantly earlier\nthan \u03c4Box on instances ofH<, by aggregating samples from multiple arms that look small.\n\u2217 is suf\ufb01cient for reaching the\nLemma 4. Fix \u00b5\u2208H. Fix an anytime sampling strategy (At) ensuring Nt\n\u2217(\u00b5). Let \u03c4\u03b4 be a\nt \u2192 w\nstopping rule such that \u03c4\u03b4\u2264 \u03c4 Box\n, for a Box stopping rule (2) whose threshold functions C\u08e4 satisfy\n\u2200r\u2265 r0, C\u08e4(\u03b4, r)\u2264 f(\u03b4)+ ln r, where f(\u03b4)= ln(1~\u03b4)+ o(ln(1~\u03b4)).\nThen lim sup\u03b4\u21920\n4 Murphy Sampling\nIn this section we denote by \u03a0n= P(\u22c5Fn) the posterior distribution of the mean parameters after n\nLaw, as it performs some conditionning to the \u201cworst event\u201d(\u00b5\u2208H<):\nMS: Sample \u03b8t\u223c \u03a0t\u22121(\u22c5H<), then play At = a\nAs we will argue below, the subtle difference of sampling from \u03a0n\u22121(\u22c5H<) instead of \u03a0n\u22121 (regular\nMS always conditions onH< (and never onH>) regardless of the position of \u00b5 w.r.t. \u03b3. This is\n\u2217(\u03b8)\u2260 a\n\u2217(\u00b5) a \ufb01xed fraction 1\u2212 \u03b2 of the time, where \u03b2 is a parameter that needs to be tuned with\nAlso note that MS is an anytime sampling algorithm, being independent of the con\ufb01dence level 1\u2212 \u03b4.\nMS is technically an instance of Thompson Sampling with a joint prior \u03a0 supported only onH<.\nThis viewpoint is conceptually funky, as we will apply MS identically toH< andH>. To implement\n\ndifferent from the symmetric Top Two Thompson Sampling [29], which essentially conditions on\na\nknowledge of \u00b5. MS on the other hand needs no parameters.\n\nrounds. We introduce a new (randomised) sampling rule called Murphy Sampling after Murphy\u2019s\n\nThompson Sampling) ensures the required split personality behavior (see Lemma 1). Note that\n\n\u2264 T\n\n\u2217(\u00b5) almost surely.\n\n\u03c4\u03b4\nln 1\n\u03b4\n\n\u2217(\u03b8t) .\n\n(4)\n\n\u2217(\u00b5)\u2260 1a\u2217.\n\nThe con\ufb01dence will manifest only in the stopping rule.\n\nMS, we use that independent conjugate per-arm priors induce likewise posteriors, admitting ef\ufb01cient\n(unconditioned) posterior sampling. Rejection sampling then achieves the required conditioning. Its\ncomputational cost is limited: the acceptance probability cannot be much smaller than the risk \u03b4\nprovided to the algorithm. Indeed, the fact that the stopping rule (see Section 5) has not yet \ufb01red,\ncombined with the posterior concentration (Proposition 6) and the convergence of the sampling efforts\nto track the sampling proportions (Theorem 5) reveals that the MS rejection sampling step accepts\n\nwith probability at least of order \u03b4~(ln t)3. So for reasonable values of \u03b4, this can be small and require\n\na few thousands of draws (not a big deal for today\u2019s computers), but it cannot be prohibitively small.\nThe rest of this section is dedicated to the analysis of MS. First, we argue that the MS sampling\nproportions converge to the oracle weights of Lemma 1.\n\n5\n\n\fAssumption For purpose of analysis, we need to assume that the parameter space \u0398\u220b \u00b5 (or the\nsupport of the prior) is the interior of a bounded subset of RK. This ensures that sup\u00b5,\u03b8\u2208\u0398 d(\u00b5, \u03b8)<\u221e\nand sup\u00b5,\u03b8\u2208\u0398\u0001\u00b5\u2212 \u03b8\u0001<\u221e. This assumption is common [16, Section 7.1], [29, Assumption 1]. We\nalso assume that the prior \u03a0 has a density \u03c0 with bounded ratio sup\u00b5,\u03b8\u2208\u0398\n\n\u03c0(\u00b5)<\u221e.\n\u03c0(\u03b8)\n\u2217(\u00b5) a.s. for any \u00b5\u2208H.\n\nt \u2192 w\n\nTheorem 5. Under the above assumption, MS ensures Nt\n\nThe main intuition is provided by\n\nIn this case the conditioning in MS is asymptotically\n\nproof in the appendix is to show the convergence occurs almost surely.\n\nWe give a sketch of the proof below, the detailed argument can be found in Appendix D, Theorems 12\nand 13. Given the convergence of the weights, the asymptotic optimality in terms of sample\ncomplexity follows by Lemma 4, if MS is used with an appropriate stopping rule (Box (2) or the\nimproved Aggregate stopping rule discussed in Section 5).\n\nProof Sketch First, consider \u00b5 \u2208 H<.\nimmaterial as \u03a0n(H<) \u2192 1, and the algorithm behaves like regular Thompson Sampling. As\nThompson sampling has sublinear pseudo-regret [1], we must have E[N1(t)]~t\u2192 1. The crux of the\nNext, consider \u00b5 \u2208 H>. Following [29], we denote the sampling probabilities in round n by\n\u03c8a(n)= \u03a0n\u22121(a= arg minj \u03b8jH<), and abbreviate \u03a8a(n)=\u2211n\nt=1 \u03c8a(t) and \u00af\u03c8a(n)= \u03a8a(n)~n.\nProposition 6 ([29, Proposition 4]). For any open subset \u02dc\u0398\u2286 \u0398, the posterior concentrates at rate\n\u03a0n( \u02dc\u0398)\u08ca exp\u0001\u2212n min\u03bb\u2208 \u02dc\u0398\u2211a\nLet us use this to analyze \u03c8a(n). As we are onH>, the posterior \u03a0n(H<)\u2192 0 vanishes. More-\nover, \u03a0n(a= arg minj \u03b8j,H<)\u223c \u03a0n(\u03b8a< \u03b3) as the probability that multiple arms fall below \u03b3 is\n\n\u00af\u03c8a(n)d(\u00b5a, \u03bba)\u0001 a.s. where an\u08ca bn means 1\n\nexp\u0001\u2212n \u00af\u03c8a(n)d(\u00b5a, \u03b3)\u0001\n\u2211j exp\u0001\u2212n \u00af\u03c8j(n)d(\u00b5j, \u03b3)\u0001 .\n\nTo get a good sense for what this means, let\u2019s analyse the version with equality. Using that w\nis constant (Lemma 1), we see\n\n\u03c8a(n+ 1) \u223c \u03a0n(\u00b5a< \u03b3)\n\u2211j \u03a0n(\u00b5j< \u03b3) \u08ca\nad(\u00b5a, \u03b3)\n\u2217\n\u03c8a(n+ 1) \u2264 e\nNow this means that whenever \u00af\u03c8a(n)\u2265 w\n\u2212n\u0001da\u2248 0 is exponentially\na+ \u0001, we \ufb01nd that \u03c8a(n+ 1)\u2264 e\n\u2217\n\u00af\u03c8a(n) decays hyperbolically (i.e. without lower bound). Hence\nsmall, and hence \u00af\u03c8a(n+ 1)\u2248 n\nn+1\na+ \u0001. As this holds for all arms a and \u0001> 0, we must have limn \u03c8a(n)= w\nlim supn\u2192\u221e \u00af\u03c8a(n)\u2264 w\n\u2217\n\u2217\n\n\u2212n\u0001 \u00af\u03c8a(n)\u2212w\n\nbn \u2192 0.\n\n\u0001d(\u00b5a,\u03b3)\n\nnegligible. Hence\n\nn ln an\n\na.\n\n\u2217\n\na\n\n.\n\n5\n\nImproved Stopping Rule and Con\ufb01dence Intervals\n\nand\n\nNS(t) = Q\n\nand recall d\none-parameter exponential families.\n\n\u02c6\u00b5S(t) = \u2211a\u2208S Na(t)\u02c6\u00b5a(t)\nNS(t)\n\nTheorem 7 below provides a new self-normalized deviation inequality that given a subset of arms\ncontrols uniformly over time how the aggregated mean of the samples obtained from those arms can\n\ndeviate from the smallest (resp. largest) mean in the subset. More formally forS\u2286[K], we introduce\n\u2212(u, v) = d(u, v)1(u\u2265v). We prove the following for\nTheorem 7. Let T\u2236 R+\u2192 R+ be the function de\ufb01ned by\n\u22121(1+ x)+ ln \u03b6(2)\n\u22121\u00041+ h\n\u2212s. For every subsetS of arms and x\u2265 0.04,\nwhere h(u)= u\u2212 ln(u) for u\u2265 1 and \u03b6(s)=\u2211\u221e\nn=1 n\na\u2208S \u00b5a\u0003\u2265 3 ln(1+ ln(NS(t)))+ T(x)\u0003 \u2264 e\n\u2212x,\na\u2208S \u00b5a\u0003\u2265 3 ln(1+ ln(NS(t)))+ T(x)\u0003 \u2264 e\n\u2212x.\n\na\u2208S Na(t)\n+(u, v) = d(u, v)1(u\u2264v) and d\nT(x) = 2h\n+\u0003\u02c6\u00b5S(t), min\n\u2212\u0003\u02c6\u00b5S(t), max\n\nP\u0003\u2203t\u2208 N\u2236 NS(t)d\nP\u0003\u2203t\u2208 N\u2236 NS(t)d\n\n\u0004\n\n(5)\n\n(6)\n\n(7)\n\n2\n\n6\n\n\f5.1 An Improved Stopping Rule\n\nincreasing empirical mean and smaller than \u03b3.\n\nThe proof of this theorem can be found in Section F and is sketched below. It generalizes in several\n\nbeyond subsets of size 1 will be crucial here to obtain better con\ufb01dence intervals on minimums, or\nstop earlier in tests. Note that the threshold function T introduced in (5) does not depend on the\n\ndirections the type of results obtained by [18, 23] for Gaussian distributions andS= 1. Going\ncardinality of the subsetS to which the deviation inequality is applied. Tight upper bounds on T can\nbe given using Lemma 21 in Appendix F.3, which support the approximation T(x)\u08c3 x+ 3 ln(x).\nFix a subset prior \u03c0\u2236\u00b7({1, . . . , K})\u2192 R+ such that\u2211S\u2286{1,...,K} \u03c0(S)= 1 and let T be the threshold\nfunction de\ufb01ned in Theorem 7. We de\ufb01ne the stopping rule \u03c4 \u03c0\u2236= \u03c4>\u2227 \u03c4 \u03c0< , where\n\u03c4> = inf{t\u2208 N\u2217\u2236\u2200a\u2208{1, . . . , K}Na(t)d\n\u2212(\u02c6\u00b5a(t), \u03b3)\u2265 3 ln(1+ ln(Na(t)))+ T(ln(1~\u03b4))} ,\n+(\u02c6\u00b5S(t), \u03b3)\u2265 3 ln(1+ ln(NS(t)))+ T(ln(1~(\u03b4\u03c0(S)))} .\n= inf{t\u2208 N\u2217\u2236\u2203S\u2236 NS(t)d\n\u03c4 \u03c0<\nThe associated recommendation rule selectsH> if \u03c4 \u03c0 = \u03c4> andH< if \u03c4 \u03c0 = \u03c4 \u03c0< . For the practical\ncomputation of \u03c4 \u03c0< , the search over subsets can be reduced to nested subsets including arms sorted by\nLemma 8. Any algorithm using the stopping rule \u03c4 \u03c0 and selecting \u02c6m=> iff \u03c4 \u03c0= \u03c4>, is \u03b4-correct.\nis uniform over subset of size 1, i.e. \u03c0(S)= K\n\u221211(S= 1), one obtain a \u03b4-correct \u03c4Box stopping rule\n\u22121\u0001 KS\u0001\u22121, that puts the\nwhich NS(t)d\nLinks with Generalized Likelihood Ratio Tests (GLRT). Assume we want to testH0 againstH1\nparameter x rejectsH0 if the test statistic maxx\u2208H1 (cid:96)(X1, . . . , Xt; x)~ maxx\u2208H0\u222aH1 (cid:96)(X1, . . . , Xt; x)\nhas large values (where (cid:96)(\u22c5; x) denotes the likelihood of the observations under the model parameter-\nized by x). In our testing problem, the GLRT statistic for rejectingH< is mina Na(t)d\n\u2212(\u02c6\u00b5a(t), \u03b3)\nhence \u03c4> is very close to a sequential GLRT test. However, the GLRT statistic for rejectingH> is\n+(\u02c6\u00b5a(t), \u03b3), which is quite different from the stopping statistic used by \u03c4 \u03c0< . Rather than\na=1 Na(t)d\n\u2211K\nUsing similar martingale techniques as for proving Theorem 7, one can show that replacing \u03c4 \u03c0< by\nK \u0004\u0017\u0017\u0017\u0017\u0017\u0017\u0017\n\u0017\u0017\u0017\u0017\u0017\u0017\u0017t\u2208 N\u2217\u2236 Q\n+(\u02c6\u00b5a(t), \u03b3)\u2212 3 ln(1+ ln(Na(t)))]+\u2265 KT\u0004 ln(1~\u03b4)\n= inf\n<\nalso yields a \u03b4-correct algorithm (see [21])1. At \ufb01rst sight, \u03c4 \u03c0< and \u03c4 GLRT\n\n+(\u02c6\u00b5S(t), \u03b3) may be larger. We advocate the use of \u03c0(S)= K\n\nwith thresholds functions satisfying the assumptions of Lemma 4. However, in practice (especially\nmore moderate \u03b4), it may be more interesting to include in the support of \u03c0 subsets of larger sizes, for\n\nFrom Lemma 8, proved in Appendix G, the prior \u03c0 doesn\u2019t impact the correctness of the algorithm.\nHowever it may impact its sample complexity signi\ufb01cantly. First it can be observed that picking \u03c0 that\n\nare hard to compare: the\nstopping statistic used by the latter can be larger than that used by the former, but it is compared to a\nsmaller threshold. In Section 6 we will provide empirical evidence in favor of aggregating samples.\n\naggregating samples from arms, the GLRT statistic is summing evidence for exceeding the threshold.\n\nfor composite hypotheses. A GLRT test based on t observations whose distribution depends on some\n\nsame weight on the set of subsets of each possible size.\n\n\u03c4 GLRT\n\n[Na(t)d\n\na\u2236\u02c6\u00b5a(t)\u2264\u03b3\n\n<\n\n5.2 A Con\ufb01dence Intervals Interpretation\n\nU\u03c0\n\nInequality (6) (and a union bound over subsets) also permits building a tight upper con\ufb01dence bound\non the minimum \u00b5\n\n\u2217. Indeed, de\ufb01ning\nS\u2286{1,...,K}[NS(t)d\nmin(t)\u2236= max\u0004q\u2236 max\n+(\u02c6\u00b5S(t), q)\u2212 3 ln(1+ ln(1+ NS(t)))]\u2264 T\u0004ln\n\u03b4\u03c0(S)\u0004\u0004 ,\nit is easy to show that P(\u2200t\u2208 N, \u00b5\nmin(t)) \u2265 1\u2212 \u03b4. For general choices of \u03c0, this upper\n\u2217\u2264 U\u03c0\ncon\ufb01dence bound may be much smaller than the naive bound mina Ua(t) which corresponds to\ncomposite arm, allowing us to replace (5) by T(x)= 2h\u22121\u00021+ x+ln \u03b6(2)\n\nchoosing \u03c0 uniform over subset of size 1. We provide an illustration supporting this claim in Figure 2\n\n1In fact, we can slightly sharpen the bound by observing that we are controlling the deviation of a single\n\n\u0002, see [21, Appendix A.1]\n\n1\n\n2\n\n7\n\n\f\u22121\u0001 KS\u0001\u22121\nbelow. The two type of upper con\ufb01dence bounds (Aggregate corresponding to \u03c0(S)= K\nand Box corresponding to \u03c0(S)= K\n\u221211(S=1)) are compared under uniform sampling in a Bernoulli\nbandit model that has k arms with mean 0.1 plus 4 arms with means[0.2 0.3 0.4 0.5]. The larger the\n\nnumber of arms close to minimum (here equal to it) is, the more UCB Aggregate beats UCB Box.\nObserve that using inequality (7) in Theorem 7 similarly allows to derive tighter lower con\ufb01dence\nbounds on the maximum of several means.\n\n5.3 Sketch of the Proof of Theorem 7\n\nFigure 2: Illustration of the Box versus Aggregate Upper Con\ufb01dence Bounds as a function of time on\n\nBernoulli instance for k= 1 (left), k= 3 (middle) and k= 10 (right) minimal arms.\n+(\u02c6\u00b5S(t), mina\u2208S \u00b5a)\u2212 2(1+ \u03b7) ln(1+ ln NS(t))],\nFix \u03b7\u2208[0, 1+ e[. Introducing X\u03b7(t)=[NS(t)d\nthe cornerstone of the proof (Lemma 17) consists in proving that for all \u03bb\u2208[0,(1+ \u03b7)\u22121[, there\nt that \u201calmost\u201d upper bounds e\u03bbX\u03b7(t): there exists a function g\u03b7 such that\nt \u2265 e\u03bbX\u03b7(t)\u2212g\u03b7(\u03bb)\n0]= 1 and \u2200t\u2208 N\u2217\nE[M \u03bb\nt > e\u03bbu\u2212g\u03b7(\u03bb)\u0001\u2264 exp(\u2212[\u03bbu\u2212 g\u03b7(\u03bb)]) .\nP(\u2203t\u2208 N\u2217\u2236 X\u03b7(t)> u) \u2264 P\u0001\u2203t\u2208 N\u2217\u2236 M \u03bb\n\nFrom there, the proof easily follows from a combination of Chernoff method and Doob inequality:\n\nexists a martingale M \u03bb\n\n, M \u03bb\n\n(8)\n\n.\n\nInequality (6) is then obtained by optimizing over \u03bb, carefully picking \u03b7 and inverting the bound.\nThe interesting part of the proof is to actually build a martingale satisfying (8). First, using the so-\ncalled method of mixtures [6] and some speci\ufb01c fact about exponential families already exploited by\n[3], we can prove that there exists a martingale \u02dcW x\nt such that for some function f (see Equation (14))\n\nFrom there it follows that, for every \u03bb and z> 1,\u0001e\u03bb(X\u03b7(t)\u2212f(\u03b7))\u2265 z\u0001\u2286{e\n\nx\n\n\u2212 ln(z)\n\u03bb(1+\u03b7) \u02dcW\n\n\u03bb ln(z)\n\n1\n\nt\n\n\u2265 1} and\n\nthe trick is to introduce another mixture martingale,\n\n1+\u03b7\u0001 .\n{X\u03b7(t)\u2212 f(\u03b7)\u2265 x}\u2286\u0001 \u02dcW x\nt \u2265 e\nt = 1+S \u221e\n\u03bb ln(z)\n\u2212 ln(z)\n\u03bb(1+\u03b7) \u02dcW\nt~E[M\nt = M\nt \u2265 e\u03bb[X\u03b7(t)\u2212f(\u03b7)]. We let M \u03bb\n\ndz,\n\nM\n\ne\n\n\u03bb\n\n\u03bb\n\n\u03bb\n\n1\n\nt\n\n1\n\nt].\n\n\u03bb\n\nthat is proved to satisfy M\n\n6 Experiments\n\nWe discuss the results of numerical experiments performed on Gaussian bandits with variance 1,\nR, which leads to a conjugate Gaussian posterior. The experiments demonstrate the \ufb02exibility of our\n\nusing the threshold \u03b3= 0. Thompson and Murphy sampling are run using a \ufb02at (improper) prior on\nMS sampling rule, which attains optimal performance on instances from bothH< andH>. Moreover,\n\u00b5\u2208H<. This aggregating stopping rule, that we refer to as \u03c4 Agg is an instance of the \u03c4 \u03c0 stopping rule\n\u22121\u0001 KS\u0001\u22121. We investigate the combined use of three sampling\npresented in Section 5 for \u03c0(S)= K\n\nthey show the advantage of using a stopping rule aggregating samples from subsets of arms when\n\nrules, MS, LCB and Thompson Sampling with three stopping rules, \u03c4 Agg, \u03c4 Box and \u03c4 GLRT.\n\n8\n\n1002003004005000.20.40.60.81.0UCB BoxUCB AggregateMinimum value1002003004005000.20.40.60.81.0UCB BoxUCB AggregateMinimum value1002003004005000.20.40.60.81.0UCB BoxUCB AggregateMinimum value\frun the different algorithms (excluding the TS sampling rule, that essentially coincides with MS\n\nMS is outperforming LCB, with a sample complexity of order T\n\nWe \ufb01rst study an instance \u00b5\u2208H< with K= 10 arms that are linearly spaced between\u22121 and 1. We\nonH<) for different values of \u03b4 and report the estimated sample complexity in Figure 3 (left). For\neach sampling rule, it appears that E[\u03c4 Agg]\u2264 E[\u03c4 Box]\u2264 E[\u03c4 GLRT]. Moreover, for each stopping rule\n\u2217(\u00b5) ln(1~\u03b4)+ C. Then we study\nan instance \u00b5\u2208H> with K= 5 arms that are linearly spaced between 0.5 and 1, with \u03c4 Agg as the\nsampling rule (which matters little as the algorithm mostly stops because of \u03c4> onH>). Results are\nalso proved optimal onH>), while vanilla TS fails dramatically. On those experiments, the empirical\nMS sampling rule as well as a larger-scale comparison of stopping rules underH<.\n\nerror was always zero, which shows that our theoretical thresholds are still quite conservative. More\nexperimental results can be found in Appendix A: an illustration of the convergence properties of the\n\nreported in Figure 3 (right), in which we see that MS is performing very similarly to LCB (that is\n\nFigure 3: E[\u03c4\u03b4] as a function of ln(1~\u03b4) for several algorithms on an instance \u00b5\u2208H< (left) and\n\u00b5\u2208H> (right), estimated using N= 5000 (resp. 500) repetitions.\n\n7 Discussion\n\n\u2217, and develop the Murphy Sampling strategy to match it asymptotically.\n\nWe propose new sampling and stopping rules for sequentially testing the minimum of means. As\nour guiding principle, we \ufb01rst prove sample complexity lower bounds, characterized the emerging\noracle sample allocation w\nWe observe in the experiments that the asymptotic regime does not necessarily kick in at moderate\ncon\ufb01dence \u03b4 (Figure 4, left) and that there is an important lower-order term to the practical sample\ncomplexity (Figure 3). It is an intriguing open problem of theoretical and practical importance\nto characterize and match optimal behavior at moderate con\ufb01dence. We make \ufb01rst contributions\nin both directions: we prove tighter sample complexity lower bounds for symmetric algorithms\n(Proposition 2, Theorem 3) and we design aggregating con\ufb01dence intervals which are tighter in\npractice (Figure 2).\nThe importance of this perspective arises, as highlighted in the introduction, from the hierarchical\napplication of maxima/minima in learning applications. A better understanding of the moderate\ncon\ufb01dence regime for learning minima will very likely translate into new insights and methods for\nlearning about hierarchical structures, where the bene\ufb01ts accumulate with depth.\n\nReferences\n[1] S. Agrawal and N. Goyal. Analysis of Thompson Sampling for the multi-armed bandit problem.\n\nIn Proceedings of the 25th Conference On Learning Theory, 2012.\n\n[2] S. Bubeck and N. Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed\n\nbandit problems. Fondations and Trends in Machine Learning, 5(1):1\u2013122, 2012.\n\n[3] O. Capp\u00e9, A. Garivier, O.-A. Maillard, R. Munos, and G. Stoltz. Kullback-Leibler upper\ncon\ufb01dence bounds for optimal sequential allocation. Annals of Statistics, 41(3):1516\u20131541,\n2013.\n\n9\n\n5101520-log(delta)0100200300400500600700mean sample complexitySample Complexity as a function of -log(delta) (N=5000 repetitions)LCB + AggregateMS + AggregateLCB + BoxMS + BoxLCB + GLRTMS + GLRTLower Bound34567-log(delta)0100200300400500600700mean sample complexitySample Complexity as a function of -log(delta) (N=500 repetitions)LCB + AGGTS + AGGMS + AGGLower Bound\f[4] L. Chen, A. Gupta, J. Li, M. Qiao, and R. Wang. Nearly optimal sampling algorithms for\ncombinatorial pure exploration. In Proceedings of the 30th Conference on Learning Theory\n(COLT), 2017.\n\n[5] H. Chernoff. Sequential design of Experiments. The Annals of Mathematical Statistics, 30(3):\n\n755\u2013770, 1959.\n\n[6] V. H. de la Pe\u00f1a, T. L. Lai, and S. Q. Self-normalized processes. Limit Theory and Statistical\n\napplications. Springer, 2009.\n\n[7] C. D\u2019Eramo, M. Restelli, and A. Nuara. Estimating maximum expected value through Gaussian\n\napproximation. In International Conference on Machine Learning (ICML), 2016.\n\n[8] C. D\u2019Eramo, A. Nuara, M. Pirotta, and M. Restelli. Estimating the maximum expected value in\n\ncontinuous reinforcement learning problems. In AAAI, 2017.\n\n[9] E. Even-Dar, S. Mannor, and Y. Mansour. Action Elimination and Stopping Conditions for\nthe Multi-Armed Bandit and Reinforcement Learning Problems. Journal of Machine Learning\nResearch, 7:1079\u20131105, 2006.\n\n[10] A. Garivier and E. Kaufmann. Optimal best arm identi\ufb01cation with \ufb01xed con\ufb01dence.\n\nProceedings of the 29th Conference On Learning Theory (COLT), 2016.\n\nIn\n\n[11] A. Garivier, E. Kaufmann, and W. M. Koolen. Maximin action identi\ufb01cation: A new bandit\nframework for games. In Proceedings of the 29th Conference On Learning Theory (COLT),\n2016.\n\n[12] A. Garivier, P. M\u00e9nard, and L. Rossi. Thresholding bandit for dose-ranging: The impact of\n\nmonotonicity. arXiv:1711.04454, 2017.\n\n[13] A. Garivier, P. M\u00e9nard, and G. Stoltz. Explore \ufb01rst, exploit next: The true shape of regret in\n\nbandit problems. Mathematics of Operations Research, Jun. 2018.\n\n[14] D. Goldsman and B. L. Nelson. Comparing Systems via Simulation, Chapter 9 in the Handbook\n\nof Simulation. Wiley, 1998.\n\n[15] J.-B. Grill, M. Valko, and R. Munos. Blazing the trails before beating the path: Sample-ef\ufb01cient\nMonte-Carlo planning. In Advances in Neural Information Processing Systems (NIPS), pages\n4680\u20134688, 2016.\n\n[16] P. D. Gr\u00fcnwald. The minimum description length principle. MIT press, 2007.\n\n[17] R. Huang, M. M. Ajallooeian, C. Szepesv\u00e1ri, and M. M\u00fcller. Structured best arm identi\ufb01cation\nwith \ufb01xed con\ufb01dence. In International Conference on Algorithmic Learning Theory (ALT),\n2017.\n\n[18] K. Jamieson, M. Malloy, R. Nowak, and S. Bubeck. lil\u2019UCB: an Optimal Exploration Algorithm\nfor Multi-Armed Bandits. In Proceedings of the 27th Conference on Learning Theory, 2014.\n\n[19] S. Kalyanakrishnan, A. Tewari, P. Auer, and P. Stone. PAC subset selection in stochastic\n\nmulti-armed bandits. In International Conference on Machine Learning (ICML), 2012.\n\n[20] E. Kaufmann and W. M. Koolen. Monte-Carlo tree search by best arm identi\ufb01cation.\n\nAdvances in Neural Information Processing Systems (NIPS), 2017.\n\nIn\n\n[21] E. Kaufmann and W. M. Koolen. Mixture Martingales Revisited with Applications\nPreprint, 2018. URL https://hal.\n\nto Sequential Tests and Con\ufb01dence Intervals.\narchives-ouvertes.fr/hal-01886612v1/.\n\n[22] E. Kaufmann, N. Korda, and R. Munos. Thompson Sampling : an Asymptotically Optimal\nFinite-Time Analysis. In Proceedings of the 23rd conference on Algorithmic Learning Theory,\n2012.\n\n[23] E. Kaufmann, O. Capp\u00e9, and A. Garivier. On the Complexity of Best Arm Identi\ufb01cation in\n\nMulti-Armed Bandit Models. Journal of Machine Learning Research, 17(1):1\u201342, 2016.\n\n10\n\n\f[24] S.-H. Kim. Comparison with a standard via fully sequential procedures. ACM Transactions on\n\nModeling and Computer Simulation (TOMACS), 15(2):155\u2013174, 2005.\n\n[25] L. Kocsis and C. Szepesv\u00e1ri. Bandit based Monte-Carlo planning. In Proceedings of the 17th\nEuropean Conference on Machine Learning, ECML\u201906, pages 282\u2013293, Berlin, Heidelberg,\n2006. Springer-Verlag. ISBN 3-540-45375-X, 978-3-540-45375-8.\n\n[26] T. L. Lai and H. Robbins. Asymptotically ef\ufb01cient adaptive allocation rules. Advances in\n\nApplied Mathematics, 6(1):4\u201322, 1985.\n\n[27] A. Locatelli, M. Gutzeit, and A. Carpentier. An optimal algorithm for the thresholding bandit\n\nproblem. In International Conference on Machine Learning (ICML), 2016.\n\n[28] H. Robbins. Some aspects of the sequential design of experiments. Bulletin of the American\n\nMathematical Society, 58(5):527\u2013535, 1952.\n\n[29] D. Russo. Simple Bayesian algorithms for best arm identi\ufb01cation. CoRR, abs/1602.08448,\n\n2016. URL http://arxiv.org/abs/1602.08448.\n\n[30] M. Simchowitz, K. Jamieson, and B. Recht. The simulator: Understanding adaptive sampling\nin the moderate-con\ufb01dence regime. In Proceedings of the 30th Conference on Learning Theory\n(COLT), 2017.\n\n[31] I. Takahisa and T. Kaneko. Estimating the maximum expected value through upper con\ufb01dence\nbound of likelihood. In Conference on Technologies and Applications of Arti\ufb01cial Intelligence\n(TAAI), pages 202\u2013207. IEEE, 2017.\n\n[32] K. Teraoka, K. Hatano, and E. Takimoto. Ef\ufb01cient sampling method for Monte Carlo tree search\n\nproblem. IEICE Transactions on Infomation and Systems, pages 392\u2013398, 2014.\n\n[33] W. R. Thompson. On the likelihood that one unknown probability exceeds another in view of\n\nthe evidence of two samples. Biometrika, 25:285\u2013294, 1933.\n\n[34] H. van Hasselt. Estimating the maximum expected value: An analysis of (nested) cross\nvalidation and the maximum sample average. CoRR, abs/1302.7175, 2013. URL http:\n//arxiv.org/abs/1302.7175.\n\n11\n\n\f", "award": [], "sourceid": 3124, "authors": [{"given_name": "Emilie", "family_name": "Kaufmann", "institution": "CNRS & CRIStAL (SequeL)"}, {"given_name": "Wouter", "family_name": "Koolen", "institution": "Centrum Wiskunde & Informatica, Amsterdam"}, {"given_name": "Aur\u00e9lien", "family_name": "Garivier", "institution": "ENS Lyon"}]}