{"title": "Multi-Bandit Best Arm Identification", "book": "Advances in Neural Information Processing Systems", "page_first": 2222, "page_last": 2230, "abstract": "We study the problem of identifying the best arm in each of the bandits in a multi-bandit multi-armed setting. We first propose an algorithm called Gap-based Exploration (GapE) that focuses on the arms whose mean is close to the mean of the best arm in the same bandit (i.e., small gap). We then introduce an algorithm, called GapE-V, which takes into account the variance of the arms in addition to their gap. We prove an upper-bound on the probability of error for both algorithms. Since GapE and GapE-V need to tune an exploration parameter that depends on the complexity of the problem, which is often unknown in advance, we also introduce variations of these algorithms that estimate this complexity online. Finally, we evaluate the performance of these algorithms and compare them to other allocation strategies on a number of synthetic problems.", "full_text": "Multi-Bandit Best Arm Identi\ufb01cation\n\nVictor Gabillon\n\nMohammad Ghavamzadeh\n\nINRIA Lille - Nord Europe, Team SequeL\n\nAlessandro Lazaric\n\n{victor.gabillon,mohammad.ghavamzadeh,alessandro.lazaric}@inria.fr\n\nDepartment of Operations Research and Financial Engineering, Princeton University\n\nS\u00b4ebastien Bubeck\n\nsbubeck@princeton.edu\n\nAbstract\n\nWe study the problem of identifying the best arm in each of the bandits in a multi-\nbandit multi-armed setting. We \ufb01rst propose an algorithm called Gap-based Ex-\nploration (GapE) that focuses on the arms whose mean is close to the mean of\nthe best arm in the same bandit (i.e., small gap). We then introduce an algorithm,\ncalled GapE-V, which takes into account the variance of the arms in addition to\ntheir gap. We prove an upper-bound on the probability of error for both algo-\nrithms. Since GapE and GapE-V need to tune an exploration parameter that de-\npends on the complexity of the problem, which is often unknown in advance, we\nalso introduce variations of these algorithms that estimate this complexity online.\nFinally, we evaluate the performance of these algorithms and compare them to\nother allocation strategies on a number of synthetic problems.\n\n1 Introduction\nConsider a clinical problem with M subpopulations, in which one should decide between Km op-\ntions for treating subjects from each subpopulation m. A subpopulation may correspond to patients\nwith a particular gene biomarker (or other risk categories) and the treatment options are the available\ntreatments for a disease. The main objective here is to construct a rule, which recommends the best\ntreatment for each of the subpopulations. These rules are usually constructed using data from clin-\nical trials that are generally costly to run. Therefore, it is important to distribute the trial resources\nwisely so that the devised rule yields a good performance. Since it may take signi\ufb01cantly more\nresources to \ufb01nd the best treatment for one subpopulation than for the others, the common strategy\nof enrolling patients as they arrive may not yield an overall good performance. Moreover, applying\ntreatment options uniformly at random in a subpopulation could not only waste trial resources, but\nalso it might run the risk of \ufb01nding a bad treatment for that subpopulation. This problem can be for-\nmulated as the best arm identi\ufb01cation over M multi-armed bandits [1], which itself can be seen as\nthe problem of pure exploration [4] over multiple bandits. In this formulation, each subpopulation is\nconsidered as a multi-armed bandit, each treatment as an arm, trying a medication on a patient as a\npull, and we are asked to recommend an arm for each bandit after a given number of pulls (budget).\nThe evaluation can be based on 1) the average over the bandits of the reward of the recommended\narms, or 2) the average probability of error (not selecting the best arm), or 3) the maximum prob-\nability of error. Note that this setting is different from the standard multi-armed bandit problem in\nwhich the goal is to maximize the cumulative sum of rewards (see e.g., [13, 3]).\n\nThe pure exploration problem is about designing strategies that make the best use of the limited bud-\nget (e.g., the total number of patients that can be admitted to the clinical trial) in order to optimize the\nperformance in a decision-making task. Audibert et al. [1] proposed two algorithms to address this\nproblem: 1) a highly exploring strategy based on upper con\ufb01dence bounds, called UCB-E, in which\nthe optimal value of its parameter depends on some measure of the complexity of the problem, and\n2) a parameter-free method based on progressively rejecting the arms which seem to be suboptimal,\ncalled Successive Rejects. They showed that both algorithms are nearly optimal since their probabil-\nity of returning the wrong arm decreases exponentially at a rate. Racing algorithms (e.g., [10, 12])\n\n1\n\n\fand action-elimination algorithms [7] address this problem under a constraint on the accuracy in\nidentifying the best arm and they minimize the budget needed to achieve that accuracy. However,\nUCB-E and Successive Rejects are designed for a single bandit problem, and as we will discuss later,\ncannot be easily extended to the multi-bandit case studied in this paper. Deng et al. have recently\nproposed an active learning algorithm for resource allocation over multiple bandits [5]. However,\nthey do not provide any theoretical analysis for their algorithm and only empirically evaluate its per-\nformance. Moreover, the target of their proposed algorithm is to minimize the maximum uncertainty\nin estimating the value of the arms for each bandit. Note that this is different than our target, which\nis to maximize the quality of the arms recommended for each bandit.\n\nIn this paper, we study the problem of best-arm identi\ufb01cation in a multi-armed multi-bandit setting\nunder a \ufb01xed budget constraint, and propose an algorithm, called Gap-based Exploration (GapE), to\nsolve it. The allocation strategy implemented by GapE focuses on the gap of the arms, i.e., the differ-\nence between the mean of the arm and the mean of the best arm (in that bandit). The GapE-variance\n(GapE-V) algorithm extends this approach taking into account also the variance of the arms. For\nboth algorithms, we prove an upper-bound on the probability of error that decreases exponentially\nwith the budget. Since both GapE and GapE-V need to tune an exploration parameter that depends\non the complexity of the problem, which is rarely known in advance, we also introduce their adaptive\nversion. Finally, we evaluate the performance of these algorithms and compare them with Uniform\nand Uniform+UCB-E strategies on a number of synthetic problems. Our empirical results indicate\nthat 1) GapE and GapE-V have a better performance than Uniform and Uniform+UCB-E, and 2) the\nadaptive version of these algorithms match the performance of their non-adaptive counterparts.\n\n2 Problem Setup\nIn this section, we introduce the notation used throughout the paper and formalize the multi-bandit\nbest arm identi\ufb01cation problem. Let M be the number of bandits and K be the number of arms for\neach bandit (we use indices m, p, q for the bandits and k, i, j for the arms). Each arm k of a bandit\nm is characterized by a distribution \u03bdmk bounded in [0, b] with mean \u00b5mk and variance \u03c32\nmk. In the\nfollowing, we assume that each bandit has a unique best arm. We denote by \u00b5\u2217m and k\u2217m the mean and\nthe index of the best arm of bandit m (i.e., \u00b5\u2217m = max1\u2264k\u2264K \u00b5mk, k\u2217m = arg max1\u2264k\u2264K \u00b5mk). In\neach bandit m, we de\ufb01ne the gap for each arm as \u2206mk = | maxj6=k \u00b5mj \u2212 \u00b5mk|.\nThe clinical trial problem described in Sec. 1 can be formalized as a game between a stochastic multi-\nbandit environment and a forecaster, where the distributions {\u03bdmk} are unknown to the forecaster.\nAt each round t = 1, . . . , n, the forecaster pulls a bandit-arm pair I(t) = (m, k) and observes\na sample drawn from the distribution \u03bdI(t) independent from the past. The forecaster estimates\nthe expected value of each arm by computing the average of the samples observed over time. Let\nTmk(t) be the number of times that arm k of bandit m has been pulled by the end of round t,\ns=1 Xmk(s), where Xmk(s) is the\ns-th sample observed from \u03bdmk. Given the previous de\ufb01nitions, we de\ufb01ne the estimated gaps as\n\nthen the mean of this arm is estimated asb\u00b5mk(t) = 1\nb\u2206mk(t) = | maxj6=kb\u00b5mj(t)\u2212b\u00b5mk(t)|. At the end of round n, the forecaster returns for each bandit\nm the arm with the highest estimated mean, i.e., Jm(n) = arg maxkb\u00b5mk(n), and incurs a regret\nMXm=1(cid:0)\u00b5\u2217\n\nAs discussed in the introduction, other performance measures can be de\ufb01ned for this problem. In\nsome applications, returning the wrong arm is considered as an error independently from its regret,\nand thus, the objective is to minimize the average probability of error\n\nTmk(t)PTmk(t)\n\nm \u2212 \u00b5mJm (n)(cid:1).\n\n1\nM\n\nMXm=1\n\nr(n) =\n\nrm(n) =\n\n1\nM\n\ne(n) =\n\n1\nM\n\nem(n) =\n\n1\nM\n\nMXm=1\n\nMXm=1\n\nP(cid:0)Jm(n) 6= k\u2217\nm(cid:1).\n\nFinally, in problems similar to the clinical trial, a reasonable objective is to return the right treatment\nfor all the genetic pro\ufb01les and not just to have a small average probability of error. In this case, the\nglobal performance of the forecaster can be measured as\n\n\u2113(n) = max\n\nm\n\n\u2113m(n) = max\n\nm\n\nP(cid:0)Jm(n) 6= k\u2217\nm(cid:1).\n\nIt is interesting to note the relationship between these three performance measures: minm \u2206m \u00d7\ne(n) \u2264 Er(n) \u2264 b\u00d7e(n) \u2264 b\u00d7\u2113(n), where the expectation in the regret is w.r.t. the random samples.\nAs a result, any algorithm minimizing the worst case probability of error, \u2113(n), also controls the\naverage probability of error, e(n), and the simple regret Er(n). Note that the algorithms introduced\nin this paper directly target the problem of minimizing \u2113(n).\n\n2\n\n\fParameters: number of rounds n, exploration parameter a, maximum range b\n\nfor t = 1, 2, . . . , n do\n\nDraw I(t) \u2208 arg maxm,k Bmk(t)\n\nInitialize: Tmk(0) = 0, b\u2206mk(0) = 0 for all bandit-arm pairs (m, k)\nCompute Bmk(t) = \u2212b\u2206mk(t \u2212 1) + bq a\nObserve XI(t)(cid:0)TI(t)(t \u2212 1) + 1(cid:1) \u223c \u03bdI(t)\nUpdate TI(t)(t) = TI(t)(t \u2212 1) + 1 and b\u2206mk(t) \u2200k of the selected bandit\nReturn Jm(n) \u2208 arg maxk\u2208{1,...,K}b\u00b5mk(n), \u2200m \u2208 {1 . . . M }\n\nFigure 1: The pseudo-code of the gap-based Exploration (GapE) algorithm.\n\nfor all bandit-arm pairs (m, k)\n\nend for\n\nTmk (t\u22121)\n\n3 The Gap-based Exploration Algorithm\nFig. 1 contains the pseudo-code of the gap-based exploration (GapE) algorithm. GapE \ufb02attens the\nbandit-arm structure and reduces it to a single-bandit problem with M K arms. At each time step t,\nthe algorithm relies on the observations up to time t \u2212 1 to build an index Bmk(t) for each bandit-\narm pair, and then selects the pair I(t) with the highest index. The index Bmk consists of two\nterms. The \ufb01rst term is the negative of the estimated gap for arm k in bandit m. Similar to other\nupper-con\ufb01dence bound (UCB) methods [3], the second part is an exploration term which forces the\nalgorithm to pull arms that have been less explored. As a result, the algorithm tends to pull arms\nwith small estimated gap and small number of pulls. The exploration parameter a tunes the level\nof exploration of the algorithm. As it is shown by the theoretical analysis of Sec. 3.1, if the time\nhorizon n is known, a should be set to a = 4\nmk is the complexity of\n9\nthe problem (see Sec. 3.1 for further discussion). Note that GapE differs from most standard bandit\nstrategies in the sense that the B-index for an arm depends explicitly on the statistics of the other\narms. This feature makes the analysis of this algorithm much more involved.\n\nH , where H =Pm,k b2/\u22062\n\nn\u2212K\n\nAs we may notice from Fig. 1, GapE resembles the UCB-E algorithm [1] designed to solve the pure\nexploration problem in the single-bandit setting. Nonetheless, the use of the negative estimated gap\n\n(\u2212b\u2206mk) instead of the estimated mean (b\u00b5mk) (used by UCB-E) is crucial in the multi-bandit setting.\n\nIn the single-bandit problem, since the best and second best arms have the same gap (\u2206mk\u2217\nm =\nm \u2206mk), GapE considers them equivalent and tends to pull them the same amount of time,\nmink6=k\u2217\nwhile UCB-E tends to pull the best arm more often than the second best one. Despite this difference,\nthe performance of both algorithms in predicting the best arm after n pulls would be the same. This is\ndue to the fact that the probability of error depends on the capability of the algorithm to distinguish\noptimal and suboptimal arms, and this is not affected by a different allocation over the best and\nsecond best arms as long as the number of pulls allocated to that pair is large enough w.r.t. their gap.\nDespite this similarity, the two approaches become completely different in the multi-bandit case. In\nthis case, if we run UCB-E on all the M K arms, it tends to pull more the arm with the highest mean\nover all the bandits, i.e., k\u2217 = arg maxm,k \u00b5mk. As a result, it would be accurate in predicting the\nbest arm k\u2217 over bandits, but may have an arbitrarily bad performance in predicting the best arm for\neach bandit, and thus, may incur a large error \u2113(n). On the other hand, GapE focuses on the arms\nwith the smallest gaps. This way, it assigns more pulls to bandits whose optimal arms are dif\ufb01cult\nto identify (i.e., bandits with arms with small gaps), and as shown in the next section, it achieves a\nhigh probability in identifying the best arm in each bandit.\n\n3.1 Theoretical Analysis\nIn this section, we derive an upper-bound on the probability of error \u2113(n) for the GapE algorithm.\nTheorem 1. If we run GapE with parameter 0 < a \u2264 4\n\nH , then its probability of error satis\ufb01es\n\nn\u2212M K\n\n9\n\nin particular for a = 4\n9\n\n\u2113(n) \u2264 P(cid:0)\u2203m : Jm(n) 6= k\u2217m(cid:1) \u2264 2M Kn exp(\u2212\nH , we have \u2113(n) \u2264 2M Kn exp(\u2212 1\n\na\n64\nn\u2212M K\nH ).\n\nn\u2212M K\n\n144\n\n),\n\nRemark 1 (Analysis of the bound). If the time horizon n is known in advance, it would be possible\nto set the exploration parameter a as a linear function of n, and as a result, the probability of error of\nGapE decreases exponentially with the time horizon. The other interesting aspect of the bound is the\n\n3\n\n\fcomplexity term H appearing in the optimal value of the exploration parameter a (i.e., a = 4\nH ).\nn\u2212K\n9\nmk, the complexity of arm k in bandit m, it is clear from the de\ufb01nition\nIf we denote by Hmk = b2/\u22062\nof H that each arm has an additive impact on the overall complexity of the multi-bandit problem.\nmk (similar to the\n\nMoreover, if we de\ufb01ne the complexity of each bandit m as Hm = Pk b2/\u22062\nde\ufb01nition of complexity for UCB-E in [1]), the GapE complexity may be rewritten as H =Pm Hm.\n\nThis means that the complexity of GapE is simply the sum of the complexities of all the bandits.\nRemark 2 (Comparison with the static allocation strategy). The main objective of GapE is to\ntradeoff between allocating pulls according to the gaps (more precisely, according to the complex-\nities Hmk) and the exploration needed to improve the accuracy of their estimates. If the gaps were\nknown in advance, a nearly-optimal static allocation strategy assigns to each bandit-arm pair a num-\nber of pulls proportional to its complexity. Let us consider a strategy that pulls each arm a \ufb01xed\nnumber of times over the horizon n. The probability of error for this strategy may be bounded as\n\n\u2113Static(n) \u2264 P(cid:0)\u2203m : Jm(n) 6= k\u2217\n\nMXm=1 Xk6=k\u2217\nMXm=1\nm (n) \u2264 \u02c6\u00b5mk(n)(cid:1)\nP(cid:0)\u02c6\u00b5mk\u2217\nm(cid:1) \u2264\nP(cid:0)Jm(n) 6= k\u2217\nm(cid:1) \u2264\nMXm=1 Xk6=k\u2217\nexp(cid:0) \u2212 Tmk(n)\nexp(cid:0) \u2212 Tmk(n)H \u22121\nmk(cid:1).\nb2 (cid:1) =\n\n\u22062\n\nmk\n\nm\n\nm\n\n(1)\n\n\u2264\n\nMXm=1 Xk6=k\u2217\n\nm\n\nGiven the constraint Pmk Tmk(n) = n, the allocation minimizing the last term in Eq. 1 is\nT \u2217mk(n) = nHmk/H. We refer to this \ufb01xed strategy as StaticGap. Although this is not neces-\nsarily the optimal static strategy (T \u2217mk(n) minimizes an upper-bound), this allocation guarantees\na probability of error smaller than M K exp(\u2212n/H). Theorem 1 shows that, for n large enough,\nGapE achieves the same performance as the static allocation StaticGap.\nRemark 3 (Comparison with other allocation strategies). At the beginning of Sec. 3, we dis-\ncussed the difference between GapE and UCB-E. Here we compare the bound reported in Theo-\nrem 1 with the performance of the Uniform and combined Uniform+UCB-E allocation strategies. In\nthe uniform allocation strategy, the total budget n is uniformly split over all the bandits and arms.\nAs a result, each bandit-arm pair is pulled Tmk(n) = n/(M K) times. Using the same derivation as\nin Remark 2, the probability of error \u2113(n) for this strategy may be bounded as\n\n\u2113Unif(n) \u2264\n\nMXm=1 Xk6=k\u2217\n\nm\n\nexp(cid:0) \u2212\n\nn\n\nM K\n\n\u22062\n\nmk\n\nb2 (cid:1) \u2264 M K exp(cid:16) \u2212\n\nn\n\nM K maxm,k Hmk(cid:17).\n\nIn the Uniform+UCB-E allocation strategy, i.e., a two-level algorithm that \ufb01rst selects a bandit\nuniformly and then pulls arms within each bandit using UCB-E, the total number of pulls for each\n\nbandit m isPk Tmk(n) = n/M , while the number of pulls Tmk(n) over the arms in bandit m is\n\ndetermined by UCB-E. Thus, the probability of error of this strategy may be bounded as\n\nn/M \u2212 K\n\nMXm=1\n\n\u2113Unif+UCB-E(n) \u2264\n\nn/M \u2212 K\n\n18Hm (cid:17) \u2264 2nM K exp(cid:16) \u2212\n\n2nK exp(cid:16) \u2212\n\n18 maxm Hm(cid:17),\nwhere the \ufb01rst inequality follows from Theorem 1 in [1] (recall that Hm =Pk b2/\u22062\nmk). Let b = 1\n(i.e., all the arms have distributions bounded in [0, 1]), up to constants and multiplicative factors in\nfront of the exponentials, and if n is large enough compared to M and K (so as to approximate\nn/M \u2212 K and n \u2212 K by n), the probability of error for the three algorithms may be bounded as\n\u2113Unif(n) \u2264 exp(cid:16)O(cid:0) \u2212n/M K\nHmk(cid:1)(cid:17).\nHm(cid:1)(cid:17), \u2113GapE(n) \u2264 exp(cid:16)O(cid:0) \u2212n\nPm,k\nM K maxm,k Hmk \u2265 M maxmPk Hmk \u2265Pm,k Hmk, which implies that the upper bound on the\n\nHmk(cid:1)(cid:17), \u2113U+UCBE(n) \u2264 exp(cid:16)O(cid:0) \u2212n/M\n\nprobability of error of GapE is usually signi\ufb01cantly smaller. This relationship, which is con\ufb01rmed\nby the experiments reported in Sec. 4, shows that GapE is able to adapt to the complexity H of\nthe overall multi-bandit problem better than the other two allocation strategies. In fact, while the\nperformance of the Uniform strategy depends on the most complex arm over the bandits and the\nstrategy Unif+UCB-E is affected by the most complex bandit, the performance of GapE depends on\nthe sum of the complexities of all the arms involved in the pure exploration problem.\n\nBy comparing the arguments of the exponential terms, we have the trivial sequence of inequalities\n\nmax\nm,k\n\nmax\n\nm\n\n4\n\n\fProof of Theorem 1. Step 1. Let us consider the following event:\n\nE =(cid:26)\u2200m \u2208 {1, . . . , M }, \u2200k \u2208 {1, . . . , K}, \u2200t \u2208 {1, . . . , n}, (cid:12)(cid:12)b\u00b5mk(t) \u2212 \u00b5mk(cid:12)(cid:12) < bcr a\nTmk(t)(cid:27).\nFrom Chernoff-Hoeffding\u2019s inequality and a union bound, we have P(\u03be) \u2265 1\u22122M Kn exp(\u22122ac2).\nNow we would like to prove that on the event E, we \ufb01nd the best arm for all the bandits, i.e., Jm(n) =\nk\u2217m, \u2200m \u2208 {1 . . . M}. Since Jm(n) is the empirical best arm of bandit m, we should prove that for\nany k \u2208 {1, . . . , K}, b\u00b5mk(n) \u2264 b\u00b5mk\u2217\nm(n). By upper-bounding the LHS and lower-bounding the\nRHS of this inequality, we note that it would be enough to prove bcpa/Tmk(n) \u2264 \u2206mk/2 on the\nevent E, or equivalently, to prove that for any bandit-arm pair m, k, we have Tmk(n) \u2265 4ab2c2\nStep 2. In this step, we show that in GapE, for any bandits (m, q) and arms (k, j), and for any\nt \u2265 M K, the following dependence between the number of pulls of the arms holds\n\n\u22062\n\nmk\n\n.\n\n\u2212\u2206mk + (1 + d)br\n\na\n\nmax(cid:0)Tmk(t) \u2212 1, 1(cid:1) \u2265 \u2212\u2206qj + (1 \u2212 d)br a\n\nTqj(t)\n\n,\n\n(2)\n\nwhere d \u2208 [0, 1]. We prove this inequality by induction.\nBase step. We know that after the \ufb01rst M K rounds of the GapE algorithm, all the arms have been\npulled once, i.e., Tmk(t) = 1, \u2200m, k, thus if a \u2265 1/4d2, the inequality (2) holds for t = M K.\nInductive step. Let us assume that (2) holds at time t \u2212 1 and we pull arm i of bandit p at time t,\ni.e., I(t) = (p, i). So at time t, the inequality (2) trivially holds for every choice of m, q, k, and\nj, except when (m, k) = (p, i). As a result, in the inductive step, we only need to prove that the\nfollowing holds for any q \u2208 {1, . . . M} and j \u2208 {1, . . . K}\n\nSince arm i of bandit p has been pulled at time t, we have that for any bandit-arm pair (q, j)\n\n. (5)\n\nTqj(t)\n\nTpi(t) \u2212 1\n\n1\u2212c and d \u2265 2\u221a2c\n\n2\u221a2bc\n\n1 \u2212 d r a\n\nand \u2212b\u2206qj(t\u22121) \u2265 \u2212\u2206qj\u2212\n\nWe report the proofs of the inequalities in (5) in App. B of [8]. The inequality (3), and as a result,\n\nTo prove (3), we \ufb01rst prove an upper-bound for \u2212b\u2206pi(t \u2212 1) and a lower-bound for \u2212b\u2206qj(t \u2212 1)\n\u2212b\u2206pi(t\u22121) \u2264 \u2212\u2206pi+\nthe inductive step is proved by replacing \u2212b\u2206pi(t\u2212 1) and \u2212b\u2206qj(t\u2212 1) in (4) from (5) and under the\n1\u2212d . These conditions are satis\ufb01ed by d = 1/2 and c = \u221a2/16.\nconditions that d \u2265 2c\nStep 3. In order to prove the condition of Tmk(n) in step 1, we need to \ufb01nd a lower-bound on the\nnumber of pulls of all the arms at time t = n (at the end). Let us assume that arm k of bandit m has\nbeen pulled less than ab2(1\u2212d)2\nTmk(n) > 0. From this\n+ 1\n\nresult and (2), we have \u2212\u2206qj + (1 + d)bq a\nfor any pair (q, j). We also know thatPq,j Tqj(n) = n. From these, we deduce that n \u2212 M K <\nab2(1 + d)2Pq,j\nthe \ufb01rst assumption that Tmk(n) < ab2(1\u2212d)2\nfor any pair\n(m, k), when 1 \u2212 d \u2265 2c. This concludes the proof. The condition for a in the statement of the\ntheorem comes from our choice of a in this step and the values of c and d from the inductive step.\n\n. So, if we select a such that n\u2212 M K \u2265 ab2(1 + d)2Pq,j\n\n, which indicates that \u2212\u2206mk + (1 \u2212 d)bq a\n\nTqj (n)\u22121 > 0, or equivalently Tqj(n) < ab2(1+d)2\n\n\u22062\nqj\n\n, which means that Tmk(n) \u2265 4ab2c2\n\n\u22062\n\n, we contradict\n\n1\n\u22062\nqj\n\n1\n\u22062\nqj\n\n\u22062\n\nmk\n\n\u22062\n\nmk\n\nmk\n\n3.2 Extensions\nIn this section we propose two variants on the GapE algorithm with the objective of extending its\napplicability and improving its performance.\n\n5\n\na\n\nmax(cid:0)Tpi(t) \u2212 1, 1(cid:1) \u2265 \u2212\u2206qj + (1 \u2212 d)br a\nTpi(t \u2212 1) \u2265 \u2212b\u2206qj(t \u2212 1) + br a\n\nTqj(t \u2212 1)\n\n.\n\n.\n\nTqj(t)\n\n(3)\n\n(4)\n\n\u2212\u2206pi + (1 + d)br\n\u2212b\u2206pi(t \u2212 1) + br a\n1 \u2212 cr a\n\n2bc\n\n\fGapE with variance (GapE-V). The allocation strategy implemented by GapE focuses only on the\narms with small gap and does not take into consideration their variance. However, it is clear that the\narms with small variance, even if their gap is small, just need a few pulls to be correctly estimated. In\norder to take into account both the gaps and variances of the arms, we introduce the GapE-variance\nmk(t) be the estimated variance\n\ns=1 X 2\n\n1\n\nfor arm k of bandit m at the end of round t. GapE-V uses the following B-index for each arm:\n\n(GapE-V) algorithm. Letb\u03c32\n\nmk(s) \u2212b\u00b52\n\nmk(t) =\n\nTmk(t)\u22121PTmk(t)\nBmk(t) = \u2212b\u2206mk(t \u2212 1) +s 2ab\u03c32\n\nmk(t \u2212 1)\n\nTmk(t \u2212 1)\n\n+\n\nNote that the exploration term in the B-index has now two components: the \ufb01rst one depends on the\nempirical variance and the second one decreases as O(1/Tmk). As a result, arms with low variance\nwill be explored much less than in the GapE algorithm. Similar to the difference between UCB [3]\nand UCB-V [2], while the B-index in GapE is motivated by Hoeffding\u2019s inequalities, the one for\nGapE-V is obtained using an empirical Bernstein\u2019s inequality [11, 2]. The following performance\nbound can be proved for GapE-V algorithm. We report the proof of Theorem 2 in App. C of [8].\n\n7ab\n\n3(cid:0)Tmk(t \u2212 1) \u2212 1(cid:1) .\n\nin particular for a = 8\n9\n\nTheorem 2. If GapE-V is run with parameter 0 < a \u2264 8\n\n9\n\nn\u22122M K\n\nH\u03c3\n\n, then it satis\ufb01es\n\nIn Theorem 2, H \u03c3 is the complexity of the GapE-V algorithm and is de\ufb01ned as\n\nH\u03c3\n\nn\u22122M K\n\n\u2113(n) \u2264 P(cid:0)\u2203m : Jm(n) 6= k\u2217m(cid:1) \u2264 6nM K exp(cid:18) \u2212\n, we have \u2113(n) \u2264 6nM K exp(cid:0) \u2212 1\nmk + (16/3)b\u2206mk(cid:1)2\nMXm=1\n\nKXk=1(cid:0)\u03c3mk +p\u03c32\n\nH \u03c3 =\n\n64\u00d78\n\n\u22062\n\nmk\n\n.\n\n9a\n\n64 \u00d7 64(cid:19)\n(cid:1).\n\nn\u22122M K\n\nH\u03c3\n\nAlthough the variance-complexity H \u03c3 could be larger than the complexity H used in GapE, when-\never the variances of the arms are small compared to the range b of the distribution, we expect H \u03c3 to\nbe smaller than H. Furthermore, if the arms have very different variances, then GapE-V is expected\nto better capture the complexity of each arm and allocate the pulls accordingly. For instance, in the\ncase where all the gaps are the same, GapE tends to allocate pulls proportionally to the complex-\nity Hmk and it would perform an almost uniform allocation over bandits and arms. On the other\nhand, the variances of the arms could be very heterogeneous and GapE-V would adapt the allocation\nstrategy by pulling more often the arms whose values are more uncertain.\nAdaptive GapE and GapE-V. A drawback of GapE and GapE-V is that the exploration parameter\na should be tuned according to the complexities H and H \u03c3 of the multi-bandit problem, which are\nrarely known in advance. A straightforward solution to this issue is to move to an adaptive version\n\nb2\n\nUCB\u2206i (t)2 ,\n\nthe adaptive GapE and GapE-V algorithms, we estimate these complexities as\n\nof these algorithms by substituting H and H \u03c3 with suitable estimates bH and bH \u03c3. At each step t of\nbH(t) =Xm,k\nUCB\u2206i (t) = b\u2206i(t \u2212 1) +s\nTi(t \u2212 1) \u2212 1(cid:19).\nSimilar to the adaptive version of UCB-E in [1], bH and bH \u03c3 are lower-con\ufb01dence bounds on the true\n\nbH \u03c3(t) =Xm,k(cid:0)LCB\u03c3i (t) +pLCB\u03c3i (t)2 + (16/3)b \u00d7 UCB\u2206i (t)(cid:1)2\n\ncomplexities H and H \u03c3. Note that the GapE and GapE-V bounds written for the optimal value of\na indicate an inverse relation between the complexity and the exploration. By using a lower-bound\non the true H and H \u03c3, the algorithms tend to explore arms more uniformly and this allows them to\nincrease the accuracy of their estimated complexities. Although we do not analyze these algorithms,\nwe empirically show in Sec. 4 that they are in fact able to match the performance of the GapE and\nGapE-V algorithms.\n\nLCB\u03c3i (t) = max(cid:18)0,b\u03c3i(t \u2212 1) \u2212s\n\nUCB\u2206i (t)2\n\n2Ti(t \u2212 1)\n\n, where\n\nand\n\n1\n\n2\n\n4 Numerical Simulations\nIn this section, we report numerical simulations of the gap-based algorithms presented in this paper,\nGapE and GapE-V, and their adaptive versions A-GapE and A-GapE-V, and compare them with Unif\n\n6\n\n\f0 Uniform + UCBE\n\n3\n0\n\n.\n\n5\n2\n0\n\n.\n\n0\n2\n\n.\n\n0\n\n5\n1\n\n.\n\n0\n\nr\no\nr\nr\ne\n \nf\no\n \ny\nt\ni\nl\ni\n\nb\na\nb\no\nr\np\n \nm\nu\nm\nx\na\nM\n\ni\n\nGapE\n\nAdapt GapE\n\nGapE\n\nGapE\u2212V\n\nAdapt GapE\u2212V\n\nr\no\nr\nr\ne\n \nf\no\n \ny\nt\ni\nl\ni\n\nb\na\nb\no\nr\np\n \nm\nu\nm\nx\na\nM\n\ni\n\n5\n2\n0\n\n.\n\n0\n2\n\n.\n\n0\n\n5\n1\n\n.\n\n0\n\n4\n\n8\n\n16 32\n\n4\n\n8\n2\nParameter h\n\n16\n\n1/8 1/4 1/2\n\n1\n\n8\n\n16 32 64\n\n4\n\n8\n2\nParameter h\n\n16\n\n1/4 1/2\n\n1\n\n2\n\nFigure 2: (left) Problem 1: Comparison between GapE, adaptive GapE, and the uniform strategies.\n(right) Problem 2: Comparison between GapE, GapE-V, and adaptive GapE-V algorithms.\n\nUnif + UCBE\n\nUnif + A UCBE\n\nUnif + UCBE\u2212V\n\nUnif + A UCBE\u2212V\n\nGapE\n\nA GapE\n\nGapE\u2212V\n\nA GapE\u2212V\n\nr\no\nr\nr\ne\n\n \nf\n\no\n\n \ny\nt\ni\nl\ni\n\nb\na\nb\no\nr\np\n \nm\nu\nm\nx\na\nM\n\ni\n\n5\n4\n.\n0\n\n5\n3\n0\n\n.\n\n5\n2\n0\n\n.\n\n5\n1\n0\n\n.\n\n4\n\n8 16 32\n\n1\n\n2\n\n4\n\n8\n\n2\n\n4\n\n8 16\n\n1/4 1/2 1\n\n2\n\n4\nParameter h\n\n8 16 32\n\n1/4 1/2 1\n\n2\n\n1/2 1\n\n2\n\n4\n\n1/4 1/2 1\n\n2\n\nFigure 3: Performance of the algorithms in Problem 3.\n\nand Unif+UCB-E algorithms introduced in Sec. 3.1. The results of our experiments both those in\nthe paper and those in App. A of [8] indicate that 1) GapE successfully adapts its allocation strategy\nto the complexity of each bandit and outperforms the uniform allocation strategies, 2) the use of\nthe empirical variance in GapE-V can signi\ufb01cantly improve the performance over GapE, and 3) the\nadaptive versions of GapE and GapE-V that estimate the complexities H and H \u03c3 online attain the\nsame performance as the basic algorithms, which receive H and H \u03c3 as an input.\nExperimental setting. We use the following three problems in our experiments. Note that b = 1\nand that a Rademacher distribution with parameters (x, y) takes value x or y with probability 1/2.\n\u2022 Problem 1. n = 700, M = 2, K = 4. The arms have Bernoulli distribution with parameters:\nbandit 1 = (0.5, 0.45, 0.4, 0.3), bandit 2 = (0.5, 0.3, 0.2, 0.1).\n\u2022 Problem 2. n = 1000, M = 2, K = 4. The arms have Rademacher distribution with\nparameters (x, y): bandit 1 = {(0, 1.0), (0.45, 0.45), (0.25, 0.65), (0, 0.9)} and in bandit 2 =\n{(0.4, 0.6), (0.45, 0.45), (0.35, 0.55), (0.25, 0.65)}.\nThe arms have Rademacher distri-\n\u2022 Problem 3.\nbution with parameters (x, y): bandit 1 = {(0, 1.0), (0.45, 0.45), (0.25, 0.65), (0, 0.9)}, ban-\ndit 2 = {(0.4, 0.6), (0.45, 0.45), (0.35, 0.55), (0.25, 0.65)}, bandit 3 = {(0, 1.0), (0.45, 0.45),\n(0.25, 0.65), (0, 0.9)}, and bandit 4 = {(0.4, 0.6), (0.45, 0.45), (0.35, 0.55), (0.25, 0.65)}.\nAll the algorithms, except the uniform allocation, have an exploration parameter a. The theoretical\nanalysis suggests that a should be proportional to n\nH . Although a could be optimized according to the\nH ,\nbound, since the constants in the analysis are not accurate, we will run the algorithms with a = \u03b7 n\nwhere \u03b7 is a parameter which is empirically tuned (in the experiments we report four different values\nfor \u03b7). If H correctly de\ufb01nes the complexity of the exploration problem (i.e., the number of samples\nto \ufb01nd the best arms with high probability), \u03b7 should simply correct the inaccuracy of the constants\nin the analysis, and thus, the range of its nearly-optimal values should be constant across different\nproblems. In Unif+UCB-E, UCB-E is run with the budget of n/M and the same parameter \u03b7 for all\nthe bandits. Finally, we set n \u2243 H \u03c3, since we expect H \u03c3 to roughly capture the number of pulls\nnecessary to solve the pure exploration problem with high probability. In Figs. 2 and 3, we report\nthe performance l(n), i.e. the probability to identify the best arm in all the bandits after n rounds,\nof the gap-based algorithms as well as Unif and Unif+UCB-E strategies. The results are averaged\n\nn = 1400, M = 4, K = 4.\n\n7\n\n\fover 105 runs and the error bars correspond to three times the estimated standard deviation. In all\nthe \ufb01gures the performance of Unif is reported as a horizontal dashed line.\n\nThe left panel of Fig. 2 displays the performance of Unif+UCB-E, GapE, and A-GapE in Problem 1.\nAs expected, Unif+UCB-E has a better performance (23.9% probability of error) than Unif (29.4%\nprobability of error), since it adapts the allocation within each bandit so as to pull more often the\nnearly-optimal arms. However, the two bandit problems are not equally dif\ufb01cult.\nIn fact, their\ncomplexities are very different (H1 \u2243 925 and H2 \u2243 67), and thus, much less samples are needed\nto identify the best arm in the second bandit than in the \ufb01rst one. Unlike Unif+UCB-E, GapE\nadapts its allocation strategy to the complexities of the bandits (on average only 19% of the pulls are\nallocated to the second bandit), and at the same time to the arm complexities within each bandit (in\nthe \ufb01rst bandit the averaged allocation of GapE is (37%, 36%, 20%, 7%)). As a result, GapE has a\nprobability of error of 15.7%, which represents a signi\ufb01cant improvement over Unif+UCB-E.\nThe right panel of Fig. 2 compares the performance of GapE, GapE-V, and A-GapE-V in Problem 2.\nIn this problem, all the gaps are equals (\u2206mk = 0.05), thus all the arms (and bandits) have the same\ncomplexity Hmk = 400. As a result, GapE tends to implement a nearly uniform allocation, which\nresults in a small difference between Unif and GapE (28% and 25% accuracy, respectively). The\nreason why GapE is still able to improve over Unif may be explained by the difference between static\nand dynamic allocation strategies and it is further investigated in App. A of [8]. Unlike the gaps,\nthe variance of the arms is extremely heterogeneous. In fact, the variance of the arms of bandit 1 is\nbigger than in bandit 2, thus making it harder to solve. This difference is captured by the de\ufb01nition\n2 \u2243 600). Note also that H \u03c3 \u2264 H. As discussed in Sec. 3.2, since\nof H \u03c3 (H \u03c3\nGapE-V takes into account the empirical variance of the arms, it is able to adapt to the complexity\nmk of each bandit-arm pair and to focus more on uncertain arms. GapE-V improves the \ufb01nal\nH \u03c3\naccuracy by almost 10% w.r.t. GapE. From both panels of Fig. 2, we also notice that the adaptive\nalgorithms achieve similar performance to their non-adaptive counterparts. Finally, we notice that\na good choice of parameter \u03b7 for GapE-V is always close to 2 and 4 (see also [8] for additional\nexperiments), while GapE needs \u03b7 to be tuned more carefully, particularly in Problem 2 where the\nlarge values of \u03b7 try to compensate the fact that H does not successfully capture the real complexity\nof the problem. This further strengthens the intuition that H \u03c3 is a more accurate measure of the\ncomplexity for the multi-bandit pure exploration problem.\n\n1 \u2243 1400 > H \u03c3\n\nWhile Problems 1 and 2 are relatively simple, we report the results of the more complicated Prob-\nlem 3 in Fig. 3. The experiment is designed so that the complexity w.r.t. the variance of each bandit\nand within each bandit is strongly heterogeneous. In this experiment, we also introduce UCBE-V\nthat extends UCB-E by taking into account the empirical variance similarly to GapE-V. The re-\nsults con\ufb01rm the previous \ufb01ndings and show the improvement achieved by introducing empirical\nestimates of the variance and allocating non-uniformly over bandits.\n5 Conclusion\nIn this paper, we studied the problem of best arm identi\ufb01cation in a multi-bandit multi-armed setting.\nWe introduced a gap-based exploration algorithm, called GapE, and proved an upper-bound for its\nprobability of error. We extended the basic algorithm to also consider the variance of the arms and\nproved an upper-bound for its probability of error. We also introduced adaptive versions of these\nalgorithms that estimate the complexity of the problem online. The numerical simulations con\ufb01rmed\nthe theoretical \ufb01ndings that GapE and GapE-V outperform other allocation strategies, and that their\nadaptive counterparts are able to estimate the complexity without worsening the global performance.\n\nAlthough GapE does not know the gaps, the experimental results reported in [8] indicate that it\nmight outperform a static allocation strategy, which knows the gaps in advance, thus suggesting\nthat an adaptive strategy could perform better than a static one. This observation asks for further\ninvestigation. Moreover, we plan to apply the algorithms introduced in this paper to the problem of\nrollout allocation for classi\ufb01cation-based policy iteration in reinforcement learning [9, 6], where the\ngoal is to identify the greedy action (arm) in each of the states (bandit) in a training set.\nAcknowledgments Experiments presented in this paper were carried out using the Grid\u20195000 ex-\nperimental testbed (https://www.grid5000.fr). This work was supported by Ministry of Higher Edu-\ncation and Research, Nord-Pas de Calais Regional Council and FEDER through the \u201ccontrat de pro-\njets \u00b4etat region 2007\u20132013\u201d, French National Research Agency (ANR) under project LAMPADA\nn\u25e6 ANR-09-EMER-007, European Community\u2019s Seventh Framework Programme (FP7/2007-2013)\nunder grant agreement n\u25e6 231495, and PASCAL2 European Network of Excellence.\n\n8\n\n\fReferences\n\n[1] J.-Y. Audibert, S. Bubeck, and R. Munos. Best arm identi\ufb01cation in multi-armed bandits. In\nProceedings of the Twenty-Third Annual Conference on Learning Theory, pages 41\u201353, 2010.\n[2] Jean-Yves Audibert, R\u00b4emi Munos, and Csaba Szepesv\u00b4ari. Tuning bandit algorithms in stochas-\ntic environments. In Marcus Hutter, Rocco Servedio, and Eiji Takimoto, editors, Algorith-\nmic Learning Theory, volume 4754 of Lecture Notes in Computer Science, pages 150\u2013165.\nSpringer Berlin / Heidelberg, 2007.\n\n[3] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multi-armed bandit prob-\n\nlem. Machine Learning, 47:235\u2013256, 2002.\n\n[4] S. Bubeck, R. Munos, and G. Stoltz. Pure exploration in multi-armed bandit problems. In\nProceedings of the Twentieth International Conference on Algorithmic Learning Theory, pages\n23\u201337, 2009.\n\n[5] K. Deng, J. Pineau, and S. Murphy. Active learning for personalizing treatment.\nSymposium on Adaptive Dynamic Programming and Reinforcement Learning, 2011.\n\nIn IEEE\n\n[6] C. Dimitrakakis and M. Lagoudakis. Rollout sampling approximate policy iteration. Machine\n\nLearning Journal, 72(3):157\u2013171, 2008.\n\n[7] Eyal Even-Dar, Shie Mannor, and Yishay Mansour. Action elimination and stopping conditions\nfor the multi-armed bandit and reinforcement learning problems. Journal of Machine Learning\nResearch, 7:1079\u20131105, 2006.\n\n[8] V. Gabillon, M. Ghavamzadeh, A. Lazaric, and S. Bubeck. Multi-bandit best arm identi\ufb01cation.\n\nTechnical Report 00632523, INRIA, 2011.\n\n[9] M. Lagoudakis and R. Parr. Reinforcement learning as classi\ufb01cation: Leveraging modern\nclassi\ufb01ers. In Proceedings of the Twentieth International Conference on Machine Learning,\npages 424\u2013431, 2003.\n\n[10] O. Maron and A. Moore. Hoeffding races: Accelerating model selection search for classi\ufb01ca-\ntion and function approximation. In Proceedings of Advances in Neural Information Process-\ning Systems 6, 1993.\n\n[11] A. Maurer and M. Pontil. Empirical bernstein bounds and sample-variance penalization. In\n\n22th annual conference on learning theory, 2009.\n\n[12] V. Mnih, Cs. Szepesv\u00b4ari, and J.-Y. Audibert. Empirical Bernstein stopping. In Proceedings of\n\nthe Twenty-Fifth International Conference on Machine Learning, pages 672\u2013679, 2008.\n\n[13] H. Robbins. Some aspects of the sequential design of experiments. Bulletin of the American\n\nMathematics Society, 58:527\u2013535, 1952.\n\n9\n\n\f", "award": [], "sourceid": 1227, "authors": [{"given_name": "Victor", "family_name": "Gabillon", "institution": null}, {"given_name": "Mohammad", "family_name": "Ghavamzadeh", "institution": null}, {"given_name": "Alessandro", "family_name": "Lazaric", "institution": null}, {"given_name": "S\u00e9bastien", "family_name": "Bubeck", "institution": null}]}