{"title": "Are sample means in multi-armed bandits positively or negatively biased?", "book": "Advances in Neural Information Processing Systems", "page_first": 7102, "page_last": 7111, "abstract": "It is well known that in stochastic multi-armed bandits (MAB), the sample mean of an arm is typically not an unbiased estimator of its true mean. In this paper, we decouple three different sources of this selection bias: adaptive \\emph{sampling} of arms, adaptive \\emph{stopping} of the experiment, and adaptively \\emph{choosing} which arm to study. Through a new notion called ``optimism'' that captures certain natural monotonic behaviors of algorithms, we provide a clean and unified analysis of how optimistic rules affect the sign of the bias. The main takeaway message is that optimistic sampling induces a negative bias, but optimistic stopping and optimistic choosing both induce a positive bias. These results are derived in a general stochastic MAB setup that is entirely agnostic to the final aim of the experiment (regret minimization or best-arm identification or anything else). We provide examples of optimistic rules of each type, demonstrate that simulations confirm our theoretical predictions, and pose some natural but hard open problems.", "full_text": "Are sample means in multi-armed bandits\n\npositively or negatively biased?\n\nJaehyeok Shin1, Aaditya Ramdas1,2 and Alessandro Rinaldo1\n\nDepartment of Statistics and Data Science1\n\nMachine Learning Department2\n\nCarnegie Mellon University\n\n{shinjaehyeok, aramdas, arinaldo}@cmu.edu\n\nAbstract\n\nIt is well known that in stochastic multi-armed bandits (MAB), the sample mean of\nan arm is typically not an unbiased estimator of its true mean. In this paper, we\ndecouple three different sources of this selection bias: adaptive sampling of arms,\nadaptive stopping of the experiment, and adaptively choosing which arm to study.\nThrough a new notion called \u201coptimism\u201d that captures certain natural monotonic\nbehaviors of algorithms, we provide a clean and uni\ufb01ed analysis of how optimistic\nrules affect the sign of the bias. The main takeaway message is that optimistic\nsampling induces a negative bias, but optimistic stopping and optimistic choosing\nboth induce a positive bias. These results are derived in a general stochastic MAB\nsetup that is entirely agnostic to the \ufb01nal aim of the experiment (regret minimization\nor best-arm identi\ufb01cation or anything else). We provide examples of optimistic\nrules of each type, demonstrate that simulations con\ufb01rm our theoretical predictions,\nand pose some natural but hard open problems.\n\n1\n\nIntroduction\n\nMean estimation is one of the most fundamental problems in statistics. In the classic nonadaptive\nsetting, we observe a \ufb01xed number of samples drawn i.i.d. from a \ufb01xed distribution with an unknown\nmean \u00b5. In this case, we know that the sample mean is an unbiased estimator of \u00b5.\nHowever, in many cases the data are collected and analyzed in an adaptive manner, a prototypical\nexample being the stochastic multi-armed bandits (MAB) framework [Robbins, 1952]. During the\ndata collection stage, in each round an analyst can draw a sample from one among a \ufb01nite set of\navailable distributions (arms) based on the previously observed data (adaptive sampling). The data\ncollecting procedure can also be terminated based on a data-driven stopping rule rather than at a \ufb01xed\ntime (adaptive stopping). Further, the analyst can choose a speci\ufb01c target arm based on the collected\ndata (adaptive choosing), for example choosing to focus on the arm with the largest empirical mean\nat the stopping time. In this setting, the sample mean is no longer unbiased, due to the selection bias\nintroduced by all three kinds of adaptivity. In this paper, we provide a comprehensive understanding\nof the sign of the bias, decoupling the effects of these three sources of adaptivity.\nIn a general and uni\ufb01ed MAB framework, we \ufb01rst de\ufb01ne natural notions of monotonicity (a special\ncase of which we call \u201coptimism\u201d) of sampling, stopping and choosing rules. Under no assumptions\non the distributions beyond assuming that their means exist, we show that optimistic sampling\nprovably results in a negative bias, but optimistic stopping and optimistic choosing both provably\nresult in a positive bias. Thus, the net bias can be positive or negative in general. This message is in\ncontrast to a recent thought-provoking work by Nie et al. [2018] titled \u201cWhy adaptively collected\ndata has a negative bias...\u201d that is unfortunately misleading for practitioners, since it only analyzed\nthe bias of adaptive sampling for a \ufb01xed arm at a \ufb01xed time.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fAs a concrete example, consider an of\ufb02ine analysis of data that was collected by an MAB algorithm\n(with any aim). Suppose that a practitioner wants to estimate the mean reward of some of the better\narms that were picked more frequently by the algorithm. Nie et al. [2018] proved that the sample\nmean of each arm is negatively biased under fairly common adaptive sampling rules. Although\nthis result is applicable only to a \ufb01xed arm at a \ufb01xed time, it could instill a possibly false sense of\ncomfort with sample mean estimates since the practitioner might possibly think that sample means\nare underestimating the effect size. However, we prove that if the algorithm was adaptively stopped\nand the arm index was adaptively picked, then the net bias can actually be positive. Indeed, we prove\nthat this is the case for the lil\u2019UCB algorithm (Corollary 8), but it is likely true more generally as\ncaptured by our main theorem. Thus, the sample mean may actually overestimate the effect size. This\nis an important and general phenomenon for both theoreticians (to study further and quantify) and for\npractitioners (to pay heed to) because if a particular arm is later deployed in practice, it may yield a\nlower reward than was possibly expected from the of\ufb02ine analysis.\n\nRelated work and our contributions. Adaptive mean estimation, in each of the three senses\ndescribed above, has received much attention in both recent and past literature. Below, we discuss\nhow our work relates to past work, proceeding one notion at a time in approximate historical order.\nWe begin by noting that a single-armed bandit is simply a random walk, where adaptive stopping has\nbeen extensively studied. The book by Gut [2009] on stopped random walks is an excellent reference,\nsummarizing almost 60 years of advances in sequential analysis. Most of these extensive results\non random walks have not been extended to the MAB setting, which naturally involves adaptive\nsampling and choosing. Of particular relevance is the paper by Starr and Woodroofe [1968] on the\nsign of the bias under adaptive stopping, whose work is subsumed by ours in two ways: we not only\nextend their insights to the MAB setting, but even for the one-armed setting, our results generalize\ntheirs.\nCharacterizing the sign of the bias of the sample mean under adaptive sampling has been a recent\ntopic of interest due to a surge in practical applications. While estimating MAB ad revenues, Xu\net al. [2013] gave an informal argument of why the sample mean is negatively biased for \u201coptimistic\u201d\nalgorithms. Later, Villar et al. [2015] encountered this negative bias in a simulation study motivated\nby using MAB for clinical trials. Most recently, Bowden and Trippa [2017] derived an exact formula\nfor the bias and Nie et al. [2018] formally provided conditions under which the bias is negative. Our\nresults on \u201coptimistic\u201d sampling inducing a negative bias generalize the corresponding results in\nthese past works.\nMost importantly, however, these past results hold only at a predetermined time and for a \ufb01xed arm.\nHere, we put forth a complementary viewpoint that \u201coptimistic\u201d stopping and choosing induces a\npositive bias. Indeed, one of our central conceptual contributions is an appropriate and crisp de\ufb01nition\nof \u201cmonotonicity\u201d and \u201coptimism\u201d (De\ufb01nition 1), that enables a clean and general analysis.\nOur main theoretical result, Theorem 7, allows the determination of the sign of the bias in several\ninteresting settings. Importantly, the bias may be of any sign when optimistic sampling, stopping and\nchoosing are all employed together. We demonstrate the practical validity of our theory using some\nsimulations that yield interesting insights in their own right.\nThe rest of this paper is organized as follows. In Section 2, we brie\ufb02y formalize the three notions\nof adaptivity by introducing a stochastic MAB framework. Section 3 derives results on when the\nbias can be positive or negative. In Section 4, we demonstrate the correctness of our theoretical\npredictions through simulations in a variety of practical situations. We end with a brief summary in\nSection 5, and for reasons of space, we defer all proofs to the Appendix.\n\n2 The stochastic MAB framework\nLet P1, . . . , PK be K distributions of interest (also called arms) with \ufb01nite means \u00b5k = EY \u223cPk [Y ].\nEvery inequality and equality between two random variables is understood in the almost sure sense.\n\n2.1 Formalizing the three notions of adaptivity\n\nFor those not familiar with MAB algorithms, Lattimore and Szepesv\u00e1ri [2019] is a good reference.\nThe following general problem setup is critical in the rest of the paper:\n\n2\n\n\fDraw an initial random seed W0 \u223c U [0, 1], and set t = 1.\n\n\u2022 Let W\u22121 denote all external sources of randomness that are independent of everything else.\n\u2022 At time t, let Dt\u22121 be the data we have so far, which is given by\nDt\u22121 := {A1, Y1, . . . , At\u22121, Yt\u22121},\n\nwhere As is the (random) index of arm sampled at time s and Ys is the observation from\nthe arm As. Based on the previous data (and possibly an external source of randomness),\nlet \u03bdt(k | Dt\u22121) \u2208 [0, 1] be the conditional probability of sampling the k-th arm for all\nk=1 \u03bdt(k | Dt\u22121) = 1. Different choices for \u03bdt capture\ncommonly used methods such as random allocation, \u0001-greedy [Sutton and Barto, 1998],\nupper con\ufb01dence bound algorithms [Auer et al., 2002, Audibert and Bubeck, 2009, Garivier\nand Capp\u00e9, 2011, Kalyanakrishnan et al., 2012, Jamieson et al., 2014] and Thompson\nsampling [Thompson, 1933, Agrawal and Goyal, 2012, Kaufmann et al., 2012].\n\nk \u2208 [K] := {1, . . . , K} with(cid:80)K\n\u2022 If Wt\u22121 \u2208(cid:16)(cid:80)k\u22121\nj=1 \u03bdt(j | Dt\u22121),(cid:80)k\n\nfor some k \u2208 [K], then set At = k\nwhich is equivalent to sample At from a multinomial distribution with probabilities {\u03bdt(k |\nDt\u22121)}K\nk=1. Let Yt be a fresh independent draw from distribution Pk. This yields a natural\n\ufb01ltration {Ft} which is de\ufb01ned, starting with F0 = \u03c3 (W\u22121, W0), as\nFt := \u03c3 (W\u22121, W0, Y1, W1, . . . , Yt, Wt) , \u2200t \u2265 1.\n\nj=1 \u03bdt(j | Dt\u22121)\n\nThen, {Yt} is adapted to {Ft}, and {At},{\u03bdt} are predictable with respect to {Ft}.\n\n\u2022 For each k \u2208 [K] and t \u2265 1, de\ufb01ne the running sum and number of draws for arm k\n1(As = k). Assuming that arm k is\n\nas Sk(t) :=(cid:80)t\n\n(cid:17)\n\ns=1\n\nsampled at least once, we de\ufb01ne the sample mean for arm k as\n\ns=1\n\n1(As = k)Ys, Nk(t) :=(cid:80)t\n(cid:98)\u00b5k(t) :=\n\nSk(t)\nNk(t)\n\n.\n\nThen, {St}, {(cid:98)\u00b5k(t)} are adapted to {F t} and {Nk(t)} is predictable with respect to {F t}.\n\n\u2022 Let T be a stopping time with respect to {F t}. If T is nonadaptively chosen, it is denoted\nT . If t < T , draw a random seed Wt \u223c U [0, 1] for the next round, and increment t. Else\nreturn the collected data DT = {A1, Y1, . . . , AT , YT } \u2208 FT .\n\u2022 After stopping, choose a data-dependent arm based on a possibly randomized rule \u03ba :\nDT \u222a {W\u22121} (cid:55)\u2192 [K], but we denote the index \u03ba(DT \u222a {W\u22121}) as just \u03ba for short, so that\nthe target of estimation is \u00b5\u03ba. Note that \u03ba \u2208 FT , but when \u03ba is nonadaptively chosen (is\nindependent of FT ), we called it a \ufb01xed arm and denote it as k.\n\nThe phrase \u201cfully adaptive setting\u201d refers to the scenario of running an adaptive sampling algorithm\nuntil an adaptive stopping time T , and asking about the sample mean of an adaptively chosen arm \u03ba.\nWhen we are not in the fully adaptive setting, we explicitly mention what aspects are adaptive.\n\n2.2 The tabular perspective on stochastic MABs\nIt will be useful to imagine the above fully adaptive MAB experiment using a N\u00d7K table, X\u2217\n\u221e, whose\nrows index time and columns index arms. Here, we put an asterisk to clarify that it is counterfactual\nand not necessarily observable. We imagine this entire table to be populated even before the MAB\nexperiments starts, where for every i \u2208 N, k \u2208 [K], the (i, k)-th entry of the table contains an\nindependent draw from Pk called X\u2217\ni,k. At each step, our observation Yt corresponds to the element\nX\u2217\nGiven the above tabular MAB setup (which is statistically indistinguishable from the setup described\nin the previous subsection), one may then \ufb01nd deterministic functions ft,k and f\u2217\nk (D\u2217\n\n\u221e \u222a {W\u22121, W0, . . . , Wt, . . .}.\n\n. Finally, we denote D\u2217\n\n1 (At = k) 1(T \u2265 t)\n\nft,k(Dt\u22121) \u2261 f\u2217\n\nNk(T ) =\n\n\u221e = X\u2217\n\nk such that\n\n(cid:88)\n\nNk(t),At\n\n\u221e).\n\n(1)\n\n(cid:88)\n\nt\u22651\n\n(cid:125)\n\n=\n\n(cid:124)\n\nt\u22651\n\n(cid:123)(cid:122)\n\nF t\u22121-measurable\n\nSpeci\ufb01cally, the function ft,k(\u00b7) evaluates to one if and only if we do not stop at time t \u2212 1, and pull\narm k at time t. Indeed, given D\u2217\n\u221e, the stopping time T is deterministic and so is the number of times\n\n3\n\n\fNk(T ) that a \ufb01xed arm k is pulled, and this is what f\u2217\ndraws from a chosen arm \u03ba at stopping time T can be written in terms of the tabular data as\n\nk captures. Along the same lines, the number of\n\nk(D\u2217\ng\u2217\n\n\u221e)f\u2217\n\nk (D\u2217\n\u221e)\n\n(2)\n\nk evaluates to one if after stopping, we choose\n\nN\u03ba(T ) =\n\nK(cid:88)\n\nk=1\n\n1 (\u03ba = k) Nk(T ) \u2261 k(cid:88)\n\nk=1\n\nk}. Indeed, g\u2217\nfor some deterministic set of functions {g\u2217\narm k, which is a fully deterministic choice given D\u2217\n\u221e.\n\n3 The sign of the bias under adaptive sampling, stopping and choosing\n\n3.1 Examples of positive bias due to \u201coptimistic\u201d stopping or choosing\n\nIn MAB problems, collecting higher rewards is a common objective of adaptive sampling strategies,\nand hence they are often designed to sample more frequently from a distribution which has larger\nsample mean than the others. Nie et al. [2018] proved that the bias of the sample mean for any\n\ufb01xed arm and at any \ufb01xed time is negative when the sampling strategy satis\ufb01es two conditions\ncalled \u201cExploit\u201d and \u201cIndependence of Irrelevant Options\u201d (IIO). However, the emphasis on \ufb01xed is\nimportant: their conditions are not enough to determine the sign of the bias under adaptive stopping\nor choosing, even in the simple nonadaptive sampling setting. Before formally de\ufb01ning our crucial\nnotions of \u201coptimism\u201d in the next subsection, it is instructive to look at some examples.\nExample 1. Suppose we continuously alternate between drawing a sample from each of two Bernoulli\ndistributions with mean parameters \u00b51, \u00b52 \u2208 (0, 1). This sampling strategy is fully deterministic, and\nthus it satis\ufb01es the Exploit and IIO conditions in Nie et al. [2018]. For any \ufb01xed time t, the bias\nequals zero for both sample means. De\ufb01ne a stopping time T as the \ufb01rst time we observe +1 from\nthe \ufb01rst arm. Then the sample size of the \ufb01rst arm, N1(T ), follows a geometric distribution with\n\nparameter \u00b51, which implies that the bias of(cid:98)\u00b51(T ) is\n(cid:21)\n\n(cid:20)\n\nE [(cid:98)\u00b51(T ) \u2212 \u00b51] = E\n\n1\n\nN1(T )\n\n\u2212 \u00b51 =\n\n\u00b51 log(1/\u00b51)\n\n1 \u2212 \u00b51\n\n\u2212 \u00b51,\n\nwhich is positive for all \u00b51 \u2208 (0, 1).\nThis example shows that for nonadaptive sampling, adaptive stopping can induce a positive bias.\nIn fact, this example is not atypical, but is an instance of a more general phenomenon explored in\nthe one-armed setting in sequential analysis. For example, Siegmund [1978, Ch. 3] contains the\nfollowing classical result for a Brownian motion W (t) with positive drift \u00b5 > 0.\nExample 2. If we de\ufb01ne a stopping time as the \ufb01rst time W (t) exceeds a line with slope \u03b7 and\nintercept b > 0, that is TB := inf{t \u2265 0 : W (t) \u2265 \u03b7t + b}, then for any slope \u03b7 \u2264 \u00b5, we have\n= 1/b. Note that a sum of Gaussians with mean \u00b5 behaves like a time-discretization\nof a Brownian motion with drift \u00b5; since EW (t) = t\u00b5, we may interpret W (TB)/TB as a stopped\nsample mean, and the last equation implies that its bias is 1/b, which is positive.\n\nE(cid:104) W (TB )\n\n\u2212 \u00b5\n\n(cid:105)\n\nTB\n\nGeneralizing further, Starr and Woodroofe [1968] proved the following remarkable result.\nExample 3. If we stop when the sample mean crosses any predetermined upper boundary, the stopped\nsample mean is always positive biased (whenever the stopping time is a.s. \ufb01nite). Explicitly, choosing\n\nany arbitrary sequence of real-valued constants {ck}, de\ufb01ne Tc := inf{t :(cid:98)\u00b51(t) > ct}, then as long\nas the observations Xi have a \ufb01nite mean and Tc is a.s. \ufb01nite, we have E [(cid:98)\u00b51(Tc) \u2212 \u00b51] \u2212 \u00b51 > 0.\n\nSurprisingly, we will generalize the above strong result even further. Additionally, stopping times\nin the MAB literature can be thought of as extensions of Tc and TB to a setting with multiple arms,\nand we will prove that indeed the bias induced will still be positive. We end with an example of the\npositive bias induced by \u201coptimistic\u201d choosing:\nExample 4. Given K standard normals {Zi} (to be thought of as one sample from each of K arms),\nlet \u03ba = argmaxk Zk, that is, we choose the arm with the largest observation. It is well known that\n2 log K. Since EZk = 0 for all k, but EZ\u03ba > 0, the \u201coptimistic\u201d\n\nE [Z\u03ba] = E(cid:2)maxk\u2208[K] Zk\n\n(cid:3) (cid:16) \u221a\n\nchoice \u03ba induces a positive bias.\n\nIn many typical MAB settings, we should expect sample means to have two contradictory sources of\nbias: negative bias from \u201coptimistic sampling\u201d and positive bias from \u201coptimistic stopping/choosing\u201d.\n\n4\n\n\f3.2 Positive or negative bias under monotonic sampling, stopping and choosing\n\nBased on the expression (2), we formally state a characteristic of data collecting strategies which\nfully determines the sign of the bias as follows.\nDe\ufb01nition 1. A data collecting strategy is \u201cmonotonically increasing (or decreasing)\u201d if for any\n\u221e) \u2261 1 (\u03ba = k) /Nk(T ), is an increasing\nk(D\u2217\ni \u2208 N and k \u2208 [K], the function D\u2217\n(or decreasing) function of X\u2217\n\ni,k while keeping all other entries in D\u2217\n\n\u221e \ufb01xed. Further, we say that\n\n\u221e (cid:55)\u2192 g\u2217\n\n\u221e)/f\u2217\n\nk (D\u2217\n\n\u2022 a data collecting strategy has an optimistic sampling rule if the function D\u2217\n\ni,k while keeping all other entries in D\u2217\n\nincreasing function of X\u2217\nt \u2265 1 and k \u2208 [K];\n\n\u221e (cid:55)\u2192 Nk(t) is an\n\u221e \ufb01xed for any \ufb01xed i \u2208 N,\n\n\u2022 a data collecting strategy has an optimistic stopping rule if D\u2217\n\ni,k while keeping all other entries in D\u2217\n\n\u221e (cid:55)\u2192 T is a decreasing\n\u221e \ufb01xed for any \ufb01xed i \u2208 N and\n\nfunction of X\u2217\nk \u2208 [K];\n\n\u2022 a data collecting strategy has an optimistic choosing rule if D\u2217\n\ni,k while keeping all other entries in D\u2217\n\n\u221e (cid:55)\u2192 1(\u03ba = k) is an\n\u221e \ufb01xed for any \ufb01xed i \u2208 N\n\nincreasing function of X\u2217\nand k \u2208 [K].\n\nNote that if a data collecting strategy has an optimistic sampling (or stopping or choosing) rule, with\nthe other components being nonadaptive, then the strategy is monotonically decreasing (increasing).\nWe remark that nonadaptive just means independent of the entries X\u2217\ni,k, but it is not necessarily\ndeterministic1. The above de\ufb01nition warrants some discussion to provide intuition.\nRoughly speaking, under optimistic stopping, if a sample from the k-th distribution was increased\nwhile keeping all other values \ufb01xed, the algorithm would reach its termination criterion sooner. For\ninstance, TB from Example 2 and the criterion in Example 1 are both optimistic stopping rules. Most\nimportantly, boundary-crossing is optimistic:\nFact 1. The general boundary-crossing stopping rule of Starr and Woodroofe [1968], denoted Tc in\nExample 3, is an optimistic stopping rule (and hence optimistic stopping is a weaker condition).\nOptimistic stopping rules do not need to be based on the sample mean; for example, if {ct} is an\narbitrary sequence, then T := inf{t \u2265 3 : Xt + Xt\u22122 \u2265 ct} is an optimistic stopping rule. In\nfact, T(cid:96) := inf{t \u2265 3 : (cid:96)t(X1, . . . , Xt) \u2265 ct} is optimistic, as long as each (cid:96)t is coordinatewise\nnondecreasing.\nFor optimistic choosing, the previously discussed argmax rule (Example 4) is optimistic. More\ngenerally, it is easy to verify the following:\nFact 2. For any probabilities p1 \u2265 p2 \u00b7\u00b7\u00b7 \u2265 pK that sum to one, a rule that chooses the arm with the\nk-th largest empirical mean with probability pk, is an optimistic choosing rule.\n\nTurning to the intuition for optimistic sampling, if a sample from the k-th distribution was increased\nwhile keeping all other values \ufb01xed, the algorithm would sample the k-th arm more often. We claim\nthat optimistic sampling is a weaker condition than the Exploit and IIO conditions employed by Nie\net al. [2018].\nFact 3. The \u201cExploit\u201d and \u201cIIO\u201d conditions in Nie et al. [2018] together imply that the sampling\nstrategy is optimistic (and hence optimistic sampling is a weaker condition). Further, as summarized\nin Appendix A, \u0001-greedy, UCB and Thompson sampling (Gaussian-Gaussian and Beta-Bernoulli, for\ninstance) are all optimistic sampling methods.\n\nFor completeness, we prove the \ufb01rst part formally in Appendix A.2, which builds heavily on\nobservations already made in the proof of Theorem 1 in Nie et al. [2018]. Beyond the instances\nmentioned above, Corollary 10 in the supplement captures a suf\ufb01cient condition for Thompson\nsampling with one-dimensional exponential families and conjugate priors to be optimistic. We now\nprovide an expression for the bias that holds at any stopping time and for any sampling algorithm.\n\n1An example of a random but nonadaptive stopping rule: \ufb02ip a (potentially biased) coin at each step to decide\nwhether to stop. An example of a random but nonadaptive sampling rule: with probability half pick a uniformly\nrandom arm, and with probability half pick the arm that has been sampled most often thus far.\n\n5\n\n\fProposition 5. Let T be a stopping time with respect to the natural \ufb01ltration {Ft}. For each \ufb01xed\n\nk \u2208 [K] such that 0 < ENk(T ) < \u221e, the bias of(cid:98)\u00b5k(T ) is given as\nE [(cid:98)\u00b5k(T ) \u2212 \u00b5k] = \u2212 Cov ((cid:98)\u00b5k(T ), Nk(T ))\n\n.\n\n(3)\n\nE [Nk(T )]\n\nMAB setting. Speci\ufb01cally, recalling that Sk(t) =(cid:98)\u00b5k(t)Nk(t), we show the following:\n\nThe proof may be found in Appendix B.3. A similar expression was derived in Bowden and Trippa\n[2017], but only for a \ufb01xed time T . In order to extend it to stopping times (that are allowed to be\nin\ufb01nite, as long as ENk(T ) < \u221e), we derive a simple generalization of Wald\u2019s \ufb01rst identity to the\nLemma 6. Let T be a stopping time with respect to the natural \ufb01ltration {Ft}. For each \ufb01xed\nk \u2208 [K] such that ENk(T ) < \u221e, we have E[Sk(T )] = \u00b5kE[Nk(T )].\nThis lemma is also proved in Appendix B.3. Proposition 5 provides a simple, and somewhat intuitive,\nexpression of the bias for each arm. It implies that if the covariance of the sample mean of an arm\nand the number of times it was sampled is positive (negative), then the bias is negative (positive). We\nnow formalize this intuition below, including for adaptively chosen arms. The following theorem\nshows that if the adaptive sampling, stopping and choosing rules are monotonically increasing (or\ndecreasing), then the sample mean is positively (or negatively) biased.\nTheorem 7. Let T be a stopping time with respect to the natural \ufb01ltration {Ft} and let \u03ba : DT (cid:55)\u2192 [K]\nbe a choosing rule. Suppose each arm has \ufb01nite expectation and, for all k with P (\u03ba = k) > 0, we\nhave E [Nk(T )] < \u221e and Nk(T ) \u2265 1. If the data collecting strategy is monotonically decreasing,\nfor example under optimistic sampling with nonadaptive stopping and choosing, then we have\n\nE [(cid:98)\u00b5\u03ba(T ) | \u03ba = k] \u2264 \u00b5k, \u2200k : P(\u03ba = k) > 0,\n\n(4)\n\nwhich also implies that\n\nE [(cid:98)\u00b5\u03ba(T ) \u2212 \u00b5\u03ba] \u2264 0.\n\n(6)\n\nE [(cid:98)\u00b5\u03ba(T ) \u2212 \u00b5\u03ba] \u2265 0.\n\n(5)\nSimilarly if the data collecting strategy is monotonically increasing, for example under optimistic\nstopping with nonadaptive sampling and choosing, or under optimistic choosing with nonadaptive\nsampling and stopping, then we have\n\nE [(cid:98)\u00b5\u03ba(T ) | \u03ba = k] \u2265 \u00b5k, \u2200k : P(\u03ba = k) > 0,\n\n(7)\n\nwhich also implies that\nIf each arm has a bounded distribution then the condition E [Nk(T )] < \u221e can be dropped.\nRemark 1. In fact, if each arm has a \ufb01nite p-th moment for a \ufb01xed p > 2 then the condition\nE [Nk(T )] < \u221e can be dropped.\nThe proofs of Theorem 7 and Remark 1 can be found in Appendix B.1 and are based on martingale\narguments that are quite different from the ones used in Nie et al. [2018]. See also Appendix A.4\nfor an intuitive explanation of the sign of the bias under optimistic sampling, stopping or choosing\n\nrules. The expression (3) intuitively suggests situations when the sample mean estimator(cid:98)\u00b5k(T ) is\n\nbiased, while the inequalities in (4) and (6) determine the direction of bias under the monotonic or\noptimistic conditions. Due to Facts 1, 2 and 3, several existing results are immediately subsumed and\ngeneralized by Theorem 7. Further, the following corollary is a particularly interesting special case\ndealing with the lil\u2019UCB algorithm by Jamieson et al. [2014] which uses adaptive sampling, stopping\nand choosing, as summarized in Section 4.3.\nCorollary 8. The lil\u2019UCB algorithm is a monotonically increasing strategy, and thus the sample\nmean of the reported arm when lil\u2019UCB stops is always positively biased.\n\nThe proof is described in Appendix B.2. The above result is interesting because of the following\nreasons: (a) when viewed separately, the sampling, stopping and choosing rules of the lil\u2019UCB\nalgorithm all seem to be optimistic (however, they are not optimistic, because our de\ufb01nition requires\ntwo out of three to be nonadaptive); hence it is apriori unclear which rule dominates and whether the\nnet bias should be positive or negative; (b) we did not have to alter anything about the algorithm in\norder to prove that it is a monotonically increasing strategy (for any distribution over arms, for any\nnumber of arms). The generality of the above result showcases the practical utility of our theorem,\nwhose message is in sharp contrast to the title of the paper by Nie et al. [2018].\nNext, we provide simulation results that verify that our monotonic and optimistic conditions accurately\ncapture the sign of the bias of the sample mean.\n\n6\n\n\f4 Numerical experiments\n\n4.1 Negative bias from optimistic sampling rules in multi-armed bandits\n\nRecall Fact 3, which stated that common MAB adaptive sampling strategies like greedy (or \u0001-greedy),\nupper con\ufb01dence bound (UCB) and Thompson sampling are optimistic. Thus, for a deterministic\nstopping time, Theorem 7 implies that the sample mean of each arm is always negatively biased. To\ndemonstrate this, we conduct a simulation study in which we have three unit-variance Gaussian arms\nwith \u00b51 = 1, \u00b52 = 2 and \u00b53 = 3. After sampling once from each arm, greedy, UCB and Thompson\nsampling are used to continue sampling until T = 200. We repeat the whole process from scratch\n104 times for each algorithm to get an accurate estimate for the bias.2 Due to limited space, we\npresent results from UCB and Thompson sampling only but detailed con\ufb01gurations of algorithms\nand a similar result for the greedy algorithm can be found in Appendix C.1. Figure 1 shows the\ndistribution of observed differences between sample means and the true mean for each arm. Vertical\nlines correspond to biases. The example demonstrates that the sample mean is negatively biased\nunder optimistic sampling rules.\nRemark 2. The main goal in our simulations is to visualize and corroborate our theoretical results\nabout the sign of the bias. As a result, we do not make any attempt to optimize the parameters for\nUCB or Thompon sampling for the purpose of minimizing the regret, since the latter is not the paper\u2019s\naim. However, investigating the relationship between the performance of MAB algorithms and the\nbias at the time horizon would be an interesting future direction of research.\n\nFigure 1: Data is collected by UCB (left) and Thompson sampling (right) algorithms from three\nunit-variance Gaussian arms with \u00b51 = 1, \u00b52 = 2 and \u00b53 = 3. For all three arms, sample means\nare negatively biased (at \ufb01xed times). A similar result for the greedy algorithm can be found in\nAppendix C.1.\n\n4.2 Bias from stopping a one-sided sequential likelihood ratio test\n\nSuppose we have two independent sub-Gaussian arms with common and known parameter \u03c32 but\nunknown means \u00b51 and \u00b52. Consider the following testing problem:\nH0 : \u00b51 \u2264 \u00b52 vs H1 : \u00b51 > \u00b52.\n\nTo test this hypothesis, suppose we draw a sample from arm 1 for every odd time and from arm 2\nfor every even time. Instead of conducting a test at a \ufb01xed time, we can use the following one-sided\nsequential likelihood ratio test [Robbins, 1970, Howard et al., 2018]: for any \ufb01xed w > 0 and\n\u03b1 \u2208 (0, 1), de\ufb01ne a stopping time T as\n\nT w := inf\n\n\uf8f1\uf8f2\uf8f3t \u2208 Neven :(cid:98)\u00b51(t) \u2212(cid:98)\u00b52(t) \u2265 2\u03c3\n\nt\n\n(cid:118)(cid:117)(cid:117)(cid:116)(t + 2w) log\n\n(cid:32)\n\n(cid:114) t + 2w\n\n1\n2\u03b1\n\n+ 1\n\n2w\n\n(cid:33)\uf8fc\uf8fd\uf8fe ,\n\n(8)\n\n2In all experiments, sizes of reported biases are larger than at least 3 times the Monte Carlo standard error.\n\n7\n\n012345\u22122.50.02.5difference between sample and true meansdensityArmsN(1,1)N(2,1)N(3,1)Bias = (\u22120.291, \u22120.363, \u22120.006)UCB algorithm012345\u22122.50.02.5difference between sample and true meansdensityArmsN(1,1)N(2,1)N(3,1)Bias = (\u22120.081, \u22120.262, \u22120.106)Thompson sampling\fM := min{T w, M}. Then, we reject the null H0 if T w\n\nFrom Theorem 7, we have that \u00b51 \u2264 E(cid:98)\u00b51(T w\n\nwhere Neven := {2n : n \u2208 N}. For a given \ufb01xed maximum even time M \u2265 2, we stop sampling at\ntime T w\nM < M. It can be checked [Howard\net al., 2018, Section 8] that, for any \ufb01xed w > 0, this test controls the type-1 error at level \u03b1 and the\npower goes to 1 as M goes to in\ufb01nity.\nFor the arms 1 and 2, these are special cases of optimistic and pessimistic stopping rules respectively.\nM ). To demonstrate this, we\nconduct two simulation studies with unit variance Gaussian errors: one under the null hypothesis\n(\u00b51, \u00b52) = (0, 0), and one under the alternative hypothesis (\u00b51, \u00b52) = (1, 0). We choose M = 200,\nw = 10 and \u03b1 = 0.1. As before, we repeat each experiment 104 times for each setting. Figure 2\nshows the distribution of observed differences between sample means and the true mean for each arm\nunder null and alternative hypothesis cases. Vertical lines correspond to biases. The simulation study\ndemonstrates that the sample mean for arm 1 is positively biased and the sample mean for arm 2 is\nnegatively biased as predicted.\n\nM ) and \u00b52 \u2265 E(cid:98)\u00b52(T w\n\nFigure 2: Data is collected from the one-sided sequential likelihood ratio test procedure described\nin Section 4.2. The sample mean for arm 1 is positively biased and the sample mean for arm 2 is\nnegatively biased under both null and alternative hypothesis cases. Note that the size of the bias\nunder the null hypothesis is smaller than the one under the alternative hypothesis since the number\nof collected samples is larger under the null hypothesis.\n\n4.3 Positive bias of the lil\u2019UCB algorithm in best-arm identi\ufb01cation\n\nSuppose we have K sub-Gaussian arms with mean \u00b51, . . . , \u00b5K and known parameter \u03c3. In the\nbest-arm identi\ufb01cation problem, our target of inference is the arm with the largest mean. There\nexist many algorithms for this task including lil\u2019UCB [Jamieson et al., 2014], Top-Two Thompson\nSampling [Russo, 2016] and Track-and-Stop [Garivier and Kaufmann, 2016].\nIn Corollary 8, we showed that the lil\u2019UCB algorithm is monotonically increasing, and thus the\nsample mean of the chosen arm is positively biased. In this subsection, we verify it with a simulation.\nIt is an interesting open question whether different types of best-arm identi\ufb01cation algorithms also\nyield positively biased sample means.\nThe lil\u2019UCB algorithm consists of the following optimistic sampling, stopping and choosing:\n\u2022 Sampling: For any k \u2208 [K] and t = 1, . . . K, de\ufb01ne \u03bdt(k) = 1(t = k). For t > K,\n\nif k = argmaxj\u2208[K](cid:98)\u00b5j(t \u2212 1) + ulil\n\nt (Nj(t \u2212 1)) ,\n\n\u03bdt(k) =\n\n(cid:26)1\n\n0\n\nwhere \u03b4, \u0001, \u03bb and \u03b2 are algorithm parameters and\n\nulil\nt (n) := (1 + \u03b2)(1 +\n\notherwise,\n\u221a\n\n(cid:110)\nt > K : Nk(t) \u2265 1 + \u03bb(cid:80)\n\n\u0001)(cid:112)2\u03c32(1 + \u0001) log (log((1 + \u0001)n)/\u03b4) /n.\n(cid:111)\nj(cid:54)=k Nj(t) for some k \u2208 [K]\n\n.\n\n\u2022 Stopping: T = inf\n\n8\n\n\f\u2022 Choosing: \u03ba = argmaxk\u2208[K] Nk(T ).\n\nOnce we stop sampling at time T , the lil\u2019UCB algorithm guarantees that \u03ba is the index of the arm\nwith largest mean with some probability depending on input parameters. Based on this, we can\n\nalso estimate the largest mean by the chosen stopped sample mean(cid:98)\u00b5\u03ba (T ). The performance of\nchosen stopped sample mean(cid:98)\u00b5\u03ba (T ) is always positively biased for any choice of parameters.\n\nthis sequential procedure can vary based on underlying distribution of the arm and the choice of\nparameters. However, we can check this optimistic sampling and optimistic stopping/choosing rules\nwhich would yield negative and positive biases respectively are monotonic increasing and thus the\n\nTo verify it with a simulation, we set 3 unit-variance Gaussian arms with means (\u00b51, \u00b52, \u00b53) =\n(g, 0,\u2212g) for each gap parameter g = 1, 3, 5. We conduct 104 trials of the lil\u2019UCB algorithm\nwith a valid choice of parameters described in Jamieson et al. [2014, Section 5]. Figure 3 shows\nthe distribution of observed differences between the chosen sample means and the corresponding\ntrue mean for each \u03b4. Vertical lines correspond to biases. The simulation study demonstrates that,\n\nin all con\ufb01gurations, the chosen stopped sample mean (cid:98)\u00b5\u03ba (T ) is always positively biased. (see\n\nAppendix B.2 for a formal proof.)\n\nFigure 3: Data is collected by the lil\u2019UCB algorithm run on three unit-variance Gaussian arms with\n\u00b51 = g, \u00b52 = 0 and \u00b53 = \u2212g for each gap parameter g = 1, 3, 5. For all cases, chosen sample\nmeans are positively biased. The bias is larger for a larger gap since the number of collected samples\nis smaller on an easier task.\n\n5 Summary\n\nThis paper provides a general and comprehensive characterization of the sign of the bias of the sample\nmean in multi-armed bandits. Our main conceptual innovation was to de\ufb01ne new weaker conditions\n(monotonicity and optimism) that capture a wide variety of practical settings in both the random\nwalk (one-armed bandit) setting and the MAB setting. Using this, our main theoretical contribution,\nTheorem 7, signi\ufb01cantly generalizes the kinds of algorithms or rules for which we can mathematically\ndetermine the sign of the bias for any problem instance. Our simulations con\ufb01rm the accuracy of\nour theoretical predictions for a variety of practical situations for which such sign characterizations\nwere previously unknown. There are several natural followup directions: (a) extending results like\nCorollary 8 to other bandit algorithms, (b) extending all our results to hold for other functionals of the\ndata like the sample variance, (c) characterizing the magnitude of the bias. We have recently made\nsigni\ufb01cant progress on the last question [Shin et al., 2019], but the other two remain open.\n\nReferences\nShipra Agrawal and Navin Goyal. Analysis of thompson sampling for the multi-armed bandit problem.\n\nIn Conference on Learning Theory, pages 39\u20131, 2012.\n\n9\n\n\fJean-Yves Audibert and S\u00e9bastien Bubeck. Minimax policies for adversarial and stochastic bandits.\n\nIn COLT, pages 217\u2013226, 2009.\n\nPeter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit\n\nproblem. Machine learning, 47(2-3):235\u2013256, 2002.\n\nJack Bowden and Lorenzo Trippa. Unbiased estimation for response adaptive clinical trials. Statistical\n\nmethods in medical research, 26(5):2376\u20132388, 2017.\n\nAur\u00e9lien Garivier and Olivier Capp\u00e9. The KL-UCB algorithm for bounded stochastic bandits and\nbeyond. In Proceedings of the 24th Annual Conference On Learning Theory, pages 359\u2013376, 2011.\n\nAur\u00e9lien Garivier and Emilie Kaufmann. Optimal best arm identi\ufb01cation with \ufb01xed con\ufb01dence. In\n\nConference on Learning Theory, pages 998\u20131027, 2016.\n\nAllan Gut. Stopped random walks. Springer, 2009.\n\nSteven R Howard, Aaditya Ramdas, Jon McAuliffe, and Jasjeet Sekhon. Uniform, nonparametric,\n\nnon-asymptotic con\ufb01dence sequences. arXiv preprint arXiv:1810.08240, 2018.\n\nKevin Jamieson, Matthew Malloy, Robert Nowak, and S\u00e9bastien Bubeck. lil\u2019 UCB: An Optimal\nExploration Algorithm for Multi-Armed Bandits. In Proceedings of The 27th Conference on\nLearning Theory, volume 35 of Proceedings of Machine Learning Research, pages 423\u2013439, 2014.\n\nShivaram Kalyanakrishnan, Ambuj Tewari, Peter Auer, and Peter Stone. Pac subset selection in\n\nstochastic multi-armed bandits. In ICML, volume 12, pages 655\u2013662, 2012.\n\nEmilie Kaufmann, Nathaniel Korda, and R\u00e9mi Munos. Thompson sampling: An asymptotically\noptimal \ufb01nite-time analysis. In International Conference on Algorithmic Learning Theory, pages\n199\u2013213. Springer, 2012.\n\nTor Lattimore and Csaba Szepesv\u00e1ri. Bandit algorithms. Cambridge University Press, 2019.\n\nXinkun Nie, Xiaoying Tian, Jonathan Taylor, and James Zou. Why adaptively collected data have\nnegative bias and how to correct for it. In International Conference on Arti\ufb01cial Intelligence and\nStatistics, pages 1261\u20131269, 2018.\n\nHerbert Robbins. Some aspects of the sequential design of experiments. Bulletin of the American\n\nMathematical Society, 58(5):527\u2013535, 1952.\n\nHerbert Robbins. Statistical methods related to the law of the iterated logarithm. The Annals of\n\nMathematical Statistics, 41(5):1397\u20131409, 1970.\n\nDaniel Russo. Simple bayesian algorithms for best arm identi\ufb01cation. In Conference on Learning\n\nTheory, pages 1417\u20131418, 2016.\n\nJaehyeok Shin, Aaditya Ramdas, and Alessandro Rinaldo. On the bias, risk and consistency of\n\nsample means in multi-armed bandits. arXiv preprint arXiv:1902.00746, 2019.\n\nDavid Siegmund. Estimation following sequential tests. Biometrika, 65(2):341\u2013349, 1978.\n\nNorman Starr and Michael B Woodroofe. Remarks on a stopping time. Proceedings of the National\n\nAcademy of Sciences of the United States of America, 61(4):1215, 1968.\n\nRichard S Sutton and Andrew G Barto. Introduction to reinforcement learning. MIT press Cambridge,\n\n1998.\n\nWilliam R Thompson. On the likelihood that one unknown probability exceeds another in view of\n\nthe evidence of two samples. Biometrika, 25(3/4):285\u2013294, 1933.\n\nSof\u00eda S Villar, Jack Bowden, and James Wason. Multi-armed bandit models for the optimal design\nof clinical trials: bene\ufb01ts and challenges. Statistical science: a review journal of the Institute of\nMathematical Statistics, 30(2):199, 2015.\n\nMin Xu, Tao Qin, and Tie-Yan Liu. Estimation bias in multi-armed bandit algorithms for search\n\nadvertising. In Advances in Neural Information Processing Systems, pages 2400\u20132408, 2013.\n\n10\n\n\f", "award": [], "sourceid": 3833, "authors": [{"given_name": "Jaehyeok", "family_name": "Shin", "institution": "Carnegie Mellon University"}, {"given_name": "Aaditya", "family_name": "Ramdas", "institution": "Carnegie Mellon University"}, {"given_name": "Alessandro", "family_name": "Rinaldo", "institution": "CMU"}]}