Reviews: Are sample means in multi-armed bandits positively or negatively biased?

===== Summary ===== This paper studies the bias in the sample mean estimates of actions in a multi-armed bandit setting. More specifically, the authors consider three sources of bias: the adaptive sampling of the arms, the adaptive stopping time, and the adaptive best-arm identification (choosing) strategy. It generalizes and complements the results of Nie et al. (AISTATS 2018). ===== Strengths ===== [Originality] * The authors provide a different view (to my knowledge) of the bandit setting (the tabular setting). * The contributions build on previous work (Nie et al. 2018), but are more general and do bring something new. [Quality] * The authors provide several examples and numerical experiments to illustrate/support their claims. * I didn't review appendix proofs in detail, but they seemed sound from an overview. [Clarity] * The paper is well organized and reasonably easy to follow given the sometimes uncommon notation. [Significance] * If bandits (and RL algorithms in general) are to be deployed more and more (e.g. in adaptive trials, A/B testing), I think that it is essential to understand better the potential bias in the resulting action estimates. ===== Weaknesses ===== [Clarity] * The writing could be polished. [Significance] * The contributions could have a broader impact if they were better linked to examples of applications where it is relevant to know the sign of bias. At first sight, it is not clear how learning about the sign of bias, but not the amplitude of bias, can be useful. Questions: * Is the choosing strategy always adaptive? If not, can you provide an example of non-adaptive choosing? Minor details: * Some results (e.g. Fig. 2-left and Fig. 3) are difficult to see. Maybe these could be zoomed to really see on which side fall the estimates? * The term "optimism" is already well-known and used in bandits for something else -- it may confuse the reader. * Bad cross-references need to be fixed in the introduction. * Sec. 2.1: Explicitly define A_t and Y_t. * The title could be improved. -------------------------------------------------------------------- I have read the rebuttal -- this is a good paper. Now that you mention it, the title is a nice reply to Nie et al. Linking explicitly the term "optimism" to the "optimism in front of uncertainty" principle should indeed help the reader, in addition to situating the paper w.r.t. the bandits literature (as pointed out by reviewer 2). Non-adaptive choosing strategy: I agree that they may exist in theory, but they are counterintuitive when speaking about bandits, which typically arise from adaptive experimental designs, therefore aiming at being adaptive by definition. It may be worth mentioning.

TL;DR: I learnt something by reading this paper and I think a lot of people working on bandits and adaptive algorithms should be aware of those results. I am also happy with how they corrected/completed the Nie et al. 2018 paper that was quite unclear to me. Originality : good. it is an unusual read in this domain (especially when one hasn't read the paper by Nie et al. 2018) Quality : Very good. It is a technically sound paper that defines and solves a precise result; Clarity : good. It is well-written but I would highly recommend to add references to the bandit literature as it is this is the targeted community; I must admit I was a bit lost at the beginning, I wasn't sure we were talking about the same thing. Significance: Okay. I think this is possibly the weakest point because in the end, I don't know exactly which issue(s) this result will solve... Major comments: M1. This paper could be of interest to a lot of people in the bandit community which is large and diverse... but for them to know about it, you have to cite them ! This paper has a total of 10 citations, including the paper they correct and a bunch of math papers that look awesome... but nearly no Bandit paper ! As far as I'm concerned, it made the paper pretty abstract as I felt it was not solving any of the problems I care about. And I still think it doesn't really, but it does bring some understanding about all those problems. I would definitely recommend expliciting further the links with the literature and I suggest at the end of this review a list of possible references to add. Typically, for the average bandit person, it feels weird to see UCB / TS / lil'UCB all mixed together and compared because they target really different kinds of problems, namely Best Arm Identification versus Cumulative Regret Minimization. I understand the scope of the paper is wider than those distinctions but that could be made clear somewhere. M2. So regarding the question of which problem is solved by this paper, I am a bit puzzled. I get that in Nie et al 2018, the message is that there analysis of the bias would allow a practitioner to correct the sample means obtained at the stopping time so that they return a less biased value. But it is not made very clear in your paper that this could be a use case of Theorem 7. Do you think you could make the goal of this analysis a bit more explicit ? M3. Numerical experiments. To be a fair comparison between UCB and TS, you should use the anytime confidence bonus \sqrt{\frac{2\log(t)}{N_k(t)}} cf KL-UCB-switch: optimal regret bounds for stochastic bandits from both a distribution-dependent and a distribution-free viewpoints (https://arxiv.org/abs/1805.05071) and the MOSS paper. I mean, otherwise it's hard to explain why you get biases an order of magnitude bigger for UCB than TS. I am not sure if this "fair comparison" would bring more answers or more questions. A general interrogation I have by reading your paper is: Could we get a relationship between the performance of a bandit algorithm (i.e it's regret in general) and the bias at some stopping time ? In the regret minimization setting, a reasonable possible relationship would be: The smaller the regret, the bigger the bias at horizon T, because smaller regret means less exploration and so less data. Similar analysis could be done on the BAI problem. Those are just suggestions for further developments rather than real critics. minor comments : m1: Section refs on p2 broken Refs : The Bandit Algorithms book by Lattimore and Szepesvari: contains literally everyting about the bandit literature (if only one, at least this one) @book{lattimore2019book, title = {Bandit Algorithms}, author = {Lattimore, Tor and Szepesv\'{a}ri, Csaba}, year = {2019}, publisher = {Cambridge University Press (preprint)}, } Optimal best arm identification with fixed confidence: a different type of Best Arm Identification @inproceedings{garivier2016optimal, title={Optimal best arm identification with fixed confidence}, author={Garivier, Aur{\'e}lien and Kaufmann, Emilie}, booktitle={Conference on Learning Theory}, pages={998--1027}, year={2016} } The original Auer 2002 paper that coined the term "optimism principle" as far as I know: @article{auer2002using, title={Using confidence bounds for exploitation-exploration trade-offs}, author={Auer, Peter}, journal={Journal of Machine Learning Research}, volume={3}, number={Nov}, pages={397--422}, year={2002} } A very similar effect noticed in management science and decision analysis : @article{smith2006optimizer, title={The optimizer’s curse: Skepticism and postdecision surprise in decision analysis}, author={Smith, James E and Winkler, Robert L}, journal={Management Science}, volume={52}, number={3}, pages={311--322}, year={2006}, publisher={INFORMS} } ---------------------------------------------------------------------------------------- Edit after rebuttal : We thank the authors for their answers to my and other reviewers' questions. We hope the NeurIPS community will get the occasion to learn as much as we had and we look forward to future developments around this idea.

Paper ID:	3833
Title:	Are sample means in multi-armed bandits positively or negatively biased?

Reviewer 1

Reviewer 2

Reviewer 3