Reviews: Epsilon-Best-Arm Identification in Pay-Per-Reward Multi-Armed Bandits

This paper introduces a new model for the multi-armed bandit problem, where the cost of pulling any given arm equals the expected utility of that arm (called the pay-per-reward model). The goal is to find an arm whose expected utility is within epsilon of the best arm. An algorithm is given, and it is proved that its regret is bounded by a polylogarithmic factor times the total utilities of all arms. My main objection to this paper is the motivation of this model, which seems somewhat contrived to me. A possible application to ad placement on web is presented, but it's not convincing: it is said that the best arm corresponds to the best expression for an ad, and pulling arms is equivalent to observing the number of clicks when the ad is linked to that expression. But then, the player cannot actively "pull an arm" in any rounds she wants. Also, why is the cost of pulling an arm (which means observing something) equals the reward of the arm? I think of this model as a model that's nice from a mathematical point of view, the result of this paper is non-trivial and the techniques used in the paper to prove the regret bounds are deep, however, I am not sure if it is of broad interest to the NeurIPS community. Also, the regret of this paper is O(W log(1/delta)/eps^2), while in Line 117 they say the regret from previous work [14] is O(W log(K/delta)/eps^2), so I don't call this a "significant" improvement. K is the number of arms. Other comments for authors: - In Line 2 and afterwards, "expected future reward" is ambiguous. Just use "expected reward." - Line 44: the sentence "it is necessary to allow a simultaneous pull of an entire set of arms from the joint distribution of rewards of all arms" is very confusing at this point in the paper. - Line 85: why do you call these "single" arm-pulls, while these are batched arm pulls? - Line 130: step -> line == After reading the response: Thanks for the responses. I now think there may be potential applications for this model, but it needs to be described more clearly in the paper. For the ad placement example, in your model the company is penalized for choosing a good text for the ad, while intuitively it should not be penalized, because choosing a good text means also means more buyers. So, from a game-theoretic point of view, it is still not clear why the company should be penazlied for choosing a good ad. I agree with authors that removing the dependence on the number of arms is significant, and thus I have improved my score.

Reviewer 2

This paper considers the problem where we want to identify an \eps-best-arm in the fixed-confidence multi-armed bandits setting. Unlike the traditional setting, it assumes each arm pull incurs a non-unit cost which equals to the expected future reward of that arm. This setup has potential applications in ad placement: a retailer may want to do some experiments before deciding under which search queries the ad is shown. If a placement incurs a click, the retailer may need to pay for this event. The cost per placement paid to the advertiser is related to the click rate. So the retailer may want to use the minimum cost to identify an \eps-best arm. To achieve this goal, this paper introduces an elimination-based algorithm which is similar to that in [14], but it saves a log(K) factor compared to the uniform sampling or the algorithm straightforwardly derived from [14]. The high-level idea of the proposed algorithm is that instead of removing half arms in each round, it removes arms such that the total weight is decreased by a constant factor with high probability. Because of this the estimation scheme as well as the analysis becomes more involved. Theoretically, this setting is very interesting and the proofs look solid. The paper is well-written. However, I have some concerns about the practical usage of the algorithm. 0) About the motivation that the cost of arm pull is equal to the expected future reward of that arm, I am not sure this is actually the case used in ad placement. I would like to see some concrete real world examples. 1) The authors did not show explicitly the constants in the analysis, which may be very large since the algorithm needs to invoke Median-of-Means procedure repeatedly. While I understand that this is for the convenience of the analysis, large constants may be unaccepted in real world applications despite the fact that the algorithm saves a log(K) factor. 2) I think the fixed-budget setting makes more sense for this problem, since a retailer usually offers a limited budget to do some survey. 3) It would be better if the authors can add some experiments to convince the readers the superior performance of the proposed algorithm. 4) The assumption that "pull all arms in finite time" for manipulating an unlimited number of arms is not practical. Minors: -- It seems that \theta was not used in the proof of Theorem 3.1. However, \theta was specified as a parameter in the main algorithm. Is \theta = \phi_{M}? Please clarify. -- Line 17, “... observes their instantaneous reward”: reward -> rewards -- Line 21. “... arm pull ...”: usage of “arm-pull” should be consistent; there are some other places that this issue exists -- Line 54, “Related work” should be put together with line 55 -- Line 73, “Let K be total the number ...” -> “Let K be the total number ...” -- Line 157, “... in more detail in as Theorem 5.6” -> “ … in more details in Theorem 5.6” -- Line 178, “... is also used in for the median-elimination algorithm ...”: “in” is redundant -- Line 219: "\eps-good arm” -> \eps-best arm -- Appendix A: no “2” factor in the first equality and the first inequality Lemma B.1: p should be at most ½ otherwise the last sentence “... more than 1 - p fraction of the empirical means satisfy \omega - \hat{\omega}^j \leq \sqrt{a \sigma^2 / \ell}, hence the p-of-means estimate also satisfies this inequality” does not hold ------------------------------- [After response] The authors have tried to address the motivation issue. However, I still feel that this problem is a bit artificial, and it is still not clear what real applications this problem will have. Other than this, the technical details of this paper look interesting and the paper is a pleasure to read.

Reviewer 3

Originality: The model, algorithm and analysis seems to be novel, but as a standard generalisation of previous results. Relevant literature seems to be covered. The proofs have not been thoroughly checked, but the arguments and approaches seems to be sound. The heuristic for adapting prior work to the new generalised setting is well founded. Clarity: The paper is generally well written and with a clear focus, comprising of a single result. The non-technical sections are well structured and care seems to be put in conveying the contend. There is however a glaring problem with the technical sections, including the algorithm itself, where more than 10 variables and constants are introduced with little to no explanation, but merely mathematical definition. While this might amount to a functional and correct algorithm, it makes understanding it quite difficult. Unsurprisingly the resulting confusion carries over to the analysis. With this complicated an algorithm, naming quantities so there qualitative function is well motivated would help the clarity tremendously. Significance: The result seems novel and well motivated both in application and literature. It is however incremental, and the scope of the paper is rather small. While keeping the scope well focused is definitely a plus and makes it easy to get an overview of the paper, additional contributions as alluded to in the conclusion might be beneficial. ==== In the rebuttal the authors have promised to address the concerns I raised. I have raised my score slightly due to this.

Paper ID:	1663
Title:	Epsilon-Best-Arm Identification in Pay-Per-Reward Multi-Armed Bandits

Reviewer 1

Reviewer 2

Reviewer 3