NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:7889
Title:Recovering Bandits

Reviewer 1

The major point that needs to be clarified for me is the distinction between single play regret and multiple play regret. - first, regarding the d-lookahead regret, it seems E[R_T^{(d)}] is only defined for policies of a very particular form, that would select every d time step a leaf of the d-lookahead tree. Am I right? The UCB-Z algorithm is not of this type for example. It is a bit weird to have regret notion that only apply to very specific algorithms - second, regarding the distinction between single and multiple play regret, am I right that E[R_T^{(d,m)}]=E[R_T^{(d)}]? That is, you would in principle care about the multiple play lookahead regret, as one should be allowed to be play an arm multiple times during the d-time step on which we optimize. - then, if I understood correctly, I just don't see the point of single-play regret. From my understanding, it would be defined only for strategies which selects each arm at most once during d steps (and therefore d cannot be larger than K in this case, right?). How do you motivate such a restriction? Currently in the paper, the motivation is mostly "easier to analyze" and I think a practical motivation is also needed. When thinking of the concrete application of the proposed algorithms for example in a recommendation settings, some questions arise. First, which kernel would be chosen? For each arm, the recovery function is defined on the discrete space [0,...,z_\max]. Typical kernel k(x,x') depend on the euclidian distance between x and x', in that case k(z,z') simply depends on the integer |z-z'|? More importantly, z_\max seems to represent a "characteristic time" after which the mean of an arm doesn't evolve any more. Does it make sense to know the maximum characteristic time of all products? In practice, what type of time scales / values of z_\max would be chosen in a recommendation context? Some other remarks: - full-horizon regret: the comparison is with respect to an oracle to would act optimally for maximizing rewards up to horizon T in an MDP. This oracle is called "stationary deterministic", but I don't agree that it is stationary. The optimal solution in finite-horizon MDP is solution to some dynamic programming equations which in this case reduce to a finite backwards induction. As a consequence, the optimal policy depends on the remaining horizon, and is therefore not stationary. - the analysis is in terms of Bayesian regret assuming recovery functions are drawn from the GP used in the algorithms. It seems the experimental protocol leading to Figure 4 is different: the recovery functions have a parametric forms and the parameters are drawn uniformly: is it equivalent from sampling the function from some GP? what would be the kernel? If not, it is worth mentioning the discrepancy between the kernel used by the algorithms and the "kernel" from which the functions are drawn - Figure 2 and Table 1 are a bit far from where they are needed (in Section 7): would it be possible to re-group all figures on the same page? - the bibliography contains several references to arxiv preprints that have been published by then, e.g. [10,26,28]. Could you please correct that?

Reviewer 2

In this paper, the authors study the recovering bandit problem, where the expected reward of arms varies according to some unknown function of the elapsed time since it was played. This is a significant problem, which occurs in practice for recommendation or advertising, where to vary the proposed item or ad during time is relevant. This problem is very challenging from an algorithmic point of view, since in addition of handling a multi-armed bandit problem, one must learn the recovery functions. Moreover, as the rewards of arms change during time, maximizing the instantaneous regret is not optimal. Here, the authors propose to maximize the rewards of the best sequence of actions. Three definition of the regret are proposed: the instantaneous regret, the full horizon regret, and the d-step Lookahead Regret. Instead choosing the full horizon regret, which compete against the best deterministic policy which knows the recovery functions, the authors focus on the d-step Lookhead Regret, which leads to tractable algorithms. Proposition 1 states that any algorithm with a low d-step Lookhead Regret can also have a low full horizon regret when d is not too small. For estimating the recovery function the authors use Gaussian Process. The authors propose two algorithms: a fully Bayesian approach which uses Thompson Sampling to sample the leaf of the lookhead tree (a sequence of d arms equipped with d recovery functions), and a hybrid approach which uses a UCB like algorithm to select the best leaf. The main drawback of the d-step lookhead strategy is that this approach does not scale with d. To improve the computational efficiency, the authors propose to use optimistic planning in section 6. The algorithms are analyzed in the case of single play lookhead, where an arm can be played only once during d steps, and multiple play lookhead. In the case of single play lookhead, the upper bound of regret are in O(sqrt (KT \gamma_T log (TK|Z|)), that is surprising since it does not depend on the depth of the lookhead d. In fact, \gamma_T has a strong dependence in d. For instance for squared exponential kernels \gamma_T = O((log T)^(d+1). The reviewer thinks that the dependence in d of \gamma should be explicit. Moreover, the upper bound of the instantaneous regret is the same than the upper bound of d-step lookhead regret (Theorem 2,3 vs Corollary 7). This suggests that the bounds in Corollary 7 are not tight, or that the dependence in d is not the same. Could the authors clarify this point? The experiments done on synthetic problems show the efficiency of the approach. Overall, this is an interesting paper on a significant and challenging problem. There are some shortcomings or ways of improvement: - The setting is so general that it could be interesting to position it with respect to reinforcement learning with discounted rewards. - The dependence in d of the regret bounds should be clarified. ____________________________________________________________________ I read the rebuttal. Thanks for answering my concern. I raised my score. I recommend acceptation.

Reviewer 3

Summary and Contributions From a practical perspective, the problem studied here is interesting. The authors tackle the problem of recommendations for a special kind of items that are associated with phases in a user's purchasing behaviour (i.e. items that users only buy rarely, like a TV, a new phone, furniture, movies etc.). Here, a play of an arm triggers a state change for a period of time (users most likely do not want to watch the same movie again immediately). Mapping this to a bandit setting, the authors aim to solve the a MAB problem where, after each play of an arm j, the expected reward of j start changing over time according to an unknown function f_j drawn from a GP of known mean and kernel, until a maximum time z_max, after which the expected reward remains constant. The authors propose using d-lookahead regret as a proxy for the classical measure of regret (which is prohibitively expensive to compute in this setting) and show that it is indeed a good approximation (the approximation is arbitrarily close as T\to\infty). Additionally, the paper introduces the d-lookahead UCB and Thompson Sampling variants and analyse their regret proving scaling of the order O(sqrt(KT\gamma \log(T))) where \gamma depends on the GP kernel and takes values from O(log(T)) to O(T^{1/(1+\nu)}log(T)) for Matern(\nu) kernels. Overall I find this paper interesting but wonder whether some of the assumptions hurt the practical significance of the work. The GP kernel will not be known in most real-life scenarios - would the uninformative prior be sufficient? z_max being the same for all arms seems a bit unrealistic. Would it be possible to relax this assumption? Another questionable assumption that I think also needlessly complicates the problem is the fact that the decision maker is forced to play an arm each round (normally, web services are not forced to display ads, for example). Having the ability to pass on playing at some round would allow for easier planning ahead since now, there is more independence between arms. While I have not checked the proofs, in my opinion the paper is very well written and I particularly like the result on d-lookahead regret being a good approximation of the full horizon regret. I like this metric and think it's a great proxy for this setting. I appreciate the authors comparing their algorithm's performance to several baselines, including the common sense UCB-Z algorithm. The authors also provide a section on reducing the computational complexity of their algorithm. Despite some questionable assumptions, I can see this paper representing a solid starting point for further work on this setting, I therefore recommend accepting this paper. =========== Post Rebuttal =========== I have read the author's reply and am satisfied with the clarifications.