NeurIPS 2020

Stage-wise Conservative Linear Bandits


Review 1

Summary and Contributions: This paper presents linear bandit algorithms that ensures that some "safety" constraint (namely a lower bound on the instantaneous expected reward) is respected, by still trying to minimize the usual cumulative (pseudo-)regret. It offers the same regret bounds (O(sqrt(L)log(T)) as the state-of-the-art, while playing the non-optimal (safe) baseline actions only at most O(log(T)) times. This algorithm could be easily extended to other types of safety constraints, especially with unknown baseline rewards.

Strengths: * This work has good theoretical grounding and, to the best of my knowledge, is quite novel in the way it addresses an important conservativeness/safety constraint, expressed at the "stage"-level (i.e. for every time t, and not cumulated over time) * The approach is elegant in first estimating the safe action set and then in sampling an action in this safe set using a standard Thompson sampling strategy (when the safe set is not empty and an extra condition on the min eigenvalue of the Gram matrix is respected) or a perturbed -but safe - version of the baseline when this double condition is not met. * It can handle multiple types of safety constraints.

Weaknesses: * Relevant to a minority of the NeurIPS community * The algorithm needs a lot of assumptions (a lot of bounds are assumed to be known: S, L, R, r_l, r_h, \kappa_l, \kappa_h) : nothing is said about how to fix them or to which extent the algorithm is robust to mis-specified bounds. I'm sometimes doubting about the fact that these assumptions are valid in practice. * The same can be said about the hyper-parameters: there are numerous (\delta, \lambda, \rho, \alpha) and nothing is said about how to choose/tune them. * The experimental section is a bit disappointing. It is limited on synthetic data and on a simulation process that follows strictly the model assumptions. Only one setting/scenario is considered and authors compare results (cumulative regret ... but unfortunately not the number of times the safety constraint is violated) with only one competitor (SEGE [3]). It would have been interesting to compare SCLTS / SCLUCB with the method of [16], using the same setting as [16]. EDIT AFTER Rebuttal: @authors: thanks for your clear answers in the rebuttal. It would be great if you can include these answers (or elements of those) in the paper as well.

Correctness: Claims, theorems and proofs seem to be correct (to the best of my understanding and knowledge).

Clarity: The paper is well written: motivations are clearly exposed, ideas and methods are well structured and the paper is, globally, easy to read, except some inconsistencies in the notations (see Additional Feedback) and the constant need to refer to the Supplementary Material to understand important concepts).

Relation to Prior Work: To the best of my knowledge, the paper explains clearly the landscape of the current state-of-the-art and how the proposed approach differs from it, especially in terms of bounds and differences in the safety/conservativeness criteria.

Reproducibility: Yes

Additional Feedback: * Typo: psuedo-regret --> pseudo-regret * In footnote of page 2, $q_{b_t}$ is not defined. We have to wait for page 7 to know what it is. * S and L are actually linked through the constraint $<x,\theta_*> \leq 1$ . Please clarify. * Make clear in the assumptions that you need to know the constants (R,S,L, \kappa_h,\kappa_l, ...) in the algorithms, and not just to know that the quantities are bounded. *Please define $\delta'$ somewhere in between lines 170 and 174. * In Algo1 [line 6] and line 182, please make explicit the dependency of \beta on \delta' --> $\beta_t(\delta')$ * What is \rho used as input of Algorithm 1? It doesn't seem to be used.


Review 2

Summary and Contributions: The paper proposes variants of Thomson Sampling and UCB for the linear bandit setting, where the learner has to satisfy a safety constraint on the instantaneous reward in each round. Regret bounds (and constraint/safety guarantees) are provided and the rate matches the minimax rate achievable in linear bandits without constraints up to logarithmic factors. Several extensions of the setting are analyzed as well; and some empirical evaluation is presented.

Strengths: This setting of safe linear bandits has been studied before, and various results are known. The setting is important for many practical applications. The proposed algorithm and theory significantly improve upon previous known results in generality and the assumptions used are much more standard compared to previous work (in particular, a general compact action set is considered). The authors modify existing schemes (TS/UCB) in a simple way, which allows them to built on existing results for the un-constrained setting. Several interesting and relevant extensions are also worked out. The proofs of the regret bounds are clearly presented and follow known techniques, modula several modifications are necessary to account for the constraints.

Weaknesses: Assumption 4 is specific to the constrained setting. The authors should clarify to what extend these assumptions are necessary, and how the assumptions are restrictive. In particular, is it necessary that the base line policy achieves positive reward? Also, what is the benefit of introducing \kappa_l, if it is claimed that one can set \kappa_l=0 (line 4)? This seems to contradict Remark 3.1, where \kappa_l^{-1} appears in the sample complexity. However this is not evident from Theorem 3.3, and eq (49) also suggests otherwise. Further, the assumption 3 that X contains the unit ball could possibly hide some geometric complexity of the problem. For example, it does not allow a very "thin" action set, which would prevent the exploration method (11). Could the authors please comment on this? Simply re-scaling would affect other assumption such as <x, theta> < 1, so perhaps one should linearly transform the action set in this case (similar to John's position)?

Correctness: The technical results appear correct to me.

Clarity: The paper is well written and structured, for someone who is familiar with related literature, the ideas are easy to follow.

Relation to Prior Work: Yes, prior work is adequately discussed. One remark is that the authors describe related methods for Gaussian Processes as "non-linear bandit optimization with nonlinear constraints" (line 118). While it is true that kernelized bandits model non-linear functions in terms of the indexing domain, these methods heavily rely on linearity in the kernel space and analysis it typically not much different from the pure linear setting.

Reproducibility: Yes

Additional Feedback: Update: I have read the author's response and I will keep my score. One comment is that the term "LUCB" (which here is used for "Linear UCB" or "LinUCB") has been coined before with a different algorithm [1] Around line 59 it could be useful to mention why the knowledge of x_{b_t} is needed; at first sight it seems that (2) could be replaced by an generic lower bound. Algorithm 1, line 14: It would be useful to repeat the definition of x_t^{cb} [1] S. Kalyanakrishnan, A. Tewari, P. Auer, and P. Stone. PAC subset selection in stochastic multi-armed bandits. In International Conference on Machine Learning (ICML), 2012. Minor: line 55: X \in R -> X \subset R 68: leaner -> learner 72 & 74: psuedo -> pseudo


Review 3

Summary and Contributions: This paper studies the problem of stage-wise conservative linear bandits. Here, in addition to minimizing regret, the learner has to ensure that the regret at any time is no larger than a baseline threshold. The authors propose and analyze two algorithms.

Strengths: The paper is certainly relevant to the community. The algorithm improves upon the existing algorithms by playing the baseline arms O(log T) times as compared to O(sqrt{T}) times. The paper is well-written and clear to understand. The authors compare their work to [3] which originally formulated this problem and highlight the key differences in Sec 1.2. They also conduct experiments to compare the empirical performance of their algorithm to existing algorithms.

Weaknesses: There are several ways the paper can be improved. The fact that the baseline arm is played only log T times is an key property. It would be good to see this verified through experiments. I would also have liked to see a summary of the main points of the proof in the main paper. Finally, it would be good if authors can make the code public for reproducibility.

Correctness: Yes

Clarity: Yes.

Relation to Prior Work: Yes.

Reproducibility: Yes

Additional Feedback: