NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
The paper extends Russo and Van Roy (JMRL2016) work to provide information-theoretical analysis of Thompson sampling and UCB-like algorithms in more general setting. The three reviewers acknowledge the contributions, and the potential impact of connecting information-theoretical concepts to the design of algorithms. Reviewers have suggested ways to improve the manuscript. The authors should follow these directions, and in particular fix notations, include simulation results, and provide explanations about proofs when necessary. The contributions in this paper are “methodological”, i.e., it proposes a framework to analyze the regret of certain algorithms. When applying the method to examples, the authors do not discuss existing results (what is known about the regret in linear bandits, tabular MDPs, …?). We encourage the authors to include a more detailed related work. On the examples presented in the paper, the proposed method does lead to regret upper bounds matching those of state-of-the-art algorithms (e.g. the MDP tabular case – unless the conjecture mentioned by the authors is true). It seems important to mention this gap.