NeurIPS 2020

Reinforcement Learning in Factored MDPs: Oracle-Efficient Algorithms and Tighter Regret Bounds for the Non-Episodic Setting

Meta Review

After discussing with the reviewers, we have decided to propose acceptance for the paper. Nonetheless, I would like the stress that the current submission has a number of critical aspects that need to be addressed by the authors to make the paper more solid. I strongly encourage the authors to read reviewers' comments and focus on improving along the following directions: - Clarity: Reviewers all agree that the paper could be improved in writing to make it more accessible to an audience that is not strictly familiar with the factored MDP formalism and/or the technicalities behind UCRL proofs. - The authors should further clarify the empirical results. In particular, it is unclear how the parameter c has been chosen and why it takes significantly different values for different algorithms. It would be helpful to see the performance as c changes. - The way optimism is obtained is probably not very tight and it may cause over exploration for a long time. This point should be discussed in much more detail. - The lower bound is an interesting novel result, but again it may require more discussing, in particular wrt to the non-factored case and why the span appears in this case and not in the non-factored case.