NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID: 3007 Almost Horizon-Free Structure-Aware Best Policy Identification with a Generative Model

Reviewer 1

Originality: It is a new method compared with previous bound where the paper use a non-uniform sampling scheme. Quality: The paper provides detailed theoretical analysis on their methods. However, no empirical experiment is provided. It would be better even if a toy example is provided and can show significant improvement of sample complexity than previous methods. Clarity: I'm a little bit struggling on the writing. The paper is notation heavy and hard to check all the proofs. I would suggest to emphasize on the notation of $n_{sa}^k$ which I think is your key differences than previous methods, should spend more paragraph on this notation. I didn't understand the intuition why problem dependent structure can remove the dependence of horizon for suboptimal action. Another thing is the terminology for 'horizon' where we usually use to refer to the maximum time steps on the MDP. In this paper it refers to the factor $(1-\gamma)^{-1}$ which is not fully explained. Significance: To be honest I don't familiar with the field of PAC-RL, but I believe it is an important result and can inspire policy optimization algorithm based on your algorithm. Overall: I believe the paper can be written much better by emphasizing its idea and explain better on its intuition. But I still think this paper is above borderline because of its theoretical quality and the idea of non-uniform sample would be inspiring for other policy optimization algorithm.