Part of Advances in Neural Information Processing Systems 31 (NeurIPS 2018)

*Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, Michael I. Jordan*

Model-free reinforcement learning (RL) algorithms directly parameterize and update value functions or policies, bypassing the modeling of the environment. They are typically simpler, more flexible to use, and thus more prevalent in modern deep RL than model-based approaches. However, empirical work has suggested that they require large numbers of samples to learn. The theoretical question of whether not model-free algorithms are in fact \emph{sample efficient} is one of the most fundamental questions in RL. The problem is unsolved even in the basic scenario with finitely many states and actions. We prove that, in an episodic MDP setting, Q-learning with UCB exploration achieves regret $\tlO(\sqrt{H^3 SAT})$ where $S$ and $A$ are the numbers of states and actions, $H$ is the number of steps per episode, and $T$ is the total number of steps. Our regret matches the optimal regret up to a single $\sqrt{H}$ factor. Thus we establish the sample efficiency of a classical model-free approach. Moreover, to the best of our knowledge, this is the first model-free analysis to establish $\sqrt{T}$ regret \emph{without} requiring access to a ``simulator.''

Do not remove: This comment is monitored to verify that the site is working properly