Processing math: 100%

The best of both worlds: stochastic and adversarial episodic MDPs with unknown transition

Part of Advances in Neural Information Processing Systems 34 (NeurIPS 2021)

Bibtex Paper Reviews And Public Comment » Supplemental

Authors

Tiancheng Jin, Longbo Huang, Haipeng Luo

Abstract

We consider the best-of-both-worlds problem for learning an episodic Markov Decision Process through T episodes, with the goal of achieving ˜O(T) regret when the losses are adversarial and simultaneously O(logT) regret when the losses are (almost) stochastic. Recent work by [Jin and Luo, 2020] achieves this goal when the fixed transition is known, and leaves the case of unknown transition as a major open question. In this work, we resolve this open problem by using the same Follow-the-Regularized-Leader (FTRL) framework together with a set of new techniques. Specifically, we first propose a loss-shifting trick in the FTRL analysis, which greatly simplifies the approach of [Jin and Luo, 2020] and already improves their results for the known transition case. Then, we extend this idea to the unknown transition case and develop a novel analysis which upper bounds the transition estimation error by the regret itself in the stochastic setting, a key property to ensure O(logT) regret.