Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
The paper introduces a double actor critic architecture for learning options. The authors define 2 augmented MDPs for learning the option selection policy as well as the options themselves. Using this MDP formulation, off-the-shelf policy learning algorithms can be used for learning option selection as well as option policies, which was not possible with previous algorithms. Both hierachy levels where optimized with PPO. The reviews for this paper are borderline. Most reviewers appreciated the intutive idea and the promising results reported in the paper. The biggest concern raised by R3 was in terms of novelty of the approach as similar augmented markov chains have been already used in Levy and Shimkin (2011), Daniel et al. (2016) and Bacon et al. (2017). However, after reading the paper I agree with the authors that modelling options as 2 MDPs is not explicitely done in these approaches (disallowing to use off-the-shelf policy learning algorithms for both levels) or an extra variable (termination event) is introduced, which affects the sample efficiency of the algorithm. While these relations have been clarified in the rebuttal, it is not clearly stated in the paper. However, I trust the authors can properly positioning their work within the existing literature for the final version and recommend acceptance as I found the idea intuitive and the experiments convincing.