NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID: 1187 DAC: The Double Actor-Critic Architecture for Learning Options

### Reviewer 2

"Equation 24 in Levy and Shimkin (2011))." --> it's better if the paper is self-contained. Fig.2 --> why in most of the cases DAC is not performing well on the first task? "4.3 Option Structure" --> I am not entirely sure what do we learn from this section as it's rather descriptive and there is no conclusion to be made (and no baseline to compare with, or no way to quantify the performance) Fig. 1 --> Why DAC outperforms the baselines only on two tasks out of four? Rows 227-228 "The main advantage of DAC over AHP is its data efficiency in learning the master policy." --> it's not clear how this statement is supported by the experimental results. Rows 249-250 "DAC+PPO outperforms all other baselines by a large margin in 3 out of 6 tasks and is among the best algorithms in the other 3 tasks." --> that does not support the statement provided in the abstract. Also, would be nice to define what is a large margin. Rows 253-254 "We conjecture that 2 options are not enough for transferring the knowledge from the first task to the second." --> what can we learn from it?

### Reviewer 3

I believe that the result presented in this paper follows from Levy and Shimkin (2011) and Bacon (2017), up to a change of notation (and terminology). It suffices to see that the call-and-return model with Markov options leads to a chain where the probability of choosing the next option conditioned on the augmented state is: $P(O_t|S_t, O_{t-1}) = (1 - \beta_{O_{t-1}}(S_t))1_{O_{t-1}=O_t} + \beta_{O_{t-1}}(S_t)\mu(O_t|S_t)$. Note that $P(O_t|S_t, O_{t-1})$ can be seen as a policy over options; one which also happens to contain the termination functions. The form of this chain is the same one behind the intra-option methods, presented in the value-based context, but independent of any algorithm. Notation: Writing $\pi_{O_t}(S_t, \cdot)$ suggests that you have a joint distribution over state and actions while pi really is a conditional distribution over options. $\pi_{O_t}(\cdot | S_t)$ is less ambiguous. > Unfortunately, to the best of our knowledge, there are no policy-based intra-option algorithms for learning a master policy. See Daniel et al. (2016), Bacon's thesis (2018) where the gradient for the policy over options is shown, and Levy and Shimkin (2011) which boil down to the same chain structure (albeit different notation). [Update:] I maintain the same overall score. The paper could eventually help other researcher to gain more clarity in this topic. It would however need to be properly positioned in the lineage of Levy and Shimkin (2011), Daniel et al. (2016) and Bacon et al. (2017), which all share the same underlying Markov chain. The "double actor-critic" point of view belongs in that same family. All these papers are also based on the "intra-option" point of view: the one that you get when assuming Markov options.