NIPS 2017
Mon Dec 4th through Sat the 9th, 2017 at Long Beach Convention Center

### Reviewer 1

The paper describes a new learning model able to discover 'intentions' from expert policies by using an imitation learning framework. The idea is mainly based on the GAIL model which aims at learning by imitation a policy using a GAN approach. The main difference in the article is that the learned policy is, in fact, a mixture of sub-policies, each sub-policy aiming at automatically matching a particular intention in the expert behavior. The GAIL algorithm is thus derived with this mixture, resulting in an effective learning technique. Another approach is also proposed where the intention will be captured through a latent vector by derivating the InfoGAN algorithm for this particular case. Experiments are made on 4 different settings and show that the model is able to discover the underlying intentions contained in the demonstration trajectories. Comments: First of all, the task focused in the paper is interesting. Discovering 'intentions' and learning sub-policies is a key RL problem that has been the source of many old and new papers. The proposed models are simple but effective extensions of existing models. Actually, I do not really understand what is the difference made by the authors between 'intentions' and 'options' in the literature. It seems to me that intentions are more restricted than options since there are no natural switching mechanisms in the proposed approach while options models are able to choose which option to use at any time. Moreover, w.r.t options literature, the main originality of the paper is both in how the intentions are discovered (using GAIL), and also in the fact that this is made by imitation while many options learning models are based on a classical RL setting. A discussion on this point is important, both in the 'related work' part, but also in the experimental part (I will go back to this point later in the review). At the end, it seems that the model discovered sub-policies, but is still unable to know how to assemble these sub-policies in order to solve complex problems. So the proposed model is more a first step to learn an efficient agent than a complete solution. (This is for example illustrated in Section 6 "[...]the categorical intention variable is manually changed [...]"). The question is thus what can be done with the learned policy ? How the result of the algorithm will be used ? Concerning the presentation of the model, the notations are not clear. Mainly, \pi^i gives the impression that the policy is indexed by the intention i (which is not the case, since, as far as I understand, the indexation by the intention is in fact contained in the notation $\pi(a|s,i)$) . Moreover, the section concerning the definition of the reward is unclear: as far as I understand, in your setting, trajectories are augmented with the intention value at each timestep. But this intention value i_t is not used during learning, but will be used for evaluating the reward value. I think that this has to be made more clear in the paper if accepted. Concerning the experimental sections: * First, the environments are quite simple, and only focused on fully-observable MDP. The extension of the proposed model to PO-MDP could be discussed in the paper (since I suppose it is not trivial). * Second, there is no comparison of the proposed approach with options discovery/hierarchical models techniques like "Imitation Learning with Hierarchical Actions -- Abram L. Friesen and Rajesh P. N. Rao", or even "Active Imitation Learning of Hierarchical Policies Mandana Hamidi, Prasad Tadepalli, Robby Goetschalckx, Alan Fern". This is a clear weak point of the paper. * I do not really understand how the quantitative evaluation is made (reward). In order to evaluate the quality of each sub-policy w.r.t each intention, I suppose that a manual matching is made between the value of $i$ and the 'real' intention. Could you please explain better explain that point ? * The difference between the categorical intentions and continuous one is not well discussed and evaluated. Particularly, on the continuous case, the paper would gain if the interpolation between sub-policies is evaluated as it is suggested in the article. Conclusion: An interesting extension of existing models, but with some unclear aspects, and with an experimental section that could be strenghtened

### Reviewer 2

This paper proposes to learn stills from demonstrations without any a priori knowledge. Using GAN, it generates trajectories from a mixture of policies, and imitate the demonstration. This idea is very simple and easy to be understood. I accept that the idea could works. There are however some issues in the details. The objective Eq.7 does not show optimizing p(i), which is surely to be optimized. In Eq.7, should not the laster term E log(p(i)) be expressed as H(p(i)), instead of H(i)? And the former is not a constant. Figure 5 is too small, and too crowd to read clearly.