NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID: 4289 SMILe: Scalable Meta Inverse Reinforcement Learning through Context-Conditional Policies

### Reviewer 1

The paper introduces a scalable approach for doing meta inverse RL based on maximum entropy IRL. The baseline is a meta-learning method based on behavioral cloning over which a significant performance improvement is obtained, Pro: The approach seems technically sound, building on the theory of AIRL/GAIL. Also, implementing the equations in a practical and efficient way is a non-trivial contribution. Furthermore, the paper is clearly written. The motivation for IRL versus BC and the advantages that IL can have over RL are clearly explained. Con: The evaluation of the method could be more extensive. Specifically, a more detailed comparison with meta-RL methods and existing meta-IRL methods would help in better evaluating the strengths of the method. Currently, only an empirical comparison is made against a meta-BC baseline. Comparing against meta-RL methods is relevant, because the paper motivates the use of IL by the fact that it can be used on complex tasks where RL methods fail. While this is true in principle, the domains used in the experiment are all domains where standard RL methods can obtain a good performance. Comparing against meta-IRL methods is also relevant, in particular, with the method from citation [33]. The main advantage of SMILe over this method appears to be a computational one. However, it remains unclear how much this computational advantage is for the domains considered in this paper. Furthermore, it would be interesting to also compare the sample efficiency of the two methods. This gives insight into the cost (in terms of sample efficiency) that SMILe pays for the gain (in terms of computation). Overall: Interesting, new meta-IRL method with limited empirical evaluation. Minor: line 39/44: what does "intractable" mean in this context. If it just refers to general function approximation, I would drop this term. AFTER AUTHOR REBUTTAL: The additional experiments from the author response have (partly) addressed my concerns. I've increased my score by 1, on the condition that these experiments are added to the final paper.

### Reviewer 2

The submission is well-written with sufficient experiments supporting the argument. The proposed method is clear and convincing. The reviewer is just concerned with the significance of the submission. It seems that the biggest difference of the proposed method compared with AIRL frame work lies on the off-policy training, which however is borrowed from [20] and also SEEMINGLY related to Rakelly, Kate, et al (2019). Therefore, it would be great if the author can further emphasize the differences of the proposed method compared with all these methods in the rebuttal. Rakelly, Kate, et al. "Efficient off-policy meta-reinforcement learning via probabilistic context variables." arXiv preprint arXiv:1903.08254 (2019). ================================================================== The rebuttal of authors is clear and convincing. Therefore, I increase my score to 7.

### Reviewer 3

The quality of the submission is fairly high; the writing is good and the paper is easy to follow and feels complete. The originality of the work is just OK; it is a fairly straightforward idea and can be boiled down to AIRL with conditioning''. Significance is reasonably high given the current popularity of meta-learning and IRL. No theory is provided but the experiments have good coverage, testing a number of different aspects of the proposed method, and the methodology seems sound. One exception to this is the rather weak behavior cloning baseline. One way in which the exposition could be improved is by being more explicit about how different functions are parameterized, and being more explicit about how the discriminator is optimized. At this point in the history of AI I guess it's safe to assume you performed gradient ascent on the objective in equation 4, but it's probably still worth being explicit about that. line 87: shouldn't there by an entropy term in the reward function? In Figure 1, it feels like there should be an arrow from E to C, especially since for all tasks considered in this paper, C takes the form of expert trajectories, which are of course heavily dependent on E (e.g. beginning of section 3.3). E and C are definitely not independent of one another given T. Line 250: did you mean \exp_base instead of exp_base? or maybe \pi_base? exp_base seems out of line with the rest of the notation. The subsampling of trajectories, discussed briefly in the caption of Table 1, is not really addressed in the main text, but probably should be as its role was not clear to me. Would be interested in seeing results for non-trivial generalization (e.g. extrapolative generalization rather than interpolative), and in more difficult stochastic domains. The description of the classification experiment in Section 4.4 should be made clearer. I'm confused about what the classification decision is supposed to be based on, given that the agent is provided 2 points, one on either side of the hyperplane for the task. Is it based on the order of the two points? (i.e. if the first point is on side A of the hyperplane, go to location 1, if it's on side B of the hyperplane go to location 2). Also I'm confused about how you are able to use the experts from Ant 2D goal as the experts in this task, since the state space is now augmented with the 2 4-dimensional vectors to be classified...I guess you just choose which expert to use based which target the expert is supposed to go to, and have the chosen expert ignore the 4-dimensional vectors? To provide context for the level of performance achieved in this task, it might be useful to see performance of a linear binary classifier trained under the same conditions (e.g. seeing only 1 to 8 pairs of training points per task).