Review for NeurIPS paper: f-GAIL: Learning f-Divergence for Generative Adversarial Imitation Learning

NeurIPS 2020

f-GAIL: Learning f-Divergence for Generative Adversarial Imitation Learning

Review 1

Summary and Contributions: The paper proposes an imitation learning algorithm that as opposed to the existing variations of generative adversarial imitation learning (GAIL) algorithms, instead of using a pre-defined f-divergences such as JS, KL, RKL, etc. learns an f-divergence. They model the convex conjugate of f with a neural network that meets the constraints that f-divergence requires. The algorithm closely follows GAIL with the difference that an f-divergence is learned along with the reward function.

Strengths: The paper is well-written and it's easy to follow. It is working on the important problem of imitation learning which is very relevant to the NeurIPS community. It addresses an issue in the recent state of the art adversarial imitation learning algorithms which use specific f-divergence functions. The proposed method unifies these algorithms and proposes a simple general algorithm that learns the divergence function instead. Each of the pre-known f-divergence functions has their own limitations, for instance one is mode-covering and one is mode-seeking. Learning the f-divergence function helps the algorithm find the function that is suitable for that specific problem. Also the results show that for similar problems similar functions are learned. The algorithm is tested in performance against some pre-known f-divergence functions in some MuJoCo tasks and the results show improvement compared to baselines in all cases.

Weaknesses: In terms of novelty, the algorithm is a small modification of previous adversarial imitation learning algorithms. Testing the algorithm on more complex domains would have been better for evaluation of the capabilities of the algorithm.

Correctness: The method seems to be sound and the empirical method is correct.

Clarity: The paper is well-written and it's easy to follow.

Relation to Prior Work: The paper has properly cited and clearly discussed the relation to previous work.

Reproducibility: Yes

Additional Feedback:

Review 2

Summary and Contributions: The paper proposes to generalise GAIL to the f-divergence formulation, with a fully learnable discriminator that can approximate arbitrary f-divergence. Extra care is taken to ensure that the numerical model satisfy the constraints of f-divergence. Empirical results also support the proposed method.

Strengths: The proposed method is clean and sound, and implements the necessary care to ensure that the model approximates f-divergence. The empirical results are thorough and detailed, demonstrating benefits of the proposed method.

Weaknesses: I have several concerns about the proposed method: 1) while it may be implicit from the main text, a rigorous proof to show that zero-gap and convex constraints are both necessary and sufficient conditions for learning f-divergence should be presented. 2) While the empirical section covers many important details about the actual implementation, it is common that the implementations of adversarial imitation learning contains subtle details that impact the evaluation. For instance, how many random seeds are used to obtain the results? How were the standard deviations obtained for each task? 3) It seems that all baselines are implemented under the model f*(T(s, a)). For known divergences, a primary advantage would be the ability to simplify the objective function (e.g. the original GAIL objective). Comparison against these optimised versions of baselines methods are necessary for fairness.

Correctness: I have several concerns about the two criteria proposed in Section 4.1. 1) how are the input distributions for each method computed (e.g. from (s,a) sampled from the learned policies, or expert (s,a))? 2) it is unclear that the two criteria could be used to assess the quality of the learned policy. The current claim seems to rely on the assumption that that f*(T(s, a)) is a good discriminator between the expert and learner, which may not be necessarily true. Perhaps the authors should consider using expert's (s,a) to first show that f*(T(s, a)) could achieve a divergence value of near-zero, before analysing the learner policy.

Clarity: The paper is clearly written and easy to follow.

Relation to Prior Work: There is sufficient details about previous work and how they are related to the proposed method.

Reproducibility: No

Additional Feedback: After rebuttal: I thank the authors for their responses. My concerns with sufficient and necessary conditions, and divergence evaluations are addressed. My concerns over the implementation details, and the performance comparison against is baselines remain relatively open. For instance, the performances on baselines seem low compared to reported results from the original papers (e.g. walker, hopper on GAIL). I agree with other reviewers' assessment that more evaluation could further support the method. As such, I would keep my current score.

Review 3

Summary and Contributions: The authors propose a novel imitation learning approach. The approach tries to find the best f-divergence to minimize and then minimizes it with an adversarial objective. To do so, the authors learn an explicit representation of the fenchel conjugate, using a special network architecture as well as a projection approach to satisfy the necessary constraints.

Strengths: The idea of finding the best divergence to match the expert’s occupancy measure is novel and could lead to improved performance in imitation learning algorithms.

Weaknesses: The proposed formulation is not completely well-defined and the evaluation does not show what the authors claim.

Correctness: One main concern is the choice of experiments which is highly problematic when evaluating the algorithms capability to match the distribution of the expert. In all domains except Reacher, it is sufficient to match a specific and mostly constant feature. For walkers, this feature is the forward velocity while for CartPole this is the angle. The Reacher domain is only slightly more complex as the agent has to match the difference between the given target and the x-y position; however, here too the agent only has to match a constant value rather than a distribution. As the paper is about divergences, it is problematic that the experiments do not include a task where the agent has to match a specific distribution rather than reproduce a constant value to achieve a high score. Another concern is the metric used in the main evaluation. The authors compare the sample-efficiency of different approaches and use this to claim superiority of the new approach. At no point in the paper do the authors elaborate how different divergences may affect the sample-complexity in imitation learning. Furthermore, the authors compare against methods that estimate the divergence in different ways. It is very plausible that the differences are better explained by the ability to accurately estimate any divergence from a limited number of samples rather than by the divergence itself.

Clarity: Overall, the paper is well written and easy to understand.

Relation to Prior Work: The authors cover the most relevant related works and the idea is novel.

Reproducibility: Yes

Additional Feedback: My other main concern is that the objective in Eq. (5) is badly motivated and the implications are under underexplored. The imitation learning objective is notoriously ill-defined and a large part of the literature focuses on introducing objectives that produce good behavior. The notion of finding the “best” f-divergence therefore requires us to state what we are optimizing for, which the authors don’t do very explicitly. On line 38, the authors mention that an imitation learning method which uses a fixed divergence method is likely to learn a sub-optimal policy, but the notion of optimality does not exist without a given divergence. For example, whether mode-seeking or mode-covering behavior is better is entirely dependent on context that the agent does not have. Either solution could be better. The authors mention that the best divergence should be the largest one, but this notion is not elaborated on and the implications are unclear. Figure 4 suggests that the agent does converge to a specific f-divergence. It would be good if the meaning of the divergence could be explored further. This is especially true for the divergences shown for Hopper and Walker2d whose conjugate looks very different from the RKL and JS conjugates. Furthermore, it would be good to show the variance on the learned conjugates, i.e. does the algorithm always converge to a similar divergence? Another point: While it is correct that BC minimizes the KL divergence on the policy, it is misleading to group it with the other methods as the other methods minimize the divergence on the state-action joint distributions. The latter requires an understanding of the environment dynamics and can therefore be much more effective. A minor point: since states are generally continuous in this work, P should map to [0, inf) on line 62

Review 4

Summary and Contributions: The paper presents an approach for learning an f-divergence measure as part of the discriminator training process within a GAIL framework for imitation learning. In particular, the discriminator function is divided into a reward signal function and an f-divergence function. Several necessary conditions are imposed onto the f-divergence function architecture to satisfy the requirements for it to be an admissible f-divergence measure. The f-divergence is trained jointly with the reward signal function as a part of discriminator training. The learned f-divergence is shown to improve performance over pre-defined divergence measures, such as JS, KL and RKL divergences, in multiple OpenAI Gym tasks.

Strengths: - Imposing additional structure on the discriminator while at the same time keeping it differentiable is a very interesting way to add inductive bias and improve performance. - The proposed method is shown to significantly improve performance on the evaluated tasks over pre-defined divergence measures.

Weaknesses: In the presented approach, training of the f-divergence function is based on the discriminator loss and it is not directly influenced by the performance of the learned policy. Although we can see improvements of the final policy performance, it would be interesting to see a discussion whether optimizing discriminator loss would always lead to an increased policy performance and if using the policy performance as an additional training objective for the f-measure would further improve the performance (e.g. in a meta-learning setting).

Correctness: Claims and mathematical derivations in the paper are coherent and empirical evaluation correctly shows performance improvements over the baselines.

Clarity: The paper is well-written and easy to understand and follow.

Relation to Prior Work: The paper clearly establishes connection to prior works and uses them as baselines for the experiments.

Reproducibility: Yes

Additional Feedback: Post-rebuttal comments: Thanks for addressing reviewer's comments and providing new details in the rebuttal. As mentioned by other reviewers, the notion of optimality of f-divergence is somewhat ambiguous and currently mostly supported by the empirical evidence, so it would be great to see more clarification regarding this in the final version of the paper.