Review for NeurIPS paper: Toward the Fundamental Limits of Imitation Learning

NeurIPS 2020

Toward the Fundamental Limits of Imitation Learning

Review 1

Summary and Contributions: This paper presents a number of theoretical results for (variants of) behavioural cloning approaches to imitation learning. These results indicate that the strategy of directly mimicking the expert is condemned in the worst-case to make an error that scales quadratically with the horizon. The same applies to a variant involving querying the expert, e.g. Dagger. It seems that presented upper bounds on these methods are similar to those known in the literature, e.g. RGB11. Another result of the paper is that an improvement to linear dependency in horizon can be obtained by making a transition function available to the learner. I believe this result is meaningful as a number of practical approaches are based on the use of simulator to overcome the expert-learner distribution shift. ----------- Rebuttal ----------- I have read the author feedback. * I believe that the problem characterization in terms of number of states |S| is weakly informative. In high-dimensional spaces, it should be expected to find an error in close proximity of the expert trajectory. I would be interested in worst-case results over distribution shifts that are likely to occur in practice; e.g., in the robust MDP framework one solves a minimax problem over a *bounded* amount of noise (see e.g. G. Iyengar "Robust dynamic programming"). * In the light of previous comment, it is not clear to me that this is a limitation of the imitation learning. * Given that quadratic upper bounds for supervised learning algorithm are tight for deterministic expert policies, the proposed lower bound (Theorem 4) is not surprising. * The real issue is how to recover from the unseen situations. This paper proposes the use of transition / simulator that seems to me much more informative result, unfortunately, only limited to one section. Due to the above considerations, I do not revise my score.

Strengths: I find the analysis with the known transition matrix informative on how to tackle the distribution mismatch between the learner and the expert in behavioral cloning. It seems that adjusting for covariate shift using a simulator / transition function is a promising approach. Practical applications would be benefit from potential extension of these results to a setting with approximate transition function.

Weaknesses: The simple behaviour cloning as well as interactive approaches are analysed in the literature using reduction to supervised learning. Thus, somewhat unsurprisingly, they suffer from the same caveats in the presence of distribution shift. When the state space is large, it will always be possible to find a state that wasn't visited by the expert trajectory. Thus, I am not fully convinced of the implications of the worst-case results presented in the paper. In addition, some results only apply to deterministic expert policy that I find restrictive in a practical setting.

Correctness: It seems that the upper bounds on behavioral cloning (simple and interactive) are in-line with the previous work. The lower bound results intuitively correspond to the supervised learning reduction. For the known-transitions setting, the theorem statements make sense, although I did not check the details of the proofs.

Clarity: The paper is well-structured by first presenting the overview of results and then describing the details. It would be helpful for the reader to visually illustrate the Mimic-MD algorithm as its mathematical notation is quite dense.

Relation to Prior Work: It would be helpful to include in Table 1 the comparison of the obtained bounds to the prior work. I feel that this work would benefit from deeper relating to the previously proposed practical algorithms and their theoretical analysis, e.g. SVBB19.

Reproducibility: Yes

Additional Feedback:

Review 2

Summary and Contributions: In this paper, the authors study theoretical bounds for imitation learning algorithms, with a focus on behavior cloning, in various settings including no-interactions, interactive and known dynamics. A major result is a quadratic term in time horizon is a fundamental gap in imitation learning, whatever the underlying algorithm is. In settings where dynamics are known, the quadratic barrier can be broken via the proposed MIMC-MD algorithm. Finally, lower bounds are given are these settings as well. ====== Post-rebuttal: thank you for the rebuttal. I still hope there were more discussions on DAgger vs BC. One potential improvement is to provide concrete assumptions on MDP where one could separate DAgger and BC on sample complexity. As a result, I decided to keep my original score.

Strengths: The presentation of theoretical results is very well done, especially the transition between each setting and the connections between them. The results in Section 4 on behavior cloning, while not groundbreaking, help unify known results. Together with the lower bounds in Section 6, they establish the optimality of behavior cloning in terms of the minimax expected error, which is a somewhat surprising result. The result in Section 5, where the dynamics are assumed known, is significant in achieving the first bound to break the quadratic compounding errors. These results are very relevant to the NeurIPS community as they provide theoretical insights to general imitation learning algorithms.

Weaknesses: I found the study on the active setting lacking. In Table 1, it is stated an upper bound is provided for active setting in Theorem 1 but the actual Theorem 1 only considers the no-interaction setting. While it is not entirely surprising that the active setting does not improve upon behavior cloning in the worst case, some discussion on their empirical success, as demonstrated by algorithms such as DAgger, will help bridge the gap between no-interaction and active settings.

Correctness: The claims seem correct though I did not check the proofs in the Appendix.

Clarity: This paper is mostly clear. In equation (9), the probability terms are not defined, though one can mostly understand what they mean with the context.

Relation to Prior Work: The differences from previous works are stated clearly.

Reproducibility: Yes

Additional Feedback: In Section 5, Remark 2, the authors mention various distribution matching approaches in imitation learning. Algorithm 2 can be also viewed as a version of distribution matching (equation (8)). Can the authors provide some intuition on what distributions are being matched? And what is the basis for the conjecture that existing distribution matching approaches are unlikely to achieve the bound in Theorem 3?

Review 3

Summary and Contributions: The paper studies the statistical sub-optimality bounds (w.r.t expert's value) of imitation learning algorithms in three settings of episodic tabular MDPs: (a) no-interaction with MDP allowed but a dataset of N expert trajectories is provided, (b) known-transition: MDP transition function is provided in additional to (a), and (c) active: no dataset provided but MDP interactions allowed and expert queries allowed at any state. Upper and lower bounds are presented (mostly for behavior cloning) for the two cases of deterministic and non-deterministic experts, out of which the the upper bound on sub-optimality for non-deterministic experts by an empirical risk minimizer under log-loss is a novel contribution as it does not depend on the cardinality of the action space (Theorem 3 in paper). Further, in the known-transition setting, a lower bound is provided which depends linearly on episode length H (i.e. |S| H / N) and a novel algorithm MIMIC-MD is proposed that achieves a (approx.) |S| H^{3/2} / N upper bound on sub-optimality, improving upon that of max likelihood estimation (MLE) and the traditional |H|^2 dependence of sub-optimality due to compounding errors in the other unknown-transition settings. The intuition of the algorithm is that even if a mistake is made by the learner that leads it to states not present in the expert's demonstrations, it can correct itself since it knows the transition function. Such corrections are approximately computed by dividing the expert dataset into half, the first half being used for imitating the expert and the other for approximating the probability of correction. ==================================== Post-rebuttal update: After carefully going through the other reviews (especially the comments by R1 and R4) and the author's response, it is clear that I had missed several drawbacks of the paper in my original assessment (because of my unfamiliarity with relevant prior work), particularly the lack of significance of the lower bounds (in Theorem 4), computational inefficiency of the algorithm (while also acknowledging the contribution, which will help future work in reducing sample complexity) and the lack of discussion about recovery from unseen situations (R1). Out of these, I have considered (to the best of my knowledge) the first to be important while the second and third drawback less relevant to the main contribution (and assumptions) of the paper. However, I do not feel that these drawbacks are sufficient grounds for decreasing my rating, so I will maintain it and recommend acceptance. In order to reflect the updated confidence of my assessment, I have additionally reduced my confidence score from 3 to 2.

Strengths: The paper's strengths are in the novel algorithms or bounds proposed in the imitation learning settings: - In the no-interaction setting (dataset of expert demonstrations provided), behavior cloning is shown to be optimal for deterministic experts and empirical risk minimization under log-loss is optimal for stochastic experts. The latter is the suggested to be the first such bound without dependence on the cardinality of the action space. - Due to the optimality of behavior cloning and the lower bound in Theorem 1, it is shown that methods in the active setting (can query expert and interact with MDP) cannot improve over behavior cloning in the no-interaction setting (for e.g.: for active setting algorithms such as DAGGER, Aggravate). - The MIMIC-MD algorithm is proposed that uses the intuition that corrections can be made in the known-transition setting to get back to states in the expert's demonstrations after making an error. This algorithm is shown to have a sub-optimality upper bound with |H| raised to the power 3/2, which is shown to be just a square root of |H| times away from the worst case lower bound of |H|.

Weaknesses: Nothing notable to the best of my knowledge.

Correctness: The proof sketches throughout the paper are quite accessible and seem reasonably correct. However, I could not verify the correctness of all the proofs in the appendix.

Clarity: The paper is well written, with a good slow build up of informal theorems highlighting the main contributions to the final core results.

Relation to Prior Work: Yes, the paper is appropriately discussed with respect to prior work.

Reproducibility: Yes

Additional Feedback:

Review 4

Summary and Contributions: This work provides a thorough and clear summary of lower and upper bounds in three related problems in imitation learning over finite MDPs with bounded rewards: no-interaction (which is classical), active (which is similar to that in DAgger [RGB11]), and known transition (a novel setting). Though some results are similar to prior works ([RB10]), or perhaps not very surprising for some, the thoroughness is remarkable. The specific novel contributions include: 1. Making clear that without further assumptions (like that in DAgger), the lower bound of active setting matches the upper bound of behavior cloning in the no-interaction setting, which are both O(H^2). 2. Pointing out that the reduction to supervised learning approach has limitations when the expert policy is stochastic (in the form of a dependency on A and a slow statistical convergence of √N). 3. Proposing the known transition setting and an algorithm MIMIC-MD (not based on reduction) which has an expected error of O(H^(3/2)).

Strengths: 1. Answer many basic questions regarding the relations of two popular settings in IL (See Summary1). 2. Propose a new setting (known transition) that captures intuition about the learner’s ability of correcting mistakes. Moreover, a near-optimal algorithm is proposed. 3. The writing of the main article body is easy to follow with substantive intuitions.

Weaknesses: 1. A discussion of the computational complexity of MIMIC-MD would be a nice addition. What is it? Is there a nice procedure for the optimization problem (8)?

Correctness: The article is theoretical and the results seem correct upon high-level checks.

Clarity: The presentation is excellent. I enjoyed reading the article and its supplement.

Relation to Prior Work: The rigor and clarity of this work contrasts favorably to other works in this domain in my opinion. 1. [RB10] contains in its supplement a lower bound example for the no-interaction setting which yields O(H^2). How would you compare your lower bound example (Fig 1(a)) with theirs? In particular, does the example of [RB10] undermines your claim on L84?

Reproducibility: Yes

Additional Feedback: 1. Could you comment on the obtained lower/upper bounds of the known transition setting? Do you think either the example is weak or the analysis of MIMIC-MD is loose? It may help to provide some numerical results. 2. (Minor) (54) in Appendix B.1.1 should be an equality, right? And the constant in (55) can be made 1/e (which is smaller than 4/9 and nicer), c.f., (135). Several results seem to depend essentially on Lemma 1. Perhaps it is worth mentioning in the main text? 3. (Minor) L565 typo. T^\pi_{t, s}(i) —> T^*_{t, s}(i). 4. Inhomogeneous (or episodic) MDPs are common in IL literature but their results often lack finer instance-dependency on the Markov structure unlike RL results on homogenous MDPs (where transitions are time-invariant). Given that your lower bound examples are homogeneous MDPs, could you comment on whether some results would be different under homogeneous MDPs (if space permits). ## Post-rebuttal Thank you for replying and commenting. I slightly lowered my score (9 -> 8) to reflect the new information on the computational inefficiency of MIMIC-MD which also explains the absence of numerical confirmation. I will suggest making this fact explicit in any future version as "algorithm" without further qualifications is usually understood as computationally efficient. As R1 has touched on, a discussion/illustration on recovery (under a good mimic policy according to (8)) may provide readers insights to a computationally efficient approximation (for future studies).