NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:5045
Title:Non-Cooperative Inverse Reinforcement Learning

Reviewer 1

This is an application paper that motivates well the design choices (i.e. model and framework) within the realm of cyber security applications. The paper’s main contribution is a result that states how the problem being studied can be decomposed and restricting to a type of strategy profiles is enough. This result can impact on the design of tractable algorithms for this problem. With this in mind, the relevance of the work within the AI community touches those interested specifically in cyber security, and more generally, to those interested in adversarial decision making. Unfortunately, the authors make some design choices (e.g. two phases, one-stage strategies) that makes the results hard to transfer to more general adversarial settings. What worries me most is the lack of formal analysis of the two phase assumption (i.e. that the learning problem is divided into a learning and deployment phases). Specially, the problem being tackled involves non-stationarity (i.e. the unknown parameter to the defender, theta, changes in time), rendering any deployment strategy (i.e. one that does not involve learning and therefore remains fixed) moot for such problem. A more careful analysis of this situation and a more thorough experimental section could go a long way in clarifying this point. However, I would still feel uneasy on this algorithmic approach for non-stationary environments.

Reviewer 2

Comments after rebuttal and discussion ====================================== Thank you for clarifying my misunderstandings in the rebuttal. I no longer have any major technical concerns and I have adjusted my score to reflect this. It still strikes me as slightly odd that the proposed algorithm does not make use of any data of play, i.e., it isn't really inverse reinforcement learning. Original Review =============== Overall, I enjoyed reading this paper. It is fairly clear and well-written. I have some concerns about the fundamental assumptions and simplifications that lead me to believe it is not ready for acceptance, though. Hopefully I have a misunderstanding that can be corrected by the authors. My understanding of Rosenberg and Vieille 2000 is that the information state reduction (where the players need only consider the distribution over the utility parameters instead of the entire history) is only valid when the attacker must disclose its behavioral strategy prior to the defender responding. At the very least, they only consider the maxmin case and discuss that in general such games need not have a value (i.e., the minmax case, where the defender acts first, results in different value). They state in the concluding remarks, "Remark 3: Unfortunately, we say nothing of the minmax [case]" implying that their analysis somehow breaks otherwise. This is troublesome for a number of reasons: First, typically when modelling security games as Stackelberg games, the defender acts first. We are interested in the defender's strategy, which when acting second is a function of the attacker's strategy. Realistically, we will only have access to the attacker's strategy through samples of play. There is no analysis of how the quality of the defender's play degrades due to approximation error in the attacker's strategy. This seems to be a concern even should the information state reduction hold in the minmax case, as the posterior update (eqn 4) requires access to the attacker's one-stage strategy. Second, if we wish to consider the defender acting first, which seems more natural, we may have to consider the full strategy space, i.e., behavioral strategies that depend on the complete history of states and actions. This is quickly becomes intractable. At the very least, it has not been shown that the reduction is lossless. Third, it implies that N-CIRL cannot be "solved" in general. i.e., the strategies resulting from convergence of the contraction mapping may not form a Nash equilibrium of the game defined in 2.1. Furthermore, there may be no efficient dynamics that converge to an equilibrium of the N-CIRL game. At a more high level, does it make sense to call this "inverse reinforcement learning"? It seems related only in that the attacker's intent is unknown, but there are no observations of an actual agent anywhere. Once the value function has converged, the strategy profile is fixed. Presumably if the attacker plays anything other than the strategy derived from that particular value function then the defender must change its response. Minor comments: 16: This paragraph comes across as value alignment is necessary. I think it is more correct to say that it is a desirable property. Disaster is not guaranteed to happen if it does not hold, and disaster can still happen if it does hold. 24: IRL is "recent"? 27: Observer in IRL isn't really an agent as it has no actions or reward. 108: The attacker can observe the reward, since they know their intent. 149: \Theta is required to be finite on line 95, but here it is [0, 1] 201: \sigma^D = (\theta_0^D, \theta_t^D, ...), typo \theta_t^D => \theta_1^D 206: It would be helpful to make it explicit that the one-stage strategies depend on distribution of intent parameters. 423: if => \text{if}

Reviewer 3

Updated review: - The author feedback on my question of the novelty of the contribution compared to [Rosenberg, 2000] has satisfied my concern and helped me position the paper's contribution. - I requested computational complexity analysis, and the author feedback says they are deriving results to show this which is encouraging, but there's no guarantee this will be ready in time for a final version. - I also requested experiments on more challenging environments, and the author feedback suggests they will include a larger less symmetric setting which I would be excited to see. - My comments about multiple data points for MA-IRL have been addressed in the author feedback and a new experiment has been offered, which I would be keen to see. Overall I am inclined to keep my original high rating for this paper. ------ Original review: Originality To the best of my knowledge, no-one has considered the non-cooperative IRL setting addressed in this paper. However, the main theoretical result of the paper rests heavily on a result of (Rosenberg 2000), that a game with one-sided information allows a decomposition into a sequence of single-stage games. The crux of the theoretical contribution then seems to be noticing that non-cooperative IRL is a game with one-sided information, and applying the Rosenberg result. This does seem to be significant, but the fact that it’s just an application of the Rosenberg result reduces the originality somewhat, compared to if Rosenberg hadn’t already shown this general result. This theorem is then applied to design an algorithm for computing the equilibrium strategies, resulting in a linear program. Nonetheless, it still scores high originality from me for the motivation and framing as an attack-defense model in cybersecurity. Quality Overall this paper has high quality, with clear lemma’s, theorems and proofs. To the best of my knowledge the proofs seem sound. However, I think the proof of Lemma 2 is not sufficiently explicit. Why does the expansion from line 1 to line 2 hold? How do we go from line 2 to line 3? Line 3 to line 4 is fine, as this is just cancelling factors top and bottom. Some comments are needed to explain lines 1 -> 2, 2 ->3. Perhaps this proof could be moved and expanded in the supplementary material. Clarity The paper is well written and clear overall, and well organised. The related work being in the supplementary material is a bit unfortunate I think stylistically. Significance In the experiment on intrusion response, it’s not that obvious that N-CIRL does much better than MA-IRL. MA-IRL seems to have a high variance, but there is a high density patch of lower attacker rewards for MA-IRL than the proposed N-CIRL method. Maybe I missed something, but why are there many MA-IRL datapoints, but only a single one for N-CIRL? Is there a stochasticity in MA-IRL, but not in N-CIRL? I think this should be explicitly explained, in order to evaluate the significance of the result. There’s no analysis of the computational complexity of the proposed algorithm, which would also affect the significance of the work. The abstract claims it is efficient, though I didn’t see this discussed in the work. I was under the impression that solving a CIRL game is computationally difficult, and I would imagine solving N-CIRL is even harder, so I would like to know the computational complexity, and how this is addressed.