Reviews: Towards Optimal Off-Policy Evaluation for Reinforcement Learning with Marginalized Importance Sampling

Originality: The main idea of the paper - avoiding the long horizon problem by computing IS over state distributions rather than trajectories - was already introduced in (Liu et. al. 2018), which the authors cite sufficiently often in the text. However, the approach the authors take to leveraging this idea is original. Additionally, there is not yet enough published work on leveraging this potentially important idea (IS over state distribution), and therefore even being the second paper in this direction is still charting new territory. Quality - To the extent I looked at it the theoretical work is solid. I did not go over every equality in the proofs to check for algebraic errors, but I did go through every step in the proofs found in the appendix. I have some comments on the exposition of the proof, but I leave that to the "clarity" section. The experimental section is passable (apart from a potential problem which I will detail later), but somewhat below what I'd expect of a publication in NeurIPS. Considering the ModelWin/ModelFail domains as toy examples for demonstration, the only actual domain is the mountain car which is a fairly simple example. I think the paper should definitely try to include some more domains. Even other classical control tasks such as acrobot, cartpole, etc.. I expect these shouldn't be hard to implement if the authors already have the mountain car implementation. A specific baseline I would really like to see the authors add is the PDIS (per-decision IS) and CWPDIS (consistent weighted per-decision IS). The reason I would like to see these specific estimators is that I think they would be interesting for comparison in the ModelWin/ModelFail domains in which every two steps (or one, depending if you count the actionless transition back to s_1 a step) effectively start a new trajectory. I am guessing that is why the MIS which needs to observe a lot of transitions, not trajectories does so well for the increasing horizon experiment. However, the PDIS estimator might be able to somewhat benefit from this property, albeit not as much, since it does not have the problem of non-per-decision methods for which decisions different form the evaluation policy late in the trajectory diminish the utility of "good" observed transitions early in the trajectory. I think the most interesting comparison in the experimental section should be made with the SSD-IS estimator since they both utilize the same general philosophy, and I think the authors should discuss it more thoroughly. Why does it achieve EXACTLY the same results as DM for model fail? Why does it perform as well as MIS for mountain car but eventually stops improving? (is that a limitation of the function class used as discriminator functions in the implementation of SSD-IS?). Lastly there is one critical question I have regarding the time-varying MDP example, but that may be just a misunderstanding on my part. If p_t is sampled uniformly at each time step, isn't the probability of transition (for example from s_1 to s_2 given a_1) p(p_t)*p_t, and marginalizing over p_t make that setting equivalent to a time-invariant MDP with p=3.5? Clarity: Overall the paper is clear but could be improved. The motivation and and background is clear. Section 4 (theory) is clear, but section 4.1 which attempts to sketch the proof is confusing. The proof as it is written in the appendix is well written, but the authors should do a better job of sketching it in the main text. Sketching proofs is always hard but I'd prefer if the authors sketched less steps, and limit themselves to the bare basics which would be stated clearly in the main text with proper references to the (well written) appendix. Apart from the confusion regarding the time-varying MDP case which I hope the authors would clarify, the experimental section is well written but as I mentioned should have better discussion of comparison with SSD-IS. Less important, but the description of the ModelWin/ModelFail case is needlessly confusing and can be stripped to an easier to grasp explanation if in the ModelWin case the agent also gets the reward upon transition from s_{2 or 3} to s_1. This would not change any of the dynamics but would reduce the difference between the models to full vs. partial observability without introducing the needless difference in reward timing. It should also be stated that the agent needs to choose an action at states s_{2 or 3}, but it doesn't matter which one it chooses. Significance - I think the paper is fairly significant. Post review update - The authors addressed some of my concerns regarding the clarity of the paper, and I have revised my score accordingly. Regarding the authors' explanation of the comparison with SSD-IS - I find the authors' explanation plausible, but not entirely convincing. Despite the SSD-IS making the assumption that the state distribution is stationary, I encountered cases where in practice SSD-IS performs well even if that assumption is broken. If the authors are correct, I think they should be able to demonstrate their claims empirically. In addition to including in the main text the explanation the authors gave, I think the paper would greatly benefit from showing in the appendix the states distributions at different times, and correlating them with the error of both the SSD-IS and MIS. By showing such plots, the authors should be able to convincingly demonstrate their claim that the difference in performance is a result of the SSD-IS not working with a true stationary state distribution.

Originality: From my personal knowledge I think the method is new, though the author has pointed out they borrow the related idea from [Liu et al., 2018a] that accumulate distribution over state, but the approach is different (Liu's paper mainly focus on continuous setting and use function approximation to learn a weight function, while this paper is mainly focus on tabular setting and approximate using empirical statistics). Quality: The theory part is sound with clear explanation and detailed discussion. For the experimental part, since no code is provided and I couldn't check the details, I will ask the author several questions for the comparison with other baseline, especially the SSD-IS method. The main concern is why MIS perform even better at time-invariant environment where SSD-IS should have more data to use when estimating the density ratio. To be more concrete here are two questions I would like to ask: 1. In Figure 2 and 3, why DM and SSD-IS method works well in ModelWin but perform very bad at ModelFail? For me it is surprised in time-invariant environment SSD-IS method perform worse than MIS method. Any explanation? 2. In Figure 3 (b) and (d), why the curve is not smooth even after 128 repetition? Clarity: The paper is well written and easy to read. Significance: I like the idea of the paper and OPE problem become more and more important in research area. Overall: I think the paper is a good paper but I will remain below borderline until the author answer for my experimental problems.

Paper ID:	5118
Title:	Towards Optimal Off-Policy Evaluation for Reinforcement Learning with Marginalized Importance Sampling

Reviewer 1

Reviewer 2

Reviewer 3