__ Summary and Contributions__: The rebuttal and discussion confirms my positive impression of this article.
---
The title accurately summarizes the article. It addresses a zero-sum Markov game setting in which a generative model is available, and establishes sample complexity bounds for model-based RL to find epsilon-Nash Equilibrium solutions. Additional derivations of corresponding lower bounds establish the near optimality of the achieved sample complexity.

__ Strengths__: The obtained results are derived from general assumptions, and thus widely applicable. The work is fundamental and theoretic in nature, contributing to the frontier of theoretical guarantees in multi-agent reinforcement learning, which is highly relevant to the NeurIPS community. The article conveys an astounding technical depth while maintaining a coherent organisation and surprisingly good readability.

__ Weaknesses__: The very fundamental nature is a limitation of this work in terms of the audience that it may find. The broader impact section could be used to reflect on the impact of complexity bounds on applications. As-is it rather merely re-iterates the contributions.

__ Correctness__: The arguments in the main text are delivered meticulously and appear consistent. To the degree that I sampled from the appendix it seems correct.

__ Clarity__: The presentation is outstanding. Clean use of technical concepts, and reference to their mathematical nature (such as the "saddle-point problem"), ease the reading of this technically dense material. The article makes a good compromise of presenting a roadmap of proofs in the main body, and deferring technical detail to the appendix.

__ Relation to Prior Work__: Demarcation to prior work is detailed and precise.

__ Reproducibility__: Yes

__ Additional Feedback__: The total variation distance could be mentioned in the appendix for completeness sake.
- "(3.3) is built upon Theorem 3.2" could better read "Corrolary 3.3 ..."

__ Summary and Contributions__: The paper provides an approach to learn a zero-sum Markov game from a sampling model and shows how much sampling is needed to achieve eps-optimal values and policies.

__ Strengths__: It builds on recent results in the single agent setting and contributes careful formal comparisons and analyses to the question.

__ Weaknesses__: The primary conceptual idea here is the smooth planning oracle. I think the paper would be improved if this idea were applied to/demonstrated in some simple games to build the reader's intuition.

__ Correctness__: The main proofs are in the supplementary material, so it's hard to tell from the paper if the results are correct. Nevertheless, the results are presented carefully and connected to closely related results and I don't see any obvious flaws.

__ Clarity__: Clarity is well above the bar.

__ Relation to Prior Work__: The extensive bibliography helps situate the work in terms of earlier efforts in this area and important recent developments.

__ Reproducibility__: Yes

__ Additional Feedback__: NOTE (post rebuttal): I didn't change my review, as a matter of expediency. But, I appreciate the authors' acknowledgement of my comments and support the plan for addressing them. (By the way, I also think "generative model" is a pretty standard setting in this community. I guess a brief clarification wouldn't hurt, but I wouldn't suggest using space to delve into the setting more deeply.)
"corner stones" -> "cornerstones"
"the sample complexity of model based MARL algorithms has rarely been investigated": The Rmax paper was one of the first papers to study RL sample complexity AND dealt with MARL. Maybe it hasn't been "recently" investigated? Or revisited in light of newer approaches?
"especially model-free ones, as the condition for guaranteeing convergence in the latter fails to hold in MARL": Convergence of model-free Q-learning variants in games has been known since the 90s. Maybe you are specifically talking about the setting where one agent can't see the action of the other? (No, that's definitely not the setting discussed. Although I'm still at a loss to understand the context of the comment.)
"See a detailed description as follows." -> "A detailed description follows."?
"Contribution": Are you following the NeurIPS style? I think subsections are supposed to be numbered.
"[51] also focused": I think it's bad style to use the references as nouns. (It's definitely hard for me to read...)
"has relatively limited literature" -> "has a relatively limited literature"?
"to find eps-optimal policy" -> "to find an eps-optimal policy"
"the state-action/Q-value function under (mu, nu) are defined" -> "the state-action/Q-value function under (mu, nu) is defined"
"for the single-agent RL" -> "for single-agent RL"?
"which are efficient in finding the NE of Ghat": They really only find eps-NEs, right? (I think this concept is described more clearly later, but it still might be best to be a bit more careful here.)
"note that different from the single-agent setting," -> "note that, different from the single-agent setting," or "note that, in contrast to the single-agent setting,"
"smooth PLANNING ORACLE": Why the bold every time? I think bold (or, better, italics) the first time makes sense as it calls attention to defining a new term. After that, normal text would be appropriate.
"an instance of the smooth Planning Oracle" -> "an instance of a smooth Planning Oracle"?
"from the single-agent setting, to zero-sum MGs" -> "from the single-agent setting to zero-sum MGs"
Some sloppiness in the bibliography:
- "Princeton university press" -> "Princeton University Press"
- "Pac bounds" -> "PAC bounds"
- "markov games" -> "Markov games"

__ Summary and Contributions__: This paper studies the sample complexity of model-based MARL algorithms in two-player discounted zero-sum Markov games.
Based on an auxiliary Markov game, the authors present the complexities in finding epsilon-approximate NE value and epsilon-approximate NE policy.
Moreover, the authors introduced Smooth Planning Oracle to to improve the sample complexity of obtaining the NE policies, which can be can be near-minimax optimal.

__ Strengths__: The paper gives the first near-minimax optimal sample complexity for model-based MARL in zero-sum Markov games with detailed proof.
More specifically, the authors established the first near-minimax optimal sample complexity of both achieving the Nash equilibrium value and policy.

__ Weaknesses__: The proof of main lemma 3.1 is mainly adopted from lower bound for MDPs proof in [4, 17], more clarification to highlight the difference would be good.

__ Correctness__: The cliams and proofs of the paper appear to be correct.

__ Clarity__: The paper is well-written and structured clearly. The contributions and proof roadmap are well summaried.

__ Relation to Prior Work__: The aurthors has disccussed the connections and gaps to model-free, turned based and single agent optimality analysis. Also mentioned the difference between existing model-based MARL complexity analysis.

__ Reproducibility__: Yes

__ Additional Feedback__: It is a bit confused when first see the generative model in abstract and introduction, the formal definition has been given in line 129. Therefore, more specific description about the generative model in abstract or introduction would be good to quickly capture the point, maybe the generative model of state transition?
Just curious, can policy guided sampling reduce the sampling complexity of model-based learning?

__ Summary and Contributions__: This paper provides theoretical results for zero-sum Markov games when a generative model is available. The authors show that a simple model-based algorithm, which builds an empirical MDP and applies a planning oracle, can obtain near minimax sample complexity.

__ Strengths__: I recommend an acceptance for this paper.
This paper provides strong theoretical results for the zero-sum Markov games. The authors show with a plug-in empirical MDP, (1) any planning oracle can find near optimal NE value, and (2) a soft planning oracle can find near optimal NE policy. Although some analysis techniques directly borrow from recent work on model-based RL [Agarwal et al., 2020], I believe the theoretical contributions are solid and nontrivial, as they need to carefully handle the minimax game setting.

__ Weaknesses__: It might be helpful to give some discussion on why introducing the soft planning oracle and how it can improve over planning oracles in general.

__ Correctness__: I briefly check the proofs in Appendix A and it looks correct to me.

__ Clarity__: The paper is well-written and easy-to-follow. The authors also give enough helpful discussions. I particularly enjoy the discussion of lower bound in Appendix B.

__ Relation to Prior Work__: The authors clearly discuss related works.

__ Reproducibility__: Yes

__ Additional Feedback__: