__ Summary and Contributions__: This paper presents the analysis of policy gradient descent algorithms for zero-sum stochastic games. A key feature of this algorithm is that the algorithm is decoupled (i.e. independent for each player). Main theorem (Theorem 1) claims that when learning rates of both players are small and satisfy some pre-specified relations, the algorithm will converges to Nash solution in polynomial episodes.
Despite a few aspects that can be further improved, the overall result and contribution look good. See strengths and weakness sections.

__ Strengths__: The paper provides the first analysis of policy gradient descent (both players do PG independently) in the setting of zero-sum stochastic games. The discovery of two-sided gradient dominance condition (Lemma 1) looks interesting, and the use of that to establish convergence is new.

__ Weaknesses__: I am not convinced by the main motivation of this paper for decoupled or independent learning. Specifically, from the communication perspective, once agents can also communicate the actions each other took per round, then each agent can also simulate any coupled algorithm locally (or only coupled online algorithm if has storage limitation). Since agents have to communicate with the oracle or environment in each round anyway, I don't see in practice why communicate the actions in the learning process is that problematic. Second, this paper says that the independent learning is important because it allows the algorithm "being versatile, being applicable even in uncertain environments where the type of interaction and number of other agents are not known to the agent. " I feel this description does not fit the algorithm studied in this paper, thus a bit misleading. The hyperparameters of two agents must be set in a coordinated dependent way. The learning strategy of each agent will only work if assuming the other agent is also doing PG as coordinated.
The technical result of this paper is also a little bit weak comparing to several recent results with coupled/dependent algorithms, e.g. [5] and "Learning Zero-Sum Simultaneous-Move Markov Games Using Function Approximation and Correlated Equilibrium". These two prior works do not need to assume minimax mismatch coefficient (Eq(5)), and directly address the exploration problem. Both papers also have relatively sharp dependency on S, A, B, epsilon comparing to the result in this paper which is polynomial in everything. Recent paper "Optimistic Policy Optimization with Bandit Feedback" already addressed both issues in the setting of single-agent MDP. This might be fine for the first paper for PG in stochastic games and independent setting. But it would also be helpful if the authors can cite these related papers, and comment out these technical points other than just saying they are coordinated thus not relevant in the current related work.

__ Correctness__: Yes

__ Clarity__: Yes

__ Relation to Prior Work__: Miss some related work, see comments in the weakness section.

__ Reproducibility__: Yes

__ Additional Feedback__: What is the main difference in proof machinery between this paper and [74]? is there a typo in line 212-213?

__ Summary and Contributions__: The authors of this paper study the problem of policy optimization in episodic two-player zero-sum stochastic games (or Markov games) with the underlying model is unknown. This problem is an interesting and important problem in the field of RL. The authors study this problem from the plain lense of optimization and propose a method to simultaneously optimize the policy of maximizing and minimizing players. The proposed approach suggests that each player follow plain gradient descent, but with different learning rates. One with much much smaller learning rate than another one. This way, the slower player follow instant optimum the other player.
The proposed approach, along with the analyses, are heavily built on the recent work of [38] and [2]. [38] proposes the two-time scale gradient-based approach for a class of two-player optimization, and [2] show the gradient dominance in MDPs. The author put together the ideas in these works to directly derive theirs. From this sense, the contribution of this paper is minor, but it does not mean it should not be accepted. Of course, the results sound straightforward given [38], and [2], but the authors put the effort, and nicely put these results together, and with their further development provided the final results.

__ Strengths__: The strength of the work is that it is basically proposed the first algorithm (at least to my knowledge) for policy optimization in zero-sum Markov games. The algorithm is simple despite suffering from some slowness (small learning rates).
The example in proposition2 is interesting (while the fact was known I guess, sorry I do not have any paper on that on the top of my head)

__ Weaknesses__: There are multiple limitations to this study.
1) is the novelty. Given prior work, the technical contribution of this work is a bit minor, but again this should not be an obstacle to accept this paper. The paper is strong, and I enjoyed reading it.
2) The algorithm is a bit slow, as a two-time scale analysis imposes this slowness. This might be seen as a problem with the two-time scale approach [38] anyway.
3) I wish the authors have provided the orders in their bounds, instead of referring to them in the poly term. I would encourage them to add this. I hope they add it.
4) The whole paper I was waiting for the function approximation part, but I read at the end of the main text that the authors left it for future work.
5) I might understand the authors that eight pages are not enough, but they could have played a better role in explaining and managing the flow in the paper.

__ Correctness__: I did not check the correctness of the claims since the main body of the theoretical results is borrowed from [2] and [38] along with a few others, and I assume the authors carefully deployed these results.

__ Clarity__: Fairly well written. It read well, not much objection. The authors could abstract out the flow and contribution better, but it is alright,

__ Relation to Prior Work__: yes, fairly.

__ Reproducibility__: Yes

__ Additional Feedback__: My evaluation of this paper is not lack anything. The contribution of this paper is important; the problem set up is important. And I would like to see this paper accepted. But since the main body of the contribution comes from the prior work, I was expecting the author to have the complete study of this work, including linear models and function approximation setting. I can totally imagine those extensions could be hard. For that reason, I vote for acceptance, but I might not be able to say the paper is among the top 10%.

__ Summary and Contributions__: This paper suggests and analyzes a method to solve zero-sum reinforcement learning games using independent policy gradients algorithms for each of the players. The authors prove that it is possible to reach Nash equilibrium by using learning rates on two different time scales for both players. This follows the reasoning of previous works which show Nash equilibrium guarantees in the case where one player performs policy iteration, where the other computes the best response at each iteration.
REBUTTAL:
I have read your comments and still believe this work is marginal. My main issue is that the proposed approach is a bit impractical, and the method is not truly independent as suggested in the paper. Also, the analysis is not very tight. However, it is a solid work, on an important problem for which the results to this date are scarce. As such, this is a somewhat novel contribution that can help further researcher in the field. My score remains the same.

__ Strengths__: This paper extends results in zero-sum games to the non-convex non-concave case of multi agent reinforcement learning competitive games. Understanding convergence of MARL algorithm in competitive games can help in wide variety of tasks, especially in the independent decentralized scenario presented in this paper. The result here is somewhat novel as previous works on zero-sum RL games focused on different scenarios such as of linear-quadratic games or as described above - where one player performs policy iteration and the other computes the best response. Policy gradient technique are very common in RL, and extending them further to more complex scenario such as zero-sum games can be beneficial, especially from a practical point of view.

__ Weaknesses__: The main issue that mitigates my enthusiasm is the asymmetry in the players' algorithms which also leads to an asymmetrical Nash equality convergence. The theoretical result is indeed interesting, yet it does not seem to serve a practical algorithm, as the two players need to agree upon learning rates, in contrast to the independence requirement. Moreover, such agreement is not very natural, especially when one player benefits more from this choice.
Another caveat of this approach is the dependence on concentrability coefficients - indeed this is a common assumption in the policy gradient literature. Yet, recently, several works proposed to omit this assumption and apply online convex optimization with optimism to solve the policy optimization question.
Finally, the number of episodes required for reaching the approximate Nash equilibrium is epsilon^{-12.5}, which is unreasonably slow, which makes me question the practicality of this approach.

__ Correctness__: The results seem reasonable, but I haven't went through all proofs details.

__ Clarity__: The paper is well written. However, I believe the contributions of the paper are not highlighted enough in the exposition. Moreover, the I believe the discussion on last-iterate convergence should be further explained in light of the main theorem of the paper.

__ Relation to Prior Work__: Yes.

__ Reproducibility__: Yes

__ Additional Feedback__: First, I liked reading this work. This is indeed an important and interesting topic!
I have several questions regarding this work:
1) One issue is that the dependence in epsilon and in the action-space A and B do not seem very natural to me. Do you have any comments regarding this dependence or lower bounds that can justify this dependence? Is it real, or an artifact of the analysis?
2) Can you further explain why a two time-scale approach is necessary? As you also discuss in your paper, it requires the opponents to cooperate on this matter - any ideas how to omit this requirement?
3) Regarding the concentrabiilty coefficient assumption. Recently, several works shown that this assumption can be reduced by using UCB based optimistic evaluation in policy gradient methods. I wonder whether this be "added" to the analysis for better results.