Review for NeurIPS paper: An Improved Analysis of (Variance-Reduced) Policy Gradient and Natural Policy Gradient Methods

NeurIPS 2020

An Improved Analysis of (Variance-Reduced) Policy Gradient and Natural Policy Gradient Methods

Review 1

Summary and Contributions: This paper propose a variance-reduced natural policy gradient method, i.e., SRVR-NPG. Moreover, it studies the global convergence properties of the existing PG, NPG, SRVR-PG methods and the proposed SRVR-NPG method.

Strengths: Although the paper studies the global convergence properties of the existing PG, NPG, SRVR-PG methods and the proposed SRVR-NPG method, these theoretical analysis mainly come from the existing work (Agarwal et al., 2019). At the same time, the proposed SRVR-NPG method is also based on the exiting SRVR-PG method (Xu et al., 2020) and the exiting NPG method (Agarwal et al., 2019). In addition, this paper does not provides any experimental results to demonstrate the efficiency of the proposed methods.

Weaknesses: This paper does not provides any experimental results to demonstrate the efficiency of the proposed methods.

Correctness: The methods in this paper are basically correct.

Clarity: This paper is basically well written.

Relation to Prior Work: In fact, the theoretical analysis in the paper mainly come from the existing work (Agarwal et al., 2019). So the paper should point some differences in the proofs.

Reproducibility: No

Additional Feedback: Some comments are given as follows: C1: The proposed SRVR-NPG method uses the Fisher information. Why dose it only reach the same complexity complexity as the existing SRVR-PG without using the Fisher information ? C2: Assumption 4.3 is a strict assumption. If you must use this assumption, you should point its disadvantage. C3: To demonstrate the effectiveness of our methods, you should provides some experimental results. ------------------------------------------------------------------------------------------------------------- Thanks for your responses. I decide to increase the scoring. I also hope you should add some more related references: [1] A Hybrid Stochastic Policy Gradient Algorithm for Reinforcement Learning, AISTATS 2020, https://arxiv.org/pdf/2003.00430.pdf [2] Momentum-Based Policy Gradient Methods, ICML 2020, https://arxiv.org/pdf/2007.06680.pdf

Review 2

Summary and Contributions: The paper introduces a framework for analyzing the global convergence of PG methods and their variance-reduced variants (for state-of-the-art SRVR-PG and the paper introduces SRVR-NPG) under general assumptions. The global convergence of NPG is improved from O(eps^{-4}) to O(eps^{-3}) up to inherent errors due to policy parametrization.

Strengths: The paper seems theoretically solid and pushes state-of-the-art understanding. A framework for global convergence as mentioned above is new and was a missing piece.

Weaknesses: The paper may wish to discuss other policy estimators and how their work may apply or leave open problems.

Correctness: The results seem correct.

Clarity: The paper is well-written and necessarily contains a lot of details.

Relation to Prior Work: Related work seems clearly discussed.

Reproducibility: Yes

Additional Feedback: Line 99: Q^pi

Review 3

Summary and Contributions: The paper analyzes the global convergence of stochastic policy gradient and natural policy gradient methods, as well as their variance-reduced versions. The variance-reduced version of natural policy gradient is novel, as are the global convergence rates for the variance-reduced algorithms. Moreover, the global convergence rate for natural policy gradient improves upon existing state-of-the-art results. The global convergence rate analysis leverages the connection between the natural gradient direction and compatible function approximation as in Agarwal et al (2019), resulting in a bound composed of an approximation error term, a stationary convergence term, and a term relating the direction chosen by the algorithm to the true natural gradient. This general framework allows previous stationary and global convergence analyses to be combined.

Strengths: The analysis of global convergence in the reinforcement learning setting is very interesting given the non-convex and stochastic nature of the problem. The authors derive a general global convergence bound (Proposition 4.5) that separates the analysis into intuitive components that can be analyzed separately for a given algorithm. A theoretical understanding of the global convergence properties of policy gradient algorithms is important, as it can help to explain the observed empirical performance and provide insight into potential ways to improve existing methods.

Weaknesses: The paper’s main contributions are theoretical, but all of the theoretical analysis is placed in the supplementary material. It would be beneficial to include additional detail in the main paper, particularly on the general framework of Proposition 4.5 and the improved convergence rate of Theorem 4.8. Some of the parameter choices used in the global convergence proofs require values that are based on the unknown optimal policy, such as the KL divergence between the initial and optimal policies as well as the difference between the initial and optimal policy values. This results in implicit bounds, rather than upper bounding these values to provide explicit bounds. In addition, the dependence on the KL divergence between initial and optimal policies can lead to a vacuous bound if these policies do not share the same support (although for commonly chosen policy parameterizations, this is not an issue). While the global convergence is a nice claim, it critically depends on the policy parametrization being \epsilon_{approx} close to the optimal (Ass. 4.4). If that is large, essentially we have no global guarantees. From this perspective, the results are not too different in practice from earlier first-order guarantees (which make no assumptions on the expressiveness of the parametrization).

Correctness: There do not appear to be any obvious errors. Proofs in the supplementary material were reviewed, but not thoroughly checked for correctness.

Clarity: The paper is well-organized, and the authors clearly state the motivation for their work as well as their main contributions and results. The problem setup is explained thoroughly, and some intuition behind the structure of the theoretical contributions is given. The introduction and conclusion mention that the theoretical analysis integrates the advantages of previous work on policy gradient and natural policy gradient convergence in a way that they improve upon each other. This is a very interesting claim, and it appears that the authors are referencing the form of Proposition 4.5. However, the connection between this claim and the theoretical results is not clearly explained in the paper.

Relation to Prior Work: The authors do a very nice job of summarizing prior work on convergence rates for policy gradient methods and variance-reduction techniques. The paper’s main contributions relative to prior work are clearly discussed. The authors note that parts of their analysis closely follow the work of Agarwal et al (2019). It appears that the form of Proposition 4.5 is the novel contribution relative to this prior work, which allows for the global convergence analysis of variance-reduced methods and the improved convergence rate of natural policy gradient. Additional commentary on their contributions relative to Agarwal et al (2019) would add clarity.

Reproducibility: Yes

Additional Feedback: The paper provides a very interesting theoretical contribution on the global convergence rates of policy gradient methods in reinforcement learning. Prior work is well summarized and results are clearly stated. As mentioned previously, it would be nice to see some additional theoretical analysis in the main paper rather than deferring all proofs to the supplementary material. The extensive Preliminaries section could be shortened to provide adequate space to include this analysis. *********** Comments after the rebuttal ********** I have read the rebuttal and it reinforces my positive review.

Review 4

Summary and Contributions: The paper theoretically gives sample complexity of PG, SRVR-PG, NPG, for the global convergence, under an assumption of positive definite Fisher information matrix. A new algorithm is proposed as SRVR-NPG.

Strengths: The paper generalizes the work on global convergence of NPG to PG and SRVR-PG, and improve the result on NPG. Comparing to the previous results on the convergence to a first-order stationary point, the global convergence is stronger and is desired by practitioners. This work can be useful for the theoretical reinforcement learning community.

Weaknesses: From the theoretical point of view, is the proved sample complexity tight? Is there a way to measure the error due to the policy parameterization? The paper proposes a new algorithm SRVR-NPG, but there is no experiment to demonstrate its empirical performance. Can the proved sample complexity be validated by simulation, even in simplified cases such as the Gaussian policy? In practice, the policy gradient methods often exhibit high variance in results and sensitivity to random initializations. Is this fact empirically inconsistent with the global convergence claim made in the paper? --- Post Rebuttal comment: I will maintain the previous review.

Correctness: As far as the reviewer can tell, it is correct. But the reviewer doesn’t go over all technical details of the proofs in the appendix.

Clarity: Yes.

Relation to Prior Work: Yes.

Reproducibility: Yes

Additional Feedback:

Review 5

Summary and Contributions: This is a theoretical paper focusing on improved analysis of the sample complexity of policy gradient algorithms, including both natural gradients and variance reduction methods. POST-REBUTTAL I appreciate the authors' detailed response to the technical points raised in my review. Further (theoretical and/or empirical) analysis of the tightness of the bounds derived would strengthen the paper further, but I'm happy to raise my score to a 7 based on the rebuttal.

Strengths: The paper is clearly written, and these results ought to be of interest to the theoretical RL community at NeurIPS. I also found that the proofs in the supplementary material were generally well written and easy to follow. As far as I am aware, the analysis is novel and leads to new, strengthened results relative to prior work, although there is some overlap in the techniques employed.

Weaknesses: At present there are a few technical aspects of the paper that I am unsure about, and so would appreciate clarification from the authors on these; they are described in detail below. My current recommendation for the paper reflects this uncertainty, and I will update it after seeing the authors' response. Beyond this, I felt that although the paper does a good job of citing earlier related works and associated results, the paper would be improved with more discussion of the assumptions made in this paper, how these compare to earlier works, and also with more discussion of how the results stated in these earlier works can be translated into the form cited in e.g. Table 1. Finally, it would have been interesting to see some empirical work to complement the theoretical results here. I don't think this should be necessary for acceptance, and I think perhaps the most interesting experiments here would not be large-scale application of these algorithms, but simple experiments on small environments to understand how tight the sample complexity bounds are, and in particular whether the SRVR-NPG analysis is likely to be sub-optimal (since it matches the complexity of NPG in Table 1).

Correctness: I have reviewed the proofs of Proposition 4.5 and Theorem 4.6 (as well as the results in the appendix they depend on) line-by-line, and believe these to be broadly correct, although there are some suspected typos I have listed below in more detail, including in the theorem statement. I also had a query about the choice of H in the proof which I think may require the statement to modified slightly. I was not able to review more proofs at this level of detail due to time constraints and the length of the paper, however I have made high-level checks of the remaining results in the main paper. The assumptions made in the theoretical results generally seem reasonable (relative to necessary assumptions in other work), although I have a few specific comments on these below that I would appreciate clarification on from the authors. I will revise my recommendation after seeing the authors' rebuttal in response to these points.

Clarity: Overall I found the paper to be written clearly. There were several places where I thought more discussion was warranted. A few examples are the sample complexities from prior works in Table 1. These take a bit of work to derive from the cited papers, so an appendix section showing how they are derived would be useful. I would also like to have seen more comparison of the assumptions made in this work with assumptions in cited works; for example, Assumption 2.1 and in particular comparison of this assumption to e.g. Assumption 6.2 in [1].

Relation to Prior Work: The authors do a good job of putting their work in context against earlier work on policy gradient methods, and variance reduction methods for optimisation more generally.

Reproducibility: Yes

Additional Feedback: Detailed comments on a section-by-section basis are given below. Lines 10-13 of abstract: I found this quite vaguely worded, and didn't understand which contributions in the main paper this is referring to. Section 1 "Policy gradients are usually estimated via Monte-Carlo rollouts". To me, this would suggest that critics are not used, or are used at most as baselines. However, as far as I am aware, most of the successful methods cited in the previous paragraph do use bootstrapping, and so are not pure Monte Carlo, in the sense that the term is used in RL. Table 1 caption: "the expected number of trajectories to reach eps-optimality". Should this be "the number of trajectories to reach an eps-optimal policy in expectation"? Section 2 Eqn (2.10): clarify that the squaring happens inside the expectation - it is a little ambiguous at the moment, and actually reads more like the expectation itself is squared. Section 3 Similarly to Eqn (2.10), can the authors clarify that the squaring that appears in the objective of Eqn (3.3) should be inside the expectation? Section 4 Assumption 4.1: This could perhaps be stated a little more precisely: "variance" is a little bit ambiguous for a vector, and what is presumably meant is the expectation of the L^2 norm of the RV minus its mean? Assumption 4.3: This seems like a very strong condition, and it seems as though this will not be satisfied by the family of Gaussian policies, nor for softmax parametrisations of the simplex. Proposition 4.5 proof. Why is the final line of (I.1) preceded by an inequality? Doesn't w^k_* attain the minimum of the of compatible function approximation error, meaning that we can have an equality here? Theorem 4.6 The quantity L_J is not defined in the main paper as far as I can tell, but in Appendix A. Please consider moving the definition to the main paper, or at least adding a pointer to the section of the appendix where it is defined. If we require K = O((1-gamma)^{-4} eps^-2}) and N = O(sigma^2 (1-gamma)^{-2} eps^{-2}), why isn't the overall number of trajectories equal to O(sigma^2 (1-gamma)^{-6} eps^{-4})? Judging from the proof in the appendix, I expect there are some typos in the theorem statement in the main paper as to these trajectory complexities - please can the authors clarify? Remark 4.9: As far as I can tell, the result in [1] also uses constant stepsizes, which doesn't seem to align with the comment here; can the authors comment on this? Lemma A.1: In the penultimate calculation line, the expression GR \sum_{h=H}^\infty h \gamma^h shouldn't have \| \| around it. I think there is a missing factor of \gamma in the first term of the bound GR(H/(1-\gamma) + \gamma/(1-\gamma)^2)\gamma^H, that should multiply the "H". This doesn't affect the validity of the upper-bound derived in this lemma, and so should not affect the validity of the analysis that depends on Lemma A.1, but it is not as tight as it could be. Theorem 4.6 proof I think there is an additional factor of 2 in the final line of (J.4) which is not required (since 2 premultiplies (1+1/\mu_F)^2, but also appears in the next bracket. I think this only makes the bound looser, and doesn't make it invalid,but clarification on this would be appreciated. Line 685: L_j -> L_J. In (J.5), it is not clear to me that such a choice of H is possible, since the right-hand side is bounded above as a function of H. In particular, when G is small relative to epsilon, there may exist no valid choice of H. This may be a misunderstanding on my part, so could the authors clarify the choice of H here, and whether some condition on epsilon being sufficiently small relative to other parameters such as G is required? References: [1] has recently been updated on arxiv (with a new title, and rearranged content). Can the authors ensure that future versions of their paper refers to the new version of [1], or perhaps include the arxiv version number in the reference if the authors want to continue to refer to the currently cited version.