NeurIPS 2020

An operator view of policy gradient methods


Meta Review

3 referees advocate to accept the paper due to its novelty and theoretical contribution to understand policy gradient methods. 1 referee (R3) has concerns with ambiguity in the introduction and overstatements of the results. I agree that the writing and ambiguity of some statements need to be improved. A few of these concerns have been addressed by the rebuttal, but could still not convince R3. However, as the theoretical contributions of this paper are significant and the paper shows interesting connections of well known algorithms, I still advocate acceptance. The authors however have to take the comments from R3 into account to improve clarity: - clearly specify to which algorithms you are referring to (e.g. for value-based algorithms, most of your statements are true for Q-learning but not for Sarsa) - In Section 4.1 you show that using an exponential transform results in a fixed point of the operator and claim that this directly relates to MPO and PPO. However, the operators for MPO and PPO are slightly different and it is unclear if the results also holds for these operators. Please discuss this in more detail. - Mention that the entropy regularization of the PPO algorithm is a variant of PPO, that is not used in all papers. For example, in the original paper, no entropy regularization is used for the mujoco experiments. Please also discuss how the operator looks like without entropy regularization.