NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center

### Reviewer 1

Originality =========== The paper extends previous analysis for the population version of policy gradient algorithm for LQR to the actor-critic algorithm with GTD critic. In addition, previous analysis only considered deterministic dynamics and deterministic policies, while the current one studies ergodic system with Gaussian noise and Gaussian policies. Quality =========== The paper seems theoretically sound. The proofs in the appendix are very well detailed, although I have not checked them carefully. The authors put their results in context with previous analysis, like the nonasymptotic analysis of GTD due to [17]. It is not clear how Assumption B.1 can be ensured in a model free setting, with no knowledge of A, B, Q or R. Numerical experiments illustrating the analysis even for simple MDPs would have strengthened the paper. Clarity =========== The paper is generally clear and well organised. Some notation is not properly introduced in a few occasions. Significance =========== The paper considers the LQR with Gaussian policy and dynamics, a model that has proved successful in several robotics applications. The contributed analysis can potentially shed some light on whether a model-free actor-critic algorithm is the right approach as opposed to learning the model dynamics. The techniques used for the analysis might be useful for other algorithms with similar linear updates. On the other hand, the authors claim that their analysis may serve as a preliminary step towards a complete theoretical understanding of bilevel optimization with nonconvex subproblems. However, the proposed techniques seem quite specific for the linear update setting, and I haven't found clear evidence that supports this claim in the text.

### Reviewer 2

This is purely a theoretical paper. Although I'm not very familiar with this type of theory, the results seem sound from my review. The paper is quite well written, with just a few minor suggestions below. My main criticiism of this paper is that it is quite hard to read and follow if one is not very familiar with bilevel optimization for LQR (like myself). After reading the paper it is not at all clear to me why any of these results would carry over to anything outside of LQR; and with that in mind, it seems like there's not much we can learn from these results, outside of the LQR setting. I think there needs to be more high-level discussion in the paper to give readers not familiar with tihs type of theory a better idea for what is being proved, why it matters, and what we can learn from it. In particular, the paragraph starting at line 298 should be in its own "Discussion" section and expanded. Some questions: - In line 132, you say "Specifically, it is shown that policy iteration...", do you mean that it's shown in this paper or that it _has_ been shown previously? - In equation 3.1, shouldn't the subscript of I be k? - In line 152: what is \rho? Is it the stationary distribution? Relative to which policy? Minor suggestions: - line 114: "Viewing LQR from the lens of *an* MDP, the state..." - line 115: remove the "Besides, " - In equation 2.6, better to use a letter other than Q to avoid cofusion with the Q-value function. - line 120: "... it is known that the optimal action*s* are linear in..." - line 150: "the state dynamics *are* given by a linear..." - line 192: "denote svec(...) by $\theta...$" (remove the first "by") - line 238: "be updated *at* a faster pace." - line 240: "... ensures that the critic updates *at* a faster timescale." -

### Reviewer 3

This is a nice paper that establishes convergence of actor-critic algorithm in the special case of LQR. The paper entirely addresses theoretical analysis, and continues the line of analysis of RL for LQR. While certain pieces of analysis are built on a recent paper on analysis of policy gradient methods for LQR (Fazel et al ICML 2018), analysis of actor-critic style algorithms seems more challenging due to the interaction during learning between actor and critic. The analysis techniques departed from the traditional ODE-based approach for actor-critic algorithms. The paper is well-written for the most part. Overall, I believe this would make a nice addition to NeurIPS. I do think that the bilevel optimization perspective, while interesting, does not really contribute much to the exposition of either the algorithm, or the analysis. It appears that the paper did not leverage much from bilevel optimization literature. Going the other direction, i.e. using what is presented as contribution to bilevel optimization seems unclear to me. Equation 3.18: it is stated (line 215) that the objective in equation 3.18 can be estimated unbiasedly using two consecutive state-action pairs. I’m not sure if I understand why that is the case. Would you come up against the double sampling problem for policy evaluation here? Some minor comments: - Equation (3.16), the symbols are used before being introduced in line 210 - Looks like there is a missing constant on line 294 (for some constant C is what you meant) - The analysis in the paper is for the on policy version of actor-critic. So algorithm 3 does not really come with the same guarantees. Perhaps the main body of the paper should make this clear.