Reviews: Finite-time Analysis of Approximate Policy Iteration for the Linear Quadratic Regulator

*** Thanks. I've read the responses, which answered my questions. *** This paper is clearly written and the results are insightful. The theoretical results rely on non-trivial combinations of several recent techniques. I enjoy reading the paper and appreciate the contribution. However, I must admit that I am not particularly familiar with the field on RL+LQR, so I cannot comment on the details of the derivations. Below are some questions I had after reading the paper. 1. Projection step Algorithm 1 and 2 both use a Proj_\mu to project the Q matrix onto pd matrices. I am wondering if this is truly necessary or could potentially hurt performance. It seems from (2.10), we do not need Q to be pd. Though the optimal Q is pd, how do we estimate \mu in practice? 2. Adaptive algorithm First, LSPlv2 is not clearly defined in the main text, though it can be understood from the appendix. My concern is that, it is unclear whether K^{i+1} will remain to be stable in this online setup. If not, wouldn't it violate the stable behavior policy assumption? The proof in appendix appear to rely on Assumption 1, which seems pretty strong, because wouldn't that imply the optimal policy is almost solved in one online iteration? 3. Practicality Though this paper is theoretically focused, but from a practical perspective, I am wondering if the model-free setting is a reasonable approach to LQR. This is not a criticism, but as the paper states, it has been shown model-based is superb both theoretically and empirically. I am guessing, when would model-free approaches start to shine. For example, would model-free approaches start to make more sense, when the state is high dimensional and without low-rank structure (or even infinite dimensional)? while the action space is low-dimensional. The experiments considered are fairly simple. I think having some motivating problems would help readers to better appreciate technical contributions here. Minor: In line 73, K_eval is not defined, though it can be understood as the feedback gain from the later context.

SUMMARY: The paper considers the Linear Quadratic Regulator (LQR) design problem when the model is not known, and we only have access to data. It is a simple form of an RL problem with a particular dynamics and reward (or cost) function. The paper considers the sample complexity of using a Least-Squares Policy Improvement (LSPI) algorithm to solve the LQR problem. It provides a finite-sample guarantee on the quality of the learned value function for the policy evaluation part of LSPI (i.e., LSTD-Q) in Theorem 2.1. It then considers the LSPI procedure and provides a guarantee on the quality of the learned policy (which is the gain of a linear controller in this case) in Theorem 2.2. These results are for the case when the data is generated by a fixed policy (specified by K_play). The paper considers the adaptive version of LQR and proves a regret bound (Theorem 2.3). Finally, some empirical results are reported in which a model-based approach, which estimates the model and uses the learned model to design the controller, generally outperform the model-free methods, including the LSPI approach. EVALUATION: [BRIEF] Originality: The algorithm and type of analysis are not particularly novel, but the theoretical result is. Quality: The claims seems to be rigorously proven. I have some questions, including whether Theorem 2.2 upper bounds the right quantity ( || K_N - K_* || instead of the error in the value functions). Clarity: The paper is written clearly. Significance: The theoretical result is important for better understanding a class of RL methods for solving a particular class of RL algorithms. It might provide insights for more general class of RL problems too. This is a well-written paper that closely studies LSPI in the context of LQR problem. The LQR problem is interesting on its own, and hopefully the results from analyzing this simpler problem can be insightful for more general RL problems. I skimmed through the proofs, but I did not verify them closely. As far as I can say, especially from sampling a few points of the argument here and there, the authors seem to be meticulous in proving their results. I have some questions and comments. I use [*], [**], etc. in order to show the importance of my comments, in case the authors have to prioritize their answers. - [**] Is it correct that the control signal u_t is noisy for both K_play and K_eval? If so, does it mean that q in Theorem 2.1 (specifically in Eq. 2.9) reflects the value of the stochastic policy and not the deterministic policy that is defined by K_eval? If it is the case, why do we need noise in K_eval? I understand that K_play may need noise for exploration (even though it is not obvious that it is indeed required, as the dynamics is already noisy and might provide the necessary persistent excitation). It is not clear, however, why it is needed for K_eval? My guess is that the value of the stochastic policy is in fact worse than the value of the deterministic policy. - [**] The vector q and \hat{q} in Theorem 2.1 describe the parameters of the linear function approximator with the feature map phi. For the LQR problem, the value function is quadratic in state and actions (as specified at L79). My question is whether phi is chosen to be of quadratic form for the result of Theorem 2.1 to hold? Or is it true for any phi? A related question: If phi is not in quadratic form (so Q(x,u) = phi(x,u) q may not be able to represent the true value function), what does q in Theorem 2.1 refer to? Is it the project of the true value function onto the span of phi? - [**] If we let sigma_w go to zero, the upper of Eq. 2.9 approaches infinity. Is this what we expect? Does this mean that learning to control a (close to) deterministic system is very difficult? - [***] The result of Theorem 2.2 is about the closeness of K_N to K_* (with K being the gain of the controller). How does this translate to the difference between the value functions? (I understand that the value function is a quadratic function of state, so it is unbounded, but we can still talk about the difference in the value functions, for example, within a unit ball; or upper bound the error in P, as is done in Theorem 4.3 of [36] Tu and Recht, 2018). - [*] A related issue is that comparing the result of Theorem 2.2 with a result such as the upper bound of L160 is not completely fair, not only because one is about discounted setting and the other is for undiscounted one (as correctly claimed), but also because one provides a guarantee on the value function and the other (this paper) only on the policy. - [**] Another point of difference is the aforementioned result is an L_inft bound. There are results in the L_p too. Some examples: - Munos, “Error Bounds for Approximate Policy Iteration,” ICML, 2003. - Farahmand, Munos, Szepesvari, “Error Propagation for Approximate Policy and Value Iteration,” NeurIPS, 2010. - Scherrer, Ghavamzadeh, Gabillon, Geist, “Approximate Modified Policy Iteration,” ICML, 2012. As well as the followings cited ones: [6] Antos et. al. [17] Farahmand et al. [23] Lazaric et. al - [*] As a side note, the theoretical analysis of Farahmand et al. [17] is not specialized to RKHS; it is for general nonparametric function spaces. The closed-form formulae, however, are. - [*] What is \tilde{x} in L377? Is it the next state? - [*] L415: Here we have E[||phi_t||^2] = E[||z_t||^4]. Is this because the choice of feature map is a quadratic function? - [*] L762: Where is Avg(rho,1) defined? Is it simply the average of 1 and rho? - [*] The term “state-value”, instead of “action-value”, is used to refer to Q at several places, including on L.532, 544, 545. - [**] By now we have several papers studying the finite-sample properties of solving an LQR, either in the model-based or model-free setting. What are the insights that we have gained from solving this simpler problem that might be useful for more general RL problems? A discussion on this would be very helpful for the paper. - [**] (This might be a bit subjective) It is claimed in LL25-28 that “On the other hand [comparison is in contrast with the work on LQR], for classic model-free RL algorithms such as Q-learning, SARSA, approximate policy iteration (PI), our understanding is much less complete despite the fact that these algorithms are well understood in the tabular setting”. Even though there is much remained to be understood about these algorithms, especially for the large state and action spaces, I do not believe that they are significantly less understood compared to the LQR setting (especially in terms of the finite-sample type of analysis, and not how to solve them as solving LQR is relatively easy). This has paper already cited a few related work such as [6] Antos et. al., [17] Farahmand et al., [23] Lazaric et. al., and there are several others too. It is true that those results do not generally provide sharp constants. But then, this work does not either. For example, Theorem 2.2 of this work has a problem-dependent constant that is raised to the power of 42! So the bound is technically very loose. I understand that large constants might appear in the bounds and if the subject of study is the behaviour of other terms (say, the dependence on eps or dimensions), it is often acceptable to have loose dependence on other terms.

Paper ID:	4597
Title:	Finite-time Analysis of Approximate Policy Iteration for the Linear Quadratic Regulator

Reviewer 1

Reviewer 2

Reviewer 3