__ Summary and Contributions__: The authors provide a theoretical analysis of Double Q-learning, specifically the asymptotic MSE in the case of linear function approximation.
Their analysis suggests a way to select the step size and final output to obtain faster initial convergence while maintaining the same asymptotic result, which they then verify empirically in different experiments.
The asymptotic analysis relies on a stochastic approximation result, which they apply in a regime where the policy to select the next step Q evaluation is already optimal (see Eq 4-5). They can then transform the q-learning and Double Q-learning updates in a similar form (eq 10 and 12) which allows the result to apply and the asymptotic MSE to be compared between the approaches.
The main concern here is that by assuming that the max action is fixed to being the optimal policy, we are already in a regime where overestimation bias isn’t present in Q-learning due to the usual reasons (taking expectation after the max), and so Double Q-learning has little to offer and all to lose (since it has split updates and parameters). The asymptotic analysis doesn’t seem like the right way to understand the potential benefits of Double Q-learning in this respect. The authors do point out that by averaging final parameters for Double Q-learning and doubling the step size, Double Q-learning’s AMSE matches Q-learning’ MSE, which then allow the possibility of Double Q-learning helping in the early stages while provably not hurting in the final stages of learning. The potential theoretical benefits of Double Q-learning early on are not elucidated theoretically in this paper, but they do provide some empirical evidence that this story holds.
The empirical evidence consists of experiments in 3 domains: Baird’s example, a gridworld, and cartpole. In each domain, experimental results give support to the analysis and the suggestion above about doubling stepsize + averaging. Something I thought was missing in the experiments is again coming back to the issue of estimation bias. These domains may not induce estimation bias for Q-learning (at least this isn’t reported here), and in that case we would not expect Double Q learning to have a large advantage in the early stages of learning. It would have been nice to see a domain with clear estimation bias (e.g. Figure 6.5 in Sutton&Barto’s RL book 2nd edition) and see how that relates to final results for these different methods.
Despite my concerns above, the paper is clearly written, the analysis is novel with some interesting empirical result to support it. I will wait for the reviewer's response to decide on final score.
-------- post author response ---------
Thank you for the additional experiments and clarifications. I have updated my score though I think the paper stands between a 6 and 7 given the limitations of the theoretical analysis. If the paper is accepted, I would appreciate if the authors clarify in the paper these limitations for the benefits of future readers.

__ Strengths__: See above.

__ Weaknesses__: See above.

__ Correctness__: As far as I could tell, the analysis is sound and the results correct.

__ Clarity__: The paper is well-written with clear notation which makes the math easy to follow.

__ Relation to Prior Work__: Relevant past work is well covered as far as I could tell.

__ Reproducibility__: Yes

__ Additional Feedback__:

__ Summary and Contributions__: A theoretical comparison between double Q-learning and Q-learning is performed, for the tabular/linear RL setting. In particular, the conditions are studied for which the asymptotic mean-squared error are equal for the two methods. In addition, experiments are performed that study the speed of learning for several settings of these methods.
---response to rebuttal-----
I appreciate the additional experiments. However, I'm keeping my current score for two reasons: 1) the improvement shown in the deep RL setting is really very minor, 2) the deep RL results are on a very simple task and do not appear to use an advanced optimizer; I'm skeptical whether any improvement can be shown at all in more complex domains.

__ Strengths__: 1. The authors provide theoretical insights as to why double Q-learning should use a higher step-size than Q-learning, in case of linear/tabular RL.
2. A minor improvement of double Q-learning is proposed (averaging the estimators), for tabular/linear RL.

__ Weaknesses__: The main weakness is that the contribution of the paper is small for a number of reasons:
1. Only tabular/linear RL is considered. However, the main empirical advantage (and usage) of double Q-learning occurs in the deep RL setting. In the deep RL setting, double DQN can result in more stable learning and higher asymptotic performance; with tabular/linear RL convergence is less of an issue and double Q-learning is hardly ever used. Furthermore, it is unclear how easily these results would carry over to deep RL, when modern optimizers such as Adam are used.
2. The main conclusion is that the step-size should be twice as high for double Q-learning compared to regular Q-learning. However, as the step-size is one of the main hyperparameters of a tabular/linear RL method, it is typically tuned per method anyway. So an empirical researcher using double Q-learning will likely quickly realize that it requires a larger step-size than Q-learning.
3. The assumption for the theory that the optimal policy should be unique is quite a strong assumption that limits the impact of this work further.

__ Correctness__: As far as I can tell, the claims appear to be correct.

__ Clarity__: The paper is very well written.

__ Relation to Prior Work__: There are several improvements proposed in the literature of double Q-learning. It would be good to at least acknowledge this and ideally motivate why these are not considered. For example:
> "weight double Q-learning", Zhang et al.
> "Averaged-DQN: Variance Reduction and Stabilization for Deep Reinforcement Learning" , Anschel et al.

__ Reproducibility__: Yes

__ Additional Feedback__: My main suggestion for improvement of this work is to add deep RL experiments. Even if the theoretical results don't naturally carry over, it would be interesting to see if the suggestions based on the theory for the tabular/linear (larger step-size, averaging estimators) setting carry over to the deep RL setting. This is especially important because double Q-learning is not much used in the tabular/linear setting, for reasons mentioned above.
For tabular RL, I would explore & discuss the effects of the Q-value initialization on the advantage of double Q-learning over Q-learning. I suspect that in tasks where optimistic initialization is effective, the bias in the update of Q-learning might actually be an advantage.

__ Summary and Contributions__: This paper investigates the convergence speed of Double Q-learning compared to Q-learning as a function of the learning rate.
It also proposed an adaptation of double Q-learning which uses the average of the two estimators as the final estimate.
The main finding is that double Q-learning obtains the same mean squared error as Q-learning assuming that it uses twice the learning rate and that the two estimators are averaged for the final answer.

__ Strengths__: Double Q-learning is a commonly used algorithm for solving RL problems, so understanding the convergence properties is highly relevant.
The paper undertakes extensive theoretical analysis and also makes some algorithmic contributions (averaging the two networks), which get validated with simple experiments.

__ Weaknesses__: The proofs in this paper are limited to tabular policies and linear function approximation, making them less applicable to Deep RL, which is probably a majority of the current applications.
It would be great to at least show some results on harder environments to validate the claims.
Lastly, I would be interested to understand how the proof would have to be adapted to account for non-unique optimal policies. If this is intractable, an experimental exploration would have been interesting.
The experimental section also contains some relatively arbitrary hyperparamters, eg. the exact equation for the step size. It would have been good to confirm whether results are robust for different choices of these parameters.
The experimental results also miss the error-of-the-mean, eg. in the cartpole experiments.

__ Correctness__: I did not find any obvious issues with the paper, but also didn't check the proofs in detail.

__ Clarity__: Yes, the paper is overall well written. I would have hoped for a bit more explanation around the proofs, providing the reader with more intuitive understanding of what is happening.

__ Relation to Prior Work__: Yes.

__ Reproducibility__: Yes

__ Additional Feedback__: Update: I have read the author response and am still believe that this paper can provide value to the NeurIPS community.

__ Summary and Contributions__: This paper sheds light on special properties of the asymptotic mean square error (AMSE) of Double Q-learning and standard Q-learning, under a linear approximation of the Q-function.
Using results from Linear Stochastic Approximation the authors show that the asymptotic covariance of each estimator of the parameter vector satisfies a specific Lyapunov equation.
The structures of the Lyapunov equations and asymptotic covariance matrices imply bounds on the AMSE of the different estimators.
In particular it is shown that the average of the two estimators given by Double Q-learning with twice the learning rate of standard Q-learning has the same AMSE as the estimator of standard Q-learning.
These theoretical results are supported in the experimental section and provide practical guidelines on how to combine the estimators of standard Q-learning and the average of the estimators of Double Q-learning.

__ Strengths__: The paper is well written and easy to read.
The exposition of the results the authors borrow is clear and honest.
The proofs seem sound, simple and clean.
The theoretical findings are indeed supported by well described experiments.
Some code is provided in supplementary (I managed to run cartpole.py)
The practical guidelines seem useful in practice.

__ Weaknesses__: The methodology presented in the paper applies only when considering a linear approximation of the Q-function.
The code provided as supplementary material is written in both Matlab and Python, this is not convenient.

__ Correctness__: The proofs seem sound and not complicated given the exposition of the results the authors borrow.

__ Clarity__: Yes, it is indeed a strength of the paper.

__ Relation to Prior Work__: Relation to prior work and tools borrowed from Linear Stochastic Approximation are clearly exposed in the introduction, Theorem 1 and proof of Theorem 2.

__ Reproducibility__: Yes

__ Additional Feedback__: l 19 vis-à-vis
l 27 TD(0) notation is not defined
l 88 "This function estimates the value function" -> the Q-function ?
l 151 Section 2.3 not clickable
l 160 - (12) U_n notation is not defined
l 200 spacing Fig.__1a
l 368 adA?pted
**EDIT**
I am satisfied with the rebuttal and discussion, my overall score remains the same.