NeurIPS 2020

### Review 1

Summary and Contributions: Double Q-learning algorithm empirically outperforms the classical Q-learning yet there are very few theoretical garantees for double Q-learning due to the fact that it is based on two interconnected stochastic processes which prevents direct application of the techniques developped to analyse classical Q-learning. So far there are only asymptotic convergence results and this paper provides first finite-time analysis of double Q-learning.

Strengths: This is a result of high relevance for the NeurIPS comunity, it provides first elements for better understanding of the empirical success of double Q-learning over classical Q-learning. The paper provides new approach to analysing RL algorithms that are based on interconnected stochastic processes, which may inspire new theoretical results for other algorithms than double Q-learning.

Weaknesses: The results only hold for the tabular double Q-learning and it is definitely not straightforward to extend the analysis to the function approximation setting, actually used in practice. So there is still a huge effort before fully understanding empirical success of deep double Q-learning.

Correctness: The paper is technically sound. However it is definitelly not stand-alone, a detailed knowledge of Even-Dar and Mansour (2003) is assumed. This is in particular the case of Lemma 1. It would be useful to provide proof of Lemma 1 in the supplementary material. What is Rmax, a maximum value of the reward function R?

Clarity: The ideas are clear, but there are some presentation issues. For example, Algorithm 1 is supposed to be the synchronous implementation, yet in lines (8) and (11) only one state action pair is updated, which is rather asynchronous implementation. If you prefer to keep asynchronous version, please specify how do you "choose action a" (i.e. what are the assumptions on the exploration policy; for example you may want to provide a forward reference to Assumption 1 and an example of an exploration policy that verifies the assumptions). Also giving the formal definition of "converge" in Algorithm 1 would increase the readability (epsilon distance in sup metric?)

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback: Could you provide a discussion on the main difficulties of analysing double Q-learning in the function approximation setting? For example in the special case of linear function approximation. The discussion on the exploration policy in the asynchronous version is confusing: Assumption 1 (existance of the covering number L) suggests a deterministic exploration policy, yet there is a claim on line 262 that L >> |S||A|; why not = ? Can Assumption 1 is replaced by its "with high probability" equivalent? In Step I of Part I of the proof of Theorem 1 it would really help if you can provide an intuition on why z_t still a martingale difference sequence although you are dealing with interconnected stochastic processes. Minor comments: - R_t is defined in line 121 and then again in line 136; you may want to merge the two definitions and state in line 121 that t denotes time. - Line 127, please mention the assumptions for the existence of the optimizer (such as unichain property) -Line 131: since both state and action space are finite, max notation may be more intuitive than sup =================== Post Rebuttal ======================== I'm satisfied with the author's response, and I decided to keep my score unchanged.

### Review 2

Summary and Contributions: This paper develops a finite time analysis for the double Q-learning scheme with both synchronous and asynchronous implementations. It characterizes the convergence speed of double Q-learning to an epsilon accurate neighborhood, and bounds the difference between two interconnected stochastic process.

Strengths: The problem of reducing overestimation in RL is important and is timely. The paper is in general well written. The ideas are clearly explained.

Weaknesses: - As the paper states in the intro, double Q-learning was developed to address the overestimation problem of Q-learning. However, this cannot really be seen directly from the results in the paper. The explanation given in the paper suggests that double Q learning resolves the overestimation problem by achieving a fast convergence rate. While this is certainly related, is it the only key factor for this? It will be useful if the authors could provide more discussion on this. - While the reviewer understands that this is a theoretical work, it might be useful to provide numerical results to demonstrate the convergence and the block-wise bound. - tau^hat_1 appears to play an important role in the results. It would be useful to provide some intuition about this.

Correctness: Appears to be correct.

Clarity: Yes.

Relation to Prior Work: Yes.

Reproducibility: Yes

Additional Feedback: ===post rebuttal=== The reviewer thinks that the rebuttal is fine, and hopes to see the numerical results in the revision.

### Review 3

Summary and Contributions: In this paper, the authors propose the first finite time analysis for both synchronous and asynchronous double Q-learning. They first analyzes the stochastic error propagation between two Q-estimators, then bound the error between one Q-estimator and the optimal Q function by conditioning on the error event associate with the previous analysis, and eventually bound the unconditional error between one Q-estimator and the optimum. The theoretical results are in line with the goal of double Q-learning, namely, reduce the number of aggressive move so as to avoid overestimation.

Strengths: 1. This is the first finite time analysis for double Q-learning, the theoretical result is exactly what we expect with when we are in the high accuracy regime (namely, \epsilon \ll 1-\gamma). 2. The technical contribution of the paper is sound, the authors introduce novel techniques to handle the inter-connected two random paths in double Q-learning.

Weaknesses: 1. In both Theorem 1 and Theorem 2, the discount factor \gamma are required to be greater than 1/3. I'm curious about why such the lower bound is imposed. 2. The primary focus of the paper is about the polynomial rate, can the current technique be applied to the case of linear rate?

Correctness: Although I didn't fully check the proof details in the Appendix (for example, I skip the proofs of Proposition 3 & 4), the main idea explained in the sketch proof is promising.

Clarity: The paper is reasonable well written, I appreciate the authors outline the sketch proof in the main paper with clear explanation of the high level idea.

Relation to Prior Work: Yes.

Reproducibility: Yes

Additional Feedback: =================== Post Rebuttal ======================== I'm satisfied with the author's response, and I decided to keep my score unchanged.

### Review 4

Summary and Contributions: This submission investigated the theoretical properties of the double Q-learning, a popular algorithm in reinforcement learning. The authors derived the number of iterations needed for convergence with high probability, for both synchronous and asynchronous variants.

Strengths: The authors provided finite-time convergence analysis for the double Q-learning, which looks novel and interesting.

Weaknesses: The main results look similar to the ones in Even-Dar and Mansour (2003). The authors may need to provide more explanation on the connection and difference with this work, e.g., any new proof techniques used.

Correctness: This is a purely theoretical paper, and no experiments have been provided. I took a high-level check of the main results and they look to hold.

Clarity: This paper is in general well written but sometimes difficult to follow, as it's a pure theory paper and I am not an expert on the theoretical analysis of the relevant research.

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback: I am a bit confused about the definition between synchronous'' vs. asychronous'' in Section 2.2: The authors mentioned in L152-L153 that For synchronous double Q-learning (as shown in Algorithm 1), all the state-action pairs of the chosen Q-estimator are visited simultaneously at each iteration.'' However, in each iteration of Algorithm 1, only one state-action pair (s-a) was updated. Could you clarify the difference? === After rebuttal === Thank you for the response. Please add the clarification and connection with previous work into the revision.