__ Summary and Contributions__: This paper studies the episodic MDP under a slightly more general assumption on the reward function (assumption 3.2) and compares the sample complexity in the number of episodes instead of the number of timestep.
Previous results all have poly in H dependence in the number of episodes, and this paper is the first one that proposes an algorithm to achieve polylog H dependence on the number of episodes under this general assumption. Hence, this paper provides a negative answer to the COLT 2018 open problem by Jiang and Agarwal.
This paper also proposes the epsilon-net concept and utilizes online trajectory synthesis, both of which are interesting themselves.
I also believe this paper will motivate further future research questions.

__ Strengths__: The idea of epsilon net is novel and new to me.
This paper uses the online trajectory synthesis which only has low sample complexity.
This paper proposes the first algorithm that has polylog H dependence on the number of episodes under more general assumption on reward functions, and therefore solves a COLT 2018 open problem.

__ Weaknesses__: The proposed algorithm is not computationally efficient.
The result seems to have a lot of room to improve, w.r.t. |S|, |A|, and 1/epsilon.

__ Correctness__: I checked the proof in detail and the proof looks rigorous to me.

__ Clarity__: This paper is clearly written and easy to follow. The main idea and concept are conveyed in the main text and the proof is deferred to the appendix. I think this paper is generally easy to read for the RL community.

__ Relation to Prior Work__: The comparison between this work and the previous tabular PAC result are discussed. The difference in the assumptions and when and why this more general setting is interesting is well illustrated. Some related work about the trajectory synthesis method is also mentioned in the paper.

__ Reproducibility__: Yes

__ Additional Feedback__: After rebuttal:
I read the author response and other reviews, and would like to keep my original score.
After the rebuttal, it's still unclear why the authors would like to clear the dataset and hope to see more discussions on that in a later version. Besides, as other reviewer pointed out, it would be clearer if the authors could discuss the regime (epsilon >> 1/H) that they advance the prior result in the intro or abstract. Beyond that I think achieving polylog(H) result in the setting epsilon >> 1/H is still interesting and the analysis is novel. Regarding the epsilon-net, I think the idea of discretizing MDP is new to me. However, it's not surprising to see the result thereafter. So the authors may consider moving some of those parts to the appendix.
================================================================================================================================================================================================================================================
Overall, I think this paper has enough contribution for the conference. The result is solid and will motivate further research questions (e.g. generalization to the rich observation setting, the improvement on the computation side, and deeper understanding about the dependence on the horizon) to the community. The paper is easy to follow, and the main concepts and algorithms are well exposed.
As mentioned above, this paper studies the episodic MDP under a slightly more general assumption on the reward function (assumption 3.2). This paper compares the sample complexity in the number of episodes instead of the number of timestep. This would be a fairer comparison for the long horizon problem since we expect the long horizon problem would pay more timestep due to larger H, while comparing the number of episodes can avoid such issue. By proving a bound that only has polylog dependence on H, this paper solves the COLT 2018 open problem by Jiang and Agrawal.
The main idea of the algorithm is to first construct a set of MDPs by discretizing the transition function and the reward function, and further obtain a set of candidate policies (all the optimal policies for these different MDPs). Then the algorithm collects a dataset that is possible to approximate evaluate all these candidate policies. Finally, the algorithm evaluates all these policies and pick the best one. Since the true MDP is close to at least one of these MDPs (say M0), the optimal policy under M0 (say pi_0) must be near optimal. In addition, the algorithm is guaranteed to pick one that has close performance as pi_0, thus eventually gets a near-optimal policy.
The main technical question is how to construct such dataset. A naive way to interact with the environment to sample according to all candidate policies will immediately fail. The authors design a clever way to only sample when necessary and use a potential function to analyze the number of samples.
I only have some minor questions as below.
1. In line 13 of the algorithm, the dataset is cleared. I think the reason is to get independence in the analysis, however, it seems a little bit inefficient. Is it possible to avoid that and maybe with different analysis?
2. The authors mentioned on line 294-297 that some previous algorithm can avoid the poly H dependence in deterministic environment. Can authors comment why that would happen or maybe some intuition?

__ Summary and Contributions__: This paper refutes the conjecture that poly(H) lower bound is needed for episodic RL, while the existing lower bounds don't depend on H. In this paper, they show that the sample complexity depending only on poly(log(H)) is possible by designing a sampling scheme to reduce the sample complexity and sacrificing computation complexity. In general speaking, this paper contributes a new approach to resolve the open problem in COLT 2018 for episodic MDP, which surprisingly breaks our intuition that long-horizon MDP is more difficult than short-horizon MDP.

__ Strengths__: I have checked the proof of theoretical results and the results and solid. I reckon that this paper breaks the traditional intuition that long-horizon MDP is harder than the short-horizon MDP, though the computation complexity scales exponentially in state space and action space. The most interesting idea is reusing exploration samples to reduce sample complexity instead of solving dynamic programming backward as prior work does. I reckon this paper provides a new perspective for RL communities to design new efficient learning algorithms.

__ Weaknesses__: I understand this paper solves the problem with some unwanted sacrifice(computation efficiency). In terms of solving sample complexity itself, I reckon this paper does well. There are two aspects in terms of computation inefficiency. 1. The number of exploration/simulation policies is exponential in state/action space. 2. For each policy, the parallel number F is poly(SA) and there is no sublinear acceleration(indeed linear in F) in sample complexity.

__ Correctness__: 1. Line 477/478 in Supplementary, I guess 'ROLLOUT' means 'SIMULATE'
2. Algorithm 3, Line3: SimAll is Algorithm1 not 2
Besides, I have one question about this paper.
This paper assumes the transition probability is independent of horizons (P_h are all the same for h=1,...,H). I'm wondering whether the sample complexity still depends on poly(log H) if transition probability varies as horzion changes (P_h are not the same for h=1,...,H).

__ Clarity__: yes

__ Relation to Prior Work__: yes

__ Reproducibility__: Yes

__ Additional Feedback__: Refer to weakness for details
====================
I have read the authors' response and I decide to keep my score

__ Summary and Contributions__: The authors propose an algorithm for PAC learning in tabular finite horizon MDPs and present PAC bounds. Their work is motivated by a recent conjecture from Jian and Agarwal 2018 which asked whether a lower bound is polynomial in H and in particular (SAH / epsilon^2) would be the right scaling. In this paper the authors provide an algorithm that has a PAC bound that scales as (S^5 A^4/epsilon^3) with only a log dependence on H.

__ Strengths__:
The algorithm itself is quite interesting. It involves simulating each policy using prior samples and only running in the real environment if it the number of times it was not possible to use prior data to create a simulated trajectory exceeds a certain threshold.
This work helps bring us closer to answering the hardness of acting in long horizon vs short horizon settings.

__ Weaknesses__: The approach and results are interesting but the resulting conclusion overclaims.
In particular, in terms of H and epsilon, these results seem stronger than prior results of Zanette and Brunskill 2019 in the regime where 1/epsilon < H and worse when those other results 1/epsilon > H. When 1/epsilon ~= H, their result would yield SAH/ epsilon^2 vs this prior work would have SA / epsilon^2 and SAH / epsilon, and therefore be strictly worse. This is ignoring their dependence on S and A (which is roughly S^5 A^4) which is substantially worse than prior results (which are S*A).
This isn’t to say this result is not valuable, but it adds benefit under particular assumptions of the relation to H and epsilon, and it’s important to clarify these differences which also make the landscape of contributions more nuanced.
The computational complexity is exponential in the S & A space
The key aspects of the algorithm that enable this result would be very helpful to highlight in the main paper. For example, defining the failure probability (and how this changes over time as more policies are evaluated) is critical to the resulting bounds, but this is not explored in the main text. Similarly, it would be great to see more details in the main text about why a new set of trajectories from all prior policies are needed every time the rollout is executed. Section 5.1 could be kept in the appendix since it’s fairly straightforward to see why this many MDPs would enable a near optimal policy for any possible MDP, and the novelty lies more in these other aspects.

__ Correctness__: I have not went through the full proofs in the appendix.
---
Post rebuttal and reading others' reviews and discussion
Thanks for your feedback. I think your proposed changes will strengthen the work including your promise to explicitly discuss the regime of epsilon under which your work improves over prior work. I agree that it's useful to have work that is polynomial in the S, A, and log in H and this makes a helpful contribution.

__ Clarity__: See weaknesses.

__ Relation to Prior Work__: Somewhat. I think a key limitation of the work is that upon a quick read it seems to suggest that it definitively answers a prior conjecture, but the situation is more nuanced. It outperforms in a particular regime of epsilon and H whereas some prior work outperforms it in other regimes.

__ Reproducibility__: No

__ Additional Feedback__: Ultimately the direction is interesting but the contribution is more limited than the introduction suggests, providing value over a different regime of epsilon and H than recent related results, and the algorithm is not computationally tractable.