NeurIPS 2020

### Review 1

Summary and Contributions: The authors in this paper study the problem of learning in reward free fashion, to construct a good estimation of the model that can be further used for planing for a class of possible reward. This problem has been studied in tabular settings in prior works, and this work extends them to the value-based setting where the transitions and rewards are linear in some feature representation, resulting in Q function to be linear for any policy. The authors focus on fixed horizon MDPs The contribution of this paper is mainly two-fold, 1) In the linear model case, they show that an optimism based approach gives rise to a sample efficient algorithm. In another word, the algorithm runs in a polynomial in dimension and 1/epsilon number of episodes, and then, given any reward function satisfying linear MDP, it outputs epsilon optimal policy for that reward. 2) The second contribution of the paper is on the lower bound. The authors show that, if the Q function of just optimal policy is linear, the sample complexity would be exponential in dimensions.

Strengths: This work's strength is in its theoretical development and proposing an algorithm that learns an explicit model useful for later planning. Also, the lower bound. The authors well motivate the problem, and the approach is sound.

Weaknesses: Maybe not a limitation, but the results in this paper are mainly a direct extension of Jin 2020, which is on tabular MDP and Jin 2019, which is one of the later studies on MDPs with linear Q functions. It makes the novelty of the techniques and results in this paper limited, but it should not stop the paper from being accepted. It is absolutely alright to have a paper like this, which states a series of strong results but directly built on the top of prior works. The biggest concern of mine about this paper is the lack of rigor. The authors did not define S, what is A, what are possible admittable Q function, what rewards are allowed, and what transitions are permitted. It is quite disappointing that authors did not specify non of these spaces, and whether Q even exists, let alone existence of an optimal policy or that the "max" even applies here.

Correctness: Since the analyses are directly borrowed from prior works, I believe the author did a great job in transferring the results to their study, and also the advancement they made are sound. I was not fully convinced in the lower bound part. I am not fully convinced that if the Q^* is linear no only on the optimal action, but also on the suboptimal action, then the lower bound is exponential. I can see it when Q^* is linear for a^*, but not for suboptimal one. So I could not verify that part. I did not check the appendix for that but would have been great if the authors provide more details on the lower bound in the main text.

Clarity: The paper is clear and I enjoyed reading it.

Relation to Prior Work: The authors did a good job of referring to prior works. Maybe the authors find it useful to discuss also this paper: Active Model Estimation in Markov Decision Processes, it seems relevant.

Reproducibility: Yes

### Review 2

Summary and Contributions: The authors develop algorithms and theoretical analysis for reward-free RL in a setting with linear MDP's and linear function approximation extending previous work on reward-free RL which was in the tabular setting. They provide an algorithm for RL which has polynomial trajectory sample complexity in the dimension of the feature space and the horizon of the episode. Interestingly, there is no dependence on the size of the action or state space. They also prove a lower bound result by explicitly showing a class of deterministic MDPs such that the sample complexity for any RL algorithm has to be exponential.

Strengths: There are a number of significant achievements in this work. First, their algorithm goes beyond the tabular setting. Second it gives a sample complexity that is independent of the number of states and actions and polynomial in the other parameters. This is significant because a polynomial dependence in the number of states is fatal and that is the best one can do in the tabular case. A very interesting result is the lower bound result which they are able to exhibit even with just deterministic MDPs. The proofs are rigourous and the formulation of the framework and the background is very good.

Weaknesses: The restriction to the linear MDP setting is a weakness. I think that the extension to more general results is not too far off, but one cannot tell unless one tries. The proofs rely strongly on this restriction.

Correctness: Yes, this is a very competent theoretical analysis. There was no experimental backing.

Clarity: Yes.

Relation to Prior Work: Yes.

Reproducibility: Yes

### Review 3

Summary and Contributions: The paper extends the reward-free approach recently introduced by Jin et al '20 to RL exploration in low-rank / linear MDPs, and provides lower bounds as well. However, in terms of techniques I feel the paper does not overall make a solid leap forward, but rather extends prior ideas. ======= Unchanged after rebuttal

Strengths: - The extension to the linear MDP setting of reward free exploration. - The lower bound (which I find more interesting) shows the separation between generative and online RL with this objective and with respect to the single reward-function.

Weaknesses: I find the extension of reward-free exploration to the linear MDP setting to be fairly straightforward, reusing most of the available techniques e.g. in Jin '20 for exploration on linear MDPs. Although reward free exploration has been only recently proposed, in general there are ways to create reward-free versions of the algorithms fairly easily (as it happens in this paper). While there has been a growing number of papers focussing on linear MDPs, it is the reviewer's opinion that we should not overfit'' to this setting by replicating / extending all the available bandit / tabular results here. The lower bound again seems to be a small variations of the ideas of Du '20 (but this is acknowledged by the authors)

Correctness: The paper looks correct to my understanding

Clarity: yes

Relation to Prior Work: yes

Reproducibility: Yes

Additional Feedback: I would suggest to put more emphasis on the lower bound as opposed to the upper bound. The upper bound looks fairly straightforward to obtain. Why do you need optimism in the planning phase?

### Review 4

Summary and Contributions: The paper extends recent work in reward-free RL (where an agent must explore its environment without reward signal, and then later try planning to solve a task with a given reward function) to the linear function approximation case. The paper proposes an algorithm for solving this problem that is provably efficient under certain assumptions. From my understanding, the algorithm essentially takes different ideas from existing work [Jin et al, 2019, 2020] and combines them. In particular, it is shown that for linear MDPs, the sample complexity of the resulting algorithm is polynomial, whereas with less restrictive assumptions it is exponential. This has implications for the “hardness” of reward-free RL under certain assumptions (i.e. that linear optimal value functions are a strictly weaker assumption than linear MDPs), as well as potentially the difference between normal RL and reward-free RL.

Strengths: The paper combines two existing pieces of work to develop an efficient algorithm to tackle the reward-free RL problem. Extended this beyond the tabular case is important, since now we can start thinking beyond small gridworlds and tackle continuous state MDPs. I thought the paper was very well written, and the insights from the theory were well presented, and would probably be of interest to the community. The fact that the assumption about the linearity of the optimal value function is weaker than the linearity of the MDP itself is unsurprising but good to know.

Weaknesses: I realise that the paper and contribution here are 100% theoretical in nature. However, it would have been nice to see even a synthetic toy empirical experiment which could showcase the algorithms in practice. Perhaps an illustration of Theorem 3.1, where delta, epsilon, H are varied and the corresponding number of samples increases?

Correctness: The theory all appears sound to me, but I was unable to verify all of the maths.

Clarity: The paper was generally well written. There were some minor typos and missing words, but nothing too major. I particularly liked the contributions and insights section at the end of the introduction. I also liked the sketch explanations of the proofs for the various lemmas and theorems. I found page 8 very hard to read since it was so dense, but I’m not sure if there’s any solution to that. It may be helpful If it could be explained roughly the intuition behind the “hard” transition distributions. I also found the use of the word “levels” a bit confusing. I understand it in the context of line 263, but is it not more easily understood as just timestep (particularly in Section 2.1). This would just be worth clarifying.

Relation to Prior Work: I am not well-versed in this exact field, but the related work seems comprehensive. The authors may be interested in some *very* recent work which slightly modifies the problem and reduces Jin et al’s dependency from O(S^2) to O(S) [1]. You could also include [2] on line 96. [1] Task-agnostic Exploration in Reinforcement Learning. Zhang et al, arxiv [2] Chentanez, Nuttapong, Andrew G. Barto, and Satinder P. Singh. "Intrinsically motivated reinforcement learning." Advances in neural information processing systems. 2005.

Reproducibility: Yes

Additional Feedback: 1. I would just like to confirm my understanding of the algorithmic contributions of this work. As far as I understand, Jin et al [2019] propose a learning algorithm for the standard RL case with linear function approximation in linear MDPs. Then Jin et al [2020] propose a method for efficient exploration in the reward-free RL case. This is for normal MDPs but in the tabular setting. In that work, exploration is achieved by constructing a reward function where the reward is 1 for states that are “significant”, and 0 otherwise, and then solving the resulting task with an efficient learning algorithm. In this paper, the idea is to use the learning algorithm from Jin et al [2019]. However, that was in the normal RL case with rewards, and here there are no rewards. So a reward function is constructed similar to Jin et al [2020], but instead of 1/0, it’s a UCB based score that captures uncertainty. And then finally that “task” is solved, which is easier because of the linearity in the MDP. 2. Looking beyond this, in the planning phase, the agent is given the explicit reward function. In practice, however, it’s probably the case that the agent just gets samples from the reward function. In that setting, what would the impact on the framework be? Presumably it would still be pretty efficient because of the efficient exploration, and it would just require observing samples to estimate the underlying reward function? It would also be interesting to find out what happens when the linearity assumption is assumed to hold, but doesn’t (or only approximately holds), but I certainly don’t think the paper needs to deal with that question. 3. On line 182-183, this is the number of episodes for *any* reward function (e.g. the most adversarial one), correct? 4. What is the effect of increasing/decreasing c_\Beta? 5. Minor typos: a. 22: algorithm -> algorithms b. 37: deal WITH both… c. 55: both A polynomial… d. 115: use citep e. 290: reward functionS ========== POST-REBUTTAL ========== Thanks to the authors for their response. Although the work is somewhat incremental with respect to Jin et al [2020], I still think it makes good progress on the "reward free RL" problem. It would have been nice to see at least a small scale empirical result, but I think because this is so theoretically-focused, it's not necessary. The linearity assumption is quite strong, so it would be useful to have some thoughts in the final version as to how we can go about loosening this assumption going forward.