NeurIPS 2020

### Review 1

Summary and Contributions: This paper studies MCTS for continuous action settings. Typically MCTS is just useful for discrete action settings and this paper studies the extension to continuous actions with the aim of theoretically justifying the approach taken. The approach is relevant to people interested in planning or people interested in continuous action control (e.g., robotics). The paper first extends an existing UCB-like algorithm for continuous-armed bandits, HOO, by using a polynomial exploration bonus instead of a logarithmic one. This approach is justified by a similar approach in the influential AlphaGo paper and prior work that justifies the approach theoretically for non-stationiary bandit problems. The paper then integrates this enhanced HOO into MCTS and calls the resulting algorithm Poly-HOOT. Theoretical results are given for convergence of approach to optimal action and empirical results show the method out-performs baselines. Overall, I liked the paper and think it clears the acceptance bar.

Strengths: As far as I can tell the approach is novel and I also like that the approach is grounded with theoretical analysis. Empirical evaluation on four toy domains is sound and shows Poly-HOOT performs as well or better than baselines. The performance gain is marginal on most tasks except for LunarLander. However, I think one of the main strengths of this paper is it introduces a theoretically well grounded method while simultaneously improving empirical performance. The paper is of high relevance to the NeurIPS community particularly people working in RL.

Weaknesses: Some empirical comparison is made to prior work (e.g., PUCT and discretized UCT) though other existing methods (discussed in related work) are not compared with. It's unclear why the baselines chosen were chosen. The proposed approach is based on discretization and so it may require many roll-out in higher dimensional problems. It would be nice for the authors to commment on this limitation and how it might compare to other related works.

Correctness: I have not checked all steps of the proofs but believe they're sound. The empirical methodology is correct and reproducible. It would be interesting to add some discussion of wall clock time and time per decision in the experiments.

Clarity: The paper is very well written.

Relation to Prior Work: Yes, the authors include a related works section and cite other approaches for extending MCTS to continuous actions.

Reproducibility: No

Additional Feedback: I read the author's response. Could you comment on how baselines chosen for experiments? Could you comment on the time per decision of methods in experiments? Is it feasible to run Poly-HOOT in a continuous control application where many decisions must be made per second? HOO / HOOT / Poly-HOOT are all based on a method of discretizing the action space. Could you discuss the limitations of this for higher dimensional problems?

### Review 2

Summary and Contributions: This paper proposes a Monte-Carlo Tree Search (MCTS) method for continuous action domains by extending Hierarchical Optimistic Optimization (HOO). The proposed method, Poly-HOOT, uses a polynomial term rather than a logarithmic term as the bonus term (bias term) in the UCB1-like formula. Poly-HOOT is proved to converge under some assumptions, such as the bounded depth.

Strengths: This paper provides a Monte-Carlo Tree Search method with proof of convergence and an analysis of the convergence rate, based on a polynomial term. An intuitive question of the researchers in this domain was that how to replace cumulative regret based formula (namely UCB1) with a simple regret based one. One example is the following paper, which uses a polynomial term based regret. J.Y. Audibert, S. Bubeck and R. Munos, Best Arm Identification in Multi-Armed Bandits, COLT-2010. Also, it was empirically known in the computer Go community that using a polynomial term rather than logarithmic term improves the strength. However, there was no tree search algorithm (based on a polynomial term (edited)) which has a proof of convergence. Therefore, the contribution of this paper is important.

Weaknesses: However, there are mainly two concerns. One is the depth bound. I am not entirely convinced that assuming maximum depth is meaningful. The depth bound is not only necessary for the proof but also a hyper-parameter. Another is the experimental settings. Table 1 shows that Poly-HOOT outperformed the counterparts in LunarLander (LunarLanderContinuous-v2). However, the score is averaged over only ten runs, and the average (and the variance or max/min.) is not shown.

Correctness: I couldn't find problems in the method or proof, though I am not confident about the details.

Clarity: In general, it is well written. However, it was difficult to follow the explanation and the proof steps because there was little information that helps the readers. Some examples are, - the intuitive meaning of \alpha, \xi, and \eta is not explained. - I think $n$ means round throughout this paper, which is defined at line 146. However, when $V_n$ or $f_n$ appeared in the equations, I couldn't remember what $n$ meant. It will help the readers if there is an explanation that reminds the readers about the definition and the purpose of the variables.

Relation to Prior Work: This paper provides adequate amount of related work and discusses the novelty of the paper. It focuses on the comparison with HOO and lacks the comparison with simple regret MAB using polynomial term, but in my idea, that is not crucial.

Reproducibility: Yes

Additional Feedback: It would have been more convincing if there had been a discussion of how to determine the parameters, especially the maximum depth, before solving problems. (comment after rebuttal) The feedback adequately answers the questions posed by the reviewers. The comparison with VOOT and additional analysis (including $\bar{H}$) would be valuable.

### Review 3

Summary and Contributions: The paper presents an MTCS variant for continuous domains. in each node of the search tree, the continuous action space is explored by a modified variant of the continuous bandit algorithm HOO. In HOO the logarithmic bonus term is replaced by a polynomial one. The main result of the paper is the convergence proof of the POLY-HOOT algorithm. A few empirical evaluation provide further indication of the effectiveness of the algorithm.

Strengths: The convergence proof.

Weaknesses: The experiments could have used more baselines.

Correctness: The paper seems sound.

Clarity: The paper is well written.

Relation to Prior Work: Related literature is discussed.

Reproducibility: Yes

Additional Feedback: POLY-HOOT seems a natural algorithm. MCTS is well established, HOO has been around for a while, and polynomial bonus terms are also popular. Therefore the main contributions is the convergence proof. The algorithm is defined in continuous state space, but given that the transition is assumed deterministic, I do not think that it has any relevance at all (compared to a discrete, but large enough state space). In general, I have reservations about the practicality of MCTS in continuous state/action spaces without a function approximator. VOOT of [Kim et al, 2020] seems to report reasonable practical performance, even compared to some RL algorithms. It would have been useful to include at least VOOT as baseline, and even better some RL methods with function approximation. ----------------------------------- I have read the autors' feedback. I hope that there will be enough time to include some of the additional baseline suggested.

### Review 4

Summary and Contributions: This work introduces Polynomial Hierarchical Optimistic Optimization (POLY-HOOT), a Monte0Carlo planning algorithm for solving continuous space MDPs. POLY-HOOT constitutes a modification of the Hierarchical Optimistic Optimization applied to Trees (HOOT) algorithm that integrates MCTS with continuous action bandit algorithm strategy HOO. To be more specific, an HOO is applied at each state and depth in the rollout tree. In contrast to HOOT that uses a logarithmic bonus term, POLY-HOOT uses a polynomial term for bandit exploration. Theoretical results show the POLY-HOOT converges to an arbitrarily small neighborhood of the optimum at a polynomial rate. Experiments have been conducted on four classical rl environments.

Strengths: The main novelty of this work is the replacement of the logarithmic bonus term that used by HOOT algorithm for bandit exploration with a polynomial term. In contrast to HOOT, a theoretical analysis is also provided that proves the non-asymptotic convergence rate of POLY-HOOT algorithm. Moreover, empirical analysis has been conducted on three classical control tasks (CartPole, Inverted pendulum swing up, and LunarLander), showing that POLY-HOOT achieves the same or slightly better performance compared to HOOT. The source code of the experiments is also provided. Finally it should be mentioned

Weaknesses: The general concept of this work is not quite novel. Actually, the proposed POLY-HOOT algorithm is a slightly modification of the standard HOOT. In addition to that I have found the empirical analysis of this work limited. To be more precise, experiments have been conducted only on three classical rl tasks. It could be interesting to check the performance of POLY-HOOT algorithm on some continuous control tasks (Mujoco, roboschool). Additionally, the computational complexity of the POLY-HOOT algorithm should be given.

Correctness: In general, the method and authors claims are valid. Additionally, the empirical evaluation seems to be correct and the source code is provided by the authors.

Clarity: The paper is fairly written and the general idea of the work is presented in a clear way. The only part that should be discussed is the presentation of the empirical results. Also, the authors' justification about the fact that POLY-HOOT outperforms HOOT on the LunarLander environment is not clear. I think that authors should elaborate more about this. Also, authors should describe how they are setting the maximum HOO depth and how does it affect the performance of the POLY-HOOT.

Relation to Prior Work: Generally speaking, the related work and the differences between POLY-HOOT and HOOT are presented adequately by the authors

Reproducibility: Yes

Additional Feedback: Please check my comments on the previous sections. I would like to thank the authors for addressing most of my review points. Therefore, I am going to change my original evaluation by increasing the score.