NIPS 2018
Sun Dec 2nd through Sat the 8th, 2018 at Palais des Congrès de Montréal

### Reviewer 1

This paper proposes using random search algorithms for solving RL tasks. They show that with a few modifications of random search, and using linear policy approximators, they achieve competitive results w.r.t policy gradient algorithms and evolution strategy and SOTA in many of them. Arguably, the main contribution of the paper is the results which are interesting. Algorithm-wise, they use a basic random search, which, in turn, augmented by three tricks: i) state normalization, ii) step-length normalization by the standard deviation of returns, and iii) updating based on the highly rewarded population. Definitely, the results of this paper are valuable and I am in favor of these findings, however, I have a few questions/concerns that I would suggest being addressed in the rebuttal/next version. 1) How does a policy learned for one random seed generalize to the other ones? 2) One question is: Does ARS reflect a superior performance in the case of having other policy approximations? For example, is it still working with a NN 1-2 hidden layer? 3) As far as I know, ACKTR surpasses TRPO on Mujuco tasks, so in order to argue SOTA results, ARS should be compared with ACKTR. 4) Although I have not seen a published paper, the following library https://github.com/hardmaru/estool shows that CMA-ES outperforms the ES. One might be interested to see a comparison between ARS and CMA-ES. 5) CMA-ES uses the covariance matrix for choosing the exploration direction in the policy space, while you use it for normalization of the input. Is there any relation between these two? It seems to me that these two ideas are related, using an affine transformation or change of coordinates. Please elaborate. 7) I am confused about the justification of dividing by $\sigma_R$, in general. To illustrate what I mean consider the following example: Let's say at a give iterate, all episode returns are close to each other, then $\sigma_R$ is close to zero. In this case, the update rule 7 in Algorithm 1 will send $M_{j+1}$ to a very far point. Theoretically speaking, instead of $\sigma_R$, $\nu$ should be in the denominator (see Nesterov & Yu, 2011). Minor: footnote 1 in a wrong page. The references to the Appendix inside the text should be B'', not A''. For example, L200. L133: and'' -> an'' How you initialize network weights? ============= After reading the rebuttal, I got the answers to all my question. I have almost no doubt that this paper will have a high impact on the community and will be a motivation for finding harder baselines.