NeurIPS 2020

PlanGAN: Model-based Planning With Sparse Rewards and Multiple Goals

Meta Review

This paper proposes using an ensemble of GANs to learn a goal-conditioned forward model of trajectories for use in planning. The model is trained using a variant of hindsight experience replay, resulting in an agent that can succeed at sparse goal-conditioned tasks with much better data efficiency than model-free approaches. All reviewers highlighted the impressiveness of the experimental results, with R1 and R2 finding the approach very interesting, and R3 and R4 indicating the potential impact and interest this work will have. I agree that this paper will likely be of broad interest to the RL community at NeurIPS and therefore recommend acceptance. However, several reviewers also noted the lack of comparison to other model-based approaches. While PlanGAN is an interesting idea worth being published on its own, I agree the paper would be greatly strengthened by a comparison to an existing non-goal conditioned planning method, and I would like to see this addition in the camera-ready. Additionally, it’s not entirely clear to me what are the tradeoffs of modeling trajectories via a GAN rather than combining a goal-conditioned policy with a non-goal-conditioned state transition model (i.e., s_{t+1} = f(s_t, a_t); a_t = \pi(s_t, g)). For example, is it that the GAN is more robust to compounding model errors? Or is it that it’s easier to jointly model states and actions than the two separately? Indeed, R2 was also concerned that some of the choices made in the paper seem arbitrary or unjustified, and so I would therefore like to see more discussion of these considerations included.