Export Reviews, Discussions, Author Feedback and Meta-Reviews

Paper ID:	1631
Title:	Action-Conditional Video Prediction using Deep Networks in Atari Games

Current Reviews

Submitted by Assigned_Reviewer_1

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)

The paper addresses the problem of learning a model of Atari 2600 games (a popular testbed for reinforcement learning algorithms), in other words predicting future frames conditioned on action input.

This is a challenging problem and its solution is a useful tool to build better controllers.

The paper is clear and well-structured, and has convincing experiments (and videos).

The model is a CNN (with a fully-connected layer) followed by multiplicative interactions with an action vector, followed by convolution decoding layers. The recurrent version has an LSTM layer added after the CNN.

The authors evaluate their models both on pixel accuracy (traditional way of evaluating such models) and on usefulness for control (what we really care about).

It would be desirable to include more experimental details about 1) the network architecture (especially the deconvolution part) and 2) the network training procedure. Ideally, code would be made available, but more details in the main text or supplementary would also be fine.

Some comments:

- It is a bit unfortunate that the authors did not try to predict rewards in addition to the next frames, that would have opened the door to using the model for planning (e.g., using UCT), instead of using a trained model-free controller to test the usefulness for control - which is a bit harder to interpret.

- Baselines against which the models are compared are a bit weak, but this is fair enough since there are no obvious candidates to compare against (afaik).

- About the exploration section, is the predictive model learned online to help with exploration? Or is it learned using data from a regular DQN (uninformed exploration) first, and then used to direct the exploration of a new controller. If it's the latter then it's not clear what this is achieving - since exploration has already been done to obtain the model. In any case, it is still a bit surprising that this helps in some games.

- The controlled vs uncontrolled dynamics section at the end is interesting.

- The authors might want to take a look at this relevant recent work "DeepMPC: Learning Deep Latent Features for Model Predictive Control" on learning deep predictive models (using also multiplicative interactions with the actions) for control, although this isn't in the visual domain.

Minor things/typos:

- "In Seaquest, new objects appear from the left side or right side randomly, and these are hard to predict." I'm not sure this is completely true, but it certainly looks random.

- line 430: predicing

[Updated score after rebuttal. Other recent papers which learn deep dynamical model from images, though not for the Atari game: -From Pixels to Torques: Policy Learning with Deep Dynamical Models -Embed to Control: A Locally Linear Latent Dynamics Model for Control from Raw Images ]