NIPS 2016
Mon Dec 5th through Sun the 11th, 2016 at Centre Convencions Internacional Barcelona
Paper ID: 1125 Value Iteration Networks

### Reviewer 1

#### Summary

The paper presents a view of value iteration that links it to a convolutional layer in a neural network. This observation is exploited in a NN architecture that has multiple layers that effectively perform depth-limited VI, the parameters of which specify the MDP on which the planning will occur. In this way, the network learns to construct MDPs for short-term planning based on the current state of the real system. Experiments demonstrate that this approach allows the agent to generalize more effectively over many instances of navigation problems.

#### Confidence in this Review

3-Expert (read the paper in detail, know the area, quite certain of my opinion)

### Reviewer 2

#### Summary

It is well known that a program can be emulated by a neural net. The paper proposes a neural architecture achieving reinforcement learning, based on two interacting neural nets. The first one, akin DQN, achieves action selection; the difference is that the value is provided by the second network, emulating a (discretized) approximation of the target MDP, and achieving value iteration. The paper is very interesting and I incline to accept it; all the more so as the authors plan to make the code publicly available. Here are some points where I did not fully understand the approach and more details and/or complementary experiments are needed: Firstly, it is mandatory to understand the limitations of the approach. Are they related to the size of the domain ? Table 1 does a good job in showing the graceful degradation as the gridworld size increases. Are they related to the architecture of the auxiliary MDP ? Complementary experiments in 4.3 (continuous control) could be used to study the sensitivity wrt the size. Here, two different discretizations can be considered; a finer discretization requires a larger attention zone; what happens if the attention zone is too small; etc. Btw, I did not follow "the agent has to control the particle". Secondly, it might be interesting to see what is learned by the NN. Look at "Graying the black box", in ICML 16, which visualize the states in the Playing Atari approach. In the gridworld, can you show the f_R landscape ? Thirdly, the initialization is not entirely clear; starting from expert trajectories is one thing; what if they are inaccurate ? What happens if you start from random trajectories: in this case, would you need as many trajectories as for the reactive Playing Atari approach ?

#### Qualitative Assessment

See summary. I particularly appreciated the diversity of the experiments.

#### Confidence in this Review

2-Confident (read it all; understood it all reasonably well)

### Reviewer 3

#### Summary

The paper proposes a formulation of the value iteration algorithm as a differentiable neural network component. It provides an elegant and novel approach to combine planning and learning in a single end-to-end neural network system. The authors show that the system can learn to plan and helps to generalize over tasks.

#### Qualitative Assessment

Overall the paper the paper is clearly written and easy to follow. It explains the proposed concepts clearly. I did feel the paper could use more detail at some points (possibly at the cost of moving other parts to the supplementary material). Especially during the explanation of the different components of the VIN block a concrete example would be helpful (this is provided in the experiments but comes a bit late in my opinion). One limitation of the proposed approach seems to be the amount of background knowledge required to design a suitable VIN representation, i.e. state space, goal state, suitable f_R and f_P forms, suitable attention model, planning horizon. This goes well beyond the information assumed to be available in typical RL/IL settings and would likely be sufficient to design an approximate model that can be solved using traditional planning (though I feel this does not detract from the contribution). Another possible limitation is that the use of a CNN representation to perform planning imposes certain assumptions on the MDP structure, mainly that transitions are local, sparse and state independent. While the authors do demonstrate in the WebNav experiment that potential applications go beyond 2D navigation tasks, the class of possible problem still seems rather restricted. The approach is evaluated using a number of different problem domains. I was somewhat disappointed that the experiments give only raw performance scores and provide little insight into the working of the algorithm or as to what is learnt by the different components. This is alleviated by additional results in the supplementary material, but I would have preferred to see some of these results in the main paper ( e.g. the untying of the weights). Additionally, it would be interesting to see what f_R functions are learned, or how useful the planning values are on their own, i.e. how close to an optimal/successful policy is obtained using just the greedy policy wrt the values returned by the planning component.

#### Confidence in this Review

3-Expert (read the paper in detail, know the area, quite certain of my opinion)

### Reviewer 4

#### Summary

The paper tries to enhance performance of Reinforcement learning on tasks that require planning by attempting to integrate a planning module in the generated policy. To maintain end-to-end differentiability, the planning module is also a CNN which seems to be roughly performing a few iterations of Value Iteration for planning. It generates attention values to use while learning the reactive policy. A comprehensive evaluation has been done on both discrete (Grid-world, WebNav) and continuous (control) domains to evaluate the Value Iteration network generalization to unseen instances of the planning problem.

#### Qualitative Assessment

Most of the paper is well-written and an appropriate literature review has been provided. Most of the details required for reproducibility have been provided either inline or in the appendices. The experimental evaluation seems to be fine too and covers both discrete and continuous domain MDPs, but the reviewer has the following concerns: The idea of embedding (1) end-to-end differentiable planners and (2) attention models in the reinforcement learning algorithm itself is creditworthy, but the implementation can definitely be improved and/or justified better. For instance, it was never explained why the functions f_R and f_P in section 3 are only functions of the observation phi(s). In the reviewer's understanding, both of them could also vary with the action to be taken by the agent and not just the observation. A proper justification for why these functions only rely on phi(s) needs to be provided. Also, the description of the VI module in section 3.1 is not clear to the reviewer. In particular, it seems as if the authors want the CNN to mimic several iterations of value iteration, which would have to be of the form V_{n+1}(s) = max_a {R(s,a) + gamma * sum_{s'} P(s'|s,a)*V_n(s')} (see equation 1 of paper). But the CNN seems to be performing a weighted summation over the reward map rather than V-values. That is not the same as value iteration. Moreover, are the learnt weights W (in section 3.1), the same as state transition probabilities \bar{P}, which are needed for value iteration? This subsection needs a clearer presentation of the VI network, without which the paper's key idea is not comprehensible. One last thing required is to thoroughly prove that the VI network is doing some real planning. Though most of the authors' experiments are showing better results than just using a simple CNN, nowhere has it been concretely demonstrated that the advantages gained are actually due to the VI network learning to plan. It seems to be a conjencture that the VI network is "learning to plan". The reviewer is not fully convinced that the better results are necessarily coming from planning, and not from the learning algorithm using the VIN as just another set of very deeply layered tunable CNN. More visualization of the VIN's output is required to make a convincing argument that the VIN is indeed learning to plan. EDIT AFTER REBUTTAL: Thanks to the authors for addressing my queries regarding VIN description, more visualization and the scalability of the method in the rebuttal. Since they provided appropriate clarifications, I am improving my ratings for the paper in Technical Quality, Novelty and Potential Impact.

#### Confidence in this Review

2-Confident (read it all; understood it all reasonably well)

### Reviewer 5

#### Summary

This paper suggests a new kind of representation method of policies in reinforcement learning with a neural network. Unlike the previous works which constructs neural network policies mapping the current observation (or its feature value) of the environment to the action value, this paper suggests to include a planning computation to the policy learning, called “Value Iteration Network”. The paper uses the fact that one step Bellman equation can be seen as a computation of convolutional neural network (with several constraints, though) and by recurrently calculating the values, VIN archives a planning based on its estimated reward and transition. The calculated values pass some attention module and be used as features of the policy. The paper tests the representation power of this policy network on several domains, by using existing RL and IL algorithm to train the VIN policy, and compare the results with the previous methods (reactive policies which just uses the current observed features on the policy).

#### Qualitative Assessment

This paper is nice to read, and may have a good potential to apply other algorithms of reinforcement learning. The main contribution of this paper is to view the Bellman equation as a form of CNN, and because CNN is widely researched on deep learning field, it may be useful for other reinforcement learning researchers who wants to adapt deep learning theories. However, as this is the first paper to make a connection between Bellman equation and the neural networks, there are several limitations which prevents the wide usage on general domains. As far as my understanding, applying transitions on the Bellman equation is done by convolution operation (grid world), or masking (WebNav), but it is not sure for large domains which cannot be discretized to the grid, such as ALE domain in Deep Q Network paper (V. Minh, 2015), or a real robotic environment. Also it may be a problem that a person who wants to adapt VIN to their research must specify each ingredients of a VIN design, including reward, transition and attention function by hand. It is possible that domain which is more complicate than grid worlds may difficult to define those functions.

#### Confidence in this Review

2-Confident (read it all; understood it all reasonably well)

### Reviewer 6

#### Summary

This paper proposes to embed the computation usually carried in value iteration within the the structure of a convolutional neural network (CNN). The computation performed by the Bellman optimality operator is implemented layer-wise in the network. $k$-steps of value iteration then correspond to the forward propagation through $k$ layers in the network. To speed up computation of such backups, the expectation over next states is implement as a convolution and the maximization as a "max-pooling" operation: two components of a standard CNN architecture. The proposed architecture is shown to generalize across tasks in the experiments.

#### Qualitative Assessment

In its current form, the paper considers many ideas simultaneously. It might be a good idea to unpack some of them in the text or as separate publications. I think that the paper could easily be dismissed due to its restriction on the local transition structure. However, this is only one aspect of a broader set of ideas, namely: 1. The idea that the iterates of value iteration can be embedded in the layers of a neural net. 2. For a subset of MDPs with a particular transition structure, the backups of value iteration can be implemented as a convolution (which is efficient on the GPU). 2.1 Computation of the expectation over the next states can be further sped up by learning an "attention mechanism", that is, a restriction on the set of next states to consider. 3. In learning models for planning, "back-propagating" through the return (via $V^\star$) seems to be better than only minimizing for the next state or reward prediction error. Idea (1) really is the realization that many iterative methods based on contraction properties (and generally studied through Banach fixed point theorem) could be implemented in the proposed recurrent neural net structure. It would also be interesting to know if contraction can be maintained through the "model updates" of backprop. Can the number of backups be determined dynamically using the usual stopping criteria of DP ? It would also be useful to study the proposed approach from an angle akin to IRL. Here, the reward function is specified but the transitions dynamics are somehow recovered through a nested optimization procedure : a fixed point problem nested within an outer L2 loss minimization at the "outer" level. In that sense, it resembles Neu & Szepesvari (2007) or Rust (1987, 1988). In general, I would appreciate if the proposed ideas could be connected more formally to existing RL or control ideas. The "train"/"test"/"label" terminology feels odd. However, I can imagine that writing in common terms for both the RL and deep/supervised learning could be challenging. I'm not sure that there is a satisfactory solution to this problem. l. 121-124: I found this paragraph confusing. At first glance, it sounded like $f_P$ and $f_R$ are mappings from the original MDP $M$ to the target MDP $\bar M$. In a second pass over the paragraph, it is clearer that the sentence on line 123 simply says that the reward and transition functions are parametrized and depend on the observation $\phi(s)$. It might be useful to show $f_p$ and $f_r$ in figure 2. Why isn't $f_R$ state and *action* dependent ? l. 224: I was confused here. I had to re-read the paragraph to verify that you were working under the IL setting. Under the RL setting, I would have a hard time understanding how you could optimize for the stochastic policy without policy gradient methods. It might be useful to mention that you use the logistic loss in the main text rather than the appendix. l. 252: How do you explain this effect ? Is is only a matter of sample sizes ? l. 268: This is an interesting comparison. l. 302: The meaning of "transition weights" is unclear. I guess that you chose to use "weight" since the VIN block might not learn probabilities. l. 292: How was the reward function defined ? Zero everywhere, +1 at the goal ? Section 4.4 : Since convolutions can be generalized to graphs with the graph Laplacian, it would be interesting to consider a graph-CNN approach in this experiment.

#### Confidence in this Review

2-Confident (read it all; understood it all reasonably well)