NIPS 2016
Mon Dec 5th through Sun the 11th, 2016 at Centre Convencions Internacional Barcelona
Paper ID: 1125 Value Iteration Networks

### Reviewer 1

#### Summary

The paper presents a view of value iteration that links it to a convolutional layer in a neural network. This observation is exploited in a NN architecture that has multiple layers that effectively perform depth-limited VI, the parameters of which specify the MDP on which the planning will occur. In this way, the network learns to construct MDPs for short-term planning based on the current state of the real system. Experiments demonstrate that this approach allows the agent to generalize more effectively over many instances of navigation problems.

#### Qualitative Assessment

***After Author Response*** I was literally just complaining about reviewers that say "Add the kitchen sink" without regard for the research cost or for the impact on clarity in the paper, and here I am being that reviewer! The authors are right to call me on that. I agree that the paper has plenty going on as it is (and said as much in my original review). At the time I was erroneously thinking one could simply take the same architecture as the VIN, but train the model using prediction error rather than the policy gradient. I see now that I hadn't thought it all the way through -- the VIN model does not have the same state space as the world, so it's not clear what the error signal would even be. Rather than suggesting an additional impractical empirical comparison, I should have instead suggested that the paper back off a bit from its dismissal of model-based RL. In general, I agree that MBRL is a challenge, but it may be that the domains that are ideal for VIN are also viable for MBRL. I think that's an open question, and could be presented as such. I have to admit that I did not follow the "constant domain observation" explanation. I might have missed something. In any case, I think runtime bears at least a little bit of discussion. If there's really no room for it, maybe it can go in the appendix. I'm genuinely surprised that VIN helps early in learning, but that's really cool, and worth reporting. ---Technical Quality--- The claims of the paper seemed to be technically sound and the experiments were well-designed. I will say that I am not sure I am totally on board with the phrase "learns to plan." I think one could argue that the actual planning process is still fixed (it is the VI-like process, defined by the K convolution+max layers). What the network learns are model parameters for which the planning process yields useful results. "Learns to plan" almost implies that it has to learn about value iteration itself, which is not the case. There was also a claim that I felt was a little under-supported. The paper waves off model-based learning because it can be hard to learn accurate dynamics. I think this is fair to say. *However*, in the particular domains the paper considers, I'm not convinced that a model would be harder to learn than the VIN. The navigation tasks all have the same flavor where the local dynamics are very simple. Presumably this is why the VIN gets any traction at all. I would imagine that a conv-net would be ideally suited to learning in these problems. I recognize that space is limited and I think there is plenty in this paper to chew on. Still, I did wish that this approach had been compared to the clear alternative approach of simply training a convolutional dynamics model and performing value iteration on that to obtain a policy. This would answer the question of whether a VIN is actually easier to learn than a dynamics model. If this approach could be demonstrated to perform robustly in domains where the more traditional model-based approach fails, that would significantly strengthen this case. I have one other small technical quibble, and it only seems to be directly relevant to the supplementary material. When describing the grid world experiment, the text seems to suggest that the VIN statespace \bar{S} and the real statespace S are similar. The implication seems to be that S is a 2D grid (i.e. only the agent's position). Actually, of course, a state in S contains *both* the agent's position *and* the arrangement of the obstacles. \bar{S} is actually far simpler than S, which is important for the success of the VIN, I suspect. ---Novelty--- I found the idea of the VIN to be highly novel and interesting. The idea that a planner might generate a temporary, locally useful model is not in itself new, but I don't believe I have seen an approach that learns *how* to construct such a model from experience. I also found the link between convolution and value iteration on a locally connected MDP to be a fresh view of familiar constructs. The paper did connect to some earlier work that the authors might consider mentioning. Both pieces of work I'm thinking of use policy gradient to directly adapt model parameters rather than trying to learn the most accurate model. Sorg et al. (NIPS 2010) does this with the reward model and Joseph et al. (ICRA 2013) does it with the dynamics model. There's been follow-up at least to the Sorg work (Guo et al., IJCAI 2016). Like this work, they treat the model-based architecture as just a complicated parameterization of the policy and adapt the whole thing with model-free learning. This work is definitely distinct in several ways, but I think there is a kinship with these existing ideas. ---Impact--- I think this paper has substantial potential for impact. I can easily see this particular algorithm being further explored/applied in other domains. I also believe that the core idea of *learning* to map complicated observations into simple models for the sake of short-horizon planning is potentially very insightful, and one that seems likely to inspire follow-up work. I will say that I wish the paper had a little more room to discuss the weaknesses/challenges of this approach. For instance... - The paper does not discuss runtime, but I assume that the VIN module adds a *lot* of computational expense. - Though f_R and f_P can be adapted over time, the experiments performed here did incorporate a great deal of domain knowledge into their structure. A less informed f_R/f_P might require an impractical amount of data to learn. - The results are only reported after a bunch of training has occurred, but in RL we are often also interested in how the agent behaves *while* learning. I presume that early in training the model parameters are essentially garbage and the planning component of the network might actually *hurt* more than it helps. This is pure speculation, but I wonder if the CNN is able to perform reasonably well with less data. - I wonder whether more could be said about when this approach is likely to be most effective. The navigation domains all have a similar property where the *dynamics* follow relatively simple, locally comprehensible rules, and the state is only complicated due to the combinatorial number of arrangements of those local dynamics. WebNav is less clear, but then the benefit of this approach is also more modest. In what kinds of problems would this approach be inappropriate to apply? ---Clarity--- I found the paper to be clear and highly readable. I thought it did a good job of motivating the approach and also clearly explaining the work at both a high level and a technical level. I thought the results presented in the main text were sufficient to make the paper's case, and the additional details and results presented in the supplementary materials were a good compliment. This is a small point, but as a reader I personally don't like the supplementary appendix to be an entire long version of the paper; it makes it harder to simply flip to the information I want to look up. I would suggest simply taking the appendices from that document and putting them up on their own. ---Summary of Review--- I think this paper presents a clever, thought-provoking idea that has the potential for practical impact. I think it would be of significant interest to a substantial portion of the NIPS audience and I recommend that it be accepted.

#### Confidence in this Review

3-Expert (read the paper in detail, know the area, quite certain of my opinion)

### Reviewer 2

#### Summary

It is well known that a program can be emulated by a neural net. The paper proposes a neural architecture achieving reinforcement learning, based on two interacting neural nets. The first one, akin DQN, achieves action selection; the difference is that the value is provided by the second network, emulating a (discretized) approximation of the target MDP, and achieving value iteration. The paper is very interesting and I incline to accept it; all the more so as the authors plan to make the code publicly available. Here are some points where I did not fully understand the approach and more details and/or complementary experiments are needed: Firstly, it is mandatory to understand the limitations of the approach. Are they related to the size of the domain ? Table 1 does a good job in showing the graceful degradation as the gridworld size increases. Are they related to the architecture of the auxiliary MDP ? Complementary experiments in 4.3 (continuous control) could be used to study the sensitivity wrt the size. Here, two different discretizations can be considered; a finer discretization requires a larger attention zone; what happens if the attention zone is too small; etc. Btw, I did not follow "the agent has to control the particle". Secondly, it might be interesting to see what is learned by the NN. Look at "Graying the black box", in ICML 16, which visualize the states in the Playing Atari approach. In the gridworld, can you show the f_R landscape ? Thirdly, the initialization is not entirely clear; starting from expert trajectories is one thing; what if they are inaccurate ? What happens if you start from random trajectories: in this case, would you need as many trajectories as for the reactive Playing Atari approach ?

#### Qualitative Assessment

See summary. I particularly appreciated the diversity of the experiments.

#### Confidence in this Review

2-Confident (read it all; understood it all reasonably well)

### Reviewer 3

#### Summary

The paper proposes a formulation of the value iteration algorithm as a differentiable neural network component. It provides an elegant and novel approach to combine planning and learning in a single end-to-end neural network system. The authors show that the system can learn to plan and helps to generalize over tasks.

#### Qualitative Assessment

Overall the paper the paper is clearly written and easy to follow. It explains the proposed concepts clearly. I did feel the paper could use more detail at some points (possibly at the cost of moving other parts to the supplementary material). Especially during the explanation of the different components of the VIN block a concrete example would be helpful (this is provided in the experiments but comes a bit late in my opinion). One limitation of the proposed approach seems to be the amount of background knowledge required to design a suitable VIN representation, i.e. state space, goal state, suitable f_R and f_P forms, suitable attention model, planning horizon. This goes well beyond the information assumed to be available in typical RL/IL settings and would likely be sufficient to design an approximate model that can be solved using traditional planning (though I feel this does not detract from the contribution). Another possible limitation is that the use of a CNN representation to perform planning imposes certain assumptions on the MDP structure, mainly that transitions are local, sparse and state independent. While the authors do demonstrate in the WebNav experiment that potential applications go beyond 2D navigation tasks, the class of possible problem still seems rather restricted. The approach is evaluated using a number of different problem domains. I was somewhat disappointed that the experiments give only raw performance scores and provide little insight into the working of the algorithm or as to what is learnt by the different components. This is alleviated by additional results in the supplementary material, but I would have preferred to see some of these results in the main paper ( e.g. the untying of the weights). Additionally, it would be interesting to see what f_R functions are learned, or how useful the planning values are on their own, i.e. how close to an optimal/successful policy is obtained using just the greedy policy wrt the values returned by the planning component.

#### Confidence in this Review

3-Expert (read the paper in detail, know the area, quite certain of my opinion)

### Reviewer 4

#### Summary

The paper tries to enhance performance of Reinforcement learning on tasks that require planning by attempting to integrate a planning module in the generated policy. To maintain end-to-end differentiability, the planning module is also a CNN which seems to be roughly performing a few iterations of Value Iteration for planning. It generates attention values to use while learning the reactive policy. A comprehensive evaluation has been done on both discrete (Grid-world, WebNav) and continuous (control) domains to evaluate the Value Iteration network generalization to unseen instances of the planning problem.

#### Qualitative Assessment

Most of the paper is well-written and an appropriate literature review has been provided. Most of the details required for reproducibility have been provided either inline or in the appendices. The experimental evaluation seems to be fine too and covers both discrete and continuous domain MDPs, but the reviewer has the following concerns: The idea of embedding (1) end-to-end differentiable planners and (2) attention models in the reinforcement learning algorithm itself is creditworthy, but the implementation can definitely be improved and/or justified better. For instance, it was never explained why the functions f_R and f_P in section 3 are only functions of the observation phi(s). In the reviewer's understanding, both of them could also vary with the action to be taken by the agent and not just the observation. A proper justification for why these functions only rely on phi(s) needs to be provided. Also, the description of the VI module in section 3.1 is not clear to the reviewer. In particular, it seems as if the authors want the CNN to mimic several iterations of value iteration, which would have to be of the form V_{n+1}(s) = max_a {R(s,a) + gamma * sum_{s'} P(s'|s,a)*V_n(s')} (see equation 1 of paper). But the CNN seems to be performing a weighted summation over the reward map rather than V-values. That is not the same as value iteration. Moreover, are the learnt weights W (in section 3.1), the same as state transition probabilities \bar{P}, which are needed for value iteration? This subsection needs a clearer presentation of the VI network, without which the paper's key idea is not comprehensible. One last thing required is to thoroughly prove that the VI network is doing some real planning. Though most of the authors' experiments are showing better results than just using a simple CNN, nowhere has it been concretely demonstrated that the advantages gained are actually due to the VI network learning to plan. It seems to be a conjencture that the VI network is "learning to plan". The reviewer is not fully convinced that the better results are necessarily coming from planning, and not from the learning algorithm using the VIN as just another set of very deeply layered tunable CNN. More visualization of the VIN's output is required to make a convincing argument that the VIN is indeed learning to plan. EDIT AFTER REBUTTAL: Thanks to the authors for addressing my queries regarding VIN description, more visualization and the scalability of the method in the rebuttal. Since they provided appropriate clarifications, I am improving my ratings for the paper in Technical Quality, Novelty and Potential Impact.

#### Confidence in this Review

2-Confident (read it all; understood it all reasonably well)

### Reviewer 5

#### Summary

This paper suggests a new kind of representation method of policies in reinforcement learning with a neural network. Unlike the previous works which constructs neural network policies mapping the current observation (or its feature value) of the environment to the action value, this paper suggests to include a planning computation to the policy learning, called “Value Iteration Network”. The paper uses the fact that one step Bellman equation can be seen as a computation of convolutional neural network (with several constraints, though) and by recurrently calculating the values, VIN archives a planning based on its estimated reward and transition. The calculated values pass some attention module and be used as features of the policy. The paper tests the representation power of this policy network on several domains, by using existing RL and IL algorithm to train the VIN policy, and compare the results with the previous methods (reactive policies which just uses the current observed features on the policy).

#### Qualitative Assessment

This paper is nice to read, and may have a good potential to apply other algorithms of reinforcement learning. The main contribution of this paper is to view the Bellman equation as a form of CNN, and because CNN is widely researched on deep learning field, it may be useful for other reinforcement learning researchers who wants to adapt deep learning theories. However, as this is the first paper to make a connection between Bellman equation and the neural networks, there are several limitations which prevents the wide usage on general domains. As far as my understanding, applying transitions on the Bellman equation is done by convolution operation (grid world), or masking (WebNav), but it is not sure for large domains which cannot be discretized to the grid, such as ALE domain in Deep Q Network paper (V. Minh, 2015), or a real robotic environment. Also it may be a problem that a person who wants to adapt VIN to their research must specify each ingredients of a VIN design, including reward, transition and attention function by hand. It is possible that domain which is more complicate than grid worlds may difficult to define those functions.

#### Confidence in this Review

2-Confident (read it all; understood it all reasonably well)

### Reviewer 6

#### Summary

This paper proposes to embed the computation usually carried in value iteration within the the structure of a convolutional neural network (CNN). The computation performed by the Bellman optimality operator is implemented layer-wise in the network. $k$-steps of value iteration then correspond to the forward propagation through $k$ layers in the network. To speed up computation of such backups, the expectation over next states is implement as a convolution and the maximization as a "max-pooling" operation: two components of a standard CNN architecture. The proposed architecture is shown to generalize across tasks in the experiments.

#### Qualitative Assessment

In its current form, the paper considers many ideas simultaneously. It might be a good idea to unpack some of them in the text or as separate publications. I think that the paper could easily be dismissed due to its restriction on the local transition structure. However, this is only one aspect of a broader set of ideas, namely: 1. The idea that the iterates of value iteration can be embedded in the layers of a neural net. 2. For a subset of MDPs with a particular transition structure, the backups of value iteration can be implemented as a convolution (which is efficient on the GPU). 2.1 Computation of the expectation over the next states can be further sped up by learning an "attention mechanism", that is, a restriction on the set of next states to consider. 3. In learning models for planning, "back-propagating" through the return (via $V^\star$) seems to be better than only minimizing for the next state or reward prediction error. Idea (1) really is the realization that many iterative methods based on contraction properties (and generally studied through Banach fixed point theorem) could be implemented in the proposed recurrent neural net structure. It would also be interesting to know if contraction can be maintained through the "model updates" of backprop. Can the number of backups be determined dynamically using the usual stopping criteria of DP ? It would also be useful to study the proposed approach from an angle akin to IRL. Here, the reward function is specified but the transitions dynamics are somehow recovered through a nested optimization procedure : a fixed point problem nested within an outer L2 loss minimization at the "outer" level. In that sense, it resembles Neu & Szepesvari (2007) or Rust (1987, 1988). In general, I would appreciate if the proposed ideas could be connected more formally to existing RL or control ideas. The "train"/"test"/"label" terminology feels odd. However, I can imagine that writing in common terms for both the RL and deep/supervised learning could be challenging. I'm not sure that there is a satisfactory solution to this problem. l. 121-124: I found this paragraph confusing. At first glance, it sounded like $f_P$ and $f_R$ are mappings from the original MDP $M$ to the target MDP $\bar M$. In a second pass over the paragraph, it is clearer that the sentence on line 123 simply says that the reward and transition functions are parametrized and depend on the observation $\phi(s)$. It might be useful to show $f_p$ and $f_r$ in figure 2. Why isn't $f_R$ state and *action* dependent ? l. 224: I was confused here. I had to re-read the paragraph to verify that you were working under the IL setting. Under the RL setting, I would have a hard time understanding how you could optimize for the stochastic policy without policy gradient methods. It might be useful to mention that you use the logistic loss in the main text rather than the appendix. l. 252: How do you explain this effect ? Is is only a matter of sample sizes ? l. 268: This is an interesting comparison. l. 302: The meaning of "transition weights" is unclear. I guess that you chose to use "weight" since the VIN block might not learn probabilities. l. 292: How was the reward function defined ? Zero everywhere, +1 at the goal ? Section 4.4 : Since convolutions can be generalized to graphs with the graph Laplacian, it would be interesting to consider a graph-CNN approach in this experiment.

#### Confidence in this Review

2-Confident (read it all; understood it all reasonably well)