NIPS 2016
Mon Dec 5th through Sun the 11th, 2016 at Centre Convencions Internacional Barcelona
Paper ID:632
Title:VIME: Variational Information Maximizing Exploration

Reviewer 1

Summary

The paper presents a new exploration technique for reinforcement learning methods. The approach is based on computing the information gain for the posterior distribution of a learned dynamics model. The dynamics model is modeled by a possible deep neural network. Actions get higher rewards if the posterior distribution over the parameters of the learned dynamics model is likely to change (in a KL sense, which is equivalent to the information gain). As the posterior distribution can not be represented, the authors use variational Bayes to approximate the posterior by a fully factorized Gaussian distribution. The paper includes efficient update equations for the variational distribution, which has to be computed for each experienced state action sample. The authors evaluate their exploration strategy with different reinforcement learning algorithms on a couple of challenging continuous control problems.

Qualitative Assessment

Positive points: - I like the idea of using the information gain for exploration. This paper is an important step for scaling such exploration to complex systems. - The paper is technically very strong, presenting a clever idea to drive exploration and efficient algorithms to implement this idea. - Exploration in in continuous action control problems is an unsolved problem and the algorithm seems to be very effective - The evaluations are convincing, including several easy but also some challenging control tasks. The exploration strategy is tested with different RL algorithms. Negative points: - Not many... maybe clarity could be slightly improved, but in general, the paper reads well. Minor comments: - Equation 7 would be easier to understand if the authors would indicate the dependency of \phi_t+1 on s_t and a_t. Also do that in the algorithm box - Why does the system dynamics in line 3 of the algorithm box depend on the history and not on the state s_t? - I do not understand how to compute p(s_t|\theta) in 13. Should it not be p(s_t|\theta, s_t-1, a_t-1) ? - There are typos in the in-text equation before Eq. 17. The last l should be in the brackets and the first nabla operator is missing the l. - Even if it is in the supplement, it would be good to shortly describe what models are used for the learned system dynamics - citation [11] is not published in ICML, its only available on archive.

Confidence in this Review

3-Expert (read the paper in detail, know the area, quite certain of my opinion)


Reviewer 2

Summary

This paper looks at the notion of curiosity in deep reinforcement learning. It returns to the idea of equating exploratory behavior with maximizing information. Key contributions are the formulation of a variational perspective on information gain, an algorithm based on Blundell et al.'s Bayesian neural networks and the exposition of the relationship between their approach and Schmidhuber's compression improvement. Results on simple domains are given.

Qualitative Assessment

The paper shows a pleasant breadth of understanding of the literature. It provides a number of insights into curiosity for RL with neural networks. I think it could be improved by focusing on the development of the variational approach and the immediately resulting algorithm. As is, there are a number of asides that detract from the main contribution. My main concern is that the proposed algorithm seems relatively brittle. In the case of Eqn 17, computing the Hessian might only be a good idea in the diagonal case. Dividing by the median suggests an underlying instability. Questions: * The median trick bothers me. Suppose the model & KL have converged. Then at best the intrinsic reward is 1 everywhere and this does not change the value function. In the worst case the KL is close to 0 and you end up with a high variance in your intrinsic rewards. Why isn't this an issue here? * Eqn 13: updating the posterior at every step is different from updating the posterior given all data from the prior. Do you think there are issues with the resulting "posterior chaining"? * How good are the learned transition models? * Can you explain line 230: "very low eta values should reduce VIME to traditional Gaussian control noise"? * Why do you propose two intrinsic rewards on line 5 of algorithm 1? I'd like to see a clear position. Suggestions: Eqn 2: P(. | s_t, a_t) Line 116: For another connection, you may want to look at Lopes et al. (2012), "Exploration in Model-based Reinforcement Learning by Empirically Estimating Learning Progress" Line 119: You should specify which description length you mean... the statement is possibly imprecise/incorrect as is Line 128: in expectation propagation (which I know from Bishop (2006)) the KLs end up getting reversed, too... is there a relationship? Eqn 13: log p(s_t | theta) should be p(s_t | s_t-1, a_t-1, theta), no? It would be good to give empirical evidence showing why the median is needed Graphs: unreadable It might be good to cite more than just [20] as a reference on Bayesian networks. Again, Bishop (2006) (Section 5.7) provides a nice list.

Confidence in this Review

3-Expert (read the paper in detail, know the area, quite certain of my opinion)


Reviewer 3

Summary

The paper describes a curiosity driven RL technique for continuous state and actions spaces. It selects actions to maximize the information gain. Since Bayesian learning is intractable most of the time, a variational Bayes approximation is described. The approach is applied to domains where the transition dynamics are represented by Bayesian neural networks.

Qualitative Assessment

The paper is well written. The ideas are clearly described. The approach advances the state of the art in curiosity driven RL by using Bayesian neural networks and showing how to maximize information gain in that context. This is good work. I have one high level comment regarding the reason for focusing on curiosity driven RL. Why maximize information gain instead of expected rewards? Bayesian RL induces a distribution over rewards. If we maximize expected rewards, exploration naturally occurs in Bayesian RL and the exploration/exploitation tradeoff is optimally balanced. Curiosity driven RL based on information gain makes sense when there are no rewards. However if there are rewards, adding an information gain might yield unnecessary exploration. For instance, suppose that an action yields a reward of exactly 10, while a second action yields uncertain rewards of at most 9. Curiosity driven RL would explore the second action in order to resolve the uncertainty even though this action will never be optimal. Bayesian RL that optimizes rewards only will explore the environment systematically, but will explore only what is necessary.

Confidence in this Review

3-Expert (read the paper in detail, know the area, quite certain of my opinion)


Reviewer 4

Summary

This paper proposes Variational Information Maximizing Exploration for RL, which is heavily based on Schmidhubers work on curiosity driven learning and his more recent work on utilizing Kolmogorov Complexity.

Qualitative Assessment

Experiments: The experiments are nice, but unfortunately not a fair comparison. It would be more useful to compare how your information-theoretic reward compares to a simple maximization of the entropy, which can also functions as an exploration term, as in e.g., Information-Theoretic Neuro-Correlates Boost Evolution of Cognitive Systems by Jory Schossau, Christoph Adami and Arend Hintze and Linear combination of one-step predictive information with an external reward in an episodic policy gradient setting: a critical analysis by Keyan Zahedi, Georg Martius and Nihat Ay Minor remarks: Equation 2: This equation results from Eq. (1) in which H(\theta|\xi_t, a_t) is compared with H(\theta|S_t+1, \xi_t, a_t). It seems that a_t is missing in p(\theta|\xi_t), if not, please mention why it can be omitted. Equation 4: Bayes' rule is formalted as p(a|b) = p(b|a) p(a) / p(b) (which I am sure the authors know). Unfortunately, I cannot see how Eq. 4 fits into this formulation. It seems that there are terms omitted. I would see it, if e.g., p(\theta|\xi_t,s_t+1) p(s_t+1|\xi_t, a_t;\theta).

Confidence in this Review

3-Expert (read the paper in detail, know the area, quite certain of my opinion)


Reviewer 5

Summary

This paper proposes a new method to compute a curiosity based intrinsic reward for reinforcement learning to promote exploration. The learning agent maintains the dynamics model explicitly, and it is used to compute the KL divergence between the approximated distribution of the dynamics with the current parameters and that with the old parameters. The computed KL divergence is used as intrinsic reward and the augmented reward values can be used with model-free policy search methods such as TRPO, REINFORCE and ERWR. The authors also show the relation between the proposed idea and compression improvement that is developed by Schmidhuber and his colleagues. The proposed method is evaluated by several control tasks and some of them have high dimensional state-action spaces.

Qualitative Assessment

The agent’s dynamics model is estimated by Bayesian neural networks (BNNs) explicitly and the intrinsic reward. My major concern is that the estimated model is not used to find an optimal policy. In other words, model free reinforcement learning such as TRPO, REINFORCE and ERWR are used. Since the model is estimated, model-based reinforcement learning is more appropriate for this setting. Please discuss about this point for more detail. The second concern is the performance of the cart pole swing-up task. Figure 2 show that REINFORCE and ERWR obtained sub-optimal policies as compared with the result of TRPO. In this task, there was no significant difference between the reinforcement learning agent with and without VIME exploration. Please discuss why the proposed method did not improve the learning performance in the cart pole swing-up task for more detail. Is it related to the property of the dynamics itself? Lines 170-174 claim that the KL-divergence is divided by median of previous KL divergences. What happens if you directly use the original KL divergence as intrinsic reward? Does it make the learning process unstable? In addition, how many previous trajectories is needed?

Confidence in this Review

3-Expert (read the paper in detail, know the area, quite certain of my opinion)


Reviewer 6

Summary

Authors introduce a method: VIME for trying to optimally trade off exploration and exploitation for reinforcement learning tasks. The idea is to add an auxiliary cost that maximizes the information gain about the model of the environment, encouraging exploration into regions with larger uncertainty. This requires use of a model that has a notion of uncertainty for which the authors use a Bayesian neural network in the style of Blundell et al. Experiments demonstrate the utility of the method, especially in instances which require some amount of exploration before rewards are obtained.

Qualitative Assessment

I enjoyed the paper. Demonstrations and comparisons seem great, and the improvements are hard to ignore. The method is novel and appears to be generally useful. The Figure legends are difficult to read and should be made larger, especially in Figure 1 and 2.

Confidence in this Review

2-Confident (read it all; understood it all reasonably well)