NIPS 2016
Mon Dec 5th through Sun the 11th, 2016 at Centre Convencions Internacional Barcelona
Paper ID:1826
Title:Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation

Reviewer 1

Summary

This paper considers a hierarchical approach to reinforcement learning, where a top-level controller selects subgoals for a low-level controller. The low-level controller is trained by DQN to select actions that maximise subgoal attainment for the specified subgoal. The high-level controller is trained by DQN to select subgoals as actions, which are then "executed" until completion, that maximise true reward. Promising results are demonstrated in the challenging Atari game Montezuma's Revenge, when using a mostly handcrafted set of subgoals.

Qualitative Assessment

The ideas in this paper are interesting and worth pursuing. It's a very clear and sensible example of combining hierarchy with deep RL, the combination of which is of high current interest. The initial experiment on the "six state" MDP is so trivial it is uninteresting. The Montezuma's Revenge example is much nicer, demonstrating impact (albeit with a little bit of handcrafting) on a problem known to be challenging for the current state-of-the-art and would be worth seeing at NIPS. The paper is technically a little sloppy in places. As a result I have mixed feelings about the paper. ISSUES 1. I'm afraid that the first experimental example is really quite lame. There's nothing wrong with small examples if that small size is used to provide insight, but comparing learning curves as if it is a meaningful domain does not provide us with very much. Also, it would be clearer to describe this example as an 11 state MDP with bidirectional connections between states 1-5 and 7-11, but only one way connections between 5->6 and 6->7. Dressing it up as an MDP with "history" makes it seem more complex than it is. 2. There are issues with reward accumulation and discounting in the paper. I believe that equation 2 is missing discounts, both on the sum over extrinsic rewards f, which should be discounted, and the max Q term, which should be discounted by gamma^N not just gamma. The stored experience for Q2 should be the discounted sum of extrinsic rewards f, not just the instantaneous reward f_t. I'm concerned that the loss function L2 used in the experiments - which is not actually given in the text - is similarly based on incorrect discounting and reward accumulation. 3. The literature review is not very comprehensive and misses many related papers, e.g. other RL papers that also have a high-level controller picking subgoals for low-level controllers, e.g. Dayan's Feudal RL, Bakker & Schmidhuber's HASSLE, etc. - this is not a new idea by any means - the novelty comes rather in the combination with deep learning. 4. The use of the term "intrinsic motivation" is non-standard - typically the usage in this paper would be considered more about subgoal selection than modifying the reward function used by the agent, although I can understand the authors' perspective on this too.

Confidence in this Review

3-Expert (read the paper in detail, know the area, quite certain of my opinion)


Reviewer 2

Summary

The paper addresses deep hierarchical RL with a two-levels approach. Low-level controllers are used to reach goals, and a higher-level controller chooses the current goal. Goal specification is not clearly described. The model is tested first in a toy problem and then on Montezuma's revenge, a hard ATARI game. Results are preliminary, but quite promising.

Qualitative Assessment

The paper addresses the timely question of building a hierarchy of behaviours out of recent deep reinforcement learning methods. This is one of the hot challenges at the moment and several teams are on that (actually, the arXiv version of the paper is already cited 10 times). The paper proposes a promising contribution to the domain, but my general feeling is that the paper and the corresponding work are not mature enough for publication at NIPS: several points need to be clarified and the empirical evaluation is too preliminary. Globally, the major issue in the field is the automatic extraction of a hierarchy from data. Here, a good deal of the hierarchy building process is left to the engineer through the identification of a set of goals. In itself, this is not a major flaw of the paper, but the limitations of the work should be discussed under this light. About the literature review section, two points: 1°) "Our approach does not use separate Q-functions for each option, but instead treats the option as part of the input, similar to [18]. This has two advantages: (1) there is shared learning between different options, and (2) the model is potentially scalable to a large number of options." => The statement about these two advantages should be supported either by a reference or an empirical study, where the authors would compare both approaches 2)° In the "Intrinsically motivated RL" and the "Cognitive Science and Neuroscience" subsections, nothing is said to relate the described works to the submitted paper. In particular, for the latter, the content looks quite far away. Also, the "Deep Reinforcement Learning" subsection does not bring much. So my general feeling is that this section should be more targeted to recent deep hierarchical RL work, and should save space for additional studies such as the one described in point 1 or improving the model description (see below). By the way, given that the arXiv version of the paper was published a while before NIPS, the authors should refresh their literature review section the many recent works in the domain. E.g. Model-free episodic control, several works about macro-actions, using A3C instead of DQN (fine if that's left for future work ;)), etc. About the model description: The model section could be much clearer. Although the right-hand side of Fig.1 shows it, it is not obvious from the notations in the text that the controller networks take s and g as input and a as output, and that the "meta-controller" takes s as input and g as output. It would be clear if you write : Q1(A|s,g) and Q2(g|s), for instance. Also, there should be a module (inside the meta-controller?) checking that a goal is reached. This is not described clearly. The language to specify goals is not described either (despite the connection to Duik's work about object-oriented MDPs). The replay buffer initialization is not described. The exploration strategy deserves its own subsection. Using a separate epsilon for each goal should be discussed. Also, the adaptive annealing method for \epsilon_1,g is not described, though this is not standard. It should be specified that a set of goals is given as input to the algorithm. Also, the method for checking that a goal is reached should be present in the algorithm. This part is poorly specified and looks quite ad hoc (see below about Experiment 2). About the study and results: In Experiment 1, the structure of the controllers and meta-controller are not described (and should be). It is only specified that it is not a deep network. In Experiment 2, the lack of a clear specification of how the goals are managed is a major issue. How many entities do you have in Montezuma's revenge? If you have two instances of the same entity, do they correspond to two different goals? Is reaching an entity with the agent the only kind of goal? What about avoiding monsters at pacMan, for instance? Even if what the authors did is completely ad hoc to a specific game, it should at least be described accurately. The parametrization of the architecture and learning process looks quite arbitrary and should be discussed. Finally, the small paragraph about missing ingredients convinces the reader that this work, although an interesting starting point, is not mature enough for publication in a major conference, particularly one that is biased towards clean theoretically-grounded work.

Confidence in this Review

3-Expert (read the paper in detail, know the area, quite certain of my opinion)


Reviewer 3

Summary

This paper proposes a setup for doing hierarchical deep reinforcement learning when task-specific subgoals are given. The architecture is akin to the one proposed by Lin in 1993, except for the deeper networks. The paper is well-written and the experiments are interesting but rather preliminary.

Qualitative Assessment

(1) The paper is severely handicapped by its lack of historical perspective. In fact closely related HRL approaches (without the “deep”) have been proposed and investigated since over 20 years, starting with Long-Ji Lin’s pioneering work (1993) as well as variations like W-learning by Humphreys (1995) and HQ-learning by Wiering (1997)... (2) The issue of option termination/interruption is not explained clearly: it sounds as if an option is executed until the corresponding subgoal is reached -- which may be an extremely long time initially (when the subgoal policy is not competent enough) or even infinite (e.g. if the key subgoal is invoked but there no longer is a key)? (3) I find the terminology of “intrinsic” motivation highly misleading -- this is not what is usually meant by this term (e.g. drives, novelty, curiosity, empowerment, etc.). The terms of (sub)goals or sub-tasks would be more appropriate for the proposed approach. (4) The first experiment is mis-specified, because the authors confuse the notion of (Markovian) state and (partial) observation: s2 after having been in s6 is not the same state than s2 without having been in s6. Please clear this up. Also, the results (0.13) look far from the achievable optimal policy for this task? (5) There is a lot of detail missing for the experiments to be reproducible (generally, the paper feels a bit rushed), e.g. how long is the “first phase” (L228)?, how long is an epoch (L187)?, what is a “step” (Fig 3, left)?, how are the Atari subgoals computed, from pixels or RAM state?, etc. (6) On one of the main results (Fig 4, bottom right), it is unclear to me how this distribution of goals is “appropriate”? Why would it not invoke the ladder goals more often than the key? Also, why would it initially pick goals uniformly even though the doors are useless because the key is never picked up yet? (7) To end on a positive note, I’m intrigued and hopeful for the author’s proposed exploration heuristic that takes into account the empirical success rate per subgoal -- I’d love to see this properly defined and its effect studied empirically.

Confidence in this Review

3-Expert (read the paper in detail, know the area, quite certain of my opinion)


Reviewer 4

Summary

The paper proposes a hierarchical architecture for Deep Q-Learning to aid exploration via intrinsic motivation and also learn goal-oriented options via temporal abstraction. It includes a meta-controller that selects which goal the agent should aim to achieve and rewards the low-level controller if it manages to achieve it (intrinsic motivation). The low-level controller uses this goal to learn policies over states and goals. The paper evaluates this approach on a stochastic decision process and a game from the Atari Learning Environment (Montezuma's Revenge), notorious for not being efficiently explorable using dithering techniques. They do not address how goals chosen by the meta-controller are generated, leaving that for future work.

Qualitative Assessment

The authors have chosen to tackle a very important problem in Deep RL, how to induce exploration and also how to effectively chunk policies to learn more efficiently. The scope and goals are outlined clearly and the experiments are designed appropriately to showcase this. While the paper does not solve all the possible problems that come up in a standalone system using Hierarchical RL, this paper is definitely a step forward in building hierarchical models that are scalable. The experiments chosen are mostly sufficient to show that the hierarchical model works at object based Q-learning, it would be interesting to see its application to some of the problems DQN is good at to get a more complete understanding of this model's capabilities. For example, comparing on a game like Pong or Breakout that DQN learns trivially would give readers an idea of the speed of learning as compared to vanilla DQN. Another point of discussion is what would happen when the Agent enters an environment that has new or different objects and relations. With a new space of objects to learn over, does the meta-controller and therefore controller transfer any previous knowledge or does it need to relearn over the space of the new objects all over. While outside the scope of the paper, this might be a potential drawback that I feel should be discussed. There are many more open questions on how the meta-controller would handle different kinds of objects, and the issue of how to learn what objects are and how to utilize them to maximize reward. It seems to me a lot of details are yet to be worked out in this work. Considering the author rebuttal and other reviews, I am revising my opinion of this submission. It is an important step towards hierarchical models, but the paper as it stands is not a complete solution. And that that takes away from its overall value.

Confidence in this Review

3-Expert (read the paper in detail, know the area, quite certain of my opinion)


Reviewer 5

Summary

While the paper overstates some claims, and does not reference some of the older literature on the topic, it is mostly well-written and offers an interesting perspective on a hard problem in Reinforcement Learning worth publishing.

Qualitative Assessment

The idea of employing sub-goal learning for solving RL problems with sparse rewards is a good one, as the results presented in the paper show. Additionally, it is well written and easy to follow. However, some aspects, mostly in presentation, could be improved. Specific positive aspects: - I specifically like the small MDP shown in section 3 that illustrates the usefulness of subgoals on a toy MDP. Since this domain is so simple, it allows for a very intuitive and non-technical understanding of the idea presented. Specific negative aspects, in decreasing relevance: - every mention and discussion of 'goals' seems purposefully opaque. Instead of admitting openly that goals could have been manually defined, without any learning, (as e.g. for the smaller MDP example, where they are manually defined as "visit a new state if you can") the authors try to make the goals seem learned by some means of object detection approach. Considering that the relational structure they then develop is manually defined as well, not going into the detail of how the objects are determined is superfluous and misleading. Since the hierarchy of goals is still learned, this would not diminish the quality of the paper. - the paper repeatedly claims that this approach is a 'temporal decomposition', which I find misleading. In my understanding, temporal decomposition consists of learning partial policies by analyzing the structure of the (pure) MDP. This approach instead enriches the MDP representation by adding (useful) structure to the reward signal. - the idea to define sub-goals in a way very similar to the one introduced has been introduces 24 years ago, but has not discussed. See e.g. Connell and Mahadevan: 'Automatic programming of behavior-based robots using reinforcement learning', 1992, where sub-goals are used to avoid dead-end states. - defining sub-goals as done here additionally relates to some forms of learning abstractions for planning, which could have been discussed. See Konidaris, Pack Kaelbling, Lozano-Perez: 'Symbol Acquisition for Probabilistic High-Level Planning', 2015.

Confidence in this Review

3-Expert (read the paper in detail, know the area, quite certain of my opinion)


Reviewer 6

Summary

The papers proposes an approach based on hierarchical reinforcement learning and intrinsically motivated goals to address the known shortcomings of many current RL algorithms when applied in a sparse reward setting. Their approach is based on a two level hierarchy where a controller learns a policy for achieving a set of goals, and a meta controller learns to set an appropriate sequence of goals. The algorithm is applied to two sample domains, including the reknownly difficult Montezuma Revenge game.

Qualitative Assessment

The paper is clear and addressed an important issue: how to learn in sparse reward settings. One limitation in the paper is that empirical validation is limited, Montezuma Revenge is certainly a challenging benchmark, however it remains unclear how well the method generalizes to other domains. Experiments on wider range of problems would have made the results more compelling. Another limitation is that more details should be given on the details of experiments (hyperparameters, details of the custom object detector, etc ...). Overall, I believe that the topics addressed are very important for RL, the proposed solution is novel and deserves to be known, so I would recommend acceptance.

Confidence in this Review

3-Expert (read the paper in detail, know the area, quite certain of my opinion)