NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:1122
Title:Mapping State Space using Landmarks for Universal Goal Reaching

Reviewer 1

The paper presents a semi-parameteric model for long-term planning in a general space of problems. It works by training parametric goal-conditioned policies accurate only on local distances (i.e. when current state and goal are within some distance threshold) and leveraging the replay buffer to non-parametrically sample a graph of landmarks which the local goal-conditioned policy can accurately produce paths between. Moving to any goal state is then accomplished by (1) moving to the closest landmark using the goal-conditioned policy, (2) planning a path to the landmark closest to the goal using value-iteration on the (low-dimensional) graph of landmarks, (3) using the goal-conditioned policy to get to the goal state from the closest landmark. The paper essentially tackles the problem that goal-conditioned policies, or Universal Value Function Approximators (UVFA), degrade substantially in performance as the planning horizon increases. By leveraging the replay buffer to provide way-points for the algorithm to plan locally along, accuracy over longer ranges is maintained. The method is close to a heuristic in the sense that there is no learning in this higher-level graph structure, it is largely agent designer prior knowledge to get UVFAs working at longer ranges. Despite this, I still think the method can be impactful as a first step to solving this challenging long-range planning problem. The evaluation settings demonstrate that the method is viable and improves upon baselines for a variety of continuous control experiments. Unfortunately, most of the improvements are in navigation domains, and in this case I believe a very similar method for navigation has already been done (Semi-parametric Topological Memory [1], which I believe should be cited given its close similarity to this method). Therefore a more informative evaluation setting would be in tasks further away from navigation, and more towards general MDPs which is where I think this paper's contribution is. For example, seeing this implemented on Atari games would be a far more interesting result. It would also be great seeing this method used to drive frontier exploration (doing random exploration near states which have been visited the least number of times). In conclusion, I believe the paper's idea of leveraging the replay buffer to build a set of non-parametric waypoints is an impactful contribution. Unfortunately, a relatively uninformative evaluation setting with too much focus on navigation (bringing it a bit too close to previous work [1]), makes me suggest a score of 6. Some other points: - Figure text is extremely small and difficult to read. - It is hard to contextualize Figure 4(b) without knowing the lower-bound on average steps to goal. It would be easier to read this graph if you added the lower-bound as a dotted line. - How are the curves in Figure 3 generated? Are these the same HER model but in one evaluation you use planning and one you use the HER policy directly? Or is the red curve also using the planner at training time? - Also, what replay buffer do you use at each point along the curves? The one encompassing all states up to that step? Or the replay buffer at the end of training? - I believe the word "unanimous" on line 22 means things are in agreement, which doesn't seem to fit when describing tasks. [1] Savinov, Nikolay, Alexey Dosovitskiy, and Vladlen Koltun. "Semi-parametric topological memory for navigation." arXiv preprint arXiv:1803.00653 (2018).

Reviewer 2

From what I understand, the goal g is typically a projection of the state s of the agent. The agent could be in different states s_1 and s_2 at the same goal g, but states s_1 and s_2 could be not easily reachable one from the other. For example, g could be a location on manifold (e.g., a city map) and s_1 and s_2 could be two different states (same position but opposite orientations) of a mobile agent (e.g., a car). To go from s_1 to s_2 is not trivial as it requires changing the orientation of the car (it might imply around a block. If the information that is used for planning corresponds to only pairs of (s, g), would not the difference between states at goal locations be lost? In that case, can we still write that the cumulated rewards for a complex task d(s, g) is the sum of d(l_k, l_{k+1}) for consecutive landmarks? Or does it simply mean that g should not be only a position, but also an orientation, or alternatively, that the planner works on pairs of states, rather than pairs of state-goals? Minor comments: In section 5.1, explain why the cardinality of the state-goal pairs is O(R^{2d}). The color code on Figure 4 could be consistent between (a) and (b) / (c).

Reviewer 3

This paper proposes to adopt state abstraction method to improve sample efficiency of learning the universal value function approximator (UVFA) for long-range goals. The key idea is building the high-level graph of given MDP by sampling the landmark states, which is uniformly sampled from the state space, and estimating the connectivity (edge) between them. In high-level, the transition between landmark states can be handled by planning on the high-level graph, while the local transition (i.e., current state to near-by landmark or transition between neighboring landmarks) can be handled by UVFA. The proposed method has been evaluated on Mujoco domain including navigation tasks, and significantly outperformed the baseline method: DDPG+HER. Overall, the paper is easy to follow. However, the main limitation of this paper is the lack of contribution. The proposed method can be seen as a relatively straightforward combination of existing works: UVFA+HER+state abstraction and planning [1], which are missing in the related work section. The idea of building the high-level graph based on the sampled landmark states, and performing the planning (i.e., value iteration) in the graph to find the shortest path to the goal has been already proposed in [1]. The salient differences from the existing works are the landmark sampling, and using the value function as the distance metric between landmark, which is a relatively minor technical details. The followings are the related works that could be included in the paper. [1] Zhang, Amy, et al. "Composable Planning with Attributes." International Conference on Machine Learning. 2018. [2] Folqué, David, et al. "Planning with Arithmetic and Geometric Attributes." arXiv 2018. [3] Liu, Evan Zheran, et al. "Learning Abstract Models for Long-Horizon Exploration." 2018. Regarding the landmark sampling step, the rebuttal addressed this issue quite well. I suggest authors to make this point more clear in the paper. I'm increasing my score to 5. Lastly, the experiment section is relatively light. The author might want to include other baselines such as HIRO [28] (which is already cited in the paper). Minor comments: - The texts in the graphs are too small when printed out. - In section 6.2.1, “planning” paragraph, what is the actor network in DQN? - In section 6.3, a line break is missing before “Choice of clip range and landmarks” paragraph.