NIPS 2017
Mon Dec 4th through Sat the 9th, 2017 at Long Beach Convention Center
Paper ID:587
Title:Safe Model-based Reinforcement Learning with Stability Guarantees

Reviewer 1


		
My understanding of the paper: This paper describes a novel algorithm for safe model-based control of a unknown system. This is an important problem space, and I am happy to see new contributions. The proposed approach uses a learnt model of the system, and constrains the policy to avoid actions that could bring the system to un-safe states. Additionally, both policy actions and exploratory actions are affected by this safety constraint. The proposed algorithm is based on reasonable assumptions of Lipschitz continuity within the system dynamics as well as the presence of a Lyapunov function, which provides some quantification of risk-related cost of a state. One 'trick' the paper uses is to conflate the Lyapunov 'risk' function with the reward function, thus assuming that cost and 'risk' are antipodal. To be able to find a tractable policy, a discretization step is necessary, but the experimental section at the end of the article shows that the final algorithm is indeed able to perform in a toy environment. My comments: I find the hypotheses chosen to be slightly restrictive relative to many real-world scenarios, but acceptable in the case of a more theoretical paper. More specifically, in the case of the Lyapunov function, many real-world systems have discrete safety functions over their state space ("everything is fine as long as you don't push /this/ button"). Additionally, as I have stated above, IIUC, the cost function and Lyapunov function are the same, thus making scenarios where the high-reward states are very close to the high-risk states difficult to describe in this framework. I would be curious to hear you out on this point in particular. Other comments/questions in a list form: Can you detail more what the discretization step in your theoretical analysis does to the guarantees? What is your mixture between explorative and exploitative actions? Algorithm 1 seems to only ever choose explorative actions. What is an example of a real-world system you believe this would be useful on? Especially because safety is a concern brought about mainly by applications of RL, it would have been interesting to see your algorithm applied to a more concrete scenario. Overall opinion: I think the analysis and the algorithm are interesting, and as I stated previously, I am very happy to see safety-related literature. I don't believe this algorithm to be all that useful in application exactly as it is presented, but it provides a good framework for thinking about safety and a nice first pass at attempting to solve it in a less ad-hoc method.

Reviewer 2


		
Safe Model-based Reinforcement Learning with Stability Guarantees ================================================================== This paper presents SafeLyapunovLearning, an algorithm for "safe" reinforcement learning. This algorithm uses an initial safe policy and dynamics model and then sucessively gathers data within a "safe" region (determined by Lyapunov stability verification). The authors establish some theoretical results that show this method is provably-efficient with high probability under certain assumptions and these results are supported by some toy experiment simulations. There are several things to like about this paper: - The problem of safe RL is very important, of great interest to the community and without too much in the way of high quality solutions. - The authors make good use of the developed tools in model-based control and provide some bridge between developmenets across sub-fields. - The simulations support the insight from the main theoretical analysis, and the algorithm seems to outperform its baseline. However, I found that there were several shortcomings: - I found the paper as a whole a little hard to follow and even poorly written as a whole. For a specific example of this see the paragraph beginning 197. - The treatment of prior work and especially the "exploration/exploitation" problem is inadequate and seems to be treated as an afterthought: but of course it is totally central to the problem! Prior work such as [34] deserve a much more detailed discussion and comparison so that the reader can understand how/why this method is different. - Something is confusing (or perhaps even wrong) about the way that Figure 1 is presented. In an RL problem you cannot just "sample" state-actions, but instead you may need to plan ahead over multiple timesteps for efficient exploration. - The main theorems are hard to really internalize in any practical way, would something like a "regret bound" be possible instead? I'm not sure that these types of guarantees are that useful. - The experiments are really on quite a simple toy domain that didn't really enthuse me. Overall, I think that there may be a good paper in here - but that in its current form it's not up to the high standard of NIPS and so I would not recommend acceptance. If I had to pinpoint my biggest concern it is that this paper really doesn't place itself properly in the context of related literature - particularly the line of work from Andreas Krause et al.

Reviewer 3


		
This paper addresses the problem of safe exploration for reinforcement learning. Safety is defined as the learner never entering a state where it can't return to a low cost part of the state space. The proposed method learns a Gaussian Process model of the system dynamics and uses Lyapunov functions to determine if a state-action pair is recoverable from. Theoretical results are given for safety and exploration quality under idealized conditions. A practical method is introduced and an experiment shows it explores much of the recoverable state-space without entering an unrecoverable state. Overall, this paper addresses an important problem for real world RL problems. Being able to explore without risking entering a hazardous state would greatly enhance the applicability of RL. This problem is challenging and this work makes a nice step towards a solution. I have several comments for improvement but overall I think the work will have a good impact on the RL community and is a good fit for the NIPS conference. One weakness of the paper is that empirical validation is confined to a single domain. It is nice to see the method works well on this domain but it would have been good to have tried it a more challenging domain. I suspect there are scalability issues and I think a clear discussion of what is preventing application to larger problems would benefit people wishing to build on this work. For example, the authors somewhat brush the issue of the curse of dimensionality under the rug by saying the policy can always be computed offline. But the runtime could be exponential in dimension and it would be preferable to just acknowledge the difficulty. The authors should also give a clear definition of safety and the goal of the work early on in the text. The definition is somewhat buried in lines 91-92 and it wasn't until I looked for this definition that I found it. Emphasizing this would make the goal of the work much clearer. Similarly, this paper could really benefit from paragraphs that clearly define notation. As is, it takes a lot of searching backwards in the text to make sense of theoretical results. New notation is often buried in paragraphs which makes it hard to access. Updated post author feedback: While I still think this is a good paper and recommend acceptance, after reflecting on the lack of clarity in the theoretical results, I've revised my recommendation to just accept. Theorem 4 seems of critical importance but its hard to see that the result maps to the description given after it. Clarifying the theoretical results would make this a much better paper than it already is. Minor comments: How is the full region of attraction computed in Figure 2A? Line 89: What do you mean by state divergence? Line 148: This section title could have a name that describes what the theoretical results will be, e.g., theoretical results on exploration and policy improvement Line 308: Could footnote this statement. Line 320: "as to" -> to Theorem 4: How is n* used?