__ Summary and Contributions__: This paper introduces a new framework for safe RL via shielding for continuous state and action spaces that trains an agent to develop formally verified policies that fulfil safety constrains both during training and test. The authors design a new methodology whose novelty relies on updating during training the shield that prevents the agent from selecting unsafe actions. The authors claim that this update mechanism helps their framework to outperform previous safe-aware approaches.

__ Strengths__: The paper generally provides a detailed theoretical grounding of its claims. It's a novel approach, and to the best of my knowledge, it would be the first work introducing a computationally affordable framework that provides safety warranties during the whole training process while achieving a good performance in well-known complex benchmarks

__ Weaknesses__: The proposed approach relies on a fixed worst-case model of the environment. It requires of an oracle for formal verification of policies and requires a manually constructed shield to start the learning process. Authors claim that their approach is computationally good, however, no empirical evidence is proved

__ Correctness__: The technical claims of this paper seems to be generally correct. Still it misses a comparison which is significant for the evaluation of this contribution. I believe this work should include the computing time required by the proposed approach compared with the three baselines, specially since the authors claim several times the feasibility of their approach compared to previous ones based on Formal Methods. This will help to understand the trade-off between having the shield vs not having it and how much computing impact has updating the shield. I could raise my score for this submission if this comparison is added in the final version.

__ Clarity__: The paper is well written. However, some parts required several reads to understand and I missed some running examples to illustrate the whole learning process. I think it would help potential readers to quickly grasp the key concepts of this paper.

__ Relation to Prior Work__: Prior work seems to be correctly contrasted and referred, which helped to understand the contributions of this work. Well done!

__ Reproducibility__: Yes

__ Additional Feedback__: I think this is a good paper that should be accepted, the authors did a good submission. It just misses the additions I mentioned above to support the author claims about computational feasibility.
Some minor comments:
* In line 17, reference 6 is cited twice
* Graphs in figures 1 and 2 have a horizontal red line that is not referred in the legend
-----
Post-Rebuttal comments
I appreciate the author efforts to answer my concerns. I still suggest authors to include not only the total computing time of their approach, but rather a comparison between the computing time required by a static shield against the proposed approach.

__ Summary and Contributions__: This paper consider a safe reinforcement learning problem by leveraging the recently developed neurosymbolic method. Particularly, in the exploration, a monitor P\sharp(s,f(s) ) \subset \phi observes the agent. If the policy f(s) is safe, then the agent just follows it .Otherwise it applies the policy in the shield. In the learning step, the agent can update the policy using any policy gradient method. Then it projects its updated policy on the space of the symbolic policy and get corresponding invariant \phi (some safe state) using algorithm 2. The author tests this algorithm in the several environment and shows its superiority over DDPG and CPO in the criteria of average cost and safety violation.

__ Strengths__: It is an interesting work to combine the neural symbolic reinforcement learning with formal verification to solve the safe RL problem. I am not familiar with formal verification but the empirical result looks good.
#######After response
Thanks for the author's explanation, which resolves most of my concerns.

__ Weaknesses__: 1. Some part of this paper is not clearly explained. For example, I guess people (at least for me ) in RL community are not familiar with formal verification. Can you give some simple example to intuitively explain how and why it works?
2. In section 3.1. the shield has the form of the piecewise linear policy. Does it work for complicated scenarios where the dimension of the state space is high? In that case the number of region is quite large.
3. In section 2, one assumption is that P\sharp is available in a closed form to the learner. Can you explain why it is reasonable? or give an example of P\sharp in your experiment.

__ Correctness__: The algorithm and experiment result look correct and reasonable.

__ Clarity__: The section of formal verification need to be further explained. See my comments on the weakness of this paper.

__ Relation to Prior Work__: It clearly states the contribution of this paper and discuss the difference from previous literatures.

__ Reproducibility__: Yes

__ Additional Feedback__:

__ Summary and Contributions__: This paper proposes REVEL, a RL framework with verifiable safety guarantees. Neural networks are hard to verify, while symbolic policies are hard to directly optimize. To address this dilemma, it first constructs safe symbolic policies as shields. To facilitate learning, it lifts those policies into the neural-symbolic space and doing updates on them. In order to verify the new updated policies, it projects the policies back into the symbolic space. Theoretical results show that on expectation, by doing so, we will achieve lower and lower regrets. Empirical results claim that their method achieves competitive results against existing alternatives.

__ Strengths__: (+) One of the reasons the prevents RL from being largely adopted in many real-world critical tasks is that it could lead to catastrophic failures. Considering this, this paper could be highly appealing to people that have ambitions in those domains.
(+) The theoretical part is interesting. The results are quite clear and clean. Furthermore, they fit our intuition and well corroborate this paperâ€™s claim. Yet, some analysisâ€™s details can be more polished, which I will elaborate on in the next weakness box.
(+) The paper is generally well organized. The supplementary also provide interesting details.

__ Weaknesses__: (-) I have two majors concerns, one regarding the (theoretical) analysis and the other empirical evaluations. Speaking on the first point, it seems like all the safety guarantees boil down to the fact that the initial shields are safe and verifiable. However, when it gets transformed into the neural space, we use imitation learning and based my understanding, there is no guarantee that by imitation learning, the neural network would exactly reproduce what would happen in the shield. Granted, the initial symbolic-form shield is safe. Yet this transformation step seems to raise the possibility of unsafety.
(-) On the empirical side. Based on my understanding, one major advantage of the proposed method is that the shield part can grow, which means the set of known safe policies are not a static concept. However, the current empirical evaluations seem not enough to justify this point. I believe a dedicated evaluation to measure what is the percentage that the f policy is executed could be an interesting thing to know.
(-) It seems on Mountain-Car, the static version completely outperforms the proposed dynamic method, which is very much against the intuition and the theorem in this paper. I suppose the static and the proposed method receive the same initial shield. Then if the proposed method can keep growing the space of shield, then at least the proposed method should perform comparably to the static version. I understanding asking a method to achieve the best performance on every task is ridiculous. Yet, I believe proving analysis on failure case could shed very important light into better understanding what is the limitation of the method and why this method really works in practice.
(-) This point is related to the first comment. Given my concern about the unsafety of imitation learning, I believe it could also be useful to show the number of safety violations of the proposed method. I suppose it should be all zeros.
(-) This is a minor point. Assuming P# is available seems very unrealistic on real-world problems. I understand this is a necessary assumption to do analysis and give safety guarantees. Yet, what is a practical workaround considering when we are tackling a completely unknown domain.

__ Correctness__: The theoreical analysis should be correct.

__ Clarity__: This paper is in general well written and easy to follow.

__ Relation to Prior Work__: The related work is well discussed.

__ Reproducibility__: Yes

__ Additional Feedback__: **** after rebuttal ***
After reading the rebuttal, and more importantly, discussing with other reviewers, my concerns regarding the safety are fully resolved.
I agree with other reviewers that an experiment about the computation time would be interesting.

__ Summary and Contributions__: The paper proposes a new algorithm for safe exploration in reinforcement learning. The paper proposes a policy representation that is easily verifiable for safety, and a way to update such policies. The paper studies the convergence of the algorithm under some special conditions, and shows promising experimental results in a range of continuous control problems.

__ Strengths__: (i) I think the algorithm involves some interesting ideas, such as the policy representation and update method, that may be interesting to the safe RL community.
- The experimental results look promising.

__ Weaknesses__: (i) Parts of the algorithm remain unclear to me:
- It is unclear to me how to specify inputs to the algorithm. How to initialize g? How to initially come up with a shield that is safe? How to specify P#, for example, in your computational experiments?
- How is \phi computed? How expensive is it to compute \phi?
(ii) There is no discussion about the algorithm's computational complexity.
(iii) The piecewise linear policy class seems like a good example, but feels too limited as the only example. In the introduction, the authors claim that their approach uses deep policy representations. It would be great to have an example of a neural network policy class.

__ Correctness__: I have not checked the appendix carefully, but the claims seem okay.

__ Clarity__: I think the paper is pretty well written in general. A few minor suggestions:
- Please label the axes of your figures.
- Line 85, (ii), \phi is closed with respect to what policy?
- Line 166, what policy is the expectation with respect to?

__ Relation to Prior Work__: Yes.

__ Reproducibility__: Yes

__ Additional Feedback__: The setting considered seems equivalent to minimizing expected cost if the cost of being in an unsafe state is set to high enough. I wonder whether there is a specific benefit of the proposed algorithm compared to algorithms that are designed to minimize expected cost. In particular, I wonder if algorithms like MuZero, which plans many steps ahead, would be a good option here.
***
The rebuttal addresses my concerns.