Summary and Contributions: The authors have addressed my main concerns and confusions and I think the changes noted in the rebuttal will make this a stronger paper. The fixes to the confusing notation and the clarification of the algorithmic details are appreciated and will make the paper easier to understand. --- This paper seeks to address the problem of side effects: where an agent changes an environment in unnecessary or undesirable ways when seeking to accomplish a task. The authors propose that side effects can be avoided if an agent thinks not only about its current task but also future tasks that it may later need to accomplish. The authors also introduce a baseline policy to provide an incentive to not interfere.
Strengths: This paper proposes a nice general way to address side effects. The idea of thinking about the future seems like a nice elegant way to address this issue. Empirical experiments show that agents whose policies are optimized via the proposed framework have fewer side effects than other approaches.
Weaknesses: The beginning of the paper was well written and easily accessible. However, the notation quickly becomes overloaded and it is difficult to understand how section 4 connects to the previous sections which are better explained. In particular, there is not a clear algorithm that is defined and it is difficult to determine exactly what the proposed method is. Section 2 defines the future task framework nicely, but then Section 3 shows that there can be problems with interference. Section 4 purports to solve these problems but it is never made clear exactly what needs to change from section 2. The proposed approach also only works with binary goal-based rewards for deterministic environments. Furthermore, the proposed approach requires an optimal reference policy for any possible subgoal. This seems highly restrictive since the goal seems to be to assist people in specifying reward functions, but if the optimal policy for any subgoal is already known, then why do we care about reward specification? If we can assume to have optimal policies for all goals, we probably also have an optimal policy that avoids side effects, etc.
Correctness: Definition 3 references a baseline \pi' but doesn't include it in the equation. This seems incorrect. The experimental methodology seems sounds.
Clarity: The notation is confusing wrt V. The subscripts and superscripts are overloaded and sometimes seems inconsistent. For example, based on previous definitions, V_i^*(s) is the optimal value function for task i starting at state s, but in Example 1, we have V_0^*(s_1) = \gamma which seems to make the subscript the start state and the argument the goal state. Finally, as noted earlier, Section 4 presents nice theory but it is unclear exactly how this section should be combined with section 2 for the full approach. Having a nice summary or algorithm box would be very helpful to make the approach clear and enable better reproducibility.
Relation to Prior Work: The related work section does a good job of relating the work to existing research.
Additional Feedback: How are goal states sampled for the experiments? Do you enforce that all goal states are actually reachable? Consider the the egg example, it seems like the agent will never use the eggs to make breakfast if it thinks it might need the eggs to make a cake later. There is always a posibliity of needed a limited resource so will the agent will reach analysis paralysis and not be able to do anything for fear of missing out on something in the future? Line 115: Last sentence is ambiguous about what "this" refers to. How do you pick the baseline? What is a good baseline. Doing nothing as an autonomous car seems bad since you won't get anywhere. In many settings doing nothing seems like it could be pessimal.
Summary and Contributions: In the search of good rewards by punishing undesirable side-effects, such as breaking a vase, while reaching a goal state, such as reaching the kitchen from the hallway. Rather than defining the undesirable side-effect as an irreversible action, as proposed in literature, the authors propose an auxiliary reward which incorporates the possibility to perform a next task, such as filling the vase with flowers. This auxiliary reward takes the form of the value estimate at the terminal state of the current task for predicting the discounted future return for the potential next task.
Strengths: Relevant problem, with elegant solution. Thorough derivation. Proper description of the problem. Novel method.
Weaknesses: What about more complex rewards? Does the proof still hold? The proof accounts for a very simplistic reward: 1 @ goal state and 0 otherwise. Is a grid world evaluation common for this type of research, or should their be more complex environments? What is the impact of the bias/variance trade-off: extra variance introduced making potentially the training more difficult? Benchmarks would be welcome. Irreversible auxiliary reward is discarded by one simple negative results. However, I could think of a grid-world scenario with too many side-effects which are not all included by future goals, and therefore not included in this future-task auxiliary reward. For example, walking in a museum would require future tasks for each statue, in order to train the robot to avoid them. This unknown set of potential outcomes is more easily covered with irreversibility.
Correctness: The limitations of the proposed method in comparison to the related work is probably not clearly enough stated, leading to an oversell of the method.
Clarity: Well written. Clear explanation of the method.
Relation to Prior Work: The prior work section appears complete and is well compared with the proposed new method. The paper proposes a more thorough definition on side-effects. In the section on safe exploration, it appears that safe-exploration is insufficient. However, the same could be said about the side-effect problem: A damaged agent might be able to perform future tasks but is still a damaged agent. Risk-aversity is implied in the side-effect problem indirectly, however, this might be insufficient for more risky tasks.
Additional Feedback: The paper starts with ‘designing rewards is difficult’ for the designer, however, the paper does not really mention any reward shaping solutions. This sets the reader on the wrong track. I would recommend something more according to these lines: avoiding undesired effects on the environment from a goal-reaching agent is challenging as side effects are hard to incorporate in a reward without giving the wrong incentive. A big challenge comes from the difficulty of encountering unforeseen side-effects. This difficulty is more broadly countered by the reversibility auxiliary reward function. Related to the conclusions on future work: By incorporating one future task in the value estimate, the same idea could further be explored in a hierarchical RL setting where likelihood over tasks are taken into account which is something we, as humans, do constantly: I’m not cleaning the cutting plank just yet as I might use it in its current state for cutting my next vegetables.
Summary and Contributions: We are typically unable to specify exactly what we would like in a handcoded reward function for reasonably complex tasks. In particular, it is hard to specify all the side effects -- all of the things that the agent should *not* change about the environment. So, we would like a generic method that can penalize side effects in arbitrary environments for an arbitrary reward function. A simple approach is to have our agent preserve its options, which the authors formalize by having the agent maintain its ability to pursue a distribution of future tasks. However, this leads to interference incentives -- if something were going to restrict the agent’s option value, such as a human irreversibly eating some food, the agent would be incentivized to interfere with that process in order to keep its option value for the future. The authors provide a formal definition of this incentive. To fix this problem, the authors introduce a baseline policy (which could be set to e.g. noop actions), and propose a future task reward that only provides reward if the baseline policy would have been able to complete the future task. Thus, the agent is only incentivized to preserve options that “would have been available”. The authors show via experiments with simple gridworlds that the future task approach with the baseline allows us to avoid side effects, while also not having intervention incentives.
Strengths: The problem is important: reward specification is hard to get right, and techniques to ameliorate the difficulty are important and understudied. I particularly like the conceptual analysis of what is necessary for avoiding both side effects and intervention incentives: the reasoning is compelling, and the conclusions are borne out by the empirical results.
Weaknesses: The experiments are done on relatively simple gridworlds, so it is not easy to say whether the method will work in more complex environments. Nonetheless, I see the contribution as the conceptual analysis, so this is not a major point. See also the relation to prior work section below.
Correctness: To my knowledge, yes.
Relation to Prior Work: Previous work on relative reachability (that the authors cite) has introduced the idea of a baseline policy and intervention incentives, including a stepwise inaction baseline that cannot be represented in the formalism of this paper because it depends on the agent’s policy (whereas the future tasks framework requires a baseline policy that is independent of the agent’s policy). This is discussed in Appendix A.2 -- I might recommend that the authors move some of this discussion into the main paper.
Additional Feedback: After reading the author response and the other reviews, I am keeping my score. I did not raise substantial critiques, and I found the authors’ response convincing at rebutting the critiques of the other reviewers.