NeurIPS 2020

Planning with General Objective Functions: Going Beyond Total Rewards


Meta Review

This paper aims to extend the theory of RL to more general objective functions than just cumulative reward. For example, one may wish to optimize for maximum minimum return. The paper specifically studies _deterministic_ MDPs, which is a bit of a departure from standard RL setups, but seems reasonable for certain problems. Several conditions are established such that, when a generic objective satisfies all of them, there exists an algorithm that optimizes it to \epsilon optimality. Two learning algorithms are proposed: one specifically for symmetric norms; the other for generic objectives that satisfy the three aforementioned conditions. The reviews were overall very positive about the work. Supporting more general objective functions is a worthy, important problem to study, and the results are compelling. The primary criticism was about the focus on deterministic MDPs. Nonetheless, the reviewers ultimately weren't too bothered by this assumption. It will be interesting to see if the authors can extend this work to the more traditional stochastic setting. Another criticism is that the proposed algorithms are not validated empirically. I think experiments would have taken this paper to another level. There was some confusion over the fact that the algorithm's time complexity scales with the number of _multisets_ of cardinality k, not the number of length-k sequences. The reviewers and I am satisfied by the authors' response and thank them for clearing that up. While most reviewers agreed that the presentation is clear enough, there is some room for improvement. I encourage the authors to incorporate the reviewers' detailed feedback when revising.