Paper ID: | 2438 |
---|---|

Title: | A Geometric Perspective on Optimal Representations for Reinforcement Learning |

This paper is very intriguing. Although there is no conclusive empirical evidence of the usefulness of auxiliary tasks, their design and justification remain on the whole ad-hoc. This paper describes a new method based on geometric properties of the space of value functions to represent learning. The results show that predicting adversarial value functions as auxiliary tasks leads to rich representations. Overall, this innovative perspective to represent learning is good for us to understand the learning process. and the literature review shows that the author is knowledgeable in this field. As the author said as I quote, their work may “opens up the possibility of automatically generating auxiliary tasks in deep reinforcement learning ”. Here are my major concerns: The author tries to describe his representation starting at part 2. A description of the previous version of representation would be better for a reviewer to get a general idea before describing a new representation for RL. I would suggest the author put “Related work” in part 2 instead of part 5.

This paper studies the problem of learning useful representations for reinforcement learning through the lens of an adversarial framework. In particular, a good representation is identified as one that yields low linear value-function estimation error if an adversary is able to choose a value function (induced by a policy). The paper shows first that the the only policies that should be considered are deterministic, and then identifies a more narrowed set of adversarial values, though the number is still exponential. I really liked the theoretical insights of this paper, and because of this I tend to vote for acceptance, though I claim that experiments are too preliminary. Some more comments below: 1- in (1) highlight more clearly that \phi is the only optimization knob. 2- in terms of readability, it is unclear why Lemma 1 is useful until after i read the proof of Theorem 1 from the Appendix. Maybe consider saying why this Lemma is useful, or move things around 3- isn't the first half of Lemma 1 (solution lies in the set of extreme points) a very well-known result in linear programming? If yes, then be more clear that this is not new. 4- this adversarial framework reminds me a lot of the use of Wasserstein distance in model-based RL, whereby a good model is defined as one that yields low error in the context of adversarial choice of value functions (that are Lipschitz). Do you see a synergy here? is there any deep connection? Also, can you clarify why you used model-based algorithms in experiments? There is no mention of model-based stuff until we get to experiments, so i am wondering if there is a connection. 5- for proof of Theorem 2 in appendix, maybe do define idempotent matrices and their properties. I checked the proofs of the first two theorems and otherwise they seem sound and clear. 6- the part that the paper falls short is experiments. It could still be OK if the authors showed a clear path towards extending the idea to function approximation, but this is lacking. Plus, the method cannot really beat the baseline even in the toy domain. Any comment on challenges when going to function approximation? ---- post rebuttal: I am happy to see that the authors are willing to add a section that more seriously tackles/starts to think about challenges when going to arbitrary function approximators in practice. As for the point about a potential model-based RL result, Farahmand and friends was indeed the paper that I had in mind. Also, because of the focus on linearity, Parr and friends 2008 on linear models shows a deeper connection/equivalence, and so could be useful. It would be very neat if there was a deeper connection. If one cannot be shown in this paper, can a conjecture still be made?

The paper claims that the best representation should minimize the maximum error in approximating all the possible value functions and not a single one that pertains to a given policy. As such, the authors establish further results, which reduce the maximization over all policies to the maximization over the finite set of external vertices which corresponds to (distinct in Lebesgue measure) deterministic policies. These value functions are termed adversarial (although these are adversarial only from approximation point-of-view). I enjoyed reading the paper. Here are some comments, which may help improving the paper. The presented theory is indeed linear, but it is coated as being generic through some introduction on the representation mapping \phi and the so-called "two-part approximation". The distinction is that in a generic case, \phi becomes part of the computation through its gradient and such, which is not the case here. I am personally not favouring the presented perspective, yet it is not a real issue to argument against. Note that in all classic sources (e.g. Bertsekas and Tsitsiklis numerous books and papers) V = \Phi \theta is the definition of linear ADP. There is no need for saying otherwise. Figure 1 (Right): I would add the axis. The current figure is hard to understand. Is it the value space for a 2 dimensional state-space? i.e., V^{pi}(x(1)) vs V^{pi}(x(2)) ? L94 --> \mathcal{V} definition: looks like it should be “for all” not “for some”. Additionally, mathematical statements should be succinct. I would remove it if you mean “for all”. Supplementary material -> the first equation after equation 6: the sub-scripts of v’s should be super-script. L107: \phi should be in R^d and not \mathcal{R}=R^{nxd}. In other words, \phi is a row of \Phi. \delta is called interest function in section 3.1, yet it was called direction in Lemma 1.