Reviews: Learning values across many orders of magnitude

NIPS 2016
Mon Dec 5th through Sun the 11th, 2016 at Centre Convencions Internacional Barcelona

Paper ID:	2127
Title:	Learning values across many orders of magnitude

Reviewer 1

Summary

The paper is concerned with the adaptive normalization of the target of a NN, focusing on Deep RL. The calibration of a learning method, including the adjustment of the hyper-parameters, is an key bottleneck for the practitioner. The novelty in this paper is to tackle the online setting for several settings, where the order of magnitude of the rewards is unknown. The calibration issue is tackled using an adaptive normalization. There are several issues which are unclear to me. Firstly, what primarily deserves to be normalized, particularly so in the non-stationary case, is the input of the neural net However, the impact of large input values can be compensated for by adjusting the learning step size. My main reservation is that imo the proposed approach is generalized by the "Learning to learn by gradient descent" (Nando de Freitas et al, 2016), where the gradient is estimated online together with the step size; thereby sidestepping the calibration issue. There is also the online adjustment of the learning hyper-parameters, by Jelena Luketina et al., ICML 16. The discussion about the fact that calibrating the rewards potentially changes the game is very interesting. I'm wondering whether replacing the reward by its quantile would address this game changing effect.

Qualitative Assessment

see summary.

Confidence in this Review

2-Confident (read it all; understood it all reasonably well)

Reviewer 2

Summary

The authors propose a method for adaptively normalizing the output distribution of the regression targets a neural network is trained to produce. The idea is to be able to elegantly deal with output targets of different scales, especially in the situation where the normalization of the targets cannot be computed a priori, as is generally the case when learning action value functions in reinforcement learning. The authors show on a toy task that their method is able to deal with low frequency targets that are much larger than the average targets. They show on the Atari 2600 reinforcement learning task that their system performs better on average than systems that use a clipping heuristic. Their model also seems to learn qualitatively different behavior because it's able to distinguish between rewards of different magnitudes.

Qualitative Assessment

I think the problem that the authors are trying to solve is a very important one that shows up in many gradient-based learning situations. The ideas are straightforward but (to my knowledge) new and apparently effective. For this reason I can see them becoming widely used. The paper is well-written and at most places clear. The related work section seems to contain enough relevant references, but it would be nice if some of the most related works would be discussed in a bit more detail. The work by Wang et al. (2016), for example, seems related to me in the sense that it seems to make the DQN in that paper less sensitive to the mean of the values as well. I like the toy example. While it seems to me that a linear model should be able to solve this task (one solution would be to set the input weights equal to powers of two), the gap between the presented data and the infrequent high target makes this impossible to learn from the data and representative of a problem that may even occur when the output targets are normalized but still follow a fat-tailed distribution. I found it a bit difficult to judge the Atari 2600 results because they represent such a high dimensional comparison of different types of scores. The movies in the supplementary materials (especially Centipede) look convincing and the method sometimes performs much better than the baseline, but it also performs much worse on some games. On average, the method performs better and I agree with the authors that being able to learn most of the games without clipping is already a step forward. It would be interesting to see how well the method would do with at least a bit of hyperparameter tuning. Especially the discount factor. The authors also point out that the success of models that use clipping might be somewhat specific to this task. For this reason, it would have been interesting to see a comparison on another reinforcement learning task as well. Some small things: Line 55: The word 'most' is very imprecise for statements of this kind. Does this mean that one of the natural gradient methods is actually invariant to rescaling the targets? Line 66: I think a more correct interval for the step size would be (0, inf) instead of [0, 1]. You clearly don't want to include 0 and I don't see why a step size couldn't be greater than 1. Line 88: The actual 'objective' mentioned in this line is never discussed. The rescaling updates seem to be more about satisfying certain constraints like the ones presented in Equation 3 and the bound on the normalized targets. Algorithm 2: The update of W seems to contain a typo where 'g' should probably be 'h'.

Confidence in this Review

2-Confident (read it all; understood it all reasonably well)

Reviewer 3

Summary

This paper addresses the problem of clipping reward functions in reinforcement learning, which is common practice but undesirable in that it introduces domain knowledge and can result in the agent optimising for reward frequency rather than amount. This work aims to remove this dependence, by proposing a method to learn the reward scales (normalisation) whilst learning. The proposed approach, Pop-Art, involves alternating updates to learning the regression target, and the normalisation. The method is demonstrated on a toy domain and the standard Atari environment, where it is shown to have a large effect on the performance of reinforcement learning algorithms.

Qualitative Assessment

The paper is very well written and easy to follow. On the whole, I think the idea presented is clever and novel. It was also interesting to see how the Atari results differ using this form of normalisation. I do however have three major issues. The first concerns the supplementary appendix which the text mentions contains proofs, discussions of alternative normalisation updates, raw performance scores, and videos. Very unfortunately, I was unable to open this, either on Windows or Ubuntu, and so was unable to verify any of this. Secondly, although the proposed scheme was shown to perform well, I would have liked to see some measure indicating that something close to the true scale is recovered. This would involve two demonstrations: 1) The results should have been compared to the results of learning using the same networks with 'oracle' knowledge of the true scale and shift parameters. This should outperform the proposed method, but would provide indication that it is acting as intended, and act as a form of baseline towards which the performance of Pop-Art should tend. 2) The learned scale and shift should be compared to their true values, which would demonstrate recovery of the true normalisation scheme, and that the algorithm was indeed able to separate out learning the target and the normalisation. Finally, the assumption of using this particular scheme (scale and shift) is a strong one. How would one know this is a reasonable assumption? It seems that this could in fact lead to the same problem again - where an incorrect normalisation merely results in the learning algorithm again optimising for some distorted reward function. To this end, it would have been useful to show that arbitrary (assumedly incorrect) normalisation schemes lead to reduced learning performance. Minor issues: The paragraph in ln130-134 and Algorithm 1 seem slightly out of place: ART has not yet been defined, and neither have J and \delta Ln 234: is useful as -> as useful as Ln 250: shows l^2 norm -> shows the l^2 norm

Confidence in this Review

2-Confident (read it all; understood it all reasonably well)

Reviewer 4

Summary

This paper introduces an approach that adaptively normalizes targets to address the variance within the target magnitudes. In particular, this work adaptively normalizes the targets while simultaneously preserving the outputs of the function. The experiments in the paper show that this approach works for binary regression and DQN.

Qualitative Assessment

=============== Technical quality =============== The paper includes many justifications for the approach, including proofs as well as empirical evidence. The results showed that POP-ART can drastically improve performance. However, it appears that performance is also reduced in many of the domains. More analysis should be done as to why this may be. Additionally, comparison against other adaptive normalization techniques would have been been useful for both experiments (including Adam and AdaGrad). =============== Novelty =============== The idea to normalize the targets while preserving the outputs is quite interesting. However, the paper needs more discussion as to why this approach might be better than the adaptive normalization approaches mentioned. For example, the paper notes that it is difficult to choose step sizes when nothing is known about the magnitude of the loss. However, adaptive step sizes would address this problem. =============== Impact =============== It appears that this approach could potentially allow researchers to use domains with DQN without needing to clip rewards. However, many of the DQN results were worse with POP-ART. It does not seem like this approach would allow one to completely ignore the scale of the rewards. =============== Clarity =============== The paper was well written and the approach was explained well. =============== Other points =============== -Line 234: is -> as -Line 239: remove by -Line 278: A discount factor of .99 is high. This does not seem like a myopic discount factor.

Confidence in this Review

2-Confident (read it all; understood it all reasonably well)

Reviewer 5

Summary

This paper describes a method, Pop-Art, which adaptively normalizes targets used for learning tasks especially in reinforcement and sensor learning where targets can vary by several non-stationary magnitudes and occurrence frequencies and avoids using a hard clipping of the targets. To prevent changing the outputs of the unnormalized function the method scales and shifts the normalized function simultanously. Additionaly the authors introduce a variant of stochastic gradient whch computes an unscaled error but a scaled update for the parameters. The former method was tested on two experiments, a binary regression experiment and on various Atari 2600 games using Double DQN plus Pop-Art. For the binary experiments the authors show a better performance for their method compared to standard unnormalized SGD. For the Atari 2600 games the l2 norms of the gradients of unclipped DQN, clipped DQN and unclipped DQN plus Pop-Art show an advantageous lower variance on the proposed method. Unclipped DQN plus Pop-Art performs slightly better than clipped DQN on 57 Atari games, however the authors provide a reasonable explanation for the weaker scores for some games. In the Appendix the authors propose a percentile normalization variant, an online normalization with minibatches and normalizations for the lower layers. A scaling parameter s is suggested which influences the magnitude of the normalized targets without influencing its distribution. Finally the proofs were given for the main propositions.

Qualitative Assessment

It is very surprising that target normalization wasn't, AFAIK, evaluated more extensively in the RL community until now. That aside, i think overall the method looks promising and could be of great interest to the NIPS community especially to the reinforcement learning and sensor learning community.

Confidence in this Review

2-Confident (read it all; understood it all reasonably well)

Reviewer 6

Summary

A method to normalize the outputs of a neural network is presented. The authors show that using their method they can remove domain knowledge such as reward clipping when training DQNs.

Qualitative Assessment

Normalization of neural networks in general and of DRL in particular is an important issue. Indeed, the deep learning community enjoyed a line of works that focused on input normalization. This work is suggesting an output normalization algorithm and shows that it can help to reduce the need of domain knowledge in DRL such as reward clipping. I find the work well motivated and well written. In particular, I find the derivation of the pop art algorithm easy to follow. The analysis in section 2.3 and in particular proposition 3 helps the reader to gain intuition on the suggested algorithm and increases the significance of the paper. While the main DRL application in the paper is the DQN, it would be interesting to see applications in other DRL algorithms such as policy networks. In particular, policies that output continuous but bounded actions, e.g., HFO (a Robocup domain) might benefit from this approach. My main criticism is about the experiments: First, DRL papers that suggest a new DQN variant, typically compare their algorithm with the DQN baseline. In this paper, the pop-art algorithm had been demonstrated only for DDQN although figure 1b suggest that experiments had been performed with DQN as well. Why does the results for pop-art with DQN are not shown here? Does the algorithm work well only for DDQN? Second, while the algorithm is making a nice contribution for the DRL community by itself, the focus of the paper is to remove the domain knowledge of clipping rewards. While reward clipping and normalizing targets are related, I don’t see a direct connection between them and the experiments presented for DDQN are not conclusive. Are these methods orthogonal or maybe pop-art will perform better with reward clipping? This is definitely an important experiment to show here. Last, while the authors mention a variety of other normalization methods for deep learning, none has been demonstrated in practice. Smaller remarks: Line 292: I think that the conclusion that pop-art successfully replaced the reward clipping is not in place, definitely more experiments should have been done before making such a claim. Line 295: The citation in line 295 and the relation to exploration seems to be unrelated to the paper.

Confidence in this Review

3-Expert (read the paper in detail, know the area, quite certain of my opinion)