These video pairs illustrate the qualitatively different behaviors learned in two Atari 2600 games when learning with Double DQN with clipped rewards or with unclipped reward using Pop-Art (and Double DQN).

Game 1: Centipede

The agent controls the colored shooting square moving around the screen. It needs to shoot down the snake pieces moving down the screen, but has also the option of shooting other enemies such as the spiders (for a large reward). When dying (for example as a result of touching the spider), the agent collects a number of small rewards. The clipped agent is then enticed to kill himself to collect many rewards (given a discounted horizon). The unclipped agent with Pop-Art correctly avoids this deleterious behavior and kills the spiders for more rewards.

Game 2: Time Pilot

The agent controls the blue airplane in the center of the screen. The agent can shoot down enemy helicopters for small rewards, or hunt down the mothership (a blimp in these videos) for a bigger reward and go to the next level. Here the Double DQN agent with clipped rewards prefers to shoot enemies and ignores the blimp. In contrast, the unclipped Pop-Art agent virtually ignores the normal enemy ships to focus on the mothership. While this is the suggested behavior in terms of optimizing rewards, it leads to agent more quickly to the harder levels and does not incentivize the agent to learn to shoot all other planes (a useful skill throughout the game), which leads to an overall worse score. This is a case where seeing the true reward function can make the game harder to play.




