{"title": "Hybrid Reward Architecture for Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 5392, "page_last": 5402, "abstract": "One of the main challenges in reinforcement learning (RL) is generalisation. In typical deep RL methods this is achieved by approximating the optimal value function with a low-dimensional representation using a deep network.  While this approach works well in many domains, in domains where the optimal value function cannot easily be reduced to a low-dimensional representation, learning can be very slow and unstable. This paper contributes towards tackling such challenging domains, by proposing a new method, called Hybrid Reward Architecture (HRA). HRA takes as input a decomposed reward function and learns a separate value function for each component reward function. Because each component typically only depends on a subset of all features, the corresponding value function can be approximated more easily by a low-dimensional representation, enabling more effective learning. We demonstrate HRA on a toy-problem and the Atari game Ms. Pac-Man, where HRA achieves above-human performance.", "full_text": "Hybrid Reward Architecture for\n\nReinforcement Learning\n\nHarm van Seijen1\n\nharm.vanseijen@microsoft.com\n\nMehdi Fatemi1\n\nmehdi.fatemi@microsoft.com\n\nJoshua Romoff12\n\njoshua.romoff@mail.mcgill.ca\n\nRomain Laroche1\n\nromain.laroche@microsoft.com\n\nTavian Barnes1\n\ntavian.barnes@microsoft.com\n\nJeffrey Tsang1\n\ntsang.jeffrey@microsoft.com\n\n1Microsoft Maluuba, Montreal, Canada\n2McGill University, Montreal, Canada\n\nAbstract\n\nOne of the main challenges in reinforcement learning (RL) is generalisation. In\ntypical deep RL methods this is achieved by approximating the optimal value\nfunction with a low-dimensional representation using a deep network. While\nthis approach works well in many domains, in domains where the optimal value\nfunction cannot easily be reduced to a low-dimensional representation, learning can\nbe very slow and unstable. This paper contributes towards tackling such challenging\ndomains, by proposing a new method, called Hybrid Reward Architecture (HRA).\nHRA takes as input a decomposed reward function and learns a separate value\nfunction for each component reward function. Because each component typically\nonly depends on a subset of all features, the corresponding value function can be\napproximated more easily by a low-dimensional representation, enabling more\neffective learning. We demonstrate HRA on a toy-problem and the Atari game Ms.\nPac-Man, where HRA achieves above-human performance.\n\n1\n\nIntroduction\n\nIn reinforcement learning (RL) (Sutton & Barto, 1998; Szepesv\u00e1ri, 2009), the goal is to \ufb01nd a\nbehaviour policy that maximises the return\u2014the discounted sum of rewards received over time\u2014in a\ndata-driven way. One of the main challenges of RL is to scale methods such that they can be applied\nto large, real-world problems. Because the state-space of such problems is typically massive, strong\ngeneralisation is required to learn a good policy ef\ufb01ciently.\nMnih et al. (2015) achieved a big breakthrough in this area: by combining standard RL techniques\nwith deep neural networks, they achieved above-human performance on a large number of Atari 2600\ngames, by learning a policy from pixels. The generalisation properties of their Deep Q-Networks\n(DQN) method is achieved by approximating the optimal value function. A value function plays an\nimportant role in RL, because it predicts the expected return, conditioned on a state or state-action\npair. Once the optimal value function is known, an optimal policy can be derived by acting greedily\nwith respect to it. By modelling the current estimate of the optimal value function with a deep neural\nnetwork, DQN carries out a strong generalisation on the value function, and hence on the policy.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fThe generalisation behaviour of DQN is achieved by regularisation on the model for the optimal\nvalue function. However, if the optimal value function is very complex, then learning an accurate\nlow-dimensional representation can be challenging or even impossible. Therefore, when the optimal\nvalue function cannot easily be reduced to a low-dimensional representation, we argue to apply a\ncomplementary form of regularisation on the target side. Speci\ufb01cally, we propose to replace the\noptimal value function as target for training with an alternative value function that is easier to learn,\nbut still yields a reasonable\u2014but generally not optimal\u2014policy, when acting greedily with respect to\nit.\nThe key observation behind regularisation on the target function is that two very different value\nfunctions can result in the same policy when an agent acts greedily with respect to them. At the\nsame time, some value functions are much easier to learn than others. Intrinsic motivation (Stout\net al., 2005; Schmidhuber, 2010) uses this observation to improve learning in sparse-reward domains,\nby adding a domain-speci\ufb01c intrinsic reward signal to the reward coming from the environment.\nWhen the intrinsic reward function is potential-based, optimality of the resulting policy is maintained\n(Ng et al., 1999). In our case, we aim for simpler value functions that are easier to represent with a\nlow-dimensional representation.\nOur main strategy for constructing an easy-to-learn value function is to decompose the reward\nfunction of the environment into n different reward functions. Each of them is assigned a separate\nreinforcement-learning agent. Similar to the Horde architecture (Sutton et al., 2011), all these agents\ncan learn in parallel on the same sample sequence by using off-policy learning. Each agent gives its\naction-values of the current state to an aggregator, which combines them into a single value for each\naction. The current action is selected based on these aggregated values.\nWe test our approach on two domains: a toy-problem, where an agent has to eat 5 randomly located\nfruits, and Ms. Pac-Man, one of the hard games from the ALE benchmark set (Bellemare et al.,\n2013).\n\n2 Related Work\n\nOur HRA method builds upon the Horde architecture (Sutton et al., 2011). The Horde architecture\nconsists of a large number of \u2018demons\u2019 that learn in parallel via off-policy learning. Each demon\ntrains a separate general value function (GVF) based on its own policy and pseudo-reward function.\nA pseudo-reward can be any feature-based signal that encodes useful information. The Horde\narchitecture is focused on building up general knowledge about the world, encoded via a large number\nof GVFs. HRA focusses on training separate components of the environment-reward function, in\norder to more ef\ufb01ciently learn a control policy. UVFA (Schaul et al., 2015) builds on Horde as well,\nbut extends it along a different direction. UVFA enables generalization across different tasks/goals. It\ndoes not address how to solve a single, complex task, which is the focus of HRA.\nLearning with respect to multiple reward functions is also a topic of multi-objective learning (Roijers\net al., 2013). So alternatively, HRA can be viewed as applying multi-objective learning in order to\nmore ef\ufb01ciently learn a policy for a single reward function.\nReward function decomposition has been studied among others by Russell & Zimdar (2003) and\nSprague & Ballard (2003). This earlier work focusses on strategies that achieve optimal behavior.\nOur work is aimed at improving learning-ef\ufb01ciency by using simpler value functions and relaxing\noptimality requirements.\nThere are also similarities between HRA and UNREAL (Jaderberg et al., 2017). Notably, both solve\nmultiple smaller problems in order to tackle one hard problem. However, the two architectures are\ndifferent in their workings, as well as the type of challenge they address. UNREAL is a technique that\nboosts representation learning in dif\ufb01cult scenarios. It does so by using auxiliary tasks to help train\nthe lower-level layers of a deep neural network. An example of such a challenging representation-\nlearning scenario is learning to navigate in the 3D Labyrinth domain. On Atari games, the reported\nperformance gain of UNREAL is minimal, suggesting that the standard deep RL architecture is\nsuf\ufb01ciently powerful to extract the relevant representation. By contrast, the HRA architecture breaks\ndown a task into smaller pieces. HRA\u2019s multiple smaller tasks are not unsupervised; they are tasks\nthat are directly relevant to the main task. Furthermore, whereas UNREAL is inherently a deep RL\ntechnique, HRA is agnostic to the type of function approximation used. It can be combined with deep\n\n2\n\n\fneural networks, but it also works with exact, tabular representations. HRA is useful for domains\nwhere having a high-quality representation is not suf\ufb01cient to solve the task ef\ufb01ciently.\nDiuk\u2019s object-oriented approach (Diuk et al., 2008) was one of the \ufb01rst methods to show ef\ufb01cient\nlearning in video games. This approach exploits domain knowledge related to the transition dynamic\nto ef\ufb01ciently learn a compact transition model, which can then be used to \ufb01nd a solution using\ndynamic-programming techniques. This inherently model-based approach has the drawback that\nwhile it ef\ufb01ciently learns a very compact model of the transition dynamics, it does not reduce the\nstate-space of the problem. Hence, it does not address the main challenge of Ms. Pac-Man: its huge\nstate-space, which is even for DP methods intractable (Diuk applied his method to an Atari game\nwith only 6 objects, whereas Ms. Pac-Man has over 150 objects).\nFinally, HRA relates to options (Sutton et al., 1999; Bacon et al., 2017), and more generally hierar-\nchical learning (Barto & Mahadevan, 2003; Kulkarni et al., 2016). Options are temporally-extended\nactions that, like HRA\u2019s heads, can be trained in parallel based on their own (intrinsic) reward\nfunctions. However, once an option has been trained, the role of its intrinsic reward function is\nover. A higher-level agent that uses an option sees it as just another action and evaluates it using its\nown reward function. This can yield great speed-ups in learning and help substantially with better\nexploration, but they do not directly make the value function of the higher-level agent less complex.\nThe heads of HRA represent values, trained with components of the environment reward. Even after\ntraining, these values stay relevant, because the aggregator uses them to select its action.\n\n3 Model\nConsider a Markov Decision Process (cid:104)S,A, P, Renv, \u03b3(cid:105) , which models an agent interacting with an\nenvironment at discrete time steps t. It has a state set S, action set A, environment reward function\nRenv : S\u00d7A\u00d7S \u2192 R, and transition probability function P : S\u00d7A\u00d7S \u2192 [0, 1]. At time step t, the\nagent observes state st \u2208 S and takes action at \u2208 A. The agent observes the next state st+1, drawn\nfrom the transition probability distribution P (st, at,\u00b7), and a reward rt = Renv(st, at, st+1). The\nbehaviour is de\ufb01ned by a policy \u03c0 : S \u00d7 A \u2192 [0, 1], which represents the selection probabilities over\nactions. The goal of an agent is to \ufb01nd a policy that maximises the expectation of the return, which is\ni=0 \u03b3i rt+i, where the discount factor \u03b3 \u2208 [0, 1] controls\nthe importance of immediate rewards versus future rewards. Each policy \u03c0 has a corresponding\naction-value function that gives the expected return conditioned on the state and action, when acting\naccording to that policy:\n\nthe discounted sum of rewards: Gt :=(cid:80)\u221e\n\n(1)\nThe optimal policy \u03c0\u2217 can be found by iteratively improving an estimate of the optimal action-value\nfunction Q\u2217(s, a) := max\u03c0 Q\u03c0(s, a), using sample-based updates. Once Q\u2217 is suf\ufb01ciently accurate\napproximated, acting greedy with respect to it yields the optimal policy.\n\nQ\u03c0(s, a) = E[Gt|st = s, at = a, \u03c0]\n\n3.1 Hybrid Reward Architecture\n\nThe Q-value function is commonly estimated using a function approximator with weight vector \u03b8:\nQ(s, a; \u03b8). DQN uses a deep neural network as function approximator and iteratively improves an\nestimate of Q\u2217 by minimising the sequence of loss functions:\n\nLi(\u03b8i) = Es,a,r,s(cid:48)[(yDQN\nyDQN\ni\n\n= r + \u03b3 max\n\ni\n\n\u2212 Q(s, a; \u03b8i))2] ,\n\na(cid:48) Q(s(cid:48), a(cid:48); \u03b8i\u22121),\n\nwith\n\n(2)\n(3)\n\nThe weight vector from the previous iteration, \u03b8i\u22121, is encoded using a separate target network.\nWe refer to the Q-value function that minimises the loss function(s) as the training target. We will call\na training target consistent, if acting greedily with respect to it results in a policy that is optimal under\nthe reward function of the environment; we call a training target semi-consistent, if acting greedily\nwith respect to it results in a good policy\u2014but not an optimal one\u2014under the reward function of\nthe environment. For (2), the training target is Q\u2217\nenv, the optimal action-value function under Renv,\nwhich is the default consistent training target.\nThat a training target is consistent says nothing about how easy it is to learn that target. For example,\nif Renv is sparse, the default learning objective can be very hard to learn. In this case, adding a\n\n3\n\n\fpotential-based additional reward signal to Renv can yield an alternative consistent learning objective\nthat is easier to learn. But a sparse environment reward is not the only reason a training target can be\nhard to learn. We aim to \ufb01nd an alternative training target for domains where the default training\ntarget Q\u2217\nenv is hard to learn, due to the function being high-dimensional and hard to generalise for.\nOur approach is based on a decomposition of the reward function.\nWe propose to decompose the reward function Renv into n reward functions:\n\nRenv(s, a, s(cid:48)) =\n\nRk(s, a, s(cid:48)) ,\n\nfor all s, a, s(cid:48),\n\n(4)\n\nn(cid:88)\n\nk=1\n\nand to train a separate reinforcement-learning agent on each of these reward functions. There are\nin\ufb01nitely many different decompositions of a reward function possible, but to achieve value functions\nthat are easy to learn, the decomposition should be such that each reward function is mainly affected\nby only a small number of state variables.\nBecause each agent k has its own reward function, it has also its own Q-value function, Qk. In\ngeneral, different agents can share multiple lower-level layers of a deep Q-network. Hence, we will\nuse a single vector \u03b8 to describe the combined weights of the agents. We refer to the combined\nnetwork that represents all Q-value functions as the Hybrid Reward Architecture (HRA) (see Figure\n1). Action selection for HRA is based on the sum of the agent\u2019s Q-value functions, which we call\nQHRA:\n\nQHRA(s, a; \u03b8) :=\n\nQk(s, a; \u03b8) ,\n\nfor all s, a.\n\n(5)\n\nBy minimising these loss functions, the different heads of HRA approximate the optimal action-value\nfunctions under the different reward functions: Q\u2217\nHRA,\nde\ufb01ned as:\n\nn. Furthermore, QHRA approximates Q\u2217\n\n1, . . . , Q\u2217\n\nQ\u2217\nHRA(s, a) :=\n\nQ\u2217\nk(s, a)\n\nfor all s, a .\n\nk=1\n\nHRA is different from Q\u2217\n\nNote that Q\u2217\nAn alternative training target is one that results from evaluating the uniformly random policy \u03c5\nenv, the\nunder each component reward function: Q\u03c5\n\nHRA(s, a) :=(cid:80)n\n\nenv and generally not consistent.\n\nk=1 Q\u03c5\n\nk(s, a). Q\u03c5\n\nHRA is equal to Q\u03c5\n\nFigure 1: Illustration of Hybrid Reward Architecture.\n\n4\n\nThe collection of agents can be viewed alternatively as a single agent with multiple heads, with each\nhead producing the action-values of the current state under a different reward function.\nThe sequence of loss function associated with HRA is:\n\n(cid:35)\n\nLi(\u03b8i) = Es,a,r,s(cid:48)\n\n(yk,i \u2212 Qk(s, a; \u03b8i))2\n\n,\n\nwith\n\nyk,i = Rk(s, a, s(cid:48)) + \u03b3 max\n\na(cid:48) Qk(s(cid:48), a(cid:48); \u03b8i\u22121) .\n\n(6)\n\n(7)\n\nn(cid:88)\n\nk=1\n\nn(cid:88)\n\n(cid:34) n(cid:88)\n\nk=1\n\nSingle-headHRA\fQ-values of the random policy under Renv, as shown below:\n\nenv(s, a) = E\nQ\u03c5\n\n\u03b3iRenv(st+i, at+i, st+1+i)|st = s, at = a, \u03c5\n\n(cid:35)\n\n,\n\n(cid:35)\n(cid:35)\n\n,\n\n,\n\ni=0\n\n(cid:34) \u221e(cid:88)\n(cid:34) \u221e(cid:88)\nn(cid:88)\n(cid:34) \u221e(cid:88)\nn(cid:88)\nn(cid:88)\n\n\u03b3i\n\nk=1\n\nk=1\n\ni=0\n\ni=0\n\nE\n\nk=1\n\n= E\n\n=\n\n=\n\nRk(st+i, at+i, st+1+i)|st = s, at = a, \u03c5\n\n\u03b3iRk(st+i, at+i, st+1+i)|st = s, at = a, \u03c5\n\nQ\u03c5\n\nk(s, a) := Q\u03c5\n\nHRA(s, a) .\n\nThis training target can be learned using the expected Sarsa update rule (van Seijen et al., 2009), by\nreplacing (7), with\n\nyk,i = Rk(s, a, s(cid:48)) + \u03b3\n\n1\n\n|A| Qk(s(cid:48), a(cid:48); \u03b8i\u22121) .\n\n(8)\n\n(cid:88)\n\na(cid:48)\u2208A\n\nActing greedily with respect to the Q-values of a random policy might appear to yield a policy that is\njust slightly better than random, but, surpringly, we found that for many navigation-based domains\nQ\u03c5\n\nHRA acts as a semi-consistent training target.\n\n3.2 Improving Performance further by using high-level domain knowledge.\n\nIn its basic setting, the only domain knowledge applied to HRA is in the form of the decomposed\nreward function. However, one of the strengths of HRA is that it can easily exploit more domain\nknowledge, if available. Domain knowledge can be exploited in one of the following ways:\n\n1. Removing irrelevant features. Features that do not affect the received reward in any way\n\n(directly or indirectly) only add noise to the learning process and can be removed.\n\n2. Identifying terminal states. Terminal states are states from which no further reward can\nbe received; they have by de\ufb01nition a value of 0. Using this knowledge, HRA can refrain\nfrom approximating this value by the value network, such that the weights can be fully used\nto represent the non-terminal states.\n\n3. Using pseudo-reward functions. Instead of updating a head of HRA using a component\nof the environment reward, it can be updated using a pseudo-reward. In this scenario, a set\nof GVFs is trained in parallel using pseudo-rewards.\n\nWhile these approaches are not speci\ufb01c to HRA, HRA can exploit domain knowledge to a much great\nextend, because it can apply these approaches to each head individually. We show this empirically in\nSection 4.1.\n\n4 Experiments\n\n4.1 Fruit Collection task\nIn our \ufb01rst domain, we consider an agent that has to collect fruits as quickly as possible in a 10 \u00d7 10\ngrid. There are 10 possible fruit locations, spread out across the grid. For each episode, a fruit is\nrandomly placed on 5 of those 10 locations. The agent starts at a random position. The reward is +1\nif a fruit gets eaten and 0 otherwise. An episode ends after all 5 fruits have been eaten or after 300\nsteps, whichever comes \ufb01rst.\nWe compare the performance of DQN with HRA using the same network. For HRA, we decompose\nthe reward function into 10 different reward functions, one per possible fruit location. The network\nconsists of a binary input layer of length 110, encoding the agent\u2019s position and whether there is\na fruit on each location. This is followed by a fully connected hidden layer of length 250. This\nlayer is connected to 10 heads consisting of 4 linear nodes each, representing the action-values of\n\n5\n\n\fthe 4 actions under the different reward functions. Finally, the mean of all nodes across heads is\ncomputed using a \ufb01nal linear layer of length 4 that connects the output of corresponding nodes in\neach head. This layer has \ufb01xed weights with value 1 (i.e., it implements Equation 5). The difference\nbetween HRA and DQN is that DQN updates the network from the fourth layer using loss function\n(2), whereas HRA updates the network from the third layer using loss function (6).\n\nFigure 2: The different network architectures used.\n\nBesides the full network, we test using different levels of domain knowledge, as outlined in Section\n3.2: 1) removing the irrelevant features for each head (providing only the position of the agent + the\ncorresponding fruit feature); 2) the above plus identifying terminal states; 3) the above plus using\npseudo rewards for learning GVFs to go to each of the 10 locations (instead of learning a value\nfunction associated to the fruit at each location). The advantage is that these GVFs can be trained\neven if there is no fruit at a location. The head for a particular location copies the Q-values of the\ncorresponding GVF if the location currently contains a fruit, or outputs 0s otherwise. We refer to\nthese as HRA+1, HRA+2 and HRA+3, respectively. For DQN, we also tested a version that was\napplied to the same network as HRA+1; we refer to this version as DQN+1.\nTraining samples are generated by a random policy; the training process is tracked by evaluating the\ngreedy policy with respect to the learned value function after every episode. For HRA, we performed\nexperiments with Q\u2217\nHRA (using Equation 8).\nSimilarly, for DQN we used the default training target, Q\u2217\nenv. We optimised the\nstep-size and the discount factor for each method separately.\nThe results are shown in Figure 3 for the best settings of each method. For DQN, using Q\u2217\nas training target resulted in the best performance, while for HRA, using Q\u03c5\nHRA resulted in the\nbest performance. Overall, HRA shows a clear performance boost over DQN, even though the\nnetwork is identical. Furthermore, adding different forms of domain knowledge causes further\nlarge improvements. Whereas using a network structure enhanced by domain knowledge improves\nperformance of HRA, using that same network for DQN results in a decrease in performance. The big\nboost in performance that occurs when the the terminal states are identi\ufb01ed is due to the representation\nbecoming a one-hot vector. Hence, we removed the hidden layer and directly fed this one-hot vector\n\nHRA as training target (using Equation 7), as well as Q\u03c5\nenv, as well as Q\u03c5\n\nenv\n\nFigure 3: Results on the fruit collection domain, in which an agent has to eat 5 randomly placed fruits.\nAn episode ends after all 5 fruits are eaten or after 300 steps, whichever comes \ufb01rst.\n\n6\n\nHRA with pseudo-rewardsHRADQN\r\u000e\n\n\u000f\n\n\u0010\n\n\u0011\n\n\u0012\n\n\u001c5\u000484/08\r\u0012\r\u000e\r\r\u000e\u0012\r\u000f\r\r\u000f\u0012\r\u0010\r\r$9058\u001b\"\u001f\u001b\"\u001f\u000e\u0002#\u0018\u0002#\u0018\u000e\r\u000f\u0012\u0012\r\u0001\u0012\u000e\r\r\u000e\u000f\u0012\u000e\u0012\r\u000e\u0001\u0012\u000f\r\r\u001c5\u000484/08\r\u0012\r\u000e\r\r\u000e\u0012\r\u000f\r\r\u000f\u0012\r\u0010\r\r$9058\u0002#\u0018\u000e\u0002#\u0018\u000f\u0002#\u0018\u0010\finto the different heads. Because the heads are linear, this representation reduces to an exact, tabular\nrepresentation. For the tabular representation, we used the same step-size as the optimal step-size for\nthe deep network version.\n\n4.2 ATARI game: Ms. Pac-Man\n\nOur second domain is the Atari 2600 game Ms. Pac-Man\n(see Figure 4). Points are obtained by eating pellets, while\navoiding ghosts (contact with one causes Ms. Pac-Man to\nlose a life). Eating one of the special power pellets turns\nthe ghosts blue for a small duration, allowing them to be\neaten for extra points. Bonus fruits can be eaten for further\npoints, twice per level. When all pellets have been eaten,\na new level is started. There are a total of 4 different maps\nand 7 different fruit types, each with a different point value.\nWe provide full details on the domain in the supplementary\nmaterial.\n\nFigure 4: The game Ms. Pac-Man.\n\nBaselines. While our version of Ms. Pac-Man is the same as used in literature, we use different\npreprocessing. Hence, to test the effect of our preprocessing, we implement the A3C method (Mnih\net al., 2016) and run it with our preprocessing. We refer to the version with our preprocessing as\n\u2018A3C(channels)\u2019, the version with the standard preprocessing \u2018A3C(pixels)\u2019, and A3C\u2019s score reported\nin literature \u2018A3C(reported)\u2019.\n\nPreprocessing. Each frame from ALE is 210 \u00d7 160 pixels. We cut the bottom part and the top part\nof the screen to end up with 160 \u00d7 160 pixels. From this, we extract the position of the different\nobjects and create for each object a separate input channel, encoding its location with an accuracy of\n4 pixels. This results in 11 binary channels of size 40 \u00d7 40. Speci\ufb01cally, there is a channel for Ms.\nPac-Man, each of the four ghosts, each of the four blue ghosts (these are treated as different objects),\nthe fruit plus one channel with all the pellets (including power pellets). For A3C, we combine the 4\nchannels of the ghosts into a single channel, to allow it to generalise better across ghosts. We do the\nsame with the 4 channels of the blue ghosts. Instead of giving the history of the last 4 frames as done\nin literature, we give the orientation of Ms. Pac-Man as a 1-hot vector of length 4 (representing the 4\ncompass directions).\n\nHRA architecture. The environment reward signal corresponds with the points scored in the game.\nBefore decomposing the reward function, we perform reward shaping by adding a negative reward of\n-1000 for contact with a ghost (which causes Ms. Pac-Man to lose a life). After this, the reward is\ndecomposed in a way that each object in the game (pellet/fruit/ghost/blue ghost) has its own reward\nfunction. Hence, there is a separate RL agent associated with each object in the game that estimates a\nQ-value function of its corresponding reward function.\nTo estimate each component reward function, we use the three forms of domain knowledge discussed\nin Section 3.2. HRA uses GVFs that learn pseudo Q-values (with values in the range [0, 1]) for\ngetting to a particular location on the map (separate GVFs are learnt for each of the four maps). In\ncontrast to the fruit collection task (Section 4.1), HRA learns part of its representation during training:\nit starts off with 0 GVFs and 0 heads for the pellets. By wandering around the maze, it discovers new\nmap locations it can reach, resulting in new GVFs being created. Whenever the agent \ufb01nds a pellet at\na new location it creates a new head corresponding to the pellet.\nThe Q-values for an object (pellet/fruit/ghost/blue ghost) are set to the pseudo Q-values of the\nGVF corresponding with the object\u2019s location (i.e., moving objects use a different GVF each time),\nmultiplied with a weight that is set equal to the reward received when the object is eaten. If an object\nis not on the screen, all its Q-values are 0.\nWe test two aggregator types. The \ufb01rst one is a linear one that sums the Q-values of all heads (see\nEquation 5). For the second one, we take the sum of all the heads that produce points, and normalise\nthe resulting Q-values; then, we add the sum of the Q-values of the heads of the regular ghosts,\nmultiplied with a weight vector.\n\n7\n\n\fFor exploration, we test two complementary types of exploration. Each type adds an extra exploration\nhead to the architecture. The \ufb01rst type, which we call diversi\ufb01cation, produces random Q-values,\ndrawn from a uniform distribution over [0, 20]. We \ufb01nd that it is only necessary during the \ufb01rst 50\nsteps, to ensure starting each episode randomly. The second type, which we call count-based, adds\na bonus for state-action pairs that have not been explored a lot. It is inspired by upper con\ufb01dence\nbounds (Auer et al., 2002). Full details can be found in the supplementary material.\nFor our \ufb01nal experiment, we implement a special head inspired by executive-memory literature (Fuster,\n2003; Gluck et al., 2013). When a human game player reaches the maximum of his cognitive and\nphysical ability, he starts to look for favourable situations or even glitches and memorises them.\nThis cognitive process is indeed memorising a sequence of actions (also called habit), and is not\nnecessarily optimal. Our executive-memory head records every sequence of actions that led to pass\na level without any kill. Then, when facing the same level, the head gives a very high value to the\nrecorded action, in order to force the aggregator\u2019s selection. Note that our simpli\ufb01ed version of\nexecutive memory does not generalise.\n\nEvaluation metrics. There are two different evaluation methods used across literature which result\nin very different scores. Because ALE is ultimately a deterministic environment (it implements\npseudo-randomness using a random number generator that always starts with the same seed), both\nevaluation metrics aim to create randomness in the evaluation in order to rate methods with more\ngeneralising behaviour higher. The \ufb01rst metric introduces a mild form of randomness by taking a\nrandom number of no-op actions before control is handed over to the learning algorithm. In the case\nof Ms. Pac-Man, however, the game starts with a certain inactive period that exceeds the maximum\nnumber of no-op steps, resulting in the game having a \ufb01xed start after all. The second metric selects\nrandom starting points along a human trajectory and results in much stronger randomness, and does\nresult in the intended random start evaluation. We refer to these metrics as \u2018\ufb01xed start\u2019 and \u2018random\nstart\u2019.\n\nTable 1: Final scores.\n\nmethod\nbest reported\nhuman\nA3C (reported)\nA3C (pixels)\nA3C (channels)\n\nrandom\nstart\n2,251\n15,375\n654\n626\n589\n23,770\n\n\ufb01xed\nstart\n6,673\n15,693\n\u2014\n2,168\n2,423\nHRA 25,304\n\nResults. Figure 5 shows the training curves; Table 1\nshows the \ufb01nal score after training. The best reported\n\ufb01xed start score comes from STRAW (Vezhnevets et al.,\n2016); the best reported random start score comes from\nthe Dueling network architecture (Wang et al., 2016). The\nhuman \ufb01xed start score comes from Mnih et al. (2015); the\nhuman random start score comes from Nair et al. (2015).\nWe train A3C for 800 million frames. Because HRA learns\nfast, we train it only for 5,000 episodes, corresponding\nwith about 150 million frames (note that better policies\nresult in more frames per episode). We tried a few different settings for HRA: with/without normalisa-\ntion and with/without each type of exploration. The score shown for HRA uses the best combination:\nwith normalisation and with both exploration types. All combinations achieved over 10,000 points in\ntraining, except the combination with no exploration at all, which\u2014not surprisingly\u2014performed very\npoorly. With the best combination, HRA not only outperforms the state-of-the-art on both metrics, it\nalso signi\ufb01cantly outperforms the human score, convincingly demonstrating the strength of HRA.\nComparing A3C(pixels) and A3C(channels) in Table 1 reveals a surprising result: while we use\nadvanced preprocessing by separating the screen image into relevant object channels, this did not\nsigni\ufb01cantly change the performance of A3C.\nIn our \ufb01nal experiment, we test how well HRA does if it exploits the weakness of the \ufb01xed-start\nevaluation metric by using a simpli\ufb01ed version of executive memory. Using this version, we not only\nsurpass the human high-score of 266,330 points,1 we achieve the maximum possible score of 999,990\npoints in less than 3,000 episodes. The curve is slow in the \ufb01rst stages because the model has to be\ntrained, but even though the further levels get more and more dif\ufb01cult, the level passing speeds up by\ntaking advantage of already knowing the maps. Obtaining more points is impossible, not because the\ngame ends, but because the score over\ufb02ows to 0 when reaching a million points.2\n\n1See highscore.com: \u2018Ms. Pac-Man (Atari 2600 Emulated)\u2019.\n2For a video of HRA\u2019s \ufb01nal trajectory reaching this point, see: https://youtu.be/VeXNw0Owf0Y\n\n8\n\n\fFigure 5: Training smoothed over 100 episodes. Figure 6: Training with trajectory memorisation.\n\n5 Discussion\n\nOne of the strengths of HRA is that it can exploit domain knowledge to a much greater extent than\nsingle-head methods. This is clearly shown by the fruit collection task: while removing irrelevant\nfeatures improves performance of HRA, the performance of DQN decreased when provided with the\nsame network architecture. Furthermore, separating the pixel image into multiple binary channels\nonly makes a small improvement in the performance of A3C over learning directly from pixels.\nThis demonstrates that the reason that modern deep RL struggle with Ms. Pac-Man is not related to\nlearning from pixels; the underlying issue is that the optimal value function for Ms. Pac-Man cannot\neasily be mapped to a low-dimensional representation.\nHRA solves Ms. Pac-Man by learning close to 1,800 general value functions. This results in an\nexponential breakdown of the problem size: whereas the input state-space corresponding with the\nbinary channels is in the order of 1077, each GVF has a state-space in the order of 103 states, small\nenough to be represented without any function approximation. While we could have used a deep\nnetwork for representing each GVF, using a deep network for such small problems hurts more than it\nhelps, as evidenced by the experiments on the fruit collection domain.\nWe argue that many real-world tasks allow for reward decomposition. Even if the reward function can\nonly be decomposed in two or three components, this can already help a lot, due to the exponential\ndecrease of the problem size that decomposition might cause.\n\nReferences\nAuer, P., Cesa-Bianchi, N., and Fischer, P. Finite-time analysis of the multiarmed bandit problem.\n\nMachine learning, 47(2-3):235\u2013256, 2002.\n\nBacon, P., Harb, J., and Precup, D. The option-critic architecture. In Proceedings of the Thirthy-\ufb01rst\n\nAAAI Conference On Arti\ufb01cial Intelligence (AAAI), 2017.\n\nBarto, A. G. and Mahadevan, S. Recent advances in hierarchical reinforcement learning. Discrete\n\nEvent Dynamic Systems, 13(4):341\u2013379, 2003.\n\nBellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M. The arcade learning environment: An\nevaluation platform for general agents. Journal of Arti\ufb01cial Intelligence Research, 47:253\u2013279,\n2013.\n\nDiuk, C., Cohen, A., and Littman, M. L. An object-oriented representation for ef\ufb01cient reinforcement\n\nlearning. In Proceedings of The 25th International Conference on Machine Learning, 2008.\n\nFuster, J. M. Cortex and mind: Unifying cognition. Oxford university press, 2003.\n\nGluck, M. A., Mercado, E., and Myers, C. E. Learning and memory: From brain to behavior.\n\nPalgrave Macmillan, 2013.\n\n9\n\n1\u00d71051\u00d71061\u00d71075\u00d71071\u00d71088\u00d71080500010000150002000025000010002000300040005000HRAA3C(pixels)A3C(channels)framesframesframesframesframesframesA3C(pixels) score at8\u00d7108framesA3C(channels) score atScoreEpisodes1.5\u00d71083.0\u00d71084.1\u00d71072.6\u00d71059.3\u00d71061.0\u00d71088.4\u00d7108050010001500200025003000200k400k600k800k1M0Level 1Level 5Level 10Level 32Level 50Level 100Level 180framesframesframesframesframesframesframesHuman high-scoreScoreEpisodes\fJaderberg, M., Mnih, V., Czarnecki, W.M., Schaul, T., Leibo, J.Z., Silver, D., and Kavukcuoglu,\nK. Reinforcement learning with unsupervised auxiliary tasks. In International Conference on\nLearning Representations, 2017.\n\nKulkarni, T. D., Narasimhan, K. R., Saeedi, A., and Tenenbaum, J. B. Hierarchical deep reinforce-\nment learning: Integrating temporal abstraction and intrinsic motivation. In Advances in Neural\nInformation Processing Systems 29, 2016.\n\nMnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A.,\nRiedmiller, M., Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I.,\nKumaran, H. King D., Wierstra, D., Legg, S., and Hassabis, D. Human-level control through deep\nreinforcement learning. Nature, 518:529\u2013533, 2015.\n\nMnih, V., Badia, A. P., Mirza, M., Graves, A., Harley, T., Lillicrap, T. P., Silver, D., and Kavukcuoglu,\nK. Asynchronous methods for deep reinforcement learning. In Proceedings of The 33rd Interna-\ntional Conference on Machine Learning, pp. 1928\u20131937, 2016.\n\nNair, A., Srinivasan, P., Blackwell, S., Alcicek, C., Fearon, R., Maria, A. De, Panneershelvam, V.,\nSuleyman, M., Beattie, C., Petersen, S., Legg, S., Mnih, V., Kavukcuoglu, K., and Silver, D.\nMassively parallel methods for deep reinforcement learning. In In Deep Learning Workshop,\nICML, 2015.\n\nNg, A. Y., Harada, D., and Russell, S. Policy invariance under reward transformations: theory and\napplication to reward shaping. In Proceedings of The 16th International Conference on Machine\nLearning, 1999.\n\nRoijers, D. M., Vamplew, P., Whiteson, S., and Dazeley, R. A survey of multi-objective sequential\n\ndecision-making. Journal of Arti\ufb01cial Intelligence Research, 2013.\n\nRussell, S. and Zimdar, A. L. Q-decomposition for reinforcement learning agents. In Proceedings of\n\nThe 20th International Conference on Machine Learning, 2003.\n\nSchaul, T., Horgan, D., Gregor, K., and Silver, D. Universal value function approximators. In\n\nProceedings of The 32rd International Conference on Machine Learning, 2015.\n\nSchmidhuber, J. Formal theory of creativity, fun, and intrinsic motivation (1990\u20132010). In IEEE\n\nTransactions on Autonomous Mental Development 2.3, pp. 230\u2013247, 2010.\n\nSprague, N. and Ballard, D. Multiple-goal reinforcement learning with modular sarsa(0).\n\nInternational Joint Conference on Arti\ufb01cial Intelligence, 2003.\n\nIn\n\nStout, A., Konidaris, G., and Barto, A. G. Intrinsically motivated reinforcement learning: A promising\nframework for developmental robotics. In The AAAI Spring Symposium on Developmental Robotics,\n2005.\n\nSutton, R. S. and Barto, A. G. Reinforcement Learning: An Introduction. MIT Press, Cambridge,\n\n1998.\n\nSutton, R. S., Modayil, J., Delp, M., Degris, T., Pilarski, P. M., White, A., and Precup, Doina.\nHorde: A scalable real-time architecture for learning knowledge from unsupervised sensorimotor\ninteraction. In Proceedings of 10th International Conference on Autonomous Agents and Multiagent\nSystems (AAMAS), 2011.\n\nSutton, R.S., Precup, D., and Singh, S.P. Between mdps and semi-mdps: A framework for temporal\n\nabstraction in reinforcement learning. Arti\ufb01cial Intelligence, 112(1-2):181\u2013211, 1999.\n\nSzepesv\u00e1ri, C. Algorithms for reinforcement learning. Morgan and Claypool, 2009.\n\nvan Seijen, H., van Hasselt, H., Whiteson, S., and Wiering, M. A theoretical and empirical analysis\nof expected sarsa. In IEEE Symposium on Adaptive Dynamic Programming and Reinforcement\nLearning (ADPRL), pp. 177\u2013184, 2009.\n\nVezhnevets, A., Mnih, V., Osindero, S., Graves, A., Vinyals, O., Agapiou, J., and Kavukcuoglu, K.\nStrategic attentive writer for learning macro-actions. In Advances in Neural Information Processing\nSystems 29, 2016.\n\n10\n\n\fWang, Z., Schaul, T., Hessel, M., van Hasselt, H., Lanctot, M., and Freitas, N. Dueling network\narchitectures for deep reinforcement learning. In Proceedings of The 33rd International Conference\non Machine Learning, 2016.\n\n11\n\n\f", "award": [], "sourceid": 2792, "authors": [{"given_name": "Harm", "family_name": "Van Seijen", "institution": "Microsoft Research"}, {"given_name": "Mehdi", "family_name": "Fatemi", "institution": "Microsoft"}, {"given_name": "Joshua", "family_name": "Romoff", "institution": "McGill University"}, {"given_name": "Romain", "family_name": "Laroche", "institution": "Microsoft Research Maluuba"}, {"given_name": "Tavian", "family_name": "Barnes", "institution": "Microsoft"}, {"given_name": "Jeffrey", "family_name": "Tsang", "institution": null}]}