{"title": "Discovery of Useful Questions as Auxiliary Tasks", "book": "Advances in Neural Information Processing Systems", "page_first": 9310, "page_last": 9321, "abstract": "Arguably, intelligent agents ought to be able to discover their own questions so that in learning answers for them they learn unanticipated useful knowledge and skills; this departs from the focus in much of machine learning on agents learning answers to externally defined questions.  We present a novel method for a reinforcement learning (RL) agent to discover questions formulated as general value functions or GVFs, a fairly rich form of knowledge representation.  Specifically, our method uses non-myopic meta-gradients to learn GVF-questions such that learning answers to them, as an auxiliary task, induces useful representations for the main task faced by the RL agent.  We demonstrate that auxiliary tasks based on the discovered GVFs are sufficient, on their own, to build representations that support main task learning, and that they do so better than popular hand-designed auxiliary tasks from the literature.  Furthermore, we show, in the context of Atari2600 videogames, how such auxiliary tasks, meta-learned alongside the main task, can improve the data efficiency of an actor-critic agent.", "full_text": "Discovery of Useful Questions as Auxiliary Tasks\n\nVivek Veeriah1\n\nMatteo Hessel2\n\nZhongwen Xu2\n\nRichard Lewis1\n\nJanarthanan Rajendran1\n\nJunhyuk Oh2\n\nHado van Hasselt2\n\nDavid Silver2\n\nSatinder Singh1,2\n\nAbstract\n\nArguably, intelligent agents ought to be able to discover their own questions so\nthat in learning answers for them they learn unanticipated useful knowledge and\nskills; this departs from the focus in much of machine learning on agents learning\nanswers to externally de\ufb01ned questions. We present a novel method for a rein-\nforcement learning (RL) agent to discover questions formulated as general value\nfunctions or GVFs, a fairly rich form of knowledge representation. Speci\ufb01cally,\nour method uses non-myopic meta-gradients to learn GVF-questions such that\nlearning answers to them, as an auxiliary task, induces useful representations for\nthe main task faced by the RL agent. We demonstrate that auxiliary tasks based\non the discovered GVFs are suf\ufb01cient, on their own, to build representations that\nsupport main task learning, and that they do so better than popular hand-designed\nauxiliary tasks from the literature. Furthermore, we show, in the context of Atari\n2600 videogames, how such auxiliary tasks, meta-learned alongside the main task,\ncan improve the data ef\ufb01ciency of an actor-critic agent.\n\nAn increasingly important component of recent approaches to developing \ufb02exible, autonomous\nagents is posing useful questions about the future for the agent to learn to answer from experience.\nThe questions can take many forms and serve many purposes. The answers to prediction or control\nquestions about suitable features of states may directly form useful representations of state (Singh\net al., 2004). Alternatively, prediction and control questions may de\ufb01ne auxiliary tasks, that drive\nrepresentation learning in the aid of a main task (Jaderberg et al., 2017). Goal-conditional questions\nmay also drive the acquisition of a diverse set of skills, even before the main task is known, form-\ning a basis for policy composition or exploration (Andrychowicz et al., 2016; Veeriah et al., 2018;\nEysenbach et al., 2018; Florensa et al., 2018; Mankowitz et al., 2018; Riedmiller et al., 2018).\nIn this paper, we consider questions in the form of general value functions (GVFs, Sutton et al.,\n2011), with the purpose of using the discovered GVFs as auxiliary tasks to aid the learning of a main\nreinforcement learning (RL) task. We chose the GVF formulation for its \ufb02exibility: according to the\nreward hypothesis (Sutton & Barto, 2018), any goal might be formulated in terms of a scalar signal,\nor cumulant (White, 2015), whose discounted sum must be maximized. Additionally, GVF-based\nauxiliary tasks have been shown in previous work to improve the sample ef\ufb01ciency of reinforcement\nlearning agents engaged in learning complex tasks (Mirowski et al., 2017; Jaderberg et al., 2017).\nIn the literature, GVF-based auxiliary tasks typically required an agent to estimate discounted sums\nof suitable handcrafted functions of state, cumulants in the GVF terminology, under handcrafted dis-\ncount factors. It was then shown that by combining gradients from learning the auxiliary GVFs with\n\n1University of Michigan, Ann Arbor. Corresponding author: Vivek Veeriah (cid:104)vveeriah@umich.edu(cid:105)\n2DeepMind, London.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fthe updates from the main task, it was possible to accelerate representation learning and improve\nperformance. It fell, however, onto the algorithm designer to design questions that were useful for\nthe speci\ufb01c task. This is a limitation because not all questions are equally well aligned with the main\ntask (Bellemare et al., 2019), and whether this is the case may be hard to predict in advance.\nThe paper makes three contributions. First, we propose a principled general method for the auto-\nmated discovery of questions in the form of GVFs, for use as auxiliary tasks. The main idea is to use\nmeta-gradient RL to discover the questions so that answering them maximises the usefulness of the\ninduced representation on the main task. This removes the need to hand-design auxiliary tasks that\nare matched to the environment or agent. Our second contribution is to empirically demonstrate the\nsuccess of non-myopic meta-gradient RL in large, challenging domains as opposed to the approxi-\nmate and myopic meta-gradient methods from previous work (Xu et al., 2018; Zheng et al., 2018);\nthe non-myopic calculation of meta-gradient proved essential to successfully learn useful questions\nand should be applicable more broadly to other applications of meta-gradients. Finally, we demon-\nstrate in the context of Atari 2600 videogames that such discovery of auxiliary tasks can improve\nthe data ef\ufb01ciency of an actor-critic agent, when these are meta-learned along side the main task.\n\n1 Background\n\nBrief background on GVFs: Standard value functions in RL de\ufb01ne a question and its answer; the\nquestion is \u201cwhat is the discounted sum of future rewards under some policy?\u201d and the answer is the\napproximate value function. Generalized value functions, or GVFs, generalize the standard value\nfunction to allow for arbitrary cumulant functions of states in place of rewards, and are speci\ufb01ed\nby the combination of such a cumulant function with a discount factor and a policy. This general-\nization of standard value functions allows GVFs to express quite general predictive knowledge and,\nnotably, temporal-difference (TD) methods for learning value functions can be extended to learn the\npredictions/answers of GVFs. We refer to Sutton et al. (2011) for additional details.\nPrior work on auxiliary tasks in RL: Jaderberg et al. (2017) explored extensively the potential,\nfor RL agents, of jointly learning the representation used for solving the main task and a number\nof GVF-based auxiliary tasks, such as pixel-control and feature-control tasks based on controlling\nchanges in pixel intensities and feature activations; this class of auxiliary tasks was also used in\nthe multi-task setting by Hessel et al. (2019a). Other recent examples of auxiliary tasks include\ndepth and loop closure classi\ufb01cation (Mirowski et al., 2017), observation reconstruction, reward\nprediction, inverse dynamics prediction (Shelhamer et al., 2017), and many-goals learning (Veeriah\net al., 2018). A geometrical perspective on auxiliary tasks was introduced by Bellemare et al. (2019).\nPrior work on meta-learning: Recently, there has been a lot of interest in exploring meta-\nlearning or learning to learn. A meta-learner progressively improves the learning process of a\nlearner (Schmidhuber et al., 1996; Thrun & Pratt, 1998) that is attempting to solve some task.\nRecent work on meta-learning includes learning good policy initializations that can be quickly\nadapted to new tasks (Finn et al., 2017; Al-Shedivat et al., 2018), improving few-shot learning\nperformance (Mishra et al., 2018; Duan et al., 2017; Snell et al., 2017), learning to explore (Stadie\net al., 2018), unsupervised learning (Gupta et al., 2018; Hsu et al., 2018), few-shot model adapta-\ntion (Nagabandi et al., 2018), and improving the optimizers (Andrychowicz et al., 2016; Li & Malik,\n2017; Ravi & Larochelle, 2017; Wichrowska et al., 2017; Chen et al., 2016; Gupta et al., 2018).\nPrior work on meta-gradients: Xu et al. (2018) formalized meta-gradients, a form of meta-\nlearning where the meta-learner is trained via gradients through the effect of the meta-parameters\non a learner also trained via gradients. In contrast to much work in meta-learning that focuses on\nmulti-task learning, Xu et al. (2018) formalized the use of meta-gradients in a way that is applicable\nalso to the single task setting, although not limited to it. They illustrated their approach by using\nmeta-gradients to adapt both the discount factor \u03b3 and the bootstrapping factor \u03bb of a reinforcement\nlearning agent, substantially improving performance of an actor-critic agent on many Atari games.\nConcurrently, Zheng et al. (2018) used meta-gradients to learn intrinsic rewards, demonstrating that\nmaximizing a sum of extrinsic and intrinsic rewards could improve an agent\u2019s performance on a\nnumber of Atari games and MuJoCo tasks. Xu et al. (2018) discussed the possibility of computing\nmeta-gradients in a non-myopic manner, but their proposed algorithm, as that of Zheng et al. (2018),\nintroduced a severe approximation and only measured the immediate consequences of an update.\n\n2\n\n\fFigure 1: An architecture for discovery: On the left, the main task and answer network with parameters \u03b8; it\ntakes past observations as input and parameterises (directly or indirectly) a policy \u03c0 as well as the answers to\nthe GVF questions. On the right, the question network with parameters \u03b7; it takes future observations as input\nand parameterises the cumulants and discounts that specify the GVFs.\n\n2 The discovery of useful questions\n\nIn this section we present a neural network architecture and a principled meta-gradient algorithm for\nthe discovery of GVF-based questions for use as auxiliary tasks in the context of deep RL agents.\n\n2.1 A neural network architecture for discovery\n\nThe neural network architecture we consider features two networks: the \ufb01rst, on the left in Figure 1,\ntakes the last i observations ot\u2212i+1:t as inputs, and parameterises (directly or indirectly) a policy \u03c0\nfor the main reinforcement learning task, together with GVF-predictions for a number of discovered\ncumulants and discounts. We use \u03b8 to denote the parameters of this \ufb01rst network. The second\nnetwork, referred to as the question network, is depicted on the right in Figure 1. It takes as inputs\nj future observations ot+1:t+j and, through the meta-parameters \u03b7, computes the values of a set of\ncumulants ut and their corresponding discounts \u03b3t (both ut and \u03b3t are therefore vectors).\nThe use of future observations ot+1:t+j as inputs to the question network requires us to wait j steps\nto unfold before computing the cumulants and discounts; this is acceptable because the question\nand answer networks are only used during training, and neither is needed for action selection. As\ndiscussed in Section 1, a GVF-question is speci\ufb01ed by a cumulant function, a discount function and\na policy. In our method, the question network only explicitly parameterises discounts and cumulants\nbecause we consider on-policy GVFs, and therefore the policy will always be, implicitly, the latest\nmain-task policy \u03c0. Note however, that since each cumulant is a function of future observations,\nwhich are in\ufb02uenced by the actions chosen by the main task policy, the cumulant and discount\nfunctions are non-stationary, not just because we are learning the question network parameters, but\nalso because the main-task policy itself is changing as learning progresses.\nPrevious work on auxiliary tasks in reinforcement learning may be interpreted as just using the net-\nwork on the left, as the cumulant functions were handcrafted and did not have any (meta-)learnable\nparameters; the availability of a separate \u201cquestion network\u201d is a critical component of our approach\nto discovery, as it enables the agent to discover from experience the most suitable questions about\nthe future to be used as auxiliary tasks. The terminology of question and answer networks is derived\nfrom work on TD networks (Sutton & Tanner, 2005). See Makino & Takagi (2008) and Schlegel\net al. (2018) for related work on incremental discovery of the structure of TD-networks and GVF-\nnetworks (work that does not, however, use meta-gradients and was applied only to relatively simple\ndomains).\n\n2.2 Multi-step meta-gradients\n\nIn their most abstract form, reinforcement learning algorithms can be described by an update proce-\ndure \u2206\u03b8t that modi\ufb01es, on each step t, the agent\u2019s parameters \u03b8t. The central idea of meta-gradient\nRL is to parameterise the update \u2206\u03b8t(\u03b7) by meta-parameters \u03b7. We may then consider the conse-\nquences of changing \u03b7 on the \u03b7-parameterised update rule by measuring the subsequent performance\nof the agent, in terms of a \u201dmeta-loss\u201d function m(\u03b8t+k). Such meta-loss may be evaluated after\none update (myopic) or k > 1 updates (non-myopic). The meta-gradient is then, by the chain rule,\n\n\u2202m(\u03b8t+k)\n\n\u2202\u03b7\n\n=\n\n\u2202m(\u03b8t+k)\n\n\u2202\u03b8t+k\n\n\u2202\u03b8t+k\n\n\u2202\u03b7\n\n.\n\n(1)\n\nImplicit in Equation 1 is that changing the meta-parameters \u03b7 at one time step affects not just the\nimmediate update to \u03b8 on the next time step, but at all future updates. This makes the meta-gradient\n\n3\n\n\fAlgorithm 1 Multi-Step Meta-Gradient Discovery of Questions for Auxiliary Tasks\n\nInitialize parameters \u03b8, \u03b7\nfor t = 1, 2,\u00b7\u00b7\u00b7 , N do\n\n\u03b8t,0 \u2190 \u03b8t\nfor k = 1, 2,\u00b7\u00b7\u00b7 , L do\n\nend for\n\u03b7t+1 \u2190 \u03b7t \u2212 \u03b1\u2207\u03b7\n\u03b8t+1 \u2190 \u03b8t,L\n\nend for\n\nGenerate experience using parameters \u03b8t,k\u22121\n\u03b8t,k \u2190 \u03b8t,k\u22121 \u2212 \u03b1(cid:48)\u2207\u03b8t,k\u22121LRL(\u03b8t,k\u22121) \u2212 \u03b1(cid:48)\u2207\u03b8t,k\u22121Lans(\u03b8t,k\u22121)\n\n(cid:80)L\nk=1 LRL(\u03b8t,k)\n\nchallenging to compute. A straightforward but effective way to capture the multi-step effects of\nchanging \u03b7 is to build a computational graph which consists of a sequence of updates made to the\nparameters \u03b8, \u03b8t \u2192 ... \u2192 \u03b8t+k with \u03b7 held \ufb01xed, ending with a meta-loss evaluation m(\u03b8t+k). The\nmeta-gradient \u2202m(\u03b8t+k)\nmay be ef\ufb01ciently computed from this graph through backward-mode au-\ntodifferentiation; this has a computational cost similar to that of the forward computation (Griewank\n& Walther, 2008), but it requires storage of k copies of the parameters \u03b8t:t+k, thus increasing the\nmemory footprint. We emphasize that this approach is in contrast to the myopic meta-gradient used\nin previous work, that either ignores effects past the \ufb01rst time step, or makes severe approximations.\n\n\u2202\u03b7\n\n2.3 A multi-step meta-gradient algorithm for discovery\n\nWe apply the meta-gradient algorithm, as presented in Section 2.2, to the discovery of GVF-based\nauxiliary tasks represented as in the neural network architecture from Section 2.1. The complete\npseudo code for the proposed approach to discovery is outlined in Algorithm 1.\nOn each iteration t of the algorithm, in an inner loop we apply L updates to the agent parameters \u03b8,\nwhich parameterise the main-task policy and the GVF answers, using separate samples of experience\nin an environment. Then, in the outer loop, we apply a single update to the meta-parameters \u03b7 (the\nquestion network that parameterises cumulant and discount functions that de\ufb01ne the GVFs), based\non the effect of the updates to \u03b8 on the meta-loss; next, we make each of these steps explicit.\nThe inner update includes two components:\nthe \ufb01rst is a canonical deep reinforcement learning\nupdate using loss denoted LRL for optimizing the main-task policy \u03c0t, either directly (as in policy-\nbased algorithms, e.g., Williams (1992)) or indirectly (as in value-based algorithms, e.g., Watkins\n(1989)). The second component is an update rule for estimating the answers to GVF-based ques-\ntions. With slight abuse of notation, we can then denote each inner-loop update as the following\ngradient descent steps on the pseudo losses denoted with LRL and Lans:\n\n\u03b8t,k \u2190 \u03b8t,k\u22121 \u2212 \u03b1(cid:48)\u2207\u03b8t,k\u22121LRL(\u03b8t,k\u22121) \u2212 \u03b1(cid:48)\u2207\u03b8t,k\u22121Lans(\u03b8t,k\u22121).\n\n(2)\nThe meta loss m is the sum of the RL pseudo losses associated with the main task updates, as\ncomputed on the batches generated in the inner loop; it is a function of meta-parameters \u03b7 through\nthe updates to the answers. We can therefore compute the update to the meta-parameters\n\nL(cid:88)\n\n\u03b7t+1 \u2190 \u03b7t \u2212 \u03b1\u2207\u03b7\n\nLRL(\u03b8t,k).\n\n(3)\n\nThis meta-gradient procedure optimizes the area under the curve over the temporal span de\ufb01ned by\nthe inner unroll length L. Alternatively, the meta-loss may be evaluated on the last batch alone, to\noptimize for \ufb01nal performance. Unless we specify otherwise, we use the area under the curve.\n\nk=1\n\n2.4 An actor critic agent with discovery of questions for auxiliary tasks\n\nIn this section we describe a concrete instantiation of the algorithm in the context of an actor-critic\nreinforcement learning agent. The network on the left of Figure 1 is composed of three modules:\n1) an encoder network that, takes the last i observations ot\u2212i+1:t as inputs, and outputs a state\nrepresentation xt; 2) a main task network that, given the state xt estimates both the policy \u03c0 and a\n\n4\n\n\fstate value function v (Sutton, 1988) 3) an answer network that, given the state xt approximates the\nGVF answers. In this paper, functions \u03c0, v and y will be linear functions of state xt.\nThe main-task network parameters {\u03b8main} are only affected by the RL component of update de-\n\ufb01ned in Equation 2. In an actor-critic agent, \u03b8main is the union of the parameters \u03b8v of the state\nvalues v and the parameters \u03b8\u03c0 of the softmax policy \u03c0. Therefore the update \u2212\u03b1\u2207\u03b8mainLRL is the\nand a policy update \u2212\u03b1\u2207\u03b8\u03c0LRL =\nj=0 \u03b3jRt+j+1) + \u03b3W +1v(xt+W +1) is a multi-step\n\nsum of a value update \u2212\u03b1\u2207\u03b8vLRL = \u03b1(cid:0)Gv\n\u03b1(cid:0)Gv\nt \u2212 v(xt)(cid:1) \u2202 log \u03c0(at|xt)\n\nt \u2212 v(xt)(cid:1) \u2202v(xt)\n\nt = ((cid:80)j=W\n\n, where Gv\n\n\u2202\u03b8v\n\n\u2202\u03b8\u03c0\n\ni\n\nyi(xt), where Gyi\nt\n\n\u03b1(cid:0)Gyi\nt \u2212 yi(xt)(cid:1)\u2207\u03b8y\n\ntruncated return, using the agent\u2019s estimates v of the state values for bootstrapping after W steps.\nThe answer network parameters {\u03b8y}, instead, are only affected by the second term of the update in\nEquation 2. Since the answers estimate on-policy, under \u03c0, an expected cumulative discounted sum\nof cumulants, we may use a generalized temporal difference learning algorithm to update \u03b8y. In our\nagents, the vector y is a linear function of state, and therefore each GVF prediction yi is separately\nparameterised by \u03b8yi \u2286 \u03b8y. The update \u2212\u03b1\u2207\u03b8yiLans for parameters \u03b8y may then be written as\nis the multi-step, truncated, \u03b3i-discounted sum of cumulants\nt highlights that we use the\n\nui from time t onwards. As in the main task updates, the notation Gyi\nanswer network\u2019s own estimates yi(xt) = xT\nThe main-task and answer-network pseudo losses LRL,Lans used in the updates above can also\nbe straightforwardly used to instantiate equation 2 for the parameters \u03b8enc of the encoder network,\nand to instantiate equation 3, for the parameters \u03b7 of the question network. For the shared state\nrepresentation, \u03b8enc, we explore two updates: (1) using the gradients from both the main task and the\nanswer network, i.e., \u2212\u03b1(cid:48)\u2207\u03b8k\u22121LRL(\u03b8k\u22121)\u2212 \u03b1(cid:48)\u2207\u03b8k\u22121Lans(\u03b8k\u22121), and (2) using only the gradients\nfrom the answer network, \u2212\u03b1(cid:48)\u2207\u03b8enc\nLans(\u03b8k\u22121). Using both the main-task and the answer network\ncomponents is more consistent with the existing literature on auxiliary tasks, but ignoring the main-\ntask updates provides a more stringent test of whether the algorithm is capable of meta-learning\nquestions that can drive, even on their own, the learning of an adequate state representations.\n\ni to bootstrap after a \ufb01xed number steps.\n\nt \u03b8y\n\nk\u22121\n\n3 Experimental setup\n\nIn this section we outline the experimental setup, including the environments we used as test-beds\nand the high level agent and neural network architectures. We refer to the Appendix for more details.\n\n3.1 Domains\n\nPuddleworld domain: is a continuous state gridworld domain (Degris et al., 2012), where the state\nspace is a 2-dimensional position in [0, 1]2. The agent has 5 actions, where four of these actions\nmove the agent in one of the four cardinal directions by a mean offset of 0.05 and the last action\nhas an offset of 0. The actions have a stochastic effect on the environment because, on each step,\nuniform noise sampled in the range [\u22120.025, 0.025] is added to each action component. We refer to\nDegris et al. (2012) for further details about this environment.\nCollect-objects domain: is a four-room gridworld, where the agent is rewarded for collecting two\nobjects in the right order. The agent moves deterministically in one of four cardinal directions. For\neach episode the starting position is chosen randomly. The locations of the two objects are the same\nacross episodes. The agent receives a reward of 1 for picking up the \ufb01rst object and a reward of 2\nfor picking up the second object after the \ufb01rst one. The maximum length of each episode is 40.\nAtari domain: the Atari games were designed to be challenging and fun for human players, and\nwere packaged up into a canonical benchmark for RL agents: the Arcade Learning Environment\n(Bellemare et al., 2013; Mnih et al., 2015, 2016; Schulman et al., 2015, 2017; Hessel et al., 2018).\nWhen summarizing results on this benchmark, we follow the common approach of \ufb01rst normalizing\nscores on the each game using the scores of random and human agents (van Hasselt et al., 2016).\n\n3.2 Our agents\n\nFor the gridworld experiments, we implemented meta-gradients on top of a 5-step actor-critic agent\nwith 16 parallel actor threads (Mnih et al., 2016). For the Atari experiments, we used a 20-step\nIMPALA (Espeholt et al., 2018) agent with 200 distributed actors. In the non-visual domain of\n\n5\n\n\fPuddleworld, the encoder is a simple MLP with two fully-connected layers. In other domains the\nencoder is a convolutional neural network. The main-task value and policy, and the answer network,\nare all linear functions of the state xt. In the gridworlds the question network outputs a set of cumu-\nlants, and the discount factor that jointly de\ufb01nes the GVFs is hand-tuned. In our Atari experiments\nthe question network outputs both the cumulants and the corresponding discounts. In all experi-\nments we report scores and curves averaging results from 3 independent runs of each agent, task or\nhyperparameter con\ufb01guration. In Atari we use a single set of hyper-parameters across all games.\n\n3.3 Baselines: handcrafted questions as auxiliary tasks\n\nIn our experiments we consider the following baseline auxiliary tasks from the literature.\nReward prediction: This baseline agent has no question network. Instead it uses the scalar reward\nobtained at the next time step as the target for the answer network. The auxiliary task loss function\n\nfor the reward prediction baseline is, Lans =(cid:2)yt(xt) \u2212 rt+1\n\n(cid:3)2.\n\nPixel control: This baseline also has no question network. The auxiliary task is to learn to opti-\nmally control changes in pixel intensities. Speci\ufb01cally, the answer network must estimate optimal\naction values for cumulants ci corresponding to the average absolute change in pixel intensities,\nbetween consecutive (in time) observations, for each cell i in an n \u00d7 n non-overlapping grid over-\nlayed onto the observation. The auxiliary loss function for the action values of the ith cell is:\nLans\ni (s(cid:48), a(cid:48)) \u2212 qi(s, a)||2, where Gci refers to discounted sum of\ni = 1\ni Lans\n2\n.\nRandom questions: This baseline agent is the same as our meta-gradient based agent except that the\nquestion network is kept \ufb01xed at its randomly initialized parameters through training. The answer\nnetwork is still trained to predict values for the cumulants de\ufb01ned by the \ufb01xed question network.\n\npseudo-rewards for the ith cell. The auxiliary loss is summed over the entire grid Lans =(cid:80)\n\nEs,a,s(cid:48)\u223cD||Gci + \u03b3 maxa(cid:48) q\u2212\n\ni\n\n4 Empirical \ufb01ndings\n\nIn this section, we empirically investigate the performance of the proposed algorithm for discovery,\nas instantiated in Section 2.4. We refer to our meta-learning agent as the \u201cDiscovered GVFs\u201d agent.\nOur experiments address the following questions:\n\n1. Can meta-gradients discover GVF-questions such that learning the answers to them is suf-\n\ufb01cient, on its own, to build representations good enough for solving complex RL tasks? We\nrefer to these as the \u201crepresentation learning\u201d experiments.\n\n2. Can meta-gradients discover GVFs questions such that learning to answer these along side\nthe main task improves the data ef\ufb01ciency of an RL agent? In these experiments the repre-\nsentation is shaped by both the updates based on the discovered GVFs as well as the main\ntask updates; we will thus refer to these as the \u201cjoint learning\u201d experiments.\n\n3. In both settings, how do auxiliary tasks discovered via meta-gradients compare to hand-\ncrafted tasks from the literature? Also, how is performance affected by design decisions\nsuch as the number of questions, the number of inner steps used to compute meta-gradients,\nand the choice between area under the curve versus \ufb01nal loss as meta-objective?\n\nWe note that the \u201crepresentation learning\u201d experiments are a more stringent test of our meta-learning\nalgorithm for discovery, compared to the \u201cjoint learning\u201d experiments. However, the latter is con-\nsistent with the literature on auxiliary tasks and can be more useful in practice.\n\n4.1 Representation learning experiments\n\nIn these experiments, the parameters of the encoder network are unaffected by gradients from the\nmain-task updates. Figures 2 and 3 compare the performance of our meta-gradient agents to the\nbaseline agents that train the state representation using the hand-crafted auxiliary tasks described\nin Section 3.3. We always include a reference curve (in black) corresponding to the baseline actor-\ncritic agent with no answer or question networks, where the representation is trained directly using\nthe main-task updates. We report results for the Collect-objects domain, Puddleworld, and three\nAtari games (more are reported in the Appendix). From the experiments we highlight the following:\n\n6\n\n\fFigure 2: Mean return on Collect-Objects (Left) and Puddleworld (Right) for the \u201cDiscovered GVFs\u201d agent\n(red), alongside the\u201cRandom GVFs\u201d (blue) and \u201cReward Prediction\u201d (purple) baselines. The dashed (black)\nline is the \ufb01nal performance of an actor-critic whose representation is trained using the main task updates.\n\nFigure 3: Mean episode return on 3 Atari domains for the \u201cDiscovered GVFs\u201d agent (red), alongside the\n\u201cRandom GVFs\u201d (blue), \u201cReward Prediction\u201d (purple) and \u201cPixel Control\u201d (green) baselines. The dashed\n(black) line is the \ufb01nal performance of an actor-critic whose representation is trained with the main task updates.\n\nFigure 4: Mean episode return on 3 Atari domains for two \u201cDiscovered GVFs\u201d agents optimizing the \u201cSummed\nMeta-Loss\u201d (red) and the \u201cEnd Meta-Loss\u201d (Orange), respectively. The dashed (black) line is the \ufb01nal perfor-\nmance of an actor-critic whose representation is trained with the main task updates.\n\nFigure 5: Parameter studies, on Collect-Objects, for \u201cDiscovered GVFs\u201d agent, as a function of the number of\nquestions used as auxiliary tasks (on the left) and the number of steps unrolled to compute the meta-gradient\n(on the right). The dashed and solid red lines correspond to the \ufb01nal and average episode return, respectively.\n\n7\n\n\fDiscovery: in all the domains, we found evidence that the state representation learned solely through\nlearning the GVF-answers to the discovered questions was suf\ufb01cient to support learning good poli-\ncies. Speci\ufb01cally, in the two gridworld domains the resulting policies were optimal (see Figure 2);\nin the Atari domains the resulting policies were comparable to those achieved by the state of the art\nIMPALA agent after training for 200M frames (see Figure 3). This is one of our main results, as\nit con\ufb01rms that non-myopic meta-gradients can discover questions, in the forms of cumulants and\ndiscounts, useful to capture rich enough knowledge of the world to support the learning of state-\nrepresentations that yield good policies even in complex RL tasks.\nBaselines: we also found that learning the answers to questions discovered using meta-gradients\nresulted in state representations that supported better performance, on the main task, compared to\nthe representations resulting from learning the answers to popular hand-crafted questions in the lit-\nerature. Consider the gridworld experiments in Figure 2, learning the representation using \u201cReward\nPrediction\u201d (purple) or \u201cRandom GVFs\u201d (blue) resulted in notably worse policies than those learned\nby the agent with \u201cDiscovered GVFs\u201d. Similarly, in Atari (shown in Figure 3) the handcrafted\nauxiliary tasks, now including a \u201cPixel Control\u201d baseline (green), resulted in almost no learning.\nMain-Task driven representations: Note that the actor-critic agent that trained the state representa-\ntion using the main-task updates directly learned faster than the agents where the representation was\nexclusively trained using auxiliary tasks. The baseline required only 3M steps on the gridworlds and\n200M frames on Atari to reach the \ufb01nal performance. This is expected and it is true both for our\nmeta-gradient solution as well as the auxiliary tasks from the literature.\nWe used the representation learning setting to investigate a number of design choices. First, we\ncompare optimizing the area under the curve over the length of the unrolled meta-gradient com-\nputation (or \u201cSummed Meta-Loss\u201d) to computing the meta-gradient on the last batch alone (\u201cEnd\nMeta-Loss\u201d). As shown in Figure 4, both approaches can be effective, but we found that optimizing\narea under the curve to be more stable. Next we examined the role of the number of GVF ques-\ntions, and the effect of varying the number of steps unrolled in the meta-gradient calculation. For\nthis purpose, we used the less compute-intensive gridworlds: Collect-Objects (reported here) and\nPuddleworld (in the Appendix). On the left in Figure 5, we report a parameter study, plotting the\nperformance of the agent with meta-learned auxiliary tasks as a function of the number of questions\nd. The dashed black line corresponds to the optimal (\ufb01nal) performance. Too few questions (d = 2)\ndid not provide enough signal to learn good representations: the dashed red line is thus far from op-\ntimal for d = 2. Other values of d all led to learning of a good representation capable of supporting\nan optimal policy. However, too many questions (e.g. d = 128) made learning slower, as shown by\nthe average performance dropping. The number of questions is therefore an important hyperparam-\neter of the algorithm. On the right, in Figure 5 we report the effect on performance of the number k\nof unrolled steps used for the meta-gradient computation. Using k = 1 corresponds to the myopic\nmeta-gradient: in contrast to previous work (Xu et al. (2018); Zheng et al. (2018)), the representa-\ntion learned with k = 1 and k = 2 was insuf\ufb01cient for the \ufb01nal policy to do anything meaningful.\nPerformance generally got better as we increased the unroll length (although the computational cost\nof meta-gradients also increased). Again the trend was not fully monotonic, with the largest unroll\nlength k = 50 performing worse than k = 25 both in terms of \ufb01nal and average performance. We\nconjecture this may be due to the increased variance of the meta-gradient estimates as the unroll\nlength increases. The number of unrolled steps k is therefore also a sensitive hyperparameter. Note\nthat neither d nor k were tuned in other experiments, with all other results using the same \ufb01xed\nsettings of d = 128 and k = 10.\n\n4.2 Joint learning Experiments\n\nThe next set of experiments use the most common setting in the literature on auxiliary tasks, where\nthe representation is learned using jointly the auxiliary task updates and the main task updates. To\naccelerate the learning of useful questions, we provided the encoded state representation as input to\nthe question network instead of learning a separate encoding; this differs from the previous exper-\niments, where the question network was a completely independent network (consistently with the\nobjective of a more stringent evaluation of our algorithm). We used a benchmark consisting of 57\ndistinct Atari games to evaluate the \u201cDiscovered GVFs\u201d agent together with an actor-critic baseline\n(\u201cIMPALA\u201d) and two auxiliary tasks from the literature: \u201cReward Prediction\u201d and \u201cPixel Control\u201d.\n\n8\n\n\fFigure 6: On the left, relative performance improvements of a \u201cDiscovered GVF\u201d agent, over plain IMPALA.\nThe 10 games are those where a \u201cPixel Control\u201d baseline showed the largest gains over IMPALA. On the right,\nwe plot median normalized scores of all agents for different subsets of the 57 Atari games (N=5, 10, 20, 40, 57).\nThe order of inclusion of the games is again determined according to the performance gains of pixel-control.\n\nNone of the auxiliary tasks outperformed IMPALA on each and every of the 57 games. To analyse\nthe results, we ranked games according to the performance of the agent with pixel-control questions,\nto identify the games more conducive to improving performance through the use of auxiliary tasks.\nOn the left of Figure 6, we report the relative gains of the \u201cDiscovered GVFs\u201d agent over IMPALA,\non the top-10 games for the \u201cPixel Control\u201d baseline: we observed large gains in 6 out of 10 games,\nsmall gains in 2, and losses in 2. On the right in Figure 6, we provid a more comprehensive view\nof the performance of the agents. For each number N on the x-axis (N = 5, 10, 20, 40, 57) we\npresent the median human normalized score achieved by each method on the top-N games, again\nselected according to the \u201cPixel Control\u201d baseline. It is visually clear that discovering questions via\nmeta-learning is fast enough to compete with handcrafted questions, and that, in games well suited\nto auxiliary tasks, it greatly improved performance over all baselines. It was particularly impressive\nto \ufb01nd that the meta-gradient solution outperformed pixel control on these games despite the ranking\nof games being biased in favour of pixel-control. The reward prediction baseline is interesting, in\ncomparison, because it\u2019s pro\ufb01le was the closest to that of the actor-critic baseline, never improving\nperformance signi\ufb01cantly, but not hurting either.\n\n5 Conclusions and Discussion\n\nThere are many forms of questions that an intelligent agent may want to discover. In this paper we\nintroduced a novel and ef\ufb01cient multi-step meta-gradient procedure for the discovery of questions\nin the form of on-policy GVFs. In a stringent test, our representation learning experiments demon-\nstrated that the meta-gradient approach is capable of discovering useful questions such that answer-\ning them can drive, by itself, learning of state representations good enough to support the learning\nof a main reinforcement learning task. Furthermore, our auxiliary tasks experiments demonstrated\nthat the meta-learning based discovery approach is data-ef\ufb01cient enough to compete well in terms of\nperformance, and in many cases even outperform, handcrafted questions developed in prior work.\nMost prior work on auxiliary tasks relied on human ingenuity to de\ufb01ne questions useful for shaping\nthe state representation used in a given task, but it\u2019s hard to create questions that are both useful and\ngeneral (i.e., that can be applied across many tasks). Bellemare et al. (2019) introduced a geometrical\nperspective to understand when auxiliary tasks give rise to good representations. Our solution differs\nfrom this line of work in that we side-step the question of how to design good auxiliary questions,\nby meta-learning them instead, directly optimizing for utility in the context of a given task. Our\napproach \ufb01ts in a general trend of increasingly relying on data rather than human designed inductive\nbiases to construct effective learning algorithms (Silver et al., 2017; Hessel et al., 2019b).\nA promising direction for future research is to investigate off-policy GVFs, where the policy under\nwhich we make the predictions differs from the main-task policy. We also note that our approach\nto discovery is quite general, and could be extended to meta-learning other kind of questions, that\ndo not \ufb01t the canonical GVF formulation; see van Hasselt et al. (2019) for one such class of predic-\ntive questions. Finally, we emphasize that the unrolled multi-step meta-gradient algorithm is likely\nto bene\ufb01t both previous applications of myopic meta-gradients, as well as possibly open up more\napplications, other from discovery, where the myopic approximation would fail.\n\n9\n\n\fAcknowledgments\n\nWe thank John Holler and Zeyu Zheng for many useful comments and discussions. The work of the\nauthors at the University of Michigan was supported by a grant from DARPAs L2M program and by\nNSF grant IIS-1526059. Any opinions, \ufb01ndings, conclusions, or recommendations expressed here\nare those of the authors and do not necessarily re\ufb02ect the views of the sponsors.\n\nReferences\nMaruan Al-Shedivat, Trapit Bansal, Yura Burda, Ilya Sutskever, Igor Mordatch, and Pieter Abbeel.\nContinuous adaptation via meta-learning in nonstationary and competitive environments. In 6th\nInternational Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada,\nApril 30 - May 3, 2018, Conference Track Proceedings, 2018.\n\nMarcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom Schaul,\nBrendan Shillingford, and Nando De Freitas. Learning to learn by gradient descent by gradient\ndescent. In Advances in Neural Information Processing Systems, pp. 3981\u20133989, 2016.\n\nMarc G. Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning envi-\n\nronment: An evaluation platform for general agents. J. Artif. Intell. Res., 47:253\u2013279, 2013.\n\nMarc G. Bellemare, Will Dabney, Robert Dadashi, Adrien Ali Ta\u00a8\u0131ga, Pablo Samuel Castro, Nico-\nlas Le Roux, Dale Schuurmans, Tor Lattimore, and Clare Lyle. A geometric perspective on\noptimal representations for reinforcement learning. arXiv preprint arXiv:1901.11530, 2019.\n\nYutian Chen, Matthew W Hoffman, Sergio Gomez Colmenarejo, Misha Denil, Timothy P Lillicrap,\nand Nando de Freitas. Learning to learn for global optimization of black box functions. arXiv\npreprint arXiv:1611.03824, 2016.\n\nThomas Degris, Martha White, and Richard S Sutton. Off-policy actor-critic. In Proceedings of the\n29th International Coference on International Conference on Machine Learning, pp. 179\u2013186.\nOmnipress, 2012.\n\nYan Duan, Marcin Andrychowicz, Bradly Stadie, OpenAI Jonathan Ho, Jonas Schneider, Ilya\nIn Advances\n\nSutskever, Pieter Abbeel, and Wojciech Zaremba. One-shot imitation learning.\nin neural information processing systems, pp. 1087\u20131098, 2017.\n\nLasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Volodymyr Mnih, Tom Ward, Yotam\nDoron, Vlad Firoiu, Tim Harley, Iain Dunning, et al. Impala: Scalable distributed deep-rl with im-\nportance weighted actor-learner architectures. In International Conference on Machine Learning,\npp. 1406\u20131415, 2018.\n\nBenjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. Diversity is all you need:\n\nLearning skills without a reward function. arXiv preprint arXiv:1802.06070, 2018.\n\nChelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation\nof deep networks. In Proceedings of the 34th International Conference on Machine Learning-\nVolume 70, pp. 1126\u20131135. JMLR. org, 2017.\n\nCarlos Florensa, David Held, Xinyang Geng, and Pieter Abbeel. Automatic goal generation for\nreinforcement learning agents. In Proceedings of the 35th International Conference on Machine\nLearning, ICML 2018, pp. 1514\u20131523, 2018.\n\nAndreas Griewank and Andrea Walther. Evaluating derivatives: principles and techniques of algo-\n\nrithmic differentiation, volume 105. Siam, 2008.\n\nAbhishek Gupta, Benjamin Eysenbach, Chelsea Finn, and Sergey Levine. Unsupervised meta-\n\nlearning for reinforcement learning. arXiv preprint arXiv:1806.04640, 2018.\n\nMatteo Hessel, Joseph Modayil, Hado van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney,\nDan Horgan, Bilal Piot, Mohammad Gheshlaghi Azar, and David Silver. Rainbow: Combining\nimprovements in deep reinforcement learning. In Proceedings of the Thirty-Second AAAI Con-\nference on Arti\ufb01cial Intelligence, pp. 3215\u20133222, 2018.\n\n10\n\n\fMatteo Hessel, Hubert Soyer, Lasse Espeholt, Wojciech Czarnecki, Simon Schmitt, and Hado van\nHasselt. Multi-task deep reinforcement learning with popart. Proceedings of the AAAI Conference\non Arti\ufb01cial Intelligence, 33(01):3796\u20133803, Jul. 2019a. doi: 10.1609/aaai.v33i01.33013796.\n\nMatteo Hessel, Hado van Hasselt, Joseph Modayil, and David Silver. On inductive biases in deep\n\nreinforcement learning. arXiv preprint arXiv:1907.02908, 2019b.\n\nKyle Hsu, Sergey Levine, and Chelsea Finn. Unsupervised learning via meta-learning. arXiv\n\npreprint arXiv:1810.02334, 2018.\n\nMax Jaderberg, Volodymyr Mnih, Wojciech Marian Czarnecki, Tom Schaul, Joel Z. Leibo, David\nSilver, and Koray Kavukcuoglu. Reinforcement learning with unsupervised auxiliary tasks. In\n5th International Conference on Learning Representations, ICLR, 2017.\n\nKe Li and Jitendra Malik. Learning to optimize.\n\nRepresentations, ICLR, 2017.\n\nIn 5th International Conference on Learning\n\nTakaki Makino and Toshihisa Takagi. On-line discovery of temporal-difference networks. In Pro-\n\nceedings of the 25th international conference on Machine learning, pp. 632\u2013639. ACM, 2008.\n\nDaniel J Mankowitz, Augustin \u02c7Z\u00b4\u0131dek, Andr\u00b4e Barreto, Dan Horgan, Matteo Hessel, John Quan,\nJunhyuk Oh, Hado van Hasselt, David Silver, and Tom Schaul. Unicorn: Continual learning with\na universal, off-policy agent. arXiv preprint arXiv:1802.08294, 2018.\n\nPiotr Mirowski, Razvan Pascanu, Fabio Viola, Hubert Soyer, Andy Ballard, Andrea Banino, Misha\nDenil, Ross Goroshin, Laurent Sifre, Koray Kavukcuoglu, Dharshan Kumaran, and Raia Hadsell.\nIn 5th International Conference on Learning\nLearning to navigate in complex environments.\nRepresentations, ICLR, 2017.\n\nNikhil Mishra, Mostafa Rohaninejad, Xi Chen, and Pieter Abbeel. A simple neural attentive meta-\nlearner. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver,\nBC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, 2018.\n\nVolodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Belle-\nmare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level\ncontrol through deep reinforcement learning. Nature, 518(7540):529, 2015.\n\nVolodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim\nHarley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement\nlearning. In International conference on machine learning, pp. 1928\u20131937, 2016.\n\nAnusha Nagabandi, Chelsea Finn, and Sergey Levine. Deep online learning via meta-learning:\n\nContinual adaptation for model-based rl. arXiv preprint arXiv:1812.07671, 2018.\n\nSachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. In 5th Interna-\n\ntional Conference on Learning Representations, ICLR, 2017.\n\nMartin A. Riedmiller, Roland Hafner, Thomas Lampe, Michael Neunert, Jonas Degrave, Tom Van\nde Wiele, Vlad Mnih, Nicolas Heess, and Jost Tobias Springenberg. Learning by playing solving\nsparse reward tasks from scratch. In Proceedings of the 35th International Conference on Machine\nLearning, ICML, pp. 4341\u20134350, 2018.\n\nMatthew Schlegel, Andrew Patterson, Adam White, and Martha White. Discovery of predictive\nrepresentations with a network of general value functions, 2018. URL https://openreview.\nnet/forum?id=ryZElGZ0Z.\n\nJuergen Schmidhuber, Jieyu Zhao, and MA Wiering. Simple principles of metalearning. Technical\n\nreport IDSIA, 69:1\u201323, 1996.\n\nJohn Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, and Pieter Abbeel. Trust region\n\npolicy optimization. arXiv preprint arXiv:1502.05477, 2015.\n\nJohn Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy\n\noptimization algorithms. arXiv preprint arXiv:1707.06347, 2017.\n\n11\n\n\fEvan Shelhamer, Parsa Mahmoudieh, Max Argus, and Trevor Darrell. Loss is its own reward:\nSelf-supervision for reinforcement learning. In 5th International Conference on Learning Repre-\nsentations, ICLR, 2017.\n\nSilver, Hubert, Schrittwieser, Antonoglou, Lai, Guez, Lanctot, Sifre, Kumaran, Graepel, Lillicrap,\nSimonyan, and Hassabis. Mastering chess and shogi by self-play with a general reinforcement\nlearning algorithm. arXiv preprint arXiv:1712.01815, 2017.\n\nSatinder Singh, Michael R James, and Matthew R Rudary. Predictive state representations: A new\ntheory for modeling dynamical systems. In Proceedings of the 20th conference on Uncertainty in\narti\ufb01cial intelligence, pp. 512\u2013519. AUAI Press, 2004.\n\nJake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. In\n\nAdvances in Neural Information Processing Systems, pp. 4077\u20134087, 2017.\n\nBradly C Stadie, Ge Yang, Rein Houthooft, Xi Chen, Yan Duan, Yuhuai Wu, Pieter Abbeel, and Ilya\nSutskever. Some considerations on learning to explore via meta-reinforcement learning. arXiv\npreprint arXiv:1803.01118, 2018.\n\nRichard S Sutton. Learning to predict by the methods of temporal differences. Machine learning, 3\n\n(1):9\u201344, 1988.\n\nRichard S Sutton and Andrew G Barto. Reinforcement Learning: An Introduction. The MIT press,\n\nCambridge MA, 2018.\n\nRichard S Sutton and Brian Tanner. Temporal-difference networks. In Advances in neural informa-\n\ntion processing systems, pp. 1377\u20131384, 2005.\n\nRichard S Sutton, Joseph Modayil, Michael Delp, Thomas Degris, Patrick M Pilarski, Adam White,\nand Doina Precup. Horde: A scalable real-time architecture for learning knowledge from unsuper-\nvised sensorimotor interaction. In The 10th International Conference on Autonomous Agents and\nMultiagent Systems-Volume 2, pp. 761\u2013768. International Foundation for Autonomous Agents\nand Multiagent Systems, 2011.\n\nSebastian Thrun and Lorien Pratt. Learning to learn: Introduction and overview. In Learning to\n\nlearn, pp. 3\u201317. Springer, 1998.\n\nHado van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with Double Q-\n\nlearning. AAAI, 2016.\n\nHado van Hasselt, John Quan, Matteo Hessel, Zhongwen Xu, Diana Borsa, and Andr\u00b4e Barreto.\n\nGeneral non-linear bellman equations. arXiv preprint arXiv:1907.03687, 2019.\n\nVivek Veeriah, Junhyuk Oh, and Satinder Singh. Many-goals reinforcement learning. arXiv preprint\n\narXiv:1806.09605, 2018.\n\nC. J. C. H. Watkins. Learning from Delayed Rewards. PhD thesis, King\u2019s College, Cambridge,\n\nEngland, 1989.\n\nAdam White. Developing a predictive approach to knowledge. University of Alberta, 2015.\n\nOlga Wichrowska, Niru Maheswaranathan, Matthew W Hoffman, Sergio Gomez Colmenarejo,\nMisha Denil, Nando de Freitas, and Jascha Sohl-Dickstein. Learned optimizers that scale and\ngeneralize. In Proceedings of the 34th International Conference on Machine Learning-Volume\n70, pp. 3751\u20133760. JMLR. org, 2017.\n\nRonald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforce-\nISSN 0885-6125. doi: 10.1007/\n\nment learning. Mach. Learn., 8(3-4):229\u2013256, May 1992.\nBF00992696.\n\nZhongwen Xu, Hado P van Hasselt, and David Silver. Meta-gradient reinforcement learning. In\n\nAdvances in Neural Information Processing Systems, pp. 2396\u20132407, 2018.\n\nZeyu Zheng, Junhyuk Oh, and Satinder Singh. On learning intrinsic rewards for policy gradient\n\nmethods. In Advances in Neural Information Processing Systems, pp. 4644\u20134654, 2018.\n\n12\n\n\f", "award": [], "sourceid": 4978, "authors": [{"given_name": "Vivek", "family_name": "Veeriah", "institution": "University of Michigan"}, {"given_name": "Matteo", "family_name": "Hessel", "institution": "Google DeepMind"}, {"given_name": "Zhongwen", "family_name": "Xu", "institution": "DeepMind"}, {"given_name": "Janarthanan", "family_name": "Rajendran", "institution": "University of Michigan"}, {"given_name": "Richard", "family_name": "Lewis", "institution": "University of Michigan"}, {"given_name": "Junhyuk", "family_name": "Oh", "institution": "DeepMind"}, {"given_name": "Hado", "family_name": "van Hasselt", "institution": "DeepMind"}, {"given_name": "David", "family_name": "Silver", "institution": "DeepMind"}, {"given_name": "Satinder", "family_name": "Singh", "institution": "University of Michigan"}]}