{"title": "RUDDER: Return Decomposition for Delayed Rewards", "book": "Advances in Neural Information Processing Systems", "page_first": 13566, "page_last": 13577, "abstract": "We propose RUDDER, a novel reinforcement learning approach for delayed rewards in finite Markov decision processes (MDPs). In MDPs the Q-values are equal to the expected immediate reward plus the expected future rewards. The latter are related to bias problems in temporal difference (TD) learning and to high variance problems in Monte Carlo (MC) learning. Both problems are even more severe when rewards are delayed. RUDDER aims at making the expected future rewards zero, which simplifies Q-value estimation to computing the mean of the immediate reward. We propose the following two new concepts to push the expected future rewards toward zero. (i) Reward redistribution that leads to return-equivalent decision processes with the same optimal policies and, when optimal, zero expected future rewards. (ii) Return decomposition via contribution analysis which transforms the reinforcement learning task into a regression task at which deep learning excels. On artificial tasks with delayed rewards, RUDDER is significantly faster than MC and exponentially faster than Monte Carlo Tree Search (MCTS), TD(\u03bb), and reward shaping approaches. At Atari games, RUDDER on top of a Proximal Policy Optimization (PPO) baseline improves the scores, which is most prominent at games with delayed rewards.", "full_text": "RUDDER: Return Decomposition for Delayed\n\nRewards\n\nJose A. Arjona-Medina\u2217 Michael Gillhofer\u2217 Michael Widrich\u2217\nSepp Hochreiter\u2020\nThomas Unterthiner\n\nJohannes Brandstetter\n\nLIT AI Lab\n\nInstitute for Machine Learning\n\n\u2020 also at Institute of Advanced Research in Arti\ufb01cial Intelligence (IARAI)\n\nJohannes Kepler University Linz, Austria\n\nAbstract\n\nWe propose RUDDER, a novel reinforcement learning approach for delayed re-\nwards in \ufb01nite Markov decision processes (MDPs). In MDPs the Q-values are\nequal to the expected immediate reward plus the expected future rewards. The\nlatter are related to bias problems in temporal difference (TD) learning and to\nhigh variance problems in Monte Carlo (MC) learning. Both problems are even\nmore severe when rewards are delayed. RUDDER aims at making the expected\nfuture rewards zero, which simpli\ufb01es Q-value estimation to computing the mean\nof the immediate reward. We propose the following two new concepts to push\nthe expected future rewards toward zero. (i) Reward redistribution that leads to\nreturn-equivalent decision processes with the same optimal policies and, when\noptimal, zero expected future rewards. (ii) Return decomposition via contribution\nanalysis which transforms the reinforcement learning task into a regression task\nat which deep learning excels. On arti\ufb01cial tasks with delayed rewards, RUD-\nDER is signi\ufb01cantly faster than MC and exponentially faster than Monte Carlo\nTree Search (MCTS), TD(\u03bb), and reward shaping approaches. At Atari games,\nRUDDER on top of a Proximal Policy Optimization (PPO) baseline improves the\nscores, which is most prominent at games with delayed rewards. Source code is\navailable at https://github.com/ml-jku/rudder and demonstration videos\nat https://goo.gl/EQerZV.\n\n1\n\nIntroduction\n\nAssigning credit for a received reward to past actions is central to reinforcement learning [47]. A\ngreat challenge is to learn long-term credit assignment for delayed rewards [23, 20, 18, 33]. Delayed\nrewards are often episodic or sparse and common in real-world problems [30, 25]. For Markov\ndecision processes (MDPs), the Q-value is equal to the expected immediate reward plus the expected\nfuture reward. For Q-value estimation, the expected future reward leads to biases in temporal\ndifference (TD) and high variance in Monte Carlo (MC) learning. For delayed rewards, TD requires\nexponentially many updates to correct the bias, where the number of updates is exponential in the\nnumber of delay steps. For MC learning, the number of states affected by a delayed reward can\ngrow exponentially with the number of delay steps. (Both statements are proved after Supplements\ntheorems S8 and S10.) An MC estimate of the expected future reward has to average over all possible\nfuture trajectories, if rewards, state transitions, or policies are probabilistic. Delayed rewards make an\nMC estimate much harder.\n\n\u2217authors contributed equally\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fThe main goal of our approach is to construct an MDP that has expected future rewards equal to\nzero. If this goal is achieved, Q-value estimation simpli\ufb01es to computing the mean of the immediate\nrewards. To push the expected future rewards to zero, we require two new concepts. The \ufb01rst new\nconcept is reward redistribution to create return-equivalent MDPs, which are characterized by\nhaving the same optimal policies. An optimal reward redistribution should transform a delayed reward\nMDP into a return-equivalent MDP with zero expected future rewards. However, expected future\nrewards equal to zero are in general not possible for MDPs. Therefore, we introduce sequence-Markov\ndecision processes (SDPs), for which reward distributions need not to be Markov. We construct\na reward redistribution that leads to a return-equivalent SDP with a second-order Markov reward\ndistribution and expected future rewards that are equal to zero. For these return-equivalent SDPs, Q-\nvalue estimation simpli\ufb01es to computing the mean. Nevertheless, the Q-values or advantage functions\ncan be used for learning optimal policies. The second new concept is return decomposition and its\nrealization via contribution analysis. This concept serves to ef\ufb01ciently construct a proper reward\nredistribution, as described in the next section. Return decomposition transforms a reinforcement\nlearning task into a regression task, where the sequence-wide return must be predicted from the\nwhole state-action sequence. The regression task identi\ufb01es which state-action pairs contribute to the\nreturn prediction and, therefore, receive a redistributed reward. Learning the regression model uses\nonly completed episodes as training set, therefore avoids problems with unknown future state-action\ntrajectories. Even for sub-optimal reward redistributions, we obtain an enormous speed-up of Q-value\nlearning if relevant reward-causing state-action pairs are identi\ufb01ed. We propose RUDDER (RetUrn\nDecomposition for DElayed Rewards) for learning with reward redistributions that are obtained via\nreturn decomposition.\nTo get an intuition for our approach, assume you repair pocket watches and then sell them. For a\nparticular brand of watch you have to decide whether repairing pays off. The sales price is known, but\nyou have unknown costs, i.e. negative rewards, caused by repair and delivery. The advantage function\nis the sales price minus the expected immediate repair costs minus the expected future delivery costs.\nTherefore, you want to know whether the advantage function is positive. \u2014 Why is zeroing the\nexpected future costs bene\ufb01cial? \u2014 If the average delivery costs are known, then they can be added\nto the repair costs resulting in zero future costs. Now you can use your repairing experiences, and\nyou just have to average over the repair costs to know whether repairing pays off. \u2014 Why is return\ndecomposition so ef\ufb01cient? \u2014 Because of pattern recognition. For zero future costs, you have to\nestimate the expected brand-related delivery costs, which are e.g. packing costs. These brand-related\ncosts are superimposed by brand-independent general delivery costs for shipment (e.g. time spent for\ndelivery). Assume that general delivery costs are indicated by patterns, e.g. weather conditions, which\ndelay delivery. Using a training set of completed deliveries, supervised learning can identify these\npatterns and attribute costs to them. This is return decomposition. In this way, only brand-related\ndelivery costs remain and, therefore, can be estimated more ef\ufb01ciently than by MC.\nRelated Work. Our new learning algorithm is gradually changing the reward redistribution during\nlearning, which is known as shaping [44, 47]. In contrast to RUDDER, potential-based shaping like\nreward shaping [27], look-ahead advice, and look-back advice [50] use a \ufb01xed reward redistribution.\nMoreover, since these methods keep the original reward, the resulting reward redistribution is not\noptimal, as described in the next section, and learning can still be exponentially slow. A monotonic\npositive reward transformation [28] changes the reward distribution but is not assured to keep\noptimal policies. Sibling Rivalry [48] overcomes local optima of distance-to-goal-based reward\nshaping by changing the original reward, while still \ufb01nding an optimal policy for the original reward.\nDisentangled rewards [15] keep optimal policies but are neither environment nor policy speci\ufb01c,\ntherefore cannot have zero expected rewards. Successor features decouple environment and policy\nfrom rewards, but changing the reward changes the optimal policies [7, 6]. Temporal Value Transport\n(TVT) uses an attentional memory mechanism to learn a value function that serves as \ufb01ctitious\nreward but optimal policies are not guaranteed to be kept [21]. All these methods do not ensure zero\nexpected future rewards that would speed up learning. Like RUDDER, previous methods have used\nsupervised methods for reinforcement learning [34, 8, 38]. Separate backward models can be learned\nin a supervised manner to trace back from known goal states [14] or from high reward states [16].\n\u201cHindsight Credit Assignment\u201d (HCA) [17] and \u201cUpside-Down Reinforcement Learning\u201d [39, 45]\nuse supervised learning via cross-entropy to select a backward model that predicts the action to\ntake (output) to achieve a desired return (input 1) from current state (input 2) in a certain number\nof steps (input 3). For HCA the desired return can be replaced by a desired future state and the\nnumber of steps can be relaxed to being achieved until episode end. Instead of backward models,\n\n2\n\n\f\u201cbackpropagation through a model\u201d [26, 31, 32, 49, 37, 4, 5] uses forward models, which predict the\nreturn and generate update signals for a policy by backward analysis via sensitivity analysis. While\nRUDDER also uses a forward model, it (i) uses contribution analysis instead of sensitivity analysis\nfor backward analysis, and (ii) uses the whole state-action sequence to predict the return.\n\n2 Reward Redistribution and Novel Learning Algorithms\n\nReward redistribution is our main new concept to achieve expected future rewards equal to zero.\nWe start by introducing MDPs, return-equivalent sequence-Markov decision processes (SDPs), and\nreward redistribution. Furthermore, optimal reward redistribution is de\ufb01ned and novel learning\nalgorithms based on reward redistribution are introduced.\n\nare p(r | s, a) = (cid:80)\ns(cid:48) p(s(cid:48), r | s, a) and p(s(cid:48) | s, a) = (cid:80)\nis r(s, a) =(cid:80)\nr rp(r | s, a). The return Gt is Gt =(cid:80)\u221e\nwith sequence length T and \u03b3 = 1 the return is Gt =(cid:80)T\u2212t\n\nMDP De\ufb01nitions and Return-Equivalent Sequence-Markov Decision Processes (SDPs). A \ufb01-\nnite Markov decision process (MDP) P is 5-tuple P = (S, A, R, p, \u03b3) of \ufb01nite sets S of states\ns (random variable St at time t), A of actions a (random variable At), and R of rewards r (ran-\ndom variable Rt+1). Furthermore, P has transition-reward distributions p(St+1 = s(cid:48), Rt+1 = r |\nSt = s, At = a) conditioned on state-actions, and a discount factor \u03b3 \u2208 [0, 1]. The marginals\nr p(s(cid:48), r | s, a). The expected reward\nk=0 \u03b3kRt+k+1. For \ufb01nite horizon MDPs\nk=0 Rt+k+1. A Markov policy is given\nas action distribution \u03c0(At = a | St = s) conditioned on states. We often equip an MDP P\nwith a policy \u03c0 without explicitly mentioning it. The action-value function q\u03c0(s, a) for policy \u03c0\nis q\u03c0(s, a) = E\u03c0 [Gt | St = s, At = a]. The goal of learning is to maximize the expected return at\n0 ]. A sequence-Markov\ntime t = 0, that is v\u03c0\ndecision process (SDP) is de\ufb01ned as a decision process which is equipped with a Markov policy and\nhas Markov transition probabilities but a reward that is not required to be Markov. Two SDPs \u02dcP and\nP are return-equivalent if (i) they differ only in their reward distributions and (ii) they have the same\n0 . They are strictly return-equivalent if they have\nexpected return at t = 0 for each policy \u03c0: \u02dcv\u03c0\nthe same expected return for every episode and for each policy \u03c0. Strictly return-equivalent SDPs\nare return-equivalent. Return-equivalent SDPs have the same optimal policies. For more details see\nSupplements S2.2.\nReward Redistribution. Strictly return-equivalent SDPs \u02dcP and P can be constructed by re-\nward redistributions. A reward redistribution given an SDP \u02dcP is a procedure that redistributes\nfor each sequence s0, a0, . . . , sT , aT the realization of the sequence-associated return variable\n\u02dcRt+1 or its expectation along the sequence. Later we will introduce a reward re-\ndistribution that depends on the SDP \u02dcP. The reward redistribution creates a new SDP P with the\nt=0 Rt+1. A reward redistri-\nbution is second order Markov if the redistributed reward Rt+1 depends only on (st\u22121, at\u22121, st, at).\nIf the SDP P is obtained from the SDP \u02dcP by reward redistribution, then \u02dcP and P are strictly return-\nequivalent. The next theorem states that the optimal policies are still the same for \u02dcP and P (proof\nafter Supplements Theorem S2).\nTheorem 1. Both the SDP \u02dcP with delayed reward \u02dcRt+1 and the SDP P with redistributed reward\nRt+1 have the same optimal policies.\n\n\u02dcG0 = (cid:80)T\nredistributed reward Rt+1 at time (t + 1) and the return variable G0 =(cid:80)T\n\n0 = E\u03c0 [G0]. The optimal policy \u03c0\u2217 is \u03c0\u2217 = argmax \u03c0[v\u03c0\n\n0 = v\u03c0\n\nt=0\n\n(t\u2212 1) in the interval [t + 1, t + m + 1] is de\ufb01ned as \u03ba(m, t\u2212 1) = E\u03c0 [(cid:80)m\n\nOptimal Reward Redistribution with Expected Future Rewards Equal to Zero. We move on\nto the main goal of this paper: to derive an SDP via reward redistribution that has expected future\nrewards equal to zero and, therefore, no delayed rewards. At time (t \u2212 1) the immediate reward is Rt\nwith expectation r(st\u22121, at\u22121). We de\ufb01ne the expected future rewards \u03ba(m, t \u2212 1) at time (t \u2212 1) as\nthe expected sum of future rewards from Rt+1 to Rt+1+m.\nDe\ufb01nition 1. For 1 (cid:54) t (cid:54) T and 0 (cid:54) m (cid:54) T \u2212 t, the expected sum of delayed rewards at time\n\u03c4 =0 Rt+1+\u03c4 | st\u22121, at\u22121].\nFor every time point t, the expected future rewards \u03ba(T \u2212 t \u2212 1, t) given (st, at) is the expected sum\nof future rewards until sequence end, that is, in the interval [t + 2, T + 1]. For MDPs, the Bellman\nequation for Q-values becomes q\u03c0(st, at) = r(st, at) + \u03ba(T \u2212 t \u2212 1, t). We aim to derive an MDP\n\n3\n\n\fwith \u03ba(T \u2212 t \u2212 1, t) = 0, which yields q\u03c0(st, at) = r(st, at). In this case, learning the Q-values\nsimpli\ufb01es to estimating the expected immediate reward r(st, at) = E [Rt+1 | st, at]. Hence, the\nreinforcement learning task reduces to computing the mean, e.g. the arithmetic mean, for each\nstate-action pair (st, at). A reward redistribution is de\ufb01ned to be optimal, if \u03ba(T \u2212 t \u2212 1, t) = 0 for\n0 (cid:54) t (cid:54) T \u2212 1. In general, an optimal reward redistribution violates the Markov assumptions and the\nBellman equation does not hold (proof after Supplements Theorem S3). Therefore, we will consider\nSDPs in the following. The next theorem states that a delayed reward MDP \u02dcP with a particular policy\n\u03c0 can be transformed into a return-equivalent SDP P with an optimal reward redistribution.\nTheorem 2. We assume a delayed reward MDP \u02dcP, where the accumulated reward is given at\nsequence end. A new SDP P is obtained by a second order Markov reward redistribution, which\nensures that P is return-equivalent to \u02dcP. For a speci\ufb01c \u03c0, the following two statements are equivalent:\n(I) \u03ba(T \u2212 t \u2212 1, t) = 0, i.e. the reward redistribution is optimal,\n(II) E [Rt+1 | st\u22121, at\u22121, st, at] = \u02dcq\u03c0(st, at) \u2212 \u02dcq\u03c0(st\u22121, at\u22121) .\nAn optimal reward redistribution ful\ufb01lls for 1 (cid:54) t (cid:54) T and 0 (cid:54) m (cid:54) T \u2212 t: \u03ba(m, t \u2212 1) = 0.\nThe proof can be found after Supplements Theorem S4. The equation \u03ba(T \u2212 t\u2212 1, t) = 0 implies that\nthe new SDP P has no delayed rewards, that is, E\u03c0 [Rt+1+\u03c4 | st\u22121, at\u22121] = 0, for 0 (cid:54) \u03c4 (cid:54) T \u2212 t\u2212 1\n(Supplements Corollary S1). The SDP P has no delayed rewards since no state-action pair can\nincrease or decrease the expectation of a future reward. Equation (1) shows that for an optimal reward\nredistribution the expected reward has to be the difference of consecutive Q-values of the original\ndelayed reward. The optimal reward redistribution is second order Markov since the expectation of\nRt+1 at time (t + 1) depends on (st\u22121, at\u22121, st, at).\nThe next theorem states the major advantage of an optimal reward redistribution: \u02dcq\u03c0(st, at) can be\nestimated with an offset that depends only on st by estimating the expected immediate redistributed\nreward. Thus, Q-value estimation becomes trivial and the computation of the advantage function of\nthe MDP \u02dcP is simpli\ufb01ed.\nTheorem 3. If the reward redistribution is optimal, then the Q-values of the SDP P are given by\nq\u03c0(st, at) = r(st, at) = \u02dcq\u03c0(st, at) \u2212 Est\u22121,at\u22121 [\u02dcq\u03c0(st\u22121, at\u22121) | st] = \u02dcq\u03c0(st, at) \u2212 \u03c8\u03c0(st) .\n(2)\nThe SDP P and the original MDP \u02dcP have the same advantage function.\nUsing a behavior policy \u02d8\u03c0 the expected immediate reward is\n\n(1)\n\nE\u02d8\u03c0 [Rt+1 | st, at] = \u02dcq\u03c0(st, at) \u2212 \u03c8\u03c0,\u02d8\u03c0(st) .\n\n(3)\n\nThe proof can be found after Supplements Theorem S5. If the reward redistribution is not optimal,\nthen \u03ba(T \u2212 t \u2212 1, t) measures the deviation of the Q-value from r(st, at). This theorem justi\ufb01es\nseveral learning methods based on reward redistribution presented in the next paragraph.\n\nNovel Learning Algorithms Based on Reward Redistributions. We assume \u03b3 = 1 and a \ufb01nite\nhorizon or an absorbing state original MDP \u02dcP with delayed rewards. For this setting we introduce\nnew reinforcement learning algorithms. They are gradually changing the reward redistribution\nduring learning and are based on the estimations in Theorem 3. These algorithms are also valid for\nnon-optimal reward redistributions, since the optimal policies are kept (Theorem 1). Convergence\nof RUDDER learning can under standard assumptions be proven by the stochastic approximation\nfor two time-scale update rules [12, 22]. Learning consists of two updates: a reward redistribution\nnetwork update and a Q-value update. Convergence proofs to an optimal policy are dif\ufb01cult, since\nlocally stable attractors may not correspond to optimal policies.\nAccording to Theorem 1, reward redistribution keeps the optimal policies. Therefore, even non-\noptimal reward redistributions ensure correct learning. However, an optimal reward redistribution\nspeeds up learning considerably. Reward redistribution can be combined with methods that use\nQ-value ranks or advantage functions. We consider (A) Q-value estimation, (B) policy gradients,\nand (C) Q-learning. Type (A) methods estimate Q-values and are divided into variants (i), (ii), and\n(iii). Variant (i) assumes an optimal reward redistribution and estimates \u02dcq\u03c0(st, at) with an offset\ndepending only on st. The estimates are based on Theorem 3 either by on-policy direct Q-value\nestimation according to Eq. (2) or by off-policy immediate reward estimation according to Eq. (3).\nVariant (ii) methods assume a non-optimal reward redistribution and correct Eq. (2) by estimating \u03ba.\n\n4\n\n\fVariant (iii) methods use eligibility traces for the redistributed reward. RUDDER learning can be\nbased on policies like \u201cgreedy in the limit with in\ufb01nite exploration\u201d (GLIE) or \u201crestricted rank-based\nrandomized\u201d (RRR) [43]. GLIE policies change toward greediness with respect to the Q-values\nduring learning. For more details on these learning approaches see Supplements S2.7.1.\nType (B) methods replace in the expected updates E\u03c0 [\u2207\u03b8 log \u03c0(a | s; \u03b8)q\u03c0(s, a)] of policy gradients\nthe value q\u03c0(s, a) by an estimate of r(s, a) or by a sample of the redistributed reward. The offset\n\u03c8\u03c0(s) in Eq. (2) or \u03c8\u03c0,\u02d8\u03c0(s) in Eq. (3) reduces the variance as baseline normalization does. These\nmethods can be extended to Trust Region Policy Optimization (TRPO) [40] as used in Proximal\nPolicy Optimization (PPO) [42]. The type (C) method is Q-learning with the redistributed reward.\nHere, Q-learning is justi\ufb01ed if immediate and future reward are drawn together, as typically done.\n\n3 Constructing Reward Redistributions by Return Decomposition\n\nWe now propose methods to construct reward redistributions. Learning with non-optimal reward\nredistributions does work since the optimal policies do not change according to Theorem 1. However,\nreward redistributions that are optimal considerably speed up learning, since future expected rewards\nintroduce biases in TD methods and high variances in MC methods. The expected optimal redis-\ntributed reward is the difference of Q-values according to Eq. (1). The more a reward redistribution\ndeviates from these differences, the larger are the absolute \u03ba-values and, in turn, the less optimal\nthe reward redistribution gets. Consequently, to construct a reward redistribution which is close to\noptimal we aim at identifying the largest Q-value differences.\n\nReinforcement Learning as Pattern Recognition. We want to transform the reinforcement learn-\ning problem into a pattern recognition task to exploit deep learning approaches. The sum of the\nQ-value differences gives the difference between expected return at sequence begin and the expected\nreturn at sequence end (telescope sum). Thus, Q-value differences allow to predict the expected return\nof the whole state-action sequence. Identifying and redistributing the reward to the largest Q-value\ndifferences reduces the prediction error most. Q-value differences are assumed to be associated\nwith patterns in state-action transitions. The largest Q-value differences are expected to be found\nmore frequently in sequences with very large or very low return. The resulting task is to predict the\nexpected return from the whole sequence and identify which state-action transitions have contributed\nthe most to the prediction. This pattern recognition task serves to construct a reward redistribution,\nwhere the redistributed reward corresponds to the different contributions. The next paragraph shows\nhow the return is decomposed and redistributed along the state-action sequence.\n\nReturn Decomposition. The return decomposition idea is that a function g predicts the expectation\nof the return for a given state-action sequence (return for the whole sequence). The function g is\nneither a value nor an action-value function since it predicts the expected return when the whole\nsequence is given. With the help of g either the predicted value or the realization of the return is\nredistributed over the sequence. A state-action pair receives as redistributed reward its contribution\nto the prediction, which is determined by contribution analysis. We use contribution analysis since\nsensitivity analysis has serious drawbacks, e.g. local minima, instabilities, exploding or vanishing\ngradients, and proper exploration [19, 36]. The biggest drawback is that the relevance of actions\nis missed since sensitivity analysis does not consider the contribution of actions to the output, but\nonly their effect on the output when slightly perturbing them. Contribution analysis determines how\nmuch a state-action pair contributes to the \ufb01nal prediction. We can use any contribution analysis\nmethod, but we speci\ufb01cally consider three methods: (A) differences of return predictions, (B)\nintegrated gradients (IG) [46], and (C) layer-wise relevance propagation (LRP) [3]. For (A), g must\ntry to predict the sequence-wide return at every time step. The redistributed reward is given by the\ndifference of consecutive predictions. The function g can be decomposed into past, immediate, and\nfuture contributions to the return. Consecutive predictions share the same past and the same future\ncontributions except for two immediate state-action pairs. Thus, in the difference of consecutive\npredictions contributions cancel except for the two immediate state-action pairs. Even for imprecise\npredictions of future contributions to the return, contribution analysis is more precise, since prediction\nerrors cancel out. Methods (B) and (C) rely on information later in the sequence for determining\nthe contribution and thereby may introduce a non-Markov reward. The reward can be viewed to be\nprobabilistic but is prone to have high variance. Therefore, we prefer method (A).\n\n5\n\n\fExplaining Away Problem. We still have to tackle the problem that reward causing actions might\nnot receive redistributed rewards since they are explained away by later states. To describe the\nproblem, assume an MDP \u02dcP with the only reward at sequence end. To ensure the Markov property,\nstates in \u02dcP have to store the reward contributions of previous state-actions; e.g. sT has to store all\nprevious contributions such that the expectation \u02dcr(sT , aT ) is Markov. The explaining away problem\nis that later states are used for return prediction, while reward causing earlier actions are missed.\nTo avoid explaining away, we de\ufb01ne a difference function \u2206(st\u22121, at\u22121, st, at) between a state-\naction pair (st, at) and its predecessor (st\u22121, at\u22121). That \u2206 is a function of (st, at, st\u22121, at\u22121) is\njusti\ufb01ed by Eq. (1), which ensures that such \u2206s allow an optimal reward redistribution. The sequence\n\nof differences is \u22060:T := (cid:0)\u2206(s\u22121, a\u22121, s0, a0), . . . , \u2206(sT\u22121, aT\u22121, sT , aT )(cid:1). The components\ncan be decomposed into g(\u22060:T ) =(cid:80)T\nholds. Therefore, we need to introduce the compensation \u02dcrT +1 \u2212 (cid:80)T\nE [R1 | s0, a0] = h0 , RT +2 = \u02dcRT +1 \u2212 T(cid:88)\n\n\u2206 are assumed to be statistically independent from each other, therefore \u2206 cannot store reward\ncontributions of previous \u2206. The function g should predict the return by g(\u22060:T ) = \u02dcr(sT , aT ) and\nt=0 ht. The contributions are ht = h(\u2206(st\u22121, at\u22121, st, at))\nfor 0 (cid:54) t (cid:54) T . For the redistributed rewards Rt+1, we ensure E [Rt+1 | st\u22121, at\u22121, st, at] = ht.\nThe reward \u02dcRT +1 of \u02dcP is probabilistic and the function g might not be perfect, therefore neither\ng(\u22060:T ) = \u02dcrT +1 for the return realization \u02dcrT +1 nor g(\u22060:T ) = \u02dcr(sT , aT ) for the expected return\n\u03c4 =0 h(\u2206(s\u03c4\u22121, a\u03c4\u22121, s\u03c4 , a\u03c4 ))\nas an extra reward RT +2 at time T + 2 to ensure strictly return-equivalent SDPs. If g was perfect,\nthen it would predict the expected return which could be redistributed. The new redistributed rewards\nRt+1 are based on the return decomposition, since they must have the contributions ht as mean:\n\nht ,\n\n(4)\n\nE [Rt+1 | st\u22121, at\u22121, st, at] = ht , 0 < t (cid:54) T ,\n\nt=0\n\n(5)\n\nwhere the realization \u02dcrT +1 is replaced by its random variable \u02dcRT +1. If the prediction of g is perfect,\nthen we can redistribute the expected return via the prediction. Theorem 2 holds also for the correction\nRT +2 (see Supplements Theorem S6). A g with zero prediction errors results in an optimal reward\nredistribution. Small prediction errors lead to reward redistributions close to an optimal one.\n\nRUDDER: Return Decomposition using LSTM. RUDDER uses a Long Short-Term Memory\n(LSTM) network for return decomposition and the resulting reward redistribution. RUDDER consists\nof three phases. (I) Safe exploration. Exploration sequences should generate LSTM training samples\nwith delayed rewards by avoiding low Q-values during a particular time interval. Low Q-values hint\nat states where the agent gets stuck. Parameters comprise starting time, length, and Q-value threshold.\n(II) Lessons replay buffer for training the LSTM. If RUDDER\u2019s safe exploration discovers an\nepisode with unseen delayed rewards, it is secured in a lessons replay buffer [24]. Unexpected\nrewards are indicated by a large prediction error of the LSTM. For LSTM training, episodes with\nlarger errors are sampled more often from the buffer, similar to prioritized experience replay [35].\n(III) LSTM and return decomposition. An LSTM learns to predict sequence-wide return at every\ntime step and, thereafter, return decomposition uses differences of return predictions (contribution\nanalysis method (A)) to construct a reward redistribution. For more details see [1].\n\nFeedforward Neural Networks (FFNs) vs. LSTMs.\nIn contrast to LSTMs, FNNs are not suited\nfor processing sequences. Nevertheless, FNNs can learn a action-value function, which enables\ncontribution analysis by differences of predictions. However, this leads to serious problems by\nspurious contributions that hinder learning. For example, any contributions would be incorrect if\nthe true expectation of the return did not change. Therefore, prediction errors might falsely cause\ncontributions leading to spurious rewards. FNNs are prone to such prediction errors since they have\nto predict the expected return again and again from each different state-action pair and cannot use\nstored information. In contrast, the LSTM is less prone to produce spurious rewards: (i) The LSTM\nwill only learn to store information if a state-action pair has a strong evidence for a change in the\nexpected return. In this way, key events can be stored. If information is stored, then internal states\nand, therefore, also the predictions change, otherwise the predictions stay unchanged. Hence, storing\nevents receives a contribution and a corresponding reward, while by default nothing is stored and no\ncontribution is given. (ii) The LSTM tends to have smaller prediction errors since it can reuse past\ninformation for predicting the expected return. (iii) Prediction errors of LSTMs are much more likely\n\n6\n\n\fto cancel via prediction differences than those of FNNs. Since consecutive predictions of LSTMs\nrely on the same internal states, they usually have highly correlated errors.\n\nHuman Expert Episodes. They are an alternative to exploration and can serve to \ufb01ll the lessons\nreplay buffer. Learning can be sped up considerably when the LSTM identi\ufb01es human key actions.\nRUDDER will reward human key actions even for episodes with low return since other actions that\nthwart high returns receive negative reward. Using human demonstrations in reinforcement learning\nled to a huge improvement on some Atari games like Montezuma\u2019s Revenge [29, 2].\n\nLimitations. RUDDER might not be effective for tasks without delayed rewards, since LSTM\nlearning takes extra time and and struggles with extremely long sequences. Moreover, reward\nredistribution may introduce disturbing spurious reward signals.\n\n4 Experiments\n\nRUDDER is evaluated on three arti\ufb01cial tasks with delayed rewards. These tasks are designed to show\nproblems of TD, MC, and potential-based reward shaping. RUDDER overcomes these problems.\nNext, we demonstrate that RUDDER also works for more complex tasks with delayed rewards.\nTherefore, we compare RUDDER with a Proximal Policy Optimization (PPO) baseline on 52 Atari\ngames. All experiments use \ufb01nite time horizon or absorbing states MDPs with \u03b3 = 1 and reward at\nepisode end. For more information see Supplements S4.1.2.\nArti\ufb01cial Tasks (I)\u2013(III). Task (I) shows that TD methods have problems with vanishing information\nfor delayed rewards. The goal is to learn that a delayed reward is larger than a distracting immediate\nreward. Therefore, the correct expected future reward must be assigned to many state-action pairs.\nTask (II) is a variation of the introductory pocket watch example with delayed rewards. It shows\nthat MC methods have problems with the high variance of future unrelated rewards. The expected\nfuture reward that is caused by the \ufb01rst action has to be estimated. Large future rewards that are not\nassociated with the \ufb01rst action impede MC estimations. Task (III) shows that potential-based reward\nshaping methods have problems with delayed rewards. For this task, only the \ufb01rst two actions are\nrelevant, to which the delayed reward has to be propagated back.\nThe tasks have different delays, are tabular (Q-table), and use an \u0001-greedy policy with \u0001 = 0.2. We\ncompare RUDDER, MC, and TD(\u03bb) on all tasks, and Monte Carlo Tree Search (MCTS) on task\n(I). Additionally, on task (III), SARSA(\u03bb) and reward shaping are compared. We use \u03bb = 0.9 as\nsuggested previously [47]. Reward shaping is either the original method, the look-forward advice, or\nthe look-back advice all with three different potential functions. RUDDER uses an LSTM net without\noutput and forget gates, no lessons buffer, and no safe exploration. Contribution analysis is performed\nwith differences of return predictions. A Q-table is learned by an exponential moving average\nof the redistributed reward (RUDDER\u2019s Q-value estimation) or by Q-learning. The performance\nis measured by the learning time to achieve 90% of the maximal expected return. A Wilcoxon\nsigned-rank indicates a signi\ufb01cant performance difference between RUDDER and other methods.\n(I) Grid World shows problems of TD methods with delayed rewards. The task illustrates a time\nbomb that explodes at episode end. The agent has to defuse the bomb and then run away as far as\npossible since defusing fails with a certain probability. Alternatively, the agent can immediately\nrun away, which, however, leads to less reward on average. The Grid World is a 31 \u00d7 31 grid with\nbomb at coordinate [30, 15] and start at [30 \u2212 d, 15], where d is the delay of the task. The agent\ncan move up, down, left, and right as long as it stays on the grid. At the end of the episode, after\n(cid:98)1.5d(cid:99) steps, the agent receives a reward of 1000 with probability of 0.5, if it has visited bomb. At\neach time step, the agent receives an immediate reward of c \u00b7 t \u00b7 h, where c depends on the chosen\naction, t is the current time step, and h is the Hamming distance to bomb. Each move toward the\nbomb, is immediately penalized with c = \u22120.09. Each move away from the bomb, is immediately\nrewarded with c = 0.1. The agent must learn the Q-values precisely to recognize that directly\nrunning away is not optimal. Figure 1(I) shows the learning times to solve the task vs. the delay of\nthe reward averaged over 100 trials. For all delays, RUDDER is signi\ufb01cantly faster than all other\nmethods with p-values < 10\u221212. Speed-ups vs. MC and MCTS, suggest to be exponential with delay\ntime. RUDDER is exponentially faster with increasing delay than Q(\u03bb), supporting Supplements\nTheorem S8. RUDDER signi\ufb01cantly outperforms all other methods.\n\n7\n\n\f(I)\n\n107\n\n106\n\n105\n\n104\n\n103\n\n102\n\ns\ne\nd\no\ns\ni\np\ne\n#\n\nQ(\u03bb)\nMC\nMCTS\n\n(II)\n\n105\n\nQ(\u03bb)\nMC\n\n4\n3\n\n2\n\n1\n\n104\n\nRUDDER\n\nRUDDER\nRUDDER\n\n20\n\n40\n\n(III)\n\n105\n\n104\n\n103\n\nSARSA(\u03bb)\nQ(\u03bb)\n\nRS\nRS\nlook-ahead\nlook-ahead\nlook-back\nlook-back\n\nRUDDER Q(\u03bb)\nRUDDER\n\n2\n\n4\n\n6\n\n8\n\n10\n\n12\n\n14\n\n0\n\n200\n\n100\ndelay of the reward\n\n300\n\n400\n\n500\n\n10\n\n15\n\n20\n\n25\n\nFigure 1: Comparison of RUDDER and other methods on arti\ufb01cial tasks with respect to the learning\ntime in episodes (median of 100 trials) vs. the delay of the reward. The shadow bands indicate the\n40% and 60% quantiles. In (II), the y-axis of the inlet is scaled by 105. In (III), reward shaping\n(RS), look-ahead advice (look-ahead), and look-back advice (look-back) use three different potential\nfunctions. In (III), the dashed blue line represents RUDDER with Q(\u03bb), in contrast to RUDDER with\nQ-estimation. In all tasks, RUDDER signi\ufb01cantly outperforms all other methods.\n\n(II) The Choice shows problems of MC methods with delayed rewards. This task has probabilistic\nstate transitions, which can be represented as a tree with states as nodes. The agent traverses the tree\nfrom the root (initial state) to the leafs (\ufb01nal states). At the root, the agent has to choose between the\nleft and the right subtree, where one subtree has a higher expected reward. Thereafter, it traverses the\ntree randomly according to the transition probabilities. Each visited node adds its \ufb01xed share to the\n\ufb01nal reward. The delayed reward is given as accumulated shares at a leaf. The task is solved when\nthe agent always chooses the subtree with higher expected reward. Figure 1(II) shows the learning\ntimes to solve the task vs. the delay of the reward averaged over 100 trials. For all delays, RUDDER\nis signi\ufb01cantly faster than all other methods with p-values < 10\u22128. The speed-up vs. MC, suggests\nto be exponential with delay time. RUDDER is exponentially faster with increasing delay than Q(\u03bb),\nsupporting Supplements Theorem S8. RUDDER signi\ufb01cantly outperforms all other methods.\n(III) Trace-Back shows problems of potential-based reward shaping methods with delayed rewards.\nWe investigate how fast information about delayed rewards is propagated back by RUDDER, Q(\u03bb),\nSARSA(\u03bb), and potential-based reward shaping. MC is skipped since it does not transfer back\ninformation. The agent can move in a 15\u00d715 grid to the 4 adjacent positions as long as it remains on\nthe grid. Starting at (7, 7), the number of moves per episode is T = 20. The optimal policy moves the\nagent up in t = 1 and right in t = 2, which gives immediate reward of \u221250 at t = 2, and a delayed\nreward of 150 at the end t = 20 = T . Therefore, the optimal return is 100. For any other policy, the\nagent receives only an immediate reward of 50 at t = 2. For t (cid:54) 2, state transitions are deterministic,\nwhile for t > 2 they are uniformly distributed and independent of the actions. Thus, the return\ndoes not depend on actions at t > 2. We compare RUDDER, original reward shaping, look-ahead\nadvice, and look-back advice. As suggested by the authors, we use SARSA instead of Q-learning for\nlook-back advice. We use three different potential functions for reward shaping, which are all based\non the reward redistribution (see Supplements). At t = 2, there is a distraction since the immediate\nreward is \u221250 for the optimal and 50 for other actions. RUDDER is signi\ufb01cantly faster than all other\nmethods with p-values < 10\u221217. Figure 1(III) shows the learning times averaged over 100 trials.\nRUDDER is exponentially faster than all other methods and signi\ufb01cantly outperforms them.\n\nAtari Games. RUDDER is evaluated with respect to its learning time and achieved scores on Atari\ngames of the Arcade Learning Environment (ALE) [9] and OpenAI Gym [13]. RUDDER is used\non top of the TRPO-based [40] policy gradient method PPO that uses GAE [41]. Our PPO baseline\ndiffers from the original PPO baseline [42] in two aspects. (i) Instead of using the sign function of\nthe rewards, rewards are scaled by their current maximum. In this way, the ratio between different\nrewards remains unchanged and the characteristics of games with large delayed rewards can be\nrecognized. (ii) The safe exploration strategy of RUDDER is used. The entropy coef\ufb01cient is replaced\nby Proportional Control [11, 10]. A coarse hyperparameter optimization is performed for the PPO\nbaseline. For all 52 Atari games, RUDDER uses the same architectures, losses, and hyperparameters,\n\n8\n\n\fRUDDER baseline\n\nBowling\nSolaris\nVenture\nSeaquest\n\n192\n1,827\n1,350\n4,770\n\n56\n616\n820\n1,616\n\ndelay\n200\n122\n150\n272\n\ndelay-event\nstrike pins\nnavigate map\n\ufb01nd treasure\ncollect divers\n\nTable 1: Average scores over 3 random seeds with 10 trials each for delayed reward Atari games.\n\"delay\": frames between reward and \ufb01rst related action. RUDDER considerably improves the PPO\nbaseline on delayed reward games.\n\nwhich were optimized for the baseline. The only difference to the PPO baseline is that the policy\nnetwork predicts the value function of the redistributed reward to utilize reward redistribution for\nPPO. Contribution analysis uses an LSTM network with differences of return predictions. Here \u2206\nis the pixel-wise difference of two consecutive frames augmented with the current frame. LSTM\ntraining and reward redistribution are restricted to sequence chunks of 500 frames.\nPolicies are trained with no-op starting condition for 200M game frames using every 4th frame.\nTraining episodes end with losing a life or at maximal 108K frames. All scores are averaged over 3\ndifferent random seeds for the network and the ALE initialization. We assess the performance by\nthe learning time and the achieved scores. First, we compare RUDDER to the baseline by average\nscores per game throughout training, to assess learning speed [42]. For 32 (20) games RUDDER\n(baseline) learns on average faster. Next, we compare the average scores of the last 10 training\ngames. For 29 (23) games RUDDER (baseline) has higher average scores. In the majority of games\nRUDDER, improves the scores of the PPO baseline. To compare RUDDER and the baseline on Atari\ngames that are characterized by delayed rewards, we select the games Bowling, Solaris, Venture, and\nSeaquest. In these games, high scores are achieved by learning the delayed rewards, while learning\nthe immediate rewards and extensive exploration (like for Montezuma\u2019s revenge) is less important.\nThe results are presented in Table 1. For more details and further results see Supplements S4.2.\nFigure 2 displays how RUDDER redistributes rewards to key events in Bowling. At delayed reward\nAtari games, RUDDER considerably increases the scores compared to the PPO baseline.\n\nFigure 2: RUDDER redistributes rewards to key events in the Atari game Bowling. Originally,\nrewards are delayed and only given at episode end. The \ufb01rst 120 out of 200 frames of the episode are\nshown. RUDDER identi\ufb01es key actions that steer the ball to hit all pins.\n\nConclusion. We have introduced RUDDER, a novel reinforcement learning algorithm based on the\nnew concepts of reward redistribution and return decomposition via contribution analysis. On arti\ufb01cial\ntasks, RUDDER signi\ufb01cantly outperforms TD(\u03bb), MC, MCTS, and reward shaping methods. On\nAtari games, RUDDER on average improves a PPO baseline, with the most prominent improvement\non long delayed reward games.\n\nAcknowledgments\n\nThis work was supported by NVIDIA Corporation, Merck KGaA, Audi.JKU Deep Learning Center,\nAudi Electronic Venture GmbH, Janssen Pharmaceutica (madeSMART), TGW Logistics Group, ZF\nFriedrichshafen AG, UCB S.A., FFG grant 871302, LIT grant DeepToxGen, and AI-SNN.\n\n9\n\nsteering ball100framesredistributed rewardoriginal reward0striking pins\f5 References\n\n[1] L. Arras, J. Arjona-Medina, M. Widrich, G. Montavon, M. Gillhofer, K. M\u00fcller, S. Hochreiter,\nand W. Samek. Explaining and Interpreting LSTMs, pages 211\u2013238. Springer International\nPublishing, 2019.\n\n[2] Y. Aytar, T. Pfaff, D. Budden, T. Le Paine, Z. Wang, and N. de Freitas. Playing hard exploration\n\ngames by watching YouTube. ArXiv, 1805.11592, 2018.\n\n[3] S. Bach, A. Binder, G. Montavon, F. Klauschen, K.-R. M\u00fcller, and W. Samek. On pixel-wise\nexplanations for non-linear classi\ufb01er decisions by layer-wise relevance propagation. PLoS ONE,\n10(7):e0130140, 2015.\n\n[4] B. Bakker. Reinforcement learning with long short-term memory. In T. G. Dietterich, S. Becker,\nand Z. Ghahramani, editors, Advances in Neural Information Processing Systems 14, pages\n1475\u20131482. MIT Press, 2002.\n\n[5] B. Bakker. Reinforcement learning by backpropagation through an lstm model/critic. In IEEE\nInternational Symposium on Approximate Dynamic Programming and Reinforcement Learning,\npages 127\u2013134, 2007.\n\n[6] A. Barreto, D. Borsa, J. Quan, T. Schaul, D. Silver, M. Hessel, D. Mankowitz, A. Z\u00eddek, and\nR. Munos. Transfer in deep reinforcement learning using successor features and generalised\npolicy improvement. In 35th International Conference on Machine Learning, volume 80 of\nProceedings of Machine Learning Research, pages 501\u2013510, 2018. ArXiv 1901.10964.\n\n[7] A. Barreto, W. Dabney, R. Munos, J. Hunt, T. Schaul, H. P. vanHasselt, and D. Silver. Successor\nfeatures for transfer in reinforcement learning. In Advances in Neural Information Processing\nSystems 30, pages 4055\u20134065, 2017. ArXiv 1606.05312.\n\n[8] A. G. Barto and T. G. Dietterich. Handbook of Learning and Approximate Dynamic Program-\nming, chapter Reinforcement Learning and Its Relationship to Supervised Learning, pages\n45\u201363. IEEE Press, John Wiley & Sons, 2015.\n\n[9] M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling. The Arcade learning environment: An\nevaluation platform for general agents. Journal of Arti\ufb01cial Intelligence Research, 47:253\u2013279,\n2013.\n\n[10] D. Berthelot, T. Schumm, and L. Metz. BEGAN: boundary equilibrium generative adversarial\n\nnetworks. ArXiv e-prints, 1703.10717, 2017.\n\n[11] W. Bolton. Instrumentation and Control Systems, chapter Chapter 5 - Process Controllers, pages\n\n99\u2013121. Newnes, 2 edition, 2015.\n\n[12] V. S. Borkar. Stochastic approximation with two time scales. Systems & Control Letters,\n\n29(5):291\u2013294, 1997.\n\n[13] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba.\n\nOpenai gym. ArXiv, 1606.01540, 2016.\n\n[14] A. D. Edwards, L. Downs, and J. C. Davidson. Forward-backward reinforcement learning.\n\nArXiv, 1803.10227, 2018.\n\n[15] J. Fu, K. Luo, and S. Levine. Learning robust rewards with adversarial inverse reinforcement\nlearning. ArXiv, 1710.11248, 2018. Sixth International Conference on Learning Representations\n(ICLR).\n\n[16] A. Goyal, P. Brakel, W. Fedus, T. Lillicrap, S. Levine, H. Larochelle, and Y. Bengio. Recall\n\ntraces: Backtracking models for ef\ufb01cient reinforcement learning. ArXiv, 1804.00379, 2018.\n\n[17] A. Harutyunyan, W. Dabney, T. Mesnard, M. G. Azar, B. Piot, N. Heess, H. P. van Hasselt,\nG. Wayne, S. Singh, D. Precup, and R. Munos. Hindsight credit assignment. In H. Wallach,\nH. Larochelle, A. Beygelzimer, F. d'Alch\u00e9-Buc, E. Fox, and R. Garnett, editors, Advances in\nNeural Information Processing Systems 32, pages 12467\u201312476. Curran Associates, Inc., 2019.\n\n10\n\n\f[18] P. Hernandez-Leal, B. Kartal, and M. E. Taylor. Is multiagent deep reinforcement learning the\n\nanswer or the question? A brief survey. ArXiv, 1810.05587, 2018.\n\n[19] S. Hochreiter. Implementierung und Anwendung eines \u2018neuronalen\u2019 Echtzeit-Lernalgorithmus\nf\u00fcr reaktive Umgebungen. Practical work, Supervisor: J. Schmidhuber, Institut f\u00fcr Informatik,\nTechnische Universit\u00e4t M\u00fcnchen, 1990.\n\n[20] C. Hung, T. Lillicrap, J. Abramson, Y. Wu, M. Mirza, F. Carnevale, A. Ahuja, and G. Wayne.\nOptimizing agent behavior over long time scales by transporting value. ArXiv, 1810.06721,\n2018.\n\n[21] C.-C. Hung, T. Lillicrap, J. Abramson, Y. Wu, M. Mirza, F. Carnevale, A. Ahuja, and G. Wayne.\nOptimizing agent behavior over long time scales by transporting value. Nature Communications,\n10(5223), 2019.\n\n[22] P. Karmakar and S. Bhatnagar. Two time-scale stochastic approximation with controlled Markov\nnoise and off-policy temporal-difference learning. Mathematics of Operations Research, 2017.\n\n[23] N. Ke, A. Goyal, O. Bilaniuk, J. Binas, M. Mozer, C. Pal, and Y. Bengio. Sparse attentive back-\ntracking: Temporal credit assignment through reminding. In Advances in Neural Information\nProcessing Systems 31, pages 7640\u20137651, 2018.\n\n[24] L. Lin. Reinforcement Learning for Robots Using Neural Networks. PhD thesis, Carnegie\n\nMellon University, Pittsburgh, 1993.\n\n[25] J. Luoma, S. Ruutu, A. W. King, and H. Tikkanen. Time delays, competitive interdependence,\n\nand \ufb01rm performance. Strategic Management Journal, 38(3):506\u2013525, 2017.\n\n[26] P. W. Munro. A dual back-propagation scheme for scalar reinforcement learning. In Proceedings\nof the Ninth Annual Conference of the Cognitive Science Society, Seattle, WA, pages 165\u2013176,\n1987.\n\n[27] A. Y. Ng, D. Harada, and S. J. Russell. Policy invariance under reward transformations: Theory\nand application to reward shaping. In Proceedings of the Sixteenth International Conference on\nMachine Learning (ICML\u201999), pages 278\u2013287, 1999.\n\n[28] J. Peters and S. Schaal. Reinforcement learning by reward-weighted regression for operational\nspace control. In Proceedings of the 24th International Conference on Machine Learning, pages\n745\u2013750, 2007.\n\n[29] T. Pohlen, B. Piot, T. Hester, M. G. Azar, D. Horgan, D. Budden, G. Barth-Maron, H. van\nHasselt, J. Quan, M. Ve\u02c7cer\u00edk, M. Hessel, R. Munos, and O. Pietquin. Observe and look further:\nAchieving consistent performance on Atari. ArXiv, 1805.11593, 2018.\n\n[30] H. Rahmandad, N. Repenning, and J. Sterman. Effects of feedback delay on learning. System\n\nDynamics Review, 25(4):309\u2013338, 2009.\n\n[31] A. J. Robinson. Dynamic Error Propagation Networks. PhD thesis, Trinity Hall and Cambridge\n\nUniversity Engineering Department, 1989.\n\n[32] T. Robinson and F. Fallside. Dynamic reinforcement driven error propagation networks with\napplication to game playing. In Proceedings of the 11th Conference of the Cognitive Science\nSociety, Ann Arbor, pages 836\u2013843, 1989.\n\n[33] H. Sahni. Reinforcement learning never worked, and \u2019deep\u2019 only helped a bit. himanshusahni.\n\ngithub.io/2018/02/23/reinforcement-learning-never-worked.html, 2018.\n\n[34] S. Schaal. Is imitation learning the route to humanoid robots? Trends in Cognitive Sciences,\n\n3(6):233\u2013242, 1999.\n\n[35] T. Schaul, J. Quan, I. Antonoglou, and D. Silver. Prioritized experience replay. ArXiv,\n\n1511.05952, 2015.\n\n11\n\n\f[36] J. Schmidhuber. Making the world differentiable: On using fully recurrent self-supervised neural\nnetworks for dynamic reinforcement learning and planning in non-stationary environments.\nTechnical Report FKI-126-90 (revised), Institut f\u00fcr Informatik, Technische Universit\u00e4t M\u00fcnchen,\n1990. Experiments by Sepp Hochreiter.\n\n[37] J. Schmidhuber. Reinforcement learning in markovian and non-markovian environments. In\nR. P. Lippmann, J. E. Moody, and D. S. Touretzky, editors, Advances in Neural Information\nProcessing Systems 3, pages 500\u2013506. San Mateo, CA: Morgan Kaufmann, 1991. Pole balancing\nexperiments by Sepp Hochreiter.\n\n[38] J. Schmidhuber. Deep learning in neural networks: An overview. Neural Networks, 61:85\u2013117,\n\n2015.\n\n[39] J. Schmidhuber. Reinforcement learning upside down: Don\u2019t predict rewards \u2013 just map them\n\nto actions. ArXiv, 1912.02875, 2019.\n\n[40] J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel. Trust region policy optimization.\nIn 32st International Conference on Machine Learning (ICML), volume 37 of Proceedings of\nMachine Learning Research, pages 1889\u20131897. PMLR, 2015.\n\n[41] J. Schulman, P. Moritz, S. Levine, M. I. Jordan, and P. Abbeel. High-dimensional continuous\ncontrol using generalized advantage estimation. ArXiv, 1506.02438, 2015. Fourth International\nConference on Learning Representations (ICLR\u201916).\n\n[42] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization\n\nalgorithms. ArXiv, 1707.06347, 2018.\n\n[43] S. Singh, T. Jaakkola, M. Littman, and C. Szepesv\u00e1ri. Convergence results for single-step\n\non-policy reinforcement-learning algorithms. Machine Learning, 38:287\u2013308, 2000.\n\n[44] B. F. Skinner. Reinforcement today. American Psychologist, 13(3):94\u201399, 1958.\n\n[45] R. K. Srivastava, P. Shyam, F. Mutz, W. Ja\u00b4skowski, and J. Schmidhuber. Training agents using\n\nupside-down reinforcement learning. ArXiv, 1912.02877, 2019.\n\n[46] M. Sundararajan, A. Taly, and Q. Yan. Axiomatic attribution for deep networks. ArXiv,\n\n1703.01365, 2017.\n\n[47] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge,\n\nMA, 2 edition, 2018.\n\n[48] A. Trott, S. Zheng, C. Xiong, and R. Socher. Keeping your distance: Solving sparse reward\ntasks using self-balancing shaped rewards. In H. Wallach, H. Larochelle, A. Beygelzimer,\nF. d'Alch\u00e9-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing\nSystems 32, pages 10376\u201310386. Curran Associates, Inc., 2019.\n\n[49] P. J. Werbos. A menu of designs for reinforcement learning over time. In W. T. Miller, R. S.\nSutton, and P. J. Werbos, editors, Neural Networks for Control, pages 67\u201395. MIT Press,\nCambridge, MA, USA, 1990.\n\n[50] E. Wiewiora, G. Cottrell, and C. Elkan. Principled methods for advising reinforcement learning\nagents. In Proceedings of the Twentieth International Conference on International Conference\non Machine Learning (ICML\u201903), pages 792\u2013799, 2003.\n\n12\n\n\f", "award": [], "sourceid": 7508, "authors": [{"given_name": "Jose A.", "family_name": "Arjona-Medina", "institution": "LIT AI Lab, Institute for Machine Learning, Johannes Kepler University Linz, Austria"}, {"given_name": "Michael", "family_name": "Gillhofer", "institution": "LIT AI Lab / University Linz"}, {"given_name": "Michael", "family_name": "Widrich", "institution": "LIT AI Lab / University Linz"}, {"given_name": "Thomas", "family_name": "Unterthiner", "institution": "Google Research"}, {"given_name": "Johannes", "family_name": "Brandstetter", "institution": "LIT AI Lab / University Linz"}, {"given_name": "Sepp", "family_name": "Hochreiter", "institution": "LIT AI Lab / University Linz / IARAI"}]}