{"title": "Keeping Your Distance: Solving Sparse Reward Tasks Using Self-Balancing Shaped Rewards", "book": "Advances in Neural Information Processing Systems", "page_first": 10376, "page_last": 10386, "abstract": "While using shaped rewards can be beneficial when solving sparse reward tasks, their successful application often requires careful engineering and is problem specific.  For instance, in tasks where the agent must achieve some goal state, simple distance-to-goal reward shaping often fails, as it renders learning vulnerable to local optima. We introduce a simple and effective model-free method to learn from shaped distance-to-goal rewards on tasks where success depends on reaching a goal state.  Our method introduces an auxiliary distance-based reward based on pairs of rollouts to encourage diverse exploration.  This approach effectively prevents learning dynamics from stabilizing around local optima induced by the naive distance-to-goal reward shaping and enables policies to efficiently solve sparse reward tasks.  Our augmented objective does not require any additional reward engineering or domain expertise to implement and converges to the original sparse objective as the agent learns to solve the task.  We demonstrate that our method successfully solves a variety of hard-exploration tasks (including maze navigation and 3D construction in a Minecraft environment), where naive distance-based reward shaping otherwise fails, and intrinsic curiosity and reward relabeling strategies exhibit poor performance.", "full_text": "Keeping Your Distance: Solving Sparse Reward Tasks\n\nUsing Self-Balancing Shaped Rewards\n\nAlexander Trott\n\nSalesforce Research\n\nStephan Zheng\n\nSalesforce Research\n\natrott@salesforce.com\n\nstephan.zheng@salesforce.com\n\nCaiming Xiong\n\nSalesforce Research\n\nRichard Socher\n\nSalesforce Research\n\ncxiong@salesforce.com\n\nrsocher@salesforce.com\n\nAbstract\n\nWhile using shaped rewards can be bene\ufb01cial when solving sparse reward tasks,\ntheir successful application often requires careful engineering and is problem\nspeci\ufb01c. For instance, in tasks where the agent must achieve some goal state,\nsimple distance-to-goal reward shaping often fails, as it renders learning vulnerable\nto local optima. We introduce a simple and effective model-free method to learn\nfrom shaped distance-to-goal rewards on tasks where success depends on reaching\na goal state. Our method introduces an auxiliary distance-based reward based\non pairs of rollouts to encourage diverse exploration. This approach effectively\nprevents learning dynamics from stabilizing around local optima induced by the\nnaive distance-to-goal reward shaping and enables policies to ef\ufb01ciently solve\nsparse reward tasks. Our augmented objective does not require any additional\nreward engineering or domain expertise to implement and converges to the original\nsparse objective as the agent learns to solve the task. We demonstrate that our\nmethod successfully solves a variety of hard-exploration tasks (including maze\nnavigation and 3D construction in a Minecraft environment), where naive distance-\nbased reward shaping otherwise fails, and intrinsic curiosity and reward relabeling\nstrategies exhibit poor performance.\n\n1\n\nIntroduction\n\nReinforcement Learning (RL) offers a powerful framework for teaching an agent to perform tasks\nusing only observations from its environment. Formally, the goal of RL is to learn a policy that will\nmaximize the reward received by the agent; for many real-world problems, this requires access to\nor engineering a reward function that aligns with the task at hand. Designing a well-suited sparse\nreward function simply requires de\ufb01ning the criteria for solving the task: reward is provided if the\ncriteria for completion are met and withheld otherwise. While designing a suitable sparse reward may\nbe straightforward, learning from it within a practical amount of time often is not, often requiring\nexploration heuristics to help an agent discover the sparse reward (Pathak et al., 2017; Burda et al.,\n2018b,a). Reward shaping (Mataric, 1994; Ng et al., 1999) is a technique to modify the reward signal,\nand, for instance, can be used to relabel and learn from failed rollouts, based on which ones made\nmore progress towards task completion. This may simplify some aspects of learning, but whether\nthe learned behavior improves task performance depends critically on careful design of the shaped\nreward (Clark & Amodei, 2016). As such, reward shaping requires domain-expertise and is often\nproblem-speci\ufb01c (Mataric, 1994).\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fTasks with well-de\ufb01ned goals provide an interesting extension of the traditional RL framework\n(Kaelbling, 1993; Sutton et al., 2011; Schaul et al., 2015). Such tasks often require RL agents to deal\nwith goals that vary across episodes and de\ufb01ne success as achieving a state within some distance of\nthe episode\u2019s goal. Such a setting naturally de\ufb01nes a sparse reward that the agent receives when it\nachieves the goal. Intuitively, the same distance-to-goal measurement can be further used for reward\nshaping (without requiring additional domain-expertise), given that it measures progress towards\nsuccess during an episode. However, reward shaping often introduces new local optima that can\nprevent agents from learning the optimal behavior for the original task. In particular, the existence\nand distribution of local optima strongly depends on the environment and task de\ufb01nition.\n\nAs such, successfully implementing reward shaping quickly becomes problem speci\ufb01c. These\nlimitations have motivated the recent development of methods to enable learning from sparse rewards\n(Schulman et al., 2017; Liu et al., 2019), methods to learn latent representations that facilitate shaped\nreward (Ghosh et al., 2018; Nair et al., 2018; Warde-Farley et al., 2019), and learning objectives that\nencourage diverse behaviors (Haarnoja et al., 2017; Eysenbach et al., 2019).\n\nWe propose a simple and effective method to address the limitations of using distance-to-goal as a\nshaped reward. In particular, we extend the naive distance-based shaped reward to handle sibling\ntrajectories, pairs of independently sampled trajectories using the same policy, starting state, and goal.\nOur approach, which is simple to implement, can be interpreted as a type of self-balancing reward:\nwe encourage behaviors that make progress towards the goal and simultaneously use sibling rollouts\nto estimate the local optima and encourage behaviors that avoid these regions, effectively balancing\nexploration and exploitation. This objective helps to de-stabilize local optima without introducing\nnew stable optima, preserving the task de\ufb01nition given by the sparse reward. This additional objective\nalso relates to the entropy of the distribution of terminal states induced by the policy; however, unlike\nother methods to encourage exploration (Haarnoja et al., 2017), our method is \u201cself-scheduling\u201d such\nthat our proposed shaped reward converges to the sparse reward as the policy learns to reach the goal.\n\nOur method combines the learnability of shaped rewards with the generality of sparse rewards,\nwhich we demonstrate through its successful application on a variety of environments that support\ngoal-oriented tasks. In summary, our contributions are as follows:\n\n\u2022 We propose Sibling Rivalry, a method for model-free, dynamic reward shaping that preserves\n\noptimal policies on sparse-reward tasks.\n\n\u2022 We empirically show that Sibling Rivalry enables RL agents to solve hard-exploration\nsparse-reward tasks, where baselines often struggle to learn. We validate in four settings,\nincluding continuous navigation and discrete bit \ufb02ipping tasks as well as hierarchical control\nfor 3D navigation and 3D construction in a demanding Minecraft environment.\n\n2 Preliminaries\n\nConsider an agent that must learn to maximize some task reward through its interactions with its\nenvironment. At each time point t throughout an episode, the agent observes its state st \u2208 S and\nselects an action at \u2208 A based on its policy \u03c0(at|st), yielding a new state s\u2032\nt sampled according to\nthe environment\u2019s transition dynamics p(s\u2032\nt|st, at) and an associated reward rt governed by the task-\nspeci\ufb01c reward function r(st, at, s\u2032\nt=0 denote the trajectory of states,\nactions, next states, and rewards collected during an episode of length T , where T is determined by\neither the maximum episode length or some task-speci\ufb01c termination conditions. The objective of\nthe agent is to learn a policy that maximizes its expected cumulative reward: E\u03c4 \u223c\u03c0,p [\u03a3t\u03b3trt].\n\nt). Let \u03c4 = {(st, at, s\u2032\n\nt, rt)}T \u22121\n\nReinforcement Learning for Goal-oriented tasks. The basic RL framework can be extended to\na more general setting where the underlying association between states, actions, and reward can\nchange depending on the parameters of a given episode (Sutton et al., 2011). From this perspective,\nthe agent must learn to optimize a set of potential rewards, exploiting the shared structure of the\nindividual tasks they each represent. This is applicable to the case of learning a goal-conditioned\npolicy \u03c0(at|st, g). Such a policy must embed a suf\ufb01ciently generic understanding of its environment\nto choose whatever actions lead to a state consistent with the goal g (Schaul et al., 2015). This setting\nnaturally occurs whenever a task is de\ufb01ned by some set of goals G that an agent must learn to reach\nwhen instructed. Typically, each episode is structured around a speci\ufb01c goal g \u2208 G sampled from the\n\n2\n\n\ftask distribution. In this work, we make the following assumptions in our de\ufb01nition of \u201cgoal-oriented\ntask\u201d:\n\n1. The task de\ufb01nes a distribution over starting states and goals \u03c1(s0, g) that are sampled to\n\nstart each episode.\n\n2. Goals can be expressed in terms of states such that there exists a function m(s) : S \u2192 G\n\nthat maps state s to its equivalent goal.\n\n3. S \u00d7 G \u2192 R+ An episode is considered a success once the state is within some radius of\nthe goal, such that d(s, g) \u2264 \u03b4, where d(x, y) : G \u00d7 G \u2192 R+ is a distance function1 and\n\u03b4 \u2208 R+ is the distance threshold. (Note: this de\ufb01nition is meant to imply that the distance\nfunction internally applies the mapping m to any states that are used as input; we omit this\nfrom the notation for brevity.)\n\nThis generic task de\ufb01nition allows for an equally generic sparse reward function r(s, g):\n\nr(s, g) = (cid:26)1, d(s, g) \u2264 \u03b4\n\notherwise\n\n0,\n\n(1)\n\nFrom this, we de\ufb01ne rt , r(s\u2032\nt, g) so that reward at time t depends on the state reached after taking\naction at from state st. Let us assume for simplicity that an episode terminates when either the goal\nis reached or a maximum number of actions are taken. This allows us to de\ufb01ne a single reward for\nan entire trajectory considering only the terminal state, giving: r\u03c4 , r(sT , g), where sT is the state\nof the environment when the episode terminates. The learning objective now becomes \ufb01nding a\ngoal-conditioned policy that maximizes E\u03c4 \u223c\u03c0,p, s0,g\u223c\u03c1 [r\u03c4 ].\n\n3 Approach\n\nDistance-based shaped rewards and local optima. We begin with the observation that the dis-\ntance function d (used to de\ufb01ne goal completion and compute sparse reward) may be exposed as a\nshaped reward without any additional domain knowledge:\n\n\u02dcr(s, g) = (cid:26)1,\n\n\u2212d(s, g),\n\nd(s, g) \u2264 \u03b4\notherwise\n\n,\n\n\u02dcr\u03c4 , \u02dcr(sT , g).\n\n(2)\n\nBy de\ufb01nition, a state that globally optimizes \u02dcr also achieves the goal (and yields sparse reward),\nmeaning that \u02dcr preserves the global optimum of r. While we expect the distance function itself to\nhave a single (global) optimum with respect to s and a \ufb01xed g, in practice we need to consider the\npossibility that other local optima exist because of the state space structure, transition dynamics and\nother features of the environment. For example, the agent may need to increase its distance to the\ngoal in order to eventually reach it. This is exactly the condition faced in the toy task depicted in\nFigure 1. We would like to gain some intuition for how the learning dynamics are in\ufb02uenced by such\nlocal optima and how this in\ufb02uence can be mitigated.\n\nThe \u201clearning dynamics\u201d refer to the interaction between (i) the distribution of terminal states \u03c1\u03c0\ng (sT )\ninduced by a policy \u03c0 in pursuit of goal g and (ii) the optimization of the policy with respect to\ng (sT )[\u02dcr(sT , g)]. A local optimum og \u2208 S can be considered \u201cstable\u201d if, for all policies within\nE\n\u03c1\u03c0\nsome basin of attraction, continued optimization causes \u03c1\u03c0\ng (sT ) to converge to og. Figure 1 (middle)\npresents an example of this. The agent observes its 2D position along the track and takes an action to\nchange its position; its reward is based on its terminal state (after 5 steps). Because of its starting\nposition, maximizing the naive reward \u02dcr(s, g) causes the policy to \u201cget stuck\u201d at the local optimum\nog, i.e., the \ufb01nal state \u03c1\u03c0\n\ng (sT ) is peaked around og.\n\nIn this example, the location of the local optimum is obvious and we can easily engineer a reward\nbonus for avoiding it. In its more general form, this augmented reward is:\n\nr\u2032(s, g, \u00afg) = (cid:26)1,\n\nmin [0, \u2212d(s, g) + d(s, \u00afg)] ,\n\nd(s, g) \u2264 \u03b4\notherwise\n\n,\n\nr\u2032\n\n\u03c4\n\n, r\u2032(sT , g, \u00afg).\n\n(3)\n\n1A straightforward metric, such as L1 or L2 distance, is often suf\ufb01cient to express goal completion.\n\n3\n\n\fToy Environment\n\nr\u03c4 = \u02dcr(sT , g)\n\nr\u03c4 = r0(sT , g, o)\n\nInitial\n\nd\nr\na\nw\ne\nR\n\ni\n\ni\n\n \n\nT\nr\na\nn\nn\ng\nP\nr\no\ng\nr\ne\ns\ns\n\nTerminal State (phase)\n\nTerminal State (phase)\n\nFinal\n\nFigure 1: Motivating example. (Left) The agent\u2019s task is to reach the goal (green X) by controlling\nits position along a warped circular track. A distance-to-goal reward (L2 distance) creates a local\noptimum og (black X). (Middle and Right) Terminal state distributions during learning. The middle\n\ufb01gure shows optimization using a distance-to-goal shaped reward. For the right \ufb01gure, the shaped\nreward is augmented to include a hand-engineered bonus for avoiding og (Eq. 3; \u00afg \u2190 og). The red\noverlay illustrates the reward at each phase of the track.\n\nAll Rollouts\n\nf\nFarther Sibling ( )\n\n\u03c4\n\nc\nCloser Sibling ( )\n\n\u03c4\n\nInitial\n\ni\n\ni\n\n \n\nT\nr\na\nn\nn\ng\nP\nr\no\ng\nr\ne\ns\ns\n\nTerminal State (phase)\n\nFinal\n\nFigure 2: Learning with Sibling Rivalry. Terminal state distribution over training when using\nSR. Middle and right plots show the farther \u03c4 f and closer \u03c4 c trajectories, respectively. Red overlay\nillustrates the shape of the naive distance-to-goal reward \u02dcr.\n\nwhere \u00afg \u2208 G acts as an \u2018anti-goal\u2019 and speci\ufb01es a state that the agent should avoid, e.g., the local\noptimum og in the case of the toy task in Figure 1. Indeed, using r\u2032 and setting \u00afg \u2190 og (that is, using\nog as the \u2018anti-goal\u2019), prevents the policy from getting stuck at the local optimum and enables the\nagent to quickly learn to reach the goal location (Figure 1, right).\n\nWhile this works well in this toy setting, the intuition for which state(s) should be used as the\n\u2018anti-goal\u2019 \u00afg will vary depending on the environment, the goal g, and learning algorithm. In addition,\nusing a \ufb01xed \u00afg may be self-defeating if the resulting shaped reward introduces its own new local\noptima. To make use of r\u2032(s, g, \u00afg) in practice, we require a method to dynamically estimate the local\noptima that frustrate learning without relying on domain-expertise or hand-picked estimations.\n\nSelf-balancing reward. We propose to estimate local optima directly from the behavior of the\npolicy by using sibling rollouts. We de\ufb01ne a pair of sibling rollouts as two independently sampled\ntrajectories sharing the same starting state s0 and goal g. We use the notation \u03c4 f , \u03c4 c \u223c \u03c0|g to denote\na pair of trajectories from 2 sibling rollouts, where the superscript speci\ufb01es that \u03c4 c ended closer to\nthe goal than \u03c4 f , i.e. that \u02dcr\u03c4 c \u2265 \u02dcr\u03c4 f . By de\ufb01nition, optimization should tend to bring \u03c4 f closer\ntowards \u03c4 c during learning. That is, it should make \u03c4 f less likely and \u03c4 c more likely. In other words,\nthe terminal state of the closer rollout sc\nT can be used to estimate the location of local optima created\nby the distance-to-goal shaped reward.\n\nTo demonstrate this, we revisit the toy example presented in Figure 1 but introduce paired sampling to\nproduce sibling rollouts (Figure 2). As before, we optimize the policy using r\u2032 but with 2 important\nmodi\ufb01cations. First, we use the sibling rollouts for mutual relabeling using the augmented shaped\n\n4\n\n\fAlgorithm 1: Sibling Rivalry\n\nGiven\n\n\u2022 Environment, Goal-reaching task w/ S, G, A, \u03c1(s0, g), m(), d(, ), \u03b4 and max episode length\n\u2022 Policy \u03c0 : S \u00d7 G \u00d7 A \u2192 [0, 1] and Critic V : S \u00d7 G \u00d7 G \u2192 R with parameters \u03b8\n\u2022 On-policy learning algorithm A, e.g., REINFORCE, Actor-critic, PPO\n\u2022 Inclusion threshold \u01eb\n\nfor iteration = 1...K do\n\nInitialize transition buffer D\nfor episode = 1...M do\n\nSample s0, g \u223c \u03c1\n\u03c4 a \u2190 \u03c0\u03b8(...)|s0,g\n# Collect rollout\n\u03c4 b \u2190 \u03c0\u03b8(...)|s0,g\n# Collect sibling rollout\nRelabel \u03c4 a reward using r\u2032 and \u00afg \u2190 m(sb\nT )\nRelabel \u03c4 b reward using r\u2032 and \u00afg \u2190 m(sa\nT )\nif d(sa\n\nT , g) < d(sb\n\nT , g) then\n\n\u03c4 c \u2190 \u03c4 a\n\u03c4 f \u2190 \u03c4 b\n\nelse\n\n\u03c4 c \u2190 \u03c4 b\n\u03c4 f \u2190 \u03c4 a\n\nif d(sc\n\nT , sf\n\nT ) < \u01eb or d(sc\n\nT , g) < \u03b4 then\n\nAdd \u03c4 f and \u03c4 c to buffer D\n\nelse\n\nAdd \u03c4 f to buffer D\n\nApply on-policy algorithm A to update \u03b8 using examples in D\n\nreward r\u2032 (Eq. 3), where each rollout treats its sibling\u2019s terminal state as its own anti-goal:\n\n\u03c4 f = r\u2032(sf\nr\u2032\n\nT , g, sc\n\n\u03c4 c = r\u2032(sc\n\nT ) & r\u2032\n\n(4)\nSecond, we only include the closer-to-goal trajectory \u03c4 c for computing policy updates if it reached\nthe goal. As shown in the distribution of sc\nT remains closely\nconcentrated around an optimum: the local optimum early in training and later the global optimum\ng. Our use of sibling rollouts creates a reward signal that intrinsically balances exploitation and\nexploration by encouraging the policy to minimize distance-to-goal while de-stabilizing local optima\ncreated by that objective. Importantly, as the policy converges towards the global optimum (i.e. learns\nto reach the goal), r\u2032 converges to the original underlying sparse reward r.\n\nT over training (Figure 2, right), sc\n\nT ).\n\nT , g, sf\n\nSibling Rivalry. From this, we derive a more general method for learning from sibling rollouts:\nSibling Rivalry (SR). Algorithm 1 describes the procedure for integrating SR into existing on-policy\nalgorithms for learning in the settings we described above2. SR has several key features:\n\n1. sampling sibling rollouts,\n2. mutual reward relabeling based on our self-balancing reward r\u2032,\n3. selective exclusion of \u03c4 c (the closer rollout) trajectories from gradient estimation, using\n\nhyperparameter \u01eb \u2208 R+ for controlling the inclusion/exclusion criterion.\n\nConsistent with the intuition presented above, we \ufb01nd that ignoring \u03c4 c during gradient estimation\nhelps prevent the policy from converging to local optima. In practice, however, it can be bene\ufb01cial to\nlearn directly from \u03c4 c. The hyperparameter \u01eb serves as an inclusion threshold for controlling when \u03c4 c\nis included in gradient estimation, such that SR always uses \u03c4 f for gradient estimation and includes\n\u03c4 c only if it reaches the goal or if d(sf\n\nT , sc\n\nT ) \u2264 \u01eb.\n\n2Reference implementation available at https://github.com/salesforce/sibling-rivalry\n\n5\n\n\fFigure 3: (Left) Maze environments. Top row illustrates our 2D point maze; bottom row shows the\nU-shaped Ant Maze in a Mujoco environment. For the 2D maze, start location is sampled within the\nblue square; in the ant maze, the agent starts near its pictured location. For both, the goal is randomly\nsampled from within the red square region. (Middle) Learning progress. Lines show rate of goal\ncompletion averaged over 5 experiments (shaded area shows mean\u00b1SD, clipped to [0, 1]). Only our\nmethod (PPO+SR) allows the agent to discover the goal in all experiments. Conversely, PPO with\nthe naive distance-to-goal reward never succeeds. Methods to learn from sparse rewards (PPO+ICM\nand DDPG+HER) only rarely discover the goals. Episodes have a maximum duration of 50 and 500\nenvironment steps for the 2D Point Maze and Ant Maze, respectively. (Right) State distribution.\nColored points illustrate terminal states achieved by the policy after each of the \ufb01rst 15 evaluation\ncheckpoints. PPO+SR allows the agent to discover increasingly good optima without becoming stuck\nin them.\n\nThe toy example above (Figure 2) shows an instance of using SR where the base algorithm is A2C,\nthe environment only yields end-of-episode reward (\u03b3 = 1), and the closer rollout \u03c4 c is only used in\ngradient estimation when that rollout reaches the goal (\u01eb = 0). In our below experiments we mostly\nuse end-of-episode rewards, although SR does not place any restriction on this choice. It should\nbe noted, however, that our method does require that full-episode rollouts are sampled in between\nparameter updates (based on the choice of treating the terminal state of the sibling rollout as \u00afg) and\nthat experimental control over episode conditions (s0 and g) is available.3 Lastly, we point out that\nwe include the state st, episode goal g, and anti-goal \u00afg as inputs to the critic network V ; the policy \u03c0\nsees only st and g.\n\nIn the appendix, we present a more formal motivation of the technique (Section A), additional\nclarifying examples addressing the behavior of SR at different degrees of local optimum severity\n(Section B), and an empirical demonstration (Section C) showing how \u01eb can be used to tune the\nsystem towards exploration (\u2193 \u01eb) or exploitation (\u2191 \u01eb).\n\n4 Experiments\n\nTo demonstrate the effectiveness of our method, we apply it to a variety of goal-reaching tasks. We\nfocus on settings where local optima interfere with learning from naive distance-to-goal shaped\nrewards. We compare this baseline to results using our approach as well as to results using curiosity\nand reward-relabeling in order to learn from sparse rewards. The appendix (Section F) provides\ndetailed descriptions of the environments, tasks, and implementation choices.\n\n2D Point-Maze Navigation. How do different training methods handle the exploration challenge\nthat arises in the presence of numerous local optima? To answer this, we train an agent to navigate a\nfully-continuous 2D point-maze with the con\ufb01guration illustrated in Figure 3 (top left). At each point\n\n3Though we observe SR to work when s0 is allowed to differ between sibling rollouts (appendix, Sec. D)\n\n6\n\n2D Point MazeMaze Environment040k80k120k160k200k0.00.20.40.60.81.0Success RatesgTerminal State Distribution(PPO + SR)151015EpochU-Shape Ant Maze016k32k48k64k80kEpisodes0.00.20.40.60.81.0PPOPPO + SRPPO + ICMDDPG + HERsg151015Epoch\fAgent State\n\nEpisode\n\nAgent Actions\n\n(location)\n\n(bitmap)\n\nGoal\n\nI / O\n\nFigure 4: 2D discrete pixel-grid environment. The agent begins in a random location on a 13x13\ngrid with all pixels off and must move and toggle pixels to produce the goal bitmap. The agent sees\nits current location (1-hot), the current bitmap, and the goal bitmap. The agent succeeds when the\nbitmap exactly matches the goal (0-distance). Lines show rate of goal completion averaged over 5\nexperiments (shaded area shows mean\u00b1SD, clipped to [0, 1]). Episodes have a maximum duration of\n50 environment steps.\n\nin time, the agent only receives its current coordinates and the goal coordinates. It outputs an action\nthat controls its change in location; the actual change is affected by collisions with walls. When\ntraining using Proximal Policy Optimization (Schulman et al., 2017) and a shaped distance-to-goal\nreward, the agent consistently learns to exploit the corridor at the top of the maze but never reaches\nthe goal. Through incorporating Sibling Rivalry (PPO + SR), the agent avoids this optimum (and all\nothers) and discovers the path to the goal location, solving the maze.\n\nWe also examine the behavior of algorithms designed to enable learning from sparse rewards without\nreward shaping. Hindsight Experience Replay (HER) applies off-policy learning to relabel trajectories\nbased on achieved goals (Andrychowicz et al., 2017). In this setting, HER [using a DDPG backbone\n(Lillicrap et al., 2016)] only learns to reach the goal on 1 of the 5 experimental runs, suggesting a\nfailure in exploration since the achieved goals do not generalize to the task goals. Curiosity-based\nintrinsic reward (ICM), which is shown to maintain a curriculum of exploration (Pathak et al., 2017;\nBurda et al., 2018a), fails to discover the sparse reward at the same rate. Using random network\ndistillation (Burda et al., 2018b), a related intrinsic motivation method, the agent never \ufb01nds the goal\n(not shown for visual clarity). Only the agent that learns with SR is able to consistently and ef\ufb01ciently\nsolve the maze (Figure 3, top middle).\n\nAnt-Maze Navigation using Hierarchical RL. SR easily integrates with HRL, which can help to\nsolve more dif\ufb01cult problems such as navigation in a complex control environment (Nachum et al.,\n2018). We use HRL to solve a U-Maze task with a Mujoco (Todorov et al., 2012) ant agent (Figure 3,\nbottom left), requiring a higher-level policy to propose subgoals based on the current state and the\ngoal of the episode as well as a low-level policy to control the ant agent towards the given subgoal.\nFor fair comparison, we employ a standardized approach for training the low-level controller from\nsubgoals using PPO but vary the approach for training the high-level controller. For this experiment,\nwe restrict the start and goal locations to the opposite ends of the maze (Figure 3, bottom left).\n\nThe results when learning to navigate the ant maze corroborate those in the toy environment: learning\nfrom the naive distance-to-goal shaped reward \u02dcr fails because the wall creates a local optimum that\npolicy gradient is unable to escape (PPO). As with the 2D Point Maze, SR can exploit the optimum\nwithout becoming stuck in it (PPO+SR). This is clearly visible in the terminal state patterns over\nearly training (Figure 3, bottom right). We again compare with methods to learn from sparse rewards,\nnamely HER and ICM. As before, ICM stochastically discovers a path to the goal but at a low rate\n(2 in 5 experiments). In this setting, HER struggles to generalize from its achieved goals to the task\ngoals, perhaps due in part to the dif\ufb01culties of off-policy HRL (Nachum et al., 2018). 3 of the 5 HER\nruns eventually discover the goal but do not reach a high level of performance.\n\nApplication to a Discrete Environment. Distance-based rewards can create local optima in less\nobvious settings as well. To examine such a setting and to show that our method can apply to\nenvironments with discrete action/state spaces, we experiment with learning to manipulate a 2D\nbitmap to produce a goal con\ufb01guration. The agent starts in a random location on a 13x13 grid and\nmay move to an adjacent location or toggle the state of its current position (Figure 4, left). We use L1\ndistance (that is, the sum of bitwise absolute differences). Interestingly, this task does not require the\n\n7\n\n\fAgent State\n\n(visual input) (structure)\n\nEpisode\n\nGoal\n\nAgent Actions\n\nMove\n\nTurn\n\nLook Construct\n\nBreak\n\nPlace Block\n\nFigure 5: 3D construction task in Minecraft. The agent must control its location/orientation and\nbreak/place blocks in order to produce the goal structure. The agent observes its \ufb01rst-person visual\ninput, the discrete 3D cuboid of the construction arena, and the corresponding cuboid of the goal\nstructure. An episode is counted as a success when the structure exactly matches the goal. The\nStructure Completion Metric is difference between correctly and incorrectly placed blocks divided by\nthe number of goal-structure blocks. In the illustrated example, the agent has nearly constructed the\ngoal, which speci\ufb01es a height-2 diamond structure near the top left of the construction arena. Goal\nstructures vary in height, dimensions, and material (4806 unique combinations). Episodes have a\nmaximum duration of 100 environment steps.\n\nagent to increase the distance to the goal in order to reach it (as, for example, with the Ant Maze), but\nnaive distance-to-goal reward shaping still creates \u2018local optima\u2018 by introducing pathological learning\ndynamics: early in training, when behavior is closer to random, toggling a bit from off to on tends to\nincrease distance-to-goal and the agent quickly learns to avoid taking the toggle action. Indeed, the\nagents trained with naive distance-to-goal reward shaping \u02dcr never make progress (PPO). As shown in\nFigure 4, we can prevent this outcome and allow the agent to learn the task through incorporating\nSibling Rivalry (PPO+SR).\n\nAs one might expect, off-policy methods that can accommodate forced exploration may avoid this\nissue; DQN (Mnih et al., 2015) gradually learns the task (note: this required densifying the reward\nrather than using only the terminal state). However, exploration alone is not suf\ufb01cient on a task\nlike this since simply achieving diverse states is unlikely to let the agent discover the task structure\nrelating states, goals, and rewards, as evidenced by the failure of ICM to enable learning in this\nsetting. HER aims to learn this task structure from failed rollouts and, as an off-policy method,\nhandles forced exploration, allowing it to quickly learn this task. Intuitively, using distance as a\nreward signal automatically exposes the task structure but often at the cost of unwanted local optima.\nSibling Rivalry avoids that tradeoff, allowing ef\ufb01cient on-policy learning4.\n\n3D Construction in Minecraft. Finally, to demonstrate that Sibling Rivalry can be applied to\nlearning in complex environments, we apply it to a custom 3D construction task in Minecraft using\nthe Malmo platform (Johnson et al., 2016). Owing to practical limitations, we use this setting to\nillustrate the scalability of SR rather than to provide a detailed comparison with other methods.\nSimilar to the pixel-grid task, here the agent must produce a discrete goal structure by placing\nand removing blocks (Figure 5). However, this task introduces the challenge of a \ufb01rst-person 3D\nenvironment, combining continuous and discrete inputs, and application of aggressively asynchronous\ntraining with distributed environments [making use of the IMPALA framework (Espeholt et al., 2018)].\nSince success requires exact-match between the goal and constructed cuboids, we use the number of\nblock-wise differences as our distance metric. Using this distance metric as a naive shaped reward\ncauses the agent to avoid ever placing blocks within roughly 1000 episodes (not shown for visual\nclarity). Simply by incorporating Sibling Rivalry the agent avoids this local optimum and learns to\nachieve a high degree of construction accuracy and rate of exact-match success (Figure 5, right).\n\n5 Related Work\n\nIntrinsic Motivation. Generally speaking, the dif\ufb01culty in learning from sparse rewards comes\nfrom the fact that they tend to provide prohibitively rare signal to a randomly initialized agent.\nIntrinsic motivation describes a form of task-agnostic reward shaping that encourages exploration\nby rewarding novel states. Count-based methods track how often each state is visited to reward\n\n4We \ufb01nd that including both sibling trajectories (\u01eb = Inf) works best in the discrete-distance settings\n\n8\n\n\freaching relatively unseen states (Bellemare et al., 2016; Tang et al., 2017). Curiosity-driven methods\nencourage actions that surprise a separate model of the network dynamics (Pathak et al., 2017;\nBurda et al., 2018a; Zhao & Tresp, 2018). Burda et al. (2018b) introduce a similar technique using\ndistillation of a random network. In addition to being more likely to discover sparse reward, policies\nthat produce diverse coverage of states provide a strong initialization for downstream tasks (Haarnoja\net al., 2017; Eysenbach et al., 2019). Intrinsic motivation requires that the statistics of the agent\u2019s\nexperience be directly tracked or captured in the training progress of some external module. In\ncontrast, we use the policy itself to estimate and encourage exploratory behavior.\n\nCurriculum Learning and Self-Play. Concepts from curriculum learning (Bengio et al., 2009)\nhave been applied to facilitate learning goal-directed tasks (Molchanov et al., 2018; Nair et al., 2018).\nFlorensa et al. (2018), for example, introduce a generative adversarial network approach for automatic\ngeneration of a goal curriculum. On competitive tasks, such as 2-player games, self-play has enabled\nremarkable success (Silver et al., 2018). Game dynamics yield balanced reward and force agents to\navoid over-committing to suboptimal strategies, providing both a natural curriculum and incentive\nfor exploration. Similar bene\ufb01ts have been gained through asymmetric self-play with goal-directed\ntasks (Sukhbaatar et al., 2018a,b). Our approach shares some inspiration with this line of work but\ncombines the asymmetric objectives into a single reward function.\n\nLearning via Generalization. Hindsight Experience Replay (Andrychowicz et al., 2017) combines\nreward relabeling and off-policy methods to allow learning from sparse reward even on failed rollouts,\nleveraging the generalization ability of neural networks as universal value approximators (Schaul\net al., 2015). Asymmetric competition has been used to improve this method, presumably by inducing\nan automatic exploration curriculum that helps relieve the generalization burden (Liu et al., 2019).\n\nLatent Reward Shaping. A separate approach within reward shaping involves using latent repre-\nsentations of goals and states. Ghosh et al. (2018) estimate distance between two states based on\nthe actions a pre-trained policy would take to reach them. Nair et al. (2018) introduce a method\nfor unsupervised learning of goal spaces that allows practicing reaching imagined goal states by\ncomputing distance in latent space [see also P\u00e9r\u00e9 et al. (2018)]. Warde-Farley et al. (2019) use\ndiscriminitive training to learn to estimate similarity to a goal state from raw observations. Learned\nmodels have also been applied to perform reward shaping to overcome challenges related to delayed\nrewards (Arjona-Medina et al., 2019).\n\n6 Conclusion\n\nWe introduce Sibling Rivalry, a simple and effective method for learning goal-reaching tasks from a\ngeneric class of distance-based shaped rewards. Sibling Rivalry makes use of sibling rollouts and\nself-balancing rewards to prevent the learning dynamics from stabilizing around local optima. By\nleveraging the distance metric used to de\ufb01ne the underlying sparse reward, our technique enables\nrobust learning from shaped rewards without relying on carefully-designed, problem-speci\ufb01c reward\nfunctions. We demonstrate the applicability of our method across a variety of goal-reaching tasks\nwhere naive distance-to-goal reward shaping consistently fails and techniques to learn from sparse\nrewards struggle to explore properly and/or generalize from failed rollouts. Our experiments show\nthat Sibling Rivalry can be readily applied to both continuous and discrete domains, incorporated\ninto hierarchical RL, and scaled to demanding environments.\n\nReferences\n\nMarcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob\nMcGrew, Josh Tobin, Pieter Abbeel, and Wojciech Zaremba. Hindsight Experience Replay. In\nNIPS, 2017.\n\nJose A. Arjona-Medina, Michael Gillhofer, Michael Widrich, Thomas Unterthiner, Johannes Brand-\nstetter, and Sepp Hochreiter. RUDDER: Return Decomposition for Delayed Rewards. In NeurIPS,\n2019.\n\nMarc G. Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Remi\n\nMunos. Unifying Count-Based Exploration and Intrinsic Motivation. In NIPS, 2016.\n\n9\n\n\fYoshua Bengio, Jerome Louradour, Ronan Collobert, and Jason Weston. Curriculum Learning. In\n\nICML, 2009.\n\nYuri Burda, Harri Edwards, Deepak Pathak, Amos Storkey, Trevor Darrell, and Alexei A. Efros.\n\nLarge-Scale Study of Curiosity-Driven Learning. arXiv, 2018a.\n\nYuri Burda, Harrison Edwards, Amos Storkey, and Oleg Klimov. Exploration by Random Network\n\nDistillation. arXiv, 2018b.\n\nPo-Wei Chou, Maturana Daniel, and Sebastian Scherer. Improving Stochastic Policy Gradients in\nContinuous Control with Deep Reinforcement Learning using the Beta Distribution. In ICML,\n2017.\n\nJack Clark and Dario Amodei. Faulty reward functions in the wild. https://openai.com/\n\nblog/faulty-reward-functions/, 2016.\n\nLasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Volodymir Mnih, Tom Ward, Yotam\nDoron, Vlad Firoiu, Tim Harley, Iain Dunning, Shane Legg, and Koray Kavukcuoglu. IMPALA:\nScalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures. In ICML,\n2018.\n\nBenjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. Diversity Is All You Need:\n\nLearning Skills Without a Reward Function. In ICLR, 2019.\n\nCarlos Florensa, David Held, Xinyang Geng, and Pieter Abbeel. Automatic Goal Generation for\n\nReinforcement Learning Agents. In ICML, 2018.\n\nDibya Ghosh, Abhishek Gupta, and Sergey Levine. Learning Actionable Representations with\n\nGoal-Conditioned Policies. arXiv, 2018.\n\nTuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement Learning with\n\nDeep Energy-Based Policies. In ICML, 2017.\n\nMatthew Johnson, Katja Hofmann, Tim Hutton, and David Bignell. The Malmo Platform for Arti\ufb01cial\n\nIntelligence Experimentation. IJCAI, 2016.\n\nLeslie Pack Kaelbling. Learning to Achieve Goals. In IJCAI, 1993.\n\nTimothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa,\nDavid Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In ICLR,\n2016.\n\nHao Liu, Alexander Trott, Richard Socher, and Caiming Xiong. Competitive Experience Replay. In\n\nICLR, 2019.\n\nMaja J Mataric. Reward Functions for Accelerated Learning. In ICML, 1994.\n\nVolodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare,\nAlex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, Stig Petersen, Charles\nBeattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane\nLegg, and Demis Hassabis. Human-level control through deep reinforcement learning. Nature,\n518(7540):529\u201333, 2015. ISSN 1476-4687.\n\nArtem Molchanov, Karol Hausman, Stan Birch\ufb01eld, and Gaurav Sukhatme. Region Growing\n\nCurriculum Generation for Reinforcement Learning. arXiv, 2018.\n\nO\ufb01r Nachum, Shixiang Gu, Honglak Lee, and Sergey Levine. Data-Ef\ufb01cient Hierarchical Reinforce-\n\nment Learning. arXiv, 2018.\n\nAshvin Nair, Vitchyr Pong, Murtaza Dalal, Shikhar Bahl, Steven Lin, and Sergey Levine. Visual\n\nReinforcement Learning with Imagined Goals. arXiv, 2018.\n\nAndrew Y Ng, Daishi Harada, and Stuart Russel. Policy invariance under reward transformations:\n\ntheory and application to reward shaping. In ICML, 1999.\n\n10\n\n\fDeepak Pathak, Pulkit Agrawal, Alexei A. Efros, and Trevor Darrell. Curiosity-driven Exploration by\n\nSelf-supervised Prediction. In ICML, 2017.\n\nAlexandre P\u00e9r\u00e9, S\u00e9bastien Forestier, Olivier Sigaud, and Pierre-Yves Oudeyer. Unsupervised Learning\n\nof Goal Spaces for Intrinsically Motivated Goal Exploration. In ICLR, 2018.\n\nNikolay Savinov, Anton Raichuk, Rapha\u00ebl Marinier, Damien Vincent, Marc Pollefeys, Timothy\n\nLillicrap, and Sylvain Gelly. Episodic Curiosity through Reachability. In ICLR, 2019.\n\nTom Schaul, Dan Horgan, Karol Gregor, and David Silver. Universal Value Function Approximators.\n\nIn ICML, 2015.\n\nJohn Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal Policy\n\nOptimization Algorithms. arXiv, 2017.\n\nDavid Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur\nGuez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lillicrap, Karen\nSimonyan, and Demis Hassabis. A general reinforcement learning algorithm that masters chess,\nshogi, and Go through self-play. Science, 362(6419):1140\u20131144, 2018. ISSN 0036-8075.\n\nSainbayar Sukhbaatar, Emily Denton, Arthur Szlam, and Rob Fergus. Learning Goal Embeddings\n\nvia Self-Play for Hierarchical Reinforcement Learning. arXiv, 2018a.\n\nSainbayar Sukhbaatar, Zeming Lin, Ilya Kostrikov, Gabriel Synnaeve, Arthur Szlam, and Rob Fergus.\n\nIntrinsic Motivation and Automatic Curricula via Asymmetric Self-Play. In ICLR, 2018b.\n\nRichard S Sutton, Joseph Modayil, Michael Delp, Thomas Degris, Patrick M Pilarski, and Adam\nWhite. Horde : A Scalable Real-time Architecture for Learning Knowledge from Unsupervised\nSensorimotor Interaction Categories and Subject Descriptors. In AAMAS, 2011.\n\nHaoran Tang, Rein Houthooft, Davis Foote, Adam Stooke, Xi Chen, Yan Duan, John Schulman,\nFilip De Turck, and Pieter Abbeel. #Exploration: A Study of Count-Based Exploration for Deep\nReinforcement Learning. In NIPS, 2017.\n\nEmanuel Todorov, Tom Erez, and Yuval Tassa. MuJoCo: A physics engine for model-based control.\n\nIEEE International Conference on Intelligent Robots and Systems, 2012. ISSN 21530858.\n\nDavid Warde-Farley, Tom Van de Wiele, Tejas Kulkarni, Catalin Ionescu, Steven Hansen, and\nVolodymyr Mnih. Unsupervised Control Through Non-Parametric Discriminative Rewards. In\nICLR, 2019.\n\nRui Zhao and Volker Tresp. Curiosity-Driven Experience Prioritization via Density Estimation. In\n\nNeurIPS Deep RL Workshop, 2018.\n\n11\n\n\f", "award": [], "sourceid": 5472, "authors": [{"given_name": "Alexander", "family_name": "Trott", "institution": "Salesforce Research"}, {"given_name": "Stephan", "family_name": "Zheng", "institution": "Salesforce"}, {"given_name": "Caiming", "family_name": "Xiong", "institution": "Salesforce"}, {"given_name": "Richard", "family_name": "Socher", "institution": "Salesforce"}]}