{"title": "Deep Reinforcement Learning from Human Preferences", "book": "Advances in Neural Information Processing Systems", "page_first": 4299, "page_last": 4307, "abstract": "For sophisticated reinforcement learning (RL) systems to interact usefully with real-world environments, we need to communicate complex goals to these systems. In this work, we explore goals defined in terms of (non-expert) human preferences between pairs of trajectory segments. Our approach separates learning the goal from learning the behavior to achieve it. We show that this approach can effectively solve complex RL tasks without access to the reward function, including Atari games and simulated robot locomotion, while providing feedback on about 0.1% of our agent's interactions with the environment. This reduces the cost of human oversight far enough that it can be practically applied to state-of-the-art RL systems. To demonstrate the flexibility of our approach, we show that we can successfully train complex novel behaviors with about an hour of human time. These behaviors and environments are considerably more complex than any which have been previously learned from human feedback.", "full_text": "Deep Reinforcement Learning\n\nfrom Human Preferences\n\nPaul F Christiano\n\nOpenAI\n\npaul@openai.com\n\nJan Leike\nDeepMind\n\nleike@google.com\n\nTom B Brown\nGoogle Brain\u21e4\n\ntombbrown@google.com\n\nMiljan Martic\n\nDeepMind\n\nmiljanm@google.com\n\nShane Legg\nDeepMind\n\nlegg@google.com\n\nAbstract\n\nDario Amodei\n\nOpenAI\n\ndamodei@openai.com\n\nFor sophisticated reinforcement learning (RL) systems to interact usefully with\nreal-world environments, we need to communicate complex goals to these systems.\nIn this work, we explore goals de\ufb01ned in terms of (non-expert) human preferences\nbetween pairs of trajectory segments. We show that this approach can effectively\nsolve complex RL tasks without access to the reward function, including Atari\ngames and simulated robot locomotion, while providing feedback on less than\n1% of our agent\u2019s interactions with the environment. This reduces the cost of\nhuman oversight far enough that it can be practically applied to state-of-the-art\nRL systems. To demonstrate the \ufb02exibility of our approach, we show that we can\nsuccessfully train complex novel behaviors with about an hour of human time.\nThese behaviors and environments are considerably more complex than any which\nhave been previously learned from human feedback.\n\n1\n\nIntroduction\n\nRecent success in scaling reinforcement learning (RL) to large problems has been driven in domains\nthat have a well-speci\ufb01ed reward function (Mnih et al., 2015, 2016; Silver et al., 2016). Unfortunately,\nmany tasks involve goals that are complex, poorly-de\ufb01ned, or hard to specify. Overcoming this\nlimitation would greatly expand the possible impact of deep RL and could increase the reach of\nmachine learning more broadly.\nFor example, suppose that we wanted to use reinforcement learning to train a robot to clean a table or\nscramble an egg. It\u2019s not clear how to construct a suitable reward function, which will need to be a\nfunction of the robot\u2019s sensors. We could try to design a simple reward function that approximately\ncaptures the intended behavior, but this will often result in behavior that optimizes our reward\nfunction without actually satisfying our preferences. This dif\ufb01culty underlies recent concerns about\nmisalignment between our values and the objectives of our RL systems (Bostrom, 2014; Russell,\n2016; Amodei et al., 2016). If we could successfully communicate our actual objectives to our agents,\nit would be a signi\ufb01cant step towards addressing these concerns.\nIf we have demonstrations of the desired task, we can use inverse reinforcement learning (Ng and\nRussell, 2000) or imitation learning to copy the demonstrated behavior. But these approaches are not\ndirectly applicable to behaviors that are dif\ufb01cult for humans to demonstrate (such as controlling a\nrobot with many degrees of freedom but non-human morphology).\n\n\u21e4Work done while at OpenAI.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fAn alternative approach is to allow a human to provide feedback on our system\u2019s current behavior\nand to use this feedback to de\ufb01ne the task. In principle this \ufb01ts within the paradigm of reinforcement\nlearning, but using human feedback directly as a reward function is prohibitively expensive for RL\nsystems that require hundreds or thousands of hours of experience. In order to practically train deep\nRL systems with human feedback, we need to decrease the amount of feedback required by several\norders of magnitude.\nWe overcome this dif\ufb01culty by asking humans to compare possible trajectories of the agent, using\nthat data to learn a reward function, and optimizing the learned reward function with RL.\nThis basic approach has been explored in the past, but we confront the challenges involved in scaling\nit up to modern deep RL and demonstrate by far the most complex behaviors yet learned from human\nfeedback.\nOur experiments take place in two domains: Atari games in the Arcade Learning Environment (Belle-\nmare et al., 2013), and robotics tasks in the physics simulator MuJoCo (Todorov et al., 2012). We\nshow that a small amount of feedback from a non-expert human, ranging from \ufb01fteen minutes to \ufb01ve\nhours, suf\ufb01ce to learn both standard RL tasks and novel hard-to-specify behaviors such as performing\na back\ufb02ip or driving with the \ufb02ow of traf\ufb01c.\n\n1.1 Related Work\n\nA long line of work studies reinforcement learning from human ratings or rankings, including Akrour\net al. (2011), Pilarski et al. (2011), Akrour et al. (2012), Wilson et al. (2012), Sugiyama et al. (2012),\nWirth and F\u00fcrnkranz (2013), Daniel et al. (2015), El Asri et al. (2016), Wang et al. (2016), and\nWirth et al. (2016). Other lines of research consider the general problem of reinforcement learning\nfrom preferences rather than absolute reward values (F\u00fcrnkranz et al., 2012; Akrour et al., 2014;\nWirth et al., 2016), and optimizing using human preferences in settings other than reinforcement\nlearning (Machwe and Parmee, 2006; Secretan et al., 2008; Brochu et al., 2010; S\u00f8rensen et al.,\n2016).\nOur algorithm follows the same basic approach as Akrour et al. (2012) and Akrour et al. (2014), but\nconsiders much more complex domains and behaviors. The complexity of our environments force us\nto use different RL algorithms, reward models, and training strategies. One notable difference is that\nAkrour et al. (2012) and Akrour et al. (2014) elicit preferences over whole trajectories rather than\nshort clips, and so would require about an order of magnitude more human time per data point. Our\napproach to feedback elicitation closely follows Wilson et al. (2012). However, Wilson et al. (2012)\nassumes that the reward function is the distance to some unknown (linear) \u201ctarget\u201d policy, and is\nnever tested with real human feedback.\nTAMER (Knox, 2012; Knox and Stone, 2013) also learns a reward function from human feedback,\nbut learns from ratings rather than comparisons, has the human observe the agent as it behaves,\nand has been applied to settings where the desired policy can be learned orders of magnitude more\nquickly.\nCompared to all prior work, our key contribution is to scale human feedback up to deep reinforcement\nlearning and to learn much more complex behaviors. This \ufb01ts into a recent trend of scaling reward\nlearning methods to large deep learning systems, for example inverse RL (Finn et al., 2016), imitation\nlearning (Ho and Ermon, 2016; Stadie et al., 2017), semi-supervised skill generalization (Finn et al.,\n2017), and bootstrapping RL from demonstrations (Silver et al., 2016; Hester et al., 2017).\n\n2 Preliminaries and Method\n\n2.1 Setting and Goal\n\nWe consider an agent interacting with an environment over a sequence of steps; at each time t the\nagent receives an observation ot 2O from the environment and then sends an action at 2A to the\nenvironment.\nIn traditional reinforcement learning, the environment would also supply a reward rt 2 R and the\nagent\u2019s goal would be to maximize the discounted sum of rewards. Instead of assuming that the\nenvironment produces a reward signal, we assume that there is a human overseer who can express\n\n2\n\n\fpreferences between trajectory segments. A trajectory segment is a sequence of observations and\nactions,  = ((o0, a0), (o1, a1), . . . , (ok1, ak1)) 2 (O\u21e5A )k. Write 1  2 to indicate that the\nhuman preferred trajectory segment 1 to trajectory segment 2. Informally, the goal of the agent is\nto produce trajectories which are preferred by the human, while making as few queries as possible to\nthe human.\nMore precisely, we will evaluate our algorithms\u2019 behavior in two ways:\nQuantitative: We say that preferences  are generated by a reward function2 r : O\u21e5A! R if\n\nwhenever\n\n0, a1\n\n0, . . . ,o1\no1\n0 + \u00b7\u00b7\u00b7 + ro1\nro1\n\n0, a1\n\nk1, a1\n\nk1, a1\n\nk1 o2\nk1 > ro2\n\n0, a2\n\n0, a2\n\nk1, a2\n\n0, . . . ,o2\n0 + \u00b7\u00b7\u00b7 + ro2\n\nk1\nk1.\n\nk1, a2\n\nIf the human\u2019s preferences are generated by a reward function r, then our agent ought to\nreceive a high total reward according to r. So if we know the reward function r, we can\nevaluate the agent quantitatively. Ideally the agent will achieve reward nearly as high as if it\nhad been using RL to optimize r.\n\nQualitative: Sometimes we have no reward function by which we can quantitatively evaluate\nbehavior (this is the situation where our approach would be practically useful). In these\ncases, all we can do is qualitatively evaluate how well the agent satis\ufb01es the human\u2019s\npreferences. In this paper, we will start from a goal expressed in natural language, ask a\nhuman to evaluate the agent\u2019s behavior based on how well it ful\ufb01lls that goal, and then\npresent videos of agents attempting to ful\ufb01ll that goal.\n\nOur model based on trajectory segment comparisons is very similar to the trajectory preference\nqueries used in Wilson et al. (2012), except that we don\u2019t assume that we can reset the system to\nan arbitrary state3 and so our segments generally begin from different states. This complicates the\ninterpretation of human comparisons, but we show that our algorithm overcomes this dif\ufb01culty even\nwhen the human raters have no understanding of our algorithm.\n\n2.2 Our Method\nAt each point in time our method maintains a policy \u21e1 : O!A and a reward function estimate\n\u02c6r : O\u21e5A! R, each parametrized by deep neural networks.\nThese networks are updated by three processes:\n\n1. The policy \u21e1 interacts with the environment to produce a set of trajectories {\u2327 1, . . . ,\u2327 i}.\nThe parameters of \u21e1 are updated by a traditional reinforcement learning algorithm, in order\nto maximize the sum of the predicted rewards rt = \u02c6r(ot, at).\n2. We select pairs of segments1, 2 from the trajectories {\u2327 1, . . . ,\u2327 i} produced in step 1,\n\nand send them to a human for comparison.\n\n3. The parameters of the mapping \u02c6r are optimized via supervised learning to \ufb01t the comparisons\n\ncollected from the human so far.\n\nThese processes run asynchronously, with trajectories \ufb02owing from process (1) to process (2), human\ncomparisons \ufb02owing from process (2) to process (3), and parameters for \u02c6r \ufb02owing from process (3)\nto process (1). The following subsections provide details on each of these processes.\n\n2Here we assume here that the reward is a function of the observation and action. In our experiments in\nAtari environments, we instead assume the reward is a function of the preceding 4 observations. In a general\npartially observable environment, we could instead consider reward functions that depend on the whole sequence\nof observations, and model this reward function with a recurrent neural network.\n\n3Wilson et al. (2012) also assumes the ability to sample reasonable initial states. But we work with high\ndimensional state spaces for which random states will not be reachable and the intended policy inhabits a\nlow-dimensional manifold.\n\n3\n\n\f2.2.1 Optimizing the Policy\nAfter using \u02c6r to compute rewards, we are left with a traditional reinforcement learning problem. We\ncan solve this problem using any RL algorithm that is appropriate for the domain. One subtlety is\nthat the reward function \u02c6r may be non-stationary, which leads us to prefer methods which are robust\nto changes in the reward function. This led us to focus on policy gradient methods, which have been\napplied successfully for such problems (Ho and Ermon, 2016).\nIn this paper, we use advantage actor-critic (A2C; Mnih et al., 2016) to play Atari games, and trust\nregion policy optimization (TRPO; Schulman et al., 2015) to perform simulated robotics tasks. In\neach case, we used parameter settings which have been found to work well for traditional RL tasks.\nThe only hyperparameter which we adjusted was the entropy bonus for TRPO. This is because TRPO\nrelies on the trust region to ensure adequate exploration, which can lead to inadequate exploration if\nthe reward function is changing.\nWe normalized the rewards produced by \u02c6r to have zero mean and constant standard deviation. This is\na typical preprocessing step which is particularly appropriate here since the position of the rewards is\nunderdetermined by our learning problem.\n\n2.2.2 Preference Elicitation\nThe human overseer is given a visualization of two trajectory segments, in the form of short movie\nclips. In all of our experiments, these clips are between 1 and 2 seconds long.\nThe human then indicates which segment they prefer, that the two segments are equally good, or that\nthey are unable to compare the two segments.\n\nThe human judgments are recorded in a database D of triples1, 2, \u00b5, where 1 and 2 are the\ntwo segments and \u00b5 is a distribution over {1, 2} indicating which segment the user preferred. If the\nhuman selects one segment as preferable, then \u00b5 puts all of its mass on that choice. If the human\nmarks the segments as equally preferable, then \u00b5 is uniform. Finally, if the human marks the segments\nas incomparable, then the comparison is not included in the database.\n\n2.2.3 Fitting the Reward Function\nWe can interpret a reward function estimate \u02c6r as a preference-predictor if we view \u02c6r as a latent factor\nexplaining the human\u2019s judgments and assume that the human\u2019s probability of preferring a segment\ni depends exponentially on the value of the latent reward summed over the length of the clip:4\n\nThis follows the Bradley-Terry model (Bradley and Terry, 1952) for estimating score functions from\npairwise preferences, and is the specialization of the Luce-Shephard choice rule (Luce, 2005; Shepard,\n1957) to preferences over trajectory segments.\nOur actual algorithm incorporates a number of modi\ufb01cations to this basic approach, which early\nexperiments discovered to be helpful and which are analyzed in Section 3.3:\n\n\u2022 We \ufb01t an ensemble of predictors, each trained on |D| triples sampled from D with replace-\nment. The estimate \u02c6r is de\ufb01ned by independently normalizing each of these predictors and\nthen averaging the results.\n\n\u2022 A fraction of 1/e of the data is held out to be used as a validation set for each predictor.\nWe use `2 regularization and adjust the regularization coef\ufb01cient to keep the validation loss\nbetween 1.1 and 1.5 times the training loss. In some domains we also apply dropout for\nregularization.\n\n4Equation 1 does not use discounting, which could be interpreted as modeling the human to be indifferent\nabout when things happen in the trajectory segment. Using explicit discounting or inferring the human\u2019s discount\nfunction would also be reasonable choices.\n\n4\n\nWe choose \u02c6r to minimize the cross-entropy loss between these predictions and the actual human\nlabels:\n\n\u02c6P\u21e51  2\u21e4 =\nloss(\u02c6r) =  X(1,2,\u00b5)2D\n\nt , a1\n\nt\n\nexpP \u02c6ro1\nt ) + expP \u02c6r(o2\n\nexpP \u02c6r(o1\n\u00b5(1) log \u02c6P\u21e51  2\u21e4 + \u00b5(2) log \u02c6P\u21e52  1\u21e4.\n\nt , a1\n\n.\n\nt , a2\nt )\n\n(1)\n\n\f\u2022 Rather than applying a softmax directly as described in Equation 1, we assume there is a\n10% chance that the human responds uniformly at random. Conceptually this adjustment is\nneeded because human raters have a constant probability of making an error, which doesn\u2019t\ndecay to 0 as the difference in reward difference becomes extreme.\n\n2.2.4 Selecting Queries\nWe decide how to query preferences based on an approximation to the uncertainty in the reward\nfunction estimator, similar to Daniel et al. (2014): we sample a large number of pairs of trajectory\nsegments of length k from the latest agent-environment interactions, use each reward predictor\nin our ensemble to predict which segment will be preferred from each pair, and then select those\ntrajectories for which the predictions have the highest variance across ensemble members5 This is a\ncrude approximation and the ablation experiments in Section 3 show that in some tasks it actually\nimpairs performance. Ideally, we would want to query based on the expected value of information of\nthe query (Akrour et al., 2012; Krueger et al., 2016), but we leave it to future work to explore this\ndirection further.\n\n3 Experimental Results\n\nWe implemented our algorithm in TensorFlow (Abadi et al., 2016). We interface with Mu-\nJoCo (Todorov et al., 2012) and the Arcade Learning Environment (Bellemare et al., 2013) through\nthe OpenAI Gym (Brockman et al., 2016).\n\n3.1 Reinforcement Learning Tasks with Unobserved Rewards\n\nIn our \ufb01rst set of experiments, we attempt to solve a range of benchmark tasks for deep RL without\nobserving the true reward. Instead, the agent learns about the goal of the task only by asking a human\nwhich of two trajectory segments is better. Our goal is to solve the task in a reasonable amount of\ntime using as few queries as possible.\nIn our experiments, feedback is provided by contractors who are given a 1-2 sentence description\nof each task before being asked to compare several hundred to several thousand pairs of trajectory\nsegments for that task (see Appendix B for the exact instructions given to contractors). Each trajectory\nsegment is between 1 and 2 seconds long. Contractors responded to the average query in 3-5 seconds,\nand so the experiments involving real human feedback required between 30 minutes and 5 hours of\nhuman time.\nFor comparison, we also run experiments using a synthetic oracle whose preferences are generated\n(in the sense of Section 2.1) by the real reward6. We also compare to the baseline of RL training\nusing the real reward. Our aim here is not to outperform but rather to do nearly as well as RL without\naccess to reward information and instead relying on much scarcer feedback. Nevertheless, note that\nfeedback from real humans does have the potential to outperform RL (and as shown below it actually\ndoes so on some tasks), because the human feedback might provide a better-shaped reward.\nWe describe the details of our experiments in Appendix A, including model architectures, modi\ufb01ca-\ntions to the environment, and the RL algorithms used to optimize the policy.\n\n3.1.1 Simulated Robotics\nThe \ufb01rst tasks we consider are eight simulated robotics tasks, implemented in MuJoCo (Todorov\net al., 2012), and included in OpenAI Gym (Brockman et al., 2016). We made small modi\ufb01cations\nto these tasks in order to avoid encoding information about the task in the environment itself (the\nmodi\ufb01cations are described in detail in Appendix A). The reward functions in these tasks are quadratic\nfunctions of distances, positions and velocities, and most are linear. We included a simple cartpole\n\n5Note that trajectory segments almost never start from the same state.\n6In the case of Atari games with sparse rewards, it is relatively common for two clips to both have zero\nreward in which case the oracle outputs indifference. Because we considered clips rather than individual states,\nsuch ties never made up a large majority of our data. Moreover, ties still provide signi\ufb01cant information to the\nreward predictor as long as they are not too common.\n\n5\n\n\fFigure 1: Results on MuJoCo simulated robotics as measured on the tasks\u2019 true reward. We compare\nour method using real human feedback (purple), our method using synthetic feedback provided by\nan oracle (shades of blue), and reinforcement learning using the true reward function (orange). All\ncurves are the average of 5 runs, except for the real human feedback, which is a single run, and\neach point is the average reward over \ufb01ve consecutive batches. For Reacher and Cheetah feedback\nwas provided by an author due to time constraints. For all other tasks, feedback was provided by\ncontractors unfamiliar with the environments and with our algorithm. The irregular progress on\nHopper is due to one contractor deviating from the typical labeling schedule.\n\ntask (\u201cpendulum\u201d) for comparison, since this is representative of the complexity of tasks studied in\nprior work.\nFigure 1 shows the results of training our agent with 700 queries to a human rater, compared to\nlearning from 350, 700, or 1400 synthetic queries, as well as to RL learning from the real reward.\nWith 700 labels we are able to nearly match reinforcement learning on all of these tasks. Training\nwith learned reward functions tends to be less stable and higher variance, while having a comparable\nmean performance.\nSurprisingly, by 1400 labels our algorithm performs slightly better than if it had simply been given\nthe true reward, perhaps because the learned reward function is slightly better shaped\u2014the reward\nlearning procedure assigns positive rewards to all behaviors that are typically followed by high reward.\nThe difference may also be due to subtle changes in the relative scale of rewards or our use of entropy\nregularization.\nReal human feedback is typically only slightly less effective than the synthetic feedback; depending\non the task human feedback ranged from being half as ef\ufb01cient as ground truth feedback to being\nequally ef\ufb01cient. On the Ant task the human feedback signi\ufb01cantly outperformed the synthetic\nfeedback, apparently because we asked humans to prefer trajectories where the robot was \u201cstanding\nupright,\u201d which proved to be useful reward shaping. (There was a similar bonus in the RL reward\nfunction to encourage the robot to remain upright, but the simple hand-crafted bonus was not as\nuseful.)\n\n3.1.2 Atari\n\nThe second set of tasks we consider is a set of seven Atari games in the Arcade Learning Environ-\nment (Bellemare et al., 2013), the same games presented in Mnih et al., 2013.\nFigure 2 shows the results of training our agent with 5,500 queries to a human rater, compared to\nlearning from 350, 700, or 1400 synthetic queries, as well as to RL learning from the real reward.\nOur method has more dif\ufb01culty matching RL in these challenging environments, but nevertheless it\ndisplays substantial learning on most of them and matches or even exceeds RL on some. Speci\ufb01cally,\n\n6\n\n\fFigure 2: Results on Atari games as measured on the tasks\u2019 true reward. We compare our method using\nreal human feedback (purple), our method using synthetic feedback provided by an oracle (shades of\nblue), and reinforcement learning using the true reward function (orange). All curves are the average\nof 3 runs, except for the real human feedback which is a single run, and each point is the average\nreward over about 150,000 consecutive frames.\n\non BeamRider and Pong, synthetic labels match or come close to RL even with only 3,300 such\nlabels. On Seaquest and Qbert synthetic feedback eventually performs near the level of RL but learns\nmore slowly. On SpaceInvaders and Breakout synthetic feedback never matches RL, but nevertheless\nthe agent improves substantially, often passing the \ufb01rst level in SpaceInvaders and reaching a score of\n20 on Breakout, or 50 with enough labels.\nOn most of the games real human feedback performs similar to or slightly worse than synthetic\nfeedback with the same number of labels, and often comparably to synthetic feedback that has 40%\nfewer labels. On Qbert, our method fails to learn to beat the \ufb01rst level with real human feedback;\nthis may be because short clips in Qbert can be confusing and dif\ufb01cult to evaluate. Finally, Enduro\nis dif\ufb01cult for A3C to learn due to the dif\ufb01culty of successfully passing other cars through random\nexploration, and is correspondingly dif\ufb01cult to learn with synthetic labels, but human labelers tend to\nreward any progress towards passing cars, essentially shaping the reward and thus outperforming\nA3C in this game (the results are comparable to those achieved with DQN).\n\n3.2 Novel behaviors\n\nExperiments with traditional RL tasks help us understand whether our method is effective, but the\nultimate purpose of human interaction is to solve tasks for which no reward function is available.\nUsing the same parameters as in the previous experiments, we show that our algorithm can learn\nnovel complex behaviors. We demonstrate:\n\n1. The Hopper robot performing a sequence of back\ufb02ips (see Figure 4). This behavior was\ntrained using 900 queries in less than an hour. The agent learns to consistently perform a\nback\ufb02ip, land upright, and repeat.\n\n2. The Half-Cheetah robot moving forward while standing on one leg. This behavior was\n\ntrained using 800 queries in under an hour.\n\n3. Keeping alongside other cars in Enduro. This was trained with roughly 1,300 queries\nand 4 million frames of interaction with the environment; the agent learns to stay almost\nexactly even with other moving cars for a substantial fraction of the episode, although it gets\nconfused by changes in background.\n\n7\n\n\fFigure 3: Performance of our algorithm on MuJoCo tasks after removing various components, as\ndescribed in Section Section 3.3. All graphs are averaged over 5 runs, using 700 synthetic labels\neach.\n\nVideos of these behaviors can be found at https://goo.gl/MhgvIU. These behaviors were trained\nusing feedback from the authors.\n\n3.3 Ablation Studies\n\nIn order to better understand the performance of our algorithm, we consider a range of modi\ufb01cations:\n\n1. We pick queries uniformly at random rather than prioritizing queries for which there is\n\ndisagreement (random queries).\n\n2. We train only one predictor rather than an ensemble (no ensemble). In this setting, we also\nchoose queries at random, since there is no longer an ensemble that we could use to estimate\ndisagreement.\n\n3. We train on queries only gathered at the beginning of training, rather than gathered through-\n\nout training (no online queries).\n\n4. We remove the `2 regularization and use only dropout (no regularization).\n5. On the robotics tasks only, we use trajectory segments of length 1 (no segments).\n6. Rather than \ufb01tting \u02c6r using comparisons, we consider an oracle which provides the true\ntotal reward over a trajectory segment, and \ufb01t \u02c6r to these total rewards using mean squared\nerror (target).\n\nThe results are presented in Figure 3 for MuJoCo and Figure 4 for Atari.\nTraining the reward predictor of\ufb02ine can lead to bizarre behavior that is undesirable as measured by\nthe true reward (Amodei et al., 2016). For instance, on Pong of\ufb02ine training sometimes leads our\nagent to avoid losing points but not to score points; this can result in extremely long volleys (videos\nat https://goo.gl/L5eAbk). This type of behavior demonstrates that in general human feedback\nneeds to be intertwined with RL rather than provided statically.\nOur main motivation for eliciting comparisons rather than absolute scores was that we found it much\neasier for humans to provide consistent comparisons than consistent absolute scores, especially on the\ncontinuous control tasks and on the qualitative tasks in Section 3.2; nevertheless it seems important\nto understand how using comparisons affects performance. For continuous control tasks we found\nthat predicting comparisons worked much better than predicting scores. This is likely because the\nscale of rewards varies substantially and this complicates the regression problem, which is smoothed\nsigni\ufb01cantly when we only need to predict comparisons. In the Atari tasks we clipped rewards\n\n8\n\n\fFigure 4: Performance of our algorithm on Atari tasks after removing various components, as\ndescribed in Section 3.3. All curves are an average of 3 runs using 5,500 synthetic labels (see minor\nexceptions in Section A.2).\n\nand effectively only predicted the sign, avoiding these dif\ufb01culties (this is not a suitable solution\nfor the continuous control tasks because the magnitude of the reward is important to learning). In\nthese tasks comparisons and targets had signi\ufb01cantly different performance, but neither consistently\noutperformed the other.\nWe also observed large performance differences when using single frames rather than clips.7 In order\nto obtain the same results using single frames we would need to have collected signi\ufb01cantly more\ncomparisons. In general we discovered that asking humans to compare longer clips was signi\ufb01cantly\nmore helpful per clip, and signi\ufb01cantly less helpful per frame. Shrinking the clip length below 1-2\nseconds did not signi\ufb01cantly decrease the human time required to label each clip in early experiments,\nand so seems less ef\ufb01cient per second of human time. In the Atari environments we also found that it\nwas often easier to compare longer clips because they provide more context than single frames.\n\n4 Discussion and Conclusions\n\nAgent-environment interactions are often radically cheaper than human interaction. We show that by\nlearning a separate reward model using supervised learning, it is possible to reduce the interaction\ncomplexity by roughly 3 orders of magnitude.\nAlthough there is a large literature on preference elicitation and reinforcement learning from unknown\nreward functions, we provide the \ufb01rst evidence that these techniques can be economically scaled up to\nstate-of-the-art reinforcement learning systems. This represents a step towards practical applications\nof deep RL to complex real-world tasks.\nIn the long run it would be desirable to make learning a task from human preferences no more dif\ufb01cult\nthan learning it from a programmatic reward signal, ensuring that powerful RL systems can be applied\nin the service of complex human values rather than low-complexity goals.\n\nAcknowledgments\nWe thank Olivier Pietquin, Bilal Piot, Laurent Orseau, Pedro Ortega, Victoria Krakovna, Owain\nEvans, Andrej Karpathy, Igor Mordatch, and Jack Clark for reading drafts of the paper. We thank\nTyler Adkisson, Mandy Beri, Jessica Richards, Heather Tran, and other contractors for providing the\n\n7We only ran these tests on continuous control tasks because our Atari reward model depends on a sequence\n\nof consecutive frames rather than a single frame, as described in Section A.2\n\n9\n\n\f", "award": [], "sourceid": 2251, "authors": [{"given_name": "Paul", "family_name": "Christiano", "institution": "OpenAI"}, {"given_name": "Jan", "family_name": "Leike", "institution": "DeepMind"}, {"given_name": "Tom", "family_name": "Brown", "institution": "Google Brain"}, {"given_name": "Miljan", "family_name": "Martic", "institution": "DeepMind"}, {"given_name": "Shane", "family_name": "Legg", "institution": "DeepMind"}, {"given_name": "Dario", "family_name": "Amodei", "institution": "OpenAI"}]}