{"title": "Where Do You Think You're Going?: Inferring Beliefs about Dynamics from Behavior", "book": "Advances in Neural Information Processing Systems", "page_first": 1454, "page_last": 1465, "abstract": "Inferring intent from observed behavior has been studied extensively within the frameworks of Bayesian inverse planning and inverse reinforcement learning. These methods infer a goal or reward function that best explains the actions of the observed agent, typically a human demonstrator. Another agent can use this inferred intent to predict, imitate, or assist the human user. However, a central assumption in inverse reinforcement learning is that the demonstrator is close to optimal. While models of suboptimal behavior exist, they typically assume that suboptimal actions are the result of some type of random noise or a known cognitive bias, like temporal inconsistency. In this paper, we take an alternative approach, and model suboptimal behavior as the result of internal model misspecification: the reason that user actions might deviate from near-optimal actions is that the user has an incorrect set of beliefs about the rules -- the dynamics -- governing how actions affect the environment. Our insight is that while demonstrated actions may be suboptimal in the real world, they may actually be near-optimal with respect to the user's internal model of the dynamics. By estimating these internal beliefs from observed behavior, we arrive at a new method for inferring intent. We demonstrate in simulation and in a user study with 12 participants that this approach enables us to more accurately model human intent, and can be used in a variety of applications, including offering assistance in a shared autonomy framework and inferring human preferences.", "full_text": "Where Do You Think You\u2019re Going?:\n\nInferring Beliefs about Dynamics from Behavior\n\nSiddharth Reddy, Anca D. Dragan, Sergey Levine\n\nDepartment of Electrical Engineering and Computer Science\n\nUniversity of California, Berkeley\n\n{sgr,anca,svlevine}@berkeley.edu\n\nAbstract\n\nInferring intent from observed behavior has been studied extensively within the\nframeworks of Bayesian inverse planning and inverse reinforcement learning.\nThese methods infer a goal or reward function that best explains the actions of\nthe observed agent, typically a human demonstrator. Another agent can use this\ninferred intent to predict, imitate, or assist the human user. However, a central\nassumption in inverse reinforcement learning is that the demonstrator is close to\noptimal. While models of suboptimal behavior exist, they typically assume that\nsuboptimal actions are the result of some type of random noise or a known cognitive\nbias, like temporal inconsistency. In this paper, we take an alternative approach,\nand model suboptimal behavior as the result of internal model misspeci\ufb01cation: the\nreason that user actions might deviate from near-optimal actions is that the user has\nan incorrect set of beliefs about the rules \u2013 the dynamics \u2013 governing how actions\naffect the environment. Our insight is that while demonstrated actions may be\nsuboptimal in the real world, they may actually be near-optimal with respect to the\nuser\u2019s internal model of the dynamics. By estimating these internal beliefs from\nobserved behavior, we arrive at a new method for inferring intent. We demonstrate\nin simulation and in a user study with 12 participants that this approach enables us\nto more accurately model human intent, and can be used in a variety of applications,\nincluding offering assistance in a shared autonomy framework and inferring human\npreferences.\n\n1\n\nIntroduction\n\nCharacterizing the drive behind human actions in the form of a goal or reward function is broadly\nuseful for predicting future behavior, imitating human actions in new situations, and augmenting\nhuman control with automated assistance \u2013 critical functions in a wide variety of applications, in-\ncluding pedestrian motion prediction [57], virtual character animation [38], and robotic teleoperation\n[35]. For example, remotely operating a robotic arm to grasp objects can be challenging for a human\nuser due to unfamiliar or unintuitive dynamics of the physical system and control interface. Existing\nframeworks for assistive teleoperation and shared autonomy aim to help users perform such tasks\n[35, 29, 46, 8, 45]. These frameworks typically rely on existing methods for intent inference in the\nsequential decision-making context, which use Bayesian inverse planning or inverse reinforcement\nlearning to learn the user\u2019s goal or reward function from observed control demonstrations. These\nmethods typically assume that user actions are near-optimal, and deviate from optimality due to\nrandom noise [56], speci\ufb01c cognitive biases in planning [16, 15, 4], or risk sensitivity [33].\n\nSee https://sites.google.com/view/inferring-internal-dynamics for supplementary materi-\n\nals, including videos and code.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fThe key insight in this paper is that suboptimal behavior can also arise from a mismatch between the\ndynamics of the real world and the user\u2019s internal beliefs of the dynamics, and that a user policy that\nappears suboptimal in the real world may actually be near-optimal with respect to the user\u2019s internal\ndynamics model. As resource-bounded agents living in an environment of dazzling complexity,\nhumans rely on intuitive theories of the world to guide reasoning and planning [21, 26]. Humans\nleverage internal models of the world for motor control [53, 30, 14, 34, 49], goal-directed decision\nmaking [7], and representing the mental states of other agents [39]. Simpli\ufb01ed internal models can\nsystematically deviate from the real world, leading to suboptimal behaviors that have unintended\nconsequences, like hitting a tennis ball into the net or skidding on an icy road. For example, a classic\nstudy in cognitive science shows that human judgments about the physics of projectile motion are\ncloser to Aristotelian impetus theory than to true Newtonian dynamics \u2013 in other words, people tend\nto ignore or underestimate the effects of inertia [11]. Characterizing the gap between internal models\nand reality by modeling a user\u2019s internal predictions of the effects of their actions allows us to better\nexplain observed user actions and infer their intent.\nThe main contribution of this paper is a new algorithm for intent inference that \ufb01rst estimates a\nuser\u2019s internal beliefs of the dynamics of the world using observations of how they act to perform\nknown tasks, then leverages the learned internal dynamics model to infer intent on unknown tasks.\nIn contrast to the closest prior work [28, 22], our method scales to problems with high-dimensional,\ncontinuous state spaces and nonlinear dynamics. Our internal dynamics model estimation algorithm\nassumes the user takes actions with probability proportional to their exponentiated soft Q-values. We\n\ufb01t the parameters of the internal dynamics model to maximize the likelihood of observed user actions\non a set of tasks with known reward functions, by tying the internal dynamics to the soft Q function\nvia the soft Bellman equation. At test time, we use the learned internal dynamics model to predict the\nuser\u2019s desired next state given their current state and action input.\nWe run experiments \ufb01rst with simulated users, testing that we can recover the internal dynamics, even\nin MDPs with a continuous state space that would otherwise be intractable for prior methods. We\nthen run a user study with 12 participants in which humans play the Lunar Lander game (screenshot\nin Figure 1). We recover a dynamics model that explains user actions better than the real dynamics,\nwhich in turn enables us to assist users in playing the game by transferring their control policy from\nthe recovered internal dynamics to the real dynamics.\n\n2 Background\n\nInferring intent in sequential decision-making problems has been heavily studied under the framework\nof inverse reinforcement learning (IRL), which we build on in this work. The aim of IRL is to learn\na user\u2019s reward function from observed control demonstrations. IRL algorithms are not directly\napplicable to our problem of learning a user\u2019s beliefs about the dynamics of the environment, but they\nprovide a helpful starting point for thinking about how to extract hidden properties of a user from\nobservations of how they behave.\nIn our work, we build on the maximum causal entropy (MaxCausalEnt) IRL framework [55, 6, 44, 36,\n28]. In an MDP with a discrete action space A, the human demonstrator is assumed to follow a policy\n\u03c0 that maximizes an entropy-regularized reward R(s, a, s(cid:48)) under dynamics T (s(cid:48)\n|s, a). Equivalently,\n(1)\n\n,\n\n\u03c0(a|s) (cid:44)\n\nexp (Q(s, a))\na(cid:48)\u2208A exp (Q(s, a(cid:48)))\n\nwhere Q is the soft Q function, which satis\ufb01es the soft Bellman equation [55],\n\nQ(s, a) = Es(cid:48)\u223cT (\u00b7|s,a) [R(s, a, s(cid:48)) + \u03b3V (s(cid:48))] ,\n\nwith V the soft value function,\n\n(cid:80)\n(cid:32)(cid:88)\n\na\u2208A\n\n(cid:33)\n\nV (s) (cid:44) log\n\nexp (Q(s, a))\n\n.\n\n(2)\n\n(3)\n\nPrior work assumes T is the true dynamics of the real world, and \ufb01ts a model of the reward R that\nmaximizes the likelihood (given by Equation 1) of some observed demonstrations of the user acting\nin the real world. In our work, we assume access to a set of training tasks for which the rewards R\nare known, \ufb01t a model of the internal dynamics T that is allowed to deviate from the real dynamics,\nthen use the recovered dynamics to infer intent (e.g., rewards) in new tasks.\n\n2\n\n\f3\n\nInternal Dynamics Model Estimation\n\nWe split up the problem of intent inference into two parts: learning the internal dynamics model from\nuser demonstrations on known tasks (the topic of this section), and using the learned internal model\nto infer intent on unknown tasks (discussed later in Section 4). We assume that the user\u2019s internal\ndynamics model is stationary, which is reasonable for problems like robotic teleoperation when the\nuser has some experience practicing with the system but still \ufb01nds it unintuitive or dif\ufb01cult to control.\nWe also assume that the real dynamics are known ex-ante or learned separately.\nOur aim is to recover a user\u2019s implicit beliefs about the dynamics of the world from observations\nof how they act to perform a set of tasks. The key idea is that, when their internal dynamics model\ndeviates from the real dynamics, we can no longer simply \ufb01t a dynamics model to observed state\ntransitions. Standard dynamics learning algorithms typically assume access to (s, a, s(cid:48)) examples,\nwith (s, a) features and s(cid:48) labels, that can be used to train a classi\ufb01cation or regression model\np(s(cid:48)\n|s, a) using supervised learning. In our setting, we instead have (s, a) pairs that indirectly encode\nthe state transitions that the user expected to happen, but did not necessarily occur, because the user\u2019s\ninternal model predicted different outcomes s(cid:48) than those that actually occurred in the real world.\nOur core assumption is that the user\u2019s policy is near-optimal with respect to the unknown internal\ndynamics model. To this end, we propose a new algorithm for learning the internal dynamics from\naction demonstrations: inverse soft Q-learning.\n\n3.1\n\nInverse Soft Q-Learning\n\nThe key idea behind our algorithm is that we can \ufb01t a parametric model of the internal dynamics\nmodel T that maximizes the likelihood of observed action demonstrations on a set of training tasks\nwith known rewards by using the soft Q function as an intermediary.1 We tie the internal dynamics\nT to the soft Q function via the soft Bellman equation (Equation 2), which ensures that the soft Q\nfunction is induced by the internal dynamics T . We tie the soft Q function to action likelihoods using\nEquation 1, which encourages the soft Q function to explain observed actions. We accomplish this\nby solving a constrained optimization problem in which the demonstration likelihoods appear in the\nobjective and the soft Bellman equation appears in the constraints.\nFormulating the optimization problem. Assume the action space A is discrete.2 Let i \u2208\n{1, 2, ..., n} denote the training task, Ri(s, a, s(cid:48)) denote the known reward function for task i, T\ndenote the unknown internal dynamics, and Qi denote the unknown soft Q function for task i. We\nrepresent Qi using a function approximator Q\u03b8i with parameters \u03b8i, and the internal dynamics using\na function approximator T\u03c6 parameterized by \u03c6. Note that, while each task merits a separate soft Q\nfunction since each task has different rewards, all tasks share the same internal dynamics.\nRecall the soft Bellman equation (Equation 2), which constrains Qi to be the soft Q function for\nrewards Ri and internal dynamics T . An equivalent way to express this condition is that Qi satis\ufb01es\n\u03b4i(s, a) = 0 \u2200s, a, where \u03b4i is the soft Bellman error:\n\n\u03b4i(s, a) (cid:44) Qi(s, a) \u2212\n\nT (s(cid:48)\n\n|s, a) (Ri(s, a, s(cid:48)) + \u03b3Vi(s(cid:48))) ds(cid:48).\n\ns(cid:48)\u2208S\n\nWe impose the same condition on Q\u03b8i and T\u03c6, i.e., \u03b4\u03b8i,\u03c6(s, a) = 0 \u2200s, a. We assume our represen-\ntations are expressive enough that there exist values of \u03b8i and \u03c6 that satisfy the condition. We \ufb01t\nparameters \u03b8i and \u03c6 to maximize the likelihood of the observed demonstrations while respecting the\nsoft Bellman equation by solving the constrained optimization problem\n\n(cid:90)\n\nn(cid:88)\n\ni=1\n\n(cid:88)\n\n(4)\n\n(5)\n\nminimize\n{\u03b8i}n\ni=1,\u03c6\nsubject to \u03b4\u03b8i,\u03c6(s, a) = 0 \u2200i \u2208 {1, 2, ..., n}, s \u2208 S, a \u2208 A,\n\n\u2212 log \u03c0\u03b8i(a|s)\n\n(s,a)\u2208Ddemo\n\ni\n\nwhere Ddemo\nand Equation 1.\n\ni\n\nare the demonstrations for task i, and \u03c0\u03b8i denotes the action likelihood given by Q\u03b8i\n\n1Our algorithm can in principle learn from demonstrations even when the rewards are unknown, but in\n\npractice we \ufb01nd that this relaxation usually makes learning the correct internal dynamics too dif\ufb01cult.\n\n2We assume a discrete action space to simplify our exposition and experiments. Our algorithm can be\n\nextended to handle MDPs with a continuous action space using existing sampling methods [25].\n\n3\n\n\fSolving the optimization problem. We use the penalty method [5] to approximately solve the\nconstrained optimization problem described in Equation 5, which recasts the problem as unconstrained\noptimization of the cost function\n\n\u2212 log \u03c0\u03b8i (a|s) +\n\n\u03c1\n2\n\n(\u03b4\u03b8i,\u03c6(s, a))2ds,\n\n(6)\n\n(cid:90)\n\nn(cid:88)\n\n(cid:88)\n\ns\u2208S\n\na\u2208A\n\ni=1\n\nc(\u03b8, \u03c6) (cid:44) n(cid:88)\n\n(cid:88)\n\ni=1\n\n(s,a)\u2208Ddemo\n\ni\n\nwhere \u03c1 is a constant hyperparameter, \u03c0\u03b8i denotes the action likelihood given by Q\u03b8i and Equation 1,\nand \u03b4\u03b8i,\u03c6 denotes the soft Bellman error, which relates Q\u03b8i to T\u03c6 through Equation 4.\nFor MDPs with a discrete state space S, we minimize the cost as is. MDPs with a continuous state\nspace present two challenges: (1) an intractable integral over states in the sum over penalty terms,\nand (2) integrals over states in the expectation terms of the soft Bellman errors \u03b4 (recall Equation\n4). To tackle (1), we resort to constraint sampling [10]; speci\ufb01cally, randomly sampling a subset of\nstate-action pairs Dsamp\nfrom rollouts of a random policy in the real world. To tackle (2), we choose\na deterministic model of the internal dynamics T\u03c6, which simpli\ufb01es the integral over next states in\nEquation 4 to a single term3.\nIn our experiments, we minimize the objective in Equation 6 using Adam [31]. We use a mix of\ntabular representations, structured linear models, and relatively shallow multi-layer perceptrons to\nmodel Q\u03b8i and T\u03c6. In the tabular setting, \u03b8i is a table of numbers with a separate entry for each\nstate-action pair, and \u03c6 can be a table with an entry between 0 and 1 for each state-action-state triple.\nFor linear and neural network representations, \u03b8i and \u03c6 are sets of weights.\n\ni\n\n3.2 Regularizing the Internal Dynamics Model\n\nOne issue with our approach to estimating the internal dynamics is that there tend to be multiple\nfeasible internal dynamics models that explain the demonstration data equally well, which makes\nthe correct internal dynamics model dif\ufb01cult to identify. We propose two different solutions to this\nproblem: collecting demonstrations on multiple training tasks, and imposing a prior on the learned\ninternal dynamics that encourages it to be similar to the real dynamics.\nMultiple training tasks. If we only collect demonstrations on n = 1 training tasks, then at any given\nstate s and action a, the recovered internal dynamics may simply assign a likelihood of one to the\nnext state s(cid:48) that maximizes the reward function R1(s, a, s(cid:48)) of the single training task. Intuitively, if\nour algorithm is given user demonstrations on only one task, then the user\u2019s actions can be explained\nby an internal dynamics model that always predicts the best possible next state for that one task\n(e.g., the target in a navigation task), no matter the current state or user action. We can mitigate this\nproblem by collecting demonstrations on n > 1 training tasks, which prevents degenerate solutions\nby forcing the internal dynamics to be consistent with a diverse set of user policies.\nAction intent prior. In our experiments, we also explore another way to regularize the learned\ninternal dynamics: imposing the prior that the learned internal dynamics T\u03c6 should be similar to the\nknown real dynamics T real by restricting the support of T\u03c6(\u00b7|s, a) to states s(cid:48) that are reachable in the\nreal dynamics. Formally,\n\nT real(s(cid:48)\n\n|s, aint)f\u03c6(aint|s, a)\n\n(7)\n\n|s, a) (cid:44) (cid:88)\n\naint\u2208A\n\nT\u03c6(s(cid:48)\n\nwhere a is the user\u2019s action, aint is the user\u2019s intended action, and f\u03c6 : S \u00d7 A2 \u2192 [0, 1] captures\nthe user\u2019s \u2018action intent\u2019 \u2013 the action they would have taken if they had perfect knowledge of the\nreal dynamics. This prior changes the structure of our internal dynamics model to predict the user\u2019s\nintended action with respect to the real dynamics, rather than directly predicting their intended next\nstate. Note that, when we use this action intent prior, T\u03c6 is no longer directly modeled. Instead, we\nmodel f\u03c6 and use Equation 7 to compute T\u03c6.\nIn our experiments, we examine the effects of employing multiple training tasks and imposing the\naction intent prior, together and in isolation.\n\n3Another potential solution is sampling states to compute a Monte Carlo estimate of the integral.\n\n4\n\n\fFigure 1: A high-level schematic of our internal-to-real dynamics transfer algorithm for shared autonomy,\nwhich uses the internal dynamics model learned by our method to assist the user with an unknown control task;\nin this case, landing the lunar lander between the \ufb02ags. The user\u2019s actions are assumed to be consistent with\ntheir internal beliefs about the dynamics T\u03c6, which differ from the real dynamics T real. Our system models the\ninternal dynamics to determine where the user is trying to go next, then acts to get there.\n\n4 Using Learned Internal Dynamics Models\n\nThe ability to learn internal dynamics models from demonstrations is broadly useful for intent\ninference. In our experiments, we explore two applications: (1) shared autonomy, in which a human\nand robot collaborate to solve a challenging real-time control task, and (2) learning the reward\nfunction of a user who generates suboptimal demonstrations due to internal model misspeci\ufb01cation.\nIn (1), intent is formalized as the user\u2019s desired next state, while in (2), the user\u2019s intent is represented\nby their reward function.\n\n4.1 Shared Autonomy via Internal-to-Real Dynamics Transfer\n\nMany control problems involving human users are challenging for autonomous agents due to partial\nobservability and imprecise task speci\ufb01cations, and are also challenging for humans due to constraints\nsuch as bounded rationality [48] and physical reaction time. Shared autonomy combines human and\nmachine intelligence to perform control tasks that neither can on their own, but existing methods\nhave the basic requirement that the machine either needs a description of the task or feedback from\nthe user, e.g., in the form of rewards [29, 8, 45]. We propose an alternative algorithm that assists the\nuser without knowing their reward function by leveraging the internal dynamics model learned by our\nmethod. The key idea is formalizing the user\u2019s intent as their desired next state. We use the learned\ninternal dynamics model to infer the user\u2019s desired next state given their current state and control\ninput, then execute an action that will take the user to the desired state under the real dynamics;\nessentially, transferring the user\u2019s policy from the internal dynamics to the real dynamics, akin to\nsimulation-to-real transfer for robotic control [13]. See Figure 1 for a high-level schematic of this\nprocess.\nEquipped with the learned internal dynamics model T\u03c6, we perform internal-to-real dynamics transfer\nby observing the user\u2019s action input, computing the induced distribution over next states using the\ninternal dynamics, and executing an action that induces a similar distribution over next states in the\nreal dynamics. Formally, for user control input ah\n\nt and state st, we execute action at, where\n\nat (cid:44) arg min\na\u2208A\n\nDKL(T\u03c6(st+1|st, ah\n\nt ) (cid:107) T real(st+1|st, a))\n\n(8)\n\n4.2 Learning Rewards from Misguided User Demonstrations\n\nMost existing inverse reinforcement learning algorithms assume that the user\u2019s internal dynamics are\nequivalent to the real dynamics, and learn their reward function from near-optimal demonstrations.\nWe explore a more realistic setting in which the user\u2019s demonstrations are suboptimal due to a\nmismatch between their internal dynamics and the real dynamics. Users are \u2018misguided\u2019 in that their\nbehavior is suboptimal in the real world, but near-optimal with respect to their internal dynamics.\nIn this setting, standard IRL algorithms that do not distinguish between the internal and the real\ndynamics learn incorrect reward functions. Our method can be used to learn the internal dynamics,\nthen explicitly incorporate the learned internal dynamics into an IRL algorithm\u2019s behavioral model of\nthe user.\n\n5\n\n\fFigure 2: Left, Center: Error bars show standard error on ten random seeds. Our method learns accurate\ninternal dynamics models, the regularization methods in Section 3.2 increase accuracy, and the approximations\nfor continuous-state MDPs in Section 3.1 do not compromise accuracy. Right: Error regions show standard\nerror on ten random tasks and ten random seeds each. Our method learns an internal dynamics model that\nenables MaxCausalEnt IRL to learn rewards from misguided user demonstrations.\n\nIn our experiments, we instantiate prior work with MaxCausalEnt IRL [55], which inverts the\nbehavioral model from Equation 1 to infer rewards from demonstrations. We adapt it to our setting,\nin which the real dynamics are known and the internal dynamics are either learned (separately by our\nalgorithm) or assumed to be the same as the known real dynamics. MaxCausalEnt IRL cannot learn\nthe user\u2019s reward function from misguided demonstrations when it makes the standard assumption\nthat the internal dynamics are equal to the real dynamics, but can learn accurate rewards when it\ninstead uses the learned internal dynamics model produced by our algorithm.\n\n5 User Study and Simulation Experiments\n\nThe purpose of our experiments is two-fold: (1) to test the correctness of our algorithm, and (2)\nto test our core assumption that a human user\u2019s internal dynamics can be different from the real\ndynamics, and that our algorithm can learn an internal dynamics model that is useful for assisting\nthe user through internal-to-real dynamics transfer. To accomplish (1), we perform three simulated\nexperiments that apply our method to shared autonomy (see Section 4.1) and to learning rewards\nfrom misguided user demonstrations (see Section 4.2). In the shared autonomy experiments, we\n\ufb01rst use a tabular grid world navigation task to sanity-check our algorithm and analyze the effects of\ndifferent regularization choices from Section 3.2. We then use a continuous-state 2D navigation task\nto test our method\u2019s ability to handle continuous observations using the approximations described in\nSection 3.1. In the reward learning experiment, we use the grid world environment to compare the\nperformance of MaxCausalEnt IRL [55] when it assumes the internal dynamics are the same as the\nreal dynamics to when it uses the internal dynamics learned by our algorithm. To accomplish (2), we\nconduct a user study in which 12 participants play the Lunar Lander game (see Figure 1) with and\nwithout internal-to-real dynamics transfer assistance. We summarize these experiments in Sections\n5.1 and 5.2. Further details are provided in Section 9.1 of the appendix.\n\n5.1 Simulation Experiments\n\nShared autonomy. The grid world provides us with a domain where exact solutions are tractable,\nwhich enables us to verify the correctness of our method and compare the quality of the approximation\nin Section 3.1 with an exact solution to the learning problem. The continuous task provides a more\nchallenging domain where exact solutions via dynamic programming are intractable. In each setting,\nwe simulate a user with an internal dynamics model that is severely biased away from the real\ndynamics of the simulated environment. The simulated user\u2019s policy is near-optimal with respect to\ntheir internal dynamics, but suboptimal with respect to the real dynamics. Figure 2 (left and center\nplots) provides overall support for the hypothesis that our method can effectively learn tabular and\ncontinuous representations of the internal dynamics for MDPs with discrete and continuous state\nspaces. The learned internal dynamics models are accurate with respect to the ground truth internal\ndynamics, and internal-to-real dynamics transfer successfully assists the simulated users. The learned\ninternal dynamics model becomes more accurate as we increase the number of training tasks, and the\naction intent prior (see Section 3.2) increases accuracy when the internal dynamics are similar to the\nreal dynamics. These results con\ufb01rm that our approximate algorithm is correct and yields solutions\n\n6\n\n02040NumberofTrainingTasks0.000.250.500.751.00InternalDynamicsAccuracyGridWorldNavigationRandomActionIntentPriorNoPrior0200040006000800010000NumberofGradientSteps0.00.51.01.52.0InternalDynamicsL2Error2DContinuous-StateNavigationOurMethodRandom01020304050NumberofGradientSteps\u22120.50.00.51.0TrueRewardLearningRewardsfromMisguidedDemosIRL+OurMethodSERD(Baseline)RandomPolicyMaxCausalEntIRL(Baseline)\fFigure 3: Human users \ufb01nd the default game environment \u2013 the real dynamics \u2013 to be dif\ufb01cult and unintuitive,\nas indicated by their poor performance in the unassisted condition (top center and right plots) and their subjective\nevaluations (in Table 1). Our method observes suboptimal human play in the default environment, learns a setting\nof the game physics under which the observed human play would have been closer to optimal, then performs\ninternal-to-real dynamics transfer to assist human users in achieving higher success rates and lower crash rates\n(top center and right plots). The learned internal dynamics has a slower game speed than the real dynamics\n(bottom left plot). The bottom center and right plots show successful (green) and failed (red) trajectories in the\nunassisted and assisted conditions.\n\nthat do not signi\ufb01cantly deviate from those of an exact algorithm. Further results and experimental\ndetails are discussed in Sections 9.1.1 and 9.1.2 of the appendix.\nLearning rewards from misguided user demonstrations. Standard IRL algorithms, such as Max-\nCausalEnt IRL [55], can fail to learn rewards from user demonstrations that are \u2018misguided\u2019, i.e.,\nsystematically suboptimal in the real world but near-optimal with respect to the user\u2019s internal dy-\nnamics. Our algorithm can learn the internal dynamics model, and we can then explicitly incorporate\nthe learned internal dynamics into the MaxCausalEnt IRL algorithm to learn accurate rewards from\nmisguided demonstrations. We assess this method on a simulated grid world navigation task. Figure 2\n(right plot) supports our claim that standard IRL is ineffective at learning rewards from misguided user\ndemonstrations. After using our algorithm to learn the internal dynamics and explicitly incorporating\nthe learned internal dynamics into an IRL algorithm\u2019s model of the user, we see that it\u2019s possible\nto recover accurate rewards from these misguided demonstrations. Additional information on our\nexperimental setup is available in Section 9.1.3 of the appendix.\nIn addition to comparing to the standard MaxCausalEnt IRL baseline, we also conducted a comparison\n(shown in Figure 2) with a variant of the Simultaneous Estimation of Rewards and Dynamics (SERD)\nalgorithm [28] that simultaneously learns rewards and the internal dynamics instead of assuming\nthat the internal dynamics are equivalent to the real dynamics. This baseline performs better than\nrandom, but still much worse than our method. This result is supported by the theoretical analysis\nin Armstrong et al. [2], which characterizes the dif\ufb01culty of simultaneously deducing a human\u2019s\nrationality \u2013 in our case, their internal dynamics model \u2013 and their rewards from demonstrations.\n\n5.2 User Study on the Lunar Lander Game\n\nOur previous experiments were conducted with simulated expert behavior, which allowed us to\ncontrol the corruption of the internal dynamics. However, it remains to be seen whether this model of\nsuboptimality effectively re\ufb02ects real human behavior. We test this hypothesis in the next experiment,\nwhich evaluates whether our method can learn the internal dynamics accurately enough to assist real\nusers through internal-to-real dynamics transfer.\nTask description. We use the Lunar Lander game from OpenAI Gym [9] (screenshot in Figure 1)\nto evaluate our algorithm with human users. The objective of the game is to land on the ground,\n\n7\n\n0.000.250.500.751.00CrashRate0.000.250.500.751.00SuccessRateLunarLanderUserStudy(12users)UnassistedAssisted0.000.250.500.751.00UnassistedSuccessRate0.000.250.500.751.00AssistedSuccessRateLunarLanderUserStudy(12users)0.020.030.040.05GameSpeed0.00.10.2LikelihoodLunarLanderUserStudyRealDynamicsInternalDynamics\fTable 1: Subjective evaluations of the Lunar Lander user study from 12 participants. Means reported below\nfor responses on a 7-point Likert scale, where 1 = Strongly Disagree, 4 = Neither Disagree nor Agree, and 7 =\nStrongly Agree. p-values from a one-way repeated measures ANOVA with the presence of assistance as a factor\nin\ufb02uencing responses.\n\np-value\nI enjoyed playing the game\n< .001\nI improved over time\n< .0001\nI didn\u2019t crash\n< .001\nI didn\u2019t \ufb02y out of bounds\n< .05\nI didn\u2019t run out of time\n> .05\nI landed between the \ufb02ags\n< .001\nI understood how to complete the task\n< .05\nI intuitively understood the physics of the game < .01\nMy actions were carried out\n> .05\nMy intended actions were carried out\n< .01\n\nUnassisted Assisted\n3.92\n3.08\n1.17\n1.67\n5.17\n1.92\n6.42\n4.58\n4.83\n2.75\n\n5.92\n5.83\n3.00\n3.08\n6.17\n4.00\n6.75\n6.00\n5.50\n5.25\n\nwithout crashing or \ufb02ying out of bounds, using two lateral thrusters and a main engine. The action\nspace A consists of six discrete actions. The state s \u2208 R9 encodes position, velocity, orientation, and\nthe location of the landing site, which is one of nine values corresponding to n = 9 distinct tasks.\nThe physics of the game are forward-simulated by a black-box function that takes as input seven\nhyperparameters, which include engine power and game speed. We manipulate whether or not the\nuser receives internal-to-real dynamics transfer assistance using an internal dynamics model trained\non their unassisted demonstrations. The dependent measures are the success and crash rates in each\ncondition. The task and evaluation protocol are discussed further in Section 9.2 of the appendix.\nAnalysis. In the default environment, users appear to play as though they underestimate the strength\nof gravity, which causes them to crash into the ground frequently (see the supplementary videos).\nFigure 3 (bottom left plot) shows that our algorithm learns an internal dynamics model characterized\nby a slower game speed than the real dynamics, which makes sense since a slower game speed\ninduces smaller forces and slower motion \u2013 conditions under which the users\u2019 action demonstrations\nwould have been closer to optimal. These results support our claim that our algorithm can learn an\ninternal dynamics model that explains user actions better than the real dynamics.\nWhen unassisted, users often crash or \ufb02y out of bounds due to the unintuitive nature of the thruster\ncontrols and the relatively fast pace of the game. Figure 3 (top center and right plots) shows that users\nsucceed signi\ufb01cantly more often and crash signi\ufb01cantly less often when assisted by internal-to-real\ndynamics transfer (see Section 9.2 of the appendix for hypothesis tests). The assistance makes\nthe system feel easier to control (see the subjective evaluations in Table 1 of the appendix), less\nlikely to tip over (see the supplementary videos), and move more slowly in response to user actions\n(assistance led to a 30% decrease in average speed). One of the key advantages of assistance was its\npositive effect on the rate at which users were able to switch between different actions: on average,\nunassisted users performed 18 actions per minute (APM), while assisted users performed 84 APM.\nQuickly switching between \ufb01ring various thrusters enabled assisted users to better stabilize \ufb02ight.\nThese results demonstrate that the learned internal dynamics can be used to effectively assist the\nuser through internal-to-real dynamics transfer, which in turn gives us con\ufb01dence in the accuracy\nof the learned internal dynamics. After all, we cannot measure the accuracy of the learned internal\ndynamics by comparing it to the ground truth internal dynamics, which is unknown for human users.\n\n6 Related Work\n\nThe closest prior work in intent inference and action understanding comes from inverse planning [3]\nand inverse reinforcement learning [37], which use observations of a user\u2019s actions to estimate the\nuser\u2019s goal or reward function. We take a fundamentally different approach to intent inference: using\naction observations to estimate the user\u2019s beliefs about the world dynamics.\nThe simultaneous estimation of rewards and dynamics (SERD) instantiation of MaxCausalEnt IRL\n[28] aims to improve the sample ef\ufb01ciency of IRL by forcing the learned real dynamics model\nto explain observed state transitions as well as actions. The framework includes terms for the\n\n8\n\n\fdemonstrator\u2019s beliefs of the dynamics, but the overall algorithm and experiments of Herman et al.\n[28] constrain those beliefs to be the same as the real dynamics. Our goal is to learn an internal\ndynamics model that may deviate from the real dynamics. To this end, we propose two new internal\ndynamics regularization techniques, multi-task training and the action intent prior (see Section 3.2),\nand demonstrate their utility for learning an internal dynamics model that differs from the real\ndynamics (see Section 5.1). We also conduct a user experiment that shows human actions in a game\nenvironment can be better explained by a learned internal dynamics model than by the real dynamics,\nand that augmenting user control with internal-to-real dynamics transfer results in improved game\nplay. Furthermore, the SERD algorithm is well-suited to MDPs with a discrete state space, but\nintractable for continuous state spaces. Our method can be applied to MDPs with a continuous state\nspace, as shown in Sections 5.1 and 5.2.\nGolub et al. [22] propose an internal model estimation (IME) framework for brain-machine interface\n(BMI) control that learns an internal dynamics model from control demonstrations on tasks with\nlinear-Gaussian dynamics and quadratic reward functions. Our work is (1) more general in that it\nplaces no restrictions on the functional form of the dynamics or the reward function, and (2) does not\nassume sensory feedback delay, which is the fundamental premise of using IME for BMI control.\nRafferty et al. [43, 41, 42] use an internal dynamics learning algorithm to infer a student\u2019s incorrect\nbeliefs in online learning settings like educational games, and leverage the inferred beliefs to generate\npersonalized hints and feedback. Our algorithm is more general in that it is capable of learning\ncontinuous parameters of the internal dynamics, whereas the cited work is only capable of identifying\nthe internal dynamics given a discrete set of candidate models.\nModeling human error has a rich history in the behavioral sciences. Procrastination and other time-\ninconsistent human behaviors have been characterized as rational with respect to a cost model that\ndiscounts the cost of future action relative to that of immediate action [1, 32]. Systematic errors\nin human predictions about the future have been partially explained by cognitive biases like the\navailability heuristic and regression to the mean [50]. Imperfect intuitive physics judgments have\nbeen characterized as approximate probabilistic inferences made by a resource-bounded observer [26].\nWe take an orthogonal approach in which we assume that suboptimal behavior is primarily caused by\nincorrect beliefs of the dynamics, rather than uncertainty or biases in planning and judgment.\nHumans are resource-bounded agents that must take into account the computational cost of their\nplanning algorithm when selecting actions [24]. One way to trade-off the ability to \ufb01nd high-value\nactions for lower computational cost is to plan using a simpli\ufb01ed, low-dimensional model of the\ndynamics [27, 19]. Evidence from the cognitive science literature suggests humans \ufb01nd it dif\ufb01cult\nto predict the motion of objects when multiple information dimensions are involved [40]. Thus, we\narrive at an alternative explanation for why humans may behave near-optimally with respect to a\ndynamics model that differs from the real dynamics: even if users have perfect knowledge of the\nreal dynamics, they may not have the computational resources to plan under the real dynamics, and\ninstead choose to plan using a simpli\ufb01ed model.\n\n7 Discussion\n\nLimitations. Although our algorithm models the soft Q function with arbitrary neural network\nparameterizations, the internal dynamics parameterizations we use are smaller, with at most seven\nparameters for continuous tasks. Increasing the number of dynamics parameters would require a\nbetter approach to regularization than those proposed in Section 3.2.\nSummary. We contribute an algorithm that learns a user\u2019s implicit beliefs about the dynamics of the\nenvironment from demonstrations of their suboptimal behavior in the real environment. Simulation\nexperiments and a small-scale user study demonstrate the effectiveness of our method at recovering a\ndynamics model that explains human actions, as well as its utility for applications in shared autonomy\nand inverse reinforcement learning.\nFuture work. The ability to learn internal dynamics models from demonstrations opens the door to\nnew directions of scienti\ufb01c inquiry, like estimating young children\u2019s intuitive theories of physics and\npsychology without eliciting verbal judgments [52, 18, 23]. It also enables applications that involve\nintent inference, including adaptive brain-computer interfaces for prosthetic limbs [12, 47] that help\nusers perform control tasks that are dif\ufb01cult to fully specify.\n\n9\n\n\f8 Acknowledgements\n\nWe would like to thank Oleg Klimov for open-sourcing his implementation of the Lunar Lander\ngame, which was originally developed by Atari in 1979, and inspired by the lunar modules built in\nthe 1960s and 70s for the Apollo space program. We would also like to thank Eliezer Yudkowsky for\nthe fan\ufb01ction novel, Harry Potter and the Methods of Rationality \u2013 Harry\u2019s misadventure with the\nrocket-assisted broomstick in chapter 59 inspired us to try to close the gap between intuitive physics\nand the real world. This work was supported in part by a Berkeley EECS Department Fellowship for\n\ufb01rst-year Ph.D. students, Berkeley DeepDrive, computational resource donations from Amazon, NSF\nIIS-1700696, and AFOSR FA9550-17-1-0308.\n\nReferences\n[1] George A Akerlof. Procrastination and obedience. The American Economic Review, 81(2):1\u201319, 1991.\n[2] Stuart Armstrong and S\u00f6ren Mindermann. Impossibility of deducing preferences and rationality from\n\nhuman policy. arXiv preprint arXiv:1712.05812, 2017.\n\n[3] Chris L Baker, Rebecca Saxe, and Joshua B Tenenbaum. Action understanding as inverse planning.\n\nCognition, 113(3):329\u2013349, 2009.\n\n[4] Leon Bergen, Owain Evans, and Joshua Tenenbaum. Learning structured preferences. In Proceedings of\n\nthe Annual Meeting of the Cognitive Science Society, volume 32, 2010.\n\n[5] Dimitri P Bertsekas. Constrained optimization and Lagrange multiplier methods. Academic press, 2014.\n[6] Michael Bloem and Nicholas Bambos. In\ufb01nite time horizon maximum causal entropy inverse reinforcement\nlearning. In Decision and Control (CDC), 2014 IEEE 53rd Annual Conference on, pages 4911\u20134916.\nIEEE, 2014.\n\n[7] Matthew Botvinick and James An. Goal-directed decision making in prefrontal cortex: a computational\n\nframework. In Advances in neural information processing systems, pages 169\u2013176, 2009.\n\n[8] Alexander Broad, TD Murphey, and Brenna Argall. Learning models for shared control of human-machine\n\nsystems with unknown dynamics. Robotics: Science and Systems Proceedings, 2017.\n\n[9] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and\n\nWojciech Zaremba. Openai gym, 2016.\n\n[10] Giuseppe Cala\ufb01ore and Fabrizio Dabbene. Probabilistic and randomized methods for design under\n\nuncertainty. Springer, 2006.\n\n[11] Alfonso Caramazza, Michael McCloskey, and Bert Green. Naive beliefs in \u201csophisticated\u201d subjects:\n\nMisconceptions about trajectories of objects. Cognition, 9(2):117\u2013123, 1981.\n\n[12] Jose M Carmena. Advances in neuroprosthetic learning and control. PLoS biology, 11(5):e1001561, 2013.\n[13] Mark Cutler and Jonathan P How. Ef\ufb01cient reinforcement learning for robots using informative simulated\npriors. In Robotics and Automation (ICRA), 2015 IEEE International Conference on, pages 2605\u20132612.\nIEEE, 2015.\n\n[14] Michel Desmurget and Scott Grafton. Forward modeling allows feedback control for fast reaching\n\nmovements. Trends in cognitive sciences, 4(11):423\u2013431, 2000.\n\n[15] Owain Evans and Noah D Goodman. Learning the preferences of bounded agents. In NIPS Workshop on\n\nBounded Optimality, volume 6, 2015.\n\n[16] Owain Evans, Andreas Stuhlm\u00fcller, and Noah D Goodman. Learning the preferences of ignorant, inconsis-\n\ntent agents. In AAAI, pages 323\u2013329, 2016.\n\n[17] Chelsea Finn, Sergey Levine, and Pieter Abbeel. Guided cost learning: Deep inverse optimal control via\n\npolicy optimization. In International Conference on Machine Learning, pages 49\u201358, 2016.\n\n[18] Jerry A Fodor. A theory of the child\u2019s theory of mind. Cognition, 1992.\n[19] David Fridovich-Keil, Sylvia L Herbert, Jaime F Fisac, Sampada Deglurkar, and Claire J Tomlin. Plan-\nning, fast and slow: A framework for adaptive real-time safe trajectory planning. arXiv preprint\narXiv:1710.04731, 2017.\n\n[20] Justin Fu, Katie Luo, and Sergey Levine. Learning robust rewards with adversarial inverse reinforcement\n\nlearning. arXiv preprint arXiv:1710.11248, 2017.\n\n[21] Tobias Gerstenberg and Joshua B Tenenbaum. Intuitive theories. Oxford handbook of causal reasoning,\n\npages 515\u2013548, 2017.\n\n10\n\n\f[22] Matthew Golub, Steven Chase, and M Yu Byron. Learning an internal dynamics model from control\ndemonstration. In Proceedings of the 30th International Conference on Machine Learning (ICML-13),\npages 606\u2013614, 2013.\n\n[23] Alison Gopnik and Henry M Wellman. The theory theory. Mapping the mind: Domain speci\ufb01city in\n\ncognition and culture, page 257, 1994.\n\n[24] Thomas L Grif\ufb01ths, Falk Lieder, and Noah D Goodman. Rational use of cognitive resources: Levels of\nanalysis between the computational and the algorithmic. Topics in cognitive science, 7(2):217\u2013229, 2015.\n[25] Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement learning with deep\n\nenergy-based policies. arXiv preprint arXiv:1702.08165, 2017.\n\n[26] Jessica Hamrick, Peter Battaglia, and Joshua B Tenenbaum. Internal physics models guide probabilistic\njudgments about object dynamics. In Proceedings of the 33rd annual conference of the cognitive science\nsociety, pages 1545\u20131550. Cognitive Science Society Austin, TX, 2011.\n\n[27] Sylvia L Herbert, Mo Chen, SooJean Han, Somil Bansal, Jaime F Fisac, and Claire J Tomlin. Fastrack: a\nmodular framework for fast and guaranteed safe motion planning. arXiv preprint arXiv:1703.07373, 2017.\n[28] Michael Herman, Tobias Gindele, J\u00f6rg Wagner, Felix Schmitt, and Wolfram Burgard. Inverse reinforcement\nlearning with simultaneous estimation of rewards and dynamics. In Arti\ufb01cial Intelligence and Statistics,\npages 102\u2013110, 2016.\n\n[29] Shervin Javdani, Siddhartha S Srinivasa, and J Andrew Bagnell. Shared autonomy via hindsight optimiza-\n\ntion. arXiv preprint arXiv:1503.07619, 2015.\n\n[30] Mitsuo Kawato. Internal models for motor control and trajectory planning. Current opinion in neurobiology,\n\n9(6):718\u2013727, 1999.\n\n[31] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[32] Jon Kleinberg and Sigal Oren. Time-inconsistent planning: a computational problem in behavioral\nIn Proceedings of the \ufb01fteenth ACM conference on Economics and computation, pages\n\neconomics.\n547\u2013564. ACM, 2014.\n\n[33] Anirudha Majumdar, Sumeet Singh, Ajay Mandlekar, and Marco Pavone. Risk-sensitive inverse reinforce-\n\nment learning via coherent risk models. In Robotics: Science and Systems, 2017.\n\n[34] Biren Mehta and Stefan Schaal. Forward models in visuomotor control. Journal of Neurophysiology,\n\n88(2):942\u2013953, 2002.\n\n[35] Katharina Muelling, Arun Venkatraman, Jean-Sebastien Valois, John E Downey, Jeffrey Weiss, Shervin\nJavdani, Martial Hebert, Andrew B Schwartz, Jennifer L Collinger, and J Andrew Bagnell. Autonomy\ninfused teleoperation with application to brain computer interface controlled manipulation. Autonomous\nRobots, pages 1\u201322, 2017.\n\n[36] Gergely Neu and Csaba Szepesv\u00e1ri. Apprenticeship learning using inverse reinforcement learning and\n\ngradient methods. arXiv preprint arXiv:1206.5264, 2012.\n\n[37] Andrew Y Ng, Stuart J Russell, et al. Algorithms for inverse reinforcement learning. In Icml, pages\n\n663\u2013670, 2000.\n\n[38] Xue Bin Peng, Pieter Abbeel, Sergey Levine, and Michiel van de Panne. Deepmimic: Example-guided\n\ndeep reinforcement learning of physics-based character skills. arXiv preprint arXiv:1804.02717, 2018.\n\n[39] David Premack and Guy Woodruff. Does the chimpanzee have a theory of mind? Behavioral and brain\n\nsciences, 1(4):515\u2013526, 1978.\n\n[40] Dennis R Prof\ufb01tt and David L Gilden. Understanding natural dynamics. Journal of Experimental\n\nPsychology: Human Perception and Performance, 15(2):384, 1989.\n\n[41] Anna N Rafferty and Thomas L Grif\ufb01ths. Diagnosing algebra understanding via inverse planning.\n[42] Anna N Rafferty, Rachel Jansen, and Thomas L Grif\ufb01ths. Using inverse planning for personalized feedback.\n\nIn EDM, pages 472\u2013477, 2016.\n\n[43] Anna N Rafferty, Michelle M LaMar, and Thomas L Grif\ufb01ths. Inferring learners\u2019 knowledge from their\n\nactions. Cognitive Science, 39(3):584\u2013618, 2015.\n\n[44] Deepak Ramachandran and Eyal Amir. Bayesian inverse reinforcement learning. Urbana, 51(61801):1\u20134,\n\n2007.\n\n[45] Siddharth Reddy, Sergey Levine, and Anca Dragan. Shared autonomy via deep reinforcement learning.\n\narXiv preprint arXiv:1802.01744, 2018.\n\n[46] Wilko Schwarting, Javier Alonso-Mora, Liam Pauli, Sertac Karaman, and Daniela Rus. Parallel autonomy\nin automated vehicles: Safe motion generation with minimal intervention. In Robotics and Automation\n(ICRA), 2017 IEEE International Conference on, pages 1928\u20131935. IEEE, 2017.\n\n11\n\n\f[47] Krishna V Shenoy and Jose M Carmena. Combining decoder design and neural adaptation in brain-machine\n\ninterfaces. Neuron, 84(4):665\u2013680, 2014.\n\n[48] Herbert A Simon. Bounded rationality and organizational learning. Organization science, 2(1):125\u2013134,\n\n1991.\n\n[49] Emanuel Todorov. Optimality principles in sensorimotor control. Nature neuroscience, 7(9):907, 2004.\n[50] Amos Tversky and Daniel Kahneman. Judgment under uncertainty: Heuristics and biases. science,\n\n185(4157):1124\u20131131, 1974.\n\n[51] Eiji Uchibe. Model-free deep inverse reinforcement learning by logistic regression. Neural Processing\n\nLetters, pages 1\u201315, 2017.\n\n[52] Friedrich Wilkening and Trix Cacchione. Children\u2019s intuitive physics. The Wiley-Blackwell Handbook of\n\nChildhood Cognitive Development, Second edition, pages 473\u2013496, 2010.\n\n[53] Daniel M Wolpert, Zoubin Ghahramani, and Michael I Jordan. An internal model for sensorimotor\n\nintegration. Science, 269(5232):1880\u20131882, 1995.\n\n[54] Markus Wulfmeier, Peter Ondruska, and Ingmar Posner. Maximum entropy deep inverse reinforcement\n\nlearning. arXiv preprint arXiv:1507.04888, 2015.\n\n[55] Brian D Ziebart, J Andrew Bagnell, and Anind K Dey. Modeling interaction via the principle of maximum\ncausal entropy. In Proceedings of the 27th International Conference on International Conference on\nMachine Learning, pages 1255\u20131262. Omnipress, 2010.\n\n[56] Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Anind K Dey. Maximum entropy inverse\n\nreinforcement learning. In AAAI, volume 8, pages 1433\u20131438. Chicago, IL, USA, 2008.\n\n[57] Brian D Ziebart, Nathan Ratliff, Garratt Gallagher, Christoph Mertz, Kevin Peterson, J Andrew Bagnell,\nMartial Hebert, Anind K Dey, and Siddhartha Srinivasa. Planning-based prediction for pedestrians. In\nIntelligent Robots and Systems, 2009. IROS 2009. IEEE/RSJ International Conference on, pages 3931\u20133936.\nIEEE, 2009.\n\n12\n\n\f", "award": [], "sourceid": 753, "authors": [{"given_name": "Sid", "family_name": "Reddy", "institution": "UC Berkeley"}, {"given_name": "Anca", "family_name": "Dragan", "institution": "UC Berkeley"}, {"given_name": "Sergey", "family_name": "Levine", "institution": "UC Berkeley"}]}