{"title": "Third-Person Visual Imitation Learning via Decoupled Hierarchical Controller", "book": "Advances in Neural Information Processing Systems", "page_first": 2597, "page_last": 2607, "abstract": "We study a generalized setup for learning from demonstration to build an agent that can manipulate novel objects in unseen scenarios by looking at only a single video of human demonstration from a third-person perspective. To accomplish this goal, our agent should not only learn to understand the intent of the demonstrated third-person video in its context but also perform the intended task in its environment configuration. Our central insight is to enforce this structure explicitly during learning by decoupling what to achieve (intended task) from how to perform it (controller). We propose a hierarchical setup where a high-level module learns to generate a series of first-person sub-goals conditioned on the third-person video demonstration, and a low-level controller predicts the actions to achieve those sub-goals. Our agent acts from raw image observations without any access to the full state information. We show results on a real robotic platform using Baxter for the manipulation tasks of pouring and placing objects in a box. Project video is available at https://pathak22.github.io/hierarchical-imitation/", "full_text": "Third-Person Visual Imitation Learning via\n\nDecoupled Hierarchical Controller\n\nPratyusha Sharma\n\nMIT\n\nDeepak Pathak\n\nFacebook AI Research\n\nAbhinav Gupta\n\nCMU\n\nAbstract\n\nWe study a generalized setup for learning from demonstration to build an agent that\ncan manipulate novel objects in unseen scenarios by looking at only a single video\nof human demonstration from a third-person perspective. To accomplish this goal,\nour agent should not only learn to understand the intent of the demonstrated third-\nperson video in its context but also perform the intended task in its environment\ncon\ufb01guration. Our central insight is to enforce this structure explicitly during\nlearning by decoupling what to achieve (intended task) from how to perform it\n(controller). We propose a hierarchical setup where a high-level module learns to\ngenerate a series of \ufb01rst-person sub-goals conditioned on the third-person video\ndemonstration, and a low-level controller predicts the actions to achieve those\nsub-goals. Our agent acts from raw image observations without any access to the\nfull state information. We show results on a real robotic platform using Baxter for\nthe manipulation tasks of pouring and placing objects in a box. Project video is\navailable at https://pathak22.github.io/hierarchical-imitation/.\n\n1\n\nIntroduction\n\nHumans have an extraordinary ability to perform complex operations by watching others. How do we\nachieve this? Imitation requires inferring the goal/intention of the other person one is trying to imitate,\ntranslating these goals into one\u2019s own context, mapping the third-person\u2019s actions to \ufb01rst-person\nactions, and then \ufb01nally using these translated goals and mapped actions to perform low-level control.\nFor example, as shown in Figure 1, imitating the pouring task not only involves understanding how\nto change object states (tilt glass on top of another glass), but also imagining how to adapt goals to\nnovel objects in scene followed by low-level control to accomplish the task.\nAs one can imagine, simultaneously learning these functions is extremely dif\ufb01cult. Therefore, most\nof the classical work in robotics has focused on a much-restricted version of the problem. One of the\nmost common setup is learning from demonstration (LfD) [2, 3, 14, 18, 22, 29], where demonstrations\nare collected either by manually actuating the robot, i.e., kinesthetic demonstrations, or controlling\nit via teleoperation. LfD involves learning a policy from such demonstrations with the hope that it\nwould generalize to new location/poses of the objects in unseen scenarios. Some recent works explore\na relatively general version where a robot learns to imitate a video of the demonstration collected\nfrom either the robot\u2019s viewpoint [17] or with only a little different expert viewpoint [28].\nIn this paper, we tackle the generalized setting of learning from third-person demonstrations. Our\nagent \ufb01rst observes a video of a human demonstrating the task in front of it, and then it performs that\ntask by itself. We do not assume any access to the state-space information of the environment and\nlearn directly from raw camera images. To be successful, the robot needs to translate the observed\ngoal states to its own context (imagine the goals in its viewpoint) as well as map the third-person\nactions to its trajectory. One way to solve this would be to use classical vision methods that estimate\n\nWork done at CMU and UC Berkeley. Correspondence to pratyuss@csail.mit.edu\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: We study the general setup of learning from demonstration with of goal of building an agent that is\ncapable of imitating a single video of human demonstration to perform the task with novel objects and tasks.\nThe \ufb01gure shows an example of a third-person video demonstration on top and the robotic agent trying to imitate\nthe setup with objects in front. As shown on the right, our approach is to decouple the learning process into\na hierarchy of what (high-level) module to translate the third-person video to \ufb01rst-person sub-goals and how\nmodule (low-level) to achieve those sub-goals.\n\nlocation/pose of objects as well as the human expert and then map the keypoints to robot actions.\nHowever, hard-coding the correspondence from human keypoints to robot morphology is often non-\ntrivial, and this overall multi-stage approach is dif\ufb01cult to generalize to unseen object/task categories.\nAnother way is to leverage modern deep learning algorithms to learn an end-to-end function that\ngoes from video frames of human demonstration to output the series of joint angles required to\nperform the task. This function can be trained in a supervised manner with ground truth kinesthetic\ndemonstrations. However, unfortunately, today\u2019s deep learning vision algorithms require millions of\nimages for training. While recent approaches [28] attempt to handle this challenge via meta-learning,\nthe models for each of the tasks are separately trained and dif\ufb01cult to generalize to new tasks.\nWe propose an alternative approach by injecting hierarchical structure into the learning process\nin-between inferring the high-level intention of the demonstrator and learning the low-level controller\nto perform the desired task. We decouple the end-to-end pipeline into two modules. First, a high-level\nmodule that generates goal conditioned on the human demonstration video (third-person view) and\nthe robot\u2019s current observation (\ufb01rst-person view). It predicts a visual sub-goal in the \ufb01rst-person\nview that roughly corresponds to an intermediate way-point in achieving the intended task described\nin the demonstration video. Generating a visual sub-goal is a dif\ufb01cult learning problem and, hence,\nwe employ a conditional variant of Generative Adversarial Networks (GANs) [9] to generate realistic\nrendering [9, 11, 12, 16]. Second, a low-level controller module outputs a sequence of actions to\nachieve this visual sub-goal from its current observation. Both the modules are trained in a supervised\nmanner using human videos and robot joint angles trajectories, which are paired (with respect to\nobjects and tasks) but unaligned (with respect to time sequence). Our overall approach is summarized\nin Figure 2. The key advantage of this modular separation into task-speci\ufb01c goal-generator and\ntask-independent low-level controller is that it improves the ef\ufb01ciency of our approach; how? The\ndata-hungry low-level controller is shared across all tasks allowing it: (a) to be sample-ef\ufb01cient (in\nterms of data required per task) (b) robust and avoid over\ufb01tting.\nWe show experiments on a real robotic platform using Baxter across two scenarios: pouring and\nplacing objects in a box. We \ufb01rst systematically evaluate the quality of both the high-level and\nlow-level modules individually given perfect information on held-out test examples of human video\nand robot trajectories. We then ablate the generalization properties of these modules across the same\ntask with different scenarios and different tasks with different scenarios. Finally, we deploy the\ncomplete system on the Baxter robot for performing tasks with novel objects and demonstrations.\n\n2 Problem Setup: Third Person Visual Imitation\n\nConsider a robotic agent observing its current observation state st at time t. The action space\nof the robot is a vector of joint angles, referred to as rt. Let IH be the sequence of images ht\n(i.e., video) of a human demonstrating the task as observed by the robot in third-person view, i.e.,\nIH \u2261 (h0, h1, ..hT ). Our goal is to train the agent such that, at inference, it can follow a video of a\n\n2\n\n Robot\u2019s EnvironmentHow it needs to be done?TASK SPECIFICWhat is the goal?TASK AGNOSTICInfer intent orsubgoals sequentiallyHallucinate subgoals in robot's perspectiveInteract to achieve translated subgoals+\fFigure 2: Decoupled Hierarchical Control for Third-Person Visual Imitation Learning: We introduce a\nhierarchical approach consisting of a goal generator that predicts a goal visual state which is then used by the\nlow-level controller as guidance to achieve a task. [Left] During training, the decoupled models are trained\nindependently. The goal generator takes as input the human video frames ht and ht+k along with the observed\nrobot state st to predict the visual goal state of the robot at t + k. The low level controller is trained using\nst,at,st+1 triplets. [Right] At inference, the models are executed one after the other in a loop. After reaching\nthe current goal, the goal generator uses the new observed state st+1 and the next images of the human video to\ngenerate a new goal for the low-level controller to attain.\n\nnovel human demonstration video IH starting from its initial state s0 by predicting a sequence of\njoint angle con\ufb01gurations IR \u2261 (r0, r1, . . . , rT ).\nOur goal is to learn an agent that can imitate the action performed by the human expert in the third\nperson video. We want to imitate only from raw pixels without access to full-state information about\nthe environment. At training, we have access to a video of the human expert demonstration for a object\nmanipulation task IH \u2261 (h0, h1, ..hT ), a video of the same demonstration performed kinesthetically\nusing the robot joint angle states IR \u2261 (r0, r1, ..rT ) and a time series of the sequence of robot\u2019s\n\ufb01rst-person image observations \u03c4R \u2261 (s0, s1, ..sT ). We leverage a recently released dataset of human\ndemonstration videos and robot trajectories [25] where the demonstrations and trajectories are paired,\nbut not exactly aligned in time. We sub-sample the robot and human demonstration sequences, which\nhelps them roughly get aligned. In our setup, we have access to all the three time-series data at the\ntraining time, but only the time series data corresponding to the human demonstration image sequence\nat the test time. The other two time series would be predicted or generated by our algorithm.\n\n3 Hierarchical Controllers for Imitation\n\nAn end-to-end model that goes from human demonstration video and robot\u2019s current observation to\ndirectly predict the robot trajectories would require a lot of human demonstrations. Instead, we inject\nthe structure into the learning process by decoupling the imitation signal into what needs to be done\nfrom how it needs to be done. Decoupling makes our approach modular and more sample ef\ufb01cient\nthan end-to-end learning. It also enables the system to be more interpretable, as the goal inference is\nnow disentangled from the control task allowing us to visualize the intermediate sub-goals.\nOur approach consists of a two-level hierarchical modular approach. The high-level module is a goal\ngenerator that infers the goal in the pixel space from a human video demonstration and translates it\ninto what it means in the context of the robot\u2019s environment in the form of a pixel level representation.\nThe second step is an inverse controller, which follows up on the generated cues from the visual\ngoal inference model and generates an action for the robot to execute. These models are trained\nindependently, and at test time, they are alternatively executed for the robot to accomplish the\nmulti-step manipulation task, as illustrated in Figure 2.\n\n3.1 High-Level Module: Goal Generator\n\nThe role of the high-level module is to translate the human demonstration images to generate sub-goals\nimages in a way that is understandable to the robot. This high-level goal-generator could be learned\n\n3\n\nTrainingTest (deployment)Predicted Next StateGoal Generator(high-level)Controller(low-level)Goal Generator(high-level)Controller(low-level)Predicted Next StateGoal Generator(high-level)Controller(low-level)\fFigure 3: (a) The Goal Generator: The high-level goal generator network \u03c0H (.) takes as input the frames of\nthe human demonstration video ht, ht+k and the current observed state of the robot st at time t. It is trained to\ngenerate the visual representation st+k of the robot at time t + k. Instead of the complex goal image generation\nproblem, our setup reduces the setup into a simpler re-rendering problem, i.e., move the pixels of robot image in\nthe similar to the change in human demonstration images. (b) Low-level Controller: The inputs to the low-level\ncontroller are the observed state of the robot st and goal state of the robot st+1. The model is trained to output\nthe action (at) that will cause it to transition to the goal state from st.\n\nby leveraging the paired examples of human demonstration video and the robot demonstration video\nfrom our training data. The most straightforward formulation is to express the goal-generator as image\ntranslation, i.e., translating human demonstration image to robot demonstration. Image translation is\na well-studied problem in computer vision and approaches like Pix2Pix [11], CycleGAN [31] could\nbe directly deployed as-is. However, the stark difference between human and robot demonstration\nimages is in terms of viewpoint (third-person vs. \ufb01rst person) and appearance (human arm vs. robotic\narm) which makes these models much harder to train, and dif\ufb01cult to generalize as shown in Section 6.\nWe propose to handle this issue by translating change in the human demonstration image instead of\nthe image itself. In particular, we task the goal-generator to translate the current robot observation\nimage in the same manner as the corresponding human demonstration image is translated into the\nnext image in sequence. This forces the goal-generator to focus on how the pixels should move\n(re-rendering) instead of \ufb01guring out the way harder task of generating the entire pixel distribution in\nthe \ufb01rst place (generation). An illustration is shown in Figure 3. Further, in order to generate realistic\nlooking sub-goals, we represent goal-generator via a conditioned version of generative adversarial\nnetworks with a U-Net [20] style architecture [9, 11, 12, 16].\nAt any particular instant t, the input to the goal generator model \u03c0H (.) is the visual state of the robot\nst as well as the visual states of the human demonstration ht and ht+n. This model is trained to\ngenerate the visual state of the robot at the (t + n)th step which can be represented as st+n. The\noverall optimization is as follows:\n\nD\n\nmax\n\nEs\u2208S [log(D(s))] + E[log(1 \u2212 D(\u03c0H (ht, ht+n, st)))] + \u03bb(cid:107)\u03c0H (ht, ht+n, st) \u2212 st+n(cid:107)1\nmin\n\u03c0H\nwhere D refers to the GAN discriminator classi\ufb01cation network, state s is sampled form the set S of\nreal robot observations from the training data, and the triplet {ht, ht+n, st} are randomly sampled\nfrom the time series data of human demonstration and corresponding robot observations. In practice,\nwe resort to using a wider context around the human demonstration images, for instance, more frames\nsurrounding ht and ht+n especially when the human and robot demonstrations are not aligned. The\nL1-loss ensures that the correct frame is generated while the adversarial discriminator loss ensures\nthe generated samples are realistic [16].\n\n3.2 Low-Level Module: Inverse Controller\n\nThe main purpose of the low-level inverse controller is to achieve the goals set by the goal generator.\nThe low-level inverse controller, \u03c0L(.), takes as input the present visual state of the robot demon-\nstration (st) along with the predicted visual state of the robot demonstration for the next time step\n(\u02c6st+n = \u03c0H (ht, ht+n, )) to predict the action that the robot should take to make the transition to its\nnext state (\u02c6st+n). Since the task we test on may be performed by the left or the right hand of the\nrobot depending on the human demonstration, we concatenate the seven joint angle states of the left\nas well as the right hand of Baxter robot. In our case, the predicted action is a 14-dimensional tuple\nof the joint angles of the robot\u2019s arms. The inverse model uses spatial information from the images\n\n4\n\nRESNET-18apredaGTRobot DemoFC + ReLUFC + ReLUFC + ReLUHuman DemoRobot DemoU-Net based architectureDiscriminator1 LossGoalGenerator+optimizerGenerated ImageGround TruthGoal Generator(U-Net based architecture)(a)(b)Low-level Controller\fof the present visual state of the robot and the generated goal visual state to predict the action. The\nnetwork used is inspired by the ResNet-18 model [10] and is initialized with the weights obtained\nfrom pretraining the network on ImageNet. An illustration of our controller is shown in Figure 3.\nNote an exciting aspect of decoupling goals from the controller is that the controller need not be\nspeci\ufb01c to a particular task. We can share the inverse controller across the different types of tasks\nlike pouring, picking, sliding. Further, another advantage of decoupling goal inference from the\ninverse model is the ability to utilize additional self-supervised data (rt, rt+1, st+1 pairs) which does\nnot have to rely on only perfectly curated demonstrations for training. We leave the self-supervised\ntraining for future work.\n\n3.3\n\nInference: Third-person Imitation\n\nAt inference, we run our high-level goal-generator and low-level inverse model in an alternating\nmanner. Given the robot\u2019s current observation st and the human demonstration sequence IH, the\ngoal-generator \u03c0H (.) \ufb01rst generates a sub-goal \u02c6st+n. The low-level controller \u03c0L(.) then outputs the\nseries of robot joint angles to reach the state \u02c6st+n. This process is continued until the \ufb01nal image of\nthe human demonstration.\n\n4\n\nImplementation Details and Baselines\n\nTraining Dataset We use the MIME dataset [25] of human demonstrations to train our decoupled\nhierarchical controllers. The dataset is collected using a Baxter robot and contains pairs of 8260\nhuman-kinesthetic robot demonstrations spanned across 20 tasks. For the pouring task, we train\non 230 demonstrations, validate on 29, and test on 30 demonstrations. For the models trained on\nmultiple tasks, 6632 demonstrations were used for training, 829 for validation, and 829 for test. In\nparticular, each example contains triplet of human demonstration image sequence, robot demonstra-\ntion images, and robot\u2019s joint angle state, i.e., {(h0, h1, ..hT ), (\u02c6r0, \u02c6r1, . . . , \u02c6rT ), (s0, s1, ...sT )}. We\nsub-sampled the trajectories (both images and joint angle states) to a \ufb01xed length of 200 time steps\nfor training our models. For training low-level inverse model, we perform regression the action space\nof robot at which is a fourteen dimensional joint angle state [\u03b81, \u03b82, \u03b83..., \u03b814]. All the training and\nimplementation details related to our hierarchical controllers are provided in the supplementary.\n\nBaseline Comparisons We \ufb01rst perform ablations of our modules and compare them to different\npossible architectures, including CycleGAN [31], and L1, L2 loss based prediction models. We then\ncompare our joint approach to two different baselines: (a) End-to-end Baseline [25]: In this approach,\nboth the task of inference and control are handled by a single network. The inputs to the network are\nconsecutive frames of the human demonstration around a time step t, along with the image of the\nrobot demonstration at the time step t. The network predicts the action that the robot must then take at\ntime step t to transition to its state at time step t+1. (b) DAML [28]: The second baseline, we compare\nour results with is the Domain Adaptive Meta-Learning (DAML [28]) baseline. The algorithm is\ntargeted for recovering the best network parameters for a task via a single gradient update at test time\nusing meta-learning.\n\n5 Results: Generalization of Individual Hierarchical Modules\n\nThe hierarchy modules run alternatively at test time, and hence, each model relies on the other\u2019s\nperformance at the previous step. Therefore, in this section, we evaluate the generalization abilities\nof both of our individual modules of the hierarchy while assuming ground truth access to others. We\nevaluate top-level goal generators assuming the inverse model is perfect and evaluate the inverse-\nmodel assuming access to perfect goal-generator. We study generalization across three different\nscenarios: new location, new objects, and new tasks.\n\n5.1 Generalization to new positions of the same object\n\nGoal Generator: The ability to condition inferred goals in the robot\u2019s own setting is a crucial aspect\nof our approach. The sensitivity analysis of the goal generator with respect to the position of the\nobjects can help us understand how well the goal generator generalizes in terms of object positions.\n\n5\n\n\fFigure 4: (a) Goal Generator Comparison: The predictions of the outputs generated by the goal generator\nwhen optimized using different methods. Our model, which is trained to translate the robot\u2019s current image\ninstead of generating from scratch, generates the sharpest and accurate results. (b) Sensitivity Analysis of\nthe Goal Generator: Given the input human demonstration of a task, we test the sensitivity of goal-generate\nwrt object locations. Our model can hallucinate accurate sub-goals in accordance with the object location. (c)\nGoal Generator Predictions: The images in the \ufb01rst row are the input observed robot states. The second row\ncontains goals generated by the goal generator from the input images. The predictions are at an interval of ten\nsteps (approx. 2sec) ahead into the future. As shown, predicted sub-goals are consistent across the trajectory.\n\nIn Figure 4 (b), we show a scenario where the input of the human demonstration is \ufb01xed, but the\npositions of the objects are varied at test time. The predictions of the goal generator reveal that it is\nresponsive in accordance with change in object positioning. A quantitative analysis of this positional\ngeneralization is performed jointly with the evaluation of generalization ability to new objects in\nTable 1.\nInverse model: To check the ability of the inverse model to generalize to new positions (given perfect\ngoal-generator) of the object, we test the inverse model using ground truth images of the test set. This\nquantitative evaluation is performed jointly with the evaluation of generalization to novel object in\nTable 2 and discussed in the next sub-section.\n\n5.2 Generalization to new objects\n\nWe now evaluate the ability of our models to generalize manipulation skills to unseen objects.\nGoal Generator: Figure 4(a) shows the ability of the goal generator to generate meaningful sub-\ngoals given a demonstration with novel objects. A quantitative evaluation is shown in Table 1 for\nthe goal generation ability when tested with novel objects in different con\ufb01gurations. Our approach\n\n6\n\nGround Truth InputOursL1 LossL2 LossCycleGANGround Truth InputPouring Prediction(OURS)Input ImageOutput predictions (OURS)(a)(b)(c)Ground Truth InputPlacing Prediction(OURS)(c)\fMethod\nL1 only\nL2 only\nCycle GAN [31]\nGoal-Gen(Ours)\n\nL2\n60.24\n76.44\n99.15\n39.98\n\nL1\n72.57\n75.94\n118.67\n52.37\n\nPSNR\n2.92\n3.02\n2.37\n3.95\n\nSSIM\n0.15\n0.14\n0.11\n0.18\n\nMethod\nEnd to End (all) [25]\nEnd to End (single) [25]\nDAML (single) [28]\nOurs (all)\nOurs (single)\n\nRMSE (mean) RMSE (stderr)\n\n14.7\n8.9\n11.84\n14.4\n8.1\n\n2.3\n1.7\n2.1\n2.2\n1.6\n\nTable 1: Goal-Generator generalization to novel ob-\njects and locations. Our goal generator outperforms\nthe other approaches, both qualitatively and quantita-\ntively, across different loss metrics. The models are\nevaluated on the pouring test set.\n\nTable 2: Inverse model generalization to novel objects\nand locations. This table contains models trained on\nall tasks of the MIME dataset (all) and just the task\nof pouring (single). The models are evaluated on the\ncommon test set of pouring\n\noutperforms the baselines on all four metrics and generalizes better to new objects both quantitatively\n(Table 1) and qualitatively (Figure 4(a)). In addition to the baselines shown in Table 1, we also tried\nan optical \ufb02ow baseline which did not perform well and was unable to account for in-plane rotations\nthat the task like pouring required. The performance is (L1: 127.28, SSIM:0.81) signi\ufb01cantly worse\nthan other methods.\nInverse model: A quantitative evaluation of generalization to new objects and locations is shown in\nTable 2. Our model outperforms all other baselines by a signi\ufb01cant margin. The generalization to\ndiverse positions of objects of the inverse model can be attributed to its training across many different\npositions of diverse objects.\nIn addition to the baselines in Table 2, we also compare against the two feature matching based\napproaches. First, we compute trajectory-based features of the frames of human demonstration and\nthen \ufb01nd the nearest neighbors from the other demonstrations in the training set. The joint angles\ncorresponding to the nearest demonstrations are then considered as the prediction. The trajectory-\nbased features were computed using state-of-the-art temporal deep visual features trained on video\naction datasets [4]. Using these features as keys to match the nearest neighbors resulted in a rMSE of\n22.20 with a stderr of 2.14. Secondly, we used a static feature-based model where we align human\ndemonstration frames with robot ones in SIFT feature space. This resulted in a rMSE value of 45.32\nwith a stderr of 6.12. Both the baselines perform signi\ufb01cantly worse than our results shown in Table 2.\nIn particular, SIFT features did not perform well in \ufb01nding correspondences between the human and\nrobot demonstrations because of the large domain gap.\n\n5.3 Generalization to new tasks\n\nSo far, we have tested generalization with respect to objects and their positions. We now evaluate the\nability of our approach to generalize across tasks.\nGoal Generator: The goal generator is not task-agnostic. We leave training a task-agnostic goal\ngenerator for future work. In principle, since both the goal generator and inverse model don\u2019t depend\non temporal information, it should potentially be possible to train a task-agnostic Goal Generator.\nInverse Model: The inverse model is not\ntrained to perform a particular task. No tem-\nporal knowledge of trajectories is used while\ntraining the module. This ensures that while\nthe model predicts every step of the trajectory\nit doesn\u2019t have any preconceived notion about\nwhat the entire trajectory will be. Hence, the role\nof low-level controller (inverse model) is decou-\npled from the intent of the task (goal-generator)\nmaking it agnostic to the task. The ability of the\nmodel to generalize to new tasks is demonstrated\nin Table 3. We train on the \ufb01rst 15 tasks from\nMIME dataset and test on a held-out dataset for\n15 training as well 5 novel tasks. Our model has a much lower error on both the trained tasks as well\nas the novel tasks than the baseline methods. We want to note that DAML [28] is a generic approach,\nnot mainly designed for task transfer in third person, and the results in the original paper have been\nshown in the context of single planar-manipulation tasks. It has not been shown to scale to training\n\nTable 3: Generalization of the Inverse-Model to New\nTasks. Our inverse model is trained on 15 tasks of the\nMIME dataset. It is evaluated on a held-out set from\ntraining tasks as well as 5 novel tasks where it signi\ufb01-\ncantly outperforms the baselines.\n\nInv. Model (Ours)\n\nTrain (15 Tasks) Test (5 Tasks)\nStderr\nMean\n23.63\n1.56\n1.55\n35.90\n18.05\n1.04\n\nStderr Mean\n1.06\n24.83\n36.45\n1.56\n0.76\n16.90\n\nMethod\n\nEnd to End [25]\n\nDAML [28]\n\n7\n\n\fon multiple task categories together. Hence, further changes might be required to scale DAML for\ntransfer across tasks.\n\n6 Results: Generalization and Evaluation of Joint Hierarchical Model\n\nThe \ufb01nal test of our approach is to evaluate how the decoupled models perform when run\ntogether. Robot demo videos are on the project website https://pathak22.github.io/\nhierarchical-imitation/.\nWe look at two tasks - Pouring and Placing in a box. In the task of pouring, the robot is required\nto start at a given location and then move to a goal location of the cup that needs to be poured into.\nThis task requires the model to predict the different parts of the task correctly which are reaching\nthe goal cup and pouring into it. Since the controller of the robot is imperfect and the predictions\ncan be slightly noisy, we consider a reach to be successful if the robot reaches within 5cm of the\ncup. Similarly, we consider pouring to be successful if the robot reaches and does the pouring\naction in 5cm radius of the cup. These evaluation metrics are similar to those used by Yu et al. [28].\nFor the task of placing in the box, we categorize\na successful placing in a box if the robot is able\nto reach within 5cm of the box and is then able to\ndrop the object within 5cm of the box. Further,\nthe models are trained on the task on pouring\nalone and we evaluate how they generalize to\nthe task of placing.\nFor the high-level goal generator, it is crucial to\ngenerate good quality results over a long hori-\nzon to ensure the successful execution of the\ntask. Our approach of using a goal generator to\npredict high-level goals and an Inverse model to\nfollow up on the generated goals in alternation\noutperforms the other approaches, as shown in Table 4. The test sets comprised of demonstrations\nwith novel objects placed in random locations. The test not only required the individual models\nto generalize well but also works well in tandem with the possibility of imperfect predictions and\nactions from one another.\n\nTable 4: Joint evaluation of our hierarchical decoupled\ncontrollers. Our approach outperforms the other base-\nlines on the tasks of pouring and placing in a box with\na signi\ufb01cant margin, however, it is still much far from\nperfect completion of the task.\n\nHierarchy (Ours)\n\nPours Reaches Drops\n10%\n8%\n10%\n15%\n60%\n50%\n\n20%\n20%\n70%\n\nEnd to End [25]\n\nDAML [28]\n\n20%\n25%\n75%\n\nMethod\n\nPouring\n\nReaches\n\nPlacing\n\n7 Related Work\n\nInferring the intent of interaction from a human demonstration and successfully enabling a robot to\nreplicate the task in it\u2019s own environment ties to several related areas discussed as follows.\nDomain Adaptation: Addressing the domain shift between the human demonstrator and robot\n(e.g., appearance, view-points) is one of the goals of our setup. There has been previous work on\ntransfer in visual space [11, 30] and on tackling domain shift from simulation environments to the\nreal-world [5, 15]. Some of these approaches map data points from one domain to another [11, 30].\nOther approaches aid the transfer by \ufb01nding domain invariant representations [21, 26]. Along similar\nlines, Sermanet et al. [24] looks at learning view-point invariant representations that are then used for\nthird-person imitation. Training such a system would require training data with videos collected from\nmultiple viewpoints. Moreover, learning task-invariant features might not alone be enough to aid the\ntransfer to the robot\u2019s setting because of the differences in the physical con\ufb01gurations. Our approach\nhandles these issues via modular controllers.\nLearning from Demonstrations (LfD): LfD generally uses demonstrations obtained from trajec-\ntories collected by kinesthetic teaching, teleoperation, or using motion capture technology on the\nrobot arm [2, 3, 14, 18, 22, 29]. LfD has been successful in learning complex tasks from expert\nhuman trajectories, for instance, playing table-tennis [13], autonomous helicopter aerobatics, and\ndrone \ufb02ying [1]. Most of these focus on learning a task from a handful of expert demonstrations for a\nsingle task. Our goal is to start by using demonstration data collected across some objects and tasks\nbut enable the robot to imitate the task by just watching one video of a human demonstrating the task\nwith new objects.\n\n8\n\n\fExplicitly Inferring Rewards: Other approaches explicitly infer the reward associated with per-\nforming a task from the human demonstrations through techniques such as inverse reinforcement\nlearning [19, 23]. The rewards become representations of the sequence of goals of the task. After\nconstruction of the reward functions, the robot is trained using reinforcement learning by collecting\nsamples in its environment to maximize the reward. However, such systems end up needing signi\ufb01-\ncantly large amounts of real-world data and have to be re-trained for every new task from scratch,\nwhich makes them dif\ufb01cult to scale in the real world. In contrast, our supervised learning approach is\ntrained via maximum likelihood, and thus, ef\ufb01cient enough to scale to real robots.\nVisual Foresight: Visual foresight has been popular for self-supervised robot manipulation [6\u20138, 27],\nbut it relies on task speci\ufb01cation in the form of dots in the image space and are action conditioned\nvisual space predictions. Our setting relies on no hand speci\ufb01ed goals. The goals in our setting are\nspeci\ufb01ed from the human demonstration videos directly. This \ufb02exibility lets us specify harder tasks\nsuch as pouring, which would have been dif\ufb01cult to specify from dots on images alone.\n\n8 Discussions\n\nWe present decoupled hierarchical controllers for third-person imitation learning. Our approach is\ncapable of inferring the task from a single third-person human demonstration and executing it on a\nreal robot from \ufb01rst-person perspective. Our approach works from raw pixel input and does not make\nany assumption about the problem setup. Our results demonstrate the advantage of using a decoupled\nmodel over an end-to-end approach and other baselines in terms of improved generalization to novel\nobjects in unseen con\ufb01gurations.\nFuture Directions: Our high-level and low-level modules currently operate at a per-time step level\nand don\u2019t make use of temporal information, which results in the predicted trajectories being shaky.\nA naive inverse controller modeled via LSTM could incorporate the temporal information but it\neasily learns to cheat by memorizing the mean trajectory making it hard to generalize to novel\ntasks. However, training on lots of tasks together could potentially alleviate this limitation. An\nadded advantage of the explicit decoupling of the models is the ability to utilize additional self-\nsupervised data to train the low-level controller and make it robust to failure and different types of\njoint con\ufb01gurations. We leave these directions for future work to explore.\n\nAcknowledgements\n\nWe would like to thank David Held, Aayush Bansal, members of the CMU visual learning lab and\nBerkeley AI Research lab for fruitful discussions. The work was carried out when PS was at CMU\nand DP was at UC Berkeley. This work was supported by ONR MURI N000141612007 and ONR\nYoung Investigator Award to AG. DP is supported by the Facebook graduate fellowship.\n\nReferences\n[1] P. Abbeel and A. Y. Ng. Apprenticeship learning via inverse reinforcement learning. In ICML,\n\n2004. 8\n\n[2] B. Akgun, M. Cakmak, J. W. Yoo, and A. L. Thomaz. Trajectories and keyframes for kinesthetic\n\nteaching: A human-robot interaction perspective. In HRI, March 2012. 1, 8\n\n[3] B. D. Argall, S. Chernova, M. Veloso, and B. Browning. A survey of robot learning from\n\ndemonstration. RAS, 2009. 1, 8\n\n[4] J. Carreira and A. Zisserman. Quo vadis, action recognition? a new model and the kinetics\n\ndataset. In CVPR, 2017. 7\n\n[5] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun. CARLA: An open urban driving\n\nsimulator. In CoRL, 2017. 8\n\n[6] F. Ebert, C. Finn, A. X. Lee, and S. Levine. Self-supervised visual planning with temporal skip\n\nconnections. CoRR, abs/1710.05268, 2017. 9\n\n9\n\n\f[7] F. Ebert, C. Finn, A. X. Lee, and S. Levine. Self-supervised visual planning with temporal skip\n\nconnections. arXiv preprint arXiv:1710.05268, 2017.\n\n[8] C. Finn and S. Levine. Deep visual foresight for planning robot motion. CoRR, abs/1610.00696,\n\n2016. 9\n\n[9] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and\n\nY. Bengio. Generative adversarial nets. In NIPS, 2014. 2, 4\n\n[10] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In ECCV,\n\npages 630\u2013645. Springer, 2016. 5\n\n[11] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with conditional\n\nadversarial networks. In CVPR, 2017. 2, 4, 8\n\n[12] M. Mirza and S. Osindero. Conditional generative adversarial nets.\n\narXiv:1411.1784, 2014. 2, 4\n\narXiv preprint\n\n[13] K. M\u00fclling, J. Kober, O. Kroemer, and J. Peters. Learning to select and generalize striking\n\nmovements in robot table tennis. Int. J. Rob. Res., 2013. 8\n\n[14] A. Y. Ng and S. J. Russell. Algorithms for inverse reinforcement learning. In ICML, pages\n\n663\u2013670, 2000. 1, 8\n\n[15] OpenAI. Learning dexterous in-hand manipulation. CoRR, abs/1808.00177, 2018. 8\n\n[16] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros. Context encoders: Feature\n\nlearning by inpainting. In CVPR, 2016. 2, 4\n\n[17] D. Pathak, P. Mahmoudieh, G. Luo, P. Agrawal, D. Chen, Y. Shentu, E. Shelhamer, J. Malik,\n\nA. A. Efros, and T. Darrell. Zero-shot visual imitation. In ICLR, 2018. 1\n\n[18] D. A. Pomerleau. ALVINN: An autonomous land vehicle in a neural network. In NIPS, 1989.\n\n1, 8\n\n[19] N. Rhinehart and K. M. Kitani. First-person activity forecasting with online inverse reinforce-\n\nment learning. In ICCV, Oct 2017. 9\n\n[20] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image\nsegmentation. In International Conference on Medical image computing and computer-assisted\nintervention, pages 234\u2013241. Springer, 2015. 4\n\n[21] F. Sadeghi and S. Levine. CAD2RL: Real single-image \ufb02ight without a single real image. CoRR,\n\nabs/1611.04201, 2016. 8\n\n[22] S. Schaal. Is imitation learning the route to humanoid robots? Trends in cognitive sciences,\n\n1999. 1, 8\n\n[23] P. Sermanet, K. Xu, and S. Levine. Unsupervised perceptual rewards for imitation learning. In\n\nRSS, 2017. 9\n\n[24] P. Sermanet, C. Lynch, Y. Chebotar, J. Hsu, E. Jang, S. Schaal, and S. Levine. Time-contrastive\n\nnetworks: Self-supervised learning from video. In ICRA, 2018. 8\n\n[25] P. Sharma, L. Mohan, L. Pinto, and A. Gupta. Multiple interactions made easy (MIME): large\n\nscale demonstrations data for imitation. CoRL, 2018. 3, 5, 7, 8\n\n[26] E. Tzeng, J. Hoffman, N. Zhang, K. Saenko, and T. Darrell. Deep domain confusion: Maximiz-\n\ning for domain invariance. CoRR, abs/1412.3474, 2014. 8\n\n[27] M. Watter, J. T. Springenberg, J. Boedecker, and M. A. Riedmiller. Embed to control: A locally\n\nlinear latent dynamics model for control from raw images. CoRR, abs/1506.07365, 2015. 9\n\n[28] T. Yu, C. Finn, A. Xie, S. Dasari, T. Zhang, P. Abbeel, and S. Levine. One-shot imitation from\nobserving humans via domain-adaptive meta-learning. arXiv preprint arXiv:1802.01557, 2018.\n1, 2, 5, 7, 8\n\n10\n\n\f[29] T. Zhang, Z. McCarthy, O. Jow, D. Lee, K. Goldberg, and P. Abbeel. Deep imitation learning\nfor complex manipulation tasks from virtual reality teleoperation. CoRR, abs/1710.04615, 2017.\n1, 8\n\n[30] T. Zhou, P. Krahenbuhl, M. Aubry, Q. Huang, and A. A. Efros. Learning dense correspondence\n\nvia 3d-guided cycle consistency. In CVPR, 2016. 8\n\n[31] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using\n\ncycle-consistent adversarial networks. ICCV, 2017. 4, 5, 7\n\n11\n\n\f", "award": [], "sourceid": 1488, "authors": [{"given_name": "Pratyusha", "family_name": "Sharma", "institution": "Carnegie Mellon University/MIT"}, {"given_name": "Deepak", "family_name": "Pathak", "institution": "UC Berkeley, FAIR, CMU"}, {"given_name": "Abhinav", "family_name": "Gupta", "institution": "Facebook AI Research/CMU"}]}