Supplemental material for submission 12184 submitted to the Thirty-fourth Conference on Neural Information Processing Systems
Simon Stepputtis1, Joseph Campbell1, Mariano Phielipp2, Stefan Lee3, Chitta Baral1, Heni Ben Amor1
Imitation learning is a popular approach for teaching motor skills to robots. However, most approaches focus on extracting policy parameters from execution traces alone, i.e., motion trajectories and perceptual data. No adequate communication channel between the human expert and the robot exists in order to describe critical aspects of the task, such as the properties of the target object or the intended shape of the motion. Motivated by insights into the human teaching process, we introduce a method for incorporating unstructured natural language into imitation learning. At training time, the expert can provide demonstrations along with verbal descriptions in order to describe the underlying intent, e.g., ‘‘Go to the large green bowl’’. The training process, then, interrelates the different modalities to encode the correlations between language, perception, and motion. The resulting language-conditioned visuomotor policies can be conditioned at run time on new human commands and instructions, which allows for more fine-grained control over the trained policies while also reducing situational ambiguity. We demonstrate in a set of simulation experiments how the introduced approach can learn language-conditioned manipulation policies for a seven degree-of-freedom robot arm and compare the results to a variety of alternative methods.
We consider the problem of learning a policy $\mathbf{\pi}$ from a given set of demonstrations ${\cal D}={\mathbf{d}^0,.., \mathbf{d}^m}$ where each demonstration contains the desired trajectory given by robot states $\mathbf{R} \in \mathbb{R}^{T \times N}$ over $T$ time steps, and with $N$ control variables (Subsequently, we will assume, without loss of generality, a seven degree-of-freedom (DOF) robot, i.e., N=7.). We also assume that each demonstration contains perceptual data $\mathbf{I}$ of the agent’s surroundings and a task description $v$ in natural language.
Given these data sources, our overall objective is to learn a policy $\mathbf{\pi}(v,\mathbf{I})$ which imitates the demonstrated behavior in ${\cal D}$, while considering the semantics of the natural language instructions and critical visual features of each demonstration.
After training, we provide the policy with a different, new state of the agent’s environment given as image $\mathbf{I}$ and a new task description (instruction) $v$. In turn, the policy generates control signals that are needed to achieve the objective described in the task description.
We do not assume any manual separation or segmentation into different tasks or behaviors.
Instead, the model independently is assumed to learn such a distinction form the provided natural language description.
Fig. 1 shows an overview of our proposed method. At a high level, our model takes an image $\mathbf{I}$ and task description $v$ as input to create a task embedding $\mathbf{e}$ in the Semantic model. Subsequently, this embedding is used in the Control model to generate robot actions at each time in a closed-loop fashion.
The supplemental material is organized in the following chapters and can be accessed by the menu in the sidebar: