{"title": "Robust Imitation of Diverse Behaviors", "book": "Advances in Neural Information Processing Systems", "page_first": 5320, "page_last": 5329, "abstract": "Deep generative models have recently shown great promise in imitation learning for motor control. Given enough data, even supervised approaches can do one-shot imitation learning; however, they are vulnerable to cascading failures when the agent trajectory diverges from the demonstrations. Compared to purely supervised methods, Generative Adversarial Imitation Learning (GAIL) can learn more robust controllers from fewer demonstrations, but is inherently mode-seeking and more difficult to train. In this paper, we show how to combine the favourable aspects of these two approaches. The base of our model is a new type of variational autoencoder on demonstration trajectories that learns semantic policy embeddings. We show that these embeddings can be learned on a 9 DoF Jaco robot arm in reaching tasks, and then smoothly interpolated with a resulting smooth interpolation of reaching behavior. Leveraging these policy representations, we develop a new version of GAIL that (1) is much more robust than the purely-supervised controller, especially with few demonstrations, and (2) avoids mode collapse, capturing many diverse behaviors when GAIL on its own does not. We demonstrate our approach on learning diverse gaits from demonstration on a 2D biped and a 62 DoF 3D humanoid in the MuJoCo physics environment.", "full_text": "Robust Imitation of Diverse Behaviors\n\nZiyu Wang\u21e4, Josh Merel\u21e4, Scott Reed, Greg Wayne, Nando de Freitas, Nicolas Heess\n\nDeepMind\n\nziyu,jsmerel,reedscot,gregwayne,nandodefreitas,heess@google.com\n\nAbstract\n\nDeep generative models have recently shown great promise in imitation learning\nfor motor control. Given enough data, even supervised approaches can do one-shot\nimitation learning; however, they are vulnerable to cascading failures when the\nagent trajectory diverges from the demonstrations. Compared to purely supervised\nmethods, Generative Adversarial Imitation Learning (GAIL) can learn more robust\ncontrollers from fewer demonstrations, but is inherently mode-seeking and more\ndif\ufb01cult to train. In this paper, we show how to combine the favourable aspects\nof these two approaches. The base of our model is a new type of variational\nautoencoder on demonstration trajectories that learns semantic policy embeddings.\nWe show that these embeddings can be learned on a 9 DoF Jaco robot arm in\nreaching tasks, and then smoothly interpolated with a resulting smooth interpolation\nof reaching behavior. Leveraging these policy representations, we develop a new\nversion of GAIL that (1) is much more robust than the purely-supervised controller,\nespecially with few demonstrations, and (2) avoids mode collapse, capturing many\ndiverse behaviors when GAIL on its own does not. We demonstrate our approach\non learning diverse gaits from demonstration on a 2D biped and a 62 DoF 3D\nhumanoid in the MuJoCo physics environment.\n\n1\n\nIntroduction\n\nBuilding versatile embodied agents, both in the form of real robots and animated avatars, capable\nof a wide and diverse set of behaviors is one of the long-standing challenges of AI. State-of-the-art\nrobots cannot compete with the effortless variety and adaptive \ufb02exibility of motor behaviors produced\nby toddlers. Towards addressing this challenge, in this work we combine several deep generative\napproaches to imitation learning in a way that accentuates their individual strengths and addresses\ntheir limitations. The end product of this is a robust neural network policy that can imitate a large and\ndiverse set of behaviors using few training demonstrations.\nWe \ufb01rst introduce a variational autoencoder (VAE) [15, 26] for supervised imitation, consisting of a\nbi-directional LSTM [13, 32, 9] encoder mapping demonstration sequences to embedding vectors,\nand two decoders. The \ufb01rst decoder is a multi-layer perceptron (MLP) policy mapping a trajectory\nembedding and the current state to a continuous action vector. The second is a dynamics model\nmapping the embedding and previous state to the present state, while modelling correlations among\nstates with a WaveNet [39]. Experiments with a 9 DoF Jaco robot arm and a 9 DoF 2D biped walker,\nimplemented in the MuJoCo physics engine [38], show that the VAE learns a structured semantic\nembedding space, which allows for smooth policy interpolation.\nWhile supervised policies that condition on demonstrations (such as our VAE or the recent approach\nof Duan et al. [6]) are powerful models for one-shot imitation, they require large training datasets in\norder to work for non-trivial tasks. They also tend to be brittle and fail when the agent diverges too\nmuch from the demonstration trajectories. These limitations of supervised learning for imitation, also\nknown as behavioral cloning (BC) [24], are well known [28, 29].\n\n\u21e4Joint First authors.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fRecently, Ho and Ermon [12] showed a way to overcome the brittleness of supervised imitation using\nanother type of deep generative model called Generative Adversarial Networks (GANs) [8]. Their\ntechnique, called Generative Adversarial Imitation Learning (GAIL) uses reinforcement learning,\nallowing the agent to interact with the environment during training. GAIL allows one to learn more\nrobust policies with fewer demonstrations, but adversarial training introduces another dif\ufb01culty called\nmode collapse [7]. This refers to the tendency of adversarial generative models to cover only a subset\nof modes of a probability distribution, resulting in a failure to produce adequately diverse samples.\nThis will cause the learned policy to capture only a subset of control behaviors (which can be viewed\nas modes of a distribution), rather than allocating capacity to cover all modes.\nRoughly speaking, VAEs can model diverse behaviors without dropping modes, but do not learn\nrobust policies, while GANs give us robust policies but insuf\ufb01ciently diverse behaviors. In section\n3, we show how to engineer an objective function that takes advantage of both GANs and VAEs to\nobtain robust policies capturing diverse behaviors. In section 4, we show that our combined approach\nenables us to learn diverse behaviors for a 9 DoF 2D biped and a 62 DoF humanoid, where the VAE\npolicy alone is brittle and GAIL alone does not capture all of the diverse behaviors.\n\n2 Background and Related Work\n\nWe begin our brief review with generative models. One canonical way of training generative\n\nmodels is to maximize the likelihood of the data: maxPi log p\u2713(xi). This is equivalent to\nminimizing the Kullback-Leibler divergence between the distribution of the data and the model:\nDKL(pdata(\u00b7)||p\u2713(\u00b7)). For highly-expressive generative models, however, optimizing the log-\nlikelihood is often intractable.\nOne class of highly-expressive yet tractable models are the auto-regressive models which decompose\nthe log likelihood as log p(x) = Pi log p\u2713(xi|x