# robust_imitation_of_diverse_behaviors__435b818e.pdf Robust Imitation of Diverse Behaviors Ziyu Wang , Josh Merel , Scott Reed, Greg Wayne, Nando de Freitas, Nicolas Heess Deep Mind ziyu,jsmerel,reedscot,gregwayne,nandodefreitas,heess@google.com Deep generative models have recently shown great promise in imitation learning for motor control. Given enough data, even supervised approaches can do one-shot imitation learning; however, they are vulnerable to cascading failures when the agent trajectory diverges from the demonstrations. Compared to purely supervised methods, Generative Adversarial Imitation Learning (GAIL) can learn more robust controllers from fewer demonstrations, but is inherently mode-seeking and more difficult to train. In this paper, we show how to combine the favourable aspects of these two approaches. The base of our model is a new type of variational autoencoder on demonstration trajectories that learns semantic policy embeddings. We show that these embeddings can be learned on a 9 Do F Jaco robot arm in reaching tasks, and then smoothly interpolated with a resulting smooth interpolation of reaching behavior. Leveraging these policy representations, we develop a new version of GAIL that (1) is much more robust than the purely-supervised controller, especially with few demonstrations, and (2) avoids mode collapse, capturing many diverse behaviors when GAIL on its own does not. We demonstrate our approach on learning diverse gaits from demonstration on a 2D biped and a 62 Do F 3D humanoid in the Mu Jo Co physics environment. 1 Introduction Building versatile embodied agents, both in the form of real robots and animated avatars, capable of a wide and diverse set of behaviors is one of the long-standing challenges of AI. State-of-the-art robots cannot compete with the effortless variety and adaptive flexibility of motor behaviors produced by toddlers. Towards addressing this challenge, in this work we combine several deep generative approaches to imitation learning in a way that accentuates their individual strengths and addresses their limitations. The end product of this is a robust neural network policy that can imitate a large and diverse set of behaviors using few training demonstrations. We first introduce a variational autoencoder (VAE) [15, 26] for supervised imitation, consisting of a bi-directional LSTM [13, 32, 9] encoder mapping demonstration sequences to embedding vectors, and two decoders. The first decoder is a multi-layer perceptron (MLP) policy mapping a trajectory embedding and the current state to a continuous action vector. The second is a dynamics model mapping the embedding and previous state to the present state, while modelling correlations among states with a Wave Net [39]. Experiments with a 9 Do F Jaco robot arm and a 9 Do F 2D biped walker, implemented in the Mu Jo Co physics engine [38], show that the VAE learns a structured semantic embedding space, which allows for smooth policy interpolation. While supervised policies that condition on demonstrations (such as our VAE or the recent approach of Duan et al. [6]) are powerful models for one-shot imitation, they require large training datasets in order to work for non-trivial tasks. They also tend to be brittle and fail when the agent diverges too much from the demonstration trajectories. These limitations of supervised learning for imitation, also known as behavioral cloning (BC) [24], are well known [28, 29]. Joint First authors. 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA. Recently, Ho and Ermon [12] showed a way to overcome the brittleness of supervised imitation using another type of deep generative model called Generative Adversarial Networks (GANs) [8]. Their technique, called Generative Adversarial Imitation Learning (GAIL) uses reinforcement learning, allowing the agent to interact with the environment during training. GAIL allows one to learn more robust policies with fewer demonstrations, but adversarial training introduces another difficulty called mode collapse [7]. This refers to the tendency of adversarial generative models to cover only a subset of modes of a probability distribution, resulting in a failure to produce adequately diverse samples. This will cause the learned policy to capture only a subset of control behaviors (which can be viewed as modes of a distribution), rather than allocating capacity to cover all modes. Roughly speaking, VAEs can model diverse behaviors without dropping modes, but do not learn robust policies, while GANs give us robust policies but insufficiently diverse behaviors. In section 3, we show how to engineer an objective function that takes advantage of both GANs and VAEs to obtain robust policies capturing diverse behaviors. In section 4, we show that our combined approach enables us to learn diverse behaviors for a 9 Do F 2D biped and a 62 Do F humanoid, where the VAE policy alone is brittle and GAIL alone does not capture all of the diverse behaviors. 2 Background and Related Work We begin our brief review with generative models. One canonical way of training generative models is to maximize the likelihood of the data: max P i log p (xi). This is equivalent to minimizing the Kullback-Leibler divergence between the distribution of the data and the model: DKL(pdata( )||p ( )). For highly-expressive generative models, however, optimizing the loglikelihood is often intractable. One class of highly-expressive yet tractable models are the auto-regressive models which decompose the log likelihood as log p(x) = P i log p (xi|x