# imitating_human_behaviour_with_diffusion_models__0a74b1b3.pdf Published as a conference paper at ICLR 2023 IMITATING HUMAN BEHAVIOUR WITH DIFFUSION MODELS Tim Pearce, Tabish Rashid, Anssi Kanervisto, Dave Bignell, Mingfei Sun, Raluca Georgescu, Sergio Valcarcel Macua, Shan Zheng Tan, Ida Momennejad, Katja Hofmann, Sam Devlin Microsoft Research Diffusion models have emerged as powerful generative models in the text-toimage domain. This paper studies their application as observation-to-action models for imitating human behaviour in sequential environments. Human behaviour is stochastic and multimodal, with structured correlations between action dimensions. Meanwhile, standard modelling choices in behaviour cloning are limited in their expressiveness and may introduce bias into the cloned policy. We begin by pointing out the limitations of these choices. We then propose that diffusion models are an excellent fit for imitating human behaviour, since they learn an expressive distribution over the joint action space. We introduce several innovations to make diffusion models suitable for sequential environments; designing suitable architectures, investigating the role of guidance, and developing reliable sampling strategies. Experimentally, diffusion models closely match human demonstrations in a simulated robotic control task and a modern 3D gaming environment. Code: https://github.com/microsoft/Imitating-Human-Behaviour-w-Diffusion. 1 INTRODUCTION To enable Human-AI collaboration, agents must learn to best respond to all plausible human behaviors (Dafoe et al., 2020; Mirsky et al., 2022). In simple environments, it suffices to generate all possible human behaviours (Strouse et al., 2021) but as the complexity of the environment grows this approach will struggle to scale. If we instead assume access to human behavioural data, collaborative agents can be improved by training with models of human behaviour (Carroll et al., 2019). In principle, human behavior can be modelled via imitation learning approaches in which an agent is trained to mimic the actions of a demonstrator from an offline dataset of observation and action tuples. More specifically, Behaviour Cloning (BC), despite being theoretically limited (Ross et al., 2011), has been empirically effective in domains such as autonomous driving (Pomerleau, 1991), robotics (Florence et al., 2022) and game playing (Ye et al., 2020; Pearce and Zhu, 2022). Popular approaches to BC restrict the types of distributions that can be modelled to make learning simpler. A common approach for continuous actions is to learn a point estimate, optimised via Mean Squared Error (MSE), which can be interpereted as an isotropic Gaussian of negligible variance. Another popular approach is to discretise the action space into a finite number of bins and frame as a classification problem. These both suffer due to the approximations they make (illustrated in Figure 1), either encouraging the agent to learn an average policy or predicting action dimensions independently resulting in uncoordinated behaviour (Ke et al., 2020). Discretised K-Means+Res. Figure 1: Expressiveness of a variety of models for behaviour cloning in a single-step, arcade claw game with two simultaneous, continuous actions. Existing methods fail to model the full action distribution, p(a|o), whilst diffusion models excel at covering multimodal & complex distributions. Published as a conference paper at ICLR 2023 Such simplistic modelling choices can be successful when the demonstrating policy is itself of restricted expressiveness (e.g. when using trajectories from a single pre-trained policy represented by a simple model.) However, for applications requiring cloning of human behaviour, which contains diverse trajectories and multimodality at decision points, simple models may not be expressive enough to capture the full range and fidelity of behaviours (Orsini et al., 2021). For these reasons, we seek to model the full distribution of actions observed. In particular, this paper focuses on diffusion models, which currently lead in image, video and audio generation (Saharia et al., 2022; Harvey et al., 2022; Kong et al., 2020), and avoid issues of training instability in generative adversarial networks (Srivastava et al., 2017), or sampling issues with energy based models (Florence et al., 2022). By using diffusion models for BC we are able to: 1) more accurately model complex action distributions (as illustrated in Figure 1); 2) significantly outperform state-of-the-art methods (Shafiullah et al., 2022) on a simulated robotic benchmark; and 3) scale to modelling human gameplay in Counter-Strike: Global Offensive - a modern, 3D gaming environment recently proposed as a platform for imitation learning research (Pearce and Zhu, 2022). To achieve this performance, we contribute several innovations to adapt diffusion models to sequential environments. Section 3.2 shows that good architecture design can significantly improve performance. Section 3.3 then shows that Classifier-Free Guidance (CFG), which is a core part of text-to-image models, surprisingly harms performance in observation-to-action models. Finally, Section 3.4 introduces novel, reliable sampling schemes for diffusion models. The appendices include related work, experimental details, as well as further results and explanations. 2 MODELLING CHOICES FOR BEHAVIOURAL CLONING In this section we examine common modelling choices for BC. For illustration purposes, we created a simple environment to highlight their limitations. We simulated an arcade toy claw machine, as shown in Figures 1, 3 & 4. An agent observes a top-down image of toys (o) and chooses a point in the image, in a 2D continuous action space, a R2. If the chosen point is inside the boundaries of a valid toy, the agent successfully obtains the toy. To build a dataset of demonstrations, we synthetically generate images containing one or more toys, and uniformly at random pick a single demonstration action a that successfully grabs a toy (note this toy environment uses synthetic rather than human data). The resulting dataset is used to learn ˆp(a|o). To make training quicker, we restrict the number of unique observations o to seven, though this could be generalised. MSE. A popular choice for BC in continuous action spaces approximates p(a|o) by a point-estimate that is optimised via MSE. This makes a surprisingly strong baseline in the literature despite its simplicity. However, MSE suffers from two limitations that harm its applicability to our goal of modelling the full, complex distributions of human behaviour. 1) MSE outputs a point-estimate. This precludes it from capturing any variance or multimodality present in p(a|o). 2) Due to its optimisation objective, MSE learns the average of the distribution. This can bias the estimate towards more frequently occurring actions, or can even lead to out-of-distribution actions (e.g. picking the action between two modes). The first can be partially mitigated by instead assuming a Gaussian distribution, predicting a variance for each action dimension and sampling from the resulting Gaussian. However, due to the MSE objective, the learnt mean is still the average of the observed action distribution. These limitations are visualised in Figure 1. Discretised. A second popular choice is to discretise each continuous action dimension into B bins, and frame it as a classification task. This has two major limitations. 1) Quantisation errors arise since the model outputs a single value for each bin. 2) Since each action dimension is treated independently, the marginal rather than the joint distribution is learnt. This can lead to issues during sampling whereby dependencies between dimensions are ignored, leading to uncoordinated behaviour. This can be observed in Figure 1 where points outside of the true distribution have been sampled in the bottom-right corner. This can be remedied by modelling action dimensions autoregressively, but these models bring their own challenges and drawbacks (Lin et al., 2021). K-Means. Another method that accounts for dependencies between action dimensions, first clusters the actions across the dataset into K bins (rather than B|a|) using K-Means. This discretises the joint-action distribution, rather than the marginal as in Discretised . Each action is then associated with its nearest cluster, and learning can again be framed as a classification task. This approach Published as a conference paper at ICLR 2023 avoids enumerating all possible action combinations, by placing bins only where datapoints exist. However two new limitations are introduced: 1) Quantisation errors can be more severe than for Discretised due to only K action options. 2) The choice of K can be critical to performance (Guss et al., 2021). If K is small, important actions may be unavailable to an agent; if K is large, learning becomes difficult. Additionally, since the bins are chosen with respect to the entire dataset they can fall outside of the target distribution for a given observation. This is observed in Figure 1. K-Means+Residual. Shafiullah et al. (2022) extended K-Means by learning an observationdependent residual that is added to the bin s center and optimised via MSE. This increases the fidelity of the learnt distribution. However, it still carries through some issues of K-Means and MSE. 1) The distribution p(a|o) is still modelled by a finite number of point estimates (maximum of K). 2) The residual learns the average of the actions that fall within each bin. In addition, it still requires a careful choice of the hyperparameter K. Diffusion models. The limitations of these existing modelling choices all arise from approximations made in the form of ˆp(a|o), combined with the optimisation objective. But it is precisely these approximations that make training models straightforward (optimising a network via MSE or crossentropy is trivial!). So how can we learn p(a|o) while avoiding approximations? We propose that recent progress in generative modelling with diffusion models provides an answer. Diffusion models are able to output expressive conditional joint distributions, and are powerful enough to scale to problems as complex as text-to-image or video generation. Thus, we will show they provide benefits when imitating human demonstrations as they make no coarse approximations about the action distribution, and avoid many of the limitations of the previously discussed choices. This can be observed in Figure 1, where only diffusion provides an accurate reconstruction of the true action distribution. 3 OBSERVATION-TO-ACTION DIFFUSION MODELS Diffusion models have largely been developed for the image domain. This section first introduces the underlying principles of diffusion models, then studies several aspects that require consideration to apply them to sequential environments. 3.1 DIFFUSION MODEL OVERVIEW Diffusion models are generative models that map Gaussian noise to some target distribution in an iterative fashion, optionally conditioned on some context (Dhariwal and Nichol, 2021). Beginning from, a T N(0, I), a sequence a T 1, a T 2 . . . a0 is predicted, each a slightly denoised version of the previous, with a0 a clean sample. Where T is the total number of denoising steps (not the environment timestep as is common in sequential decision making). This paper uses denoising diffusion probabilisitic models (Ho et al., 2020). During training, noisy inputs can be generated as: aτ = ατa+ 1 ατz, for some variance schedule ατ, random noise z N(0, I). A neural network, ϵ( ), is trained to predict the noise that was added to an input, by minimising: LDDPM := Eo,a,τ,z ||ϵ(o, aτ, τ) z||2 2 , (1) where the expectation is over all denoising timesteps, τ U[1, T], and observations and actions are drawn from a demonstration dataset, o, a D. At sampling time, with further variance schedule parameters ατ & σ, inputs are iteratively denoised: aτ 1 = 1 ατ aτ 1 ατ 1 ατ ϵ(o, aτ, τ) + στz. (2) 3.2 ARCHITECTURAL DESIGN This section explores neural network architectures for observation-to-action diffusion models. Specifically, the requirements of the network are: Input: Noisy action aτ 1 R|a|, denoising timestep τ, observation o (possibly with a history); Output: Predicted noise mask, ˆz R|a|. Published as a conference paper at ICLR 2023 Figure 2: Diffusion BC generates an action vector conditioned on an observation (which may be an image). By contrast, text-to-image diffusion models generate an image conditioned on a vector. While U-Nets have become standard components of text-to-image diffusion models, their use only makes sense for large, spatial input and outputs, while we require generation of an action vector of modest dimensionality. Therefore, we now describe three architectures of varying complexity. Section 4 empirically assesses these three architectures, finding performance improvements in the order: Basic MLP < MLP Sieve < Transformer. Basic MLP. This architecture directly concatenates all relevant inputs together, [aτ 1, o, τ]. This input is fed into a multi-layer perceptron (MLP). MLP Sieve. This uses three encoding networks to produce embeddings of the observation, denoising timestep, and action: oe, te, ae τ 1 Rembed dim. These are concatenated together as input to a denoising network, [oe, te, ae τ 1]. The denoising network is a fully-connected architecture, with residual skip connections, and with the raw denoising timestep τ and action aτ 1 repeatedly concatenated after each hidden layer. To include a longer observation history, previous observations are passed through the same embedding network, and embeddings are concatenated together. Transformer. This creates embeddings as for MLP Sieve. A multi-headed attention architecture (Vaswani et al., 2017) (as found in modern transformer encoder networks) is then used as the denoising network. At least three tokens are used as input, oe, te, ae τ 1, and this can be extended to incorporate a longer history of observations (only the current te, ae τ 1 are needed since the diffusion process is Markovian). Sampling rate. The MLP Sieve and Transformer are carefully designed so the observation encoder is separate from the denoising network. At test time, this means only a single forward pass is required for the observation encoder, with multiple forward passes run through the lighter denoising network. This results in a manageable sampling time in the experiment playing a video game from pixels (section 4.2), we were able to roll out our diffusion models at 8Hz on an average gaming GPU (NVIDIA GTX 1060 Mobile). Table 6 provides a detailed breakdown. Note that sampling time is a more severe issue in text-to-image diffusion models, where forward passes of the heavy U-Net architectures are required for all denoising timesteps. 3.3 WHY CLASSIFIER-FREE GUIDANCE FAILS Classifier-Free Guidance (CFG) has become a core ingredient for text-to-image models, allowing one to trade-off image typicality with diversity (Ho and Salimans, 2021). In CFG, a neural network is trained as both a conditional and unconditional generative model. During sampling, by introducing a guidance weight w, one places a higher weight (w > 0) on the prediction conditioned on some context (here, o), and a negative weight on the unconditional prediction, ˆzτ = (1 + w)ϵcond.(aτ 1, o, τ) wϵuncond.(aτ 1, τ). (3) One might anticipate that CFG would also be beneficial in the sequential setting, with larger w producing trajectories of higher likelihood, but at the cost of the diversity. Surprisingly, we find that CFG can actually encourage less common trajectories, and degrade performance. Published as a conference paper at ICLR 2023 Unique p(a|o) Weight = 0.0 Weight = 4.0 Weight = 8.0 Figure 3: We vary the CFG weight parameter (w value in Eq. 3) during sampling in the Arcade Claw environment. CFG encourages selection of actions that were specific to an observation (maximising p(o|a)). This can lead to less common trajectories being sampled more often. In Figure 3 we visualise ˆp(a|o) for the claw machine game, under varying guidance strengths, w. An interpretation of CFG is that it encourages sampling of actions that would maximise an implicit classifier, p(o|a). Hence, CFG encourages selection of actions that were unique to a particular observation (Ho et al., 2022). Whilst this is useful for text-to-image models (generate images that are more specific to a prompt), in sequential environments this leads to an agent rejecting higherlikelihood actions in favour of less usual ones that were paired with some observation. In later experiments (section 4.1) we demonstrate empirically that this can lead to less common trajectories being favoured, while degrading overall performance. Appendix E provides a didactic example of when CFG fails in this way. 3.4 RELIABLE SAMPLING SCHEMES In text-to-image diffusion, several samples are typically generated in parallel, allowing the user to select their favourite, and ignore any failures. However, when rolling out an observation-to-action diffusion model, such manual screening is not feasible. There remains a risk that a bad action could be selected during a roll-out, which may send an agent toward an out-of-distribution state. Hence, we propose Diffusion-X and Diffusion-KDE as variants of Diffusion BC, that mirror this screening process by encouraging higher-likelihood actions during sampling. For both methods, the training procedure is unchanged (only the conditional version of the model is required). The algorithms for all sampling methods are given in appendix D. Diffusion-X. The sampling process runs as normal for T denoising timesteps. The denoising timestep is then fixed, τ = 1, and extra denoising iterations continue to run for M timesteps. The intuition behind this is that samples continue to be moved toward higher-likelihood regions. Diffusion-KDE. Generate multiple action samples from the diffusion model as usual (these can be done in parallel). Fit a simple kernel-density estimator (KDE) over all samples, and score the likelihood of each. Select the action with the highest likelihood. Figure 4: This figure shows the predictive distributions of diffusion models using various sampling schemes. When M = 0 this is Diffusion BC and when M > 0 this is Diffusion-X . Diff. KDE refers to Diffusion-KDE . Published as a conference paper at ICLR 2023 The effect of these sampling modifcations is demonstrated in Figure 4. While Diffusion BC generates a small number of actions that fall outside of the true p(a|o) region, Diffusion-X and Diffusion KDE avoid these bad actions. Note that the two distinct modes in the figure are recovered by both sampling methods, suggesting that multimodality is not compromised, though the diversity within each mode is reduced. Both techniques are simple to implement, and experiments in Section 4 show their benefit. 4 EXPERIMENTS This section empirically investigates the efficacy of diffusion models for BC. We assess our method in two complex sequential environments, which have large human datasets available, with the aim of answering several questions: Q1) How do diffusion models compare to existing baseline methods for BC, in terms of matching the demonstration distribution? Section 4.1 compares to four popular modelling choices in BC, as well as recent state-of-the-art models. Section 4.2 provides further focused comparisons. Q2) How is performance affected by the architectures designed in section 3.2, CFG, and the sampling schemes in section 3.4? We provide full ablations in Section 4.1, and targeted ablations over sampling schemes in Section 4.2. Q3) Can diffusion models scale to complex environments efficiently? Section 4.2 tests on an environment where the observation is a high-resolution image, the action space is mixed continuous & discrete, and there are strict time constraints for sampling time. Evaluation. To evaluate how closely methods imitate human demonstrations, we compare the behaviours of our models with those of the humans in the dataset. To do so, we will compare both at a high-level, analysing observable outcomes (e.g. tasks completions or game score), as well as at a low-level comparing the stationary distributions over states or actions. Both of these are important in evaluating how humanlike our models are, and provide complimentary analyses. Appendix B provides details on our evaluation metrics, which are introduced less formally here. Baselines. Where applicable we include Human to indicate the metrics achieved by samples drawn from the demonstration dataset itself. We then re-implement five baselines. Three correspond to popular BC modelling choices, namely MSE: a model trained via mean squared error; Discretised: each action dimension is discretised into 20 uniform bins, then trained independently via crossentropy; and K-means: a set of K candidate actions are first produced by running K-means over all actions, actions are discretised into their closest bin, and the model is trained via cross-entropy. A further two baselines can be considered strong, more complex methods, namely K-means+Residual: as with K-means, but additionally learns a continuous residual on top of each bin prediction, trained via MSE, which was the core innovation of Behaviour Transformers (Be T) (Shafiullah et al., 2022); and EBM: a generative energy-based model trained with a contrastive loss, proposed in (Florence et al., 2022) full details about the challenges of this method given in Appendix B.4. One of our experiments uses the set up from Shafiullah et al. (2022), allowing us to compare to their reported results, including Behaviour Transformers (Be T): the K-mean+residual combined with a large 6-layer transformer, and previous 10 observations as history; Implicit BC: the official implementation of energy-based models for BC (Florence et al., 2022); SPi RL: using a VAE, originally from Pertsch et al. (2021); and PARROT: a flow-based model, originally from Singh et al. (2020). Architectures & Hyperparameters. We trial the three architecture options described in Section 3.2, which are identical across both the five re-implemented baseline methods and our diffusion variants (except where required for output dimensionality). Basic MLP and MLP Sieve both use networks with GELU activations and 3 hidden-layers of 512 units each. Transformer uses four standard encoder blocks, each with 16 attention heads. Embedding dimension is 128 for MLP Sieve and Transformer. Appendix B gives full hyperparameter details. 4.1 LEARNING ROBOTIC CONTROL FROM HUMAN DEMONSTRATION In this environment, an agent controls a robotic arm inside a simulated kitchen. It is able to perform seven tasks of interest, such as opening a microwave, or turning on a stove (Appendix B contains full details). The demonstration dataset contains 566 trajectories. These were collected using a virtual reality setup, with a human s movements translated to robot joint actuations (Gupta et al., Published as a conference paper at ICLR 2023 2020). Each demonstration trajectory performed four predetermined tasks. There are 25 different task sequences present in the dataset, of roughly equal proportion. This environment has become a popular offline RL benchmark for learning reward-maximising policies (Fu et al., 2020). However, as our goal is to learn the full distribution of demonstrations, we instead follow the setup introduced by Shafiullah et al. (2022), which ignores any goal conditioning and aims to train an agent that can recover the full set of demonstrating policies. The kitchen environment s observation space is a 30-dimensional continuous vector containing information about the positions of the objects and robot joints. The action space is a 9-dimensional continuous vector of joint actuations. All models receive the previous two observations as input, allowing the agent to infer velocities. For diffusion models, we set T = 50. The kitchen environment is challenging for several reasons. 1) Strong (sometimes non-linear) correlations exist between action dimensions. 2) There is multimodality in p(a|o) at the point the agent selects which task to complete next, and also in how it completes it. We show that our diffusion model learns to represent both these properties in Figure 9, which visualises relationships between all action dimensions during one rollout. 4.1.1 MAIN RESULTS Comparing the behaviour of our agents with that of the humans demonstrations is a challenging research problem in its own right. Table 1 presents several metrics that provide insight into how closely models match the human demonstrations from high-level analysis of task sequences selected, to low-level statistics of the observation trajectories generated by models. We briefly summarise these, more technical descriptions can be found in Appendix B. Training and sampling times of methods are given in Table 6. Tasks 4. We first measure the proportion of rollouts for which models perform four valid tasks, which nearly all human demonstrations achieve. Tasks Wasserstein. For each method we record how many times each different task sequence was completed during rollouts. The Wasserstein distance is then computed between the resulting histogram, compared to the human distribution. Time Wasserstein. As well as analysing which tasks were completed, we also analyse when they were completed. Figure 5 plots the distribution of the time taken to complete different numbers of tasks (normalised to exclude failures), for MSE and Diffusion-X. We then compute the Wasserstein distance between the human and agent distributions (see Appendix C for a full break down.) State Wasserstein. If we truly capture the full diversity and fidelity of human behaviour present in the dataset, this will be reflected in the state occupancy distribution. We compute the Wasserstein between agent and human distributions, with a lower value indicating they are more similar. Density & Coverage. We use the Density and Coverage metrics from Naeem et al. (2020) that are used to evaluate GANs. Roughly speaking Density corresponds to how many states from human trajectories are close to the agent s states and Coverage corresponds to the proportion of human states that have an agent generated state nearby. In terms of task-completion rate, our Diffusion BC approaches outperform all baselines, with the ordering; Diffusion BC 1, else z = 0 4: aτ 1 = 1 ατ aτ 1 ατ 1 ατ ϵθ(aτ, τ, o) + στz 5: end for 6: return a0 Algorithm 2 Sampling for Diffusion-X 1: a T N(0, I) 2: for i = T, . . . , 1 M do 3: τ = max(i, 1) 4: z N(0, I) if τ > 1, else z = 0 5: aτ 1 = 1 ατ aτ 1 ατ 1 ατ ϵθ(aτ, τ, o) + στz 6: end for 7: return a M Algorithm 3 Sampling for Diffusion-KDE 1: A [ ] 2: for i = 1, . . . , K do 3: Use Algorithm 1 to sample action, a0. 4: A . append(a0) 5: end for 6: KDEmodel. fit(A) 7: Likelihoods = KDEmodel. score(A) 8: i = arg maxi(Likelihoods) 9: return A[i] E CLASSIFIER-FREE GUIDANCE ANALYSIS Figure 10: Didactic example of when CFG can lead to generation of less usual trajectories. Published as a conference paper at ICLR 2023 CFG can be interpreted as guiding the denoising sampling procedure towards higher values of p(o|a) (Ho et al., 2022). Given this interpretation, we now provide a grid-world to show concretely why this can lead to sampling of less common trajectories in a sequential environment. Figure 10 shows an environment with four discrete states and a discrete action. The action space allows three ego-centric options; turn left, turn right or continue straight forward. Agents are always initialised at state 0, giving observation o0, and rolled out for exactly two timesteps, visiting state 1 and ending in state 2 or 3. The figure shows the empirical action distributions in a demonstration dataset. In states 0, 2 & 3, the agent always selects straight. But in state 1, the agent makes a right turn with 0.1 probability. Let o1, o2 and o3 denote the observations given by states 1, 2 and 3, respectively. We can apply Bayes rule to find p(o|a), which will provide an understanding of what behaviour CFG induces. We are interested in the learnt behaviour at the decision point o1, p(o1|a) = p(a|o1)p(o1) Given the agent starts in state 0 and is rolled out for two timesteps (sees three states), and p(a = Turn right|o1) = 0.1, p(a = straight|o1) = 0.9, we find the marginal probability of each observation, p(o = o0) = 1/3 (5) p(o = o1) = 1/3 (6) p(o = o2) = 1/3 0.1 (7) p(o = o3) = 1/3 0.9. (8) Hence, the marginal action distribution is, p(a = Turn right) = i=0 p(a = Turn right|oi)p(oi) (9) = 1/3 0.1 (10) p(a = Straight) = 1/3 + 1/3 0.9 + 1/3. (11) We can now compute the quantities p(o1|a), p(o1|a = Turn right) = p(a = Turn right|o1)p(o1) p(a = Turn right) = 0.1 1/3 0.1 1/3 = 1 (12) p(o1|a = Straight) = p(a = Straight|o1)p(o1) p(a = Straight) = 0.9 1/3 1/3 + 1/3 0.9 + 1/3 = 0.9 2.9 = 0.31. (13) As such, since CFG favours actions that maximise p(o|a), the CFG agent will select the less frequently visited right-hand path more often.