# behaviour_distillation__3c95ca84.pdf Published as a conference paper at ICLR 2024 BEHAVIOUR DISTILLATION Andrei Lupu1,2, Chris Lu1, Jarek Liesen3, Robert Tjarko Lange3 & Jakob Foerster1 1University of Oxford 2Meta AI 3Technical University Berlin Correspondence at alupu@meta.com Dataset distillation aims to condense large datasets into a small number of synthetic examples that can be used as drop-in replacements when training new models. It has applications to interpretability, neural architecture search, privacy, and continual learning. Despite strong successes in supervised domains, such methods have not yet been extended to reinforcement learning, where the lack of a fixed dataset renders most distillation methods unusable. Filling the gap, we formalize behaviour distillation, a setting that aims to discover and then condense the information required for training an expert policy into a synthetic dataset of state-action pairs, without access to expert data. We then introduce Hallucinating Datasets with Evolution Strategies (Ha DES), a method for behaviour distillation that can discover datasets of just four state-action pairs which, under supervised learning, train agents to competitive performance levels in continuous control tasks. We show that these datasets generalize out of distribution to training policies with a wide range of architectures and hyperparameters. We also demonstrate application to a downstream task, namely training multi-task agents in a zero-shot fashion. Beyond behaviour distillation, Ha DES provides significant improvements in neuroevolution for RL over previous approaches and achieves So TA results on one standard supervised dataset distillation task. Finally, we show that visualizing the synthetic datasets can provide human-interpretable task insights. 1 INTRODUCTION Dataset distillation (Wang et al., 2018) is the task of synthesizing a small number of datapoints that can replace training on a large real datasets for downstream tasks. Not only a scientific curiosity, distilled datasets have seen applications to core research endeavours such as interpretability, architecture search, privacy, and continual learning (Lei & Tao, 2023; Sachdeva & Mc Auley, 2023). Despite a series of successes on vision tasks, and more recently in graph learning (Jin et al., 2021) and recommender systems (Sachdeva et al., 2022), distillation methods have not yet been extended to reinforcement learning (RL). This is because they generally make strong assumptions about prior availability of an expert (or ground truth) dataset. To address this gap in the literature, we introduce a new setting called behaviour distillation1 that aims to discover and condense the information required for training an expert policy into a synthetic dataset of state-action pairs, without access to an expert. Unlike dataset distillation, which simply replaces a hard supervised learning task by an easier one, behaviour distillation solves two challenges at once: the exploration problem (discovering trajectories with high expected return) and the representation learning problem (learning to represent a policy that produces those trajectories), both of which are fundamental to deep reinforcement learning. Thus, behaviour distillation aims to produce a dataset that obviates the need for exploration, essentially pre-solving the environment. As such, a behaviour distillation dataset does not encode a summary of the full environment, but only a summary of an expert policy in that environment. In other words, it reduces the joint problems of data collection (i.e. exploration) and sequential learning on a large amount of non-stationary data to one of supervised learning on a tiny amount of stationary non-sequential synthetic data, such as the example datasets in Fig. 1. 1The term was also recently used by Furuta et al. as a near-synonym for policy distillation. Published as a conference paper at ICLR 2024 Figure 1: Entire synthetic datasets required to train an optimal Cartpole policy (top) and an expert Hopper policy with behaviour cloning (bottom). The state-action pairs help interpret the learned policies. Red box contains observation features for Cartpole and action labels (torques) for Hopper. Motivated by the challenge of behaviour distillation, we introduce Hallucinating Datasets with Evolution Strategies (Ha DES), a method based on a meta-evolutionary outer loop. Specifically, Ha DES optimizes the synthetic dataset using a bi-level optimization structure, which uses evolutionary strategies (ES) to update the datasets on the outer loop and supervised learning ( behaviour cloning ) on the current dataset in the inner loop. The fitness function for ES is the performance of the policy at the end of the supervised learning step. We show that the generated datasets can be used to retrain policies with vastly different architectures and hyperparameters from those used to produce the datasets and achieve competitive returns to training directly on the original environment while doing behaviour cloning on less than 1/10th or in some cases less than 1/100th of a single episode worth of data. We also demonstrate the applicability of these datasets to downstream tasks and open-source them in the hope of accelerating future research. Furthermore, while Ha DES is tailored to behaviour distillation, we show it is also competitive when applied to popular computer vision dataset distillation benchmarks. There is a recent resurgence of interest in evolutionary strategies (ES) for machine learning, fuelled by their generality and applicability to non-differentiable objectives, as well as to long-horizon tasks with delayed rewards Lu et al. (2023); Salimans et al. (2017). However, current evolutionary optimization methods are limited in the number of parameters that can be evolved, since a large number of parameters combined with a large population size induce a large memory footprint, as we show in Section 5.1. This limits the use of ES for training large neural networks. To tackle this issue, we adapt Ha DES into an alternative parametrization and training scheme for neuroevolution by not resampling the initial weights of the inner loop policy. This parametrization has the benefit of scaling independently of the number of parameters in the evolved policy, thereby reducing the memory footprint and resulting in competitive performance across multiple environments when compared to vanilla ES. Our main contributions are: 1. We formalize the setting of behaviour distillation, which extends the principles of dataset distillation to reinforcement learning.(Section 4.1); 2. We introduce Ha DES, the first method for behaviour distillation (Section 4); 3. We show that a minor change to our method provides a parametrization for neuroevolution through ES that reduces its memory (Section 5.1). 4. We demonstrate empirically that Ha DES can produce effective synthetic datasets for challenging discrete and continuous control environments that generalize to training policies with a large range of architectures and hyperparameters (Section 5.2); 5. We use the synthetic datasets for a downstream task: quickly training a multi-task agent from datasets produced for individual environments (Section 5.3); 6. We achieve So TA for a common dataset distillation benchmark with Ha DES (Section 5.4); 7. We open-source our code and synthetic datasets under https://github.com/ FLAIROx/behaviour-distillation. Published as a conference paper at ICLR 2024 2 RELATED WORKS 2.1 DATASET DISTILLATION Efforts to reduce the amount of training data required for machine learning can be traced back to reduced support vector machines (Lee & Mangasarian, 2001; Lee & Huang, 2007; Wang et al., 2005). In deep learning, Bachem et al. (2017) and Coleman et al. (2019) have called coreset selection the problem of selecting a small number of representative examples from a dataset that suffice to train a model faster without degrading its final performance. Wang et al. forgo this restriction to real examples in favour of producing synthetic datasets, coining the task of dataset distillation. Since then, numerous approaches have been proposed to distill supervised learning datasets (Lei & Tao, 2023). Most involve a bi-level optimization procedure and can be divided into four big categories (Sachdeva & Mc Auley, 2023). Gradient matching methods (Zhao et al., 2020) aim to minimize the difference in gradient updates that a model receives when training on the synthetic vs. the real dataset, while trajectory matching (Cazenavette et al., 2022) minimizes the distance between checkpoint parameters of models trained on either dataset. Unfortunately, neither of these techniques is applicable to reinforcement learning without prior access to an expert policy, its checkpoints or at the very least a dataset of expert trajectories. More recently, Zhao & Bilen and Wang et al. directly align the synthetic dataset distribution to match the real one. While highly effective, such an approach is also inapplicable to reinforcement learning due to the nonstationary and policy-specific data distribution. The oldest and most closely related approach to our method is meta-model matching (Wang et al., 2018; Nguyen et al., 2020; Loo et al., 2022), which involves fully training a model on the synthetic data in the inner loop, while updating the dataset in the outer loop to minimize the model s loss on the real data. These works either compute expensive meta-gradients using back-propagation through time (Wang et al. (2018); Deng & Russakovsky (2022), BPTT), or use a neural tangent kernel rather than a finite width neural network in the inner loop such that they can compute the classifier loss on the target dataset in closed form (Nguyen et al., 2020). While these methods could be applied to RL by choosing an appropriate loss (e.g. REINFORCE (Williams, 1992)), we instead replace meta-gradients by an evolutionary approach in the outer loop, making the cost of the outer loop updates independent of both the network size and the number of updates in the inner loop. This is important since we can use hundred of policy updates in the inner loop in practice, making the use of BPTT prohibitively expensive. A few other works have extended dataset distillation beyond image classification to graphs (Jin et al., 2021; 2022) and recommender systems (Sachdeva et al., 2022), but to the best of our knowledge no previous work has broken away from assuming access to a pre-existing target dataset. As such, our work is the first to break the data-centric paradigm and introduces the first general distillation method applicable to distillation in reinforcement learning. 2.2 NEUROEVOLUTION AND INDIRECT ENCODINGS Neuroevolution (Schwefel, 1977) has been shown to perform comparably to reinforcement learning on several benchmarks (Such et al., 2017; Salimans et al., 2017). Part of our work can be viewed as a form of indirect encoding (Stanley et al., 2019) for neuroevolution an alternative parameterization for evolving neural network weights. Rather than evolve the parameters of a neural network directly, indirect encoding evolves a genotype in an alternative representation (which is usually compressed) that then maps to the parameters. A well-known example is Hyper NEAT (Stanley et al., 2009), a precursor to Hyper Networks (Ha et al., 2016), which evolves a smaller neural network to generate the weights of a larger one. Indirect encoding is desirable because evolution strategies can scale poorly in the number of parameters (Hansen, 2016). Our work, instead of evolving a neural network, evolves a small dataset on which we train a larger neural network with supervised learning. Other related work has also evolved other aspects of reinforcement learning training. For example, other works have evolved RL policy objectives (Lu et al., 2022; Co-Reyes et al., 2021; Houthooft et al., 2018; Jackson et al., 2024) and environment features (Lu et al., 2023). Most related to our work is Synthetic Environments (Ferreira et al., 2022), which evolve neural networks to replace an environment s state dynamics and rewards to speed up training. Instead of evolving transition and reward functions and training with RL, our work evolves supervised data for behavioural cloning (BC). This greatly aids the interpretability and simplifies the inner-loop. Published as a conference paper at ICLR 2024 3 BACKGROUND The goal of this paper is to discover a behavioural dataset which, in combination with supervised learning, solves a Markov Decision Processes. We describe and formalize the conceptual basis of our paper below. 3.1 REINFORCEMENT LEARNING A Markov Decision Process (MDP) is defined by a tuple S, A, P, R, γ in which S, A, P, R, γ define the state space, action space, transition probability function (which maps from a state and action to a distribution over the next state), reward function (which maps from a state and action to a scalar return), and discount factor, respectively. At each step t, the agent observes a state st and uses its policy πθ (a function from states to actions parametrized by θ) to select an action at. The environment then samples a new state st+1 according to the transition function P and a scalar reward rt according to the reward function R. The objective in reinforcement learning is to discover a policy that maximizes the expected discounted sum of rewards: 3.2 EVOLUTION STRATEGIES Many reinforcement learning algorithms use the structure of the MDP to update the policy using gradient-based methods and techniques such as the Bellman equation (Bellman, 1966). An alternative approach is to treat the function J(θ) as a blackbox function and directly optimize θ. One popular approach to this is known as Evolution Strategies Salimans et al. (2017). Given an arbitrary function F(ϕ), ES optimizes the following smoothed objective: Eϵ N(0,I)[F(ϕ + σϵ)], where N(0, I) is the standard multivariate normal distribution and σ is its standard deviation. We estimate the gradient of F by sampling noise from N(0, I) and evaluating F at the resulting points. Specifically, the gradient is estimated by: ϕEϵ N(0,Id)[F(ϕ + σϵ)] = Eϵ N(0,Id) h ϵ σ F(ϕ + σϵ) i . We then apply this update to our parameters and repeat the process. When applied to metaoptimization, ES allows us to optimize functions that would otherwise require taking meta-gradients through hundreds or thousands of update steps, which is often intractable Metz et al. (2021). 3.3 DATASET DISTILLATION In the context of supervised learning, dataset distillation is the task of generating or selecting a proxy dataset that allows training a model to a similar performance as training on the original dataset, often using several orders of magnitude fewer samples. Formally, we assume a dataset D = {xi, yi}N i=1, of which Dtrain D is the training (sub)set, and a training algorithm alg. Also, let falg(D) : xi 7 yi be the classifier obtained by training on D with alg. Then, the dataset distillation objective is to find a synthetic dataset Dϕ, |Dϕ| << |Dtrain|, such that Ex,y DL(falg(Dϕ)(x), y) Ex,y DL(falg(Dtrain)(x), y), (2) where ϕ indicates that the synthetic dataset can be parametrized and learned, rather than being sampled. In practice, |Dϕ| is often set to a fixed number of examples, e.g. |Dϕ| = n|Y |, where |Y | is the total number of discrete classes, or to be determined by fixed number of parameters, i.e. |ϕ| = n(|x| + |y|). The latter formulation, being more permissive, admits factorized representations of the data, e.g. by representing |Dϕ| with a generative neural network. While this is a promising avenue for future work, we focus exclusively on non-factorized distillation, which allows for better interpretability and tractability of the synthetic dataset. Published as a conference paper at ICLR 2024 4 PROBLEM SETTING AND METHOD 4.1 BEHAVIOUR DISTILLATION We introduce behaviour distillation as a parallel to dataset distillation. Rather than optimizing a proxy dataset to minimize a loss in supervised learning, behaviour distillation optimizes a proxy dataset that maximizes the discounted return in reinforcement learning, after supervised learning. More formally, the aim of behavioural distillation is to find a dataset Dϕ (S A)N, where N is the number of points in the dataset, to solve the following bi-level optimization problem: max Dϕ J(θ (Dϕ)) (3) s.t. θ (Dϕ) = arg min θ L(θ, Dϕ), (4) where θ are the parameters of the neural network used in the policy πθ, J is the discounted sum of returns defined in Eq. (1) and L is any supervised learning loss. Crucially, this setting does not assume access to an expert policy or dataset, for two reasons. Firstly, expert data may not always be available or may not be easily compressible, for instance if the expert is erratic or idiosyncratic. Secondly, standard imitation learning is plagued by cascading errors: if π is an imitation learning policy that deviates from some expert πexpert with probability ϵ, then it is likely to end up off-distribution, further increasing its error rate. As such, it generally incurs a regret J(πexpert) J(π) T 2ϵ, where T is the episode horizon (Ross & Bagnell, 2010). Given dataset distillation is lossy, applying it naively would result in a large ϵ and therefore a poor performance for π. Since we ultimately care about maximizing the expected discounted return rather than reproducing some specific expert behaviour, this is how we formalize behaviour distillation. To tackle behaviour distillation, we introduce our method Hallucinating Datasets with Evolution Strategies (Ha DES). Ha DES optimizes the inner loop objective to obtain θ using gradient descent, and optimizes the outer loop objective (the return of πθ ) using ES (Salimans et al., 2017). In the inner loop, Ha DES uses the cross-entropy loss for discrete action spaces, and the negative log likelihood loss for continuous action, albeit other losses could be substituted as well. We provide pseudocode in Algorithm 1. 4.2.1 POLICY INITIALIZATION We further specify two variants of our methods, which have distinct use cases. They differ only in the way inner loop policies are initialized, which leads to different inductive biases. The variants described here are visualized in Fig. 2 The first variant is Ha DES with fixed policy initialization, or Ha DES-F. In this variant, we sample a single policy initialization θ0 at the very beginning of meta-training. The policy is re-trained every inner-loop, but always starting from this fixed θ0. The second variant is Ha DES with randomized policy initialization, or Ha DES-R. In this variant, we use multiple (k 2) policy initializations (θ1 0, ..., θk 0)i in the inner loop, and we resample the initializations randomly at every generation i. Ha DES-F has a stronger inductive bias in that it is only optimizing Dϕ for a single initialization. We expect it might be able to overfit on that initialization and achieve higher policy returns, but at the cost of the synthetic dataset having poor generalization properties to other initializations. Ha DESF will therefore be stronger for neuroevolution (where only the final return of the specific policy matters), but weaker for behaviour distillation. Ha DES-R has a weaker inductive bias in that it optimizes Dϕ for a range of initializations. We expect this to result in decreased policy returns, but the synthetic dataset to be a useful artifact of training that generalizes to unseen initializations or even unseen policy architectures. Ha DES-R will therefore be a better choice for behaviour distillation, but a weaker one for neuroevolution. We confirm both those intuitions empirically in Section 5. Published as a conference paper at ICLR 2024 Figure 2: Left: Standard neuroevolution. Middle: Ha DES-F. Right: Ha DES-R. Ha DES-F uses a single fixed policy initialization. Ha DES-R samples k 2 policy initializations every generation. 5 EXPERIMENTS Next, we show that Ha DES can be successfully applied to a broad range of settings. More specifically, we investigate the following four applications: 1. We demonstrate the effectiveness of Ha DES as an indirect encoding for neuroevolution in RL tasks (Section 5.1), with an analysis of the distillation budget in Appendix C.1. 2. We show that synthetic datasets successfully generalize across a broad range of architectures and hyperparameters (Section 5.2). 3. We show that the synthetic datasets can be used to train multi-task models (Section 5.3). 4. We show that while our focus is on RL, our method is also competitive when applied to dataset distillation in supervised settings (Section 5.4). We refer the reader to Appendix B.2 for experimental details, including runtime comparisons. 5.1 PERFORMANCE EVALUATION OF HADES-DISCOVERED RL DATASETS We first test the effectiveness of Ha DES as a way of training competitive policies in 8 continuous control environments from the Brax suite. Fig. 3a shows the performance of Ha DES with a fixed policy initialization (Ha DES-F), Ha DES with two randomized initializations (Ha DES-R), and direct neuroevolution through ES. Ha DES-F achieves the highest return across the board, while Ha DES-R also matches or beats the baseline in 6/8 environments. In Humanoid-Standup, our method discovers a glitch in the collision physics, propelling itself into the air to achieve extremely high returns2. In Min Atar, we find that Ha DES-F outperforms the ES baseline in Breakout, and matches it in Space Invaders and Freeway. We hypothesize Min Atar is a harder setting for our method due to the symbolic (rather than continuous) nature of the environment. For both settings, we also plot the performance of PPO policies after 5 107 training steps. While there is still a performance discrepancy between ES and RL, Ha DES narrows the gap significantly. We use policy networks with width 512 and a population size of 2048 for Ha DES, but are forced to cut the network widths by half for the ES baseline on Min Atar due to memory constraints. Indeed, ES requires the entire population to be allocated on a single device at once when a new population is generated and when estimating the meta-gradient at each generation. While distributed approaches to the outer loop computation are feasible, they involve an engineering overhead, which our method alleviates. As such, our method is drastically more memory efficient, enabling larger populations and network sizes with minimal code changes. We also run Ha DES with width 256 for completeness. 5.2 HADES DATASETS GENERALIZE ACROSS ARCHITECTURES & HYPERPARAMETERS We now turn our attention to the datasets themselves, and use them to train new policies in a zeroshot fashion, i.e. without additional environment interactions. We take two synthetic datasets for Hopper one generated with Ha DES-R and another with Ha DES-F and use them to train new policies from scratch. For each dataset, and for each of 7 different policy network sizes, we train 50 policies, each using a different learning rate and number of training epochs. In particular, the learning rates span 3 orders of magnitude and training epochs ranged uniformly between 100 and 2video link Published as a conference paper at ICLR 2024 0 500 1000 1500 2000 Generations Mean Fitness 0 500 1000 1500 2000 Generations Mean Fitness 0 500 1000 1500 2000 Generations Mean Fitness 0 500 1000 1500 2000 Generations Mean Fitness inverted_double_pendulum 0 500 1000 1500 2000 Generations Mean Fitness 0 500 1000 1500 2000 Generations Mean Fitness halfcheetah 0 500 1000 1500 2000 Generations Mean Fitness 0 500 1000 1500 2000 Generations Mean Fitness humanoidstandup PPO ES neuroevolution Ha DES-F Ha DES-R 0 1000 2000 3000 4000 5000 Generations Mean Fitness Space Invaders-Min Atar 0 1000 2000 3000 4000 5000 Generations Mean Fitness Breakout-Min Atar 0 1000 2000 3000 4000 5000 Generations Mean Fitness Asterix-Min Atar 0 1000 2000 3000 4000 5000 Generations Mean Fitness Freeway-Min Atar PPO ES neuroevolution, width=256 Ha DES-F, width=256 Ha DES-F, width=512 Ha DES-R, width=256 Ha DES-R, width=512 Figure 3: Ha DES trains competitive policies on a) Brax, using 64 state-action pairs and b) Min Atar using 16 state-action pairs. For each environment, we show the mean return of the population at each generation for Ha DES-F, Ha DES-R and direct neuroevolution through ES, as well as the PPO final performance after 5 107 steps. Ha DES-F matches or outperforms direct ES on all Brax environments, outperforms ES in one out of four Min Atar environments and matches it in two others. We also observe a significant gap between Ha DES-F and Ha DES-R, as predicted in Section 4.2. 32 64 128 256 512 1024 2056 Policy Network Width Mean Return Ha DES-R Ha DES-F Figure 4: Hopper dataset transfer to other architectures and training parameters. We take a synthetic dataset of 64 state-action pairs evolved for policy networks of size 512 (highlighted) and use it to train policies with varying widths and 50 hyperparameter combinations per width. We plot the top 50% within each width group. Ha DES-F indicates that the dataset was trained with a fixed π0. The Ha DES-R dataset was trained with randomized (π1 0, ..., πk 0)i and generalizes much better across all architectures and training parameters. This holds generally across environments (see Appendix C.2). 500. For each dataset and width, we discard the worst 25 policies and plot the return distribution of the remaining 25 in Fig. 4. The datasets were evolved for policies of width 512, a fixed learning rate, and a fixed number of epochs, but readily generalize out of distribution to training with different settings and architectures. In particular, we see that the Ha DES-R dataset is more robust to changes in both policy architecture and training parameters than the Ha DES-F dataset, which incorporates a stronger inductive bias. We hypothesize that the generalization properties of the synthetic datasets can further be improved by randomizing not only the policy initialization, as in Ha DES-R, but also architectures and training Published as a conference paper at ICLR 2024 parameters, further reducing some of the inductive biases present in our implementation. How to best navigate the trade-off between generalization, dataset size and policy performance remains an interesting question for future work. 5.3 HADES DATASETS CAN BE APPLIED TO ZERO-SHOT MULTI-TASKING halfcheetah hopper Environment Normalized Return (%) correct data wrong data merged data humanoid humanoidstandup Environment Normalized Return (%) correct data wrong data merged data Figure 5: We use the synthetic datasets to train multi-task agents without any additional environment interaction. We plot the normalized fitness of agents trained either on the correct dataset for their environment, the wrong dataset for their environment, or a combined dataset, merged through concatenation and zero-padding to have the observation sizes match. Left: we train multi-task agents that achieve 50% normalized fitness for Halfcheetah and Hopper. Right: we train agents that achieve 100% normalized fitness for Humanoid and Humanoidstandup. The multi-task policy architecture and training parameters were not optimized. We plot mean stderr. across 10 seeds. This shows that synthetic datasets can accelerate future research on RL foundation models. We showcase an application of the distilled dataset by training multi-task Brax agents in a zero-shot fashion. To this end, we merge datasets for two different environments in the following way. Let D1 = (S1, A1) and D2 = (S2, A2) be the two datasets, each with 64 state-action pairs. We first zero pad the states in S1 to the left and zero pad the states in S1 to the right, such that each padded state has now the combined dimension |s1|+|s2|, where S1 R|s1| and S2 R|s2|. We then do the same for the actions, such that each padded action now has size |a1| + |a2|, where A1 R|a1| and A2 R|a2|. Finally, we take the union of both padded datasets to build a merged dataset Dmerged with 128 state-action pairs. We then train agents on either D1, D2 or Dmerged using behaviour cloning and report the normalized performance on both environments in Fig. 5. We perform this experiment for two pairs of environments: (Halfcheetah, Hopper) and (Humanoid, Humanoid Standup). Blue indicates the baseline performance of training on Di and evaluating on environment i. Red shows the performance of training on Di and evaluating on environment i, and is a loose proxy for how much the data from one environment helps to learn a policy for the other. Finally, green shows the performance of policies trained on Dmerged. We see that the multi-task agents achieve roughly 50% of the single-task performance in the first pair of environments, but see no loss in performance in the second pair. This shows that the synthetic datasets evolved by Ha DES can vastly accelerate future research on RL foundation models. Indeed, given those datasets, training new models takes only a few seconds, and makes it possible to experiment with architectures and multi-task representation learning at a fraction of the original computational cost. Furthermore, it allows studying the properties of cross-task parameter sharing and representation learning in isolation, separately from the exploration issues inherent to reinforcement learning. 5.4 HADES CAN BE APPLIED TO SUPERVISED DATASET DISTILLATION While the focus is on behaviour distillation in RL, our method is readily applicable to the standard dataset distillation setting by replacing environment return with a cross-entropy loss on some target dataset. We apply our method to dataset distillation with 1 image/class in MNIST and Fashion MNIST, with results in Table 1. We use the cross-entropy loss on the training set as the fitness for Ha DES and report the mean accuracy on the test set obtained by training classifiers on the final synthetic dataset. We run 3 different seeds and train 20 classifiers for each final datasets. Similar to Zhao et al. (2020) and Zhao & Bilen (2021), we report the mean and standard deviation across all 60 final classifiers and compare against the best method in these settings, namely RFAD Loo et al. Published as a conference paper at ICLR 2024 MNIST Fashion MNIST RFAD Ours RFAD Ours 1 img/cls 94.4 1.5 90.1 0.3 78.6 1.3 80.2 0.4 Table 1: Test set accuracy of classifiers trained on datasets composed of 1 image per class. We compare to RFAD, which is the Sot A for non-factorized dataset distillation on these datasets (Sachdeva & Mc Auley, 2023). RFAD uses a Conv Net architecture for testing, while we use a smaller CNN. Despite being designed for RL, our method is also competitive for image-based dataset distillation and achieves state-of-the-art distillation for 10-image Fashion MNIST. (2022). We find that Ha DES performs competitively in 10 image MNIST and achieves state-of-theart results in Fashion MNIST. However, we were unable to scale our methods to CIFAR-10, which have many more parameters due to being RGB rather than greyscale. In tuning our method, we find that the two single most important parameters are the outer learning rate and the dataset initialization. We find that initializing the dataset at the class-wise mean of the data works best. This is similar to findings by Zhao & Bilen, who also find that warm starting the dataset performs better than initializing from scratch. 5.5 HADES-EXPLAINABILITY The final benefit of the Ha DES datasets is that the synthetic examples lend themselves to interpretability. Going back to Fig. 1, we see that the resulting datasets have intuitive properties. For instance, the two state dataset for Cartpole captures that the policy should go left if the pole is leaning left and go right otherwise. Performing an explainable RL study lies beyond the scope of this paper, but in a critical analysis, Atrey et al. highlight the importance of taking a hypothesis-driven approach to explaining deep RL policies, formulating possible explanations, then testing them rigorously with ablations and careful experiments. With that in mind, we argue that our synthetic datasets are an effective starting point for such hypothesis-testing, for instance by applying transformations to datasets and observing how it impacts the trained policies. 6 DISCUSSION AND CONCLUSION In this paper, we introduced a new parametrization for policy neuroevolution by evolving a small synthetic dataset and training a policy on it through behavioural cloning. We showed that our method produces policies with competitive return in continuous control and discrete tasks. Our method can be used for behaviour distillation, summarizing all relevant information about optimal behaviour in an environment into a small synthetic dataset. This behaviour floppy disk can quickly train new policies parametrized by a range of different architectures. We then demonstrated the utility of the distilled dataset by training multi-task models. We finished by showing that although our focus is on RL, our method also applies to vanilla dataset distillation in supervised learning, where we achieved state-of-the-art in one settings. The main limitation of this work is of computational nature, since evolutionary methods require a large population to be effective. Furthermore, while our alternative parameterization enables us to evolve larger neural networks than standard neuroevolution, the number of parameters still grows linearly with the number of datapoints, especially in pixel-based environments which tend to be very highly dimensional. Tackling this issue, for instance by employing factorized distillation, is therefore a promising avenue for future work. Another downside of our work is the number of hyperparameters, since we need to tune both the ES parameters in the outer loop and the supervised learning ones in the inner loop. However, anecdotal evidence seems to indicate that ES can adapt to the inner loop parameters, for instance by increasing the magnitude of the dataset if the learning rate is low. Understanding the interplay between parameters would allow for faster an better tuning. A related approach would also be to evolve the inner loop parameters along with the dataset. Finally, possible applications of the distilled datasets are ripe for investigation, for instance in continual or life-long learning, and regularizing datasets to further promote interpretability. Published as a conference paper at ICLR 2024 REPRODUCIBILITY STATEMENT To encourage reproducibility, we described the method in detail in Section 4.2, include pseudocode (Algorithm 1) and provide hyperparameters in Appendix B.1, in a format that corresponds directly to the configs used by our code. We also open-source our code and our synthetic datasets at https: //github.com/FLAIROx/behaviour-distillation. ACKNOWLEDGMENTS Andrei Lupu was partially funded by a Fonds de recherche du Qu ebec doctoral training scholarship. Akanksha Atrey, Kaleigh Clary, and David Jensen. Exploratory not explanatory: Counterfactual analysis of saliency maps for deep reinforcement learning. ar Xiv preprint ar Xiv:1912.05743, 2019. Olivier Bachem, Mario Lucic, and Andreas Krause. Practical coreset constructions for machine learning. ar Xiv preprint ar Xiv:1703.06476, 2017. Richard Bellman. Dynamic programming. Science, 153(3731):34 37, 1966. James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake Vander Plas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transformations of Python+Num Py programs, 2018. URL http: //github.com/google/jax. George Cazenavette, Tongzhou Wang, Antonio Torralba, Alexei A Efros, and Jun-Yan Zhu. Dataset distillation by matching training trajectories. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4750 4759, 2022. John D Co-Reyes, Yingjie Miao, Daiyi Peng, Esteban Real, Sergey Levine, Quoc V Le, Honglak Lee, and Aleksandra Faust. Evolving reinforcement learning algorithms. ar Xiv preprint ar Xiv:2101.03958, 2021. Cody Coleman, Christopher Yeh, Stephen Mussmann, Baharan Mirzasoleiman, Peter Bailis, Percy Liang, Jure Leskovec, and Matei Zaharia. Selection via proxy: Efficient data selection for deep learning. ar Xiv preprint ar Xiv:1906.11829, 2019. Zhiwei Deng and Olga Russakovsky. Remember the past: Distilling datasets into addressable memories for neural networks. Advances in Neural Information Processing Systems, 35:34391 34404, 2022. Fabio Ferreira, Thomas Nierhoff, Andreas Saelinger, and Frank Hutter. Learning synthetic environments and reward networks for reinforcement learning. ar Xiv preprint ar Xiv:2202.02790, 2022. C. Daniel Freeman, Erik Frey, Anton Raichuk, Sertan Girgin, Igor Mordatch, and Olivier Bachem. Brax - a differentiable physics engine for large scale rigid body simulation, 2021. URL http: //github.com/google/brax. Hiroki Furuta, Yusuke Iwasawa, Yutaka Matsuo, and Shixiang Shane Gu. A system for morphology-task generalization via unified representation and behavior distillation. ar Xiv preprint ar Xiv:2211.14296, 2022. David Ha, Andrew M Dai, and Quoc V Le. Hypernetworks. In International Conference on Learning Representations, 2016. Nikolaus Hansen. The cma evolution strategy: A tutorial. ar Xiv preprint ar Xiv:1604.00772, 2016. Rein Houthooft, Richard Y Chen, Phillip Isola, Bradly C Stadie, Filip Wolski, Jonathan Ho, and Pieter Abbeel. Evolved policy gradients. ar Xiv preprint ar Xiv:1802.04821, 2018. Published as a conference paper at ICLR 2024 Matthew Thomas Jackson, Chris Lu, Louis Kirsch, Robert Tjarko Lange, Shimon Whiteson, and Jakob Nicolaus Foerster. Discovering temporally-aware reinforcement learning algorithms. ar Xiv preprint ar Xiv:2402.05828, 2024. Wei Jin, Lingxiao Zhao, Shichang Zhang, Yozen Liu, Jiliang Tang, and Neil Shah. Graph condensation for graph neural networks. ar Xiv preprint ar Xiv:2110.07580, 2021. Wei Jin, Xianfeng Tang, Haoming Jiang, Zheng Li, Danqing Zhang, Jiliang Tang, and Bing Yin. Condensing graphs via one-step gradient matching. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 720 730, 2022. Robert Tjarko Lange. gymnax: A jax-based reinforcement learning environment library, 2022b. URL http://github. com/Robert TLange/gymnax, 2022. Robert Tjarko Lange. evosax: Jax-based evolution strategies. In Proceedings of the Companion Conference on Genetic and Evolutionary Computation, pp. 659 662, 2023. Yann Le Cun. The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/, 1998. Yuh-Jye Lee and Su-Yun Huang. Reduced support vector machines: A statistical theory. IEEE Transactions on neural networks, 18(1):1 13, 2007. Yuh-Jye Lee and Olvi L Mangasarian. Rsvm: Reduced support vector machines. In Proceedings of the 2001 SIAM International Conference on Data Mining, pp. 1 17. SIAM, 2001. Shiye Lei and Dacheng Tao. A comprehensive survey to dataset distillation. ar Xiv preprint ar Xiv:2301.05603, 2023. Noel Loo, Ramin Hasani, Alexander Amini, and Daniela Rus. Efficient dataset distillation using random feature approximation. Advances in Neural Information Processing Systems, 35:13877 13891, 2022. Chris Lu, Jakub Kuba, Alistair Letcher, Luke Metz, Christian Schroeder de Witt, and Jakob Foerster. Discovered policy optimisation. Advances in Neural Information Processing Systems, 35:16455 16468, 2022. Chris Lu, Timon Willi, Alistair Letcher, and Jakob Nicolaus Foerster. Adversarial cheap talk. In International Conference on Machine Learning, pp. 22917 22941. PMLR, 2023. Luke Metz, C Daniel Freeman, Samuel S Schoenholz, and Tal Kachman. Gradients are not all you need. ar Xiv preprint ar Xiv:2111.05803, 2021. Timothy Nguyen, Zhourong Chen, and Jaehoon Lee. Dataset meta-learning from kernel ridgeregression. ar Xiv preprint ar Xiv:2011.00050, 2020. Stephane Ross and Drew Bagnell. Efficient reductions for imitation learning. In Yee Whye Teh and Mike Titterington (eds.), Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, volume 9 of Proceedings of Machine Learning Research, pp. 661 668, Chia Laguna Resort, Sardinia, Italy, 13 15 May 2010. PMLR. URL https: //proceedings.mlr.press/v9/ross10a.html. Noveen Sachdeva and Julian Mc Auley. Data distillation: A survey. ar Xiv preprint ar Xiv:2301.04272, 2023. Noveen Sachdeva, Mehak Dhaliwal, Carole-Jean Wu, and Julian Mc Auley. Infinite recommendation networks: A data-centric approach. Advances in Neural Information Processing Systems, 35: 31292 31305, 2022. Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. Advances in neural information processing systems, 29, 2016. Tim Salimans, Jonathan Ho, Xi Chen, Szymon Sidor, and Ilya Sutskever. Evolution strategies as a scalable alternative to reinforcement learning. ar Xiv preprint ar Xiv:1703.03864, 2017. Published as a conference paper at ICLR 2024 Hans-Paul Schwefel. Evolutionsstrategien f ur die numerische optimierung. In Numerische Optimierung von Computer-Modellen mittels der Evolutionsstrategie, pp. 123 176. Springer, 1977. Kenneth O Stanley, David B D Ambrosio, and Jason Gauci. A hypercube-based encoding for evolving large-scale neural networks. Artificial life, 15(2):185 212, 2009. Kenneth O Stanley, Jeff Clune, Joel Lehman, and Risto Miikkulainen. Designing neural networks through neuroevolution. Nature Machine Intelligence, 1(1):24 35, 2019. Felipe Petroski Such, Vashisht Madhavan, Edoardo Conti, Joel Lehman, Kenneth O Stanley, and Jeff Clune. Deep neuroevolution: Genetic algorithms are a competitive alternative for training deep neural networks for reinforcement learning. ar Xiv preprint ar Xiv:1712.06567, 2017. Jigang Wang, Predrag Neskovic, and Leon N Cooper. Training data selection for support vector machines. In International Conference on Natural Computation, pp. 554 564. Springer, 2005. Kai Wang, Bo Zhao, Xiangyu Peng, Zheng Zhu, Shuo Yang, Shuo Wang, Guan Huang, Hakan Bilen, Xinchao Wang, and Yang You. Cafe: Learning to condense dataset by aligning features. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12196 12205, 2022. Tongzhou Wang, Jun-Yan Zhu, Antonio Torralba, and Alexei A Efros. Dataset distillation. ar Xiv preprint ar Xiv:1811.10959, 2018. Daan Wierstra, Tom Schaul, Tobias Glasmachers, Yi Sun, Jan Peters, and J urgen Schmidhuber. Natural evolution strategies. The Journal of Machine Learning Research, 15(1):949 980, 2014. Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8:229 256, 1992. Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. ar Xiv preprint ar Xiv:1708.07747, 2017. Kenny Young and Tian Tian. Minatar: An atari-inspired testbed for thorough and reproducible reinforcement learning experiments. ar Xiv preprint ar Xiv:1903.03176, 2019. Bo Zhao and Hakan Bilen. Dataset condensation with differentiable siamese augmentation. In International Conference on Machine Learning, pp. 12674 12685. PMLR, 2021. Bo Zhao and Hakan Bilen. Dataset condensation with distribution matching. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 6514 6523, 2023. Bo Zhao, Konda Reddy Mopuri, and Hakan Bilen. Dataset condensation with gradient matching. ar Xiv preprint ar Xiv:2006.05929, 2020. Published as a conference paper at ICLR 2024 A HADES ALGORITHM Algorithm 1 Ha DES Require: dataset size n, environment Env Require: number of meta steps T, pop. size P, learning rate α, std σ 1: Initialize Dϕ = {(s, a)1, . . . , (s, a)n} e.g. randomly or sampled 2: for meta step = 0, . . . , T do 3: Initialize ξ N(0, σ) 4: for i = 0, . . . , P do 5: if i is even then 6: Perturb Dϕ with noise ξi = ξ to get Di antithetic noise 7: else 8: Perturb Dϕ with noise ξi = ξ to get Di 9: Update ξ N(0, σ) 10: end if 11: Initialize policy πθ 12: Train policy πθ on Di using BC 13: Unroll πθ and compute expected return Ji = J(πθ, Env|Di) 14: end for 15: Approximate ϕJ 1 P σ P i Jiξi 16: Update Dϕ = Dϕ + α ϕJ 17: end for 18: Train policy πθ on final Dϕ 19: return (πθ, Dϕ) B IMPLEMENTATION DETAILS B.1 HYPERPARAMETERS Parameter Value LR 0.005 NUM ENVS 4 NUM STEPS 1024 UPDATE EPOCHS 400 MAX GRAD NORM 0.5 ACTIVATION tanh WIDTH 512 ANNEAL LR False GREEDY ACT False CONST NORMALIZE OBS False NORMALIZE OBS True NORMALIZE REWARD True popsize 2048 dataset size 64 rollouts per candidate 1 n generations 2000 sigma init 0.03 sigma decay 1.0 lrate init 0.05 Evo. strategy Open ES Table 2: Hyperparameters for Ha DES in Brax. Top: inner loop parameters. Bottom: Outer loop parameters. Published as a conference paper at ICLR 2024 Parameter Value NET mlp LR 0.03 NUM ENVS 8 NUM STEPS 1024 UPDATE EPOCHS 64 MAX GRAD NORM 0.5 ACTIVATION relu WIDTH 512 or 256 ANNEAL LR True GREEDY ACT False CONST NORMALIZE OBS True NORMALIZE OBS False NORMALIZE REWARD False popsize 2048 dataset size 16 rollouts per candidate 2 n generations 5000 sigma init 0.5 sigma limit 0.01 sigma decay 1.0 lrate init 0.05 lrate decay 1.0 Evo. strategy SNES temperature 20.0 Table 3: Hyperparameters for Ha DES in Min Atar. Top: inner loop parameters. Bottom: Outer loop parameters. B.2 EXPERIMENTAL DETAILS For all RL tasks we use Brax (Freeman et al., 2021), a suite of continuous control environments, and Min Atar (Young & Tian, 2019), a set of Atari-like environments. For dataset distillation, we report results on two image classification tasks: MNIST (Le Cun, 1998), which is composed of handwritten digits, and Fashion MNIST (Xiao et al., 2017), which features different clothing items. For the evolutionary algorithm, we use Open ES Salimans et al. (2017) for Brax and image classification, and use SNES Wierstra et al. (2014) for Min Atar. In the inner loop, we minimize either the cross-entropy loss (discrete cases) or the negative log likelihood of the synthetic actions (continuous action cases). All of our runs use 8 Nvidia V100 GPUs and take between 1 and 17 seconds per outer loop generation. Detailed generation times are reported in Table 4. These times include outer loop operations (all methods), inner loop policy training (Ha DES only), and inner loop policy evaluation (all methods). Ha DES-R is slightly slower than Ha DES-F since it trains two policies instead of just one. Because we train policies from scratch every generation, the times reported are strict upper bounds to how long it takes to train a policy on the final distilled datasets. In image classification and Min Atar, we assign labels (i.e. classes or discrete actions) uniformly, whereas in Brax we evolve the dataset labels alongside the observations since the environments feature continuous actions. We implement our algorithm in JAX (Bradbury et al., 2018) using the Pure Jax RL (Lu et al., 2022), gymnax (Lange, 2022) and evosax (Lange, 2023) libraries to enable parallel training on hardware accelerators. We also use virtual batch normalization (Salimans et al., 2016) to stabilize training, which was previously found to be crucial in stabilizing ES (Salimans et al., 2017). Published as a conference paper at ICLR 2024 Environment Name ES Neuroevolution Ha DES-F Ha DES-R Hopper 4.0 5.6 6.3 Walker2d 3.4 7.9 8.5 Reacher 2.7 6.7 7.2 Inverted Double Pendulum 2.6 4.4 4.6 Ant 6.6 11.1 12.1 Halfcheetah 10.7 14.8 16.9 Humanoid 7.8 13.9 15.1 Humanoid Standup 8.6 14.8 16.4 Space Invaders-Min Atar 1.6 1.5 1.6 Breakout-Min Atar 1.3 1.6 1.8 Asterix-Min Atar 1.9 2.1 2.4 Freeway-Min Atar 2.3 2.2 2.9 Table 4: Runtime of the different neuroevolution methods in seconds/generation. Times averaged over 3 seeds rounded to the nearest tenth of a second. Standard deviation omitted, but the difference between the fastest and slowest runs for any setting is usually smaller than 0.2 seconds. C ADDITIONAL RESULTS C.1 IMPACT OF DISTILLATION BUDGET ON PERFORMANCE 4 16 64 256 Dataset Size ((s,a) pairs) Mean Return 4 16 64 256 Dataset Size ((s,a) pairs) Mean Return halfcheetah 4 16 64 256 Dataset Size ((s,a) pairs) Mean Return PPO ES neuroevolution Ha DES-F Ha DES-R Figure 6: Final return of Ha DES policies as a function of distillation budget (i.e. dataset size). ES neuroevolution and PPO returns also plotted for reference. In Fig. 6, we investigate the impact of the distillation budget on the final performance of policies trained with Ha DES on three different environments. What we observe is that dataset sizes that are too small degrade performance, likely because they cannot contain all the information required to train an expert policy. This is particularly noticeable in Humanoid, where for a dataset of 4 stateaction pairs, the score drops as low as 1013. However, a score of 1000 corresponds to a humanoid policy that keeps its balance and stays immobile, with lower scores indicating that the policy falls and causes an early termination. This is an indication that for distillation budgets that are too low to capture expert behaviour, Ha DES does not fail to learn, and will still optimize return within the constraints of the budget. On the opposite end of the spectrum, we also observe return dropping for large dataset sizes (|D| = 256), despite the increased expressivity. This is possibly due to ES (and therefore Ha DES) scaling poorly to a large number of parameters. This problem can be alleviated by relying on better ES methods, or by using a factorized approach to distillation. C.2 DATASET GENERALIZATION ACROSS ARCHITECTURES AND HYPERPARAMETERS Here we plot generalization plots for additional environments. As expected, Ha DES-R generalizes better than Ha DES-F both to new hyperparameters and to new architectures. Published as a conference paper at ICLR 2024 32 64 128 256 512 1024 2056 Policy Network Width Mean Return Ha DES-R Ha DES-F Figure 7: Dataset transfer to hopper architecture and training parameters. 32 64 128 256 512 1024 2056 Policy Network Width Mean Return Ha DES-R Ha DES-F Figure 8: Dataset transfer to walker2d architecture and training parameters. 32 64 128 256 512 1024 2056 Policy Network Width Mean Return Ha DES-R Ha DES-F Figure 9: Dataset transfer to reacher architecture and training parameters. 32 64 128 256 512 1024 2056 Policy Network Width Mean Return inverted_double_pendulum Ha DES-R Ha DES-F Figure 10: Dataset transfer to inverted double pendulum architecture and training parameters. 32 64 128 256 512 1024 2056 Policy Network Width Mean Return Ha DES-R Ha DES-F Figure 11: Dataset transfer to ant architecture and training parameters. Published as a conference paper at ICLR 2024 32 64 128 256 512 1024 2056 Policy Network Width Mean Return halfcheetah Ha DES-R Ha DES-F Figure 12: Dataset transfer to halfcheetah architecture and training parameters. 32 64 128 256 512 1024 2056 Policy Network Width Mean Return Ha DES-R Ha DES-F Figure 13: Dataset transfer to humanoid architecture and training parameters. 32 64 128 256 512 1024 2056 Policy Network Width Mean Return humanoidstandup Ha DES-R Ha DES-F Figure 14: Dataset transfer to humanoidstandup architecture and training parameters.