# generalization_through_diversity_improving_unsupervised_environment_design__c318cc32.pdf Generalization through Diversity: Improving Unsupervised Environment Design Wenjun Li , Pradeep Varakantham , Dexun Li Singapore Management University {wjli.2020, pradeepv, dexunli.2019}@smu.edu.sg Agent decision making using Reinforcement Learning (RL) heavily relies on either a model or simulator of the environment (e.g., moving in an 8x8 maze with three rooms, playing Chess on an 8x8 board). Due to this dependence, small changes in the environment (e.g., positions of obstacles in the maze, size of the board) can severely affect the effectiveness of the policy learned by the agent. To that end, existing work has proposed training RL agents on an adaptive curriculum of environments (generated automatically) to improve performance on out-of-distribution (OOD) test scenarios. Specifically, existing research has employed the potential for the agent to learn in an environment (captured using Generalized Advantage Estimation, GAE) as the key factor to select the next environment(s) to train the agent. However, such a mechanism can select similar environments (with a high potential to learn) thereby making agent training redundant on all but one of those environments. To that end, we provide a principled approach to adaptively identify diverse environments based on a novel distance measure relevant to environment design. We empirically demonstrate the versatility and effectiveness of our method in comparison to multiple leading approaches for unsupervised environment design on three distinct benchmark problems used in literature. 1 Introduction Deep Reinforcement Learning (DRL) has had many successes in challenging tasks (e.g., Atari [Mnih et al., 2015], Alpha GO [Silver et al., 2016], solving Rubik s cube [Akkaya et al., 2019], chip design [Mirhoseini et al., 2021]) in the last decade. However, DRL agents have been proven to be brittle and often fail to transfer well to environments only slightly different from those encountered during training [Zhang et al., 2018; Cobbe et al., 2019], e.g., changing size of the maze, background in a game, positions of obstacles. To achieve robust and generalizing policies, agents must be exposed to different types of environments that assist in learning different challenges [Cobbe et al., 2020]. To that end, existing research has proposed Unsupervised Environment Design (UED) that can create a curriculum (or distribution) of training scenarios that are adaptive with respect to agent policy [Mehta et al., 2020; Portelas et al., 2020; Wang et al., 2019a; Dennis et al., 2020; Jiang et al., 2021b; Jiang et al., 2021a]. Following the terminology in existing works, we will refer to a particular environment configuration as a level, e.g., an arrangement of blocks in the maze, shape/height of the obstacles in front of the agent, the geometric configuration of racing tracks, etc. In the rest of this paper, we will use levels and environments interchangeably. 1.1 Related Work on UED Existing works on UED can be categorized along two threads. The first thread has focused on principled generation of levels and was pioneered by the Protagonist Antagonist Induced Regret Environment Design (PAIRED, [Dennis et al., 2020]) algorithm. PAIRED introduced a multi-agent game between level generator (teacher), antagonist (expert student) and protagonist (normal student), and utilized the performance difference between antagonist and protagonist as a reward signal to guide the generator to adaptively create challenging training levels. The second thread has abandoned the idea of principled level generation and instead championed: (a) randomly generating levels and (b) replaying previously considered levels to deal with catastrophic forgetting by the student agent. The algorithm is referred to as Prioritized Level Replay (PLR, [Jiang et al., 2021b]) and the levels are replayed based on learning potential, captured using Generalized Advantage Estimation (GAE, [Schulman et al., 2015]). PLR has been empirically shown to be more scalable and also is able to achieve more robust and better out-of-distribution performance than PAIRED. In [Jiang et al., 2021a], authors combined randomized generator and replay mechanism together and proposed PLR . Empirically, PLR achieves the state-of-the-art in literature (as a method that demands no human expertise), and thus we will adopt PLR as our primary baseline in this paper. Another recent approach ACCEL [Parker-Holder et al., 2022] that builds on PLR performs edits (i.e., modifications based on human expertise) on high-regret environments to learn better. Both these threads of research have significantly improved the state-ofthe-art (domain randomization [Sadeghi and Levine, 2016; Tobin et al., 2017] and adversarial learning [Pinto et al., Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence (IJCAI-23) 2017]). However, to generalize better, it is not sufficient to train the agent only on high-regret/GAE levels. We can have multiple high-regret levels, which are very similar to each other and agent does not learn a lot from being trained on similar levels. Thus, levels also have to be sufficiently different , so that agent can gain more perspective on different challenges that it can face. To that end, in this paper, we introduce a diversity metric in UED, which is defined based on distance between occupancy distributions associated with agent trajectories from different levels. We then provide a principled method, referred to as Diversity Induced Prioritized Level Replay (DIPLR) to select levels with the diversity metric to provide better generalization performance than the state-of-the-art. 1.2 Related Work on Diversity in RL There is existing literature on diversity measurement in the field of transfer RL, e.g., [Song et al., 2016; Agarwal et al., 2021a; Wang et al., 2019a]. However, these works conduct an exhaustive comparison of any possible states between two underlying Markov Decision Processes (MDPs), which does not take the current agent policy into account and may suffer from a large computational cost. In contrast, we use the collected trajectories induced by the policy to approximate the occupancy distribution of the different levels. Such a method can avoid massive computation and more robustly indicate the difference between two levels being explored by the same agent policy. There are a few works that have explored the diversity between policies in DRL. 1) [Hong et al., 2018] adopted distance on pairwise policies to modify the loss function to enforce the agent to attempt policies different from its prior policies. As an improved version of [Hong et al., 2018], [Parker-Holder et al., 2020] computes the diversity on a population of policies instead of pairwise distance, so that it avoids cycling training behaviors and policy redundancy. 2) [Lupu et al., 2021] proposed trajectory diversity measurement based on Jensen-Shannon Divergence (JSD) and included it in the loss function to maximize the difference between policies. The JSD computes the divergence based on action distribution in trajectories. In contrast to these works, our method computes both pairwise and population-wise distance for the purpose of measuring the difference between levels. Furthermore, we provide a diversity-guided method to train a robust and well-generalizing agent in the UED framework. 1.3 Contributions Overall, our contributions are three-fold. First, we highlight the benefits of diversity in UED tasks and formally present the definition of distance between levels. Second, we employ Wasserstein distance to quantitatively measure the distance and introduce the DIPLR algorithm. Finally, we empirically demonstrate the versatility and effectiveness of DIPLR in comparison to other leading approaches on benchmark problems. Notably, we also investigate the relationship between diversity and regret (i.e., learning potential) and conduct an ablation study to reveal their individual effectiveness. Surprisingly, we find that diversity solely works better than regret in the model, and they complement each other to achieve the best performance when combined. 2 Background In this section, we describe the problem of UED and also the details of the leading approaches for solving UED. 2.1 Unsupervised Environment Design, UED The goal of UED is to train a student agent that performs well across a large set of different environments. To achieve this goal, there is a teacher agent in UED that provides a curriculum of environment parameter values to train the student agent to generalize well to unseen levels. UED problem is formally described using an Underspecified Partially Observable Markov Decision Process (UPOMDP). It is defined using the following tuple: S, A, Θ, I, O, T, R, γ S, A and O are the set of states, actions and observations respectively. R : S R is the reward function and γ is the discount factor. The most important element in the tuple is Θ, which is the set of environment parameters. A particular parameter θ Θ (can be a vector or sequence of values) defines a level and can impact the reward model, transition dynamics and the observation function, i.e., R : S Θ R, T : S A Θ S and I : S Θ O. UPOMDP is underspecified, because we cannot train with all values of θ ( Θ), as Θ can be infinitely large. The goal of the student agent policy π in a UPOMDP is to maximize its discounted expected rewards for any given θ Θ: max π V θ(π) = max π EπV θ(τ) = max π Eπ h H X t=0 rθ t γti where rθ t is the reward obtained by π in a level with environment parameter θ at time step t. Thus, the student agent needs to be trained on a series of θ values that maximize its generalization ability on all possible levels from Θ. To that end, we employ the teacher agent. The goal of the teacher agent policy Λ is to generate a distribution over the next set of environment parameter values to train the student, i.e., to achieve good generalization performance, where Π is the set of possible policies of the teacher. 2.2 Approaches for Solving UED In all approaches for solving UED, the student always optimizes a policy that maximizes its value on the given θ. The different approaches for solving UED vary on the method adopted by the teacher agent. We now elaborate on the Λ employed by different approaches. Two fundamental approaches to UED are Domain Randomization (DR) [Jakobi, 1997; Tobin et al., 2017] and Minimax [Morimoto and Doya, 2005; Pinto et al., 2017]. The teacher in DR simply randomizes the environment configurations regardless of the student s policy, i.e., ΛDR(π) = U(Θ) Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence (IJCAI-23) where U is the uniform distribution over all possible θ values. On the other hand, teacher in Minimax adversarially generates challenging environments to minimize the rewards of the student s policy, i.e., ΛMM(π) = arg min θ V θ(π) The next set of approaches related to PAIRED [Dennis et al., 2020; Jiang et al., 2021a] rely on regret, which is defined approximately as the difference between the maximum and the mean return of students policy, to generate new levels: regθ(π) max τ π V θ(τ) Eτ πV θ(τ) Given this regret, teacher selects policies that maximize regret: ΛMR(π) = arg max θ regθ(π) From the original paper by [Dennis et al., 2020], there have been multiple major improvements [Jiang et al., 2021b; Parker-Holder et al., 2022]: 1. New levels are generated randomly instead of using an optimizer. 2. Level to be selected at a time step is decided based on an efficient approximation of regret, referred to as positive value loss (a customized form of Generalized Advantage Estimation (GAE)): gaeθ(π) = 1 k=t (γλ)k tδk, 0 where γ and λ are the GAE and MDP discount factors respectively, H is the time horizon and δk is the TDerror at time step k. 3. Levels are replayed with a certain probability to ensure there is no catastrophic forgetting. 4. Edit randomly on generated levels through humandefined edits [Parker-Holder et al., 2022]. We will use learning potential, regret and GAE, interchangeably to represent positive value loss in the rest of the paper. 2.3 Wasserstein Distance We will propose a diversity measure that relies on distance between occupancy distributions, where we utilize Wasserstein Distance. The problem of optimal transport density between two distributions was initially proposed by [Monge, 1781] and generalized by [Kantorovich, 1942]. Wasserstein distance is preferred by the machine learning community among several commonly used divergence measures, e.g., Kullback Leibler (KL) divergence, because it is 1) non-zero and continuously differentiable when the supports of two distributions are disjoint and 2) symmetric, i.e., W(P, Q) is equal to W(Q, P). Wasserstein distance is the measurement between probability distributions defined on a metric space M(d, C) with a cost metric d : C C 7 R+: inf ψ Π(P,Q) E(x,y) ψ[d(x, y)p] where Π(P, Q) is the set of all possible joint probability distributions between P and Q, and one joint distribution ψ Π(P, Q) represents a density transport plan from point x to y so as to make x follows the same probability distribution of y. 3 Approach: DIPLR A key drawback of existing approaches for solving UED is the inherent assumption that levels with high regret (or GAE) all have high learning potential. However, if there are two levels that are very similar to each other, and they both have high regret, training the student agent on one of those levels makes training on the other unnecessary and irrelevant. Given our ultimate goal is to let student agent policy transfer well to a variety of few-shot or zero-shot levels, we utilize diversity. Diversity has gained traction in Deep Reinforcement Learning to improve the generalization performance (albeit in a different problem settings than the one in the paper) of models ([Parker-Holder et al., 2020; Hong et al., 2018; Lupu et al., 2021]). More specifically, we provide 1. A scalable mechanism for quantitatively estimating similarity (or dissimilarity) between two given levels given the current student policy. This will then be indirectly used to generate diverse levels. 2. A teacher agent for UED that generates diverse levels and trains student agent in those diverse environments, so as to achieve strong generalization to out-ofdistribution levels. 3.1 Estimating Similarity between Levels The objective here is to estimate the similarity between levels given the current student agent policy. One potential option is to encode each level using one of a variety of encoding methods (e.g., Variational Autoencoder) or using the parameters associated with the level and then taking the distance between encodings of levels. However, such an approach has multiple major issues: Encoding a level does not account for the sequential moves the student agent will make through the level, which is of critical importance as the similarity is with respect to student policy; There can be stochasticity in the level that is not captured by the parameters of the level. For instance, in Bipedal Walker, a complex yet well-parameterized UPOMDP introduced by [Wang et al., 2019b; Parker-Holder et al., 2022], has multiple free parameters controlling the terrain. Because of the existence of stochasticity in the level generation process, two levels could be very different while we have near-zero distance measurement given their environment parameter vectors. Distance between environment parameters requires normalization in each parameter dimension and is domainspecific. Since we collect several trajectories within current levels when approximating the regret value, we can naturally get the state-action distributions induced by the current policy. Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence (IJCAI-23) Figure 1: An Overview of DIPLR Algorithm Therefore, we propose to evaluate similarity on the different levels based on the distance between occupancy distributions of the current student policy. The hypothesis is that if the levels are similar, then the trajectories traversed by the student agent within the level will have similar state-action distribution. Equation (3) provides the expression for stateaction distribution given a policy, π and level lθ1 (θ1 represents the parameters corresponding to the level): ρπ lθ1 (s, a) = (1 γ) h Pr(st = s, at = a|s0 p0(.), st p(.|st 1, at 1, θ1), at π(.|st) i (3) Similarly, we can derive ρπ lθ2 for level, lθ2 with parameter θ2. Typically, KL divergence is employed to measure the distance between state-action distributions. DKL(ρπ lθ1 ||ρπ lθ2 ) = E(s,a) ρπ θ1 h log ρπ lθ2 ρπ lθ1 However, KL divergence is not applicable in our setting, because of the lack of transition probabilities, the two different levels can result in two occupancy distributions with disjoint supports. More importantly, KL divergence cannot work without explicit estimates of the occupancy distributions. Therefore, we employ the Wasserstein distance described in Equation (2), which can calculate the distance between two distributions from empirical samples. We can write the Wasserstein distance between two occupancy distributions as Equation (4): W(ρπ lθ1 , ρπ lθ2 ) = inf ψ Π(ρπ lθ1 ,ρπ lθ2 ) E(ϕ1,ϕ2) ψ[d(ϕ1, ϕ2)p] (4) where ϕ (S, A) is a sample from the occupancy distribution. By Equation (4), we can collect state-action samples in trajectories to compute the empirical Wasserstein distance between two levels, i.e., D(lθ1, lθ2) W(ρπ lθ1 , ρπ lθ2 ), is our empirical estimation of the Wasserstein distance between two levels. 3.2 Diversity Guided Training Agent Now that we have a scalable way to compute the distance between levels, D( , ), we will utilize it in the teacher to gen- erate a diverse curriculum for the student agent. We build on the PLR training agent [Jiang et al., 2021a] that only considered regret/GAE and staleness to prioritize levels for selection. Our training agent most crucially employs diversity to select levels. Our training agent maintains a buffer of high-potential levels for training. At each iteration, we either: (a) generate a new level to be added to the buffer; or (b) sample a minibatch of levels for the student agent to train on. Diversity During Generating a New Level For a new level, lθ, the distance from lθ to all the levels, L = lθ1, lθ2, in the buffer is given by: D(lθ, L) = min k D(lθ, lθk) (5) To increase diversity, we add new levels to the buffer that have the highest value of this distance (amongst all the randomly generated levels). We can also combine this distance measure with regret/GAE when taking a decision to include a level in the buffer so that different and challenging levels are added. Diversity in Sampling Levels from Buffer to Train To decide on levels to train on, we maintain a priority (probability of getting selected) for each level that is computed based on their distance from other levels in the buffer. The rank of a level, lθi in terms of distance from other levels in the buffer, L is given by h(Di, D1, D2, ) where Di = D(lθi L \ lθi). We then convert the rank into a probability using Equation (6), Pi = 1 h(Di, D)β (6) where h( , ) is the rank transformation that can find the rank of Di in a set of values D, and β is a tunable parameter. Combining Diversity with Regret/GAE Instead of solely considering the diversity aspect of the level buffer, we can also take learning potential into account by using regrets in Equation (1) so that we will have a level buffer that is not only filled up with diverse training levels but also challenging levels that continuously push the student. We could assign different weights to diversity and regret by letting the replay probability Preplay = ρ PD + (1 ρ) PR, where PD and PR are the prioritization of diversity and regret respectively, and ρ is the tuning parameter. An overview of our proposed algorithm is shown in Figure 1. When the level generator creates a new level lθi, we collect trajectories τi on lθi, and compute its distance Di to the trajectory buffer by Equation (5). If the replay probability of lθi is greater than that of any levels in the level buffer, we insert lθi into the level buffer and remove the level with minimum replay probability from the level buffer. The complete procedure of DIPLR is presented in Algorithm 1. To accelerate the calculation process required by Wasserstein distance, we adopt the popular empirical Wasserstein distance solver D( , ) from [Flamary et al., 2021]. For simplicity, we use τ to represent the state-action samples from trajectory τ in the pseudocode. To better reveal Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence (IJCAI-23) Algorithm 1 DIPLR Input: Level buffer size N, level generator G Initialize: student policy πη, level buffer L, traj buffer Γ 1: Generate N initial levels by G to populate L 2: Collect trajectories on each replay level in L and fill up Γ 3: while not converged do 4: Sample replay-decision, ϵ U[0, 1] 5: if ϵ 0.5 then 6: Generate a new level lθi by G 7: Collect trajectories τi on lθi, with stop-gradient η 8: Compute the regret, staleness and distance for lθi 9: else 10: Sample a replay level lθj L according to Preplay 11: Collect trajectories τj on lθj and update πη with rewards R(τj) 12: Compute the regret, staleness and distance for lj 13: end if 14: Flush Γ and collect trajectories on all replay levels to fill up Γ 15: Update regret, staleness, and distance for lθi or lθj 16: Update L with new level lθi if its replay probability is greater than any levels in L 17: Update replay probability Preplay 18: end while the relationship between diversity and learning potential, we conduct an ablation study where we only adopt the diversity metric to pick levels to fill up the level buffer, i.e., set Preplay = 1 PD + 0 PR. 4 Experiments In this section, we compare DIPLR to the set of leading benchmarks for solving UED: Domain Randomization, Minimax, PAIRED, PLR . We conduct experiments and empirically demonstrate the effectiveness and generality of DIPLR on three popular yet highly distinct UPOMDP domains, Minigrid, Bipedal-Walker and Car-Racing. Minigrid is a partially-observable navigation problem under discrete control with sparse rewards, while Bipedal-Walker and Car Racing are partially-observable walking/driving under continuous control with dense rewards (we provide more details in Appendix). In each domain, we train the teacher and student agents with Proximal Policy Optimization (PPO, [Schulman et al., 2017]) and we will present the zero-shot out-ofdistribution (OOD) test performance of all algorithms in each domain. To make the comparison more reliable and straightforward, we adopt the recently introduced standardized DRL evaluation metrics [Agarwal et al., 2021b], with which we show the aggregate inter-quartile mean (IQM) and optimality gap plots. Specifically, IQM discards the bottom and top 25% of the runs and measures the performance in the middle 50% of combined runs, and it is thus robust to outlier scores and is a better indicator of overall performance; Optimality gap captures the amount by which the algorithm fails to meet the best performance , i.e., it assumes that a score (e.g., =1.0) is a desirable target beyond which improvements are not very important. We present the plots after normalizing the perfor- Figure 2: Examples of training environments in Minigrid Domain Figure 3: Aggregate zero-shot OOD test performance in Minigrid Domain across 10 independent runs. Higher IQM scores and lower optimality gap are better. mance with a min-max range of solved-rate/returns for better visualization. We also provide a detailed ablation analysis to demonstrate the utility of diversity alone, by providing results with and without regret (or GAE). The version with diversity alone is referred to as DIPLR and the one which uses both diversity and regret is referred to as DIPLR. 4.1 Minigrid Domain Introduced by [Dennis et al., 2020], Minigrid is a good benchmark due to its easy interpretability and customizability. In our experimental settings, the teacher has a maximum budget of 60 steps to place small block cells and another two steps to place the start position and goal position at the end. The student agent is required to navigate to the goal cell within 256 steps and it receives rewards=(num-steps/256) upon success, otherwise, it receives zero rewards. Figure 2 displays a few training examples from the domain. We train all the student agents for 30k PPO updates ( 250M steps) and evaluate their transfer capabilities on eight zero-shot OOD environments (see the first row in Figure 2) used in the literature that are crafted by humans. We summarize the empirical results of our proposed algorithm and all other benchmarks in Figure 3 and Figure 4. Here are the key observations from both the figures: Figure 3: DIPLR surpasses all the benchmark approaches, and surpasses the leading best approach, PLR by more than 20% with regards to IQM and Optimality Gap. Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence (IJCAI-23) Figure 4: Zero-shot transfer performance in eight human-designed test environments. The plots are based on the median and interquartile range of solved rates across 10 independent experiments. Figure 5: Six examples of test environments in Bipedal-Walker domain. (a) Bipedal Walker, (b) Hardcore, (c) Stair, (d) Pit Gap, (e) Stump, (f) Roughness. Note that these may not be the exact test environment terrains but only extreme cases for demonstration. Figure 3: DIPLR performs on par or better than PLR . Figure 3: PAIRED performs poorly, as the long horizon (60 steps) and sparse reward task is challenging for a guided level generator. While Domain Randomization performs decently, minimax performs worse than PAIRED. Figure 4: With respect to zero-shot transfer results, DIPLR is obviously better than PLR in every one of the testing scenarios. This indicates the utility of diversity. 4.2 Bipedal-Walker Domain We also conduct a comprehensive comparison on the Bipedal-Walker domain, which is introduced in [Wang et al., 2019b; Parker-Holder et al., 2022] and is popular in the community as it is a well-parameterized environment with eight variables (including ground roughness, the number of stair steps, min/max range of pit gap width, min/max range of stump height, and min/max range of stair height) conditioning the terrains and is thus straightforward to interpret. The teacher in Bipedal-Walker can specify the values of the eight environment parameters, but there will still be stochasticity in the generation of a particular level. As for the student, i.e., walker agent, it should determine the torques applied on its Figure 6: Aggregate test performance over 10 runs in Bipedal Walker domain. Higher IQM scores and lower optimality gap are better joints and is constrained by partial observability where it only knows its horizontal speed, vertical speed, angular speed, positions of joints, joints angular speed, etc. Following the experiment settings in prior UED works, we train all the algorithms for 30k PPO updates ( 1B steps), and then evaluate their generalization ability on six distinct test instances in Bipedal-Walker domain, i.e., Bidedal Walker, Hardcore, Stair, Pit Gap, Stump, and Roughness, as illustrated in Figure 5. Among these, {Bipedal Walker} is the basic level that only evaluates whether the agent can walk on flat ground, {Stair, Pit Gap, Stump, Roughness} challenges the walker for a particular kind of obstacles and {Hardcore} contains a combination of all types of challenging obstacles. To understand the evolution of transfer performance, we evaluate the student every 100 student policy updates and plot the curves in Figure 7. As is shown, our proposed method DIPLR outperforms all other benchmarks in each test environment. The ablation model DIPLR also surpasses our primary benchmark PLR in five test environments, indicating that diversity metric alone contributes to ultimate generalization performance more than regret in continuous domains. PAIRED suffers from variance quite a bit and cannot constantly create informative levels for the student agents in such a complex domain. This is because of the nature of the multi-agent framework adopted in PAIRED, where conver- Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence (IJCAI-23) Figure 7: Performance on test environments during the training period over five independent experiments (mean and standard error). Figure 8: Examples in Car-Racing domain. (a) A zoom-in snapshot of the car-racing track; (b) a track generated by domain randomization; (c) one of the test environments, F1-Singapore track Figure 9: Zero-Shot Transfer Performance in Car-Racing domain gence highly depends on the ability of the expert student. After training for 30k PPO updates, we collect trained models and conduct a more rigorous evaluation based on 10 test episodes in each test environment, and present the aggregate results after min-max normalization (with range=[0, 300] on Bipedal-Walker and [0, 150] on the other five test environments) in Figure 6. Notably, our method DIPLR dominates all the benchmarks in both IQM and optimality gap. Furthermore, DIPLR performs better than PLR . 4.3 Car-Racing Domain To validate that our method is scalable and versatile, we further implement DIPLR on the Car-Racing environment, which is introduced by [Jiang et al., 2021a] and has been used in existing UED papers. In this domain, the teacher needs to create challenging racing tracks by using B ezier curves (via 12 control points) and the student drives on the track under continuous control with dense reward. A zoom-in illustration of the track is shown in Figure 8 (a). After training the student for around 3k PPO updates ( 5.5M steps), we evaluate the student agent in four test scenarios, among which three Figure 10: Aggregate test performance over 10 independent runs in Car-Racing domain. are challenging F1-tracks existing in the real world and the other one is a basic vanilla track. Note that these tracks are absolutely OOD because they cannot be defined with just 12 control points, for example, the F1-Singapore track in Figure 8 (c) has more than 20 control points. Figure 9 presents the complete training process of each algorithm on four test environments, with an evaluation interval of 100 PPO updates. As a relatively simpler domain, our method DIPLR and DIPLR quickly converge to the optimum while DR and PLR achieve a near-optimal generalization on the Vanilla and F1-Italy track. The aggregate performance after min-max normalization (with range=[200,800]) of all methods is summarized in Figure 10. Despite both PLR and DR agents can learn well on this comparatively simple domain, DIPLR outperforms them on both IQM and optimality gap. 5 Conclusion We provided a novel method DIPLR for unsupervised environment design (UED), where diversity in training environments is employed to improve generalization ability of a student agent. By computing the distance between levels via Wasserstein distance over occupancy distributions, we constantly expose the agent to challenging yet diverse scenarios and thus improve its generalization in zero-shot out-ofdistribution test environments. In our experiments, we validated that DIPLR is capable of training robust and generalizing agents and significantly outperforms the best-performing baselines in three distinct UED domains. Moreover, we explored the relationship between diversity and learning potential, and we discover that diversity alone benefits the algorithm more than learning potential, and they complement each other to achieve the best performance when combined. Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence (IJCAI-23) Acknowledgments This research/project is supported by the National Research Foundation Singapore and DSO National Laboratories under the AI Singapore Programme (AISG Award No: AISG2-RP2020-017). References [Agarwal et al., 2021a] Rishabh Agarwal, Marlos C Machado, Pablo Samuel Castro, and Marc G Bellemare. Contrastive behavioral similarity embeddings for generalization in reinforcement learning. ar Xiv preprint ar Xiv:2101.05265, 2021. [Agarwal et al., 2021b] Rishabh Agarwal, Max Schwarzer, Pablo Samuel Castro, Aaron C Courville, and Marc Bellemare. Deep reinforcement learning at the edge of the statistical precipice. Advances in neural information processing systems, 34:29304 29320, 2021. [Akkaya et al., 2019] Ilge Akkaya, Marcin Andrychowicz, Maciek Chociej, Mateusz Litwin, Bob Mc Grew, Arthur Petron, Alex Paino, Matthias Plappert, Glenn Powell, Raphael Ribas, et al. Solving rubik s cube with a robot hand. ar Xiv preprint ar Xiv:1910.07113, 2019. [Cobbe et al., 2019] Karl Cobbe, Oleg Klimov, Chris Hesse, Taehoon Kim, and John Schulman. Quantifying generalization in reinforcement learning. In International Conference on Machine Learning, pages 1282 1289. PMLR, 2019. [Cobbe et al., 2020] Karl Cobbe, Chris Hesse, Jacob Hilton, and John Schulman. Leveraging procedural generation to benchmark reinforcement learning. In International conference on machine learning, pages 2048 2056. PMLR, 2020. [Dennis et al., 2020] Michael Dennis, Natasha Jaques, Eugene Vinitsky, Alexandre Bayen, Stuart Russell, Andrew Critch, and Sergey Levine. Emergent complexity and zero-shot transfer via unsupervised environment design. Advances in neural information processing systems, 33:13049 13061, 2020. [Flamary et al., 2021] Remi Flamary, Nicolas Courty, Alexandre Gramfort, Mokhtar Z Alaya, Aurelie Boisbunon, Stanislas Chambon, Laetitia Chapel, Adrien Corenflos, Kilian Fatras, Nemo Fournier, et al. Pot: Python optimal transport. J. Mach. Learn. Res., 22(78):1 8, 2021. [Hong et al., 2018] Zhang-Wei Hong, Tzu-Yun Shann, Shih Yang Su, Yi-Hsiang Chang, Tsu-Jui Fu, and Chun-Yi Lee. Diversity-driven exploration strategy for deep reinforcement learning. Advances in neural information processing systems, 31, 2018. [Jakobi, 1997] Nick Jakobi. Evolutionary robotics and the radical envelope-of-noise hypothesis. Adaptive behavior, 6(2):325 368, 1997. [Jiang et al., 2021a] Minqi Jiang, Michael Dennis, Jack Parker-Holder, Jakob Foerster, Edward Grefenstette, and Tim Rockt aschel. Replay-guided adversarial environment design. Advances in Neural Information Processing Systems, 34:1884 1897, 2021. [Jiang et al., 2021b] Minqi Jiang, Edward Grefenstette, and Tim Rockt aschel. Prioritized level replay. In International Conference on Machine Learning, pages 4940 4950. PMLR, 2021. [Kantorovich, 1942] LV Kantorovich. Kantorovich: On the transfer of masses. In Dokl. Akad. Nauk. SSSR, 1942. [Lupu et al., 2021] Andrei Lupu, Brandon Cui, Hengyuan Hu, and Jakob Foerster. Trajectory diversity for zero-shot coordination. In International Conference on Machine Learning, pages 7204 7213. PMLR, 2021. [Mehta et al., 2020] Bhairav Mehta, Manfred Diaz, Florian Golemo, Christopher J Pal, and Liam Paull. Active domain randomization. In Conference on Robot Learning, pages 1162 1176. PMLR, 2020. [Mirhoseini et al., 2021] Azalia Mirhoseini, Anna Goldie, Mustafa Yazgan, Joe Wenjie Jiang, Ebrahim Songhori, Shen Wang, Young-Joon Lee, Eric Johnson, Omkar Pathak, Azade Nazi, et al. A graph placement methodology for fast chip design. Nature, 594(7862):207 212, 2021. [Mnih et al., 2015] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. nature, 518(7540):529 533, 2015. [Monge, 1781] Gaspard Monge. M emoire sur la th eorie des d eblais et des remblais. Mem. Math. Phys. Acad. Royale Sci., pages 666 704, 1781. [Morimoto and Doya, 2005] Jun Morimoto and Kenji Doya. Robust reinforcement learning. Neural computation, 17(2):335 359, 2005. [Parker-Holder et al., 2020] Jack Parker-Holder, Aldo Pacchiano, Krzysztof M Choromanski, and Stephen J Roberts. Effective diversity in population based reinforcement learning. Advances in Neural Information Processing Systems, 33:18050 18062, 2020. [Parker-Holder et al., 2022] Jack Parker-Holder, Minqi Jiang, Michael Dennis, Mikayel Samvelyan, Jakob Foerster, Edward Grefenstette, and Tim Rockt aschel. Evolving curricula with regret-based environment design. ar Xiv preprint ar Xiv:2203.01302, 2022. [Pinto et al., 2017] Lerrel Pinto, James Davidson, and Abhinav Gupta. Supervision via competition: Robot adversaries for learning tasks. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pages 1601 1608. IEEE, 2017. [Portelas et al., 2020] R emy Portelas, C edric Colas, Katja Hofmann, and Pierre-Yves Oudeyer. Teacher algorithms for curriculum learning of deep rl in continuously parameterized environments. In Conference on Robot Learning, pages 835 853. PMLR, 2020. Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence (IJCAI-23) [Sadeghi and Levine, 2016] Fereshteh Sadeghi and Sergey Levine. Cad2rl: Real single-image flight without a single real image. ar Xiv preprint ar Xiv:1611.04201, 2016. [Schulman et al., 2015] John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. Highdimensional continuous control using generalized advantage estimation. ar Xiv preprint ar Xiv:1506.02438, 2015. [Schulman et al., 2017] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. ar Xiv preprint ar Xiv:1707.06347, 2017. [Silver et al., 2016] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484 489, 2016. [Song et al., 2016] Jinhua Song, Yang Gao, Hao Wang, and Bo An. Measuring the distance between finite markov decision processes. In Proceedings of the 2016 international conference on autonomous agents & multiagent systems, pages 468 476, 2016. [Tobin et al., 2017] Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wojciech Zaremba, and Pieter Abbeel. Domain randomization for transferring deep neural networks from simulation to the real world. In 2017 IEEE/RSJ international conference on intelligent robots and systems (IROS), pages 23 30. IEEE, 2017. [Wang et al., 2019a] Hao Wang, Shaokang Dong, and Ling Shao. Measuring structural similarities in finite mdps. In IJCAI, pages 3684 3690, 2019. [Wang et al., 2019b] Rui Wang, Joel Lehman, Jeff Clune, and Kenneth O Stanley. Paired open-ended trailblazer (poet): Endlessly generating increasingly complex and diverse learning environments and their solutions. ar Xiv preprint ar Xiv:1901.01753, 2019. [Zhang et al., 2018] Amy Zhang, Nicolas Ballas, and Joelle Pineau. A dissection of overfitting and generalization in continuous reinforcement learning. ar Xiv preprint ar Xiv:1806.07937, 2018. Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence (IJCAI-23)