# learning_diverse_risk_preferences_in_populationbased_selfplay__8ca7fab6.pdf

Learning Diverse Risk Preferences in Population-Based Self-Play

Yuhua Jiang*, Qihan Liu*, Xiaoteng Ma, Chenghao Li, Yiqin Yang, Jun Yang , Bin Liang, Qianchuan Zhao

Department of Automation, Tsinghua University {jiangyh22, lqh20, ma-xt17, lch18, yangyiqi19}@mails.tsinghua.edu.cn {yangjun603, bliang, zhaoqc}@tsinghua.edu.cn

Among the remarkable successes of Reinforcement Learning (RL), self-play algorithms have played a crucial role in solving competitive games. However, current self-play RL methods commonly optimize the agent to maximize the expected win-rates against its current or historical copies, resulting in a limited strategy style and a tendency to get stuck in local optima. To address this limitation, it is important to improve the diversity of policies, allowing the agent to break stalemates and enhance its robustness when facing with different opponents. In this paper, we present a novel perspective to promote diversity by considering that agents could have diverse risk preferences in the face of uncertainty. To achieve this, we introduce a novel reinforcement learning algorithm called Risk-sensitive Proximal Policy Optimization (RPPO), which smoothly interpolates between worst-case and best-case policy learning, enabling policy learning with desired risk preferences. Furthermore, by seamlessly integrating RPPO with population-based self-play, agents in the population optimize dynamic risk-sensitive objectives using experiences gained from playing against diverse opponents. Our empirical results demonstrate that our method achieves comparable or superior performance in competitive games and, importantly, leads to the emergence of diverse behavioral modes. Code is available at https://github.com/Jackory/RPBT.

Introduction Reinforcement Learning (RL) has witnessed significant advancements in solving challenging decision problems, particularly in competitive games such as Go (Silver et al. 2016), Dota (Berner et al. 2019), and Star Craft (Vinyals et al. 2019). One of the key factors contributing to these successes is self-play, where an agent improves its policy by playing against itself or its previous policies. However, building expert-level AI solely through model-free self-play remains challenging, as it requires no access to a dynamic model or the prior knowledge of a human expert. One of the difficulties is that agents trained in self-play only compete against themselves, leading to policies that often become stuck in local optima and struggle to generalize to different opponents (Bansal et al. 2018).

*These authors contributed equally. Corresponding authors. Copyright 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

In this study, we focus on the problem of training agents that are both robust and high-performing against any type of opponent. We argue that the key to addressing this problem lies in producing a diverse set of training opponents, or in other words, learning diverse strategies in self-play. A promising paradigm to address this problem is populationbased approaches (Vinyals et al. 2019; Jaderberg et al. 2019), where a collection of agents is gathered and trained against each other. However, the diversity achieved in such methods mainly stems from various hyperparameter setups and falls short. Recent research has increasingly focused on enhancing population diversity (Lupu et al. 2021; Zhao et al. 2022; Parker-Holder et al. 2020; Wu et al. 2023; Liu et al. 2021). These methods introduce diversity as a learning objective and incorporate additional auxiliary losses on a population-wide scale. Nonetheless, this population-wide objective increases implementation complexity and necessitates careful hyperparameter tuning to strike a balance between diversity and performance. Here, we propose a novel perspective on introducing diversity within the population: agents should possess diverse risk preferences. In this context, risk refers to the uncertainty arising from stochastic transitions within the environment, while risk preference encompasses the agent s sensitivity to such uncertainty, typically including risk-seeking, risk-neutral, and risk-averse tendencies. Taking inspiration from the fact that humans are risk-sensitive (Tversky and Kahneman 1992), we argue that the lack of diversity in self-play arises from agents solely optimizing their expected winning rate against opponents without considering other optimization goals that go beyond the expectation, such as higher-order moments of the winning rate distribution. Intuitively, the population of agents should be diverse, including conservative (risk-averse) agents that aim to enhance their worst performance against others, as well as aggressive (risk-seeking) agents that focus on improving their best winning record. To achieve risk-sensitive learning with minimal modifications, we introduce the expectile Bellman operator, which smoothly interpolates between the worst-case and best-case Bellman operators. Building upon this operator, we propose Risk-sensitive Proximal Policy Optimization (RPPO), which utilizes a risk level hyperparameter to control the risk preference during policy learning. To strike a balance between

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

bias and variance in policy learning, we extend the expectile Bellman operator to its multi-step form and adapt the Generalized Advantage Estimation (GAE) (Schulman et al. 2018). It is important to highlight that this extension is non-trivial due to the nonlinearity of the expectile Bellman operator. RPPO offers a simple and practical approach that can be easily integrated into existing population-based self-play frameworks. We provide an implementation called RPBT, where a population of agents is trained using RPPO with diverse risk preference settings, without introducing additional population-level optimization objectives. The risk preferences of the agents are automatically tuned using exploitation and exploration techniques drawn from Populationbased Training (PBT) (Jaderberg et al. 2017). By gathering the experience of agents competing against each other in the diverse population, RPBT effectively addresses the issue of overfitting to specific types of opponents.

Related Work

Self-Play RL Training agents in multi-agent games requires instantiations of opponents the in environment. A solution could be self-play RL, where an agent is trained by playing against its own policy or its past versions. Selfplay variants have proven effective in some multi-agent games (Tesauro et al. 1995; Silver et al. 2016; Berner et al. 2019; Vinyals et al. 2019; Jaderberg et al. 2019). A key advantage of self-play is that the competitive multi-agent environment provides the agents with an appropriate curriculum (Bansal et al. 2018; Liu et al. 2019; Baker et al. 2020), which facilitates the emergence of complex and interesting behaviors. However, real-world games often involve non-transitivity and strategic cycles(Czarnecki et al. 2020), and thus self-play cannot produce agents that generalize well to different types of opponents. Population-based self-play (Jaderberg et al. 2019; Garnelo et al. 2021; Yu et al. 2023; Strouse et al. 2021) improves self-play by training a population of agents, all of whom compete against each other. While population-based self-play is able to gather a substantial amount of match experience to alleviate the problem of overfitting to specific opponents, it still requires techniques to introduce diversity among agents to stabilize training and facilitate robustness (Jaderberg et al. 2019).

Population-Based Diversity Population-based methods demonstrate a strong connection with evolutionary algorithms (Mouret and Clune 2015). These population-based approaches have effectively addressed black-box optimization challenges (Loshchilov and Hutter 2016). Their primary advantages encompass the ability to obtain high-performing hyperparameter schedules in a single training run, which leads to great performance across various environments (Liu et al. 2019; Li et al. 2019; Espeholt et al. 2018). Recently, diversity across the population has drawn great interests (Shen et al. 2020; Khadka et al. 2019; Liu et al. 2021, 2022; Wu et al. 2023). Reward randomizations (Tang et al. 2021; Yu et al. 2023) were employed to discover diverse policies in multi-agent games. Other representative works formulated individual rewards and diversity among agents into a multi-objective optimization problem. More specifically,

Dv D (Parker-Holder et al. 2020) utilizes the determinant of the kernel matrix of action embedding as a population diversity metric. Traje Di (Lupu et al. 2021) introduces trajectory diversity by approximating Jensen-Shannon divergence with action discounting kernel. MEP (Zhao et al. 2022) maximizes a lower bound of the average KL divergence within the population. Instead of introducing additional diversitydriven objectives across the entire population, our RPBT approach trains a population of agents with different risk preferences. Each individual agent maximizes its own returns based on its specific risk level, making our method both easy to implement and adaptable for integration into a population-based self-play framework.

Game Theory Fictitious play (Brown 1951) and double oracle(Mc Mahan, Gordon, and Blum 2003) have been studied to achieve approximate Nash equilibrium in normalform games. Fictitious self-play (FSP)(Heinrich, Lanctot, and Silver 2015) extends fictitious play to extensive-form games. Policy-space response oracle (PSRO) (Lanctot et al. 2017) is a natural generalization of double oracle, where the choices become the policies in meta-games rather than the actions in games. PSRO is a general framework for solving games, maintaining a policy pool and continuously adds the best responses. Our methods fall under PSRO framework at this level. However, in practice, PSRO necessitates the computation of the meta-payoff matrix between policies in the policy pool, which is computationally intensive (Mc Aleer et al. 2020) in real-world games since the policy pool is pretty large. Various improvements (Balduzzi et al. 2019; Perez-Nieves et al. 2021; Liu et al. 2021) have been made upon PSRO by using different metrics based on the metapayoffs to promote diversity. However, most of these works have confined their experiments to normal-form games or meta-games.

Risk-Sensitive RL Our risk-sensitive methods draw from risk-sensitive and distributional RL, with comprehensive surveys (Bellemare, Dabney, and Rowland 2023) available. Key distributional RL studies (Bellemare, Dabney, and Munos 2017) highlight the value of learning return distributions over expected returns, enabling approximation of value functions under various risk measures like Wang (M uller 1997) and CVa R (Rockafellar, Uryasev et al. 2000; Chow and Ghavamzadeh 2014; Qiu et al. 2021) for generating risk-averse or risk-seeking policies. However, these methods reliance on discrete samples for risk and gradient estimation increases computational complexity. Sampling-free methods (Tang, Zhang, and Salakhutdinov 2019; Yang et al. 2021; Ying et al. 2022) have been explored for CVa R computation, but CVa R focuses solely on risk aversion, neglecting best-case scenarios, which may not align with competition objectives. A risk-sensitive RL algorithm (Del etang et al. 2021) balances risk aversion and seeking, but assumes gaussian data generation. In contrast, our RPPO algorithm requires no data assumptions and minimal code modifications on PPO to learn diverse risk-sensitive policies.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Preliminary Problem Definition We consider fully competitive games, which can be regarded as Markov games (Littman 1994). A Markov game for N agents is a partially observable Markov decision process (POMDP), which can be defined by: a set of states S, a set of observations O1, , ON of each agent, a set of actions of each agent A1, , AN, a transition function p(s |s, a1, , an): S A1 AN (S) determining distribution over next states, and a reward function for each agent ri : S Ai A i S R. Each agent chooses its actions based on a stochastic policy πθi : Oi (A), where θi is the policy parameter of agent i. In the self-play setting, we have control over a single agent known as the main agent, while the remaining agents act as opponents. These opponents are selected from the main agent s checkpoints. The main agent s primary objective is to maximize its expected returns, also referred to as discounted cumulative rewards, denoted as E[PT t=0 γtri t], where γ represents the discount factor, and T signifies the time horizon. By considering the opponents as part of the environmental dynamics, we can view the multi-agent environment as a single-agent stochastic environment from the main agent s perspective. Consequently, we can employ singleagent RL methods such as PPO to approximate the best response against all opponents. However, it should be noted that the opponents are sampled from the main agent, which itself lacks diversity as it is generated by single-agent RL methods. In this study, we aim to construct a strong and robust policy, and generating diverse policies serves this goal.

Risk-sensitive PPO In this section, we present our novel RL algorithm, Risksensitive Proximal Policy Optimization (RPPO), which involves minimal modifications to PPO. The algorithm is shown in Algorithm 1. RPPO utilizes an expectile Bellman operator that interpolates between a worst-case Bellman operator and a best-case Bellman operator. This allows learning to occur with a specific risk preference, ranging from risk-averse to risk-seeking. Additionally, we extend the expectile Bellman operator into a multi-step form to balance bias and variance in RPPO. The theoretical analysis of the expectile Bellman operator and its multi-step form, as well as a toy example demonstrating the potential of RPPO to learn risk preferences, are presented in the following subsections.

Expectile Bellman Operator Given a policy π and a risk level hyperparamemter τ (0, 1), we consider expectile Bellman operator defined as follows:

T π τ V (s) := V (s) + 2αEa Es [τ[δ]+ + (1 τ)[δ] ] , (1)

where α is the step size which we set to 1 2 max{τ,1 τ}, δ refers to δ(s, a, s ) = r(s, a, s ) + γV (s ) V (s) which is one-step TD error, [ ]+ = max( , 0) and [ ] = min( , 0). This operator draws inspiration from expectile statistics (Newey and Powell 1987; Rowland et al. 2019; Ma

et al. 2022), and thus the name. We can see that the standard Bellman operator is a special case of expectile Bellman operator when τ = 1/2. We have the following theoretical properties to guide the application of expectile Bellman operator in practice. Please refer to the Appendix for the proof.

Proposition 1. For any τ (0, 1), T π τ is a γτ-contraction, where γτ = 1 2α(1 γ) min{τ, 1 τ}.

Proposition 1 guarantees the convergence of the value function in the policy evaluation phase.

Proposition 2. Let V τ denotes the fixed point of T π τ . For any τ, τ (0, 1), if τ τ, we have V τ (s) V τ (s), s S.

Proposition 2 guarantees the fixed point of expectile Bellman operator is monotonic with respect to τ.

Proposition 3. Let V τ , V best, and V worst respectively denote the fixed point of expectile Bellman operator, bestcase Bellman operator and worst-case Bellman operator. We have

V τ = V worst if τ 0 V best if τ 1. (2)

The worst-case Bellman operator and best-case Bellman operator are defined by:

Tbest V (s) := max a,s [R(s, a, s ) + γV (s )],

Tworst V (s) = min a,s [R(s, a, s ) + γV (s )]. (3)

It is important to highlight the difference between best-case Bellman operator and Bellman optimal operator (T V (s) := maxa[P s (R(s, a, s ) + γV (s )]). Best-case Bellman operator takes stochastic transitions into account, which is the main source of risk in competitive games when confronting unknown actions of different opponents. Proposition 3 guarantees expectile Bellman operator approaches best-case Bellman operator as risk level τ approaches 1, and approaches worst-case Bellman operator as τ approaches 0. Combing Propostion 2 and Proposition 3, we observe that expectile Bellman operator can be used to design an optimization algorithm whose objective is the interpolation between the best-case and the worst-case objective. When τ = 1/2, the objective is equivalent to expected returns, which is the scenario of risk-neutral. And risk level τ < 1/2 and τ > 1/2 represent the risk-averse and risk-seeking cases, respectively. As τ varies, the learning objective varies, and hence diverse risk perferences arise. We can see the potential of using the expectile Bellman operator for designing risk-sensitive RL algorithms. PPO, widely utilized in self-play, is favored for its ease of use, stable performance, and parallel scalability. By extending PPO with the expectile Bellman operator, we introduce the RPPO algorithm. Moreover, generalizing this operator to other RL algorithms is not difficult. In practice, we define the advantage function as

Aπ τ (s, a) := 2αEs τ [δ]+ + (1 τ)[δ] , (4)

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Algorithm 1 Risk-sensitive PPO (RPPO) Input: initial network parameter θ, ϕ, horizon T, update epochs K, risk level τ. 1: for i = 1, 2, do 2: Collect trajectories by running policy πθold in the environment for T time steps. 3: Compute λ-variant returns ˆVτ,λ according to Eq.10 4: Compute advantages ˆAτ,λ according to Eq.11 5: for k = 1, 2, , K do 6: Update the θ by maximizing the surrogate function with Eq.5 via some gradient algorithms. 7: Update the ϕ by minimizing mean-squared error with Eq.6 via some gradient algorithms. 8: end for 9: θold = θ 10: end for

With a batch of data {(st, at, rt, st+1)}T 1 t=0 collected by π, we train the policy with the clip variant PPO loss,

LCLIP π (θ) = 1

h min ωt(θ) ˆAτ(st, at),

clip(ωt(θ), 1 ε, 1 + ε) ˆAτ(st, at) i , (5) where ωt(θ) = πθ(at|st)

π(at|st) is the importance sampling ratio. Meanwhile, we train the value network with mean squared loss,

t=0 (Vϕ(st) ˆVτ,t)2, (6)

where ϕ is the parameter of the value network, and ˆVτ,t := Vϕ(st) + ˆAτ(st, at) is the target value.

Multi-Step Expectile Bellman Operator While RPPO is sufficient for providing decent risk preferences, we still require some techniques to balance the bias and variance when estimating the advantage function. The original PPO uses the Generalized Advantage Estimation (GAE) technique (Schulman et al. 2018) to reduce variance by using an exponentially-weighted estimator of the advantage function, at the cost of introducing some bias. In this section, we extend GAE to RPPO along this line of work. However, it is non-trivial for direct incorporation of GAE into RPPO due to the non-linearity of the expectile Bellman operator. To address this issue, we define the multi-step expectile Bellman operator as:

T π τ,λV (s) := (1 λ)

n=1 λn 1(T π τ )n V (s). (7)

We can derive that multi-step expectile Bellman operator still remains the contraction and risk-sensitivity properties. Please refer to Appendix for the proof. Proposition 4. For any τ (0, 1), T π τ,λ is a γτ,λ-

contraction, where γτ,λ = (1 λ)γτ

Proposition 5. Let V τ,λ denote the fixed point of T π τ,λ. For any τ, τ (0, 1), if τ τ, we have V τ ,λ(s) V τ,λ(s), s S.

Proposition 6. Let V τ,λ denote the fixed point of T π τ,λ, we have

lim τ V τ,λ = V worst if τ 0 V best if τ 1 . (8)

Despite the fact that T π τ,λ has the property of contraction, it is hard to be estimated with trajectories. Hence, we introduce another sample form operator

ˆT π, ˆV τ V (s) := V (s) + 2α h τ[ˆδ]+ + (1 τ)[ˆδ] i , (9)

where ˆδ := r(s, a, s ) + γ ˆV (s ) V (s). Here ˆV is an estimate of target value. When we choose ˆV = V and take the expectation, we recover the T π τ (s) = Ea π( |s),s p( |s,a) ˆT π,V τ V (s). Furthermore, The multi-step sample form operator is

ˆV π τ,λ(s) = (1 λ)

n=1 λn 1 ˆT π, ˆVn τ V (s). (10)

where ˆVn(s) = ˆT π, ˆVn 1 τ V (s) and ˆV0(s) = V (s). Finally, we estimate the advantage as

ˆAπ τ,λ(s, a) = ˆV π τ,λ(s) V (s). (11)

When τ = 1/2, we recover the original form of GAE. However, if τ = 1/2, we introduce extra bias1. The details of computing multi-step expectile Bellman operator in practice are provided in Appendix. With sample form multi-step expectile Bellman operator in place, we propose the whole RPPO algorithm, as shown in Algorithm 1. The implementation of RPPO requires only a few lines of code modifications based on PPO.

Toy Example We use a grid world (Del etang et al. 2021) shown in Fig. 1 to illustrate the ability of RPPO to learn diverse risk preferences with different risk levels τ. The grid world consists of an agent, a flag, water, and strong wind. The agent s goal is to reach the flag position within 25 steps while avoiding stepping into the water. When reaching the flag, the agent receives a +1 bonus and is done, but each step spent in the water results in a -1 penalty. In addition, the strong wind randomly pushes the agent in one direction with 50% probability. In this world, risk (uncertainty) comes from the strong wind. We train 9 RPPO agents with τ {0.1, 0.2, ..., 0.9} to investigate how risk level τ affects the behaviors of policies. We illustrate the state-visitation frequencies computed from the 1,000 rollouts, as shown in Fig. 1. Three modes

1The fixed points of T π τ,λ and T π τ are the same one. However, the sample form operator introduces extra bias for value estimation, which means the fixed points of ˆT π τ,λ and T π τ,λ are not the same one. It is caused by the non-linearity of the operator with respect to the noise.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Figure 1: Experiments of the toy example. (a) Illustration of the toy example. The task is to pick up the bonus located at the flag while avoiding the penalty of stepping into water in a 4 4 grid world. A strong wind pushes the agent into a random direction 50% of the time. (b), (c) and (d) States visitation frequencies for the three policies that have different risk preferences. The black arrow indicates the deterministic policy when there is no wind.

Figure 2: The framework of RPBT. We train a population of agents with different initialization of risk levels. During a training round, each agent spawns a number of subprocesses to collect data against randomly selected opponents from the policy pool. We use RPPO to update models under the specific risk level. The policies updated are added to the policy pool. If an agent in the population is under-performing, it will exploit (copy) the model parameters and the risk level τ of a better-performing agent, and it will explore a new risk level by adding a perturbation to the better-performing agent s risk level for the following training.

of policies emerge here: risk-averse (τ {0.2, 0.3, 0.4}), taking the longest path away from the water; risk-neutral (τ {0.5, 0.6, 0.7}), taking the middle path, and riskseeking (τ {0.8, 0.9}), taking the shortest route along the water. Interestingly, agents with τ = 0.1 are too conservative and do not always reach the flag, which is consistent with risk-averse behavior styles.

RPBT In this section, we present our population-based self-play method based on RPPO, which we refer to as RPBT, as illustrated in Fig. 2. Our proposed RPBT stabilizes the learning process by concurrently training a diverse population of agents who learn by playing with each other. Moreover, different agents in the population have varying risk preferences in order to promote the learned policies to be diverse and not homogeneous. In contrast to previous work (Parker Holder et al. 2020; Lupu et al. 2021; Zhao et al. 2022), RPBT does not include population-level diversity objectives. Instead, each agent acts and learns to maximize its individual rewards under the specific risk preference settings. This makes RPBT more easier to implement and scale. In our RPBT, the value of the risk level τ (0, 1) controls

the agent s risk preference. Since it is impossible to cover all possible risk levels within the population, we dynamically adjust the risk levels during training, particularly for those with poor performance. We adopt the exploitation and exploration steps in PBT (Jaderberg et al. 2017) to achieve this auto-tuning goal. In the exploitation step, a poorly performing agent can directly copy the model parameters and risk level τ from a well-performing agent to achieve equal performance. Then in the exploration step, the newly copied risk level is randomly perturbed by noise to produce a new risk level. In our practice, we introduce a simple technique:

Exploitation. We rank all agents in the population according to ELO score, indicating performance. If an agent s ELO score falls below a certain threshold, it is considered an underperforming agent, and its model parameters and risk level will be replaced by those from a superior agent.

Exploration. The risk level of the underperforming agent is further updated by adding a noise term varying between -0.2 and 0.2.

We find that this technique allows RPBT to explore almost all possible values of τ during training. At the later stage of training, the values of τ will converge to an interval with

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Figure 3: Illustration of diverse risk preferences in Slimevolley. Risk-seeking agent (left) stands farther from the fence and hit the ball at a lower angle while risk-averse agent (right) does the opposite.

Figure 4: Illustration of diverse risk preferences in Sumoants. The risk-seeking agent (red) constantly attacks, while the riskaverse agent (green) assumes a defensive stance.

better performance while maintaining diversity and avoiding harmful values. Additionally, we treat all agents of the population and their historical copies as a policy pool from which opponents are uniformly sampled for self-play. Actually, this approach of sampling opponents is commonly referred to as FSP (Vinyals et al. 2019), which ensures that an agent must be able to defeat random old versions of any other agent in the population in order to continue learning adaptively (Bansal et al. 2018). Since any approach of sampling opponents can be adapted to our method, we do not dwell on more complicated opponent sampling techniques (Vinyals et al. 2019; Berner et al. 2019). Moreover, we utilize FSP for all our experiments to ensure fair comparisons. Fig. 2 illustrated one training round of RPBT. Each training round consists of uniformly sampling opponents from the policy pool, collecting game experience, updating models using RPPO, and performing exploitation and exploration on risk levels. After training is ended, we select the agent with the highest ELO score from the population to serve as the evaluation agent.

Experiments In our experiments, we aim to answer the following questions: Q1, can RPPO generate policies with diverse risk preferences in competitive games? Q2, how does RPBT perform compared to other methods in competitive games? Some results of the ablation experiments on RPBT are presented in the Appendix. All the experiments are conducted with one 64-core CPU and one Ge Force RTX 3090 GPU.

Environment Setup We consider two competitive multi-agent benchmarks: Slimevolley (Ha 2020) and Sumoants (Al-Shedivat et al. 2018). Slimevolley is a two-agent volleyball game where the action space is discrete. The goal of each agent is to land the ball on the opponent s field, causing the opponent to lose lives. Sumoants is a two-agent game based on Mu Jo Co where the action space is continuous. Two ants compete in

a square area, aiming to knock the other agent to the ground or push it out of the ring. More details about the two benchmarks are given in Appendix.

Diverse Risk Preferences Illustration

We trained RPPO agents via self-play, using various risk levels. As for Slimevolley, we pit the risk-seeking agent with τ = 0.9 against the risk-averse agent with τ = 0.1, as shown in Fig. 3. We observe that the risk-seeking agent prefers to adjust its position to spike the ball so that the height of the ball is lower, while the risk-averse agent takes a more conservative strategy and just focuses on catching the ball. To further quantify this phenomenon, we pit the three agents τ {0.1, 0.5, 0.9} versus the τ = 0.5 agent and compute two metrics average over 200 rollouts: the distance to the fences (denoted as d) and hitting angle (denoted as β). As τ increases, β shows a monotonically decreasing trend, and d shows a monotonically increasing trend. Because the physics engine of the volleyball game is a bit dodgy , the agent can hit the ball with a low height only when standing far away from the fence. As for Sumoants, a typical play between a risk-seeking agent with τ = 0.7 and a risk-averse agent with τ = 0.3 is illustrated in Fig. 4. We observed that the risk-averse agent tends to maintain a defensive stance: four legs spread out, lowering its center of gravity and holding the floor tightly to keep it as still as possible. In contrast, the risk-seeking agent frequently attempts to attack. We also provide multiple videos for further observation on our code website.

(a) Slimevolley

RPBT PP SP RPBT - 56% 64% PP 44% - 56% SP 36% 44% -

(b) Sumoants

RPBT PP SP RPBT - 63% 67% PP 37% - 53% SP 33% 47% -

Table 1: RPBT performing against basic baselines.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

(a) Slimevolley

RPBT MEP Traje Di Dv D RR PSRO

RPBT(ours) - 64% 59% 48% 60% 53% MEP 36% - 46% 32% 39% 32% Traje Di 41% 53% - 38% 38% 45% Dv D 52% 63% 61% - 58% 63% RR 37% 52% 63% 42% - 54% PSRO 48% 68% 55% 37% 46% -

(b) Sumoants

RPBT MEP Traje Di Dv D RR PSRO

RPBT - 58% 63% 54% 56% 53% MEP 42% - 54% 53% 51% 50% Traje Di 37% 46% - 44% 48% 37% Dv D 46% 47% 56% - 52% 43% RR 44% 49% 52% 48% - 43% PSRO 47% 50% 63% 57% 57% -

Table 2: RPBT performing against different methods.

RPBT MEP Traje Di Dv D RR

agent1 agent2 agent3 agent4 agent5

Figure 5: Comparing diversity of different methods with a population size of 5.

Results Compared With Other Methods We trained RPBT with population size 5 and set initial risk levels to {0.1, 0.4, 0.5, 0.6, 0.9} for all the experiments. To ensure fairness in our comparison, all the baselines were integrated with PPO and set to the same population size if necessary. See Appendix for more implementation details. The baseline methods includes: Basic baselines: self-play (SP), where agents learn solely through competing with themselves, and populationbased self-play (PP), where a population of agents are co-trained through randomly competing against one another. Population-based methods: MEP (Zhao et al. 2022), Traje Di (Lupu et al. 2021), Dv D (Parker-Holder et al. 2020), and RR (Reward Randomization, similar ideas with (Tang et al. 2021)). Game-theoretic method: PSRO (Lanctot et al. 2017). For each method, we trained 3 runs using different random seeds and selected the one with the highest ELO score for evaluation. Then, we calculated the win rate matrix between the methods, recording the average win rate corresponding to 200 plays. Tab.1 shows the results among RPBT and basic methods. We observe that RPBT exhibits superior performance compared to the baselines. Therefore, it is crucial to integrate diversity-oriented methods like RPBT in self-play. Tab.2 shows the results among RPBT, populationbased methods, and game-theoretic methods. We observed that RPBT outperforms MEP, Traje Di, and RR. Moreover, RPBT achieves performance comparable to that of Dv D in Slimevolley but better in Sumoants. Furthermore, RPBT performs slightly better than PSRO. However, PSRO is more computationally intensive (While PSRO requires 34 hours for a single run, other methods only require almost 12 hours). In order to further compare the diversity in populationbased methods, we let each agent in the population play

against itself. We then extract the first 100 states from the game trajectories and use t-SNE (Van der Maaten and Hinton 2008) to reduce the states to 2 dimensions for visualizing. Fig.5 shows the results. We observed that the trajectories of agents in the RPBT population are distinct from each other. In comparison, Traje Di and RR exhibit a certain level of diversity, whereas MEP and Dv D show no indications of diversity. Notably, these population-based methods require the introduction of an additional diversity objective at the population level. This causes higher complexity to implementation as well as hyperparameter tuning due to the tradeoff between diversity and performance. Overall, RPBT is a favorable approach for both learning diverse behaviors and enhancing performance.

Conclusion Observing that learning to be diverse is essential for addressing the overfitting problem in self-play, we propose that diversity could be introduced in terms of risk preferences from different agents. Specifically, we propose a novel Risksensitive PPO (RPPO) approach to learn policies that align with desired risk preferences. Furthermore, we incorporate RPPO into a population-based self-play framework, which is easy to implement, thereby introducing the RPBT approach. In this approach, a population of agents, with diverse and dynamically adjusted risk preferences, compete with one another. We showed that our method can generate diverse modes of policies and achieve comparable or superior performance over existing methods. To the best of our knowledge, this is the first work that uses a risk-oriented approach to train a diverse population of agents. Future directions include developing a meta-policy that can adaptively select the optimal risk level according to the opponent s strategy and exploring the applicability of RPPO to safe reinforcement learning scenarios.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Acknowledgments

This work was supported by National Natural Science Foundation of China under Grant No. 62192751, in part by the National Science and Technology Innovation 2030 - Major Project (Grant No. 2022ZD0208804), in part by Key R&D Project of China under Grant No. 2017YFC0704100, the 111 International Collaboration Program of China under Grant No. BP2018006, and in part by the BNRist Program under Grant No. BNR2019TD01009, the National Innovation Center of High Speed Train R&D project (CX/KJ-20200006), in part by the Inno HK Initiative, The Government of HKSAR; and in part by the Laboratory for AI-Powered Financial Technologies.

Al-Shedivat, M.; Bansal, T.; Burda, Y.; Sutskever, I.; Mordatch, I.; and Abbeel, P. 2018. Continuous Adaptation via Meta-Learning in Nonstationary and Competitive Environments. In International Conference on Learning Representations.

Baker, B.; Kanitscheider, I.; Markov, T.; Wu, Y.; Powell, G.; Mc Grew, B.; and Mordatch, I. 2020. Emergent Tool Use From Multi-Agent Autocurricula. In International Conference on Learning Representations.

Balduzzi, D.; Garnelo, M.; Bachrach, Y.; Czarnecki, W.; Perolat, J.; Jaderberg, M.; and Graepel, T. 2019. Openended learning in symmetric zero-sum games. In International Conference on Machine Learning, 434 443. PMLR.

Bansal, T.; Pachocki, J.; Sidor, S.; Sutskever, I.; and Mordatch, I. 2018. Emergent Complexity via Multi-Agent Competition. In International Conference on Learning Representations.

Bellemare, M. G.; Dabney, W.; and Munos, R. 2017. A distributional perspective on reinforcement learning. In International conference on machine learning, 449 458. PMLR.

Bellemare, M. G.; Dabney, W.; and Rowland, M. 2023. Distributional Reinforcement Learning. MIT Press.

Berner, C.; Brockman, G.; Chan, B.; Cheung, V.; Debiak, P.; Dennison, C.; Farhi, D.; Fischer, Q.; Hashme, S.; Hesse, C.; J ozefowicz, R.; Gray, S.; Olsson, C.; Pachocki, J. W.; Petrov, M.; de Oliveira Pinto, H. P.; Raiman, J.; Salimans, T.; Schlatter, J.; Schneider, J.; Sidor, S.; Sutskever, I.; Tang, J.; Wolski, F.; and Zhang, S. 2019. Dota 2 with Large Scale Deep Reinforcement Learning. Ar Xiv preprint, abs/1912.06680.

Brown, G. W. 1951. Iterative Solution of Games by Fictitious Play. In Koopmans, T. C., ed., Activity Analysis of Production and Allocation. New York: Wiley.

Chow, Y.; and Ghavamzadeh, M. 2014. Algorithms for CVa R optimization in MDPs. Advances in neural information processing systems, 27.

Czarnecki, W. M.; Gidel, G.; Tracey, B.; Tuyls, K.; Omidshafiei, S.; Balduzzi, D.; and Jaderberg, M. 2020. Real world games look like spinning tops. Advances in Neural Information Processing Systems, 33: 17443 17454.

Del etang, G.; Grau-Moya, J.; Kunesch, M.; Genewein, T.; Brekelmans, R.; Legg, S.; and Ortega, P. A. 2021. Modelfree risk-sensitive reinforcement learning. ar Xiv preprint ar Xiv:2111.02907. Espeholt, L.; Soyer, H.; Munos, R.; Simonyan, K.; Mnih, V.; Ward, T.; Doron, Y.; Firoiu, V.; Harley, T.; Dunning, I.; et al. 2018. Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. In International conference on machine learning, 1407 1416. PMLR. Garnelo, M.; Czarnecki, W. M.; Liu, S.; Tirumala, D.; Oh, J.; Gidel, G.; van Hasselt, H.; and Balduzzi, D. 2021. Pick your battles: Interaction graphs as population-level objectives for strategic diversity. ar Xiv preprint ar Xiv:2110.04041. Ha, D. 2020. Slime Volleyball Gym Environment. Heinrich, J.; Lanctot, M.; and Silver, D. 2015. Fictitious Self-Play in Extensive-Form Games. In Bach, F.; and Blei, D., eds., Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, 805 813. Lille, France: PMLR. Jaderberg, M.; Czarnecki, W. M.; Dunning, I.; Marris, L.; Lever, G.; Casta neda, A. G.; Beattie, C.; Rabinowitz, N. C.; Morcos, A. S.; Ruderman, A.; Sonnerat, N.; Green, T.; Deason, L.; Leibo, J. Z.; Silver, D.; Hassabis, D.; Kavukcuoglu, K.; and Graepel, T. 2019. Human-level performance in 3D multiplayer games with population-based reinforcement learning. Science, 364(6443): 859 865. Jaderberg, M.; Dalibard, V.; Osindero, S.; Czarnecki, W. M.; Donahue, J.; Razavi, A.; Vinyals, O.; Green, T.; Dunning, I.; Simonyan, K.; et al. 2017. Population based training of neural networks. ar Xiv preprint ar Xiv:1711.09846. Khadka, S.; Majumdar, S.; Nassar, T.; Dwiel, Z.; Tumer, E.; Miret, S.; Liu, Y.; and Tumer, K. 2019. Collaborative evolutionary reinforcement learning. In International conference on machine learning, 3341 3350. PMLR. Lanctot, M.; Zambaldi, V.; Gruslys, A.; Lazaridou, A.; Tuyls, K.; P erolat, J.; Silver, D.; and Graepel, T. 2017. A unified game-theoretic approach to multiagent reinforcement learning. Advances in neural information processing systems, 30. Li, A.; Spyra, O.; Perel, S.; Dalibard, V.; Jaderberg, M.; Gu, C.; Budden, D.; Harley, T.; and Gupta, P. 2019. A generalized framework for population based training. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 1791 1799. Littman, M. L. 1994. Markov Games as a Framework for Multi-Agent Reinforcement Learning. In Machine Learning Proceedings 1994, 157 163. Elsevier. ISBN 978-1-55860335-6. Liu, S.; Lever, G.; Heess, N.; Merel, J.; Tunyasuvunakool, S.; and Graepel, T. 2019. Emergent Coordination Through Competition. In International Conference on Learning Representations. Liu, X.; Jia, H.; Wen, Y.; Hu, Y.; Chen, Y.; Fan, C.; Hu, Z.; and Yang, Y. 2021. Towards unifying behavioral and response diversity for open-ended learning in zero-sum games. Advances in Neural Information Processing Systems, 34: 941 952.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Liu, Z.; Yu, C.; Yang, Y.; Wu, Z.; Li, Y.; et al. 2022. A Unified Diversity Measure for Multiagent Reinforcement Learning. Advances in Neural Information Processing Systems, 35: 10339 10352. Loshchilov, I.; and Hutter, F. 2016. CMA-ES for hyperparameter optimization of deep neural networks. ar Xiv preprint ar Xiv:1604.07269. Lupu, A.; Cui, B.; Hu, H.; and Foerster, J. 2021. Trajectory Diversity for Zero-Shot Coordination. In Proceedings of the 38th International Conference on Machine Learning, 7204 7213. PMLR. Ma, X.; Yang, Y.; Hu, H.; Yang, J.; Zhang, C.; Zhao, Q.; Liang, B.; and Liu, Q. 2022. Offline Reinforcement Learning with Value-based Episodic Memory. In International Conference on Learning Representations. Mc Aleer, S.; Lanier, J. B.; Fox, R.; and Baldi, P. 2020. Pipeline psro: A scalable approach for finding approximate nash equilibria in large games. Advances in neural information processing systems, 33: 20238 20248. Mc Mahan, H. B.; Gordon, G. J.; and Blum, A. 2003. Planning in the Presence of Cost Functions Controlled by an Adversary. In Proceedings of the Twentieth International Conference on International Conference on Machine Learning, ICML 03, 536 543. AAAI Press. ISBN 1577351894. Mouret, J.-B.; and Clune, J. 2015. Illuminating search spaces by mapping elites. ar Xiv preprint ar Xiv:1504.04909. M uller, A. 1997. Integral Probability Metrics and Their Generating Classes of Functions. Advances in Applied Probability, 29(2): 429 443. Newey, W. K.; and Powell, J. L. 1987. Asymmetric least squares estimation and testing. Econometrica: Journal of the Econometric Society, 819 847. Parker-Holder, J.; Pacchiano, A.; Choromanski, K. M.; and Roberts, S. J. 2020. Effective diversity in population based reinforcement learning. Advances in Neural Information Processing Systems, 33: 18050 18062. Perez-Nieves, N.; Yang, Y.; Slumbers, O.; Mguni, D. H.; Wen, Y.; and Wang, J. 2021. Modelling behavioural diversity for learning in open-ended games. In International Conference on Machine Learning, 8514 8524. PMLR. Qiu, W.; Wang, X.; Yu, R.; Wang, R.; He, X.; An, B.; Obraztsova, S.; and Rabinovich, Z. 2021. RMIX: Learning risk-sensitive policies for cooperative reinforcement learning agents. Advances in Neural Information Processing Systems, 34: 23049 23062. Rockafellar, R. T.; Uryasev, S.; et al. 2000. Optimization of conditional value-at-risk. Journal of risk, 2: 21 42. Rowland, M.; Dadashi, R.; Kumar, S.; Munos, R.; Bellemare, M. G.; and Dabney, W. 2019. Statistics and Samples in Distributional Reinforcement Learning. In International Conference on Machine Learning, 5528 5536. PMLR. Schulman, J.; Moritz, P.; Levine, S.; Jordan, M.; and Abbeel, P. 2018. High-Dimensional Continuous Control Using Generalized Advantage Estimation. ar Xiv:1506.02438 [cs]. Shen, R.; Zheng, Y.; Hao, J.; Meng, Z.; Chen, Y.; Fan, C.; and Liu, Y. 2020. Generating Behavior-Diverse Game

AIs with Evolutionary Multi-Objective Deep Reinforcement Learning. In Twenty-Ninth International Joint Conference on Artificial Intelligence, volume 4, 3371 3377. Silver, D.; Huang, A.; Maddison, C. J.; Guez, A.; Sifre, L.; Van Den Driessche, G.; Schrittwieser, J.; Antonoglou, I.; Panneershelvam, V.; and Lanctot, M. 2016. Mastering the Game of Go with Deep Neural Networks and Tree Search. nature, 529(7587): 484 489. Strouse, D.; Mc Kee, K.; Botvinick, M.; Hughes, E.; and Everett, R. 2021. Collaborating with humans without human data. Advances in Neural Information Processing Systems, 34: 14502 14515. Tang, Y. C.; Zhang, J.; and Salakhutdinov, R. 2019. Worst cases policy gradients. ar Xiv preprint ar Xiv:1911.03618. Tang, Z.; Yu, C.; Chen, B.; Xu, H.; Wang, X.; Fang, F.; Du, S. S.; Wang, Y.; and Wu, Y. 2021. Discovering Diverse Multi-Agent Strategic Behavior via Reward Randomization. In International Conference on Learning Representations. Tesauro, G.; et al. 1995. Temporal difference learning and TD-Gammon. Communications of the ACM, 38(3): 58 68. Tversky, A.; and Kahneman, D. 1992. Advances in prospect theory: Cumulative representation of uncertainty. Journal of Risk and uncertainty, 5: 297 323. Van der Maaten, L.; and Hinton, G. 2008. Visualizing data using t-SNE. Journal of machine learning research, 9(11). Vinyals, O.; Babuschkin, I.; Czarnecki, W. M.; Mathieu, M.; Dudzik, A.; Chung, J.; Choi, D. H.; Powell, R.; Ewalds, T.; Georgiev, P.; et al. 2019. Grandmaster level in Star Craft II using multi-agent reinforcement learning. Nature, 575(7782): 350 354. Wu, S.; Yao, J.; Fu, H.; Tian, Y.; Qian, C.; Yang, Y.; FU, Q.; and Wei, Y. 2023. Quality-Similar Diversity via Population Based Reinforcement Learning. In The Eleventh International Conference on Learning Representations. Yang, Q.; Sim ao, T. D.; Tindemans, S. H.; and Spaan, M. T. 2021. WCSAC: Worst-case soft actor critic for safetyconstrained reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, 10639 10646. Ying, C.; Zhou, X.; Su, H.; Yan, D.; Chen, N.; and Zhu, J. 2022. Towards safe reinforcement learning via constraining conditional value-at-risk. ar Xiv preprint ar Xiv:2206.04436. Yu, C.; Gao, J.; Liu, W.; Xu, B.; Tang, H.; Yang, J.; Wang, Y.; and Wu, Y. 2023. Learning Zero-Shot Cooperation with Humans, Assuming Humans Are Biased. In The Eleventh International Conference on Learning Representations. Zhao, R.; Song, J.; Yuan, Y.; Haifeng, H.; Gao, Y.; Wu, Y.; Sun, Z.; and Wei, Y. 2022. Maximum Entropy Population Based Training for Zero-Shot Human-AI Coordination. arxiv:2112.11701.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)