# policy_space_diversity_for_nontransitive_games__b6d0f4d3.pdf

Policy Space Diversity for Non-Transitive Games

Jian Yao1 , Weiming Liu 1 , Haobo Fu1 , Yaodong Yang2, Stephen Mc Aleer3, Qiang Fu1, Wei Yang1

1Tencent AI Lab, Shenzhen, China 2Peking University, Beijing, China 3Carnegie Mellon University

Policy-Space Response Oracles (PSRO) is an influential algorithm framework for approximating a Nash Equilibrium (NE) in multi-agent non-transitive games. Many previous studies have been trying to promote policy diversity in PSRO. A major weakness in existing diversity metrics is that a more diverse (according to their diversity metrics) population does not necessarily mean (as we proved in the paper) a better approximation to a NE. To alleviate this problem, we propose a new diversity metric, the improvement of which guarantees a better approximation to a NE. Meanwhile, we develop a practical and well-justified method to optimize our diversity metric using only state-action samples. By incorporating our diversity regularization into the best response solving in PSRO, we obtain a new PSRO variant, Policy Space Diversity PSRO (PSD-PSRO). We present the convergence property of PSD-PSRO. Empirically, extensive experiments on various games demonstrate that PSD-PSRO is more effective in producing significantly less exploitable policies than state-of-the-art PSRO variants. The experiment code is available at https://github.com/nigelyaoj/policy-space-diversity-psro.

1 Introduction

Most real-world games demonstrate strong non-transitivity [9], where the winning rule follows a cyclic pattern (e.g., the strategy cycle in Rock-Paper-Scissors) [6, 3]. A common objective in solving non-transitive games is to find a Nash Equilibrium (NE), which has the best worst-case performance in the whole policy space. Traditional algorithms, like simple self-play, fail to converge to a NE in games with strong non-transitivity [26]. Recently, many game-theoretic methods have been proposed to approximate a NE in such games. For example, Counterfactual Regret Minimization (CFR) [54] minimizes the so-called counterfactual regret. Neural fictitious self play [18, 19] extends the classical game-theoretic approach, Fictitious Play (FP) [4], to larger games using Reinforcement Learning (RL) to approximate a Best Response (BR). Another well-known algorithm is Policy-Space Response Oracles (PSRO) [26], which generalizes the double oracle approach [38] by adopting a RL subroutine to approximate a BR.

Improving the performance of PSRO on approximating a NE is an active research topic, and many PSRO variants have been proposed so far, which generally fall into three categories. The first category [36, 45, 12, 29] aims to improve the training efficiency at each iteration. For instance, pipeline-PSRO [36] trains multiple BRs in parallel at each iteration. Neural population learning [29] enables fast transfer learning across policies via representing a population of policies within a single conditional model. The second category [37, 53] incorporates no-regret learning into PSRO, which solves an unrestricted-restricted game with a no-regret learning method to guarantee the decrease of exploitability of the meta-strategy across each iteration. The third category [2, 41, 32, 33]

Equal contribution. Correspondence to: Haobo Fu (haobofu@tencent.com).

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

promotes policy diversity in the population, which is usually implemented by incorporating a diversity regularization term into the BR solving in the original PSRO.

Despite achieving promising improvements over the original PSRO, the theoretical reason why the diversity metrics in existing diversity-enhancing PSRO variants [2, 41, 32, 33] help PSRO in terms of approximating a NE is unclear. More specifically, those diversity metrics are justified in the sense that adding the corresponding diversity-regularized BR strictly enlarges the gamescape. However, as we prove and demonstrate later in the paper, a population with a larger gamescape is neither a sufficient nor a necessary condition for a better approximation (we will define the precise meaning later) to a NE. One fundamental reason is that gamescape is a concept that varies significantly according to the choice of opponent policies. In contrast, exploitability (the distance to a NE) measures the worst-case performance that is invariant to the choice of opponent policies.

In this paper, we seek for a new and better justified diversity metric that improves the approximation of a NE in PSRO. To achieve this, we introduce a new concept, named Population Exploitability (PE), to quantify the strength of a population. The PE of a population is the optimal exploitability that can be achieved by selecting a policy from its Policy Hull (PH), which is simply a complete set of polices that are convex combinations of individual polices in the population. In addition, we show that a larger PH means a lower PE. Based on these insights, we make the following contributions:

We point out a major and common weakness of existing diversity-enhancing PSRO variants: their goal of enlarging the gamescape of the population in PSRO is somewhat deceptive to the extent that it can lead to a weaker population in terms of PE. In other words, a more diverse (according to their diversity metrics) population a larger gamescape closer to a full game NE.

We develop a new diversity metric that encourages the enlargement of a population s PH. In addition, we develop a practical and well-justified method to optimize our diversity metric using only state-action samples. We then incorporate our diversity metric (as a regularization term) into the BR solving in the original PSRO and obtain a new algorithm: Policy Space Diversity PSRO (PSD-PSRO). Our method PSD-PSRO establishes the causality: a more diverse (according to our diversity metric) population a larger PH a lower PE closer to a full game NE.

We prove that a full game NE is guaranteed once PSD-PSRO is converged. In contrast, it is not clear, in other state-of-the-art diversity-enhancing PSRO variants [2, 41, 32, 33], whether a full game NE is found once they are converged in terms of their optimization objectives. Notably, PSROr N [2] is not guaranteed to find a NE once converged [36].

2 Notations and Preliminary

2.1 Extensive-form Games, NE, and Exploitability

Extensive-form games are used to model sequential interaction involving multiple agents, which can be defined by a tuple N, S, P, A, u . N = {1, 2} denotes the set of players (we focus on the two-player zero-sum games). S is a set of information states for decision-making. Each information state node s S includes a set of actions A(s) that lead to subsequent information states. The player function P : S N {c}, with c denoting chance, determines which player takes action in s. We use si, Si = {s S|P(s) = i}, and Ai = s Si A(s) to denote player i s state, set of states, and set of actions respectively. We consider games with perfect recall, where each player remembers the sequence of states to the current state.

A player s behavioral strategy is denoted by πi(s) (A(s)), s Si, and πi(a|s) is the probability of player i taking action a in s. A strategy profile π = (π1, π2) is a pair of strategies for each player, and we use π i to refer to the strategy in π except πi. ui(π) = ui(πi, π i) denotes the payoff for player i when both players follow π. The BR of player i to the opponent s strategy π i is denoted by BR(π i) = arg maxπ i ui(π i, π i). The exploitability of strategy profile π is defined as:

i N [max π i ui(π i, π i) ui(πi, π i)]. (1)

When E(π) = 0, π is a NE of the game.

2.2 Meta-Games, PH, and PSRO

Meta-games are introduced to represent games at a higher level. Denoting a population of mixed strategies for player i by Πi := {π1 i , π2 i , ...}, the payoff matrix on the joint population Π = Πi Π i is denoted by MΠi,Π i, where MΠi,Π i[j, k] := ui(πj i , πk i). The meta-game on Π and MΠi,Π i is simply a normal-form game where selecting an action means choosing which πi to play for player i. Accordingly, we use σi (σi is called a meta-strategy and could be, e.g., playing [π1 i , π2 i ] with probability [0.5, 0.5]) to denote a mixed strategy over Πi, i.e., σi Πi. A meta-policy σi over Πi can be viewed as a convex combination of polices in Πi, and we define the PH of a population H(Πi) as the set of all convex combinations of the policies in Πi. Meta-games are often open-ended in the sense that there exist an infinite number of mixed strategies and that new policies will be successively added to Πi and Π i respectively. We give a summary of notations in Appendix A.

PSRO operates on meta-games and consists of two components: an oracle and a meta-policy solver. At each iteration t, PSRO maintains a population of policies, denoted by Πt i, for each player i. The joint meta-policy solver first computes a NE meta-policy σt on the restricted meta-game represented by MΠt i,Πt i. Afterwards, for each player i, the oracle computes an approximate BR (i.e., πt+1 i )

against the meta-policy σt i: πt+1 i BR(σt i). The new policy πt+1 i is then added to its population (Πt+1 i = Πt i {πt+1 i }), and the next iteration starts. In the end, PSRO outputs a meta-policy NE on the final joint population as an approximation to a full game NE.

2.3 Previous Diversity Metrics for PSRO

Effective diversity [2] measures the variety of effective strategies (strategies with support under a meta-policy NE) and uses a rectifier to focus on how these effective strategies beat each other. Let (σ i , σ i) denote a meta-policy NE on MΠi,Π i. The effective diversity of Πi is:

Div(Πi) = σ i T MΠi,Π i +σ i, (2)

where x + := x if x 0 else 0.

Expected Cardinality [41], inspired by the determinantal point processes [35], measures the diversity of a population Πi as the expected cardinality of the random set Y sampled according to det(LΠ):

Div(Πi) = EY PLΠ[|Y|] = Tr(I (LΠ + I) 1), (3)

where |Y| is the cardinality of Y, and LΠ = MΠi,Π i MT Πi,Π i.

Convex Hull Enlargement [32] builds on the idea of enlarging the convex hull of all row vectors in the payoff matrix: Div(Πi {π i}) = min 1T β=1,β 0 ||MT Πi,Π iβ m||, (4)

where π i is the new strategy to add, and m is the payoff vector of policy π i against each opponent policy in Π i: m[j] = ui(π i, πj i).

Occupancy Measure Mismatching [32] considers the state-action distribution ρπ(s, a) induced by a joint policy π. When considering adding a new policy π i, the corresponding diversity metric is:

Div(Πi {π i}) = Df(ρ(π i,σ i)||ρ(σ i ,σ i)), (5)

where π i is the new policy to add; (σ i , σ i) is a meta-policy NE on MΠi,Π i, and Df is a general fdivergence between two distributions. It is worth noting that Equation 5 only considers the difference between two policies (π i and σ i ), instead of π i and Πi. In practice, this diversity metric is used together with the convex hull enlargement in [32].

Unified Diversity Measure [33] offers a unified view on existing diversity metrics and is defined as:

m=1 f(λm), (6)

where f takes different forms for different existing diversity metrics; λm is the eigenvalues of [K(ϕm, ϕn)]|Πi| |Πi|; K( , ) is a predefined kernel function; and ϕm is the strategy feature for the m-th policy in Πi. It is worth mentioning that only payoff vectors in MΠi,Π i were investigated for the strategy feature of the new diversity metric proposed in [33].

3 A Common Weakness of Existing Diversity-Enhancing PSRO Variants

As shown in last section, all previous diversity-enhancing PSRO variants [2, 41, 32, 33] try to enlarge the gamescape of Πi, which is the convex hull of the rows in the empirical payoff matrix:

GS(Πi|Π i) := { X

j αjmj : α 0, αT 1 = 1},

where mj is the j-th row vector in MΠi,Π i. However, the gamescape of a population depends on the choice of opponent policies, and two policies with the same payoff vector are not necessarily the same. Moreover, enlarging the gamescape without careful tuning would encourage the current player to deliberately lose to the opponent to get diverse payoffs. We suspect this might be the reason why the optimization of the gamescape is activated later in the training procedure in [32]. More importantly, it is not theoretically clear from previous diversity-enhancing PSRO variants why enlarging the gamescape would help in approximating a full game NE in PSRO.

To rigorously answer the question whether a diversity metric is helpful in approximating a NE in PSRO, we need a performance measure to monitor the progress of PSRO across iterations in terms of finding a full game NE. In other words, we need to quantify the strength of a population of policies. Previously, the exploitability of a meta NE of the joint population is usually employed to monitor the progress of PSRO. Yet, as demonstrated in [37], this exploitability may increase after an iteration. Intuitively, a better alternative is the exploitability of the least exploitable mixed strategy supported by a population. We define this exploitability as the population exploitability: Definition 3.1. For a joint population Π = Πi Π i, let (σ i , σ i) be a meta NE on MΠi,Π i. The relative population performance of Πi against Π i [2] is:

Pi(Πi, Π i) = σ i T MΠi,Π iσ i. (7) The population exploitability of the joint population Π is defined as:

i=1,2 max Π i Ωi Pi(Π i, Π i), (8)

where Ωi is the full set of all possible mixed strategies of player i.

We notice that PE is equal to the sum of negative population effectivity defined in [32]. Yet, we prefer PE as it is more of a natural extension to exploitability in Equation 1. Some properties of PE and its relation to PH are presented in the following. Proposition 3.2. Considering a joint population Π = Πi Π i, we have:

1. PE(Π) 0, Π.

2. For another joint population Π , if H(Πi) H(Π2) H(Π i) H(Π i), then PE(Π) PE(Π ).

3. If Πi = {πi} and Π i = {π i}, then PE(Π) = E(π), where π = (πi, π i).

4. π = (πi, π i) H(Πi) H(Π i) s.t. E(π) = PE(Π) = minπ H(Πi) H(Π i) E(π ).

5. Let (σ i , σ i) denote an arbitrary NE of the full game. PE(Π) = 0 if and only if (σ i , σ i) H(Πi) H(Π i).

The proof is in Appendix B.2. Once we use PE to monitor the progress of PSRO, we have: Proposition 3.3. The PE of the joint population Πt at each iteration t in PSRO is monotonically decreasing and will converge to 0 in finite iterations for finite games. Once PE(ΠT ) = 0, a meta NE on ΠT is a full game NE.

The proof is in Appendix B.3. From Proposition 3.2 and 3.3, we are convinced that PE is indeed an appropriate performance measure for populations of polices. Using PE, we can now formally present why enlarging the gamescape of the population in PSRO is somewhat deceptive: Theorem 3.4. The enlargement of the gamescape is neither sufficient nor necessary for the decrease of PE. Considering two populations (Π1 i and Π2 i ) for player i and one population Π i for player i, and denoting Πj = Πj i Π i, j = 1, 2 we have

GS(Π1 i |Π i) GS(Π2 i |Π i) PE(Π1) PE(Π2)

PE(Π1) PE(Π2) GS(Π1 i |Π i) GS(Π2 i |Π i) (9)

The proof of Theorem 3.4 is in Appendix B.5, where we provide concrete examples.

In other words, enlarging the gamescape in either short term or long term does not necessarily lead to a better approximation to a full game NE.

4 Policy Space Diversity PSRO

In this section, we develop a new diversity-enhancing PSRO variant, i.e., PSD-PSRO. In contrast to methods that enlarge the gamescape, PSD-PSRO encourages the enlargement of PH of a population, which helps reduce a population s PE (according to Proposition 3.2). In addition, we develop a well-justified state-action sampling method to optimize our diversity metric in practice. Finally, we present the convergence property of PSD-PSRO and discuss its relation to the original PSRO.

4.1 A New Diversity Regularization Term for PSRO

Our purpose of promoting diversity in PSRO is to facilitate the convergence to a full game NE. We follow the conventional scheme in previous diversity-enhancing PSRO variants [41, 32, 33], which introduces a diversity regularization term to the BR solving in PSRO. Nonetheless, our diversity regularization encourages the enlargement of PH of the current population, which is in contrast to the enlargement of gamescape in previous methods [2, 41, 32, 33]. We thus name our diversity metric policy space diversity. Intuitively, the larger the PH of a population is, the more likely it will include a full game NE. More formally, a larger PH means a lower PE (Proposition 3.2), which means our diversity metric avoids the common weakness (Section 3) of existing ones.

Recall that the PH of a population is simply the complete set of polices that are convex combinations of individual polices in the population. To quantify the contribution of a new policy to the enlargement of the PH of the current population, a straightforward idea is to maximize a distance between the new policy and the PH. Such a distance should be 0 for any policy that belongs to the PH and greater than 0 otherwise. Without loss of generality, the distance between a policy and a PH could be defined as the minimal distance between the policy and any policy in the PH. We can now write down the diversity regularized BR solving objective in PSD-PSRO, where at each iteration t for player i we add a new policy πt+1 i by solving:

πt+1 i = arg max πi

u(πi, σt i) + λ min πk i H(Πt i) dist(πi, πk i )

where σt i is the opponent s meta NE policy at the t-th iteration, λ is a Lagrange multiplier, and dist( , ) is a distance function (will be specified in the next subsection) between two polices.

4.2 Diversity Optimization in Practice

To be able to optimize our diversity metric (the right part in Equation 10) in practice, we need to encode a policy into some representation space and specify a distance function there. Such a representation should be a one-to-one mapping between a policy and its representation. Also, to ensure that enlarging the convex hull in the representation space results in the enlargement of the PH, we require the representation to satisfy the linearity property. Formally, we have the following definition:

Definition 4.1. A fine policy representation for our purpose is a function ρ : Πi RNi, which satisfies the following two properties:

(bijection) For any representation ρ(πi), there exists a unique behavior policy πi whose representation is ρ(πi), and vice-versa.

(linearity) For any two policies (πj i and πk i ) and α (0 α 1), the following holds:

ρ(απj i + (1 α)πk i ) = αρ(πj i ) + (1 α)ρ(πk i ),

where απj i + (1 α)πk i means playing πj i with probability α and πk i with probability (1 α).

Existing diversity metrics explicitly or implicitly define a policy representation [33]. For instance, the gamescape-based methods [2, 41, 32, 33] represent a policy using its payoff vector against the opponent s population. Yet, this representation is not a fine policy representation as it is not a bijection (different policies can have the same payoff vector). The (joint) occupancy measure, which is a fine policy representation, is usually used to encode a policy in the RL community [48, 20, 32].The f-divergence is then employed to measure the distance between two policies [32, 24, 16]. However, computing the f-divergence based on the occupancy measure is usually intractable and often in practice roughly approximated using the prediction of neural networks [32, 24, 16].

Instead, we use another fine policy representation, i.e., the sequence-form representation [21, 11, 31, 27, 1], which was originally developed for representing a policy in multi-agent games. We then define the distance between two policies using the Bregman divergence, which can be further simplified to a tractable form and optimized using only state-action samples in practice.

The sequence-form representation of a policy remembers the realization probability of reaching a state-action pair. We follow the definition in [11, 31], where the sequence form representation xi Xi [0, 1]|Si Ai| of πi is a vector:

xi(s, a) = Y

esi,ea τ(s,a) πi(ea|esi), (11)

where τ(s, a) is a trajectory from the beginning to (s, a). By the perfect-recall assumption, there is a unique τ that leads to (s, a). The policy πi can be written as πi(a|s) = xi(s, a)/ xi(s) 1, where xi(s) is (xi(s, a1), . . . , xi(s, an)) with a1, . . . , an A(s). Unlike the payoff vector representation or the occupancy measure representation, xi is independent of the opponent s policy as well as the environmental dynamics. Therefore, it should be more appropriate in representing a policy for the diversity optimization. Without loss of generality and following [21, 11, 31], we define the distance dist(πi, π i) between two policies as the Bregman divergence on the sequence form representation Bd(xi x i), which can be further written in terms of state-action pairs (the derivation is presented in Appendix B.1) in the following:

dist(πi, π i) := Bd(xi x i) = X

esi,ea τ(s,a) πi(ea|esi)

βs Bds(πi(s) π i(s)), (12)

where Bds(πi(s) π i(s)) is the Bregman divergence between πi(s) and π i(s). In our experiment, we let Bds(πi(s) π i(s)) = P a πi(a|s) log πi(a|s)/π i(a|s) = KL(πi(s) π i(s)), i.e., the KL divergence. In previous work [21, 11, 31], the coefficient βs usually declines monotonically as the length of the sequence increases. In our case, we make βs depend on an opponent s policy b i: βs = Q

es i,ea τ(s,a) b i(ea|es i)2. This weighting method allows us to estimate the distance using the sampled average KL divergence and avoids importance sampling:

dist(πi, π i) = X

esi,ea τ(s,a) πi(ea|esi)

es i,ea τ(s,a) b i(ea|es i)

Bds(πi(s) π i(s))

=Esi πi,b i[KL(πi(si) π i(si))], (13) where si πi, b i means sampling player i s information states from the trajectories that are collected by playing πi against b i. As a result, we can rewrite the diversity regularized BR solving objective in PSD-PSRO as follows:

πt+1 i = arg max πi

u(πi, σt i) + λ min πk i H(Πt i) Esi πi,b i[KL(πi(si) πk i (si))]

Now we provide a practical way to optimize Equation 14. By regarding the opponent and chance as the environment, we can use the policy gradient method [47, 43, 44] in RL to train πt+1 i . Denote the probability of generating τ by πi(τ), and the payoff of player i for the trajectory τ by R(τ). Let πmin i = arg minπk i H(Πt i) Esi πi,b i[KL(πi(si) πk i (si))] be the policy in H(Πt i) that minimizes the distance and let RKL(τ) = Esi τ[KL(πi(si) πmin i (si))].

2Assume βs > 0, i.e., b i(a|s i) > 0, (s i, a) S i A i.

The gradient of the first term u(πi, σt i) in Equation 14 with respect to πi can be written as:

u(πi, σt i) = Z πi(τ)R(τ)dτ = Z πi(τ)R(τ)dτ

= Z πi(τ) log πi(τ)R(τ)dτ

=Eτ πi,σt i[ log πi(τ)R(τ)]. (15)

The gradient of the diversity term in Equation 14 with respect to πi can be written as:

πi min πk i H(Πt i) Esi τ,τ πi,b i[KL(πi(si) πk i (si))]

= Esi τ,τ πi,b i[KL(πi(si) πmin i (si))]

= Z πi(τ)RKL(τ)dτ

= Z πi(τ)RKL(τ)dτ + Z πi(τ)Esi τ[ KL(πi(si) πmin i (si))]dτ

=Eτ πi,b i[ log πi(τ)RKL(τ)] + Esi πi,b i[ KL(πi(si) πmin i (si))], (16)

where we use the property that E[KL(πi(si)|πmin i (si)] minπk i E[KL(πi(si)|πk i (si)] , whose correctness can be shown from the following proposition: Proposition 4.2. For any local Lipschitz continuous function f(x, y), assume x, miny f(x, y) exists, then xf(x, y)|y arg min f(x,y) x miny f(x, y), where f is the generalized gradient [7].

Proof. According to Theorem 2.1 (property (4) in [7], the result is immediate, as x miny f(x, y) is the convex hull of { xf(x, y)|y arg max g(x, y)}.

Combining the above two equations, we have,

u(πi, σt i) + λ min πk i H(Πt i) Esi πi,b i[KL(πi(si) πk i (si))]

=Eτ πi,σt i[ log πi(τ)R(τ)] + λEτ πi,b i[ log πi(τ)RKL(τ)]

+ λEsi πi,b i[ KL(πi(si) πmin i (si))].

According to Equation 17, we can see that optimizing πt+1 i requires maximizing two types of rewards: R(τ) and RKL(τ). This can be done by averaging the gradients using samples that are generated by playing πi against σt i and b i separately. The last term in Equation 17 is easily estimated by sampling states in the sampled trajectories via playing πi against b i. For training efficiency, we simply set b i = σt i in our experiments, although other settings of b i are possible, e.g., the uniform random policy. We also want to emphasize that we optimize πt+1 i for each iteration t, and σt i is fixed during an iteration and thus can be viewed as a part of the environment. Finally, to estimate the distance between πi and πmin i , we should ideally compute πmin i exactly first by solving a convex optimization problem. In practice, we find sampling the policies in H(Πt i) and using the minimal sampled distance already gives us satisfactory performance. The pseudo-code of PSD-PSRO is provided in Appendix C.

4.3 The Convergence Property of PSD-PSRO

We first show how the PH evolves in the original PSRO: Proposition 4.3. Before the PE of the joint population reaches 0, adding a BR to the meta NE policy from last iteration in PSRO will strictly enlarge the PH of the current population.

The proof is in Appendix B.3. From Proposition 4.3, we can see that adding a BR serves one way (an implicit way) of enlarging the PH and hence reducing the PE. In contrast, the optimization of our diversity term in Equation 14 aims to explicitly enlarge the PH. In other words, adding a diversity regularized BR in PSD-PSRO serves as a mixed way of enlarging the PH. More formally, we have the following theorem:

Theorem 4.4. (1) In PSD-PSRO, before the PE of the joint population reaches 0, adding an optimal solution in Equation 14 will strictly enlarge the PH and hence reduce the PE. (2) Once the PH can not be enlarged (i.e., PSD-PSRO converges) by adding an optimal solution in Equation 14, the PE reaches 0, and PSD-PSRO finds a full game NE.

The proof is in Appendix B.6. In terms of convergence property, one significant benefit of PSD-PSRO over other state-of-the-art diversity-enhancing PSRO variants [2, 41, 32, 33] is the convergence of PH guarantees a full game NE in PSD-PSRO. Yet, it is not clear in those papers whether a full game NE is found once the PH of their populations converge. Notably, PSROr N [2] is not guaranteed to find a NE once converged [36]. In practice, we expect a significant performance improvement of PSD-PSRO over PSRO in approximating a NE. As for different games, there might exist different optimal trade-offs between exploitation (adding a BR) and exploration (optimizing our diversity metric) in enlarging the PH and reducing the PE. In other words, PSD-PSRO generalizes PSRO (a PSD-PSRO instance when λ = 0) in ways of enlarging the PH and reducing the PE.

5 Related Work

Diversity has been widely studied in evolutionary computation [13], with a central focus that mimics the natural evolution process. One of the ideas is novelty search [28], which searches for policies that lead to novel outcomes. By hybridizing novelty search with a fitness objective, quality-diversity [42] aims for diverse behaviors of good qualities. Despite these methods achieving good empirical results [8, 23], the diversity metric is often hand-crafted for different tasks.

Promoting diversity is also intensively studied in RL. By adding a distance regularization between the current policy and a previous policy, a diversity-driven approach has been proposed for good exploration [22]. Unsupervised learning of diverse policies [10] has been studied to serve as an effective pretraining mechanism for downstream RL tasks. A diversity metric based on DPP [39] has been proposed to improve exploration in population-based training. Diverse behaviors were learned in order to improve generalization ability for test environments that are different from training [25]. A diversity-regularized collaborative exploration strategy has been proposed in [40]. Reward randomization [49] has been employed to discover diverse strategies in multi-agent games. Trajectory diversity has been studied for better zero-shot coordination in a multi-agent environment [34]. Quality-similar diversity has been investigated in [51].

Diversity also plays a role in game-theoretic methods. Smooth FP [17] adds a policy entropy term when finding a BR. PSROr N [2] encourages effective diversity, which considers amplifying the strength over the weakness in a policy. DPP-PSRO [41] introduces a diversity metric based on DPP and provides a geometric interpretation of behavioral diversity. BD&RD-PSRO [32] combines the occupancy measure mismatch and the diversity on payoff vectors as a unified diversity metric. UDM [33] summarizes existing diversity metrics, by providing a unified diversity framework. In both opponent modeling [15] and opponent-limited subgame solving [30], diversity has been shown to have a large impact on the performance.

Figure 1: (a): Exploitability of the meta NE. (b): PE of the joint population.

Figure 2: Non-Transitive Mixture Game. Exploration trajectories during training. For each method, the final exploitability 100 (Exp) is reported at the bottom.

6 Experiments

The main purpose of the experiments is to compare PSD-PSRO with existing state-of-the-art PSRO variants in terms of approximating a full game NE. The baseline methods include PSRO [26], Pipeline PSRO (P-PSRO) [36], PSROr N [2], DPP-PSRO [41], and BD&RD-PSRO [32]. The benchmarks consist of single-state games (Alpha Star888 and non-transitive mixture game) and complex extensive games (Leduc poker and Goofspiel). For Alpha Star888 and Leduc poker, we report the exploitability of the meta NE and the PE of the joint population through the training process. For the non-transitive mixture game, we illustrate the diversity of the population and report the final exploitability. For Goofspiel where the exact exploitability is intractable, we report the win rate between the final agents. In addition, we illustrate how the PH evolves for each method using the Disc game [2] in Appendix D.3, where PSD-PSRO is more effective at enlarging the PH and approximating a NE. An ablation study on λ of PSD-PSRO in Appendix D.2 reveals that, for different benchmarks, an optimal trade-off between exploitation and exploration in enlarging the PH to approximate a NE usually happens when λ is greater than zero. Error bars or stds in the results are obtained via 5 independent runs. In Appendix D.3, we also investigate the time cost of calculating our policy space diversity. More details for the environments and hyper-parameters are given in Appendix E.

Alpha Star888 is an empirical game generated from the process of solving Starcraft II [50], which contains a payoff table for 888 RL policies. be viewed as a zero-sum symmetric two-player game where there is only one state s0. In s0, there are 888 legal actions. Any mixed strategy is a discrete probability distribution over the 888 actions. Hence, the distance function in Equation 14 for Alpha Star888 reduces to the KL divergence between two 888-dim discrete probability distributions.

In Figure 1, we can see that PSD-PSRO is more effective at reducing both the exploitability and PE than other methods.

Non-Transitive Mixture Game consists of 7 equally-distanced Gaussian humps on the 2D plane. Each strategy can be represented by a point on the 2D plane, which is equivalent to the weights (the likelihood of that point in each Gaussian distribution) that each player puts on the humps. The optimal strategy is to stay close to the center of the Gaussian humps and explore all the distributions. In Figure 2, we show the exploration trajectories for different methods during training, where PSRO and PSROr N get trapped and fail in this non-transitive game. In contrast, PSD-PSRO tends to find diverse strategies and explore all Gaussians. Also, the exploitability of the meta NE of the final population is significantly lower in PSD-PSRO than others.

Figure 3: (a): Exploitability of the meta NE. (b): PE of the joint population.

Leduc Poker is a simplified poker [46], where the deck consists of two suits with three cards in each suit. Each player bets one chip as an ante, and a single private card is dealt to each player. Since DPP-PSRO cannot scale to the RL setting [32] and the code of BD&RD-PSRO for complex games is not available, we compare PSD-PSRO only to P-PSRO, PSROr N and PSRO. As demonstrated in Figure 3, PSD-PSRO is more effective at reducing both the exploitability and PE.

PSRO PSROr N P-PSRO PSD-PSRO(OURS) PSRO - 0.613 0.019 0.469 0.034 0.422 0.025 PSROr N 0.387 0.019 - 0.412 0.030 0.358 0.019 P-PSRO 0.531 0.034 0.588 0.030 - 0.370 0.031 PSD-PSRO(OURS) 0.578 0.025 0.642 0.019 0.630 0.031 -

Table 1: The win rate of the row agents against the column agents on Goofspiel.

Goofspiel is commonly used as a large-scale multi-stage simultaneous move game. Goofspiel features strong non-transitivity, as every pure strategy can be exploited by a simple counter-strategy. We compare PSD-PSRO with PSRO, P-PSRO, and PSROr N on Goofspiel with 5 point cards and 8 point cards settings. In the game with 5 point cards setting, due to the relatively small game size, we can calculate the exact exploitability. We report the results in Appendix D.1, in which we see that PSD-PSRO reduces the exploitability more effectively than other methods. In the game with 8 point cards setting, the game size is too large to show exact exploitability for each iteration. In this setting, we provide a comparison among final solutions produced by different methods. We report the win rate between each two methods in Table 1, where we can see that PSD-PSRO consistently beats existing methods with a 62% win rate on average.

7 Conclusions and Limitations

In this paper, we point out a major and common weakness of existing diversity metrics in previous diversity-enhancing PSRO variants, which is their goal of enlarging the gamescape does not necessarily result in a better approximation to a full game NE. Based on the insight that a larger PH means a lower PE (a better approximation to a NE), we develop a new diversity metric (policy space diversity) that explicitly encourages the enlargement of a population s PH. We then develop a practical method to optimize our diversity metric using only state-action samples, which is derived based on the Bregman divergence on the sequence form of policies. We incorporate our diversity metric into the BR solving in PSRO to obtain PSD-PSRO. We present the convergence property of PSD-PSRO, and extensive experiments demonstrate that PSD-PSRO is significantly more effective in approximating a NE than state-of-the-art PSRO variants.

The diversity regularization term λ in PSD-PSRO plays an important role in balancing the exploitation and exploration in terms of enlarging the PH to approximate a NE. In this paper, other than showing that different problems have different optimal settings of λ, we have not discussed about the guidance of choosing λ optimally. Although there has been related work on how to adapt a diversity regularization term using online bandits [39], future work is still needed on how our policy space diversity could benefit the approximation of a full game NE most. Another interesting direction of future work is to extend PSD-PSRO to larger scale games, such as poker [5], Mahjong [14], and dark chess [52].

[1] Yu Bai, Chi Jin, Song Mei, and Tiancheng Yu. Near-optimal learning of extensive-form games with imperfect information. In International Conference on Machine Learning, pages 1337 1382. PMLR, 2022.

[2] David Balduzzi, Marta Garnelo, Yoram Bachrach, Wojciech Czarnecki, Julien Perolat, Max Jaderberg, and Thore Graepel. Open-ended learning in symmetric zero-sum games. In International Conference on Machine Learning, pages 434 443. PMLR, 2019.

[3] David Balduzzi, Sebastien Racaniere, James Martens, Jakob Foerster, Karl Tuyls, and Thore Graepel. The mechanics of n-player differentiable games. In International Conference on Machine Learning, pages 354 363. PMLR, 2018.

[4] George W Brown. Iterative solution of games by fictitious play. Act. Anal. Prod Allocation, 13(1):374, 1951.

[5] Noam Brown and Tuomas Sandholm. Superhuman ai for heads-up no-limit poker: Libratus beats top professionals. Science, 359(6374):418 424, 2018.

[6] Ozan Candogan, Ishai Menache, Asuman Ozdaglar, and Pablo A Parrilo. Flows and decompositions of games: Harmonic and potential games. Mathematics of Operations Research, 36(3):474 503, 2011.

[7] Frank H Clarke. Generalized gradients and applications. Transactions of the American Mathematical Society, 205:247 262, 1975.

[8] Antoine Cully, Jeff Clune, Danesh Tarapore, and Jean-Baptiste Mouret. Robots that can adapt like animals. Nature, 521(7553):503 507, 2015.

[9] Wojciech M Czarnecki, Gauthier Gidel, Brendan Tracey, Karl Tuyls, Shayegan Omidshafiei, David Balduzzi, and Max Jaderberg. Real world games look like spinning tops. In Advances in Neural Information Processing Systems, volume 33, pages 17443 17454, 2020.

[10] Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. Diversity is all you need: Learning skills without a reward function. In International Conference on Learning Representations, 2018.

[11] Gabriele Farina, Christian Kroer, and Tuomas Sandholm. Optimistic regret minimization for extensive-form games via dilated distance-generating functions. Advances in neural information processing systems, 32, 2019.

[12] Xidong Feng, Oliver Slumbers, Ziyu Wan, Bo Liu, Stephen Mc Aleer, Ying Wen, Jun Wang, and Yaodong Yang. Neural auto-curricula in two-player zero-sum games. In Advances in Neural Information Processing Systems, volume 34, pages 3504 3517, 2021.

[13] David B Fogel. Evolutionary computation: toward a new philosophy of machine intelligence. John Wiley & Sons, 2006.

[14] Haobo Fu, Weiming Liu, Shuang Wu, Yijia Wang, Tao Yang, Kai Li, Junliang Xing, Bin Li, Bo Ma, Qiang Fu, et al. Actor-critic policy optimization in a large-scale imperfect-information game. In International Conference on Learning Representations, 2021.

[15] Haobo Fu, Ye Tian, Hongxiang Yu, Weiming Liu, Shuang Wu, Jiechao Xiong, Ying Wen, Kai Li, Junliang Xing, Qiang Fu, et al. Greedy when sure and conservative when uncertain about the opponents. In International Conference on Machine Learning, pages 6829 6848. PMLR, 2022.

[16] Justin Fu, Katie Luo, and Sergey Levine. Learning robust rewards with adversarial inverse reinforcement learning. ar Xiv preprint ar Xiv:1710.11248, 2017.

[17] Drew Fudenberg and David K Levine. Consistency and cautious fictitious play. Journal of Economic Dynamics and Control, 19(5-7):1065 1089, 1995.

[18] Johannes Heinrich, Marc Lanctot, and David Silver. Fictitious self-play in extensive-form games. In International conference on machine learning, pages 805 813. PMLR, 2015.

[19] Johannes Heinrich and David Silver. Deep reinforcement learning from self-play in imperfect-information games. ar Xiv preprint ar Xiv:1603.01121, 2016.

[20] Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. Advances in neural information processing systems, 29, 2016.

[21] Samid Hoda, Andrew Gilpin, Javier Pena, and Tuomas Sandholm. Smoothing techniques for computing nash equilibria of sequential games. Mathematics of Operations Research, 35(2):494 512, 2010.

[22] Zhang-Wei Hong, Tzu-Yun Shann, Shih-Yang Su, Yi-Hsiang Chang, Tsu-Jui Fu, and Chun-Yi Lee. Diversity-driven exploration strategy for deep reinforcement learning. In Advances in neural information processing systems, volume 31, 2018.

[23] Max Jaderberg, Wojciech M Czarnecki, Iain Dunning, Luke Marris, Guy Lever, Antonio Garcia Castaneda, Charles Beattie, Neil C Rabinowitz, Ari S Morcos, Avraham Ruderman, et al. Human-level performance in 3d multiplayer games with population-based reinforcement learning. Science, 364(6443):859 865, 2019.

[24] Seyed Kamyar Seyed Ghasemipour, Richard Zemel, and Shixiang Gu. A divergence minimization perspective on imitation learning methods. ar Xiv e-prints, pages ar Xiv 1911, 2019.

[25] Saurabh Kumar, Aviral Kumar, Sergey Levine, and Chelsea Finn. One solution is not all you need: Few-shot extrapolation via structured maxent rl. In Advances in Neural Information Processing Systems, volume 33, pages 8198 8210, 2020.

[26] Marc Lanctot, Vinicius Zambaldi, Audrunas Gruslys, Angeliki Lazaridou, Karl Tuyls, Julien Pérolat, David Silver, and Thore Graepel. A unified game-theoretic approach to multiagent reinforcement learning. In Advances in neural information processing systems, volume 30, 2017.

[27] Chung-Wei Lee, Christian Kroer, and Haipeng Luo. Last-iterate convergence in extensive-form games. Advances in Neural Information Processing Systems, 34:14293 14305, 2021.

[28] Joel Lehman and Kenneth O Stanley. Abandoning objectives: Evolution through the search for novelty alone. Evolutionary computation, 19(2):189 223, 2011.

[29] Siqi Liu, Luke Marris, Daniel Hennes, Josh Merel, Nicolas Heess, and Thore Graepel. Neupl: Neural population learning. ar Xiv preprint ar Xiv:2202.07415, 2022.

[30] Weiming Liu, Haobo Fu, Qiang Fu, and Yang Wei. Opponent-limited online search for imperfect information games. In International Conference on Machine Learning. PMLR, 2023.

[31] Weiming Liu, Huacong Jiang, Bin Li, and Houqiang Li. Equivalence analysis between counterfactual regret minimization and online mirror descent. In Proceedings of the 39th International Conference on Machine Learning, volume 162, pages 13717 13745, 2022.

[32] Xiangyu Liu, Hangtian Jia, Ying Wen, Yujing Hu, Yingfeng Chen, Changjie Fan, Zhipeng Hu, and Yaodong Yang. Towards unifying behavioral and response diversity for open-ended learning in zero-sum games. In Advances in Neural Information Processing Systems, volume 34, pages 941 952, 2021.

[33] Zongkai Liu, Chao Yu, Yaodong Yang, Zifan Wu, Yuan Li, et al. A unified diversity measure for multiagent reinforcement learning. In Advances in Neural Information Processing Systems, 2022.

[34] Andrei Lupu, Brandon Cui, Hengyuan Hu, and Jakob Foerster. Trajectory diversity for zero-shot coordination. In International Conference on Machine Learning, pages 7204 7213. PMLR, 2021.

[35] Odile Macchi. The fermion process a model of stochastic point process with repulsive points. In Transactions of the Seventh Prague Conference on Information Theory, Statistical Decision Functions, Random Processes and of the 1974 European Meeting of Statisticians, pages 391 398. Springer, 1977.

[36] Stephen Mc Aleer, John B Lanier, Roy Fox, and Pierre Baldi. Pipeline psro: A scalable approach for finding approximate nash equilibria in large games. In Advances in neural information processing systems, volume 33, pages 20238 20248, 2020.

[37] Stephen Mc Aleer, Kevin Wang, Marc Lanctot, John Lanier, Pierre Baldi, and Roy Fox. Anytime optimal psro for two-player zero-sum games. ar Xiv preprint ar Xiv:2201.07700, 2022.

[38] H Brendan Mc Mahan, Geoffrey J Gordon, and Avrim Blum. Planning in the presence of cost functions controlled by an adversary. In Proceedings of the 20th International Conference on Machine Learning (ICML-03), pages 536 543, 2003.

[39] Jack Parker-Holder, Aldo Pacchiano, Krzysztof M Choromanski, and Stephen J Roberts. Effective diversity in population based reinforcement learning. In Advances in Neural Information Processing Systems, volume 33, pages 18050 18062, 2020.

[40] Zhenghao Peng, Hao Sun, and Bolei Zhou. Non-local policy optimization via diversity-regularized collaborative exploration. ar Xiv preprint ar Xiv:2006.07781, 2020.

[41] Nicolas Perez-Nieves, Yaodong Yang, Oliver Slumbers, David H Mguni, Ying Wen, and Jun Wang. Modelling behavioural diversity for learning in open-ended games. In International Conference on Machine Learning, pages 8514 8524. PMLR, 2021.

[42] Justin K Pugh, Lisa B Soros, and Kenneth O Stanley. Quality diversity: A new frontier for evolutionary computation. Frontiers in Robotics and AI, page 40, 2016.

[43] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International conference on machine learning, pages 1889 1897. PMLR, 2015.

[44] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. ar Xiv preprint ar Xiv:1707.06347, 2017.

[45] Max Smith, Thomas Anthony, and Michael Wellman. Iterative empirical game solving via single policy best response. In International Conference on Learning Representations, 2020.

[46] Finnegan Southey, Michael P Bowling, Bryce Larson, Carmelo Piccione, Neil Burch, Darse Billings, and Chris Rayner. Bayes bluff: Opponent modelling in poker. ar Xiv preprint ar Xiv:1207.1411, 2012.

[47] Richard S Sutton, David Mc Allester, Satinder Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. Advances in neural information processing systems, 12, 1999.

[48] Umar Syed, Michael Bowling, and Robert E Schapire. Apprenticeship learning using linear programming. In Proceedings of the 25th international conference on Machine learning, pages 1032 1039, 2008.

[49] Zhenggang Tang, Chao Yu, Boyuan Chen, Huazhe Xu, Xiaolong Wang, Fei Fang, Simon Shaolei Du, Yu Wang, and Yi Wu. Discovering diverse multi-agent strategic behavior via reward randomization. In International Conference on Learning Representations, 2020.

[50] Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michaël Mathieu, Andrew Dudzik, Junyoung Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 575(7782):350 354, 2019.

[51] Shuang Wu, Jian Yao, Haobo Fu, Ye Tian, Chao Qian, Yaodong Yang, Qiang Fu, and Yang Wei. Qualitysimilar diversity via population based reinforcement learning. In The Eleventh International Conference on Learning Representations, 2022.

[52] Brian Zhang and Tuomas Sandholm. Subgame solving without common knowledge. Advances in Neural Information Processing Systems, 34:23993 24004, 2021.

[53] Ming Zhou, Jingxiao Chen, Ying Wen, Weinan Zhang, Yaodong Yang, Yong Yu, and Jun Wang. Efficient policy space response oracles. ar Xiv preprint ar Xiv:2202.00633, 2022.

[54] Martin Zinkevich, Michael Johanson, Michael Bowling, and Carmelo Piccione. Regret minimization in games with incomplete information. In Advances in neural information processing systems, volume 20, 2007.

We provide the notation in Table 2.

Table 2: Notations

Notation Meaning

Extensive-Form Games

N N = {1, 2}; set of players S set of information states s s S; state node A(s) set of actions allowed to be performed at state s P player function i / i i N; player i / players except i si player i s information state Si Si = {s S|P(s) = i}; set of player s information state Ai Ai = s Si A(s); set of player i s actions πi player i s behaivoral strategy π π = (πi, π i) strategy profile ui(π) ui(π) = ui(πi, π i); the payoff for player i BR(π i) BR(π i) := arg maxπ i Si u(π i, π i); best response for player i against players i E(π) E(π) = 1 N P

i N [maxπ i ui(π i, π i) ui(πi, π i)]; exploitability for policy π

i / i i {1, 2}; player i / players except i πj i / πi j-th policy / any policy for player i π π = (π1, π2); joint policy Πi Πi := {π1 i , π2 i , ..., }; policy set for player i Π Π = Π1 Π2; joint policy set σi σi Πi; meta-policy over Πi σ σ = (σ1, σ2); joint meta-policy MΠi,Π i (MΠi,Π i)jk := ui(πj i , πk i); player i s payoff table in the meta-games (Πi, Π i) Ωi the full set of all possible mixed strategies of player i H(Πi) H(Πi) = {P

j βjπj i |βj 0, Pt j=1 βj = 1}; Policy Hull of Πi = {π1 i , π2 i , ..., πt i} Pi( , ) Pi(Πi, Π i) = π i T MΠi,Π iπ i ; Relative Population Performance for player i PE( ) PE(Π) = 1

i=1,2 maxΠ i Ωi Pi(Π i, Π i); Population Exploitability GS(Πi|Π i) GS(Πi|Π i) = {P

j αjmj : α 0, αT 1 = 1, mj = MΠi,Π i [j,:]}; gamescape of Πi

PSRO / PSD-PSRO Training

Πt i Πt i = {π1 i , ..., πt i}; the policy set of player i at the t-th iteration in PSRO σt i σt i Πt i; the meta-policy of player i at t-th iteration in PSRO ρ( ) policy representation xi xi Xi [0, 1]|Si Ai| sequence form representation for a policy τ(s, a) trajectory from the initial state to (s, a) Bds(πi(s) π i(s)) Bregman divergence between πi(s) and π i(s) Div( ) diversity on the policy set specified in different methods dist( , ) distance function λ diversity weight KL Kullback-Leibler divergence

B Theoretical Analysis

B.1 Derivation of Bregman Divergence

We define the Bregman divergence on the sequence-form representation of the policies. Given a strictly convex function d defined on R|Si Ai|, a Bregman divergence is defined as

Bd(xi x i) = d(xi) d(x i) d(x i), xi x i , (18) for any xi, x i Xi. Intuitively, Bregman divergence is the gap between d(xi) and its first-order approximation at x i. It is non-negative (zero is achieved if and only if xi = x i) and strictly convex in its first argument. So, it can be used to measure the difference between xi and x i. Examples of Bregman divergence include the Squared Euclidean distance, which is generated by the l2 function, and the Kullback Leibler (KL) divergence, which is generated by the negative entropy function. For sequence-form policies, we use the dilated Distance-Generating Function (dilated DGF) [21, 11, 31]:

s,a Si Ai xi(s, a)βsds

xi(s) xi(s) 1

where βs > 0 is a hyper-parameter and ds is a strictly convex function defined on R|A(s)|, for example, the negative entropy function. It is proven that for the dilated DGF (Lemma D.2 in [31] ), we have the Bregman divergence

Bd(xi x i) = X

s,a Si Ai xi(s, a)βs Bds

xi(s) xi(s) 1

x i(s) x i(s) 1

The result allows us to compute the Bergman divergence using the behavior policy instead of explicitly using the sequence-form representation, which is prohibited in large-scale games. Formally, we define the distance of two policies πi, π i as

dist(πi, π i) := Bd(xi x i) = X

esi,ea τ(s,a) πi(ea|esi)

βs Bds(πi(s) π i(s)). (21)

B.2 Proof of Proposition 3.2

Proof. Recall that the PE of joint policy set Π in two-player zero-sum games can be written into

i=1,2 max Π i Ωi Pi(Π i, Π i)

2( max Π i Ωi Pi(Π i, Π2) min Π i Ω2 Pi(Πi, Π 2)),

since P i(Π i, Π1) = Pi(Πi, Π 2).

1. We can check this property directly, since

2( max Π i Ωi Pi(Π i, Π2) min Π 2 Ω2 Pi(Πi, Π 2))

2(Pi(Πi, Π2) Pi(Πi, Π2)) = 0

2. We first note that H(Π1) H(Π2) H(Π i) H(Π i) means H(Π1) H(Π i) and H(Π2) H(Π i).

H(Π2) H(Π 2) suggests that Φi Ωi, we have Pi(Φi, Π2) Pi(Φi, Π 2), which is a property of relative population performance [2]. Take the maximum for both sides, we have,

max Φi Ωi Pi(Φi, Π i) max Φi Ωi Pi(Φi, Π i) (22)

Similarly, using the condition H(Π1) H(Π i), we have,

min Φ i Ω i Pi(Πi, Φ i) min Φ i Ω i Pi(Π i, Φ i) (23)

Combining Equation 22 and 23, we obtain PE(Π) PE(Π )

3. By the monotonicity of PE, we have

2(Pi(Ωi, Π i) Pi(Πi, Ω i)) (24)

Since Π i = {π i} contains only one policy, by the definition of relative population performance, we have,

Pi(Ωi, {π i}) = u(BR(π i), π i); Pi({πi}, Ω i)) = u(πi, BR(πi)) (25)

Substituting Equation 25 into Equation 24, we obtain

PE(Π) = E(π) (26)

4. By Equation 24 and the definition of the relative population performance, there exists NE (σi, π i) in game MΩi,Π i and NE (πi, σ i) in game MΠi,Ω i s.t.

2(u(σi, π i) u(πi, σ i))

By the definition of NE, we have,

2( max π i Ωi u(π i, π i) min π i Ω i u(πi, π i))

= E(π) (27)

We now prove (πi, π i) is also the least exploitable policy in the joint policy set. π = (π i, π i) Πi Π i , we have,

E(π) = PE(Π) PE({π i} {π i}) = E(π ) (28)

where the inequality is from the monotonicity of PE.

Hence, (πi, π i) is the least exploitable policy in the Policy Hull of the population.

5. Necessity If (σ i , σ i) is the NE of the game, we can regard the {σ i } and {σ i} as the populations which contain a single policy. Then by property 2, we have,

PE(Π) PE({σ i } {σ i})

2(Pi(Ωi, {σ i}) Pi({σ i }, Ω i))

2(u(σ i , σ i) u(σ i , σ i)) = 0

Note that by property 1, we have PE(Π) 0, so we have PE(Π) = 0.

Sufficiency Using the conclusion from 3, It is easy to check the sufficiency. If PE(Π) = 0, there exists σ = (σi, σ i) s.t. E(σ) = 0, which means σ is the NE of the game.

B.3 Proof of Proposition 3.3

Proof. Recall that at each iteration t, the PSRO calculates the NE (σt i, σt i) of the meta-game MΠt i,Πt i, then adds the new policy πt = (BR(σt i), BR(σt i)) to the population.

If PE(Πt) > 0, then by the second property in Proposition 3.2, we know the population does not contain the NE of the game. Therefore, since (σt i, σt i) Πt, it is not the NE of the full game, which means at least one of the conditions holds: (1) BR(σt i) / H(Πt i); (2) BR(σt i) / H(Πt i), which suggests the enlargement of the Policy Hull.

In finite games, since the PSRO adds pure BR at each iteration, this Policy Hull extension stops at some iteration, saying T, in which case we have BR(σT i) H(ΠT i ) and BR(σT i ) H(ΠT i) (by above analysis). Let (πT +1 i , πT +1 i ) := (BR(σT i), BR(σT i )), we claim that (πT +1 i , πT +1 i ) is

the NE of the whole game. To see this, note that (σT i , σT i) is the NE of the game MΠT i ,ΠT i and

πT +1 i H(ΠT i ), hence we have,

u(σT i , σT i) u(πT +1 i , σT i). (29)

On the other hand, since πT +1 i is the BR to σT i, we have,

u(πT +1 i , σT i) u(σT i , σT i). (30)

Combining Equation 29 and 30, we obtain u(πT +1 i , σT i) = u(σT i , σT i), which suggests σT i is also the BR of σT i. Similarly, we can prove that σT i is the BR of σT i , and hence (σT i , σT i) is the NE of the game.

Finally, by the fourth property in Proposition 3.2, we have PE(ΠT ) = 0.

B.4 Proof of Proposition 4.3

This is an intermediate result in the proof of 3.3 (Appendix B.3).

B.5 Proof of Theorem 3.4

Proof. We prove this theorem by providing the counterexamples. Let s first focus on the following two-player zero-sum game:

R P S R 0 1 1 P 1 0 1 S 1 1 0 R 1 1 1 P 1 1 1 S 1 1 1

The game is inspired by Rock-Paper-Scissors. We add high-level strategies {R , P , S } to player i in the original R-P-S game. The high-level strategy can beat the low-level strategy with the same type.

Let Πi = {S, R , P } and Π i = {R, P, 1

2P}, where 1

2P means playing [R, P] with probability [ 1

2]. The payoff matrix MΠi,Π iis

R P 1 2R + 1

2P S 1 1 0 R 1 1 0 P 1 1 1

It is in player i turn to choose his new strategy. Considering two candidates π1 i = S and π2 i = 1 2R + 1

2R (There are many choices for π2 i to construct the counterexample, we just take one for example).

The payoff of π1 i against Π i is ( 1, 1, 0), the same as the first row in MΠi,Π i The payoff of π2 i is ( 1

4), which is out of the gamescape of Πi.

Denote Π1 i = Πi {π1 i }, Π2 i = Πi {π2 i }. By the analysis above, we have

GS(Π1 i |Π i) GS(Π2 i |Π i) (33)

Hence, to promote the diversity defined on the payoff matrix and aim to enlarge the gamescape [2, 41], we should choose π2 i to add. However, in this case, we show adding π1 i is more helpful to decrease PE.

Denote Π1 = Π1 i Π i and Π2 = Π2 i Π i, by Equation 24 and a few calculation, we have,

PE(Π1) PE(Π2) = 1

2P(Π1 i , Ω i) + 1

2P(Π2 i , Ω i) 0.17 + 0.10 < 0, (34)

i.e., PE(Π1) < PE(Π2). We can see that adding a new policy following the guidance of enlarging gamescape may not be the best choice to decrease PE.

Back to the Proposition 3.4, since we have found an example that,

GS(Π1 i |Π i) GS(Π2 i |Π i) but PE(Π1) < PE(Π2) (35)

hence, GS(Π1 i |Π i) GS(Π2 i |Π i) PE(Π1) PE(Π2) (36)

Rewrite the example in Equation 35, we have,

PE(Π2) PE(Π1) but GS(Π2 i |Π i) GS(Π1 i |Π i) (37)

Rename the superscripts (exchange the identity of 1, 2 in the superscript), we obtain,

PE(Π1) PE(Π2) GS(Π1 i |Π i) GS(Π2 i |Π i) (38)

Combine Equation 36 and 38, we complete the proof.

Remark We offer another example in a symmetric game, proving that this proposition is also valid in this type of game.

A B C D E A 0 1 0.5 1 4 B 1 0 0.5 1 4 C 0.5 0.5 0 0 4 D 1 1 0 0 4 E 4 4 4 4 0

Current policy set Πi = Π i = {A, B}, the payoff is

A B A 0 1 B 1 0

Considering π1 i = C, π2 i = D, the payoff of π1 i = C against Π i is (0.5, 0.5), contained in GS(Πi|Π i), while the payoff of π2 i = D against Π i is (1, 1), which is out of GS(Πi|Π i).

Denote Π1 i = Πi {π1 i }, Π2 i = Πi {π2 i }, Π1 = Π1 i Π i and Π2 = Π2 i Π i, we have GS(Π1 i |Π i) GS(Π2 i |Π i). However,

PE(Π1) PE(Π2) = 1

2Pi(Π1 i , Ω i) + 1

2Pi(Π2 i , Ω i) 0.17 2 < 0 (41)

Hence, similar to the above analysis, we can prove the Proposition 3.4 in symmetric games.

B.6 Proof of Theorem 4.4

Proof. Similar to Appendix B.3, we denote the meta-strategy at iteration t as (σt i, σt i). At iteration t of PSRO, if PE(Πt) > 0, then from Appendix B.3, we know that at least one of the conditions holds: (1) BR(σt i) / H(Πt i); (2) BR(σt i) / H(Πt i). Without loss of generality, we assume

BR(σt i) / H(Πt i). (42)

Then ˆσi H(Πt i) and λ > 0,

u(ˆσi, σt i) + λ min πk i H(Πt i) Esi ˆσi,b2[KL(ˆσi(si) πk i (si))]

=u(ˆσi, σt i) + 0

<u(BR(σt i), σt i)

<u(BR(σt i), σt i) + λ min πk i H(Πt i) Esi ˆσi,b2[KL(ˆσi(si) πk i (si))] (43)

where the first equation is because ˆσi H(Πt i) and the last two inequalities is due to the definition of BR and Equation 42 respectively.

Algorithm 1 PSD-PSRO

Input: Initial policy sets Π1 i , Π1 i Compute payoff matrix MΠ1 i ,Π1 i Initialize meta policies σ1 i UNIFORM(Π1 i ) for t in {1, 2, ...} do

for player i {1, 2} do

Initialize πi = πt i and sample K policies {πk i }K k=1 from Policy Hull Πt i for many episodes do

Sample π i σt i and collect the trajectory τ by playing πi against π i Discount the terminal reward u(πi, π i) to each state as the extrinsic reward r1 Discount RKL(τ) to each state as the intrinsic reward r2 Store (s, a, s , r) to the buffer, where s is the next state and r = r1 + r2 Estimate the gradient in Equation 17 with the samples in the buffer and update πi end for πt+1 i = πi and Πt+1 i = Πt i {πt+1 i } end for Compute missing entries in MΠt i,Πt i Compute meta-strategies (σt+1 i , σt+1 i ) from MΠt i,Πt i end for Output: current meta-strategy for each player.

Table 3: The exploitability 100 for populations generated by PSD with different diversity weights in Non-transitive Mixture Games. The standard error is calculated by running 5 experiments.

λ 0 1 2 3 5 10 50

EXP 40.88 2.18 2.81 0.66 1.72 0.54 2.87 0.70 3.73 0.77 4.31 0.67 8.22 1.73

From Equation 43, we know that any policy in H(Πt i) will not be considered as the optimal solution in Equation 14, since they are dominated by at least one of the policy that out of H(Πt i), i.e., the BR. Hence, before PE reaches 0, solving Equation 14 will find a policy not in H(Πt i), which guarantees the enlargement of the Policy Hull. Once we can not enlarge the Policy Hull by adding the optimal solution in Equation 14, the PE of the Policy Hull must be 0, otherwise the optimal solution will be outside of the Policy Hull.

C Algorithm for PSD-PSRO

We provide the algorithm for PSD-PSRO in Algorithm 1, which generates the algorithm of PSRO to incorporate the estimation of the diversity in Equation 14.

D Additional Study

D.1 PE on Leduc

In Figure 4, we report the exploitability of the meta NE on Goofspiel with 5 point cards setting.

D.2 Ablation Study on Diversity Weight

We conduct the ablation study on the sensitivity of the diversity weight λ on the Non-Transitive Mixture Game and Leduc Poker. In Table 3 and Table 4, we show the exploitability of the final population against different diversity weights in Non-Transitive Mixture Game and Leduc Poker, respectively. We can see that the λ can significantly affect the training convergence and the final exploitability. A large or small λ can lead to high exploitability. This suggests that properly combing the BR (implicitly enlarges the Policy Hull) and PSD (explicitly enlarges the Policy Hull) is critical to improving the efficiency at reducing the exploitability.

Figure 4: Exploitability on Goofspiel (5 point cards).

Table 4: The exploitability for populations generated by PSD with different diversity weights in Leduc Poker. The standard error is calculated by running 3 experiments.

λ 0 0.1 0.5 1 2 5

EXP 0.590 0.017 0.403 0.027 0.356 0.012 0.398 0.030 0.412 0.042 0.664 0.062

D.3 The Visualization of PH

we demonstrate the expansion of PH on the Disc game [2]. In Disc game, each pure strategy is represented by a point in the circle. Due to the linearity in the payoff function, a mixed strategy is equivalent to a pure strategy and thus can be equivalently visualized. Figure 5 shows the expansion of PH on the Disc game [2]. It demonstrates that PSD-PSRO is more effective than other PSRO variants at enlarging the PH.

D.4 Time Consumption in PSD-PSRO

We run an experiment on Leduc poker to estimate the time portion of different parts in our algorithm PSD-PSRO. We can see from Table 5 that the majority of time in PSD-PSRO is in BR solving and computing the payoff matrix, which is shared by PSRO and all its variants. The time spent on calculating our diversity metric is not significant, which accounts for 7.9% computational time.

E Benchmark and Implementation Details

E.1 Alpha Star888

Alpha Star888 is an empirical game generated from the process of solving Starcraft II [50], which contains a payoff table for 888 RL policies. The occupancy measure is reduced to the action distribution as the game can be regarded as a single-state Markov Game. We choose diversity weight λ to be 0.85 in this game.

E.2 Non-Transitive Mixture Game

This game consists 7 equally-distanced Gaussian humps on the 2D plane. We represent each strategy by a point on the 2D plane, which is equivalent to the weights (the likelihood of that point in each Gaussian distribution) that each player puts on the humps. The payoff of the game which contains

Figure 5: Equivalent visualization of PH of different methods across iterations in the Disc game

both non-transitive and transitive components is πT i Sπ i + 1

2 P7 k=1(πk i πk i), where

0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0

We choose diversity weight λ to be 1.9 in this game.

Table 5: Time Consumption in PSD-PSRO on Leduc poker

MAIN PART OF PSD-PSRO PERCENTAGE COMPUTE NE OF THE META-GAME 0.9% COMPUTE THE APPROXIMATE BR 59.9% COMPUTE THE EXTRA DIVERSITY METRIC 7.9% COMPUTE THE PAYOFF MATRIX 31.3%

E.3 Leduc Poker

Since the DPP-PSRO needs evolutionary updates and cannot scale to the RL setting, and the code for BD&RD-PSRO on complex games, where f-divergence on occupancy measure is approximated using prediction errors of neural networks, is not available, we compare PSD-PSRO with P-PSRO, PSROr N and PSRO in this benchmark. We implement the PSRO framework with Nash solver, using PPO as the oracle agent. Hyper-parameters are shown in Table 6.

Table 6: Hyperparameters for Leduc poker

Hyperparameters Value Oracle Oracle agent PPO Replay buffer size 104

Mini-batch size 512 Optimizer Adam Learning rate 3 10 4

Discount factor (γ) 0.99 Clip 0.2 Max Gradient Norm 0.05 Policy network MLP (state_dim-256-256-256-action_dim) Activation function in MLP Re Lu PSRO Episodes for each BR training 2 104

meta-policy solver Nash PSROr N Episodes for each BR training 2 104

meta-policy solver rectified Nash P-PSRO Episodes for each BR training 2 104

meta-policy solver Nash number of threads 3 PSD-PSRO Episodes for each BR training 2 104

meta-policy solver Nash diversity weight λ 0.1

E.4 Goofspiel

In game theory and artificial intelligence, Goofspiel is commonly used as an example of a multi-stage simultaneous move game. In two-player Goofspiel, each player has one suit of cards, which is ranked A (low), 2, ..., 10, J, Q, K (high). Another suit of cards serves as the prize (competition cards). Play proceeds in a series of rounds. The players make sealed bids for the top (face up) prize by selecting a card from their hand (keeping their choice secret from their opponent). Once these cards are selected, they are simultaneously revealed, and the player making the highest bid takes the competition card. If tied, the competition card is discarded. The cards used for bidding are discarded, and play continues

with a new upturned prize card. After 13 rounds, there are no remaining cards and the game ends. Typically, players earn points equal to the sum of the ranks of cards won (i.e. A is worth one point, 2 is worth two points, etc., J 11, Q 12, and K 13 points). Goofspiel demonstrates high non-transitivity. Any pure strategy in this game has a simple counter-strategy where the opponent bids one rank higher, or as low as possible against the King bid. As an example, consider the strategy of matching the upturned card value mentioned in the previous section. The final score will be 78 - 13 with the K being the only lost prize. we focus on the simple versions of this game: goofspiel with 5 point cards and 8 point cards settings.

Similar to the setting in Leduc poker, we compare PSD-PSRO with P-PSRO, PSROr N, and PSRO in this benchmark. We implement the PSRO framework with Nash solver, using DQN as the Oracle agent. We use Linear Programming to solve NE from the payoff table. As the game size of Goofspiel is larger than Leduc poker, we improve the capacity of the model by setting the hidden dimension in MLP to be 512 and the episodes for each BR training to be 3 104. Other settings are similar to Leduc poker in Table 6.