# neupl_neural_population_learning__1909e50e.pdf Published as a conference paper at ICLR 2022 NEUPL: NEURAL POPULATION LEARNING Siqi Liu University College London Deep Mind liusiqi@google.com Luke Marris University College London Deep Mind marris@google.com Daniel Hennes Deep Mind hennes@google.com Josh Merel Deep Mind jsmerel@gmail.com Nicolas Heess Deep Mind heess@google.com Thore Graepel University College London t.graepel@ucl.ac.uk Learning in strategy games (e.g. Star Craft, poker) requires the discovery of diverse policies. This is often achieved by iteratively training new policies against existing ones, growing a policy population that is robust to exploit. This iterative approach suffers from two issues in real-world games: a) under finite budget, approximate best-response operators at each iteration needs truncating, resulting in under-trained good-responses populating the population; b) repeated learning of basic skills at each iteration is wasteful and becomes intractable in the presence of increasingly strong opponents. In this work, we propose Neural Population Learning (Neu PL) as a solution to both issues. Neu PL offers convergence guarantees to a population of best-responses under mild assumptions. By representing a population of policies within a single conditional model, Neu PL enables transfer learning across policies. Empirically, we show the generality, improved performance and efficiency of Neu PL across several test domains1. Most interestingly, we show that novel strategies become more accessible, not less, as the neural population expands. The need for learning not one, but a population of strategies is rooted in classical game theory. Consider the purely cyclical game of rock-paper-scissors, the performance of individual strategies is meaningless as improving against one entails losing to another. By contrast, performance can be meaningfully examined between populations. A population consisting of pure strategies {rock, paper} does well against a singleton population of {scissors} because in the meta-game where both populations are revealed, a player picking strategies from the former can always beat a player choosing from the latter2. This observation underpins the unifying population learning framework of Policy Space Response Oracle (PSRO) where a new policy is trained to best-respond to a mixture over previous policies at each iteration, following a meta-strategy solver (Lanctot et al., 2017). Most impressively, Vinyals et al. (2019) explored the strategy game of Star Craft with a league of policies, using a practical variation of PSRO. The league counted close to a thousand sophisticated deep RL agents as the population collectively became robust to exploits. Unfortunately, such empirical successes often come at considerable costs. Population learning algorithms with theoretical guarantees are traditionally studied in normal-form games (Brown, 1951; Mc Mahan et al., 2003) where best-responses can be solved exactly. This is in stark contrast to real-world Game-of-Skills (Czarnecki et al., 2020) such games are often temporal in nature, where best-responses can only be approximated with computationally intensive methods (e.g. deep RL). This has two implications. First, for a given opponent, one cannot efficiently tell apart good-responses that temporarily plateaued at local optima from globally optimal best-responses. As a result, approximate best-response operators are often truncated prematurely, according to hand-crafted schedules (Lanctot et al., 2017; Mcaleer et al., 2020). Second, real-world games often afford strategy-agnostic transitive Currently at Reality Labs, work carried out while at Deep Mind. Work carried out while at Deep Mind. 1See https://neupl.github.io/demo/ for supplementary illustrations. 2This is formally quantified by Relative Population Performance, see Definition A.1 (Balduzzi et al., 2019). Published as a conference paper at ICLR 2022 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 Population Self-Play 0 1 0 0 0 1 1 0 0 Strategic Cycle 0 0 0 1 0 0 Fictitious Play 0 0 0 1 0 0 p 1 p 0 Figure 1: Popular population learning algorithms implemented as directed interaction graphs (Bottom), or equivalently, a set of meta-game mixture strategies Σ R3 3 := {σi}3 i=1 (Top). A directed edge from i to j with weight σij indicates that policy i optimizes against j, with probability σij. Unless labeled, out edges from each node are weighted equally and their weights sum up to one. skills that are pre-requisite to strategic reasoning. Learning such skills from scratch at each iteration in the presence of evermore skillful opponents quickly becomes intractable beyond a few iterations. This iterative and isolated approach is fundamentally at odds with human learning. For humans, mastering diverse strategies often facilitates incremental strategic innovation and learning about new strategies does not stop us from revisiting and improving upon known ones (Caruana, 1997; Krakauer et al., 2006). In this work, we make progress towards endowing artificial agents with similar capability by extending population learning to real-world games. Specifically, we propose Neu PL, an efficient and general framework that learns and represents diverse policies in symmetric zero-sum games within a single conditional network, using the computational infrastructure of simple self-play (Section 1.2). Theoretically, we show that Neu PL converges to a sequence of iterative best-responses under certain conditions (Section 1.3). Empirically, we illustrate the generality of Neu PL by replicating known results of population learning algorithms on the classical domain of rockpaper-scissors as well as its partially-observed, spatiotemporal counterpart running-with-scissors (Vezhnevets et al., 2020) (Section 2.1). Most interestingly, we show that Neu PL enables transfer learning across policies, discovering exploiters to strong opponents that would have been inaccessible to comparable baselines (Section 2.2). Finally, we show the appeal of Neu PL in the challenge domain of Mu Jo Co Football (Liu et al., 2019) where players must continuously refine their movement skills in order to coordinate as a team. In this highly transitive game, Neu PL naturally represents a short sequence of best-responses without the need for a carefully chosen truncation criteria (Section 2.4). Our method is designed with two desiderata in mind. First, at convergence, the resulting population of policies should represent a sequence of iterative best-responses under reasonable conditions. Second, transfer learning can occur across policies throughout training. In this section, we define the problem setting of interests as well as necessary terminologies. We then describe Neu PL, our main conceptual algorithm as well as its theoretical properties. To make it concrete, we further consider deep RL specifically and offer two practical implementations of Neu PL for real-world games. 1.1 PRELIMINARIES Approximate Best-Response (ABR) in Stochastic Games We consider a symmetric zero-sum Stochastic Game (Shapley, 1953) defined by (S, O, X, A, P, R, p0) with S the state space, O the observation space and X : S O O the observation function defining the (partial) views of the state for both players. Given joint actions (at, a t) A A, the state follows the transition distribution P : S A A Pr(S). The reward function R : S R R defines the rewards for both players in state st, denoted R(st) = (rt, rt). The initial state of the environment follows the distribution p0. In a given state st, players act according to policies (π( |o t), π ( |o t)). Player π achieves an expected return of J(π, π ) = Eπ,π [P t rt] against π . Policy π is a best response to π Published as a conference paper at ICLR 2022 if π, J(π , π ) J(π, π ). We define ˆπ ABR(π, π ) with J(ˆπ, π ) J(π, π ). In other words, an ABR operator yields a policy ˆπ that does no worse than π, in the presence of an opponent π . Meta-game Strategies in Population Learning Given a symmetric zero-sum game and a set of N policies Π := {πi}N i=1, we define a normal-form meta-game where players i-th action corresponds to executing policy πi for one episode. A meta-game strategy σ thus defines a probability assignment, or an action profile, over Π. Within Π, we define U RN N EVAL(Π) to be the expected payoffs between pure strategies of this meta-game or equivalently, Uij := J(πi, πj) in the underlying game. We further extend the ABR operator of the underlying game to mixture policies represented by σ, such that ˆπ ABR(π, σ, Π) with Eπ P (σ)[J(ˆπ, π )] Eπ P (σ)[J(π, π )]. Finally, we define f : R|Π| |Π| R|Π| to be a meta-strategy solver (MSS) with σ f(U) and F : RN N RN N a meta-graph solver (MGS) with Σ F(U). The former formulation is designed for iterative optimization of approximate best-responses as in Lanctot et al. (2017) whereas the latter is motivated by concurrent optimization over a set of population-level objectives as in Garnelo et al. (2021). In particular, Σ RN N := {σi}N i=1 defines N population-level objectives, with πi optimized against the mixture policy represented by σi and Π. As such, Σ RN N corresponds to the adjacency matrix of an interaction graph. Figure 1 illustrates several commonly used population learning algorithms defined by Σ or equivalently, their interaction graphs. 1.2 NEURAL POPULATION LEARNING We now present Neu PL and contrast it with Policy-Space Response Oracles (PSRO, Lanctot et al. (2017)) which similarly focuses on population learning with approximate best-responses by RL. Algorithm 1 Neural Population Learning (Ours) 1: Πθ( |s, σ) Conditional neural population net. 2: Σ := {σi}N i=1 Initial interaction graph. 3: F : RN N RN N Meta-graph solver. 4: while true do 5: ΠΣ θ {Πθ( |s, σi)}N i=1 Neural population. 6: for σi UNIQUE(Σ) do 7: Πσi θ Πθ( |s, σi) 8: Πσi θ ABR(Πσi θ , σi, ΠΣ θ ) Self-play. 9: U EVAL(ΠΣ θ ) (Optional) if F adaptive. 10: Σ F(U) (Optional) if F adaptive. 11: return Πθ, Σ Algorithm 2 PSRO (Lanctot et al., 2017) 1: Π := {π0} Initial policy population. 2: σ UNIF(Π) Initial meta-game strategy. 3: f : R|Π| |Π| R|Π| Meta-strategy solver. 4: 5: for i [[N]] do N-step ABR. 6: Initialize πθi. 7: πθi ABR(πθi, σ, Π) 8: Π Π {πθi} 9: U EVAL(Π) Empirical payoffs. 10: σ f(U) 11: return Π Neu PL deviates from PSRO in two important ways. First, Neu PL suggests concurrent and continued training of all unique policies such that no good-response features in the population prematurely due to early truncation. Second, Neu PL represents an entire population of policies via a shared conditional network Πθ( |s, σ) with each policy Πθ( |s, σi) conditioned on and optimised against a meta-game mixture strategy σi, enabling transfer learning across policies. This representation also makes Neu PL general: it delegates the choice of effective population sizes |UNIQUE(Σ)| |Σ| = N to the meta-graph solver F as σi = σj implies Πθ( |s, σi) Πθ( |s, σj) (cf. Section 2.1). Finally, Neu PL allows for cyclic interaction graphs, beyond the scope of PSRO. We discuss the generality of Neu PL in the context of prior works in further details in Appendix D. N-step Best-Responses via Lower-Triangular Graphs A popular class of population learning algorithms seeks to converge to a sequence of N iterative best-responses where each policy πi is a best-response to an opponent meta-game strategy σi with support over a subset of the policy population Π