# neural_autocurricula_in_twoplayer_zerosum_games__5d6895d2.pdf Neural Auto-Curricula Xidong Feng ,1, Oliver Slumbers ,1, Ziyu Wan2, Bo Liu3, Stephen Mc Aleer 4, Ying Wen2, Jun Wang1, Yaodong Yang ,5 1University College London, 2Shanghai Jiao Tong University, 3Institute of Automation, CAS, 4University of California, Irvine, 5Institute for AI, Peking University When solving two-player zero-sum games, multi-agent reinforcement learning (MARL) algorithms often create populations of agents where, at each iteration, a new agent is discovered as the best response to a mixture over the opponent population. Within such a process, the update rules of "who to compete with" (i.e., the opponent mixture) and "how to beat them" (i.e., finding best responses) are underpinned by manually developed game theoretical principles such as fictitious play and Double Oracle. In this paper1, we introduce a novel framework Neural Auto-Curricula (NAC) that leverages meta-gradient descent to automate the discovery of the learning update rule without explicit human design. Specifically, we parameterise the opponent selection module by neural networks and the bestresponse module by optimisation subroutines, and update their parameters solely via interaction with the game engine, where both players aim to minimise their exploitability. Surprisingly, even without human design, the discovered MARL algorithms achieve competitive or even better performance with the state-of-the-art population-based game solvers (e.g., PSRO) on Games of Skill, differentiable Lotto, non-transitive Mixture Games, Iterated Matching Pennies, and Kuhn Poker. Additionally, we show that NAC is able to generalise from small games to large games, for example training on Kuhn Poker and outperforming PSRO on Leduc Poker. Our work inspires a promising future direction to discover general MARL algorithms solely from data. 1 Introduction Two-player zero-sum games have been a central interest of the recent development of multi-agent reinforcement learning (MARL) [60, 2, 53]. In solving such games, a MARL agent has a clear objective: to minimise the worst case performance, its exploitability, against any potential opponents. When both agents achieve zero exploitability, they reach a Nash equilibrium (NE) [36], a classical solution concept from Game Theory [34, 10]. Even though this objective is straightforward, developing effective algorithms to optimise such an objective often requires tremendous human efforts. One effective approach is through iterative methods, where players iteratively expand a population of agents (a.k.a auto-curricula [26]) where at each iteration, a new agent is trained and added to the player s strategy pool. However, within the auto-curricula process, it is often non-trivial to design effective update rules of "who to compete with" (i.e., opponent selections) and "how to beat them" (i.e., finding best responses). The problem becomes more challenging when one considers additional requirements such as generating agents that have behavioural diversity [58, 37, 3, 28] or generalisation ability [47, 38]. 1 Equal contributions. Corresponding to . Code released at https:// github.com/waterhorse1/NAC 35th Conference on Neural Information Processing Systems (Neur IPS 2021). One effective approach to designing auto-curricula is to follow game theoretical principles such as fictitious play [5] and Double Oracle (DO) methods [33, 11]. For example, in the DO dynamics, each player starts from playing a sub-game of the original game where only a restricted set of strategies are available; then, at each iteration, each player will add a best-response strategy, by training against a NE mixture of opponent models at the current sub-game, into its strategy pool. When an exact best response is hard to compute, approximation methods such as reinforcement learning (RL) methods are often applied [25]. One of the potential downsides of this approach is, under approximate best response methods, one no longer maintains the theoretical properties of DO and a worthwhile avenue of exploration is to find auto-curricula that are more conducive to these approximation solutions. Apart from following game theoretical principles, an appealing alternative approach one can consider is to automatically discover auto-curricula from data generated by playing games, which can be formulated as a meta-learning problem [13]. However, it remains an open challenge whether it is feasible to discover fundamental concepts (e.g., NE) entirely based on data. If we can show that it is possible to discover fundamental concepts from scratch, it opens the avenue to trying to discover fundamentally new concepts at a potentially rapid pace. Although encouraging results have been reported in single-agent RL settings showing that it is possible to meta-learn RL update rules, such as temporal difference learning [38, 56, 57], and the learned update rules can generalise to unseen tasks, we believe discovering auto-curricula in multi-agent cases is particularly hard. Two reasons for this are that the discovered auto-curriculum itself directly affects the development of the agent population, and each agent involves an entire training process, which complicates the meta-learning process. Albeit challenging, we believe a method capable of discovering auto-curricula without explicit game theoretic knowledge can potentially open up entirely new approaches to MARL. As a result, this paper initiates the study on discovering general-purpose MARL algorithms in two-player zero-sum games. Specifically, our goal is to develop an algorithm that learns its own objective (i.e., the auto-curricula) solely from environment interaction, and we offer a meta-learning framework that achieves such a goal. Our solution framework Neural Auto-Curriculum (NAC) has two promising properties. Firstly, it does not rely on human-designed knowledge about game theoretic principles, but instead lets the meta-learner decide what the meta-solution concept (i.e., who to compete with) should be during training. This property means that our meta-learner shares the ability of game theoretic principles in being able to accurately evaluate the policies in the population. Secondly, by taking the best-response computation into consideration, NAC can end to end offer more suitable auto-curricula within the approximate best-response scenario compared with previous approaches; this is particularly important since an exact best-response oracle is not always available in practice. Our empirical results show that NAC can discover meaningful solution concepts alike NE, and based on that build effective auto-curricula in training agent populations. In multiple different environments, the discovered auto-curriculum achieves the same performance or better than that of PSRO methods [25, 3]. We additionally evaluate the ability of our discovered meta-solvers to generalise to unseen games of a similar type (e.g., training on Kuhn Poker and testing on Leduc Poker), and show that the auto-curricula found on a simple environment is able to generalise to a more difficult one. To the best of our knowledge, this is the first work that demonstrates the possibility of discovering an entire auto-curriculum in solving two-player zero-sum games, and that the rule discovered from simple domains can be competitive with human-designed algorithms on challenging benchmarks. 2 Related Work Whilst it is theoretically possible to solve for NE in two-player zero-sum games via linear programming (LP) [34] in polynomial time [51], this is a strictly limited approach. For example, when the action space become prohibitively large, or continuous, using LP becomes intractable; other approximation methods such as fictitious play (FP) [5], DO [33, 11] or PSRO [25, 32, 31] methods are required. These methods all follow iterative best-response dynamics where at each iteration, a best response policy is found against a previous aggregated policy (e.g., DO/PSRO applies a sub-game NE, FP applies time-average strategy). Under this general iterative framework, other solution concepts include Rectified NE [3], -Rank [35, 59] or no-regret algorithms [11]. In this paper, instead of following any existing game theoretic knowledge, we try to discover effective solution concepts solely through environment interactions via meta-learning techniques. Meta-learning, also known as learning to learn, has recently gained increased attention for successfully improving sample efficiency in regression, classification and RL tasks. MAML [13] meta-learns Game Environment G P(G) 픈픵픭(πT, ΦT(θ)) 헠t (Neural) Meta-Solver fθ(헠t) π ϕBR Column Mean-Pooling Row-wise Concatenation MLP N 64 Global Info Row Mean-Pooling Forward Pass Back-prop Gradients Section 3.1 Section 3.4 Section 3.5 Section 3.2 ΦT = {ϕ1, ϕ2, , ϕT} Φt = {ϕ1, ϕ2, ϕ3} Best Response Oracle Section 3.3 Φ2 = {ϕ1, ϕ2} Figure 1: NAC. The top box illustrates the training process where we iteratively expand Mt with best responses to f (Mt), and then calculate the exploitability (in red). The bottom box illustrates expanding the population by finding φBR. The blue arrows show how the gradients pass backwards, from the exploitability through the best-response trajectories. The gradient paths [1], [2] refer to the two gradient terms in Eq. (8) and the gradient paths [3], [4] refer to the two gradient paths in Eq. (7). model parameters by differentiating the learning process for fast adaptation on unseen tasks. PROMP [42] theoretically analyses the MAML-RL formulation and addresses the biased meta-gradient issue. ES-MAML [48] bypasses the Hessian estimation problem by leveraging evolutionary strategies for gradient-free optimisation. RL2 [12] and L2RL [54] formulate the meta-learning problem as a second RL procedure but update the parameters at a "slow" pace. Recently, there is a trend to meta-learn the algorithmic components of RL algorithms, such as the discount factor [57], intrinsic rewards [65, 64], auxiliary tasks [52, 66], objective functions for value/policy networks [4, 56, 22], off-policy update targets [62], and the bootstrapping mechanism [38]. Their success is built on the meta-gradient technique, which leverages gradient descent on the sequence of gradient descent updates resulting from the choice of objective function. The meta-learned RL algorithms demonstrate promising generalisation capability in solving different tasks. Apart from meta-gradients, evolutionary strategies (ES) have also been successfully applied [20]. Our paper, in comparison, has a parallel goal to these prior arts: to discover MARL algorithms that are effective for solving various two-player zero-sum games, and to demonstrate their generalisation ability across different games. So far, there have been very few attempts to conduct meta-learning in multi-agent settings [60]. MNMPG [46] applied meta-gradients for training credit assignment modules for better decomposing the value network [49, 61]. Meta-PG [1] was proposed as a meta-gradient solution to continuous adaptations for non-stationary environments. Building on the opponent modelling technique called LOLA [15, 14], Meta-MAPG [21] extends Meta-PG [1] to multi-agent settings by considering both the non-stationarity from the environment and from the opponent s learning process. Our work rather focuses on automatically discovering multi-agent auto-curricula through meta-learning, without explicit game theoretic knowledge, that leads to general MARL algorithms for solving two-player zero-sum games. 3 Multi-Agent Meta-Learning Framework In this section, we discuss our meta-learning framework NAC for discovering multi-agent autocurricula in two-player zero-sum games. We follow the road-map set out by Fig. (1) which illustrates the flow of NAC. The relevant sections for each part of Fig. (1) are marked in pink. 3.1 The Meta-game Consider two-player zero-sum games G2 drawn from some distribution P(G), where players have the state-action space S A. We start by introducing the notion of an agent who is characterised by a policy φ, where a policy is a mapping φ : S A ! [0, 1] which can be described in both a tabular form or as a neural network. The payoff between two agents is defined to be M(φi, φj) (i.e., the game engine), and represents the utility to agent i, or alternatively, the negative utility to agent j. Our population-based framework revolves around iterative updates on the meta-game M. At every iteration t 2 T, a Player is defined by a population of fixed agents Φt = Φ0 [ 2 , ..., φBR , where Φ0 is the initial random agent pool and the φBR t are discussed further in Sec. (3.3). From here, 2G encapsulates all of the information for a Player to take part in a game (e.g., actions, reward functions). for the sake of notation convenience, we will only consider the single-population case where Players share the same Φt. As such, the single population will generate a meta-game Mt, a payoff matrix between all of the agents in the population, with individual entries being M(φi, φj) 8φi, φj 2 Φt. 3.2 The Meta-Solver Based on Mt, the Player will solve for the meta-distribution t which is defined as an aggregated agent over the fixed agents in Φt. These meta-solvers in recent work have stuck to human-designed solution concepts such as a uniform distribution [25] or NE and its variants [3, 37, 28], whereas we introduce a method to actively learn these distributions through solely interacting with the game. Specifically, we parameterise our meta-solver via a neural network. This network f with parameters is a mapping f : Mt ! [0, 1]t which takes as input a meta-game Mt and outputs a meta-distribution t = f (Mt). The output t is a probability assignment to each agent in the population Φt and, as we are in the single-population setting, we do not distinguish between different populations. There are two characteristics that our network f should have: firstly, it should be able to handle variable-length inputs as the size of Mt is (t t), and f outputs a vector of size (t 1), where the value of t increments at every iteration. Secondly, f should have both column-permutation invariance and row-permutation equivariance. Given an input Mt, f will output the distribution t = f (Mt) for the row Player and it should be equivariant to any permutation of the row index, and be invariant to any permutation of the column index as neither of these actions should affect t. Much work has been done on how to maintain permutation invariance/equivariance for neural networks such as Deepsets [63] for handling tasks defined on sets, or Point Net [39] for handling 3D point cloud data processing. Our first Multi-Layer Perceptron (MLP) based network is inspired by Point Net, which consists of three components: (1) An MLP over each element Mij t for a non-linear representation, (2) Column Mean-Pooling for column-permutation invariant information aggregation, which will be concatenated to each row and (3) Row-wise MLP for the row-permutation equivariant meta distribution. We also offer two alternatives, inspired by a fully Conv1d-based model [29] and a GRU-based model [7]. We refer to Appendix A for detailed architectures of our network f ( ). 3.3 Best Response Oracle Once a meta-distribution t 2 |Φt 1| is obtained, the goal is to solve for a best-response against t to strengthen the population. Formally, we define M(φ, h t, Φti) := t M(φ, φk), (1) to represent the payoff for agent φ against the aggregated agent (aggregated by t) of population Φt. Consequently, we have that the best-response to an aggregated strategy is: t+1 = arg max t M(φ, φk), (2) and the best-response is appended to the population to form a new fixed population, aiming to strengthen the population so as to become less exploitable. Depending on the sophistication of games, one may choose appropriate oracles. We consider different oracle implementations in Sec. (3.5). 3.4 The Learning Objective of Players The goal of NAC is to find an auto-curricula that after T best-response iterations returns a metastrategy and population, h T , ΦT i, that helps minimise the exploitability, written as: Exp( T ( ), ΦT ( )), where Exp := max φ M (φ, h T , ΦT i) , (3) T = f (MT ), ΦT = T 1( ), ..., φBR Exp( ) 3 represents exploitability [9], a measure of the incentive to deviate from the meta-strategy over h T , ΦT i. When exploitability reaches zero, it means one can no longer improve performance. φBR t ( ) in Eq. (4) shows that each best-response has a dependency on since φBR t is influenced 3In the multi-population case, Exp expresses the notion of each Player having a different population and final meta-strategy, in the single-population case we only need to evaluate the deviation incentive for one population. explicitly by the curriculum at iteration t and implicitly by all previous curricula. We believe such an objective maximises the generality of our framework on solving different types of zero-sum games. NAC can use any oracle, however different considerations must be taken dependent on the choice. In the following sections we will discuss the technicality of directly solving for the meta-gradients of with respect to a gradient-descent (GD) oracle and an RL-based oracle. Additionally, we will provide an oracle-agnostic method based on zero-order gradients that allows us to ignore oracle trajectories via Evolutionary Strategies [43]. Pseudo-code4 for these methods is shown in Alg. (1). Algorithm 1 Neural Auto-Curricula (NAC) - statements in teal refer only to ES-NAC, statements in red only to standard NAC. Require: Game distribution p(G), learning rate , time window T, perturbations n, precision σ. 1: Randomly initialise policy pool φ0, Initialise parameters of the meta solver f . 2: for each training iteration do 3: Store current model f . 4: Sample 1, ..., n N(0, I) and store n models f( + i). . If using ES, include this step 5: for each stored model f do 6: Sample games {Gk}k=1,...,K from p(G). 7: for each game Gk do 8: for each iteration t do 9: Compute the meta-policy t 1 = f(Mt 1). 10: Compute the best response φBR t by Eq. (9) or Eq. (12). 11: Expand the population Φt = Φt 1 [ {φBR t } 12: end for 13: Compute Expi( T , ΦT ) by Eq. (3) 14: end for 15: Compute the meta-gradient gk via Eq. (6) or Eq. (13) 16: Update meta-solver s parameters 0 = 1 k gk. 17: end for 18: end for 3.5 Optimising the Meta-Solver through Meta-gradients Based on the Player s learning objectives in Eq. (3), we can optimise the meta-solver as follows: J ( ) , where J( ) = EG P (G) Exp ( , Φ| , G) Deriving the (meta-)gradient of is non-trivial, which we show in the below Remark (3.1). Remark 3.1. For a given distribution of game p(G), by denoting exploitability at the final iteration M T +1, h T , ΦT i as MT +1, the meta-gradient for (see also Fig. 1) is T +1 @ + @MT +1 , where (6) @ =@f (MT ) @ + @f (MT ) T +1 @ = @φBR and Eq. (8) can be further decomposed by iteratively applying Eq. (7) from iteration T 1 to 0. The full proof is in Appendix B. Intuitively, in the forward process, the population adds T new agents. In the backward process, the meta-gradient traverses through the full T best-response iterations (each iteration may involve many gradient updates) and back-propagates through all trajectories. Therefore, the gradients of @ΦT /@ need collecting from MT 1 to M1. Whilst this is critical in ensuring that every agent is influential in optimising , it introduces computational troubles. Firstly, due to the long-trajectory dependency, computing meta-gradients becomes inefficient due to multiple Hessian-vector products. Secondly, the gradients are susceptible to exploding/vanishing gradients in the same manner as RNNs [19]. To alleviate these issues, we introduce a truncated version similar to 4Pseudo-code including details of the best-response oracles is shown in Appendix C [38], where we back-propagate up to a smaller window size (i.e., n < T) of population updates. We shall study the effect of the window size later in Figure 4. Notably, the gradients of @φBR t+1/@Φt and @φBR t+1/@ t in Eq. (7) also depends on the type of best-response subroutines. In the next section, we demonstrate two types of oracles and show how the meta-gradients are derived accordingly. 3.5.1 Gradient-Descent Best-Response Oracles When the payoff function G is known and differentiable, one can approximate the best-response through gradient descent (GD). A one-step GD oracle example is written as: t+1 = φ0 + @M (φ0, h t, Φti) where φ0 and denote the initial parameters and learning rate respectively. The backward gradients of one-step GD share similarities with MAML [13], which can be written as: t+1 @ t = @2M (φ0, h t, Φti) @φ0@ t , @φBR t+1 @Φt = @2M (φ0, h t, Φti) @φ0@Φt . (10) We refer to Appendix B.1 for the specification of Remark (3.1) for gradient descent-based oracles. Though Eq. (9) and Eq. (10) can be easily extended to multi-step GD case, it becomes easily intractable to take the gradient of a computational graph that includes hundreds of gradient updates [48, 40, 30]. To solve this problem, we offer another solution for efficient back-propagation based on the implicit gradient method [40], which does not need the full trajectory as a dependency. The idea is that, when we arrive at the best response point such that @M(φBR t+1, h t, Φti)/@φBR t = 0, we can apply the implicit function theorem and derive the gradient by, t+1, h t, Φti) t+1, h t, Φti) @φBR t+1@Φt . (11) See the proof in Appendix B.2. Eq. (11) allows for efficient back-propagation by ignoring the dependency on trajectories. Note that the implicit gradient methods can theoretically be applied to compute the meta-gradient w.r.t RL oracles (next section), but empirically we had little success. 3.5.2 Reinforcement Learning Best-Response Oracles The above GD-based oracles require the pay-off function (i.e., the game engine) to be differentiable. Yet, for complex real-world games such as Star Craft [53], we have to rely on RL methods to approximate the best-response agent. Overall, the RL meta-gradient shares a similar structure to that of the above. The major difference is that we replace the GD terms with Policy Gradient estimator [55]. Considering the unbiased estimators for the first and the second-order meta-(policy-)gradients, we apply Differentiable Monte Carlo Estimator (DICE) [14] in Eq. (9). DICE is an unbiased higherorder gradient estimator that is fully compatible with automatic differentiation. Thanks to DICE, for an RL-based oracle, by regarding the best-response agent φBR t as φ1 and the aggregated agent h t, Φti as φ2 respectively, we obtain the follow equation: φ1 = φ0 + @J DICE @φ0 , where J DICE = where ? refers to the stop-gradient operator, r1 k for the reward for agent 1, and H represents the trajectory length. We refer to Appendix B.3 for how DICE provides the unbiased first and secondorder meta gradient, and the specification of Remark (3.1) for RL-based oracles. This RL-based formulation is limited in the fact that it is does not directly extend to SOTA RL techniques such as value-based methods [18], and therefore we next introduce a zero-order method that is able to tackle the case of a non-differentiable pay-off function with any best-response oracle. 3.6 Optimising the Meta-Solver through Evolution Strategies Inspired by the generality of Evolutionary Strategies (ES) [43] in optimising black-box functions and ES-based MAML [48], we also propose an ES based framework that can cope with any best-response oracle and underlying game engine. We name our method ES-NAC. The ES approach of [43] states that if we have an objective function F( ) over a network parameterised by , we can apply Gaussian perturbations to the parameters so that a gradient estimate of the objective function can be achieved by r E N (0,I)F( + σ ) = 1 σE N (0,I)F( + σ ) . In our framework, if we set the objective function F( ) = Exp T ( T , ΦT ) and T = f (MT ), we can have an objective function as a function of which can be perturbed. This allows us to write the gradient of a surrogate objective of Eq. (5) as follows: r ˆJσ( ) = EG P (G), N (0,I) Exp T ( T , ΦT ) Additionally, we make use of control variates [27] to reduce the variance of the estimator whilst remaining unbiased, for example we apply forward finite differences [6] whereby the exploitability of the unperturbed meta-solver f is subtracted from the perturbed meta-solver, that is r E N (0,I)F σ E N (0,I) F( + σ ) F( ) The key benefit of ES-NAC is that it is agnostic to the best-response oracle choice, as only the final exploitability is required. Unlike the implicit formulation we provided in Sec. (3.5.1), it is not restricted by the fixed-point condition, which we note is difficult to attain for RL oracles, and therefore may be more widely applicable. This is particularly useful in practice since most games require hundreds of game simulations for each entry in M (e.g., Star Craft [53]), in which case we lose the applicability of either GD or RL-based oracles. We note that Alg. (1) encapsulates ES-NAC when the number of perturbations n > 0, and that lines in teal refer only to the ES formulation. 4 Experiments We validate the effectiveness of NAC on five types of zero-sum environments5 with different levels of complexity. They are Games of Skill (Go S) [8], differentiable Lotto [3], non-transitive mixture game (2D-RPS) [37], iterated matching pennies (IMP) [15, 21] and Kuhn Poker [23]. All selected games are non-trivial to solve as an effective solver has to consider both transitive and non-transitive dynamics in the policy space [8, 44]. The motivation behind our selections is to evaluate the performance of NAC under the different oracles proposed in Sec. (3.5) and Sec. (3.6). Specifically, we test the gradient descent-oracle (GD) in Go S, Lotto and 2D-RPS and the RL oracle in IMP. For ES-NAC, we conduct experiments on Kuhn poker [23] with two approximate tabular oracles (V1, V2), an exact tabular oracle6 and a PPO [45] oracle. We conduct the experiments on multiple random seeds for NAC and the details of how we conduct meta-testing on baseline algorithms and NAC are reported in Appendix E.3. More details of all of the applied oracles and their hyper-parameters are in Appendix F, and details of the baseline implementations are in Appendix E. We select the baselines to be vanilla self-play (i.e., best responding to the latest agent in the population) and the PSRO variants, including PSRO [25], PSRO-Uniform [3] (equivalent to Fictitious Play [5]) and PSRO-r N [3]. Their implementations can be found in Open Spiel [24]. We believe these methods offer strong benchmarks for NAC since they are all underpinned by game theoretic principles, and NAC tries to discover solution algorithms purely from data. Results are presented in the form of answering five critical questions w.r.t the effectiveness of NAC. Question 1. How does NAC perform in terms of exploitability on different games? Firstly, we are interested in whether NAC can learn an auto-curricula that can solve games effectively. In order to characterise performance, we measure the exploitability, Eq. (3), of NAC and compare it against other baselines. Surprisingly, results in Fig. (2) suggest that, by learning an effective metasolver, NAC is able to solve games without explicit game theoretic solution concepts. In particular, NAC performs at least as well as PSRO (only slightly worse than PSRO in IMP), and in multiple games outperforms PSRO. We notice that NAC performs better in the presence of approximate bestresponses. One explanation is that Double Oracle relies upon a true best-response oracle to guarantee convergence, when it comes to PSRO where only approximate best responses are available, the principles of sub-game NE may not necessarily fit with PSRO anymore. In contrast, NAC considers the outcomes of approximate best-responses in an end-to-end fashion; therefore, the auto-curricula for each player tends to be adaptive, leading to the development of a stronger population. Overall, we believe these results suggest a promising avenue of research for solving larger-scale games (e.g., Star Craft [53] or XLand [50] ) with no exact best responses available and no prior game theoretic solution concepts involved. 5To stay self-contained, we provide a detailed description for each game in Appendix E.1. 6Training performance for the exact tabular oracle provided in Appendix D.1 Figure 2: Exploitability results on five different environments with differing best-response oracles. NAC performs at least as well as the best baseline in all settings, and often outperforms the PSRO baselines. Settings of adopted oracles in each game can be found in Appendix F. Question 2. What does the learned curricula (i.e., f (Mt)) look like? Figure 3: Visualisations of curricula on 2D-RPS. Red points denote the meta-solver output, and darker refers to higher probability in . The blue star denotes the latest best-response. To achieve low exploitability, one needs to climb up and explore each Gaussian. PSRO fails to explore fully, where as NAC creates an effective curricula to explore all modes and assigns each of them high weights. To address this question7, we visualise the auto-curricula on the 2D-RPS game between PSRO and NAC in Fig. (3). PSRO is successful in climbing up some Gaussians before iteration 7; however, it fails to offer an effective auto-curricula that can lead it to discover other Gaussians. PSRO fails to select an auto-curricula that takes into consideration whether the best-response oracle is capable of learning a suitable best-response. This result is inline with [37] and we believe it is because the approximate best responses may lead to a local optimum in the policy space for PSRO-type methods. In contrast, NAC adaptively generates an auto-curricula which is more suitable for approximate best-responses, as evidenced by a wide spread of agents over the plane, and lower exploitability. Question 3. Does back-propogation through multiple best-response iterations help the training? As shown by the blue lines in Fig. (1), the backward meta-gradient will propagate through multiple iterations of best-response processes. To demonstrate its effects, in Fig. (4c) we conduct an ablation 7We also visualise Kuhn-Poker policy for understanding how NAC works in Appendix D.2 Figure 4: (a), (b) The effect of different neural architectures on the exploitability in Go S and IMP games. (c) The effect of window size on the exploitability in 2D-RPS game. Figure 5: (a) Exploitability when trained on Kuhn Poker with an exact tabular BR oracle using ESNAC and tested on Leduc Poker. (b) Same as (a) with approximate tabular BR V2 (c) Exploitability when trained on Go S with a GD oracle and tested on the Alpha Star meta-game from [8] (d) Final exploitability when trained on 200 Dimension Go S and tested on a variety of dimension size Go S. study on NAC by varying the how many best-response iterations (i.e., the window size) we consider, by controlling how many agents are added into the population before computing the meta-gradient. A window size of 0 refers to the setting where we completely detach the gradients of the best-response process. We can see that NAC achieves lower exploitability when considering multiple best-response iterations, which reflects the effectiveness of NAC in offering a more suitable and appropriate curricula to strengthen the whole population. Question 4. How is NAC affected by the architecture and capacity of meta-solver? In Sec. (3.2), we provide several neural architectures for the meta-solver. Thus, to understand the effect of these different architectures, we conduct an ablation study on Go S and IMP. We specify six different models by varying both the architecture and the size. Results in Fig. (4a, 4b) show that, firstly, the permutation invaraiance/equivariance is not a necessary property, as the GRU-based models achieve great performance. Secondly, the effect of the meta-solver s architecture is heavily dependent on the game. The performance of all three models are comparable for GOS while only MLP and GRU work well for IMP. Different games may need different meta-solver s architecture and GRU-based meta-solver tend to work better. In addition, the increase of network capacity does not correspond to performance improvement. We refer the reader to Appendix (F) for details of our model choices for different games. Question 5. What is the generalisation ability of the neural meta-solver by NAC? The most promising aspect of NAC is that the neural auto-curricula (i.e., meta-solvers) have the ability to generalise to different out-of-distribution games. This is particularly impactful, as it allows for training on simpler games and then deploying them on larger, more difficult games. We test the generalisation capability of NAC in two settings. First, we take our meta-solver trained over 200 dimensional Go S and test on new unseen Go S of varying dimension. We consider this to be the most direct way of ascertaining whether the neural meta-solvers are able to generalise to larger, more difficult games, and whether the in-task performance still holds out-of-task. Fig. (5d) plots the final exploitability after 20 PSRO iterations against the dimension of the Go S, and noticeably, NAC still outperforms the PSRO baselines in all dimensions larger than the training dimension. Additionally, we test our trained meta-solver on the Alpha Star meta-game generated by [8]8 in Fig. (5c), which is also considered to be a form of a Go S. Interestingly, our meta-solver is able to perform well on a Go S that is outside of the task distribution and therefore has a different type of underlying dynamics. Secondly, we introduce an example of our meta-solver showing the ability to scale-up to different games, namely we train on the Kuhn Poker environment (212 pure strategies) and test on the Leduc Poker environment (3936 pure strategies). As shown in Fig. (5a, 5b) the trained meta-solver is able to outperform the PSRO algorithms when used on Leduc Poker, which suggests NAC enjoys effective generalisation abilities for both an exact best-response oracle and an approximate best-response oracle. We hypothesise that, whilst Leduc Poker is different from Kuhn Poker, the "Poker" nature of both games means they encapsulate similar dynamics, allowing our meta-solver to perform favourably. 5 Conclusion We introduce a method for discovering auto-curricula on two-player zero-sum games based on meta-learning. To our best knowledge, we are the first to show that it is entirely possible to perform as well as solutions underpinned in game-theoretic concepts that are designed through human insights, without any active design of the auto-curricula itself. In particular, we show that our NAC method can learn in small games and generalise to larger games, more difficult games that follow a similar underlying structure. We believe this initiates an exciting and promising research area in which large-scale difficult games can be solved effectively by training on simplified versions of the game. Author Contributions We summarise the main contributions from each of the authors as follows: Xidong Feng: Idea proposing, algorithm design, code implementation and experiments running (on 2D-RPS, 2D-RPS-Implicit and IMP), and paper writing. Oliver Slumbers: Algorithm design, code implementation and experiments running (on Gos, Blotto, Kuhn-Poker), and paper writing. Ziyu Wan: Code implementation and experiments running for RL based NAC in Appendix D.4. Bo Liu: Experiments running for Kuhn-Poker. Stephen Mc Aleer: Project discussion and paper writing. Ying Wen: Project discussion. Jun Wang: Project discussion and overall project supervision. Yaodong Yang: Project lead, idea proposing, experiment supervision, and whole manuscript writing. The work was done at King s College London. [1] Maruan Al-Shedivat, Trapit Bansal, Yuri Burda, Ilya Sutskever, Igor Mordatch, and Pieter Abbeel. Continuous adaptation via meta-learning in nonstationary and competitive environments. ar Xiv preprint ar Xiv:1710.03641, 2017. [2] Bowen Baker, Ingmar Kanitscheider, Todor Markov, Yi Wu, Glenn Powell, Bob Mc Grew, and Igor Mordatch. Emergent tool use from multi-agent autocurricula. ar Xiv preprint ar Xiv:1909.07528, 2019. [3] D Balduzzi, M Garnelo, Y Bachrach, W Czarnecki, J Pérolat, M Jaderberg, and T Graepel. Open-ended learning in symmetric zero-sum games. In ICML, volume 97, pages 434 443. PMLR, 2019. [4] Sarah Bechtle, Artem Molchanov, Yevgen Chebotar, Edward Grefenstette, Ludovic Righetti, Gaurav Sukhatme, and Franziska Meier. Meta learning via learned loss. In 2020 25th International Conference on Pattern Recognition (ICPR), pages 4161 4168. IEEE, 2021. 8We provide the results on the other tens of meta-games from [8] in Appendix D.3 [5] George W Brown. Iterative solution of games by fictitious play. Activity analysis of production and allocation, 13(1):374 376, 1951. [6] Krzysztof Choromanski, Mark Rowland, Vikas Sindhwani, Richard E. Turner, and Adrian Weller. Structured evolution with compact architectures for scalable policy optimization, 2018. [7] Junyoung Chung, Caglar Gulcehre, Kyung Hyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. ar Xiv preprint ar Xiv:1412.3555, 2014. [8] Wojciech Marian Czarnecki, Gauthier Gidel, Brendan Tracey, Karl Tuyls, Shayegan Omidshafiei, David Balduzzi, and Max Jaderberg. Real world games look like spinning tops. ar Xiv preprint ar Xiv:2004.09468, 2020. [9] Trevor Davis, Neil Burch, and Michael Bowling. Using response functions to measure strategy strength. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 28, 2014. [10] Xiaotie Deng, Yuhao Li, David Henry Mguni, Jun Wang, and Yaodong Yang. On the complexity of computing markov perfect equilibrium in general-sum stochastic games. ar Xiv preprint ar Xiv:2109.01795, 2021. [11] Le Cong Dinh, Yaodong Yang, Zheng Tian, Nicolas Perez Nieves, Oliver Slumbers, David Henry Mguni, and Jun Wang. Online double oracle. ar Xiv preprint ar Xiv:2103.07780, 2021. [12] Yan Duan, John Schulman, Xi Chen, Peter L Bartlett, Ilya Sutskever, and Pieter Abbeel. Rl: Fast reinforcement learning via slow reinforcement learning. ar Xiv preprint ar Xiv:1611.02779, 2016. [13] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adapta- tion of deep networks. In International Conference on Machine Learning, pages 1126 1135. PMLR, 2017. [14] Jakob Foerster, Gregory Farquhar, Maruan Al-Shedivat, Tim Rocktäschel, Eric Xing, and Shimon Whiteson. Dice: The infinitely differentiable monte carlo estimator. In International Conference on Machine Learning, pages 1529 1538. PMLR, 2018. [15] Jakob N Foerster, Richard Y Chen, Maruan Al-Shedivat, Shimon Whiteson, Pieter Abbeel, and Igor Mordatch. Learning with opponent-learning awareness. ar Xiv preprint ar Xiv:1709.04326, 2017. [16] Robert S Gibbons. Game theory for applied economists. Princeton University Press, 1992. [17] Sergiu Hart. Discrete colonel blotto and general lotto games. International Journal of Game Theory, 36(3):441 460, 2008. [18] Matteo Hessel, Joseph Modayil, Hado Van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Silver. Rainbow: Combining improvements in deep reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018. [19] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735 1780, 1997. [20] Rein Houthooft, Richard Y Chen, Phillip Isola, Bradly C Stadie, Filip Wolski, Jonathan Ho, and Pieter Abbeel. Evolved policy gradients. ar Xiv preprint ar Xiv:1802.04821, 2018. [21] Dong-Ki Kim, Miao Liu, Matthew Riemer, Chuangchuang Sun, Marwa Abdulhai, Golnaz Habibi, Sebastian Lopez-Cot, Gerald Tesauro, and Jonathan P How. A policy gradient algorithm for learning to learn in multiagent reinforcement learning. ar Xiv preprint ar Xiv:2011.00382, 2020. [22] Louis Kirsch, Sjoerd van Steenkiste, and Juergen Schmidhuber. Improving generalization in meta reinforcement learning using learned objectives. In International Conference on Learning Representations, 2019. [23] Harold W Kuhn. 9. a simplified two-person poker. In Contributions to the Theory of Games (AM-24), Volume I, pages 97 104. Princeton University Press, 2016. [24] Marc Lanctot, Edward Lockhart, Jean-Baptiste Lespiau, Vinicius Zambaldi, Satyaki Upadhyay, Julien Pérolat, Sriram Srinivasan, Finbarr Timbers, Karl Tuyls, Shayegan Omidshafiei, et al. Openspiel: A framework for reinforcement learning in games. ar Xiv preprint ar Xiv:1908.09453, 2019. [25] Marc Lanctot, Vinicius Zambaldi, Audrunas Gruslys, Angeliki Lazaridou, Karl Tuyls, Julien Perolat, David Silver, and Thore Graepel. A unified game-theoretic approach to multiagent reinforcement learning, 2017. [26] Joel Z Leibo, Edward Hughes, Marc Lanctot, and Thore Graepel. Autocurricula and the emergence of innovation from social interaction: A manifesto for multi-agent intelligence research. ar Xiv e-prints, pages ar Xiv 1903, 2019. [27] Hao Liu, Richard Socher, and Caiming Xiong. Taming maml: Efficient unbiased metareinforcement learning. In International Conference on Machine Learning, pages 4061 4071. PMLR, 2019. [28] Xiangyu Liu, Hangtian Jia, Ying Wen, Yaodong Yang, Yujing Hu, Yingfeng Chen, Changjie Fan, and Zhipeng Hu. Unifying behavioral and response diversity for open-ended learning in zero-sum games. ar Xiv preprint ar Xiv:2106.04958, 2021. [29] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation, 2015. [30] Jonathan Lorraine, Paul Vicol, and David Duvenaud. Optimizing millions of hyperparameters by implicit differentiation. In International Conference on Artificial Intelligence and Statistics, pages 1540 1552. PMLR, 2020. [31] Stephen Mc Aleer, John Lanier, Pierre Baldi, and Roy Fox. XDO: A double oracle algorithm for extensive-form games. Reinforcement Learning in Games Workshop, AAAI, 2021. [32] Stephen Mc Aleer, John Lanier, Roy Fox, and Pierre Baldi. Pipeline PSRO: A scalable approach for finding approximate nash equilibria in large games. In Advances in Neural Information Processing Systems (Neur IPS), 2020. [33] H Brendan Mc Mahan, Geoffrey J Gordon, and Avrim Blum. Planning in the presence of cost functions controlled by an adversary. In Proceedings of the 20th International Conference on Machine Learning (ICML-03), pages 536 543, 2003. [34] Oskar Morgenstern and John Von Neumann. Theory of games and economic behavior. Princeton university press, 1953. [35] Paul Muller, Shayegan Omidshafiei, Mark Rowland, Karl Tuyls, Julien Perolat, Siqi Liu, Daniel Hennes, Luke Marris, Marc Lanctot, Edward Hughes, et al. A generalized training approach for multiagent learning. In International Conference on Learning Representations, 2019. [36] John F Nash et al. Equilibrium points in n-person games. Proceedings of the national academy of sciences, 36(1):48 49, 1950. [37] Nicolas Perez Nieves, Yaodong Yang, Oliver Slumbers, David Henry Mguni, and Jun Wang. Modelling behavioural diversity for learning in open-ended games. ar Xiv preprint ar Xiv:2103.07927, 2021. [38] Junhyuk Oh, Matteo Hessel, Wojciech M Czarnecki, Zhongwen Xu, Hado van Hasselt, Satinder Singh, and David Silver. Discovering reinforcement learning algorithms. ar Xiv preprint ar Xiv:2007.08794, 2020. [39] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 652 660, 2017. [40] Aravind Rajeswaran, Chelsea Finn, Sham Kakade, and Sergey Levine. Meta-learning with implicit gradients, 2019. [41] Kate Rakelly, Aurick Zhou, Chelsea Finn, Sergey Levine, and Deirdre Quillen. Efficient off-policy meta-reinforcement learning via probabilistic context variables. In International conference on machine learning, pages 5331 5340. PMLR, 2019. [42] Jonas Rothfuss, Dennis Lee, Ignasi Clavera, Tamim Asfour, and Pieter Abbeel. Promp: Proximal meta-policy search. ar Xiv preprint ar Xiv:1810.06784, 2018. [43] Tim Salimans, Jonathan Ho, Xi Chen, Szymon Sidor, and Ilya Sutskever. Evolution strategies as a scalable alternative to reinforcement learning, 2017. [44] Ricky Sanjaya, Jun Wang, and Yaodong Yang. Measuring the non-transitivity in chess, 2021. [45] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. ar Xiv preprint ar Xiv:1707.06347, 2017. [46] Jianzhun Shao, Hongchang Zhang, Yuhang Jiang, Shuncheng He, and Xiangyang Ji. Credit assignment with meta-policy gradient for multi-agent reinforcement learning. ar Xiv preprint ar Xiv:2102.12957, 2021. [47] Avi Singh, Huihan Liu, Gaoyue Zhou, Albert Yu, Nicholas Rhinehart, and Sergey Levine. Parrot: Data-driven behavioral priors for reinforcement learning. ar Xiv preprint ar Xiv:2011.10024, 2020. [48] Xingyou Song, Wenbo Gao, Yuxiang Yang, Krzysztof Choromanski, Aldo Pacchiano, and Yunhao Tang. Es-maml: Simple hessian-free meta learning. ar Xiv preprint ar Xiv:1910.01215, [49] Peter Sunehag, Guy Lever, Audrunas Gruslys, Wojciech Marian Czarnecki, Vinicius Zambaldi, Max Jaderberg, Marc Lanctot, Nicolas Sonnerat, Joel Z Leibo, Karl Tuyls, et al. Valuedecomposition networks for cooperative multi-agent learning. ar Xiv preprint ar Xiv:1706.05296, 2017. [50] Ended Learning Team, Adam Stooke, Anuj Mahajan, Catarina Barros, Charlie Deck, Jakob Bauer, Jakub Sygnowski, Maja Trebacz, Max Jaderberg, Michael Mathieu, et al. Open-ended learning leads to generally capable agents. ar Xiv preprint ar Xiv:2107.12808, 2021. [51] Jan van den Brand. A deterministic linear program solver in current matrix multiplication time. In Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 259 278. SIAM, 2020. [52] Vivek Veeriah, Matteo Hessel, Zhongwen Xu, Janarthanan Rajendran, Richard L Lewis, Junhyuk Oh, Hado van Hasselt, David Silver, and Satinder Singh. Discovery of useful questions as auxiliary tasks. In Neur IPS, 2019. [53] Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michaël Mathieu, Andrew Dudzik, Jun- young Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 575(7782):350 354, 2019. [54] Jane X Wang, Zeb Kurth-Nelson, Dhruva Tirumala, Hubert Soyer, Joel Z Leibo, Remi Munos, Charles Blundell, Dharshan Kumaran, and Matt Botvinick. Learning to reinforcement learn. ar Xiv preprint ar Xiv:1611.05763, 2016. [55] Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforce- ment learning. Machine learning, 8(3-4):229 256, 1992. [56] Zhongwen Xu, Hado van Hasselt, Matteo Hessel, Junhyuk Oh, Satinder Singh, and David Silver. Meta-gradient reinforcement learning with an objective discovered online. ar Xiv preprint ar Xiv:2007.08433, 2020. [57] Zhongwen Xu, Hado van Hasselt, and David Silver. Meta-gradient reinforcement learning. ar Xiv preprint ar Xiv:1805.09801, 2018. [58] Yaodong Yang, Jun Luo, Ying Wen, Oliver Slumbers, Daniel Graves, Haitham Bou Ammar, Jun Wang, and Matthew E Taylor. Diverse auto-curriculum is critical for successful realworld multiagent learning systems. In Proceedings of the 20th International Conference on Autonomous Agents and Multi Agent Systems, pages 51 56, 2021. [59] Yaodong Yang, Rasul Tutunov, Phu Sakulwongtana, and Haitham Bou Ammar. -rank: Prac- tically scaling -rank through stochastic optimisation. In Proceedings of the 19th International Conference on Autonomous Agents and Multi Agent Systems, pages 1575 1583, 2020. [60] Yaodong Yang and Jun Wang. An overview of multi-agent reinforcement learning from game theoretical perspective. ar Xiv preprint ar Xiv:2011.00583, 2020. [61] Yaodong Yang, Ying Wen, Jun Wang, Liheng Chen, Kun Shao, David Mguni, and Weinan Zhang. Multi-agent determinantal q-learning. In International Conference on Machine Learning, pages 10757 10766. PMLR, 2020. [62] Tom Zahavy, Zhongwen Xu, Vivek Veeriah, Matteo Hessel, Junhyuk Oh, Hado van Has- selt, David Silver, and Satinder Singh. A self-tuning actor-critic algorithm. ar Xiv preprint ar Xiv:2002.12928, 2020. [63] Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Ruslan Salakhutdinov, and Alexander Smola. Deep sets. ar Xiv preprint ar Xiv:1703.06114, 2017. [64] Zeyu Zheng, Junhyuk Oh, Matteo Hessel, Zhongwen Xu, Manuel Kroiss, Hado Van Hasselt, David Silver, and Satinder Singh. What can learned intrinsic rewards capture? In International Conference on Machine Learning, pages 11436 11446. PMLR, 2020. [65] Zeyu Zheng, Junhyuk Oh, and Satinder Singh. On learning intrinsic rewards for policy gradient methods. Advances in Neural Information Processing Systems, 31:4644 4654, 2018. [66] Wei Zhou, Yiying Li, Yongxin Yang, Huaimin Wang, and Timothy Hospedales. Online meta- critic learning for off-policy actor-critic methods. Advances in Neural Information Processing Systems, 33, 2020.