# multiagent_cooperation_through_learningaware_policy_gradients__6314590b.pdf

Published as a conference paper at ICLR 2025

MULTI-AGENT COOPERATION THROUGH LEARNINGAWARE POLICY GRADIENTS

Alexander Meulemans1, , Seijin Kobayashi1, , Johannes von Oswald1, Nino Scherrer1, Eric Elmoznino1,2,3, Blake A. Richards1,2,3,4,5, Guillaume Lajoie1,2,3,4,5, Blaise Ag uera y Arcas1, Jo ao Sacramento1

1Google, Paradigms of Intelligence Team, 2Mila - Quebec AI Institute, 3Universit e de Montr eal, 4Mc Gill University, 5CIFAR, Equal contribution ameulemans@google.com, seijink@google.com

Self-interested individuals often fail to cooperate, posing a fundamental challenge for multi-agent learning. How can we achieve cooperation among self-interested, independent learning agents? Promising recent work has shown that in certain tasks cooperation can be established between learning-aware agents who model the learning dynamics of each other. Here, we present the ﬁrst unbiased, higherderivative-free policy gradient algorithm for learning-aware reinforcement learning, which takes into account that other agents are themselves learning through trial and error based on multiple noisy trials. We then leverage efﬁcient sequence models to condition behavior on long observation histories that contain traces of the learning dynamics of other agents. Training long-context policies with our algorithm leads to cooperative behavior and high returns on standard social dilemmas, including a challenging environment where temporally-extended action coordination is required. Finally, we derive from the iterated prisoner s dilemma a novel explanation for how and when cooperation arises among self-interested learning-aware agents.

1 INTRODUCTION

From self-driving autonomous vehicles to personalized assistants, there is a rising interest in developing agents that can learn to interact with humans (Collins et al., 2024; Gweon et al., 2023), and with each other (Park et al., 2023; Vezhnevets et al., 2023). However, multi-agent learning comes with signiﬁcant challenges that are not present in more conventional single-agent paradigms. This is perhaps best seen through the study of social dilemmas , general-sum games which model the tension between cooperation and competition in abstract form (von Neumann & Morgenstern, 1947). Without further assumptions, letting agents independently optimize their individual objectives on such games results in poor outcomes and a lack of cooperation (Tan, 1993; Claus & Boutilier, 1998).

First, for general-sum games, reaching an equilibrium point does not necessarily imply appropriate behavior because there can be many sub-optimal equilibria (Fudenberg & Levine, 1998; Shoham & Leyton-Brown, 2008). Second, the control problem an agent faces is non-stationary from its own viewpoint, because other agents themselves simultaneously learn and adapt (Hernandez-Leal et al., 2017). Centralized training algorithms sidestep non-stationarity issues by sharing agent information (Sunehag et al., 2017), but this transformation into a global learning problem is usually prohibitively costly, and impossible to implement when agents must be developed separately (Zhang et al., 2021).

The above two fundamental issues have hindered progress in multi-agent reinforcement learning, and have limited our understanding of how self-interested agents may reach high returns when faced with social dilemmas. In this paper, we join a promising line of work on learning awareness that has been shown to improve cooperation (Foerster et al., 2018a). The key idea behind such approaches is to take into account the learning dynamics of other agents explicitly, rendering it into a meta-learning problem (Schmidhuber, 1987; Bengio et al., 1990; Hochreiter et al., 2001).

The present paper contains two main novel results on learning awareness in general-sum games. First, we introduce a new learning-aware reinforcement learning rule derived as a policy gradient estimator. Unlike existing methods (Foerster et al., 2018a;b; Xie et al., 2021; Balaguer et al., 2022;

Published as a conference paper at ICLR 2025

Lu et al., 2022; Willi et al., 2022; Cooijmans et al., 2023; Khan et al., 2024; Aghajohari et al., 2024), it has a number of desirable properties: (i) it does not require computing higher-order derivatives, (ii) it is provably unbiased, (iii) it can model minibatched learning algorithms, (iv) it is applicable to scalable architectures based on recurrent sequence policy models, and (v) it does not assume access to privileged information, such as the opponents policies or learning rules. Our policy gradient rule signiﬁcantly outperforms previous model-free methods in the general-sum game setting. In particular, we show that efﬁcient learning-aware learning sufﬁces to reach cooperation in a challenging sequential social dilemma involving temporally extended actions (Leibo et al., 2017) that we adapt from the Melting Pot suite (Agapiou et al., 2023). Second, we analyze the iterated prisoner s dilemma (IPD), a canonical model for studying cooperation among self-interested agents (Rapoport, 1974; Axelrod & Hamilton, 1981). Our analysis uncovers a novel mechanism for the emergence of cooperation through learning awareness, and explains why the seminal learning with opponent-learning awareness algorithm due to Foerster et al. (2018a) led to cooperation in the IPD.

2 BACKGROUND AND PROBLEM SETUP

We consider partially observable stochastic games (POSGs; Kuhn, 1953) consisting of a tuple (I, S, A, Pt, Pr, Pi, O, Po, γ, T) with I = {1, . . . , n} a ﬁnite set of n agents, S the state space, A = i IAi the joint action space, Pt(St+1 | St, At) the state transition distribution, Pi(S0) the initial state distribution, Pr = i IP i r(R | S, A) the joint factorized reward distribution with R = {Ri}i I and bounded rewards Ri, O = i IOi the joint observation space, Po(Ot | St, At 1) the observation distribution, γ the discount factor, t the time step, and T the horizon. We use superscript i to indicate agent-speciﬁc actions, observations and rewards, i to indicate all agent indices except i, and we omit the superscript for joint actions, observations and rewards. As agents only receive partial state information, they beneﬁt from conditioning their policies πi(ai t | xi t; φi) on the observation history xi t = {oi k}t k=1 ( Astr om, 1965; Kaelbling et al., 1998), with φi the policy parameters. Note that the observations can contain the agent s actions on previous timesteps.

2.1 GENERAL-SUM GAMES AND THEIR CHALLENGES

We focus on general-sum games, where each agent has their own reward function, possibly different from those of other agents. Speciﬁcally, we consider mixed-motive general-sum games that are neither zero-sum nor fully-cooperative. Analyzing and solving such general-sum games while letting every agent individually and independently maximize their rewards (a setting often referred to as fully-decentralized reinforcement learning ; Albrecht et al., 2024) is a longstanding problem in the ﬁelds of machine learning and game theory for two primary reasons, described below.

Non-stationarity of the environment. In a general-sum game, each agent aims to maximize its expected return Ji(φi, φ i) = EP φi,φ i h PT t=1 γt Ri t i , with P φi,φ i the distribution over environ-

ment trajectories x T induced by the environment dynamics, the policy πi(ai | xi; φi) of agent i, and the policies π i of all other agents. Importantly, the expected return Ji(φi, φ i) does not only depend on the agent s own policy, but also on the current policies of the other agents. As other agents are updating their policies through learning, the environment which includes the other agents is effectively non-stationary from a single agent s perspective. Furthermore, the actions of an agent can inﬂuence this non-stationarity by changing the observation histories of other agents, on which they base their learning updates.

Equilibrium selection. It is not clear how to identify appropriate policies for a general-sum game. To see this, let us ﬁrst brieﬂy revisit the concept of a Nash equilibrium (Nash Jr., 1950). For a ﬁxed set of co-player policies φ i, one can compute a best response, which for agent i is given by φ

i = arg maxφi Ji(φi, φ i). When all current policies φ are a best response against each other, we have reached a Nash equilibrium, where no agent is incentivized to change its policy anymore, i, φi : Ji( φi, φ i) Ji(φi, φ i). Various folk theorems show that for most POSGs of decent complexity, there exist inﬁnitely many Nash equilibria (Fudenberg & Levine, 1998; Shoham & Leyton-Brown, 2008). This lies at the origin of the equilibrium selection problem in multiagent reinforcement learning: it is not only important to let a multi-agent system converge to a Nash equilibrium, but also to target a good equilibrium, as Nash equilibria can be arbitrarily bad. Famously, unconditional mutual defection in the inﬁnitely iterated prisoner s dilemma is a Nash equilibrium, with strictly lower expected returns for all agents compared to the mutual tit-for-tat Nash equilibrium (Axelrod & Hamilton, 1981).

Published as a conference paper at ICLR 2025

2.2 CO-PLAYER LEARNING AWARENESS

We aim to address the above two major challenges of multi-agent learning in this paper. Our work builds upon recent efforts that are based on adding a meta level to the multi-agent POSG, where the higher-order variable represents the learning algorithm used by each agent (Lu et al., 2022; Balaguer et al., 2022; Khan et al., 2024). In this meta-problem, the environment includes the learning dynamics of other agents. At the meta-level, one episode now extends across multiple episodes of actual game play, allowing the ego agent , i, to observe how its co-players, i, learn, see Fig. 1. The goal of this meta-agent may be intuitively understood as that of shaping co-player learning to its own advantage.

Provided that co-player learning algorithms remain constant, the above reformulation yields a singleagent problem that is amenable to standard reinforcement learning techniques. This setup is fundamentally asymmetric: while the meta agent (ego agent) is endowed with co-player learning awareness (i.e., observing multiple episodes of game play), the remaining agents remain oblivious to the fact that the environment is non-stationary. We thus refer to them here as naive agents (see Fig. 1B). Despite this asymmetry, prior work has observed that introducing a learning-aware agent in a group of naive learners often leads to better learning outcomes for all agents involved, avoiding mutual defection equilibria (Lu et al., 2022; Balaguer et al., 2022; Khan et al., 2024). Moreover, Foerster et al. (2018a) has shown that certain forms of learning awareness can lead to the emergence of cooperation even in symmetric cases, a surprising ﬁnding that is not yet well understood.

These observations motivate our study, leading us to derive novel efﬁcient learning-aware reinforcement learning algorithms, and to investigate their efﬁcacy in driving a group of agents (possibly composed of both meta and naive agents) towards more beneﬁcial equilibria. Below, we proceed by ﬁrst formalizing asymmetric co-player shaping problems, which we solve with a novel policy gradient algorithm (Section 3). In Section 4, we then return to the question of why and when co-player learning awareness can result in cooperation in multi-agent systems with equally capable agents.

B parallel trajectories

M inner-episodes per meta-trajectory

T steps per inner-episode

Naive agent: Takes only intra-episode context into account

Meta agent: Takes intraand inter-episode context into account

Figure 1: A. Experience data terminology. Inner-episodes comprise T steps of (inner) game play, played between agents B times in parallel, forming a batch of inner-episodes. A given sequence of M inner-episodes forms a meta-trajectory, thus comprising MT steps of inner game play. The collection of B meta-trajectories forms a meta-episode. B. During game play, a naive agent takes only the current episode context into account for decision making. In contrast, a meta agent takes the full long context into account. Seeing multiple episodes of game play endows a meta agent with learning awareness. Co-player shaping. Following Lu et al. (2022), we ﬁrst introduce a meta-game with a single meta-agent whose goal is to shape the learning of naive co-players to its advantage. This metagame is deﬁned formally as a single-agent partially observable Markov decision process (POMDP) ( S, A, Pt, Pr, Pi, O, γ, M). The meta-state consists of the policy parameters φ i of all co-players together with the agent s own parameters φi. The meta-environment dynamics represent the ﬁxed learning rules of the co-players, and the meta-reward distribution represents the expected return Ji(φi, φ i) collected by agent i during an inner episode, with inner referring to the actual game being played. The initialization distribution Pi reﬂects the policy initializations of all players. Finally, we introduce a meta-policy µ(φi m+1 | φi m, φ i m ; θ) parameterized by θ, that decides the update to the parameter φi m+1 (the meta-action) to shape the co-player learning towards highly rewarding regions for agent i over a horizon of M meta steps. This leads to the co-player shaping problem

max µ E Pi(φi 0,φ i 0 )E P µ

m=1 Ji(φi m, φ i m )

with P µ the distribution over parameter trajectories induced by the meta-dynamics and meta-policy.

Published as a conference paper at ICLR 2025

2.3 SINGLE-LEVEL CO-PLAYER SHAPING BY LEVERAGING SEQUENCE MODELS

In this paper, we combine both innerand meta-policies in a single long-context policy, conditioning actions on long observation histories spanning multiple inner game episodes (see Fig. 1B). Instead of hand-designing the co-player learning algorithms, we instead let meta-learning discover the algorithms used by other agents. This way, we leverage the in-context learning and inference capabilities of modern neural sequence models (Brown et al., 2020; Rabinowitz, 2019; Aky urek et al., 2023; von Oswald et al., 2023; Li et al., 2023) to both simulate in-context an inner policy, as well as strategically update it based on current estimates of co-player policies. This philosophy has been adopted in Khan et al. (2024), in which a ﬂat policy is optimized using an evolutionary algorithm. We compare to this method in Section 3.1, after we derive our meta reinforcement learning algorithm.

To proceed with this approach, we must ﬁrst reformulate the meta-game. In particular, we must deal with a difﬁculty that is not present in single-agent meta reinforcement learning (e.g., Wang et al., 2016; Duan et al., 2017), which stems from the fact that co-players generally update their own policies based on multiple inner episodes ( minibatches ), without which reinforcement learning cannot practically make progress. Here, we solve this by deﬁning the environment dynamics over B parallel trajectories, with B the size of the minibatch of inner episode histories that co-players use to update their policies at each inner episode boundary (see Fig. 1A).

Batched co-player shaping POMDP. We deﬁne the batched co-player shaping POMDP ( S, A, Pt, Pr, Pi, O, γ, M, B), with hidden states consisting of the hidden environment states of the B ongoing inner episodes, combined with the current parameters φ i m of all co-players; environment dynamics Pt simulating B environments in parallel, combined with updating the co-player s policy parameters φ i, and resetting the environments at each inner episode boundary; initial state distribution Pi that initializes the co-player policies and initializes the environments for the ﬁrst inner episode batch; and ﬁnally, an ego-agent policy πi( ai l | hi l; φi) parameterized by φi, which determines a distribution over the batched action ai l = {ai,b l }B b=1, based on the batched long history hi l = {hi,b l }B b=1. We refer to each element of the latter as a long history hi,b l , with long time index l running across multiple episodes, from l = 1 until l = MT. It should be contrasted to the inner episode history xi t, which runs from t = 1 to t = T and thus only reﬂects the current (inner) game history.

The POMDP introduced above suggests using a sequence policy πi( ai l | hi l; φi) that is aware of the full minibatch of long histories and which produces a joint distribution over all current actions in the minibatch. However, as we aim to use our agents not only to shape naive learners, but also to play against/with each other, we require a policy that can be used both in a batch setting with naive learners, and in a single-trajectory setting with other learning-aware agents. Within our single-level approach, we achieve this by factorizing the batch-aware policy πi( ai t | hi l; φi) into B independent policies with shared parameters φi, πi( ai l | hi l; φi) = QB b=1 πi(ai,b l | hi,b l ; φi). Thanks to the batched POMDP, we can now pose co-player shaping as a standard (single-level, single-agent) expected return maximization problem:

max φi E P φi

This formulation is the key for obtaining an efﬁcient policy gradient co-player shaping algorithm.

3 CO-AGENT LEARNING-AWARE POLICY GRADIENTS

3.1 A POLICY GRADIENT FOR SHAPING NAIVE LEARNERS

We now provide a meta reinforcement learning algorithm for solving the co-player shaping problem stated in Eq. 2 efﬁciently. Under the POMDP introduced in the previous section, co-player shaping becomes a conventional expected return maximization problem. Applying the policy gradient theorem (Sutton et al., 1999) to Eq. 2, we arrive at COALA-PG (co-agent learning-aware policy gradients, c.f. Theorem 3.1): a policy-gradient method compatible with shaping other reinforcement learners that base their own policy updates on minibatches of experienced trajectories.

Theorem 3.1. Take the expected shaping return J(φi) = E P φi h 1 B PB b=1 PMT l=0 Ri,b l i , with P φi

the distribution induced by the environment dynamics Pt, initial state distribution Pi and policy φi.

Published as a conference paper at ICLR 2025

Naive agent: Policy update after every inner-episode batch COALA-PG agent: Policy update after every meta-episode

Figure 2: Policy update and credit assignment of naive and meta agents. For credit assignment of action ai,b l , a naive agent (left) takes only intra-episode context into account. A COALA agent (right) takes interepisode context across the batch dimension into account. For policy updates, a naive agent aggregates policy gradients over the inner-batch dimension (dashed blocks) and updates their policy between episode boundaries. In contrast, a COALA agent updates their policy at a lower frequency along the meta-episode dimension.

Then the policy gradient of this expected return is equal to

φi J(φi) = E P φi

l=1 φ log πi(ai,b l | hi,b l )

l =l ri,b l + 1

l =ml T +1 ri,b

We provide a proof in Appendix D. There are three important differences between COALA-PG and naively applying policy gradient methods to individual trajectories in a batch. (i) Each gradient term for an individual action ai,b l takes into account the future inner episode returns averaged over the whole minibatch, instead of the future return along trajectory b (see Fig. 2). This allows taking into account the inﬂuence of this action on the parameter update of the naive learner, which inﬂuences all trajectories in the minibatch after that update. (ii) Instead of averaging the policy gradients for each trajectory in the batch, COALA-PG accumulates (sums) them. This is important, as otherwise the learning signal would vanish in the limit of large minibatches. Intuitively, when a naive learner uses a large minibatch for its updates, the effect of a single action on the naive learner s update is small (O( 1

B )), and this must be compensated by summing all such small effects. (iii) To ensure a correct balance between the return from the current inner episode ml and the return from future inner episodes, COALA-PG rescales the current episode return by 1

B . Figure 12 in App. H shows empirically that COALA-PG correctly balances the policy gradient terms arising from the current inner episode return versus the future inner episode returns, whereas M-FOS (Lu et al., 2022) and a naive policy gradient that ignores the other parallel trajectories over-emphasize the current inner episode return, causing them to loose the co-player shaping learning signals. We will later show experimentally in Section 5 that correct treatment of minibatches critically affects reinforcement learning performance.

The expectation appearing in the policy gradient expression must be estimated. To reduce gradient estimation variance, we resort to standard practices, including generalized advantage estimation (Schulman et al., 2016) and sampling a meta-batch of B batched trajectories from E P φi (c.f. Appendix A).

Relationship to prior shaping methods. We now contrast our policy gradient algorithm to two closely related methods, M-FOS (Lu et al., 2022) and Shaper (Khan et al., 2024). Like COALA-PG, M-FOS is a model-free meta reinforcement learning method. Unlike the approached followed here, though, it aims to solve the bilevel co-player shaping problem of Eq. 1, treating metaand innerpolicy networks separately. Moreover, the M-FOS parameter update is not derived as the policy gradient on the batched co-player shaping POMDP introduced above, and current-episode returns are overemphasized compared to future-episode returns (see Appendix G). This leads to a biased parameter update, which results in learning inefﬁciencies. We comment on other existing bilevel shaping methods in Appendix F.

Khan et al. (2024) adopt a single-level sequence policy for their Shaper algorithm, as we do here, but then resort to black-box evolution strategies (Rechenberg & Eigen, 1973) to learn the policy. Obtaining an efﬁcient meta reinforcement learning algorithm from a POMDP applicable to such single-level policies is thus our key distinguishing contribution. The unbiased policy gradient property of our learning rule translates in practice onto learning speed and stability gains, as we will see in the experiments reported in Section 5.

Published as a conference paper at ICLR 2025

4 WHY IS LEARNING AWARENESS BENEFICIAL ON GENERAL-SUM GAMES?

We have established that co-player shaping can be cast as a single-agent reward maximization problem whenever there is a single learning-aware player amongst a group of learners that are otherwise naive. This allowed us to derive a policy gradient shaping method. However, such an asymmetric setup cannot in general be taken for granted. In our experimental analyses, we therefore consider the more realistic scenario where equally-capable, learning-aware agents try to shape each other.

As reviewed in Section 2.2, prior work has shown that learning-awareness can result in better outcomes in general-sum games, but the origin and conditions for the occurrence of this phenomenon are not yet well understood. Here, we shed light on this question by analyzing the interactions of agents with varying degrees of learning-awareness in an analytically tractable matrix game setting. This leads us to uncover a novel explanation for the emergence of cooperation in general-sum games.

4.1 THE ITERATED PRISONER S DILEMMA

Table 1: Single-round IPD rewards (r1, r2).

c d c (1,1) (-1,2) d (2,-1) (0,0)

We focus on the inﬁnitely iterated prisoner s dilemma (IPD), the quintessential model for understanding the challenges of cooperation among selfinterested agents (Rapoport, 1974; Axelrod & Hamilton, 1981). The game goes on for an indeﬁnite number of rounds, where for each round of play two players (i = 1, 2) meet and choose between two actions, cooperate or defect, ai t {c, d}. The rewards collected as a function of the actions of both agents are shown in Table 1. These four rewards are set so as to create a social dilemma. When the agents meet only once, mutual defection is the only Nash equilibrium; self-interested agents thus end up obtaining low reward. In the inﬁnitely iterated variant of the game, there exist Nash equilibria involving cooperative behavior, but these are notoriously hard to converge to through self-interested reward maximization.

We model each agent through a tabular policy πi(ai t | xi t; φi) that depends only on the previous action of both agents, xi t = (a1 t 1, a2 t 1). Their behavior is thus fully speciﬁed by ﬁve parameters, which determine the probability of cooperating in response to the four possible previous action combinations together with the initial cooperation probability. For this game, the discounted expected return Ji(φ1, φ2) can be calculated analytically. We exploit this property and optimize policies by performing exact gradient ascent on the expected return (c.f. Appendix C for details).

4.2 EXPLAINING COOPERATION THROUGH LEARNING AWARENESS

Based on the experimental results reported in Fig. 3, we now identify three key ﬁndings that establish how learning awareness enables cooperation to be reached in the iterated prisoner s dilemma:

Finding 1: Learning-aware agents extort naive learners. We ﬁrst pit naive against learningaware agents. We ﬁnd that the latter develop extortion policies which force naive learners onto unfair cooperation, similar to the zero-determinant extortion strategies discovered by Press & Dyson (2012) (c.f. Appendix C.1). Even when a learning-aware agent is initialized at pure defection, maximizing the shaping objective of Eq. 2 lets it escape mutual defection (see Fig. 3A).

Finding 2: Extortion turns into cooperation when two learning-aware players face each other. After developing extortion policies against naive learners (grey shaded area in Fig. 3B), we then let two learning-aware agents (C1 and C2 in Fig. 3B) play against each other after. We see that optimizing the co-player shaping objective turns extortion policies into cooperative policies. Intuitively, under independent learning, an extortion policy shapes the co-player to cooperate more. We remark that the same occurs if learning-aware agents play against themselves (self-play; data not shown). This analysis explains the success of the annealing procedure employed by Lu et al. (2022), according to which naive co-players transition to self-play throughout training.

Finding 3: Cooperation emerges within groups of naive and learning-aware agents. Findings 1. and 2. motivate studying learning in a group containing both naive and learning-aware agents, with every agent in the group trained against each other. This mixed group setting yields a sum of two distinct shaping objectives, which depend on whether the agent being shaped is learning-aware or naive. The gradients resulting from playing against naive learners pull against mutual defection and towards extortion, while those resulting from playing against other learning-aware agents push away from extortion towards cooperation. Balancing these competing forces leads to robust cooperation, see Fig. 3C (left). Intriguingly, mutual unconditional defection is no longer a Nash equilibrium

Published as a conference paper at ICLR 2025

Figure 3: (A) Learning-aware agents learn to extort naive learners, even when initialized with pure defection strategy. (B) An extortion policy developed against naive agents (shaded area period) turns into a cooperative one when playing against another learning-aware agent (M1 & M2). (C) Cooperation emerges within mixed training pools of naive and learning-aware agents, but not in pools of learning-aware agents only. The shaded regions represent the interquartile range (25th to 75th quantiles) across 32 random seeds

in this mixed group setting, and agents initialized with unconditional defection policies learn to cooperate (see Appendix C.1). By contrast, a pure group of learning-aware agents cannot escape mutual defection, see Fig. 3C (right). This can be explained by the fact that the agents can no longer observe others learn, and must deal again with a non-stationary problem. The resulting gradients do not therefore contain information on the effects of unconditional defection on the future strategies of co-players, or that policies in the vein of tit-for-tat can shape co-players towards more cooperation.

Our analysis thus reveals a surprising path to cooperation through heterogeneity. The presence of short-sighted agents that greedily maximize immediate rewards turns out to be essential for full cooperation to be established among far-sighted, learning-aware agents.

4.3 EXPLAINING WHEN AND HOW COOPERATION ARISES WITH THE LOLA ALGORITHM

We next analyze the seminal Learning with Opponent-Learning Awareness (LOLA; Foerster et al., 2018a) algorithm. Brieﬂy, LOLA assumes that co-players update their parameters with M naive gradient steps, and estimates the total gradient through a look-ahead update:

LOLA φi = d dφi

s.t. qφ i = α φ i Ji

with d dφi the total derivative taking into account the effect of φi on the parameter updates qφ i, and φ i the partial derivative. Note that Eq. 4 considers the LOLA-DICE update (Foerster et al., 2018b), an improved version of LOLA. In Appendix E, we show that Eq. 4 can be derived as a special case of COALA-PG. Note that LOLA-DICE estimates the policy gradient in Eq. 4 by explicitly backpropagating through the co-player s parameter update using higher-order derivatives, whereas COALA-PG leads to a novel higher-order-derivative-free estimator of Eq. 4 (see Appendix E).

Above, we showed that the two main ingredients for learning to cooperate under selﬁsh objectives are (i) observe that one s actions inﬂuence the future behavior of others, providing shaping gradients pulling away from defection towards extortion, and (ii), also play against other extortion agents immune to being shaped on the fast timescale, providing gradients pulling away from extortion towards cooperation. We then showed that both ingredients can be combined by training agents in a heterogeneous group containing both naive and learning-aware agents.

We can explain the emergent cooperation in LOLA by observing that LOLA also combines both ingredients, albeit differently from the heterogeneous group setting. The look-ahead rule (Eq. 4) computes gradients that shape naive learners performing M naive gradient steps. Unique to LOLA however, these simulated naive learners are initialized with the parameters φ i of other LOLA agents. If the number of look-ahead steps is small, the updated naive learner parameters stay close to φ i, mimicking playing against other extortion agents. This then results in emergent cooperation.

Fig. 4A conﬁrms that LOLA-DICE with ground-truth gradients and with few look-ahead steps leads to cooperation on the iterated prisoner s dilemma. However, as the number of look-ahead steps increases, the naive learner starts moving too far away from its φ i initialization, removing the second ingredient, thus leading to defection. In Fig. 4, we take the policy resulting from LOLAtraining with many look-ahead steps, and train a new randomly initialized naive learner against this

Published as a conference paper at ICLR 2025

Figure 4: (A) Performance of two agents trained by LOLA-DICE on the iterated prisoner s dilemma with analytical gradients for various look-ahead steps (only the performance of the ﬁrst agent is shown). (B) Performance of a randomly initialized naive learner trained against the ﬁxed LOLA 20 look-aheads policy taken from the end of training of (A). (C) Same setting as (A), but with the naive gradient λ φ i Ji(φ, φ i) added to the LOLA-DICE update, with λ a hyperparameter (c.f. Appendix C). Shaded regions indicate standard error computed over 64 seeds.

ﬁxed LOLA policy. The results show that the LOLA policy extorts the naive learner into unfair cooperation, conﬁrming that with many look-ahead steps, only a shaping incentive is present in the LOLA update, resulting in extortion policies. Hence, the low reward in Fig. 4A for LOLA agents with many look-ahead steps does not result from unconditional defection, but instead from both LOLA policies trying to extort the other one. Finally, we can improve the performance of LOLA with many look-ahead steps by explicitly introducing ingredient 2 by adding the partial gradient φ i Ji(φ, φ i) to Eq. 4, see Fig. 4C.

5 EXPERIMENTAL ANALYSIS OF POLICY GRADIENT IMPLEMENTATIONS

The results presented in the previous sections were obtained by performing gradient ascent on analytical expected returns. This assumes knowledge of co-player parameters, and it is only possible on a restricted number of games which admit closed-form value functions. We now move to the general reinforcement learning setting, aiming at understanding (i) when meta-agents succeed in exploiting naive agents, and (ii) when cooperation is achieved among meta-agents.

5.1 AGENTS TRAINED WITH COALA-PG MASTER THE ITERATED PRISONER S DILEMMA

Figure 5: Agents trained by COALA-PG play iterated prisoner s dilemma. (A): When trained against naive agents only, COALA-PG-trained agents extort the latter and reach considerably higher reward than other baseline agents. The stars ( ) indicate overlapping curves of the corresponding color at that point (B): When analyzing the behavior of the agents within one meta-episode, we observe COALA-PG-trained agents shaping naive co-players, leading to low defection rate in the beginning, which is then exploited towards the end. MFOS on the other hand defects from the beginning, achieving lower reward, thus failing to properly optimize the shaping problem. Batch-unaware COALA-PG performs identically to M-FOS and is therefore omitted. (C): Average performance of meta agents playing against other meta agents, when training a group of meta agents against a mixture of naive and other meta agents. Such agents trained with COALA-PG cooperate when playing against each other, but fail to do so when trained with baseline methods. When removing naive agents from the pool, meta agents also fail to cooperate, as predicted in Section 3. Shaded regions indicate standard deviation computed over 5 seeds.

We train a long-context sequence policy πi(ai,b l | hi,b l ; φi) with the COALA-PG rule to play the (ﬁnite) iterated prisoner s dilemma, see Appendix B. We choose a Hawk recurrent neural network as the policy backbone (De et al., 2024). Hawk models achieve transformer-level performance at scale, but with time and memory costs that grow only linearly with sequence length. This allows processing efﬁciently the long history context hi,b l , which contains all actions played by the agents

Published as a conference paper at ICLR 2025

across episodes. Based on the results of the preceding section, we consider a mixed group setting, pitting COALA-PG-trained agents against naive learners as well as other equally capable learningaware agents. Naive learners are equipped with the same policy architecture as the agents trained by COALA-PG, but their context is limited to the current inner game history xi,b t .

In Fig. 5, we see that COALA-PG reproduces the analytical game ﬁndings reported in the previous section: learning-aware agents cooperate with other learning-aware agents, and extort naive learners. Importantly, the identity of each agent is not revealed to a learning-aware agent, which must therefore infer in-context the strategy used by the player it faces. Likewise, we ﬁnd that the LOLA-DICE (Eq. 4) estimator behaves as in the analytical game. This result complements previous experiments with LOLA on tabular policies (Foerster et al., 2018a;b), suggesting that there is a broad class of efﬁcient learning-aware reinforcement learning rules that can reach cooperation with more complex context-dependent sequence models. We note that LOLA achieves this by explicitly differentiating through co-player updates, which requires access to their parameters and gives rise to higher-order derivatives. Our rule lifts these requirements, while maintaining learning efﬁciency.

By contrast, the whole group falls into defection when training the exact same sequence model with the M-FOS rule, which weighs disproportionately future vs. current episode returns. We note that the experiments reported by Lu et al. (2022) were performed with a tabular policy and analytical inner game returns, for which cooperation could be achieved with the M-FOS rule. This shows how crucial the unbiased policy gradient property of COALA-PG is for co-player shaping by meta reinforcement learning to succeed in practice. The same failure to beat defection occurs when using a naive policy gradient ablation which does not take co-player batching into account. We refer to Appendix G for expressions for this baseline as well as the M-FOS rule. When training against M-FOS agents, our COALA-PG agents successfully shape M-FOS agents into cooperative behavior (c.f. Appendix H.3).

5.2 AGENTS TRAINED WITH COALA-PG COOPERATE ON A SEQUENTIAL SOCIAL DILEMMA

Figure 6: Agents trained by COALA-PG against naive agents only successfully shape them in Clean Up-lite. (A) COALA-PG-trained agents better shape naive opponents compared to baselines, obtaining higher return. (B and C) Analyzing behavior within a single meta-episode after training reveals that COALA outperforms baselines and shapes naive agents, (i) exhibiting a lower cleaning discrepancy (absolute difference in average cleaning time between the two agents), and (ii) being less often zapped. Shaded regions indicate standard deviation computed over 5 seeds.

Finally, we consider Clean Up-lite, a simpliﬁed two-player version of the Clean Up game, which is part of the Melting Pot suite of multi-agent environments (Agapiou et al., 2023). We brieﬂy describe the game here, and provide additional details on Appendix B. On a high level, Clean Up-lite is a two-player game that models the social dilemma known as the tragedy of the commons (Hardin, 1968). A player receives rewards by picking up apples. Apples are spontaneously generated in an orchard, but the rate of generation is inversely proportional to the pollution level of a nearby river. Agents can spend time cleaning the river to reduce the pollution level, and thus increase the apple generation rate. In a single-player game, an agent would balance out cleaning and harvesting to maximize the return. In a multi-player setting however, this gives room for a freerider who never cleans and always harvests, letting the opponent clean instead. At any time point, agents can zap the opponent, which would result in the opponent being frozen for a number of time steps, unable to harvest or clean. In contrast to matrix games, this game is a sequential social dilemma (Leibo et al., 2017), where cooperation involves orchestrating multiple actions.

Published as a conference paper at ICLR 2025

As in the previous section, we model agent behavior through Hawk sequence policies, and compare COALA-PG to the same baseline methods as before. Naive agents here would learn too slowly if initialized from scratch, and are therefore handled differently, see Appendix B.2.2. In Figs. 6 and 7, we see that agents trained by COALA-PG reach signiﬁcantly higher returns than previous modelfree baselines, establishing a mutual cooperation protocol with other learning-aware agents, while exploiting naive ones. We further describe below the qualitative behavior found in the simulations.

Exploitation of naive agents. COALA-PG-trained agents shape the behavior of naive ones to their advantage (c.f. Figure 6). Our behavioral analysis reveals two salient features. First, COALA-PG successfully shapes naive opponents to zap less often throughout the meta-episode. Less overall zapping means that agents can harvest more apples while the pollution level is low, thus increasing the overall reward. Second, COALA-PG successfully shapes naive co-players to clean signiﬁcantly more often compared to the COALA-PG agent, resulting in a lower average pollution level and a higher average apple level (c.f. Figure 13 in App. H). Interestingly, the naive learners beneﬁt from the shaping from COALA-PG agents, reaching a higher average reward compared to playing against other baselines (c.f. Figure 13).

Figure 7: Agents trained with COALA-PG against a mixture of naive and other meta agents learn to cooperate in Clean Up-lite. (A) COALA-PG-trained agents obtain higher average reward than baseline agents when playing against each other. (B and C): COALA-PG leads to a more fair division of cleaning efforts and lower zapping rates. Shaded regions indicate standard deviation computed over 5 seeds.

Learning-aware agents cooperate. We see similar trends when introducing other COALA-PGtrained agents in the game, see Fig. 7. Essentially, COALA-PG allows for higher apple production because of lower pollution, and lower zapping rate. We see that over training time the zapping rate goes down, and COALA-PG agents have a fairer division of cleaning time compared to baselines. Interestingly, the zapping rates averaged over the meta-episode are lower than in the pure shaping setting (i.e., with naive co-players only), indicating that learning-aware agents mutually shape each other to zap less.

6 CONCLUSION

We have shown that learning awareness allows reaching high returns in challenging social dilemmas, designed to make independent learning difﬁcult. We identiﬁed two key conditions for this to occur. First, we found it necessary to take into account the stochastic minibatched nature of the updates used by other agents. This is one distinguishing aspect of the COALA-PG learning rule proposed here, which translates into a signiﬁcant performance advantage over prior methods. Second, learningaware agents had to be embedded in a heterogeneous group containing non-learning-aware agents.

An important component of our result is the ability to leverage modern and scalable sequence models. Modern sequence models have scaled favorably and in a predictable manner, most notably in autoregressive language modeling (Kaplan et al., 2020), and our results suggest important gains could be made applying similar approaches to multi-agent learning. Our method shares key aspects with the current scalable machine learning approach: unbiased stochastic gradients, sequence model architectures that are amenable to gradient-based learning, and in-context learning/inference. Moreover, we focused on the setting of independent agent learning, which scales well in parallel by design. We thus see it as an exciting question to investigate the approach pursued here at larger scale and in a wider range of environments. The resulting self-organized behavior may display unique social properties that are absent from single-agent machine learning paradigms, and which may open new avenues towards artiﬁcial intelligence (Du e nez-Guzm an et al., 2023).

Published as a conference paper at ICLR 2025

ACKNOWLEDGEMENTS

We would like to thank Maximilian Schlegel, Yanick Schimpf, Rif A. Saurous, Joel Leibo, Alexander Sasha Vezhnevets, Aaron Courville, Juan Duque, Milad Aghajohari, Razvan Ciuca, Gauthier Gidel, James Evans and the Google Paradigms of Intelligence team for feedback and enlightening discussions. GL and BR acknowledge support from the CIFAR chair program. EE acknowledges support from a Vanier scholarship from the government of Canada.

John P. Agapiou, Alexander Sasha Vezhnevets, Edgar A. Du e nez-Guzm an, Jayd Matyas, Yiran Mao, Peter Sunehag, Raphael K oster, Udari Madhushani, Kavya Kopparapu, Ramona Comanescu, D. J. Strouse, Michael B. Johanson, Sukhdeep Singh, Julia Haas, Igor Mordatch, Dean Mobbs, and Joel Z. Leibo. Melting Pot 2.0. ar Xiv preprint ar Xiv:2211.13746, 2023.

Milad Aghajohari, Juan Agustin Duque, Tim Cooijmans, and Aaron Courville. Loqa: Learning with opponent q-learning awareness. ar Xiv preprint ar Xiv:2405.01035, 2024.

Ekin Aky urek, Dale Schuurmans, Jacob Andreas, Tengyu Ma, and Denny Zhou. What learning algorithm is in-context learning? Investigations with linear models. In International Conference on Learning Representations, 2023.

Stefano V. Albrecht, Filippos Christianos, and Lukas Sch afer. Multi-Agent Reinforcement Learning: Foundations and Modern Approaches. MIT Press, 2024.

Robert Axelrod and William D. Hamilton. The evolution of cooperation. Science, 211(4489):1390 1396, March 1981.

Igor Babuschkin, Kate Baumli, Alison Bell, Surya Bhupatiraju, Jake Bruce, Peter Buchlovsky, David Budden, Trevor Cai, Aidan Clark, Ivo Danihelka, Antoine Dedieu, Claudio Fantacci, Jonathan Godwin, Chris Jones, Ross Hemsley, Tom Hennigan, Matteo Hessel, Shaobo Hou, Steven Kapturowski, Thomas Keck, Iurii Kemaev, Michael King, Markus Kunesch, Lena Martens, Hamza Merzic, Vladimir Mikulik, Tamara Norman, George Papamakarios, John Quan, Roman Ring, Francisco Ruiz, Alvaro Sanchez, Laurent Sartran, Rosalia Schneider, Eren Sezener, Stephen Spencer, Srivatsan Srinivasan, Miloˇs Stanojevi c, Wojciech Stokowiec, Luyu Wang, Guangyao Zhou, and Fabio Viola. The Deep Mind JAX Ecosystem, 2020.

Jan Balaguer, Raphael Koster, Christopher Summerﬁeld, and Andrea Tacchetti. The good shepherd: An oracle agent for mechanism design. ar Xiv preprint ar Xiv:2202.10135, 2022.

Yoshua Bengio, Samy Bengio, and Jocelyn Cloutier. Learning a synaptic learning rule. Technical report, Universit e de Montr eal, D epartement d Informatique et de Recherche op erationnelle, 1990.

James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake Vander Plas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transformations of Python+Num Py programs, 2018.

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam Mc Candlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 2020.

Caroline Claus and Craig Boutilier. The dynamics of reinforcement learning in cooperative multiagent systems. AAAI/IAAI, 1998(746-752):2, 1998.

Katherine M. Collins, Ilia Sucholutsky, Umang Bhatt, Kartik Chandra, Lionel Wong, Mina Lee, Cedegao E. Zhang, Tan Zhi-Xuan, Mark Ho, Vikash Mansinghka, Adrian Weller, Joshua B. Tenenbaum, and Thomas L. Grifﬁths. Building machines that learn and think with people. ar Xiv preprint ar Xiv:2408.03943, 2024.

Published as a conference paper at ICLR 2025

Tim Cooijmans, Milad Aghajohari, and Aaron Courville. Meta-value learning: a general framework for learning with learning awareness. ar Xiv preprint ar Xiv:2307.08863, 2023.

Soham De, Samuel L. Smith, Anushan Fernando, Aleksandar Botev, George Cristian-Muraru, Albert Gu, Ruba Haroun, Leonard Berrada, Yutian Chen, Srivatsan Srinivasan, Guillaume Desjardins, Arnaud Doucet, David Budden, Yee Whye Teh, Razvan Pascanu, Nando De Freitas, and Caglar Gulcehre. Grifﬁn: mixing gated linear recurrences with local attention for efﬁcient language models. ar Xiv preprint ar Xiv:2402.19427, 2024.

Yan Duan, John Schulman, Xi Chen, Peter L. Bartlett, Ilya Sutskever, and Pieter Abbeel. RL2: Fast reinforcement learning via slow reinforcement learning. In International Conference on Learning Representations, 2017.

Edgar A. Du e nez-Guzm an, Suzanne Sadedin, Jane X. Wang, Kevin R. Mc Kee, and Joel Z. Leibo. A social path to human-like artiﬁcial intelligence. Nature Machine Intelligence, 5(11):1181 1188, 2023.

Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning, 2017.

Jakob Foerster, Richard Y. Chen, Maruan Al-Shedivat, Shimon Whiteson, Pieter Abbeel, and Igor Mordatch. Learning with opponent-learning awareness. In International Conference on Autonomous Agents and Multiagent Systems, 2018a.

Jakob Foerster, Gregory Farquhar, Maruan Al-Shedivat, Tim Rockt aschel, Eric Xing, and Shimon Whiteson. Di CE: The inﬁnitely differentiable Monte Carlo estimator. In International Conference on Machine Learning, 2018b.

Drew Fudenberg and David K Levine. The theory of learning in games, volume 2. MIT press, 1998.

Hyowon Gweon, Judith Fan, and Been Kim. Socially intelligent machines that learn from humans and help humans learn. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 381(2251):20220048, July 2023.

Garrett Hardin. The tragedy of the commons. Science, 162(3859):1243 1248, 1968.

Charles R. Harris, K. Jarrod Millman, St efan J. van der Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric Wieser, Julian Taylor, Sebastian Berg, Nathaniel J. Smith, Robert Kern, Matti Picus, Stephan Hoyer, Marten H. van Kerkwijk, Matthew Brett, Allan Haldane, Jaime Fern andez del R ıo, Mark Wiebe, Pearu Peterson, Pierre G erard-Marchant, Kevin Sheppard, Tyler Reddy, Warren Weckesser, Hameer Abbasi, Christoph Gohlke, and Travis E. Oliphant. Array programming with Num Py. Nature, 585(7825):357 362, 2020.

Jonathan Heek, Anselm Levskaya, Avital Oliver, Marvin Ritter, Bertrand Rondepierre, Andreas Steiner, and Marc van Zee. Flax: A neural network library and ecosystem for JAX, 2024. URL http://github.com/google/flax.

Pablo Hernandez-Leal, Michael Kaisers, Tim Baarslag, and Enrique Munoz De Cote. A survey of learning in multiagent environments: Dealing with non-stationarity. ar Xiv preprint ar Xiv:1707.09183, 2017.

Sepp Hochreiter, A. Steven Younger, and Peter R. Conwell. Learning to learn using gradient descent. In International Conference on Artiﬁcial Neural Networks, Lecture Notes in Computer Science. Springer, 2001.

J. D. Hunter. Matplotlib: A 2D graphics environment. Computing in Science & Engineering, 9(3): 90 95, 2007.

Leslie Pack Kaelbling, Michael L. Littman, and Anthony R. Cassandra. Planning and acting in partially observable stochastic domains. Artiﬁcial Intelligence, 101(1):99 134, 1998.

Jared Kaplan, Sam Mc Candlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. ar Xiv preprint ar Xiv:2001.08361, 2020.

Published as a conference paper at ICLR 2025

Akbir Khan, Timon Willi, Newton Kwan, Andrea Tacchetti, Chris Lu, Edward Grefenstette, Tim Rockt aschel, and Jakob N. Foerster. Scaling opponent shaping to high dimensional games. In International Conference on Autonomous Agents and Multiagent Systems, 2024.

Dong Ki Kim, Miao Liu, Matthew D Riemer, Chuangchuang Sun, Marwa Abdulhai, Golnaz Habibi, Sebastian Lopez-Cot, Gerald Tesauro, and Jonathan How. A policy gradient algorithm for learning to learn in multiagent reinforcement learning. In International Conference on Machine Learning, 2021.

H. W. Kuhn. Extensive games and the problem of information. Princeton University Press, 1953.

Michael Laskin, Luyu Wang, Junhyuk Oh, Emilio Parisotto, Stephen Spencer, Richie Steigerwald, D. J. Strouse, Steven Hansen, Angelos Filos, Ethan Brooks, Maxime Gazeau, Himanshu Sahni, Satinder Singh, and Volodymyr Mnih. In-context reinforcement learning with algorithm distillation. ar Xiv preprint ar Xiv:2210.14215, 2022.

Joel Z. Leibo, Vinicius Zambaldi, Marc Lanctot, Janusz Marecki, and Thore Graepel. Multiagent reinforcement learning in sequential social dilemmas. In International Conference on Autonomous Agents and Multiagent Systems, 2017.

Yingcong Li, Muhammed Emrullah Ildiz, Dimitris Papailiopoulos, and Samet Oymak. Transformers as algorithms: generalization and stability in in-context learning. In International Conference on Machine Learning, 2023.

Christopher Lu, Timon Willi, Christian A Schroeder De Witt, and Jakob Foerster. Model-free opponent shaping. In International Conference on Machine Learning, 2022.

Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, 2016.

John F. Nash Jr. Equilibrium points in n-person games. Proceedings of the National Academy of Sciences, 36(1):48 49, 1950.

Joon Sung Park, Joseph O Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, 2023.

William H. Press and Freeman J. Dyson. Iterated Prisoner s Dilemma contains strategies that dominate any evolutionary opponent. Proceedings of the National Academy of Sciences, 109(26): 10409 10413, 2012.

Neil C. Rabinowitz. Meta-learners learning dynamics are unlike learners . ar Xiv preprint ar Xiv:1905.01320, 2019.

Anatol Rapoport. Prisoner s dilemma recollections and observations. In Game Theory as a Theory of a Conﬂict Resolution, pp. 17 34. Springer, 1974.

Ingo Rechenberg and M. Eigen. Evolutionsstrategie: Optimierung technischer Systeme nach Prinzipien der biologischen Evolution. Frommann-Holzboog Verlag, 1973.

J urgen Schmidhuber. Evolutionary principles in self-referential learning, or on learning how to learn: the meta-meta-... hook. Diploma thesis, Institut f ur Informatik, Technische Universit at M unchen, 1987.

John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. Highdimensional continuous control using generalized advantage estimation. ICLR, 2016.

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. ar Xiv preprint ar Xiv:1707.06347, 2017.

Yoav Shoham and Kevin Leyton-Brown. Multiagent systems: Algorithmic, game-theoretic, and logical foundations. Cambridge University Press, 2008.

Published as a conference paper at ICLR 2025

Peter Sunehag, Guy Lever, Audrunas Gruslys, Wojciech Marian Czarnecki, Vinicius Zambaldi, Max Jaderberg, Marc Lanctot, Nicolas Sonnerat, Joel Z. Leibo, Karl Tuyls, and Thore Graepel. Valuedecomposition networks for cooperative multi-agent learning. ar Xiv preprint ar Xiv:1706.05296, 2017.

Richard S. Sutton, David Mc Allester, Satinder Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. Advances in Neural Information Processing Systems, 12, 1999.

Ming Tan. Multi-agent reinforcement learning: Independent vs. cooperative agents. In International Conference on Machine Learning, 1993.

Alexander Sasha Vezhnevets, John P. Agapiou, Avia Aharon, Ron Ziv, Jayd Matyas, Edgar A. Du e nez-Guzm an, William A. Cunningham, Simon Osindero, Danny Karmon, and Joel Z. Leibo. Generative agent-based modeling with actions grounded in physical, social, or digital space using Concordia. ar Xiv preprint ar Xiv:2312.03664, 2023.

John von Neumann and Oskar Morgenstern. Theory of games and economic behavior. Princeton University Press, 1947.

Johannes von Oswald, Eyvind Niklasson, Maximilian Schlegel, Seijin Kobayashi, Nicolas Zucchet, Nino Scherrer, Nolan Miller, Mark Sandler, Blaise Ag uera y Arcas, Max Vladymyrov, Razvan Pascanu, and Jo ao Sacramento. Uncovering mesa-optimization algorithms in Transformers. ar Xiv preprint ar Xiv:2309.05858, 2023.

Jane X Wang, Zeb Kurth-Nelson, Dhruva Tirumala, Hubert Soyer, Joel Z. Leibo, Remi Munos, Charles Blundell, Dharshan Kumaran, and Matt Botvinick. Learning to reinforcement learn. ar Xiv preprint ar Xiv:1611.05763, 2016.

Timon Willi, Alistair Hp Letcher, Johannes Treutlein, and Jakob Foerster. COLA: consistent learning with opponent-learning awareness. In International Conference on Machine Learning, 2022.

Annie Xie, Dylan Losey, Ryan Tolsma, Chelsea Finn, and Dorsa Sadigh. Learning latent representations to inﬂuence multi-agent interaction. In Conference on Robot Learning, 2021.

Kaiqing Zhang, Zhuoran Yang, and Tamer Bas ar. Multi-agent reinforcement learning: A selective overview of theories and algorithms. Handbook of Reinforcement Learning and Control, pp. 321 384, 2021.

Stephen Zhao, Chris Lu, Roger B. Grosse, and Jakob Foerster. Proximal learning with opponentlearning awareness. Advances in Neural Information Processing Systems, 35, 2022.

Karl Johan Astr om. Optimal control of Markov processes with incomplete state information I. Journal of Mathematical Analysis and Applications, 10:174 205, 1965.

Published as a conference paper at ICLR 2025

A A2C AND PPO IMPLEMENTATIONS OF COALA-PG

We use both Advantage Actor-Critic (A2C) (Mnih et al., 2016) and Proximal Policy Optimization (PPO) Schulman et al. (2017) for our COALA policy gradient estimate. We detail here how to merge these methods with our COALA-PG method.

A.1 REINFORCE ESTIMATOR

For the reader s convenience, we display the COALA policy gradient below. We remind for reference, that ml the inner episode index corresponding to the meta episode time step l.

φi J(φi) = E P φi

l=1 φi log πi(ai,b l | hi,b l )

k=l Ri,b k + 1

k=T ml+1 Ri,b

The batch-unaware COALA policy gradient, which we use as a baseline method for shaping naive learners, is given by

φi J(φi) = E P φi

l=1 φi log πi(ai,b l | hi,b l )

k=l Ri,b k +

k=T ml+1 Ri,b k

Note that when we play against other meta agents instead of naive learners, all parallel POMDP trajectories in the batch are independent, and hence we can correctly use the batch-unaware COALA policy gradient for this setting.

Finally, the M-FOS policy gradient (c.f. Appendix G) is given by

φi J(φi) = E P φi

l=1 φi log πi(ai,b l | hi,b l )

k=l Ri,b k + 1

k=T ml+1 Ri,b

The difference between M-FOS and COALA-PG is the 1

B scaling factor for the current inner episode return. This scaling factor is crucial for a correct balance between gradient contributions arising from the current inner episode, and future inner episodes. Without this scaling factor, the contributions from future inner episodes required for learning to shape the co-players vanish for large inner batch sizes.

We can construct REINFORCE estimators by sampling directly from the above expectations. However, this leads to policy gradients with prohibitively large variance. Hence, in the following sections we will derive improved advantage estimators to reduce the variance of the policy gradient estimates.

A.2 VALUE FUNCTION ESTIMATION

One of the easiest ways to use value functions for reducing the variance in the policy gradient estimator, is to subtract a baseline from the return estimator. In the COALA-PG equation 5, the straightforward value function to learn is

V (hi,b l ) = E P φi( |hi,b l )

k=l Ri,b k + 1

k=T ml+1 Ri,b

As the environment is reset after each inner episode, the second term can be simpliﬁed by merging expectations over the different parallel trajectories:

E P φi( |hi,b l )

1 B Ri,b k +

k=T ml+1 Ri,b

Published as a conference paper at ICLR 2025

Which has an additional 1

B factor on the left term compared to a conventional value function that would need to be learnt when playing e.g. against another meta agent equation 6. This is undesirable for several reasons, one of which being that as B increases, the value target becomes increasingly insensitive to the inner episode return, which makes learning difﬁcult. Another reason is that the target magnitude between when playing against a naive agent or a meta agent can signiﬁcantly differ. Finally, for simplicity reasons, we want a value function that we can use both when playing against naive learners, as well as other meta agents.

We can solve these issues by instead learning the value function for the batch-unaware returns, and introducing some specialized reweighing when playing against naive learners, which we will see later.

ˆV (hi,b l ) = E P φi( |hi,b l )

k=l Ri,b k +

k=T ml+1 Ri,b k

As such, the same value function can be used for both when playing against a naive or a meta agent. In practice, we trade off variance and bias for learning the value function by using TD(λ) targets. Algorithm 1 shows how to compute such targets with a general algorithm, which we later can also repurpose for Generalized Advantage Estimation (Schulman et al., 2016) and MFOS value functions. For computing the TD(λ) targets for learning our value functions, we use normalize current episode=False and average future episodes=False when the given trajectory batch originates from playing against another meta agent.

Algorithm 1 Batch Lambda Returns Input: rt, discount, vt, λ, average future episodes, normalize current episode, inner episode length Output: returns seq len rt.shape[1] batch size rt.shape[0] if normalize current episode then

normalization batch size else

normalization 1 episode end (range(seq len) mod inner episode length) == (inner episode length 1) acc vt[:, 1] global acc mean(vt[:, 1]) for t = seq len 1 to 0 do

if average future episodes and episode end[t] then

acc global acc

acc rt[:, t]/normalization + discount ((1 λ) vt[:, t] + λ acc) global acc mean(rt[:, t] + discount ((1 λ) vt[:, t] + λ global acc)) returns[:, t] acc

return returns

A.3 GENERALIZED ADVANTAGE ESTIMATION

We now see how the above value estimation can be used to update the policy following COALA-PG. Ultimately we want an unbiased estimate of the advantage function, as this allows the usage of algorithms like PPO or A2C.

The advantage of a state hi,b l and action ai,b l against a naive agent is

Published as a conference paper at ICLR 2025

A(hi,b l , ai,b l ) =E P φi( |hi,b l ,ai,b l )

k=l Ri,b k + 1

k=T ml+1 Ri,b

E P φi( |hi,b l )

k=l Ri,b k + 1

k=T ml+1 Ri,b

We can reformulate the expression using ˆV as follows:

A(hi,b l , ai,b l ) =E P φi( |hi,b l ,ai,b l )

k=l Ri,b k + 1

k=T ml+1 Ri,b

B ˆV (hi,b l ) 1

B E P φi( |hi,b l )

b =b ˆV (hi,b

A simple advantage estimator would be the Monte-Carlo estimate of the above. However, we can trade-off variance with bias by using the Generalized Advantage Estimator (Schulman et al., 2016). Using similar logic as for the equation above, we can compute the COALA version of the GAE by reusing the batched lambda returns algorithm (c.f. Algorithm 1 as follows:

Instead of the rewards of the trajectory, we provide the TD errors δt = rt + γ ˆVt+1 ˆVt as input for r t.

We provide γλ as input for discount

We provide 1.0 as input for λ

We put average future episodes and normalize current episode both on True.

For computing the GAE for the batch-unaware COALA-PG baseline, we follow the same approach except putting average future episodes and normalize current episode both on False. For computing the GAE for the M-FOS baseline (c.f. G), we follow the same approach except putting average future episodes on True and normalize current episode on False.

A.4 A2C AND PPO IMPLEMENTATIONS

We can now use the above advantage estimates directly into A2C and PPO implementations. Below, we list a few tweaks of classical reinforcement learning tricks that we used in our implementation.

Advantage normalization: as is common with PPO implementations, we investigate the use of advantage normalization. Given a batched trajectory of advantage estimation over which the policy should be updated, the trick consists in centering the advantage estimates over the batched trajectory. Empirically, we found out that when playing against a mixture of naive and meta learners, it was beneﬁcial to apply the centering separately for the 2 types of meta-trajectories (playing against naive learners or playing against other meta agents).

Reward rescaling: as another way to prevent issues stemming from large value target, we investigate simply rescaling the reward of an environment when appropriate. Effectively, the reward is rescaled for the value and policy gradient computation, but all metrics are reported by reverting the scaling, i.e. reported in the original reward scale.

Published as a conference paper at ICLR 2025

B EXPERIMENTAL DETAILS

B.1 ENVIRONMENTS

B.1.1 ITERATED PRISONER S DILEMMA (IPD)

We model the IPD environment as follows:

State: The environment has 5 states, that we label by s0, (c, c), (c, d), (d, c), (d, d). Action: Each agent has 2 possible actions: cooperate (c) and defect (d). Dynamics: Based on the action taken by each agent in the previous time step, the state of the environment is set to the states (a1, a2) where a1, a2 are respectively the previous action of the ﬁrst and second player in the environment. The assignment of who is ﬁrst and second is made arbitrarily and ﬁxed. Initial state: The initial state is always set to s0. Observation: The agents observe directly the state, modulo a permutation of the tuple to ensure a symmetry of observation. The 5 possible observations are then encoded as one-hot encoding. Reward: At every timestep, each agents receive a reward following the reward matrix in Table 1

B.1.2 CLEANUP-LITE

Clean Up-lite is a simpliﬁed two-player version of the Clean Up game, which is part of the Melting Pot suite of multi-agent environments (Agapiou et al., 2023). It is modelled as follows:

State: The world is a 2D grid of size 54. The right column is the river, and the left one the orchard. Cells in the river column can be occupied by dirt. Cells in the orchard column can be occupied by an apple. The world state also contains the position of each agent, and their respective zapped state Action: There exists 6 actions: {move right, move left, move up, moved down, zap, do nothing}. Dynamics: the environment evolves at every timestep in the following order:

1. When there is at least one cell in the river column that is not occupied by dirt, a new patch of dirt is spawned with probability ppollution = 0.35, and placed randomly on one of the free cells in the river column. 2. When there is at least one cell in the orchard column that is not occupied by dirt, a new apple is spawned with probability papple = 1 min(1, P/Pthreshold), where Pthreshold = 3 and P the total number of dirt cells in the environment. The spawned apple is placed randomly on one of the free cells in the orchard column. 3. When an agent that is not zapped visits a cell with an apple, it harvests the apple and gets a reward of 1. The apple is replaced by an empty cell 4. When an agent that is not zapped visits a cell with a dirt patch, it cleans the dirt patch and replaces it by an empty cell. 5. Finally, an agent zapping has a pzap = 0.9 probability of successfully zapping the co-player, if the co-player is maximally 2 cells away from the agent. If the zapping is successful, the opponent is frozen for tzap = 5 timesteps, during which it is frozen and cannot be further zapped. 6. Agents can move around with the {move right, move left, move up, moved down} actions. Initial state: Agents are randomly placed on the grid, unzapped, there are no apples at initialization and 3 dirt patches randomly placed in the river column. Observation: the observation contains full information about the environment. Each agent sees the position of each agent encoded as ﬂattened one-hot grid indicating the position in the grid, the full grid as a ﬂattened grid with one-hot objects (apple, dirt, empty), and the state of all agents (zapped or non-zapped). The observation is symmetric.

Published as a conference paper at ICLR 2025

Reward: An agent that picks up an apple receives a reward of rapple = 1 in that timestep.

B.2 TRAINING DETAILS

Here, we describe the procedure that we use in our experiments to train meta agents in an arbitrary mixture of naive and other meta agents (who themselves are learning). A single parameter, pnaive, indicating the probability of encountering a naive agent, controls the heterogeneity of the pool that a meta agents trains against. If pnaive = 1, the meta agents are trained only against naive opponents, and thus the training corresponds to a pure shaping setting. If pnaive = 0, meta agents are only trained against other meta agents.

Given a set of meta agent parameters {φi}, and a set of naive agent parameters {ψi}, a training iteration updates each parameters as follows.

B.2.1 META AGENTS

The meta-agent parameters are updated simultaneously. For each parameter φi, the following update is applied:

1. First, a meta batch of opponents is sampled. Each opponent is hierarchically sampled by ﬁrst determining whether it is a naive opponent (with probability pnaive), and then sampling uniformly from {φi} or {ψi} accordingly. The sampling is done with replacement, and disallowing sampling of oneself.

2. For each opponent, generate a batch of B trajectories of length TM, where M is the number of inner episode, and T the episode length of the environment. Crucially, after every T steps, the environment terminates and is reset, and, if the opponent is naive, the previous batch of length T trajectories is used to update its parameter following a RL update rule of choice.

3. For each collected batched trajectories, the policy gradient of the meta agent parameter is computed following the COALA-PG update rule (or other baselines, c.f. G) if the opponent is naive, and the standard policy gradient otherwise otherwise (i.e. the batch and meta batch dimensions are ﬂattened). Crucially, the done signals from the inner episodes are ignored. The gradient is then averaged, and the parameter updated.

B.2.2 NAIVE AGENT

The naive agent parameters are used to initialize the naive opponents when training the meta agents, but the resulting trained parameters are discarded. The initialization may or may not be nonetheless updated during training. In a more challenging environment however, training from scratch until good performance is achieved in a single meta trajectory may require prohibitively many inner episodes. To avoid this, in some of our experiments, at each training iteration, we set each {ψi} to be equal to one of the {φi}. This ensures that naive agents are initialized as an already capable agent, and is possible due to our choice of common architecture between naive and meta agents (c.f. below). In that case, we say that the naive agents are dynamic. Otherwise, naive agent is always initailized at one of a predeﬁned static set of parameters.

B.3 ARCHITECTURE

We choose a Hawk recurrent neural network as the policy and value function backbone (De et al., 2024), for all methods, both for meta and naive agents. First, a linear layer projects the observation into an embedding space of dimension 32. Then, a single residual Hawk recurrent neural network with LRU width 32, MLP expanded width 32 and 2 heads follows. Finally, an RMS normalization layer is applied, after which 2 linear readouts, one for the value estimate, and the other for the policy logits, are applied.

All meta agent and naive agent parameters are initialized following the standard initialization scheme of Hawk. The last readout layers are however initialized to 0.

Published as a conference paper at ICLR 2025

Table 2: Hyperparameter ﬁxed for the IPD experiments

IPD Hyperparameter Pure Shaping Mixed Pool

training iteration 3000 3000 meta batch size 128 128 batch size (B) 16 16 num inner episode (M) 20 20 inner episode length (T) 10 10 p naive 1. 0.75 population size (meta) 1 4 population size (naive) 10 10 dynamic naive agents False False

B.4 HYPERPARAMETERS FOR EACH EXPERIMENT

In all experiments, we ﬁrst ﬁx the environment hyperparameters. In order to ﬁnd the suitable hyperparameter for each methods, we perform for each of them a sweep over reinforcement learning hyperparameters, and select the best hyperparameters over after averaging over 3 seeds. The ﬁnal performance and metrics are then computed using 5 fresh seeds.

In all our experiments, naive agents update their parameters using the Advantage Actor Critic (A2C) algorithm, without value bootstrapping on the batch of length T trajectories. The hyperparameter for all experiments, can be found on Table 8.

IPD, Figure 5 We perform 2 experiments in the IPD environment, (i) the pure Shaping experiment with pnaive = 1 to investigate the shaping capabilities of meta agents, and (ii) the mixed pool setting with pnaive = 0.75, to investigate the collaboration capabilities of meta agents. For both experimental setting, we show the environment hyperparameters in Table 2. All meta agents are trained by PPO and Adam optimizer. For each method, we sweep hyperparameters over range speciﬁed in Table 3. Table 4 shows the resulting hyperparameters for all methods.

Table 3: The range of values swept over for hyperparameter search for each method for the IPD environment

RL Hyperparameter Range

advantages normalization {False, True} value discount (γ) {0.999, 1.0} gae lambda (λgae) {0.98, 1.0} learning rate {0.003, 0.001, 0.0003}

Cleanup, Figure 6, 7 Likewise, we have the pure shaping (Figure 6) and mixed pool (Figure 7) experiment in the Cleanup-lite environment. For both experimental setting, we show the environment hyperparameters in Table 5. All meta agents are trained by PPO and Adam optimizer for the pure shaping setting, while using A2C and SGD for the mixed pool setting. For each method, we sweep hyperparameters over range speciﬁed in Table 6. Table 7 shows the resulting hyperparameters for PPO for all methods.

Published as a conference paper at ICLR 2025

Table 4: Hyperparameters used for the IPD Shaping and Mixed Pool experiments. Despite the search, the hyperparameter chosen for each method were identical

RL Hyperparameter Pure Shaping Mixed Pool

algorithm PPO PPO ppo nminibatches 2 2 ppo nepochs 4 4 ppo clipping epsilon 0.2 0.2 value coefficient 0.5 0.5 clip value True True entropy reg 0 0 advantage normalization False False reward rescaling 0.05 0.05 γ 1 1 λtd 1 1 λgae 1 1 optimizer ADAM ADAM adam epsilon 0.00001 0.00001 learning rate 0.0003 0.0003 max grad norm 1 1

Table 5: Cleanup hyperparameters

Hyperparameter ﬁxed for the Cleanup experiments Pure Shaping Mixed Pool

training iteration 3000 30000 meta batch size 512 512 batch size (B) 32 64 num inner episode (M) 100 5 inner episode length (T) 64 64 p naive 1. 0.75 population size (meta) 1 3 population size (naive) 10 3 dynamic naive agents False True

Table 6: The range of values swept over for hyperparameter search for each method for the Cleanup environment

RL Hyperparameter Pure Shaping Mixed Pool

advantages normalization {False, True} {False, True} value discount (γ) {0.999, 1.0} {1.0} learning rate {0.003, 0.001, 0.0003} {0.03, 0.01, 0.5, 1.0} optimizer {ADAM} {SGD}

Published as a conference paper at ICLR 2025

Table 7: Hyperparameters used for the Cleanup Shaping and Cleanup Pool experiments.

RL Hyperparameter Cleanup Shaping Cleanup Pool

Coala Batch Unaware M-FOS LOLA Coala Batch Unaware M-FOS LOLA

algorithm PPO PPO PPO - A2C A2C A2C - ppo nminibatches 2 2 2 - - - - - ppo nepochs 4 4 4 - - - - - ppo clipping epsilon 0.2 0.2 0.2 - - - - - value coefficient 0.5 0.5 0.5 - - - - - clip value True True True - True True True True entropy regularization 0 0 0 0 0 0 0 0 advantage normalization True True True True True True True True γ 1 1 1 1 1 1 1 1 λtd 1 1 1 1 1 1 1 1 reward rescaling 0.1 1 1 1 0.1 0.1 0.1 0.1 λgae 1 1 1 1 1 1 1 1 optimizer ADAM ADAM ADAM SGD SGD SGD SGD SGD adam epsilon 0.00001 0.00001 0.00001 - - - - - learning rate 0.001 0.001 0.003 0.1 0.1 0.03 0.03 0.03 max grad norm 1 1 1 - 1 1 1 1

Table 8: Naive agent hyperparameters used across different settings

RL Hyperparameter IPD Shaping IPD Mixed Cleanup Shaping Cleanup Mixed

algorithm A2C A2C A2C A2C advantages normalization True True True True reward rescaling 0.05 0.05 0.1 0.1 value discount (γ) 0.99 0.99 0.99 1 td lambda (λtd) 1.0 1.0 1.0 1.0 gae lambda (λgae) 1.0 1.0 1.0 1.0 value coefficient 0.5 0.5 0.5 0.5 entropy reg 0.0 0.0 0.0 0.0 optimizer ADAM ADAM ADAM SGD adam epsilon 0.00001 0.00001 0.00001 learning rate 0.005 0.005 0.005 1. max grad norm 1.0 1.0 1.0 1.0

Published as a conference paper at ICLR 2025

C THE ANALYTICAL ITERATED PRISONER S DILEMMA

For the experiments in Section 4 and 4.3, we analytically compute the discounted expected return of an inﬁnitely iterated prisoner s dilemma, and its parameter gradients. Automatic differentiation allows us then to explicitly backpropagate through the learning trajectory of naive learners, to compute the ground-truth meta update. In the following, we provide details on this approach.

For both the naive learners and learning-aware meta agents, we consider tabular policies φi taking into account the previous action of both agents:

φi = [φi 0, φi 1, φi 2, φi 3, φi 4]

with σ(φi 0) the probability of cooperating in the initial state (with sigmoid σ), and the next 4 parameters the logits of cooperating in states CC, CD, DC and DD respectively (CD indicates that ﬁrst agent cooperated, and the second agent defected). As we use a tabular policy for the meta agents, they cannot accurately infer the opponent s parameters from context, but its policy gradient updates still inform it regarding the learning behavior of naive learners. Hence the meta agent can learn to shape naive learners while using a tabular policy, for example through zero-determinant extortion strategies (Press & Dyson, 2012). Using both policies, we can construct a Markov matrix providing the transition probabilities of one state to the next, ignoring the initial state.

M = σ(φ1 1:4) σ(φ2 1:4), σ(φ1 1:4) (1 σ(φ2 1:4)), (1 σ(φ1 1:4)) σ(φ2 1:4), (1 σ(φ1 1:4)) (1 σ(φ2 1:4))

with the element-wise product. Given the payoff vectors r1 = [1, 1, 2, 0] and r2 = [1, 2, 1, 0], and initial state distribution s0 = [σ(φ1 0)σ(φ2 0), σ(φ1 0)(1 σ(φ2 0)), (1 σ(φ1 0))σ(φ2 0), (1 σ(φ1 0))(1 σ(φ2 0))] we can write the expected discounted return of agent i as

Ji(φ1, φ2) = ri, " X

t=0 γt M ts0

This discounted inﬁnite matrix sum is a Neumann series of the inverse (I γM) 1 with I the identity matrix. This gives us:

Ji(φ1, φ2) = ri, (I γM) 1s0 (16)

Both M and s0 depend on the agent s policies, and we can compute the analytical gradients using automatic differentiation (we use JAX).

We model naive learners φ i as taking gradient steps on J i with learning rate ηnaive. The co-player shaping objective for meta agent i is now

s.t. qφ i = ηnaive φ i Ji

When a learning-aware meta agent faces a naive learner, we compute the shaping gradient by explicitly backpropagating through J(φi), using automatic differentiation. When a learning-aware meta agent faces another meta agent, we compute the policy gradient as the partial gradient on Ji(φi, φ i), as with tabular policies, the meta agents deploy the same policy in each inner episode, and hence averaging over inner episodes is equivalent to playing a single episode of meta vs meta. For training the meta agents, we use a convex mixture of the gradients against naive learners and gradients against the other meta agent, with mixing factor pnaive. For the gradients against naive learners, we use a batch of randomly initialized naive learners of size metabatch. We use the adamw optimizer from the Optax library to train the meta agents, with default hyperparameters and learning rate ηmeta.

For the LOLA experiments of Section 4.3, we compute the ground-truth LOLA-DICE updates equation 4 by initializing a naive learner with the opponent s parameters, simulate M naive updates (look-aheads) following partial derivatives of Ji, and backpropagating through the ﬁnal return Ji φi, φ i + PM q=1 qφ i , including backpropagating through the learning trajectory. For Fig. 4, we train two separate LOLA agents against each other and report the training curves of the ﬁrst agent (the training curves of the second agent are similar, data not shown). Using self-play

Published as a conference paper at ICLR 2025

Figure 8: Histogram of the regression losses after ﬁtting the (χ, φ) parameters to the learned co-player shaping policies from Figure 3A for 64 random seeds, versus ﬁtting the (χ, φ) parameters to 64 uniform random policies.

instead of other-play resulted in similar results with the same main conclusions (data not shown). For all experiments, we used the following hyperparemeters: γ = 0.95, ηmeta = 0.005, ηnaive = 5 (except for 1-look-ahead, where we used ηnaive = 10). For Figure 4C, we used a convex mixture of the LOLA-DICE gradient and the partial gradient on Ji(φi, φ i) with mixing factor pnaive. We used pnaive = 1, 1, 0.75, 0.6, 0.4 for look-aheads 1, 2, 3, 10 and 20 respectively.

C.1 ADDITIONAL RESULTS ON THE ANALYTICAL IPD

Learning-aware agents extort naive learners following Zero-Determinant-like extortion strategies. Figure 3A shows that learning-aware agents trained against naive learners ﬁnd a policy that extorts the naive learners into unfair cooperation. Here, we investigate the resulting extortion policies in more detail, and show that they are similar to the Zero-Determinant extortion strategies discovered by Press & Dyson (2012). Zero-determinant extortion strategies are parameterized by χ and φ as follows (with (T, R, P, S) = (2, 1, 0, 1) the rewards of the prisoner s dilemma):

p1 = 1 φ(χ 1)R P

p2 = 1 φ 1 + χT P

p3 = φ χ + T P

with χ 1 and 0 < φ P S (P S)+χ(T P ). For χ = 1 and φ = P S (P S)+(T P ) we recover the titfor-tat strategy, representing the fair shaping strategy, whereas for higher values of xi, the resulting policies extort the naive learner into unfair cooperation. Note that Press & Dyson (2012) did not consider a p0 parameter, as there theory is independent of the choice for p0.

To investigate whether our learned co-player shaping policies are related to ZD extortion strategies, we take the converged policy σ(φi) after training with the pure shaping objective (c.f. Figure 3A), and ﬁt the parameters (χ, φ) to the regression loss σ(φi)[1 : 5] p ZD(χ, φ) 2, with p ZD(χ, φ) the ZD extortion policy of Eq. 18. Figure 8 shows that policies learned with the co-player shaping objective can be well aproximated by ZD extortion policies, whereas random policies cannot. The ZD extortion policies of Eq. 18 consider undiscounted inﬁnitely repeated matrix games, whereas we consider discounted inﬁnitely repeated prisoner s dilemma with discount γ = 0.999. Furthermore, our shaping objective considers the cumuluted returns over the whole learning trajectory of the naive learner, in contrast to ZD extortion strategies that are optimized for the maximizing the return of the last inner episode. Hence, we should not expect an exact match between the learned policies σ(φi) and the ZD extortion strategies.

Mutual unconditional defection is not a Nash equilibrium in the mixed group setting. First, we check numerically whether mutual unconditional defection results in a zero gradient, a necessary condition for being a Nash equilibrium. As a zero probability corresponds to inﬁnite logits, we

Published as a conference paper at ICLR 2025

Figure 9: (A) Average reward during training in a mixed group setting with both agents starting from an unconditional defection policy, when evaluating the learned policy versus naive agents (shaping reward) and versus the other learned policy (other-play reward). (B) Parameter trajectory in logit space of the ﬁrst agent (the second agent has similar learning trajectories, data not reported). Shaded regions indicate 0.25 and 0.75 quantiles, and solid lines the median over 8 random seeds.

parameterize our policy now directly in the probability space instead of logit space, and consider projected gradient ascent to the probability simplex, i.e. clipping the updated parameters between 0 and 1. For non-zero mixing factors pnaive, this results in a projected gradient that is 0 everywhere, except for the parameter corresponding to the DC state. For shaping naive learners, it is beneﬁcial to reward co-players that cooperate by also cooperating with non-zero probability afterwards. Hence, when an agent with a pure defection policy plays against naive learners, the resulting gradient will push it out of the pure defection policy.

Figure 9A shows that when we train unconditional defection policies in the mixed group setting with the same hyperparameters as for Figure 3C, the agents escape mutual defection and learn to cooperate. Note that the agents quickly learn how to shape naive learners, and that it takes a bit longer to learn full cooperation, as the shaping objective does not provide pressure to increase the cooperation probability in the starting state. However, as the shaping policies of the meta agents are not any longer unconditional defection, playing against other meta agents provides a pressure to increase the cooperation probability, eventually leading to a phase transition towards cooperation. Figure 9B shows the parameter trajectory in logit space over training, showing that indeed the agents adjust quickly their parameters for shaping, and eventually also the initial state cooperation probability, leading to cooperation against other meta agents. As our policies are parameterized in logit space, we initialize them to log 0.01 instead of exactly to zero cooperation probability to avoid inﬁnities.

Tit for Tat is not a Nash equilibrium in the mixed group setting. Figure 10 repeats the same analysis but now starting from Tit for Tat policies, showing that mutual Tit for Tat is not a Nash equilibrium in our mixed pool setting. As we show in Figure 10C, this is caused by the possibility to shape naive learners faster by deviating from a strict tit for tat policy. Note that even though the resulting policies are not perfect tit for tat, they still fully cooperate when played against each other.

Theorem D.1. Take the expected shaping return J(φi) = E P φi h 1 B PB b=1 PMT l=0 Ri,b l i , with P φi

the distribution induced by the environment dynamics Pt, initial state distribution Pi and policy φi. Then the policy gradient of this expected return is equal to

φi J(φi) = E P φi

l=1 φ log πi(ai,b l | hi,b l )

l =l ri,b l + 1

l =ml T +1 ri,b

Proof. In the co-player shaping batched POMDP there is only one agent that is relevant for the policy gradient, as all other agents are naive learners and subsumed in the environment dynamics. Hence, to avoid overloading notations, we drop the i superscript in the parameters, actions, policy and histories. Furthermore, we use the notation al = {ab l}B b=1 and similarly for hl.

Published as a conference paper at ICLR 2025

Figure 10: (A) Average reward during training in a mixed group setting with both agents starting from a tit for tat policy, when evaluating the learned policy versus naive agents (shaping reward) and versus the other learned policy (other-play reward). (B) Parameter trajectory in logit space of the ﬁrst agent (the second agent has similar learning trajectories, data not reported). (C) The reward of a main agent when playing against a naive learner over a trajectory of 20 naive learning steps, showing that the policy learned after convergence in the mixed group setting shapes a naive learner faster compared to a tit for tat policy. Shaded regions indicate 0.25 and 0.75 quantiles, and solid lines the median over 8 random seeds.

We start by writing down the gradient of J(φ) = E P φi h 1 B PB b=1 PMT l=0 Ri,b l i , making the summations in its expectation explicit.

P φ(hl) π(al | hl) P(rl | hl, al) 1

with R = b R the joint reward space, Hl the joint space over possible batched histories up until timestep l, and π(al | hl) = QB b=1 π(ab l | hb l). Applying the chain rule leads to

hl Hl φ P φ(hl) π(al | hl) P(rl | hl, al) 1

b rb l . . .

P φ(hl) φ π(al | hl) P(rl | hl, al) 1

as the reward dynamics P(rl | hl) are independent of the policy parameterization φ. We ﬁrst investigate the gradient of the marginal distribution φ P φ(hl), by marginalizing over the joint trajectory distribution.

φ P φ(hl) = φ X

{al A,{hl Hl }l <l

P(hl | hl 1, al 1) Y

l <l P(hl | hl 1, al 1) π(al | hl )

{al A,{hl Hl }l <l

P(hl | hl 1, al 1) Y

l <l P(hl | hl 1, al 1) π(al | hl ) X

l <l φ log π(al | hl )

l <l φ log π(al | hl )

with {al A, {hl Hl }l <l the joint space over all actions and histories over timesteps l < l. In the second line, we used the chain rule and φ π = π φ log π. In the third line we renamed the index l to l .

Filling this expression for φ P φ(hl) into our expression for φ J(φ), combined with the log trick φ π = π φ log π and using the expectation notation for clarity of notation, we end up with

φ J(φ) = E P φ

l l φ log π(al | hl )

Published as a conference paper at ICLR 2025

Reordering the summations, and using π(al | hl) = QB b=1 π(ab l | hb l) leads to

φ J(φ) = E P φ

l=1 φ log π(ab l | hb l)

Finally, actions can only inﬂuence the other parallel trajectories through the parameter updates of the naive learners in the environment, which takes place at the inner episode boundaries. Hence, during the current inner episode (before a naive learner update takes place), rewards rb l are independent of actions ab l for b = b . As Ea π [c log π(a | h)] = 0 for a constant c independent of the actions, the policy gradient is equal to

φ J(φ) = E P φ

l=1 φ log π(ab l | hb l)

l =l rb l + 1

l =ml T +1 rb l

with ml the inner episode index corresponding to timestep l, thereby concluding the proof.

Theorem D.2. Assuming that (i) the COALA policy is only conditioned on inner episode histories xi,b m,t instead of long histories hi,b l , and (ii) the naive learners are initialized with the current parameters φ i of the other agents, then the COALA-PG update on the batched co-player shaping POMDP is equal to

φi J(φi) = d dφi

q=1 φ i( xq,T )

with φ i( xq,T ) the naive learner s update based on the batch of inner-episode histories xq,T , and d dφi the total derivative, taking into account both the inﬂuence of φi through the ﬁrst argument of Ji(φi, φ i + Pm q=1 φ i( xq,T )), as well as through the parameter updates φ i by adjusting the distribution over xq,T .

Proof. We start by restructuring the long-history expected return J of the batched co-player shaping POMDP into a sum of inner episode expected returns J from the multi-agent POSG, leveraging the assumptions that (i) the COALA policy is only conditioned on inner episode histories xi,b m,t instead of long histories hi,b l , and (ii) the naive learners are initialized with the current parameters φ i of the other agents.

J(φi) = E P φi

t=m T +1 Ri,b t

m=0 E P φi( hm T )

b=1 EP φi,φ i+Pm q=1 φ i( xq,T )

t=m T +1 Ri,b t

q=1 φ i( xq,T )

with P φi,φ i+Pm q=1 φ i( xq,T ) the distribution induced by the environment dynamics of the multiagent POSG, played with policies (φi, φ i + Pm q=1 φ i( xq,T )). The step from line two to three is made possible by the assumption that the COALA policy πi is only conditioned on inner episode histories xi,b m,t instead of long histories hi,b l . This ensures that the distribution P φi over the batch of inner-episode histories xm+1,T only depends on (φi, φ i + Pm q=1 φ i( xq,T )), as the policy πi

does not take previous observations from before the current inner episode boundary into account.

Published as a conference paper at ICLR 2025

A policy gradient takes into account the full effect of the parameters of a policy on the trajectory distribution induced by the policy. In our batched co-player shaping POMDP, the trajectory distribution induced by policy φi inﬂuences the reward distribution in the concatenated POSG, as well as the naive learner s current parameters through the inner episode batches they use for their updates. Hence, we have that

φi J(φi) = d dφi

q=1 φ i( xq,T )

with d dφi the total derivative, taking into account both the inﬂuence of φi through the ﬁrst argument of Ji(φi, φ i + Pm q=1 φ i( xq,T )), as well as through the parameter updates φ i by adjusting the distribution over xq,T , thereby concluding the proof.

E RELATING COALA-PG TO LEARNING WITH OPPONENT-LEARNING AWARENESS (LOLA)

In this section, we establish a formal relationship between COALA-PG and Learning with Opponent-Learning Awareness (LOLA; Foerster et al., 2018a), the seminal work that spearheaded the learning awareness ﬁeld. In doing so, we further show how COALA-PG can be used to derive a new LOLA gradient estimator that does not require higher-order derivative estimates.

LOLA considers a POSG (c.f. Section 2.2) with agent policies (φi, φ i), and expected return Ji(φi, φ i). Recall from Section 4 that instead of estimating the naive gradients φi Ji(φi, φ i), LOLA anticipates that co-players update their parameters with M naive gradient steps. The improved LOLA-DICE (Foerster et al., 2018b) update reads:

LOLA φi = d dφi

s.t. qφ i = α φ i Ji

with d dφi the total derivative taking into account the effect of φi on the parameter updates qφ i, and φ i the partial derivative.

Despite the apparent dissimilarities of the two algorithms, we now show in Theorem E.1 that as a special case of COALA-PG, we can estimate the gradient of a similar objective to the LOLA-DICE method. We consider a mixed agent group of meta agents (φi, φ i), and naive learners. Theorem E.1. Assuming that (i) the COALA policy is only conditioned on inner episode histories xi,b m,t (instead of long histories hi,b l ), with subscript m = 1, . . . , M indexing histories over metasteps, and (ii) the naive learners are initialized with the current parameters φ i of the other agents, then the COALA-PG update on the batched co-player shaping POMDP equals

φi J(φi) = d dφi

q=1 φ i( xq,T )

with φ i( xq,T ) the naive learner s update based on the batch of inner-episode histories xq,T .

Compared to the LOLA-DICE gradient of Eq. 21, there are two main differences. (i) LOLA-DICE assumes that the naive learner takes a deterministic gradient step on J, whereas COALA-PG takes into account that the naive learner takes a stochastic policy gradient step based on the current minibatch of inner policy histories. When J is linear, we can bring the expectation in the COALA-PG expression inside, resulting in the deterministic gradient, but in general J is nonlinear. (ii) LOLADICE considers only the average inner episode return J after M naive learner updates, whereas COALA-PG considers the whole learning trajectory.

The above two differences are rooted in the distinction of the objectives on which LOLA-DICE and COALA-PG estimate the policy gradient. A bigger difference arises on how both methods estimate

Published as a conference paper at ICLR 2025

the policy gradient. As LOLA-DICE assumes that the naive learner takes deterministic gradient updates, their resulting gradient estimator backpropagates explicitly through this learning update, resulting in higher-order derivative estimates. By contrast, COALA-PG assumes that the naive learner takes stochastic gradient updates, and can hence estimate the policy gradient through measuring the effect of φi on the distribution of xl,T , and thereby on the resulting co-player parameter updates, without requiring higher-order derivatives.

We emphasize that both estimators have their strengths and weaknesses. When LOLA-DICE has access to an accurate model of the co-players and their learning algorithm, explicitly backpropagating through this learning algorithm and model can provide detailed gradient information. However, when the co-player and learning algorithm model are inaccurate, the detailed higher-order derivative information can actually hurt (Zhao et al., 2022). In this case, a higher-order-derivative-free approach such as COALA-PG can be beneﬁcial.

In sum, in contrast to existing methods, we introduce a return estimator that avoids higher-order derivative computations while taking into account that other agents are themselves undergoing minibatched reinforcement learning. Furthermore, while methods of the LOLA type require an explicit model of the opponent and its update function, COALA-PG allows for more ﬂexible modeling. Importantly, this enables exploiting the information in long histories that cover multiple inner episodes with a powerful sequence model, for which credit assignment can be carried out over long time scales. This makes it possible to combine implicit co-player modeling with modeling the learning of co-players, a process known as algorithm distillation (Laskin et al., 2022).

F DERIVING PRIOR SHAPING ALGORITHMS FROM EQ. 1

The general shaping problem due to Lu et al. (2022) presented in Eq. 1, which we repeat below for convenience, captures many of the relevant co-player shaping techniques in the literature.

max µ E Pi(φi 0,φ i 0 )E P µ

m=mstart Ji(φi m, φ i m )

M-FOS solves this co-player shaping problem without making further assumptions (Lu et al., 2022). Good Shepherd uses a stateless meta-policy µ(φi m; θ) = δ(φi m = θ), with δ the Dirac-delta distribution (Balaguer et al., 2022). Meta-MAPG (Kim et al., 2021) meta-learns an initialization φi 0 = θ in the spirit of model-agnostic meta-learning (Finn et al., 2017), and lets every player learn by following gradients on their respective objectives. The meta-value learning method (Cooijmans et al., 2023) models the meta-policy µ(φi m+1 | φi m, φ i m ; θ) as the gradient on a meta-value function V (φi m, φ i m ) parameterized by θ. Finally, we recover a single LOLA update step (Foerster et al., 2018a) by initializing (φi 0, φ i 0 ) with the current parameters of all agents, taking M = mstart = 1, using a stateless meta-policy µ(φi m; θ) = δ(φi m = θ) and taking a single gradient step w.r.t. θ, instead of solving the shaping problem to convergence.

G DETAIL ON BASELINE METHODS

G.1 BATCH-UNAWARE COALA PG

We remind that the policy gradient expression for COALA is as follows

φi J(φi) = E P φi

l=1 φi log πi(ai,bl | hi,b l )

The batch-unaware COALA PG is the naive baseline, consisting in applying policy gradient methods to individual trajectories in a batch, i.e.

ˆ φibatch-unaware J(φi) = E P φi

l=1 φi log πi(ai,bl | hi,b l )

Published as a conference paper at ICLR 2025

Figure 11: Diagram visualizing the difference between the COALA-PG update and the batch unaware COALA policy gradient update.

Figure 11 visualizes the main difference between the COALA-PG update and the batch-unaware COALA-PG update.

Lu et al. (2022) consider both a policy gradient method and evolutionary search method to optimize the shaping problem of Eq. 1. Here, we focus on the M-FOS policy gradient method, which was used by Lu et al. (2022) in their Coin Game experiments, which is their only experiment that goes beyond tabular policies. M-FOS uses a distinct architecture with a recurrent neural network as inner policy, which receives an extra conditioning vector as input from a meta policy that processes inner episode batches. As shown by Khan et al. (2024), we can obtain better performance by combining both inner and meta policy in a single sequence model with access to the full history hl containing all past inner episodes. Hence, we use this improved architecture for our M-FOS baseline, which we train by the policy gradient method proposed by Lu et al. (2022). This allows us to use the same architecture for both M-FOS and COALA.

As the manuscript of (Lu et al., 2022) does not explicitly mention how to deal with the inner-batch dimension of a naive learner, we reconstruct the learning rule from their publicly available codebase. By denoting by ml the inner episode index corresponding to the meta episode time step l, the update is as follows:

ˆ φi M-FOS J(φi) = E P φi

l=1 φi log πi(ai,b l | hi,b l )

k=l Ri,b k + 1

k=T ml+1 Ri,b

We note that when taking the inner-episode boundaries into account in the COALA-PG update, due to some terms disappearing from the expectation, the expression becomes

φi J(φi) = E P φi

l=1 φi log πi(ai,b l | hi,b l )

k=l Ri,b k + 1

k=T ml+1 Ri,b

One can see that the M-FOS update rule closely resembles COALA-PG, except for the fact that it is lacking a factor 1 B in front of the reward of the current inner episode. As we show in our experiments, this biases the policy gradient and introduces inefﬁciencies in the optimization.

H ADDITIONAL EXPERIMENTAL RESULTS

H.1 BALANCING GRADIENT CONTRIBUTIONS

The main difference between COALA-PG (c.f. Eq. 3) and M-FOS is the scaling of the current reward-episode. The main difference between COALA-PG equation 3 and M-FOS equation 7 is that COALA-PG scales the return of the current inner episode by 1

B , whereas M-FOS does not. This scaling is crucial, as the the inﬂuence of an action on future inner episodes is of order O( 1

B ) because

Published as a conference paper at ICLR 2025

Figure 12: Y-axis: Ratio of the magnitude of the policy gradient contribution arising from future inner episode returns (the co-player shaping learning signal) w.r.t. the gradient contribution arising from the current inner episode return. X-axis: meta gradient steps on the iterated prisoner s dilemma, in a mixed group setting corresponding to the setting of Fig. 5C, and an increased meta-batch size of 2048 to reduce the variance of the gradient estimates.

the naive learner averages its update over the B inner episode trajectories. Hence, by scaling the inner episode return by 1

B , COALA-PG ensures that both contributions have the same scaling w.r.t. B, and ﬁnally by summing the policy gradient over the minibatch instead of averaging, we end up with a policy gradient of O(1). M-FOS and batch-unaware COALA do not scale the current inner episode return, which causes the co-player shaping learning signal to vanish w.r.t. the current inner episode return, resulting in poor shaping performance.

Figure 12 conﬁrms that empirically, COALA-PG correctly balances the gradient contributions from the current inner episode return with those from future inner episode returns. In contrast, M-FOS and batch-unaware COALA have unbalanced gradient contributions, with the co-player shaping gradient contribution vanishing w.r.t. the current inner episode return contribution.

H.2 DETAILED RESULTS ON CLEA NUP-L I T E

Figure 13: Agents trained by COALA-PG against naive agents only successfully shape them in Clean Up-lite. (A) COALA-PG-trained agents better shape naive opponents compared to baselines, obtaining higher return. (B and C) Analyzing behavior within a single meta-episode after training reveals that COALA outperforms baselines and shapes naive agents, (i) exhibiting a lower cleaning discrepancy (absolute difference in average cleaning time between the two agents), and (ii) being less often zapped. (D) Average reward of naive learners is higher when playing with COALA-PG agents compared to other agents. (E and F): COALA-PG results in lower average pollution level and higher average apple level. Shaded regions indicate standard deviation computed over 5 seeds.

Published as a conference paper at ICLR 2025

Figure 14: Agents trained with COALA-PG against a mixture of naive and other meta agents learn to cooperate in Clean Up-lite. (A) COALA-PG-trained agents obtain higher average reward than baseline agents when playing against each other. (B and C): COALA-PG leads to a more fair division of cleaning efforts and lower zapping rates. (D and E): COALA-PG results in lower average pollution level and higher average apple level. Shaded regions indicate standard deviation.

H.3 TRAINING A COALA-PG AGENT VERSUS AN M-FOS AGENT ON ITERATED PRISONER S DILEMMA

We investigate training a COALA-PG versus an M-FOS agent, our strongest performing meta-agent baseline. In this setup, we still use a mixture of naive learners and meta agents. Only now when a COALA-PG agent either plays against an M-FOS agent or naive learner, and an M-FOS agent either plays against a COALA-PG agent or a naive learner. We found that COALA-PG successfully shapes M-FOS into cooperation, reaching average rewards of 0.850 0.037 for COALA-PG and 0.853 0.017 for M-FOS. This is in stark contrast to when M-FOS agents only play against other M-FOS agents and naive learners, which converges to mutual defection.

We did not investigate training a COALA-PG agent versus a LOLA agent, as the training setup of both agents is fundamentally different, and would pose an unfair disadvantage for LOLA agents. A LOLA agent learns on the same timescale as naive learners, taking a few look-ahead steps into account in its update. Hence, the full learning trajectory of a LOLA agent is considered as one meta trajectory for a COALA-PG agent. As COALA-PG agents would play many different meta trajectories, always against a freshly initialized LOLA agent, this would give a COALA-PG detailed information about the learning behavior of LOLA agents, whereas LOLA agents are left in the dark as they cannot observe the meta updates of COALA-PG. Hence, COALA-PG would extort LOLA agents similar to naive learners.

The results reported in this paper were produced with open-source software. We used the Python programming language together with the Google JAX (Bradbury et al., 2018) framework, and the Num Py (Harris et al., 2020), Matplotlib (Hunter, 2007), Flax (Heek et al., 2024) and Optax (Babuschkin et al., 2020) packages.