# modelfree_opponent_shaping__1f9ec630.pdf Model-Free Opponent Shaping Chris Lu 1 Timon Willi 1 Christian Schroeder de Witt 1 Jakob Foerster 1 In general-sum games, the interaction of selfinterested learning agents commonly leads to collectively worst-case outcomes, such as defectdefect in the iterated prisoner s dilemma (IPD). To overcome this, some methods, such as Learning with Opponent-Learning Awareness (LOLA), shape their opponents learning process. However, these methods are myopic since only a small number of steps can be anticipated, are asymmetric since they treat other agents as naive learners, and require the use of higher-order derivatives, which are calculated through white-box access to an opponent s differentiable learning algorithm. To address these issues, we propose Model-Free Opponent Shaping (M-FOS). M-FOS learns in a meta-game in which each meta-step is an episode of the underlying ( inner ) game. The meta-state consists of the inner policies, and the meta-policy produces a new inner policy to be used in the next episode. M-FOS then uses generic model-free optimisation methods to learn meta-policies that accomplish long-horizon opponent shaping. Empirically, M-FOS near-optimally exploits naive learners and other, more sophisticated algorithms from the literature. For example, to the best of our knowledge, it is the first method to learn the wellknown Zero-Determinant (ZD) extortion strategy in the IPD. In the same settings, M-FOS leads to socially optimal outcomes under meta-self-play. Finally, we show that M-FOS can be scaled to high-dimensional settings. 1. Introduction While much past work in multi-agent reinforcement learning (MARL) has focused on fully-cooperative learning in 1Department of Engineering Sciences, University of Oxford, Oxford, United Kingdom. Correspondence to: Chris Lu , Timon Willi . Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s). domains such as Dec-POMDP s (Oliehoek & Amato, 2016) or zero-sum games like Starcraft and Go (Silver et al., 2017; Vinyals et al., 2019), these settings only represent a fraction of potential real-world multi-agent environments. Generalsum games, which can be neither fully-cooperative nor fullycompetitive, describe many domains such as agent-based modeling, social dilemmas, and systems of interacting selfinterested agents like self-driving cars. Even simple social dilemmas commonly present unique challenges that are not present in single-agent learning (Foerster et al., 2018a). For example, in the IPD (Axelrod & Hamilton, 1981; Harper et al., 2017), learning agents that treat their opponents as static parts of the environment typically converge on unconditional mutual defection, which is the globally worst outcome. To avoid such catastrophic outcomes, Foerster et al. (2018a) introduce LOLA, which takes into account the opponents learning step in order to shape their policy. In the self-play setting, LOLA was one of the first methods to discover the reciprocating tit-for-tat (TFT) strategy in the IPD. However, LOLA and related algorithms, such as SOS (Letcher et al., 2019b) and Meta Multi-Agent Policy Gradient (Kim et al., 2021, Meta-MAPG), assume that the opponent is a naive learning (NL) agent, which is often incorrect, e.g. in self-play. Furthermore, to shape their opponents, these methods use second-order derivatives, which are typically high-variance, making learning unstable (Foerster et al., 2018a). Lastly, they are also myopic they only shape the opponent s next few learning steps, not their long-term development. To resolve all of these issues, we introduce Model-Free Opponent Shaping (M-FOS). M-FOS is a general metalearning algorithm that learns over multiple opponentlearning steps without requiring a model of its opponent s underlying learning algorithm. The core of M-FOS is a meta-game in which each metastep is an episode of the underlying ( inner ) game. The meta-state consists of the inner policies, and the meta-policy produces a new inner policy to be used in the next episode. M-FOS then uses generic model-free optimisation methods, rather than approaches that require higher-order derivatives, to learn meta-policies that accomplish long-horizon oppo- Model-Free Opponent Shaping nent shaping. Furthermore, training M-FOS in meta-selfplay allows mutual opponent shaping without causing the kind of infinite regress typically caused by ever higher-order learning awareness (Foerster et al., 2018a). However, since M-FOS is naively model-free, the metaself-play setting reduces to independent learning, which is highly initialisation-dependent and unstable in general-sum settings. To mitigate this, we introduce a training schedule inspired by Cognitive Hierarchies (CH) (Camerer et al., 2003). With this schedule, M-FOS learns to reciprocate with itself in the meta-game, even achieving higher scores than LOLA in self-play. For low-dimensional games, M-FOS directly learns policy updates by taking policies as input and outputting the next policy as an action. However, directly inputting and outputting policies does not scale to higher-dimensional games. We introduce a variant of M-FOS that takes past trajectories as inputs to meta-learn across its opponent s learning steps. We then demonstrate that, even in social dilemmas with temporally-extended transition dynamics, M-FOS still manages to shape naive learners and find mutually beneficial solutions in meta-self-play. In the experiment section, we show that M-FOS can exploit naive learners much better than a set of widely used generalsum learning algorithms (Foerster et al., 2018a; Kim et al., 2021). In the IPD, M-FOS discovers a famous strategy known as ZD extortion (Press & Dyson, 2012) when playing against NL agents. Notably, unlike other algorithms, it does so without access to the opponent s underlying learning algorithm. M-FOS even learns to exploit other general-sum algorithms, such as LOLA. 2. Related Work Opponent Shaping: Several methods recognise that their current actions influence the future policies of learning opponents and take advantage of this to shape an opponent s policy to desirable values. Most of these works assume white-box access to an opponent s learning algorithm and reward in order to take higher-order derivatives through an opponent s update (Foerster et al., 2018a; Letcher et al., 2019a; Kim et al., 2021; Willi et al., 2022). Such updates are also myopic since anticipating many steps is intractable. In self-play, these methods inconsistently assume that their opponent is a naive learner. M-FOS does not assume whitebox access to an opponent s underlying learning algorithm or reward, does not require higher-order derivatives (which are often high-variance), can shape opponents across a large number of updates, and is consistent in self-play. Opponent Modeling: Much work in MARL has focused on the idea of opponent modeling in which an agent attempts to model some aspect of the policy of other agents in the en- vironment. This includes explicitly modeling opponent policies (Mealing & Shapiro, 2017), modeling opponent intentions (Raileanu et al., 2018), classifying opponent strategies (Weber & Mateas, 2009; Synnaeve & Bessi ere, 2011), and modeling an opponent s nested beliefs (Wen et al., 2019). LILI (Xie et al., 2020) models an opponent s high-level latent strategy from local observations with a latent dynamics model rather than explicitly modeling the opponent s policy. Other work (Chakraborty & Stone, 2014) has also considered learning effective policies in the presence of opponents that have memory. However, these methods are not capable of actively shaping their opponents learning dynamics, thus they do not address the issue that we address in this paper. Multi-Agent Meta-Learning: M-FOS is a form of multiagent meta-learning where the meta-policy is parameterized by a neural network. Existing multi-agent meta-learning methods, such as Meta-Policy Gradient (Meta-PG) (Al Shedivat et al., 2018), Meta-MAPG (Kim et al., 2021), and Learning to Exploit (L2E) (Wu et al., 2021) instead parameterize the meta-policy using a method similar to that of Model-Agnostic Meta-Learning (MAML) (Finn et al., 2017), in which they learn initial parameters and meta-learn across their own gradient updates. While this type of metalearning can adapt to any task at test time in single-agent settings (Xiong et al., 2021), in multi-agent settings, the calculated gradient may not correspond to a direction of improvement as the updates of other agents change the underlying dynamics. Rather than being restricted to a gradient update within the episode, M-FOS allows for arbitrary metapolicies that can carry out long horizon opponent shaping. 3. Background A partially observable stochastic game (Kuhn, 1953, POSG) consists of a tuple Mn = I, S, A, Ω, O, P, R, γ , where I = {1, . . . , n} denotes a set of n agents, S denotes the state space, A = i IAi represents the joint action space, Ω= i IΩi the joint observation space, P : S A 7 S denotes the transition probability function, O : S A Ω [0, 1] is the observation function, R = i IRi represents the set of reward functions of all agents, and γ [0, 1) denotes the discount factor. At each timestep t, every agent samples an action from its stochastic policy, ai t πi | oi t, ϕi , where the joint actions at timestep t are at = ai t, a i t and i stands for all agents except i. The policy is parameterized by ϕi. Given the joint actions and the current state, each agent receives their respective reward ri t = Ri (st, at). Finally, a new state is sampled st+1 P ( | st, at). Popular special cases of POSGs are fully observable stochastic games where all agents observe the full state at each time step; single-player, i.e. I = {1}, partially observable Markov decision processes (POMDPs), and MDPs, where Model-Free Opponent Shaping Algorithm 1 General M-FOS 1: Initialize M-FOS parameters θ. 2: while true do 3: Initialize agents parameters ϕi 0, ϕ i 0 . 4: for t = 0 to T do 5: Reset environment 6: Gather trajectories τϕ given ϕi t, ϕ i t 7: Update ϕ i t+1 according to respective learning algorithms 8: Update ϕi t+1 according to meta-policy πθ 9: end for 10: Update θ 11: end while the single player observes the full state at each time step. 4. Model-Free Opponent Shaping Typically opponent-shaping methods are based on MAMLlike approaches (Foerster et al., 2018a; Letcher et al., 2019b; Kim et al., 2021) and use higher-order derivatives to directly shape the opponents parameter update, which requires white-box access to their differentiable learning algorithm. Furthermore, opponent shaping typically creates a conceptual problem: To shape an opponent, an algorithm needs to specify the learning behaviour of other agents in the environment, e.g. by treating them as naive learners, as is done in LOLA (Foerster et al., 2018a). This leads to a fundamental inconsistency in self-play when two of these agents are training together. Even though they are both opponent shaping they treat each other as naive learners, which can lead to undesired outcomes (Letcher et al., 2019a). Lastly, most opponent-shaping methods only shape the next learning steps instead of considering longer horizons. Opponent shaping can be formulated as a meta-game, in which the meta-state consists of the policies of all agents, a meta-step is an inner episode, the reward is the inner return, and the meta-action is choosing the next inner policy, where inner refers to the underlying game. The key insight underlying Model-Free Opponent Shaping (M-FOS) is that we can resolve all of the issues above by directly training meta-policies using model-free optimisation methods that are appropriate for sequential settings, rather than relying on MAML-like approaches. We formally construct the meta-game as a POMDP S, A, Ω, O, P, R, γ over an underlying POSG Mn. The meta-game is partially observable because we do not assume full access to the opponents parameters. The M-FOS meta-agent controls agent i I in the underlying POSG Mn. The state space S of the meta-game consists of the policy parameters of the agents in the underlying POSG, st = (ϕi t 1, ϕ i t 1) S. The meta-agent s action space consists of agent i s policy, for example outputting a conditioning vector or setting agent i s policy parameters directly, at = ϕi t πθ( | ot). Here the meta-policy is parameterized by θ. The meta-agent receives observation ot Ωwith probability O( ot | st, at). After each meta-episode, the scalar reward is rt = PK k=0 ri k(ϕi t, ϕ i t ), where K is the length of the inner episode (i.e. the reward in the meta-game at each step is the inner return). Finally, a new meta-state is sampled from a stochastic transition probability function, st+1 P( | st, at). P( st, at) is stochastic since, in general, the update function for any agent can be stochastic, ϕj t+1 h( | ϕj t). For example, when agent j updates their parameters with policy gradients. Consequently, the trajectory is denoted as τθ := ( o0, a0, r0, . . . , r T ), where T is the length of the meta-episode. We train the metapolicy to maximise the expected return per meta-episode J = PT t=0 ri t(ϕi t, ϕ i t ). Crucially, rather than relying on higher-order derivatives, M-FOS uses model-free optimisation methods to directly train a meta-policy. In the Section 6 we show that PPO (Schulman et al., 2017; Barhate, 2021) and Genetic Algorithms (Such et al., 2017) work well in this general meta-learning framework. 4.1. M-FOS Self-Play By doing model-free optimisation in the meta-game, we no longer require higher-order derivatives and also can learn strategies that engage in long-horizon opponent shaping. Next, we also address the issue of symmetry and consistency by introducing meta-self-play. When using MAML-like approaches for opponent shaping, attempts of consistent self-play lead to infinite recursions, since each agent differentiates through the learning step of the other agent and so on. In contrast, since M-FOS is entirely model-free, meta-self-play between two M-FOS agents simply corresponds to learning in a general-sum game, where model-free methods can be applied without causing infinite regress. One challenge is that independent learning in general-sum settings is highly initialisation dependent and unstable, which is undesirable for a principled method. Furthermore, in general-sum games there are often multiple nash equilibria, which implies that a stricter way to select equilibria would be desirable. One such way to select ideal equilibria is called the tracing procedure (Harsanyi et al., 1988). The tracing procedure is based on the idea of each agent having a (common) initial bayesian prior over how a rational agent would behave in a general-sum game. The agents then repeatedly update their policies against each other until convergence, initialising from this prior. The exact procedure is impractical in high-dimensional function approximation, but we use it as inspiration to create a similar, but more tractable, self-play Model-Free Opponent Shaping This is implemented via a parameter λ that corresponds to the probability of an M-FOS agent being paired with a naive learner rather than another M-FOS agent. By setting λ = 1 at the beginning of training, we ground the training to an approximate best-response to NL, while annealing it to λ = 0 allows us to transition to self-play over the course of training gradually. We anneal λ slowly enough, such that the M-FOS agents are always playing near optimally for the given distribution. 5. Experimental Setup 5.1. Environments IPD: The prisoner s dilemma is one of the most widelystudied and important general-sum games, with applications in evolutionary biology, economics, politics, sociology, and other fields (Rapoport et al., 1965). In the prisoner s dilemma, agents can choose to cooperate (C) or defect (D) against each other, with the payouts of the result being presented in Table 1. Table 1. Payoff Matrix for the Prisoner s Dilemma C D C (-1, -1) (-3, 0) D (0, -3) (-2, -2) A common extension of the prisoner s dilemma is the IPD, in which the prisoner s dilemma is played repeatedly, with players able to observe their opponent s past decisions. Axelrod (Axelrod & Hamilton, 1981) famously held an IPD tournament where a strategy known as TFT, in which a player copies the other player s last move, was popularized. Despite decades of previous study of the IPD, Press & Dyson (2012) made a surprising mathematical discovery that dramatically changed our understanding of the game: There exist fixed policies, called ZD extortion strategies, that dominate any learning opponent. More specifically, ZD extortion enforces a linear relationship between the two agents rewards that disproportionately benefits the extortioner (see Figure 4). However, it is still in a learning agent s best interest to cooperate against extortion despite the fact that it benefits the extortioner more. In principle, an agent could overcome the extortion by being meta -aware that the opponent can change their policy and punish an extorting opponent. Iterated Matching Pennies: Iterated Matching Pennies (IMP) is an iterated matrix game like the IPD but is zerosum. In IMP, agents can play Heads or Tails and get payouts according to Table 2. Chicken Game: The Chicken Game is a stochastic matrix Table 2. Payoff Matrix for Matching Pennies H T H (+1, -1) (-1, +1) T (-1, +1) (+1, -1) game. Agents can either Swerve (C) or head Straight (D). While agents can gain a small reward by heading straight against a swerving opponent, they incur a large negative cost if they both head straight. It is often used in political science and economics to describe brinksmanship scenarios in which there is a threat of mutually assured destruction (Rapoport & Chammah, 1966). Table 3. Payoff Matrix for the Chicken Game C D C (0, 0) (-1, +1) D (+1, -1) (-100, -100) Matrix Game Setup: In this paper, we directly calculate the value function rather than repeatedly sampling actions. We initialize all policies randomly (except for M-MAML) by taking the sigmoid of samples from the standard normal distribution. We then calculate the value functions for both policies and update them for T = 100 steps. Coin Game: The Coin Game is a multi-agent grid-world environment that simulates social dilemmas like the IPD but with high dimensional dynamic states. First proposed by Lerer & Peysakhovich (2017), the game consists of two players, labeled red and blue respectively, who are tasked with picking up coins, also labeled red and blue respectively, in a 3x3 grid. If a player picks up any coin by moving into the same position as the coin, they receive a reward of +1. However, if they pick up a coin of the other player s color, the other player receives a reward of 2. Thus, if both agents play greedily and pick up every coin, the expected reward for both agents is 0. Figure 1. Illustration of Coin Game 5.2. Baseline Comparisons Naive Learning (NL): Naive learners assume that other agents are part of the environment and are static between Model-Free Opponent Shaping episodes. Thus, between each episode, naive learners perform the following update with learning rate α: ϕi t+1 = ϕi t + α ϕi t Ri(ϕi t, ϕ i t ) In reinforcement learning, this is often approximated with a sample-based approach. In our experiments, in the Coin Game, the NL uses PPO, (Schulman et al., 2017) which modifies this by clipping the update. In matrix games, we can directly perform gradient ascent without sampling because the exact value Ri is differentiable. Learning with Opponent Learning Awareness (LOLA): LOLA assumes that other agents are naive learners and perform the gradient step performed above. LOLA takes a gradient through the opponent s update function to shape the opponent. ϕi t+1 = ϕi t + αi ϕi t Ri(ϕi t, ϕ i t + ϕ i t ) (1) ϕ i t = α i ϕi t R i(ϕi t, ϕ i t ) Multiagent Model-Agnostic Meta-Learning: We introduce a new baseline, Multiagent MAML (M-MAML), which is inspired by Meta-Multiagent Policy Gradient (Kim et al., 2021, Meta-MAPG). Meta-MAPG and M-MAML operate in a similar setting to M-FOS in that they meta-learn over multiple opponent learning updates. However, instead of learning an update function, they learn initial parameters. They then meta-learn over their own gradient updates (much like MAML (Finn et al., 2017)) as well as the gradient updates of their opponents. Meta-MAPG and M-MAML optimize the following: max ϕi 0 Ep(ϕ i 0 )[ t=0 Ri(ϕi t, ϕ i t )], (2) ϕi t+1 = ϕi t + αi ϕi t Ri(ϕi t, ϕ i t ) ϕ i t+1 = ϕ i t + α i ϕ i t R i(ϕi t, ϕ i t ) I.e., the methods only optimize initial policy parameters, assuming that all agents are naive learners. Meta-MAPG expands the objective into multiple learning terms to perform policy-gradient updates. However, we do not directly compare to Meta-MAPG because it only scales to T = 7 meta-steps in the IPD, not T = 100. Instead, we use the exact value function and exact gradients allowing our baseline (M-MAML) to scale to meta-episodes consisting of 100 inner episodes. 5.3. M-FOS Implementation Details Although M-FOS can be applied to any POSG, different settings allow for very different architectures. Below we describe the architectures we use for basic matrix games and the higher-dimensional coin game. Matrix Games: In the matrix game environments we allow M-FOS to observe the full state, which is the concatenation of the policies played last timestep ot = st = (ϕi t 1, ϕ i t 1). Because all of our evaluated opponents (including M-FOS itself) only make updates according to the current state, this turns the induced POMDP into an MDP. Because the inner policy can be fully expressed with very few parameters, we can directly output the parameters, turning the MDP into a basic continuous control problem. Because of this, we model the M-FOS meta-agent as a simple feed-forward neural network parameterized by θ that takes in the state and outputs a distribution over the next policy. max θ Ep(ϕ i 0 ,ϕi 0)[ t=0 Ri(ϕi t, ϕ i t )], (3) ϕi t+1 πθ( | ϕi t, ϕ i t ) ϕ i t+1 = f(ϕi t, ϕ i t ) This can be seen as being related to hypernetwork metalearners since it directly outputs the weights of another (very simple) model (Zhmoginov et al., 2022). We optimize the meta-policy using both Genetic Algorithms (Such et al., 2017), and PPO (Schulman et al., 2017; Barhate, 2021), and report the best of both. A detailed breakdown of the performance of each can be found in the Appendix A. M-FOS in Coin Game: Here, M-FOS does not directly observe the opponent s policy parameters but only the effects of their past actions. The opponent is parameterized by a convolutional neural network and, as a naive learner, is trained using PPO. M-FOS s inner policy is parameterized by a convolutional recurrent neural network that takes in an observation as input along with a conditioning vector from the meta-policy. We require the inner policy to be recurrent to respond to and shape the opponent s policy. The hidden state of the recurrent neural network is reset each episode. M-FOS s meta-policy is parameterized by a convolutional recurrent neural network that processes the batch of trajectories from the last episode and outputs a conditioning vector, used in the next episode. Using PPO, the inner policy and the meta-policy parameters are trained end-to-end to maximise the expected discounted meta-return. 6.1. Iterated Prisoner s Dilemma In a round-robin tournament in which algorithms train against each other in a head-to-head matchup, M-FOS vastly outperforms all other learning methods in the IPD. Notably, it is the only algorithm to achieve scores better than mutual cooperation ( 1), and it does so against all opponents, excluding itself. Similarly, it is the only algorithm for which one of its opponents performs worse than mutual defection Model-Free Opponent Shaping Figure 2. Visualisations of a run of a meta-episode of each learner against M-FOS. Notice how the opponents policies are shaped into cooperating, resulting in state visitations that are beneficial to the M-FOS agent. Figure 3. Visualisations of M-FOS shaping a naive learner. The area denoted by the black lines represents the episode s possible rewards. The blue points represent the possible payoffs of a naive learner against the M-FOS policy at that timestep. (a)-(b) M-FOS begins by playing TFT until the opponent is sufficiently cooperative. (c)-(d) M-FOS then repeatedly switches between an extortion-like policy (c) and a defecting policy (d), making the NL oscillate. ( 2), and it does so against both naive learners and LOLA. Table 4. Head-to-head rewards of each learning algorithm in the Iterated Prisoner s Dilemma. M-FOS NL LOLA M-MAML M-FOS -1.01 -0.51 -0.73 -0.67 NL -2.14 -1.98 -1.52 -1.28 LOLA -2.09 -1.30 -1.09 -1.04 M-MAML -1.86 -1.25 -1.15 -1.17 M-FOS v. Naive Learner: Against an NL agent, M-FOS gets an average score of 0.51, while the NL agent gets an average score of 2.14. This is a far more advantageous result than LOLA achieves ( 1.30/ 1.52), even though LOLA has a perfect learning model of its opponent and can take the derivative through its update step and the environment. We suspect that this happens because LOLA is a myopic one-step learner, whereas M-FOS considers the discounted returns far in the future. This can be observed in Figure 2. Also, note that the NL agent achieves a total score lower than 2. This is a lower score than ZD extortion can theoretically make its opponent achieve since blind defection at worst achieves a score of 2. Figure 3 shows that M-FOS takes advantage of the fact that the NL agent s gradient updates do not find the optimal response policy in a single step. M-FOS v. Look-Ahead Best Response: To demonstrate the above point, we train M-FOS against a variant of a naive learner that can observe its opponent s next policy and then plays the best response to it (which is calculated by performing a thousand steps of gradient ascent). Despite the game being symmetric, M-FOS extorts this Look-Ahead Best Re- Model-Free Opponent Shaping sponse (LABR) agent, achieving an average score of 0.71. Figure 4 shows that the policy M-FOS outputs approximates ZD extortion. To the best of our knowledge, M-FOS is the first learning algorithm to discover ZD extortion. Figure 4. Visualisation of M-FOS v. Look-Ahead Best Response in the IPD. Note that the payoff between the two agents is nearlinear and favors the M-FOS agent, indicating ZD extortion. M-FOS v. LOLA: In (Foerster et al., 2018a), the authors write that 2nd-order LOLA, which is an agent that takes the derivative through the opponent s LOLA update, does not achieve any incremental gains against an opposing LOLA agent. In other words, a LOLA agent achieves a better score against another LOLA agent than a 2nd-order LOLA agent would, implying that it is difficult to exploit LOLA. However, M-FOS manages to find a dominating strategy against LOLA ( 0.73 / 2.09). To the best of our knowledge, M-FOS is the first learning algorithm to exploit the LOLA update. M-FOS v. M-MAML: M-MAML seems to have generally learned to initialize with values close to TFT (see Appendix Section B). This initialisation allows it to achieve favorable results against all algorithms except M-FOS ( 0.67 / 1.86), which learns to exploit it in Figure 2. M-FOS v. M-FOS: We arrive at a cooperative score when M-FOS is trained against other M-FOS agents using the meta-self-play training scheme from above. When viewing the final policies played against each other, we observe that M-FOS has largely arrived at TFT, as seen in Figure 5. To the best of our knowledge, M-FOS is the first learning algorithm to arrive at TFT in the IPD against itself without using higher-order derivatives, access to the opponent s rewards, or specific hand-coding of TFT-like behaviour. 6.2. Other Matrix Games IMP: In IMP, M-FOS once again outperforms other baseline methods. In particular, by examining how M-FOS exploits a naive learner compared to how LOLA does so, we observe that LOLA is myopic compared to M-FOS. In Figure 6, LOLA gradually approaches the nash equilibrium Figure 5. Visualisation of 32 Final Episode Policies in M-FOS v. M-FOS in the IPD Table 5. Head-to-head results of each learning algorithm in Iterated Matching Pennies. M-FOS NL LOLA M-MAML M-FOS 0.0 0.20 0.19 0.22 NL -0.20 0.0 -0.02 -0.01 LOLA -0.19 0.02 0.0 0.02 M-MAML -0.22 0.01 -0.02 0.0 against a naive learner in order to avoid being exploited by its opponent. In contrast, M-FOS cyclically shapes the naive learner s policy to continuously exploit it while staying one step ahead. Table 6. Head-to-head results of each learning algorithm in the Chicken Game. The results of an M-FOS meta-policy that learns an initial policy is in parantheses. M-FOS NL LOLA M-MAML M-FOS -0.01 0.97 -0.94[0.5] 0.86 NL -1.03 -0.0 -0.97 -0.27 LOLA 0.87[-1.5] 0.94 -85.96 0.40 M-MAML -1.08 0.27 -0.42 -0.15 Chicken Game: M-FOS performs well against all baselines in the head-to-head in the Chicken Game but achieves a lower score against LOLA. Interestingly, LOLA tends to behave in an extortionary manner in the Chicken Game. After one update against most random policies, it attempts to shape the opponent by heading straight (i.e. defecting) with high probability. While this extortionate behavior works against most learning opponents, it leads to catastrophic results in self-play ( 85.96). This suggests that LOLA will continue to head straight, whether its opponent swerves or heads straight. Because M-MAML selects its initial policy, it can shape LOLA from the first time step, preventing LOLA from immediately heading straight after its first update. M-FOS, in contrast, is by default forced to a random initialisation. However, if we allow M-FOS also to learn an initial policy, it achieves a much higher score against LOLA (0.5), far outperforming M-MAML ( 0.42). Model-Free Opponent Shaping Figure 6. Visualisation of M-FOS s long-term shaping LOLA s and myopic strategy in the Iterated Matching Pennies environment. Note how LOLA converges to the nash equilibrium, resulting in zero reward for both agents, while M-FOS continually drags the naive learner s policy to exploitable states. 6.3. Coin Game Prior work (Yu et al., 2021) has shown that LOLA-Di CE (Foerster et al., 2018b) and Meta-MAPG (Kim et al., 2021) do not achieve significant results in a simplified version of coin game with a fully cooperative reward. Because of this, we do not compare to these baselines. We also observe that M-FOS outperforms PPO in head-to-head training while still achieving good performance in self-play. Meanwhile, PPO agents, when trained together, pick up each other s coins indiscriminately, leading to 0 expected reward. Table 7. Head-to-head results of M-FOS and PPO in the Coin Game. M-FOS PPO M-FOS 20.56 44.26 PPO -24.62 4.25 7. Conclusion & Future Work In this paper, we presented Model-Free Opponent Shaping (M-FOS) as a simple model-free alternative to popular MAML-like opponent shaping methods, such as LOLA and MMAPG. Although M-FOS does not use higher-order derivatives and does not have white-box access to its opponent s learning model, it vastly outperforms all tested baselines across several matrix games. More specifically, in the IPD, M-FOS achieves several notable results. First, to the best of our knowledge, it is the Figure 7. Probability of the PPO agent picking up its own coin across the inner episodes. Note that it is shaped into picking up more of its own coins against the M-FOS agent. first learning algorithm to discover ZD extortion, the first learning algorithm that exploits LOLA, and the first learning algorithm to achieve cooperation in self-play without using higher-order derivatives or inconsistent models. Furthermore, it achieves a score higher than mutual cooperation against all tested opponents, while none of the baselines could do so against any single opponent. We also show that M-FOS can scale to more complex, high-dimensional games and achieve similar results. In the future, we could generalize M-FOS beyond social dilemmas. For example, M-FOS could shape financial instruments, trading models, recommendation systems, and any system trained on real-world data. Model-Free Opponent Shaping Acknowledgements Christian Schroeder de Witt is sponsored by the Cooperative AI Foundation. Compute for this project was partially run on Oxford s Advanced Research Cluster (ARC). This work used the Cirrus UK National Tier-2 HPC Service at EPCC funded by the University of Edinburgh and EPSRC (EP/P020267/1). This work was also supported by an Oracle for Research Cloud Grant (19158657). Al-Shedivat, M., Bansal, T., Burda, Y., Sutskever, I., Mordatch, I., and Abbeel, P. Continuous adaptation via metalearning in nonstationary and competitive environments. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. Open Review.net, 2018. URL https://openreview.net/ forum?id=Sk2u1g-0-. Axelrod, R. and Hamilton, W. D. The evolution of cooperation. Science, 211(4489):1390 1396, 1981. Barhate, N. Minimal pytorch implementation of proximal policy optimization. https://github.com/ nikhilbarhate99/PPO-Py Torch, 2021. Camerer, C., Ho, T., and Chong, J.-K. A cognitive hierarchy theory of one-shot games and experimental analysis. SSRN Electronic Journal, 09 2003. doi: 10.2139/ssrn.411061. Chakraborty, D. and Stone, P. Multiagent learning in the presence of memory-bounded agents. Autonomous agents and multi-agent systems, 28(2):182 213, 2014. Finn, C., Abbeel, P., and Levine, S. Model-agnostic metalearning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 1126 1135, 2017. Foerster, J., Chen, R. Y., Al-Shedivat, M., Whiteson, S., Abbeel, P., and Mordatch, I. Learning with opponentlearning awareness. In Proceedings of the 17th International Conference on Autonomous Agents and Multi Agent Systems, pp. 122 130, 2018a. Foerster, J. N., Farquhar, G., Al-Shedivat, M., Rockt aschel, T., Xing, E. P., and Whiteson, S. Dice: The infinitely differentiable monte carlo estimator. In Dy, J. G. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsm assan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pp. 1524 1533. PMLR, 2018b. URL http://proceedings.mlr.press/ v80/foerster18a.html. Harper, M., Knight, V., Jones, M., Koutsovoulos, G., Glynatsi, N. E., and Campbell, O. Reinforcement learning produces dominant strategies for the iterated prisoner s dilemma. PLOS ONE, 12(12):e0188046, 2017. Harsanyi, J. C., Selten, R., et al. A general theory of equilibrium selection in games. MIT Press Books, 1, 1988. Kim, D., Liu, M., Riemer, M., Sun, C., Abdulhai, M., Habibi, G., Lopez-Cot, S., Tesauro, G., and How, J. P. A policy gradient algorithm for learning to learn in multiagent reinforcement learning. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 5541 5550, 2021. Kuhn, H. W. Extensive games and the problem of information. Princeton University Press, Princeton, NJ, 1953. Lerer, A. and Peysakhovich, A. Maintaining cooperation in complex social dilemmas using deep reinforcement learning. Co RR, abs/1707.01068, 2017. URL http: //arxiv.org/abs/1707.01068. Letcher, A., Balduzzi, D., Racani ere, S., Martens, J., Foerster, J. N., Tuyls, K., and Graepel, T. Differentiable game mechanics. J. Mach. Learn. Res., 20:84:1 84:40, 2019a. Letcher, A., Foerster, J. N., Balduzzi, D., Rockt aschel, T., and Whiteson, S. Stable opponent shaping in differentiable games. In 7th International Conference on Learning Representations, 2019b. Mealing, R. and Shapiro, J. L. Opponent modeling by expectation-maximization and sequence prediction in simplified poker. IEEE Trans. Comput. Intell. AI Games, 9(1):11 24, 2017. Oliehoek, F. A. and Amato, C. A Concise Introduction to Decentralized POMDPs. Springer Briefs in Intelligent Systems. Springer International Publishing, 2016. ISBN 978-3-319-28927-4. doi: 10.1007/978-3-319-28929-8. URL https://www. springer.com/gp/book/9783319289274. Press, W. H. and Dyson, F. J. Iterated Prisoner s Dilemma contains strategies that dominate any evolutionary opponent. Proceedings of the National Academy of Sciences, 109(26):10409 10413, 2012. ISSN 0027-8424. Raileanu, R., Denton, E., Szlam, A., and Fergus, R. Modeling others using oneself in multi-agent reinforcement learning. In Dy, J. G. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsm assan, Stockholm, Model-Free Opponent Shaping Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pp. 4254 4263. PMLR, 2018. URL http://proceedings.mlr.press/ v80/raileanu18a.html. Rapoport, A. and Chammah, A. M. The game of chicken. American Behavioral Scientist, 10(3):10 28, 1966. doi: 10.1177/000276426601000303. URL https://doi. org/10.1177/000276426601000303. Rapoport, A., Chammah, A. M., and Orwant, C. J. Prisoner s dilemma: A study in conflict and cooperation, volume 165. University of Michigan press, 1965. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. ar Xiv preprint ar Xiv:1707.06347, 2017. Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., Chen, Y., Lillicrap, T. P., Hui, F., Sifre, L., van den Driessche, G., Graepel, T., and Hassabis, D. Mastering the game of go without human knowledge. Nat., 550(7676):354 359, 2017. Such, F. P., Madhavan, V., Conti, E., Lehman, J., Stanley, K. O., and Clune, J. Deep neuroevolution: Genetic algorithms are a competitive alternative for training deep neural networks for reinforcement learning. ar Xiv preprint ar Xiv:1712.06567, 2017. Synnaeve, G. and Bessi ere, P. A bayesian model for opening prediction in RTS games with application to starcraft. In IEEE Conference on Computational Intelligence and Games, pp. 281 288, 2011. Vinyals, O., Babuschkin, I., Czarnecki, W. M., Mathieu, M., Dudzik, A., Chung, J., Choi, D. H., Powell, R., Ewalds, T., Georgiev, P., Oh, J., Horgan, D., Kroiss, M., Danihelka, I., Huang, A., Sifre, L., Cai, T., Agapiou, J. P., Jaderberg, M., Vezhnevets, A. S., Leblond, R., Pohlen, T., Dalibard, V., Budden, D., Sulsky, Y., Molloy, J., Paine, T. L., G ulc ehre, C ., Wang, Z., Pfaff, T., Wu, Y., Ring, R., Yogatama, D., W unsch, D., Mc Kinney, K., Smith, O., Schaul, T., Lillicrap, T. P., Kavukcuoglu, K., Hassabis, D., Apps, C., and Silver, D. Grandmaster level in starcraft II using multi-agent reinforcement learning. Nat., 575(7782):350 354, 2019. Weber, B. G. and Mateas, M. A data mining approach to strategy prediction. In Proceedings of the 2009 IEEE Symposium on Computational Intelligence and Games, pp. 140 147, 2009. Wen, Y., Yang, Y., Luo, R., Wang, J., and Pan, W. Probabilistic recursive reasoning for multi-agent reinforcement learning. In 7th International Conference on Learning Representations, 2019. Willi, T., Treutlein, J., Letcher, A., and Foerster, J. COLA: consistent learning with opponent-learning awareness. 2022. Wu, Z., Li, K., Zhao, E., Xu, H., Zhang, M., Fu, H., An, B., and Xing, J. L2e: Learning to exploit your opponent, 2021. Xie, A., Losey, D. P., Tolsma, R., Finn, C., and Sadigh, D. Learning latent representations to influence multi-agent interaction. In 4th Conference on Robot Learning, volume 155 of Proceedings of Machine Learning Research, pp. 575 588, 2020. Xiong, Z., Zintgraf, L. M., Beck, J., Vuorio, R., and Whiteson, S. On the practical consistency of meta-reinforcement learning algorithms. Co RR, abs/2112.00478, 2021. URL https://arxiv.org/ abs/2112.00478. Yu, X., Jiang, J., Jiang, H., and Lu, Z. Model-based opponent modeling. ar Xiv preprint ar Xiv:2108.01843, 2021. Zhmoginov, A., Sandler, M., and Vladymyrov, M. Hypertransformer: Model generation for supervised and semi-supervised few-shot learning. ar Xiv preprint ar Xiv:2201.04182, 2022. Model-Free Opponent Shaping A. Detailed Results Each experiment is run 10 times. The inner batch size of each experiment for the matrix games is 4096. Table 8. Head-to-head results of each learning algorithm in IPD, results reported for M-FOS PPO. M-FOS NL LOLA M-MAML M-FOS -1.01 -0.51 -1.03 -0.84 NL -2.14 -1.98 -1.52 -1.28 LOLA -1.02 -1.30 -1.09 -1.04 M-MAML -1.52 -1.25 -1.15 -1.17 Table 9. Head-to-head results of each learning algorithm in IPD, results reported for M-FOS GA. M-FOS NL LOLA M-MAML M-FOS -0.745 -0.73 -0.67 NL -1.69 -1.98 -1.52 -1.28 LOLA -2.09 -1.30 -1.09 -1.04 M-MAML -1.86 -1.25 -1.15 -1.17 Table 10. Head-to-head results of each learning algorithm in IMP, results reported for M-FOS PPO. M-FOS NL LOLA M-MAML M-FOS 0.0 0.20 0.19 0.22 NL -0.20 0.0 -0.02 -0.01 LOLA -0.19 0.02 0.0 0.02 M-MAML -0.22 0.01 -0.02 0.0 Table 11. Head-to-head results of each learning algorithm in IMP, results reported for M-FOS GA. M-FOS NL LOLA M-MAML M-FOS 0.13 0.10 0.17 NL -0.13 0.0 -0.02 -0.01 LOLA -0.10 0.02 0.0 0.02 M-MAML -0.17 0.01 -0.02 0.0 Table 12. Head-to-head results of each learning algorithm in the Chicken Game. The results of an M-FOS meta-policy that learns an initial policy is in parantheses. Results reported for M-FOS PPO. M-FOS NL LOLA M-MAML M-FOS -0.01 0.97 -0.94[0.5] 0.85 NL -1.03 -0.0 -0.97 -0.27 LOLA 0.87[-1.5] 0.94 -85.96 0.40 M-MAML -1.11 0.27 -0.42 -0.15 Table 13. Head-to-head results of each learning algorithm in the Chicken Game. The results of an M-FOS meta-policy that learns an initial policy is in parantheses. Results reported for M-FOS GA. M-FOS NL LOLA M-MAML M-FOS 0.97 -0.94[0.5] 0.86 NL -1.03 -0.0 -0.97 -0.27 LOLA 0.91[-1.5] 0.94 -85.96 0.40 M-MAML -1.08 0.27 -0.42 -0.15 Model-Free Opponent Shaping B. M-MAML Initialisations Plot Figure 8. The distribution of probabilities in each state after training 10 different instances of M-MAML. C. Hyperparameter Details We report our hyperparameter values that we used for each of the methods in our experiments: Hyperparameter Value Number of Actor Hidden Layers 1 Size of Actor Hidden Layers [256] Number of Critic Hidden Layers 1 Size of Critic Hidden Layers [256] Length of Meta-Episode T 100 Batch Size B 4096 Adam Step Size 0.0002 Number of Epochs 4 Outer Discount Factor γ 0.99 PPO Clipping ϵ 0.2 Entropy Coefficient 0.01 Table 14. PPO for IPD, IMP, and Chicken Game Hyperparameter Value Number of Hidden Layers 1 Size of Hidden Layers [256] Number of Species N 2048 Batch Size B 128 Length of Meta-Episode T 100 Noise Std Dev σ 2.0 Number of Elites E 1 Table 15. Genetic Algorithm for IPD, IMP, and Chicken Game Model-Free Opponent Shaping Hyperparameter Value Number of Conv Layers 2 Output Channels of Conv Layers [16, 16] Kernel Sizes of Conv Layers [[3, 3], [3, 3]] Strides of Conv Layers [1, 1] Number of Linear Layers 1 Size of Linear Layer [16] Number of GRUs 1 Size of GRUs [16] Length of Meta-Episode T 16 Length of Inner Episode 16 Batch Size B 512 Adam Step Size 0.0002 Number of Epochs 16 Outer Discount Factor γ 0.99 PPO Clipping ϵ 0.2 Entropy Coefficient 0.01 Table 16. PPO For Coin Game. The Actor, Critic, and Meta-Policy have the same network architecture but do not share weights. C.2. Environments Hyperparameter Value Inner Gamma γ 0.96 Learning Rate α 1 M-MAML Adam Learning Rate 0.05 Table 17. Hyperparameters for IPD Environment Hyperparameter Value Inner Gamma γ 0.96 Learning Rate α 0.1 M-MAML Adam Learning Rate 0.05 Table 18. Hyperparameters for IMP Environment Hyperparameter Value Learning Rate α 1 M-MAML Adam Learning Rate 0.05 Table 19. Hyperparameters for Chicken Environment Model-Free Opponent Shaping Hyperparameter Value Number of Conv Layers 2 Output Channels of Conv Layers [16, 16] Kernel Sizes of Conv Layers [[3, 3], [3, 3]] Strides of Conv Layers [1, 1] Number of Linear Layers 1 Size of Linear Layer [16] Adam Step Size 0.005 Number of Epochs 80 PPO Clipping ϵ 0.2 Entropy Coefficient 0.01 Discount Factor γ 0.96 Length of Inner Episode 16 Table 20. Hyperparameters for Coin Game Environment and Naive Learner. The Actor and Critic share the same architecture but do not share weights.