# cola_consistent_learning_with_opponentlearning_awareness__03057c03.pdf

COLA: Consistent Learning with Opponent-Learning Awareness

Timon Willi * 1 Alistair Letcher * Johannes Treutlein * 2 3 Jakob Foerster 1

Learning in general-sum games is unstable and frequently leads to socially undesirable (Paretodominated) outcomes. To mitigate this, Learning with Opponent-Learning Awareness (LOLA) introduced opponent shaping to this setting, by accounting for each agent s influence on their opponents anticipated learning steps. However, the original LOLA formulation (and follow-up work) is inconsistent because LOLA models other agents as naive learners rather than LOLA agents. In previous work, this inconsistency was suggested as a cause of LOLA s failure to preserve stable fixed points (SFPs). First, we formalize consistency and show that higher-order LOLA (HOLA) solves LOLA s inconsistency problem if it converges. Second, we correct a claim made in the literature by Sch afer and Anandkumar (2019), proving that Competitive Gradient Descent (CGD) does not recover HOLA as a series expansion (and fails to solve the consistency problem). Third, we propose a new method called Consistent LOLA (COLA), which learns update functions that are consistent under mutual opponent shaping. It requires no more than second-order derivatives and learns consistent update functions even when HOLA fails to converge. However, we also prove that even consistent update functions do not preserve SFPs, contradicting the hypothesis that this shortcoming is caused by LOLA s inconsistency. Finally, in an empirical evaluation on a set of general-sum games, we find that COLA finds prosocial solutions and that it converges under a wider range of learning rates than HOLA and LOLA. We support the latter finding with a theoretical result for a simple game.

*Equal contribution 1Department of Engineering Science, University of Oxford, United Kingdom 2Department of Computer Science, University of Toronto, Canada 3Vector Institute, Toronto, Canada. Correspondence to: <timon.willi@eng.ox.ac.uk>.

Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s).

1. Introduction

Much research in deep multi-agent reinforcement learning (MARL) has focused on zero-sum games like Starcraft and Go (Silver et al., 2017; Vinyals et al., 2019) or fully cooperative settings (Oroojlooyjadid & Hajinezhad, 2019). However, many real-world problems, e.g. self-driving cars, contain both cooperative and competitive elements, and are thus better modeled as general-sum games. One such game is the famous Prisoner s Dilemma (Axelrod & Hamilton, 1981), in which agents have an individual incentive to defect against their opponent, even though they would prefer the outcome in which both cooperate to the one where both defect. A strategy for the infinitely iterated version of the game (IPD) is tit-for-tat, which starts out cooperating and otherwise mirrors the opponent s last move. It achieves mutual cooperation when chosen by both players and has proven to be successful at IPD tournaments (Axelrod & Hamilton, 1981). If MARL algorithms are deployed in the real world, it is essential that they are able to cooperate with others and entice others to cooperate with them, using strategies such as tit-for-tat (Dafoe et al., 2021). However, naive gradient descent and other more sophisticated methods (Korpelevich, 1977; Mescheder et al., 2017; Balduzzi et al., 2018; Mazumdar et al., 2019; Sch afer & Anandkumar, 2019) converge to the mutual defection policy under random initialization (Letcher et al., 2019b).

An effective paradigm to improve learning in general-sum games is opponent shaping, where agents take into account their influence on the anticipated learning step of the other agents. LOLA (Foerster et al., 2018a) was the first work to make explicit use of opponent shaping and is one of the only general learning methods designed for general-sum games that obtains mutual cooperation with the tit-for-tat strategy in the IPD. While LOLA discovers these prosocial equilibria, the original LOLA formulation is inconsistent because LOLA agents assume that their opponent is a naive learner. This assumption is clearly violated if two LOLA agents learn together. It has been suggested that this inconsistency is the cause for LOLA s main shortcoming, which is not maintaining the stable fixed points (SFPs) of the underlying game, even in some simple quadratic games (Letcher 2018, pp. 2, 26; see also Letcher et al. 2019b).

COLA: Consistent Learning with Opponent-Learning Awareness

Contributions. To address LOLA s inconsistency, we first revisit the concept of higher-order LOLA (HOLA) (Foerster et al., 2018a) in Section 4.1. For example, second-order LOLA assumes that the opponent is a first-order LOLA agent (which in turn assumes the opponent is a naive learner) and so on. Supposing that HOLA converges with increasing order, we define infinite-order LOLA (i LOLA) as the limit. Intuitively, two i LOLA agents have a consistent view of each other since they accurately account for the learning behavior of the opponent under mutual opponent shaping. Based on this idea, we introduce a formal definition of consistency and prove that, if it exists, i LOLA is indeed consistent (Proposition 4.3).

Second, in Section 4.2, we correct a claim made in previous literature, which would have provided a closed-form solution of the i LOLA update. According to Sch afer & Anandkumar (2019), the Competitive Gradient Descent (CGD) algorithm recovers HOLA as a series expansion. If true, this would imply that CGD coincides with i LOLA, thus solving LOLA s inconsistency problem. We prove that this is untrue: CGD s series expansion does, in general, not recover HOLA, CGD does not correspond to i LOLA, and CGD does not solve the inconsistency problem (Proposition 4.4).

In lieu of a closed-form solution, a naive way of computing the i LOLA update is to iteratively compute higher orders of LOLA until convergence. However, there are two main problems with addressing consistency using a limiting update: the process may diverge and typically requires arbitrarily high derivatives. To address these, in Section 4.3, we propose Consistent LOLA (COLA) as a more robust and efficient alternative. COLA learns a pair of consistent update functions by explicitly minimizing a differentiable measure of consistency inspired by our formal definition. We use the representational power of neural networks and gradient based optimization to minimize this loss, resulting in learned update functions that are mutually consistent. By reframing the problem as such, we only require up to second-order derivatives.

In Section 4.4, we prove initial results about COLA. First, we show that COLA s solutions are not necessarily unique. Second, despite being consistent, COLA does not recover SFPs, contradicting the prior belief that this shortcoming is caused by inconsistency. Third, to show the benefit of additional consistency, we prove that COLA converges under a wider range of look-ahead rates than LOLA in a simple general-sum game.

Finally, in Sections 5 and 6, we report our experimental setup and results, investigating COLA and HOLA and comparing COLA to LOLA and CGD in a range of games. We experimentally confirm our theoretical result that CGD does not equal i LOLA. Moreover, we show that COLA converges under a wider range of look-ahead rates than HOLA and

LOLA, and that it is generally able to find socially desirable solutions. It is the only algorithm consistently converging to the fair solution in the Ultimatum game, and while it does not find tit-for-tat in the IPD (unlike LOLA), it does learn policies with near-optimal total payoff. We find that COLA learns consistent update functions even when HOLA diverges with higher order and its updates are similar to i LOLA when HOLA converges. Although COLA solutions are not unique in theory, COLA empirically tends to find similar solutions over different runs.

2. Related work

General-sum learning algorithms have been investigated from different perspectives in the reinforcement learning, game theory, and GAN literature (Schmidhuber, 1991; Barto & Mahadevan, 2003; Goodfellow et al., 2014; Racani ere et al., 2017). Next, we will highlight a few of the approaches to the mutual opponent shaping problem.

Opponent modeling maintains an explicit belief of the opponent, allowing to reason over their strategies and compute optimal responses. Opponent modeling can be divided into different subcategories: There are classification methods, classifying the opponents into pre-defined types (Weber & Mateas, 2009; Synnaeve & Bessi ere, 2011), or policy reconstruction methods, where we explicitly predict the actions of the opponent (Mealing & Shapiro, 2017). Most closely related to opponent shaping is recursive reasoning, where methods model nested beliefs of the opponents (He & Boyd Graber, 2016; Albrecht & Stone, 2017; Wen et al., 2019).

In comparison, COLA assumes that we have access to the ground-truth model of the opponent, e.g., the opponent s payoff function, parameters, and gradients, putting COLA into the framework of differentiable games (Balduzzi et al., 2018). Various methods have been proposed, investigating the local convergence properties to different solution concepts (Mescheder et al., 2017; Mazumdar et al., 2019; Letcher et al., 2019b; Sch afer & Anandkumar, 2019; Azizian et al., 2020; Sch afer et al., 2020; Hutter, 2021). Most of the work in differentiable games has not focused on opponent shaping or consistency. Mescheder et al. (2017) and Mazumdar et al. (2019) focus solely on zero-sum games without shaping. To improve upon LOLA, Letcher et al. (2019b) suggested Stable Opponent Shaping (SOS), which applies ad-hoc corrections to the LOLA update, leading to theoretically guaranteed convergence to SFPs. However, despite its desirable convergence properties, SOS still does not solve the conceptual issue of inconsistent assumptions about the opponent. CGD (Sch afer & Anandkumar, 2019) addresses the inconsistency issue for zero-sum games but not for general-sum games. The exact difference between CGD, LOLA and our method is addressed in Section 4.2. Model-Free Opponent Shaping (M-FOS) (Lu et al., 2022)

COLA: Consistent Learning with Opponent-Learning Awareness

Table 1. On the Tandem game: (a) Log of the squared consistency loss, where e.g. HOLA6 is sixth-order higher-LOLA. (b) Cosine similarity between COLA and LOLA, HOLA3, and HOLA6 over different look-ahead rates. The values represent the mean of a 1,000 samples, uniformly sampled from the parameter space Θ. Error bars represent one standard deviation over 10 COLA training runs.

α LOLA HOLA3 HOLA6 COLA 1.0 128.0 512 131072 3e-14 2e-15 0.5 12.81 14.05 12.35 2e-14 5e-15 0.3 2.61 2.05 0.66 4e-14 3e-15 0.1 0.08 9e-3 2e-6 6e-14 9e-15 0.01 1e-5 2e-8 4e-14 1e-14 4e-14

α LOLA HOLA3 HOLA6 1.0 0.94 0.04 0.94 0.04 0.94 0.04 0.5 0.88 0.12 0.88 0.12 0.08 0.13 0.3 0.92 0.01 0.91 0.01 0.80 0.01 0.1 0.95 0.01 0.99 0.01 0.99 0.01 0.01 0.99 0.01 1.00 0.00 1.00 0.00

frames opponent shaping as a meta-learning problem requiring only first-order derivatives. M-FOS is consistent in that it does not make any assumption about the learning algorithm of the opponent. However, the work does not investigate consistency specifically.

3. Background

3.1. Differentiable games

The framework of differentiable games has become increasingly popular to model multi-agent learning. Whereas stochastic games are limited to parameters such as actionstate probabilities, differentiable games generalize to any real-valued parameter vectors and differentiable loss functions (Balduzzi et al., 2018). We restrict our attention to two-player games, as is standard in much of the literature (Foerster et al., 2018a;b; Sch afer & Anandkumar, 2019).

Definition 3.1 (Differentiable games). In a two-player differentiable game, players i = 1, 2 control parameters θi Rdi to minimize twice continuously differentiable losses Li : Rd1+d2 R. We adopt the convention to write i to denote the opponent of player i.

A fundamental challenge of the multi-loss setting is finding a good solution concept. Whereas in the single loss setting the typical solution concept are local minima, in multi-loss settings there are different sensible solution concepts. Most prominently, there are Nash Equilibria (Osborne & Rubinstein, 1994). However, Nash Equilibria include unstable saddle points that cannot be reasonably found via gradientbased learning algorithms (Letcher et al., 2019b). A more suitable concept are stable fixed points (SFPs), which could be considered a differentiable game analogon to local minima in single loss optimization. We will omit a formal definition here for brevity and point the reader to previous work on the topic (Letcher et al., 2019a).

Consider a differentiable game with two players. A LOLA agent θ1 uses its access to the opponent s parameters θ2 to differentiate through a learning step of the opponent. That is,

agent 1 reformulates their loss to L1 = L1 θ1, θ2 + d θ2 ,

where d θ2 represents the assumed learning step of the opponent. In first-order LOLA we assume the opponent to be a naive learner: d θ2 = α 2L2. This assumption makes LOLA inconsistent when the opponent is any other type of learner. Here, 2 denotes the gradient with respect to θ2, and α represents the look-ahead rate, which is the assumed learning rate of the opponent. This rate may differ from the opponent s actual learning rate, but we will only consider equal learning rates and look-ahead rates across opponents for simplicity. In the original paper the loss was approximated using a Taylor expansion, L1 L1 + ( 2L1) d θ2. For agent 1, their first-order Taylor LOLA update is then

θ1 = α 1L1 + 12L1 d θ2 + 1 d θ2 2L1 .

Alternatively, in exact LOLA, the derivative is taken directly with respect to the reformulated loss, yielding the update

θ1 = α 1 L1 θ1, θ2 + d θ2 .

LOLA has had some empirical success, being one of the first general learning methods to discover tit-for-tat in the IPD. However, later work showed that LOLA does not preserve SFPs θ, e.g., the rightmost term in the equation for Taylor LOLA can be nonzero at θ. In fact, LOLA agents show arrogant behavior: they assume they can shape the learning of their naive opponents without having to adapt to the shaping of the opponent. Prior work hypothesized that this arrogant behavior is due to LOLA s inconsistent formulation and may be the cause for LOLA s failure to preserve SFPs (Letcher (2018), pp. 2, 26; Letcher et al. (2019b))

CGD (Sch afer & Anandkumar, 2019) proposes updates that are themselves Nash Equilibra of a local bilinear approximation of the game. It stands out by its robustness to different look-ahead rates and its ability to find SFPs. However, CGD does not find tit-for-tat on the IPD, instead converging to mutual defection (see Figure 3e). CGD s update rule is

COLA: Consistent Learning with Opponent-Learning Awareness

Figure 1. Subfigures (a), (b) and (c): Log of the consistency loss with standard error of 10 independent training runs over the training of the update functions for the Tandem, MP and Ultimatum games. Subfigures (d), (e) and (f): Learning outcomes for the respective games. E.g., COLA:0.1 is COLA at a look-ahead rate of 0.1. Lines represent payoff means and shaded areas standard deviations over 10 runs.

given by θ1 θ2

= α Id α 12L1

One can recover different orders of CGD by approximating the inverse matrix via the series expansion (Id A) 1 = lim N PN k=0 Ak for A < 1. For example, at N=1, we recover a version called Linearized CGD (LCGD), defined via θ1 := α 1L1 + α2 12L1 2L2.

4. Method and theory

In this section, we formally define i LOLA and consistency under mutual opponent shaping and show that i LOLA is consistent, thus in principle addressing LOLA s inconsistency problem. We then clear up the relation between CGD and i LOLA, correcting a false claim in Sch afer & Anandkumar (2019). Lastly, we introduce COLA as an alternative to i LOLA and present some initial theoretical analysis, including the result that, contrary to prior belief, even consistent update functions do not recover SFPs.

4.1. Convergence and consistency of higher-order LOLA

The original formulation of LOLA is inconsistent when two LOLA agents learn together, because LOLA agents assume their opponent is a naive learner. To address this problem, we define and analyze i LOLA. In this section, we focus on

exact LOLA, but we provide a version of our analysis for Taylor LOLA in Appendix C. HOLAn is defined by the recursive relation

hn+1 1 := α 1 L1(θ1, θ2 + hn 2)

hn+1 2 := α 2 L2(θ1 + hn 1, θ2)

with h 1 1 = h 1 2 = 0, omitting arguments (θ1, θ2) for convenience. In particular, HOLA0 coincides with simultaneous gradient descent while HOLA1 coincides with LOLA.

Definition 4.1 (i LOLA). If HOLAn = (hn 1, hn 2) converges pointwise as n , define

i LOLA := lim n

as the limiting update.

We show in Appendix A that HOLA does not always converge, even in simple quadratic games. But, unlike LOLA, i LOLA satisfies a criterion of consistency whenever HOLA does converge (under some assumptions), formally defined as follows:

Definition 4.2 (Consistency). Any update functions f1 : Rd Rd1 and f2 : Rd Rd2 are consistent (under mutual opponent shaping with look-ahead rate α) if for all θ1 Rd1, θ2 Rd2, they satisfy

f1(θ1, θ2) = α 1(L1(θ1, θ2 + f2(θ1, θ2))) (1)

f2(θ1, θ2) = α 2(L2(θ1 + f1(θ1, θ2), θ2)) (2)

COLA: Consistent Learning with Opponent-Learning Awareness

Table 2. On the MP game: Over different look-ahead rates we compare (a) the consistency losses and (b) the cosine similarity between COLA and LOLA, HOLA2, and HOLA4. The values represent the mean over 1,000 samples, uniformly sampled from the parameter space Θ. The error bars represent one standard deviation and capture the variance over 10 different COLA training runs.

α LOLA HOLA2 HOLA4 COLA 10 0.06 0.70 6.56 2e-3 3e-4 5 5e-3 0.03 0.15 5e-4 5e-5 1.0 9e-6 3e-8 4e-9 3e-6 1e-6 0.5 5e-7 3e-10 5e-12 3e-6 3e-6 0.01 1e-13 6e-17 5e-17 2e-6 2e-6

α LOLA HOLA2 HOLA4 10 0.90 0.01 0.84 0.01 0.74 0.02 5 0.98 0.01 0.97 0.01 0.92 0.01 1.0 1.00 0.00 1.00 0.00 1.00 0.00 0.5 1.00 0.00 1.00 0.00 1.00 0.00 0.01 1.00 0.00 1.00 0.00 1.00 0.00

Proposition 4.3. Let HOLAn = (hn 1, hn 2) denote both players exact n-th order LOLA updates. Assume that limn hn i (θ) = hi(θ) and limn ihn i(θ) = ih i(θ) exist for all θ Rd

and i {1, 2}. Then i LOLA is consistent under mutual opponent shaping.

Proof. In Appendix B.

4.2. CGD does not recover higher-order LOLA

Sch afer & Anandkumar (2019) claim that LCGD [Linearized CGD] coincides with first order LOLA (page 6), and moreover that the higher-order series-expansion [of CGD] would recover higher-order LOLA (page 4). If this were correct, it would imply that full CGD is equal to i LOLA and thus provides a convenient closed-form solution. We prove that this is false in general games:

Proposition 4.4. CGD is inconsistent and does not coincide with i LOLA. In particular, Linearized CGD (LCGD) does not coincide with LOLA and the series-expansion of CGD does not recover HOLA (neither exact nor Taylor). Instead, LCGD coincides with Look Ahead (Zhang & Lesser, 2010), an algorithm that lacks opponent shaping, and the seriesexpansion of CGD recovers higher-order Look Ahead.

Proof. In Appendix D. For the negative results, it suffices to construct a single counterexample: we show that LCGD and LOLA differ almost everywhere in the Tandem game (excluding a set of measure zero). We prove by contradiction that the series-expansion of CGD does not recover HOLA. If it did, CGD would equal i LOLA, and by Proposition 4.3, CGD would satisfy the consistency equations. However, this fails almost everywhere in the Tandem game, concluding the contradiction.

i LOLA is consistent under mutual opponent shaping. However, HOLA does not always converge and, even when it does, it may be expensive to recursively compute HOLAn for sufficiently high n to achieve convergence. As an alternative, we propose COLA.

COLA learns consistent update functions and avoids infinite regress by directly solving the equations in Definition 4.2. We define the consistency losses for learned update functions f1, f2 parameterized by ϕ1, ϕ2, obtained for a given θ as the difference between RHS and LHS in Definition 4.2:

C1(ϕ1, ϕ2, θ1, θ2) = f1 + α 1(L1(θ1, θ2 + f2))

C2(ϕ1, ϕ2, θ1, θ2) = f2 + α 2(L2(θ1 + f1, θ2)) .

If both losses are 0 for all θ, then the two update functions defined by ϕ1, ϕ2 are consistent. For this paper, we parameterise f1, f2 as neural networks with parameters ϕ1, ϕ2 respectively, and numerically minimize the sum of both losses over a region of interest.

The parameter region of interest Θ depends on the game being played. For games with probabilities as actions, we select an area that captures most of the probability space (e.g. we sample a pair of parameters θ1, θ2 [ 7, 7], since σ(7) 1 where σ is the sigmoid function).

We optimize the mean of the sum of consistency losses,

C(ϕ1, ϕ2) := E(θ1,θ2) U(Θ) C1(ϕ1, ϕ2, θ1, θ2)

+ C2(ϕ1, ϕ2, θ1, θ2) ,

by sampling parameter pairs (θ1, θ2) uniformly from Θ and feeding them to the neural networks f1, f2, each outputting an agent s parameter update. The weights ϕ1, ϕ2 are then updated by taking a gradient step to minimize C. We train the update functions until the loss has converged and use the learned update functions to train a pair of agent policies in the given game.

4.4. Theoretical results for COLA

In this section, we provide some initial theoretical results for COLA s uniqueness and convergence behavior, using the Tandem game (Letcher et al., 2019b) and the Hamiltonian game (Balduzzi et al., 2018) as examples. These are simple polynomial games, with losses given in Section 5. Proofs for the following propositions can be found in Appendices E, F and G, respectively.

COLA: Consistent Learning with Opponent-Learning Awareness

Figure 2. Training in MP at look-ahead rate of α = 10. (a) Axes are on a log-scale. Shown is the mean variance and consistency over 10 different runs. Each run, COLA was retrained. COLA long was trained for 80k steps and COLA short for 800 steps. (b) LOLA and HOLA find non-convergent or even diverging solutions, while COLA s converge. (c) Gradient field learned by COLA on MP at a look-ahead rate of 10.

First, we show that solutions to the consistency equations are in general not unique, even when restricting to linear update functions in the Tandem game. Interestingly, empirically, COLA does seem to consistently converge to similar solutions regardless (see Table 7 in Appendix I.3).

Proposition 4.5. Solutions to the consistency equations are not unique, even when restricted to linear solutions; more precisely, there exist several linear consistent solutions to the Tandem game.

Second, we show that consistent solutions do not, in general, preserve SFPs, contradicting the hypothesis that LOLA s failure to preserve SFPs is due to its inconsistency (Letcher (2018), pp. 2, 26; Letcher (2018)). We experimentally confirm this result in Section 6.

Proposition 4.6. Consistency does not imply preservation of SFPs: there is a consistent solution to the Tandem game with α = 1 that fails to preserve any SFP. Moreover, for any α > 0, there are no linear consistent solutions to the Tandem game that preserve more than one SFP.

Third, we show that COLA can have more robust convergence behavior than LOLA and SOS:

Proposition 4.7. For any non-zero initial parameters and any α > 1, LOLA and SOS have divergent iterates in the Hamiltonian game. By contrast, any linear solution to the consistency equations converges to the origin for any initial parameters and any look-ahead rate α > 0; moreover, the speed of convergence strictly increases with α.

5. Experiments

We perform experiments on a set of games from the literature (Balduzzi et al., 2018; Letcher et al., 2019b) using LOLA, SOS and CGD as baselines. For details on the training procedure of COLA, we refer the reader to Appendix H.

First, we compare HOLA and COLA on polynomial general-

sum games, including the Tandem game (Letcher et al., 2019b), where LOLA fails to converge to SFPs. Second, we investigate non-polynomial games, specifically the zero-sum Matching Pennies (MP) game, the general-sum Ultimatum game (Hutter, 2021) and the IPD (Axelrod & Hamilton, 1981; Harper et al., 2017).

Polynomial games. Losses in the Tandem game (Letcher et al., 2019b) are given by L1(x, y) = (x + y)2 2x and L2(x, y) = (x+y)2 2y for agent 1 and 2 respectively. The Tandem game was introduced to show that LOLA fails to preserve SFPs at x + y = 1 and instead converges to Paretodominated solutions (Letcher et al., 2019b). Additionally to the Tandem game, we investigate the algorithms on the Hamiltonian game, L1(x, y) = xy and L2(x, y) = xy; and the Balduzzi game, where L1(x, y) = 1

2x2 +10xy and L2(x, y) = 1

2y2 10xy (Balduzzi et al., 2018).

Matching Pennies. The payoff matrix for MP (Lee & K, 1967) is shown in Appendix I.3 in Table 6. Each policy is parameterized with a single parameter, the log-odds of choosing heads pheads = σ(θA). In this game, the unique Nash equilibrium is playing heads half the time.

Ultimatum game. The binary, single-shot Ultimatum game (G uth et al., 1982; Sanfey et al., 2003; Oosterbeek et al., 2004; Henrich et al., 2006) is set up as follows. Player 1 has access to $10. They can split the money fairly with player 2 ($5 for each player) or they can split it unfairly ($8 for player 1, $2 for player 2). Player 2 can either accept or reject the proposed split. If player 2 rejects, the reward is 0 for both players. If player 2 accepts, the reward follows the proposed split. Player 1 s parameter is the log-odds of proposing a fair split pfair = σ(θ1). Player 2 s parameter is the log-odds of accepting the unfair split (assuming that player 2 always accepts fair splits) paccept = σ(θ2). In terms

COLA: Consistent Learning with Opponent-Learning Awareness

Figure 3. IPD results: Subfigure (a) / (d) show the consistency loss with shaded standard error over 10 independent training runs for look-ahead rates of 0.03 / 1.0, (b) / (e) the average loss and (c) / (f) the policy for the first player, both for the same pair of look-ahead rates. At low look-ahead HOLA defects and at high ones it diverges, also leading to high loss.

of losses, we have

L1 = (5pfair + 8(1 pfair)paccept)

L2 = (5pfair + 2(1 pfair)paccept) .

IPD. We investigate the infinitely iterated Prisoner s Dilemma (IPD) (Axelrod & Hamilton, 1981; Harper et al., 2017) with discount factor γ = 0.96 and the usual payout function (see Appendix I.6). An agent i is defined through 5 parameters, the log-odds of cooperating in the first time step and across each of the four possible tuples of past actions of both players in the later steps.

First, we report and compare the learning outcomes achieved by COLA and our baselines. We find that COLA update functions converge even under high look-ahead rates and learn socially desirable solutions. We also confirm our theoretical result (Proposition 4.4) that CGD does not equal i LOLA, contradicting Sch afer & Anandkumar (2019), and that COLA does not, in general, maintain SFPs (Proposition 4.6), contradicting the prior belief that this shortcoming is caused by inconsistency.

Second, we provide a more in-depth empirical analysis and comparison of the COLA and HOLA update functions, showing that COLA and HOLA tend to coincide when the latter converges, and that COLA is able to find consistent

solutions even when HOLA diverges. Moreover, while COLA s solutions are not unique in theory (Proposition 4.5), we empirically find that in our examples COLA tends to find similar solutions across different independent training runs. Additional results supporting the above findings are reported in Appendix I.

Learning Outcomes. In the Tandem game (Figure 1d), we see that COLA and HOLA8 converge to similar outcomes in the game, whereas CGD does not. This supports our theoretical result that CGD does not equal i LOLA (Proposition 4.4). We also see that COLA does not recover SFPs, thus experimentally confirming Proposition 4.6. In contrast to LOLA, HOLA and SOS, COLA finds a convergent solution even at a high look-ahead rate (see COLA:0.8 in Figure 1d and Figure 4b in Appendix I.1). CGD is the only other algorithm in the comparison that also shows robustness to high look-ahead rates in the Tandem game.

On the IPD, all algorithms find the defect-defect strategy on low look-ahead rates (Figure 3b). At high look-ahead rates, COLA finds a strategy qualitatively similar to tit-for-tat, as displayed in Figure 3f, though more noisy. However, COLA still achieves close to the optimal total loss, in contrast to CGD, which finds defect-defect even at a high look-ahead rate (see Figure 13 in Appendix I.6). The fact that, unlike HOLA and COLA, CGD finds defect-defect, further confirms that CGD does not equal i LOLA.

On MP at high look-ahead rates, SOS and LOLA mostly

COLA: Consistent Learning with Opponent-Learning Awareness

Table 3. IPD: Over multiple look-ahead rates we compare (a) the consistency losses and (b) the cosine similarity between COLA and LOLA, HOLA2, and HOLA4. The values represent the mean over 250 samples, uniformly sampled from the parameter space Θ. The error bars represent one standard deviation and capture the variance over 10 different COLA training runs.

α LOLA HOLA2 HOLA4 COLA 1.0 39.56 21.16 381.21 1.06 0.09 0.03 2e-3 5e-6 9e-8 0.16 0.02

α LOLA HOLA2 HOLA4 1.0 0.73 0.02 0.63 0.01 0.46 0.03 0.03 0.97 0.01 0.97 0.01 0.97 0.01

don t converge, whereas COLA converges even faster with a high look-ahead rate (see Figure 2a), confirming Proposition 4.7 experimentally (also see Figure 6b and 7b in Appendix I.2). To further investigate the influence of consistency on learning behavior, we plot the consistency of an update function against the variance of the losses across learning steps achieved by that function, for different orders of HOLA and for COLA (Figure 2b). At a high look-ahead rate in Matching Pennies, we find that more consistent update functions tend to lead to lower variance across training, demonstrating a potential benefit of increased consistency at least at high look-ahead rates.

For the Ultimatum game, we find that COLA is the only method that finds the fair solution consistently at a high look-ahead rate, whereas SOS, LOLA, and CGD do not (Figure 1f). At low look-ahead rates, all algorithms find the unfair solution (see Figure 10b in Appendix I.4). This demonstrates an advantage of COLA over our baselines and shows that higher look-ahead rates can lead to better learning outcomes.

Lastly, we introduce the Chicken game in Appendix I.5. Both Taylor LOLA and SOS crash, whereas COLA, HOLA, CGD, and exact LOLA swerve at high look-ahead rates (Figure 12d). Crashing in Chicken results in a catastrophic payout for both agents, whereas swerving results in a jointly preferable outcome.1

Update functions. Turning to our analysis of COLA and HOLA update functions, we first investigate how increasing the order of HOLA affects the consistency of its updates. As shown in Table 1a, 2a and 3a, HOLA s updates become more consistent with increasing order, but only below a certain, game-specific look-ahead rate threshold. Above that threshold, HOLA s updates become less consistent with increasing order.

Second, we compare the consistency losses of COLA and HOLA. In the aforementioned tables, we observe that COLA achieves low consistency losses on most games. Below the threshold, COLA finds similarly low consistency losses

1Interestingly, in contrast to Taylor LOLA, exact LOLA swerves. The Chicken game is the only game where we found a difference in learning behavior between exact LOLA and Taylor LOLA.

as HOLA, though there HOLA s are lower in the nonpolynomial games. Above the threshold, COLA finds consistent updates, even when HOLA does not. A visualization of the update function learned by COLA at a high lookahead rate on the MP is given in Figure 2c.

For the IPD, COLA s consistency losses are high compared to other games, but much lower than HOLA s consistency losses at high look-ahead rates. We leave it to future work to find methods that obtain more consistent solutions.

Third, we are interested whether COLA and HOLA find similar solutions. We calculate the cosine similarity between the respective update functions over Θ. As we show in Table 1b, 2b and 3b, COLA and HOLA find very similar solutions when HOLA s updates converge, i.e., when the look-ahead rate is below the threshold. Above the threshold, COLA s and HOLA s updates unsurprisingly become less similar with increasing order, as HOLA s updates diverge with increasing order.

Lastly, we investigate Proposition 4.5 empirically and find that COLA finds similar solutions in Tandem and MP over 5 training runs (see Table 7 in Appendix I.3). Moreover, the small standard deviations in Table 3b indicate that COLA also finds similar solutions over different runs in the IPD.

7. Conclusion and Future Work

In this paper, we corrected a claim made in prior work (Sch afer & Anandkumar, 2019), clearing up the relation between the CGD and LOLA algorithms. We also showed that i LOLA solves part of the consistency problem of LOLA. We introduced COLA, which finds consistent solutions without requiring many recursive computations like i LOLA. It was believed that inconsistency leads to arrogant behaviour and lack of preservation of SFPs. We showed that even with consistency, opponent shaping behaves arrogantly, pointing towards a fundamental open problem for the method.

In a set of games, we found that COLA tends to find prosocial solutions. Although COLA s solutions are not unique in theory, empirically, COLA tends to find similar solutions in different runs. It coincides with i LOLA when HOLA converges and finds consistent update functions even when HOLA fails to converge with increasing order. Moreover, we showed empirically (and in one case theoretically) that

COLA: Consistent Learning with Opponent-Learning Awareness

COLA update functions converge under a wider range of look-ahead rates than HOLA and LOLA update functions.

This work raises many questions for future work, such as the existence of solutions to the COLA equations in general games and general properties of convergence and learning outcomes. Moreover, additional work is needed to scale COLA to large settings such as GANs or Deep RL, or settings with more than two players. Another interesting axis is addressing further inconsistent aspects of LOLA as identified in Letcher et al. (2019b).

Acknowledgements

A part of this work was done while Timon Willi and Jakob Foerster were at the Vector Institute, University of Toronto. They are grateful for the access to the Vector Institute s compute infrastructure. They are also grateful for the access to the Advanced Research Computing (ARC) infrastructure. Johannes Treutlein is grateful for support by the Center on Long-Term Risk.

Albrecht, S. V. and Stone, P. Reasoning about hypothetical agent behaviours and their parameters. In Proceedings of the 16th Conference on Autonomous Agents and Multi Agent Systems, pp. 547 555, 2017.

Axelrod, R. and Hamilton, W. D. The evolution of cooperation. Science, 211(4489):1390 1396, 1981.

Azizian, W., Mitliagkas, I., Lacoste-Julien, S., and Gidel, G. A tight and unified analysis of gradient-based methods for a whole spectrum of differentiable games. In Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, pp. 2863 2873, 2020.

Balduzzi, D., Racani ere, S., Martens, J., Foerster, J. N., Tuyls, K., and Graepel, T. The mechanics of n-player differentiable games. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 363 372, 2018.

Barto, A. G. and Mahadevan, S. Recent advances in hierarchical reinforcement learning. Discret. Event Dyn. Syst., 13(1-2):41 77, 2003.

Dafoe, A., Hughes, E., Bachrach, Y., Collins, T., Mc Kee, K. R., Leibo, J. Z., Larson, K., and Graepel, T. Open problems in cooperative ai. In Cooperative AI workshop, 2021.

Foerster, J., Chen, R. Y., Al-Shedivat, M., Whiteson, S., Abbeel, P., and Mordatch, I. Learning with opponent-

learning awareness. In Proceedings of the 17th International Conference on Autonomous Agents and Multi Agent Systems, pp. 122 130, 2018a.

Foerster, J. N., Farquhar, G., Al-Shedivat, M., Rockt aschel, T., Xing, E. P., and Whiteson, S. Dice: The infinitely differentiable monte carlo estimator. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 1524 1533, 2018b.

Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A. C., and Bengio, Y. Generative adversarial nets. In Advances in Neural Information Processing Systems, volume 27, pp. 2672 2680, 2014.

G uth, W., Schmittberger, R., and Schwarze, B. An experimental analysis of ultimatum bargaining. Journal of Economic Behavior & Organization, 3(4):367 388, 1982.

Harper, M., Knight, V., Jones, M., Koutsovoulos, G., Glynatsi, N. E., and Campbell, O. Reinforcement learning produces dominant strategies for the iterated prisoner s dilemma. PLOS ONE, 12(12):e0188046, 2017.

He, H. and Boyd-Graber, J. L. Opponent modeling in deep reinforcement learning. In Proceedings of the 33nd International Conference on Machine Learning, volume 48 of JMLR Workshop and Conference Proceedings, pp. 1804 1813, 2016.

Henrich, J., Boyd, R., Bowles, S., Camerer, C., Fehr, E., and Gintis, H. Foundations of Human Sociality: Economic Experiments and Ethnographic Evidence From Fifteen Small-Scale Societies. In American Anthropologist, volume 108. 2006.

Hutter, A. Learning in two-player games between transparent opponents. ar Xiv preprint ar Xiv:2012.02671, 2021.

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, 2015.

Korpelevich, G. M. The extragradient method for finding saddle points and other problems. volume 13, pp. 35 49, 1977.

Lee, K. and K, L. The Application of Decision Theory and Dynamic Programming to Adaptive Control Systems. Thesis, 1967.

Letcher, A. Stability and exploitation in differentiable games. Master s thesis, University of Oxford, 2018.

COLA: Consistent Learning with Opponent-Learning Awareness

Letcher, A., Balduzzi, D., Racani ere, S., Martens, J., Foerster, J. N., Tuyls, K., and Graepel, T. Differentiable game mechanics. J. Mach. Learn. Res., 20:84:1 84:40, 2019a.

Letcher, A., Foerster, J. N., Balduzzi, D., Rockt aschel, T., and Whiteson, S. Stable opponent shaping in differentiable games. In 7th International Conference on Learning Representations, 2019b.

Lu, C., Willi, T., Schroeder de Witt, C., and Foerster, J. Model-free opponent shaping. ar Xiv preprint ar Xiv:2205.01447, 2022.

Mazumdar, E. V., Jordan, M. I., and Sastry, S. S. On finding local nash equilibria (and only local nash equilibria) in zero-sum games. ar Xiv preprint ar Xiv:1901.00838, 2019.

Mealing, R. and Shapiro, J. L. Opponent modeling by expectation-maximization and sequence prediction in simplified poker. IEEE Trans. Comput. Intell. AI Games, 9(1):11 24, 2017.

Mescheder, L. M., Nowozin, S., and Geiger, A. The numerics of gans. In Advances in Neural Information Processing Systems, volume 30, pp. 1825 1835, 2017.

Oosterbeek, H., Sloof, R., and van de Kuilen, G. Cultural Differences in Ultimatum Game Experiments: Evidence from a Meta-Analysis. Experimental Economics, 7(2): 171 188, 2004.

Oroojlooyjadid, A. and Hajinezhad, D. A review of cooperative multi-agent deep reinforcement learning. 2019.

Osborne, M. J. and Rubinstein, A. A Course in Game Theory. The MIT Press, Cambridge, MA, 1994.

Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., De Vito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., and Chintala, S. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, volume 32, pp. 8024 8035. 2019.

Racani ere, S., Weber, T., Reichert, D. P., Buesing, L., Guez, A., Rezende, D. J., Badia, A. P., Vinyals, O., Heess, N., Li, Y., Pascanu, R., Battaglia, P. W., Hassabis, D., Silver, D., and Wierstra, D. Imagination-augmented agents for deep reinforcement learning. In Advances in Neural Information Processing Systems, volume 30, pp. 5690 5701, 2017.

Sanfey, A. G., Rilling, J. K., Aronson, J. A., Nystrom, L. E., and Cohen, J. D. The Neural Basis of Economic Decision Making in the Ultimatum Game. Science, 300(5626): 1755 1758, 2003.

Sch afer, F. and Anandkumar, A. Competitive gradient descent. In Advances in Neural Information Processing Systems, volume 32, pp. 7623 7633, 2019.

Schmidhuber, J. A possibility for implementing curiosity and boredom in model-building neural controllers. In Meyer, J. A. and Wilson, S. W. (eds.), Proc. of the International Conference on Simulation of Adaptive Behavior: From Animals to Animats, pp. 222 227. MIT Press/Bradford Books, 1991.

Sch afer, F., Anandkumar, A., and Owhadi, H. Competitive mirror descent. ar Xiv preprint ar Xiv:2006.10179, 2020.

Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., Chen, Y., Lillicrap, T. P., Hui, F., Sifre, L., van den Driessche, G., Graepel, T., and Hassabis, D. Mastering the game of go without human knowledge. Nat., 550(7676):354 359, 2017.

Synnaeve, G. and Bessi ere, P. A bayesian model for opening prediction in RTS games with application to starcraft. In IEEE Conference on Computational Intelligence and Games, pp. 281 288, 2011.

Vinyals, O., Babuschkin, I., Czarnecki, W. M., Mathieu, M., Dudzik, A., Chung, J., Choi, D. H., Powell, R., Ewalds, T., Georgiev, P., Oh, J., Horgan, D., Kroiss, M., Danihelka, I., Huang, A., Sifre, L., Cai, T., Agapiou, J. P., Jaderberg, M., Vezhnevets, A. S., Leblond, R., Pohlen, T., Dalibard, V., Budden, D., Sulsky, Y., Molloy, J., Paine, T. L., G ulc ehre, C ., Wang, Z., Pfaff, T., Wu, Y., Ring, R., Yogatama, D., W unsch, D., Mc Kinney, K., Smith, O., Schaul, T., Lillicrap, T. P., Kavukcuoglu, K., Hassabis, D., Apps, C., and Silver, D. Grandmaster level in starcraft II using multi-agent reinforcement learning. Nat., 575(7782):350 354, 2019.

Weber, B. G. and Mateas, M. A data mining approach to strategy prediction. In Proceedings of the 2009 IEEE Symposium on Computational Intelligence and Games, pp. 140 147, 2009.

Wen, Y., Yang, Y., Luo, R., Wang, J., and Pan, W. Probabilistic recursive reasoning for multi-agent reinforcement learning. In 7th International Conference on Learning Representations, 2019.

Zhang, C. and Lesser, V. R. Multi-agent learning with policy prediction. In Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence, 2010.

COLA: Consistent Learning with Opponent-Learning Awareness

A. Nonconvergence of HOLA in the Tandem Game

In the following, we show that for the choice of look-ahead rate α = 1, HOLA does not converge in the Tandem game. This shows that given a large enough look-ahead rate, even in a simple quadratic game, HOLA need not converge.

Proposition A.1. Let L1, L2 be the two players loss functions in the Tandem game as defined in Section 5:

L1(x, y) = (x + y)2 2x and L2(x, y) = (x + y)2 2y, (3)

and let hn i denote the n-th order exact LOLA update for player i (where n = 0 denotes naive learning). Consider the look-ahead rate α := 1. Then the functions (hn i )n N for i = 1, 2 do not converge pointwise.

Proof. We will prove the auxiliary statement that

hn i (x, y) = 2n+2 2(1 + x + y)

for i = 1, 2. It then follows trivially that the hn i cannot converge.

The auxiliary result can be proven by induction. The base case n = 0 follows from

i Li(x, y) = 2 2(x + y) = 22 2(1 + x + y)

for i = 1, 2. Next, for the inductive step, we have to show that

hn 1(x, y) = 1(L1(x, y + hn 1 2 (x, y))) (4)

hn 2(x, y) = 2(L2(x + hn 1 1 (x, y), y)) (5)

for any n > 0. Substituting the inductive hypothesis in the second step, we have

1(L1(x, y + hn 1 2 (x, y))) (6)

= 1(L1(x, y + 2n+1 2(1 + x + y))) (7)

= 1 (x + y + 2n+1 2 2x 2y)2 2x (8)

= 1 ( x y + 2n+1 2)2 2x (9)

= 2( x y + 2n+1 2) + 2 (10)

= 2n+2 2(1 + x + y) (11)

= hn 1(x, y). (12)

The derivation for hn 2(x, y) is exactly analogous. This shows the inductive step and thus finishes the proof.

B. Proof of Proposition 4.3

To begin, recall that some differentiable game with continuously differentiable loss functions L1, L2 is given, and that hn = (hn 1, hn 2) denotes the n-th order exact LOLA update function. We assume that the i LOLA update function h exists, defined via hi(θ) := lim n hn i (θ),

for all θ Rd.

To prove Proposition 4.3, we need to show that h1, h2 are consistent, i.e., satisfy Definition 4.2, under the assumption that

lim n ihn i(θ) = ih i(θ)

for i = 1, 2 and any θ.

COLA: Consistent Learning with Opponent-Learning Awareness

To that end, define the (exact) LOLA operator Ψ as the function mapping a pair of update functions f := (f1, f2) to the RHS of Equations 1 and 2,

Ψ1(f)(θ) := α 1(L1(θ1, θ2 + f2(θ1, θ2))) (13)

Ψ2(f)(θ) := α 2(L2(θ1 + f1(θ1, θ2), θ2)) (14)

for any θ. Note that then we have hn+1 i = Ψi(hn), i.e., Ψ maps n-th order LOLA to n + 1-order LOLA.

In the following, we show that i LOLA is a fixed point of the LOLA operator, i.e., Ψ(h) = h. It follows from the definition of Ψ that then h is consistent. We denote by the Euclidean norm or the induced operator norm for matrices. We focus on showing Ψ1(h) = h1. The case i = 2 is exactly analogous.

For arbitrary θ and n, define ˆθ2 := θ2 + h2(θ) and ˆθn 2 := θ2 + h2 n(θ) as the updated parameter of player 2. First, it is helpful to show that Ψ1(hn)(θ) converges to Ψ1(h)(θ):

0 Ψ1(h)(θ) Ψ1(hn)(θ) (15)

= α 1(L1(θ1, θ2 + h2(θ))) 1(L1(θ1, θ2 + hn 2(θ)) (16)

= α ( 1h2(θ)) 2L1(θ1, ˆθ2) ( 1hn 2(θ)) 2L1(θ1, ˆθn 2 ) + 1L1(θ1, ˆθ2) 1L1(θ1, ˆθn 2 ) (17)

α ( 1h2(θ)) 2L1(θ1, ˆθ2) ( 1hn 2(θ)) 2L1(θ1, ˆθn 2 ) + α 1L1(θ1, ˆθ2) 1L1(θ1, ˆθn 2 ) (18)

= α ( 1h2(θ)) ( 2L1(θ1, ˆθ2) 2L1(θ1, ˆθn 2 )) (19)

+ ( 1h2(θ) 1hn 2(θ)) 2L1(θ1, ˆθn 2 ) + α 1L1(θ1, ˆθ2) 1L1(θ1, ˆθn 2 )

α ( 1h2(θ)) 2L1(θ1, ˆθ2) 2L1(θ1, ˆθn 2 )

+ α ( 1h2(θ) 1hn 2(θ)) 2L1(θ1, ˆθn 2 ) + α 1L1(θ1, ˆθ2) 1L1(θ1, ˆθn 2 ) (20)

In the last step, we used the following two facts. First, since i L1(θ) is assumed to be continuous in θ2, and limn ˆθn 2 = θ2 + limn hn 2(θ) = θ2 + h2(θ) = ˆθ2 by assumption, it follows that limn 2L1(θ1, ˆθn 2 ) = 2L1(θ1, ˆθ2) and limn 1L1(θ1, ˆθn 2 ) = 1L1(θ1, ˆθ2). Second, by assumption, limn hn 2(θ) = h2(θ). In particular, 2L1(θ1, ˆθn 2 ) must be bounded, and thus the three terms in (20) must all converge to 0 as n . It follows by the sandwich theorem that limn Ψ1(hn)(θ) = Ψ1(h)(θ).

Now we can directly prove that Ψ1(h)(θ) = h1(θ). It is

0 Ψ1(h)(θ) h1(θ) (22)

= Ψ1(h)(θ) Ψ1(hn)(θ) + Ψ1(hn)(θ) hn 1(θ) + hn 1(θ) h1(θ) (23)

Ψ1(h)(θ) Ψ1(hn)(θ) + Ψ1(hn)(θ) hn 1(θ) + hn 1(θ) h1(θ) (24)

= Ψ1(h)(θ) Ψ1(hn)(θ) + hn+1 1 (θ) hn 1(θ) + hn 1(θ) h1(θ) (25)

where in the last step we have used the above result, as well as the assumption that hn 1(θ) converges pointwise, and thus must also be a Cauchy sequence, so the last and the middle term both converge to zero as well.

It follows by the sandwich theorem that Ψ1(h)(θ) = h1(θ). Since θ was arbitrary, this concludes the proof.

C. Infinite-order Taylor LOLA

In this Section, we repeat the analysis of i LOLA from Section 4.1 for infinite-order Taylor LOLA (Taylor i LOLA). I.e., we define Taylor consistency, and show that Taylor i LOLA satisfies this consistency equation under certain assumptions. This result will be needed for our proof of Proposition 4.4.

To begin, assume that some differentiable game with continuously differentiable loss functions L1, L2 is given. Define the Taylor LOLA operator Φ that maps pairs of update functions (f1, f2) to the associated Taylor LOLA update

Φi(f) := α i(Li + ( i Li) f i) (27)

COLA: Consistent Learning with Opponent-Learning Awareness

for i = 1, 2.

We then have the following definition.

Definition C.1 (Taylor consistency). Two update functions f1, f2 are called Taylor consistent if for any i = 1, 2, we have

Φ(f1, f2) = (f1, f2).

Next, let hn i denote i s n-th order Taylor LOLA update. I.e., hn i := Φi(hn 1) for n 0, where we let h 1 i := 0. Then we define

Definition C.2 (Taylor i LOLA). If (hn 1, hn 2) converges pointwise as n , define Taylor i LOLA as the limiting update

Finally, we provide a proof that Taylor i LOLA is Taylor consistent; i.e., we give a Taylor version of Proposition 4.3.

Proposition C.3. Let hn i denote player i s n-th order Taylor LOLA update. Assume that limn hn i (θ) = hi(θ) and limn ihn i(θ) = ih i(θ) for all θ and i {1, 2}. Then Taylor i LOLA is Taylor consistent.

Proof. The proof is exactly analogous to that of Proposition 4.3, but easier. We show Φ(h) = h. It follows from the definition of Φ in Equation 27 that then h is Taylor consistent. We focus on showing Φ1(h) = h1, and the case i = 2 is exactly analogous.

First, we show that Φ1(hn)(θ) converges to Φ1(h)(θ) for all θ. Letting n be arbitrary and omitting θ in the following for clarity, it is

0 Φ1(h) Φ1(hn) (28)

= α 1(L1 + ( 2L1) h2) + α 1(L1 + ( 2L1) hn 2) (29)

= α 12L1h2 ( 2L1) ( 1h2) + 12L1hn 2 + ( 2L1) ( 1hn 2) (30)

α 12L1(hn 2 h2) + α ( 2L1) ( 1hn 2 1h2) (31)

α 12L1 hn 2 h2 + α 2L1 1hn 2 1h2 (32)

In the last step, we used the assumptions that limn hn 2 = h2 and limn hn 2 = h2. It follows by the sandwich theorem that limn Φ1(hn)(θ) = Φ1(h)(θ).

It follows from the above that Φ1(h)(θ) = h1(θ), using exactly the same argument as in Equations 22-26 with Φ instead of Ψ. Since θ was arbitrary, this concludes the proof.

D. Proof of Proposition 4.4

We begin by proving that LCGD does not coincide with Taylor LOLA and CGD neither coincides with exact nor Taylor i LOLA. It is sufficient to manifest a single counter-example: we consider the Tandem game given by L1 = (x + y)2 2x and L2 = (x + y)2 2y (using x, y instead of θ1, θ2 for simplicity). Throughout this proof we use the notation introduced by Balduzzi et al. (2018) and Letcher et al. (2019b) including the simultaneous gradient, the off-diagonal Hessian and the shaping term of the game as

and Ho = 0 12L2

and χ = diag HT o L

respectively. Note that in two-player games, Taylor LOLA s shaping term reduces to

χ = 12L2 2L1

COLA: Consistent Learning with Opponent-Learning Awareness

LCGD = LOLA. Following Sch afer & Anandkumar (2019), LCGD is given by

LCGD = α xf αD2 xyf yg yg αD2 yxg xf

= α I αD2 xyf αD2 yxg I

= α(I αHo)ξ

while Taylor LOLA is given (Letcher et al., 2019b) by

LOLA = α(I αHo)ξ + α2χ .

Any game with χ = 0 will yield a difference between LCGD and LOLA; in particular,

χ = 4(x + y) 1 1

in the Tandem game implies that LCGD = LOLA whenever parameters lie outside the measure-zero set {x + y = 0} R2.

CGD does not recover HOLA. Since CGD is obtained through a bilinear approximation (Taylor expansion) of the loss functions, one would expect that the authors claim of recovering HOLA is with regards to Taylor (not exact) HOLA. For completeness, and to avoid any doubts for the reader, we prove that CGD neither corresponds to exact nor Taylor HOLA.

Following Sch afer & Anandkumar (2019), the series-expansion of CGD is given by

0 αD2 xyf αD2 yxg 0

i=0 ( αHo)iξ

and converges to CGD whenever α < 1/ Ho (where denotes the operator norm induced by the Euclidean norm on the space). Assume for contradiction that the series-expansion of CGD recovers HOLA, i.e. that CGDn = HOLAn for all n. In particular, we must have CGD = lim n CGDn = lim n HOLAn = i LOLA

whenever α < 1/ Ho . In the tandem game, we have

Ho = 2 0 1 1 0

with Ho = 2, so CGD = i LOLA whenever α < 1/2. Moreover, Ho being constant implies that

HOLAn = CGDn = α

i=0 ( αHo)i ξ ,

so gradients of HOLA also converge pointwise for all α < 1/2. In particular, CGD = i LOLA must satisfy the (exact or Taylor) consistency equations by Proposition 4.3 or Proposition C.3. However, the update for CGD is given by

= α(I + αHo) 1ξ = 2α(x + y 1) 1 2α 2α 1

= 2α(x + y 1)

For the exact case, the RHS of the first consistency equation is

α x (x + y + f2)2 2x = 2α ((1 + xf2)(x + y + f2) 1)

= 2α 1 + 2α

x + y + 2α(x + y 1)

1 + 2α 1 2α

= f1 + 4α2(x + y + 2α)

COLA: Consistent Learning with Opponent-Learning Awareness

which does not coincide with the LHS of the consistency equation (= f1) whenever parameters lie outside the measure-zero set {x + y + 2α = 0} R2. Similarly for Taylor i LOLA, the RHS of the first consistency equation is

α x (x + y)2 2x + 2(x + y)f2 = 2α x + y 1 + 2α(x + y 1)

1 + 2α + 2α(x + y)

= f1 + 4α2(x + y)

which does not coincide with the LHS of the consistency equation (= f1) whenever parameters lie outside the measure-zero set {x + y = 0} R2. This is a contradiction to consistency; we are done.

LCGD = Look Ahead. We have already shown that LCGD is given by α(I αHo)ξ in the proof that LCGD = LOLA. This coincides exactly with Look Ahead following Letcher et al. (2019b).

CGD recovers higher-order Look Ahead. The series expansion of CGD is given by

i=0 ( αHo)iξ .

Also, higher-order (Taylor) Look Ahead is defined recursively by expanding

f n+1 1 = α 1 L1(θ1, θ2 + f n 2 ) α 1L1 + 12L1f n 2

f n+1 2 = α 2 L2(θ1 + f n 1 , θ2) α 2L2 + 21L2f n 1 ,

where is the stop-gradient operator (see (Balduzzi et al., 2018) for details on this operator) and f 1 1 = f 1 2 = 0. This can be written more succinctly as f n+1 1 f n+1 2

= α 1L1 + 12L1f n 2 2L2 + 21L2f n 1

f n 1 f n 2

We prove by induction that f n 1 f n 2

i=0 ( αHo)iξ

for all n 0. The base case is trivial; assume the statement holds for any fixed n 0. Then

f n+1 1 f n+1 2

i=0 ( αHo)iξ

i=1 ( αHo)iξ = α

i=0 ( αHo)iξ

as required. Finally we conclude

Look Aheadn = f n 1 f n 2

i=0 ( αHo)iξ = CGDn

as required.

E. Proof of Proposition 4.5

We prove that the two pairs of linear functions

f1 = f2 = 2(x + y + 1)

and f1 = f2 = 1

COLA: Consistent Learning with Opponent-Learning Awareness

are solutions to the consistency equations in the Tandem game with α = 1. (See below for a generalization to any α > 0.) For the first pair of functions, we have

x L1(x, y + f2) = x (x + y + 2)2 2x = 2(x + y + 1) = f1

for the first consistency equation and similarly

y L2(x + f1, y) = x (x + y + 2)2 2y = 2(x + y + 1) = f2

for the second. For the second pair of functions we similarly obtain

x L1(x, y + f2) = x

4(x + y + 2)2 2x = 1

2(x + y 2) = f1

for the first consistency equation and

y L2(x + f1, y) = x

4(x + y + 2)2 2y = 1

2(x + y 2) = f2

for the second. This shows that both functions are solutions to the consistency equations. For general α > 0, we can similarly show that f1 = f2 = ax + by + c with

a = 1 + 8α 1 4α

4α ; b = 2α(1 + a) 1 + 2α(1 + a) ; c = 2α 1 + 2α(1 + a)

are two distinct solutions (depending on ) to the consistency equations in the Tandem game, noting that the denominators cannot be 0 for α > 0 (otherwise leading to a contradiction in the expression for a). This is left to the reader, noting that the proof for α = 1 is sufficient to establish that consistent solutions are not always unique.

F. Proof of Proposition 4.6

Recall from the proof of Proposition 4.5 that the linear functions

f1 = f2 = 2(x + y + 1)

are consistent solutions to the Tandem game with α = 1. The SFPs of the Tandem game are (x, 1 x) for each x R, but none of these are preserved by the consistent solutions above since

f1(x, 1 x) = f2(x, 1 x) = 4 = 0 .

We conclude that consistency does not imply preservation of SFPs.

Moreover, we prove that any (non-zero) linear solution to the consistency equations cannot preserve more than one SFP in the Tandem game, for any opponent shaping rate α. Assuming it did, we must have linear functions

f1 = ax + by + c

f2 = a x + b y + c

satisfying f1(x, 1 x) = 0 = f1(x , 1 x )

for some x = x R. Subtracting RHS from LHS we obtain (x x )(a b) = 0 hence a = b, which substituted again into the LHS yields b = c. Applying the same method for f2 we obtain a = b = c and so f1, f2 take the form

f1 = a(x + y 1)

f2 = a (x + y 1).

Note that since f1, f2 were assumed to be nonzero, it follows that a, a = 0. Plugging these into the first consistency equation, we obtain a(x + y 1) = 2α [(1 + a ) ((x + y)(1 + a ) a ) 1] .

Comparing x terms and constant terms yields

a = 2α(1 + a )2 and a = 2α 1 + a + a 2

which concludes the contradiction a = 0.

COLA: Consistent Learning with Opponent-Learning Awareness

G. Proof of Proposition 4.7

LOLA and SOS diverge. Assume (x0, y0) = 0 and α > 1. We prove the more general claim that p-LOLA diverges for any 0 p 1 (where p may take a different value at each learning step), recalling that LOLA and SOS are both special cases of p-LOLA (Letcher et al., 2019b). Indeed, the p-LOLA gradient update is given by h1 h2

= α(I αHo)ξ + pα2χ = α y + αx(1 + p) x + αy(1 + p)

and we show that each update leads to increasing distance from the origin as follows:

(x + h1, y + h2) 2 = x2 2xα(y + αx(1 + p)) + α2 y2 + α2x2(1 + p)2 + 2αxy(1 + p) +

y2 2yα( x + αy(1 + p)) + α2 x2 + α2y2(1 + p)2 2αxy(1 + p)

= x2 + y2 1 α2(2p + 1) + α4(1 + p)2

x2 + y2 1 α2 + α4 := (x, y) 2 λ

where the inequality follows because the final expression in p has positive derivative for α > 1, hence minimized at p = 0. Now λ > 1 for any α > 1, so we conclude by induction that

(xn, yn) 2 λn (x0, y0) 2

as n , provided (x0, y0) = 0, as required.

Consistent solution converges. We begin by showing that the following linear functions satisfy the consistency equations for the Hamiltonian game: f1 f2

= α 1 + 2α2

y + 2αx x + 2αy

Indeed, the RHS of the first consistency equation is

x y α x + 2αy

= α 1 + 2α2

y(1 + 2α2) α( x + 2αy) + αx

= α 1 + 2α2

y + 2αx = f1

and similarly for the second equation.

To prove uniqueness, assume there is a second pair of linear update functions ˆf1, ˆf2 also satisfying consistency. Let a, b, c R such that ˆf1(x, y) = ax + by + c. Note that substituting the second equation into the first yields

ˆf1(x, y) = α x L1(x, y α y L2(x + ˆf1(x, y), y) )

= α y + α 2x + ˆf1(x, y) + x x ˆf1(x, y) + y y ˆf1(x, y) + xy xy ˆf1(x, y)

Expanding the above and substituting the equation for ˆf1, we obtain

ax + by + c = 2α2x(1 + a) αy(1 + 2αb) α2c

for all x, y R, which yields (by comparing coefficients)

a(1 + 2α2) = 2α2 ; b(1 + 2α2) = α ; c(1 + α2) = 0.

It follows that ˆf1(x, y) = α 1 + 2α2

y + 2αx = f1(x, y),

proving the uniqueness of f1. Since f2 is directly determined by f1 via the second consistency equation, this concludes the proof.

COLA: Consistent Learning with Opponent-Learning Awareness

Finally we prove that this linear update leads to decreasing distance from the origin as follows:

(x + f1, y + f2) 2 = x2 2xα 1 + 2α2 (y + 2αx) + α2

(1 + 2α2)2 y2 + 4α2x2 + 4αxy +

y2 2yα 1 + 2α2 α( x + 2αy) + α2

(1 + 2α2)2 x2 + 4α2y2 4αxy

= x2 + y2 1 α2(3 + 4α2)

:= (x, y) 2 λ .

Notice that the derivative of λ is strictly negative in α while its limit as α is 0, with value 1 at α = 0, hence |λ| = λ < 1 for any α > 0. We conclude by induction that

(xn, yn) 2 = λn (x0, y0) 2 0

as n , with λ decreasing (hence the speed of convergence increasing) as α increases.

H. Training Details COLA

All code was implemented using Python. The code relies on the Py Torch library for autodifferentiability (Paszke et al., 2019).

H.1. Polynomial games

For the polynomial games, COLA uses a neural network with 1 non-linear layer for both h1(θ1, θ2) and h2(θ1, θ2). The non-linearity is a Re LU function. The layer has 8 nodes. For training, we randomly sample pairs of parameters on a [-1, 1] parameter region. In general, the size of the region is a hyperparameter. We use a batch size of 8. We found that training is improved with a learning rate scheduler. For the learning rate scheduling we use a γ of 0.9. We train the neural network for 120,000 steps. To compute the consistency loss we use the squared distance measure. The optimizer used is Adam (Kingma & Ba, 2015).

H.2. Non-polynomial games

For the non-polynomial games, we deploy a neural network with 3 non-linear layers using Tanh activation functions. Each layer has 16 nodes. For this type of game, the parameter region is set to [-7, 7], because the parameters will be squished into probability space, allowing us to explore the full probability space. During training, we used a batch size of 64. The optimizer used is Adam (Kingma & Ba, 2015).

COLA: Consistent Learning with Opponent-Learning Awareness

I. Further experimental results

I.1. Tandem game

In this section, we will provide more empirical results on the Tandem game. First, in Figure 4, we display a look-ahead regime where SOS, LOLA and HOLA8 diverge in training, whereas COLA and CGD do not. Second, in Figure 5 we compare the gradient fields of HOLA4 and COLA at a low and high look-ahead rate of 0.1 and 1.0 respectively. The updates are shown on the parameter region of interesting Θ for the Tandem game, which is [-1,1]. Figure 5a and 5c show that the solutions are very similar at low look-ahead rates but become dissimilar at high look-ahead rates, shown in Figure 5b and 5d. This also confirms our quantitative observation in Table 1b.

Figure 4. (a): Consistency loss of COLA at a look-ahead rate of 1.0. (b): Solutions on the Tandem game by COLA, LOLA, HOLA8, CGD, and SOS with a look-ahead rate of 1.0. The standard deviation for the initialization of parameters used here is 0.1, which is standard in the literature (Letcher et al., 2019b).

COLA: Consistent Learning with Opponent-Learning Awareness

Figure 5. Gradients field of the Tandem game at two different look-ahead rates, 0.1 and 1.0.

I.2. Balduzzi and Hamiltonian game

The Hamiltonian game was originally introduced in (Balduzzi et al., 2018) as a minimal example of Hamiltonian dynamics. Recall that its loss function is L1(x, y) = xy and L2(x, y) = xy (34)

The Balduzzi game was introduced to investigate the behaviour of differentiable game algorithms when a weak attractor is coupled with strong rotational forces in the Hamiltonian dynamics (Balduzzi et al., 2018), captured by the losses

L1(x, y) = 1

2x2 + 10xy and L2(x, y) = 1

2y2 10xy (35)

Results for both games are displayed in Figures 6 and 7. COLA at high look-ahead rates converges considerably faster than the other methods on both games. This supports Proposition 4.7 empirically. In Table 4a, 4b and 5a we find similar consistency loss behaviour as we do for Tandem. Note however, that the look-ahead thresholds are at significantly different levels. On the Balduzzi game, the threshold is much lower around 0.02, whereas for the Hamiltonian game it is between 0.1 and 0.4.

COLA: Consistent Learning with Opponent-Learning Awareness

Figure 6. (a): Consistency losses of COLA at different look-ahead rates. (b): Solutions on the Balduzzi game by COLA, LOLA, HOLA8, CGD, and SOS with a look-ahead rate of 0.01 (and 0.1 additionally for COLA). The standard deviation for the initialization of parameters used here is 1.0

Figure 7. (a): Consistency losses of COLA at different look-ahead rates. (b): Solutions on the Hamiltonian game by COLA, LOLA, HOLA8, CGD, and SOS with a look-ahead rate of 0.1 (and 0.9 additionally for COLA). The standard deviation for the initialization of parameters used here is 1.0

I.3. Matching Pennies

Matching Pennies (MP) is a single-shot, zero-sum game, where two players, A and B, each flip a biased coin (Lee & K, 1967). Player A wins if the outcomes of both flips are the same and player B wins if they are different.

The visual difference between the updates found by COLA and HOLA4 is shown in Figure 8. Whereas they are very similar at a relatively low look-ahead rate, at a high look-ahead rate HOLA4 s gradient field shows signs of high variance, especially around the origin, whereas COLA s gradient field appears to learn a robust update function. Note that HOLA4 s update function results in the learning behaviour shown in Figure 2b.

COLA: Consistent Learning with Opponent-Learning Awareness

Table 4. On the Hamiltonian game: (a) Log of the squared consistency loss. (b) Cosine similarity between COLA and LOLA, HOLA3, and HOLA6 over different look-ahead rates. The values represent the mean of a 1,000 samples, uniformly sampled from the parameter space Θ. The error bars represent one standard deviation and capture the variance over 10 different COLA training runs.

α LOLA HOLA3 HOLA6 COLA 0.9 6.99 4.63 13.68 1e-14 2e-15 0.5 12.67 13.77 13.00 1e-14 2e-15 0.4 5.78 7.14 6.38 1e-14 2e-15 0.1 0.08 0.01 2e-6 7e-15 1e-15 0.05 2e-5 4e-10 3e-15 1e-14 2e-14 (b)

α LOLA HOLA3 HOLA6 0.9 1.00 0.00 -1.00 0.00 0.50 0.00 0.5 1.00 0.00 1.00 0.00 1.00 0.00 0.4 1.00 0.00 1.00 0.00 1.00 0.00 0.1 1.00 0.00 1.00 0.00 1.00 0.00 0.05 1.00 0.00 1.00 0.00 1.00 0.00

Table 5. On the Balduzzi game: (a) Log of the squared consistency loss. (b) Cosine similarity between COLA and LOLA, HOLA3, and HOLA6 over different look-ahead rates. The values represent the mean of a 1,000 samples, uniformly sampled from the parameter space Θ. The error bars represent one standard deviation and capture the variance over 10 different COLA training runs.

α LOLA HOLA3 HOLA6 COLA 0.9 2e+6 5e+10 4e+17 5e-13 1e-13 0.1 3e+2 1.e+3 2e+4 7e-13 1e-13 0.05 2e+1 4.01 1.03 7e-13 9e-14 0.03 2.16 0.07 0.01 8e-13 1e-13 0.01 0.03 1e-5 2e-10 7e-13 1e-13

α LOLA HOLA3 HOLA6 0.9 1.00 0.00 -1.00 0.00 0.01 4e-9 0.1 1.00 0.00 1.00 0.00 0.40 1e-8 0.05 1.00 0.00 1.00 0.00 1.00 0.00 0.03 1.00 0.00 1.00 0.00 1.00 0.00 0.01 1.00 0.00 1.00 0.00 1.00 0.00

Table 6. Payoff Matrix for the Matching Pennies game.

Head Tail Head (+1, -1) (-1, +1) Tail (-1, +1) (+1, -1)

Table 7. Cosine similarities over multiple COLA training runs on the MP and Tandem game for different look-ahead rates.

Game@LR Cosine Sim MP@10 0.97 0.01 MP@0.5 0.99 0.01 Tandem@0.1 1.00 0.00 Tandem@1.0 0.98 0.01

I.4. Ultimatum game

Here we provide more empirical results on the Ultimatum game. First, in Table 8a and 8b we display the consistency losses of COLA and HOLA at different look-ahead rates. In comparison to the polynomial games, here we observe that COLA s consistency losses are not as low as HOLAn s losses at low look-ahead rates. Nonetheless, they are low enought to constitute a consistent solution. Moreover, the COLA and HOLAn solutions are very similar according to the cosine similarity score. Qualitatively, we compare the updates in Figure 9. Similarly to the MP game, we observe that HOLA4 s update shows higher variance around the origin that COLA s update. This variance is reflected in HOLA4 s learning behaviour shown in Figure 10d, where HOLA4 do not converge to the fair solution consistently.

COLA: Consistent Learning with Opponent-Learning Awareness

Figure 8. Gradients field of the MP game at two different look-ahead rates: 0.5 (RHS) and 10 (LHS). COLA is on the upper row, HOLA4 is on the lower row.

Table 8. On the Ultimatum game: Over multiple look-ahead rates we compare (a) the consistency losses and (b) the cosine similarity between COLA and LOLA, HOLA2, and HOLA4. The values represent the mean of a 1,000 samples, uniformly sampled from the parameter space Θ. The error bars represent one standard deviation and capture the variance over 10 different COLA training runs.

α LOLA HOLA2 HOLA4 COLA 1.1 2e-3 5e-3 0.01 4e-4 5e-5 0.7 4e-4 2e-4 2e-4 7e-5 1e-5 0.3 3e-5 2e-7 5e-8 3e-6 2e-6 0.1 2e-7 1e-11 6e-13 4e-6 4e-6 0.001 3e-15 4e-17 9e-17 4e-6 3e-6

α LOLA HOLA2 HOLA4 1.1 0.96 0.01 0.96 0.01 0.95 0.01 0.7 0.98 0.01 0.98 0.01 0.98 0.01 0.3 0.99 0.01 0.99 0.01 0.99 0.01 0.1 0.99 0.01 0.99 0.01 0.99 0.01 0.001 0.99 0.01 0.99 0.01 0.99 0.01

I.5. Chicken game

In the Chicken game, an agent can either choose to yield to avoid a catastrophic payoff but face a small punishment if they are the only agent to yield. Imagine a game where two agents drive towards each other in their cars. If both never swerve, they frontally crash into each other, an obviously catastrophic outcome. If any of the agents chicken out , e.g. swerve, they do not crash but receive a small punishment for having chickened out. At the same time, the other agent is being rewarded for staying on track, as quantified in Table 9.

Next we report the consistency losses on the Chicken game in Table 10a. We identify a look-ahead rate threshold where HOLAn s consistency loss becomes increasingly bigger with increasing order. For the Chicken game, this threshold is fairly

COLA: Consistent Learning with Opponent-Learning Awareness

Figure 9. Gradients field of the ultimatum game at two different look-ahead rates: 0.2 (RHS) and 1.1 (LHS). COLA is on the upper row, HOLA4 is on the lower row.

Table 9. Payoff Matrix for the Chicken game.

C (swerve) D (straight) C (swerve) 0, 0 -1, +1 D (straight) +1 ,-1 -100, -100

low, between 0.01 and 0.05. Interestingly, we note that around a look-ahead rate of 0.05 and 0.1 it becomes harder to find a consistent solution for COLA.

Table 10. On the Chicken game: Over multiple look-ahead rates we compare (a) the consistency losses and (b) the cosine similarity between COLA and LOLA, HOLA2, and HOLA4. The values represent the mean of a 1,000 samples, uniformly sampled from the parameter space Θ. The error bars represent one standard deviation and capture the variance over 10 different COLA training runs.

α LOLA HOLA2 HOLA4 SOS CGD COLA 1.0 2429 3892 46637 1494 1677 0.01 0.01 0.5 643 484 4320 475 2330 0.03 0.01 0.1 11.99 7.69 73.28 2.73 8.46 0.70 0.10 0.05 0.84 0.17 0.47 0.37 1.31 0.06 0.01 0.01 9e-4 3e-6 6e-9 2e-4 0.04 5e-4 3e-4

α LOLA HOLA2 HOLA4 1.0 0.39 0.03 0.57 0.02 0.57 0.03 0.5 0.54 0.04 0.62 0.03 0.64 0.04 0.1 0.88 0.03 0.88 0.03 0.86 0.03 0.05 0.93 0.03 0.93 0.03 0.93 0.03 0.01 0.96 0.02 0.96 0.02 0.96 0.01

We also perform a qualitative comparison on the gradient fields of COLA and HOLA4 on the Chicken game (see Figure 11). The difference at high look-ahead rates is pronounced, as HOLA4 shows high variance around the origin. Nonetheless, the gradient field leads to swerving in the actual game (see Figure 12d, which is a preferable outcome. Moreover, HOLA4

COLA: Consistent Learning with Opponent-Learning Awareness

Figure 10. (a) and (c): Consistency losses of COLA at different look-ahead rates. (b) and (d): Solutions on the Ultimatum game by COLA, LOLA, HOLA4, CGD, and SOS with a look-ahead rate of 0.2 and 5.0. The standard deviation for the initialization of parameters used here is 1.0.

appears to converge to Swerving consistently despite the chaotic gradient field, showcasing that the behaviour in the game is not necessarily correlated with the qualitative analysis of the gradient fields.

COLA: Consistent Learning with Opponent-Learning Awareness

Figure 11. Gradients field of the Chicken game for COLA and HOLA4 at two different look-ahead rates, 0.01 (LHS) and 1.0. (RHS)

As a baseline comparison, we show the performance of different state-of-the-art algorithms in Figure 13. CGD does not recover the tit-for-tat policy whereas exact LOLA, Taylor LOLA and SOS do.

Table 11. Payoff Matrix for the IPD game.

C D C (-1, -1) (0, -3) D (0, -3) (-2, -2)

COLA: Consistent Learning with Opponent-Learning Awareness

Figure 12. (a) and (c): Consistency losses of COLA at different look-ahead rates. (b) and (d): Solutions on the Ultimatum game by COLA, TLOLA, HOLA4, CGD, and SOS with a look-ahead rate of 0.04 and 0.5. We used a standard deviation of 1.0 to initialize the parameters.

COLA: Consistent Learning with Opponent-Learning Awareness

Figure 13. CGD, SOS, Taylor LOLA (TLOLA), Exact LOLA (ELOLA) and Naive Learning (NL) on the IPD at a look-ahead rate of 1.0. We used a standard deviation of 1.0 to initialize the parameters.