# stable_opponent_shaping_in_differentiable_games__6df6c0ac.pdf STABLE OPPONENT SHAPING IN DIFFERENTIABLE GAMES Alistair Letcher1 Jakob Foerster1 David Balduzzi2 Tim Rockt aschel3 Shimon Whiteson1 1University of Oxford 2Deep Mind 3University College London A growing number of learning methods are actually differentiable games whose players optimise multiple, interdependent objectives in parallel from GANs and intrinsic curiosity to multi-agent RL. Opponent shaping is a powerful approach to improve learning dynamics in these games, accounting for player influence on others updates. Learning with Opponent-Learning Awareness (LOLA) is a recent algorithm that exploits this response and leads to cooperation in settings like the Iterated Prisoner s Dilemma. Although experimentally successful, we show that LOLA agents can exhibit arrogant behaviour directly at odds with convergence. In fact, remarkably few algorithms have theoretical guarantees applying across all (n-player, non-convex) games. In this paper we present Stable Opponent Shaping (SOS), a new method that interpolates between LOLA and a stable variant named Look Ahead. We prove that Look Ahead converges locally to equilibria and avoids strict saddles in all differentiable games. SOS inherits these essential guarantees, while also shaping the learning of opponents and consistently either matching or outperforming LOLA experimentally. 1 INTRODUCTION Problem Setting. While machine learning has traditionally focused on optimising single objectives, generative adversarial nets (GANs) (Goodfellow et al., 2014) have showcased the potential of architectures dealing with multiple interacting goals. They have since then proliferated substantially, including intrinsic curiosity (Pathak et al., 2017), imaginative agents (Racani ere et al., 2017), synthetic gradients (Jaderberg et al., 2017), hierarchical reinforcement learning (RL) (Wayne & Abbott, 2014; Vezhnevets et al., 2017) and multi-agent RL in general (Busoniu et al., 2008). These can effectively be viewed as differentiable games played by cooperating and competing agents which may simply be different internal components of a single system, like the generator and discriminator in GANs. The difficulty is that each loss depends on all parameters, including those of other agents. While gradient descent on single functions has been widely successful, converging to local minima under rather mild conditions (Lee et al., 2017), its simultaneous generalisation can fail even in simple two-player, two-parameter zero-sum games. No algorithm has yet been shown to converge, even locally, in all differentiable games. Related Work. Convergence has widely been studied in convex n-player games, see especially Rosen (1965); Facchinei & Kanzow (2007). However, the recent success of non-convex games exemplified by GANs calls for a better understanding of this general class where comparatively little is known. Mertikopoulos & Zhou (2018) recently prove local convergence of no-regreat learning to variationally stable equilibria, though under a number of regularity assumptions. Conversely, a number of algorithms have been successful in the non-convex setting for restricted classes of games. These include policy prediction in two-player two-action bimatrix games (Zhang & Lesser, 2010); Wo LF in two-player two-action games (Bowling & Veloso, 2001); AWESOME in repeated games (Conitzer & Sandholm, 2007); Optimistic Mirror Descent in two-player bilinear zero-sum games (Daskalakis et al., 2018) and Consensus Optimisation (CO) in two-player zerosum games (Mescheder et al., 2017). An important body of work including Heusel et al. (2017); Nagarajan & Kolter (2017) has also appeared for the specific case of GANs. Working towards bridging this gap, some of the authors recently proposed Symplectic Gradient Adjustment (SGA), see Balduzzi et al. (2018). This algorithm is provably attracted to stable fixed points while repelled from unstable ones in all differentiable games (n-player, non-convex). Nonetheless, these results are weaker than strict convergence guarantees. Moreover, SGA agents may act against their own self-interest by prioritising stability over individual loss. SGA was also discovered independently by Gemp & Mahadevan (2018), drawing on variational inequalities. In a different direction, Learning with Opponent-Learning Awareness (LOLA) (Foerster et al., 2018) modifies the learning objective by predicting and differentiating through opponent learning steps. This is intuitively appealing and experimentally successful, encouraging cooperation in settings like the Iterated Prisoner s Dilemma (IPD) where more stable algorithms like SGA defect. However, LOLA has no guarantees of converging or even preserving fixed points of the game. Contribution. We begin by constructing the first explicit tandem game where LOLA agents adopt arrogant behaviour and converge to non-fixed points. We pinpoint the cause of failure and show that a natural variant named Look Ahead (LA), discovered before LOLA by Zhang & Lesser (2010), successfully preserves fixed points. We then prove that Look Ahead locally converges and avoids strict saddles in all differentiable games, filling a theoretical gap in multi-agent learning. This is enabled through a unified approach based on fixed-point iterations and dynamical systems. These techniques apply equally well to algorithms like CO and SGA, though this is not our present focus. While Look Ahead is theoretically robust, the shaping component endowing LOLA with a capacity to exploit opponent dynamics is lost. We solve this dilemma with an algorithm named Stable Opponent Shaping (SOS), trading between stability and exploitation by interpolating between Look Ahead and LOLA. Using an intuitive and theoretically grounded criterion for this interpolation parameter, SOS inherits both strong convergence guarantees from LA and opponent shaping from LOLA. On the experimental side, we show that SOS plays tit-for-tat in the IPD on par with LOLA, while all other methods mostly defect. We display the practical consequences of our theoretical guarantees in the tandem game, where SOS always outperforms LOLA. Finally we implement a more involved GAN setup, testing for mode collapse and mode hopping when learning Gaussian mixture distributions. SOS successfully spreads mass across all Gaussians, at least matching dedicated algorithms like CO, while LA is significantly slower and simultaneous gradient descent fails entirely. 2 BACKGROUND 2.1 DIFFERENTIABLE GAMES We frame the problem of multi-agent learning as a game. Adapted from Balduzzi et al. (2018), the following definition insists only on differentiability for gradient-based methods to apply. This concept is strictly more general than stochastic games, whose parameters are usually restricted to action-state transition probabilities or functional approximations thereof. Definition 1. A differentiable game is a set of n players with parameters θ = (θ1, . . . , θn) Rd and twice continuously differentiable losses Li : Rd R, where θi Rdi for each i and P Crucially, note that each loss is a function of all parameters. From the viewpoint of player i, parameters can be written as θ = (θi, θ i) where θ i contains all other players parameters. We do not make the common assumption that each Li is convex as a function of θi alone, for any fixed opponent parameters θ i, nor do we restrict θ to the probability simplex though this restriction can be recovered via projection or sigmoid functions σ : R [0, 1]. If n = 1, the game is simply to minimise a given loss function. In this case one can reach local minima by (possibly stochastic) gradient descent (GD). For arbitrary n, the standard solution concept is that of Nash equilibria. Definition 2. A point θ Rd is a (local) Nash equilibrium if for each i, there are neighbourhoods Ui of θi such that Li(θi, θ i) Li( θ) for all θi Ui. In other words, each player s strategy is a local best response to current opponent strategies. We write i Lk = θi Lk and ij Lk = θj θi Lk for any i, j, k. Define the simultaneous gradient of the game as the concatenation of each player s gradient, ξ = 1L1, . . . , n Ln Rd . The ith component of ξ is the direction of greatest increase in Li with respect to θi. If each agent minimises their loss independently from others, they perform GD on their component i Li with learning rate αi. Hence, the parameter update for all agents is given by θ θ α ξ, where α = (α1, . . . , αn) and is element-wise multiplication. This is also called naive learning (NL), reducing to θ θ αξ if agents have the same learning rate. This is assumed for notational simplicity, though irrelevant to our results. The following example shows that NL can fail to converge. Example 1. Consider L1/2 = xy, where players control the x and y parameters respectively. The origin is a (global and unique) Nash equilibrium. The simultaneous gradient is ξ = (y, x) and cycles around the origin. Explicitly, a gradient step from (x, y) yields (x, y) (x, y) α(y, x) = (x αy, y + αx) which has distance from the origin (1 + α2)(x2 + y2) > (x2 + y2) for any α > 0 and (x, y) = 0. It follows that agents diverge away from the origin for any α > 0. The cause of failure is that ξ is not the gradient of a single function, implying that each agent s loss is inherently dependent on others. This results in a contradiction between the non-stationarity of each agent, and the optimisation of each loss independently from others. Failure of convergence in this simple two-player zero-sum game shows that gradient descent does not generalise well to differentiable games. We consider an alternative solution concept to Nash equilibria before introducing LOLA. 2.2 STABLE FIXED POINTS Consider the game given by L1 = L2 = xy where players control the x and y parameters respectively. The optimal solution is (x, y) ( , ), since then L1 = L2 . However the origin is a global Nash equilibrium, while also a saddle point of xy. It is highly undesirable to converge to the origin in this game, since infinitely better losses can be reached in the anti-diagonal direction. In this light, Nash equilibria cannot be the right solution concept to aim for in multi-agent learning. To define stable fixed points, first introduce the Hessian of the game as the block matrix ... ... ... This can equivalently be viewed as the Jacobian of the vector field ξ. Importantly, note that H is not symmetric in general unless n = 1, in which case we recover the usual Hessian H = 2L. Definition 3. A point θ is a fixed point if ξ( θ) = 0. It is stable if H( θ) 0, unstable if H( θ) 0 and a strict saddle if H( θ) has an eigenvalue with negative real part. The name fixed point is coherent with GD, since ξ( θ) = 0 implies a fixed update θ θ αξ( θ) = θ. Though Nash equilibria were shown to be inadequate above, it is not obvious that stable fixed points (SFPs) are a better solution concept. In Appendix A we provide intuition for why SFPs are both closer to local minima in the context of multi-loss optimisation, and more tractable for convergence proofs. Moreover, this definition is an improved variant on that in Balduzzi et al. (2018), assuming positive semi-definiteness only at θ instead of holding in a neighbourhood. This makes the class of SFPs as large as possible, while sufficient for all our theoretical results. Assuming invertibility of H( θ) at SFPs is crucial to all convergence results in this paper. The same assumption is present in related work including Mescheder et al. (2017), and cannot be avoided. Even for single losses, a fixed point with singular Hessian can be a local minimum, maximum, or saddle point. Invertibility is thus necessary to ensure that SFPs really are local minima . This is omitted from now on. Finally note that unstable fixed points are a subset of strict saddles, making Theorem 6 both stronger and more general than results for SGA by Balduzzi et al. (2018). 2.3 LEARNING WITH OPPONENT-LEARNING AWARENESS (LOLA) Accounting for nonstationarity, Learning with Opponent-Learning Awareness (LOLA) modifies the learning objective by predicting and differentiating through opponent learning steps (Foerster et al., 2018). For simplicity, if n = 2 then agent 1 optimises L1(θ1, θ2 + θ2) with respect to θ1, where θ2 is the predicted learning step for agent 2. Foerster et al. (2018) assume that opponents are naive learners, namely θ2 = α2 2L2. After first-order Taylor expansion, the loss is approximately given by L1 + 2L1 θ2. By minimising this quantity, agent 1 learns parameters that align the opponent learning step θ2 with the direction of greatest decrease in L1, exploiting opponent dynamics to further reduce one s losses. Differentiating with respect to θ1, the adjustment is 1L1 + 21L1 θ2 + 1 θ2 2L1 . By explicitly differentiating through θ2 in the rightmost term, LOLA agents actively shape opponent learning. This has proven effective in reaching cooperative equilibria in multi-agent learning, finding success in a number of games including tit-for-tat in the IPD. The middle term above was originally dropped by the authors because LOLA focuses on this shaping of the learning direction of the opponent . We choose not to eliminate this term, as also inherent in LOLA-Di CE (Foerster et al., 2018). Preserving both terms will in fact be key to developing stable opponent shaping. First we formulate n-player LOLA in vectorial form. Let Hd and Ho be the matrices of diagonal and anti-diagonal blocks of H, so that H = Hd + Ho. Also define L = (L1, . . . , Ln) and the operator diag : Rd n Rd constructing a vector from the block matrix diagonal, namely diag(M)i = Mii. Proposition 1 (Appendix B). Writing χ = diag(H o L), the LOLA gradient adjustment is LOLA = (I αHo)ξ αχ . While experimentally successful, LOLA fails to preserve fixed points θ of the game since (I αHo)ξ( θ) αχ( θ) = αχ( θ) = 0 in general. Even if θ is a Nash equilibrium, the update θ θ αLOLA = θ can push them away despite parameters being optimal. This may worsen the losses for all agents, as in the game below. Example 2 (Tandem). Imagine a tandem controlled by agents facing opposite directions, who feed x and y force into their pedals respectively. Negative numbers correspond to pedalling backwards. Moving coherently requires x y, embodied by a quadratic loss (x+y)2. However it is easier for agents to pedal forwards, translated by linear losses 2x and 2y. The game is thus given by L1(x, y) = (x + y)2 2x and L2(x, y) = (x + y)2 2y. These sub-goals are incompatible, so agents cannot simply accelerate forwards. The SFPs are given by {x + y = 1}. Computing χ(x, 1 x) = (4, 4) = 0, none of these are preserved by LOLA. Instead, we show in Appendix C that LOLA can only converge to sub-optimal scenarios with worse losses for both agents, for any α. Figure 1: Illustration of the tandem game. Intuitively, the root of failure is that LOLA agents try to shape opponent learning and enforce compliance by accelerating forwards, assuming a dynamic response from their opponent. The other agent does the same, so they become arrogant and suffer by pushing strongly in opposite directions. 3.1 LOOKAHEAD The shaping term χ prevents LOLA from preserving fixed points. Consider removing this component entirely, giving (I αHo)ξ. This variant preserves fixed points, but what does it mean from the perspective of each agent? Note that LOLA optimises L1(θ1, θ2 + θ2) with respect to θ1, while θ2 is a function of θ1. In other words, we assume that our opponent s learning step depends on our current optimisation with respect to θ1. This is inaccurate, since opponents cannot see our updated parameters until the next step. Instead, assume we optimise L1(θ1, ˆθ2 + θ2(ˆθ1, ˆθ2)) where ˆθ1, ˆθ2 are the current parameters. After Taylor expansion, the gradient with respect to θ1 is given by 1L1 + 21L1 θ2 since θ2(ˆθ1, ˆθ2) does not depend on θ1. In vectorial form, we recover the variant (I αHo)ξ since the shaping term corresponds precisely to differentiating through θ2. We name this Look Ahead, which was discovered before LOLA by Zhang & Lesser (2010) though not explicitly named. Using the stop-gradient operator 1, this can be reformulated as optimising L1(θ1, θ2 + θ2) where prevents gradient flowing from θ2 upon differentiation. The main result of Zhang & Lesser (2010) is that Look Ahead converges to Nash equilibria in the small class of two-player, two-action bimatrix games. We will prove local convergence to SFP and non-convergence to strict saddles in all differentiable games. On the other hand, by discarding the problematic shaping term, we also eliminated LOLA s capacity to exploit opponent dynamics and encourage cooperation. This will be witnessed in the IPD, where Look Ahead agents mostly defect. 3.2 STABLE OPPONENT SHAPING (SOS) We propose Stable Opponent Shaping (SOS), an algorithm preserving both advantages at once. Define the partial stop-gradient operator p := p + (1 p)I, where I is the identity and p stands for partial. A p-LOLA agent optimises the modified objective L1(θ1, θ2 + 1 p θ2, . . . , θn + 1 p θn) , collapsing to Look Ahead at p = 0 and LOLA at p = 1. The resulting gradient is given by ξp := p-LOLA = (I αHo)ξ pαχ with ξ0 = LA. We obtain an algorithm trading between shaping and stability as a function of p. Note however that preservation of fixed points only holds if p is infinitesimal, in which case p-LOLA is almost identical to Look Ahead losing the very purpose of interpolation. Instead we propose a two-part criterion for p at each learning step, through which all guarantees descend. First choose p such that ξp points in the same direction as Look Ahead. This will not be enough to prove convergence itself, but prevents arrogant behaviour by ensuring convergence only to fixed points. Formally, the first criterion is given by ξp, ξ0 0. If αχ, ξ0 0 then ξp, ξ0 0 automatically, so we choose p = 1 for maximal shaping. Otherwise choose p = min 1, a ξ0 2 with any hyperparameter 0 < a < 1. This guarantees a positive inner product ξp, ξ0 = p αχ, ξ0 + ξ0 2 a ξ0 2 + ξ0 2 = ξ0 2(1 a) > 0 . We complement this with a second criterion ensuring local convergence. The idea is to scale p by a function of ξ if ξ is small enough, which certainly holds in neighbourhoods of fixed points. Let 0 < b < 1 be a hyperparameter and take p = ξ 2 if ξ < b, otherwise p = 1. Choosing p1 and p2 according to these criteria, the two-part criterion is p = min{p1, p2}. SOS is obtained by combining p-LOLA with this criterion, as summarised in Algorithm 1. Crucially, all theoretical results in the next section are independent from the choice of hyperparameters a and b. Algorithm 1: Stable Opponent Shaping 1 Initialise θ randomly and fix hyperparameters a, b (0, 1). 2 while not done do 3 Compute ξ0 = (I αHo)ξ and χ = diag(H o L) at θ. 4 if αχ, ξ0 > 0 then p1 = 1 else p1 = min n 1, a ξ0 2 5 if ξ < b then p2 = ξ 2 else p2 = 1 6 Let p = min{p1, p2}, compute ξp = ξ0 pαχ and assign θ θ αξp. 4 THEORETICAL RESULTS Our central theoretical contribution is that Look Ahead and SOS converge locally to SFP and avoid strict saddles in all differentiable games. Since the learning gradients involve second-order Hessian terms, our results assume thrice continuously differentiable losses (omitted hereafter). Losses which are C2 but not C3 are very degenerate, so this is a mild assumption. Statements made about SOS crucially hold for any hyperparameters a, b (0, 1). See Appendices D and E for detailed proofs. 1This operator is implemented in Tensor Flow as stop gradient and in Py Torch as detach. 4.1 LOCAL CONVERGENCE TO STABLE FIXED POINTS Convergence is proved using Ostrowski s Theorem. This reduces convergence of a gradient adjustment g to positive stability (eigenvalues with positive real part) of g at stable fixed points. Theorem 2. Let H 0 be invertible with symmetric diagonal blocks. Then there exists ϵ > 0 such that (I αHo)H is positive stable for all 0 < α < ϵ. This type of result would usually be proved either by analytical means showing positive definiteness and hence positive stability, or direct eigenvalue analysis. We show in Appendix D that (I αHo)H is not necessarily positive definite, while there is no necessary relationship between eigenpairs of H and Ho. This makes our theorem all the more interesting and non-trivial. We use a similarity transformation trick to circumvent the dual obstacle, allowing for analysis of positive definiteness with respect to a new inner product. We obtain positive stability by invariance under change of basis. Corollary 3. Look Ahead converges locally to stable fixed points for α > 0 sufficiently small. Using the second criterion for p, we prove local convergence of SOS in all differentiable games despite the presence of a shaping term (unlike LOLA). Theorem 4. SOS converges locally to stable fixed points for α > 0 sufficiently small. 4.2 AVOIDING STRICT SADDLES Using the first criterion for p, we prove that SOS only converges to fixed points (unlike LOLA). Proposition 5. If SOS converges to θ and α > 0 is small then θ is a fixed point of the game. Now assume that θ is initialised randomly (or with arbitrarily small noise), as is standard in ML. Let F(θ) = θ αξp(θ) be the SOS iteration. Using both the second criterion and the Stable Manifold Theorem from dynamical systems, we can prove that every strict saddle θ has a neighbourhood U such that {θ U | F n(θ) θ as n } has measure zero for α > 0 sufficiently small. Since θ is initialised randomly, we obtain the following result. Theorem 6. SOS locally avoids strict saddles almost surely, for α > 0 sufficiently small. This also holds for Look Ahead, and could be strenghtened to global initialisations provided a strong boundedness assumption on H 2. This is trickier for SOS since p(θ) is not globally continuous. Altogether, our results for Look Ahead and the correct criterion for p-LOLA lead to some of the strongest theoretical guarantees in multi-agent learning. Furthermore, SOS retains all of LOLA s opponent shaping capacity while Look Ahead does not, as shown experimentally in the next section. 5 EXPERIMENTS AND DISCUSSION We evaluate the performance of SOS in three differentiable games. We first showcase opponent shaping and superiority over LA/CO/SGA/NL in the Iterated Prisoner s Dilemma (IPD). This leaves SOS and LOLA, which have differed only in theory up to now. We bridge this gap by showing that SOS always outperforms LOLA in the tandem game, avoiding arrogant behaviour by decaying p while LOLA overshoots. Finally we test SOS on a more involved GAN learning task, with results similar to dedicated methods like Consensus Optimisation. 5.1 EXPERIMENTAL SETUP IPD: This game is an infinite sequence of the well-known Prisoner s Dilemma, where the payoff is discounted by a factor γ [0, 1) at each iteration. Agents are endowed with a memory of actions at the previous state. Hence there are 5 parameters for each agent i: the probability P i(C | state) of cooperating at start state s0 = or state st = (a1 t 1, a2 t 1) for t > 0. One Nash equilibrium is to always defect (DD), with a normalised loss of 2. A better equilibrium with loss 1 is named tit-for-tat (TFT), where each player begins by cooperating and then mimicks the opponent s previous action. We run 300 training episodes for SOS, LA, CO, SGA and NL. The parameters are initialised following a normal distribution around 1/2 probability of cooperation, with unit variance. We fix α = 1 and γ = 0.96, following Foerster et al. (2018). We choose a = 0.5 and b = 0.1 for SOS. The first is a robust and arbitrary middle ground, while the latter is intentionally small to avoid poor SFP. 0.0 0.2 0.4 0.6 0.8 1.0 P 1(C | state) P 2(C | state) CC CD DC DD 0.0 0.2 0.4 0.6 0.8 1.0 P 1(C | state) 0.0 0.2 0.4 0.6 0.8 1.0 P 1(C | state) 0.0 0.2 0.4 0.6 0.8 1.0 P 1(C | state) 0.0 0.2 0.4 0.6 0.8 1.0 P 1(C | state) P 2(C | state) 0.0 0.2 0.4 0.6 0.8 1.0 P 1(C | state) 0 20 40 60 80 100 Learning Step Average Loss SOS LOLA LA CO SGA NL Figure 2: Results in the IPD. (A) Probability that agents cooperate, given memory state, at the end of 50 training runs. SOS and LOLA mostly play tit-for-tat, while others mostly defect. (B) Average loss at each step, across 300 runs, with shaded deviations. SOS and LOLA outperform all others. Tandem: Though local convergence is guaranteed for SOS, it is possible that SOS diverges from poor initialisations. This turns out to be impossible in the tandem game since the Hessian is globally positive semi-definite. We show this explicitly by running 300 training episodes for SOS and LOLA. Parameters are initialised following a normal distribution around the origin. We found performance to be robust to hyperparameters a, b. Here we fix a = b = 0.5 and α = 0.1. Gaussian mixtures: We reproduce a setup from Balduzzi et al. (2018). The game is to learn a Gaussian mixture distribution using GANs. Data is sampled from a highly multimodal distribution designed to probe the tendency to collapse onto a subset of modes during training see ground truth in Appendix F. The generator and discriminator networks each have 6 Re LU layers of 384 neurons, with 2 and 1 output neurons respectively. Learning rates are chosen by grid search at iteration 8k, with a = 0.5 and b = 0.1 for SOS, following the same reasoning as the IPD. 5.2 RESULTS AND DISCUSSION IPD: Results are given in Figure 2. Parameters in part (A) are the end-run probabilities of cooperating for each memory state, encoded in different colours. Only 50 runs are shown for visibility. Losses at each step are displayed in part (B), averaged across 300 episodes with shaded deviations. SOS and LOLA mostly succeed in playing tit-for-tat, displayed by the accumulation of points in the correct corners of (A) plots. For instance, CC and CD points are mostly in the top right and left corners so agent 2 responds to cooperation with cooperation. Agents also cooperate at the start state, represented by points all hidden in the top right corner. Tit-for-tat strategy is further indicated by the losses close to 1 in part (B). On the other hand, most points for LA/CO/SGA/NL are accumulated at the bottom left, so agents mostly defect. This results in poor losses, demonstrating the limited effectiveness of recent proposals like SGA and CO. Finally note that trained parameters and losses for SOS are almost identical to those for LOLA, displaying equal capacity in opponent shaping while also inheriting convergence guarantees and outperforming LOLA in the next experiment. Tandem: Results are given in Figure 3. SOS always succeeds in decreasing p to reach the correct equilibria, with losses averaging at 0. LOLA fails to preserve fixed points, overshooting with losses averaging at 4/9. The criterion for SOS is shown in action in part (B), decaying p to avoid overshooting. This illustrates that purely theoretical guarantees descend into practical outperfor- 0 5 10 15 20 25 30 Learning Step Average Loss 0 5 10 15 20 25 30 Learning Step Figure 3: Results in the tandem game. (A) Average loss and (B) average p at each learning step, across 300 runs, with shaded deviations. SOS decays p to avoid arrogance and outperforms LOLA. Iteration = 2k 4k 6k 8k DKL = 0.77 1.12 1.61 2.51 DKL = 0.65 0.31 0.07 0.03 DKL = 0.45 0.15 0.09 0.03 Figure 4: Generator distribution at sampled iterations. NL suffers from mode collapse and hopping, while CO and SOS learn the correct mixture of Gaussians. Below each plot: KL divergence DKL(P || Q) from generator P to ground truth Q, estimated from 25600 samples. To the RHS of each row: learning rate α. Best result at each iteration shown in bold. mance. Note that SOS even gets away from the LOLA fixed points if initialised there (not shown), converging to improved losses using the alignment criterion with Look Ahead. Gaussian mixtures: The generator distribution and KL divergence are given at {2k, 4k, 6k, 8k} iterations for NL, CO and SOS in Figure 4. Results for SGA, LOLA and LA are in Appendix F. SOS achieves convincing results by spreading mass across all Gaussians, as do CO/SGA/LOLA. Look Ahead is significantly slower, while NL fails through mode collapse and hopping. Only visual inspection was used for comparison by Balduzzi et al. (2018), while KL divergence gives stronger numerical evidence here. SOS and CO are slightly superior to others with reference to this metric. However CO is aimed specifically toward two-player zero-sum GAN optimisation, while SOS is widely applicable with strong theoretical guarantees in all differentiable games. 6 CONCLUSION Theoretical results in machine learning have significantly helped understand the causes of success and failure in applications, from optimisation to architecture. While gradient descent on single losses has been studied extensively, algorithms dealing with interacting goals are proliferating, with little grasp of the underlying dynamics. The analysis behind CO and SGA has been helpful in this respect, though lacking either in generality or convergence guarantees. The first contribution of this paper is to provide a unified framework and fill this theoretical gap with robust convergence results for Look Ahead in all differentiable games. Capturing stable fixed points as the correct solution concept was essential for these techniques to apply. Furthermore, we showed that opponent shaping is both a powerful approach leading to experimental success and cooperative behaviour while at the same time preventing LOLA from preserving fixed points in general. This conundrum is solved through a robust interpolation between Look Ahead and LOLA, giving birth to SOS through a robust criterion. This was partially enabled by choosing to preserve the middle term in LOLA, and using it to inherit stability from Look Ahead. This results in convergence guarantees stronger than all previous algorithms, but also in practical superiority over LOLA in the tandem game. Moreover, SOS fully preserves opponent shaping and outperforms SGA, CO, LA and NL in the IPD by encouraging tit-for-tat policy instead of defecting. Finally, SOS convincingly learns Gaussian mixtures on par with the dedicated CO algorithm. 7 ACKNOWLEDGEMENTS This project has received funding from the European Research Council under the European Union s Horizon 2020 research and innovation programme (grant agreement number 637713). It was also supported by the Oxford-Google Deep Mind Graduate Scholarship. D. Balduzzi, S. Racaniere, J. Martens, J. Foerster, K. Tuyls, and T. Graepel. The Mechanics of n-Player Differentiable Games. ICML, 2018. M. Bowling and M. Veloso. Rational and convergent learning in stochastic games. In Proceedings of the 17th International Joint Conference on Artificial Intelligence - Volume 2, pp. 1021 1026. Morgan Kaufmann Publishers Inc., 2001. L. Busoniu, R. Babuska, and B. De Schutter. A comprehensive survey of multiagent reinforcement learning. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 38(2):156 172, March 2008. V. Conitzer and T. Sandholm. AWESOME: A General Multiagent Learning Algorithm that Converges in Self-Play and Learns a Best Response Against Stationary Opponents. Machine Learning, 67(1):23 43, May 2007. C. Daskalakis, A. Ilyas, V. Syrgkanis, and H. Zeng. Training GANs with Optimism. ICLR, 2018. Francisco Facchinei and Christian Kanzow. Generalized Nash equilibrium problems. 4OR, 5(3), Sep 2007. J. N. Foerster, R. Y. Chen, M. Al-Shedivat, S. Whiteson, P. Abbeel, and I. Mordatch. Learning with Opponent-Learning Awareness. AAMAS, 2018. J. N. Foerster, G. Farquhar, M. Al-Shedivat, T. Rockt aschel, E. P. Xing, and S. Whiteson. Di CE: The Infinitely Differentiable Monte-Carlo Estimator. ICML, 2018. I. Gemp and S. Mahadevan. Global Convergence to the Equilibrium of GANs using Variational Inequalities. Ar Xiv e-prints, 2018. I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative Adversarial Networks. NIPS, 2014. M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. NIPS, 2017. M. Jaderberg, W. M. Czarnecki, S. Osindero, O. Vinyals, A. Graves, D. Silver, and K. Kavukcuoglu. Decoupled Neural Interfaces using Synthetic Gradients. ICML, 2017. J. D. Lee, M. Simchowitz, M. I. Jordan, and B. Recht. Gradient Descent Only Converges to Minimizers. In 29th Annual Conference on Learning Theory, volume 49 of Proceedings of Machine Learning Research, pp. 1246 1257, 2016. J. D. Lee, I. Panageas, G. Piliouras, M. Simchowitz, M. I. Jordan, and B. Recht. First-order Methods Almost Always Avoid Saddle Points. Ar Xiv e-prints, 2017. Panayotis Mertikopoulos and Zhengyuan Zhou. Learning in games with continuous action sets and unknown payoff functions. Mathematical Programming, Mar 2018. L. Mescheder, S. Nowozin, and A. Geiger. The Numerics of GANs. NIPS, 2017. V. Nagarajan and J. Kolter. Gradient descent GAN optimization is locally stable. NIPS, 2017. J. Ortega and W. Rheinboldt. Iterative Solution of Nonlinear Equations in Several Variables. Society for Industrial and Applied Mathematics, 2000. I. Panageas and G. Piliouras. Gradient Descent Only Converges to Minimizers: Non-Isolated Critical Points and Invariant Regions. In ITCS 2017, volume 67 of Leibniz International Proceedings in Informatics, pp. 2:1 2:12, 2017. D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell. Curiosity-driven Exploration by Self-supervised Prediction. ICML, 2017. S. Racani ere, T. Weber, D. P. Reichert, L. Buesing, A. Guez, D. Jimenez Rezende, A. Puigdom enech Badia, O. Vinyals, N. Heess, Y. Li, R. Pascanu, P. Battaglia, D. Hassabis, D. Silver, and D. Wierstra. Imagination-Augmented Agents for Deep Reinforcement Learning. NIPS, 2017. J.B. Rosen. Existence and Uniqueness of Equilibrium Points for Concave N-Person Games. Econometrica, 33, Jul 1965. A. S. Vezhnevets, S. Osindero, T. Schaul, N. Heess, M. Jaderberg, D. Silver, and K. Kavukcuoglu. Fe Udal Networks for Hierarchical Reinforcement Learning. ICML, 2017. G. Wayne and L. F. Abbott. Hierarchical control using networks trained with higher-level forward models. Neural Computation, 26(10):2163 2193, 2014. C. Zhang and V. Lesser. Multi-Agent Learning with Policy Prediction. AAAI Conference on Artificial Intelligence, 2010. A STABLE FIXED POINTS In the main text we showed that Nash equilibria are inadequate in multi-agent learning, exemplified by the simple game given by L1 = L2 = xy, where the origin is a global Nash equilibrium but a saddle point of the losses. It is not however obvious that SFP are a better solution concept. We begin by pointing out that for single losses, invertibility and symmetry of the Hessian imply positive definiteness at SFP. These are exactly local minima of L detected by the second partial derivative test, namely those points provably attainable by gradient descent. To emphasise this, note that gradient descent does not converge locally to all local minima. This can be seen by considering the example L(x, y) = y2 and the local (global) minimum (0, 0). There is no neighbourhood for which gradient descent converges to (0, 0), since initialising at (x0, y0) will always converge to (x0, 0) for appropriate learning rates, with x0 = 0 almost surely. This occurs precisely because the Hessian is singular at (0, 0). Though a degenerate example, this suggests an important difference to make between the ideal solution concept (local minima) and that for which local convergence claims are possible to attain (local minima with invertible H 0). Accordingly, the definition of SFP is the immediate generalisation of fixed points with positive semi-definite Hessian , or in other words, second-order-tractable local minima . It is important to impose only positive semi-definiteness to keep the class as large as possible, despite strict positive definiteness holding for single losses due to symmetry. Imposing strict positivity would for instance exclude the origin in the cyclic game L1 = xy = L2, a point certainly worthy of convergence. Note also that imposing a weaker condition than H 0 would be incorrect. Invertibility aside, local convergence of gradient descent on single functions cannot be guaranteed if H 0, since such points are strict saddles. These are almost always avoided by gradient descent, as proven by Lee et al. (2016) and Panageas & Piliouras (2017). It is thus necessary to impose H 0 as a minimal requirement in optimisation methods attempting to generalise gradient descent. Remark A.1. A matrix H is positive semi-definite iff the same holds for its symmetric part S = (H + H )/2, so SFP could equivalently be defined as S( θ) 0. This is the original formulation given by part of the authors (Balduzzi et al., 2018), who also imposed the extra requirement S(θ) 0 in a neighbourhood of θ. After discussion we decided to drop this assumption, pointing out that it is 1) more restrictive, 2) superficial to all theoretical results and 3) weakens the analogy with tractable local minima. The only thing gained by imposing semi-positivity in a neighbourhood is that SFP become a subset of Nash equilibria. Regarding unstable fixed points and strict saddles, note that H( θ) 0 implies H(θ) 0 in a neighbourhood, hence being equivalent to the definition in Balduzzi et al. (2018). It follows also that unstable points are a subset of strict saddles: if H( θ) 0 then all eigenvalues are negative since any eigenpair (v, λ) satisfies 0 > Re(v Hv) = Re(λv v) = Re(λ) . We introduced strict saddles in this paper as a generalisation of unstable FP, which are more difficult to handle but nonetheless tractable using dynamical systems. The name is chosen by analogy to the definition in Lee et al. (2016) for single losses. B LOLA VECTORIAL FORM Proposition B.1. The LOLA gradient adjustment is LOLA = (I αHo)ξ α diag(H o L) . in the usual assumption of equal learning rates. Proof. Recall the modified objective L1(θ1, θ2 α 2L2, . . . , θn α n Ln) for agent 1, and so on for each agent. First-order Taylor expansion yields j =1 ( j L1) j Lj and similarly for each agent. Differentiating with respect to θi, the adjustment for player i is j =i ( j Li) j Lj j =i ( ji Li) j Lj + ( ji Lj) j Li j =i ij Li j Lj α X j =i ( ji Lj) j Li j (Ho)ijξj α X j (H o )ij( L)ji = ξi α(Hoξ)i α(H o L)ii = ξ αHoξ α diag(H o L) i and thus LOLA = (I αHo)ξ α diag(H o L) as required. C TANDEM GAME We provide a more detailed exposition of the tandem game in this section, including computation of fixed points for NL/LOLA and corresponding losses. Recall that the game is given by L1(x, y) = (x + y)2 2x and L2(x, y) = (x + y)2 2y . Intuitively, agents wants to have x y since (x + y)2 is the leading loss, but would also prefer to have positive x and y. These are incompatible, so the agents must not be arrogant and instead make concessions. The fixed points are given by ξ = 2(x + y 1) namely any pair (x, 1 x). The corresponding losses are L1 = 1 2x = L2, summing to 0 for any x. We have everywhere, so all fixed points are SFP. LOLA fails to preserve these, since χ = diag(H o L) = 4 diag ! x + y 1 x + y x + y x + y 1 which is non-zero for any SFP (x, 1 x). Instead, LOLA can only converge to points such that LOLA = ξ αHoξ αχ = 0 . We solve this explicitly as follows: LOLA = 2(x + y 1) 4α(x + y 1) = 2 [(1 4α)(x + y) (1 2α)] The fixed points for LOLA are thus pairs (x, y) such that x + y = 1 2α noting that (1 2α)/(1 4α) > 1 for all α > 0. This leads to worse losses 2 2x > 1 2x = L1(x, 1 x) for agent 1 and similarly for agent 2. In particular, losses always sum to something greater than 0. This becomes negligible as the learning rate becomes smaller, but is always positive nonetheless Taking α arbitrarily small is not a viable solution since convergence will in turn be arbitrarily slow. LOLA is thus not a strong algorithm candidate for all differentiable games. D CONVERGENCE PROOFS We use Ostrowski s theorem as a unified framework for proving local convergence of gradient-based methods. This is a standard result on fixed-point iterations, adapted from (Ortega & Rheinboldt, 2000, 10.1.3). We also invoke and prove a topological result of our own, Lemma D.9, at the end of this section. This is useful in deducing local convergence, though not central to intuition. Theorem D.1 (Ostrowski). Let F : Ω Rd be continuously differentiable on an open subset Ω Rd, and assume x Ωis a fixed point. If all eigenvalues of F( x) are strictly in the unit circle of C, then there is an open neighbourhood U of x such that for all x0 U, the sequence F (k)(x0) converges to x. Moreover, the rate of convergence is at least linear in k. Definition D.2. A matrix M is called positive stable if all its eigenvalues have positive real part. Recall the simultaneous gradient ξ and the Hessian H defined for differentiable games. Let X be any matrix with continuously differentiable entries. Corollary D.3. Assume x is a fixed point of a differentiable game such that XH( x) is positive stable. Then the iterative procedure F(x) = x αXξ(x) converges locally to x for α > 0 sufficiently small. Proof. By definition of fixed points, ξ( x) = 0 and so [Xξ]( x) = X( x)ξ( x) + X( x) ξ( x) = XH( x) is positive stable by assumption, namely has eigenvalues ak + ibk with ak > 0. It follows that F( x) = I α [Xξ]( x) has eigenvalues 1 αak iαbk, which are in the unit circle for small α. More precisely, |1 αak iαbk|2 < 1 1 2αak + α2a2 k + α2b2 k < 1 0 < α < 2ak a2 k + b2 k which is always possible for ak > 0. Hence F( x) has eigenvalues in the unit circle for 0 < α < mink 2ak/(a2 k + b2 k), and we are done by Ostrowski s Theorem since x is a fixed point of F. We apply this corollary to Look Ahead, which is given by F(θ) = θ αXξ(θ) where X = (I αHo). It is thus sufficient to prove the following result. Theorem D.4. Let H 0 invertible with symmetric diagonal blocks. Then there exists ϵ > 0 such that (I αHo)H is positive stable for all 0 < α < ϵ. Remark D.5. Note that (I αHo)H may fail to be positive definite, though true in the case of 2 2 matrices. This no longer holds in higher dimensions, exemplified by the Hessian By direct computation (symbolic in α), one can show that G = (I αHo)H always has positive eigenvalues for small α > 0, whereas its symmetric part S always has a negative eigenvalue with magnitude in the order of α. This implies that S and in turn G is not positive definite. As such, an analytical proof of the theorem involving bounds on the corresponding bilinear form will fail. This makes the result all the more interesting, but more involved. Central to the proof is a similarity transformation proving positive definiteness with respect to a different inner product, a novel technique we have not found in the multi-agent learning literature. Proof. We cannot study the eigenvalues of G directly, since there is no necessary relationship between eigenpairs of H and Ho. In the aim of using analytical tools, the trick is to find a positive definite matrix which is similar to G, thus sharing the same positive eigenvalues. First define G1 = (I + αHd)H and G2 = αH2 , where Hd is the sub-matrix of diagonal blocks,and rewrite G = (I αHo)H = (I α(H Hd))H = (I + αHd)H αH2 = G1 + G2 . Note that Hd is block diagonal with symmetric blocks ii Li 0, so (I + αHd) is symmetric and positive definite for all α 0. In particular its principal square root M = (I + αHd)1/2 is unique and invertible. Now note that M 1G1M = M 1M 2HM = M HM , which is positive semi-definite since u M HMu = (Mu) H(Mu) 0 for all non-zero u. In particular M provides a similarity transformation which eliminates Hd from G1 while simultaneously delivering positive semi-definiteness. We can now prove that M 1GM = M 1G1M + M 1G2M is positive definite, establishing positive stability of G by similarity. Let m = d 1 where d is the vector space dimension, namely H Rd d. Recall that the m-sphere Sm Rd is the space of unit vectors in Rd. Take any u Sm and consider the quantity First note that a Taylor expansion of M in α yields M = (I + αHd)1/2 = I + O(α) and M 1 = (I + αHd) 1/2 = I + O(α) . This implies in turn that u M 1GMu = u Gu + O(α) . There are two cases to distinguish. If u Hu > 0 then u M 1GMu = u Gu + O(α) = u G1u + O(α) = u Hu + O(α) > 0 for α sufficiently small. Otherwise, u Hu = 0 and consider decomposing H into symmetric and antisymmetric parts S = (H +H )/2 and A = (H H )/2, so that H = S +A. By antisymmetry of A we have u Au = 0 and hence u Hu = 0 = u Su. Now H 0 implies S 0, so by Cholesky decomposition of S there exists a matrix T such that S = T T. In particular 0 = u Su = Tu 2 implies Tu = 0, and in turn Su = 0. Since H is invertible and u = 0, we have 0 = Hu = Au and so Au 2 > 0. It follows in particular that αu H2u = αu (S A )(S + A)u = αu A Au = α Au 2 > 0 . Using positive semi-definiteness of M 1G1M, u M 1GMu = u M 1G1Mu + u M 1G2Mu = αu H2u + O(α2) = α Au 2 + O(α2) > 0 for α > 0 small enough. We conclude that for any u Sm there is ϵu > 0 such that u M 1GMu > 0 for all 0 < α < ϵu, where g(α, u) = u M 1GMu is a function g : R+ Sm R with Sm compact. By Lemma D.9, this can be extended uniformly with some ϵ > 0 such that u M 1GMu > 0 for all u Sm and 0 < α < ϵ. It follows that M 1GM is positive definite for all 0 < α < ϵ and thus G is positive stable for α in the same range, by similarity. Corollary D.6. Look Ahead converges locally to stable fixed points for α > 0 sufficiently small. Proof. For any SFP θ we have ξ( θ) = 0 and H( θ) 0 invertible by definition, with diagonal blocks ii Li symmetric by twice continuous differentiability. We are done by the result above and Corollary D.3. We now prove that local convergence results descend to SOS. The following lemma establishes the crucial claim that our criterion for p is C1 in neighbourhoods of fixed points. This is necessary to invoke analytical arguments including Ostrowski s Theorem, and would be untrue globally. Lemma D.7. If θ is a fixed point and α is sufficiently small then p = ξ 2 in a neighbourhood of θ. Proof. First note that ξ( θ) = 0, so there is a (bounded) neighbourhood V of θ such that ξ(θ) < b for all θ V , for any choice of hyperparameter b (0, 1). In particular p2(θ) = ξ(θ) 2 by definition of the second criterion. We want to show that p(θ) = p2(θ) near θ, or equivalently p1(θ) p2(θ). Since p2(θ) = ξ(θ) 2 < b2 < 1 in V , it remains only to show that αχ, ξ0 ξ(θ) 2 in some neighbourhood U V of θ, for any choice of hyperparameter a (0, 1). Now by boundedness of V and continuity of χ, there exists c > 0 such that αχ(θ) = α2 χ(θ) < c for all θ V and bounded α. It follows by Cauchy-Schwartz that αχ, ξ0 a ξ0 αχ > a ξ0 /c in V . Now note that ξ0 = (I αHo)ξ d ξ in V , for some d > 0 and α sufficiently small, by boundedness of V and continuity of Ho. Finally there is a sub-neighbourhood U V such that ξ(θ) < ad/c for all θ U, so that ad ξ /c > ξ(θ) 2 and hence a ξ0 2 αχ, ξ0 > ξ 2 = p2 in U. Hence p(θ) = min{p1(θ), p2(θ)} = p2(θ) = ξ(θ) 2 for all θ U, as required. Theorem D.8. SOS converges locally to stable fixed points for α > 0 sufficiently small. Proof. Though the criterion for p is dual, we will only use the second part. More precisely, p = min{p1, p2} p2 = ξ if ξ < b. The aim is to show that if θ is an SFP then ξp( θ) is positive stable for small α, using Ostrowski to conclude as usual. The first problem we face is that ξp does not exist everywhere, since p(θ) is not a continuous function. However we know by Lemma D.7 that p = ξ 2 in a neighbourhood U of θ, so ξp is continuously differentiable in U. Moreover, p( θ) = ξ( θ) 2 = 0 with gradient p( θ) = 2H ξ( θ) = 0 by definition of fixed points. It follows that ξp( θ) = (I αHo)H( θ) α p( θ)χ( θ) αp( θ) χ( θ) = (I αHo)H( θ) which is identical to Look Ahead. This is positive stable for all 0 < α < ϵ, and θ is a fixed point of the iteration since ξp( θ) = (I αHo)ξ( θ) αp( θ)χ( θ) = 0 . We conclude by Corollary D.3 that SOS converges locally to SFP for any a, b (0, 1) and α sufficiently small. Lemma D.9. Let g : R+ Y Z continuous with Y compact and Z R. Assume that for any u Y there is ϵu > 0 such that g(α, u) > 0 for all 0 < α < ϵu. Then there exists ϵ > 0 such that g(α, u) > 0 for all 0 < α < ϵ and u Y . Proof. For any u Y there is ϵu > 0 such that (0, ϵu) {u} g 1(0, ) . We would like to extend this uniformly in u, namely prove that (0, ϵ) Y g 1(0, ) . for some ϵ > 0. Now g 1(0, ) is open by continuity of g, so each (0, ϵu) {u} has a neighbourhood Xu contained in g 1(0, ). Open sets in a product topology are unions of open products, so Xu = [ In particular (0, ϵu) S x Ux and at least one Vx contains u, so we can take the open neighbourhood to be Xu = (0, ϵu) Vu g 1(0, ) for some neighbourhood Vu of u. In particular Y S u Y Vu, and by compactness there is a finite cover Y Sk i=1 Vui . Letting ϵ = min{ϵi}k i=1 > 0, we obtain the required inclusion (0, ϵ) Y (0, ϵ) i=1 (0, ϵ) Vui i=1 (0, ϵi) Vui g 1(0, ) . E NON-CONVERGENCE PROOFS Lemma E.1. Let ak and bk be sequences of real numbers, and define ck = min{ak, bk}. If L = lim k ck and L = lim k ak both exist then L L . Proof. Assume for contradiction that L > L , then there exists δ > 0 such that L > L + δ. By definition of limits, there exist M, N N such that |ck L| < δ/2 and |ak L | < δ/2 for all k M, k N. Expanding the absolute value, this implies L δ/2 < ck < L + δ/2 and L δ/2 < ak < L + δ/2 for all k max{M, N}. Now ck ak for all k, hence L δ/2 < ck ak < L + δ/2 which implies the contradiction L < L + δ . Proposition E.2. If SOS converges to θ and α > 0 is small then θ is a fixed point of the game. Proof. The iterative procedure is given by θk+1 = F(θk) = θk αξp(θk) . If θk θ as k then taking limits on both sides of the iteration yields θ = θ α lim k ξp(θk) and so lim k ξp(θk) = 0, omitting k for convenience. It follows by continuity that ξ0( θ) + lim k p(θk) αχ( θ) = 0 , noting that p(θ) is not a globally continuous function. Assume for contradiction that ξ0( θ) = 0. There are two cases to distinguish for clarity. (i) First assume αχ, ξ0 ( θ) 0. Note that limk p(θk) 0 since p(θ) 0 for all θ, and so lim k ξp(θk), ξ0( θ) = lim k p(θk) αχ, ξ0 ( θ) + ξ0( θ) 2 > 0 . This is a contradiction since limk ξp(θk) = 0. (ii) Otherwise, αχ, ξ0 ( θ) < 0 and hence αχ, ξ0 (θ) < 0 in a neighbourhood. In particular there exists N N such that αχ, ξ0 (θk) < 0 for all k N. In particular p1(θk) = min 1, a ξ0(θk) 2 αχ, ξ0 (θk) for all k N. Now notice that lim k p(θk) = lim k min 1, a ξ0(θk) 2 αχ, ξ0 (θk), p2(θk) , which implies lim k p(θk) lim k a ξ0(θk) 2 αχ, ξ0 (θk) = a ξ0( θ) 2 αχ, ξ0 ( θ) by continuity and Lemma E.1. Finally we conclude lim k ξp, ξ0 (θk) = lim k p(θk) αχ, ξ0 ( θ) + ξ0( θ) 2 a ξ0( θ) 2 + ξ0( θ) 2 > 0 for any a (0, 1), a contradiction. In both cases a contradiction is obtained, hence ξ0( θ) = 0 = (I αHo)ξ( θ). Now note that (I αHo)( θ) is singular iff Ho( θ) has an eigenvalue 1/α, which is impossible for α sufficiently small. Hence (I αHo)ξ( θ) = 0 implies ξ( θ) = 0, as required. Now assume that θ is initialised randomly (or with arbitrarily small noise around a point), as is standard in ML. We prove that SOS locally avoids strict saddles using the Stable Manifold Theorem, inspired from Lee et al. (2017). Theorem E.3 (Stable Manifold Theorem). Let x be a fixed point for the C1 local diffeomorphism F : U Rd, where U is a neighbourhood of x in Rd. Let Es Eu be the generalised eigenspaces of F( x) corresponding to eigenvalues with |λ| 1 and |λ| > 1 respectively. Then there exists a local stable center manifold W with tangent space Es at x and a neighbourhood B of x such that F(W) B W and n=0F n(B) W. In particular, if F( x) has at least one eigenvalue |λ| > 1 then Eu has dimension at least 1. Since W has tangent space Es at x, with codimension at least one, it follows that W has measure zero in Rd. This is central in proving that the set of initial points in a neighbourhood which converge through SOS to a given strict saddle θ has measure zero. Theorem E.4. SOS locally avoids strict saddles almost surely, for α > 0 sufficiently small. Proof. Let θ a strict saddle and recall that SOS is given by F(θ) = θ α(I αHo)ξ(θ) + α2p(θ)χ(θ) . Recall by Lemma D.7 that p(θ) = ξ(θ) 2 for all θ in a neighbourhood U of θ. Restricting F to U, all terms involved are continuously differentiable and F( θ) = I α(I αHo)H( θ) by assumption that ξ( θ) = 0. Since all terms except I are of order at least α, F( θ) is invertible for all α sufficiently small. By the inverse function theorem, there exists a neighbourhood V of θ such that F is has a continuously differentiable inverse on V . Hence F restricted to U V is a C1 diffeomorphism with fixed point θ. By definition of strict saddles, H( θ) has a negative eigenvalue. It follows by continuity that (I αHo)H( θ) also has a negative eigenvalue a + ib with a < 0 for α sufficiently small. Finally, F( θ) = I α(I αHo)H( θ) has an eigenvalue λ = 1 αa iαb with |λ| = 1 2αa + α2(a2 + b2) 1 2αa > 1 . It follows that Es has codimension at least one, implying in turn that the local stable set W has measure zero. We can now prove that Z = {θ U V | lim n F n(θ) = θ} has measure zero, or in other words, that local convergence to θ occurs with zero probability. Let B the neighbourhood guaranteed by the Stable Manifold Theorem, and take any θ Z. By definition of convergence there exists N N such that F N+n(θ) B for all n 0, so that F N(θ) n=0F n(B) W by the Stable Manifold Theorem. This implies that θ F N(W), and finally θ n NF n(W). Since θ was arbitrary, we obtain the inclusion Z n NF n(W) . Now F 1 is C1, hence locally Lipschitz and thus preserves sets of measure zero, so that F n(W) has measure zero for each n. Countable unions of measure zero sets are still measure zero, so we conclude that Z also has measure zero. In other words, SOS converges to θ with zero probability upon random initialisation of θ in U. Figure 5: Ground truth in the Gaussian mixture experiment. F FURTHER GAUSSIAN MIXTURE EXPERIMENTS In the Gaussian mixture experiment, data is sampled from a highly multimodal distribution designed to probe the tendency to collapse onto a subset of modes during training, given in Figure 5. The generator distribution and KL divergence are given at {2k, 4k, 6k, 8k} iterations for LA, LOLA and SGA in Figure 6. LOLA and SGA successfully spread mass across all Gaussians. Look Ahead displays mode collapse and hopping in early stages, but begins to incorporate further mixtures near 8k iterations. We ran further iterations and discovered that Look Ahead eventually spreads mass across all mixtures, though very slowly. Comparing with results for NL/CO/SOS in the main text, we see that CO/SOS/LOLA/SGA are equally successful in qualitative terms. Note that SOS/CO are slightly superior with respect to KL divergence after 6-8k iterations, though LOLA is initially faster. This may be due only to random sampling. We also noticed experimentally that LOLA often moves away from the correct distribution after 8-10k iterations (not shown), while SOS stays stable in the long run. This may occur thanks to the two-part criterion encouraging convergence, while LOLA continually attempts to exploit opponent learning. Finally we plot ξ at all iterations up to 12k for SOS, LA and NL in Figure 7 (other algorithms are omitted for visibility). This gives further evidence of SOS converging quite rapidly to the correct distribution, while NL perpetually suffers from mode hopping and LA lags behind significantly. Iteration = 2k 4k 6k 8k DKL = 0.84 1.19 1.01 0.78 DKL = 0.30 0.11 0.09 0.07 DKL = 0.63 0.32 0.19 0.12 Figure 6: Generator distribution at sampled iterations for LA/LOLA/SGA. LA suffers in the early stages from mode collapse and hopping, but incorporates more mixtures later on. LOLA and SGA learn the correct mixture of Gaussians. Below each plot: KL divergence DKL(P || Q) from generator P to ground truth Q, estimated from 25600 samples. Best result at each iteration in bold. 0 2000 4000 6000 8000 10000 12000 Iteration Figure 7: Semilog plot of ξ at each iteration for SOS, LA and NL.