# the_mechanics_of_nplayer_differentiable_games__4da5abb1.pdf The Mechanics of n-Player Differentiable Games David Balduzzi 1 S ebastien Racani ere 1 James Martens 1 Jakob Foerster 2 Karl Tuyls 1 Thore Graepel 1 The cornerstone underpinning deep learning is the guarantee that gradient descent on an objective converges to local minima. Unfortunately, this guarantee fails in settings, such as generative adversarial nets, where there are multiple interacting losses. The behavior of gradient-based methods in games is not well understood and is becoming increasingly important as adversarial and multiobjective architectures proliferate. In this paper, we develop new techniques to understand and control the dynamics in general games. The key result is to decompose the second-order dynamics into two components. The first is related to potential games, which reduce to gradient descent on an implicit function; the second relates to Hamiltonian games, a new class of games that obey a conservation law, akin to conservation laws in classical mechanical systems. The decomposition motivates Symplectic Gradient Adjustment (SGA), a new algorithm for finding stable fixed points in general games. Basic experiments show SGA is competitive with recently proposed algorithms for finding stable fixed points in GANs whilst at the same time being applicable to and having guarantees in much more general games. 1. Introduction Recent progress in machine learning is heavily dependent on using gradient descent, applied to optimize the parameters of models with respect to a (single) objectives. A basic result is that gradient descent converges to a local minimum of the objective under a broad range of conditions (Lee et al., 2017). However, there is a rapidly growing set of powerful models that do not optimize a single objective, including: generative adversarial networks (Goodfellow et al., 2014), proximal gradient TD learning (Liu et al., 2016), multi-level 1Deep Mind 2University of Oxford. Correspondence to: David Balduzzi . Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s). optimization (Pfau & Vinyals, 2016), synthetic gradients (Jaderberg et al., 2017), hierarchical reinforcement learning (Wayne & Abbott, 2014; Vezhnevets et al., 2017), curiosity (Pathak et al., 2017), and imaginative agents (Racani ere et al., 2017). In effect, the models are trained via games played by cooperating and competing modules. No-regret algorithms such as gradient descent are guaranteed to converge to coarse correlated equilibria in games (Stoltz & Lugosi, 2007). However, the dynamics do not converge to Nash equilibria and do not even stabilize in general (Mertikopoulos et al., 2018). Concretely, cyclic behaviors emerge even in simple cases, see example 1. This paper presents an analysis of the second-order structure of game dynamics that allows to identify two classes of games, potential and Hamiltonian, that are easy to solve separately. We then derive symplectic gradient adjustment (SGA), a method for finding stable fixed points in games. SGA s performance is evaluated in basic experiments. Background and problem description. Procedures that converge to Nash equilibria have been found for restricted game classes: potential games, 2-player zero-sum games and a few others (Hu & Wellman, 2003; Hart & Mas-Colell, 2013). Finding equilibria can be reformulated as a nonlinear complementarity problem, but these are hopelessly impractical to solve in general (Shoham & Leyton-Brown, 2008) because the problem is PPAD hard (Daskalakis et al., 2009). Players are primarily neural nets in our setting. We therefore restrict to gradient-based methods (game-theorists have considered a much broader range of techniques). Losses are not necessarily convex in any of their parameters, so Nash equilibria do not necessarily exist. Leaving existence aside, finding Nash equilibria is analogous to, but much harder than, finding global minima in neural nets which is not realistic with gradient-based methods. There are (at least) three problems with gradient descent in games. Firstly, the potential existence of cycles (recurrent dynamics) implies there are no convergence guarantees, see example 1 and Mertikopoulos et al. (2018). Secondly, even when gradient descent converges, the rate may be too slow in practice because rotational forces necessitate extremely small learning rates (see figure 3). Finally, since there is no single objective, there is no way to measure Differentiable Game Mechanics progress. Application-specific proxies have been proposed, for example the inception score for GANs (Salimans et al., 2016), but these are little help during training the inception score is no substitute for looking at samples. Outline and summary of main contributions. We start with the well-known case of a zero-sum bimatrix game: example 1. It turns out that the dynamics (that is, the dynamics under simultaneous gradient descent) can be reformulated via Hamilton s equations. The cyclic behavior arises because the dynamics live on the level sets of the Hamiltonian. More directly useful, gradient descent on the Hamiltonian converges to a Nash equilibrium. Lemma 1 shows that the Hessian of any game decomposes into symmetric and antisymmetric components. There are thus two pure cases: when only the symmetric component is present, or only the antisymmetric. The first case, known as potential games (Monderer & Shapley, 1996), have been intensively studied because they are exactly the games where gradient descent does converge. The second case, Hamiltonian1 games, were not studied previously, probably because they coincide with zero-sum games in the bimatrix case (or constant-sum, depending on the constraints). Zero-sum and Hamiltonian games differ when the losses are not bilinear or there are more than two players. Hamiltonian games are important because (i) they are easy to solve and (ii) general games combine potential-like and Hamiltonian-like dynamics. The concept of a zero-sum game is too loose to be useful when there are many players: any n-player game can be reformulated as a zero-sum (n + 1)-player game where n+1 = Pn i=1 i. Zero-sum games are as complicated as general-sum games. Theorem 3 shows that Hamiltonian games obey a conservation law which also provides the key to solving them, by gradient descent on the conserved quantity. The general case, neither potential nor Hamiltonian, is more difficult and the focus of the remainder of the paper. Section 3 proposes symplectic gradient adjustment (SGA), a gradient-based method for finding stable fixed points in general games. Appendix C contains Tensor Flow code to compute the adjustment. The algorithm computes two Hessianvector products, at a cost of two iterations of backprop. SGA satisfies a few natural desiderata: D1 it is compatible with the original dynamics and D2, D3 it is guaranteed to find stable equilibria in potential and Hamiltonian games. For general games, correctly picking the sign of the adjustment (whether to add or subtract) is critical since it determines the behavior near stable and unstable equilibria. Section 2.3 defines stable equilibria and relates them to local Nash equilibria. Lemma 10 then shows how to set the sign so as to converge to stable fixed points. Correctly aligning 1Lu (1992) defined an unrelated notion of Hamiltonian game. SGA allows higher learning rates and faster, more robust convergence, see theorem 7 and experiments in section 4. Section 4 investigates a basic GAN setup from Metz et al. (2017), that tests for mode-collapse and mode hopping. Whereas simultaneous gradient descent completely fails; the symplectic adjustment leads to rapid convergence slightly improved by correctly choosing the sign of the adjustment. Finally, section 3.5 applies the same criterion to align consensus optimization (Mescheder et al., 2017), preventing it from converging to unstable equilibria and (slightly) improving performance, figure 9 in the appendix. Caveat. The behavior of SGA near fixed points that are neither negative nor positive semi-definite is not analysed. On the one hand, it was only recently shown that gradient descent behaves well near saddles when optimizing a single objective (Lee et al., 2016; 2017). On the other hand, Newton s method is attracted to saddles, see analysis and recently proposed remedy in Dauphin et al. (2014). Studying indefinite fixed points is deferred to future work. Related work. Convergence to Nash equilibria in twoplayer games was studied in Singh et al. (2000). Wo LF (Win or Learn Fast) converges to Nash equilibria in two-player two-action games (Bowling & Veloso, 2002). Extensions include weighted policy learning (Abdallah & Lesser, 2008) and GIGA-Wo LF (Bowling, 2004). There has been almost no work on convergence to fixed points in general games. Optimistic mirror descent approximately converges in twoplayer bilinear zero-sum games (Daskalakis et al., 2018), a special case of Hamiltonian games. In more general settings it converges to coarse correlated equilibria. There has been interesting recent work on convergence in GANs. Heusel et al. (2017) propose a two-time scale methods to find Nash. However, it likely scales badly with the number of players. Nagarajan & Kolter (2017) prove convergence for some algorithms, but under very strong assumptions Mescheder (2018). Consensus optimization (Mescheder et al., 2017) is discussed in section 3. Learning with opponent-learning awareness (LOLA) infinitesimally modifies the objectives of players to take into account their opponents goals (Foerster et al., 2018). However, there are no guarantees that LOLA converges and in general it may modify the fixed-points of the game. Notation. Dot products are written as v|w or hv, wi. The angle between vectors is (v, w). Positive definiteness is denoted S 0. Omitted proofs are in appendix B. 2. The infinitesimal structure of games In contrast to the classical formulation of games, we do not constrain the parameter sets (e.g. to the probability simplex) or require losses to be convex in the corresponding players Differentiable Game Mechanics Figure 1. (A) cycles around the origin. (B) gradient descent on H converges to Nash. parameters. Players could be interacting neural nets such as GANs (Goodfellow et al., 2014). Definition 1. A game is a set of players [n] = {1, . . . , n} and twice continuously differentiable losses { i : Rd ! R}n i=1. Parameters are w = (w1, . . . , wn) 2 Rd with wi 2 Rdi where Pn i=1 di = d. . Player i controls wi. It is sometimes convenient to write w = (wi, w i) where w i concatenates the parameters of all the players other than the ith, which is placed out of order by abuse of notation. The simultaneous gradient is the gradient of the losses with respect to the parameters of the respective players: (w) = (rw1 1, . . . , rwn n) 2 Rd. By the dynamics of the game, we mean following the negative of the vector field with infinitesimal steps. There is no reason to expect to be the gradient of a single function in general, and therefore no reason to expect the dynamics to converge to a fixed point. 2.1. Warmup: Hamiltonian mechanics in games The next example illustrates the essential problem with gradients in games and the key insight motivating our approach. Example 1 (Conservation of energy in a zero-sum unconstrained bimatrix game). Zero-sum games, where Pn i=1 i 0, are well-studied. The zero-sum game 1(x, y) = x|Ay and 2(x, y) = x|Ay has Nash equilibrium at (x, y) = (0, 0). The simultaneous gradient (x, y) = (Ay, A|x) rotates around the Nash, see figure 1. The matrix A admits singular value decomposition (SVD) A = U|DV. Changing to coordinates u = D 1 2 Ux and v = D 1 2 Vy gives 1/2(u, v) = u|v. Introduce the Hamiltonian 2 (x|U|DUx + y|V|DVy) . Remarkably, the dynamics can be reformulated via Hamilton s equations in the coordinates given by the SVD of A: Vector field cycles around the equilibrium because con- serves the Hamiltonian s level sets (i.e. h , r Hi = 0). However, gradient descent on the Hamiltonian converges to the Nash equilibrium. The remainder of the paper explores the implications and limitations of this insight. Papadimitriou & Piliouras (2016) recently analyzed the dynamics of Matching Pennies (essentially, the above example) and showed with some effort that the cyclic behavior covers the entire parameter space. The Hamiltonian reformulation directly explains the cyclic behavior via a conservation law. 2.2. The generalized Helmholtz decomposition The Hessian of a game is the (d d)-matrix of second- derivatives H(w) := rw (w)| = cretely, the Hessian can be written w2,wn 2 ... ... r2 wi,wj k is the (di dj)-block of 2nd-order derivatives. The Hessian of a game is not necessarily symmetric. Note: , β run over dimensions; i, j run over players. Lemma 1 (generalized Helmholtz decomposition). The Hessian of any vector field decomposes uniquely into two components H(w) = S(w) + A(w) where S S| is symmetric and A + A| 0 is antisymmetric. Proof. Any matrix decomposes uniquely as M = S + A where S = 1 2(M + M|) and A = 1 2(M M|) are symmetric and antisymmetric. Applying the decomposition to the game Hessian yields the result. The decomposition is preserved by orthogonal change-ofcoordinates P|MP = P|SP + P|AP since the terms remain symmetric and antisymmetric. The connection to the classical Helmholtz decomposition in calculus is sketched in appendix E. Two obvious classes of games arise from the decomposition: Definition 2. A game is a potential game if A(w) 0. It is a Hamiltonian game if S(w) 0. Potential games are well-studied and easy to solve. Hamiltonian games are a new class of games that are also easy to solve. The general case is more delicate, see section 3. Differentiable Game Mechanics 2.3. Stable fixed points Gradient-based methods can reliably find local but not global optima of nonconvex objective functions (Lee et al., 2016; 2017). Similarly, gradient-based methods cannot be expected to find global Nash equilibria in games. Definition 3. A fixed point w , with (w ) = 0, is stable if S(w) 0 and unstable if S(w) 0 for w in a neighborhood of w . Fixed points that are neither positive nor negative definite are beyond the scope of the paper. Lemma 2 (Stable fixed points are local Nash equilibria). A point w is a local Nash equilibrium if, for all i, there is a neighborhood Ui of w i such that i(wi, w i) for wi 2 Ui. If fixed point w is stable then it is a local Nash equilibrium. Proof. If S is positive semidefinite then so are its (di di)- submatrices Si := r2 wi i for all i. The result follows. Appendix A contains more details on local Nash equilibria. 2.4. Potential games Potential games were introduced in Monderer & Shapley (1996). A game is a potential game if there is a single potential function φ : Rd ! R and positive numbers { i > 0}n i=1 such that i, w i) φ(w00 i , w i) = i i, w i) i(w00 for all i and all w0 i , w i. Monderer & Shapley (1996) show a game is a potential game iff irwi i = rwiφ for all i, which is equivalent to wiwj i = jr2 Our definition of potential game is the special case where i = 1 for all i, which Monderer & Shapley (1996) call an exact potential game. We use the shorthand potential game to refer to exact potential games in what follows. Potential games have been extensively studied since they are one of the few classes of games for which Nash equilibria can be computed (Rosenthal, 1973). For our purposes, they are games where simultaneous gradient descent on the losses is gradient descent on a single function. It follows that descent on converges to a fixed point that is a local minimum of φ or saddle, but see Lee et al. (2017). 2.5. Hamiltonian games A concrete example may help understand antisymmetric matrices. Suppose n competitors play one-on-one and that the probability of player i beating player j is pij. Then, assuming there are no draws, the probabilities satisfy pij +pji = 1 and pii = 1 2. The matrix A = log pij 1 pij i,j=1 of logits is then antisymmetric. Intuitively, antisymmetry reflects a hyperadversarial setting where all pairwise interactions between players are zero-sum. In general, Hamiltonian games are related to but distinct from zero-sum games. Example 2 (an unconstrained2 bimatrix game is zero sum iff it is Hamiltonian). Consider bimatrix game with 1(x, y) = x|Py and 2(x, y) = x|Qy. Then = (Py, Q|x) and the Hessian components have block struc- 0 P Q (Q P)| 0 0 P + Q (P + Q)| 0 The game is Hamiltonian iff S = 0 iff P + Q = 0 iff There are Hamiltonian games that are not zero-sum and vice versa. Example 3 (Hamiltonian game that is not zero-sum). Fix constants a and b and suppose players 1 and 2 minimize losses 1(x, y) = x(y b) and 2(x, y) = (x a)y with respect to x and y respectively. Example 4 (zero-sum game that is not Hamiltonian). Players 1 and 2 minimize 1(x, y) = x2 + y2 2(x, y) = (x2 + y2). The game actually has potential function φ(x, y) = x2 y2. Hamiltonian games are quite different from potential games. There is a Hamiltonian function H that specifies a conserved quantity. Whereas the dynamics equal rφ in potential games; they are orthogonal to r H in Hamiltonian games. The orthogonality implies the conservation law that underlies the cyclic behavior in example 1. Theorem 3. Let H(w) := 1 2. If the game is Hamiltonian then (i) r H = A| and (ii) preserves the level sets of H since h , r Hi = 0. If the Hessian is invertible and limkwk!1 H(w) = 1 then (iii) gradient descent on H converges to a local Nash equilibrium. In fact, H is a Hamiltonian3 function for the game dynamics. We use the notation H(w) = 1 2k (w)k2 throughout the paper. However, H is only a Hamiltonian function for if the game is Hamiltonian. 2The parameters in bimatrix games are usually constrained to the probability simplex. 3Notation: H for Hamiltonian; H for Hessian. Differentiable Game Mechanics Figure 2. Infinitesimal alignment is positive (cyan) when small positive λ either: (A) pulls u toward w, if w and u have angle < 90 ; or (B) pushes u away from w if their angle is > 90 . There is a precise mapping from Hamiltonian games to symplectic geometry, see Appendix E. Symplectic geometry is the modern formulation of classical mechanics. Recall that periodic behaviors (e.g. orbits) often arise in classical mechanics. The orbits lie on the level sets of the Hamiltonian, which expresses the total energy of the system. 3. Algorithms Fixed points of potential and Hamiltonian games can be found by descent on and r H respectively. This section tackles finding stable fixed points in general games. 3.1. Finding fixed points If the Hessian H(w) is invertible, then r H = H| = 0 iff = 0. Thus, gradient descent on H converges to fixed points of . However, there is no guarantee that descent on H will find a stable fixed point. Mescheder et al. (2017) propose consensus optimization, a gradient adjustment of the form + λ H| = + λ r H. Unfortunately, consensus optimization can converge to unstable fixed points even in simple cases where the game is to minimize a single function: Example 5 (consensus optimization converges to global maximum). Consider a potential game with losses 1(x, y) = 2(x, y) = 2 (x2 + y2) with 0. Then Note that k k2 = 2(x2 + y2) and + λ H| = (λ 1) Descent on + λ H| converges to the global maximum (x, y) = (0, 0) unless λ < 1 Although consensus optimization works well in some special cases like two-player zero-sum; it cannot be considered a candidate algorithm for finding stable fixed points in general games, since it fails in the basic case of potential games. Algorithm 1 Symplectic Gradient Adjustment Input: losses L = { i}n i=1, weights W = {wi}n gradient( i, wi) for ( i, wi) 2 (L, W) A| get sym adj(L, W) // appendix C if align then gradient( 1 2k k2, w) for w 2 W) 1 dh , r Hih A| , r Hi + // = 1 10 else λ 1 end if Output: + λ A| // plug into any optimizer 3.2. Finding stable fixed points There are two classes of games where we know how to find stable fixed points: potential games where converges to a local minimum and Hamiltonian games where r H, which is orthogonal to , finds stable fixed points. In the general case, the following desiderata provide a set of reasonable properties for an adjustment λ of the game dynamics: Desiderata. To find stable fixed points, an adjustment λ to the game dynamics should satisfy D1. compatible with game dynamics: h λ, i = 1 k k2; D2. compatible with potential dynamics: if the game is a potential game then h λ, rφi = 2 krφk2; D3. compatible with Hamiltonian dynamics: If the game is Hamiltonian then h λ, r Hi = 3 kr Hk2; D4. attracted to stable equilibria: in neighborhoods where S 0, require ( λ, r H) ( , r H); D5. repelled by unstable equilibria: in neighborhoods where S 0, require ( λ, r H) ( , r H); for some 1, 2, 3 > 0. Desideratum D1 does not guarantee that players act in their own self-interest this requires a stronger positivity condition on dot-products with subvectors of , see Balduzzi (2017). Desiderata D4 and D5 are explained in section 3.4. The unadjusted dynamics satisfies all the desiderata except D3. Consensus optimization only satisfies D3 and D4. Ideally, desideratum D5 would be strengthened to repelled by saddle points where (w) = 0 but S(w) 6 0 . 3.3. Symplectic gradient adjustment Proposition 4. The symplectic gradient adjustment (SGA) λ := + λ A| . satisfies D1 D3 for λ > 0, with 1 = 1 = 2 and 3 = λ. Differentiable Game Mechanics Note that desiderata D1 and D2 are true even when λ < 0. This will prove useful, since example 6 and theorem 5 show it is necessary pick negative λ near S 0. Section 3.4 shows how to also satisfy desiderata D4 and D5. Proof. First claim: λ |A| = 0 by skew-symmetry of A. Second claim: A 0 in a potential game, so λ = = rφ. Third claim: h λ, r Hi = h λ, H| i = h λ, A| i = λ |AA| = λ kr Hk2 since H = A by assumption and since |A| = 0 by antisymmetry. In the example below, almost any choice of positive λ results in convergence to an unstable equilibrium. The problem arises from the combination of a weak repellor with a strong rotational force. The next section shows how to pick λ to avoid unstable equilibria. Example 6 (failure case for λ > 0). Suppose > 0 is small and 2x2 xy and g(x, y) = with an unstable equilibrium at (0, 0). The dynamics are Finally observe that + λ A| = (λ ) which converges to the unstable equilibrium if λ > . Lemma 9 in the appendix shows that, if S and A commute and S 0, then h λ, r Hi 0 for λ > 0. The proof of theorem 5 introduces the additive condition number to upper-bound the worst-case noncommutativity of S, which allows to quantify the relationship between λ and r H. Theorem 5. Let S be a symmetric matrix with eigenvalues σmax σmin. The additive condition number4 of S is = σmax σmin. If S 0 is positive semidefinite with additive condition number then λ 2 (0, 4 h λ, r Hi 0. If S is negative semidefinite, then λ 2 (0, 4 h λ, r Hi 0. The inequalities are strict if H is invertible. If = 0, then S = σ I commutes with all matrices. The larger the additive condition number , the larger the potential failure of S to commute with other matrices. 4The condition number of a positive definite matrix is σmax GRADIENT DESCENT SGA 0.032 0.1 0.01 learning rate Figure 3. SGA allows faster and more robust convergence to stable fixed points. Note the scale of top-right panel differs from rest. 3.4. How to pick sign(λ) This section explains desiderata D4 D5 and shows how to pick sign(λ) to speed up convergence towards stable and away from unstable fixed points. First, observe that h , r Hi = |(S + A)| = |S . It follows that for 6= 0: if S 0 then h , r Hi 0; if S 0 then h , r Hi < 0. (1) A criterion to probe the positive/negative definiteness of S is thus to check the sign of h , r Hi. The dot product can take any value if S is not positive nor negative definite. The behavior near saddles is beyond the scope of the paper. Desiderata D4 can be interpreted as follows. If points at a stable equilibrium then we require that λ points more towards the equilibrium (i.e. has smaller angle). Conversely if points away then the adjustment should point further away. More formally, Definition 4. Let u and v be two vectors. The infinitesimal alignment of λ := u + λ v with a third vector w is align( λ, w) := d |λ=0 for λ := ( λ, w). If u and w point the same way, u|w > 0, then align > 0 when v bends u further toward w, see figure 2A. Otherwise align > 0 when v bends u away from w, figure 2B. Proposition 6. Desiderata D4 D5 are satisfied for λ such that λ h , r Hi h A| , r Hi 0. Computing the sign of h , r Hi provides a check for stable and unstable fixed points. Computing the sign of h A| , r Hi checks whether the adjustment term points towards or away from the (nearby) fixed point. Putting the two checks together yields a prescription for the sign of λ. Differentiable Game Mechanics LEARNING RATE LEARNING RATE STEPS TO CONVERGE LOSS AFTER 250 STEPS OMD SGA OMD SGA Figure 4. Comparison of SGA with optimistic mirror descent. Sweep over learning rates in range [0.01, 1.75]; λ = 1 throughout. (Left): iterations to convergence (up to 250). (Right): average absolute value of losses over iterations 240-250, cutoff at 5. Alignment and convergence rates. Finally, we show that increasing alignment helps speed convergence. Gradient descent is also known as the method of steepest descent. In general games, however, does not follow the steepest path to fixed points due to the rotational force , which forces lower learning rates and slows down convergence: Theorem 7. Suppose f is convex and Lipschitz smooth with krf(x) rf(y)k L kx yk. Let wt+1 = wt v where kvk = krf(wt)k. Then the optimal step size is = cos L where := (rf(wt), v), with f(wt+1) f(wt) cos2 2L krf(wt)k2. Increasing the cosine with the steepest direction improves convergence. The alignment computation in algorithm 1 chooses λ to be positive or negative such that λ is bent towards stable (increasing the cosine) and away from unstable fixed points. Adding a small > 0 to the computation introduces a weak bias towards stable fixed points. 3.5. Aligned consensus optimization The stability criterion in (1) also provides a simple way to prevent consensus optimization from converging to unstable equilibria. Aligned consensus optimization is where in practice we set λ = 1. Aligned consensus satisfies desiderata D3 D5. However, it behaves strangely in potential games. Multiplying by the Hessian is the inverse of Newton s method: it increases the gap between small and large eigenvalues, increasing the (usual, multiplicative) condition number and slows down convergence. Nevertheless, consensus optimization works well in GANs (Mescheder et al., 2017), and aligned consensus may improve performance, see figure 9 in appendix. Finally, note that dropping the first term from (2) yields a simpler update that also satisfies D3 D5. However, the resulting algorithm performs poorly in experiments (not shown), perhaps because it is attracted to saddles. 4. Experiments We compare SGA with simultaneous gradient descent, optimistic mirror descent (Daskalakis et al., 2018) and consensus optimization (Mescheder et al., 2017) in basic settings. 4.1. Learning rates and alignment We investigate the effect of SGA when a weak attractor is coupled to a strong rotational force: 1(x, y) = 1 2x2 + 10xy and 2(x, y) = 1 Gradient descent is extremely sensitive to the choice of learning rate , top row of figure 3. As increases through {0.01, 0.032, 0.1} gradient descent goes from converging extremely slowly, to diverging slowly, to diverging rapidly. SGA yields faster, more robust convergence. SGA converges faster with learning rates = 0.01 and = 0.032, and only starts overshooting the fixed point for = 0.1. 4.2. Basic adversarial games Figure 4 compares SGA with optimistic mirror descent on a zero-sum bimatrix game with 1/2(w1, w2) = w| 1w2. The example is modified from Daskalakis et al. (2018) who also consider a linear offset that makes no difference. A run converges, panel A, if the average absolute value of losses on the last 10 iterations is < 0.01. Although OMD s peak performance is better than SGA, we find that SGA converges and does so faster for a wider range of learning rates. OMD diverges for learning rates not in the range [0.3, 1.2]. Simultaneous gradient descent oscillates without converging (not shown). Individual runs are shown in the appendix. The appendix also compares the performance of the algorithms on four-player games. 4.3. Generative adversarial networks We apply SGA to a basic setup adapted from Metz et al. (2017). Data is sampled from a highly multimodal distribution designed to probe the tendency to collapse onto a subset of modes during training. The distribution is a mixture of 16 Gaussians arranged in a 4 4 grid, see ground truth in figure 8. The generator and discriminator networks both have 6 Re LU layers of 384 neurons. The generator has two output neurons; the discriminator has one. Figure 5 shows results after {2000, 4000, 6000, 8000} iterations. The networks are trained under RMSProp. Learning rates were chosen by visual inspection of grid search results at iteration 8000, see appendix. Simultaneous gradient descent and SGA are shown in the figure. Results for consensus optimization are in the appendix. Simultaneous gradient descent exhibits mode collapse fol- Differentiable Game Mechanics GRADIENT DESCENT learning rate 1e-4 learning rate 9e-5 learning rate 9e-5 SGA without ALIGNMENT SGA with ALIGNMENT 4000 6000 8000 2000 Iteration: Figure 5. Top: Simultaneous gradient descent suffers from mode collapse and in later iterations (not shown) mode hopping. Middle: vanilla SGA converges smoothly to the ground truth (figure 8 in appendix). Bottom: SGA with alignment converges slightly faster. lowed by mode hopping in later iterations (not shown). Mode hopping is analogous to the cycles in example 1. Unaligned SGA converges to the correct distribution; alignment speeds up convergence slightly. Consensus optimization performs similarly, see figure 9 in the appendix. However, it can converge to local maxima, recall example 5. 5. Discussion Modern deep learning treats differentiable modules like plug-and-play lego blocks. For this to work, at the very least, we need to know that gradient descent will find local minima. Unfortunately, gradient descent does not necessarily find local minima when optimizing multiple interacting objectives. With the recent proliferation of algorithms that optimize more than one loss, it is becoming increasingly urgent to understand and control the dynamics of interacting losses. Although there is interesting recent work on twoplayer adversarial games such as GANs, there is essentially no work on finding stable fixed points in general. The generalized Helmholtz decomposition provides a powerful new perspective on game dynamics. A key feature is that the analysis is indifferent to the number of players. Instead, it is the interplay between the simultaneous gradient on the losses and the symmetric and antisymmetric matrices of second-order terms that guides algorithm design and governs the dynamics under gradient adjustments. Symplectic gradient adjustment is a straightforward application of the Helmholtz decomposition. It is unlikely that SGA is the only or best approach to finding stable fixed points. A deeper understanding of the interaction between the potential and Hamiltonian components will lead to more effective algorithms. Of particular interest are pure firstorder methods that do not use Hessian-vector products. Finally, it is worth raising a philosophical point. The goal in this paper is to find stable fixed points (e.g. because in GANs they yield pleasing samples). We are not concerned with the losses of the players per se. The gradient adjustments may lead to a player acting against its own self-interest by increasing its loss. We consider this acceptable insofar as it encourages convergence to a stable fixed point. Acknowledgements. We thank Guillame Desjardins, Csaba Szepesvari and especially Alistair Letcher for useful feedback. Abdallah, S. and Lesser, V. R. A multiagent reinforcement learning algorithm with non-linear dynamics. J. Artif. Differentiable Game Mechanics Intell. Res., 33:521 549, 2008. Balduzzi, D. Strongly-Typed Agents are Guaranteed to Interact Safely. In ICML, 2017. Bottacin, F. A Marsden-Weinstein Reduction Theorem for Presymplectic Manifolds. 2005. Bowling, M. and Veloso, M. Multiagent learning using a variable learning rate. Artificial Intelligence, 136:215 250, 2002. Bowling, M. H. Convergence and no-regret in multiagent learning. In NIPS, pp. 209 216, 2004. Candogan, O., Menache, I., Ozdaglar, A., and Parrilo, P. A. Flows and decompositions of games: Harmonic and potential games. Mathematics of Operations Research, 36 (3):474 503, 2011. Daskalakis, C., Goldberg, P. W., and Papadimitriou, C. The Complexity of Computing a Nash Equilibrium. SIAM J. Computing, 39(1):195 259, 2009. Daskalakis, C., Ilyas, A., Syrgkanis, V., and Zeng, H. Train- ing GANs with Optimism. In ICLR, 2018. Dauphin, Y., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., and Bengio, Y. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In NIPS, 2014. Foerster, J. N., Chen, R. Y., Al-Shedivat, M., Whiteson, S., Abbeel, P., and Mordatch, I. Learning with Opponent Learning Awareness. In AAMAS, 2018. Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative Adversarial Nets. In NIPS, 2014. Hart, S. and Mas-Colell, A. Simple Adaptive Strategies: From Regret-Matching to Uncoupled Dynamics. World Scientific, 2013. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Klambauer, G., and Hochreiter, S. GANs Trained by a Two Time-Scale Update Rule Converge to a Nash Equilibrium. In NIPS, 2017. Hu, J. and Wellman, M. P. Nash Q-Learning for General- Sum Stochastic Games. Journal of Machine Learning Research, 4:1039 1069, 2003. Jaderberg, M., Czarnecki, W. M., Osindero, S., Vinyals, O., Graves, A., and Kavukcuoglu, K. Decoupled Neural Interfaces using Synthetic Gradients. In ICML, 2017. Lee, J., Panageas, I., Piliouras, G., Simchowitz, M., Jordan, M., and Recht, B. First-order Methods Almost Always Avoid Saddle Points. In ar Xiv:1710.07406, 2017. Lee, J. D., Simchowitz, M., Jordan, M. I., and Recht, B. Gradient Descent Converges to Minimizers. In COLT, 2016. Liu, B., Liu, J., Ghavamzadeh, M., Mahadevan, S., and Petrik, M. Proximal Gradient Temporal Difference Learning Algorithms. In IJCAI, 2016. Lu, X. Hamiltonian games. Journal of Combinatorial Theory, Series B, 55:18 32, 1992. Mertikopoulos, P., Papadimitriou, C., and Piliouras, G. Cy- cles in adversarial regularized learning. In SODA, 2018. Mescheder, L. On the convergence properties of GAN training. In Ar Xiv:1801:04406, 2018. Mescheder, L., Nowozin, S., and Geiger, A. The Numerics of GANs. In NIPS. 2017. Metz, L., Poole, B., Pfau, D., and Sohl-Dickstein, J. Un- rolled generative adversarial networks. In ICLR, 2017. Monderer, D. and Shapley, L. S. Potential Games. Games and Economic Behavior, 14:124 143, 1996. Nagarajan, V. and Kolter, J. Z. Gradient descent GAN optimization is locally stable. In NIPS, 2017. Nesterov, Y. Introductory Lectures on Convex Optimization: A Basic Course. Kluwer, 2004. Papadimitriou, C. and Piliouras, G. From Nash Equilibria to Chain Recurrent Sets: Solution Concepts and Topology. In ITCS, 2016. Pathak, D., Agrawal, P., Efros, A. A., and Darrell, T. Curiosity-driven Exploration by Self-supervised Prediction. In ICML, 2017. Pfau, D. and Vinyals, O. Connecting Generative Adversarial Networks and Actor-Critic Methods. In ar Xiv:1610.01945, 2016. Racani ere, S., Weber, T., Reichert, D. P., Buesing, L., Guez, A., Rezende, D. J., Badia, A. P., Vinyals, O., Heess, N., Li, Y., Pascanu, R., Battaglia, P., Hassabis, D., Silver, D., and Wierstra, D. Imagination-Augmented Agents for Deep Reinforcement Learning. In NIPS, 2017. Rosenthal, R. W. A Class of Games Possessing Pure Strategy Nash Equilibria. Int J Game Theory, 2:65 67, 1973. Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., and Chen, X. Improved Techniques for Training GANs. In NIPS, 2016. Differentiable Game Mechanics Shoham, Y. and Leyton-Brown, K. Multiagent Systems: Algorithmic, Game-Theoretic, and Logical Foundations. Cambridge University Press, 2008. Singh, S., Kearns, M., and Mansour, Y. Nash Convergence of Gradient Dynamics in General-Sum Games. In UAI, 2000. Stoltz, G. and Lugosi, G. Learning correlated equilibria in games with compact sets of strategies. Games and Economic Behavior, 59:187 208, 2007. Vezhnevets, A., Osindero, S., Schaul, T., Heess, N., Jader- berg, M., Silver, D., and Kavukcuoglu, K. Fe Udal Networks for Hierarchical Reinforcement Learning. In ICML, 2017. Wayne, G. and Abbott, L. F. Hierarchical Control Using Networks Trained with Higher-Level Forward Models. Neural Computation, (26), 2014.