# lead_minmax_optimization_from_a_physical_perspective__962992bc.pdf

Published in Transactions on Machine Learning Research (06/2023)

LEAD: Min-Max Optimization from a Physical Perspective

Reyhane Askari Hemmat reyhane.askari.hemmat@umontreal.ca Department of Computer Science and Operations Research University of Montreal and Mila, Quebec AI Institute

Amartya Mitra amitr003@ucr.edu Department of Physics and Astronomy University of California, Riverside

Guillaume Lajoie g.lajoie@umontreal.ca Department of Mathematics and Statistics University of Montreal and Mila, Quebec AI Institute Canada CIFAR AI chair

Ioannis Mitliagkas ioannis@iro.umontreal.ca Department of Computer Science and Operations Research University of Montreal and Mila, Quebec AI Institute Canada CIFAR AI chair

Reviewed on Open Review: ht tp s: // op en re vi ew .n et /f or um ?i d= v X Ss TY s6ZB

Adversarial formulations such as generative adversarial networks (GANs) have rekindled interest in two-player min-max games. A central obstacle in the optimization of such games is the rotational dynamics that hinder their convergence. In this paper, we show that game optimization shares dynamic properties with particle systems subject to multiple forces, and one can leverage tools from physics to improve optimization dynamics. Inspired by the physical framework, we propose LEAD, an optimizer for min-max games. Next, using Lyapunov stability theory and spectral analysis, we study LEAD s convergence properties in continuous and discrete time settings for a class of quadratic min-max games to demonstrate linear convergence to the Nash equilibrium. Finally, we empirically evaluate our method on synthetic setups and CIFAR-10 image generation to demonstrate improvements in GAN training.

1 Introduction

Much of the advances in traditional machine learning can be attributed to the success of gradient-based methods. Modern machine learning systems such as GANs (Goodfellow et al., 2014), multi-task learning, and multi-agent settings (Sener & Koltun, 2018) in reinforcement learning (Bu et al., 2008) require joint optimization of two or more objectives which can often be formulated as games. In these game settings, best practices and methods developed for single-objective optimization are observed to perform noticeably poorly (Mescheder et al., 2017; Balduzzi et al., 2018b; Gidel et al., 2019). Specifically, they exhibit rotational dynamics in parameter space about the Nash Equilibria (Mescheder et al., 2017), slowing down convergence. Recent work in game optimization (Wang et al., 2019; Mazumdar et al., 2019; Mescheder et al., 2017; Balduzzi et al., 2018b; Abernethy et al., 2019; Loizou et al., 2020) demonstrates that introducing additional second-order terms in the optimization algorithm helps to suppress these rotations, thereby improving convergence.

Taking inspiration from recent work in single-objective optimization that re-derives existing accelerated methods from a variational perspective (Wibisono et al., 2016; Wilson et al., 2016), in this work, we adopt

Equal Contribution. Significant part of this work was done while Amartya Mitra was interning at Mila.

Published in Transactions on Machine Learning Research (06/2023)

a similar approach in the context of games. To do so, we borrow formalism from physics by likening the gradient-based optimization of two-player (zero-sum) games to the dynamics of a system where we introduce relevant forces that helps curb these rotations. We consequently utilize the dynamics of this resultant system to propose our novel second-order optimizer for games, LEAD.

Next, using Lyapunov and spectral analysis, we demonstrate linear convergence of our optimizer (LEAD) in both continuous and discrete-time settings for a class of quadratic min-max games. In terms of empirical performance, LEAD achieves an FID of 10.49 on CIFAR-10 image generation, outperforming existing baselines such as Big GAN (Brock et al., 2018), which is approximately 30-times larger than our baseline Res Net architecture.

What distinguishes LEAD from other second-order optimization methods for min-max games such as Mescheder et al. (2017); Wang et al. (2019); Mazumdar et al. (2019); Schäfer & Anandkumar (2019) is its computational complexity. All these different methods involve Jacobian (or Jacobian-inverse) vector-product computation commonly implemented using a form of approximation. Thus making a majority of them intractable in real-world large scale problems. On the other hand, LEAD involves computing only one-block of the full Jacobian of the gradient vector-field multiplied by a vector. This makes our method significantly cheaper and comparable to several first-order methods, as we show in section 6. We summarize our contributions below:

In section 3, we model gradient descent-ascent as a physical system. Armed with the physical model, we introduce counter-rotational forces to curb the existing rotations in the system. Next, we employ the principle of least action to determine the (continuous-time) dynamics. We then accordingly discretize these resultant dynamics to obtain our optimization scheme, Least Action Dynamics (LEAD).

In section 4, we use Lyapunov stability theory and spectral analysis to prove a linear convergence of LEAD in continuous and discrete-time settings for quadratic min-max games.

Finally, in section 7, we empirically demonstrate that LEAD is computationally efficient. Additionally, we demonstrate that LEAD improves the performance of GANs on different tasks such as 8-Gaussians and CIFAR-10 while comparing the performance of our method against other first and second-order methods.

The source code for all the experiments is available at https://github.com/lead-minmax-gam es/LEAD. Furthermore, we provide a blog post that summarizes our work, which is also available at https://reyhaneaskari.github.io/LEAD.html.

2 Problem Setting

Notation Continuous time scalar variables are in uppercase letters (X), discrete-time scalar variables are in lower case (x) and vectors are in boldface (A). Matrices are in blackboard bold (M) and derivatives w.r.t. time are denoted as an over-dot ( x). Furthermore, off-diag[M]i,j is equal to Mi,j for i = j, and equal to 0 for i = j where i, j = 1, 2, . . . , n.

Setting In this work, we study the optimization problem of two-player zero-sum games,

min X max Y f (X, Y ) , (1)

where f : Rn Rn R, and is assumed to be a convex-concave function which is continuous and twice differentiable w.r.t. X, Y R. It is to be noted that though in developing our framework below, X, Y are assumed to be scalars, it is nevertheless found to hold for the more general case of vectorial X and Y , as we demonstrate both analytically (Appendix C) and empirically, our theoretical analysis is found to hold.

Published in Transactions on Machine Learning Research (06/2023)

3 Optimization Mechanics

In our effort to study min-max optimization from a physical perspective, we note from classical physics the following: under the influence of a net force F, the equation of motion of a physical object of mass m, is determined by Newton s 2nd Law, m X = F, (2)

with the object s coordinate expressed as Xt X. According to the principle of least action1 (Landau & Lifshitz, 1960), nature selects this particular trajectory over other possibilities, as a quantity called the action is extremized along it.

We start with a simple observation that showcases the connection between optimization algorithms and physics. Polyak s heavy-ball momentum (Polyak, 1964) is often perceived from a physical perspective as a ball moving in a "potential" well (cost function). In fact, it is straightforward to show that Polyak momentum is a discrete counterpart of a continuous-time equation of motion governed by Newton s 2nd Law. For single-objective minimization of an objective function f (x), Polyak momentum follows:

xk+1 = xk + β (xk xk 1) η xf (xk) , (3)

where η is the learning rate and β is the momentum coefficient. For simplicity, setting β to one, and moving to continuous time, one can rewrite this equation as,

(xk+δ xk) (xk xk δ)

δ2 xf (xk) , (4)

and in the limit δ, η 0, Eq.(4) then becomes (xk X (t) X),

m X = Xf (X) . (5)

This is equivalent to Newton s 2nd Law of motion (Eq.(2)) of a particle of mass m = δ2/η, and identifying F = Xf (X) (i.e. f (X) acting as a potential function (Landau & Lifshitz, 1960)). Thus, Polyak s heavy-ball method Eq.(3) can be interpreted as an object (ball) of mass m rolling down under a potential f (X) to reach the minimum while accelerating.

Armed with this observation, we perform an extension of 5 to our min-max setup,

m X = Xf (X, Y ) ,

m Y = Y f (X, Y ) , (6)

which represents the dynamics of an object moving under a curl force (Berry & Shukla, 2016): Fcurl = ( Xf, Y f) in the 2-dimensional X Y plane, see Figure 1 (a) for a simple example where the continuoustime dynamic of a curl force (or equivalently a vortex force) diverges away from the equilibrium.

Furthermore, it is to be noted that discretization of Eq.(6) corresponds to Gradient Descent-Ascent (GDA) with momentum 1. Authors in (Gidel et al., 2019) found that this optimizer is divergent in the prototypical min-max objective, f (X, Y ) = XY , thus indicating the need for further improvement.

To this end, we note that the failure modes of the optimizer obtained from the discretization of Eq.(6), can be attributed to: (a) an outward rotatory motion by our particle of mass m, accompanied by (b) an increase in its velocity over time. Following these observations, we aim to introduce suitable counter-rotational and dissipative forces to our system above, in order to tackle (a) and (b) in an attempt to achieve converging dynamics.

Specifically, as an initial consideration, we choose to add to our system, two ubiquitous forces:

magnetic force, Fmag = q XY f Y , q XY f X (7)

1Also referred to as the Principle of Stationary Action.

Published in Transactions on Machine Learning Research (06/2023)

known to produce rotational motion (in charged particles), to counteract the rotations introduced by Fcurl. Here, q is the charge imparted to our particle.

friction, Ffric = µ X, µ Y (8)

to prevent the increase in velocity of our particle (µ: coefficient of friction).

Assimilating all the above forces Fcurl, Fmag and Ffric, the equations of motion (EOMs) of our crafted system then becomes,

m X = Fcurl + Fmag + Ffric,

m Y = Fcurl + Fmag + Ffric. (9)

Or equivalently,

m X = µ X Xf q XY f Y ,

m Y = µ Y + Y f + q XY f X. (10)

Without loss of generality, from hereon we set the mass of our object to be unity. In the rest of this work, we study the above EOMs in continuous and discrete time for min-max games. See Figure 1 (c) for a simple example where the continuous-time dynamic of a particle under all the above forces converges to the equilibrium.

(a) (b) (c)

Figure 1: Depiction of the continuous-time dynamics of a particle experiencing different forces. In (a), a particle inside a vortex diverges from the equilibrium. This corresponds to the continuous-time dynamics of gradient descent-ascent with momentum in the bilinear game. In (b), we introduce a counter-rotational magnetic force to the existing system with the vortex force, assuming the particle is charged. The magnetic force is exerted in a perpendicular direction to the current direction of the particle and affects the rotations induced by the vortex. However, we do not observe convergence, which is expected from a physics perspective. The vortex force is known to increase the particle s speed over time Berry & Shukla (2016), while the magnetic force does not affect the particle s speed. Therefore, the particle s velocity will keep increasing over time, preventing convergence. To decrease the particle s velocity for convergence, we introduce friction to the system in (c). As a result, we observe that the friction causes the particle to lose speed and converge.

3.1 Discretization

With the continuous-time trajectory of Eq.(10) in hand, we now proceed to discretize it using a combination of Euler s implicit and explicit discretization schemes. To discretize X = VX we have,

Euler Implicit Discretization : xk+1 xk = δvx k+1 Euler Explicit Discretization : xk+1 xk = δvx k. (11)

where δ is the discretization step-size and k is the iteration step.

Published in Transactions on Machine Learning Research (06/2023)

Proposition 1. The continuous-time EOMs (10) can be discretized in an implicit-explicit way, to yield,

xk+1 = xk + β(xk xk 1) η xf (xk, yk)

α xyf (xk, yk) (yk yk 1),

yk+1 = yk + β(yk yk 1) + η yf (xk, yk)

+ α yxf (xk, yk) (xk xk 1),

where we have defined α = 2qδ, β = 1 µδ and η = δ2 (Proof in Appendix B). Taking inspiration from the fact that Eq. 10 corresponds to the trajectory of a charged particle under a curl, magnetic and frictional force, as governed by the principle of least action, we refer to the discrete update rules of Eq. 12 as Least Action Dynamics (LEAD). (Algorithm 1 details the pseudo-code of LEAD).

Understanding the terms in LEAD: Analyzing our novel optimizer, we note that it consist of three types of terms, namely,

1. Gradient Descent or Ascent: xf or yf: Each player s immediate direction of improving their own objective.

2. Momentum: Standard Polyak momentum term; known to accelerate convergence in optimization and recently in smooth games. (Gidel et al., 2019; Azizian et al., 2020b; Lorraine & Duvenaud, 2022)

3. Coupling term: xyf (xk, yk) (yk yk 1), yxf (xk, yk) (xk xk 1)

Main new term in our method. It captures the first-order interaction between players. This cross-derivative corresponds to the counter-rotational force in our physical model; it allows our method to exert control on rotations.

Algorithm 1 Least Action Dynamics (LEAD)

Input: learning rate η, momentum β, coupling coefficient α. Initialize: x0 xinit, y0 yinit, t 0 while not converged do

t t + 1 gx xf(xt, yt) gxy yt y(gx)(yt yt 1) xt+1 xt + β(xt xt 1) ηgx αgxy yt gy yf(xt, yt) gxy xt x(gy)(xt xt 1) yt+1 yt + β(yt yt 1) + ηgy + αgxy xt end while return (xk+1, yk+1)

4 Convergence Analysis

We now study the behavior of LEAD on the quadratic min-max game,

f (X, Y) = h

2 ||Y||2 + XT AY (13)

where X, Y Rn, A Rn Rn is a (constant) coupling matrix and h is a scalar constant. Additionally, the Nash equilibrium of the above game lies at X = 0, Y = 0. Let us further define the vector field v of the above game, f, as,

v = Xf (X, Y) Yf (X, Y)

= h X + AY h Y A X

Published in Transactions on Machine Learning Research (06/2023)

4.1 Continuous Time Analysis

A general way to prove the stability of a dynamical system is to use a Lyapunov function (Hahn et al., 1963; Lyapunov, 1992). A scalar function Et : Rn Rn R, is a Lyapunov function of a continuous-time dynamics if t,

(i) Et(X, Y) 0,

(ii) Et(X, Y) 0

The Lyapunov function Et can be perceived as a generalization of the total energy of the system and the requirement (ii) ensures that this generalized energy decreases along the trajectory of evolution, leading the system to convergence as we will show next.

For the quadratic min-max game defined in Eq.(13), Eq.(10) generalizes to,

X = µ X (h + A)Y q A Y Y = µ Y (h AT )X + q AT X, (15)

Theorem 1. For the dynamics of Eq.(15),

2 X + µX + µAY T X + µX + µAY

2 Y + µY µAT X T X + µY µAT X

2 XT X + Y T Y + XT (h + AAT )X + y T (h + AT A)Y

is a Lyapunov function of the system.

Furthermore, setting q = (2/µ) + µ, we find Et ρEt for

ρ min µ 1 + µ, 2µ(σ2 min + h) (1 + σ2 min + 2h) (µ2 + µ) + 2σ2 min

with σmin being the smallest singular value of A. This consequently ensures linear convergence of the dynamics of Eq. 15,

||X||2 + ||Y ||2 E0 h + σ2 min exp ( ρt) . (17)

(Proof in Appendix C).

4.2 Discrete-Time Analysis

In this section, we next analyze the convergence behavior of LEAD, Eq.(12) in the case of the quadratic min-max game of Eq.(13), using spectral analysis,

xk+1 = xk + β xk ηhxk ηAyk αA yk

yk+1 = yk + β yk ηhyk + ηAT xk + αAT xk, (18)

where xk = xk xk 1.

For brevity, consider the joint parameters ωt := (xt, yt). We start by studying the update operator of simultaneous gradient descent-ascent,

Fη(ωt) = ωt ηv(ωt 1).

Published in Transactions on Machine Learning Research (06/2023)

where, the vector-field is given by Eq. 14. Thus, the fixed point ω of Fη(ωt) satisfies Fη(ω ) = ω .

Furthermore, at ω , we have, Fη(ω ) = In η v(ω ), (19)

with In being the n n identity matrix. Consequently the spectrum of Fη(ω ) in the quadratic game considered, is, Sp( Fη(ω )) = {1 ηh ηλ | λ Sp(off-diag[ v(ω )])} , (20)

The next proposition outlines the condition under which the fixed point operator is guaranteed to converge around the fixed point.

Proposition 2 (Prop. 4.4.1 (Bertsekas, 1999)). For the spectral radius,

ρmax := ρ{ Fη(ω )} < 1 (21)

and for some ω0 in a neighborhood of ω , the update operator F, ensures linear convergence to ω at a rate,

t+1 O(ρ + ϵ) t ϵ > 0,

where t+1 := ||ωt+1 ω ||2 2 + ||ωt ω ||2 2.

Next, we proceed to define the update operator of Eq.(12) as FLEAD (ωt, ωt 1) = (ωt+1, ωt) . For the quadratic min-max game of Eq.(13), the Jacobian of FLEAD takes the form,

FLEAD = I2n + βI2n (η + α) v βI2n + α v I2n 0

In the next Theorem 2, we find the set of eigenvalues corresponding to the update operator FLEAD which are then used in Theorem 3, where we show for a selected values of η and α, LEAD attains a linear rate.

Theorem 2. The eigenvalues of FLEAD(ω ) are,

µ = 1 (η + α)λ + β ηh

where, = (1 (η + α) λ + β ηh)2 4 (β αλ)

and λ Sp(off-diag[ v(ω )]).

Furthermore, for h, η, |α|, |β| << 1, we have,

µ+ 1 ηh + (η + α)2 λ2 + η2h2 + β2 2ηhβ

2 (ηh β) η (24)

µ β (η + α)2 λ2 + η2h2 + β2 2ηhβ

2 (β ηh) α (25)

Published in Transactions on Machine Learning Research (06/2023)

Figure 2: Diagram depicts positioning of the eigenvalues of GDA in blue (Eq. 19) and those of LEAD (eqs. (24) and (25) with β = h = 0) in red. Eigenvalues inside the black unit circle imply convergence such that the closer to the origin, the faster the convergence rate (Prop. 2). Every point on solid blue and red lines corresponds to a specific choice of learning rate. No choice of learning rate results in convergence for gradient ascent descent method as the blue line is tangent to the unit circle. At the same time, for a fixed value of α, LEAD shifts the eigenvalues (µ+) into the unit circle which leads to a convergence rate proportional to the radius of the red dashed circle. Note that LEAD also introduces an extra set of eigenvalues (µ ) which are close to zero and do not affect convergence .

See proof in Appendix D.

Theorem 2 states that the LEAD operator has two eigenvalues µ+ and µ for each λ Sp (off-diag[ v(ω )]). Specifically, µ+ can be viewed as a shift of the eigenvalues of GDA in Eq.(20), while additionally being the leading eigenvalue for small values of h, η, |α| and |β|. (See Fig. 2 for a schematic description) Also, for small values of α, µ+ is the limiting eigenvalue while µ 0.

In the following Proposition, we next show that locally, a choice of positive α decreases the spectral radius of Fη (ω ) defined as, ρ := max{|µ+|2, |µ |2} λ.

Proposition 3. For any λ Sp(off-diag[ v(ω )]),

αρ (λ) α=0 < 0 η 0, 2 Im(λmax)

where Im(λmax) is the imaginary component of the largest eigenvalue λmax.

See proof in Appendix E.

Having established that a small positive value of α improves the rate of convergence, in the next theorem, we prove that for a specific choice of positive α and η in the quadratic game Eq.(13), a linear rate of convergence to its Nash equilibrium is attained.

Theorem 3. Setting η = α = 1 2(σmax(A)+h), then we have ϵ > 0,

1 σ2 min 4(σmax + h)2 (1 + β/2)h

σmax + h + 3h2

8(σmax + h)2

where σmax(σmin) is the largest (smallest) eigen value of A and

t+1 := ||ωt+1 ω ||2 2 + ||ωt ω ||2 2.

Theorem 3 ensures a linear convergence of LEAD in the quadratic min-max game. (Proof in Appendix F).

Published in Transactions on Machine Learning Research (06/2023)

5 Comparison of Convergence Rate for Quadratic Min-Max Game

In this section, we perform a Big-O comparison of convergence rates of LEAD (Eq. 59), with several other existing methods. Below in Table 5 we summarize the convergence rates for the quadratic min-max game of Eq. 13. For each method that converges at the rate of O((1 r)t), we report the quantity r (larger r corresponds to faster convergence). We observe that for the quadratic min-max game, given the analysis in Azizian et al. (2020a), for h < σmax (A) and β > 3h2 8(σmax+h), r LEAD r EG and r LEAD r OG. Furthermore, for the bilinear case, where h = 0, LEAD has a faster convergence rate than EG and OG.

Method r Alternating-GDA h/2L

Extra-Gradient (EG) 1 4(h/L + σ2 min (A) /16L2)

Optimistic Gradient(OG) 1 4(h/L + σ2 min (A) /32L2)

Consensus Optimization (CO) h2/2L2 H + σ2 min (A) /2L2 H LEAD (Th. 3) (1 + β/2)h/(σmax + h) + σ2 min/4(σmax + h)2 3h2/8(σmax + h)2

Table 1: Big-O comparison of convergence rates of LEAD against EG (Korpelevich, 1976), OG (Mertikopoulos et al., 2018) and CO (Mescheder et al., 2017) for the quadratic min-max game of Eq. 13. We report the EG, OG and CO rates from the tight analysis in Azizian et al. (2020a) and Alt-GDA from Zhang et al. (2022). For each method that converges at the rate of O((1 r)t), we report the quantity r (larger r corresponds to faster convergence). Note that L :=

2 max{h, σmax(A)}, is the Lipschitz constant of the vector field and and L2 H is the Lipschitz-smoothness of 1

6 Comparison of Computational Cost

In this section we first study several second-order algorithms and perform computational comparisons on an 8-Gaussians generation task. The Jacobian of the gradient vector field v = ( xf(x, y), yf(x, y)) is given by,

J = 2 xf (x, y) xyf (x, y) yxf (x, y) 2 yf (x, y)

Considering player x, a LEAD update requires the computation of the term xyf (xk, yk) (yk yk 1), thereby involving only one block of the full Jacobian J. On the other hand Symplectic Gradient Adjustment (SGA) (Balduzzi et al., 2018a), requires the full computation of two Jacobian-vector products Jv, J v. Similarly, Competitive Gradient Descent (CGD) (Schäfer & Anandkumar, 2019) involves the computation of the following term, 1 + η 2 xyf(xk, yk) 2 yxf(xk, yk) 1

along with the Jacobian-vector product,

2 xyf(xk, yk) yf(xk, yk).

While the inverse term is approximated using conjugate gradient method, it still involves the computation of approximately ten Jacobian-vector products for each update. To explore these comparisons in greater detail and on models with many parameters, we experimentally compare the computational cost of our method with several other second as well as first-order methods on the 8-Gaussians problem in Figure 3 (architecture reported in Appendix J). We calculate the average wall-clock time (in milliseconds) per iteration. Results are reported on an average of 1000 iterations, computed on the same architecture and the same machine with forced synchronous execution. All the methods are implemented in Py Torch Paszke et al. (2017) and SGA is replicated based on the official implementation 2.

Furthermore, we observe that the computational cost per iteration of LEAD while being much lower than SGA and CGD, is similar to WGAN-GP and Extra-Gradient. The similarity to Extra-Gradient is due to

2SGA official Deep Mind implementation (non-zero sum setting): https://github.com/deepmind/symplectic-gradient-adjustment/ blob/master/Symplectic_Gradient_Adjustment.ipynb

Published in Transactions on Machine Learning Research (06/2023)

Figure 3: Average computational cost per iteration of several well-known methods for (non-saturating) GAN optimization. The numbers are reported on the 8-Gaussians generation task and averaged over 1000 iterations. Note that the y-axis is log-scale. We compare Competitive Gradient Descent (CGD) (53) (using official CGD optimizer code), Symplectic Gradient Adjustment (SGA) (6), Consensus Optimization (CO) (39), Extra-gradient with Adam (Extra-Adam) (17), WGAN with Gradient Penalty (WGAN GP) (21). We observe that per-iteration time complexity of our method is very similar to Extra-Adam and WGAN GP and is much cheaper than other second order methods such as CGD. Furthermore, by increasing the size of the hidden dimension of the generator and discriminator s networks we observe that the gap between different methods increases.

the fact that for each player, Extra-Gradient requires the computation of a half-step and a full-step, so in total each step requires the computation of two gradients. LEAD also requires the computation of a gradient ( fx) which is then used to compute ( fxy) multiplied by (yk yk 1). Using Py Torch, we do not require to compute fxy and then perform the multiplication. Given fx the whole term fxy(yk yk 1), is computed using Py Torch s Autograd Vector-Jacobian product, with the computational cost of a single gradient. Thus, LEAD also requires the computation of two gradients for each step.

7 Experiments

In this section, we first empirically validate the performance of LEAD on several toy as well as largescale experiments. Furthermore, we extend LEAD based on the Adam algorithm to be used in large-scale experiments. See 2 for the detailed Algorithm.

7.1 Adversarial vs Cooperative Games

In Section 6 we showed that using Auto-grad software tools such as Tensor Flow and Py Torch, LEAD can be computed very efficiently and as fast as extra-gradient. In this section we compare the perforamce of LEAD with several first order methods in a toy setup inspired by (Lorraine & Duvenaud, 2022). Consider the following game,

min x max y x T (γA)y + x T ((I γ)B1)x y T ((I γ)B2)y. (29)

Such formulation in Eq. 29 enables us to compare the performance of different methods in cooperative games, adversarial games, and any interpolation between the two. Namely, varying γ from 0 to I changes the dynamics of the game from purely cooperative to adversarial. Many real-world applications such as GANs exhibit an analogous range of adversarial to cooperative changes during training (Lorraine & Duvenaud, 2022).

In Figure 4, we compare LEAD against several methods including gradient descent-ascent (GDA), extragradient (EG) (Korpelevich, 1976), optimistic gradient (OG) (Mertikopoulos et al., 2018), complex momentum (CM) (Lorraine & Duvenaud, 2022), negative momentum (NM) (Gidel et al., 2019), and positive momentum (PM) (Polyak, 1964). Each method is tuned to optimality for each setup.

Published in Transactions on Machine Learning Research (06/2023)

Figure 4: Comparison of several methods on the game in Eq. 29. The diagonal matrix γ determines the degree of adversarialness along each direction. Elements on the diagonal are sampled from a uniform distribution,γii Unif[0, γmax]. By varying γmax from 0 to 1, we move from a purely cooperative setup to a hybrid setup with a mixture of cooperative and adversarial games. The spectral radius (shown on the y-axis) determines the convergence rate in this game and is a function of γmax. The smaller the spectral radius, the faster the convergence rate. A spectral radius of 1 corresponds to non-convergent dynamics. We compare several methods including gradient descent-ascent (GDA), extragradient (EG) (Korpelevich, 1976), optimistic gradient (OG) (Mertikopoulos et al., 2018), complex momentum (CM) (Lorraine & Duvenaud, 2022), negative momentum (NM) (Gidel et al., 2019), and positive momentum (PM) (Polyak, 1964). Each method has been tuned to optimality for each setup. For cooperative games (on the leftmost), LEAD and Positive Momentum achieve great performance. In more adversarial settings (on the rightmost), LEAD performs on par with other game-specific optimization methods (excluding Negative Momentum, GDA and Positive Momentum which diverge). This plot suggests that LEAD is a robust optimizer across different types of games. We conjecture that for the same reason, LEAD performs desirably in real-world setups such as GANs where the adversarialness changes dynamically throughout the training (Lorraine & Duvenaud, 2022).

7.2 Generative Adversarial Networks

We study the performance of LEAD in zero-sum as well as non-zero sum settings. See Appendix I for a comparison of LEAD-Adam against vanilla Adam on the generation task of a mixture of 8-Gaussians.

CIFAR-10 DCGAN: We evaluate LEAD-Adam on the task of CIFAR-10 (Krizhevsky & Hinton, 2009) image generation with a non-zero-sum formulation (non-saturating) on a DCGAN architecture similar to Gulrajani et al. (2017). As shown in Table. 2, we compare with several first-order and second order methods and observe that LEAD-Adam outperforms the rest in terms of Fréchet Inception Distance (FID) (Heusel et al., 2017) and Inception score (IS) (Salimans et al., 2016) 3, reaching an FID score of 19.27 0.10 which outperforms OMD (Mertikopoulos et al., 2018) and CGD (Schäfer & Anandkumar, 2019). See Figure 5 that shows the improvement of FID using LEAD-Adam vs vanilla Adam and Figure 9 for a sample of the generated images.

CIFAR-10 Res Net: Furthermore, we evaluate LEAD-Adam on more complex and deep architectures. We adapt the Res Net architecture in SN-GAN Miyato et al. (2018). We compare with several existing results on the task of image generation on CIFAR-10 using Res Nets. See Table 2 for a full comparison. Note that state of the art performance in recent work such as Style-GAN based models (Sauer et al., 2022; Kang et al., 2021; Lee et al., 2021) or Big GAN based models (Brock et al., 2018; Lorraine & Duvenaud, 2022) use architectures that are 30 times or more larger than the architecture that we have chosen to test our method on.

We report our results against a properly tuned version of SNGAN that achieves an FID of 12.36. Our method obtains a competitive FID of 10.49. We give a detailed description of these experiments and full detail on the architecture and hyper-parameters in Appendix J. See also Figure 10 for a sample of generated samples on a Res Net using LEAD-Adam.

3The FID and IS are metrics for evaluating the quality of generated samples of a generative model. Lower FID and higher inception score (IS) correspond to better sample quality.

Published in Transactions on Machine Learning Research (06/2023)

Figure 5: Plot showing the evolution of the FID over 400 epochs for our method (LEAD-Adam) vs vanilla Adam on a DCGAN architecture. It is important to note that compared to the Adam, LEAD-Adam is twice expensive computationally.

Table 2: Performance of several methods on CIFAR-10 image generation task. The FID and IS is reported over 50k samples unless mentioned otherwise.

DCGAN FID ( ) IS ( ) Adam (Radford et al., 2015) 24.38 0.13 6.58 LEAD-Adam 19.27 0.10 7.58 0.11 CGD-WGAN (Schäfer & Anandkumar, 2019) 21.3 7.2 OMD (Daskalakis et al., 2018) 29.6 0.19 5.74 0.1 Res Net SNGAN 12.10 0.31 8.58 0.03 LEAD-Adam (ours) 10.49 0.11 8.82 0.05 Extra Adam (Gidel et al., 2018) 16.78 0.21 8.47 0.1 LA-GAN (Chavdarova et al., 2020) 12.67 0.57 8.55 0.04 ODE-GAN (Qin et al., 2020) 11.85 0.21 8.61 0.06 Evaluated with 5k samples SN-GAN (DCGAN) (Miyato et al., 2018) 29.3 7.42 0.08 SN-GAN (Res Net) (Miyato et al., 2018) 21.7 0.21 8.22 0.05

8 Related Work

Game Optimization: With increasing interest in games, significant effort is being spent in understanding common issues affecting optimization in this domain. These issues range from convergence to non-Nash equilibrium points, to exhibiting rotational dynamics around the equilibrium which hampers convergence. Authors in Mescheder et al. (2017) discuss how the eigenvalues of the Jacobian govern the local convergence properties of GANs. They argue that the presence of eigenvalues with zero real-part and large imaginary part results in oscillatory behavior. To mitigate this issue, they propose Consensus Optimization (CO). Along similar lines, Balduzzi et al. (2018b); Gemp & Mahadevan (2018); Letcher et al. (2019); Loizou et al. (2020) use the Hamiltonian of the gradient vector-field, to improve the convergence in games through disentangling the convergent parts of the dynamics from the rotations. Another line of attack taken in Schäfer & Anandkumar (2019) is to use second-order information as a regularizer of the dynamics and motivate the use of Competitive Gradient Descent (CGD). In Wang et al. (2019), Follow the Ridge (Ft R) is proposed. They motivate the use of a second order term for one of the players (follower) as to avoid the rotational dynamics in a sequential formulation of the zero-sum game. See appendix K for full discussion on the comparison of LEAD versus other second-order methods.

Another approach taken by Gidel et al. (2019), demonstrate how applying negative momentum over GDA can improve convergence in min-max games, while also proving a linear rate of convergence in the case of bilinear games. More recently, Zhang & Wang (2021) have shown the suboptimality of negative momentum in specific settings. Furthermore, in (Lorraine & Duvenaud, 2022) authors carry-out an extensive study on the effect of momentum in games and specifically show that complex momentum is optimal in many games ranging from adversarial to non-adversarial settings. Daskalakis et al. (2018) show that extrapolating the next value of the gradient using previous history, aids convergence. In the same spirit, Chavdarova et al. (2020), proposes Look Ahead GAN (LA-GAN) and show that the Look Ahead algorithm is a compelling candidate in improving

Published in Transactions on Machine Learning Research (06/2023)

convergence in GANs. Gidel et al. (2018) also explores this line of thought by introducing averaging to develop a variant of the extra-gradient algorithm and proposes Extra-Adam-Averaging. Similar to Extra Adam-Averaging is SN-EMA Yazıcı et al. (2019) which uses the SN-GAN and achieves great performance by applying an exponential moving average on the parameters. More recently, Fiez & Ratliff (2021) study using different time-scales for each player in zero-sum non-convex, non-concave games. Furthermore, in Rosca et al. (2021) authors study the dynamics of game optimization in both continuous and discrete time and examine the effects of discritization drift on the game performance. They suggest a modified continuous-time dynamical system that more closely matches the discrete time dynamics and introduce regularizers that mitigate the effect harmful drifts.

Lastly, in regard to convergence analysis in games, Zhang et al. (2022) study the convergence of alternating gradient descent-ascent for minmax games, Golowich et al. (2020) provide last iterate convergence rate for convex-concave saddle point problems. Nouiehed et al. (2019) propose a multi-step variant of gradient descent-ascent, to show it can find a game s ϵ first-order stationary point. Additionally, Azizian et al. (2020a) and Ibrahim et al. (2020) provide spectral lower bounds for the rate of convergence in the bilinear setting for an accelerated algorithm developed in Azizian et al. (2020b) for a specific families of bilinear games. Furthermore, Fiez & Ratliff (2020) use Lyapunov analysis to provide convergence guarantees for gradient descent ascent using timescale separation and in Hsieh et al. (2020), authors show that commonly used algorithms for min-max optimization converge to attractors that are not optimal.

Single-objective Optimization and Dynamical Systems: The authors of Su et al. (2014) started a new trend in single-objective optimization by studying the continuous-time dynamics of Nesterov s accelerated method (Nesterov, 2013). Their analysis allowed for a better understanding of the much-celebrated Nesterov s method. In a similar spirit, Wibisono et al. (2016); Wilson et al. (2016) study continuous-time accelerated methods within a Lagrangian framework, while analyzing their stability using Lyapunov analysis. These work show that a family of discrete-time methods can be derived from their corresponding continuous-time formalism using various discretization schemes. Additionally, several recent work (Muehlebach & Jordan, 2019; Bailey & Piliouras, 2019; Maddison et al., 2018; Ryu et al., 2019) cast game optimization algorithms as dynamical systems so to leverage its rich theory, to study the stability and convergence of various continuoustime methods. Nagarajan & Kolter (2017) also analyzes the local stability of GANs as an approximated continuous dynamical system.

9 Conclusion and Future Direction

In this paper, we leverage tools from physics to propose a novel second-order optimization scheme LEAD, to address the issue of rotational dynamics in min-max games. By casting min-max game optimization as a physical system, we use the principle of least action to discover an effective optimization algorithm for this setting. Subsequently, with the use of Lyapunov stability theory and spectral analysis, we prove LEAD to be convergent at a linear rate in bilinear min-max games. We supplement our theoretical analysis with experiments on GANs and toy setups, demonstrating improvements over baseline methods. Specifically for GAN training, we observe that our method outperforms other second-order methods, both in terms of sample quality and computational efficiency. Our analysis underlines the advantages of physical approaches in designing novel optimization algorithms for games as well as for traditional optimization tasks.

It is important to note in this regard that our crafted physical system is a way to model min-max optimization physically. Alternate schemes to perform such modeling can involve other choices of counter-rotational and dissipative forces which can be explored in future work. Other directions for future work can extend the Lyapunov analysis to the general convex-concave setting. Our existent analysis and the subsequent promising experimental results there upon, makes us believe that analysis can be extended to the more general convex-concave setting. Nevertheless, such an attempt is dependent on finding an appropriate Lyapunov function which may be challenging given the complex game dynamics.

As a future direction, one may pursue other types of games where LEAD may be useful. Any 2-parameter dynamical system can be interpreted as a 2-player game. Since LEAD models second-order interactions of the players, it may be a preferable optimization algorithm for games with higher-order structure. Furthermore,

Published in Transactions on Machine Learning Research (06/2023)

studying the performance of LEAD on larger GAN architecture such as Style-GAN may be studied in future work.

Broader Impact Statement

While our contribution is mostly theoretical, our research has the potential to improve the optimization of multi-agent machine learning models, such as generative adversarial networks (GANs). GANs have been very successful in generating realistic images, music, speech and text, and for improving performance on an array of different real-world tasks. On the other hand, GANs can be misused to generate fake news, fake images, and fake voices. Furthermore, a common problem encountered during GAN training is mode-collapse. This results in GANs being biased in generating certain types of data moreover others, thereby causing data misrepresentation. In this paper, we show that our proposed method can tackle the mode collapse problem by observing improvements over baseline methods. However, we would like to emphasize that practitioners should use our research with caution as the change of dataset and tasks might not prevent the mode collapse problem.

Acknowledgments

The authors would like to thank Mohammad Pezeshki, Gauthier Gidel, Tatjana Chavdarova, Maxime Laborde, Nicolas Loizou, Hugo Berard, Giancarlo Kerg, Manuela Girotti, Adam Ibrahim, Damien Scieur and Michael Mulligan, for useful discussions and feedback.

This work is supported by NSERC Discovery Grants (RGPIN-2018-04821 and RGPIN-2019-06512), FRQNT Young Investigator Startup Grants (2019-NC-253251 and 2019-NC-257943), a startup grant by IVADO, the Canada CIFAR AI chairs program and a collaborative grant from Samsung Electronics Co., Ltd.

Reyhane Askari Hemmat also acknowledges support of Borealis AI Graduate Fellowship, Microsoft Diversity Award and the NSERC Postgraduate Scholarships. Amartya Mitra acknowledges support of the internship program at Mila and the UCR Graduate Fellowship.

This research was enabled in part by compute resources, software and technical help provided by Mila (mila.quebec) and Compute Canada.

Jacob Abernethy, Kevin A Lai, and Andre Wibisono. Last-iterate convergence rates for min-max optimization. ar Xiv preprint ar Xiv:1906.02027, 2019.

Waïss Azizian, Ioannis Mitliagkas, Simon Lacoste-Julien, and Gauthier Gidel. A tight and unified analysis of gradient-based methods for a whole spectrum of differentiable games. In International Conference on Artificial Intelligence and Statistics, pp. 2863 2873, 2020a.

Waïss Azizian, Damien Scieur, Ioannis Mitliagkas, Simon Lacoste-Julien, and Gauthier Gidel. Accelerating smooth games by manipulating spectral shapes. International Conference on Artificial Intelligence and Statistics, 2020b.

James P Bailey and Georgios Piliouras. Multi-agent learning in network zero-sum games is a hamiltonian system. In Proceedings of the 18th International Conference on Autonomous Agents and Multi Agent Systems, pp. 233 241. International Foundation for Autonomous Agents and Multiagent Systems, 2019.

D Balduzzi, S Racaniere, J Martens, J Foerster, K Tuyls, and T Graepel. deepmind-symplectic-gradientadjustment, 2018a. https://github.com/deepmind/symplectic-gradient-adjustment/blob/master/ Symplectic_Gradient_Adjustment.ipynb.

D Balduzzi, S Racaniere, J Martens, J Foerster, K Tuyls, and T Graepel. The mechanics of n-player differentiable games. In ICML, volume 80, pp. 363 372. JMLR. org, 2018b.

MV Berry and Pragya Shukla. Curl force dynamics: symmetries, chaos and constants of motion. New Journal of Physics, 18(6):063018, 2016.

Published in Transactions on Machine Learning Research (06/2023)

Dimitri P Bertsekas. Nonlinear programming. Athena scientific Belmont, 1999.

Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high fidelity natural image synthesis. ar Xiv preprint ar Xiv:1809.11096, 2018.

Lucian Bu, Robert Babu, Bart De Schutter, et al. A comprehensive survey of multiagent reinforcement learning. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 38 (2):156 172, 2008.

Tatjana Chavdarova, Matteo Pagliardini, Martin Jaggi, and Francois Fleuret. Taming gans with lookahead. ar Xiv preprint ar Xiv:2006.14567, 2020.

Constantinos Daskalakis, Andrew Ilyas, Vasilis Syrgkanis, and Haoyang Zeng. Training gans with optimism. In International Conference on Learning Representations, 2018.

Tanner Fiez and Lillian Ratliff. Gradient descent-ascent provably converges to strict local minmax equilibria with a finite timescale separation. ar Xiv preprint ar Xiv:2009.14820, 2020.

Tanner Fiez and Lillian J Ratliff. Local convergence analysis of gradient descent ascent with finite timescale separation. In Proceedings of the International Conference on Learning Representation, 2021.

Jakob Foerster, Richard Y Chen, Maruan Al-Shedivat, Shimon Whiteson, Pieter Abbeel, and Igor Mordatch. Learning with opponent-learning awareness. In Proceedings of the 17th International Conference on Autonomous Agents and Multi Agent Systems, pp. 122 130. International Foundation for Autonomous Agents and Multiagent Systems, 2018.

Ian Gemp and Sridhar Mahadevan. Global convergence to the equilibrium of gans using variational inequalities. ar Xiv preprint ar Xiv:1808.01531, 2018.

Gauthier Gidel, Hugo Berard, Gaëtan Vignoud, Pascal Vincent, and Simon Lacoste-Julien. A variational inequality perspective on generative adversarial networks. In International Conference on Learning Representations, 2018.

Gauthier Gidel, Reyhane Askari Hemmat, Mohammad Pezeshki, Rémi Le Priol, Gabriel Huang, Simon Lacoste-Julien, and Ioannis Mitliagkas. Negative momentum for improved game dynamics. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 1802 1811, 2019.

Noah Golowich, Sarath Pattathil, Constantinos Daskalakis, and Asuman Ozdaglar. Last iterate is slower than averaged iterate in smooth convex-concave saddle point problems. ar Xiv preprint ar Xiv:2002.00057, 2020.

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672 2680, 2014.

Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of wasserstein gans. In Advances in neural information processing systems, pp. 5767 5777, 2017.

Wolfgang Hahn, Hans H Hosenthien, and H Lehnigk. Theory and application of Liapunov s direct method. Prentice-Hall Englewood Cliffs, NJ, 1963.

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in neural information processing systems, pp. 6626 6637, 2017.

Ya-Ping Hsieh, Panayotis Mertikopoulos, and Volkan Cevher. The limits of min-max optimization algorithms: convergence to spurious non-critical sets. ar Xiv preprint ar Xiv:2006.09065, 2020.

Adam Ibrahim, Waïss Azizian, Gauthier Gidel, and Ioannis Mitliagkas. Linear lower bounds and conditioning of differentiable games. In International conference on machine learning, 2020.

Published in Transactions on Machine Learning Research (06/2023)

Minguk Kang, Woohyeon Shim, Minsu Cho, and Jaesik Park. Rebooting acgan: Auxiliary classifier gans with stable training. Advances in Neural Information Processing Systems, 34:23505 23518, 2021.

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014.

GM Korpelevich. The extragradient method for finding saddle points and other problems. Matecon, 12: 747 756, 1976.

Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.

LD Landau and EM Lifshitz. Course of theoretical physics. vol. 1: Mechanics. Oxford, 1960.

Kwonjoon Lee, Huiwen Chang, Lu Jiang, Han Zhang, Zhuowen Tu, and Ce Liu. Vitgan: Training gans with vision transformers. ar Xiv preprint ar Xiv:2107.04589, 2021.

Alistair Letcher, David Balduzzi, Sébastien Racaniere, James Martens, Jakob N Foerster, Karl Tuyls, and Thore Graepel. Differentiable game mechanics. Journal of Machine Learning Research, 20(84):1 40, 2019.

Nicolas Loizou, Hugo Berard, Alexia Jolicoeur-Martineau, Pascal Vincent, Simon Lacoste-Julien, and Ioannis Mitliagkas. Stochastic hamiltonian gradient methods for smooth games. ICML, 2020.

David Acuna Paul Vicol Lorraine, Jonathan P. and David Duvenaud. Complex momentum for optimization in games. In International Conference on Artificial Intelligence and Statistics, pp. 7742 7765. PMLR, 2022.

Aleksandr Mikhailovich Lyapunov. The general problem of the stability of motion. International journal of control, 55(3):531 534, 1992.

Chris J Maddison, Daniel Paulin, Yee Whye Teh, Brendan O Donoghue, and Arnaud Doucet. Hamiltonian descent methods. ar Xiv preprint ar Xiv:1809.05042, 2018.

Eric V Mazumdar, Michael I Jordan, and S Shankar Sastry. On finding local nash equilibria (and only local nash equilibria) in zero-sum games. ar Xiv preprint ar Xiv:1901.00838, 2019.

Panayotis Mertikopoulos, Bruno Lecouat, Houssam Zenati, Chuan-Sheng Foo, Vijay Chandrasekhar, and Georgios Piliouras. Optimistic mirror descent in saddle-point problems: Going the extra (gradient) mile. ar Xiv preprint ar Xiv:1807.02629, 2018.

Lars Mescheder, Sebastian Nowozin, and Andreas Geiger. The numerics of gans. In Advances in Neural Information Processing Systems, pp. 1825 1835, 2017.

Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. ar Xiv preprint ar Xiv:1802.05957, 2018.

Michael Muehlebach and Michael Jordan. A dynamical systems perspective on nesterov acceleration. In International Conference on Machine Learning, pp. 4656 4662, 2019.

Vaishnavh Nagarajan and J Zico Kolter. Gradient descent gan optimization is locally stable. In Advances in neural information processing systems, pp. 5585 5595, 2017.

Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2013.

Maher Nouiehed, Maziar Sanjabi, Tianjian Huang, Jason D Lee, and Meisam Razaviyayn. Solving a class of non-convex min-max games using iterative first order methods. In Advances in Neural Information Processing Systems, pp. 14934 14942, 2019.

Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary De Vito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. s. 2017.

Published in Transactions on Machine Learning Research (06/2023)

Boris T Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5):1 17, 1964.

Chongli Qin, Yan Wu, Jost Tobias Springenberg, Andrew Brock, Jeff Donahue, Timothy P Lillicrap, and Pushmeet Kohli. Training generative adversarial networks by solving ordinary differential equations. ar Xiv preprint ar Xiv:2010.15040, 2020.

Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. ar Xiv preprint ar Xiv:1511.06434, 2015.

Mihaela C Rosca, Yan Wu, Benoit Dherin, and David Barrett. Discretization drift in two-player games. In International Conference on Machine Learning, pp. 9064 9074. PMLR, 2021.

Ernest K Ryu, Kun Yuan, and Wotao Yin. Ode analysis of stochastic gradient methods with optimism and anchoring for minimax problems and gans. ar Xiv preprint ar Xiv:1905.10899, 2019.

Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. In Advances in neural information processing systems, pp. 2234 2242, 2016.

Axel Sauer, Katja Schwarz, and Andreas Geiger. Stylegan-xl: Scaling stylegan to large diverse datasets. In ACM SIGGRAPH 2022 Conference Proceedings, pp. 1 10, 2022.

Florian Schäfer and Anima Anandkumar. Competitive gradient descent. In Advances in Neural Information Processing Systems, pp. 7623 7633, 2019.

Ozan Sener and Vladlen Koltun. Multi-task learning as multi-objective optimization. In Advances in Neural Information Processing Systems, pp. 527 538, 2018.

Bin Shi, Simon S Du, Weijie Su, and Michael I Jordan. Acceleration via symplectic discretization of highresolution differential equations. In Advances in Neural Information Processing Systems, pp. 5745 5753, 2019.

Weijie Su, Stephen Boyd, and Emmanuel Candes. A differential equation for modeling nesterov s accelerated gradient method: Theory and insights. In Advances in Neural Information Processing Systems, pp. 2510 2518, 2014.

Yuanhao Wang, Guodong Zhang, and Jimmy Ba. On solving minimax optimization locally: A follow-the-ridge approach. In International Conference on Learning Representations, 2019.

Andre Wibisono, Ashia C Wilson, and Michael I Jordan. A variational perspective on accelerated methods in optimization. Proceedings of the National Academy of Sciences, 113(47):E7351 E7358, 2016.

Ashia C Wilson, Benjamin Recht, and Michael I Jordan. A lyapunov analysis of momentum methods in optimization. ar Xiv preprint ar Xiv:1611.02635, 2016.

Yasin Yazıcı, Chuan-Sheng Foo, Stefan Winkler, Kim-Hui Yap, Georgios Piliouras, and Vijay Chandrasekhar. The unusual effectiveness of averaging in gan training, 2019.

Guodong Zhang and Yuanhao Wang. On the suboptimality of negative momentum for minimax optimization. In International Conference on Artificial Intelligence and Statistics, pp. 2098 2106. PMLR, 2021.

Guodong Zhang, Yuanhao Wang, Laurent Lessard, and Roger B Grosse. Near-optimal local convergence of alternating gradient descent-ascent for minimax optimization. In International Conference on Artificial Intelligence and Statistics, pp. 7659 7679. PMLR, 2022.

Published in Transactions on Machine Learning Research (06/2023)

B Proof of Proposition 1

Proof. The EOMs of the quadratic game in continuous-time (Eq.(10)), can be discretized in using a combination of implicit and explicit update steps as (Shi et al., 2019),

xk+1 xk = δvx k+1, (30a)

yk+1 yk = δvy k+1, (30b)

vx k+1 vx k = qδ xyf (xk, yk) vy k µδvx k δ xf (xk, yk) (30c)

vy k+1 vy k = qδ xyf (xk, yk) vx k µδvy k + δ yf (xk, yk) (30d)

where δ is the discretization step-size. Using Eqns.(30a) and (30b), we can further re-express Eqns. (30c), (30d) as,

xk+1 = xk + β xk η xf (xk, yk) α x,yf (xk, yk) yk yk+1 = yk + β yk + η yf (xk, yk) + α x,yf (xk, yk) xk (31)

where xk = xk xk 1, and,

β = 1 µδ, η = δ2, α = 2qδ (32)

C Continuous-time Convergence Analysis: Quadratic Min-Max Game

Proof. For the class of quadratic min-max games,

f (X, Y ) = h

2 |Y |2 + XT AY (33)

where X X1, , Xn , Y Y 1, , Y n Rn and An n is a constant positive-definite matrix, the continuous-time EOMs of Eq.(10) become:

X = µ X h X AY q A Y Y = µ Y h Y + AT X + q AT X (34)

We next define our continuous-time Lyapunov function in this case to be,

2 X + µX + µAY T X + µX + µAY

2 Y + µY µAT X T X + µY µAT X

2 XT X + Y T Y + XT (h + AAT )X + Y T (h + AT A)Y

The time-derivative of Et is then given by,

Et = X + µX + µAY T X + µ X + µA Y + Y + µY µAT X T Y + µ Y µAT X

+ XT X + Y T Y + 2 XT (h + AAT ) X + Y T (h + AT A) Y

= XT + µXT + µY T AT ( q + µ) A Y AY + XT q A Y µ X AY

+ Y T + µY T µXT A (q µ) AT X + AT X + Y T q AT X µ Y + AT X

+ 2 XT (h + AAT ) X + Y T (h + AT A) Y

= (µ (q µ) 2) Y T AT X XT A Y (µ (q µ) 2) XT AAT X + Y T AT A Y

µ XT (h + AAT )X + Y T (h + AT A)Y µ XT X + Y T Y

Published in Transactions on Machine Learning Research (06/2023)

where we have used the fact that XT AY being a scalar thus implying XT AY = Y T AT X. If we now set q = (2/µ) + µ in the above, then that further leads to,

Et = µ XT (h + AAT )X + Y T (h + AT A)Y µ XT X + Y T Y

= µ h||X||2 + h||Y ||2 + AT X 2 + ||AY ||2 µ X 2 + Y 2 0 t (37)

exhibiting that the Lyapunov function, Eq.(16) is asymptotically stable at all times t.

Next, consider the following expression,

||X||2 + ||Y ||2 + ρµ XT X + Y T Y ρµ X 2 + ||Y ||2

ρµ XT A Y XT AY ρµ

AT X 2 + ||AY ||2

= ρ (1 + µ) X 2 + Y 2 ρ

2 µ2 + µ + 2h ||X||2 + ||Y ||2

2 µ2 + µ + 2 AT X 2 + ||AY ||2

where ρ is some positive definite constant. This implies that the above expression is negative semi-definite by construction given µ 0. Now, for a general square matrix A, we can perform a singular value decomposition (SVD) as A = VTSU. Here, U and V are the right and left unitaries of A, while S is a diagonal matrix of singular values (σi) of A. Using this decomposition in Eq.(38), then allows us to write,

ρ (1 + µ) X 2 + Y 2 ρ

2 µ2 + µ + 2h ||X||2 + ||Y ||2

2 µ2 + µ + 2 AT X 2 + ||Ay||2

= ρ (1 + µ) V X 2 + U Y 2 ρ

2 µ2 + µ + 2h ||VX||2 + ||UY ||2

2 µ2 + µ + 2 ||SVX||2 + ||SUY ||2

= ρ (1 + µ) X 2 + Y 2 ρ

2 µ2 + µ + 2h ||X||2 + ||Y||2

2 µ2 + µ + 2 ||SX||2 + ||SY||2

j=1 ρ (1 + µ) X j 2 + Yj 2

ρ 2 1 + σ2 j + 2h µ2 + µ + 2σ2 j X j 2 + Yj 2

where we have made use of the relations UTU = UUT = In = VTV = VVT, and additionally performed a basis change, as X = VX and Y = UY . Now, we know from Eq.(37) that,

Et = µ h||X||2 + h||Y ||2 + AT X 2 + ||AY ||2 µ X 2 + Y 2

= µ h||X||2 + h||Y ||2 + UTSVX 2 + VTSUY 2 µ V X 2 + U Y 2

= µ h||X||2 + h||Y||2 + ||SX||2 + ||SY||2 µ X 2 + Y 2

j=1 µ σ2 j + h X j 2 + Yj 2

j=1 µ X j 2 + Yj 2

Published in Transactions on Machine Learning Research (06/2023)

Comparing the above expression with Eq.(39), we note that a choice of ρ as,

ρ min µ 1 + µ, 2µ(σ2 min + h) (1 + σ2 min + 2h) (µ2 + µ) + 2σ2 min

j [1, n] (41)

Et E0 exp ( ρt)

XT h + AAT X + Y T h + AT A Y E0 exp ( ρt)

X T (h + S2)X + YT (h + S2)Y E0 exp ( ρt)

j=1 (h + σ2 j ) ||X j||2 + ||Yj||2 E0 exp ( ρt)

j=1 (h + σ2 j ) ||Xj||2 + ||Y j||2 E0 exp ( ρt)

||X||2 + ||Y ||2 E0 h + σ2 min exp ( ρt) j

Figure 6: Left: Contours of the Lyapunov function Ek, Eq. (35) (black), and convergence trajectory of LEAD (red) in the quadratic min-max game (Eq.(34)) to the Nash equilibrium (0, 0). Right: The evolution of the discrete-time Lyapunov function of Eq. (35) over iteration, confirming Ek Ek 1 0 k N.

D Proof of Theorem 2

Theorem. The eigenvalues of FLEAD(ω ) about the Nash equilibrium ω = (x , y ) of the quadratic min-max game are,

µ (α, β, η) = 1 (η + α)λ + β ηh

Published in Transactions on Machine Learning Research (06/2023)

where, = (1 (η + α) λ + β ηh)2 4 (β αλ) and λ Sp(off-diag[ v(ω )]). Furthermore, for h, η, |α|, |β| << 1, we have,

µ(i) + (α, β, η) 1 ηh + (η + α)2 λ2 i + η2h2 + β2 2ηhβ

µ(i) (α, β, η) β (η + α)2 λ2 i + η2h2 + β2 2ηhβ

Proof. For the quadratic game 33, the Jacobian of the vector field v is given by,

v xf(xt, yt) yf(xt, yt)

= h I2n A A h I2n

R2n R2n. (45)

Let us next define a matrix Dq as,

Dq = 2 xyf(x, y) 0 0 2 xyf(x, y)

R2n R2n (46)

Consequently, the update rule for LEAD can be written as: xt+1 yt+1

+ β xt xt 1 yt yt 1

η xf(xt, yt) yf(xt, yt)

α 2 xyf(xt, yt) yt 2 xyf(xt, yt) xt

+ β xt xt 1 yt yt 1

where yt = yt yt 1 and xt = xt xt 1.

Next, by making use of the permutation matrix P,

P := 0 In In 0

we can re-express Eq. 47 as, ωt+1 ωt

= I2n 0 I2n 0

+ β I2n I2n 0 0

= I2n 0 I2n 0

+ β I2n I2n 0 0

α Dq P Dq P 0 0

where ωt (xt, yt). Hence, the Jacobian of FLEAD is then given by,

FLEAD = I2n 0 I2n 0

+ β I2n I2n 0 0

α Dq P Dq P 0 0

= (1 + β) I2n η v αDq P βI2n + αDq P I2n 0

It is to be noted that, for games of the form of Eq. 33, we specifically have,

v = Dq P + h I2n

off-diag[ v] = Dq P

Published in Transactions on Machine Learning Research (06/2023)

Therefore, Eq. 49 becomes,

FLEAD = (1 + β ηh) I2n (η + α) Dq P βI2n + αDq P I2n 0

We next proceed to study the eigenvalues of this matrix which will determine the convergence properties of LEAD around the Nash equilibrium. Using Lemma 1 of (Gidel et al., 2019), we can then write the characteristic polynomial of FLEAD as,

det (XI4n FLEAD) = 0

det (X 1) I2n (β ηh) I2n + (η + α) Dq P βI2n αDq P I2n XI2n

det (X 1) (X β) I2n + Xηh I2n + (Xη + Xα α) Dq P = 0

det ((X 1) (X β) + Xηh) UU 1 + (Xη + Xα α) UΛU 1 = 0

det ((X 1) (X β) + Xηh) I2n + (Xη + Xα α) Λ = 0

i=1 [(X 1) (X β) + Xηh + (Xη + α (X 1)) λi] = 0

Where, in the above, we have performed an eigenvalue decomposition of Dq P = UΛU 1. Therefore,

X2 X (1 (η + α)λi + β ηh) + β αλ = 0, λi Sp(Dq P)

X(i) µ(i) = 1 (η + α)λi + β ηh

= (1 (η + α) λi + β ηh)2 4 (β αλi) (53)

Furthermore for h, η, |β|, |α| << 1, we can approximate the above roots to be,

µ(i) + (α, β, η) 1 ηh + (η + α)2 λ2 i + η2h2 + β2 2ηhβ

µ(i) (α, β, η) β (η + α)2 λ2 i + η2h2 + β2 2ηhβ

2 (β ηh) α (54)

E Proof of Proposition 3

Proposition. For any λ Sp(off-diag[ v(ω )]),

αρ (λ) α=0 < 0 η 0, 2 Im(λmax)

where Im(λmax) is the imaginary component of the largest eigenvalue λmax.

We observe from Proposition 3 above that for h, η, |α|, |β| << 1,

ρ(α, η, β) := max{|µ(i) + |2, |µ(i) |2} i

= max{ µ(i) + 2 } i (56)

Published in Transactions on Machine Learning Research (06/2023)

αρ α=0 max nη2 |λi|2 η2h2 β2

4 η |λi|2 + ηhβ (ηh β)2

(1 + β) η |λi|2 o i

4 |λi|4 1 + β + 3β2

η |λi|2 o i

4 |λi|2 1 η |λi|2 o i

where we have retained only terms up to cubic-order in η, |β| and h. Hence, choosing η 0, 2 Im(λmax) , ensures: αρ α=0 < 0 i, (58)

We thus posit, that a choice of a positive α causes the norm of the limiting eigenvalue µ+ of FLEAD to decrease.

F Proof of Theorem 3

Theorem. Setting η = α = 1 2(σmax(A)+h), then we have ϵ > 0,

1 σ2 min 4(σmax + h)2 (1 + β/2)h

σmax + h + 3h2

8(σmax + h)2

where σmax(σmin) is the largest (smallest) eigenvalue of A, t+1 := ||ωt+1 ω ||2 2 + ||ωt ω ||2 2.

Proof : From Eq. (52), we recall that the eigenvalues of FLEAD (ω ) for the quadratic game are,

µ(i) (α, β, η) = (1 (α + η)λi + β ηh)

1 4 (β ηλi) (1 (α + η)λi + β ηh)2

with λi Sp(off-diag[ v(ω )]). Now, since in the quadratic-game setting considered, we have,

off-diag[ v(ω )] = Dq P = 0 A AT 0

hence, λi = iσi with σi being the singular values of A. This, then allows us to write,

µ(i) (α, β, η) = (1 (α + η)( iσi) + β ηh)

1 4 (β α( iσi)) (1 (α + η)( iσi) + β ηh)2

According to Proposition 2, the convergence behavior of LEAD is determined as, t+1 O(ρ + ϵ) t ϵ > 0, where (setting η = α),

ρ := max{|µ(i) + |2, |µ(i) |2} i

= |µ(i) + |2 i (63)

Now assuming that η is small enough, such that, η3 0 and β2 0, we have,

ρ 1 η2σ2 i + 3

2η2h2 (2 + β) ηh (64)

Furthermore, using a learning rate η as prescribed by Proposition 3, such as η = α = 1 2(σmax(A)+h) we find,

r LEAD = σ2 min 4(σmax + h)2 + (1 + β/2)h

σmax + h 3h2

8(σmax + h)2 (65)

Published in Transactions on Machine Learning Research (06/2023)

t+1 O (1 r LEAD)t 0

1 σ2 min 4(σmax + h)2 (1 + β/2)h

σmax + h + 3h2

8(σmax + h)2

where t+1 := ||ωt+1 ω ||2 2 + ||ωt ω ||2 2.

G Robustness to the Discretization Parameter

In this section we study the effect of the discretization parameter δ, defined in equation 32 and Proposition 1 on the quadratic game defined in 29,

min x max y x T (γA)y + x T ((I γ)B1)x y T ((I γ)B2)y.

In Figure 7, we provide a grid of experiments to study the effect of discretization on the convergence of the quadratic game explored in section 7.1 (Adversarial vs Cooperative games). We observe that for every level of γmax that changes the dynamics of the game from adversarial to cooperative game, LEAD allows for a wide range of discretization steps that lead to convergence. Note that the discretization step and consequently the learning rate can be viewed as a hyper-parameter and consequently require tuning.

Spectral Radius

γmax = 0 γmax = 1 δ = 2 1.5

Figure 7: The effect of discretization parameter (δ) on the convergence of LEAD on the quadratic min-max game defined in 29. The x-axis changes the game dynamics from cooperative to adversarial. The y-axes shows the range of δ. The color shows the spectral radius which corresponds to the convergence rate (smaller spectral radius leads to faster convergence). Convergent dynamics, require a spectral radius that is smaller than one. We plot experiments with non-convergent dynamics (spectral radius > 1) in white. Darker colors correspond to faster convergence. We observe that LEAD allows for a wide range of δ that result in convergent dynamics. Thus, LEAD is robust against the discretization parameter.

H LEAD-Adam

Since Adam algorithm is commonly used in large-scale experiments, we extend LEAD to be used with the Adam algorithm.

Published in Transactions on Machine Learning Research (06/2023)

Algorithm 2 Least Action Dynamics Adam (LEAD-Adam)

1: Input: learning rate η, momentum β, coupling coefficient α. 2: Initialize: x0 xinit, y0 yinit, t 0, mx 0 0, vx 0 0 my 0 0, vy 0 0 3: while not converged do 4: t t + 1

5: gx xf(xt, yt) 6: gxy y y(gx)(yt yt 1) 7: gx t gxy y + gx 8: mx t β1.mx t 1 + (1 β1).gx t 9: vx t β2.vx t 1 + (1 β2).(gx t )2

10: ˆmt mt/(1 βt 1) 11: ˆvt vt/(1 βt 2) 12: xt+1 xt η ˆmt/( ˆvt + ϵ) 13: gy yf(xt+1, yt) 14: gxy x x(gy)(xt+1 xt) 15: gy t gxy x + gy 16: my t β1.my t 1 + (1 β1).gy t 17: vy t β2.vy t 1 + (1 β2).(gy t )2

18: ˆmy t my t /(1 βt 1) 19: ˆvy t vy t /(1 βt 2)

20: yt+1 yt + η ˆmy t /( p

21: end while 22: return (x, y)

I 8-Gaussians Generation

We compare our method LEAD-Adam with vanilla-Adam (Kingma & Ba, 2014) on the generation task of a mixture of 8-Gaussians. Standard optimization algorithms such as vanilla-Adam suffer from mode collapse in this simple task, implying the generator cannot produce samples from one or several of the distributions present in the real data. Through Figure 8, we demonstrate that LEAD-Adam fully captures all the modes in the real data in both saturating and non-saturating losses.

Adam Non-Saturating Loss

LSD Saturating Loss

LSD Non-Saturating Loss

Figure 8: Performance of LEAD-Adam on the generation task of 8-Gaussians. All samples are shown after 10k iterations. Samples generated using Adam exhibit mode collapse, while LEAD-Adam does not suffer from this issue.

Published in Transactions on Machine Learning Research (06/2023)

J Experiments and Implementation Details

J.1 Mixture of Eight Gaussians

Dataset The real data is generated by 8-Gaussian distributions their mean are uniformly distributed around the unit circle and their variance is 0.05. The code to generate the data is included in the source code.

Architecture The architecture for Generator and Discriminator, each consists of four layers of affine transformation, followed by Re LU non-linearity. The weight initialization is default Py Torch s initialization scheme. See a schematic of the architecture in Table 3.

Table 3: Architecture used for the Mixture of Eight Gaussians.

Generator Discriminator

Input: z R64 N(0, I) Input: x R2

Linear (64 2000) Linear (2 2000) Re LU Re LU Linear (2000 2000) Linear (2000 2000) Re LU Re LU Linear (2000 2000) Linear (2000 2000) Re LU Re LU Linear (2000 2) Linear (2000 1)

Other Details We use the Adam (Kingma & Ba, 2014) optimizer on top of our algorithm in the reported results. Furthermore, we use batchsize of 128.

J.2 CIFAR 10 DCGAN

Dataset The CIFAR10 dataset is available for download at the following link; https://www.cs.toronto.e du/~kriz/cifar.html

Architecture The discriminator has four layers of convolution with Leaky Re LU and batch normalization. Also, the generator has four layers of deconvolution with Re LU and batch normalization. See a schematic of the architecture in Table 4.

Table 4: Architecture used for CIFAR-10 DCGAN.

Generator Discriminator

Input: z R100 N(0, I) Input: x R3 32 32

conv. (ker: 4 4, 100 1024; stride: 1; pad: 0) conv. (ker: 4 4, 3 256; stride: 2; pad: 1) Batch Normalization Leaky Re LU Re LU conv. (ker: 4 4, 256 512; stride: 2; pad: 1) conv. (ker: 4 4, 1024 512; stride: 2; pad: 1) Batch Normalization Batch Normalization Leaky Re LU Re LU conv. (ker: 4 4, 512 1024; stride: 2; pad: 1) conv. (ker: 4 4, 512 256; stride: 2; pad: 1) Batch Normalization Batch Normalization Leaky Re LU Re LU conv. (ker: 4 4, 1024 1; stride: 1; pad: 0) conv. (ker: 4 4, 256 3; stride: 2; pad: 1) Tanh Sigmoid

Other Details For the baseline we use Adam with β1 set to 0.5 and β2 set to 0.99. Generator s learning rate is 0.0002 and discriminator s learning rate is 0.0001. The same learning rate and momentum were used to train LEAD model. We also add the mixed derivative term with αd = 0.3 and αg = 0.0.

The baseline is a DCGAN with the standard non-saturating loss (non-zero sum formulation). In our experiments, we compute the FID based on 50,000 samples generated from our model vs 50,000 real samples.

Published in Transactions on Machine Learning Research (06/2023)

Figure 9: Performance of LEAD on CIFAR-10 image generation task on a DCGAN architecture. Left: LEAD achieves FID 19.27. Right: Vanilla Adam achieves FID 24.38. LEAD is able to generate better sample qualities from several classes such as ships, horses and birds (red). Best performance is reported after 100 epochs.

J.3 CIFAR 10 Res Net

Dataset The CIFAR10 dataset is available for download at the following link; https://www.cs.toronto.e du/~kriz/cifar.html

Architecture See Table 6 for a schematic of the architecture used for the CIFAR10 experiments with Res Net.

Table 5: Res Net blocks used for the Res Net architectures (see Table 6).

Shortcut: Upsample( 2) Residual: Batch Normalization Re LU Upsample( 2) conv. (ker: 3 3, 256 256; stride: 1; pad: 1) Batch Normalization Re LU conv. (ker: 3 3, 256 256; stride: 1; pad: 1)

Shortcut: downsample conv. (ker: 1 1, 3ℓ=1/128ℓ =1 128; stride: 1) Spectral Normalization [Avg Pool (ker:2 2, stride:2)], if ℓ = 1 Residual: [ Re LU ], if ℓ = 1 conv. (ker: 3 3, 3ℓ=1/128ℓ =1 128; stride: 1; pad: 1) Spectral Normalization Re LU conv. (ker: 3 3, 128 128; stride: 1; pad: 1) Spectral Normalization Avg Pool (ker:2 2 )

Other Details The baseline is a Res Net with non-saturating loss (non-zero sum formulation). Similar to (Miyato et al., 2018), for every time that the generator is updated, the discriminator is updated 5 times. For both the Baseline SNGAN and LEAD-Adam we use a β1 of 0.0 and β2 of 0.9 for Adam. Baseline SNGAN uses a learning rate of 0.0002 for both the generator and the discriminator. LEAD-Adam also uses a learning rate of 0.0002 for the generator but 0.0001 for the discriminator. LEAD-Adam uses an α of 0.5 and 0.01 for the generator and the discriminator respectively. Furthermore, we evaluate both the baseline and our method on an exponential moving average of the generator s parameters.

Published in Transactions on Machine Learning Research (06/2023)

Table 6: Res Net architectures used for experiments on CIFAR10.

Generator Discriminator

Input: z R64 N(0, I) Input: x R3 32 32

Linear(64 4096) D Res Block G Res Block D Res Block G Res Block D Res Block G Res Block D Res Block Batch Normalization Re LU Re LU Avg Pool (ker:8 8 ) conv. (ker: 3 3, 256 3; stride: 1; pad:1) Linear(128 1) Tanh( ) Spectral Normalization

In our experiments, we compute the FID based on 50,000 samples generated from our model vs 50,000 real samples and reported the mean and variance over 5 random runs. We have provided pre-trained models as well as the source code for both LEAD-Adam and Baseline SNGAN in our Git Hub repository.

Figure 10: Generated sample of LEAD-Adam on CIFAR-10 after 50k iterations on a Res Net architecture. We achieve an FID score of 10.49 using learning rate 2e 4 for the generator and the discriminator, α for the generator is set to 0.01 and for the discriminator is set to 0.5.

K Comparison to other methods

In this section we compare our method with several other second order methods in the min-max setting.

The distinction of LEAD from SGA and Look Ahead, can be understood by considering the 1st-order approximation of xk+1 = xk η xf (xk, yk + η yk), where yk = η yf (xk + η x, yk).

4For Ft R, we provide the update for the second player given the first player performs gradient descent. Also note that in this table SGA is simplified for the two player zero-sum game. Non-zero sum formulation of SGA such as the one used for GANs require the computation of Jv, J v.

Published in Transactions on Machine Learning Research (06/2023)

Table 7: Comparison of several second-order methods in min-max optimization. Each update rule, corresponding to a particular row, can be constructed by adding cells in that row from Columns 4 to 7 and then multiplying that by the value in Column 1. Furthermore, xk+1 = xk+1 xk, while C = I + η2 2 xyf 2 yxf . We compare the update rules of the first player4 for the following methods: Gradient Descent-Ascent (GDA), Least Action Dynamics (LEAD, ours), Symplectic Gradient Adjustment (SGA), Competitive Gradient Descent (CGD), Consensus Optimization (CO), Follow-the-Ridge (Ft R) and Learning with Opponent Learning Awareness (LOLA), in a zero-sum game.

Coefficient Momentum Gradient Interaction-xy Interaction-xx GDA xk+1 = 1 0 η xf η xf 0 LEAD xk+1 = 1 β xk η xf α 2 xyf yk 0 SGA(6) xk+1 = 1 0 η xf ηγ 2 xyf yf 0 CGD(53) xk+1 = C 1 0 η xf η2 2 xyf yf 0 CO(39) xk+1 = 1 0 η xf ηγ 2 xyf yf ηγ 2 xxf xf Ft R(57) yk+1 = 1 0 ηy yf ηx 2 yyf 1 2 yxf xf 0 LOLA(15) xk+1 = 1 0 η xf 2ηα xyf yf 0

This gives rise to:

xk+1 = xk η xf (xk, yk) η2 2 xyf (xk, yk) y, (67)

yk+1 = yk + η yf (xk, yk) + η2 2 xyf (xk, yk) x, (68)

with x, y corresponding to each player accounting for its opponent s potential next step. However, SGA and Look Ahead additonally model their opponent as naive learners i.e. x = xf(xk, yk), y = yf(xk, yk). On the contrary, our method does away with such specific assumptions, instead modeling the opponent based on its most recent move.

Furthermore, there is a resemblance between LEAD and OGDA that we would like to address. The 1st order Taylor expansion of the difference in gradients term of OGDA yields the update (for x):

xk+1 = xk η xf η2 2 xyf yf + η2 2 xxf xf, (69)

which contains an extra 2nd order term 2 xxf compared to ours. As noted in Schäfer & Anandkumar (2019), the 2 xxf term does not systematically aid in curbing the min-max rotations, rather causing convergence to non-Nash points in some settings. For e.g., let us consider the simple game f(x, y) = γ(x2 y2), where x, y, γ are all scalars, with the Nash equilibrium of this game located at (x = 0, y = 0). For a choice of γ 6, OGDA fails to converge for any learning rate while methods like LEAD, Gradient Descent Ascent (GDA) and CGD (Schäfer & Anandkumar (2019)) that do not contain the xxf( yyf) term do exhibit convergence. See Figure 11 and Schäfer & Anandkumar (2019) for more discussion.

Figure 11: Figure depicting the convergence/divergence of several algorithms on the game of f(x, y) = γ(x2 y2) (Nash equilibrium at x = 0, y = 0). Left: For γ = 1, OGDA and LEAD/GDA/CGD (overlaying) are found to converge to the Nash eq. Right: For γ = 6, we find that OGDA fails to converge while LEAD/GDA/CGD (overlaying) converge. We conjecture that the reason behind this observation is the existence of 2 xxf term in the optimization algorithm of OGDA.