# taming_gans_with_lookaheadminmax__0e73bfa5.pdf

Published as a conference paper at ICLR 2021

TAMING GANS WITH LOOKAHEAD MINMAX

Tatjana Chavdarova EPFL Mattéo Pagliardini EPFL

Sebastian U. Stich EPFL François Fleuret University of Geneva Martin Jaggi EPFL

Generative Adversarial Networks are notoriously challenging to train. The underlying minmax optimization is highly susceptible to the variance of the stochastic gradient and the rotational component of the associated game vector ﬁeld. To tackle these challenges, we propose the Lookahead algorithm for minmax optimization, originally developed for single objective minimization only. The backtracking step of our Lookahead minmax naturally handles the rotational game dynamics, a property which was identiﬁed to be key for enabling gradient ascent descent methods to converge on challenging examples often analyzed in the literature. Moreover, it implicitly handles high variance without using large mini-batches, known to be essential for reaching state of the art performance. Experimental results on MNIST, SVHN, CIFAR-10, and Image Net demonstrate a clear advantage of combining Lookahead minmax with Adam or extragradient, in terms of performance and improved stability, for negligible memory and computational cost. Using 30-fold fewer parameters and 16-fold smaller minibatches we outperform the reported performance of the class-dependent Big GAN on CIFAR-10 by obtaining FID of 12.19 without using the class labels, bringing state-of-the-art GAN training within reach of common computational resources. Our source code is available: https://github.com/Chavdarova/LAGAN-Lookahead_Minimax.

1 INTRODUCTION

Gradient-based methods are the workhorse of machine learning. These methods optimize the parameters of a model with respect to a single objective f : X R . However, an increasing interest for multi-objective optimization arises in various domains such as mathematics, economics, multiagent reinforcement learning (Omidshaﬁei et al., 2017) where several agents aim at optimizing their own cost function fi : X1 XN R simultaneously.

A particularly successful class of algorithms of this kind are the Generative Adversarial Networks (Goodfellow et al., 2014, (GANs)), which consist of two players referred to as a generator and a discriminator. GANs were originally formulated as minmax optimization f : X Y R (Von Neumann & Morgenstern, 1944), where the generator and the discriminator aim at minimizing and maximizing the same value function, see 2. A natural generalization of gradient descent for minmax problems is the gradient descent ascent algorithm (GDA), which alternates between a gradient descent step for the min-player and a gradient ascent step for the max-player. This minmax training aims at ﬁnding a Nash equilibrium where no player has the incentive of changing its parameters.

Despite the impressive quality of the samples generated by the GANs relative to classical maximum likelihood-based generative models these models remain notoriously difﬁcult to train. In particular, poor performance (sometimes manifesting as mode collapse ), brittle dependency on hyperparameters, or divergence are often reported. Consequently, obtaining state-of-the-art performance was shown to require large computational resources (Brock et al., 2019), making well-performing models unavailable for common computational budgets.

Equal contributions. Correspondence to firstname.lastname@epfl.ch.

Published as a conference paper at ICLR 2021

Figure 1: Illustration of Lookahead minmax (Alg.1) with GDA on: minx maxy x y, with α=0.5. The solution, trajectory {xt, yt}T t=1, and the lines between (x P, y P) and (xt, yt) are shown with red star, blue line, and dashed green line, resp. The backtracking step of Alg. 1 (lines 10 & 11) allows the otherwise non-converging GDA to converge, see 4.2.1.

0 1 2 3 4 5 Iteration 105

Fast weights Slow weights

LA-step every 10k iterations

Figure 2: FID ( ) of Lookahead minmax on 32 32 Image Net ( 5), see Fig. 12 for IS. We observe signiﬁcant improvements after each LA-step, what empirically conﬁrms the existence of rotations and thus the intuition of the behaviour of LAminmax on games, as well as the relevance of the bilinear game for real world applications Fig.1.

It was empirically shown that: (i) GANs often converge to a locally stable stationary point that is not a differential Nash equilibrium (Berard et al., 2020); (ii) increased batch size improves GAN performances (Brock et al., 2019) in contrast to minimization (Defazio & Bottou, 2019; Shallue et al., 2018). A principal reason is attributed to the rotations arising due to the adversarial component of the associated vector ﬁeld of the gradient of the two player s parameters (Mescheder et al., 2018; Balduzzi et al., 2018), which are atypical for minimization. More precisely, the Jacobian of the associated vector ﬁeld (see def. in 2) can be decomposed into a symmetric and antisymmetric component (Balduzzi et al., 2018), which behave as a potential (Monderer & Shapley, 1996) and a Hamiltonian game, resp. Games are often combination of the two, making this general case harder to solve.

In the context of single objective minimization, Zhang et al. (2019) recently proposed the Lookahead algorithm, which intuitively uses an update direction by looking ahead at the sequence of parameters that change with higher variance due to the stochasticity of the gradient estimates called fast weights generated by an inner optimizer. Lookahead was shown to improve the stability during training and to reduce the variance of the so called slow weights.

Contributions. Our contributions can be summarized as follows:

We propose Lookahead minmax for optimizing minmax problems, that applies extrapolation in the joint parameter space (see Alg 1), so as to account for the rotational component of the associated game vector ﬁeld (deﬁned in 2).

In the context of: (i) single objective minimization: by building on insights of Wang et al. (2020), who argue that Lookahead can be interpreted as an instance of local SGD, we derive improved convergence guarantees for the Lookahead algorithm; (ii) two-player games: we elaborate why Lookahead minmax suppresses the rotational part in a simple bilinear game, and prove its convergence for a given converging base-optimizer; in 3 and 4, resp.

We motivate the use of Lookahead minmax for games by considering the extensively studied toy bilinear example (Goodfellow, 2016) and show that: (i) the use of lookahead allows for convergence of the otherwise diverging GDA on the classical bilinear game in full-batch setting (see 4.2.1), as well as (ii) it yields good performance on challenging stochastic variants of this game, despite the high variance (see 4.2.2).

We empirically benchmark Lookahead minmax on GANs on four standard datasets MNIST, CIFAR-10, SVHN and Image Net on two different models (DCGAN & Res Net), with standard optimization methods for GANs, GDA and Extragradient, called LA Alt GAN and LA Extra Gradient, resp. We consistently observe both stability and performance improvements at a negligible additional cost that does not require additional forward and backward passes, see 5.

Published as a conference paper at ICLR 2021

2 BACKGROUND

GAN formulation. Given the data distribution pd, the generator is a mapping G: z 7 x, where z is sampled from a known distribution z pz and ideally x pd. The discriminator D: x 7 D(x) [0, 1] is a binary classiﬁer whose output represents a conditional probability estimate that an x sampled from a balanced mixture of real data from pd and G-generated data is actually real. The optimization of a GAN is formulated as a differentiable two-player game where the generator G with parameters θ, and the discriminator D with parameters ϕ, aim at minimizing their own cost function Lθ and Lϕ, respectively, as follows:

θ arg min θ Θ Lθ(θ, ϕ ) and ϕ arg min ϕ Φ Lϕ(θ , ϕ) . (2P-G)

When Lϕ = Lθ the game is called a zero-sum and equation 2P-G is a minmax problem.

Minmax optimization methods. As GDA does not converge for some simple convex-concave game, Korpelevich (1976) proposed the extragradient method, where a prediction step is performed to obtain an extrapolated point (θt+ 1

2 ) using GDA, and the gradients at the extrapolated point are then applied to the current iterate (θt, ϕt) as follows:

Extrapolation:

2 =θt η θLθ(θt, ϕt)

2 =ϕt η ϕLϕ(θt, ϕt) Update:

( θt+1=θt η θLθ(θt+ 1

ϕt+1=ϕt η ϕLϕ(θt+ 1

where η denotes the step size. In the context of zero-sum games, the extragradient method converges for any convex-concave function L and any closed convex sets Θ and Φ (Facchinei & Pang, 2003).

The joint vector ﬁeld. Mescheder et al. (2017) and Balduzzi et al. (2018) argue that the vector ﬁeld obtained by concatenating the gradients of the two players gives more insights of the dynamics than studying the loss surface. The joint vector ﬁeld (JVF) and the Jacobian of JVF are deﬁned as:

v(θ, ϕ) = θLθ(θ, ϕ) ϕLϕ(θ, ϕ)

, and v (θ, ϕ) =

2 θLθ(θ, ϕ) ϕ θLθ(θ, ϕ)

θ ϕLϕ(θ, ϕ) 2 ϕLϕ(θ, ϕ)

, resp. (JVF)

Rotational component of the game vector ﬁeld. Berard et al. (2020) show empirically that GANs converge to a locally stable stationary point (Verhulst, 1990, LSSP) that is not a differential Nash equilibrium deﬁned as a point where the norm of the Jacobian is zero and where the Hessian of both the players are deﬁnite positive, see C. LSSP is deﬁned as a point (θ , ϕ ) where:

v(θ , ϕ ) = 0, and R(λ) > 0, λ Sp(v (θ , ϕ )) , (LSSP)

where Sp( ) denotes the spectrum of v ( ) and R( ) the real part. In summary, (i) if all the eigenvalues of v (θt, ϕt) have positive real part the point (θt, ϕt) is LSSP, and (ii) if the eigenvalues of v (θt, ϕt) have imaginary part, the dynamics of the game exhibit rotations.

Impact of noise due to the stochastic gradient estimates on games. Chavdarova et al. (2019) point out that relative to minimization, noise impedes more the game optimization, and show that there exists a class of zero-sum games for which the stochastic extragradient method diverges. Intuitively, bounded noise of the stochastic gradient hurts the convergence as with higher probability the noisy gradient points in a direction that makes the algorithm to diverge from the equilibrium, due to the properties of v ( ) (see Fig.1, Chavdarova et al., 2019).

3 LOOKAHEAD FOR SINGLE OBJECTIVE

In the context of single objective minimization, Zhang et al. (2019) recently proposed the Lookahead algorithm where at every step t: (i) a copy of the current iterate ωt is made: ωt ωt, (ii) ωt is then updated for k 1 times yielding ωt+k, and ﬁnally (iii) the actual update ωt+1 is obtained as a point that lies on a line between the two iterates: the current ωt and the predicted one ωt+k:

ωt+1 ωt + α( ωt+k ωt), where α [0, 1] . (LA)

Published as a conference paper at ICLR 2021

Algorithm 1 General Lookahead Minmax pseudocode.

1: Input: Stopping time T, learning rates ηθ, ηϕ, initial weights θ0, ϕ0, lookahead k and α, losses Lθ, Lϕ, dϕ ( ), dθ ( ) base optimizer updates deﬁned in 4, pd and pz real and noise data distributions, resp. 2: θ0,0, ϕ0,0 θ0, ϕ0 3: for t 0, . . . , T 1 do 4: for i 0, . . . , k 1 do 5: Sample x, z pd, pz 6: ϕt,i+1 = ϕt,i ηϕdϕ t,i( θt,i, ϕt,i, x, z) 7: Sample z pz 8: θt,i+1 = θt,i+1 ηθdθ t,i( θt,i, ϕt,i, z) 9: end for 10: ϕt+1 = ϕt + αϕ( ϕt,k ϕt) 11: θt+1 = θt + αθ( θt,k θt) 12: θt+1,0, ϕt+1,0 θt+1, ϕt+1 13: end for 14: Output: θT , ϕT

0 200 400 600 800 Number of passes

Distance to the optimum

Extra Grad LA Extra Grad Unroll-Y

Unroll-XY OGDA LA-OGDA LA-GDA α=0.5 LA-GDA α=0.4

Figure 3: Distance to the optimum of equation SB G using different full batch methods, averaged over 5 runs. Unless otherwise speciﬁed, the learning rate is η = 0.3. See 4.2.1.

Lookahead uses two additional hyperparameters: (i) k the number of steps used to obtain the prediction ωt+k, as well as (ii) α that controls how large a step we make towards the predicted iterate ω: the larger the closest, and when α = 1 equation LA is equivalent to regular optimization (has no impact). Besides the extra hyperparameters, LA was shown to help the used optimizer to be more resilient to the choice of its hyperparameters, achieve faster convergence across different tasks, as well as to reduce the variance of the gradient estimates (Zhang et al., 2019). Theoretical Analysis. Zhang et al. (2019) study LA on quadratic functions and Wang et al. (2020) recently provided an analysis for general smooth non-convex functions. One of their main observations is that LA can be viewed as an instance of local SGD (or parallel SGD, Stich, 2019; Koloskova et al., 2020; Woodworth et al., 2020b) which allows us to further tighten prior results. Theorem 1. Let f : Rd R be L-smooth (possibly non-convex) and assume access to unbiased stochastic gradients σ2-bounded variance. Then the LA optimizer with hyperparemeters (k, α) converges to a stationary point E f(ωout) 2 ε (for the proof refer to Appendix A), after at most

k 1 ε3/2 + k

iterations. Here ωout denotes uniformly at random chosen iterate of LA. Remark 1. When in addition f is also quadratic, the complexity estimate improves to O σ2

ε . The asymptotically most signiﬁcant term, O σ2

ε2 , matches with the corresponding term in the SGD convergence rate for all choices of α (0, 1], and when α 1, the same convergence guarantees as for SGD can be attained. When σ2 = 0 the rate improves to O 1

ε , in contrast to O 1

ε2 in (Wang et al., 2020).1 For small values of α, the worst-case complexity estimates of LA can in general be k times worse than for SGD (except for quadratic functions, where the rates match). Deriving tighter analyses for LA that corroborate the observed practical advantages is still an open problem.

4 LOOKAHEAD FOR MINMAX OBJECTIVES

We now study the LA optimizer in the context of minmax optimization. We start with an illustrative example, considering the bilinear game: minθ maxϕ θ Iϕ (see Fig. 1). Observation 1: Gradient Descent Ascent always diverges. The iterates of GDA are:

ωt+1 θt+1 ϕt+1

1Our results rely on optimally tuned stepsize γ for every choice of α. Wang et al. (2020) state a slightly different observation but keep the stepize γ ﬁxed for different choices of α.

Published as a conference paper at ICLR 2021

The norm of the iterates ωt+1 2 = (1 + η2) ωt 2 increases for any stepsize η > 0, hence GDA diverges for all choices of η.

In this bilinear game, taking a point on a line between two points on a cyclic trajectory would reduce the distance to the optimum as illustrated in Fig. 1 what motivates extending the Lookahead algorithm to games, while using equation LA in joint parameter space ω (θ, ϕ).

Observation 2: Lookahead can converge. The iterates of LA with k = 2 steps of GDA are:

ωt+1 θt+1 ϕt+1

+ αη 2ϕt ηθt 2θt ηϕt

We can again compute the norm: ωt+1 2 = (1 η2α)2 + 4η2α2 ωt 2. For the choice α (0, 2 η2+4) the norm strictly decreases, hence the algorithm converges linearly.

4.1 THE GENERAL LOOKAHEAD-MINMAX ALGORITHM AND ITS CONVERGENCE

Alg. 1 summarizes the proposed Lookahead minmax algorithm which for clarity does not cover different update ratios for the two players see Alg. 3 for this case. At every step t = 1, . . . , T, we ﬁrst compute k-step predicted iterates ( θt,k, ϕt,k) also called fast weights using an inner base optimizer whose updates for the two players are denoted with dϕ and dθ. For GDA and EG we have:

( dϕ t,i ϕLϕ( θt,i, ϕt,i, x, z)

dθ t,i θLθ( θt,i, ϕt,i, z) EG:

(dϕ t,i ϕLϕ( θt,i+ 1

2 , ϕt,i+ 1

dθ t,i θLθ( θt,i+ 1

2 , ϕt,i+ 1

where for the latter, ϕt,i+ 1

2 , θt,i+ 1

2 are computed using equation EG. The above updates can be combined with Adam (Kingma & Ba, 2015), see G for detailed descriptions of these algorithms. The so called slow weights θt+1, ϕt+1 are then obtained as a point on a line between (θt, ϕt) and ( θt,k, ϕt,k) i.e. equation LA, see lines 10 & 11 (herein called backtracking or LA step). When combined with GDA, Alg. 1 can be equivalently re-written as Alg. 3, and note that it differs from using Lookahead separately per player see G.2. Moreover, Alg. 1 can be applied several times using nested LA steps with (different) k, see Alg. 6.

Local convergence. We analyze the game dynamics as a discrete dynamical system as in (Wang et al., 2020; Gidel et al., 2019b). Let F(ω) denote the operator that an optimization method applies to the iterates ω = (θ, ϕ), i.e. ωt = F F(ω0). For example, for GDA we have F(ω) = ω ηv(ω). A point ω is called ﬁxed point if F(ω ) = ω , and it is stable if the spectral radius ρ( F(ω )) 1, where for GDA for example F(ω ) = I ηv (ω ). In the following, we show that Lookaheadminmax combined with a base optimizer which converges to a stable ﬁxed point, also converges to the same ﬁxed point. On the other hand, its convergence when combined with a diverging base-optimizer depends on the choice of its hyper-parameters. Developing a practical algorithm for ﬁnding optimal set of hyper-parameters for realistic setup is out of the scope of this work, and in 5 we show empirically that Alg. 1 outperforms its base-optimizer for all tested hyperparameters. Theorem 2. (Proposition 4.4.1 Bertsekas, 1999) If the spectral radius ρ( F(ω )) < 1, the ω is a point of attraction and for ω0 in a sufﬁciently small neighborhood of ω , the distance of ωt to the stationary point ω converges at a linear rate.

The above theorem guarantees local linear convergence to a ﬁxed point for particular class of operators. The following general theorem (proof in . B) further shows that for any such base optimizer (of same operator class) which converges linearly to the solution, Lookahead Minmax converges too, what includes Proximal Point, EG and OGDA, but does not give precise rates left for future work. Theorem 3. (Convergence of Lookahead Minmax, given converging base optimizer) If the spectral radius of the Jacobian of the operator of the base optimizer satisﬁes ρ( F base(ω )) < 1, then for ω0 in a neighborhood of ω , the iterates ωt of Lookahead Minmax converge to ω as t .

Re-visiting the above example. For the GDA operator we have ρ( F GDA(ω )) = |λ| = 1+η2 1 and as λ reiθ C, with r > 1. Applying it k times will thus result in eigenvalues {λ(t)}k t=1 = reiθ, . . . , rkeikθ which are diverging and rotating in C. The spectral radius of LA GDA with k = 2 is then 1 α + αλ2 (see B) where α selects a particular point that lies between 1 + 0i and λ2 in C, allowing for reducing the spectral radius of the GDA base operator.

Published as a conference paper at ICLR 2021

4.2 MOTIVATING EXAMPLE: THE BILINEAR GAME

We argue that Lookahead-minmax allows for improved stability and performance on minmax problems due to two main reasons: (i) it handles well rotations, as well as (ii) it reduces the noise due to making more conservative steps. In the following, we disentangle the two, and show in 4.2.1 that Lookahead-minmax converges fast in the full-batch setting, without presence of noise as each parameter update uses the full dataset.

In 4.2.2 we consider the challenging problem of Chavdarova et al. (2019), designed to have high variance. We show that besides therein proposed Stochastic Variance Reduced Extragradient (SVRE), Lookahead-minmax is the only method that converges on this experiment, while considering all methods of Gidel et al. (2019a, 7.1). More precisely, we consider the following bilinear problem:

min θ Rd max ϕ Rd L(θ, ϕ) = min θ Rd max ϕ Rd 1 n

i=1 (θ bi + θ Aiϕ + c i ϕ), (SB G)

with θ, ϕ, bi, ci Rd and A Rn d d, n=d=100, and we draw [Ai]kl=δkli and [bi]k, [ci]k N(0, 1/d) , 1 k, l d, where δkli=1 if k=l=i, and 0 otherwise. As one pass we count a forward and backward pass D.5, and all results are normalized. See D.1 for hyperparameters.

4.2.1 THE FULL-BATCH SETTING

In Fig. 3 we compare: (i) GDA with learning rate η = 10 4 and η = 0.3 (in blue), which oscilates around the optimum with small enough step size, and diverges otherwise; (ii) Unroll-Y where the max-player is unrolled k steps, before updating the min player, as in (Metz et al., 2017); (iii) Unroll-XY where both the players are unrolled k steps with ﬁxed opponent, and the actual updates are done with un unrolled opponent (see D); (iv) LA GDA with α = 0.5 and α = 0.4 (in red and pink, resp.) which combines Alg. 1 with GDA. (v) Extra Gradient Eq. EG; (vi) LA Extra Grad, which combines Alg. 1 with Extra Gradient; (vii) OGDA Optimistic GDA (Rakhlin & Sridharan, 2013), as well as (viii) LA OGDA which combines Alg. 1 with OGDA. See D for description of the algorithms and their implementation. Interestingly, we observe that Lookahead Minmax allows GDA to converge on this example, and moreover speeds up the convergence of Extra Gradient.

4.2.2 THE STOCHASTIC SETTING

In this section, we show that besides SVRE, Lookahead minmax also converges on equation SB G. In addition, we test all the methods of Gidel et al. (2019a, 7.1) using minibatches of several sizes B = 1, 16, 64, and sampling without replacement. In particular, we tested: (i) the Adam method combined with GDA (shown in blue); (ii) Extra Gradient Eq. EG; as well as (iii) Extra Adam proposed by (Gidel et al., 2019a); (iv) our proposed method LA-GDA (Alg. 1) combined with GDA; as well as (v) SVRE (Chavdarova et al., 2019, Alg.1) for completeness. Fig. 4 depicts our results. See D for details on the implementation and choice of hyperparameters. We observe that besides the good performance of LA-GDA on games in the batch setting, it also has the property to cope well large variance of the gradient estimates, and it converges without using restarting.

Distance to the optimum

Adam Extra-Adam Extragradient LA-GDA SVRE

Number of passes Figure 4: Convergence of Adam, Extra Adam, Extragradient, SVRE and LA-GDA, on equation SB G, for several minibatch sizes B, averaged over 5 runs with random initialization of both the parameters and the data points (A, b and c). Fast ( ϕ, θ) and slow (ϕ, θ) weights of LA GDA are shown with solid and dashed lines, resp.

Published as a conference paper at ICLR 2021

Fréchet Inception distance Inception score

(32 32) Image Net no avg EMA EMA-slow no avg EMA EMA-slow

Alt GAN 15.63 .46 14.16 .27 7.23 .13 7.81 .07 LA Alt GAN 14.37 .20 13.06 .20 12.53 .06 7.58 .07 7.97 .11 8.42 .11 NLA Alt GAN 13.14 .25 13.07 .25 12.71 .13 7.85 .05 7.87 .01 8.10 .05 Extra Grad 15.48 .44 14.15 .63 7.31 .06 7.85 .10 LA Extra Grad 14.53 .27 14.13 .23 14.09 .28 7.62 .06 7.70 .07 7.89 .04 NLA Extra Grad 15.05 .96 14.79 .93 13.88 .45 7.39 .12 7.48 .12 7.76 .15

CIFAR-10 Alt GAN 21.37 1.60 16.92 1.16 7.41 .16 8.03 .13 LA Alt GAN 16.74 .46 13.98 .47 12.67 .57 8.05 .43 8.19 .05 8.55 .04 Extra Grad 18.49 .99 15.47 1.82 7.61 .07 8.05 .09 LA Extra Grad 15.25 .30 14.68 .30 13.39 .23 7.99 .03 8.04 .04 8.40 .05 Unrolled GAN 21.04 1.08 17.51 1.08 7.43 .07 7.88 .12

SVHN Alt GAN 7.84 1.21 6.83 2.88 3.10 .09 3.19 .09 LA Alt GAN 3.87 .09 3.28 .09 3.21 .19 3.16 .02 3.22 .08 3.30 .07 Extra Grad 4.08 .11 3.22 .09 3.21 .02 3.16 .02 LA Extra Grad 3.20 .09 3.16 .14 3.15 .31 3.20 .02 3.19 .03 3.20 .04

Table 1: Comparison of LA GAN with its respective baselines Alt GAN and Extra Grad (see 5.1 for naming), using FID (lower is better) and IS (higher is better), and best obtained scores. EMA denotes exponential moving average, see F. All methods use Adam, see G for detailed description. Results are averaged over 5 runs. We run each experiment for 500K iterations. See H and 5.2 for details on architectures and hyperparameters and for discussion on the results, resp. Our overall best obtained FID scores are 12.19 on CIFAR-10 and 2.823 on SVHN, see I for samples of these generators. Best scores obtained for each metric and dataset are highlighted in yellow. For each column the best score is in bold along with any score within its standard deviation reach.

Unconditional GANs Conditional GANs

SNGAN Prog.GAN NCSN WS-SVRE Extra Adam LA-Alt GAN SNGAN Big GAN Miyato et al. Karras et al. Song & Ermon Chavdarova et al. Gidel et al. (ours) Miyato et al. Brock et al.

FID 21.7 25.32 16.77 16.78 .21 12.19 25.5 14.73 IS 8.22 8.80 .05 8.87 8.47 .10 8.78 8.60 9.22

Table 2: Summary of the recently reported best scores on CIFAR-10 and benchmark with LA GAN, using published results. Note that the architectures are not identical for all the methods see 5.2.

5 EXPERIMENTS

In this section, we empirically benchmark Lookahead minmax Alg. 1 for training GANs. For the purpose of fair comparison, as an iteration we count each update of both the players, see Alg. 3.

5.1 EXPERIMENTAL SETUP

Datasets. We used the following image datasets: (i) MNIST (Lecun & Cortes, 1998), (ii) CIFAR-10 (Krizhevsky, 2009, 3), (iii) SVHN (Netzer et al., 2011), and (iv) Image Net ILSVRC 2012 (Russakovsky et al., 2015), using resolution of 28 28 for MNIST, and 3 32 32 for the rest.

Metrics. We use Inception score (IS, Salimans et al., 2016) and Fréchet Inception distance (FID, Heusel et al., 2017) as these are most commonly used metrics for image synthesis, see H.1 for details. On datasets other than Image Net, IS is less consistent with the sample quality (see H.1.1).

DNN architectures. For experiments on MNIST, we used the DCGAN architectures (Radford et al., 2016), described in H.2.1. For SVHN and CIFAR-10, we used the Res Net architectures, replicating the setup in (Miyato et al., 2018; Chavdarova et al., 2019), described in detail in H.2.2.

Optimization methods. We use: (i) Alt Gan: the standard alternating GAN, (ii) Extra Grad: the extragradient method, as well as (iii) Unrolled GAN: (Metz et al., 2017). We combine Lookaheadminmax with (i) and (ii), and we refer to these as LA Alt GAN and LA Extra Grad, respectively or for both as LA GAN for brevity. We denote nested LA GAN with preﬁx NLA, see G.1. All methods in this section use Adam (Kingma & Ba, 2015). We compute Exponential Moving Average (EMA, see def. in F) of both the fast and slow weights called EMA and EMA slow, resp. See Appendix for results of uniform iterate averaging and RAdam (Liu et al., 2020).

Published as a conference paper at ICLR 2021

5.2 RESULTS AND DISCUSSION

Comparison with baselines. Table 1 summarizes our comparison of combining Alg. 1 with Alt GAN and Extra Grad. On all datasets, we observe that the iterates (column no avg ) of LA Alt GAN and LA Extra Grad perform notably better than the corresponding baselines, and using EMA on LA Alt GAN and LA Extra Grad further improves the FID and IS scores obtained with LA Alt GAN. As the performances improve after each LA step, see Fig. 2 computing EMA solely on the slow weights further improves the scores. It is interesting to note that on most of the datasets, the scores of the iterates of LA GAN (column no-avg) outperform the EMA scores of the respective baselines. In some cases, EMA for Alt GAN does not provide improvements, as the iterates diverge relatively early. In our baseline experiments, Extra Grad outperforms Atl GAN while requiring twice the computational time of the latter per iteration. The addition of Lookahead minimax stabilizes Atl GAN making it competitive to LA Extra Grad while using half of the computational time.

In Table 3 we report our results on MNIST, where different from the rest of the datasets the training of the baselines is stable, to gain insights if LA GAN still yields improved performances. The best FID scores of the iterates (column no avg ) are obtained with LA GAN. Interestingly, although we obtain that the last iterates are not LSSP (which could be due to stochasticity), from Fig. 5 which depicts the eigenvalues of JVF, we observe that after convergence LA GAN shows no rotations.

0 100 200 Real Part

Imaginary Part

Alt GAN LA-Alt GAN

Figure 5: Eigenvalues of v (θ, ϕ) at 100K iterations on MNIST, see 5.2.

no avg EMA EMA-slow

Alt GAN .094 .006 .031 .002 LA Alt GAN .053 .004 .029 .002 .032 .002 Extra Grad .094 .013 .032 .003 LA Extra Grad .053 .005 .032 .002 .034 .001 Unrolled GAN .077 .006 .030 .002

Table 3: FID (lower is better) results on MNIST, averaged over 5 runs. Each experiment is trained for 100K iterations. Note that Unrolled GAN is computationally more expensive: in the order of the ratio 4 : 22 as we used 20 steps of unrolling what gave best results. See 5.2 & H for implementation and discussion, resp.

0 1 2 3 4 5 Iteration 105

LA-Alt GAN Alt GAN

(a) SVHN dataset.

0 1 2 3 4 5 Iteration 105

LA-Alt GAN Alt GAN

(b) CIFAR-10 dataset.

0 1 2 3 4 5 Iteration 105

LA-Alt GAN Alt GAN

(c) Image Net (32 32) dataset.

Figure 6: Improved stability of LA Alt GAN relative to its baselines on SVHN, CIFAR-10 and Image Net, over 5 runs. The median and the individual runs are illustrated with ticker solid lines and with transparent lines, respectively. See 5.2 for discussion.

Benchmark on CIFAR-10 using reported results. Table 2 summarizes the recently reported best obtained FID and IS scores on CIFAR-10. Although using the class labels Conditional GAN is known to improve GAN performances (Radford et al., 2016), we outperform Big GAN (Brock et al., 2019) on CIFAR-10. Notably, our model and Big GAN have 5.1M and 158.3M parameters in total, respectively, and we use minibatch size of 128, whereas Big GAN uses 2048 samples. The competitive result of (Yazıcı et al., 2019) of average 12.56 FID uses 3.5 times larger model than ours and is omitted from Table 2 as it does not use the standard metrics pipeline.

Additional memory & computational cost. Lookahead-minmax requires the same extra memory footprint as EMA and uniform averaging (one additional set of parameters per player) both of which are updated each step whereas LA GAN is updated once every k iterations.

On the choice of α and k. We ﬁxed α = 0.5, and we experimented with several values of k while once selected, keeping k ﬁxed throughout the training. We observe that all values of k improve the baseline. Using larger k in large part depends on the stability of the base-optimizer: if it quickly diverges, smaller k is necessary to stabilize the training. Thus we used k = 5 and k = 5000 for LA

Published as a conference paper at ICLR 2021

Alt GAN and LA Extra Grad, resp. When using larger values of k we noticed that the obtained scores would periodically drastically improve every k iterations after an LA step, see Fig.2. Interestingly, complementary to the results of (Berard et al., 2020) who locally analyze the vector ﬁeld, Fig.2 conﬁrms the presence of rotation while taking into account the used optimizer, and empirically validates the geometric justiﬁcation of Lookahead minmax illustrated in Fig.1. The necessity of small k for unstable baselines motivates nested LA Minmax using two different values of k, of which one is relatively small (the inner LA step) and the other is larger (the outer LA step). Intuitively, the inner and outer LA steps in such nested LA GAN tackle the variance and the rotations, resp.

Stability of convergence. Fig. 6 shows that LA GAN consistently improved the stability of its respective baseline. Despite using 1 : 5 update ratio for G : D known to improve stability, our baselines always diverge (also reported by Chavdarova et al., 2019; Brock et al., 2019). On the other hand, LA GAN diverged only few times and notably later in the training relative to the same baseline, see additional results in I. The stability further improves using nested LA steps as in Fig.2.

6 RELATED WORK

Parameter averaging. In the context of convex single-objective optimization, taking an arithmetic average of the parameters as by Polyak & Juditsky (1992); Ruppert (1988) is well-known to yield faster convergence for convex functions and allowing the use of larger constant step-sizes in the case of stochastic optimization (Dieuleveut et al., 2017). It recently gained more interest in deep learning in general (Garipov et al., 2018), in natural language processing (Merity et al., 2018), and particularly in GANs (Yazıcı et al., 2019) where researchers report the performance of a uniform or exponential moving average of the iterates. Such averaging as a post-processing after training is fundamentally different from immediately applying averages during training. Lookahead as of our interest here in spirit is closer to extrapolation methods (Korpelevich, 1976) which rely on gradients taken not at the current iterate but at an extrapolated point for the current trajectory. For highly complex optimization landscapes such as in deep learning, the effect of using gradients at perturbations of the current iterate has a desirable smoothing effect which is known to help training speed and stability in the case of non-convex single-objective optimization (Wen et al., 2018; Haruki et al., 2019).

GANs. Several proposed methods for GANs are motivated by the recurrent dynamics . Apart from the already introduced works, (i) Yadav et al. (2018) use prediction steps, (ii) Daskalakis et al. (2018) propose Optimistic Mirror Decent (OMD) (iii) Balduzzi et al. (2018) propose the Symplectic Gradient Adjustment (SGA) (iv) Gidel et al. (2019b) propose negative momentum, (v) Xu et al. (2020) propose closed-loop based method from control theory, among others. Besides its simplicity, the key beneﬁt of LA GAN is that it handles well both the rotations of the vector ﬁeld as well as noise from stochasticity, thus performing well on real-world applications.

7 CONCLUSION

Motivated by the adversarial component of games and the negative impact of noise on games, we proposed an extension of the Lookahead algorithm to games, called Lookahead minmax . On the bilinear toy example we observe that combining Lookahead minmax with standard gradient methods converges, and that Lookahead minmax copes very well with the high variance of the gradient estimates. Exponential moving averaging of the iterates is known to help obtain improved performances for GANs, yet it does not impact the actual iterates, hence does not stop the algorithm from (early) divergence. Lookahead minmax goes beyond such averaging, requires less computation than running averages, and it is straightforward to implement. It can be applied to any optimization method, and in practice it consistently improves the stability of its respective baseline. While we do not aim at obtaining a new state-of-the-art result, it is remarkable that Lookahead minmax obtains competitive result on CIFAR 10 of 12.19 FID outperforming the 30 times larger Big GAN.

As Lookahead minmax uses two additional hyperparameters, future directions include developing adaptive schemes of obtaining these coefﬁcients throughout training, which could further speed up the convergence of Lookahead minmax.

Published as a conference paper at ICLR 2021

ACKNOWLEDGMENTS

This work was in part done while TC and FF were afﬁliated with Idiap research institute. TC was funded in part by the grant 200021-169112 from the Swiss National Science Foundation, and MP was funded by the grant 40516.1 from the Swiss Innovation Agency. TC would like to thank Hugo Berard for insightful discussions on the optimization landscape of GANs and sharing their code, as well as David Balduzzi for insightful discussions on n-player differentiable games.

David Balduzzi, Sebastien Racaniere, James Martens, Jakob Foerster, Karl Tuyls, and Thore Graepel. The mechanics of n-player differentiable games. In ICML, 2018.

Hugo Berard, Gauthier Gidel, Amjad Almahairi, Pascal Vincent, and Simon Lacoste-Julien. A closer look at the optimization landscapes of generative adversarial networks. In ICLR, 2020.

Dimitri P Bertsekas. Nonlinear programming. Athena scientiﬁc Belmont, 1999.

Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale GAN training for high ﬁdelity natural image synthesis. In ICLR, 2019.

Ronald E Bruck. On the weak convergence of an ergodic iteration for the solution of variational inequalities for monotone operators in hilbert space. Journal of Mathematical Analysis and Applications, 1977.

Tatjana Chavdarova, Gauthier Gidel, François Fleuret, and Simon Lacoste-Julien. Reducing noise in GAN training with variance reduced extragradient. In Neur IPS, 2019.

Constantinos Daskalakis, Andrew Ilyas, Vasilis Syrgkanis, and Haoyang Zeng. Training GANs with optimism. In ICLR, 2018.

Constantinos Daskalakis, Stratis Skoulakis, and Manolis Zampetakis. The complexity of constrained min-max optimization. ar Xiv:2009.09623, 2020.

Aaron Defazio and Léon Bottou. On the ineffectiveness of variance reduced optimization for deep learning. In Neur IPS, 2019.

Popov Leonid Denisovich. A modiﬁcation of the arrow hurwicz method for search of saddle points. Mathematical Notes of the Academy of Sciences of the USSR, 28(5):845 848, 1980.

Aymeric Dieuleveut, Alain Durmus, and Francis Bach. Bridging the gap between constant step size stochastic gradient descent and markov chains. ar Xiv:1707.06386, 2017.

Francisco Facchinei and Jong-Shi Pang. Finite-Dimensional Variational Inequalities and Complementarity Problems Vol I. Springer Series in Operations Research and Financial Engineering, Finite-Dimensional Variational Inequalities and Complementarity Problems. Springer-Verlag, 2003.

Timur Garipov, Pavel Izmailov, Dmitrii Podoprikhin, Dmitry Vetrov, and Andrew Gordon Wilson. Loss surfaces, mode connectivity, and fast ensembling of DNNs. In Neur IPS, 2018.

Gauthier Gidel, Hugo Berard, Pascal Vincent, and Simon Lacoste-Julien. A variational inequality perspective on generative adversarial nets. In ICLR, 2019a.

Gauthier Gidel, Reyhane Askari Hemmat, Mohammad Pezeshki, Rémi Le Priol, Gabriel Huang, Simon Lacoste-Julien, and Ioannis Mitliagkas. Negative momentum for improved game dynamics. In AISTATS, 2019b.

Xavier Glorot and Yoshua Bengio. Understanding the difﬁculty of training deep feedforward neural networks. In AISTATS, 2010.

Ian Goodfellow. NIPS 2016 tutorial: Generative adversarial networks. ar Xiv:1701.00160, 2016.

Published as a conference paper at ICLR 2021

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NIPS, 2014.

Kosuke Haruki, Taiji Suzuki, Yohei Hamakawa, Takeshi Toda, Ryuji Sakai, Masahiro Ozawa, and Mitsuhiro Kimura. Gradient Noise Convolution (GNC): Smoothing Loss Function for Distributed Large-Batch SGD. ar Xiv:1906.10822, 2019.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs trained by a two time-scale update rule converge to a local nash equilibrium. In NIPS, 2017.

Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.

Chi Jin, Praneeth Netrapalli, and Michael I. Jordan. What is local optimality in nonconvex-nonconcave minimax optimization? In ICML, 2020.

Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of GANs for improved quality, stability, and variation. In ICLR, 2018.

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.

Anastasia Koloskova, Nicolas Loizou, Sadra Boreiri, Martin Jaggi, and Sebastian U. Stich. A uniﬁed theory of decentralized SGD with changing topology and local updates. In ICML, 2020.

Galina Michailovna Korpelevich. The extragradient method for ﬁnding saddle points and other problems. Matecon, 1976.

Alex Krizhevsky. Learning Multiple Layers of Features from Tiny Images. Master s thesis, 2009.

Yann Lecun and Corinna Cortes. The MNIST database of handwritten digits. 1998. URL http: //yann.lecun.com/exdb/mnist/.

Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, and Jiawei Han. On the variance of the adaptive learning rate and beyond. In ICLR, 2020.

Stephen Merity, Nitish Shirish Keskar, and Richard Socher. Regularizing and optimizing LSTM language models. In ICLR, 2018.

Lars Mescheder, Sebastian Nowozin, and Andreas Geiger. The numerics of GANs. In NIPS, 2017.

Lars Mescheder, Andreas Geiger, and Sebastian Nowozin. Which Training Methods for GANs do actually Converge? In ICML, 2018.

Luke Metz, Ben Poole, David Pfau, and Jascha Sohl-Dickstein. Unrolled generative adversarial networks. In ICLR, 2017.

Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. In ICLR, 2018.

Aryan Mokhtari, Asuman Ozdaglar, and Sarath Pattathil. A uniﬁed analysis of extra-gradient and optimistic gradient methods for saddle point problems: Proximal point approach. ar Xiv:1901.08511, 2019.

Dov Monderer and Lloyd S Shapley. Potential games. Games and economic behavior, 1996.

Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y. Ng. Reading digits in natural images with unsupervised feature learning. 2011. URL http://ufldl. stanford.edu/housenumbers/.

Shayegan Omidshaﬁei, Jason Pazis, Chris Amato, Jonathan P. How, and John Vian. Deep decentralized multi-task multi-agent reinforcement learning under partial observability. In ICML, 2017.

Published as a conference paper at ICLR 2021

B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 1992. doi: 10.1137/0330046.

Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. In ICLR, 2016.

Alexander Rakhlin and Karthik Sridharan. Online learning with predictable sequences. In COLT, 2013.

David Ruppert. Efﬁcient estimations from a slowly convergent Robbins-Monro process. Technical report, Cornell University Operations Research and Industrial Engineering, 1988.

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. Image Net Large Scale Visual Recognition Challenge. IJCV, 115(3):211 252, 2015.

Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training GANs. In NIPS, 2016.

Christopher J Shallue, Jaehoon Lee, Joe Antognini, Jascha Sohl-Dickstein, Roy Frostig, and George E Dahl. Measuring the effects of data parallelism on neural network training. ar Xiv:1811.03600, 2018.

Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. In Neur IPS, pp. 11895 11907, 2019.

Sebastian U. Stich. Local SGD converges fast and communicates little. In ICLR, 2019.

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. ar Xiv:1512.00567, 2015.

Fiez Tanner, Chasnov Benjamin, and Ratliff Lillian. Implicit learning dynamics in stackelberg games: Equilibria characterization, convergence analysis, and empirical study. In ICML, 2020.

Ferdinand Verhulst. Nonlinear Differential Equations and Dynamical Systems. Springer-Verlag Berlin Heidelberg, 1990.

J Von Neumann and O Morgenstern. Theory of games and economic behavior. Princeton University Press, 1944.

J. Wang, V. Tantia, N. Ballas, and M. Rabbat. Lookahead converges to stationary points of smooth non-convex functions. In ICASSP, 2020.

Yuanhao Wang, Guodong Zhang, and Jimmy Ba. On solving minimax optimization locally: A follow-the-ridge approach. In ICLR, 2020.

Wei Wen, Yandan Wang, Feng Yan, Cong Xu, Chunpeng Wu, Yiran Chen, and Hai Li. Smoothout: Smoothing out sharp minima to improve generalization in deep learning. ar Xiv:1805.07898, 2018.

Blake Woodworth, Kumar Kshitij Patel, and Nathan Srebro. Minibatch vs local SGD for heterogeneous distributed learning. ar Xiv, 2020a.

Blake Woodworth, Kumar Kshitij Patel, Sebastian U. Stich, Zhen Dai, Brian Bullins, H. Brendan Mc Mahan, Ohad Shamir, and Nathan Srebro. Is local SGD better than minibatch SGD? In ICML, 2020b.

Kun Xu, Chongxuan Li, Huanshu Wei, Jun Zhu, and Bo Zhang. Understanding and stabilizing gans training dynamics with control theory. In ICML, 2020.

Abhay Yadav, Sohil Shah, Zheng Xu, David Jacobs, and Tom Goldstein. Stabilizing adversarial nets with prediction methods. In ICLR, 2018.

Yasin Yazıcı, Chuan-Sheng Foo, Stefan Winkler, Kim-Hui Yap, Georgios Piliouras, and Vijay Chandrasekhar. The unusual effectiveness of averaging in GAN training. In ICLR, 2019.

Michael Zhang, James Lucas, Jimmy Ba, and Geoffrey E Hinton. Lookahead optimizer: k steps forward, 1 step back. In Neur IPS, 2019.

Published as a conference paper at ICLR 2021

A THEORETICAL ANALYSIS OF LOOKAHEAD FOR SINGLE OBJECTIVE MINIMIZATION PROOF OF THEOREM 1

In this section we give the theoretical justiﬁcation for the convergence results claimed in Section 3, in Theorem 1 and Remark 1. We consider the LA optimizer, with k SGD prediction steps, that is,

ωt+k = ωt γ

i=1 g( ωt+i 1)

where g: X R is a stochastic gradient estimator, Eg(ω) = f(ω), with bounded variance E g(ω) 2 σ2. Further, we assume that f is L-smooth for a constant L 0 (but not necessarily convex). We count the total number of gradient evaluations K. As is standard, we consider as output of the algorithm a uniformly at random chosen iterate ωout (we prove convergence of E f(ω) 2 = f(ωout) 2. This matches the setting considered in (Wang et al., 2020).

Wang et al. (2020, Theorem 2) show that if the stepsize γ satisﬁes

αγL + (1 α)2γ2L2k(k 1) 1 , (1)

then the expected squared gradient norm of the LA iterates after T updates of the slow sequence, i.e. K = k T gradient evaluations, can be bounded as

αγK + αγLσ2 + (1 α)2γ2L2σ2(k 1) (2)

where F0 := f(ω0) finf denotes an upper bound on the on the optimality gap. Now, departing from Wang et al. (2020), we can directly derive the claimed bound in Theorem 1 by choosing γ such as to minimze equation 2, while respecting the constraints given in equation 1 (two necessary constraints are e.g. γ2 1 (1 α)2L2k(k 1) and γ 1 αL). For this minimization with respect to

γ, see for instance (Koloskova et al., 2020, Lemma 15), where we plug in r0 = F0

α , b = αLσ2, e = (1 α)2L2σ2(k 1) and d = min{ 1

αL, 1 (1 α)Lk}.

The improved result for the quadratic case follows from the observations in Woodworth et al. (2020b) who analyze local SGD on quadratic functions (they show that linear update algorithms (such as local SGD on quadratics) beneﬁt from variance reduction, allowing to recover the same convergence guarantees as single-threaded SGD. These observations also hold for non-convex quadratic functions.

We point out, that the the choice of the stepsize γ = 1

k T proposed in Theorem 2 in (Wang et al., 2020) does not necessarily satisfy their constraint given in equation 1 for small values of T. Whilst their conclusions remain valid when k is a constant (or in the limit, T ), the current analyses (including our slightly improved bound) do not show that LA can in general match the performance upper bounds of SGD when k is large. Whilst casting LA as an instance of local SGD was useful to derive the ﬁrst general convergence guarantees, we believe in regard to recently derived lower bounds (Woodworth et al., 2020a) that show limitations on convex functions that this approach is limited and more carefully designed analyses are required to derive tighter results for LA in future works.

B PROOF OF THEOREM 3

Proof. Let F base(ω) denote the operator that an optimization method applies to the iterates ω = (θ, ϕ). Let ω be a stable ﬁxed point of F base i.e. ρ( F base(ω )) < 1. Then, the operator of Lookahead-Minmax algorithhm F LA (when combined with F base( )) is:

F LA(ω) = ω + α((F base)k(ω) ω)

= (1 α)ω + α(F base)k(ω) ,

with α [0, 1] and k N, k 2.

Published as a conference paper at ICLR 2021

Let λ denote the eigenvalue of Jbase F base(ω ) with largest modulus, i.e. ρ( F base(ω )) = |λ|, let u be its associated eigenvector: Jbaseu = λu.

Using the ﬁxed point property (F(ω ) = ω ), the Jacobian of Lookahead-Minmax at ω is thus:

JLA = F LA(ω ) = (1 α)I + α(Jbase)k .

By noticing that:

JLAu =((1 α)I + α(Jbase)k)u

=((1 α) + αλk)u ,

we deduce u is an eigenvector of JLA with eigenvalue 1 α + αλk.

We know that ρ(Jbase) = max{|λbase 0 |, ..., |λbase n |} < 1 (by the assumption of this theorem). To prove Theorem 3 by contradiction, let us assume that: ρ(JLA) 1, what implies that there exist an eigenvalue of the spectrum of JLA such that:

λLA Spec(JLA) s.t. |λLA| |1 α + α(λbase)k| 1 . (3) As |λbase| < 1, we have |λbase|k = |(λbase)k| < 1. Furthermore, the set of points {p C|α [0, 1], p = 1 α + α(λbase)k} is describing a segment in the complex plane between (λbase)k and 1 + 0i (excluded from the segment). As both ends of the segment are in the unit circle, it follows that the left hand side of the inequality of equation 3 is strictly smaller than 1. By contradiction, this implies ρ(JLA) < 1, hence it converges linearly.

Hence, if ω is a stable ﬁxed point for some inner optimizer whose operator satisﬁes our assumption, then ω is a stable ﬁxed point for Lookahead-Minmax as well.

C SOLUTIONS OF A MIN-MAX GAME

In 2 we brieﬂy mention the ideal convergence point of a 2-player games and deﬁned LSSP required to follow up our result in Fig. 5. For completeness of this manuscript, in this section we deﬁne Nash equilibria formally, and we list relevant works along this line which further question this notion of optimality in games.

For simplicity, in this section we focus on the case when the two players aim at minimizing and maximizing the same value function Lϕ = Lθ := L, called zero-sum or minimax game. More precisely: min θ Θ max ϕ Φ L(θ, ϕ) , (ZS-G)

where L : Θ Φ 7 R.

Nash Equilibria. For 2-player games, ideally we would like to converge to a point called Nash equilibrium (NE). In the context of game theory, NE is a combination of strategies from which, no player has an incentive to deviate unilaterally. More formally, Nash equilibria for continuous games is deﬁned as a point (ϕ , θ ) where: L(ϕ , θ) L(ϕ , θ ) L(ϕ, θ ) . (NE) Such points are (locally) optimal for both players with respect to their own decision variable, i.e. no player has the incentive to unilaterally deviate from it.

Differential Nash Equilibria. In machine learning we are interested in differential games where L is twice differentiable, in which case such NE needs to satisfy slightly stronger conditions. A point (θ , ϕ ) is a Differential Nash Equilibrium (DNE) of a zero-sum game iff: || θL(θ , ϕ )|| = || ϕL(θ , ϕ )|| = 0,

2 θL(θ , ϕ ) 0, and

2 ϕL(θ , ϕ ) 0 , (DNE)

Published as a conference paper at ICLR 2021

where A 0 and A 0 iff A is positive deﬁnite and negative deﬁnite, respectively. Note that the key difference between DNE and NE is that 2 θL( ) and 2 ϕL( ) for DNE are required to be deﬁnite (instead of semideﬁnite).

(Differential) Stackelberg Equilibria. A recent line of works questions the choice of DNEs as a game solution of multiple machine learning applications, including GANs (Tanner et al., 2020; Jin et al., 2020; Wang et al., 2020). Namely, as GANs are optimized sequentially, authors argue that a so called Stackelberg game is more suitable, which consist of a leader player and a follower. Namely, the leader can choose her action while knowing the other player plays a best-response:

Leader: θ = arg min θ Θ

L(θ, ϕ) : ϕ = arg max ϕ Φ L(θ, ϕ)

Follower: ϕ = arg max ϕ Φ L(θ , ϕ) (Stackelberg ZS)

Brieﬂy, such games converge to a so called Differential Stackelberg Equilibria (DSE), deﬁned as a point (θ , ϕ ) where:

θL(θ , ϕ ) = 0 v(θ , ϕ ) = 0 , and (4)

[ 2 θL θ ϕL( 2 ϕL) 1( θ ϕL) ](θ , ϕ ) > 0 . (DSE)

Remark. (Berard et al., 2020) show that GANs do not converge to DNEs, however, these convergence points often have good performances for the generator. Recent works on DSEs give a game-theoretic interpretation of these solution concepts which constitute a superset of DNEs. Moreover, while DNEs are not guaranteed to exist, approximate NE are guaranteed to exist (Daskalakis et al., 2020). A more detailed discussion is out of the scope of this paper, and we encourage the interested reader to follow up the given references.

D EXPERIMENTS ON THE BILINEAR EXAMPLE

In this section we list the details regarding our implementation of the experiments on the bilinear example of equation SB G that were presented in 4.2. In particular: (i) in D.1 we list the implementational details of the benchmarked algorithms, (ii) in D.2 and D.4 we list the hyperparameters used in 4.2.1 and 4.2.2, respectively (iii) in D.5 we describe the choice of the x-axis of the toy experiments presented in the paper, and ﬁnally (iv) in D.6 we present visualizations in aim to improve the reader s intuition on how Lookahead-minmax works on games.

D.1 IMPLEMENTATION DETAILS

Gradient Descent Ascent (GDA). We use an alternating implementation of GDA where the players are updated sequentially, as follows:

ϕt+1 = ϕt η ϕL(θt, ϕt), θt+1 = θt + η θL(θt, ϕt+1) (GDA)

Extra Grad. Our implementation of extragradient follows equation EG, with Lθ( ) = Lϕ( ), thus:

Extrapolation:

2 = θt η θL(θt, ϕt)

2 = ϕt + η ϕL(θt, ϕt) Update:

( θt+1 = θt η θL(θt+ 1

ϕt+1 = ϕt + η ϕL(θt+ 1

OGDA. Optimistic Gradient Descent Ascent (Denisovich, 1980; Rakhlin & Sridharan, 2013; Daskalakis et al., 2018) has last iterate convergence guaranties when the objective is linear for both the min and the max player. The update rule is as follows: θt+1 = θt 2η θL(θt, ϕt) + η θL(θt 1, ϕt 1) ϕt+1 = ϕt + 2η ϕL(θt, ϕt) η ϕL(θt 1, ϕt 1) . (OGDA)

Published as a conference paper at ICLR 2021

Contrary to EG ZS, OGDA requires one gradient computation per parameter update, and requires storing the previously computed gradient at t 1. Interestingly, both EG ZS and OGDA can be seen as an approximations of the proximal point method for min-max problem, see (Mokhtari et al., 2019) for details.

Unroll-Y. Unrolling was introduced by Metz et al. (2017) as a way to mitigate mode collapse of GANs. It consists of ﬁnding an optimal max player ϕ for a ﬁxed min player θ, i.e. ϕ (θ) = arg maxϕ Lϕ(θ, ϕ) through unrolling as follows:

ϕ0 t = ϕt, ϕm+1 t (θ) = ϕm t η ϕLϕ(θt, ϕm t ), ϕ t (θt) = lim m ϕm t (θ) .

In practice m is a ﬁnite number of unrolling steps, yielding ϕm t . The min player θt, e.g. the generator, can be updated using the unrolled ϕm t , while the update of ϕt is unchanged:

θt+1 = θt η θLθ(θt, ϕm t ), ϕt+1 = ϕt η ϕLϕ(θt, ϕt) (UR X)

Unroll-XY. While Metz et al. (2017) only unroll one player (the discriminator in their GAN setup), we extended the concept of unrolling to games and for completeness also considered unrolling both players. For the bilinear experiment we also have that Lθ(θt, ϕt) = Lϕ(θt, ϕt).

Adam. Adam (Kingma & Ba, 2015) computes an exponentially decaying average of both past gradients mt and squared gradients vt, for each parameter of the model as follows:

mt = β1mt 1 + (1 β1)gt (5)

vt = β2vt 1 + (1 β2)g2 t , (6)

where the hyperparameters β1, β2 [0, 1], m0 = 0, v0 = 0, and t denotes the iteration t = 1, . . . T. mt and vt are respectively the estimates of the ﬁrst and the second moments of the stochastic gradient. To compensate the bias toward 0 due to their initialization to m0 = 0, v0 = 0, Kingma & Ba (2015) propose to use bias-corrected estimates of these ﬁrst two moments:

ˆmt = mt 1 βt 1 (7)

ˆvt = vt 1 βt 2 . (8)

Finally, the Adam update rule for all parameters at t-th iteration ωt can be described as:

ωt+1 = ωt η ˆmt ˆvt + ε. (Adam)

Extra-Adam. Gidel et al. (2019a) adjust Adam for extragradient equation EG and obtain the empirically motivated Extra Adam which re-uses the same running averages of equation Adam when computing the extrapolated point ωt+ 1

2 as well as when computing the new iterate ωt+1 (see Alg.4, Gidel et al., 2019a). We used the provided implementation by the authors.

SVRE. Chavdarova et al. (2019) propose SVRE as a way to cope with variance in games that may cause divergence otherwise. We used the restarted version of SVRE as used for the problem of equation SB G described in (Alg3, Chavdarova et al., 2019), which we describe in Alg. 2 for completeness where dθ and dϕ denote variance corrected gradient:

dϕ(θ, ϕ, θS, ϕS) := µϕ + ϕLϕ(θ, ϕ, D[nd], Z[nz]) ϕLϕ(θS, ϕS, D[nd], Z[nz]) (9)

dθ(θ, ϕ, θS, ϕS) := µθ + θLθ(θ, ϕ, Z[nz]) θLθ(θS, ϕS, Z[nz]) , (10)

where θS and ϕS are the snapshots and µθ and µϕ their respective gradients. D and Z denote the ﬁnite data and noise datasets. With a probability p (ﬁxed) before the computation of µS ϕ and µS θ , we decide whether to restart SVRE (by using the averaged iterate as the new starting point Alg. 2, Line 6 ωt) or computing the batch snapshot at a point ωt. For consistency, we used the provided implementation by the authors.

Published as a conference paper at ICLR 2021

Algorithm 2 Pseudocode for Restarted SVRE.

1: Input: Stopping time T, learning rates ηθ, ηϕ, losses Lθ and Lϕ, probability of restart p, dataset D, noise dataset Z, with |D| = |Z| = n. 2: Initialize: ϕ, θ, t = 0 t is for the online average computation. 3: for e = 0 to T 1 do 4: Draw restart B(p). Check if we restart the algorithm. 5: if restart and e > 0 then 6: ϕ ϕ, θ θ and t = 1 7: end if 8: ϕS ϕ and µS ϕ 1 |D| Pn i=1 ϕLϕ i (θ, ϕS)

9: θS θ and µS θ 1 |Z| Pn i=1 θLθ i (θS, ϕS)

10: N Geom 1/n Length of the epoch. 11: for i = 0 to N 1 do 12: Sample iθ πθ, iϕ πϕ, do extrapolation: 13: ϕ ϕ ηθdϕ(θ, ϕ, θS, ϕS) , θ θ ηϕdθ(θ, ϕ, θS, ϕS) equation 9 and equation 10 14: Sample iθ πθ, iϕ πϕ, do update: 15: ϕ ϕ ηθdϕ( θ, ϕ, θS, ϕS) , θ θ ηϕdθ( θ, ϕ, θS, ϕS) equation 9 and equation 10

16: θ t t+1 θ + 1 t+1θ and ϕ t t+1 ϕ + 1 t+1ϕ Online computation of the average. 17: t t + 1 Increment t for the online average computation. 18: end for 19: end for 20: Output: θ, ϕ

D.2 HYPERPARAMETERS USED FOR THE FULL-BATCH SETTING

Optimal α. In the full-batch bilinear problem, it is possible to derive the optimal α parameter for a small enough η. Given the optimum ω , the current iterate ω, and the previous iterate ωP before k steps, let x = ωP + α(ω ωP) be the next iterate selected to be on the interpolated line between ωP and ω. We aim at ﬁnding x (or in effect α) that is closest to ω . For an inﬁnitesimally small learning rate, a GDA iterate would revolve around ω , hence ω ω = ωP ω = r. The shortest distance between x and ω would be according to:

r2 = ωP x 2 + x ω 2 = ω x 2 + x ω 2

Hence the optimal x, for any k, would be obtained for ω x = ωP x , which is given for α = 0.5.

In the case of larger learning rate, for which the GDA iterates diverge, we would have ωP ω = r1 < ω ω = r2 as we are diverging. Hence the optimal x would follow ωP x < ω x , which is given for α < 0.5. In Fig. 3 we indeed observe LA-GDA with α = 0.4 converging faster than with α = 0.5.

Hyperparameters. Unless otherwise speciﬁed the learning rate used is ﬁxed to η = 0.3. For both Unroll-Y and Unroll-XY, we use 6 unrolling steps. When combining Lookahead-minmax with GDA or Extragradient, we use a k of 6 and and α of 0.5 unless otherwise emphasized.

D.3 PERFORMANCE OF EMA-ITERATES IN THE STOCHASTIC SETTING

For the identical experiment presented in 4.2.2 in this section we depict the corresponding performance of the EMA iterates, of Adam, Extra-Adam, Extragradient and LA-GDA. The hyperparameters used are the same and shown in D.4, we use β = 0.999 for EMA. In Fig. 7 we report the distance to the optimum as a function of the number of passes for each method. We observe the high variance of the stochastic bilinear is preventing the convergence of Adam, Extra-Adam and Extragradient,

Published as a conference paper at ICLR 2021

when the batch size B is small, despite computing an exponential moving average of the iterates over 20k iterations.

Distance to the optimum

EMA Adam EMA Extra-Adam EMA Extragradient EMA LA-GDA

Number of passes

Figure 7: Distance to the optimum as a function of the number of passes, for Adam, Extra-Adam, Extragradient, and LA-GAN, all combined with EMA (β = 0.999). For small batch sizes, the high variance of the problem is preventing convergence for all methods but Lookahead-Minmax.

D.4 HYPERPARAMETERS USED FOR THE STOCHASTIC SETTING

The hyperparameters used in the stochastic bilinear experiment of equation SB G are listed in Table 4. We tuned the hyperparameters of each method independently, for each batch-size. We tried η ranging from 0.005 to 1. When for all values of η the method diverges, we set η = 0.005 in Fig. 4. To tune the ﬁrst moment estimate of Adam β1, we consider values ranging from 1 to 1, as Gidel et al. reported that negative β1 can help in practice. We used α {0.3, 0.5} and k [5, 3000].

Batch-size Parameter Adam Extra-Adam Extragradient LA-GDA SVRE

η 0.005 0.02 0.8 0.2 - Adam β1 0.9 0.6 - - - Lookahead k - - - 15 - Lookahead α - - - 0.3 -

η 0.005 0.01 0.005 0.005 - Adam β1 0.6 0.2 - - - Lookahead k - - - 450 - Lookahead α - - - 0.3 -

η 0.005 0.005 0.005 0.01 - Adam β1 0.3 0.0 - - - Lookahead k - - - 1500 - Lookahead α - - - 0.3 -

η 0.005 0.005 0.005 0.05 0.1 Adam β1 0.0 0.0 - - - Lookahead k - - - 2450 - Lookahead α - - - 0.3 - restart probability p - - - - 0.1

Table 4: List of hyperparameters used in Figure 4. η denotes the learning rate, β1 is deﬁned in equation 5, and α and k in Alg. 1.

Fig. 8 depicts the ﬁnal performance of Lookahead minmax, using different values of k. Note that, the choice of plotting the distance to the optimum at a particular ﬁnal iteration is causing the frequent oscillations of the depicted performances, since the iterate gets closer to the optimum only after the backtracking step. Besides the misleading oscillations, one can notice the trend of how the choice of k affects the ﬁnal distance to the optimum. Interestingly, the case of B = 16 in Fig. 8 captures the

Published as a conference paper at ICLR 2021

0 2000 4000 6000 8000 10000 k

Final distance to the optimum

η = 0.05, B = 1 η = 0.01, B = 16

Figure 8: Sensitivity of LA-GDA to the value of the hyperparameter k in Alg. 1 for two combinations of batch sizes and η. The y-axis is the distance to the optimum at 20000 passes. The jolting of the curves is due to the ﬁnal value being affected by how close it is to the last equation LA step, i.e. lines 10 and 11 of Alg. 1.

periodicity of the rotating vector ﬁeld, what sheds light on future directions in ﬁnding methods with adaptive k.

D.5 NUMBER OF PASSES

In our toy experiments, as x-axis we use the number of passes as in (Chavdarova et al., 2019), so as to account for the different computation complexity of the optimization methods being compared. The number of passes is different from the number of iterations, where the former denotes one cycle of forward and backward passes, and alternatively, it can be called gradient queries . More precisely, using parameter updates as the x-axis could be misleading as for example extragradient uses extra passes (gradient queries) per one parameter update. Number of passes is thus a good indicator of the computation complexity, as wall-clock time measurements which depend on the concurrent overhead of the machine at the time of running, can be relatively more noisy.

D.6 ILLUSTRATIONS OF GAN OPTIMIZATION WITH LOOKAHEAD

In Fig. 9 we consider a 2D bilinear game minx maxy x y, and we illustrate the convergence of Lookahead Minmax. Interestingly, Lookahead makes use of the rotations of the game vector ﬁeld caused by the adversarial component of the game. Although standard-GDA diverges with all three shown learning rates, Lookahead minmax converges. Moreover, we see Lookahead minmax with larger learning rate of η = 0.4 (and ﬁxed k and α) in fact converges faster then the case η = 0.1, what indicates that Lookahead minmax is also sensitive to the value of η, besides that it introduces additional hyperparameters k and α

E EXPERIMENTS ON QUADRATIC FUNCTIONS

In this section we run experiments on the following 2D quadratic zero-sum problems:

L(x, y) = 3x2 + 4xy y2 (QP-1)

L(x, y) = x2 + 5xy y2 (QP-2)

E.1 EXPERIMENTS ON EQUATION QP-1

In Fig. 10, we compare GDA, LA-GDA, Extra Grad, and LA-Extra Grad on the problem deﬁned by equation QP-1. One can show that the spectral radius for the Jacobian of the GDA operator is always

Published as a conference paper at ICLR 2021

η = 0.1, α = 0.5, k = 5

η = 0.4, α = 0.5, k = 5

η = 1.5, α = 0.5, k = 5

10 8 6 4 2 0 2 4 6 8

Figure 9: Illustration of Lookahead-minmax on the bilinear game minx maxy x y, for different values of the learning rate η {0.1, 0.4, 1.5}, with ﬁxed k = 5 and α = 0.5. The trajectory of the iterates is depicted with green line, whereas the the interpolated line between (ωt, ωt,k), t = 1, . . . , T, k R with ωt = (θt, ϕt) is shown with dashed red line. The transparent lines depict the level curves of the loss function, and ω = (0.0). See D.6 for discussion.

larger than one when considering a single step size η shared by both players x and y. Therefore, we tune each method by performing a grid search over individual step sizes for each player ηx and ηy. We pick the pair (ηx, ηy) which gives the smallest spectral radius. From Fig. 10 we observe that Lookahead-Minmax provides good performances.

0 20 40 60 80 100 Number of passes

Distance to the optimum

GDA LA-GDA Extra Grad LA-Extra Grad

Figure 10: Convergence of GDA, LA-GDA, Extra Grad, and LA-Extra Grad on equation QP-1.

E.2 EXPERIMENTS ON EQUATION QP-2

In Fig. 11, we compare GDA, LA-GDA, Extra Grad, and LA-Extra Grad on the problem deﬁned by equation QP-2. This problem is better conditioned than equation QP-1 and one can show the spectral radius for the Jacobian of GDA and EG are smaller than one , for some step size η shared by both players. In order to ﬁnd the best hyperparameters for each method, we perform a grid search on the different hyperparameters and pick the set giving the smallest spectral radius. The results are corroborating our previous analysis as we observe a faster convergence when using Lookahead-Minmax.

Published as a conference paper at ICLR 2021

0 20 40 60 80 100 Number of passes

Distance to the optimum

GDA LA-GDA Extra Grad LA-Extra Grad

Figure 11: Convergence of GDA, LA-GDA, Extra Grad, and LA-Extra Grad on equation QP-2.

F PARAMETER AVERAGING

Polyak parameter averaging was shown to give fastest convergence rates among all stochastic gradient algorithms for convex functions, by minimizing the asymptotic variance induced by the algorithm (Polyak & Juditsky, 1992). This, so called Ruppet Polyak averaging, is computed as the arithmetic average of the parameters:

t=1 θ(t) , T 1 . (RP Avg)

In the context of games, weighted averaging was proposed by Bruck (1977) as follows:

θ(T ) WA = PT t=1 ρ(t)θ(t) PT t=1 ρ(t) . (W Avg)

Eq. equation W Avg can be computed efﬁciently online as: θ(t) WA = (1 γ(t))θ(t 1) WA + γ(t)θ(t) with γ [0, 1]. With γ = 1

t we obtain the Uniform Moving Averages (UMA) whose performance is reported in our experiments in 5 and is computed as follows:

θt UMA = (1 1

t )θ(t 1) UMA + 1

t θ(t) , t = 1, . . . , T . (UMA)

Analogously, we compute the Exponential Moving Averages (EMA) in an online fashion using γ = 1 β < 1, as follows:

θt EMA = βθ(t 1) EMA + (1 β)θ(t) , t = 1, . . . , T . (EMA)

In our experiments, following related works (Yazıcı et al., 2019; Gidel et al., 2019a; Chavdarova et al., 2019), we ﬁx β = 0.9999 for computing equation EMA on all the fast weights and when computing it on the slow weights in the experiments with small k (e.g. when k = 5). The remaining case of computing equation EMA on the slow weights of the experiments that use large k produce only few slow weight iterates, thus using such large β = 0.9999 results in largely weighting the parameters at initialization. We used ﬁxed β = 0.8 for those cases.

Published as a conference paper at ICLR 2021

G DETAILS ON THE LOOKAHEAD MINMAX ALGORITHM AND ALTERNATIVES

As Lookahead uses iterates generated by an inner optimizer we also refer it is a meta-optimizer or wrapper. In the main paper we provide a general pseudocode that can be applied to any optimization method.

Alg. 3 reformulates the general Alg. 1 when combined with GDA, by replacing the inner fast-weight loop by a modulo operation of k to clearly demonstrate to the reader the negligible modiﬁcation for LA MM of the source code of the base-optimizer. Note that instead of the notation of fast weights θ, ϕ we use snapshot networks θS, ϕS (past iterates) to denote the slow weights, and the parameters θ, ϕ for the fast weights. Alg. 3 also covers the possibility of using different update ratio r Z for the two players. For a fair comparison, all our empirical LA GAN results use the convention of Alg. 3, i.e. as one step we count one update of both players (rather than one update of the slow weights as it is t in Alg. 1).

Algorithm 3 Alternative formulation of Lookahead Minmax (equivalent to Alg. 1& GDA)

1: Input: Stopping time T, learning rates ηθ, ηϕ, initial weights θ, ϕ, lookahead hyperparameters k and α, losses Lθ, Lϕ, update ratio r, real data distribution pd, noise data distribution pz. 2: θS, ϕS θ, ϕ (store snapshots) 3: for t 1, . . . , T do 4: for i 1, . . . , r do 5: x pd, z pz 6: ϕ = ϕ ηϕ ϕLϕ(θ, ϕ, x, z) (update ϕ r times) 7: end for 8: z pz 9: θ = θ ηθ θLθ(θ, ϕ, z) (update θ once) 10: if t%k == 0 then 11: ϕ = ϕS + αϕ(ϕ ϕS) (backtracking on interpolated line ϕS, ϕ) 12: θ = θS + αθ(θ θS) (backtracking on interpolated line θS, θ) 13: θS, ϕS θ, ϕ (update snapshots) 14: end if 15: end for 16: Output: θS, ϕS

Alg. 5 shows in detail how Lookahead minmax can be combined with Adam equation Adam. Alg. 4 shows in detail how Lookahead minmax can be combined with Extragradient. By combining equation Adam with Alg. 4 (analogous to Alg. 5), one could implement the LA Extra Grad Adam algorithm, used in our experiments on MNIST, CIFAR-10, SVHN and Image Net. A basic pytorch implementation of Lookahead-Minmax, including computing EMA on the slow weights, is provided in Listing. 1.

Published as a conference paper at ICLR 2021

from collections import defaultdict import torch import copy

class Lookahead(torch.optim.Optimizer):

def __init__(self, optimizer, alpha=0.5):

self.optimizer = optimizer self.alpha = alpha self.param_groups = self.optimizer.param_groups self.state = defaultdict(dict)

def lookahead_step(self):

for group in self.param_groups:

for fast in group["params"]: param_state = self.state[fast] if "slow_params" not in param_state: param_state["slow_params"] = torch.zeros_like(fast.data) param_state["slow_params"].copy_(fast.data) slow = param_state["slow_params"] # slow <- slow+alpha*(fast-slow) slow += (fast.data - slow) * self.alpha fast.data.copy_(slow)

def step(self, closure=None): loss = self.optimizer.step(closure) return loss

def update_ema_gen(G, G_ema, beta_ema=0.9999): l_param = list(G.parameters()) l_ema_param = list(G_ema.parameters()) for i in range(len(l_param)):

with torch.no_grad(): l_ema_param[i].data.copy_(l_ema_param[i].data.mul(beta_ema)

.add(l_param[i].data.mul(1-beta_ema)))

def train(G, D, optimizer D, optimizer G, data_sampler, noise_sampler, iterations=100, lookahead_k=5, beta_ema=0.9999, lookahead_alpha=0.5):

# Wrapping the optimizers with the lookahead optimizer optimizer D = Lookahead(optimizer D, alpha=lookahead_alpha) optimizer G = Lookahead(optimizer G, alpha=lookahead_alpha)

G_ema = None

for i in range(iterations):

# Update discriminator and generator discriminator_step(D, optimizer D, data_sampler, noise_sampler) generator_step(G, optimizer G, noise_sampler)

if (i+1) % lookahead_k == 0: # Joint lookahead update

optimizer G.lookahead_step() optimizer D.lookahead_step() if G_ema is None: G_ema = copy.deepcopy(G) else:

# Update EMA on the slow weights update_ema_gen(G, G_ema, beta_ema=beta_ema)

return G_ema

Listing 1: Implementation of Lookahead-minmax in Pytorch. Also includes the computation of EMA on the slow weights.

Published as a conference paper at ICLR 2021

Unroll-GAN Vs. Lookahead Minmax (LA MM). LAGAN differs from Unrolled GAN (and approximated Unrolled GAN which does not backprop through the unrolled steps) as it: (i) does not ﬁx one player to update the other player k times. (ii) at each step t, it does not take gradient at future iterate t + k to apply this gradient at the current iterate t (as extragradient does too). (iii) contrary to Unrolled GAN, LA MM uses point on a line between two iterates, closeness to which is controlled with parameter α, hence deals with rotations in a different way. Note that LA MM can be applied to Unrolled GAN, however the heavy computational cost associated to Unrolled GAN prevented us from running this experiment (see note in Table 3 for computational comparison).

Algorithm 4 Lookahead Minmax combined with Extragradient (equivalent to Alg. 1 & EG)

1: Input: Stopping time T, learning rates ηθ, ηϕ, initial weights θ, ϕ, lookahead hyperparameters k and α, losses Lθ, Lϕ, update ratio r, real data distribution pd, noise data distribution pz. 2: θS, ϕS θ, ϕ (store snapshots) 3: for t 1, . . . , T do 4: ϕextra ϕ 5: for i 1, . . . , r do 6: x pd, z pz 7: ϕextra = ϕextra ηϕ ϕextra Lϕ(θ, ϕextra, x, z) (Compute the extrapolated ϕ) 8: end for 9: z pz 10: θextra = θ ηθ θLθ(θ, ϕ, z) (Compute the extrapolated θ) 11: for i 1, . . . , r do 12: x pd, z pz 13: ϕ = ϕ ηϕ ϕLϕ(θextra, ϕ, x, z) (update ϕ r times) 14: end for 15: z pz 16: θ = θ ηθ θLθ(θ, ϕextra, z) (update θ once) 17: if t%k == 0 then 18: ϕ = ϕS + αϕ(ϕ ϕS) (backtracking on interpolated line ϕS, ϕ) 19: θ = θS + αθ(θ θS) (backtracking on interpolated line θS, θ) 20: θS, ϕS θ, ϕ (update snapshots) 21: end if 22: end for 23: Output: θS, ϕS

Published as a conference paper at ICLR 2021

Algorithm 5 Lookahead Minmax (Alg. 1) combined with Adam as inner optimizer.

1: Input: Stopping time T, learning rates ηθ, ηϕ, initial weights θ, ϕ, lookahead hyperparameters k and α, losses Lθ, Lϕ, update ratio r, real data distribution pd, noise data distribution pz, Adam parameters mϕ, vϕ, mθ, vθ, β1, β2. 2: θS, ϕS θ, ϕ (store snapshots) 3: mϕ, vϕ, mθ, vθ 0, 0, 0, 0 (initialize ﬁrst and second moments for Adam) 4: for t 1, . . . , T do 5: for i 1, . . . , r do 6: x pd, z pz 7: gϕ ϕLϕ(θ, ϕ, x, z) 8: mϕ = β1mϕ + (1 β1)gϕ

9: vϕ = β2vϕ + (1 β2)(gϕ)2

10: ˆmϕ = mϕ

1 β((t 1) r+i) 1 11: ˆvϕ = vϕ

1 β((t 1) r+i) 2 12: ϕ = ϕ ηϕ ˆ mϕ

ˆvϕ+ϵ (update ϕ r times)

13: end for 14: z pz 15: gθ θLθ(θ, ϕ, x, z) 16: mθ = β1mθ + (1 β1)gθ

17: vθ = β2vθ + (1 β2)(gθ)2

18: ˆmθ = mθ

1 βt 1 19: ˆvθ = vθ

1 βt 2 20: θ = θ ηθ ˆ mθ

ˆvθ+ϵ (update θ once)

21: if t%k == 0 then 22: ϕ = ϕS + αϕ(ϕ ϕS) (backtracking on interpolated line ϕS, ϕ) 23: θ = θS + αθ(θ θS) (backtracking on interpolated line θS, θ) 24: θS, ϕS θ, ϕ (update snapshots) 25: end if 26: end for 27: Output: θS, ϕS

G.1 NESTED LOOKAHEAD MINMAX

In our experiments we explored the effect of different values of k. On one hand, we observed that less stable baselines such as Alt-GAN tend to perform better using small k (e.g. k = 5) when combined with Lookahead Minmax, as it seems to be the necessary to prevent early divergence (what in turn achieves better iterate and EMA performances). On the other hand, we observed that stable baselines combined with Lookahead Minmax with large k (e.g. k = 10000) tend to take better advantage of the rotational behavior of games, as indicated by the notable improvement after each LA-step, see Fig. 2 and Fig. 12.

This motivates combination of both a small "slow" ks and a large "super slow" kss, as shown in Alg. 6. In this so called nested Lookahead minimax version denoted with NLA preﬁx, we store two copies of the each player, one corresponding to the traditional slow weights updated every ks, and another for the so called "super slow" weights updated every kss. When computing EMA on the slow weights for NLA GAN methods we use the super-slow weights as they correspond to the best performing iterates. However, better results could be obtain by computing EMA on the slow weights as EMA performs best with larger β parameter and averaging over many iterates (whereas we obtain relatively small number of super slow weights). We empirically ﬁnd Nested Lookahead minimax to be more stable than its non-nested counterpart, see Fig.13.

Published as a conference paper at ICLR 2021

0 1 2 3 4 5 Iteration 105

Fast weights Slow weights

Figure 12: IS (higher is better) of LA GAN on Image Net with relatively large k = 10000. The backtracking step is signiﬁcantly improving the model s performance every 10000 iterations. This shows how a large k can take advantage of the rotating gradient vector ﬁeld.

0 1 2 3 4 5 Iteration 105

LA-Extra Grad NLA-Extra Grad Extra Grad

Figure 13: LA-Extra Grad, NLA-Extra Grad and Extragrad models trained on Image Net. For LA-Extragrad (k = 5000), the lighter and darker colors represent the fast and slow weights respectively. For NLA-Extra Grad (ks = 5, kss = 5000), the lighter and darker colors represent the fast and super-slow weights respectively. In our experiments on Image Net, while LA-Extra Grad is more stable than Extra Grad, it still has a tendency to diverge early. Using Alg. 6 we notice a clear improvement in stability.

G.2 ALTERNATING LOOKAHEAD MINMAX

Lookahead Vs. Lookahead minmax. Note how Alg. 1 differs from applying Lookahead to both the players separately. The obvious difference is for the case r = 1, as the backtracking is done at different number of updates of ϕ and θ. The key difference is in fact that after applying equation LA to one of the players, we do not use the resulting interpolated point to update the parameters of the other player a version we refer to as Alternating Lookahead , see G. Instead, equation LA is applied to both the players at the same time, which we found that outperforms the former. Unless otherwise emphasized, we focus on the joint version, as described in Alg. 1.

For completeness, in this section we consider an alternative implementation of Lookahead-minmax, which naively applies equation LA on each player separately, which we refer to as alternating lookahead . This in turn uses a backtracked iterate to update the opponent, rather than performing the backtracking step at the same time for both the players. In other words, the fact that line 9 of Alg. 7 is executed before updating θ in line 14, and vice versa, does not allow for Lookahead to help deal with the rotations typical for games.

Published as a conference paper at ICLR 2021

Algorithm 6 Pseudocode of Nested Lookahead Minmax.

1: Input: Stopping time T, learning rates ηθ, ηϕ, initial weights θ, ϕ, lookahead hyperparameters ks, kss and α, losses Lθ, Lϕ, update ratio r, real data distribution pd, noise data distribution pz.

2: (θs, θss, ϕs, ϕss) (θ, θ, ϕ, ϕ) (store copies for slow and super-slow) 3: for t 1, . . . , T do 4: for i 1, . . . , r do 5: x pd, z pz 6: ϕ ϕ ηϕ ϕLϕ(θ, ϕ, x, z) (update ϕ r times) 7: end for 8: z pz 9: θ θ ηθ θLθ(θ, ϕ, z) (update θ once) 10: if t%ks == 0 then 11: ϕ ϕs + αϕ(ϕ ϕs) (backtracking on interpolated line ϕs, ϕ) 12: θ θs + αθ(θ θs) (backtracking on interpolated line θs, θ) 13: (θs, ϕs) (θ, ϕ) (update slow checkpoints) 14: end if 15: if t%kss == 0 then 16: ϕ ϕss + αϕ(ϕ ϕss) (backtracking on interpolated line ϕss, ϕ) 17: θ θss + αθ(θ θss) (backtracking on interpolated line θss, θ) 18: (θss, ϕss) (θ, ϕ) (update super-slow checkpoints) 19: (θs, ϕs) (θ, ϕ) (update slow checkpoints) 20: end if 21: end for 22: Output: θss, ϕss

Algorithm 7 Alternating Lookahead-minmax pseudocode.

1: Input: Stopping time T, learning rates ηθ, ηϕ, initial weights θ, ϕ, kθ, kϕ, αθ, αϕ, losses Lθ, Lϕ, update ratio r, real data distribution pd, noise data distribution pz. 2: θ θ (store copy) 3: ϕ ϕ 4: for t 0, . . . , T 1 do 5: for i 1, . . . , r do 6: x pd, z pz 7: ϕ ϕ ηϕ ϕLϕ(θ, ϕ, x, z) (update ϕ k times) 8: if (t r + i)%kϕ == 0 then 9: ϕ ϕ + αϕ(ϕ ϕ) (backtracking on line ϕ, ϕ) 10: ϕ ϕ 11: end if 12: end for 13: z pz 14: θ θ ηθ θLθ(θ, ϕ, z) (update θ once) 15: if t%kθ == 0 then 16: θ θ + αθ(θ θ) (backtracking on line θ, θ) 17: θ θ 18: end if 19: end for 20: Output: θ, ϕ

On SVHN and CIFAR-10, the joint Lookahead-minmax consistently gave us the best results, as can be seen in Figure 14 and 15. On MNIST, the alternating and joint implementations worked equally well, see Figure 15.

Published as a conference paper at ICLR 2021

0 250000 500000 Iteration

Alt-LA-Alt GAN Joint-LA-Alt GAN

(a) LA Alt GAN.

0 150000 300000 Iteration

Joint-LA-Extra Grad (k = 5000)

Joint-LA-Extra Grad (k = 1000)

Alt-LA-Extra Grad (k = 1000)

(b) LA Extra Grad.

Figure 14: Comparison between different extensions of Lookahead to games on CIFAR-10. We use preﬁx joint and alt to denote Alg. 1 and Alg. 7, respectively of which the former is the one presented in the main paper. We can see some signiﬁcant improvements in FID when using the joint implementation, for both LA-Alt GAN (left) and LA-Extra Grad (right).

0 250000 500000 Iteration

Alt-LA-Alt GAN Joint-LA-Alt GAN

(a) SVHN dataset.

0 20000 40000 60000 80000 100000 Iteration

Alt Gan Alt-LA-Alt Gan Unrolled-GAN Extra Grad

(b) MNIST dataset.

Figure 15: (a): Comparison of the joint Lookahead-minmax implementation (Joint preﬁx, see Algorithm 1) and the alternating Lookahead-minmax implementation (Alt preﬁx, see Algorithm 7) on the SVHN dataset. (b): Results obtained with the different methods introduced in 5 as well as an alternating implementation of Lookahead-minmax, on the MNIST dataset. Each curve is obtained averaged over 5 runs. The results of the alternating implementation differ very little from the joint implementation, the curve for Alt-LA-Alt GAN matches results in Table 1.

Published as a conference paper at ICLR 2021

Figure 16: Samples from our generator model with the highest IS score. We can clearly see some unrealistic artefacts. We observed that the IS metric does not penalize these artefacts, whereas FID does penalize them.

H DETAILS ON THE IMPLEMENTATION

For our experiments, we used the Py Torch2 deep learning framework. For experiments on CIFAR-10 and SVHN, we compute the FID and IS metrics using the provided implementations in Tensorﬂow3 for consistency with related works. For experiments on Image Net we use a faster-running Py Torch implementation4 of FID and IS by (Brock et al., 2019) which allows for more frequent evaluation.

H.1 METRICS

We provide more details about the metrics enumerated in 5. Both FID and IS use: (i) the Inception v3 network (Szegedy et al., 2015) that has been trained on the Image Net dataset consisting of 1 million RGB images of 1000 classes, C = 1000. (ii) a sample of m generated images x pg, where usually m = 50000.

H.1.1 INCEPTION SCORE

Given an image x, IS uses the softmax output of the Inception network p(y|x) which represents the probability that x is of class ci, i 1 . . . C, i.e., p(y|x) [0, 1]C. It then computes the marginal class distribution p(y) = R

x p(y|x)pg(x). IS measures the Kullback Leibler divergence DKL between the predicted conditional label distribution p(y|x) and the marginal class distribution p(y). More precisely, it is computed as follows:

IS(G) = exp Ex pg[DKL(p(y|x)||p(y))] = exp 1

c=1 p(yc|xi) log p(yc|xi)

p(yc) . (11)

It aims at estimating (i) if the samples look realistic i.e., p(y|x) should have low entropy, and (ii) if the samples are diverse (from different Image Net classes) i.e., p(y) should have high entropy. As these are combined using the Kullback Leibler divergence, the higher the score is, the better the performance. Note that the range of IS scores at convergence varies across datasets, as the Inception network is pretrained on the Image Net classes. For example, we obtain low IS values on the SVHN dataset as a large fraction of classes are numbers, which typically do not appear in the Image Net dataset. Since MNIST has greyscale images, we used a classiﬁer trained on this dataset and used m = 5000. For CIFAR-10 and SVHN, we used the original implementation5 of IS in Tensor Flow, and m = 50000.

As the Inception Score considers the classes as predicted by the Inception network, it can be prone not to penalize visual artefacts as long as those do not alter the predicted class distribution. In Fig. 16 we show some images generated by our best model according to IS. Those images exhibit some visible unrealistic artifacts, while enough of the image is left for us to recognise a potential image label. For this reason we consider that the Fréchet Inception Distance is a more reliable estimator of image quality. However, we reported IS for completeness.

2https://pytorch.org/ 3https://www.tensorflow.org/ 4https://github.com/ajbrock/Big GAN-Py Torch 5https://github.com/openai/improved-gan/

Published as a conference paper at ICLR 2021

Input: z R128 N(0, I) transposed conv. (ker: 3 3, 128 512; stride: 1) Batch Normalization Re LU transposed conv. (ker: 4 4, 512 256, stride: 2) Batch Normalization Re LU transposed conv. (ker: 4 4, 256 128, stride: 2) Batch Normalization Re LU transposed conv. (ker: 4 4, 128 1, stride: 2, pad: 1) Tanh( )

Discriminator

Input: x R1 28 28

conv. (ker: 4 4, 1 64; stride: 2; pad:1) Leaky Re LU (negative slope: 0.2) conv. (ker: 4 4, 64 128; stride: 2; pad:1) Batch Normalization Leaky Re LU (negative slope: 0.2) conv. (ker: 4 4, 128 256; stride: 2; pad:1) Batch Normalization Leaky Re LU (negative slope: 0.2) conv. (ker: 3 3, 256 1; stride: 1) Sigmoid( )

Table 5: DCGAN architectures (Radford et al., 2016) used for experiments on MNIST. We use ker and pad to denote kernel and padding for the (transposed) convolution layers, respectively. With h w we denote the kernel size. With cin yout we denote the number of channels of the input and output, for (transposed) convolution layers.

H.1.2 FRÉCHET INCEPTION DISTANCE

Contrary to IS, FID aims at comparing the synthetic samples x pg with those of the training dataset x pd in a feature space. The samples are embedded using the ﬁrst several layers of the Inception network. Assuming pg and pd are multivariate normal distributions, it then estimates the means mg and md and covariances Cg and Cd, respectively for pg and pd in that feature space. Finally, FID is computed as:

DFID(pd, pg) d2((md, Cd), (mg, Cg)) = md mg 2 2 + Tr(Cd + Cg 2(Cd Cg) 1 2 ), (12)

where d2 denotes the Fréchet Distance. Note that as this metric is a distance, the lower it is, the better the performance. We used the original implementation of FID6 in Tensorﬂow, along with the provided statistics of the datasets.

H.2 ARCHITECTURES & HYPERPARAMETERS

Description of the architectures. We describe the models we used in the empirical evaluation of Lookahead-minmax by listing the layers they consist of, as adopted in GAN works, e.g. (Miyato et al., 2018). With conv. we denote a convolutional layer and transposed conv a transposed convolution layer (Radford et al., 2016). The models use Batch Normalization (Ioffe & Szegedy, 2015) and Spectral Normalization layers (Miyato et al., 2018).

H.2.1 ARCHITECTURES FOR EXPERIMENTS ON MNIST

For experiments on the MNIST dataset, we used the DCGAN architectures (Radford et al., 2016), listed in Table 5, and the parameters of the models are initialized using Py Torch default initialization. For experiments on this dataset, we used the non saturating GAN loss as proposed (Goodfellow et al., 2014): LD = Ex pd log(D(x)) + Ez pz log(D(G(z))) (13) LG = Ez pz log(D(G(z))), (14) where pd and pz denote the data and the latent distributions (the latter to be predeﬁned).

H.2.2 RESNET ARCHITECTURES FOR IMAGENET, CIFAR-10 AND SVHN

We replicate the experimental setup described for CIFAR-10 and SVHN in (Miyato et al., 2018; Chavdarova et al., 2019), as listed in Table 7. This setup uses the hinge version of the adversarial non-saturating loss, see (Miyato et al., 2018). As a reference, our Res Net architectures for CIFAR-10 have approximately 85 layers in total for G and D, including the non linearity and the normalization layers.

6https://github.com/bioinf-jku/TTUR

Published as a conference paper at ICLR 2021

G Res Block Bypass: Upsample( 2) Feedforward: Batch Normalization Re LU Upsample( 2) conv. (ker: 3 3, 256 256; stride: 1; pad: 1) Batch Normalization Re LU conv. (ker: 3 3, 256 256; stride: 1; pad: 1)

D Res Block (ℓ th block) Bypass: [Avg Pool (ker:2 2 )], if ℓ= 1 conv. (ker: 1 1, 3ℓ=1/128ℓ =1 128; stride: 1) Spectral Normalization [Avg Pool (ker:2 2, stride:2)], if ℓ = 1 Feedforward: [ Re LU ], if ℓ = 1 conv. (ker: 3 3, 3ℓ=1/128ℓ =1 128; stride: 1; pad: 1) Spectral Normalization Re LU conv. (ker: 3 3, 128 128; stride: 1; pad: 1) Spectral Normalization Avg Pool (ker:2 2 )

Table 6: Res Net blocks used for the Res Net architectures (see Table 7), for the Generator (left) and the Discriminator (right). Each Res Net block contains skip connection (bypass), and a sequence of convolutional layers, normalization, and the Re LU non linearity. The skip connection of the Res Net blocks for the Generator (left) upsamples the input using a factor of 2 (we use the default Py Torch upsampling algorithm nearest neighbor), whose output is then added to the one obtained from the Res Net block listed above. For clarity we list the layers sequentially, however, note that the bypass layers operate in parallel with the layers denoted as feedforward (He et al., 2016). The Res Net block for the Discriminator (right) differs if it is the ﬁrst block in the network (following the input to the Discriminator), ℓ= 1, or a subsequent one, ℓ> 1, so as to avoid performing the Re LU non linearity immediate on the input.

Generator Discriminator

Input: z R128 N(0, I) Input: x R3 32 32

Linear(128 4096) D Res Block G Res Block D Res Block G Res Block D Res Block G Res Block D Res Block Batch Normalization Re LU Re LU Avg Pool (ker:8 8 ) conv. (ker: 3 3, 256 3; stride: 1; pad:1) Linear(128 1) Tanh( ) Spectral Normalization

Table 7: Deep Res Net architectures used for experiments on Image Net, SVHN and CIFAR-10, where G Res Block and D Res Block for the Generator (left) and the Discriminator (right), respectively, are described in Table 6. The models parameters are initialized using the Xavier initialization (Glorot & Bengio, 2010). For Image Net experiments, the generator s input is of dimension 512 instead of 128.

Published as a conference paper at ICLR 2021

H.2.3 UNROLLING IMPLEMENTATION

In Section D.1 we explained how we implemented unrolling for our full-batch bilinear experiments. Here we describe our implementation for our MNIST and CIFAR-10 experiments.

Unrolling is computationally intensive, which can become a problem for large architectures. The computation of ϕLϕ(θm t , ϕt), with m unrolling steps, requires the computation of higher order derivatives which comes with a m memory footprint and a signiﬁcant slowdown. Due to limited memory, one can only backpropagate through the last unrolled step, bypassing the computation of higher order derivatives. We empirically see the gradient is small for those derivatives. In this approximate version, unrolling can be seen as of the same family as extragradient, computing its extrapolated points using more than a single step. We tested both true and approximate unrolling on MNIST, with a number of unrolling steps ranging from 5 to 20. The full unrolling that performs the backpropagation on the unrolled discriminator was implemented using the Higher7 library. On CIFAR-10 we only experimented with approximate unrolling over 5 to 10 steps due to the large memory footprint of the Res Net architectures used for the generator and discriminator, making the other approach infeasible given our resources.

H.2.4 HYPERPARAMETERS USED ON MNIST

Table 8 lists the hyperparameters that we used for our experiments on the MNIST dataset.

Table 8: Hyperparameters used on MNIST.

Parameter Alt GAN LA-Alt GAN Extra Grad LA-Extra Grad Unrolled-GAN ηG 0.001 0.001 0.001 0.001 0.001 ηD 0.001 0.001 0.001 0.001 0.001 Adam β1 0.05 0.05 0.05 0.05 0.05 Batch-size 50 50 50 50 50 Update ratio r 1 1 1 1 1 Lookahead k - 1000 - 1000 - Lookahead α - 0.5 - 0.5 - Unrolling steps - - - - 20

H.2.5 HYPERPARAMETERS USED ON SVHN

Table 9: Hyperparameters used on SVHN.

Parameter Alt GAN LA-Alt GAN Extra Grad LA-Extra Grad ηG 0.0002 0.0002 0.0002 0.0002 ηD 0.0002 0.0002 0.0002 0.0002 Adam β1 0.0 0.0 0.0 0.0 Batch-size 128 128 128 128 Update ratio r 5 5 5 5 Lookahead k - 5 - 5000 Lookahead α - 0.5 - 0.5

Table 9 lists the hyperparameters used for experiments on SVHN. These values were selected for each algorithm independently after tuning the hyperparameters for the baseline.

7https://github.com/facebookresearch/higher

Published as a conference paper at ICLR 2021

H.2.6 HYPERPARAMETERS USED ON CIFAR-10

Table 10: Hyperparameters that we used for our experiments on CIFAR-10.

Parameter Alt GAN LA-Alt GAN Extra Grad LA-Extra Grad Unrolled-GAN ηG 0.0002 0.0002 0.0002 0.0002 0.0002 ηD 0.0002 0.0002 0.0002 0.0002 0.0002 Adam β1 0.0 0.0 0.0 0.0 0.0 Batch-size 128 128 128 128 128 Update ratio r 5 5 5 5 5 Lookahead k - 5 - 5000 - Lookahead α - 0.5 - 0.5 - Unrolling steps - - - - 5

The reported results on CIFAR-10 were obtained using the hyperparameters listed in Table 10. These values were selected for each algorithm independently after tuning the hyperparameters. For the baseline methods we selected the hyperparameters giving the best performances. Consistent with the results reported by related works, we also observed that using larger ratio of updates of the discriminator and the generator improves the stability of the baseline, and we used r = 5. We observed that using learning rate decay delays the divergence, but does not improve the best FID scores, hence we did not use it in our reported models.

H.2.7 HYPERPARAMETERS USED ON IMAGENEET

Table 11: Hyperparameters used for our Alt GAN experiments on Image Net.

Parameter Alt GAN LA-Alt GAN DLA-Alt GAN ηG 0.0002 0.0002 0.0002 ηD 0.0002 0.0002 0.0002 Adam β1 0.0 0.0 0.0 Batch-size 256 256 256 Update ratio r 5 5 5 Lookahead k - 5 5 Lookahead k - - 10000 Lookahead α - 0.5 0.5 Unrolling steps - - -

Table 12: Hyperparameters used for our Extra Grad experiments on Image Net.

Parameter Extra Grad LA-Extra Grad DLA-Extra Grad ηG 0.0002 0.0002 0.0002 ηD 0.0002 0.0002 0.0002 Adam β1 0.0 0.0 0.0 Batch-size 256 256 256 Update ratio r 5 5 5 Lookahead k - 5 5 Lookahead k - - 5000 Lookahead α - 0.5 0.5 Unrolling steps - - -

The reported results on Image Net were obtained using the hyperparameters listed in Tables 11 and 12.

Published as a conference paper at ICLR 2021

I ADDITIONAL EXPERIMENTAL RESULTS

In Fig. 6 we compared the stability of LA Alt GAN methods against their Alt GAN baselines on both the CIFAR-10 and SVHN datasets. Analogously, in Fig. 17 we report the comparison between LA Extra Grad and Extra Grad over the iterations. We observe that the experiments on SVHN with Extra Grad are more stable than those of CIFAR-10. Interestingly, we observe that: (i) LA Extra Gradient improves both the stability and the performance of the baseline on CIFAR-10, see Fig. 17a, and (ii) when the stability of the baseline is relatively good as on the SVHN dataset, LA Extragradient still improves its performances, see Fig. 17b.

For completeness to Fig. 5 in Fig 18 we report the respective eigenvalue analyses on MNIST, that are summarized in the main paper.

0 250000 500000 Iteration

LA-Extra Grad Extra Grad

(a) CIFAR-10 dataset.

0 250000 500000 Iteration

LA-Extra Grad Extra Grad

(b) SVHN dataset.

Figure 17: Improved stability of LA Extra Grad relative to its Extra Grad baseline on SVHN and CIFAR-10, over 5 runs. The median and the individual runs are illustrated with ticker solid lines and with transparent lines, respectively. See I and H for discussion and details on the implementation, resp.

I.1 COMPLETE BASELINE COMPARISON

In 5.2 we omitted uniform averaging of the iterates for clarity of the presented results selected as it down-performs the exponential moving average in our experiments. In this section, for completeness we report the uniform averaging results. Table 13 lists these results, including experiments using the RAdam (Kingma & Ba, 2015) optimizer instead of Adam.

0 2 4 6 8 10 12 14 16 18 20 Top-20 Eigenvalues

Alt GAN LA-Alt GAN

(a) Generator.

0 2 4 6 8 10 12 14 16 18 20 Top-20 Eigenvalues

Alt GAN LA-Alt GAN

(b) Discriminator.

0 100 200 Real Part

Imaginary Part

Alt GAN LA-Alt GAN

(c) Eigenvalues of v (θ, ϕ).

Figure 18: Analysis on MNIST at 100K iterations. Fig. 18a & 18b: Largest 20 eigenvalues of the Hessian of the generator and the discriminator. Fig. 18c: Eigenvalues of the Jacobian of JVF, indicating no rotations at the point of convergence of LA Alt GAN (see 2).

Published as a conference paper at ICLR 2021

CIFAR-10 Fréchet Inception distance Inception score

Method no avg uniform avg EMA no avg uniform avg EMA

Alt GAN R 23.27 1.65 19.81 2.58 17.82 1.31 7.32. .30 8.14 .20 7.99 .13 LA Alt GAN R 17.31 .58 16.68 1.45 16.68 1.45 7.81 .09 8.61 .08 8.13 .07 Extra Grad R 19.15 1.13 16.17 .63 15.40 .94 7.59 .13 8.55 .05 8.05 .15 LA Extra Grad R 15.38 .76 14.75 .61 14.99 .66 7.93 .05 8.51 .08 8.01 .09 Alt GAN A 21.37 1.60 19.25 1.72 16.92 1.16 7.41 .16 8.23 .17 8.03 .13 LA Alt GAN A 16.74 .46 15.02 .81 13.98 .47 8.05 .43 8.45 .32 8.19 .05 Extra Grad A 18.49 .99 16.22 1.59 15.47 1.82 7.61 .07 8.46 .08 8.05 .09 LA Extra Grad A 15.25 .30 14.95 .44 14.68 .30 7.99 .03 8.13 .18 8.04 .04 Unrolled GAN A 21.04 1.08 18.25 1.60 17.51 1.08 7.43 .07 8.26 .15 7.88 .12

SVHN Alt GAN A 7.84 1.21 10.83 3.20 6.83 2.88 3.10 .09 3.12 .14 3.19 .09 LA Alt GAN A 3.87 .09 10.84 1.04 3.28 .09 3.16 .02 3.38 .09 3.22 .08 Extra Grad A 4.08 .11 8.89 1.07 3.22 .09 3.21 .02 3.21 .04 3.16 .02 LA Extra Grad A 3.20 .09 7.66 1.54 3.16 .14 3.20 .02 3.32 .13 3.19 .03

MNIST Alt GAN A .094 .006 .167 .033 .031 .002 8.92 .01 8.88 .02 8.99 .01 LA Alt GAN A .053 .004 .176 .024 .029 .002 8.93 .01 8.92 .01 8.96 .02 Extra Grad A .094 .013 .182 .024 .032 .003 8.90 .01 8.88 .03 8.98 .01 LA Extra Grad A .053 .005 .180 .024 .032 .002 8.91 .01 8.92 .02 8.95 .01 Unrolled GAN A .077 .006 .224 .016 .030 .002 8.91 .02 8.91 .02 8.99 .01

Table 13: Comparison of the LA-GAN optimizer with its respective baselines Alt GAN and Extra Grad (see 5.1 for naming), using FID (lower is better) and IS (higher is better). EMA denotes exponential moving average (with ﬁxed β = 0.9999, see F). With sufﬁx R and A we denote that we use RAdam (Liu et al., 2020) and Adam (Kingma & Ba, 2015) optimizer, respectively. Results are averaged over 5 runs. We run each experiment on MNIST for 100K iterations, and for 500K iterations for the rest of the datasets. See H and 5.2 for details on architectures and hyperparameters and for discussion on the results, resp.

I.2 SAMPLES OF LA GAN GENERATORS

In this section we show random samples of the generators of our LAGAN experiments trained on Image Net, CIFAR-10 and SVHN.

The samples in Fig. 19 are generated by our best performing LA-Alt GAN models trained on CIFAR10. Similarly, Fig. 20 & 21 depict such samples of generators trained on Image Net and SVHN, respectively.

Published as a conference paper at ICLR 2021

Figure 19: Samples generated by our best performing trained generator on CIFAR-10, using LA-Alt GAN and exponential moving average (EMA) on the slow weights. The obtained FID score is 12.193.

Published as a conference paper at ICLR 2021

Figure 20: Images generated by our best model trained on 32 32 Image Net, obtained with LA-Alt GAN and EMA of the slow weights, yielding FID of 12.44.

Published as a conference paper at ICLR 2021

Figure 21: Images generated by one of our best LA-Extra Grad & EMA model (FID of 2.94) trained on SVHN.