# minimax_optimization_with_smooth_algorithmic_adversaries__47263ef7.pdf

Published as a conference paper at ICLR 2022

MINIMAX OPTIMIZATION WITH SMOOTH ALGORITHMIC ADVERSARIES

Tanner Fiez , Lillian J. Ratliff University of Washington, Seattle {fiezt, ratliffl}@uw.edu

Chi Jin Princeton University chij@princeton.edu

Praneeth Netrapalli Google Research, India pnetrapalli@google.com

This paper considers minimax optimization minx maxy f(x, y) in the challenging setting where f can be both nonconvex in x and nonconcave in y. Though such optimization problems arise in many machine learning paradigms including training generative adversarial networks (GANs) and adversarially robust models, from a theoretical point of view, two fundamental issues remain: (i) the absence of simple and efﬁciently computable optimality notions, and (ii) cyclic or diverging behavior of existing algorithms. This paper proposes a new theoretical framework for nonconvex-nonconcave minimax optimization that addresses both of the above issues. The starting point of this paper is the observation that, under a computational budget, the max-player can not fully maximize f(x, ) since nonconcave maximization is NP-hard in general. So, we propose a new framework, and a corresponding algorithm, for the min-player to play against smooth algorithms deployed by the adversary (i.e., the max-player) instead of against full maximization. Our algorithm is guaranteed to make monotonic progress (thus having no limit cycles or diverging behavior), and to ﬁnd an appropriate stationary point in a polynomial number of iterations. Our framework covers practically relevant settings where the smooth algorithms deployed by the adversary are multi-step stochastic gradient ascent, and its accelerated version. We further present experimental results that conﬁrm our theoretical ﬁndings and demonstrate the effectiveness of the proposed approach in practice on simple, conceptual settings.

1 INTRODUCTION

This paper considers minimax optimization minx maxy f(x, y) in the context of two-player zero-sum games, where the min-player (controlling x) tries to minimize objective f assuming a worst-case opponent (controlling y) that acts so as to maximize it. Minimax optimization naturally arises in a variety of important machine learning paradigms, with the most prominent examples being the training of generative adversarial networks (GANs) (Goodfellow et al., 2014) and adversarially robust models (Madry et al., 2018). These applications commonly engage deep neural networks with various techniques such as convolution, recurrent layers, and batch normalization. As a result, the objective function f is highly nonconvex in x and nonconcave in y.

Theoretically, minimax optimization has been extensively studied starting from the seminal work of von Neumann (Neumann, 1928), with many efﬁcient algorithms proposed for solving it (Robinson, 1951; Korpelevich, 1976; Nemirovski, 2004). A majority of these classical results have been focused on convex-concave functions, and heavily rely on the minimax theorem, i.e., minx maxy f(x, y) = maxy minx f(x, y), which no longer holds beyond the convex-concave setting. Recent line of works (Lin et al., 2020a; Nouiehed et al., 2019; Thekumparampil et al., 2019; Lin et al., 2020b; Ostrovskii et al., 2020) address the nonconvex-concave setting where f is nonconvex in x but concave

Currently with Amazon.

Published as a conference paper at ICLR 2022

in y by proposing meaningful optimality notions and designing algorithms to ﬁnd such points. A key property heavily exploited in this setting is that the inner maximization over y given a ﬁxed x can be computed efﬁciently, which unfortunately does not extend to the nonconvex-nonconcave setting.

Consequently, nonconvex-nonconcave optimization remains challenging, and two of the most fundamental theoretical issues are still unresolved: (i) what is an appropriate notion of optimality that can be computed efﬁciently? and (ii) can we design algorithms that do not suffer from cycling or diverging behavior? Practitioners often use simple and popular algorithms such as gradient descent ascent (GDA) and other variants for solving these challenging optimization problems. While these algorithms seem to perform well in terms of producing high quality images in GANs and robust models in adversarial training, they are also highly unstable, particularly in training GANs. Indeed the instability of GDA and other empirically popular methods is not surprising since they are known to not converge even in very simple settings (Daskalakis & Panageas, 2018a; Bailey et al., 2020). This current state of affairs strongly motivates the need to develop a strong theoretical foundation for nonconvex-nonconcave minimax optimization and to design better algorithms for solving them.

This work considers the challenging nonconvex-nonconcave setting. Our framework sprouts from the practical consideration that under a computational budget, the max-player cannot fully maximize f(x, ) since nonconcave maximization is NP-hard in general. In fact, in both the settings of GAN and adversarial training, in practice, the max-player employs simple gradient based algorithms such as gradient ascent, run for a few steps on the order of 5 for GAN training and 40 for adversarial training (see, e.g., Arjovsky et al. 2017; Madry et al. 2018) to estimate argmaxy f(x, y). To capture this aspect, we assume that the max-player has a toolkit of multiple (potentially randomized) algorithms A1, A2, , Ak in an attempt to solve the maximization problem given ﬁxed x, and picks the best solution among these algorithms. This motivates us to study the surrogate of the minimax optimization problem as

minx maxi [k] f(x, Ai(x)) = minx maxλ k Pk i=1 λif(x, Ai(x)), (1)

where k denotes the k-dimensional simplex, and Ai(x) denotes the output of algorithm Ai for a given x. When both the objective function f and the algorithms {Ai}k i=1 are smooth (deﬁned formally in Section 3), we can show that (1) becomes a smooth nonconvex-concave minimax optimization problem, where recent advances can be leveraged in solving such problems.

In particular, given the smooth algorithms deployed by the adversary (i.e. the max-player), this paper proposes two algorithms for solving problems in (1). The ﬁrst algorithm is based on stochastic gradient descent (SGD), which is guaranteed to ﬁnd an appropriate notion of ϵ-approximate stationary point in O(ϵ 4) gradient computations. The second algorithm is based on proximal algorithm, in the case of deterministic adversarial algorithms {Ai}k i=1, this algorithm has an improved gradient complexity O(ϵ 3) or O(poly(k)/ϵ2) depending on the choice of subroutine within the algorithm. All our algorithms are guaranteed to make monotonic progress, thus having no limit cycles.

Our second set of results show that, many popular algorithms deployed by the adversary such as multi-step stochastic gradient ascent, and multi-step stochastic Nesterov s accelerated gradient ascent are in fact smooth. Therefore, our framework readily applies to those settings in practice.

We also present complementing experimental results showing that our theoretical framework and algorithm succeed on simple, conceptual examples of GAN and adversarial training. While more work is needed to scale our approach to large scale benchmarks in GAN and adversarial training, our experimental results serve as a proof of concept demonstration that our algorithm converges to desirable points in practice, at least on simple conceptual examples.

2 RELATED WORK

Due to lack of space, we focus our discussion here on directly related works and present a more detailed overview of related work in Appendix A. Many existing works that propose local optimality notions in nonconvex-nonconcave minimax optimization suffer from nonexistence of such points in general (Ratliff et al., 2013; Ratliff et al., 2016; Jin et al., 2020; Fiez et al., 2020; Zhang et al., 2020a; Farnia & Ozdaglar, 2020). To the best of our knowledge, the only works that propose a relaxed local optimality notion that is shown to exist and be computable in polynomial time is due to Keswani et al. (2020); Mangoubi & Vishnoi (2021). The aforementioned works are similar to this

Published as a conference paper at ICLR 2022

paper in the sense that the min-player faces the max-player with computational restrictions, but are different from ours in terms of the model of the max-player and the algorithms to solve the problem. Concretely, Keswani et al. (2020); Mangoubi & Vishnoi (2021) consider the stationary points of a smoothed" greedy-max function, which is computed by maximizing along a locally ascending path when ﬁxing the min-player. In contrast, our paper considers the stationary points of a max" function computed by particular smooth algorithms given their initializations, which is a more faithful surrogate to the objective that the min-player wishes to minimize. Depending on the choice of smooth algorithms and their initializations, the optimal points of Keswani et al. (2020); Mangoubi & Vishnoi (2021) and the optimal points in our paper can be very different, making them incomparable.

3 PRELIMINARIES

In this section, we present problem formulation and preliminaries. We consider function f satisfying Assumption 1. We denote w = (x, y), and assume f : Rd1 Rd2 R is:

(a) B-bounded i.e., |f(w)| B, (b) G-Lipschitz i.e., |f(w1) f(w2)| G w1 w2 , (c) L-gradient Lipschitz i.e., f(w1) f(w2) L w1 w2 , (d) ρ-Hessian Lipschitz i.e., 2f(w1) 2f(w2) ρ w1 w2 .

where denotes Euclidean norm for vectors and operator norm for matrices.

We aim to solve minx Rd1 maxy Rd2 f(x, y). Since maxy Rd2 f(x, y) involves non-concave maximization and hence is NP-hard in the worst case, we intend to play against algorithm(s) that y-player uses to compute her strategy. Concretely, given x Rd1, we assume that the y-player chooses her (potentially random) strategy byz(x) = Ai (x)(x, zi (x)), where we use shorthand z := (z1, , zk), as i (x) = argmaxi [k] f(x, Ai(x, zi)), where A1, , Ak are k deterministic algorithms that take as input x and a random seed zi Rℓ, where zi are all independent. Note that the framework captures randomized algorithms e.g., A could be stochastic gradient ascent on f(x, ), with initialization, minibatching etc. determined by the random seed z. This also incorporates running the same algorithm multiple times, with different seeds and then choosing the best strategy. We now reformulate the minimax objective function to:

minx Rd1 g(x) where g(x) := Ez [f(x, byz(x))] . (2) For general algorithms Ai, the functions f(x, Ai(x, zi)) need not be continuous even when f satisﬁes Assumption 1. However, if the algorithms Ai are smooth as deﬁned below, the functions f(x, Ai(x, zi)) behave much more nicely. Deﬁnition 1 (Algorithm Smoothness). A randomized algorithm A : Rd1 Rℓ Rd2 is:

(a) G-Lipschitz, if A(x1, z) A(x2, z) G x1 x2 for any z. (b) L-gradient Lipschitz, if DA(x1, z) DA(x2, z) L x1 x2 for any z.

Here DA(x, z) Rd1 Rd2 is the Jacobian of the function A( , z) for a ﬁxed z. The following lemma tells us that f(x, A(x, z)) behaves nicely whenever A is a Lipschitz and gradient Lipschitz algorithm. For deterministic algorithms, we also use the shortened notation A(x) and DA(x). Lemma 1. Suppose A is G -Lipschitz and L -gradient Lipschitz and f satisﬁes Assumption 1. Then, for a ﬁxed z, function f( , A( , z)) is G(1 + G )-Lipschitz and L(1 + G )2 + GL -gradient Lipschitz.

While g(x) deﬁned in (2) is not necessarily gradient Lipschitz, it can be shown to be weakly-convex as deﬁned below. Note that an L-gradient Lipschitz function is L-weakly convex. Deﬁnition 2. A function g : Rd1 R is L-weakly convex if x, there exists a vector ux satisfying:

g(x ) g(x) + ux, x x L

2 x x 2 x . (3)

Any vector ux satisfying this property is called the subgradient of g at x and is denoted by g(x).

An important property of weakly convex function is that the maximum over a ﬁnite number of weakly convex function is still a weakly convex function. Lemma 2. Given L-weakly convex functions g1, , gk : Rd R, the maximum function g( ) := maxi [k] gi( ) is also L-weakly convex and the set of subgradients of g( ) at x is given by:

j S(x) λj gj(x) : λj 0, P

j S(x) λj = 1}, where S(x) := argmaxi [k] gi(x).

Published as a conference paper at ICLR 2022

Thus, under Assumption 1 and the assumptions Ai are all G -Lipschitz and L -gradient Lipschitz, then g( ) deﬁned in (2) is L(1 + G )2 + GL -weakly convex. The usual optimality notion for weakly-convex functions is approximate ﬁrst order stationary points (Davis & Drusvyatskiy, 2018).

Approximate ﬁrst-order stationary point for weakly convex functions: In order to deﬁne approximate stationary points, we also need the notion of Moreau envelope. Deﬁnition 3. The Moreau envelope of a function g : Rd1 R and parameter λ is:

gλ(x) = minx Rd1 g(x ) + (2λ) 1 x x 2 . (4)

The following lemma provides useful properties of the Moreau envelope. Lemma 3. For an L-weakly convex function g : Rd1 R and λ < 1/L, we have:

(a) The minimizer ˆxλ(x) = arg minx Rd1 g(x )+(2λ) 1 x x 2 is unique and g(ˆxλ(x)) gλ(x) g(x). Furthermore, arg minx g(x) = arg minx gλ(x). (b) gλ is λ 1(1 + (1 λL) 1)-smooth and thus differentiable, and

(c) minu g(ˆxλ(x)) u λ 1 ˆxλ(x) x = gλ(x) .

First order stationary points (FOSP) of a non-smooth nonconvex function are well-deﬁned, i.e., x is a FOSP of a function g(x) if, 0 f(x ). However, unlike smooth functions, it is nontrivial to deﬁne an approximate FOSP. For example, if we deﬁne an ε-FOSP as the point x with minu g(x) u ε, where g(x) denotes the subgradients of g at x, there may never exist such a point for sufﬁciently small ε, unless x is exactly a FOSP. In contrast, by using above properties of the Moreau envelope of a weakly convex function, it s approximate FOSP can be deﬁned as (Davis & Drusvyatskiy, 2018): Deﬁnition 4. Given an L-weakly convex function g, we say that x is an ε-ﬁrst order stationary point (ε-FOSP) if, g1/2L(x ) ε, where g1/2L is the Moreau envelope with parameter 1/2L.

Using Lemma 3, we can show that for any ε-FOSP x , there exists ˆx such that ˆx x ε/2L and minu g(ˆx) u ε. In other words, an ε-FOSP is O(ε) close to a point ˆx which has a subgradient smaller than ε. Other notions of FOSP proposed recently such as in Nouiehed et al. (2019) can be shown to be a strict generalization of the above deﬁnition.

4 MAIN RESULTS

In this section, we present our main results. Assuming that the adversary employs Lipschitz and gradient-Lipschitz algorithms (Assumption 2), Section 4.1 shows how to compute (stochastic) subgradients of g( ) (deﬁned in (2)) efﬁciently. Section 4.2 further shows that stochastic subgradient descent (SGD) on g( ) can ﬁnd an ϵ-FOSP in O ϵ 4 iterations while for the deterministic setting, where the adversary uses only deterministic algorithms, Section 4.3 provides a proximal algorithm that can ﬁnd an ϵ-FOSP faster than SGD. For convenience, we denote gz,i(x) := f(x, Ai(x, zi)) and recall g(x) := Ez maxi [k] gz,i(x) . For deterministic Ai, we drop z and just use gi(x).

4.1 COMPUTING STOCHASTIC SUBGRADIENTS OF g(x)

In this section, we give a characterization of subgradients of g(x) and show how to compute stochastic subgradients efﬁciently under the following assumption. Assumption 2. Algorithms Ai in (2) are G -Lipschitz and L -gradient Lipschitz as per Deﬁnition 1.

Under Assumptions 1 and 2, Lemma 1 tells us that gz,i(x) is a G(1+G )-Lipschitz and L(1+G )2 + GL -gradient Lipschitz function for every i [k] with

gz,i(x) = xf(x, Ai(x, zi)) + DAi(x, zi) yf(x, Ai(x, zi)), (5)

where we recall that DAi(x, zi) Rd1 d2 is the Jacobian matrix of Ai( , zi) : Rd1 Rd2 at x and xf(x, Ai(x, zi)) denotes the partial derivative of f with respect to the ﬁrst variable at (x, Ai(x, zi)). While there is no known general recipe for computing DAi(x, zi) for an arbitrary algorithm Ai, most algorithms used in practice such as stochastic gradient ascent (SGA), stochastic Nesterov accelerated gradient (SNAG), ADAM, admit efﬁcient ways of computing these derivatives e.g., higher package

Published as a conference paper at ICLR 2022

Algorithm 1: Stochastic subgradient descent (SGD) Input: initial point x0, step size η

1 for s = 0, 1, . . . , S do

2 Sample z1, , zk and compute b g(xs) according to eq. (6).

3 xs+1 xs η b g(xs).

4 return x xs, where s is uniformly sampled from {0, , S}.

in Py Torch (Grefenstette et al., 2019). For concreteness, we obtain expression for gradients of SGA and SNAG in Section 5 but the principle behind the derivation holds much more broadly and can be extended to most algorithms used in practice (Grefenstette et al., 2019). In practice, the cost of computing gz,i(x) in (5) is at most twice the cost of evaluating gz,i(x) it consists a forward pass for evaluating gz,i(x) and a backward pass for evaluating its gradient (Grefenstette et al., 2019).

Lemma 2 shows that g(x) := Ez maxi [k] gz,i(x) is a weakly convex function and a stochastic subgradient of g( ) can be computed by generating a random sample of z1, , zk as:

j S(x) λj gz,i(x) for any λ k, where S(x) := argmax i [k] gz,i(x) (6)

Here k is the k-dimensional probability simplex. It can be seen that Ez[ ˆ g(x)] g(x). Furthermore, if all Ai are deterministic algorithms, then the above is indeed a subgradient of g.

4.2 CONVERGENCE RATE OF SGD

The SGD algorithm to solve (2) is given in Algorithm 1. The following theorem shows that Algorithm 1 ﬁnds an ϵ-FOSP of g( ) in S = O ϵ 4 iterations. Since each iteration of Algorithm 1 requires computing the gradient of gz,i(x) for each i [k] at a single point xs, this leads to a total of S = O ϵ 4 gradient computations for each gz,i(x).

Theorem 1. Under Assumptions 1 and 2, if S 16Bb L b G2ϵ 4 and learning rate η = (2B/[b L b G2(S + 1)])1/2 then output of Algorithm 1 satisﬁes E[ g1/2b L( x) 2] ϵ2, where b L :=

L(1 + G )2 + GL and b G := G(1 + G ).

Theorem 1 claims that the expected norm squared of the Moreau envelope gradient of the output point satisﬁes the condition for being an ϵ-FOSP. This, together with Markov s inequality, implies that at least half of x0, , x S are 2ϵ-FOSPs, so that the probability of outputting a 2ϵ-FOSP is at least 0.5. Proposition 1 in Appendix C shows how to use an efﬁcient postprocessing mechanism to output a 2ϵ-FOSP with high probability. Theorem 1 essentially follows from the results of (Davis & Drusvyatskiy, 2018), where the key insight is that, in expectation, the SGD procedure in Algorithm 1 almost monotonically decreases the Moreau envelope evaluated at xs i.e., E h g1/2b L(xs) i is almost monotonically decreasing. This shows that Algorithm 1 makes (almost) monotonic progress in a precise sense and hence does not have limit cycles. In contrast, none of the other existing algorithms for nonconvex-nonconcave minimax optimization enjoy such a guarantee. Note that this claim includes extra-gradient and optimistic gradient methods that have been touted as empirically mitigating cycling behavior; see Appendix F for examples demonstrating nonconvergence of these methods.

4.3 PROXIMAL ALGORITHM WITH FASTER CONVERGENCE FOR DETERMINISTIC ALGORITHMS

While the rate achieved by SGD is the best known for weakly-convex optimization with a stochastic subgradient oracle, faster algorithms exist for functions which can be written as maximum over a ﬁnite number of smooth functions with access to exact subgradients of these component functions. These conditions are satisﬁed when Ai are all deterministic and satisfy Assumption 2. A pseudocode of such a fast proximal algorithm, inspired by Thekumparampil et al. (2019, Algorithm 3), is presented in Algorithm 2. However, in contrast to the results of Thekumparampil et al. (2019), the following theorem provides two alternate ways of implementing Step 3 of Algorithm 2, resulting in two different (and incomparable) convergence rates.

Published as a conference paper at ICLR 2022

Algorithm 2: Proximal algorithm

Input: initial point x0, target accuracy ϵ, smoothness parameter b L

1 bϵ ϵ2/(64b L)

2 for t = 0, 1, . . . , S do

3 Find xs+1 such that

maxi [k] gi(xs+1) + b L xs xs+1 2 minx maxi [k] gi(x) + b L xs x 2 + bϵ/4

4 if maxi [k] gi(xs+1) + b L xs xs+1 2 maxi [k] gi(xs) 3bϵ/4 then

5 return xs

Theorem 2. Under Assumptions 1 and 2, if b L := L(1 + G )2 + GL + k G(1 + G ) and S 200b LBϵ 2 then Algorithm 2 returns xs satisfying g1/2b L(xs) ϵ. Depending on whether we use (Thekumparampil et al., 2019, Algorithm 1) or cutting plane method (Lee et al., 2015) for solving Step 3 of Algorithm 2, the total number of gradient computations of each gi is e O b L2B

ϵ2 poly(k) log b L ϵ respectively.

Ignoring the parameters L, G, L and G , the above theorem tells us that Algorithm 2 outputs an ϵ-FOSP using O kϵ 3 or O poly(k)ϵ 2 log 1

ϵ gradient queries to each gi depending on whether Thekumparampil et al. (2019, Algorithm 3) or cutting plane method (Lee et al., 2015) was used for implementing Step 3. While the proximal algorithm itself works even when Ai are randomized algorithms, there are no known algorithms that can implement Step (3) with fewer than O ϵ 2

stochastic gradient queries. Hence, this does not improve upon the O ϵ 4 guarantee for Algorithm 1 when Ai are randomized algorithms. The proof of Theorem 2 shows that the iterates xs monotonically decrease the value g(xs), guaranteeing that there are no limit cycles for Algorithm 2 as well.

5 SMOOTHNESS OF POPULAR ALGORITHMS

In this section, we show that two popular algorithms namely, T-step stochastic gradient ascent (SGA) and T-step stochastic Nesterov s accelerated gradient ascent (SNAG) are both Lipschitz and gradient-Lipschitz satisfying Assumption 2 and hence are captured by our results in Section 4. Consider the setting f(x, y) = 1

j [n] fj(x, y). Let z be a random seed that captures the randomness in the initial point as well as minibatch order in SGA and SNAG. We ﬁrst provide the smoothness results on T-step SGA for different assumptions on the shape of the function f and for T-step SNAG. After giving these results, we make remarks interpreting their signiﬁcance and the implications.

T-step SGA: For a given x and random seed z, the T-step SGA update is given by: yt+1 = yt + η yfσ(t)(x, yt), where σ : [T] [N] is a sample selection function and η is the stepsize. Observe that with the same randomness z, the initial point does not depend on x i.e., y0(x) = y0(x ), so Dy0 = 0. The following theorems provide the Lipschitz and gradient Lipschitz constants of y T (x) (as generated by T-step SGA) for the general nonconvex-nonconcave setting as well as the settings in which the function f is nonconvex-concave and nonconvex-strongly concave. Theorem 3 (General Case). Suppose for all j [n], fj satisﬁes Assumption 1. Then, for any ﬁxed randomness z, T-step SGA is (1 + ηL)T -Lipschitz and 4(ρ/L) (1 + ηL)2T -gradient Lipschitz.

Theorem 4 (Concave Case). Suppose for all j [n], fj satisﬁes Assumption 1 and fj(x, ) is concave for any x. Then, for any ﬁxed randomness z, T-step SGA is ηLT-Lipschitz and (ρ/L) (1 + ηLT)3gradient Lipschitz.

Theorem 5 (Strongly-concave Case). Suppose for all j [n], fj satisﬁes Assumption 1 and fj(x, ) is α-strongly concave for any x. Then, for any ﬁxed randomness z, T-step SGA is κ-Lipschitz and 4(ρ/L) κ3-gradient Lipschitz, where κ = L/α is the condition number.

T-step SNAG: For a given random seed z, the T-step SNAG update is given by:

yt = yt + (1 θ)(yt yt 1) and yt+1 = yt + η y bfσ(t)(x, yt),

Published as a conference paper at ICLR 2022

where η is the stepsize, θ [0, 1] is the momentum parameter. The output of the algorithm is given by A(x, z) = y T (x). Furthermore, we have the following guarantee. Theorem 6 (General Case). Suppose for all j [n], fj satisﬁes Assumption 1. Then, for any ﬁxed seed z, T-step SNAG is T(1+ηL/θ)T -Lipschitz and 50(ρ/L) T 3(1+ηL/θ)2T -gradient Lipschitz.

Remarks on the Impact of the Smoothness Results: The Lipschitz and gradient Lipschitz parameters for T-step SGA and T-step SNAG in the setting where f is nonconvex-nonconcave are all exponential in T, the duration of the algorithm. In general, this seems unavoidable in the worst case for the above algorithms and also seems to be the case for most of the other popular algorithms such as ADAM, RMSProp, etc. In contrast, in the nonconvex-concave and nonconvex-strongly concave settings, our results show that the smoothness parameters of T-step SGA are no longer exponential in T. In particular, in the nonconvex-concave the Lipschitz parameter is linear in T and the gradient Lipschitz parameter is polynomial in T while in the nonconvex-strongly concave, the analogous smoothness parameters are no longer dependent on T. We conjecture this is also the case for T-step SNAG, though the proof appears quite tedious and hence, we opted to leave that for future work. For problems of practical importance, however, we believe that the smoothness parameters are rarely exponential in T. Our experimental results conﬁrm this intuition (see Figure 2). Furthermore, note that the function g(x) (2), and its stationary points, vary as one varies T. While in adversarial training, T is essentially determined by the power of adversary against which we wish the model to be robust, in GAN training, T is actually an algorithmic knob that can be tuned to improve the ﬁnal generator performance. Indeed, larger T does not necessarily lead to better generators and small values of T are of vital interest; as is pointed out in the original paper (Goodfellow et al., 2014), fully optimizing the discriminator can result in overﬁtting and too strong of a discriminator can limit the ability of the generator to learn. Finally, while we prove the Lipschitz and gradient Lipschitz properties only for SGA and SNAG, we believe that the same techniques could be used to prove similar results for popular algorithms such as ADAM, RMSProp, etc. However, there are other algorithms, particularly those involving projection that are not gradient Lipschitz (see Proposition 3 in Appendix E.5).

6 EMPIRICAL RESULTS

This section presents empirical results evaluating our SGD algorithm (Algorithm 1) for generative adversarial networks (Goodfellow et al., 2014) and adversarial training (Madry et al., 2018). Our results demonstrate that our framework results in stable monotonic improvement during training and converges to desirable solutions in both GAN and adversarial training problems. While optimization through the algorithm of the adversary increases the complexity of each update by a factor of two, our results demonstrate that it also leads to faster convergence. Finally, we show the gradient norms do not grow exponentially in the number of gradient ascent steps T taken by the adversary in practice.

Generative Adversarial Networks. Generative adversarial network formulations can be characterized by the following minimax problem (Nagarajan & Kolter, 2017; Mescheder et al., 2018):

minθ maxω f(θ, ω) = Ex p X [ℓ(Dω(x))] + Ez p Z[ℓ( Dω(Gθ(z)))]. (7) In this formulation Gθ : Z X is the generator network parameterized by θ that maps from the latent space Z to the input space X, Dω : X R is discriminator network parameterized by ω that maps from the input space X to real-valued logits, and p X and p Z are the distributions over the input and latent spaces. The loss function deﬁnes the objective where ℓ(w) = log(1 + exp( w)) gives the original saturating generative adversarial networks formulation (Goodfellow et al., 2014).

Dirac GAN. The Dirac-GAN (Mescheder et al., 2018) is a simple and common baseline for evaluating the efﬁcacy of generative adversarial network training methods. In this problem, the generator distribution Gθ(z) = δθ is a Dirac distribution concentrated at θ, the discriminator network Dω(x) = ωx is linear, and the real data distribution p X = δ0 is a Dirac distribution concentrated at zero. The resulting objective after evaluating (7) with the loss ℓ(w) = log(1 + exp( w)) is

minθ maxω f(θ, ω) = ℓ(θω) + ℓ(0) = log(1 + e θω) log(2).

To mimic the real data distribution, the generator parameter θ should converge to θ = 0. Notably, simultaneous and alternating gradient descent-ascent are known to cycle and fail to converge on this problem (see Figure 1). We consider an instantiation of our framework where the discriminator

Published as a conference paper at ICLR 2022

Figure 1: Dirac-GAN: Generator parameters while training using simultaneous and alternating gradient descent-ascent (left), and our framework (right) with & without optimizing through the discriminator. Under our framework, training is stable and converges to correct distribution. Further, differentiating through the discriminator results in faster convergence.

Figure 2: Adversarial training: f(θ, A(θ)) as a function of number of steps T taken by gradient ascent (GA) algorithm A evaluated at multiple points in the training procedure. The plot shows that, in practice, the Lipschitz parameter of GA does not grow exponentially in T.

(a) Real Data

Figure 3: Mixture of Gaussians: Generated distribution at various steps during course of training. We see that training is stable and results in monotonic progress towards the true distribution.

samples an initialization uniformly between [ 0.1, 0.1] and performs T = 10 steps of gradient ascent between each generator update. The learning rates for both the generator and the discriminator are η = 0.01. We present the results in Figure 1. Notably, the generator parameter monotonically converges to the optimal θ = 0 and matches the real data distribution using our training method. We also show the performance when the generator descends using partial gradient θf(x, A(θ)) instead of the total gradient f(θ, A(θ)) in Algorithm 1. This method is able to converge to the optimal generator distribution but at a slower rate. Together, this example highlights that our method ﬁxes the usual cycling problem by reinitializing the discriminator and also that optimizing through the discriminator algorithm is key to fast convergence. Additional results are given in Appendix G.

Mixture of Gaussians. We now demonstrate that the insights we developed from the Dirac-GAN (stability and monotonic improvement) carry over to the more complex problem of learning a 2dimensional mixture of Gaussians. This is a common example and a number of papers (see, e.g., Metz et al. 2017; Mescheder et al. 2017; Balduzzi et al. 2018) show that standard training methods using simultaneous or alternating gradient descent-ascent can fail. The setup for the problem is as follows. The real data distribution consists of 2-dimensional Gaussian distributions with means given by µ = [sin(φ), cos(φ)] for φ {kπ/4}7 k=0 and each with covariance σ2I where σ2 = 0.05. For training, the real data x R2 is drawn at random from the set of Gaussian distributions and the latent data z R16 is drawn from a standard normal distribution with batch sizes of 512. The network for the generator and discriminator contain two and one hidden layers respectively, each of which contain 32 neurons and Re LU activation functions. We consider the objective from (7) with ℓ(w) = log(1 + exp( w)) which corresponds to the saturating generative adversarial networks formulation (Goodfellow et al., 2014). This objective is known to be difﬁcult to train since with typical training methods the generator gradients saturate early in training.

We show results using our framework in Figure 3 where the discriminator performs T = 15 steps of gradient ascent and the initialization between each generator step is obtained by the default network initialization in Pytorch. The generator and discriminator learning rates are both ﬁxed to be η = 0.5. We see that our method has stable improvement during the course of training and recovers close to the real data distribution. We demonstrate in Appendix G that this result is robust by presenting the

Published as a conference paper at ICLR 2022

Figure 4: Adversarial training: Test accuracy during course of training where the attack used during training is gradient ascent (GA) with learning rate (LR) of 4 and number of steps (Steps) of 10 but evaluated against attacks with different Steps and LR. These plots show that training with a single attack gives more robustness to even other attacks with different parameters or algorithms of similar computational power. This empirical insight is complementary to our theoretical results that extend to the setting of training against multiple smooth algorithms. Finally, using total gradient f(θ, A(θ)) yields better robustness compared to using partial gradient θf(θ, A(θ)).

ﬁnal output of 10 runs of the procedure. Notably, the training algorithm recovers all the modes of the distribution in each run. We also show results using Adam for the discriminator in Appendix G.

Adversarial Training. Given a data distribution D over pairs of examples x Rd and labels y [k], parameters θ of a neural network, a set S Rd of allowable adversarial perturbations, and a loss function ℓ( , , ) dependent on the network parameters and the data, adversarial training amounts to considering a minmax optimization problem of the form minθ E(x,y) D[maxδ S ℓ(θ, x + δ, y)]. In practice (Madry et al., 2018), the inner maximization problem maxδ S ℓ(θ, x + δ, y) is solved using projected gradient ascent. However, as described in Section 5, this is not a smooth algorithm and does not ﬁt our framework. So, we use gradient ascent, without projection, for the inner maximization.

We run an adversarial training experiment with the MNIST dataset, a convolutional neural network, and the cross entropy loss function. We compare Algorithm 1 with usual adversarial training (Madry et al., 2018) which descends θf(θ, A(θ)) instead of f(θ, A(θ)), and a baseline of standard training without adversarial training. For each algorithm, we train for 100 passes over the training set using a batch size of 50. The minimization procedure has a ﬁxed learning rate of η1 = 0.0001 and the maximization procedure runs for T = 10 steps with a ﬁxed learning rate of η2 = 4. We evaluate the test classiﬁcation accuracy during the course of training against gradient ascent or Adam optimization adversarial attacks. The results are presented in Figure 4 where the mean accuracies are reported over 5 runs and the shaded regions show one standard deviation around the means. We observe that the adversarial training procedure gives a signiﬁcant boost in robustness compared to standard training. Moreover, consistent with the previous experiments, our algorithm which uses total gradient outperforms standard adversarial training which uses only partial gradient. We present results against more attacks in Appendix G. As suggested in Section 5, we also ﬁnd that in practice, the gradient norms f(θ, A(θ)) do not grow exponentially in the number of gradient ascent steps T in the adversary algorithm A (see Figure 2). For further details and additional results see Appendix G.

7 CONCLUSION

Motivated by the facts that (i) nonconcave maximization is NP-hard in the worst case and (ii) in practice, simple gradient based algorithms are used for maximization, this paper presents a new framework for solving nonconvex-nonconcave minimax optimization problems. Assuming that the min player has knowledge of the smooth algorithms being used by max player, we propose new efﬁcient algorithms and verify that these algorithms ﬁnd desirable solutions in practice on smallscale generative adversarial network and adversarial training problems. Furthermore, the framework proposed in this paper extends naturally to two player non-zero sum games as well. An important future direction of work is to scale our algorithm to large scale, state of the art GAN benchmarks. The key challenge here is that reinitializing the discriminator parameters necessitates running several steps of gradient ascent for discriminator for every generator update. One approach to overcome this could be to use pretrained networks for initializing the discriminator.

Published as a conference paper at ICLR 2022

Leonard Adolphs, Hadi Daneshmand, Aurelien Lucchi, and Thomas Hofmann. Local saddle point optimization: A curvature exploitation approach. In International Conference on Artiﬁcial Intelligence and Statistics, pp. 486 495, 2019.

Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. In International Conference on Machine Learning, pp. 214 223, 2017.

Kendall Atkinson, Weimin Han, and David E Stewart. Numerical solution of ordinary differential equations, volume 108. John Wiley & Sons, 2011.

James P Bailey, Gauthier Gidel, and Georgios Piliouras. Finite regret and cycles with ﬁxed step-size via alternating gradient descent-ascent. In Conference on Learning Theory, pp. 391 407. PMLR, 2020.

David Balduzzi, Sebastien Racaniere, James Martens, Jakob Foerster, Karl Tuyls, and Thore Graepel. The mechanics of n-player differentiable games. In International Conference on Machine Learning, pp. 354 363, 2018.

Tamer Ba sar and Geert Jan Olsder. Dynamic noncooperative game theory. SIAM, 1998.

Dimitri P Bertsekas. Convex optimization theory. Athena Scientiﬁc Belmont, 2009.

Battista Biggio, Igino Corona, Davide Maiorca, Blaine Nelson, Nedim v Srndi c, Pavel Laskov, Giorgio Giacinto, and Fabio Roli. Evasion attacks against machine learning at test time. In European conference on machine learning and knowledge discovery in databases, pp. 387 402, 2013.

Benjamin Chasnov, Tanner Fiez, and Lillian J Ratliff. Opponent anticipation via conjectural variations. In Smooth Games Optimization and Machine Learning Workshop at Neur IPS 2020: Bridging Game Theory and Deep Learning, 2020.

Earl A Coddington and Norman Levinson. Theory of ordinary differential equations. Tata Mc Graw Hill Education, 1955.

Constantinos Daskalakis and Ioannis Panageas. The limit points of (optimistic) gradient descent in min-max optimization. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 9256 9266, 2018a.

Constantinos Daskalakis and Ioannis Panageas. The limit points of (optimistic) gradient descent in min-max optimization. In Advances in Neural Information Processing Systems, pp. 9236 9246, 2018b.

Constantinos Daskalakis, Andrew Ilyas, Vasilis Syrgkanis, and Haoyang Zeng. Training gans with optimism. In International Conference on Learning Representations (ICLR 2018), 2018.

Constantinos Daskalakis, Stratis Skoulakis, and Manolis Zampetakis. The complexity of constrained min-max optimization. In ACM Symposium on Theory of Computing, 2021.

Damek Davis and Dmitriy Drusvyatskiy. Stochastic subgradient method converges at the rate o(k 1/4) on weakly convex functions. ar Xiv preprint ar Xiv:1802.02988, 2018.

Tristan Deleu, Tobias Würﬂ, Mandana Samiei, Joseph Paul Cohen, and Yoshua Bengio. Torchmeta: A meta-learning library for pytorch. ar Xiv preprint ar Xiv:1909.06576, 2019.

Farzan Farnia and Asuman Ozdaglar. Do gans always have nash equilibria? In International Conference on Machine Learning, pp. 3029 3039, 2020.

Tanner Fiez and Lillian Ratliff. Local convergence analysis of gradient descent ascent with ﬁnite timescale separation. In International Conference on Learning Representations, 2021.

Tanner Fiez, Benjamin Chasnov, and Lillian Ratliff. Implicit learning dynamics in stackelberg games: Equilibria characterization, convergence analysis, and empirical study. In International Conference on Machine Learning, pp. 3133 3144, 2020.

Published as a conference paper at ICLR 2022

Jakob Foerster, Richard Y Chen, Maruan Al-Shedivat, Shimon Whiteson, Pieter Abbeel, and Igor Mordatch. Learning with opponent-learning awareness. In International Conference on Autonomous Agents and Multi Agent Systems, pp. 122 130, 2018.

Yoav Freund, Michael Kearns, Yishay Mansour, Dana Ron, Ronitt Rubinfeld, and Robert E Schapire. Efﬁcient algorithms for learning to play repeated games against computationally bounded adversaries. In Proceedings of IEEE 36th Annual Foundations of Computer Science, pp. 332 341. IEEE, 1995.

Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks. In Advances in Neural Information Processing Systems, 2014.

Edward Grefenstette, Brandon Amos, Denis Yarats, Phu Mon Htut, Artem Molchanov, Franziska Meier, Douwe Kiela, Kyunghyun Cho, and Soumith Chintala. Generalized inner loop meta-learning. ar Xiv preprint ar Xiv:1910.01727, 2019.

Joseph Y Halpern, Rafael Pass, and Lior Seeman. Decision theory with resource-bounded agents. Topics in cognitive science, 6(2):245 257, 2014.

Philip Hartman. Ordinary Differential Equations. Society for Industrial and Applied Mathematics, second edition, 2002. doi: 10.1137/1.9780898719222.

Nicholas JA Harvey, Christopher Liaw, and Sikander Randhawa. Simple and optimal high-probability bounds for strongly-convex stochastic gradient descent. ar Xiv preprint ar Xiv:1909.00843, 2019.

Ya-Ping Hsieh, Panayotis Mertikopoulos, and Volkan Cevher. The limits of min-max optimization algorithms: Convergence to spurious non-critical sets. ar Xiv preprint ar Xiv:2006.09065, 2020.

Chi Jin, Praneeth Netrapalli, and Michael Jordan. What is local optimality in nonconvex-nonconcave minimax optimization? In International Conference on Machine Learning, pp. 4880 4889, 2020.

Sham Kakade, Shai Shalev-Shwartz, Ambuj Tewari, et al. On the duality of strong convexity and strong smoothness: Learning applications and matrix regularization. Unpublished Manuscript, http://ttic. uchicago. edu/shai/papers/Kakade Shalev Tewari09. pdf, 2(1), 2009.

Vijay Keswani, Oren Mangoubi, Sushant Sachdeva, and Nisheeth K Vishnoi. Gans with ﬁrst-order greedy discriminators. ar Xiv preprint ar Xiv:2006.12376, 2020.

Weiwei Kong and Renato DC Monteiro. An accelerated inexact proximal point method for solving nonconvex-concave min-max problems. ar Xiv preprint ar Xiv:1905.13433, 2019.

Galina M Korpelevich. The extragradient method for ﬁnding saddle points and other problems. Matecon, 12:747 756, 1976.

Alexey Kurakin, Ian J Goodfellow, and Samy Bengio. Adversarial machine learning at scale. In International Conference on Learning Representations, 2017.

Yin Tat Lee, Aaron Sidford, and Sam Chiu-wai Wong. A faster cutting plane method and its implications for combinatorial and convex optimization. In 2015 IEEE 56th Annual Symposium on Foundations of Computer Science, pp. 1049 1065. IEEE, 2015.

Qi Lei, Sai Ganesh Nagarajan, Ioannis Panageas, and xiao wang. Last iterate convergence in no-regret learning: constrained min-max optimization for convex-concave landscapes. In International Conference on Artiﬁcial Intelligence and Statistics, pp. 1441 1449, 2021.

Alistair Letcher. On the impossibility of global convergence in multi-loss optimization. In International Conference on Learning Representations, 2021.

Alistair Letcher, Jakob Foerster, David Balduzzi, Tim Rocktäschel, and Shimon Whiteson. Stable opponent shaping in differentiable games. In International Conference on Learning Representations, 2019.

Haochuan Li, Yi Tian, Jingzhao Zhang, and Ali Jadbabaie. Complexity lower bounds for nonconvexstrongly-concave min-max optimization. ar Xiv preprint ar Xiv:2104.08708, 2021.

Published as a conference paper at ICLR 2022

Tianyi Lin, Chi Jin, and Michael Jordan. On gradient descent ascent for nonconvex-concave minimax problems. In International Conference on Machine Learning, pp. 6083 6093, 2020a.

Tianyi Lin, Chi Jin, Michael Jordan, et al. Near-optimal algorithms for minimax optimization. In Conference on Learning Theory, pp. 2738 2779, 2020b.

Songtao Lu, Ioannis Tsaknakis, Mingyi Hong, and Yongxin Chen. Hybrid block successive approximation for one-sided non-convex min-max problems: algorithms and applications. IEEE Transactions on Signal Processing, 2020.

Luo Luo, Haishan Ye, Zhichao Huang, and Tong Zhang. Stochastic recursive gradient descent ascent for stochastic nonconvex-strongly-concave minimax problems. Advances in Neural Information Processing Systems, 33, 2020.

Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations, 2018.

Oren Mangoubi and Nisheeth K Vishnoi. Greedy adversarial equilibrium: An efﬁcient alternative to nonconvex-nonconcave min-max optimization. In ACM Symposium on Theory of Computing, 2021.

Eric Mazumdar, Lillian J Ratliff, and S Shankar Sastry. On gradient-based learning in continuous games. SIAM Journal on Mathematics of Data Science, 2(1):103 131, 2020.

Eric V Mazumdar, Michael I Jordan, and S Shankar Sastry. On ﬁnding local nash equilibria (and only local nash equilibria) in zero-sum games. ar Xiv preprint ar Xiv:1901.00838, 2019.

Lars Mescheder, Sebastian Nowozin, and Andreas Geiger. The numerics of gans. In Advances in Neural Information Processing Systems, pp. 1823 1833, 2017.

Lars Mescheder, Andreas Geiger, and Sebastian Nowozin. Which training methods for gans do actually converge? In International Conference on Machine Learning, pp. 3481 3490, 2018.

Luke Metz, Ben Poole, David Pfau, and Jascha Sohl-Dickstein. Unrolled generative adversarial networks. In International Conference on Learning Representations, 2017.

Vaishnavh Nagarajan and J Zico Kolter. Gradient descent gan optimization is locally stable. In Advances in Neural Information Processing Systems, pp. 5591 5600, 2017.

Arkadi Nemirovski. Prox-method with rate of convergence o (1/t) for variational inequalities with lipschitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM Journal on Optimization, 15(1):229 251, 2004.

J v Neumann. Zur theorie der gesellschaftsspiele. Mathematische annalen, 100(1):295 320, 1928.

Maher Nouiehed, Maziar Sanjabi, Tianjian Huang, Jason D Lee, and Meisam Razaviyayn. Solving a class of non-convex min-max games using iterative ﬁrst order methods. In Advances in Neural Information Processing Systems, pp. 14934 14942, 2019.

Dmitrii M Ostrovskii, Andrew Lowy, and Meisam Razaviyayn. Efﬁcient search of ﬁrst-order nash equilibria in nonconvex-concave smooth min-max problems. ar Xiv preprint ar Xiv:2002.07919, 2020.

Hassan Raﬁque, Mingrui Liu, Qihang Lin, and Tianbao Yang. Weakly-convex concave min max optimization: provable algorithms and applications in machine learning. Optimization Methods and Software, pp. 1 35, 2021.

L. J. Ratliff, S. A. Burden, and S. S. Sastry. Characterization and computation of local Nash equilibria in continuous games. In Allerton Conference on Communication, Control, and Computing, pp. 917 924, 2013.

Lillian J Ratliff, Samuel A Burden, and S Shankar Sastry. On the characterization of local Nash equilibria in continuous games. IEEE Transactions on Automatic Control, 61(8):2301 2307, 2016.

Published as a conference paper at ICLR 2022

Julia Robinson. An iterative method of solving a game. Annals of mathematics, pp. 296 301, 1951.

Tuomas W Sandhlom and Victor RT Lesser. Coalitions among computationally bounded agents. Artiﬁcial intelligence, 94(1-2):99 137, 1997.

Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. In International Conference on Learning Representations, 2014.

Kiran K Thekumparampil, Prateek Jain, Praneeth Netrapalli, and Sewoong Oh. Efﬁcient algorithms for smooth minimax optimization. Advances in Neural Information Processing Systems, 32: 12680 12691, 2019.

Yuanhao Wang, Guodong Zhang, and Jimmy Ba. On solving minimax optimization locally: A follow-the-ridge approach. In International Conference on Learning Representations, 2020.

Chongjie Zhang and Victor Lesser. Multi-agent learning with policy prediction. In AAAI Conference on Artiﬁcial Intelligence, 2010.

Guojun Zhang, Pascal Poupart, and Yaoliang Yu. Optimality and stability in non-convex-non-concave min-max optimization. ar Xiv preprint ar Xiv:2002.11875, 2020a.

Guojun Zhang, Kaiwen Wu, Pascal Poupart, and Yaoliang Yu. Newton-type methods for minimax optimization. ar Xiv preprint ar Xiv:2006.14592, 2020b.

Siqi Zhang, Junchi Yang, Cristóbal Guzmán, Negar Kiyavash, and Niao He. The complexity of nonconvex-strongly-concave minimax optimization. ar Xiv preprint ar Xiv:2103.15888, 2021.

Renbo Zhao. A primal dual smoothing framework for max-structured nonconvex optimization. ar Xiv preprint ar Xiv:2003.04375, 2020.

Published as a conference paper at ICLR 2022

A DETAILED RELATED WORK

Nonconvex-Nonconcave Zero-Sum Games. The existing work on nonconvex-nonconcave zerosum games has generally focused on (1) deﬁning and characterizing local equilibrium solution concepts and (2) analyzing the local stability and convergence behavior of gradient-based learning algorithms around ﬁxed points of the dynamics. The concentration on local analysis stems from the inherent challenges that arise in nonconvex-nonconcave zero-sum games from both a dynamical systems perspective and a computational perspective. In particular, it is know that broad classes of gradient-based learning dynamics can admit limit cycles and other non-trivial periodic orbits that are antithetical to any type of global convergence guarantee in this class of games (Hsieh et al., 2020; Letcher, 2021). Moreover, on constrained domains, it has been shown that ﬁnding even a local equilibrium is computationally intractable (Daskalakis et al., 2021).

A number of local equilibrium notions for nonconvex-nonconcave zero-sum games now exist with characterizations in terms of gradient-based conditions relevant to gradient-based learning. This includes the local Nash (Ratliff et al., 2013; Ratliff et al., 2016) and local minmax (Stackelberg) (Jin et al., 2020; Fiez et al., 2020) equilibrium concepts, which both amount to local reﬁnements and characterizations of historically standard game-theoretic equilibrium notions. In terms of provable guarantees, algorithms incorporating higher-order gradient information have been proposed and analyzed that guarantee local convergence to only local Nash equilibria (Adolphs et al., 2019; Mazumdar et al., 2019) or local convergence to only local minmax equilibria (Wang et al., 2020; Zhang et al., 2020b; Fiez et al., 2020) in nonconvex-nonconcave zero-sum games. Beyond the local Nash and minmax equilibrium, notions including the proximal equilibrium concept (Farnia & Ozdaglar, 2020), which is a class between the set of local Nash and local minmax equilibria, and the local robust equilibrium concept (Zhang et al., 2020a), which includes both local minmax and local maxmin equilibria, have been proposed and studied. It is worth noting that a shortcoming of each of the local equilibrium notions is that may fail to exist on unconstrained domains.

Signiﬁcant attention has been given to the local stability and convergence of simultaneous gradient descent-ascent in nonconvex-nonconcave zero-sum games. This stems from the fact that it is the natural analogue of learning dynamics for zero-sum game optimization to gradient descent for function optimization. Moreover, simultaneous gradient descent-ascent is know to often perform reasonably well empirically and is ubiquitous in a number of applications such as in training generative adversarial networks and adversarial learning. However, it has been shown that while local Nash are guaranteed to be stable equilibria of simultaneous gradient descent-ascent (Mazumdar et al., 2020; Daskalakis & Panageas, 2018b; Jin et al., 2020), local minmax may not be unless there is sufﬁcient timescale separation between the minimizing and maximizing players (Jin et al., 2020; Fiez & Ratliff, 2021). Speciﬁc to generative adversarial networks, it has been shown that simultaneous gradient descent-ascent locally converges to local equilibria under certain assumptions on the generator network and the data distribution (Nagarajan & Kolter, 2017; Mescheder et al., 2018). Later in this section we discuss in further detail learning dynamics studied previously in games which bear resemblance to that which we consider in this paper depending on the model of the maximizing player.

The challenges of nonconvex-nonconcave zero-sum games we have highlighted limit the types of provable guarantees that can be obtained and consequently motivate tractable relaxations including to nonconvex-concave zero-sum games and the general framework we formulate in this work. Before moving on, we mention that from a related perspective, a line of recent work (Keswani et al., 2020; Mangoubi & Vishnoi, 2021) in nonconvex-nonconcave zero-sum games proposes relaxed equilibrium notions that are shown to be computable in polynomial time and are guaranteed to exist. At a high level, the equilibria correspond to a joint strategy at which the maximizing player is at an approximate local maximum of the cost function and the minimizing player is at an approximate local minimum of a smoothed and relaxed best-response function of the maximizing player. The aforementioned works are similar to this paper in the sense that the minimizing player faces a maximizing player with computational restrictions, but diverge in terms of the model of the maximizing player and the algorithms for solving the problem.

Nonconvex-Concave Zero-Sum Games. The past few years has witnessed a signiﬁcant amount of work on gradient-based dynamics in nonconvex-concave zero-sum games. The focus of existing work on nonconvex-concave zero-sum games has key distinctions from that in nonconvex-nonconcave

Published as a conference paper at ICLR 2022

zero-sum games. Generally, the work on nonconvex-concave zero-sum games has analyzed dynamics on constrained domains, where typically the strategy space of the maximizing player is constrained to a closed convex set and occasionally the minimizing player also faces a constraint. In contrast, nonconvex-nonconcave zero-sum games have generally been analyzed on unconstrained domains. Moreover, instead of focusing on computing notions of game-theoretic equilibrium as is typical in nonconvex-nonconcave zero-sum games, the body of work on nonconvex-concave zero-sum games has focused on achieving stationarity of the game cost function f( , ) or the best-response function Φ( ) = maxy f( , y).

The structure present in nonconvex-concave zero-sum games has been shown to simplify the problem compared to nonconvex-nonconcave zero-sum games so that global ﬁnite-time convergence guarantees are achievable. Thus, work in this direction has focused on improving the rates of convergence in terms of the gradient complexity to ﬁnd ϵ approximate stationary points of f( , ) or Φ( ), both with deterministic and stochastic gradients. Guarantees on the former notion of stationarity can be translated to guarantees on the latter notion of stationarity with extra computational cost (Lin et al., 2020a).

For the the class of nonconvex-strongly-concave zero-sum games, a series of works design algorithms that are shown to obtain ϵ approximate stationary points of the functions f( , ) or Φ( ) with a gradient complexity of e O(ϵ 2) in terms of ϵ in the deterministic setting (Jin et al., 2020; Raﬁque et al., 2021; Lu et al., 2020; Lin et al., 2020a;b). In the deterministic nonconvex-strongly concave problem, the notions of stationarity are equivalent in terms of the dependence on ϵ up to a logarithmic dependence (Lin et al., 2020a). Lower bounds for this problem have also been established (Zhang et al., 2021; Li et al., 2021). In the stochastic nonconvex-strongly-concave problem, existing work has developed algorithms that are shown to obtain ϵ approximate stationary points of the function Φ( ) in gradient complexities of e O(ϵ 4) (Raﬁque et al., 2021; Jin et al., 2020; Lin et al., 2020a) and e O(ϵ 3) (Luo et al., 2020) in terms of ϵ dependence.

In the deterministic nonconvex-concave problem, a number of algorithms with provable guarantees to ϵ approximate stationary points of the function f( , ) have been shown with gradient complexities of e O(ϵ 4) (Lu et al., 2020), e O(ϵ 3.5) (Nouiehed et al., 2019), and e O(ϵ 2.5) (Ostrovskii et al., 2020; Lin et al., 2020b). Similarly, in this class of problems, there exist results on algorithms that guarantee convergence to an ϵ approximate stationary points of the function Φ( ) with gradient complexities e O(ϵ 6) (Raﬁque et al., 2021; Jin et al., 2020; Lin et al., 2020a) and e O(ϵ 3) (Thekumparampil et al., 2019; Zhao, 2020; Kong & Monteiro, 2019; Lin et al., 2020b). Finally, existing results in the stochastic setting for achieving an ϵ approximate stationary point of Φ( ) show gradient complexities of e O(ϵ 6) (Raﬁque et al., 2021) and e O(ϵ 8) (Lin et al., 2020a).

In this work, we build on the developments for nonconvex-concave problems to obtain our results.

Gradient-Based Learning with Opponent Modeling. A number of gradient-based learning schemes have been derived is various classes of games based on the following idea: if a player knows how the opponents in a game are optimizing their cost functions, then it is natural to account for this behavior in the players own optimization procedure. The simultaneous gradient descent learning dynamics can be viewed as the simplest instantiation of this perspective, where each player is optimizing their own cost function assuming that all other players in the game will remain ﬁxed. In general, the more sophisticated existing learning dynamics based on opponent modeling assume the opponents are doing gradient descent on their cost function and this prediction is incorporated into the objective being optimized in place of the current strategies of opponents. A key conceptual distinction between this approach and our work is that in existing opponent modeling methods the dynamics of the players are always updated simultaneously whereas the procedure we consider is sequential in nature with the opponent initializing again at each interaction. Moreover, the types of guarantees we prove are distinct compared to existing work in this realm.

In this modern literature, gradient-based learning with opponent modeling dates back to the work of Zhang & Lesser (2010). They study simple two-player, two-action, general-sum matrix games, and analyze a set of learning dynamics called iterated descent descent with policy prediction (IGAPP) and show asymptotic convergence to a Nash equilibrium. In this set of learning dynamics, each player assumes the other player is doing gradient descent and this prediction is used in the objective. In particular, each player i has a choice variable xi and a cost function fi(xi, x i) that after

Published as a conference paper at ICLR 2022

incorporating the prediction becomes fi(xi t, x i t γ if i(xi t, x i t )). To optimize the objective, each player takes a ﬁrst-order Taylor expansion of their cost function to give the augmented objective

fi(xi t, x i t γ if i(xi t, x i t )) fi(xi t, x i t ) γ ifi(xt i, xt i) if i(xi t, x i t ).

Each player in the game simultaneously follows the gradient of their augmented objective which is given by ifi(xi t, x i t ) γ i,ifi(xt i, xt i) if i(xi t, x i t ). This gradient computation is derived based on the fact that the assumed update of the other player if i(xi t, x i t ) does not depend on the optimization variable.

Similar ideas have recently been revisited in more general nonconvex multiplayer games (Foerster et al., 2018; Letcher et al., 2019). In learning with opponent learning awareness (LOLA) (Foerster et al., 2018), players again assume the other players are doing gradient descent and take their objective to be fi(xi t, x i t γ if i(xi t, x i t )). To derive the learning rule, an augmented objective is again formed by computing a ﬁrst-order Taylor expansion, but now the term if i(xi t, x i t ) in the augmented objective is treated as dependent on the optimization variable so that the gradient of the augmented objective is given by

ifi(xi t, x i t ) γ i,ifi(xt i, xt i) if i(xi t, x i t ) γ i,if i(xi t, x i t ) ifi(xt i, xt i).

Finally, to arrive at the ﬁnal gradient update for each player, the middle term in the equation above is removed and each player takes steps along the gradient update

ifi(xi t, x i t ) γ i,if i(xi t, x i t ) ifi(xt i, xt i).

While no convergence results are given for LOLA, a follow-up work shows local convergence guarantees to stable ﬁxed points for IGA-PP and learning dynamics called stable opponent shaping (SOS) that interpolate between IGA-PP and LOLA (Letcher et al., 2019). A related work derives learning dynamic based on the idea that the opponent selects a best-response to the chosen strategy (Fiez et al., 2020). The resulting learning dynamics can be viewed as LOLA with the opponent selecting a Newton learning rate. For nonconvex-nonconcave zero-sum games, local convergence guarantees to only local Stackelberg equilibrium are given in for this set of learning dynamics (Fiez et al., 2020). It is worth remarking that gradient-based learning with opponent modeling is historically rooted in the general framework of consistent conjectural variations (see, e.g., Ba sar & Olsder 1998, Chapter 4.6), a concept that is now being explored again and is closely related to the previously mentioned learning dynamics (Chasnov et al., 2020).

Perhaps the closest work on gradient-based learning with opponent modeling to this paper is that of unrolled generative adversarial networks (Metz et al., 2017). In unrolled generative adversarial networks, the generator simulates the discriminator doing a ﬁxed number of gradient steps from the current parameter conﬁgurations of the generator and discriminator. The resulting discriminator parameters are then used in place of the current discriminator parameters in the generator objective. The generator then updates following the gradient of this objective, optimizing through the rolled out discriminator update by computing the total derivative. Simultaneously with the generator update, the discriminator updates its parameters by performing a gradient step on its objective. In our framework, for generative adversarial networks when the discriminator is modeled as performing T-steps of gradient ascent, the procedure we propose is similar but an important difference is that when the generator simulates the discriminator unrolling procedure the discriminator parameters are initialized from scratch and there is no explicit discriminator being trained simultaneously with the generator.

Extragradient methods: Starting from the seminal work of Korpelevich (1976), extragradient methods have been studied extensively for solving minimax optimization problems. While such methods have been shown to obtain faster convergence rates (Nemirovski, 2004) and ﬁnal iterate convergence (Daskalakis et al., 2018; Lei et al., 2021) for convex-concave minimax optimization, there is no result showing that it converges in general nonconvex-nonconcave settings. In fact, in Appendix F we present simple examples demonstrating that the extragradient method as well as another variant called optimistic gradient descent ascent, do not converge for general nonconvexnonconcave minimax optimization.

Games with computationally bounded adversaries: There are also a few works in the game theory literature which consider resource/computationally bounded agents. For example Freund et al. (1995)

Published as a conference paper at ICLR 2022

considers repeated games between resource bounded agents, (Sandhlom & Lesser, 1997) considers coalition formation between resource bounded agents in cooperative games and Halpern et al. (2014) shows that resource constraints in otherwise rational players might lead to some commonly observed human behaviors while making decisions. However, the settings, models of limited computation and the focus of results considered in all of these prior works are distinct from those of this paper.

Stability of algorithms in numerical analysis: To our knowledge such results on the smoothness of the classes of algorithms we study i.e., gradient-based updates such as SGA and SNAG with respect to problem parameters (e.g., in this case, x) have not been shown in the machine learning and optimization literature. This being said, in the study of dynamical systems more speciﬁcally differential equations the concept of continuity (and Lipschitzness) with respect to parameters and initial data has been studied using a variational approach wherein the continuity of the solution of the differential equation is shown to be continuous with respect to variations in the parameters or initial data by appealing to nonlinear variation of parameters results such as the Bellman-Grownwall inequality or Alekseev s theorem (see classical references on differential equations such as Coddington & Levinson (1955, Chapter 2) or Hartman (2002, Chapter IV.2)).

In numerical methods, such results on the smoothness or continuity of the differential equation with respect to initial data or problem parameters are used to understand stability of particular numerical methods (see, e.g., Atkinson et al. 2011, Chapter 1.2). In particular, a initial value problem is only considered well-posed if there is continuous dependence on initial data. For instance, the simple scalar differential equation

y(t) = y(t) + 1, 0 t T, y(0) = 1

has solution y(t) 1, yet the perturbed problem,

yϵ(t) = yϵ(t) + 1, 0 t T, yϵ(0) = 1 + ϵ,

has solution yϵ(t) = 1 + ϵe t so that

|y(t) yϵ(t)| |ϵ|, 0 t T.

If the maximum error yϵ y is (much) larger than ϵ then the initial value problem is ill-conditioned and any typical attempt to numerically solve such a problem will lead to large errors in the computed solution. In short, the stability properties of a numerical method (i.e., discretization of the differential equation) are fundamentally connected to the continuity (smoothness) with respect to initial data.

Observe that methods such as gradient ascent can be viewed as a discretization of an differential equation: y(t) = yf(x, y(t)) yk+1 = yk + η yf(x, yk).

As such, the techniques for showing continuity of the solution of a differential equation with respect to initial data or other problem parameters (e.g., in this case x) can be adopted to show smoothness of the T-step solution of the discretized update. Our approach to showing smoothness, on the other hand, leverages the recursive nature of the discrete time updates deﬁning the classes of algorithms we study. This approach simpliﬁes the analysis by directly going after the smoothness parameters using the update versus solving the difference (or differential) equation for y T (x) and then ﬁnding the smoothness parameters which is the method typically used in numerical analysis of differential equations. An interesting direction of future research is to more formally connect the stability analysis from numerical analysis of differential equations to robustness of adversarial learning to initial data and even variations in problem parameters.

B PROOF OF RESULTS IN SECTION 3

Proof of Lemma 1. For any ﬁxed z, we note that A( , z) is a deterministic algorithm. Consequently, it sufﬁces to prove the lemma for a deterministic algorithm A( ). By chain rule, the derivative of f(x, A(x)) is given by:

f(x, A(x)) = xf(x, A(x)) + DA(x) yf(x, A(x)), (8)

where DA(x) Rd1 d2 is the derivative of A( ) : Rd1 Rd2 at x and xf(x, A(x)) and yf(x, A(x)) denote the partial derivatives of f with respect to the ﬁrst and second variables

Published as a conference paper at ICLR 2022

respectively at (x, A(x)). An easy computation shows that

f(x, A(x)) xf(x, A(x)) + DA(x) yf(x, A(x))

G + G G = (1 + G )G.

This shows that f(x, A(x)) is (1 + G )G-Lipschitz. Similarly, we have:

f(x1, A(x1)) f(x2, A(x2)) xf(x1, A(x1)) xf(x2, A(x2)) + DA(x1) yf(x1, A(x1)) DA(x2) yf(x2, A(x2)) .

For the ﬁrst term, we have:

xf(x1, A(x1)) xf(x2, A(x2)) xf(x1, A(x1)) xf(x2, A(x1)) + xf(x2, A(x1)) xf(x2, A(x2))

L ( x1 x2 + A(x1) A(x2) ) L (1 + G ) x1 x2 .

Similarly, for the second term we have:

DA(x1) yf(x1, A(x1)) DA(x2) yf(x2, A(x2)) DA(x1) yf(x1, A(x1)) yf(x2, A(x2)) + yf(x2, A(x2)) DA(x2) DA(x1)

(LG (1 + G ) + GL ) x1 x2 .

This proves the lemma.

Proof of Lemma 2. Given any x and y, and any λ such that λj 0 and P j S(x) λj = 1, we have:

g(y) = max j [k] gj(y) X

j S(x) λjgj(y) X

gj(x) + gj(x), y x 1

j S(x) λj gj(x), y x 1

proving the lemma.

Proof of Lemma 3. We re-write fλ(x) as minimum value of a ( 1

λ L)-strong convex function φλ,x, as g is L-weakly convex (Deﬁnition 2) and 1 2λ x x 2 is differentiable and 1

λ-strongly convex,

gλ(x) = min x Rd1

φλ,x(x ) = g(x ) + 1

2λ x x 2 . (9)

Then ﬁrst part of (a) follows trivially by the strong convexity. For the second part notice the following,

min x gλ(x) = min x min x g(x ) + 1

= min x min x g(x ) + 1

= min x g(x )

Furthermore, we will show that arg minx gλ(x) = arg minx g(x). Consider x argminx g(x). We have that for any x,

gλ(x) = min x g(x ) + 1

2λ x x 2 g(x ) gλ(x ).

This shows that x argminx gλ(x) and hence argminx g(x) argminx gλ(x). Similarly, for any x argminx gλ(x), we have:

gλ(x ) = g(ˆxλ(x )) + 1

2λ ˆxλ(x ) x 2 g(ˆxλ(x )) gλ(ˆxλ(x )).

Published as a conference paper at ICLR 2022

Algorithm 3: Estimating Moreau envelope s gradient for postprocessing Input: point x, stochastic subgradient oracle for function g, error ϵ, failure probability δ

1 Find ex such that g(ex) + L x ex 2 minx g(x ) + L x x 2 + ϵ2

4L using (Harvey et al., 2019, Algorithm 1).

2 return 2L(x ex).

Since x argminx gλ(x), the above inequality is infact an equality and hence ˆxλ(x ) = x . So, gλ(x ) = g(x ) and hence x argminx g(x) implying that argminx gλ(x) argminx g(x). Thus arg minx gλ(x) = arg minx g(x). For (b) we can re-write the Moreau envelope gλ as,

gλ(x) = min x g(x ) + 1

λ max x (x T x λg(x ) x 2

where ( ) is the Fenchel conjugation operator. Since L < 1/λ, using L-weak convexity of g, it is easy to see that λg(x )+ x 2

2 is (1 λL)-strongly convex, therefore its Fenchel conjugate would be 1 (1 λL)-smooth (Kakade et al., 2009, Theorem 6). This, along with 1

λ-smoothness of ﬁrst quadratic term implies that gλ(x) is 1

λ + 1 λ(1 λL) -smooth, and thus differentiable.

For (c) we proceed as follows. Since ˆxλ(x) = argminx φλ,x(x ), by ﬁrst order KKT optimality conditions, we have that x ˆxλ(x) λ g(x). (11) Further, from proof of part (a) we have that φλ,x(x ) (1 λL)-strongly-convex in x and it is quadratic (and thus convex) in x. Danskin s theorem (Bertsekas, 2009, Section 6.11) thus implies that for gλ(x) = minx Rd1 φλ,x(x ), we have that gλ(x) = (x ˆxλ(x))/λ. Letting ˆuλ(x) := λ 1(x ˆxλ(x)) g(x), we have that:

min u g(ˆxλ(x)) u ˆuλ(x) = λ 1 x ˆxλ(x) = gλ(x) .

C PROOFS OF RESULTS IN SECTION 4.2

In order to prove convergence of this algorithm, we ﬁrst recall the following result from (Davis & Drusvyatskiy, 2018). Theorem 7 (Corollary 2.2 from (Davis & Drusvyatskiy, 2018)). Suppose g( ) is L-weakly convex, and Ez1, ,zk h b g(x) 2i G2. Then, the output x of Algorithm 1 with stepsize η = γ S+1 satisﬁes:

2L ( x) 2i 2

2L (x0) minx g(x) + LG2γ2

Proof of Theorem 1. Lemmas 1 and 2 tell us that g(x) is b L-weakly convex and for any choice of z, the stochastic subgradient b g(x) is bounded in norm by b G. Consequently, Theorem 7 with the stated choice of S proves Theorem 1.

Proposition 1. Given a point x, an L-weakly convex function g and a G-norm bounded and a stochastic subgradient oracle to g (i.e., given any point x , which returns a stochastic vector u such that E[u] g(x ) and u G), with probability at least 1 δ, Algorithm 3 returns a vector u

satisfying u gλ(x) ϵ with at most O G2 log 1

δ ϵ2 queries to the stochastic subgradient oracle of g.

Published as a conference paper at ICLR 2022

Proof of Proposition 1. Let Φ 1

2L (x , x) := g(x ) + L x x 2. Recall the notation of Lemma 3 bx 1

2L (x) := argminx Φ 1

2L (x , x) and gλ(x) = minx Φ 1

2L (x , x). The proof of Lemma 3 tells us that

2L (x) = x bx 1

λ and also that bx 1

2L (x) x G 2L. Since Φ 1

2L ( , x) is a L-strongly convex

and G Lipschitz function in the domain n x : bx 1

2L o , (Harvey et al., 2019, Theorem

3.1) tells us that we can use SGD with O G2 log 1

δ Leϵ stochastic gradient oracle queries to implement

Step 1 of Algorithm 3 with success probability at least 1 δ, where eϵ := ϵ2 4L. Simplifying this

expression gives us a stochastic gradient oracle complexity of O G2 log 1

D PROOFS OF RESULTS IN SECTION 4.3

Proof of Theorem 2. Letting h(x, λ) := P

i [k] λigi(x), we note that xxh(x, λ) = P

i [k] λi 2gi(x) L(1 + G )2 + GL , where we used Lemma 1 and the fact that P

i |λi| 1. On the other hand, xλh(x, λ) = P

i [k] gi(x) k G(1 + G ), where we again used

Lemma 1. Since λλh(x, λ) = 0, we can conclude that h is an b L-gradient Lipschitz function with b L := L(1 + G )2 + GL + k G(1 + G ). Consequently, g(x) = maxλ S h(x, λ), where

S := n λ Rk : λi 0, P

i [k] λi = 1 o , is b L-weakly convex and the Moreau envelope g 1

2 b L is well deﬁned.

Denote bg(x, xs) := maxi [k] gi(x) + b L x xs 2. We now divide the analysis of each of iteration of Algorithm 2 into two cases.

Case I, bg(xs+1, xs) maxi [k] gi(xs) 3bϵ/4: Since maxi [k] gi(xs+1) bg(xs+1, xs) maxi [k] gi(xs) 3bϵ/4, we see that in this case g(xs) decreases monotonically by at least 3bϵ/4 in each iteration. Since by Assumption 1, g is bounded by B in magnitude, and the termination condition in Step 4 guarantees monotonic decrease in every iteration, there can only be at most 2B/(3bϵ/4) = 8B/bϵ such iterations in Case I.

Case II, bg(xs+1, xs) maxi [k] gi(xs) 3bϵ/4: In this case, we claim that xs is an ϵ-FOSP of g = maxi [k] gi(x). To see this, we ﬁrst note that

g(xs) 3bϵ/4 bg(xs+1, xs) (min x g(x) + b L x xs 2) + bϵ/4 = g(xs) < min x bg(x; xs) + bϵ.

Let x s := arg minx bg(x; xk). Since g is b L-gradient Lipschitz, we note that bg( ; xs) is b L-strongly convex. We now use this to prove that xs is close to x s:

bg(x s; xs) + b L 2 xs x s 2 bg(xs; xs) = f(xs) (a) < bg(x k; xs) + bϵ = xs x s <

where (a) uses (12). Now consider any bx, such that 4 p

bϵ/L bx xs . Then,

g(bx) + L bx xs 2 = max i [k] gi(bx) + L bx xs 2 = bg(bx; xs) (a) = bg(x s; xs) + L

(b) f(xs) bϵ + b L 2 ( bx xs xs x s )2 (c) f(xs) + bϵ, (14)

where (a) uses uses b L-strong convexity of bg( ; xs) at its minimizer x s, (b) uses (12), and (b) and (c)

use triangle inequality, (13) and 4 q

bϵ/b L bx xs .

Now consider the Moreau envelope, g 1

2 b L (x) = minx φ 1

2 b L ,x(x ) where φ 1

2 b L ,x(x ) = g(x ) + L x

x 2. Then, we can see that φ 1

2 b L ,xs(x ) achieves its minimum in the ball {x | x xs 4 q

by (14) and Lemma 3(a). Then, with Lemma 3(b,c) and bϵ = ε2

64 b L, we get that,

2 b L (xs) (2b L) xs ˆx 1

2 b L (xs) = 8 p

b Lbϵ = ε, (15)

Published as a conference paper at ICLR 2022

i.e., xs is an ε-FOSP of g.

By combining the above two cases, we establish that 8B

3bϵ outer iterations ensure convergence to a ε-FOSP.

We now compute the gradient call complexity of each of these outer" iterations, where we have two options for implementing Step 3 of Algorithm 2. Note that this step corresponds to solving minx maxλ S h(x, λ) up to an accuracy of bϵ/4.

Option I, (Thekumparampil et al., 2019, Algorithm 2): Since the minimax optimization problem here is b L-strongly-convex concave and 2b L-gradient Lipschitz, (Thekumparampil et al., 2019, Theorem 1) tells us that this requires at most m gradient calls for each gi where,

28b L = O b L

Therefore the number of gradient computations required for each iteration of inner problem is O b L

ϵ log2 1 ε .

Option II, Cutting plane method (Lee et al., 2015): Let us consider u(λ) := minx h(x, λ) + b L x xs 2, which is a b L-Lipschitz, concave function of λ. (Lee et al., 2015) tells us that we can use

cutting plane algorithms to obtain bλ satisfying u(bλ) maxλ S u(λ) eϵ using O k log kb L

eϵ gradient

queries to u and poly(k log b L eϵ ) computation. The gradient of u is given by u(λ) = λh(x (λ), λ), where x (λ) := argminx h(x, λ) + b L x xs 2. Since h(x, λ) + b L x xs 2 is a 3b L-smooth and b L-strongly convex function in x, x (λ) can be computed up to eϵ error using gradient descent in

O log b L eϵ iterations. If we choose eϵ = bϵ2/poly(k, b L/µ), then Proposition 2 tells us that x (bλ) satisﬁes the requirements of Step (3) of Algorithm 2 and the total number of gradient calls to each gi is at most O poly(k) log b L ϵ in each outer iteration of Algorithm 2.

Proposition 2. Suppose h : Rd1 U R be such that h( , λ) is µ-strongly convex for every λ U, h(x, ) is concave for every x Rd1 and h is L-gradient Lipschitz. Let bλ be such that minx h(x, bλ) maxλ minx h(x, λ) ϵ and let x (bλ) := argminx h(x, bλ). Then, we have maxλ h(x (bλ), λ)

minx maxλ h(x, λ) + c L µ ϵ + LDU µ ϵ , where DU = maxλ1,λ2 U λ1 λ2 is the diameter of U.

Proof of Proposition 2. From the hypothesis, we have:

ϵ h(x , λ ) h(x (bλ), bλ) h(x , bλ) h(x (bλ), bλ) µ

2 x x (bλ) 2,

where (x , λ ) is the Nash equilibrium and the second step follows from the fact that λ = argmaxλ h(x , λ) and the third step follows from the fact that x (bλ) = argminx h(x, bλ). Consequently, we have that x x (bλ) p

2ϵ/µ. Let λ := argmaxλ h(x (bλ), λ). We now have

Published as a conference paper at ICLR 2022

that: max λ h(x (bλ), λ) max λ min x h(x, λ) = h(x (bλ), λ) h(x , λ )

= h(x (bλ), λ) h(x (bλ), λ ) + h(x (bλ), λ ) h(x , λ )

(ζ1) λh(x (bλ), λ ), λ λ + L

2 x (bλ) x 2

(ζ2) λh(x (bλ), λ ) λ λ + Lϵ

(ζ3) λh(x , λ ) + L x (bλ) x DU + Lϵ

where (ζ1) follows from the fact that h(x (bλ), ) is concave and x = argminx h(x, λ ), (ζ2) follows from the bound x x (bλ) p

2ϵ/µ, (ζ3) follows from the L-gradient Lipschitz property of h, and the last step follows from the fact that λh(x , λ ) = 0. This proves the proposition.

E PROOFS OF RESULTS IN SECTION 5

In this appendix, we present the proofs for the lemmas in Section 5. Recall the SGA update: yt+1 = yt + η yfσ(t)(x, yt). (17)

Therefore, the Jacobian of the T-step SGA update is given by Dyt+1 = I + η yyfσ(t)(x, yt) Dyt + η yxfσ(t)(x, yt), with A(x, z) = y T (x). (18)

E.1 PROOF OF THEOREM 3

Theorem 3 (General Case). Suppose for all j [n], fj satisﬁes Assumption 1. Then, for any ﬁxed randomness z, T-step SGA is (1 + ηL)T -Lipschitz and 4(ρ/L) (1 + ηL)2T -gradient Lipschitz.

Proof. Lipschitz of yt(x). We ﬁrst show the Lipschitz claim. We have the following bound on the Jacobian of the update equation given in (18): Dyt+1(x) (I + η 2 yyf(x, yt(x)))Dyt(x) + η 2 yxf(x, yt(x))

(1 + ηL) Dyt(x) + ηL. Since Dy0(x) = 0, the above recursion implies that

τ=0 (1 + ηL)τ (1 + ηL)t.

Gradient-Lipschitz of yt(x). Next, we show the claimed gradient Lipschitz constant. As above, using the update equation in (18), we have the following bound on the Jacobian: Dyt+1(x1) Dyt+1(x2)

(I + η 2 yyf(x1, yt(x1)))(Dyt(x1) Dyt(x2)) + η 2 yxf(x1, yt(x1)) 2 yxf(x2, yt(x2))

+ η [ 2 yyf(x1, yt(x1)) 2 yyf(x2, yt(x2))]Dyt(x2)

(1 + ηL) Dyt(x1) Dyt(x2) + ηρ(1 + Dyt(x2) )( x1 x2 + yt(x1) yt(x2) )

(1 + ηL) Dyt(x1) Dyt(x2) + 4ηρ(1 + ηL)2t x1 x2 . The above recursion implies the claimed Lipschitz constant. Indeed,

Dyt(x1) Dyt(x2) 4ηρ

τ=0 (1 + ηL)t+τ 1 x1 x2 4(ρ/L) (1 + ηL)2t x1 x2 .

Published as a conference paper at ICLR 2022

E.2 PROOF OF THEOREM 4

Theorem 4 (Concave Case). Suppose for all j [n], fj satisﬁes Assumption 1 and fj(x, ) is concave for any x. Then, for any ﬁxed randomness z, T-step SGA is ηLT-Lipschitz and (ρ/L) (1 + ηLT)3gradient Lipschitz.

Proof. Lipschitz of yt(x). We ﬁrst show the Lipschitz claim. Using the update equation in (18), we have the following bound on the Jacobian:

Dyt+1(x) (I + η 2 yyf(x, yt(x)))Dyt(x) + η 2 yxf(x, yt(x))

Dyt(x) + ηL.

Since Dy0(x) = 0, the above recursive implies that Dyt(x) ηLt.

Gradient-Lipschitz of yt(x). Next, we show the claimed gradient Lipschitz constant. Using the update equation in (18):, we have the following bound on the Jacobian:

Dyt+1(x1) Dyt+1(x2)

(I + η 2 yyf(x1, yt(x1)))(Dyt(x1) Dyt(x2)) + η 2 yxf(x1, yt(x1)) 2 yxf(x2, yt(x2))

+ η [ 2 yyf(x1, yt(x1)) 2 yyf(x2, yt(x2))]Dyt(x2)

Dyt(x1) Dyt(x2) + ηρ(1 + Dyt(x2) )( x1 x2 + yt(x1) yt(x2) )

Dyt(x1) Dyt(x2) + ηρ(1 + ηLt)2 x1 x2 .

This recursion implies the following gradient Lipschitz constant:

Dyt(x1) Dyt(x2) ηρ

τ=0 (1 + ηLτ)2 x1 x2 (ρ/L) (1 + ηLt)3 x1 x2 .

E.3 PROOF OF THEOREM 5

Theorem 5 (Strongly-concave Case). Suppose for all j [n], fj satisﬁes Assumption 1 and fj(x, ) is α-strongly concave for any x. Then, for any ﬁxed randomness z, T-step SGA is κ-Lipschitz and 4(ρ/L) κ3-gradient Lipschitz, where κ = L/α is the condition number.

Proof. Denote the condition number κ = L/α.

Lipschitz of yt(x). We ﬁrst show the claimed Lipschitz constant. Using the update equation in (18), we have that

Dyt+1(x) (I + η 2 yyf(x, yt(x)))Dyt(x) + η 2 yxf(x, yt(x))

(1 ηα) Dyt(x) + ηL.

Since Dy0(x) = 0, the above recursion gives the following bound:

τ=0 (1 ηα)τ κ.

Gradient-Lipschitz of yt(x). Next we show the claimed gradient Lipschitz constant. Again, using the update equation, we have that

Dyt+1(x1) Dyt+1(x2)

(I + η 2 yyf(x1, yt(x1)))(Dyt(x1) Dyt(x2)) + η 2 yxf(x1, yt(x1)) 2 yxf(x2, yt(x2))

+ η [ 2 yyf(x1, yt(x1)) 2 yyf(x2, yt(x2))]Dyt(x2)

(1 ηα) Dyt(x1) Dyt(x2) + ηρ(1 + Dyt(x2) )( x1 x2 + yt(x1) yt(x2) )

(1 ηα) Dyt(x1) Dyt(x2) + 4ηρκ2 x1 x2 .

Published as a conference paper at ICLR 2022

This recursion implies that

Dyt(x1) Dyt(x2) 4ηρκ2 t 1 X

τ=0 (1 ηα)τ x1 x2 4(ρ/L) κ3 x1 x2 .

E.4 PROOF OF THEOREM 6

Theorem 6 (General Case). Suppose for all j [n], fj satisﬁes Assumption 1. Then, for any ﬁxed seed z, T-step SNAG is T(1 + ηL/θ)T -Lipschitz and 50(ρ/L) T 3(1 + ηL/θ)2T -gradient Lipschitz.

Proof. Recall the SNAG update yt = yt + (1 θ)(yt yt 1) (19) yt+1 = yt + η yfσ(t)(x, yt). (20) Observe that the update equation for T-step SNAG implies that D yt = Dyt + (1 θ)(Dyt Dyt 1) Dyt+1 = (I + η yyfσ(t)(x, yt))D yt + η yxfσ(t)(x, yt) (21)

Lipschitz of yt(x), vt(x). We ﬁrst show the claimed Lipschitz constant. By the update equations in (21), we have that Dyt+1 = (I + η yyfσ(t)(x, yt))(Dyt + (1 θ)(Dyt Dyt 1)) + η yxfσ(t)(x, yt).

Denote δt = Dyt Dyt 1 , and note that Dy0 = Dy 1 = 0 so that δ0 = 0. By the equation above, we have that δt+1 ηL Dyt + (1 + ηL)(1 θ)δt + ηL

τ=1 δτ + (1 + ηL)(1 θ)δt + ηL.

In the following, we use induction to prove that δt (1 + ηL/θ)t. (22) It is easy to verify that this is true for the base case δ0 = 0 1. Suppose the claim is true for all τ t, then we have that

τ=1 (1 + ηL/θ)τ + (1 + ηL)(1 θ)(1 + ηL/θ)t + ηL

τ=0 (1 + ηL/θ)τ + (1 θ)(1 + ηL/θ)t+1

=θ[(1 + ηL/θ)t+1 1] + (1 θ)(1 + ηL/θ)t+1 (1 + ηL/θ)t+1. This proves the induction claim. Therefore, by (22), we have the following two bounds:

τ=1 δτ t(1 + ηL/θ)t,

D yt(x) (2 θ) Dyt(x) + (1 θ) Dyt 1(x) 3t(1 + ηL/θ)t.

Gradient-Lipschitz of yt(x). Next, we show the claimed gradient Lipschitz constant. For any ﬁxed x1, x2, denote wt = Dyt(x1) Dyt(x2), we have wt+1 =(I + η yyfσ(t)(x1, yt(x1)))(wt + (1 θ)(wt wt 1))

+ η( yxfσ(t)(x1, yt(x1)) yxfσ(t)(x2, yt(x2))) | {z } T1 + η( yyfσ(t)(x1, yt(x1)) yyfσ(t)(x2, yt(x2)))(Dyt(x2) + (1 θ)(Dyt(x2) Dyt 1(x2))) | {z } T2

Published as a conference paper at ICLR 2022

We note that we can upper bound the last two terms above as follows:

T1 + T2 ηρ( x1 x2 + yt(x1) yt(x2) ) + ηρ( x1 x2 + yt(x1) yt(x2) )(2 Dyt(x2) + Dyt 1(x2) )

24ηρt2(1 + ηL/θ)2t x1 x2 .

Therefore, let ζt = wt wt 1 , and = x1 x2 , we have the following:

ζt+1 ηL wt + (1 + ηL)(1 θ)ζt + 24ηρt2(1 + ηL/θ)2t

τ=1 ζτ + (1 + ηL)(1 θ)ζt + 24ηρt2(1 + ηL/θ)2t .

In the following, we use induction to prove that

ζt 50(ρ/L) t2(1 + ηL/θ)2t := ψ(t). (23)

It is easy to verify that this is true for the base case ζ0 = 0. Suppose the claim is true for all τ t, then we have that

τ=1 τ 2(1 + ηL/θ)2τ + (1 + ηL)(1 θ)ψ(t) + 24ηρt2(1 + ηL/θ)2t

θ(ρ/L) [50t2 (1 + ηL/θ)2(t+1) 1

(1 + ηL/θ)2 1 + 24(ηL/θ)t2(1 + ηL/θ)2t] + (1 θ)ψ(t + 1)

θ(ρ/L) [25t2(1 + ηL/θ)2(t+1) + 24t2(1 + ηL/θ)2(t+1)] + (1 θ)ψ(t + 1)

θ(ρ/L) [50t2(1 + ηL/θ)2(t+1)] + (1 θ)ψ(t + 1) ψ(t + 1).

This proves the induction claim. Therefore, by (23), we have that

Dyt(x1) Dyt(x2) = wt

τ=1 ζτ 50(ρ/L)t3(1 + ηL/θ)2t x1 x2 .

E.5 PROJECTED GRADIENT ASCENT IS NOT GRADIENT-LIPSCHITZ

Proposition 3. Consider f(x, y) = xy for (x, y) X Y, where X = [0, 10] and Y = [0, 1]. 1-step projected gradient ascent given by:

y1(x) = PY(y0 + η yf(x, y0)) y0 = 0,

where η > 1/10 is not a gradient-Lipschitz algorithm.

Proof. We see that y1(x) = min(1, ηx) and f(x, y1(x)) = x min(1, ηx). For η < 1/10, we see that f(x, y1(x)) is not gradient-Lipschitz at x = 1/η (0, 10).

F ON NONCONVERGENCE OF EXTRAGRADIENT METHOD AND OPTIMISTIC GRADIENT DESCENT ASCENT (OGDA)

In this section, we show that there exist nonconvex-nonconcave problems where the extragradient method and OGDA get trapped in limit cycles. While the example we construct is a quadratic function that is not bounded over all of the space, it will be clear from the proof that the same results hold for any function that matches the constructed quadratic locally around origin.

Consider the following two dimensional quadratic function parametrized by a, b R:

f(x, y) = 1

2ax2 + bxy 1

2 (x y) a b b a

Published as a conference paper at ICLR 2022

Denote z := (x, y), Φ := a b b a

F(z) := xf(x, y) yf(x, y)

Let the eigenvalues of matrix Φ be λ1, λ2. We know they satify the characteristic equation (λ a)2 + b2 = 0. By Vieta s formulas, we know that λ1 + λ2 = 2a, λ1λ2 = a2 + b2. This justiﬁes the following proposition. Proposition 4. For any complex number c C, there exist a, b R s.t. the eigenvalues of Φ are c and c. Here c is the complex conjugate of c.

Proof. Let c = p + qi for some p, q R, we can choose a = (c + c)/2 = p, b =

c c a2 = q. It is easy to check such matrix Φ has eigenvalues c and c.

F.1 EXTRAGRADIENT METHODS

The extragradient algorithm (Korpelevich, 1976) uses the following update:

zt+1/2 =zt ηF(zt)

zt+1 =zt ηF(zt+1/2)

For the case of quadratic function speciﬁed by (24), this update is equivalent to

zt+1 = (I ηΦ(I ηΦ))zt = (I ηΦ + η2Φ2)zt Clearly, the eigenvalues of matrix (I ηΦ+η2Φ2) are simply (1 ηλ1+η2λ2 1) and (1 ηλ2+η2λ2 2). Proposition 5. For the extragradient algorithm with any ﬁxed learning rate η, there exists a quadratic function, and appropriate initial points such that the extragradient algorithm is trapped in a limit cycle.

Proof. For any given θ (0, 2π), we can simply pick λ1 C to be the solution of equation 1 ηλ1 + η2λ2 1 = eiθ. This is a quadratic equation of λ1, which always has nonzero complex roots. For λ2 = λ1, we verify that 1 ηλ2 + η2λ2 2 = e iθ. Finally, by Proposition 4, we know that there exists a choice of a, b such that, the two eigenvalues of (I ηΦ + η2Φ2) are precisely eiθ and e iθ.

If the initial point x0 y0

can be decomposed as x0 y0

= α1z1 + α2z2, where z1 and z2 are the two

eigenvectors of Φ corresponding to eigenvalues λ1 and λ2 respectively, we see that the extra gradient

algorithm follows the trajectory xt yt

= α1eiθtz1 + α2e iθtz2, thereby getting trapped in a limit

F.2 OPTIMISTIC GRADIENT DESCENT ASCENT (OGDA)

The OGDA algorithm (Daskalakis & Panageas, 2018a) has the following update:

zt+1 = zt 2ηF(zt) + ηF(zt 1).

(Daskalakis & Panageas, 2018a)[Section 4.1] presents an example where OGDA is empirically not shown to converge for some initialization points. Here, we present an example for which nonconvergence of OGDA is shown rigorously. We can rewrite this update as zt+1 zt

= I 2ηΦ ηΦ I 0

There are four eigenvalues for the matrix I 2ηΦ ηΦ I 0

. Two of them µ1, µ2 are eigenval-

ues of the matrix 1 2ηλ1 ηλ1 1 0

while the other two µ3, µ4 are eigenvalues of the matrix 1 2ηλ2 ηλ2 1 0

, where λ1 and λ2 are the eigenvalues of Φ.

Published as a conference paper at ICLR 2022

Proposition 6. For the OGDA algorithm with any ﬁxed learning rate η, there exists a quadratic function, and a choice of initial points such that the iterates are trapped in a limit cycle.

Proof. We will ﬁst show that there is a choice for a and b so that the resulting eigenvalues µ1,

µ2 of 1 2ηλ1 ηλ1 1 0

and µ3, µ4 of 1 2ηλ2 ηλ2 1 0

satisfy |µ1| = |µ3| = 1 and |µ2| =

|µ4| < 1. Let us ﬁrst ﬁx µ1 = eiθ and choose ηλ1 = µ1(µ1 1)

2µ1 1 so that we have µ1 is a root of µ2 (1 2ηλ1)µ + ηλ1 = 0. We note that µ2 is the other root of the above equation. We have that:

|ηλ1| = |µ1 1|

(cos θ 1)2 + sin2 θ (2 cos θ 1)2 + sin2 θ.

Therefore, for any θ (0, π/4], we have |ηλ1| < 1. That is |µ1µ2| < 1, since |µ1| = 1, we have

|µ2| < 1. By symmetry, we know by choosing λ2 = λ1, the two eigenvalues of 1 2ηλ2 ηλ2 1 0

are µ3 = µ1 and µ4 = µ2. In sum, we have proved that for any θ (0, π/4], and any learning rate η,

there exists a choice of a, b such that the four eigenvalues of I 2ηΦ ηΦ I 0

are e iθ, c , c

where |c | = | c | < 1. If the initial points z1 z0

have the decomposition P4 i=1 αi zi, where

zi is the eigenvector corresponding to eigenvalue µi, then the iterates zt and zt 1 are given by zt zt 1

= P4 i=1 αiµt i zi. As t , zt zt 1

converges to α1eiθt z1 + α3e iθt z3, proving that

OGDA iterates get trapped in a limit cycle and neither converge to a single point nor escape to inﬁnity.

G ADDITIONAL EXPERIMENTS AND DETAILS

In this appendix section, we provide additional experimental results and details.

Dirac-GAN. In the results presented in Section 6 for this problem, the discriminator sampled its initialization uniformly from the interval [ 0.1, 0.1] and performed T = 10 steps of gradient ascent. For the results given in Figure 5, we allow the discriminator to sample uniformly from the interval [ 0.5, 1] and consider the discriminator performing T = 100 (Figure 5b) and T = 1000 (Figure 5c) gradient ascent steps. The rest of the experimental setup is equivalent to that described in Section 6.

For the result presented in Figure 5b, we see that with this distribution of initializations for the discriminator and T = 100 gradient ascent steps, the generator is not able to converge to the optimal parameter of θ = 0 to recreate the underlying data distribution using our algorithm which descends f(θ, A(θ)) or the algorithm that descends θf(θ, A(θ)). However, we see that our algorithm converges signiﬁcantly closer to the optimal parameter conﬁguration. Furthermore, we still observe stability and convergence from our training method, whereas standard training methods using simultaneous or alternating gradient descent-ascent always cycle. This example highlights that the optimization through the algorithm of the adversary is important not only for the rate of convergence, but it also inﬂuences what the training method converges to and gives improved results in this regard.

Finally, in the result presented in Figure 5b, we see that with this distribution of initializations for the discriminator and T = 1000 gradient ascent steps, the generator is able to converge to the optimal parameter of θ = 0 to recreate the underlying data distribution using our algorithm which descends f(θ, A(θ)) or the algorithm that descends θf(θ, A(θ)). Thus, while with T = 100 we did not observe convergence to the optimal generator parameter, with a stronger adversary we do see convergence to the optimal generator parameter. This behavior can be explained by the fact that when the discriminator is able to perform enough gradient ascent steps to nearly converge, the gradients f(θ, A(θ)) and θf(θ, A(θ)) are nearly equivalent.

We remark that we repeated the experiments 5 times with different random seeds and show the mean generator parameters during the training with a window around the mean of a standard deviation. The results were very similar between runs so the window around the mean is not visible.

Published as a conference paper at ICLR 2022

Figure 5: Dirac-GAN: Generator parameters while training using our framework with and without optimizing through the discriminator where between each generator update the discriminator samples an initial parameter choice uniformly at random from the interval [ 0.5, 1] and then performs T = 100 (Figure 5b) and T = 1000 (Figure 5c) steps of gradient ascent.

(a) Real Data

Figure 6: Mixture of Gaussians: Figure 6a shows the real data distribution and Figures 6b 6k show the ﬁnal generated distributions after 150k generator updates from 10 separate runs of the training procedure described in Section 6 using gradient ascent for the discriminator. Each run recovers a generator distribution closely resembling the underlying data distribution.

Mixture of Gaussians. We noted in Section 6 that we repeated our experiment training a generative adversarial network to learn a mixture of Gaussians 10 times and observed that for each run of the experiment our training algorithm recovered all modes of the distribution. We now show those results in Figure 6. In particular, in Figure 6a we show the real data distribution and in Figures 6b 6k we show the ﬁnal generated distribution from 10 separate runs of the training procedure after 150k generator updates. Notably, we observe that each run of the training algorithm is able to generate a distribution that closely resembles the underlying data distribution, showing the stability and robustness of our training method.

We also performed an experiment on the mixture of Gaussian problem in which the discriminator algorithm was the Adam optimization procedure with parameters (β1, β2) = (0.99, 0.999) and learning rate η2 = 0.004 and the generator learning rate was η1 = 0.05. The rest of the experimental setup remained the same. We ran this experiment 10 times and observed that for 7 out of the 10 runs of the ﬁnal generated distribution was reasonably close to the real data distribution, while for 3 out of the 10 runs the generator did not learn the proper distribution. This is to say that we found the training algorithm was not as stable when the discriminator used Adam versus normal gradient ascent. The ﬁnal generated distribution from the 7 of 10 runs with reasonable distributions are shown in Figure 7.

Published as a conference paper at ICLR 2022

(a) Real Data

Figure 7: Mixture of Gaussians: Figure 7a shows the real data distribution and Figures 7b 7h show the ﬁnal generated distributions after 150k generator updates from the 7 out of 10 separate runs of the training procedure using Adam optimization for the discriminator that produced reasonable distributions.

Adversarial Training. We now provide some further background on the adversarial training experiment and additional results. It is now well-documented that the effectiveness of deep learning classiﬁcation models can be vulnerable to adversarial attacks that perturb the input data (see, e.g., Biggio et al. 2013; Szegedy et al. 2014; Kurakin et al. 2017; Madry et al. 2018). A common approach toward remedying this vulnerability is by training the classiﬁcation model against adversarial perturbations. Recall from Section 6 that given a data distribution D over pairs of examples x Rd and labels y [k], parameters θ of a neural network, a set S Rd of allowable adversarial perturbations, and a loss function ℓ( , , ) dependent on the network parameters and the data, adversarial training amounts to considering a minmax optimization problem of the form minθ E(x,y) D[maxδ S ℓ(θ, x + δ, y)].

A typical approach to solving this problem is an alternating optimization approach (Madry et al., 2018). In particular, each time a batch of data is drawn from the distribution, T-steps of projected gradient ascent are performed by ascending along the sign of the gradient of the loss function with respect to the data and projecting back onto the set of allowable perturbations, then the parameters of the neural network are updated by descending along the gradient of the loss function with the perturbed examples. The experimental setup we consider is analogous but the inner maximization loop performs T-steps of regular gradient ascent (not using the sign of the gradient and without projections).

For the adversarial training experiment considered in Section 6, we also evaluated the trained models against various other attacks. Recall that the models were training using T = 10 steps of gradient ascent in the inner optimization loop with a learning rate of η2 = 4. To begin, we evaluated the trained models against gradient ascent attacks with a ﬁxed learning rate of η2 = 4 and a number of steps T {5, 10, 20, 40}. We also evaluated the trained models against gradient ascent attacks with a ﬁxed budget of Tη2 = 40 and various choices of T and η2. These results are presented in Figure 9. Finally, we evaluated the trained models against attacks using the Adam optimization method with a ﬁxed budget of Tη2 = 0.04 and various choices of T and η2. These results are presented in Figure 10. Notably, we see that our algorithm outperforms the baselines and similar conclusions can be drawn as from the experiments for adversarial training presented in Section 6. The additional experiments highlight that our method of adversarial training is robust against attacks that the algorithm did not use in training when of comparable computational power and also that it improves robustness against attacks of greater computational power than used during training.

In Section 6, we showed the results of evaluating the gradient norms f(θ, A(θ)) as a function of the number of gradient ascent steps T in the adversary algorithm A and observed that it grows much slower than exponentially. Here we provide more details on the setup. We took a run of our algorithm trained with the setup described in Section 6 and retrieved the models that were saved after 25, 50, 75, and 100 training epochs. For each model, we then sampled 100 minibatches of data and

Published as a conference paper at ICLR 2022

Figure 8: Adversarial Training: Test accuracy during the course of training against gradient ascent attacks with a ﬁxed learning rate of η2 = 4 and the number of steps T {5, 10, 20, 40}.

Figure 9: Adversarial Training: Test accuracy during the course of training against gradient ascent attacks with a ﬁxed attack budget of Tη2 = 40 where T is the number of attack steps and η2 is the learning rate (LR).

Figure 10: Adversarial Training: Test accuracy during the course of training against Adam optimization attacks with a ﬁxed attack budget of Tη2 = 0.04 where T is the number of attack steps and η2 is the learning rate (LR).

for each minibatch performed T {20, 30, 40, 50, 60, 70, 80, 90, 100} steps of gradient ascent with learning rate η2 = 4 (the learning rate from training) and then computed the norm of the gradient f(θ, A(θ)) where A corresponds to the gradient ascent procedure with the given number of steps and learning rate. In Figure 2, which is reproduced here in Figure 11a, the mean of the norm of the gradients over the sampled minibatches are shown with the shaded window indicating a standard deviation around the mean. We also repeated this procedure using η2 = 1 and show the results in Figure 11b from which similar conclusions can be drawn.

Published as a conference paper at ICLR 2022

Figure 11: Adversarial Training: f(θ, A(θ)) as a function of the number of steps T taken by the gradient ascent algorithm A evaluated at multiple points in the training procedure. Figure 11a corresponds to using η2 = 4 in the gradient ascent procedure and Figure 11b corresponds to using η2 = 1 in the gradient ascent procedure.

Layer Type Shape Convolution + Re LU 5 5 20 Max Pooling 2 2 Convolution + Re LU 5 5 20 Max Pooling 2 2 Fully Connected + Re LU 800 Fully Connected + Re LU 500 Softmax 10

Table 1: Convolutional neural network model for the adversarial training experiments.

Finally, we provide details on the convolutional neural network model for the adversarial training experiments. In particular, this model is exactly the same as considered in (Nouiehed et al., 2019) and we reproduce it in Table 1.

Experimental Details. For the experiments with neural network models we used two Nvidia Ge Force GTX 1080 Ti GPU and the Py Torch higher library(Deleu et al., 2019) to compute f(θ, A(θ)). In total, running all the experiments in the paper takes about half of a day with this computational setup. The code for the experiments is included in the supplementary material with instructions on how to run. We built our code for the adversarial training based off code openly provided by the authors of (Nouiehed et al., 2019).