# nesterov_meets_optimism_rateoptimal_separable_minimax_optimization__060268d3.pdf

Nesterov Meets Optimism: Rate-Optimal Separable Minimax Optimization

Chris Junchi Li * 1 Huizhuo Yuan * 2 Gauthier Gidel 3 Quanquan Gu 2 Michael I. Jordan 1 4

We propose a new first-order optimization algorithm Accelerated Gradient-Optimistic Gradient (AG-OG) Descent Ascent for separable convexconcave minimax optimization. The main idea of our algorithm is to carefully leverage the structure of the minimax problem, performing Nesterov acceleration on the individual component and optimistic gradient on the coupling component. Equipped with proper restarting, we show that AG-OG achieves the optimal convergence rate (up to a constant) for a variety of settings, including bilinearly coupled strongly convex-strongly concave minimax optimization (bi-SC-SC), bilinearly coupled convex-strongly concave minimax optimization (bi-C-SC), and bilinear games. We also extend our algorithm to the stochastic setting and achieve the optimal convergence rate in both bi-SC-SC and bi-C-SC settings. AG-OG is the first single-call algorithm with optimal convergence rates in both deterministic and stochastic settings for bilinearly coupled minimax optimization problems.

1. Introduction

Optimization is the workhorse for machine learning (ML) and artificial intelligence. While many ML learning tasks can be cast as a minimization problem, there is an increasing number of ML tasks, such as generative adversarial networks (GANs) (Goodfellow et al., 2020), robust/adversarial training (Bai & Jin, 2020; Madry et al., 2017), Markov games (MGs) (Shapley, 1953), and reinforcement learning (RL) (Sutton & Barto, 2018; Du et al., 2017; Dai et al.,

*Equal contribution 1Department of Electrical Engineering and Computer Sciences, University of California, Berkeley 2Department of Computer Sciences, University of California, Los Angeles 3DIRO, Universit e de Montr eal and Mila 4Department of Statistics, University of California, Berkeley. Correspondence to: Chris Junchi Li <junchili@berkeley.edu>, Gauthier Gidel <gauthier.gidel@umontreal.ca>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

2018), that are instead formulated as a minimax optimization problem in the following form:

min x X max y Y L(x, y). (1.1)

When L(x, y) : X Y R is a smooth function that is convex in x and concave in y, we refer to this problem as a convex-concave minimax problem (a.k.a., convex-concave saddle point problem). In this work, we focus on designing fast or even optimal deterministic and stochastic first-order algorithms for solving convex-concave minimax problems of the form (1.1).

Unlike in the convex minimization setting, where gradient descent is the method of choice, the gradient descent-ascent method can exhibit divergence on convex-concave objectives. Indeed, examples show the divergence of gradient descent ascent (GDA) on bilinear objectives (Liang & Stokes, 2019; Gidel et al., 2018). This has led to the development of extrapolation-based methods, including the extragradient (EG) method (Korpelevich, 1976) and the optimistic gradient descent ascent (OGDA) method (Popov, 1980), both of which can be shown to converge in the convex-concave setting. While the EG algorithm needs to call the gradient oracle twice at each iteration, the OGDA algorithm only needs a single call to the gradient oracle (Gidel et al., 2018; Hsieh et al., 2019) and therefore has a practical advantage when the gradient evaluation is expensive. We build on this line of research, aiming to attain improved, and even optimal, convergence rates via algorithms that retain the spirit of simplicity of OGDA.

We focus on a specific instance of the general minimax optimization problem, namely the separable minimax optimization problem, which is formulated as follows

min x X max y Y L(x, y) = f(x) + I(x, y) g(y). (1.2)

We refer to f(x) g(y) as the individual component, and I(x, y) as the coupling component of Problem (1.2). Let f be µf-strongly convex and Lf-smooth and g be µg-strongly convex and Lg-smooth. Let I(x, y) be convex-concave with blockwise smoothness parameters Ixx, Ixy, Iyy where || 2 xx I||op Ixx, || 2 xy I||op Ixy, and || 2 yy I||op Iyy. Let I(x, y) be LH-smooth, and it is straightforward to observe that LH can be picked as small as Ixx Iyy +

Rate-Optimal Separable Minimax Optimization

Ixy. Throughout this paper, we focus on the unconstrained problem where X = Rn and Y = Rm unless otherwise specified in certain applications.

A notable special case of the separable minimax Problem (1.2) is the so-called bilinearly coupled strongly convexstrongly concave minimax problem (bi-SC-SC), which has the following form:

min x X max y Y L(x, y) f(x) + x By g(y). (1.3)

Here we take I(x, y) as the bilinear coupling function x By and is LH-smooth where LH can be picked as small as the operator norm B op of matrix B.

For the general minimax optimization problem (1.1), standard algorithms such as mirror-prox (Nemirovski, 2004), EG and OGDA when operating on the entire objective can be shown to exhibit a complexity upper bound of L µ log 1

ϵ for finding an ϵ-accurate solution (Gidel et al., 2018; Mokhtari et al., 2020a), where L Lf Lg LH and µ µf µg. Such a complexity is optimal when Lf = Lg = LH and µf = µg, since the lower-bound complexity is Ω( L µ log( 1

ϵ )) Nemirovskij & Yudin (1983); Azizian et al. (2020). However, in the general case where the strong convexity and smoothness parameters are significantly different in x and y, fine-grained convergence rates that depend on the individual strong convexity µf, µg and smoothness parameters Lf, Lg and also Ixx, Ixy, Iyy are more desirable. In fact, Zhang et al. (2021a) have proved the following iteration complexity lower bound for solving (1.2) via any first-order algorithms under the linear span assumption:

µf + Lg + Iyy

With the goal of attaining this lower bound, several efforts have been made in the setting of bi-SC-SC (1.3) or separable SC-SC (1.2). Two notable methods are LPD (Thekumparampil et al., 2022) and PD-EG (Jin et al., 2022), which utilize techniques from primal-dual lifting and convex conjugate decomposition. Another approach is the APDG algorithm developed by Kovalev et al. (2021), which is based on adding an extrapolation step to the forward-backward algorithm. The work of Du et al. (2022) is also closely related to our work, in the sense that it uses iterate averaging and employs scaling reduction with scheduled restarting. However, these algorithms are either limited to the bi-SC-SC setting (Kovalev et al., 2021; Thekumparampil et al., 2022; Du et al., 2022), or are not single-call algorithms (Kovalev et al., 2021; Jin et al., 2022; Du et al., 2022).1 In addition, only Kovalev et al. (2021) and Du et al. (2022) can be extended to the

1By single call, we mean the algorithm only needs to call the

stochastic setting (Metelev et al., 2022), while the extension of Thekumparampil et al. (2022); Jin et al. (2022) to the stochastic setting remains elusive.

In this paper, we design near-optimal single-call algorithms for both deterministic and stochastic separable minimax problems (1.2). We focus on accelerating OGDA because of its simplicity and because it enjoys the single-call property. We show that it achieves a fine-grained, accelerated convergence rate with a sharp dependency on the individual Lipschitz constants. To the best of our knowledge, this is the first presentation of a single-call algorithm that matches the best-known result for the separable minimax problem (1.2) and the lower bounds under a bi-SC-SC setting (1.3), bilinearly coupled convex-strongly concave (bi-C-SC) setting (i.e., f is convex but not strongly convex in (1.3)), and the bilinear game setting (i.e., setting f = g = 0 in (1.3)).

1.1. Contributions

We highlight our contributions as follows.

(i) We present a novel algorithm that blends acceleration dynamics based on the single-call OGDA algorithm for the coupling component and Nesterov s acceleration for the individual component. We refer to this new algorithm as the Accelerated Gradient-Optimistic Gradient (AG-OG) Descent Ascent algorithm. Using a scheduled restarting, we derive an Accelerated Gradient Optimistic Gradient with restarting (AG-OG with restarting) algorithm that achieves a sharp convergence rate in a variety of settings. We provide theoretical analysis of our algorithm for general separable SC-SC problem (1.2) and compare the results with existing literature under special cases in the form of (1.3) (bi-SC-SC, bi-C-SC and Bilinear).

(ii) Using a scheduled restarting, we derive an Accelerated Gradient-Optimistic Gradient with restarting (AG-OG with restarting) algorithm that achieves a sharp convergence rate in a variety of settings. For general separable SC-SC setting in (1.2), our algorithm achieves a complexity

µf Ixy µf µg Iyy

ϵ , matching the best known upper bound in Jin et al. (2022). For the setting of bilinearly coupled SCSC in (1.3), our algorithm achieves a complexity

ϵ [Corollary 3.4], which matches the lower bound established by Zhang et al. (2021a). For bi-C-SC, we prove a

(stochastic) gradient oracle of the coupling component once in each iteration of the algorithm. This is in accordance with the concept of single-call variants of extragradient in Hsieh et al. (2019). Previous work calls I(x, y) at least twice per iteration.

Rate-Optimal Separable Minimax Optimization

µg + B op ϵµg

ϵ complexity [Theo-

rem 3.5], which matches that of Thekumparampil et al. (2022) and is also optimal.

(iii) In the stochastic setting where the algorithm can only query a stochastic gradient oracle with bounded noise, we propose a stochastic extension of AG-OG with restarting and establish a sharp convergence rate. For both bi-SC-SC and bi-C-SC settings, the convergence rate of our algorithm is near-optimal in the sense that its bias error matches the respective deterministic lower bound and its variance error matches the statistical minimax rate, i.e., σ2

µ2 f ϵ2 [Corollary 4.3].

(iv) In the special case of the bilinear game (when f = g = 0 in (1.3)), our algorithm has a complexity

ϵ [Theorem 3.6], which

matches the lower bound established by Ibrahim et al. (2020). Note that prior work (Kovalev et al., 2021; Thekumparampil et al., 2022; Jin et al., 2022) cannot achieve the optimal rate when applied to bilinear games, which is an unique advantage of our algorithm.

A summary of the iteration complexity comparisons with the state-of-the-art methods can be found in Table 1.

1.2. More Related Work

Deterministic Case. Much attention has been paid to obtaining linear convergence rates for gradient-based methods applied to games in the context of strongly monotone operators (which is implied by strong convexconcavity) (Mokhtari et al., 2020a) and several recent works (Yang et al., 2020; Zhang et al., 2021b; Cohen et al., 2020; Wang & Li, 2020; Xie et al., 2021) have bridged the gap with the lower bound provided for unbalanced stronglyconvex-strongly-concave objective. There has been a series of papers along this direction (Mokhtari et al., 2020a; Cohen et al., 2020; Lin et al., 2020a; Wang & Li, 2020; Xie et al., 2021), and only very recently have optimal results that reach the lower bound been presented (Kovalev et al., 2021; Thekumparampil et al., 2022; Jin et al., 2022). This work presented improved methods leveraging convex duality. Among these works, only Jin et al. (2022) considers non-bilinear coupling terms, and only Thekumparampil et al. (2022) considers single gradient calls. Note that Jin et al. (2022) consider a finite-sum case, which differs from our setting of a general expectation. Kovalev et al. (2021); Thekumparampil et al. (2022) focus solely on the deterministic setting, and Metelev et al. (2022) present a stochastic version of APDG algorithm (Kovalev et al., 2021) and its extension to a decentralized setting, which is comparable and concurrent with the work of Du et al. (2022).

Stochastic Case. There exists a rich literature on stochastic variational inequalities with application to solving stochastic minimax problems (Juditsky et al., 2011; Hsieh et al., 2019; Chavdarova et al., 2019; Alacaoglu & Malitsky, 2022; Zhao, 2022; Beznosikov et al., 2022). However, only a few works have proposed fine-grained bounds suited to the (bi-)SC-SC setting. To the best of our knowledge, most fine-grained bounds have been proposed in the finite-sum setting (Palaniappan & Bach, 2016; Jin et al., 2022) or in the proximal-friendly case (Zhang et al., 2021c). Two closely related works are Li et al. (2022), who provide a convergence rate for stochastic extragradient method in the purely bilinear setting and Du et al. (2022), who study an accelerated version of extragradient, dubbed as Accelerated Gradient Extra Gradient (AG-EG) in the bi-SC-SC setting. Our work is in the same vein as Du et al. (2022) but instead employs the optimistic gradient instead of extragradient to handle the bilinear coupling component. Optimistic-gradient-based methods have been considered extensively in the literature due to their need for fewer gradient oracle calls per iteration than standard extragradient and their applicability to the online learning setting (Golowich et al., 2020). Note that, in general, EG and OG methods share some similarities in their analyses, but there are also significant differences (Golowich et al., 2020, 3.1), (Gorbunov et al., 2022, 2). Specifically in our case, using a single-call algorithm that reuses previously calculated gradients alters a key recursion (Eq. (C.7)). Although the main part of the proof follows the standard path of estimating Nesterov s acceleration terms first, an additional squared error norm involving the previous iterates is present, intrinsically implying an additional iterative rule (Eq. (C.8)) in place of the original iterative rule that is essential for proving boundedness of the iterates. In addition, due to the accumulated error across iterates, the maximum stepsize allowed in single-call algorithms is forced to be smaller. We believe that this is not an artifact of our analysis but is a general feature of OG methods.2

Organization. The rest of this work is organized as follows. 2 introduces the basic settings and assumptions necessary for our algorithm and theoretical analysis. Our proposed Accelerated Gradient-Optimistic Gradient (AG-OG) Descent Ascent algorithm is formally introduced in 3 and further generalized to Stochastic Accelerated Gradient Optimistic Gradient (S-AG-OG) Descent Ascent in 4. We present our conclusions in 5. Due to space limitations, we defer all proof details along with results of numerical experiments to the supplementary materials.

Notation. For two sequences of positive scalars {an} and {bn}, we denote an = Ω(bn) (resp. an = O(bn)) if

2Limited by space, we refer readers to C.1 and C.4 for technical details.

Rate-Optimal Separable Minimax Optimization

Method Setting SC-SC bi-SC-SC Bilinear bi-C-SC Stochastic Rate Single Call

OGDA (Mokhtari et al., 2020b) e O L f L g Ixy µf µg

e O Lf Lg B op

e O B 2 op λmin

O Lf Lg B op

Proximal Best Response (Wang & Li, 2020) e O r

L f µf L g µg +

Ixy(L f L g Ixy) µf µg

B op(Lf Lg B op)

DIPPA (Xie et al., 2021) e O Lf Lg µf µg

4 + B op µf µg

LPD (Thekumparampil et al., 2022) e O q

µg + B op µf µg

e O B 2 op λmin

µg + B op ϵµg

APDG (Kovalev et al., 2021) (Metelev et al., 2022) e O q

µg + B op µf µg

e O B 2 op λmin

λmin B op λmin

Lg µg B 2 op λmin

PD-EG (Jin et al., 2022) e O q

µf Ixy µf µg Iyy

µg + B op µf µg

e O B 2 op λmin

EG+Momentum (Azizian et al., 2020) e O B op λmin

SEG with Restarting (Li et al., 2022) e O B op λmin

AG-EG with Restarting (Du et al., 2022) e O q

µg + B op µf µg

e O B op λmin

AG-OG with Restarting

(this work)

µf Ixy µf µg Iyy

µg + B op µf µg

e O B op λmin

µg + B op ϵµg

Lower Bound (Zhang et al., 2021a) (Ibrahim et al., 2020) eΩ r

L f µf L g µg + Ixy µf µg

µg + B op µf µg

eΩ B op λmin

µg + B op ϵµg

Table 1. We present a comparison of the first-order gradient complexities of our proposed algorithm with selected prevailing algorithms for solving bilinearly-coupled minimax problems. The comparison includes several cases such as general SC-SC, bilinear games, bi-SC-SC (bilinearly-coupled SC-SC), and the bi-C-SC cases. We denote λmin λmin(B B), L f Lf + Ixx and L g Lg + Iyy. We focus on comparing the gradient complexities of deterministic algorithms, and include a column to indicate whether the stochastic case has been discussed. The row in blue background is the convergence result presented in this paper. The indicates that the complexity does not apply to the given case.

an Cbn (resp. an Cbn) for all n, and also an = Θ(bn) if both Ω(bn) and an = O(bn) hold, for some absolute constant C > 0, and e O or eΩis adopted in turn when C contains a polylogarithmic factor in problem-dependent parameters. Let λmax(A) and λmin(A) denote the maximal and minimal eigenvalues of a real symmetric matrix A, and A op the operator norm p

λmax(A A). Let vector z = [x; y] Rn+m denote the concatenation of x Rn, y Rm. We use (resp. ) to denote the bivariate min (resp. max) throughout this paper. For natural number K let [K] denote the set {1, . . . , K}. Throughout the paper we also use the standard notation || || to denote the ℓ2-norm and op to denote the operator norm of a matrix. We will explain other notations at their first appearances.

2. Preliminaries

In minimax optimization the goal is to find an (approximate) Nash equilibrium (or minimax point) of problem (1.1) (or (1.2)), defined as a pair [x ; y ] X Y satisfying:

L(x , y) L(x , y ) L(x, y ).

In order to analyze first-order gradient methods for this problem, we assume access to the gradients of the objective x L(x, y) and y L(x, y). Finding the minimax point of the original convex-concave optimization problem (1.1) and (1.2) reduces to finding the point where the gradients

vanish. Accordingly, we use W to denote the gradient vector field and z = [x; y] Rn+m:

W(z) := x L(x, y) y L(x, y)

= f(x) + x I(x, y) y I(x, y) + g(y)

Based on this formulation, our goal is to find the stationary point of the vector field correponding to the monotone operator W(z), namely a point z = [x ; y ] Rn+m

satisfying (in the unconstrained case) W(z ) = 0, which is referred to as the variational inequality (VI) formulation of minimax optimization (Gidel et al., 2018). The compact representation of the convex-concave minimax problem as a VI allows us to simplify the notation.

In the vector field (2.1), there are individual components that point along the direction optimizing f, g individually, and a coupling component which corresponds to the gradient vector field of a separable minimax problem. For the individual component, we let F(z) := f(x) + g(y) and correspondingly F(z) = [ f(x); g(y)]. For the coupling component, we define the operator H(z) = [ x I(x, y); y I(x, y)]. Note that the representation allows us to write W(z) as the summation of the two vector fields: W(z) = F(z) + H(z).

We introduce our main assumptions as follows:

Assumption 2.1 (Convexity and Smoothness). We assume

Rate-Optimal Separable Minimax Optimization

that f( ) : Rn R is µf-strongly convex and Lf-smooth, g( ) : Rm R is µg-strongly convex and Lg-smooth, and I(x, y) is convex-concave with blockwise smoothness parameters LH = Ixx Iyy + Ixy.

This implies that F(z) is (Lf Lg)-smooth and (µf µg)- strongly convex. In addition H( ) is monotone, yielding the property that for all z, z Rn+m:

H(z) H(z ), z z 0. (2.2)

The above assumption adds convexity and smoothness constraints to the individual components f(x) and g(y). In addition, for the coupling component x By in the separable minimax problem (1.2), without loss of generality, we assume that B Rn m, n m > 0 is a tall matrix. Note that as x and y are exchangeable, tall matrices cover all circumstances.

In the stochastic setting, we assume access to an unbiased stochastic oracle e H(z; ζ) of H(z) and an unbiased stochastic oracle e F(z; ξ) of F(z). Furthermore, we consider the case where the variances of such stochastic oracles are bounded:

Assumption 2.2 (Bounded Variance). We assume that the stochastic gradients admit bounded second moments σ2 H, σ2 F 0:

Eξ h || e H(z; ζ) H(z)||2i σ2 H,

Eζ h || e F(z; ξ) F(z)||2i σ2 F .

For ease of exposition, we introduce the overall variance σ2 = 3

2σ2 H + 2σ2 F . Note that the noise variance bound assumption is common in the stochastic optimization literature.3 Under the above assumptions, our goal is to find an ϵ-optimal minimax point, a notion defined as follows.

Definition 2.3 (ϵ-Optimal Minimax Point). [x; y] X Y is called an ϵ-optimal minimax point of a convex-concave function L(x, y) if x x 2 + y y 2 ϵ2.

It is obvious that when the accuracy ϵ = 0, [x; y] is an (exact) optimal minimax point of L(x, y).

3. Accelerated Gradient Optimistic Gradient Descent Ascent

In this section, we discuss key elements of our algorithm design consisting of Optimistic Gradient Descent-Ascent (OGDA) and Nesterov s acceleration method that together solve the separable minimax problem. Such an approach allows us to demonstrate the main properties of our approach

3We leave the generalization to models of unbounded noise to future work.

that will eventually guide our analysis in the discrete-time case. In 3.1 and 3.2 we review OGDA and Nesterov s acceleration. In 3.3 we present our approach to accelerating OGDA for bilinear minimax problems, yielding the Accelerated Gradient-Optimistic Gradient (AG-OG) algorithm, and we prove its convergence. Finally in 3.4 we show that proper restarting on top of the AG-OG algorithm achieves a sharp convergence rate that matches the lower bound of Zhang et al. (2021a).

3.1. Optimistic Gradient Descent Ascent

The Optimistic Gradient Descent Ascent (OGDA) algorithm has received considerable attention in the recent literature, especially for the problem of training Generative Adversarial Networks (GANs) (Goodfellow et al., 2020). In the general variational inequality setting, the iteration of OGDA takes the following form (Popov, 1980):

2 = zk ηW(zk 1

2 ), zk+1 = zk ηW(zk+ 1

Note that at step k, the scheme performs a gradient descentascent step at the extrapolated point zk+ 1

2 . Equivalently, with simple algebraic modification (3.1) can be written in a standard form (Gidel et al., 2018):

2 ) + ηW(zk 3

Treating the difference W(zk 1

2 ) as a prediction of the future one W(zk+ 1

2 ), this update rule can be viewed as an approximation of the implicit proximal point (PP) method:

Another popular tractable approximation of the PP method is the extragradient (EG) method (Korpelevich, 1976): Although conceptually similar to OGDA (3.1), EG requires two gradient queries per iteration and hence doubles the overall number of gradient computations. Both OGDA and EG dynamics (3.1) alleviate cyclic behavior by extrapolation from the past and exhibit a complexity of O(L/µ log(1/ϵ)) (Gidel et al., 2018; Mokhtari et al., 2020a) in general setting (1.1) with L-smooth, µ-strongly-convexµ-strongly-concave objectives.4

3.2. Nesterov s Acceleration Scheme

Turning to the minimization problem, while vanilla gradient descent enjoys an iteration complexity of O(κ log(1/ϵ)) on L-smooth, µ-strongly convex problems, with κ = L/µ being the condition number, Nesterov s method (Nesterov,

4In fact an analogous result holds true for general smooth, strongly monotone variational inequalities (Mokhtari et al., 2020a).

Rate-Optimal Separable Minimax Optimization

1983), when equipped with proper restarting, achieves an improved iteration complexity of O( κ log(1/ϵ)). We adopt the following version of the Nesterov acceleration, known as the second scheme (Tseng, 2008; Lin et al., 2020b):

zmd k = k k+2zag k + 2 k+2zk, (3.3a)

zk+1 = zk ηk F(zmd k ), (3.3b)

zag k+1 = k k+2zag k + 2 k+2zk+1. (3.3c)

Subtracting (3.3a) from (3.3c) and combining the resulting equation with (3.3b), we conclude

zag k+1 zmd k = 2 k + 2(zk+1 zk) = 2ηk

k + 2 F(zmd k )

zag k+1 = zmd k 2ηk

k + 2 F(zmd k ). (3.4)

Moreover, shifting the index forward by one in (3.3a) and combining it with (3.3c) to cancel the zk+1 term, we obtain

k + 2 k + 3zag k+1 zmd k+1 = k k + 3zag k k + 1

k + 3zag k+1 (3.5)

zmd k+1 = zag k+1 + k k + 3 zag k+1 zag k . (3.6)

Thus, by a simple notational transformation, (3.4) plus (3.6) (and hence the original update rule (3.3)) is exactly equivalent to the original updates of Nesterov s acceleration scheme (Nesterov, 1983). Here, zag k denotes a 2

k-weightedaveraged iteration. In other words, compared with vanilla gradient descent, zk+1 = zk ηk F(zk), Nesterov s acceleration conducts a step at the negated gradient direction evaluated at a predictive iterate of the weighted-averaged iterate of the sequence. This enables a larger choice of stepsize, reflecting the enhanced stability. An analogous interpretation has been discussed in work on a heavy-ballbased acceleration method (Sebbouh et al., 2021, 1.3).

3.3. Accelerating OGDA on Separable Minimax Problems

In this subsection and 3.4, we show that an organic combination of the two algorithms in 3.1 and 3.2 achieves improved convergence rates and when equipped with scheduled restarting, obtains a sharp iteration complexity that matches Jin et al. (2022) while only requiring a single gradient call per iterate. Our algorithm is shown in Algorithm 1. In Line 2 and 4 the update rules of the evaluated point and the extrapolated point of f follow that in (3.3a) and (3.3c), while in Lines 3 and 5 the updates follow the OGDA dynamics (3.1) with each step modified by (3.3b). Algorithm 1 can be seen as a synthesis of OGDA and Nesterov s acceleration, as it reduces to OGDA when F = 0 and to Nesterov s accelerated gradient when H = 0.

For theoretical analysis, we first state a nonexpansiveness lemma of zk with respect to z , the unique solution to Problem (1.2). The proof is presented in E.2.

Algorithm 1 Accelerated Gradient-Optimistic Gradient (AGOG)(zag 0 , z0, z 1/2, K)

1: for k = 0, 1, . . . , K 1 do 2: zmd k = (1 αk)zag k + αkzk

2 = zk ηk H(zk 1

2 ) + F(zmd k )

4: zag k+1 = (1 αk)zag k + αkzk+ 1

2 5: zk+1 = zk ηk H(zk+ 1

2 ) + F(zmd k )

6: end for 7: Output: zag K

Lemma 3.1 (Nonexpansiveness). Under Assumptions 2.1, we set the parameters as L = Lf Lg, LH = Ixx Iyy + Ixy, ηk = k+2

3LH(k+2) and αk = 2 k+2 in

Algorithm 1 and choose initialization z 1

2 = zag 0 = z0, at any iterate k < K we have

||zk z || ||z0 z ||.

Remark 3.2. The result in Lemma 3.1 is significant in that it establishes the last-iterate nonexpansiveness ruled by the initialization z0: the zk iteration stays in the ball centered at z with radius ||z0 z ||. This is essential in proving convergence results of iteration zag k where the main technical difficulty lies upon the additional recursive analysis due to gradient evaluation in a previous iterate. From a past extragradient perspective, earlier analysis was focusing on the half iterates in extragradient step (3) (zk+ 1

2 in our formulation). In contrast, we perform a nonexpansiveness analysis on the integer steps (zk), serving as a critical improvement over the best previous result achieved by Mokhtari et al. (2020b, Lemma 2(b)) (consider the bilinear coupling case where f = 0, g = 0), which merely admits a factor of

2 in terms of the Euclidean metric (i.e., ||zk z ||

2||z0 z ||).

With the parameter choice in Lemma 3.1, Line 4 can also be seen as an average step that makes last iterates shrink toward the center of convergence. Equipped with Lemma 3.1, we are ready to state the following convergence theorem for discrete-time AG-OG: Theorem 3.3. Under Assumption 2.1 and setting the parameters as in Lemma 3.1, the output of Algorithm 1 on problem (1.2) satisfies:

||zag K z ||2

4L µ(K + 1)2 + 2 p

3LH µ(K + 1)

(3.7) Here we use µ to denote µf µg.

The proof of Theorem 3.3 is provided in C.1. The selection of αk = 2 k+2 and ηk = k+2

3LH(k+2) is vital for Nes-

terov s accelerated gradient descent to achieve desirable convergence behavior (Nesterov, 1983). This stepsize choice is

Rate-Optimal Separable Minimax Optimization

Algorithm 2 Accelerated Gradient-Optimistic Gradient with restarting (AG-OG with restarting) Require: Initialization z0 0, total number of epochs N 1, per-epoch iterates (Kn : n = 0, . . . , N 1) 1: for n = 0, 1, . . . , N 1 do 2: zout = AG-OG(zn 0 , zn 0 , zn 0 , Kn) 3: Set zn+1 0 zout //Warm-starting from the previous output 4: end for 5: Output: z N 0

larger than the ones used in previous techniques (Chen et al., 2017; Du et al., 2022), which is brought by a fine-tuned analysis of (C.7) in the proof of Theorem 3.3. The convergence rate in (3.7) for strongly convex problems is slow and not even linear. However, in the next subsection we show how a simple restarting technique not only achieves the linear convergence rate, but also matches the lower bound in Zhang et al. (2021a) in a broad regime of parameters.

3.4. Improving Convergence Rates via Restarting and Scaling Reduction

Our algorithm design (Algorithm 2) utilizes the restarting technique, which is a well-established method to accelerate first-order methods in optimization literature (O donoghue & Candes, 2015; Roulet & d Aspremont, 2017; Renegar & Grimmer, 2022). Our variant of restarting accelerates convergence through a novel approach inspired by contemporary variance-reduction strategies, similar to those presented in Li et al. (2022); Du et al. (2022). Our approach is distinct from previous ones (Kovalev et al., 2021; Thekumparampil et al., 2022; Jin et al., 2022) that incorporate the last iterate EG/OGDA with Nesterov s acceleration. By incorporating the extrapolated step of Nesterov s method as the average step of OGDA and utilizing restarting, we use a two-timescale analysis and scaling reduction technique to achieve optimal results under all regimes. Although our algorithm is a multi-loop algorithm, the simplicity of restarting does not harm the practical aspect of our approach.

Normally, as f and g have different strong convexity parameters (µf and µg), it is preferable in practice to have different stepsizes for the descent step on f(x) and the ascent step on g(y) (Du et al., 2017; Lin et al., 2020a; Du et al., 2022). Accordingly, for our analysis we use a scaling reduction technique (Du et al., 2022) that allows us to consider applying a single scaling for all parameters without loss of generality. Setting by = q µg

µf y, we have by H(z) = q µf

and byg(y) = q µf

µg g(y). Other scaling changes are

listed as follows:

µg Lg, LH = Ixx Ixy rµf

ηk,y = ηkµf

µg , µ = µf, (3.8)

where by ηk,y we mean that when updating z = [x; y] Rn+m, we adopt stepsize ηk on the x-part (first n coordinates) and ηk,y on the y-part (last m coordinates). Writing out in details, the update rules with adjusted stepsizes on [x; y] are as follows:

xmd k = (1 αk)xag k + αkxk

ymd k = (1 αk)yag k + αkyk

2 = xk ηk Ix(xk 1

2 ) + f(xmd k )

2 = yk ηk,y Iy(xk 1

2 ) + g(ymd k )

xag k+1 = (1 αk)xag k + αkxk+ 1

2 yag k+1 = (1 αk)yag k + αkyk+ 1

xk+1 = xk ηk Ix(xk+ 1

2 ) + f(xmd k )

yk+1 = yk ηk,y Iy(xk+ 1

2 ) + g(ymd k )

With the new scaling and restarting, we obtain Algorithm 2, which we refer to as AG-OG with restarting. The iteration complexity of AG-OG with restarting is stated in the following Corollary 3.4. Corollary 3.4. Algorithm 2 on problem (1.2) with Kn = lq

outputs an ϵ-optimal minimax

point within a number O(N) of iterates, for N satisfying:

µf Ixy µfµg Iyy

We defer the proof of the corollary to C.2. When restricted to the bilinear-coupled problem (1.3), Eq. (4.1) reduces

µg + B op µf µg

ϵ , which exactly matches the lower bound result in Zhang et al. (2021a) and therefore is optimal under this special instance.

The analysis in Du et al. (2022), which also adopts a restarted scheme is most similar with ours. However, although OGDA can be written in past-EG form, the algorithm and theoretical analysis are fundamentally different (Golowich et al., 2020). For example, in contrast to EG, the non-expansiveness argument for OGDA does not achieve a unity prefactor (Mokhtari et al., 2020b). Our work proves a strict non-expansive property with prefactor 1, and our technique is new compared with existing EG-based analysis and existing the OGDA-based analysis.

Rate-Optimal Separable Minimax Optimization

3.5. Application to Separable Convex-Strongly-Concave (C-SC) Problem

To extend our strongly-convex-strongly-concave (SC-SC) AG-OG algorithm complexity to the convex-stronglyconcave (C-SC) setting, we define a regularization reduction method that modifies the objective via the addition of a regularization term, which gives the objective function Lϵ(x, y) = L(x, y) + ϵ||x||2, where ϵ is the desired accuracy of the solution. The following Theorem 3.5 provides the complexity analysis; see C.3 for the proof. Theorem 3.5. The output of Algorithm 1 under the same assumptions and stepsize choices of Theorem 3.3 on the objective function Lϵ achieves the ϵ-optimal minimax point of L within the sample complexity of

ϵ Ixy ϵµg Iyy

for the original C-SC problem.

The work of Thekumparampil et al. (2022) also provides a C-SC case result that is obtained by utilizing the smoothing technique (Nesterov, 2005). Additionally, they present a direct C-SC algorithm without smoothing. On the other hand, Kovalev et al. (2021) focuses on a different perspective on the C-SC problem where x and y have strong interactions and obtains superlinear complexity of log( 1

ϵ ). However, both of these papers are limited to bilinear coupling terms. Our result, in contrast, targets a more general separable objective. Our complexity in Theorem 3.5 matches the complexity for regularized reduction in Thekumparampil et al. (2022). Furthermore, Theorem 3.5 is optimal in the bilinear

coupling case (1.3). The reason is that q

ϵ is optimal for a

pure minimization of convex f (Nesterov et al., 2018), q

Lg µg is optimal for a pure maximization of strongly-concave g (Nesterov et al., 2018), and B op ϵµg matches the lower bound of bilinearly coupled concave-convex minimax optimization (Ouyang & Xu, 2021) when f = 0.

3.6. Application to Bilinear Games

While the complexity result for deterministic case in Corollary 3.4 has also been obtained in Thekumparampil et al. (2022); Kovalev et al. (2021) and Jin et al. (2022), in addition to conceptual simplicity, our algorithm has the significant advantage that it yields a stochastic version and a convergence rate for the stochastic case. By using proper averaging and scheduled restarting techniques, our algorithm is able to find near-optimal solutions and achieve an optimal sample complexity up to a constant prefactor. Additionally, we demonstrate that our algorithm can be reduced to a combination of the averaged iterates of the OGDA algorithm and a scheduled restarting procedure, which gives rise to

a novel single-call algorithm that achieves an accelerated convergence rate on the bilinear minimax problem itself. Finally, we address the situation where there is stochasticity present in the problem. Throughout this section, we consider Problem (1.2) with f(x), g(y) being zero almost surely. Moreover, we assume the following bilinear form:

I(x, y) = x By + x ux + u y y, (3.10)

where x Rn, y Rm with n = m, B Rn n is square and full-rank, and ux, uy Rn are two parameter vectors. Algorithm 1 reduces to an equivalent form of the OGDA algorithm (in the past extragradient form) with initial condition zag 0 = z 1

2 = z0, which gives for all k 1:

2 = zk ηk H(zk 1

2 ), (3.11a)

zag k+1 = (1 αk)zag k + αkzk+ 1

zk+1 = zk ηk H(zk+ 1

2 ) . (3.11c)

By selecting the parameters αk = 2 k+2 and ηk = k+2 2L+c1LH(k+2) with L = 0 and c1 = 2 in (3.11), we can prove a boundedness lemma (Lemma D.1, presented in D), which is the bilinear game analogue of Lemma 3.1 with an improved scheme of stepsize and demonstrates the nonexpansiveness of the last iterate of the OGDA algorithm. The proof is deferred to E.8.

Non-expansiveness of the iterates further yields the following theorem whose proof is in D.1.

Theorem 3.6. When specified to the bilinear game case, setting the parameters as αk = 2 k+2 and ηk = 1 2LH , the output of update rules (3.11) satisfies

||zag K z ||2 64λmax(B B) λmin(B B)(K + 1)2 ||z0 z ||2. (3.7)

Moreover, using the scheduled restarting technique, we obtain a complexity result that matches the lower bound of Ibrahim et al. (2020):

λmin (B B) log 1

An extended result is also obtained in the stochastic setting; we refer interested readers to D.2.

4. Stochastic Accelerated Gradient Optimistic Gradient Descent Ascent

In this subsection, we generalize the theoretical performance of our AG-OG algorithm (Algorithm 1 and 2) to the stochastic case where the rate-optimal convergence behavior is maintained. The stochastic AG-OG algorithm replaces each batch gradient with its unbiased stochastic counterpart, with

Rate-Optimal Separable Minimax Optimization

noise indices represented by ζt, ξt. The full stochastic AGOG algorithm is shown in Algorithm 3 in B.2.

Based on a generalized nonexpasiveness lemma (Lemma C.7, presented in C.4) which is the stochastic analogue of Lemma 3.1, we can proceed the analysis and arrive at our stochastic result. See C.4 for the proof.

Theorem 4.1. Under Assumptions 2.1 and 2.2, we take ηk = k+2

2LH(k+2) where D = σ

E||z0 z ||2

for A(K) := p

(K + 1)(K + 2)(2K + 3)/6 and some absolute constant C > 0. Then the output of Algorithm 3 on problem (1.2) satisfies:

E||zag K z ||2

8L µ(K + 1)2 + 7.4(1 + C2)LH

E||z0 z ||2

E||z0 z ||2.

Remark 4.2. Without knowledge of expected initial distance E||z0 z ||2 to the true minimax point, we need an alternative selection of stepsize ηk. We assume an upper bound on ||z0 z ||2 defined as Γ0 and let C = Γ0

E||z0 z ||2 . The quantity D = σA(K)

Γ0 is hence known.

E||zag K z ||2

8L µ(K + 1)2 + 14.8LH

Γ2 0 + 4σ µ

Analogous to the method in 3.4, we restart the S-AG-OG algorithm properly and achieve the following complexity:

Corollary 4.3. With scheduled restarting imposed on top of Algorithm 3, Algorithm 2 on problem (1.2) outputs an ϵ-optimal minimax point within O(N) iterations, for N satisfying:

µ2 fϵ2 (4.1)

µf Ixy µfµg Iyy

In the special case of bilinearly coupled SC-SC, the above result reduces to s

µg + Ixy µfµg

which matches that of Du et al. (2022) and is rate-optimal. The reason is that the first term (i.e., bias error) matches the lower bound of bilinearly coupled SC-SC in Zhang et al. (2021a), and the second term (i.e., variance error) matches the worst-case statistical minimax rate.

5. Discussion

In this paper, we propose novel algorithms for both the deterministic setting (AG-OG) and a stochastic setting (S-AGOG) which organically blends optimism with Nesterov s acceleration, featuring structural interpretability and simplicity. Leveraging novel Lyapunov analysis, these algorithms achieve desirable polynomial convergence behavior. Further by properly restarting the algorithms, AG-OG and its stochastic version theoretically enjoy rate-optimal sample complexity for finding an ϵ-accurate solution. Future directions include closing the gap between the upper and lower bounds for general separable minimax optimization, and generalizations to nonconvex settings.

Acknowledgements

We sincerely thank the anonymous reviewers for their helpful comments, and thank Simon Shaolei Du for highly inspiring discussions. This work is supported in part by Canada CIFAR AI Chair to GG, by the National Science Foundation CAREER Award 1906169 to QG, by the Mathematical Data Science program of the Office of Naval Research under grant number N00014-18-1-2764 and also the Vannevar Bush Faculty Fellowship program under grant number N00014-21-1-2941 and NSF grant IIS-1901252 to MIJ.

Alacaoglu, A. and Malitsky, Y. Stochastic variance reduction for variational inequality methods. In Conference on Learning Theory, pp. 778 816. PMLR, 2022.

Azizian, W., Scieur, D., Mitliagkas, I., Lacoste-Julien, S., and Gidel, G. Accelerating smooth games by manipulating spectral shapes. In International Conference on Artificial Intelligence and Statistics, pp. 1705 1715. PMLR, 2020.

Bai, Y. and Jin, C. Provable self-play algorithms for competitive reinforcement learning. ar Xiv preprint ar Xiv:2002.04017, 2020.

Ben-Tal, A., El Ghaoui, L., and Nemirovski, A. Robust optimization, volume 28. Princeton university press, 2009.

Beznosikov, A., Polyak, B., Gorbunov, E., Kovalev, D., and Gasnikov, A. Smooth monotone stochastic variational inequalities and saddle point problems survey. ar Xiv preprint ar Xiv:2208.13592, 2022.

Chavdarova, T., Gidel, G., Fleuret, F., and Lacoste-Julien, S. Reducing noise in gan training with variance reduced extragradient. Advances in Neural Information Processing Systems, 32, 2019.

Rate-Optimal Separable Minimax Optimization

Chen, Y., Lan, G., and Ouyang, Y. Accelerated schemes for a class of variational inequalities. Mathematical Programming, 165(1):113 149, 2017.

Cohen, M. B., Sidford, A., and Tian, K. Relative lipschitzness in extragradient methods and a direct recipe for acceleration. ar Xiv preprint ar Xiv:2011.06572, 2020.

Dai, B., Shaw, A., Li, L., Xiao, L., He, N., Liu, Z., Chen, J., and Song, L. Sbeed: Convergent reinforcement learning with nonlinear function approximation. In International Conference on Machine Learning, pp. 1125 1134. PMLR, 2018.

Du, S. S. and Hu, W. Linear convergence of the primal-dual gradient method for convex-concave saddle point problems without strong convexity. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 196 205. PMLR, 2019.

Du, S. S., Chen, J., Li, L., Xiao, L., and Zhou, D. Stochastic variance reduction methods for policy evaluation. In International Conference on Machine Learning, pp. 1049 1058. PMLR, 2017.

Du, S. S., Gidel, G., Jordan, M. I., and Li, C. J. Optimal extragradient-based bilinearly-coupled saddle-point optimization. ar Xiv preprint ar Xiv:2206.08573, 2022.

Gidel, G., Berard, H., Vignoud, G., Vincent, P., and Lacoste-Julien, S. A variational inequality perspective on generative adversarial networks. ar Xiv preprint ar Xiv:1802.10551, 2018.

Golowich, N., Pattathil, S., and Daskalakis, C. Tight lastiterate convergence rates for no-regret learning in multiplayer games. Advances in neural information processing systems, 2020.

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial networks. Communications of the ACM, 63(11):139 144, 2020.

Gorbunov, E., Taylor, A., and Gidel, G. Last-iterate convergence of optimistic gradient method for monotone variational inequalities. Advances in neural information processing systems, 2022.

Hsieh, Y.-G., Iutzeler, F., Malick, J., and Mertikopoulos, P. On the convergence of single-call stochastic extragradient methods. Advances in Neural Information Processing Systems, 32, 2019.

Ibrahim, A., Azizian, W., Gidel, G., and Mitliagkas, I. Linear lower bounds and conditioning of differentiable games. In International conference on machine learning, pp. 4583 4593. PMLR, 2020.

Jin, Y., Sidford, A., and Tian, K. Sharper rates for separable minimax and finite sum optimization via primal-dual extragradient methods. ar Xiv preprint ar Xiv:2202.04640, 2022.

Juditsky, A., Nemirovski, A., and Tauvel, C. Solving variational inequalities with stochastic mirror-prox algorithm. Stochastic Systems, 1(1):17 58, 2011.

Korpelevich, G. M. The extragradient method for finding saddle points and other problems. Ekonomika i Matematicheskie Metody, 12:747 756, 1976.

Kovalev, D., Gasnikov, A., and Richt arik, P. Accelerated primal-dual gradient method for smooth and convexconcave saddle-point problems with bilinear coupling. ar Xiv preprint ar Xiv:2112.15199, 2021.

Li, C. J., Yu, Y., Loizou, N., Gidel, G., Ma, Y., Le Roux, N., and Jordan, M. On the convergence of stochastic extragradient for bilinear games using restarted iteration averaging. In International Conference on Artificial Intelligence and Statistics, pp. 9793 9826. PMLR, 2022.

Liang, T. and Stokes, J. Interaction matters: A note on non-asymptotic local convergence of generative adversarial networks. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 907 915. PMLR, 2019.

Lin, T., Jin, C., and Jordan, M. I. Near-optimal algorithms for minimax optimization. In Conference on Learning Theory, pp. 2738 2779. PMLR, 2020a.

Lin, Z., Li, H., and Fang, C. Accelerated optimization for machine learning. Nature Singapore: Springer, 2020b.

Madry, A., Makelov, A., Schmidt, L., Tsipras, D., and Vladu, A. Towards deep learning models resistant to adversarial attacks. ar Xiv preprint ar Xiv:1706.06083, 2017.

Metelev, D., Rogozin, A., Gasnikov, A., and Kovalev, D. Decentralized saddle-point problems with different constants of strong convexity and strong concavity. ar Xiv preprint ar Xiv:2206.00090, 2022.

Mokhtari, A., Ozdaglar, A., and Pattathil, S. A unified analysis of extra-gradient and optimistic gradient methods for saddle point problems: Proximal point approach. In International Conference on Artificial Intelligence and Statistics, pp. 1497 1507. PMLR, 2020a.

Mokhtari, A., Ozdaglar, A. E., and Pattathil, S. Convergence rate of o(1/k) for optimistic gradient and extragradient methods in smooth convex-concave saddle point problems. SIAM Journal on Optimization, 30(4):3230 3251, 2020b.

Rate-Optimal Separable Minimax Optimization

Nemirovski, A. Prox-method with rate of convergence o (1/t) for variational inequalities with lipschitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM Journal on Optimization, 15(1): 229 251, 2004.

Nemirovskij, A. S. and Yudin, D. B. Problem complexity and method efficiency in optimization. 1983.

Nesterov, Y. Smooth minimization of non-smooth functions. Mathematical programming, 103(1):127 152, 2005.

Nesterov, Y. et al. Lectures on convex optimization, volume 137. Springer, 2018.

Nesterov, Y. E. A method for solving the convex programming problem with convergence rate o (1/kˆ 2). In Dokl. akad. nauk Sssr, volume 269, pp. 543 547, 1983.

Ouyang, Y. and Xu, Y. Lower complexity bounds of firstorder methods for convex-concave bilinear saddle-point problems. Mathematical Programming, 185(1):1 35, 2021.

O donoghue, B. and Candes, E. Adaptive restart for accelerated gradient schemes. Foundations of computational mathematics, 15(3):715 732, 2015.

Palaniappan, B. and Bach, F. Stochastic variance reduction methods for saddle-point problems. Advances in Neural Information Processing Systems, 29, 2016.

Popov, L. D. A modification of the arrow-hurwicz method for search of saddle points. Mathematical notes of the Academy of Sciences of the USSR, 28(5):845 848, 1980.

Renegar, J. and Grimmer, B. A simple nearly optimal restart scheme for speeding up first-order methods. Foundations of Computational Mathematics, 22(1):211 256, 2022.

Roulet, V. and d Aspremont, A. Sharpness, restart and acceleration. Advances in Neural Information Processing Systems, 30, 2017.

Sebbouh, O., Gower, R. M., and Defazio, A. Almost sure convergence rates for stochastic gradient descent and stochastic heavy ball. In Conference on Learning Theory, pp. 3935 3971. PMLR, 2021.

Shapley, L. S. Stochastic games. Proceedings of the national academy of sciences, 39(10):1095 1100, 1953.

Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction. MIT press, 2018.

Thekumparampil, K. K., He, N., and Oh, S. Lifted primaldual method for bilinearly coupled smooth minimax optimization. In International Conference on Artificial Intelligence and Statistics, pp. 4281 4308. PMLR, 2022.

Tseng, P. On accelerated proximal gradient methods for convex-concave optimization. submitted to SIAM Journal on Optimization, 2(3), 2008.

Wang, Y. and Li, J. Improved algorithms for convex-concave minimax optimization. Advances in Neural Information Processing Systems, 33:4800 4810, 2020.

Xie, G., Han, Y., and Zhang, Z. Dippa: An improved method for bilinear saddle point problems. ar Xiv preprint ar Xiv:2103.08270, 2021.

Yang, J., Zhang, S., Kiyavash, N., and He, N. A catalyst framework for minimax optimization. Advances in Neural Information Processing Systems, 33:5667 5678, 2020.

Zhang, J., Hong, M., and Zhang, S. On lower iteration complexity bounds for the convex concave saddle point problems. Mathematical Programming, pp. 1 35, 2021a.

Zhang, S., Yang, J., Guzm an, C., Kiyavash, N., and He, N. The complexity of nonconvex-strongly-concave minimax optimization. In Uncertainty in Artificial Intelligence, pp. 482 492. PMLR, 2021b.

Zhang, X., Aybat, N. S., and G Urb Uzbalaban, M. Robust accelerated primal-dual methods for computing saddle points. ar Xiv preprint ar Xiv:2111.12743, 2021c.

Zhao, R. Accelerated stochastic algorithms for convexconcave saddle-point problems. Mathematics of Operations Research, 47(2):1443 1473, 2022.

Rate-Optimal Separable Minimax Optimization

A. Examples of Separable Minimax Optimization

In this section, we use two examples to showcase the applications of formulation (1.2). We refer the readers for other examples in prior works such as Thekumparampil et al. (2022); Kovalev et al. (2021); Du et al. (2022). In the first example, we demonstrate how the parameters of a linear state-value function can be estimated by solving (1.2). In the second example of robust learning problem, we illustrate how turning the disk constraint into a penalty term allows us to obtain an objective in the form of (1.2).

Policy Evaluation in Reinforcement Learning. The policy evaluation problem in RL can be formulated as a convexconcave bilinearly coupled minimax problem. We are provided a sequence of four-tuple {(st, at, rt, st+1)}n t=1, where

(i) st, st+1 are the current state (at time t) and future state (at time t + 1), respectively;

(ii) at is the action at time t generated by policy π, that is, at = π(st);

(iii) rt = r(st, at) is the reward obtained after taken action at at state st.

Our goal is to estimate the value function of a fixed policy π in the discounted, infinite-horizon setting with discount factor γ (0, 1), where for each state s the discounted reward

s0 = s, at = π(st), t 0

If a linear function approximation is adopted, i.e. V π(s) = ϕ(s) x where ϕ( ) is a feature mapping from the state space to feature space, we estimate the model parameter x via minimizing the empirical mean-squared projected Bellman error (MSPBE):

min x 1 2 Ax b 2 C 1. (A.1)

where ||x||M

x Mx denotes the M-norm, for positive semi-definite matrix M, of an arbitrary vector x, and

t=1 ϕ(st) (ϕ(st) γϕ(st+1)) , b = 1

t=1 rtϕ(st), C = 1

t=1 ϕ(st)ϕ(st) .

Applying first-order optimization directly to (A.1) would necessitate computing (and storing) the inversion of matrix C, or at least, the matrix-vector product C 1v for given vector v at each step, which would be computationally costly or even prohibited. To circumvent inverting matrix C a reformulation via conjugate function can be resorted to; that is, solving (A.1) is equivalent to solving the following minimax problem (Du et al., 2017; Du & Hu, 2019):

min x max y y Ax 1

2 y 2 C + b y.

Such an instance falls under the category of minimax problem (1.2) where the individual component is convex-concave, and is further enhanced to be strongly-convex-strongly-concave when a quadratic regularizer term is imposed and C is strictly positive definite.

Robust Learning. A robust learning or robust optimization problem targets to minimize an objective function (here the sum of squares) formulated as a minimax optimization problem (Ben-Tal et al., 2009; Du & Hu, 2019; Thekumparampil et al., 2022)

min x max y: y y0 R 1 2 Ax y 2, (A.2)

where A is a coefficient matrix and y is a noisy observation vector, which is perturbed by a vector of R-bounded norm. Transforming (A.2) to a penalized objective gives a formulation of minx maxy 1 2 Ax y 2 ρ y y0 2. When ρ is selected to be strictly greater than 1

2, we get a strongly-convex-strongly-concave bilinearly coupled minimax optimization problem.

Rate-Optimal Separable Minimax Optimization

0 100 200 300 400 500

OGDA AGOG AGOG-restart

(a) Lg = 64, µg = 1

0 100 200 300 400 500

OGDA AGOG AGOG-restart

(b) Lg = 1, µg = 1/64

0 100 200 300 400 500

OGDA AGOG AGOG-restart

(c) Lg = 4096, µg = 64

Figure 1. Comparison with OGDA on different problem sets (Deterministic)

B. Experiments

In this section, we empirically study the performance of our AG-OG with restarting algorithm. In these experimental results, we study both deterministic [ B.1] and stochastic settings [ B.2], each of which we compare the state-of-the-art algorithms. Throughout this section, the x-axis represents the number of gradient queries while the y-axis represents the squared distance to the minimax point.

B.1. Deterministic Setting

We present results on synthetic quadratic game datasets:

x A1x + y A2x y A3y, (B.1)

with various selections of the eigenvalues of A1, A2, A3.

Comparison with OGDA. We use the single-call OGDA algorithm (Gidel et al., 2018; Hsieh et al., 2019) as the baseline. In Figure 1 we plot the AG-OG algorithm and the AG-OG with restarting algorithm under three different instances. We use stepsize ηk = k+2

3LH(k+2) in both the AG-OG and the AG-OG with restarting algorithms and restart AG-OG with

restarting once every 100 iterates. For the OGDA algorithm, we take stepsize η = 1 2(L LH) as is indicated by recent arts e.g. (Mokhtari et al., 2020b). For the parameters of the problem (B.1), we fix LH = 1, Lf = 64, µf = 1 and scatter various values of Lg, µg. In Figure 1(a) we take Lg = 64, µg = 1. In Figure 1(b) we take Lg = 1, µg = 1/64 and in Figure 1(c) we take Lg = 4096, µg = 64. We see from Figures 1(a), 1(b) and 1(c) when the problem has different Lf, µf and Lg, µg, changing Lg, µg has larger impact on OGDA than on AG-OG, which matches our theoretical results.

Comparison with LPD. Next, we focus on comparison to the Lifted Primal-Dual (LPD) algorithm (Thekumparampil et al., 2022). We implement the AG-OG algorithm and its restarted version, the AG-OG with restarting. Additionally, inspired by the technique of a single-loop direct-approach in Du et al. (2022), we consider a single-loop algorithm named AG-OG-Direct that takes advantage of the strongly-convex-strongly-concave nature of the problem. We refer readers to Du et al. (2022) for the direct method. The parameters of LPD are chosen as described in Thekumparampil et al. (2022). For our AG-OG and AG-OG with restarting algorithms, we take ηk = k+2

3LH(k+2) and the scaling parameters are

taken as in Eq. (3.8). For the AG-OG-direct algorithm, we take η = 1

3LH)2/µ2 f )µf with the same set of

scaling parameters. We restart AG-OG with restarting once every 100 iterates.

In Figure 2(a), the bilinear coupling component y A2x is the dominant part. In Figure 2(b), we set the eigenvalues of A2 even larger than in Figure 2(a). In Figure 2(c), x A1x and y A3y are the dominant terms. More details on the specific designs of the matrices are shown in the caption of the corresponding figures.

We see from Figures 2(a) and 2(b) that AG-OG with restarting (green line) outperforms LPD and MP in regimes where the bilinear term dominates, and when the eigenvalues of the coupling matrix increase, the performance of AG-OG with restarting relative to other algorithms is enhanced. This is in accordance with our theoretical analysis. In addition, AG-OG

Rate-Optimal Separable Minimax Optimization

0 200 400 600 800 1000

LPD AGOG AGOG-restart AGOG-direct

(a) Lf = Lg = µf = µg = 1, LH = 356, µH = 101

0 200 400 600 800 1000

LPD AGOG AGOG-restart AGOG-direct

(b) Lf = Lg = µf = µg = 1, LH = 725, µH = 101

0 200 400 600 800 1000

LPD AGOG AGOG-restart AGOG-direct

(c) Lf = Lg = 100, µf = µg = 1, LH = µH = 1

Figure 2. Comparison with LPD on different problem sets (Deterministic)

0 50 100 150 200 250 300 350 400

101 SEG S-AGOG SEG-restart S-AGOG-restart

(a) Lf = Lg = µf = µg = 1, LH = 356, µH = 101, σ = 0.1

0 50 100 150 200 250 300 350 400

SEG S-AGOG SEG-restart S-AGOG-restart

(b) Lf = Lg = 10, µf = µg = µH = 1, LH = 11, σ = 0.1

0 50 100 150 200 250 300 350 400

101 SEG S-AGOG SEG-restart S-AGOG-restart

(c) Lf = Lg = 1, µf = µg = 1/8, LH = µH = 1, σ = 0.1

Figure 3. Comparison of algorithms on different problem sets (Stochastic)

Algorithm 3 Stochastic Accelerated Gradient-Optimistic Gradient (S-AG-OG)(zag 0 , z0, z 1/2, K)

1: for k = 0, 1, . . . , K 1 do 2: zmd k = (1 αk)zag k + αkzk

2 = zk ηk e H(zk 1

2 ) + e F(zmd k ; ξk)

4: zag k+1 = (1 αk)zag k + αkzk+ 1

2 5: zk+1 = zk ηk e H(zk+ 1

2 ) + e F(zmd k ; ξk)

6: end for 7: Output: zag K

with restarting outperforms its non-restarted version (orange line) which has a gentle slope at the end. On the other hand, when the individual component dominates, our AG-OG-direct (red line) slightly outperforms LPD. Moreover, AG-OG-direct and LPD almost overlap in 2(a) and 2(b).

B.2. Stochastic Setting

We compared stochastic AG-OG and its restarted version (S-AG-OG) with Stochastic extragradient (SEG) SEG with restarting, respectively (cf. Li et al., 2022). The complete algorithm is shown in 3. We note that we refer to the averaged iterates version of SEG everywhere when using SEG. For SEG and SEG-restart, we use stepsize ηk = 1 2(L LH). For AG-OG

and AG-OG with restarting, we use stepsize ηk = k+2

3LH(k+2). We restart every 100 gradient calculations for both

SEG-restart and AG-OG-restart.

We use the same quadratic game setting as in (B.1) except that we assume access only to noisy estimates of A1, A2, A3. We add Gaussian noise to A1, A2, A3 with σ = 0.1 throughout this experiment. We plot the squared norm error with respect to the number of gradient computations in Figure 3. In 3(a) we consider larger eigenvalues for A2 than A1, A3.

Rate-Optimal Separable Minimax Optimization

In 3(b), we let A1, A2, A3 to be approximately of the same scale. In 3(c), as the scale of the eigenvalues shrinks, the noise is relatively larger than in 3(a) and 3(b). The specific choice of parameters are shown in the caption of the corresponding figures. We see from 3(a), 3(c) and 3(c) that stochastic AG-OG with restarting achieves a more desirable convergence speed than SEG-restart. Also, the restarting technique significantly accelerates the convergence, validating our theory.

C. Proof of Main Convergence Results

This section collects the proofs of our main results, Theorem 3.3 [ C.1], Corollary 3.4 [ C.2], and Theorem 4.1 [ C.4].

C.1. Proof of Theorem 3.3

Proof.[Proof of Theorem 3.3] We define the point-wise primal-dual gap function as:

V (z, z ) := F(z) F(z ) + H(z ), z z . (C.1)

We first provide the following property for the primal-dual gap function:

Lemma C.1. For L-smooth and µ-strongly convex F(z), and for any z Rn+m we have

V (z, z ) = F(z) F(z ) + H(z ), z z µ

2 z z 2 . (C.2)

Proof of Lemma C.1 is provided in E.1.

Our proof proceeds in the following steps:

Step 1: Estimating weighted temporal difference in squared norms. We first prove a result on bounding the temporal difference of the point-wise primal-dual gap between zag k and z , whose proof is delayed to E.4.

Lemma C.2. For arbitrary αk (0, 1] and any ωz Rn+m the iterates of Algorithm 1 satisfy for k = 1, . . . , K almost surely

V (zag k+1, ωz) (1 αk)V (zag k , ωz) αk D F(zmd k ) + H(zk+ 1

Note that in Lemma C.2, the term I is an inner product that involves a gradient term.5 The term II is brought by gradient evaluated at zmd k .

Additionally, throughout the proof of Lemma C.2, we only leverage the convexity and L-smoothness of f and the monotonicity of H as in (2.2), as well as the update rules as in Line 2 and Line 4. The proof involves no update rules regarding the gradient updates and hence Lemma C.2 holds for the stochastic case as well.

Next, to further bound the inner product term I, we introduce a general proposition that holds for two updates starting from the same point. Proposition C.3 is a slight modification from the proof of Proposition 4.2 in Chen et al. (2017) and analogous to Lemma 7.1 in Du et al. (2022). We omit the proof here as the argument comes from simple algebraic tricks. Readers can refer to Du et al. (2022) for more details.

Proposition C.3 (Proposition 4.2 in Chen et al. (2017) and Lemma 7.1 in Du et al. (2022)). Given an initial point θ Rd, two update vectors δ1, δ2 Rd and the corresponding outputs φ1, φ2 Rd satisfying:

φ1 = θ δ1, φ2 = θ δ2. (C.4)

For any point z Rd we have

2 δ2 δ1 2 + 1

2 θ z 2 φ2 z 2 θ φ1 2 . (C.5)

5In fact, this term reduces to f(zk), zk ωz of the vanilla gradient algorithm if the features of accelerations and optimistic gradients are removed.

Rate-Optimal Separable Minimax Optimization

Noting that the gradient term F(zmd k ) + H(zk+ 1

2 ) in Term I of inequality (C.3) of Lemma C.2 has been used in updating zk to zk+1 in Line 5 in Algorithm 1. Comparing Line 5 with Line 3 and by letting θ = zk, φ1 = zk+ 1

2 , φ2 = zk+1 in Proposition C.3, we obtain an upper bound for the inner product term I:

ηk I η2 k 2 ||H(zk+ 1

h ||zk ωz||2 ||zk+1 ωz||2 ||zk+ 1

L2 Hη2 k 2 ||zk+ 1

h ||zk ωz||2 ||zk+1 ωz||2 ||zk+ 1

2 zk||2i , (C.6)

where the last inequality is due to properties of H and the definition of LH. Combining Eqs. (C.3) and (C.6) we obtain

V (zag k+1, ωz) (1 αk)V (zag k , ωz) L2 Hηkαk

h ||zk ωz||2 ||zk+1 ωz||2 ||zk+ 1

2 zk||2i + Lα2 k 2 ||zk+ 1

2 zk||2. (C.7)

This finishes Step 1.

Step 2: Building and solving the recursion. We first apply the following lemma to build connections between ||zk+ 1

2 ||2 and ||zk+ 1

2 zk||2, reducing Eq. (C.7) to the composition of sequences {||zk ωz||2}0 k K 1 and {||zk+ 1

2 zk||2}0 k K 1. The proof of Lemma C.4 is deferred to E.5.

Lemma C.4. For any stepsize sequence {ηk}0 k K 1 satisfying for some positive constant c > 0 and the Lipschitz parameter LH such that LHηk p c

2 holds for all k. Algorithm 1 with initialization z 1

2 = zag 0 = z0 gives the following for any k = 0, . . . , K 1: zk+ 1

ℓ=0 c ℓ zℓ+ 1

2 zℓ 2 . (C.8)

Combining Eqs. (C.7) and (C.8), bringing in the stepsize choice αk = 2 k+2 and rearranging the terms, we obtain the following relation:

V (zag k+1, ωz) k k + 2V (zag k , ωz) 1 ηk(k + 2) ||zk ωz||2 ||zk+1 ωz||2

1 ηk(k + 2) 2L (k + 2)2

2 zk||2 + 2L2 Hηk k + 2

ℓ=0 ck ℓ||zℓ+ 1

Multiplying both sides by (k + 2)2, we obtain

(k + 2)2V (zag k+1, ωz) [(k + 1)2 1]V (zag k , ωz) k + 2

||zk ωz||2 ||zk+1 ωz||2

ηk 2L ||zk+ 1

2 zk||2 + 2L2 H(k + 2)ηk

ℓ=0 ck ℓ||zℓ+ 1

Taking ηk = k+2 2L+

2 c LH(k+2), we have k+2

2 c LH(k + 2), and the previous inequality reduces to

(k + 2)2V (zag k+1, ωz) [(k + 1)2 1]V (zag k , ωz)

2 c LH(k + 2)

! ||zk ωz||2 ||zk+1 ωz||2

2 c LH(k + 2)||zk+ 1

2c LH(k + 2)

ℓ=0 ck ℓ||zℓ+ 1

Rate-Optimal Separable Minimax Optimization

Subtracting off term V (zag k+1, ωz) on both sides and summing over k from 0 to K 1, we have

(K + 1)2 1 V (zag K, ωz) +

2 c LH(K + 1)

||z0 ωz||2 +

k=0 zk ωz 2

k=0 V (zag k+1, ωz)

k=0 (k + 2)||zk+ 1

| {z } III1

k=0 (k + 2)

ℓ=0 ck ℓ||zℓ+ 1

| {z } III2

Simple algebra yields

ℓ=0 ||zℓ+ 1

2 zℓ||2 K 1 X

k=ℓ (k + 2)ck ℓ

1 c + c (1 c)2

Straightforward derivations give that if we choose c = 2 3+

3, the inequality q

2c h k+2 1 c + c (1 c)2 i holds for all k 0. Thus, summing III1 and III2 terms we have

2 c LHIII1 +

2c LHIII2 0.

Finally, we solve the recursion and conclude

(K + 1)2 1 V (zag K, ωz) +

2 c LH(K + 1)

||z0 ωz||2 +

k=0 zk ωz 2

k=0 V (zag k+1, ωz). (C.9)

finishing Step 2.

Step 3: Proving zk stays nonexpansive with respect to z . In Lemma 3.1, we show that zk always stays in the ball centered at z with radius ||z0 z ||. The proof of this lemma is presented in E.2.

Lemma 3.1 (Nonexpansiveness, restated). Under Assumptions 2.1, we set the parameters as L = Lf Lg, LH = Ixx Iyy + Ixy, ηk = k+2

3LH(k+2) and αk = 2 k+2 Algorithm 1 with initialization z 1

2 = zag 0 = z0, at any iterate

k < K we have

||zk z || ||z0 z ||.

Step 4: Combining everything together. Bringing the nonexpansiveness result in Lemma 3.1 into the solved recursion (C.9), setting ωz = z and rearranging, we obtain the following:

(K + 1)2V (zag K, z ) (K + 1)2V (zag K, z ) +

2 c LH(K + 1)

2 c LH(K + 1)

Dividing both sides by (K + 1)2 and noting that Lemma C.1 implies V (zag K, z ) µ

2 ||zag K z ||2. Hence, bringing in the choice of c = 2 3+

3 concludes our proof of Theorem 3.3.

Rate-Optimal Separable Minimax Optimization

We finally remark that a limitation of this convergence rate bound is that the coefficient for LH in our stepsize choosing scheme is p

3 2.175 while an improved stepsize in this special case is 1 2LH , yielding a sharper coefficient 2. Although the slight difference in constant factors does not harm the practical performance drastically, we anticipate that this constant might be further improved and leave it to future work.

C.2. Proof of Corollary 3.4

Proof.[Proof of Corollary 3.4] The proof of restarting argument is direct. By Eq. (3.7), if we want ||zag K z ||2 1

e||z0 z ||2

to hold, we can choose K such that

4L µ(K + 1)2 1

3LH µ(K + 1) 1

This is equivalent to

µ , K + 1 4e p

For a given threshold ϵ > 0, with the output of every epoch satisfying ||zag K z ||2 1

e||z0 z ||2, the total epochs

required to obtain an ϵ-optimal minimax point would be log ||z0 z ||2

ϵ . Thus, the total number of iterates required to get within the ϵ threshold would be:

Bringing the choice of scaling parameters in (3.8) and we conclude our proof of Corollary 3.4.

C.3. Proof of Theorem 3.5

Proof.[Proof of Theorem 3.5] For minimax problem, we recall that we define the primal-dual gap function as:

V (z, z ) = F(z) F(z ) + H(z ), z z .

We have for any pair of parameters (bx, by) and (x, y):

L(bx, y) L(x, by) = f(bx) f(x) + g(by) g(y) + I(bx, y) I(x, by)

= f(bx) f(x) + g(by) g(y) + I(bx, y) I(x, y) + I(x, y) I(x, by)

f(bx) f(x) + g(by) g(y) + H(z), bz z V (bz, z).

Similarly for the regularized problem we define

Vϵ(z, z ) = V (z, z ) + ϵ

and by the definition of Lϵ we have

Lϵ(bx, y) Lϵ(x, by) Vϵ(bz, z),

and moreover,

max y Y Lϵ(bx, y) min x X Lϵ(x, by) max z X Y Vϵ(bz, z).

By applying the AG-OG algorithm (Algorithm 1) onto the regularized objective Lϵ, if we can find a pair bz = (bx, by) such that

max y Y Lϵ(bx, y) min x X Lϵ(x, by) max z X Y Vϵ(bz, z) ϵ.

Rate-Optimal Separable Minimax Optimization

The result would imply for the original C-SC problem that

max y Y L (bx, y) min x X L (x, by) = max y Y L (bx, y) + ϵ

2||bx||2 min x X L (x, by) + ϵ

L (bx, y) + ϵ

2||bx||2 min x X

L (x, by) + ϵ

max y Y Lϵ (bx, y) min x X Lϵ (x, by) ϵ.

The left of this subsection is devoted to finding the gradient complexity of finding a bz Z such that maxz Z Vϵ(bz, z). By utilizing the results in Step 2 in the proof of Theorem 3.3 to the objective Vϵ, we have

Vϵ(zag k+1, ωz) k k + 2Vϵ(zag k , ωz) 1 ηk(k + 2) ||zk ωz||2 ||zk+1 ωz||2

1 ηk(k + 2) 2L (k + 2)2

2 zk||2 + 2ηk L2 H k + 2

ℓ=0 ck ℓ||zℓ+ 1

Multiplying both sides by (k + 2)(k + 1), we obtain

(k + 2)(k + 1)Vϵ(zag k+1, ωz) (k + 1)k Vϵ(zag k , ωz) k + 1

||zk ωz||2 ||zk+1 ωz||2

ηk 2L ||zk+ 1

2 zk||2 + 2(k + 1)ηk L2 H

ℓ=0 ck ℓ||zℓ+ 1

Taking ηk = k+1 2L+

2 c LH(k+1), c = 2 3+

5 and adopting similar techniques as in the proof of Theorem 3.3, we have

(K + 2)(K + 1)Vϵ(zag K, ωz) +

z K ωz 2 2L||z0 ωz||2 +

k=0 zk ωz 2 .

Taking ωz = z ϵ where z ϵ is the solution of the objective. Similarly as in Lemma 3.1 in the proof of Theorem 3.3, we can apply the same bootstrapping argument and derive for µϵ being the strongly convexity parameter of Vϵ, Lϵ being the smoothness parameter of the regularized F, the following inequality

||zag K z ϵ ||2

4L µϵ(K + 1)K + 2 p

5LH µϵ(K + 1)

Applying the same restarting as in Corollary 3.4, the total number of iterates required to get within the ϵ threshold (in terms of ||zag K z ϵ ||2) should be

We note that in previous iterates n = 0, . . . , N 2 in Algorithm 2, we have obtained a z0 such that ||z0 z ϵ ||2 ϵ. We then analyze at iteration n = N 1. Again from Equation (C.10), letting ωz := z K = arg maxz Z Vϵ(zag K, z), we have

(K + 2)(K + 1)Vϵ(zag K, z K) +

z K z K 2 2L||z0 z K||2 +

k=0 zk z K 2 .

As Vϵ(zag K, z K) 0, we can apply the same boostrapping argument and derive

2 ||zag K z K||2 Vϵ(zag K, z K) 2Lϵ + p

5LHK K(K + 1) ||z0 z K||2. (C.11)

Rate-Optimal Separable Minimax Optimization

On the other hand, we also have that

2 ||zag K z ϵ ||2 Vϵ(zag K, z ϵ ) 2L + p

5LHK K(K + 1) ||z0 z ϵ ||2. (C.12)

By analyzing the two inequalities (C.11) and (C.12), we obtain that for sufficiently large K (in an order of O q

such that 4L+2

5LHK µϵK(K+1) 1 c2 ,

||z K z ϵ || ||zag K z K|| + ||zag K z ϵ || 1

c2 [||z0 z K|| + ||z0 z ϵ ||]

c2 [||z0 z ϵ || + ||z ϵ z K|| + ||z0 z ϵ ||] 2 c2 1||z0 z ϵ ||.

Furthermore, we apply (C.11) again and derive

max z Z Vϵ(zag K, z) = Vϵ(zag K, z K) 1

c2 ||z0 z L||2 ||z0 z K|| c2 ||z0 z ϵ || + ||z ϵ z K|| c2 c2 + 1 c2(c2 1)||z0 z ϵ ||.

Taking c2 = 3, and noting that we have obtained ||z0 z ϵ ||2 ϵ in previous restarted iterates and combining the technique at the beginning of the proof of Theorem 3.5, the total number of iterates in order to get maxy Y Lϵ (bx, y)

minx X Lϵ (x, by) maxz Z Vϵ(bz, z) ϵ is O q

L ϵ µg + LH ϵ µg

ϵ . Applying a scaling reduction argument as in (3.8) (the stepsize is dependent on ϵ after scaling reduction) gives a final complexity of

ϵ Ixy ϵµg Iyy

C.4. Proof of Theorem 4.1

Proof.[Proof of Theorem 4.1] For the stochastic case, we use the primal-dual gap function (C.1) and proceeds in the following

Step 1: Estimating weighted temporal difference in squared norms. We mentioned in the proof of Theorem 3.3 that Lemma C.2 holds for the stochastic case as well. Thus, we have

V (zag k+1, ωz) (1 αk)V (zag k , ωz) αk D F(zmd k ) + H(zk+ 1

By applying Proposition C.3 to the iterates of Algorithm 3. Taking x = zk, ϕ1 = zk+ 1

2 , ϕ2 = zk+1 in Proposition C.3 and recalling the update rules in Algorithm 3, we obtain the following stochastic version of inequality (C.6):

ηk D e F(zmd k ; ξk) + e H(zk+ 1

2η2 k || e H(zk+ 1

2 ) e H(zk 1

2 )||2 | {z } (a)

h ||zk ωz||2 ||zk+1 ωz||2 ||zk+ 1

Step 2: Building and solving the recursion. Note that in the stochastic case, unlike Step 2 in the proof of Theorem 3.3, before connecting ||zk+ 1

2 ||2 with ||zk+ 1

2 zk||2 to get an iterative rule, we need to bound the expectation of (a) with additional noise first.

Throughout the rest of the proof of Theorem 4.1, we denote

2 h = e H(zk+ 1

2 ) H(zk+ 1

2 ), k f = e F(zmd k ; ξk) F(zmd k ).

Taking expectation over term (a) in above, we use the following lemma to depict the upper bound of the quantity. The proof is delayed to E.6.

Rate-Optimal Separable Minimax Optimization

Lemma C.5. For any β > 0, under Assumption 2.2, we have

E|| e H(zk+ 1

2 ) e H(zk 1

2 )||2 (1 + β)L2 HE||zk+ 1

2 ||2 + 2 + 1

σ2 H. (C.13)

Taking β = 1 in Lemma C.5 and bringing the result into the expectation of (C.3), we obtain that

EV (zag k+1, ωz) (1 αk)EV (zag k , ωz)

h 2L2 HE||zk+ 1

2 ||2 + 3σ2 H i + αk E D k f + k+ 1

2 h , zk+ 1

+ Lα2 k 2 E||zk+ 1

2 zk||2 + αk

2ηk E h ||zk ωz||2 ||zk+1 ωz||2 ||zk+ 1

2 zk||2i . (C.14)

Following the above inequality and following similar techniques as in Step 2 of the proof of Theorem 3.3, we can derive the following Lemma C.6, whose proof is delayed to E.7.

Lemma C.6. For the choice of stepsize such that ηk LH c

2 holds for all k and any constant r > 0, we have

EV (zag k+1, ωz) (1 αk)EV (zag k , ωz) αk

2ηk E ||zk ωz||2 ||zk+1 ωz||2 + 3αkηk

+ 2αkηk L2 H

ℓ=0 ck ℓE||zℓ+ 1

2 zℓ||2 rαk

2ηk Lα2 k 2

2 zk||2 + αkηk 2(1 r)σ2 F .

Recalling that αk = 2 k+2, we have

EV (zag k+1, ωz) k k + 2EV (zag k , ωz) 1 ηk(k + 2)E ||zk ωz||2 ||zk+1 ωz||2

+ 4ηk L2 H k + 2

ℓ=0 ck ℓE||zℓ+ 1

2 zℓ||2 r ηk(k + 2) 2L (k + 2)2

+ 3ηk (1 c)(k + 2)σ2 H + ηk (1 r)(k + 2)σ2 F .

Multiplying both sides by (k + 2)2 and taking r = 1

2, we obtain

(k + 2)2EV (zag k+1, ωz) [(k + 1)2 1]EV (zag k , ωz)

ηk E ||zk ωz||2 ||zk+1 ωz||2 + 4ηk L2 H(k + 2)

ℓ=0 ck ℓE||zℓ+ 1

ηk 2L E||zk+ 1

2 zk||2 + 3ηk(k + 2)

1 c σ2 H + ηk(k + 2)

ηk E ||zk ωz||2 ||zk+1 ωz||2 + 4ηk L2 H(k + 2)

ℓ=0 ck ℓE||zℓ+ 1

2ηk 2L E||zk+ 1

2 zk||2 + 3ηk(k + 2)

1 c σ2 H + 2ηk(k + 2)σ2 F .

Telescoping over k = 0, 1, . . . K 1 and using the same techniques as in the proof of Theorem 3.3, we have for k+2

2ηk 2L + 1 c LH(k + 2) and c = 1 2+

2 (c/(1 c) =

2 1, and recall σ2 = 3

2σ2 H + 2σ2 F so that

(K + 1)2 1 EV (zag K, z ) + K + 1

ηK 1 E||z K z ||2

η0 E||z0 z ||2 + 2 c LH

k=1 E zk z 2 +

k=0 (k + 2)ηkσ2

k=0 EV (zag k+1, z ). (C.15)

Rate-Optimal Separable Minimax Optimization

Step 3: Proving zk stays within a neighbourhood of z . We introduce the following Lemma C.7, whose proof is in E.3

Lemma C.7. Given the maximum epoch number K > 0 and stepsize sequence {ηk}k [K] satisfying

ηk 1 = 2 c LH for any k < K, we have for k [K 1]:

||zk z ||2 ||z0 z ||2 + η0

k=0 (k + 2)ηkσ2.

(b) In addition if ηk k+2

D for k [K 1] where D will be specified in (c) and taking A(K) := p

(K + 1)(K + 2)(2K + 3)/6, we have

||zk z ||2 ||z0 z ||2 + A(K)2σ2

D2 . (C.16)

(c) Taking D = σ

E||z0 z ||2 for some absolute constant C > 0 , bound (C.16) reduces to

||zk z ||2 1 + C2 ||z0 z ||2. (C.17)

Step 4: Combining everything together. Combining the choice of stepsize ηk in (a), (b) in Lemma C.7 and k+2

2ηk 2L + 1 c LH(k + 2), and bound (C.15) with Eq. (C.17), by rearranging the terms again, we conclude that for ηk =

(K + 1)2EV (zag K, z ) 4L + 2 q

2(K + 1) 1 + C2 LH

E||z0 z ||2

E||z0 z ||2.

Dividing both sides by (K + 1)2 and noting that V (zag K, z ) µ

2 E||zag K z ||2, we have

E||zag K z ||2 8L µ(K + 1)2 + 7.4(1 + C2)LH

E||z0 z ||2 + 2(C + 1

E||z0 z ||2,

hence concluding the entire proof of Theorem 4.1.

D. Proof of Bilinear Game Cases

D.1. Proof of Theorem 3.6

Proof.[Proof of Theorem 3.6]

Step 1: Non-expansiveness of OGDA last-iterate. We start by the non-expansiveness Lemma, whose proof is in E.8.

Lemma D.1 (Bounded Iterates). Following (3.11), at any iterate k < K, zk stays within the region defined by the initialization z0:

||zk z || ||z0 z ||,

where we recall that z = [x ; y ] denotes the unique solution of Problem (1.3) with f, g = 0 and I defined in (3.10).

By Lemma D.1, for any 0 k < K, we have

||zk z || ||z0 z ||.

Rate-Optimal Separable Minimax Optimization

Step 2: Recalling that we take αk = 2 k+2in (3.11b) of (3.11). Thus, we obtain the following

zag k+1 = k k + 2zag k + 2 k + 2zk+ 1

Subtracting both sides by z and multiplying both sides by (k + 1)(k + 2), we have

(k + 1)(k + 2) zag k+1 z = k(k + 1) zag k z + 2(k + 1) zk+ 1

Telescoping over k = 0, . . . , K 1 and we conclude that

K(K + 1) zag K z = 2

k=0 (k + 1) zk+ 1

2 z . (D.1)

Moreover, according to the update rule (3.11c), we have that

k=0 (k + 1)zk+1 (k + 1)zk =

k=0 ηk(k + 1)H(zk+ 1

Recalling that ηk = 1 2LH , combining (D.2) with (D.1) and taking the squared norm on both sides, we conclude that

||K(K + 1) zag K z ||2 = ||2

k=0 (k + 1) zk+ 1

2 z ||2 1 λmin(B B)||2

k=0 (k + 1)H(zk+ 1

= 16L2 H λmin(B B)||Kz K

k=0 zk||2 = 16L2 H λmin(B B)||

k=0 [(z K z ) (zk z )] ||2

16L2 H λmin(B B)K

2||z K z ||2 + 2||zk z ||2 . (D.3)

Applying non-expansiveness in Lemma D.1 in (D.10), bringing LH = p

λmax(B B)and rearranging, we conclude that

||zag K z ||2 64λmax(B B) λmin(B B)(K + 1)2 ||z0 z ||2.

Restarting every l 8 q

λmin(B B) m iterates for a total of log ||z0 z ||

ϵ times, we obtain the final sample complexity of

λmin(B B) log 1

for the nonstochastic setting.

D.2. Stochastic Bilinear Game Case

For the stochastic AG-OG with restarting for bilinear case, our iteration spells

2 = zk ηk e H(zk 1

2 ), (D.4a)

zag k+1 = (1 αk)zag k + αkzk+ 1

zk+1 = zk ηk e H(zk+ 1

2 ). (D.4c)

We are able to derive the following theorem.

Rate-Optimal Separable Minimax Optimization

Theorem D.2 (Convergence of stochastic AG-OG, bilinear case)). When specified to the stochastic bilinear game case, setting the parameters as αk = 2 k+2 and ηk = 1 2LH , the output of update rules (D.4) satisfies for any γ > 0,

E||zag K z ||2 (1 + γ) 128L2 H λmin(B B)(K + 1)2 ||z0 z ||2 + 48 λmin(B B)(K + 1)σ2 H

γ )σ2 H 3λmin(B B)K .

Moreover, by operating scheduled restarting technique, it yields a complexity result of

v u u t λmax (B B)

λmin BB log

λmin BB λmax (B B)

+ σ2 H λmin BB ε2

We begin by proving the following lemma, which is the stochastic version of Lemma D.1. It is worth noting that when setting σH = 0 and β = 0 in Lemma D.3 below, it reduces to the non-expansiveness of deterministic iterates (3.11). Moreover, Lemma D.3 holds for any monotonic H( ).

Lemma D.3 (Bounded Iterates (Stochastic)). Following (D.4), for any β > 0, taking ηk = 1 2LH

(1+β) at any iterate

k < K, zk stays within the region defined by the initialization z0:

||zk z ||2 ||z0 z ||2 + η2 k

where we recall that z denotes the unique solution of Problem (1.2) with f, g = 0 and I defined in (3.10).

Proof.[Proof of Lemma D.3] The optimal condition of the problem yields H(z ) = 0 for z being the solution of the VI. By the monotonicity of H( ), let z = zk+ 1

2 and z = ωz in (2.2), we have that

2 ) H(ωz), zk+ 1

2 ωz E 0, ωz Z. (D.5)

Let φ1 = zk+ 1

2 , φ2 = zk+1, θ = zk, δ1 = ηk e H(zk 1

2 ), δ2 = ηk e H(zk+ 1

2 ) and z = ωz in Proposition C.3, we have

ηk D e H(zk+ 1

η2 k 2 || e H(zk+ 1

2 ) e H(zk 1

h ||zk ωz||2 ||zk+1 ωz||2 ||zk zk+ 1

2 ||2i . (D.6)

Recalling the results in Lemma C.5 that

E|| e H(zk+ 1

2 ) e H(zk 1

2 )||2 (1 + β)L2 HE||zk+ 1

2 ||2 + 2 + 1

Taking expectation over (D.6), combining it with (D.5) and letting ωz = z , we obtain

0 = ηk E D H(z ), zk+ 1

η2 k(1 + β)L2 H 2 E||zk+ 1

2E h ||zk z ||2 ||zk+1 z ||2 ||zk zk+ 1

2 ||2i + η2 k 2 + 1

Next, we move on to estimate ||zk+ 1

2 ||2. As we know that via Young s and Cauchy-Schwarz s inequalities and the update rules (3.11a) and (3.11c), for all k 1

2 ||2 2||zk+ 1

2 zk||2 + 2||zk zk 1

2 zk||2 + 2η2 k 1L2 H||zk 1

Rate-Optimal Separable Minimax Optimization

Multiplying both sides by 2 and moving one term to the right hand gives for all k 1

2 ||2 4||zk+ 1

2 zk||2 + 4η2 k 1L2 H||zk 1

2 ||2 ||zk+ 1

Bringing this into (E.12) and noting that ηk 1 1 2LH as well as ηk 1 2LH , we have

0 η2 k(1 + β)L2 H 2 E||zk+ 1

2E h ||zk z ||2 ||zk+1 z ||2 ||zk zk+ 1

2 ||2i + η2 k 2 + 1

2E ||zk z ||2 ||zk+1 z ||2 + η2 k(1 + β)L2 H 2 E h ||zk 1

2 ||2 ||zk+ 1

2 2η2 k(1 + β)L2 H

E||zk zk+ 1

2 ||2 + η2 k 2 + 1

Taking ηk satisfying ηk 1 LH

4(1+β) and by rearraing the above inequality, we have

2E ||zk z ||2 ||zk+1 z ||2 + 1

8E h ||zk 1

2 ||2 ||zk+ 1

E||zk zk+ 1

2 ||2 + η2 k 2 + 1

2E ||zk z ||2 + 1

2 ||2 ||zk+1 z ||2 1

2 ||2 + η2 k 2 + 1

Rearranging the above inequality and we conclude that

E ||zk+1 z ||2 + 1

2 ||2 E ||zk z ||2 + 1

2 ||2 + η2 k

Telescoping over k = 0, 1, . . . , K 1 and noting that z 1

2 = z0 and ηk is taken as a constant for the bilinear case, we have

||z K z ||2 ||z K z ||2 + 1

2 ||2 ||z0 z ||2 + η2 k

which concludes our proof of Lemma D.3.

Recalling that we take αk = 2 k+2 in (D.4b) of (D.4). Thus, we obtain the following

zag k+1 = k k + 2zag k + 2 k + 2zk+ 1

Subtracting both sides by z and multiplying both sides by (k + 1)(k + 2), we have

(k + 1)(k + 2) zag k+1 z = k(k + 1) zag k z + 2(k + 1) zk+ 1

Telescoping over k = 0, . . . , K 1 and we conclude that

K(K + 1) zag K z = 2

k=0 (k + 1) zk+ 1

2 z . (D.8)

Moreover, according to the update rule (D.4c), we have that

k=0 (k + 1)zk+1 (k + 1)zk =

k=0 ηk(k + 1) e H(zk+ 1

k=0 ηk(k + 1) h H(zk+ 1

2 h i . (D.9)

Rate-Optimal Separable Minimax Optimization

Recalling that ηk = 1 LH

4(1+β), combining (D.9) with (D.8) and taking the squared norm on both sides, we conclude that

||K(K + 1) zag K z ||2

k=0 (k + 1) zk+ 1

2 z ||2 1 λmin(B B)||2

k=0 (k + 1)H(zk+ 1

= 16(1 + β)(1 + γ)L2 H λmin(B B) ||Kz K

k=0 zk||2 + 4(1 + 1

λmin(B B)||

k=0 (k + 1) k+ 1

16(1 + β)(1 + γ)L2 H λmin(B B) K

2||z K z ||2 + 2||zk z ||2 + 4(1 + 1

λmin(B B)||

k=0 (k + 1) k+ 1

2 h ||2. (D.10)

Dividing both sides of (D.10) by K2(K + 1)2 and taking expectation, we have

E||zag K z ||2 16(1 + β)(1 + γ)L2 H λmin(B B)K(K + 1)2

k=0 E 2||z K z ||2 + 2||zk z ||2

λmin(B B)K2(K + 1)2

k=0 (k + 1)2E|| k+ 1

64(1 + β)(1 + γ)L2 H λmin(B B)(K + 1)2

||z0 z ||2 + η2 k(2 + 1

γ )σ2 H 3λmin(B B)K

= 64(1 + β)(1 + γ)L2 H λmin(B B)(K + 1)2 ||z0 z ||2 + 16(1 + γ)(2 + 1

λmin(B B)(K + 1)σ2 H + 4(1 + 1

γ )σ2 H 3λmin(B B)K ,

Minimize over β, we have β = σH K+1 2LH||z0 z || and the first two terms become

64(1 + γ)L2 H λmin(B B)(K + 1)2 ||z0 z ||2 + 32(1 + γ) λmin(B B)(K + 1)σ2 H + 32(1 + γ)LHσH λmin(B B)(K + 1)3/2 ||z0 z ||

If we take β = 1, the above reduces to

E||zag K z ||2 128(1 + γ)L2 H λmin(B B)(K + 1)2 ||z0 z ||2 + 48(1 + γ) λmin(B B)(K + 1)σ2 H + 4(1 + 1

γ )σ2 H 3λmin(B B)K

= (1 + γ) 128L2 H λmin(B B)(K + 1)2 ||z0 z ||2 + 48 λmin(B B)(K + 1)σ2 H

γ )σ2 H 3λmin(B B)K

Further minimizing over γ, we conclude that so

E||zag K z ||2

128L2 H λmin(B B)(K + 1)2 ||z0 z ||2 + 48 λmin(B B)(K + 1)σ2 H +

4σ2 H 3λmin(B B)K

128L2 H λmin(B B)(K + 1)2 ||z0 z ||2 +

48σ2 H λmin(B B)(K + 1) +

4σ2 H 3λmin(B B)K

2LH||z0 z ||

K + 1 + 8.083σH

By operating restarting techniques the same way as in the explanation of Corollary 3.2 in Du et al. (2022), we conclude Theorem D.2.

Rate-Optimal Separable Minimax Optimization

E. Proof of Auxiliary Lemmas

E.1. Proof of Lemma C.1

Proof.[Proof of Lemma C.1]

Since F(z) is L-smooth and µ-strongly convex. For the rest of this proof, we observe that the saddle definition of z

satisfies the first-order stationary condition:

F(z ) + H(z ) = 0. (E.1)

Furthermore, we have F(z) F(z ) + H(z ), z z

F(z ), z z + µ

2 z z 2 + H(z ), z z

= F(z ) + H(z ), z z + µ

2 z z 2 = µ

where in both of the two displays, the inequality holds due to the µ-strong convexity of F, and the equality holds due to the first-order stationary condition (E.1). This completes the proof.

E.2. Proof of Lemma 3.1

Proof.[Proof of Lemma 3.1] Following (C.9), we let ωz = z . Due to the non-negativity of V ( , z ), we can eliminate the V terms and have:

2 c LH(K + 1)

||z0 z ||2 +

k=0 zk z 2 .

We adopt a bootstrapping argument.We define MK = max0 k K 1 ||zk z ||2 and taking a maximum on each term on the right hand side of the above inequality, we conclude that

2 c LH(K + 1)

2 c LH(K + 1)

Thus, we know that ||z K z ||2 MK 1 and hence MK = MK 1 always holds. That yields MK = M0, and we conclude the proof of Lemma 3.1.

E.3. Proof of Lemma C.7

Proof.[Proof of Lemma C.7] Starting from (C.15) that

(K + 1)2 1 EV (zag K, z ) + K + 1

ηK 1 E||z K z ||2

η0 E||z0 z ||2 + 2 c LH

k=1 E zk z 2 +

k=0 (k + 2)ηkσ2

k=0 EV (zag k+1, z ).

We first omit the V ( , ) terms and have

ηK 1 E||z K z ||2 2

η0 E||z0 z ||2 + 2 c LH

k=1 ||zk z ||2 +

k=0 (k + 2)ηkσ2. (E.2)

Rate-Optimal Separable Minimax Optimization

Rewrite ||z K z ||2 as the difference between two summations, we obtain:

E||zk ωz||2 2

η0 E||z0 z ||2 + 2 c LH

k=1 E||zk z ||2 +

k=0 (k + 2)ηkσ2.

Rearranging the terms and by the first condition (a) that k+2

ηk 1 = 2 c LH, we have:

k=1 E||zk z ||2 2

η0 E||z0 z ||2 + K + 2

k=1 E||zk ωz||2 +

k=0 (k + 2)ηkσ2.

To construct a valid iterative rule, we divide both sides of the above inequality with (K+1)(K+2)

ηK 1ηK and obtain the following:

k=1 E||zk z ||2 ηK 1

k=1 E||zk ωz||2 + ηK 1ηK (K + 1)(K + 2)

" 2 η0 E||z0 z ||2 +

k=0 (k + 2)ηkσ2 #

Here we slightly abuse the notations and use K to denote an arbitrary iteration during the process of the algorithm and use K to denote the fixed total number of iterates. Thus, PK 1 k=0 (k + 2)ηkσ2 PK 1 k=0 (k + 2)ηkσ2 is an upper bound that does not change with the choice of K. It follows that:

k=1 E||zk z ||2 ηK 1

k=1 E||zk ωz||2 + c 2LH

K + 1 ηK K + 2

" 2 η0 E||z0 z ||2 +

k=0 (k + 2)ηkσ2 #

" 2 η0 E||z0 z ||2 +

k=0 (k + 2)ηkσ2 #

Dividing both sides by ηK K+2, the result follows:

k=1 E||zk z ||2 c 2LH

2ηK 1 " 2 η0 E||z0 z ||2 +

k=0 (k + 2)ηkσ2 #

Bringing this into Eq. (E.2), we conclude that

ηK 1 E||z K z ||2 η0(K + 1)

" 2 η0 E||z0 z ||2 +

k=0 (k + 2)ηkσ2 #

Dividing both sides by K+1

ηK 1 and we have:

E||z K z ||2 η0

" 2 η0 E||z0 z ||2 +

k=0 (k + 2)ηkσ2 #

Now we change back using the notation K to denote the total iterates and k is the iterates indexes, we have

E||zk z ||2 E||z0 z ||2 + η0

k=0 (k + 2)ηkσ2,

which concludes the proof of (a) of Lemma C.7. Additionally, if ηk k+2

D for some quantity D, we have

k=0 (k + 2)ηk

D (K + 1)(K + 2)(2K + 3)

We use A(K) = p

(K + 1)(K + 2)(2K + 3)/6 and noting that η0 2

E||zk z ||2 E||z0 z ||2 + A(K)2σ2

which concludes our proof of (b). And (c) follows by straightforward calculations.

Rate-Optimal Separable Minimax Optimization

E.4. Proof of Lemma C.2

Proof.[Proof of Lemma C.2] Recalling that F is L-smooth. To upper-bound the difference in pointwise primal-dual gap between iterates, we first estimate the difference in function values of f via gradients at the extrapolation point zmd k . For any given u Z, the convexity and L-smoothness of F( ) implies that

F(zag k+1) F(u) = F(zag k+1) F(zmd k ) F(u) F(zmd k )

F(zmd k ), zag k+1 zmd k + L

zag k+1 zmd k 2 F(zmd k ), u zmd k .

Taking u = ωz and u = zag k respectively, we conclude that

F(zag k+1) F(ωz) F(zmd k ), zag k+1 zmd k + L

zag k+1 zmd k 2 F(zmd k ), ωz zmd k , (E.3)

F(zag k+1) F(zag k ) F(zmd k ), zag k+1 zmd k + L

zag k+1 zmd k 2 F(zmd k ), zag k zmd k . (E.4)

Multiplying (E.3) by αk and (E.4) by (1 αk) and adding them up, we have

F(zag k+1) αk F(ωz) (1 αk)F(zag k ) F(zmd k ), zag k+1 (1 αk)zag k αkωz + L

2 ||zag k+1 zmd k ||2

= αk D F(zmd k ), zk+ 1

| {z } I(a)

where by substracting Line 2 from Line 4 of Algorithm 1 and by Line 4 itself, the last equality of (E.5) follows.

Recalling that zag k corresponds to regular iterates and zmd k corresponds to the extrapolated iterates of Nesterov s acceleration scheme. The squared error term II in (E.5) is brought by gradient evaluated at the extrapolated point instead of the regular point. Note that if we do an implicit version of Nesterov such that zmd k 1 = zag k , this squared term goes to zero, and the convergence analysis would be the same as in OGDA. This could potentially result in a new implicit algorithm with better convergence guarantee.

On the other hand, for the coupling term of the updates, we have

H(ωz), zag k+1 ωz (1 αk) H(ωz), zag k ωz = αk D H(ωz), zk+ 1

2 ωz E αk D H(zk+ 1

| {z } I(b)

where the last equality comes from the monotonicity property of H( ) that

2 ) H(ωz), zk+ 1

Summing both sides of Eq. (E.5) and Eq. (E.6) we obtain the following:

F(zag k+1) αk F(ωz) (1 αk)F(zag k ) + H(ωz), zag k+1 ωz (1 αk) H(ωz), zag k wz

αk D F(zmd k ) + H(zk+ 1

where I is the summation of I(a) and I(b). This concludes our proof of Lemma C.2 by bringing in the definitions of V (zag k+1, z ) and V (zag k , z ).

Rate-Optimal Separable Minimax Optimization

E.5. Proof of Lemma C.4

Proof.[Proof of Lemma C.4] We focus on k = 1, . . . , K 1 since the k = 0 case holds automatically. By Young s inequality and Cauchy-Schwarz inequality, we have that

2 zk 2 + 2 zk zk 1

(a) 2 zk+ 1

2 zk 2 + 2η2 k 1L2 H zk 1

2 (b) 2 zk+ 1

2 zk 2 + c zk 1

where (a) is due to Lines 3 and 5 of Algorithm 1 and the definition of LH, and (b) is due to the condition in Lemma C.4 that ηk LH p c

2. Recursively applying the above gives (C.8) which is repeated as:

ℓ=0 c ℓ zℓ+ 1

2 zℓ 2 . (C.8)

Indeed, from (E.7)

2 c (k 1) zk 1

2 2c k zk+ 1

so telescoping over k = 1, . . . , K gives

k=1 c k zk+ 1

which simply reduces to (due to z0 = z 1

k=1 c k zk+ 1

2 zk 2 + z 1

k=0 c k zk+ 1

This gives (C.8) and the entire Lemma C.4.

E.6. Proof of Lemma C.5

Proof.[Proof of Lemma C.5] We recall that we denote

2 h = e H(zk+ 1

2 ) H(zk+ 1

2 ), k f = e F(zmd k ; ξk) F(zmd k ).

Then, we have

E|| e H(zk+ 1

2 ) e H(zk 1

2 )||2 = E||H(zk+ 1

By first taking expectation over ζk+ 1

2 condition on zk+ 1

2 given, we have

LHS E||H(zk+ 1

2 h ||2 + E|| k+ 1

(1 + β)E||H(zk+ 1

2 )||2 + (1 + 1

2 h ||2 + E|| k+ 1

(1 + β)L2 HE||zk+ 1

2 ||2 + (1 + 1

2 h ||2 + E|| k+ 1

Recalling that by Assumption 2.2, E|| k+ 1

2 h ||2 σ2 H and E|| k 1

2 h ||2 σ2 H, we conclude our proof of Lemma C.5.

Rate-Optimal Separable Minimax Optimization

E.7. Proof of Lemma C.6

Proof.[Proof of Lemma C.6] By inequality (C.14), we have

EV (zag k+1, ωz) (1 αk)EV (zag k , ωz)

h 2L2 HE||zk+ 1

2 ||2 + 3σ2 H i + αk E D k f + k+ 1

2 h , zk+ 1

+ Lα2 k 2 E||zk+ 1

2 zk||2 + αk

2ηk E h ||zk ωz||2 ||zk+1 ωz||2 ||zk+ 1

The inner product term can be decomposed into

E D k f + k+ 1

2 h , zk+ 1

2 h , zk+ 1

2 ωz E + E k f, zk ωz + E D k f, zk+ 1

2 zk E = E D k f, zk+ 1

Where the expectation of the first two terms all equals 0. Thus, we obtain

EV (zag k+1, ωz) (1 αk)EV (zag k , ωz)

h 2L2 HE||zk+ 1

2 ||2 + 3σ2 H i + αk E D k f, zk+ 1

2ηk E ||zk ωz||2 ||zk+1 ωz||2 αk

2ηk Lα2 k 2

For any r > 0, we pair up

2ηk E||zk+ 1

2 zk||2 + αk E D k f, zk+ 1

2 zk E αkηk 2(1 r)E|| k f||2,

EV (zag k+1, ωz) (1 αk)EV (zag k , ωz)

h 2L2 HE||zk+ 1

2 ||2 + 3σ2 H i + αkηk 2(1 r)E|| k f||2

2ηk E ||zk ωz||2 ||zk+1 ωz||2 rαk

2ηk Lα2 k 2

2 zk||2. (E.8)

Next, we connect ||zk+ 1

2 ||2 with the squared norms ||zℓ+ 1

2 zℓ||2. For ηk satisfying ηk LH c

2 , we have

2 2E||zk+ 1

2 zk||2 + 2E||zk zk 1

= 2E||zk+ 1

2 zk||2 + 2η2 k 1E|| e H(zk 1

2 ) e H(zk 3

= 2E||zk+ 1

2 zk||2 + 2η2 k 1E||H(zk 1

2 h ||2 + 2η2 k 1E|| k 1

2 zk||2 + 4η2 k 1L2 HE||zk 1

2 ||2 + 6η2 k 1σ2 H

ℓ=0 ck ℓE||zℓ+ 1

2 zℓ||2 + 6

ℓ=0 ck ℓη2 ℓ 1σ2 H.

Rate-Optimal Separable Minimax Optimization

Bringing Eq. (E.9) into (E.8), we have

EV (zag k+1, ωz) (1 αk)EV (zag k , ωz)

ℓ=0 ck ℓE||zℓ+ 1

2 zℓ||2 + 12L2 H

ℓ=0 ck ℓη2 ℓ 1σ2 H + 3σ2 H

+ αkηk 2(1 r)σ2 F

2ηk E ||zk ωz||2 ||zk+1 ωz||2 rαk

2ηk Lα2 k 2

ℓ=0 ck ℓE||zℓ+ 1

2 zℓ||2 + 3 c 1 cσ2 H + 3σ2 H

+ αkηk 2(1 r)σ2 F

2ηk E ||zk ωz||2 ||zk+1 ωz||2 rαk

2ηk Lα2 k 2

ℓ=0 ck ℓE||zℓ+ 1

2 zℓ||2 + 3αkηk

2(1 c)σ2 H + αkηk 2(1 r)σ2 F

2ηk E ||zk ωz||2 ||zk+1 ωz||2 rαk

2ηk Lα2 k 2

which concludes our proof of Lemma C.6.

E.8. Proof of Lemma D.1

Proof.[Proof of Lemma D.1] The optimal condition of the problem yields H(z ) = 0 for z being the solution of the VI. By the monotonicity of H( ), let z = zk+ 1

2 and z = ωz in (2.2), we have that

2 ) H(ωz), zk+ 1

2 ωz E 0, ωz Z. (E.10)

Let φ1 = zk+ 1

2 , φ2 = zk+1, θ = zk, δ1 = ηk H(zk 1

2 ), δ2 = ηk H(zk+ 1

2 ) and z = ωz in Proposition C.3, we have

ηk D H(zk+ 1

2 ωz E η2 k 2 ||H(zk+ 1

h ||zk ωz||2 ||zk+1 ωz||2 ||zk zk+ 1

η2 k L2 H 2 ||zk+ 1

h ||zk ωz||2 ||zk+1 ωz||2 ||zk zk+ 1

where the last inequality follows by the LH-Lipschitzness of the H operator. Combining (E.11) with (E.10), we obtain

0 = ηk D H(ωz), zk+ 1

2 ωz E η2 k L2 H 2 ||zk+ 1

h ||zk ωz||2 ||zk+1 ωz||2 ||zk zk+ 1

Next, we move on to estimate ||zk+ 1

2 ||2. As we know that via Young s and Cauchy-Schwarz s inequalities and the update rules (3.11a) and (3.11c), for all k 1

2 ||2 2||zk+ 1

2 zk||2 + 2||zk zk 1

2 zk||2 + 2η2 k 1L2 H||zk 1

Multiplying both sides by 2 and moving one term to the right hand gives for all k 1

2 ||2 4||zk+ 1

2 zk||2 + 4η2 k 1L2 H||zk 1

2 ||2 ||zk+ 1

Rate-Optimal Separable Minimax Optimization

Bringing this into (E.12) and noting that ηk 1 1 2LH as well as ηk 1 2LH , we have

0 η2 k L2 H 2 ||zk+ 1

h ||zk ωz||2 ||zk+1 ωz||2 ||zk zk+ 1

2 ||zk ωz||2 ||zk+1 ωz||2 + η2 k L2 H 2

2 ||2 ||zk+ 1

2 2η2 k L2 H

||zk ωz||2 + 1

2 ||2 ||zk+1 ωz||2 1

Rearranging the above inequality and take ωz = z and we conclude that

||zk+1 z ||2 + 1

2 ||2 ||zk z ||2 + 1

Telescoping over k = 0, 1, . . . , K 1 and noting that z 1

2 = z0, we have

||z K z ||2 ||z K z ||2 + 1

2 ||2 ||z0 z ||,

which concludes our proof of Lemma D.1.