# linear_lastiterate_convergence_in_constrained_saddlepoint_optimization__cf275cc3.pdf

Published as a conference paper at ICLR 2021

LINEAR LAST-ITERATE CONVERGENCE IN CONSTRAINED SADDLE-POINT OPTIMIZATION

Chen-Yu Wei, Chung-Wei Lee, Mengxiao Zhang, Haipeng Luo University of Southern California {chenyu.wei,leechung,mengxiao.zhang,haipengl}@usc.edu

Optimistic Gradient Descent Ascent (OGDA) and Optimistic Multiplicative Weights Update (OMWU) for saddle-point optimization have received growing attention due to their favorable last-iterate convergence. However, their behaviors for simple bilinear games over the probability simplex are still not fully understood previous analysis lacks explicit convergence rates, only applies to an exponentially small learning rate, or requires additional assumptions such as the uniqueness of the optimal solution. In this work, we signiﬁcantly expand the understanding of last-iterate convergence for OGDA and OMWU in the constrained setting. Speciﬁcally, for OMWU in bilinear games over the simplex, we show that when the equilibrium is unique, linear last-iterate convergence is achieved with a learning rate whose value is set to a universal constant, improving the result of (Daskalakis & Panageas, 2019b) under the same assumption. We then signiﬁcantly extend the results to more general objectives and feasible sets for the projected OGDA algorithm, by introducing a sufﬁcient condition under which OGDA exhibits concrete last-iterate convergence rates with a constant learning rate whose value only depends on the smoothness of the objective function. We show that bilinear games over any polytope satisfy this condition and OGDA converges exponentially fast even without the unique equilibrium assumption. Our condition also holds for strongly-convex-stronglyconcave functions, recovering the result of (Hsieh et al., 2019). Finally, we provide experimental results to further support our theory.

1 INTRODUCTION

Saddle-point optimization in the form of minx maxy f(x, y) dates back to (Neumann, 1928), where the celebrated minimax theorem was discovered. Due to advances of Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) (which itself is a saddle-point problem), the question of how to ﬁnd a good approximation of the saddle point, especially via an efﬁcient iterative algorithm, has recently gained signiﬁcant research interest. Simple algorithms such as Gradient Descent Ascent (GDA) and Multiplicative Weights Update (MWU) are known to cycle and fail to converge even in simple bilinear cases (see e.g., (Bailey & Piliouras, 2018) and (Cheung & Piliouras, 2019)).

Many recent works consider resolving this issue via simple modiﬁcations of standard algorithms, usually in the form of some extra gradient descent/ascent steps. This includes Extra-Gradient methods (EG) (Liang & Stokes, 2019; Mokhtari et al., 2020b), Optimistic Gradient Descent Ascent (OGDA) (Daskalakis et al., 2018; Gidel et al., 2019; Mertikopoulos et al., 2019), Optimistic Multiplicative Weights Update (OMWU) (Daskalakis & Panageas, 2019b; Lei et al., 2021), and others. In particular, OGDA and OMWU are suitable for the repeated game setting where two players repeatedly propose xt and yt and receive only xf(xt, yt) and yf(xt, yt) respectively as feedback, with the goal of converging to a saddle point or equivalently a Nash equilibrium using game theory terminology. One notable beneﬁt of OGDA and OMWU is that they are also no-regret algorithms with important applications in online learning, especially when playing against adversarial opponents (Chiang et al., 2012; Rakhlin & Sridharan, 2013).

Despite considerable progress, especially those for the unconstrained setting, the behavior of these algorithms for the constrained setting, where x and y are restricted to closed convex sets X and

Published as a conference paper at ICLR 2021

Y respectively, is still not fully understood. This is even true when f is a bilinear function and X and Y are simplex, known as the classic two-player zero-sum games in normal form, or simply matrix games. Indeed, existing convergence results on the last iterate of OGDA or OMWU for matrix games are unsatisfactory they lack explicit convergence rates (Popov, 1980; Mertikopoulos et al., 2019), only apply to exponentially small learning rate thus not reﬂecting the behavior of the algorithms in practice (Daskalakis & Panageas, 2019b), or require additional conditions such as uniqueness of the equilibrium or a good initialization (Daskalakis & Panageas, 2019b).

Motivated by this fact, in this work, we ﬁrst improve the last-iterate convergence result of OMWU for matrix games. Under the same unique equilibrium assumption as made by Daskalakis & Panageas (2019b), we show linear convergence with a concrete rate in terms of the Kullback-Leibler divergence between the last iterate and the equilibrium, using a learning rate whose value is set to a universal constant.

We then signiﬁcantly extend our results and consider OGDA for general constrained and smooth convex-concave saddle-point problems, without the uniqueness assumption. Speciﬁcally, we start with proving an average duality gap convergence of OGDA at the rate of O(1/

T) after T iterations. Then, to obtain a more favorable last-iterate convergence in terms of the distance to the set of equilibria, we propose a general sufﬁcient condition on X, Y, and f, called Saddle-Point Metric Subregularity (SP-MS), under which we prove concrete last-iterate convergence rates, all with a constant learning rate and without further assumptions.

Our last-iterate convergence results of OGDA greatly generalize that of (Hsieh et al., 2019, Theorem 2), which itself is a consolidated version of results from several earlier works. The key implication of our new results is that, by showing that matrix games satisfy our SP-MS condition, we provide by far the most general last-iterate guarantee with a linear convergence for this problem using OGDA. Compared to that of OMWU, the convergence result of OGDA holds more generally even when there are multiple equilibria.

More generally, the same linear last-iterate convergence holds for any bilinear games over polytopes since they also satisfy the SP-MS condition as we show. To complement this result, we construct an example of a bilinear game with a non-polytope feasible set where OGDA provably does not ensure linear convergence, indicating that the shape of the feasible set matters.

Finally, we also provide experimental results to support our theory. In particular, we observe that OGDA generally converges faster than OMWU for matrix games, despite the facts that both provably converge exponentially fast and that OMWU is often considered more favorable compared to OGDA when the feasible set is the simplex.

2 RELATED WORK

Average-iterate convergence. While showing last-iterate convergence has been a challenging task, it is well-known that the average-iterate of many standard algorithms such as GDA and MWU enjoys a converging duality gap at the rate of O(1/

T) (Freund & Schapire, 1999). A line of works show that the rate can be improved to O(1/T) using the optimistic version of these algorithms such as OGDA and OMWU (Rakhlin & Sridharan, 2013; Daskalakis et al., 2015; Syrgkanis et al., 2015). For tasks such as training GANs, however, average-iterate convergence is unsatisfactory since averaging large neural networks is usually prohibited.

Extra-Gradient (EG) algorithms. The saddle-point problem ﬁts into the more general variational inequality framework (Harker & Pang, 1990). A classic algorithm for variational inequalities is EG, ﬁrst introduced in (Korpelevich, 1976). Tseng (1995) is the ﬁrst to show last-iterate convergence for EG in various settings such as bilinear or strongly-convex-strongly-concave problems. Recent works signiﬁcantly expand the understanding of EG and its variants for unconstrained bilinear problems (Liang & Stokes, 2019), unconstrained strongly-convex-strongly-concave problems (Mokhtari et al., 2020b), and more (Zhang et al., 2019; Lin et al., 2020; Golowich et al., 2020b).

The original EG is not applicable to a repeated game setting where only one gradient evaluation is possible in each iteration. Moreover, unlike OGDA and OMWU, EG is shown to have linear regret against adversarial opponents, and thus it is not a no-regret learning algorithm (Bowling, 2005;

Published as a conference paper at ICLR 2021

Golowich et al., 2020a). However, there are single-call variants of EG that address these issues. In fact, some of these versions coincide with the OGDA algorithm under different names such as modiﬁed Arrow Hurwicz method (Popov, 1980) and extrapolation from the past (Gidel et al., 2019). Apart from OGDA, other single-call variants of EG include Reﬂected Gradient (Malitsky, 2015; Cui & Shanbhag, 2016; Malitsky & Tam) and Optimistic Gradient (Daskalakis et al., 2018; Mokhtari et al., 2020a). These variants are all equivalent in the unconstrained setting but differ in the constrained setting. To the best of our knowledge, none of the existing results for any single-call variant of EG covers the constrained bilinear case (which is one of our key contributions).

Error Bounds and Metric Subregularity To derive linear convergence for variational inequality problems, error bound method is a commonly used technique (Pang, 1997; Luo & Tseng, 1993). For example, it is a standard approach to studying the last-iterate convergence of EG algorithms (Tseng, 1995; Hsieh et al., 2020; Azizian et al., 2020). An error bound method is associated with an error function that gives every point in the feasible set a measure of sub-optimality that is lower bounded by the distance of the point to the optimal set up to some problem dependent constant. If such a error function exists, linear convergence can be obtained. The choice of the error function depends on the feasible region, the objection function, and the algorithm. Common error functions include natural residual functions (Iusem et al., 2017; Malitsky, 2019) and gap functions (Larsson & Patriksson, 1994; Solodov & Tseng, 2000; Chen et al., 2017). Our method to derive the last-iterate convergence for OGDA can also be viewed as an error bound method.

Metric subregularity is another important concept to derive linear convergence via some Lipschitz behavior of a set-valued operator (Leventhal, 2009; Liang et al., 2016; Alacaoglu et al., 2019; Latafat et al., 2019). Metric subregularity is closely related to error bound methods (Kruger, 2015). In fact, as we prove in Appendix F, one special case of our condition SP-MS (that allows us to show linear convergence) is equivalent to metric subregularity of an operator deﬁned in terms of the normal cone of the feasible set and the gradient of the objective. This is also the reason why we call our condition Saddle-Point Metric Subregularity. Although metric subregularity has been extensively used in the literature, to the best of our knowledge, our work is the ﬁrst to use this condition to analyze OGDA.

OGDA and OMWU. Recently, last-iterate convergence for OGDA has been proven in various settings such as convex-concave problems (Daskalakis et al., 2018), unconstrained bilinear problems (Daskalakis & Panageas, 2018; Liang & Stokes, 2019), strongly-convex-strongly-concave problems (Mokhtari et al., 2020b), and others (e.g. (Mertikopoulos et al., 2019)).

However, the behavior of OGDA and OMWU for the constrained bilinear case, or even the special case of classic matrix games, appears to be much more mysterious and less understood. Cheung & Piliouras (2020) provide an alternative view on the convergence behavior of OMWU by studying volume contraction in the dual space. Daskalakis & Panageas (2019b) show last-iterate convergence of OMWU for matrix games under a uniqueness assumption and without a concrete rate. Although it is implicitly suggested in (Daskalakis & Panageas, 2019b;a) that a rate of O(1/T 1/9) is possible, it is still not clear how to choose the learning rate appropriately from their analysis. As mentioned, our results for OMWU signiﬁcantly improve theirs, with a clean linear convergence rate using a constant learning rate under the same uniqueness assumption, while our results for OGDA further remove the uniqueness assumption.

3 NOTATIONS AND PRELIMINARIES

We consider the following constrained saddle-point problem: minx X maxy Y f(x, y), where X and Y are closed convex sets, and f is a continuous differentiable function that is convex in x for any ﬁxed y and concave in y for any ﬁxed x. By the celebrated minimax theorem (Neumann, 1928), we have minx X maxy Y f(x, y) = maxy Y minx X f(x, y).

The set of minimax optimal strategy is denoted by X = argminx X maxy Y f(x, y), and the set of maximin optimal strategy is denoted by Y = argmaxy Y minx X f(x, y). It is well-known that X and Y are convex, and any pair (x , y ) X Y is a Nash equilibrium satisfying f(x , y) f(x , y ) f(x, y ) for any (x, y) X Y.

For notational convenience, we deﬁne Z = X Y and similarly Z = X Y . For a point z = (x, y) Z, we further deﬁne f(z) = f(x, y) and F(z) = ( xf(x, y), yf(x, y)).

Published as a conference paper at ICLR 2021

Our goal is to ﬁnd a point z Z that is close to the set of Nash equilibria Z , and we consider three ways of measuring the closeness. The ﬁrst one is the duality gap, deﬁned as αf(z) = maxy Y f(x, y ) minx X f(x , y), which is always non-negative since maxy Y f(x, y ) f(x, y) minx X f(x , y).

The second one is the distance between z and Z . Speciﬁcally, for any closed set A, we deﬁne the projection operator ΠA as ΠA(a) = argmina A a a (throughout this work represents L2 norm). The squared distance between z and Z is then deﬁned as dist2(z, Z ) = z ΠZ (z) 2.

The third one is only for the case when X and Y are probability simplices, and z = (x , y ) is the unique equilibrium. In this case, we use the sum of Kullback-Leibler divergence KL(x , x) + KL(y , y) to measure the closeness between z = (x, y) and z , where KL(x, x ) = P

x i . With a slight abuse of notation, we use KL(z, z ) to denote KL(x, x ) + KL(y, y ).

Other notations. We denote the (d 1)-dimensional probability simplex as d = {u Rd + : Pd i=1 ui = 1}. For a convex function ψ, the corresponding Bregman divergence is deﬁned as Dψ(u, v) = ψ(u) ψ(v) ψ(v), u v . If ψ is γ-strongly convex in a domain, then Dψ(u, v) γ 2 u v 2 for any u, v in that domain. For u Rd, we deﬁne supp(u) = {i : ui > 0}.

Optimistic Gradient Descent Ascent (OGDA). Starting from an arbitrary point (bx1, by1) = (x0, y0) from Z, OGDA with step size η > 0 iteratively computes the following for t = 1, 2, . . .,

xt = ΠX bxt η xf(xt 1, yt 1) , bxt+1 = ΠX bxt η xf(xt, yt) ,

yt = ΠY byt + η yf(xt 1, yt 1) , byt+1 = ΠY byt + η yf(xt, yt) . Note that there are several slightly different versions of the algorithm in the literature, which differ in the timing of performing the projection. Our version is the same as those in (Chiang et al., 2012; Rakhlin & Sridharan, 2013). It is also referred to as single-call extra-gradient in (Hsieh et al., 2019), but it does not belong to the class of extra-gradient methods discussed in (Tseng, 1995; Liang & Stokes, 2019; Golowich et al., 2020b) for example.

Also note that OGDA only requires accessing f via its gradient. In fact, only one gradient at the point (xt, yt) is needed for iteration t. This aspect makes it especially suitable for a repeated game setting, where in each round, one player proposes xt while another player proposes yt. With only the information of the gradient from the environment ( xf(xt, yt) for the ﬁrst player and yf(xt, yt) for the other), both players can execute the algorithm.

Optimistic Multiplicative Weights Update (OMWU). When the feasible sets X and Y are probability simplices M and N for some integers M and N, OMWU is another common iterative algorithm to solve the saddle-point problem. For simplicity, we assume that it starts from the uniform distributions (bx1, by1) = (x0, y0) = 1M

N , where 1d is the all-one vector of dimension d. Then OMWU with step size η > 0 iteratively computes the following for t = 1, 2, . . .,

xt,i = bxt,i exp( η( xf(xt 1, yt 1))i) P

j bxt,j exp( η( xf(xt 1, yt 1))j), bxt+1,i = bxt,i exp( η( xf(xt, yt))i) P

j bxt,j exp( η( xf(xt, yt))j),

yt,i = byt,i exp(η( yf(xt 1, yt 1))i) P

j byt,j exp(η( yf(xt 1, yt 1))j), byt+1,i = byt,i exp(η( yf(xt, yt))i) P

j byt,j exp(η( yf(xt, yt))j).

OMWU and OGDA as Optimistic Mirror Descent Ascent. OMWU and OGDA can be viewed as special cases of Optimistic Mirror Descent Ascent. Speciﬁcally, let regularizer ψ(u) denote the negative entropy P

i ui ln ui for the case of OMWU and (half of) the L2 norm square 1 2 u 2 for the case of OGDA (so that Dψ(u, v) is KL(u, v) and 1

2 u v 2 respectively). Then using the shorthands zt = (xt, yt) and bzt = (bxt, byt) and recalling the notation deﬁned earlier: Z = X Y and F(z) = ( xf(x, y), yf(x, y)), one can rewrite OMWU/OGDA compactly as

zt = argmin z Z

n η z, F(zt 1) + Dψ(z, bzt) o , (1)

bzt+1 = argmin z Z

n η z, F(zt) + Dψ(z, bzt) o . (2)

Published as a conference paper at ICLR 2021

By the standard regret analysis of Optimistic Mirror Descent, we have the following important lemma, which is readily applied to OMWU and OGDA when ψ is instantiated as the corresponding regularizer. The proof is mostly standard (see e.g., (Rakhlin & Sridharan, 2013, Lemma 1)). For completeness, we include it in Appendix B.

Lemma 1. Consider update rules Eq. (1) and Eq. (2) and deﬁne dist2 p(z, z ) = x x 2 p + y y 2 p. Suppose that ψ satisﬁes Dψ(z, z ) 1

2dist2 p(z, z ) for some p 1, and F satisﬁes dist2 q(F(z), F(z )) L2dist2 p(z, z ) for q 1 with 1

q = 1. Also, assume that η 1 8L. Then for any z Z and any t 1, we have

ηF(zt) (zt z) Dψ(z, bzt) Dψ(z, bzt+1) Dψ(bzt+1, zt) 15

16Dψ(zt, bzt) + 1

16Dψ(bzt, zt 1).

4 CONVERGENCE RESULTS FOR OMWU

In this section, we show that for a two-player zero-sum matrix game with a unique equilibrium, OMWU with a constant learning rate converges to the equilibrium exponentially fast. The assumption and the algorithm are the same as those considered in (Daskalakis & Panageas, 2019b), but our analysis improves theirs in two ways. First, we do not require the learning rate to be exponentially smaller than some problem-dependent quantity. Second, we explicitly provide a linear convergence rate. In Section 5, we further remove the uniqueness assumption and signiﬁcantly generalize the results by studying OGDA.

In a matrix game we have X = M, Y = N, and f(z) = x Gy for some matrix G [ 1, 1]M N. To show the last-iterate convergence of OMWU, we ﬁrst apply Lemma 1 with Dψ(u, v) = KL(u, v), z = z (the unique equilibrium of the game matrix G) and (p, q) = (1, ). The constant L can be chosen as 1 since dist2 (F(z), F(z )) = maxi |(G(y y ))i|2 + maxj |(G (x x ))j|2 y y 2 1 + x x 2 1 = dist2 1(z, z ). Also notice that F(zt) (zt z ) = f(xt, yt) f(x , yt) + f(xt, y ) f(xt, yt) = f(xt, y ) f(x , yt) 0 by the optimality of z . Therefore, we have when η 1

KL(z , bzt+1) KL(z , bzt) KL(bzt+1, zt) 15

16KL(zt, bzt) + 1

16KL(bzt, zt 1).

Deﬁning Θt = KL(z , bzt) + 1

16KL(bzt, zt 1) and ζt = KL(bzt+1, zt) + KL(zt, bzt), we rewrite the above as

From Eq. (3) it is clear that the quantity Θt is always non-increasing in t due to the non-negativity of ζt. Furthermore, the more the algorithm moves between round t and round t + 1 (that is, the larger ζt is), the more Θt decreases.

To establish the rate of convergence, a natural idea is to relate ζt back to Θt or Θt+1. For example, if we can show ζt cΘt+1 for some constant c > 0, then Eq. (3) implies Θt+1 Θt 15c

16 Θt+1, which further gives Θt+1 1 + 15c

16 1 Θt. This immediately implies a linear convergence rate for Θt as well as KL(z , bzt) since KL(z , bzt) Θt.

Moreover, notice that to ﬁnd such c, it sufﬁces to ﬁnd a c > 0 such that ζt c KL(z , bzt+1). This is because it will then give ζt 1 16KL(bzt+1, zt) + 15

16ζt 1 16KL(bzt+1, zt) + 15c

16 KL(z , bzt+1) min{1, 15c

16 }Θt+1, and thus c min{1, 15c

16 } satisﬁes the condition.

From the discussion above, we see that to establish the linear convergence of KL(z , bzt), we only need to show that there exists some c > 0 such that KL(bzt+1, zt) + KL(zt, bzt) c KL(z , bzt+1). The high-level interpretation of this inequality is that when bzt+1 is far from the equilibrium z (i.e., KL(z , bzt+1) is large), the algorithm should have a large move between round t and t + 1 making KL(bzt+1, zt) + KL(zt, bzt) large.

In our analysis, we use a two-stage argument to ﬁnd such a c . In the ﬁrst stage, we only show that KL(bzt+1, zt) + KL(zt, bzt) c KL(z , bzt+1)2 for some c > 0, and use it to argue a slower convergence rate KL(z , bzt) = O 1

t . Then in the second stage, we show that after bzt and zt become close enough to z , we have KL(bzt+1, zt)+KL(zt, bzt) c KL(z , bzt+1) for some c > 0.

Published as a conference paper at ICLR 2021

This kind of two-stage argument might be reminiscent of that used by Daskalakis & Panageas (2019b); however, the techniques we use are very different. Speciﬁcally, Daskalakis & Panageas (2019b) utilize tools of spectral analysis similar to (Liang & Stokes, 2019) and show that the OMWU update can be viewed as a contraction mapping with respect to a matrix whose eigenvalue is smaller than 1. Our analysis, on the other hand, leverages analysis of online mirror descent, starting from the one-step regret bound (Lemma 1) and making use of the two negative terms that are typically dropped in the analysis. Importantly, our analysis does not need an exponentially small learning rate required by (Daskalakis & Panageas, 2019b). Thus, unlike their results, our learning rate is kept as a universal constant in all stages. The arguments above are formalized below:

Lemma 2. Consider a matrix game f(x, y) = x Gy with X = M, Y = N, and G [ 1, 1]M N. Assume that there exists a unique Nash equilibrium z and η 1

8. Then, there exists a constant C1 > 0 that depends on G such that for any t 1, OMWU ensures

KL(bzt+1, zt) + KL(zt, bzt) η2C1KL(z , bzt+1)2.

Also, there is a constant ξ > 0 that depends on G (deﬁned in Deﬁnition 2) such that as long as max{ z bzt 1, z zt 1} ηξ

KL(bzt+1, zt) + KL(zt, bzt) η2C2KL(z , bzt+1)

for another constant C2 > 0 that depends on G.

With Lemma 2 and the earlier discussion, the last-iterate convergence rate of OMWU is established:

Theorem 3. For a matrix game f(x, y) = x Gy with a unique Nash equilibrium z , OMWU with a learning rate η 1

8 guarantees KL(z , zt) C3(1 + C4) t, where C3, C4 > 0 are some constants depending on the game matrix G.

Proofs for this section are deferred to Appendix D, where all problem-dependent constants are speciﬁed as well.1 To the best of our knowledge, Theorem 3 gives the ﬁrst last-iterate convergence result for OMWU with a concrete linear rate. We note that the uniqueness assumption is critical for our analysis, and whether this is indeed necessary for OMWU is left as an important future direction.

5 CONVERGENCE RESULTS FOR OGDA

In this section, we provide last-iterate convergence results for OGDA, which are much more general than those in Section 4. We propose a general condition subsuming many well-studied cases, under which OGDA enjoys a concrete last-iterate convergence guarantee in terms of the L2 distance between zt and Z . The results in this part can be specialized to the setting of bilinear games over simplex, but the unique equilibrium assumption made in Section 4 and in (Daskalakis & Panageas, 2019b) is no longer needed.

Throughout the section we make the assumption that f is L-smooth:

Assumption 1. For any z, z Z, F(z) F(z ) L z z holds.2

To introduce our general condition, we ﬁrst provide some intuition by applying Lemma 1 again. Letting ψ(u) = 1

2 u 2 in Lemma 1, we get that for OGDA, for any z Z and any t 1,

2ηF(zt) (zt z) bzt z 2 bzt+1 z 2 bzt+1 zt 2 15

16 zt bzt 2 + 1

16 bzt zt 1 2.

Now we instantiate the inequality above with z = ΠZ (bzt) Z . Since z = ΠZ (bzt) is an equilibrium, we have F(zt) (zt z) f(xt, yt) f(x, yt)+f(xt, y) f(xt, yt) = f(xt, y) f(x, yt) 0 by the convexity/concavity of f and the optimality of z, and thus

bzt+1 ΠZ (bzt) 2 bzt ΠZ (bzt) 2 bzt+1 zt 2 15

16 zt bzt 2 + 1

16 bzt zt 1 2.

1One might ﬁnd that the constant C3 is exponential in some problem-dependent quantity T0. However, this is simply a loose bound in exchange for more concise presentation our proof in fact shows that when t < T0, the convergence is of a slower 1/t rate, and when t T0, the convergence is linear without this large constant. 2This is equivalent to the condition dist2 q(F(z), F(z )) L2dist2 p(z, z ) in Lemma 1 with p = 2, hence the same notation L.

Published as a conference paper at ICLR 2021

Further noting that the left-hand side is lower bounded by dist2(bzt+1, Z ) by deﬁnition, we arrive at

dist2(bzt+1, Z ) dist2(bzt, Z ) bzt+1 zt 2 15

16 zt bzt 2 + 1

16 bzt zt 1 2.

Similarly, we deﬁne Θt = bzt ΠZ (bzt) 2 + 1

16 bzt zt 1 2, ζt = bzt+1 zt 2 + zt bzt 2, and rewrite the above as

As in Section 4, our goal now is to lower bound ζt by some quantity related to dist2(bzt+1, Z ), and then use Eq. (4) to obtain a convergence rate for Θt. In order to incorporate more general objective functions into the discussion, in the following Lemma 4, we provide an intermediate lower bound for ζt, which will be further related to dist2(bzt+1, Z ) later.

Lemma 4. For any t 0 and z Z with z = bzt+1, OGDA with η 1 8L ensures

bzt+1 zt 2 + zt bzt 2 32

81η2 F(bzt+1) (bzt+1 z ) 2 + bzt+1 z 2 , (5)

where [a]+ max{a, 0}, and similarly, for z = zt+1,

bzt+1 zt+1 2 + zt bzt+1 2 32

81η2 F(zt+1) (zt+1 z ) 2 + zt+1 z 2 . (6)

We note that a direct consequence of Lemma 4 is an average duality gap guarantee for OGDA when Z is bounded:

t=1 α(zt) = 1

t=1 max x X,y Y (f(xt, y ) f(x , yt)) = O D

where D supz,z Z z z is the diameter of Z (the duality gap may be undeﬁned when Z is unbounded). We are not aware of any previous work that gives this result for the constrained case. See Appendix E for the proof of Eq. (7) and comparisons with previous works.

However, to obtain last-iterate convergence results, we need to make sure that the right-hand side of Eq. (5) is large enough. Motivated by this fact, we propose the following general condition on f and Z to achieve so.

Deﬁnition 1 (Saddle-Point Metric Subregularity (SP-MS)). The SP-MS condition is deﬁned as: for any z Z\Z with z = ΠZ (z),

F(z) (z z )

z z C z z β+1 (SP-MS)

holds for some parameter β 0 and C > 0.

We call this condition Saddle-Point Metric Subregularity because the case with β = 0 is equivalent to one type of metric subregularity in variational inequality problems, as we prove in Appendix F. The condition is also closely related to other error bound conditions that have been identiﬁed for variational inequality problems (e.g., Tseng (1995); Gilpin et al. (2008); Malitsky (2019)). Although these works have shown that under similar conditions their algorithms exhibit linear convergence, to the best of our knowledge, there is no previous work that analyzes OGDA or other no-regret learning algorithms using such conditions.

SP-MS covers many standard settings studied in the literature. The ﬁrst and perhaps the most important example is bilinear games with a polytope feasible set, which in particular includes the classic two-player matrix games considered in Section 4.

Theorem 5. A bilinear game f(x, y) = x Gy with X RM and Y RN being polytopes and G RM N satisﬁes SP-MS with β = 0.

Published as a conference paper at ICLR 2021

We emphasize again that different from Lemma 2, Theorem 5 does not require a unique equilibrium. Note that we have not provided the concrete form of the parameter C in the theorem (which depends on X, Y, and G), but it can be found in the proof (see Appendix G).3 The next example shows that strongly-convex-strongly-concave problems are also special cases of our condition.

Theorem 6. If f is strongly convex in x and strongly concave in y, then SP-MS holds with β = 0.

Next, we provide a toy example where SP-MS holds with β > 0.

Theorem 7. Let X = Y {(a, b) : 0 a, b 1, a + b = 1}, n > 2 be an integer, and f(x, y) = x2n 1 x1y1 y2n 1 . Then SP-MS holds with β = 2n 2.

With this general condition, we are now able to complete the loop. For any value of β, we show the following last-iterate convergence guarantee for OGDA.

Theorem 8. For any η 1 8L, if SP-MS holds with β = 0, then OGDA guarantees linear last-iterate convergence:

dist2(zt, Z ) 64dist2(bz1, Z )(1 + C5) t; (8)

on the other hand, if the condition holds with β > 0, then we have a slower convergence:

dist2(zt, Z ) 32

dist2(bz1, Z ) + 2 2

where C5 min n 16η2C2

We defer the proof to Appendix I and make several remarks. First, note that based on a convergence result on dist2(zt, Z ), one can immediately obtain a convergence guarantee for the duality gap αf(zt) as long as f is also Lipschitz. This is because αf(zt) maxx ,y f(xt, y ) f(x , y ) +

f(x , y ) f(x , yt) O( xt x + yt y ) = O q

dist2(zt, Z ) , where (x , y ) =

ΠZ (zt). While this leads to stronger guarantees compared to Eq. (7), we emphasize that the latter holds even without the SP-MS condition.

Second, our results signiﬁcantly generalize (Hsieh et al., 2019, Theorem 2) which itself is a consolidated version of several earlier works and also shows a linear convergence rate of OGDA under a condition stronger than our SP-MS with β = 0 as discussed earlier. More speciﬁcally, our results show that linear convergence holds for a much broader set of problems. Furthermore, we also show slower sublinear convergence rates for any value of β > 0, which is also new as far as we know. In particular, we empirically verify that OGDA indeed does not converge exponentially fast for the toy example deﬁned in Theorem 7 (see Appendix A).

Last but not least, the most signiﬁcant implication of Theorem 8 is that it provides by far the most general linear convergence result for OGDA for the classic two-player matrix games, or more generally bilinear games with polytope constraints, according to Theorem 5 and Eq. (8). Compared to recent works of (Daskalakis & Panageas, 2018; 2019b) for matrix games (on OGDA or OMWU), our result is considerably stronger: 1) we do not require a unique equilibrium while they do; 2) linear convergence holds for any initial points bz1, while their result only holds if the initial points are in a small neighborhood of the unique equilibrium (otherwise the convergence is sublinear initially); 3) our only requirement on the step size is η 1 8L,4 while they require an exponentially small η, which does not reﬂect the behavior of the algorithms in practice. Even compared with our result in Section 4, we see that for OGDA, the unique equilibrium assumption is not required, and we do not have an initial phase of sublinear convergence as in Lemma 2. In Appendix A, we empirically show that OGDA often outperforms OMWU when both are tuned with a constant learning rate.

3After the ﬁrst version of this paper, we found that (Gilpin et al., 2008, Lemma 3) gives a simpler proof for our Theorem 5. Although their lemma only focuses on the case where the feasible sets are probability simplices, it can be directly extended to the case of polytopes. 4In fact, any η < 1 2L is enough to achieve linear convergence rate for OGDA, as one can verify by going over our proof. We use η 1 8L simply for consistency with the results for OMWU (where η cannot be set any larger due to technical reasons).

Published as a conference paper at ICLR 2021

0 0.5 1 1.5 2 2.5 3

OMWU-eta=0.1 OMWU-eta=0.2 OMWU-eta=0.5 OMWU-eta=1 OMWU-eta=5 OMWU-eta=10 OGDA-eta=0.125 OGDA-eta=0.25 OGDA-eta=0.5

Figure 1: Experiments of OGDA and OMWU with different learning rates for a matrix game f(x, y) = x Gy. OGDA/OMWU-eta=η represents the curve of OGDA/OMWU with learning rate η. The conﬁguration order in the legend is consistent with the order of the curves. For OMWU, η 11 makes the algorithm diverge. The plot conﬁrms the linear convergence of OMWU and OGDA, although OGDA is generally observed to converge faster than OMWU.

One may wonder what happens if a bilinear game has a non-polytope constraint. It turns out that in this case, SP-MS may only hold with β > 0, due to the following example showing that linear convergence provably does not hold for OGDA when the feasible set has a curved boundary.

Theorem 9. There exists a bilinear game with a non-polytope feasible set such that SP-MS holds with β = 3, and dist2(zt, Z ) = Ω(1/t2) holds for OGDA.

This example indicates that the shape of the feasible set plays an important role in last-iterate convergence, which may be an interesting future direction to investigate, This is also veriﬁed empirically in our experiments (see Appendix A).

6 EXPERIMENTS FOR MATRIX GAMES

In this section, we provide empirical results on the performance of OGDA and OMWU for matrix games on probability simplex.5 We include more empirical results in other settings in Appendix A. We set the size of the game matrix to be 32 32, then generate a random matrix with each entry Gij drawn uniformly at random from [ 1, 1], and ﬁnally rescale its operator norm to 1. With probability 1, the game has a unique Nash Equilibrium (Daskalakis & Panageas, 2019b).

We compare the performances of OGDA and OMWU. For both algorithms, we choose a series of different learning rates and compare their performances, as shown in Figure 1. The x-axis represents time step t, and the y-axis represents ln(KL(z , zt)) (we observe similar results using dist2(z , zt) or the duality gap as the measure; see Appendix A.1). Note that here we approximate z by running OGDA for much more iterations and taking the very last iterate. We also verify that the iterates of OMWU converge to the same point as OGDA.

From Figure 1, we see that all curves eventually become a straight line, supporting our linear convergence results. Generally, the slope of the straight line is larger for a larger learning rate η. However, the algorithm diverges when η exceeds some value (such as 11 for the case of OMWU). Comparing OMWU and OGDA, we see that OGDA converges faster, which is also consistent with our theory if one compares the bounds in Theorem 3 and Theorem 8 (with the value of the constants revealed in the proofs). We ﬁnd this observation interesting, since OMWU is usually considered more favorable for problems deﬁned over the simplex, especially in terms of regret minimization. Our experiments suggest that, however, in terms of last-iterate convergence, OGDA might perform even better than OMWU.

5Note that in this case the projection step of OGDA can be implemented efﬁciently in O(M ln M+N ln N) time (Wang & Carreira-Perpin an, 2013).

Published as a conference paper at ICLR 2021

ACKNOWLEDGMENTS

The authors would like to thank the anonymous reviewers for providing highly constructive comments which bring about signiﬁcant improvement of the result during the rebuttal phase. CL would like to thank Yu-Guan Hsieh for many helpful discussions on error bounds and metric subregularity. The authors are supported by NSF Awards IIS-1755781 and IIS-1943607.

Ahmet Alacaoglu, Olivier Fercoq, and Volkan Cevher. On the convergence of stochastic primal-dual hybrid gradient. ar Xiv preprint ar Xiv:1911.00799, 2019.

Wa ıss Azizian, Ioannis Mitliagkas, Simon Lacoste-Julien, and Gauthier Gidel. A tight and uniﬁed analysis of gradient-based methods for a whole spectrum of differentiable games. In International Conference on Artiﬁcial Intelligence and Statistics, 2020.

James P Bailey and Georgios Piliouras. Multiplicative weights update in zero-sum games. In Proceedings of the 2018 ACM Conference on Economics and Computation, 2018.

Michael Bowling. Convergence and no-regret in multiagent learning. In Advances in neural information processing systems, pp. 209 216, 2005.

Yunmei Chen, Guanghui Lan, and Yuyuan Ouyang. Accelerated schemes for a class of variational inequalities. Mathematical Programming, 2017.

Yun Kuen Cheung and Georgios Piliouras. Vortices instead of equilibria in minmax optimization: Chaos and butterﬂy effects of online learning in zero-sum games. In Conference on Learning Theory, pp. 807 834, 2019.

Yun Kuen Cheung and Georgios Piliouras. Chaos, extremism and optimism: Volume analysis of learning in games. Advances in Neural Information Processing Systems, 2020.

Chao-Kai Chiang, Tianbao Yang, Chia-Jung Lee, Mehrdad Mahdavi, Chi-Jen Lu, Rong Jin, and Shenghuo Zhu. Online optimization with gradual variations. In Conference on Learning Theory, pp. 6 1, 2012.

Shisheng Cui and Uday V Shanbhag. On the analysis of reﬂected gradient and splitting methods for monotone stochastic variational inequality problems. In 2016 IEEE 55th Conference on Decision and Control (CDC), 2016.

Constantinos Daskalakis and Ioannis Panageas. The limit points of (optimistic) gradient descent in min-max optimization. In Advances in Neural Information Processing Systems, pp. 9236 9246, 2018.

Constantinos Daskalakis and Ioannis Panageas. Last-iterate convergence: Zero-sum games and constrained min-max optimization. Smooth Games Optimization and Machine Learning Workshop (Neur IPS 2019), 2019a.

Constantinos Daskalakis and Ioannis Panageas. Last-iterate convergence: Zero-sum games and constrained min-max optimization. Innovations in Theoretical Computer Science, 2019b.

Constantinos Daskalakis, Alan Deckelbaum, Anthony Kim, et al. Near-optimal no-regret algorithms for zero-sum games. Games and Economic Behavior, 2015.

Constantinos Daskalakis, Andrew Ilyas, Vasilis Syrgkanis, and Haoyang Zeng. Training gans with optimism. In International Conference on Learning Representations, 2018.

Damek Davis. Lecture 5, mathematical programming I, 2016a. Available at people.orie. cornell.edu/dsd95/teaching/orie6300/lec05.pdf.

Damek Davis. Lecture 6, mathematical programming I, 2016b. Available at people.orie. cornell.edu/dsd95/teaching/orie6300/lec06.pdf.

Published as a conference paper at ICLR 2021

Yoav Freund and Robert E Schapire. Adaptive game playing using multiplicative weights. Games and Economic Behavior, 29(1-2):79 103, 1999.

Gauthier Gidel, Hugo Berard, Ga etan Vignoud, Pascal Vincent, and Simon Lacoste-Julien. A variational inequality perspective on generative adversarial networks. International Conference on Learning Representations, 2019.

Andrew Gilpin, Javier Pe na, and Tuomas Sandholm. First-order algorithm with o (ln (1/e)) convergence for e-equilibrium in two-person zero-sum games. In AAAI, 2008.

Noah Golowich, Sarath Pattathil, and Constantinos Daskalakis. Tight last-iterate convergence rates for no-regret learning in multi-player games. Advances in Neural Information Processing Systems, 2020a.

Noah Golowich, Sarath Pattathil, Constantinos Daskalakis, and Asuman Ozdaglar. Last iterate is slower than averaged iterate in smooth convex-concave saddle point problems. Conference on Learning Theory, 2020b.

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, 2014.

Patrick T Harker and Jong-Shi Pang. Finite-dimensional variational inequality and nonlinear complementarity problems: a survey of theory, algorithms and applications. Mathematical programming, 1990.

Yu-Guan Hsieh, Franck Iutzeler, J erˆome Malick, and Panayotis Mertikopoulos. On the convergence of single-call stochastic extra-gradient methods. In Advances in Neural Information Processing Systems, 2019.

Yu-Guan Hsieh, Franck Iutzeler, J erˆome Malick, and Panayotis Mertikopoulos. Explore aggressively, update conservatively: Stochastic extragradient methods with variable stepsize scaling. Advances in Neural Information Processing Systems, 2020.

Alfredo N Iusem, Alejandro Jofr e, Roberto Imbuzeiro Oliveira, and Philip Thompson. Extragradient method with variance reduction for stochastic variational inequalities. SIAM Journal on Optimization, 2017.

G. M. Korpelevich. The extragradient method for ﬁnding saddle points and other problems. 1976.

Alexander Y Kruger. Error bounds and metric subregularity. Optimization, 2015.

Torbj orn Larsson and Michael Patriksson. A class of gap functions for variational inequalities. Mathematical Programming, 1994.

Puya Latafat, Nikolaos M Freris, and Panagiotis Patrinos. A new randomized block-coordinate primal-dual proximal algorithm for distributed optimization. IEEE Transactions on Automatic Control, 2019.

Qi Lei, Sai Ganesh Nagarajan, Ioannis Panageas, and Xiao Wang. Last iterate convergence in noregret learning: constrained min-max optimization for convex-concave landscapes. The 24nd International Conference on Artiﬁcial Intelligence and Statistics, 2021.

D Leventhal. Metric subregularity and the proximal point method. Journal of Mathematical Analysis and Applications, 2009.

Jingwei Liang, Jalal Fadili, and Gabriel Peyr e. Convergence rates with inexact non-expansive operators. Mathematical Programming, 2016.

Tengyuan Liang and James Stokes. Interaction matters: A note on non-asymptotic local convergence of generative adversarial networks. In The 22nd International Conference on Artiﬁcial Intelligence and Statistics, 2019.

Tianyi Lin, Chi Jin, Michael Jordan, et al. Near-optimal algorithms for minimax optimization. Conference on Learning Theory, 2020.

Published as a conference paper at ICLR 2021

Zhi-Quan Luo and Paul Tseng. Error bounds and convergence analysis of feasible descent methods: a general approach. Annals of Operations Research, 1993.

Yu Malitsky. Projected reﬂected gradient methods for monotone variational inequalities. SIAM Journal on Optimization, 2015.

Yura Malitsky. Golden ratio algorithms for variational inequalities. Mathematical Programming, 2019.

Yura Malitsky and Matthew K Tam. A forward-backward splitting method for monotone inclusions without cocoercivity. SIAM Journal on Optimization.

Panayotis Mertikopoulos, Christos Papadimitriou, and Georgios Piliouras. Cycles in adversarial regularized learning. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, 2018.

Panayotis Mertikopoulos, Houssam Zenati, Bruno Lecouat, Chuan-Sheng Foo, Vijay Chandrasekhar, and Georgios Piliouras. Optimistic mirror descent in saddle-point problems: Going the extra (gradient) mile. In International Conference on Learning Representations, 2019.

Aryan Mokhtari, Asuman Ozdaglar, and Sarath Pattathil. Convergence rate of O(1/k) for optimistic gradient and extra-gradient methods in smooth convex-concave saddle point problems. SIAM Journal on Optimization, 30(4):3230 3251, 2020a.

Aryan Mokhtari, Asuman Ozdaglar, and Sarath Pattathil. A uniﬁed analysis of extra-gradient and optimistic gradient methods for saddle point problems: Proximal point approach. The 22nd International Conference on Artiﬁcial Intelligence and Statistics, 2020b.

John von Neumann. Zur theorie der gesellschaftsspiele. Mathematische annalen, 1928.

Jong-Shi Pang. Error bounds in mathematical programming. Mathematical Programming, 1997.

Leonid Denisovich Popov. A modiﬁcation of the arrow-hurwicz method for search of saddle points. Mathematical notes of the Academy of Sciences of the USSR, 1980.

Sasha Rakhlin and Karthik Sridharan. Optimization, learning, and games with predictable sequences. In Advances in Neural Information Processing Systems, pp. 3066 3074, 2013.

Michael V Solodov and Paul Tseng. Some methods based on the d-gap function for solving monotone variational inequalities. Computational optimization and applications, 2000.

Vasilis Syrgkanis, Alekh Agarwal, Haipeng Luo, and Robert E Schapire. Fast convergence of regularized learning in games. In Advances in Neural Information Processing Systems, 2015.

Paul Tseng. On linear convergence of iterative methods for the variational inequality problem. 1995.

Weiran Wang and Miguel A Carreira-Perpin an. Projection onto the probability simplex: An efﬁcient algorithm with a simple proof, and an application. ar Xiv preprint ar Xiv:1309.1541, 2013.

Junyu Zhang, Mingyi Hong, and Shuzhong Zhang. On lower iteration complexity bounds for the saddle point problems. ar Xiv preprint ar Xiv:1912.07481, 2019.

Published as a conference paper at ICLR 2021

A MORE EXPERIMENT RESULTS

A.1 MORE EMPIRICAL RESULTS FOR MATRIX GAMES

Here, we provide more plots for the same matrix game experiment described in Section 6. Specifically, the left plot in Figure 2 shows the convergence with respect to ln zt z , while the right plot shows the convergence with respect to the logarithm of the duality gap ln(αf(zt)) = ln maxj(G xt)j mini(Gyt)i . One can see that the plots are very similar to those in Figure 1.

0 1 2 3 4 5 6

OMWU-eta=0.1 OMWU-eta=0.2 OMWU-eta=0.5 OMWU-eta=1 OMWU-eta=5 OMWU-eta=10 OGDA-eta=0.125 OGDA-eta=0.25 OGDA-eta=0.5

0 1 2 3 4 5 6

OMWU-eta=0.1 OMWU-eta=0.2 OMWU-eta=0.5 OMWU-eta=1 OMWU-eta=5 OMWU-eta=10 OGDA-eta=0.125 OGDA-eta=0.25 OGDA-eta=0.5

Figure 2: Experiments of OGDA and OMWU with different learning rates on a matrix game f(x, y) = x Gy, where we generate G R32 32 with each entry Gij drawn uniformly at random from [ 1, 1] and then rescale G s operator norm to 1. OGDA/OMWU-eta=η represents the curve of OGDA/OMWU with learning rate η. The conﬁguration order in the legend is consistent with the order of the curves. For OMWU, η 11 makes the algorithm diverge. The plot conﬁrms the linear convergence of OMWU and OGDA, although OGDA is generally observed to converge faster than OMWU.

A.2 MATRIX GAME ON CURVED REGIONS

Next, we conduct experiments on a bilinear game similar to the one constructed in the proof of Theorem 9. Speciﬁcally, the bilinear game is deﬁned by

f(x, y) = x2y1 x1y2, X = Y {(a, b), 0 a 1

2, 0 b 1 2n , an b}.

For any positive integer n, the equilibrium point of this game is (0, 0) for both x and y. Note that in Theorem 9, we prove that OGDA only converges at a rate no better than Ω(1/t2) in this game when n = 2.

Figure 3 shows the empirical results for various values of n. In this ﬁgure, we plot zt z versus time step t in log-log scale. Note that in a log-log plot, a straight line with slope s implies a convergence rate of order O(ts), that is, a sublinear convergence rate. It is clear from Figure 3 that OGDA indeed converges sublinearly for all n, supporting our Theorem 9.

A.3 STRONGLY-CONVEX-STRONGLY-CONCAVE GAMES

In this section, we use the same experiment setup for strongly-convex-strongly-concave games in (Lei et al., 2021), where

f(x, y) = x2 1 y2 1 + 2x1y1, and X = Y {(a, b), 0 a, b 1, a + b = 1}.

The equilibrium point is (0, 1) for both x and y. In Figure 4, we present the log plot of zt z versus time step t and compare OGDA with OMWU using different learning rates as in Appendix A.1. The straight line of OGDA implies that OGDA algorithm converges exponentially fast, supporting Theorem 6 and Theorem 8. Also note that here, OGDA outperforms OMWU, which is different from the empirical results shown in (Lei et al., 2021). We hypothesize that this is because they use a different version of OGDA.

Published as a conference paper at ICLR 2021

0 2 4 6 8 10 12 -10

Figure 3: Experiments of OGDA on matrix games with curved regions where f(x, y) = x2y1 x1y2, X = Y {(a, b), 0 a 1

2, 0 b 1 2n , an b}, and n = 2, 4, 6, 8. This ﬁgure is a log-log plot of zt z versus t, and it indicates sublinear convergence rates of OGDA in all these games.

0 20 40 60 80 100 120 140 -30

OMWU-eta=0.01 OMWU-eta=0.1 OMWU-eta=0.2 OMWU-eta=0.5 OMWU-eta=1 OGDA-eta=0.125 OGDA-eta=0.25

Figure 4: Experiments on a strongly-convex-strongly-concave game where f(x, y) = x2 1 y2 1 + 2x1y1 and X = Y {(a, b), 0 a, b 1, a + b = 1}. The ﬁgure is showing ln zt z versus the time step t. The result shows that OGDA enjoys linear convergence and outperforms OMWU in this case.

Published as a conference paper at ICLR 2021

0 2 4 6 8 10 -4

Figure 5: Experiments of OGDA on a set of games satisfying SP-MS with β > 0, where f(x, y) = x2n 1 x1y1 y2n 1 for some integer n 2 and X = Y {(a, b), 0 a, b 1, a + b = 1}. The result shows that OGDA converges to the Nash equilibrium with sublinear rates in these instances.

A.4 AN EXAMPLE WITH β > 0 FOR SP-MS

We also consider the toy example in Theorem 7, where f(x, y) = x2n 1 x1y1 y2n 1 for some integer n 2 and X = Y {(a, b), 0 a, b 1, a + b = 1}. The equilibrium point is (0, 1) for both x and y. We prove in Theorem 7 that SP-MS does not hold for β = 0 but does hold for β = 2n 2.

The point-wise convergence result is shown in Figure 5, which is again a log-log plot of zt z versus time step t. One can observe that the convergence rate of OGDA is sublinear, supporting our theory again.

A.5 MATRIX GAMES WITH MULTIPLE NASH EQUILIBRIA

Finally, we provide empirical results for OGDA and OMWU in matrix games with multiple Nash equilibria, even though theoretically we only prove linear convergence results for OMWU assuming that the Nash equilibrium is unique. We consider the following game matrix

0 1 1 0 0 1 0 1 0 0 1 1 0 0 0 1 1 0 2 1 1 1 0 1 2

The value of G is 0. To verify this, consider x0 = y0 = 1

3 1 3 1 3 0 0 . Then we have for maxy 5 x 0 Gy = minx 5 x Gy0 = 0. Direct calculation gives the following set of Nash equilibria.

Y = y 5 : y1 = y2 = y3; 1 2y5 y4 2y5

Figure 6 shows the point-wise convergence result. ΠZ (zt) is the projection of zt on the set of Nash qquilibria. One can observe from the plots that both OGDA and OMWU achieve linear convergence rate in this example. We thus conjecture that the uniqueness assumption for Theorem 3 can be further relaxed.

Published as a conference paper at ICLR 2021

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 -18

OMWU-eta=0.1 OMWU-eta=0.2 OMWU-eta=0.5 OMWU-eta=1 OGDA-eta=0.125 OGDA-eta=0.25 OGDA-eta=0.5

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 -35

OMWU-eta=0.1 OMWU-eta=0.2 OMWU-eta=0.5 OMWU-eta=1 OGDA-eta=0.125 OGDA-eta=0.25 OGDA-eta=0.5

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 -18

OMWU-eta=0.1 OMWU-eta=0.2 OMWU-eta=0.5 OMWU-eta=1 OGDA-eta=0.125 OGDA-eta=0.25 OGDA-eta=0.5

Figure 6: Experiments of OGDA and OMWU with different learning rates on a matrix game with multiple Nash equilibria. OGDA/OMWU-eta=η represents the curve of OGDA/OMWU with learning rate η. We observe from these plots that both OGDA and OMWU enjoy a linear convergence rate, even though we are only able to show the linear convergence of OMWU under the uniqueness assumption.

Published as a conference paper at ICLR 2021

B LEMMAS FOR OPTIMISTIC MIRROR DESCENT

We prove Lemma 1 in this section. To do so, we use the following two lemmas. Lemma 10. Let A be a convex set and u = argminu A { u , g + Dψ(u , u)}. Then for any u A,

u u , g Dψ(u , u) Dψ(u , u ) Dψ(u , u). (10)

Proof. Since Dψ(u , u) = ψ(u ) ψ(u) ψ(u), u u , by the ﬁrst-order optimality condition of u , we have

(g + ψ(u ) ψ(u)) (u u ) 0.

On the other hand, notice that the right-hand side of Eq. (10) is

ψ(u ) ψ(u) ψ(u), u u

ψ(u ) + ψ(u ) + ψ(u ), u u

ψ(u ) + ψ(u) + ψ(u), u u

= ψ(u ) ψ(u), u u .

Therefore, Eq. (10) is equivalent to g + ψ(u ) ψ(u), u u 0, which we have already shown above.

Lemma 11. Suppose that ψ satisﬁes Dψ(x, x ) 1

2 x x 2 p for some p 1, and let u, u1, u2 A (a convex set) be related by the following:

u1 = argmin u A { u , g1 + Dψ(u , u)} ,

u2 = argmin u A { u , g2 + Dψ(u , u)} .

Then we have

u1 u2 p g1 g2 q,

where q 1 and 1

Proof. By the ﬁrst-order optimality conditions of u1 and u2, we have

ψ(u1) ψ(u) + g1, u2 u1 0, ψ(u2) ψ(u) + g2, u1 u2 0.

Summing them up and rearranging the terms, we get

u2 u1, g1 g2 ψ(u1) ψ(u2), u1 u2 . (11)

By the condition on ψ, we have ψ(u1), u1 u2 ψ(u1) ψ(u2) + 1

2 u1 u2 2 p and ψ(u2), u2 u1 ψ(u2) ψ(u1) + 1

2 u1 u2 2 p. Summing them up we get ψ(u1) ψ(u2), u1 u2 u1 u2 2 p. Combining this with Eq. (11) we get

u2 u1, g1 g2 u1 u2 2 p.

Since u2 u1, g1 g2 u1 u2 p g1 g2 q by H older s inequality, we further get u1 u2 p g1 g2 q.

Proof of Lemma 1. Considering Eq. (2), and using Lemma 10 with u = bzt, u = bzt+1, u = z, and g = ηF(zt), we get

ηF(zt) (bzt+1 z) Dψ(z, bzt) Dψ(z, bzt+1) Dψ(bzt+1, bzt).

Considering Eq. (1), and using Lemma 10 with u = bzt, u = zt, u = bzt+1, and g = ηF(zt 1), we get

ηF(zt 1) (zt bzt+1) Dψ(bzt+1, bzt) Dψ(bzt+1, zt) Dψ(zt, bzt).

Published as a conference paper at ICLR 2021

Summing up the two inequalities above, and adding η (F(zt) F(zt 1)) (zt bzt+1) to both sides, we get

ηF(zt) (zt z)

Dψ(z, bzt) Dψ(z, bzt+1) Dψ(bzt+1, zt) Dψ(zt, bzt) + η (F(zt) F(zt 1)) (zt bzt+1). (12)

Using Lemma 11 with u = bxt, u1 = xt, u2 = bxt+1, g1 = η xf(zt 1) and g2 = η xf(zt), we get xt bxt+1 p η xf(zt 1) xf(zt) q. Similarly, we have yt byt+1 p η yf(zt) yf(zt 1) q. Therefore, by H older s inequality, we have

η (F(zt) F(zt 1)) (zt bzt+1) η xt bxt+1 p xf(zt 1) xf(zt) q + η yt byt+1 p yf(zt 1) yf(zt) q η2 xf(zt 1) xf(zt) 2 q + η2 yf(zt 1) yf(zt) 2 q = η2dist2 q(F(zt), F(zt 1))

η2L2dist2 p(zt, zt 1) (by assumption)

64dist2 p(zt, zt 1). (by our choice of η)

Continuing from Eq. (12), we then have

ηF(zt) (zt z)

Dψ(z, bzt) Dψ(z, bzt+1) Dψ(bzt+1, zt) Dψ(zt, bzt) + 1

64dist2 p(zt, zt 1)

Dψ(z, bzt) Dψ(z, bzt+1) Dψ(bzt+1, zt) Dψ(zt, bzt) + 1

32dist2 p(zt, bzt) + 1

32dist2 p(bzt, zt 1)

( u + v 2 p ( u p + v p)2 2 u 2 p + 2 v 2 p)

Dψ(z, bzt) Dψ(z, bzt+1) Dψ(bzt+1, zt) Dψ(zt, bzt) + 1

16Dψ(zt, bzt) + 1

16Dψ(bzt, zt 1)

(by the assumption on ψ)

= Dψ(z, bzt) Dψ(z, bzt+1) Dψ(bzt+1, zt) 15

16Dψ(zt, bzt) + 1

16Dψ(bzt, zt 1).

This concludes the proof.

C AN AUXILIARY LEMMA ON RECURSIVE FORMULAS

Here, we provide an auxiliary lemma that gives an explicit bound based on a particular recursive formula. This will be useful later for deriving the convergence rate. Lemma 12. Consider a non-negative sequence {Bt}t=1,2, that satisﬁes for some p > 0 and q > 0,

Bt+1 Bt q Bp+1 t+1 , t 1

q(1 + p)Bp 1 1.

Then Bt ct 1

p , where c = max B1, 2 qp 1

Proof. We ﬁrst prove that Bt+1 Bt q

2Bp+1 t . Notice that since Bt are all non-negative, by the ﬁrst condition, we have Bt+1 Bt B1. Using the fundamental theorem of calculus, we have

Bp+1 t Bp+1 t+1 = Z Bt

dxxp+1 dx = (p + 1) Z Bt

Bt+1 xpdx (p + 1)(Bt Bt+1)Bp t

Bt+1 Bt q Bp+1 t+1 Bt q Bp+1 t + q(p + 1) (Bt Bt+1) Bp t .

Published as a conference paper at ICLR 2021

By rearranging, we get

Bt+1 1 q Bp t 1 + q(1 + p)Bp t

Bt 1 q Bp t 2

where the last inequality is because q(1 + p)Bp t q(1 + p)Bp 1 1.

Below we use induction to prove Bt ct 1

p , where c = max B1, 2 qp 1

p . This clearly holds

for t = 1. Suppose that it holds for 1, . . . , t. Note that the function f(Bt) = 1 q

2Bp t Bt is increasing in Bt as f (Bt) = 1 q(p+1)

2 Bp t 1 q(p+1)

2 Bp 1 0. Therefore, we apply the induction hypothesis and get

2Bp t Bt 1 q

2cpt 1 ct 1

2cp+1 by the deﬁnition of c)

where the last inequality is by the fundamental theorem of calculus:

p (1 + t) 1

This completes the induction.

D PROOFS OF LEMMA 2 AND THEOREM 3

In this section, we consider f(x, y) = x Gy with X = M and Y = N being simplex and G [ 1, 1]M N. We assume that G has a unique Nash equilibrium z = (x , y ). The value of the game is denoted as ρ = minx X maxy Y x Gy = maxy Y minx X x Gy = x Gy .

Before proving Lemma 2 and Theorem 3, in Section D.1, we deﬁne some constants for later analysis; in Section D.2, we state more auxiliary lemmas, which are useful when proving Lemma 2 and Theorem 3 in Section D.3.

D.1 SOME PROBLEM-DEPENDENT CONSTANTS

First, we deﬁne a constant ξ that is determined by G. Deﬁnition 2.

ξ min min i/ supp(x )(Gy )i ρ, ρ max i/ supp(y )(G x )i

The fact ξ 1 can be shown by:

ξ mini/ supp(x )(Gy )i ρ + ρ maxi/ supp(y )(G x )i

while the fact ξ > 0 is a direct consequence of Lemma C.3 of Mertikopoulos et al. (2018), stated below. Lemma 13 (Lemma C.3 of Mertikopoulos et al. (2018)). Let G RM N be a game matrix for a two-player zero-sum game with value ρ. Then there exists a Nash equilibrium (x , y ) such that

(Gy )i = ρ i supp(x ), (Gy )i > ρ i / supp(x ),

(G x )i = ρ i supp(y ),

(G x )i < ρ i / supp(y ).

Published as a conference paper at ICLR 2021

Below, we deﬁne V (Z) = V (X) V (Y), where

V (X) {x : x M, supp(x) supp(x )}

V (Y) {y : y N, supp(y) supp(y )}.

Deﬁnition 3.

cx min x M\{x } max y V (Y) (x x ) Gy

x x 1 , cy min y N\{y } max x V (X) x G(y y)

Note that in the deﬁnition of cx and cy, the outer minimization is over an open set, which may make the deﬁnition problematic as the optimal value may not be attained. However, the following lemma shows that cx and cy are well-deﬁned.

Lemma 14. cx and cy are well-deﬁned, and 0 < cx, cy 1.

Proof. We ﬁrst show cx and cy are well-deﬁned. To simplify the notations, we deﬁne x min mini supp(x ) x i and X {x : x M, x x 1 x min}, and deﬁne y min and Y similarly. We will show that

cx = min x X max y V (Y) (x x ) Gy

x x 1 , cy = min y Y max x V (X) x G(y y)

which are well-deﬁned as the outer minimization is now over a closed set. Consider cx, it sufﬁces to show that for any x M such that x = x and x x 1 < x min, there exists x M such that x x 1 = x min and

x x 1 = (x x ) Gy

x x 1 , y. (13)

In fact, we can simply choose x = x + (x x ) x min x x 1 . We ﬁrst argue that x is still in M. For each j [K], if xj x j 0, we surely have x j x j + 0 0; otherwise, x j > xj 0 and

thus j supp(x ) and x j x min, which implies x j x j |xj x j| x min x x 1 x j x min 0. In addition, P j x j = P j x j = 1. Combining these facts, we have x M.

Moreover, according to the deﬁnition of x , x x 1 = x min holds. Also, since x x and x x are parallel vectors, Eq. (13) is satisﬁed. The arguments above show that the cx in Deﬁnition 3 is a well-deﬁned real number. The case of cy is similar.

Now we show 0 < cx, cy 1. The fact that cx, cy 1 is a direct consequence of G being in [ 1, 1]M N. Below, we use contradiction to prove that cy > 0. First, if cy < 0, then there exists y = y such that x Gy < x Gy. This contradicts with the fact that (x , y ) is the equilibrium.

On the other hand, if cy = 0, then there is some y = y such that

max x V (X) x G(y y) = 0. (14)

Published as a conference paper at ICLR 2021

Consider the point y = y + ξ

2(y y ) (recall the deﬁnition of ξ in Deﬁnition 2 and that 0 < ξ 1), which lies on the line segment between y and y. Then, for any x X,

i/ supp(x ) xi(Gy )i + X

i supp(x ) xi(Gy )i

i/ supp(x )

xi(Gy )i xi y y 1 + X

2 xi(G(y y ))i + xi(Gy )i

(using Gij [ 1, 1] for the ﬁrst part and y = y + ξ

2(y y ) for the second)

i/ supp(x )

xi(Gy )i xi y y 1 + X

i supp(x ) xiρ

(using Eq. (14) and (Gy )i = ρ for all i supp(x ))

i/ supp(x )

xi ((Gy )i ξ) + X

i supp(x ) xiρ

(using y y = ξ

2(y y ) and y y 1 2)

i/ supp(x ) xiρ + X

i supp(x ) xiρ (by the deﬁnition of ξ)

This shows that minx X x Gy ρ, that is, y = y is also a maximin point, contradicting that z is unique. Therefore, cy > 0 has to hold, and so does cx > 0 by the same argument.

Finally, we deﬁne the following constant that depends on G:

Deﬁnition 4.

ϵ min j supp(z ) exp

D.2 AUXILIARY LEMMAS

All lemmas stated in this section is for the case f(x, y) = x Gy with Z = M N and a unique Nash equilibrium z = (x , y ).

Lemma 15. For any z Z, we have

max z V (Z) F(z) (z z ) C z z 1

for C = min{cx, cy} (0, 1].

Proof. Recall that ρ = x Gy is the game value and note that

max z V (Z) F(z) (z z ) = max z V (Z)(x x ) Gy + x G(y y) = max z V (Z) x Gy + x Gy

= max x V (X)

ρ x Gy + max y V (Y)

= max x V (X) x G(y y) + max y V (Y)(x x ) Gy (Lemma 13)

cy y y 1 + cx x x 1 (by Deﬁnition 3) min{cx, cy} z z 1,

which completes the proof.

Lemma 16. For any z Z, we have

KL(z , z) X

i/ supp(z ) zi 1 mini supp(z ) zi z z 1.

Published as a conference paper at ICLR 2021

Proof. Using the deﬁnition of the Kullback-Leibler divergence, we have

KL(x , x) = X

i x i ln x i xi

where the ﬁrst inequality is by the concavity of the ln( ) function, and the second inequality is because ln(1 + u) u. Considering i supp(x ) and i / supp(x ) separately in the last summation, we have X

i/ supp(x )

i/ supp(x ) xi.

The case for KL(y , y) is similar. Combining both cases ﬁnishes the proof of the ﬁrst inequality (recall that KL(z , z) is deﬁned as KL(x , x) + KL(y , y)). The second inequality is straightforward:

i/ supp(z ) zi 1 mini supp(z ) zi

i supp(z ) |z i zi| + X

i/ supp(z ) |zi|

= 1 mini supp(z ) zi z z 1.

Lemma 17. For η 1

8, OMWU guarantees 3

4 bzt,i zt,i 4

3 bzt,i and 3

4 bzt,i bzt+1,i 4

Proof. This is shown directly by the update of bxt:

bxt,i exp ( η)

exp (η) bxt+1,i = bxt,i exp ( η (Gyt)i) P

j bxt,j exp ( η (Gyt)j) bxt,i exp (η)

So by the condition on η, we have 3

4 bxt,i exp( 2η) bxt,i bxt+1,i exp(2η) bxt,i 4

3 bxt,i. The cases for xt, byt and yt are similar.

Lemma 18. For any two probability vectors u, v, if for every entry i, 1

2ui vi 3 2ui, then

ui KL(u, v) P

Proof. Using the deﬁnition of the Kullback-Leibler divergence, we have

KL(u, v) = X

KL(u, v) = X

ui (vi ui)2

where the ﬁrst inequality is because ln(1+a) a 1

2, and the second inequality is because ln(1 + a) a a2 for 1

2. The third inequality is by using the condition |ui vi| 1

Lemma 19. For all i supp(z ) and t, OMWU guarantees bzt,i ϵ (ϵ is deﬁned in Deﬁnition 4).

Proof. Using Eq. (3), we have

KL(z , bzt) Θt Θ1 = 1 16KL(bz1, z0) + KL(z , bz1) = KL(z , bz1), (15)

where the last equality is because bz1 = z0 = ( 1M

Published as a conference paper at ICLR 2021

Then, for any i supp(z ), we have

bzt,j = KL(z , bzt) X

j z j ln z j KL(z , bz1) X

j z j ln z j

bz1,j = ln(MN).

Therefore, we conclude for all t and i supp(z ), bzt,i satisﬁes

bzt,i exp ln(MN)

min j supp(z ) exp

D.3 PROOFS OF LEMMA 2 AND THEOREM 3

Proof of Lemma 2. Below we consider any z Z such that supp(z ) supp(z ), that is, z V (Z). Considering Eq. (1), and using the ﬁrst-order optimality condition of bzt+1, we have

( ψ(bzt+1) ψ(bzt) + ηF(zt)) (z bzt+1) 0,

where ψ(z) = P i zi ln zi. Rearranging the terms and we get

ηF (zt) (bzt+1 z ) ( ψ(bzt+1) ψ(bzt)) (z bzt+1) = X

i (z i bzt+1,i) ln bzt+1,i

The left hand side of Eq. (16) is lower bounded as

ηF (zt) (bzt+1 z ) = ηF (bzt+1) (bzt+1 z ) + η (F (zt) F (bzt+1)) (bzt+1 z )

ηF (bzt+1) (bzt+1 z ) η F (zt) F (bzt+1) bzt+1 z 1

ηF (bzt+1) (bzt+1 z ) 4η zt bzt+1 1 ( F (zt) F (bzt+1) zt bzt+1 1 4)

ηF (bzt+1) (bzt+1 z ) 1

2 zt bzt+1 1 ; (η 1/8)

on the other hand, the right hand side of Eq. (16) is upper bounded by

i (z i bzt+1,i) ln bzt+1,i

i supp(z ) z i ln bzt+1,i

bzt,i KL(bzt+1, bzt) (supp(z ) supp(z ))

i supp(z ) max ln 1 + bzt+1,i bzt,i

, ln 1 + bzt,i bzt+1,i

i supp(z ) ln 1 + |bzt+1,i bzt,i| min{bzt+1,i, bzt,i}

|bzt+1,i bzt,i|

bzt,i . (ln(1 + a) a and Lemma 17)

Combining the bounds on the two sides of Eq. (16), we get

ηF (bzt+1) (bzt+1 z ) 4

|bzt+1,i bzt,i|

2 zt bzt+1 1.

Published as a conference paper at ICLR 2021

Since z can be chosen as any point in V (Z), we further lower bound the left-hand side above using Lemma 15 and get

ηC z bzt+1 1 4

|bzt+1,i bzt,i|

2 zt bzt+1 1

3ϵ bzt+1 bzt 1 + 1

2 zt bzt+1 1, (Lemma 19)

3ϵ ( bzt+1 bzt 1 + zt bzt+1 1) (17)

where the last inequality uses ϵ 1. With the help of Eq. (17), below we prove the desired inequalities.

Case 1. General case.

KL(bzt+1, zt) + KL(zt, bzt)

2 bxt+1 xt 2 1 + 1

2 byt+1 yt 2 1 + 1

2 xt bxt 2 1 + 1

2 yt byt 2 1 (Pinsker s inequality)

4 bzt+1 zt 2 1 + 1

4 zt bzt 2 1 (a2 + b2 1

16 bzt+1 zt 2 1 + 1

8 bzt+1 zt 2 1 + zt bzt 2 1

16 bzt+1 zt 2 1 + 1

16 bzt+1 bzt 2 1 (a2 + b2 1

2(a + b)2 and triangle inequality)

32 ( bzt+1 zt 1 + bzt+1 bzt 1)2 (a2 + b2 1

2 z bzt+1 2 1 (Eq. (17))

64 ϵ2KL(z , bzt+1)2 = ϵ4η2C2

64 KL(z , bzt+1)2. (Lemma 16 and Lemma 19)

This proves the ﬁrst part of the lemma with C1 = ϵ4C2/64.

Case 2. The case when max{ z bzt 1, z zt 1} ηξ

KL(bzt+1, zt) + KL(zt, bzt)

(bzt+1,i zt,i)2

bzt+1,i + (zt,i bzt,i)2

(Lemma 17 and Lemma 18)

i/ supp(z )

(bzt+1,i zt,i)2

bzt,i + (zt,i bzt,i)2

i/ supp(z )

(bzt+1,i bzt,i)2

bzt,i . (18)

Below we continue to bound P

i/ supp(z ) (bzt+1,i bzt,i)2

By the assumption, we have yt y 1 ηξ

10, which by Lemma 13 and Deﬁnition 2 implies

i supp(x ), (Gyt)i (Gy )i + ηξ

10 = ρ + ηξ

i / supp(x ), (Gyt)i (Gy )i ηξ

10 ρ + ξ ηξ

Published as a conference paper at ICLR 2021

We also have bxt x 1 ηξ

j / supp(x ) bxt,j ηξ

10. Then, for i / supp(x ), we have

bxt+1,i = bxt,i exp( η(Gyt)i) P

j bxt,j exp( η(Gyt)j)

bxt,i exp( η(Gyt)i) P

j supp(x ) bxt,j exp( η(Gyt)j)

bxt,i exp( η(ρ + 9ξ

j supp(x ) bxt,j exp( η(ρ + ξ

= bxt,i exp 8

10ηξ 1 P j / supp(x ) bxt,j

bxt,i exp 8

where the last inequality is because exp( 0.8u)

1 0.1u 1 0.5u for u [0, 1]. Rearranging gives

|bxt+1,i bxt,i|2

4 bxt,i η2ξ2

where the last step uses Lemma 17. The case for byt is similar, so we have

|bzt+1,i bzt,i|2

Combining this with Eq. (18), we get

KL(bzt+1, zt) + KL(zt, bzt) η2ξ2

i/ supp(z ) bzt+1,i. (19)

Now we combine two lower bounds of KL(bzt+1, zt) + KL(zt, bzt). Using an intermediate step in Case 1, and Eq. (19), we get

KL(bzt+1, zt) + KL(zt, bzt) = 1

2 (KL(bzt+1, zt) + KL(zt, bzt)) + 1

2 (KL(bzt+1, zt) + KL(zt, bzt))

128 z bzt+1 2 1 + η2ξ2

i/ supp(z ) bzt+1,i

ξ2ϵ bzt+1 z 2 1 + 1 ϵ3C2 X

i/ supp(z ) bzt+1,i

ϵ bzt+1 z 2 1 + X

i/ supp(z ) bzt+1,i

(ξ 1, C 1, and ϵ 1)

128 KL(z , bzt+1). (Lemma 16 and Lemma 19)

This proves the second part of the lemma with C2 = ϵ3C2ξ2/128.

Now we are ready to prove Theorem 3.

Proof of Theorem 3. As argued in Section 4, with Θt = KL(z , bzt) + 1 16KL(bzt, zt 1) and ζt = KL(bzt+1, zt) + KL(zt, bzt), we have (see Eq. (3))

Published as a conference paper at ICLR 2021

We the proceed as,

2KL(bzt+1, zt) + 1

2KL(bzt+1, zt) + η2C1

2 KL(z , bzt+1)2 (Lemma 2)

2KL(bzt+1, zt)2 + η2C1

2 KL(z , bzt+1)2 (by Lemma 17 and Lemma 18)

2 KL(bzt+1, zt)2 + KL(z , bzt+1)2

(C1 = ϵ4C2/64 1/64 as shown in the proof of Lemma 2)

4 (KL(bzt+1, zt) + KL(z , bzt+1))2

Therefore, Θt+1 Θt 15η2C1

64 Θ2 t+1 Θt 15η2C1 64+ln MN Θ2 t+1. Also, recall bz1 = z0 = ( 1M

N ) and thus Θ1 = KL(z , bz1) ln(MN). Therefore, the conditions of Lemma 12 are satisﬁed with p = 1 and q = 15η2C1 64+ln(MN), and we conclude that

where C = max n ln(MN), 128+2 ln(MN)

o = 128+2 ln(MN)

Next we prove the main result. Set T0 = 12800C

η2ξ2 . For t T0, we have using Pinsker s inequality,

z bzt 2 1 2 x bxt 2 1 + 2 y byt 2 1 4KL(z , bzt) 4C

z zt 2 1 2 z bzt+1 2 1 + 2 bzt+1 zt 2 1 4 x bxt+1 2 1 + 4 bxt+1 xt 2 1 + 4 y byt+1 2 1 + 4 byt+1 yt 2 1 8KL(z , bzt+1) + 8KL(bzt+1, zt)

128Θt+1 128C

Therefore, when t T0, the condition of the second part of Lemma 2 is satisﬁed, and we have

2KL(bzt+1, zt) + 1

2KL(bzt+1, zt) + η2C2

2 KL(z , bzt+1) (by Lemma 2)

2 Θt+1. (C2 = ϵ3C2ξ2/128 1/128 as shown in the proof of Lemma 2)

Therefore, when t T0, Θt+1 Θt 15η2C2

32 Θt+1, which further leads to

Θt ΘT0 1 + 15η2C2

T0 t Θ1 1 + 15η2C2

T0 t ln(MN) 1 + 15η2C2

where the second inequality uses Eq. (15). The inequality trivially holds for t < T0 as well, so it holds for all t.

We ﬁnish the proof by relating KL(z , zt) and Θt+1. Note that by Lemma 16, Lemma 17, and Lemma 19, we have

KL(z , zt)2 z zt 2

mini supp(z ) z2 t,i 16 z zt 2

9ϵ2 4 z bzt+1 2 + bzt+1 zt 2

Published as a conference paper at ICLR 2021

We continue to bound the last term as

4 z bzt+1 2 + bzt+1 zt 2

= 4 x bxt+1 2 + y byt+1 2 + bxt+1 xt 2 + byt+1 yt 2

= 4 x bxt+1 2 1 + y byt+1 2 1 + bxt+1 xt 2 1 + byt+1 yt 2 1 ϵ2

KL(z , bzt+1)

16 + KL(bzt+1, zt)

(Pinsker s inequality)

Combining everything, we get

which completes the proof.

E PROOFS OF LEMMA 4 AND THE SUM-OF-DUALITY-GAP BOUND

Proof of Lemma 4. Below we consider any z = bzt+1 Z. Considering Eq. (1) with Dψ(u, v) = 1 2 u v 2, and using the ﬁrst-order optimality condition of bzt+1, we have

(bzt+1 bzt + ηF(zt)) (z bzt+1) 0,

(zt+1 bzt+1 + ηF(zt)) (z zt+1) 0. Rearranging the terms and we get

(bzt+1 bzt) (z bzt+1) ηF(zt) (bzt+1 z )

= ηF(bzt+1) (bzt+1 z ) + η (F(zt) F(bzt+1)) (bzt+1 z )

ηF(bzt+1) (bzt+1 z ) ηL zt bzt+1 bzt+1 z

ηF(bzt+1) (bzt+1 z ) 1

8 zt bzt+1 bzt+1 z ,

and (zt+1 bzt+1) (z zt+1) ηF(zt) (zt+1 z )

= ηF(zt+1) (zt+1 z ) + η (F(zt) F(zt+1)) (zt+1 z )

ηF(zt+1) (zt+1 z ) ηL zt zt+1 zt+1 z

ηF(zt+1) (zt+1 z ) 1

8 zt zt+1 zt+1 z .

Here, for both block, the third step uses H older s inequality and the smoothness condition Assumption 1, and the last step uses the condition η 1/(8L). Upper bounding the left-hand side of the two inequalities by bzt+1 bzt bzt+1 z and zt+1 bzt+1 zt+1 z respectively and then rearranging, we get

bzt+1 z bzt+1 bzt + 1

8 zt bzt+1 ηF(bzt+1) (bzt+1 z ),

zt+1 z zt+1 bzt+1 + 1

8 zt zt+1 ηF(zt+1) (zt+1 z ).

Therefore, we have bzt+1 bzt + 1

8 zt bzt+1 2 η2[F(bzt+1) (bzt+1 z )]2 + bzt+1 z 2 ,

zt+1 bzt+1 + 1

8 zt zt+1 2 η2[F(zt+1) (zt+1 z )]2 + zt+1 z 2 .

Published as a conference paper at ICLR 2021

Finally, by the triangle inequality and the fact (a + b)2 2a2 + 2b2, we have bzt+1 bzt + 1

8 zt bzt+1 2 zt bzt + 9

8 zt bzt+1 2

8 zt bzt + 9

8 zt bzt+1 2

32 zt bzt 2 + zt bzt+1 2 , bzt+1 zt+1 + 1

8 zt zt+1 2 9

8 zt+1 bzt+1 + zt bzt+1 2

8 zt+1 bzt+1 + 9

8 zt bzt+1 2

32 zt+1 bzt+1 2 + zt bzt+1 2 ,

which ﬁnishes the proof.

Next, we use Eq. (4) and Eq. (6) to derive a result on the convergence of average duality gap across time. First, we use the following lemma to relate the right-hand side of Eq. (6) to the duality gap of zt.

Lemma 20. Let Z be closed and bounded. Then for any z Z, we have αf(z) maxz Z F(z) (z z ).

Proof. This is a direct consequence of the convexity of f( , y) and the concavity of f(x, ):

αf(z) = max (x ,y ) X Y (f(x, y ) f(x, y) + f(x, y) f(x , y))

max (x ,y ) X Y

yf(x, y) (y y) + xf(x, y) (x x ) = max z Z F(z) (z z ).

With Lemma 20, the following theorem can be proven straightforwardly.

Theorem 21. Let Z be closed and bounded. Then OGDA with η 1 8L ensures 1

T PT t=1 αf(zt) =

for any T, where D supz,z Z z z .

Proof. We ﬁrst bound the sum of squared duality gap as (recall ζt = bzt+1 zt 2 + zt bzt 2):

t=1 αf(zt)2

max z ZF(zt) (zt z ) 2 (Lemma 20)

t=1 (ζt 1 + ζt) zt z 2 (Lemma 4)

t=2 (Θt 1 Θt + Θt Θt+1)

. (telescoping)

Finally, by Cauchy-Schwarz inequality, we get 1 T PT t=1 αf(zt) 1 T

T PT t=1 αf(zt)2 =

Published as a conference paper at ICLR 2021

This theorem indicates that αf(zt) is converging to zero. A rate of αf(zt) = O( D

t) would be compatible with the theorem, but is not directly implied by it. In a recent work, Golowich et al. (2020b) consider the unconstrained setting and show that the extra-gradient algorithm obtains the rate αf(zt) = O( D

t), under an extra assumption that the Hessian of f is also Lipschitz (since Golowich et al. (2020b) study the unconstrained setting, their duality gap αf is deﬁned only with respect to the best responses that lie within a ball of radius D centered around the equilibrium). Note that the extra-gradient algorithm requires more cooperation between the two players compared to OGDA and is less suitable for a repeated game setting.

F THE EQUIVALENCE BETWEEN SP-MS AND METRIC SUBREGULARITY

In this section, we formally that show our SP-MS condition with β = 0 is equivalent to metric subregularity. Before introducing the main theorem, we introduce several deﬁnitions. We let Z Z RK (Z and Z follow the same deﬁnitions as in our main text). First, we deﬁne the elementto-set distance function d:

Deﬁnition 5. The element-to-set distance function d: RK 2RK R is deﬁned as d(z, S) = infz S z z .

The deﬁnition of metric subregularity involves a set-valued operator T : Z 2RK, which maps an element of Z to a set in RK.

Deﬁnition 6. A set-valued operator T is called metric subregular at ( z, v) for v T ( z) if there exists κ > 0 and a neighborhood Ωof z such that

d(v, T (z)) κd(z, T 1(v))

for all z Ω, where x T 1(v) v T (x). If Ω= Z, we call T globally metric subregular.

The following deﬁnition of normal cone is also required in the analysis:

Deﬁnition 7. The normal cone of Z at point z is N(z) = {g | g (z z) 0, z Z} (we omit its dependence on Z for simplicity). Equivalently, N(z) is the polar cone of the convex set Z z (a property that we will use in the proof).

Now we are ready to show that our SP-MS condition with β = 0 is equivalent to metric subregularity of the operator N + F, deﬁned via: (N + F)(z) = {g + F(z) | g N(z)}.

Theorem 22. Let z Z . Then the following two statements are equivalent:

(N + F) is globally metric subregular at (z , 0) with κ > 0;

For all z Z\Z , maxz Z F(z) (z z )

z z κd(z, Z ).

Proof. Let T = N + F. Notice that

z Z F(z) (z z) 0 F(z) N(z) 0 (N + F)(z).

Therefore, 0 T (z ) indeed holds, and we have T 1(0) = Z . This means that the ﬁrst statement in the theorem is equivalent to

d(0, T (z)) κd(z, T 1(0)) d(0, N(z) + F(z)) κd(z, Z ).

This inequality holds trivially when z Z . Thus, to complete the proof, it sufﬁces to prove that d(0, N(z) + F(z)) = maxz Z F(z) (z z )

z z for z Z\Z . To do so, note that

d(0, N(z) + F(z)) = d( F(z), N(z)) = F(z) ΠN(z)( F(z))

= ΠN (z)( F(z))

Published as a conference paper at ICLR 2021

where N (z) = {g | g n 0, n N(z)} is the polar cone of N(z) and the last step is by Moreau s theorem. Now consider the projection of F(z) onto the polar cone N (z):

ΠN (z)( F(z)) = argmin y N (z) F(z) y 2

= argmin y N (z)

2F(z) y + y 2

= argmin y N (z)

= argmin λ 0, z N (z), z =1

2λF(z) z + λ2 ,

where the last equality is because N (z) is a cone. Next, we ﬁnd the z and λ that realize the last argmin operator: notice that the objective is increasing in F(z) z, so z = argmin z N (z): z =1 F(z) z , and thus λ = F(z) z when F(z) z 0 and λ = 0 otherwise. Therefore,

ΠN (z)( F(z)) = λ = max 0, max z N (z), z =1 F(z) z .

Note that N(z) is the polar cone of the conic hull of Z z. Therefore, N (z) = (Conic Hull(Z z)) = Conic Hull(Z z) and

max 0, max z N (z), z =1 F(z) z = max 0, max z Z F(z) (z z )

Finally, note that when z Z\Z , we have maxz Z F(z) (z z ) > 0. Combining all the facts above, we have shown d(0, N(z) + F(z)) = maxz Z F(z) (z z )

G PROOF OF THEOREM 5

Proof of Theorem 5. Let ρ = minx X maxy Y x Gy = maxy Y minx X x Gy be the game value. In this proof, we prove that there exists some c > 0 such that

max y Y x Gy ρ c x ΠX (x) (20)

for all x X. Similarly we prove

max x X ρ x Gy c y ΠY (y)

for all y Y. Assume that the diameter of the polytope is D < . Then combining the two proves

max z F(z) (z z )

D max z F(z) (z z ) = 1

max y x Gy min x x Gy

D ( y ΠY (y) + x ΠX (x) ) c

D z ΠZ (z) ,

meaning that SP-MS holds with β = 0. We break the proof into following several claims.

Claim 1. If X, Y are polytopes, then X and Y are also polytopes.

Proof of Claim 1. Note that X = x X : maxy Y x Gy ρ . Since Y is a polytope, the maximum is attained at vertices of Y. Therefore, X can be equivalently written as x X : maxy V(Y) x Gy ρ , where V(Y) is the set of vertices of Y. Since the constraints of X are all linear constraints, X is a polytope.

With Claim 1, we without loss of generality write X as

X = x RM : a i x bi, for i = 1, . . . , L, c i x di, for i = 1, . . . , K ,

Published as a conference paper at ICLR 2021

where the a i x bi constraints come from x X and the c i x di constraints come from maxy V(Y) x Gy ρ. Below, we refer to a i x bi as the feasibility constraints, and c i x di as the optimality constraints. In fact, one can identify the i-th optimality constraint as ci = Gy(i)

and di = ρ, where y(i) is the i-th vertex of Y. This is based on our construction of X in the proof of Claim 1. Therefore, K = |V(Y)|.

Since Eq. (20) clearly holds for x X , below, we focus on an x X\X , and let x ΠX (x).

We say a constraint is tight at x if a i x = bi or c i x = di. Below we assume that there are ℓ tight feasibility constraints at and k tight optimality constraints at x . Without loss of generality, we assume these tight constraints correspond to i = 1, . . . , ℓand i = 1, . . . , k respectively. That is,

a i x = bi, for i = 1, . . . , ℓ,

c i x = di, for i = 1, . . . , k.

Claim 2. x violates at least one of the tight optimality constraint at x .

Proof of Claim 2. We prove this by contradiction. Suppose that x satisﬁes all k tight optimality constraints at x . Then x must violates some of the remaining K k optimality constraints (otherwise x X ). Assume that it violates constraints K n + 1, . . . , K for some 1 n K k. Thus, we have the following:

c i x di for i = 1, . . . K n;

c i x > di for i = K n + 1, . . . , K.

Recall that c i x di for i = 1, . . . , K n and c i x < di for all i = K n + 1, . . . , K. Thus, there exists some x that lies strictly between x and x that makes all constraints hold (notice that x and x both satisfy all feasibility constraints), which contradicts with ΠX (x) = x .

Claim 3. maxy Y x Gy ρ maxi {1,...,k} c i (x x ).

Proof of Claim 3. Recall that we identify ci with Gy(i) and di = ρ. Therefore,

max y Y x Gy ρ = max i {1,...,|V(Y)|}

c i x di max i {1,...,k}

c i x di = max i {1,...,k} c i (x x ),

where the last equality is because c i x = di for i = 1, . . . , k.

Recall from linear programming literature Davis (2016a;b) that the normal cone of X at x is expressed as follows:

Nx = n x x : x RM, ΠX (x ) = x o =

i=1 qici : pi 0, qi 0

The normal cone of X at x consists of all outgoing normal vectors of X originated from x . Clearly, x x belongs to Nx . However, besides the fact that x x is a normal vector of X , we also have the additional constraints that x X. We claim that in our case, x x lies in the following smaller cone (which is a subset of Nx ):

Claim 4. x x belongs to

i=1 qici : pi 0, qi 0, a j

0, j = 1, . . . , ℓ

Proof of Claim 4. As argued above, x x Nx , and thus x x can be expressed as Pℓ i=1 piai + Pk i=1 qici with pi 0, qi 0. To prove that x x Mx , we only need to prove that it satisﬁes the additional constraints, that is,

a i (x x ) 0, i = 1, . . . , ℓ.

Published as a conference paper at ICLR 2021

This is shown by noticing that for all i = 1, . . . , ℓ,

a i (x x ) = a i x bi + a i (x x ) (the i-th constraint is tight at x )

= a i (x + x x ) bi = a i x bi 0. (x X)

Claim 5. x x can be written as Pℓ i=1 piai + Pk i=1 qici with 0 pi, qi C x x for all i and some problem-dependent constant C < .

Proof of Claim 5. Notice that x x x x Mx (because 0 = x x Mx and Mx is a cone).

Furthermore, x x x x {v RM : v 1}. Therefore, x x x x Mx {v RM : v 1}, which is a bounded subset of the cone Mx .

Below we argue that there exists a large enough C > 0 such that ( ℓ X

i=1 qici : 0 pi, qi C , i

Mx {v RM : v 1} P.

To see this, ﬁrst note that P is a polytope. For every vertex bv of P, the smallest C such that bv belongs to the left-hand side is the solution of the following linear programming:

min pi,qi,C bv C bv s.t. bv =

i=1 qici, 0 pi, qi C bv.

Since bv Mx , this linear programming is always feasible and admits a ﬁnite solution C bv < . Now let C = maxbv V(P) C bv, where V(P) is the set of all vertices of P. Then since any v P can be expressed as a convex combination of points in V(P), v can be also be expressed as Pℓ i=1 piai + Pk i=1 qici with 0 pi, qi C .

To sum up, x x x x can be represented as Pℓ i=1 piai+Pk i=1 qici with 0 pi, qi C . This further

implies that x x can be represented as Pℓ i=1 piai + Pk i=1 qici with 0 pi, qi C x x . Notice that C only depends on the set of tight constraints at x .

Finally, we are ready to combine all previous claims and prove the desired inequality.

Deﬁne Ai a i (x x ) and Ci c i (x x ). By Claim 5, we can write x x as Pℓ i=1 piai + Pk i=1 qici with 0 pi, qi C x x , and thus,

i=1 pi Ai +

i=1 qi Ci =

(x x ) = x x 2.

On the other hand, since x x Mx by Claim 4, we have

i=1 pi Ai =

i=1 pia i (x x ) 0

i=1 qi Ci max i {1,...,k} Ci

i=1 qi max i {1,...,k} Ci

where in the ﬁrst inequality we use the fact pi 0, and in the second inequality we use the fact maxi {1,...,k} Ci > 0 (by Claim 2) and 0 qi C x x .

Published as a conference paper at ICLR 2021

Combining the three inequalities above, we get

max i {1,...,k} Ci 1 k C x x .

Then by Claim 3,

max y Y x Gy ρ max i {1,...,k} Ci 1 k C x x .

Note that k and C only depend on the set of tight constraints at the projection point x , and there are only ﬁnitely many different sets of tight constraints. Therefore, we conclude that there exists a constant c > 0 such that maxy Y x Gy ρ c x x holds for all x and x , which completes the proof.

H PROOF OF THEOREM 6 AND THEOREM 7

Proof of Theorem 6. Suppose that f is γ-strongly-convex in x and γ-strongly-concave in y, and let (x , y ) Z . Then for any (x, y) we have

f(x, y) f(x , y) xf(x, y) (x x ) γ

f(x, y ) f(x, y) yf(x, y) (y y) γ

Summing up the two inequalities, and noticing that f(x, y ) f(x , y) 0 for any (x , y ) Z , we get

F(z) (z z ) γ

and therefore, for z / Z ,

F(z) (z z )

which implies SP-MS with β = 0 and C = γ/2.

Proof of Theorem 7. First, we show that f has a unique Nash Equilibrium z = (x , y ) = ((0, 1), (0, 1)). As f is a strictly monotone decreasing function with respect to y1, we must have y 1 = 0 and y 2 = 1. In addition, if x = (0, 1), maxy Y f(x, y) = miny Y y2n 1 = 0. If x = (0, 1), then by choosing y = (0, 1), f(x, y ) = x2n 1 > 0. Therefore, we have x = (0, 1), which proves that the unique Nash Equilibrium is x = (0, 1), y = (0, 1).

Second, we show that f satisﬁes SP-MS with β = 2n 2. In fact, for any z = (x, y) = z , we have

F(z) (z z ) =

2nx2n 1 1 y1 0 2ny2n 1 1 + x1 0

x1 x2 1 y1 y2 1

= 2n x2n 1 + y2n 1

4n x2 1 + y2 1 2

n (Jensen s inequality)

= n 2n 2 x2 1 + y2 1 n .

Note that z z = p

x2 1 + (1 x2)2 + y2 1 + (1 y2)2 = p

2x2 1 + 2y2 1. Therefore, we have F (z) (z z )

z z n 22n 2 z z 2n 1. This shows that f satisﬁes SP-MS with β = 2n 2 and C = n 22n 2 .

Published as a conference paper at ICLR 2021

I PROOF OF THEOREM 8

Proof of Theorem 8. As argued in Section 5, with Θt = bzt ΠZ (bzt) 2 + 1 16 bzt zt 1 2, ζt = bzt+1 zt 2 + zt bzt 2, we have (see Eq. (4)) Θt+1 Θt 15

16ζt. (21) Below, we relate ζt to Θt+1 using the SP-MS condition, and then apply Lemma 12 to show

2dist2(bz1, Z )(1 + C5) t if β = 0, 1 + 4 4 β 1

β dist2(bz1, Z ) + 2 2 C5β 1

β if β > 0, (22)

where C5 = min n 16η2C2

2 o as deﬁned in the statement of the theorem. This is enough to prove the theorem since dist2(zt, Z ) zt ΠZ (bzt+1) 2

2 bzt+1 ΠZ (bzt+1) 2 + 2 bzt+1 zt 2

32Θt+1 32Θt. Next, we prove Eq. (22). We ﬁrst show a simple fact by Eq. (21):

bzt+1 zt 2 ζt 16

Notice that

2 bzt+1 zt 2 + 1

2 bzt+1 zt 2 + zt bzt 2

2 bzt+1 zt 2 + 16η2

F(bzt+1) (bzt+1 z ) 2 + bzt+1 z 2 (Lemma 4)

2 bzt+1 zt 2 + 16η2C2

81 bzt+1 ΠZ (bzt+1) 2(β+1) (SP-MS condition)

β) bzt+1 zt 2(β+1) + bzt+1 ΠZ (bzt+1) 2(β+1)

(by Eq. (23))

β) bzt+1 zt 2 + bzt+1 ΠZ (bzt+1) 2 β+1

(by H older s inequality: (aβ+1 + bβ+1)(1 + 1)β (a + b)β+1)

Θβ+1 t+1 (recall that C5 = min{ 16η2C2

= C Θβ+1 t+1 . (deﬁne C = min C5 2β , 1

Combining this with Eq. (21), we get

Θt+1 Θt C Θβ+1 t+1 (24)

When β = 0, Eq. (24) implies Θt+1 (1 + C5) 1Θt, which immediately implies Θt (1 + C5) t+1Θ1 2Θ1(1 + C5) t. When β > 0, Eq. (24) is of the form speciﬁed in Lemma 12 with p = β and q = C . Note that the second required condition is satisﬁed: C (β + 1)Θβ 1 β+1

2 4β 1. Therefore, by the conclusion of Lemma 12,

Eq. (22) is then proven by noticing that Θ1 = dist2(bz1, Z ).

Published as a conference paper at ICLR 2021

J PROOF OF THEOREM 9

Proof of Theorem 9. Consider the following 2 2 bilinear game with curved feasible sets:

f(x, y) = x Gy = [x1 x2] 0 1 1 0

X = x : 0 x1 1

4, x2 x1 2 ,

Y = y : 0 y1 1

4, y2 y1 2 .

Below, we use Claim 1 - Claim 5 to argue that if the two players start from x0 = y0 = bx0 = by0 = ( 1

4), and use any constant learning rate η 1 64, then the convergence is sublinear in the sense that zt z Ω(1/t). Then, in Claim 6, we show that in this example, SP-MS holds with β = 3.

Claim 1. The unique equilibrium is x = 0, y = 0.

When x = 0, clearly maxy Y f(x, y ) = 0. When x = 0, we prove maxy Y f(x, y ) > 0 below. If x1 = 0, we let y 1 = 1

2x1 and y 2 = 1

4x12 (which satisﬁes y Y), and thus

f(x, y ) = x2y 1 x1y 2 = x1 2 1

If x1 = 0 but x2 = 0, we let y 1 = 1

4, and thus

f(x, y ) = x2y 1 x1y 2 = 1

Thus, maxy Y f(x, y ) > 0 if x = 0, and x = 0 is the unique optimal solution for x. By the symmetry between x and y (because G = G ), we can also prove that the unique optimal solution for y is y = 0.

Claim 2. Suppose that x0 = y0 = bx0 = by0 = ( 1

4). Then, at any step t [T], we have xt = yt and bxt = byt, and all xt, yt, bxt, byt belong to {u R2 : u2 = u2 1}.

We prove this by induction. The base case trivially holds. Suppose that for step t, we have xt = yt, bxt = byt, and xt, yt, bxt, byt {u R2 : u2 = u2 1}. Then consider step t + 1. According to the dynamic of OGDA, we have

bxt η yt,2 yt,1

bxt,1 + ηyt,2 bxt,2 ηyt,1

bxt+1 η yt,2 yt,1

bxt+1,1 + ηyt,2, bxt+1,2 ηyt,1

byt + η xt,2 xt,1

byt,1 + ηxt,2 byt,2 ηxt,1

byt+1 + η xt,2 xt,1

byt+1,1 + ηxt,2 byt+1,2 ηxt,1

According to induction hypothesis, we have bxt+1 = byt+1, which further leads to xt+1 = yt+1.

Now we prove that for any x1 x2

such that x1 0, x2 1

4 and x2 < x12,

satisﬁes that x2 1 = x2. Otherwise, suppose that x2 1 < x2. Then according to the intermediate

value theorem, there exists ex1 ex2

that lies in the line segment of x1 x2

such that ex2 1 = ex2.

Moreover, as x1 0, ex1 0, x2 1

4, we know that ex1 ex2

X. Therefore, we have

ex x < x x , which leads to contradiction.

Published as a conference paper at ICLR 2021

Now consider bxt+1. According to induction hypothesis, we have (bxt,1 + ηyt,2)2 bx2 t,1 = bxt,2 bxt,2 ηyt,1. If equalities hold, trivially we have bx2 t+1,1 = bx2 t,1 = bxt,2 = bxt+1,2 according to Eq. (25). Otherwise, as bxt,1 +ηyt,2 0, bxt,2 ηyt,1 1

4, according to the analysis above, we also have bx2 t+1,1 = bxt+1,2. Applying similar analysis to byt+1, xt+1 and yt+1 ﬁnishes the induction proof.

Claim 3. With η 1 64, the following holds for all t 1,

2 bxt,1, 2bxt,1

bxt,1 bxt 1,1 4ηbx2 t 1,1, bxt 1,1 + 4ηbx2 t 1,1 . (27)

We prove the claim by induction on t. The case t = 1 trivially holds. Suppose that Eq. (26) and Eq. (27) hold at step t. Now consider step t + 1.

Induction to get Eq. (27). According to Claim 2, we have

bxt η yt,2 yt,1

bxt,1 + ηx2 t,1 bx2 t,1 ηxt,1

and bxt+1 = (u, u2) for some u [0, 1/2]. Using the deﬁnition of the projection function, we have

bxt+1,1 = argmin u [0, 1

n bxt,1 + ηx2 t,1 u 2 + bx2 t,1 ηxt,1 u2 2o argmin u [0, 1

Now we show that argminu [0, 1

2 ] g(u) = argminu R g(u). Note that

g(u) = 2(u bxt,1 ηx2 t,1) + 4u u2 + ηxt,1 bx2 t,1 , (28)

Therefore, when u > 1

2, using xt,1 1

g(u) > 2ηx2 t,1 + 2ηxt,1 0, (29)

which means g(u) > g( 1

2). On the other hand, when u < 0, using bxt,1 1

g(u) < 2u 4ubx2 t,1 u < 0, (30)

which means g(u) > g(0). Combining Eq. (29) and Eq. (30), we know that argminu [0, 1

2 ] g(u) = argminu R g(u). Therefore, bxt+1,1 is the unconstrained minimizer of convex function g(u), which means g(bxt+1,1) = 0. Below we use contradiction to prove that bxt+1,1 bxt,1 4ηbx2 t,1. If bxt+1,1 < bxt,1 4ηbx2 t,1, we use Eq. (28) and get

g(bxt+1,1) = 2(bxt+1,1 bxt,1 ηx2 t,1) + 4bxt+1,1 bx2 t+1,1 + ηxt,1 bx2 t,1

< 2( 4ηbx2 t,1 ηx2 t,1) + 4bxt+1,1 ηxt,1 8ηbx3 t,1 + 16η2bx4 t,1

2 ηbx2 t,1 + 4bxt+1,1 2ηbxt,1 8ηbx3 t,1 + 16η2bx4 t,1 (Eq. (26))

2 ηbx2 t,1 + 4bxt+1,1 2ηbxt,1 + 16η2bx4 t,1

2 ηbx2 t,1 + 4(bxt,1 4ηbx2 t,1) 2ηbxt,1 + 16η2bx4 t,1 (bxt+1,1 < bxt,1 4ηbx2 t,1)

2ηbx2 t,1 + 64η2bx5 t,1 32η2bx3 t,1 256η3bx6 t,1

2ηbx2 t,1 16η2bx3 t,1 256η3bx6 t,1 (bxt,1 1

Published as a conference paper at ICLR 2021

which leads to contradiction. Similarly, if bxt+1,1 > bxt,1 + 4ηbx2 t,1, we have

g(bxt+1,1) = 2(bxt+1,1 bxt,1 ηx2 t,1) + 4bxt+1,1 bx2 t+1,1 + ηxt,1 bx2 t,1

> 2(4ηbx2 t,1 ηx2 t,1) + 4bxt+1,1 ηxt,1 + 8ηbx3 t,1 + 16η2bx4 t,1

0. (Eq. (26))

The calculations above conclude that

bxt+1,1 bxt,1 4ηbx2 t,1, bxt,1 + 4ηbx2 t,1 . (31)

Induction to get Eq. (26). Similarly, we have

xt+1,1 = argmin u [0, 1

n bxt+1,1 + ηx2 t,1 u 2 + bx2 t+1,1 ηxt,1 u2 2o argmin u [0, 1

h(u) = 2(u bxt+1,1 ηx2 t,1) + 4u(u2 + ηxt,1 bx2 t+1,1),

and h(xt+1,1) = 0. If xt+1,1 < 1

2 bxt+1,1, we have

h(xt+1,1) = 2(xt+1,1 bxt+1,1 ηx2 t,1) + 4xt+1,1 x2 t+1,1 + ηxt,1 bx2 t+1,1

< bxt+1,1 2ηx2 t,1 3xt+1,1bx2 t+1,1 + 2ηbxt+1,1xt,1 (xt+1,1 < 1

0. (η 1 64, xt,1 1

If xt+1,1 > 2bxt+1,1, we also have

h(xt+1,1) = 2(xt+1,1 bxt+1,1 ηx2 t,1) + 4xt+1,1 x2 t+1,1 + ηxt,1 bx2 t+1,1

> 2bxt+1,1 2ηx2 t,1 + 24bx3 t+1,1 + 8ηbxt+1,1xt,1 (xt+1,1 > 2bxt+1,1)

2bxt+1,1 2ηx2 t,1 + 24bx3 t+1,1 + 8η(bxt,1 4ηbx2 t,1)xt,1 (Eq. (31))

2bxt+1,1 2ηx2 t,1 + 24bx3 t+1,1 + 8η( 1

2xt,1 4ηbx2 t,1)xt,1 (Eq. (26))

= 2bxt+1,1 + 2ηx2 t,1 + 24bx3 t+1,1 32η2bx2 t,1xt,1

2bxt+1,1 + 1

4ηbx2 t,1 + 24bx3 t+1,1 32η2bx2 t,1xt,1 (Eq. (26))

0. (η 1 64, xt,1 1

Both lead to contradiction. Therefore, we conclude that xt+1 [ 1

2 bxt+1,1, 2bxt+1,1], which ﬁnishes the induction proof.

Claim 4. xt,1 bxt,1 4ηbx2 t,1, for all t 1.

The case t = 1 holds trivially. For t 2, we prove this by contradiction. Using the deﬁnition of the projection function, we have:

xt+1,1 = argmin u [0, 1

n bxt+1,1 + ηx2 t,1 u 2 + bx2 t+1,1 ηxt,1 u2 2o argmin u [0, 1

Similar to the analysis in Claim 3, we have argminu [0, 1

2] h(u) = argminu R h(u), which means

that h(xt+1,1) = 0. Note that η 1 64 and 0 bxt,1 1

2, according to Eq. (26) and Eq. (27), we have

bxt+1,1 bxt,1 4ηbx2 t,1, bxt,1 + 4ηbx2 t,1 31

32 bxt,1, 33

which means that

2 bxt,1, 2bxt,1

33 bxt+1,1, 64

Published as a conference paper at ICLR 2021

If xt+1,1 < bxt+1,1 4ηbx2 t+1,1, we show that h(xt+1,1) < 0. In fact,

h(xt+1,1) = 2(xt+1,1 bxt+1,1 ηx2 t,1) + 4xt+1,1 x2 t+1,1 + ηxt,1 bx2 t+1,1

< 2( 4ηbx2 t+1,1 ηx2 t,1) + 4xt+1,1 ηxt,1 8ηbx3 t+1,1 + 16η2bx4 t+1,1

5 ηbx2 t+1,1 + 4xt+1,1

31ηbxt+1,1 8ηbx3 t+1,1 + 16η2bx4 t+1,1

5 ηbx2 t+1,1 + 4xt+1,1

31ηbxt+1,1 + 16η2bx4 t+1,1

5 ηbx2 t+1,1 + 4(bxt+1,1 4ηbx2 t+1,1) 64

31ηbxt+1,1 + 16η2bx4 t+1,1

64η2bx5 t+1,1 32η2bx3 t+1,1 256η3bx6 t+1,1 16η2bx3 t,1 256η3bx6 t,1 (bxt,1 1

0, which leads to contradiction. Therefore, we show that xt,1 bxt,1 4ηbx2 t,1 for all t 1.

Claim 5. If η 1 64, we have zt z Ω(1/t).

Now we are ready to prove zt z Ω(1/t). First we show bxt,1 1

2t for all t 1 by induction. The case t = 1 trivially holds. Suppose that it holds at step t. Considering step t + 1, we have bxt+1,1 bxt,1 4ηbx2 t,1 (Claim 3)

16 bx2 t,1 (η 1 64)

2t 1 64t2 ( 1

16x2 is increasing when x 8)

1 2(t + 1). (t 1)

Therefore, bxt,1 1

2t, t 1. This, by Claim 4 and the analysis above, shows that

xt,1 bxt,1 4ηbx2 t,1 1 2(t + 1).

Note that according to Claim 1, x = 0. Therefore, we have zt z xt,1 1 2(t+1), which ﬁnishes the proof.

Claim 6. In this example, SP-MS holds with β = 3. This can be seen by the following:

max z Z F(z) (z z )

z z max z Z F(z) (z z )

= max x X,y Y x Gy x Gy

= max x X,y Y { x1y 2 + x2y 1 + x 1y2 x 2y1}

x1x2 2 + x2 2 + y2 2 y2 2y1 (picking y 1 = x2, y 2 = x2 2, x 1 = y2, x 2 = y2 2)

2y2 2 (x1, y1 1

4 x4 1 + x4 2 + y4 1 + y4 2 (x2 {x2 1, x2 2}, y2 {y2 1, y2 2})

16 x2 1 + x2 2 + y2 1 + y2 2 2 (Cauchy-Schwarz)

16 z z 4, (z = (0, 0, 0, 0))

Published as a conference paper at ICLR 2021

which implies β = 3.