# competitive_gradient_optimization__1f264dfd.pdf

Competitive Gradient Optimization

Abhijeet Vyas 1 Brian Bullins 1 Kamyar Azizzadenesheli 2

Abstract We study the problem of convergence to a stationary point in zero-sum games. We propose competitive gradient optimization (CGO), a gradientbased method that incorporates the interactions between two players in zero-sum games for its iterative updates. We provide a continuous-time analysis of CGO and its convergence properties while showing that in the continuous limit, previous methods degenerate to their gradient descent ascent (GDA) variants. We further provide a rate of convergence to stationary points in the discretetime setting. We propose a generalized class of α-coherent functions and show that for strictly αcoherent functions, CGO ensures convergence to a saddle point. Moreover, we propose optimistic CGO (o CGO), an optimistic variant, for which we show a convergence rate of O( 1

n) to saddle points for α-coherent functions.

1. Introduction We study the zero-sum simultaneous two-player optimization problem of the following form,

min x X f(x, y), max y Y f(x, y) (1)

where x and y are players moves with X Rm, Y Rn

and f is a scalar value map from X Y R. Such an optimization problem, also known as minimax optimization problems, has numerous applications in machine learning and decision theory, some examples including competitive Markov decision processes (Filar and Vrieze, 1996), e.g., game of Star Craft, Go, soccer, and car racing (Vinyals et al., 2019; Silver et al., 2016; Prajapat et al., 2021), adversarial and robust learning (Sinha et al., 2017; Namkoong and Duchi, 2016; Madry et al., 2017), generative adversarial networks (GAN) (Goodfellow et al., 2014; Radford et al., 2015; Arjovsky et al., 2017), and risk assessment (Artzner

1Department of Computer Science, Purdue University, West Lafayette, IN 47907 2Nvidia, Santa Clara, CA 95050. Correspondence to: Abhijeet Vyas <vyas26@purdue.edu>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

et al., 1999).

Gradient descent ascent (GDA) is the standard first-order approach to the minimax optimization problem in Eq. (1) and is known to converge for strictly-coherent functions (Mertikopoulos et al., 2019) which subsumes the strictly convexconcave function class (Facchinei and Pang, 2003). Yet, GDA cycles or diverges on simple functions with interactive terms between the players, e.g., a function like f(x, y) = y x (Mertikopoulos et al., 2019) which are not strictly-coherent. To tackle this issue, Schäfer and Anandkumar (2019) propose competitive gradient descent (CGD) which includes the bilinear approximation of the function as opposed to only the linear approximation used in GDA to formulate the local update. In this approach, despite being bilinear, the game approximation per player is linear. With this update, CGD is able to utilize the interaction terms to guarantee convergence in some non-convex non-concave problems rather than be impeded by them. Schäfer and Anandkumar (2019) further asks the question whether a local optimality result analogous to (Lee et al., 2016) for the minimization problem can be obtained for minimax optimization. Letcher (2020) answers this to the negative by constructing an example with a local-Nash equilibrium to which no reasonable algorithm can converge. Yet, the constructed function is one with discontinuous xyf the continuity of which is a key assumption in CGD. Diakonikolas et al. (2021) proposes a generalized version of the extragradient method that solves the problem in Eq. (1) for the weak-MVI condition which extends the MVI condition in (Mertikopoulos et al., 2019). This is achieved by decoupling the learning rates in the two steps of the extragradient method.

Daskalakis et al. (2018) extend the online learning algorithm optimistic mirror descent ascent (OMDA) (Rakhlin and Sridharan, 2013) to two player games and shows convergence of the method for all bilinear games of the form f(x, y) = y Ax (thereby for y x). Mertikopoulos et al. (2019) use the extragradient version of OMDA to show convergence for all coherent saddle points which includes the saddle points in bilinear-games of the form f(x, y) = y Ax. However, we show, CGD and OMDA (as defined in (Mertikopoulos et al., 2019)) reduce to GDA and mirror descent ascent (MDA) respectively in the continuoustime limit (gradient-flow). The continuous-time regime has

Competitive Gradient Optimization

given insights into the behavior of single-player optimization algorithms (Wilson et al.; Lee et al., 2016) and has been used to study games in (Mazumdar et al., 2020).

In light of this, we propose competitive gradient optimization (CGO), an optimization method that incorporates players interaction in order to come up with gradient updates. CGO considers a local linear approximation of the game and introduces the interaction terms in the linear model. The algorithm solves the problem in Eq. (1) for a class of α-coherent functions. This class is strictly larger than the coherent class in Mertikopoulos et al. (2019) and contains examples that are not in the weak-MVI class in Diakonikolas et al. (2021). At an iteration point (x, y), the CGO update is as follows,

arg min δx X δx xf + α

η δx 2 xyfδy + δy yf

arg max δy Y δy yf + α

η δy 2 yxfδx + δx xf

where the first term in the update is a local linear approximation of the game. The second term is the interaction term between players which is scaled with α to represent the importance of incorporating the interaction in the update. This scaling is analogous to scaling in Newton methods with varying learning rates. This approximation results in a local bilinear approximation of the game. Finally, η is the learning rate appearing in the penalty term. It is important noting that fixing the other player, the optimization for each player is a linear approximation of the game. Since the game approximation for each player is linear in its action, we consider this update yet a linear update. The solution to the CGO update is the following, δx δy

= ηgα := η I α xyf α yxf I

where gα is the gradient-based update at point (x, y). CGO is a generalization of its predecessors, in the sense that, setting α = 0 recovers GDA, and setting α = η recovers CGD. CGO gives greater flexibility for the updates in the hyper-parameters and gives rise to a distinct algorithm in continuous-time. In large-scale practical and deep learning settings, this update can be efficiently and directly computed using an optimized implementation of conjugate gradient and Hessian vector products.

Further, we introduce generalized versions of the Stampacchia and Minty variational inequality (Facchinei and Pang, 2003) and extend the definition of coherent sad-

dle points (Mertikopoulos et al., 2019) to α-coherent saddle points and show the convergence of CGO under αcoherence (which contains the bilinear function class and thus explains the success of CGD). Finally, we propose optimistic CGO which converges to the saddle points for α-coherent saddle point problems which are not strictly α-coherent.

Our main contributions are as follows:

We propose CGO that utilizes bilinear approximation of the game in Eq. (1) and accordingly weights the interaction terms between agents in the updates.

In order to study whether CGO provides a fundamentally new component, we study CGO s and its predecessors behaviors in continuous-time. We observe that in the limit of the learning rate approaching zero, i.e., continuous-time regime, the CGD and OMDA reduce to their GDA and MDA counterparts and CGO gives rise to a distinct update in the continuous-time.

Using the Lyapunov analysis in the continuous time regime, we show that while CGD and GDA converge for strictly convex-concave functions, CGO allows for a deviation below zero in eigenvalues of the pure Hessian block of the minimizer and above zero for that of the maximizer. A deviation from the convex-concave condition proportional to the lowest eigenvalue of the cross-terms of the Hessian is allowed.

We extend the definition of coherence function class (Mertikopoulos et al., 2019) to α-coherent functions for which we show the optimistic variant of CGO, optimistic CGO (o CGO) converges to saddle points with a rate of O 1

n while CGO converges to the saddle points which satisfy the strict α-coherence condition. We provide functions that are not coherent and explain the success of CGD in bilinear functions by setting α = η.

2. Related works Largely the algorithms proposed to solve the minimax optimization problem can be divided into two groups: those containing simultaneous update which solve a simultaneous game locally at each iteration and those containing sequential updates. While our work focuses on the simultaneous updates, sequential updates are relevant due to their close proximity and the fact that they often give rise to relevant solutions. We discuss the work done in the aforementioned categories below.

Sequential updates. A sequential version GDA in an alternating form is alternating gradient descent ascent (AGDA) is often time shown to be more stable than its simultaneous counter-part (Gidel et al., 2019; Bailey et al.,

Competitive Gradient Optimization

2020). Yang et al. (2020) introduces the 2-sided Polyak Lojasiewicz (PL)-inequality. The PL-inequality was first introduced by Polyak (1963) as a sufficient condition for gradient descent to achieve a linear convergence rate, Yang et al. (2020) shows that the same can be extended to achieve convergence of AGDA to saddle points, which are the only stationary points for the said functions. Yet, AGDA also cycles in several problems including bi-linear functions showing the persisting difficulty of cycling behavior for any GDA algorithm. To solve this problem, 2-time scale gradient descent ascent has been proposed (Heusel et al., 2017; Goodfellow et al., 2014; Metz et al., 2016; Prasad et al., 2015) which uses different learning rates for the descent and ascent. Heusel et al. (2017) proves its convergence to local Nash-equilibrium (saddle points). Jin et al. (2020) discusses the limit points of 2-time scale GDA by defining local minimax points, analogues of the local Nash-equilibrium in the sequential game setting and shows that for vanishing learning rate for the descent, 2-time scale GDA provably converges to local mini-max points. Another line of work concerns itself with finding stationary points of the function F(x) = maxy f(x, y), (Lin et al., 2020; Rafique et al., 2018; Nouiehed et al., 2019; Jin et al., 2019). The 2-time scale approaches mainly rely on the convergence of one player per update step of the other player, which makes these updates generally slow to converge.

Simultaneous updates. Simultaneous update methods preserve the simultaneous nature of the game at each step, such methods include OMDA (Daskalakis et al., 2018), its extra-gradient version (Mertikopoulos et al., 2019), Con Opt (Mescheder et al., 2017), CGD (Schäfer and Anandkumar, 2019), LOLA (Foerster et al., 2017), predictive update (Yadav et al., 2017) and symplectic gradient adjustment (Balduzzi et al., 2018). Of the above, (Daskalakis et al., 2018; Mertikopoulos et al., 2019; Foerster et al., 2017) are inspired from no-regret strategies formulated in (Rakhlin and Sridharan, 2013; Jadbabaie et al., 2015) based on follow the leader (Shalev-Shwartz and Singer, 2006; Grnarova et al., 2017) for online learning. (Schäfer and Anandkumar, 2019) uses the cross-term of the Hessian, while (Mescheder et al., 2017) uses the pure terms to come up with a second order update. (Balduzzi et al., 2018) proposes an update based on the asymmetric part of the game Hessian obtained from its Helmholtz decomposition. Some of these algorithms converge to stationary points that need not correspond to saddle points, Daskalakis and Panageas (2018) shows that GDA, as well as optimistic GDA, may converge to stationary points which are not saddle points. Con Opt is shown to converge to stationary points which are not local Nash equilibrium in the experiments (Schäfer and Anandkumar, 2019).

3. Preliminaries In this section, we describe the simultaneous minimax optimization problem and notations to express the properties of functions we use in the analysis. We discuss the class of α-coherent functions which extends the definition of coherence in (Mertikopoulos et al., 2019) and for different versions of which CGO and o CGO converge to the saddle point.

Throughout the paper we often denote the concatenation of the arguments x and y to be z := (x, y).

Definition 3.1 (First order stationary point). A point z = (x , y ) X Y is a stationary point of the optimization Eq. (1) if it satisfies the following,

xf(x , y ) = 0, yf(x , y ) = 0 (4)

We say a function f is L Lipschitz continuous if for any two points z1 := (x1, y1) X Y and z2 =: (x2, y2) X Y, it satisfies

|f(z1) f(z2)| L z1 z2 2

where | | denote the absolute value and 2 denote the corresponding 2-norm in the product space X Y. Similarly, for a given function f, we say it has L -Lipschitz continuous gradient if for any two points z1 := (x1, y1) X Y and z2 =: (x2, y2) X Y, it satisfies

f(z1) f(z2) 2 L z1 z2 2

And finally, we say a function has (Lxx, Lyy, Lxy)- Lipschitz continuous Hessian, if similarly, for any two points z1 := (x1, y1) X Y and z2 := (x2, y2) X Y the followings hold,

2 xxf(z1) 2 xxf(z2) 2 Lxx z1 z2 2 2 yyf(z1) 2 yyf(z2) 2 Lyy z1 z2 2 2 xyf(z1) 2 xyf(z2) 2 Lxy z1 z2 2

where all the norms are 2-norms with respect to their corresponding suitable definition of native spaces. We present the notation for the minimum and maximum value of matrices derived from the Hessian of f. The extrema are evaluated over the complete domain of f.

Table 1. Eigenvalue notations for matrices derived from 2nd

derivates to simplify notation

Matrix Min Eigenvalue Max Eigenvalue 2 xxf λxx λxx

2 yyf λyy λyy

2 xyf 2 yxf λxy λxy

2 yxf 2 xyf λyx λyx

Competitive Gradient Optimization

Further we define λ1 = max(λxx, λyy), λ2 = max(λxx, λyy). We also have λxy, λyx 0 since 2 xyf 2 yxf, 2 yxf 2 xyf are positive semi-definite.

Definition 3.2 (Bregman Divergence). The Bregman divergence with a strongly convex and differentiable potential function h is defined as

Bh(x, y) = h(x) h(y) x y, h(y)

Saddle point (SP). We define the solutions of the following problems to be min-max and max min saddle points respectively,

min-max saddle point: min x X max y Y f(x, y) (5)

max-min saddle point: max x X min y Y f(x, y) (6)

We now introduce modified forms of the Stampacchia and Minty variational inequalities and present the definition of α-coherent saddle point problems.

Definition 3.3 (α-Variational inequalities). α-coherence generalizes the definition of coherent saddle points in (Mertikopoulos et al., 2019) which sets α = 0. The definition of α-coherence hinges on the following two variational inequalities (gα as in Eq. (2)),

α-MV I : gα(x, y) (z z ) 0 for all z : (x, y) X Y

α-SV I : gα(x , y ) (z z ) 0 for all z : (x, y) X Y

Definition 3.4 (α-coherence). We say that min-max SP problem is α-coherent if,

Every solution of α-SV I is also a min-max SP

There exists a min-max SP, p that satisfies α-MV I

Every min-max SP, (x , y ) satisfies α-MV I locally, i.e., for all (x, y) sufficiently close to (x , y )

The α-coherent max-min SP problem is defined similarly.

In the above, if α-MV I holds as a strict inequality whenever x is not a solution thereof, SP problem will be called strictly α-coherent; by contrast, if α-MV I holds as an equality for all (x, y) X Y, we will say that the SP problem is null α-coherent. Note that in the unconstrained setting α-SVI is satisfied iff gα = 0 which occurs iff g0 = 0

4. Motivation In this section, we present the main motivations of our approach. The first is the popularity of the damped Newton

method (Algorithm 9.5, (Boyd and Vandenberghe, 2004)) which scales the second-order term in the Taylor-series expansion of the function to come up with the local update. The second is the observation that both CGD and OMDA reduce to GDA and MDA in the continuous-time limit which calls for a new algorithm that is distinct from GDA and MDA in continuous-time. The third is the observation that several functions give rise to (SP) problems which are strictly α-coherent, α > 0 but not strictly-coherent as defined in (Mertikopoulos et al., 2019) which coincides with α-coherence when we set α = 0

4.1. Adjustable learning rate Newton method The celebrated Newton method in one player optimization gives rise to an update which is the solution to the following local optimization problem. For a function f : X R, x X, X Rm, we have,

min δx X f δx + 1

2δx 2 xxfδx (7)

which does not have the notion of learning rate. However, prior work provides strong learning and regret guarantees, even in adversarial cases, for the adjusted Newton method where the Newton term is replaced with its weighted version α

2 δx 2 xxfδx (Hazan et al., 2007). This scaling allows for different learning updates that adjust how much the update weighs the second term. This is a similar approach taken in CGO update.

4.2. Continuous-time version of CGD and OMDA In this section, we analyze the continuous-time versions of CGD, OMDA, and CGO. We show that while, in continuous-time, CGD and OMDA reduce to their GDA and MDA counterparts, CGO gives rise to a distinct update.

CGD. Following the discussion in the introduction, CGD can be obtained by setting α = η in the CGO update rule. Doing so in Eq. (2), we obtain

δx = η I + η2 2 xyf 2 yxf 1 xf + η 2 xyf yf

δy = η I + η2 2 yxf 2 xyf 1 yf + η 2 yxf xf . (9)

For the continuous-time analysis, the learning rate η corresponds to the time discretization t with scaling factor β, i.e., η = β t. The ratios of the changes in x := δx and y := δy to η then become the time derivative of x and y in the limit η 0. Ergo, for CGD update we obtain,

x = β xf (10)

y = β yf (11)

Competitive Gradient Optimization

Where x = dx

dt , y = dy

dt are the time derivatives of x and y. This is the same as the update rule for GDA and the interaction information is lost in continuous-time.

OMDA. To present the updates for OMDA we first define the proximal map,

Pz(p) = arg min z X Y{ p, z z + Bh(z , z)} (12)

Where Bh is the Bregman Divergence with the potential function h. The OMDA update rule is then given by,

2 =Pzn( ηgn) = h 1( h(zn) ηgn)

=zn η ( h 1) gn + o(ηgn) (13)

zn+1 =Pzn( ηgn+ 1

2 ) = h 1( h(zn) ηgn+ 1

=zn η ( h 1) gn+ 1

2 + o(ηgn+ 1

where gn, gn+ 1

2 are the vector ( xf(x, y), yf(x, y)) evaluated at zn, zn+ 1

2 respectively. We now analyze the updates of OMDA in continuous-time. In the limit η 0 we have z

t = β( ( h) 1(z)) f(z) which is the same as the update rule of MDA and the effect of half time stepping vanishes in continuous-time. It is interesting to note that Diakonikolas et al. (2021) use different learning rates in the two steps of (13) and avoid this issue, which allows them to show convergence of the augmented OMDA for a larger class of weak-MVI functions.

CGO. Taking the continuous-time limit η 0 of the CGO updates Eq. (2) we obtain,

x = β I + α2 2 xyf 2 yxf 1 xf + α 2 xyf yf

y = β I + α2 2 yxf 2 xyf 1 yf + α 2 yxf xf , (15)

which is a distinct update from GDA and the interaction information is preserved in continuous-time. We simulate the continuous-time setting by using a very small learning rate and observe that while CGD cycles around the origin (Figure (7a)), CGO is able to take a somewhat direct path to the saddle point solution (Figure (1b)). This is an encouraging experiment, validating our hypothesis on the importance of CGO update.

4.3. Families of functions which give rise to α-coherent SP The following examples establish a few families of αcoherent functions. First we present the important result that all bi-linear games f = x Ay, are strictly αcoherent.

Example 4.1. All functions of the form f(x, y) = x Ay, A Rm n, give rise to strictly α-coherent minmax SP problems α > 0 and are null coherent for α = 0.

(a) CGD (b) CGO

Figure 1. Modeling of the continuous-time regime for f(x, y) = xy: CGD cycles while CGO takes a direct path. The exact analysis of the resulting ODE s is provided in Appendix E.1

Proof Sketch. The origin is the only saddle point of the above function, we evaluate SVI and α-SVI at the origin,

i) We have g0, z , g0 = (Ay, A x). Hence, g0, z = x Ay y A x = 0, (x, y) X Y

ii) Also we have

gα, z αλmin((I + α2AA ) 1AA ) x 2 (16)

+ αλmin((I + α2A A) 1A Ay) y 2 > 0

Where the final inequality follows from the fact that min(λmin(A A), λmin(AA )) > 0, A Rm n. See Appendix A for a detailed proof.

We present another family of functions parameterized by a scalar k. For k 0 the functions exhibit a min-max saddle point at the origin (a max-min saddle point is at ( , )), while for k < 0 the function has a max-min saddle point at the origin (a min-max saddle point is at ( , )). For both cases, origin satisfies the α-variational inequalities for α k, strictly for α > k.

Example 4.2. The family of functions fk(x, y) = k

2(x2 y2) + xy with k 0 gives rise to an α-coherent min-max SP problem for α = k and a strictly α-coherent min-max SP problem α > k. For k < 0, it gives rise to an αcoherent max-min SP problem for α = k and a strictly α-coherent max-min SP problem α > k

Proof Sketch. We evaluate the variational inequalities at the origin. For g0 we have

g0, z = x(kx + y) y( ky + x) = kx2 + ky2 > 0 (17)

For gα we have:

gα, z = k + α

1 + α2 (x2 + y2) > 0, α > k (18)

Competitive Gradient Optimization

Finally we present α-coherent functions that do not satisfy the weak-MVI condition and establish that the class is strictly larger than the coherent class.

Example 4.3. Consider the family of functions fk(x, y) = x2y+kxy. For k = 0 it gives rise to a min-max SP problem such that the region where the α-MVI is not satisfied, shrinks as α increases. For k = 1, it gives rise to a min-max SP problem which is α-coherent for large α. Furthermore, both the above mentioned problems do not satisfy the weak-MVI condition in Diakonikolas et al. (2021).

Proof Sketch. Since the Nash Equilibrium is at the origin for both the problems we evaluate the α-VIs at the origin and obtain the conditions for α-coherence as,

x2y + 2α(x4 + 2x2y2) 0 or y + 2α(x2 + 4y2) 0 since x can be non-zero.

x2y + α(x2(x + 1)(2x + 1) + y2(2x + 1)2) 0

The first condition is satisfied for all but sufficiently small (x, y), furthermore it is clear from the expression that the region where it is not satisfied reduces as α increases, this is also shown in Figure (6a). The second condition is satisfied for large α( 10) for the restricted domain x 1

3. We numerically verify this through plots in Appendix C

We numerically verify that the weak-MVI condition is not satisfied for both the above examples and demonstrate it through heat maps in the Appendix C.

5. Convergence results of CGO and the o CGO algorithm In this section we present the convergence results of our CGO algorithm. We first consider the convergence to stationary points and present the conditions and rate for the continuous-time and discrete-time regimes. Then, we state the convergence results of the CGO algorithm to strictly α-coherent saddle points. Then, we introduce the o CGO updates and present its rate of convergence to α-coherent saddle points. Finally we showcase the working of CGO and o CGO by simulating them on a few benchmark functions from the families presented in Subsection 4.3.

5.1. Convergence analysis in continuous-time We present our first result for convergence of CGO in continuous-time. We present the proof sketch and refer the readers to Appendix D for the complete proof. To highlight the difference in convergence rate and condition of CGO from GDA we also derive the conditions for convergence of GDA using a Lyapunov-style analysis. By carefully choosing the parameter α we show that we can accommodate arbitrary deviation from the strictly convexconcave condition which is required for the convergence of

continuous-time GDA.

Theorem 5.1. Continuous-time CGO runs on a twice differentiable function f with parameters α, β on functions satisfying λ > 0 where

λ := β min(2λxx 2αλxx 2 + c λxy 1 + α2λxy

2λyy 2αλyy 2 + c λyx 1 + α2λyx

converges exponentially to a stationary point with rate λ. Where c = β(α 2α2λ1 2α3λ2 2).

Proof Sketch. We choose g0 2 to be our Lyapnuov function where,

g0 := ( x(f(x, y), yf(x, y))

Evaluating the time derivative of g 2, we obtain

dt = 2g 0 g0

= 2 xf yf xxf xyf xyf yyf

= 2 x xxf xf + 2 xf xyf y

+ 2 y yyf yf + 2 yf xyf x (20)

By plugging in CGO updates and manipulating we show that d g0 2

where λ is as stated in the theorem. The detailed proof is in Appendix E.

To compare, we also derive the conditions for GDA Eq. (10) in continuous-time in Appendix D. We obtain

dt g0 2 min(λmin(2β xx), λmin( 2β yy))

= 2β g0 2 min(λxx, λyy) (21) For convergence, we require min(λxx, λyy) 0 which is the convex-concave condition.

This theorem implies that in the presence of interaction, particularly, when λxy 1+α2λxy and λyx 1+α2λyx are positive, it allows to break free from the convex-concave condition by appropriately setting α.

We set α such that λxx 1 5α; λxx 1

5α; λyx, λxy

K α2 ; λyy 1

5α; λyy 1 5α; K 1 which implies λ1, λ2 < 1 5α and we obtain λmin 1 50α. This shows

Competitive Gradient Optimization

that continuous-time CGO allows arbitrary deviation of λxx, λyy (from the convex-concave condition i.e. λxx 0, λyy 0), if λyx, λxy are proportional to the square of the deviation of the pure terms.

5.2. Convergence analysis in discrete-time Convergence to stationary points. We derive the conditions required for CGO to converge to a stationary point and show that large singular values of the interaction terms help in convergence. By tuning the hyperparameters we are able to control the influence of this interactive term and obtain faster convergence.

Theorem 5.2. CGO with parameters α and η when initialized in the neighborhood of a first-order stationary point z

on a Lipschitz-continuous and thrice differentiable function f that has Lipschitz-continuous gradients and Hessian and 1 λ > 0 where

λ := min(η(2λxx 210η + 8α

η λxx 2) + c λxy 1 + α2λxy

η(2λyy + 210η + 8α

η λyy 2) + c λyx 1 + α2λyx

converges exponentially to z with rate r(λ) = 1 λ. Where c is a polynomial function of η, α, λ1, λ2.

Similar to the continuous-time setting, the terms λxy 1+α2λxy ,

λyx 1+α2λyx are non-negative and appropriately choosing α and η allows us to tune c and obtain convergence for functions not satisfying the convex-concave condition. CGD restricts the flexibility of c by choosing α = η and CGO utilizes this extra degree of freedom granted by α to allow convergence for a larger class of functions. The proof of the above theorem is provided in Appendix G, for completeness we also provide the analysis of discrete time GDA in the Appendix F.

Convergence to strictly α-coherent saddle points. Now we discuss the convergence properties of CGO for the class of strictly α-coherent functions. The detailed proof can be found in Appendix H.

Theorem 5.3. Suppose that a Lipschitz-continuous function f has Lipschitz-continuous gradients and Hessian and gives rise to a strictly α-coherent SP. If CGO is run with perfect gradient and competitive Hessian oracles and parameter α and parameter sequence {ηn} such that P 1 η2 n < and P 1 ηn = , then the sequence of CGD iterates {zn}, converges to a solution of SP.

Convergence to α-coherent saddle points. For convergence to the saddle points for α-coherent functions which

are not strictly α-coherent, we propose the optimistic CGO algorithm.

Optimistic CGO The update rule is given by:

2 = Pzn( ηgα,n) (a) = zn ηgα,n (23)

zn+1 = Pzn( ηgα,n+ 1

2 ) (b) = zn ηgα,n+ 1

where gα,n, is as in (2) and η is the learning rate. Where a and b hold for the unconstrained setting .

Theorem 5.4. Suppose that a L-Lipschitz-continuous function f that has L -Lipschitz-continuous gradients and Lxy Lipschitz-continuous Hessian gives rise to an α-coherent SP. If o CGO is run with parameter α and parameter sequence {ηn} such that,

L 4+4L2xy L2 L 2

α2L2L2xy+L 2 2α4L2L 2L2xy α2L 4 α3L2 0L2 xy α2L2L2xy+L 2 , n

then the sequence of iterates zn converges to z where z := (x , y ) X Y is a saddle point. Moreover, the o CGO converges with the rate of 1

n, i.e., for the average of the gradients, we have,

k=1 gα,k 2 = O 1

The detailed proof of the above theorem is provided in Appendix H.2.

5.3. Simulation of CGO and o CGO on families from Section 4 We now evaluate the performance1 of CGO and o CGO on families discussed in examples (4.1) and (4.2). We first consider a function f(x, y) = x Ay, A R4 5, x R4, y R5 . We sample all the entries of A independently from a standard Gaussian, A = (aij), aij N(0, 1). We consider the plot of the L2 norm of x vs. that of y, since the only saddle point is the origin, the desired solution is x 2, y 2 0. We plot the iterates of CGO and o CGO for different α, and observe that o CGO converges to the saddle point for α 0 (at a very slow rate for α = 0) while CGO does so for α > 0. The results at α = 0 are that of GDA and optimistic GDA. We see similar results for the case where A is the scalar 1, i.e. f(x, y) = xy. This is in accordance with the analysis in example (4.1).

We then proceed to perform experiments on the family from example (4.2) for k = 2, 2. For both values of k we see that o CGO converges to the origin for α k and CGO

1The code for the experiments is available at Link to code

Competitive Gradient Optimization

2 1 0 1 2 x

α, η=-0.5,0.02 α, η=0,0.1 α, η=0.5,0.1 α, η=1,0.1 start

2 1 0 1 2 x

α, η=-0.5,0.02 α, η=0,0.1 α, η=0.5,0.1 α, η=1,0.1 start

(b) optimistic CGO

0.0 0.5 1.0 1.5 2.0 2.5 x 2

α, η = 0, 1e 2

α, η = 1, 0.5 α, η = 2, 0.5 α, η = 3, 0.5 start

0.0 0.5 1.0 1.5 2.0 2.5 |x| 2

α, η = 0, 0.1 α, η = 1, 0.5 α, η = 2, 0.5 α, η = 3, 0.5 start

(d) optimistic CGO

Figure 2. CGO and optimistic CGO on bilinear functions f(x, y) = xy, (x, y) R2 : (a,b) and f(x, y) = x Ay, x R4, y R5 : (c,d) for 100 iterations.

converges for α > k, following the analysis in example (4.2). For k = 2 the origin is a min-max saddle point, while for k = 2 it is a max-min saddle point. Finally we perform experiments on α-coherent functions from example (4.3) that do not satisfy the weak-MVI assumption.

6. Conclusion We propose the CGO algorithm which allows us to control the effect of the cross derivative term in CGD. This increases the size of the class of functions for which the algorithm converges. In the realm of continuous-time we observe that CGD reduces to GDA, CGO on the other hand gives rise to a distinct update which allows for a margin of deviation from the strictly convex-concave convergence condition of GDA. Furthermore, we generalize the definition of coherent saddle point problems defined in (Mertikopoulos et al., 2019) to α-coherent saddle points for which we prove convergence of Optimistic CGO and of CGO in the strict version of α-coherence, we show order O( 1

n) rate of the average gradients for CGO. Finally we present a short experiment study on some α-coherent functions. Future work would involve using CGO in various machine learning tasks such as GANs, competitive reinforcement learning (RL) and adversarial machine learning.

5 4 3 2 1 0 1 2 3 4 5 x

α, η = 3, 0.1 α, η = 2, 0.1 α, η = 1, 0.1 α, η = 0,0.1 α, η = 1,0.1 α, η = 2,0.1 start

5 4 3 2 1 0 1 2 3 4 5 x

α, η = 3, 0.1 α, η = 2, 0.1 α, η = 1, 0.1 α, η = 0,0.1 α, η = 1,0.1 α, η = 2,0.1 start

5 4 3 2 1 0 1 2 3 4 5 x

α, η = 2, 7e 3

α, η = 1, 7e 3

α, η = 0, 7e 3

α, η = 1, 7e 3

α, η = 2, 0.15 α, η = 3, 0.15 α, η = 4, 0.15 α, η = 5, 0.15 start

5 4 3 2 1 0 1 2 3 4 5 x

α, η = 2, 7e 3

α, η = 1, 7e 3

α, η = 0, 7e 3

α, η = 1, 7e 3

α, η = 2, 0.15 α, η = 3, 0.15 α, η = 4, 0.15 α, η = 5, 0.15 start

Figure 3. CGO and optimistic CGO on functions from the family f(x, y) = k

2 (x2 y2) xy. k = 2 : (a,b) and k = 2 : (c,d) for 100 iterations.

0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0

η=0.06 α=0.350 Start Nash

0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0

η=0.06 α=0.45 Start Nash

0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0

η=0.06 α=0.55 Start Nash

0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 x

η=0.06 α=0.350 Start Nash

0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 x

η=0.06 α=0.45 Start Nash

η=0.06 α=0.55 Start Nash

Figure 4. CGO and optimistic CGO on the function f(x, y) = x2y + xy from multiple initializations for 500 iterations with increasing α.

Competitive Gradient Optimization

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 x

η=0.05 α=0.0 region start Nash Equillibrium

1.5 1.0 0.5 0.0 0.5 1.0 1.5 x

η=0.05 α=0.1 region start Nash Equillibrium

1.0 0.5 0.0 0.5 1.0 x

η=0.05 α=1.5 region start Nash Equillibrium

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 x

η=0.05 α=0.0 region start Nash Equillibrium

1.5 1.0 0.5 0.0 0.5 1.0 1.5 x

η=0.05 α=0.1 region start Nash Equillibrium

1.0 0.5 0.0 0.5 1.0 x

α=1.5 region start Nash Equillibrium

Figure 5. CGO and optimistic CGO on the function f(x, y) = x2y from multiple initializations for 500 iterations with increasing α. The shrinking yellow region is where α-MVI is not satisfied.

M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. In International conference on machine learning, pages 214 223. PMLR, 2017.

P. Artzner, F. Delbaen, J.-M. Eber, and D. Heath. Coherent measures of risk. Mathematical finance, 9(3):203 228, 1999.

J. P. Bailey, G. Gidel, and G. Piliouras. Finite regret and cycles with fixed step-size via alternating gradient descentascent. In Conference on Learning Theory, pages 391 407. PMLR, 2020.

D. Balduzzi, S. Racaniere, J. Martens, J. Foerster, K. Tuyls, and T. Graepel. The mechanics of n-player differentiable games. In International Conference on Machine Learning, pages 354 363. PMLR, 2018.

S. Boyd and L. Vandenberghe. Convex optimization. Cambridge university press, 2004.

C. Daskalakis and I. Panageas. The limit points of (optimistic) gradient descent in min-max optimization. Advances in Neural Information Processing Systems, 31, 2018.

C. Daskalakis, A. Ilyas, V. Syrgkanis, and H. Zeng. Training GANs with optimism. In International Conference on Learning Representations, 2018. URL https: //openreview.net/forum?id=SJJy Sbb AZ.

J. Diakonikolas, C. Daskalakis, and M. I. Jordan. Efficient methods for structured nonconvex-nonconcave min-max optimization. In International Conference on Artificial Intelligence and Statistics, pages 2746 2754. PMLR, 2021.

F. Facchinei and J.-S. Pang. Finite-dimensional variational inequalities and complementarity problems. Springer, 2003.

J. Filar and K. Vrieze. Competitive Markov Decision Processes. 01 1996. ISBN 978-1-4612-8481-9. doi: 10.1007/978-1-4612-4054-9.

J. N. Foerster, R. Y. Chen, M. Al-Shedivat, S. Whiteson, P. Abbeel, and I. Mordatch. Learning with opponentlearning awareness. ar Xiv preprint ar Xiv:1709.04326, 2017.

G. Gidel, R. A. Hemmat, M. Pezeshki, R. Le Priol, G. Huang, S. Lacoste-Julien, and I. Mitliagkas. Negative momentum for improved game dynamics. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 1802 1811. PMLR, 2019.

I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.

P. Grnarova, K. Y. Levy, A. Lucchi, T. Hofmann, and A. Krause. An online learning approach to generative adversarial networks. ar Xiv preprint ar Xiv:1706.03269, 2017.

E. Hazan, A. Agarwal, and S. Kale. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69(2):169 192, 2007.

M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.

A. Jadbabaie, A. Rakhlin, S. Shahrampour, and K. Sridharan. Online optimization: Competing with dynamic comparators. In Artificial Intelligence and Statistics, pages 398 406. PMLR, 2015.

C. Jin, P. Netrapalli, and M. I. Jordan. Minmax optimization: Stable limit points of gradient descent ascent are locally optimal. ar Xiv preprint ar Xiv:1902.00618, 2019.

C. Jin, P. Netrapalli, and M. Jordan. What is local optimality in nonconvex-nonconcave minimax optimization? In International Conference on Machine Learning, pages 4880 4889. PMLR, 2020.

J. D. Lee, M. Simchowitz, M. I. Jordan, and B. Recht. Gradient descent only converges to minimizers. In Conference on learning theory, pages 1246 1257. PMLR, 2016.

A. Letcher. On the impossibility of global convergence in multi-loss optimization. ar Xiv preprint ar Xiv:2005.12649, 2020.

Competitive Gradient Optimization

T. Lin, C. Jin, and M. Jordan. On gradient descent ascent for nonconvex-concave minimax problems. In International Conference on Machine Learning, pages 6083 6093. PMLR, 2020.

A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu. Towards deep learning models resistant to adversarial attacks. 06 2017.

E. Mazumdar, L. J. Ratliff, and S. S. Sastry. On gradientbased learning in continuous games. SIAM Journal on Mathematics of Data Science, 2(1):103 131, 2020.

P. Mertikopoulos, B. Lecouat, H. Zenati, C.-S. Foo, V. Chandrasekhar, and G. Piliouras. Optimistic mirror descent in saddle-point problems: Going the extra(-gradient) mile. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum? id=Bkg8jj C9KQ.

L. Mescheder, S. Nowozin, and A. Geiger. The numerics of gans. Advances in neural information processing systems, 30, 2017.

L. Metz, B. Poole, D. Pfau, and J. Sohl-Dickstein. Unrolled generative adversarial networks. ar Xiv preprint ar Xiv:1611.02163, 2016.

H. Namkoong and J. C. Duchi. Stochastic gradient methods for distributionally robust optimization with fdivergences. Advances in neural information processing systems, 29, 2016.

M. Nouiehed, M. Sanjabi, T. Huang, J. D. Lee, and M. Razaviyayn. Solving a class of non-convex min-max games using iterative first order methods. Advances in Neural Information Processing Systems, 32, 2019.

B. T. Polyak. Gradient methods for the minimisation of functionals. USSR Computational Mathematics and Mathematical Physics, 3(4):864 878, 1963.

M. Prajapat, K. Azizzadenesheli, A. Liniger, Y. Yue, and A. Anandkumar. Competitive policy optimization. In Uncertainty in Artificial Intelligence, pages 64 74. PMLR, 2021.

H. Prasad, P. LA, and S. Bhatnagar. Two-timescale algorithms for learning nash equilibria in general-sum stochastic games. In Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems, pages 1371 1379, 2015.

A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. ar Xiv preprint ar Xiv:1511.06434, 2015.

H. Rafique, M. Liu, Q. Lin, and T. Yang. Non-convex min max optimization: provable algorithms and applications in machine learning (2018). ar Xiv preprint ar Xiv:1810.02060, 2018.

A. Rakhlin and K. Sridharan. Online learning with predictable sequences. In Conference on Learning Theory, pages 993 1019. PMLR, 2013.

F. Schäfer and A. Anandkumar. Competitive gradient descent. Advances in Neural Information Processing Systems, 32, 2019.

S. Shalev-Shwartz and Y. Singer. Convex repeated games and fenchel duality. Advances in neural information processing systems, 19, 2006.

D. Silver, A. Huang, C. Maddison, A. Guez, L. Sifre, G. Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis. Mastering the game of go with deep neural networks and tree search. Nature, 529:484 489, 01 2016. doi: 10.1038/nature16961.

A. Sinha, H. Namkoong, and J. Duchi. Certifiable distributional robustness with principled adversarial training. 10 2017.

O. Vinyals, I. Babuschkin, W. Czarnecki, M. Mathieu, A. Dudzik, J. Chung, D. Choi, R. Powell, T. Ewalds, P. Georgiev, J. Oh, D. Horgan, M. Kroiss, I. Danihelka, A. Huang, L. Sifre, T. Cai, J. Agapiou, M. Jaderberg, and D. Silver. Grandmaster level in starcraft ii using multiagent reinforcement learning. Nature, 575, 11 2019. doi: 10.1038/s41586-019-1724-z.

A. Wilson, B. Recht, and M. Jordan. A lyapunov analysis of momentum methods in optimization (2016). ar Xiv preprint ar Xiv:1611.02635.

A. Yadav, S. Shah, Z. Xu, D. Jacobs, and T. Goldstein. Stabilizing adversarial nets with prediction methods. ar Xiv preprint ar Xiv:1705.07364, 2017.

J. Yang, N. Kiyavash, and N. He. Global convergence and variance reduction for a class of nonconvex-nonconcave minimax problems. Advances in Neural Information Processing Systems, 33:1153 1165, 2020.

Competitive Gradient Optimization

In this section we present proofs for statements pertaining to example (4.1) and (4.2).

A. Proof of example (4.1) For clarity we restate the statement of the example (4.1). All functions of the form x Ay are strictly α-coherent α > 0 and are null coherent for α = 0.

Proof of example (4.1). In order to show the above mentioned statement, we first note that the origin is the only saddle point of this function. We now evaluate g0, z , where g0 = (Ay, A x). Hence, (x, y) X Y. we have,

g0, z = x Ay y A x = 0.

ergo, the function x Ay is null-coherent.

Similarly, we evaluate the α-SVI. We observe for the function x Ay,

gα = ((I + α2AA ) 1(Ay + αAA x), (I + α2A A) 1( A x + αA Ay))

Hence for gα, z we have,

gα, z = x (I + α2AA ) 1(Ay + αAA x) + y (I + α2A A) 1( A x + αA Ay) (25)

We further observe that, following the statement of Lemma (E.1), we have

(I + α2AA ) 1A = A(I + α2A A) 1,

and therefore, incorporating it in to the Eq. (25), we have,

x (I + α2AA ) 1Ay = x A(I + α2A A) 1y = y (I + α2A A) 1A x.

Thus, for gα, z we have,

gα, z =x (I + α2AA ) 1αAA x + y (I + α2A A) 1αA Ay

αλmin((I + α2AA ) 1AA ) x 2 + αλmin((I + α2A A) 1A Ay) y 2

Finally, observing that min(λmin(A A), λmin(AA )) 0 for any A, and following the statement in the Lemma (E.3) we also have,

αλmin((I + α2AA ) 1AA ) x 2 + αλmin((I + α2A A) 1A Ay) y 2 > 0, α > 0,

and hence gα, z > 0, α > 0. Ergo, the function x Ay is strictly α coherent.

B. Proof of example (4.2) Now, we restate the statement of the example (4.2). The family ofunctions fk(x, y) = k

2(x2 y2) + xy for k 0 gives rise to

min-max α-coherent SP problem when α = k,

min-max strictly α-coherent SP problem when α > k.

and for k < 0 the family gives rise to,

max min α-coherent SP problem when α = k,

max min strictly α-coherent SP problem when α > k.

Proof of example (4.2). We first note that the origin is the only saddle point of the above family. Further, the origin is a min-max saddle point when k 0 and a max min saddle point when k < 0.

For this family we evaluate gα, z ,

gα, z =x((1 + α2) 1(kx y α( x ky))) + y((1 + α2) 1(x + ky α(kx y)))

=(1 + α2) 1(kx2 xy + αx2 + αkxy + xy + ky2 αkxy + αy2)

Competitive Gradient Optimization

Simplifying this expression for α > k we obtain,

gα, z = k + α

1 + α2 (x2 + y2) > 0, α > k

Ergo, the above mentioned function class is strictly α-coherent when α > k. Furthermore, when α = k we have gα, z = 0, ergo the class is null α-coherent for α = k.

C. Proof of example (4.3) Beyond the explanation in the main text we provide numerically generated heat-maps for the weak-MVI condition for the counter-examples provided in example (4.3) and the α-coherence region for f(x, y) = x2y + xy, x 1

Figure 6. (a),(b) and (c) : where α-MVI condition is not satisfied for the function f(x, y) = x2y + xy for increasing α. (d) and (e) : weak-MVI condition for f(x, y) = x2y and f(x, y) = x2y + xy; x 1

A detailed simulation of the α-MVI condition for f(x, y) = x2y + xy; x 1

3 is available here.

D. Continuous time GDA In this section, we state the update rule for GDA and derive sufficient convergence conditions using Lyapunov analysis. The update rule of GDA is computed through the following optimization problem,

min δx Rm δx xf + δy yf + 1

max δy Rn δy yf + δx xf 1

2η δy δy. (26)

Which gives the following closed form update,

Competitive Gradient Optimization

where η is the learning rate. Taking the limit η 0 and scaling the flow of time with β we get the continuous time dynamics as follows,

where g0 = xf yf

is the concatenation of the gradients. Furthermore, for the second order curvature of this dynamics,

i.e., the gradient of g0, we have,

g0 = 2 xxf 2 xyf 2 yxf 2 yyf

For the Lyapunov analysis, we now choose g0 2 as our Lyapunov function and evaluate its time-derivative, i.e.,

g0 2 = d g0 2

dt = 2g 0 g0 = 2 xf yf 2 xxf 2 xy 2 yxf 2 yyf

= 2 x 2 xxf xf + 2 xf 2 xyf y + 2 y 2 yyf yf + 2 yf 2 yxf x

Using the update rule of GDA, i.e., Eq. (28), we substitute x and y in the above equation and have,

g0 2 = 2β xf 2 xxf xf + 2β yf 2 yyf yf

2β xf 2 xyf yf + 2β yf 2 yxf xf

= 2β xf 2 xxf xf ( 2β yf 2 yyf yf) (30)

For the right hand side, we know,

2β xf 2 xxf xf + ( 2β yf 2 yyf yf) λmin(2β 2 xxf) xf 2 + λmin( 2β 2 yyf) yf 2

Therefore, following the Eq. (30), we have,

g0 2 λmin(2β 2 xxf) xf 2 + λmin( 2β 2 yyf) yf 2

Resulting in the following Lyapunov key inequality,

g0 2 g0 2 min{λmin(2β 2 xxf), λmin( 2β 2 yyf)}

Since, for convex-concave functions, min{λmin(2β 2 xxf), λmin( 2β 2 yyf)} is always non-negative, which guarantees convergence of this dynamical system.

E. Continuous time CGO In this section, we first derive the continuous-time update rule of CGO and then show convergence by choosing the norm squared of the gradient of f as the Lyapunov function. Taking the CGO update rule, x y

= η I α 2 xyf α yxf I

Competitive Gradient Optimization

and taking the limit η 0, treating η as time, and scaling time with β, we get, x y

= β I α 2 xyf α 2 yxf I

We further simplify Eq. (31) by re-arranging the matrix inverse, x + α 2 xyf y α 2 yxf x + y

= β xf β yf

The above form will be useful in showing convergence. By solving for variable x, y, we get the explicit form,

x = β I + α2 2 xyf 2 yxf 1 x + α 2 xyf y

y = β I + α2 2 yxf 2 xyf 1 α 2 yxf x y (33)

We use this construction to prove Theorem (5.1).

Proof of Theorem (5.1). We choose g0 2 as our Lyapunov function and evaluate its time derivative to observe,

g0 2 = d g0 2

dt = 2g 0 g0

= 2 xf yf 2 xxf 2 xyf 2 yxf 2 yyf

= 2 x 2 xxf xf + 2 xf 2 xyf y + 2 y 2 yyf yf + 2 yf 2 yxf x (34)

Ignoring the factor 2, we expand the terms containing 2 xyf in Eq. (34) by replacing x and y using Eq. (33) as follows,

x 2 xyf yf + xf 2 xyf y = β x + α 2 xyf y I + α2 2 xyf 2 yxf 1 2 xyf yf

xf 2 xyfβ I + α2 2 yxf 2 xyf 1 α 2 yxf x y

= β xf I + α2 2 xyf 2 yxf 1 2 xyf yf

αβ yf 2 yxf I + α2 2 yxf 2 xyf 1 2 xyf yf

+ β xf 2 xyf I + α2 2 yxf 2 xyf 1 yf

αβ xf 2 xyf I + α2 2 yxf 2 xyf 1 2 yxf yf (35)

Using the equality proven in Lemma (E.1) we have,

x 2 xyf yf + xf 2 xyf y = αβ xf I + α2 2 xyf 2 yxf 1 2 xyf 2 yxf xf

αβ yf I + α2 2 yxf 2 xyf 1 2 yxf 2 xyf yf

Using the expanded terms in RHS of Eq. (35) back into Eq. (34), we obtain a unified expression,

g0 2 = 2 x 2 xxf xf 2αβ xf I + α2 2 xyf 2 yxf 1 2 xyf 2 yxf xf

+ 2 y 2 yyf yf 2αβ yf I + α2 2 yxf 2 xyf 1 2 yxf 2 xyf yf

We now observe that α 2 xyf y + β xf = x and α 2 yxf x + β yf = y, yielding in,

g0 2 = 2β xf 2 xxf xf 2αβ xf I + α2 2 xyf 2 yxf 1 2 xyf 2 xyf xf

2α y 2 yxf 2 xxf x (36)

+ 2β yf 2 yyf yf 2αβ yf I + α2 2 yxf 2 xyf 1 2 yxf 2 xyf yf

+ 2α x 2 xyf 2 yyf yf (37)

Competitive Gradient Optimization

Substituting xf and yf in lines (36) and (37) with their equivalences in Eq. (32), we get,

g0 2 = 2β xf 2 xxf xf 2αβ xf I + α2 2 xyf 2 yxf 1 2 xyf 2 xyf xf

2α y 2 yxf 2 xxf( x + α 2 xyf y β ) (38)

+ 2β yf 2 yyf yf 2αβ yf I + α2 2 yxf 2 xyf 1 2 yxf 2 xyf yf

+ 2α x 2 xyf 2 yyf y α 2 yxf x β (39)

Taking transpose of the final terms in lines (38) and (39), we obtain,

g0 2 = 2β xf 2 xxf xf 2αβ xf I + α2 2 xyf 2 yxf 1 2 xyf 2 xyf xf

β α x + α2 2 xyf y 2 xxf 2 xyf y

+ 2β yf 2 yyf yf 2αβ yf I + α2 2 yxf 2 xyf 1 2 yxf 2 xyf yf

β α y α2 2 yxf x 2 yyf 2 yxf x (40)

We utilize the Peter-Paul inequality to further expand 2 xxf and 2 yyf terms in Eq. (40). In particular, we derive the following inequalities, 2 x 2 xxf 2 xyf y x 2 xxf 2 + 2 xyf y 2

and 2 x 2 xxf 2 xyf y x 2 xxf 2 + 2 xyf y 2.

Using these inequalities in Eq. (40), we have,

g0 2 2β xf 2 xxf xf 2αβ xf I + α2 2 xyf 2 yxf 1 2 xyf 2 xyf xf

β y 2 yxf αI + 2α2 2 xxf 2 xyf y + 2β yf 2 yyf yf

2αβ yf I + α2 2 yxf 2 xyf 1 2 yxf 2 xyf yf + 1

β x 2 xyf αI 2α2 2 yyf 2 yxf x

β y 2 yyf 2 yyf y + α

β x 2 xxf 2 xxf x

Considering that 2 xxf and 2 yyf are symmetric matrices, we have,

g0 2 2β xf 2 xxf xf 2αβ xf I + α2 2 xyf 2 yxf 1 2 xyf 2 xyf xf

β x 2 xxf 2 xxf x + 1

β α + 2α2λxx 2 xyf y 2

+ 2β yf 2 yyf yf 2αβ yf I + α2 2 yxf 2 xyf 1 2 yxf 2 xyf yf

β y 2 yyf 2 yyf y + 1

2 yxf x 2 (41)

Setting λ1 = max(λxx, λyy) we obtain,

g0 2 2β xf 2 xxf xf 2αβ xf I + α2 2 xyf 2 yxf 1 2 xyf 2 xyf xf

β x 2 xxf 2 xxf x + 1

β α + 2α2λ1 2 xyf y 2

+ 2β yf 2 yyf yf 2αβ yf I + α2 2 yxf 2 xyf 1 2 yxf 2 xyf yf

β y 2 yyf 2 yyf y + 1

β α + 2α2λ1 2 yxf x 2 (42)

Competitive Gradient Optimization

Using the update rule in Eq. (33), we compute, 2 yxf x 2 = β2 xf + α 2 xyf yf I + α2 2 xyf 2 yxf 2 2 xyf 2 yxf xf + α 2 xyf yf

2 xyf y 2 = β2 yf + α 2 yxf xf I + α2 2 yxf 2 xyf 2 2 yxf 2 xyf yf + α 2 yxf xf .

by adding up the two equalities above, we obtain, 2 yxf x 2 + 2 xyf y 2 = β2 xf I + α2 2 xyf 2 yxf 2 2 xyf 2 yxf xf

+ β2 yf I + α2 2 yxf 2 xyf 2 2 yxf 2 xyf yf

+ αβ2 xf I + α2 2 xyf 2 yxf 2 2 xyf 2 yxf 2 xyf yf

+ αβ2 yf 2 yxf I + α2 2 yxf 2 xyf 2 2 xyf 2 yxf xf

αβ2 xf 2 xyf I + α2 2 xyf 2 yxf 2 2 yxf 2 xyf yf | {z } (i)

αβ2 yf I + α2 2 yxf 2 xyf 2 2 yxf 2 xyf 2 yxf xf | {z } (ii)

+ α2β2 xf 2 xyf I + α2 2 xyf 2 yxf 2 2 yxf 2 xyf 2 yxf xf | {z } (iii)

+ α2β2 yf 2 yxf I + α2 2 yxf 2 xyf 2 2 xyf 2 yxf 2 xyf yf | {z } (iv)

We further analyze the last four terms of the Eq. (43). In particular, we utilize the statement of Lemma (E.1) and for the term (i) in the above equality, we have,

αβ2 xf I + α2 2 xyf 2 yxf 2 2 xyf 2 yxf 2 xyf yf

= αβ2 xf 2 xyf I + α2 2 xyf 2 yxf 2 2 yxf 2 xyf yf

correspondingly, for the term (ii), we have,

αβ2 yf 2 yxf I + α2 2 yxf 2 xyf 2 2 xyf 2 yxf xf

= αβ2 yf I + α2 2 yxf 2 xyf 2 2 yxf 2 xyf 2 yxf xf

for the term (iii), we have,

α2β2 xf 2 xyf I + α2 2 xyf 2 yxf 2 2 yxf 2 xyf 2 yxf xf

= α2β2 xf I + α2 2 xyf 2 yxf 2 2 xyf 2 yxf 2 xyf 2 yxf xf

correspondingly, for the term (iv), we have,

α2β2 yf 2 yxf I + α2 2 yxf 2 xyf 2 2 xyf 2 yxf 2 xyf yf

= α2β2 yf I + α2 2 yxf 2 xyf 2 2 yxf 2 xyf 2 yxf 2 xyf yf

Putting these equalities together in Eq. (43), we have,

2 yxf x 2 + 2 xyf y 2

= β2 xf I + α2 2 xyf 2 yxf 2 2 xyf 2 yxf + α2 2 xyf 2 yxf 2 xyf 2 yxf xf

+ β2 yf I + α2 2 yxf 2 xyf 2 2 yxf 2 xyf + α2 2 yxf 2 xyf 2 yxf 2 xyf yf

= β2 xf I + α2 2 xyf 2 yxf 1 2 xyf 2 yxf xf

+ β2 yf I + α2 2 yxf 2 xyf 1 2 yxf 2 xyf yf (44)

Competitive Gradient Optimization

Plugging this into Eq. (42) we obtain,

g0 2 2β xf 2 xxf xf + β(2α2λ1 α) xf I + α2 2 xyf 2 yxf 1 2 xyf 2 xyf xf

β x 2 xxf 2 xxf x + α

β y 2 yyf 2 yyf y

+ 2β yf 2 yyf yf

+ β(2α2λ1 α) yf I + α2 2 yxf 2 xyf 1 2 yxf 2 xyf yf (45)

We now do the following set of computations,

x 2 xxf 2 xxf x + y 2 yyf 2 yyf y (a) = ( α 2 xyf y β xf) 2 xxf 2 xxf( α 2 xyf y β xf)

+ (α 2 yxf x + β yf) 2 yyf 2 yyf(α 2 yxf x + β yf)

(b) = (α 2 xyf y + β xf) 2 xxf 2

+ (α 2 yxf x + β yf) 2 yyf 2

(c) 2α2 2 xyf y 2 2 xxf 2 + 2α2 2 yxf x 2 2 yyf 2

+ 2β2 xf 2 xxf 2 + 2β2 yf 2 yyf 2

(d) 2β2α2λxx 2 2 xyf y 2 + 2β2α2λyy 2 2 yxf x 2

+ 2β2λxx 2 xf xf + 2β2λyy 2 yf yf

(e) 2β2α2λ2 2( 2 xyf y 2 + 2 yxf x 2)

+ 2β2λxx 2 xf xf + 2β2λyy 2 yf yf

(f) 2β2α2λ2 2 xf I + α2 2 xyf 2 yxf 1 2 xyf 2 yxf xf

+ 2β2α2λ2 2 yf I + α2 2 yxf 2 xyf 1 2 yxf 2 xyf yf

+ 2β2λxx 2 xf xf + 2β2λyy 2 yf yf

Where for (a) we use Eq. (32) to substitute x and y, in (b) we re-write the terms as norms, in (c) we use the inequality a + b 2 2 a 2 + 2 b 2, in (d) we bound the terms using the maximum eigenvalues, in (e) we set λ2 = max(λxx, λyy) and finally for (f) we use Eq. (44).

Using the above inequality in Eq. (45), we have,

g0 2 xf (2β 2 xxf 2βαλxx 2I) xf + yf (2β 2 yyf + 2βαλyy 2I) yf

β(α 2α2λ1 2α3λ2 2) yf I + α2 2 yxf 2 xyf 1 2 yxf 2 xyf yf

β(α 2α2λ1 2α3λ2 2) xf I + α2 2 xyf 2 yxf 1 2 xyf 2 xyf xf

By rearranging the above inequality, we get,

g0 2 xf (2β 2 xxf 2βαλxx 2I) + β(α 2α2λ1 2α3λ2 2) I + α2 2 xyf 2 yxf 1 2 xyf 2 yxf xf

yf (2β 2 yyf + 2βαλyy 2I) + β(α 2α2λ1 2α3λ2 2) I + α2 2 yxf 2 xyf 1 2 yxf 2 xyf yf

g0 2 min{λmin((2β 2 xxf 2βαλxx 2I) + β(α 2α2λ1 2α3λ2 2) I + α2 2 xyf 2 yxf 1 2 xyf 2 yxf),

λmin( (2β 2 yyf + 2βαλyy 2I) + β(α 2α2λ1 2α3λ2 2) I + α2 2 yxf 2 xyf 1 2 yxf 2 xyf)}

Competitive Gradient Optimization

which is the key Lyapunov inequality. Thus, under the conditions expressed in the statement of the main Theorem, i.e., λ, as defined in the following is positive,

λ := min{λmin((2β 2 xxf 2βαλxx 2I) + β(α 2α2λ1 2α3λ2 2) I + α2 2 xyf 2 yxf 1 2 xyf 2 yxf), (46)

λmin( (2β 2 yyf + 2βαλyy 2I) + β(α 2α2λ1 2α3λ2 2) I + α2 2 yxf 2 xyf 1 2 yxf 2 xyf)} (47)

the quantity g0 2 converges to zero exponentially fast with the rate at least λ.

Now, we simplify the above expression of the rate using Lemmas (E.2) and (E.3) to address the 1st and 2nd terms respectively in lines (46) and (47),

λmin β min{2λxx 2αλxx 2 + β(α 2α2λ1 2α3λ2 2) λxy 1 + α2λxy

2λyy 2αλyy 2 + β(α 2α2λ1 2α3λ2 2) λyx 1 + α2λyx

To better understand the above results, we set some relations between the quantities in the above expression. If we set α such that λxx 1 5α; λxx 1

5α; λyx, λxy K

5α; λyy 1 5α; K 1. We have λ1, λ1 1 5α and we obtain λmin 1 50α. This shows that as long as the interaction terms λyx, λxy are of the order of the square of the deviation of the pure terms λxx, λyy (from the convex-concave condition i.e. λxx 0, λyy 0), we can guarantee convergence for CGO

Statements and proofs of the Lemmas used in the above derivation are provided below,

Lemma E.1. The following equality holds,

2 yxf(I + α2 2 xyf 2 yxf) 1 = (I + α2 2 yxf 2 xyf) 1 2 yxf.

Proof. To prove this equality statement, we write,

2 yxf + α2 2 yxf 2 xyf 2 yxf = (I + α2 2 yxf 2 xyf) 2 yxf

and at the same time,

2 yxf + α2 2 yxf 2 xyf 2 yxf = 2 yxf(I + α2 2 xyf 2 yxf)

therefore, we have,

(I + α2 2 yxf 2 xyf) 2 yxf = 2 yxf(I + α2 2 xyf 2 yxf)

Multiplying both sides with the inverse of (I + α2 2 yxf 2 xyf) from the left, and the inverse of (I + α2 2 xyf 2 yxf) from the right results in,

2 yxf(I + α2 2 xyf 2 yxf) 1 = (I + α2 2 yxf 2 xyf) 1 2 yxf

which is the statement of the Lemma.

Lemma E.2. The following inequality holds, λmin(A + B) λmin(A) + λmin(B), A, B Sn +.

Proof. We know,

x 2λmin(A + B) x (A + B)x = x Ax + x Bx, x

the following also holds,

x Ax + x Bx x 2λmin(A) + x 2λmin(B), x

Choosing x not equal to zero we complete the proof.

Competitive Gradient Optimization

Lemma E.3. Let B Sn +, if (I+B) is invertible, λmin((I + B) 1B) λb 1+λb , where λb = λmin(B)

Proof. We can write the following,

(I + B) 1B = (I + B) 1B = I (I + B) 1

From the statement of Lemma (E.2) we can write,

λmin((I + B) 1B) λmin(I) + λmin( (I + B) 1)

Hence we have,

λmin((I + B) 1B) 1 + ( 1 1 + λb

) = λb 1 + λb

which is the statement of the Lemma.

E.1. Sample continuous time analysis For the function f(x, y) = xy. The continuous time equations for GDA/CGD are,

x = βy (48)

y = βx (49)

The solution to the above ODE is,

x = c1cos(βt) c2sin(βt) (50)

y = c1sin(βt) + c2cos(βt) (51)

For the aforementioned x and y we have,

x2 + y2 = c2 1 + c2 2 which is the equation of a circle indicating that the iterates circle around the nash equillibrium.

for CGO we have,

x = β 1 + α2 (y + αx) (52)

y = β 1 + α2 ( x + αy) (53)

The solution is,

x(t) = c1 e αβt

α2 + 1cos( βt α2 + 1) c2 e αβt

α2 + 1sin( βt α2 + 1) (54)

y(t) = c1 e αβt

α2 + 1sin( βt α2 + 1) + c2 e αβt

α2 + 1cos( βt α2 + 1) (55)

Thus x and y satisfy,

Competitive Gradient Optimization

x2 + y2 = (c2 1 + c2 2) e 2αβt

Indicating that the distance of the iterates from center falls exponentially.

The figure below illustrates the trajectories of the 2 algorithms.

2 1 0 1 2 x

α=0 (GDA/CGD) α=1 (CGO) α=2 (CGO) α=3 (CGO) start

(a) Exact Trajectories

Figure 7. The exact trajectories of GDA and CGO in continuous time with time scale β = 1, t [0, 2π] and starting point x0, y0 = 1, 1

F. Discrete time GDA In this section, we present the analysis of the discrete time GDA algorithm for completeness. We first present the optimization problem and then derive GDA convergence conditions and convergence rate.

To come up with the update rule, we solve the below optimization problem,

min δx Rm δx xf + δy yf + 1

max δy Rn δy yf + δx xf 1

2η δy δy. (56)

Which gives, x y

We now write the Taylor expansion of xf, yf around the (x, y),

xf( x + x, y + y) = xf(x, y) + 2 xxf x + 2 xyf y + Rx( x, y)

yf( x + x, y + y) = yf(x, y) + 2 yyf y + 2 yxf x + Ry( x, y)

where the remainder terms Rx and Ry are defined as,

Rx( x, y) ..=

2 xxf(t x + x, t y + y) 2 xxf x + 2 xyf(t x + x, t y + y) 2 xyf y dt (58)

Ry( x, y) ..=

2 yyf(t x + x, t y + y) 2 yyf y + 2 yxf(t x + x, t y + y) 2 yxf x dt

Competitive Gradient Optimization

Using this equality, we obtain,

xf ( x + x, y + y) 2 + yf ( x + x, y + y) 2 xf(x, y) 2 yf(x, y) 2

= 2 x 2 xxf xf(x, y) + 2 xf(x, y) 2 xyf y + x 2 xxf 2 xxf x

+ 2 y 2 yyf yf(x, y) + 2 yf(x, y) 2 yxf x + y 2 yyf 2 yyf y+

+ y 2 yxf 2 xyf y + x 2 xyf 2 yxf x + 2 x 2 xxf 2 xyf y + 2 y 2 yyf 2 yxf x

+ 2 xf(x, y) Rx( x, y) + 2 x 2 xxf Rx( x, y) + 2 y 2 yxf Rx( x, y) + Rx( x, y) 2

+ 2 yf(x, y) Ry( x, y) + 2 y 2 yyf Ry( x, y) + 2 x 2 xyf Ry( x, y) + Ry( x, y) 2

Substituting x = η xf (x, y) and y = η yf (x, y) we obtain,

xf ( x + x, y + y) 2 + yf ( x + x, y + y) 2 xf(x, y) 2 yf(x, y) 2

= 2η xf(x, y) 2 xxf xf(x, y) + 2η yf(x, y) 2 yyf yf(x, y)

+ 2η2 yf (x, y) 2 yyf 2 yxf xf (x, y) 2η2 xf (x, y) 2 xxf 2 xyf yf (x, y)

+ 2η xf(x, y) 2 xyf yf(x, y) | {z } (i)

2η yf(x, y) 2 yxf xf (x, y) | {z } (ii)

+ η2 yf (x, y) 2 yyf 2 yyf yf (x, y) + η2 xf(x, y) 2 xxf 2 xxf xf(x, y)

+ η2 yf (x, y) 2 yxf 2 xyf yf (x, y) + η2 xf (x, y) 2 xyf 2 yxf xf (x, y)

+ 2 xf(x, y) Rx( x, y) + 2 x 2 xxf Rx( x, y) + 2 y 2 yxf Rx( x, y) + Rx( x, y) 2

+ 2 yf(x, y) Ry( x, y) + 2 y 2 yyf Ry( x, y) + 2 x 2 xyf Ry( x, y) + Ry( x, y) 2

The terms (i) and (ii) in the RHS cancel out. Using the Cauchy-Schwarz inequality we obtain,

2 xf(x, y) Rx( x, y) 2 xf(x, y) Rx( x, y)

2 yf(x, y) Ry( x, y) 2 yf(x, y) Ry( x, y) (59)

Using the upper bounds on 2 xf(x, y) Rx( x, y) and 2 yf(x, y) Ry( x, y) derived in Eq. (59) we obtain,

xf ( x + x, y + y) 2 + yf ( x + x, y + y) 2 xf(x, y) 2 yf(x, y) 2

= 2η xf(x, y) 2 xxf xf(x, y) + 2η yf(x, y) 2 yyf yf(x, y)

+ 2η2 yf (x, y) 2 yyf 2 yxf xf (x, y) 2η2 xf (x, y) 2 xxf 2 xyf yf (x, y)

+ 2η2 yf (x, y) 2 yyf 2 yyf yf (x, y) + 2η2 xf(x, y) 2 xxf 2 xxf xf(x, y)

+ η2 yf (x, y) 2 yxf 2 xyf yf (x, y) + η2 xf (x, y) 2 xyf 2 yxf xf (x, y)

+ 2 xf (x, y) Rx( x, y) + 2 yf (x, y) Ry( x, y)

+ 4 Rx( x, y) 2 + 4 Ry( x, y) 2

Rearranging we obtain,

Competitive Gradient Optimization

xf ( x + x, y + y) 2 + yf ( x + x, y + y) 2 xf(x, y) 2 yf(x, y) 2

= xf(x, y) η2 2 xxf 2 2η 2 xxf + η2 2 xyf 2 yxf xf(x, y)

+ yf(x, y) η2 2 yyf 2 + 2η 2 yyf + η2 2 yxf 2 xyf yf(x, y)

+ 2η2 xf (x, y) 2 xyf 2 yyf 2 xxf 2 xyf yf (x, y)

+ 4 Rx( x, y) 2 + 4 Ry( x, y) 2 + 2 xf (x, y) Rx( x, y)

+ 2 yf (x, y) Ry( x, y)

To conclude, we need to bound the R terms. Using the Lipschitz-continuity of the Hessian and Eq. (58), we can bound the remainder terms as,

Rx( x, y) , Ry( x, y) Lxy( x + y )2 (60)

Using Eq. (57), we get,

x 2 + y 2 = η2( xf(x, y) 2 + yf(x, y) 2)

Hence we have,

Lxy( x + y )2 2Lxy( x 2 + y 2) 2η2Lxy( xf(x, y) 2 + yf(x, y) 2)

xf ( x + x, y + y) 2 + yf ( x + x, y + y) 2 xf(x, y) 2 yf(x, y) 2

xf(x, y) η2 2 xxf 2 2η 2 xxf + 2η2 2 xyf 2 yxf

+ 4η2Lxy( xf(x, y) + yf(x, y) ) xf(x, y)

+ yf(x, y) η2 2 yyf 2 + 2η 2 yyf + 2η2 2 yxf 2 xyf

+ 4η2Lxy( xf(x, y) + yf(x, y) ) yf(x, y)

+ 2η2 xf (x, y) 2 xyf 2 yyf 2 xxf 2 xyf yf (x, y) | {z } (i)

+ 8η2Lxy( xf(x, y) 2 + yf(x, y) 2)

We further use the following inequality,

2b A a (a) 1

4(a (AA + I)a + b (A A + I)b)

(where in (a) we use the Peter-Paul inequality on both the terms) to bound the term (i) in the above inequality. We

Competitive Gradient Optimization

xf ( x + x, y + y) 2 + yf ( x + x, y + y) 2 xf(x, y) 2 yf(x, y) 2

= xf(x, y) (η2 2 xxf 2 2η 2 xxf + 2η2 2 xyf 2 yxf)

+ I(8η2Lxy + 4η2Lxy( xf(x, y) + yf(x, y) )) xf(x, y)

+ yf(x, y) (η2 2 yyf 2 + 2η 2 yyf + 2η2 2 yxf 2 xyf)

+ I(8η2Lxy + 4η2Lxy( xf(x, y) + yf(x, y) )) yf(x, y)

+ η2 xf(x, y) ( 2 xyf 2 yyf 2 xxf 2 xyf)( 2 xyf 2 yyf 2 xxf 2 xyf) xf(x, y)

+ η2 yf(x, y) ( 2 xyf 2 yyf 2 xxf 2 xyf) ( 2 xyf 2 yyf 2 xxf 2 xyf) yf(x, y)

+ η2/2( xf(x, y) 2 + yf(x, y) 2)

This gives,

xf ( x + x, y + y) 2 + yf ( x + x, y + y) 2 (1 λmin)( xf(x, y) 2 + yf(x, y) 2)

λmin = η min n λmin(2 2 xxf η( 2 xxf 2 2 2 xyf 2 yxf I(8Lxy + 4Lxy( xf + yf ) + 1

( 2 xyf 2 yyf 2 xxf 2 xyf)( 2 xyf 2 yyf 2 xxf 2 xyf) )),

λmin( 2 2 yyf η( 2 yyf 2 2 2 yxf 2 xyf I(8Lxy + 4Lxy( xf + yf ) + 1

( 2 xyf 2 yyf 2 xxf 2 xyf) ( 2 xyf 2 yyf 2 xxf 2 xyf))) o

Hence for 1 λmin > 0 we have exponentially fast convergence. For sufficiently small η, we have convergence for all strongly convex-concave functions with rate 1 λmin where,

λmin = η(min{λmin(2 2 xxf), λmin( 2 2 yyf)})

G. Discrete time CGO In this section, we restate the update rule for the CGO algorithm and then derive its convergence rate and a condition for convergence. Recall the update rule for CGO,

= η I α 2 xyf α 2 yxf I

The following form of the above equation will be useful in the proof,

x = η xf α 2 xyf y

y = η yf + α 2 yxf x (61)

Finally, writing the updates explicitly,

x = η I + α2 2 xyf 2 yxf 1 xf + α 2 xyf yf

y = η I + α2 2 yxf 2 xyf 1 yf α 2 yxf xf , (62)

Competitive Gradient Optimization

Proof of Theorem (5.2). Using the Taylor expansion of ( xf, yf), around the point (x, y) we obtain,

xf( x + x, y + y) = xf(x, y) + 2 xxf x + 2 xyf y + Rx( x, y)

yf( x + x, y + y) = yf(x, y) + 2 yyf y + 2 yxf x + Ry( x, y)

where the remainder terms Rx and Ry are defined as,

Rx( x, y) ..=

2 xxf(t x + x, t y + y) 2 xxf x + 2 xyf(t x + x, t y + y) 2 xyf y dt (63)

Ry( x, y) ..=

2 yyf(t x + x, t y + y) 2 yyf y + 2 yxf(t x + x, t y + y) 2 yxf x dt (64)

Using these equalities, we obtain the value of the difference between norm of the vector ( xf, yf) at points (x, y) and updated ones, ( x + xk, y + yk).

xf ( x + x, y + y) 2 + yf ( x + x, y + y) 2 xf(x, y) 2 yf(x, y) 2

= 2 x 2 xxf xf(x, y) + 2 xf(x, y) 2 xyf y + x 2 xxf 2 xxf x + 2 x 2 xxf 2 xyf y

+ 2 y 2 yyf yf(x, y) + 2 yf(x, y) 2 yxf x + y 2 yyf 2 yyf y + 2 y 2 yyf 2 yxf x

+ y 2 yxf 2 xyf y + x 2 xyf 2 yxf x

+ 2 xf(x, y) Rx( x, y) + 2 x 2 xxf Rx( x, y) + 2 y 2 yxf Rx( x, y) + Rx( x, y) 2

+ 2 yf(x, y) Ry( x, y) + 2 y 2 yyf Ry( x, y) + 2 x 2 xyf Ry( x, y) + Ry( x, y) 2

We now observe using Eq. (61) that,

y 2 yxf 2 xyf y = y 2 yxf x + η xf(x, y)

x 2 xyf 2 yxf x = x 2 xyf y η yf(x, y)

Adding up Eq. (66) and Eq. (67) we obtain,

x 2 xyf 2 yxf x + y 2 yxf 2 xyf y = η

α( y 2 yxf xf(x, y) + x 2 xyf yf(x, y))

Substituting this into Eq. (65) yields,

xf ( x + x, y + y) 2 + yf ( x + x, y + y) 2 xf(x, y) 2 yf(x, y) 2

= 2 x 2 xxf xf(x, y) + (2 η

α) xf(x, y) 2 xyf y + x 2 xxf 2 xxf x + 2 x 2 xxf 2 xyf y

+ 2 y 2 yyf yf(x, y) + (2 η

α) yf(x, y) 2 yxf x + y 2 yyf 2 yyf y + 2 y 2 yyf 2 yxf x

+ 2 xf(x, y) Rx( x, y) + 2 x 2 xxf Rx( x, y) + 2 y 2 yxf Rx( x, y) + Rx( x, y) 2

+ 2 yf(x, y) Ry( x, y) + 2 y 2 yyf Ry( x, y) + 2 x 2 xyf Ry( x, y) + Ry( x, y) 2

Competitive Gradient Optimization

We use the update rule of CGO, Eq. (62) to substitute x and y and observe that 2 yxf(I + 2 xyf 2 yxf) 1 = (I + 2 yxf 2 xyf) 1 2 yxf as stated in Lemma (E.1) to obtain the following equality,

x 2 xyf yf(x, y) + xf(x, y) 2 xyf y

= ηα xf(x, y) I + α2 2 xyf 2 yxf 1 2 xyf 2 yxf xf(x, y)

ηα yf(x, y) I + α2 2 yxf 2 xyf 1 2 yxf 2 xyf yf(x, y).

xf ( x + x, y + y) 2 + yf ( x + x, y + y) 2 xf(x, y) 2 yf(x, y) 2

= 2 x 2 xxf xf(x, y) ηα(2 η

α) xf(x, y) I + α2 2 xyf 2 yxf 1 2 xyf 2 yxf xf(x, y)

+ 2 x 2 xxf 2 xyf y + 2 y 2 yyf 2 yxf x + x 2 xxf 2 xxf x + y 2 yyf 2 yyf y

+ 2 y 2 yyf yf(x, y) ηα(2 η

α) yf(x, y) I + α2 2 yxf 2 xyf 1 2 yxf 2 xyf yf(x, y)

+ 2 xf(x, y) Rx( x, y) + 2 x 2 xxf Rx( x, y) + 2 y 2 yxf Rx( x, y) + Rx( x, y) 2

+ 2 yf(x, y) Ry( x, y) + 2 y 2 yyf Ry( x, y) + 2 x 2 xyf Ry( x, y) + Ry( x, y) 2.

We now substitute x and y using Eq. (61) yielding,

xf ( x + x, y + y) 2 + yf ( x + x, y + y) 2 xf(x, y) 2 yf(x, y) 2

= 2η xf(x, y) 2 xxf xf(x, y) + 2η yf(x, y) 2 yyf yf(x, y)

α) xf(x, y) I + α2 2 xyf 2 yxf 1 2 xyf 2 yxf xf(x, y)

+ x 2 xxf 2 xxf x + 2

(α + η) x | {z } (i)

+α2 2 xyf y

2 xxf 2 xyf y

+ y 2 yyf 2 yyf y + 2

(α + η) y | {z } (ii)

2 yyf 2 yxf x

α) yf(x, y) I + α2 2 yxf 2 xyf 1 2 yxf 2 xyf yf(x, y)

+ 2 xf(x, y) Rx( x, y) + 2 x 2 xxf Rx( x, y) | {z } (iii)

+2 y 2 yxf Rx( x, y) + Rx( x, y) 2

+ 2 yf(x, y) Ry( x, y) + 2 y 2 yyf Ry( x, y) | {z } (iv)

+2 x 2 xyf Ry( x, y) + Ry( x, y) 2 (68)

Now we use Peter-Paul inequality, and bound the terms (i) and (ii) respectively as follows,

η x 2 xxf 2 xyf y 8α + η

η x 2 xxf 2 + α + η

8η 2 xyf y 2

η y 2 yyf 2 yxf x 8α + η

η y 2 yyf 2 + α + η

8η 2 yxf x 2 (69)

and terms (iii) and (iv) as,

2 x 2 xyf Ry( x, y) Ry( x, y) 2 + 2 xyf y 2

2 y 2 yxf Rx( x, y) Rx( x, y) 2 + 2 yxf x 2 (70)

Competitive Gradient Optimization

Using the bounds obtained in Eqs. (69) and (70) we get,

xf ( x + x, y + y) 2 + yf ( x + x, y + y) 2 xf(x, y) 2 yf(x, y) 2

2η xf(x, y) 2 xxf xf(x, y)

α) xf(x, y) I + α2 2 xyf 2 yxf 1 2 xyf 2 yxf xf(x, y)

η x 2 xxf 2 xxf x + y 2 yxf α + η

8η I + 2α2 2 xxf η

+ 2η yf(x, y) 2 yyf yf(x, y)

α) yf(x, y) I + α2 2 yxf 2 xyf 1 2 yxf 2 xyf yf(x, y)

η y 2 yyf 2 yyf y + x 2 xyf

8η I 2α2 2 yyf η

+ 2 xf(x, y) Rx( x, y) + 2 y 2 yxf Rx( x, y) | {z } (i)

+2 Rx( x, y) 2

+ 2 yf(x, y) Ry( x, y) + 2 x 2 xyf Ry( x, y) | {z } (ii)

+2 Ry( x, y) 2.

We use the Peter-Paul inequality to bound the term (i) as,

2 y 2 yxf Rx( x, y) 4 Rx( x, y) 2 + 1

4 2 yxf x 2

and the term (ii) as,

2 x 2 xyf Ry( x, y) 4 Ry( x, y) 2 + 1

4 2 xyf y 2

Substituting the above obtained bounds and noting that xxf and yyf are symmetric matrices we obtain,

xf ( x + x, y + y) 2 + yf ( x + x, y + y) 2 xf(x, y) 2 yf(x, y) 2

2η xf(x, y) 2 xxf xf(x, y)

α) xf(x, y) I + α2 2 xyf 2 yxf 1 2 xyf 2 yxf xf(x, y)

η x 2 xxf 2 xxf x + α + 3η

8η + 2α2λxx

+ 2η yf(x, y) 2 yyf yf(x, y)

α) yf(x, y) I + α2 2 yxf 2 xyf 1 2 yxf 2 xyf yf(x, y)

η y 2 yyf 2 yyf y +

+ 2 xf(x, y) Rx( x, y) + 6 Rx( x, y) 2

+ 2 yf(x, y) Ry( x, y) + 6 Ry( x, y) 2

Using Eq. (61) to substitute x and y we compute, 2 yxf x 2 = η2 xf(x, y) + 2 xyf yf(x, y)

I + 2 xyf 2 yxf 2 2 xyf 2 yxf xf(x, y) + 2 xyf yf(x, y)

Competitive Gradient Optimization

2 xyf y 2 = η2 yf(x, y) + 2 yxf xf(x, y)

I + 2 yxf 2 xyf 2 2 yxf 2 xyf yf(x, y) + 2 yxf xf(x, y)

By adding up the two, we obtain,

2 yxf x 2 + 2 xyf y 2 = η2 xf(x, y) I + 2 xyf 2 yxf 2 2 xyf 2 yxf + 2 xyf 2 yxf xf(x, y)

+ η2 yf(x, y) I + 2 yxf 2 xyf 2 2 yxf 2 xyf + 2 yxf 2 xyf yf(x, y)

= η2 xf(x, y) I + α2 2 xyf 2 yxf 1 2 xyf 2 yxf xf(x, y)

+ η2 yf(x, y) I + α2 2 yxf 2 xyf 1 2 yxf 2 xyf yf(x, y) (71)

Setting λ1 = max(λxx, λyy) and using Eq. (71) to substitute 2 xyf y 2 + 2 yxf x 2, we have,

xf ( x + x, y + y) 2 + yf ( x + x, y + y) 2 xf(x, y) 2 yf(x, y) 2

2η xf(x, y) 2 xxf xf(x, y) + 10η + 8α

η x 2 xxf 2 xxf x | {z } (i)

+ η 11η + 16α2λ1

8 α xf(x, y) I + α2 2 xyf 2 yxf 1 2 xyf 2 yxf xf(x, y)

+ 2η yf(x, y) 2 yyf yf(x, y) + 10η + 8α

η y 2 yyf 2 yyf y | {z } (ii)

+ η 11η + 16α2λ1

8 α yf(x, y) I + α2 2 yxf 2 xyf 1 2 yxf 2 xyf yf(x, y)

+ 2 xf(x, y) Rx( x, y) + 6 Rx( x, y) 2

+ 2 yf(x, y) Ry( x, y) + 6 Ry( x, y) 2 (72)

Substituting x and y from Eq. (61) we bound the sum of terms (i) and (ii) as follows,

x 2 xxf 2 xxf x + y 2 yyf 2 yyf y

= ( α 2 xyf y η xf(x, y)) 2 xxf 2 xxf( α 2 xyf y η xf(x, y))

+ (α 2 yxf x + η yf(x, y)) 2 yyf 2 yyf(α 2 yxf x + η yf(x, y))

= 2 xxf( 2 xyf y + η xf(x, y)) 2

+ 2 yyf(α 2 yxf x + η yf(x, y)) 2

2α2 2 xyf y 2 2 xxf 2 + 2α2 2 yxf x 2 2 yyf 2

+ 2η2 2 xxf xf(x, y) 2 + 2η2 2 yyf yf(x, y) 2

= 2α2 2 xyf y 2λxx 2 + 2α2 2 yxf x 2λyy 2

+ 2η2 2 xxf xf(x, y) 2 + 2η2 2 yyf yf(x, y) 2

Competitive Gradient Optimization

Setting λ2 = max(λxx, λyy) and using Eq. (71) to substitute 2 xyf y 2 + 2 yxf x 2,

x 2 xxf 2 xxf x + y 2 yyf 2 yyf y

2α2λ2 2( 2 xyf y 2 + 2 yxf x 2)

+ 2η2 2 xxf xf(x, y) 2 + 2η2 2 yyf yf(x, y) 2

2α2η2λ2 2 xf(x, y) I + α2 2 xyf 2 yxf 1 2 xyf 2 yxf xf(x, y)

+ 2α2η2λ2 2 yf(x, y) I + α2 2 yxf 2 xyf 1 2 yxf 2 xyf yf(x, y)

+ 2λxx 2 xf(x, y) xf(x, y)

+ 2λyy 2 yf(x, y) yf(x, y)

Substituting the above bound in Eq. (72) we obtain,

xf ( x + x, y + y) 2 + yf ( x + x, y + y) 2 xf(x, y) 2 yf(x, y) 2

η xf(x, y) 2 2 xxf 210η + 8α

η λxx 2 xf(x, y)

+ η yf(x, y) 2 2 yyf + 210η + 8α

η λyy 2 yf(x, y)

+ η(11η + 16α2λ1

8 α) + 2(10η + 8α)α2ηλ2 2

( yf(x, y) I + α2 2 yxf 2 xyf 1 2 yxf 2 xyf yf(x, y)

+ xf(x, y) I + α2 2 xyf 2 yxf 1 2 xyf 2 yxf xf(x, y))

+ 2 xf(x, y) Rx( x, y) + 6 Rx( x, y) 2

+ 2 yf(x, y) Ry( x, y) + 6 Ry( x, y) 2. (73)

To conclude, we need to bound the R-terms. Using the Lipschitz-continuity of the Hessian, and equations Eq. (63) and Eq. (64) we can bound,

Rx( x, y) , Ry( x, y) Lxy( x + y )2 (74)

Using Eq. (61) we get,

( x 2 + y 2) =η2( xf(x, y) 2 + yf(x, y) 2)

+ 2ηα( xf(x, y) 2 yxf x + yf(x, y) 2 xyf y)

+ α2( 2 yxf x 2 + 2 xyf y 2)

From Eq. (71) we have,

α2( 2 yxf x 2 + 2 xyf y 2) =η2 xf(x, y) I + α2 2 xyf 2 yxf 1 α2 2 xyf 2 yxf xf(x, y)

+ η2 yf(x, y) I + α2 2 yxf 2 xyf 1 α2 2 yxf 2 xyf yf(x, y)

η2( xf(x, y) 2 + yf(x, y) 2) (75)

And observe,

Competitive Gradient Optimization

xf(x, y) 2 yxf x + yf(x, y) 2 xyf y = ( xf(x, y), yf(x, y)) ( 2 yxf x, 2 xyf y)

(c) ( xf(x, y), yf(x, y) ( 2 yxf x, 2 xyf y)

α( xf(x, y) 2 + yf(x, y) 2) (76)

Where in (c) we use the Cauchy-Schwarz inequality and in (d) we use the bound derived in Eq. (75). We then substitute x and y using Eq. (62) to obtain,

Lxy( x + y )2 2Lxy( x 2 + y 2)

2Lxy(η2( xf(x, y) 2 + yf(x, y) 2)

+ 2ηα ( xf(x, y) 2 yxf x + yf(x, y) 2 xyf y) | {z } (i)

+ α2 ( 2 yx x 2 + 2 xyf y 2)) | {z } (ii)

(e) 8η2Lxy( xf(x, y) 2 + yf(x, y) 2) (77)

Where in (e) we have used Eq. (76) to bound term (i) and Eq. (75) to bound (ii). Combining Eq. (74) and Eq. (77) we obtain,

Rx( x, y) , Ry( x, y) 8η2Lxy( xf(x, y) 2 + yf(x, y) 2) (78)

Also we have,

2 xf(x, y) Rx( x, y) + 2 yf(x, y) Ry( x, y)

(a) 2( xf(x, y) Rx( x, y) + yf(x, y) Ry( x, y) )

(b) 16Lxyη2( xf(x, y) + yf(x, y) )( xf(x, y) 2 + yf(x, y) 2) (79)

Where we use Cauchy-Schwarz inequality in (a) and Eq. (74) in (b). Finally we use the bounds in Eq. (78) and Eq. (79) to bound the terms containing Rx( x, y) and Ry( x, y) in Eq. (73) and further set k = η( 11η+16α2λ1

8 α)+2(10η +

8α)α2ηλ2 2 to obtain,

xf ( x + x, y + y) 2 + yf ( x + x, y + y) 2 xf(x, y) 2 yf(x, y) 2

xf(x, y) (η 2 2 xxf 210η + 8α

η λxx 2 + k I + α2 2 xyf 2 yxf 1 2 xyf 2 yxf

16η2Lxy ( xf(x, y) + yf(x, y) )

384η4L2 xy( xf(x, y) 2 + xf(x, y) 2)) xf(x, y)

yf(x, y) ( η 2 2 yyf + 210η + 8α

η λyy 2 + k I + α2 2 yxf 2 xyf 1 2 yxf 2 xyf

16η2Lxy ( xf(x, y) + yf(x, y) )

384η4L2 xy( xf(x, y) 2 + yf(x, y) 2)) yf(x, y)

Competitive Gradient Optimization

Rearranging we obtain,

( xf(x + x, y + y), yf(x + x, y + y) (1 λmin) ( xf(x, y), yf(x, y)

Thus for 1 λmin > 0 where,

λmin = min n λmin(η 2 2 xxf 210η + 8α

η λxx 2 + k I + α2 2 xyf 2 yxf 1 2 xyf 2 yxf

16η2L ( xf(x, y) + yf(x, y) )

384η4L2( xf(x, y) 2 + xf(x, y) 2)),

λmin( η 2 2 yyf + 210η + 8α

η λyy 2 + k I + α2 2 yxf 2 xyf 1 2 yxf 2 xyf

16η2L ( xf(x, y) + yf(x, y) )

384η4L2( xf(x, y) 2 + yf(x, y) 2) o

we have exponential convergence with rate (1 λmin).

Now, we simplify the above expression using Lemmas (E.2) and (E.3) to obtain,

λmin min n η(2λxx 210η + 8α

η λxx 2) + k λxy 1 + α2λxy

16η2L ( xf(x, y) + yf(x, y) )

384η4L2( xf(x, y) 2 + yf(x, y) 2),

η(2λyy + 210η + 8α

η λyy 2) + k λyx 1 + α2λyx

16η2L ( xf(x, y) + yf(x, y) )

384η4L2( xf(x, y) 2 + yf(x, y) 2) o

When initializing close to the stationary point, the Lipschitz-continuity of the gradient guarantees that the terms ( xf(x, y) + yf(x, y) ) and xf(x, y) 2 + yf(x, y) 2 are small and we have,

λmin min n η(2λxx 210η + 8α

η λxx 2) + k λxy 1 + α2λxy

η(2λyy + 210η + 8α

η λyy 2) + k λyx 1 + α2λyx

which is the statement of our Theorem.

H. Convergence for α-coherent functions H.1. CGO converges to a saddle point under strictly α-coherent functions

Proof of Theorem (5.3). We prove the convergence through contradiction. Let us assume that the algorithm does not converge to a saddle point. Let zn := (xn, yn) denote the parameters at the n th iterate of the algorithm. gα,n := gα(zn) denote the vector gα evaluated at zn. Let the set of saddle points be Z , and let all the iterates of the algorithm lie in a compact set C. Then from the assumption we have Z C = ϕ. Now from the definition of strict coherence we have gα,n, z z a for some a > 0 and z Z and , z C. Such a z is guaranteed by definition (3.4)(2nd point).

Recall the proximal map defined in Eq. (12),

z+ = Pz(y) = arg min z Z { y, z z + D(z , z)} = arg max z Z { y + h(z), z h(z )}. (80)

Competitive Gradient Optimization

Then for Bregmann Divergence Dh(x, y) with K-strongly convex potential function h and 2-norm . we have, (Mertikopoulos et al., 2019)(Proposition B.3),

D(p, z+) D(p, z) + y, z p + K

2 y 2. (81)

To obtain the CGO update we substitute y = ηngα,n, z = zn, z+ = zn+1, p = z , h = . 2 2 2 in Eq. (81) we get,

D(z , zn+1) = D(z , Pzn( ηngα,n)) D(z , zn) ηn gα,n, zn z + η2 n gα,n 2

Note that the above substitution in Eq.(81) is equivalent to CGO only in the interior of the domain. At the boundary we need an additional projection step since in Eq. (81) we only look within the domain for the minimum.

Since the saddle point is α-coherent we have gα,n, z z a for some a > 0.

D(z , zn+1) D(z , zn) ηna + η2 n gα,n 2

2 D(z , z0) (a Pn k=1 ηk 2

2 Pn k=1 ηk )

Since we have Pn k=1 ηk = and Pn k=1 ηk 2 < , we obtain limn Dn = , which is a contradiction since the divergence is positive. Hence CGO converges to a saddle point.

H.2. o CGO converges to a saddle point under α-coherent functions

Proof of Theorem (5.4). Let Pz(y) be as in Eq. (12) and z+ 1 = Pz(y1), z+ 2 = Pz(y2). We then have for Bregmann Divergence Dh(x, y) with K-strongly convex potential function h, 2-norm . and a fixed point p (Mertikopoulos et al., 2019)(Proposition B.4),

D(p, x+ 2 ) D(p, x) + y2, x+ 1 p + 1 2K y2 y1 2 K

2 x+ 1 x 2. (82)

Let p be a solution of the SP problem such that α-MVI holds z X Y, the existence of such a p is guaranteed via the definition of α-coherence Def. (3.4)(2nd point).

In order to obtain the o CGO update we substitute y1 = ηngα,n, y2 = ηngα,n+ 1

2 , x = zn, x+ 1 = zn+ 1

2 , x+ 2 = zn+1, p =

p and set h = . 2

2 (for this h we have K = 1),

D(x , zn+1) D(x , zn) ηn gα,n+ 1

2 x + η2 n 2 gα,n+ 1

From coherence condition we have,

D(p , zn+1) D(p , zn) + η2 n 2 gα,n+ 1

2 zn 2 (83)

Using Eq. (61) we get,

2 gα,n 2 g0,n+ 1

2 g0,n 2 + α2

η2n ( xy,n+ 1

2 xy,nf yn 2

2 xy,nf xn 2)

Where ( xn, yn) = ηngα,n, ( xn+ 1

2 ) = ηngα,n+ 1

2 and xy,nf, xy,n+ 1

2 f are the second order cross

Competitive Gradient Optimization

terms evaluated at zn, zn+ 1

2 . We can re-write the above as,

1 η2n ( xn+ 1

2 xn, yn+ 1

η2n xy,n+ 1

2 f yn + xy,n+ 1

2 f yn xy,nf yn 2

η2n xy,n+ 1

2 f xn + xy,n+ 1

2 f xn xy,nf xn 2

η2n ( xy,n+ 1

2 f 2 yn+ 1

2 yn 2 + yn 2 xy,n+ 1

2 f xy,nf 2)

η2n ( xy,n+ 1

2 f 2 xn+ 1

2 xn 2 + xn 2 xy,n+ 1

2 f xy,nf 2)

Using the Lipschitz continuity of the Hessian terms, setting α2 xy,n+ 1

2 f 2 = α2 xy,n+ 1

2 f 2 = α2L2 xy 1, and rearranging we get,

1 η2n xn+ 1

2 xn, yn+ 1

2 yn 2 1 1 xy,n+ 1

2 f 2α2 g0,n+ 1

η2n(1 xy,n+ 1

2 f 2α2)( yn 2 xy,n+ 1

2 f xy,nf 2)

η2n(1 xy,n+ 1

2 f 2α2)( xn 2 xy,n+ 1

2 f xy,nf 2)

η2n(1 xy,n+ 1

2 f 2α2) zn+ 1

η2n(1 xy,n+ 1

2 f 2α2)( xn 2 + yn 2) zn+ 1

Finally we have,

2 gα,n 2 L2 + L2 xyα2( xn 2 + yn 2) η2n(1 xy,n+ 1

2 f 2α2) zn+ 1

L2 + L2 xyα2(α + η)2 g0,n 2

η2n(1 xy,n+ 1

2 f 2α2) zn+ 1

2 zn 2 (84)

Substituting in Eq. (83) we get,

D(p , zn+1) D(p , zn) + zn+ 1

2 zn 2(η2 n L 2 + L2 xyα2(α + ηn)2 g0,n 2

2(1 xy,n+ 1

D(p , zn) + zn+ 1

2 zn 2(η2 n L 2 + L2 xyα2(α + ηn)2L2

2(1 xy,n+ 1

Hence if α satisfies the following,

Competitive Gradient Optimization

α4L2 xy L2 + α2L 2 1 < 0

or equivalently we have,

L 4 + 4L2xy L2 L 2

2L2xy L2 < α <

L 4 + 4L2xy L2 L 2

2L2xy L2 (86)

and also ηn satisfying the following,

α2L2L2xy + L 2 2α4L2L 2L2xy α2L 4 α3L2 0L2 xy

α2L2L2xy + L 2 (87)

η2 n L 2 + L2 xyα2(α + ηn)2L2

2(1 xy,n+ 1

and the divergence decreases at each step. By telescoping Eq. (85) we obtain,

2 zk 2(1 η2 k L 2 + L2 xyα2(α + ηk)2L2

2 2α2) ) 2D(x , z1). (88)

We also know zk+ 1

2 zk = ηkgα,k, thus for α and ηn satisfying Eq. (86) and Eq. (87), we have,

k=1 η2 k gα,k 2 2

nc D(x , z1). (89)

Where 1 η2 k L 2+L2 xyα2(α+ηk)2L2

2 2α2) > c, ηk. If we assume without loss of generality that ηk converges to η, then we have

from Eq. (89) that the average of gα,n and zn+ 1

2 zn falls with order O( 1

n) where n is the iteration count. Taking limit of zk+ 1

2 we have, z = lim k zk+ 1

2 = Pz ( ηgα(z )),

this implies z satisfies α-SVI and is hence a solution of the SP problem via definition of α-coherence Def. (3.4) (1st point ).

Coherence condition Def. (3.4)(3rd point) implies that α-MVI holds locally around z . Thus for, α and ηn satisfying Eq. (86) and Eq. (87) respectively, and sufficiently large n, we have,

D(z , zn+1) D(z , zn) + z zn 2(η2 n L 2 + L2 xyα2(α + ηn)2L2

2(1 xy,n+ 1

2) (a) D(z , zn)

Where the equality in (a) holds if and only if z = zn. Thus D(z , zn) is non-increasing and zn z which is a saddle point.

Competitive Gradient Optimization

I. Additional examples and simulations We now present some more simulations of CGO and o CGO on the function x Ay with multiple samples of the matrix A = (aij), aij N(0, 1).

0.0 0.5 1.0 1.5 2.0 2.5 3.0 x 2

α, η = 0, 1e 2

α, η = 1, 0.5 α, η = 2, 0.5 α, η = 3, 0.5 start

0.0 0.5 1.0 1.5 2.0 2.5 |x| 2

α, η = 0, 0.1 α, η = 1, 0.5 α, η = 2, 0.5 α, η = 3, 0.5 start

0.0 0.5 1.0 1.5 2.0 2.5 x 2

α, η = 0, 1e 2

α, η = 1, 0.5 α, η = 2, 0.5 α, η = 3, 0.5 start

0.0 0.5 1.0 1.5 2.0 2.5 |x| 2

α, η = 0, 0.1 α, η = 1, 0.5 α, η = 2, 0.5 α, η = 3, 0.5 start

0.0 0.5 1.0 1.5 2.0 2.5 x 2

α, η = 0, 1e 2

α, η = 1, 0.5 α, η = 2, 0.5 α, η = 3, 0.5 start

0.0 0.5 1.0 1.5 2.0 2.5 |x| 2

α, η = 0, 0.1 α, η = 1, 0.5 α, η = 2, 0.5 α, η = 3, 0.5 start

0.0 0.5 1.0 1.5 2.0 2.5 x 2

α, η = 0, 1e 2

α, η = 1, 0.5 α, η = 2, 0.5 α, η = 3, 0.5 start

0.0 0.5 1.0 1.5 2.0 2.5 |x| 2

α, η = 0, 0.1 α, η = 1, 0.5 α, η = 2, 0.5 α, η = 3, 0.5 start

0.0 0.5 1.0 1.5 2.0 2.5 x 2

α, η = 0, 1e 2

α, η = 1, 0.5 α, η = 2, 0.5 α, η = 3, 0.5 start

0.0 0.5 1.0 1.5 2.0 2.5 |x| 2

α, η = 0, 0.1 α, η = 1, 0.5 α, η = 2, 0.5 α, η = 3, 0.5 start

0.0 0.5 1.0 1.5 2.0 2.5 3.0 x 2

α, η = 0, 1e 2

α, η = 1, 0.5 α, η = 2, 0.5 α, η = 3, 0.5 start

0.0 0.5 1.0 1.5 2.0 2.5 |x| 2

α, η = 0, 0.1 α, η = 1, 0.5 α, η = 2, 0.5 α, η = 3, 0.5 start

0.0 0.5 1.0 1.5 2.0 2.5 x 2

α, η = 0, 1e 2

α, η = 1, 0.5 α, η = 2, 0.5 α, η = 3, 0.5 start

0.0 0.5 1.0 1.5 2.0 2.5 |x| 2

α, η = 0, 0.1 α, η = 1, 0.5 α, η = 2, 0.5 α, η = 3, 0.5 start

0.0 0.5 1.0 1.5 2.0 2.5 x 2

α, η = 0, 1e 2

α, η = 1, 0.5 α, η = 2, 0.5 α, η = 3, 0.5 start

0.0 0.5 1.0 1.5 2.0 2.5 |x| 2

α, η = 0, 0.1 α, η = 1, 0.5 α, η = 2, 0.5 α, η = 3, 0.5 start

Figure 8. CGO and o CGO on bilinear function family f(x, y) = x Ay, x R4, y R5 for 100 iterations. In each row, the 1st and 2nd

as well as the 3rd and 4th figures correspond to the same sample of A