# variance_reduction_for_matrix_games__e4cfc7ca.pdf

Variance Reduction for Matrix Games

Yair Carmon, Yujia Jin, Aaron Sidford and Kevin Tian

Stanford University {yairc,yujiajin,sidford,kjtian}@stanford.edu

We present a randomized primal-dual algorithm that solves the problem minx maxy y>Ax to additive error in time nnz(A) +

nnz(A)n/ , for matrix A with larger dimension n and nnz(A) nonzero entries. This improves the best known exact gradient methods by a factor of

nnz(A)/n and is faster than fully stochastic gradient methods in the accurate and/or sparse regime

n/nnz(A). Our results hold for x, y in the simplex (matrix games, linear programming) and for x in an 2 ball and y in the simplex (perceptron / SVM, minimum enclosing ball). Our algorithm combines the Nemirovski s conceptual prox-method and a novel reduced-variance gradient estimator based on sampling from the difference between the current iterate and a reference point.

1 Introduction

Minimax problems or games of the form minx maxy f(x, y) are ubiquitous in economics, statistics, optimization and machine learning. In recent years, minimax formulations for neural network training rose to prominence [15, 23], leading to intense interest in algorithms for solving large scale minimax games [10, 14, 20, 9, 18, 24]. However, the algorithmic toolbox for minimax optimization is not as complete as the one for minimization. Variance reduction, a technique for improving stochastic gradient estimators by introducing control variates, stands as a case in point. A multitude of variance reduction schemes exist for ﬁnite-sum minimization [cf. 19, 34, 1, 4, 12], and their impact on complexity is well-understood [43]. In contrast, only a few works apply variance reduction to ﬁnite-sum minimax problems [3, 39, 5, 26], and the potential gains from variance reduction are not well-understood.

We take a step towards closing this gap by designing variance-reduced minimax game solvers that offer strict runtime improvements over non-stochastic gradient methods, similar to that of optimal variance reduction methods for ﬁnite-sum minimization. To achieve this, we focus on the fundamental class of bilinear minimax games,

min x2X max

y2Y y>Ax, where A 2 Rm n.

In particular, we study the complexity of ﬁnding an -approximate saddle point (Nash equilibrium), namely x, y with

max y02Y (y0)>Ax min

x02X y>Ax0 .

In the setting where X and Y are both probability simplices, the problem corresponds to ﬁnding an approximate (mixed) equilbrium in a matrix game, a central object in game theory and economics. Matrix games are also fundamental to algorithm design due in part to their equivalence to linear programming [8]. Alternatively, when X is an 2 ball and Y is a simplex, solving the corresponding problem ﬁnds a maximum-margin linear classiﬁer (hard-margin SVM), a fundamental task in machine learning and statistics [25]. We refer to the former as an 11 game and the latter as an 21 game; our primary focus is to give improved algorithms for these domains.

33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada.

1.1 Our Approach

Our starting point is Nemirovski s conceptual prox-method [28] for solving minx2X maxy2Y f(x, y), where f : X Y ! R is convex in x and concave in y. The method solves a sequence of subproblems parameterized by > 0, each of the form

ﬁnd x, y s.t. 8x0, y0 hrxf(x, y), x x0i hryf(x, y), y y0i Vx0(x0) + Vy0(y0) (1)

for some (x0, y0) 2 X Y, where Va(b) is a norm-suitable Bregman divergence from a to b: squared Euclidean distance for 2 and KL divergence for 1. Combining each subproblem solution with an extragradient step, the prox-method solves the original problem to accuracy by solving e O( / ) subproblems.1 (Solving (1) with = 0 is equivalent to to solving minx2X maxy2Y f(x, y).)

Our ﬁrst contribution is showing that if a stochastic unbiased gradient estimator g satisﬁes the variance bound

E k g(x, y) rf(x0, y0)k2

L2 kx x0k2 + L2 ky y0k2 (2)

for some L > 0, then O(L2/ 2) regularized stochastic mirror descent steps using g solve (1) in a suitable probabilistic sense. We call unbiased gradient estimators that satisfy (2) centered.

Our second contribution is the construction of centered gradient estimators for 11 and 21 bilinear games, where f(x, y) = y>Ax. Our 1 estimator has the following form. Suppose we wish to estimate gx = A>y (the gradient of f w.r.t. x), and we already have gx

0 = A>y0. Let p 2 m be some distribution over {1, . . . , m}, draw i p and set

where Ai: is the ith column of A>. This form is familiar from variance reduction techniques [19, 44, 1], that typically use a ﬁxed distribution p. In our setting, however, a ﬁxed p will not produce sufﬁciently low variance. Departing from prior variance-reduction work and building on [16, 6], we choose p based on y according to

##[y]i [y0]i

yielding exactly the variance bound we require. We call this technique sampling from the difference.

For our 2 gradient estimator, we sample from the squared difference, drawing X-block coordinate j q, where qj(x) = ([x]j [x0]j)2/kx x0k2

2. To strengthen our results for 21 games, we consider a reﬁned version of the centered criterion (2) which allows regret analysis using local norms [37, 6]. To further facilitate this analysis we follow [6] and introduce gradient clipping. We extend our proofs to show that stochastic regularized mirror descent can solve (1) despite the (distance-bounded) bias caused by gradient clipping.

Our gradient estimators attain the bound (2) with L equal to the Lipschitz constant of rf. Speciﬁcally,

maxij |Aij| in the 11 setup maxi k Ai:k2 in the 21 setup. (3)

1.2 Method complexity compared with prior art

As per the discussion above, to achieve accuracy our algorithm solves e O( / ) subproblems. Each subproblem takes O(nnz(A)) time for computing two exact gradients (one for variance reduction and one for an extragradient step), plus an additional (m + n)L2/ 2 time for the inner mirror descent iterations, with L as in (3). The total runtime is therefore

nnz(A) + (m + n)L2

1 More precisely, the required number of subproblem solutions is at most

, where is a domain size parameter that depends on X, Y, and the Bregman divergence V (see Section 2). In the 1 and 2 settings considered in this paper, we have the bound log(nm) and we use the e O notation to suppress terms logarithmic in n and m. However, in other settings e.g., 11 games [cf. 38, 40] making the parameter scale logarithmically with the problem dimension is far more difﬁcult.

By setting optimally to be max{ , L

(m + n)/nnz(A)}, we obtain the runtime

e O(nnz(A) +

nnz(A) (m + n) L 1). (4)

Comparison with mirror-prox and dual extrapolation. Nemirovski [28] instantiates his conceptual prox-method by solving the relaxed proximal problem (1) with = L in time O(nnz(A)), where L is the Lipschitz constant of rf, as given in (3). The total complexity of the resulting method is therefore

e O(nnz(A) L 1). (5) The closely related dual extrapolation method of Nesterov [31] attains the same rate of convergence. We refer to the running time (5) as linear since it scales linearly with the problem description size nnz(A). Our running time guarantee (4) is never worse than (5) by more than a constant factor, and improves on (5) when nnz(A) = !(n + m), i.e. whenever A is not extremely sparse. In that regime, our method uses L, hence solving a harder version of (1) than possible for mirror-prox.

Comparison with sublinear-time methods Using a randomized algorithm, Grigoriadis and Khachiyan [16] solve 11 bilinear games in time

e O((m + n) L2 2), (6)

and Clarkson et al. [6] extend this result to 21 bilinear games, with the values of L as in (3). Since these runtimes scale with n + m nnz(A), we refer to them as sublinear. Our guarantee improves on the guarantee (6) when (m + n) L2 2 nnz(A), i.e. whenever (6) is not truly sublinear.

Our method carefully balances linear-time extragradient steps with cheap sublinear-time stochastic gradient steps. Consequently, our runtime guarantee (4) inherits strengths from both the linear and sublinear runtimes. First, our runtime scales linearly with L/ rather than quadratically, as does the linear runtime (5). Second, while our runtime is not strictly sublinear, its component proportional to L/ is

nnz(A)(n + m), which is sublinear in nnz(A).

Overall, our method offers the best runtime guarantee in the literature in the regime

nnz(A)(n + m)

where the lower bound on is due to the best known theoretical runtimes of interior point methods:

e O(max{n, m}! log(L/ )) [7] and e O(nnz(A) + min{n, m}2)

min{n, m} log(L/ )) [21], where ! is the (current) matrix multiplication exponent.

In the square dense case (i.e. nnz(A) n2 = m2), we improve on the accelerated runtime (5) by a factor of pn, the same improvement that optimal variance-reduced ﬁnite-sum minimization methods achieve over the fast gradient method [44, 1].

1.3 Related work

Matrix games, the canonical form of discrete zero-sum games, have long been studied in economics [32]. The classical mirror descent (i.e. no-regret) method yields an algorithm with running time

e O(nnz(A)L2 2) [30]. Subsequent work [16, 28, 31, 6] improve this runtime as described above. Our work builds on the extragradient scheme of Nemirovski [28] as well as the gradient estimation and clipping technique of Clarkson et al. [6].

Balamurugan and Bach [3] apply standard variance reduction [19] to bilinear 22 games by sampling elements proportional to squared matrix entries. Using proximal-point acceleration they obtain a runtime of e O(nnz(A)+k Ak F

nnz(A) max{m, n} 1 log 1

), a rate we recover using our algorithm (Appendix E). However, in this setting the mirror-prox method has runtime e O(k Akop nnz(A) 1), which may be better than the result of [3] by a factor of

mn/nnz(A) due to the discrepancy in the norm of A. Naive application of [3] to 1 domains results in even greater potential losses. Shi et al. [39] extend the method of [3] to smooth functions using general Bregman divergences, but their extension is unaccelerated and appears limited to a 2 rate.

Chavdarova et al. [5] propose a variance-reduced extragradient method with applications to generative adversarial training. In contrast to our algorithm, which performs extragadient steps in the outer loop,

the method of [5] performs stochastic extragradient steps in the inner loop, using ﬁnite-sum variance reduction as in [19]. Chavdarova et al. [5] analyze their method in the convex-concave setting, showing improved stability over direct application of the extragradient method to noisy gradients. However, their complexity guarantees are worse than those of linear-time methods. Following up on [5], Mishchenko et al. [26] propose to reduce the variance of the stochastic extragradient method by using the same stochastic sample for both the gradient and extragradient steps. In the Euclidean strongly convex case, they show a convergence guarantee with a relaxed variance assumption, and in the noiseless full-rank bilinear case they recover the guarantees of [27]. In the general convex case, however, they only show an 2 rate of convergence.

1.4 Paper outline and additional contributions

We deﬁne our notation in Section 2. In Section 3.1, we review Nemirovski s conceptual prox-method and introduce the notion of a relaxed proximal oracle; we implement such oracle using variancereduced gradient estimators in Section 3.2. In Section 4, we construct these gradient estimators for the 11 and 21 domain settings, and complete the analyses of the corresponding algorithms; in Appendix E we provide analogous treatment for the 22 setting, recovering the results of [3].

In Appendix F we provide three additional contributions: variance-reduction-based computation of proximal points for arbitrary convex-concave functions (Appendix F.1); extension of our results to composite saddle point problems of the form minx2X maxy2Y {f(x, y) + φ(x) (y)}, where f

admits a centered gradient estimator and φ, are simple convex functions (Appendix F.2); and a number of alternative centered gradient estimators for the 21 and 22 settings (Appendix F.3).

Problem setup. A setup is the triplet (Z, k k , r) where: (i) Z is a compact and convex subset of Rn Rm, (ii) k k is a norm on Z and (iii) r is 1-strongly-convex w.r.t. Z and k k, i.e. such that r(z0) r(z) + hrr(z), z z0i + 1

2 kz0 zk2 for all z, z0 2 Z.2 We call r the distance generating function and denote the Bregman divergence associated with it by

Vz(z0) := r(z0) r(z) hrr(z), z0 zi 1

2 kz0 zk2 .

We also denote := maxz0 r(z0) minz r(z) and assume it is ﬁnite.

Norms and dual norms. We write S for the set of linear functions on S. For 2 Z we deﬁne the dual norm of k k as k k := maxkzk 1 h , zi. For p 1 we write the p norm kzkp = (P

with kzk1 = maxi |zi|. The dual norm of p is q with q 1 = 1 p 1.

Domain components. We assume Z is of the form X Y for convex and compact sets X Rn and Y Rm. Particular sets of interest are the simplex d = {v 2 Rd | kvk1 = 1, v 0} and the Euclidean ball Bd = {v 2 Rd | kvk2 1}. For any vector in z 2 Rn Rm,

we write zx and zy for the ﬁrst n and last m coordinates of z, respectively.

When totally clear from context, we sometimes refer to the X and Y components of z directly as x and y. We write the ith coordinate of vector v as [v]i.

Matrices. We consider a matrix A 2 Rm n and write nnz(A) for the number of its nonzero entries. For i 2 [n] and j 2 [m] we write Ai:, A:j and Aij for the corresponding row, column and entry, respectively.3 We consider the matrix norms k Akmax := maxij |Aij|, k Akp!q := maxkxkp 1 k Axkq and k Ak F := (P

2 For non-differentiable r, let hrr(z), wi := supγ2@r(z) hγ, wi, where @r(z) is the subdifferential of r at z. 3 For k 2 N, we let [k] := {1, . . . , k}.

3 Primal-dual variance reduction framework

In this section, we establish a framework for solving the saddle point problem

min x2X max

y2Y f(x, y),

where f is convex in x and concave y, and admits a (variance-reduced) stochastic estimator for the continuous and monotone4 gradient mapping

g(z) = g(x, y) := (rxf(x, y), ryf(x, y)) .

Our goal is to ﬁnd an -approximate saddle point (Nash equilibrium), i.e. z 2 Z := X Y such that

Gap(z) := max

y02Y f(zx, y0) min

x02X f(x0, zy) . (7)

We achieve this by generating a sequence z1, z2, . . . , zk such that 1

k=1 hg(zk), zk ui for every u 2 Z and using the fact that

hg(zk), zk ui (8)

due to convexity-concavity of f (see proof in Appendix A.1).

In Section 3.1 we deﬁne the notion of a (randomized) relaxed proximal oracle, and describe how Nemirovski s mirror-prox method leverages it to solve the problem (3). In Section 3.2 we deﬁne a class of centered gradient estimators, whose variance is proportional to the squared distance from a reference point. Given such a centered gradient estimator, we show that a regularized stochastic mirror descent scheme constitutes a relaxed proximal oracle. For a technical reason, we limit our oracle guarantee in Section 3.2 to the bilinear case f(x, y) = y>Ax, which sufﬁces for the applications in Section 4. We lift this limitation in Appendix F.1, where we show a different oracle implementation that is valid for general convex-concave f, with only a logarithmic increase in complexity.

3.1 The mirror-prox method with a randomized oracle

Recall that we assume the space Z = X Y is equipped with a norm k k and distance generating function r : Z ! R that is 1-strongly-convex w.r.t. k k and has range . We write the induced Bregman divergence as Vz(z0) = r(z0) r(z) hrr(z), z0 zi. We use the following fact throughout the paper: by deﬁnition, the Bregman divergence satisﬁes, for any z, z0, u 2 Z,

hr Vz(z0), z0 ui = Vz(u) Vz0(u) Vz(z0). (9)

For any > 0 we deﬁne the -proximal mapping Prox

z (g) to be the solution of the variational inequality corresponding to the strongly monotone operator g + r Vz, i.e. the unique z 2 Z such that hg(z ) + r Vz(z ), z ui 0 for all u 2 Z [cf. 11]. Equivalently (by (9)),

z (g) := the unique z 2 Z s.t. hg(z ), z ui Vz(u) Vz (u) Vz(z ) 8u 2 Z.

(10) When Vz(z0) = V x

x (x0) + V y

y (y0), Prox

z (g) is also the unique solution of the saddle point problem

min x02X max

f(x0, y0) + V x

Consider iterations of the form zk = Prox

zk 1(g), with z0 = arg minz r(z). Averaging the deﬁnition (10) over k, using the bound (8) and the nonnegativity of Bregman divergences gives

hg(zk), zk ui max

(Vz0(u) Vz K(u))

Thus, we can ﬁnd an -suboptimal point in K = / exact proximal steps. However, computing Prox

z (g) exactly may be as difﬁcult as solving the original problem. Nemirovski [28] proposes a relaxation of the exact proximal mapping, which we slightly extend to include the possibility of randomization, and formalize in the following.

4 A mapping q : Z ! Z is monotone if and only if hq(z0) q(z), z0 zi 0 for all z, z0 2 Z; g is monotone due to convexity-concavity of f.

Deﬁnition 1 (( , ")-relaxed proximal oracle). Let g be a monotone operator and , " > 0. An ( , ")-relaxed proximal oracle for g is a (possibly randomized) mapping O : Z ! Z such that z0 = O(z) satisﬁes

hg(z0), z0 ui Vz(u)

Note that O(z) = Prox

z (g) is an ( , 0)-relaxed proximal oracle. Algorithm 1 describes the conceptual prox-method of Nemirovski [28], which recovers the error guarantee of exact proximal

iterations. The kth iteration consists of (i) a relaxed proximal oracle call producing zk 1/2 = O(zk 1), and (ii) a linearized proximal (mirror) step where we replace z 7! g(z) with the constant function z 7! g(zk 1/2), producing zk = Prox

zk 1(g(zk 1/2)). We now state the convergence guarantee for the mirror-prox method, ﬁrst shown in [28] (see Appendix B.1 for a simple proof).

Algorithm 1: Outer Loop(O) (Nemirovski [28]) Input: ( , ")-relaxed proximal oracle O(z) for gradient mapping g, distance-generating r Parameters:Number of iterations K Output: Point z K with E Gap( z)

1 z0 arg minz2Z r(z)

2 for k = 1, . . . , K do

3 zk 1/2 O(zk 1) . We implement O(zk 1) by calling Inner Loop(zk 1, gzk 1, )

zk 1(g(zk 1/2)) = arg minz2Z

5 return z K = 1

Proposition 1 (Mirror prox convergence via oracles). Let O be an ( ,")-relaxed proximal oracle with respect to gradient mapping g and distance-generating function r with range at most . Let z1/2, z3/2, . . . , z K 1/2 be the iterates of Algorithm 1 and let z K be its output. Then

E Gap( z K) E max

g(zk 1/2), zk 1/2 u

3.2 Implementation of an ( , 0)-relaxed proximal oracle

We now explain how to use stochastic variance-reduced gradient estimators to design an efﬁcient ( , 0)-relaxed proximal oracle. We begin by introducing the bias and variance properties of the estimators we require.

Deﬁnition 2. Let z0 2 Z and L > 0. A stochastic gradient estimator gz0 : Z ! Z is called (z0, L)-centered for g if for all z 2 Z

1. E [ gz0(z)] = g(z),

2. E k gz0(z) g(z0)k2

L2 kz z0k2.

Lemma 1. A (z0, L)-centered estimator for g satisﬁes E k gz0(z) g(z)k2

(2L)2 kz z0k2.

Proof. Writing δ = gz0(z) g(z0), we have E δ = g(z) g(z0) by the ﬁrst centered estimator property. Therefore,

E k gz0(z) g(z)k2

= Ek δ E δk2

(2L)2 kz z0k2 ,

where the bounds follow from (i) the triangle inequality, (ii) Jensen s inequality and (iii) the second centered estimator property.

Remark 1. A gradient mapping that admits a (z, L)-centered gradient estimator for every z 2 Z is 2L-Lipschitz, since by Jensen s inequality and Lemma 1 we have for all w 2 Z

kg(w) g(z)k = k E gz(w) g(z)k (E k gz(w) g(z)k2

)1/2 2L kw zk .

Remark 2. Deﬁnition 2 bounds the gradient variance using the distance to the reference point. Similar bounds are used in variance reduction for bilinear saddle-point problems with Euclidean norm [3], as well as for ﬁnding stationary points in smooth nonconvex ﬁnite-sum problems [2, 33, 12, 45]. However, known variance reduction methods for smooth convex ﬁnite-sum minimization require stronger bounds [cf. 1, Section 2.1].

With the variance bounds deﬁned, we describe Algorithm 2 which (for the bilinear case) implements a relaxed proximal oracle. The algorithm is stochastic mirror descent with an additional regularization term around the initial point w0. Note that we do not perform extragradient steps in this stochastic method. When combined with a centered gradient estimator, the iterates of Algorithm 2 provide the following guarantee, which is one of our key technical contributions.

Algorithm 2: Inner Loop(w0, gw0, ) Input: Initial w0 2 Z, gradient estimator gw0, oracle quality > 0 Parameters:Step size , number of iterations T Output: Point w T satisfying Deﬁnition 1 (for appropriate gw0, , T)

1 for t = 1, . . . , T do

2 wt arg minw2Z

h gw0(wt 1), wi +

2 Vw0(w) + 1

3 return w T = 1

Proposition 2. Let , L > 0, let w0 2 Z and let gw0 be (w0, L)-centered for monotone g. Then, for = 10L2 and T 4 = 40L2

2 , the iterates of Algorithm 2 satisfy

hg(wt), wt ui Vw0(u)

Before discussing the proof of Proposition 2, we state how it implies the relaxed proximal oracle property for the bilinear case. Corollary 1. Let A 2 Rm n and let g(z) = (A>zy, Azx). Then, in the setting of Proposition 2, O(w0) = Inner Loop(w0, gw0, ) is an ( , 0)-relaxed proximal oracle.

Proof. Note that hg(z), wi = hg(w), zi for any z, w 2 Z and consequently hg(z), zi = 0. Therefore, the iterates w1, . . . , w T of Algorithm 2 and its output w T = 1

t=1 wt satisfy for every u 2 Z,

hg(wt), wt ui = 1

hg(u), wti = hg(u), w T i = hg( w T ), w T ui .

Substituting into the bound (11) yields the ( , 0)-relaxed proximal oracle property in Deﬁnition 1.

More generally, the proof of Corollary 1 shows that Algorithm 2 implements a relaxed proximal oracle whenever z 7! hg(z), z ui is convex for every u. In Appendix F.1 we implement an ( , ")-relaxed proximal oracle without such an assumption.

The proof of Proposition 2 is a somewhat lengthy application of existing techniques for stochastic mirror descent analysis in conjunction with Deﬁnition 2. We give it in full in Appendix B.2 and sketch it brieﬂy here. We view Algorithm 2 as mirror descent with stochastic gradients δt = gw0(wt) g(w0) and composite term hg(w0), zi +

2 Vw0(z). For any u 2 Z, the standard mirror descent analysis (see Lemma 4 in Appendix A.2) bounds the regret P

2 r Vw0(wt), wt u

in terms of the distance to initialization Vw0(u) and the stochastic gradient norms k δtk2

for t 2 [T]. Bounding these norms via Deﬁnition 2 and rearranging the hr Vw0(wt), wt ui terms, we show that

t2[T ] hg(wt), wt ui Vw0(u)

0 for all u 2 Z. To reach our desired result we must swap the order of the expectation and for all. We do so using the ghost iterate technique due to Nemirovski et al. [29].

4 Application to bilinear saddle point problems

We now construct centered gradient estimators (as per Deﬁnition 2) for the linear gradient mapping

g(z) = (A>zy, Azx) corresponding to the bilinear saddle point problem min

Sections 4.1 and 4.2 consider the 11 and 21 settings, respectively; in Appendix E we show how our approach naturally extends to the 22 setting as well. Throughout, we let w0 denote the center (i.e. reference point) of our stochastic gradient estimator and consider a general query point

w 2 Z = X Y. We also recall the notation [v]i for the ith entry of vector v.

4.1 11 games

Setup. Denoting the d-dimensional simplex by d, we let X = n, Y = m and Z = X Y. We take k k to be the 1 norm with conjugate norm k k = k k1. We take the distance generating function r to be the negative entropy, i.e. r(z) = P

i[z]i log[z]i. We note that both k k1 and r are separable and in particular separate over the X and Y blocks of Z. Finally we set

k Akmax := max

and note that this is the Lipschitz constant of the gradient mapping g under the chosen norm.

Gradient estimator. Given w0 = (wx

0) and g(w0) = (A>wy

0), we describe the reduced-variance gradient estimator gw0(w). First, we deﬁne the probabilities p(w) 2 m and q(w) 2 n according to,

pi(w) := |[wy]i [wy

0]i| kwy wy

and qj(w) := |[wx]j [wx

0]j| kwx wx

To compute gw0 we sample i p(w) and j q(w) independently, and set

0]i pi(w) , Awx

where Ai: and A:j are the ith row and jth column of A, respectively. Since the sampling distributions p(w), q(w) are proportional to the absolute value of the difference between blocks of w and w0, we call strategy (12) sampling from the difference. Substituting (12) into (13) gives the explicit form

gw0(w) = g(w0) + (Ai:kwy wy

0k1sign([wy wy

0]i), A:jkwx wx

0k1sign([wx wx

0]j)) . (14)

A straightforward calculation shows that this construction satisﬁes Deﬁnition 2.

Lemma 2. In the 11 setup, the estimator (14) is (w0, L)-centered with L = k Akmax.

Proof. The ﬁrst property (E gw0(w) = g(w)) follows immediately by inspection of (13). The second property follows from (14) by noting that

k gw0(w) g(w0)k1 = max

k Ai:k1 kwy wy

0k1 , k A:jk1 kwx wx

k Akmax kw w0k1

for all i, j, and therefore E k gw0(w) g(w0)k2

max kw w0k2

The proof of Lemma 2 reveals that the proposed estimator satisﬁes a stronger version of Deﬁnition 2: the last property and also Lemma 1 hold with probability 1 rather than in expectation.

Runtime bound. Combining the centered gradient estimator (13), the relaxed oracle implementation (Algorithm 2) and the extragradient outer loop (Algorithm 1), we obtain our main result for 11 games: an accelerated stochastic variance reduction algorithm. We write the resulting complete method explicitly as Algorithm 3 in Appendix C.1. The algorithm enjoys the following runtime guarantee (see proof in Appendix C.2).

Theorem 1. Let A 2 Rm n, > 0, and / log(nm). Algorithm 3 outputs a point z = (zx, zy) such that E

maxy2 m y>Azx minx2 n(zy)>Ax

maxi [Azx]i minj [A>zy]j

, and runs in time

nnz(A) + (m + n) k Ak2

Setting optimally, the running time is

nnz(A)(m + n) k Akmax log(mn)

4.2 21 games

Setup. We set X = Bn to be the n-dimensional Euclidean ball of radius 1, while Y = m remains the simplex. For z = (zx, zy) 2 Z = X Y we deﬁne a norm by

kzk2 = kzxk2

1 with dual norm kgk2

For distance generating function we take r(z) = rx(zx) + ry(zy) with rx(x) = 1

2 and ry(y) = P

i yi log yi; r is 1-strongly convex w.r.t. to k k and has range 1

2 + log m log(2m). Finally, we denote

k Ak2!1 = max

i2[m] k Ai:k2 ,

and note that this is the Lipschitz constant of g under k k.

Gradient estimator. To account for the fact that X is now the 2 unit ball, we modify the sampling distribution q in (12) to qj(w) = ([wx]j [wx

0]j)2 kwx wx

, and keep p the same. As we explain in detail

in Appendix D.1.1, substituting these probabilities into the expression (13) yields a centered gradient estimator with a constant (P

j2[n] k A:jk2

1)1/2 that is larger than k Ak2!1 by a factor of up to pn. Using local norms analysis allows us to tighten these bounds whenever the stochastic steps have bounded inﬁnity norm. Following Clarkson et al. [6], we enforce such bound on the step norms via gradient clipping. The ﬁnal gradient estimator is

0k1 sign([wy wy

2 [wx]j [wx

where [T (v)]i =

[v]i < [v]i [v]i [v]i > ,

The clipping operation T introduces bias to the gradient estimator, which we account for by carefully choosing a value of for which the bias is on the same order as the variance, and yet the resulting steps are appropriately bounded; see Appendix D.1.2. In Appendix F.3.1 we describe an alternative gradient estimator for which the distribution q does not depend on the current iterate w.

Runtime bound. Algorithm 4 in Appendix D.5 combines our clipped gradient estimator with our general variance reduction framework. The analysis in Appendix D gives the following guarantee. Theorem 2. Let A 2 Rm n, > 0, and any / log(2m). Algorithm 4 outputs a point z = (zx, zy) such that E

maxy2 m y>Azx minx2Bn(zy)>Ax

maxi [Azx]i + k A>zyk2

, and runs in time

nnz(A) + (m + n) k Ak2

Setting optimally, the running time is

nnz(A)(m + n) k Ak2!1 log(2m)

Acknowledgments

YC and YJ were supported by Stanford Graduate Fellowships. AS was supported by the NSF

CAREER Award CCF-1844855. KT was supported by the NSF Graduate Fellowship DGE1656518.

[1] Z. Allen-Zhu. Katyusha: the ﬁrst direct acceleration of stochastic gradient methods. In

Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pages 1200 1205, 2017.

[2] Z. Allen-Zhu and E. Hazan. Variance reduction for faster non-convex optimization. In Proceed-

ings of the 33rd International Conference on Machine Learning, pages 699 707, 2016.

[3] P. Balamurugan and F. R. Bach. Stochastic variance reduction methods for saddle-point problems. In Advances in Neural Information Processing Systems, 2016.

[4] L. Bottou, F. E. Curtis, and J. Nocedal. Optimization methods for large-scale machine learning.

SIAM Review, 60(2):223 311, 2018.

[5] T. Chavdarova, G. Gidel, F. Fleuret, and S. Lacoste-Julien. Reducing noise in GAN training

with variance reduced extragradient. In Advances in Neural Information Processing Systems, 2019.

[6] K. L. Clarkson, E. Hazan, and D. P. Woodruff. Sublinear optimization for machine learning. In

51th Annual IEEE Symposium on Foundations of Computer Science, pages 449 457, 2010.

[7] M. B. Cohen, Y. T. Lee, and Z. Song. Solving linear programs in the current matrix multiplication

time. ar Xiv preprint ar Xiv:1810.07896, 2018.

[8] G. B. Dantzig. Linear Programming and Extensions. Princeton University Press, Princeton, NJ,

[9] C. Daskalakis, A. Ilyas, V. Syrgkanis, and H. Zeng. Training GANs with optimism. In

International Conference on Learning Representations, 2019.

[10] Y. Drori, S. Sabach, and M. Teboulle. A simple algorithm for a class of nonsmooth convex-

concave saddle-point problems. Operations Research Letters, 43(2):209 214, 2015.

[11] J. Eckstein. Nonlinear proximal point algorithms using Bregman functions, with applications to

convex programming. Mathematics of Operations Research, 18(1):202 226, 1993.

[12] C. Fang, C. J. Li, Z. Lin, and T. Zhang. SPIDER: near-optimal non-convex optimization via

stochastic path-integrated differential estimator. In Advances in Neural Information Processing Systems, 2018.

[13] R. Frostig, R. Ge, S. Kakade, and A. Sidford. Un-regularizing: approximate proximal point

and faster stochastic algorithms for empirical risk minimization. In Proceedings of the 32nd International Conference on Machine Learning, pages 2540 2548, 2015.

[14] G. Gidel, T. Jebara, and S. Lacoste-Julien. Frank-Wolfe algorithms for saddle point problems.

In Proceedings of the 20th International Conference on Artiﬁcial Intelligence and Statistics, 2017.

[15] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. C. Courville,

and Y. Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, 2014.

[16] M. D. Grigoriadis and L. G. Khachiyan. A sublinear-time randomized approximation algorithm

for matrix games. Operation Research Letters, 18(2):53 58, 1995.

[17] J. Hiriart-Urruty and C. Lemaréchal. Convex Analysis and Minimization Algorithms I. Springer,

New York, 1993.

[18] C. Jin, P. Netrapalli, and M. I. Jordan. Minmax optimization: Stable limit points of gradient

descent ascent are locally optimal. ar Xiv preprint ar Xiv:1902.00618, 2019.

[19] R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance

reduction. In Advances in Neural Information Processing Systems, 2013.

[20] O. Kolossoski and R. D. Monteiro. An accelerated non-Euclidean hybrid proximal extragradient-

type algorithm for convex-concave saddle-point problems. Optimization Methods and Software, 32(6):1244 1272, 2017.

[21] Y. T. Lee and A. Sidford. Efﬁcient inverse maintenance and faster algorithms for linear

programming. In IEEE 56th Annual Symposium on Foundations of Computer Science, pages 230 249, 2015.

[22] H. Lin, J. Mairal, and Z. Harchaoui. A universal catalyst for ﬁrst-order optimization. In

Advances in Neural Information Processing Systems, 2015.

[23] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu. Towards deep learning models

resistant to adversarial attacks. In International Conference on Learning Representations, 2018.

[24] P. Mertikopoulos, H. Zenati, B. Lecouat, C.-S. Foo, V. Chandrasekhar, and G. Piliouras. Mirror

descent in saddle-point problems: Going the extra (gradient) mile. In International Conference on Learning Representations, 2019.

[25] M. Minsky and S. Papert. Perceptrons an introduction to computational geometry. MIT Press,

[26] K. Mishchenko, D. Kovalev, E. Shulgin, P. Richtárik, and Y. Malitsky. Revisiting stochastic

extragradient. ar Xiv preprint ar Xiv:1905.11373, 2019.

[27] A. Mokhtari, A. Ozdaglar, and S. Pattathil. A uniﬁed analysis of extra-gradient and opti-

mistic gradient methods for saddle point problems: Proximal point approach. ar Xiv preprint ar Xiv:1901.08511, 2019.

[28] A. Nemirovski. Prox-method with rate of convergence O(1/t) for variational inequalities with

lipschitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM Journal on Optimization, 15(1):229 251, 2004.

[29] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach

to stochastic programming. SIAM Journal on optimization, 19(4):1574 1609, 2009.

[30] A. Nemirovsky and D. Yudin. Problem Complexity and Method Efﬁciency in Optimization. J.

Wiley & Sons, New York, NY, 1983.

[31] Y. Nesterov. Dual extrapolation and its applications to solving variational inequalities and

related problems. Mathematical Programing, 109(2-3):319 344, 2007.

[32] J. V. Neumann. Zur theorie der gesellschaftsspiele. Mathematische Annalen, 100:295 320,

[33] S. J. Reddi, A. Hefny, S. Sra, B. Poczos, and A. Smola. Stochastic variance reduction for

nonconvex optimization. In Proceedings of the 33rd International Conference on Machine Learning, pages 314 323, 2016.

[34] M. W. Schmidt, N. L. Roux, and F. R. Bach. Minimizing ﬁnite sums with the stochastic average

gradient. Mathematical Programing, 162(1-2):83 112, 2017.

[35] S. Shalev-Shwartz and T. Zhang. Stochastic dual coordinate ascent methods for regularized

loss. Journal of Machine Learning Research, 14(1):567 599, 2013.

[36] S. Shalev-Shwartz and T. Zhang. Accelerated proximal stochastic dual coordinate ascent for

regularized loss minimization. Mathematical Programing, 155(1-2):105 145, 2016.

[37] S. Shalev-Shwartz et al. Online learning and online convex optimization. Foundations and

Trends in Machine Learning, 4(2):107 194, 2012.

[38] J. Sherman. Area-convexity, 1 regularization, and undirected multicommodity ﬂow. In

Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pages 452 460. ACM, 2017.

[39] Z. Shi, X. Zhang, and Y. Yu. Bregman divergence for stochastic variance reduction: Saddle-point

and adversarial prediction. In Advances in Neural Information Processing Systems, 2017.

[40] A. Sidford and K. Tian. Coordinate methods for accelerating 1 regression and faster approx-

imate maximum ﬂow. In 2018 IEEE 59th Annual Symposium on Foundations of Computer Science, pages 922 933. IEEE, 2018.

[41] T. Strohmer and R. Vershynin. A randomized Kaczmarz algorithm with exponential convergence.

Journal of Fourier Analysis and Applications, 15(2):262, 2009.

[42] M. D. Vose. A linear algorithm for generating random numbers with a given distribution. IEEE

Transactions on software engineering, 17(9):972 975, 1991.

[43] B. E. Woodworth and N. Srebro. Tight complexity bounds for optimizing composite objectives.

In Advances in Neural Information Processing Systems, 2016.

[44] L. Xiao and T. Zhang. A proximal stochastic gradient method with progressive variance

reduction. SIAM Journal on Optimization, 24(4):2057 2075, 2014.

[45] D. Zhou, P. Xu, and Q. Gu. Stochastic nested variance reduced gradient descent for nonconvex

optimization. In Advances in Neural Information Processing Systems, 2018.