# variance_reduction_for_matrix_games__e4cfc7ca.pdf Variance Reduction for Matrix Games Yair Carmon, Yujia Jin, Aaron Sidford and Kevin Tian Stanford University {yairc,yujiajin,sidford,kjtian}@stanford.edu We present a randomized primal-dual algorithm that solves the problem minx maxy y>Ax to additive error in time nnz(A) + nnz(A)n/ , for matrix A with larger dimension n and nnz(A) nonzero entries. This improves the best known exact gradient methods by a factor of nnz(A)/n and is faster than fully stochastic gradient methods in the accurate and/or sparse regime n/nnz(A). Our results hold for x, y in the simplex (matrix games, linear programming) and for x in an 2 ball and y in the simplex (perceptron / SVM, minimum enclosing ball). Our algorithm combines the Nemirovski s conceptual prox-method and a novel reduced-variance gradient estimator based on sampling from the difference between the current iterate and a reference point. 1 Introduction Minimax problems or games of the form minx maxy f(x, y) are ubiquitous in economics, statistics, optimization and machine learning. In recent years, minimax formulations for neural network training rose to prominence [15, 23], leading to intense interest in algorithms for solving large scale minimax games [10, 14, 20, 9, 18, 24]. However, the algorithmic toolbox for minimax optimization is not as complete as the one for minimization. Variance reduction, a technique for improving stochastic gradient estimators by introducing control variates, stands as a case in point. A multitude of variance reduction schemes exist for finite-sum minimization [cf. 19, 34, 1, 4, 12], and their impact on complexity is well-understood [43]. In contrast, only a few works apply variance reduction to finite-sum minimax problems [3, 39, 5, 26], and the potential gains from variance reduction are not well-understood. We take a step towards closing this gap by designing variance-reduced minimax game solvers that offer strict runtime improvements over non-stochastic gradient methods, similar to that of optimal variance reduction methods for finite-sum minimization. To achieve this, we focus on the fundamental class of bilinear minimax games, min x2X max y2Y y>Ax, where A 2 Rm n. In particular, we study the complexity of finding an -approximate saddle point (Nash equilibrium), namely x, y with max y02Y (y0)>Ax min x02X y>Ax0 . In the setting where X and Y are both probability simplices, the problem corresponds to finding an approximate (mixed) equilbrium in a matrix game, a central object in game theory and economics. Matrix games are also fundamental to algorithm design due in part to their equivalence to linear programming [8]. Alternatively, when X is an 2 ball and Y is a simplex, solving the corresponding problem finds a maximum-margin linear classifier (hard-margin SVM), a fundamental task in machine learning and statistics [25]. We refer to the former as an 11 game and the latter as an 21 game; our primary focus is to give improved algorithms for these domains. 33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada. 1.1 Our Approach Our starting point is Nemirovski s conceptual prox-method [28] for solving minx2X maxy2Y f(x, y), where f : X Y ! R is convex in x and concave in y. The method solves a sequence of subproblems parameterized by > 0, each of the form find x, y s.t. 8x0, y0 hrxf(x, y), x x0i hryf(x, y), y y0i Vx0(x0) + Vy0(y0) (1) for some (x0, y0) 2 X Y, where Va(b) is a norm-suitable Bregman divergence from a to b: squared Euclidean distance for 2 and KL divergence for 1. Combining each subproblem solution with an extragradient step, the prox-method solves the original problem to accuracy by solving e O( / ) subproblems.1 (Solving (1) with = 0 is equivalent to to solving minx2X maxy2Y f(x, y).) Our first contribution is showing that if a stochastic unbiased gradient estimator g satisfies the variance bound E k g(x, y) rf(x0, y0)k2 L2 kx x0k2 + L2 ky y0k2 (2) for some L > 0, then O(L2/ 2) regularized stochastic mirror descent steps using g solve (1) in a suitable probabilistic sense. We call unbiased gradient estimators that satisfy (2) centered. Our second contribution is the construction of centered gradient estimators for 11 and 21 bilinear games, where f(x, y) = y>Ax. Our 1 estimator has the following form. Suppose we wish to estimate gx = A>y (the gradient of f w.r.t. x), and we already have gx 0 = A>y0. Let p 2 m be some distribution over {1, . . . , m}, draw i p and set where Ai: is the ith column of A>. This form is familiar from variance reduction techniques [19, 44, 1], that typically use a fixed distribution p. In our setting, however, a fixed p will not produce sufficiently low variance. Departing from prior variance-reduction work and building on [16, 6], we choose p based on y according to ##[y]i [y0]i yielding exactly the variance bound we require. We call this technique sampling from the difference. For our 2 gradient estimator, we sample from the squared difference, drawing X-block coordinate j q, where qj(x) = ([x]j [x0]j)2/kx x0k2 2. To strengthen our results for 21 games, we consider a refined version of the centered criterion (2) which allows regret analysis using local norms [37, 6]. To further facilitate this analysis we follow [6] and introduce gradient clipping. We extend our proofs to show that stochastic regularized mirror descent can solve (1) despite the (distance-bounded) bias caused by gradient clipping. Our gradient estimators attain the bound (2) with L equal to the Lipschitz constant of rf. Specifically, maxij |Aij| in the 11 setup maxi k Ai:k2 in the 21 setup. (3) 1.2 Method complexity compared with prior art As per the discussion above, to achieve accuracy our algorithm solves e O( / ) subproblems. Each subproblem takes O(nnz(A)) time for computing two exact gradients (one for variance reduction and one for an extragradient step), plus an additional (m + n)L2/ 2 time for the inner mirror descent iterations, with L as in (3). The total runtime is therefore nnz(A) + (m + n)L2 1 More precisely, the required number of subproblem solutions is at most , where is a domain size parameter that depends on X, Y, and the Bregman divergence V (see Section 2). In the 1 and 2 settings considered in this paper, we have the bound log(nm) and we use the e O notation to suppress terms logarithmic in n and m. However, in other settings e.g., 11 games [cf. 38, 40] making the parameter scale logarithmically with the problem dimension is far more difficult. By setting optimally to be max{ , L (m + n)/nnz(A)}, we obtain the runtime e O(nnz(A) + nnz(A) (m + n) L 1). (4) Comparison with mirror-prox and dual extrapolation. Nemirovski [28] instantiates his conceptual prox-method by solving the relaxed proximal problem (1) with = L in time O(nnz(A)), where L is the Lipschitz constant of rf, as given in (3). The total complexity of the resulting method is therefore e O(nnz(A) L 1). (5) The closely related dual extrapolation method of Nesterov [31] attains the same rate of convergence. We refer to the running time (5) as linear since it scales linearly with the problem description size nnz(A). Our running time guarantee (4) is never worse than (5) by more than a constant factor, and improves on (5) when nnz(A) = !(n + m), i.e. whenever A is not extremely sparse. In that regime, our method uses L, hence solving a harder version of (1) than possible for mirror-prox. Comparison with sublinear-time methods Using a randomized algorithm, Grigoriadis and Khachiyan [16] solve 11 bilinear games in time e O((m + n) L2 2), (6) and Clarkson et al. [6] extend this result to 21 bilinear games, with the values of L as in (3). Since these runtimes scale with n + m nnz(A), we refer to them as sublinear. Our guarantee improves on the guarantee (6) when (m + n) L2 2 nnz(A), i.e. whenever (6) is not truly sublinear. Our method carefully balances linear-time extragradient steps with cheap sublinear-time stochastic gradient steps. Consequently, our runtime guarantee (4) inherits strengths from both the linear and sublinear runtimes. First, our runtime scales linearly with L/ rather than quadratically, as does the linear runtime (5). Second, while our runtime is not strictly sublinear, its component proportional to L/ is nnz(A)(n + m), which is sublinear in nnz(A). Overall, our method offers the best runtime guarantee in the literature in the regime nnz(A)(n + m) where the lower bound on is due to the best known theoretical runtimes of interior point methods: e O(max{n, m}! log(L/ )) [7] and e O(nnz(A) + min{n, m}2) min{n, m} log(L/ )) [21], where ! is the (current) matrix multiplication exponent. In the square dense case (i.e. nnz(A) n2 = m2), we improve on the accelerated runtime (5) by a factor of pn, the same improvement that optimal variance-reduced finite-sum minimization methods achieve over the fast gradient method [44, 1]. 1.3 Related work Matrix games, the canonical form of discrete zero-sum games, have long been studied in economics [32]. The classical mirror descent (i.e. no-regret) method yields an algorithm with running time e O(nnz(A)L2 2) [30]. Subsequent work [16, 28, 31, 6] improve this runtime as described above. Our work builds on the extragradient scheme of Nemirovski [28] as well as the gradient estimation and clipping technique of Clarkson et al. [6]. Balamurugan and Bach [3] apply standard variance reduction [19] to bilinear 22 games by sampling elements proportional to squared matrix entries. Using proximal-point acceleration they obtain a runtime of e O(nnz(A)+k Ak F nnz(A) max{m, n} 1 log 1 ), a rate we recover using our algorithm (Appendix E). However, in this setting the mirror-prox method has runtime e O(k Akop nnz(A) 1), which may be better than the result of [3] by a factor of mn/nnz(A) due to the discrepancy in the norm of A. Naive application of [3] to 1 domains results in even greater potential losses. Shi et al. [39] extend the method of [3] to smooth functions using general Bregman divergences, but their extension is unaccelerated and appears limited to a 2 rate. Chavdarova et al. [5] propose a variance-reduced extragradient method with applications to generative adversarial training. In contrast to our algorithm, which performs extragadient steps in the outer loop, the method of [5] performs stochastic extragradient steps in the inner loop, using finite-sum variance reduction as in [19]. Chavdarova et al. [5] analyze their method in the convex-concave setting, showing improved stability over direct application of the extragradient method to noisy gradients. However, their complexity guarantees are worse than those of linear-time methods. Following up on [5], Mishchenko et al. [26] propose to reduce the variance of the stochastic extragradient method by using the same stochastic sample for both the gradient and extragradient steps. In the Euclidean strongly convex case, they show a convergence guarantee with a relaxed variance assumption, and in the noiseless full-rank bilinear case they recover the guarantees of [27]. In the general convex case, however, they only show an 2 rate of convergence. 1.4 Paper outline and additional contributions We define our notation in Section 2. In Section 3.1, we review Nemirovski s conceptual prox-method and introduce the notion of a relaxed proximal oracle; we implement such oracle using variancereduced gradient estimators in Section 3.2. In Section 4, we construct these gradient estimators for the 11 and 21 domain settings, and complete the analyses of the corresponding algorithms; in Appendix E we provide analogous treatment for the 22 setting, recovering the results of [3]. In Appendix F we provide three additional contributions: variance-reduction-based computation of proximal points for arbitrary convex-concave functions (Appendix F.1); extension of our results to composite saddle point problems of the form minx2X maxy2Y {f(x, y) + φ(x) (y)}, where f admits a centered gradient estimator and φ, are simple convex functions (Appendix F.2); and a number of alternative centered gradient estimators for the 21 and 22 settings (Appendix F.3). Problem setup. A setup is the triplet (Z, k k , r) where: (i) Z is a compact and convex subset of Rn Rm, (ii) k k is a norm on Z and (iii) r is 1-strongly-convex w.r.t. Z and k k, i.e. such that r(z0) r(z) + hrr(z), z z0i + 1 2 kz0 zk2 for all z, z0 2 Z.2 We call r the distance generating function and denote the Bregman divergence associated with it by Vz(z0) := r(z0) r(z) hrr(z), z0 zi 1 2 kz0 zk2 . We also denote := maxz0 r(z0) minz r(z) and assume it is finite. Norms and dual norms. We write S for the set of linear functions on S. For 2 Z we define the dual norm of k k as k k := maxkzk 1 h , zi. For p 1 we write the p norm kzkp = (P with kzk1 = maxi |zi|. The dual norm of p is q with q 1 = 1 p 1. Domain components. We assume Z is of the form X Y for convex and compact sets X Rn and Y Rm. Particular sets of interest are the simplex d = {v 2 Rd | kvk1 = 1, v 0} and the Euclidean ball Bd = {v 2 Rd | kvk2 1}. For any vector in z 2 Rn Rm, we write zx and zy for the first n and last m coordinates of z, respectively. When totally clear from context, we sometimes refer to the X and Y components of z directly as x and y. We write the ith coordinate of vector v as [v]i. Matrices. We consider a matrix A 2 Rm n and write nnz(A) for the number of its nonzero entries. For i 2 [n] and j 2 [m] we write Ai:, A:j and Aij for the corresponding row, column and entry, respectively.3 We consider the matrix norms k Akmax := maxij |Aij|, k Akp!q := maxkxkp 1 k Axkq and k Ak F := (P 2 For non-differentiable r, let hrr(z), wi := supγ2@r(z) hγ, wi, where @r(z) is the subdifferential of r at z. 3 For k 2 N, we let [k] := {1, . . . , k}. 3 Primal-dual variance reduction framework In this section, we establish a framework for solving the saddle point problem min x2X max y2Y f(x, y), where f is convex in x and concave y, and admits a (variance-reduced) stochastic estimator for the continuous and monotone4 gradient mapping g(z) = g(x, y) := (rxf(x, y), ryf(x, y)) . Our goal is to find an -approximate saddle point (Nash equilibrium), i.e. z 2 Z := X Y such that Gap(z) := max y02Y f(zx, y0) min x02X f(x0, zy) . (7) We achieve this by generating a sequence z1, z2, . . . , zk such that 1 k=1 hg(zk), zk ui for every u 2 Z and using the fact that hg(zk), zk ui (8) due to convexity-concavity of f (see proof in Appendix A.1). In Section 3.1 we define the notion of a (randomized) relaxed proximal oracle, and describe how Nemirovski s mirror-prox method leverages it to solve the problem (3). In Section 3.2 we define a class of centered gradient estimators, whose variance is proportional to the squared distance from a reference point. Given such a centered gradient estimator, we show that a regularized stochastic mirror descent scheme constitutes a relaxed proximal oracle. For a technical reason, we limit our oracle guarantee in Section 3.2 to the bilinear case f(x, y) = y>Ax, which suffices for the applications in Section 4. We lift this limitation in Appendix F.1, where we show a different oracle implementation that is valid for general convex-concave f, with only a logarithmic increase in complexity. 3.1 The mirror-prox method with a randomized oracle Recall that we assume the space Z = X Y is equipped with a norm k k and distance generating function r : Z ! R that is 1-strongly-convex w.r.t. k k and has range . We write the induced Bregman divergence as Vz(z0) = r(z0) r(z) hrr(z), z0 zi. We use the following fact throughout the paper: by definition, the Bregman divergence satisfies, for any z, z0, u 2 Z, hr Vz(z0), z0 ui = Vz(u) Vz0(u) Vz(z0). (9) For any > 0 we define the -proximal mapping Prox z (g) to be the solution of the variational inequality corresponding to the strongly monotone operator g + r Vz, i.e. the unique z 2 Z such that hg(z ) + r Vz(z ), z ui 0 for all u 2 Z [cf. 11]. Equivalently (by (9)), z (g) := the unique z 2 Z s.t. hg(z ), z ui Vz(u) Vz (u) Vz(z ) 8u 2 Z. (10) When Vz(z0) = V x x (x0) + V y y (y0), Prox z (g) is also the unique solution of the saddle point problem min x02X max f(x0, y0) + V x Consider iterations of the form zk = Prox zk 1(g), with z0 = arg minz r(z). Averaging the definition (10) over k, using the bound (8) and the nonnegativity of Bregman divergences gives hg(zk), zk ui max (Vz0(u) Vz K(u)) Thus, we can find an -suboptimal point in K = / exact proximal steps. However, computing Prox z (g) exactly may be as difficult as solving the original problem. Nemirovski [28] proposes a relaxation of the exact proximal mapping, which we slightly extend to include the possibility of randomization, and formalize in the following. 4 A mapping q : Z ! Z is monotone if and only if hq(z0) q(z), z0 zi 0 for all z, z0 2 Z; g is monotone due to convexity-concavity of f. Definition 1 (( , ")-relaxed proximal oracle). Let g be a monotone operator and , " > 0. An ( , ")-relaxed proximal oracle for g is a (possibly randomized) mapping O : Z ! Z such that z0 = O(z) satisfies hg(z0), z0 ui Vz(u) Note that O(z) = Prox z (g) is an ( , 0)-relaxed proximal oracle. Algorithm 1 describes the conceptual prox-method of Nemirovski [28], which recovers the error guarantee of exact proximal iterations. The kth iteration consists of (i) a relaxed proximal oracle call producing zk 1/2 = O(zk 1), and (ii) a linearized proximal (mirror) step where we replace z 7! g(z) with the constant function z 7! g(zk 1/2), producing zk = Prox zk 1(g(zk 1/2)). We now state the convergence guarantee for the mirror-prox method, first shown in [28] (see Appendix B.1 for a simple proof). Algorithm 1: Outer Loop(O) (Nemirovski [28]) Input: ( , ")-relaxed proximal oracle O(z) for gradient mapping g, distance-generating r Parameters:Number of iterations K Output: Point z K with E Gap( z) 1 z0 arg minz2Z r(z) 2 for k = 1, . . . , K do 3 zk 1/2 O(zk 1) . We implement O(zk 1) by calling Inner Loop(zk 1, gzk 1, ) zk 1(g(zk 1/2)) = arg minz2Z 5 return z K = 1 Proposition 1 (Mirror prox convergence via oracles). Let O be an ( ,")-relaxed proximal oracle with respect to gradient mapping g and distance-generating function r with range at most . Let z1/2, z3/2, . . . , z K 1/2 be the iterates of Algorithm 1 and let z K be its output. Then E Gap( z K) E max g(zk 1/2), zk 1/2 u 3.2 Implementation of an ( , 0)-relaxed proximal oracle We now explain how to use stochastic variance-reduced gradient estimators to design an efficient ( , 0)-relaxed proximal oracle. We begin by introducing the bias and variance properties of the estimators we require. Definition 2. Let z0 2 Z and L > 0. A stochastic gradient estimator gz0 : Z ! Z is called (z0, L)-centered for g if for all z 2 Z 1. E [ gz0(z)] = g(z), 2. E k gz0(z) g(z0)k2 L2 kz z0k2. Lemma 1. A (z0, L)-centered estimator for g satisfies E k gz0(z) g(z)k2 (2L)2 kz z0k2. Proof. Writing δ = gz0(z) g(z0), we have E δ = g(z) g(z0) by the first centered estimator property. Therefore, E k gz0(z) g(z)k2 = Ek δ E δk2 (2L)2 kz z0k2 , where the bounds follow from (i) the triangle inequality, (ii) Jensen s inequality and (iii) the second centered estimator property. Remark 1. A gradient mapping that admits a (z, L)-centered gradient estimator for every z 2 Z is 2L-Lipschitz, since by Jensen s inequality and Lemma 1 we have for all w 2 Z kg(w) g(z)k = k E gz(w) g(z)k (E k gz(w) g(z)k2 )1/2 2L kw zk . Remark 2. Definition 2 bounds the gradient variance using the distance to the reference point. Similar bounds are used in variance reduction for bilinear saddle-point problems with Euclidean norm [3], as well as for finding stationary points in smooth nonconvex finite-sum problems [2, 33, 12, 45]. However, known variance reduction methods for smooth convex finite-sum minimization require stronger bounds [cf. 1, Section 2.1]. With the variance bounds defined, we describe Algorithm 2 which (for the bilinear case) implements a relaxed proximal oracle. The algorithm is stochastic mirror descent with an additional regularization term around the initial point w0. Note that we do not perform extragradient steps in this stochastic method. When combined with a centered gradient estimator, the iterates of Algorithm 2 provide the following guarantee, which is one of our key technical contributions. Algorithm 2: Inner Loop(w0, gw0, ) Input: Initial w0 2 Z, gradient estimator gw0, oracle quality > 0 Parameters:Step size , number of iterations T Output: Point w T satisfying Definition 1 (for appropriate gw0, , T) 1 for t = 1, . . . , T do 2 wt arg minw2Z h gw0(wt 1), wi + 2 Vw0(w) + 1 3 return w T = 1 Proposition 2. Let , L > 0, let w0 2 Z and let gw0 be (w0, L)-centered for monotone g. Then, for = 10L2 and T 4 = 40L2 2 , the iterates of Algorithm 2 satisfy hg(wt), wt ui Vw0(u) Before discussing the proof of Proposition 2, we state how it implies the relaxed proximal oracle property for the bilinear case. Corollary 1. Let A 2 Rm n and let g(z) = (A>zy, Azx). Then, in the setting of Proposition 2, O(w0) = Inner Loop(w0, gw0, ) is an ( , 0)-relaxed proximal oracle. Proof. Note that hg(z), wi = hg(w), zi for any z, w 2 Z and consequently hg(z), zi = 0. Therefore, the iterates w1, . . . , w T of Algorithm 2 and its output w T = 1 t=1 wt satisfy for every u 2 Z, hg(wt), wt ui = 1 hg(u), wti = hg(u), w T i = hg( w T ), w T ui . Substituting into the bound (11) yields the ( , 0)-relaxed proximal oracle property in Definition 1. More generally, the proof of Corollary 1 shows that Algorithm 2 implements a relaxed proximal oracle whenever z 7! hg(z), z ui is convex for every u. In Appendix F.1 we implement an ( , ")-relaxed proximal oracle without such an assumption. The proof of Proposition 2 is a somewhat lengthy application of existing techniques for stochastic mirror descent analysis in conjunction with Definition 2. We give it in full in Appendix B.2 and sketch it briefly here. We view Algorithm 2 as mirror descent with stochastic gradients δt = gw0(wt) g(w0) and composite term hg(w0), zi + 2 Vw0(z). For any u 2 Z, the standard mirror descent analysis (see Lemma 4 in Appendix A.2) bounds the regret P 2 r Vw0(wt), wt u in terms of the distance to initialization Vw0(u) and the stochastic gradient norms k δtk2 for t 2 [T]. Bounding these norms via Definition 2 and rearranging the hr Vw0(wt), wt ui terms, we show that t2[T ] hg(wt), wt ui Vw0(u) 0 for all u 2 Z. To reach our desired result we must swap the order of the expectation and for all. We do so using the ghost iterate technique due to Nemirovski et al. [29]. 4 Application to bilinear saddle point problems We now construct centered gradient estimators (as per Definition 2) for the linear gradient mapping g(z) = (A>zy, Azx) corresponding to the bilinear saddle point problem min Sections 4.1 and 4.2 consider the 11 and 21 settings, respectively; in Appendix E we show how our approach naturally extends to the 22 setting as well. Throughout, we let w0 denote the center (i.e. reference point) of our stochastic gradient estimator and consider a general query point w 2 Z = X Y. We also recall the notation [v]i for the ith entry of vector v. 4.1 11 games Setup. Denoting the d-dimensional simplex by d, we let X = n, Y = m and Z = X Y. We take k k to be the 1 norm with conjugate norm k k = k k1. We take the distance generating function r to be the negative entropy, i.e. r(z) = P i[z]i log[z]i. We note that both k k1 and r are separable and in particular separate over the X and Y blocks of Z. Finally we set k Akmax := max and note that this is the Lipschitz constant of the gradient mapping g under the chosen norm. Gradient estimator. Given w0 = (wx 0) and g(w0) = (A>wy 0), we describe the reduced-variance gradient estimator gw0(w). First, we define the probabilities p(w) 2 m and q(w) 2 n according to, pi(w) := |[wy]i [wy 0]i| kwy wy and qj(w) := |[wx]j [wx 0]j| kwx wx To compute gw0 we sample i p(w) and j q(w) independently, and set 0]i pi(w) , Awx where Ai: and A:j are the ith row and jth column of A, respectively. Since the sampling distributions p(w), q(w) are proportional to the absolute value of the difference between blocks of w and w0, we call strategy (12) sampling from the difference. Substituting (12) into (13) gives the explicit form gw0(w) = g(w0) + (Ai:kwy wy 0k1sign([wy wy 0]i), A:jkwx wx 0k1sign([wx wx 0]j)) . (14) A straightforward calculation shows that this construction satisfies Definition 2. Lemma 2. In the 11 setup, the estimator (14) is (w0, L)-centered with L = k Akmax. Proof. The first property (E gw0(w) = g(w)) follows immediately by inspection of (13). The second property follows from (14) by noting that k gw0(w) g(w0)k1 = max k Ai:k1 kwy wy 0k1 , k A:jk1 kwx wx k Akmax kw w0k1 for all i, j, and therefore E k gw0(w) g(w0)k2 max kw w0k2 The proof of Lemma 2 reveals that the proposed estimator satisfies a stronger version of Definition 2: the last property and also Lemma 1 hold with probability 1 rather than in expectation. Runtime bound. Combining the centered gradient estimator (13), the relaxed oracle implementation (Algorithm 2) and the extragradient outer loop (Algorithm 1), we obtain our main result for 11 games: an accelerated stochastic variance reduction algorithm. We write the resulting complete method explicitly as Algorithm 3 in Appendix C.1. The algorithm enjoys the following runtime guarantee (see proof in Appendix C.2). Theorem 1. Let A 2 Rm n, > 0, and / log(nm). Algorithm 3 outputs a point z = (zx, zy) such that E maxy2 m y>Azx minx2 n(zy)>Ax maxi [Azx]i minj [A>zy]j , and runs in time nnz(A) + (m + n) k Ak2 Setting optimally, the running time is nnz(A)(m + n) k Akmax log(mn) 4.2 21 games Setup. We set X = Bn to be the n-dimensional Euclidean ball of radius 1, while Y = m remains the simplex. For z = (zx, zy) 2 Z = X Y we define a norm by kzk2 = kzxk2 1 with dual norm kgk2 For distance generating function we take r(z) = rx(zx) + ry(zy) with rx(x) = 1 2 and ry(y) = P i yi log yi; r is 1-strongly convex w.r.t. to k k and has range 1 2 + log m log(2m). Finally, we denote k Ak2!1 = max i2[m] k Ai:k2 , and note that this is the Lipschitz constant of g under k k. Gradient estimator. To account for the fact that X is now the 2 unit ball, we modify the sampling distribution q in (12) to qj(w) = ([wx]j [wx 0]j)2 kwx wx , and keep p the same. As we explain in detail in Appendix D.1.1, substituting these probabilities into the expression (13) yields a centered gradient estimator with a constant (P j2[n] k A:jk2 1)1/2 that is larger than k Ak2!1 by a factor of up to pn. Using local norms analysis allows us to tighten these bounds whenever the stochastic steps have bounded infinity norm. Following Clarkson et al. [6], we enforce such bound on the step norms via gradient clipping. The final gradient estimator is 0k1 sign([wy wy 2 [wx]j [wx where [T (v)]i = [v]i < [v]i [v]i [v]i > , The clipping operation T introduces bias to the gradient estimator, which we account for by carefully choosing a value of for which the bias is on the same order as the variance, and yet the resulting steps are appropriately bounded; see Appendix D.1.2. In Appendix F.3.1 we describe an alternative gradient estimator for which the distribution q does not depend on the current iterate w. Runtime bound. Algorithm 4 in Appendix D.5 combines our clipped gradient estimator with our general variance reduction framework. The analysis in Appendix D gives the following guarantee. Theorem 2. Let A 2 Rm n, > 0, and any / log(2m). Algorithm 4 outputs a point z = (zx, zy) such that E maxy2 m y>Azx minx2Bn(zy)>Ax maxi [Azx]i + k A>zyk2 , and runs in time nnz(A) + (m + n) k Ak2 Setting optimally, the running time is nnz(A)(m + n) k Ak2!1 log(2m) Acknowledgments YC and YJ were supported by Stanford Graduate Fellowships. AS was supported by the NSF CAREER Award CCF-1844855. KT was supported by the NSF Graduate Fellowship DGE1656518. [1] Z. Allen-Zhu. Katyusha: the first direct acceleration of stochastic gradient methods. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pages 1200 1205, 2017. [2] Z. Allen-Zhu and E. Hazan. Variance reduction for faster non-convex optimization. In Proceed- ings of the 33rd International Conference on Machine Learning, pages 699 707, 2016. [3] P. Balamurugan and F. R. Bach. Stochastic variance reduction methods for saddle-point problems. In Advances in Neural Information Processing Systems, 2016. [4] L. Bottou, F. E. Curtis, and J. Nocedal. Optimization methods for large-scale machine learning. SIAM Review, 60(2):223 311, 2018. [5] T. Chavdarova, G. Gidel, F. Fleuret, and S. Lacoste-Julien. Reducing noise in GAN training with variance reduced extragradient. In Advances in Neural Information Processing Systems, 2019. [6] K. L. Clarkson, E. Hazan, and D. P. Woodruff. Sublinear optimization for machine learning. In 51th Annual IEEE Symposium on Foundations of Computer Science, pages 449 457, 2010. [7] M. B. Cohen, Y. T. Lee, and Z. Song. Solving linear programs in the current matrix multiplication time. ar Xiv preprint ar Xiv:1810.07896, 2018. [8] G. B. Dantzig. Linear Programming and Extensions. Princeton University Press, Princeton, NJ, [9] C. Daskalakis, A. Ilyas, V. Syrgkanis, and H. Zeng. Training GANs with optimism. In International Conference on Learning Representations, 2019. [10] Y. Drori, S. Sabach, and M. Teboulle. A simple algorithm for a class of nonsmooth convex- concave saddle-point problems. Operations Research Letters, 43(2):209 214, 2015. [11] J. Eckstein. Nonlinear proximal point algorithms using Bregman functions, with applications to convex programming. Mathematics of Operations Research, 18(1):202 226, 1993. [12] C. Fang, C. J. Li, Z. Lin, and T. Zhang. SPIDER: near-optimal non-convex optimization via stochastic path-integrated differential estimator. In Advances in Neural Information Processing Systems, 2018. [13] R. Frostig, R. Ge, S. Kakade, and A. Sidford. Un-regularizing: approximate proximal point and faster stochastic algorithms for empirical risk minimization. In Proceedings of the 32nd International Conference on Machine Learning, pages 2540 2548, 2015. [14] G. Gidel, T. Jebara, and S. Lacoste-Julien. Frank-Wolfe algorithms for saddle point problems. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, 2017. [15] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. C. Courville, and Y. Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, 2014. [16] M. D. Grigoriadis and L. G. Khachiyan. A sublinear-time randomized approximation algorithm for matrix games. Operation Research Letters, 18(2):53 58, 1995. [17] J. Hiriart-Urruty and C. Lemaréchal. Convex Analysis and Minimization Algorithms I. Springer, New York, 1993. [18] C. Jin, P. Netrapalli, and M. I. Jordan. Minmax optimization: Stable limit points of gradient descent ascent are locally optimal. ar Xiv preprint ar Xiv:1902.00618, 2019. [19] R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, 2013. [20] O. Kolossoski and R. D. Monteiro. An accelerated non-Euclidean hybrid proximal extragradient- type algorithm for convex-concave saddle-point problems. Optimization Methods and Software, 32(6):1244 1272, 2017. [21] Y. T. Lee and A. Sidford. Efficient inverse maintenance and faster algorithms for linear programming. In IEEE 56th Annual Symposium on Foundations of Computer Science, pages 230 249, 2015. [22] H. Lin, J. Mairal, and Z. Harchaoui. A universal catalyst for first-order optimization. In Advances in Neural Information Processing Systems, 2015. [23] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu. Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations, 2018. [24] P. Mertikopoulos, H. Zenati, B. Lecouat, C.-S. Foo, V. Chandrasekhar, and G. Piliouras. Mirror descent in saddle-point problems: Going the extra (gradient) mile. In International Conference on Learning Representations, 2019. [25] M. Minsky and S. Papert. Perceptrons an introduction to computational geometry. MIT Press, [26] K. Mishchenko, D. Kovalev, E. Shulgin, P. Richtárik, and Y. Malitsky. Revisiting stochastic extragradient. ar Xiv preprint ar Xiv:1905.11373, 2019. [27] A. Mokhtari, A. Ozdaglar, and S. Pattathil. A unified analysis of extra-gradient and opti- mistic gradient methods for saddle point problems: Proximal point approach. ar Xiv preprint ar Xiv:1901.08511, 2019. [28] A. Nemirovski. Prox-method with rate of convergence O(1/t) for variational inequalities with lipschitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM Journal on Optimization, 15(1):229 251, 2004. [29] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on optimization, 19(4):1574 1609, 2009. [30] A. Nemirovsky and D. Yudin. Problem Complexity and Method Efficiency in Optimization. J. Wiley & Sons, New York, NY, 1983. [31] Y. Nesterov. Dual extrapolation and its applications to solving variational inequalities and related problems. Mathematical Programing, 109(2-3):319 344, 2007. [32] J. V. Neumann. Zur theorie der gesellschaftsspiele. Mathematische Annalen, 100:295 320, [33] S. J. Reddi, A. Hefny, S. Sra, B. Poczos, and A. Smola. Stochastic variance reduction for nonconvex optimization. In Proceedings of the 33rd International Conference on Machine Learning, pages 314 323, 2016. [34] M. W. Schmidt, N. L. Roux, and F. R. Bach. Minimizing finite sums with the stochastic average gradient. Mathematical Programing, 162(1-2):83 112, 2017. [35] S. Shalev-Shwartz and T. Zhang. Stochastic dual coordinate ascent methods for regularized loss. Journal of Machine Learning Research, 14(1):567 599, 2013. [36] S. Shalev-Shwartz and T. Zhang. Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization. Mathematical Programing, 155(1-2):105 145, 2016. [37] S. Shalev-Shwartz et al. Online learning and online convex optimization. Foundations and Trends in Machine Learning, 4(2):107 194, 2012. [38] J. Sherman. Area-convexity, 1 regularization, and undirected multicommodity flow. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pages 452 460. ACM, 2017. [39] Z. Shi, X. Zhang, and Y. Yu. Bregman divergence for stochastic variance reduction: Saddle-point and adversarial prediction. In Advances in Neural Information Processing Systems, 2017. [40] A. Sidford and K. Tian. Coordinate methods for accelerating 1 regression and faster approx- imate maximum flow. In 2018 IEEE 59th Annual Symposium on Foundations of Computer Science, pages 922 933. IEEE, 2018. [41] T. Strohmer and R. Vershynin. A randomized Kaczmarz algorithm with exponential convergence. Journal of Fourier Analysis and Applications, 15(2):262, 2009. [42] M. D. Vose. A linear algorithm for generating random numbers with a given distribution. IEEE Transactions on software engineering, 17(9):972 975, 1991. [43] B. E. Woodworth and N. Srebro. Tight complexity bounds for optimizing composite objectives. In Advances in Neural Information Processing Systems, 2016. [44] L. Xiao and T. Zhang. A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization, 24(4):2057 2075, 2014. [45] D. Zhou, P. Xu, and Q. Gu. Stochastic nested variance reduced gradient descent for nonconvex optimization. In Advances in Neural Information Processing Systems, 2018.