# nearoptimal_noregret_learning_in_general_games__b0a58624.pdf

Near-Optimal No-Regret Learning in General Games

Constantinos Daskalakis

MIT CSAIL costis@csail.mit.edu

Maxwell Fishelson

MIT CSAIL maxfish@mit.edu

Noah Golowich

MIT CSAIL nzg@mit.edu

We show that Optimistic Hedge a common variant of multiplicative-weightsupdates with recency bias attains poly(log T) regret in multi-player general-sum games. In particular, when every player of the game uses Optimistic Hedge to iteratively update her strategy in response to the history of play so far, then after T rounds of interaction, each player experiences total regret that is poly(log T). Our bound improves, exponentially, the O(T 1/2) regret attainable by standard no-regret learners in games, the O(T 1/4) regret attainable by no-regret learners with recency bias [SALS15], and the O(T 1/6) bound that was recently shown for Optimistic Hedge in the special case of two-player games [CP20]. A corollary of our bound is that Optimistic Hedge converges to coarse correlated equilibrium in general games at a rate of O

1 Introduction

Online learning has a long history that is intimately related to the development of game theory, convex optimization, and machine learning. One of its earliest instantiations can be traced to Brown s proposal [Bro49] of ﬁctitious play as a method to solve two-player zero-sum games. Indeed, as shown by [Rob51], when the players of (zero-sum) matrix game use ﬁctitious play to iteratively update their actions in response to each other s history of play, the resulting dynamics converge in the following sense: the product of the empirical distributions of strategies for each player converges to the set of Nash equilibria in the game, though the rate of convergence is now known to be exponentially slow [DP14]. Moreover, such convergence to Nash equilibria fails in non-zero-sum games [Sha64].

The slow convergence of ﬁctitious play to Nash equilibria in zero-sum matrix games and nonconvergence in general-sum games can be mitigated by appealing to the pioneering works [Bla54, Han57] and the ensuing literature on no-regret learning [CBL06]. It is known that if both players of a zero-sum matrix game experience regret that is at most "(T), the product of the players empirical distributions of strategies is an O("(T)/T)-approximate Nash equilibrium. More generally, if each player of a general-sum, multi-player game experiences regret that is at most "(T), the empirical distribution of joint strategies converges to a coarse correlated equilibrium1 of the game, at a rate of O("(T)/T). Importantly, a multitude of online learning algorithms, such as the celebrated Hedge and Follow-The-Perturbed-Leader algorithms, guarantee adversarial regret O(

T) [CBL06]. Thus, when such algorithms are employed by all players in a game, their O(

T) regret implies convergence to coarse correlated equilibria (and Nash equilibria of matrix games) at a rate of O(1/

While standard no-regret learners guarantee O(

T) regret for each player in a game, the players can do better by employing specialized no-regret learning procedures. Indeed, it was established by [DDK11] that there exists a somewhat complex no-regret learner based on Nesterov s excessive gap technique [Nes05], which guarantees O(log T) regret to each player of a two-player zero-sum game.

1In general-sum games, it is typical to focus on proving convergence rates for weaker types of equilibrium than Nash, such as coarse correlated equilibria, since ﬁnding Nash equilibria is PPAD-complete [DGP06, CDT09].

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

Table 1: Overview of prior work on fast rates for learning in games. m denotes the number of players, and n denotes the number of actions per player (assumed to be the same for all players). For Optimistic Hedge, the adversarial regret bounds in the right-hand column are obtained via a choice of adaptive step-sizes. The O( ) notation hides factors that are polynomial in log T.

Algorithm Setting Regret in games Adversarial regret

Hedge (& many other algs.)

multi-player,

general-sum O(p T log n) [CBL06] O(p T log n) [CBL06]

Excessive Gap Technique

O(log n(log T + log3/2 n))

O(p T log n)

DS-Opt MD, Opt DA 2-player, 0-sum log O(1)(n) [HAM21]

T log O(1)(n) [HAM21]

Optimistic Hedge multi-player,

general-sum

O(log n pm T 1/4)

[RS13b, SALS15]

O(p T log n) [RS13b, SALS15]

Optimistic Hedge 2-player, general-sum O(log5/6 n T 1/6) [CP20] O(p T log n)

Optimistic Hedge multi-player,

general-sum

O(log n m log4 T)

(Theorem 3.1)

O(p T log n) (Corollary D.1)

This represents an exponential improvement over the regret guaranteed by standard no-regret learners. More generally, [SALS15] established that if players of a multi-player, general-sum game use any algorithm from the family of Optimistic Mirror Descent (MD) or Optimistic Follow-the-Regularized Leader (FTRL) algorithms (which are analogoues of the MD and FTRL algorithms, respectively, with recency bias), each player enjoys regret that is O(T 1/4). This was recently improved by [CP20] to O(T 1/6) in the special case of two-player games in which the players use Optimistic Hedge, a particularly simple representative from both the Optimistic MD and Optimistic FTRL families.

The above results for general-sum games represent signiﬁcant improvements over the O(

T) regret attainable by standard no-regret learners, but are not as dramatic as the logarithmic regret that has been shown attainable by no-regret learners, albeit more complex ones, in 2-player zero-sum games (e.g., [DDK11]). Indeed, despite extensive work on no-regret learning, understanding the optimal regret that can be guaranteed by no-regret learning algorithms in general-sum games has remained elusive. This question is especially intruiging in light of experiments suggesting that polylogarithmic regret should be attainable [SALS15, HAM21]. In this paper we settle this question by showing that no-regret learners can guarantee polylogarithmic regret to each player in general-sum multi-player games. Moreover, this regret is attainable by a particularly simple algorithm Optimistic Hedge:

Theorem 1.1 (Abbreviated version of Theorem 3.1). Suppose that m players play a general-sum multi-player game, with a ﬁnite set of n strategies per player, over T rounds. Suppose also that each player uses Optimistic Hedge to update her strategy in every round, as a function of the history of play so far. Then each player experiences O(m log n log4 T) regret.

An immediate corollary of Theorem 1.1 is that the empirical distribution of play is a O

m log n log4 T

- approximate coarse correlated equilibrium (CCE) of the game. We remark that Theorem 1.1 bounds the total regret experienced by each player of the multi-player game, which is the most standard regret objective for no-regret learning in games, and which is essential to achieve convergence to CCE. For the looser objective of the average of all players regrets, [RS13b] established a O(log n) bound for Optimistic Hedge in two-player zero-sum games, and [SALS15] generalized this bound, to O(m log n) in m-player general-sum games. Note that since some players may experience negative regret [HAM21], the average of the players regrets cannot be used in general to bound the maximum regret experienced by any individual player. Finally, we remark that several results in the literature posit no-regret learning as a model of agents rational behavior; for instance, [Rou09, ST13, RST17] show that no-regret learners in smooth games enjoy strong Price-of-Anarchy bounds. By showing that each agent can obtain very small regret in games by playing Optimistic Hedge, Theorem 1.1

strengthens the plausability of the common assumption made in this literature that each agent will choose to use such a no-regret algorithm.

1.1 Related work Table 1 summarizes the prior works that aim to establish optimal regret bounds for no-regret learners in games. We remark that [CP20] shows that the regret of Hedge is (

T) even in 2-player games where each player has 2 actions, meaning that optimism is necessary to obtain fast rates. The table also includes a recent result of [HAM21] showing that when the players in a 2-player zero-sum game with n actions per player use a variant of Optimistic Hedge with adaptive step size (a special case of their algorithms DS-Opt MD and Opt DA), each player has log O(1) n regret. The techniques of [HAM21] differ substantially from ours: the result in [HAM21] is based on showing that the joint strategies x(t) rapidly converge, pointwise, to a Nash equilibrium x?. Such a result seems very unlikely to extend to our setting of general-sum games, since ﬁnding an approximate Nash equilibrium even in 2-player games is PPAD-complete [CDT09]. We also remark that the earlier work [KHSC18] shows that each player s regret is at most O(log T log n) when they use a certain algorithm based on Optimistic MD in 2-player zero-sum games; their technique is heavily tailored to 2-player zero-sum games, relying on the notion of duality in such a setting.

[FLL+16] shows that one can obtain fast rates in games for a broader class of algorithms (e.g., including Hedge) if one adopts a relaxed (approximate) notion of optimality. [WL18] uses optimism to obtain adaptive regret bounds for bandit problems. Many recent papers (e.g., [DP19, GPD20, LGNPw21, HAM21, WLZL21, AIMM21]) have studied the last-iterate convergence of algorithms from the Optimistic Mirror Descent family, which includes Optimistic Hedge. Finally, a long line of papers (e.g., [HMc W+03, DFP+10, KLP11, BCM12, PP16, BP18, MPP18, BP19, CP19, VGFL+20]) has studied the dynamics of learning algorithms in games. Essentially all of these papers do not use optimism, and many of them show non-convergence (e.g., divergence or recurrence) of the iterates of various learning algorithms such as FTRL and Mirror Descent when used in games.

2 Preliminaries Notation. For a positive integer n, let [n] := {1, 2, . . . , n}. For a ﬁnite set S, let (S) denote the space of distributions on S. For S = [n], we will write n := (S) and interpret elements of n as vectors in Rn. For a vector v 2 Rn and j 2 [n], we denote the jth coordinate of v as v(j). For vectors v, w 2 Rn, write hv, wi = Pn

j=1 v(j)w(j). The base-2 logarithm of x > 0 is denoted log x.

No-regret learning in games. We consider a game G with m 2 N players, where player i 2 [m] has action space Ai with ni := |Ai| actions. We may assume that Ai = [ni] for each player i. The joint action space is A := A1 Am. The speciﬁcation of the game G is completed by a collection of loss functions L1, . . . , Lm : A ! [0, 1]. For an action proﬁle a = (a1, . . . , am) 2 A and i 2 [m], Li(a) is the loss player i experiences when each player i0 2 [m] plays ai0. A mixed strategy xi 2 (Ai) for player i is a distribution over Ai, with the probability of playing action j 2 Ai given by xi(j). Given a mixed strategy proﬁle x = (x1, . . . , xm) (or an action proﬁle a = (a1, . . . , am)) and a player i 2 [m] we let x i (or a i, respectively) denote the proﬁle after removing the ith mixed strategy xi (or the ith action ai, respectively).

The m players play the game G for a total of T rounds. At the beginning of each round t 2 [T], each player i chooses a mixed strategy x(t)

i 2 (Ai). The loss vector of player i, denoted (t)

i 2 [0, 1]ni, is deﬁned as (t)

i (j) = Ea i x(t)

i[Li(j, a i)]. As a matter of convention, set (0)

i = 0 to be the all-zeros vector. We consider the full-information setting in this paper, meaning that player i observes its full loss vector (t)

i for each round t. Finally, player i experiences a loss of h (t)

i i. The goal of each player i is to minimize its regret, deﬁned as: Regi,T := P

t2[T ]hx(t)

i i minj2[ni]

Optimistic hedge. The Optimistic Hedge algorithm chooses mixed strategies for player i 2 [m] as follows: at time t = 1, it sets x(1)

i = (1/ni, . . . , 1/ni) to be the uniform distribution on Ai. Then for all t < T, player i s strategy at iteration t + 1 is deﬁned as follows, for j 2 [ni]:

i (j) := x(t)

i (j) exp( (2 (t)

i (j) (t 1)

k2[ni] x(t)

i (k) exp( (2 (t)

i (k) (t 1)

Optimistic Hedge is a modiﬁcation of Hedge, which performs the updates x(t+1)

i (j) exp( (t)

k2[ni] x(t)

i (k) exp( (t)

i (k)). The update (1) modiﬁes the Hedge update by replacing the loss vector

i with a predictor of the following iteration s loss vector, (t)

i ). Hedge corresponds to FTRL with a negative entropy regularizer (see, e.g., [Bub15]), whereas Optimistic Hedge corresponds to Optimistic FTRL with a negative entropy regularizer [RS13b, RS13a]. Distributions & divergences. For distributions P, Q on a ﬁnite domain [n], the KL divergence between P, Q is KL(P; Q) = Pn

j=1 P(j) log

. The chi-squared divergence between P, Q

is χ2(P; Q) = Pn

(P (j) Q(j))2

Q(j) . For a distribution P on [n] and

a vector v 2 Rn, we write Var P (v) := Pn

j=1 P(j) (v(j) Pn

k=1 P(k)v(k))2 . Also deﬁne

j=1 P(j) v(j)2. If further P has full support, then deﬁne kvk?

P (j) . The above notations will often be used when P is the mixed strategy proﬁle xi for some player i and v is a loss vector i; in such a case the norms kvk P and kvk?

P are often called local norms.

3 Results Below we state our main theorem, which shows that when all players in a game play according to Optimistic Hedge with appropriate step size, they all experience polylogarithmic individual regrets.

Theorem 3.1 (Formal version of Theorem 1.1). There are constants C, C0 > 1 so that the following holds. Suppose a time horizon T 2 N and a game G with m players and ni actions for each player i 2 [m] is given. Suppose all players play according to Optimistic Hedge with any positive step size 1 C m log4 T . Then for any i 2 [m], the regret of player i satisﬁes

Regi,T log ni

+ C0 log T. (2)

In particular, if the players step size is chosen as = 1 C m log4 T , then the regret of player i satisﬁes

m log ni log4 T

A common goal in the literature on learning in games is to obtain an algorithm that achieves fast rates whan played by all players, and so that each player i still obtains the optimal rate of O(

T) in the adversarial setting (i.e., when i receives an arbitrary sequence of losses (1)

i , . . . , (T )

i ). We show in Corollary D.1 (in the appendix) that by running Optimistic Hedge with an adaptive step size, this is possible. Table 1 compares our regret bounds discussed in this section to those of prior work.

4 Proof overview In this section we overview the proof of Theorem 3.1; the full proof may be found in the appendix.

4.1 New adversarial regret bound The ﬁrst step in the proof of Theorem 3.1 is to prove a new regret bound (Lemma 4.1 below) for Optimistic Hedge that holds for an adversarial sequence of losses. We will show in later sections that when all players play according to Optimistic Hedge, the right-hand side of the regret bound (4) is bounded by a quantity that grows only poly-logarithmically in T. Lemma 4.1. There is a constant C > 0 so that the following holds. Suppose any player i 2 [m] follows the Optimistic Hedge updates (1) with step size < 1/C, for an arbitrary sequence of losses (1)

i , . . . , (T )

i 2 [0, 1]ni. Then

Regi,T log ni

The detailed proof of Lemma 4.1 can be found in Section A, but we sketch the main steps here. The starting point is a reﬁnement of [RS13a, Lemma 3] (stated as Lemma A.5), which gives an upper

bound for Regi,T in terms of local norms corresponding to each of the iterates x(t)

i of Optimistic

Hedge. The bound involves the difference between the Optimistic Hedge iterates x(t)

i and iterates

i deﬁned by x(t)

i (j) exp( ( (t)

i (j) (t 1)

k2[ni] x(t)

i (k) exp( ( (t)

i (k) (t 1)

Regi,T log ni

We next show (in Lemma A.2) that KL( x(t)

i ) and KL(x(t)

i ) may be lower bounded by (1/2 O( )) χ2( x(t)

i ) and (1/2 O( )) χ2(x(t)

i ), respectively. Note it is a standard fact that the KL divergence between two distributions is upper bounded by the chi-squared distribution between them; by contrast, Lemma A.2 can exploit that x(t)

i and x(t 1)

i are close to each other to show a reverse inequality. Finally, exploiting the exponential weights-style functional relationship between x(t)

i and x(t 1)

i , we show (in Lemma A.3) that the χ2-divergence χ2(x(t)

lower bounded by (1 O( )) 2 Varx(t)

, leading to the term (1 C )

being subtracted in (4). The χ2-divergence χ2( x(t)

i ), as well as the term

are bounded in a similar manner to obtain (4).

4.2 Finite differences Given Lemma 4.1, in order to establish Theorem 3.1, it sufﬁces to show Lemma 4.2 below. Indeed, (6) below implies that the right-hand side of (4) is bounded above by log ni

+ O(log5 T), which is

bounded above by O(m log ni log4 T) for the choice =

of Theorem 3.1.2

Lemma 4.2 (Abbreviated; detailed version in Section C.3). Suppose all players play according to Optimistic Hedge with step size satifying 1/T 1 Cm log4 T for a sufﬁciently large constant C.

Then for any i 2 [m], the losses (1)

i , . . . , (T )

i 2 Rni for player i satisfy:

The deﬁnition below allows us to streamline our notation when proving Lemma 4.2.

Deﬁnition 4.1 (Finite differences). Suppose L = (L(1), . . . , L(T )) is a sequence of vectors L(t) 2 Rn. For integers h 0, the order-h ﬁnite difference sequence for the sequence L, denoted by Dh L, is the sequence Dh L := ((Dh L)(1) , . . . , (Dh L)(T h)) deﬁned recursively as: (D0 L)(t) := L(t) for all 1 t T, and

(Dh L)(t) := (Dh 1 L)(t+1) (Dh 1 L)(t) (7)

for all h 1, 1 t T h.3

Remark 4.3. Notice that another way of writing (7) is: Dh L = D1 Dh 1 L. We also remark for later use that (Dh L)(t) = Ph

( 1)h s L(t+s).

Let H = log T, where T denotes the ﬁxed time horizon from Theorem 3.1 (and thus Lemma 4.2). In the proof of Lemma 4.2, we will bound the ﬁnite differences of order h H for certain sequences. The bound (6) of Lemma 4.2 may be rephased as upper bounding PT

t=1 Varx(t)

(D1 i)(t 1)

; to prove this, we proceed in two steps:

2Notice that the factor 1

2 in (6) is not important for this argument any constant less than 1 would sufﬁce. 3We remark that while Deﬁnition 4.1 is stated for a 1-indexed sequence L(1), L(2), . . ., we will also occasionally consider 0-indexed sequences L(0), L(1), . . ., in which case the same recursive deﬁnition (7) holds for the ﬁnite differences (Dh L)(t), t 0.

1. (Upwards induction step) First, in Lemma 4.4 below, we ﬁnd an upper bound on

((((Dh i)(t)(((

1 for all t 2 [T], h 0, which decays exponentially in h for h H. This is done via upwards induction on h, i.e., ﬁrst proving the base case h = 0 using boundedness of the losses (t)

i and then h = 1, 2, . . . inductively. The main technical tool we develop for the inductive step is a weak form of the chain rule for ﬁnite differences, Lemma 4.5. The inductive step uses the fact that all players are following Optimistic Hedge to relate the hth order ﬁnite differences of player i s loss sequence (t)

i to the hth order ﬁnite differences of the strategy sequences x(t)

i0 for players i0 6= i; then we use the exponential-weights style updates of Optimistic Hedge and Lemma 4.5 to relate the hth order ﬁnite differences of the strategies x(t)

i0 to the (h 1)th order ﬁnite differences of the losses (t)

i0 . 2. (Downwards induction step) We next show that for all 0 h H, PT

t=1 Varx(t)

(Dh+1 i)(t 1)

is bounded above by ch PT

t=1 Varx(t)

(Dh i)(t 1)

for some ch < 1/2 and µh < O(log5 T). This shown via downwards induction on h, namely ﬁrst establishing the base case h = H by using the result of item 1 for h = H and then treating the cases h = H 1, H 2, . . . , 0. The inductive step makes use of the discrete Fourier transform (DFT) to relate the ﬁnite differences of different orders (see Lemmas 4.7 and 4.8). In particular, Parseval s equality together with a standard relationship between the DFT of the ﬁnite differences of a sequence to the DFT of that sequence allow us to ﬁrst prove the inductive step in the frequency domain and then transport it back to the original (time) domain.

In the following subsections we explain in further detail how the two steps above are completed.

4.3 Upwards induction proof overview Addressing item 1 in the previous subsection, the lemma below gives a bound on the supremum norm of the h-th order ﬁnite differences of each player s loss vector, when all players play according to Optimistic Hedge and experience losses according to their loss functions L1, . . . , Lm : A ! [0, 1].

Lemma 4.4 (Abbreviated). Fix a step size > 0 satisfying o

. If all players follow Optimistic Hedge updates with step size , then for any player i 2 [m], integer h satisfying 0 h H, and time step t 2 [T h], it holds that k (Dh i)(t) k1 O(m )h h O(h).

A detailed version of Lemma 4.4, together with its full proof, may be found in Section B.4. We next give a proof overview of Lemma 4.4 for the case of 2 players, i.e., m = 2; we show in Section B.4 how to generalize this computation to general m. Below we introduce the main technical tool in the proof, a boundedness chain rule, and then outline how it is used to prove Lemma 4.4. Main technical tool for Lemma 4.4: boundedness chain rule. We say that a function φ : Rn ! R is a softmax-type function if there are real numbers 1, . . . , n and some j 2 [n] so that for all (z1, . . . , zn) 2 Rn, φ((z1, . . . , zn)) = exp(zj) Pn

k=1 k exp(zk). Lemma 4.5 below may be interpreted as a boundedness chain rule for ﬁnite differences. To explain the context for this lemma, recall that given

an inﬁnitely differentiable vector-valued function L : R ! Rn and an inﬁnitely differentiable function φ : Rn ! R, the higher order derivatives of the function φ(L(t)) may be computed in terms of those of L and φ using the chain rule. Lemma 4.5 considers an analogous setting where the input variable t to L is discrete-valued, taking values in [T] (and so we identify the function L with the sequence L(1), . . . , L(T )). In this case, the higher order ﬁnite differences of the sequence L(1), . . . , L(T ) (Deﬁnition 4.1) take the place of the higher order derivatives of L with respect to t. Though there is no generic chain rule for ﬁnite differences, Lemma 4.5 states that, at least when φ is a softmax-type function, we may bound the higher order ﬁnite differences of the sequence φ(L(1)), . . . , φ(L(T )). In the lemma s statement we let φ L denote the sequence φ(L(1)), . . . , φ(L(T )). Lemma 4.5 ( Boundedness chain rule for ﬁnite differences; abbreviated). Suppose that h, n 2 N, φ : Rn ! R is a softmax-type function, and L = (L(1), . . . , L(T )) is a sequence of vectors in Rn satisfying k L(t)k1 1 for t 2 [T]. Suppose for some 2 (0, 1), for each 0 h0 h and t 2 [T h0], it holds that k Dh0 L(t)k1 O( h0) (h0)O(h0). Then for all t 2 [T h],

| (Dh (φ L))(t) | O( h) h O(h).

A detailed version of Lemma 4.5 may be found in Section B.3. While Lemma 4.5 requires φ to be a softmax-type function for simplicity (and this is the only type of function φ we will need to consider

for the case m = 2) we remark that the detailed version of Lemma 4.5 allows φ to be from a more general family of analytic functions whose higher order derivatives are appropriately bounded. The proof of Lemma 4.4 for all m 2 requires that more general form of Lemma 4.5.

The proof of Lemma 4.5 proceeds by considering the Taylor expansion Pφ( ) of the function φ at the origin, which we write as follows: for z = (z1, . . . , zn) 2 Rn, Pφ(z) := P

0: |γ|=k aγzγ,

where aγ 2 R, |γ| denotes the quantity γ1 + + γn and zγ denotes zγ1

1 zγn n . The fact that φ is a softmax-type function ensures that the radius of convergence of its Taylor series is at least 1, i.e., φ(z) = Pφ(z) for any z satisfying kzk1 1. By the assumption that k L(t)k1 1 for each t, we may therefore decompose (Dh (φ L))(t) as:

(Dh (φ L))(t) =

aγ (Dh Lγ)(t) , (8)

where Lγ denotes the sequence of scalars (Lγ)(t) := (L(t))γ for all t. The fact that φ is a softmaxtype function allows us to establish strong bounds on |aγ| for each γ in Lemma B.5. The proof of Lemma B.5 bounds the |aγ| by exploiting the simple form of the derivative of a softmax-type function to decompose each aγ into a sum of |γ|! terms. Then we establish a bijection between the terms of this decomposition and graph structures we refer to as factorial trees; that bijection together with the use of an appropriate generating function allow us to complete the proof of Lemma B.5.

Thus, to prove Lemma 4.5, it sufﬁces to bound

***(Dh Lγ)(t)*** for all γ. We do so by using Lemma 4.6.

Lemma 4.6 (Abbreviated; detailed vesion in Section B.2). Fix any h 0, a multi-index γ 2 Zn

0 and set k = |γ|. For each of the kh functions : [h] ! [k], and for each r 2 [k], there are integers h0

,r 2 {0, 1, . . . , h}, t0

,r 0, and j0

,r 2 [n], so that the following holds. For any sequence L(1), . . . , L(T ) 2 Rn of vectors, it holds that, for each t 2 [T h],

(Dh Lγ)(t) =

Dh0 ,r (L(j0

Lemma 4.6 expresses the hth order ﬁnite differences of the sequence Lγ as a sum of kh terms, each of which is a product of k ﬁnite order differences of a sequence L(t)(j0

,r) (i.e., the j0

,rth coordinate of the vectors L(t)). Crucially, when using Lemma 4.6 to prove Lemma 4.5, the assumption of Lemma 4.5 gives that for each j0 2 [n], each h0 2 [h], and each t0 2 [T h0], we have the bound ***(Dh0 L(j0))(t0)*** O( h0) (h0)O(h0). These assumed bounds may be used to bound the right-hand side of (9), which together with Lemma 4.6 and (8) lets us complete the proof of Lemma 4.5. Proving Lemma 4.4 using the boundedness chain rule. Next we discuss how Lemma 4.5 is used to prove Lemma 4.4, namely to bound k (Dh i)(t) k1 for each t 2 [T h], i 2 [m], and 0 h H. Lemma 4.4 is proved using induction, with the base case h = 0 being a straightforward consequence of the fact that k (D0 i)(t) k1 = k (t)

i k1 1 for all i 2 [m], t 2 [T]. For the rest of this section we focus on the inductive case, i.e., we pick some h 2 [H] and assume Lemma 4.4 holds for all h0 < h.

The ﬁrst step is to reduce the claim of Lemma 4.4 to the claim that the upper bound k (Dh xi)(t) k1 O (m )h h O(h) holds for each t 2 [T h], i 2 [m]. Recalling that we are only sketching here the case m = 2 for simplicity, this reduction proceeds as follows: for i 2 {1, 2}, deﬁne the matrix Ai 2 Rn1 n2 by (Ai)a1a2 = Li(a1, a2), for a1 2 [n1], a2 2 [n2]. We have assumed that all players are using Optimistic Hedge and thus (t)

i = Eai0 x(t)

i0 , 8i06=i[Li(a1, . . . , an)]; for our case

here (m = 2), this may be rewritten as (t)

k (Dh 1)(t) k1 =

( 1)h sx(t+s)

( 1)h sx(t+s)

= k (Dh x2)(t) k1,

where the ﬁrst equality is from Remark 4.3 and the inequality follows since all entries of A1 have absolute value 1. A similar computation allows us to show k (Dh 2)(t) k1 k (Dh x1)(t) k1.

To complete the inductive step it remains to upper bound the quantities k (Dh xi)(t) k1 for i 2 [m], t 2 [T h]. To do so, we note that the deﬁnition of the Optimistic Hedge updates (1) implies that for any i 2 [m], t 2 [T], j 2 [ni], and t0 1, we have

i (j) Pt0 1

i (j) (t+t0 1)

i (k) Pt0 1

i (k) (t+t0 1)

For t 2 [T], t0 0, set (t0)

. Also, for each i 2 [m], j 2

[ni], t 2 [T], and any vector z = (z(1), . . . , z(ni)) 2 Rni deﬁne φt,i,j(z) := x(t)

i (j) exp(z(j)) Pni

i (k) exp(z(k)).

Thus (10) gives that for t0 1, x(t+t0)

i (j) = φt,i,j( (t0)

i,t ). Viewing t as a ﬁxed parameter and letting

t0 vary, it follows that for h 0 and t0 1,

Dh (φt,i,j i,t)

Recalling that our goal is to bound | (Dh xi(j))(t+1) | for each t, we can do so by using Lemma 4.5 with φ = φt,i,j and = O(m ), if we can show that its precondition is met, i.e. that

"(t0) k1 1 B1 h0 (h0)B0h0 for all h0 h, the appropriate and appropriate con-

stants B0, B1. Helpfully, the deﬁnition of (t0)

i,t as a partial sum allows us to relate the h0-th order

ﬁnite differences of the sequence (t0)

i,t to the (h0 1)-th order ﬁnite differences of the sequence (t)

i as follows: !

"(t0) = (Dh0 1 i)(t+t0 1) 2 (Dh0 1 i)(t+t0) . (11)

Since h0 1 < h for h0 h, the inductive assumption of Lemma 4.4 gives a bound on the 1-norm of the terms on the right-hand side of (11), which are sufﬁcient for us to apply Lemma 4.5. Note that the inductive assumption gives an upper bound on k (Dh0 1 i)(t) k1 that only scales with h0 1, whereas Lemma 4.5 requires scaling of h0. This discrepancy is corrected by the factor of on the right-hand side of (11), which gives the desired scaling h0 (since < for the choice = O(m )).

4.4 Downwards induction proof overview In this section we discuss in further detail item 2 in Section 4.2; in particular, we will show that there is a parameter µ = ( m) so that for all integers h satisfying H 1 h 0,

(Dh+1 i)(t)

where O hides factors polynomial in log T. The validity of (12) for h = 0 implies Lemma 4.2. On the other hand, as long we choose the value µ in (12) to satisfy µ m H (1), then Lemma 4.4 implies that PT H

t=1 Varx(t)

O(µ2H). This gives that (12) holds for h = H 1. To show that (12) holds for all H 1 > h 0, we use downwards induction; ﬁx any h, and assume that (12) has been shown for all h0 satisfying h < h0 H 1. Our main tool in the inductive step is to apply Lemma 4.7 below. To state it, for > 0, n 2 N, we say that a sequence of distributions P (1), . . . , P (T ) 2 n is -consecutively close if for each 1 t < T, it holds that max

((( P (t+1)

1 + .4 Lemma 4.7 shows that given a sequence of vectors for which the variances of its second-order ﬁnite differences are bounded by the variances of its ﬁrst-order ﬁnite differences, a similar relationship holds between its ﬁrstand zeroth-order ﬁnite differences. Lemma 4.7. There is a sufﬁciently large constant C0 > 1 so that the following holds. For any M, , > 0 and n 2 N, suppose that P (1), . . . , P (T ) 2 n and Z(1), . . . , Z(T ) 2 [ M, M]n satisfy the following conditions:

1. The sequence P (1), . . . , P (T ) is -consecutively close for some 2 [1/(2T), 4/C0].

2. It holds that PT 2

t=1 Var P (t)

t=1 Var P (t)

4Here, for distributions P, Q 2 n, P

Q 2 Rn denotes the vector whose jth entry is P(j)/Q(j).

t=1 Var P (t)

t=1 Var P (t)

Given Lemma 4.7, the inductive step for establishing (12) is straightforward: we apply Lemma 4.7 with P (t) = x(t)

i and Z(t) = (Dh i)(t) for all t. The fact that x(t)

i are updated with Optimistic Hedge may be used to establish that precondition 1 of Lemma 4.7 holds. Since (D1 Z)(t) = (Dh+1 i)(t) and (D2 Z)(t) = (Dh+2 i)(t), that the inductive hypothesis (12) holds for h + 1 implies that precondition

2 of Lemma 4.7 holds for appropriate , µ > 0. Thus Lemma 4.7 implies that (12) holds for the value h, which completes the inductive step. On the proof of Lemma 4.7. Finally we discuss the proof of Lemma 4.7. One technical challenge is the fact that the vectors P (t) are not constant functions of t, but rather change slowly (as constrained by being -consecutively close). The main tool for dealing with this difﬁculty is Lemma C.1, which

shows that for a -consecutively close sequence P (t), for any vector Z(t),

Var P (t)(Z(t)) Var P (t+1)(Z(t)) 2 [1 , 1+

]. This fact, together with some algebraic manipulations, lets us to reduce to the case that all P (t) are equal. It is also relatively straightforward to reduce to the case that h P (t), Z(t)i = 0 for all t, i.e., so that Var P (t)

P (t). We may further separate

j=1 P (t)(j) (Z(t)(j))2

into its individual components P (t)(j) (Z(t)(j))2, and treat each one separately, thus allowing us to reduce to a one-dimensional problem. Finally, we make one further reduction, which is to replace the ﬁnite differences Dh ( ) in Lemma 4.7 with circular ﬁnite differences, deﬁned below:

Deﬁnition 4.2 (Circular ﬁnite difference). Suppose L = (L(0), . . . , L(S 1)) is a sequence of vectors L(t) 2 Rn. For integers h 0, the level-h circular ﬁnite difference sequence for the sequence L, denoted by D

h L, is the sequence deﬁned recursively as: (D

0 L)(t) = L(t) for all 0 t < S, and

"(t) : 0 t S 2 !

"(T ) : t = S 1.

Circular ﬁnite differences for a sequence L(0), . . . , L(S 1) are deﬁned similarly to ﬁnite differences (Deﬁnition 4.1) except that unlike for ﬁnite differences, where (Dh L)(S h) , . . . , (Dh L)(S 1) are not deﬁned, (D

h L)(S h) , . . . , (D

h L)(S 1) are deﬁned by wrapping around back to the beginning of the sequence. The above-described reductions, which are worked out in detail in Section C.2, allow us to reduce proving Lemma 4.7 to proving the following simpler lemma:

Lemma 4.8. Suppose µ 2 R, > 0, and W (0), . . . , W (S 1) 2 R is a sequence of reals satisfying

t=1 (W (t))2 + µ/ .

To prove Lemma 4.8, we apply the discrete Fourier transform to both sides of (14) and use the Cauchy Schwarz inequality in frequency domain. For a sequence W (0), . . . , W (S 1) 2 R, its (discrete) Fourier transform is the sequence c W (0), . . . , c W (S 1) deﬁned by c W (s) = PS 1

t=0 W (t) e 2 ist

S . Below we prove Lemma 4.8 for the special case µ = 0; we defer the general case to Section C.1.

Proof of Lemma 4.8 for special case µ = 0. We have the following:

***c W (s)(e2 is/T 1)

2 **e2 is/T 1

where the ﬁrst equality uses Parseval s equality, the second uses Fact C.3 (in the appendix) for h = 1, and the inequality uses Cauchy-Schwarz. By Parseval s inequality and Fact C.3 for h = 2, the right-

hand side of the above equals S

t=1(W (t))2

, which, by assumption, is

t=1(W (t))2

. Rearranging terms completes the proof.

Acknowledgments and Disclosure of Funding

C.D. is Supported by NSF Awards CCF-1901292, DMS-2022448 and DMS-2134108, by a Simons Investigator Award, by the Simons Collaboration on the Theory of Algorithmic Fairness, by a DSTA grant, and by the DOE Ph ILMs project (No. DE-AC05-76RL01830). N.G. is supported by a Fannie & John Hertz Foundation Fellowship and an NSF Graduate Fellowship.

[AIMM21] Waïss Azizian, Franck Iutzeler, Jérome Malick, and Panayotis Mertikopoulos. The last-iterate convergence rate of optimistic mirror descent in stochastic variational inequalities. In Conference on Learning Theory, pages 1 32, 2021.

[BCM12] Maria-Florina Balcan, Florin Constantin, and Ruta Mehta. The Weighted Majority Algorithm does not Converge in Nearly Zero-sum Games. In ICML Workshop on Markets, Mechanisms, and Multi-Agent Models, 2012.

[Bla54] David Blackwell. Controlled Random Walks. In Proceedings of the International Congress of Mathematicians, volume 3, pages 336 338, 1954.

[BP18] James P. Bailey and Georgios Piliouras. Multiplicative Weights Update in Zero-Sum Games. In Proceedings of the 2018 ACM Conference on Economics and Computation - EC 18, pages 321 338, Ithaca, NY, USA, 2018. ACM Press.

[BP19] James P. Bailey and Georgios Piliouras. Fast and furious learning in zero-sum games: Vanishing regret with non-vanishing step sizes. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Neur IPS 2019, December

8-14, 2019, Vancouver, BC, Canada, pages 12977 12987, 2019.

[Bro49] George W Brown. Some Notes on Computation of Games Solutions. Technical report, RAND CORP SANTA MONICA CA, 1949.

[Bub15] Sébastien Bubeck. Convex Optimization: Algorithms and Complexity. Found. Trends Mach. Learn., 8(3 4):231 357, November 2015.

[CBL06] Nicolo Cesa-Bianchi and Gábor Lugosi. Prediction, Learning, and Games. Cambridge university press, 2006.

[CDT09] Xi Chen, Xiaotie Deng, and Shang-Hua Teng. Settling the complexity of computing two-player nash equilibria. Journal of the ACM (JACM), 56(3):1 57, 2009.

[CP19] Yun Kuen Cheung and Georgios Piliouras. Vortices instead of equilibria in minmax optimization: Chaos and butterﬂy effects of online learning in zero-sum games. In Proceedings of the Thirty Second Conference on Learning Theory, pages 807 834, 2019.

[CP20] Xi Chen and Binghui Peng. Hedging in games: Faster convergence of external and swap regrets. In Advances in Neural Information Processing Systems, volume 33, pages 18990 18999. Curran Associates, Inc., 2020.

[CS04] Imre Csiszár and Paul C. Shields. Information Theory and Statistics: A Tutorial. Commun. Inf. Theory, 1(4):417 528, December 2004.

[DDK11] Constantinos Daskalakis, Alan Deckelbaum, and Anthony Kim. Near-Optimal No-Regret Algorithms for Zero-Sum Games. In Proceedings of the twenty-second annual ACM-SIAM symposium on Discrete Algorithms (SODA), 2011.

[DFP+10] Constantinos Daskalakis, Rafael Frongillo, Christos H. Papadimitriou, George Pierrakos, and Gregory Valiant. On learning algorithms for nash equilibria. In Proceedings of the Third International Conference on Algorithmic Game Theory, SAGT 10, page 114 125, Berlin, Heidelberg, 2010. Springer-Verlag.

[DGP06] Constantinos Daskalakis, Paul W. Goldberg, and Christos H. Papadimitriou. The complexity of computing a nash equilibrium. In Proceedings of the Thirty-Eighth Annual ACM Symposium on Theory of Computing (STOC), 2006.

[DP14] Constantinos Daskalakis and Qinxuan Pan. A counter-example to Karlin s strong conjecture for ﬁctitious play. In Proceedings of the 55th Annual Symposium on Foundations of Computer Science (FOCS), 2014.

[DP19] Constantinos Daskalakis and Ioannis Panageas. Last-iterate convergence: Zero-sum games and constrained min-max optimization. In 10th Innovations in Theoretical Computer Science Conference, ITCS 2019, January 10-12, 2019, San Diego, California, USA, volume 124 of LIPIcs, pages 27:1 27:18. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2019.

[FLL+16] Dylan J Foster, Zhiyuan Li, Thodoris Lykouris, Karthik Sridharan, and Eva Tardos. Learning in games: Robustness of fast convergence. In Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016.

[GKP89] Ronald L. Graham, Donald E. Knuth, and Oren Patashnik. Concrete Mathematics: A Foundation for Computer Science. Addison-Wesley, Reading, 1989.

[GPD20] Noah Golowich, Sarath Pattathil, and Constantinos Daskalakis. Tight last-iterate convergence rates for no-regret learning in multi-player games. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020, December 6-12, 2020, virtual, 2020.

[HAM21] Yu-Guan Hsieh, Kimon Antonakopoulos, and Panayotis Mertikopoulos. Adaptive learning in continuous games: Optimal regret bounds and convergence to nash equilibrium. In Conference on Learning Theory, 2021.

[Han57] James Hannan. Approximation to Bayes risk in repeated play. Contributions to the Theory of Games, 3:97 139, 1957.

[HMc W+03] Hart, Andreu Mas-colell, Of Jörgen W. Weibull, O Vega, Drew Fudenberg, David K. Levine,

Josef Hofbauer, Karl Sigmund, Eric Maskin, Motty Perry, and Er Vasin. Uncoupled dynamics do not lead to nash equilibrium. Amer. Econ. Rev, pages 1830 1836, 2003.

[KHSC18] Ehsan Asadi Kangarshahi, Ya-Ping Hsieh, Mehmet Fatih Sahin, and Volkan Cevher. Let s be honest: An optimal no-regret framework for zero-sum games. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 2488 2496. PMLR, 10 15 Jul 2018.

[KLP11] Robert Kleinberg, Katrina Ligett, and Georgios Piliouras. Beyond the nash equilibrium barrier. In In Innovations in Computer Science (ICS, pages 125 140, 2011.

[LGNPw21] Qi Lei, Sai Ganesh Nagarajan, Ioannis Panageas, and xiao wang. Last iterate convergence in no-regret learning: constrained min-max optimization for convex-concave landscapes. In Arindam Banerjee and Kenji Fukumizu, editors, Proceedings of The 24th International Conference on Artiﬁcial Intelligence and Statistics, volume 130 of Proceedings of Machine Learning Research,

pages 1441 1449. PMLR, 13 15 Apr 2021.

[MPP18] Panayotis Mertikopoulos, Christos Papadimitriou, and Georgios Piliouras. Cycles in adversarial regularized learning. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 18, page 2703 2717, USA, 2018. Society for Industrial and Applied Mathematics.

[Nes05] Yu Nesterov. Excessive gap technique in nonsmooth convex minimization. SIAM Journal on Optimization, 16(1):235 249, 2005.

[PP16] Christos Papadimitriou and Georgios Piliouras. From nash equilibria to chain recurrent sets: Solution concepts and topology. In Proceedings of the 2016 ACM Conference on Innovations in Theoretical Computer Science, ITCS 16, page 227 235, New York, NY, USA, 2016. Association

for Computing Machinery.

[Rob51] Julia Robinson. An Iterative Method of Solving a Game. Annals of mathematics, pages 296 301, 1951.

[Rou09] Tim Roughgarden. Intrinsic robustness of the price of anarchy. In Proceedings of the Forty-First Annual ACM Symposium on Theory of Computing, STOC 09, page 513 522, New York, NY,

USA, 2009. Association for Computing Machinery.

[RS13a] Alexander Rakhlin and Karthik Sridharan. Online learning with predictable sequences. In Proceedings of the 26th Annual Conference on Learning Theory, pages 993 1019, 2013.

[RS13b] Alexander Rakhlin and Karthik Sridharan. Optimization, Learning, and Games with Predictable Sequences. ar Xiv:1311.1869 [cs], November 2013. ar Xiv: 1311.1869.

[RST17] Tim Roughgarden, Vasilis Syrgkanis, and Éva Tardos. The price of anarchy in auctions. J. Artif. Int. Res., 59(1):59 101, May 2017.

[SALS15] Vasilis Syrgkanis, Alekh Agarwal, Haipeng Luo, and Robert E Schapire. Fast convergence of regularized learning in games. In Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015.

[Sha64] L. Shapley. Some Topics in Two-Person Games. Advances in Game theory, 1964.

[ST13] Vasilis Syrgkanis and Eva Tardos. Composable and efﬁcient mechanisms. In Proceedings of the Forty-Fifth Annual ACM Symposium on Theory of Computing, STOC 13, page 211 220, New York, NY, USA, 2013. Association for Computing Machinery.

[VGFL+20] Emmanouil-Vasileios Vlatakis-Gkaragkounis, Lampros Flokas, Thanasis Lianeas, Panayotis Mertikopoulos, and Georgios Piliouras. No-regret learning and mixed nash equilibria: They do not mix. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 1380 1391. Curran Associates, Inc., 2020.

[WL18] Chen-Yu Wei and Haipeng Luo. More adaptive algorithms for adversarial bandits. In Proceedings of the 31st Conference On Learning Theory, pages 1263 1291, 2018.

[WLZL21] Chen-Yu Wei, Chung-Wei Lee, Mengxiao Zhang, and Haipeng Luo. Linear last-iterate convergence in constrained saddle-point optimization. In International Conference on Learning Representations, 2021.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reﬂect the paper s

contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes]

(c) Did you discuss any potential negative societal impacts of your work? [N/A] (d) Have you read the ethics review guidelines and ensured that your paper conforms to

them? [Yes] 2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [Yes] See Sections 2

and 3. (b) Did you include complete proofs of all theoretical results? [Yes] See Section 4 and the

appendix (in the supplementary material). 3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experi-

mental results (either in the supplemental material or as a URL)? [N/A] (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they

were chosen)? [N/A] (c) Did you report error bars (e.g., with respect to the random seed after running experi-

ments multiple times)? [N/A] (d) Did you include the total amount of compute and the type of resources used (e.g., type

of GPUs, internal cluster, or cloud provider)? [N/A] 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [N/A] (b) Did you mention the license of the assets? [N/A]

(c) Did you include any new assets either in the supplemental material or as a URL? [N/A]

(d) Did you discuss whether and how consent was obtained from people whose data you re

using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identiﬁable

information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if

applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review

Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount

spent on participant compensation? [N/A]