# on_lastiterate_convergence_beyond_zerosum_games__d77df552.pdf

On Last-Iterate Convergence Beyond Zero-Sum Games

Ioannis Anagnostides 1 Ioannis Panageas 2 Gabriele Farina 1 Tuomas Sandholm 1 3 4 5

Most existing results about last-iterate convergence of learning dynamics are limited to twoplayer zero-sum games, and only apply under rigid assumptions about what dynamics the players follow. In this paper we provide new results and techniques that apply to broader families of games and learning dynamics. First, we show that in a class of games that includes constant-sum polymatrix and strategically zero-sum games, the trajectories of dynamics such as optimistic mirror descent (OMD) exhibit a boundedness property, which holds even when players employ different algorithms and prediction mechanisms. This property enables us to obtain O(1/

T) rates and optimal O(1) regret bounds. Our analysis also reveals a surprising property: OMD either reaches arbitrarily close to a Nash equilibrium or it outperforms the robust price of anarchy in efficiency. Moreover, for potential games we establish convergence to an ϵ-equilibrium after O(1/ϵ2) iterations for mirror descent under a broad class of regularizers, as well as optimal O(1) regret bounds for OMD variants. Our framework also extends to near-potential games, and unifies known analyses for distributed learning in Fisher s market model. Finally, we analyze the convergence, efficiency, and robustness of optimistic gradient descent (OGD) in general-sum continuous games.

1. Introduction

No-regret learning and game theory share an intricately connected history tracing back to Blackwell s seminal approachability theorem (Blackwell, 1956; Abernethy et al., 2011), leading to fundamental connections between no-regret learning and game-theoretic equilibrium concepts. For example,

1Carnegie Mellon University 2University of California Irvine 3Strategy Robot, Inc. 4Optimized Markets, Inc. 5Strategic Machine, Inc. Correspondence to: Ioannis Anagnostides <ianagnos@cs.cmu.edu>.

Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s).

it is folklore that if both players in a zero-sum game employ a no-regret algorithm, the time average of their strategies will eventually converge to a Nash equilibrium (NE). However, typical guarantees within the no-regret framework provide no insights about the final state of the system. This begs the question: Will the agents eventually play according to an equilibrium strategy? In general, the answer to this question is no: Broad families of regret minimization algorithms such as mirror descent (MD) are known to exhibit recurrent or even chaotic behavior (Sato et al., 2002; Sandholm, 2010; Mertikopoulos et al., 2018).

In an attempt to stabilize the chaotic behavior of traditional no-regret learning algorithms and ameliorate the notoriously tedious training process of generative adversarial networks (GANs) (Goodfellow et al., 2014), Daskalakis et al. (2018) discovered that optimistic gradient descent (OGD) (Popov, 1980) guarantees last-iterate convergence in unconstrained bilinear games .1 Thereafter, their result has been extended along several lines (e.g., (Daskalakis & Panageas, 2019; Mertikopoulos et al., 2019; Wei et al., 2021b; Golowich et al., 2020a; Azizian et al., 2021)). Last-iterate convergence is also central in economics (Milgrom & Roberts, 1990), and goes back to the fundamental question of what it really means to learn in a game . Indeed, it is unclear how a time average guarantee is meaningful from an economic standpoint. Nevertheless, the known guarantees only apply for restricted classes of games such as two-player zero-sum games. As a result, it is natural to ask whether last-iterate convergence could be a universal phenomenon in games.

Unfortunately, this cannot be the case: if every player employs a no-regret algorithm and the individual strategies converge, this would necessarily mean that the limit point is a Nash equilibrium.2 But this is precluded by fundamental impossibility results (Hart & Mas-Colell, 2003; Rubinstein, 2016; Chen et al., 2009; Babichenko, 2014; Daskalakis et al., 2009), at least in a polynomial number of iterations. This observation offers an additional crucial motivation for convergence since Nash equilibria can be exponentially more efficient compared to correlated equilibria (Blum et al., 2008).

1We call them games with an abuse of terminology. They are not games in the game-theoretic sense. 2To see this, observe that the average product distribution of play which is a coarse correlated equilibrium will converge to a product distribution, and hence, a Nash equilibrium.

On Last-Iterate Convergence Beyond Zero-Sum Games

In light of this, we ask the following question: For which classes of games beyond two-player zero-sum games can we guarantee last-iterate convergence? Characterizing dynamics beyond zero-sum games is a recognized important open question in the literature (Candogan et al., 2013). In fact, even in GANs many practitioners employ a general-sum objective due to its practical superiority. However, many existing techniques are tailored to the min-max structure of the problem.

Another drawback of existing last-iterate guarantees is that they make restrictive assumptions about the dynamics the players follow; namely, they employ exactly the same learning algorithm. However, we argue that in many settings such a premise is unrealistic under independent and decentralized players. Indeed, to quote from the work of Candogan et al. (2013): The limit behavior of dynamic processes where players adhere to different update rules is [...] an open question, even for potential games . In this paper, we make progress towards addressing these fundamental questions.

1.1 Overview of Our Contributions The existing techniques employed to show last-iterate convergence are inherently different from the ones used to analyze regret. Indeed, it is tacitly accepted that these two require a different treatment; this viewpoint is reflected by Mertikopoulos et al. (2018): a regret-based analysis cannot distinguish between a self-stabilizing system, and one with recurrent cycles .

Our first contribution is to challenge this conventional narrative. We show that for a broad class of games the regret bounded by variation in utilities (RVU) property established by Syrgkanis et al. (2015) implies that the second-order path lengths are bounded (Theorem 3.1) for a broad family of dynamics including optimistic mirror descent (OMD) (Rakhlin & Sridharan, 2013). We use this bound to obtain optimal O(1) individual regret bounds (Corollary 3.3), and, more importantly, to show that O(1/ϵ2) iterations suffice to reach an ϵ-approximate Nash equilibrium (Theorem 3.4). This characterization holds for a class of games intricately connected to the minimax theorem, and includes constant-sum polymatrix (Kearns et al., 2001) and strategically zero-sum games (Moulin & Vial, 1978).

Furthermore, our results hold even if players employ different OMD variants (with smooth regularizers) and even with advanced predictions, as long as an appropriate RVU bound holds. Additionally, our techniques apply under arbitrary (convex and compact) constrained sets, thereby allowing for a direct extension, for example, to extensive-form games. We also illustrate how our framework can be applied to smooth min-max optimization. Such results appear hard to obtain with prior techniques. Also, in cases where prior techniques applied, our proofs are considerably simpler. Overall, our framework inherits all of the robustness and the simplicity deriving from a regret-based analysis.

Our approach also reveals an intriguing additional result: OMD variants either converge arbitrarily close to a Nash equilibrium, or they outperform the robust price of anarchy in terms of efficiency (Roughgarden, 2015) (see Theorem 3.8 for a formal statement). This is based on the observation that lack of last-iterate convergence can be leveraged to show that the sum of the players regrets at a sufficiently large time T will be at most CT, for some parameter C > 0. This substantially refines the result of Syrgkanis et al. (2015) using the rather underappreciated fact that regret can be negative. In fact, the further the dynamics are from a Nash equilibrium, the larger is the improvement compared to the robust price of anarchy.

Next, we study the convergence of mirror descent (MD) learning dynamics in (weighted) potential games. We give a new potential argument applicable even if players employ different regularizers (Theorem 4.3), thereby addressing an open question of Candogan et al. (2013) regarding heterogeneous dynamics in potential games. Such results for no-regret algorithms were known when all players employed variants of multiplicative weights update (Palaiopanos et al., 2017; H eliou et al., 2017), though in (H eliou et al., 2017) the analysis requires vanishing learning rates.

Our potential argument implies a boundedness property for the trajectories, (similarly to Theorem 3.1 we previously discussed); we use this property to show that O(1/ϵ2) iterations suffice to reach an ϵ-Nash equilibrium. We also show a similar boundedness property for optimistic variants, which is then used along with the RVU property to show optimal O(1) regret for each individual player (Theorem 4.6). To our knowledge, this is the first result showing optimal O(1) individual regret in potential games, and it is based on the observation that last-iterate convergence can be leveraged to show improved regret bounds. In a sense, this is the converse of the technique employed in Section 3. Our potential argument also extends to near-potential games (Candogan et al., 2013), implying convergence to approximate Nash equilibria (Theorem 4.10). Importantly, our framework is general enough to unify results from distributed learning in Fisher s market model as well (Birnbaum et al., 2011).

Finally, motivated by applications such as GANs, we study the convergence of optimistic gradient descent (OGD) in continuous games. We characterize the class of two-player games for which OGD converges, extending the known prior results (Theorem 5.1). Also, we show that OGD can be arbitrarily inefficient beyond zero-sum games (Proposition 5.2). More precisely, under OGD players can fail to coordinate even if the objective presents a clear coordination aspect. Finally, in Theorem 5.4 we use our techniques to characterize convergence in multiplayer settings as well.

For the convenience of the reader we have summarized our key results in Table 1.

On Last-Iterate Convergence Beyond Zero-Sum Games

Setting Results

Bounded second-order path lengths (Theorem 3.1) Games with nonnegative regrets (Sections 3.1 and 3.2) Optimal regret (Corollary 3.3) O(1/

T) rates to Nash equilibria (NE) (Theorem 3.4)

Smooth games (Section 3.3) Outperforming the (robust) price of anarchy (Theorem 3.8)

Optimal regret (Theorem 4.6) Potential and near-potential games (Section 4) O(1/

T) rates to NE (Theorems 4.7 and 4.10) Convergence in Fisher markets (Appendix B.2)

Convergence of OGD (Theorems 5.1 and 5.4) Unconstrained general-sum games (Section 5) Inefficiency of OGD (Proposition 5.2) Instability of first-order methods (Theorem C.11)

Table 1. Overview of our main results.

1.2 Further Related Work The related literature on the subject is too vast to cover in its entirety; below we emphasize certain key contributions.

For constrained zero-sum games Daskalakis & Panageas (2019) established asymptotic last-iterate convergence for optimistic multiplicative weights update, although they had to assume the existence of a unique Nash equilibrium. Our techniques do not require a uniqueness assumption, which is crucial for equilibrium refinements (Damme, 1987). Wei et al. (2021b) improved several aspects of the result of Daskalakis & Panageas (2019), showing a surprising linear rate of convergence, albeit with a dependence on condition number-like quantities which were left unspecified. Some of the latter results were extended for the substantially more complex class of extensive-form games by Lee et al. (2021). While the aforementioned results apply under a time-invariant learning rate, there has also been a substantial amount of work considering a vanishing scheduling (e.g., (Mertikopoulos et al., 2019; Mertikopoulos & Zhou, 2019; Zhou et al., 2017; 2018; Hsieh et al., 2021)). Our results apply under a constant learning rate, which has been extensively argued to be desirable in the literature (Bailey & Piliouras, 2019; Golowich et al., 2020a). Last-iterate convergence has also been recently documented in certain settings from competitive reinforcement learning (Daskalakis et al., 2020; Wei et al., 2021a; Leonardos et al., 2022).

Beyond zero-sum games, our knowledge remains much more limited. Cheung & Piliouras (2020); Cheung & Tao (2021) made progress by establishing chaotic behavior for instances of OMD in bimatrix games. Last-iterate convergence under follow the regularizer leader (FTRL) is known to be rather elusive (Vlatakis-Gkaragkounis et al., 2020), occurring only under strict Nash equilibria (Giannou et al., 2021). On the positive side, in (Hsieh et al., 2021), lastiterate convergence for a variant of OMD with an adaptive learning rate was established for variationally stable games,

a class of games that includes zero-sum convex-concave. The convergence of learning dynamics has also recently received attention in several auction settings; we refer to the work of Feng et al. (2021) and Deng et al. (2022), as well as references therein.

In the unconstrained setting, the behavior of OGD is by now well-understood in bilinear zero-sum games (Liang & Stokes, 2019; Mokhtari et al., 2020b; Zhang & Yu, 2020). Various results have also been shown for convex-concave landscapes (e.g., (Mertikopoulos et al., 2019)), and cocoercive games (e.g., (Lin et al., 2020)), but covering this literature goes beyond our scope. For monotone variational inequalities (VIs) problems (Harker & Pang, 1990), Golowich et al. (2020a) showed last-iterate rates of O(1/

T), which is also tight for their considered setting. Note that this rate is slower than that of the time average for the extra-gradient (EG) method (Golowich et al., 2020b). Lastly, last-iterates rates for variants of OMD have been obtained by Azizian et al. (2021) in a certain stochastic VI setup.

2. Preliminaries

Conventions In the main body we allow the O( ) notation to suppress game-dependent parameters polynomial in the game; precise statements are given in the Appendix. We use subscripts to indicate players, and superscripts for time indices. The j-th coordinate of x ℝd is denoted by x(j).

Normal-Form Games In this paper we primarily focus on normal-form games (NFGs). We let [n] := {1, . . . , n} be the set of players. Each player i [n] has a set of available actions Ai. The joint action profile is denoted by a := (a1, . . . , an) Q

i [n] Ai. For a given a each player i receives some (normalized) utility ui(a), where ui : Q

i [n] Ai [ 1, 1]. Players are allowed to randomize, and we denote by xi(ai) the probability that player i will chose action ai Ai, so that xi (Ai) :=

On Last-Iterate Convergence Beyond Zero-Sum Games n x ℝ|Ai| 0 : P

ai Ai x(ai) = 1 o . The expected utility of player i can be expressed as ui(x) := 𝔼a x[ui(a)], where x := (x1, . . . , xn) is the joint (mixed) strategy profile. The social welfare for a given a is defined as the sum of the players utilities: SW(a) := Pn i=1 ui(a). As usual, we overload notation to denote SW(x) := 𝔼a x SW(a). We also let OPT := maxa A SW(a) be the maximum social welfare.

Definition 2.1 (Approximate Nash equilibrium). A joint strategy profile (x 1, . . . , x n) Q

i [n] (Ai) is an ϵapproximate (mixed) Nash equilibrium if for any player i [n] and any unilateral deviation xi (Ai),

ui(xi, x i) ui(x ) + ϵ.

Regret Let X ℝd be a convex and compact set. A regret minimization algorithm produces at every time t ℕ a strategy x(t) X, and receives as feedback from the environment a (linear) utility function u(t)(x) : x 7 x, u(t) . The cumulative regret (or simply regret) Reg T up to time T measures the performance of the regret minimization algorithm compared to the optimal fixed strategy in hindsight:

Reg T := max x X

t=1 x , u(t)

t=1 x(t), u(t) .

Optimistic Dynamics Following a recent trend in online learning (Rakhlin & Sridharan, 2013), we study optimistic (or predictive) algorithms. Let R be a 1-strongly convex regularizer or distance generating function (DGF) with respect to some norm , continuously differentiable on the relative interior of the feasible set (Rockafellar, 1970). We denote by DR the Bregman divergence generated by R; that is, DR(x, y) := R(x) R(y) R(y), x y . Canonical DGFs include the squared Euclidean R(x) := 1

2 x 2 2 which generates the squared Euclidean distance, as well as the negative entropy R(x) = Pd r=1 x(r)(log x(r) 1) which generates the Kulllback-Leibler divergence. Optimistic mirror descent (OMD) has an internal prediction m(t) at every time t ℕ, so that the update rule takes the following form for t 0:

x(t+1) :=arg max x X η x, m(t+1) DR(x, bx(t));

bx(t+1) :=arg max bx X η bx, u(t+1) DR(bx, bx(t)), (OMD)

where bx(0) := arg minbx X R(bx). We also consider the optimistic follow the regularizer leader (OFTRL) algorithm of Syrgkanis et al. (2015), defined as follows.

x(t+1) := arg max x X η

x, m(t+1) +

Unless specified otherwise, it will be assumed that m(t+1) := u(t) for t 1, and m(1) := 0. Syrgkanis et al. (2015) established that OMD and OFTRL have the so-called RVU property, which we recall below.

Definition 2.2 (RVU Property). A regret minimizing algorithm satisfies the (α, β, γ)-RVU property under a pair of dual norms3 ( , ) if for any sequence of utility vectors u(1), . . . , u(T ) its regret Reg T can be bounded as

t=1 u(t) u(t 1) 2 γ

t=1 x(t) x(t 1) 2,

where α, β, γ > 0 are time-independent parameters, and x(1), . . . , x(T ) is the sequence of produced iterates.

Proposition 2.3 (Syrgkanis et al., 2015). OFTRL with η and m(t) = u(t 1) satisfies the RVU property with (α, β, γ) = ( Ω

4η), where Ω:= supx R(x) infx R(x). OMD with η and m(t) = u(t 1) satisfies the RVU property with (α, β, γ) = ( Ω

8η), where Ω:= supx DR(x, bx(0)).

3. Optimistic Learning in Games

In this section we study optimistic learning dynamics. We first show that the RVU bound implies a strong boundedness property for the trajectories under a broad class of games.

Theorem 3.1. Suppose that every player i [n] employs a regret minimizing algorithm satisfying the (αi, βi, γi)-RVU property with γi 2(n 1) P

j =i βj, for all i [n], under the pair of dual norms ( 1, ). If the game is such that Pn i=1 Reg T i 0, for any T ℕ, it holds that

t=1 γi x(t) i x(t 1) i 2 1 2

i=1 αi. (1)

The ℓ1-norm in Theorem 3.1 is only used for convenience; (1) immediately extends under any equivalent norm. Now the requirement of Theorem 3.1 related to the RVU property can be satisfied by OMD and OFTRL, as implied by Proposition 2.3. Concretely, if all players employ (OMD) with the same η > 0, it would suffice to take η 1 4(n 1). In light of this, applying Theorem 3.1 only requires that the game is such that Pn i=1 Reg T i 0. Next, we show that this is indeed the case for certain important classes of games.

3.1 Games with Nonnegative Sum of Regrets In polymatrix games there is an underlying graph so that every node is associated with a given player and every edge corresponds to a two-person game. The utility of a player is the sum of the utilities obtained from each individual game played with its neighbors on the graph (Kearns et al., 2001). In a polymatrix zero-sum game, every two-person game is

3The dual of norm is defined as v := sup u 1 u, v .

On Last-Iterate Convergence Beyond Zero-Sum Games

zero-sum. More generally, in a zero-sum polymatrix game the sum of all players utilities has to be 0.

Strategically zero-sum games, introduced by Moulin & Vial (1978), is the subclass of bimatrix games which are strategically equivalent to a zero-sum game (see Definition A.5 for a formal statement). For this class of games, Moulin & Vial (1978) showed that fictitious play converges to a Nash equilibrium. We will extend their result under a broad class of no-regret learning dynamics. An important feature of strategically zero-sum games is that it constitutes the exact class of games whose fully-mixed Nash equilibria cannot be improved by, e.g., a correlation scheme (Aumann, 1974). Proposition 3.2 (Abridged; Full Version in Proposition A.10). For the following classes of games it holds that Pn i=1 Reg T i 0:

1. Two-player zero-sum games; 2. Polymatrix zero-sum games; 3. Constant-sum Polymatrix games; 4. Strategically zero-sum games;4

5. Polymatrix strategically zero-sum games;4

As suggested in the full version (Proposition A.10), this class of games appears to be intricately connected with the admission of a minimax theorem. As such, it also captures certain nonconvex-nonconcave landscapes such as quasiconvex-quasiconcave games (Sion, 1958) and stochastic games (Shapley, 1953), but establishing last-iterate convergence for those settings goes beyond the scope of this paper. Note that Von Neumann s minimax theorem for twoplayer zero-sum games can be generalized to the polymatrix games we are considering (Daskalakis & Papadimitriou, 2009; Cai & Daskalakis, 2011; Cai et al., 2016). Interestingly, our framework also has implications when the duality gap is small; see Remark A.11.

3.2 Implications Having established the richness of the class of games we are capturing, we present implications deriving from Theorem 3.1, and extensions thereof. First, we show that it implies optimal O(1) individual regret. Corollary 3.3 (Optimal Regret Bound). In the setting of Theorem 3.1 with αi = α, βi = β, and γi = γ, the individual regret of each player i [n] can be bounded as

Reg T i α + 2n(n 1)αβ

For example, under (OMD) with η = 1 4(n 1) this corollary

implies that Reg T i 8nΩi 8n log |Ai|, for any i [n]. Moreover, we also obtain an O(1/ϵ2) bound on the number of iterations so that the last iterate is an ϵ-Nash equilibrium. Theorem 3.4 (Abridged; Full Version in Theorem A.12). Suppose that each player employs (OMD) with a suitable

4See Remark A.7.

regularizer. If Pn i=1 Reg T i 0, then for any ϵ > 0 and after T = O(1/ϵ2) iterations there is joint strategy x(t), with t [T], which is an ϵ-approximate Nash equilibrium.

This theorem requires smoothness of the regularizer used by each player (which could be different), satisfied by a broad family that includes the Euclidean DGF. Interestingly, the argument does not appear to hold under non-smooth regularizers such as the negative entropy DGF, even if one works with common local norms. We also remark that while the bound in Theorem 3.4 applies for the best iterate, it is always possible to terminate once a desired precision has been reached. Asymptotic last-iterate convergence follows immediately from our techniques; see Remark A.15.

Advanced Predictions Our last-iterate guarantees are the first to apply even if players employ more advanced prediction mechanisms. Indeed, Syrgkanis et al. (2015) showed that a qualitatively similar RVU bound can be derived under

1. H-step recency bias: Given a window of size H, we define m(t) := Pt 1 τ=t H u(τ)/H; 2. Geoemetrically discounted recency bias: For δ (0, 1), we define m(t) := 1 Pt 1 τ=0 δ τ Pt 1 τ=0 δ τu(τ).

In Proposition A.2 we show the boundedness property for the trajectories under such prediction mechanisms, and we experimentally evaluate their performance in Appendix D.

Bilinear Saddle-Point Problems Our framework also has direct implications for extensive-form games (EFGs) where the strategy space of each player is no longer a probability simplex. In particular, computing a Nash equilibrium in zero-sum EFGs can be formulated as a bilinear saddle-point problem (BSPP)

min x X max y Y x Ay, (2)

where X and Y are convex and compact sets associated with the sequential decision process faced by each player (Romanovskii, 1962; von Stengel, 1996; Koller et al., 1996). Proposition 3.5 (Abdriged; Full Version in Proposition A.22). Suppose that both players in a BSPP employ (OMD) with η 1 4 A 2 , where 2 denotes the spectral norm. Then, after T = O(1/ϵ2) iterations there is a pair (x(t), y(t)) with t [T] which is an O(ϵ)-approximate Nash equilibrium.

We are not aware of any prior polynomial-time guarantees for the last iterate of no-regret learning dynamics in EFGs. Asymptotic last-iterate convergence also follows immediately. Proposition 3.5 is verified through experiments on benchmark EFGs in Section 6 (Figure 1). Remark 3.6 (Convex-Concave Games). Our framework also applies to smooth min-max optimization via a well-known linearization trick; see Appendix A.1. Implications for unconstrained setups are also immediate; see Remark A.19.

On Last-Iterate Convergence Beyond Zero-Sum Games

3.3 Convergence of the Social Welfare

We conclude this section with a surprising new result presented in Theorem 3.8. Let us first recall the concept of a smooth game.

Definition 3.7 (Smooth Game; Roughgarden, 2015). A utility-maximizing game is (λ, µ)-smooth if for any two action profiles a, a A it holds that

i=1 ui(a i , a i) λ SW(a ) µ SW(a).

The smoothness condition imposes a bound on the externality of any unilateral deviation. The robust price of anarchy (Po A) of a game Γ is defined as ρ := supλ,µ λ/(1 + µ), where the supremum is taken over all possible λ, µ for which Γ is (λ, µ)-smooth. Smooth games with favorable smoothness parameters include simultaneous second-price auctions (Christodoulou et al., 2016), valid utility games (Vetta, 2002), and congestion games with affine costs (Awerbuch et al., 2013; Christodoulou & Koutsoupias, 2005). We refer to (Roughgarden, 2015) for an additional discussion.

The importance of Roughgarden s smoothness framework is that it guarantees convergence (in a time average sense) of no-regret learning to outcomes with approximately optimal welfare, where the approximation depends on the robust Po A. In particular, the convergence is controlled by the sum of the players regrets (Syrgkanis et al., 2015, Proposition 2). We use this property to obtain the following result.

Theorem 3.8 (Abridged; Full Version in Theorem A.17). Suppose that each player in a (λ, µ)-smooth game Γ employs (OMD) with a suitable regularizer and learning rate η > 0. For any γ > 0 and a sufficiently large number of iterations T = Ω(1/γ2), either of the following occurs:

1. There is an iterate x(t) with t [T] which is an O(γ)- approximate Nash equilibrium; 2. Otherwise, it holds that

t=1 SW(x(t)) λ 1 + µ OPT + 1 1 + µ γ2

where (λ, µ) are the smoothness parameters of Γ.

In words, the dynamics either approach arbitrarily close to a Nash equilibrium, or the time average of the social welfare outperforms the robust price of anarchy. Either of these implications is remarkable.

4. Convergence with the Potential Method

In this section we show optimal regret bounds (Theorem 4.6) and last-iterate rates (Theorem 4.7) for the fundamental class of potential games (Monderer & Shapley, 1996; Rosenthal,

1973) under mirror descent (MD) with suitable regularizers. In fact, our approach is general enough to capture under a unifying framework distributed learning in Fisher s market model (Birnbaum et al., 2011), while we expect that further applications will be identified in the future. Finally, we show that our approach can also be applied in near-potential games, in the precise sense of (Candogan et al., 2013), showing convergence to approximate Nash equilibria.

We commence by formally describing the class of games we are considering in this section; in Proposition 4.2 we show that it incorporates typical potential games.

Definition 4.1. Let Φ : Qn i=1 (Ai) (x1, . . . , xn) 7 ℝ be a bounded function, with maxx |Φ(x)| Φmax, for which there exists L > 0 such that for any x, ex it holds that

Φ(x) Φ(ex) xΦ(x), ex x + L ex x 2 2. (3)

Moreover, let gi be a strictly increasing function for each i [n]. A game Γ is (g1, . . . , gn)-potential if for all i [n], ai Ai, and x i Q

j =i (Aj) we have that

Φ(x) xi(ai) = gi(ui(ai, x i)). (4)

A few remarks are in order. First, (3) imposes a one-sided smoothness condition. This relaxation turns out to be crucial to encompass settings such as linear Fisher markets (Birnbaum et al., 2011). Moreover, (4) prescribes applying a monotone transformation to the utility. While the identity mapping gi : x 7 x suffices to cover typical potential games, a logarithmic transformation is required to capture the celebrated proportional response dynamics (Birnbaum et al., 2011; Wu & Zhang, 2007) in markets.

Proposition 4.2. Let Γ be a game for which there exists a function Φ : Qn i=1 Ai ℝwith

Φ(a) Φ(a i, a i) = wi(ui(a) ui(a i, a i)),

where w ℝn >0. Then, Γ is a potential game in the sense of Definition 4.1 with L = 1

2 Φmax Pn i=1 |Ai|.

In this context, we will assume that all players update their strategies using mirror descent (MD) for t 0:

x(t+1) i := arg max x (Ai) η gi(u(t) i ), x DRi(x, x(t) i ). (MD)

In accordance to Definition 4.1, players apply (coordinatewise) the transformation gi to the observed utility. Importantly, we will allow players to employ different regularizers, as long as Ri is 1-strongly convex with respect to 2. This trivially holds under the Euclidean DGF, while it also holds under negative entropy. We are ready to establish that the potential function increases along non-stationary orbits:

On Last-Iterate Convergence Beyond Zero-Sum Games

Theorem 4.3. Suppose that each player employs (MD) with a 1-strongly convex regularizer with respect to 2, and learning rate η = 1 2L, where L is defined as in Proposition 4.2. Then, for any t 1 it holds that

Φ(x(t+1)) Φ(x(t)) 1

i=1 x(t+1) i x(t) i 2 2 0. (5)

For large values of learning rate η variants of (MD) are known to exhibit chaotic behavior in potential games (Bielawski et al., 2021; Palaiopanos et al., 2017). Theorem 4.3 also implies the following boundedness property for the trajectories from a telescopic summation of (5):

Corollary 4.4. In the setting of Theorem 4.3 it holds that

t=1 x(t+1) x(t) 2 2 Φ(x(T )) Φ(x(1)) 2 Φmax .

In the sequel, to argue about the regret incurred by each player and the convergence to Nash equilibria, we tacitly assume that gi is the identity map. The first implication of Corollary 4.4 is an O(

T) bound on the individual regrets.

Corollary 4.5. In the setting of Theorem 4.3, with Ri := 1 2 x 2 2, it holds that the regret of each player i [n] is such that Reg T i = O(

Note that we establish vanishing O(1/

T) average regret under constant learning rate, thereby deviating from the traditional regret analysis of (MD). More importantly, with a more involved argument we show optimal individual regret under optimistic multiplcative weights update (OMWU):

Theorem 4.6 (Optimal Regret for Potential Games). Suppose that each player employs OMWU with a sufficiently small learning rate η > 0. Then, the regret of each player i [n] is such that Reg T i = O(1).

This theorem is based on showing a suitable boundedness property (Theorem B.5), which is then appropriately combined with Proposition 2.3. While Theorem 4.6 is stated for OMWU, the proof readily extends well-beyond. In this way, when the underlying game is potential, we substantially strengthen and simplify the result of Daskalakis et al. (2021). Moreover, we also obtain a bound on the number of iterations required to reach an approximate Nash equilibrium.

Theorem 4.7 (Abridged; Full Version in Theorem B.6). Suppose that each player employs (MD) with a suitable regularizer. Then, after O(1/ϵ2) iterations there is a strategy x(t) which is an ϵ-approximate Nash equilibrium.

This theorem has a similar flavour to Theorem 3.4, and it is the first of its kind for the class of potential games. Finally, in settings where the potential function is concave we show a rate of convergence of O(1/T).

Proposition 4.8. In the setting of Theorem 4.3, if the potential function Φ is also concave it holds that

Φ(x ) Φ(x(T +1)) 2L

i=1 DRi(x i , x(1) i ).

This result is stronger than the standard convergence guarantee in smooth convex optimization since Definition 4.1 only imposes a one-sided condition. While concavity does not hold in typical potential games, it applies to games such as Fisher s market model; see our remark below. Remark 4.9. In Appendix B.2 we explain how our framework naturally captures the analysis of Birnbaum et al. (2011) for distributed dynamics in Fisher s market model.

4.1 Near-Potential Games Finally, we illustrate the robustness of our framework by extending our results to nearpotential games (Candogan et al., 2013). Roughly speaking, a game is near-potential if it is close in terms of the maximum possible utility improvement through unilateral deviations (MPD) to a potential game; we defer the precise definition to Appendix B.1. It is also worth noting that our result immediately extends under different distance measures. In this context, we prove the following theorem.

Theorem 4.10 (Abridged; Full Version in Theorem B.12). Consider a δ-near-potential game where each player employs (MD) with suitable regularizer. Then, there exists a potential function Φ which increases as long as x(t) is not an O(

δ)-Nash equilibrium.

5. Continuous Games

Sections 3 and 4 primarily focused on classes of games stemming from applications in game theory. In this section we shift our attention to continuous games , strategic interactions motivated by applications such as GANs. Specifically, we study the convergence properties of optimistic gradient descent (OGD) beyond two-player zero-sum games. Recall that OGD in unconstrained settings can be expressed using the following update rule for t 1:

x(t+1) i = x(t) i +2η xiui(x(t)) η xiui(x(t 1)), (OGD)

for any player i [n], where ui : Q

i [n] Xi ℝis assumed to be continuously differentiable.

5.1 Two-Player Games We first study two-player games. Let A, B ℝd d be matrices so that under strategies (x, y) X Y the utilities of players X and Y are given by the bilinear form x Ay and x By respectively. As

On Last-Iterate Convergence Beyond Zero-Sum Games

is common in this line of work, the matrices are assumed to be square and non-singular. In line with the application of interest, we mostly focus on the unconstrained setting, where X = ℝd and X = ℝd; some of our results are also applicable when X and Y are balls in ℝd, as we make clear in the sequel. A point (x , y ) is an equilibrium if

x Ay (x ) Ay , x X;

(x ) By (x ) By , y Y. (6)

Even when B = A, this seemingly simple saddle-point problem is with the addition of appropriate regularization powerful enough to capture problems such as linear regression, empirical risk minimization, and robust optimization (Du & Hu, 2019). Further, studying such games relates to the local convergence of complex games encountered in practical applications (Liang & Stokes, 2019). We also point out that our techniques can address the addition of quadratic regularization. In this regime, our first contribution is to extend the known regime for which OGD retains stability:

Theorem 5.1 (Abridged; Full Version in Theorem C.3). Suppose that the matrix A B has strictly negative (real) eigenvalues. Then, for a sufficiently small learning rate η > 0, (OGD) converges linearly to an equilibrium.

The proof of this theorem is based on transforming the dynamics of (OGD) to the frequency domain via the Ztransform, in order to derive the characteristic equation of the dynamical system. When the game is zero-sum, the condition of the theorem holds since the matrix A A is symmetric and negative definite. As such, Theorem 5.1 extends the known results in the literature. The technique we employ can also reveal the rate of convergence in terms of the eigenvalues of A B. The first question stemming from Theorem 5.1 is whether the condition of stability captures games which are, in some sense, fully competitive. Our next result answers this question in the negative.

Indeed, it turns out that the condition of stability in Theorem 5.1 also includes games with a coordination aspect, but under (OGD) the players will fail to coordinate. In turn, we show that this implies that (OGD) can be arbitrarily inefficient. More precisely, for strategies x X and y Y, we define the social welfare as SW(x, y) := x Ay + x By. We show the following result (see Figure 8 for an illustration).

Proposition 5.2. For any sufficiently large R > 0, there exist games such that (OGD) converges under any initialization to an equilibrium (x( ), y( )) such that SW(x( ), y( )) = 0, while there exist an equilibrium (x , y ) with SW(x , y ) 2R2.

This holds when X and Y are compact balls on ℝd and parameter R > 0 controls their radius. Proposition 5.2 seems

to suggest that the stabilizing properties of (OGD) come at a dramatic loss of efficiency beyond zero-sum games.

Another interesting implication is that an arbitrarily small perturbation from a zero-sum game can destabilize (OGD):

Proposition 5.3. For any ϵ > 0 there exists a game (A, B) with A + B F ϵ for which (OGD) diverges, while the dynamics converge for the game (A, A).

Thus, even if a game (A, B) is arbitrarily close to a zerosum in the Frobenius norm, (OGD) may still diverge. To put it differently, a small noise in one of the payoff matrices can dramatically alter the behavior of the system. This phenomenon is illustrated and further discussed in Figure 9.

5.2 Multiplayer Games Moreover, we also characterize the (OGD) dynamics in polymatrix games. Such multiplayer interactions have already received considerable attention in the literature on GANs (e.g., see (?), and references therein), but the behavior of the dynamics is poorly understood even in structured games (Kalogiannis et al., 2021). Our next theorem characterizes the important case where a single player 1 plays against n 1 different players, numbered from 2 to n. In particular, the utility of player 1 under strategies x = (x1, . . . , xn) is given by P

j =i x i A1,jxj, while the utility of player j is given by x j Aj,1x1. Our next result extends Theorem 5.1.

Theorem 5.4 (Abridged; Full Version in Theorem C.7). If M := P

j =1 A1,j Aj,1 has strictly negative (real) eigenvalues, there exists a sufficiently small learning rate η > 0 depending only on the spectrum of the matrix M such that (OGD) converges with linear rate to an equilibrium.

This condition is again trivially satisfied in polymatrix zero-sum games. In fact, Theorem 5.4 is an instance of a more general characterization for polymatrix unconstrained games, as we further explain in Remark C.9. Remark 5.5 (Beyond OGD: Stability of First-Order Methods). In light of the result established in Theorem 5.1, it is natural to ask whether its condition is necessary. In Theorem C.11 we show that the presence of positive eigenvalues in the spectrum of matrix A B necessarily implies instability even under a generic class of first-order methods.

6. Experiments

In this section we numerically investigated the last-iterate convergence of (OMD) in two zero-sum extensive-form games (EFGs). Recall that computing a Nash equilibrium in this setting can be expressed as the bilinear saddle-point problem of (2). When both players employ (OMD) with Euclidean regularization and η 1 4 A 2 , where A 2 is the spectral norm of A, our results guarantee last-iterate convergence in terms of the saddle-point gap.

On Last-Iterate Convergence Beyond Zero-Sum Games

We experimented on two standard poker benchmarks known as Kuhn poker (Kuhn, 1950) and Leduc poker (Southey et al., 2005). For each benchmark game, we ran (OMD) with Euclidean regularization and three different values of η for 10000 iterations. After each iteration t, we computed the saddle point gap corresponding to the iterates at time t, as well as to the average iterates up to time t. The results are shown in Figure 1. As expected, we observe that both the average and the last iterate converge in terms of the saddlepoint gap. Moreover, we see that larger values of learning rate lead to substantially faster convergence, illustrating the importance of obtaining sharp theoretical bounds.

Additional experiments related to our theoretical results have been included in Appendix D.

100 101 102 103 104 10 3

Saddle point gap

Last iterate

100 101 102 103 104

Average iterate

Leduc poker

100 101 102 103 104

Saddle point gap

η = 1 4 A 2

η = 1 10 A 2

η = 1 20 A 2

100 101 102 103 104

Figure 1. The saddle-point gap of the last iterate and the average iterate of (OMD) dynamics in Kuhn and Leduc poker.

Acknowledgements

We are grateful to the anonymous ICML reviewers for many helpful comments. Ioannis Panageas is supported by a startup grant. Part of this work was done while Ioannis Panageas was visiting the Simons Institute for the Theory of Computing. Tuomas Sandholm is supported by the National Science Foundation under grants IIS-1901403 and CCF-1733556.

Abernethy, J. D., Bartlett, P. L., and Hazan, E. Blackwell approachability and no-regret learning are equivalent. In

COLT 2011 - The 24th Annual Conference on Learning Theory, volume 19 of JMLR Proceedings, pp. 27 46. JMLR.org, 2011.

Adler, I., Daskalakis, C., and Papadimitriou, C. H. A note on strictly competitive games. In Internet and Network Economics, 5th International Workshop, WINE 2009, volume 5929 of Lecture Notes in Computer Science, pp. 471 474. Springer, 2009.

Aumann, R. J. Subjectivity and correlation in randomized strategies. Journal of Mathematical Economics, 1(1): 67 96, 1974.

Awerbuch, B., Azar, Y., and Epstein, A. The price of routing unsplittable flow. SIAM J. Comput., 42(1):160 177, 2013.

Azizian, W., Iutzeler, F., Malick, J., and Mertikopoulos, P. The last-iterate convergence rate of optimistic mirror descent in stochastic variational inequalities. In Conference on Learning Theory, COLT 2021, volume 134 of Proceedings of Machine Learning Research, pp. 326 358. PMLR, 2021.

Babichenko, Y. Query complexity of approximate nash equilibria. In Symposium on Theory of Computing, STOC 2014, pp. 535 544. ACM, 2014.

Bailey, J. P. and Piliouras, G. Fast and furious learning in zero-sum games: Vanishing regret with non-vanishing step sizes. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Neur IPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp. 12977 12987, 2019.

Bielawski, J., Chotibut, T., Falniowski, F., Kosiorowski, G., Misiurewicz, M., and Piliouras, G. Follow-theregularized-leader routes to chaos in routing games. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, volume 139 of Proceedings of Machine Learning Research, pp. 925 935. PMLR, 2021.

Birnbaum, B. E., Devanur, N. R., and Xiao, L. Distributed algorithms via gradient descent for fisher markets. In Proceedings 12th ACM Conference on Electronic Commerce (EC-2011), pp. 127 136. ACM, 2011.

Blackwell, D. An analog of the minimax theorem for vector payoffs. Pacific Journal of Mathematics, 6(1):1 8, 1956.

Blum, A., Hajiaghayi, M., Ligett, K., and Roth, A. Regret minimization and the price of total anarchy. In Proceedings of the 40th Annual ACM Symposium on Theory of Computing, 2008, pp. 373 382. ACM, 2008.

Cai, Y. and Daskalakis, C. On minmax theorems for multiplayer games. In Proceedings of the Twenty-Second

On Last-Iterate Convergence Beyond Zero-Sum Games

Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2011, pp. 217 234. SIAM, 2011.

Cai, Y., Candogan, O., Daskalakis, C., and Papadimitriou, C. Zero-sum polymatrix games: A generalization of minmax. Mathematics of Operations Research, 41(2): 648 655, 2016.

Candogan, O., Ozdaglar, A. E., and Parrilo, P. A. A projection framework for near-potential games. In Proceedings of the 49th IEEE Conference on Decision and Control, CDC 2010, pp. 244 249. IEEE, 2010.

Candogan, O., Menache, I., Ozdaglar, A. E., and Parrilo, P. A. Flows and decompositions of games: Harmonic and potential games. Math. Oper. Res., 36(3):474 503, 2011.

Candogan, O., Ozdaglar, A. E., and Parrilo, P. A. Dynamics in near-potential games. Games Econ. Behav., 82:66 90, 2013.

Chen, C., Surana, A., Bloch, A. M., and Rajapakse, I. Multilinear control systems theory. SIAM Journal on Control and Optimization, 59(1):749 776, 2021.

Chen, G. and Teboulle, M. Convergence analysis of a proximal-like minimization algorithm using bregman functions. SIAM Journal on Optimization, 3(3):538 543, 1993.

Chen, X., Deng, X., and Teng, S. Settling the complexity of computing two-player nash equilibria. J. ACM, 56(3): 14:1 14:57, 2009.

Cheung, Y. K. and Piliouras, G. Chaos, extremism and optimism: Volume analysis of learning in games. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, 2020.

Cheung, Y. K. and Tao, Y. Chaos of learning beyond zerosum and coordination via game decompositions. In 9th International Conference on Learning Representations, ICLR 2021. Open Review.net, 2021.

Christodoulou, G. and Koutsoupias, E. The price of anarchy of finite congestion games. In Proceedings of the 37th Annual ACM Symposium on Theory of Computing, 2005, pp. 67 73. ACM, 2005.

Christodoulou, G., Kov acs, A., and Schapira, M. Bayesian combinatorial auctions. J. ACM, 63(2):11:1 11:19, 2016.

Damme, E. V. Stability and Perfection of Nash Equilibria. Berlin: Springer-Verlag, 1987.

Daskalakis, C. and Panageas, I. Last-iterate convergence: Zero-sum games and constrained min-max optimization. In Blum, A. (ed.), 10th Innovations in Theoretical

Computer Science Conference, ITCS 2019, volume 124 of LIPIcs, pp. 27:1 27:18. Schloss Dagstuhl - Leibniz Zentrum f ur Informatik, 2019.

Daskalakis, C. and Papadimitriou, C. H. On a network generalization of the minmax theorem. In Automata, Languages and Programming, 36th Internatilonal Colloquium, ICALP 2009, volume 5556 of Lecture Notes in Computer Science, pp. 423 434. Springer, 2009.

Daskalakis, C., Fabrikant, A., and Papadimitriou, C. H. The game world is flat: The complexity of nash equilibria in succinct games. In Automata, Languages and Programming, 33rd International Colloquium, ICALP 2006, volume 4051 of Lecture Notes in Computer Science, pp. 513 524. Springer, 2006.

Daskalakis, C., Goldberg, P. W., and Papadimitriou, C. H. The complexity of computing a nash equilibrium. SIAM J. Comput., 39(1):195 259, 2009.

Daskalakis, C., Ilyas, A., Syrgkanis, V., and Zeng, H. Training gans with optimism. In 6th International Conference on Learning Representations, ICLR 2018. Open Review.net, 2018.

Daskalakis, C., Foster, D. J., and Golowich, N. Independent policy gradient methods for competitive reinforcement learning. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020l, 2020.

Daskalakis, C., Fishelson, M., and Golowich, N. Nearoptimal no-regret learning in general games. Co RR, abs/2108.06924, 2021.

Deng, X., Hu, X., Lin, T., and Zheng, W. Nash convergence of mean-based learning algorithms in first price auctions. In WWW 22: The ACM Web Conference 2022, Virtual Event, Lyon, France, April 25 - 29, 2022, pp. 141 150. ACM, 2022.

Diakonikolas, J. and Wang, P. Potential function-based framework for making the gradients small in convex and min-max optimization. Co RR, abs/2101.12101, 2021.

Du, S. S. and Hu, W. Linear convergence of the primaldual gradient method for convex-concave saddle point problems without strong convexity. In The 22nd International Conference on Artificial Intelligence and Statistics, AISTATS 2019, volume 89 of Proceedings of Machine Learning Research, pp. 196 205. PMLR, 2019.

Eisenberg, E. and Gale, D. Consensus of Subjective Probabilities: The Pari-Mutuel Method. The Annals of Mathematical Statistics, 30(1):165 168, 1959.

On Last-Iterate Convergence Beyond Zero-Sum Games

Feng, Z., Guruganesh, G., Liaw, C., Mehta, A., and Sethi, A. Convergence analysis of no-regret bidding algorithms in repeated auctions. In Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, pp. 5399 5406. AAAI Press, 2021.

Giannou, A., Vlatakis-Gkaragkounis, E., and Mertikopoulos, P. Survival of the strictest: Stable and unstable equilibria under regularized learning with partial information. In Conference on Learning Theory, COLT 2021, volume 134 of Proceedings of Machine Learning Research, pp. 2147 2148. PMLR, 2021.

Golowich, N., Pattathil, S., and Daskalakis, C. Tight lastiterate convergence rates for no-regret learning in multiplayer games. In Advances in Neural Information Processing Systems 2020, 2020a.

Golowich, N., Pattathil, S., Daskalakis, C., and Ozdaglar, A. E. Last iterate is slower than averaged iterate in smooth convex-concave saddle point problems. In Conference on Learning Theory, COLT 2020, volume 125 of Proceedings of Machine Learning Research, pp. 1758 1784. PMLR, 2020b.

Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A. C., and Bengio, Y. Generative adversarial nets. In Advances in Neural Information Processing Systems 2014, pp. 2672 2680, 2014.

Harker, P. T. and Pang, J.-S. Finite-dimensional variational inequality and nonlinear complementarity problems: A survey of theory, algorithms and applications. Math. Program., 48(1 3):161 220, 1990.

Hart, S. and Mas-Colell, A. Uncoupled dynamics do not lead to nash equilibrium. The American Economic Review, 93(5):1830 1836, 2003.

H eliou, A., Cohen, J., and Mertikopoulos, P. Learning with bandit feedback in potential games. In Advances in Neural Information Processing Systems 30, 2017, pp. 6369 6378, 2017.

Hsieh, Y., Antonakopoulos, K., and Mertikopoulos, P. Adaptive learning in continuous games: Optimal regret bounds and convergence to nash equilibrium. In Conference on Learning Theory, COLT 2021, 15-19 August 2021, Boulder, Colorado, USA, volume 134 of Proceedings of Machine Learning Research, pp. 2388 2422. PMLR, 2021.

Kalogiannis, F., Panageas, I., and Vlatakis-Gkaragkounis, E. Teamwork makes von neumann work: Min-max optimization in two-team zero-sum games. Co RR, abs/2111.04178, 2021.

Kearns, M. J., Littman, M. L., and Singh, S. P. Graphical models for game theory. In UAI 01: Proceedings of the 17th Conference in Uncertainty in Artificial Intelligence, 2001, pp. 253 260. Morgan Kaufmann, 2001.

Koller, D., Megiddo, N., and von Stengel, B. Efficient computation of equilibria for extensive two-person games. Games and Economic Behavior, 14(2), 1996.

Kuhn, H. W. A simplified two-person poker. In Kuhn, H. W. and Tucker, A. W. (eds.), Contributions to the Theory of Games, volume 1 of Annals of Mathematics Studies, 24, pp. 97 103. Princeton University Press, Princeton, New Jersey, 1950.

Lee, C.-W., Kroer, C., and Luo, H. Last-iterate convergence in extensive-form games. In Neur IPS, 2021.

Leonardos, S., Overman, W., Panageas, I., and Piliouras, G. Global convergence of multi-agent policy gradient in markov potential games. ICLR, 2022.

Liang, T. and Stokes, J. Interaction matters: A note on non-asymptotic local convergence of generative adversarial networks. In The 22nd International Conference on Artificial Intelligence and Statistics, AISTATS 2019, volume 89 of Proceedings of Machine Learning Research, pp. 907 915. PMLR, 2019.

Lin, T., Zhou, Z., Mertikopoulos, P., and Jordan, M. I. Finitetime last-iterate convergence for multi-agent learning in games. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pp. 6161 6171. PMLR, 2020.

Mc Mahan, H. B. Follow-the-regularized-leader and mirror descent: Equivalence theorems and L1 regularization. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2011, volume 15 of JMLR Proceedings, pp. 525 533. JMLR.org, 2011.

Mertikopoulos, P. and Zhou, Z. Learning in games with continuous action sets and unknown payoff functions. Math. Program., 173(1-2):465 507, 2019.

Mertikopoulos, P., Papadimitriou, C. H., and Piliouras, G. Cycles in adversarial regularized learning. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2018, pp. 2703 2717. SIAM, 2018.

Mertikopoulos, P., Lecouat, B., Zenati, H., Foo, C., Chandrasekhar, V., and Piliouras, G. Optimistic mirror descent in saddle-point problems: Going the extra (gradient) mile. In 7th International Conference on Learning Representations, ICLR 2019. Open Review.net, 2019.

On Last-Iterate Convergence Beyond Zero-Sum Games

Milgrom, P. and Roberts, J. Rationalizability, learning, and equilibrium in games with strategic complementarities. Econometrica, 58(6):1255 1277, 1990.

Mokhtari, A., Ozdaglar, A., and Pattathil, S. Convergence rate of O(1/k) for optimistic gradient and extragradient methods in smooth convex-concave saddle point problems. SIAM Journal on Optimization, 30:3230 3251, 2020a.

Mokhtari, A., Ozdaglar, A. E., and Pattathil, S. A unified analysis of extra-gradient and optimistic gradient methods for saddle point problems: Proximal point approach. In The 23rd International Conference on Artificial Intelligence and Statistics, AISTATS 2020, volume 108 of Proceedings of Machine Learning Research, pp. 1497 1507. PMLR, 2020b.

Monderer, D. and Shapley, L. S. Potential games. Games and Economic Behavior, 14(1):124 143, 1996.

Moulin, H. and Vial, J. Strategically zero-sum games: The class of games whose completely mixed equilibria cannot be improved upon. International Journal of Game Theory, 1978.

Palaiopanos, G., Panageas, I., and Piliouras, G. Multiplicative weights update with constant step-size in congestion games: Convergence, limit cycles and chaos. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, pp. 5872 5882, 2017.

Panageas, I., Tr obst, T., and Vazirani, V. V. Combinatorial algorithms for matching markets via nash bargaining: One-sided, two-sided and non-bipartite. Co RR, abs/2106.02024, 2021.

Popov, L. D. A modification of the arrow-hurwicz method for search of saddle points. Mathematical notes of the Academy of Sciences of the USSR, 28:845 848, 1980.

Rakhlin, A. and Sridharan, K. Online learning with predictable sequences. In COLT 2013 - The 26th Annual Conference on Learning Theory, volume 30 of JMLR Workshop and Conference Proceedings, pp. 993 1019. JMLR.org, 2013.

Rockafellar, R. T. Convex analysis. Princeton University Press, Princeton, N. J., 1970.

Romanovskii, I. Reduction of a game with complete memory to a matrix game. Soviet Mathematics, 3, 1962.

Rosenthal, R. W. A class of games possessing pure-strategy nash equilibria. International Journal of Game Theory, 2:65 67, 1973.

Roughgarden, T. Intrinsic robustness of the price of anarchy. J. ACM, 62(5):32:1 32:42, 2015.

Rubinstein, A. Settling the complexity of computing approximate two-player nash equilibria. In IEEE 57th Annual Symposium on Foundations of Computer Science, FOCS 2016, pp. 258 265. IEEE Computer Society, 2016.

Sandholm, W. H. Population Games and Evolutionary Dynamics. MIT Press, 2010.

Sato, Y., Akiyama, E., and Farmer, J. D. Chaos in learning a simple two-person game. Proceedings of the National Academy of Sciences, 99(7):4748 4751, 2002.

Shalev-Shwartz, S. Online learning and online convex optimization. Found. Trends Mach. Learn., 4(2):107 194, 2012.

Shapley, L. S. Stochastic games. Proceedings of the National Academy of Sciences, 39(10):1095 1100, 1953.

Shmyrev, V. An algorithm for finding equilibrium in the linear exchange model with fixed budgets. Journal of Applied and Industrial Mathematics, 3:505 518, 10 2009. doi: 10.1134/S1990478909040097.

Sion, M. On general minimax theorems. Pacific Journal of Mathematics, 8(1):171 176, 1958.

Southey, F., Bowling, M. H., Larson, B., Piccione, C., Burch, N., Billings, D., and Rayner, D. C. Bayes? bluff: Opponent modelling in poker. In UAI 05, Proceedings of the 21st Conference in Uncertainty in Artificial Intelligence, pp. 550 558. AUAI Press, 2005.

Syrgkanis, V., Agarwal, A., Luo, H., and Schapire, R. E. Fast convergence of regularized learning in games. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, pp. 2989 2997, 2015.

Vetta, A. Nash equilibria in competitive societies, with applications to facility location, traffic routing and auctions. In 43rd Symposium on Foundations of Computer Science (FOCS 2002), pp. 416. IEEE Computer Society, 2002.

Vlatakis-Gkaragkounis, E., Flokas, L., Lianeas, T., Mertikopoulos, P., and Piliouras, G. No-regret learning and mixed nash equilibria: They do not mix. In Advances in Neural Information Processing Systems 33 2020l, 2020.

von Stengel, B. Efficient computation of behavior strategies. Games and Economic Behavior, 14(2):220 246, 1996.

Wei, C., Lee, C., Zhang, M., and Luo, H. Last-iterate convergence of decentralized optimistic gradient descent/ascent in infinite-horizon competitive markov games. In Conference on Learning Theory, COLT 2021, volume 134 of

On Last-Iterate Convergence Beyond Zero-Sum Games

Proceedings of Machine Learning Research, pp. 4259 4299. PMLR, 2021a.

Wei, C., Lee, C., Zhang, M., and Luo, H. Linear last-iterate convergence in constrained saddle-point optimization. In 9th International Conference on Learning Representations, ICLR 2021. Open Review.net, 2021b.

Wu, F. and Zhang, L. Proportional response dynamics leads to market equilibrium. In Proceedings of the 39th Annual ACM Symposium on Theory of Computing, 2007, pp. 354 363. ACM, 2007.

Zhang, G. and Yu, Y. Convergence of gradient methods on bilinear zero-sum games. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. Open Review.net, 2020.

Zhang, L. Proportional response dynamics in the fisher market. Theor. Comput. Sci., 412(24):2691 2698, 2011.

Zhou, Z., Mertikopoulos, P., Moustakas, A. L., Bambos, N., and Glynn, P. W. Mirror descent learning in continuous games. In 56th IEEE Annual Conference on Decision and Control, CDC 2017, Melbourne, Australia, December 1215, 2017, pp. 5776 5783. IEEE, 2017.

Zhou, Z., Mertikopoulos, P., Athey, S., Bambos, N., Glynn, P. W., and Ye, Y. Learning in games with lossy feedback. In Advances in Neural Information Processing Systems 2018,, pp. 5140 5150, 2018.

On Last-Iterate Convergence Beyond Zero-Sum Games

A. Proofs from Section 3

In this section we give complete proofs for our results in Section 3. For the convenience of the reader we will restate the claims before proceeding with their proofs. We begin with Theorem 3.1.

Theorem 3.1. Suppose that every player i [n] employs a regret minimizing algorithm satisfying the (αi, βi, γi)-RVU property with γi 2(n 1) P

j =i βj, for all i [n], under the pair of dual norms ( 1, ). If the game is such that Pn i=1 Reg T i 0, for any T ℕ, it holds that

t=1 γi x(t) i x(t 1) i 2 1 2

i=1 αi. (1)

Proof. The proof commences similarly to (Syrgkanis et al., 2015, Theorem 4). By assumption, we know that for any player i [n] it holds that

Reg T i αi + βi

t=1 u(t) i u(t 1) i 2 γi

t=1 x(t) i x(t 1) i 2 1. (7)

Furthermore, the middle term in the RVU property can be bounded using the following simple claim.

Claim A.1. For any player i [n] it holds that

u(t) i u(t 1) i X

j =i x(t) j x(t 1) j 1.

Proof. By the normalization assumption on the utilities we know that |ui( )| 1. Thus, from the triangle inequality it follows that

u(t) i u(t 1) i X

j =i x(t) j (aj) Y

j =i x(t 1) j (aj)

The induced term can be recognized as (two times) the total variation distance of two product distributions. Hence, a standard application of the sum-product inequality implies that

j =i x(t) j (aj) Y

j =i x(t 1) j (aj)

j =i x(t) j x(t 1) j 1.

Moreover, a direct application of Young s inequality to the bound of the previous claim gives us that

u(t) i u(t 1) i 2 (n 1) X

j =i x(t) j x(t 1) j 2 1. (8)

As a result, if we plug-in this bound to (7) we may conclude that for all players i [n] it holds that

Reg T i αi + (n 1)βi

j =i x(t) j x(t 1) j 2 1 γi

t=1 x(t) i x(t 1) i 2 1.

Summing these inequalities for all players i [n] yields that

i=1 Reg T i

t=1 x(t) i x(t 1) i 2 1

t=1 x(t) i x(t 1) i 2 1,

where we used the assumption that γi 2(n 1) P

j =i βj for any i [n]. Finally, the theorem follows from a rearrangement of the final bound, using the assumption that Pn i=1 Reg T i 0.

On Last-Iterate Convergence Beyond Zero-Sum Games

Proposition A.2. Suppose that each player employs OFTRL with the same learning rate η > 0. Moreover, suppose that the game is such that Pn i=1 Reg T i 0. Then,

1. If all players use H-step recency bias (Item 1) and η 1 4(n 1)H it holds that

i=1 x(t) i x(t 1) i 2 1 8

2. If all players use geometrically δ-discounted recency bias (Item 2) and η (1 δ)3/2

4(n 1) it holds that

i=1 x(t) i x(t 1) i 2 1 16

Proof. Let us first establish Item 1. We will use the following RVU bound shown by Syrgkanis et al. (2015, Proposition 9).

Proposition A.3 (Syrgkanis et al., 2015). OFTRL with learning rate η > 0 and H-step recency bias (Item 1) satisfies the RVU property with (α, β, γ) = ( Ω

4η), where Ω:= supx X R(x) infx X R(x).

As a result, we may conclude from this bound that the regret of each player i [n] can be bounded as

η + ηH2 T X

t=1 u(t) i u(t 1) i 2 1

t=1 x(t) i x(t 1) i 2 1,

where Ωi := supx Xi R(x) infx X R(x). Next, using Claim A.1 the previous bound implies that

η + η(n 1)H2 T X

j =i x(t) j x(t 1) j 2 1 1

t=1 x(t) i x(t 1) i 2 1.

Summing these bounds for all players i [n] gives us that

i=1 Reg T i 1

η(n 1)2H2 1

t=1 x(t) i x(t 1) i 2 1 1

i=1 x(t) i x(t 1) i 2 1,

for η 1 4(n 1)H . Thus, Item 1 follows immediately under the assumption that Pn i=1 Reg T i 0. Next, we proceed with the proof of Item 2. We use the following RVU bound shown by Syrgkanis et al. (2015, Proposition 10).

Proposition A.4 (Syrgkanis et al., 2015). OFTRL with learning rate η > 0 and geometrically δ-discounted recency bias (Item 2) satisfies the RVU property with (α, β, γ) = ( Ω

η , η (1 δ)3 , 1

8η), where Ω:= supx X R(x) infx X R(x).

Thus, if player i [n] uses geometrically δ-discounted recency bias it follows that its regret can be bounded as

η + η (1 δ)3

t=1 u(t) i u(t 1) i 2 1

t=1 x(t) i x(t 1) i 2 1.

Combining this bound with Claim A.1 yields that

t=1 x(t) i x(t 1) i 2 1 1

t=1 x(t) i x(t 1) i 2 1.

Summing these bounds for all players i [n] gives us that

i=1 Reg T i 1

t=1 x(t) i x(t 1) i 2 1 1

i=1 Ωi 1 16η

i=1 x(t) i x(t 1) i 2 1,

for η (1 δ)3/2

4(n 1) . Finally, the fact that Pn i=1 Reg T i 0 along with a rearrangement of the previous bound concludes the proof.

On Last-Iterate Convergence Beyond Zero-Sum Games

Next, we proceed with the proof of Proposition 3.2. First, we state some preliminary facts about strategically zero-sum games, a subclass of bimatrix games. It should be stressed that bimatrix games is already a very general class of games. For example, Daskalakis et al. (2006) have shown that for succinctly representable multiplayer games, the problem of computing Nash equilibria can be reduced to the two-player case. Below we give the formal definition of a strategically zero-sum game.

Definition A.5 (Strategically Zero-Sum Games (Moulin & Vial, 1978)). The bimatrix games (A, B) and (C, D), defined on the same action space, are strategically equivalent if for all x, x X and y, y Y,

x Ay (x ) Ay x Cy (x ) Cy;

x By x By x Dy x Dy .

A game is strategically zero-sum if it is strategically equivalent to some zero-sum game.

In words, consider a pair of strategies (x, y) X Y. Player X can order her strategy space based on the (expected) payoff if player Y was to play y, and similarly, player Y can order her strategy space under the assumption that player X will play x. In strategically equivalent games this ordering is preserved.

Without any loss, we will consider non-trivial games in the sense of (Moulin & Vial, 1978, Definition 2). The following characterization (Moulin & Vial, 1978, Theorem 2) will be crucial for the proof of Proposition 3.2.

Theorem A.6 (Moulin & Vial, 1978). Let (A, B) be a non-trivial n m strategically zero-sum game. Then, there exist λ > 0, µ > 0 and a compatible matrix C such that

A = λC + va1 m, for some va ℝn;

B = µC + 1nv b , for some vb ℝm.

Here we used the notation 1n to denote the all-ones vector in ℝn. The converse of Theorem A.6 also holds. It is also worth pointing out strategically zero-sum games can be embedded in a zero-sum game with imperfect information (Moulin & Vial, 1978). Remark A.7. For the purpose of Proposition 3.2 (and Proposition A.10) it will be assumed that λ = µ, neutralizing any scale imbalances in the payoff matrices of the two players. We remark that if this is not the case, our result for strategically zero-sum games (Item 4) still holds by considering an appropriately weighted sum of regrets. In this case one should select different learning rates for the two players in order to cancel out the underlying scale imbalance. Nonetheless, such an extension does not appear to work for polymatrix strategically zero-sum games (Item 5). The latter fact could be partly justified by the surprising result that even polymatrix strictly competitive games are PPAD-hard (Cai & Daskalakis, 2011, Theorem 1.2). Remark A.8 (Strictly Competitive Games). Strictly competitive games form a subclass of strategically zero-sum games. More precisely, a two-person game is strictly competitive if it has the following property: if both players change their mixed strategies, then either there is no change in the expected payoffs, or one of the two expected payoffs increases and the other decreases. It was formally shown by Adler et al. (2009) that a game (A, B) is strictly competitive if and only if B is an affine variant of matrix A; that is, B = λA + µ1, where λ > 0, µ is unrestricted, and 1 denotes the all-ones matrix.

Before we proceed with the proof of Proposition 3.2, let us first make the following simple observation.

Observation A.9. FTRL and MD, and optimistic variants thereof, employed on a strategically zero-sum game with suitable learning rates are equivalent to the dynamics in a hidden zero-sum game.

Thus, this observation reassures us that OFTRL and OMD dynamics in strategically zero-sum games inherit all of the favorable last-iterate convergence guarantees known for zero-sum games. We are now ready to prove Proposition 3.2, the extended version of which is included below.

Proposition A.10. Full Version of Proposition 3.2 For the following classes of games it holds that Pn i=1 Reg T i 0:

1. Two-player zero-sum games; 2. Polymatrix zero-sum games; 3. Constant-sum Polymatrix games; 4. Strategically zero-sum games;5

5See Remark A.7.

On Last-Iterate Convergence Beyond Zero-Sum Games

5. Polymatrix strategically zero-sum games;5

6. Convex-concave (zero-sum) games; 7. Zero-sum games with objective f(x, y) such that

min x X max y Y f(x, y) = max y Y min x X f(x, y);

8. Quasiconvex-quasiconcave (zero-sum) games; 9. Zero-sum stochastic games.

Proof. We proceed separately for each class of games.

For Item 1 let us denoted by A the matrix associated with the zero-sum game, so that player X is the minimizer and player Y is the maximizer . Further, let us denote by x := (x(1) + . . . x(T ))/T and y := (y(1) + . . . y(T ))/T the time average of the strategies employed by the two players respectively up to time T ℕ. Then, we have that

Reg T X + Reg T Y = max x X

t=1 x(t), Ay(t) + max y Y

t=1 y(t), A x(t)

= T max y Y y , A x min x X x , A y T y, A x x, A y = 0.

For Item 2 let us denoted by Ni [n] the neighborhood of the node associated with each player i [n] so that the (expected) utility of player i [n] under a joint (mixed) strategy vector x := (x1, . . . , xn) can be expressed as P

j Ni x i Ai,jxj, where Ai,j is the payoff matrix associated with the edge i j. Observe that since every edge

corresponds to a zero-sum game, it holds that Ai,j = A j,i. Moreover, we let xi := (x(1) i + + x(T ) i )/T be the time average of player i s strategies up to time T. We have that

i=1 Reg T i = 1

j Ni Ai,jx(t) j

j Ni Ai,jx(t) j

i=1 max x i Xi

j Ni Ai,j xj

j Ni Ai,jx(t) j

i=1 max x i Xi

j Ni Ai,j xj

j Ni Ai,j xj

where (9) and (10) follow from the fact that Ai,j = A j,i for any i = j.

For Item 3 the proof is analogous to Item 2 using the assumption that

i=1 ui(x) =

j Ni Ai,jxj

for any x = (x1, . . . , xn) Q

i [n] (Ai), where C is some arbitrary constant.

For Item 4 we let (A, B) be the underlying (bimatrix) strategically zero-sum game, with A, B ℝn m. We will use Theorem A.6 under the assumption that λ = µ; see Remark A.7. That is, we know that there exists a (compatible) matrix C and λ > 0 such that A = λC + va1 m and B = λC + 1nv b , where va ℝn and vb ℝm. Similarly to

On Last-Iterate Convergence Beyond Zero-Sum Games

the proof for Item 1, we have that

Reg T X + Reg T Y = max x n

t=1 x(t), Ay(t) + max y m

t=1 y(t), B x(t)

D x(t), Cy(t)E

D x(t), va E (11)

D y(t), C x(t)E

D y(t), vb E (12)

= T max x n {λ x , C y + x , va } x, va + max y m λ y , C x + y , vb y, vb

T(λ x, C y λ y, C x ) = 0.

where in (11) we used the fact that A = λC + va1 m and that 1 my(t) = 1 since y(t) m, and (12) follows given that B = λC + vb1 m and that 1 n x(t) = 1 since x(t) n.

For Item 5 the proof follows analogously to Item 2 and Item 4; see Remark A.7.

For Item 6 let f(x, y) be any convex-concave function; that is, f(x, y) is convex with respect to x X for any fixed y Y and concave with respect to y Y for any fixed x X. Moreover, let us assume that player X is the minimizer and player X the maximizer . We have that

Reg T X + Reg T Y = min x X

t=1 f(x , y(t))

t=1 f(x(t), y(t)) + max y Y

t=1 f(x(t), y )

t=1 f(x(t), y(t))

t=1 f( x, y(t)) +

t=1 f(x(t), y) Tf( x, y) + Tf( x, y) = 0,

where the last inequality follows from Jensen s inequality, applicable since f(x, y) is convex-concave.

For Item 7 we assume that X and Y are convex and compact sets. Then, we have that we have that

Reg T X + Reg T Y T = max y Y

t=1 f(x(t), y )

t=1 f(x , y(t))

max y Y min x X f(x , y ) min x X max y Y f(x , y ) = 0, (13)

where (13) uses the following inequalities:

f(x(1), y ) + + f(x(T ), y )

T min x X f(x , y );

f(x , y(1)) + + f(x , y(T ))

T max y Y f(x , y ).

For Item 8 it is assumed that f(x, y) is lower semicontinuous and quasiconvex with respect to x X for any fixed y Y, and upper semicontinuous and quasiconcave with respect to y Y for any fixed x X, where X and Y are convex and compact sets. By Sion s minimax theorem (Sion, 1958) we know that

max y Y min x X f(x , y ) = min x X max y Y f(x , y ),

and the conclusion follows from Item 7.

For Item 9 the claim follows directly by Item 7 by virtue of Shapley s theorem (Shapley, 1953).

On Last-Iterate Convergence Beyond Zero-Sum Games

Remark A.11 (Duality Gap). Consider a function f : X Y ℝ. Weak duality implies that

min x X max y Y f(x , y ) max y Y min x X f(x , y ).

The value gap := minx X maxy Y f(x , y ) maxy Y minx X f(x , y ) 0 is called the duality gap of f. Analogously to the proof of Proposition 3.2 (Item 7) we may conclude that

Reg T X + Reg T Y gap T.

Now observe that if we relax the condition of Theorem 3.1 so that Pn i=1 Reg T i gap T, then for αi = α, βi = β, and γi = γ it follows that T X

i=1 x(t) i x(t 1) i 2 1 2nα

This observation, along with the argument of Theorem 3.4, imply that for suitable regularizers and for a sufficiently large T there will exist a joint strategy vector x(t) with t [T] which is an O( gap)-Nash equilibrium, as long as Pn i=1 Reg T i gap T, where gap > 0. In fact, this argument is analogous to that we provide for near-potential games in Theorem 4.10.

Corollary 3.3 (Optimal Regret Bound). In the setting of Theorem 3.1 with αi = α, βi = β, and γi = γ, the individual regret of each player i [n] can be bounded as

Reg T i α + 2n(n 1)αβ

Proof. First of all, when γi = γ and αi = α for all i [n], the bound we obtained in Theorem 3.1 can be simplified as

t=1 x(t) i x(t 1) i 2 1 2nα

Moreover, the RVU property implies that the regret of each individual player i [n] can be bounded as

Reg T i α + β

t=1 u(t) i u(t 1) i 2 γ

t=1 x(t) i x(t 1) i 2 1 α + (n 1)β X

t=1 x(t) i x(t 1) i 2 1, (15)

where we used Claim A.1 and the fact that γ > 0. As a result, plugging-in bound (14) (which follows from Theorem 3.1) to (15) concludes the proof.

Next, we proceed with the proof of Theorem 3.4, the detailed version of which is given below.

Theorem A.12 (Full Version of Theorem 3.4). Suppose that each player employs (OMD) with (i) pair of norms ( , ) such that x C x 1 and x C x for any x Xi; (ii) Gi-smooth regularizer Ri; and (iii) learning rate η C 4C (n 1). Moreover, suppose that the game is such that Pn i=1 Reg T i 0 for any T ℕ. Then, for any ϵ > 0, after

ϵ2 Pn i=1 Ωi iterations there exists an iterate x(t) with t [T] which is an

ϵ C + 2maxi [n] {GiΩ i} η

approximate Nash equilibrium, where Ωi := supx,y Xi DRi(x, y) and Ω i := supx,y Xi x y .

Proof. We will use the following refined RVU bound, extracted from (Syrgkanis et al., 2015, Proof of Theorem 18):

Proposition A.13 (Syrgkanis et al., 2015). If a player i employs (OMD), it holds that

t=1 u(t) i u(t 1) i 2 1

x(t) i bx(t) i 2 + x(t) i bx(t 1) i 2 .

On Last-Iterate Convergence Beyond Zero-Sum Games

Using the norm-equivalence bounds x C x 1 and x C x to the bound of Proposition A.13 yields that

t=1 u(t) i u(t 1) i 2 1

t=1 x(t) i bx(t) i 2 1 + x(t) i bx(t 1) i 2 1

t=1 x(t) i bx(t) i 2 + x(t) i bx(t 1) i 2

Moreover, combining this bound with Claim A.1 implies that

η + ηC2 (n 1)

j =i x(t) j x(t 1) j 2 1 1

x(t) i bx(t) i 2 1 + x(t) i bx(t 1) i 2 1

x(t) i bx(t) i 2 + x(t) i bx(t 1) i 2

η + ηC2 (n 1)

j =i x(t) j x(t 1) j 2 1 C2

t=1 x(t) i x(t 1) i 2 1

x(t) i bx(t) i 2 + x(t) i bx(t 1) i 2 ,

where we used the fact that

t=1 x(t) i x(t 1) i 2 2

t=1 x(t) i bx(t) i 2 + 2

t=1 x(t) i bx(t 1) i 2 1,

which in turn follows from Young s inequality. As a result, we may conclude that

i=1 Reg T i

η + ηC2 (n 1)2 C2

t=1 x(t) i x(t 1) i 2 1

x(t) i bx(t) i 2 + x(t) i bx(t 1) i 2 .

Thus, for learning rate η C 4C (n 1) it follows that

i=1 Reg T i 1

x(t) i bx(t) i 2 + x(t) i bx(t 1) i 2 ,

implying that T X

x(t) i bx(t) i 2 + x(t) i bx(t 1) i 2 8

i=1 Ωi. (16)

Now assume that for all t [T] it holds that Pn i=1 x(t) i bx(t) i 2 + x(t) i bx(t 1) i 2 > ϵ2. In this case, it follows from (16) that

i=1 Ωi = T 8

Thus, for T > 8

ϵ2 Pn i=1 Ωi it must be the case that there exists t [T] such that

x(t) i bx(t) i 2 + x(t) i bx(t 1) i 2 ϵ2.

On Last-Iterate Convergence Beyond Zero-Sum Games

In turn, this implies that for any i [n],

(i) x(t) i bx(t) i ϵ;

(ii) bx(t) i bx(t 1) i 2 2 x(t) i bx(t) i 2 + 2 x(t) i bx(t 1) i 2 2ϵ2 = bx(t) i bx(t 1) i 2ϵ.

Finally, we show the following claim which will conclude the proof.

Claim A.14. In the setting of Theorem A.12, suppose that v u u t

x(t) i bx(t) i 2 + x(t) i bx(t 1) i 2 ϵ.

Then, it follows that x(t) is an

ϵ C + 2maxi [n] GiΩ i η

approximate Nash equilibrium.

Proof. Observe that the maximization problem associated with (OMD) can be expressed in the following variational inequality form: u(t) i 1

Ri(bx(t) i ) Ri(bx(t 1) i ) , bxi bx(t) i

for any i [n]. Thus, it follows that

u(t) i , bxi bx(t) i 1

η Ri(bx(t) i ) Ri(bx(t 1) i ), bxi bx(t) i

η Ri(bx(t) i ) Ri(bx(t 1) i ) bxi bx(t) i (17)

2ϵGiΩ i η , (18)

where (17) follows from the Cauchy-Schwarz inequality, and (18) uses the fact that Ri(bx(t) i ) Ri(bx(t 1) i ) Gi bx(t) i bx(t 1) i 2ϵGi, which follows from the smoothness assumption on the regularizer Ri. (18) also uses the notation Ω i := supx,y Xi x y . As a result, we have established that for any player i [n] it holds that for any bxi Xi,

u(t) i , bx(t) i u(t) i , bxi 2ϵGiΩ i η . (19)

Moreover, we also have that u(t) i , x(t) i bx(t) i u(t) i x(t) i bx(t) i ϵC ,

where we used the fact that x(t) i bx(t) i ϵ, and that u(t) i 1 (by the normalization hypothesis). Plugging-in the last bound to (19) gives us that

x(t) i , u(t) i bx(t) i , u(t) i ϵC bxi, u(t) i ϵC 2ϵGiΩ i η ,

for any bxi Xi and player i [n]. As a result, the proof follows by definition of approximate Nash equilibria.

Remark A.15 (Last-Iterate Convergence). Leveraging our argument in the proof of Theorem A.12 it follows that for any ϵ > 0 there exists a sufficiently large T = T(ϵ) so that x (t) i bx(t) i 1 ϵ and x(t) i bx(t 1) i ϵ , for any i [n] and t T. In turn, under smooth regularizers this implies that any x(t) with t T = T(ϵ) will be an O(ϵ)-approximate Nash equilibrium by virtue of Claim A.14, establishing last-iterate convergence. On the other hand, our techniques do not seem to imply pointwise convergence.

On Last-Iterate Convergence Beyond Zero-Sum Games

Next, we give the proof of Theorem 3.8, the detailed version of which is given in Theorem A.17. To this end, we will require the following proposition shown by Syrgkanis et al. (2015), which is a slight refinement of a result due to Roughgarden (2015).

Proposition A.16 (Syrgkanis et al., 2015). Consider a (λ, µ)-smooth game. If each player i [n] incurs regret at most Reg T i , then

t=1 SW(x(t)) λ 1 + µ OPT 1 1 + µ 1 T

i=1 Reg T i .

Theorem A.17 (Full Version of Theorem 3.8). Suppose that each player employs (OMD) with (i) pair of norms ( , ) such that x C x 1 and x C x for any x Xi; (ii) Gi-smooth regularizer Ri; and (iii) learning rate η C 4C (n 1). Moreover, fix any γ > 0 and consider T iterations of the dynamics with T 16 Pn i=1 Ωi γ2 , where Ωi := supx,y Xi DRi(x, y). Then, either of the following occurs:

1. There exists an iterate x(t) with t [T] which is

γ C + 2maxi [n] GiΩ i η

approximate Nash equilibrium;

2. Otherwise, it holds that

t=1 SW(x(t)) λ 1 + µ OPT + γ2

Proof. Similarly to the proof of Theorem A.12, we may conclude that for η C 4C (n 1) it holds that

i=1 Reg T i 1

x(t) i bx(t) i 2 + x(t) i bx(t 1) i 2 . (20)

Now if there exists t [T] such that

x(t) i bx(t) i 2 + x(t) i bx(t 1) i 2 γ2,

it follows from Claim A.14 that x(t) is a

γ C + 2maxi [n] GiΩ i η

approximate Nash equilibrium. In the contrary case, we may conclude from (20) that

i=1 Reg T i 1

as long as T 16 Pn i=1 Ωi γ2 . Thus, we may conclude from Proposition A.16 that

t=1 SW(x(t)) λ 1 + µ OPT + 1 1 + µ γ2

On Last-Iterate Convergence Beyond Zero-Sum Games

A.1 Smooth Convex-Concave Games

In this subsection we explain how our framework can also be applied in the context of smooth min-max optimization. To be precise, let f(x, y) : X Y ℝbe a continuously differentiable convex-concave function; that is, f(x, y) is convex with respect to x X for any fixed y Y and concave with respect to y Y for any fixed x X. We will also make the following standard ℓ2-smoothness assumptions:

xf(x, y) xf(x, y ) 2 L y y 2, x X, y, y Y;

xf(x, y) xf(x , y) 2 L x x 2, x, x X, y Y; (21)

yf(x, y) yf(x, y ) 2 L y y 2, x X, y, y Y;

yf(x, y) yf(x , y) 2 L x x 2, x, x X, y Y. (22)

For notational simplicity we are using a common smoothness parameter L in (21) and (22), but a more refined analysis follows directly from our techniques.

The key idea is the well-known fact that one can construct a regret minimizer for convex utility functions via a regret minimizer for linear utilities (e.g., see (Mc Mahan, 2011)). Indeed, we claim that the incurred regret Reg T under convex utility functions can be bounded by the regret Reg L of an algorithm observing the tangent plane of the utility function at every decision point. To see this, let xu(t)(x(t)) be the gradient of a convex and continuously differentiable utility function u(t) on some convex domain X.6 Then, by convexity, we have that

u(t)(x) u(t)(x(t)) + xu(t)(x(t)), x x(t) , x X.

From this inequality it is easy to conclude that Reg T Reg T L, (23)

as we claimed.

Now let us assume, for concreteness, that each player employs (OMD) with Euclidean regularization, while observing the linearized utility functions based on the aforementioned scheme. Then, Proposition 2.3 implies that the individual regret of each player can be bounded as

t=1 xf(x(t), y(t)) xf(x(t 1), y(t 1)) 2 2 1

t=1 x(t) x(t 1) 2 2;

t=1 yf(x(t), y(t)) yf(x(t 1), y(t 1)) 2 2 1

t=1 y(t) y(t 1) 2 2,

where ΩX and ΩY are defined as in Proposition 2.3. Now it follows from (21) and Young s inequality that

xf(x(t), y(t)) xf(x(t 1), y(t 1)) 2 2 2 xf(x(t), y(t)) xf(x(t), y(t 1)) 2 2+

2 xf(x(t), y(t 1)) xf(x(t 1), y(t 1)) 2 2 2L2 x(t) x(t 1) 2 2 + 2L2 y(t) y(t 1) 2 2. (25)

Similarly, it follows from (22) and Young s inequality that

yf(x(t), y(t)) yf(x(t 1), y(t 1)) 2 2 2 yf(x(t), y(t)) yf(x(t), y(t 1)) 2 2+

2 yf(x(t), y(t 1)) yf(x(t 1), y(t 1)) 2 2 2L2 x(t) x(t 1) 2 2 + 2L2 y(t) y(t 1) 2 2. (26)

6The same technique applies more generally using any subgradient at the decision point.

On Last-Iterate Convergence Beyond Zero-Sum Games

As a result, plugging (25) and (26) to the RVU bound of (24) gives us that

Reg X,L + Reg Y,L ΩX + ΩY

t=1 x(t) x(t 1) 2 2 + 4ηL2 1

t=1 y(t) y(t 1) 2 2. (27)

Letting η = 1 8L in (27) implies that

Reg X,L + Reg Y,L 8L(ΩX + ΩY) L

t=1 x(t) x(t 1) 2 2 L

t=1 y(t) y(t 1) 2 2. (28)

Moreover, we know from Proposition 3.2 that Reg T X + Reg T Y 0, in turn implying that Reg T X,L + Reg T Y,L 0 by virtue of (23). As a result, we may conclude from (28) the following theorem.

Theorem A.18. Let f(x, y) be a continuously differentiable convex-concave function satisfying the L-smoothness condition of (21) and (22). If both players employ (OMD) with Euclidean regularization and η = 1 8L it holds that

t=1 x(t) x(t 1) 2 2 +

t=1 y(t) y(t 1) 2 2 16(ΩX + ΩY).

This theorem combined with (24) directly gives us that every player incurs O(1) individual regret. Further, a bound on the number of iterations required to reach an approximate equilibrium can be derived similarly to Theorem 3.4. Remark A.19 (Unconstrained Setting). While our framework based on regret minimization and in particular Theorem A.18 requires ΩX and ΩY to be bounded, extensions are possible to the unconstrained setup as well. Indeed, as was pointed out by Golowich et al. (2020a, Remark 4), it suffices to use the fact that the iterates of (unconstrained) optimistic gradient descent remain within a bounded ball that depends only the initialization; the later property was shown in (Mokhtari et al., 2020a, Lemma 4). This can be combined with Theorem A.18 to bound the norm of the gradients Λ(t) := xf(x(t), y(t)) + yf(x(t), y(t)) at time t. More precisely, Theorem A.18 implies that after T iterations there exists time t [T] such that Λ(t) = O(1/

T); cf., see Golowich et al. (2020a). We point out that Λ(t) is, perhaps, the most common measure for last-iterate convergence in min-max optimization (Diakonikolas & Wang, 2021).

Remark A.20 (Curvature Exploitation). The linearization trick we employed can be suboptimal as the regret minimization algorithm could fail to exploit the curvature of the utility functions; e.g., this would be the case if the objective function f is strongly-convex-strongly-concave. Extending our framework to address this issue is an interesting direction for the future.

A.2 Bilinear Saddle-Point Problems

In this subsection we show that our framework has direct implications for extensive-form games (EFGs). A comprehensive overview of extensive-form games would lead us beyond the scope of this paper. Instead, we will focus on two-player zero-sum games wherein the computation of a Nash equilibrium can be formulated as a bilinear saddle-point problem (BSPP). A BSPP can be expressed as

min x X max y Y x Ay,

where X and Y are convex and compact sets, and A ℝn m. In the case of EFGs, X and Y are the sequence-form strategy polytopes of the sequential decision process faced by the two players, and A is a matrix with the leaf payoffs of the game. We first show the following.

Proposition A.21. Suppose that both players in a BSPP employ (OMD) with Euclidean regularizer and η 1 4 A 2 , where A 2 is the spectral norm of A. Then, it holds that

x(t) x(t 1) 2 2 + y(t) y(t 1) 2 2 16(ΩX + ΩY).

On Last-Iterate Convergence Beyond Zero-Sum Games

Proof. Given that player X employs (OMD) with Euclidean regularization, Proposition 2.3 implies that7

t=1 Ay(t) Ay(t 1) 2 2 1

t=1 x(t) x(t 1) 2 2

η + η A 2 2

t=1 y(t) y(t 1) 2 2 1

t=1 x(t) x(t 1) 2 2. (29)

Similarly, we have that

t=1 A x(t) A x(t 1) 2 2 1

t=1 y(t) y(t 1) 2 2

η + η A 2 2

t=1 x(t) x(t 1) 2 2 1

t=1 y(t) y(t 1) 2 2, (30)

where we used the fact that A 2 = A 2. Thus, summing (29) and (30) yields that

0 Reg T X + Reg T Y ΩX + ΩY

η + η A 2 2 1

t=1 x(t) x(t 1) 2 2 + η A 2 2 1

t=1 y(t) y(t 1) 2 2.

t=1 x(t) x(t 1) 2 2 1 16η

t=1 y(t) y(t 1) 2 2,

where we used the fact that η 1 4 A 2 . A rearrangement of the final bound completes the proof.

Moreover, we can employ the argument of Theorem 3.4 to also bound the number of iterations required to reach an ϵ-Nash equilibrium of the BSPP. In the following claim we use the O( ) notation to suppress (universal) constants.

Proposition A.22 (Full Version of Proposition 3.5). Suppose that both players in a BSPP employ (OMD) with Euclidean regularization and η 1 4 A 2 . Then, after T = O ΩX +ΩY

ϵ2 iterations there is a joint strategy (x(t), y(t)) with t [T]

which is an O(ϵ max{ΩX , ΩY} A 2)-approximate Nash equilibrium of the BSPP.

B. Proofs from Section 4

In this section we give complete proofs for our results in Section 4. We commence with the simple proof of Proposition 4.2, which is recalled below.

Proposition 4.2. Let Γ be a game for which there exists a function Φ : Qn i=1 Ai ℝwith

Φ(a) Φ(a i, a i) = wi(ui(a) ui(a i, a i)),

where w ℝn >0. Then, Γ is a potential game in the sense of Definition 4.1 with L = 1

2 Φmax Pn i=1 |Ai|.

Proof. By assumption, we know that there exists a (bounded) function Φ such that

Φ(a) Φ(a i, a i) = wi(ui(a) ui(a i, a i)). (31)

That is, Γ is a weighted potential game: the difference in utility deriving from a deviation of a player i [n] is translated to the exact same deviation in the potential function, modulo the scaling factor wi. We will first show that the condition of (31) can be translated for the mixed extensions as well. As is common, in the sequel we slightly abuse notation by using the same symbols for the mixed extensions of the potential function and the utility function of each player.

7Although Proposition 2.3 was only stated for simplexes by Syrgkanis et al. (2015), the proof readily extends for arbitrary convex and compact sets.

On Last-Iterate Convergence Beyond Zero-Sum Games

Claim B.1. For any (x1, . . . , xn) Q

i [n] (Ai) it holds that

Φ(x) Φ(x i, x i) = wi(ui(x) ui(x i, x i)), (32)

where Φ(x) := 𝔼a x[Φ(a)] = P

(a1,...,an) A Φ(a1, . . . , an) Qn i=1 xi(ai).

Proof. We have that

Φ(x) Φ(x i, x i) = X

a A Φ(a)(xi(ai) x i(ai)) Y

j =i xj(aj)

ai Ai (xi(ai) x i(ai)) X

a i A i Φ(ai, a i) Y

j =i xj(aj)

ai Ai (xi(ai) x i(ai)) X

a i A i ui(ai, a i) Y

j =i xj(aj) (33)

j=1 xj(aj) X

a A ui(a)x i(ai)

k =i xk(ak)

= wi(ui(x) ui(x i, x i)),

where (33) follows from a rearrangement of the terms and the assumption of (31).

Now observe that from Claim B.1 we may conclude that

Φ(x) xi(ai) = wi ui(x) xi(ai) = wi xi(ai)

a i A i ui(ai, a i) Y

j =i xj(aj) = wiui(ai, x i).

This verifies condition (4) from Definition 4.1. Thus, it remains to establish the smoothness of the potential, in the sense of (3). To this end, we first recall the following simple fact.

Fact B.2. Let g : ℝd ℝbe a twice continuously differentiable function such that the spectral norm of the Hessian 2g is upper bounded by some M > 0. Then, for any x, y ℝd it holds that

g(x) g(y) 2 M x y 2.

Hence, it suffices to bound the operator norm of the Hessian of Φ. This is shown in the following lemma, where we crucially leverage the multilinearity of the (mixed) potential function.

Lemma B.3. It holds that 2Φ 2 Φmax Pn i=1 |Ai|.

Proof. First of all, it is easy to see that for all i [n] and ai, a i Ai, it holds that

2Φ xi(ai) xi(a i) = 0.

On the other hand, for i = j [n] and ai Ai and aj Aj it holds that

2Φ xi(ai) xj(aj) = X

a i, j A i, j Φ(ai, aj, a i, j) Y

k =i,j xk(ak);

On Last-Iterate Convergence Beyond Zero-Sum Games

here we used the notation a i, j := Q

k =i,j ak and A i, j := Q

k =i,j Ak. Thus, applying the triangle inequality yields that 2Φ xi(ai) xj(aj)

a i, j Φ(ai, aj, a i, j) Y

k =i,j xk(ak)

k =i,j xk(ak) = Φmax, (34)

where the final equality holds by the normalization constraint of the induced product distribution. Thus, for any z ℝd, where d = Pn i=1 |Ai|, we have that

v u u td2 d X

r=1 z2(r) = Φmax d z 2,

where we used (34) in the first bound, and Jensen s inequality in the second one. As a result, we have shown that 2Φ 2 Φmax Pn i=1 |Ai|, concluding the proof.

Finally, combining this lemma with Fact B.2 we arrive at the following corollary, which concludes the proof of Proposition 4.2.

Corollary B.4. Let L = 1

2 Φmax Pn i=1 |Ai|. Then, for any x, ex Q

i [n] (Ai) it holds that

Φ(ex) Φ(x) + xΦ(x), ex x + L ex x 2 2;

Φ(ex) Φ(x) xΦ(x), ex x + L ex x 2 2.

Theorem 4.3. Suppose that each player employs (MD) with a 1-strongly convex regularizer with respect to 2, and learning rate η = 1 2L, where L is defined as in Proposition 4.2. Then, for any t 1 it holds that

Φ(x(t+1)) Φ(x(t)) 1

i=1 x(t+1) i x(t) i 2 2 0. (5)

Proof. First of all, since every regularizer Ri is 1-strongly convex with respect to 2, we obtain from Definition 4.1 that

Φ(x) Φ(ex) xΦ(x), ex x + L

i=1 exi xi 2 2

Φ(ex) xΦ(x), ex x + 2L

i=1 DRi(exi, xi). (35)

Moreover, we know from the update rule of (MD) that for any player i [n] it holds that

x(t+1) i = arg max xi (Ai)

D xiΦ(x(t)), xi x(t) i E 1

η DRi(xi, x(t) i ) , (36)

where we used the fact that g(ui) = xiΦ(x) (Definition 4.1). Also, from (35) we obtain that

Φ(x(t)) Φ(x(t+1)) D xΦ(x(t)), x(t+1) x(t)E + 2L

i=1 DRi(x(t+1) i , x(t) i ). (37)

Now the update rule of (36) implies that

D xiΦ(x(t)), x(t+1) i x(t) i E 1

η DRi(x(t+1) i , x(t) i ) 1

2η x(t+1) i x(t) i 2 2, (38)

On Last-Iterate Convergence Beyond Zero-Sum Games

for all i [n], where we used the 1-strong convexity of Ri with respect to 2 (quadratic growth). As a result, summing (38) for all i [n] yields that

D xΦ(x(t)), x(t+1) x(t)E 1

i=1 DRi(x(t+1) i , x(t) i ) 1

i=1 x(t+1) i x(t) i 2 2.

Finally, plugging this bound with η = 1 2L to (37) and rearranging the terms concludes the proof.

Corollary 4.5. In the setting of Theorem 4.3, with Ri := 1

2 x 2 2, it holds that the regret of each player i [n] is such that Reg T i = O(

Proof. It is well-known (e.g., see (Shalev-Shwartz, 2012)) that the cumulative regret Reg T of (MD) can be bounded as

t=1 u(t) x(t) x(t 1) , (39)

where Ω:= supx,y X DR(x, y), and u(1), . . . , u(t) represents the sequence of utility vectors observed by the regret minimizer. Note that with a conventional abuse of notation, it is assumed that x(0) := 0 in (39). Now we know from Corollary 4.4 that for η = 1 2L it holds that

t=2 x(t) x(t 1) 2 2

t=2 x(t) x(t 1) 2 2 2 Φmax

Thus, an application of the Cauchy-Schwarz inequality implies that

t=2 x(t) x(t 1) 2

v u u t(T 1)

t=2 x(t) x(t 1) 2 2

Finally, from (39) it follows that for any player i [n] it holds that

t=1 x(t) x(t 1) 2 Ωi

where we used the notation ui 2 := maxt [T ] u(t) i 2. This concludes the proof.

Theorem 4.6 (Optimal Regret for Potential Games). Suppose that each player employs OMWU with a sufficiently small learning rate η > 0. Then, the regret of each player i [n] is such that Reg T i = O(1).

Proof. First of all, it is well-known that OMWU on the simplex can be expressed with the following update rule for t 1:

x(t+1) i (ai) = exp 2ηu(t) i (ai) ηu(t 1) i (ai)

a i Ai exp 2ηu(t) i (a i) ηu(t 1) i (a i) x(t) i (a i) x(t) i (ai), (OMWU)

for any action ai Ai and player i [n]. By convention, x(0) i and x(1) i are initialized from the uniform distribution over Ai, for all i [n]. Now we claim that the update rule of (OMWU) is equivalent to

x(t+1) i = arg max xi (Ai)

D 2 xiΦ(x(t)) xiΦ(x(t 1)), xi x(t) i E 1

η DRi(xi, x(t) i ) ,

assuming that Ri is the negative entropy DGF for all i [n]. By 1-strong convexity of Ri, this implies that D 2 xiΦ(x(t)) xiΦ(x(t 1)), x(t+1) i x(t) i E 1

η DRi(x(t+1) i , x(t) i ) 1

2η x(t+1) i x(t) i 2 2. (41)

On Last-Iterate Convergence Beyond Zero-Sum Games

Moreover, we have that D xiΦ(x(t)) xiΦ(x(t 1)), x(t+1) i x(t) i E xiΦ(x(t)) xiΦ(x(t 1)) 2 x(t+1) i x(t) i 2 (42)

= u(t) i u(t 1) i 2 x(t+1) i x(t) i 2 (43)

|Ai| u(t) i u(t 1) i x(t+1) i x(t) i 2 (44)

j =i x(t) j x(t 1) j 2 x(t+1) i x(t) i 2 (45)

x(t) j x(t 1) j 2 2 + x(t+1) i x(t) i 2 2 (46)

(n 1) x(t+1) i x(t) i 2 2 + X

j =i x(t) j x(t 1) j 2 2

where (42) follows from Cauchy-Schwarz; (43) is a consequence of Definition 4.1; (44) is derived from H older s inequality (equivalence of norms); (45) uses Claim A.1; and (46) is an application of Young s inequality. Thus, combining this bound to (41) yields that

D xiΦ(x(t)), x(t+1) i x(t) i E 1

η DRi(x(t+1) i , x(t) i ) 1

|Ai|(n 1) x(t+1) i x(t) i 2 2

j =i x(t) j x(t 1) j 2 2.

Summing these inequalities for all i [n] gives us that

D xΦ(x(t)), x(t+1) x(t)E 1

i=1 DRi(x(t+1) i , x(t) i )

|Ai|(n 1) x(t+1) i x(t) i 2 2 1

x(t) i x(t 1) i 2 2.

Thus, from (35) for η 1 2L we conclude that

Φ(x(t+1)) Φ(x(t))

|Ai|(n 1) x(t+1) i x(t) i 2 2 1

x(t) i x(t 1) i 2 2

As a result, as long as

for all i [n], the previous bound implies that

Φ(x(t+1)) Φ(x(t))

4η x(t+1) i x(t) i 2 2 1

x(t) i x(t 1) i 2 2

Thus, a telescopic summation leads to the following conclusion:

Theorem B.5. Suppose that each player i [n] employs (OMWU) with learning rate η > 0 such that

|Ai|(n 1) , 1 4 P

On Last-Iterate Convergence Beyond Zero-Sum Games

for all i [n], where L is defined as in Proposition 4.2. Then, it holds that

t=1 x(t+1) i x(t) i 2 2 Φ(x(T )) Φ(x(1)) 2 Φmax .

Finally, Theorem B.5 along with the RVU bound (which holds for (OMWU) as specified in Proposition 2.3) and Claim A.1 conclude the proof of the theorem.

Next, we proceed with the proof of Theorem 4.7, the detailed version of which is given below.

Theorem B.6 (Full Version of Theorem 4.7). Suppose that each player i employs (MD) with Ri(x) such that Ri is G-Lipschitz continuous with respect to the ℓ2-norm, and η = 1 2L, where L is defined as in Proposition 4.2. Then, after O(1/ϵ2) iterations there exists a joint strategy x(t) which is an ϵ-approximate Nash equilibrium.

For the proof of this theorem we will use the following simple claim.

Claim B.7. Suppose that each player employs (MD) with learning rate η > 0 and regularizer Ri such that Ri is G-Lipschitz continuous with respect to the ℓ2-norm. If x(t+1) x(t) 2 ϵ, it holds that x(t) is an

approximate Nash equilibrium, where Ω:= maxi [n] supxi,x i (Ai) xi x i 2, and |A| := maxi [n] |Ai|.

Proof. The argument proceeds similarly to the proof of Theorem A.12. First, observe that for each i [n] it holds that x(t+1) i x(t) i 2 ϵ. By definition of (MD) we have that

x(t+1) i = arg max xi (Ai)

xi, u(t) i 1

η DRi(xi, x(t) i ) .

This maximization problem can be equivalently expressed in the following variational inequality form: u(t) i 1

Ri(x(t+1) i ) Ri(x(t) i ) , bxi x(t+1) i

0, bxi (Ai),

for any i [n]. Thus, it follows that

u(t) i , bxi x(t+1) i 1

η Ri(x(t+1) i ) Ri(x(t) i ), bxi x(t+1) i

η Ri(x(t+1) i ) Ri(x(t) i ) 2 bxi x(t+1) i 2 (47)

where (47) follows from the Cauchy-Schwarz inequality, and (48) uses the fact that Ri(x(t+1) i ) Ri(x(t) i ) 2 G x(t+1) i x(t) i 2 ϵG, which follows from the assumption that Ri is G-Lipschitz continuous. Also note that (48) uses the notation Ωi := supxi,x i (Ai) xi x i 2. As a result, we have established that for any player i [n] it holds that for any bxi (Ai),

u(t) i , x(t+1) i u(t) i , bxi ϵGΩi

Moreover, it also follows that u(t) i , x(t+1) i x(t) i u(t) i 2 x(t+1) i x(t) i 2 p

On Last-Iterate Convergence Beyond Zero-Sum Games

where we used the fact that x(t+1) i x(t) i 2 ϵ, and that u(t) i 1 (by the normalization hypothesis). Plugging the last bound to (49) gives us that

u(t) i , x(t) i u(t) i , x(t+1) i ϵ p

|Ai| u(t) i , bxi ϵGΩi

for any bxi (Ai) and player i [n]. Finally, the proof follows by definition of approximate Nash equilibrium.

Proof of Theorem B.6. Suppose that x(t+1) x(t) 2 > ϵ for all t [T]. Corollary 4.4 implies that

t=1 x(t+1) x(t) 2 2 (T 1)ϵ2 = T 4η Φmax

ϵ2 + 1. (51)

Hence, for T > l 4η Φmax

ϵ2 m + 1 it must be the case that there exists t [T] such that x(t+1) x(t) 2 ϵ. Thus, the theorem follows directly from Claim B.7.

Proposition 4.8. In the setting of Theorem 4.3, if the potential function Φ is also concave it holds that

Φ(x ) Φ(x(T +1)) 2L

i=1 DRi(x i , x(1) i ).

Proof. The proof proceeds similarly to that in (Birnbaum et al., 2011, Lemma 4). We will first require an auxiliary lemma regarding the following optimization problem:

maximize g(x) DR(x, y);

subject to x C, (52)

where g is a concave function on a convex and compact domain C.

Lemma B.8 (e.g., see (Chen & Teboulle, 1993)). Let bx be the optimal solution to the optimization problem (52). Then, it holds that g(x) DR(x, y) g(bx) DR(bx, y) DR(x, bx).

Next, we will apply this lemma to conclude that for all i [n] it holds that

η xiΦ(x(t)), x i x(t) i DRi(x i , x(t) i ) η xiΦ(x(t)), x(t+1) i x(t) i DRi(x(t+1) i , x(t) i ) DRi(x i , x(t+1) i ). (53)

Moreover, from (35) it follows that

Φ(x(t+1)) Φ(x(t)) xΦ(x(t)), x(t+1) x(t) + 2L

i=1 DRi(x(t+1) i , x(t) i )

Φ(x(t)) xΦ(x(t)), x x(t) + 2L

i=1 DRi(x , x(t) i ) 2L

i=1 DRi(x i , x(t+1) i ) (54)

i=1 DRi(x , x(t) i ) 2L

i=1 DRi(x i , x(t+1) i ), (55)

where (54) follows from (53) for η = 1 2L, and (55) follows by concavity of Φ. As a result, summing (55) for all t [T] and removing the telescopic terms yields that

Φ(x ) Φ(x(t+1)) 2L

i=1 DRi(x i , x(1) i ) 2L

i=1 DRi(x i , x(T +1) i ) 2L

i=1 DRi(x i , x(1) i ), (56)

where in the last inequality we used the well-known fact that DRi( , ) 0. Finally, Theorem 4.3 implies that

t=1 Φ(x ) Φ(x(t+1)) TΦ(x ) TΦ(x(T +1)),

and plugging this bound to (56) concludes the proof after a rearrangement.

On Last-Iterate Convergence Beyond Zero-Sum Games

B.1 Near-Potential Games

In this section we extend some of the results established for (weighted) potential games to near-potential games in the sense of Candogan et al. (2013) by proving Theorem 4.10. Intuitively, a game Γ is said to be near-potential if it is close to some potential game. While there are many natural ways to measure the distance between two games, here we follow the one suggested by Candogan et al. (2013); namely, the maximum pairwise difference:

Definition B.9 (Maximum Pairwise Difference (MPD); (Candogan et al., 2013)). Let Γ and bΓ be two (normal-form) games. The maximum pairwise difference between Γ and bΓ is defined as

d(Γ, bΓ) := sup i [n],a A |(ui(ai, a i) ui(a i, a i)) (bui(ai, a i) bui(a i, a i))|,

where ui : Q

i [n] Ai [ 1, 1] and bui : Q

i [n] Ai [ 1, 1] are the utility functions associated with player i for Γ and bΓ.

This definition tacitly posits that the games are compatible in the sense that the set of actions available to each player coincide. MPD captures the difference between two games in terms of the maximum possible utility improvement through unilateral deviations. Different distance measures can be considered without qualitatively altering the rest of the analysis. Armed with Definition B.9, we are ready to introduce the concept of a near-potential game.

Definition B.10 (Near-Potential Game; (Candogan et al., 2013)). A game Γ is δ-near-potential if there exists a (compatible) potential game bΓ such that d(Γ, bΓ) δ.

We remark that (Candogan et al., 2010; 2011) have devised a framework for efficiently finding the nearest potential game to a given game when the distance is measured in terms of the MPD. Specifically, they show that this can be formulated as a convex optimization problem (Candogan et al., 2010; 2011). Nevertheless, it is clear that following a given update rule in a game Γ does not require any sort of knowledge regarding the closest potential game bΓ; the potential function of bΓ will only be used for the purpose of the analysis. We begin by pointing out the following simple claim.

Claim B.11. Let Γ be any δ-near-potential game with utilities ui : Q

i [n] Ai [ 1, 1], for all i [n]. Moreover, let Φ be

the potential function of a game bΓ such that d(Γ, bΓ) = δ. Then,

Φ(x) xi(ai) = ui(ai, x i) + ei(ai; x i), (57)

where |ei(ai, x i)| = O(δ), for any i [n], ai Ai, and x i Q

Now we are ready to prove Theorem 4.10, the full version of which is given below.

Theorem B.12 (Full Version of Theorem 4.10). Consider a δ-nearly-potential game wherein every player employs (MD) with learning rate η = 1 2L, where L is a parameter associated with the closest potential game, and regularizer Ri which is 1-strongly convex and G-smooth with respect to 2. Then, there exists a bounded potential function Φ which increases as long as x(t) is not an O(

δ)-approximate Nash equilibrium.

Proof. Since players employ (MD), we know that the update rule of each player i [n] takes the form

x(t+1) i = arg max xi (Ai)

D u(t) i , xi x(t) i E 1

η DRi(xi, x(t) i ) .

By the 1-strong convexity of Ri with respect to 2, this implies that

D u(t) i , x(t+1) i x(t) i E 1

η DRi(x(t+1) i , x(t) i ) 1

2η x(t+1) i x(t) i 2 2. (58)

However, by Claim B.11 we also know that

u(t) i = xiΦ(x(t)) + e(t) i ,

On Last-Iterate Convergence Beyond Zero-Sum Games

where Φ is the potential function of the potential game within δ distance from the original game, and e(t) i is a vector such that e(t) i = Cδ, for some parameter C > 0. Thus, combing this fact with (58) yields that D xiΦ(x(t)), x(t+1) i x(t) i E 1

η DRi(x(t+1) i , x(t) i ) 1

2η x(t+1) i x(t) i 2 2 e(t) i , x(t+1) i x(t) i

2η x(t+1) i x(t) i 2 2 e(t) i x(t+1) i x(t) i 1

2η x(t+1) i x(t) i 2 2 2Cδ, (59)

where we used the fact that the ℓ1 diameter of (Ai) is 2 in the last derivation. Summing (59) for all i [n] gives us that

D xΦ(x(t)), x(t+1) x(t)E 1

i=1 DRi(x(t+1) i , x(t) i ) 1

i=1 x(t+1) i x(t) i 2 2 2Cnδ.

Thus, using the smoothness condition of (35) with η = 1 2L implies that

Φ(x(t+1)) Φ(x(t)) 1

i=1 x(t+1) i x(t) i 2 2 2Cnδ.

As a result, we conclude that Φ increases as long as

x(t+1) x(t) 2 2 p

and the theorem follows directly from Claim B.7.

B.2 Fisher Markets

In this subsection we illustrate how the framework we developed in Section 4 unifies the work of Birnbaum et al. (2011) related to distributed dynamics in Fisher s classical market model. While the following exposition focuses on the linear counterpart of Fisher s model, we refer the interested reader to the work of Birnbaum et al. (2011) for an elegant extension to the spending constraints model. Regarding the motivation for studying distributed dynamics in markets, let us quote from the work of Birnbaum et al. (2011):

Algorithmic results in a centralized model of computation do not directly address the question of market dynamics: how might agents interacting in a market arrive at an equilibrium? Here, the quest is for simple and distributed algorithms that are guaranteed to converge fast. Such distributed algorithms are especially applicable when the agents involved are automated, and one has to prescribe a particular protocol for them to follow

In this context, we commence by recalling the underlying model. The exposition in the sequel follows that in (Birnbaum et al., 2011). In the linear Fisher s market model there are n agents (bidders) and m (infinitely) divisible goods. It is assumed without any loss of generality that there is a unit supply from each good. Every agent has an overall budget Bi, circumscribing its buying power. The goal of each agent i [n] is to maximize its own utility, which, for a given allocation vector x ℝn m 0 , is defined as P

j xi(j)ui(j), where ui(j) represents the utility of that agent for a unit of good j.

Equilibrium Conditions Consider an allocation vector x ℝn m 0 and a price vector p ℝm 0. The pair (x, p) is said to be an equilibrium if the following conditions are met.

1. Buyer optimality: Each agent i [n] maximizes its own utility subject to the budget constraints. Formally, for the given price vector p, it has to be that the allocation vector x maximizes the following (linear) program:

j [m] ui(j)xi(j),

subject to X

j [m] p(j)xi(j) Bi;

xi(j) 0, j [m].

On Last-Iterate Convergence Beyond Zero-Sum Games

2. Market clearance: It must be the case that P

i [n] xi(j) = 1, for all j [m].

Convex Formulation of Equilibria Equilibria in linear Fisher markets can be determined via the seminal Eisenberg-Gale convex program (Eisenberg & Gale, 1959). As in (Birnbaum et al., 2011), here we focus on an alternative formulation due to Shmyrev (2009), which is given below.

i,j bi(j) log ui(j) X

j [m] p(j) log p(j),

subject to X

i [n] bi(j) = p(j), j [m];

j [m] bi(j) = Bi, [n];

bi(j) 0, i [n], j [m].

In this convex program each variable bi(j) represents the amount spent by agent i on good j, so that for a given solution to (60), the induced allocation xi(j) is given by bi(j)/p(j). Observe that for any feasible solution to (60) the markets always clears in the sense of Item 2. Moreover, Shmyrev (2009) showed that the optimal solution of (60) is such that each buyer is allocated an optimal bundle of goods, guaranteeing both equilibrium conditions for the associated Fisher market. In the sequel, and for the sake of simplicity, we will let Bi = 1.

In this context, we will relate linear Fisher markets with the setting we presented in Section 4. To this end, let Φ(b) be the objective function of (60):

i,j bi(j) log ui(j) X

j [m] p(j) log p(j) = X

i,j bi(j) log ui(j)

where we used the fact that p(j) = P i [n] bi(j), which in turn follows by the feasibility constraints of (60). Now a direct calculation shows that for any i [n], each component j [m] on the corresponding gradient is such that

( Φ(b))i,j = log ui(j)

Thus, observe that Φ is not always smooth on the feasible region

j [m] bi(j) = Bi, i [n], bi(j) 0, (i, j) [n] [m]

Nevertheless, (Birnbaum et al., 2011, Lemma 7) establishes that Φ satisfies the weaker one-sided smoothness condition of Definition 4.1 with respect to the Kullback-Leibler divergence. Moreover, let g : x 7 log x 1 be a monotone transformation, so that

g 1 (( Φ(b))i,j) = ui(j)

p(j) = ui(j)xi(j)

bi(j) . (61)

But the latter expression can be thought of as the utility that agent i received from item j per fraction of budget invested to j. As a result, if every agent employs (MD) with negative entropy DGF, monotone transformation g : x 7 log x 1, and utility vector the feedback suggested by Equation (61), it can be shown that bids are updated using the following update rule:

bi(j) = ui(j)xi(j) P

k [m] ui(k)xi(k)Bi. (PR)

These dynamics are known us Proportional Response, and they were introduced by Wu & Zhang (2007) (see also (Zhang, 2011)). (PR) can be seen as typical multiplicative weights update after applying the monotone transformation g. In this way, these distributed dynamics are tantamount to optimizing Smyrev s convex program (60), which converges to an equilibrium with O(1/T) rate by concavity of Φ; see Proposition 4.8.

An interesting direction would be to incorporate into this framework further and more general market models such as the Arrow-Debreu (exchange) version; e.g., see (Panageas et al., 2021) for some recent developments related in spirit to the approach we presented in this section.

On Last-Iterate Convergence Beyond Zero-Sum Games

C. Proofs from Section 5

In this section we provide the proofs omitted from Section 5. From a technical standpoint we will rely on a fundamental tool from signal processing and control theory; namely, the Z-transform. Recall that the (bilateral) Z-transform of a sequence (x(t)) in ℝd is defined as

X(z) := Z n x(t)o =

t= x(t)z t, (62)

where z ℂ is assumed to be in the region of convergence:

t= x(t)z t <

Observe that we define the Z-transform coordinate-wise. Our analysis will leverage the following well-known properties which follow directly from the definition of (62).

Property C.1 (Linearity). Let (x(t)) t= and (y(t)) t= be sequences in ℝd. Then,

Z n x(t) + y(t)o = Z n x(t)o + Z n y(t)o .

Property C.2 (Time-Delay Property). Let (x(t)) t= be a sequence in ℝd, and some t0 ℝ. Then, it holds that

Z n x(t t0)o = z t0Z n x(t)o .

We are now ready to prove Theorem 5.1, the detailed version of which is given below.

Theorem C.3 (Detailed Version of Theorem 5.1). Let A and B be square and full-rank d d matrices, and γ := ρ(A B), where ρ( ) denotes the spectral radius. If the matrix A B has strictly negative (real) eigenvalues, it holds that for any learning rate η 1 2 γ (OGD) converges with linear rate to an equilibrium.

Proof. First, observe that x(x Ay) = Ay and y(x By) = B x. Thus, (OGD) can be expressed as the following linear dynamical system:

x(t+1) = x(t) + 2ηAy(t) ηAy(t 1);

y(t+1) = y(t) + 2ηB x(t) ηB x(t 1). (63)

To analyze its convergence properties, we will transfer (63) to the z-space. To this end, let X(z) and Y (z) be the Ztransform of the sequence (x(t)) and (y(t)) respectively; here it is tacitly assumed that z belongs to the region of convergence. Thus, using linearity (Property C.1) and the time-delay property (Property C.2), it follows from (63) that

z X(z) = X(z) + 2ηAY (z) ηz 1AY (z);

z Y = Y (z) + 2ηB X(z) ηz 1B X(z). (64)

Note that we also used the fact that Z Ay(t) = AY (z) and Z B x(t) = B X(z), which follow from linearity of the Z-transform. In this way, we have transferred the difference equation (63) to the algebraic equation (64). From the latter algebraic system, we may uncouple these equations to conclude that

z2(z 1)2X(z) = (2ηz η)Az(z 1)Y (z) = (2ηz η)A(2ηz η)B X(z) = η2(2z 1)2AB X(z);

z2(z 1)2Y (z) = (2ηz η)B z(z 1)X(z) = (2ηz η)B (2ηz η)AY (z) = η2(2z 1)2B AY (z). (65)

As a result, we have derived the characteristic equation of the dynamical system (63):

On Last-Iterate Convergence Beyond Zero-Sum Games

Proposition C.4. The characteristic equation of (63) can be expressed as

det(Id(z(z 1))2 η2(2z 1)2A B) = 0. (66)

In particular, if χ(z) is the characteristic polynomial of matrix A B, it follows that z ℂ satisfies (66) if and only if

Note that we used the fact that the matrices AB and B A have identical spectrum, which in turn follows since A and B are assumed to be non-singular, as well as the property det(M) = det(M ) in order to derive (66) from (65). To see the second part of the claim in Proposition C.4, first observe that z = 1

2 is never a solution to (66). Thus, from the multilinearity of the determinant, it holds that

det(Id(z(z 1))2 η2(2z 1)2A B) = 0 det

As a result, the equivalence asserted in (67) follows by recalling that χ(λ) = det(λId A B) since χ(λ) is the characteristic polynomial of matrix A B.

Having derived the characteristic equation of the dynamical system (Proposition C.4), it remains to derive its solutions. To do this, let λ be such that χ( λ) = 0; by assumption, we know that λ ℝ>0. Now this eigenvalue induces solutions of the following form:

2 = λ z(z 1)

( z2 + z( 1 2η

λj = 0; z2 + z( 1 + 2η

λj = 0. (68)

where j ℂdenotes the imaginary unit. Now observe that z2 + z( 1 2η

λj = 0 z2 + z( 1 + 2η

λj = 0, where z denotes the complex conjugate of z. Hence, it suffices to solve only the first equation since their solutions are equivalent in terms of magnitude in particular, they are complex conjugates. Thus, we obtain the following solutions:

1 4η2λ 2 . (69)

Moreover, by assumption we know that η 1 2 γ , which in turn implies that η 1 2

λ 1 4η2λ 0; this holds by

definition of γ := ρ(A B). Thus, it follows that

1 4η2λ)2 + (2η

1 4η2λ < 1;

1 4η2λ)2 + (2η

1 4η2λ < 1.

This implies that all of the solutions to the characteristic equation (67) lie within the unit circle on the complex plane. Thus, the theorem follows from the fundamental theorem of linear dynamical systems.

Proposition 5.2. For any sufficiently large R > 0, there exist games such that (OGD) converges under any initialization to an equilibrium (x( ), y( )) such that SW(x( ), y( )) = 0, while there exist an equilibrium (x , y ) with SW(x , y ) 2R2.

Proof. We consider the game described with the following matrices:

A := 1 2 1 1

; B := 1 1 1 1

We also assume that the constraints sets are such that X = B1(0, R) and Y = B1(0, R), for a sufficiently large radius R > 0. Here B1 denotes the closed ℓ1-ball; we only use the ℓ1-norm for mathematical convenience. We will first consider

On Last-Iterate Convergence Beyond Zero-Sum Games

the unconstrained dynamics. In particular, we will show that the condition of Theorem 5.1 is met. Indeed, a direct calculation reveals that

A B = 0 2 1 3

Now the characteristic equation of A B reads λ(λ + 3) + 2 = 0 λ1 = 2, λ2 = 1. Thus, A B has negative (real) eigenvalues. As a result, Theorem 5.1 implies that for a sufficiently small learning rate (OGD) converges under any initialization assuming unconstrained domains (see Figure 8). Moreover, the following claim characterizing its limit points is immediate.

Claim C.5. Suppose that (63) converges to a point (x( ), y( )). Then, it holds that Ay( ) = 0 and B x( ) = 0.

In particular, given that the matrices A and B are full-rank, this claim implies that x( ) = 0 and y( ) = 0. Moreover, for a sufficiently large R > 0 we know that the projected dynamics on X and Y (respectively) will coincide with the unconstrained dynamics. As a result, we have shown that (OGD) constrained on X and Y will converge to (0, 0), which is clearly an equilibrium point. However, there exists a much more efficient equilibrium:

Claim C.6. Let x = (R, 0) B1(0, R) and y = (R, 0) B1(0, R). Then, (x , y ) is an equilibrium of the game (70).

Proof. When player Y plays y it follows that Ay = (R, R). Thus, it follows that x Ay x 1 Ay R2 = (x ) Ay , for any x B1(0, R). Thus, x is indeed a best response to y . Similarly, when player X plays x it follows that B x = (R, R). Thus, we conclude that (x ) By y 1 B x R2 = (x ) By , for any y B1(0, R). This means that y is indeed a best response to x , verifying our claim.

Finally, we have that SW(x , y ) = (x ) Ay + (x ) By = 2R2. This concludes the proof.

Proposition 5.3. For any ϵ > 0 there exists a game (A, B) with A + B F ϵ for which (OGD) diverges, while the dynamics converge for the game (A, A).

Proof. Fix any ϵ > 0. We consider the game (A, B) described with the following payoff matrices:

A := 1 0 0 ϵ/2

; B := 1 0 0 ϵ/2

Observe that A + B F = ϵ.8 Let us first argue about convergence of (OGD) in the game (A, A). To this end, observe that the matrix A A has only positive eigenvalues. Thus, we know from Theorem 5.1 that (OGD) converges for η 1

2 since the spectral radius of A A is 1. On the other hand, for the game (A, B) it holds that the matrix A B has a positive eigenvalue; namely, λ = ϵ2/4. But from Proposition C.4 it can be shown that this implies that the characteristic equation of the associated dynamical system has a solution z ℂ with |z| > 1. In turn, this implies that (OGD) will diverge under any non-trivial initialization; see Figure 9 for a graphical illustration and a discussion about this phenomenon.

Next, we proceed with the proof of Theorem 5.4, the detailed version of which is recalled below.

Theorem C.7 (Detailed Version of Theorem 5.4). Let {A1,j}n j=2 and {Aj,1}n j=2 be square matrices such that det(M) = 0, where M := P

j =1 A1,j Aj,1. Moreover, let γ := ρ(A), where ρ( ) denotes the spectral radius. If M has strictly negative (real) eigenvalues, it holds that for any learning rate η 1 2 γ (OGD) converges with linear rate to an equilibrium.

Proof. First of all, recall that we have that u1(x) = P

j =1 x 1 A1,jxj, while uj(x) = x j Aj,1x1 for j = 1. Thus, it follows that x1u1(x) = P

j =1 A1,jxj, and xjuj(x) = Aj,1x1 for j = 1. As a result, (OGD) can be cast as

x(t+1) 1 = x(t) 1 + 2η X

j =1 A1,jx(t) j η X

j =1 A1,jx(t 1) j ;

x(t+1) j = x(t) j + 2ηAj,1x(t) 1 ηAj,1x(t 1) 1 , j = 1.

8Recall that the Frobenius norm of a matrix M is defined as M F := (M) 2, where (M) is the standard vectorization of M.

On Last-Iterate Convergence Beyond Zero-Sum Games

Transferring these dynamics to the z-space yields that

z X1(z) = X1(z) + 2η X

j =1 A1,j Xj(z) ηz 1 X

j =1 A1,j Xj(z);

z Xj(z) = Xj(z) + 2ηAj,1X1(z) ηz 1Aj,1X1(z), j = 1. (73)

From the second equation it follows that (z2 z)Xj(z) = η(2z 1)Aj,1X1(z), for all j = 1. Thus, plugging this to the first equation of (73) yields that

z2(z 1)2X1(z) = η2(2z 1)2 X

j =1 A1,j Aj,1X1(z)

(z(z 1))2Id η2(2z 1)2 X

j =1 A1,j Aj,1

From this, it is possible to conclude the characteristic equation of the associated dynamical system:

Proposition C.8. The characteristic equation of the dynamical system (72) can be expressed as

(z(z 1))2Id η2(2z 1)2 X

j =1 A1,j Aj,1

Observe that if all the poles of X1(z) lie within the unit circle of the complex plane, the same holds for each Xj(z), for all j = 1, modulo at most a single pole at z = 1. This follows given that (z2 z)Xj(z) = η(2z 1)Aj,1X1(z), for all j = 1. Finally, the rest of the theorem follows analogously to Theorem C.3, while it is direct to verify that, assuming convergence, the limit points satisfy the equilibrium conditions.

Remark C.9. More broadly, the convergence of (OGD) can be determined in terms of the spectrum of the matrix

0d A1,2 A1,3 . . . A1,n A2,1 0d A2,3 . . . A2,n ... ... ... ... ... An,1 An,2 An,3 . . . 0d

Indeed, Theorem 5.4 can be seen as an instance of such a broader characterization for separable continuous games. As we point out in Remark C.10, obtaining such results in multilinear settings would require different techniques. Remark C.10 (Beyond Polymatrix Games). We believe that results in unconstrained multilinear n-player games (akin to standard NFGs, but without constraints) can be obtained using recent advances from multilinear control theory (Chen et al., 2021). In particular, stability could be determined in terms of the tensor of the underlying game.

Our final result for Section 5 shows that a generic class of first-order fails to guarantee stability in all general-sum games, even with two players. More precisely, we consider the following method.

τ=0 α(τ)x(t τ) i +

τ=0 β(τ) xiui(x(t τ)). (HGD)

This method which will be referred to as historical gradient descent (HGD) can be described with the ordered set of coefficients A := (α(0), . . . , α(T )) and B := (β(0), . . . , β(T )). In particular, (HGD) contains (OGD) when A = (1) and B = (2η, η), but it goes well-beyond this method. For the purpose of our analysis we are going to represent an (HGD) algorithm using the following two polynomials.

S(z) := α(0) + α(1)z 1 + + α(T )z T ;

G(z) := β(0) + β(1)z 1 + + β(T )z T .

For example, (OGD) is associated with polynomials S(z) = 1 and G(z) = 2η ηz 1. However, the considered class of dynamics contains certain trivial algorithms. For example, the update rule of an (HGD) algorithm for which G(z) 0 would not depend on the observed utility at any iteration. To address such trivialities, we impose a condition which ensures

On Last-Iterate Convergence Beyond Zero-Sum Games

that an (HGD) algorithm is interesting from a game-theoretic standpoint. In particular, we say that an (HGD) algorithm is regular if S(1) = 1 and G(1) = 0. Then, it is easy to see that, if both players employ a regular (HGD) algorithm and the dynamics converge, the limit points will be equilibria in the sense of (6). We will also assume without any loss that the polynomials G(z) and S(z) z have no common roots. We are now ready to state and prove our main theorem on (HGD) algorithms.

Theorem C.11. For any regular algorithm in (HGD) there exists a game for which the method will diverge under any non-trivial initialization.

Proof. Consider a game (A, B). The dynamics of the underlying dynamical system in the z-space can be expressed as follows.

z X(z) = S(z)X(z) + G(z)AY (z);

z Y (z) = S(z)Y (z) + G(z)B X(z). (z S(z))X(z) = G(z)AY (z);

(z S(z))Y (z) = G(z)B X(z).

From this representation, we may conclude that the characteristic equation of the dynamical system can be expressed as

det (z S(z))2Id (G(z))2A B = 0 χ

where χ(z) represents the characteristic polynomial of matrix A B. Note that our previous derivation uses the assumption that the polynomials S(z) z and G(z) do not share any common roots, in turn implying that no root of G(z) can satisfy the equation det (z S(z))2Id (G(z))2A B = 0. Now consider any ϵ > 0 so that G(1 + ϵ) = 0 and S(1 + ϵ) = 1 + ϵ. If the game (A, B) is such that the matrix A B has the positive eigenvalue

λ = 1 + ϵ S(1 + ϵ)

then it follows that z = 1+ϵ is a solution to the characteristic equation of the system. But given that |z| > 1, this necessarily implies that the dynamics will diverge by the fundamental theorem of linear dynamical systems.

While there is a plethora of control-theoretic techniques that could stabilize the dynamics, we have argued (Proposition 5.2) that stability might be at odds with efficiency welfare maximization. Understanding the fundamental tension between different solution concepts is an important question for the future.

D. Experiments

In this section we corroborate some our theoretical results through experiments on several games.

D.1 Normal-Form Games We start by illustrating the last-iterate convergence when players employ different and potentially advanced prediction mechanisms in normal-form games. We are first going to assume that both players employ (OMD) with prediction m(t) set to either H-step recency bias (Item 1), or geometrically δ-discounted recency bias (Item 2). We remark that while Syrgkanis et al. (2015) established RVU bounds only for OFTRL under such predictions, it is immediate to extend these bounds with qualitatively similar results for OMD. That is, the bounds on the learning rate obtained in Proposition A.2 for OFTRL are analogous to the corresponding ones for OMD. Inspired by the work of Daskalakis et al. (2021), we will also experiment with predictions inducing H-order differences in the RVU bound:

(i) 1-order: m(t) := u(t 1);

(ii) 2-order: m(t) := 2u(t 1) u(t 2);

(iii) 3-order: m(t) := 3u(t 1) 3u(t 2) + u(t 3);

(iv) 4-order: m(t) := 4u(t 1) 6u(t 2) + 4u(t 3) u(t 4).

On Last-Iterate Convergence Beyond Zero-Sum Games

We point out that our techniques immediate imply last-iterate convergence under H-order predictions, for a sufficiently small learning rate η = η(H).

Zero-Sum Games We first illustrate the behavior of the dynamics for two-player zero-sum games. We let Aj represent the cost matrix for player X and subsequently the payoff matrix for player Y for j {1, 2, 3}, defined as follows.

1 1 1 1 1 0 0.5 0 1

1 2 1 1 1 0 0.5 1 1

1 1 1 0 0.5 1 0.3 0.5 0.5

The (OMD) dynamics when both players employ H-step recency bias are illustrated and discussed in Figure 2. Figure 3 illustrates the behavior of the dynamics when only one of the two players employs H-step recency bias. The geometrically δ-discounting prediction mechanism is depicted in Figure 4, while Figure 5 shows the dynamics under H-order predictions.

Matrix Game (A1, A1)

0 1000 2000

Action probability

0 1000 2000

0 1000 2000

0 1000 2000

0 1000 2000

Action probability

0 1000 2000 0 1000 2000 0 1000 2000

Matrix Game (A2, A2)

0 1000 2000 Iteration t

Action probability

0 1000 2000 Iteration t

0 1000 2000 Iteration t

0 1000 2000 Iteration t

Matrix Game (A3, A3)

Figure 2. The (OMD) dynamics in the zero-sum games described in (74) for 2000 iterations. The y-axis corresponds to the probability with which each player selects the first action 1. Both players use Euclidean regularization with η = 0.05 and H-step recency bias (Item 1) for different values of H {1, 10, 25, 50}. The dynamics under small values of the prediction window H are qualitatively similar. On the other hand, as suggested by Proposition A.2, larger values of H can lead to instability, confirming that the learning rate has to be adapted to H.

On Last-Iterate Convergence Beyond Zero-Sum Games

Matrix Game (A1, A1)

0 1000 2000

Action probability

0 1000 2000

0 1000 2000

0 1000 2000

0 1000 2000

Action probability

0 1000 2000 0 1000 2000 0 1000 2000

Matrix Game (A2, A2)

0 1000 2000 Iteration t

Action probability

0 1000 2000 Iteration t

0 1000 2000 Iteration t

0 1000 2000 Iteration t

Matrix Game (A3, A3)

Figure 3. The (OMD) dynamics in the zero-sum games described in (74) for 2000 iterations. The y-axis corresponds to the probability with which each player selects the first action 1. Both players use Euclidean regularization with η = 0.05. Unlike Figure 2, player X uses one-step recency bias, while player Y continues to be using H-step recency bias for differnt values of H {1, 10, 25, 50}. We observe that larger values of H could lead to unstable behavior if η is not selected sufficiently small.

Strategically Zero-Sum Games Next, we experiment with strategically zero-sum games (Definition A.5). The cost matrix for player X in each game is still given by (74), but now we assume that the cost matrix of player Y are given as follows.

1 0.5 1 0 .5 .5 0.25 0 1

0.3 0 0.3 0.2 .25 0.3 0.35 0.75 0.05

0.7 0.5 0.56 0.4 0.4 0.7 0.5 0.6 0.5

To verify that the games (A1, B1), (A2, B2), and (A3, B3) are strategically zero-sum, observe that

B1 = 0.5A1 + 13 0.5 0 0.5 ;

B2 = 0.5A2 + 13 0.2 0.5 0.2 ;

B3 = 0.2A3 + 13 0.5 0.5 0.5 .

where recall that 13 is the all-ones vector in ℝ3. Thus, the fact that these games are strategically zero-sum follows from the

On Last-Iterate Convergence Beyond Zero-Sum Games

Matrix Game (A1, A1)

0 5000 10000 0.0

Action probability

0 5000 10000

0 5000 10000

0 5000 10000

0 5000 10000

Action probability

0 5000 10000 0 5000 10000 0 5000 10000

Matrix Game (A2, A2)

0 5000 10000 Iterations

Action probability

0 5000 10000 Iterations

0 5000 10000 Iterations

0 5000 10000 Iterations

Matrix Game (A3, A3)

Figure 4. The (OMD) dynamics in the zero-sum games described in (74) for 10000 iterations. The y-axis corresponds to the probability with which each player selects the first action 1. Both players use Euclidean regularization with η = 0.02 and geometrically δ-discounted predictions (Item 2) for different values of δ {0.5, 0.8, 0.95, 0.96}. We observe that while δ approaches to 1 the dynamics become more oscillatory.

converse of Theorem A.6. In particular, the game (A3, B3) is strictly competitive (see Remark A.8). The (OMD) dynamics in these games are illustrated in Figure 6.

We also experiment with different players using different regularization. In particular, we assume that player X uses the Euclidean DGF, while player Y uses the negative entropy DGF. While our last-iterate guarantees do not appear to apply for this case due to the lack of smoothness of the entropic regularizer (close to the boundary), Figure 7 illustrates that the dynamics still converge.

D.2 Continuous Games In this subsection with provide experiments on continuous unconstrained games. In particular, we illustrate the behavior of the dynamics for the games constructed for Proposition 5.2 and Proposition 5.3 in Figure 8 and Figure 9 respectively.

On Last-Iterate Convergence Beyond Zero-Sum Games

Matrix Game (A1, A1)

0 1000 2000

Action probability

0 1000 2000

0 1000 2000

0 1000 2000

0 1000 2000

Action probability

0 1000 2000 0 1000 2000 0 1000 2000

Matrix Game (A2, A2)

0 1000 2000 Iteration t

Action probability

0 1000 2000 Iteration t

0 1000 2000 Iteration t

0 1000 2000 Iteration t

Matrix Game (A3, A3)

Figure 5. The (OMD) dynamics in the zero-sum games described in (74) for 2000 iterations. The y-axis corresponds to the probability with which each player selects the first action 1. Both players use Euclidean regularization with η = 0.05 and H-order predictions for different values of H {1, 2, 3, 4}. Interestingly, the dynamics exhibit almost identical behavior to that with one recency bias equivalently, 1-order predictions.

On Last-Iterate Convergence Beyond Zero-Sum Games

Bimatrix Game (A1, B1)

0 5000 10000

Action probability

0 5000 10000

0 5000 10000

0 5000 10000

0 5000 10000

Action probability

0 5000 10000 0 5000 10000 0 5000 10000

Bimatrix Game (A2, B2)

0 5000 10000 Iteration t

Action probability

0 5000 10000 Iteration t

0 5000 10000 Iteration t

0 5000 10000 Iteration t

Bimatrix Game (A3, B3)

Figure 6. The (OMD) dynamics in the strategically zero-sum cost-minimization games described in (74) and (75) for 10000 iterations. The y-axis corresponds to the probability with which each player selects the first action 1. Both players use Euclidean regularization with η = 0.05 and H-step recency bias for different values of H {1, 10, 25, 50}. The dynamics under small values of the prediction window H are qualitatively similar. In contrast, observe that larger values of H can lead to instability.

On Last-Iterate Convergence Beyond Zero-Sum Games

Matrix Game (A1, A1)

0 10000 20000

Action probability

0 10000 20000

0 10000 20000

0 10000 20000

0 10000 20000

Action probability

0 10000 20000 0 10000 20000 0 10000 20000

Matrix Game (A2, A2)

0 10000 20000 Iteration t

Action probability

0 10000 20000 Iteration t

0 10000 20000 Iteration t

0 10000 20000 Iteration t

Matrix Game (A3, A3)

Figure 7. The (OMD) dynamics in the zero-sum games described in (74). The y-axis corresponds to the probability with which each player selects the first action 1. Player X uses the Euclidean DGF, while player Y uses the negative entropy DGF, both with η = 0.05 and H-order predictions. We observe that the dynamics exhibit convergent behavior.

On Last-Iterate Convergence Beyond Zero-Sum Games

0 200 400 600 800 Iteration t

0 200 400 600 800 Iteration t

0 200 400 600 800 Iteration t

Figure 8. The (OGD) dynamics for the continuous game described in (70) for 800 iterations. The y-axis illustrates the first coordinate of x ℝ2 and y ℝ2 respectively. Different columns correspond to different random initializations. As predicted by Theorem 5.1, and subsequently Proposition 5.2, the dynamics converge to the point (0, 0), which leads to each player receiving 0 utility.

0.0 0.2 0.4 0.6 0.8 1.0 Iteration t 106

0.6 x(t)(2)

0 200 400 600 800 1000 Iteration t

Figure 9. The (OGD) dynamics for the continuous game described in (71) for ϵ = 0.05. In particular, the left image corresponds to the game (A, A) for 106 iterations, while the right one to the game (A, B) for 1000 iterations. Although A + B F = ϵ, the two systems exhibit completely different behavior. This appears to be related to the fact that although the dynamics in the game (A, A) converge linearly, the rate of convergence is very close to 1. The same phenomenon occurs for any ϵ > 0, as predicted by Proposition 5.3.