# accelerated_regularized_learning_in_finite_nperson_games__19b489f2.pdf

Accelerated Regularized Learning in Finite 𝑁-Person Games

Kyriakos Lotidis Stanford University klotidis@stanford.edu

Angeliki Giannou University of Wisconsin Madison giannou@wisc.edu

Panayotis Mertikopoulos Univ. Grenoble Alpes, CNRS, Inria, Grenoble INP LIG 38000 Grenoble, France panayotis.mertikopoulos@imag.fr

Nicholas Bambos Stanford University bambos@stanford.edu

Motivated by the success of Nesterov s accelerated gradient algorithm for convex minimization problems, we examine whether it is possible to achieve similar performance gains in the context of online learning in games. To that end, we introduce a family of accelerated learning methods, which we call follow the accelerated leader (FTXL), and which incorporates the use of momentum within the general framework of regularized learning and, in particular, the exponential / multiplicative weights algorithm and its variants. Drawing inspiration and techniques from the continuous-time analysis of Nesterov s algorithm, we show that FTXL converges locally to strict Nash equilibria at a superlinear rate, achieving in this way an exponential speed-up over vanilla regularized learning methods (which, by comparison, converge to strict equilibria at a geometric, linear rate). Importantly, FTXL maintains its superlinear convergence rate in a broad range of feedback structures, from deterministic, full information models to stochastic, realization-based ones, and even when run with bandit, payoff-based information, where players are only able to observe their individual realized payoffs.

1 Introduction

One of the most important milestones in convex optimization was Nesterov s accelerated gradient (NAG) algorithm, as proposed by Nesterov [38] in 1983. The groundbreaking achievement of Nesterov s algorithm was that it attained an O(1/𝑇2) rate of convergence in Lipschitz smooth convex minimization problems, thus bridging a decades-old gap between the O(1/𝑇) convergence rate of ordinary gradient descent and the corresponding Ω(1/𝑇2) lower bound for said class [37]. In this way, Nesterov s accelerated gradient algorithm opened the door to acceleration in optimization, leading in turn to a wide range of other, likewise influential schemes such as FISTA and its variants [3] and jumpstarting a vigorous field of research that remains extremely active to this day.

Somewhat peculiarly, despite the great success that NAG has enjoyed in all fields where optimization plays a major role and, in particular, machine learning and data science its use has not percolated to the adjoining field of game theory as a suitable algorithm for learning Nash equilibria. Historically, the reasons for this are easy to explain: despite intense scrutiny by the community and an extensive corpus of literature dedicated to deconstructing the algorithm s guarantees, NAG s update structure remains quite opaque and, to a certain extent, mysterious. Because of this, Nesterov s algorithm could not be considered as a plausible learning scheme that could be employed by boundedly rational human agents involved in a repeated game. Given that this was the predominant tenet in economic thought

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

at the time, the use of Nesterov s algorithm in a game-theoretic context has not been extensively explored, to the best of our knowledge.

On the other hand, as far as applications to machine learning and artificial intelligence are concerned, the focus on human agents is no longer a limiting factor. In most current and emerging applications of game-theoretic learning from multi-agent reinforcement learning to adversarial models in machine learning the learning agents are algorithms whose computational capacity is only limited by the device on which they are deployed. In view of this, our paper seeks to answer the following question:

Can Nesterov s accelerated gradient scheme be deployed in a game-theoretic setting? And, if so, is it possible to achieve similar performance gains as in convex optimization?

Our contributions in the context of related work. The answer to the above questions is not easy to guess. On the one hand, given that game theory and convex optimization are fundamentally different fields, a reasonable guess would be no after all, finding a Nash equilibrium is a PPAD-complete problem [9], whereas convex minimization problems are solvable in polynomial time [7]. On the other, since in the context of online learning each player would have every incentive to use the most efficient unilateral optimization algorithm at their disposal, the use of NAG methods cannot be easily discarded from an algorithmic viewpoint.

Our paper examines if it is possible to obtain even a partially positive answer to the above question concerning the application of Nesterov s accelerated gradients techniques to learning in games. We focus throughout on the class of finite 𝑁-person games where, due to the individual concavity of the players payoff functions, the convergence landscape of online learning in games is relatively well-understood at least, compared to non-concave games. In particular, it is known that regularized learning algorithms such as follow the regularized leader (FTRL) and its variants converge locally to strict Nash equilibria at a geometric rate [18], and strict equilibria are the only locally stable and attracting limit points of regularized learning in the presence of randomness and/or uncertainty [11, 17, 23]. In this regard, we pose the question of (i) whether regularized learning schemes like FTRL can be accelerated; and (ii) whether the above properties are enhanced by this upgrade.

We answer both questions in the positive. First, we introduce an accelerated regularized scheme, in both continuous and discrete time, which we call follow the accelerated leader (FTXL). In continuous time, our scheme can be seen as a fusion of the continuous-time analogue of NAG proposed by Su, Boyd, and Candès [41] and the dynamics of regularized learning studied by Mertikopoulos & Sandholm [31] see also [5, 6, 19, 24, 29, 32 36, 42] and references therein. We show that the resulting dynamics exhibit the same qualitative equilibrium convergence properties as the replicator dynamics of Taylor & Jonker [43] (the most widely studied instance of FTRL in continuous time). However, whereas the replicator dynamics converge to strict Nash equilibria at a linear rate, the FTXL dynamics converge superlinearly.

In discrete time, we likewise propose an algorithmic implementation of FTXL which can be applied in various information context: (i) full information, that is, when players observe their entire mixed payoff vector; (ii) realization-based feedback, i.e., when players get to learn the what-if payoff of actions that they did not choose; and (iii) bandit, payoff-based feedback, where players only observe their realized, in-game payoff, and must rely on statistical estimation techniques to reconstruct their payoff vectors. In all cases, we show that FTXL maintains the exponential speedup described above, and converges to strict Nash equilibria at a superlinear rate (though the subleading term in the algorithm s convergence rate becomes increasingly worse as less information is available). We find this feature of FTXL particularly intriguing as superlinear convergence rates are often associated to methods that are second-order in space, not time; the fact that this is achieved even with bandit feedback is quite surprising in this context.

Closest to our work is the continuous-time, second-order replicator equation studied by Laraki & Mertikopoulos [25] in the context of evolutionary game theory, and derived through a model of pairwise proportional imitation of long-term success . The dynamics of [25] correspond to the undamped, continuous-time version of FTXL with entropic regularization, and the equilibrium convergence rate obtained by [25] agrees with our analysis. Other than that, the dynamics of Flåm & Morgan [12] also attempted to exploit a Newtonian structure, but they do not yield favorable convergence properties in a general setting. The inertial dynamics proposed in [26] likewise sought to leverage an inertial structure combined with the Hessian Riemannian underpinnings of the replicator

dynamics, but the resulting replicator equation was not even well-posed (in the sense that its solutions exploded in finite time).

More recently, Gao & Pavel [15, 16] considered a second-order, inertial version of the dynamics of mirror descent in continuous games, and examined their convergence in the context of variational stability [34]. Albeit related at a high level to our work (given the link between mirror descent and regularized learning), the dynamics of Gao & Pavel [15, 16] are actually incomparable to our own, and there is no overlap in our techniques or results. Other than that, second-order dynamics in games have also been studied in continuous time within the context of control-theoretic passivity, yielding promising results in circumventing the impossibility results of Hart & Mas-Colell [21], cf. Gao & Pavel [13, 14], Mabrok & Shamma [30], Toonsi & Shamma [44], and references therein. However, the resulting dynamics are also different, and we do not see a way of obtaining comparable rates in our setting.

2 Preliminaries

In this section, we outline some notions and definitions required for our analysis. Specifically, we introduce the framework of finite 𝑁-player games, we discuss the solution concept of a Nash equilibrium, and we present the main ideas of regularized learning in games.

2.1. Finite games. In this work, we focus exclusively with finite games in normal form. Such games consist of a finite set of players N = {1, . . . , 𝑁}, each of whom has a finite set of actions or pure strategies 𝛼𝑖 A𝑖and a payoff function 𝑢𝑖: A ℝ, where A := Q 𝑖 N A𝑖denotes the set of all possible action profiles 𝛼= (𝛼1, . . . , 𝛼𝑁). To keep track of all this, a finite game with the above primitives will be denoted as Γ Γ(N, A, 𝑢).

In addition to pure strategies, players may also randomize their choices by employing mixed strategies, that is, by choosing probability distributions 𝑥𝑖 X𝑖:= Δ(A𝑖) over their pure strategies, where Δ(A𝑖) denotes the probability simplex over A𝑖. Now, given a strategy profile 𝑥= (𝑥1, . . . , 𝑥𝑁) X := Q 𝑖 N X𝑖, we will use the standard shorthand 𝑥= (𝑥𝑖; 𝑥 𝑖) to highlight the mixed strategy 𝑥𝑖of player 𝑖against the mixed strategy profile 𝑥 𝑖 X 𝑖:= Q 𝑗 𝑖X 𝑗of all other players. We also define:

1. The mixed payoff of player 𝑖under 𝑥as

𝑢𝑖(𝑥) = 𝑢𝑖(𝑥𝑖; 𝑥 𝑖) =

𝛼𝑁 A𝑁 𝑥1𝛼1 . . . 𝑥𝑁𝛼𝑁𝑢𝑖(𝛼1, . . . , 𝛼𝑁) (1)

2. The mixed payoff vector of player 𝑖under 𝑥as

𝑣𝑖(𝑥) = 𝑥𝑖𝑢𝑖(𝑥) = (𝑢𝑖(𝛼𝑖; 𝑥 𝑖))𝛼𝑖 A𝑖 (2)

In words, 𝑣𝑖(𝑥) collects the expected rewards 𝑣𝑖𝛼𝑖(𝑥) := 𝑢𝑖(𝛼𝑖; 𝑥 𝑖) of each action 𝛼𝑖 A𝑖of player 𝑖 N against the mixed strategy profile 𝑥 𝑖of all other players. Finally, we write 𝑣(𝑥) = (𝑣1(𝑥), . . . , 𝑣𝑁(𝑥)) for the concatenation of the players mixed payoff vectors.

In terms of solution concepts, we will say that 𝑥 is a Nash equilibrium (NE) if no player can benefit by unilaterally deviating from their strategy, that is

𝑢𝑖(𝑥 ) 𝑢𝑖(𝑥𝑖; 𝑥 𝑖) for all 𝑥𝑖 X𝑖and all 𝑖 N . (NE)

Moreover, we say that 𝑥 is a strict Nash equilibrium if (NE) holds as a strict inequality for all 𝑥𝑖 𝑥 𝑖, 𝑖 N, i.e., if any deviation from 𝑥 𝑖results in a strictly worse payoff for the deviating player 𝑖 N. It is straightforward to verify that a strict equilibrium 𝑥 X is also pure in the sense that each player assigns positive probability only to a single pure strategy 𝛼 𝑖 A𝑖. Finally, we denote the support of a strategy 𝑥as the set of actions with non-zero probability mass, i.e., supp(𝑥) = {𝛼 A : 𝑥𝛼> 0}.

2.2. Regularized learning in games. In the general context of finite games, the most widely used learning scheme is the family of algorithms and dynamics known as follow the regularized leader (FTRL). In a nutshell, the main idea behind FTRL is that each player 𝑖 N plays a regularized best response to their cumulative payoff over time, leading to the continuous-time dynamics

𝑦𝑖(𝑡) = 𝑣𝑖(𝑥(𝑡)) 𝑥𝑖(𝑡) = 𝑄𝑖(𝑦𝑖(𝑡)) (FTRL-D)

where 𝑄𝑖(𝑦𝑖) = arg max𝑥𝑖 X𝑖{ 𝑦𝑖, 𝑥𝑖 ℎ𝑖(𝑥𝑖)} (3)

denotes the regularized best response or mirror map of player 𝑖 N, and ℎ𝑖: X𝑖 ℝis a strongly convex function known as the method s regularizer. Accordingly, in discrete time, this leads to the algorithm 𝑦𝑖,𝑛+1 = 𝑦𝑖,𝑛+ 𝛾ˆ𝑣𝑖,𝑛 𝑥𝑖,𝑛= 𝑄𝑖(𝑦𝑖,𝑛) (FTRL)

where 𝛾> 0 is a hyperparameter known as the algorithm s learning rate (or step-size) and ˆ𝑣𝑖,𝑛is a black-box payoff signal that carries information about 𝑣𝑖(𝑥𝑛). In the simplest case, when players have full information about the game being played and the actions taken by their opponents, we have ˆ𝑣𝑖,𝑛= 𝑣𝑖(𝑥𝑛); in more information-depleted environments (such as learning with payoff-based, bandit feedback), ˆ𝑣𝑖,𝑛is a reconstruction of 𝑣𝑖(𝑥𝑛) based on whatever information is at hand.

For concreteness, we close this section with the prototypical example of FTRL methods, the exponential / multiplicative weights (EW) algorithm. Going back to [2, 28, 45], this method is generated by the negentropy regularizer ℎ𝑖(𝑥𝑖) = P 𝛼𝑖 A𝑖𝑥𝑖𝛼𝑖log 𝑥𝑖𝛼𝑖, which yields the EW update rule

𝑦𝑖,𝑛+1 = 𝑦𝑖,𝑛+ 𝛾ˆ𝑣𝑖,𝑛 𝑥𝑖,𝑛= Λ𝑖(𝑦𝑖,𝑛) := exp(𝑦𝑖,𝑛) exp(𝑦𝑖,𝑛) 1 (EW)

and, in the continuous-time limit 𝛾 0, the exponential weights dynamics

𝑦𝑖(𝑡) = 𝑣𝑖(𝑥(𝑡)) 𝑥𝑖(𝑡) = Λ𝑖(𝑦𝑖(𝑡)) . (EWD)

In the above, Λ𝑖denotes the regularized best response induced by the method s entropic regularizer, which is known colloquially as a logit best response or, even more simply, as the logit map. To make the notation more compact in the sequel, we will write 𝑄= (𝑄𝑖)𝑖 N and Λ = (Λ𝑖)𝑖 N for the ensemble of the players regularized / logit best response maps. Remark 1. To streamline our presentation, in the main part of the paper, quantitative results will be stated for the special case of the EW setup above. In Appendix A, we discuss more general decomposable regularizers of the form ℎ𝑖(𝑥𝑖) = P 𝛼𝑖 A𝑖𝜃𝑖(𝑥𝑖) where 𝜃𝑖: [0, 1] ℝis continuous on [0, 1], and has 𝜃 (𝑥) > 0 for all 𝑥 (0, 1] and lim𝑥 0+ 𝜃 (𝑥) = . Although this set of assumptions can be relaxed, it leads to the clearest presentation of our results, so it will suffice for us. Remark 2. Throughout the paper, we will interchangeably use 𝑔(𝑡) and 𝑑𝑔/𝑑𝑡to denote the time derivative of 𝑔(𝑡). This dual notation allows us to adopt whichever form is most convenient in the given context. Moreover, for a process 𝑔, we will use the notation 𝑔(𝑡) for 𝑡 0 if it evolves in continuous time, and 𝑔𝑛for 𝑛 ℕif it evolves in discrete time steps, omitting the time-index when it is clear from context.

3 Combining acceleration with regularization: First insights and results

In this section, we proceed to illustrate how Nesterov s accelerated gradient (NAG) method can be combined with FTRL. To keep things as simple as possible, we focus on the continuous-time limit, so we do not have to worry about the choice of hyperparameters, the construction of black-box models for the players payoff vectors, etc.

3.1. Nesterov s accelerated gradient algorithm. We begin by discussing Nesterov s accelerated gradient algorithm as presented in Nesterov s seminal paper [38] in the context of unconstrained smooth convex minimization. Specifically, given a Lipschitz smooth convex function 𝑓: ℝ𝑑 ℝ, the algorithm unfolds iteratively as

𝑥𝑛+1 = 𝑤𝑛 𝛾 𝑓(𝑤𝑛)

𝑤𝑛+1 = 𝑥𝑛+1 + 𝑛 𝑛+ 3 (𝑥𝑛+1 𝑥𝑛) (NAG)

where 𝑤1 = 𝑥1 is initialized arbitrarily and 𝛾> 0 is a step-size parameter (typically chosen as 𝛾 1/𝐿where 𝐿is the Lipschitz smoothness modulus of 𝑓). The specific iterative structure of (NAG) and, in particular the 3 in the denominator can appear quite mysterious; nevertheless, (NAG) otherwise offers remarkable perfomance gains, improving in particular the rate of convergence

of gradient methods from O(1/𝑇) to O(1/𝑇2) [38], and matching in this way the corresponding Ω(1/𝑇2) lower bound for the minimization of smooth convex functions [37].1

This groundbreaking result has since become the cornerstone of a vast and diverse literature expanding on the properties of (NAG) and trying to gain a deeper understanding of the how and why of its update structure. One perspective that has gained significant traction in this regard is the continuoustime approach of Su et al. [40, 41]; combining the two equations in (NAG) into 𝑥𝑛+1 2 𝑥𝑛+ 𝑥𝑛 1 𝛾 = 𝛾 𝑓(𝑤𝑛) 3 𝑛+ 2 𝑥𝑛 𝑥𝑛 1 𝛾 , (4)

they modeled (NAG) as a heavy ball with vanishing friction system of the form

𝑑𝑡2 = 𝑓(𝑥) 3

The choice of terminology alludes to the fact that (HBVF) describes the dynamics of a heavy ball descending the landscape of 𝑓under the potential field 𝐹(𝑥) = 𝑓(𝑥) with a vanishing kinetic friction coefficient (the 3/𝑡factor in front of the momentum term 𝑑𝑥/𝑑𝑡). In this interpretation, the mass of the ball accelerates the system, the friction term dissipates energy to enable convergence, and the vanishing friction coefficient quenches the impact of friction over time in order to avoid decelerating the system too much (so the system is, in a sense, critically underdamped ).

As was shown by Su et al. [41], an explicit Euler discretization of (HBVF) yields (NAG) with exactly the right momentum coefficient 𝑛/(𝑛+ 3); moreover, the rate of convergence of the continuous-time dynamics (HBVF) is the same as that of the discrete-time algorithm (NAG), and the energy function and Lyapunov analysis used to derive the former can also be used to derive the latter. For all these reasons, (HBVF) is universally considered as the de facto continuous-time analogue of (NAG), and we will treat it as such in the sequel.

3.2. NAG meets FTRL. To move from unconstrained convex minimization problems to finite 𝑁-person games a constrained, non-convex, multi-agent, multi-objective setting it will be more transparent to start with the continuous-time formulation (HBVF). Indeed, applying the logic behind (HBVF) to the (unconstrained) state variables 𝑦of (FTRL-D), we obtain the follow the accelerated leader dynamics 𝑑2𝑦

𝑑𝑡2 = 𝑣(𝑄(𝑦)) 𝑟

𝑑𝑡 (FTXL-D)

where the dynamics driving force 𝐹(𝑦) = 𝑣(𝑄(𝑦)) is now given by the payoff field of the game, and the factor 𝑟/𝑡, 𝑟 0, plays again the role of a vanishing friction coefficient. To avoid confusion, we highlight that in the case of regularized learning, the algorithm s variable that determines the evolution of the system in an autonomous way is the score variable 𝑦, not the strategy variable 𝑥 (which is an ancillary variable obtained from 𝑦via the regularized choice map 𝑄).

In contrast to (EWD), the accelerated dynamics (FTXL-D) are second-order in time, a fact with fundamental ramifications, not only from a conceptual, but also from an operational viewpoint. Focusing on the latter, we first note that (FTXL-D) requires two sets of initial conditions, 𝑦(0) and 𝑦(0), the latter having no analogue in the first-order setting of (FTRL-D). In general, the evolution of the system depends on both 𝑦(0) and 𝑦(0), but since this would introduce an artificial bias toward a certain direction, we will take 𝑦(0) = 0, in tune with standard practice for (NAG) [41].

We also note that (FTXL-D) can be mapped to an equivalent autonomous first-order system with double the variables: specifically, letting 𝑝= 𝑦denote the players (payoff) momentum, (FTXL-D) can be rewritten as 𝑑𝑦

𝑑𝑡= 𝑣(𝑄(𝑦)) 𝑟

with 𝑦(0) initialized arbitrarily and 𝑝(0) = 𝑦(0). In turn, (5) yields 𝑝(𝑡) = 𝑡 𝑟 𝑡 0 𝑡𝑟𝑣(𝑄(𝑦(𝜏))) 𝑑𝜏, so 𝑝(𝑡) can be seen as a weighted aggregate of the players payoffs up to time 𝑡: if 𝑟= 0 (the undamped regime), all information enters 𝑝(𝑡) with the same weight; if 𝑟> 0, past information is discounted relative to more recent observations; and, in the overdamped limit 𝑟 , all weight is assigned to the current point in time, emulating in this way the first-order system (FTRL-D).

1There are, of course, many other approaches to acceleration, that we cannot cover here; for a discussion of the popular linear coupling approach of Allen-Zhu & Orecchia [1], see Appendix F.

3.3. First insights and results. From an operational standpoint, the main question of interest is to specify the equilibrium convergence properties of (FTXL-D) and, later in the paper, its discrete-time analogue. To establish a baseline, the principal equilibrium properties of its first-order counterpart can be summarized as follows: (i) strict Nash equilibria are locally stable and attracting under (FTRL-D) [23, 31];2 (ii) the dynamics do not admit any other such points (that is, stable and attracting) [11]; and (iii) quantitively, in the case of (EWD), the dynamics converge locally to strict Nash equilibria at a geometric rate of the form 𝑥(𝑡) 𝑥 = O(exp( 𝑐𝑡)) for some 𝑐> 0 [31].

Our first result below shows that the accelerated dynamics (FTXL-D) exhibit an exponential speed-up relative to (FTRL-D), and the players orbits converge to strict Nash equilibria at a superlinear rate: Theorem 1. Let 𝑥 be a strict Nash equilibrium of Γ, and let 𝑥(𝑡) = 𝑄(𝑦(𝑡)) be a solution orbit of (FTXL-D). If 𝑥(0) is sufficiently close to 𝑥 , then 𝑥(𝑡) converges to 𝑥 ; in particular, if (FTXL-D) is run with logit best responses (that is, 𝑄 Λ), we have

𝑥(𝑡) 𝑥 exp 𝐶 𝑐𝑡2

where 𝐶> 0 is a constant that depends only on the initialization of (FTXL-D) and

2 min 𝑖 N min 𝛽𝑖 supp(𝑥 𝑖)[𝑢𝑖(𝑥 𝑖; 𝑥 𝑖) 𝑢𝑖(𝛽𝑖; 𝑥 𝑖)] > 0 (7)

is the minimum payoff difference at equilibrium.

Theorem 1 (which we prove in Appendix B) is representative of the analysis to come, so some remarks are in order. First, we should note that the explicit rate estimate (6) is derived for the special case of logit best responses, which underlie all exponential / multiplicative weights algorithms. To the best of our knowledge, the only comparable result in the literature is the similar rate provided in [25] for the case 𝑟= 0. In the case of a general regularizer, an analogous speed-up is observed, but the exact expressions are more involved, so we defer them to Appendix B. A second important point concerns whether the rate estimate (6) is tight or not. Finally, the neighborhood of initial conditions around 𝑥 is determined by the minimum payoff difference at equilibrium and is roughly O(𝑐) in diameter; we defer the relevant details of this discussion to Appendix B.

To answer this question and, at the same time get a glimpse of the proof strategy for Theorem 1 it will be instructive to consider a single-player game with two actions. Albeit simple, this toy example is not simplistic, as it provides an incisive look into the problem, and will be used to motivate our design choices in the sequel. Example 3.1. Consider a single-player game Γ with actions A and B such that 𝑢(A) 𝑢(B) = 1, so the (dominant) strategy 𝑥 = (1, 0) is a strict Nash equilibrium. Then, letting 𝑧= 𝑦A 𝑦B, (FTXL-D) readily yields

𝑑𝑡2 = 𝑢(A) 𝑢(B) 𝑟

𝑡 𝑑𝑧 𝑑𝑡. (8)

As we show in Appendix B, this non-autonomous differential equation can be solved exactly to yield 𝑧(𝑡) = 𝑧(0) + 𝑡2/[2(𝑟+ 1)], and hence

𝑥(𝑡) 𝑥 = 1 1 + exp(𝑧(𝑡)) exp 𝑧(0) 𝑡2

Since 𝑐= 𝑢(A) 𝑢(B) = 1, the rate (9) coincides with that of Theorem 1 up to a factor of 1/2. This factor is an artifact of the analysis and, in fact, it can be tightened to (1 𝜀) for arbitrarily small 𝜀> 0; we did not provide this more precise expression to lighten notation. By contrast, the factor 2(𝑟+ 1) in (6) cannot be lifted; this has important ramifications which we discuss below.

The first conclusion that can be drawn from Example 3.1 is that the rate estimate of Theorem 1 is tight and cannot be improved in general. In addition, and in stark contrast to (NAG), Example 3.1

2Recall here that 𝑥 X is said to be (i) Lyapunov stable (or simply stable) if every orbit 𝑥(𝑡) of the dynamics that starts close enough to 𝑥 remains close enough to 𝑥 for all 𝑡 0; (ii) attracting if lim𝑡 𝑥(𝑡) = 𝑥 for every orbit 𝑥(𝑡) that starts close enough to 𝑥 ; and (iii) asymptotically stable if it is both stable and attracting. For an introduction to the theory of dynamical systems, cf. Shub [39] and Hirsch et al. [22].

shows that the optimal value for the friction parameter is 𝑟= 0 (at least from a min-max viewpoint, as this value yields the best possible lower bound for the rate). Of course, this raises the question as to whether this is due to the continuous-time character of the policy;3 however, as we show in detail in Appendix C, this is not the case: the direct handover of (NAG) to Example 3.1 yields the exact same rate (though the proof relies on a significantly more opaque generating function calculation).

In view of all this, it becomes apparent that friction only hinders the equilibrium convergence properties of accelerated FTRL schemes in our game-theoretic setting. On that account, we will continue our analysis in the undamped regime 𝑟= 0.

4 Accelerated learning: Analysis and results

4.1. The algorithm. To obtain a bona fide, algorithmic implementation of the continuous-time dynamics (FTXL-D), we will proceed with the same explicit, finite-difference scheme leading to the discrete-time algorithm (NAG) from the continuous-time dynamics (HBVF) of Su et al. [41]. Specifically, taking a discretization step 𝛾> 0 in (FTXL-D) and setting the scheme s friction parameter 𝑟to zero (which, as we discussed at length in the previous section, is the optimal choice in our setting), a straightforward derivation yields the basic update rule

[𝑦𝑖,𝑛+1 2𝑦𝑖,𝑛+ 𝑦𝑖,𝑛 1]/𝛾2 = ˆ𝑣𝑖,𝑛 for all 𝑖 N and all 𝑛= 1, 2, . . . (10)

In the above, just as in the case of (FTRL), ˆ𝑣𝑖,𝑛 ℝA𝑖denotes a black-box payoff signal that carries information about the mixed payoff vector 𝑣𝑖(𝑥𝑛) of player 𝑖at the current strategy profile 𝑥𝑛 (we provide more details on this below).

Alternatively, to obtain an equivalent first-order iterative rule (which is easier to handle and discuss), it will be convenient to introduce the momentum variables 𝑝𝑛= (𝑦𝑛 𝑦𝑛 1)/𝛾. Doing just that, a simple rearrangement of (10) yields the follow the accelerated leader scheme

𝑦𝑖,𝑛+1 = 𝑦𝑖,𝑛+ 𝛾𝑝𝑖,𝑛+1 𝑝𝑖,𝑛+1 = 𝑝𝑖,𝑛+ 𝛾ˆ𝑣𝑖,𝑛 𝑥𝑖,𝑛= 𝑄𝑖(𝑦𝑖,𝑛) . (FTXL)

The algorithm (FTXL) will be our main object of study in the sequel, and we will examine its convergence properties under three differerent models for ˆ𝑣𝑛:

1. Full information, i.e., players get to access their full, mixed payoff vectors:

ˆ𝑣𝑖,𝑛= 𝑣𝑖(𝑥𝑛) for all 𝑖 N, 𝑛= 1, 2, . . . (11a)

2. Realization-based feedback, i.e., after choosing an action profile 𝛼𝑛 𝑥𝑛, each player 𝑖 N observes (or otherwise calculates) the vector of their counterfactual, what-if rewards, namely

ˆ𝑣𝑖,𝑛= 𝑣𝑖(𝛼𝑛) for all 𝑖 N, 𝑛= 1, 2, . . . (11b)

3. Bandit / Payoff-based feedback, i.e., each player only observes their current reward, and must rely on statistical estimation techniques to reconstruct an estimate of 𝑣𝑖(𝑥𝑛). For concreteness, we will consider the case where players employ a version of the so-called importance-weighted estimator

ˆ𝑣𝑖,𝑛= IWE(𝑥𝑖,𝑛; 𝛼𝑖,𝑛) for all 𝑖 N, 𝑛= 1, 2, . . . (11c)

which we describe in detail later in this section.

Of course, this list of information models is not exhaustive, but it is a faithful representation of most scenarios that arise in practice, so it will suffice for our purposes.

Now before moving forward with the analysis, it will be useful to keep some high-level remarks in mind. The first is that (FTXL) shares many similarities with (FTRL), but also several notable differences. At the most basic level, (FTRL) and (FTXL) are both stimulus-response schemes in the spirit of Erev & Roth [10], that is, players respond with a strategy 𝑥𝑖,𝑛= 𝑄𝑖(𝑦𝑖,𝑛) to a stimulus 𝑦𝑖,𝑛generated by the observed payoff signals ˆ𝑣𝑖,𝑛. In this regard, both methods adhere to the online learning setting (and, in particular, to the regularized learning paradigm).

3The reader might also wonder if the use of a non-vanishing friction coefficient 𝑟 𝑦instead of (𝑟/𝑡) 𝑦 could be beneficial to the convergence rate of (FTXL-D). As we show in Appendices B and C, this leads to significantly worse convergence rates of the form 𝑥(𝑡) 𝑥 exp( Θ(𝑡)) for all 𝑟> 0.

However, unlike (FTRL), where players respond to the aggregate of their payoff signals the process 𝑦𝑛in (FTRL) the accelerated algorithm (FTXL) introduces an additional aggregation layer, which expresses how players build momentum based on the same payoff signals the process 𝑝𝑛in (FTXL). Intuitively, we can think of these two processes as the position and momentum variables of a classical inertial system, not unlike the heavy-ball dynamics of Su et al. [41]. The only conceptual difference is that, instead of rolling along the landscape of a (convex) function, the players now track the mirrored payoff field ˆ𝑣(𝑦) := 𝑣(𝑄(𝑦)).

In the rest of this section, we proceed to examine in detail the equilibrium convergence properties of (FTXL) under each of the three models detailed in Eqs. (11a) (11c) in order.

4.2. Accelerated learning with full information. We begin with the full information model (11a). This is the most straightforward model (due to the absence of randomness and uncertainty) but, admittedly, also the least realistic one. Nevertheless, it will serve as a useful benchmark for the rest, and it will allow us to introduce several important notions.

Before we state our result, it is important to note that a finite game can have multiple strict Nash equilibria, so global convergence results are, in general, unattainable; for this reason, we analyze the algorithm s local convergence landscape. In this regard, Theorem 2 below shows that (FTXL) with full information achieves a superlinear local convergence rate to strict Nash equilibria: Theorem 2. Let 𝑥 be a strict Nash equilibrium of Γ, and let 𝑥𝑛= 𝑄(𝑦𝑛) be the sequence of play generated by (FTXL) with full information feedback of the form (11a). If 𝑥1 is initialized sufficiently close to 𝑥 , then 𝑥𝑛converges to 𝑥 ; in particular, if (FTXL) is run with logit best responses (that is, 𝑄 Λ), we have

𝑥𝑇 𝑥 exp 𝐶 𝑐𝛾2𝑇(𝑇 1)

= exp Θ(𝑇2) (12)

where 𝐶> 0 is a constant that depends only on the initialization of (FTXL) and

2 min 𝑖 N min 𝛽𝑖 supp(𝑥 𝑖)[𝑢𝑖(𝑥 𝑖; 𝑥 𝑖) 𝑢𝑖(𝛽𝑖; 𝑥 𝑖)] > 0 (13)

is the minimum payoff difference at equilibrium.

To maintain the flow of our discussion, we defer the proof of Theorem 2 to Appendix C. Instead, we only note here that, just as in the case of (HBVF) and (NAG), Theorem 2 provides essentially the same rate of convergence as its continuous-time counterpart, Theorem 1, modulo a subleading term which has an exponentially small impact on the rate of convergence. In particular, we should stress that the superlinear convergence rate of (FTXL) exhibits an exponential speedup relative to (FTRL), which is known to converge at a geometric rate 𝑥𝑇 𝑥 = exp( Θ(𝑇)). This is in direct correspondence to what we observe in continuous time, showing in particular that the continuous-time dynamics (FTXL-D) are a faithful representation of (FTXL).

We should also stress here that superlinear convergence rates are typically associated with methods that are second-order in space, in the sense that they employ Hessian-like information like Newton s algorithm not second-order in time like (NAG) and (FTXL). We find this observation particularly intriguing as it suggests that accelerated rates can be observed in the context of learning in games without having to pay the excessively high compute cost of second-order methods in optimization.

4.3. Accelerated learning with realization-based feedback. We now turn to the realization-based model (11b), where players can only assess the rewards of their pure actions in response to the realized actions of all other players. In words, ˆ𝑣𝑖,𝑛= 𝑣𝑖(𝛼𝑛) collects the payoffs that player 𝑖 N would have obtained by playing each of their pure actions 𝛼𝑖 A𝑖against the action profile 𝛼 𝑖,𝑛 adopted by the rest of the players.

In contrast to the full information model (11a), the realization-based model is stochastic in nature, so our convergence results will likewise be stochastic. Nevertheless, despite the added layer of uncertainty, we show that (FTXL) with realization-based feedback maintains a superlinear convergence rate with high probability: Theorem 3. Let 𝑥 be a strict Nash equilibrium of Γ, fix some confidence level 𝛿> 0, and let 𝑥𝑛= 𝑄(𝑦𝑛) be the sequence of play generated by (FTXL) with realization-based feedback as per

0 200 400 600 800 1000 Iteration (n)

Feedback Model: Realization-based

(a) Zero-sum game: Realization-based feedback

0 200 400 600 800 1000 Iteration (n)

Feedback Model: Bandit

(b) Zero-sum game: Bandit feedback

0 200 400 600 800 1000 Iteration (n)

Feedback Model: Realization-based

(c) Congestion game: Realization-based feedback

0 200 400 600 800 1000 Iteration (n)

Feedback Model: Bandit

(d) Congestion game: Bandit feedback

Figure 1: Performance evaluation of (FTXL) in a zero-sum and a congestion game under realization-based and bandit feedback. Solid lines represent average values, while shaded regions enclose 1 standard deviation. The plots are in logarithmic scale.

(11b) and a sufficiently small step-size 𝛾> 0. Then there exists a neighborhood U of 𝑥 such that

ℙ(𝑥𝑛 𝑥 as 𝑛 ) 1 𝛿 if 𝑥1 U. (14)

In particular, if (FTXL) is run with logit best responses (that is, 𝑄 Λ), there exist positive constants 𝐶, 𝑐> 0 as in Theorem 2 such that on the event {𝑥𝑛 𝑥 as 𝑛 }:

𝑥𝑇 𝑥 exp 𝐶 𝑐𝛾2𝑇(𝑇 1)

5𝑐𝛾5/3𝑇5/3 = exp Θ(𝑇2) . (15)

What is particularly surprising in Theorem 3 is that, (FTXL) maintains the accelerated superlinear rate of Theorem 2 and, likewise, the exponential speedup relative to (FTRL) despite the randomness and uncertainty involved in the realization-based model (11b). The salient point enabling this feature of (FTXL) is that ˆ𝑣𝑛can be expressed as

ˆ𝑣𝑛= 𝑣(𝑥𝑛) + 𝑈𝑛 (16)

where 𝑈𝑛 Q 𝑖ℝA𝑖is an almost surely bounded conditionally zero-mean stochastic perturbation, that is, 𝔼[𝑈𝑛| F𝑛] = 0, where F𝑛:= 𝜎(𝑥1, . . . , 𝑥𝑛) denotes the history of play up to (and including) time 𝑛. Thanks to the boundedness of (16), we are able to derive a series of probabilistic estimates showing that, with high probability (and, in particular, greater than 1 𝛿), the contribution of the noise in the algorithm s rate becomes subleading, thus allowing the superlinear rate of Theorem 2 to emerge. As in the case of Theorem 2, we defer the proof of Theorem 3 to the appendix.

4.4. Bandit feedback. The last framework we consider is the bandit model where players only observe their realized rewards, a scalar from which they must reconstruct their entire payoff vector. To do so, a standard technique from the multi-armed bandit literature is the so-called importance weighted estimator (IWE) [8, 27], defined in our setting as

ˆ𝑣𝑖𝛼𝑖,𝑛= 1{𝛼𝑖,𝑛= 𝛼𝑖}

ˆ𝑥𝑖𝛼𝑖,𝑛 𝑢𝑖(𝛼𝑖; 𝛼 𝑖,𝑛) (IWE)

where ˆ𝑥𝑖,𝑛= (1 𝜀𝑛)𝑥𝑖,𝑛+ 𝜀𝑛unif A𝑖is a mixture of 𝑥𝑖,𝑛and the uniform distribution on A𝑖(a mechanism known in the literature as explicit exploration). Importantly, this estimator is unbiased relative to the perturbed strategy ˆ𝑥𝑥𝑛, which thus incurs an O(𝜀𝑛) non-zero-sum error to the estimation of 𝑣𝑖(𝑥𝑛). This error can be made arbitrarily small by taking 𝜀𝑛 0 but, in doing so, the variance of ˆ𝑣𝑖,𝑛diverges, leading to a bias-variance trade-off that is difficult to tame.

Despite these added difficulties, we show below that (FTXL) maintains its superlinear convergence rate even with bandit, payoff-based feedback: Theorem 4. Let 𝑥 be a strict Nash equilibrium of Γ, fix some confidence level 𝛿> 0, and let 𝑥𝑛= 𝑄(𝑦𝑛) be the sequence of play generated by (FTXL) with bandit feedback of the form (11c), an IWE exploration parameter 𝜀𝑛 1/𝑛ℓ𝜀for some ℓ𝜀 (0, 1/2), and a sufficiently small step-size 𝛾> 0. Then there exists a neighborhood U of 𝑥 in X such that

ℙ(𝑥𝑛 𝑥 as 𝑛 ) 1 𝛿 if 𝑥1 U. (17)

In particular, if (FTXL) is run with logit best responses (that is, 𝑄 Λ), there exist positive constants 𝐶, 𝑐> 0 as in Theorem 2 such that on the event {𝑥𝑛 𝑥 as 𝑛 }

𝑥𝑇 𝑥 exp 𝐶 𝑐𝛾2𝑇(𝑇 1)

9𝑐𝛾9/5𝑇9/5 = exp Θ(𝑇2) . (18)

Theorem 4 (which we prove in Appendix D shows that, despite the degradation of the subleading term, (FTXL) retains its superlinear convergence rate even with bandit, payoff-based feedback (for a numerical demonstration, see Fig. 1 above). We find this feature of (FTXL) particularly important as it shows that the algorithm remains exceptionally robust in the face of randomness and uncertainty, even as we move toward increasingly information-starved environments from full information, to realization-based observations and, ultimately, to bandit feedback. This has important ramifications from an operational standpoint, which we intend to examine further in future work.

4.5. Numerical Experiments. We conclude this section with a series of numerical simulations to validate the performance of (FTXL). To this end, we consider two game paradigms, (i) a 2-player zero-sum game, and (ii) a congestion game.

Zero-sum Games. First, we consider a 2-player zero-sum game with actions {𝛼1, 𝛼2, 𝛼3} and {𝛽1, 𝛽2, 𝛽3}, and payoff matrix

(2, 2) (1, 1) (2, 2) ( 2, 2) ( 1, 1) ( 2, 2) ( 2, 2) ( 1, 1) ( 2, 2)

Here, the rows of 𝑃correspond to the actions of player 𝐴and the columns to the actions of player 𝐵, while the first item of each entry of 𝑃corresponds to the payoff of 𝐴, and the second one to the payoff of 𝐵. Clearly, the action profile (𝛼1, 𝛽2) is a strict Nash equilibrium.

Congestion Games. As a second example, we consider a congestion game with 𝑁= 100 and 2 roads, 𝑟1 and 𝑟2, with costs 𝑐1 = 1.1 and 𝑐2 = 𝑑/𝑁where 𝑑is the number of drivers on 𝑟2. In words, 𝑟1 has a fixed delay equal to 1.1, while 𝑟2 has a delay proportional to the drivers using it. Note, that the strategy profile where all players are using 𝑟2 is a strict Nash equilibrium.

In Fig. 1, we assess the convergence of (FTXL) with logit best responses, under realization-based and bandit feedback, and compare it to the standard (EW) with the same level of information. The figures verify that (FTXL) outperforms (EW) regarding the convergence to a strict Nash equilibrium both for the realization-based and the bandit feedback, as expected from the theoretical findings. Specifically, they validate the faster convergence rate of (FTXL) compared to that of the (EW) algorithm. Moreover, we observe that both algorithms perform worse under bandit feedback than under realization-based feedback. This behavior is expected as less information becomes available. More details can be found in Appendix E.

Acknowledgments and Disclosure of Funding

This research was supported in part by the French National Research Agency (ANR) in the framework of the PEPR IA FOUNDRY project (ANR-23-PEIA-0003), the Investissements d avenir program (ANR-15-IDEX-02), the Lab Ex PERSYVAL (ANR-11-LABX-0025-01), MIAI@Grenoble Alpes (ANR-19-P3IA-0003), the project IRGA2024-SPICE-G7H-IRG24E90. PM is also with the Archimedes Research Unit Athena RC Department of Mathematics, National & Kapodistrian University of Athens, and his research was partially funded by project MIS 5154714 of the National Recovery and Resilience Plan Greece 2.0 funded by the European Union under the Next Generation EU Program.

[1] Allen-Zhu, Z. and Orecchia, L. Linear coupling: An ultimate unification of gradient and mirror descent. In ITCS 17: Proceedings of the 8th Conference on Innovations in Theoretical Computer Science, 2017. [2] Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. Gambling in a rigged casino: The adversarial multi-armed bandit problem. In Proceedings of the 36th Annual Symposium on Foundations of Computer Science, 1995. [3] Beck, A. and Teboulle, M. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1):183 202, March 2009. [4] Benaïm, M. Dynamics of stochastic approximation algorithms. In Azéma, J., Émery, M., Ledoux, M., and Yor, M. (eds.), Séminaire de Probabilités XXXIII, volume 1709 of Lecture Notes in Mathematics, pp. 1 68. Springer Berlin Heidelberg, 1999. [5] Boone, V. and Mertikopoulos, P. The equivalence of dynamic and strategic stability under regularized learning in games. In Neur IPS 23: Proceedings of the 37th International Conference on Neural Information Processing Systems, 2023. [6] Bravo, M. and Mertikopoulos, P. On the robustness of learning in games with stochastically perturbed payoff observations. Games and Economic Behavior, 103(John Nash Memorial issue):41 66, May 2017. [7] Bubeck, S. Convex optimization: Algorithms and complexity. Foundations and Trends in Machine Learning, 8(3-4):231 358, 2015. [8] Bubeck, S. and Cesa-Bianchi, N. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning, 5(1):1 122, 2012. [9] Daskalakis, C., Goldberg, P. W., and Papadimitriou, C. H. The complexity of computing a Nash equilibrium. Communications of the ACM, 52(2):89 97, 2009. [10] Erev, I. and Roth, A. E. Predicting how people play games: Reinforcement learning in experimental games with unique, mixed strategy equilibria. American Economic Review, 88:848 881, 1998. [11] Flokas, L., Vlatakis-Gkaragkounis, E. V., Lianeas, T., Mertikopoulos, P., and Piliouras, G. No-regret learning and mixed Nash equilibria: They do not mix. In Neur IPS 20: Proceedings of the 34th International Conference on Neural Information Processing Systems, 2020. [12] Flåm, S. D. and Morgan, J. Newtonian mechanics and Nash play. International Game Theory Review, 6 (2):181 194, 2004. [13] Gao, B. and Pavel, L. On passivity and reinforcement learning in finite games. In 2018 IEEE Conference on Decision and Control (CDC), pp. 340 345, 2018. doi: 10.1109/CDC.2018.8619157. [14] Gao, B. and Pavel, L. On passivity, reinforcement learning, and higher order learning in multiagent finite games. IEEE Transactions on Automatic Control, 66(1):121 136, 2021. doi: 10.1109/TAC.2020.2978037. [15] Gao, B. and Pavel, L. Second-order mirror descent: exact convergence beyond strictly stable equilibria in concave games. In 2021 60th IEEE Conference on Decision and Control (CDC), pp. 948 953, 2021. doi: 10.1109/CDC45484.2021.9683223. [16] Gao, B. and Pavel, L. Second-order mirror descent: Convergence in games beyond averaging and discounting. IEEE Transactions on Automatic Control, 69(4):2143 2157, 2024. doi: 10.1109/TAC.2023. 3291953. [17] Giannou, A., Vlatakis-Gkaragkounis, E. V., and Mertikopoulos, P. Survival of the strictest: Stable and unstable equilibria under regularized learning with partial information. In COLT 21: Proceedings of the 34th Annual Conference on Learning Theory, 2021. [18] Giannou, A., Vlatakis-Gkaragkounis, E. V., and Mertikopoulos, P. The convergence rate of regularized learning in games: From bandits and uncertainty to optimism and beyond. In Neur IPS 21: Proceedings of the 35th International Conference on Neural Information Processing Systems, 2021.

[19] Hadikhanloo, S., Laraki, R., Mertikopoulos, P., and Sorin, S. Learning in nonatomic games, Part I: Finite action spaces and population games. Journal of Dynamics and Games, 9(4, William H. Sandholm memorial issue):433 460, October 2022. [20] Hall, P. and Heyde, C. C. Martingale Limit Theory and Its Application. Probability and Mathematical Statistics. Academic Press, New York, 1980. [21] Hart, S. and Mas-Colell, A. Uncoupled dynamics do not lead to Nash equilibrium. American Economic Review, 93(5):1830 1836, 2003. [22] Hirsch, M. W., Smale, S., and Devaney, R. L. Differential Equations, Dynamical Systems, and an Introduction to Chaos. Elsevier, London, UK, 2 edition, 2004. [23] Hofbauer, J. and Sigmund, K. Evolutionary game dynamics. Bulletin of the American Mathematical Society, 40(4), July 2003. [24] Hsieh, Y.-G., Antonakopoulos, K., and Mertikopoulos, P. Adaptive learning in continuous games: Optimal regret bounds and convergence to Nash equilibrium. In COLT 21: Proceedings of the 34th Annual Conference on Learning Theory, 2021. [25] Laraki, R. and Mertikopoulos, P. Higher order game dynamics. Journal of Economic Theory, 148(6): 2666 2695, November 2013. [26] Laraki, R. and Mertikopoulos, P. Inertial game dynamics and applications to constrained optimization. SIAM Journal on Control and Optimization, 53(5):3141 3170, October 2015. [27] Lattimore, T. and Szepesvári, C. Bandit Algorithms. Cambridge University Press, Cambridge, UK, 2020. [28] Littlestone, N. and Warmuth, M. K. The weighted majority algorithm. Information and Computation, 108 (2):212 261, 1994. [29] Lotidis, K., Mertikopoulos, P., and Bambos, N. Learning in games with quantized payoff observations. In CDC 22: Proceedings of the 61st IEEE Annual Conference on Decision and Control, 2022. [30] Mabrok, M. A. and Shamma, J. S. Passivity analysis of higher order evolutionary dynamics and population games. In 2016 IEEE 55th Conference on Decision and Control (CDC), pp. 6129 6134, 2016. doi: 10.1109/CDC.2016.7799211. [31] Mertikopoulos, P. and Sandholm, W. H. Learning in games via reinforcement and regularization. Mathematics of Operations Research, 41(4):1297 1324, November 2016. [32] Mertikopoulos, P. and Sandholm, W. H. Riemannian game dynamics. Journal of Economic Theory, 177: 315 364, September 2018. [33] Mertikopoulos, P. and Staudigl, M. Equilibrium tracking and convergence in dynamic games. In CDC 21: Proceedings of the 60th IEEE Annual Conference on Decision and Control, 2021. [34] Mertikopoulos, P. and Zhou, Z. Learning in games with continuous action sets and unknown payoff functions. Mathematical Programming, 173(1-2):465 507, January 2019. [35] Mertikopoulos, P., Papadimitriou, C. H., and Piliouras, G. Cycles in adversarial regularized learning. In SODA 18: Proceedings of the 29th annual ACM-SIAM Symposium on Discrete Algorithms, 2018. [36] Mertikopoulos, P., Hsieh, Y.-P., and Cevher, V. A unified stochastic approximation framework for learning in games. Mathematical Programming, 203:559 609, January 2024. [37] Nemirovski, A. S. and Yudin, D. B. Problem Complexity and Method Efficiency in Optimization. Wiley, New York, NY, 1983. [38] Nesterov, Y. A method for unconstrained convex minimization problem with the rate of convergence O(1/𝑘2). Proceedings of the USSR Academy of Sciences, 269(543-547), 1983. [39] Shub, M. Global Stability of Dynamical Systems. Springer-Verlag, Berlin, 1987. [40] Su, W., Boyd, S. P., and Candès, E. J. A differential equation for modeling Nesterov s accelerated gradient method: Theory and insights. In NIPS 14: Proceedings of the 28th International Conference on Neural Information Processing Systems, pp. 2510 2518, 2014. [41] Su, W., Boyd, S., and Candès, E. J. A differential equation for modeling Nesterov s accelerated gradient method: Theory and insights. Journal of Machine Learning Research, 17(153):1 43, 2016. [42] Syrgkanis, V., Agarwal, A., Luo, H., and Schapire, R. E. Fast convergence of regularized learning in games. In NIPS 15: Proceedings of the 29th International Conference on Neural Information Processing Systems, pp. 2989 2997, 2015. [43] Taylor, P. D. and Jonker, L. B. Evolutionary stable strategies and game dynamics. Mathematical Biosciences, 40(1-2), 1978. [44] Toonsi, S. and Shamma, J. S. Higher-order uncoupled dynamics do not lead to Nash equilibrium - except when they do. In Neur IPS 23: Proceedings of the 37th International Conference on Neural Information Processing Systems, 2023. [45] Vovk, V. G. Aggregating strategies. In COLT 90: Proceedings of the 3rd Workshop on Computational Learning Theory, pp. 371 383, 1990.

In Appendix A, we discuss how our findings can be extended to general regularizers. Subsequently, Appendices B and C contain the technical proofs for the continuous and discrete time algorithms, respectively. Following this, Appendix D provides the convergence results of (FTXL) under partial information, specifically under realization-based and bandit feedback. We conclude this section with Appendix E, which presents the details of the numerical experiments.

A Auxiliary results for general regularizers

In this appendix, we briefly discuss how to obtain the convergence of (FTXL) for mirror maps 𝑄 beyond the logit map Λ. Namely, we consider regularizers that are decomposable, i.e., ℎ𝑖(𝑥𝑖) = P 𝛼𝑖 A𝑖𝜃𝑖(𝑥𝛼𝑖) such that 𝜃𝑖: [0, 1] ℝis continuous on [0, 1], twice differentiable on (0, 1] and strongly convex with 𝜃 𝑖(0+) = .

Lemma A.1. Suppose that 𝑥𝑛= 𝑄(𝑦𝑛) and for all 𝛼 A, 𝛼 𝛼 , it holds that 𝑦𝛼,𝑛 𝑦𝛼 ,𝑛 as 𝑛 . Then, 𝑥𝑛converges to 𝑥 , where 𝑥 is a point mass at 𝛼 . Moreover, it holds that:

𝛼 𝛼 (𝜃 ) 1 𝜃 (1) + 𝑦𝛼,𝑛 𝑦𝛼 ,𝑛 (A.1)

Proof. First, note that for 𝑥= 𝑄(𝑦), we have that 𝑥is the solution of the following optimization problem

𝑄(𝑦) = arg max

𝛼 A 𝑦𝛼𝑥𝛼 ℎ(𝑥) :

𝛼 A 𝑥𝛼= 1 and 𝛼 A : 𝑥𝛼 0

By solving the Karush Kuhn Tucker (KKT) conditions to this optimization problem we readily get that 𝑥lies in the interior of X, since 𝜃(0+) = , and thus we obtain that at the solution, it holds 𝑦𝛼= 𝜃 (𝑥𝛼) + 𝜆for 𝜆 ℝ. Therefore, we have:

𝑦𝛼,𝑛 𝑦𝛼 ,𝑛= 𝜃 (𝑥𝛼,𝑛) 𝜃 (𝑥𝛼 ,𝑛) (A.2)

or equivalently:

𝜃 (𝑥𝛼,𝑛) = 𝜃 (𝑥𝛼 ,𝑛) + 𝑦𝛼,𝑛 𝑦𝛼 ,𝑛 𝜃 (1) + 𝑦𝛼,𝑛 𝑦𝛼 ,𝑛 (A.3)

Now, assume that there exists 𝛼 A such that 𝑥𝛼,𝑛does not converge to 0, that is, lim sup𝑛𝑥𝛼,𝑛> 𝜀 for some 𝜀> 0. Then, since 𝜃is strongly convex, 𝜃 is strictly increasing, and thus 𝜃 (𝑥𝛼,𝑛) 𝜃 (𝜀) infinitely often. However, by taking 𝑛 in (A.3), it implies that 𝜃 (𝑥𝛼,𝑛) , which is a contradiction. Therefore, we conclude that for all 𝛼 𝛼 , it holds that lim𝑛 𝑥𝛼,𝑛= 0, and the convergence result follows.

Finally, note that since 𝜃 is strictly increasing, it is invertible and its inverse is strictly increasing as well. Thus, for each 𝛼 𝛼 we have:

𝑥𝛼,𝑛 (𝜃 ) 1 𝜃 (1) + 𝑦𝛼,𝑛 𝑦𝛼 ,𝑛 (A.4)

𝑥𝑛 𝑥 = 1 𝑥𝛼 ,𝑛=

𝛼 𝛼 (𝜃 ) 1 𝜃 (1) + 𝑦𝛼,𝑛 𝑦𝛼 ,𝑛 (A.5)

and our proof is complete.

B Proofs for Continuous Time Algorithms

In this appendix, we provide the proof of Theorem 1 and discuss the convergence of (FTXL-D) under a non-vanishing friction coefficient that is, 𝑟 𝑦instead of (𝑟/𝑡) 𝑦. First, we provide a lemma that is necessary for our analysis.

Lemma B.1. Let 𝑥 = (𝛼 1, . . . , 𝛼 𝑁) X be a strict Nash equilibrium of Γ, and let 𝑑denote the minimum payoff difference at equilibrium, i.e.,

𝑑:= min 𝑖 N min 𝛽𝑖 supp(𝑥 𝑖)[𝑢𝑖(𝑥 𝑖; 𝑥 𝑖) 𝑢𝑖(𝛽𝑖; 𝑥 𝑖)] . (B.1)

Then, for any 𝑐 (0, 𝑑), there exists 𝑀> 0 such that if 𝑦𝑖𝛼 𝑖 𝑦𝑖𝛼𝑖> 𝑀for all 𝛼𝑖 𝛼 𝑖 A𝑖and 𝑖 N, then 𝑣𝑖𝛼 𝑖(𝑄(𝑦)) 𝑣𝑖𝛼𝑖(𝑄(𝑦)) > 𝑐 for all 𝛼𝑖 𝛼 𝑖 A𝑖, and 𝑖 N . (B.2)

Proof. Since 𝑥 is a strict Nash equilibrium, the minimum payoff difference 𝑑at 𝑥 is bounded away from zero. Then, by continuity of the function 𝑥 𝑣(𝑥), there exists a neighborhood U of 𝑥 such that for any 𝑥 U , it holds

𝑣𝑖𝛼 𝑖(𝑥) 𝑣𝑖𝛼𝑖(𝑥) > 𝑐 for all 𝛼𝑖 𝛼 𝑖 A𝑖, and 𝑖 N (B.3)

Finally, by Giannou et al. [18, Lemma C.2.], there exists 𝑀> 0, such that 𝑄(𝑦) U for all 𝑦 V with 𝑦𝑖𝛼 𝑖 𝑦𝑖𝛼𝑖> 𝑀 for all 𝛼𝑖 𝛼 𝑖 A𝑖, and 𝑖 N (B.4)

Therefore, we readily get that if 𝑦 V satisfies the above relation, then

𝑣𝑖𝛼 𝑖(𝑄(𝑦)) 𝑣𝑖𝛼𝑖(𝑄(𝑦)) > 𝑐 for all 𝛼𝑖 𝛼 𝑖 A𝑖, and 𝑖 N .

We are now in a position to prove Theorem 1, which we restate below for convenience. Theorem 1. Let 𝑥 be a strict Nash equilibrium of Γ, and let 𝑥(𝑡) = 𝑄(𝑦(𝑡)) be a solution orbit of (FTXL-D). If 𝑥(0) is sufficiently close to 𝑥 , then 𝑥(𝑡) converges to 𝑥 ; in particular, if (FTXL-D) is run with logit best responses (that is, 𝑄 Λ), we have

𝑥(𝑡) 𝑥 exp 𝐶 𝑐𝑡2

where 𝐶> 0 is a constant that depends only on the initialization of (FTXL-D) and

2 min 𝑖 N min 𝛽𝑖 supp(𝑥 𝑖)[𝑢𝑖(𝑥 𝑖; 𝑥 𝑖) 𝑢𝑖(𝛽𝑖; 𝑥 𝑖)] > 0 (7)

is the minimum payoff difference at equilibrium.

Proof. First of all, since 𝑥 is a strict Nash equilibrium, by Lemma B.1 for

2 min 𝑖 N min 𝛽𝑖 supp(𝑥 𝑖)[𝑢𝑖(𝑥 𝑖; 𝑥 𝑖) 𝑢𝑖(𝛽𝑖; 𝑥 𝑖)]

there exists 𝑀> 0 such that if 𝑦𝑖𝛼 𝑖 𝑦𝑖𝛼𝑖> 𝑀for all 𝛼𝑖 𝛼 𝑖 A𝑖and 𝑖 N, then

𝑣𝑖𝛼 𝑖(𝑄(𝑦)) 𝑣𝑖𝛼𝑖(𝑄(𝑦)) > 𝑐 for all 𝛼𝑖 𝛼 𝑖 A𝑖, and 𝑖 N . (B.5)

From now on, for notational convenience, we focus on player 𝑖 N and drop the player-specific indices altogether. Then, for 𝛼 𝛼 A, we let 𝑧𝛼(𝑡) := 𝑦𝛼(𝑡) 𝑦 𝛼(𝑡), which evolves as:

𝑧(𝑡) = 𝑣𝛼(𝑥(𝑡)) 𝑣𝛼 (𝑥(𝑡)) 𝑟

𝑡 𝑧𝛼(𝑡) (B.6)

Let 𝑦(0) such that 𝑧𝛼(0) = 𝑀 𝜀, for all 𝛼 𝛼 A, where 𝜀> 0 small. We will, first, show that 𝑧(𝑡) < 𝑀for all 𝑡 0. For the sake of contradiction, and denoting 𝑇0 := inf{𝑡 0 : 𝑧(𝑡) 𝑀}, suppose that 𝑇0 < . Then, we readily get that for all 𝑡< 𝑇0, it holds

𝑣𝛼(𝑥(𝑡)) 𝑣𝛼 (𝑥(𝑡)) < 𝑐 (B.7)

and therefore, for all 𝑡 𝑇0:

𝑧𝛼(𝑡)𝑡𝑟+ 𝑟𝑡𝑟 1 𝑧𝛼(𝑡) = 𝑡𝑟[𝑣𝛼(𝑥) 𝑣𝛼 (𝑥)] 𝑐𝑡𝑟 (B.8)

which can be rewritten as: 𝑑 𝑑𝑡( 𝑧𝛼(𝑡)𝑡𝑟) 𝑐𝑡𝑟 (B.9)

Integrating over 𝑡< 𝑇0, we obtain 𝑧𝛼(𝑡)𝑡𝑟 𝑐𝑡𝑟+1/(𝑟+ 1), which readily implies:

𝑧𝛼(𝑡) 𝑧𝛼(0) 𝑐 2(𝑟+ 1) 𝑡2

< 𝑀 𝑐 2(𝑟+ 1) 𝑡2 (B.10)

By sending 𝑡 𝑇0, we arrive at a contradiction. Therefore 𝑧𝛼(𝑡) < 𝑀for all 𝑡 0, and the previous equation implies that for all 𝑡 0 :

𝑧𝛼(𝑡) 𝑧𝛼(0) 𝑐 2(𝑟+ 1) 𝑡2 (B.11)

and invoking Lemma A.1, we get the convergence result. Finally, translating the score-differences to the primal space X, we get:

𝑥(𝑡) 𝑥 = max 𝑖 N 1 𝑥𝑖𝛼 𝑖(𝑡) (B.12)

For the case of logit best responses, i.e., when 𝑄 Λ, and assuming that the maximum above is attained for player 𝑖 N, we obtain

P 𝛼𝑖 𝛼 𝑖exp(𝑧𝛼𝑖(𝑡))

1 + P 𝛼𝑖 𝛼 𝑖exp(𝑧𝛼𝑖(𝑡))

𝛼𝑖 𝛼 𝑖 exp(𝑧𝛼𝑖(𝑡))

|A𝑖| exp 𝑧𝛼𝑖(0) 𝑐 2(𝑟+ 1) 𝑡2

exp 𝐶 𝑐 2(𝑟+ 1) 𝑡2 (B.13)

for 𝐶= log|A𝑖| + 𝑧𝛼𝑖(0).

Now, moving to the case where we use a constant friction coefficient 𝑟 𝑦instead of (𝑟/𝑡) 𝑦, (FTXL-D) becomes: 𝑑2𝑦

𝑑𝑡2 = 𝑣(𝑄(𝑦)) 𝑟𝑑𝑦

Under, (B.14), we obtain the following convergence result.

Theorem B.1. Let 𝑥 be a strict Nash equilibrium of Γ, and let 𝑥(𝑡) = 𝑄(𝑦(𝑡)) be a solution orbit of (B.14). If 𝑥(0) is sufficiently close to 𝑥 , then 𝑥(𝑡) converges to 𝑥 ; in particular, if (B.14) is run with logit best responses (that is, 𝑄 Λ), we have

𝑥(𝑡) 𝑥 exp 𝐶 𝑐

where 𝐶> 0 is a constant that depends on the initialization of (B.14) and

2 min 𝑖 N min 𝛽𝑖 supp(𝑥 𝑖)[𝑢𝑖(𝑥 𝑖; 𝑥 𝑖) 𝑢𝑖(𝛽𝑖; 𝑥 𝑖)] > 0 (B.16)

is the minimum payoff difference at equilibrium.

Proof. The initial steps of proof of Theorem B.1 are similar to the proof of Theorem 1, which we include for the sake of completeness.

Specifically, by Lemma B.1 there exists 𝑀> 0 such that if 𝑦𝑖𝛼 𝑖 𝑦𝑖𝛼𝑖> 𝑀for all 𝛼𝑖 𝛼 𝑖 A𝑖and 𝑖 N, then 𝑣𝑖𝛼 𝑖(𝑄(𝑦)) 𝑣𝑖𝛼𝑖(𝑄(𝑦)) > 𝑐 for all 𝛼𝑖 𝛼 𝑖 A𝑖, and 𝑖 N . (B.17)

Now, for notational convenience, we focus on player 𝑖 N and drop the player-specific indices altogether. Then, for 𝛼 𝛼 A, we let 𝑧𝛼(𝑡) := 𝑦𝛼(𝑡) 𝑦 𝛼(𝑡), which evolves as:

𝑧(𝑡) = 𝑣𝛼(𝑥(𝑡)) 𝑣𝛼 (𝑥(𝑡)) 𝑟 𝑧𝛼(𝑡) (B.18)

Let 𝑦(0) such that 𝑧𝛼(0) = 𝑀 𝜀, for all 𝛼 𝛼 A, where 𝜀> 0 small. As in the proof of Theorem 1, we will, first, show that 𝑧(𝑡) < 𝑀for all 𝑡 0. For the sake of contradiction, and denoting 𝑇0 := inf{𝑡 0 : 𝑧(𝑡) 𝑀}, suppose that 𝑇0 < . Then, we readily get that for all 𝑡< 𝑇0, it holds 𝑣𝛼(𝑥(𝑡)) 𝑣𝛼 (𝑥(𝑡)) < 𝑐 (B.19)

and therefore, for all 𝑡 𝑇0:

𝑧𝛼(𝑡)𝑒𝑟𝑡+ 𝑟𝑒𝑟𝑡 𝑧𝛼(𝑡) = 𝑒𝑟𝑡[𝑣𝛼(𝑥) 𝑣𝛼 (𝑥)] 𝑐𝑒𝑟𝑡 (B.20)

which can be rewritten as: 𝑑 𝑑𝑡 𝑧𝛼(𝑡)𝑒𝑟𝑡 𝑐𝑒𝑟𝑡 (B.21)

Integrating over 𝑡< 𝑇0, and using that 𝑧𝛼(0) = 0, we obtain 𝑧𝛼(𝑡) 𝑐/𝑟+ 𝑐𝑒 𝑟𝑡/𝑟, which implies:

𝑧𝛼(𝑡) 𝑧𝛼(0) 𝑐

𝑟2 𝑟𝑡+ 𝑒 𝑟𝑡 1

< 𝑧𝛼(0) < 𝑀 (B.22)

where we used the fact that 𝑥+ 𝑒 𝑥 1 0 for all 𝑥 ℝwith equality if and only if 𝑥= 0. By sending 𝑡 𝑇0, we arrive at a contradiction. Therefore 𝑧𝛼(𝑡) < 𝑀for all 𝑡 0, and the previous equation implies that for all 𝑡 0 :

𝑧𝛼(𝑡) 𝑧𝛼(0) 𝑐

and invoking Lemma A.1 for 𝜃(𝑥) = 𝑥log 𝑥, we get the convergence result.

C Proofs for discrete-time algorithms with full information

In this section, we provide the results for the (FTXL) algorithm with full-information feedback. First, we discuss the rates obtained by the direct discretization of (FTXL-D) with both vanishing and nonvanishing friction, and then provide the proof of Theorem 2, our main result, for the full-information case.

C.1. FTXL with vanishing friction. First, we provide the rate of convergence for the discrete version of (FTXL-D) with vanishing friction:

𝑝𝑖,𝑛+1 = 𝑝𝑖,𝑛 1 𝛾𝑟

𝑦𝑖,𝑛+1 = 𝑦𝑖,𝑛+ 𝛾𝑝𝑖,𝑛+1 (C.1)

To streamline our presentation, we consider the setup of Example 3.1 that provides a lower bound for the algorithm.

Proposition C.1. Consider the single-player game Γ with actions A and B such that 𝑢(A) 𝑢(B) = 1 of Example 3.1, and let 𝑥𝑛= Λ(𝑦𝑛) be the sequence of play generated by (C.1). Then, denoting by 𝑥 = (1, 0) the strict Nash equilibrium, we have:

𝑥𝑇 𝑥 exp 𝐶 𝛾2𝑇2

where 𝐶> 0 is a constant that depends only on the initialization of the algorithm.

Proof. We first define the score-difference

𝑤𝑛:= 𝑝B,𝑛 𝑝A,𝑛 (C.3)

with initial condition 𝑤1 = 0. Then, unfolding according to the sequence of play, we obtain:

𝑤𝑛+1 = 𝑤𝑛 1 𝛾𝑟

+ 𝛾(𝑢(B) 𝑢(A))

We next define for 𝑛 ℕthe difference 𝑧𝑛:= 𝑦B,𝑛 𝑦A,𝑛. Thus, unfolding it, we obtain:

𝑧𝑛+1 = 𝑧𝑛+ 𝛾𝑤𝑛+1

Now, using Lemma C.1, which we provide after this proof, we obtain that

𝑧𝑛+1 = 𝑧1 𝛾2 𝑛

1 + 𝛾𝑟 1 1 + 𝛾𝑟

= 𝑧1 𝛾2 𝑛(𝑛+ 1)

2(1 + 𝛾𝑟) 𝛾2𝑛 1 𝛾𝑟 1 + 𝛾𝑟

2(1 + 𝛾𝑟) + Θ(𝑛) (C.6)

and invoking Lemma A.1 for 𝜃(𝑥) = 𝑥log 𝑥, we get the result.

The following lemma is a necessary tool for obtaining the exact convergence rate in Proposition C.2. Lemma C.1. For any 𝑚 ℕand 𝑎> 0, we have that

ℓ=0 (1 𝑎 𝑚 ℓ) = 𝑚 𝑎

1 + 𝑎 1 1 + 𝑎

Proof. First, by expanding the inner product, we can rewrite the expression as

ℓ=0 (1 𝑎 𝑚 ℓ) = 𝑚 1

ℓ=0 ( 𝑚 ℓ 𝑎

(𝑚 𝑎) . . . (𝑚 𝑘+ 1 𝑎)

𝑚. . . (𝑚 𝑘+ 1)

(𝑚 𝑎)!(𝑚 𝑘)!

(𝑚 𝑘)! (𝑚 𝑘 𝑎)! (C.8)

where with a slight abuse of notation we use the factorial notation (𝑚 𝑎)! to denote the Gamma function evaluated at 𝑚 𝑎+ 1, i.e., Γ(𝑚 𝑎+ 1).

Now, defining the quantity

𝐹𝑚:= (𝑚 𝑎)!

(𝑚 𝑘)! (𝑚 𝑘 𝑎)!

the difference of two consecutive terms evolves as:

𝐹𝑚+1 𝐹𝑚= (𝑚+ 1 𝑎)!

(𝑚+ 1 𝑘)! (𝑚+ 1 𝑘 𝑎)! (𝑚 𝑎)!

(𝑚 𝑘)! (𝑚 𝑘 𝑎)!

𝑚+ 1 + (𝑚+ 1 𝑎)!

(𝑚+ 1 𝑘)! (𝑚+ 1 𝑘 𝑎)! (𝑚 𝑎)!

(𝑚 𝑘)! (𝑚 𝑘 𝑎)!

𝑚+ 1 + (𝑚+ 1 𝑎)!

(𝑚+ 1 𝑘)! (𝑚+ 1 𝑘 𝑎)! (𝑚 𝑎)!

(𝑚 𝑘+ 1)! (𝑚 𝑘+ 1 𝑎)!

(𝑚+ 1 𝑎)!(𝑚+ 1 𝑘)! (𝑚+ 1)(𝑚 𝑎)!(𝑚 𝑘+ 1)!

(𝑚+ 1)!(𝑚+ 1 𝑎 𝑘)!

(𝑚 𝑎)!(𝑚+ 1 𝑘)!(𝑚+ 1 𝑎 𝑚 1)

(𝑚+ 1)!(𝑚+ 1 𝑎 𝑘)!

(𝑚 𝑎)!(𝑚+ 1 𝑘)! (𝑚+ 1)!(𝑚+ 1 𝑎 𝑘)!

𝑚+ 1 𝑎 𝑚+ 1 𝑎

(𝑚+ 1 𝑘)!(𝑚+ 1 𝑎)! (𝑚+ 1)!(𝑚+ 1 𝑎 𝑘)! 𝑚+ 1 𝑎

= 1 𝑎 𝑚+ 1 𝑎𝐹𝑚+1 (C.9)

Thus, we readily obtain the recurrence relation 𝑚+ 1 𝑚+ 1 𝑎𝐹𝑚+1 = 𝐹𝑚+ 1 . (C.10)

We continue the proof by induction. To this end, we will show that

1 + 𝑎+ 𝑎 1 + 𝑎

For the base case, note that

𝐹1 = (1 𝑎) = 1 𝑎

1 + 𝑎+ 𝑎 1 + 𝑎(1 𝑎) (C.12)

For the inductive step, suppose that (C.11) holds for 𝑚 ℕ. Then, we have: 𝑚+ 1 𝑚+ 1 𝑎𝐹𝑚+1 = 𝑚 𝑎

1 + 𝑎+ 𝑎 1 + 𝑎

1 + 𝑎+ 𝑎 1 + 𝑎

which implies the inductive step

𝐹𝑚+1 = 𝑚+ 1 𝑎

1 + 𝑎 + 𝑎 1 + 𝑎

and thus (C.11) holds for all 𝑚 ℕ. Finally, to complete the proof notice that

ℓ=0 (1 𝑎 𝑚 ℓ) = (𝑚 𝑎)!

(𝑚 𝑘)! (𝑚 𝑘 𝑎)!

1 + 𝑎+ 𝑎 1 + 𝑎

1 + 𝑎 1 1 + 𝑎

as was to be shown.

Next, we discuss the cases of non-vanishing and zero friction.

C.2. FTXL with non-vanishing friction. We continue this section by considering the case of non-vanishing friction in analogy to the continuous-time case, as per Appendix B. Specifically, we consider the discrete version of (FTXL-D) with non-vanishing friction, as follows: 𝑝𝑖,𝑛+1 = 𝑝𝑖,𝑛(1 𝛾𝑟) + 𝛾ˆ𝑣𝑖,𝑛 𝑦𝑖,𝑛+1 = 𝑦𝑖,𝑛+ 𝛾𝑝𝑖,𝑛+1 (C.16)

with 𝛾𝑟< 1. Below, we provide the rate of convergence for the setup of 𝐸𝑥𝑎𝑚𝑝𝑙𝑒3.1, as we did before. Namely, we obtain a linear convergence rate, as the following proposition suggests. Proposition C.2. Consider the single-player game Γ with actions A and B such that 𝑢(A) 𝑢(B) = 1 of Example 3.1, and let 𝑥𝑛= Λ(𝑦𝑛) be the sequence of play generated by (C.16). Then, denoting by 𝑥 = (1, 0) the strict Nash equilibrium, we have:

𝑥𝑛 𝑥 exp 𝐶 𝛾

𝑟𝑛 . (C.17)

where 𝐶> 0 is a constant that depends on the initialization of the algorithm.

Proof. We first define the score-difference 𝑤𝑛:= 𝑝B,𝑛 𝑝A,𝑛 (C.18) with initial condition 𝑤1 = 0. Then, unfolding according to the sequence of play, we obtain: 𝑤𝑛+1 = 𝑤𝑛(1 𝛾𝑟) + 𝛾(𝑢(B) 𝑢(A)) = 𝑤𝑛(1 𝛾𝑟) 𝛾 = . . .

𝑘=0 (1 𝛾𝑟)𝑘

= 1 (1 𝛾𝑟)𝑛

We next define for 𝑛 ℕthe difference 𝑧𝑛:= 𝑦B,𝑛 𝑦A,𝑛. Thus, unfolding it, we obtain: 𝑧𝑛+1 = 𝑧𝑛+ 𝛾𝑤𝑛+1

= 𝑧𝑛 𝛾1 (1 𝛾𝑟)𝑛

𝑛 (1 𝛾𝑟) 1 (1 𝛾𝑟)𝑛

𝑟𝑛+ O(1) (C.20)

and invoking Lemma A.1 for 𝜃(𝑥) = 𝑥log 𝑥, we get the result.

C.3. FTXL with zero friction. Moving forward to the case of 𝑟= 0 as presented in Section 4, we provide the proof of Theorem 2, which we restate below for convenience. Theorem 2. Let 𝑥 be a strict Nash equilibrium of Γ, and let 𝑥𝑛= 𝑄(𝑦𝑛) be the sequence of play generated by (FTXL) with full information feedback of the form (11a). If 𝑥1 is initialized sufficiently close to 𝑥 , then 𝑥𝑛converges to 𝑥 ; in particular, if (FTXL) is run with logit best responses (that is, 𝑄 Λ), we have

𝑥𝑇 𝑥 exp 𝐶 𝑐𝛾2𝑇(𝑇 1)

= exp Θ(𝑇2) (12)

where 𝐶> 0 is a constant that depends only on the initialization of (FTXL) and

2 min 𝑖 N min 𝛽𝑖 supp(𝑥 𝑖)[𝑢𝑖(𝑥 𝑖; 𝑥 𝑖) 𝑢𝑖(𝛽𝑖; 𝑥 𝑖)] > 0 (13)

is the minimum payoff difference at equilibrium.

Proof. First of all, since 𝑥 is a strict Nash equilibrium, by Lemma B.1 for

2 min 𝑖 N min 𝛽𝑖 supp(𝑥 𝑖)[𝑢𝑖(𝑥 𝑖; 𝑥 𝑖) 𝑢𝑖(𝛽𝑖; 𝑥 𝑖)]

there exists 𝑀> 0 such that if 𝑦𝑖𝛼 𝑖 𝑦𝑖𝛼𝑖> 𝑀for all 𝛼𝑖 𝛼 𝑖 A𝑖and 𝑖 N, then

𝑣𝑖𝛼 𝑖(𝑄(𝑦)) 𝑣𝑖𝛼𝑖(𝑄(𝑦)) > 𝑐 for all 𝛼𝑖 𝛼 𝑖 A𝑖, and 𝑖 N . (C.21)

For notational convenience, we focus on player 𝑖and drop the player-specific indices altogether. Let 𝛼 𝛼 A, and define for 𝑛 ℕthe quantities 𝑤𝛼,𝑛and 𝑧𝛼,𝑛as

𝑤𝛼,𝑛:= 𝑝𝑛, 𝑒𝛼 𝑒 𝛼 , 𝑧𝛼,𝑛:= 𝑦𝑛, 𝑒𝛼 𝑒 𝛼 (C.22)

where 𝑒𝛼, 𝑒 𝛼are the standard basis vectors corresponding to 𝛼, 𝛼 A.

Let initial conditions 𝑦1 such that 𝑦𝛼,1 𝑦𝛼 ,1 = 𝑀 𝜀, for all 𝛼 𝛼 A, where 𝜀> 0 small, and 𝑝1 = 0. We will first show by induction that 𝑧𝛼,𝑛< 𝑀for all 𝑛 ℕ. To this end, unfolding the recursion, we obtain:

𝑤𝛼,𝑛+1 = 𝑤𝛼,𝑛+ 𝛾 ˆ𝑣𝑛, 𝑒𝛼 𝑒 𝛼 = 𝑤𝛼,𝑛+ 𝛾 𝑣(𝑥𝑛), 𝑒𝛼 𝑒 𝛼

𝑘=1 𝑣(𝑥𝑘), 𝑒𝛼 𝑒 𝛼 (C.23)

where we used that 𝑤1 = 0. Now, for the sake of induction, suppose that

𝑧𝛼,𝑘< 𝑀 for all 𝑘= 1, . . . , 𝑛 (C.24)

which implies that 𝑣(𝑥𝑘), 𝑒𝛼 𝑒 𝛼 < 𝑐. With this in hand, we will prove that 𝑧𝛼,𝑛+1 < 𝑀, as well. Specifically, we have:

𝑧𝛼,𝑛+1 = 𝑧𝛼,𝑛+ 𝛾𝑤𝛼,𝑛+1 = 𝑧𝛼,𝑛+ 𝛾2 𝑛

𝑘=1 𝑣(𝑥𝑘), 𝑒𝛼 𝑒 𝛼

where we used the inductive hypothesis and the initial condition. Therefore, we conclude by induction that 𝑧𝛼,𝑛< 𝑀for all 𝑛 ℕ. Thus, we readily obtain that after 𝑇time-step:

𝑧𝑇 𝑧𝛼,1 𝑐𝛾2 𝑇 1

ℓ=1 ℓ 𝑧𝛼,1 𝑐𝛾2𝑇(𝑇 1)

and invoking Lemma A.1 for 𝜃(𝑥) = 𝑥log 𝑥, we get the result.

D Proofs for discrete-time algorithms with partial information

In this appendix, we provide the proofs of Theorem 3 and Theorem 4 that correspond to the convergence of (FTXL) with realization-based and bandit feedback, respectively. For this, we need the following lemma, which provides a maximal bound on a martingale process. Namely, we have:

Lemma D.1. Let 𝑀𝑛:= P𝑛 𝑘=1 𝛾𝑘𝜉𝑘be a martingale with respect to (F𝑛)𝑛 ℕwith 𝔼[ 𝜉𝑛 𝑞 ] 𝜎𝑞 𝑛 for some 𝑞> 2. Then, for 𝜇 (0, 1) and 𝑛 ℕ:

sup 𝑘 𝑛 |𝑀𝑘| > 𝑐

P𝑛 𝑘=1 𝛾𝑞/2+1 𝑘 𝜎𝑞 𝑘 P𝑛 𝑘=1 𝛾𝑘 1+𝑞(𝜇 1/2) (D.1)

where 𝐴𝑞is a constant depending only on 𝑐and 𝑞.

Proof. Fix some 𝜇 (0, 1). By Doob s maximal inequality [20, Corollary 2.1], we have:

sup 𝑘 𝑛 |𝑀𝑘| > 𝑐

𝔼[|𝑀𝑛|𝑞] 𝑐𝑞 P𝑛 𝑘=1 𝛾𝑘 𝑞𝜇 (D.2)

Now, applying the Burkholder Davis Gundy inequality [20, Theorem 2.10], we get that

𝔼[|𝑀𝑛|𝑞] 𝐴𝑞𝔼

𝑘=1 𝛾2 𝑘 𝜉𝑘 2

where 𝐴𝑞is a constant depending only on 𝑐and 𝑞. Now, we will invoke the generalized Hölder s inequality [4], we have: 𝑛

𝑘=1 𝑎(1 𝜆)𝜌 𝑘 𝑏𝜌 𝑘 (D.4)

for 𝑎𝑘, 𝑏𝑘 0, 𝜌> 1 and 𝜆 [0, 1). Thus, setting 𝑎𝑘= 𝛾2 𝑘, 𝑏𝑘= 𝜉𝑘 2 , 𝜌= 𝑞/2 and 𝜆= 1/2 1/𝑞, (D.2), combined with (D.3), becomes:

sup 𝑘 𝑛 |𝑀𝑘| > 𝑐

P𝑛 𝑘=1 𝛾𝑘 𝑞/2 1 P𝑛 𝑘=1 𝛾𝑞/2+1 𝑘 𝔼[ 𝜉𝑘 𝑞 ] P𝑛 𝑘=1 𝛾𝑘 𝑞𝜇

P𝑛 𝑘=1 𝛾𝑞/2+1 𝑘 𝜎𝑞 𝑘 P𝑛 𝑘=1 𝛾𝑘 1+𝑞(𝜇 1/2) (D.5)

and our proof is complete.

With this tool in hand, we proceed to prove the convergence of (FTXL) under realization-based feedback. For convenience, we restate the relevant result below. Theorem 3. Let 𝑥 be a strict Nash equilibrium of Γ, fix some confidence level 𝛿> 0, and let 𝑥𝑛= 𝑄(𝑦𝑛) be the sequence of play generated by (FTXL) with realization-based feedback as per (11b) and a sufficiently small step-size 𝛾> 0. Then there exists a neighborhood U of 𝑥 such that

ℙ(𝑥𝑛 𝑥 as 𝑛 ) 1 𝛿 if 𝑥1 U. (14)

In particular, if (FTXL) is run with logit best responses (that is, 𝑄 Λ), there exist positive constants 𝐶, 𝑐> 0 as in Theorem 2 such that on the event {𝑥𝑛 𝑥 as 𝑛 }:

𝑥𝑇 𝑥 exp 𝐶 𝑐𝛾2𝑇(𝑇 1)

5𝑐𝛾5/3𝑇5/3 = exp Θ(𝑇2) . (15)

Proof. First of all, since 𝑥 is a strict Nash equilibrium, by Lemma B.1 for

2 min 𝑖 N min 𝛽𝑖 supp(𝑥 𝑖)[𝑢𝑖(𝑥 𝑖; 𝑥 𝑖) 𝑢𝑖(𝛽𝑖; 𝑥 𝑖)]

there exists 𝑀> 0 such that if 𝑦𝑖𝛼 𝑖 𝑦𝑖𝛼𝑖> 𝑀for all 𝛼𝑖 𝛼 𝑖 A𝑖and 𝑖 N, then

𝑣𝑖𝛼 𝑖(𝑄(𝑦)) 𝑣𝑖𝛼𝑖(𝑄(𝑦)) > 𝑐 for all 𝛼𝑖 𝛼 𝑖 A𝑖, and 𝑖 N . (D.6)

For notational convenience, we focus on player 𝑖and drop the player-specific indices altogether. Let 𝛼 𝛼 A, and define for 𝑛 ℕthe quantities 𝑤𝛼,𝑛and 𝑧𝛼,𝑛as

𝑤𝛼,𝑛:= 𝑝𝑛, 𝑒𝛼 𝑒 𝛼 , 𝑧𝛼,𝑛:= 𝑦𝑛, 𝑒𝛼 𝑒 𝛼 (D.7)

where 𝑒𝛼, 𝑒 𝛼are the standard basis vectors corresponding to 𝛼, 𝛼 A.

Then, unfolding the recursion, we obtain:

𝑤𝛼,𝑛+1 = 𝑤𝛼,𝑛+ 𝛾 ˆ𝑣𝑛, 𝑒𝛼 𝑒 𝛼 = 𝑤𝛼,𝑛+ 𝛾 𝑣(𝑥𝑛), 𝑒𝛼 𝑒 𝛼 + 𝛾 𝑈𝑛, 𝑒𝛼 𝑒 𝛼

𝑘=1 𝑣(𝑥𝑘), 𝑒𝛼 𝑒 𝛼 + 𝛾 𝑛

𝑘=1 𝑈𝑘, 𝑒𝛼 𝑒 𝛼 (D.8)

where we used that 𝑤1 = 0. Now, define the stochastic process {𝑀𝑛}𝑛 ℕas

𝑘=1 𝑈𝑘, 𝑒𝛼 𝑒 𝛼 (D.9)

which is a martingale, since 𝔼[𝑈𝑛| F𝑛] = 0. Moreover, note that

𝑈𝑛 = 𝑣(𝛼𝑛) 𝑣(𝑥𝑛) 2 max 𝛼 A 𝑣(𝛼) (D.10)

and, thus, we readily obtain that 𝔼[ 𝑈𝑛 𝑞 | F𝑛] 𝜎𝑞for 𝜎= 2 max𝛼 A 𝑣(𝛼) and all 𝑞 [1, ].

By Lemma D.1 for 𝛾𝑛= 𝛾, 𝜎𝑛= 𝜎, 𝜉𝑛= 𝑈𝑛, 𝑒𝛼 𝑒 𝛼 , 𝑐as in Theorem 2, and 𝜇 (0, 1), 𝑞> 2 whose values will be determined next, there exists 𝐴𝑞> 0 such that:

𝛿𝑛:= ℙ sup 𝑘 𝑛 |𝑀𝑘| > 𝑐(𝛾𝑛)𝜇 𝐴𝑞𝜎𝑞 𝑛𝛾𝑞/2+1

(𝛾𝑛)1+𝑞(𝜇 1/2)

𝐴𝑞𝜎𝑞 𝛾𝑞(1 𝜇)

𝑛𝑞(𝜇 1/2) (D.11)

Now, we need to guarantee that there exist 𝜇 (0, 1), 𝑞> 2, such that

𝑛=1 𝛿𝑛< (D.12)

For this, we simply need 𝑞(𝜇 1/2) > 1, or equivalently, 𝜇> 1/2 + 1/𝑞, which implies that 𝜇 (1/2, 1).

Therefore, for 𝛾small enough, we get P 𝑛=1 𝛿𝑛< 𝛿, and therefore:

sup 𝑘 𝑛 |𝑀𝑘| 𝑐(𝛾𝑛)𝜇 !

sup 𝑘 𝑛 |𝑀𝑘| > 𝑐(𝛾𝑛)𝜇 !

From now on, we denote the good event T 𝑛=1 sup𝑘 𝑛|𝑀𝑘| 𝑐(𝛾𝑛)𝜇 by 𝐸. Then, with probability at least 1 𝛿:

𝑘=1 𝑣(𝑥𝑘), 𝑒𝛼 𝑒 𝛼 + 𝑐(𝛾𝑛)𝜇 for all 𝑛 ℕ. (D.14)

Furthermore, we have that for 𝑛> 𝑁0 := 1/𝛾 , we readily get that 𝛾𝑛> (𝛾𝑛)𝜇. Therefore, setting

𝑘=1 ((𝛾𝑘)𝜇 𝛾𝑘) (D.15)

𝑘=1 (𝛾𝑘 (𝛾𝑘)𝜇) 𝑅 (D.16)

for all 𝑛 ℕ. Then, initializing 𝑦1 such that 𝑧𝛼,1 < 𝑀 𝑅, we will show that 𝑧𝛼,𝑛< 𝑀for all 𝑛 ℕwith probability at least 1 𝛿. For this, suppose that 𝐸is realized, and assume that

𝑧𝛼,𝑘< 𝑀 for all 𝑘= 1, . . . , 𝑛 (D.17)

We will show that 𝑧𝛼,𝑛+1 < 𝑀, as well. For this, we have:

𝑧𝛼,𝑛+1 = 𝑧𝛼,𝑛+ 𝛾𝑤𝛼,𝑛+1

𝑘=1 𝑣(𝑥𝑘), 𝑒𝛼 𝑒 𝛼 + 𝑐(𝛾𝑛)𝜇 !

𝑧𝛼,𝑛 𝑐𝛾(𝛾𝑛 (𝛾𝑛)𝜇)

𝑘=1 (𝛾𝑘 (𝛾𝑘)𝜇)

𝑘=1 (𝛾𝑘 (𝛾𝑘)𝜇)

Therefore, we conclude by induction that 𝑧𝛼,𝑛< 𝑀for all 𝑛 ℕ. Thus, we readily obtain that with probability at least 1 𝛿it holds:

𝑧𝛼,𝑇 𝑧𝛼,1 𝑐𝛾 𝑇 1

𝑘=1 (𝛾𝑘 (𝛾𝑘)𝜇)

𝑧𝛼,1 𝑐𝛾2𝑇(𝑇 1)

2 + 𝑐𝛾1+𝜇 𝑇

𝑧𝛼,1 𝑐𝛾2𝑇(𝑇 1)

2 + 𝑐𝛾1+𝜇𝑇𝜇+1

𝜇+ 1 (D.19)

for all 𝑇 ℕ. Setting 𝜇= 2/3 and invoking Lemma A.1 for 𝜃(𝑥) = 𝑥log 𝑥, we get the result.

Finally, we prove the convergence of (FTXL) with bandit feedback. Again, for convenience, we restate the relevant result below. Theorem 4. Let 𝑥 be a strict Nash equilibrium of Γ, fix some confidence level 𝛿> 0, and let 𝑥𝑛= 𝑄(𝑦𝑛) be the sequence of play generated by (FTXL) with bandit feedback of the form (11c), an IWE exploration parameter 𝜀𝑛 1/𝑛ℓ𝜀for some ℓ𝜀 (0, 1/2), and a sufficiently small step-size 𝛾> 0. Then there exists a neighborhood U of 𝑥 in X such that

ℙ(𝑥𝑛 𝑥 as 𝑛 ) 1 𝛿 if 𝑥1 U. (17)

In particular, if (FTXL) is run with logit best responses (that is, 𝑄 Λ), there exist positive constants 𝐶, 𝑐> 0 as in Theorem 2 such that on the event {𝑥𝑛 𝑥 as 𝑛 }

𝑥𝑇 𝑥 exp 𝐶 𝑐𝛾2𝑇(𝑇 1)

9𝑐𝛾9/5𝑇9/5 = exp Θ(𝑇2) . (18)

Proof. First of all, since 𝑥 is a strict Nash equilibrium, by Lemma B.1 for

2 min 𝑖 N min 𝛽𝑖 supp(𝑥 𝑖)[𝑢𝑖(𝑥 𝑖; 𝑥 𝑖) 𝑢𝑖(𝛽𝑖; 𝑥 𝑖)]

there exists 𝑀> 0 such that if 𝑦𝑖𝛼 𝑖 𝑦𝑖𝛼𝑖> 𝑀for all 𝛼𝑖 𝛼 𝑖 A𝑖and 𝑖 N, then

𝑣𝑖𝛼 𝑖(𝑄(𝑦)) 𝑣𝑖𝛼𝑖(𝑄(𝑦)) > 𝑐 for all 𝛼𝑖 𝛼 𝑖 A𝑖, and 𝑖 N . (D.20)

For notational convenience, we focus on player 𝑖and drop the player-specific indices altogether. Let 𝛼 𝛼 A, and define for 𝑛 ℕthe quantities 𝑤𝛼,𝑛and 𝑧𝛼,𝑛as

𝑤𝛼,𝑛:= 𝑝𝑛, 𝑒𝛼 𝑒 𝛼 , 𝑧𝛼,𝑛:= 𝑦𝑛, 𝑒𝛼 𝑒 𝛼 (D.21)

where 𝑒𝛼, 𝑒 𝛼are the standard basis vectors corresponding to 𝛼, 𝛼 A. For notational convenience, we focus on player 𝑖and drop the player-specific indices altogether. Now, decomposing the IWE ˆ𝑣𝑛, we obtain ˆ𝑣𝑛= 𝑣(𝑥𝑛) + 𝑈𝑛+ 𝑏𝑛 (D.22) where 𝑈𝑛:= ˆ𝑣𝑛 𝑣𝑖( ˆ𝑥𝑛) is a zero-mean noise, and 𝑏𝑖,𝑛:= 𝑣𝑖( ˆ𝑥𝑛) 𝑣𝑖(𝑥𝑛).

Then, unfolding the recursion, we obtain:

𝑤𝛼,𝑛+1 = 𝑤𝛼,𝑛+ 𝛾 ˆ𝑣𝑛, 𝑒𝛼 𝑒 𝛼 = 𝑤𝛼,𝑛+ 𝛾 𝑣(𝑥𝑛), 𝑒𝛼 𝑒 𝛼 + 𝛾 𝑈𝑛, 𝑒𝛼 𝑒 𝛼 + 𝛾 𝑏𝑛, 𝑒𝛼 𝑒 𝛼 𝑤𝛼,𝑛+ 𝛾 𝑣(𝑥𝑛), 𝑒𝛼 𝑒 𝛼 + 𝛾 𝑈𝑛, 𝑒𝛼 𝑒 𝛼 + 2𝛾 𝑏𝑛

𝑘=1 𝑣(𝑥𝑘), 𝑒𝛼 𝑒 𝛼 + 𝛾 𝑛

𝑘=1 𝑈𝑘, 𝑒𝛼 𝑒 𝛼 + 2𝛾 𝑛

𝑘=1 𝑣(𝑥𝑘), 𝑒𝛼 𝑒 𝛼 + 𝛾 𝑛

𝑘=1 𝑈𝑘, 𝑒𝛼 𝑒 𝛼 + 2𝛾𝐵 𝑛

𝑘=1 𝜀𝑘 (D.23)

where we used that 𝑏𝑛 = Θ(𝜀𝑛) for all 𝑛 ℕ. Now, define the process {𝑀𝑛}𝑛 ℕas

𝑘=1 𝑈𝑘, 𝑒𝛼 𝑒 𝛼 (D.24)

which is a martingale, since 𝔼[𝑈𝑛| F𝑛] = 0. Moreover, note that

𝑈𝑛 = ˆ𝑣𝑛 𝑣( ˆ𝑥𝑛) ˆ𝑣𝑛 + 𝑣( ˆ𝑥𝑛) (D.25)

i.e., 𝑈𝑛 = Θ(1/𝜀𝑛). Thus, we readily obtain that 𝔼[ 𝑈𝑛 𝑞 | F𝑛] 𝜎𝑞 𝑛for 𝜎𝑛= Θ(1/𝜀𝑛) and all 𝑞 [1, ]. So, by Lemma D.1 for 𝛾𝑛= 𝛾, 𝜎𝑛= 𝜎, 𝑐as in Theorem 2, and 𝜇 (0, 1), 𝑞> 2 whose values will be determined next, there exists 𝐴𝑞> 0 such that:

𝛿𝑛:= ℙ sup 𝑘 𝑛 |𝑀𝑘| > 𝑐

2 (𝛾𝑛)𝜇 𝐴𝑞 𝛾𝑞/2+1 P𝑛 𝑘=1 𝜎𝑞 𝑘 (𝛾𝑛)1+𝑞(𝜇 1/2)

𝐴𝑞 𝛾𝑞(1 𝜇) P𝑛 𝑘=1 𝜎𝑞 𝑘 𝑛1+𝑞(𝜇 1/2) (D.26)

Now, note that for 𝜀𝑛= 𝜀/𝑛ℓ𝜀, and since 𝜎𝑛= Θ(1/𝜀𝑛), we get that there exists 𝑀> 0 such that

𝑘=1 𝜎𝑞 𝑘 𝑀𝜀 𝑞 𝑛

𝑘=1 𝑘𝑞ℓ𝜀 (D.27)

with P𝑛 𝑘=1 𝑘𝑞ℓ𝜀= Θ(𝑛1+𝑞ℓ𝜀). Therefore,

𝛿𝑛 𝐴 𝑞 𝛾𝑞(1 𝜇)𝜀 𝑞𝑛1+𝑞ℓ𝜀

𝑛1+𝑞(𝜇 1/2)

𝐴 𝑞 𝛾𝑞(1 𝜇)𝜀 𝑞

𝑛𝑞(𝜇 1/2 ℓ𝜀) (D.28)

Now, we need to guarantee that there exist 𝜇 (0, 1), 𝑞> 2, such that

𝑛=1 𝛿𝑛< (D.29)

For this, we need to ensure that 𝑞(𝜇 1/2 ℓ𝜀) > 1, or, equivalently,

ℓ𝜀< 𝜇 1/2 1/𝑞 (D.30)

which we will do later. Then, we will get for 𝛾small enough:

sup 𝑘 𝑛 |𝑀𝑘| 𝑐

sup 𝑘 𝑛 |𝑀𝑘| > 𝑐

Regarding the term 2𝛾𝐵P𝑛 𝑘=1 𝜀𝑘in (D.23), we have that:

𝑘=1 𝜀𝑘= 2𝐵𝛾𝜀 𝑛

𝑘=1 𝑘 ℓ𝜀 𝐵 𝛾𝜀𝑛1 ℓ𝜀 (D.32)

where we used that P𝑛 𝑘=1 𝑘 ℓ𝜀= Θ(𝑛1 ℓ𝜀). Thus, for

1 ℓ𝜀< 𝜇 (D.33)

we have for 𝜀, 𝛾> 0 small enough:

𝑘=1 𝜀𝑘 𝐵 𝛾𝜀𝑛1 ℓ𝜀 𝐵 𝛾𝜀𝑛𝜇 𝑐

2 (𝛾𝑛)𝜇 (D.34)

for all 𝑛 ℕ. Hence, by (D.30), (D.33) we need the following two conditions to be satisfied:

1 ℓ𝜀< 𝜇 and ℓ𝜀< 𝜇 1

for which we get that for ℓ𝜀 (0, 1/2), there exists always 𝜇 (3/4, 1) and 𝑞large that satisfy (D.35). Thus, combining (D.34) and (D.31), we get by (D.23) that with probability at least 1 𝛿:

𝑘=1 𝛾𝑘 𝑣(𝑥𝑘), 𝑒𝛼 𝑒 𝛼 + 𝑐(𝛾𝑛)𝜇 for all 𝑛 ℕ. (D.36)

Thus, following similar steps as in the proof Theorem 3 after (D.14), we readily obtain that with probability at least 1 𝛿, we have:

𝑧𝛼,𝑇 𝑧𝛼,1 𝑐𝛾 𝑇 1

𝑘=1 (𝛾𝑘 (𝛾𝑘)𝜇)

𝑧𝛼,1 𝑐𝛾2𝑇(𝑇 1)

2 + 𝑐𝛾1+𝜇 𝑇

0 𝑡𝜇𝑑𝑡 (D.37)

𝑧𝛼,1 𝑐𝛾2𝑇(𝑇 1)

2 + 𝑐𝛾1+𝜇𝑇𝜇+1

𝜇+ 1 (D.38) for all 𝑇 ℕ. Setting 𝜇= 4/5 and invoking Lemma A.1 for 𝜃(𝑥) = 𝑥log 𝑥, our claim follows.

E Numerical experiments

In this section, we provide numerical simulations to validate and explore the performance of (FTXL). To this end, we consider two game paradigms, (i) a zero-sum game, and (ii) a congestion game.

Zero-sum Game. First, we consider a 2-player zero-sum game with actions {𝛼1, 𝛼2, 𝛼3} and {𝛽1, 𝛽2, 𝛽3}, and payoff matrix

(2, 2) (1, 1) (2, 2) ( 2, 2) ( 1, 1) ( 2, 2) ( 2, 2) ( 1, 1) ( 2, 2)

Here, the rows of 𝑃correspond to the actions of player 𝐴and the columns to the actions of player 𝐵, while the first item of each entry of 𝑃corresponds to the payoff of 𝐴, and the second one to the payoff of 𝐵. Clearly, the action profile (𝛼1, 𝛽2) is a strict Nash equilibrium.

Congestion Game. As a second example, we consider a congestion game with 𝑁= 100 and 2 roads, 𝑟1 and 𝑟2, with costs 𝑐1 = 1.1 and 𝑐2 = 𝑑/𝑁where 𝑑is the number of drivers on 𝑟2. In words, 𝑟1 has a fixed delay equal to 1.1, while 𝑟2 has a delay proportional to the drivers using it. Note, that the strategy profile where all players are using 𝑟2 is a strict Nash equilibrium.

In Fig. 1, we assess the convergence of (FTXL) with logit best responses, under realization-based and bandit feedback, and compare it to the standard (EW) with the same level of information. For each feedback mode, we conducted 100 separate trials, each with 𝑇= 103 steps, and calculated the average norm 𝑥𝑛 𝑥 1 as a function of the iteration counter 𝑛= 1, 2, ...,𝑇. The solid lines represent the average distance from equilibrium for each method, while the shaded areas enclose the range of 1 standard deviation from the mean across the different trials. All the plots are displayed in logarithmic scale. For the zero-sum game, all runs were initialized with 𝑦1 = 0, and we used constant step-size 𝛾= 10 2, and exploration parameter 𝜀= 10 1, where applicable. For the congestion game, the initial state 𝑦1 for each run was drawn uniformly at random in [ 1, 1]2, and we used constant step-size 𝛾= 10 2, and exploration parameter 𝜀𝑛= 1/𝑛1/4, where applicable.

The experiments have been implemented using Python 3.11.5 on a M1 Mac Book Air with 16GB of RAM.

F Connection with other acceleration mechanisms

In this appendix, we discuss the connection between (FTXL) and the linear coupling method of Allen-Zhu & Orecchia [1]. Because [1] is not taking a momentum-based approach, it is difficult to accurately translate the coupling approach of [1] to our setting and provide a direct comparison between the two methods. One of the main reasons for this is that [1] is essentially using two step-sizes: the first is taken equal to the inverse Lipschitz modulus of the function being minimized and is used to take a gradient step; the second step-size sequence is much more aggressive, and it is used to generate an ancillary, exploration sequence which scouts ahead . These two sequences are then coupled with a mixing coefficient which plays a role similar but not equivalent to the friction coefficient in the (HBVF) formulation of (NAG) by Su et al. [40].

The above is the best high-level description and analogy we can make between the coupling approach of [1] and the momentum-driven analysis of Su et al. [40] and/or momentum analysis in Nesterov s 2004 textbook. At a low level (and omitting certain technical details and distinctions that are not central to this discussion), the linear coupling approach of [1] applied to our setting would correspond to the update scheme:

𝑥𝑛= 𝑄(𝑦𝑛) 𝑤𝑛= 𝜆𝑛𝑧𝑛+ (1 𝜆𝑛)𝑥𝑛 𝑦𝑛+1 = 𝑦𝑛+ (1 𝜆𝑛)𝜂𝑛ˆ𝑣𝑛 𝑧𝑛+1 = 𝜆𝑛𝑧𝑛+ (1 𝜆𝑛)𝑥𝑛+1

with ˆ𝑣𝑛obtained by querying a first-order oracle at 𝑤𝑛that is, ˆ𝑣𝑛is an estimate, possibly imperfect, of 𝑣(𝑤𝑛). The first and third lines of this update scheme are similar to the corresponding update structure of (FTXL). However, whereas (FTXL) builds momentum by the aggregation of gradient information via the momentum variables 𝑝𝑛, the linear coupling method above achieves acceleration through the coupling of the sequences 𝑤𝑛, 𝑧𝑛and 𝑥𝑛, and by taking an increasing step-size sequence 𝜂𝑛that grows roughly as Θ(𝑛), and a mixing coefficient 𝜆𝑛that evolves as 𝜆𝑛= 1 1/(𝐿𝜂𝑛), where 𝐿is the Lipschitz modulus of 𝑣( ). Beyond this comparison, we cannot provide a term-by-term correspondence between the momentum-based and coupling-based approaches, because the two methods are not equivalent (even though they give the same value convergence rates in convex minimization problems). In particular, we do not see a way of linking the parameters 𝜂𝑛and 𝜆𝑛of the coupling approach to the friction and step-size parameters of the momentum approach.

In the context of convex minimization problems, the coupling-based approach of [1] is more amenable to a regret-based analysis this is the unification aspect of [1] while the momentum-based approach of Su et al. [40] facilitates a Lyapunov-based analysis. From a game-theoretic standpoint, the momentum-based approach seems to be more fruitful and easier to implement, but studying the linear coupling approach of [1] could also be very relevant.

Neur IPS Paper Checklist

Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: It can be found in Section 3, Section 4 and the appendix. Guidelines:

The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper.

2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: It can be found in Section 1. Guidelines:

The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations.

3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [Yes] Justification: It can be found in Section 2, Section 3, Section 4 and the appendix. Guidelines:

The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and cross-referenced.

All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced.

4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justification: It can be found in Appendix E, and the code is included in the supplemental material. Guidelines:

The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example

(a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results.

5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: The code is included in the supplemental material. Guidelines:

The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/public/ guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark).

The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https://nips.cc/public/ guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted.

6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justification: It can be found in Appendix E. Guidelines:

The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material.

7. Experiment Statistical Significance

Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [NA] Justification: No statistical significance applicable. Guidelines:

The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text.

8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: It can be found in Appendix E. Guidelines:

The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper).

9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justification: The paper conforms with the Neur IPS Code of Ethics. Guidelines:

The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction).

10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [NA] Justification: There is no societal impact. Guidelines:

The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).

11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] Justification: There are no such risks. Guidelines:

The answer NA means that the paper poses no such risks.

Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.

12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [NA] Justification: The paper does not use existing assets. Guidelines:

The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators.

13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [NA] Justification: The paper does not release new assets. Guidelines:

The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file.

14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: The paper does not involve crowdsourcing nor research with human subjects. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper.

According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: The paper does not involve crowdsourcing nor research with human subjects. Guidelines: The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.