# asynchronous_gradient_play_in_zerosum_multiagent_games__a342ebd2.pdf

Published as a conference paper at ICLR 2023

ASYNCHRONOUS GRADIENT PLAY IN ZERO-SUM MULTI-AGENT GAMES

Ruicheng Ao Peking University archer arc@pku.edu.cn

Shicong Cen & Yuejie Chi Carnegie Mellon University {shicongc,yuejiec}@andrew.cmu.edu

Finding equilibria via gradient play in competitive multi-agent games has been attracting a growing amount of attention in recent years, with emphasis on designing efficient strategies where the agents operate in a decentralized and symmetric manner with guaranteed convergence. While significant efforts have been made in understanding zero-sum two-player matrix games, the performance in zerosum multi-agent games remains inadequately explored, especially in the presence of delayed feedbacks, leaving the scalability and resiliency of gradient play open to questions. In this paper, we make progress by studying asynchronous gradient plays in zero-sum polymatrix games under delayed feedbacks. We first establish that the last iterate of entropy-regularized optimistic multiplicative weight updates (OMWU) method converges linearly to the quantal response equilibrium (QRE), the solution concept under bounded rationality, in the absence of delays. While the linear convergence continues to hold even when the feedbacks are randomly delayed under mild statistical assumptions, it converges at a noticeably slower rate due to a smaller tolerable range of learning rates. Moving beyond, we demonstrate entropy-regularized OMWU by adopting two-timescale learning rates in a delay-aware manner enjoys faster last-iterate convergence under fixed delays, and continues to converge provably even when the delays are arbitrarily bounded in an average-iterate manner. Our methods also lead to finite-time guarantees to approximate the Nash equilibrium (NE) by moderating the amount of regularization. To the best of our knowledge, this work is the first that aims to understand asynchronous gradient play in zero-sum polymatrix games under a wide range of delay assumptions, highlighting the role of learning rates separation.

1 INTRODUCTION

Finding equilibria of multi-player games via gradient play lies at the heart of game theory, which permeates a remarkable breadth of modern applications, including but not limited to competitive reinforcement learning (RL) (Littman, 1994), generative adversarial networks (GANs) (Goodfellow et al., 2014) and adversarial training (Mertikopoulos et al., 2018). While conventional wisdom leans towards the paradigm of centralized learning (Bertsekas & Tsitsiklis, 1989), retrieving and sharing information across multiple agents raise questions in terms of both privacy and efficiency, leading to a significant amount of interest in designing decentralized learning algorithms that utilize only local payoff feedbacks, with the updates at different agents executed in a symmetric manner.

In reality, there is no shortage of scenarios where the feedback can be obtained only in a delayed manner (He et al., 2014), i.e., the agents only receive the payoff information sent from a previous round instead of the current round, due to communication slowdowns and congestions, for example. Substantial progress has been made towards reliable and efficient online learning with delayed feedbacks in various settings, e.g., stochastic multi-armed bandit (Pike-Burke et al., 2018; Vernade et al., 2017), adversarial multi-armed bandit (Cesa-Bianchi et al., 2016; Li et al., 2019), online convex optimization (Quanrud & Khashabi, 2015; Mc Mahan & Streeter, 2014) and multi-player game (Meng et al., 2022; H eliou et al., 2020; Zhou et al., 2017). Typical approaches to combatting delays include subsampling the payoff history (Weinberger & Ordentlich, 2002; Joulani et al., 2013), or adopting

Author are sorted alphabetically.

Published as a conference paper at ICLR 2023

Learning rate Type of delay Iteration complexity

single-timescale

none τ 1dmax A log ϵ 1 dmax A ϵ 1

statistical τ 2d2 max A 2 (γ + 1)2 log ϵ 1 d2 max A 2 (γ + 1)2ϵ 2

two-timescale constant τ 1dmax A (γ + 1)2 log ϵ 1 dmax A (γ + 1)2ϵ 1

bounded τ 2nd3 max A 3 (γ + 1)5/2ϵ 1 nd3 max A 3 (γ + 1)5/2ϵ 3

Table 1: Iteration complexities of the proposed OMWU method for finding ϵ-QRE/NE of zero-sum polymatrix games, where logarithmic dependencies are omitted. Here, γ denotes the maximal time delay when the delay is bounded, n denotes the number of agents in the game, dmax is the maximal degree of the graph, and A = maxi,j Ai,j is the ℓ norm of the entire payoff matrix A (over all games in the network). We only present the result under statistical delay when the delays are bounded for ease of comparison, while more general bounds are given in Section 3.2.

adaptive learning rates suggested by delay-aware analysis (Quanrud & Khashabi, 2015; Mc Mahan & Streeter, 2014; Hsieh et al., 2020; Flaspohler et al., 2021). Most of these efforts, however, have been limited to either the asymptotic convergence to the equilibrium (Zhou et al., 2017; H eliou et al., 2020) or the study of individual regret, which characterizes the performance gap between an agent s learning trajectory and the best policy in hindsight. It remains highly inadequate when it comes to guaranteeing finite-time convergence to the equilibrium in a multi-player environment, especially in the presence of delayed feedbacks, thus leaving the scalability and resiliency of gradient play open to questions.

In this work, we initiate the study of asynchronous learning algorithms for an important class of games called zero-sum polymatrix games (also known as network matrix games (Bergman & Fokin, 1998)), which generalizes two-player zero-sum matrix games to the multiple-player setting and serves as an important stepping stone to more general multi-player general-sum games. Zero-sum polymatrix games are commonly used to describe situations in which agents interactions are captured by an interaction graph and the entire system of games are closed so that the total payoffs keep invariant in the system. They find applications in an increasing number of important domains such as security games (Cai et al., 2016), graph transduction (Bernardi, 2021), and more.

In particular, we focus on finite-time last-iterate convergence to two prevalent solution concepts in game theory, namely Nash Equilibrium (NE) and Quantal Response Equilibrium (QRE) which considers bounded rationality (Mc Kelvey & Palfrey, 1995). Despite the seemingly simple formulation, few existing works have achieved this goal even in the synchronous setting, i.e., with instantaneous feedback. Leonardos et al. (2021) studied a continuous-time learning dynamics that converges to the QRE at a linear rate. Anagnostides et al. (2022) demonstrated Optimistic Mirror Descent (OMD) (Rakhlin & Sridharan, 2013) enjoys finite-time last-iterate convergence to the NE, yet the analysis therein requires continuous gradient of the regularizer, which incurs computation overhead for solving a subproblem every iteration. In contrast, an appealing alternative is the entropy regularizer, which leads to closed-form multiplicative updates and is computationally more desirable, but remains poorly understood. In sum, designing efficient learning algorithms that provably converge to the game equilibria has been technically challenging, even in the synchronous setting.

1.1 OUR CONTRIBUTIONS

In this paper, we develop provably convergent algorithms broadly dubbed as asynchronous gradient play to find the QRE and NE of zero-sum polymatrix games in a decentralized and symmetric manner with delayed feedbacks. We propose an entropy-regularized Optimistic Multiplicative Weights Update (OMWU) method (Cen et al., 2021), where each player symmetrically updates their strategies without access to the payoff matrices and other players strategies, and initiate a systematic investigation on the impacts of delays on its convergence under two schemes of learning rates schedule. Our main contributions are summarized as follows.

Finite-time last-iterate convergence of single-timescale OMWU. We begin by showing that, in the synchronous setting, the single-timescale OMWU method when the same learning rate is adopted for extrapolation and update achieves last-iterate convergence to the QRE at a linear

Published as a conference paper at ICLR 2023

rate, which is independent of the number of agents as well as the size of action spaces (up to logarithmic factors). In addition, this implies a last-iterate convergence to an ϵ-approximate NE in e O(ϵ 1) iterations by adjusting the regularization parameter, where e O( ) hides logarithmic dependencies. While the last-iterate linear convergence to QRE continues to hold in the asynchronous setting, as long as the delay sequence follows certain mild statistical assumptions, it converges at a slower rate due to a smaller tolerable range of learning rates, with the iteration complexity to find an ϵ-NE degenerating to e O(ϵ 2). Finite-time convergence of two-timescale OMWU. To accelerate the convergence rate in the presence of delayed feedback, we propose a two-timescale OMWU method which separates the learning rates of extrapolation and update in a delay-aware manner for applications with constant and known delays (e.g. from timestamp information). The learning rate separation is critical in bypassing the convergence slowdown encountered in the single-timescale case, where we show that two-timescale OMWU achieves a faster last-iterate linear convergence to QRE in the presence of constant delays, with an improved e O(ϵ 1) iteration complexity to ϵ-NE that matches the rate without delay. We further tackle the more practical yet challenging setting where the feedback sequence is permutated by bounded delays possibly in an adversarial manner and demonstrate provable convergence to the equilibria in an average-iterate manner.

We summarize the iteration complexities of the proposed methods for finding ϵ-approximate solutions of QRE and NE in Table 1. To the best of our knowledge, this work presents the first algorithm design and analysis that focus on equilibrium finding in a multi-player game with delayed feedbacks. In contrast, most of existing works concerning individual regret in the synchronous/asynchronous settings typically yield average-iterate convergence guarantees (see e.g., Bailey (2021); Meng et al. (2022)) and fall short of characterizing the actual learning trajectory to the equilibrium.

1.2 NOTATION AND PAPER ORGANIZATION

Denote by [n] the set {1, , n} and by (S) the probability simplex over the set S. Given two probability distributions p, p (S), the KL divergence from p to p is defined by KL p p := P

k S p(k) log p(k)

p (k). For any vector z = [zi]1 i n Rn, we use exp(z) to represent [exp(zi)]1 i n. The rest of this paper is organized as follows. Section 2 provides the preliminary on zero-sum polymatrix games and solution concepts. Performance guarantees of single-timescale OMWU and two-timescale OMWU are presented in Section 3 and Section 4, respectively. Numerical experiments are provided in Section 5 to corroborate the theoretical findings, and finally, we conclude in Section 6. The proofs are deferred to the appendix.

2 PRELIMINARIES

In this section, we introduce the formulation of zero-sum polymatrix games as well as the solution concept of NE and QRE. We start by defining the polymatrix game. Definition 1 (Polymatrix game). Let G := {(V, E), {Si}i V , {Aij}(i,j) E} be an n-player polymatrix game, where each element in the tuple is defined as follows.

An undirected graph (V, E), with V = [n] denoting the set of players and E the set of edges;

For each player i V , Si represents its action set, which is assumed to be finite;

For each edge (i, j) E, Aij R|Si| |Sj| and Aji R|Sj| |Si| represent the payoff matrices associated with player i and j, i.e., when player i and player j choose si Si and sj Sj, the received payoffs are given by Aij(si, sj), Aji(sj, si), respectively.

Utility function. Given the strategy profile s = (s1, , sn) S = Q

i V Si taken by all players, the utility function ui : S R of player i is given by

j:(i,j) E Aij(si, sj).

Suppose that player i adopts a mixed/stochastic strategy or policy, πi (Si), where the probability of selecting si Si is specified by πi(si). With slight abuse of notation, we denote the expected

Published as a conference paper at ICLR 2023

utility of player i with a mixed strategy profile π = (π1, , πn) (S) as

ui(π) = E si πi, i V [ui(s)] = X

j:(i,j) E π i Aijπj. (1)

It turns out to be convenient to treat πi and π as vectors in R|Si| and R P

i V |Si| without ambiguity, and concatenate all payoff matrices associated with player i into

Ai = (Ai1, , Ain) R|Si| P

j V |Sj|, (2)

where Aij is set to 0 whenever (i, j) / E. In particular, it follows that Aii = 0 for all i V . With these notation in place, we can rewrite the expected utility function (1) as

ui(π) = π i Aiπ, (3)

where Aiπ R|Si| can be interpreted as the expected utility of the actions in Si for player i. In addition, we denote the maximum entrywise absolute value of payoff by A = maxi,j Aij = maxi Ai , and the maximum degree of the graph by dmax = maxi V degi, where degi is the degree of player i. Moreover, we denote Smax = maxi |Si| as the maximum size of the action space over all players.

Zero-sum polymatrix games. The game G is a zero-sum polymatrix game if it holds that P

i V ui(s) = 0, s S. This immediately implies that for any strategy profile π (S), it follows that P

i V ui(π) = 0.

Nash equilibrium (NE). A mixed strategy profile π = (π 1, , π n) is a Nash equilibrium (NE) when each player i cannot further increase its own utility function ui by unilateral deviation, i.e., ui(π i, π i) ui(π i , π i), for all i V, π i (Si), where the existence is guaranteed by the work (Cai et al., 2016). Here we denote the mixed strategies of all players other than i by π i and write ui(πi, π i) = ui(π). To measure how close a strategy π (S) is to an NE, we introduce

NE-Gap(π) = max i V

max π i (Si) ui(π i, π i) ui(π) ,

which measures the largest possible gain in the expected utility when players deviate from its strategy unilaterally. A mixed strategy profile π is called an ϵ-approximate Nash equilibrium (ϵ-NE) when NE-Gap(π) ϵ, which ensures that ui(π i, π i) ui(πi, π i)+ϵ, for all i V, π i (Si).

Quantal response equilibrium (QRE). The quantal response equilibrium (QRE), proposed by Mc Kelvey & Palfrey (1995), generalizes the classical notion of NE under uncertain payoffs or bounded rationality, while balancing exploration and exploitation. A mixed strategy profile π τ = (π 1,τ, , π n,τ) is a QRE when each player assigns its probability of action according to the expected utility of every action in a Boltzmann fashion, i.e., for all i V ,

π i,τ(k) = exp([Aiπ τ]k/τ) P

k Si exp([Aiπ τ]k/τ), k Si, (4)

where τ > 0 is the regularization parameter or temperature. Equivalently, this amounts to maximizing an entropy-regularized utility of each player (Mertikopoulos & Sandholm, 2016), i.e., ui,τ(π i, π i,τ) ui,τ(π i,τ, π i,τ) for all i V , π i (Si). Here, the entropy-regularized utility function ui : S R of player i is given by

ui,τ(π) = ui(π) + τH(πi), (5)

where H(πi) = π i log πi denotes the Shannon entropy of πi. In Leonardos et al. (2021), it is shown that a unique QRE exists in a zero-sum polymatrix game. Similarly, we can measure the proximity of a strategy π to a QRE by

QRE-Gapτ(π) = max i V

max π i (Si) ui,τ(π i, π i) ui,τ(π) . (6)

A mixed strategy profile π is called an ϵ-QRE when QRE-Gapτ(π) ϵ. According to the straightforward relationship NE-Gap(π) QRE-Gapτ(π) + τ log Smax, it follows immediately that we can link an ϵ/2-QRE to ϵ-NE by setting τ = ϵ 2 log Smax . This facilitates the translation of convergence to the QRE to one regarding the NE by appropriately setting the regularization parameter τ.

Published as a conference paper at ICLR 2023

Algorithm 1 Entropy-regularized OMWU, agent i

1: Initialize π(0) i = π(0) i as uniform distribution. Learning rates η, and η (optional). 2: for t = 0, 1, 2, . . . do

3: Receive payoff vector Aiπ(κ(t) i ). 4: When t 1, update πi according to

π(t) i (k) π(t 1) i (k)1 ητ exp(η[Aiπ(κ(t) i )]k), k Si.

5: Update πi according to the single-timescale rule

π(t+1) i (k) π(t) i (k)1 ητ exp(η[Aiπ(κ(t) i )]k), k Si. (9)

or the two-timescale rule

π(t+1) i (k) π(t) i (k)1 ητ exp(η[Aiπ(κ(t) i )]k), k Si. (10)

3 PERFORMANCE GUARANTEES OF SINGLE-TIMESCALE OMWU

In this section, we present and study the entropy-regularized OMWU method (Cen et al., 2021) for finding the QRE of zero-sum polymatrix games. Whilst the method is originally proposed for finding QRE in a two-player zero-sum game, the update rule naturally generalizes to the multi-player setting as π(t+1) i (k) π(t) i (k)1 ητ exp(η[Aiπ(t+1)]k), k Si, (7)

where η > 0 is the learning rate and π(t+1) serves as a prediction for π(t+1) via an extrapolation step

π(t+1) i (k) π(t) i (k)1 ητ exp(η[Aiπ(t)]k), k Si. (8)

In the asynchronous setting, however, each agent i receives a delayed payoff vector Aiπ(κ(t) i ) instead of Aiπ(t) in the t-th iteration, where κ(t) i = max{t γ(t) i , 0}, with γ(t) i 0 representing the length of delay. The detailed procedure is outlined in Algorithm 1 using the single-timescale rule (9) for extrapolation.

3.1 PERFORMANCE GUARANTEES WITHOUT DELAYS

We first present our theorem concerning the last-iterate convergence of single-timescale OMWU for finding the QRE in the synchronous setting, i.e. γ(t) i = 0 for all i V and t 0. For any π, π V , let KL π π = P i V KL πi π i . Theorem 1 (Last-iterate convergence without delays). Suppose that the learning rate η of singletimescale OMWU in Algorithm 1 obeys 0 < η min n 1 2τ , 1 4dmax A

o , then for any T 0, the

iterates π(T ) and π(T s) converge at a linear rate according to

KL π τ π(T ) (1 ητ)T KL π τ π(0) , KL π τ π(T +1) 2(1 ητ)T KL π τ π(0) . (11a)

Furthermore, the QRE-gap also converges linearly according to

QRE-Gapτ(π(T )) η 1 + 2τ 1d2 max A 2 (1 ητ)T 1KL π τ π(0) . (11b)

Theorem 1 demonstrates that as long as the learning rate η is sufficiently small, the last iterate of single-timescale OMWU converges to the QRE at a linear rate. Compared with prior works for finding approximate equilibrium for zero-sum polymatrix games, our approach features a closedform multiplicative update and a fast linear last-iterate convergence. Some remarks are in order.

Linear convergence to the QRE. Theorem 1 implies an iteration complexity of e O 1 ητ log 1

for finding an ϵ-QRE in a last-iterate manner, which leads to an iteration complexity of

Published as a conference paper at ICLR 2023

τ + 1 log 1

ϵ by optimizing the learning rate in Theorem 1.The result is especially appealing as it avoids direct dependency on the number of agents n as well as the size of action spaces (up to logarithmic factors), suggesting that learning in competitive multi-agent games can be made quite scalable as long as the interactions among the agents are sparse (so that the maximum degree of the graph dmax is much smaller than the number of agents n). Last-iterate convergence to ϵ-NE. By setting τ appropriately, we end up with an iteration complexity of e O dmax A

ϵ for achieving last-iterate convergence to an ϵ-NE, which outperforms

the best existing last-iterate rate of e O n A /ϵ2 from Leonardos et al. (2021) by at least a factor of n/(dmaxϵ). Remark 1. Our results trivially extend to the setting of weighted zero-sum polymatrix games (Leonardos et al., 2021), which amounts to adopting different learning rates {ηi}i V at each player.

In this case, the iteration complexity becomes e O maxi V 1 ηiτ log 1

ϵ . In addition, our convergence result readily translates to a bound on individual regret as detailed in Appendix C.

3.2 PERFORMANCE GUARANTEES UNDER RANDOM DELAYS

We continue to examine single-timescale OMWU in the more challenging asynchronous setting. In particularly, we show that the last iterate of single-timescale OMWU continues to converge linearly to the QRE at a slower rate, as long as the delays satisfy some mild statistical assumptions given below. Assumption 1 (Random delays). Assume that for all i V , t 0, the delay γ(t) i is independently generated and satisfies

h γ(t) i i := E h γ(t) i γ(t) i ℓ i E(ℓ), ℓ= 0, 1, . . . . (12)

Additionally, there exists some constant ζ > 1, such that L P ℓ=0 ζℓE(ℓ) < .

We remark that Assumption 1 is a rather mild condition that applies to typical delay distributions, such as the Poisson distribution (Zhang et al., 2020), as well as distributions with bounded support (Recht et al., 2011; Liu et al., 2014; Assran et al., 2020). Roughly speaking, Assumption 1 implies that the probability of the delay decays exponentially with its length, where ζ 1 approximately indicates the decay rate. We have the following theorem. Theorem 2 (Last-iterate convergence with random delays). Under Assumption 1, suppose that the regulari-zation parameter τ < min{1, dmax A } and the learning rate η of single-timescale OMWU in Algorithm 1 obeys

24d2max A 2 (L + 1) , ζ 1

then for any T 1, the iterates π(T ) and π(T ) converges to π τ at the rate

max E h KL π τ π(T ) i , 1

2E h KL π τ π(T ) i (1 ητ)T KL π τ π(0) . (14a)

Furthermore, the QRE-gap also converges linearly according to

E h QRE-Gapτ(π(T )) i 4η 1(1 ητ)T KL π τ π(0) . (14b)

Theorem 2 suggests that the iteration complexity to ϵ-QRE is no more than e O max n d2 max A 2 (L + 1), ζ ζ 1 o 1 τ 2 log 1

ϵ after optimizing the learning rate, whose range is more limited compared with the requirement in Theorem 1without delays. In particular, the range of the learning rate is proportional to the regularization parameter τ, an issue we shall try to address by resorting to two-timescale learning rates in OMWU. To facilitate further understanding, we showcase the iteration complexity for finding ϵ-QRE/NE under two typical scenarios: bounded delay and Poisson delay.

Published as a conference paper at ICLR 2023

Bounded random delay. When the delays are bounded above by some maximum delay γ, Assumption 1 is met with ζ = 1 + γ 1 and L = eγ(γ + 1). Plugging into Theorem 2 yields an

iteration complexity of e O d2 max A 2 (γ+1)2

ϵ for finding an ϵ-QRE, or e O d2 max A 2 (γ+1)2

for finding an ϵ-NE, which increases quadratically as the maximum delay increases. Note that these rates are worse than those without delays (cf. Theorem 1). Poisson delay. When the delays follow the Poisson distribution with parameter 1/T, it suffices to set ζ = 1 + T 1 and L = e T(1 + T) Assumption 1. This leads to an iteration complexity of

e O d2 max A 2 T 2

for finding an ϵ-QRE, or e O d2 max A 2 T 2

for finding an ϵ-NE, which is

similar to the bounded random delay case.

4 PERFORMANCE GUARANTEES OF TWO-TIMESCALE OMWU

While Theorem 2 demonstrates provable convergence of single-timescale OMWU with random delays, it remains unclear whether the update rule can be better motivated in more general asynchronous settings, and whether the convergence can be further ensured under adversarial delays. Indeed, theoretical insights from previous literature (Mokhtari et al., 2020; Cen et al., 2021) suggest the critical role of π(t) as a predictive surrogate for π(t) in enabling fast convergence, which no longer holds when π(t) is replaced by a delayed feedback from π(κ(t) i ). To this end, we propose to replace the extrapolation update (9) with one equipped with a different learning rate:

π(t+1) i (k) π(t) i (k)1 ητ exp(η[Aiπ(κ(t) i )]k), k Si, (15)

which adopts a larger learning rate η > η to counteract the delay. Intuitively, a choice of η (γ(t) i +

1)η would allow π(κ(t) i ) to approximate π(t) by taking the intermediate updates {π(l) : κ(t) i l < t} into consideration. We refer to this update rule as the two-timescale entropy-regularized OMWU, whose detailed procedure is again outlined in Algorithm 1 using (10) for extrapolation.

4.1 PERFORMANCE GUARANTEES UNDER CONSTANT AND KNOWN DELAYS

To highlight the potential benefit of learning rate separation, we start by studying the convergence of two-timescale OMWU in the asynchronous setting with constant and known delays, which has been studied in (Weinberger & Ordentlich, 2002; Flaspohler et al., 2021; Meng et al., 2022). We have the following theorem, which reveals a a faster linear convergence to the QRE by using a delay-aware two-timescale learning rate design.

Theorem 3 (Last-iterate convergence with fixed delays). Suppose that the delays γ(t) i = γ are fixed and known. Suppose that the learning rate η of two-timescale OMWU in Algorithm 1 satisfies

η min 1 2τ(γ + 1), 1 5dmax A (γ + 1)2

and η is determined by 1 ητ = (1 ητ)(γ+1), then the last iterate π(t) and π(t) converge to the QRE at a linear rate: for T γ,

max KL π τ π(T +1) , 1

2KL π τ π(T γ+1) (1 ητ)T +1KL π τ π(0) + (1 ητ)T +1 γ.

In addition, the QRE-gap converges linearly according to

QRE-Gapτ(π(T γ+1)) 2 max nd2 max A 2 τ , 1

o (1 ητ)T +1KL π τ π(0) + (1 ητ)T +1 γ .

By optimizing the learning rate η, Theorem 3 implies that two-timescale OMWU takes at most e O dmax A (γ+1)2

ϵ iterations to find an ϵ-QRE in a last-iterate manner, which translates to an

iteration complexity of e O dmax A (γ+1)2

ϵ for finding an ϵ-NE. This significantly improves over

the iteration complexity of e O d2 max A 2 (γ + 1)2/ϵ2 for single-timescale OMWU, verifying the positive role of adopting two-timescale learning rate in enabling faster convergence.

Published as a conference paper at ICLR 2023

4.2 PERFORMANCE GUARANTEES WITH PERMUTED BOUNDED DELAYS

The above result requires the exact information of the delay, which may not always be available. Motivated by the need to address arbitrary or even adversarial delays, we consider a more realistic scenario, where the payoff sequence arrives in a permuted order (Agarwal & Duchi, 2011) constrained by a maximum bounded delay (Mc Mahan & Streeter, 2014; Wan et al., 2022).

Assumption 2 (Bounded delay). For any i V and t > 0, it holds that γ(t) i γ. Assumption 3 (Permuted feedback). For any t > 0, the payoff vector at the t-th iteration is received by agent i only once. The payoff at the 0-th iteration can be used multiple times.

The following theorem unveils the convergence of two-timescale OMWU to the QRE in an average sense under permutated bounded delays. Theorem 4 (Average-iterate convergence under permutated delays). Under Assumption 2 and 3, suppose that the learning rate η of two-timescale OMWU in Algorithm 1 satisfies η min n 1 2τ(γ+1), 1 28dmax A (γ+1)5/2 o , and η is determined by 1 ητ = (1 ητ)(γ+1), then for T > 2γ, it holds that

1 T 2γ max T 1 X

t=2γ KL π τ π(t+1) , 1

t=2γ KL π τ π(t γ+1)

KL π i,τ π(0) i + n + 24nγ log Smax

T 2γ . (16)

Furthermore, the average QRE-gap can be bounded by

t=2γ QRE-Gapτ(π(t+1))

1 T 2γ max n3d2 max A 2 2τ , τ o 1

ητ (KL π i,τ π(0) i + n) + 36nγ log Smax .

Theorem 4 guarantees that the best iterate among {π(t)}2γ<t T is an ϵ-QRE as long as

T is on the order of e O nd3 max A 3 (γ+1)5/2

τ 2ϵ , which translates to an iteration complexity of

e O nd3 max A 3 (γ+1)5/2

ϵ3 for finding an ϵ-NE. While the rate seems slower than the previous theorems, Theorem 4 holds under arguably the weakest delay assumptions, where it can be even adversarially bounded. We remark that the result in (16) also guarantees the convergence of the last iterate π(t) to the QRE asymptotically, although without a finite-time rate. This is in sharp contrast to typical average-iterate analysis that only applies to 1

T PT t=1 π(t) without implications on the convergence of the last iterate π(t). Remark 2. The analysis in this section can be generalized to more commonly-used delay models where the reward information is not assumed to be observed once per round (Quanrud & Khashabi, 2015; Joulani et al., 2013), i.e., in every round an agent may observe multiple reward feedbacks from previous iterations or receive no information. This can be achieved by storing reward feedbacks in a buffer memory and picking one for policy update every round in a First-In-First-Out manner.

5 NUMERICAL EXPERIMENTS

In this section, we verify our theoretical findings by investigating the performance of both singletimescale and two-timescale OMWU on randomly generated zero-sum entropy-regularized polymatrix games with n = 10, |Si| = 10, i V and τ = 0.1. For each (i, j) E, we set Aij = A ji with entries of Aij independently sampled from the uniform distribution over [ 1, 1]. All the results are averaged over five independent runs.

In Fig. 1 (a), we compare the performance of single-timescale OMWU in both synchronous and asynchronous settings, with delay uniformly sampled from {0, 1, . . . , 10}. We adopt the optimal learning rate η from {0.1, 0.05, 0.02, 0.01, . . . } that yields the highest accuracy. The method

Published as a conference paper at ICLR 2023

0 2000 4000 Iteration count t

KL(π τ||π(t))

no delay random delay

10 4 10 2 Learning rate η

KL(π τ||π(t))

single-timescale two-timescale

10 4 10 2 Learning rate η

KL(π τ||π(t))

different η proposed η single-timescale

(a) (b) (c)

Figure 1: KL π τ π(t) of single-timescale and two-timescale OMWU with respect to different values of learning rate and delay. (a): performance of single-timescale OMWU in the synchronous setting and asynchronous setting. (b) & (c): performance of the two methods after 5000 iterations under various choices of η and η, with η fixed to η = τ 1(1 (1 ητ)γ+1) in (b) and η fixed to 0.001 in (c).

achieves linear convergence in both cases, yet the convergence rate is slowed down by delayed feedbacks in the asynchronous setting. Fig. 1 (b) and (c) compare the effect of different choices of learning rates η, η on the performance of the proposed methods, where the feedback is permutated with bounded delay γ = 25 (cf. Assumptions 2 and 3). In general, two-timescale OMWU outperforms single-timescale OMWU given appropriate choices of learning rate η. On the other hand, (c) demonstrates that the choice of η = τ 1(1 (1 ητ)γ+1) suggested by the theory (marked with star) indeed leads to near-optimal performance of two-timescale OMWU.

Figure 2 shows KL π τ π(t) with respect to the number of iterations of single-timescale and twotimescale OMWU under different asynchronous scenarios, with optimal choices of η and τ = 0.1. In particular, two-timescale OMWU adopts the extrapolation learning rate suggested by theory η = τ 1(1 (1 ητ)γ+1). While both methods yield linear convergence to the QRE, two-timescale method outperforms its single-timescale counterpart in the case with constant and known delay and the case where the feedback is permutated with bounded delay, which verifies our theory.

0 5000 10000 15000 20000 Iteration count t

KL(π τ||π(t))

single-timescale two-timescale

0 2000 4000 Iteration count t

KL(π τ||π(t))

single-timescale two-timescale

0 5000 10000 15000 20000 Iteration count t

KL(π τ||π(t))

single-timescale two-timescale

(a) random delay (b) fixed delay (c) permuted feedback

Figure 2: KL π τ π(t) with respect to iteration count t of single-timescale and two-timescale OMWU under various asynchronous settings. (a): random delays bounded by γ = 25. (b): constant delays γ = 50. (c): permuted feedback with delay bounded by γ = 25.

6 CONCLUSION

This paper studies asynchronous gradient play in zero-sum polymatrix games, by investigating the convergence behaviors of entropy-regularized OMWU with delayed feedbacks under two different schedules of the learning rates. We demonstrate that single-timescale OMWU enjoys a linear lastiterate convergence to the QRE even under mild statistical delays. However, the presence of the delay noticeably limits the allowable range of learning rates and slows down the convergence. To mitigate the impact, we further show that the method benefits from adopting a two-timescale learning rate in a delay-aware manner, which achieves a faster last-iterate convergence when the delay is fixed and known, and continues to converge provably even when the delays are arbitrarily bounded in an average-iterate manner. We believe our work lays the foundation for further understandings of delayed feedback in games under symmetric and independent learning.

Published as a conference paper at ICLR 2023

ACKNOWLEDGEMENT

S. Cen and Y. Chi are supported in part by the grants ONR N00014-19-1-2404, NSF CCF-1901199, CCF-2106778 and CNS-2148212. S. Cen is also gratefully supported by Wei Shen and Xuehong Zhang Presidential Fellowship, and Nicholas Minnici Dean s Graduate Fellowship in Electrical and Computer Engineering at Carnegie Mellon University. R. Ao is supported by the Elite Undergraduate Training Program of School of Mathematical Sciences at Peking University.

Alekh Agarwal and John C Duchi. Distributed delayed stochastic optimization. Advances in neural information processing systems, 24, 2011.

Ioannis Anagnostides, Ioannis Panageas, Gabriele Farina, and Tuomas Sandholm. On last-iterate convergence beyond zero-sum games. ar Xiv preprint ar Xiv:2203.12056, 2022.

Mahmoud Assran, Arda Aytekin, Hamid Reza Feyzmahdavian, Mikael Johansson, and Michael G Rabbat. Advances in asynchronous parallel and distributed optimization. Proceedings of the IEEE, 108(11):2013 2031, 2020.

James P Bailey. O(1/T) time-average convergence in a generalization of multiagent zero-sum games. ar Xiv preprint ar Xiv:2110.02482, 2021.

LM Bergman and IN Fokin. On separable non-cooperative zero-sum games. Optimization, 44(1): 69 84, 1998.

Riccardo Bernardi. Interactive image segmentation using graph transduction games. 2021.

Dimitri P Bertsekas and John N Tsitsiklis. Parallel and distributed computation: numerical methods, 1989.

Yang Cai, Ozan Candogan, Constantinos Daskalakis, and Christos Papadimitriou. Zero-sum polymatrix games: A generalization of minmax. Mathematics of Operations Research, 41(2):648 655, 2016.

Shicong Cen, Chen Cheng, Yuxin Chen, Yuting Wei, and Yuejie Chi. Fast global convergence of natural policy gradient methods with entropy regularization. ar Xiv preprint ar Xiv:2007.06558, 2020.

Shicong Cen, Yuting Wei, and Yuejie Chi. Fast policy extragradient methods for competitive games with entropy regularization. Advances in Neural Information Processing Systems, 34:27952 27964, 2021.

Shicong Cen, Yuejie Chi, Simon Du, and Lin Xiao. Faster last-iterate convergence of policy optimization in zero-sum markov games. ar Xiv preprint ar Xiv:2210.01050, 2022.

Nicol o Cesa-Bianchi, Claudio Gentile, Yishay Mansour, and Alberto Minora. Delay and cooperation in nonstochastic bandits. In Conference on Learning Theory, pp. 605 622. PMLR, 2016.

Constantinos Daskalakis and Ioannis Panageas. Last-iterate convergence: Zero-sum games and constrained min-max optimization. ar Xiv preprint ar Xiv:1807.04252, 2018.

Constantinos Daskalakis, Alan Deckelbaum, and Anthony Kim. Near-optimal no-regret algorithms for zero-sum games. In Proceedings of the twenty-second annual ACM-SIAM symposium on Discrete Algorithms, pp. 235 254. SIAM, 2011.

Michail Fasoulakis, Evangelos Markakis, Yannis Pantazis, and Constantinos Varsos. Forward looking best-response multiplicative weights update methods for bilinear zero-sum games. In International Conference on Artificial Intelligence and Statistics, pp. 11096 11117. PMLR, 2022.

Genevieve E Flaspohler, Francesco Orabona, Judah Cohen, Soukayna Mouatadid, Miruna Oprescu, Paulo Orenstein, and Lester Mackey. Online learning with optimism and delay. In International Conference on Machine Learning, pp. 3363 3373. PMLR, 2021.

Published as a conference paper at ICLR 2023

Yoav Freund and Robert E Schapire. Adaptive game playing using multiplicative weights. Games and Economic Behavior, 29(1-2):79 103, 1999.

Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks. ar Xiv preprint ar Xiv:1406.2661, 2014.

Xinran He, Junfeng Pan, Ou Jin, Tianbing Xu, Bo Liu, Tao Xu, Yanxin Shi, Antoine Atallah, Ralf Herbrich, Stuart Bowers, et al. Practical lessons from predicting clicks on ads at facebook. In Proceedings of the eighth international workshop on data mining for online advertising, pp. 1 9, 2014.

Am elie H eliou, Panayotis Mertikopoulos, and Zhengyuan Zhou. Gradient-free online learning in continuous games with delayed rewards. In International conference on machine learning, pp. 4172 4181. PMLR, 2020.

Yu-Guan Hsieh, Franck Iutzeler, J erˆome Malick, and Panayotis Mertikopoulos. Multi-agent online optimization with delays: Asynchronicity, adaptivity, and optimism. ar Xiv preprint ar Xiv:2012.11579, 2020.

Pooria Joulani, Andras Gyorgy, and Csaba Szepesv ari. Online learning under delayed feedback. In International Conference on Machine Learning, pp. 1453 1461. PMLR, 2013.

Chung-Wei Lee, Christian Kroer, and Haipeng Luo. Last-iterate convergence in extensive-form games. Advances in Neural Information Processing Systems, 34, 2021.

Stefanos Leonardos, Georgios Piliouras, and Kelly Spendlove. Exploration-exploitation in multiagent competition: convergence with bounded rationality. Advances in Neural Information Processing Systems, 34:26318 26331, 2021.

Bingcong Li, Tianyi Chen, and Georgios B Giannakis. Bandit online learning with unknown delays. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 993 1002. PMLR, 2019.

Michael L Littman. Markov games as a framework for multi-agent reinforcement learning. In Machine Learning Proceedings, pp. 157 163. Elsevier, 1994.

Ji Liu and Stephen J Wright. Asynchronous stochastic coordinate descent: Parallelism and convergence properties. SIAM Journal on Optimization, 25(1):351 376, 2015.

Ji Liu, Steve Wright, Christopher R e, Victor Bittorf, and Srikrishna Sridhar. An asynchronous parallel stochastic coordinate descent algorithm. In International Conference on Machine Learning, pp. 469 477. PMLR, 2014.

Richard D Mc Kelvey and Thomas R Palfrey. Quantal response equilibria for normal form games. Games and economic behavior, 10(1):6 38, 1995.

Brendan Mc Mahan and Matthew Streeter. Delay-tolerant algorithms for asynchronous distributed online learning. Advances in Neural Information Processing Systems, 27, 2014.

Min Meng, Xiuxian Li, and Jie Chen. Decentralized nash equilibria learning for online game with bandit feedback. ar Xiv preprint ar Xiv:2204.09467, 2022.

Panayotis Mertikopoulos and William H Sandholm. Learning in games via reinforcement and regularization. Mathematics of Operations Research, 41(4):1297 1324, 2016.

Panayotis Mertikopoulos, Christos Papadimitriou, and Georgios Piliouras. Cycles in adversarial regularized learning. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 2703 2717. SIAM, 2018.

Aryan Mokhtari, Asuman Ozdaglar, and Sarath Pattathil. A unified analysis of extra-gradient and optimistic gradient methods for saddle point problems: Proximal point approach. In International Conference on Artificial Intelligence and Statistics, pp. 1497 1507. PMLR, 2020.

Published as a conference paper at ICLR 2023

Yu Nesterov. Excessive gap technique in nonsmooth convex minimization. SIAM Journal on Optimization, 16(1):235 249, 2005a.

Yu Nesterov. Smooth minimization of non-smooth functions. Mathematical programming, 103(1): 127 152, 2005b.

Yu Nesterov. Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization, 22(2):341 362, 2012.

Ciara Pike-Burke, Shipra Agrawal, Csaba Szepesvari, and Steffen Grunewalder. Bandits with delayed, aggregated anonymous feedback. In International Conference on Machine Learning, pp. 4105 4113. PMLR, 2018.

Kent Quanrud and Daniel Khashabi. Online learning with adversarial delays. Advances in neural information processing systems, 28, 2015.

Alexander Rakhlin and Karthik Sridharan. Optimization, learning, and games with predictable sequences. ar Xiv preprint ar Xiv:1311.1869, 2013.

Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. Hogwild!: A lock-free approach to parallelizing stochastic gradient descent. Advances in neural information processing systems, 24, 2011.

Samuel Sokota, Ryan D Orazio, J Zico Kolter, Nicolas Loizou, Marc Lanctot, Ioannis Mitliagkas, Noam Brown, and Christian Kroer. A unified approach to reinforcement learning, quantal response equilibria, and two-player zero-sum games. ar Xiv preprint ar Xiv:2206.05825, 2022.

Qianqian Tong, Guannan Liang, and Jinbo Bi. Effective federated adaptive gradient methods with non-iid decentralized data. ar Xiv preprint ar Xiv:2009.06557, 2020.

Claire Vernade, Olivier Capp e, and Vianney Perchet. Stochastic bandit models for delayed conversions. ar Xiv preprint ar Xiv:1706.09186, 2017.

Yuanyu Wan, Wei-Wei Tu, and Lijun Zhang. Online strongly convex optimization with unknown delays. Machine Learning, 111(3):871 893, 2022.

Chen-Yu Wei, Chung-Wei Lee, Mengxiao Zhang, and Haipeng Luo. Last-iterate convergence of decentralized optimistic gradient descent/ascent in infinite-horizon competitive Markov games. ar Xiv preprint ar Xiv:2102.04540, 2021.

Marcelo J Weinberger and Erik Ordentlich. On delayed prediction of individual sequences. IEEE Transactions on Information Theory, 48(7):1959 1976, 2002.

Xin Zhang, Jia Liu, and Zhengyuan Zhu. Taming convergence for asynchronous stochastic gradient descent with unbounded delay in non-convex learning. In 2020 59th IEEE Conference on Decision and Control (CDC), pp. 3580 3585. IEEE, 2020.

Zhengyuan Zhou, Panayotis Mertikopoulos, Nicholas Bambos, Peter W Glynn, and Claire Tomlin. Countering feedback delays in multi-agent learning. Advances in Neural Information Processing Systems, 30, 2017.

Published as a conference paper at ICLR 2023

Table of Contents

A Further related works 13

B Proof for single-timescale OMWU (Section 3) 14 B.1 Proof of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 B.2 Proof of Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

C Regret analysis of single-timescale OMWU 22 C.1 Proof of Theorem 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 C.2 Proof of Theorem 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

D Proof for two-timescale OMWU (Section 4) 29 D.1 Proof of Theorem 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 D.2 Proof of Theorem 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

E Proof of auxiliary lemmas 39 E.1 Proof of Lemma 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 E.2 Proof of Lemma 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 E.3 Proof of Lemma 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 E.4 Proof of Lemma 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 E.5 Proof of Lemma 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 E.6 Proof of Lemma 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 E.7 Proof of Lemma 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 E.8 Proof of Lemma 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 E.9 Proof of Lemma 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

A FURTHER RELATED WORKS

Learning in two-player zero-sum matrix games. Freund & Schapire (1999) proved that Multiplicative Weights Update (MWU) method achieve an average-iterate convergence rate of O(1/

T) through the lens of regret analysis. Daskalakis et al. (2011) is the first to achieve an optimal convergence rate of O(1/T) with the excessive gap technique of Nesterov (Nesterov, 2005a;b). Rakhlin & Sridharan (2013) achieved the same rate with OMD, which is more commonly referred to as OMWU when entropy regularization is in use for the mirror descent update rule. In terms of lastiterate convergence, Daskalakis & Panageas (2018) established asymptotic last-iterate convergence for OMWU assuming the uniqueness of NE. Wei et al. (2021) improved upon the analysis under the same assumption by showing a problem-dependent linear rate of convergence, which is extended to a class of extensive-form games (Lee et al., 2021). Cen et al. (2021) showed that entropy-regularized OMWU converges linearly to the QRE of two-player zero-sum matrix game, which translates to an iteration complexity of e O(1/T) for finding an ϵ-NE, without assuming its uniqueness; the linear convergence to the QRE continues to hold with smooth value updates (Cen et al., 2022). Sokota et al. (2022) showed that linear convergence to QRE can be achieved without resorting to optimistic update rules, e.g., using entropy-regularized MWU, albeit with a more restrictive learning rate. It is worth pointing out that the idea of learning rate separation has been explored for equilibrium finding in two-player zero-sum games with instant feedback (Fasoulakis et al., 2022) and online learning with delayed feedback (Hsieh et al., 2020), but lacks study in an asynchronous multi-player game setting.

Published as a conference paper at ICLR 2023

Asynchronous optimization. Asynchronous and decentralized optimization algorithms have been extensively studied since the proposal of Bertsekas & Tsitsiklis (1989), where a number of agents seek to find an approximate global optimizer of a common loss function, by performing iterative gradient-based methods in a collaborative manner. Typical approaches including parallelizing the computation of gradient with regard to data (Tong et al., 2020), or parallelizing the model updates by imposing coordinate update rules (Nesterov, 2012; Liu et al., 2014; Liu & Wright, 2015). Delayed gradient (feedback) is also common in these scenarios due to the existence of other agents updating the model. Moreover, the zero-sum polymatrix setting considered in this work is inherently noncollaborative by requiring every agent to maximize its own utility function and compete with other agents, and leads to substantially difference analysis techniques.

B PROOF FOR SINGLE-TIMESCALE OMWU (SECTION 3)

Before delving into the main proof, we first record a useful lemma pertaining to a basic property of zero-sum polymatrix games; the proof is deferred to Appendix E.1. For i V , we denote by Ni = {j : (i, j) E} the neighbors of agent i in the graph (V, E). For notational simplicity, we denote by x 1= y the equivalence between two vectors x and y up to a global shift, i.e.,

x = y + c 1 (17)

for some constant c R, where 1 is the all-one vector. Lemma 1. For any zero-sum polymatrix game G, it holds that for π, π (S) that X

ui(πi, π i) + ui(π i, π i) = 0. (18)

Or equivalently, P

i V π i Aiπ + (π i) Aiπ = 0. It follows that X

πi π i, Ai(π π ) = X

i V [ui(π) + ui(π )] X

π i Aiπ + (π i) Aiπ = 0.

B.1 PROOF OF THEOREM 1

We start with the following lemma that characterizes the iterates of OMWU, which generalizes Cen et al. (2021, Lemma 1) for zero-sum two-player games to zero-sum polymatrix games. The proof can be found in Appendix E.2. Lemma 2. The iterates of OMWU based on the update rule (9) satisfy log π(t+1) (1 ητ) log π(t) ητ log π τ, π(t+1) π τ = 0.

To continue, by the definition of KL divergence, we have log π(t+1) (1 ητ) log π(t) ητ log π τ, π(t+1)

= log π(t+1) (1 ητ) log π(t) ητ log π τ, π(t+1)

log π(t+1) log π(t+1), π(t+1) log π(t+1) log π(t+1), π(t+1) π(t+1)

π(t+1) π(t) + ητKL

π(t+1) π τ + KL π(t+1) π(t+1)

log π(t+1) log π(t+1), π(t+1) π(t+1) .

In addition,

log π(t+1) (1 ητ) log π(t) ητ log π τ, π τ = KL π τ π(t+1) (1 ητ)KL π τ π(t) .

Summing up the above two relations, in view of Lemma 2, it holds that

KL π τ π(t+1) = (1 ητ)KL π τ π(t) (1 ητ)KL

π(t+1) π(t) KL π(t+1) π(t+1)

+ log π(t+1) log π(t+1), π(t+1) π(t+1) ητKL

π(t+1) π τ . (19) We now proceed to bound the terms of interest one by one.

Published as a conference paper at ICLR 2023

Bounding KL π τ π(t) . We aim to control the right-hand-side (RHS) of (19). Based on the update rule of π(t+1) i in Algorithm 1, we have

log π(t+1) i log π(t+1) i 1= ηAi(π(t) π(t+1)) (20)

1= ηAi(π(t) π(t)) + ηAi(π(t) π(t+1)).

It follows that log π(t+1) i log π(t+1) i , π(t+1) i π(t+1) i

j Ni (π(t+1) i π(t+1) i ) Aij(π(t) j π(t) j ) + η X

j Ni (π(t+1) i π(t+1) i ) Aij(π(t) j π(t+1) j )

j Ni Aij π(t+1) i π(t+1) i 1 π(t) j π(t) j 1 + η X

j Ni Aij π(t+1) i π(t+1) i 1 π(t) j π(t+1) j 1

π(t) j π(t) j 2 1 + π(t+1) j π(t) j 2 1 + 2 π(t+1) i π(t+1) i 2 1

KL π(t) j π(t) j + KL

π(t+1) j π(t) j + 2KL π(t+1) i π(t+1) i , (21)

where the last line follows from Pinsker s inequality. Summing the inequality over i V , we get log π(t+1) log π(t+1), π(t+1) π(t+1)

ηdmax A KL π(t) π(t) + KL

π(t+1) π(t) + 2KL π(t+1) π(t+1) .

Plugging the above inequality back into (19) yields

KL π τ π(t+1) (1 ητ)KL π τ π(t) (1 ητ ηdmax A ) KL

π(t+1) π(t)

(1 2ηdmax A )KL π(t+1) π(t+1) + ηdmax A KL π(t) π(t)

π(t+1) π τ . (22)

With the choice of the learning rate

0 < η min 1

2τ , 1 4dmax A

it holds that 1 ητ ηdmax A > 0 and

4 (1 ητ)(1 2ηdmax A ). (23)

This allows us to further relax (22) by

KL π τ π(t+1) + (1 2ηdmax A )KL π(t+1) π(t+1)

(1 ητ)KL π τ π(t) + ηdmax A KL π(t) π(t)

(1 ητ) KL π τ π(t) + (1 2ηdmax A )KL π(t) π(t) .

Let us now introduce the potential function of iterates

L(t) := KL π τ π(t) + (1 2ηdmax A )KL π(t) π(t) ,

which allows us to simply the previous inequality as

L(t+1) (1 ητ)L(t) (1 ητ)t+1L(0) = (1 ητ)t+1KL π τ π(0) , (24)

where the last equality follows from the definition π(0) = π(0). Hence, we have

KL π τ π(t) L(t) (1 ητ)t KL π τ π(0) .

Published as a conference paper at ICLR 2023

Bounding KL π τ π(t+1) . Following similar approaches to (21), we can bound

π i,τ π(t+1) i , log π(t+1) i log π(t+1) i

= η(π(t+1) i π i,τ) Ai(π(t) π(t)) + η(π(t+1) i π i,τ) Ai(π(t) π(t+1))

KL π(t) j π(t) j + KL

π(t+1) j π(t) j + 2KL π i,τ π(t+1) i . (25)

Summing the inequality over i V leads to

π τ π(t+1), log π(t+1) log π(t+1)

ηdmax A KL π(t) π(t) + KL

π(t+1) π(t) + 2KL π τ π(t+1) .

On the other hand, by the definition of KL divergence, we have

KL π τ π(t+1) = KL π τ π(t+1) KL

π(t+1) π(t+1) π τ π(t+1), log π(t+1) log π(t+1) . (26) Combining the above two inequalities, we get

(1 2ηdmax A )KL π τ π(t+1)

KL π τ π(t+1) + ηdmax A KL π(t) π(t) + KL

π(t+1) π(t) .

Plugging the above inequality back into (22), we have

(1 2ηdmax A )KL π τ π(t+1)

(1 ητ)KL π τ π(t) (1 ητ 2dmaxη A )KL

π(t+1) π(t) ητKL

(1 2ηdmax A )KL π(t+1) π(t+1) + 2ηdmax A KL π(t) π(t)

(1 ητ)KL π τ π(t) + 2ηdmax A KL π(t) π(t)

KL π τ π(t) + (1 2ηdmax A )KL π(t) π(t) = L(t),

where the second and third inequalities follow from the choice of the learning rate, and the last line follows from the definition of the potential function L(t). Then the result follows from (24) as

1 2KL π τ π(t+1) (1 2ηdmax A )KL π τ π(t+1) L(t) (1 ητ)t KL π τ π(0) .

Bounding the QRE-Gap. Finally, we bound the QRE-gap, which can be linked to the KL divergence using the following lemma. The proof can be found in Appendix E.3. Lemma 3. For any π (S) and QRE π τ (S), it holds that

QRE-Gapτ(π) τKL π π τ + d2 max A 2 τ KL π τ π .

Lemma 3 tells us

QRE-Gapτ(π(t)) τKL

π(t) π τ + d2 max A 2 τ KL π τ π(t) . (27)

With KL π τ π(t) controlled in the above, we still need to control KL

π(t) π τ . From (22), it follows that

π(t) π τ η 1(1 ητ)L(t 1) η 1(1 ητ)t L(0) = η 1(1 ητ)t KL π τ π(0) .

Plugging them back to (27), we arrive at

QRE-Gapτ(π(t)) η 1 + 2τ 1d2 max A 2 (1 ητ)t 1KL π τ π(0) .

B.2 PROOF OF THEOREM 2

We begin with bounding the KL divergence KL π τ π(t) and then move to bound the QRE-gap by linking it to the KL divergence.

Published as a conference paper at ICLR 2023

Bounding the term KL π τ π(t) . We start with the following equation

(1 ητ)KL π i,τ π(t) i = (1 ητ)KL

π(t+1) i π(t) i + ητKL

π(t+1) i π i,τ + KL π(t+1) i π(t+1) i

+ KL π i,τ π(t+1) i log π(t+1) i log π(t+1) i , π(t+1) i π(t+1) i

+ η(π(t+1) i π i,τ) Ai(π(κ(t) i ) π τ) (28)

where its proof follows a similar deduction as (19). Our first target is to bound the last two terms on the RHS of (28) with

π(t+1) i π i,τ + KL π(t+1) i π(t+1) i + (1 ητ)KL

π(t+1) i π(t) i .

Let us introduce the potential function of iterates

Ψ(l) i := KL

π(l+1) i π(l) i +KL π(l) i π(l) i , Ψ(l) = X

i V Ψ(l) i = KL

π(l+1) π(l) +KL π(l) π(l) ,

which will be used repetitively in the rest of this proof. For notational simplicity, let Ψ(l) i = 0 when l < 0.

Step 1: bounding log π(t+1) i log π(t+1) i , π(t+1) i π(t+1) i . Following a similar argument as (21), we get log π(t+1) i log π(t+1) i , π(t+1) i π(t+1) i

j Ni (π(t+1) i π(t+1) i ) Aij(π (κ(t+1) i ) j π (κ(t) i ) j )

ηdmax A KL π(t+1) i π(t+1) i + η A

π (κ(t+1) i ) j π (κ(t) i ) j 2 1. (29)

To control the term π (κ(t+1) i ) j π (κ(t) i ) j 2 1, when t = 0, we have

π (κ(t+1) i ) j π (κ(t) i ) j 2 1 = π (κ(t+1) i ) j π(0) j 2 1 π(1) j π(0) j 2 1 2Ψ(0) j (30)

by Pinsker s inequality. For t 1, consider the decomposition

π(t) j π(t k) j =

π(l+1) j π(l) j , 1 k t,

it then follows that

π(t) j π(t k) j 2 1 k

π(l+1) j π(l) j 2 1

π(l+1) j π(l) j 2 1 + π(l) j π(l) j 2 1

l=t k Ψ(l) j , (31)

where the last line applies Pinsker s inequality. Depending on whether γ(t+1) i > 0, we proceed to

bound the terms π (κ(t+1) i ) j π (κ(t) i ) j 2 1 in (29) considering the following two cases based on (31).

γ(t+1) i = 0. Then

π (κ(t+1) i ) j π (κ(t) i ) j 2 1 2 π(t+1) j π(t) j 2 1 + 2 π(t) j π (κ(t) i ) j 2 1

Published as a conference paper at ICLR 2023

8Ψ(t) j + 8γ(t) i

where the last step uses (31) and π(t+1) j π(t) j 2 1 2 π(t+1) j π(t) j 2 1 + π(t) j π(t) j 2 1

via again Pinsker s inequality.

γ(t+1) i > 0. Then it follows similarly that

π (κ(t+1) i ) j π (κ(t) i ) j 2 1

l=t+1 γ(t+1) i

π(l+1) j π(l) j 2 1 +

π(l+1) j π(l) j 2 1

l=t γ(t+1) i

Ψ(l) j + 4γ(t) i

Combining the above two bounds together, we get

π (κ(t+1) i ) j π (κ(t) i ) j 2 1 8Ψ(t) j + 8γ(t) i

Ψ(l) j + 4γ(t+1) i

l=t γ(t+1) i

Ψ(l) j (32)

when t > 0. In view of (30) when t = 0, the above bound (32) holds for all t 0. Plugging the above inequality into (29) yields log π(t+1) i log π(t+1) i , π(t+1) i π(t+1) i

l=t γ(t+1) i

γ(t+1) i Ψ(l) j + 4η A X

γ(t) i Ψ(l) j

j Ni Ψ(t) j + ηdmax A KL π(t+1) i π(t+1) i . (33)

Step 2: bounding (π(t+1) i π i,τ) Ai(π(κ(t+1) i ) π τ). Let us begin with the following decomposition

(π(t+1) i π i,τ) Ai(π(κ(t+1) i ) π τ) = (π(t+1) i π i,τ) Ai(π(t+1) π τ)

+ (π(t+1) i π i,τ) Ai(π(κ(t+1) i ) π(t+1)), (34)

where the second term in the RHS of (34) can be bounded by

(π(t+1) i π i,τ) Ai(π(κ(t+1) i ) π(t+1))

j Ni (π(t+1) i π i,τ) Aij(π (κ(t+1) i ) j π(t+1) j )

π(t+1) i π i,τ 1 π (κ(t+1) i ) j π(t+1) j 1

π(t+1) i π i,τ 2 1 + dmax A

π (κ(t+1) i ) j π(t+1) j 2 1

π(t+1) i π i,τ + dmax A 2 2τ

π (κ(t+1) i ) j π(t+1) j 2 1.

Following similar deduction of (32) for the second term, we attain

(π(t+1) i π i,τ) Ai(π(κ(t+1) i ) π(t+1))

Published as a conference paper at ICLR 2023

π(t+1) i π i,τ + 4dmax A 2 τ

l=t γ(t+1) i

γ(t+1) i Ψ(l) j .

Plugging the above inequality back to (34) results in

(π(t+1) i π i,τ) Ai(π(κ(t+1) i ) π τ) (π(t+1) i π i,τ) Ai(π(t+1) π τ)

π(t+1) i π i,τ + 4dmax A 2 τ

l=t γ(t+1) i

γ(t+1) i Ψ(l) j .

Step 3: combining the bounds. For simplicity, we introduce the short-hand notation

cτ = 1 + dmax A

τ and c A = dmax A . (36)

Combining (33) and (35) into (28), and summing over i V gives

(1 ητ)KL π τ π(t)

π(t+1) π(t) + (1 2ηc A)KL π(t+1) π(t+1) + KL π τ π(t+1)

l=t γ(t+1) i

cτγ(t+1) i Ψ(l) j +

γ(t) i Ψ(l) j + cτΨ(t) j

KL π τ π(t+1) + (1 4ηc A(cτ + 1)) Ψ(t)

l=t γ(t+1) i

γ(t+1) i Ψ(l) j +

γ(t) i Ψ(l) j

where we make use of the fact X

i V (π(t+1) i π i,τ) Ai(π(t+1) π τ) = 0

from Lemma 1 in the first inequality, and the second inequality uses the relation X

j Ni Ψ(t) j = X

i V diΨ(t) i dmaxΨ(t).

Step 4: finishing up via averaging the delay. We now evaluate the expectation of KL π τ π(t+1) . Recall that we use subscript Eγ(t) [ ] to represent the conditional expectation given

γ(t) = {γ(t) i }i V . We shall first control the conditional expectation of the last term in (37). Observing that π(l+1) j , π(l) j are independent of γ(t) i for j Ni and l t 1. Using the definition of E(t l), we have

j Ni Et l γ(t) i

h γ(t) i Ψ(l) j i

l=0 E(t l) X

j Ni Ψ(l) j

l=0 E(t l) X

j Ni Ψ(l) i

l=0 E(t l) X

i V Ψ(l) i = dmax

l=0 E(t l)Ψ(l), (38)

Published as a conference paper at ICLR 2023

where the second line follows from the definition of E(t l) in Assumption 1. Applying a sim-

ilar argument to bound P

j Ni Eγ(t+1) h γ(t+1) i Pt 1 l=t γ(t+1) i Ψ(l) j i , and taking expectation of

γ(t), γ(t+1) on both sides of (37), we get

(1 ητ)Eγ(t) h KL π τ π(t) i Eγ(t),γ(t+1) h KL π τ π(t+1) + (1 4c A(cτ + 1))Ψ(t)i

4ηc A(cτ + 1)

l=0 E(t l)Ψ(l).

Taking expectation on both sides over all the delays yields

(1 ητ)E h KL π τ π(t) i

E h KL π τ π(t+1) i + E h (1 4ηc A(cτ + 1)) Ψ(t) 4ηc A(cτ + 1) Xt 1

l=0 E(t l)Ψ(l) | {z } =:U (t)

Telescoping over t = 0, 1, . . . , T, we get

(1 ητ)T +1KL π τ π(0) E h KL π τ π(T +1) i +

t=0 (1 ητ)T t E h U (t)i , (40)

which leads to the desired bound if

t=0 (1 ητ)T t E h U (t)i 0. (41)

Proof of (41). To begin, notice that with the choice of the learning rate

24d2max A 2 (L + 1) , ζ 1

it follows that 1 1 ητ ζ (42a)

4ηc A(cτ + 1)(L + 1) < 4 τ

24d2max A 2 (L + 1) dmax A

= τ 6dmax A

= τ 3dmax A + 1

as τ dmax A . Both of these relations will be useful in our follow-up analysis.

Now, taking the definition of U (t) (cf. (39)), we have

t=0 (1 ητ)T t U (t) =

t=0 (1 ητ)T t "

(1 4ηc A(cτ + 1)) Ψ(t) 4ηc A(cτ + 1)

l=0 E(t l)Ψ(l) #

where the second half of the RHS can be further controlled via

t=0 (1 ητ)T t t 1 X

l=0 E(t l)Ψ(l) =

t=0 Ψ(t) T X

l=t+1 (1 ητ)T l E(l t)

t=0 Ψ(t) T t X

l =0 (1 ητ)T (t+l )E(l )

Published as a conference paper at ICLR 2023

t=0 (1 ητ)T tΨ(t) T t X

l =0 (1 ητ) l E(l )

t=0 (1 ητ)T tΨ(t) X

l=0 ζl E(l)

t=0 (1 ητ)T t LΨ(t),

where the first line follows by changing the order of summation, the second line follows from the change of variable l = l t, and the last line follows from (42a) and the definition of L in Assumption 1. Plugging the above relation back leads to

t=0 (1 ητ)T t U (t)

t=0 (1 ητ)T t [(1 4ηc A(cτ + 1)) 4ηc A(cτ + 1)L] Ψ(t)

1 2(1 ητ)T tΨ(t) 0, (43)

where the second line results from (42b).

Bounding the term KL π τ π(t+1) . With a similar deduction of (19), we get

(1 ητ)KL π τ π(t) + η X

i V (π(t+1) i π τ) Ai(π(κ(t) i ) π τ)

= KL π τ π(t+1) + (1 ητ)KL

π(t+1) π(t) + ητKL

π(t+1) π τ . (44)

Following the similar argument of (35), we have

(π(t+1) i π i,τ) Ai(π(κ(t) i ) π τ) (π(t+1) i π i,τ) Ai(π(t+1) π τ)

π(t+1) i π i,τ + 8dmax A 2 τ

γ(t) i Ψ(l) j .

Summing over i V and plugging into (44) yields

(1 ητ)KL π τ π(t) + 8ηdmax A 2 τ

γ(t) i Ψ(l) j

KL π τ π(t+1) + (1 ητ)KL

π(t+1) π(t) + ητ

KL π τ π(t+1) + ητ

π(t+1) π τ .

Taking expectation on both sides over all delays and using (38) leads to

(1 ητ)E h KL π τ π(t) i + 8ηd2 max A 2 τ E

l=0 E(t l)Ψ(l) #

E h KL π τ π(t+1) i + ητ

π(t+1) π τ i . (45)

Notice that with the choice of the learning rate

24d2max A 2 (L + 1) , ζ 1

we have 8(L + 1)ηd2 max A 2 τ 1

Published as a conference paper at ICLR 2023

(1 ητ)t+1KL π τ π(0) 1

l=0 (1 ητ)t l E h Ψ(l)i

by combining (43) and (40). It follows that

E h Ψ(t)i 2(1 ητ)t+1KL π τ π(0)

l=0 E(t l)Ψ(l) # (i) E

l=0 (1 ητ)t lΨ(l) E(t l)ζt l #

l=0 (1 ητ)t lΨ(l) t 1 X

l=0 E(t l)ζt l #

(ii) 2L(1 ητ)t+1KL π τ π(0) ,

where (i) is by the bound (1 ητ) 1 ζ and (ii) uses the definition of L in Assumption 1. Plugging the above inequalities into (45) leads to

(1 ητ)E h KL π τ π(t) i + (1 ητ)t+1KL π τ π(0)

E h KL π τ π(t+1) i + ητ

π(t+1) π τ i .

Then from (40) we have

E h KL π τ π(t+1) i E h KL π τ π(t+1) i + ητ

π(t+1) π τ i

(1 ητ)t+1KL π τ π(0) + (1 ητ)t+1KL π τ π(0)

= 2(1 ητ)t+1KL π τ π(0) . (46)

Bounding the QRE-Gap. Combining (27) and (46), we have

E h QRE-Gapτ(π(t+1)) i τE h KL

π(t+1) π τ i + d2 max A 2 τ E h KL π τ π(t+1) i

π(t+1) π τ i + E h KL π τ π(t+1) i

η KL π τ π(0) ,

where the second line uses the learning rate bound

2 η > 24d2 max A 2 (L + 1) τ > d2 max A 2 τ .

C REGRET ANALYSIS OF SINGLE-TIMESCALE OMWU

For completeness, we also provide the regret analysis of single-timescale OMWU in both synchronous and asynchronous settings, which might be of independent interest. To begin, for τ 0, the regret for each player i V is defined as

Regreti,τ T = max πi (Si)

t=1 ui,τ(πi, π(t) i)

t=1 ui,τ(π(t)), (47)

which measures the performance gap compared to the optimal fixed strategy in hindsight for player i, when the rest of the players follow the strategies derived from Algorithm 1.

Published as a conference paper at ICLR 2023

Synchronous setting. We begin with the following no-regret guarantee of single-timescale OMWU in the synchronous setting. Theorem 5 (No-regret without delays). Suppose all players i V follow single-timescale OMWU in Algorithm 1 with the initialization π(0) i = 1 |Si|1 and the learning rate obeys 0 < η

1 4dmax A +4τ . Then, for T 1, it holds that

Regreti,τ T 1

η log |Si| + 16η degi A 2 X

k V log |Sk|.

By optimizing the learning rate η, Theorem 5 suggests that the regret is bounded by

max i V Regreti,τ T e O A p

up to logarithmic factors. Compared with the OMD method for multi-agent games in Anagnostides et al. (2022), which only provided the regret bound for τ = 0, our bound is more general by allowing entropy regularization. Moreover, our bound is tighter by a factor of n/dmax by exploiting the graph connectivity pattern, which is significant for large sparse graphs.

Asynchronous setting. We next move to the asynchronous case, and show that single-timescale OMWU continues to enjoy no-regret learning as long as the delays have finite second-order moments. Assumption 4 (Random delays). Recall the definition of E(ℓ) in (12). There exists some constant

σ > 0, such that E h (γ(t) i )2i P ℓ=0 E(ℓ) σ2, for all t 0 and i V .

Clearly Assumption 4 is weaker than Assumption 1, since it only requires the second-order moments to be finite, instead of an exponential decay of γ(t) i . We have the following theorem. Theorem 6 (No-regret with random delays). Under Assumption 4, suppose all players i V follow single-timescale OMWU in Algorithm 1 with the initialization π(0) i = 1 |Si|1, the regularization parameter τ < min{1, A }, and the learning rate obeys 0 < η τ 24d2max A 2 (σ2+1). Then, for T 1, it holds that

E Regreti,τ T 1

η log |Si| + 8dmax A

τ + 2 (σ2 + 1) X

i V log |Si|. (48)

Theorem 6 guarantees that the iterate among {π(t)}t 1 enjoys a regret bound on the order of

max i V E Regreti,τ T e O

σ2ndmax A 2 τ

by optimizing the learning rate η.

C.1 PROOF OF THEOREM 5

Recall the expression of the regret

Regreti,τ T = max πi (Si) Regreti,τ πi, T ,

Regreti,τ πi, T :=

t=1 ui,τ(πi, π(t) i)

t=1 ui,τ(π(t))

πi π(t+1) i , Aiπ(t+1) + τH(πi) τH(π(t+1) i ) . (49)

Therefore, it is sufficient to bound Regreti,τ πi, T for any πi (Si). To begin, we record the following useful lemma whose proof can be found in Appendix E.4.

Published as a conference paper at ICLR 2023

Lemma 4. For any τ 0 and T 0, we have

i V Regreti,τ T + 1 = X

i V max πi (Si)

πi π(t+1) i , Aiπ(t+1) + τH(πi) τH(π(t+1) i ) 0.

Let us now proceed with the following regret decomposition πi π(t+1) i , Aiπ(t+1) + τH(πi) τH(π(t+1) i )

= πi π(t+1) i , Aiπ(t+1) + τH(πi) + π(t+1) i π(t+1) i , Aiπ(t) τH(π(t+1) i )

π(t+1) i π(t+1) i , Aiπ(t+1) Aiπ(t) . (50)

We proceed to bound each term on the RHS of (50).

To begin, note that log π(t+1) i 1= (1 ητ) log π(t) i + ηAiπ(t+1). The first term in (50) can then be written as πi π(t+1) i , Aiπ(t+1) + τH(πi)

η log π(t+1) i log π(t) i , πi π(t+1) i + τ log π(t) i , πi π(t+1) i τ log πi, πi

η τ KL πi π(t) i 1

KL πi π(t+1) i + KL π(t+1) i π(t) i τ log π(t) i , π(t+1) i ,

where the second step is derived from the definition of KL divergence.

Similarly, the second term in (50) has the form π(t+1) i π(t+1) i , Aiπ(t) τH(π(t+1) i )

KL π(t+1) i π(t) i KL π(t+1) i π(t+1) i 1

π(t+1) i π(t) i

+ τ log π(t) i , π(t+1) i . (52)

Moving to the third term on the RHS of (50), we first make the following claim, which shall be proven at the end of this proof: π(t+1) i π(t+1) i 1 η Aiπ(t+1) Aiπ(t) . (53)

With (53) in place, we have

π(t+1) i π(t+1) i , Aiπ(t+1) Aiπ(t) X

j Ni A π(t+1) i π(t+1) i 1 π(t+1) j π(t) j 1

j Ni A Ai(π(t+1) π(t)) π(t+1) j π(t) j 1

π(t+1) j π(t) j 1

The latter term can be further bounded by X

π(t+1) j π(t) j 1

π(t+1) j π(t) j + π(t) j π(t) j 2 1

π(t+1) j π(t) j 2 1 + π(t) j π(t) j 2 1

π(t+1) j π(t) j + KL π(t) j π(t) j ,

Published as a conference paper at ICLR 2023

where it follows respectively from Cauchy-Schwarz inequality, a + b 2 1 2 a 2 1 + b 2 1 , and Pinsker s inequality. Plugging this into the previous inequality leads to

π(t+1) i π(t+1) i , Aiπ(t+1) Aiπ(t) 4η degi A 2 X

π(t+1) j π(t) j + KL π(t) j π(t) j .

Plugging (51), (52), and (54) into (50) yields πi π(t+1) i , Aiπ(t+1) + τH(πi) τH(π(t+1) i )

KL πi π(t) i KL πi π(t+1) i 1

η KL π(t+1) i π(t+1) i 1

π(t+1) i π(t) i

τKL πi π(t) i + 4η degi A 2 X

π(t+1) j π(t) j + KL π(t) j π(t) j .

Telescoping the sum over t = 0, 1, . . . , T leads to

Regreti,τ πi, T + 1 1

η KL πi π(0) i 1

t=0 KL π(t+1) i π(t+1) i 1

π(t+1) i π(t) i

+ 4η degi A 2

π(t+1) j π(t) j + KL π(t) j π(t) j (55)

η log |Si| + 4η degi A 2

π(t+1) π(t) + KL π(t) π(t) ,

where the last line follows from the fact that KL πi π(0) i log |Si| and 1/η > τ. The proof is thus complete if we can establish

t=0 KL π(t) π(t) +

π(t+1) π(t) 4 X

k V log |Sk|. (57)

Therefore, it remains to establish (53) and (57), which shall be completed as follows.

Proof of (53). By the update rules of π(t+1) i and π(t+1) i , from (20) we can deduce that log π(t+1) i log π(t+1) i , π(t+1) i π(t+1) i = η Ai(π(t) π(t+1)), π(t+1) i π(t+1) i . (58)

By Pinsker s inequality, we have log π(t+1) i log π(t+1) i , π(t+1) i π(t+1) i π(t+1) i π(t+1) i 2 1.

In addition, Ai(π(t) π(t+1)), π(t+1) i π(t+1) i π(t+1) i π(t+1) i 1 Ai(π(t) π(t+1)) .

Plugging the above two relations into (58) then leads to (53).

Proof of (57). Summing (55) over i V gives

i V Regreti,τ πi, T + 1 1

η KL π π(0) 1

t=0 KL π(t+1) π(t+1) 1

π(t+1) π(t)

+ 4ηd2 max A 2

π(t+1) π(t) + KL π(t) π(t)

KL π π(0) + KL π(0) π(0) 1

η τ 4ηd2 max A 2

π(t+1) π(t)

Published as a conference paper at ICLR 2023

η 4ηd2 max A 2

t=0 KL π(t) π(t)

k V log |Sk| 1

π(t+1) π(t) 1

t=0 KL π(t) π(t) ,

where the last line follows from π(0) = π(0), KL π π(0) P k V log |Sk| for any π since π(0) is a uniform distribution, as well as the choice of the learning rate such that

4ηd2 max A 2 1

Taking supremum over π on both sides of (59) and applying Lemma 4 gives (57) as advertised.

C.2 PROOF OF THEOREM 6

Similar to the proof of Theorem 5 in Appendix C.1, it suffices to bound Regreti,τ πi, T for any πi (Si), where

Regreti,τ πi, T :=

t=1 ui,τ(πi, π(t) i)

t=1 ui,τ(π(t))

πi π(t+1) i , Aiπ(t+1) + τH(πi) τH(π(t+1) i ) . (60)

Consider the following decomposition: πi π(t+1) i , Aiπ(t+1) + τH(πi) τH(π(t+1) i )

= πi π(t+1) i , Aiπ(κ(t+1) i ) + πi π(t+1) i , Aiπ(t+1) Aiπ(κ(t+1) i ) + τH(πi) τH(π(t+1) i )

= πi π(t+1) i , Aiπ(κ(t+1) i ) + π(t+1) i π(t+1) i , Aiπ(κ(t) i )

π(t+1) i π(t+1) i , Aiπ(κ(t+1) i ) Aiπ(κ(t) i )

+ πi π(t+1) i , Aiπ(t+1) Aiπ(κ(t+1) i ) . (61)

We now bound each term on the RHS of (61). For simplicity, we reuse the short-hand notation in (36).

To begin with, note that π(t+1) i = (1 ητ)π(t) i + ηAiπ(κ(t+1) i ) + ci1 for some normalization constant ci. Thus we have πi π(t+1) i , Aiπ(κ(t+1) i )

η log π(t+1) i log π(t) i , πi π(t+1) i + τ log π(t) i , πi π(t+1) i

KL πi π(t) i KL πi π(t+1) i KL π(t+1) i π(t) i + τ log π(t) i , πi π(t+1) i ,

where the second step is derived from the definition of KL-divergence.

Similarly, it holds that π(t+1) i π(t+1) i , Aiπ(κ(t) i )

KL π(t+1) i π(t) i KL π(t+1) i π(t+1) i KL

π(t+1) i π(t) i + τ log π(t) i , π(t+1) i π(t+1) i .

Published as a conference paper at ICLR 2023

For the term π(t+1) i π(t+1) i , Aiπ(κ(t+1) i ) Aiπ(κ(t) i )

in (61), following the deduction of (33), we get

π(t+1) i π(t+1) i , Aiπ(κ(t+1) i ) Aiπ(κ(t) i )

l=t γ(t+1) i

γ(t+1) i Ψ(l) j +

γ(t) i Ψ(l) j + 2c AKL π(t+1) i π(t+1) i .

For the last term in (61), it similarly follows that πi π(t+1) i , Aiπ(t+1) Aiπ(κ(t+1) i )

= πi π(t) i , Aiπ(t+1) Aiπ(κ(t+1) i ) + π(t) i π(t+1) i , Aiπ(t+1) Aiπ(κ(t+1) i )

π(t+1) i π(t) i + 4cτ A X

l=t γ(t+1) i

γ(t+1) i Ψ(l) j . (65)

Plugging (62) (63) (64) (65) into (61) yields πi π(t+1) i , Aiπ(t+1) + τ H(πi) H(π(t+1) i )

KL πi π(t) i KL πi π(t+1) i 1

j Ni Ψ(t) j + 2γ(t) i A X

+ 4cτ A γ(t+1) i X

l=t γ(t+1) i

Ψ(l) j + τ log π(t) i , πi π(t) i + τ H(πi) H(π(t+1) i ) .

H(πi) H(π(t+1) i ) + log π(t) i , πi π(t+1) i = log πi log π(t) i , πi + log π(t+1) i log π(t) i , π(t+1) i

π(t+1) i π(t) i KL πi π(t) i .

Then we have πi π(t+1) i , Aiπ(t+1) + τ H(πi) H(π(t+1) i )

KL πi π(t) i KL πi π(t+1) i 1

η 2c A τ Ψ(t) i

j Ni Ψ(t) j + 2γ(t) i A X

Ψ(l) j + 4cτ A γ(t+1) i X

l=t γ(t+1) i

Since the learning rate satisfies

1 η 24d2 max A 2 τ (σ2 + 1) 2dmax A + τ = 2c A + τ,

taking expectation on both sides of (66) leads to

E h πi π(t+1) i , Aiπ(t+1) + τ H(πi) H(π(t+1) i ) i

η E h KL πi π(t) i KL πi π(t+1) i i 1

η 2c A τ E h Ψ(t) i i

Published as a conference paper at ICLR 2023

+ 4cτ A E h Ψ(t)i + 4cτ A E

l=0 E(t l)Ψ(l) #

η E h KL πi π(t) i KL πi π(t+1) i i + 4cτ A E h Ψ(t)i + 4cτ A E

l=0 E(t l)Ψ(l) #

where we use the fact P j Ni Ψ(l) j Ψ(l) and the definition of E(t l). Since P l=0 E(l) σ2 by definition in Assumption 4, summing (67) over t = 0, 1, . . . , T yields

E Regreti,τ T + 1 1

η E h KL πi π(0) i i + 4cτ A (σ2 + 1)E

η log |Si| + 4cτ A (σ2 + 1)E

It remains to establish

i V log |Si|. (69)

Proof of (69). Summing (66) over i V gives X

πi π(t+1) i , Aiπ(t+1) + τ H(πi) H(π(t+1) i )

KL π π(t) KL π π(t+1) 1

η 2c A τ Ψ(t)

j Ni Ψ(t) j + 2γ(t) i A X

Ψ(l) j + 4cτ A X

i V γ(t+1) i X

l=t γ(t+1) i

Taking expectation of on both sides and using (38) leads to

πi π(t+1) i , Aiπ(t+1) + τ H(πi) H(π(t+1) i ) #

η E h KL π π(t) KL π π(t+1) i 1

η 4c A(cτ + 1) τ E h Ψ(t)i

+ 4c A(cτ + 1)E

l=0 E(t l)Ψ(l) #

2E h KL π π(t) i .

Summing over t = 0, 1, . . . , T yields

i V Regreti,τ π, T + 1 #

η E h KL π π(0) i 1

η 4c A(cτ + 1)(σ2 + 1) E

η E h KL π π(0) i 1

where the second line follows from

4c A(cτ + 1)(σ2 + 1)η 4dmax A

24d2max A 2 (σ2 + 1) < 1

due to τ dmax A and η τ 24d2max A 2 (σ2+1). Taking supremum with respect to π on both sides, in view of Lemma 4, we arrive at the advertised bound (69).

Published as a conference paper at ICLR 2023

D PROOF FOR TWO-TIMESCALE OMWU (SECTION 4)

D.1 PROOF OF THEOREM 3

Bounding KL π τ π(t) . For notational convenience, we set π(t) = π(t) = π(0) for t < 0. The following lemma parallels Lemma 2 by focusing on delayed feedbacks. The proof is postponed to Appendix E.5.

Lemma 5. Assuming constant delays γ(t) i = γ, the iterates of OMWU based on the update rule (10) satisfy log π(t+1) (1 ητ) log π(t) ητ log π τ, π(t γ+1) π τ = 0.

By following a similar argument in (19), we conclude that

KL π τ π(t+1) = (1 ητ)KL π τ π(t) (1 ητ)KL

π(t γ+1) π(t) KL π(t+1) π(t γ+1)

+ log π(t γ+1) log π(t+1), π(t γ+1) π(t+1) ητKL

π(t γ+1) π τ . (71)

It boils down to control the term log π(t γ+1) log π(t+1), π(t γ+1) π(t+1) . When t γ, by taking logarithm on the both sides of the update rules (7) and (10), we have

log π(t γ+1) i 1= (1 ητ) log π(t γ) i + ηAiπ(t 2γ)

log π(t+1) i 1= (1 ητ) log π(t) i + ηAiπ(t γ+1)

1= (1 ητ)γ+1 log π(t γ) i + η

l=0 (1 ητ)l Aiπ(t γ l+1).

Subtracting the above equalities and taking inner product with π(t γ+1) i π(t+1) i gives log π(t γ+1) i log π(t+1) i , π(t γ+1) i π(t+1) i

l=0 (1 ητ)l

π(t γ+1) i π(t+1) i , Ai(π(t 2γ) π(t γ l+1)) ,

where the log π(t γ) i terms cancel out due to the choice 1 ητ = (1 ητ)γ+1. Summing over i V , log π(t γ+1) log π(t+1), π(t γ+1) π(t+1)

l=0 (1 ητ)l

π(t γ+1) i π(t+1) i , Ai(π(t 2γ) π(t γ l+1))

l=0 (1 ητ)l π(t γ+1) i π(t+1) i 1 π(t 2γ) j π(t γ l+1) j 1. (72)

Using the triangle inequality, we can bound π(t 2γ) π(t γ l+1) 1 as

π(t 2γ) π(t γ l+1) 1

π(l1 γ) i π(l1 γ+1) j 1

π(l1 γ) i π(l1) i 1 + π(l1 γ+1) j π(l1) j 1

Substitution of the bound into (72) yields log π(t γ+1) log π(t+1), π(t γ+1) π(t+1)

Published as a conference paper at ICLR 2023

l=0 (1 ητ)l t l X

π(t γ+1) i π(t+1) i 1

π(l1 γ) j π(l1) j 1 + π(l1 γ+1) j π(l1) j 1

l=0 (1 ητ)l π(t γ+1) i π(t+1) i 1

π(l1 γ) j π(l1) j 1 + π(l1 γ+1) j π(l1) j 1

l=0 (1 ητ)l π(t γ+1) i π(t+1) i 2 1

l=0 (1 ητ)l π(l1 γ) j π(l1) j 2 1 + π(l1 γ+1) j π(l1) j 2 1

2(γ + 1)2KL π(t+1) π(t γ+1)

l=0 (1 ητ)l KL π(l1) π(l1 γ) + KL

π(l1 γ+1) π(l1) . (73)

Plugging the above inequality into (71) and recursively applying the inequality gives

KL π τ π(t+1) + KL π(t+1) π(t γ+1) + ητKL

π(t γ+1) π τ

(1 ητ)t+1 γKL π τ π(γ)

l1=γ (1 ητ)t l1 KL π(l1+1) π(l1 γ+1) + (1 ητ)KL

π(l1 γ+1) π(l1)

2(γ + 1)2 t X

l1=γ (1 ητ)t l1KL π(l1+1) π(l1 γ+1)

t2=γ (1 ητ)t l2 l2 X

l=0 (1 ητ)l KL π(l1) π(l1 γ) + KL

π(l1 γ+1) π(l1)

(i) (1 ητ)t+1 γKL π τ π(γ)

l1=γ (1 ητ)t l1 KL π(l1+1) π(l1 γ+1) + (1 ητ)KL

π(l1 γ+1) π(l1)

+ 2(γ + 1)2ηdmax A

l1=γ (1 ητ)t l1KL π(l1+1) π(l1 γ+1)

+ 2(γ + 1)2ηdmax A

l1=0 (1 ητ)t l1 KL π(l1+1) π(l1 γ+1) + (1 ητ)KL

π(l1 γ+1) π(l1)

(ii) (1 ητ)t+1 γKL π τ π(γ)

+ 2(γ + 1)2ηdmax A

l1=0 (1 ητ)t l1 KL π(l1+1) π(l1 γ+1) + (1 ητ)KL

π(l1 γ+1) π(l1) ,

(74) where (i) results from basic calculation

t2=γ (1 ητ)t l2 l2 X

l=0 (1 ητ)l KL π(l1) π(l1 γ) + KL

π(l1 γ+1) π(l1)

l1=0 (1 ητ)t l1 l1+γ X

l=0 (1 ητ)l1 l2+l KL π(l1) π(l1 γ) + KL

π(l1 γ+1) π(l1)

l1=0 (1 ητ)t l1 γ X

l=0 (1 ητ)l l KL π(l1) π(l1 γ) + KL

π(l1 γ+1) π(l1)

Published as a conference paper at ICLR 2023

l1=0 (1 ητ)t l1(γ + 1)2 1 1 2(γ + 1)

(γ+1) (1 ητ) KL π(l1) π(l1 γ) + KL

π(l1 γ+1) π(l1)

2(γ + 1)2 t X

l1=0 (1 ητ)t l1(1 ητ) KL π(l1) π(l1 γ) + KL

π(l1 γ+1) π(l1)

2(γ + 1)2 t X

l1=0 (1 ητ)t l1 KL π(l1+1) π(l1 γ+1) + (1 ητ)KL

π(l1 γ+1) π(l1)

and (ii) is due to η min n 1 2τ(γ+1), 1 5dmax A (γ+1)2 o . To proceed, we introduce the following

lemma concerning the error KL π τ π(γ) , with the proof postponed to Appendix E.6.

Lemma 6. With constant delays γ(t) i = γ, the iterates of OMWU based on the update rule (10) satisfy

KL π τ π(γ)

(1 ητ)γKL π τ π(0)

l1=0 (1 ητ)γ 1 l1 KL π(l1+1) π(l1 γ+1) + (1 ητ)KL

π(l1 γ+1) π(l1)

+ 2ηγ2dmax A .

With the lemma above in mind, we can continue to bound (74) by

KL π τ π(t+1)

(1 ητ)t+1KL π τ π(0) + 2(1 ητ)t+1 γηγ2dmax A

l1=0 (1 ητ)t l1 KL π(l1+1) π(l1 γ+1) + (1 ητ)KL

π(l1 γ+1) π(l1)

+ 2(γ + 1)2ηdmax A

l1=0 (1 ητ)t l1 KL π(l1+1) π(l1 γ+1) + (1 ητ)KL

π(l1 γ+1) π(l1)

(1 ητ)t+1KL π τ π(0) + (1 ητ)t+1 γ.

Bounding KL π τ π(t γ+1) . By definition of KL divergence, we have

KL π τ π(t γ+1)

= KL π τ π(t+1) + π τ, log π(t+1) log π(t γ+1)

= KL π τ π(t+1) + KL π(t+1) π(t γ+1) + π τ π(t+1), log π(t+1) log π(t γ+1) . (75)

It remains to control the term π τ π(t+1), log π(t+1) log π(t γ+1) . By following a similar argument in (73), we have π τ π(t+1), log π(t+1) log π(t γ+1)

l=0 (1 ητ)l π i,τ π(t+1) i , Ai(π(t 2γ) π(t γ l+1))

l=0 (1 ητ)l π i,τ π(t+1) i 1 π(t 2γ) j π(t γ l+1) j 1

l=0 (1 ητ)l t l X

π i,τ π(t+1) i 1

π(l1 γ) j π(l1) j 1 + π(l1 γ+1) j π(l1) j 1

l=0 (1 ητ)l π i,τ π(t+1) i 1

π(l1 γ) j π(l1) j 1 + π(l1 γ+1) j π(l1) j 1

Published as a conference paper at ICLR 2023

2(γ + 1)2KL π τ π(t+1)

l=0 (1 ητ)l KL π(l1) π(l1 γ) + KL

π(l1 γ+1) π(l1) .

Substitution of the above inequality into (75) yields

KL π τ π(t γ+1) + ητKL

π(t γ+1) π τ

= (1 + 2(γ + 1)2ηdmax)KL π τ π(t+1) + KL π(t+1) π(t γ+1) + ητKL

π(t γ+1) π τ

l=0 (1 ητ)l KL π(l1) π(l1 γ) + KL

π(l1 γ+1) π(l1)

(i) 2 KL π τ π(t+1) + KL π(t+1) π(t γ+1) + ητKL

π(t γ+1) π τ

+ 2(γ + 1)ηdmax A

l1=0 (1 ητ)t l1 KL π(l1+1) π(l1 γ+1) + (1 ητ)KL

π(l1 γ+1) π(l1)

(ii) 2(1 ητ)t+1 γKL π τ π(γ) 2

l1=γ (1 ητ)t l1 KL π(l1+1) π(l1 γ+1) + (1 ητ)KL

π(l1 γ+1) π(l1)

+ 4(γ + 1)2ηdmax A

l1=γ (1 ητ)t l1KL π(l1+1) π(l1 γ+1)

+ 6(γ + 1)2ηdmax A

l1=0 (1 ητ)t l1 KL π(l1+1) π(l1 γ+1) + (1 ητ)KL

π(l1 γ+1) π(l1)

2(1 ητ)t+1 γKL π τ π(γ)

+ 6(γ + 1)2ηdmax A

l1=0 (1 ητ)t l1 KL π(l1+1) π(l1 γ+1) + (1 ητ)KL

π(l1 γ+1) π(l1) ,

where (i) results from

l=0 (1 ητ)l KL π(l1) π(l1 γ) + KL

π(l1 γ+1) π(l1)

l1=t γ (1 ητ)t l1 t l1 X

l=0 (1 ητ)l+l1 t KL π(l1) π(l1 γ) + KL

π(l1 γ+1) π(l1)

l1=t γ (1 ητ)t l1(γ + 1)(1 ητ) (γ+1)(1 ητ) KL π(l1) π(l1 γ) + KL

π(l1 γ+1) π(l1)

l1=0 (1 ητ)t l1 KL π(l1+1) π(l1 γ+1) + (1 ητ)KL

π(l1 γ+1) π(l1) .

and (ii) is due to the bound established in (74). Finally, applying Lemma 6 yields

KL π τ π(t γ+1) + ητKL

π(t γ+1) π τ

2(1 ητ)t+1KL π τ π(0) + 4(1 ητ)t+1 γηγ2dmax A

l1=0 (1 ητ)t l1 KL π(l1+1) π(l1 γ+1) + (1 ητ)KL

π(l1 γ+1) π(l1)

+ 6(γ + 1)2ηdmax A

l1=0 (1 ητ)t l1 KL π(l1+1) π(l1 γ+1) + (1 ητ)KL

π(l1 γ+1) π(l1)

Published as a conference paper at ICLR 2023

2(1 ητ)t+1KL π τ π(0) + 2(1 ητ)t+1 γ. (76)

Bounding the QRE gap. With Lemma 3, we have

QRE-Gapτ(π(t γ+1)) d2 max A 2 τ KL π τ π(t γ+1) + τKL

π(t γ+1) π τ

max nd2 max A 2 τ , 1

o KL π τ π(t γ+1) + ητKL

π(t γ+1) π τ

2 max nd2 max A 2 τ , 1

o (1 ητ)t+1KL π τ π(0) + (1 ητ)t+1 γ ,

where the last step results from (76).

D.2 PROOF OF THEOREM 4

Bounding the term KL π τ π(t) . Recall that the update rule of π(t) i (k) is given by

π(t) i (k) π(t 1) i (k)1 ητ exp(η[Aiπ(κ(t) i )]k). (77)

We introduce an auxiliary variable eπ(t) i :

eπ(t) i (k) π(t 1) i (k)1 eη(t) i τ exp eη(t) i [Aiπ(κ(t) i )]k , (78)

which can be viewed as a conceptual alternative update of π(t) i with a different step size eη(t) i > 0 satisfying

(1 eη(t) i τ)(1 ητ)t κ(t) i = 1 ητ

or equivalently

1 eη(t) i τ = (1 ητ)γ+1 t+κ(t) i .

It directly follows that eη(t) i η. Since κ(t) i t, we have 1 eη(t) i τ 1 (γ + 1 t + κ(t) i )ητ 1 (γ + 1)ητ, which implies eη(t) i (γ + 1)η. For notational convenience, we set eπ(t) i = π(0), eη(t) i = η and κ(t) i = 0 when t 0. The following lemma establishes a one-step analysis, with the proof postponed to Appendix E.7.

Lemma 7. When t 1, it holds that

KL π i,τ π(t) i + ητKL

π (κ(t) i ) i π i,τ

= (1 ητ)KL π i,τ π(t 1) i η(π (κ(t) i ) i π i,τ) Ai(π(κ(t) i ) π τ) ψ(t) i

log π (κ(t) i ) i log eπ(t) i , π (κ(t) i ) i eπ(t) i , (79)

ψ(t) i := 1 η

KL π(t) i π(t 1) i

(1 eη(t) i τ)KL

π (κ(t) i ) i π(t 1) i + KL eπ(t) i π (κ(t) i ) i + KL π(t) i eπ(t) i .

We proceed to control the term log π (κ(t) i ) i log eπ(t) i , π (κ(t) i ) i eπ(t) i . By definition, we have

log eπ(t) i 1= (1 eη(t) i τ) log π(t 1) i + eη(t) i Aiπ(κ(t) i )

1= (1 eη(t) i τ)(1 ητ)t κ(t) i log π(κ(t) i 1)

Published as a conference paper at ICLR 2023

Aiπ(κ(t) i ) +

(1 eη(t) i τ)(1 ητ)t 1 l Aiπ(κ(l) i )

log π (κ(t) i ) i 1= (1 ητ) log π(κ(t) i 1) + ηAiπ(κ (κ(t) i 1) i )

when κ(t) i 1. Subtracting the two equations yields

log π (κ(t) i ) i log eπ(t) i 1= eη(t) i

Ai(π(κ (κ(t) i 1) i ) π(κ(t) i ))

(1 eη(t) i τ)(1 ητ)t 1 l Ai(π(κ (κ(t) i 1) i ) π(κ(l) i )) ,

where the log π(κ(t) i 1) terms cancel out due to (1 eη(t) i τ)(1 ητ)t κ(t) i = 1 ητ. It follows that log π (κ(t) i ) i log eπ(t) i , π (κ(t) i ) i eπ(t) i

π (κ(t) i ) i eπ(t) i , Ai(π(κ (κ(t) i 1) i ) π(κ(t) i ))

(1 eη(t) i τ)(1 ητ)t 1 l

π (κ(t) i ) i eπ(t) i , Ai(π(κ (κ(t 1) i ) i ) π(κ(l) i ))

eη(t) i A π (κ(t) i ) i eπ(t) i 1 X

π (κ(l) i ) j π (κ (κ(t 1) i ) i ) j 1. (81)

The next lemma establishes an upper bound on the term Pt l=κ(t) i π (κ(l) i ) j π (κ (κ(t 1) i ) i ) j 1, with the proof postponed to Appendix E.8. Lemma 8. Let νj(t) denote the time index when agent j receives the payoff from the t-th iteration, i.e., κ(νj(t)) j = t. For t = 0, we set νj(0) to an arbitrary index that satisfies κ(νj(0)) j = 0. When t 2γ + 1, it holds that

π (κ(l) i ) j π (κ (κ(t 1) i ) i ) j 1 4

ψ (νj(κ (κ(t 1) i ) i )) j ,

Plugging Lemma 8 into (81) gives log π (κ(t) i ) i log eπ(t) i , π (κ(t) i ) i eπ(t) i

eη(t) i A π (κ(k) i ) i eπ(t) i 1 X

ψ (νj(κ (κ(t 1) i ) i )) j i

2 eη(t) i A

14dmax(γ + 1)3/2 π (κ(t) i ) i eπ(t) i 2 1

h 8(γ + 1)3/2 t+γ X

l=t 2γ ψ(l) j + 4(γ + 1)5/2ψ (νj(κ (κ(t 1) i ) i )) j i

(ii) eη(t) i A

14dmax(γ + 1)5/2ψ(t) i + X

h 4(γ + 1)3/2 t+γ X

l=t 2γ ψ(l) j + 2(γ + 1)5/2ψ (νj(κ (κ(t 1) i ) i )) j i ,

Published as a conference paper at ICLR 2023

where (i) results from Young s inequality

π (κ(t) i ) i eπ(t) i 1

2(γ + 1)1/2 π (κ(t) i ) i eπ(t) i 2 1 +

2(γ + 1)1/2ψ(l) j

and (ii) follows from π (κ(t) i ) i eπ(t) i

1 2KL eπ(t) i π (κ(t) i ) i 2(γ + 1)ψ(t) i . Plugging (82) into

(79) and summing over i V yields

KL π τ π(t) + ητ X

π (κ(t) i ) i π τ

(1 ητ)KL π τ π(t 1) η X

i V (π (κ(t) i ) i π i,τ) Ai(π(κ(t) i ) π τ)

(1 14ηdmax A (γ + 1)5/2) X

i V ψ(t) i + 2η A (γ + 1)5/2 X

(i,j) E ψ (νj(κ (κ(t 1) i ) i )) j

+ 4ηdmax A (γ + 1)3/2 t+γ X

l=t 2γ ψ(l), (83)

where we denote P i V ψ(l) i by ψ(l) for notation simplicity. We then seek to sum the above equation over t = 2γ + 1, , T. Before proceeding, we note that

l=t 2γ ψ(l)

t=l γ ψ(l) 3(γ + 1)

(i,j) E ψ (νj(κ (κ(t 1) i ) i )) j X

t=0 ψ(t) j dmax

where the first step is due to the mapping t 7 νj(κ (κ(t 1) i ) i ) being injective when t 2γ + 1 (cf. Assumptions 2, 3). Note that ψ(t) j = 0 when t 0 and hence can be safely discarded. Taken together, we arrive at

t=2γ+1 KL π τ π(t) + ητ

π (κ(t) i ) i π i,τ

(1 ητ)KL π τ π(2γ) η

i V (π (κ(t) i ) i π i,τ) Ai(π(κ(t) i ) π τ)

1 14ηdmax A (γ + 1)5/2 T X

t=2γ+1 ψ(t) + 12ηdmax A (γ + 1)5/2 T +γ X

+ 2ηdmax A (γ + 1)5/2 T +γ 1 X

(1 ητ)KL π τ π(2γ) η

i V (π (κ(t) i ) i π i,τ) Ai(π(κ(t) i ) π τ)

1 28ηdmax A (γ + 1)5/2 T X

t=2γ+1 ψ(t) + 14ηdmax A (γ + 1)5/2 X

Published as a conference paper at ICLR 2023

(1 ητ)KL π τ π(2γ) η

i V (π (κ(t) i ) i π i,τ) Ai(π(κ(t) i ) π τ) + 1

l Γ ψ(l), (84)

where Γ = {1, , 2γ} {T + 1, , T + γ}. The last step results from the choice of learn-

ing rate η 1 28dmax A (γ+1)5/2 . It now remains to bound the terms PT t=2γ+1 P i V (π (κ(t) i ) i

π i,τ) Ai(π(κ(t) i ) π τ), KL π τ π(2γ) and P

l Γ ψ(l). In view of Lemma 1, we have

i V (π (κ(t) i ) i π i,τ) Ai(π(κ(t) i ) π τ)

i V (π(t) i π i,τ) Ai(π(t) π τ)

i V (π (κ(t) i ) i π i,τ) Ai(π(κ(t) i ) π τ).

We remark that each (π (κ(t) i ) i π i,τ) Ai(π(κ(t) i ) π τ) term will cancel out due to the mapping

t 7 κ(t) i being injective when t γ. In addition, we have a crude bound

(π(t) i π i,τ) Ai(π(t) π τ) = X

j Ni (π(t) i π i,τ) Aij(π(t) j π j,τ) 4dmax A

for every i V, t 0. Applying the bound to the remaining nγ terms gives

i V (π (κ(t) i ) i π i,τ) Ai(π(κ(t) i ) π τ) 4nγdmax A . (85)

The remaining terms KL π τ π(2γ) and ψ(l) can be bounded with the following lemma, with the proof postponed to Appendix E.9. Lemma 9. It holds for all i V and t 0 that

ψ(t) i η(dmax A (2γ + 11) + 3τ log |Si|). (86)

In addition, we have

KL π i,τ π(2γ) i KL π i,τ π(0) i + 4ηdmax A γ. (87)

Putting all pieces together, we continue from (84) and show that

t=2γ+1 KL π τ π(t)

(1 ητ)KL π τ π(2γ) η

i V (π (κ(t) i ) i π i,τ) Ai(π(κ(t) i ) π τ) + 1

KL π i,τ π(0) i + 8ηnγdmax A + ηγ ndmax A (2γ + 11) + 3τ X

i V log |Si|

KL π i,τ π(0) i + 8ηn h γdmax A + γ dmax A (2γ + 11) + 3τ log Smax i

KL π i,τ π(0) i + n + 24ητnγ log Smax.

Bounding the term KL π τ π(t γ+1) . By definition of KL divergence, we have

KL π i,τ π(t γ+1) i

= KL π i,τ π(t+1) i + π i,τ, log π(t+1) i log π(t γ+1) i

= KL π i,τ π(t+1) i KL

π(t γ+1) i π(t+1) i + π i,τ π(t γ+1) i , log π(t+1) i log π(t γ+1) i . (88)

Published as a conference paper at ICLR 2023

It follows directly from the update rules that

log π(t γ+1) i 1= (1 ητ) log π(t γ) i + ηAiπ(κ(t γ) i )

log π(t+1) i 1= (1 ητ)γ+1 log π(t γ) i + η

l=t γ+1 (1 ητ)t l+1Aiπ(κ(l) i ),

which enables us to control the term π i,τ π(t γ+1) i , log π(t+1) i log π(t γ+1) i as π i,τ π(t γ+1) i , log π(t+1) i log π(t γ+1) i

l=t γ+1 (1 ητ)t l+1 π i,τ π(t γ+1) i , Ai(π(κ(t γ) i ) π(κ(l) i ))

η A π i,τ π(t γ+1) i 1 X

π (κ(t γ) i ) j π (κ(l) i ) j 1. (89)

In the same vein as Lemma 8, we can bound the term Pt+1 l=t γ+1 π (κ(t γ) i ) j π (κ(l) i ) j 1 with {ψ(l) i }, as detailed in the following lemma. The proof is omitted due to its similarity with that of Lemma 8. Lemma 10. When t 2γ, it holds that

π (κ(t γ) i ) j π (κ(l) i ) j 1 4

2(γ + 1)2 r

ψ (νj(κ(t γ) i )) j .

Plugging the above lemma into (89), we have π i,τ π(t γ+1) i , log π(t+1) i log π(t γ+1) i

η A π i,τ π(t γ+1) i 1 X

2(γ + 1)2 r

ψ (νj(κ(t γ) i )) j

14dmax(γ + 1)3/2 π i,τ π(t γ+1) i 2 1

h 8(γ + 1)3/2 t+γ+1 X

l=t 2γ+1 ψ(l) j + 4(γ + 1)5/2ψ (νj(κ(t γ) i )) j i

14dmax(γ + 1)3/2KL π i,τ π(t γ+1) i

h 4(γ + 1)3/2 t+γ+1 X

l=t 2γ+1 ψ(l) j + 2(γ + 1)5/2ψ (νj(κ(t γ) i )) j i ,

where (i) results from similar arguments in (82) and (ii) invokes Pinsker s inequality. Substitution of the above inequality into (88) and summing over i V leads to

(1 14ηdmax A (γ + 1)3/2)KL π τ π(t γ+1)

KL π τ π(t+1) + η A X

h 4(γ + 1)3/2 t+γ+1 X

l=t 2γ+1 ψ(l) j + 2(γ + 1)5/2ψ (νj(κ(t γ) i )) j i

KL π τ π(t+1) + 4ηdmax A (γ + 1)3/2 t+γ+1 X

l=t 2γ+1 ψ(l) + 2ηdmax A (γ + 1)5/2ψ(νj(κ(t γ) i )).

Summing the above inequality over t = 2γ 1, , T 1 and adding PT 1 t=2γ P

π (κ(t+1) i ) i π i,τ to the both sides,

3KL π τ π(t γ+1) + X

π (κ(t+1) i ) i π i,τ

Published as a conference paper at ICLR 2023

t=2γ KL π τ π(t+1) +

π (κ(t+1) i ) i π i,τ

+ 4ηdmax A (γ + 1)3/2 T 1 X

l=t 2γ+1 ψ(l) + 2ηdmax A (γ + 1)5/2 T 1 X

t=2γ ψ(νj(κ(t γ) i ))

(1 ητ)KL π τ π(2γ) η

i V (π (κ(t) i ) i π i,τ) Ai(π(κ(t) i ) π τ)

1 28ηdmax A (γ + 1)5/2 T X

t=2γ+1 ψ(t) + 14ηdmax A (γ + 1)5/2 X

+ 12ηdmax A (γ + 1)5/2 T +γ X

l=1 ψ(l) + 2ηdmax A (γ + 1)5/2 T +γ 1 X

(1 ητ)KL π τ π(2γ) η

i V (π (κ(t) i ) i π i,τ) Ai(π(κ(t) i ) π τ)

1 28(1 + ητ

2 )ηdmax A (γ + 1)5/2 T X

t=2γ+1 ψ(t) + 14(1 + ητ)ηdmax A (γ + 1)5/2 X

Here, (i) invokes the bound established in (84). We remark that our choice of learning rate

η min n 1 2τ(γ + 1), 1 42dmax A (γ + 1)5/2

guarantees 1 28(1 + ητ

2 )ηdmax A (γ + 1)5/2 0. This taken together with (85) and Lemma 9 gives

3KL π τ π(t γ+1) + X

π (κ(t+1) i ) i π i,τ

(1 ητ)KL π τ π(2γ) η

i V (π (κ(t) i ) i π i,τ) Ai(π(κ(t) i ) π τ) + 1

KL π i,τ π(0) i + 8ηn h γdmax A + 3γ

dmax A (2γ + 11) + 3τ log Smax i

KL π i,τ π(0) i + n + 36ητnγ log Smax

Bounding the QRE gap. With Lemma 3, we have

t=2γ QRE-Gapτ(π(t+1))

d2 max A 2 τ KL π τ π(t+1) + τKL

max n3d2 max A 2 2τ , τ o T γ 1 X

3KL π τ π(t+1) + KL

π(t+1) π τ .

Since the mapping t 7 νi(t) is injective, we have

π(t+1) i π i,τ =

π (κ (νi(t+1)) i ) i π i,τ

π (κ(t+1) i ) i π i,τ .

Combining the above two equalities gives

t=2γ QRE-Gapτ(π(t+1))

Published as a conference paper at ICLR 2023

max n3d2 max A 2 2τ , τ o T γ 1 X

2 3KL π τ π(t+1) +

max n3d2 max A 2 2τ , τ o T 1 X

3KL π τ π(t γ+1) + X

π (κ(t+1) i ) i π i,τ

max n3d2 max A 2 2τ , τ o 1

KL π i,τ π(0) i + n + 36ητnγ log Smax ,

where the last step results from (90).

E PROOF OF AUXILIARY LEMMAS

E.1 PROOF OF LEMMA 1

To prove this lemma, we recall a key observation in Cai et al. (2016) that allows one to transform a zero-sum polymatrix game G = {(V, E), {Si}i V , {Aij}(i,j) E} into a pairwise constant-sum polymatrix game e G = {(V, E), {Si}i V , { e Aij}(i,j) E} such that

(1) For every player i V , it has the same payoff in G and e G:

ui(s) = eui(s), s S.

(2) For each pair (i, j) E, i = j, the two-player game e G is constant-sum, i.e., there exist constants αij = αji, such that e Aij(si, sj) + e Aji(sj, si) = αij (91) holds for all si Si, sj Sj.

We are now in a place to prove Lemma 1. Let e G be the pairwise constant-sum polymatrix game associated with G after the above payoff preserving transformation. We have X

ui(πi, π i) + ui(π i, π i)

eui(πi, π i) + eui(π i, π i)

E si πi,sj π j

h e Aij(si, sj) i + E si π i,sj πj

h e Aij(si, sj) i#

E si πi,sj π j

h e Aij(si, sj) i + E si π i,sj πj

h αij e Aji(sj, si) i#

(i,j) E αij = 0,

where the penultimate line uses (91), and the last line uses the fact that e G is also a zero-sum polymatrix game, which satisfies X

(i,j) E αij = X

h e Aij(si, sj) + e Aji(sj, si) i = X

i V eui(s) + X

j V euj(s) = 0

for any arbitrary s S.

E.2 PROOF OF LEMMA 2

In view of the update rule (7), we have

log π(t+1) i = (1 ητ) log π(t) i + ηAiπ(t+1) + ci1

Published as a conference paper at ICLR 2023

for some constant ci. On the other hand, it follows from the expression of QRE in (4) that

ητ log π i,τ = ηAiπ τ + c i 1 (92)

for some constant c i . By combining the above two equalities and taking the inner product with π(t+1) i π i,τ, we have log π(t+1) i (1 ητ) log π(t) i ητ log π i,τ, π(t+1) i π i,τ = η(π(t+1) i π i,τ) Ai(π(t+1) π τ). (93) Summing the above equality over i V gives log π(t+1) (1 ητ) log π(t) ητ log π τ, π(t+1) π τ

i V (π(t+1) i π i,τ) Ai(π(t+1) π τ)

h (π(t+1) i ) Aiπ(t+1) + (π i,τ) Aiπ τ i η X

h (π(t+1) i ) Aiπ τ + (π i,τ) Aiπ(t+1)i

h ui(π(t+1)) + ui(π τ) i = 0,

where the last line follows from P i V h (π(t+1) i ) Aiπ τ + (π i,τ) Aiπ(t+1)i = 0 due to Lemma 1, as well as that the game is zero-sum.

E.3 PROOF OF LEMMA 3

Recalling the definition

QRE-Gapτ(π) = max i V

max π i (Si) ui,τ(π i, π i) ui,τ(π)

max π i (Si) ui,τ(π i, π i) ui,τ(π)

= max i V :π i (Si)

i V [ui,τ(π i, π i) ui,τ(πi, π i)] ,

where the inequality holds since maxπ i (Si) ui,τ(π i, π i) ui,τ(π) ui,τ(πi, π i) ui,τ(π) = 0 for all i V . We now proceed to decompose X

i V [ui,τ(π i, π i) ui,τ(πi, π i)]

ui,τ(π i, π i) ui,τ(π i,τ, π i,τ) τ X

H(πi) H(π i,τ)

ui,τ(π i, π i) ui,τ(π i, π i,τ) ui,τ(π i,τ, π i) + ui,τ(π i,τ, π i,τ)

ui,τ(π i,τ, π i) ui,τ(π i,τ, π i,τ) τ H(πi) H(π i,τ)

ui,τ(π i, π i,τ) ui,τ(π i,τ, π i,τ) (94)

where the first line follows from P

i V (ui,τ(π) τH(πi)) = P

i V ui,τ(π τ) τH(π i,τ) = 0 by the definition of zero-sum games. It boils down to control the terms on the RHS of (94).

To control the first term, by the definition of ui,τ in (5) (see also (3)), it follows that

ui,τ(π i, π i) ui,τ(π i, π i,τ) ui,τ(π i,τ, π i) + ui,τ(π i,τ, π i,τ)

= ui(π i, π i) ui(π i, π i,τ) ui(π i,τ, π i) + ui(π i,τ, π i,τ)

Published as a conference paper at ICLR 2023

= (π i π i,τ) Ai(π π τ) = X

j Ni (π i π i,τ) Aij(πj π j,τ),

which each summand can be further bounded by Young s inequality and Pinsker s inequality as

(π i π i,τ) Aij(πj π j,τ) A π i π i,τ 1 πj π j,τ 1

π i π i,τ 2 1 + dmax A

πj π j,τ 2 1

τ dmax A KL π i π i,τ + dmax A

τ KL π j,τ πj .

Summing the inequality over i, j gives X

ui,τ(π i, π i) ui,τ(π i, π i,τ) ui,τ(π i,τ, π i) + ui,τ(π i,τ, π i,τ)

τKL π π τ + d2 max A 2 τ KL π τ π . (95)

Regarding the second term, we have X

ui,τ(π i,τ, π i) ui,τ(π i,τ, π i,τ) τ H(πi) H(π i,τ)

(π i,τ) Ai(π π τ) + τ(π i log πi (π i,τ) log π i,τ)

(π i,τ) Ai(π π τ) + τ πi, log πi log π i,τ + πi π i,τ, log π i,τ

(π i,τ) Ai(π π τ) + (πi π i,τ) Aiπ τ + τKL πi π i,τ

= τKL π π τ , (96)

where the penultimate step follows from (92) and the last step invokes Lemma 1. Moving to the last term, we have

ui,τ(π i,τ, π i,τ) ui,τ(π i, π i,τ) = (π i,τ π i) Aiπ τ τ(π i,τ) log π i,τ + τ(π i) log π i = τ(π i,τ π i) log π i,τ τ(π i,τ) log π i,τ + τ(π i) log π i = τKL π i π i,τ . (97)

where the second line follows again from (92).

Plugging (95), (96) and (97) into (94) gives X

i V [ui,τ(π i, π i) ui,τ(πi, π i)] τKL π π τ + d2 max A 2 τ KL π τ π .

Taking maximum over π finishes the proof.

E.4 PROOF OF LEMMA 4

Let eπ(T ) = 1 T +1 PT t=0 π(t+1), then eπ(T ) (S). The proof is completed if we can show X

i V Regreti,τ T + 1 X

i V Regreti,τ eπ(T ) i , T + 1 0, (98)

where the first inequality holds trivially since Regreti,τ T + 1 Regreti,τ eπ(T ) i , T . It then boils down to show the second inequality of the above relation. From the definition of zero-sum polymatrix games, it holds that

eπ(T ) i π(t+1) i , Aiπ(t+1) = X

eπ(T ) i , Aiπ(t+1)

Published as a conference paper at ICLR 2023

eπ(T ) i , Ai

= (T + 1) X

eπ(T ) i , Aieπ(T ) = 0.

In addition, applying Jensen s inequality gives

t=0 H(eπ(T ) i ) = (T + 1)H(eπ(T ) i )

t=0 H(π(t+1) i ).

Combining the above two relations yields

i V Regreti,τ eπ(T ) i , T + 1 X

eπ(T ) i π(t+1) i , Aiπ(t+1) + τH(eπ(T ) i ) τH(π(t+1) i ) 0,

which concludes the proof.

E.5 PROOF OF LEMMA 5

Taking logarithm on the both sides of (7), we have

log π(t+1) i 1= (1 ητ) log π(t) i + ηAiπ(t γ+1). (99)

On the other hand, the definition of QRE in (4) gives

ητ log π i,τ 1= ηAiπ τ.

Subtracting the two equalities and taking inner product with π(t γ+1) i π i,τ, we get log π(t+1) i (1 ητ) log π(t) i ητ log π i,τ, π(t γ+1) i π i,τ

π(t γ+1) i π i,τ Ai

π(t γ+1) π τ .

Summing the above equality over i V leads to log π(t+1) (1 ητ) log π(t) ητ log π τ, π(t γ+1) i π τ

π(t γ+1) i π i,τ Ai

π(t γ+1) π τ = 0,

where the final step results from Lemma 1.

E.6 PROOF OF LEMMA 6

Recall from (71) that

KL π τ π(t+1) = (1 ητ)KL π τ π(t) (1 ητ)KL

π(t γ+1) π(t) KL π(t+1) π(t γ+1)

+ log π(t γ+1) log π(t+1), π(t γ+1) π(t+1) ητKL

π(t γ+1) π τ . (100)

When t < γ, we have π(t γ+1) i = π(0). It follows that

log π(t γ+1) i = log π(0) 1= 0,

log π(t+1) i 1= (1 ητ)t+1 log π(0) + η

l=0 (1 ητ)l Aiπ(t γ l+1)

Published as a conference paper at ICLR 2023

l=0 (1 ητ)l Aiπ(0).

Therefore, we can bound the term log π(t γ+1) log π(t+1), π(t γ+1) π(t+1) as

log π(t γ+1) log π(t+1), π(t γ+1) π(t+1) = η

l=0 (1 ητ)l Aiπ(0), π(0) π(t+1)

η(t + 1)dmax A π(0) π(t+1) 1 2η(t + 1)dmax A . (101)

Plugging the above inequality into (100) leads to

KL π τ π(t+1) (1 ητ)KL π τ π(t) (1 ητ)KL

π(t γ+1) π(t) KL π(t+1) π(t γ+1)

+ 2η(t + 1)dmax A .

Applying the above inequality recursively to the iterates 0, 1, . . . , γ 1, we arrive at

KL π τ π(γ)

(1 ητ)γKL π τ π(0)

l1=0 (1 ητ)γ 1 l1h (1 ητ)KL

π(l1 γ+1) π(l1) + KL π(l1+1) π(l1 γ+1) i

l1=0 (1 ητ)γ 1 l1(l1 + 1)dmax A

(1 ητ)γKL π τ π(0)

l1=0 (1 ητ)γ 1 l1h (1 ητ)KL

π(l1 γ+1) π(l1) + KL π(l1+1) π(l1 γ+1) i

+ 2ηγ2dmax A .

E.7 PROOF OF LEMMA 7

Taking logarithm on the both sides of (77) and (78), we get

η log eπ(t) i log π(t 1) i 1= eη(t) i log π(t) i log π(t 1) i ,

or equivalently log π(t) i 1= η

eη(t) i log eπ(t) i + 1 η

log π(t 1) i .

Taking inner product with π i,τ π(t) i ,

log π(t) i η

eη(t) i log eπ(t) i 1 η

log π(t 1) i , π i,τ π(t) i = 0.

By definition of KL divergence, we have log π(t) i η

eη(t) i log eπ(t) i 1 η

log π(t 1) i , π i,τ

= (log π(t) i log π i,τ) η

eη(t) i (log eπ(t) i log π i,τ) 1 η

(log π(t 1) i log π i,τ), π i,τ

= KL π i,τ π(t) i + 1 η

KL π i,τ π(t 1) i + η

eη(t) i KL π i,τ eπ(t) i ,

and log π(t) i η

eη(t) i log eπ(t) i 1 η

log π(t 1) i , π(t) i

Published as a conference paper at ICLR 2023

eη(t) i KL π(t) i eπ(t) i + 1 η

KL π(t) i π(t 1) i .

Taken together, we get

KL π i,τ π(t) i + η

eη(t) i KL π(t) i eπ(t) i + 1 η

KL π(t) i π(t 1) i

KL π i,τ π(t 1) i + η

eη(t) i KL π i,τ eπ(t) i . (102)

On the other hand, taking logarithm of (78) and making inner product with π (κ(t) i ) i π i,τ gives

log eπ(t) i (1 eη(t) i τ) log π(t 1) i eη(t) i τ log π i,τ, π (κ(t) i ) i π i,τ

= eη(t) i (π (κ(t) i ) i π i,τ) Ai(π(κ(t) i ) π τ).

Following a similar discussion in (19) gives

KL π i,τ eπ(t) i = (1 eη(t) i τ)KL π i,τ π(t 1) i (1 eη(t) i τ)KL

π (κ(t) i ) i π(t 1) i

eη(t) i τKL

π (κ(t) i ) i π i,τ KL eπ(t) i π (κ(t) i ) i

+ log π (κ(t) i ) i log eπ(t) i , π (κ(t) i ) i eπ(t) i eη(t) i (π (κ(t) i ) i π i,τ) Ai(π(κ(t) i ) π τ). (103)

Plugging the above equation into (102),

KL π i,τ π(t) i + η

eη(t) i KL π(t) i eπ(t) i +

KL π(t) i π(t 1) i

= (1 ητ)KL π i,τ π(t 1) i η(π (κ(t) i ) i π i,τ) Ai(π(κ(t) i ) π τ)

(1 eη(t) i τ)KL

π (κ(t) i ) i π(t 1) i + eη(t) i τKL

π (κ(t) i ) i π i,τ + KL eπ(t) i π (κ(t) i ) i

log π (κ(t) i ) i log eπ(t) i , π (κ(t) i ) i eπ(t) i .

Rearranging the terms finishes the proof.

E.8 PROOF OF LEMMA 8

For notational convenience, we set

ϕ(t) i = 1 η

π(t) i π(t 1) i 1

π (κ(t) i ) i π(t 1) i 1 + eπ(t) i π (κ(t) i ) i 1 + π(t) i eπ(t) i 1

for all i V, t 0. By triangular inequality, we have ϕ(t) i π(t) i π(t 1) i 1. In addition, we

denote by t1 t2 := min{t1, t2} and t1 t2 := max{t1, t2}. For 0 < t1 < t2, it holds that

π (κ(t1) i ) j π (κ(t2) i ) j 1

π (νj(κ(t1) i )) j π (νj(κ(t2) i )) j 1 + π (κ(t1) i ) j π (νj(κ(t1) i )) j 1 + π (κ(t2) i ) j π (νj(κ(t2) i )) j 1

νj(κ(t1) i ) νj(κ(t2) i ) X

l=(νj(κ(t1) i )+1) (νj(κ(t2) i )+1)

π(l) j π(l 1) j 1 + π (κ(t1) i ) j eπ (νj(κ(t1) i )) j 1 + eπ (νj(κ(t1) i )) j π (νj(κ(t1) i )) j 1

Published as a conference paper at ICLR 2023

+ π (κ(t2) i ) j eπ (νj(κ(t2) i )) j 1 + eπ (νj(κ(t2) i )) j π (νj(κ(t2) i )) j 1

νj(κ(t1) i ) νj(κ(t2) i ) X

l=(νj(κ(t1) i )+1) (νj(κ(t2) i )+1) ϕ(l) j + η (νj(κ(t1) i )) j

η ϕ (νj(κ(t1) i )) j + η (νj(κ(t2) i )) j

η ϕ (νj(κ(t2) i )) j

νj(κ(t1) i ) νj(κ(t2) i ) X

l=(νj(κ(t1) i )+1) (νj(κ(t2) i )+1) ϕ(l) j + (γ + 1)ϕ (νj(κ(t1) i )) j + (γ + 1)ϕ (νj(κ(t2) i )) j . (104)

Therefore, we have

π (κ(k) i ) j π (κ (κ(t 1) i ) i ) j 1

( νj(κ (κ(t 1) i ) i ) νj(κ(k) i ) X

l=(νj(κ (κ(t 1) i ) i )+1) (νj(κ(k) i )+1)

ϕ(l) j + (γ + 1)ϕ (νj(κ (κ(t 1) i ) i )) j + (γ + 1)ϕ (νj(κ(k) i )) j

Since 0 (t γ) κ(t) i t νi(t) t + γ for all i V , t 0, the first term can be bounded by

νj(κ (κ(t 1) i ) i ) νj(κ(k) i ) X

l=(νj(κ (κ(t 1) i ) i )+1) (νj(κ(k) i )+1)

(t+γ 1) (k+γ) X

l=(t 2γ) (k γ+1) ϕ(l) j

l=t 2γ ϕ(l) j .

In addition, the mapping k 7 νj(κ(k) i ) is injective when k γ (cf. Assumption 2 and 3). It follows that

ϕ (νj(κ(k) i )) j

l=κ(t) i γ ϕ(l) j

l=t 2γ ϕ(l) j

Plugging the above inequalities into (105) yields

π (κ(k) i ) j π (κ (κ(t 1) i ) i ) j 1

(t + 1 κ(t) i )

l=t 2γ ϕ(l) j + (t + 1 κ(t) i )(γ + 1)ϕ (νj(κ (κ(t 1) i ) i )) j + (γ + 1)

l=t 2γ ϕ(l) j

l=t 2γ ϕ(l) j + (γ + 1)2ϕ (νj(κ (κ(t 1) i ) i )) j .

Finally, we control the term ϕ(t) i with ψ(t) i as:

(ϕ(t) i )2 = 1 η

1/2 π(t) i π(t 1) i 1

eη(t) i (1 eη(t) i τ) 1 1/2 η

eη(t) i (1 eη(t) i τ) 1/2 π (κ(t) i ) i π(t 1) i 1

1/2 eπ(t) i π (κ(t) i ) i 1 + π(t) i eπ(t) i 1

eη(t) i + η

2 + (1 eη(t) i τ) 1 1 η

π(t) i π(t 1) i 2 1

Published as a conference paper at ICLR 2023

(1 eη(t) i τ) π (κ(t) i ) i π(t 1) i 2 1 + eπ(t) i π (κ(t) i ) i 2 1 + π(t) i eπ(t) i 2 1

(ii) 2 2 + (1 eη(t) i τ) 1 1 η

KL π(t) i π(t 1) i

(1 eη(t) i τ)KL

π (κ(t) i ) i π(t 1) i + KL eπ(t) i π (κ(t) i ) i + KL π(t) i eπ(t) i

(iii) 8ψ(t) i , (106)

where (i) applies Cauchy-Schwarz inequality, (ii) invokes Pinsker s inequality and (iii) is due to eη(t) i τ (γ + 1)ητ 1/2. Combining the above two inequalities finishes the proof.

E.9 PROOF OF LEMMA 9

We start with verifying the claim (86). Recall that

ψ(t) i := 1 η

KL π(t) i π(t 1) i

(1 eη(t) i τ)KL

π (κ(t) i ) i π(t 1) i + KL eπ(t) i π (κ(t) i ) i + KL π(t) i eπ(t) i .

We introduce the following standard Lemma (see e.g., (Cen et al., 2020, Appendix A.2)), which allows us to bound control KL πi π i properly:

Lemma 11. Given πi, π i (Si) and w R|Si| with log πi 1= log π i + w, we have

KL πi π i log πi log π i 2 w .

Therefore, it suffices to figure out the terms log π(t) i log π(t 1) i , log π(t) i log eπ(t) i , log eπ(t) i

log π (κ(t) i ) i and log π (κ(t) i ) i log π(t 1) i .

Bounding KL π(t) i π(t 1) i and KL π(t) i eπ(t) i . The following equations follow directly from (77) and (78):

log π(t) i log π(t 1) i 1= η([Aiπ(κ(t) i )]k τ log π(t 1) i )

log π(t) i log eπ(t) i 1= (η eη(t) i )([Aiπ(κ(t) i )]k τ log π(t 1) i ) . (107)

In addition, we have the following bound w.r.t. the order of log π(t 1) i , which we shall establish momentarily. τ log π(t 1) i τ log |Si| + 2dmax A . (108)

This taken together with Lemma 11 yields ( KL π(t) i π(t 1) i η(3dmax A + τ log |Si|)

KL π(t) i eπ(t) i (eη(t) i η)(3dmax A + τ log |Si|) . (109)

Bounding KL eπ(t) i π (κ(t) i ) i . When κ(t) i 1, we recall from (80) that:

log π (κ(t) i ) i log eπ(t) i 1= eη(t) i

Ai(π(κ (κ(t) i 1) i ) π(κ(t) i ))

(1 eη(t) i τ)(1 ητ)t 1 l Ai(π(κ (κ(t) i 1) i ) π(κ(l) i )) , (110)

Published as a conference paper at ICLR 2023

which leads to a crude bound

π (κ(t) i ) i eπ(t) i eη(t) i dmax A (t κ(t) i + 1) eη(t) i dmax A (γ + 1).

When κ(t) i = 0, we have

log π (κ(t) i ) i log eπ(t) i 1= log eπ(t) i 1= (1 eη(t) i τ)(1 ητ)t 1 log π(0)

Aiπ(κ(t) i ) +

l=κ(t) i +1 (1 eη(t) i τ)(1 ητ)t 1 l Aiπ(κ(l) i )

Aiπ(κ(t) i ) +

l=κ(t) i +1 (1 eη(t) i τ)(1 ητ)t 1 l Aiπ(κ(l) i ) ,

which yields

π (κ(t) i ) i eπ(t) i eη(t) i dmax A t eη(t) i dmax A (γ + 1).

Bounding KL

π (κ(t) i ) i π(t 1) i . Note that

log π (κ(t) i ) i log π(t 1) i = (log π (κ(t) i ) i log eπ(t) i ) + (log eπ(t) i log π(t) i ) + (log π(t) i log π(t 1) i ). This yields, by equations (107), (110), (111) and associated bounds,

π (κ(t) i ) i π(t 1) i eη(t) i dmax A (γ + 1) + eη(t) i (3dmax A + τ log |Si|).

Putting all pieces together, we conclude that

ψ(t) i = 1 η

KL π(t) i π(t 1) i

(1 eη(t) i τ)KL

π (κ(t) i ) i π(t 1) i + KL eπ(t) i π (κ(t) i ) i + KL π(t) i eπ(t) i

3η(3dmax A + τ log |Si|) + 2ηdmax A (γ + 1) = η(dmax A (2γ + 11) + 3τ log |Si|). It remains to prove the claim (87):

KL π i,τ π(2γ) i = KL π i,τ π(0) i + π i,τ, log π(0) i log π(2γ) i

KL π i,τ π(0) i + log π(0) i log π(2γ) i

KL π i,τ π(0) i + 2

l=1 (1 ητ)2γ l Aiπ(κ(l) i ) KL π i,τ π(0) i + 4ηdmax A γ,

where the third step results from log π(2γ) i 1= (1 ητ)2γ log π(0) i + η P2γ l=1(1 ητ)2γ l Aiπ(κ(l) i ) and Lemma 11.

Proof of the claim (108). First, we prove by induction that for any k, l Si,

log π(t) i (k) log π(t) i (l) 2dmax A

τ , t 0. (112)

Note that the claim trivially holds for t = 0 with the uniform initialization π(0) i = 1 |Si|1, i V .

Assume that (112) holds for all t t 1. Note that log π(t) i 1= (1 ητ) log π(t 1) i + ηAiπ(κ(t) i ), we have

log π(t) i (k) log π(t) i (l) = (1 ητ) log π(t 1) i (k) log π(t 1) i (l) + η [Aiπ(κ(t) i )]k [Aiπ(κ(t) i )]l

Published as a conference paper at ICLR 2023

(1 ητ)2dmax A

τ + 2ηdmax A

where the second line follows from the induction hypothesis (112). This completes the induction at the t-th iteration. It follows that for all i V and t 0,

log π(t) i (l) log max k Si π(t) i (k) 2dmax A

τ log |Si| 2dmax A