# asynchronous_gradient_play_in_zerosum_multiagent_games__a342ebd2.pdf Published as a conference paper at ICLR 2023 ASYNCHRONOUS GRADIENT PLAY IN ZERO-SUM MULTI-AGENT GAMES Ruicheng Ao Peking University archer arc@pku.edu.cn Shicong Cen & Yuejie Chi Carnegie Mellon University {shicongc,yuejiec}@andrew.cmu.edu Finding equilibria via gradient play in competitive multi-agent games has been attracting a growing amount of attention in recent years, with emphasis on designing efficient strategies where the agents operate in a decentralized and symmetric manner with guaranteed convergence. While significant efforts have been made in understanding zero-sum two-player matrix games, the performance in zerosum multi-agent games remains inadequately explored, especially in the presence of delayed feedbacks, leaving the scalability and resiliency of gradient play open to questions. In this paper, we make progress by studying asynchronous gradient plays in zero-sum polymatrix games under delayed feedbacks. We first establish that the last iterate of entropy-regularized optimistic multiplicative weight updates (OMWU) method converges linearly to the quantal response equilibrium (QRE), the solution concept under bounded rationality, in the absence of delays. While the linear convergence continues to hold even when the feedbacks are randomly delayed under mild statistical assumptions, it converges at a noticeably slower rate due to a smaller tolerable range of learning rates. Moving beyond, we demonstrate entropy-regularized OMWU by adopting two-timescale learning rates in a delay-aware manner enjoys faster last-iterate convergence under fixed delays, and continues to converge provably even when the delays are arbitrarily bounded in an average-iterate manner. Our methods also lead to finite-time guarantees to approximate the Nash equilibrium (NE) by moderating the amount of regularization. To the best of our knowledge, this work is the first that aims to understand asynchronous gradient play in zero-sum polymatrix games under a wide range of delay assumptions, highlighting the role of learning rates separation. 1 INTRODUCTION Finding equilibria of multi-player games via gradient play lies at the heart of game theory, which permeates a remarkable breadth of modern applications, including but not limited to competitive reinforcement learning (RL) (Littman, 1994), generative adversarial networks (GANs) (Goodfellow et al., 2014) and adversarial training (Mertikopoulos et al., 2018). While conventional wisdom leans towards the paradigm of centralized learning (Bertsekas & Tsitsiklis, 1989), retrieving and sharing information across multiple agents raise questions in terms of both privacy and efficiency, leading to a significant amount of interest in designing decentralized learning algorithms that utilize only local payoff feedbacks, with the updates at different agents executed in a symmetric manner. In reality, there is no shortage of scenarios where the feedback can be obtained only in a delayed manner (He et al., 2014), i.e., the agents only receive the payoff information sent from a previous round instead of the current round, due to communication slowdowns and congestions, for example. Substantial progress has been made towards reliable and efficient online learning with delayed feedbacks in various settings, e.g., stochastic multi-armed bandit (Pike-Burke et al., 2018; Vernade et al., 2017), adversarial multi-armed bandit (Cesa-Bianchi et al., 2016; Li et al., 2019), online convex optimization (Quanrud & Khashabi, 2015; Mc Mahan & Streeter, 2014) and multi-player game (Meng et al., 2022; H eliou et al., 2020; Zhou et al., 2017). Typical approaches to combatting delays include subsampling the payoff history (Weinberger & Ordentlich, 2002; Joulani et al., 2013), or adopting Author are sorted alphabetically. Published as a conference paper at ICLR 2023 Learning rate Type of delay Iteration complexity single-timescale none τ 1dmax A log ϵ 1 dmax A ϵ 1 statistical τ 2d2 max A 2 (γ + 1)2 log ϵ 1 d2 max A 2 (γ + 1)2ϵ 2 two-timescale constant τ 1dmax A (γ + 1)2 log ϵ 1 dmax A (γ + 1)2ϵ 1 bounded τ 2nd3 max A 3 (γ + 1)5/2ϵ 1 nd3 max A 3 (γ + 1)5/2ϵ 3 Table 1: Iteration complexities of the proposed OMWU method for finding ϵ-QRE/NE of zero-sum polymatrix games, where logarithmic dependencies are omitted. Here, γ denotes the maximal time delay when the delay is bounded, n denotes the number of agents in the game, dmax is the maximal degree of the graph, and A = maxi,j Ai,j is the ℓ norm of the entire payoff matrix A (over all games in the network). We only present the result under statistical delay when the delays are bounded for ease of comparison, while more general bounds are given in Section 3.2. adaptive learning rates suggested by delay-aware analysis (Quanrud & Khashabi, 2015; Mc Mahan & Streeter, 2014; Hsieh et al., 2020; Flaspohler et al., 2021). Most of these efforts, however, have been limited to either the asymptotic convergence to the equilibrium (Zhou et al., 2017; H eliou et al., 2020) or the study of individual regret, which characterizes the performance gap between an agent s learning trajectory and the best policy in hindsight. It remains highly inadequate when it comes to guaranteeing finite-time convergence to the equilibrium in a multi-player environment, especially in the presence of delayed feedbacks, thus leaving the scalability and resiliency of gradient play open to questions. In this work, we initiate the study of asynchronous learning algorithms for an important class of games called zero-sum polymatrix games (also known as network matrix games (Bergman & Fokin, 1998)), which generalizes two-player zero-sum matrix games to the multiple-player setting and serves as an important stepping stone to more general multi-player general-sum games. Zero-sum polymatrix games are commonly used to describe situations in which agents interactions are captured by an interaction graph and the entire system of games are closed so that the total payoffs keep invariant in the system. They find applications in an increasing number of important domains such as security games (Cai et al., 2016), graph transduction (Bernardi, 2021), and more. In particular, we focus on finite-time last-iterate convergence to two prevalent solution concepts in game theory, namely Nash Equilibrium (NE) and Quantal Response Equilibrium (QRE) which considers bounded rationality (Mc Kelvey & Palfrey, 1995). Despite the seemingly simple formulation, few existing works have achieved this goal even in the synchronous setting, i.e., with instantaneous feedback. Leonardos et al. (2021) studied a continuous-time learning dynamics that converges to the QRE at a linear rate. Anagnostides et al. (2022) demonstrated Optimistic Mirror Descent (OMD) (Rakhlin & Sridharan, 2013) enjoys finite-time last-iterate convergence to the NE, yet the analysis therein requires continuous gradient of the regularizer, which incurs computation overhead for solving a subproblem every iteration. In contrast, an appealing alternative is the entropy regularizer, which leads to closed-form multiplicative updates and is computationally more desirable, but remains poorly understood. In sum, designing efficient learning algorithms that provably converge to the game equilibria has been technically challenging, even in the synchronous setting. 1.1 OUR CONTRIBUTIONS In this paper, we develop provably convergent algorithms broadly dubbed as asynchronous gradient play to find the QRE and NE of zero-sum polymatrix games in a decentralized and symmetric manner with delayed feedbacks. We propose an entropy-regularized Optimistic Multiplicative Weights Update (OMWU) method (Cen et al., 2021), where each player symmetrically updates their strategies without access to the payoff matrices and other players strategies, and initiate a systematic investigation on the impacts of delays on its convergence under two schemes of learning rates schedule. Our main contributions are summarized as follows. Finite-time last-iterate convergence of single-timescale OMWU. We begin by showing that, in the synchronous setting, the single-timescale OMWU method when the same learning rate is adopted for extrapolation and update achieves last-iterate convergence to the QRE at a linear Published as a conference paper at ICLR 2023 rate, which is independent of the number of agents as well as the size of action spaces (up to logarithmic factors). In addition, this implies a last-iterate convergence to an ϵ-approximate NE in e O(ϵ 1) iterations by adjusting the regularization parameter, where e O( ) hides logarithmic dependencies. While the last-iterate linear convergence to QRE continues to hold in the asynchronous setting, as long as the delay sequence follows certain mild statistical assumptions, it converges at a slower rate due to a smaller tolerable range of learning rates, with the iteration complexity to find an ϵ-NE degenerating to e O(ϵ 2). Finite-time convergence of two-timescale OMWU. To accelerate the convergence rate in the presence of delayed feedback, we propose a two-timescale OMWU method which separates the learning rates of extrapolation and update in a delay-aware manner for applications with constant and known delays (e.g. from timestamp information). The learning rate separation is critical in bypassing the convergence slowdown encountered in the single-timescale case, where we show that two-timescale OMWU achieves a faster last-iterate linear convergence to QRE in the presence of constant delays, with an improved e O(ϵ 1) iteration complexity to ϵ-NE that matches the rate without delay. We further tackle the more practical yet challenging setting where the feedback sequence is permutated by bounded delays possibly in an adversarial manner and demonstrate provable convergence to the equilibria in an average-iterate manner. We summarize the iteration complexities of the proposed methods for finding ϵ-approximate solutions of QRE and NE in Table 1. To the best of our knowledge, this work presents the first algorithm design and analysis that focus on equilibrium finding in a multi-player game with delayed feedbacks. In contrast, most of existing works concerning individual regret in the synchronous/asynchronous settings typically yield average-iterate convergence guarantees (see e.g., Bailey (2021); Meng et al. (2022)) and fall short of characterizing the actual learning trajectory to the equilibrium. 1.2 NOTATION AND PAPER ORGANIZATION Denote by [n] the set {1, , n} and by (S) the probability simplex over the set S. Given two probability distributions p, p (S), the KL divergence from p to p is defined by KL p p := P k S p(k) log p(k) p (k). For any vector z = [zi]1 i n Rn, we use exp(z) to represent [exp(zi)]1 i n. The rest of this paper is organized as follows. Section 2 provides the preliminary on zero-sum polymatrix games and solution concepts. Performance guarantees of single-timescale OMWU and two-timescale OMWU are presented in Section 3 and Section 4, respectively. Numerical experiments are provided in Section 5 to corroborate the theoretical findings, and finally, we conclude in Section 6. The proofs are deferred to the appendix. 2 PRELIMINARIES In this section, we introduce the formulation of zero-sum polymatrix games as well as the solution concept of NE and QRE. We start by defining the polymatrix game. Definition 1 (Polymatrix game). Let G := {(V, E), {Si}i V , {Aij}(i,j) E} be an n-player polymatrix game, where each element in the tuple is defined as follows. An undirected graph (V, E), with V = [n] denoting the set of players and E the set of edges; For each player i V , Si represents its action set, which is assumed to be finite; For each edge (i, j) E, Aij R|Si| |Sj| and Aji R|Sj| |Si| represent the payoff matrices associated with player i and j, i.e., when player i and player j choose si Si and sj Sj, the received payoffs are given by Aij(si, sj), Aji(sj, si), respectively. Utility function. Given the strategy profile s = (s1, , sn) S = Q i V Si taken by all players, the utility function ui : S R of player i is given by j:(i,j) E Aij(si, sj). Suppose that player i adopts a mixed/stochastic strategy or policy, πi (Si), where the probability of selecting si Si is specified by πi(si). With slight abuse of notation, we denote the expected Published as a conference paper at ICLR 2023 utility of player i with a mixed strategy profile π = (π1, , πn) (S) as ui(π) = E si πi, i V [ui(s)] = X j:(i,j) E π i Aijπj. (1) It turns out to be convenient to treat πi and π as vectors in R|Si| and R P i V |Si| without ambiguity, and concatenate all payoff matrices associated with player i into Ai = (Ai1, , Ain) R|Si| P j V |Sj|, (2) where Aij is set to 0 whenever (i, j) / E. In particular, it follows that Aii = 0 for all i V . With these notation in place, we can rewrite the expected utility function (1) as ui(π) = π i Aiπ, (3) where Aiπ R|Si| can be interpreted as the expected utility of the actions in Si for player i. In addition, we denote the maximum entrywise absolute value of payoff by A = maxi,j Aij = maxi Ai , and the maximum degree of the graph by dmax = maxi V degi, where degi is the degree of player i. Moreover, we denote Smax = maxi |Si| as the maximum size of the action space over all players. Zero-sum polymatrix games. The game G is a zero-sum polymatrix game if it holds that P i V ui(s) = 0, s S. This immediately implies that for any strategy profile π (S), it follows that P i V ui(π) = 0. Nash equilibrium (NE). A mixed strategy profile π = (π 1, , π n) is a Nash equilibrium (NE) when each player i cannot further increase its own utility function ui by unilateral deviation, i.e., ui(π i, π i) ui(π i , π i), for all i V, π i (Si), where the existence is guaranteed by the work (Cai et al., 2016). Here we denote the mixed strategies of all players other than i by π i and write ui(πi, π i) = ui(π). To measure how close a strategy π (S) is to an NE, we introduce NE-Gap(π) = max i V max π i (Si) ui(π i, π i) ui(π) , which measures the largest possible gain in the expected utility when players deviate from its strategy unilaterally. A mixed strategy profile π is called an ϵ-approximate Nash equilibrium (ϵ-NE) when NE-Gap(π) ϵ, which ensures that ui(π i, π i) ui(πi, π i)+ϵ, for all i V, π i (Si). Quantal response equilibrium (QRE). The quantal response equilibrium (QRE), proposed by Mc Kelvey & Palfrey (1995), generalizes the classical notion of NE under uncertain payoffs or bounded rationality, while balancing exploration and exploitation. A mixed strategy profile π τ = (π 1,τ, , π n,τ) is a QRE when each player assigns its probability of action according to the expected utility of every action in a Boltzmann fashion, i.e., for all i V , π i,τ(k) = exp([Aiπ τ]k/τ) P k Si exp([Aiπ τ]k/τ), k Si, (4) where τ > 0 is the regularization parameter or temperature. Equivalently, this amounts to maximizing an entropy-regularized utility of each player (Mertikopoulos & Sandholm, 2016), i.e., ui,τ(π i, π i,τ) ui,τ(π i,τ, π i,τ) for all i V , π i (Si). Here, the entropy-regularized utility function ui : S R of player i is given by ui,τ(π) = ui(π) + τH(πi), (5) where H(πi) = π i log πi denotes the Shannon entropy of πi. In Leonardos et al. (2021), it is shown that a unique QRE exists in a zero-sum polymatrix game. Similarly, we can measure the proximity of a strategy π to a QRE by QRE-Gapτ(π) = max i V max π i (Si) ui,τ(π i, π i) ui,τ(π) . (6) A mixed strategy profile π is called an ϵ-QRE when QRE-Gapτ(π) ϵ. According to the straightforward relationship NE-Gap(π) QRE-Gapτ(π) + τ log Smax, it follows immediately that we can link an ϵ/2-QRE to ϵ-NE by setting τ = ϵ 2 log Smax . This facilitates the translation of convergence to the QRE to one regarding the NE by appropriately setting the regularization parameter τ. Published as a conference paper at ICLR 2023 Algorithm 1 Entropy-regularized OMWU, agent i 1: Initialize π(0) i = π(0) i as uniform distribution. Learning rates η, and η (optional). 2: for t = 0, 1, 2, . . . do 3: Receive payoff vector Aiπ(κ(t) i ). 4: When t 1, update πi according to π(t) i (k) π(t 1) i (k)1 ητ exp(η[Aiπ(κ(t) i )]k), k Si. 5: Update πi according to the single-timescale rule π(t+1) i (k) π(t) i (k)1 ητ exp(η[Aiπ(κ(t) i )]k), k Si. (9) or the two-timescale rule π(t+1) i (k) π(t) i (k)1 ητ exp(η[Aiπ(κ(t) i )]k), k Si. (10) 3 PERFORMANCE GUARANTEES OF SINGLE-TIMESCALE OMWU In this section, we present and study the entropy-regularized OMWU method (Cen et al., 2021) for finding the QRE of zero-sum polymatrix games. Whilst the method is originally proposed for finding QRE in a two-player zero-sum game, the update rule naturally generalizes to the multi-player setting as π(t+1) i (k) π(t) i (k)1 ητ exp(η[Aiπ(t+1)]k), k Si, (7) where η > 0 is the learning rate and π(t+1) serves as a prediction for π(t+1) via an extrapolation step π(t+1) i (k) π(t) i (k)1 ητ exp(η[Aiπ(t)]k), k Si. (8) In the asynchronous setting, however, each agent i receives a delayed payoff vector Aiπ(κ(t) i ) instead of Aiπ(t) in the t-th iteration, where κ(t) i = max{t γ(t) i , 0}, with γ(t) i 0 representing the length of delay. The detailed procedure is outlined in Algorithm 1 using the single-timescale rule (9) for extrapolation. 3.1 PERFORMANCE GUARANTEES WITHOUT DELAYS We first present our theorem concerning the last-iterate convergence of single-timescale OMWU for finding the QRE in the synchronous setting, i.e. γ(t) i = 0 for all i V and t 0. For any π, π V , let KL π π = P i V KL πi π i . Theorem 1 (Last-iterate convergence without delays). Suppose that the learning rate η of singletimescale OMWU in Algorithm 1 obeys 0 < η min n 1 2τ , 1 4dmax A o , then for any T 0, the iterates π(T ) and π(T s) converge at a linear rate according to KL π τ π(T ) (1 ητ)T KL π τ π(0) , KL π τ π(T +1) 2(1 ητ)T KL π τ π(0) . (11a) Furthermore, the QRE-gap also converges linearly according to QRE-Gapτ(π(T )) η 1 + 2τ 1d2 max A 2 (1 ητ)T 1KL π τ π(0) . (11b) Theorem 1 demonstrates that as long as the learning rate η is sufficiently small, the last iterate of single-timescale OMWU converges to the QRE at a linear rate. Compared with prior works for finding approximate equilibrium for zero-sum polymatrix games, our approach features a closedform multiplicative update and a fast linear last-iterate convergence. Some remarks are in order. Linear convergence to the QRE. Theorem 1 implies an iteration complexity of e O 1 ητ log 1 for finding an ϵ-QRE in a last-iterate manner, which leads to an iteration complexity of Published as a conference paper at ICLR 2023 τ + 1 log 1 ϵ by optimizing the learning rate in Theorem 1.The result is especially appealing as it avoids direct dependency on the number of agents n as well as the size of action spaces (up to logarithmic factors), suggesting that learning in competitive multi-agent games can be made quite scalable as long as the interactions among the agents are sparse (so that the maximum degree of the graph dmax is much smaller than the number of agents n). Last-iterate convergence to ϵ-NE. By setting τ appropriately, we end up with an iteration complexity of e O dmax A ϵ for achieving last-iterate convergence to an ϵ-NE, which outperforms the best existing last-iterate rate of e O n A /ϵ2 from Leonardos et al. (2021) by at least a factor of n/(dmaxϵ). Remark 1. Our results trivially extend to the setting of weighted zero-sum polymatrix games (Leonardos et al., 2021), which amounts to adopting different learning rates {ηi}i V at each player. In this case, the iteration complexity becomes e O maxi V 1 ηiτ log 1 ϵ . In addition, our convergence result readily translates to a bound on individual regret as detailed in Appendix C. 3.2 PERFORMANCE GUARANTEES UNDER RANDOM DELAYS We continue to examine single-timescale OMWU in the more challenging asynchronous setting. In particularly, we show that the last iterate of single-timescale OMWU continues to converge linearly to the QRE at a slower rate, as long as the delays satisfy some mild statistical assumptions given below. Assumption 1 (Random delays). Assume that for all i V , t 0, the delay γ(t) i is independently generated and satisfies h γ(t) i i := E h γ(t) i γ(t) i ℓ i E(ℓ), ℓ= 0, 1, . . . . (12) Additionally, there exists some constant ζ > 1, such that L P ℓ=0 ζℓE(ℓ) < . We remark that Assumption 1 is a rather mild condition that applies to typical delay distributions, such as the Poisson distribution (Zhang et al., 2020), as well as distributions with bounded support (Recht et al., 2011; Liu et al., 2014; Assran et al., 2020). Roughly speaking, Assumption 1 implies that the probability of the delay decays exponentially with its length, where ζ 1 approximately indicates the decay rate. We have the following theorem. Theorem 2 (Last-iterate convergence with random delays). Under Assumption 1, suppose that the regulari-zation parameter τ < min{1, dmax A } and the learning rate η of single-timescale OMWU in Algorithm 1 obeys 24d2max A 2 (L + 1) , ζ 1 then for any T 1, the iterates π(T ) and π(T ) converges to π τ at the rate max E h KL π τ π(T ) i , 1 2E h KL π τ π(T ) i (1 ητ)T KL π τ π(0) . (14a) Furthermore, the QRE-gap also converges linearly according to E h QRE-Gapτ(π(T )) i 4η 1(1 ητ)T KL π τ π(0) . (14b) Theorem 2 suggests that the iteration complexity to ϵ-QRE is no more than e O max n d2 max A 2 (L + 1), ζ ζ 1 o 1 τ 2 log 1 ϵ after optimizing the learning rate, whose range is more limited compared with the requirement in Theorem 1without delays. In particular, the range of the learning rate is proportional to the regularization parameter τ, an issue we shall try to address by resorting to two-timescale learning rates in OMWU. To facilitate further understanding, we showcase the iteration complexity for finding ϵ-QRE/NE under two typical scenarios: bounded delay and Poisson delay. Published as a conference paper at ICLR 2023 Bounded random delay. When the delays are bounded above by some maximum delay γ, Assumption 1 is met with ζ = 1 + γ 1 and L = eγ(γ + 1). Plugging into Theorem 2 yields an iteration complexity of e O d2 max A 2 (γ+1)2 ϵ for finding an ϵ-QRE, or e O d2 max A 2 (γ+1)2 for finding an ϵ-NE, which increases quadratically as the maximum delay increases. Note that these rates are worse than those without delays (cf. Theorem 1). Poisson delay. When the delays follow the Poisson distribution with parameter 1/T, it suffices to set ζ = 1 + T 1 and L = e T(1 + T) Assumption 1. This leads to an iteration complexity of e O d2 max A 2 T 2 for finding an ϵ-QRE, or e O d2 max A 2 T 2 for finding an ϵ-NE, which is similar to the bounded random delay case. 4 PERFORMANCE GUARANTEES OF TWO-TIMESCALE OMWU While Theorem 2 demonstrates provable convergence of single-timescale OMWU with random delays, it remains unclear whether the update rule can be better motivated in more general asynchronous settings, and whether the convergence can be further ensured under adversarial delays. Indeed, theoretical insights from previous literature (Mokhtari et al., 2020; Cen et al., 2021) suggest the critical role of π(t) as a predictive surrogate for π(t) in enabling fast convergence, which no longer holds when π(t) is replaced by a delayed feedback from π(κ(t) i ). To this end, we propose to replace the extrapolation update (9) with one equipped with a different learning rate: π(t+1) i (k) π(t) i (k)1 ητ exp(η[Aiπ(κ(t) i )]k), k Si, (15) which adopts a larger learning rate η > η to counteract the delay. Intuitively, a choice of η (γ(t) i + 1)η would allow π(κ(t) i ) to approximate π(t) by taking the intermediate updates {π(l) : κ(t) i l < t} into consideration. We refer to this update rule as the two-timescale entropy-regularized OMWU, whose detailed procedure is again outlined in Algorithm 1 using (10) for extrapolation. 4.1 PERFORMANCE GUARANTEES UNDER CONSTANT AND KNOWN DELAYS To highlight the potential benefit of learning rate separation, we start by studying the convergence of two-timescale OMWU in the asynchronous setting with constant and known delays, which has been studied in (Weinberger & Ordentlich, 2002; Flaspohler et al., 2021; Meng et al., 2022). We have the following theorem, which reveals a a faster linear convergence to the QRE by using a delay-aware two-timescale learning rate design. Theorem 3 (Last-iterate convergence with fixed delays). Suppose that the delays γ(t) i = γ are fixed and known. Suppose that the learning rate η of two-timescale OMWU in Algorithm 1 satisfies η min 1 2τ(γ + 1), 1 5dmax A (γ + 1)2 and η is determined by 1 ητ = (1 ητ)(γ+1), then the last iterate π(t) and π(t) converge to the QRE at a linear rate: for T γ, max KL π τ π(T +1) , 1 2KL π τ π(T γ+1) (1 ητ)T +1KL π τ π(0) + (1 ητ)T +1 γ. In addition, the QRE-gap converges linearly according to QRE-Gapτ(π(T γ+1)) 2 max nd2 max A 2 τ , 1 o (1 ητ)T +1KL π τ π(0) + (1 ητ)T +1 γ . By optimizing the learning rate η, Theorem 3 implies that two-timescale OMWU takes at most e O dmax A (γ+1)2 ϵ iterations to find an ϵ-QRE in a last-iterate manner, which translates to an iteration complexity of e O dmax A (γ+1)2 ϵ for finding an ϵ-NE. This significantly improves over the iteration complexity of e O d2 max A 2 (γ + 1)2/ϵ2 for single-timescale OMWU, verifying the positive role of adopting two-timescale learning rate in enabling faster convergence. Published as a conference paper at ICLR 2023 4.2 PERFORMANCE GUARANTEES WITH PERMUTED BOUNDED DELAYS The above result requires the exact information of the delay, which may not always be available. Motivated by the need to address arbitrary or even adversarial delays, we consider a more realistic scenario, where the payoff sequence arrives in a permuted order (Agarwal & Duchi, 2011) constrained by a maximum bounded delay (Mc Mahan & Streeter, 2014; Wan et al., 2022). Assumption 2 (Bounded delay). For any i V and t > 0, it holds that γ(t) i γ. Assumption 3 (Permuted feedback). For any t > 0, the payoff vector at the t-th iteration is received by agent i only once. The payoff at the 0-th iteration can be used multiple times. The following theorem unveils the convergence of two-timescale OMWU to the QRE in an average sense under permutated bounded delays. Theorem 4 (Average-iterate convergence under permutated delays). Under Assumption 2 and 3, suppose that the learning rate η of two-timescale OMWU in Algorithm 1 satisfies η min n 1 2τ(γ+1), 1 28dmax A (γ+1)5/2 o , and η is determined by 1 ητ = (1 ητ)(γ+1), then for T > 2γ, it holds that 1 T 2γ max T 1 X t=2γ KL π τ π(t+1) , 1 t=2γ KL π τ π(t γ+1) KL π i,τ π(0) i + n + 24nγ log Smax T 2γ . (16) Furthermore, the average QRE-gap can be bounded by t=2γ QRE-Gapτ(π(t+1)) 1 T 2γ max n3d2 max A 2 2τ , τ o 1 ητ (KL π i,τ π(0) i + n) + 36nγ log Smax . Theorem 4 guarantees that the best iterate among {π(t)}2γ 0 and 4 (1 ητ)(1 2ηdmax A ). (23) This allows us to further relax (22) by KL π τ π(t+1) + (1 2ηdmax A )KL π(t+1) π(t+1) (1 ητ)KL π τ π(t) + ηdmax A KL π(t) π(t) (1 ητ) KL π τ π(t) + (1 2ηdmax A )KL π(t) π(t) . Let us now introduce the potential function of iterates L(t) := KL π τ π(t) + (1 2ηdmax A )KL π(t) π(t) , which allows us to simply the previous inequality as L(t+1) (1 ητ)L(t) (1 ητ)t+1L(0) = (1 ητ)t+1KL π τ π(0) , (24) where the last equality follows from the definition π(0) = π(0). Hence, we have KL π τ π(t) L(t) (1 ητ)t KL π τ π(0) . Published as a conference paper at ICLR 2023 Bounding KL π τ π(t+1) . Following similar approaches to (21), we can bound π i,τ π(t+1) i , log π(t+1) i log π(t+1) i = η(π(t+1) i π i,τ) Ai(π(t) π(t)) + η(π(t+1) i π i,τ) Ai(π(t) π(t+1)) KL π(t) j π(t) j + KL π(t+1) j π(t) j + 2KL π i,τ π(t+1) i . (25) Summing the inequality over i V leads to π τ π(t+1), log π(t+1) log π(t+1) ηdmax A KL π(t) π(t) + KL π(t+1) π(t) + 2KL π τ π(t+1) . On the other hand, by the definition of KL divergence, we have KL π τ π(t+1) = KL π τ π(t+1) KL π(t+1) π(t+1) π τ π(t+1), log π(t+1) log π(t+1) . (26) Combining the above two inequalities, we get (1 2ηdmax A )KL π τ π(t+1) KL π τ π(t+1) + ηdmax A KL π(t) π(t) + KL π(t+1) π(t) . Plugging the above inequality back into (22), we have (1 2ηdmax A )KL π τ π(t+1) (1 ητ)KL π τ π(t) (1 ητ 2dmaxη A )KL π(t+1) π(t) ητKL (1 2ηdmax A )KL π(t+1) π(t+1) + 2ηdmax A KL π(t) π(t) (1 ητ)KL π τ π(t) + 2ηdmax A KL π(t) π(t) KL π τ π(t) + (1 2ηdmax A )KL π(t) π(t) = L(t), where the second and third inequalities follow from the choice of the learning rate, and the last line follows from the definition of the potential function L(t). Then the result follows from (24) as 1 2KL π τ π(t+1) (1 2ηdmax A )KL π τ π(t+1) L(t) (1 ητ)t KL π τ π(0) . Bounding the QRE-Gap. Finally, we bound the QRE-gap, which can be linked to the KL divergence using the following lemma. The proof can be found in Appendix E.3. Lemma 3. For any π (S) and QRE π τ (S), it holds that QRE-Gapτ(π) τKL π π τ + d2 max A 2 τ KL π τ π . Lemma 3 tells us QRE-Gapτ(π(t)) τKL π(t) π τ + d2 max A 2 τ KL π τ π(t) . (27) With KL π τ π(t) controlled in the above, we still need to control KL π(t) π τ . From (22), it follows that π(t) π τ η 1(1 ητ)L(t 1) η 1(1 ητ)t L(0) = η 1(1 ητ)t KL π τ π(0) . Plugging them back to (27), we arrive at QRE-Gapτ(π(t)) η 1 + 2τ 1d2 max A 2 (1 ητ)t 1KL π τ π(0) . B.2 PROOF OF THEOREM 2 We begin with bounding the KL divergence KL π τ π(t) and then move to bound the QRE-gap by linking it to the KL divergence. Published as a conference paper at ICLR 2023 Bounding the term KL π τ π(t) . We start with the following equation (1 ητ)KL π i,τ π(t) i = (1 ητ)KL π(t+1) i π(t) i + ητKL π(t+1) i π i,τ + KL π(t+1) i π(t+1) i + KL π i,τ π(t+1) i log π(t+1) i log π(t+1) i , π(t+1) i π(t+1) i + η(π(t+1) i π i,τ) Ai(π(κ(t) i ) π τ) (28) where its proof follows a similar deduction as (19). Our first target is to bound the last two terms on the RHS of (28) with π(t+1) i π i,τ + KL π(t+1) i π(t+1) i + (1 ητ)KL π(t+1) i π(t) i . Let us introduce the potential function of iterates Ψ(l) i := KL π(l+1) i π(l) i +KL π(l) i π(l) i , Ψ(l) = X i V Ψ(l) i = KL π(l+1) π(l) +KL π(l) π(l) , which will be used repetitively in the rest of this proof. For notational simplicity, let Ψ(l) i = 0 when l < 0. Step 1: bounding log π(t+1) i log π(t+1) i , π(t+1) i π(t+1) i . Following a similar argument as (21), we get log π(t+1) i log π(t+1) i , π(t+1) i π(t+1) i j Ni (π(t+1) i π(t+1) i ) Aij(π (κ(t+1) i ) j π (κ(t) i ) j ) ηdmax A KL π(t+1) i π(t+1) i + η A π (κ(t+1) i ) j π (κ(t) i ) j 2 1. (29) To control the term π (κ(t+1) i ) j π (κ(t) i ) j 2 1, when t = 0, we have π (κ(t+1) i ) j π (κ(t) i ) j 2 1 = π (κ(t+1) i ) j π(0) j 2 1 π(1) j π(0) j 2 1 2Ψ(0) j (30) by Pinsker s inequality. For t 1, consider the decomposition π(t) j π(t k) j = π(l+1) j π(l) j , 1 k t, it then follows that π(t) j π(t k) j 2 1 k π(l+1) j π(l) j 2 1 π(l+1) j π(l) j 2 1 + π(l) j π(l) j 2 1 l=t k Ψ(l) j , (31) where the last line applies Pinsker s inequality. Depending on whether γ(t+1) i > 0, we proceed to bound the terms π (κ(t+1) i ) j π (κ(t) i ) j 2 1 in (29) considering the following two cases based on (31). γ(t+1) i = 0. Then π (κ(t+1) i ) j π (κ(t) i ) j 2 1 2 π(t+1) j π(t) j 2 1 + 2 π(t) j π (κ(t) i ) j 2 1 Published as a conference paper at ICLR 2023 8Ψ(t) j + 8γ(t) i where the last step uses (31) and π(t+1) j π(t) j 2 1 2 π(t+1) j π(t) j 2 1 + π(t) j π(t) j 2 1 via again Pinsker s inequality. γ(t+1) i > 0. Then it follows similarly that π (κ(t+1) i ) j π (κ(t) i ) j 2 1 l=t+1 γ(t+1) i π(l+1) j π(l) j 2 1 + π(l+1) j π(l) j 2 1 l=t γ(t+1) i Ψ(l) j + 4γ(t) i Combining the above two bounds together, we get π (κ(t+1) i ) j π (κ(t) i ) j 2 1 8Ψ(t) j + 8γ(t) i Ψ(l) j + 4γ(t+1) i l=t γ(t+1) i Ψ(l) j (32) when t > 0. In view of (30) when t = 0, the above bound (32) holds for all t 0. Plugging the above inequality into (29) yields log π(t+1) i log π(t+1) i , π(t+1) i π(t+1) i l=t γ(t+1) i γ(t+1) i Ψ(l) j + 4η A X γ(t) i Ψ(l) j j Ni Ψ(t) j + ηdmax A KL π(t+1) i π(t+1) i . (33) Step 2: bounding (π(t+1) i π i,τ) Ai(π(κ(t+1) i ) π τ). Let us begin with the following decomposition (π(t+1) i π i,τ) Ai(π(κ(t+1) i ) π τ) = (π(t+1) i π i,τ) Ai(π(t+1) π τ) + (π(t+1) i π i,τ) Ai(π(κ(t+1) i ) π(t+1)), (34) where the second term in the RHS of (34) can be bounded by (π(t+1) i π i,τ) Ai(π(κ(t+1) i ) π(t+1)) j Ni (π(t+1) i π i,τ) Aij(π (κ(t+1) i ) j π(t+1) j ) π(t+1) i π i,τ 1 π (κ(t+1) i ) j π(t+1) j 1 π(t+1) i π i,τ 2 1 + dmax A π (κ(t+1) i ) j π(t+1) j 2 1 π(t+1) i π i,τ + dmax A 2 2τ π (κ(t+1) i ) j π(t+1) j 2 1. Following similar deduction of (32) for the second term, we attain (π(t+1) i π i,τ) Ai(π(κ(t+1) i ) π(t+1)) Published as a conference paper at ICLR 2023 π(t+1) i π i,τ + 4dmax A 2 τ l=t γ(t+1) i γ(t+1) i Ψ(l) j . Plugging the above inequality back to (34) results in (π(t+1) i π i,τ) Ai(π(κ(t+1) i ) π τ) (π(t+1) i π i,τ) Ai(π(t+1) π τ) π(t+1) i π i,τ + 4dmax A 2 τ l=t γ(t+1) i γ(t+1) i Ψ(l) j . Step 3: combining the bounds. For simplicity, we introduce the short-hand notation cτ = 1 + dmax A τ and c A = dmax A . (36) Combining (33) and (35) into (28), and summing over i V gives (1 ητ)KL π τ π(t) π(t+1) π(t) + (1 2ηc A)KL π(t+1) π(t+1) + KL π τ π(t+1) l=t γ(t+1) i cτγ(t+1) i Ψ(l) j + γ(t) i Ψ(l) j + cτΨ(t) j KL π τ π(t+1) + (1 4ηc A(cτ + 1)) Ψ(t) l=t γ(t+1) i γ(t+1) i Ψ(l) j + γ(t) i Ψ(l) j where we make use of the fact X i V (π(t+1) i π i,τ) Ai(π(t+1) π τ) = 0 from Lemma 1 in the first inequality, and the second inequality uses the relation X j Ni Ψ(t) j = X i V diΨ(t) i dmaxΨ(t). Step 4: finishing up via averaging the delay. We now evaluate the expectation of KL π τ π(t+1) . Recall that we use subscript Eγ(t) [ ] to represent the conditional expectation given γ(t) = {γ(t) i }i V . We shall first control the conditional expectation of the last term in (37). Observing that π(l+1) j , π(l) j are independent of γ(t) i for j Ni and l t 1. Using the definition of E(t l), we have j Ni Et l γ(t) i h γ(t) i Ψ(l) j i l=0 E(t l) X j Ni Ψ(l) j l=0 E(t l) X j Ni Ψ(l) i l=0 E(t l) X i V Ψ(l) i = dmax l=0 E(t l)Ψ(l), (38) Published as a conference paper at ICLR 2023 where the second line follows from the definition of E(t l) in Assumption 1. Applying a sim- ilar argument to bound P j Ni Eγ(t+1) h γ(t+1) i Pt 1 l=t γ(t+1) i Ψ(l) j i , and taking expectation of γ(t), γ(t+1) on both sides of (37), we get (1 ητ)Eγ(t) h KL π τ π(t) i Eγ(t),γ(t+1) h KL π τ π(t+1) + (1 4c A(cτ + 1))Ψ(t)i 4ηc A(cτ + 1) l=0 E(t l)Ψ(l). Taking expectation on both sides over all the delays yields (1 ητ)E h KL π τ π(t) i E h KL π τ π(t+1) i + E h (1 4ηc A(cτ + 1)) Ψ(t) 4ηc A(cτ + 1) Xt 1 l=0 E(t l)Ψ(l) | {z } =:U (t) Telescoping over t = 0, 1, . . . , T, we get (1 ητ)T +1KL π τ π(0) E h KL π τ π(T +1) i + t=0 (1 ητ)T t E h U (t)i , (40) which leads to the desired bound if t=0 (1 ητ)T t E h U (t)i 0. (41) Proof of (41). To begin, notice that with the choice of the learning rate 24d2max A 2 (L + 1) , ζ 1 it follows that 1 1 ητ ζ (42a) 4ηc A(cτ + 1)(L + 1) < 4 τ 24d2max A 2 (L + 1) dmax A = τ 6dmax A = τ 3dmax A + 1 as τ dmax A . Both of these relations will be useful in our follow-up analysis. Now, taking the definition of U (t) (cf. (39)), we have t=0 (1 ητ)T t U (t) = t=0 (1 ητ)T t " (1 4ηc A(cτ + 1)) Ψ(t) 4ηc A(cτ + 1) l=0 E(t l)Ψ(l) # where the second half of the RHS can be further controlled via t=0 (1 ητ)T t t 1 X l=0 E(t l)Ψ(l) = t=0 Ψ(t) T X l=t+1 (1 ητ)T l E(l t) t=0 Ψ(t) T t X l =0 (1 ητ)T (t+l )E(l ) Published as a conference paper at ICLR 2023 t=0 (1 ητ)T tΨ(t) T t X l =0 (1 ητ) l E(l ) t=0 (1 ητ)T tΨ(t) X l=0 ζl E(l) t=0 (1 ητ)T t LΨ(t), where the first line follows by changing the order of summation, the second line follows from the change of variable l = l t, and the last line follows from (42a) and the definition of L in Assumption 1. Plugging the above relation back leads to t=0 (1 ητ)T t U (t) t=0 (1 ητ)T t [(1 4ηc A(cτ + 1)) 4ηc A(cτ + 1)L] Ψ(t) 1 2(1 ητ)T tΨ(t) 0, (43) where the second line results from (42b). Bounding the term KL π τ π(t+1) . With a similar deduction of (19), we get (1 ητ)KL π τ π(t) + η X i V (π(t+1) i π τ) Ai(π(κ(t) i ) π τ) = KL π τ π(t+1) + (1 ητ)KL π(t+1) π(t) + ητKL π(t+1) π τ . (44) Following the similar argument of (35), we have (π(t+1) i π i,τ) Ai(π(κ(t) i ) π τ) (π(t+1) i π i,τ) Ai(π(t+1) π τ) π(t+1) i π i,τ + 8dmax A 2 τ γ(t) i Ψ(l) j . Summing over i V and plugging into (44) yields (1 ητ)KL π τ π(t) + 8ηdmax A 2 τ γ(t) i Ψ(l) j KL π τ π(t+1) + (1 ητ)KL π(t+1) π(t) + ητ KL π τ π(t+1) + ητ π(t+1) π τ . Taking expectation on both sides over all delays and using (38) leads to (1 ητ)E h KL π τ π(t) i + 8ηd2 max A 2 τ E l=0 E(t l)Ψ(l) # E h KL π τ π(t+1) i + ητ π(t+1) π τ i . (45) Notice that with the choice of the learning rate 24d2max A 2 (L + 1) , ζ 1 we have 8(L + 1)ηd2 max A 2 τ 1 Published as a conference paper at ICLR 2023 (1 ητ)t+1KL π τ π(0) 1 l=0 (1 ητ)t l E h Ψ(l)i by combining (43) and (40). It follows that E h Ψ(t)i 2(1 ητ)t+1KL π τ π(0) l=0 E(t l)Ψ(l) # (i) E l=0 (1 ητ)t lΨ(l) E(t l)ζt l # l=0 (1 ητ)t lΨ(l) t 1 X l=0 E(t l)ζt l # (ii) 2L(1 ητ)t+1KL π τ π(0) , where (i) is by the bound (1 ητ) 1 ζ and (ii) uses the definition of L in Assumption 1. Plugging the above inequalities into (45) leads to (1 ητ)E h KL π τ π(t) i + (1 ητ)t+1KL π τ π(0) E h KL π τ π(t+1) i + ητ π(t+1) π τ i . Then from (40) we have E h KL π τ π(t+1) i E h KL π τ π(t+1) i + ητ π(t+1) π τ i (1 ητ)t+1KL π τ π(0) + (1 ητ)t+1KL π τ π(0) = 2(1 ητ)t+1KL π τ π(0) . (46) Bounding the QRE-Gap. Combining (27) and (46), we have E h QRE-Gapτ(π(t+1)) i τE h KL π(t+1) π τ i + d2 max A 2 τ E h KL π τ π(t+1) i π(t+1) π τ i + E h KL π τ π(t+1) i η KL π τ π(0) , where the second line uses the learning rate bound 2 η > 24d2 max A 2 (L + 1) τ > d2 max A 2 τ . C REGRET ANALYSIS OF SINGLE-TIMESCALE OMWU For completeness, we also provide the regret analysis of single-timescale OMWU in both synchronous and asynchronous settings, which might be of independent interest. To begin, for τ 0, the regret for each player i V is defined as Regreti,τ T = max πi (Si) t=1 ui,τ(πi, π(t) i) t=1 ui,τ(π(t)), (47) which measures the performance gap compared to the optimal fixed strategy in hindsight for player i, when the rest of the players follow the strategies derived from Algorithm 1. Published as a conference paper at ICLR 2023 Synchronous setting. We begin with the following no-regret guarantee of single-timescale OMWU in the synchronous setting. Theorem 5 (No-regret without delays). Suppose all players i V follow single-timescale OMWU in Algorithm 1 with the initialization π(0) i = 1 |Si|1 and the learning rate obeys 0 < η 1 4dmax A +4τ . Then, for T 1, it holds that Regreti,τ T 1 η log |Si| + 16η degi A 2 X k V log |Sk|. By optimizing the learning rate η, Theorem 5 suggests that the regret is bounded by max i V Regreti,τ T e O A p up to logarithmic factors. Compared with the OMD method for multi-agent games in Anagnostides et al. (2022), which only provided the regret bound for τ = 0, our bound is more general by allowing entropy regularization. Moreover, our bound is tighter by a factor of n/dmax by exploiting the graph connectivity pattern, which is significant for large sparse graphs. Asynchronous setting. We next move to the asynchronous case, and show that single-timescale OMWU continues to enjoy no-regret learning as long as the delays have finite second-order moments. Assumption 4 (Random delays). Recall the definition of E(ℓ) in (12). There exists some constant σ > 0, such that E h (γ(t) i )2i P ℓ=0 E(ℓ) σ2, for all t 0 and i V . Clearly Assumption 4 is weaker than Assumption 1, since it only requires the second-order moments to be finite, instead of an exponential decay of γ(t) i . We have the following theorem. Theorem 6 (No-regret with random delays). Under Assumption 4, suppose all players i V follow single-timescale OMWU in Algorithm 1 with the initialization π(0) i = 1 |Si|1, the regularization parameter τ < min{1, A }, and the learning rate obeys 0 < η τ 24d2max A 2 (σ2+1). Then, for T 1, it holds that E Regreti,τ T 1 η log |Si| + 8dmax A τ + 2 (σ2 + 1) X i V log |Si|. (48) Theorem 6 guarantees that the iterate among {π(t)}t 1 enjoys a regret bound on the order of max i V E Regreti,τ T e O σ2ndmax A 2 τ by optimizing the learning rate η. C.1 PROOF OF THEOREM 5 Recall the expression of the regret Regreti,τ T = max πi (Si) Regreti,τ πi, T , Regreti,τ πi, T := t=1 ui,τ(πi, π(t) i) t=1 ui,τ(π(t)) πi π(t+1) i , Aiπ(t+1) + τH(πi) τH(π(t+1) i ) . (49) Therefore, it is sufficient to bound Regreti,τ πi, T for any πi (Si). To begin, we record the following useful lemma whose proof can be found in Appendix E.4. Published as a conference paper at ICLR 2023 Lemma 4. For any τ 0 and T 0, we have i V Regreti,τ T + 1 = X i V max πi (Si) πi π(t+1) i , Aiπ(t+1) + τH(πi) τH(π(t+1) i ) 0. Let us now proceed with the following regret decomposition πi π(t+1) i , Aiπ(t+1) + τH(πi) τH(π(t+1) i ) = πi π(t+1) i , Aiπ(t+1) + τH(πi) + π(t+1) i π(t+1) i , Aiπ(t) τH(π(t+1) i ) π(t+1) i π(t+1) i , Aiπ(t+1) Aiπ(t) . (50) We proceed to bound each term on the RHS of (50). To begin, note that log π(t+1) i 1= (1 ητ) log π(t) i + ηAiπ(t+1). The first term in (50) can then be written as πi π(t+1) i , Aiπ(t+1) + τH(πi) η log π(t+1) i log π(t) i , πi π(t+1) i + τ log π(t) i , πi π(t+1) i τ log πi, πi η τ KL πi π(t) i 1 KL πi π(t+1) i + KL π(t+1) i π(t) i τ log π(t) i , π(t+1) i , where the second step is derived from the definition of KL divergence. Similarly, the second term in (50) has the form π(t+1) i π(t+1) i , Aiπ(t) τH(π(t+1) i ) KL π(t+1) i π(t) i KL π(t+1) i π(t+1) i 1 π(t+1) i π(t) i + τ log π(t) i , π(t+1) i . (52) Moving to the third term on the RHS of (50), we first make the following claim, which shall be proven at the end of this proof: π(t+1) i π(t+1) i 1 η Aiπ(t+1) Aiπ(t) . (53) With (53) in place, we have π(t+1) i π(t+1) i , Aiπ(t+1) Aiπ(t) X j Ni A π(t+1) i π(t+1) i 1 π(t+1) j π(t) j 1 j Ni A Ai(π(t+1) π(t)) π(t+1) j π(t) j 1 π(t+1) j π(t) j 1 The latter term can be further bounded by X π(t+1) j π(t) j 1 π(t+1) j π(t) j + π(t) j π(t) j 2 1 π(t+1) j π(t) j 2 1 + π(t) j π(t) j 2 1 π(t+1) j π(t) j + KL π(t) j π(t) j , Published as a conference paper at ICLR 2023 where it follows respectively from Cauchy-Schwarz inequality, a + b 2 1 2 a 2 1 + b 2 1 , and Pinsker s inequality. Plugging this into the previous inequality leads to π(t+1) i π(t+1) i , Aiπ(t+1) Aiπ(t) 4η degi A 2 X π(t+1) j π(t) j + KL π(t) j π(t) j . Plugging (51), (52), and (54) into (50) yields πi π(t+1) i , Aiπ(t+1) + τH(πi) τH(π(t+1) i ) KL πi π(t) i KL πi π(t+1) i 1 η KL π(t+1) i π(t+1) i 1 π(t+1) i π(t) i τKL πi π(t) i + 4η degi A 2 X π(t+1) j π(t) j + KL π(t) j π(t) j . Telescoping the sum over t = 0, 1, . . . , T leads to Regreti,τ πi, T + 1 1 η KL πi π(0) i 1 t=0 KL π(t+1) i π(t+1) i 1 π(t+1) i π(t) i + 4η degi A 2 π(t+1) j π(t) j + KL π(t) j π(t) j (55) η log |Si| + 4η degi A 2 π(t+1) π(t) + KL π(t) π(t) , where the last line follows from the fact that KL πi π(0) i log |Si| and 1/η > τ. The proof is thus complete if we can establish t=0 KL π(t) π(t) + π(t+1) π(t) 4 X k V log |Sk|. (57) Therefore, it remains to establish (53) and (57), which shall be completed as follows. Proof of (53). By the update rules of π(t+1) i and π(t+1) i , from (20) we can deduce that log π(t+1) i log π(t+1) i , π(t+1) i π(t+1) i = η Ai(π(t) π(t+1)), π(t+1) i π(t+1) i . (58) By Pinsker s inequality, we have log π(t+1) i log π(t+1) i , π(t+1) i π(t+1) i π(t+1) i π(t+1) i 2 1. In addition, Ai(π(t) π(t+1)), π(t+1) i π(t+1) i π(t+1) i π(t+1) i 1 Ai(π(t) π(t+1)) . Plugging the above two relations into (58) then leads to (53). Proof of (57). Summing (55) over i V gives i V Regreti,τ πi, T + 1 1 η KL π π(0) 1 t=0 KL π(t+1) π(t+1) 1 π(t+1) π(t) + 4ηd2 max A 2 π(t+1) π(t) + KL π(t) π(t) KL π π(0) + KL π(0) π(0) 1 η τ 4ηd2 max A 2 π(t+1) π(t) Published as a conference paper at ICLR 2023 η 4ηd2 max A 2 t=0 KL π(t) π(t) k V log |Sk| 1 π(t+1) π(t) 1 t=0 KL π(t) π(t) , where the last line follows from π(0) = π(0), KL π π(0) P k V log |Sk| for any π since π(0) is a uniform distribution, as well as the choice of the learning rate such that 4ηd2 max A 2 1 Taking supremum over π on both sides of (59) and applying Lemma 4 gives (57) as advertised. C.2 PROOF OF THEOREM 6 Similar to the proof of Theorem 5 in Appendix C.1, it suffices to bound Regreti,τ πi, T for any πi (Si), where Regreti,τ πi, T := t=1 ui,τ(πi, π(t) i) t=1 ui,τ(π(t)) πi π(t+1) i , Aiπ(t+1) + τH(πi) τH(π(t+1) i ) . (60) Consider the following decomposition: πi π(t+1) i , Aiπ(t+1) + τH(πi) τH(π(t+1) i ) = πi π(t+1) i , Aiπ(κ(t+1) i ) + πi π(t+1) i , Aiπ(t+1) Aiπ(κ(t+1) i ) + τH(πi) τH(π(t+1) i ) = πi π(t+1) i , Aiπ(κ(t+1) i ) + π(t+1) i π(t+1) i , Aiπ(κ(t) i ) π(t+1) i π(t+1) i , Aiπ(κ(t+1) i ) Aiπ(κ(t) i ) + πi π(t+1) i , Aiπ(t+1) Aiπ(κ(t+1) i ) . (61) We now bound each term on the RHS of (61). For simplicity, we reuse the short-hand notation in (36). To begin with, note that π(t+1) i = (1 ητ)π(t) i + ηAiπ(κ(t+1) i ) + ci1 for some normalization constant ci. Thus we have πi π(t+1) i , Aiπ(κ(t+1) i ) η log π(t+1) i log π(t) i , πi π(t+1) i + τ log π(t) i , πi π(t+1) i KL πi π(t) i KL πi π(t+1) i KL π(t+1) i π(t) i + τ log π(t) i , πi π(t+1) i , where the second step is derived from the definition of KL-divergence. Similarly, it holds that π(t+1) i π(t+1) i , Aiπ(κ(t) i ) KL π(t+1) i π(t) i KL π(t+1) i π(t+1) i KL π(t+1) i π(t) i + τ log π(t) i , π(t+1) i π(t+1) i . Published as a conference paper at ICLR 2023 For the term π(t+1) i π(t+1) i , Aiπ(κ(t+1) i ) Aiπ(κ(t) i ) in (61), following the deduction of (33), we get π(t+1) i π(t+1) i , Aiπ(κ(t+1) i ) Aiπ(κ(t) i ) l=t γ(t+1) i γ(t+1) i Ψ(l) j + γ(t) i Ψ(l) j + 2c AKL π(t+1) i π(t+1) i . For the last term in (61), it similarly follows that πi π(t+1) i , Aiπ(t+1) Aiπ(κ(t+1) i ) = πi π(t) i , Aiπ(t+1) Aiπ(κ(t+1) i ) + π(t) i π(t+1) i , Aiπ(t+1) Aiπ(κ(t+1) i ) π(t+1) i π(t) i + 4cτ A X l=t γ(t+1) i γ(t+1) i Ψ(l) j . (65) Plugging (62) (63) (64) (65) into (61) yields πi π(t+1) i , Aiπ(t+1) + τ H(πi) H(π(t+1) i ) KL πi π(t) i KL πi π(t+1) i 1 j Ni Ψ(t) j + 2γ(t) i A X + 4cτ A γ(t+1) i X l=t γ(t+1) i Ψ(l) j + τ log π(t) i , πi π(t) i + τ H(πi) H(π(t+1) i ) . H(πi) H(π(t+1) i ) + log π(t) i , πi π(t+1) i = log πi log π(t) i , πi + log π(t+1) i log π(t) i , π(t+1) i π(t+1) i π(t) i KL πi π(t) i . Then we have πi π(t+1) i , Aiπ(t+1) + τ H(πi) H(π(t+1) i ) KL πi π(t) i KL πi π(t+1) i 1 η 2c A τ Ψ(t) i j Ni Ψ(t) j + 2γ(t) i A X Ψ(l) j + 4cτ A γ(t+1) i X l=t γ(t+1) i Since the learning rate satisfies 1 η 24d2 max A 2 τ (σ2 + 1) 2dmax A + τ = 2c A + τ, taking expectation on both sides of (66) leads to E h πi π(t+1) i , Aiπ(t+1) + τ H(πi) H(π(t+1) i ) i η E h KL πi π(t) i KL πi π(t+1) i i 1 η 2c A τ E h Ψ(t) i i Published as a conference paper at ICLR 2023 + 4cτ A E h Ψ(t)i + 4cτ A E l=0 E(t l)Ψ(l) # η E h KL πi π(t) i KL πi π(t+1) i i + 4cτ A E h Ψ(t)i + 4cτ A E l=0 E(t l)Ψ(l) # where we use the fact P j Ni Ψ(l) j Ψ(l) and the definition of E(t l). Since P l=0 E(l) σ2 by definition in Assumption 4, summing (67) over t = 0, 1, . . . , T yields E Regreti,τ T + 1 1 η E h KL πi π(0) i i + 4cτ A (σ2 + 1)E η log |Si| + 4cτ A (σ2 + 1)E It remains to establish i V log |Si|. (69) Proof of (69). Summing (66) over i V gives X πi π(t+1) i , Aiπ(t+1) + τ H(πi) H(π(t+1) i ) KL π π(t) KL π π(t+1) 1 η 2c A τ Ψ(t) j Ni Ψ(t) j + 2γ(t) i A X Ψ(l) j + 4cτ A X i V γ(t+1) i X l=t γ(t+1) i Taking expectation of on both sides and using (38) leads to πi π(t+1) i , Aiπ(t+1) + τ H(πi) H(π(t+1) i ) # η E h KL π π(t) KL π π(t+1) i 1 η 4c A(cτ + 1) τ E h Ψ(t)i + 4c A(cτ + 1)E l=0 E(t l)Ψ(l) # 2E h KL π π(t) i . Summing over t = 0, 1, . . . , T yields i V Regreti,τ π, T + 1 # η E h KL π π(0) i 1 η 4c A(cτ + 1)(σ2 + 1) E η E h KL π π(0) i 1 where the second line follows from 4c A(cτ + 1)(σ2 + 1)η 4dmax A 24d2max A 2 (σ2 + 1) < 1 due to τ dmax A and η τ 24d2max A 2 (σ2+1). Taking supremum with respect to π on both sides, in view of Lemma 4, we arrive at the advertised bound (69). Published as a conference paper at ICLR 2023 D PROOF FOR TWO-TIMESCALE OMWU (SECTION 4) D.1 PROOF OF THEOREM 3 Bounding KL π τ π(t) . For notational convenience, we set π(t) = π(t) = π(0) for t < 0. The following lemma parallels Lemma 2 by focusing on delayed feedbacks. The proof is postponed to Appendix E.5. Lemma 5. Assuming constant delays γ(t) i = γ, the iterates of OMWU based on the update rule (10) satisfy log π(t+1) (1 ητ) log π(t) ητ log π τ, π(t γ+1) π τ = 0. By following a similar argument in (19), we conclude that KL π τ π(t+1) = (1 ητ)KL π τ π(t) (1 ητ)KL π(t γ+1) π(t) KL π(t+1) π(t γ+1) + log π(t γ+1) log π(t+1), π(t γ+1) π(t+1) ητKL π(t γ+1) π τ . (71) It boils down to control the term log π(t γ+1) log π(t+1), π(t γ+1) π(t+1) . When t γ, by taking logarithm on the both sides of the update rules (7) and (10), we have log π(t γ+1) i 1= (1 ητ) log π(t γ) i + ηAiπ(t 2γ) log π(t+1) i 1= (1 ητ) log π(t) i + ηAiπ(t γ+1) 1= (1 ητ)γ+1 log π(t γ) i + η l=0 (1 ητ)l Aiπ(t γ l+1). Subtracting the above equalities and taking inner product with π(t γ+1) i π(t+1) i gives log π(t γ+1) i log π(t+1) i , π(t γ+1) i π(t+1) i l=0 (1 ητ)l π(t γ+1) i π(t+1) i , Ai(π(t 2γ) π(t γ l+1)) , where the log π(t γ) i terms cancel out due to the choice 1 ητ = (1 ητ)γ+1. Summing over i V , log π(t γ+1) log π(t+1), π(t γ+1) π(t+1) l=0 (1 ητ)l π(t γ+1) i π(t+1) i , Ai(π(t 2γ) π(t γ l+1)) l=0 (1 ητ)l π(t γ+1) i π(t+1) i 1 π(t 2γ) j π(t γ l+1) j 1. (72) Using the triangle inequality, we can bound π(t 2γ) π(t γ l+1) 1 as π(t 2γ) π(t γ l+1) 1 π(l1 γ) i π(l1 γ+1) j 1 π(l1 γ) i π(l1) i 1 + π(l1 γ+1) j π(l1) j 1 Substitution of the bound into (72) yields log π(t γ+1) log π(t+1), π(t γ+1) π(t+1) Published as a conference paper at ICLR 2023 l=0 (1 ητ)l t l X π(t γ+1) i π(t+1) i 1 π(l1 γ) j π(l1) j 1 + π(l1 γ+1) j π(l1) j 1 l=0 (1 ητ)l π(t γ+1) i π(t+1) i 1 π(l1 γ) j π(l1) j 1 + π(l1 γ+1) j π(l1) j 1 l=0 (1 ητ)l π(t γ+1) i π(t+1) i 2 1 l=0 (1 ητ)l π(l1 γ) j π(l1) j 2 1 + π(l1 γ+1) j π(l1) j 2 1 2(γ + 1)2KL π(t+1) π(t γ+1) l=0 (1 ητ)l KL π(l1) π(l1 γ) + KL π(l1 γ+1) π(l1) . (73) Plugging the above inequality into (71) and recursively applying the inequality gives KL π τ π(t+1) + KL π(t+1) π(t γ+1) + ητKL π(t γ+1) π τ (1 ητ)t+1 γKL π τ π(γ) l1=γ (1 ητ)t l1 KL π(l1+1) π(l1 γ+1) + (1 ητ)KL π(l1 γ+1) π(l1) 2(γ + 1)2 t X l1=γ (1 ητ)t l1KL π(l1+1) π(l1 γ+1) t2=γ (1 ητ)t l2 l2 X l=0 (1 ητ)l KL π(l1) π(l1 γ) + KL π(l1 γ+1) π(l1) (i) (1 ητ)t+1 γKL π τ π(γ) l1=γ (1 ητ)t l1 KL π(l1+1) π(l1 γ+1) + (1 ητ)KL π(l1 γ+1) π(l1) + 2(γ + 1)2ηdmax A l1=γ (1 ητ)t l1KL π(l1+1) π(l1 γ+1) + 2(γ + 1)2ηdmax A l1=0 (1 ητ)t l1 KL π(l1+1) π(l1 γ+1) + (1 ητ)KL π(l1 γ+1) π(l1) (ii) (1 ητ)t+1 γKL π τ π(γ) + 2(γ + 1)2ηdmax A l1=0 (1 ητ)t l1 KL π(l1+1) π(l1 γ+1) + (1 ητ)KL π(l1 γ+1) π(l1) , (74) where (i) results from basic calculation t2=γ (1 ητ)t l2 l2 X l=0 (1 ητ)l KL π(l1) π(l1 γ) + KL π(l1 γ+1) π(l1) l1=0 (1 ητ)t l1 l1+γ X l=0 (1 ητ)l1 l2+l KL π(l1) π(l1 γ) + KL π(l1 γ+1) π(l1) l1=0 (1 ητ)t l1 γ X l=0 (1 ητ)l l KL π(l1) π(l1 γ) + KL π(l1 γ+1) π(l1) Published as a conference paper at ICLR 2023 l1=0 (1 ητ)t l1(γ + 1)2 1 1 2(γ + 1) (γ+1) (1 ητ) KL π(l1) π(l1 γ) + KL π(l1 γ+1) π(l1) 2(γ + 1)2 t X l1=0 (1 ητ)t l1(1 ητ) KL π(l1) π(l1 γ) + KL π(l1 γ+1) π(l1) 2(γ + 1)2 t X l1=0 (1 ητ)t l1 KL π(l1+1) π(l1 γ+1) + (1 ητ)KL π(l1 γ+1) π(l1) and (ii) is due to η min n 1 2τ(γ+1), 1 5dmax A (γ+1)2 o . To proceed, we introduce the following lemma concerning the error KL π τ π(γ) , with the proof postponed to Appendix E.6. Lemma 6. With constant delays γ(t) i = γ, the iterates of OMWU based on the update rule (10) satisfy KL π τ π(γ) (1 ητ)γKL π τ π(0) l1=0 (1 ητ)γ 1 l1 KL π(l1+1) π(l1 γ+1) + (1 ητ)KL π(l1 γ+1) π(l1) + 2ηγ2dmax A . With the lemma above in mind, we can continue to bound (74) by KL π τ π(t+1) (1 ητ)t+1KL π τ π(0) + 2(1 ητ)t+1 γηγ2dmax A l1=0 (1 ητ)t l1 KL π(l1+1) π(l1 γ+1) + (1 ητ)KL π(l1 γ+1) π(l1) + 2(γ + 1)2ηdmax A l1=0 (1 ητ)t l1 KL π(l1+1) π(l1 γ+1) + (1 ητ)KL π(l1 γ+1) π(l1) (1 ητ)t+1KL π τ π(0) + (1 ητ)t+1 γ. Bounding KL π τ π(t γ+1) . By definition of KL divergence, we have KL π τ π(t γ+1) = KL π τ π(t+1) + π τ, log π(t+1) log π(t γ+1) = KL π τ π(t+1) + KL π(t+1) π(t γ+1) + π τ π(t+1), log π(t+1) log π(t γ+1) . (75) It remains to control the term π τ π(t+1), log π(t+1) log π(t γ+1) . By following a similar argument in (73), we have π τ π(t+1), log π(t+1) log π(t γ+1) l=0 (1 ητ)l π i,τ π(t+1) i , Ai(π(t 2γ) π(t γ l+1)) l=0 (1 ητ)l π i,τ π(t+1) i 1 π(t 2γ) j π(t γ l+1) j 1 l=0 (1 ητ)l t l X π i,τ π(t+1) i 1 π(l1 γ) j π(l1) j 1 + π(l1 γ+1) j π(l1) j 1 l=0 (1 ητ)l π i,τ π(t+1) i 1 π(l1 γ) j π(l1) j 1 + π(l1 γ+1) j π(l1) j 1 Published as a conference paper at ICLR 2023 2(γ + 1)2KL π τ π(t+1) l=0 (1 ητ)l KL π(l1) π(l1 γ) + KL π(l1 γ+1) π(l1) . Substitution of the above inequality into (75) yields KL π τ π(t γ+1) + ητKL π(t γ+1) π τ = (1 + 2(γ + 1)2ηdmax)KL π τ π(t+1) + KL π(t+1) π(t γ+1) + ητKL π(t γ+1) π τ l=0 (1 ητ)l KL π(l1) π(l1 γ) + KL π(l1 γ+1) π(l1) (i) 2 KL π τ π(t+1) + KL π(t+1) π(t γ+1) + ητKL π(t γ+1) π τ + 2(γ + 1)ηdmax A l1=0 (1 ητ)t l1 KL π(l1+1) π(l1 γ+1) + (1 ητ)KL π(l1 γ+1) π(l1) (ii) 2(1 ητ)t+1 γKL π τ π(γ) 2 l1=γ (1 ητ)t l1 KL π(l1+1) π(l1 γ+1) + (1 ητ)KL π(l1 γ+1) π(l1) + 4(γ + 1)2ηdmax A l1=γ (1 ητ)t l1KL π(l1+1) π(l1 γ+1) + 6(γ + 1)2ηdmax A l1=0 (1 ητ)t l1 KL π(l1+1) π(l1 γ+1) + (1 ητ)KL π(l1 γ+1) π(l1) 2(1 ητ)t+1 γKL π τ π(γ) + 6(γ + 1)2ηdmax A l1=0 (1 ητ)t l1 KL π(l1+1) π(l1 γ+1) + (1 ητ)KL π(l1 γ+1) π(l1) , where (i) results from l=0 (1 ητ)l KL π(l1) π(l1 γ) + KL π(l1 γ+1) π(l1) l1=t γ (1 ητ)t l1 t l1 X l=0 (1 ητ)l+l1 t KL π(l1) π(l1 γ) + KL π(l1 γ+1) π(l1) l1=t γ (1 ητ)t l1(γ + 1)(1 ητ) (γ+1)(1 ητ) KL π(l1) π(l1 γ) + KL π(l1 γ+1) π(l1) l1=0 (1 ητ)t l1 KL π(l1+1) π(l1 γ+1) + (1 ητ)KL π(l1 γ+1) π(l1) . and (ii) is due to the bound established in (74). Finally, applying Lemma 6 yields KL π τ π(t γ+1) + ητKL π(t γ+1) π τ 2(1 ητ)t+1KL π τ π(0) + 4(1 ητ)t+1 γηγ2dmax A l1=0 (1 ητ)t l1 KL π(l1+1) π(l1 γ+1) + (1 ητ)KL π(l1 γ+1) π(l1) + 6(γ + 1)2ηdmax A l1=0 (1 ητ)t l1 KL π(l1+1) π(l1 γ+1) + (1 ητ)KL π(l1 γ+1) π(l1) Published as a conference paper at ICLR 2023 2(1 ητ)t+1KL π τ π(0) + 2(1 ητ)t+1 γ. (76) Bounding the QRE gap. With Lemma 3, we have QRE-Gapτ(π(t γ+1)) d2 max A 2 τ KL π τ π(t γ+1) + τKL π(t γ+1) π τ max nd2 max A 2 τ , 1 o KL π τ π(t γ+1) + ητKL π(t γ+1) π τ 2 max nd2 max A 2 τ , 1 o (1 ητ)t+1KL π τ π(0) + (1 ητ)t+1 γ , where the last step results from (76). D.2 PROOF OF THEOREM 4 Bounding the term KL π τ π(t) . Recall that the update rule of π(t) i (k) is given by π(t) i (k) π(t 1) i (k)1 ητ exp(η[Aiπ(κ(t) i )]k). (77) We introduce an auxiliary variable eπ(t) i : eπ(t) i (k) π(t 1) i (k)1 eη(t) i τ exp eη(t) i [Aiπ(κ(t) i )]k , (78) which can be viewed as a conceptual alternative update of π(t) i with a different step size eη(t) i > 0 satisfying (1 eη(t) i τ)(1 ητ)t κ(t) i = 1 ητ or equivalently 1 eη(t) i τ = (1 ητ)γ+1 t+κ(t) i . It directly follows that eη(t) i η. Since κ(t) i t, we have 1 eη(t) i τ 1 (γ + 1 t + κ(t) i )ητ 1 (γ + 1)ητ, which implies eη(t) i (γ + 1)η. For notational convenience, we set eπ(t) i = π(0), eη(t) i = η and κ(t) i = 0 when t 0. The following lemma establishes a one-step analysis, with the proof postponed to Appendix E.7. Lemma 7. When t 1, it holds that KL π i,τ π(t) i + ητKL π (κ(t) i ) i π i,τ = (1 ητ)KL π i,τ π(t 1) i η(π (κ(t) i ) i π i,τ) Ai(π(κ(t) i ) π τ) ψ(t) i log π (κ(t) i ) i log eπ(t) i , π (κ(t) i ) i eπ(t) i , (79) ψ(t) i := 1 η KL π(t) i π(t 1) i (1 eη(t) i τ)KL π (κ(t) i ) i π(t 1) i + KL eπ(t) i π (κ(t) i ) i + KL π(t) i eπ(t) i . We proceed to control the term log π (κ(t) i ) i log eπ(t) i , π (κ(t) i ) i eπ(t) i . By definition, we have log eπ(t) i 1= (1 eη(t) i τ) log π(t 1) i + eη(t) i Aiπ(κ(t) i ) 1= (1 eη(t) i τ)(1 ητ)t κ(t) i log π(κ(t) i 1) Published as a conference paper at ICLR 2023 Aiπ(κ(t) i ) + (1 eη(t) i τ)(1 ητ)t 1 l Aiπ(κ(l) i ) log π (κ(t) i ) i 1= (1 ητ) log π(κ(t) i 1) + ηAiπ(κ (κ(t) i 1) i ) when κ(t) i 1. Subtracting the two equations yields log π (κ(t) i ) i log eπ(t) i 1= eη(t) i Ai(π(κ (κ(t) i 1) i ) π(κ(t) i )) (1 eη(t) i τ)(1 ητ)t 1 l Ai(π(κ (κ(t) i 1) i ) π(κ(l) i )) , where the log π(κ(t) i 1) terms cancel out due to (1 eη(t) i τ)(1 ητ)t κ(t) i = 1 ητ. It follows that log π (κ(t) i ) i log eπ(t) i , π (κ(t) i ) i eπ(t) i π (κ(t) i ) i eπ(t) i , Ai(π(κ (κ(t) i 1) i ) π(κ(t) i )) (1 eη(t) i τ)(1 ητ)t 1 l π (κ(t) i ) i eπ(t) i , Ai(π(κ (κ(t 1) i ) i ) π(κ(l) i )) eη(t) i A π (κ(t) i ) i eπ(t) i 1 X π (κ(l) i ) j π (κ (κ(t 1) i ) i ) j 1. (81) The next lemma establishes an upper bound on the term Pt l=κ(t) i π (κ(l) i ) j π (κ (κ(t 1) i ) i ) j 1, with the proof postponed to Appendix E.8. Lemma 8. Let νj(t) denote the time index when agent j receives the payoff from the t-th iteration, i.e., κ(νj(t)) j = t. For t = 0, we set νj(0) to an arbitrary index that satisfies κ(νj(0)) j = 0. When t 2γ + 1, it holds that π (κ(l) i ) j π (κ (κ(t 1) i ) i ) j 1 4 ψ (νj(κ (κ(t 1) i ) i )) j , Plugging Lemma 8 into (81) gives log π (κ(t) i ) i log eπ(t) i , π (κ(t) i ) i eπ(t) i eη(t) i A π (κ(k) i ) i eπ(t) i 1 X ψ (νj(κ (κ(t 1) i ) i )) j i 2 eη(t) i A 14dmax(γ + 1)3/2 π (κ(t) i ) i eπ(t) i 2 1 h 8(γ + 1)3/2 t+γ X l=t 2γ ψ(l) j + 4(γ + 1)5/2ψ (νj(κ (κ(t 1) i ) i )) j i (ii) eη(t) i A 14dmax(γ + 1)5/2ψ(t) i + X h 4(γ + 1)3/2 t+γ X l=t 2γ ψ(l) j + 2(γ + 1)5/2ψ (νj(κ (κ(t 1) i ) i )) j i , Published as a conference paper at ICLR 2023 where (i) results from Young s inequality π (κ(t) i ) i eπ(t) i 1 2(γ + 1)1/2 π (κ(t) i ) i eπ(t) i 2 1 + 2(γ + 1)1/2ψ(l) j and (ii) follows from π (κ(t) i ) i eπ(t) i 1 2KL eπ(t) i π (κ(t) i ) i 2(γ + 1)ψ(t) i . Plugging (82) into (79) and summing over i V yields KL π τ π(t) + ητ X π (κ(t) i ) i π τ (1 ητ)KL π τ π(t 1) η X i V (π (κ(t) i ) i π i,τ) Ai(π(κ(t) i ) π τ) (1 14ηdmax A (γ + 1)5/2) X i V ψ(t) i + 2η A (γ + 1)5/2 X (i,j) E ψ (νj(κ (κ(t 1) i ) i )) j + 4ηdmax A (γ + 1)3/2 t+γ X l=t 2γ ψ(l), (83) where we denote P i V ψ(l) i by ψ(l) for notation simplicity. We then seek to sum the above equation over t = 2γ + 1, , T. Before proceeding, we note that l=t 2γ ψ(l) t=l γ ψ(l) 3(γ + 1) (i,j) E ψ (νj(κ (κ(t 1) i ) i )) j X t=0 ψ(t) j dmax where the first step is due to the mapping t 7 νj(κ (κ(t 1) i ) i ) being injective when t 2γ + 1 (cf. Assumptions 2, 3). Note that ψ(t) j = 0 when t 0 and hence can be safely discarded. Taken together, we arrive at t=2γ+1 KL π τ π(t) + ητ π (κ(t) i ) i π i,τ (1 ητ)KL π τ π(2γ) η i V (π (κ(t) i ) i π i,τ) Ai(π(κ(t) i ) π τ) 1 14ηdmax A (γ + 1)5/2 T X t=2γ+1 ψ(t) + 12ηdmax A (γ + 1)5/2 T +γ X + 2ηdmax A (γ + 1)5/2 T +γ 1 X (1 ητ)KL π τ π(2γ) η i V (π (κ(t) i ) i π i,τ) Ai(π(κ(t) i ) π τ) 1 28ηdmax A (γ + 1)5/2 T X t=2γ+1 ψ(t) + 14ηdmax A (γ + 1)5/2 X Published as a conference paper at ICLR 2023 (1 ητ)KL π τ π(2γ) η i V (π (κ(t) i ) i π i,τ) Ai(π(κ(t) i ) π τ) + 1 l Γ ψ(l), (84) where Γ = {1, , 2γ} {T + 1, , T + γ}. The last step results from the choice of learn- ing rate η 1 28dmax A (γ+1)5/2 . It now remains to bound the terms PT t=2γ+1 P i V (π (κ(t) i ) i π i,τ) Ai(π(κ(t) i ) π τ), KL π τ π(2γ) and P l Γ ψ(l). In view of Lemma 1, we have i V (π (κ(t) i ) i π i,τ) Ai(π(κ(t) i ) π τ) i V (π(t) i π i,τ) Ai(π(t) π τ) i V (π (κ(t) i ) i π i,τ) Ai(π(κ(t) i ) π τ). We remark that each (π (κ(t) i ) i π i,τ) Ai(π(κ(t) i ) π τ) term will cancel out due to the mapping t 7 κ(t) i being injective when t γ. In addition, we have a crude bound (π(t) i π i,τ) Ai(π(t) π τ) = X j Ni (π(t) i π i,τ) Aij(π(t) j π j,τ) 4dmax A for every i V, t 0. Applying the bound to the remaining nγ terms gives i V (π (κ(t) i ) i π i,τ) Ai(π(κ(t) i ) π τ) 4nγdmax A . (85) The remaining terms KL π τ π(2γ) and ψ(l) can be bounded with the following lemma, with the proof postponed to Appendix E.9. Lemma 9. It holds for all i V and t 0 that ψ(t) i η(dmax A (2γ + 11) + 3τ log |Si|). (86) In addition, we have KL π i,τ π(2γ) i KL π i,τ π(0) i + 4ηdmax A γ. (87) Putting all pieces together, we continue from (84) and show that t=2γ+1 KL π τ π(t) (1 ητ)KL π τ π(2γ) η i V (π (κ(t) i ) i π i,τ) Ai(π(κ(t) i ) π τ) + 1 KL π i,τ π(0) i + 8ηnγdmax A + ηγ ndmax A (2γ + 11) + 3τ X i V log |Si| KL π i,τ π(0) i + 8ηn h γdmax A + γ dmax A (2γ + 11) + 3τ log Smax i KL π i,τ π(0) i + n + 24ητnγ log Smax. Bounding the term KL π τ π(t γ+1) . By definition of KL divergence, we have KL π i,τ π(t γ+1) i = KL π i,τ π(t+1) i + π i,τ, log π(t+1) i log π(t γ+1) i = KL π i,τ π(t+1) i KL π(t γ+1) i π(t+1) i + π i,τ π(t γ+1) i , log π(t+1) i log π(t γ+1) i . (88) Published as a conference paper at ICLR 2023 It follows directly from the update rules that log π(t γ+1) i 1= (1 ητ) log π(t γ) i + ηAiπ(κ(t γ) i ) log π(t+1) i 1= (1 ητ)γ+1 log π(t γ) i + η l=t γ+1 (1 ητ)t l+1Aiπ(κ(l) i ), which enables us to control the term π i,τ π(t γ+1) i , log π(t+1) i log π(t γ+1) i as π i,τ π(t γ+1) i , log π(t+1) i log π(t γ+1) i l=t γ+1 (1 ητ)t l+1 π i,τ π(t γ+1) i , Ai(π(κ(t γ) i ) π(κ(l) i )) η A π i,τ π(t γ+1) i 1 X π (κ(t γ) i ) j π (κ(l) i ) j 1. (89) In the same vein as Lemma 8, we can bound the term Pt+1 l=t γ+1 π (κ(t γ) i ) j π (κ(l) i ) j 1 with {ψ(l) i }, as detailed in the following lemma. The proof is omitted due to its similarity with that of Lemma 8. Lemma 10. When t 2γ, it holds that π (κ(t γ) i ) j π (κ(l) i ) j 1 4 2(γ + 1)2 r ψ (νj(κ(t γ) i )) j . Plugging the above lemma into (89), we have π i,τ π(t γ+1) i , log π(t+1) i log π(t γ+1) i η A π i,τ π(t γ+1) i 1 X 2(γ + 1)2 r ψ (νj(κ(t γ) i )) j 14dmax(γ + 1)3/2 π i,τ π(t γ+1) i 2 1 h 8(γ + 1)3/2 t+γ+1 X l=t 2γ+1 ψ(l) j + 4(γ + 1)5/2ψ (νj(κ(t γ) i )) j i 14dmax(γ + 1)3/2KL π i,τ π(t γ+1) i h 4(γ + 1)3/2 t+γ+1 X l=t 2γ+1 ψ(l) j + 2(γ + 1)5/2ψ (νj(κ(t γ) i )) j i , where (i) results from similar arguments in (82) and (ii) invokes Pinsker s inequality. Substitution of the above inequality into (88) and summing over i V leads to (1 14ηdmax A (γ + 1)3/2)KL π τ π(t γ+1) KL π τ π(t+1) + η A X h 4(γ + 1)3/2 t+γ+1 X l=t 2γ+1 ψ(l) j + 2(γ + 1)5/2ψ (νj(κ(t γ) i )) j i KL π τ π(t+1) + 4ηdmax A (γ + 1)3/2 t+γ+1 X l=t 2γ+1 ψ(l) + 2ηdmax A (γ + 1)5/2ψ(νj(κ(t γ) i )). Summing the above inequality over t = 2γ 1, , T 1 and adding PT 1 t=2γ P π (κ(t+1) i ) i π i,τ to the both sides, 3KL π τ π(t γ+1) + X π (κ(t+1) i ) i π i,τ Published as a conference paper at ICLR 2023 t=2γ KL π τ π(t+1) + π (κ(t+1) i ) i π i,τ + 4ηdmax A (γ + 1)3/2 T 1 X l=t 2γ+1 ψ(l) + 2ηdmax A (γ + 1)5/2 T 1 X t=2γ ψ(νj(κ(t γ) i )) (1 ητ)KL π τ π(2γ) η i V (π (κ(t) i ) i π i,τ) Ai(π(κ(t) i ) π τ) 1 28ηdmax A (γ + 1)5/2 T X t=2γ+1 ψ(t) + 14ηdmax A (γ + 1)5/2 X + 12ηdmax A (γ + 1)5/2 T +γ X l=1 ψ(l) + 2ηdmax A (γ + 1)5/2 T +γ 1 X (1 ητ)KL π τ π(2γ) η i V (π (κ(t) i ) i π i,τ) Ai(π(κ(t) i ) π τ) 1 28(1 + ητ 2 )ηdmax A (γ + 1)5/2 T X t=2γ+1 ψ(t) + 14(1 + ητ)ηdmax A (γ + 1)5/2 X Here, (i) invokes the bound established in (84). We remark that our choice of learning rate η min n 1 2τ(γ + 1), 1 42dmax A (γ + 1)5/2 guarantees 1 28(1 + ητ 2 )ηdmax A (γ + 1)5/2 0. This taken together with (85) and Lemma 9 gives 3KL π τ π(t γ+1) + X π (κ(t+1) i ) i π i,τ (1 ητ)KL π τ π(2γ) η i V (π (κ(t) i ) i π i,τ) Ai(π(κ(t) i ) π τ) + 1 KL π i,τ π(0) i + 8ηn h γdmax A + 3γ dmax A (2γ + 11) + 3τ log Smax i KL π i,τ π(0) i + n + 36ητnγ log Smax Bounding the QRE gap. With Lemma 3, we have t=2γ QRE-Gapτ(π(t+1)) d2 max A 2 τ KL π τ π(t+1) + τKL max n3d2 max A 2 2τ , τ o T γ 1 X 3KL π τ π(t+1) + KL π(t+1) π τ . Since the mapping t 7 νi(t) is injective, we have π(t+1) i π i,τ = π (κ (νi(t+1)) i ) i π i,τ π (κ(t+1) i ) i π i,τ . Combining the above two equalities gives t=2γ QRE-Gapτ(π(t+1)) Published as a conference paper at ICLR 2023 max n3d2 max A 2 2τ , τ o T γ 1 X 2 3KL π τ π(t+1) + max n3d2 max A 2 2τ , τ o T 1 X 3KL π τ π(t γ+1) + X π (κ(t+1) i ) i π i,τ max n3d2 max A 2 2τ , τ o 1 KL π i,τ π(0) i + n + 36ητnγ log Smax , where the last step results from (90). E PROOF OF AUXILIARY LEMMAS E.1 PROOF OF LEMMA 1 To prove this lemma, we recall a key observation in Cai et al. (2016) that allows one to transform a zero-sum polymatrix game G = {(V, E), {Si}i V , {Aij}(i,j) E} into a pairwise constant-sum polymatrix game e G = {(V, E), {Si}i V , { e Aij}(i,j) E} such that (1) For every player i V , it has the same payoff in G and e G: ui(s) = eui(s), s S. (2) For each pair (i, j) E, i = j, the two-player game e G is constant-sum, i.e., there exist constants αij = αji, such that e Aij(si, sj) + e Aji(sj, si) = αij (91) holds for all si Si, sj Sj. We are now in a place to prove Lemma 1. Let e G be the pairwise constant-sum polymatrix game associated with G after the above payoff preserving transformation. We have X ui(πi, π i) + ui(π i, π i) eui(πi, π i) + eui(π i, π i) E si πi,sj π j h e Aij(si, sj) i + E si π i,sj πj h e Aij(si, sj) i# E si πi,sj π j h e Aij(si, sj) i + E si π i,sj πj h αij e Aji(sj, si) i# (i,j) E αij = 0, where the penultimate line uses (91), and the last line uses the fact that e G is also a zero-sum polymatrix game, which satisfies X (i,j) E αij = X h e Aij(si, sj) + e Aji(sj, si) i = X i V eui(s) + X j V euj(s) = 0 for any arbitrary s S. E.2 PROOF OF LEMMA 2 In view of the update rule (7), we have log π(t+1) i = (1 ητ) log π(t) i + ηAiπ(t+1) + ci1 Published as a conference paper at ICLR 2023 for some constant ci. On the other hand, it follows from the expression of QRE in (4) that ητ log π i,τ = ηAiπ τ + c i 1 (92) for some constant c i . By combining the above two equalities and taking the inner product with π(t+1) i π i,τ, we have log π(t+1) i (1 ητ) log π(t) i ητ log π i,τ, π(t+1) i π i,τ = η(π(t+1) i π i,τ) Ai(π(t+1) π τ). (93) Summing the above equality over i V gives log π(t+1) (1 ητ) log π(t) ητ log π τ, π(t+1) π τ i V (π(t+1) i π i,τ) Ai(π(t+1) π τ) h (π(t+1) i ) Aiπ(t+1) + (π i,τ) Aiπ τ i η X h (π(t+1) i ) Aiπ τ + (π i,τ) Aiπ(t+1)i h ui(π(t+1)) + ui(π τ) i = 0, where the last line follows from P i V h (π(t+1) i ) Aiπ τ + (π i,τ) Aiπ(t+1)i = 0 due to Lemma 1, as well as that the game is zero-sum. E.3 PROOF OF LEMMA 3 Recalling the definition QRE-Gapτ(π) = max i V max π i (Si) ui,τ(π i, π i) ui,τ(π) max π i (Si) ui,τ(π i, π i) ui,τ(π) = max i V :π i (Si) i V [ui,τ(π i, π i) ui,τ(πi, π i)] , where the inequality holds since maxπ i (Si) ui,τ(π i, π i) ui,τ(π) ui,τ(πi, π i) ui,τ(π) = 0 for all i V . We now proceed to decompose X i V [ui,τ(π i, π i) ui,τ(πi, π i)] ui,τ(π i, π i) ui,τ(π i,τ, π i,τ) τ X H(πi) H(π i,τ) ui,τ(π i, π i) ui,τ(π i, π i,τ) ui,τ(π i,τ, π i) + ui,τ(π i,τ, π i,τ) ui,τ(π i,τ, π i) ui,τ(π i,τ, π i,τ) τ H(πi) H(π i,τ) ui,τ(π i, π i,τ) ui,τ(π i,τ, π i,τ) (94) where the first line follows from P i V (ui,τ(π) τH(πi)) = P i V ui,τ(π τ) τH(π i,τ) = 0 by the definition of zero-sum games. It boils down to control the terms on the RHS of (94). To control the first term, by the definition of ui,τ in (5) (see also (3)), it follows that ui,τ(π i, π i) ui,τ(π i, π i,τ) ui,τ(π i,τ, π i) + ui,τ(π i,τ, π i,τ) = ui(π i, π i) ui(π i, π i,τ) ui(π i,τ, π i) + ui(π i,τ, π i,τ) Published as a conference paper at ICLR 2023 = (π i π i,τ) Ai(π π τ) = X j Ni (π i π i,τ) Aij(πj π j,τ), which each summand can be further bounded by Young s inequality and Pinsker s inequality as (π i π i,τ) Aij(πj π j,τ) A π i π i,τ 1 πj π j,τ 1 π i π i,τ 2 1 + dmax A πj π j,τ 2 1 τ dmax A KL π i π i,τ + dmax A τ KL π j,τ πj . Summing the inequality over i, j gives X ui,τ(π i, π i) ui,τ(π i, π i,τ) ui,τ(π i,τ, π i) + ui,τ(π i,τ, π i,τ) τKL π π τ + d2 max A 2 τ KL π τ π . (95) Regarding the second term, we have X ui,τ(π i,τ, π i) ui,τ(π i,τ, π i,τ) τ H(πi) H(π i,τ) (π i,τ) Ai(π π τ) + τ(π i log πi (π i,τ) log π i,τ) (π i,τ) Ai(π π τ) + τ πi, log πi log π i,τ + πi π i,τ, log π i,τ (π i,τ) Ai(π π τ) + (πi π i,τ) Aiπ τ + τKL πi π i,τ = τKL π π τ , (96) where the penultimate step follows from (92) and the last step invokes Lemma 1. Moving to the last term, we have ui,τ(π i,τ, π i,τ) ui,τ(π i, π i,τ) = (π i,τ π i) Aiπ τ τ(π i,τ) log π i,τ + τ(π i) log π i = τ(π i,τ π i) log π i,τ τ(π i,τ) log π i,τ + τ(π i) log π i = τKL π i π i,τ . (97) where the second line follows again from (92). Plugging (95), (96) and (97) into (94) gives X i V [ui,τ(π i, π i) ui,τ(πi, π i)] τKL π π τ + d2 max A 2 τ KL π τ π . Taking maximum over π finishes the proof. E.4 PROOF OF LEMMA 4 Let eπ(T ) = 1 T +1 PT t=0 π(t+1), then eπ(T ) (S). The proof is completed if we can show X i V Regreti,τ T + 1 X i V Regreti,τ eπ(T ) i , T + 1 0, (98) where the first inequality holds trivially since Regreti,τ T + 1 Regreti,τ eπ(T ) i , T . It then boils down to show the second inequality of the above relation. From the definition of zero-sum polymatrix games, it holds that eπ(T ) i π(t+1) i , Aiπ(t+1) = X eπ(T ) i , Aiπ(t+1) Published as a conference paper at ICLR 2023 eπ(T ) i , Ai = (T + 1) X eπ(T ) i , Aieπ(T ) = 0. In addition, applying Jensen s inequality gives t=0 H(eπ(T ) i ) = (T + 1)H(eπ(T ) i ) t=0 H(π(t+1) i ). Combining the above two relations yields i V Regreti,τ eπ(T ) i , T + 1 X eπ(T ) i π(t+1) i , Aiπ(t+1) + τH(eπ(T ) i ) τH(π(t+1) i ) 0, which concludes the proof. E.5 PROOF OF LEMMA 5 Taking logarithm on the both sides of (7), we have log π(t+1) i 1= (1 ητ) log π(t) i + ηAiπ(t γ+1). (99) On the other hand, the definition of QRE in (4) gives ητ log π i,τ 1= ηAiπ τ. Subtracting the two equalities and taking inner product with π(t γ+1) i π i,τ, we get log π(t+1) i (1 ητ) log π(t) i ητ log π i,τ, π(t γ+1) i π i,τ π(t γ+1) i π i,τ Ai π(t γ+1) π τ . Summing the above equality over i V leads to log π(t+1) (1 ητ) log π(t) ητ log π τ, π(t γ+1) i π τ π(t γ+1) i π i,τ Ai π(t γ+1) π τ = 0, where the final step results from Lemma 1. E.6 PROOF OF LEMMA 6 Recall from (71) that KL π τ π(t+1) = (1 ητ)KL π τ π(t) (1 ητ)KL π(t γ+1) π(t) KL π(t+1) π(t γ+1) + log π(t γ+1) log π(t+1), π(t γ+1) π(t+1) ητKL π(t γ+1) π τ . (100) When t < γ, we have π(t γ+1) i = π(0). It follows that log π(t γ+1) i = log π(0) 1= 0, log π(t+1) i 1= (1 ητ)t+1 log π(0) + η l=0 (1 ητ)l Aiπ(t γ l+1) Published as a conference paper at ICLR 2023 l=0 (1 ητ)l Aiπ(0). Therefore, we can bound the term log π(t γ+1) log π(t+1), π(t γ+1) π(t+1) as log π(t γ+1) log π(t+1), π(t γ+1) π(t+1) = η l=0 (1 ητ)l Aiπ(0), π(0) π(t+1) η(t + 1)dmax A π(0) π(t+1) 1 2η(t + 1)dmax A . (101) Plugging the above inequality into (100) leads to KL π τ π(t+1) (1 ητ)KL π τ π(t) (1 ητ)KL π(t γ+1) π(t) KL π(t+1) π(t γ+1) + 2η(t + 1)dmax A . Applying the above inequality recursively to the iterates 0, 1, . . . , γ 1, we arrive at KL π τ π(γ) (1 ητ)γKL π τ π(0) l1=0 (1 ητ)γ 1 l1h (1 ητ)KL π(l1 γ+1) π(l1) + KL π(l1+1) π(l1 γ+1) i l1=0 (1 ητ)γ 1 l1(l1 + 1)dmax A (1 ητ)γKL π τ π(0) l1=0 (1 ητ)γ 1 l1h (1 ητ)KL π(l1 γ+1) π(l1) + KL π(l1+1) π(l1 γ+1) i + 2ηγ2dmax A . E.7 PROOF OF LEMMA 7 Taking logarithm on the both sides of (77) and (78), we get η log eπ(t) i log π(t 1) i 1= eη(t) i log π(t) i log π(t 1) i , or equivalently log π(t) i 1= η eη(t) i log eπ(t) i + 1 η log π(t 1) i . Taking inner product with π i,τ π(t) i , log π(t) i η eη(t) i log eπ(t) i 1 η log π(t 1) i , π i,τ π(t) i = 0. By definition of KL divergence, we have log π(t) i η eη(t) i log eπ(t) i 1 η log π(t 1) i , π i,τ = (log π(t) i log π i,τ) η eη(t) i (log eπ(t) i log π i,τ) 1 η (log π(t 1) i log π i,τ), π i,τ = KL π i,τ π(t) i + 1 η KL π i,τ π(t 1) i + η eη(t) i KL π i,τ eπ(t) i , and log π(t) i η eη(t) i log eπ(t) i 1 η log π(t 1) i , π(t) i Published as a conference paper at ICLR 2023 eη(t) i KL π(t) i eπ(t) i + 1 η KL π(t) i π(t 1) i . Taken together, we get KL π i,τ π(t) i + η eη(t) i KL π(t) i eπ(t) i + 1 η KL π(t) i π(t 1) i KL π i,τ π(t 1) i + η eη(t) i KL π i,τ eπ(t) i . (102) On the other hand, taking logarithm of (78) and making inner product with π (κ(t) i ) i π i,τ gives log eπ(t) i (1 eη(t) i τ) log π(t 1) i eη(t) i τ log π i,τ, π (κ(t) i ) i π i,τ = eη(t) i (π (κ(t) i ) i π i,τ) Ai(π(κ(t) i ) π τ). Following a similar discussion in (19) gives KL π i,τ eπ(t) i = (1 eη(t) i τ)KL π i,τ π(t 1) i (1 eη(t) i τ)KL π (κ(t) i ) i π(t 1) i eη(t) i τKL π (κ(t) i ) i π i,τ KL eπ(t) i π (κ(t) i ) i + log π (κ(t) i ) i log eπ(t) i , π (κ(t) i ) i eπ(t) i eη(t) i (π (κ(t) i ) i π i,τ) Ai(π(κ(t) i ) π τ). (103) Plugging the above equation into (102), KL π i,τ π(t) i + η eη(t) i KL π(t) i eπ(t) i + KL π(t) i π(t 1) i = (1 ητ)KL π i,τ π(t 1) i η(π (κ(t) i ) i π i,τ) Ai(π(κ(t) i ) π τ) (1 eη(t) i τ)KL π (κ(t) i ) i π(t 1) i + eη(t) i τKL π (κ(t) i ) i π i,τ + KL eπ(t) i π (κ(t) i ) i log π (κ(t) i ) i log eπ(t) i , π (κ(t) i ) i eπ(t) i . Rearranging the terms finishes the proof. E.8 PROOF OF LEMMA 8 For notational convenience, we set ϕ(t) i = 1 η π(t) i π(t 1) i 1 π (κ(t) i ) i π(t 1) i 1 + eπ(t) i π (κ(t) i ) i 1 + π(t) i eπ(t) i 1 for all i V, t 0. By triangular inequality, we have ϕ(t) i π(t) i π(t 1) i 1. In addition, we denote by t1 t2 := min{t1, t2} and t1 t2 := max{t1, t2}. For 0 < t1 < t2, it holds that π (κ(t1) i ) j π (κ(t2) i ) j 1 π (νj(κ(t1) i )) j π (νj(κ(t2) i )) j 1 + π (κ(t1) i ) j π (νj(κ(t1) i )) j 1 + π (κ(t2) i ) j π (νj(κ(t2) i )) j 1 νj(κ(t1) i ) νj(κ(t2) i ) X l=(νj(κ(t1) i )+1) (νj(κ(t2) i )+1) π(l) j π(l 1) j 1 + π (κ(t1) i ) j eπ (νj(κ(t1) i )) j 1 + eπ (νj(κ(t1) i )) j π (νj(κ(t1) i )) j 1 Published as a conference paper at ICLR 2023 + π (κ(t2) i ) j eπ (νj(κ(t2) i )) j 1 + eπ (νj(κ(t2) i )) j π (νj(κ(t2) i )) j 1 νj(κ(t1) i ) νj(κ(t2) i ) X l=(νj(κ(t1) i )+1) (νj(κ(t2) i )+1) ϕ(l) j + η (νj(κ(t1) i )) j η ϕ (νj(κ(t1) i )) j + η (νj(κ(t2) i )) j η ϕ (νj(κ(t2) i )) j νj(κ(t1) i ) νj(κ(t2) i ) X l=(νj(κ(t1) i )+1) (νj(κ(t2) i )+1) ϕ(l) j + (γ + 1)ϕ (νj(κ(t1) i )) j + (γ + 1)ϕ (νj(κ(t2) i )) j . (104) Therefore, we have π (κ(k) i ) j π (κ (κ(t 1) i ) i ) j 1 ( νj(κ (κ(t 1) i ) i ) νj(κ(k) i ) X l=(νj(κ (κ(t 1) i ) i )+1) (νj(κ(k) i )+1) ϕ(l) j + (γ + 1)ϕ (νj(κ (κ(t 1) i ) i )) j + (γ + 1)ϕ (νj(κ(k) i )) j Since 0 (t γ) κ(t) i t νi(t) t + γ for all i V , t 0, the first term can be bounded by νj(κ (κ(t 1) i ) i ) νj(κ(k) i ) X l=(νj(κ (κ(t 1) i ) i )+1) (νj(κ(k) i )+1) (t+γ 1) (k+γ) X l=(t 2γ) (k γ+1) ϕ(l) j l=t 2γ ϕ(l) j . In addition, the mapping k 7 νj(κ(k) i ) is injective when k γ (cf. Assumption 2 and 3). It follows that ϕ (νj(κ(k) i )) j l=κ(t) i γ ϕ(l) j l=t 2γ ϕ(l) j Plugging the above inequalities into (105) yields π (κ(k) i ) j π (κ (κ(t 1) i ) i ) j 1 (t + 1 κ(t) i ) l=t 2γ ϕ(l) j + (t + 1 κ(t) i )(γ + 1)ϕ (νj(κ (κ(t 1) i ) i )) j + (γ + 1) l=t 2γ ϕ(l) j l=t 2γ ϕ(l) j + (γ + 1)2ϕ (νj(κ (κ(t 1) i ) i )) j . Finally, we control the term ϕ(t) i with ψ(t) i as: (ϕ(t) i )2 = 1 η 1/2 π(t) i π(t 1) i 1 eη(t) i (1 eη(t) i τ) 1 1/2 η eη(t) i (1 eη(t) i τ) 1/2 π (κ(t) i ) i π(t 1) i 1 1/2 eπ(t) i π (κ(t) i ) i 1 + π(t) i eπ(t) i 1 eη(t) i + η 2 + (1 eη(t) i τ) 1 1 η π(t) i π(t 1) i 2 1 Published as a conference paper at ICLR 2023 (1 eη(t) i τ) π (κ(t) i ) i π(t 1) i 2 1 + eπ(t) i π (κ(t) i ) i 2 1 + π(t) i eπ(t) i 2 1 (ii) 2 2 + (1 eη(t) i τ) 1 1 η KL π(t) i π(t 1) i (1 eη(t) i τ)KL π (κ(t) i ) i π(t 1) i + KL eπ(t) i π (κ(t) i ) i + KL π(t) i eπ(t) i (iii) 8ψ(t) i , (106) where (i) applies Cauchy-Schwarz inequality, (ii) invokes Pinsker s inequality and (iii) is due to eη(t) i τ (γ + 1)ητ 1/2. Combining the above two inequalities finishes the proof. E.9 PROOF OF LEMMA 9 We start with verifying the claim (86). Recall that ψ(t) i := 1 η KL π(t) i π(t 1) i (1 eη(t) i τ)KL π (κ(t) i ) i π(t 1) i + KL eπ(t) i π (κ(t) i ) i + KL π(t) i eπ(t) i . We introduce the following standard Lemma (see e.g., (Cen et al., 2020, Appendix A.2)), which allows us to bound control KL πi π i properly: Lemma 11. Given πi, π i (Si) and w R|Si| with log πi 1= log π i + w, we have KL πi π i log πi log π i 2 w . Therefore, it suffices to figure out the terms log π(t) i log π(t 1) i , log π(t) i log eπ(t) i , log eπ(t) i log π (κ(t) i ) i and log π (κ(t) i ) i log π(t 1) i . Bounding KL π(t) i π(t 1) i and KL π(t) i eπ(t) i . The following equations follow directly from (77) and (78): log π(t) i log π(t 1) i 1= η([Aiπ(κ(t) i )]k τ log π(t 1) i ) log π(t) i log eπ(t) i 1= (η eη(t) i )([Aiπ(κ(t) i )]k τ log π(t 1) i ) . (107) In addition, we have the following bound w.r.t. the order of log π(t 1) i , which we shall establish momentarily. τ log π(t 1) i τ log |Si| + 2dmax A . (108) This taken together with Lemma 11 yields ( KL π(t) i π(t 1) i η(3dmax A + τ log |Si|) KL π(t) i eπ(t) i (eη(t) i η)(3dmax A + τ log |Si|) . (109) Bounding KL eπ(t) i π (κ(t) i ) i . When κ(t) i 1, we recall from (80) that: log π (κ(t) i ) i log eπ(t) i 1= eη(t) i Ai(π(κ (κ(t) i 1) i ) π(κ(t) i )) (1 eη(t) i τ)(1 ητ)t 1 l Ai(π(κ (κ(t) i 1) i ) π(κ(l) i )) , (110) Published as a conference paper at ICLR 2023 which leads to a crude bound π (κ(t) i ) i eπ(t) i eη(t) i dmax A (t κ(t) i + 1) eη(t) i dmax A (γ + 1). When κ(t) i = 0, we have log π (κ(t) i ) i log eπ(t) i 1= log eπ(t) i 1= (1 eη(t) i τ)(1 ητ)t 1 log π(0) Aiπ(κ(t) i ) + l=κ(t) i +1 (1 eη(t) i τ)(1 ητ)t 1 l Aiπ(κ(l) i ) Aiπ(κ(t) i ) + l=κ(t) i +1 (1 eη(t) i τ)(1 ητ)t 1 l Aiπ(κ(l) i ) , which yields π (κ(t) i ) i eπ(t) i eη(t) i dmax A t eη(t) i dmax A (γ + 1). Bounding KL π (κ(t) i ) i π(t 1) i . Note that log π (κ(t) i ) i log π(t 1) i = (log π (κ(t) i ) i log eπ(t) i ) + (log eπ(t) i log π(t) i ) + (log π(t) i log π(t 1) i ). This yields, by equations (107), (110), (111) and associated bounds, π (κ(t) i ) i π(t 1) i eη(t) i dmax A (γ + 1) + eη(t) i (3dmax A + τ log |Si|). Putting all pieces together, we conclude that ψ(t) i = 1 η KL π(t) i π(t 1) i (1 eη(t) i τ)KL π (κ(t) i ) i π(t 1) i + KL eπ(t) i π (κ(t) i ) i + KL π(t) i eπ(t) i 3η(3dmax A + τ log |Si|) + 2ηdmax A (γ + 1) = η(dmax A (2γ + 11) + 3τ log |Si|). It remains to prove the claim (87): KL π i,τ π(2γ) i = KL π i,τ π(0) i + π i,τ, log π(0) i log π(2γ) i KL π i,τ π(0) i + log π(0) i log π(2γ) i KL π i,τ π(0) i + 2 l=1 (1 ητ)2γ l Aiπ(κ(l) i ) KL π i,τ π(0) i + 4ηdmax A γ, where the third step results from log π(2γ) i 1= (1 ητ)2γ log π(0) i + η P2γ l=1(1 ητ)2γ l Aiπ(κ(l) i ) and Lemma 11. Proof of the claim (108). First, we prove by induction that for any k, l Si, log π(t) i (k) log π(t) i (l) 2dmax A τ , t 0. (112) Note that the claim trivially holds for t = 0 with the uniform initialization π(0) i = 1 |Si|1, i V . Assume that (112) holds for all t t 1. Note that log π(t) i 1= (1 ητ) log π(t 1) i + ηAiπ(κ(t) i ), we have log π(t) i (k) log π(t) i (l) = (1 ητ) log π(t 1) i (k) log π(t 1) i (l) + η [Aiπ(κ(t) i )]k [Aiπ(κ(t) i )]l Published as a conference paper at ICLR 2023 (1 ητ)2dmax A τ + 2ηdmax A where the second line follows from the induction hypothesis (112). This completes the induction at the t-th iteration. It follows that for all i V and t 0, log π(t) i (l) log max k Si π(t) i (k) 2dmax A τ log |Si| 2dmax A