# noregret_learning_in_timevarying_zerosum_games__76cb54b4.pdf

No-Regret Learning in Time-Varying Zero-Sum Games

Mengxiao Zhang * 1 Peng Zhao * 2 Haipeng Luo 1 Zhi-Hua Zhou 2

Learning from repeated play in a fixed two-player zero-sum game is a classic problem in game theory and online learning. We consider a variant of this problem where the game payoff matrix changes over time, possibly in an adversarial manner. We first present three performance measures to guide the algorithmic design for this problem: 1) the well-studied individual regret, 2) an extension of duality gap, and 3) a new measure called dynamic Nash Equilibrium regret, which quantifies the cumulative difference between the player s payoff and the minimax game value. Next, we develop a single parameter-free algorithm that simultaneously enjoys favorable guarantees under all these three performance measures. These guarantees are adaptive to different nonstationarity measures of the payoff matrices and, importantly, recover the best known results when the payoff matrix is fixed. Our algorithm is based on a two-layer structure with a meta-algorithm learning over a group of black-box base-learners satisfying a certain property, along with several novel ingredients specifically designed for the time-varying game setting. Empirical results further validate the effectiveness of our algorithm.

1. Introduction

Repeated play in a fixed two-player zero-sum game, a fundamental problem in the interaction between game theory and online learning, has been extensively studied in recent decades. In particular, many efforts have been devoted to designing online algorithms such that both players achieve small individual regret (that is, difference between one s cumulative payoff and that of the best fixed action) while at

*Equal contribution, in alphabetical order. 1University of Southern California 2National Key Laboratory for Novel Software Technology, Nanjing University. Correspondence to: Mengxiao Zhang <mengxiao.zhang@usc.edu>, Peng Zhao <zhaop@lamda.nju.edu.cn>.

Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s).

the same time the dynamics of the players strategy leads to a Nash equilibrium, a pair of strategies that neither player has incentive to deviate from; see for example (Freund & Schapire, 1999; Rakhlin & Sridharan, 2013; Daskalakis et al., 2015; Syrgkanis et al., 2015; Chen & Peng, 2020; Wei et al., 2021; Hsieh et al., 2021; Daskalakis et al., 2021).

In contrast to this large body of studies for learning over a fixed zero-sum game, repeated play over a sequence of time-varying games, the focus of this paper and a ubiquitous scenario in practice, is much less explored. While minimizing individual regret still makes perfect sense in this case, it is not immediately clear what other desirable game-theoretic guarantees are that generalize the concept of approaching a Nash equilibrium when the game is fixed. As far as we know, Cardoso et al. (2019) are the first to explicitly consider this problem. They proposed the notion of Nash-Equilibrium regret (NE-regret) as the performance measure, which quantifies the difference between the learners cumulative payoff and the minimax value of the cumulative payoff matrix. The authors proposed an algorithm with e O(

T) NE-regret after T rounds of play and, importantly, proved that no algorithm can simultaneously achieve sublinear NE-regret and sublinear individual regret for both players.

Our work starts by questioning whether the NE-regret of Cardoso et al. (2019) is indeed a good performance measure for the problem of learning in time-varying games, especially given its incompatibility with the arguably most standard goal of having small individual regret. We then discover that measuring performance with NE-regret can in fact be highly unreasonable: we show an example (in Section 3) where even the two players perform perfectly (in the sense that they play the corresponding Nash equilibrium in every round), the resulting NE-regret is still linear in T!

Motivated by this observation, we revisit the basic problem of how to measure the algorithm s performance in such a time-varying game setting. Concretely, we consider three performance measures that we believe are appropriate and natural: 1) the standard individual regret; 2) the direct generalization of cumulative duality gap from a fixed game to a varying game; and 3) a new measure called dynamic NEregret, which quantifies the difference between the learner s cumulative payoff and the cumulative minimax game value (instead of the minimax value of the cumulative payoff ma-

No-Regret Learning in Time-Varying Zero-Sum Games

Table 1. Summary of our results. The first column indicates the three performance measures considered in this work. The second column presents our main results for a sequence of time-varying payoff matrices {At}T t=1, and notably all results are simultaneously achieved by one single parameter-free algorithm. These guarantees are expressed in terms of three different (unknown) non-stationarity measures of the payoff matrices: PT , VT , and WT , all of which are Θ(T) in the worst-case and zero in the stationary case (when At = A for all t [T]); see Section 3 for definitions. Additionally, for duality gap we use notation QT as a shorthand for VT + min{PT , WT }. Substituting all non-stationarity measures with zero leads to our corollaries for the stationary case shown in the third column, which match the state-of-the-art for all three performance measures (up to logarithmic factors) as shown in the last column.

Measure Time-Varying Game ({At}T t=1, general) Stationary Game (At = A, fixed)

Individual Regret

1 + VT + min{PT , WT } e O(1) O(1)

[Theorem 4] [Corollary 5] (Hsieh et al., 2021)

Dynamic NE-Regret

(1 + VT )(1 + PT ) + PT , 1 + WT } e O(1) O(1)

[Theorem 6] [Corollary 7] (Hsieh et al., 2021)1

Duality Gap

e O min{T 3 4 (1 + QT ) 1 4 , T 1 2 (1 + Q

3 2 T + PT QT ) 1 2 } e O(

T) [Theorem 8] [Corollary 9] (Wei et al., 2021)

trix, as in NE-regret). We argue that dynamic NE-regret is a better measure compared to NE-regret: first, in the earlier example where both players play perfectly in each round using the corresponding Nash equilibrium, their dynamic NE-regret is exactly zero (while their NE-regret can be linear in T); second, having small dynamic NE-regret does not prevent one from enjoying small individual regret or duality gap (as will become clear soon).

With these performance measures in mind, our main contribution is to develop one single parameter-free algorithm that simultaneously enjoys favorable guarantees under all measures. These guarantees are adaptive to some unknown non-stationarity measures of the payoff matrices naturally, the bounds worsen as the non-stationarity becomes larger. More specifically, the individual regret is always at most e O(

T), the well-known worst-case bound, but could be much smaller if the non-stationarity measures are sublinear; on the other hand, the duality gap and dynamic NE-regret are sublinear as long as the non-stationarity measures are sublinear. In the special case of a fixed payoff matrix, all non-stationarity measures become zero and our results immediately recover the state-of-the-art results (up to logarithmic factors); see Table 1 for details. Notice that the best known results for a fixed game are not necessarily achieved by the same algorithm, while again, our results are all achieved by one adaptive algorithm. We also conduct empirical studies to further support our theoretical findings.

Techniques. For a fixed game, Syrgkanis et al. (2015) proposed the Regret bounded by Variation in Utilities (RVU) property as the key condition for an algorithm to achieve good performance. On the other hand, one of the key tools for achieving our results is to ensure a small gap between each player s cumulative payoff and that of a sequence of changing comparators, known as dynamic regret in the literature (Zinkevich, 2003). Therefore, our first step is to

generalize the RVU property to Dynamic Regret bounded by Variation in Utilities (DRVU) property, and to show that many existing algorithms indeed satisfy DRVU.

Furthermore, to achieve strong guarantees for all performance measures without any prior knowledge, we also need to deploy a two-layer structure, with a meta-algorithm learning over and combining decisions of a group of baselearners, each of which satisfies the DRVU property but uses a different step size. Although such a framework has been used in many prior works in online learning (see for example the latest advances (Chen et al., 2021; Zhao et al., 2021) and references therein), several new ingredients are required to achieve our results. First, when updating the meta-algorithm, a correction term related to the stability of each base-algorithm is injected into the loss for the corresponding base-algorithm, which plays a key role in the analysis. More specifically, we show (in Lemma 10) an explicit bound for the stability of the meta-algorithm s decisions, whose proof requires a careful analysis using the correction terms above and the unique game structure. Second, we also introduce a set of additional dummy base-algorithms that always play some fixed action. This plays a key role in controlling the dynamics of the base-learners outputs and turns out to be critical when bounding the duality gap.

Related Work. Two-player zero-sum game is one of the most fundamental problems in game theory, whose studies date back to the seminal work of von Neumann (1928). Freund & Schapire (1999) discovered the profound connections between zero-sum games and no-regret online learning, and since then there have been extensive studies on designing no-regret algorithms to solve games in the stationary set-

1This is implicitly implied by the results of Hsieh et al. (2021), as our Lemma 17 shows that in the stationary case dynamic NEregret is bounded by the individual regret.

No-Regret Learning in Time-Varying Zero-Sum Games

ting (Rakhlin & Sridharan, 2013; Daskalakis et al., 2015; Syrgkanis et al., 2015; Chen & Peng, 2020; Wei et al., 2021; Daskalakis et al., 2021). We refer the reader to (Daskalakis et al., 2021) for a more thorough discussion on the literature. Several recent works start considering the problem of learning over a sequence of non-stationary payoffs under different structures, including zero-sum matrix games (Mai et al., 2018; Cardoso et al., 2019; Fiez et al., 2021), convexconcave games (Roy et al., 2019) and strongly monotone games (Duvocelle et al., 2021). For zero-sum games, Fiez et al. (2021); Mai et al. (2018) focus on the periodic case and proves divergence results for a class of learning algorithms; (Cardoso et al., 2019) is the closest to our work, but as mentioned, we argue that their proposed measure (NEregret) is not always appropriate (see Section 3.1). Learning in time-varying games is also related to bandits with knapsack (Badanidiyuru et al., 2018; Immorlica et al., 2019).

Organization. We formulate the problem set in Section 2, then present the performance measures in Section 3 and our algorithm in Section 4. Next, we provide theoretical guarantees in Section 5. We finally report the empirical results in Section 6 and conclude the paper in Section 7.

2. Problem Setup and Notations

We consider the following problem of two players (called xplayer and y-player) repeatedly playing a zero-sum game for T rounds, with m fixed actions for x-player and n fixed actions for y-player. At each round t [T] {1, . . . , T}, the environment first chooses a payoff matrix At [ 1, 1]m n, whose (i, j) entry denotes the loss/reward for x-player/yplayer when they play action i and action j respectively. Without knowing At, x-player (y-player) decides her own mixed strategy (that is, a distribution over actions) xt m (yt n), where k denotes the probability simplex k = {u Rk 0 | Pk i=1 ui = 1}. At the end of this round, x-player suffers expected loss x t Atyt and observes the loss vector Atyt, while y-player receives the expected reward x t Atyt and observes the reward vector x t At. Note that neither player observes the matrix At itself.

When At is fixed for all t, this exactly recovers the standard stationary setting considered in for example (Syrgkanis et al., 2015). Having a time-varying At allows us to capture various possible sources of non-stationarity. In fact, At can even be decided by an adaptive adversary who makes the decision knowing the players algorithm and their decisions in earlier rounds. Our setting is almost the same as (Cardoso et al., 2019), except that the feedback they consider is either the entire matrix At (stronger than ours) or just one entry of At sampled according to (xt, yt) (weaker than ours).

For each game matrix At, define the set of minimax strategies for x-player as X t = argminx m maxy n x Aty

and similarly the set of maximin strategies for y-player as Y t = argmaxy n minx m x Aty. It is well-known that any pair (x t , y t ) X t Y t forms a Nash equilibrium of At with the following saddle-point property: x t Aty x t Aty t x Aty t holds for any x m and y n. Throughout the paper, (x t , y t ) denotes an arbitrary Nash equilibrium of At.

Notations. For a real-valued matrix A Rm n, its infinity norm is defined as A maxi,j |Aij|. We use 1N and 0N to denote the all-one and all-zero vectors of length N. For conciseness, we often hide polynomial dependence on the size of the game (that is, m and n) in the O( )-notation. The e O( )-notation further omits logarithmic dependence on T. We sometimes write minx m (miny n) simply as minx (miny) when there is no confusion.

3. How to Measure the Performance?

With the learning protocol specified, the next pressing question is to determine what the goal is when designing algorithms for the two players. When At is fixed, most studies consider minimizing individual regret for each player and some form of convergence to a Nash equilibrium of the fixed game as the two primary goals. While minimizing individual regret is still naturally defined when At is changing over time, it is less clear what other desirable game-theoretic guarantees are in this case. In Section 3.1, we formally discuss three performance measures that we think are reasonable for this problem. Then in Section 3.2, we further discuss how to measure the non-stationarity of the sequence {At}1:T that will play a role in how well the players can do under some of the performance measures.

3.1. Performance Measures

1 Individual Regret. The first measure we consider is the standard individual regret. For x-player, this is defined as

t=1 x t Atyt min x m

t=1 x Atyt, (1)

that is, the difference between her total loss and that of the best fixed strategy (assuming the same behavior from the opponent). Similarly, the regret for y-player is defined as Regy T maxy n PT t=1 x t Aty PT t=1 x t Atyt. Achieving sublinear (in T) individual regret implies that on average each player performs almost as well as their best fixed strategy, and this is arguably the most standard and basic goal for online learning problems.

2 Duality Gap. For a game matrix At, the duality gap of a pair of strategy (xt, yt) is defined as maxy n x t Aty minx m x Atyt. It is always nonnegative since maxy n x t Aty x t Aty t x t Aty t x t Atyt minx m x Atyt, and it is zero if and only

No-Regret Learning in Time-Varying Zero-Sum Games

if (xt, yt) is a Nash equilibrium of At. Thus, the duality gap measures how close (xt, yt) is to the equilibria in some sense. We thus naturally use the cumulative duality gap:

max y n x t Aty min x m x Atyt , (2)

as another performance measure. When At is fixed, this measure is considered in (Wei et al., 2021) for example.

3 Dynamic Nash Equilibrium (NE)-Regret. Before introducing this last measure, we first review what Cardoso et al. (2019) proposed as the goal for this problem, that is, ensuring small Nash Equilibrium (NE)-regret, defined as

t=1 x t Atyt min x m max y n

In words, this is the difference between the cumulative loss of x-player (or equivalently the cumulative reward of yplayer) and the minimax value of the cumulative payoff matrix (PT t=1 At). While this might appear to be a reasonable generalization of individual regret for a central controller who decides xt and yt jointly, we argue below that this measure is in fact often inappropriate for two reasons.

The first reason is in fact already hinted in (Cardoso et al., 2019): they proved that no algorithm can always ensure sublinear NE-regret and simultaneously sublinear individual regret for both players. Given that minimizing individual regret selfishly is a natural impulse and the standard goal for each player, NE-regret can only make sense when both players are controlled by a centralized algorithm.

The second reason is perhaps more profound. Consider the following two-phase example: when t T/2, At = 1 1 1 1 ; when t > T/2, At = 1 1 1 1 .2 It is straightforward to verify that: when t T/2, the equilibrium for At is the uniform distribution for both players, leading to game value 0; when t > T/2, the equilibrium is such that y-player always picks the first column, leading to game value 1; and the equilibrium for the cumulative game matrix PT t=1 At = T T 0 0 is x-player picking the second row while y-player picking the first column, leading to game value 0. To sum up, even if both players play perfectly in each round using the equilibrium, their NE-regret is still |T/2 0| = T/2, which is a vacuous bound linear in T!

Motivated by the observations above, we propose a variant of NE-regret as the third performance measure, called dynamic NE-regret:3

Dyn NE-Reg T

t=1 x t Atyt

t=1 min x m max y n x Aty

2The same example is in fact also used by Cardoso et al. (2019) to prove the incompatibility of individual regret and NE-regret.

Compared to NE-regret, here we move the minimax operation inside the summation, making it the cumulative difference between x-player s loss and the minimax game value in each round. In other words, similarly to duality gap, dynamic NE-regret provides yet another way to measure in each round, how close (xt, yt) is to the equilibria of At from the game value perspective.

The connection between NE-regret and Dynamic NE-regret is on the surface analogous to that between standard regret and dynamic regret (Zinkevich, 2003) (see Appendix B.1 for definitions and more related discussions). However, while dynamic regret is always no less than standard regret, Dynamic NE-regret could be smaller than NE-regret simply consider our earlier two-phase example: the perfect players (who always play an equilibrium) clearly have 0 dynamic NE-regret, but their NE-regret is T/2 as discussed. This example also shows that dynamic NE-regret is more reasonable compared to NE-regret. Moreover, as will become clear soon, dynamic NE-regret is compatible with individual regret (and also duality gap), in the sense there are algorithms that provably perform well under all these measures.

We conclude this section with the following two remarks.

Remark 1 (Comparisons of the three measures). Both individual regret and dynamic NE-regret are bounded by duality gap (see proofs in Appendix B.2), but the latter could be much larger. On the other hand, individual regret and dynamic NE-regret are generally incomparable.

Remark 2 (Other possibilities). The three measures we consider are by no mean the only possibilities. Another reasonable one is the tracking error PT t=1( xt x t 1+ yt y t 1) that directly measures the distance between (xt, yt) and the equilibrium (x t , y t ) (assuming unique equilibrium for simplicity). This is considered in (Roy et al., 2019; Balasubramanian & Ghadimi, 2021) (for different problems). However, tracking error bounds are in fact not well studied even when At is fixed the best known results still depend on some problem-dependent constant that can be arbitrarily large (Daskalakis & Panageas, 2019; Wei et al., 2021). Deriving tracking error bounds in our setting is thus beyond the scope of this paper. Note that in many optimization studies, one often only cares about finding a point that is close to the optimal solution in terms of their function value instead of their absolute distance. Our dynamic NE-regret and duality gap are both in this same sprite by looking at the game value instead of the actual distance as in tracking error.

3In fact, a preprint by Roy et al. (2019) also considers a similar measure for general convex-concave problem, but we believe that their results are incorrect. Specifically, they claim (in their Theorem 4.3) that an e O(

T) bound is always achievable for dynamic NE-regret, but this is clearly impossible because when At always has identical columns (so y-player does not play any role), dynamic NE-regret becomes the dynamic regret (Zinkevich, 2003) for x-player, which is well-known to be Ω(T) in the worst case.

No-Regret Learning in Time-Varying Zero-Sum Games

3.2. Non-stationarity Measures

For duality gap and dynamic NE-regret, it is not difficult to see that if At changes drastically over time, then no meaningful guarantees are possible. This is similar to dynamic regret in standard online learning problems, where guarantees are always expressed in terms of some non-stationarity measure of the environment and are meaningful only when the non-stationarity is reasonably small. In our setting, we consider the following three different ways to measure nonstationarity of the sequence {At}T t=1.

Variation of Nash Equilibria. Recall the notation X t Y t , the set of Nash equilibria for matrix At. Define the variation of Nash equilibria as:

PT min t,(x t ,y t ) X t Y t

x t x t 1 1 + y t y t 1 1 ,

which quantifies the drift of the Nash equilibria of the game matrices in ℓ1-norm.

Variation/Variance of Game Matrices. The path-length variation and the variance of {At}T t=1 are respectively defined as

t=2 At At 1 2 , WT

where A = 1

T PT t=1 At is the averaged game matrix.

Clearly, PT , VT , and WT are all Θ(T) in the worst case, and 0 when At is fixed over time. For dynamic regret and duality gap, the natural goal is to enjoy sublinear bounds whenever (some of) these non-stationarity measures are sublinear (which we indeed achieve).

We conclude by pointing out some connections between these non-stationarity measures. First, VT 8WT holds but the former could be much smaller. Second, PT is generally not comparable with VT and WT , and there are examples where PT = 0 and VT = WT = Θ(T), or PT = Θ(T) and VT = WT = O(1). We defer all details to Appendix B.

4. Proposed Algorithm

In this section, we present our proposed algorithm for timevarying games, which provably achieves favorable guarantees under all three performance measures. To illustrate the ideas behind our algorithm design, we first review how (Syrgkanis et al., 2015) achieves fast convergence results for a fixed game, followed by a detailed discussion on how to generalize their idea and overcome the difficulties brought by time-varying games. For conciseness, throughout the section we focus on the x-player; how the y-player should behave is completely symmetric.

For a fixed game At = A, Syrgkanis et al. (2015) proposed that each player should deploy an online learning algorithm that satisfies a specific property called Regret bounded by Variation in Utilities (RVU). More specifically, an online learning algorithm proposes xt m at the beginning of round t, and then receives a loss vector gt Rm and suffers loss xt, gt . Its regret against a comparator u m after T rounds is naturally PT t=1 xt u, gt , and the RVU property states that this should be bounded by α + β PT t=2 gt gt 1 2 γ PT t=2 xt xt 1 2 1 for some parameters α, β, γ > 0.4 To see why RVU property is useful, consider x-player deploying such an algorithm with gt set to Atyt = Ayt. Then her regret is further bounded as α + β PT t=2 Ayt Ayt 1 2 γ PT t=2 xt xt 1 2 1 α+β PT t=2 yt yt 1 2 1 γ PT t=2 xt xt 1 2 1. Therefore, as long as y-player also deploys the same algorithm, by symmetry, the sum of their regret is at most α + (β γ)(PT t=2 xt xt 1 2 1 + PT t=2 yt yt 1 2 1), which can be simply bounded by (a constant) α as long as β γ. Many useful guarantees can then be obtained as a corollary of the fact that the sum of regret is small.

In our setting where At is changing over time, our first observation is that instead of the sum of the two players regret, what we need to control is the sum of their dynamic regret (Zinkevich, 2003), which plays an important role when deriving guarantees for all the three measures (including individual regret). Specifically, for an online learning algorithm producing xt and receiving gt, its dynamic regret against a sequence of comparators u1, . . . , u T m is defined as PT t=1 xt ut, gt . Generalizing RVU, we naturally introduce the following Dynamic Regret bounded by Variation in Utilities (DRVU) property.

Definition 3 (DRVU Property). Denote by A(η) an online learning algorithm with a parameter η > 0. We say that it satisfies the Dynamic Regret bounded by Variation in Utilities property (abbreviated as DRVU(η)) with parameters α, β, γ > 0, if its dynamic regret PT t=1 xt ut, gt on any loss sequence g1, . . . , g T with respect to any comparator sequence u1, . . . , u T is bounded by

η (1+P u T )+ηβ

t=1 gt gt 1 2 γ

t=2 xt xt 1 2 1,

where P u T PT t=2 ut ut 1 1 is the path-length of the comparator sequence.

Compared to RVU, DRVU naturally replaces the first constant term in the regret bound with a term depending on the path-length of the comparator sequence. We also add another step size parameter η (whose role will become clear

4Without loss of generality, we here focus on ( 1, ) norm, and it is straightforward to generalize the argument to general primal-dual norm as in (Syrgkanis et al., 2015).

No-Regret Learning in Time-Varying Zero-Sum Games

soon). Recent studies in dynamic regret (Zhao et al., 2020; 2021) show that variants of optimistic Online Mirror Descent (such as Optimistic Gradient Descent and Optimistic Hedge) indeed satisfy DRVU with α, β, γ = eΘ(1); see Appendix D for formal statements and proofs.

Now, if x-player deploys an algorithm satisfying DRVU and feeds it with loss vector gt = Atyt (and similarly y-player does the same), we can indeed prove a desired guarantee for each of the three performance measures. However, the tuning of η will require the knowledge of the unknown parameters PT , VT , WT and, perhaps more importantly, be different for each different measures. To obtain an adaptive algorithm that performs well under all three measures without any prior knowledge, we further propose a twolayer structure with a meta-algorithm learning over and combining decisions of a set of base-learners, each of which satisfies DRVU(η) but with a different step size η. While this idea of learning over learning algorithms is not new in online learning, we will discuss below what extra difficulties show up in our case and how we address them.

4.1. Base-learners

Define N = 1

2 log2 T + 1. Our algorithm maintains N + m base-learners: i [N], the i-th base-learner is any algorithm that satisfies DRVU(ηx i ), where ηx i = 2i 1

L = max 4, p

(β and γ are the parameters from DRVU and c = e O(1) is a constant whose exact value can be found in the proof); the last m base-learners are dummy learners, with the (j + N)- th one always outputting the basis vector ej m (that is, always choosing the j-th action). We note that the dummy base-learners are important in controlling the duality gap (but not the other two measures). We let Sx S1,x S2,x with S1,x = [N] and S2,x = {N + 1, . . . , N + m} denote the set of indices of base-learners.

At round t, each base-learner i submits her decision xt,i m to the meta-algorithm, who decides the final decision xt. Upon receiving the feedback Atyt, the meta-algorithm sends the same (as the loss vector gt) to each base-learner i S1,x (no updates needed for the dummy base-learners).

4.2. Meta-algorithm

With all the decisions {xt,i}i Sx collected from the baselearners, the meta-algorithm outputs the final decision xt = P

i Sx pt,ixt,i,5 where pt |Sx| is a distribution over the base-learners updated according to a version of Optimistic Online Gradient Descent (OOGD) (Rakhlin & Sridharan,

5Note the slight abuse of notations here: while pt,i represents the i-th entry of vector pt, xt,i is not the i-th entry of xt.

Algorithm 1 Algorithm for the x-player Input: a base-algorithm A(η) satisfying DRVU(η). Initialize: a set of base-learners Sx as described in Section 4.1, bp1 = 1 |Sx|1|Sx|, learning rate εx 1 = 1

L (c.f. Eq. (4)). for t = 1, . . . , T do

Receive xt,i m from each base-learner i Sx. Compute mx t based on Eq. (7) and pt based on Eq. (5). Play the final decision xt = P

i Sx pt,ixt,i. Suffer loss x t Atyt and observe the loss vector Atyt. Compute ℓx t based on Eq. (6) and bpt+1 based on Eq. (5). Update εx t+1 = 1/

L2+Pt s=2 Atyt At 1yt 1 2 . Send Atyt as the feedback to each base-learner. end

2013; Syrgkanis et al., 2015):

pt = argmin p |Sx|

εx t p, mx t + p bpt 2 2 ,

bpt+1 = argmin p |Sx|

εx t p, ℓx t + p bpt 2 2 . (5)

Here, εx t > 0 is a time-varying learning rate, {bpt}t=1,2,... is an auxiliary sequence (starting with bp1 as the uniform distribution) updated via projected gradient descent using some loss vector sequence ℓx 1, ℓx 2, . . . R|Sx|, and pt is updated via projected gradient descent from the distribution bpt and using a loss predictor mx t R|Sx|. It remains to specify what ℓx t and mx t are (the tuning of the learning rate will be specified in the final algorithm).

Since base-learner i predicts xt,i and receives loss vector Atyt, it is natural to set its loss ℓx t,i as x t,i Atyt from the meta-algorithm s perspective. In light of standard OOGD, mx t should then be set to x t,i At 1yt 1, meaning that the last loss vector At 1yt 1 is used to predict the current one (that is unknown yet when computing pt). However, this setup leads to the following issue. When applying DRVU(ηx i ) to this base-learner, we see that a negative term related to xt,i xt 1,i 2 1 and a positive term related to yt yt 1 2 1 arise (the latter is from Atyt At 1yt 1 2 2 At At 1 2 + 2 yt yt 1 2 1, with the first term only related to the non-stationarity of game matrices). By symmetry, y-player contributes a positive term xt xt 1 2 1, which now cannot be canceled by xt,i xt 1,i 2 1, unlike the case with only one learner for each player discussed earlier.

To address this issue, we propose to add a stability correction term to both ℓx t and mx t . Concretely, they are defined as ℓx 1,i = x 1,i A1y1 and mx 1,i = 0, i, and for all t 2:

ℓx t,i = x t,i Atyt + λ xt,i xt 1,i 2 1, (6)

mx t,i = x t,i At 1yt 1 + λ xt,i xt 1,i 2 1, (7)

where λ = γL

2 (γ is the parameter from DRVU). From a technical perspective, this introduces to the regret a nega-

No-Regret Learning in Time-Varying Zero-Sum Games

tive term P

i Sx pt,i xt,i xt 1,i 2 1, and a positive term xt,i xt 1,i 2 1 which can be canceled by the aforementioned negative term from DRVU(ηx i ). To see why the extra negative term is useful, notice that the troublesome term xt xt 1 2 1 from DRVU(ηx i ) can be bounded as

xt xt 1 2 1 =

i Sx pt,ixt,i X

i Sx pt 1,ixt 1,i

i Sx pt,i xt,i xt 1,i 2 1 + 2 pt pt 1 2 1,

where the first term can exactly be canceled by the extra negative term introduced by the correction term, and the second term can in fact also be canceled in a standard way since the meta-algorithm itself can be shown to satisfy RVU. This explains the design of our correction terms from a technical level. Intuitively, injecting this correction term guides the meta-algorithm to bias toward the more stable base-learners, hence also stabilizing the final decision xt.

We note that a similar technique was used in analyzing gradient-variation dynamic regret for online convex optimization (Zhao et al., 2021). Our approach is different from theirs in the sense that there is only one player in their setting and the correction term is used to cancel the additional gradient variation introduced by the variation of her own decision. In contrast, in our setting the correction term is used to cancel the opponent s gradient variation.

To summarize, our final algorithm (for the x-player) is presented in Algorithm 1. We also include the symmetric version for the y-player in Algorithm 2 (Appendix A) for completeness. We emphasize again that this is a parameterfree algorithm that does not require any prior knowledge of the environment.

5. Theoretical Guarantees and Analysis

In this section, we first provide the guarantees of our algorithm under each of the three performance measures, and then highlight several key ideas in the analysis, with the full proofs deferred to Appendix E. Recall that our guarantees are all expressed in terms of the non-stationarity measures PT , VT , and WT , defined in Section 3.2. Also, to avoid showing the cumbersome dependence on the DRVU parameters (α, β, γ) in all our bounds, we will simply assume that they are all eΘ(1), which, as mentioned earlier and shown in Appendix D, is indeed the case for standard algorithms.

5.1. Performance Guarantees

We state our results for each performance measure separately below, but emphasize again that they hold simultaneously. First, we show the individual regret bound.

Theorem 4 (Individual Regret). When the x-player uses Al-

gorithm 1, irrespective of y-player s strategies, we have

t=1 x t Atyt min x

t=1 x Atyt = e O(

Furthermore, if x-player follows Algorithm 1 and y-player follows Algorithm 2, then individual regret satisfies:

max {Regx T , Regy T } = e O p

1 + VT + min{PT , WT } .

The first statement of Theorem 4 provides a robustness guarantee for our algorithm no matter how non-stationary the game matrices are and no matter how the opponent behaves, following our algorithm always ensures e O(

T) individual regret, the standard worst-case regret bound. On the other hand, when both players follow our algorithm, their individual regret could be even smaller depending on the non-stationarity. In particular, as long as VT + min{PT , WT } = o(T) (that is, not the worst case scenario), our bound becomes o(

T). Also note that PT and WT are generally incomparable (see Appendix B), but our bound achieves the minimum of them, thus achieving the best of both worlds.

When the game matrix is fixed, we have PT = VT = WT = 0, immediately leading to the following corollary.

Corollary 5. When x-player follows Algorithm 1 and yplayer follows Algorithm 2, if At = A for all t [T], then max {Regx T , Regy T } = e O(1).

The best known individual regret bound for learning in a fixed two-player zero-sum game is O(1) (Hsieh et al., 2021). Our result matches theirs up to logarithmic factors.

The next theorem presents the dynamic NE-regret bound.

Theorem 6 (Dynamic NE-Regret). When x-player follows Algorithm 1 and y-player follows Algorithm 2, we have the following dynamic NE-regret bound:

Dyn NE-Reg T =

t=1 x t Atyt

t=1 min x m max y n x Aty

= e O min{ p

(1 + VT )(1 + PT ) + PT , 1 + WT } .

Similarly, our dynamic NE-regret bound is o(T) as long as PT or WT is o(T). When the game matrix is fixed, we again obtain the following direct corollary by noticing PT = VT = WT = 0 in this case.

Corollary 7. When x-player follows Algorithm 1 and yplayer follows Algorithm 2, if At = A for all t [T], then Dyn NE-Reg T = e O(1).

In fact, when the game is fixed, dynamic NE-regret degenerates to NE-regret of Cardoso et al. (2019)

No-Regret Learning in Time-Varying Zero-Sum Games

as PT t=1 minx maxy x Ay = minx maxy PT t=1 x Ay. Their algorithm would achieve e O(

T) (dynamic) NEregret in this case. A better O(1) result is implicitly implied by the aforementioned work of Hsieh et al. (2021), as we show (in Lemma 17) that (dynamic) NE-regret is bounded by the individual regret in this stationary case. Our result again matches theirs up to logarithmic factors.

The last theorem provides an upper bound for duality gap.

Theorem 8 (Duality Gap). When x-player follows Algorithm 1 and y-player follows Algorithm 2, we have

Dual-Gap T =

t=1 max y n x t Aty

t=1 min x m x Atyt

= e O min{T 3 4 1 + QT 1

4 , T 1 2 (1 + Q

3 2 T + PT QT ) 1 2 } ,

where QT VT + min{PT , WT }.

Once again the bound is o(T) whenever QT = o(T), and it implies the following corollary.

Corollary 9. When x-player follows Algorithm 1 and yplayer follows Algorithm 2, if At = A for all t [T], then Dual-Gap T = e O(

Notably, the best known result of duality gap for a fixed game is O(

T) (Wei et al., 2021), and our result again matches theirs up to logarithmic factors.

5.2. Key Ideas for Analysis

We now highlight some key components and novelty of our analysis. As mentioned in Section 4, to bound all the three metrics, the key is to bound the sum of the two players dynamic regret, which further requires controlling the stability of the strategies between consecutive rounds. The following key lemma shows how such stability is controlled by the non-stationarity measures of {At}T t=1.

Lemma 10. When x-player follows Algorithm 1 and yplayer follows Algorithm 2, we have both PT t=2 xt xt 1 2 1 and PT t=2 yt yt 1 2 1 bounded by

(1 + VT )(1 + PT ) + PT , 1 + WT .

This lemma implies an e O(1) stability bound when the game is fixed, which is first proven in (Hsieh et al., 2021) where both players run the Optimistic Hedge algorithm with an adaptive learning rate. Our result generalizes theirs but requires a novel analysis due to both the time-varying matrices and the two-layer structure of our algorithm. As another note, this lemma also highlights another difference of our method compared to (Zhao et al., 2021) as mentioned in Section 4 our algorithm shares some similarity with theirs,

but no explicit stability bound is proven or required in their problem, while stability is crucial for our whole analysis. We next present the proof sketch for Lemma 10. More details can be found in Appendix F.

Proof Sketch. We show in Lemma 15 that the sum of the two players dynamic regret (against a sequence u1, . . . , u T m for x-player and a sequence v1, . . . , v T n for yplayer) can be bounded by

x t Atvt u t Atyt = e O

η + η(1 + VT )

t=1 ( xt xt 1 2 1 + yt yt 1 2 1)

for any step size 0 < η e O(1). Here, PT P u T + P v T , P u T PT t=2 ut ut 1 1 and P v T PT t=2 vt vt 1 1 are the path-length of comparators. Then, Lemma 10 can be proven by taking different choices of η and the comparator sequence. For example, consider picking (ut, vt) = (x t , y t ). Since the saddle point property ensures x t Aty t x t Atyt 0, rearranging and picking the optimal η thus gives the first bound e O( p

(1 + VT )(1 + PT ) + PT ) on the stability. To prove the second bound, pick (ut, vt) = ( u , v ) where ( u , v ) is a Nash equilibrium of the averaged game matrix. Then, we have P u T = P v T = 0 and PT t=1 x t Atvt PT t=1 u t Atyt O(WT ). Rearranging, picking the optimal η, and using VT O(WT ) then proves the e O(1 + WT ) bound.

We finally briefly mention two more new ideas when bounding the duality gap. First, we apply a reduction from general dynamic regret that competes with any comparator sequence to its worst-case variant, which in some place helps bound the duality gap by the aforementioned stability. Second, we show how the extra set of dummy base-learners enables the meta-algorithm to have a direct control on the duality gap. We refer the reader to Appendix E.3 for more details.

6. Experiment

In this section, we provide empirical studies on the performance of our proposed algorithm in time-varying games.

We construct an environment such that PT = Θ(

T), WT = Θ(T 3 4 ), and VT = Θ(

T). Under this environment, our theoretical results indicate that max{Regx T , Regy T } e O(T 1 4 ), NE-Reg T e O(

T) and Dual-Gap T e O(T 7 8 ). Our empirical results validate the effectiveness of our algorithm in this environment, and in fact its performance is even better than the theoretical upper bounds, which also encourage us to investigate better guarantees in the future.

The environment setup is as follows. We set the size of game matrix to be m n with m = 2 and n = 2. The total

No-Regret Learning in Time-Varying Zero-Sum Games

0 0.5 1 1.5 2 Iteration 106

Individual Regret

Ours Best tuning (individual regret) Best tuning (dynamic NE-regret) Best tuning (duality gap)

0 0.5 1 1.5 2 Iteration 106

Dynamic NE-Regret

Ours Best tuning (individual regret) Best tuning (dynamic NE-regret) Best tuning (duality gap)

0 0.5 1 1.5 2 Iteration 106

Duality Gap

Ours Best tuning (individual regret) Best tuning (dynamic NE-regret) Best tuning (duality gap)

Figure 1. Empirical results of our algorithm (red line) compared with the base-learners with different step size choices. Best tuning (measure) denotes the curve of the base-learner with the step size choice that performs the best with respect to this measure . The three figures show that our algorithm s performance on all three measures is comparable to (or even better than) the base-learner with the best step size tuning, while the base-learners specifically tuned for a single measure cannot perform well on all other measures simultaneously.

time horizon is set as T = 2 106. Define

Set T0 = 2 T 1/2 . The scheduling of the game matrices is separated into K = 4 epochs and during each epoch k,

A0 + ( 1)t E, t h (k 1)T

K + 1, (k 1)T

K + T0 i , 1 2 ( 1)t T 1

4 A1, t h (k 1)T

K + T0 + 1, k T

Specifically, during the first phase of each epoch, when t is even, At = A0 + E, in which x-player s Nash equilibrium is x t = (0, 1) and y t = (1, 0); when t is odd, At = A0 E, where x t = (0, 1) but y t = (0, 1). Hence, the variation of Nash equilibrium of the first phase is Θ(1) per consecutive rounds. Also, the variation of the game matrix is Θ(1) per consecutive rounds. During the second phase of each epoch, the Nash equilibrium of At keeps the same but the variation of the game matrix is Θ(T 1

2 ) per consecutive rounds. Thus, over the T rounds, PT = Θ(T0) = Θ(

T), VT = Θ(

T). Direct calculation shows WT = Θ(T 3 4 ).

To show the necessity of the two-layer structure, we compare the performance of our two-layer algorithm with one single base-learner with a fixed step size chosen specifically to minimize each measure. Concretely, we choose the baselearner as optimistic Hedge with a fixed-share update, which satisfies DRVU(η) property as we prove in Appendix D.1. As mentioned in Section 4, this base-learner with a specific choice of the step size can indeed achieve a favorable bound for a specific measure. In our environment setup, to achieve the best individual regret bound, the step size needs to be chosen as Θ(1/ 1 + PT + VT ) = Θ(T 1

4 ), while to achieve the best dynamic NE-regret bound, the step size should be chosen as Θ( p

PT /(1 + PT + VT )) = Θ(1), which means that the base-learner cannot guarantee the desired bounds for all the three measures simultaneously.

We implement Algorithm 1 for x-player and Algorithm 2 for y-player with L = 4 and step size pool ηi = 2i 1

T for both players. The number of base-learners (i.e., the size of step size pool) is N = 1

2 log2 T + 1 = 11. We also run our base-learner with each single ηi separately and pick the baselearner with best step size for each measure respectively to see their performance in all the three measures.

Figure 1 plots the results with respect to all the three measures (individual regret, dynamic NE-regret, and duality gap). We can observe that our algorithm s performance on all three measures is comparable to (or even better than) the base-learner with the best step size tuning, while the base-learners specifically tuned for a single measure cannot perform well on all other measures simultaneously, which supports our theoretical results and also validate the necessity of a two-layer structure of our proposed algorithm.

7. Discussions and Future Directions

Our work is among the first few to study learning in timevarying games, and we believe that our proposed performance measures and algorithm are important first steps in this direction. Our results can also be directly extended to general convex-concave games over a bounded convex domain (details omitted). We also conduct experiments with synthetic data to show the effectiveness of our algorithm.

One missing part in our work is the tightness of each bound even though they match the best known results for a fixed game, it is unclear whether they can be further improved in the general case. We leave this as a future direction. Another interesting direction would be to consider extending the results to time-varying multi-player general-sum games.

Acknowledgements

Peng Zhao and Zhi-Hua Zhou are supported by National Science Foundation of China (61921006). Haipeng Luo and Mengxiao Zhang are supported by NSF Award IIS-1943607.

No-Regret Learning in Time-Varying Zero-Sum Games

Badanidiyuru, A., Kleinberg, R., and Slivkins, A. Bandits with knapsacks. Journal of ACM, 65(3):13:1 13:55, 2018.

Balasubramanian, K. and Ghadimi, S. Zeroth-order nonconvex stochastic optimization: Handling constraints, high dimensionality, and saddle points. Foundations of Computational Mathematics, pp. 1 42, 2021.

Besbes, O., Gur, Y., and Zeevi, A. J. Non-stationary stochastic optimization. Operations Research, 63(5):1227 1244, 2015.

Cardoso, A. R., Abernethy, J. D., Wang, H., and Xu, H. Competing against Nash equilibria in adversarially changing zero-sum games. In Proceedings of the 36th International Conference on Machine Learning (ICML), pp. 921 930, 2019.

Cesa-bianchi, N., Gaillard, P., Lugosi, G., and Stoltz, G. Mirror descent meets fixed share (and feels no regret). Advances in Neural Information Processing Systems, 25: 980 988, 2012.

Cesa-Bianchi, N., Gaillard, P., Lugosi, G., and Stoltz, G. Mirror descent meets fixed share (and feels no regret). In Advances in Neural Information Processing Systems 25 (NIPS), pp. 989 997, 2012.

Chen, G. and Teboulle, M. Convergence analysis of a proximal-like minimization algorithm using bregman functions. SIAM Journal on Optimization, 3(3):538 543, 1993.

Chen, L., Luo, H., and Wei, C. Impossible tuning made possible: A new expert algorithm and its applications. In Proceedings of the 34th Conference on Learning Theory (COLT), pp. 1216 1259, 2021.

Chen, X. and Peng, B. Hedging in games: Faster convergence of external and swap regrets. In Advances in Neural Information Processing Systems 33 (Neur IPS), pp. 18990 18999, 2020.

Chiang, C.-K., Yang, T., Lee, C.-J., Mahdavi, M., Lu, C.-J., Jin, R., and Zhu, S. Online optimization with gradual variations. In Proceedings of the 25th Conference On Learning Theory (COLT), pp. 6.1 6.20, 2012.

Daskalakis, C. and Panageas, I. Last-iterate convergence: Zero-sum games and constrained min-max optimization. In Proceedings of the 10th Innovations in Theoretical Computer Science (ITCS) Conference, 2019.

Daskalakis, C., Deckelbaum, A., and Kim, A. Near-optimal no-regret algorithms for zero-sum games. Games and Economic Behavior, 92:327 348, 2015.

Daskalakis, C., Fishelson, M., and Golowich, N. Nearoptimal no-regret learning in general games. In Advances in Neural Information Processing Systems 34 (Neur IPS), pp. to appear, 2021.

Duvocelle, B., Mertikopoulos, P., Staudigl, M., and Vermeulen, D. Multi-agent online learning in time-varying games. Mathematics of Operations Research, to appear, 2021.

Fiez, T., Sim, R., Skoulakis, S., Piliouras, G., and Ratliff, L. J. Online learning in periodic zero-sum games. In Advances in Neural Information Processing Systems 34 (Neur IPS), pp. to appear, 2021.

Freund, Y. and Schapire, R. E. Adaptive game playing using multiplicative weights. Games and Economic Behavior, 29(1-2):79 103, 1999.

Herbster, M. and Warmuth, M. K. Tracking the best expert. Machine learning, 32(2):151 178, 1998.

Hsieh, Y.-G., Antonakopoulos, K., and Mertikopoulos, P. Adaptive learning in continuous games: Optimal regret bounds and convergence to Nash equilibrium. In Proceedings of the 34th Conference on Learning Theory (COLT), pp. 2388 2422, 2021.

Immorlica, N., Sankararaman, K. A., Schapire, R. E., and Slivkins, A. Adversarial bandits with knapsacks. In Proceedings of the 60th IEEE Annual Symposium on Foundations of Computer Science (FOCS), pp. 202 219, 2019.

Luo, H. and Schapire, R. E. Achieving all with no parameters: Ada Normal Hedge. In Proceedings of the 28th Annual Conference Computational Learning Theory (COLT), pp. 1286 1304, 2015.

Mai, T., Mihail, M., Panageas, I., Ratcliff, W., Vazirani, V. V., and Yunker, P. Cycles in zero-sum differential games and biological diversity. In Proceedings of the 2018 ACM Conference on Economics and Computation (EC), pp. 339 350, 2018.

Pogodin, R. and Lattimore, T. On first-order bounds, variance and gap-dependent bounds for adversarial bandits. In Proceedings of the 35th Conference on Uncertainty in Artificial Intelligence (UAI), pp. 894 904, 2019.

Rakhlin, A. and Sridharan, K. Optimization, learning, and games with predictable sequences. In Advances in Neural Information Processing Systems 26 (NIPS), pp. 3066 3074, 2013.

Roy, A., Chen, Y., Balasubramanian, K., and Mohapatra, P. Online and bandit algorithms for nonstationary stochastic saddle-point optimization. ar Xiv preprint ar Xiv:1912.01698, 2019.

No-Regret Learning in Time-Varying Zero-Sum Games

Syrgkanis, V., Agarwal, A., Luo, H., and Schapire, R. E. Fast convergence of regularized learning in games. In Advances in Neural Information Processing Systems 28 (NIPS), pp. 2989 2997, 2015.

von Neumann, J. Zur theorie der gesellschaftsspiele. Mathematische Annalen, 100:295 320, 1928.

Wei, C.-Y., Hong, Y.-T., and Lu, C.-J. Tracking the best expert in non-stationary stochastic environments. In Advances in Neural Information Processing Systems 29 (NIPS), pp. 3972 3980, 2016.

Wei, C.-Y., Lee, C.-W., Zhang, M., and Luo, H. Linear lastiterate convergence in constrained saddle-point optimization. In Proceedings of the 9th International Conference on Learning Representations (ICLR), 2021.

Yang, T., Zhang, L., Jin, R., and Yi, J. Tracking slowly moving clairvoyant: Optimal dynamic regret of online learning with true and noisy gradient. In Proceedings of the 33rd International Conference on Machine Learning (ICML), pp. 449 457, 2016.

Zhang, L., Lu, S., and Zhou, Z.-H. Adaptive online learning in dynamic environments. In Advances in Neural Information Processing Systems 31 (Neur IPS), pp. 1330 1340, 2018.

Zhang, Y.-J., Zhao, P., and Zhou, Z.-H. A simple online algorithm for competing with dynamic comparators. In Proceedings of the 36th Conference on Uncertainty in Artificial Intelligence (UAI), pp. 390 399, 2020.

Zhao, P. and Zhang, L. Improved analysis for dynamic regret of strongly convex and smooth functions. In Proceedings of the 3rd Conference on Learning for Dynamics and Control (L4DC), pp. 48 59, 2021.

Zhao, P., Zhang, Y.-J., Zhang, L., and Zhou, Z.-H. Dynamic regret of convex and smooth functions. In Advances in Neural Information Processing Systems 33 (Neur IPS), pp. 12510 12520, 2020.

Zhao, P., Zhang, Y.-J., Zhang, L., and Zhou, Z.-H. Adaptivity and non-stationarity: Problem-dependent dynamic regret for online convex optimization. Ar Xiv preprint, ar Xiv:2112.14368, 2021.

Zinkevich, M. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning (ICML), pp. 928 936, 2003.

No-Regret Learning in Time-Varying Zero-Sum Games

A. Algorithm for y-player

For completeness, in this section, we show the algorithm run by y-player as follows. Our algorithm for y-player maintains N + n base-learners: for i [N], the i-th base-learner is any algorithm that satisfies DRVU(ηy i ) where ηy i = 2i 1

T and L is defined in Eq. (4); the last n base-learners are dummy learners, in which the (j + N)-th one always outputting the basis vector ej n. Let Sy S1,y S2,y with S1,y = [N] and S2,y = {N + 1, . . . , N + n} denote the set of indices of base-learners.

At round t, each base-learner j submits her decision yt,j n to the meta-algorithm, who decides the final decision yt. After receiving the feedback A t xt, the meta-algorithm sends this feedback to each base-learner j S1,y.

The meta-algorithm of y-player performs the following update:

qt = argmin q |Sy|

εy t q, my t + q bqt 2 2 ,

bqt+1 = argmin q |Sy|

εy t q, ℓy t + q bqt 2 2 , (8)

where εy t is the dynamic learning rate for the y-player. The loss vector ℓy t |Sy| and loss predictor vector my t |Sy| is defined as follows: for any j Sy,

ℓy t,j = y t,j A t xt + λ yt,j yt 1,j 2 1, (9)

my t,j = y t,j A t 1xt 1 + λ yt,j yt 1,j 2 1. (10)

The full pseudo code of the algorithm run by y-player is shown in Algorithm 2.

Algorithm 2 Algorithm for the y-player Input: a base-algorithm A(η) satisfying DRVU(η). Initialize: a set of base-learners Sy as described in Appendix A, bp1 = 1 |Sy|1|Sy|, learning rate εy 1 = 1

L (c.f. Eq. (4)). for t = 1, . . . , T do

Receive yt,j n from each base-learner j Sy. Compute my t based on Eq. (10) and qt based on Eq. (8). Play the final decision yt = P

j Sy qt,jyt,j. Suffer loss x t Atyt and observe the loss vector A t xt. Compute ℓy t based on Eq. (9) and bqt+1 based on Eq. (8).

Update εy t+1 = 1/ q

L2 + Pt s=2 A t xt A t 1xt 1 2 .

Send A t xt as the feedback to each base-learner. end

B. Discussions on Performance Measure

In this section, we include more discussions on the performance measures presented in Section 3.1.

B.1. Relationship between Dynamic NE-Regret and NE-Regret

Before discussing the relationship between dynamic NE-regret and NE-regret for the game setting, we first review the notion of dynamic regret and static regret for the online convex optimization (OCO) setting. Then we show that in contrast to the case in OCO that the worst-case dynamic regret is always larger than static regret, in the online game setting, dynamic NE-regret is not necessarily larger than the standard NE-regret due to the different structure of the minimax operation.

Dynamic Regret for OCO. OCO can be regarded as an iterative game between the player and the environment. At each round t [T], the player makes the decision xt from a convex feasible domain X Rd and simultaneously the environment chooses the loss function ft : X 7 R, then the player suffers an instantaneous loss ft(xt) and observe the full information about the loss function. The standard regret measure is defined as the difference between the cumulative loss of the player

No-Regret Learning in Time-Varying Zero-Sum Games

and that of the best action in hindsight:

t=1 ft(xt) min x X

t=1 ft(x). (11)

Note that the measure only competes with a single fixed decision over the time. A stronger measure proposed for OCO problems is called general dynamic regret (Zinkevich, 2003; Zhang et al., 2018; Zhao et al., 2020; 2021), defined as

D-Reg T (u1, . . . , u T ) =

t=1 ft(ut), (12)

which benchmarks the player s performance against an arbitrary sequence of comparators u1, . . . , u T X. The measure is also studied in the prediction with expert advice setting (Cesa-Bianchi et al., 2012; Luo & Schapire, 2015; Wei et al., 2016). We emphasize that one of the key tools to achieve our results for time-varying games is to derive a favorable bound for the above general dynamic regret for each player. See Lemma 15 for the details of our derived bound.

In addition, there is a variant of the above general dynamic regret called the worst-case dynamic regret, defined as

D-Reg T = D-Reg T (x 1, . . . , x T ) =

t=1 ft(x t ), (13)

where x t argminx X ft(x) is the minimizer of the online loss function ft. The worst-case dynamic regret is extensively studied in the literature (Besbes et al., 2015; Yang et al., 2016; Zhang et al., 2020; Zhao & Zhang, 2021). It is worth noting that both standard regret in Eq. (11) and the worst-case dynamic regret in Eq. (13) are special cases of the general dynamic regret in Eq. (12). In fact, by choosing the comparators as u1 = . . . = u T argminx X PT t=1 ft(x), the general dynamic regret recovers the standard static regret; and by choosing the comparators as ut = x t argminx X ft(x) for t [T], the general dynamic regret recovers the worst-case dynamic regret.

Notice that the worst-case dynamic regret in Eq. (13) is strictly larger than the static regret in Eq. (11), whereas the general dynamic regret in Eq. (12) is not necessarily larger than the static regret due to the flexibility of the comparator sequence.

Dynamic NE-Regret of Online Two-Player Zero-Sum Game. In this part, we aim to show that, different from the relationships between the (worst-case) dynamic regret and static regret in OCO setting, dynamic NE-regret is not necessarily larger than the NE-regret in the game setting. For a better readability, we here restate the definitions of NE-regret and dynamic NE-regret. Specifically, NE-regret is defined as the absolute value of the difference between the learners cumulative payoff and the minimax value of the time-averaged payoff matrix, namely,

t=1 x t Atyt min x m max y n

The dynamic NE-regret proposed by this paper is defined as absolute value of the difference between the cumulative payoff of the two players against the sum of the minimax game value at each round, namely,

Dyn NE-Reg T

t=1 x t Atyt

t=1 min x m max y n x Aty

Comparing to the original NE-regret in Eq. (14), we here move the minimax operation inside the summation of the benchmark. The operation is similar to that of the worst-case dynamic regret in Eq. (13), which moves the minimization operation inside the summation of the benchmark compared to the standard static regret in Eq. (11). However, the important point to note here is: worst-case dynamic regret is always no smaller than the static regret in online convex optimization setting, whereas the dynamic NE-regret is not necessarily larger than the NE-regret. Recall the example of two-phase online games in Section 3.1: the online matrix is set as At = 1 1 1 1 when t T/2, and set as At = 1 1 1 1 when t > T/2. In this case, when both players are indeed using the Nash equilibrium strategy at each round, they will suffer 0 dynamic NE-regret, while still incur a linear NE-regret as NE-Reg T = |T/2 0| = T/2.

No-Regret Learning in Time-Varying Zero-Sum Games

B.2. Relationships among Individual Regret, Duality Gap, and Dynamic NE-Regret

In this subsection, we discuss the relationship among the three performance measures considered in this work: individual regret, duality gap, and dynamic NE-regret. As mentioned in Section 3.1, both the individual regret and the dynamic NE-regret are bounded by duality gap. In the following, we present a formal statement and provide the proof.

Proposition 11. Consider any strategy sequence {xt}T t=1 and {yt}T t=1, where xt m, yt n, t [T]. We have

Regx T Dual-Gap T , Regy T Dual-Gap T , and Dyn NE-Reg T Dual-Gap T , (16)

where all these measures are defined in Section 3.1.

Proof. First, we show that Regx T Dual-Gap T as follows.

t=1 x t Atyt min x m

t=1 max y n x t Aty min x m

t=1 max y n x t Aty

t=1 min x m x Atyt = Dual-Gap T .

The inequality of Regy T Dual-Gap T can be obtained in the same way as shown above.

For the relationship between Dyn NE-Reg T and Dual-Gap T , actually we have

t=1 x t Atyt

t=1 min x m max y n x Aty

t=1 max y n x t Aty

t=1 min x m max y n x Aty

t=1 max y n x t Aty

t=1 x t Aty t ((x t , y t ) X t Y t )

t=1 max y n x t Aty

t=1 x t Atyt

t=1 max y n x t Aty

t=1 min x m x Atyt. (17)

In addition,

t=1 min x m max y n x Aty

t=1 x t Atyt

t=1 x t Aty t

t=1 min x m x Atyt

t=1 x t Aty t

t=1 min x m x Atyt

t=1 max y n x t Aty

t=1 min x m x Atyt. (18)

Combining Eq. (17) and Eq. (18) shows that Dyn NE-Reg T Dual-Gap T .

C. Discussions on Non-Stationarity Measure

In the following, we present more discussions on the relationships among all three non-stationarity measures (PT , VT , and WT ) proposed in Section 3.2.

No-Regret Learning in Time-Varying Zero-Sum Games

Comparison between Nash non-stationarity PT and game matrix non-stationarity WT , VT . Here we present two specific cases to show that the non-stationarity on Nash equilibrium is not comparable to the one on game matrix.

Case 1. Let At = 1 1 1 1 . Consider the time-varying games with At = 1

T A when t is odd and At = T 1

is even. Notice that all At s have the same (unique) Nash equilibrium x t = y t = ( 1

2) in this case, which implies that the path-length of Nash equilibria is PT = 0. By contrast, the other two non-stationarity measures related to game payoff matrices are large, concretely, WT = T( 1

T ) = Θ(T) and VT = T (T 2)2

T 2 = Θ(T).

Case 2. Let A = 1 1 1 1 and E = ε ε ε ε for some ε > 0. Consider At = A + ( 1)t E. In this case, we have

x t = (1, 0) when t is odd and x t = (0, 1) when t is even; y t = ( 1

2) for all rounds. Then the path-length of Nash equilibria is large, PT = Θ(T). By contrast, the other two measures can be small, specifically, WT = Θ(Tε) = O(1) and VT = Θ(Tε2) = O(1/T) when choosing ε = O(1/T).

Comparison between two game matrix non-stationarity measures VT and WT . Here we show the relationship between two non-stationarity measures regarding game matrix. First, we have VT O(WT ) as PT t=2 At At 1 2 2 PT t=2( At A 2 + At 1 A 2 ) O(WT ). Indeed, VT can be much smaller than WT in some cases, for instance

when At = t

T A with A = 1 1 1 1 , VT = T 1 T 2 = Θ( 1

T ) whereas WT = PT t=1 | t

2T | = Θ(T).

D. Verifying DRVU Property

In this section, we present two instantiations of Optimistic Online Mirror Descent (Optimistic OMD) (Rakhlin & Sridharan, 2013) and prove that both of them satisfy the DRVU property in Definition 3 with α, β, γ = eΘ(1).

Consider the general protocol of online linear optimization over the linear function sequence {f1, . . . , f T } with ft(x) = x, gt over a convex feasible set X Rd. Optimistic OMD is a generic algorithmic framework parametrized by a sequence of optimistic vectors M1, . . . , MT Rd and a regularizer ψ that is 1-strongly convex with respect to a certain norm . Optimistic OMD starts from an initial point x1 X and then makes the following two-step update at each round:

xt = argmin x X ηt Mt, x + Dψ(x, bxt),

bxt+1 = argmin x X ηt gt, x + Dψ(x, bxt). (19)

In above, ηt > 0 is the step size at round t, and Dψ( , ) is the Bregman divergence induced by the regularizer ψ. Zhao et al. (2021) prove the following general result for the dynamic regret of Optimistic OMD, and we present the proof in Appendix D.3 for completeness. Theorem 12 (Theorem 1 of Zhao et al. (2021)). The dynamic regret of Optimistic OMD whose update rule is specified in Eq. (19) is bounded by

t=1 ηt gt Mt 2 +

Dψ(ut, bxt) Dψ(ut, bxt+1)

Dψ(bxt+1, xt) + Dψ(xt, bxt) ,

which holds for any comparator sequence u1, . . . , u T X.

Note that the theorem is very general due to the flexibility in choosing the comparator sequence u1, . . . , u T and the regularizer ψ. In the following, we present two instantiations of Optimistic OMD: Optimistic Hedge with a fixed-share update (Herbster & Warmuth, 1998; Cesa-bianchi et al., 2012) and Optimistic Online Gradient Descent (Chiang et al., 2012), and then we use the above general theorem to prove that the two algorithms indeed satisfy the DRVU property defined in Definition 3.

D.1. Optimistic Hedge with a Fixed-share Update

In this subsection, we show that Optimistic Hedge with a fixed-shared update indeed satisfies the DRVU property.

No-Regret Learning in Time-Varying Zero-Sum Games

Consider the following online convex optimization with linear loss functions: for t = 1, . . . , T, an online learning algorithm proposes xt m at the beginning of round t, and then receives a loss vector gt Rm and suffer loss gt, xt .

We first present the algorithmic procedure. Optimistic Hedge with a fixed-share update starts from an initial distribution ex1 m and updates according to

xt = argmin x m η x, ht + Dψ(x, ext),

bxt+1 = argmin x m η x, gt + Dψ(x, ext),

ext+1 = (1 ξ)bxt+1 + ξ

where η > 0 is a fixed step size, ψ(x) = Pm i=1 xi log xi is the negative-entropy regularizer, Dψ( , ) is the induced Bregman divergence, and 0 ξ 1 is the fixed-share coefficient. The first step updates by the optimistic vector ht Rm that serving as a guess of the next-round loss, the second step updates by the received loss gt Rm, and the final step admits a fixed-share update. We have the following result on the dynamic regret of Optimistic Hedge with a fixed-share update.

Lemma 13. Set the fixed-share coefficient as ξ = 1/T. The dynamic regret of Optimistic Hedge with a fixed-share update is at most T X

t=1 gt, ut (3 + log(m T))(1 + P u T ) η + η

t=1 gt ht 2 1

4η xt xt 1 2 1, , (21)

where u1, . . . , u T m is any comparator sequence and PT = PT t=2 ut ut 1 1 denotes the path-length of comparators. Therefore, when choosing the optimism as ht = gt 1, the algorithm satisfies the DRVU(η) property with parameters α = 3 + log(m T), β = 1, and γ = 1

Proof. First we note that the chosen regularizer, ψ(x) = Pm i=1 xi log xi, is 1-strongly convex in 1, because for any x, x m it holds that

ψ(x) ψ(x ) ψ(x ), x x =

i=1 xi log xi

where the last inequality is by Pinsker s inequality. Therefore, we can apply the general result of Theorem 12 with ft(x) = gt, x and Mt = ht and achieve the following result,

t=1 gt, ut η

t=1 gt ht 2 + 1

t=1 (Dψ(ut, ext) Dψ(ut, bxt+1))

t=1 (Dψ(bxt+1, xt) + Dψ(xt, ext)) . (22)

We now evaluate the right-hand side. For the second term PT t=1(Dψ(ut, ext) Dψ(ut, bxt+1)), we have

Dψ(ut, ext) Dψ(ut, bxt+1)

i=1 ut,i log ut,i

i=1 ut,i log ut,i bxt+1,i =

i=1 ut,i log bxt+1,i

i=1 ut,i log 1

i=1 ut 1,i log 1

i=1 ut 1,i log 1

i=1 ut,i log 1 bxt+1,i

Notice that the first term above can be further upper bounded by

i=1 ut,i log 1

i=1 ut 1,i log 1

No-Regret Learning in Time-Varying Zero-Sum Games

i=1 (ut,i ut 1,i) log 1

i=1 ut 1,i log bxt,i

ξ ut ut 1 1 + log 1 1 ξ ,

where the inequality comes from the fixed-share update procedure, where we have log 1 ext,i log m

ξ and log bxt,i

ext,i log 1 1 ξ for any i [m] and t [T]. Then, taking summation of Eq. (23) over t = 2 to T and combining the fact that (Dψ(u1, ex1) Dψ(u1, bx2)) = Pm i=1 u1,i log bx2,i

ex1,i and ex1,i ξ

m for any i [m], we get

t=1 (Dψ(ut, ext) Dψ(ut, bxt+1))

t=2 ut ut 1 1 + (T 1) log 1 1 ξ +

i=1 u1,i log 1 bx2,i +

i=1 u1,i log bx2,i

t=2 ut ut 1 1

+ (T 1) log 1 1 ξ

Next, we proceed to analyze the negative term , i.e., the third term of the right-hand side in Eq. (22). Indeed,

t=2 (Dψ(bxt, xt 1) + Dψ(xt, ext))

bxt xt 1 2 1 + xt ext 2 1 (Pinsker s inequality)

xt xt 1 + bxt ext 2 1 ( x 2 1 + y 2 1 1

2 x + y 2 1)

t=2 xt xt 1 + ξ bxt 1

m1m 2 1 (due to the fixed-share update)

t=2 xt xt 1 2 1

ξ 2 xt xt 1 1

1 ( a b 2 1 a 2 1 2 a 1 b 1)

t=2 xt xt 1 2 1 2ξ(T 1). (25)

Substituting Eq. (24) and Eq. (25) into the general dynamic regret upper bound in Eq. (22), we achieve

t=1 gt, xt ut η

t=1 gt ht 2 + 1

ξ (1 + P u T ) + 1

η (T 1) log 1 1 ξ + 2ξ

t=2 xt xt 1 2 1

t=1 gt ht 2 + 1

η (3 + log(m T)(1 + P u T )) 1

t=2 xt xt 1 2 1,

where the last step holds because we set ξ = 1

1 η (T 1) log 1 1 ξ + 2ξ

η (T 1) = 1

η (T 1) log 1 + ξ 1 ξ

η (T 1) ξ 1 ξ + 2ξ

When choosing the optimism as ht = gt 1, we then have

t=1 gt, xt ut η

t=1 gt gt 1 2 + 3 + log(m T)

η (1 + P u T ) 1

t=2 xt xt 1 2 1,

which verifies the DRVU property of Definition 3, with α = 3 + log(m T), β = 1, and γ = 1

4. This ends the proof.

No-Regret Learning in Time-Varying Zero-Sum Games

D.2. Optimistic Online Gradient Descent

In this subsection, we show that Optimistic Online Gradient Descent (Optimistic OGD) with a fixed-shared update indeed satisfies the DRVU property.

Consider the following online convex optimization with linear loss functions: for t = 1, . . . , T, an online learning algorithm proposes xt m at the beginning of round t, and then receives a loss vector gt Rm and suffer loss gt, xt .

We first present the algorithmic procedure. Optimistic OGD starts from an initial distribution bx1 m and updates according to

xt = argmin x m η x, ht + Dψ(x, bxt),

bxt+1 = argmin x m η x, gt + Dψ(x, bxt), (26)

where η > 0 is a fixed step size and ψ(x) = 1

2 x 2 2 is the Euclidean regularizer and Dψ( , ) is the induced Bregman divergence. Compared to Eq. (20). The first step updates by the optimistic vector ht Rm that serving as a guess of the next-round loss, the second step updates by the received loss gt Rm. We note that Optimistic OGD does not require a fixed-share mixing operation to achieve dynamic regret.

Then, we have the following result on the dynamic regret of Optimistic OGD.

Lemma 14. The dynamic regret of Optimistic OGD is at most

t=1 gt, ut (m + 2)P u T η + ηm

t=2 gt ht 2 1 4ηm

t=2 xt xt 1 2 1 + O(1), (27)

where u1, . . . , u T m is any comparator sequence and PT = PT t=2 ut ut 1 1 denotes the path-length of comparators. Therefore, when choosing the optimism as ht = gt 1, the algorithm satisfies the DRVU(η) property with parameters α = m + 2, β = m

2 , and γ = 1 4m.

Proof. From the general result of Theorem 12, we have the following dynamic regret bound for Optimistic OGD:

t=1 gt, xt ut η

t=1 gt ht 2 2 + 1

bxt ut 2 2 bxt+1 ut 2 2 1

bxt+1 xt 2 2 + xt bxt 2 2 .

Besides, we have

bxt ut 2 2 bxt+1 ut 2 2

t=2 bxt ut 2 2

t=2 bxt ut 1 2 2 + D2

t=2 ( ut ut 1 2 bxt + bxt ut ut 1 2) + D2

t=2 ( ut ut 1 2 bxt + bxt ut ut 1 1) + D2

t=2 ut ut 1 2

t=2 ut ut 1 1,

No-Regret Learning in Time-Varying Zero-Sum Games

where we introduce the notation D = supx,y m x y 2, and it can be verified that D

2m. Further we have

bxt+1 xt 2 2 + xt bxt 2 2

bxt xt 1 2 2 + xt bxt 2 2 1

t=2 xt xt 1 2 2 1 2m

t=2 xt xt 1 2 1.

Combining the above three inequalities, we achieve that

t=1 gt, xt ut η

t=1 gt ht 2 2 + 1

η (m + 2P u T ) 1 4ηm

t=2 xt xt 1 2 1

t=1 gt ht 2 + m + 2

η (1 + P u T ) 1 4ηm

t=2 xt xt 1 2 1.

Therefore, choosing the optimism as ht = gt 1, we then verify the DRVU property of Definition 3 for Optimistic OGD, with α = m + 2, β = m

2 , and γ = 1 4m. This ends the proof.

D.3. Proof of Theorem 12

Proof. We decompose the instantaneous dynamic regret into three terms and bound each one respectively. Specifically,

ft(xt) ft(ut) ft(xt), xt ut = ft(xt) Mt, xt bxt+1 + Mt, xt bxt+1 + ft(xt), bxt+1 ut .

The first term can be controlled by Lemma 22, which guarantees that the OMD update satisfies xt bxt+1 ηt ft(xt) Mt and thus,

ft(xt) Mt, xt bxt+1 ft(xt) Mt xt bxt+1 ηt ft(xt) Mt 2 .

We now analyze the remaining two terms on the right-hand side. By the Bregman proximal inequality in Lemma 21 and the OMD update step xt = argminx X ηt Mt, x + Dψ(x, bxt), we have

Mt, xt bxt+1 1

Dψ(bxt+1, bxt) Dψ(bxt+1, xt) Dψ(xt, bxt) .

Similarly, the OMD update step bxt+1 = argminx X ηt ft(xt), x + Dψ(x, bxt) implies

ft(xt), bxt+1 ut 1

Dψ(ut, bxt) Dψ(ut, bxt+1) Dψ(bxt+1, bxt) .

Combining the above three inequalities yields an upper bound for the instantaneous dynamic regret:

ft(xt) ft(ut) ηt ft(xt) Mt 2 + 1

Dψ(ut, bxt) Dψ(ut, bxt+1) Dψ(bxt+1, xt) Dψ(xt, bxt) . (28)

Taking the summation over all iterations completes the proof.

E. Proofs for Section 5

In this section, we provide the proofs for the main results presented in Section 5, including individual regret of Theorem 4, dynamic NE-regret of Theorem 6, and duality gap of Theorem 8.

E.1. Proof of Theorem 4 (Individual Regret)

Proof. In the following, we focus on the individual regret of x-player, and the result for y-player can be proven in a similar way. The proof is split into three parts.

(1) First, we prove the e O(

T) individual regret bound for x-player no matter whether the y-player follows the strategy suggested by Algorithm 2.

No-Regret Learning in Time-Varying Zero-Sum Games

(2) Second, we prove the e O( 1 + VT + PT ) bound, which depends on VT (the variation of the payoff matrices) and PT (the path-length of the Nash equilibrium sequence). (3) Finally, we prove the e O( 1 + VT + WT ) bound, which depends on VT and WT , the variation and variance of the payoff matrices.

Our analysis is mainly based on the general dynamic regret bound proven in Lemma 15. Specifically, using Eq. (35), setting a fixed comparator, i.e., u1 = . . . = u T argminx m PT t=1 x Atyt (then the path-length P u T = 0), and also dropping the last three negative terms in the regret upper bound, for any i S1,x we have

t=1 x t Atyt min x

t=1 x Atyt e O

1 ηx i + ηx i

t=2 At At 1 2 + ηx i

t=2 yt yt 1 2 1

In the following, we further bound the right-hand side in three different ways to achieve different individual regret bounds.

T) robustness bound. First of all, we prove that for x-player, her individual regret against y-player s actions is at most e O(

T), which holds even when y-player does not follow strategies suggested by Algorithm 2. Note that PT t=2 At At 1 2 + PT t=2 yt yt 1 2 1 O(T). Therefore, we have

t=1 x t Atyt min x

t=1 x Atyt e O 1

ηx i + ηx i T e O(

where the last inequality is achieved by taking i = i such that ηi = Θ(1/

T). Note that the choice is viable due to the configuration of the step size pool. Similarly, we can also attain an e O(

T) robustness bound for y-player.

Next, we demonstrate two adaptive bounds of the individual regret when both players follow our prescribed strategy (namely, x-player is using Algorithm 1 and y-player is using Algorithm 2). They are both in the worst case e O(

T) but can be much smaller if the sequence of online payoff matrices is less non-stationary.

The e O( 1 + VT + PT ) bound. We first consider the individual regret bound that scales with the variation of Nash equilibria denoted by PT min t,(x t ,y t ) X t Y t PT t=2 x t x t 1 1 + y t y t 1 1 and VT = PT t=2 At At 1 2 . According to Lemma 16, which proves the stability of the dynamics with respect to PT and VT when both players are following the suggested strategy, we have PT t=2 yt yt 1 2 1 e O( p

(1 + VT )(1 + PT ) + PT ). Therefore, we achieve

t=1 x t Atyt min x m x Atyt e O 1

ηx i + ηx i VT + ηx i p

(1 + VT )(1 + PT ) + PT

ηx i + ηx i (1 + VT + PT ) (by AM-GM inequality)

1 + VT + PT ,

where in the last inequality, we choose i = i such that ηx i [ 1

2ηx , 2ηx ] where ηx = min{ 1 1+VT +PT , 1

L}. The choice of ηx i is also viable due to the configuration of the step size pool.

The e O( 1 + VT + WT ) bound. We then consider the individual regret bound that scales with VT and the variance of the game matrices denoted by WT = PT t=1 At A with A = 1 T PT t=1 At being the averaged game matrix. Then according to Lemma 18, which proves the stability of the dynamics with respect to WT when both players are following the suggested strategy, we have PT t=2 yt yt 1 2 1 e O(1 + WT ). Therefore, we achieve

t=1 x t Atyt min x m x Atyt e O 1

ηx i + ηx i (VT + 1 + WT ) e O p

1 + VT + WT ,

where in the last inequality, we choose i = i such that ηx i [ 1

2ηx , 2ηx ] where ηx = min{ 1 1+VT +WT , 1

L}. The choice of ηx i is also viable due to the configuration of the step size pool.

Combining above three upper bounds finishes the proof of Theorem 4.

No-Regret Learning in Time-Varying Zero-Sum Games

E.2. Proof of Theorem 6 (Dynamic NE-Regret)

Proof. The proof for the dynamic NE-regret measure consists of two parts. We first prove the e O( p

(1 + VT )(1 + PT )+PT ) bound and then prove the e O(1 + WT ) bound.

(1 + VT )(1 + PT ) + PT ) bound. Let (x t , y t ) be the Nash equilibrium of the online game matrix At. We consider the upper bound in terms of the non-stationarity measure PT PT t=2 x t x t 1 1 + y t y t 1 1 .6 As (x t , y t ) is the Nash equilibrium of At, the inequality x t Aty x t Aty t x Aty t holds for any x m and y n. We notice that

min x max y x Aty = x t Aty t x t Atyt min x x Atyt,

min x max y x Aty = x t Aty t x t Aty t max y x t Aty.

Therefore, we have

t=1 x t Atyt

t=1 min x max y x Aty

t=1 x t Atyt

t=1 x t Atyt,

t=1 x t Atyt +

t=1 min x max y x Aty

t=1 x t Atyt +

t=1 x t Aty t ,

which means that the dynamic NE-regret is upper bounded by the maximum of the following two dynamic regret bounds:

t=1 x t Atyt

t=1 min x max y x Aty

t=1 x t Atyt

t=1 x t Atyt,

t=1 x t Atyt +

t=1 x t Aty t

Moreover, according to the general dynamic regret analysis in Lemma 15 with the choice of {(ut, vt)}T t=1 = {(x t , y t )}T t=1 and dropping the three negative terms, we have

t=1 x t Atyt

t=1 x t Atyt e O

1 + P x T ηx i + ηx i VT + ηx i

t=2 yt yt 1 2 1

where P x T = PT t=2 x t x t 1 1 denotes the path-length of the Nash equilibria of the x-player. According to Lemma 16, we have PT t=2 yt yt 1 2 1 e O( p

(1 + VT )(1 + PT ) + PT ), so using AM-GM inequality achieves

t=1 x t Atyt

t=1 x t Atyt e O 1 + P x T ηx i + ηx i VT + ηx i ( p

(1 + VT )(1 + PT ) + PT )

ηx i + ηx i (1 + VT + PT )

(1 + PT )(1 + VT + PT ) + PT

(1 + PT )(1 + VT ) + PT .

where the last inequality is by choosing i = i such that ηx i [ 1

2ηx , 2ηx ] where ηx = min nq

1+PT 1+VT +PT , 1

6Strictly speaking, the path-length non-stationarity measure is PT min t,(x t ,y t ) X t Y t PT t=2 ( x t x t 1 1 + y t y t 1 1) as the Nash equilibrium of each round may not unique. Fortunately, our analysis holds for any Nash equilibrium, so we can in particular take the sequence of Nash equilibria making PT t=2 ( x t x t 1 1 + y t y t 1 1) smallest possible. The quantity PT is only used in the analysis, and our algorithm does not require any prior knowledge about it.

No-Regret Learning in Time-Varying Zero-Sum Games

The e O(1 + WT ) bound. According to Lemma 17, we have the NE-regret is bounded by the maximum of two individual regret upper bounds plus the variance of the game matrices.

t=1 x t Atyt

t=1 min x max y x Aty

t=1 x t Atyt

t=1 x Atyt,

t=1 x t Aty

t=1 x t Atyt

Using the e O( 1 + VT + WT ) individual regret bound proven in Theorem 4 and the fact that VT O(WT ), we have

t=1 x t Atyt

t=1 min x max y x Aty

1 + VT + WT + WT e O (1 + WT ) .

To summarize, combining the above two upper bounds for dynamic NE-regret, we finally achieve the following guarantee:

t=1 x t Atyt

t=1 min x max y x Aty

(1 + VT )(1 + PT ) + PT , 1 + WT o ,

which completes the proof of Theorem 6.

E.3. Proof of Theorem 8 (Duality Gap)

Proof. In Theorem 8, there are indeed two different upper bounds for duality gap of our approach, as restated below.

t=1 x t At y t

t=1 x t Atyt e O min{T 3 4 1 + QT 1

4 , T 1 2 (1 + Q

3 2 T + PT QT ) 1 2 } ,

where QT VT + min{PT , WT } is introduced to simplify the notation. Now we will prove the two bounds respectively.

The e O(T 3 4 1 + VT + min{PT , WT } 1

4 ) bound. For convenience of the following proof, we introduce the notation of ft(x) x Atyt, and then the best response is essentially the minimizer of the function, namely, x t = argminx m ft(x). We now investigate the following worst-case dynamic regret of the x-player,

t=1 x t Atyt

t=1 x t Atyt =

t=1 ft( x t ),

which benchmarks the cumulative loss of x-player s actions with the best response at each round. We decompose the quantity into the following two terms:

t=1 ft( x t ) =

| {z } term (i)

t=1 ft( x t )

| {z } term (ii)

where we insert a term PT t=1 ft(ut) as an anchor quantity. Notably, this comparator sequence {ut}T t=1 can be arbitrarily set without affecting the above equation. In particular, we choose it as a piecewise-stationary comparator sequence such that I1, I2, . . . , IK is an even partition of the total horizon [T] with |Ik| = for k = 1, . . . , K (for simplicity, suppose the time horizon T is divisible by epoch length ), and for any t Ik, ut argminx m P

t Ik ft(x). Then, following the general dynamic regret bound proven in Lemma 15, for this particular comparator sequence and for any i S1,x, we have the following upper bound for term (i):

O α(1 + P u T ) ηx i

t=2 At At 1 2 + ηx i cβ

t=2 yt yt 1 2 1 + λ γ

t=2 xt,i xt 1,i 2 1

No-Regret Learning in Time-Varying Zero-Sum Games

t=1 ( pt bpt+1 2 2 + pt bpt 2 2) λ

i Sx pt,i xt,i xt 1,i 2 1 + e O(1)

1 + P u T ηx i + ηx i

t=2 At At 1 2

t=2 yt yt 1 2 1

t=2 pt pt 1 2 1 λ

i=1 pt,i xt,i xt 1,i 2 1 (λ γ

1 + P u T ηx i + ηx i

t=2 At At 1 2

t=2 qt qt 1 2 1 + 2ηx i cβ

j Sy qt,j yt,j yt 1,j 2 1

t=2 pt pt 1 2 1 λ

i=1 pt,i xt,i xt 1,i 2 1 (by Eq. (43))

ηx i + ηx i

t=2 At At 1 2

t=2 qt qt 1 2 1 + 2ηx i cβ

j Sy qt,j yt,j yt 1,j 2 1

t=2 pt pt 1 2 1 λ

i=1 pt,i xt,i xt 1,i 2 1,

where the last step holds because P u T = PT t=2 ut ut 1 1 = O(K) = O(T/ ) by the specific construction of the comparator sequence.

Moreover, in Lemma 20 we present a general result to relate the function-value difference between the sequence of piecewise minimizers and the sequence of each-round minimizers, so term (ii) can be well upper bounded as follows.

t=1 ft( x t ) 2

t=2 Atyt At 1yt 1 .

Combining the above two inequalities, we get the worst-case dynamic regret for the x-player: for any i S1,x, we have

t=1 x t Atyt

t=1 x t Atyt

ηx i + ηx i VT

t=2 qt qt 1 2 1 + 2ηx i cβ

j Sy qt,j yt,j yt 1,j 2 1

t=2 Atyt At 1yt 1 L

t=2 pt pt 1 2 1 λ

i Sx pt,i xt,i xt 1,i 2 1.

Similarly, we can also obtain the worst-case dynamic regret for the y-player: for any j S1,y, we have

t=1 x t At y t

t=1 x t Atyt

ηy j + ηy j VT

t=2 pt pt 1 2 1 + 2ηy j cβ

i Sx pt,i xt,i xt 1,i 2 1

t=2 x t At x t 1At 1 L

t=2 qt qt 1 2 1 λ

j Sy qt,j yt,j yt 1,j 2 1.

Combining the two dynamic regret bounds yields: for any i S1,x and any j S1,y,

t=1 x t At y t

t=1 x t Atyt

No-Regret Learning in Time-Varying Zero-Sum Games

ηy j + ηx i VT + ηy j VT

t=2 Atyt At 1yt 1 + 2

t=2 x t At x t 1At 1

+ 2ηy j cβ L

t=2 qt qt 1 2 1 + 2ηx i cβ L

t=2 pt pt 1 2 1

+ (2ηy j cβ λ)

j Sy qt,j yt,j yt 1,j 2 1 + (2ηx i cβ λ)

i Sx pt,i xt,i xt 1,i 2 1

ηy j + ηx i VT + ηy j VT

t=2 Atyt At 1yt 1 + 2

t=2 x t At x t 1At 1 , (29)

where the last inequality is because 2ηy j cβ L

2 0, 2ηx i cβ L

2 0, 2ηx i cβ γ 0, 2ηx i cβ γ 0 based on the fact

that ηy j 1

L and L = max{4, 16cβ, q

γ } and λ = γL

Next, we bound the last two terms in the right-hand side of Eq. (29). Indeed,

t=2 Atyt At 1yt 1 +

t=2 x t At x t 1At 1

t=2 Atyt At 1yt 1 2 +

t=2 x t At x t 1At 1 2

T(1 + VT + min{PT , WT }) ,

where the first inequality is by Cauchy-Schwarz inequality and the second inequality uses the gradient-variation bound in Lemma 19. Plugging this into Eq. (29) and choosing ηx i , ηy j [ 1

2η , 2η ] with η = min n 1 L, q

T (1+VT ) o , we have

t=1 x t At y t

t=1 x t Atyt e O

T(1 + VT + min{PT , WT })

T(1 + VT + min{PT , WT })

= e O T 3 4 1 + VT + min{PT , WT } 1

where the last inequality is by setting the epoch length optimally. We remark that the above choice of ηx i , ηy j is viable due to the construction of step size pool and the fact p

T/( (1 + VT )) Θ(1/T); besides, the setting of epoch length is also feasible, and notably the epoch length is only used in the analysis and our algorithm does not require its information.

The e O(T 1 2 (1 + Q

3 2 T + PT QT ) 1 2 ) bound. From the update rule of the meta-algorithm and Eq. (28) proven in Theorem 12 with ψ(x) = 1

2 x 2 2, we have the following instantaneous regret bound for any p |Sx| and q |Sy|.

pt p, ℓx t εx t ℓx t mx t 2 2 + 1

bpt p 2 2 bpt+1 p 2 2 pt bpt+1 2 2 pt bpt 2 2 ,

qt q, ℓy t εy t ℓy t my t 2 2 + 1

bqt q 2 2 bqt+1 q 2 2 qt bqt+1 2 2 qt bqt 2 2 . (30)

Recall that the feedback loss and optimism are set as follows. For x-player, we have ℓx t = X t Atyt + λXδ t and mx t = X t At 1yt 1 + λXδ t , where Xt = [xt,1; xt,2; . . . , xt,|Sx|] and Xδ t = [ xt,1 xt 1,1 2 1; xt,2 xt 1,2 2 1; . . . ; xt,|Sx| xt 1,|Sx| 2 1]. For y-player, we have ℓy t = Y t A t xt + λY δ t , my t = Y t A t 1xt 1 + λY δ t , Yt = [yt,1; yt,2; . . . ; yt,|Sy|]

No-Regret Learning in Time-Varying Zero-Sum Games

and Y δ t = [ yt,1 yt 1,1 2 1; yt,2 yt 1,2 2 1; . . . ; yt,|Sy| yt 1,|Sy| 2 1]. Note that Sx S1,x S2,x with S1,x = [N] and S2,x = {N + 1, . . . , N + m}; besides, Sy S1,y S2,y with S1,y = [N] and S2,y = {N + 1, . . . , N + n}. N = 1

2 log2 T + 1.

Then using Cauchy-Schwarz inequality and noticing that the dimensions of Xt, Yt are eΘ(1), we have

ℓx t mx t 2 2 = X t Atyt X t At 1yt 1 2 2 c At At 1 2 + yt yt 1 2 2 ,

ℓy t my t 2 2 = Y t A t xt + Y t A t 1xt 1 2 2 c At At 1 2 + xt xt 1 2 2 , (31)

where c > 0 is a universal constant independent with the time horizon and the non-stationarity measures (ignoring the dependence on poly-logarithmic factors in T).

In the following, we will specify the choice of the compared weight vectors. Concretely, let (x t , y t ) be any Nash equilibrium of the payoff matrix At. We pick the compared weight distribution p = p t |Sx| and q = q t |Sy| such that both p t and q t have supports only on the additional dummy base-learners and the supports finally result in a Nash equilibrium of At, namely, p t,i = q t,j = 0 for i = 1, . . . , |S1,x| and j = 1, . . . , |S1,y|, p t,i+|S1,x| = x t,i for i = 1, . . . , m and q t,j+|S1,y| = y t,j for j = 1, . . . , n. By definition, we have pt p t , X t Atyt + qt q t , Y t Atxt = x t Atyt + x t Aty t 0. (32)

Moreover, combining Eq. (30) and Eq. (31) yields the following results:

pt p t , X t Atyt 1

bpt p t 2 2 bpt+1 p t 2 2 pt bpt+1 2 2 pt bpt 2 2

+ c εx t At At 1 2 + yt yt 1 2 2 + λ p t pt, Xδ t .

Notice that in fact we have p t , Xδ t = 0 due to the choice of p t . More specifically, p t has support only on the additional dummy base-learners and for those dummy learners their stability quantity is zero (namely, Xδ t,i = 0 for i = |S1,x| + 1, . . . , |S1,x| + m). In addition, it is clear that pt, Xδ t 0, so we have

pt p t , X t Atyt

bpt p t 2 2 bpt+1 p t 2 2 pt bpt+1 2 2 pt bpt 2 2 + c εx t At At 1 2 + yt yt 1 2 2 .

Similarly, we get qt q t , Y t Atxt

bqt q t 2 2 bqt+1 q t 2 2 qt bqt+1 2 2 qt bqt 2 2 + c εy t At At 1 2 + xt xt 1 2 2 .

Adding the above two inequalities and rearranging the terms, based on Eq. (32), we have

pt bpt+1 2 2 + pt bpt 2 2 + 1

qt bqt+1 2 2 + qt bqt 2 2

bpt p t 2 2 bpt+1 p t 2 2 + c εx t ( At At 1 2 + yt yt 1 2 2)

bqt q t 2 2 bqt+1 q t 2 2 + c εy t ( At At 1 2 + xt xt 1 2 2) + λD( Xδ t 2 + Y δ t 2)

bpt p t 2 2 bpt+1 p t+1 2 2 + c εx t ( At At 1 2 + yt yt 1 2 2)

bqt q t 2 2 bqt+1 q t+1 2 2 + c εy t ( At At 1 2 + xt xt 1 2 2)

( p t p t+1 1 + q t q t+1 1). (33)

No-Regret Learning in Time-Varying Zero-Sum Games

In above, D = p

2(N + max{m, n}) = eΘ(1) is another universal constant (ignoring the dependence on logarithmic factors in T) serving as the upper bound of p p 2 and q q 2 for any p, p |Sx| and for any q, q |Sy|.

Next, we show that the desired duality gap upper bound can be related to the terms on the left-hand side of above inequality, namely, 1 εx t pt bpt+1 2 2 + pt bpt 2 2 + 1 εy t qt bqt+1 2 2 + qt bqt 2 2 . To see this, we first have the following inequalities from the update rule of the meta-algorithm as well as the first-order optimality condition: for any p |Sx| and q |Sy|,

(bpt+1 bpt + εx t X t Atyt + εx t λXδ t ) (p bpt+1) 0,

(bqt+1 bqt εy t Y t A t xt + εy t λY δ t ) (q bqt+1) 0.

Rearranging the terms and introducing the notations ext P

i Sx bpt+1,ixt,i and eyt P

j Sy bqt+1,jyt,j, we then have for any p |Sx| and q |Sy|,

(bpt+1 bpt) (p bpt+1)

εx t (bpt+1 p ) (X t Atyt + λXδ t )

= εx t (bpt+1 p ) (X t Ateyt + X t At(yt eyt) + λXδ t )

εx t (bpt+1 p ) X t Ateyt c εx t bpt+1 p 2 qt bqt+1 2 + λεx t bpt+1 p , Xδ t ,

(bqt+1 bqt) (q bqt+1)

εt,y(bqt+1 q ) ( Y t Atxt + λY δ t )

= εy t (bqt+1 q ) ( Y t Atext Y t At(xt ext) + λY δ t )

εy t (bqt+1 q ) ( Y t Atext) c εy t bqt+1 q 2 pt bpt+1 2 + λεy t bqt+1 q , Y δ t ,

where c > 0 is also a universal constant independent with the time horizon and the non-stationarity measures (ignoring the dependence on poly-logarithmic factors in T). Rearranging the terms arrives that

(bpt+1 p ) X t Ateyt p bpt+1 2 e O 1

εx t bpt+1 bpt 2 + qt bqt+1 2

+ λ p bpt+1, Xδ t ,

(bqt+1 q ) ( Y t Atext) q bqt+1 2 e O 1

εy t bqt+1 bqt 2 + pt bpt+1 2

+ λ q bqt+1, Y δ t .

Let ( x t , y t ) be the corresponding best response for the strategy (ext, eyt) with respect to the payoff At, i.e., x t = argminx m x Ateyt and y t = argmaxy n ex t Aty. Denote the duality gap bound of (ext, eyt) as αt(ext, eyt) maxy ex t Aty minx x Ateyt. Now we pick the comparator vectors p |Sx| and q |Sy| such that both p

and q have supports only on the additional dummy base-learners and the supports finally form the best response of ext, eyt, namely, p i = q j = 0 for i = 1, . . . , |S1,x|, j = 1, . . . , |S1,y| and p i+|S1,x| = x t,i for i = 1, . . . , m and q j+|S1,y| = y t,j for j = 1, . . . , n. Due to the construction, we confirm that p , Xδ t = 0 and q , Y δ t = 0.

As a result, combining the two inequalities on the above gives the following upper bound for duality gap:

αt(ext, eyt) = (bpt+1 p ) X t Ateyt + (bqt+1 q ) ( Y t Atext)

εx t bpt+1 bpt 2 + qt bqt+1 2 + 1

εy t bqt+1 bqt 2 + pt bpt+1 2

εx t ( bpt+1 bpt 2 + pt+1 bpt+1 2) + 1

εy t ( bqt+1 bqt 2 + qt bqt+1 2) .

The first inequality holds because p bpt+1 2 D = eΘ(1) and q bqt+1 2 D = eΘ(1) hold for any t [T]. The second inequality is obtained by scaling two terms with a factor of 1 εx t and 1 εy t respectively, where we notice that 1 1 Lεx t and 1 1 Lεy t are true as εx t 1

L and εy t 1

L holds for all t [T].

No-Regret Learning in Time-Varying Zero-Sum Games

Then, by Cauchy-Schwarz inequality, we obtain the following upper bound for the square of duality gap bound of (ext, eyt):

α2 t(ext, eyt)

εx t ( bpt+1 bpt 2 + pt+1 bpt+1 2) + 1

εy t ( bqt+1 bqt 2 + qt bqt+1 2) 2

εx t ( bpt+1 bpt 2 2 + pt+1 bpt+1 2 2) + 1

εy t ( bqt+1 bqt 2 2 + qt bqt+1 2 2)

bpt p t 2 2 bpt+1 p t+1 2 2 + εx t At At 1 2 + yt yt 1 2 2

bqt q t 2 2 bqt+1 q t+1 2 2 + εy t At At 1 2 + xt xt 1 2 2

2 ( p t p t+1 1 + q t q t+1 1) .

Notably, the last step makes use of the inequality in Eq. (33) and λ = γL

2 = eΘ(1). For simplicity, we introduce the notation 1 εt 1 εx t + 1

εy t . Taking a summation on the squared duality gap over all rounds and using the fact that εx t , εy t e O(1) and εx t , εy t are non-increasing in t, we have (omitting all dimension and poly(log T) factors)

t=1 α2 t(ext, eyt)

t=1 e O 1 εx t+1 εt+1 1 εx t εt

bpt+1 p t+1 2 2

t=1 e O 1 εy t+1 εt+1 1 εy t εt

bqt+1 q t+1 2 2

ε2 T e O (PT ) + 1

t=2 e O At At 1 2 + yt yt 1 2 2 + xt xt 1 2 2

t=2 e O At At 1 2 + xt xt 1 2 2 + yt yt 1 2 2 ,

where the last inequality uses bpt+1 p t+1 2 e O(1), bqt+1 q t+1 2 e O(1). According to Lemma 16 and Lemma 18,

t=2 xt xt 1 2 2 +

t=2 yt yt 1 2 2 e O min np

(1 + VT )(1 + PT ) + PT , 1 + WT o .

In addition, according to the definition of εx t and εy t , we have

1 εx T +1 =

v u u t L2 +

t=2 Atyt At 1yt 1 2 e O p

1 + VT + min{PT , WT } ,

1 εy T +1 =

v u u t L2 +

t=2 x t At x t 1At 1 2 e O p

1 + VT + min{PT , WT } ,

where the last inequality is due to the gradient-variation bound in Lemma 19. Therefore, combining all above inequalities can achieve the following result on the squared duality gap:

t=1 α2 t(ext, eyt) e O (1 + PT )(1 + VT + min{PT , WT }) + e O (1 + VT + min{WT , PT }) 3 2

= e O (1 + VT + min{PT , WT }) p

1 + VT + min{PT , WT } + PT .

No-Regret Learning in Time-Varying Zero-Sum Games

We further introduce the notation QT Vt + min{PT , WT } to simplify the presentation. Then, by Cauchy-Schwarz inequality we have

t=1 αt(ext, eyt) e O q

T(1 + QT )( p

1 + QT + PT ) = e O T 1 2 (1 + Q

3 2 T + PT QT ) 1 2 . (34)

We finally transform the above bound back to αt(xt, yt) by noticing that

t=1 αt(xt, yt)

max y n x t Aty min x m x Atyt

max y n ex t Aty min x m x Ateyt

max y n x t Aty max y n ex t Aty +

min x m x Ateyt min x m x Atyt

t=1 αt(ext, eyt) +

t=1 e O ( pt bpt+1 2 + qt bqt+1 2)

t=1 αt(ext, eyt) + e O

t=1 ( pt bpt+1 2 2 + qt bqt+1 2 2)

(by Cauchy-Schwarz inequality)

t=1 αt(ext, eyt) + e O q

(1 + VT )(1 + PT ) + PT , 1 + WT } . (by Lemma 16 and Lemma 18)

e O T 1 2 1 + Q

3 2 T + PT QT 1

T(1 + QT ) (by Eq. (34) and Cauchy-Schwarz inequality)

e O T 1 2 1 + Q

3 2 T + PT QT 1

To summarize, combining the both types of upper bounds for duality gap, we finally achieve the following guarantee:

t=1 max y n x t Aty

t=1 min x m x Atyt e O min{T 3 4 1 + QT 1

4 , T 1 2 (1 + Q

3 2 T + PT QT ) 1 2 } ,

which completes the proof of Theorem 8.

F. Key Lemmas

This section presents several key lemmas used in proving our theoretical results.

We first provide an analysis for the general dynamic regret of the meta-base two-layer approach, which serves as one of the key technical tools for proving upper bounds for the three performance measures. The result is shown in Lemma 15, and we emphasize that the regret bounds hold for any comparator sequence, which is crucial and useful in the subsequent analysis. Lemma 15 (General dynamic regret). Algorithm 1 guarantees that x-player s dynamic regret with respect to any comparator sequence u1, . . . , u T m is bounded by

t=1 x t Atyt

t=1 u t Atyt

O α(1 + P u T ) ηx i

t=2 At At 1 2 + ηx i cβ

t=2 yt yt 1 2 1 + λ γ

t=2 xt,i xt 1,i 2 1

t=1 ( pt bpt+1 2 2 + pt bpt 2 2) λ

i Sx pt,i xt,i xt 1,i 2 1 + e O(1),

No-Regret Learning in Time-Varying Zero-Sum Games

for a specific c = eΘ(1) and any compared base-learner s index i S1,x.

Similarly, Algorithm 2 guarantees that y-player s dynamic regret with respect to any comparator sequence v1, . . . , v T n is at most

t=1 x t Atyt +

t=1 x t Atvt

α(1 + P v T ) ηy j

t=2 At At 1 2 + ηy j cβ

t=2 xt xt 1 2 1 +

t=2 yt,i yt 1,i 2 1

t=1 ( qt bqt+1 2 2 + qt bqt 2 2) λ

i Sx qt,i yt,i yt 1,i 2 1 + e O(1),

which also holds for any compared base-learner s index j S1,y.

Proof. We consider the dynamic regret for x-player and similar results hold for y-player. First, we decompose the dynamic regret for x-player into the sum of the meta-regret and base-regret. Specifically, for any i S1,x, we have

t=1 x t Atyt

t=1 ut Atyt =

t=1 x t Atyt

t=1 x t,i Atyt

| {z } meta-regret

t=1 x t,i Atyt

t=1 u t Atyt

| {z } base-regret

We now give upper bounds for the meta-regret and base-regret respectively.

First, we consider the meta-regret, which is essentially the static regret with respect to any base-learner with an index i S1,x. Recall several notations introduced in the algorithm. For x-player, ℓx t = X t Atyt + λXδ t and mx t = X t At 1yt 1 + λXδ t , where Xt = [xt,1; xt,2; . . . , xt,|Sx|] and Xδ t = [ xt,1 xt 1,1 2 1; xt,2 xt 1,2 2 1; . . . ; xt,|Sx| xt 1,|Sx| 2 1]. Note that Sx S1,x S2,x with S1,x = [N] and S2,x = {N + 1, . . . , N + m}, where N = 1

2 log2 T + 1. The notations for y-player are similarly defined and we do not restate here for conciseness. According to the general result of Theorem 12 with ft(pt) = pt, X t Atyt + λXδ t , Mt = X t At 1yt 1 + λXδ t , and ut = ei |Sx| for all t [T], we have the following regret bound for the meta-algorithm,

pt ei, X t Atyt + λXδ t

t=2 εx t X t Atyt X t At 1yt 1 2 2

bpt ei 2 2 bpt+1 ei 2 2

pt bpt+1 2 2 + pt bpt 2 2 + e O(1)

t=2 εx t Atyt At 1yt 1 2

bpt ei 2 2 bpt+1 ei 2 2

pt bpt+1 2 2 + pt bpt 2 2 + e O(1)

Atyt At 1yt 1 2 q

L2 + Pt 1 s=2 Asys As 1ys 1 2 + e O(1)

pt bpt+1 2 2 + pt bpt 2 2

(by definition of εx t and maxp |Sx| p ei 2 2 e O(1))

v u u t L2 +

t=2 Atyt At 1yt 1 2 + e O(1) L

t=1 ( pt bpt+1 2 2 + pt bpt 2 2),

No-Regret Learning in Time-Varying Zero-Sum Games

where c1, c2 = eΘ(1) and the last step holds by Lemma 23.

Next, we consider the base-regret. Since the base-algorithm Bi satisfies the DRVU property, the base-regret is upper bounded as follows:

t=1 x t,i Atyt

t=1 u t Atyt α(1 + P u T ) ηx i + ηx i β

t=2 Atyt At 1yt 1 2 γ

t=2 xt,i xt 1,i 2 1.

Summing up the above two inequalities, we achieve the following dynamic regret guarantee for the x-player:

t=1 x t Atyt

t=1 u t Atyt

v u u t L2 +

t=2 Atyt At 1yt 1 2 + e O(1) L

t=1 ( pt bpt+1 2 2 + pt bpt 2 2)

+ α(1 + P u T ) ηx i + ηx i β

t=2 Atyt At 1yt 1 2 γ

t=2 xt,i xt 1,i 2 1

t=2 xt,i xt 1,i 2 1 λ

i Sx pt,i xt,i xt 1,i 2 1 + e O(1)

O α(1 + P u T ) ηx i

+ ηx i (c2 2 + β)

t=2 Atyt At 1yt 1 2 + λ γ

t=2 xt,i xt 1,i 2 1

t=1 ( pt bpt+1 2 2 + pt bpt 2 2) λ

i Sx pt,i xt,i xt 1,i 2 1 + e O(1)

O α(1 + P u T ) ηx i

t=2 At At 1 2 + ηx i cβ

t=2 yt yt 1 2 1 + λ γ

t=2 xt,i xt 1,i 2 1

t=1 ( pt bpt+1 2 2 + pt bpt 2 2) λ

i Sx pt,i xt,i xt 1,i 2 1 + e O(1),

where c = eΘ(1). This proves Eq. (35). Repeating the above analysis for y-player proves Eq. (36).

The following lemma presents the stability lemma in terms of the non-stationarity measure PT (that is, the NE variation). We give the stability upper bounds from the aspects of meta-algorithm and final decisions. For simplicity, we assume that the DRVU parameters (α, β, γ) are all eΘ(1), which is indeed the case for standard algorithms as proven in Appendix D. Lemma 16 (NE-variation stability). Suppose that x-player follows Algorithm 1 and y-player follows Algorithm 2. Then, the following inequalities hold simultaneously. In the meta-algorithm aspect, we have

t=1 pt bpt+1 2 2 e O p

(1 + VT )(1 + PT ) + PT , (37)

t=1 qt bqt+1 2 2 e O p

(1 + VT )(1 + PT ) + PT ; (38)

in the final decision aspect, we have

t=2 xt xt 1 2 1 e O p

(1 + VT )(1 + PT ) + PT , (39)

t=2 yt yt 1 2 1 e O p

(1 + VT )(1 + PT ) + PT . (40)

No-Regret Learning in Time-Varying Zero-Sum Games

Proof. Let (x t , y t ) denote the Nash equilibrium of the game matrix At at round t. Based on Lemma 15 with a choice of {ut}T t=1 = {x t }T t=1 and {vt}T t=1 = {y t }T t=1, we have

t=1 x t Aty t

t=1 x t Atyt

t=1 x t Aty t

t=1 x t Atyt +

t=1 x t Atyt

t=1 x t Atyt

α(1 + P x T ) ηx i + α(1 + P y T ) ηy j + β(ηx i + ηy j )VT

t=2 yt yt 1 2 1 + ηy j cβ

t=2 xt xt 1 2 1

t=2 xt,i xt 1,i 2 1 +

t=2 yt,j yt 1,j 2 1

t=1 ( pt bpt+1 2 2 + pt bpt 2 2 + qt bqt+1 2 2 + qt bqt 2 2)

i Sx pt,i xt,i xt 1,i 2 1 λ

j Sy qt,j yt,j yt 1,j 2 1

α(1 + P x T ) ηx i + α(1 + P y T ) ηy j + β(ηx i + ηy j )VT

t=2 qt qt 1 2 1 + 2ηx i cβ

j Sy yt,j yt 1,j 2 1

t=2 pt pt 1 2 1 + 2ηy j cβ

i Sx xt,i xt 1,i 2 1

t=2 xt,i xt 1,i 2 1 +

t=2 yt,j yt 1,j 2 1

t=1 ( pt bpt+1 2 2 + pt bpt 2 2 + qt bqt+1 2 2 + qt bqt 2 2) L

t=2 ( pt pt 1 2 2 + qt qt 1 2 2)

i Sx pt,i xt,i xt 1,i 2 1 λ

j Sy qt,j yt,j yt 1,j 2 1

α(1 + P x T ) ηx i + α(1 + P y T ) ηy j + β(ηx i + ηy j )VT

+ 2ηx i cβ L

t=2 qt qt 1 2 1 + 2ηy j cβ L

t=2 pt pt 1 2 1

+ (2ηx i cβ λ)

j Sy qt,j yt,j yt 1,j 2 1 + 2ηy j cβ λ T X

i Sx pt,i xt,i xt 1,i 2 1

t=2 xt,i xt 1,i 2 1 +

t=2 yt,j yt 1,j 2 1

pt bpt+1 2 2 + pt bpt 2 2 + qt bqt+1 2 2 + qt bqt 2 2 .

According to the choice of L = max n 4, 16cβ, q

γ o = eΘ(1), λ = γL

2 , and ηx i , ηy j 1

L, it can be verified that

8 ; 2ηy j cβ L

No-Regret Learning in Time-Varying Zero-Sum Games

2ηx i cβ λ = 2cβ

4 ; 2ηy j cβ λ = 2cβ

2ηx i ; λ γ

2ηy j . (41)

In addition, since (x t , y t ) is the Nash equilibrium of At, it follows that

x t Aty t x t Atyt x t Aty t x t Aty t = 0.

Therefore, as α, β, γ = eΘ(1), we have the following inequalities simultaneously.

t=2 pt pt 1 2 1 8

1 + P x T ηx i + 1 + P y T ηy j + (ηx i + ηy j )VT

(1 + VT )(1 + PT ) + PT ,

where the last inequality is because we pick ηx i and ηy j to be the one such that ηx i [ 1

2ηx , 2ηx ], ηy j [ 1

2ηy , 2ηy ] where

1+P x T 1+VT , 1

and ηy = min q

1+P y T 1+VT , 1

. This is achievable based on the choice of our step size pool.

Similarly, we have

t=2 qt qt 1 2 1 e O p

(1 + VT )(1 + PT ) + PT ,

t=1 pt bpt+1 2 1 e O p

(1 + VT )(1 + PT ) + PT ,

t=1 qt bqt+1 2 1 e O p

(1 + VT )(1 + PT ) + PT ,

i Sx pt,i xt,i xt 1,i 2 1 e O p

(1 + VT )(1 + PT ) + PT ,

j Sy qt,j yt,j yt 1,j 2 1 e O p

(1 + VT )(1 + PT ) + PT .

In addition, note that

t=2 xt xt 1 2 1

i Sx pt,ixt,i X

i Sx pt 1,ixt 1,i

i Sx pt,i(xt,i xt 1,i) X

i Sx (pt 1,i pt,i)xt 1,i

i Sx pt,i(xt,i xt 1,i)

i Sx (pt,i pt 1,i)xt 1,i

i Sx pt,i xt,i xt 1,i 1

i Sx |pt,i pt 1,i| xt 1,i 1

i Sx pt,i xt,i xt 1,i 1

i Sx |pt,i pt 1,i| xt 1,i 1

i Sx pt,i xt,i xt 1,i 2 1 + 2

t=2 pt pt 1 2 1,

No-Regret Learning in Time-Varying Zero-Sum Games

where the last inequality is by Cauchy-Schwarz inequality and xt,i 1 = 1 for all i Sx. Similarly, we have for y-player,

t=2 yt yt 1 2 1 2

j Sy qt,j xt,j xt 1,j 2 1 + 2

t=2 qt qt 1 2 1. (43)

Based on the above results, we further have

t=2 xt xt 1 2 1 2

t=2 pt pt 1 2 1 + 2

i=1 pt,i xt,i xt 1,i 2 1 e O p

(1 + VT )(1 + PT ) + PT ,

t=2 yt yt 1 2 1 2

t=2 qt qt 1 2 1 + 2

i=1 qt,i yt,i yt 1,i 2 1 e O p

(1 + VT )(1 + PT ) + PT ,

which completes the proof.

The following lemma shows the relationship between dynamic NE-regret and the individual regret.

Lemma 17 (Dynamic-NE-regret-to-individual-regret conversation). For arbitrary sequences of {xt}T t=1, {yt}T t=1 and {At}T t=1, where xt m, yt n, and At Rm n, t [T], we have

t=1 x t Atyt

t=1 min x max y x Aty

t=1 x t Atyt

t=1 x Atyt,

t=1 x t Aty

t=1 x t Atyt

+ 2WT , (44)

where WT = PT t=1 At A is the variance of the game matrices with A = 1

T PT t=1 At being the average game matrix and (x , y ) is a pair of Nash equilibrium of A. In the special case where At = A for all t [T], we have dynamic NE-regret bounded by the maximum of the two individual regrets as WT = 0.

Proof. Suppose that PT t=1 x t Atyt PT t=1 minx maxy x Aty 0, then the dynamic NE-regret can be upper bounded as

t=1 x t Atyt

t=1 min x max y x Aty

(let (x t , y t ) be the Nash equilibrium of At)

t=1 x t Atyt

t=1 x t Aty t (let (x , y ) be the Nash equilibrium of A)

t=1 x t Atyt

t=1 x t Aty (changing y t to y decreases the game value w.r.t. At)

t=1 x t Atyt

t=1 x t Ay + WT (shifting the payoff matrix from At to A)

t=1 x t Atyt

t=1 x Ay + WT (changing x t to x decreases the game value w.r.t. A)

t=1 x t Atyt

t=1 x Ayt + WT (changing y to yt decreases the game value w.r.t. A)

t=1 x t Atyt

t=1 x Atyt + 2WT . (shifting the payoff matrix from A to At)

Similarly, when PT t=1 x t Atyt PT t=1 minx maxy x Aty 0, we can verify that

t=1 x t Atyt

t=1 min x max y x Aty

(let (x t , y t ) be the Nash equilibrium of At)

No-Regret Learning in Time-Varying Zero-Sum Games

t=1 x t Aty t

t=1 x t Atyt (let (x , y ) be the Nash equilibrium of A)

t=1 x Aty t

t=1 x t Atyt (changing x t to x increases the game value w.r.t. At)

t=1 x t Atyt + WT (shifting the payoff matrix from At to A)

t=1 x t Atyt + WT (changing y t to y increases the game value w.r.t. A)

t=1 x t Atyt + WT (changing x to xt increases the game value w.r.t. A)

t=1 x t Aty

t=1 x t Atyt + 2WT . (shifting the payoff matrix from A to At)

Combining the two cases yields the desired result.

Next, we present the following stability lemma in terms of the non-stationarity measure WT (that is, the payoff variance). We give the stability upper bounds from the aspects of meta-algorithm and final decisions. Again, we assume that the DRVU parameters (α, β, γ) are all eΘ(1).

Lemma 18 (Payoff-variance stability). Suppose that x-player follows Algorithm 1 and y-player follows Algorithm 2. Then, the following inequalities hold simultaneously. In the meta-algorithm aspect, we have

t=1 pt bpt+1 2 2 e O (1 + WT ) ,

t=1 qt bqt+1 2 2 e O (1 + WT ) ; (45)

in the final decision aspect, we have

t=2 xt xt 1 2 1 e O (1 + WT ) ,

t=2 yt yt 1 2 1 e O (1 + WT ) . (46)

Proof. Let A = 1

T PT t=1 At denote the average game matrix, and let (x , y ) be the Nash equilibrium of A. Then according to the saddle point property of (x , y ), we have for any xt m and yt n, x t Ay x Ayt 0. Therefore, based on Lemma 15 with ut = y and vt = x for all t [T], we have

t=1 x t Aty

t=1 x t Atyt +

t=1 x t Atyt

t=1 x Atyt + 2WT

ηy j + β(ηx i + ηy j )VT

t=2 yt yt 1 2 1 + ηy j cβ

t=2 xt xt 1 2 1

t=2 xt,i xt 1,i 2 1 +

t=2 yt,j yt 1,j 2 1

t=1 ( pt bpt+1 2 2 + pt bpt 2 2 + qt bqt+1 2 2 + qt bqt 2 2)

No-Regret Learning in Time-Varying Zero-Sum Games

i Sx pt,i xt,i xt 1,i 2 1 λ

j Sy qt,j yt,j yt 1,j 2 1 + 2WT

ηy j + β(ηx i + ηy j )VT

t=2 qt qt 1 2 1 + 2ηx i cβ

j Sy yt,j yt 1,j 2 1

t=2 pt pt 1 2 1 + 2ηy j cβ

i Sx xt,i xt 1,i 2 1

t=2 xt,i xt 1,i 2 1 +

t=2 yt,j yt 1,j 2 1

t=1 ( pt bpt+1 2 2 + pt bpt 2 2 + qt bqt+1 2 2 + qt bqt 2 2) L

t=2 ( pt pt 1 2 2 + qt qt 1 2 2)

i Sx pt,i xt,i xt 1,i 2 1 λ

j Sy qt,j yt,j yt 1,j 2 1 + 2WT

ηy j + β(ηx i + ηy j )VT

+ 2ηx i cβ L

t=2 qt qt 1 2 1 + 2ηy j cβ L

t=2 pt pt 1 2 1

+ (2ηx i cβ λ)

j Sy qt,j yt,j yt 1,j 2 1 + 2ηy j cβ λ T X

i Sx pt,i xt,i xt 1,i 2 1

t=2 xt,i xt 1,i 2 1 +

t=2 yt,j yt 1,j 2 1

t=1 ( pt bpt+1 2 2 + pt bpt 2 2 + qt bqt+1 2 2 + qt bqt 2 2) + 2WT .

Based on the choice of L and λ, we can again verify the condition of Eq. (41), which leads to the following inequalities:

t=2 pt pt 1 2 1 8

ηy j + (ηx i + ηy j )VT + WT

1 + VT + WT e O (1 + WT ) ,

where the second last inequality is because we pick ηx i and ηy j to be the one such that ηx i [ 1

2ηx , 2ηx ], ηy j [ 1

2ηy , 2ηy ]

where ηx = min nq

L o and ηy = min nq

L o . This is achievable based on the choice of our step size pool.

The last inequality holds because of AM-GM inequality and VT O(WT ). Similarly, we have

t=2 qt qt 1 2 1 e O (1 + WT ) ,

t=1 pt bpt+1 2 1 e O (1 + WT ) ,

t=1 qt bqt+1 2 1 e O (1 + WT ) ,

i Sx pt,i xt,i xt 1,i 2 1 e O (1 + WT ) ,

j Sy qt,j yt,j yt 1,j 2 1 e O (1 + WT ) .

No-Regret Learning in Time-Varying Zero-Sum Games

Based on the above results, we further have

t=2 xt xt 1 2 1 2

t=2 pt pt 1 2 1 + 2

i=1 pt,i xt,i xt 1,i 2 1 e O (1 + WT ) ,

t=2 yt yt 1 2 1 2

t=2 qt qt 1 2 1 + 2

i=1 qt,i yt,i yt 1,i 2 1 e O (1 + WT ) ,

which completes the proof.

Building upon the stability of the decisions of both x-player and y-player proven in Lemma 16 and Lemma 18, we further show the variation of the (gradient) feedback received by both x-player and y-player.

Lemma 19. Suppose x-player follows Algorithm 1 and y-player follows Algorithm 2. Then, the gradient variation can be bounded as follows:

t=2 Atyt At 1yt 1 2 e O 1 + VT + min{PT , WT } ,

t=2 x t At x t 1At 1 2 e O 1 + VT + min{PT , WT } .

Proof. The gradient variation of the x-player can be upper bounded as follows:

t=2 Atyt At 1yt 1 2 2

t=2 Atyt At 1yt 2 + 2

t=2 At 1yt At 1yt 1 2

t=2 At At 1 2 + O

t=2 yt yt 1 2 1

O(VT ) + e O min{ p

1 + VT + PT + PT , 1 + WT }

e O (min{1 + VT + PT , 1 + VT + WT })

where the second last step holds by Lemma 16 and Lemma 18, and the last step makes use of AM-GM inequality. A similar argument can be applied to upper bound PT t=2 x t At x t 1At 1 2 . This ends the proof.

The following lemma establishes a general result to relate the function-value difference between the sequence of piecewise minimizers and the sequence of each-round minimizers.

Lemma 20. Let At Rm n and yt n for all t [T] with maxt [T ] At 1. Let I1, I2, . . . , IK be an even partition of the total horizon [T] with |Ik| = for k = 1, . . . , K (for simplicity, suppose the time horizon T is divisible by epoch length ). Denote x t = argminx m x Atyt for any t [T] and denote ut argminx m P

τ Ik x Aτyτ for any t Ik. Then, we have

t=1 u t Atyt

t=1 x t Atyt 2

t=2 Atyt At 1yt 1 . (48)

Proof. The proof follows the analysis of function-variation type worst-case dynamic regret (Zhang et al., 2020). For convenience, we introduce the notation ft(x) x Atyt to denote the each-round online function of x-player. Then,

i=1 ft( x t ) =

t Ik (ft(ut) ft( x t ))

t Ik Atyt At 1yt 1 = 2

t=2 Atyt At 1yt 1 ,

No-Regret Learning in Time-Varying Zero-Sum Games

where the inequality holds because of the setting of |Ik| = and the following fact about the instantaneous quantity:

ft(ut) ft( x t ) ft( x t1) ft( x t ) (denote by t1 the starting time stamp of Ik)

= ft( x t1) ft1( x t1) + ft1( x t1) ft( x t )

ft( x t1) ft1( x t1) + ft1( x t ) ft( x t ) (by optimality of x t1)

t Ik sup x |ft(x) ft 1(x)|

x (Atyt At 1yt 1)

t Ik Atyt At 1yt 1 .

Hence we finish the proof.

G. Technical Lemmas

Lemma 21 (Bregman proximal inequality (Chen & Teboulle, 1993, Lemma 3.2)). Let X be a convex set in a Banach space. Let f : X 7 R be a closed proper convex function on X. Given a convex regularizer ψ : X 7 R, we denote its induced Bregman divergence by Dψ( , ). Then, any update of the form xk = argminx X {f(x) + Dψ(x, xk 1)} satisfies the following inequality for any u X,

f(xk) f(u) Dψ(u, xk 1) Dψ(u, xk) Dψ(xk, xk 1). (49)

Lemma 22 (stability lemma (Chiang et al., 2012, Proposition 7)). Let x = argminx X a, x + Dψ(x, c) and x = argminx X a , x + Dψ(x, c). When the regularizer ψ : X 7 R is a 1-strongly convex function with respect to the norm , we have x x ( ψ(c) a) ( ψ(c) a ) = a a .

Lemma 23 (variant of self-confident tuning (Pogodin & Lattimore, 2019, Lemma 4.8)). Let a1, a2, . . . , a T be non-negative real numbers. Then

1 + Pt 1 s=1 as 4

t=1 at + max t [T ] at.