# online_learning_in_unknown_markov_games__dacaf0ef.pdf

Online Learning in Unknown Markov Games

Yi Tian 1 * Yuanhao Wang 2 * Tiancheng Yu 1 * Suvrit Sra 1

We study online learning in unknown Markov games, a problem that arises in episodic multiagent reinforcement learning where the actions of the opponents are unobservable. We show that in this challenging setting, achieving sublinear regret against the best response in hindsight is statistically hard. We then consider a weaker notion of regret by competing with the minimax value of the game, and present an algorithm that achieves a sublinear O(K 2/3) regret after K episodes. This is the ﬁrst sublinear regret bound (to our knowledge) for online learning in unknown Markov games. Importantly, our regret bound is independent of the size of the opponents action spaces. As a result, even when the opponents actions are fully observable, our regret bound improves upon existing analysis (e.g., (Xie et al., 2020)) by an exponential factor in the number of opponents.

1. Introduction

Multi-agent reinforcement learning (MARL) helps us model strategic decision making problems in an interactive environment with multiple players. It has witnessed notable recent success (with two or more agents), e.g., in Go (Silver et al., 2016; 2017), video games (Vinyals et al., 2019), Poker (Brown & Sandholm, 2018; 2019), and autonomous driving (Shalev-Shwartz et al., 2016).

When studying MARL, often Markov games (MGs) (Shapley, 1953) are used as the computational model. Compared with Markov decision processes (MDPs) (Puterman, 2014), Markov games allow the players to inﬂuence the state transition and returns, and are thus capable of modeling competitive and collaborative behaviors that arise in MARL.

A fundamental problem in MGs is sample efﬁciency. Unlike MDPs, there are at least two key ways to measure performance in MGs: (1) the ofﬂine (self-play) setting, where we

*Equal contribution 1Department of EECS, MIT 2Department of Computer Science, Princeton University. Correspondence to: Suvrit Sra <suvrit@mit.edu>.

Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s).

control both/all players and aim to minimize the number of episodes required to ﬁnd a good policy; and (2) the online setting, where we can only control one player (which we refer to as our player), treat other players as opponents, and judge how our player performs in the whole process using regret. The ofﬂine setting is more useful when training players in a controllable environment (e.g., a simulator) and the online setting is more favorable for life-long learning.

When ensuring sample efﬁciency for MARL, key challenges arise from the observation model. We distinguish between two online settings. When learning in informed MGs, our player can observe the actions taken by the opponents. For learning in unknown MGs (Cesa-Bianchi & Lugosi, 2006), such observations are unavailable; information ﬂows to our player only through the revealed returns and state transitions. We emphasize that both informed games and unknown games are describing the observation process instead of our prior knowledge of the parameters: We always assume zero knowledge of the transition function of the MG.

Learning in unknown MGs is harder, more general, and potentially of greater practical relevance than informed MGs. It is thus important to discover algorithms that can guarantee low regret. However, theoretical understanding for unknown MGs is rather limited. Even the following fundamental question for analyzing online learning in unknown MGs is open:

Q1. Is sublinear regret achievable?

To see why learning in unknown MGs is challenging, notice that without observing an opponents actions, we cannot learn the transition function of the MG, even with inﬁnitely many episodes to collect data. Therefore, explore-thencommit type of algorithms cannot achieve sublinear regret.

Another concern arises when the number of players involved increases, as then the effective size of the opponents action space grows exponentially in it. Therefore, the following question is also crucial, even in (easier) informed MGs:

Q2. Can the regret be independent of the size of the opponents action space?

Contributions. We answer both questions Q1 and Q2 afﬁrmatively in this paper. At the heart of our answers lies an Optimistic Nash V-learning algorithm for online learning

Online Learning in Unknown Markov Games

(V-OL) that we develop. This algorithm is signiﬁcant in the following aspects:

It achieves O(K 2/3) regret, the ﬁrst sublinear regret bound for online learning in unknown MGs. This bound is nontrivial because without observing opponents actions, we cannot learn the transition function of the MG, even with inﬁnitely many episodes to collect data.

Its regret does not depend on the size of the opponents action space. This regret bound is also the ﬁrst of this kind in the online setting, even for the (easier) informed MG setting. For m-player MGs, the effective size of the opponents action space is Am 1 with A the size of each player s action space. Therefore, compared with existing algorithms (Xie et al., 2020) even in the informed setting, we save an exponential factor.

It is computationally efﬁcient. The computational complexity does not scale up as the number of players m increases; existing algorithms such as (Xie et al., 2020) suffer space and time complexities exponential in m. Also, in existing algorithms, a subprocedure to ﬁnd a Nash equilibrium in two-player zero-sum games is called in each step, which becomes the computational bottleneck. In sharp contrast, our algorithm does not require calling any such subprocedures.

The idea of Nash V-learning ﬁrst appears in (Bai et al., 2020). We denote their original Nash V-learning algorithm by V-SP (SP is an acronym for self-play) to distinguish it from our algorithm V-OL. See the discussion at the end of Section 4 for a detailed comparison of the two algorithms.

Furthermore, although the weaker notion of regret (see Section 2) that we use has appeared in prior works (Brafman & Tennenholtz, 2002; Xie et al., 2020), it is not clear why this choice is statistically reasonable. We justify this notion of regret by showing that competing with the best response in hindsight is statistically hard (Section 3). Speciﬁcally, the regret can be exponential in the horizon H. This result also strengthens the computational lower bound in (Bai et al., 2020) for online learning in unknown MGs. As an intermediate step, we prove that competing with the optimal policy in hindsight is also statistically hard in MDPs with adversarial transitions under bandit feedback, which strengthens the computational lower bound in (Yadkori et al., 2013) under bandit feedback and is a result of independent interest.

1.1. Related work

Learning in MGs without strategic exploration. A large body of literature focuses on solving known MGs (Littman, 1994; Hansen et al., 2013) or learning with a generative model (Jia et al., 2019; Sidford et al., 2020; Zhang et al., 2020a), using which we can sample transitions and returns for arbitrary state-action pairs. Littman (2001); Hu & Well-

man (2003); Wei et al. (2017) do not assume a generative model, but their results only apply to communicating MGs.

Online MGs. Brafman & Tennenholtz (2002) propose Rmax, which does not provide a regret guarantee in general. Xie et al. (2020) study this setting for two-player zero-sum games with linear function approximation using the same weaker deﬁnition of regret. They use a value iteration (VI) based algorithm and achieve O(

H4A3B3S3K) regret when translated into the tabular language, where A and B are number of actions for the two players, S is the number of states and H is the horizon. In Appendix C, we adapt the Optimistic Nash Q-learning algorithm (Q-SP) (Bai et al., 2020) to the online setting (Q-OL, Algorithm 3) and prove for Q-OL a O(

H5ABSK) regret (Theorem 4). All the three algorithms require observing the opponents actions and thus cannot be applied to learning in unknown MGs.

Self-play. There is a recent line of work focusing on achieving near-optimal sample complexity in ofﬂine two-player zero-sum MGs (Bai & Jin, 2020; Xie et al., 2020; Bai et al., 2020; Liu et al., 2020). The goal is to ﬁnd an ϵ-approximate Nash equilibrium within K episodes. VIbased methods (Bai & Jin, 2020; Xie et al., 2020) achieve K = O(S2AB/ϵ2). Q-SP (Bai et al., 2020) achieves K = O(SAB/ϵ2), and the V-SP algorithm (Bai et al., 2020) achieves the best existing result K = O(S(A + B)/ϵ2), matching the lower bound w.r.t. the dependence on S, A, B and ϵ. Note that in the self-play setting, we need to ﬁnd good policies for both players, so the dependence on B is inevitable. Extensions to multi-player general-sum games are discussed in (Liu et al., 2020) but the dependence on the number of players is exponential.

MDPs with adversarial transitions. Online MGs are closely related to adversarial MDPs. In general, competing with the optimal policy in hindsight in MDPs with adversarial transitions is intractable. With full-information feedback, the problem is computationally hard (Yadkori et al., 2013). With bandit feedback, the problem is statistically hard (Lemma 1). However, under additional structural assumptions, one can achieve low regret (Cheung et al., 2019).

MDPs with adversarial rewards. We can ensure sublinear regret if the transition is ﬁxed (but unknown) and only the reward is chosen adversarially (Zimin & Neu, 2013; Rosenberg & Mansour, 2019; Jin et al., 2019). This yields another useful model for adversarial MDPs. The best existing result in adversarial episodic MDPs with bandit feedback and unknown transition is achieved in (Jin et al., 2019) with O(

H3S2AK) regret, where H is the horizon.

Single-agent RL. Finally, there is an abundance of works on sample efﬁcient learning in MDPs. Jaksch et al. (2010) ﬁrst adopt optimism to achieve efﬁcient exploration in MDPs and Jin et al. (2018) extend this idea to model-free

Online Learning in Unknown Markov Games

methods. Azar et al. (2017) and Zhang et al. (2020b) achieve minimax regret bounds (up to log-factors) O(

H3SAK) for model-based and model-free methods, respectively.

2. Background and problem setup

For simplicity, we formulate the problem of two-player zero-sum MGs in this section and provide our algorithmic solution in Section 4. Please see Section 5 for extensions to multi-player general-sum MGs.

2.1. Markov games: setup and notation

Model. We consider episodic two-player zero-sum MGs, where the max-player (min-player) aims to maximize (minimize) its cumulative return. Let [H] := {1, 2, . . . , H} for positive integer H, and let (X) be the set of probability distributions on set X. Then such an MG is denoted by MG(S, A, B, P, r, H), where

H N+ is the number of steps in each episode,

h [H+1] Sh is the state space,

h [H] Ah (B = S

h [H] Bh) is the action space of the max-player (min-player, resp.).

P is a collection of unknown transition functions {Ph : Sh Ah Bh (Sh+1)}h [H], and

r is a collection of return functions {rh : Sh Ah Bh [0, 1]}h [H].

The return r is usually called reward in MDPs, which a player aims to maximize. We will use the term return for MGs and reserve the term reward for (adversarial) MDPs.

With a subscript h let Sh, Ah, Bh, Ph, rh denote the corresponding objects at step h. Let | | denote cardinality of a set; then deﬁne the following terms:

S := sup h [H] |Sh|, A := sup h [H] |Ah|, B := sup h [H] |Bh|.

Interaction protocol. In each episode, the MG starts at an adversarially chosen initial state s1 S1. At each step h [H], the two players observe the state sh Sh and simultaneously take actions ah Ah, bh Bh; then the environment transitions to the next state sh+1 Ph( |sh, ah, bh) and outputs the return rh(sh, ah, bh). The max-player s policy µ speciﬁes a distribution on Ah at each step h. Concretely, µ = {µh}h [H] where µh : Sh (Ah). Similarly we deﬁne the min-player s policy ν.

Value functions. Analogously to MDPs, for a policy pair (µ, ν), step h [H], state s Sh, and actions a Ah, b Bh, deﬁne the state value function and Q-value function as:

V µ,ν h (s) := Eµ,ν[ XH

h =h rh (sh , ah , bh )|sh = s],

Qµ,ν h (s, a, b)

:= Eµ,ν[ XH

h =h rh (sh , ah , bh )|sh = s, ah = a, bh = b].

For compactness of notation, deﬁne the operators:

Ph V (s, a, b) := Es Ph( |s,a,b)[V (s )],

Dµ,ν[Q](s) := Ea µ( |s),b ν( |s)[Q(s, a, b)].

Then we have the following Bellman equations:

V µ,ν h (s) = Dµh,νh[Qµ,ν h ](s),

Qµ,ν h (s, a, b) = (rh + Ph V µ,ν h+1)(s, a, b).

For convenience deﬁne V µ,ν H+1(s) := 0 for s SH+1.

Optimality. For a given min-player s policy ν, there exists a best response µ (ν) to it, such that for any step h [H] and state s Sh,

V ,ν h (s) V µ (ν),ν h (s) := sup µ V µ,ν h (s).

Again, a symmetric discussion applies to the best response to a max-player s policy. The following minimax theorem holds for two-player zero-sum MGs: for any step h [H] and state s Sh,

max µ min ν V µ,ν h (s) = min ν max µ V µ,ν h (s).

Moreover, the best policies against the best responses

µ argmax µ M V µ, 1 , ν argmin ν N V ,ν 1

attain the minimax value. Such a policy pair is known as a Nash equilibrium (NE). We use V h (s) := V µ ,ν

h (s) to denote the value at the NE, which is unique for the MG and we call the minimax value of the MG.

2.2. Problem setup

We are now ready to formally deﬁne the problem of online learning in an unknown MG: we control the max-player and in each step, only the state sh and return rh are revealed, but not the action of the min-player bh. Recall that if bh is also accessible, we call it the informed setting.

Our goal is to maximize the expected cumulative return, or equivalently, to minimize the regret. The conventional deﬁnition of regret is to compete against the best ﬁxed policy in hindsight:

Regret (K) := sup µ

V µ,νk 1 (sk 1) V µk,νk 1 (sk 1) , (1)

where the superscript k denotes the corresponding objects in the kth episode. Although we use this compact notation, the regret depends on both µk and νk.

Online Learning in Unknown Markov Games

Figure 1. Illustration of the MDP MX,Y . For y {0, 1}, y stands for 1 y.

However, even in the informed setting, achieving sublinear regret in this form is computationally hard (Bai et al., 2020). For online learning in unknown MGs, the problem is statistically hard (Section 3), thus is still intractable even if we have inﬁnite computational power.

Therefore, by noting

max µ M V µ,νk 1 (sk 1) V µ ,νk 1 (sk 1) V 1 (sk 1),

we consider a more modest goal. That is, to compete against the minimax value of the game, which has appeared in (Brafman & Tennenholtz, 2002). Speciﬁcally, we deﬁne the following regret (as used by Xie et al. (2020)):

Regret(K) := XK

k=1 V 1 (sk 1) V µk,νk 1 (sk 1) . (2)

As a special case, if the opponent is omniscient and plays the best response νk = ν (µk), then a sublinear regret guarantee for (2) implies a sample complexity guarantee to approximate a Nash equilibrium policy.

3. Statistical hardness of online learning in unknown MGs

As mentioned above, we use the minimax value of the game as the benchmark for online learning in unknown MGs. In contrast, in adversarial MDPs (Jin et al., 2019), it is more common to compete against the best policy in hindsight (using regret (1)). In this section, we justify our usage of the weaker notion of regret (2) by showing that, in general, competing against the best policy in hindsight is statistically intractable. In particular, we show that in this case, the regret has to be either linear in K or exponential in H.

Theorem 1 (Statistical hardness for online learning in unknown MGs). For any H 2 and K 1, there exists a two-player zero-sum MG with horizon H, |Sh| 2, |Ah| 2, |Bh| 4 such that any algorithm for unknown MGs suffers the following worst-case one-sided regret:

V µ,νk 1 (s1) Eµk V µk,νk 1 (s1)

In particular, any algorithm has to suffer linear regret unless K Ω(2H).

Here we give a sketch of our proof, while the full proof is deferred to Appendix A.

We start by considering online learning in (single-agent) MDPs, where the reward and transition function in each episode are adversarially determined, and the goal is to compete against the best (ﬁxed) policy in hindsight. In the following lemma we show that this problem is statistically hard; see Lemma 1 in the appendix for its formal statement.

Lemma (informal). For any algorithm, there exists a sequence of single agent MDPs with horizon H, S = O(H) states and A = O(1) actions, such that the regret deﬁned against the best policy in hindsight is Ω(min{

Remark 1. The above lemma is different from a previous hardness result in (Yadkori et al., 2013), which states that this problem is computationally hard.

We now brieﬂy explain how this family of hard MDPs is constructed, which is inspired by the combination lock MDP (Du et al., 2019). Every MDP MX,Y is speciﬁed by two H-bit strings: X, Y {0, 1}H. The states are {s0,0, s0,1, s1,1, , s0,H, s1,H}. As shown in Figure 1, MX,Y has a layered structure, and the reward is nonzero only at the ﬁnal layer. The only way to achieve the high reward is to follow the path s0,0 sy1,1 sy H,H. Thus, the corresponding optimal policy is π(sw,h) = xh w, which is only a function of X. Here, denotes the bitwise exclusive or operator.

Now, in each episode, Y is chosen from a uniform distribution over {0, 1}H while X is ﬁxed. When the player interacts with MX,Y , since Y is uniformly random, it gets no effective feedback from the observed transitions, and the only informative feedback is the reward at the end. However, achieving the high reward requires guessing every bit of X correctly. This needle in a haystack situation makes the problem as hard as a multi-armed bandit problem with 2H

arms. The regret lower bound immediately follows.

Next, we use the hard family of MDPs in Lemma 1 to prove Theorem 1 by reducing the adversarial MDP problem to online learning in unknown MGs. The construction is

Online Learning in Unknown Markov Games

straightforward. The state space and the action space for the max-player are the same as that in the original MDP family. The min-player has control over the transition function and reward at each step, and executes a policy such that the induced MDP for the max-player is the same as MX,Y . This is possible using only B = O(1) actions as MX,Y has a layered structure. Online learning in unknown MGs then simulates the online learning in the adversarial MDP problem, and thus has the same regret lower bound.

Classes of policies. In Section 2, we deﬁne the policy µ by mappings from Sh to a distribution on Ah at each step h. Such policies are called Markov policies (Bai et al., 2020). The policies induced by the algorithms in the remaining part of this paper are always Markov policies. However, our lower bound also holds for general policies (Bai et al., 2020). Here, for an informed max-player the input of µh can be the history (s1, a1, b1, r1, , sh), while for a maxplayer in an unknown MG the input of µh can be the history (s1, a1, r1, , sh). In words, the lower bound holds even for policies that depend on histories.

Regret minimization in self-play. We emphasize that our lower bound applies to online learning in unknown MGs. For the self-play setting, people indeed minimize the strong regret (1) as an intermediate step toward PAC guarantees (Bai & Jin, 2020; Bai et al., 2020; Xie et al., 2020). This is possible because in self-play both players are running the policies speciﬁed by the algorithm designer. Therefore, they do not need to worry about the adversarial scenario described in the lower bound here.

We emphasize that our lower bound applies to online learning in unknown MGs. In self-play, as an intermediate step toward PAC guarantees, people indeed minimize an even stronger notion called duality gap (Bai & Jin, 2020; Bai et al., 2020; Xie et al., 2020), which is deﬁned as

Gap(K) := XK

k=1 V ,νk 1 (sk 1) V µk, 1 (sk 1)

k=1 V ,νk 1 (sk 1) V µk,νk 1 (sk 1)

k=1 V µk,νk 1 (sk 1) V µk, 1 (sk 1) ,

where the two terms in the last equality are no smaller than the stronger regrets (1) of the two players respectively. This is possible because in self-play both players are running the policies speciﬁed by the algorithm. Therefore, they do not need to worry about the adversarial scenario described in the lower bound here.

4. The V-OL algorithm

In this section, we introduce the V-OL algorithm and its regret guarantees for online learning in two-player zerosum unknown Markov games. We show that not only can

Algorithm 1 Optimistic Nash V-learning for Online Learning (V-OL)

1: Require: Learning rate {αt}t 1, exploration bonus {βt}t 1, policy update parameter {ηt}t 1 2: Initialize: for any h [H], s Sh, a Ah, Vh(s) H, Lh(s, a) 0, Nh(s) 0, µh(a|s) 1/|Ah|. 3: for episode k = 1, . . . , K do 4: Receive s1 5: for step h = 1, . . . , H do 6: Take action ah µh( |sh) 7: Observe return rh and next state sh+1 8: Increase counter t = Nh(sh) Nh(sh) + 1 9: Vh(sh) (1 αt)Vh(sh)+αt(rh+Vh+1(sh+1)+ βt) 10: for all actions a Ah do 11: lh(sh, a) (H rh Vh+1(sh+1))I(ah = a)/(µh(ah|sh) + ηt) 12: Lh(sh, a) (1 αt)Lh(sh, a) + αtlh(sh, a) 13: end for 14: Update policy µ by

µh( |sh) exp{ ηt Lh(sh, )/αt} P

a exp{ ηt Lh(sh, a)/αt}

15: end for 16: end for

we achieve a sublinear regret in this challenging setting, but the regret bound can be independent of the size of the opponent s action space as well.

The V-OL algorithm. V-OL is a variant of V-learning algorithms. Bai et al. (2020) ﬁrst propose V-SP as a nearoptimal algorithm for the self-play setting of two-player zero-sum MGs. See the discussion at the end of this section for a detailed comparison between V-OL and V-SP.

In V-OL (Algorithm 1), at each time step h, the player interacts with the environment, performs an incremental update to Vh, and updates its policy µh. Note that the estimated value function Vh is only used for the intermediate loss lh(sh, ) in this time step, but not used in decision making. To encourage exploration in less visited states, we add a bonus term βt. As we will see in Section 6, this update rule is optimistic, i.e., Vh is an upper conﬁdence bound (UCB) on the minimax value V h of the MG. Then the player samples the action according to the exponentially weighted averaged loss Lh(sh, ), which is a popular decision rule in adversarial environments (Auer et al., 1995).

Intuition behind V-learning. Most existing provably efﬁcient tabular RL algorithms learn a Q-table (table consisting of Q-values). However, since state-action pairs are necessary for updating the Q-table, for online learning in MGs,

Online Learning in Unknown Markov Games

algorithms based on it inevitably require observing the opponent s actions and are thus inapplicable to unknown MGs. By contrast, V-OL does not need to maintain the Q-table at all and bypasses this challenge naturally.

Moreover, learning a Q-value function in two-player Markov games usually results in a regret or sample complexity that depends on its size SAB, whether in the self-play setting, such as VI-ULCB (Bai & Jin, 2020) and Q-SP (Bai et al., 2020), OMNI-VI-ofﬂine (Xie et al., 2020), or in the online setting, such as OMNI-VI-online (Xie et al., 2020) and Q-OL (Appendix C). By contrast, V-learning removes the dependence on B, as formalized in Theorem 2.

Note that we analyze Q-OL in Appendix C to more clearly demonstrate V-OL s advantage of avoiding learning a Qtable. Q-OL is a Q-learning-type algorithm for online MGs adapted from Q-SP. It updates the Q-values by a termporal difference method like V-OL but makes decisions based on the Q-values instead. Therefore, Q-OL applies only to the informed setting and its regret depends on AB (Theorem 4).

Favoring more recent samples. Despite the above noted advantages of V-learning, the V-SP algorithm (Bai et al., 2020) may have a regret bound that is linear in K, as indicated by (4) in Theorem 2 and discussed in Section 6 in more detail. To resolve this problem, we adopt a different set of hyperparameters to learn more aggressively by giving more weight to more recent samples. Concretely, for the self-play setting, Bai et al. (2020) specify the following hyperparameters for V-SP:

H+t , βt = c q

where ι is a log factor deﬁned later and c > 0 is a constant. For the online setting, we set these hyperparameters as:

GH+t , βt = c( q

t ), ηt = q

where G 1 is a quantity that we tune and c > 0 is a constant. Ostensibly, these changes may appear small, but they are essential to attaining a sublinear regret.

Remark 2. Compared with αt = 1/t, the learning rate αt = H+1/H+t ﬁrst proposed in (Jin et al., 2018) already favors more recent samples. Here we go one step further: our algorithm learns even more aggressively by taking αt = GH+1/GH+t with G 1. Moreover, we choose a larger ηt to make our algorithm care more about more recently incurred loss. βt is set accordingly to achieve optimism.

We call this variant of V-learning V-OL, for which we prove the following regret guarantees.

Theorem 2 (Regret bounds). For any p (0, 1), let ι = log(HSAK/p). If we run V-OL with our hyperparameter

speciﬁcation (3) for some large constant c > 0 and G 1 in an online two-player zero-sum MG, then with probability at least 1 p, the regret in K episodes satisﬁes

Regret(K) = O GH3Sι2 +

GH5SAKι + G 1KH . (4)

In particular, by taking G = 1

SA) 1/3 if K H3SA and G = K 1/3 otherwise, with probability at least 1 p, the regret satisﬁes

Regret(K) =

( O H2S 1 3 A 1 3 K 2 3 , if K H3SA, O

H5SAK 2 3 + H3SK 1 3 , otherwise.

Theorem 2 shows that a sublinear regret against the minimax value of the MG is achievable for online learning in unknown MGs. As expected, the regret bound does not depend on the size of the opponent s action space B. This independence of B is particularly signiﬁcant for large B, as is the case where our player plays with multiple opponents. Note that although in Theorem 2 setting the parameter G requires knowledge of K beforehand, we can use a standard doubling trick to bypass this requirement. Remark 3. In V-SP the parameter G is set to be 1. Then

our choice of ηt and βt become q

At and c( q

t ). If the other player also adopts the corresponding new policy update parameter and exploration bonus, then the sample complexity of V-SP can actually be improved upon (Bai et al., 2020) by an H factor to O(H5S(A + B)/ϵ2).

Comparison between V-OL and V-SP. Apart from the difference in parameter choices, we now point out other differences between V-OL and V-SP.

1. To achieve near-optimal sample complexity in the selfplay setting, V-SP needs to construct upper and lower conﬁdence bounds not only for the minimax value of the game, but also for the best response values. As a result, it uses a complicated certiﬁed policy technique, and must store the whole history of states and policies in the past K episodes for resampling. By comparing with the minimax value directly, we can make V-OL provably efﬁcient without extracting a certiﬁed policy. Therefore, V-OL only needs O(HSA) space instead of O(HSAK), and the resampling procedure is no more necessary.

2. A key feature of the proof in (Bai et al., 2020) is to make full use of a symmetric structure, which naturally arises because in the self-play setting we can control both players to follow the same learning algorithm. However, this property no longer holds for the online setting, and we must take a different proof route. Algorithmically, V-OL learns more aggressively to be provably efﬁcient.

Online Learning in Unknown Markov Games

3. V-OL also works in multi-player general-sum MGs; see Section 5.

5. Multi-player general-sum games

In this section, we extend the regret guarantees of V-OL to multi-player general-sum MGs, demonstrating the generality of our algorithm. Notably, the result in multi-player MGs highlights the signiﬁcance of removing the dependence on B in the regret bound, which is now an exponential factor in the number of opponents.

Formally, consider the m-player general-sum MG

MGm(S, {Ai}m i=1, P, {ri}m i=1, H), (5)

where S, H follow from the same deﬁnition in two-player zero-sum MGs, and

for each i [m], player i has its own action space Ai = S

h [H] Ai,h and return function ri = {ri,h : Sh Nm i=1 Ai,h [0, 1]}m i=1, and aims to maximize its own cumulative return (here N denotes the Cartesian product of sets);

P is a collection of transition functions {Ph : Sh Nm i=1 Ai,h (Sh+1)}h [H].

Like in two-player MGs, let

S := sup h [H] |Sh|, Ai := sup h [H] |Ai,h| for all i [m].

Online learning in an unknown multi-player general-sum MG can be reduced to that in a two-player zero-sum MG. Concretely, suppose we are player 1, then online learning in unknown MGs (5) is indistinguishable from that in the two-player zero-sum MG speciﬁed by (S, A1, B, P, r1, H) where B = Nm i=2 Ai, since we only observe and care about player 1 s return. For all states s S1, deﬁne the value function using r1 as

V µ,ν h (s) := Eµ,ν[ XH

h =h r1,h (sh , ah , bh )|sh = s],

and deﬁne the minimax value of player 1 as

V 1 (s) := max µ min ν V µ,ν 1 (s) = min ν max µ V µ,ν 1 (s),

which is no larger than the value at any Nash equilibrium of the multi-player general-sum MG. Then we deﬁne the regret against the minimax value of player 1 as

Regret(K) := XK

k=1 V 1 (sk 1) V µk,νk 1 (sk 1) .

We argue that this notion of regret is reasonable since we have control of only player 1 and all opponents may collude to compromise our performance. Then immediately we obtain the following corollary from Theorem 2.

Corollary 3 (Regret bound in multi-player MGs). For any p (0, 1), let A = A1 and ι = log(HSAK/p). If we run V-

OL with our hyperparameter speciﬁcation (3) for some large constant c > 0 and the choice of G in Theorem 2 for player 1 in the online multi-player general-sum MG (5), then with probability at least 1 p, the regret in K episodes satisﬁes

Regret(K) =

( O H2S 1 3 A

1 3 1 K 2 3 , if K H3SA1, O H5SA1K 2 3 + H3SK 1 3 , otherwise.

In a multi-player MG, the size of the opponents joint action space B grows exponentially in the number of opponents. Corollary 3 shows that the regret of V-OL only depends on the size of our player s action space A1. The savings arise because V-OL bypasses the need to learn Q-tables, and the multi-player setting makes no real difference in our analysis.

In the online informed setting, the same equivalence to a two-player zero-sum MG holds, since the other players actions we observe can be seen as a single action (ai)m i=2, and whether we observe the other players returns does not help us decide our policies to maximize our own cumulative return. In this setting, the regret bound in (Xie et al., 2020) becomes O( p

H4S3 Qm i=1 A3 i K), which depends exponentially on m. On the other hand, since the online informed setting has stronger assumptions than online learning in unknown MGs, the O(H2S 1/3A 1/3 1 K 2/3) regret bound of V-OL carries over, which has no dependence on m. This sharp contrast highlights the importance of achieving a regret independent of the size of the opponent s action space.

Furthermore, since in V-OL we only need to update the value function (which has HS entries), rather than update the Q-table (which has HS Qm i=1 Ai entries) as in (Xie et al., 2020), we can also improve the time and space complexity by an exponential factor in m.

6. Proof sketch of Theorem 2

In this section, we sketch the proof of Theorem 2. We also highlight an observation that V-OL can perform much better than claimed in Theorem 2. Moreover, we expose the problem with V-SP in the online setting, which explains why we favor more recent samples in V-OL.

In the analysis below, we use a superscript k to signify the corresponding quantities at the beginning of the kth episode. To express V k h in Algorithm 1 compactly, we introduce the following quantities.

j=1(1 αj), αi t := αi Yt

j=i+1(1 αj).

Let t := N k h(s) and suppose s is previously visited at episodes k1, . . . , kt k. Then we can express V k h (s) as

α0 t H + Xt

i=1 αi t rh(s, aki h , bki h ) + V ki h+1(ski h+1) + βi .

Online Learning in Unknown Markov Games

It is easy to verify that {αi t}t i=1 satisﬁes the normalization property that Pt i=1 αi t = 1 for any sequence {αt}t 1 and any t 1. Moreover, for {αt}t 1 speciﬁed in (3), {αi t} has several other desirable properties (Lemma 2), resembling (Jin et al., 2018, Lemma 4.1).

Upper conﬁdence bound (UCB). In Algorithm 1, by bonus βt we ensure that V k h is an entrywise UCB on V h using standard techniques (Bai et al., 2020), building on the normalization property of {αi t}t i=1 and the key V-learning lemma (Lemma 3) based on the regret bound of the adversarial bandit problem we solve to derive the policy update. Remark 4. A main difference from the previous UCB framework (e.g., Azar et al. (2017)) is that here the gap between V k h and V h is not necessarily diminishing, which partially explains why we do not achieve the conventional O(

T) regret. Concretely, by taking µ = µ in the V-learning lemma (Lemma 3), we have

V k h (s) V h (s)

i=1 αi t Dµ h,νki h [rh + Ph V ki h+1](s)

Dµ h,ν h[rh + Ph V h+1](s)

i=1 αi t Dµ h,νki h [Ph(V ki h+1 V h+1)](s)

i=1 αi t(Dµ h,νki h Dµ h,ν h)[rh + Ph V h+1](s)

i=1 αi t(Dµ h,νki h Dµ h,ν h)[rh + Ph V h+1](s),

where (i) follows from the above UCB. If the opponent is weak at some step h [H] such that for all episodes k [K],

(Dµ h,νk h Dµ h,ν h)[rh + Ph V h+1](s) C,

then PK k=1(V k h (s) V h (s)) CK. This indicates that the gap between the sum of the UCBs and that of the minimax values can be linear in K. As proved below, we actually show that PK k=1(V k 1 V µk,νk 1 )(sk h) is sublinear in K, which is much stronger than that merely the regret is sublinear if the opponent is weak. In words, V-OL performs much better than claimed in Theorem 2 against a weak opponent.

Regret bounds. Note that the above proof of the UCB holds for any G > 0. We now illustrate what problem appears if G = 1 and where the constraint G 1 comes from. Let denote up to multiplicative constants. Deﬁne δk h := (V k h V µk,νk

h )(sk h). Then by the UCB, Regret(K) PK k=1 δk 1. It then sufﬁces to bound PK k=1 δk 1.

By the decomposition of V k h , the standard concentration inequality and our choice of βt, we have (with some lowerorder terms hidden)

t Dµk h,νk h[rh + Ph V µk,νk

h+1 ](sk h)

i=1 αi t Dµki,νki[rh + Ph V ki h+1](sk h).

To treat the last term, we need the regrouping technique (see, e.g., (Jin et al., 2018)): for any quantity f k indexed by k [K],

i=1 αi tf ki XK

t=nk h αnk h t

(1 + 1 GH ) XK

Taking Dµki,νki[rh + Ph V ki h+1](sk h) as f i yields (with some lower-order terms hidden)

k=1 δk h XK

k=1 (1 + 1 GH )δk h+1

G (not arising in the proof of V-SP) results from

1 GH Dµk h,νk h[rh + Ph V µk,νk

h+1 ](sk h) 1 GH H = 1

Since PK k=1 δk H+1 = 0, a recursion over h [H] for PK k=1 δk h yields

k=1 δk 1 (1 + 1 GH )H K X

To bound the coefﬁcient (1 + 1 GH )H e, we need G 1. By standard pigeonhole arguments,

k=1 1 t = X

n=1 1 n S log K Sι.

Hence, we obtain

k=1 δk 1 GH3Sι2 +

GH5SAKι + G 1KH.

If we take G = 1 as in V-SP, the regret is linear in K and therefore useless. To address this problem, we introduced the tunable parameter G 1 that balances the

K and K terms in the above bound to yield a sublinear regret.

7. Conclusion and Future Work

In this paper, we study online learning in unknown Markov games using V-OL, which is based on the V-SP algorithm of Bai et al. (2020). V-OL achieves O(K 2/3) regret after K episodes. Furthermore, the regret bound is independent of the size of opponents action space. It is still unclear whether one can achieve a sharper regret bound, which is

Online Learning in Unknown Markov Games

a question worthy of future study. We brieﬂy comment on two other future directions.

Toward O(K 1/2) regret in MDPs. A key reason why we need to learn more aggressively in online learning is that a symmetric structure (like that in the proof of V-SP) is absent. However, it exists if the opponent plays a ﬁxed policy, in which case the Markov game becomes an MDP. To see why, we can imagine the opponent is also executing V-OL, which makes no difference since B = 1. However, even in that case, a gap remains: we can only upper and lower bound V h but not V µk,νk

h . Figuring out how to ﬁll this gap will make V-OL become the ﬁrst policy-based algorithm without an estimation of Q-value functions that achieves O(K 1/2) regret for tabular RL.

Strong regret for MDPs with adversarial rewards. Another special case is MDPs with adversarial rewards, where the transitions are ﬁxed across episodes. In this case, achieving sublinear regret using strong regret (1) is possible (Jin et al., 2019). A question is then: does V-OL (or its variants) achieve sublinear regret using the strong regret? Given the many technical differences between MDPs with adversarial rewards and online Markov games, it is desirable to resolve these problems in a uniﬁed manner. In addition, the form of the model-free update in V-OL should be of independent interest for MDPs with adversarial rewards.

Acknowledgement

YT, TY, SS acknowledge partial support from the NSF BIGDATA grant (number 1741341). We thank Yu Bai, Kefan Dong and Chi Jin for useful discussions.

Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. Gambling in a rigged casino: The adversarial multi-armed bandit problem. In Proceedings of IEEE 36th Annual Foundations of Computer Science, pp. 322 331. IEEE, 1995.

Azar, M. G., Osband, I., and Munos, R. Minimax regret bounds for reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning Volume 70, pp. 263 272. JMLR. org, 2017.

Bai, Y. and Jin, C. Provable self-play algorithms for competitive reinforcement learning. ar Xiv preprint ar Xiv:2002.04017, 2020.

Bai, Y., Jin, C., and Yu, T. Near-optimal reinforcement learning with self-play. ar Xiv preprint ar Xiv:2006.12007, 2020.

Brafman, R. I. and Tennenholtz, M. R-max-a general polynomial time algorithm for near-optimal reinforcement

learning. Journal of Machine Learning Research, 3(Oct): 213 231, 2002.

Brown, N. and Sandholm, T. Superhuman AI for heads-up no-limit poker: Libratus beats top professionals. Science, 359(6374):418 424, 2018.

Brown, N. and Sandholm, T. Superhuman AI for multiplayer poker. Science, 365(6456):885 890, 2019.

Cesa-Bianchi, N. and Lugosi, G. Prediction, learning, and games. Cambridge university press, 2006.

Cheung, W. C., Simchi-Levi, D., and Zhu, R. Non-stationary reinforcement learning: The blessing of (more) optimism. Available at SSRN 3397818, 2019.

Du, S., Krishnamurthy, A., Jiang, N., Agarwal, A., Dudik, M., and Langford, J. Provably efﬁcient RL with Rich Observations via Latent State Decoding. In International Conference on Machine Learning, pp. 1665 1674, 2019.

Hansen, T. D., Miltersen, P. B., and Zwick, U. Strategy iteration is strongly polynomial for 2-player turn-based stochastic games with a constant discount factor. Journal of the ACM (JACM), 60(1):1 16, 2013.

Hu, J. and Wellman, M. P. Nash Q-learning for general-sum stochastic games. Journal of machine learning research, 4(Nov):1039 1069, 2003.

Jaksch, T., Ortner, R., and Auer, P. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(Apr):1563 1600, 2010.

Jia, Z., Yang, L. F., and Wang, M. Feature-based Qlearning for two-player stochastic games. ar Xiv preprint ar Xiv:1906.00423, 2019.

Jin, C., Allen-Zhu, Z., Bubeck, S., and Jordan, M. I. Is Q-learning provably efﬁcient? In Advances in Neural Information Processing Systems, pp. 4863 4873, 2018.

Jin, C., Jin, T., Luo, H., Sra, S., and Yu, T. Learning adversarial mdps with bandit feedback and unknown transition. ar Xiv, pp. ar Xiv 1912, 2019.

Lattimore, T. and Szepesv ari, C. Bandit algorithms. Cambridge University Press, 2020.

Littman, M. L. Markov games as a framework for multiagent reinforcement learning. In Machine learning proceedings 1994, pp. 157 163. Elsevier, 1994.

Littman, M. L. Friend-or-foe Q-learning in general-sum games. In ICML, volume 1, pp. 322 328, 2001.

Liu, Q., Yu, T., Bai, Y., and Jin, C. A sharp analysis of model-based reinforcement learning with self-play. ar Xiv preprint ar Xiv:2010.01604, 2020.

Online Learning in Unknown Markov Games

Puterman, M. L. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.

Rosenberg, A. and Mansour, Y. Online convex optimization in adversarial Markov decision processes. ar Xiv preprint ar Xiv:1905.07773, 2019.

Shalev-Shwartz, S., Shammah, S., and Shashua, A. Safe, multi-agent, reinforcement learning for autonomous driving. ar Xiv preprint ar Xiv:1610.03295, 2016.

Shapley, L. S. Stochastic games. Proceedings of the national academy of sciences, 39(10):1095 1100, 1953.

Sidford, A., Wang, M., Yang, L., and Ye, Y. Solving discounted stochastic two-player games with near-optimal time and sample complexity. In International Conference on Artiﬁcial Intelligence and Statistics, pp. 2992 3002, 2020.

Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al. Mastering the game of Go with deep neural networks and tree search. nature, 529(7587):484 489, 2016.

Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., et al. Mastering the game of Go without human knowledge. nature, 550(7676):354 359, 2017.

Vinyals, O., Babuschkin, I., Czarnecki, W. M., Mathieu, M., Dudzik, A., Chung, J., Choi, D. H., Powell, R., Ewalds, T., Georgiev, P., et al. Grandmaster level in Star Craft II using multi-agent reinforcement learning. Nature, 575 (7782):350 354, 2019.

Wei, C.-Y., Hong, Y.-T., and Lu, C.-J. Online reinforcement learning in stochastic games. In Advances in Neural Information Processing Systems, pp. 4987 4997, 2017.

Xie, Q., Chen, Y., Wang, Z., and Yang, Z. Learning Zero Sum Simultaneous-Move Markov Games Using Function Approximation and Correlated Equilibrium. ar Xiv preprint ar Xiv:2002.07066, 2020.

Yadkori, Y. A., Bartlett, P. L., Kanade, V., Seldin, Y., and Szepesv ari, C. Online learning in Markov decision processes with adversarially chosen transition probability distributions. In Advances in neural information processing systems, pp. 2508 2516, 2013.

Zhang, K., Kakade, S. M., Bas ar, T., and Yang, L. F. Model-based multi-agent rl in zero-sum markov games with near-optimal sample complexity. ar Xiv preprint ar Xiv:2007.07461, 2020a.

Zhang, Z., Zhou, Y., and Ji, X. Almost optimal model-free reinforcement learning via reference-advantage decomposition. ar Xiv preprint ar Xiv:2004.10019, 2020b.

Zimin, A. and Neu, G. Online learning in episodic Markovian decision processes by relative entropy policy search. In Advances in neural information processing systems, pp. 1583 1591, 2013.