# nearoptimal_reinforcement_learning_with_selfplay__d2698a37.pdf

Near-Optimal Reinforcement Learning with Self-Play

Yu Bai Salesforce Research yu.bai@salesforce.com

Chi Jin Princeton University chij@princeton.edu

Tiancheng Yu MIT yutc@mit.edu

This paper considers the problem of designing optimal algorithms for reinforcement learning in two-player zero-sum games. We focus on self-play algorithms which learn the optimal policy by playing against itself without any direct supervision. In a tabular episodic Markov game with S states, A max-player actions and B min-player actions, the best existing algorithm for ﬁnding an approximate Nash equilibrium requires O(S2AB) steps of game playing, when only highlighting the dependency on (S, A, B). In contrast, the best existing lower bound scales as Ω(S(A + B)) and has a signiﬁcant gap from the upper bound. This paper closes this gap for the ﬁrst time: we propose an optimistic variant of the Nash Qlearning algorithm with sample complexity O(SAB), and a new Nash V-learning algorithm with sample complexity O(S(A + B)). The latter result matches the information-theoretic lower bound in all problem-dependent parameters except for a polynomial factor of the length of each episode. In addition, we present a computational hardness result for learning the best responses against a ﬁxed opponent in Markov games a learning objective different from ﬁnding the Nash equilibrium.

1 Introduction

A wide range of modern artiﬁcial intelligence challenges can be cast as a multi-agent reinforcement learning (multi-agent RL) problem, in which more than one agent performs sequential decision making in an interactive environment. Multi-agent RL has achieved signiﬁcant recent success on traditionally challenging tasks, for example in the game of GO [30, 31], Poker [6], real-time strategy games [33, 22], decentralized controls or multiagent robotics systems [5], autonomous driving [27], as well as complex social scenarios such as hide-and-seek [3]. In many scenarios, the learning agents even outperform the best human experts .

Despite the great empirical success, a major bottleneck for many existing RL algorithms is that they require a tremendous number of samples. For example, the biggest Alpha Go Zero model is trained on tens of millions of games and took more than a month to train [31]. While requiring such amount of samples may be acceptable in simulatable environments such as GO, it is not so in other sampleexpensive real world settings such as robotics and autonomous driving. It is thus important for us to understand the sample complexity in RL how can we design algorithms that ﬁnd a near optimal policy with a small number of samples, and what is the fundamental limit, i.e. the minimum number of samples required for any algorithm to ﬁnd a good policy.

Theoretical understandings on the sample complexity for multi-agent RL are rather limited, especially when compared with single-agent settings. The standard model for a single-agent setting is an episodic Markov Decision Process (MDP) with S states, and A actions, and H steps per episode. The best known algorithm can ﬁnd an ϵ near-optimal policy in Θ(poly(H)SA/ϵ2) episodes, which matches the lower bound up to a single H factor [1, 8]. In contrast, in multi-agent settings, the optimal sample complexity remains open even in the basic setting of two-player tabular Markov games

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

[28], where the agents are required to ﬁnd the solutions of the games the Nash equilibria. The best known algorithm, VI-ULCB, ﬁnds an ϵ-approximate Nash equilibrium in O(poly(H)S2AB/ϵ2) episodes [2], where B is the number of actions for the other player. The information theoretical lower bound is Ω(poly(H)S(A + B)/ϵ2). Speciﬁcally, the number of episodes required for the algorithm scales quadratically in both S and (A, B), and exhibits a gap from the linear dependency in the lower bound. This motivates the following question:

Can we design algorithms with near-optimal sample complexity for learning Markov games?

In this paper, we present the ﬁrst line of near-optimal algorithms for two-player Markov games that match the aforementioned lower bound up to a poly(H) factor. This closes the open problem for achieving the optimal sample complexity in all (S, A, B) dependency. Our algorithm learns by playing against itself without requiring any direct supervision, and is thus a self-play algorithm.

1.1 Our contributions

We propose an optimistic variant of Nash Q-learning [11], and prove that it achieves sample complexity O(H5SAB/ϵ2) for ﬁnding an ϵ-approximate Nash equilibrium in two-player Markov games (Section 3). Our algorithm builds optimistic upper and lower estimates of Q-values, and computes the Coarse Correlated Equilibrium (CCE) over this pair of Q estimates as its execution policies for both players. We design a new algorithm Nash V-learning for ﬁnding approximate Nash equilibria, and show that it achieves sample complexity O(H6S(A + B)/ϵ2) (Section 4). This improves upon Nash Q-learning in case min {A, B} > H. It is also the ﬁrst result that matches the minimax lower bound up to only a poly(H) factor. This algorithm builds optimistic upper and lower estimates of V -values, and features a novel combination of Follow-the-Regularized-Leader (FTRL) and standard Q-learning algorithm to determine its execution policies. Apart from ﬁnding Nash equilibria, we prove that learning the best responses of ﬁxed opponents in Markov games is as hard as learning parity with noise a notoriously difﬁcult problem that is believed to be computationally hard (Section 5). As a corollary, this hardness result directly implies that achieving sublinear regret against adversarial opponents in Markov games is also computationally hard, a result that ﬁrst appeared in [25]. This in turn rules out the possibility of designing efﬁcient algorithms for ﬁnding Nash equilibria by running no-regret algorithms for each player separately.

In addition to above contributions, this paper also features a novel approach of extracting certiﬁed policies from the estimates produced by reinforcement learning algorithms such as Nash Qlearning and Nash V-learning that are certiﬁed to have similar performance as Nash equilibrium policies, even when facing against their best response (see Section 3 for more details). We believe this technique could be of broader interest to the community.

1.2 Related Work

Markov games Markov games (or stochastic games) are proposed in the early 1950s [28]. They are widely used to model multi-agent RL. Learning the Nash equilibria of Markov games has been studied in classical work [18, 19, 11, 10], where the transition matrix and reward are assumed to be known, or in the asymptotic setting where the number of data goes to inﬁnity. These results do not directly apply to the non-asymptotic setting where the transition and reward are unknown and only a limited amount of data are available for estimating them.

A recent line of work tackles self-play algorithms for Markov games in the non-asymptotic setting with strong reachability assumptions. Speciﬁcally, Wei et al. [35] assumes no matter what strategy one agent sticks to, the other agent can always reach all states by playing a certain policy, and Jia et al. [13], Sidford et al. [29] assume access to simulators (or generative models) that enable the agent to directly sample transition and reward information for any state-action pair. These settings ensure that all states can be reached directly, so no sophisticated exploration is not required.

Very recently, [2, 36] study learning Markov games without these reachability assumptions, where exploration becomes essential. However, both results suffer from highly suboptimal sample complexity. We compare them with our results in Table 1. The results of [36] also applies to the linear

Table 1: Sample complexity (the required number of episodes) for algorithms to ﬁnd ϵ-approximate Nash equlibrium policies in zero-sum Markov games.

Algorithm Sample Complexity Runtime

VI-ULCB [2] O(H4S2AB/ϵ2) PPAD-complete

VI-explore [2] O(H5S2AB/ϵ2)

Polynomial OMVI-SM [36] O(H4S3A3B3/ϵ2)

Optimistic Nash Q-learning O(H5SAB/ϵ2)

Optimistic Nash V-learning O(H6S(A + B)/ϵ2)

Lower Bound [14, 2] Ω(H3S(A + B)/ϵ2) -

function approximation setting. We remark that the R-max algorithm [4] does provide provable guarantees for learning Markov game, even in the setting of playing against the adversarial opponent, but using a deﬁnition of regret that is weaker than the standard regret. Their result does not imply any sample complexity result for ﬁnding Nash equilibrium policies.

Adversarial MDP Another line of related work focuses on provably efﬁcient algorithms for adversarial MDPs. Most work in this line considers the setting with adversarial rewards [38, 26, 15], because adversarial MDP with changing dynamics is computationally hard even under fullinformation feedback [37]. These results do not direcly imply provable self-play algorithms in our setting, because the opponent in Markov games can affect both the reward and the transition.

Single-agent RL There is a rich literature on reinforcement learning in MDPs [see e.g. 12, 24, 1, 7, 32, 14]. MDP is a special case of Markov games, where only a single agent interacts with a stochastic environment. For the tabular episodic setting with nonstationary dynamics and no simulators, the best sample complexity achieved by existing model-based and model-free algorithms are O(H3SA/ϵ2) [1] and O(H4SA/ϵ2) [14], respectively, where S is the number of states, A is the number of actions, H is the length of each episode. Both of them (nearly) match the lower bound Ω(H3SA/ϵ2) [12, 23, 14].

2 Preliminaries

We consider zero-sum Markov Games (MG) [28, 18], which are also known as stochastic games in the literature. Zero-sum Markov games are generalization of standard Markov Decision Processes (MDP) into the two-player setting, in which the max-player seeks to maximize the total return and the min-player seeks to minimize the total return.

Formally, we denote a tabular episodic Markov game as MG(H, S, A, B, P, r), where H is the number of steps in each episode, S is the set of states with |S| S, (A, B) are the sets of actions of the max-player and the min-player respectively, P = {Ph}h [H] is a collection of transition matrices, so that Ph( |s, a, b) gives the distribution over states if action pair (a, b) is taken for state s at step h, and r = {rh}h [H] is a collection of reward functions, and rh : S A B [0, 1] is the deterministic reward function at step h. 1

In each episode of this MG, we start with a ﬁxed initial state s1. Then, at each step h [H], both players observe state sh S, and the max-player picks action ah A while the min-player picks action bh B simultaneously. Both players observe the actions of the opponents, receive reward rh(sh, ah, bh), and then the environment transitions to the next state sh+1 Ph( |sh, ah, bh). The episode ends when s H+1 is reached.

1We assume the rewards in [0, 1] for normalization. Our results directly generalize to randomized reward functions, since learning the transition is more difﬁcult than learning the reward.

Markov policy, value function A Markov policy µ of the max-player is a collection of H functions {µh : S A}h [H], which maps from a state to a distribution of actions. Here A is the probability simplex over action set A. Similarly, a policy ν of the min-player is a collection of H functions {νh : S B}h [H]. We use the notation µh(a|s) and νh(b|s) to present the probability of taking action a or b for state s at step h under Markov policy µ or ν respectively.

We use V µ,ν h : S R to denote the value function at step h under policy µ and ν, so that V µ,ν h (s) gives the expected cumulative rewards received under policy µ and ν, starting from s at step h:

V µ,ν h (s) := Eµ,ν h PH h =h rh (sh , ah , bh ) sh = s i . (1)

We also deﬁne Qµ,ν h : S A B R to denote Q-value function at step h so that Qµ,ν h (s, a, b) gives the cumulative rewards received under policy µ and ν, starting from (s, a, b) at step h:

Qµ,ν h (s, a, b) := Eµ,ν h PH h =h rh (sh , ah , bh ) sh = s, ah = a, bh = b i . (2)

For simplicity, we use notation of operator Ph so that [Ph V ](s, a, b) := Es Ph( |s,a,b)V (s ) for any value function V . We also use notation [DπQ](s) := E(a,b) π( , |s)Q(s, a, b) for any action-value function Q. By deﬁnition of value functions, we have the Bellman equation

Qµ,ν h (s, a, b) = (rh + Ph V µ,ν h+1)(s, a, b), V µ,ν h (s) = (Dµh νh Qµ,ν h )(s)

for all (s, a, b, h) S A B [H]. We deﬁne V µ,ν H+1(s) = 0 for all s SH+1.

Best response and Nash equilibrium For any Markov policy of the max-player µ, there exists a best response of the min-player, which is a Markov policy ν (µ) satisfying V µ,ν (µ) h (s) = infν V µ,ν h (s) for any (s, h) S [H]. Here the inﬁmum is taken over all possible policies which

are not necessarily Markovian (we will deﬁne later in this section). We deﬁne V µ, h := V µ,ν (µ) h . By symmetry, we can also deﬁne µ (ν) and V ,ν h . It is further known (cf. [9]) that there exist Markov policies µ , ν that are optimal against the best responses of the opponents, in the sense that

V µ , h (s) = supµ V µ, h (s), V ,ν

h (s) = infν V ,ν h (s), for all (s, h).

We call these optimal strategies (µ , ν ) the Nash equilibrium of the Markov game, which satisﬁes the following minimax equation: 2

supµ infν V µ,ν h (s) = V µ ,ν

h (s) = infν supµ V µ,ν h (s).

Intuitively, a Nash equilibrium gives a solution in which no player has anything to gain by changing only her own policy. We further abbreviate the values of Nash equilibrium V µ ,ν

h and Qµ ,ν

h as V h and Q h. We refer readers to Appendix A for Bellman optimality equations for values of best responses or Nash equilibria.

General (non-Markovian) policy In certain situations, it is beneﬁcial to consider general, historydependent policies that are not necessarily Markovian. A (general) policy µ of the max-player is a set of H maps µ := µh : R (S A B R)h 1 S A

h [H], from a random number z R and a history of length h say (s1, a1, b1, r1, , sh), to a distribution over actions in A. By symmetry, we can also deﬁne the (general) policy ν of the min-player, by replacing the action set A in the deﬁnition by set B. The random number z is sampled from some underlying distribution D, but may be shared among all steps h [H].

For a pair of general policy (µ, ν), we can still use the same deﬁnitions (1) to deﬁne their value V µ,ν 1 (s1) at step 1. We can also deﬁne the best response ν (µ) of a general policy µ as the mini-

mizing policy so that V µ, 1 (s1) V µ,ν (µ) 1 (s1) = infν V µ,ν h (s1) at step 1. We remark that the best response of a general policy is not necessarily Markovian.

2The minimax theorem here is different from the one for matrix games, i.e. maxφ minψ φ Aψ = minψ maxφ φ Aψ for any matrix A, since here V µ,ν h (s) is in general not bilinear in µ, ν.

Algorithm 1 Optimistic Nash Q-learning

1: Initialize: for any (s, a, b, h), Qh(s, a, b) H, Qh(s, a, b) 0, Nh(s, a, b) 0, πh(a, b|s) 1/(AB). 2: for episode k = 1, . . . , K do 3: receive s1. 4: for step h = 1, . . . , H do 5: take action (ah, bh) πh( , |sh). 6: observe reward rh(sh, ah, bh) and next state sh+1. 7: t = Nh(sh, ah, bh) Nh(sh, ah, bh) + 1. 8: Qh(sh, ah, bh) (1 αt)Qh(sh, ah, bh) + αt(rh(sh, ah, bh) + V h+1(sh+1) + βt) 9: Qh(sh, ah, bh) (1 αt)Qh(sh, ah, bh) + αt(rh(sh, ah, bh) + V h+1(sh+1) βt)

10: πh( , |sh) CCE(Qh(sh, , ), Qh(sh, , ))

11: V h(sh) (Dπh Qh)(sh); V h(sh) (Dπh Qh)(sh).

Learning Objective There are two possible learning objectives in the setting of Markov games. The ﬁrst one is to ﬁnd the best response for a ﬁxed opponent. Without loss of generality, we consider the case where the learning agent is the max-player, and the min-player is the opponent. Deﬁnition 1 (ϵ-approximate best response). For an opponent with an ﬁxed unknown general policy ν, a general policy ˆµ is the ϵ-approximate best response if V ,ν 1 (s1) V ˆµ,ν 1 (s1) ϵ.

The second goal is to ﬁnd a Nash equilibrium of the Markov games. We measure the suboptimality of any pair of general policies (ˆµ, ˆν) using the gap between their performance and the performance of the optimal strategy (i.e. Nash equilibrium) when playing against the best responses respectively:

V ,ˆν 1 (s1) V ˆµ, 1 (s1) = h V ,ˆν 1 (s1) V 1 (s1) i + h V 1 (s1) V ˆµ, 1 (s1) i

Deﬁnition 2 (ϵ-approximate Nash equilibrium). A pair of general policies (ˆµ, ˆν) is an ϵapproximate Nash equilibrium, if V ,ˆν 1 (s1) V ˆµ, 1 (s1) ϵ.

Loosely speaking, Nash equilibria can be viewed as the best responses to the best responses . In most applications, they are the ultimate solutions to the games. In Section 3 and 4, we present sharp guarantees for learning an approximate Nash equilibrium with near-optimal sample complexity. However, rather surprisingly, learning a best response in the worst case is more challenging than learning the Nash equilibrium. In Section 5, we present a computational hardness result for learning an approximate best response.

3 Optimistic Nash Q-learning

In this section, we present our ﬁrst algorithm Optimistic Nash Q-learning and its corresponding theoretical guarantees.

Algorithm part I: learning values Our algorithm Optimistic Nash Q-learning (Algorithm 1) is an optimistic variant of Nash Q-learning [11]. For each step in each episode, it (a) takes actions according to the previously computed policy πh, and observes the reward and next state, (b) performs incremental updates on Q-values, and (c) computes new greedy policies and updates V -values. Part (a) is straightforward; we now focus on explaining part (b) and part (c).

In part (b), the incremental updates on Q-values (Line 8, 9) are almost the same as standard Qlearning [34], except here we maintain two separate Q-values Qh and Qh, as upper and lower conﬁdence versions respectively. We add and subtract a bonus term βt in the corresponding updates, which depends on t = Nh(sh, ah, bh) the number of times (sh, ah, bh) has been visited at step h. We pick parameter αt and βt as follows for some large constant c , and log factors ι:

αt = (H + 1)/(H + t), βt = c p

In part (c), our greedy policies are computed using a Coarse Correlated Equilibrium (CCE) subroutine, which is ﬁrst introduced by [36] to solve Markov games using value iteration algorithms. For

Algorithm 2 Certiﬁed Policy ˆµ of Nash Q-learning

1: sample k Uniform([K]). 2: for step h = 1, . . . , H do 3: observe sh, and take action ah µk h( |sh). 4: observe bh, and set t N k h(sh, ah, bh). 5: sample m [t] with P(m = i) = αi t. 6: k km h (sh, ah, bh)

any pair of matrices Q, Q [0, H]A B, CCE(Q, Q) returns a distribution π A B such that

E(a,b) πQ(a, b) max a E(a,b) πQ(a , b) (4)

E(a,b) πQ(a, b) min b E(a,b) πQ(a, b )

It can be shown that a CCE always exists, and it can be computed by linear programming in polynomial time (see Appendix B for more details).

Now we are ready to state an intermediate guarantee for optimistic Nash Q-learning. We assume the algorithm has played the game for K episodes, and we use V k, Qk, N k, πk to denote values, visitation counts, and policies at the beginning of the k-th episode in Algorithm 1.

Lemma 3. For any p (0, 1], choose hyperparameters αt, βt as in (3) for a large absolute constant c and ι = log(SABT/p). Then, with probability at least 1 p, Algorithm 1 has following guarantees

V k h(s) V h (s) V k h(s) for all (s, h, k) S [H] [K].

(1/K) PK k=1(V k 1 V k 1)(s1) O p

Lemma 3 makes two statements. First, it claims that the V k h(s) and V k h(s) computed in Algorithm 1 are indeed upper and lower bounds of the value of the Nash equilibrium. Second, Lemma 3 claims that the averages of the upper bounds and the lower bounds are also very close to the value of Nash equilibrium V 1 (s1), where the gap decrease as 1/

K. This implies that in order to learn the value V 1 (s1) up to ϵ-accuracy, we only need O(H5SABι/ϵ2) episodes.

However, Lemma 3 has a signiﬁcant drawback: it only guarantees the learning of the value of Nash equilibrium. It does not imply that the policies (µk, νk) used in Algorithm 1 are close to the Nash equilibrium, which requires the policies to have a near-optimal performance even against their best responses. This is a major difference between Markov games and standard MDPs, and is the reason why standard techniques from the MDP literature does not apply here. To resolve this problem, we propose a novel way to extract a certiﬁed policy from the optimistic Nash Q-learning algorithm.

Algorithm part II: certiﬁed policies We describe our procedure of executing the certiﬁed policy ˆµ of the max-player is described in Algorithm 2. Above, µk h, νk h denote the marginal distributions of πk h produced in Algorithm 1 over action set A, B respectively. We also introduce the following quantities that directly induced by αt:

α0 t := Qt j=1 (1 αj), αi t := αi Qt j=i+1 (1 αj) (5)

whose properties are listed in the following Lemma 11. Especially, Pt i=1 αi t = 1, so {αi t}t i=1 deﬁnes a distribution over [t]. We use km h (s, a, b) to denote the index of the episode where (s, a, b) is observed in step h for the m-th time. The certiﬁed policy ˆν of the min-player is easily deﬁned by symmetry. We note that ˆµ, ˆν are clearly general policies, but they are no longer Markov policies.

The intuitive reason why such policy ˆµ deﬁned in Algorithm 2 is certiﬁed by Nash Q-learning algorithm, is because the update equation in line 8 of Algorithm 1 and equation (5) gives relation:

Q k h(s, a, b) = α0 t H + Pt i=1 αi t

rh(s, a, b) + V ki h(s,a,b) h+1 (ski h(s,a,b) h+1 ) + βi

Algorithm 3 Optimistic Nash V-learning (the max-player version)

1: Initialize: for any (s, a, b, h), V h(s) H, Lh(s, a) 0, Nh(s) 0, µh(a|s) 1/A. 2: for episode k = 1, . . . , K do 3: receive s1. 4: for step h = 1, . . . , H do 5: take action ah µh( |sh), observe the action bh from opponent. 6: observe reward rh(sh, ah, bh) and next state sh+1. 7: t = Nh(sh) Nh(sh) + 1. 8: V h(sh) min{H, (1 αt)V h(sh) + αt(rh(sh, ah, bh) + V h+1(sh+1) + βt)}. 9: for all a A do 10: ℓh(sh, a) [H rh(sh, ah, bh) V h+1(sh+1)]I{ah = a}/[µh(ah|sh) + ηt]. 11: Lh(sh, a) (1 αt)Lh(sh, a) + αtℓh(sh, a). 12: set µh( |sh) exp[ (ηt/αt)Lh(sh, )].

This certiﬁes the good performance against the best responses if the max-player plays a mixture of policies {µki h(s,a,b) h+1 }t i=1 at step h + 1 with mixing weights {αi t}t i=1 (see Appendix C.2 for more details). A recursion of this argument leads to the certiﬁed policy ˆµ a nested mixture of policies.

We now present our main result for Nash Q-learning, using the certiﬁed policies (ˆµ, ˆν). Theorem 4 (Sample Complexity of Nash Q-learning). For any p (0, 1], choose hyperparameters αt, βt as in (3) for large absolute constant c and ι = log(SABT/p). Then, with probability at least 1 p, if we run Nash Q-learning (Algorithm 1) for K episodes where

K Ω H5SABι/ϵ2 ,

the certiﬁed policies (ˆµ, ˆν) (Algorithm 2) will be ϵ-approximate Nash, i.e. V ,ˆν 1 (s1) V ˆµ, 1 (s1) ϵ.

Theorem 4 asserts that if we run the optimistic Nash Q-learning algorithm for more than O(H5SABι/ϵ2) episodes, the certiﬁed policies (ˆµ, ˆν) extracted using Algorithm 2 will be ϵapproximate Nash equilibrium (Deﬁnition 2).

We make two remarks. First, the executions of the certiﬁed policies ˆµ, ˆν require the storage of {µk h} and {νk h} for all k, h [H] [K]. This makes the space complexity of our algorithm scales up linearly in the total number of episodes K. Second, Q-learning style algorithms (especially online updates) are crucial in our analysis for achieving sample complexity linear in S. They enjoy the property that every sample is only been used once, on the value function that is independent of this sample. In contrast, value iteration type algorithms do not enjoy such an independence property, which is why the best existing sample complexity scales as S2 [2]. 3

4 Optimistic Nash V-learning

In this section, we present our new algorithm Optimistic Nash V-learning and its corresponding theoretical guarantees. This algorithm improves over Nash Q-learning in sample complexity from O(SAB) to O(S(A + B)), when only highlighting the dependency on S, A, B.

Algorithm description Nash V-learning combines the idea of Follow-The-Regularized-Leader (FTRL) in the bandit literature with the Q-learning algorithm in reinforcement learning. This algorithm does not require extra information exchange between players other than standard game playing, thus can be ran separately by the two players. We describe the max-player version in Algorithm 3. See Algorithm 7 in Appendix D for the min-player version, where V h, Lh, νh, ηt and βt are deﬁned symmetrically.

For each step in each episode, the algorithm (a) ﬁrst takes action according to µh, observes the action of the opponent, the reward, and the next state, (b) performs an incremental update on V , and (c)

3Despite [1] provides techniques to improve the sample complexity from S2 to S for value iteration in MDP, the same techniques can not be applied to Markov games due to the unique challenge that, in Markov games, we aim at ﬁnding policies that are good against their best responses.

Algorithm 4 Certiﬁed Policy ˆµ of Nash V-learning

1: sample k Uniform([K]). 2: for step h = 1, . . . , H do 3: observe sh, and set t N k h(sh). 4: sample m [t] with P(m = i) = αi t. 5: k km h (sh). 6: take action ah µk h( |sh).

updates policy µh. The ﬁrst two parts are very similar to Nash Q-learning. In the third part, the agent ﬁrst computes ℓh(sh, ) as the importance weighted estimator of the current loss. She then computes the weighted cumulative loss Lh(sh, ). Finally, the policy µh is updated using FTRL principle:

µh( |sh) argminµ A ηt Lh(sh, ), µ + αt KL(µ µ0)

Here µ0 is the uniform distribution over all actions A. Solving above minimization problem gives the update equation as in Line 12 in Algorithm 3. In multi-arm bandit, FTRL can defend against adversarial losses, with regret independent of the number of the opponent s actions. This property turns out to be crucial for Nash V-learning to achieve sharper sample complexity than Nash Qlearning (see the analog of Lemma 3 in Lemma 15).

Similar to Nash Q-learning, we also propose a new algorithm (Algorithm 4) to extract a certiﬁed policy from the optimistic Nash V-learning algorithm. The certiﬁed policies are again non-Markovian. We choose all hyperparameters as follows, for some large constant c , and log factors ι.

H + t , ηt =

Bt , βt = c

We now present our main result on the sample complexity of Nash V-learning.

Theorem 5 (Sample Complexity of Nash V-learning). For any p (0, 1], choose hyperparameters as in (6) for large absolute constant c and ι = log(SABT/p). Then, with probability at least 1 p, if we run Nash V-learning (Algorithm 3 and 7) for K episodes with

K Ω H6S(A + B)ι/ϵ2 ,

its induced policies (ˆµ, ˆν) (Algorithm 4) will be ϵ-approximate Nash, i.e. V ,ˆν 1 (s1) V ˆµ, 1 (s1) ϵ.

Theorem 4 claims that if we run the optimistic Nash V-learning for more than O(H6S(A+B)ι/ϵ2) episodes, the certiﬁed policies (ˆµ, ˆν) extracted from Algorithm 4 will be ϵ-approximate Nash (Deﬁnition 2). Nash V-learning is the ﬁrst algorithm of which the sample complexity matches the information theoretical lower bound Ω(H3S(A + B)/ϵ2) up to poly(H) factors and logarithmic terms.

5 Hardness for Learning the Best Response

In this section, we present a computational hardness result for computing the best response against an opponent with a ﬁxed unknown policy. We further show that this implies the computational hardness result for achieving sublinear regret in Markov games when playing against adversarial opponents, which rules out a popular approach to design algorithms for ﬁnding Nash equilibria.

We ﬁrst remark that if the opponent is restricted to only play Markov policies, then learning the best response is as easy as learning a optimal policy in the standard single-agent Markov decision process, where efﬁcient algorithms are known to exist. Nevertheless, when the opponent can as well play any policy which may be non-Markovian, we show that ﬁnding the best response against those policies is computationally challenging.

We say an algorithm is a polynomial time algorithm for learning the best response if for any policy of the opponent ν, and for any ϵ > 0, the algorithm ﬁnds the ϵ-approximate best response of policy ν (Deﬁnition 1) with probability at least 1/2, in time polynomial in S, H, A, B, ϵ 1.

We can show the following hardness result for ﬁnding the best response in polynomial time.

Theorem 6 (Hardness for learning the best response). There exists a Markov game with deterministic transitions and rewards deﬁned for any horizon H 1 with S = 2, A = 2, and B = 2, such that if there exists a polynomial time algorithm for learning the best response for this Markov game, then there exists a polynomial time algorithm for learning parity with noise (see problem description in Appendix E).

We remark that learning parity with noise is a notoriously difﬁcult problem that has been used to design efﬁcient cryptographic schemes. It is conjectured by the community to be hard.

Conjecture 7 ([16]). There is no polynomial time algorithm for learning party with noise.

Theorem 6 with Conjecture 7 demonstrates the fundamental difﬁculty if not strict impossibility of designing a polynomial time for learning the best responses in Markov games. The intuitive reason for such computational hardness is that, while the underlying system has Markov transitions, the opponent can play policies that encode long-term correlations with non-Markovian nature, such as parity with noise, which makes it very challenging to ﬁnd the best response. It is known that learning many other sequential models with long-term correlations (such as hidden Markov models or partially observable MDPs) is as hard as learning parity with noise [20].

5.1 Hardness for Playing Against Adversarial Opponent

Theorem 6 directly implies the difﬁculty for achieving sublinear regret in Markov games when playing against adversarial opponents in Markov games. Our construction of hard instances in the proof of Theorem 6 further allows the adversarial opponent to only play Markov policies in each episode. Since playing against adversarial opponent is a different problem with independent interest, we present the full result here.

Without loss of generality, we still consider the setting where the algorithm can only control the max-player, while the min-player is an adversarial opponent. In the beginning of every episode k, both players pick their own policies µk and νk, and execute them throughout the episode. The adversarial opponent can possibly pick her policy νk adaptive to all the observations in the earlier episodes.

We say an algorithm for the learner is a polynomial time no-regret algorithm if there exists a δ > 0 such that for any adversarial opponent, and any ﬁxed K > 0, the algorithm outputs policies {µk}K k=1 which satisﬁes the following, with probability at least 1/2, in time polynomial in S, H, A, B, K.

Regret(K) = sup µ

k=1 V µ,νk 1 (s1)

k=1 V µk,νk 1 (s1) poly(S, H, A, B)K1 δ (7)

Theorem 6 directly implies the following hardness result for achieving no-regret against adversarial opponents, a result that ﬁrst appeared in [25].

Corollary 8 (Hardness for playing against adversarial opponent). There exists a Markov game with deterministic transitions and rewards deﬁned for any horizon H 1 with S = 2, A = 2, and B = 2, such that if there exists a polynomial time no-regret algorithm for this Markov game, then there exists a polynomial time algorithm for learning parity with noise (see problem description in Appendix E). The claim remains to hold even if we restrict the adversarial opponents in the Markov game to be non-adaptive, and to only play Markov policies in each episode.

Similar to Theorem 6, Corollary 8 combined with Conjecture 7 demonstrates the fundamental difﬁculty of designing a polynomial time no-regret algorithm against adversarial opponents for Markov games.

Implications on algorithm design for ﬁnding Nash Equilibria Corollary 8 also rules out a natural approach for designing efﬁcient algorithms for ﬁnding approximate Nash equilibrium through combining two no-regret algorithms. In fact, it is not hard to see that if the min-player also runs a non-regret algorithm, and obtain a regret bound symmetric to (7), then summing the two regret bounds shows the mixture policies (ˆµ, ˆν) which assigns uniform mixing weights to policies {µk}K k=1 and {νk}K k=1 respectively is an approximate Nash equilibrium. Corollary 8 with Conjecture 7 claims that any algorithm designed using this approach is not a polynomial time algorithm.

Broader Impact

As this is a theoretical contribution, we do not envision that our direct results will have a tangible societal impact. Our broader line of inquiry could impact a line of thinking about how to design more sample-efﬁcient algorithms for multi-agent reinforcement learning, which could be useful towards making artiﬁcial intelligence more resource and energy efﬁcient.

Acknowledgments

TY is partially supported by NSF BIGDATA grant IIS1741341.

[1] M. G. Azar, I. Osband, and R. Munos. Minimax regret bounds for reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 263 272. JMLR. org, 2017.

[2] Y. Bai and C. Jin. Provable self-play algorithms for competitive reinforcement learning. ar Xiv preprint ar Xiv:2002.04017, 2020.

[3] B. Baker, I. Kanitscheider, T. Markov, Y. Wu, G. Powell, B. Mc Grew, and I. Mordatch. Emergent tool use from multi-agent autocurricula. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=Skxpx JBKw S.

[4] R. I. Brafman and M. Tennenholtz. R-max-a general polynomial time algorithm for nearoptimal reinforcement learning. Journal of Machine Learning Research, 3(Oct):213 231, 2002.

[5] M. Brambilla, E. Ferrante, M. Birattari, and M. Dorigo. Swarm robotics: a review from the swarm engineering perspective. Swarm Intelligence, 7(1):1 41, 2013.

[6] N. Brown and T. Sandholm. Superhuman ai for multiplayer poker. Science, 365(6456):885 890, 2019.

[7] C. Dann, T. Lattimore, and E. Brunskill. Unifying pac and regret: Uniform pac bounds for episodic reinforcement learning. In Advances in Neural Information Processing Systems, pages 5713 5723, 2017.

[8] C. Dann, L. Li, W. Wei, and E. Brunskill. Policy certiﬁcates: Towards accountable reinforcement learning. In International Conference on Machine Learning, pages 1507 1516, 2019.

[9] J. Filar and K. Vrieze. Competitive Markov decision processes. Springer Science & Business Media, 2012.

[10] T. D. Hansen, P. B. Miltersen, and U. Zwick. Strategy iteration is strongly polynomial for 2-player turn-based stochastic games with a constant discount factor. Journal of the ACM (JACM), 60(1):1 16, 2013.

[11] J. Hu and M. P. Wellman. Nash q-learning for general-sum stochastic games. Journal of machine learning research, 4(Nov):1039 1069, 2003.

[12] T. Jaksch, R. Ortner, and P. Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(Apr):1563 1600, 2010.

[13] Z. Jia, L. F. Yang, and M. Wang. Feature-based q-learning for two-player stochastic games. ar Xiv preprint ar Xiv:1906.00423, 2019.

[14] C. Jin, Z. Allen-Zhu, S. Bubeck, and M. I. Jordan. Is Q-learning provably efﬁcient? In Advances in Neural Information Processing Systems, pages 4868 4878, 2018.

[15] C. Jin, T. Jin, H. Luo, S. Sra, and T. Yu. Learning adversarial markov decision processes with bandit feedback and unknown transition. ar Xiv preprint ar Xiv:1912.01192, 2019.

[16] M. Kearns. Efﬁcient noise-tolerant learning from statistical queries. Journal of the ACM (JACM), 45(6):983 1006, 1998.

[17] T. Lattimore and C. Szepesv ari. Bandit algorithms. 2018.

[18] M. L. Littman. Markov games as a framework for multi-agent reinforcement learning. In Machine learning proceedings 1994, pages 157 163. Elsevier, 1994.

[19] M. L. Littman. Friend-or-foe q-learning in general-sum games. In ICML, volume 1, pages 322 328, 2001.

[20] E. Mossel and S. Roch. Learning nonsingular phylogenies and hidden markov models. In Proceedings of the thirty-seventh annual ACM symposium on Theory of computing, pages 366 375, 2005.

[21] G. Neu. Explore no more: Improved high-probability regret bounds for non-stochastic bandits. In Advances in Neural Information Processing Systems, pages 3168 3176, 2015.

[22] Open AI. Openai ﬁve. https://blog.openai.com/openai-five/, 2018.

[23] I. Osband and B. Van Roy. On lower bounds for regret in reinforcement learning. ar Xiv preprint ar Xiv:1608.02732, 2016.

[24] I. Osband, B. Van Roy, and Z. Wen. Generalization and exploration via randomized value functions. ar Xiv preprint ar Xiv:1402.0635, 2014.

[25] G. Radanovic, R. Devidze, D. Parkes, and A. Singla. Learning to collaborate in markov decision processes. In International Conference on Machine Learning, pages 5261 5270, 2019.

[26] A. Rosenberg and Y. Mansour. Online convex optimization in adversarial markov decision processes. ar Xiv preprint ar Xiv:1905.07773, 2019.

[27] S. Shalev-Shwartz, S. Shammah, and A. Shashua. Safe, multi-agent, reinforcement learning for autonomous driving. ar Xiv preprint ar Xiv:1610.03295, 2016.

[28] L. S. Shapley. Stochastic games. Proceedings of the national academy of sciences, 39(10): 1095 1100, 1953.

[29] A. Sidford, M. Wang, L. F. Yang, and Y. Ye. Solving discounted stochastic two-player games with near-optimal time and sample complexity. ar Xiv preprint ar Xiv:1908.11071, 2019.

[30] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484, 2016.

[31] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, et al. Mastering the game of go without human knowledge. nature, 550(7676):354 359, 2017.

[32] A. L. Strehl, L. Li, E. Wiewiora, J. Langford, and M. L. Littman. PAC model-free reinforcement learning. In International Conference on Machine Learning, pages 881 888, 2006.

[33] O. Vinyals, I. Babuschkin, W. M. Czarnecki, M. Mathieu, A. Dudzik, J. Chung, D. H. Choi, R. Powell, T. Ewalds, P. Georgiev, et al. Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 575(7782):350 354, 2019.

[34] C. J. C. H. Watkins. Learning from delayed rewards. 1989.

[35] C.-Y. Wei, Y.-T. Hong, and C.-J. Lu. Online reinforcement learning in stochastic games. In Advances in Neural Information Processing Systems, pages 4987 4997, 2017.

[36] Q. Xie, Y. Chen, Z. Wang, and Z. Yang. Learning zero-sum simultaneous-move markov games using function approximation and correlated equilibrium. ar Xiv preprint ar Xiv:2002.07066, 2020.

[37] Y. A. Yadkori, P. L. Bartlett, V. Kanade, Y. Seldin, and C. Szepesv ari. Online learning in markov decision processes with adversarially chosen transition probability distributions. In Advances in neural information processing systems, pages 2508 2516, 2013.

[38] A. Zimin and G. Neu. Online learning in episodic markovian decision processes by relative entropy policy search. In Advances in neural information processing systems, pages 1583 1591, 2013.