# online_reinforcement_learning_in_stochastic_games__fd6647fc.pdf

Online Reinforcement Learning in Stochastic Games

Chen-Yu Wei Institute of Information Science Academia Sinica, Taiwan bahh723@iis.sinica.edu.tw

Yi-Te Hong Institute of Information Science Academia Sinica, Taiwan ted0504@iis.sinica.edu.tw

Chi-Jen Lu Institute of Information Science Academia Sinica, Taiwan cjlu@iis.sinica.edu.tw

We study online reinforcement learning in average-reward stochastic games (SGs). An SG models a two-player zero-sum game in a Markov environment, where state transitions and one-step payoffs are determined simultaneously by a learner and an adversary. We propose the UCSG algorithm that achieves a sublinear regret compared to the game value when competing with an arbitrary opponent. This result improves previous ones under the same setting. The regret bound has a dependency on the diameter, which is an intrinsic value related to the mixing property of SGs. If we let the opponent play an optimistic best response to the learner, UCSG ﬁnds an ε-maximin stationary policy with a sample complexity of O (poly(1/ε)), where ε is the gap to the best policy.

1 Introduction

Many real-world scenarios (e.g., markets, computer networks, board games) can be cast as multi-agent systems. The framework of Multi-Agent Reinforcement Learning (MARL) targets at learning to act in such systems. While in traditional reinforcement learning (RL) problems, Markov decision processes (MDPs) are widely used to model a single agent s interaction with the environment, stochastic games (SGs, [32]), as an extension of MDPs, are able to describe multiple agents simultaneous interaction with the environment. In this view, SGs are most well-suited to model MARL problems [24].

In this paper, two-player zero-sum SGs are considered. These games proceed like MDPs, with the exception that in each state, both players select their own actions simultaneously 1, which jointly determine the transition probabilities and their rewards . The zero-sum property restricts that the two players payoffs sum to zero. Thus, while one player (Player 1) wants to maximize his/her total reward, the other (Player 2) would like to minimize that amount. Similar to the case of MDPs, the reward can be discounted or undiscounted, and the game can be episodic or non-episodic.

In the literature, SGs are typically learned under two different settings, and we will call them online and ofﬂine settings, respectively. In the ofﬂine setting, the learner controls both players in a centralized manner, and the goal is to ﬁnd the equilibrium of the game [33, 21, 30]. This is also known as ﬁnding the worst-case optimality for each player (a.k.a. maximin or minimax policy). In this case, we care about the sample complexity, i.e., how many samples are required to estimate the worst-case optimality such that the error is below some threshold. In the online setting, the learner controls only one of the players, and plays against an arbitrary opponent [24, 4, 5, 8, 31]. In this case, we care

1Turn-based SGs, like Go, are special cases: in each state, one player s action set contains only a null action.

31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.

about the learner s regret, i.e., the difference between some benchmark measure and the learner s total reward earned in the learning process. This benchmark can be deﬁned as the total reward when both players play optimal policies [5], or when Player 1 plays the best stationary response to Player 2 [4]. Some of the above online-setting algorithms can ﬁnd the equilibrium simply through self-playing.

Most previous results on ofﬂine sample complexity consider discounted SGs. Their bounds depend heavily on the chosen discount factor [33, 21, 30, 31]. However, as noted in [5, 19], the discounted setting might not be suitable for SGs that require long-term planning, because only ﬁnite steps are relevant in the reward function it deﬁnes. This paper, to the best of our knowledge, is the ﬁrst to give an ofﬂine sample complexity bound of order O (poly(1/ε)) in the average-reward (undiscounted and non-episodic) setting, where ε is the error parameter. A major difference between our algorithm and previous ones is that the two players play asymmetric roles in our algorithm: by focusing on ﬁnding only one player s worst-case optimal policy at a time, the sampling can be rather efﬁcient. This resembles but strictly extends [13] s methods in ﬁnding the maximin action in a two-stage game.

In the online setting, we are only aware of [5] s R-MAX algorithm that deals with average-reward SGs and provides a regret bound. Considering a similar scenario and adopting the same regret deﬁnition, we signiﬁcantly improve their bounds (see Appendix A for details). Another difference between our algorithm and theirs is that ours is able to output a currently best stationary policy at any stage in the learning process, while theirs only produces a Tε-step ﬁxed-horizon policy for some input parameter Tε. The former could be more natural since the worst-case optimal policy is itself a stationary policy.

The techniques used in this paper are most related to RL for MDPs based on the optimism principle [2, 19, 9] (see Appendix A). The optimism principle built on concentration inequalities automatically strikes a balance between exploitation and exploration, eliminating the need to manually adjust the learning rate or the exploration ratio. However, when importing analysis from MDPs to SGs, we face the challenge caused by the opponent s uncontrollability and non-stationarity. This prevents the learner from freely exploring the state space and makes previous analysis that relies on stationary distribution s perturbation analysis [2] useless. In this paper, we develop a novel way to replace the opponent s non-stationary policy with a stationary one in the analysis (introduced in Section 5.1), which facilitates the use of techniques based on perturbation analysis. We hope that this technique can beneﬁt future analysis concerning non-stationary agents in MARL.

One related topic is the robust MDP problem [29, 17, 23]. It is an MDP where some state-action pairs have adversarial rewards and transitions. It is often assumed in robust MDP that the adversarial choices by the environment are not directly observable by the Player, but in our SG setting, we assume that the actions of Player 2 can be observed. However, there are still difﬁculties in SG that are not addressed by previous works on robust MDP.

Here we compare our work to [23], a recent work on learning robust MDP. In their setting, there are adversarial and stochastic state-action pairs, and their proposed OLRM2 algorithm tries to distinguish them. Under the scenario where the environment is fully adversarial, which is the counterpart to our setting, the worst-case transitions and rewards are all revealed to the learner, and what the learner needs to do is to perform a maximin planning. In our case, however, the worst-case transitions and rewards are still to be learned, and the opponent s arbitrary actions may hinder the learner to learn this information. We would say that the contribution of [23] is orthogonal to ours.

Other lines of research that are related to SGs are on MDPs with adversarially changing reward functions [11, 27, 28, 10] and with adversarially changing transition probabilities [35, 1]. The assumptions in these works have several differences with ours, and therefore their results are not comparable to our results. However, they indeed provide other viewpoints about learning in stochastic games.

2 Preliminaries

Game Models and Policies. A SG is a 4-tuple M = (S, A, r, p). S denotes the state space and A = A1 A2 the players joint action space. We denote S = |S| and A = |A|. The game starts from an initial state s1. Suppose at time t the players are at state st. After the players play the joint actions (a1 t, a2 t), Player 1 receives the reward rt = r(st, a1 t, a2 t) [0, 1] from Player 2, and both players visit state st+1 following the transition probability p( |st, a1 t, a2 t). For simplicity, we consider

deterministic rewards as in [3]. The extension to stochastic case is straightforward. We shorten our notation by a := (a1, a2) or at := (a1 t, a2 t), and use abbreviations such as r(st, at) and p( |st, at).

Without loss of generality, players are assumed to determine their actions based on the history. A policy π at time t maps the history up to time t, Ht = (s1, a1, r1, ..., st) Ht, to a probability distribution over actions. Such policies are called history-dependent policies, whose class is denoted by ΠHR. On the other hand, a stationary policy, whose class is denoted by ΠSR, selects actions as a function of the current state. For either class, joint policies (π1, π2) are often written as π.

Average Return and the Game Value. Let the players play joint policy π. Deﬁne the T-step total reward as RT (M, π, s) := PT t=1 r(st, at), where s1 = s, and the average reward as ρ(M, π, s) := lim T 1

T E [RT (M, π, s)], whenever the limit exists. In fact, the game value exists2 [26]:

ρ (M, s) := sup π1 inf π2 lim T 1 T E RT (M, π1, π2, s) .

If ρ(M, π, s) or ρ (M, s) does not depend on the initial state s, we simply write ρ(M, π) or ρ (M).

The Bias Vector. For a stationary policy π, the bias vector h(M, π, ) is deﬁned, for each coordinate s, as

h(M, π, s) := E

t=1 r(st, at) ρ(M, π, s) s1 = s, at π( |st)

The bias vector satisﬁes the Bellman equation: s S,

ρ(M, π, s) + h(M, π, s) = r(s, π) + X

s p(s |s, π)h(M, π, s ),

where r(s, π) := Ea π( |s)[r(s, a)] and p(s |s, π) :=Ea π( |s)[p(s |s, a)].

The vector h(M, π, ) describes the relative advantage among states under model M and (joint) policy π. The advantage (or disadvantage) of state s compared to state s under policy π is deﬁned as the difference between the accumulated rewards with initial states s and s , which, from (1), converges to the difference h(M, π, s) h(M, π, s ) asymptotically. For the ease of notation, the span of a vector v is deﬁned as sp(v) := maxi vi mini vi. Therefore if a model, together with any policy, induces large sp (h), then this model will be difﬁcult to learn because visiting a bad state costs a lot in the learning process. As shown in [3] for the MDP case, the regret has an inevitable dependency on sp(h(M, π , )), where π is the optimal policy.

On the other hand, sp(h(M, π, )) is closely related to the mean ﬁrst passage time under the Markov chain induced by M and π. Actually we have sp(h(M, π, )) T π(M) := maxs,s T π s s (M), where T π s s (M) denotes the expected time to reach state s starting from s when the model is M and the player(s) follow the (joint) policy π. This fact is intuitive, and the proof can be seen at Remark M.1.

Notations. In order to save space, we often write equations in vector or matrix form. We use vectors inequalities: if u, v Rn, then u v ui vi i = 1, ..., n. For a general matrix game with matrix G of size n m, we denote the value of the game as val G := max p n min q m p Gq =

min q m max p n p Gq, where k is the probability simplex of dimension k. In SGs, given the estimated

value function u(s ) s , we often need to solve the following matrix game equation:

v(s) = max a1 π1( |s) min a2 π2( |s){r(s, a1, a2) + X

s p(s |s, a1, a2)u(s )},

and this is abbreviated with the vector form v = val{r + Pu}. We also use solve1 G and solve2 G to denote the optimal solutions of p and q. In addition, the indicator function is denoted by 1{ } or 1{ }.

2Unlike in one-player MDPs, the sup and inf in the deﬁnition of ρ (M, s) are not necessarily attainable. Moreover, players may not have stationary optimal policies.

3 Problem Settings and Results Overview

We assume that the game proceeds for T steps. In order to have meaningful regret bounds (i.e., sublinear to T), we must make some assumptions to the SG model itself. Our two different assumptions are

Assumption 1. max s,s max π1 ΠSR max π2 ΠSR T π1,π2

Assumption 2. max s,s max π2 ΠSR min π1 ΠSR T π1,π2

Why we make these assumptions is as follows. Consider an SG model where the opponent (Player 2) has some way to lock the learner (Player 1) to some bad state. The best strategy for the learner might be to totally avoid, if possible, entering that state. However, in the early stage of the learning process, the learner won t know this, and he/she will have a certain probability to visit that state and get locked. This will cause linear regret to the learner. Therefore, we assume the following: whatever policy the opponent executes, the learner always has some way to reach any state within some bounded time. This is essentially our Assumption 2.

Assumption 1 is the stronger one that actually implies that under any policies executed by the players (not necessarily stationary, see Remark M.2), every state is visited within an average of D steps. We ﬁnd that under this assumption, the asymptotic regret can be improved. This assumption also has a sense similar to those required for Q-learning-type algorithms convergence: they require that every state be visited inﬁnitely often. See [18] for example.

These assumptions deﬁne some notion of diameters that are speciﬁc to the SG model. It is known that under Assumption 1 or Assumption 2, both players have optimal stationary policies, and the game value is independent of the initial state. Thus we can simply write ρ (M, s) as ρ (M). For a proof of these facts, please refer to Theorem E.1 in the appendix.

3.1 Two Settings and Results Overview

We focus on training Player 1 and discuss two settings. In the online setting, Player 1 competes with an arbitrary Player 2. The regret is deﬁned as

Reg(on) T =

t=1 ρ (M) r(st, at).

In the ofﬂine setting, we control both Player 1 and Player 2 s actions, and ﬁnd Player 1 s maximin policy. The sample complexity is deﬁned as

t=1 1{ρ (M) min π2 ρ(M, π1 t , π2) > ε},

where π1 t is a stationary policy being executed by Player 1 at time t. This deﬁnition is similar to those in [20, 19] for one-player MDPs. By the deﬁnition of Lε, if we have an upper bound for Lε and run the algorithm for T > Lε steps, there is some t such that π1 t is ε-optimal. We will explain how to pick this t in Section 7 and Appendix L.

It turns out that we can use almost the same algorithm to handle these two settings. Since learning in the online setting is more challenging, from now on we will mainly focus on the online setting, and leave the discussion about the ofﬂine setting at the end of the paper. Our results can be summarized by the following two theorems.

Theorem 3.1. Under Assumption 1, UCSG achieves Reg(on) T = O(D3S5A + DS

AT) w.h.p. 3

Theorem 3.2. Under Assumption 2, UCSG achieves Reg(on) T = O( 3

DS2AT 2) w.h.p.

4 Upper Conﬁdence Stochastic Game Algorithm (UCSG)

3We write, with high probability, g = O(f) or w.h.p., g = O(f) to indicate with probability 1 δ, g = f1O(f) + f2 , where f1, f2 are some polynomials of log D, log S, log A, log T, log(1/δ).

Algorithm 1 UCSG

Input: S, A = A1 A2, T. Initialization: t = 1. for phase k = 1, 2, ... do tk = t. 1. Initialize phase k: vk(s, a) = 0, nk(s, a) = max n 1, Ptk 1 τ=1 1(sτ ,aτ )=(s,a) o ,

nk(s, a, s ) = Ptk 1 τ=1 1(sτ ,aτ ,sτ+1)=(s,a,s ), ˆpk(s |s, a) = nk(s,a,s )

nk(s,a) , s, a, s .

2. Update the conﬁdence set: Mk = { M : s, a, p( |s, a) Pk(s, a)}, where Pk(s, a) := CONF1(ˆpk( |s, a), nk(s, a)) CONF2(ˆpk( |s, a), nk(s, a)). 3. Optimistic planning: M 1 k, π1 k = MAXIMIN-EVI (Mk, γk) , where γk := 1/ tk. 4. Execute policies: repeat Draw a1 t π1 k( |st); observe the reward rt and the next state st+1. Set vk(st, at) = vk(st, at) + 1 and t = t + 1. until (s, a) such that vk(s, a) = nk(s, a) end for Deﬁnitions of conﬁdence regions:

CONF1(ˆp, n) := p [0, 1]S : p ˆp 1 q

2S ln(1/δ1)

, δ1 = δ 2S2A log2 T .

CONF2(ˆp, n) := p [0, 1]S : i, p

ˆpi(1 ˆpi) q

| pi ˆpi| min q

2ˆpi(1 ˆpi)

δ1 + 7 3(n 1) ln 6

The Upper Conﬁdence Stochastic Game algorithm (UCSG) (Algorithm 1) extends UCRL2 [19], using the optimism principle to balance exploitation and exploration. It proceeds in phases (indexed by k), and only changes the learner s policy π1 k at the beginning of each phase. The length of each phase is not ﬁxed a priori, but depends on the statistics of past observations.

In the beginning of each phase k, the algorithm estimates the transition probabilities using empirical frequencies ˆpk( |s, a) observed in previous phases (Step 1). With these empirical frequencies, it can then create a conﬁdence region Pk(s, a) for each transition probability. The transition probabilities lying in the conﬁdence regions constitute a set of plausible stochastic game models Mk, where the true model M belongs to with high probability (Step 2). Then, Player 1 optimistically picks one model M 1 k from Mk, and ﬁnds the optimal (stationary) policy π1 k under this model (Step 3). Finally, Player 1 executes the policy π1 k for a while until some (s, a)-pair s number of occurrences is doubled during this phase (Step 4). The count vk(s, a) records the number of steps the (s, a)-pair is observed in phase k; it is reset to zero in the beginning of every phase.

In Step 3, to pick an optimistic model and a policy is to pick M 1 k Mk and π1 k ΠSR such that s,

min π2 ρ(M 1 k, π1 k, π2, s) max M Mk ρ ( M, s) γk. (2)

where γk denotes the error parameter for MAXIMIN-EVI. The LHS of (2) is well-deﬁned because Player 2 has stationary optimal policy under the MDP induced by M 1 k and π1 k. Roughly speaking, (2) says that min π2 ρ(M 1 k, π1 k, π2, s) should approximate max M Mk,π1 min π2 ρ( M, π1, π2, s) by an error

no more than γk. That is, (M 1 k, π1 k) are picked optimistically in Mk ΠSR considering the most adversarial opponent.

4.1 Extended SG and Maximin-EVI

The calculation of M 1 k and π1 k involves the technique of Extended Value Iteration (EVI), which also appears in [19] as a one-player version.

Consider the following SG, named M +. Let the state space S and Player 2 s action space A2 remain the same as in M. Let A1+, p+( | , , ), r+( , , ) be Player 1 s action set, the transition kernel, and

the reward function of M +, such that for any a1 A1 and a2 A2 and an admissible transition probability p( |s, a1, a2) Pk(s, a1, a2), there is an action a1+ A1+ such that p+( |s, a1+, a2) = p( |s, a1, a2) and r+(s, a1+, a2) = r(s, a1, a2). In other words, Player 1 selecting an action in A1+ is equivalent to selecting an action in A1 and simultaneously selecting an admissible transition probability in the conﬁdence region Pk( , ).

Suppose that M Mk, then the extended SG M + satisﬁes Assumption 2 because the true model M is embedded in M +. By Theorem E.1 in Appendix E, it has a constant game value ρ (M +) independent of the initial state, and satisﬁes Bellman equation of the form val{r + Pf} = ρ e + f, for some bounded function f( ), where e stands for the all-one constant vector. With the above conditions, we can use value iteration with Schweitzer transform (a.k.a. aperiodic transform)[34] to solve the optimal policy in the extended EG M +. We call it MAXIMIN-EVI. For the details of MAXIMIN-EVI, please refer to Appendix F. We only summarize the result with the following Lemma.

Lemma 4.1. Suppose the true model M Mk, then the estimated model M 1 k and stationary policy π1 k output by MAXIMIN-EVI in Step 3 satisfy

s, min π2 ρ(M 1 k, π1 k, π2, s) max π1 min π2 ρ(M, π1, π2, s) γk.

Before diving into the analysis under the two assumptions, we ﬁrst establish the following fact.

Lemma 4.2. With high probability, the true model M Mk for all phases k.

It is proved in Appendix D. With Lemma 4.2, we can fairly assume M Mk in most of our analysis.

5 Analysis under Assumption 1

In this section, we import analysis techniques from one-player MDPs [2, 19, 22, 9]. We also develop some techniques that deal with non-stationary opponents.

We model Player 2 s behavior in the most general way, i.e., assuming it using a history-dependent randomized policy. Let Ht = (s1, a1, r1, ..., st 1, at 1, rt 1, st) Ht be the history up to st, then we assume π2 t to be a mapping from Ht to a distribution over A2. We will simply write π2 t ( ) and hide its dependency on Ht inside the subscript t. A similar deﬁnition applies to π1 t ( ). With abuse of notations, we denote by k(t) the phase where step t lies in, and thus our algorithm uses policy π1 t ( ) = π1 k(t)( |st). The notations π1 t and π1 k are used interchangeably. Let Tk := tk+1 tk be the length of phase k. We decompose the regret in phase k in the following way:

Λk := Tkρ (M)

t=tk r(st, at) =

n=1 Λ(n) k , (3)

in which we deﬁne

Λ(1) k = Tk

ρ (M) min π2 ρ(M 1 k, π1 k, π2, stk) ,

Λ(2) k = Tk

min π2 ρ(M 1 k, π1 k, π2, stk) ρ(M 1 k, π1 k, π2 k, stk) ,

Λ(3) k = Tk ρ(M 1 k, π1 k, π2 k, stk) ρ(M, π1 k, π2 k) ,

Λ(4) k = Tkρ(M, π1 k, π2 k)

t=tk r(st, at),

where π2 k is some stationary policy of Player 2 which will be deﬁned later. Since the actions of Player 2 are arbitrary, π2 k is imaginary and only exists in analysis. Note that under Assumption 1, any stationary policy pair over M induces an irreducible Markov chain, so we do not need to specify the initial states for ρ(M, π1 k, π2 k) in (3). Among the four terms, Λ(2) k is clearly non-positive, and Λ(1) k , by optimism, can be bounded using Lemma 4.1. Now remains to bound Λ(3) k and Λ(4) k .

5.1 Bounding P

k Λ(3) k and P

The Introduction of π2 k. Λ(3) k and Λ(4) k involve the artiﬁcial policy π2 k, which is a stationary policy that replaces Player 2 s non-stationary policy in the analysis. This replacement costs some constant regret but facilitates the use of perturbation analysis in regret bounding. The selection of π2 k is based on the principle that the behavior (e.g., total number of visits to some (s, a)) of the Markov chain induced by M, π1 k, π2 k should be close to the empirical statistics. Intuitively, π2 k can be deﬁned as

π2 k(a2|s) :=

Ptk+1 1 t=tk 1st=sπ2 t (a2) Ptk+1 1 t=tk 1st=s . (4)

Note two things, however. First, since we need the actual trajectory in deﬁning this policy, it can only be deﬁned after phase k has ended. Second, π2 k can be undeﬁned because the denominator of (4) can be zero. However, this will not happen in too many steps. Actually, we have

Lemma 5.1. P k Tk1{ π2 k not well-deﬁned} O(DS2A) with high probability.

Before describing how we bound the regret with the help of π2 k and the perturbation analysis, we establish the following lemma: Lemma 5.2. We say the transition probability at time step t is ε-accurate if |p1 k(s |st, πt) p(s |st, πt)| ε s where p1 k denotes the transition kernel of M 1 k. We let Bt(ε) = 1 if the transition probability at time t is ε-accurate; otherwise Bt(ε) = 0. Then for any state s, with high probability, PT t=1 1st=s1Bt(ε)=0 O A/ε2 .

Now we are able to sketch the logic behind our proofs. Let s assume that π2 k models π2 k quite well, i.e., the expected frequency of every state-action pair induced by M, π1 k, π2 k is close to the empirical frequency induced by M, π1 k, π2 k. Then clearly, Λ(4) k is close to zero in expectation. The term Λ(3) k now becomes the difference of average reward between two Markov reward processes with slightly different transition probabilities. This term has a counterpart in [19] as a single-player version. Using similar analysis, we can prove that the dominant term of Λ(3) k is proportional to sp(h(M 1 k, π1 k, π2 k, )). In the single-player case, [19] can directly claim that sp(h(M 1 k, π1 k, )) D (see their Remark 8), but unfortunately, this is not the case in the two-player version. 4

To continue, we resort to the perturbation analysis for the mean ﬁrst passage times (developed in Appendix C). Lemma 5.2 shows that M 1 k will not be far from M for too many steps. Then Theorem C.9 in Appendix C tells that if M 1 k are close enough to M, T π1 k, π2 k(M 1 k) can be bounded by 2T π1 k, π2 k(M). As Remark M.1 implies that sp(h(M 1 k, π1 k, π2 k, )) T π1 k, π2 k(M 1 k) and Assumption 1 guarantees that T π1 k, π2 k(M) D, we have sp(h(M 1 k, π1 k, π2 k, )) T π1 k, π2 k(M 1 k) 2T π1 k, π2 k(M) 2D.

The above approach leads to Lemma 5.3, which is a key in our analysis. We ﬁrst deﬁne some notations. Under Assumption 1, any pair of stationary policies induces an irreducible Markov chain, which has a unique stationary distribution. If the policy pair π = (π1, π2) is executed, we denote its stationary distribution by µ(M, π1, π2, ) = µ(M, π, ). Besides, denote vk(s) := Ptk+1 1 t=tk 1st=s.

We say a phase k is benign if the following hold true: the true model M lies in Mk, π2 k is well-deﬁned, sp(h(M 1 k, π1 k, π2 k, )) 2D, and µ(M, π1 k, π2 k, s) 2vk(s)

Tk s. We can show the following:

Lemma 5.3. P

k Tk1{phase k is not benign} O(D3S5A) with high probability.

Finally, for benign phases, we can have the following two lemmas.

Lemma 5.4. P

k Λ(4) k 1{ π2 k is well-deﬁned } O(D

ST + DSA) with high probability.

4The argument in [19] is simple: suppose that h(M 1 k, π1 k, s) h(M 1 k, π1 k, s ) > D, by the communicating assumption, there is a path from s to s with expected time no more than D. Thus a policy that ﬁrst goes from s

to s within D steps and then executes π1 k will outperform π1 k at s . This leads to a contradiction. In two-player SGs, with a similar argument, we can also show that sp(h(M 1 k, π1 k, π2 k , )) D, where π2 k is the best response to π1 k under M 1 k. However, since Player 2 is uncontrollable, his/her policy π2 k (or π2 k) can be quite different from π2 k , and thus sp(h(M 1 k, π1 k, π2 k, )) D does not necessarily hold true.

Lemma 5.5. P

k Λ(3) k 1{phase k is benign} O(DS

AT + DS2A) with high probability,

Proof of Theorem 3.1. The regret proof starts from the decomposition of (3). Λ(1) k is bounded with the help of Lemma 4.1: P

k Tk/ tk = O(

k Λ(2) k 0 by deﬁnition. Then with Lemma 5.1, 5.3, 5.4, and 5.5, we can bound Λ(3) k and Λ(4) k by O(D3S5A + DS

6 Analysis under Assumption 2

In Section 5, the main ingredient of regret analysis lies in bounding the span of the bias vector, sp(h(M 1 k, π1 k, π2 k, )). However, the same approach does not work because under the weaker Assumption 2, we do not have a bound on the mean ﬁrst passage time under arbitrary policy pairs. Hence we adopt the approach of approximating the average reward SG problem by a sequence of ﬁnite-horizon SGs: on a high level, ﬁrst, with the help of Assumption 2, we approximate the T multiple of the original average-reward SG game value (i.e. the total reward in hindsight) with the sum of those of H-step episodic SGs; second, we resort to [9] s results to bound the H-step SGs sample complexity and translates it to regret.

Approximation by repeated episodic SGs. For the approximation, the quantity H does not appear in UCSG but only in the analysis. The horizon T is divided into episodes each with length H. Index episodes with i = 1, ..., T/H, and denote episode i s ﬁrst time step by τi. We say i ph(k) if all H steps of episode i lie in phase k. Deﬁne the H-step expected reward under joint policy π with initial state s as VH(M, π, s) := E h PH t=1 rt|at π, s1 = s i . Now we decompose the regret in phase k as

t=tk r(st, at)

n=1 (n) k , (5)

i ph(k) H ρ minπ2 ρ(M 1 k, π1 k, π2, sτi) ,

i ph(k) H minπ2 ρ(M 1 k, π1 k, π2, sτi) minπ2 VH(M 1 k, π1 k, π2, sτi) ,

i ph(k) minπ2 VH(M 1 k, π1 k, π2, sτi) VH(M 1 k, π1 k, π2 i , sτi) ,

i ph(k) VH(M 1 k, π1 k, π2 i , sτi) VH(M, π1 k, π2 i , sτi) ,

i ph(k) VH(M, π1 k, π2 i , sτi) Pτi+1 1 t=τi r(st, at) , (6) k = 2H.

Here, π2 i denotes Player 2 s policy in episode i, which may be non-stationary. (6) k comes from the possible two incomplete episodes in phase k. (1) k is related to the tolerance level we set for the MAXIMIN-EVI algorithm: (1) k Tkγk = Tk/ tk. (2) k is an error caused by approximating an inﬁnite-horizon SG by a repeated episodic H-step SG (with possibly different initial states). (3) k is clearly non-positive. It remains to bound (2) k , (4) k and (5) k .

Lemma 6.1. By Azuma-Hoeffding s inequality, P

HT) with high probability.

Lemma 6.2. Under Assumption 2, P

k (2) k TD/H + P

From sample complexity to regret bound. As the main contributor of regret, (4) k corresponds to the inaccuracy in the transition probability estimation. Here we largely reuse [9] s results where they consider one-player episodic MDP with a ﬁxed initial state distribution. Their main lemma states that the number of episodes in phases such that |VH(M 1 k, πk, s0) VH(M, πk, s0)| > ε will not exceed O H2S2A/ε2 , where s0 is their initial state in each episode. In other words, P

H 1{|VH(M 1 k, πk, s0) VH(M, πk, s0)| > ε} = O(H2S2A/ε2). Note that their proof allows πk to be an arbitrarily selected non-stationary policy for phase k.

We can directly utilize their analysis and we summarize it as Theorem K.1 in the appendix. While their algorithm has an input ε, this input can be removed without affecting bounds. This means that the PAC bounds holds for arbitrarily selected ε. With the help of Theorem K.1, we have

Lemma 6.3. P

k (4) k O(S

HAT + HS2A) with high probability.

Proof of Theorem 3.2. With the decomposition (5) and the help of Lemma 6.1, 6.2, and 6.3, the regret is bounded by O( T D

HAT + S2AH) = O( 3

DS2AT 2) by selecting H = max{D, 3p

D2T/(S2A)}.

7 Sample Complexity of Ofﬂine Training

In Section 3.1, we deﬁned Lε to be the sample complexity of Player 1 s maximin policy. In our ofﬂine version of UCSG, in each phase k we let both players each select their own optimistic policy. After Player 1 has optimistically selected π1 k, Player 2 then optimistically selects his policy π2 k based on the known π1 k. Speciﬁcally, the model-policy pair M 2 k, π2 k is obtained by another extended value iteration on the extended MDP under ﬁxed π1 k, where Player 2 s action set is extended. By setting the stopping threshold also as γk, we have

ρ(M 2 k, π1 k, π2 k, s) min M Mk min π2 ρ( M, π1 k, π2, s) + γk (6)

when value iteration halts. With this selection rule, we are able to obtain the following theorems.

Theorem 7.1. Under Assumption 1, UCSG achieves Lε = O(D3S5A + D2S2A/ε2) w.h.p.

Theorem 7.2. Let Assumption 2 hold, and further assume that max s,s max π1 ΠSR min π2 ΠSR T π1,π2

Then UCSG achieves Lε = O(DS2A/ε3) w.h.p.

The algorithm can output a single stationary policy for Player 1 with the following guarantee: if we run the ofﬂine version of UCSG for T > Lε steps, the algorithm can output a single stationary policy that is ε-optimal. We show how to output this policy in the proofs of Theorem 7.1 and 7.2.

8 Open Problems

In this work, we obtain the regret of O(D3S5A + DS

AT) and O( 3

DS2AT) under different mixing assumptions. A natural open problem is how to improve these bounds on both asymptotic and constant terms. A lower bound of them can be inherited from the one-player MDP setting, which is Ω(

DSAT) [19].

Another open problem is that if we further weaken the assumptions to maxs,s minπ1 minπ2 T π1,π2

s s D, can we still learn the SG? We have argued that if we only have this assumption, in general we cannot get sublinear regret in the online setting. However, it is still possible to obtain polynomial-time ofﬂine sample complexity if the two players cooperate to explore the state-action space.

Acknowledgments

We would like to thank all anonymous reviewers who have devoted their time for reviewing this work and giving us valuable feedbacks. We would like to give special thanks to the reviewer who reviewed this work s previous version in ICML; your detailed check of our proofs greatly improved the quality of this paper.

[1] Yasin Abbasi, Peter L Bartlett, Varun Kanade, Yevgeny Seldin, and Csaba Szepesvári. Online learning in markov decision processes with adversarially chosen transition probability distributions. In Advances in Neural Information Processing Systems, 2013.

[2] Peter Auer and Ronald Ortner. Logarithmic online regret bounds for undiscounted reinforcement learning. In Advances in Neural Information Processing Systems, 2007.

[3] Peter L Bartlett and Ambuj Tewari. Regal: A regularization based algorithm for reinforcement learning in weakly communicating mdps. In Proceedings of Conference on Uncertainty in Artiﬁcial Intelligence. AUAI Press, 2009.

[4] Michael Bowling and Manuela Veloso. Rational and convergent learning in stochastic games. In International Joint Conference on Artiﬁcial Intelligence, 2001.

[5] Ronen I Brafman and Moshe Tennenholtz. R-max-a general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research, 2002.

[6] Sébastien Bubeck and Aleksandrs Slivkins. The best of both worlds: Stochastic and adversarial bandits. In Conference on Learning Theory, 2012.

[7] Grace E Cho and Carl D Meyer. Markov chain sensitivity measured by mean ﬁrst passage times. Linear Algebra and Its Applications, 2000.

[8] Vincent Conitzer and Tuomas Sandholm. Awesome: A general multiagent learning algorithm that converges in self-play and learns a best response against stationary opponents. Machine Learning, 2007.

[9] Christoph Dann and Emma Brunskill. Sample complexity of episodic ﬁxed-horizon reinforcement learning. In Advances in Neural Information Processing Systems, 2015.

[10] Travis Dick, Andras Gyorgy, and Csaba Szepesvari. Online learning in markov decision processes with changing cost sequences. In Proceedings of International Conference of Machine Learning, 2014.

[11] Eyal Even-Dar, Sham M Kakade, and Yishay Mansour. Online markov decision processes. Mathematics of Operations Research, 2009.

[12] Awi Federgruen. On n-person stochastic games by denumerable state space. Advances in Applied Probability, 1978.

[13] Aurélien Garivier, Emilie Kaufmann, and Wouter M Koolen. Maximin action identiﬁcation: A new bandit framework for games. In Conference on Learning Theory, pages 1028 1050, 2016.

[14] Arie Hordijk. Dynamic programming and markov potential theory. MC Tracts, 1974.

[15] Jeffrey J Hunter. Generalized inverses and their application to applied probability problems. Linear Algebra and Its Applications, 1982.

[16] Jeffrey J Hunter. Stationary distributions and mean ﬁrst passage times of perturbed markov chains. Linear Algebra and Its Applications, 2005.

[17] Garud N. Iyengar. Robust dynamic programming. Math. Oper. Res., 30(2):257 280, 2005.

[18] Tommi Jaakkola, Michael I Jordan, and Satinder P Singh. On the convergence of stochastic iterative dynamic programming algorithms. Neural computation, 1994.

[19] Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 2010.

[20] Sham Machandranath Kakade et al. On the sample complexity of reinforcement learning. Ph D thesis, University of London London, England, 2003.

[21] Michail G Lagoudakis and Ronald Parr. Value function approximation in zero-sum markov games. In Proceedings of Conference on Uncertainty in Artiﬁcial Intelligence. Morgan Kaufmann Publishers Inc., 2002.

[22] Tor Lattimore and Marcus Hutter. Pac bounds for discounted mdps. In International Conference on Algorithmic Learning Theory. Springer, 2012.

[23] Shiau Hong Lim, Huan Xu, and Shie Mannor. Reinforcement learning in robust markov decision processes. Math. Oper. Res., 41(4):1325 1353, 2016.

[24] Michael L Littman. Markov games as a framework for multi-agent reinforcement learning. In Proceedings of International Conference of Machine Learning, 1994.

[25] A Maurer and M Pontil. Empirical bernstein bounds and sample variance penalization. In Conference on Learning Theory, 2009.

[26] J-F Mertens and Abraham Neyman. Stochastic games. International Journal of Game Theory, 1981.

[27] Gergely Neu, Andras Antos, András György, and Csaba Szepesvári. Online markov decision processes under bandit feedback. In Advances in Neural Information Processing Systems, 2010.

[28] Gergely Neu, András György, and Csaba Szepesvári. The adversarial stochastic shortest path problem with unknown transition probabilities. In AISTATS, 2012.

[29] Arnab Nilim and Laurent El Ghaoui. Robust control of markov decision processes with uncertain transition matrices. Math. Oper. Res., 53(5):780 798, 2005.

[30] Julien Perolat, Bruno Scherrer, Bilal Piot, and Olivier Pietquin. Approximate dynamic programming for two-player zero-sum markov games. In Proceedings of International Conference of Machine Learning, 2015.

[31] HL Prasad, Prashanth LA, and Shalabh Bhatnagar. Two-timescale algorithms for learning nash equilibria in general-sum stochastic games. In Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems. International Foundation for Autonomous Agents and Multiagent Systems, 2015.

[32] Lloyd S Shapley. Stochastic games. Proceedings of the National Academy of Sciences, 1953.

[33] Csaba Szepesvári and Michael L Littman. Generalized markov decision processes: Dynamicprogramming and reinforcement-learning algorithms. In Proceedings of International Conference of Machine Learning, 1996.

[34] J Van der Wal. Successive approximations for average reward markov games. International Journal of Game Theory, 1980.

[35] Jia Yuan Yu and Shie Mannor. Arbitrarily modulated markov decision processes. In Proceedings of Conference on Decision and Control. IEEE, 2009.