# approximate_state_abstraction_for_markov_games__1e6b7fd0.pdf

Approximate State Abstraction for Markov Games

Hiroki Ishibashi1, Kenshi Abe1, 2, Atsushi Iwasaki1

1The University of Electro-Communications 2Cyber Agent i2430006@edu.cc.uec.ac.jp, abekenshi1224@gmail.com, atsushi.iwasaki@uec.ac.jp

This paper introduces state abstraction for two-player zerosum Markov games (TZMGs), where the payoffs for the two players are determined by the state representing the environment and their respective actions, with state transitions following Markov decision processes. For example, in games like soccer, the value of actions changes according to the state of play, and thus such games should be described as Markov games. In TZMGs, as the number of states increases, computing equilibria becomes more difficult. Therefore, we consider state abstraction, which reduces the number of states by treating multiple different states as a single state. There is a substantial body of research on finding optimal policies for Markov decision processes using state abstraction. However, in the multi-player setting, the game with state abstraction may yield different equilibrium solutions from those of the ground game. To evaluate the equilibrium solutions of the game with state abstraction, we derived bounds on the duality gap, which represents the distance from the equilibrium solutions of the ground game. Finally, we demonstrate our state abstraction with Markov Soccer, compute equilibrium policies, and examine the results.

1 Introduction Multi-agent reinforcement learning (MARL) is a framework for sequential decision-making, where multiple agents make decisions in a non-stationary environment to maximize their cumulative rewards. MARL has a wide range of applications, e.g., robotics, distributed control, game AI, and so on (Shalev-Shwartz, Shammah, and Shashua 2016; Silver et al. 2016, 2017; Brown and Sandholm 2018; Perolat et al. 2022). Such an environment is often modeled as two-player zero-sum Markov games (TZMGs) (Littman 1994) and computing the equilibria is said to be empirically tractable. However, it still suffers from the exponential growth of state space size in the number of domain variables. Markov decision processes (MDPs), which are a singleagent version of Markov games, face the same challenge. Solving MDPs, i.e., computing an optimal policy, is PComplete in state space size (Littman 1994), while that size often exponentially increases. Several state abstraction techniques have been developed, which aggregate multiple states

Copyright 2025, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

into one abstract state according to a certain criterion and reduce the state space size. For example, a state is equivalent to another state if choosing an action leads to the same state with same rewards. Such abstraction is effective for agents to solve more complicated MDPs than they would be able to without using abstraction. However, if we abstract the state space by regarding only identical states as equivalent, we are difficult to find all of them within a reasonable time, and even worse no two states might be identical. In contrast to such exact state abstraction, Abel, Hershkowitz, and Littman (2016) proposed approximate state abstraction, which characterizes how close a state is to another state in single-agent MDPs. This technique reduces ground MDPs with large state spaces to abstract MDPs with smaller state spaces by aggregating states according to some notion of closeness or similarity. While this relaxation makes the state spaces smaller, the resulting optimal policies in abstract MDPs may become suboptimal. Furthermore, they derive error bounds for the resulting policies based on four different criteria, such as optimal Q-values, according to which states are aggregated. TZMGs extend MDPs to model decision-making by two interacting agents in a shared, state-dependent environment. Unlike MDPs, the rewards in TZMGs depend on the actions of both agents. For instance, in a soccer game, the rewards vary based on the state of play and the actions chosen by both players. TZMGs provide a framework to formalize such scenarios and have stimulated research in competitive MARL. An agent s optimal policy depends on the policy of its opponent. The goal is to identify the equilibrium policy profile, which consists of mutual best responses, to predict and understand the consequences of their interactions. Building upon these insights, this paper extends the work by Abel, Hershkowitz, and Littman (2016) from singleagent MDPs to TZMGs. We first describe an approximate state abstraction based on optimal Q-value criteria or minimax values and derive the bound of the duality gap for the resulting equilibrium, which measures the proximity to equilibrium. To establish the bound, we analyze the gains from best responses in the ground game where the strategy space is unabstracted against the resulting equilibrium in the abstracted game, where the strategy space has been simplified. Second, we conduct experiments in Markov Soccer and demonstrate how state spaces are reduced and how well

The Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25)

the equilibrium strategies in the ground TZMG are approximated. Furthermore, we discuss the extension for the other three criteria.

Related Literature. There is a lot of literature on state abstraction for MDPs initiated by (Dietterich 1998, 1999; Jonsson and Barto 2000). The concept of state equivalence has been developed to reduce state space size by aggregating equivalent states (Givan, Dean, and Greig 2003), and it has been relaxed by allowing some associated actions (Ravindran and Barto 2003, 2004; van der Pol et al. 2020). Ferns, Panangaden, and Precup (2004) proposed a distance between states and aggregates states with zero distance. Similarly, Castro (2020) parameterized the bisimulation metric (Ferns, Panangaden, and Precup 2004). Recently, Dadvar, Nayyar, and Srivastava (2023) incorporated state abstraction into policy iteration. Li, Walsh, and Littman (2006) discussed the optimality of policies on five criteria for state abstraction. In another line of work, state space abstraction has been developed in poker AI (Gilpin 2006; Gilpin and Sandholm 2006, 2007; Gilpin, Sandholm, and Sørensen 2007; Johanson et al. 2013; Waugh 2013; Ganzfried and Sandholm 2013; Burch, Johanson, and Bowling 2014; Kroer and Sandholm 2018). The state spaces in poker can have at most 10 10160 states, and this area has been extensively investigated. Initially, states were abstracted manually based on knowledge and experience in poker. Subsequently, automated abstraction techniques were developed. For example, states are aggregated by estimating winning probabilities at information sets (Gilpin and Sandholm 2007; Gilpin, Sandholm, and Sørensen 2007; Johanson et al. 2013; Waugh 2013; Ganzfried and Sandholm 2013). Furthermore, Kroer and Sandholm (2018) presented a unified framework for analyzing abstractions that can express all types of abstractions and solution concepts used in prior work, with performance guarantees, while maintaining comparable bounds on abstraction quality.

2 Preliminaries

2.1 Two-Player Zero-Sum Markov Games

A two-player zero-sum Markov game (TZMG) M is defined by a tuple M = S, A1, A2, P, R, γ . Here, S represents a finite state space, Ai represents an action space for player i {1, 2}, P : S A1 A2 (S) represents a transition probability function, R : S A1 A2 [0, 1] represents a reward function, and γ [0, 1) represents a discount factor. Let A = A1 A2, and let a = (a1, a2) A denote the action profile. For a given state s S and an action profile a A, the next state is determined according to P( |s, a), and player 1 (resp. player 2) receives a reward of R(s, a) (resp. R(s, a)). A Markov policy for player i, denoted as πi : S (Ai), represents the probability of choosing action ai Ai at a state s S. Letting π = (π1, π2) be a policy profile, we further define the state value function, which is the ex-

pected discounted sum of rewards at state s S as follows:

V π(s) := E

t=1 γt 1R(st, at) s1 = s,

at π( |st), st+1 P( |st, at), t 0

We similarly define the state-action value function of taking an action profile a A at state s S as follows:

Qπ(s, a) := R(s, a) + γ X

s S P(s |s, a)V π(s ).

From the definition of these functions, the state value function V π can be expressed as follows:

a A π(a|s)Qπ(s, a).

2.2 Nash Equilibrium In TZMGs, a Nash equilibrium is defined as the policy profile that satisfies the following condition for all s S simultaneously:

(π1, π2), V π 1,π2(s) V π (s) V π1,π 2 (s). Intuitively, a Nash equilibrium is the policy profile where no player can improve her value by deviating from her own policy. Shapley (1953) has shown that any Nash equilibrium π in TZMGs satisfies the following condition for all s S:

V π (s) = max p (A1) min a2 A2

a1 A1 p(a1)Qπ (s, a)

= min p (A2) max a1 A1

a2 A2 p(a2)Qπ (s, a). (1)

It is known that this minimax value is unique for each s (Shapley 1953), thus we can write V (s) := V π (s) and Q (s, a) := Qπ (s, a). To measure the proximity to equilibrium for a given policy profile π = (π1, π2), we use the duality gap defined as follows:

GAP (π) := max s S,π 1,π 2

V π 1,π2(s) V π1,π 2(s) .

From the definition, we can see that GAP (π) 0 for any π, and the equality holds if and only if π is a Nash equilibrium.

2.3 Minimax Q-learning Let us briefly describe Minimax Q-learning (Littman 1994) which is developed to address the limitations of standard Qlearning in adversarial, zero-sum environments. Its robustness and adaptability in competitive settings make it effective for AI in games and security applications. It has been shown that Minimax Q-learning converges to a Nash equilibrium under some appropriate conditions (Szepesv ari and Littman 1999). Due to its theoretical guarantee and ease of implementation, Minimax Q-learning is used as a standard algorithm to compute equilibrium policies for TZMGs. Algorithm 1 illustrates the procedure with finite T iterations.

Algorithm 1: Minimax Q-learning

Input: Learning rates αt, exploration parameter β 1: V [s] 0 for all s S 2: Q[s, a] 0 for all s S and a A

3: π0 i ( |s) 1 |Ai|

a Ai for all i {1, 2} and s S

4: Sample initial state s0 5: for t = 0, 1, . . . , T 1 do 6: π i( |st) (1 β)πt i( |st) + β |Ai|1 for all i {1, 2}

7: Sample action profile at π ( |st) 8: Next state is sampled st+1 P( |st, at) 9: Q[st, at] (1 αt)Q[st, at] + αt (R(st, at) + γV π[st+1]) 10: πt+1 1 ( |st) arg max p (A1) min a2 A2 P a1 A1 p(a1)Q[st, at]

11: πt+1 2 ( |st) arg min p (A2) max a1 A1 P a2 A2 p(a2)Q[st, at]

12: V (st) P

a2 A2 πt+1 1 (a1|st)πt+1 2 (a2|st)Q[st, at]

13: end for Output: (πT 1 , πT 2 )

1. At each iteration t 0, each player i {1, 2} chooses an action ai,t Ai randomly with probability β, otherwise based on her policy πt i( |st) (Ai). 2. State st transits to st+1 according to the transition probability function P( | st, a1,t, a2,t). 3. State-action value function Qt is updated with the obtained reward R(st, at) and the learning rate αt:

Qt+1(st, at) :=(1 αt)Qt(st, at) + αt(R(st, at) + Vt(st+1)).

4. Players update their policies using linear programming:

π1( |st) := arg max p (A1) min a2 A2

a1 A1 p(a1)Qt+1(st, a),

π2( |st) := arg min p (A2) max a1 A1

a2 A2 p(a2)Qt+1(st, a).

5. They update their state value function with current policy:

Vt+1(st) := X

a2 A2 π1(a1|st)π2(a2|st)Qt+1(st, a).

We implement this procedure when we compute the duality gap in Section 5. Note that our proposed state abstraction does not depend on Minimax Q-learning as well as in (Abel, Hershkowitz, and Littman 2016). In fact, we are going to discuss the extensions to different criteria of aggregating states.

3 State Abstraction In this section, we extend state abstraction for an MDP to a TZMG. State abstraction is a method for reducing the state space by aggregating similar states to decrease the time of calculating the equilibria. In the previous research of Abel

et al. (Abel, Hershkowitz, and Littman 2016), they propose four different state abstraction approaches for an MDP and theoretically analyze how well the optimal policy in the abstract MDP achieves performance in the ground MDP using various metrics. In this section, we extend their approach based on state-action value function similarity. The other approaches, including model similarity, Boltzmann distribution similarity, and multinomial distribution similarity, are introduced in the Discussion section.

3.1 Abstract Two-Player Zero-Sum Markov Games This section defines an abstract TZMG MA = SA, A1, A2, PA, RA, γ by using the notation introduced by Abel, Hershkowitz, and Littman (2016); Li, Walsh, and Littman (2006). Here, SA is an abstract state space, and PA : SA A1 A2 (SA) is an abstract transition probability function, and RA : SA A1 A2 [0, 1] is an abstract reward function. Let ϕ : S SA be a state aggregation function that maps from the ground state space S to an abstract state space SA. Given ϕ, we can define a set of states GA(s A) that are aggregated into the same abstract state s A SA:

GA(s A) := {g S | ϕ(g) = s A}.

For a given ground state s S, we further define a set of states G(s) that are aggregated into the same abstract state as s:

G(s) := {g S | ϕ(g) = ϕ(s)}.

In order to construct an abstract transition probability function PA and an abstract reward function RA, we introduce a weight function w : S [0, 1], which satisfies the following condition:

g GA(s A) w(g) = 1.

For example, w(s) = 1/|G(s)| is a representative weight function. Using this function, we define the abstract transition probability function PA as follows:

PA(s A|s A, a) := X

s GA(s A) P(s |s, a)w(s).

Similarly, the abstract reward function RA : SA A [0, 1] is defined as follows:

RA(s A, a) := X

s GA(s A) R(s, a)w(s).

The policy profile in the abstract TZMG MA is defined similarly to the ground TZMG M. Let πA,i : SA (Ai) be a policy for player i in MA. Similarly, V πA A (s A) denotes a state value for a given policy profile πA at state s A SA in the abstract TZMG, and QπA A (s A, a) is defined as a state-action value for s A SA and a A. Finally, letting π A be a Nash equilibrium in the abstract TZMG MA, π A must satisfy the following condition for all s A SA and (πA,1, πA,2):

V π A,1,πA,2 A (s A) V π A A (s A) V πA,1,π A,2 A (s A).

4 Abstraction Based on Minimax Values In this section, we analyze the performance of a Nash equilibrium π A in an abstract TZMG MA under a specific aggregation function ϕ when applied to the ground TZMG M. To this end, we define the policy profile π GA(s) in the ground TZMG, which is induced by π A. Formally, π GA(s) for all s in S is given by: π GA(s) := π A(ϕ(s)). In a later section, we derive the upper bound on the duality gap of π GA under various aggregation functions ϕ.

4.1 Suboptimality of Nash Equilibria in Abstract TZMGs We examine the aggregation function ϕQ , which is constructed based on the state-action value function Q . Specifically, in the abstraction ϕQ , states are aggregated to the same abstract state when their minimax state-action values are close within ϵ. Assumption 1. The aggregation function ϕQ satisfies the following property for some non-negative constant ϵ 0:

ϕQ (s1) = ϕQ (s2)

a A, |Q (s1, a) Q (s2, a)| ϵ. (2) Under Assumption 1, it can be shown that the suboptimality of π GA is no more than O(ϵ). Theorem 1. When the ground states are aggregated by the aggregation function ϕQ satisfying Assumption 1 with ϵ 0, then π GA satisfies:

GAP (π GA) 12ϵ (1 γ)3 .

To perform an initial abstraction, minimax Q-learning is used to calculate Q-values for the ground game. If the results do not meet a sufficient solution criterion, the process iteratively updates the Q-values and refines the abstraction (Li, Walsh, and Littman 2006). Developing efficient algorithms for discovering abstractions remains open. Also, such abstractions have potential utility beyond a single game, enabling transferable knowledge across related games (Jong and Stone 2005).

4.2 Proofs for Theorem 1 This section provides the proofs for Theorem 1.

Proof of Theorem 1. By the definition of the duality gap:

GAP (π GA) = max s S,π1,π2

V π1,π GA,2(s) V π GA,1,π2(s)

V π1,π GA,2(s) V π GA(s)

+ max s S,π2

V π GA(s) V π GA,1,π2(s) . (3)

Hence, it is sufficient to derive the upper bound on maxπi V π1,π GA,2(s) V π GA(s) and V π GA(s) maxπi V π GA,1,π2(s) for each s S. Here, letting π 1 be an optimal policy against π GA,2 in the ground TZMG, we obtain the following result on the performance difference between π 1 and π GA,1 against π GA,2. The proof for Lemma 1 is provided in Appendix A.1.

Lemma 1. Assume that the aggregation function ϕ satisfies the following condition for some non-negative constant δ 0:

s S, a A, Qπ A A (ϕ(s), a) Qπ 1,π GA,2(s, a) δ.

Then, we have for any s S:

V π 1,π GA,2(s) V π GA(s) 2δ 1 γ .

Next, we show that the assumption in Lemma 1 holds with some constant δ. By the triangle inequality, the difference between Qπ A A (ϕQ (s), a) and Qπ 1,π GA,2(s, a) can be bound as follows: Q π A A (ϕQ (s), a) Qπ 1,π GA,2(s, a)

Q π A A (ϕQ (s), a) Q (s, a)

+ Q (s, a) Qπ 1,π GA,2(s, a) . (4)

The first term in (4) means the difference between the minimax state-action values in the abstract TZMG MA and ground TZMG M. The second term in (4) represents the performance gap between π and (π 1, π GA,2) in M. Each term can be bounded as follows:

Lemma 2. In the same setup of Theorem 1, we have for any s S and a A: Q π A A (ϕQ (s), a) Q (s, a) ϵ 1 γ .

Lemma 3. In the same setup of Theorem 1, we have for any s S and a A: Q (s, a) Qπ 1,π GA,2(s, a) 2ϵ (1 γ)2 .

By combining (4) and Lemmas 2 and 3, we get for any s S and a A: Q π A A (ϕQ (s), a) Qπ 1,π GA,2(s, a) 3ϵ (1 γ)2 .

This inequality implies that, under the aggregation function ϕQ , the assumption in Lemma 1 holds with δ = 3ϵ (1 γ)2 . Thus, we can apply Lemma 1, and then obtain the following inequality:

V π 1,π GA,2(s) V π GA(s) 6ϵ (1 γ)3 . (5)

By a similar procedure, we can show that:

V π GA(s) V π GA,1,π 2(s) 6ϵ (1 γ)3 . (6)

By combining (3), (5), and (6), we can upper bound the duality gap of π GA as follows:

GAP (π GA) 6ϵ (1 γ)3 2 = 12ϵ (1 γ)3 .

Proof of Lemma 2. From the definition of the state-action value function in the abstract TZMG MA:

Q π A A (ϕQ (s), a)

= RA(ϕQ (s), a) + γ X

s A SA PA(s A|ϕQ (s), a)V π A A (s A)

g G(s) w(g)

R(g, a)+ γ X

s GA(s A) P(s |g, a)V π A A (s A)

g G(s) w(g)

R(g, a) + γ X

s S P(s |g, a)V π A A (ϕQ (s ))

Hence, the difference between P g G(s) w(g)Q (g, a) and

Qπ A A (ϕQ (s), a) can be bounded as follows:

g G(s) w(g)Q (g, a) Q π A A (ϕQ (s), a)

g G(s) w(g) X

s S P(s |g, a) V (s ) V π A A (ϕQ (s ))

g G(s) w(g) X

s S P(s |g, a)

max p (A1) min a 2 A2

a 1 A1 p(a 1)Q (s , a )

max p (A1) min a 2 A2

a 1 A1 p(a 1)Q π A A (ϕQ (s ), a )

γ max (s ,a ) S A

Q (s , a ) Q π A A (ϕQ (s ), a ) , (7)

where the second equality follows from the fact that V

and V π A A are the minimax values in the ground and abstract TZMG, respectively. On the other hand, under Assumption 1, we have the following lower bound on the state-action value difference: X

g G(s) w(g)Q (g, a) Q π A A (ϕQ (s), a)

min g G(s) Q (g, a) Q π A A (ϕQ (s), a)

ϵ + Q (s, a) Q π A A (ϕQ (s), a). (8)

By combining (7) and (8), and then taking the maximum value of both sides, we obtain:

max (s,a) S A

Q (s, a) Q π A A (ϕQ (s), a)

ϵ + γ max (s,a) S A

Q (s, a) Q π A A (ϕQ (s), a) .

Rearranging this inequality, we have for any s S and a A:

Qπ (s, a) Q π A A (ϕQ (s), a)

max (s ,a ) S A

Qπ (s , a ) Q π A A (ϕQ (s ), a ) ϵ 1 γ .

By a similar procedure, we can show that:

Q (s, a) Q π A A (ϕQ (s), a)

min (s ,a ) S A

Q (s , a ) Q π A A (ϕQ (s ), a ) ϵ 1 γ .

In summary, we have for any s S and a A: Q (s, a) Q π A A (ϕQ (s), a) ϵ 1 γ .

Proof of Lemma 3. By the definition of the state value function V π A A in the abstract TZMG MA, we have for any s S:

V π A A (ϕQ (s))= max p (A1) min a2 A2

a1 A1 p(a1)Q π A A (ϕQ (s), a).

Applying Lemma 2 to this equation, we obtain the following upper bound on V π A A (ϕQ (s)):

V π A A (ϕQ (s)) max p (A1) min a2 A2

a1 A1 p(a1)Q (s, a)+ ϵ 1 γ

= V (s) + ϵ 1 γ .

Similarly, we can derive the lower bound on V π A A (ϕQ (s)):

V π A A (ϕQ (s)) max p (A1) min a2 A2

a1 A1 p(a1)Q (s, a) ϵ 1 γ

= V (s) ϵ 1 γ .

Hence, we have for any s S: V (s) V π A A (ϕQ (s)) ϵ 1 γ .

By using this inequality, we obtain for any s S and a A,

Qπ 1,π GA,2(s, a) Q (s, a)

s S P(s |s, a) V π 1,π GA,2(s ) V (s )

s S P(s |s, a) V π A A (ϕQ (s )) V (s )

s S P(s |s, a) V π 1,π GA,2(s ) V π A A (ϕQ (s ))

s S P(s |s, a) V π 1,π GA,2(s ) V π A A (ϕQ (s )) , (9)

where the first equality follows from the definition of the state-action value function. Here, we can upper bound the term of V π 1,π GA,2(s ) V π A A (ϕQ (s )) as follows:

V π 1,π GA,2(s ) V π A A (ϕQ (s ))

= max a 1 A1

a 2 A2 π GA,2(a 2|s )Qπ 1,π GA,2(s , a )

a 2 A2 π GA,2(a 2|s )Q π A A (ϕQ (s ), a )

Qπ 1,π GA,2(s , a ) Q π A A (ϕQ (s ), a ) . (10)

By combining (9), (10), and Lemma 2, we have for any s S and a A:

Qπ 1,π GA,2(s, a) Q (s, a) 2γϵ 1 γ

+ γ max (s ,a ) S A

Qπ 1,π GA,2(s , a ) Q (s , a ) .

Taking the maximum value of both sides:

max (s,a) S A

Qπ 1,π GA,2(s, a) Qπ (s, a)

2γϵ 1 γ + γ max (s,a) S A

Qπ 1,π GA,2(s, a) Qπ (s, a) .

Rearranging this inequality, we have for any s S and a A:

Qπ 1,π GA,2(s, a) Q (s, a)

max (s ,a ) S A

Qπ 1,π GA,2(s , a ) Q (s , a ) 2ϵ (1 γ)2 .

On the other hand, from the property of minimax stateaction value in (1), we have:

Q (s, a) Qπ 1,π GA,2(s, a) 0. (12)

By combining (11) with (12), we get for any s S and a A: Qπ 1,π GA,2(s, a) Q (s, a) 2ϵ (1 γ)2 .

5 Experiments This section demonstrates our state abstraction developed so far in Markov Soccer (Littman 1994; Abe and Kaneko 2021). We here focus only on the minimax values, since the theoretical results on the other criteria in Section 6 rely on the minimax values.

5.1 Markov Soccer We describe the Markov soccer experiment as 1 vs 1 game on 4 5 as shown in Figure 1. Two players 1 and 2 occupy distinct squares of the grid, respectively and the circled player, 1 here, has a ball, which specifies the states of the game. Figure 1 shows the initial positions of the players. Which player has the ball at the initial turn is determined at uniformly random. In each turn, each player can move to one of the neighboring cells or stand at the place, i.e., their set of actions includes Up , Left , Down , Right , and Stand. After both select their actions, these two moves are executed in random order. When a player tries to move to the cell occupied by the other player, the ball s possession goes to the stationary player, and the positions of both players remain unchanged. Also, if a player s choice lets him or her be out of the pitch, the position of the player remains unchanged. When a player keeping the ball steps into his or her goal (the right side for player 1 and the left side for player 2), the game is over. At the same time, that player scores 1 point and the opponent scores -1 point. The positions of the players and the ball s possession are initialized as shown in Figure 1.

Figure 1: An initial state of the Markov soccer game in which player 1 has the ball.

5.2 Training and Evaluating In Markov soccer, we build state abstraction, by first solving the game, then greedily aggregating ground states into abstract states that satisfy the Q criterion. We calculate the number of states in the abstract Markov game |SA|, varying ϵ from 0.0 to 2.0 in increments of 0.1. Then, for each ϵ, we compute the equilibrium policies π and π A for the ground and the abstract Markov soccer games, respectively, via the minimax Q-learning. We here assume that the total number of learning iterations T is 1,000,000, the discount factor γ is 0.9, and the learning rate αt is set to 10 2

T t for learning iterations t 0. We further evaluate the duality gaps for those equilibrium policies. However, it is demanding to calculate the true one, so we use the approximation obtained from Q-learning.

5.3 Results We first compute the number of states in the abstract Markov soccer game for different of ϵ in Figure 2. The xand y-axes represent ϵ and the number of states |SA|, respectively. We observed that as ϵ increases, the number of states decreases almost linearly. Note that the number of states of the ground Markov soccer is 760, as illustrated at ϵ = 0.0. We observe the value of ϵ up to two, since the difference between the state-action value functions is bound up to 2 from the definition of the game. The number of states of the abstract Markov soccer is 1 at ϵ = 2.0. Figure 3 illustrates the duality gap in the number of learning iterations where xand y-axes represent learning iterations and the gap, respectively, varying ϵ. We label Ground the gap for the ground Markov soccer game (ϵ = 0.0) and draw the trajectories varying ϵ {0.2, 0.6, 1.0, 1.4, 1.8}. We observe that the minimax Q-learning approximately solves the ground game, i.e., the gap converges to zero. If ϵ is sufficiently small, i.e., less than 0.6, the duality gap in the abstract Markov soccer game approximates the ground one. Otherwise, the gaps become significantly worse. In these cases, agents often repeat previous actions without progress. Such deadlocks increase the duality gap, as agents can more easily identify best responses by fully exploring the ground game s strategy space.

6 Extentions We have extended the approximate function that preserves near-optimal behavior by aggregating states on similar optimal Q-values to two-player zero-sum Markov games. This

0.0 0.4 0.8 1.2 1.6 2.0

number of states

Ground game Abstract game

Figure 2: Number of states in the abstract Markov soccer games with respect to ϵ.

0.0 0.2 0.4 0.6 0.8 1.0 iteration 1e6

duality gap

= 0.2 = 0.6 = 1.0 = 1.4 = 1.8

Figure 3: Duality gap at each iteration in minimax Qlearning. Note that the policies are trained in the abstract game, and their duality gap values are computed in the ground game.

section further extends three other types of criteria: Model Similarity; Boltzmann Distribution Similarity; Multinomial Distribution similarity. We then derive the bounds of the gap function for each criterion. Since the theoretical results on these three criteria rely on the one on the minimax Q-values, the proofs are placed in Appendix B-D. Let us now consider Model Similarity where states are aggregated together when their rewards and transitions are within ϵ (Li, Walsh, and Littman 2006; Abel, Hershkowitz, and Littman 2016). Specifically, when a aggregation function ϕ aggregates states s1, s2 S, the difference between the probability functions P(s1, a) and P(s2, a) is bounded within ϵ [0, ). As well, the difference between the reward functions R(s1, a) and R(s2, a) is bounded within the same error amount. Assumption 2. The aggregation function ϕmodel satisfies the following property for some non-negative constant ϵ 0: ϕmodel(s1) = ϕmodel(s2)

max a A |R(s1, a) R(s2, a)| ϵ, and

P(s |s1, a) P(s |s2, a) ϵ.

We derive the bound of the duality gap for the Model Similarity criteria as follows.

Theorem 2. When the ground states are aggregated by the aggregation function ϕmodel satisfying Assumption 2 with ϵ 0, then π GA satisfies:

GAP (π GA) 4(1 + γ (|S| 1))ϵ

Let us next examine Boltzmann Distribution Similarity from the classic textbook (Sutton and Barto 1998), which balances exploration and exploitation and allows for aggregating states when the ratios of Q-values are similar, but the magnitudes are different.

Assumption 3. The aggregation function ϕbolt satisfies the following property for some non-negative constants ϵ 0 and k 0: ϕbolt(s1) = ϕbolt(s2)

e Q (s1,a) P

b A e Q (s1,b) e Q (s2,a) P

b A e Q (s2,b)

b A e Q (s1,b) X

b A e Q (s2,b) kϵ.

Theorem 3. When the ground states are aggregated by the aggregation function ϕbolt satisfying Assumption 3 with ϵ 0 and k 0, then π GA satisfies:

GAP (π GA) 12e 2 1 γ |A1||A2| + k e 1 1 γ |A1||A2|

We finally consider Multinomial Distribution similarity, which is a bit simpler and has close properties to Boltzmann Distribution Similarity.

Assumption 4. The aggregation function ϕmult satisfies the following property for some non-negative constants ϵ 0 and k 0: ϕmult(s1) = ϕmult(s2)

Q (s1, a) P

b A Q (s1, b) Q (s2, a) P

b A Q (s2, b)

b A Q (s1, b) X

b A Q (s2, b)

Theorem 4. Suppose that the ground states are aggregated by the aggregation function ϕmult satisfying Assumption 4 with ϵ 0 and k 0. If there exists some positive constant δ > 0 such that P b A Q (s, b) δ for any states s S,

GAP (π GA) 12 |A1||A2| + k

δ ϵ (1 γ)4 .

7 Conclusion

This paper investigates approximate state abstraction, which was originally developed for single-agent MDPs, and extends it for TZMGs, which potentially have many real-world applications. Future works include conducting experiments on games with larger state spaces, such as The Chasing Game on Gridworld (Wang and Klabjan 2018) and Snake Games (Guibas et al. 2022).

Acknowledgments We would like to thank Yuki Shimano for useful discussions. Atsushi Iwasaki is supported by JSPS KAKENHI Grant Numbers 21H04890 and 23K17547.

References Abe, K.; and Kaneko, Y. 2021. Off-Policy Exploitability Evaluation in Two-Player Zero-Sum Markov Games. In AAMAS, 78 87. Abel, D.; Hershkowitz, D. E.; and Littman, M. L. 2016. Near optimal behavior via approximate state abstraction. In ICML, 2915 2923. Brown, N.; and Sandholm, T. 2018. Superhuman AI for heads-up no-limit poker: Libratus beats top professionals. Science, 359(6374): 418 424. Burch, N.; Johanson, M.; and Bowling, M. 2014. Solving imperfect information games using decomposition. In AAAI, 602 608. Castro, P. S. 2020. Scalable Methods for Computing State Similarity in Deterministic Markov Decision Processes. In AAAI, 10069 10076. Dadvar, M.; Nayyar, R. K.; and Srivastava, S. 2023. Conditional abstraction trees for sample-efficient reinforcement learning. In UAI, 485 495. Dietterich, T. 1998. The MAXQ Method for Hierarchical Reinforcement Learning. In ICML, 118 126. Dietterich, T. 1999. State abstraction in MAXQ hierarchical reinforcement learning. In Neur IPS, 994 1000. Ferns, N.; Panangaden, P.; and Precup, D. 2004. Metrics for finite Markov decision processes. In UAI, 162 169. Ganzfried, S.; and Sandholm, T. 2013. Action translation in extensive-form games with large action spaces: axioms, paradoxes, and the pseudo-harmonic mapping. In IJCAI, 120 128. Gilpin, A. 2006. A Competitive Texas Hold em Poker Player via Automated Abstraction and Real-Time Equilibrium Computation. In AAAI, 1007 1013. Gilpin, A.; and Sandholm, T. 2006. Finding equilibria in large sequential games of imperfect information. In EC, 160 169. Gilpin, A.; and Sandholm, T. 2007. Better automated abstraction techniques for imperfect information games, with application to Texas Hold em poker. In AAMAS, 1 8. Gilpin, A.; Sandholm, T.; and Sørensen, T. B. 2007. Potential-aware automated abstraction of sequential games, and holistic equilibrium analysis of Texas Hold em poker. In AAAI, 50 57. Givan, R.; Dean, T.; and Greig, M. 2003. Equivalence notions and model minimization in Markov decision processes. Artificial intelligence, 147(1 2): 163 223. Guibas, J.; Mardani, M.; Li, Z.; Tao, A.; Anandkumar, A.; and Catanzaro, B. 2022. Efficient Token Mixing for Transformers via Adaptive Fourier Neural Operators. In ICLR. Johanson, M.; Burch, N.; Valenzano, R.; and Bowling, M. 2013. Evaluating state-space abstractions in extensive-form games. In AAMAS, 271 278.

Jong, N. K.; and Stone, P. 2005. State abstraction discovery from irrelevant state variables. In IJCAI, 752 757. Jonsson, A.; and Barto, A. G. 2000. Automated state abstraction for options using the U-Tree algorithm. In Neur IPS, 1010 1016. Kroer, C.; and Sandholm, T. 2018. A unified framework for extensive-form game abstraction with bounds. In Neur IPS, 613 624. Li, L.; Walsh, T. J.; and Littman, M. L. 2006. Towards a Unified Theory of State Abstraction for MDPs. In ISAIM. Littman, M. L. 1994. Markov games as a framework for multi-agent reinforcement learning. In ICML, 157 163. Perolat, J.; Vylder, B. D.; Hennes, D.; Tarassov, E.; Strub, F.; de Boer, V.; Muller, P.; Connor, J. T.; Burch, N.; Anthony, T.; Mc Aleer, S.; Elie, R.; Cen, S. H.; Wang, Z.; Gruslys, A.; Malysheva, A.; Khan, M.; Ozair, S.; Timbers, F.; Pohlen, T.; Eccles, T.; Rowland, M.; Lanctot, M.; Lespiau, J.-B.; Piot, B.; Omidshafiei, S.; Lockhart, E.; Sifre, L.; Beauguerlange, N.; Munos, R.; Silver, D.; Singh, S.; Hassabis, D.; and Tuyls, K. 2022. Mastering the game of Stratego with modelfree multiagent reinforcement learning. Science, 378(6623): 990 996. Ravindran, B.; and Barto, A. G. 2003. SMDP homomorphisms: an algebraic approach to abstraction in semi Markov decision processes. In IJCAI, 1011 1016. Ravindran, B.; and Barto, A. G. 2004. Approximate homomorphisms: A framework for non-exact minimization in Markov decision processes. In International Conference on Knowledge Based Computer Systems, 19 22. Shalev-Shwartz, S.; Shammah, S.; and Shashua, A. 2016. Safe, multi-agent, reinforcement learning for autonomous driving. ar Xiv preprint ar Xiv:1610.03295. Shapley, L. S. 1953. Stochastic Games. Proceedings of the National Academy of Sciences, 39(10): 1095 1100. Silver, D.; Huang, A.; Maddison, C. J.; Guez, A.; Sifre, L.; van den Driessche, G.; Schrittwieser, J.; Antonoglou, I.; Panneershelvam, V.; Lanctot, M.; Dieleman, S.; Grewe, D.; Nham, J.; Kalchbrenner, N.; Sutskever, I.; Lillicrap, T. P.; Leach, M.; Kavukcuoglu, K.; Graepel, T.; and Hassabis, D. 2016. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587): 484 489. Silver, D.; Schrittwieser, J.; Simonyan, K.; Antonoglou, I.; Huang, A.; Guez, A.; Hubert, T.; Baker, L.; Lai, M.; Bolton, A.; Chen, Y.; Lillicrap, T. P.; Hui, F.; Sifre, L.; van den Driessche, G.; Graepel, T.; and Hassabis, D. 2017. Mastering the game of Go without human knowledge. Nature, 550(7676): 354 359. Sutton, R. S.; and Barto, A. G. 1998. Reinforcement Learning: An Introduction. MIT Press. Szepesv ari, C.; and Littman, M. L. 1999. A Unified Analysis of Value-Function-Based Reinforcement-Learning Algorithms. Neural Computation, 11(8): 2017 2060. van der Pol, E.; Kipf, T.; Oliehoek, F. A.; and Welling, M. 2020. Plannable Approximations to MDP Homomorphisms: Equivariance under Actions. In AAMAS, 1431 1439.

Wang, X.; and Klabjan, D. 2018. Competitive Multi-agent Inverse Reinforcement Learning with Sub-optimal Demonstrations. In ICML, 5143 5151. Waugh, K. 2013. A fast and optimal hand isomorphism algorithm. In AAAI Workshop on Computer Poker and Imperfect Information.