# learning_in_nonzerosum_stochastic_games_with_potentials__9d8faef8.pdf

Learning in Nonzero-Sum Stochastic Games with Potentials

David Mguni 1 Yutong Wu 2 Yali Du 3 Yaodong Yang 1 3 Ziyi Wang 2 Minne Li 3 Ying Wen 4 Joel Jennings 1

Multi-agent reinforcement learning (MARL) has become effective in tackling discrete cooperative game scenarios. However, MARL has yet to penetrate settings beyond those modelled by team and zero-sum games, conﬁning it to a small subset of multi-agent systems. In this paper, we introduce a new generation of MARL learners that can handle nonzero-sum payoff structures and continuous settings. In particular, we study the MARL problem in a class of games known as stochastic potential games (SPGs) with continuous state-action spaces. Unlike cooperative games, in which all agents share a common reward, SPGs are capable of modelling real-world scenarios where agents seek to fulﬁl their individual goals. We prove theoretically our learning method, SPot-AC, enables independent agents to learn Nash equilibrium strategies in polynomial time. We demonstrate our framework tackles previously unsolvable tasks such as Coordination Navigation and large selﬁsh routing games and that it outperforms the state of the art MARL baselines such as MADDPG and COMIX in such scenarios.

1. Introduction

Many real-world systems give rise to multi-agent systems (MAS); trafﬁc network systems with autonomous vehicles (Ye et al., 2015; Zhou et al., 2020), network packet routing systems (Wiering, 2000) and ﬁnancial trading (Mariano et al., 2001) are some examples. In these systems, selfinterested agents act in a shared environment, each seeking to perform some pre-speciﬁed task. Each agent s actions affect the performance of other agents and may even prevent them from completing their tasks altogether. For example,

1Huawei R&D UK 2Institute of Automation, Chinese Academy of Sciences 3University College London, UK 4Shanghai Jiao Tong University. Correspondence to: David Mguni <davidmguni@hotmail.com>, Yaodong Yang <yaodong.yang@outlook.com>.

Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s).

autonomous vehicles seeking to arrive at their individual destinations must avoid colliding with other vehicles. Therefore to perform their task the agents must account for other agents behaviours.

There is therefore a great need for reinforcement learning (RL) agents with their own goals to learn to perform in MAS. In these scenarios, agents are not required to behave as a team nor as perfect adversaries. These settings are modelled by nonzero-sum stochastic games (SGs) whose solution concept is a ﬁxed point known as Nash equilibrium (NE). An NE describes the stable point in which all agents respond optimally to the actions of other agent. Computing the NE is therefore central to solving MAS.

Despite its fundamental importance for solving many MAS, computing the NE of any SG with a general payoff structure remains an open challenge (Yang & Wang, 2020). Presently, methods to compute NE in SGs that are neither zero-sum nor team settings are extremely scarce and impose limiting assumptions. As such, the application of these methods is generally unsuitable for real world MAS (Shoham & Leyton-Brown, 2008). Moreover, ﬁnding NE even in the simple case of normal form games (where agents take only a single action) is generally intractable when the game is nonzero-sum (Chen et al., 2009).

Among multi-agent reinforcement learning (MARL) methods are a class of algorithms known as independent learners e.g. independent Q learning (Tan, 1993). These algorithms ignore actions of other agents and are ill-suited to tackle MAS and often fail to learn (Hernandez-Leal et al., 2017). In contrast, algorithms such as MADDPG (Lowe et al., 2017), COMA (Foerster et al., 2018) and QDPP (Yang et al., 2020) include a centralised critic that accounts for the actions of all agents. To date none of these algorithms have been proven to converge in SGs that are neither team nor zero-sum (adversarial). Additionally, these methods suffer from combinatorial growth in complexity with the number of agents (Yang et al., 2019) leading to prohibitively expensive computations in some systems. There is also a noticeable lack of MARL methods that can handle continuous spaces which is required for tasks such as physical control (Bloembergen et al., 2015). This has left MARL largely unable to solve various practical tasks such as multi-

Learning in Nonzero-Sum Stochastic Games with Potentials

agent Mujoco (de Witt et al., 2020) which remains an open challenge. This is in contrast to discrete counterpart settings e.g. Starcraft micro-management in which MARL has had notable success (Peng et al., 2017).

In this paper, we address the challenge of solving MAS with payoff structures beyond zero-sum and team game settings in continuous systems. In particular, we develop a MARL solver that computes the NE within a new subclass of continuous nonzero-sum SGs, namely continuous stochastic potential games (c-SPGs) in which the agents interaction at each stage has a potential game property. Lastly, our solver avoids combinatorial complexity with the number of agents.

Our framework is developed through theoretical results that enable the NE of some SGs to be found tractably. First, we formalise a construction of continuous SGs in which the interaction between agents at each stage can be described by a PG. Thereafter, we show that the NE of the SG can be computed by solving a dual Markov decision process (MDP) whose solution exactly coincides with the NE of the original SG. This converts the problem of ﬁnding a ﬁxed point NE of an (a priori unknown) nonzero-sum SG to solving an (unknown) MDP whose solution as we show, can be found tractably using a new distributed variant of actor-critic methods, which we call SPot-AC.

The paper is organised as follows: after the related work next, we present our construction of c-SPGs in Sec. 3. We continue in Sec. 4 to present a simple planner when the environment model is given and prove that c-SPGs have dual representations as MDPs. A polynomial-time ﬁtted Qlearning solver (SPot Q) is then given to ﬁnd the NE in this setting. In Sec. 5, we extend the learning method and propose an actor-critic variant (SPot-AC) that solves c-SPGs in unknown environments. A fully distributed variant is also provided that scales with the number of agents. Robustness analysis is followed and we show that the method closely approximates the NE solution when the construction of the potential function has small estimation errors. Lastly in Sec. 6, we conduct detailed ablation studies and performance tests on various tasks and conclude the paper.

2. Related Work

MARL has been successful in zero-sum scenarios (Grau Moya et al., 2018) and settings of homogeneous agents with population sizes that approach inﬁnity (Mguni et al., 2018; Yang et al., 2018) and team game scenarios (Peng et al.,

2017). However, the restrictions on the payoffs therein means that these models are usually far away from many real-world scenarios, prohibiting the deployment of MARL therein. There have been few attempts at computing NE in settings outside of team and zero-sum SGs. Most notably is Nash Q-learning (Hu & Wellman, 2003); it however im-

poses stringent assumptions that force the SG to resemble a team game. For example, in (Hu & Wellman, 2003) at each iteration a unique Pareto dominant NE must exist and be computed which is generally unachievable. Friend or foe learning (Littman, 2001) establishes convergence to NE in two-player coordination games but requires known reward functions and solving a linear program at each time step. Zhang et al. (2020) adopts the stackelberg equilibrium as the learning target. More recently, Lowe et al. (2017) suggests an actor-critic method (MADDPG) with centralised training on the critic. Nevertheless Lowe et al. (2017) do not tackle SGs outside of the zero-sum or cooperative cases in either theoretical results or experiments. In particular, the experiments in (Lowe et al., 2017) are all aimed at either adversarial (zero-sum) or the fully cooperative settings.

Very recently (Zhang et al., 2021) consider an SG setting in which all agents value functions are assumed to satisfy a global PG condition, that is, the incentive of all agents to change their policies can now be expressed using a single global function. As noted in their discussion, without further qualiﬁcation, this assumption is rather strong and difﬁcult to verify except in the case in which all agents share the same objective. In a later work, (Leonardos et al., 2021) consider an SG setting with a PG game property while imposing conditions that either i) reduce the SG to a linear combination of normal form games and removes all planning aspects or ii) limit the agents interaction to a term in their reward that does not depend on either the state or the agent s own actions. The latter condition (ii) results in an SG which is a restrictive case of our SG, in particular, our SG captures a richer, more general set of strategic interactions between agents (see Sec. 3.1 for a more detailed discussion).

We tackle a subclass of SGs which satisfy a PG condition at each stage game. We then show that with this construction, a potentiality property can be naturally extrapolated to the value functions of the SG without imposing restrictive assumptions. With this we prove that the NE of the game can be learned by independent agents without the need to impose restrictive assumptions as in (Hu & Wellman, 2003; Littman, 2001) and in a way that scales with the number of agents; in contrast to centralised critic methods that scale combinatorially.

3. Continuous Stochastic Potential Games

Continuous Stochastic Games MAS are modelled by SGs (Shoham & Leyton-Brown, 2008; Shapley, 1953). An SG is an augmented MDP involving two or more agents {1, 2, . . . , N} =: N that simultaneously take actions over many (possibly inﬁnite) rounds. Formally, a continuous SG is a tuple G = h N, S, (Ai)i2N, P, (Ri)i2N, γi where Sis the set of states, Ai Rq is an action set and Ri : S A ! P(D) is the

Learning in Nonzero-Sum Stochastic Games with Potentials

Figure 1. a) Coordination Navigation: selﬁsh agents (purple) seek

to reach rewards (black) whilst minimising contact with each other. b) Routing Networks: agents split their own commodity ﬂow between edges in a network over a sequence of time-steps. Starting at a source (node 1) and arriving at a target node (8), paths that have more commodity incur higher congestion costs.

distribution reward function for agent i 2 Nwhere D is a compact subset of R and lastly, P : S A S ! [0, 1] is the probability function P : S A S ! [0, 1] describing the system dynamics where A:= N

In an SG, at each time t 2 0, 1, . . . , the system is in state st 2 Sand each agent i 2 Ntakes an action ai

t 2 Ai. The joint action at = (a1

t, . . . , a N

t ) 2 Aproduces an immediate reward Ri(st, at) for agent i 2 N and inﬂuences the next-state transition which is chosen according to P. Using a (parameterised) Markov strategy1 i, i : S Ai ! [0, 1] to select its actions, each agent i seeks to maximise its individual expected returns as measured by its value function: v i, j6=i

t=0 γt Ri(st, at)

where i 2 Ei Rl and i is a compact Markov strategy space. A pure strategy (PS) is a map i : S ! Ai, for any i 2 Nthat assigns to any state an action in Ai.

We denote the space of joint policies by := i2N i; where it will not cause confusion (and with a minor abuse of notation) we use the shorthands i i

f ( i, i)(s) = f(s, i, i) E i, i[f(s, ai, a i)].

SGs can be viewed as a sequence of stage games {M(s)}s2S that take place at each time step where M(s) = h(Ai)i2N, (Ri(s))i2N, Ni. Therefore, at each time step a stage game is played and then the game transitions to the next stage game which is selected according to P.

Continuous Stochastic Potential Games

We now introduce a new subset of SGs namely c-SPGs which is the framework of our approach.

Deﬁnition 1. An SG is a c-SPG if for all states there exists a function φ : S A ! R such that the following holds for any (ai, a i), (a0i, a i) 2 A where a i

1A Markov strategy requires as input only the current state (and not the game history or other agents actions or strategies).

(a1, . . . ai 1, ai+1, . . . , a N), 8i 2 N, 8s 2 S:

Ri(s, (ai, a i)) Ri(s, (a0i, a i))

= φ(s, (ai, a i)) φ(s, (a0i, a i)). (1)

Condition (1) says that the difference in payoff from a deviation by one of the agents is exactly quantiﬁed by a global function φ that does not depend on the agent s identity. We call φ the potential function or potential for short. The condition extends the notion of static one-shot potential games (PGs) (Monderer & Shapley, 1996b) to a continuous SG setting that now includes states and transition dynamics.

To complete the construction we introduce a condition which is a natural extension of PGs to state-based settings:

Deﬁnition 2. A stage game M(s) is state transitive if there exists a φ : S A ! R s.th. 8(ai, a i) 2 A, 8i 2 N, 8s, s0 2 S:

Ri(s, (ai, a i)) Ri(s0, (ai, a i))

= φ(s, (ai, a i)) φ(s0, (ai, a i)). (2)

The intuition is that the difference in rewards for changing state is the same for each agent. Some classic examples of where state transitivity holds are anonymous games (Daskalakis & Papadimitriou, 2007), symmetric SGs (Ja skiewicz & Nowak, 2018), team SGs (Cheng et al., 2017).

Our results are built under the assumption2 that state transitivity assumption holds.

Fig. 1 illustrates two examples of c-SPGs. For instance, in Coordination Navigation, one can verify that the state transitivity assumption is satisﬁed: a collection of agents seeks to arrive at some destination x? 2 Rp. Crucially, the agents must avoid colliding with other agents. Each agent s value function is given by:

Ki kxi,t x?k2

(kxi,t xj,tk2 + ) 1/2 ka T

where k k and k k M are Euclidean and Mahalanobis norms respectively, xi,t 2 Rp is the position of agent i at time t 2 N and xt = (x1,t, . . . , x N,t) 2 Rp; c, , H, , β{Ki}i2N are constants and ai,t is vector representing the action taken by agent i. It can be readily veriﬁed that the game is potential with the following potential function: φ (x) = kx x?k2 β P

kx xj,tk2 +

+ 1/2+βc 2 ka T k2. Similarly, it can be readily veriﬁed that the game satisﬁes the state transitivity assumption.

2Statements of the technical assumptions are in the Appendix.

Learning in Nonzero-Sum Stochastic Games with Potentials

In our Ablation experiments (see Sec. 6) we show that our method is able to tackle settings in which the potentiality and state transitivity conditions are mildly violated.

C-SPGs also hold in SGs in which the agents have the same reward functions (identical interest games)(Monderer & Shapley, 1996a) such as anonymous games (Daskalakis & Papadimitriou, 2007), team games (Wang & Sandholm, 2003) and mean ﬁeld games (Mguni et al., 2018). Such SGs are widely used to study distributive systems and coordination problems in MAS such as Starcraft (Samvelyan et al., 2019) and Capture the Flag (Jaderberg et al., 2019). MARL frameworks such as COMA (Foerster et al., 2018), QMIX (Rashid et al., 2018) and QDPP (Yang et al., 2020) are fully cooperative settings and therefore fall within this category.

A key result we prove is that c-SPGs enjoy a dual representation as MDPs therefore enabling their solution to be computed by tackling MDPs. To construct a solution method for c-SPGs, we resolve a number of challenges: i) The ﬁrst involves determining the dual MDP whose solution is to be learned through interaction with the environment. ii) The second involves developing a tractable learning procedure that ensures convergence to the game solution. To do this we develop a method that ﬁnds the solution of the dual MDP distributively, in doing so we also resolve the problem of combinatorial complexity that afﬂict MARL methods. iii) The method of determining the dual MDP (i)) can incur small errors. Our last challenge is to show that small errors in the construction of the dual MDP induce only small errors in the agents best response actions.

3.1. Link to Potential Games and Discussion

We brieﬂy continue the discussion on related works with a relevant review of PGs. The ﬁrst systematic treatment of PGs appeared in (Monderer & Shapley, 1996b) in a static setting. PGs constitute a fundamental building block of general-sum games - any general-sum game can be decomposed into two (strategic) parts; PGs and harmonic games (Candogan et al., 2011). PGs model many real-world scenarios including trafﬁc network scenarios, network resource allocation (Zazo et al., 2015)) social conﬂicts (L a et al., 2016) and consensus problems (Marden et al., 2009). PGs also encompass all team games and some zero-sum games (Balduzzi et al., 2018).

C-SPGs extend PGs to settings with dynamics and future uncertainty. This enables PGs to capture real-world scenarios that involve sequential decision-making and dynamics. Example of these settings trafﬁc networks models, routing and packet delivery problems.

The analysis of dynamic PGs is extremely sparse and does not cover (reinforcement) learning settings in which the system is a priori unknown. In the direction of incorporating

potentiality property within an SG, (Gonz alez-S anchez & Hern andez-Lerma, 2013; Macua et al., 2018) consider an SG in which the potentiality property is imposed on the value functions which results in the need for highly restrictive assumptions. In (Gonz alez-S anchez & Hern andez-Lerma, 2013) the SG is restricted to concave reward functions (in the state variable) and the transition function is required to be invertible (and known). These assumptions are generally incompatible with many MAS settings of interest.3 Similarly, (Macua et al., 2018) study a discrete Markov game in which the value function is assumed to satisfy a PG property. Their construction requires that the agents policies depend only on disjoint subcomponents of the state which prohibits non-local (strategic) interactions.

Very recently (Zhang et al., 2021) consider an SG setting in which all agents value functions are assumed to satisfy a global PG property, that is, the incentive of all agents to change their policies can now be expressed using a single global function. To construct this relationship using conditions on the stage game, in a later work (Leonardos et al., 2021) consider an SG setting and embed either of two properties into the game structure namely, an agent-independent transition assumption (C.1) or an equality of individual dummy term assumption (C.2). Using either of these conditions and the stage game PG condition (Condition (1)), they show that the PG condition can be extrapolated to a global PG condition on the value functions.

Conditions C.1. and C.2. in (Leonardos et al., 2021) impose heavy restrictions since Condition C.1. reduces the SG to a linear combination of normal form games and removes all planning aspects (hence extrapolating the potentiality of stage games to the agents value functions is deduced trivially). Condition C.2. restricts the noncooperative (strategic) interaction part of the game to a term that does not depend on the state or the agent s own action. Moreover imposing condition C.2. produces an SG that is a special case of our SG (this can be seen using the equivalence expression in Lemma B (see Sec. H in Appendix) by setting k(s) 1 and restricting hi to depend only on other agents actions in the reward functions of our SG). Therefore, the generalisation of the PG condition to SGs in (Leonardos et al., 2021) requires strong limitations on the structure of the SG not present in our analysis.

With our new construction which has a PG at each stage game we show that the PG condition can be naturally extrapolated to the value functions of the SG. This provides veriﬁable assumptions on the game while imposing relatively weak assumptions on the SG in comparison to (Leonardos et al., 2021). With this we prove that the equilibrium of

3The result in (Gonz alez-S anchez & Hern andez-Lerma, 2013) also requires verifying the policy satisﬁes sufﬁciency conditions which is generally difﬁcult given the size of the space of functions.

Learning in Nonzero-Sum Stochastic Games with Potentials

the SG can be found by merely solving an (unknown) MDP without imposing either state disjointness as in (Macua et al., 2018) or concavity as in (Gonz alez-S anchez & Hern andez Lerma, 2013).

4. Planning in c-Stochastic Potential Games

We now show that the stable point solution of a c-SPG can be computed tractably by solving a dual MDP with reward function φ. This leads to a vast reduction in complexity for ﬁnding NEs in our c-SPG subclass of nonzero-sum. In what follows, we assume that the environment is known; in Sec. 5 we extend the analysis of this section to unknown environments. We defer the proofs of the results of the following sections to the Appendix.

As SGs are noncooperative settings, the solution cannot be described as an optimisation of a single objective. The appropriate solution concept is the following NE variant (Fudenberg & Tirole, 1991):

Deﬁnition 3. A strategy proﬁle ˆ = (ˆ i, ˆ i) 2 is a Markov perfect equilibrium (MPE) if 8i 2 N:

i,ˆ i) i (s), 8s 2 S, 8 0

The condition characterises a ﬁxed point in strategies in which no agent can improve their expected payoff by unilaterally deviating from their current policy. We denote the set of NE of G by NE{G}. Finding NE of nonzero-sum SGs in general involves using ﬁxed point methods which are generally intractable (Chen et al., 2009). Indeed, ﬁnding NE in SGs is PPAD complex (Polynomial Parity Arguments on Directed graphs) (Chen et al., 2009) for which brute force methods are intractable. Finding efﬁcient solution methods for nonzero-sum SGs is an open challenge (Shoham & Leyton-Brown, 2008).

We now show that c-SPGs exhibit special properties that enable their NE to be computed tractably. In particular, we show that computing the NE of c-SPGs can be achieved by solving an MDP. With this, solving c-SPGs can be approached with stochastic approximation tools. We then present a new Q-learning variant that solves c-SPGs in polynomial time.

To begin, we construct the Bellman operator of G. Let g : S A! R and F : S ! R, for any s 2 Sthe Bellman operator of the game Gis given by the following:

[Tg F](s) := sup

[g(s, a) + γ

ds0P(s0; a, s)F[s0]].

We now state our ﬁrst key result which reveals a striking property of the c-SPG class of games:

Theorem 1. Let V : S A! R be a test function, then G possesses a ﬁxed point NE in pure (deterministic) Markov

strategies characterised by:

where φ is the potential of G.

The result states that the MPE of the game exist and in pure strategies and correspond to solution of a (dual) MDP M := hφ, i2NAi, P, S, γi. In fact, it is shown that any MPE is a local optimum of the value function associated to M . The value function of M which we call the dynamic potential function (DPF), B, is constructed by B (s) = E[P1

t=0 γtφ(st, at)|at ], 8s 2 S, 8 2 .

The theorem is proven inductively within a dynamic programming argument to extrapolate the potentiality property to the entire SG then showing Gis continuous at inﬁnity.

Theorem 1 enables us to compute the MPE by solving an MDP, a task which can be performed in polynomial time.4

Moreover, Theorem 1 enables a Q-learning approach (Bertsekas, 2012) for ﬁnding the MPE of the game. The following ﬁtted Q-learning method computes the approximate B function and the corresponding optimal policy for each agent.

First, let us deﬁne by

Ylk(slk, alk, s0

lk) := φlk,ˆ (slk, alk) + γsup

At each iteration k = 0, 1, . . . we solve the minimisation:

Fl 2 arg inf

Ylk(slk, alk, s0

lk) [F] (slk, alk)

Algorithm 1 SPot Q: Stochastic POTential Q-Learning Input: discount factor γ and PF φ.

1: for k 2 0, 1, 2, do 2: for i 2 Ndo 3: Set the local target Ylk by (3) 4: Update F by minimizing Eq. (4) 5: end for 6: end for Output: F, ( k

The minimisation seeks to ﬁnd the optimal action-value function Q?. Using this, we can construct our SPot Q algorithm that works by mimicking value iteration. By Theorem 1, the algorithm converges to the MPE of the game.

Theorem 1 does not establish uniqueness of B which could lead to ambiguity in the solution. The following result reduces the set of candidates to a single family of functions:

4The MDP lies is in a complexity class known as P-SPACE which can be solved tractably (Papadimitriou & Tsitsiklis, 1987).

Learning in Nonzero-Sum Stochastic Games with Potentials

Lemma 1. If B1, B2 are value functions of the dual MDP M then (B

2 )(s) = c, 8 2 , 8s 2 Swhere c 2 R

Therefore, the set of candidate functions are limited to a family of functions that differ only by a constant.

Computing the Potential Function φ

Theorem 1 requires knowledge of φ. Existing methods to ﬁnd φ in PGs e.g. MPD method (Candogan et al., 2013) are combinatorial in actions and agents. Indeed, directly applying (1) to compute φ requires checking all deviations over pure strategies (deterministic policies) which is expensive since it involves sweeping through the joint action space A. We now demonstrate how to compute φ while overcoming these issues by transforming (1) into a differential equation. To employ standard RL methods we require parameterised policies and, in anticipation of tackling an RL setting we extend our coverage to parameterised stochastic policies.

Proposition 1. In any c-SPG the following result holds 8s 2 S, 8a 2 A, 8( i, i) 2 Eps:

(Ri φ)(s, a)

The PDE serves as an analogue to the PG condition (1) which now exploits the continuity of the action space and the fact that they agents actions are sampled from stochastic policies. Therefore Prop. 1 reduces the problem of ﬁnding φ to solving a PDE.

So far we have considered a planning solution method that solves the game when the agents reward functions are known upfront. In Sec 5, we consider settings in which the reward functions and the transition function are a priori unknown but the agents observe their rewards with noisy feedback.

5. Learning in c-Stochastic Potential Games

In RL, an agent learns to maximise its total rewards by repeated interaction with an unknown environment. The underlying problem is typically formalised as an MDP (Sutton & Barto, 2018). MARL extends RL to a multi-player setting (Yang & Wang, 2020). The underlying problem is modelled as an SG in which the rewards of each agent and transition dynamics are a priori unknown.

We have shown the MPE of a c-SPG can be computed by solving a Markov team game M , an SG in which all agents share the same reward function φ. We now discuss how to solve M from observed data in unknown environments (i.e. if P, {Ri} are not known). Additionally, we discuss our approach to enable easy scaling in the number of agents (and avoid combinatorial complexity) using distributive methods.

The scheme can be summarised in the following steps:

i) Compute the potential estimate ˆφ by solving the PDE in Prop. 1 using a distributed supervised learning method.

ii) Solve the team game ˆ M := h N, S, A, P, ˆφ, γi with a distributed actor-critic method. The critic is updated with a distributed variant of the ﬁtted Q-learning method in Sec. 4.

5.1. Learning the Potential Function φ

Though Prop. 1 reveals that φ can be found by solving a PDE, it involves evaluations in pure strategies which can be costly. Moreover, the result cannot be applied directly to estimate φ since the agents sample their rewards but not φ.

We now show how each agent can construct an approximation of φ in a way that generalises across actions and states by sampling its rewards. First, we demonstrate how the potential condition (1) can be closely satisﬁed using almost pure strategies. The usefulness of this will become apparent when we solve the PDE in Prop. 1 to ﬁnd φ.

Lemma 2. Let F be a bounded and continuous function and let F(st, (ai

t ) := F(st, (a0i

t )) F(st, (ai

t )) then there exists c > 0 such that

### F(st, (ai

t ) F(st, ( i, 0

### ck Fk1ˆσ2

where the policy i, is a nascent delta function5 and ˆσ2

:= max{Var( i), Var( 0

Since the bound approaches 0 in the limit as policies become pure strategies, the potential condition (5) is closely satisﬁed in nascent stochastic policies.

We now put Lemma 2 to use with a method to compute ˆφ that inexpensively solves the PDE condition (5) over the policy parameter space E. Indeed, thanks to Lemma 2, we can learn φ through an optimisation over E. The method uses a PDE solver over a set of randomly sampled points across E Susing the observed data {(sk, at, (r1,k, . . . , r N,k))}k 0 where ri,k Ri(sk, ak).6

Therefore, deﬁne by:

gi(s, , ˆφ) := r i ln ,i(ai|s; )

[Ri ˆφ](s, a)

where (a|s; ) := ,i(ai|s; i) , i(a i|s; i).

Following Prop. 1 we consider the following problem to

5A nascent delta function g has the property lim

X g f(x)dx = f(0) for any function f. They enable

pure strategies to be approximated by stochastic policies with small variance. We denote a nascent policy by ,i, 8 > 0.

6As with methods with sharing networks (e.g. COMIX, Fac MADDPG (de Witt et al., 2020)), agents observe other agents rewards. The method can be performed using only each agent s data {(sk, at, ri,k)}, however this requires more trajectory data.

Learning in Nonzero-Sum Stochastic Games with Potentials

compute ˆφ:

min 2Rkk G(s, ; ˆφ )k2

where G(s, ; ˆφ) := [g1(s, ; ˆφ), . . . , g N(s, ; ˆφ)]T and kf(y)k2 :=

Y |f(y)|2 (y)dy and (y) is a positive probability density on Y and 2 Rk are parameters. The optimisation performs evaluations in mixed strategies which is computationally inexpensive. Using a weighted exponential sum method (Chang, 2015), the objective reduces to a least squares problem on a single objective

G(s, , ) := N 1 P

gi(s, , ˆφ i)

. The optimi-

sation can be solved with a function approximator on ˆφ e.g. a deep neural network (NN). Under mild conditions (Bertsekas & Tsitsiklis, 2000) the method converges to a critical point of G, that is lim

n!1r G = 0. We defer the details of the method to the the Appendix.

Actor-Critic Method

We now return to tackling the problem of solving the team game ˆ M := h N, S, A, P, ˆφ, γi. To enable the method to scale and handle continuous actions, we adapt the ﬁtted Qlearning method in Sec. 4 to an actor-critic method (Konda & Tsitsiklis, 2000) for which each agent learns its own policy i, i using the estimate of B. The policy parameter i of the policy at the kth iteration is updated through sampled deterministic policy gradients (DPGs) (Silver et al., 2014):

i, i) (slk)

r i i, i ( |slk) rai

lk Fk (slk, alk)

Equation (7) describes the actor update via a DPG. The complete process is described in Algorithm 1. It involves two optimisations in sequence: the agents individually compute the approximation ˆφ which is then used for computing

ˆB, which approximates the optimal value function B by a Q-learning + decentralised DPG method and outputs each agent s MPE policy. Crucially the method avoids optimisations over the joint space i2NAi enabling easy scaling (in the number of agents) in this component of the algorithm.

Scaling in N using Consensus Optimisation Although the above method represents progress for solving SGs, a scalability issue remains since estimating ˆφ involves a computation over the joint space E. This becomes increasingly expensive with large numbers of agents. We now devise a fully distributed version of the method that scales with the number of agents. In this version, each agent i constructs an independent estimate of ˆφ by sampling across Ei Sat each step using only its own observed data {(sk, at, ri,k)}k 0. The method includes a consensus step

Algorithm 2 SPot-AC: Stochastic POTential Actor-Critic Input: discount factor γ, DPF and PF approximation maps

F, P 2 H(resp.) ( 2 Rk). 1: for k 2 0, 1, 2, do 2: Using ( k

i )i2N to rollout, collect the trajectory data and save it in the buffer D. 3: for i 2 Ndo 4: // Learn the potential function ˆφ 5: Sample a random minibatch of L samples {(slt, (ai

lt)i2N, slt+1, (ri

lt)i2N} from D.

6: Compute ˆφ by solving Expression (6) 7: // Compute the value function ˆB 8: Set the local target Ylk by (3) 9: Update the shared critic F by minimizing Eq. (4) 10: // Learn the individual policy 11: Update the policy by minimizing Eq. (7) 12: end for 13: end for Output: F, ( k

that enables ˆφ (and hence ˆB) to be accurately computed efﬁciently in a fully distributed fashion (Tutunov et al., 2019).

To enable efﬁcient scaling with the number of agents, we use distributed optimisation (DO) with consensus (Nedic & Ozdaglar, 2009) to ﬁnd ˆφ. Each agent produces its own estimate ˆB based on its observed rewards. DO methods efﬁciently solve large scale optimisation problems (Macua et al., 2010) and yields two major beneﬁts: i) efﬁciency: computing φ uses feedback from all agents reward samples. ii) consensus on Q: agents learn φ distributively but have identical Q iterates (for computing ˆB).

The common objective which each agent solves individually, is expressed with a set of local variables { i}i2N and a common global variable z:

minimise G(s, ) = N 1 P

i2N(gi(s, i))2

s.t. i z = 0, i = 1, . . . , N, where the gradient descent is according to: i

n 1 rgi(s, i) for some step size > 0. Note that the constraint prevents convergence to any Ri.

The algorithm works by constructing an estimate ˆφ then solving M in a distributed fashion allowing the method to scale with the number of agents.

Algorithm Analysis

Our SPot-AC algorithm inherits many useful properties of Q-learning (Antos et al., 2008).7 Nevertheless, it is necessary to ensure the output of the algorithm still yields good

7By Prop. 5 (see Appendix) any MPE is a local optimum of B.

Learning in Nonzero-Sum Stochastic Games with Potentials

performance when the supervised learning approximation of M has small errors. We now analyse the SPot-AC algorithm and show that provided errors in approximating M

are small the error in the algorithm output is also small.

Our ﬁrst result bounds the error on the estimate for the DPF from using the approximation method for ˆφ. Proposition 2. Deﬁne by the following Fi(s, , ) := R

A (da, , s) @

@ i i, (ai, i, s)r F (s, a) and U(s, , ) :=

A (da, , s) @

@ i i, (ai, i, s)r Ri(s, a) then the following bound holds for some c > 0:

k Fi(s, , ) U(s, )k c N 2 2,

where is the approximation error from the SL procedure.

Our next result ensures that if the estimates of φ have only small errors, SPot-AC generates policy performances that closely match that of the MPE policies. Proposition 3. Deﬁne by B

φ B and let the

policy ˆ be an MPE strategy i.e. ˆ 2 NE{G} (so that

φ B = B ˆ ) then for any > 0 the following holds:

k (2 γ)(1 γ) 1 ,

whenever kφ φk < .

The result ensures that given close approximations of φ SPot-AC in turn yields outputs close to B?. The result exploits the fact that the dual MDP Mof Theorem 1 exhibits a continuity property so that small errors in the approximation of φ and B incur only small changes to the MPE of G.

6. Experiments

We evaluate SPot-AC in three popular multi-agent environments: the particle world (Lowe et al., 2017), a network routing game (Roughgarden, 2007) and a Cournot duopoly problem (Agliari et al., 2016). These environments have continuous action and state spaces, and the agents seek to maximise their own interest e.g. reaching target without collisions on particle world and minimising the cost for transporting commodity on routing game. To solve these problems successfully, the agents must learn Markov perfect equilibrium policies in order to respond optimally to the actions of others.

We consider two groups of state-of-the-art MARL baselines that handle continuous actions. The ﬁrst group use individual rewards for learning: MADDPG (Lowe et al., 2017) and DDPG (Lillicrap et al., 2015). The second group use the collective rewards of all agents: COMIX and COVDN (de Witt et al., 2020). Further details are in the Appendix.

We use two evaluation metrics: Exploitability (Davis et al., 2014) describes how much

additional payoff players can achieve by playing a bestresponse policy BR (deﬁned in the Appendix). It measures the proximity of the agents policies to the MPE strategy, deﬁned as δ = 1

i(ui( i, BR( i)) ui( )). Social welfare is the collective reward of all agents: r = PN

i=1 ri, this is most relevant in tasks such as team games, where agents seek to maximise total reward.

ABLATION STUDIES

To test the robustness of SPot-AC and the baselines, we perform a set of ablation studies within routing games. Ablation 1 analyses SPot-AC in SGs that progressively deviate from c-SPGs, showing that SPot-AC can handle SGs that mildly violate the c-SPG conditions (i.e. the potentiality requirement). Ablation 2 analyses SPot-AC in SGs that progressively deviate from team games but retain the potential game property. We demonstrate that, unlike other methods, SPot-AC is able to converge to the Markov Perfect Equilibrium in non-cooperative SGs. We also report results on the classic Cournot Duopoly and show convergence of SPot-AC to NE.

Non-atomic Routing Games involve a set of N selﬁsh agents seeking to transport their commodity from a source node to a goal node in a network. This commodity can be divided arbitrarily and sent between nodes along edges.

At each time step, each agent has a distribution of commodity over the nodes of the graph. It assigns a fraction of its commodity in each node to travel along the edges emerging from those nodes. There are multiple agents (given by N 2 {2, 4, 6, 8}), using the same network (number of nodes K 2 {20, 40}) and agents pay a cost related to the total congestion of every edge at each time step. We design the game so that the MPE is socially efﬁcient, i.e. playing an MPE strategy leads to high individual returns. We repeat the experiments for 5 independent runs and report the mean and standard deviation of the rewards. Further details on the settings can be found in Appendix.

Exploitability: We test SPot-AC in a simple Braess paradox game. The exploitability of SPot-AC (Fig. 2) quickly converges to close to 0, indicating it learns NE policies (negative values are due to the fact that we are approximating best-responses). In contrast, the high exploitability values of existing MARL methods indicate that they fail to converge to NE policies. The algorithms that involve reward sharing (COMIX, COVDN) attempt to maximise social welfare, which is incompatible with this non-cooperative setting, so can be exploited by a best-response strategy.

Social welfare: In the cooperative, non-atomic routing game environment, we see in Fig. 3, using SPot-AC (or-

Learning in Nonzero-Sum Stochastic Games with Potentials

(a) SPot-AC s policy on Braess paradox

(b) Exploitability of different methods.

Figure 2. Results of exploitability. (a) a visualization of learned

policy ﬂows by SPot-AC. (b) Exploitability results of all methods.

ange), each agent learns how to split their commodity G in a way that maximises rewards (minimises costs) and matches the shared reward baselines. Conversely, MADDPG (orange) and DDPG (blue) yield low rewards with high variance.

Coordination Navigation An Open AI Multi Agent Particle Environment task (Lowe et al., 2017) involves n agents and n landmarks. Each agent must reach the target while avoiding collisions with other agents and ﬁxed landmarks. Agents can observe the relative positions of other agents and landmarks, and have ﬁve actions {up, down, left, right, stay}. The reward is calculated as the agent s distance to each landmark with penalties for collisions with other agents.

This is a non-cooperative SG, so we compare SPot-AC to DDPG and MADDPG, algorithms that are able to learn policies in which agents can act selﬁshly. We perform the exploitability analysis as above. Fig. 4 shows SPot-AC achieves the best performance in terms of minimum distance to target and number of collisions, demonstrating that SPot AC enables agents to learn to coordinate while pursuing their own goals.

7. Conclusion

In this paper, we describe the ﬁrst MARL framework that tackles MAS with payoff structures beyond zero-sum or team games. In doing so, the results we establish pave the way for a new generation of solvers that are able to tackle classes of SGs beyond cases in which the payoff structures lie at extremes. Therefore, the results of this paper open

Figure 3. Top Average agent returns in Network with N = 4, 6, 8

agents, K = 20 nodes. Bottom Average agent returns in Network with N = 4, 6, 8 agents, K = 59 nodes.

Figure 4. Coordination Navigation problem.

the door for MARL techniques to address a wider range of multi-agent scenarios. By developing theory that shows a class of SGs, namely c-SPGs have a dual representation as MDPs, we showed that c-SPGs can be solved by MARL agents using a novel distributed method which avoids the combinatorial explosion therefore allowing the solver to scale with the number of agents. We then validated our theory in experiments in previously unsolvable scenarios showing our method successfully learns MPE policies in contrast to existing MARL methods.

Acknowledgements

YW and ZW are partly supported by the Strategic Priority

Research Program of Chinese Academy of Sciences, Grant No.XDA27000000. DM is grateful to Mohammed Amin Abdullah for insightful discussions relating to the convergence proofs in this work.

Learning in Nonzero-Sum Stochastic Games with Potentials

Agliari, A., Naimzada, A. K., and Pecora, N. Nonlinear

dynamics of a cournot duopoly game with differentiated products. Applied Mathematics and Computation, 281: 1 15, 2016.

Antos, A., Szepesv ari, C., and Munos, R. Fitted q-iteration

in continuous action-space mdps. In Advances in neural information processing systems, pp. 9 16, 2008.

Balduzzi, D., Racaniere, S., Martens, J., Foerster, J., Tuyls,

K., and Graepel, T. The mechanics of n-player differentiable games. ar Xiv preprint ar Xiv:1802.05642, 2018.

Bertsekas, D. P. Approximate dynamic programming. 2012.

Bertsekas, D. P. and Tsitsiklis, J. N. Gradient convergence

in gradient methods with errors. SIAM Journal on Optimization, 10(3):627 642, 2000.

Bloembergen, D., Tuyls, K., Hennes, D., and Kaisers, M.

Evolutionary dynamics of multi-agent learning: A survey. Journal of Artiﬁcial Intelligence Research, 53:659 697, 2015.

Candogan, O., Menache, I., Ozdaglar, A., and Parrilo, P. A.

Flows and decompositions of games: Harmonic and potential games. Mathematics of Operations Research, 36 (3):474 503, 2011.

Candogan, O., Ozdaglar, A., and Parrilo, P. A. Nearpotential games: Geometry and dynamics. ACM Transactions on Economics and Computation (TEAC), 1(2):1 32, 2013.

Chang, K.-H. Chapter 5 - multiobjective optimization and

advanced topics. In Chang, K.-H. (ed.), Design Theory and Methods Using CAD/CAE, pp. 325 406. Academic Press, Boston, 2015.

Chen, X., Deng, X., and Teng, S.-H. Settling the complexity

of computing two-player nash equilibria. Journal of the ACM (JACM), 56(3):1 57, 2009.

Cheng, Y., Diakonikolas, I., and Stewart, A. Playing anony-

mous games using simple strategies. In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 616 631. SIAM, 2017.

Daskalakis, C. and Papadimitriou, C. Computing equilibria

in anonymous games. In 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS 07), pp. 83 93. IEEE, 2007.

Davis, T., Burch, N., and Bowling, M. Using response func-

tions to measure strategy strength. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 28,

de Witt, C. S., Peng, B., Kamienny, P.-A., Torr, P., B ohmer,

W., and Whiteson, S. Deep multi-agent reinforcement learning for decentralized continuous cooperative control. ar Xiv preprint ar Xiv:2003.06709, 2020.

Foerster, J. N., Farquhar, G., Afouras, T., Nardelli, N., and

Whiteson, S. Counterfactual multi-agent policy gradients. In Thirty-second AAAI conference on artiﬁcial intelligence, 2018.

Fudenberg, D. and Tirole, J. Game theory. MIT Press, 726:

Gonz alez-S anchez, D. and Hern andez-Lerma, O. Discrete

time stochastic control and dynamic potential games: the Euler Equation approach. Springer Science & Business Media, 2013.

Grau-Moya, J., Leibfried, F., and Bou-Ammar, H. Balancing

two-player stochastic games with soft q-learning. ar Xiv preprint ar Xiv:1802.03216, 2018.

Hernandez-Leal, P., Kaisers, M., Baarslag, T., and de Cote,

E. M. A survey of learning in multiagent environments: Dealing with non-stationarity. ar Xiv preprint ar Xiv:1707.09183, 2017.

Hu, J. and Wellman, M. P. Nash q-learning for general-sum

stochastic games. Journal of machine learning research, 4(Nov):1039 1069, 2003.

Jaderberg, M., Czarnecki, W. M., Dunning, I., Marris, L.,

Lever, G., Castaneda, A. G., Beattie, C., Rabinowitz, N. C., Morcos, A. S., Ruderman, A., et al. Human-level performance in 3d multiplayer games with populationbased reinforcement learning. Science, 364(6443):859 865, 2019.

Ja skiewicz, A. and Nowak, A. S. On symmetric stochastic

games of resource extraction with weakly continuous transitions. Top, 26(2):239 256, 2018.

Konda, V. R. and Tsitsiklis, J. N. Actor-critic algorithms. In

Advances in neural information processing systems, pp. 1008 1014, 2000.

L a, Q. D., Chew, Y. H., and Soong, B.-H. Potential Game

Theory. Springer, 2016.

Leonardos, S., Overman, W., Panageas, I., and Piliouras,

G. Global convergence of multi-agent policy gradient in markov potential games. 2021.

Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez,

T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. ar Xiv preprint ar Xiv:1509.02971, 2015.

Learning in Nonzero-Sum Stochastic Games with Potentials

Littman, M. L. Friend-or-foe q-learning in general-sum

games. In ICML, volume 1, pp. 322 328, 2001.

Lowe, R., Wu, Y. I., Tamar, A., Harb, J., Abbeel, O. P.,

and Mordatch, I. Multi-agent actor-critic for mixed cooperative-competitive environments. In Advances in neural information processing systems, pp. 6379 6390, 2017.

Macua, S. V., Belanovic, P., and Zazo, S. Consensus-based

distributed principal component analysis in wireless sensor networks. In 2010 IEEE 11th International Workshop on Signal Processing Advances in Wireless Communications (SPAWC), pp. 1 5. IEEE, 2010.

Macua, S. V., Zazo, J., and Zazo, S. Learning parametric

closed-loop policies for markov potential games. ar Xiv preprint ar Xiv:1802.00899, 2018.

Marden, J. R., Arslan, G., and Shamma, J. S. Cooperative

control and potential games. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 39(6): 1393 1407, 2009.

Mariano, P., Pereira, A., Correia, L., Ribeiro, R., Abramov,

V., Szirbik, N., Goossenaerts, J., Marwala, T., and De Wilde, P. Simulation of a trading multi-agent system. In 2001 IEEE International Conference on Systems, Man and Cybernetics. e-Systems and e-Man for Cybernetics in Cyberspace (Cat. No. 01CH37236), volume 5, pp. 3378 3384. IEEE, 2001.

Mguni, D., Jennings, J., and de Cote, E. M. Decentralised

learning in systems with many, many strategic agents. In Thirty-Second AAAI Conference on Artiﬁcial Intelligence,

Monderer, D. and Shapley, L. S. Fictitious play property

for games with identical interests. Journal of economic theory, 68(1):258 265, 1996a.

Monderer, D. and Shapley, L. S. Potential games. Games

and economic behavior, 14(1):124 143, 1996b.

Nedic, A. and Ozdaglar, A. Distributed subgradient meth-

ods for multi-agent optimization. IEEE Transactions on Automatic Control, 54(1):48 61, 2009.

Nicolaescu, L. I. The coarea formula. In seminar notes.

Citeseer, 2011.

Papadimitriou, C. H. and Tsitsiklis, J. N. The complexity of

markov decision processes. Mathematics of operations research, 12(3):441 450, 1987.

Peng, P., Wen, Y., Yang, Y., Yuan, Q., Tang, Z., Long,

H., and Wang, J. Multiagent bidirectionally-coordinated nets: Emergence of human-level coordination in learning to play starcraft combat games. ar Xiv preprint ar Xiv:1703.10069, 2017.

Rashid, T., Samvelyan, M., De Witt, C. S., Farquhar, G.,

Foerster, J., and Whiteson, S. Qmix: monotonic value function factorisation for deep multi-agent reinforcement learning. ar Xiv preprint ar Xiv:1803.11485, 2018.

Roughgarden, T. Routing games. Algorithmic game theory,

18:459 484, 2007.

Samvelyan, M., Rashid, T., de Witt, C. S., Farquhar, G.,

Nardelli, N., Rudner, T. G., Hung, C.-M., Torr, P. H., Foerster, J., and Whiteson, S. The starcraft multi-agent challenge. ar Xiv preprint ar Xiv:1902.04043, 2019.

Shapley, L. S. Stochastic games. Proceedings of the national

academy of sciences, 39(10):1095 1100, 1953.

Shoham, Y. and Leyton-Brown, K. Multiagent systems: Al-

gorithmic, game-theoretic, and logical foundations. Cambridge University Press, 2008.

Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., and

Riedmiller, M. Deterministic policy gradient algorithms. 2014.

Simon, L. et al. Lectures on geometric measure theory. The

Australian National University, Mathematical Sciences Institute, Centre ..., 1983.

Slade, M. E. What does an oligopoly maximize? The Journal of Industrial Economics, pp. 45 61, 1994.

Sutton, R. S. and Barto, A. G. Reinforcement learning: An

introduction. MIT press, 2018.

Sutton, R. S., Mc Allester, D. A., Singh, S. P., and Mansour,

Y. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pp. 1057 1063, 2000.

Szepesv ari, C. and Munos, R. Finite time bounds for sam-

pling based ﬁtted value iteration. In Proceedings of the 22nd international conference on Machine learning, pp. 880 887, 2005.

Tan, M. Multi-agent reinforcement learning: Independent

vs. cooperative agents. In Proceedings of the tenth international conference on machine learning, pp. 330 337, 1993.

Tutunov, R., Bou-Ammar, H., and Jadbabaie, A. Distributed

newton method for large-scale consensus optimization. IEEE Transactions on Automatic Control, 64(10):3983 3994, 2019.

Ui, T. A shapley value representation of potential games.

Games and Economic Behavior, 31(1):121 135, 2000.

Learning in Nonzero-Sum Stochastic Games with Potentials

Wang, X. and Sandholm, T. Reinforcement learning to play

an optimal nash equilibrium in team markov games. In Advances in neural information processing systems, pp. 1603 1610, 2003.

Wiering, M. Multi-agent reinforcement learning for trafﬁc

light control. In Machine Learning: Proceedings of the Seventeenth International Conference (ICML 2000), pp. 1151 1158, 2000.

Yang, Y. and Wang, J. An overview of multi-agent reinforce-

ment learning from game theoretical perspective. ar Xiv preprint ar Xiv:2011.00583, 2020.

Yang, Y., Luo, R., Li, M., Zhou, M., Zhang, W., and Wang,

J. Mean ﬁeld multi-agent reinforcement learning. In International Conference on Machine Learning, pp. 5571 5580. PMLR, 2018.

Yang, Y., Tutunov, R., Sakulwongtana, P., Ammar, H. B.,

and Wang, J. -rank: Scalable multi-agent evaluation through evolution. ar Xiv preprint ar Xiv:1909.11628, 2019.

Yang, Y., Wen, Y., Wang, J., Chen, L., Shao, K., Mguni,

D., and Zhang, W. Multi-agent determinantal q-learning.

In International Conference on Machine Learning, pp. 10757 10766. PMLR, 2020.

Ye, D., Zhang, M., and Yang, Y. A multi-agent framework

for packet routing in wireless sensor networks. sensors, 15(5):10026 10047, 2015.

Zazo, S., Macua, S. V., S anchez-Fern andez, M., and Zazo, J.

Dynamic potential games in communications: Fundamentals and applications. ar Xiv preprint ar Xiv:1509.01313, 2015.

Zhang, H., Chen, W., Huang, Z., Li, M., Yang, Y., Zhang,

W., and Wang, J. Bi-level actor-critic for multi-agent coordination. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 34, pp. 7325 7332, 2020.

Zhang, R., Ren, Z., and Li, N. Gradient play in multi-

agent markov stochastic games: Stationary points and convergence. ar Xiv preprint ar Xiv:2106.00198, 2021.

Zhou, M., Luo, J., Villela, J., Yang, Y., Rusu, D., Miao,

J., Zhang, W., Alban, M., Fadakar, I., Chen, Z., et al. Smarts: Scalable multi-agent reinforcement learning training school for autonomous driving. ar Xiv preprint ar Xiv:2010.09776, 2020.