# maximum_entropy_heterogeneousagent_reinforcement_learning__f8613bd3.pdf

Published as a conference paper at ICLR 2024

MAXIMUM ENTROPY HETEROGENEOUS-AGENT REINFORCEMENT LEARNING

Jiarong Liu1 , Yifan Zhong1,2 , Siyi Hu3, Haobo Fu4, Qiang Fu4, Xiaojun Chang3, Yaodong Yang1

1Institute for AI, Peking University, 2National Key Laboratory of General AI, BIGAI, 3University of Technology Sydney, 4Tencent AI Lab

ABSTRACT Multi-agent reinforcement learning (MARL) has been shown effective for cooperative games in recent years. However, existing state-of-the-art methods face challenges related to sample complexity, training instability, and the risk of converging to a suboptimal Nash Equilibrium. In this paper, we propose a unified framework for learning stochastic policies to resolve these issues. We embed cooperative MARL problems into probabilistic graphical models, from which we derive the maximum entropy (Max Ent) objective for MARL. Based on the Max Ent framework, we propose Heterogeneous-Agent Soft Actor-Critic (HASAC) algorithm. Theoretically, we prove the monotonic improvement and convergence to quantal response equilibrium (QRE) properties of HASAC. Furthermore, we generalize a unified template for Max Ent algorithmic design named Maximum Entropy Heterogeneous-Agent Mirror Learning (MEHAML), which provides any induced method with the same guarantees as HASAC. We evaluate HASAC on six benchmarks: Bi-Dex Hands, Multi-Agent Mu Jo Co, Star Craft Multi-Agent Challenge, Google Research Football, Multi-Agent Particle Environment, and Light Aircraft Game. Results show that HASAC consistently outperforms strong baselines, exhibiting better sample efficiency, robustness, and sufficient exploration.

1 INTRODUCTION Cooperative multi-agent reinforcement learning (MARL) is a complex problem characterized by the difficulty of coordinating individual agent policy improvements to enhance overall performance of the entire team. As a result, traditional independent learning methods in MARL often lead to poor convergence properties (30; 3). To alleviate these difficulties, the centralized training decentralized execution (CTDE) paradigm (5; 19) assumes that global states and teammates actions and policies are accessible during the training phase. This approach leads to the development of competent multi-agent policy gradient algorithms (40; 37; 38; 43; 42) as well as value decomposition algorithms (27; 24; 41; 26; 33) . Furthermore, heterogeneous-agent mirror learning (HAML) (15) provides a template for rigorous algorithmic design, which guarantees any induced algorithm of monotonic improvement of joint objective and convergence to Nash equilibrium (NE) (22).

Despite the theoretical soundness of the HAML framework, HAML-derived algorithms suffer from two main challenges. First, these methods face challenges attributed to either sample complexity or training instability. On-policy algorithms, including HAPPO and HATRPO (13), require new sample data for each gradient step, which becomes very expensive as task complexity and agent numbers increase. Off-policy algorithms, on the other hand, observe training instability and hyperparameter sensitivity (45). Moreover, HAML-derived algorithms suffer from insufficient exploration, which can lead to suboptimal NE convergence. This is primarily due to the standard MARL objective they maximize, where there always exists a deterministic convergence solution (15; 28) and stochasticity is not inherently encouraged. Since the presence of multiple NEs is a frequently observed phenomenon in many multi-agent games, these methods can fail to explore sufficiently and prematurely converge to a suboptimal NE, as we will show in Section 3.2.

A possible solution to these challenges is to let agents learn stochastic behaviors in a sample-efficient way. Similar to the case of single-agent RL (7), in cooperative MARL problems, stochastic policies enable effective exploration of the reward landscape, mastery of multiple ways of performing the

Equal contribution. Work done during internship at Institute for AI, Peking University. Corresponding to <yaodong.yang@pku.edu.cn>. See our page at https://sites.google.com/view/meharl.

Published as a conference paper at ICLR 2024

task, and robustness when facing prediction errors (46). Unfortunately, while a number of methods have achieved great success in single-agent RL settings (46; 7; 9; 17), solving such stochastic policy learning problems in cooperative MARL is challenging. Existing CTDE methods offer no convergence guarantee for learning stochastic policies in general cases. Recently, FOP (44) applies maximum entropy framework to multi-agent settings via value decomposition. However, FOP only provides convergence to the global optimum when a task satisfies the Individual-Global-Optimal (IGO) assumption, which limits its applicability in general cooperative MARL problems.

In this paper, we propose the first theoretically-justified actor-critic framework for learning stochastic policies in cooperative MARL. Firstly, we model cooperative MARL as a probabilistic graphical inference problem (Figure 2), where stochastic policies arise as optimal solutions. Performing variational inference in this model leads us to derive the maximum entropy (Max Ent) MARL objective. To maximize this objective, we introduce heterogeneous-agent soft policy iteration (HASPI) which ensures the properties of monotonic improvement and convergence to quantal response equilibrium (QRE), which is the solution concept corresponding to stochastic policies in game theory (20; 6). The key insight behind this theory is the joint soft policy decomposition proposition. Based on HASPI, we derive the heterogeneous-agent soft actor-critic (HASAC) algorithm. Furthermore, we generalize the HASPI procedure to the Maximum Entropy Heterogeneous-Agent Mirror Learning (MEHAML) template, which offers a unified solution to Max Ent MARL problems and provides the same theoretical guarantees for any induced methods as HASAC. We test HASAC on six benchmarks: Multi-Agent Mu Jo Co (MAMu Jo Co) (4), Bi-Dex Hands (2), Star Craft Multi-Agent Challenge (SMAC) (25), Google Research Football (GRF) (16), Multi-Agent Particle Environment (MPE) (19), and Light Aircraft Game (LAG) (23). Across the majority of benchmark tasks, HASAC consistently outperforms strong baselines, exhibiting the advantages of stochastic policies, namely improved robustness, higher sample efficiency, and sufficient exploration.

2 RELATED WORK Multi-agent policy gradient (MAPG) methods have been shown effective for multi-agent cooperation tasks (5; 19). Yu et al. (42) discovers the effectiveness of PPO in multi-agent scenarios and introduces MAPPO. It inspires Co PPO (39), which preserves monotonic improvement property with a simultaneous update scheme, HAPPO / HATRPO (13), which proves monotonic improvement and NE convergence property with a sequential update scheme, and A2PO (34), which preserves per-agent monotonic improvement property. Guarantees of HAPPO / HATRPO are enhanced by HAML (15), which abstracts a general theoretical framework and leads to several practical algorithms. While these methods are effective on challenging benchmarks, we show in Section 3.2 that due to the standard objective they optimize, they tend to converge rapidly to a suboptimal NE when in proximity to it. Notably, the idea of NE can be considered as a notion of local optimum in cooperative MARL settings, and has been studied in many prior works (29; 36; 13; 15). To alleviate suboptimal NE convergence problem, we propose to learn stochastic policies, which maximize the Max Ent MARL objective that we derive from probabilistic graphical models. We adopt QRE (20; 6; 11) as the solution concept in Max Ent MARL framework, which generalizes NE when payoffs are perturbed by additional noise.

Max Ent algorithms have achieved great success in single-agent RL. SQL (7) and SAC (8) learn optimal Max Ent policies through soft Q-iteration and soft policy iteration respectively. They refresh SOTA performance, showcasing the robustness and effective exploration of stochastic policy. Levine (17) reviews these algorithms from a control as inference perspective. Unfortunately, learning Max Ent policies with theoretical guarantees remains a challenge in cooperative MARL settings. MASQL (36) adopts a multi-agent actor-critic architecture similar to MADDPG (19), extending SQL to multi-agent settings without providing any theoretical guarantees. FOP (44), on the other hand, is a decomposed actor-critic method (4; 35), which utilizes the decomposed critic instead of the centralized critic to learn individual policies. It factorizes the optimal joint policy of Max Ent MARL under the IGO assumption (see Appendix A.2 for details), which leads to poor performance in complex scenarios. To overcome the constraint of IGO, MACPF (32) learns optimal joint policy during training phase and distills independent policies from the optimal joint policy to fulfill decentralized execution. Such a procedure can be considered as offline imitating learning, where independent policies strive to mimic the behaviors produced by the optimal joint policy, but they still lack the guarantee of converging to the optimum. In contrast, our approach is the first Max Ent actor-critic method with theoretical guarantee, presenting an improvement to HAML-based algorithms. We augment the objective with entropy, propose HASPI, prove its monotonic improvement and QRE (20; 6; 11) convergence property without restrictive assumption, derive HASAC, and establish MEHAML template.

Published as a conference paper at ICLR 2024

3 PRELIMINARIES

3.1 COOPERATIVE MULTI-AGENT REINFORCEMENT LEARNING

We consider a cooperative Markov game (18) formulated by a tuple N, S, A, r, P, γ, d . Here, N = {1, . . . , n} denotes the set of n agents, S is the finite state space, A = Qn i=1 Ai is the joint action space, where Ai denotes the finite action space of agent i, r : S A R is the joint reward function, P : S A S [0, 1] is the transition probability function, γ [0, 1) is the discount factor, and d P(X) (where P(X) denotes the set of probability distributions over a set X ) is initial state distribution. In this work, we use the notation P(X) to denote the power set of a set X and Sym(n) to denote the set of permutations of integers {1, . . . , n}, known as the symmetric group. At time step t {1, . . . , T}, each agent i N is at state st S and then takes independent actions ai t πi( i|st), where πi is the policy of agent i. Let at = a1 t, . . . , an t A denotes the joint action and π ( |st) = Qn i=1 πi i|st denotes the joint policy. We denote the policy space of agent i as Πi s Sπi i|s | s S , and the joint policy space as Π Π1, . . . , Πn . The agents receive a joint reward r (st, at) and move to the next state st+1 P ( |st, at). The initial state distribution d, the joint policy π, and the transition kernel P induce a marginal state distribution at time t, denoted by ρt π. We define the (unnormalized) marginal state distribution ρπ PT t=1 ρt π. The standard joint objective of all agents is to maximize the expected total reward, defined as1

Jstd(π) = Es1:T ρ1:T π ,a1:T π

t=1 r (st, at)

3.2 LIMITATIONS OF EXISTING COOPERATIVE MARL METHODS

Existing MAPG methods maximizing the standard joint objective (Equation 1), such as MAPPO and HAPPO, may fail to find the optimal NE and converge prematurely. We analyze this problem by considering a singe-state 2-agent cooperative matrix game as shown in Figure 1.

-20 -20 5 A

-20 10 -20 B

20 -20 -20 C

(a) Reward matrix

0.12 0.12 0.36 A

0.04 0.04 0.12 B

0.04 0.04 0.12 C

(b) Initial joint policy π

(c) Final joint policy π

Figure 1: A single-state 2-agent cooperative matrix game. (a) is the reward matrix of joint actions. (b) represents the initial joint policy π formed by both agents taking the individual policy π = {0.6, 0.2, 0.2}. (c) represents the final joint policy π that MAPPO and HAPPO converge to, deterministically choosing action (A, A).

Agents each choose from three possible actions {A, B, C} and receive a joint reward. In this case, (A, A), (B, B), (C, C) are three different NEs with rewards of 5, 10, 20, with (C, C) being the optimal NE. To simulate common scenarios where finding the global optima is challenging and, due to the learning in the early stages, agents are updated towards some reasonably good local optima initially, we consider a setting where agents assign a higher probability to action (A, A), as shown in Figure 1b. By exact calculation (see Appendix B), we find traditional algorithms like MAPPO and HAPPO converge rapidly towards the suboptimal point (A, A), failing to identify the Pareto-optimal equilibria (C, C), as shown in Figure 1c, which is also corroborated by our experiment (Section 5.1).

The explanation is that maximizing the standard joint objective (Equation 1) leads to convergence towards the deterministic policies exploiting the local optimum they have explored so far (28). Specifically, the standard objective discourages any unilateral deviation from the local optimum as it will result in a decrease in the joint reward. As a result, agents tend to deterministically converge to the suboptimal NE. To address this issue, we propose to learn stochastic behaviors, which always preserve the probability of selecting currently underexplored actions, and will eventually converge to the QRE where action probability is proportional to the exponential of action value. We will show in the experimental section that our method could successfully converge to the Pareto-optimal equilibrium (C, C), even when it has a higher probability of choosing action (A, A) initially.

1We write ai, a, and s when we refer to the action, joint action, and state as to values, and ai, a, and s as to random variable.

Published as a conference paper at ICLR 2024

In this section, we establish maximum entropy heterogeneous-agent reinforcement learning - a framework for learning stochastic policies in cooperative MARL settings, which alleviates the suboptimal convergence problem mentioned in Section 3.2. We name it heterogeneous-agent (HA) as it is a substantial improvement to HARL (45), its Proposition 1 builds upon the prior advantage decomposition lemma (13), and it is generally applicable to HA settings. This framework encompasses three key components, including the derivation of Max Ent MARL objective from a probabilistic inference perspective in Section 4.1, the HASPI procedure and HASAC algorithm in Section 4.2, and the unified MEHAML template for theoretically-justified algorithmic design in Section 4.3.

4.1 MAXIMUM ENTROPY MULTI-AGENT REINFORCEMENT LEARNING

Figure 2: The probabilistic graphical model for cooperative MARL.

To formalize the idea of learning stochastic policies π ( |st), we embed cooperative MARL problem into the PGM (Figure 2). Following Levine (17), we introduce an additional optimality variable Ot, which takes on binary values indicating the optimality of joint actions taken by all agents. Specifically, Ot = 1 denotes that time step t is optimal, and we model it as p(Ot = 1|st, at) exp(r(st, at)). Then we use structured variational inference to approximate the posterior distribution p(τ|O1:T =

1) h p(s1) QT t=1 p(st+1|st, at) i exp PT t=1 r(st, at)

over trajectory τ with the distribution q(τ) = q(s1) QT t=1 q(st+1|st, at)q(at|st), where we fix the environment dynamics q(s1) = p(s1) and q(st+1|st, at) = p(st+1|st, at) to avoid risk-seeking behaviors (17). This inference procedure leads to the maximum entropy (Max Ent) objective of MARL (see Appendix C):

J(π) = Es1:T ρ1:T π ,a1:T π

r (st, at) + α

i=1 H πi ( |st) !#

where α is the temperature constant that trades off between reward and entropy maximization, and when α = 0, the objective is reduced to standard MARL.

Augmenting standard MARL objective with an entropy term (Equation 2) aligns with the solution concept of quantal response equilibrium (QRE) proposed by Mc Kelvey & Palfrey (20), which is a generalization of the standard notion of Nash equilibrium (NE) in game theory. In a QRE, payoffs are perturbed by additive disturbances (entropy term in our case) so that players do not deterministically choose the strategy with the highest observed payoff, but rather assign the probability mass in its strategies according to every strategy s payoff (6). The following Theorem 1 shows that the QRE policies of Max Ent objective can be represented as Boltzmann distributions: Theorem 1 (Representation of QRE). A joint policy πQRE Π is a QRE if none of the agents can increase the maximum entropy objective (Equation 2) by unilaterally altering its policy, i.e.,

i N, πi Πi, J πi, π i QRE J πQRE .

Then the QRE policies are given by

i N, πi QRE ai|s := exp α 1Ea i π i QRE QπQRE s, ai, a i

bi Ai exp α 1Ea i π i QRE QπQRE (s, bi, a i) , (3)

where the soft Q-functions are defined as follows,

Qπ(s, a) = r (s, a) + Ea1: π,s1: P

r (st, at) + α

i=1 H πi ( |st) ! s0 = s, a0 = a

Proof can be found in Appendix D. Theorem 1 illustrates the connection between QRE policies and an energy-based model, where the term 1

αQπQRE(s, a) serves as the negative energy. When α > 0,

Published as a conference paper at ICLR 2024

the QRE policies (Equation 3) are no longer deterministic but rather can represent all the ways of performing a task. This suggests that the inclusion of entropy term makes the policy πQRE stochastic, enabling more effective exploration of the environment (21), which is consistent with our initial goal of learning stochastic policies in MARL settings.

4.2 HETEROGENEOUS-AGENT SOFT ACTOR-CRITIC

In this subsection, we develop heterogeneous-agent soft policy iteration (HASPI) to maximize Max Ent objective (Equation 2), which alternates between joint soft policy evaluation and heterogeneous-agent soft policy improvement, and then derive HASAC based on this theory.

4.2.1 HETEROGENEOUS-AGENT SOFT POLICY ITERATION

In joint soft policy evaluation step of HASPI, we compute soft Q-value from any Q(s, a) : S A R by repeatedly applying a soft Bellman backup operator Γπ given by:

ΓπQ(s, a) r(s, a) + γEs P [V (s )] , (5)

where V (s) = Ea π

Q(s, a) + α

i=1 H πi i|s #

is the soft value function. We can obtain the soft Q-function of any joint policy π as shown in Lemma 4.1. Notably, the same method for updating soft Q-function has been proposed in FOP (44) since it is the straightforward application of soft Bellman equation. Lemma 4.1 (Joint Soft Policy Evaluation). Consider the soft Bellman backup operator Γπ and a mapping Q0 : S A R with |A| < , and define Qk+1 = ΓπQk. Then the sequence Qk will converge to the joint soft Q-function π as k .

Proof can be found in E.1. In policy improvement step, we show that the joint policy π can be updated based on individual policy updates. To this end, we first introduce the following definition. Definition 1. Let i1:m = {i1, . . . , im} N be an ordered subset of agents, and let i1:m refer to its complement. We write ik when we refer to the kth agent in the ordered subset. Correspondingly, the multi-agent soft Q-function is defined as

Qi1:m π s, ai1:m Ea i1:m π i1:m

Qπ s, ai1:m, a i1:m + α X

i i1:m H πi i|s #

In the case where m = n, Qi1:n π s, ai1:n takes the form Qπ(s, a), representing the joint soft Q-function. When m = 0, i.e., i1:m = , the function represents the soft value function Vπ(s).

With this notation defined, we introduce a pivotal proposition that shows the joint soft policy update can be decomposed into a multiplication of sequential local policy updates. Proposition 1 (Joint Soft Policy Decomposition). Let π be a joint policy, and i1:n Sym(n) be an agent permutation. Suppose that, for each state s and every m = 1, . . . , n,

πim new = arg min πim Πim DKL

πim im|s exp Eai1:m 1 π i1:m 1 new 1

αQi1:m πold s, ai1:m 1, im

Eai1:m 1 π i1:m 1 new [Zπold (s, ai1:m 1)]

where ai1:m 1 is drawn from the policy πi1:m 1 new ( |s) and the partition function Zπold s, ai1:m 1

normalizes the distribution. Then the joint policy satisfies the following:

πnew = arg min π Π DKL

π ( |s) exp 1

αQπold (s, )

Proof can be found in E.2. Proposition 1 holds significance due to the crucial insight it provides, suggesting that a Max Ent MARL problem can be considered as a sum of n Max Ent RL problems. It indicates an effective heterogeneous-agent approach to improving the joint soft policy in multiagent learning, where each agent optimizes individual KL-divergence sequentially leading to the optimization of joint soft policy. To formally extend the above process into a policy improvement procedure with theoretical guarantees of monotonic improvement and convergence to QRE, we propose the heterogeneous-agent soft policy improvement as formalized below.

Published as a conference paper at ICLR 2024

Lemma 4.2 (Heterogeneous-Agent Soft Policy Improvement). Let i1:n Sym(n) be an agent permutation, and for every m = 1, . . . , n, let policy πim old Πim and πim new be the optimizer of the minimization problem defined in Equation 8. Then Qπnew(s, a) Qπold(s, a) for all (s, a) S A with |A| < and J (πnew) J (πold).

Proof can be found in E.3. Lemma 4.2 guarantees that soft Q-function and Max Ent objective monotonically increase at each policy improvement step. Next, we propose heterogeneous-agent soft policy iteration, which alternates between joint soft policy evaluation and heterogeneous-agent soft policy improvement, and prove that joint policy π converges to a QRE. Theorem 2 (Heterogeneous-Agent Soft Policy Iteration). For any joint policy π Π, if we repeatedly apply joint soft policy evaluation and heterogeneous-agent soft policy improvement from πi Πi. Then the joint policy π = Qn i=1 πi converges to πQRE in Theorem 1.

Proof can be found in E.4. To obtain such theoretical results, the updating approach in Proposition 1 plays a pivotal role. Lemma 4.2 ensures that, through the updates in Proposition 1, soft Q-function increases monotonically, leading to convergence of the policies. Eventually, none of the agents is motivated to make an update (Equation 8) at convergence, thereby establishing a QRE.

4.2.2 PRACTICAL ALGORITHM

In practice, large continuous domains require us to derive a practical approximation to the procedure above. We will use function approximators for both centralized soft Q-function Qθ (st, at) and tractable decentralized policies πim ϕim aim t |st , for each agent im, parameterized respectively by θ and ϕim, and alternate between optimizing both networks with stochastic gradient descent.

The centralized soft Q-function parameters can be trained to minimize the Bellman residual

JQ(θ) = E(st,at) D

2 Qθ (st, at) r (st, at) + γEst+1 P [V θ (st+1)] 2 ,

where the soft value function is implicitly parameterized through the soft Q-function parameters via Equation 6. Then we draw a permutation i1:n Sym(n) and sequentially update the policy of each agent im according to Equation 8. The policy parameters can be learned by directly minimizing the expected KL-divergence in Equation 8 disregarding the constant log-partition function

Jπim (ϕim) = Est D

Ea i1:m 1 t π i1:m 1

ϕ i1:m 1 new ,aim t πim ϕim

h α log πim ϕim aim t |st Qi1:m πold ;θ st, a i1:m 1 t , aim t i

(10) We refer to the above procedure as HASAC and Appendix F for its full pseudocode.

4.3 MAXIMUM ENTROPY HETEROGENEOUS-AGENT MIRROR LEARNING

In addition to updating policies by directly minimizing the KL-divergence in Equation 8, we aim to propose a generalized HASPI procedure that provides a range of solutions to Max Ent MARL problem. To this end, we start by introducing the necessary definitions of the operators proposed in HAML (15): the drift functional (HADF) Di π ˆπi|s, πj1:m which, intuitively, is a notion of distance between πi and ˆπi, given that agents j1:m just updated to πj1:m; the neighborhood operator Ui π πi

which forms a region around the policy πi; as well as a sampling distribution βπ P (S) that is continuous in π (see detailed definitions in Appendix A.3). As shown in (15; 14), these operators allow for effective abstraction of standard RL and MARL methods due to their generality.

We present the generalized HASPI using these operators. In heterogeneous-agent soft policy improvement step, we redefine the operator that agents optimize as follows, Definition 2. Let i N, j1:m P( i), and Di be a HADF of agent i. The maximum entropy heterogeneous-agent mirror operator (MEHAMO) integrates the soft Q-function as h M(ˆπi) Di, πj1:m Vπ i (s) Eaj1:m πj1:m ,ai ˆπi h Qj1:m,i π (s, aj1:m, ai) α log ˆπi ai|s i Di π ˆπi|s, πj1:m .

Then we propose MEHAML Algorithm template 1 to formalize generalized heterogeneous-agent soft policy iteration. Notably, HAML is a special instance of our template when α = 0.

Published as a conference paper at ICLR 2024

Algorithm 1: Maximum Entropy Heterogeneous-Agent Mirror Learning

Initialise a joint policy π0 = π1 0, . . . , πn 0 ; for k = 0, 1, do

Compute the soft Q-function Qπk(s, a) for all state-(joint)action pairs (s, a); Update Qπk(s, a) according to the soft Bellman equation (Equation 5) | {z } Joint Soft Policy Evaluation

Draw a permutaion i1:n of agents at random; for m = 1 : n do

Make an update πim k+1 = arg max πim Uim πk (πim k ) Es βπk

Dim π ,π i1:m 1 k+1 Vπk

| {z } (Generalized) Heterogeneous-Agent Soft Policy Improvement end end Output: A limit-point joint policy π

Despite the presence of a drift penalty and a neighborhood constraint, optimizing the MEHAMO sequentially is sufficient to guarantee the same desired properties as HASPI. We establish the complete list of the core MEHAML properties in Theorem 3, which confirms that any method derived from Algorithm 1 has the desired properties of monotonic improvement of the Max Ent objective and QRE convergence (detailed proof can be found in Appendix G). Theorem 3 (The Core Theorem of MEHAML). Let π0 Π, and the sequence of joint policies (πk) k=0 be obtained by a MEHAML algorithm 1 induced by Di, Ui, i N, and βπ. Then, the joint policies induced by the algorithm enjoy the following list of properties (1) Attain the monotonic improvement property J (πk+1) J (πk), (2) Their value functions converge to a quantal response value function limk Vπk = V QRE, (3) Their expected returns converge to a quantal response return limk J (πk) = JQRE, (4) Their ω-limit set consists of quantal response equilibria.

Proof sketch. We divide the proof into four steps. In Step 1, we show that the sequence of soft value function (Vπk)k N increases monotonically and converges to a limit point. In Step 2, we show the existence of limit points π of (πk)k N, and that they are fixed points of the MEHAML update. In the most important Step 3, we prove that π is also a fixed point of soft policy iteration by leveraging the concavity of MEHAMO. Step 4 finalizes the proof, proving that fixed points of soft policy iteration are QRE policies.

By selecting appropriate HADFs and neighborhood operators that satisfy the definitions, Algorithm 1 has the potential to generate various theoretically-justified algorithms to solve Max Ent MARL problem. The drifts Di π ˆπi|s, πj1:m can serve as soft constraints, such as KL-divergence, controlling the distance between ˆπi and πi when the agents j1:m have just updated to πj1:m. Additionally, the neighborhood operators Ui can generate small policy-space subsets, serving as hard constraints, then the resultant policy improvement will remain within a small range due to the fact that πi Ui π πi , i N, πi Πi. Therefore, algorithms equipped with appropriate HADFs and neighborhoods can learn stochastic policies in a stable and coordinated manner. In summary, MEHAML provides a template for generating theoretically sound, stable, monotonically improving algorithms that enable agents to learn stochastic policies to solve multi-agent cooperation tasks.

5 EXPERIMENTS

To demonstrate the advantages of the stochastic policies learned by HASAC, we conduct comprehensive experiments on the matrix game in Section 3.2 and six benchmarks MAMu Jo Co (4), Bi-Dex Hands (2), SMAC (25), GRF (16), MPE (19), and LAG (23)2. It is important to note that while HASAC is originally designed for continuous actions, we employ a Gumbel-Softmax (10) to ensure that HASAC would work for discrete actions. Compared to existing SOTA MAPG methods, HASAC achieves the best performance in 31 out of 35 tasks across all benchmarks. In addition to final performance, experimental results (see full experimental details and hyperparameter in Appendix

2HASAC achieves best performance on fully cooperative MPE and LAG tasks. See Appendix H.2.5 and H.2.6 for full results.

Published as a conference paper at ICLR 2024

(a) Matrix Game

(b) Catch Abreast

(c) Two Catch Underarm

(d) Lift Underarm

(e) 3x2-Agent Half Cheetah

(f) 6x1-Agent Walker2d

(g) RPS (2 agents)

(h) Corner (11 agents) Figure 3: Performance comparisons on selected tasks of multiple benchmarks.

Map Difficulty HASAC HAPPO HATRPO MAPPO QMIX FOP Timesteps 8m_vs_9m Hard 97.5(1.2) 83.8(4.1) 92.5(3.7) 87.5(4.0) 92.2(1.0) 23.4(1.6) 1e7 5m_vs_6m Hard 90.0(3.9) 77.5(7.2) 75.0(6.5) 75.0(18.2) 77.3(3.3) 46.9(5.3) 1e7 3s5z Hard 100.0(0.0) 97.5(1.2) 93.8(1.2) 96.9(0.7) 89.8(2.5) 93.0(0.8) 1e7 10m_vs_11m Hard 95.0(3.1) 87.5(6.7) 98.8(0.6) 96.9(4.8) 95.3(2.2) 12.5(6.2) 1e7 MMM2 Super Hard 97.5(2.4) 88.8(2.0) 97.5(6.4) 93.8(4.7) 87.5(2.5) 37.5(28.1) 2e7 3s5z_vs_3s6z Super Hard 82.5(4.1) 66.2(3.1) 72.5(14.7) 70.0(10.7) 87.5(12.6) 0.0(0.0) 2e7 corridor Super Hard 90.0(10.8) 92.5(13.9) 88.8(2.7) 97.5(1.2) 82.8(4.4) 0.0(0.0) 2e7 6h_vs_8z Super Hard 95.0(3.1) 76.2(3.1) 78.8(0.6) 85.0(2.0) 7.0(27.0) 0.0(0.0) 4e7

Table 1: Median evaluate winning rate and standard deviation on eight SMAC maps for different methods. All values within 1 standard deviation of the maximum score rate are marked in bold. H) show the following advantages of stochastic policies: (1) HASAC exhibits similar performance across different random seeds and higher training stability , (2) HASAC shows higher learning speed compared to existing algorithms, and (3) HASAC improves agents exploration, which facilitates policies to escape from suboptimal equilibria and converge towards a higher reward equilibrium.

5.1 EXPERIMENTAL RESULTS

Matrix Game. We show the results of HASAC, HAPPO, MAPPO over 200 learning episodes in matrix game (Figure 1a). HASAC escapes local optima and achieves the Pareto optimum due to the learned stochastic policies, while the other two methods fall into the suboptimal NE (Figure 3a). Bi-Dex Hands. Bi-Dex Hands offers numerous bimanual manipulation tasks that match various human skill levels. In the challenging Catch Abreast, Two Catch Underarm, and Lift Underarm tasks (Figure 3b, 3c, and 3d), all on-policy methods fail within 10m steps due to low sample efficiency. HATD3 (45), on the other hand, exhibits very high variability and prematurely converges to local optima. In contrast, HASAC outperforms the other five methods by a large margin. It also exhibits higher convergence speed and improved robustness, demonstrating the benefits of stochastic policies. Multi-Agent Mu Jo Co. We compare our method to HAPPO, MAPPO, and HATD3 in ten MAMu Jo Co tasks (see Appendix H.2.2). Figure 3e, 3f, and the other results in Appendix H.2.2 demonstrate that HASAC enjoys superior performance over the three rivals both in terms of reward values and learning speed. It s worth noting that, although both HATD3 and HASAC are off-policy algorithms, we generally observe that HASAC learns faster with higher sample efficiency compared to HATD3. This is because the exploration mechanism of HATD3 involves adding Gaussian noise to a deterministic policy, resulting in lower exploration efficiency, requiring more samples to explore the reward landscape. In contrast, HASAC, due to its inherent encouragement of exploration within stochastic policies, can effectively explore the entire reward landscape, thus learning better behaviors rapidly. Star Craft Multi-Agent Challenge. We evaluate our method on four hard and four super-hard maps. As shown in Table 1, HASAC achieves over 90% win rates in 7 out of 8 maps and outperforms other strong baselines in most maps. Notably, in particularly challenging tasks such as 5m_vs_6m, 3s5z_vs_3s5z, and 6h_vs_8z, we observe that HAPPO and HATRPO would converge towards suboptimal NE. In addition, FOP is unable to learn meaningful joint policy in super-hard tasks due to its reliance on the IGO assumption. By contrast, HASAC consistently achieves superior performance

Published as a conference paper at ICLR 2024

(a) Stochastic vs. deterministic

(b) Effect of different α on Ant 4x2

(c) Effect of different α on matrix game

Figure 4: Performance comparison between HASAC with different hyperparameters on Ant-v2 4x2 task and matrix game. (a) The comparison indicates that stochastic policy can lead to better equilibrium and stabilize training. (b) & (c) HASAC converges to different QRE with different temperature parameter α.

0 0 0 C (a) HASAC: α = 1

1 0 0 C (b) HASAC: α = 5

0.02 0.0025 0.0025 A

0.02 0.0025 0.0025 B

0.91 0.02 0.02 C (c) HASAC: α = 10

0.1 0.01 0.01 A

0.1 0.01 0.01 B

0.56 0.1 0.1 C (d) HASAC: α = 15

Figure 5: Different temperature α leads to convergence of different QRE. and shows the ability to identify higher reward equilibria due to its extensive exploration. We also observe that HASAC has better stability and higher learning speed across most maps. Google Research Football. We compare HASAC with QMIX, MAPPO, and HAPPO. As shown in Figure 3g and 3h, we generally observe that both MAPPO and HAPPO tend to converge to a non-optimal NE on the two challenging tasks, while HASAC exhibits the ability to attain a higher reward equilibrium by learning stochastic policies which effectively enhance exploration.

5.2 ABLATION STUDY

We investigate the benefits of stochastic policies learned by HASAC and empirically show how different temperature α values affect the stochasticity of policies, leading to different QRE convergence. Stochasticity. HASAC learns stochastic policies through maximizing the Max Ent objective (Equation 2). We compare it to a deterministic variant which utilizes deterministic policies with fixed Gaussian exploration noise to maximize standard MARL objective. The results in Figure 4a show that HASAC achieves a higher reward equilibrium and demonstrates better stability compared to the deterministic variant, which exhibits high variance across the different runs. These findings highlight the importance of learning stochastic policies, which can improve robustness, facilitate escape from suboptimal equilibria, and converge to higher reward equilibrium. Analysis of temperature α. We further show the effect of temperature α on the stochasticity of policies. As illustrated in Figure 4b (we report α 1 in legend), 4c, and 5, when α is large, policies predominantly emphasize maximizing entropy, leading to poor performance due to the failure to exploit the reward signal. Conversely, when α is small, Max Ent objective almost degrades to standard objective, leading to suboptimal equilibrium due to inadequate exploration. A proper α achieves a trade-off between exploration and exploitation, eventually resulting in better performance. To obtain appropriate α, we implement both fixed and auto-tuned (see Appendix H.1 for details) α in practice.

6 CONCLUSION

In this paper, we propose maximum entropy heterogeneous-agent reinforcement learning (MEHARL) a unified framework for learning stochastic policies in MARL. This framework comprises three key components, including the PGM derivation of Max Ent MARL, the HASAC algorithm with monotonic improvement and QRE convergence properties, and the unified MEHAML template that provides any induced Max Ent method with the same theoretical guarantees as HASAC. To demonstrate the advantages of stochastic policies, we evaluate HASAC on both discrete and continuous control tasks, confirming its superior performance and improved robustness and exploration. For future work, we aim to explore appropriate drift functionals and neighborhood operators to design more principled Max Ent MARL algorithms that can further enhance performance and stability in multiagent cooperation tasks.

Published as a conference paper at ICLR 2024

Acknowledgements. This work is sponsored by National Natural Science Foundation of China (62376013) and Beijing Municipal Science & Technology Commission (Z231100007423015).

[1] Lawrence M Ausubel and Raymond J Deneckere. A generalized theorem of the maximum. Economic Theory, 3(1):99 107, 1993.

[2] Yuanpei Chen, Tianhao Wu, Shengjie Wang, Xidong Feng, Jiechuang Jiang, Stephen Marcus Mc Aleer, Yiran Geng, Hao Dong, Zongqing Lu, Song-Chun Zhu, and Yaodong Yang. Towards human-level bimanual dexterous manipulation with reinforcement learning, 2022.

[3] Caroline Claus and Craig Boutilier. The dynamics of reinforcement learning in cooperative multiagent systems. AAAI/IAAI, 1998(746-752):2, 1998.

[4] Christian Schroeder de Witt, Bei Peng, Pierre-Alexandre Kamienny, Philip Torr, Wendelin Böhmer, and Shimon Whiteson. Deep multi-agent reinforcement learning for decentralized continuous cooperative control. ar Xiv preprint ar Xiv:2003.06709, 19, 2020.

[5] Jakob Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. Counterfactual multi-agent policy gradients. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018.

[6] Jacob K Goeree, Charles A Holt, and Thomas R Palfrey. Stochastic game theory for social science: A primer on quantal response equilibrium. In Handbook of Experimental Game Theory, pp. 8 47. Edward Elgar Publishing, 2020.

[7] Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement learning with deep energy-based policies. In International conference on machine learning, pp. 1352 1361. PMLR, 2017.

[8] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor, 2018.

[9] Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, George Tucker, Sehoon Ha, Jie Tan, Vikash Kumar, Henry Zhu, Abhishek Gupta, Pieter Abbeel, et al. Soft actor-critic algorithms and applications. ar Xiv preprint ar Xiv:1812.05905, 2018.

[10] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. ar Xiv preprint ar Xiv:1611.01144, 2016.

[11] TS Kozitsina, IV Kozitsin, and IS Menshikov. Quantal response equilibrium for the prisoner s dilemma game in markov strategies. Scientific reports, 12(1):4482, 2022.

[12] Jakub Grudzien Kuba, Muning Wen, Linghui Meng, Haifeng Zhang, David Mguni, Jun Wang, Yaodong Yang, et al. Settling the variance of multi-agent policy gradients. Advances in Neural Information Processing Systems, 34:13458 13470, 2021.

[13] Jakub Grudzien Kuba, Ruiqing Chen, Muning Wen, Ying Wen, Fanglei Sun, Jun Wang, and Yaodong Yang. Trust region policy optimisation in multi-agent reinforcement learning, 2022.

[14] Jakub Grudzien Kuba, Christian Schroeder de Witt, and Jakob Foerster. Mirror learning: A unifying framework of policy optimisation, 2022.

[15] Jakub Grudzien Kuba, Xidong Feng, Shiyao Ding, Hao Dong, Jun Wang, and Yaodong Yang. Heterogeneous-agent mirror learning: A continuum of solutions to cooperative marl. ar Xiv preprint ar Xiv:2208.01682, 2022.

[16] Karol Kurach, Anton Raichuk, Piotr Sta nczyk, Michał Zaj ac, Olivier Bachem, Lasse Espeholt, Carlos Riquelme, Damien Vincent, Marcin Michalski, Olivier Bousquet, et al. Google research football: A novel reinforcement learning environment. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pp. 4501 4510, 2020.

Published as a conference paper at ICLR 2024

[17] Sergey Levine. Reinforcement learning and control as probabilistic inference: Tutorial and review, 2018.

[18] Michael L Littman. Markov games as a framework for multi-agent reinforcement learning. In Machine learning proceedings 1994, pp. 157 163. Elsevier, 1994.

[19] Ryan Lowe, Yi I Wu, Aviv Tamar, Jean Harb, Open AI Pieter Abbeel, and Igor Mordatch. Multi-agent actor-critic for mixed cooperative-competitive environments. Advances in neural information processing systems, 30, 2017.

[20] Richard D Mc Kelvey and Thomas R Palfrey. Quantal response equilibria for normal form games. Games and economic behavior, 10(1):6 38, 1995.

[21] Ofir Nachum, Mohammad Norouzi, Kelvin Xu, and Dale Schuurmans. Bridging the gap between value and policy based reinforcement learning. Advances in neural information processing systems, 30, 2017.

[22] John Nash. Non-cooperative games. Annals of mathematics, pp. 286 295, 1951.

[23] Xiaoteng Ma Qihan Liu, Yuhua Jiang. Light aircraft game: A lightweight, scalable, gymwrapped aircraft competitive environment with baseline reinforcement learning algorithms. https://github.com/liuqh16/Close Air Combat, 2022.

[24] Tabish Rashid, Mikayel Samvelyan, Christian Schroeder de Witt, Gregory Farquhar, Jakob Foerster, and Shimon Whiteson. Qmix: Monotonic value function factorisation for deep multi-agent reinforcement learning, 2018.

[25] Mikayel Samvelyan, Tabish Rashid, Christian Schroeder De Witt, Gregory Farquhar, Nantas Nardelli, Tim GJ Rudner, Chia-Man Hung, Philip HS Torr, Jakob Foerster, and Shimon Whiteson. The starcraft multi-agent challenge. ar Xiv preprint ar Xiv:1902.04043, 2019.

[26] Kyunghwan Son, Daewoo Kim, Wan Ju Kang, David Earl Hostallero, and Yung Yi. Qtran: Learning to factorize with transformation for cooperative multi-agent reinforcement learning, 2019.

[27] Jianyu Su, Stephen Adams, and Peter A. Beling. Value-decomposition multi-agent actor-critics, 2020.

[28] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. The MIT Press, second edition, 2018. URL http://incompleteideas.net/book/ the-book-2nd.html.

[29] Brian Swenson, Ryan Murray, and Soummya Kar. On best-response dynamics in potential games. SIAM Journal on Control and Optimization, 56(4):2734 2767, 2018.

[30] Ming Tan. Multi-agent reinforcement learning: Independent vs. cooperative agents. In Proceedings of the tenth international conference on machine learning, pp. 330 337, 1993.

[31] J Terry, Benjamin Black, Nathaniel Grammel, Mario Jayakumar, Ananth Hari, Ryan Sullivan, Luis S Santos, Clemens Dieffendahl, Caroline Horsch, Rodrigo Perez-Vicente, et al. Pettingzoo: Gym for multi-agent reinforcement learning. Advances in Neural Information Processing Systems, 34:15032 15043, 2021.

[32] Jiangxing Wang, Deheng Ye, and Zongqing Lu. More centralized training, still decentralized execution: Multi-agent conditional policy factorization, 2023.

[33] Jianhao Wang, Zhizhou Ren, Terry Liu, Yang Yu, and Chongjie Zhang. Qplex: Duplex dueling multi-agent q-learning, 2021.

[34] Xihuai Wang, Zheng Tian, Ziyu Wan, Ying Wen, Jun Wang, and Weinan Zhang. Order matters: Agent-by-agent policy optimization. ar Xiv preprint ar Xiv:2302.06205, 2023.

[35] Yihan Wang, Beining Han, Tonghan Wang, Heng Dong, and Chongjie Zhang. {DOP}: Offpolicy multi-agent decomposed policy gradients. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=6Fq Ki VAd I3Y.

Published as a conference paper at ICLR 2024

[36] Ermo Wei, Drew Wicke, David Freelan, and Sean Luke. Multiagent soft q-learning. ar Xiv preprint ar Xiv:1804.09817, 2018.

[37] Ying Wen, Yaodong Yang, Rui Luo, and Jun Wang. Modelling bounded rationality in multiagent interactions by generalized recursive reasoning. ar Xiv preprint ar Xiv:1901.09216, 2019.

[38] Ying Wen, Yaodong Yang, Rui Luo, Jun Wang, and Wei Pan. Probabilistic recursive reasoning for multi-agent reinforcement learning. ar Xiv preprint ar Xiv:1901.09207, 2019.

[39] Zifan Wu, Chao Yu, Deheng Ye, Junge Zhang, Hankz Hankui Zhuo, et al. Coordinated proximal policy optimization. Advances in Neural Information Processing Systems, 34:26437 26448, 2021.

[40] Yaodong Yang, Rui Luo, Minne Li, Ming Zhou, Weinan Zhang, and Jun Wang. Mean field multi-agent reinforcement learning. In International conference on machine learning, pp. 5571 5580. PMLR, 2018.

[41] Yaodong Yang, Jianye Hao, Ben Liao, Kun Shao, Guangyong Chen, Wulong Liu, and Hongyao Tang. Qatten: A general framework for cooperative multiagent reinforcement learning, 2020.

[42] Chao Yu, Akash Velu, Eugene Vinitsky, Jiaxuan Gao, Yu Wang, Alexandre Bayen, and Yi Wu. The surprising effectiveness of ppo in cooperative multi-agent games. Advances in Neural Information Processing Systems, 35:24611 24624, 2022.

[43] Haifeng Zhang, Weizhe Chen, Zeren Huang, Minne Li, Yaodong Yang, Weinan Zhang, and Jun Wang. Bi-level actor-critic for multi-agent coordination. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp. 7325 7332, 2020.

[44] Tianhao Zhang, Yueheng Li, Chen Wang, Guangming Xie, and Zongqing Lu. Fop: Factorizing optimal joint policy of maximum-entropy multi-agent reinforcement learning. In International Conference on Machine Learning, pp. 12491 12500. PMLR, 2021.

[45] Yifan Zhong, Jakub Grudzien Kuba, Siyi Hu, Jiaming Ji, and Yaodong Yang. Heterogeneousagent reinforcement learning, 2023.

[46] Brian D Ziebart. Modeling purposeful adaptive behavior with the principle of maximum causal entropy. Carnegie Mellon University, 2010.

Published as a conference paper at ICLR 2024

Table of Contents

A Preliminaries 14 A.1 Table of Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 A.2 Individual-Global-Optimal . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 A.3 Definitions and Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 A.4 Proofs of Useful Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

B Exact Calculation of Matrix Game 18

C Derivation of Max Ent MARL Objective 21

D Proofs of Representation of QRE 22

E Proofs of Heterogeneous-Agent Soft Policy Iteration 24 E.1 Proof of Joint Soft Policy Evaluation . . . . . . . . . . . . . . . . . . . . . . 24 E.2 Proof of Joint Soft Policy Decomposition . . . . . . . . . . . . . . . . . . . . 24 E.3 Proof of Heterogeneous-Agent Soft Policy Improvement . . . . . . . . . . . . 25 E.4 Proof of Heterogeneous-Agent Soft Policy Iteration . . . . . . . . . . . . . . 26

F Pseudocode of HASAC 27

G Proofs of The Core Theorem of MEHAML 28

H Experimental Details 34 H.1 Automatically Adjusting Temperature . . . . . . . . . . . . . . . . . . . . . . . 34 H.2 Experimental Setup and Additional Results . . . . . . . . . . . . . . . . . . . . 34 H.3 Hyper-parameter Settings for Experiments . . . . . . . . . . . . . . . . . . . . 37

I Ablation Study on Sequential Updates 42

Published as a conference paper at ICLR 2024

A PRELIMINARIES

A.1 TABLE OF ACRONYMS

Table 2 lists the main acronyms used in this paper.

Table 2: The acronyms used in this paper.

Acronym Meaning A2PO Agent-by-agent Policy Optimization Co PPO Coordinated Proximal Policy Optimization CTDE Centralized training decentralized execution FOP Factorizing Optimal Joint Policy GRF Google Research Football HA Heterogeneous-Agent HADF Heterogeneous-agent drift functional HAML Heterogeneous-Agent Mirror Learning HAPPO Heterogeneous-Agent Proximal Policy Optimization HARL Heterogeneous-Agent Reinforcement Learning HASAC Heterogeneous-Agent Soft Actor-Critic HASPI Heterogeneous-agent soft policy iteration HATD3 Heterogeneous-Agent Twin Delayed Deep Deterministic Policy Gradient HATRPO Heterogeneous-Agent Trust Region Policy Optimization IGO Individual-Global-Optimal LAG Light Aircraft Game MACPF Multi-Agent Conditional Policy Factorization MADDPG Multi-Agent Deep Deterministic Policy Gradient MAMu Jo Co Multi-Agent Mu Jo Co MAPG Multi-agent policy gradient MAPPO Multi-Agent Proximal Policy Optimization MARL Multi-agent reinforcement learning MASQL Multi-Agent Soft Q-Learning Max Ent Maximum entropy MEHAML Maximum Entropy Heterogeneous-Agent Mirror Learning MEHAMO Maximum Entropy Heterogeneous-Agent Mirror Operator MEHARL Maximum Entropy Heterogeneous-Agent Reinforcement Learning MPE Multi-Agent Particle Environment NE Nash equilibrium PGM Probabilistic Graphical Model PPO Proximal Policy Optimization QRE Quantal response equilibrium RL Reinforcement learning RPS Run pass and shoot with keeper SAC Soft Actor-Critic SMAC Star Craft Multi-Agent Challenge SQL Soft Q-Learning

A.2 INDIVIDUAL-GLOBAL-OPTIMAL

The Individual-Global-Optimal (IGO) assumption is introduced in (44) and can be stated as below.

Definition 3 (IGO). For an optimal joint policy π (a|s) : S A [0, 1], if there exist individual optimal policies [πi (ai|s) : S Ai [0, 1]]n i=1, such that the following holds:

i=1 πi (ai|s),

then, we say that [πi] satisfy IGO for π under s.

Published as a conference paper at ICLR 2024

The main limitation of IGO arises from the fact that during training, each agent s policy is updated based solely on its local Q-function, thereby ignoring the actions of other agents. As discussed by Wang et al. (32), when multiple agents need to collaborate in decision-making, they may fail to reach a consensus, updating their policies in the direction that increases their local Q-functions, but eventually resulting in a decrease in the global Q-function.

A.3 DEFINITIONS AND ASSUMPTIONS

Throughout the proofs, we make the following regularity assumption introduced by Kuba et al. (13): Assumption 1. There exists η R, such that 0 < η 1, and for every agent i N, the policy space Πi is η-soft; that means that for every πi Πi, s S, and ai Ai, we have πi ai|s η.

In the following, we provide the essential definitions of the two key components, originally proposed by Kuba et al. (15), that serve as the building blocks of the MEHAML framework. Additionally, we present the definitions of soft advantage function and a notion of distance that will be utilized in the proof of Lemma A.2. Definition 4. Let i N, a heterogeneous-agent drift functional (HADF) Di of i consists of a map, which is defined as

Di : Π Π P( i) S Di π |s, πj1:m : P Ai R ,

such that for all arguments, under notation Di π ˆπi|s, πj1:m Di π ˆπi i|s |s, πj1:m ,

1. Di π ˆπi|s, πj1:m Di π πi|s, πj1:m = 0 (non-negativity),

2. Di π ˆπi|s, πj1:m has all Gâteaux derivatives zero at ˆπi = πi (zero gradient),

We say that the HADF is positive if Di π ˆπi| πj1:m = 0, s S implies ˆπi = πi, and trivial if Di π ˆπi| πj1:m = 0, s S for all π, πj1:m, and ˆπi.

Definition 5. Let i N. We say that, Ui : Π Πi P Πi is a neighborhood operator if πi Πi, Ui π πi contains a closed ball, i.e., there exists a state-wise monotonically non-decreasing metric χ : Πi Πi R such that πi Πi there exists δi > 0 such that χ πi, πi δi = πi Ui π πi .

Definition 6. Let i1:m = {i1, . . . , im} N and j1:k = {j1, . . . , jk} N be disjoint ordered subsets of agents, and let Qi1:m π s, ai1:m be the multi-agent soft Q-function. Then the multi-agent soft advantage function is defined as

Ai1:m π s, aj1:k, ai1:m Qj1:k,i1:m π s, aj1:k, ai1:m Qj1:k π s, aj1:k . (11)

Definition 7. Let X be a finite set and p : X R, q : X R be two maps. Then, the notion of distance between p and q that we adopt is given by p q maxx X |p(x) q(x)|.

A.4 PROOFS OF USEFUL LEMMAS

Lemma A.1 (Multi-Agent Advantage Decomposition). Let π be a joint policy, and i1, . . . , im be an arbitrary ordered subset of agents. Then, for any state s and joint action ai1:m,

Ai1:m π s, ai1:m =

j=1 A ij π s, ai1:j 1, aij . (12)

Proof. (The lemma is proposed in (12) and we quote the proof from Kuba et al. (12)) By the definition of multi-agent soft advantage function,

Ai1:m π s, ai1:m = Qi1:m π s, ai1:m Vπ(s)

h Qi1:j π s, ai1:j Qi1:j 1 π s, ai1:j 1 i =

j=1 Aij π s, ai1:j 1, aij ,

which finishes the proof.

Published as a conference paper at ICLR 2024

The continuity of Qπ is a crucial requirement for proving Theorem 3 later. Now we first prove that the inclusion of an additional entropy term does not affect the continuity of Qπ, which is the state-action value function in the single-agent setting, where π denotes the policy of a single agent. And finally, we generalize the result to Qπ in MARL. Lemma A.2 (Continuity of Qπ). Let π be a policy. Then Qπ(s, a) is continuous in π.

Proof. Let π and ˆπ be two policies. Then we have |Qπ(s, a) Qˆπ(s, a)|

r(s, a) + γ X

s P (s |s, a)

a π(a |s )Qπ(s , a ) α X

a π (a |s ) log π (a |s )

r(s, a) + γ X

s P (s |s, a)

a ˆπ(a |s )Qˆπ(s , a ) α X

a ˆπ (a |s ) log ˆπ (a |s )

s P (s |s, a)

a [π(a |s )Qπ(s , a ) ˆπ(a |s )Qˆπ(s , a )]

a [π (a |s ) log π (a |s ) ˆπ (a |s ) log ˆπ (a |s )]

s P (s |s, a)

a |π(a |s )Qπ(s , a ) ˆπ(a |s )Qˆπ(s , a )|

a |π (a |s ) log π (a |s ) ˆπ (a |s ) log ˆπ (a |s )|

s P (s |s, a)

a |π(a |s )Qπ(s , a ) ˆπ(a |s )Qπ(s , a )

+ ˆπ(a |s )Qπ(s , a ) ˆπ(a |s )Qˆπ(s , a )|

a |(π (a |s ) ˆπ (a |s )) log π (a |s ) + ˆπ (a |s ) (log π (a |s ) log ˆπ (a |s ))|

s P (s |s, a)

a (|π(a |s )Qπ(s , a ) ˆπ(a |s )Qπ(s , a )|

+ |ˆπ(a |s )Qπ(s , a ) ˆπ(a |s )Qˆπ(s , a )|)

a (|π (a |s ) ˆπ (a |s )| |log π (a |s )| + |ˆπ (a |s )| |log π (a |s ) log ˆπ (a |s )|)

s P (s |s, a)

a |π(a |s ) ˆπ(a |s )| |Qπ(s , a )| + X

a ˆπ(a |s ) |Qπ(s , a ) Qˆπ(s , a )|

a (|π (a |s ) ˆπ (a |s )| |log π (a |s )| + |ˆπ (a |s )| |log π (a |s ) log ˆπ (a |s )|)

s P (s |s, a)

a π ˆπ Qmax + X

a ˆπ(a |s ) Qπ Qˆπ

a ( π ˆπ logmax π + ˆπ (a |s ) log π log ˆπ )

γQmax |A| π ˆπ + γ Qπ Qˆπ + αγ logmax π |A| π ˆπ + αγ log π log ˆπ Hence, we get Qπ Qˆπ γQmax |A| π ˆπ +γ Qπ Qˆπ +αγ logmax π |A| π ˆπ +αγ log π log ˆπ ,

Published as a conference paper at ICLR 2024

which implies

Qπ Qˆπ γ |A| (Qmax + α logmax π) π ˆπ + αγ log π log ˆπ

By continuity of π and log π, for any arbitrary ϵ > 0, we can find δ1 > 0 such that π ˆπ < δ1 implies π ˆπ < (1 γ)ϵ 2γ |A| (Qmax+α logmax π) and δ2 > 0 such that π ˆπ < δ2 implies log π

log ˆπ < (1 γ)ϵ

2αγ . Taking δ = min (δ1, δ2) , when π ˆπ < δ we get Qπ Qˆπ < ϵ, which finishes the proof.

Corollary 1. From Lemma A.2 we obtain that the following functions are continuous in π :

(1) the state value function Vπ(s) = P

a (π(a|s)Qπ(s, a) π(a|s) log π(a|s)),

(2) the advantage function Aπ(s, a) = Qπ(s, a) Vπ(s),

(3) and the expected total reward J(π) = Es ρ0 [Vπ(s)].

Corollary 2 (Continuity in MARL). All the results about continuity in π extend to MARL. Policy π can be replaced with joint policy π; as π is Lipschitz-continuous in agent i s policy πi, the above continuity results extend to continuity in πi. Thus, we will quote them in our proofs for MARL.

Published as a conference paper at ICLR 2024

B EXACT CALCULATION OF MATRIX GAME

In this section, we provide an exact calculation of the matrix game in Section 3.2. The initial policies of the matrix game represent the scenario where due to the exploration and learning in the early stages, agents may have explored a locally optimal solution (in this example action (A, A)) and assigned a high probability to it. The matrix game aims to show that both MAPPO and HAPPO will converge towards this local optimum in expectation, while HASAC, owing to its optimization of the maximum entropy objective, has the potential to escape this local optimum and converge towards a superior solution.

We start by analyzing MAPPO. For both parameter-sharing and non-parameter-sharing versions of MAPPO, the policy is optimizing the following objective:

πi new = arg max πi Es ρπold,a πold

πi old(ai|s)Aπold(s, a) , i = 1, 2 (13)

Note that we omit ratio-clipping in this tabular case, as this is simpler and does not affect the final result. In this 2-agent single-state matrix game, s ρπold can be ignored as there is only one state. Value function Vπold(s) can be calculated exactly:

Vπold(s) = Ea πold [Qπold(s, a)] = Ea πold [r(s, a)] = 0.36 5 + 0.04 10 + 0.04 20 + 4 0.12 ( 20) + 2 0.04 ( 20) = 8.2

Similarly, the advantage function Aπold(s, a) can be calculated exactly as shown in Figure 6.

-11.8 -11.8 13.2 A

-11.8 18.2 -11.8 B

28.2 -11.8 -11.8 C

Figure 6: Advantage values Aπold(s, a) for all joint actions in the first iteration.

Expanding objective 13, we get

πi new = arg max πi (0.36 13.2 + 0.12 ( 11.8) 2) πi(ai = A|s)

+ (0.12 ( 11.8) + 0.04 18.2 + 0.04 ( 11.8)) πi(ai = B|s)

+ (0.12 ( 11.8) + 0.04 ( 11.8) + 0.04 (28.2)) πi(ai = C|s)

0.2 = arg max πi 3.2 πi(ai = A|s) + ( 5.8) πi(ai = B|s) + ( 3.8) πi(ai = C|s)

This will encourage πi to assign a higher probability to action A and lower probabilities to action B and C. The updated policies will further encourage choosing action A in subsequent iterations. Finally, the policies will converge to the deterministic policies πi = (1, 0, 0), i = 1, 2.

For HAPPO, the agents update their policies sequentially. Without loss of generality, we assume they update in the order of Agent 1 and Agent 2. Agent 1 updates its policy according to objective 13 and increases the probability of action A. Let π1 new = (p1, p2, p3), s.t. p1 > 0.6, p2 < 0.2, p3 < 0.2, P3 j=1 pj = 1. Agent 2 updates its policy according to the following objective:

Published as a conference paper at ICLR 2024

π2 new = arg max π2 Es ρπold,a πold

π2 old(a2|s) π1 new(a1|s) π1 old(a1|s) Aπold(s, a)

= arg max π2 (0.6 p1 13.2 + 0.6 p2 ( 11.8) + 0.6 p3 ( 11.8)) π2(a2 = A|s)

+ (0.2 p1 ( 11.8) + 0.2 p2 18.2 + 0.2 p3 ( 11.8)) π2(a2 = B|s)

+ (0.2 p1 ( 11.8) + 0.2 p2 ( 11.8) + 0.2 p3 (28.2)) π2(a2 = C|s)

0.2 = arg max π2 (13.2 p1 11.8 p2 11.8 p3) π2(a2 = A|s)

( 11.8 p1 + 18.2 p2 11.8 p3) π2(a2 = B|s)

( 11.8 p1 11.8 p2 + 28.2 p3) π2(a2 = C|s)

Since (13.2 p1 11.8 p2 11.8 p3) > 3.2, ( 11.8 p1 + 18.2 p2 11.8 p3) < 5.8, ( 11.8 p1 11.8 p2 +28.2 p3) < 3.8, this will encourage assigning higher probability to action A and lower probability to action B and C. Similar to MAPPO, the updated policies will further encourage choosing action A in subsequent iterations. Finally, the policies converge to πi = (1, 0, 0), i = 1, 2.

By contrast, HASAC is possible to escape the local optimum and converge to better stochastic policies, as we show below. Without loss of generality, we assume the update order is Agent 1 and Agent 2. We first consider the update of Agent 1, who optimizes the following objective:

π1 new = arg max π1 Ea1 π1,a2 π2 old Qπold(s, a1, a2) α log π1(a1|s)

= arg max π1 Ea1 π1,a2 π2 old r(s, a1, a2) α log π1(a1|s)

Parametrizing π1 = (p1, p2, p3), we get

π1 new = arg max π1 5p1 14p2 12p3 αp1 log p1 αp2 log p2 αp3 log p3,

s.t. p1 + p2 + p3 = 1

The solutions corresponding to different α are listed in the left part of Table 3.

After the first update Convergent (both) α p1 p2 p3 p1 p2 p3 1 0.9990 0.0001 0.0009 1.0000 0.0000 0.0000 2 0.9603 0.0107 0.0290 1.0000 0.0000 0.0000 5 0.7083 0.1171 0.1747 0.9849 0.0075 0.0076 10 0.5254 0.2136 0.2609 0.0221 0.0224 0.9555 15 0.4596 0.2522 0.2882 0.1278 0.1354 0.7368 20 0.4269 0.2722 0.3009 0.2514 0.2790 0.4697

Table 3: Exact calculation results for HASAC update. On the left we show the policy after the first update; on the right we show the policies of π1 and π2 after convergence.

As α increases, the policy is encouraged to be stochastic, trading off between stochasticity and reward maximization. With appropriate α, the policies will be able to assign higher probabilities to action B

Published as a conference paper at ICLR 2024

and C, thereby retaining the exploration of them. It then encourages more exploration of action B and C in subsequent updates. Finally, the convergent policies predominantly choose action C while still being stochastic, as shown in the right part of Table 3.

Thus, we show the benefit of stochastic policies and that in expectation, with appropriate α, HASAC is capable of escaping local optimum and converging to better stochastic policies. We note that the exact calculation results are generally consistent with the empirical results in Section 5.2 (this can be easily checked by computing the joint action probabilities and comparing with Table 5). An observation is that when α = 5, theoretically it is not sufficient for escaping local optimum but empirically it already results in learning the optimal actions (C, C). This may be due to the fact that in practical implementation, HASAC learns a centralized Q-function by sampling from the replay buffer, which introduces some randomness that causes slight deviations between the experimental results and the exact calculated ones.

Published as a conference paper at ICLR 2024

C DERIVATION OF MAXENT MARL OBJECTIVE

In this section, we derive the maximum entropy objective of MARL (Equation 2) by performing variational inference in Figure 2. In structured variational inference, our objective is to approximate some distribution p(y) with another, usually simpler distribution q(y). In our case, we aim to approximate p(τ), given by

t=1 p(st+1|st, at)

t=1 r(st, at)

via the distribution

q(τ) = q(s1)

t=1 q(st+1|st, at)q(at|st) = q(s1)

t=1 q(st+1|st, at)

m=1 qim aim t |st , (15)

where the joint policy q(at|st) follows a fully independent factorization q(at|st) = Qn m=1 qim aim t |st as we assume that agents are independent of each other under CTDE paradigm. To avoid risk-seeking behavior, we fix the environment dynamics, i.e., q(s1) = p(s1) and q(st+1|st, at) = p(st+1|st, at). In structured variational inference, approximate inference is performed by optimizing the variational lower bound (also called the evidence lower bound). In our case, the evidence is that Ot = 1, t {1, . . . , T} and the posterior is conditioned on the initial state s1. The variational lower bound is given by

log p(O1:T ) = log Z Z p(O1:T , s1:T , a1:T )ds1:T da1:T

= log Z Z p(O1:T , s1:T , a1:T )q(s1:T , a1:T )

q(s1:T , a1:T )ds1:T da1:T

= log E(s1:T ,a1:T ) q(s1:T ,a1:T )

p(O1:T , s1:T , a1:T )

q(s1:T , a1:T )

E(s1:T ,a1:T ) q(s1:T ,a1:T ) [log p(O1:T , s1:T , a1:T ) log q(s1:T , a1:T )] ,

where the inequality on the last line is obtained via Jensen s inequality. Substituting the definitions of p(τ) and q(τ) from Equations 14 and 15, and according to q(st+1|st, at) = p(st+1|st, at), the bound reduces to

log p (O1:T ) E(s1:T ,a1:T ) q(s1:T ,a1:T )

m=1 log qim aim t |st !#

Optimizing this lower bound with respect to the policy q(at|st) corresponds exactly to the following maximum entropy objective:

J(π) = Es1:T ρ1:T π ,a1:T π

r (st, at) + α

i=1 H πi ( |st) !#

where we multiply a temperature parameter α to the entropy term to assign relative importance to entropy and reward maximization (9).

Published as a conference paper at ICLR 2024

D PROOFS OF REPRESENTATION OF QRE

Theorem 1 (Representation of QRE). A joint policy πQRE Π is a QRE if none of the agents can increase the maximum entropy objective (Equation 2) by unilaterally altering its policy, i.e.,

i N, πi Πi, J πi, π i QRE J πQRE .

Then the QRE policies are given by

i N, πi QRE ai|s := exp α 1Ea i π i QRE QπQRE s, ai, a i

bi Ai exp α 1Ea i π i QRE QπQRE (s, bi, a i) , (3)

where the soft Q-functions are defined as follows,

Qπ(s, a) = r (s, a) + Ea1: π,s1: P

r (st, at) + α

i=1 H πi ( |st) ! s0 = s, a0 = a

Proof. First, we consider the following constrained policy optimization problem to agent i: for a given state s S,

max πi Eai πi,a i π i [Qπ(s, a)] α

aj Aj πj aj|s log πj aj|s ,

ai Ai πi ai|s = 1.

We consider its associated Lagrangian function as follows,

L(πi, λ) = Eai πi,a i π i [Qπ(s, a)] α

aj Aj πj aj|s log πj aj|s + λ( X

ai Ai πi ai|s 1)

j=1 πj aj|s Qπ(s, a) α

aj Aj πj aj|s log πj aj|s + λ( X

ai Ai πi ai|s 1).

Then we consider the derivative of L(πi, λ) with respect to πi, we obtain

L(πi, λ) πi (ai|s) = X

j =i πj aj|s Qπ(s, ai, a i) α log πi ai|s α + λ.

Let L(πi,λ)

πi(ai|s) = 0, we know the optimal policy πi satisfies the following condition,

πi ai|s = exp(α 1Ea i π i Qπ s, ai, a i ) exp(λ

Furthermore, since P ai Ai πi ai|s = 1, we know optimal lagrange multiplier λ satisfies

ai Ai exp(α 1Ea i π i Qπ s, ai, a i ),

ai Ai exp(α 1Ea i π i Qπ s, ai, a i )

Finally, we obtain the optimal policy as follows,

πi ai|s = exp(α 1Ea i π i Qπ s, ai, a i ) P

bi Ai exp(α 1Ea i π i [Qπ (s, bi, a i)]).

Published as a conference paper at ICLR 2024

Hence, when each agent can not increase the maximum entropy objective by unilaterally changing its policy, policies take the following form,

i N, πi QRE ai|s = arg max πi( |s) Eai πi,a i π i QRE QπQRE(s, a) α

aj Aj πj aj|s log πj aj|s

= πi ai|s = exp(α 1Ea i π i QRE QπQRE s, ai, a i ) P

bi Ai exp(α 1Ea i π i QRE QπQRE (s, bi, a i) ),

which finishes the proof.

Published as a conference paper at ICLR 2024

E PROOFS OF HETEROGENEOUS-AGENT SOFT POLICY ITERATION

In this section, we introduce heterogeneous-agent soft policy iteration and prove its properties of monotonic improvement and convergence to the QRE policies.

E.1 PROOF OF JOINT SOFT POLICY EVALUATION

For joint soft policy evaluation, we will repeatedly apply soft Bellman operator Γπ to Q(s, a) until convergence, where: ΓπQ(s, a) r(s, a) + γEs P [V (s )] (17)

V (s) = Ea π

Q(s, a) + α

i=1 H πi i|s #

In this way, as shown in lemma 4.1, we can get Qπ for any joint policy π. Lemma 4.1 (Joint Soft Policy Evaluation). Consider the soft Bellman backup operator Γπ and a mapping Q0 : S A R with |A| < , and define Qk+1 = ΓπQk. Then the sequence Qk will converge to the joint soft Q-function π as k .

Proof. We define the reward with entropy term as rπ(s, a) r(s, a)+Es P Pn i=1 H πi i|s . We can then express the update rule as:

Q(s, a) rπ(s, a) + γEs P,a π [Q(s , a )]

and apply the standard convergence results for policy evaluation (28).

E.2 PROOF OF JOINT SOFT POLICY DECOMPOSITION

After we get Qπ(s, a), we draw a permutation i1:n Sym(n) and update each agent s policy πim according to the following optimization proposition: Proposition 1 (Joint Soft Policy Decomposition). Let π be a joint policy, and i1:n Sym(n) be an agent permutation. Suppose that, for each state s and every m = 1, . . . , n,

πim new = arg min πim Πim DKL

πim im|s exp Eai1:m 1 π i1:m 1 new 1

αQi1:m πold s, ai1:m 1, im

Eai1:m 1 π i1:m 1 new [Zπold (s, ai1:m 1)]

where ai1:m 1 is drawn from the policy πi1:m 1 new ( |s) and the partition function Zπold s, ai1:m 1

normalizes the distribution. Then the joint policy satisfies the following:

πnew = arg min π Π DKL

π ( |s) exp 1

αQπold (s, )

Proof. First, we use Lim πim old (πim( |s)) to denote

πim im|s exp Eai1:m 1 π i1:m 1 new 1

αQi1:m πold s, ai1:m 1, im

Eai1:m 1 π i1:m 1 new [Zπold (s, ai1:m 1)]

Suppose that there exists a policy π = πnew, such that Lπold( π( |s)) < Lπold(πnew( |s)), we have

Ea π [Qπold (s, a)] + α

i=1 H πi i|s > Ea πnew [Qπold(s, a)] + α

i=1 H πi new i|s . (19)

From Equation 8, we have Lim πim old (πim new( |s)) Lim πim old ( πim( |s)) for every m = 1, . . . , n, i.e.,

Eai1:m 1 π i1:m 1 new ,aim πim new Qi1:m πold (s, ai1:m 1, aim) α log πim new aim|s

Eai1:m 1 π i1:m 1 new ,aim πim Qi1:m πold (s, ai1:m 1, aim) α log πim aim|s .

Published as a conference paper at ICLR 2024

Subtracting both sides of the inequality by Eai1:m 1 π i1:m 1 new

h Qi1:m 1 πold (s, ai1:m 1) i gives

Eai1:m 1 π i1:m 1 new ,aim πim new Aim πold(s, ai1:m 1, aim) α log πim new aim|s

Eai1:m 1 π i1:m 1 new ,aim πim Aim πold(s, ai1:m 1, aim) α log πim aim|s . (20)

Combining this with Lemma A.1 gives

Aπold (s, a) + α

i=1 H πi new i|s #

h Eai1:m 1 π i1:m 1 new ,aim πim new Aim πold(s, ai1:m 1, aim) α log πim new aim|s i

by Inequality 20

h Eai1:m 1 π i1:m 1 new ,aim πim Aim πold(s, ai1:m 1, aim) α log πim aim|s i

Aπold (s, a) + α

i=1 H πi i|s #

The resulting inequality can be equivalently rewritten as

Ea π [Qπold (s, a)] + α

i=1 H πi i|s Ea πnew [Qπold (s, a)] + α

i=1 H πi new i|s , (21)

which contradicts Equation 19. Hence, for all π Π, Lπold(πnew( |s)) Lπold(π( |s)), i.e.,

πnew = arg min π Π DKL

π ( |s) exp 1

αQπold (s, )

which finishes the proof.

E.3 PROOF OF HETEROGENEOUS-AGENT SOFT POLICY IMPROVEMENT

Then the soft Q-function and joint maximum entropy objective (Equation 2) has monotonic improvement property as formalized below.

Lemma 4.2 (Heterogeneous-Agent Soft Policy Improvement). Let i1:n Sym(n) be an agent permutation, and for every m = 1, . . . , n, let policy πim old Πim and πim new be the optimizer of the minimization problem defined in Equation 8. Then Qπnew(s, a) Qπold(s, a) for all (s, a) S A with |A| < and J (πnew) J (πold).

Proof. From Proposition 1, we have

πnew = arg min π Π DKL

π ( |s) exp 1

αQπold (s, )

= arg min π Π Lπold(π( |s)).

It must be the case that Lπold(πnew( |s)) Lπold(πold( |s)). Hence

Ea πnew [Qπold (s, a)] + α

i=1 H πi new i|s Ea πold [Qπold (s, a)] + α

i=1 H πi old i|s = Vπold(s).

Published as a conference paper at ICLR 2024

Last, considering the soft Bellman equation, the following holds: Qπold (s, a) = r(s, a) + γEs P [Vπold (s )]

r(s, a) + γEs P

Ea πnew [Qπold (s , a)] + α

i=1 H πi new i|s )

(by Inequality 22)

... Qπnew(s, a), (23) where we have repeatedly expanded Qπold on the RHS by applying the soft Bellman equation and the bound in Inequality 22. Convergence to Qπnew follows from Lemma 4.1.

We use it to prove the claim as follows,

Vπnew (s) = Ea πnew [Qπnew (s, a)] + α

i=1 H πi new i|s

Ea πnew [Qπold (s, a)] + α

i=1 H πi new i|s (by Inequality 23)

Ea πold [Qπold (s, a)] + α

i=1 H πi old i|s (by Inequality 22)

= Vπold(s). Subsequently, the monotonic improvement property of the joint maximum entropy return follows naturally, as J (πnew ) = Es d [Vπnew (s)] Es d [Vπold (s)] = J (πold ) .

E.4 PROOF OF HETEROGENEOUS-AGENT SOFT POLICY ITERATION

The full heterogeneous-agent soft policy iteration algorithm alternates between the joint soft policy evaluation and the heterogeneous-agent soft policy improvement steps, and it will provably converge to the QRE policies. Theorem 2 (Heterogeneous-Agent Soft Policy Iteration). For any joint policy π Π, if we repeatedly apply joint soft policy evaluation and heterogeneous-agent soft policy improvement from πi Πi. Then the joint policy π = Qn i=1 πi converges to πQRE in Theorem 1.

Proof. Let πk be the joint policy at iteration k.

First, by Lemma 4.2, we have that Qπk(s, a) Qπk+1(s, a), and that the soft Q-function is upperbounded by Qmax for all π Π (both reward and entropy are bounded). Hence, the sequence converges to some limit point π.

Then, considering this limit point joint policy π, it must be the case that i N, πi Πi, Li πi( πi( |s)) Li πi(πi( |s)). And we have πi i|s = arg max πi( i|s) P(Ai) Eai πi Qi π s, ai α log πi ai|s

= arg max πi( i|s) P(Ai) Eai πi,a i π i [Q π(s, a)]

aj Aj πj aj|s log πj aj|s , i N.

Last, following the proof of Theorem 1, we have

πi ai|s := exp α 1Ea i π i Q π s, ai, a i P

bi Ai exp (α 1Ea i π i [Q π (s, bi, a i)]).

Thus, π is a quantal response equilibrium, which finishes the proof.

Published as a conference paper at ICLR 2024

F PSEUDOCODE OF HASAC

Algorithm 2: Heterogeneous-Agent Soft Actor-Critic Input: temperature α, Polyak coefficient τ, batch size B, number of: agents n, episodes K, steps per episode T, mini-epochs e; Initialize: the critic networks: ϕ1 and ϕ2 and policy networks: θi

i N , replay buffer B, Set target parameters equal to main parameters ϕtarg, 1 ϕ1, ϕtarg, 2 ϕ2; for k = 0, 1, . . . , K 1 do

Observe state oi t and select action ai t πi θ( |oi t); Execute ai t in the environment; Observe next state ot+1, reward rt; Push transitions oi t, ai t, oi t+1, rt , i N, t T into B; Sample a random minibatch of B transitions from B; Compute the critic targets

min i=1,2 Qϕtarg ,i (st+1, at+1) α

i=1 log πi θ ai t+1|oi t+1 !

, at+1 πθ ( |st+1) ;

Update Q-functions by one step of gradient descent using

ϕi = arg min ϕi 1 B

t (yt Qϕi (st, at))2 for i = 1, 2;

Draw a permutation of agents i1:n at random; for agent im = i1, . . . , in do

Update agent im by solving

θim new = arg max ˆθim 1 B

min i=1,2 Qϕi

st, ai1:m 1

θ i1:m 1 new

oi1:m 1 t , aim ˆθim oim t , aim+1:n

θ im+1:n old

α log πim ˆθim aim ˆθim |oim t !

where ai θ(oi t) is a sample from πi θ( |oi t) which is differentiable wrt θ via the reparametrization trick; with e mini-epochs of policy gradient ascent; end Update the target critic network smoothly

ϕtarg ,i ρϕtarg ,i + (1 ρ)ϕi for i = 1, 2;

end Discard ϕ . Deploy θi

i N in execution;

Published as a conference paper at ICLR 2024

G PROOFS OF THE CORE THEOREM OF MEHAML

First, we show that enhancing the MEHAMO (Definition 2) alone is sufficient to guarantee policy improvement, as demonstrated by the following lemma. Lemma G.1. Let πold and πnew be joint policies and let i1:n Sym(n) be an agent permutation. Suppose that, for every state s S and every m = 1, . . . , n, M (πim new )

Dim ,π i1:m 1 new Vπold

(s) M (πim old )

Dim ,π i1:m 1 new Vπold

Then, πnew is jointly better than πold , so that for every state s,

Vπnew (s) Vπold (s).

Subsequently, the monotonic improvement property of the joint return follows naturally, as

J (πnew ) = Es d [Vπnew (s)] Es d [Vπold (s)] = J (πold ) .

Proof. By Inequality 25, we have

Eai1:m 1 π i1:m 1 new ,aim πim new Qi1:m πold (s, ai1:m 1, aim) α log πim new aim|s

Dim πold πim new|s, πi1:m 1 new

Eai1:m 1 π i1:m 1 new ,aim πim old

Qi1:m πold (s, ai1:m 1, aim) α log πim old aim|s

Dim πold πim old|s, πi1:m 1 new .

Subtracting both sides of the inequality by Eai1:m 1 π i1:m 1 new

h Qi1:m 1 πold (s, ai1:m 1) i gives

Eai1:m 1 π i1:m 1 new ,aim πim new Aim πold(s, ai1:m 1, aim) α log πim new aim|s

Dim πold πim new|s, πi1:m 1 new

Eai1:m 1 π i1:m 1 new ,aim πim old

Aim πold(s, ai1:m 1, aim) α log πim old aim|s

Dim πold πim old|s, πi1:m 1 new .

Let e Dπold (πnew |s) Pn m=1 Dim πold πim new |s, πi1:m 1 new . Combining this with Lemma A.1 gives

Aπold (s, a) + α

i=1 H πi new i|s #

e Dπold (πnew |s)

h Eai1:m 1 π i1:m 1 new ,aim πim new Aim πold(s, ai1:m 1, aim) α log πim new aim|s

Dim πold πim new|s, πi1:m 1 new i

by Inequality 20

h Eai1:m 1 π i1:m 1 new ,aim πim old

Aim πold(s, ai1:m 1, aim) α log πim old aim|s

Dim πold πim old|s, πi1:m 1 new i

Aπold (s, a) + α

i=1 H πi old i|s #

e Dπold (πold |s) .

Published as a conference paper at ICLR 2024

The resulting inequality can be equivalently rewritten as

Ea πnew [Qπold (s, a)] + α

i=1 H πi new i|s e Dπold (πnew |s)

Ea πold [Qπold (s, a)] + α

i=1 H πi old i|s e Dπold (πold |s) , s S.

We use it to prove the claim as follows,

Vπnew (s) = Ea πnew [Qπnew (s, a)] + α

i=1 H πi new i|s

= Ea πnew [Qπold (s, a)] + α

i=1 H πi new i|s e Dπold (πnew |s)

+ e Dπold (πnew |s) + Ea πnew [Qπnew (s, a) Qπold (s, a)] , by Inequality 27

Ea πold [Qπold (s, a)] + α

i=1 H πi old i|s e Dπold (πold |s)

+ e Dπold (πnew |s) + Ea πnew [Qπnew (s, a) Qπold (s, a)] ,

= Vπold (s) + e Dπold (πnew |s) + Ea πnew [Qπnew (s, a) Qπold (s, a)]

= Vπold (s) + e Dπold (πnew |s) + Ea πnew ,s P [r(s, a) + γVπnew (s ) r(s, a) γVπold (s )]

= Vπold (s) + e Dπold (πnew |s) + γEa πnew ,s P [Vπnew (s ) Vπold (s )]

Vπold (s) + γ inf s [Vπnew (s ) Vπold (s )] .

Hence Vπnew (s) Vπold (s) γ inf s [Vπnew (s ) Vπold (s )] .

Taking infimum over s and simplifying (1 γ) inf s [Vπnew (s) Vπold (s)] 0

Therefore, infs [Vπnew (s) Vπold (s)] 0, which proves the lemma.

Then, any algorithm derived from Algorithm 1 ensures that the resulting policies satisfy Condition 25, as demonstrated by the following lemma.

Lemma G.2. Suppose an agent im maximizes the expected MEHAMO

πim new = arg max πim Uim πold (πim old ) Es βπold

Dim,π i1:m 1 new Vπold

Then, for every state s S M(πim new )

Dim,π i1:m 1 new Vπold

(s) M (πim old )

Dim,π i1:m 1 new Vπold

Hence, πnew attains the properties provided by Lemma G.1.

Proof. We will prove this statement by contradiction. Suppose that there exists s0 S such that M(πim new )

Dim,π i1:m 1 new Vπold

(s0) < M (πim old )

Dim,π i1:m 1 new Vπold

Let us define the following policy ˆπim.

ˆπim im|s = πim old im|s , at s = s0 πim new im|s , at s = s0

Published as a conference paper at ICLR 2024

Note that ˆπim is (weakly) closer to πim old than πim new at s0, and at the same distance at other states. Together with πim new Uim πold πim old , this implies that ˆπim Uim πold πim old . Further,

Dim,π i1:m 1 new Vπold

(s) Es βπold

M(πim new )

Dim,π i1:m 1 new Vπold

= βπold (s0) M(ˆπim)

Dim,π i1:m 1 new Vπold

(s0) M(πim new )

Dim,π i1:m 1 new Vπold

The above contradicts πim new as being the argmax of Equality 28, as ˆπim is strictly better. The contradiction finishes the proof.

Next, we prove the most fundamental theorem of MEHAML. Theorem 3 (The Core Theorem of MEHAML). Let π0 Π, and the sequence of joint policies (πk) k=0 be obtained by a MEHAML algorithm 1 induced by Di, Ui, i N, and βπ. Then, the joint policies induced by the algorithm enjoy the following list of properties (1) Attain the monotonic improvement property J (πk+1) J (πk), (2) Their value functions converge to a quantal response value function limk Vπk = V QRE, (3) Their expected returns converge to a quantal response return limk J (πk) = JQRE, (4) Their ω-limit set consists of quantal response equilibria.

Proof. Proof of Property 1. It follows from combining Lemma G.1 & G.2.

Proof of Properties 2, 3 & 4.

Step 1: convergence of the value function. By Lemma G.1, we have that Vπk(s) Vπk+1(s), s S, and that the value function is upper-bounded by Vmax. Hence, the sequence of value functions (Vπk)k N converges. We denote its limit by V .

Step 2: characterisation of limit points. As the joint policy space Π is bounded, by Bolzano Weierstrass theorem, we know that the sequence (πk)k N has a convergent subsequence. Therefore,it has at least one limit point policy. Let π be such a limit point. We introduce an auxiliary notation: for a joint policy π and a permutation i1:n, let HU (π, i1:n) be a joint policy obtained by a MEHAML update from π along the permutation i1:n.

Claim: For any permutation z1:n Sym(n),

π = HU ( π, z1:n)

Proof of Claim. Let ˆπ = HU ( π, z1:n) = π and (πkr)r N be a subsequence converging to π. Let us recall that the limit value function is unique and denoted as V . Writing Ei0: 1:n [ ] for the expectation operator under the stochastic process ik 1:n

k N of update orders, for a state s S, we have

0 = lim r Ei0: 1:n Vπkr+1(s) Vπkr (s)

as every choice of permutation improves the value function

lim r P ikr 1:n = z1:n VHU(πkr ,z1:n)(s) Vπkr (s)

= p (z1:n) lim r VHU(πkr ,z1:n)(s) Vπkr (s) .

By the continuity of the expected MEHAMO (following from the continuity of the state-action value function (Lemma A.2), the entropy term, HADFs, neighbourhood operators, and the sampling distribution) we obtain that the first component of HU (πkr, z1:n), which is πz1 kr+1, is continuous in πkr by Berge s Maximum Theorem (1). Applying this argument recursively for z2, . . . , zn, we have that HU (πkr, z1:n) is continuous in πkr.Hence, as πkr converges to π, its HU converges to the HU of π, which is ˆπ. Hence, we continue writing the above derivation as

= p (z1:n) [Vˆπ(s) V π(s)] 0, by Lemma G.1.

As s was arbitrary, the state-value function of ˆπ is the same as that of π : Vˆπ = V π, by the Bellman equation 5: Q(s, a) = r(s, a) + γE [V (s )], this also implies that their state-action value functions

Published as a conference paper at ICLR 2024

are the same: Qˆπ = Q π. Let m be the smallest integer such that ˆπzm = πzm. This means that ˆπzm achieves a greater expected MEHAMO than πzm. Hence,

Es β π hh M(ˆπzm) Dzm, πz1:m 1V π i (s) i > Es β π hh M( πzm) Dzm, πz1:m 1V π i (s) i

then for some state s, h M(ˆπzm) Dzm, πz1:m 1V π i (s) > h M( πzm) Dzm, πz1:m 1V π i (s)

which can be written as

Eaz1:m 1 πz1:m 1,azm ˆπzm h Qz1:m π (s, az1:m 1, azm) α log ˆπzm (azm|s) i Dzm π (ˆπzm|s, πz1:m 1)

= Eaz1:m 1 πz1:m 1,azm ˆπzm h Qz1:m ˆπ (s, az1:m 1, azm) α log ˆπzm (azm|s) i Dzm π (ˆπzm|s, πz1:m 1)

> Eaz1:m 1 πz1:m 1,azm πzm h Qz1:m π (s, az1:m 1, azm) α log πzm (azm|s) i Dzm π ( πzm|s, πz1:m 1)

= Eaz1:m 1 πz1:m 1,azm πzm h Qz1:m π (s, az1:m 1, azm) α log πzm (azm|s) i .

Adding both sides of the inequality by α Pm 1 i=1 H ( πzi ( |s)) and using the equation Vπ(s) = Ea π Qπ(s, a) + α Pn i=1 H πi ( |s) gives

Vˆπ(s) = Ea ˆπ

Qˆπ(s, a) + α

i=1 H ˆπi ( |s) #

Qˆπ(s, a) + α

i=1 H ˆπi ( |s) #

Dzm π (ˆπzm|s, πz1:m 1)

Q π(s, a) + α

i=1 H πi ( |s) #

However, we have Vˆπ = Vπ which yields a contradiction, proving the claim.

Step 3: dropping the HADF. Consider an arbitrary limit point joint policy π. By Step 2, for any permutation i1:n, considering the first component of the HU,

πi1 = arg max

πi1 Ui1 π ( πi1) Es β π

M(πi1) Di1 V π

πi1 Ui1 π ( πi1) Es β π h Eai1 πi1 h Qi1 π s, ai1 α log πi1 ai1|s i Di1 π πi1|s i

πi1 Ui1 π ( πi1) Es β π h Eai1 πi1 h Ai1 π s, ai1 α log πi1 ai1|s i Di1 π πi1|s i .

Suppose that there exists a policy π = πi1, and a state s, such that

π = arg max

πi1 Ui1 π ( πi1) Eai1 πi1 Ai1 π s, ai1 α log πi1 ai1|s , (30)

Eai1 π Ai1 π (s, ai1) α log π ai1|s > Eai1 πi1 Ai1 π (s, a) α log πi1 ai1|s

which can be written as

Eai1 π Ai1 π (s, ai1) + αH π i1|s > αH πi1 i1|s .

Published as a conference paper at ICLR 2024

For any policy πi1, consider the canonical parameterisation πi1( i1|s) = x1, . . . , xm 1,

1 Pm 1 i=1 xi , where m is the size of the action space. We have that

Eai1 πi1 Ai1 π (s, ai1) =

i=1 πi1 ai1 i |s Ai1 π s, ai1 i

i=1 xi Ai1 π s, ai1 i +

Ai1 π s, ai1 m

i=1 xi Ai1 π s, ai1 i Ai1 π s, ai1 m + Ai1 π s, ai1 m .

This means that Eai1 πi1 Ai1 π (s, ai1) is an affine function of πi1( i1|s), and thus, its Gâteaux derivatives are constant in P(A) for fixed directions. Hence, we can obtain that Eai1 πi1 Ai1 π (s, ai1) + αH πi1 i1|s is a strict concave function of πi1( i1|s) (following from the affinity of Eai1 πi1 Ai1 π (s, ai1) and the strict concavity of H πi1 i1|s ). Therefore, by combining the Equation 30 and the strict concavity of Eai1 πi1 Ai1 π (s, ai1) + αH πi1 i1|s , Gâteaux derivative of Eai1 πi1 Ai1 π (s, ai1) + αH πi1 i1|s , in the direction from π to π , is strictly positive.

Furthermore, the Gâteaux derivatives of Di1 π (πi1|s) are zero at πi1( i1|s) = πi1( i1|s) by its definition (zero gradient). Hence, the Gâteaux derivative of Eai1 πi1 Ai1 π (s, a) + αH πi1 i1|s Di1 π πi1|s is strictly positive. Therefore, for conditional policies ˆπi1( i1|s) sufficiently close to πi1( i1|s) in the direction towards π ( i1|s), we have

Eai1 ˆπi1 h Ai1 π (s, ai1) i + αH ˆπi1 i1|s Di1 π (ˆπi1|s)

> Eai1 πi1 h Ai1 π (s, ai1) i + αH πi1 i1|s Di1 π ( πi1|s). (31)

Let us construct a policy πi1 as follows. For all states y = s, we set πi1( i1|y) = πi1( i1|y). Moreover, for πi1( i1|s) we choose ˆπi1( i1|s) as in Inequality 31, sufficiently close to πi1( i1|s), so that πi1 Ui1 πi1 πi1 . Then, we have

Es β π,ai1 πi1 h Ai1 π (s, ai1) i + Es β π h αH πi1 i1|s Di1 π ( πi1|s) i

> Es β π,ai1 πi1 h Ai1 π (s, ai1) i + Es β π h αH πi1 i1|s Di1 π ( πi1|s) i ,

which yields a contradiction. Hence, the assumption was false. Thus, we have proved that, for every state s, πi1 i1|s = arg max

πi1 Ui1 π ( πi1) Eai1 πi1 Ai1 π s, ai1 α log πi1 ai1|s

πi1 Ui1 π ( πi1) Eai1 πi1 Qi1 π s, ai1 α log πi1 ai1|s .

Step 4: Quantal response equilibrium. We have proved that π satisfies

πi i|s = arg max πi( i|s) P(Ai) Eai πi Qi π s, ai α log πi ai|s

= arg max πi( i|s) P(Ai) Eai πi,a i π i [Q π(s, a)]

aj Aj πj aj|s log πj aj|s , i N, s S.

Then following the proof of Theorem 1, we have

πi ai|s := exp α 1Ea i π i Q π s, ai, a i P

bi Ai exp (α 1Ea i π i [Q π (s, bi, a i)]).

Published as a conference paper at ICLR 2024

Thus, π is a quantal response equilibrium. Lastly, this implies that the value function corresponds to a quantal response value function V QRE, the return corresponds to a quantal response return JQRE.

Published as a conference paper at ICLR 2024

H EXPERIMENTAL DETAILS

H.1 AUTOMATICALLY ADJUSTING TEMPERATURE

We implement an automated temperature tuning method for HASAC which draws on the auto-tuned temperature extension of SAC (9). Thus, we adjust α with the following objective:

J(α) = Est D,at π[ α log π(at|st) α H],

where H is the target entropy.

H.2 EXPERIMENTAL SETUP AND ADDITIONAL RESULTS

H.2.1 BI-DEXHANDS

Bi-Dex Hands (2) offers numerous bimanual manipulation tasks that are designed to match various human skill levels. Building on the Isaac Gym simulator, Bi-Dex Hands supports running thousands of environments simultaneously. This increases the number of samples generated in the same time interval, thus significantly alleviating the sample efficiency problem of on-policy algorithms.

We evaluate HASAC on nine tasks simulating human behaviors across different ages and compare it with on-policy algorithms HAPPO, MAPPO, PPO, and off-policy algorithms HATD3 and SAC, as shown in Figure 7. Despite leveraging GPU parallelization, we observe that the sample efficiency of on-policy algorithms remains significantly lower than that of off-policy algorithms. In the most challenging Catch Abreast, Two Catch Underarm, and Lift Underarm tasks, on-policy algorithms fail to learn useful policies within 10m steps. While the off-policy algorithm HATD3 demonstrates higher sample efficiency compared to on-policy algorithms, it exhibits substantial variance during training due to the deterministic policies it learned, eventually converging to local optima. In contrast, our algorithm HASAC outperforms the other five methods by a large margin, showcasing faster convergence speed and lower variance.

H.2.2 MULTI-AGENT MUJOCO

Mu Jo Co tasks challenge a robot to learn an optimal way of motion; Multi-Agent Mu Jo Co (MAMu Jo Co) models each part of a robot as an independent agent, for example, a leg for a spider or an arm for a swimmer. With the increasing variety of body parts, improving each agent s exploration becomes necessary. Although the easier tasks with fewer agents can be solved by a wide range of different algorithms, the more complex benchmarks, such as the Humanoid Standup 17x1 and Many Agent Swimmer 10x2, are difficult to solve with current MARL algorithms.

We compare our method to several algorithms that show the current state-of-the-art performance in 10 tasks of 5 scenarios in MAMu Jo Co, including HAPPO, a sequential-update on-policy algorithm; MAPPO, a simultaneous-update on-policy algorithm; and HATD3 (45), an off-policy algorithm that outperforms HADDPG and MADDPG. Figure 8 demonstrates that, in all scenarios, HASAC enjoys superior performance over the three rivals both in terms of reward values and learning speed.

H.2.3 STARCRAFT MULTI-AGENT CHALLENGE

Star Craft Multi-Agent Challenge (SMAC) (25) contains a set of Star Craft maps in which a team of ally units aims to defeat the opponent team. Notably, MAPPO (42), HAPPO, and HATRPO have demonstrated remarkable performance on this benchmark through the utilization of five influential factors that significantly impact algorithm performance.

We evaluate our method on four hard maps and four super-hard. Our experimental results, illustrated in Figure 9, reveal that HASAC achieves over 90% win rates in 7 out of 8 maps and significantly outperforms other strong baselines in most maps. Notably, in particularly challenging tasks such as 5m_vs_6m, 3s5z_vs_3s5z, and 6h_vs_8z, we observe that HAPPO and HATRPO would converge towards suboptimal NE. In addition, FOP is unable to learn meaningful joint policy in super-hard tasks due to its reliance on the IGO assumption. By contrast, HASAC consistently achieves superior performance and shows the ability to identify higher reward equilibria due to its extensive exploration. We also observe that HASAC has better stability and higher learning speed across most maps.

Published as a conference paper at ICLR 2024

0.0 0.2 0.4 0.6 0.8 1.0 Step 1e7

Episode Return

Shadow Hand Catch Abreast

(a) Catch Abreast

0.0 0.2 0.4 0.6 0.8 1.0 Step 1e7

Episode Return

Shadow Hand Two Catch Underarm

(b) Two Catch Underarm

0.0 0.2 0.4 0.6 0.8 1.0 Step 1e7

Episode Return

Shadow Hand Lift Underarm

PPO SAC MAPPO HAPPO HATD3 HASAC

(c) Lift Underarm

0.0 0.2 0.4 0.6 0.8 1.0 Step 1e7

Episode Return

Shadow Hand Over

(d) Hand Over

0.0 0.2 0.4 0.6 0.8 1.0 Step 1e7

Episode Return

Shadow Hand Catch Over2Underarm

(e) Catch Over2Underarm

0.0 0.2 0.4 0.6 0.8 1.0 Step 1e7

Episode Return

Shadow Hand Pen

PPO SAC MAPPO HAPPO HATD3 HASAC

(f) Hand Pen

0.0 0.2 0.4 0.6 0.8 1.0 Step 1e7

Episode Return

Shadow Hand Door Close Inward

(g) Door Close Inward

0.0 0.2 0.4 0.6 0.8 1.0 Step 1e7

Episode Return

Shadow Hand Door Open Inward

(h) Door Open Inward

0.0 0.2 0.4 0.6 0.8 1.0 Step 1e7

Episode Return

Shadow Hand Door Open Outward

PPO SAC MAPPO HAPPO HATD3 HASAC

(i) Door Open Outward

Figure 7: Comparisons of average episode return on nine Bi-Dex Hands tasks. H.2.4 GOOGLE RESEARCH FOOTBALL

Google Research Football Environment (GRF) contains a set of cooperative multi-agent challenges in which a team of agents plays a team of bots in various football scenarios. Recent works (42; 45) have conducted experiments on academic scenarios and achieved nearly 100% winning rate on each scenario except two very challenging tasks: run pass and shoot with keeper (RPS) and corner. We apply HASAC to these two academy tasks of GRF, with QMIX and several SOTA methods, including HAPPO and MAPPO as baselines. Since GRF lacks a global state interface, we propose a solution to address this limitation by implementing a global state based on agents observations following the Simple115State Wrapper of GRF. Concretely, the global state consists of common components in agents observations and the concatenation of agent-specific parts and is taken as input by the centralized critic for value prediction. Additionally, we employ the dense-reward setting. All methods are trained for 20 million environment steps in the RPS task and for 25 million environment steps in the corner task.

As shown in Figure 10a and 10b, we generally observe that both MAPPO and HAPPO tend to converge to a non-optimal NE on the two challenging tasks with a winning rate of approximately 80%. This suboptimal convergence can be attributed to the insufficient level of exploration of these algorithms. In contrast, HASAC exhibits the ability to attain a higher reward equilibrium by learning stochastic policies, which effectively enhance exploration and robustness. This finding highlights the crucial role of stochastic policies in improving exploration, thereby enabling agents to converge toward a higher reward equilibrium.

H.2.5 MULTI-AGENT PARTICLE ENVIRONMENT

We evaluate HASAC on the Spread, Reference, and Speaker_Listener tasks of the Multi-Agent Particle Environment (MPE) (19), which were implemented in Petting Zoo (31). Petting Zoo incorporates

Published as a conference paper at ICLR 2024

(a) 2x4-Agent Ant

(b) 4x2-Agent Ant

(c) 8x1-Agent Ant

(d) 2x3-Agent Half Cheetah

(e) 3x2-Agent Half Cheetah

(f) 6x1-Agent Half Cheetah

(g) 2x3-Agent Walker

(h) 6x1-Agent Walker

(i) 10x2-Agent Swimmer

(j) 17x1-Agent Humanoid Standup

Figure 8: Comparisons of average episode return on multiple Multi-Agent Mu Jo Co tasks. MPE with some minor adjustments and allows for customizing the number of agents and landmarks, as well as the global and local rewards. To adapt these tasks to fully cooperative MARL settings, we sum up the individual rewards of agents to form a joint reward. The results presented in Figure 11 shows that HASAC consistently outperforms the baselines in terms of both average return and sample efficiency.

H.2.6 LIGHT AIRCRAFT GAME

In addition to the previous three well-established benchmarks, we extend our experiments to include a novel environment called Light Aircraft Game (LAG) (23). LAG is a recently developed cooperativecompetitive environment for red and blue aircraft games, offering various settings such as single control, 1v1, and 2v2 scenarios. In the context of multi-agent scenarios, LAG currently supports selfplay only for 2v2 settings. To address this limitation, we introduce a novel cooperative non-weapon task where two agents collaborate to combat two opponents controlled by the built-in AI. Specifically, the agents are trained to fly towards the tail of their opponents and maintain a suitable distance.

Published as a conference paper at ICLR 2024

0.0 0.2 0.4 0.6 0.8 1.0 Step 1e7

Evaluate winning rate

(a) 8m-vs-9m (hard)

0.0 0.2 0.4 0.6 0.8 1.0 Step 1e7

Evaluate winning rate

(b) 5m-vs-6m (hard)

0.0 0.2 0.4 0.6 0.8 1.0 Step 1e7

Evaluate Winning Rate

FOP QMIX MAPPO HAPPO HATRPO HASAC

(c) 3s5z (hard)

0.0 0.2 0.4 0.6 0.8 1.0 Step 1e7

Evaluate winning rate

(d) 10m-vs-11m (hard)

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Step 1e7

Evaluate Winning Rate

(e) MMM2 (super hard)

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Step 1e7

Evaluate winning rate

3s5z vs 3s6z

FOP QMIX MAPPO HAPPO HATRPO HASAC

(f) 3s5z-vs-3s6z (super hard)

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Step 1e7

Evaluate Winning Rate

(g) corridor (super hard)

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Step 1e7

Evaluate winning rate

FOP QMIX MAPPO HAPPO HATRPO HASAC

(h) 6h-vs-8z (super hard)

Figure 9: Performance comparison on eight SMAC tasks. We generally observe that HASAC consistently outperforms other baselines in most tasks, exhibiting higher convergence speed and better stability.

(a) RPS (2 agents)

(b) Corner (11 agents)

(c) Non-weapon (2 agents)

Figure 10: Performance comparison on two GRF tasks and one LAG task. HASAC achieves superior performance to the other methods. We compare our method to MAPPO and HAPPO on the cooperative non-weapon task involving 2 agents. Figure 10c demonstrates that HASAC outperforms both MAPPO and HAPPO in terms of learning speed and stability. Specifically, HASAC exhibits faster convergence and achieves a higher level of stability throughout the learning process. In contrast, MAPPO and HAPPO exhibit considerable variability in their performance and display slower learning speeds.

H.3 HYPER-PARAMETER SETTINGS FOR EXPERIMENTS

Before presenting the hyperparameters employed in our experiments, we would like to clarify the reporting conventions we adhere to. First, we introduce a boolean variable auto_alpha to indicate whether the temperature is auto-tuned or not. Also, the hyperparameters will only take effect when

Published as a conference paper at ICLR 2024

0 1 2 3 4 5 Episode 1e6

Episode Return

Reference (continuous)

(a) Reference (continuous)

0 1 2 3 4 5 Episode 1e6

Episode Return

Speaker Listener (continuous)

(b) Speaker Listener (continuous)

0 1 2 3 4 5 Episode 1e6

Episode Return

Spread (continuous)

HATRPO MAPPO HAPPO HATD3 HASAC

(c) Spread (continuous)

0 1 2 3 4 5 Episode 1e6

Episode Return

Reference (discrete)

(d) Reference (discrete)

0 1 2 3 4 5 Episode 1e6

Episode Return

Speaker Listener (discrete)

(e) Speaker Listener (discrete)

0 1 2 3 4 5 Episode 1e6

Episode Return

Spread (discrete)

HATRPO MAPPO HAPPO HASAC

(f) Spread (discrete)

Figure 11: Comparisons of average episode return on six MPE tasks. they are used. For instance, the temperature parameter α can be assigned any numerical value, but it is taken into consideration only when the boolean value auto_alpha is set to False. Similarly, the target_entropy and alpha_lr are applicable when auto_alpha is set to True.

H.3.1 COMMON HYPER-PARAMETERS ACROSS ALL ENVIRONMENTS

We implement the HASAC based on the HARL framework (45) and employ the existing implementations of other algorithms, including HATD3, HAPPO, HATRPO, and MAPPO, as described in the HARL literature. In Google Research Football, we use the results of QMIX in Yu et al. (42). In Star Craft Multi-Agent Challenge, we use the results of FOP in Zhang et al. (44). To ensure comprehensive evaluation, we conduct training using a minimum of four different random seeds for each algorithm. Next, we offer the hyperparameters used for HASAC in Table 4 across all environments, which are kept comparable with the HATD3 for fairness purposes.

Table 4: Common hyperparameters used for off-policy algorithms HASAC and HATD3 across all environments.

hyperparameters value hyperparamerters value proper time limits True warmup steps 1e4 activation Re LU final activation Tanh buffer size 1e6 polyak 0.005 hidden sizes [256, 256] update per train 1 train interval 50 target entropy -dim(Ai) policy noise 0.2 noise clip 0.5 policy update frequency 2 linear lr decay False

H.3.2 BI-DEXHANDS

In the Bi-Dex Hands domain, for SAC, PPO, MAPPO baselines we adopt the implementation and tuned hyperparameters reported in the paper (43). And for HAPPO we adopt the implementation and tuned hyperparameters reported in the HARL paper (45). Here we report the hyperparameters for HASAC and HATD3 in Table 5 and 6.

H.3.3 MULTI-AGENT MUJOCO (MAMUJOCO)

In this part, we report the hyperparameters used in MAMu Jo Co tasks for HASAC and HATD3 in Table 7, 8, 9, and 10. For the other three baselines, we utilize the implementation and tuned hyperparameters reported in the HARL paper (45).

Published as a conference paper at ICLR 2024

Table 5: Common hyperparameters used for HASAC and HATD3 in the Bi-Dex Hands domain.

hyperparameters value hyperparameters value hyperparameters value rollout threads 20 gamma 0.95 batch size 1000 actor lr 5e - 4 critic lr 5e - 4 n step 20 use huber loss False use valuenorm False exploration noise 0.1

Table 6: Different hyperparameters used for HASAC in the Bi-Dex Hands domain.

scenarios auto alpha alpha alpha lr Shadow Hand Catch Abreast True / 3e - 4 Shadow Hand Two Catch Underarm False 5e - 5 / Shadow Hand Lift Underarm True / 3e - 4 Shadow Hand Over True / 3e - 4 Shadow Hand Catch Over2Underarm True / 3e - 4 Shadow Hand Pen True / 3e - 4 Shadow Hand Door Close Inward True / 3e - 4 Shadow Hand Door Open Inward True / 3e - 4 Shadow Hand Door Open Outward True / 3e - 4

Table 7: Common hyperparameters used for HASAC and HATD3 in the MAMu Jo Co domain.

hyperparameters value hyperparameters value rollout threads 10 train interval 50 critic lr 1e - 3 gamma 0.99 use huber loss False use valuenorm False

Table 8: Common hyperparameters used for HATD3 in the MAMu Jo Co domain.

hyperparameters value hyperparameters value actor lr 5e - 4 exploration noise 0.1

Table 9: Parameter n_step used for HASAC and HATD3 in the MAMu Jo Co domain.

scenarios value scenarios value Ant 2x4 5 Half Cheetah 2x3 10 Ant 4x2 5 Half Cheetah 3x2 10 Ant 8x1 5 Half Cheetah 6x1 10 Walker 2x3 20 Walker 6x1 20 manyagent_swimmer 10x2 10 Humanoid Standup 17x1 10

Table 10: Different hyperparameters used for HASAC in the MAMu Jo Co domain.

scenarios actor lr auto alpha alpha alpha lr batch size Ant 2x4 3e - 4 False 0.2 / 2200 Ant 4x2 3e - 4 False 0.2 / 1000 Ant 8x1 3e - 4 False 0.2 / 2200 Half Cheetah 2x3 1e - 3 True / 3e - 4 1000 Half Cheetah 3x2 1e - 3 True / 3e - 4 1000 Half Cheetah 6x1 1e - 3 True / 3e - 4 1000 Walker 2x3 5e - 4 False 0.2 / 1000 Walker 6x1 5e - 4 False 0.2 / 1000 manyagent_swimmer 10x2 1e - 3 True / 3e - 4 1000 Humanoid Standup 17x1 1e - 3 True / 3e - 4 1000 H.3.4 STARCRAFT MULTI-AGENT CHALLENGE (SMAC)

In the SMAC domain, for MAPPO we adopt the implementation and tuned hyperparameters reported in the MAPPO paper (42). And for HAPPO and HATRPO we adopt the implementation and tuned

Published as a conference paper at ICLR 2024

Table 11: Common hyperparameters used for HAPPO and MAPPO in the MAMu Jo Co domain.

hyperparameters value hyperparameters value batch size 4000 network MLP gamma 0.99 hidden sizes [256, 256]

Table 12: Different hyperparameters used for HAPPO and MAPPO in the MAMu Jo Co domain.

scenarios linear lr decay actor/critic lr ppo/critic epoch clip param actor/critic mini batch entropy coef Ant False 5e - 4 5 0.1 1 0 Half Cheetah False 5e - 4 15 0.05 1 0.01 Walker 2x3 True 1e - 3 5 0.05 2 0 Walker 6x1 False 5e - 4 5 0.1 1 0.01 manyagent_swimmer 10x2 False 5e - 4 5 0.2 1 0.01 Humanoid Standup 17x1 True 5e - 4 5 0.1 1 0 hyperparameters reported in the HARL paper (45). Here we report the hyperparameters for HASAC in Table 13 and 14.

Table 13: Common hyperparameters used for HASAC in the SMAC domain.

hyperparameters value hyperparameters value hyperparameters value rollout threads 20 state type FP batch size 1000 use huber loss False use valuenorm False use policy active masks true

Table 14: Different hyperparameters used for HASAC in the SMAC domain.

Map critic lr actor lr gamma auto alpha alpha alpha lr n step

8m_vs_9m 5e - 4 3e - 4 0.95 True / 3e - 4 5 5m_vs_6m 5e - 4 3e - 4 0.95 True / 3e - 4 20 3s5z 5e - 4 3e - 4 0.99 True / 3e - 4 20 10m_vs_11m 5e - 4 3e - 4 0.95 True / 3e - 4 20 MMM2 5e - 4 3e - 4 0.95 True / 3e - 4 20 3s5z_vs_3s6z 5e - 4 3e - 4 0.99 True / 3e - 4 10 corridor 5e - 4 5e - 4 0.99 False 2e - 3 / 5 6h_vs_8z 5e - 4 3e - 4 0.99 True / 3e - 4 5

H.3.5 GOOGLE RESEARCH FOOTBALL (GRF)

In the GRF domain, for MAPPO and QMIX baselines we adopt the implementation and tuned hyperparameters reported in the MAPPO paper (42). And for HAPPO we adopt the implementation and tuned hyperparameters reported in the HARL paper (45). Here we report the hyperparameters for HASAC in Table 15 and 16.

Table 15: Common hyperparameters used for HASAC in the GRF domain.

hyperparameters value hyperparameters value hyperparameters value rollout threads 20 gamma 0.99 batch size 1000 actor lr 5e - 4 critic lr 5e - 4 n step 10 use huber loss False use valuenorm False

Table 16: Different hyperparameters used for HASAC in the GRF domain.

scenarios auto alpha alpha alpha lr RPS False 1e - 4 / Corner False 1e - 3 /

Published as a conference paper at ICLR 2024

H.3.6 MULTI-AGENT PARTICLE ENVIRONMENT (MPE)

In this part, we report the hyperparameters for HASAC in Table 17.

Table 17: Hyperparameters used for HASAC in the MPE domain.

hyperparameters value hyperparameters value hyperparameters value rollout threads 20 batch_size 1000 critic lr 5e - 4 gamma 0.99 actor lr 5e - 4 n_step 20 auto alpha True alpha lr 3e - 4 use huber loss False use valuenorm False

H.3.7 LIGHT AIRCRAFT GAME (LAG)

In this part, we report the hyperparameters for HASAC in Table 18 and the hyperparameters for MAPPO and HAPPO in Table 19.

Table 18: Hyperparameters used for HASAC in the LAG domain.

hyperparameters value hyperparameters value hyperparameters value rollout threads 20 batch_size 1000 critic lr 5e - 4 gamma 0.99 actor lr 5e - 4 n_step 10 auto alpha True alpha lr 3e - 4

Table 19: Hyperparameters used for MAPPO and HAPPO in the LAG domain.

hyperparameters value hyperparameters value hyperparameters value batch size 4000 linear lr decay False hidden sizes [256, 256] actor/critic lr 5e - 4 gamma 0.99 network MLP ppo epoch 5 entropy coef 0 clip param 0.05 critic epoch 5 actor mini batch 2 critic mini batch 2

Published as a conference paper at ICLR 2024

I ABLATION STUDY ON SEQUENTIAL UPDATES

In this section, we conduct an ablation study to investigate the effect of sequential updates. We compare the performance of the original HASAC with sequential update scheme, and HASAC without sequential updates (or MASAC), which updates each policy according to the following objective:

Jπi(ϕi) = Est D

Eai t πi ϕi,a i t π i

h α log πi ϕi ai t|st Qi πold ;θ st, ai t, a i t i .

We run the experiments on nine Bi-Dex Hands tasks. For a fair comparison, we set the hyperparameters of MASAC to be the same as those of HASAC. Experiments results show that HASAC with sequential updates consistently achieves better performance.

Tasks HASAC MASAC Shadow Hand Catch Abreast 45.9(4.3) 36.4(5.2) Shadow Hand Two Catch Underarm 20.6(1.9) 17.2(1.1) Shadow Hand Lift Underarm 361.0(23.2) 345.0(9.7) Shadow Hand Over 31.1(0.7) 30.5(1.1) Shadow Hand Catch Over2Underarm 28.7(0.4) 27.0(0.9) Shadow Hand Pen 181.5(10.4) 163.8(18.6) Shadow Hand Door Close Inward 429.8(16.7) 425.9(7.1) Shadow Hand Door Open Inward 475.6(23.8) 428.8(14.1) Shadow Hand Door Open Outward 542.1(13.1) 528.1(20.4)

Table 20: Averaged final performance on nine Bi-Dex Hands tasks.