# efficient_multiagent_cooperation_learning_through_teammate_lookahead__d5251b3b.pdf

Published in Transactions on Machine Learning Research (03/2025)

Efficient Multi-Agent Cooperation Learning through Teammate Lookahead

Feng Chen1,2, Xinwei Chen2, Rongjun Qin1,2, Cong Guan1, Lei Yuan1,2, Zongzhang Zhang1 , Yang Yu1,2

1 National Key Laboratory for Novel Software Technology, Nanjing University School of Artificial Intelligence, Nanjing University 2 Polixir Technologies

chenf@lamda.nju.edu.cn, xinwei.chen@polixir.ai, {qinrj, guanc, yuanl}@lamda.nju.edu.cn, {zzzhang, yuy}@nju.edu.cn

Reviewed on Open Review: https: // openreview. net/ forum? id= Ce NNIQ8GJf

Cooperative Multi-Agent Reinforcement Learning (MARL) is a rapidly growing research field that has achieved outstanding results across a variety of challenging cooperation tasks. However, existing MARL algorithms typically overlook the concurrent updates of teammate agents. An agent always learns from the data that it cooperates with one set of (current) teammates, but then practices with another set of (updated) teammates. This phenomenon, termed as teammate delay , leads to a discrepancy between the agent s learning objective and the actual evaluation scenario, which can degrade learning stability and efficiency. In this paper, we tackle this challenge by introducing a lookahead strategy that enables agents to learn to cooperate with predicted future teammates, allowing the explicit awareness of concurrent teammate updates. This lookahead strategy is designed to seamlessly integrate with existing policy-gradient-based MARL methods, enhancing their performance without significant modifications to their underlying structures. The extensive experiments demonstrate the effectiveness of this approach, showing that the lookahead strategy enhances cooperation learning efficiency and achieves competitive performance compared to state-ofthe-art MARL algorithms.

1 Introduction

Cooperative Multi-Agent Reinforcement Learning (MARL) techniques focus on replicating the collaborative intelligence observed in human teams (Oroojlooy & Hajinezhad, 2023), and advancements in recent years have showcased its remarkable potential in various application domains, including robotics (Wang et al., 2022), games (Berner et al., 2019), and social networks (Leibo et al., 2017). Among them, multi-agent policy gradient methods stand out with the capability to handle continuous control tasks and with potential to solve intricate cooperative problems (de Witt et al., 2020a; Yu et al., 2022a). Despite the ongoing progresses in this category of methods, we point out that they typically suffer from a teammate delay issue. Specifically, this issue occurs when an agent learns from the data that it cooperates with current teammates, but then practices with updated teammates due to the concurrent teammate updates. It is worth noting that the teammate delay phenomenon is related to, yet distinct from, the general non-stationarity issue (Yuan et al., 2023) in multi-agent systems. Whereas non-stationarity broadly refers to learning challenges arising from dynamic policy changes across all agents (including both teammates and opponents), the identified teammate delay specifically characterizes learning inefficiency in cooperative settings caused by agents neglecting teammate

Corresponding Author

Published in Transactions on Machine Learning Research (03/2025)

Cooking Pattern Cook

Cook hamburgers

Cook hamburgers

Cook Vegetable Salad

Policy Update

Vegetable Salad Complete Incomplete

Hamburgers Complete Incomplete

Figure 1: One simple example to show teammate delay issue, where Purchaser is expected to buy corresponding ingredients for Cook to cook.

policy updates. This problem is particularly pronounced in on-policy MARL algorithms, where agents optimize their policies based on outdated teammate policies.

For more intuitive illustration, one concrete example is shown in Figure 1. In this example, Cook initially wants to cook hamburgers, and Purchaser adjusts its policy to buy corresponding ingredients after one round of policy update. However, at this moment, Cook also improves its policy to cook vegetable salad. Thereby, their updated policies fail to cooperate well. This simple example reveals that updating the agents to cooperate with current teammates would lead to a training-test mismatch because the teammates update their policies as well. This gap between policy training and evaluation in each round of update can lead to severe learning inefficiency.

Although not explicitly pointing out this teammate delay issue, there exist works aiming to resolve similar problems arising from concurrent updates of teammate policies. Opponent modeling methods seek to alleviate the non-stationarity in multi-agent scenarios through explicitly modeling the teammate policies (Yuan et al., 2023). They either introduce an auxiliary task of predicting teammate behaviors (Hernandez-Leal et al., 2019) or learn teammate representations as extra policy conditions (Papoudakis & Albrecht, 2020; Cao et al., 2023). Despite their effectiveness in many problem scenarios, these methods necessitate extra teammate modeling efforts and lack theoretical analysis support. On the other hand, recent works, LOLA (Foerster et al., 2018a) and COLA (Willi et al., 2022), explicitly acknowledge the learning behavior of other agents and propose learning rules with opponent-learning awareness. However, these works are limited to two-player simple problems and face challenges to extend to practical cooperative scenarios. The most promising approach is one recent progress of multi-agent policy gradient method, HAPPO (Kuba et al., 2021). This method proposes a sequential policy update scheme with theoretical guarantees for joint policy improvement. However, its actual implementation involves an approximation utilizing importance sampling, potentially influencing the actual performance due to large variance in policy gradients.

Despite all these previous efforts, how the teammate delay issue influences the cooperative policy learning and how to better mitigate its negative impact are still open questions. To answer these two questions, in this paper, we both provide a formal analysis about the impact of this issue on the policy update, which motivates us to predict future teammate policies, and propose a model-based MARL algorithm where we approximate the future teammates via conducting policy updates within the environment model. In summary, our main contributions are:

We offer a rigorous formal analysis on policy-gradient MARL algorithms by investigating the regret of the updated policy, which unveils the impact of teammate delay issue on cooperative policy learning.

Furthermore, we introduce a practical model-based MARL algorithm explicitly designed to address the challenges posed by the teammate delay issue. By leveraging insights from our formal analysis, our algorithm aims to enhance cooperative policy learning.

To validate the effectiveness of our proposed approach, we conduct empirical studies on various benchmarks. These benchmarks include complex problems with continuous action spaces, as well as challenging multi-agent cooperative tasks. The empirical results validate the effectiveness of our

Published in Transactions on Machine Learning Research (03/2025)

method, showcasing its ability to match or exceed the performance of existing approaches across diverse scenarios with performance gains.

2 Preliminaries

In this section, we will introduce some basic preliminaries concerning our problem formulation and analysis. We will firstly introduce these concepts in single-agent setting for brevity, and they can be simply extended to multi-agent setting.

In single-agent setting, the sequential decision-making problem can be modeled as a Markov decision process (MDP). It can be defined as a tuple (S, A, P, R, γ, ρ0), where S is the state space, A is the action space, P( |s, a) means the transition function, R(s, a) is the reward function, γ (0, 1) is the discounted factor, while ρ0 is the initial state distribution. The agent policy is defined as π(a|s) and the goal of Reinforcement Learning (RL) is to maximize the expected discounted return:

η(π) = Es0 ρ0,at π( |st),st P( |st 1,at 1)

t=0 γt R(st, at)

Moreover, for a given agent policy π, we can define the stationary state distribution at timestep t as dπ t (s) = Pπ(st = s). Then, the discounted state visitation distribution can be defined as dπ(s) = (1 γ) P t=0 γtdπ t (s), and the discounted state-action visitation distribution can be defined as dπ(s, a) = dπ(s)π(a|s). Based on these definitions, the expected discounted return of policy π can be rewritten as:

η(π) = 1 1 γ

s S dπ(s) X

a A π(a|s)R(s, a). (2)

For the sake of clarity in the subsequent derivations, we additionally define ρπ(s) = P t=0 γtdπ t (s) = ( 1 1 γ )dπ(s), which can be regarded as the unnormalized version of dπ(s). Then the expected return further equals to η(π) = P

s S ρπ(s) P

a A π(a|s)R(s, a). We define the agent policy that can maximize η(π) as the optimal policy π , which means that η(π ) η(π) for all Markovian agent policy π.

Moreover, for two different agent policy π and π, their difference in discounted return is given by the famous performance discrepancy lemma:

η(π ) η(π) = X

a A π (a|s)Aπ(s, a)

, Aπ(s, a) = Qπ(s, a) Vπ(s, a), (3)

Qπ(s, a) = Eπ

t=0 γt R(st, at)|s0 = s, a0 = a

, Vπ(s) = X

a A π(a|s)Qπ(s, a), (4)

where Qπ(s, a) and Vπ(s) are respectively the state-action value function and state value function.

For multi-agent setting, most definitions are similar. The main difference lies in the fact that there are N agents in the multi-agent setting, leading to the concepts of joint action space A and joint policy π. Accordingly, the discounted return and performance discrepancy lemma (Kakade & Langford, 2002) become:

s S ρπ(s) X

a A π(a|s)R(s, a), (5)

η(π ) η(π) = X

a A π (a|s)Aπ(s, a)

Notably, we focus on the fully cooperative setting, where the environmental reward R(s, a is shared among all agents. Additionally, whether in the single-agent or multi-agent setting, the problem formulation only considers Markovian policies, which means the agent policy makes decisions based on the state information at the current timestep. There is some slight abuse of notation, as the subscript i will also denote the i-th agent in subsequent derivations. For example, πi represents the policy of the i-th agent. More details about the notations are listed in Appendix A.1.

Published in Transactions on Machine Learning Research (03/2025)

2.1 Single-Agent Policy Gradient

In single-agent setting, the goal of reinforcement learning is to maximize the discounted return η(π) of the agent policy π, which according to the performance discrepancy lemma equals to:

η(π) = η(πk) + X

a A π(a|s)Aπk(s, a)

, Aπk(s, a) = Qπk(s, a) Vπk(s), (7)

where πk is the k-th round agent policy. As it is hard to sample trajectories corresponding to the state distribution ρπ(s), in practice, we typically use πk to sample trajectories instead. This implies that the actual learning objective is:

J(π) = η(πk) + X

a A π(a|s)Aπk(s, a)

In fact, J(π) is related to πk, but for brevity we omit it in input. The same applies to the following context. Due to the state distribution change, π can not be updated too far away from πk, for which traditional actorcritic algorithms, e.g., A3C (Mnih et al., 2016), conduct only a few policy gradient ascend while trust-region algorithms, such as PPO (Schulman et al., 2017), conforms to the trust-region optimization.

2.2 Multi-Agent Policy Gradient

When it comes to the multi-agent setting, the problem can be defined as a tuple (N, S, A, P, R, γ, ρ0), where N is the number of agents. It should be noted that in some scenarios, the agents need to make decisions based on local observations. For the sake of simplicity and without loss of generality, we have omitted the partial observability in derivation. Moreover, we assume that the joint policy π can be decomposed into the product of individual policies π(a|s) = QN i=1 πi(ai|s), where πi is the individual policy for agent i and the joint action a = [a1, a2, , a N] A is decomposed of individual actions {ai}N i=1.

In this case, the learning objective for each agent in multi-agent policy gradient methods is typically defined as: Ji(πi|{πk j }j =i)

= η(πk) + X

s S ρπk(s) X

a A πi(ai|s) Y

j =i πk j (aj|s)Aπk(s, a) , i {1, 2, , N}. (9)

More detailed derivation for this learning objective is provided in Appendix A.2. Compared with that in single-agent setting, the data distribution here is also influenced by the teammate policies. In other words, the i-th agent updates its policy associated with the current teammates {πk j }j =i in this learning objective. Overall, the learning objective for the joint policy can be formalized as:

J(π) = η(πk) + X

s S ρπk(s) 1

a A πi(ai|s) Y

j =i πk j (aj|s)Aπk(s, a) . (10)

In this work, we identify the teammate delay phenomenon in the common practice of multi-agent policy gradient methods. The direct negative impact of this issue can be analyzed and how to solve this issue deserves further study. In this section, we firstly discuss how the teammate delay issue can cause a negative impact on the cooperative policy learning through analyzing the regret of the updated joint policy at the next round. Motivated by this analysis, we then propose a practical algorithm that exploits the future teammate information to facilitate the cooperation learning.

Published in Transactions on Machine Learning Research (03/2025)

3.1 Analysis Motivates Predicting Future Teammates

From Equation (10), we know that in typical multi-agent policy gradient methods, the learning objective of the agent policy involves computing an expectation with respect to the current teammate policies {πk j }j =i. Consequently, the current policy distribution of the teammates will have an impact on the policy update. In order to provide further analysis on this impact, we replace the teammate policies with a general notation {µj}j =i, which means that the trajectories are sampled associated with a sampling policy µ. In this way, the learning objective is transformed into:

J(π, µ) = η(πk) + X

s S ρµ(s) 1

a A πi(ai|s) Y

j =i µj(aj|s)Aπk(s, a) , (11)

where πk still denotes the joint policy at the k-th round. In existing multi-agent policy-gradient methods, the sampling policy µ is typically selected to be πk, which means that we expect πi to collaborate well with the k-th round teammate policies through maximizing P

s S ρπk(s) P

a A πi(ai|s) Q

j =i πk j (aj|s)Aπk(s, a).

We wonder what would happen when we adjust µ from πk to other distributions. To answer this question, we firstly propose the following lemma that estimates the upper bound of discrepancy between the learning objective J(π, µ) and the actual policy return η(π).

Lemma 1 Assume that we update the joint policy πk to πk+1 with sampling policy µ. Given the measurement of distance between sampling policy µ and the updated policy πk+1 as αi = maxs DTV πk+1 i ( |s) µi( |s) 1, we have:

|J(πk+1, µ) η(πk+1)| 4ϵγ (1 γ)2

i=1 αi, (12)

where γ is the discount factor and ϵ = maxs,a |Aπk(s, a)|.

For proof see Appendix A.2. The estimated upper bound of the discrepancy between J(πk+1, µ) and η(πk+1) in Lemma 1 can aid us in analyzing the regret of πk+1, leading to the following theorem:

Corollary 1 Suppose that we update joint policy πk to πk+1 with sampling policy µ, then the regret of the updated joint policy πk+1 has the following upper bound:

η(π ) η(πk+1) η(π ) J(πk+1, µ) + 4ϵγ (1 γ)2

For proof see Appendix A.2. The right hand side of Inequality (13) sheds light on the elements that can influence the cooperative policy learning. Totally, we expect to minimize the regret of the updated policy via minimizing the overall upper bound expression. Other than the first term η(π ) that is a constant value, the upper bound is composed of J(πk+1, µ) and one extra term (c). In fact, the second term J(πk+1, µ) is exactly the loss function that the algorithm aims to minimize at each update round, which typically serves as a surrogate function for the regret η(π ) η(πk+1). However, Inequality (13) reveals that the regret can not be bounded by J(πk+1, µ) alone, and an extra term (c) relatively captures the gap between this surrogate function and the actual regret.

This extra term (c), a function of the sampling policy µ and the updated policy πk+1, can not be optimized by the previous learning algorithms, but it can have an impact on the cooperation learning. When given a large term (c), the surrogate function would be far from the actual regret, which can result in low learning efficiency. In fact, it is easy to observe that term (c) will be reduced to zero when µ is equivalent to πk+1.

1The TV distance measures the distance between two distributions via calculating DTV(P, Q) = 1

x |P(x) Q(x)| (Cover, 1999).

Published in Transactions on Machine Learning Research (03/2025)

Algorithm 1 Multi-Agent Policy Gradient Learning with Lookahead Input: The number of agent N, max iteration number K, trajectory batch size M Output: Obtained multi-agent cooperation policy

1: Initialize replay buffer B; 2: Initialize a joint policy π = {πi}N i=1 randomly; 3: for iteration k = 1 to K do 4: Sample a batch of transitions from B and update the environment model by minimizing loss Lmodel; 5: Sample a batch of trajectories { τ}M in the environment model with sampling policy πk, and obtain πk+1 = ψ(πk, πk) using the training trajectories { τ}M;

6: Sample a batch of trajectories {τ}M in the real environment with sampling policy πk+1, and obtain πk+1 = ψ( πk+1, πk) using the training trajectories {τ}M; 7: Add trajectories τ to the buffer B; 8: end for

That is, the extra term (c) would disappear if we trained the agents with the information of future teammates. This outcome motivates us to replace the sampling policy µ with an approximation of the future teammates, thus to reduce the regret. A more comprehensive analysis on how approximating future teammates reduces the regret upper bound is provided in Appendix A.3. In Section 3.2 and Section 3.3, we will show how we approximate the future teammates.

3.2 Future Teammate Approximation

Based on the above analysis, we are motivated to replace the sampling policy µ with the future teammate policy πk+1 in each round of policy update. However, achieving this goal is not easy in practice, because in each round of policy update, the updated policy πk+1 is affected by the sampling policy µ, that is, πk+1

and µ are coupled. Thus, to serve this goal, we propose that the future teammate policy can be obtained by solving a bi-level optimization problem below:

Theorem 1 Suppose the sampling policy µ can derive the same updated policy, which means that µ = arg maxπ J(π, µ ). If it exists, it will be the solution of the following bi-level optimization problem:

arg min µ DKL(µ πk+1), s.t. πk+1 = arg max π J(π, µ). (14)

Its proof can be found in Appendix A.2. This theorem inspires us that we can obtain the expected µ by solving the consistent bi-level optimization problem. The solution of this problem to some extent contains the information of future teammate policy. However, this problem typically follows a form of Stackelberg Game (Friedman, 1971), and is not easy to solve.

In this case, we propose to perform one-step approximation of this optimization problem, which means that with µ initialized as πk, we firstly solve the inner-loop optimization with πk+1 = arg maxπ J(π, µ) and then we assign the obtained πk+1 to the sampling policy µ, thus obtaining the approximation of the solution µ . This one-step approximation is commonly utilized for stackelberg-game-like problems, and it achieves a trade-off between the solution accuracy and the computation cost. In brief, for feasible future teammate approximation, we perform an additional round of optimization before each algorithm iteration, using the previous round s policy πk as the sampling policy, and the obtained policy πk+1 serves as the approximation of future teammate policy.

3.3 Practical Algorithm Implementation

Model-based Approximation The above analysis motivates us to conduct extra training to estimate the future teammate policy. However, a straight-forward implementation is not practical because it wastes near half of the online samples for estimating the future teammate policy, and those samples are not utilized for the actual policy training, which as a result will lead to very low sample efficiency of the algorithm. To

Published in Transactions on Machine Learning Research (03/2025)

Model Architecture for MDP Environment Model Architecture for POMDP Environment

Multivariate Gaussian Distribution

Gaussian Distribution

Vector Concatenation

MLP Network

Figure 2: Network architectures for environment model in practical implementation.

avoid this issue, we propose to learn an environment model, and put the teammate policy estimation process within it, thus avoiding the waste of a large number of online samples.

In practice, we build the environment model utilizing multiple Multi-Layer Perceptrons (MLPs), as shown in Figure 2. It takes the environment state s, joint action a as input, and predicts the distributions of next state s and reward signal r. Specifically, the next state is modeled as a multivariate Gaussian distribution s N(µs(s, a|θ), Σs(s, a|θ)), and the reward is modeled as a univariate Gaussian r N(µr(s, a|θ), σr(s, a|θ)2). To train the environment model, we construct a replay buffer B and store the online samples into it throughout the training process. At each iteration, we will sample data batches from buffer B to minimize the following negative log-likelihood of the true data:

Lmodel(θ) = Ltrans(θ) + Lreward(θ) (15)

Ltrans(θ) = E(s,a,s ,r) B [ log p(s |µs, Σs)] = E(s,a,s ,r) B

2 (s µs) Σ 1 s (s µs) + log |Σs| , (16)

Lreward(θ) = E(s,a,s ,r) B log p(r|µr, σ2 r) = E(s,a,s ,r) B

σ2r + log σ2 r

where µs, Σs, µr, σr are short for µs(s, a|θ), Σs(s, a|θ), µr(s, a|θ), σr(s, a|θ) that are the predictions of environment model fθ.

Moreover, considering that many multi-agent scenarios are partially observable, where agents can only observe a portion of the environment s state, for such POMDP environments, we additionally introduce one projection network hϕ that predicts individual observations oi from the global state s. Thus, for these POMDP cases, the training loss includes one additional term of observation prediction:

Lmodel(θ, ϕ) = Ltrans(θ) + Lreward(θ) + Lprojection(ϕ), (18)

Lprojection(ϕ) = E(s,{oi}N i=1) B

log p(oi|µo(s, i|ϕ), Σo(s, i|ϕ)) #

where µo(s, i|ϕ), Σo(s, i|ϕ) are the outputs of the projection network hϕ.

Off-policy Value Estimation The intuition of our work is to modify the sampling policy µ in Equation (11), thus to derive a better optimization objective that can bring a smaller upper bound of the regret for the updated policy. However, term Aπk is expected to be maintained which means that we want to estimate the advantage with the policy of the last round. For previous methods, it is easy to achieve because

Published in Transactions on Machine Learning Research (03/2025)

Agent 1 s Best-Response (BR) policy curve against Agent 0

Lookahead approximation

Policy adjustment based on lookahead

Start Point

Optimal Goal

Figure 3: Visualization results on the toy environment. (a) Landscape of the two-variable function, where darker the color is, higher function score the region obtains; (b) Algorithm learning score curve; (c) Optimization process of Lookahead and MAPPO.

the trajectories are sampled by πk and we can estimate the advantage directly. While in our algorithm design, the sampling policy is replaced with the lookahead policy πk+1, which means that we need to conduct off-policy estimation for Aπk. Specifically, we adopt the V-trace (Espeholt et al., 2018) trick to estimate Aπk with the trajectories sampled by πk+1.

Overall Flow of the Algorithm Combining all the algorithmic design techniques that we have raised, we propose a practical algorithm that can enhance the underlying multi-agent policy gradient method. The overall flow of our algorithm has been presented in Algorithm 1. In line 5, we obtain the estimated future teammate policy πk+1 within the environment model, while in line 6 we utilize πk+1 to aid in actually updating the policy in the real environment. Besides, we update the environment model in line 4.

4 Experiments

In this section, we substantiate the efficacy of our proposed approach through empirical validation via experiments conducted on diverse benchmarks. These benchmarks include a toy environment, which serves to illustrate the algorithmic process of our approach, and two intricate cooperative multi-agent scenarios that provide practical validation of our approach s effectiveness. Specifically, we aim to utilize these experimental results to investigate the following questions: 1) How does our algorithm work and can we analyze the underlying mechanism through one simple task (Section 4.1)? 2) Can our algorithm actually enhance the cooperation learning in complex multi-agent cooperative tasks (Section 4.2)? 3) Does the phenomenon exhibited by our algorithm in complex cooperative tasks still align with our analysis (Section 4.3)?

4.1 Algorithm Analysis in Toy Environment

To visually reveal how our method works, we devised a toy environment involving a two-variable function optimization problem. As depicted in Figure 3(a), this problem comprises two agents with continuous action spaces in the range [0, 1]. Whenever the agents execute a joint action [a0, a1], the environment yields reward: R = a3 0 2(a0 a1)2 + 3a0.

Published in Transactions on Machine Learning Research (03/2025)

(a) 2x3-Agent Half Cheetah (b) academy 3 vs 1 with Keeper

Figure 4: Some example task scenarios of Multi-Agent Mu Jo Co (MA-Mu Jo Co), Google Football Research (GRF), and Star Craft Multi-Agent Challenge (SMAC) environments.

Table 1: Evaluation results of various methods on MA-Mu Jo Co tasks, providing average scores across 5 seeds with standard errors. The highest score for each task scenario is bolded and the top-2 scores are marked in

blue . The average rank denotes the average ranking across all task scenarios of each method.

Algorithm Ant 2x4 Ant 4x2 Half Cheetah 2x3 Half Cheetah 3x2 Walker2d 2x3 Walker2d 3x2 Average Rank

Lookahead 3393.39(336.98) 2858.22(540.76) 3315.83(346.46) 3687.10(304.98) 2048.30(309.27) 2670.71(123.74) 1.33

HAPPO 2471.15(201.23) 2120.18(168.48) 2910.18(39.38) 3016.20(80.90) 2544.62(272.91) 2780.65(76.25) 2.00

TAPPO 2055.21(182.25) 2503.62(207.66) 2154.73(443.24) 3487.52(711.65) 1475.63(215.27) 1445.56(350.34) 3.67

MAPPO 1034.19(18.99) 1002.15(18.30) 2160.29(503.76) 2350.32(477.29) 1852.8(41.48) 1812.54(139.44) 4.33

IPPO 884.93(46.24) 875.8(20.84) 2652.04(635.31) 2477.98(566.12) 2021.89(67.64) 1775.2(122.09) 4.33

MADDPG 1866.08(9.77) 1701.08(13.45) 1553.54(251.6) 1295.5(384.87) 71.68(15.36) 100.72(35.19) 5.33

Specifically, we initialize the joint policy as [0.1, 0.1] and both adopt the algorithms of MAPPO and our lookahead strategy to investigate how they converge to the optimal policy [1, 1]. As shown in Figure 3(c), without our lookahead strategy, agent 1 always shows a large gap from the Best-Response (BR) against the updated agent 0, revealing the phenomenon of teammate delay . While our lookahead strategy can help agent 1 predict the updated policy of agent 0, leading to a shorter optimization path. From the learning curve in Figure 3(b), we also find that our lookahead strategy helps converge to the optimal policy using much fewer samples, enhancing the learning efficiency.

The algorithm analysis in this straightforward objective optimization task helps provide an intuitive explanation about the algorithm mechanism and motivation behind our approach. In the subsequent sections, we explore whether the proposed lookahead strategy can indeed enhance the cooperative learning in more complex task scenarios.

4.2 Main Results in Complex Cooperative Tasks

4.2.1 Experiment Setup

To investigate the effectiveness of our approach in more practical task scenarios, this section focuses on several prevalent cooperative benchmark environments, including continuous control tasks from Multi-Agent Mu Jo Co (MA-Mu Jo Co) (de Witt et al., 2020b) and Google Research Football (GRF) (Kurach et al., 2020) games with discrete action spaces. Besides, we also include experiments on Star Craft Multi-Agent Challenge (SMAC), which enables complex tasks involving larger team sizes. Examples of task scenarios from these three environments are illustrated in Figure 4. The introduction to each environment is provided below.

Multi-Agent Mu Jo Co (MA-Mu Jo Co) The MA-Mu Jo Co environment is built upon the Mu Jo Co physics engine to create realistic simulations for MARL research. In specific, MA-Mu Jo Co partitions the

Published in Transactions on Machine Learning Research (03/2025)

Table 2: Evaluation results of various methods on GRF tasks. GRF 3vs1, CA(hard) and Corner are respectively short for maps of academy 3 vs 1 with keeper, academy counterattack hard and academy corner in GRF environment.

Algorithm GRF 3vs1 GRF CA (hard) GRF Corner Average Rank

Lookahead 0.82(0.02) 0.50(0.07) 0.63(0.03) 1.33

HAPPO 0.86(0.03) 0.46(0.09) 0.49(0.07) 2.00

TAPPO 0.77(0.03) 0.49(0.03) 0.28(0.10) 3.00

MAPPO 0.66(0.04) 0.37(0.08) 0.48(0.11) 3.67

CDS 0.49(0.11) 0.21(0.07) 0.02(0.01) 5.00

Table 3: Evaluation results of various methods on SMAC tasks, providing average scores across 5 seeds with standard errors.

Algorithm 3s5z 5m_vs_6m MMM2 Average Rank

Look Ahead 94.2(2.31) 90.5(2.51) 90.6(1.04) 1.00

HAPPO 92.2(1.74) 87.5(2.04) 87.1(2.20) 4.00

TAPPO 90.3(5.60) 89.1(2.51) 88.2(2.90) 3.33

MAPPO 91.9(1.42) 88.2(2.35) 89.1(2.79) 2.67

IPPO 91.9(2.51) 87.5(2.93) 87.5(2.51) 3.67

QMIX 88.3(2.90) 75.8(3.70) 87.5(2.60) 5.33

body graph in Mu Jo Co into disjoint sub-graphs, one for each agent, e.g., 2x4-Agent Ant means dividing the 8 joints in Ant into 2 agents, each controlling 4 joints.

Google Research Football (GRF) The Google Research Football (GRF) is a novel benchmark environment offering simulations of soccer matches, enabling the study of multi-agent behaviors and reinforcement learning. It introduces challenging cooperation learning tasks as it has the property of heterogeneity and sparse rewards.

Star Craft Multi-Agent Challenge (SMAC) The SMAC environment is also a popular benchmark for MARL, set in micromanagement scenarios with multi-agent and multi-unit battles. The goal is for the ally units to cooperate together to debate the enemy units. SMAC allows for the design of complex and challenging cooperative tasks due to the diversity of unit types and maps.

To thoroughly explore the cooperative performance that our approach can potentially bring about, we integrate our lookahead strategy with HAPPO, one of the current state-of-the-art multi-agent policy gradient algorithms, in all experiments of this section. For comparison, we select several popular multi-agent actorcritic algorithms as baselines. An opponent modeling approach, TAPPO, is also included, which learns teammate representations to incorporate additional policy conditions like in previous methods (Papoudakis & Albrecht, 2020; Cao et al., 2023). We adopt this baseline to contrast our approach with traditional opponent modeling approaches in mitigating non-stationarity issue arising from teammate co-learning. Moreover, in the GRF environment, we add one additional baseline CDS (Li et al., 2021), a value-based MARL algorithm designed specifically for solving GRF games, for a more comprehensive comparison. More details about baselines can be found in Appendix B.1.

Published in Transactions on Machine Learning Research (03/2025)

4.2.2 Results Analysis

MA-Mu Jo Co As shown in Table 1, in multiple task scenarios of MA-Mu Jo Co, our approach Lookahead has achieved superior cooperative performance compared to other baseline algorithms. For tasks of Ant and Half Cheetah, our approach has consistently achieved the highest scores across all methods. These results imply that in these task scenarios, through predicting the potential future policies of other agents controlling their respective joints, our algorithm can help agents learn to manipulate their own joints in coordination with other agents better, finally enhancing the cooperative performance. Despite not achieving the highest score, our approach also attains top-2 performance in the Walker2d scenarios. We hypothesize that the slight performance loss might be due to the nature of the Walker2d task, which requires not only proficient walking but also maintaining the balance of the mechanical legs at all times. The possible failure to maintain the legs balance may pose great challenges to the learning of the environment model, consequently having a negative impact on the performance of our algorithm. Actually, the selection of environment model learning methods is orthogonal to our algorithm. In the future, we will consider designing better model learning methods to further enhance the performance of our approach.

Google Research Football (GRF) Similar to the results in MA-Mu Jo Co, the results in Table 2 demonstrate that our approach also achieves superior performance in GRF problems. In particular, our approach is the only one that attains a success rate exceeding 60% in the academy corner scenario, achieving a notable improvement of over 25% compared to the second-best algorithm. Moreover, the average rank of our approach also stands out as the best in the GRF environment, the same as that in MA-Mu Jo Co.

Stra Craft Multi-Agent Challenge (SMAC) The results in Table 3 highlight the win rate performance of different approaches in SMAC, where our approach consistently achieves the best performance across three different task scenarios. These results, especially those on MMM2 task which involves ten ally agents, demonstrate the ability of our approach to adapt to more complex cooperative tasks, supporting the generality and adaptability of our approach.

Ablation Study The comparison in Tables 1 to 3 between the Lookahead and HAPPO algorithms on both benchmark environments can be seen as ablation study to assess the impact of our introduced lookahead strategy. Across the majority of task scenarios, the incorporation of our lookahead strategy results in enhanced cooperative performance compared to the original HAPPO algorithm, e.g., exceeding the secondbest algorithm by around 1000 points in 2x4-Agent Ant, which effectively validates the efficacy of the proposed lookahead strategy.

4.3 Analysis of Lookahead Policy in Complex Problems

While we have analyzed the algorithmic mechanism in a toy environment, applying the algorithm in complex tasks is more intricate due to the involvement of model learning. In this section, we measure the policy distances to investigate whether our lookahead approximation still provides right direction information in MA-Mu Jo Co. In specific, we compute the difference between DKL(πk+1, πk) and DKL(πk+1, πk+1), which equals to DKL(πk+1, πk) DKL(πk+1, πk+1). As we can see from Figure 5 that this metric consistently keep positive throughout the training process, the results reveal that πk+1 is actually closer to the future teammate policy πk+1, forming a relatively good approximation. These results support our motivation and more analytical experimental results can be found in Appendix C.

5 Related Work

The related work of this paper mainly covers three aspects: multi-agent policy gradient, multi-agent model learning, and opponent modeling. Below, we provide the introduction to related works in these three aspects respectively. Among them, opponent modeling approaches are typically proposed to address the general nonstationarity issue in multi-agent systems, considering the dynamic policy changes of both teammates and opponents. While related, our identified teammate delay problem is particularly pronounced in on-policy MARL algorithms for cooperative settings, which is an issue more related to algorithm design.

Published in Transactions on Machine Learning Research (03/2025)

0.0M 1.0M 2.0M 3.0M 4.0M 5.0M Timesteps

KL Distance Difference

2x4-Agent Ant 4x2-Agent Ant

0.0M 1.0M 2.0M 3.0M 4.0M 5.0M Timesteps

KL Distance Difference

Half Cheetah

2x3-Agent Half Cheetah 3x2-Agent Half Cheetah

Figure 5: Measuring policy distances in Multi-Agent Mu Jo Co environments. The y-axis calculates the difference in distances from πk and πk+1 to πk+1, i.e., DKL(πk+1, πk) DKL(πk+1, πk+1), and the plot illustrates the changes in this metric throughout the policy training process.

Multi-Agent Policy Gradient The multi-agent policy gradient algorithms hold better convergence stability compared to the value-based algorithms, and they provide the ability to handle continuous control tasks. IA2C (Chu et al., 2019) introduces the A2C method to the multi-agent setting, and adopts an independent learning paradigm. Subsequently, COMA (Foerster et al., 2018b) proposes the paradigm of centralized critic with decentralized actors, which tries to conduct credit-assignment for each agent via introducing a counterfactual baseline. MAAC (Iqbal & Sha, 2019) and DOP (Wang et al., 2020) respectively improve the policy gradient methods by introducing the attention mechanism to the critic network and conducting value decomposition for the centralized critic. On the other hand, IPPO (de Witt et al., 2020a) and MAPPO (Yu et al., 2022a) extend the trust-region policy optimization scheme to the multi-agent setting, and obtain remarkable performance. However, all the policy gradient methods above directly optimize the agent policy associated with the current teammates, and may suffer from the teammate delay issue. Recently, HAPPO (Kuba et al., 2021) introduces sequential update scheme to the multi-agent policy gradient algorithm, which considers the mutual influences between different agents policy update. Nevertheless, it adopts importance sampling technique which suffers from high variance, and it is orthogonal to our algorithm. More discussion is provided in Appendix B.2. Besides, there exist algorithms introducing deterministic policy gradient to the multi-agent setting (Lowe et al., 2017), while we mainly consider stochastic policy gradient methods here.

Multi-Agent Model Learning Model-based reinforcement learning enjoys higher sample efficiency. However, multi-agent model learning faces significant challenges due to the exponential growth of the state-action space and the non-stationary in multi-agent scenarios. Adopting the Dreamer (Hafner et al., 2019) architecture, MAMBA (Egorov & Shpilman, 2022) sustains a world model for each agent with necessary communication, thus to scale gracefully with the number of agents. Another work (Mahajan et al., 2021) shows utilizing tensor decomposition in multi-agent model learning can significantly improve the sample efficiency when the environment transition and reward functions are of low CP-rank. Krupnik (Krupnik et al., 2020) adopts generative models to learn a multi-step world model which can consider the delayed effects of the previous actions. Besides, considering the characteristics of multi-agent settings, AORPO (Zhang et al., 2021) and CTRL (Park et al., 2019) incorporate the opponent modeling into the model learning in order to roll-out opponent-wise trajectories. When the environment model has been obtained, dyna-style algorithms (Zhang et al., 2022; Willemsen et al., 2021) conduct data augmentation to enhance policy learning. MBVD (Xu et al., 2022) evaluates the current state value via imagining future states within the model. Han (Han et al., 2022) conducts credit assignment by computing the shapley value (Winter, 2002) using the samples rolled out in the model. In the future, a promising direction is to use generative world models for multi-agent environment modeling, enhancing the learning of multi-agent policies, as these generative world models have already shown considerable potential (Hu et al., 2023; Bruce et al., 2024; Qiao et al., 2024).

Published in Transactions on Machine Learning Research (03/2025)

Opponent Modeling Opponent modeling is a well-studied topic in the field of MARL. Some previous works utilize opponent modeling to alleviate the non-stationarity issue in MARL. Among them, some works (Hong et al., 2018; Papoudakis & Albrecht, 2020; Xie et al., 2021; Cao et al., 2023) involve utilizing the opponent representations as additional inputs to the policy network, thereby enhancing the policy learning. While AMS-A3C and AFS-A3C (Hernandez-Leal et al., 2019) treat the opponent modeling as an auxiliary task to guide the network optimization. Besides, another series of works assume that opponents are uncertain or may change, and they aim to help recognize and adapt to the opponents. DPN-BPR+ (Zheng et al., 2018) and MBOM (Yu et al., 2022b) estimate the most probable types of opponents from a statistical perspective, while Fastap (Zhang et al., 2023) further considers that the changes of teammates may happen within one episode and learn a instantaneous representation to achieve fast recognition of teammate changes. Moreover, there exist other series of works (Foerster et al., 2018a; Willi et al., 2022; Lu et al., 2022) that propose a better update operator for general-sum games by modeling the influences of agents policies on the other agent. However, these methods are limited to two-player simple problems, while our work focuses on complex cooperative tasks.

6 Closing Remarks

This paper introduces a pioneering approach to enhance cooperative MARL by anticipating future teammate policies. Alleviating the prevalent issue of teammate delay , our proposed lookahead strategy bridges the gap between the learning objective and the real evaluation scenario, significantly boosting the learning efficiency. Through seamless integration with existing gradient-based MARL methods, our approach surpasses state-of-the-art algorithms, exhibiting good performance in complex cooperative multi-agent benchmarks. Currently, our method mainly relies on the environment model to predict the future teammates. Thus, the practical algorithm performance is to some extent limited by the model learning error. How to better estimate the future teammates and whether there exist other ways to harness the predicted information of future teammates deserve further investigation. We believe researches in this topic can bring great advancement in the MARL domain.

Acknowledgments

This work is supported by the National Science Foundation of China (62276126, 62495093, 62250069) and the Natural Science Foundation of Jiangsu (BK2024119, BK20243039).

Published in Transactions on Machine Learning Research (03/2025)

Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemys law D kebiak, Christy Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, et al. Dota 2 with large scale deep reinforcement learning. ar Xiv preprint ar Xiv:1912.06680, 2019.

Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. In International Conference on Machine Learning, pp. 4603 4623, 2024.

Jiahan Cao, Lei Yuan, Jianhao Wang, Shaowei Zhang, Chongjie Zhang, Yang Yu, and De-Chuan Zhan. Linda: Multi-agent local information decomposition for awareness of teammates. Science China Information Sciences, 66(8):1 17, 2023.

Tianshu Chu, Jie Wang, Lara Codecà, and Zhaojian Li. Multi-agent deep reinforcement learning for largescale traffic signal control. IEEE Transactions on Intelligent Transportation Systems, 21(3):1086 1095, 2019.

Thomas M Cover. Elements of information theory. John Wiley & Sons, 1999.

Christian Schroeder de Witt, Tarun Gupta, Denys Makoviichuk, Viktor Makoviychuk, Philip HS Torr, Mingfei Sun, and Shimon Whiteson. Is independent learning all you need in the starcraft multi-agent challenge? ar Xiv preprint ar Xiv:2011.09533, 2020a.

Christian Schroeder de Witt, Bei Peng, Pierre-Alexandre Kamienny, Philip Torr, Wendelin Böhmer, and Shimon Whiteson. Deep multi-agent reinforcement learning for decentralized continuous cooperative control. ar Xiv preprint ar Xiv:2003.06709, 19, 2020b.

Vladimir Egorov and Alexei Shpilman. Scalable multi-agent model-based reinforcement learning. In International Conference on Autonomous Agents and Multiagent Systems, pp. 381 390, 2022.

Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Vlad Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, et al. Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. In International Conference on Machine Learning, pp. 1407 1416, 2018.

Jakob Foerster, Richard Y Chen, Maruan Al-Shedivat, Shimon Whiteson, Pieter Abbeel, and Igor Mordatch. Learning with opponent-learning awareness. In International Conference on Autonomous Agents and Multiagent Systems, pp. 122 130, 2018a.

Jakob Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. Counterfactual multi-agent policy gradients. In AAAI Conference on Artificial Intelligence, pp. 2974 2982, 2018b.

James W Friedman. A non-cooperative equilibrium for supergames. The Review of Economic Studies, 38 (1):1 12, 1971.

Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination. In International Conference on Learning Representations, 2019.

Dongge Han, Chris Xiaoxuan Lu, Tomasz Michalak, and Michael Wooldridge. Multiagent model-based credit assignment for continuous control. In International Conference on Autonomous Agents and Multiagent Systems, pp. 571 579, 2022.

Pablo Hernandez-Leal, Bilal Kartal, and Matthew E Taylor. Agent modeling as auxiliary task for deep reinforcement learning. In AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment, pp. 31 37, 2019.

Zhang-Wei Hong, Shih-Yang Su, Tzu-Yun Shann, Yi-Hsiang Chang, and Chun-Yi Lee. A deep policy inference Q-network for multi-agent systems. In International Conference on Autonomous Agents and Multiagent Systems, pp. 1388 1396, 2018.

Published in Transactions on Machine Learning Research (03/2025)

Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gianluca Corrado. Gaia-1: A generative world model for autonomous driving. ar Xiv preprint ar Xiv:2309.17080, 2023.

Shariq Iqbal and Fei Sha. Actor-attention-critic for multi-agent reinforcement learning. In International Conference on Machine Learning, pp. 2961 2970, 2019.

Sham Kakade and John Langford. Approximately optimal approximate reinforcement learning. In International Conference on Machine Learning, pp. 267 274, 2002.

Orr Krupnik, Igor Mordatch, and Aviv Tamar. Multi-agent reinforcement learning with multi-step generative models. In Conference on Robot Learning, pp. 776 790, 2020.

Jakub Grudzien Kuba, Ruiqing Chen, Muning Wen, Ying Wen, Fanglei Sun, Jun Wang, and Yaodong Yang. Trust region policy optimisation in multi-agent reinforcement learning. In International Conference on Learning Representations, 2021.

Karol Kurach, Anton Raichuk, Piotr Stanczyk, Michal Zajac, Olivier Bachem, Lasse Espeholt, Carlos Riquelme, Damien Vincent, Marcin Michalski, Olivier Bousquet, and Sylvain Gelly. Google research football: A novel reinforcement learning environment. In AAAI Conference on Artificial Intelligence, pp. 4501 4510, 2020.

Joel Z Leibo, Vinicius Zambaldi, Marc Lanctot, Janusz Marecki, and Thore Graepel. Multi-agent reinforcement learning in sequential social dilemmas. In International Conference on Autonomous Agents and Multiagent Systems, pp. 464 473, 2017.

Chenghao Li, Tonghan Wang, Chengjie Wu, Qianchuan Zhao, Jun Yang, and Chongjie Zhang. Celebrating diversity in shared multi-agent reinforcement learning. In Advances in Neural Information Processing Systems, pp. 3991 4002, 2021.

Ryan Lowe, Yi I Wu, Aviv Tamar, Jean Harb, Open AI Pieter Abbeel, and Igor Mordatch. Multi-agent actorcritic for mixed cooperative-competitive environments. In Advances in Neural Information Processing Systems, pp. 6379 6390, 2017.

Christopher Lu, Timon Willi, Christian A Schroeder De Witt, and Jakob Foerster. Model-free opponent shaping. In International Conference on Machine Learning, pp. 14398 14411, 2022.

Anuj Mahajan, Mikayel Samvelyan, Lei Mao, Viktor Makoviychuk, Animesh Garg, Jean Kossaifi, Shimon Whiteson, Yuke Zhu, and Animashree Anandkumar. Tesseract: Tensorised actors for multi-agent reinforcement learning. In International Conference on Machine Learning, pp. 7301 7312, 2021.

Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, pp. 1928 1937, 2016.

Afshin Oroojlooy and Davood Hajinezhad. A review of cooperative multi-agent deep reinforcement learning. Applied Intelligence, 53(11):13677 13722, 2023.

Georgios Papoudakis and Stefano V Albrecht. Variational autoencoders for opponent modeling in multi-agent systems. In AAAI Workshop on Reinforcement Learning in Games, 2020.

Young Joon Park, Yoon Sang Cho, and Seoung Bum Kim. Multi-agent reinforcement learning with approximate model learning for competitive games. Plo S one, 14(9):e0222215, 2019.

Shuofei Qiao, Runnan Fang, Ningyu Zhang, Yuqi Zhu, Xiang Chen, Shumin Deng, Yong Jiang, Pengjun Xie, Fei Huang, and Huajun Chen. Agent planning with world knowledge model. ar Xiv preprint ar Xiv:2405.14205, 2024.

John Schulman, Sergey Levine, Pieter Abbeel, Michael I. Jordan, and Philipp Moritz. Trust region policy optimization. In International Conference on Machine Learning, pp. 1889 1897, 2015.

Published in Transactions on Machine Learning Research (03/2025)

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. ar Xir preprint ar Xiv:1707.06347, 2017.

Yihan Wang, Beining Han, Tonghan Wang, Heng Dong, and Chongjie Zhang. DOP: Off-policy multi-agent decomposed policy gradients. In International Conference on Learning Representations, 2020.

Yutong Wang, Mehul Damani, Pamela Wang, Yuhong Cao, and Guillaume Sartoretti. Distributed reinforcement learning for robot teams: A review. Current Robotics Reports, 3(4):239 257, 2022.

Daniël Willemsen, Mario Coppola, and Guido CHE de Croon. Mambpo: Sample-efficient multi-robot reinforcement learning using learned world models. In IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5635 5640, 2021.

Timon Willi, Alistair Hp Letcher, Johannes Treutlein, and Jakob Foerster. COLA: Consistent learning with opponent-learning awareness. In International Conference on Machine Learning, pp. 23804 23831, 2022.

Eyal Winter. The shapley value. Handbook of Game Theory with Economic Applications, 3:2025 2054, 2002.

Annie Xie, Dylan Losey, Ryan Tolsma, Chelsea Finn, and Dorsa Sadigh. Learning latent representations to influence multi-agent interaction. In Conference on Robot Learning, pp. 575 588, 2021.

Zhiwei Xu, Dapeng Li, Bin Zhang, Yuan Zhan, Yunpeng Bai, and Guoliang Fan. Mingling foresight with imagination: Model-based cooperative multi-agent reinforcement learning. In Advances in Neural Information Processing Systems, pp. 11327 11340, 2022.

Chao Yu, Akash Velu, Eugene Vinitsky, Jiaxuan Gao, Yu Wang, Alexandre Bayen, and Yi Wu. The surprising effectiveness of ppo in cooperative multi-agent games. In Advances in Neural Information Processing Systems, pp. 24611 24624, 2022a.

Xiaopeng Yu, Jiechuan Jiang, Wanpeng Zhang, Haobin Jiang, and Zongqing Lu. Model-based opponent modeling. In Advances in Neural Information Processing Systems, pp. 28208 28221, 2022b.

Lei Yuan, Ziqian Zhang, Lihe Li, Cong Guan, and Yang Yu. A survey of progress on cooperative multi-agent reinforcement learning in open environment. ar Xiv preprint ar Xiv:2312.01058, 2023.

Qizhen Zhang, Chris Lu, Animesh Garg, and Jakob Foerster. Centralized model and exploration policy for multi-agent RL. In International Conference on Autonomous Agents and Multiagent Systems, pp. 1500 1508, 2022.

Weinan Zhang, Xihuai Wang, Jian Shen, and Ming Zhou. Model-based multi-agent policy optimization with adaptive opponent-wise rollouts. In International Joint Conference on Artificial Intelligence, pp. 3384 3391, 2021.

Ziqian Zhang, Lei Yuan, Lihe Li, Ke Xue, Chengxing Jia, Cong Guan, Chao Qian, and Yang Yu. Fast teammate adaptation in the presence of sudden policy change. ar Xiv preprint ar Xiv:2305.05911, 2023.

Yan Zheng, Zhaopeng Meng, Jianye Hao, Zongzhang Zhang, Tianpei Yang, and Changjie Fan. A deep Bayesian policy reuse approach against non-stationary agents. In Advances in Neural Information Processing Systems, pp. 960 970, 2018.

Published in Transactions on Machine Learning Research (03/2025)

A Notations and Theoretical Analysis

A.1 Notations

In Table 4, we list the main notations in our paper.

Table 4: Notation list.

Symbol Meaning

S, s S denotes the state space for either the single-agent problem or multi-agent problem, while s S is an instance of the state. A, a A denotes the action space for the single-agent problem, while a is an instance of the action. N Number of the agents in multi-agent problems. A, {Ai}N i=1 A is the joint action space for the multi-agent problem, Ai is the action space for agent i. a, {ai}N i=1 a = [a1, a2, , a N] is an instance of the joint action, where ai is the action for agent i. P Transition function for either the single-agent problem or multi-agent problem. R Reward function for either the single-agent problem or multiagent problem. γ Discount factor. π π denotes the policy for single-agent problems, where π(a|s) means the probability of taking action a under state s. π, {πi}N i=1 π denotes the joint policy for multi-agent problems, while πi indicates the policy for agent i. π(a|s) = QN i=1 πi(ai|s) means the probability of taking action a under state s. πk The obtained joint policy after the k-th round of policy update. η(π) The discounted return of joint policy π in multi-agent problems, i.e., η(π) = Eπ [P t=0 γtrt]. dπ(s), dπ(s, a) The stationary state visitation distribution and state-action visitation distribution when given the fixed joint policy π. ρπ(s) The unnormalized stationary state distribution derived by the joint policy π. Specifically, ρπ(s) = 1 1 γ dπ(s). Qπk(s, a), Qπk(s, a) Qπk(s, a) represents the Q-function in single-agent problems, defined as the expected cumulative reward obtained by taking action a in state s and then following policy πk thereafter, i.e., Qπk(s, a) = Eπk [P t=0 γtrt|s0 = s, a0 = a]. Qπk(s, a) is for multi-agent problems. Vπk(s), Vπk(s) Vπk(s) represents the state value function in single-agent problems, indicating the expected cumulative reward starting from state s and following policy π thereafter, i.e., Vπk(s) = Eπk [P t=0 γtrt|s0 = s]. Vπk(s) is for multi-agent problems. αi αi = maxs DTV πk+1 i ( |s) µi( |s) is utilized to denote the distance between the sampling policy and the updated policy (the policy at the next round).

Published in Transactions on Machine Learning Research (03/2025)

A.2 Proofs of Main Theoretical Results

In this section, we provide the proofs of the main theoretical results in our paper. In specific, we begin by outlining the primary theoretical results below, followed by their respective proofs one-by-one. Among them, Lemma 1, Theorem 1, and Theorem 2 are introduced in the main text, while Theorem 3 is introduced in the supplementary discussion in Appendix A.3.

Statement 1 In previous multi-agent policy gradient methods, the learning objective for each agent in multiagent policy gradient methods is typically defined as:

Ji(πi|{πk j }j =i) = η(πk) + X

s S ρπk(s) X

a A πi(ai|s) Y

j =i πk j (aj|s)Aπk(s, a) , i {1, 2, , N}. (20)

Lemma 1 Assume that we update the joint policy πk to πk+1 with sampling policy µ. Given the measurement of distance between sampling policy µ and the updated policy πk+1 as αi = maxs DTV πk+1 i ( |s) µi( |s) , we have:

|J(πk+1, µ) η(πk+1)| 4ϵγ (1 γ)2

i=1 αi, (21)

where γ is the discount factor and ϵ = maxs,a |Aπk(s, a)|.

Corollary 1 Suppose that we update joint policy πk to πk+1 with sampling policy µ, then the regret of the updated joint policy πk+1 has the following upper bound:

η(π ) η(πk+1) η(π ) J(πk+1, µ) + 4ϵγ (1 γ)2

i=1 αi. (22)

Theorem 1 Suppose the sampling policy µ can derive the same updated policy, which means that µ = arg maxπ J(π, µ ). If it exists, it will be the solution of the following bi-level optimization problem:

arg min µ DKL(µ πk+1), s.t. πk+1 = arg max π J(π, µ). (23)

Theorem 2 Let ϕ be the ego-max-operator 2. We suppose that µ denotes the lookahead policy which means that it can derive the same updated policy, i.e., µ = ψ(µ , πk); and π = ψ(πk, πk) denotes the updated policy when using πk as the sampling policy. We express the trust region as that DTV π i( |s)||πk i ( |s) βi.

In this case, when βi P

s S ρµ (s)ϕ(s,Aπk )

s S ρπk (s)ϕ(s,|Aπk |), we have J(µ , µ ) J(π , πk).

Below are proofs.

Statement 1 In previous multi-agent policy gradient methods, the learning objective for each agent is typically defined as:

Ji(πi|{πk j }j =i) = η(πk) + X

s S ρπk(s) X

a A πi(ai|s) Y

j =i πk j (aj|s)Aπk(s, a) , i {1, 2, , N}. (24)

Proof. Based on the previous literature, we have already know that for single-agent setting, the learning objective is typically defined as:

J(π) = η(πk) + X

a A π(a|s)Aπk(s, a)

2Assuming f is a function defined over the state and joint action space, the ego-max-operator ϕ is defined as ϕ(f, s) = 1 N PN i=1 maxai A P

j =i πk j (aj|s)f(s, a).

Published in Transactions on Machine Learning Research (03/2025)

However, since π is the policy we are optimizing, we can not obtain the training data belonging to its distribution perfectly. Thus, in practice, the objective is typically transformed into:

J(π) = η(πk) + X

a A πk(a|s) π(a|s)

πk(a|s)Aπk(s, a)

= η(πk) + E(s,a) πk π(a|s)

πk(a|s)Aπk(s, a) . (27)

This allows us to optimize the training objective through sampling trajectories using πk, which is the policy at the last round. For multi-agent cases, the situation is similar. The previous multi-agent policy gradient methods typically sample data utilizing the joint policy πk to sample training data and the ideal goal is to optimize the following objective:

J(π) = η(πk) + X

a A πk(a|s) π(a|s)

πk(a|s)Aπk(s, a)

= η(πk) + E(s,a) πk π(a|s)

πk(a|s)Aπk(s, a) . (29)

However, since the candidate space of joint policy π is huge and it is hard to optimize directly. The previous multi-policy gradient methods actually conduct a trade-off, and optimize a decomposed objective for each agent (Yu et al., 2022a):

Ji(πi) = η(πk) + E(s,a) πk πi(ai|s)

πk i (ai|s)Aπk(s, a) , i {1, 2, , N}. (30)

Equation (30) can actually expressed as:

Ji(πi|{πk j }j =i) = η(πk) + X

s S ρπk(s) X

a A πi(ai|s) Y

j =i πk j (aj|s)Aπk(s, a) , i {1, 2, , N}. (31)

Lemma 1 Assume that we update the joint policy πk to πk+1 with sampling policy µ. Given the measurement of distance between sampling policy µ and the updated policy πk+1 as αi = maxs DTV πk+1 i ( |s) µi( |s) , we have:

|J(πk+1, µ) η(πk+1)| 4ϵγ (1 γ)2

i=1 αi, (32)

where γ is the discount factor and ϵ = maxs,a |Aπk(s, a)|.

Published in Transactions on Machine Learning Research (03/2025)

Proof. Firstly, according to the performance difference lemma (Kakade & Langford, 2002), we have:

η(πk+1) J(πk+1, µ)

s S ρπk+1(s) X

a A πk+1(a|s)Aπk(s, a)

a A πi(ai|s) Y

j =i µj(aj|s)Aπk(s, a)

s S ρπk+1(s) X

a A πk+1(a|s)Aπk(s, a) X

s S ρµ(s) X

a A πk+1(a|s)Aπk(s, a)

s S ρµ(s) X

a A πk+1(a|s)Aπk(s, a)

a A πi(ai|s) Y

j =i µj(aj|s)Aπk(s, a)

s S ρπk+1(s) X

a A πk+1(a|s)Aπk(s, a) X

s S ρµ(s) X

a A πk+1(a|s)Aπk(s, a)

s S ρµ(s) X

a A πk+1(a|s)Aπk(s, a)

a A πi(ai|s) Y

j =i µj(aj|s)Aπk(s, a)

(I) 4ϵγ (1 γ)2 α2 +

s S ρµ(s) X

a A πk+1(a|s)Aπk(s, a)

a A πi(ai|s) Y

j =i µj(aj|s)Aπk(s, a)

where (I) holds because of the conclusion that has already been obtained in TRPO (Schulman et al., 2015) (see Theorem 1). Besides, we further have:

s S ρµ(s) X

a A πk+1(a|s)Aπk(s, a)

a A πk+1 i (ai|s) Y

j =i µj(aj|s)Aπk(s, a)

s S ρµ(s) 1

ai Ai πk+1 i (ai|s) X

j =i µj(aj|s) Y

j =i πk+1 j (aj|s)

s S ρµ(s) 2

ai Ai πk+1 i (ai|s)DTV πk+1 i ( |s) µ i( |s)

i=1 DTV πk+1 i ( |s) µ i( |s)

i=1 max s DTV πk+1 i ( |s) µ i( |s) ,

Published in Transactions on Machine Learning Research (03/2025)

where (II) holds according to the definition of TV distance. Thus, we finally have:

η(πk+1) J(πk+1, µ) 4ϵγ (1 γ)2 α2 + 2ϵ

i=1 max s DTV πk+1 i ( |s) µ i( |s)

= 4ϵγ (1 γ)2 α2 + 2ϵ(N 1)

(III) 4ϵγ (1 γ)2

where (III) is because α PN i=1 αi. Specifically, term (a) in the final upper bound is due to the state distribution mismatch of the training trajectories, while term (b) reveals the impact of the teammate delay phenomenon on the learning objective.

Corollary 1 Suppose that we update joint policy πk to πk+1 with sampling policy µ, then the regret of the updated joint policy πk+1 has the following upper bound:

η(π ) η(πk+1) η(π ) J(πk+1, µ) + 4ϵγ (1 γ)2

i=1 αi. (36)

Proof. η(π ) η(πk+1)

= η(π ) J(πk+1, µ) + J(πk+1, µ) η(πk+1)

η(π ) J(πk+1, µ) + J(πk+1, µ) η(πk+1)

(IV) η(π ) J(πk+1, µ) + 4ϵγ (1 γ)2

where (IV) is obtained due to Lemma 1.

Theorem 1 Suppose the sampling policy µ can derive the same updated policy, which means that µ = arg maxπ J(π, µ ). If it exists, it will be the solution of the following bi-level optimization problem:

arg min µ DKL(µ πk+1), s.t. πk+1 = arg max π J(π, µ). (38)

Proof. This theorem implicitly conveys a twofold meaning. First, we will prove that µ is one optimal solution of this optimization problem. Second, we will prove that each solution µ for this optimization problem can derive the same updated policy when serving as the sampling policy.

First, for µ = µ , we know that the corresponding πk+1 = arg maxπ J(π, µ) is also µ , due to the definition of µ . Then we have: DKL(µ πk+1) = DKL(µ µ ) = 0 (39)

Moreover, since DKL(µ πk+1) 0, we know that µ achieves the lowest value of DKL(µ πk+1). Thus, µ

is one solution for this bi-level optimization problem.

Second, suppose µ is one solution for this bi-level optimization problem. Then according to the definition of this problem, we should have:

DKL(µ arg max π J(π, µ )) DKL(µ µ ) = 0, (40)

It can be inferred that DKL(µ arg maxπ J(π, µ )) = 0, implying that arg maxπ J(π, µ ) and µ have the same distributions. Thus we know µ = arg maxπ J(π, µ ).

Published in Transactions on Machine Learning Research (03/2025)

Theorem 2 Let ϕ be the ego-max-operator. We suppose that µ denotes the lookahead policy which means that it can derive the same updated policy, i.e., µ = ψ(µ , πk); and π = ψ(πk, πk) denotes the updated policy when using πk as the sampling policy. We express the trust region as that DTV π i( |s)||πk i ( |s) βi.

In this case, when βi P

s S ρµ (s)ϕ(s,Aπk )

s S ρπk (s)ϕ(s,|Aπk |), we have J(µ , µ ) J(π , πk).

Proof. To begin with, for the updated policy π = ψ(πk, πk) using πk as the sampling policy, we have:

J(π , πk) = η(πk) + X

a A π i(ai|s) Y

j =i πk j (aj|s)Aπk(s, a)

= |J(π , πk) J(πk, πk)|

π i(ai|s) πk i (ai|s) Y

j =i πk j (aj|s)Aπk(s, a)

π i(ai|s) πk i (ai|s) Y

j =i πk j (aj|s)|Aπk(s, a)|

i=1 DTV(π i( |s) πk i ( |s)) max ai A

j =i πk j (aj|s)|Aπk(s, a)|.

We have trust region condition that s S, DTV(π i( |s) πk i ( |s)) βi. We assume βi has an upper bound ζ for each agent i, then we further have:

|J(π , πk) J(πk, πk)| 2ζ X

i=1 max ai A

j =i πk j (aj|s)|Aπk(s, a)|

J(π , πk) J(πk, πk) + 2ζ X

i=1 max ai A

j =i πk j (aj|s)|Aπk(s, a)|

J(π , πk) η(πk) + 2ζ X

i=1 max ai A

j =i πk j (aj|s)|Aπk(s, a)|

We know that J(µ , µ ) = arg maxπ Ball(µ ) J(π, µ ), where Ball(µ ) means the trust region of µ . Further, this optimization problem means that:

arg max π Ball(µ) J(π, µ ) = η(πk) + X

a A πi(ai|s) Y

j =i µ j(aj|s)Aπk(s, a)

Thus, we are actually to optimize 1 N PN i=1 P

a A πi(ai|s) Q

j =i µ j(aj|s)Aπk(s, a) for each s S; and when we find the optimized results are actually µ, it means that for each s S, µ( |s) is a nash equilibrium for the cooperative game where the utility of action a is defined as Aπk(s, a). With proper updating scheme, it is reasonable that we can obtain equilibrium that satisfies:

a A µ i (ai|s) Y

j =i µ j(aj|s)Aπk(s, a) = X

ai A µ i (ai|s) X

j =i µ j(aj|s)Aπk(s, a)

j =i πk j (aj|s)Aπk(s, a). (44)

Published in Transactions on Machine Learning Research (03/2025)

Thus we have that:

J(µ , µ ) = η(πk) + X

a A µ i (ai|s) Y

j =i µ j(aj|s)Aπk(s, a)

i=1 max ai A

j =i πk j (aj|s)Aπk(s, a)

We define ϕ(f, s) = 1 N PN i=1 maxai A P

j =i πk j (aj|s)f(s, a). Then when βi ζ P

s S ρµ (s)ϕ(s,Aπk )

s S ρπk (s)ϕ(s,|Aπk |), according to Equation (42) and Equation (45), it is obvious that J(µ , µ )

A.3 Extra Analysis on Upper Bound

Theorem 1 has told us that when the extra term (c) in the upper bound disappears when we train the agents with future teammate information, which motivates us to predict future teammates. However, we still retrain a question whether eliminating term (c) can indeed reduce the overall upper bound. For this question, one potential risk is that eliminating term (c) might influence the optimization of J(π, µ), thus making the second term J(πk+1, µ) larger. To solve this concern, we prove that under certain conditions it is at least better than the previous algorithms, as described below.

Theorem 2 To begin with, we introduce the policy update operator ψ 3 and ego-max-operator ϕ 4. We suppose that µ denotes the lookahead policy which means that it can derive the same updated policy, i.e., µ = ψ(µ , πk); and π = ψ(πk, πk) denotes the updated policy when using πk as the sampling policy. We

express the trust region as that DTV π i( |s)||πk i ( |s) βi. In this case, when βi P

s S ρµ (s)ϕ(s,Aπk )

s S ρπk (s)ϕ(s,|Aπk |),

we have J(µ , µ ) J(π , πk).

For proof see Appendix A.2. This theorem shows that when we replace µ with an approximation of future teammate policies, under certain conditions we can at least obtain a smaller upper bound compared to the previous typical algorithms. Specifically, the required conditions are relevant to the trust-region setting.

B More Implementation Details

B.1 More Details about Baselines

To conduct performance comparison in our experiments, we firstly select the main multi-agent actor-critic algorithms as baselines, including MADDPG (Lowe et al., 2017), IPPO (de Witt et al., 2020a), MAPPO (Yu et al., 2022a) and HAPPO (Kuba et al., 2021). Besides, we additionally design an opponent modeling algorithm TAPPO (abbreviated for Teammate-Aware MAPPO) to further compare our approach with traditional opponent modeling techniques. In specific, TAPPO learns teammate representations for extra policy conditions like in previous works (Papoudakis & Albrecht, 2020; Cao et al., 2023) and is incorporated into MAPPO. Furthermore, in the GRF environment, we additionally include the CDS algorithm (Li et al., 2021), which is a value-based algorithm specifically designed for the GRF problems. It mainly designs mechanism to enhance policy diversity among agents and surpasses typical value-based algorithms in the GRF environment in its experiments.

3ψ(µ, πk) means the result of one round of policy update starting from πk using µ as the sampling policy, i.e., ψ(µ, πk) = arg maxπ J(π, µ) within the trust region of πk for MAPPO. 4Assuming f is a function defined over the state and joint action space, the ego-max-operator ϕ is defined as ϕ(f, s) = 1 N PN i=1 maxai A P

j =i πk j (aj|s)f(s, a).

Published in Transactions on Machine Learning Research (03/2025)

B.2 Incorporating Lookahead Strategy into HAPPO

Heterogeneous-Agent Proximal Policy Optimisation (HAPPO) (Kuba et al., 2021) is a recent work that introduced the sequential policy update scheme to the multi-agent policy gradient algorithm. It provides a monotonic improvement guarantee in theory based on the finding of the multi-agent advantage decomposition lemma. The core idea of this work is to update the agents policies in sequence. This approach empowers subsequent agents to adapt their policies based on the updated strategies of preceding agents, thereby mitigating to some extent the impact of the teammate delay phenomenon. However, it has two main issues that might impact its effectiveness:

1) Despite the sequential policy update scheme, the preceding agents in the sequence still learn to cooperate with the previous round of teammates, which means the teammate delay issue persists for the preceding agents. This results in a critical importance placed on the order of agents update (e.g., for the example in the Introduction section, if the order is Cook first, the updates are more efficient; otherwise, sequential update yields no benefits). However, HAPPO adopts random update orders, which poses a significant limitation.

2) Since in practice, the training trajectories are sampled by the policy of the previous round, HAPPO adopts Importance Sampling to help the subsequent agents learn to cooperate with updated previous agents. This approach can lead to a higher variance in the policy gradients as we need to multiply it by an importance sampling ratio to correct the objective. This issue becomes exacerbated when dealing with a larger number of agents, as we need to accumulate the product of importance ratios for all preceding agents.

Due to these two main issues, the effectiveness of HAPPO in practice may be influenced and it can not fully resolve the teammate delay issue . Actually, in practice, our lookahead strategy can be seamlessly integrated with HAPPO, further enhancing its effectiveness, which has been validated by our empirical experiments. The detailed process is introduced in the Algorithm 2, where the text highlighted in red emphasizes the uniqueness introduced by HAPPO.

Algorithm 2 Heterogeneous-Agent Proximal Policy Optimisation with Lookahead Input: The Number of agent N Output: A cooperation policy for a multi-agent system

1: Initialize a replay buffer B; 2: Initialize a policy π randomly; 3: for each iteration k do 4: Sample a batch of transitions from B and update the environment model by minimizing loss Lmodel;; 5: Sample a batch of trajectories τ in the environment model with sampling policy πk, and obtain πk+1 via maximizing J(π, πk) within trust region in a sequential update scheme using the training trajectories τ; 6: Sample a batch of trajectories τ in the real environment with sampling policy πk+1, and obtain πk+1 via maximizing J(π, πk+1) within trust region in a sequential update scheme using the training trajectories τ; 7: Add trajectories τ to the buffer B;

B.3 Details about Hyper-parameters

In this section, we firstly introduce the hyper-parameter configurations of our method in the experiments, and then we illustrate how we tune the hyper-parameters.

Published in Transactions on Machine Learning Research (03/2025)

Table 5: Common hyper-parameters used across task scenarios of multi-agent Mu Jo Co. Note that lka is short for Lookahead, mlearn is short for model learning , and ppo stands for Proximal Policy Optimization (PPO) algorithm.

hyper-parameter value hyper-parameter value hyper-parameter value

critic lr 3e-4 max grad norm 10 lka episode length 20 actor lr 3e-4 num rollouts 40 lka num mini-batches 10 gamma γ 0.99 ppo num mini-batches 10 lka entropy coef 0.001 optimizer Adam entropy coef 0.01 mlearn batch size 512 optim eps 1e-5 stacked-frames 1

B.3.1 Hyper-parameter configuration

Here, we list the configuration of hyper-parameters that was utilized in our experiments to facilitate reproducing our experimental results. Note that for the common hyper-parameters, both the Lookahead algorithm and HAPPO adopted the same value in our experiments. Hence, the hyper-parameter configurations provided in this section are also applicable to our experimental results of the HAPPO algorithm.

Multi-agent Mu Jo Co For hyper-parameters that were set to the same values across all task scenarios, the configuration is provided in Table 5. Additionally, the varying hyper-parameter configurations across different tasks are provided in Table 6.

Table 6: Different hyper-parameters used across task scenarios of multi-agent Mu Jo Co. Note that lka is short for Lookahead, mlearn is short for model learning , and ppo stands for Proximal Policy Optimization (PPO) algorithm.

hyper-parameter Ant Half Cheetah Walker2d

episode length 200 400 200 ppo num epochs 10 20 10 lka num rollouts 2000 4000 2000 lka num epochs 10 20 10 mlearn num epochs 4000 2000 2000

Google Research Football (GRF) In the three task scenarios of Google Research Football (GRF), we employed identical hyper-parameter configurations, which are detailed in Table 7.

Table 7: Common hyper-parameters used across task scenarios of Google Research Football (GRF). Note that lka is short for Lookahead, mlearn is short for model learning , and ppo stands for Proximal Policy Optimization (PPO) algorithm.

hyper-parameter value hyper-parameter value hyper-parameter value

critic lr 5e-4 episode length 100 lka num rollouts 100 actor lr 5e-4 num rollouts 10 lka num mini-batches 1 gamma γ 0.99 ppo num mini-batches 1 lka num epochs 15 optimizer Adam ppo num epochs 15 lka entropy coef 0 optim eps 1e-5 entropy coef 5e-3 mlearn batch size 1024 max grad norm 10 lka episode length 100 mlearn num epochs 800

Published in Transactions on Machine Learning Research (03/2025)

0.0M 1.0M 2.0M 3.0M 4.0M 5.0M Timesteps

KL Distance Difference

2x4-Agent Ant 4x2-Agent Ant

(a) Ant Task

0.0M 1.0M 2.0M 3.0M 4.0M 5.0M Timesteps

KL Distance Difference

Half Cheetah

2x3-Agent Half Cheetah 3x2-Agent Half Cheetah

(b) Half Cheetah Task

0.0M 1.0M 2.0M 3.0M 4.0M 5.0M Timesteps

KL Distance Difference

2x3-Agent Walker2d 3x2-Agent Walker2d

(c) Walker2d Task

Figure 6: Measuring policy distances in multi-agent Mu Jo Co environment.

B.3.2 Hyper-parameter tuning strategy

Multi-agent Mu Jo Co For the practical implementation efficiency of the algorithm, we employed JAX to implement our Lookahead algorithm. Additionally, to ensure a fair and effective comparison of the efficacy of our added lookahead strategy, the underlying HAPPO algorithm also utilized the same codebase. Furthermore, the fundamental hyper-parameters for both Lookahead and HAPPO were kept consistent. Consequently, we mainly tune the hyper-parameters to align the performance of the HAPPO algorithm of our codebase with that disclosed in the original paper.

Google Research Football (GRF) Similar to the case in multi-agent Mu Jo Co, to ensure a fair comparison, we maintained consistent foundational hyper-parameters for both Lookahead and the underlying HAPPO algorithm. While tuning these hyper-parameters, we conducted a search within certain ranges to fine-tune the underlying HAPPO algorithm for reasonably good performance results. Specifically, we explored learning rate lr (including critic_lr and actor_lr) within the range of {1e-4, 5e-4, 1e-3}, number of learning epochs ppo_num_epochs within {10, 15, 20}, and entropy regularization coefficient entropy_coef within {1e-3, 5e-3}.

C More Experimental Results

C.1 More Results about Lookahead Policy Analysis

In Section 4.3, we measure the KL distance difference of policies on Ant and Half Cheetah tasks of multiagent Mu Jo Co. Here, we additionally provide the results on the Walker2d task. As we can see, the results on three different types of tasks all validate that our approach can empirically obtain positive results for DKL(πk+1, πk) DKL(πk+1, πk+1), which means that πk+1 can to some extent provide the direction information of future teammate policy πk+1 and is relatively a good approximation.

C.2 Study about Multiple Steps of Lookahead Approximation

In the practical implementation of the algorithm in this work, we employ a one-step approximation to estimate future teammate policies. We wonder whether we can obtain a better approximation of future teammates through more rounds of lookahead training. To answer this question, we conduct additional experiments, comparing with the baselines that perform more rounds of policy update when obtaining the lookahead policy. The results are depicted in Figure 7, where Lookahead Step represents the number of policy update rounds conducted for approximating future teammates, i.e., Lookahead Step=1 corresponds to the results in the maintext.

From the results, we can see that when we increase the Lookahead Step, the quality of the lookahead approximation does not increase and we obtain lower performance. It is reasonable because when we conduct more iterations for lookahead training within the model, the newly updated policy would appear unfamiliar to the model, as the environmental model has been trained on data sampled from the old policies. This

Published in Transactions on Machine Learning Research (03/2025)

0.0M 1.0M 2.0M 3.0M 4.0M 5.0M Timesteps 0

Lookahead Step=1 Lookahead Step=2 Lookahead Step=4

(a) 2x4-Agent Ant

0.0M 1.0M 2.0M 3.0M 4.0M 5.0M Timesteps 0

(b) 2x3-Agent Half Cheetah

Figure 7: Experimental results for study about multiple steps of lookahead approximation.

necessitates refraining from employing excessively off-policy policies for trajectory rollout within the model. However, despite achieving lower convergence performance, it seems to learn faster in the early stage in the task of 2x3-Agent Half Cheetah when we increase the Lookahead Step. This encourages us to design better model learning algorithm in the future, thus to further improve the effectiveness of our approach.