# efficient_multiagent_communication_via_shapley_message_value__ca1649b3.pdf

Efficient Multi-Agent Communication via Shapley Message Value

Di Xue1, , Lei Yuan1,2, , Zongzhang Zhang1, and Yang Yu1,3

1State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China 2Polixir Technologies, Nanjing 210000, China 3Peng Cheng Laboratory, Shenzhen, Guangdong, China {xued, yuanl}@lamda.nju.edu.cn, {zzzhang, yuy}@nju.edu.cn

Utilizing messages from teammates is crucial in cooperative multi-agent tasks due to the partially observable nature of the environment. Naively asking messages from all teammates without pruning may confuse individual agents, hindering the learning process and impairing the whole system s performance. Most previous work either utilizes a gate or employs an attention mechanism to extract relatively important messages. However, they do not explicitly evaluate each message s value, failing to learn an efficient communication protocol in more complex scenarios. To tackle this issue, we model the teammates of an agent as a message coalition and calculate the Shapley Message Value (SMV) of each agent within it. SMV reflects the contribution of each message to an agent, and redundant messages can be spotted in this way effectively. On top of that, we design a novel framework named Shapley Message Selector (SMS), which learns to predict the SMVs of teammates for an agent solely based on local information so that the agent can only query those teammates with positive SMVs. Empirically, we demonstrate that our method can prune redundant messages and achieve comparable or better performance in various multi-agent cooperative scenarios than full communication settings and existing strong baselines.

1 Introduction Cooperative Multi-Agent Reinforcement Learning (MARL) [Gronauer and Diepold, 2021] aims at coordinating multiple agents towards a shared goal, showing great potential to solve real-world problems. Most recent work on MARL adopts the Centralized Training and Decentralized Execution (CTDE) [Kraemer and Banerjee, 2016] paradigm to deal with nonstationarity caused by the concurrent learning of multiple policies. However, the coordination ability of the learned policies under CTDE can be fragile in the face of partial observability.

Equal contribution. Corresponding author: Zongzhang Zhang.

Communication is a promising way to address the mentioned issue in MARL. It allows agents to exchange information, share experiences and coordinate better with teammates. To accomplish this, researchers have considered the differentiable communication channel under the CTDE paradigm. Many methods in this direction utilize a broadcast-like communication channel where the sent message is delivered to all agents. They employ techniques such as gate mechanisms to control when to send messages [Singh et al., 2019; Mao et al., 2020], or attention mechanisms [Das et al., 2019] to extract useful information from all received messages. However, these methods treat messages as black boxes and tacitly assume that deep neural networks can identify valuable ones. As the number of agents increases, the communication channel would soon be overwhelmed with irrelevant messages, and hence communication can barely help. Learning whom to communicate with is one of the salient properties of humans, by which we could realize more efficient communication. Rather than the receive-then-select approach widely used in previous work [Das et al., 2019], or pruning messages according to a learned rule [Wang et al., 2020b; Ding et al., 2020; Niu et al., 2021], it can be more efficient if we can evaluate the contribution of each message, then ask for messages only from those who benefit us. For example, in our daily research life, we can build a specific belief for each member in our lab during everyday interaction, such as expertise in different domains. When we do a project, we can decide whom to collaborate with according to their expertise to obtain the best guidance and work smoothly. Taking inspiration from the above insight, this paper proposes a novel multi-agent communication framework that enables agents to evaluate the contribution of each message and decide whom to communicate with. Specifically, we model the teammates of an agent as a message coalition inspired by the cooperative game theory [Chalkiadakis et al., 2011; Wang et al., 2020a], and we calculate the Shapley Message Value (SMV) of each agent within the message coalition. The contribution of messages is captured by the SMV, by which we can distribute part of the individual utility to its teammates according to the contribution of the messages. The calculated SMVs are then used to train the Shapley Message Selector (SMS), which is equipped on each agent. During execution, the learned SMS can help an agent identify teammates whose messages can benefit its decision-making based

Proceedings of the Thirty-First International Joint Conference on Artiﬁcial Intelligence (IJCAI-22)

Environment

𝐹𝐹1𝑖𝑖 𝐹𝐹𝑛𝑛𝑛𝑛

𝒎𝒎𝑖𝑖= 𝐹𝐹𝑗𝑗𝑗𝑗 𝑐𝑐𝑗𝑗𝑖𝑖𝑗𝑗=1

Communication Module

𝜑𝜑(𝑑𝑑1 𝑜𝑜𝑖𝑖) 𝜑𝜑(𝑑𝑑𝑛𝑛 𝑜𝑜𝑖𝑖)

Shapley Selector

Message Encoder

Request & Reply

Request & Reply

Peer-to-Peer Communication

(a) Communication-Based Policy

(𝑘𝑘1, 𝑘𝑘𝑛𝑛)

𝑄𝑄1(𝑠𝑠, 𝑎𝑎1) 𝑄𝑄𝑛𝑛(𝑠𝑠, 𝑎𝑎𝑛𝑛)

𝑄𝑄𝑡𝑡𝑡𝑡𝑡𝑡(𝑠𝑠, 𝒂𝒂) 𝑇𝑇𝐷𝐷𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿

Agent 𝑛𝑛 Agent 1

(b) Linearly Decomposable Critic

Figure 1: Schematics of our proposed framework.

solely on local observation. We evaluate it on three cooperative multi-agent tasks, i.e., Listener-Speaker, Hallway [Wang et al., 2020b], and Star Craft Multi-Agent Challenge (SMAC) [Samvelyan et al., 2019]. Visualization and empirical results show that our method can prune redundant messages and achieve comparable or better performance in various multiagent cooperative scenarios than full communication settings and existing strong baselines.

2 Background We consider fully cooperative multi-agent tasks with partial observability modeled as Dec-POMDPs [Oliehoek and Amato, 2016]. A Dec-POMDP model can be formulated as M = (N, S, A, P, Ω, O, R, γ), where N = {1, 2, . . . , n} represents the set of n agents, S is the set of states, and A = Qn i=1 Ai is the joint action space. At each timestep, agent i N receives a local observation oi Ωdrawn by the observation function O(s, i), and then chooses an action ai Ai. The joint action a = (a1, . . . , an) leads to the shared global reward R(s, a) and the next state s P(s | s, a). The formal objective is to find a joint policy π(a | τ) to maximize the global value function Qπ tot(s, a) = Es0: ,a0: [P t=0 γt R(s, a) | s0 = s, a0 = a, π], where τ = (τ1, . . . , τn) is the joint history of all agents, and γ [0, 1) is the discount factor. Shapley value is a popular solution concept to solve the payoff distribution problem for a grand coalition [Chalkiadakis et al., 2011], and the advantage is that it provides a fair and unique solution [Fatima et al., 2008]. A cooperative game can be defined as G = (N, v), where N = {1, 2, . . . , n} is a finite, non-empty set of players and v : 2N R is a characteristic function assigning each coalition C N of players a value v(C) representing the total payoff that can be obtained by the coalition C. The Shapley value of the ith player with respect to the grand coalition N is defined as

C N\{i} |C|!(|N| |C| 1)!

|N|! ΦG i (C) , (1)

where ΦG i (C) = v(C {i}) v(C) is called the marginal contribution, which can measure how much value is attributed

to the participation of the ith player to the coalition C. Intuitively, the Shapley value is the average marginal contribution over all possible coalitions.

3 Method In this section, we first introduce our multi-agent communication framework. Then we formulate multi-agent communication as cooperative games and define SMV, which extends the concepts of the grand coalition and Shapley value. After that, we propose a method that learns to predict the SMVs of teammates and decide whether to ask for messages accordingly. We finally present the overall training scheme.

3.1 Overall Framework As depicted in Figure 1, our framework aims to learn a peerto-peer communication scheme under the CTDE paradigm. It is composed of two main parts, a communication-based policy (Figure 1(a)) equipped with a communication module that learns from whom to ask for information, and a linearly decomposable centralized critic (Figure 1(b)) that can be used to evaluate the contribution of each message and provide the training signal for the communication module of the policy. During execution, the centralized critic will be removed, and agents utilize the learned SMS to communicate with teammates who are believed to be beneficial to them. At each timestep, agent i decides whom to communicate with before taking action. To be specific, agent i applies its SMS to the current observation to predict the SMVs of teammates, which gives [φ(dj | oi)]n j=1 and dj is the ID of agent j. The predicted SMVs are then transformed into an n-dimension binary vector [Fji]n j=1 by an element-wise indicator function 1[>0]( ) that assigns one to positive numbers. Fji represents whether to ask for information from agent j. If Fji = 1, agent i would send a request for information to agent j. Otherwise, the communication link is pruned. We set Fii = 0 to avoid circulate communication. Upon receiving a request, agent j would reply with a message cji = gj(oj, di) through its message encoder, an encoding of its local observation. The messages received by agent i are gathered into mi = [Fji cji]n j=1, which is then fed into its local policy πi

Proceedings of the Thirty-First International Joint Conference on Artiﬁcial Intelligence (IJCAI-22)

along with its local action-observation history τi to generate an action ai = πi(τi, mi). To fulfill the purpose of considering contributions of messages to local decision-making, the design of centralized critic should be taken extra care. Popular policy-based methods, e.g., MADDPG [Lowe et al., 2017] and COMA [Foerster et al., 2018], adopt non-factorizable centralized critics that conditions on global state and joint action. These methods either suffer from inefficient credit assignment [Wang et al., 2021] or high variance [Papoudakis et al., 2021], making them not good choices to evaluate individual contribution. Value factorization methods like QMIX [Rashid et al., 2018] decompose centralized value into individual utilities in a nonlinear manner and destroy the consistency between centralized value and individual utilities. To make individual critics on the same scale as the centralized critic, we employ a linearly decomposed centralized critic as in DOP [Wang et al., 2021]:

Qtot(s, a) = Xn

i=1 ki(s)Qi(s, ai) + b(s) , (2)

where ki(s) and b(s) are generated by learnable networks whose inputs are global state, and ki(s) is restricted to be non-negative to ensure monotonicity. Since the utilities of agents are additive after being scaled by their coefficients, we can evaluate the value made by each agent to the whole team.

3.2 Communication as Cooperative Games An advantage of our framework is that it allows us the flexibility to quantitatively examine the contribution of a message to overall decision-making. Suppose there are two agents i and j working together, depending on whether the message from the agent j is received, agent i may take different actions: if agent i queries agent j, agent i would choose aj i = πi(τi, [cji, 0]), otherwise it would choose ai = πi(τi, [0, 0]). Then the contribution of agent j made to agent i is

Φi j = ki(s)(Qi(s, aj i) Qi(s, ai)) , (3) where we scale the individual utility by its coefficient so that the scale is consistent with the global value. Recall that the definition of the marginal contribution in Section 2 is similar to how we calculate the contribution of messages to local decision-making here. Now we are at the point to generalize the related concepts in cooperative game theory to our setting. Message Coalition: The action of each agent is determined by its local information, as well as the messages received from teammates. Then it is natural to attribute part of the utility to other agents so that the importance of each message can be measured. To distribute agent i s utility to other agents, we model it as a cooperative game (Ci, vi), where the grand coalition is the set of all agents but for agent i, i.e., N \ {i}, which is named as the message coalition Ci of i. Likewise, the characteristic function is defined on each subset of the message coalition C Ci:

vi(C) = ki(s)(Qi(s, a C i ) Qi(s, ai)) , (4) where a C i = πi(τi, m C i ) and m C i = [1 C(j) cji]n j=1 contains the messages received from agents in C. In this cooperative game, we are actually distributing the value vi(Ci) earned by the joint messages from message coalition to each member within it.

Shapley Message Value: To solve the cooperative game in our setting (Ci, vi), we recall the definition of the Shapley value. The generalized definition for the marginal contribution of the message from the jth agent to a coalition C Ci \ {j} is as below:

Φi j(C | s) = vi(C {j}) vi(C)

= ki(s)(Qi(s, a C {j} i ) Qi(s, a C i )) . (5)

Then, we compute the Shapley value of agent j s message:

ϕi(dj | s) = X

|C|!(|Ci| |C| 1)!

|Ci|! Φi j(C | s)

= Ep(C|Ci\{j})[Φi j(C | s)] ,

which is called Shapley Message Value (SMV) in our paper. The coefficients of Φi j(C | s) in Eq. 6 sum to one, implying a probabilistic interpretation and we make it explicit by introducing a notation p(C | Ci \ {j}). SMV measures the contribution of messages to the recipient agent. Intuitively, we hope to leave those messages having positive effects on local decision-making, i.e., messages with positive SMV, and prune those having negative effects. Since each agent can be seen as a recipient agent, there are n such cooperative games existing at the same time. By selecting messages with positive SMV, each agent is expected to gain a greater local value, which is further combined monotonically by the mixer, resulting in a greater total value.

3.3 Shapley Message Selector SMV serves as a criterion for selecting those messages having positive effects on local decision-making. However, the problem is that we cannot evaluate the SMV of an agent unless we know the global state and receive all the messages, making it infeasible to apply during decentralized execution. Therefore, we design a practical method to predict the SMV solely based on local information. The most straightforward way is to do a regression. Here we use a neural network named Shapley Message Selector (SMS) to approximate the SMVs of teammates within its message coalition through supervised learning. The SMS takes local observation as input, and the output is a vector of length n whose jth component is the predicted SMV φi(dj | oi) of agent j among message coalition Ci. For simplicity of notation, we define the SMV of an agent to itself to be zero. Fortunately, global state and local information are correlated probabilistically. Thus we can exploit such correlation to approximate SMV via

φi(dj | oi) = Ep(oi|s)[ϕi(dj | s)] , (7)

where p(oi | s) is induced by the current joint policy. This suggests we use on-policy trajectories to generate the dataset to train the SMS. During centralized training, we can calculate the SMV to generate the training data. To be specific, for agent i, the local observation oi is taken as the feature and the SMVs of each agent [ϕi(dj | s)]n j=1 are taken as the label, forming a feature-label tuple oi, [ϕi(dj | s)]n j=1 . During decentralized execution, the SMS behaves like a gate. Given current observation oi, agent i would predict the

Proceedings of the Thirty-First International Joint Conference on Artiﬁcial Intelligence (IJCAI-22)

SMV of each teammate. If the predicted SMV of agent j is positive, meaning that the information from agent j can benefit the local decision-making of agent i in expectation, agent i would send a request for information to agent j.

3.4 Training Scheme During centralized training the parameters of the critic are updated by the standard TD loss of the global value Qtot. To improve sample efficiency, we utilize both on-policy and offpolicy data by minimizing the following loss:

2LOn(θQ) + 1

2LOff(θQ) , (8)

where θQ is the parameter of the centralized critic, and LOn

and LOff are evaluated on the on-policy and off-policy replay buffers DOn and DOff, respectively. The first loss item is LOn(θQ) = EDon[(y On Qtot(st, at; θQ))2], where y On is the target value defined as rt + γQtot(st+1, at+1; θ Q), and θ Q is the parameter of the target network periodically copied from θQ. The second loss item is LOff(θQ) = EDOff[(y Off Qtot(st, at; θ Q))], the action for the next timestep is corrected by the current policy, message encoder and SMS:

y Off = rt + γQtot(st+1, πi(st+1, mi; θ π ); θ Q) . (9)

Here, mi = {1[>0](φi(dj | ot+1 i ; θ φi)) gj(ot+1 j , di; θ gj)}n j=1 is the relabeled message. The policy and message encoder are jointly trained to maximize the centralized critic Qtot:

J (θπ, θg) = EDOn

i=1 ki(s)Qi(s, πi(τi, mi; θπ))

where s DOn, Qtot is decomposed into individual critics, and the bias can be left out because it is not dependent on the action. To deal with discrete action space, we apply the Gumbel-Softmax trick [Jang et al., 2016] to reparameterize the stochastic policies as deterministic functions. In order to calculate SMV, we have to evaluate the value of messages from the message coalition by Qi(s, a C i ), where C Ci. Since the policy is modeled by a neural network, extra care should be taken to avoid the possible correlation between messages so that each message can be useful on its own. To this end, we train the policy by dropping each message with probability pdrop which is a hyper-parameter, just like the dropout technique [Srivastava et al., 2014]. The SMS φi(dj | oi) is parameterized by θφi and is trained on the tuple ot i, [ϕi(dj | st)]n j=1 generated from recent trajectories by minimizing the following mean squared error:

L(θφi) = 1 n T

j=1 (φi(dj | ot i; θφi) ϕi(dj | st))2 .

(11) The exact calculation of SMV ϕi(dj | st) is intractable as its complexity goes up exponentially, but we can approximate it according to its probabilistic interpretation given in Eq. 6:

ϕi(dj | st) = Ep(C|Ci\{j})[Φi j(C|st)] 1

k=1 Φi j(Ck|st),

where Ck p(C | Ci\{j}) and H is the number of samples. Recall that the SMV ϕi(dj | s) is calculated from the individual critic Qi(s, ai), which is modeled by a trainable neural network. This means that the quality of the SMS is highly relied on the accuracy of Qi(s, ai). Thus we do not apply the SMS in execution until tselector timesteps have passed. Interested readers may refer to our pseudo-code in Appendix C.

4 Experiments In this section, we design experiments to verify that SMS can learn an efficient and targeted communication protocol in multi-agent tasks1. We compare SMS with multiple baselines on cooperative tasks, including Listener-Speaker, Hallway [Wang et al., 2020b], and the challenging Star Craft II unit micromanagement benchmark2 [Samvelyan et al., 2019]. For evaluation, all results are illustrated with median performance with 95% confidence interval on 5 random seeds. Detailed network architecture and hyper-parameter choices are shown in Appendix B.

4.1 Baselines QMIX [Rashid et al., 2018] and DOP [Wang et al., 2021] are strong non-communication baselines. QMIX is a value-based baseline, the implementation we compare is provided by Py MARL2, which has shown excellent performance on diverse multi-agent benchmarks [Hu et al., 2021]. DOP, on the other hand, is a policy-based method that employs a linearly decomposable mixing network similar to our method. Tar MAC [Das et al., 2019], NDQ [Wang et al., 2020b] and MAGIC [Niu et al., 2021] are state-of-the-art communication baselines. Tar MAC utilizes an attention mechanism to integrate messages according to their relative importance. The implementation we use is provided by [Wang et al., 2020b], denoted as QMIX+Tar MAC. NDQ aims at learning nearly decomposable Q functions via communication minimization. MAGIC makes use of hard attention to construct a dynamic communication graph, which then combines with a graph attention neural network to process the messages. For the ablation study, we design a baseline with only difference in the communication protocol. It adopts a full communication paradigm, where each agent queries all other teammates at each timestep, denoted as Full Comm.

4.2 Listener-Speaker We design a listener-speaker environment (Figure 2(a)) to demonstrate (1) the existence of redundant messages that may harm the performance of learning algorithms, and (2) SMS can recognize and prune those redundant messages. Listener-speaker is a grid world consisting of n listeners and n speakers randomly set at the beginning of each episode. Listeners and speakers are one-to-one paired, and each pair has the same number attached to them. The listeners must navigate to their paired speakers and the reward is calculated as rt = 0.01 Pn i=1 dist(lt i, st i), where dist(lt i, st i) is the Euclidean distance between the ith listener-speaker pair at

1Code available at https://github.com/Di Xue98/SMS. 2The results reported in this paper use SC2.4.6.2.6923. Note that performance is not always comparable between versions.

Proceedings of the Thirty-First International Joint Conference on Artiﬁcial Intelligence (IJCAI-22)

(a) Listener-Speaker

0.0M 0.4M 0.8M 1.2M 1.6M 2.0M Timesteps

Median Test Return

SMS(2)(ours)

SMS(4)(ours)

SMS(6)(ours)

QMIX+Tar MAC

(b) Performance comparison

0.0M 0.4M 0.8M 1.2M 1.6M 2.0M Timesteps

Median Communication Rate %

n=2 n=3 n=4 n=5

(c) Communication rate

2 3 4 5 # of listener-speaker pairs

Median Test Return

SMS Full Comm

(d) Performance across agents

Figure 2: (a) The Listener-Speaker environment, where listeners are marked in black and speakers are marked in red. (b) Median test returns of different algorithms. (c) Communication rates under different listener-speaker pairs. (d) Median test returns with different number of listener-speaker pairs after 2M timesteps. The error bar represents the standard deviation across 5 random seeds.

timestep t. While the listener cannot observe anything but its ID, the speaker knows its relative position to each listener. To succeed in such a task, the listener has to learn to figure out from which speaker to ask for information, so that it can move towards its paired speaker and won t be distracted by irrelevant information. Although this problem is seemingly simple, it poses a significant challenge to full communication algorithms even when n is small. We ignore NDQ in this environment as the behavior of the speaker is fixed. We set n = 4 and the map size to be 15 15. We also set tselector = 0.2M, which is the timestep when SMS is activated. As shown in Figure 2(b), the methods without communication, i.e., QMIX and DOP, fail in this task and perform worst, which indicates communication is necessary. Tar MAC mitigates the redundancy to some extent, but its attention mechanism is inefficient, leading to its slow convergence speed. SMS behaves similarly to Full Comm and MAGIC at the beginning but successfully outperforms all baselines to a large margin after enabling SMS at 0.2M steps. The ablation of SMS, i.e., Full Comm, behaves almost the same as our method before 0.4M but deteriorates after that. We think it is because its policy overfits to the redundant messages of irrelevant speakers, converging to sub-optimal policy. Figure 2(b) also reveals the effects of different sample sizes H (2, 4, 6) used to approximate the SMV on the performance. As the sample size grows, we have a little faster convergence speed, indicating a more accurate SMV estimation. A large sample size also costs more computational resources. We finally set the sample size H = 2 in the following experiments for a computational complexity and performance trade-off. Communication Rate Analysis. To answer why SMS can effectively deal with redundancy, we plot the communication rate of SMS with the number of listener-speaker pairs ranging from 2 to 5 in Figure 2(c). Since listeners and speakers are one-to-one paired, for each listener there is only one speaker that can help it reach its destination. Thus the optimal communication rate is 1/n. The learned SMS can almost achieve optimal communication rates in these scenarios, although the variance increases as N grows. Meanwhile, we can find from

𝑎𝑎1 𝑎𝑎2 𝑎𝑎𝑙𝑙

𝑏𝑏1 𝑏𝑏2 𝑏𝑏𝑚𝑚

𝑐𝑐1 𝑐𝑐2 𝑐𝑐𝑛𝑛

𝑑𝑑1 𝑑𝑑2 𝑑𝑑𝑘𝑘

(a) Hallway

0.0M 0.4M 0.8M 1.2M 1.6M 2.0M Timesteps

Median Test Win Rate %

SMS (ours) NDQ QMIX+Tar MAC MAGIC Full Comm QMIX DOP

(b) Test median success rate

Figure 3: (a) Hallway, where agents aim to reach g simultaneously. (b) Test median success rates on Hallway.

Figure 2(d) that SMS outperforms Full Comm in all the scenarios, and the return of SMS changes nearly proportionally to the number of listener-speaker pairs, indicating SMS can effectively identify the paired speaker for each listener.

4.3 Hallway Hallway [Wang et al., 2020b] (Figure 3(a)) is another cooperative environment under partial observability, with four agents a, b, c, and d randomly initialized at states a1 to al, b1 to bm, c1 to cn, and d1 to dk, respectively. An agent can only observe its position and select actions from moving left, moving right, or keeping immobile at each timestep. An episode ends if any agent arrives at state g, and agents will win and get a shared reward of 1 only when they reach state g simultaneously, otherwise the reward is 0. The horizon is set to max(l, m, n, k) + 10 to avoid infinite loops. We set l = m = 2 and n = k = 3 to make the simultaneous arrival difficult. For SMS we set tselector = 0.3M. The performance is shown in Figure 3(b), where we find QMIX and DOP fail again, indicating communication is a requirement to achieve effective coordination in this task. Other communication methods, i.e., Tar MAC and NDQ, can measure the relative importance of messages but are not efficient enough. MAGIC can solve it but takes much more timesteps to converge than SMS. Full Comm performs well, but the curve oscillates a lot and does not converge to the same level as SMS. The redundancy in the message can be validated by the

Proceedings of the Thirty-First International Joint Conference on Artiﬁcial Intelligence (IJCAI-22)

𝑐𝑐1 𝑐𝑐2 𝑐𝑐3

𝑑𝑑1 𝑑𝑑2 𝑑𝑑3

𝑐𝑐1 𝑐𝑐2 𝑐𝑐3

𝑑𝑑1 𝑑𝑑2 𝑑𝑑3

𝑐𝑐1 𝑐𝑐2 𝑐𝑐3

𝑑𝑑1 𝑑𝑑2 𝑑𝑑3

Agent Received

Agent Received

Agent Received

Figure 4: The SMV distributions learned by our method on Hallway. A blue square in the heat map means that the corresponding message has negative SMV and is pruned.

gap between the curves of SMS and Full Comm. The Communication Protocol Learned by SMS. We are particularly interested in how the learned SMS behaves at each timestep. To this end, we replay an episode in Figure 4 to demonstrate what happened at each timestep. The top row shows the snapshot of the global state under each timestep, while the corresponding output of the SMS is represented by a heat map in the bottom row. The ith row of the heat map is the output of the SMS of the ith agent, i.e., the predicted SMV for its teammates. At the first timestep (t = 1), the heat map is blue everywhere and all the SMVs are less than zero, indicating no communication among agents. This is because they are far from the state g, and all agents should move left regardless of the positions of their teammates. As the episode progresses (t = 2), some agents reach positions next to g, and this is where the coordination happens: agents (a and b) next to the goal g should wait for the ones (c and d) still two more steps away from g. This explains why the SMS of agents a and b have positive SMVs for other agents. In order to win the game when all agents are next to the goal state g (t = 3), they need to make sure that the other teammates are in the same position so that they can take actions simultaneously. This is also implied by the heat map. Besides, it is interesting to discover that agents tend to have a larger SMV for c and d than a and b. We believe this is because agents c and d are further from the state g on average, hence their information is more critical to the success of the whole team.

4.4 Star Craft II Micromanagement Benchmark Finally, we verify whether SMS can handle more challenging scenarios on Star Craft Multi-Agent Challenge (SMAC) [Samvelyan et al., 2019]. We test our method on three maps, including 1o10b vs 1r and 1o2r vs 4r, which are difficult due to partial observability of the environment, and 5z vs 1ul,

0.0M 2.0M 4.0M 6.0M 8.0M 10.0M Timesteps

Median Test Win Rate %

(a) 1o10b vs 1r

0.0M 2.0M 4.0M 6.0M 8.0M 10.0M Timesteps

Median Test Win Rate %

(b) 1o2r vs 4r

0.0M 2.0M 4.0M 6.0M 8.0M 10.0M Timesteps

Median Test Win Rate %

(c) 5z vs 1ul

Figure 5: Median test win rates on three SMAC maps.

where strong coordination is needed to succeed3. Since these tasks are more complex, we need more time to train the SMS before activation, thus we set tselector = 1.0M. The results are shown in Figure 5. We find that 1o10b vs 1r and 1o2r vs 4r need strong communication to succeed. Methods without communication, such as QMIX and DOP, result in sub-optimal policies. SMS achieves a faster convergence speed than NDQ and higher asymptotic performance than other baselines, demonstrating that SMS can efficiently help agents identify helpful information among teammates. Scenario 5z vs 1ul needs strong coordination. SMS outperforms all other baselines except QMIX as the problem of partial observability is not so severe on the map. MAGIC fails to maintain its decent performance in all the scenarios. We think it is largely due to the issues of credit assignment and sample efficiency are not properly handled.

5 Conclusions and Future Work In this paper, we have shown the existence of redundancy in multi-agent communication and how it complicates the learning of communication protocol. Due to this issue, our work models multi-agent communication as cooperative games, and proposes Shapley Message Value (SMV) which extends Shapley value to quantitatively examine the value of each message. We also design a practical method to predict the value of messages based solely on local information. The calculation of SMV is based on the individual critic, which implies that the quality of SMV highly relies on the accuracy of the individual critic. However, the current credit assignment methods do not work well and are not explainable enough [Wang et al., 2020a], which may deteriorate the quality of SMV in more complex environments. A more precise method to calculate SMV and its theoretical support for multi-agent communication would be of great interest.

3The numbers of controlled agents on 1o10b vs 1r, 1o2r vs 4r and 5z vs 1ul are 11, 3, and 5 respectively.

Proceedings of the Thirty-First International Joint Conference on Artiﬁcial Intelligence (IJCAI-22)

Acknowledgments We would like to thank Quan He, Xue-Kun Jin, Yu Fan, Yun Tian Zhang, and the anonymous reviewers for their helpful discussions and support. This work is supported by National Key Research and Development Program of China (2020AAA0107200), NSFC (61876077, 61876119), the Major Key Project of PCL (PCL2021A12), and Hikvision Open Fund (CCF-HIKVISION OF 20210001).

References [Chalkiadakis et al., 2011] Georgios Chalkiadakis, Edith Elkind, and Michael Wooldridge. Computational aspects of cooperative game theory. Synthesis Lectures on Artificial Intelligence and Machine Learning, 5(6):1 168, 2011. [Das et al., 2019] Abhishek Das, Th eophile Gervet, Joshua Romoff, Dhruv Batra, Devi Parikh, Mike Rabbat, and Joelle Pineau. Tar MAC: Targeted multi-agent communication. In ICML, pages 1538 1546, 2019. [Ding et al., 2020] Ziluo Ding, Tiejun Huang, and Zongqing Lu. Learning individually inferred communication for multi-agent cooperation. In Neur IPS, pages 22069 22079, 2020. [Fatima et al., 2008] Shaheen S Fatima, Michael Wooldridge, and Nicholas R Jennings. A linear approximation method for the Shapley value. Artificial Intelligence, 172(14):1673 1699, 2008. [Foerster et al., 2018] Jakob N. Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. Counterfactual multi-agent policy gradients. In AAAI, pages 2974 2982, 2018. [Gronauer and Diepold, 2021] Sven Gronauer and Klaus Diepold. Multi-agent deep reinforcement learning: A survey. Artificial Intelligence Review, pages 1 49, 2021. [Hu et al., 2021] Jian Hu, Siyang Jiang, Seth Austin Harding, Haibin Wu, and Shih-wei Liao. RIIT: Rethinking the importance of implementation tricks in multi-agent reinforcement learning. ar Xiv preprint ar Xiv:2102.03479, 2021. [Jang et al., 2016] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. In ICLR, 2016. [Kraemer and Banerjee, 2016] Landon Kraemer and Bikramjit Banerjee. Multi-agent reinforcement learning as a rehearsal for decentralized planning. Neurocomputing, 190:82 94, 2016. [Lowe et al., 2017] Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, Pieter Abbeel, and Igor Mordatch. Multi-agent actor-critic for mixed cooperative-competitive environments. In NIPS, pages 6379 6390, 2017. [Mao et al., 2020] Hangyu Mao, Zhengchao Zhang, Zhen Xiao, Zhibo Gong, and Yan Ni. Learning agent communication under limited bandwidth by message pruning. In AAAI, pages 5142 5149, 2020.

[Niu et al., 2021] Yaru Niu, Rohan Paleja, and Matthew Gombolay. Multi-agent graph-attention communication and teaming. In AAMAS, pages 964 973, 2021. [Oliehoek and Amato, 2016] Frans A Oliehoek and Christopher Amato. A Concise Introduction to Decentralized POMDPs. Springer, 2016. [Papoudakis et al., 2021] Georgios Papoudakis, Filippos Christianos, Lukas Sch afer, and Stefano V Albrecht. Benchmarking multi-agent deep reinforcement learning algorithms in cooperative tasks. In Neur IPS, 2021. [Rashid et al., 2018] Tabish Rashid, Mikayel Samvelyan, Christian Schroeder, Gregory Farquhar, Jakob Foerster, and Shimon Whiteson. QMIX: Monotonic value function factorisation for deep multi-agent reinforcement learning. In ICML, pages 4295 4304, 2018. [Samvelyan et al., 2019] Mikayel Samvelyan, Tabish Rashid, Christian Schr oder de Witt, Gregory Farquhar, Nantas Nardelli, Tim G. J. Rudner, Chia-Man Hung, Philip H. S. Torr, Jakob N. Foerster, and Shimon Whiteson. The Starcraft multi-agent challenge. In AAMAS, pages 2186 2188, 2019. [Singh et al., 2019] Amanpreet Singh, Tushar Jain, and Sainbayar Sukhbaatar. Learning when to communicate at scale in multiagent cooperative and competitive tasks. In ICLR, 2019. [Srivastava et al., 2014] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1):1929 1958, 2014. [Wang et al., 2020a] Jianhong Wang, Yuan Zhang, Tae Kyun Kim, and Yunjie Gu. Shapley Q-value: A local reward approach to solve global reward games. In AAAI, pages 7285 7292, 2020. [Wang et al., 2020b] Tonghan Wang, Jianhao Wang, Chongyi Zheng, and Chongjie Zhang. Learning nearly decomposable value functions via communication minimization. In ICLR, 2020. [Wang et al., 2021] Yihan Wang, Beining Han, Tonghan Wang, Heng Dong, and Chongjie Zhang. DOP: Off-policy multi-agent decomposed policy gradients. In ICLR, 2021.

Proceedings of the Thirty-First International Joint Conference on Artiﬁcial Intelligence (IJCAI-22)