# divergenceregularized_multiagent_actorcritic__e6beb6b3.pdf Divergence-Regularized Multi-Agent Actor-Critic Kefan Su 1 Zongqing Lu 1 Entropy regularization is a popular method in reinforcement learning (RL). Although it has many advantages, it alters the RL objective of the original Markov Decision Process (MDP). Though divergence regularization has been proposed to settle this problem, it cannot be trivially applied to cooperative multi-agent reinforcement learning (MARL). In this paper, we investigate divergence regularization in cooperative MARL and propose a novel off-policy cooperative MARL framework, divergence-regularized multi-agent actor-critic (DMAC). Theoretically, we derive the update rule of DMAC which is naturally offpolicy and guarantees monotonic policy improvement and convergence in both the original MDP and divergence-regularized MDP. We also give a bound of the discrepancy between the converged policy and optimal policy in the original MDP. DMAC is a flexible framework and can be combined with many existing MARL algorithms. Empirically, we evaluate DMAC in a didactic stochastic game and Star Craft Multi-Agent Challenge and show that DMAC substantially improves the performance of existing MARL algorithms. 1. Introduction Regularization is a common method for single-agent reinforcement learning (RL). The optimal policy learned by traditional RL algorithms is always deterministic (Sutton & Barto, 2018). This property may result in the inflexibility of the policy facing unknown environments (Yang et al., 2019). Entropy regularization is proposed to settle this problem by learning a policy according to the maximum-entropy principle (Haarnoja et al., 2017). Moreover, entropy regularization is beneficial to exploration and robustness for RL algorithms (Haarnoja et al., 2018). However, entropy 1School of Computer Science, Peking University. Correspondence to: Zongqing Lu . Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s). regularization is imperfect. Eysenbach & Levine (2019) pointed out that maximum-entropy RL modifies the original RL objective because of the entropy regularizer. Maximumentropy RL is actually learning an optimal policy for the entropy-regularized Markov Decision Process (MDP) rather than the original MDP, i.e., the converged policy may be biased. Nachum et al. (2017) analyzed a more general case for regularization in RL and proposed what we call divergence regularization. Divergence regularization is beneficial to exploration and may be helpful to the bias issue. Regularization can also be applied to cooperative multiagent reinforcement learning (MARL) (Agarwal et al., 2020; Zhang et al., 2021). However, most cooperative MARL algorithms do not use regularizers (Lowe et al., 2017; Foerster et al., 2018; Rashid et al., 2018; Son et al., 2019; Jiang et al., 2020; Wang et al., 2021a). Only few cooperative MARL algorithms such as FOP (Zhang et al., 2021) use entropy regularization, which may suffer from the drawback aforementioned. Divergence regularization, on the other hand, could potentially benefit cooperative MARL. In addition to its advantages mentioned above, divergence regularization can also help to control the step size of policy update which is similar to conservative policy iteration (Kakade & Langford, 2002) in single-agent RL. Conservative policy iteration and its successive methods such as TRPO (Schulman et al., 2015) and PPO (Schulman et al., 2017) can stabilize policy improvement (Touati et al., 2020). These methods use a surrogate objective for policy update, but the policies in centralized training with decentralized execution (CTDE) paradigm may not preserve the properties of the surrogate objective. Moreover, DAPO (Wang et al., 2019), a single-agent RL algorithm using divergence regularizer, cannot be trivially extended to cooperative MARL settings. Even with some tricks like V-trace (Espeholt et al., 2018) for off-policy correction, DAPO is essentially an on-policy algorithm and thus may not be sample-efficient in cooperative MARL settings. In the paper, we propose divergence-regularized multi-agent actor-critic (DMAC), a novel off-policy cooperative MARL framework. We analyze the general iteration of DMAC and theoretically show that DMAC guarantees the monotonic policy improvement and convergence in both the original MDP and the divergence-regularized MDP. We also derive a bound of the discrepancy between the converged policy and Divergence-Regularized Multi-Agent Actor-Critic the optimal policy in the original MDP. Besides, DMAC is beneficial to exploration and stable policy improvement by applying our update rule of target policy. We also propose and analyze divergence policy iteration in general cooperative MARL settings and a special case combined with value decomposition. Based on divergence policy iteration, we derive the off-policy update rule for the critic, policy, and target policy. Moreover, DMAC is a flexible framework and can be combined with many existing cooperative MARL algorithms to substantially improve their performance. We empirically investigate DMAC in a didactic stochastic game and Star Craft Multi-Agent Challenge (Samvelyan et al., 2019). We combine DMAC with five representative MARL methods, i.e., COMA (Foerster et al., 2018) for on-policy multi-agent policy gradient, MAAC (Iqbal & Sha, 2019) for off-policy multi-agent actor-critic, QMIX (Rashid et al., 2018) for value decomposition, DOP (Wang et al., 2021b) for the combination of value decomposition and policy gradient, and FOP (Zhang et al., 2021) for the combination of value decomposition and entropy regularization. Experimental results show that DMAC indeed induces better performance, faster convergence, and better stability in most tasks, which verifies the benefits of DMAC and demonstrates the advantages of divergence regularization over entropy regularization in cooperative MARL. 2. Related Work MARL. MARL has been a hot topic in the field of RL. In this paper, we focus on cooperative MARL. Cooperative MARL is usually modeled as Dec-POMDP (Oliehoek et al., 2016), where all agents share a reward and aim to maximize the long-term return. Centralized training with decentralized execution (CTDE) (Lowe et al., 2017) paradigm is widely used in cooperative MARL. CTDE usually utilizes a centralized value function to address the non-stationarity for multi-agent settings and decentralized policies for scalability. Many MARL algorithms adopt CTDE paradigm such as COMA, MAAC, QMIX, DOP, and FOP. COMA (Foerster et al., 2018) employs the counterfactual baseline which can reduce the variance as well as settle the credit assignment problem. MAAC (Iqbal & Sha, 2019) uses self-attention mechanism to integrate local observation and action of each agent and provides structured information for the centralized critic. Value decomposition (Sunehag et al., 2018; Rashid et al., 2018; Son et al., 2019; Yang et al., 2020; Wang et al., 2021a;b; Zhang et al., 2021; Sun et al., 2021; Pan et al., 2021; Peng et al., 2021) is a popular class of cooperative MARL algorithms. These methods express the global Q-function as a function of individual Q-functions to satisfy Individual-Global-Max (IGM), which means the optimal actions of individual Q-functions are corresponding to the optimal joint action of the global Q-function. QMIX (Rashid et al., 2018) is a representative of value decomposition methods. It uses a hypernet to ensure the monotonicity of the global Q-function in terms of individual Q-functions, which is a sufficient condition of IGM. DOP (Wang et al., 2021b) is a method that combines value decomposition with policy gradient. DOP uses a linear value decomposition which is another sufficient condition of IGM and the linear value decomposition helps the compute of policy gradient. FOP (Zhang et al., 2021) is a method that combines value decomposition with entropy regularization and uses a more general condition, Individual-Global-Optimal, to replace IGM. In this paper, we will combine DMAC with these algorithms and show its improvement. MAPPO (Yu et al., 2021) is a CTDE version of PPO (Schulman et al., 2017), where a centralized state-value function is learned. However, the update rule of policy of MAPPO contradicts DMAC, so DMAC cannot be combined with MAPPO. Regularization. Entropy regularization was first proposed in single-agent RL. Nachum et al. (2017) analyzed the entropy-regularized MDP and revealed the properties about the optimal policy and the corresponding Q-function and Vfunction. They also showed the equivalence of value-based methods and policy-based methods in entropy-regularized MDP. Haarnoja et al. (2018) pointed out maximum-entropy RL can achieve better exploration and stability facing the model and estimation error. Although entropy regularization has many advantages, Eysenbach & Levine (2019) showed entropy regularization modifies the MDP and results in the bias of the convergent policy. Yang et al. (2019) revealed the drawbacks of the convergent policy of general RL and maximum-entropy RL. The former is usually a deterministic policy (Sutton & Barto, 2018) that is not flexible enough for unknown situations, while the latter is a policy with non-zero probability for all actions which may be dangerous in some scenarios. Neu et al. (2017) analyzed the entropy regularization method from several views. They revealed a more general form of regularization which is actually divergence regularization and showed entropy regularization is just a special case of divergence regularization. Wang et al. (2019) absorbed previous results and proposed an onpolicy algorithm, i.e., DAPO. However, DAPO cannot be trivially applied to MARL. Moreover, its on-policy learning is not sample-efficient for MARL settings, and its off-policy correction trick V-trace (Espeholt et al., 2018) is also intractable in MARL. There are some previous studies in single-agent RL which use similar target policy to ours, but their purposes are quite different. Trust-PCL (Nachum et al., 2018) introduces a target policy as a trust-region constraint for maximum-entropy RL, but the policy is still biased by entropy regularizer. MIRL (Grau-Moya et al., 2019) uses a distribution that is only related to actions as the target policy to compute a mutual-information regularizer, but it still changes the objective of the original RL. Divergence-Regularized Multi-Agent Actor-Critic 3. Preliminaries Dec-POMDP is a general model for cooperative MARL. A Dec-POMDP is a tuple M = {S, A, P, Y, O, I, n, r, γ}. S is the state space, n is the number of agents, γ is the discount factor, and I = {1, 2 n} is the set of all agents. A = A1 A2 An represents the joint action space where Ai is the individual action space for agent i. P(s |s, a) : S A S [0, 1] is the transition function, and r(s, a) : S A R is the reward function of state s and joint action a. Y is the observation space, and O(s, i) : S I Y is a mapping from state to observation for each agent. The objective of Dec-POMDP is to maximize J(π) = Eπ [P t=0 γtr(st, at)] , and thus we need to find the optimal joint policy π = arg maxπ J(π). To settle the partial observable problem, history τi Ti = (Y Ai) is often used to replace observation oi Y . As for policies in CTDE, each agent i has an individual policy πi(ai|τi) and the joint policy π is the product of each πi. Though we calculate individual policy as πi(ai|τi) in practice, we will use πi(ai|s) in analysis and proofs for simplicity. Entropy regularization adds the logarithm of the probability of the sampled action according to current policy to the reward function. It modifies the optimization objective as Jent(π) = Eπ [P t=0 γt (r(st, at) ω log π(at|st))] . We also have the corresponding Q-function Qπ ent(s, a) = r(s, a) + γE [V π ent(s )] and V-function V π ent(s) = Eπ [P t=0 γt (r(st, at) ω log π(at|st)) |s0 = s]. Given these definitions, we can deduce an interesting property V π ent(s) = E [Qπ ent(s, a)] + ωH (π( |s)), where H (π( |s)) represents the entropy of policy π( |s). V π ent(s) includes an entropy term which is the reason it is called entropy regularization. In this section, we first give the definition of divergence regularization. Then we propose the selection of target policy and derive its theoretical properties. Next we propose and analyze divergence policy iteration. Finally, we derive the update rules of the critic and actors and obtain the algorithm of DMAC. Note that all proofs are given in Appendix A. 4.1. Divergence Regularization We maintain a target policy ρi for each agent i, which is different from the agent policy πi. Together we have a joint target policy ρ = Qn i=1 ρi. This joint target policy ρ modifies the objective function as Jρ(π) = Eπ h P t=0 γt r(st, at) ω log π(at|st) ρ(at|st) i . That is, a reg- ularizer log π(at|st) ρ(at|st) , which describes the discrepancy between policy π and target policy ρ, is added to the reward function just like entropy regularization. Given ρ, we can define the corresponding V-function and Q-function for divergence regularization as follows, V π ρ (s) = Eπ t=0 γt(r(st, at) ω log π(at|st) ρ(at|st) )|s0 = s Qπ ρ(s, a) = r(s, a) + γEs P ( |s,a) V π ρ (s ) . Further, by simple deduction, we have V π ρ (s) = Ea π( |s) Qπ ρ(s, a) ω log π(a|s) = Ea π( |s) Qπ ρ(s, a) ωDKL (π( |s) ρ( |s)) . V π ρ (s) includes an extra term which is the KL divergence between π and ρ, and thus this regularizer is referred to as divergence regularization. 4.2. Target Policy Intuitively, this regularizer log π(at|st) ρ(at|st) could help to balance exploration and exploitation. For example, for some action a, if ρ(a|s) > π(a|s), then the regularizer is equivalent to adding a positive value to the reward and vice versa. Therefore, if we choose previous policy as target policy, the regularizer will encourage agents to take actions whose probability has decreased and discourage agents to take actions whose probability has increased. Additionally, the regularizer actually controls the discrepancy between current policy and previous policy, which could stabilize the policy improvement (Kakade & Langford, 2002; Schulman et al., 2015; 2017). We further analyze the selection of target policy theoretically. Let µπ denote the state-action distribution given a policy π. That is, µπ(s, a) = dπ(s)π(a|s), where dπ(s) = P t=0 γt Pr(st = s|π) is the stationary distribution of states given π. With µπ, we can rewrite the optimization objective Jρ(π) as follows, s,a µπ(s, a)r(s, a) ω X s,a µπ(s, a) log π(a|s) s,a µπ(s, a)r(s, a) ωDC (µπ µρ) , (1) where DC (µπ µρ) = P s,a µπ(s, a) log π(a|s) ρ(a|s) is a Bregman divergence (Neu et al., 2017). Therefore, the objective of the divergence-regularized MDP can be expressed as π ρ = arg max π s,a µπ(s, a)r(s, a) ωDC (µπ µρ) . (2) With this property, similar to Neu et al. (2017) and Wang et al. (2019), we can use the following iterative process, πt+1 = arg max π s,a µπ(s, a)r(s, a) ωDC (µπ µπt) . Divergence-Regularized Multi-Agent Actor-Critic In this iteration, by taking the target policy ρ as the previous policy πt when updating the policy πt+1, we can obtain the following inequalities, J(πt+1) J(πt+1) ωDC (µπt+1 µπt) = Jπt(πt+1) J(πt) ωDC (µπt µπt) = J(πt) Jπt 1(πt). The first and the third inequalities are from DC (µπ µρ) 0, and the second inequality is from the definition of πt+1. This indicates the policy sequence obtained by this iteration improves monotonically in both the divergence-regularized MDP (i.e., Jπt(πt+1) Jπt 1(πt)) and the original MDP (i.e., J(πt+1) J(πt)). Moreover, as the sequences {Jπt(πt+1)} and {J(πt+1)} are both bounded, these two sequences will converge. With these deductions, we can obtain the following proposition. Proposition 1. By iteratively applying the iteration (3) and taking πk as ρk, the policy sequence {πk} converges and improves monotonically in both the divergence-regularized MDP and the original MDP. For the sake of simplicity, we denote the converged policy of iteration (3) and its corresponding Q-function and V-function as π , Q and V , respectively. Then we further discuss the property of π , Q and V . Actually, the expression of π is decided by the initial policy π0 and the action that obtains the optimal value of Q . We define the set containing the optimal actions as Us = {a A|a = arg maxa Q (s, a)}. Without loss of generality, we suppose that π0(a|s) > 0 for all state-action pairs. Then we have the following proposition about π . Proposition 2. All the probabilities of π lie in the optimal actions of Q and are proportional to their probabilities in the initial policy π0, π (a|s) = 1(a Us) π0(a|s) P a Us π0(a |s). As for the property of the Q and V , by the definition of Q and V , we have the following proposition. Proposition 3. Q is the same as Q π , the Q-function of the policy π in the original MDP, while V is the same as V π , the V-function of the policy π in the original MDP. With all these results above, we can further derive the discrepancy between the policy π and the optimal policy π in the original MDP in terms of their V-functions, and have the following theorem. Theorem 1. If the initial policy π0 is a uniform distribution, then we have V π V ω 1 γ log |A|, (4) where V is the optimal V-function in the original MDP and |A| is the number of actions in A. This theorem tells us that the discrepancy between π and π can be controlled by the coefficient of the regularization term, ω. At this stage, we can partly conclude the benefits of the divergence regularization or the target policy. The divergence regularization is beneficial to exploration which can be witnessed from the empirical results later. The policy sequence obtained by our selection of target policy converges and monotonically improves not just in the regularized MDP, but also in the original MDP. Moreover, the V-function of the converged policy in the original MDP can be sufficiently approximate to that of the optimal policy with a proper ω. 4.3. Divergence Policy Iteration To complete the update in the iteration (3), we need to study the divergence-regularized MDP, given a fixed target policy ρ. From the perspective of policy evaluation, we can define an operator Γπ ρ as Γπ ρQ(s, a) = r(s, a) + γE Q(s , a ) ω log π(a |s ) and have the following lemma. Lemma 1 (Divergence Policy Evaluation). For any initial Q-function Q0(s, a) : S A R, we define a sequence Qk given operator Γπ ρ as Qk+1 = Γπ ρQk. Then, the sequence will converge to Qπ ρ as k . After the evaluation of the policy, we need a way to improve the policy. We have the following lemma about policy improvement. Lemma 2 (Divergence Policy Improvement). If we define πnew satisfying πnew( |s) = arg min π DKL (π( |s) u( |s)) , (5) where u( |s) = ρ( |s) exp(Q πold ρ (s, )/ω) Zπold(s) and Zπold(s) is a normalization term, then for all actions a and all states s we have Qπnew ρ (s, a) Qπold ρ (s, a). Lemma 1 and 2 can be seen as corollaries of the conclusion of Haarnoja et al. (2018). Lemma 2 indicates that given a policy πold, if we find a policy πnew according to (5), then the policy πnew is better than πold. Lemma 2 does not make any assumption and is for general settings. Further, the policy improvement can be established and simplified based on value decomposition. In the following, we give an example for linear value decomposition Divergence-Regularized Multi-Agent Actor-Critic (LVD) like DOP (i.e., Q(s, a) = P i ki(s)Qi(s, ai) + b(s)) (Wang et al., 2021b). Lemma 3 (Divergence Policy Improvement with LVD). If Q-functions satisfy Qπ ρ(s, a) = P i ki(s)Qπi ρ (s, ai) + b(s) and we define πi new satisfying πi new( |s) = arg min πi DKL (πi( |s) ui( |s)) i I, where ui( |s) = ρi( |s) exp ki(s)Q πi old ρ (s, )/ω Zπi old(s) and Zπi old(s) is a normalization term, then for all actions a and all states s, we have Qπnew ρ (s, a) Qπold ρ (s, a). Lemma 3 further tells us that if the MARL setting satisfies the condition of linear value decomposition, then each agent can optimize its individual policy with an objective of its own individual Q-function, which immediately improves the joint policy. By combining divergence policy evaluation and divergence policy improvement, we have the following theorem of divergence policy iteration. Theorem 2 (Divergence Policy Iteration). By iteratively using Divergence Policy Evaluation and Divergence Policy Improvement, we will get a sequence Qk and this sequence will converge to the optimal Q-function Q ρ and the corresponding policy sequence will converge to the optimal policy π ρ. Theorem 2 shows that with repeated application of divergence policy improvement and divergence policy evaluation, the policy can monotonically improve and converge to the optimal policy. We use π ρ, V ρ (s), and Q ρ(s, a) to denote the optimal policy, Q-function, and V-function, respectively, given a target policy ρ. With all these results above, we have enough tools to obtain the practical update rule of the critic and actors of DMAC. 4.4. Divergence-Regularized Critic For the update of the critic, we have the following proposition. Proposition 4. Q ρ(s, a) = r(s, a) + γE Q ρ(s , a ) ω log π ρ(a |s ) ρ(a |s ) is tenable for all actions a A. Proposition 4 gives an iterative formula for Q ρ(s, a), with which we can have a loss function and update rule for learning the critic, LQ = E h (Qϕ(s, a) y)2i , where y = r(s, a) + γ Q ϕ (s , a ) ω log π (a |s ) where ϕ and ϕ are respectively the weights of Q-function and target Q-function.The update of Q-function is similar to that in general MDP, except that the action for next state could be chosen arbitrarily while it must be the action that maximizes Q-function for next state in general MDP. This property greatly enhances the flexibility of learning Q-function, e.g., we can easily extend it to TD(λ). 4.5. Divergence-Regularized Actors DAPO (Wang et al., 2019) analyzes the divergenceregularized MDP from the perspective of policy gradient theorem (Sutton et al., 2000) and gives an on-policy update rule for single-agent RL. Unlike existing work, we focus on a different perspective and derive an off-policy update rule by taking into consideration the characteristics of MARL. From Lemma 2, we can obtain an optimization target for policy improvement, arg min π DKL π( |s) ρ( |s)exp Qπold ρ (s, )/ω = arg max π a π(a|s) Qπold ρ (s, a) ω log π(a|s) Then, we can define the objective of the actors, a π(a|s) Qπold ρ (s, a) ω log π(a|s) where D is the replay buffer. Suppose each individual policy πi has a corresponding parameterization θi. We can obtain the following policy gradient for each agent with some derivations given in Appendix A.9, = E θi log πi(ai|s) Qπold ρ (s, a) ω log π(a|s) We need to point out that the key to off-policy update is that Lemma 2 does not limit the state distribution. It only requires the condition is satisfied for each state. Therefore, we can maintain a replay buffer to cover different states as much as possible, which is a common practice in off-policy learning. DAPO uses a similar formula to ours, but it obtains the formula from policy gradient theorem, which requires the state distribution of the current policy. Further, we can add a counterfactual baseline to the gradient. First, we have the following equation about the counterfactual baseline (Foerster et al., 2018), Es D,a π [ θi log πi(ai|s)b(s, a i)] = 0, where a i denotes the joint action of all agents except agent i. Next, we take the baseline as b(s, a i) = Eai πi[(Qπold ρ (s, a) ω log π(a|s) ρ(a|s) ω)]. Divergence-Regularized Multi-Agent Actor-Critic 0 20000 40000 60000 80000 100000 step episode rewards Stochastic Game COMA+DMAC COMA 0 20000 40000 60000 80000 100000 step Stochastic Game QMIX+DMAC QMIX 0 20000 40000 60000 80000 100000 step Stochastic Game MAAC+DMAC MAAC 0 50000 100000 150000 200000 step Stochastic Game DOP+DMAC DOP Figure 1. Learning curves in terms of episode rewards of COMA, MAAC, QMIX and DOP groups in randomly generated stochastic game. Then, the gradient for each agent i can be modified as follows, θi Lπ = E θi log πi(ai|s)(Qπold ρ (s, a) ω log πi(ai|s) Eai πi[Qπold ρ (s, a)] + ωDKL(πi( |s) ρi( |s))) . In addition to variance reduction and credit assignment, this counterfactual baseline eliminates the policies of other agents from the gradient. This property makes it convenient to calculate the gradient and easy to select the target policy for each agent. Moreover, if the linear value decomposition condition is satisfied, we have the following gradient formula, θi Lπ = E θi log πi(ai|s) ki(s)Aπi old ρ (s, ai) ω log πi(ai|s) ρi(ai|s) + ωDKL(πi( |s) ρi( |s)) , where Aπi old ρ (s, ai) = Qπi old ρ (s, ai) E ai πi h Qπi old ρ (s, ai) i . 4.6. Algorithm Every iteration of (3) needs a converged policy. However, it is intractable to perform this update rule precisely in practice. Thus, we propose an alternative approximate method. For each agent, we update the policy πi and the target policy ρi respectively as θi = θi +β θi Lπ and θi = (1 τ) θi +τθi, where β is the learning rate, θi is the weights of ρi, and τ is the hyperparameter for soft update. Here we use one gradient step to replace the max operator in (3). From Theorem 2 and previous discussion, we know that optimizing Lπ can maximize Jρ(π), so we use θi Lπ in the gradient step for off-policy training instead of the gradient step directly optimizing Jρ(π) in (1). Moreover, even if we take the target policies {ρk} as the moving average of the policies {πk}, the properties such as monotonic improvements, convergences of value functions and policies, and the bound of the biases, will be still conserved. The details of these results are included in Appendix A.10. Now we have all the update rules of DMAC. The training of DMAC is a typical off-policy learning process, which is given in Appendix B for completeness. 5. Experiments In this section, we first empirically study the benefits of DMAC and investigate how DMAC improves the performance of existing MARL algorithms in a didactic stochastic game and five SMAC maps. Then, we demonstrate the advantages of divergence regularizer over entropy regularizer. 5.1. Improvements of Existing Methods DMAC is a flexible framework and can be combined with many existing MARL algorithms. In the experiments, we choose four representative algorithms for different types of methods: COMA (Foerster et al., 2018) for on-policy multi-agent policy gradients, MAAC (Iqbal & Sha, 2019) for off-policy multi-agent actor-critic, QMIX (Rashid et al., 2018) for value decomposition, DOP (Wang et al., 2021b) for the combination of value decomposition and policy gradient. These algorithms need minor modifications to fit the framework of DMAC. We denote these modified algorithms as COMA+DMAC, MAAC+DMAC, QMIX+DMAC, and DOP+DMAC. Generally, our modification is limited and tries to keep the original architecture so as to fairly demonstrate the improvement of DMAC. The details of the modifications and hyperparameters are included in Appendix C. All the curves in our plots correspond to the mean value of five training runs with different random seeds, and shaded regions indicate 95% confidence interval. A DIDACTIC EXAMPLE We first test the four groups of methods in a stochastic game where agents share the reward. The stochastic game is generated randomly for the reward function and transition probabilities with 30 states, 3 agents and 5 actions for each agent. Each episode contains 30 timesteps in this game. 0.0 0.2 0.4 0.6 0.8 1.0 cover rate episode rewards =0.001 COMA Stochastic Game =100 =10 =1 =0.2 =0.1 =0.01 =0.001 COMA Figure 2. The scatter plot of the converged COMA and COMA+DMAC with different ω in terms of the episode rewards and cover rate in the randomly generated stochastic game. Divergence-Regularized Multi-Agent Actor-Critic 0.0 0.2 0.4 0.6 0.8 1.0 1e6 MAAC+DMAC MAAC 0.0 0.2 0.4 0.6 0.8 1.0 1e6 COMA+DMAC COMA 0.0 0.2 0.4 0.6 0.8 1.0 1e6 QMIX+DMAC QMIX 0.0 0.2 0.4 0.6 0.8 1.0 1e6 DOP+DMAC DOP 0.0 0.2 0.4 0.6 0.8 1.0 1e6 0.0 0.2 0.4 0.6 0.8 1.0 1e6 0.0 0.2 0.4 0.6 0.8 1.0 1e6 0.0 0.2 0.4 0.6 0.8 1.0 1e6 0.0 0.2 0.4 0.6 0.8 1.0 1e6 0.0 0.2 0.4 0.6 0.8 1.0 1e6 episode rewards 0.0 0.2 0.4 0.6 0.8 1.0 1e6 0.0 0.2 0.4 0.6 0.8 1.0 1e6 0.0 0.5 1.0 1.5 2.0 1e6 0.0 0.5 1.0 1.5 2.0 1e6 0.0 0.5 1.0 1.5 2.0 1e6 0.0 0.5 1.0 1.5 2.0 1e6 0.0 0.5 1.0 1.5 2.0 step 1e6 episode rewards 0.0 0.5 1.0 1.5 2.0 step 1e6 episode rewards 0.0 0.5 1.0 1.5 2.0 step 1e6 0.0 0.5 1.0 1.5 2.0 step 1e6 Figure 3. Learning curves in terms of win rates or episode rewards of COMA, MAAC, QMIX and DOP groups in five SMAC maps (each row corresponds to a map and each column corresponds to a group). The performance of these methods is illustrated in Figure 1. We can find that DMAC performs better than the baseline at the end of the training in all the four groups. Moreover, COMA+DMAC, QMIX+DMAC and MAAC+DMAC learn faster than their baselines. Though DOP learns faster than DOP+DMAC at the start, it falls into a sub-optimal policy and DOP+DMAC finds a better policy in the end. We also show the benefit of exploration in this stochastic game for the convenience of statistics. We evaluate the exploration in terms of the cover rate of all state-action pairs, i.e., the ratio of the explored state-action pairs to all stateaction pairs. The cover rates of COMA and COMA+DMAC with different ω are illustrated in Figure 2. We first study the effect of ω for the tradeoff between exploration and exploitation. From our previous analysis, a smaller ω gives a tighter bound for the converged performance of DMAC, but it also means that the benefits of the regularizer for exploration is smaller which will also affect the performance. So we need to choose a proper ω in practice. We can find in Figure 2 that when ω is large such as the case ω = 10 and ω = 100, the exploration is guaranteed but the performance is quite low since the regularizer is much larger than the environmental reward, and when ω is small such as the case ω = 0.001 and ω = 0.01, the performance is limited by the ability of exploration. Only when ω lies in a proper interval such as ω = 0.1, ω = 0.2 and ω = 1, the agents of DMAC can obtain a good balance between exploration and exploitation in this task. We use COMA here as a representation of the traditional policy gradient method in cooperative MARL. In practice, we choose the ω = 0.2 for COMA+DMAC. We can find that the cover rate of COMA+DMAC (ω = 0.2) is higher than COMA, which can be an evidence for the benefit of exploration of DMAC. The cover rates of other three groups of algorithms and the learning curves for methods in Figure 2 are available in Appendix D. We test all the methods in five tasks of SMAC (Samvelyan et al., 2019). The introduction of the SMAC environment and the training details are included in Appendix C. The learning curves in terms of win rate (or episode rewards) of all the methods in the five SMAC maps are illustrated in Figure 3 (four columns for four groups of algorithms and five rows for five different maps in SMAC). For the case that both the baseline and DMAC can hardly win such as 3s5z and MMM2 for the COMA group and MMM2 for the MAAC group, we use the episode rewards to show the dif- Divergence-Regularized Multi-Agent Actor-Critic 0.0 0.5 1.0 1.5 2.0 step 1e6 FOP+DMAC FOP 0.0 0.5 1.0 1.5 2.0 step 1e6 FOP+DMAC FOP 0.0 0.5 1.0 1.5 2.0 step 1e6 FOP+DMAC FOP 0.0 0.5 1.0 1.5 2.0 step 1e6 FOP+DMAC FOP 0.0 0.5 1.0 1.5 2.0 step 1e6 FOP+DMAC FOP Figure 4. Learning curves in terms of win rates of FOP and FOP+DMAC in five SMAC maps. ference. In addition, the learning curves in terms of episode rewards are available in Appendix D, including two additional maps, 3m and 8m. We show the empirical result of DAPO in the map of 3m in the first figure of the second column in Figure 8 in Appendix D. It can be seen that DAPO cannot obtain a good performance in the simple task of SMAC, so we skip it in other SMAC tasks. The reason for the low performance of DAPO may be that DAPO omits the correction for dπ(s)/dπt(s) in policy update which introduces bias in the gradient of policy, and uses V-trace as off-policy correction which however is biased. These drawbacks may be magnified in MARL settings. The superiority of our naturally off-policy method over the biased off-policy correction method can be partly seen from the large performance gap between COMA+DMAC and DAPO. In all the five tasks, MAAC+DMAC outperforms MAAC significantly, but MAAC+DMAC does not change the network architecture of MAAC, which shows the benefits of divergence regularizer. As for the result of COMA and COMA+DMAC. We find that COMA+DMAC has higher win rates than COMA in most cases at the end of the training, which can be attributed to the benefits of off-policy learning and exploration of divergence regularizer. Though in some cases COMA learns faster than COMA+DMAC, it falls into sub-optimal in the end. This phenomenon can be observed more clearly in the plots of episode rewards in the hard tasks like 3s5z. This can be an evidence for the advantage of divergence regularizer which helps the agents find a better policy. The stable policy improvement of divergence regularizer can be manifested by the small variance of the learning curves especially in the comparison between QMIX and QMIX+DMAC. In most tasks, we find that QMIX+DMAC learns substantially faster than QMIX and gets higher win rates in harder tasks. The results of DOP groups are illustrated in the fourth column of Figure 3. DOP+DMAC learns faster than DOP in most cases and finally obtains a better performance. The difference of DOP and DOP+DMAC can also partly show the advantage of naturally off-policy method to the off-policy correction method, as DOP+DMAC replaces the tree backup loss with offpolicy TD(λ). DMAC improves the performance and/or convergence speed of the evaluated algorithms in most tasks. This empirically demonstrates the benefits of divergence regularizer. More- over, the superiority of our naturally off-policy learning over the biased off-policy correction method can be partly witnessed from the empirical results. 5.2. Comparison with Entropy Regularization FOP (Zhang et al., 2021) combines value decomposition with entropy regularization, which obtained the state-ofthe-art performance in SMAC tasks. FOP has a welltuned scheme for the temperature parameter of the entropy, so we take FOP as a strong baseline for entropyregularized methods in cooperative MARL. We compare FOP and FOP+DMAC in five SMAC tasks, 3s_vs_3z, 2s3z, 3s5z, 2c_vs_64zg and MMM2, which respectively correspond to the three levels of difficulties (i.e., easy, hard, and super hard) for SMAC tasks. Three of these tasks are taken from the original paper of FOP. The modifications of FOP+DMAC are also included in Appendix C. The win rates of FOP and FOP+DMAC are illustrated in Figure 4. We can find that FOP+DMAC learns much faster than FOP in 3s_vs_3z, while it performs better than FOP in other four tasks. These results can be an evidence for the advantages of DMAC which could guarantee the monotonic improvement in the original MDP. 6. Conclusion We propose a multi-agent actor-critic framework, DMAC, for cooperative MARL. We investigate divergence regularization, derive divergence policy iteration, and deduce the update rules for the critic, policy, and target policy in multiagent settings. DMAC is a naturally off-policy framework and the divergence regularizer is beneficial to exploration and stable policy improvement. DMAC is also a flexible framework and can be combined with many existing MARL algorithms. It is empirically demonstrated that combining DMAC with existing MARL algorithms can improve the performance and convergence speed in a stochastic game and SMAC tasks. Acknowledgements We would like to thank the anonymous reviewers for their useful comments to improve our work. This work is supported in part by NSF China under grant 61872009 and Huawei. Divergence-Regularized Multi-Agent Actor-Critic Agarwal, A., Kumar, S., Sycara, K., and Lewis, M. Learning Transferable Cooperative Behavior in Multi-Agent Team. In International Conference on Autonomous Agents and Multiagent Systems (AAMAS), 2020. Espeholt, L., Soyer, H., Munos, R., Simonyan, K., Mnih, V., Ward, T., Doron, Y., Firoiu, V., Harley, T., Dunning, I., et al. Impala: Scalable Distributed Deep-Rl with Importance Weighted Actor-Learner Architectures. In International Conference on Machine Learning (ICML), 2018. Eysenbach, B. and Levine, S. If Max Ent RL Is The Answer, What Is The Question? ar Xiv preprint ar Xiv:1910.01913, 2019. Foerster, J. N., Farquhar, G., Afouras, T., Nardelli, N., and Whiteson, S. Counterfactual Multi-Agent Policy Gradients. In AAAI Conference on Artificial Intelligence (AAAI), 2018. Grau-Moya, J., Leibfried, F., and Vrancx, P. Soft QLearning with Mutual-Information Regularization. In International Conference on Learning Representations (ICLR), 2019. Haarnoja, T., Tang, H., Abbeel, P., and Levine, S. Reinforcement Learning with Deep Energy-Based Policies. In International Conference on Machine Learning (ICML), 2017. Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with A Stochastic Actor. In International Conference on Machine Learning (ICML), 2018. Iqbal, S. and Sha, F. Actor-Attention-Critic for Multi-Agent Reinforcement Learning. In International Conference on Machine Learning (ICML), 2019. Jiang, J., Dun, C., Huang, T., and Lu, Z. Graph Convolutional Reinforcement Learning. In International Conference on Learning Representations (ICLR), 2020. Kakade, S. and Langford, J. Approximately Optimal Approximate Reinforcement Learning. In International Conference on Machine Learning (ICML), 2002. Lowe, R., Wu, Y. I., Tamar, A., Harb, J., Abbeel, O. P., and Mordatch, I. Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments. In Advances in Neural Information Processing Systems (Neur IPS), 2017. Nachum, O., Norouzi, M., Xu, K., and Schuurmans, D. Bridging The Gap Between Value and Policy Based Reinforcement Learning. In Advances in Neural Information Processing Systems (Neur IPS), 2017. Nachum, O., Norouzi, M., Xu, K., and Schuurmans, D. Trust-Pcl: An Off-Policy Trust Region Method for Continuous Control. In International Conference on Learning Representations (ICLR), 2018. Neu, G., Jonsson, A., and Gómez, V. A Unified View of Entropy-Regularized Markov Decision Processes. ar Xiv preprint ar Xiv:1705.07798, 2017. Oliehoek, F. A., Amato, C., et al. A Concise Introduction To Decentralized POMDPs, volume 1. Springer, 2016. Pan, L., Rashid, T., Peng, B., Huang, L., and Whiteson, S. Regularized Softmax Deep Multi-Agent Q-Learning. Advances in Neural Information Processing Systems (Neur IPS), 2021. Peng, B., Rashid, T., de Witt, C. A. S., Kamienny, P.-A., Torr, P. H., Böhmer, W., and Whiteson, S. FACMAC: Factored Multi-Agent Centralised Policy Gradients. In Advances in Neural Information Processing Systems (Neur IPS), 2021. Rashid, T., Samvelyan, M., De Witt, C. S., Farquhar, G., Foerster, J., and Whiteson, S. QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning. In International Conference on Machine Learning (ICML), 2018. Samvelyan, M., Rashid, T., de Witt, C. S., Farquhar, G., Nardelli, N., Rudner, T. G., Hung, C.-M., Torr, P. H., Foerster, J. N., and Whiteson, S. The Star Craft Multi Agent Challenge. In AAMAS, 2019. Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. Trust Region Policy Optimization. In International Conference on Machine Learning (ICML), 2015. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal Policy Optimization Algorithms. ar Xiv preprint ar Xiv:1707.06347, 2017. Son, K., Kim, D., Kang, W. J., Hostallero, D. E., and Yi, Y. QTRAN: Learning To Factorize with Transformation for Cooperative Multi-Agent Reinforcement Learning. In International Conference on Machine Learning (ICML), 2019. Sun, W.-F., Lee, C.-K., and Lee, C.-Y. DFAC Framework: Factorizing The Value Function Via Quantile Mixture for Multi-Agent Distributional Q-Learning. In International Conference on Machine Learning (ICML), 2021. Sunehag, P., Lever, G., Gruslys, A., Czarnecki, W. M., Zambaldi, V. F., Jaderberg, M., Lanctot, M., Sonnerat, N., Leibo, J. Z., Tuyls, K., et al. Value-Decomposition Networks For Cooperative Multi-Agent Learning Based On Team Reward. In International Conference on Autonomous Agents and Multi Agent Systems (AAMAS), 2018. Divergence-Regularized Multi-Agent Actor-Critic Sutton, R. S. and Barto, A. G. Reinforcement Learning: An Introduction. MIT press, 2018. Sutton, R. S., Mc Allester, D. A., Singh, S. P., and Mansour, Y. Policy Gradient Methods for Reinforcement Learning with Function Approximation. In Advances in Neural Information Processing Systems (Neur IPS), 2000. Touati, A., Zhang, A., Pineau, J., and Vincent, P. Stable Policy Optimization Via Off-Policy Divergence Regularization. In Conference on Uncertainty in Artificial Intelligence (UAI), 2020. Wang, J., Ren, Z., Liu, T., Yu, Y., and Zhang, C. QPLEX: Duplex Dueling Multi-Agent Q-Learning. In International Conference on Learning Representations (ICLR), 2021a. Wang, Q., Li, Y., Xiong, J., and Zhang, T. Divergence Augmented Policy Optimization. In Advances in Neural Information Processing Systems (Neur IPS), 2019. Wang, Y., Han, B., Wang, T., Dong, H., and Zhang, C. DOP: Off-Policy Multi-Agent Decomposed Policy Gradients. In International Conference on Learning Representations (ICLR), 2021b. Yang, W., Li, X., and Zhang, Z. A Regularized Approach To Sparse Optimal Policy in Reinforcement Learning. In Advances in Neural Information Processing Systems (Neur IPS), 2019. Yang, Y., Hao, J., Liao, B., Shao, K., Chen, G., Liu, W., and Tang, H. Qatten: A General Framework for Cooperative Multiagent Reinforcement Learning. ar Xiv preprint ar Xiv:2002.03939, 2020. Yu, C., Velu, A., Vinitsky, E., Wang, Y., Bayen, A. M., and Wu, Y. The Surprising Effectiveness of PPO in Cooperative, Multi-Agent Games. ar Xiv preprint ar Xiv:2103.01955, 2021. Zhang, T., Li, Y., Wang, C., Xie, G., and Lu, Z. FOP: Factorizing Optimal Joint Policy of Maximum-Entropy Multi-Agent Reinforcement Learning. In International Conference on Machine Learning (ICML), 2021. Divergence-Regularized Multi-Agent Actor-Critic A.1. Proposition 2 Proof. For the sake of simplicity, we define Qk = Q πk 1 and V k = V πk 1. From the definition of iteration (3), we have πk = π πk 1. Then from Proposition 1, we also have Qk Q and V k V . From Proposition 5, we have πk(a|s) πk 1(a|s) exp( Qk(s,a) ω ). We define f k(s, a) = exp( Qk(s,a) ω ) and f (s, a) = exp( Q (s,a) ω ), then we have f k(s, a) f (s, a). Next we will show that πk(a|s) π0(a|s)Πk i=1f i(s, a) or πk(a|s) = π0(a|s)Πk i=1f i(s,a) Zk(s) by induction, where Zk(s) = P a π0(a |s)Πk i=1f i(s, a ) . πk(a|s) = πk 1(a|s)f k(s, a) P a πk 1(a |s)f k(s, a ) π0(a|s)Πk 1 i=1 f i(s,a) Zk 1(s) f k(s, a) P a π0(a |s)Πk 1 i=1 f i(s,a ) Zk 1(s) f k(s, a ) (By induction) = π0(a|s)Πk i=1f i(s, a) P a π0(a |s)Πk i=1f i(s, a ) = π0(a|s)Πk i=1f i(s, a) Zk(s) We define f s = maxa f (s, a), then we will consider π (a|s) = limk πk(a|s). lim k πk(a|s) = lim k π0(a|s)Πk i=1f i(s, a) P a π0(a |s)Πk i=1f i(s, a ) π0(a|s) Πk i=1f i(s,a) f (s,a)k f (s,a)k a π0(a |s) Πk i=1f i(s,a ) f (s,a )k f (s,a )k = π0(a|s) limk Πk i=1f i(s,a) f (s,a)k f (s,a)k a π0(a |s) limk Πk i=1f i(s,a ) f (s,a )k f (s,a )k = 1(a Us) π0(a|s) P a Us π0(a |s) The last equation is actually from two simple conclusion: (1) If a sequence {ak} > 0 and ak A > 0, then limn Πn k=1ak An = 1 ; (2) For A > B > 0, limn Bn An = 0. So limk Πk i=1f i(s,a) f (s,a)k f (s,a)k (f s )k = 1(a Us). A.2. Theorem 1 We define two operator Γ and Γω as ΓV (s) = max π a π(a|s)(r(s, a) + γE[V (s )]) ΓωV (s) = max π a π(a|s)(r(s, a) + γE[V (s )]) ωDKL(π( |s)|| π ( |s)) From the result of traditional Q-learning (Sutton & Barto, 2018), we know that Γ is a γ-contraction and the unique fixed point is the global optimal V-function V . As for Γω, we have the following lemma from Yang et al. (2019): (1) Γω is a γ-contraction. Divergence-Regularized Multi-Agent Actor-Critic (2) If V1(s) V2(s) for all state s, then ΓωV1(s) ΓωV2(s) for all state s (3) For any constant c, we define (V + c)(s) = V (s) + c, then Γω(V + c)(s) = ΓωV (s) + γc Moreover, from the definition we know that V k(s) = maxπ P a π(a|s)(r(s, a) + γE[ V k(s )]) ωDKL(π( |s)||πk 1( |s) and V k V , πk π , then we have: V (s) = max π a π(a|s)(r(s, a) + γE[ V (s )]) ωDKL(π( |s)|| π ( |s)) This means that V is the unique fix point of Γω. Now we have all the tools for the proof of proposition 1. And this proof is inspired by Yang et al. (2019). Given that the initial policy is a uniform distribution, we know from Proposition 2 that: π (a|s) = 1(a Us) 1 where |Us| is the number of actions contained in the set Us. Then we consider DKL(π( |s)|| π ( |s)) and have: DKL(π( |s)|| π ( |s)) = X a Us π(a|s) log π(a|s) π (a|s) (7) a Us π(a|s) log π(a|s) X a Us π(a|s) log π (a|s) (8) a Us π(a|s) log π (a|s) (π(a|s) 1) (9) = log |Us| X a Us π(a|s) (10) log |Us| ( X a Us π(a|s) X a A π(a|s) = 1) (11) log |A| (12) Next we will consider the relation between Γω and Γ: ΓωV (s) = max π a π(a|s)(r(s, a) + γE[V (s )]) ωDKL(π( |s)|| π ( |s)) a π(a|s)(r(s, a) + γE[V (s )]) ΓωV (s) = max π a π(a|s)(r(s, a) + γE[V (s )]) ωDKL(π( |s)|| π ( |s)) a π(a|s)(r(s, a) + γE[V (s )]) max π DKL(π( |s)|| π ( |s)) ΓV (s) ω log |A| (From the inequality (12)) We will show that Γk V (s) ω log |A| Pk 1 t=0 γt Γk ωV (s) Γk V (s) for any V and state s by induction. Divergence-Regularized Multi-Agent Actor-Critic Γk+1 ω V (s) = Γω(Γk ωV (s)) Γω(Γk V (s) ω log |A| t=0 γt) (From the induction and the monotonicity in lemma 4) = Γω(Γk V (s)) ω log |A| t=0 γt+1 (From the conclusion about constant in lemma 4) Γk+1V (s) ω log |A| ω log |A| t=0 γt+1 (From the inequality 14 ) = Γk+1V (s) ω log |A| Γk+1 ω V (s) = Γω(Γk ωV (s)) Γω(Γk V (s)) (From the induction and the monotonicity in lemma 4) Γk+1V (s) (From the inequality 13 ) Finally, with all these results above, let k . As both Γω and Γ are γ-contraction, Γk ωV V , Γk V V for any V . We have: V V ω log |A| t=0 γt = ω 1 γ log |A| A.3. Lemma 1 Proof. We define a new reward function rπ ρ (s, a) = r(s, a) ωEs P ( |s,a) [DKL (π( |s ) ρ( |s ))] , then we can rewrite the definition of operator Γπ ρ as Γπ ρQ(s, a) = rπ ρ (s, a) + γEs P ( |s,a),a π( |s ) [Q(s , a )] . With this formula, we can apply the traditional convergence result of policy evaluation in Sutton & Barto (2018). A.4. Lemma 2 For the proof of Lemma 2, we need the following lemma (Haarnoja et al., 2018) about improving policy in entropyregularized MDP. Lemma 5. If we have a new policy πnew and πnew = arg min π DKL π( |s) exp (Qπold ent (s, )/ω) Zπold(s) where Zπold(s) represents the normalization term, then we have Qπnew ent (s, a) Qπold ent (s, a), s S, a A. With Lemma 5, we have the following proof of Lemma 2. Divergence-Regularized Multi-Agent Actor-Critic Proof. Let ˆQ be the same as the definition in Proof A.7. Then we have ˆQπ(s, a) = Qπ ρ(s, a) + ω log ρ(a|s), π. According to Lemma 5, ˆπnew(s, ) = arg min π DKL π( |s) exp ˆQπold(s, )/ω ˆQˆπnew(s, a) ˆQπold(s, a), a A. With the definition, we have π( |s) exp( ˆQπold(s, )/ω) π( |s) ρ( |s)exp(Qπold ρ (s, )/ω) Zπold(s) πnew = ˆπnew Qπnew ρ (s, a) = ˆQπnew(s, a) ω log ρ(a|s) ˆQπold(s, a) ω log ρ(a|s) = Qπold ρ (s, a). A.5. Lemma 3 Proof. From the equation πi new = arg max πi ai πi(ai|s) ki(s)Qπi old ρ (s, ai) ω log πi(ai|s) we can obtain ai πi new(ai|s) ki(s)Qπi old ρ (s, ai) ω log πi new(ai|s) ρi(ai|s) ai πi old(ai|s) ki(s)Qπi old ρ (s, ai) ω log πi old(ai|s) ρi(ai|s) By taking expectation on the both side of (15), we can obtain the followings. a i π i(a i|s) X ai πi new(ai|s) ki(s)Qπi old ρ (s, ai) ω log πi new(ai|s) ρi(ai|s) a i π i(a i|s) X ai πi old(ai|s) ki(s)Qπi old ρ (s, ai) ω log πi old(ai|s) ρi(ai|s) Moreover, we can easily have the following derivation, ai πi new(ai|s) ki(s)Qπi old ρ (s, ai) + b(s) ω log πi new(ai|s) ρi(ai|s) a i u1(a i|s) X ai πi new(ai|s) ki(s)Qπi old ρ (s, ai) + b(s) ω log πi new(ai|s) ρi(ai|s) a i u2(a i|s) X ai πi new(ai|s) ki(s)Qπi old ρ (s, ai) + b(s) ω log πi new(ai|s) ρi(ai|s) Divergence-Regularized Multi-Agent Actor-Critic Then we have Qπold ρ (s, a) log πnew(a|s) a πnew(a|s) X ki(s)Qπi old ρ (s, ai) + b(s) ω log πi new(ai|s) ρi(ai|s) a πnew(a|s) X ki(s)Qπi old ρ (s, ai) + b(s) ω log πi new(ai|s) ρi(ai|s) a i π i new(a i|s) X ai πi new(ai|s) ki(s)Qπi old ρ (s, ai) + b(s) ω log πi new(ai|s) ρi(ai|s) a i π i old(a i|s) X ai πi new(ai|s) ki(s)Qπi old ρ (s, ai) + b(s) ω log πi new(ai|s) ρi(ai|s) a i π i old(a i|s) X ai πi old(ai|s) ki(s)Qπi old ρ (s, ai) + b(s) ω log πi old(ai|s) ρi(ai|s) = V πold ρ (s) The fourth equation is from (17) and the fifth inequality is from (16). By repeatedly applying (18) and the relation Qπold ρ (s, a) = r(s, a) + γEs V πold ρ (s ) , we can complete the proof as followings. Qπold ρ (s, a) = r(s, a) + γEs V πold ρ (s ) r(s, a) + γEs Ea πnew Qπold ρ (s , , a ) log πnew(a |s ) = r(s, a) + γEs Ea πnew r(s , a ) + γEs V πold ρ (s ) log πnew(a |s ) Qπnew ρ (s, a) A.6. Theorem 2 Proof. First, we will show that Divergence Policy Iteration will monotonically improve the policy. From Lemma 2, we know that V πnew ρ (s) = Ea πnew( |s) Qπnew ρ (s, a) ω log πnew(a|s) Ea πnew( |s) Qπold ρ (s, a) ω log πnew(a|s) Ea πold( |s) Qπold ρ (s, a) ω log πold(a|s) = V πold ρ (s). The first inequality is from the conclusion of Lemma 2 that Qπnew ρ (s, a) Qπold ρ (s, a) a A, and the second inequality is from the definition of πnew that πnew = arg min π DKL π( |s) exp Qπold ρ (s, )/ω Divergence-Regularized Multi-Agent Actor-Critic Here we have V πnew(s) V πold(s), s S, and thus Jρ(πnew) Jρ(πold). So, Divergence Policy Iteration will monotonically improve the policy. Since the Qπ ρ is bounded above (the reward function is bounded), the sequence of Q-function Qk of Divergence Policy Iteration will converge and the corresponding policy sequence will also converge to some policy πconv. We need to show πconv = π ρ. V πconv ρ (s) = Ea πconv( |s) Qπconv ρ (s, a) ω log πconv(a|s) Qπconv ρ (s, a) ω log π(a|s) r(s, a) + γEa π( |s ) Qπconv ρ (s , a ) ω log π(a |s ) ω log π(a|s) The first inequality is from the definition of πconv that πconv = arg min π DKL π( |s) exp Qπconv ρ (s, )/ω and all the other inequalities are just iteratively using the first inequality and the relation of Qπ ρ and V π ρ . With iterations, we replace all the πconv with π in the expression of V πconv ρ (s) and finally we get V π ρ (s). Therefore, we have V πconv ρ (s) V π ρ (s) s S π Π Jρ(πconv) Jρ(π) π Π πconv = π ρ. A.7. Proposition 5 We have the following proposition. Proposition 5. If π ρ = arg maxπ Jρ(π), and V ρ (s) = V π ρ ρ (s) and Q ρ(s, a) = Q π ρ ρ (s, a) are respectively the corresponding Q-function and V-function of π ρ, then they satisfy the following properties: π ρ(a|s) ρ(a|s) exp r(s, a) + γE V ρ (s ) V ρ (s) = ω log X a ρ(a|s) exp r(s, a) + γE V ρ (s ) Q ρ(s, a) = r(s, a) + γωE h V ρ (s ) i . (21) Before the proof of Proposition 5, we need some results about the optimal Q-function Q ent, the optimal V-function V ent, and the optimal policy π ent in entropy-regularized MDP. We have the following lemma (Nachum et al., 2017). π ent(s, a) exp r(s, a) + γEs P ( |s,a) [V ent(s )] /ω V ent(s) = ω log X a exp r(s, a) + γEs P ( |s,a) [V ent(s )] /ω Q ent(s, a) = r(s, a) + γωEs P ( |s,a) a exp (Q ent(s , a )/ω) Divergence-Regularized Multi-Agent Actor-Critic With Lemma 6, we can complete the proof of Proposition 5. Proof. Let ˆr(s, a) = r(s, a) + ω log ρ(a|s), we consider the objective function t=0 γt (ˆr(st, at) ω log π(at|st)) Let ˆπ (a|s), ˆV (s) and ˆQ (s, a) be the corresponding optimal policy, V-function and Q-function of ˆJ(π). By definition we can obtain ˆπ (a|s) = π ρ(a|s) ˆV (s) = E a ˆπ ( |s),s P ( |s,a) h ˆr(s, a) + γ ˆV (s ) ω log ˆπ (a|s) i = E a π ρ( |s),s P ( |s,a) r(s, a) + γ ˆV (s ) ω log π ρ(a|s) ρ(a|s) ˆQ (s, a) = ˆr(s, a) + Es P ( |s,a) h ˆV (s ) i = r(s, a) + Es P ( |s,a) V ρ (s ) + ω log ρ(a|s) = Q ρ(s, a) + ω log ρ(a|s). According to Lemma 6, we have π ρ(a|s) = ˆπ (a|s) exp ˆr(s, a) + γEs P ( |s,a) h ˆV (s ) i /ω = ρ(a|s) exp r(s, a) + γEs P ( |s,a) V ρ (s ) /ω V ρ (s) = ˆV (s) a exp ˆr(s, a) + γEs P ( |s,a) h ˆV (s ) i /ω a ρ(a|s) exp r(s, a) + γEs P ( |s,a) V ρ (s ) /ω Q ρ(s, a) = ˆQ (s, a) ω log ρ(a|s) = ˆr(s, a) + γωEs P ( |s,a) a exp ˆQ (s , a )/ω # ω log ρ(a|s) = r(s, a) + γωEs P ( |s,a) a ρ(a|s) exp Q ρ(s , a )/ω # A.8. Proposition 4 Proof. From Proposition 5, we can obtain π ρ(a|s) = ρ(a|s) exp r(s, a) + γE V ρ (s ) /ω b ρ(b|s) exp r(s, b) + γE V ρ (s ) /ω = ρ(a|s) exp Q ρ(s, a)/ω exp V ρ (s)/ω = ρ(a|s) exp Q ρ(s, a) V ρ (s) /ω . (22) Divergence-Regularized Multi-Agent Actor-Critic By rearranging the equation, we have V ρ (s) = Q ρ(s, a) ω log π ρ(a|s) ρ(a|s) , (23) which is tenable for all actions a A. A.9. Derivation of Gradient θi Lπ = Es D a θiπ(a|s) Qπold ρ (s, a) ω log π(a|s) + π(a|s) θi( ω log π(a|s) a π(a|s) θi log πi(ai|s) Qπold ρ (s, a) ω log π(a|s) ωπ(a|s) θi log πi(ai|s) θi log πi(ai|s) Qπold ρ (s, a) ω log π(a|s) A.10. Theoretical Results for the Moving Average Case. In the case where we take the target policies {ρt} as the moving average of the policies {πt}, we will formulate the iterations as followings: ρt = (1 τ)ρt 1 + τπt 1 πt = arg max π s,a µπ(s, a)r(s, a) ωDC (µπ µρt) . Then we have: J(πt+1) J(πt+1) ωDC µπt+1 µρt+1 = Jρt+1(πt+1) J(πt) ωDC µπt µρt+1 J(πt) (1 τ)ωDC (µπt µρt) J(πt) ωDC (µπt µρt) = Jρt(πt). The third inequality could be obtained as following: DC µπt µρt+1 = X s,a µπ(s, a) log πt(a|s) s,a µπ(s, a) log πt(a|s) τπt(a|s) + (1 τ)ρt(a|s) s,a µπ(s, a) log πt(a|s) τ log πt(a|s) + (1 τ) log ρt(a|s) from the concavity of log( ) s,a µπ(s, a) log πt(a|s) = (1 τ)DC (µπt µρt) . So the sequence {Jρt(πt)} is still bounded and improving monotonically and will converge. With this result, we have Qk = Q ρk and V k = V ρk will converge to Q and V . Next we will consider the convergence of the sequence {ρt} and {πt}. From Proposition 5, we have the formulation that πt(a|s) = ρt(a|s)f t(s, a) P b ρt(b|s)f t(s, b) (24) Divergence-Regularized Multi-Agent Actor-Critic where f t(s, a) = exp( Qt(s,a) ω ) and f t(s, a) f (s, a) = exp( Q (s,a) We define Zt(s) = P b ρt(b|s)f t(s, b) and will show that {Zt(s)} will converge for each state s. Actually, we will show that {Zt(s)} monotonically improves and is bounded. With Zt(s), we could rewrite the the relation between ρt+1 and ρt as followings: ρt+1(a|s) = 1 τ + τ f t(s, a) ρt(a|s). (25) Then we have: Zt+1(s) Zt(s) = X b ρt+1(b|s)f t+1(s, b) X b ρt(b|s)f t(s, b) b ρt+1(b|s)f t(s, b) X b ρt(b|s)f t(s, b) (f t+1(s, b) f t(s, b) from { Qt} improve monotonically.) f t(s, b) Zt(s) Zt(s) ρt(b|s)f t(s, b) from iteration (25) b ρt(b|s)f t(s, b)2 Zt(s)2 = τ Varb ρt( |s) [f t(s, b)] Moreover, let M = maxs,a f (s, a). When t is sufficiently large we have f t(s, a) f (s, a) + 1 M + 1. So Zt(s) M + 1 when t is sufficiently large. With all the above discussions, we show that Zt(s) converge to Z (s). Next we will divide the actions into three classes according to the relation between f (s, a) and Z (s),given any fixed state s. Let I+ s = {a A|f (s, a) > Z (s)}, I s = {a A|f (s, a) < Z (s)}, and I0 s = {a A|f (s, a) = Z (s)}. We will show three properties about these sets: (1) a I s , ρt(a|s) 0. (2) I+ s = . a I0s ρt(a|s) 1. It is obvious that the property (3) is an corollary of property (1) and (2). So we will focus on property (1) and (2). As for property (1), consider any action a I s . Let ϵ = Z (s) f (s,a) 4 > 0, when t is sufficiently large, we have f t(s, a) < f (s, a) + ϵ = Z (s) 3ϵ and Zt(s) > Z (s) ϵ. Then we have ρt+1(a|s) = 1 τ + τ f t(s, a) < 1 τ + τ Z (s) 3ϵ = 1 4ϵτ Z (s) + ϵ Since the constant 1 4ϵτ Z (s)+ϵ < 1, we know that ρt(a|s) 0 for action a. As for property (2), suppose that I+ s = . Then, we could take some action a I+ s . Let ϵ = f (s,a) Z (s) 4 > 0, when t is Divergence-Regularized Multi-Agent Actor-Critic sufficiently large, we have f t(s, a) > f (s, a) ϵ and Zt(s) < Z (s) + ϵ = f (s, a) 3ϵ. Then we have ρt+1(a|s) = 1 τ + τ f t(s, a) > 1 τ + τ f (s, a) ϵ f (s, a) 3ϵ = 1 + 2ϵτ f (s, a) 3ϵ Since the constant 1 + 2ϵτ f (s,a) 3ϵ > 1, we know that ρt(a|s) for action a. But we know that ρt(a|s) 1 which is a contradiction. From these three properties we actually know that Z (s) = maxa f (s, a). The proof is as followings: from property (2) we know that f (s, a) Z (s), a A; suppose that f (s, a) < Z (s), a A, then from property (1) we know that P a ρt(a|s) 0 which contradicts to P a ρt(a|s) = 1. This fact could also be written as I0 s = Us, where Us is the same definition as the proof for Proposition 2 in Appendix A.1 Finally, with all these preparations above, we could discuss the convergence of πt. From iteration (25), we could obtain the following formula: ρt(a|s) = ρ0(a|s)f t(s, a)Πt k=0 1 τ + τ f k(s, a) = π0(a|s)f t(s, a)Πt k=0 1 τ + τ f k(s, a) Combining this with (24), we have πt(a|s) = π0(a|s)f t(s, a)Πt 1 k=0 1 τ + τ f k(s,a) b π0(b|s)f t(s, b)Πt 1 k=0 1 τ + τ f k(s,b) = π0(a|s)f t(s, a)Πt 1 k=0 (1 τ)Zk(s) + τf k(s, a) b π0(b|s)f t(s, b)Πt 1 k=0 ((1 τ)Zk(s) + τf k(s, b)) = π0(a|s) f t(s,a)Πt 1 k=0((1 τ)Zk(s)+τf k(s,a)) (Z (s))t+1 P b π0(b|s) f t(s,b)Πt 1 k=0((1 τ)Zk(s)+τf k(s,b)) (Z (s))t+1 . We also have lim t (1 τ)Zt(s) + τf t(s, a) = ( Z (s) a I0 s (1 τ)Z (s) + τf (s, a) < Z (s) a I s . (27) Similarly to the proof for Proposition 2, we have lim t f t(s, a)Πt 1 k=0 (1 τ)Zk(s) + τf k(s, a) (Z (s))t+1 = ( 1 a I0 s 0 a I s . (28) Finally we obtain π (a|s) = lim t πt(a|s) = 1(a I0 s) π0(a|s) P b I0 s π0(b|s) = 1(a Us) π0(a|s) P b Us π0(b|s). (29) This conclusion is actually the same as Proposition 2. Moreover, as ρt is the moving average of πt, it is obvious that ρt π . With this result, we could define the same operator Γω as the proof for Theorem 1 in Appendix A.2 and use the same proof to show that V is the fixed point of Γω and finally obtain the same conclusion of Theorem 1. Divergence-Regularized Multi-Agent Actor-Critic Algorithm 1 DMAC 1: for episode = 1 to m do 2: Initialize the environment and receive initial state s 3: for t = 1 to max-episode-length do 4: For each agent i, select action ai πi( |s) 5: Execute joint-action a = (a1, a2, , an) and observe reward r and next state s 6: Store (s, a, r, s ) in replay buffer D 7: end for 8: Sample a random minibatch of K samples from D, {(sk, ak, rk, s k)}K 9: for agent i = 1 to n do 10: Update policy πi: θi = θi + β θi Lπ 11: Update target policy ρi: θi = (1 τ) θi + τθi 12: end for 13: Update critic: ϕ = ϕ α ϕLQ 14: Update target critic: ϕ = (1 τ) ϕ + τϕ 15: end for B. Algorithm Algorithm 1 gives the training procedure of DMAC. C. Implementation Details SMAC is an MARL environment based on the game Star Craft II (SC2). Agents control different units in SC2 and can attack, move or take other actions. The general mode of SMAC tasks is that agents control a team of units to counter another team controlled by built-in AI. The target of agents is to wipe out all the enemy units and agents will gain rewards when hurting or killing enemy units. Agents have an observation range and can only observe information of units in this range, but the information of all the units can be accessed in training. We test all the methods in totally 8 tasks/maps: 3m, 2s3z, 3s5z, 8m, 1c3s5z, 3s_vs_3z, 2c_vs_64zg, and MMM2. C.1. Modifications of the Baseline Methods The modifications of the baseline methods, COMA, MAAC, QMIX, DOP, and FOP, are as follows: COMA. We keep the original critic and actor networks and add a target policy network with the same architecture as the actor. As COMA is on-policy but COMA+DMAC is off-policy, we add a replay buffer for experience replay. MAAC already has a target policy for stability, so we do not need to modify the network architecture. We only change the update rule for the critic and actors. QMIX is a value-based method, so we need to add a policy network and a target policy network for each agent. We keep the original individual Q-functions to learn the critic in QMIX+DMAC. In divergence-regularized MDP, the max operator is not needed in the critic update, so we abandon the hypernet and use an MLP, which takes individual Q-values and state as input and produces the joint Q-value. This architecture is simple and its expressive capability is not limited by QMIX s IGM condition. DOP. We keep the original critic and actor networks and add a target policy network with the same architecture as the actor. We keep the value decomposition structure and use off-policy TD(λ) for all samples in training to replace the tree backup loss and on-policy TD(λ) loss. FOP. We replace the entropy regularizers with divergence regularizers in FOP and use the update rules of DMAC. We keep the original architecture of FOP. Divergence-Regularized Multi-Agent Actor-Critic 0 20000 40000 60000 80000 100000 step Stochastic Game COMA+DMAC COMA 0 20000 40000 60000 80000 100000 step Stochastic Game QMIX+DMAC QMIX 0 20000 40000 60000 80000 100000 step Stochastic Game MAAC+DMAC MAAC 0 50000 100000 150000 200000 step Stochastic Game DOP+DMAC DOP Figure 5. Learning curves in terms of cover rates of COMA, MAAC, QMIX and DOP groups in the randomly generated stochastic game. 0 50000 100000150000200000250000300000 episode rewards Stochastic Game =100 =10 =1 =0.2 =0.1 =0.01 =0.001 COMA 0 50000 100000150000200000250000300000 Stochastic Game =100 =10 =1 =0.2 =0.1 =0.01 =0.001 COMA Figure 6. Learning curves in terms of cover rates of COMA and COMA+DMAC with different ω in the randomly generated stochastic game. C.2. Hyperparameters As all tasks in our experiments are cooperative with shared reward, so we use parameter-sharing policy network and critic network for MAAC and MAAC+DMAC to accelerate training. Besides, we add a RNN layer to the policy network and critic network in MAAC and MAAC+DMAC to settle the partial observability. All the policy networks are the same as two linear layers and one GRUCell layer with Re LU activation and the number of hidden units is 64. The individual Q-networks for QMIX group is the same as the policy network mentioned before. The critic network for COMA group is a MLP with three 128-unit hidden layers and Re LU activation. The attention dimension in the critic networks of MAAC group is 32. The number of hidden units of mixer network in QMIX group is 32. The learning rate for critic is 10 3 and the learning rate for actor is 10 4. We train all networks with RMSprop optimizer. The discouted factor is γ = 0.99. The coefficient of regularizer is ω = 0.01 for SMAC tasks and ω = 0.2 for the stochastic game. The td_lambda factor used in COMA group is 0.8. The parameter used for soft updating target policy is τ = 0.01. Our code is based on the implementation of Py MARL (Samvelyan et al., 2019), MAAC (Iqbal & Sha, 2019), DOP (Wang et al., 2021b), FOP (Zhang et al., 2021) and an open source code for algorithms in SMAC (https://github.com/starry-sky6688/Star Craft). C.3. Experiment Settings We trained each algorithms for five runs with different random seeds. In SMAC tasks, we train each algorithm for one million steps in each run for COMA, QMIX, MAAC, and DOP groups (except 2c_vs_64zg and MMM2) and two million steps for FOP groups. We evaluate 20 episodes in every 10000 training steps in the one million steps training procedure and in every 20000 steps in the two million steps training procedure. In evaluation, we select greedy actions for QMIX and FOP (following the setting in the FOP paper) and sample actions according to action distribution for stochastic policy (COMA, MAAC, DOP and divergence-regularized methods). We do all the experiments by a server with 2 NVIDIA A100 GPUs. D. Additional Results Figure 5 shows the learning curves in terms of cover rates of COMA, QMIX, MAAC and DOP groups in the randomly generated stochastic game. Figure 6 shows learning curves in terms of cover rates and episode rewards of COMA and COMA+DMAC with different ω in the randomly generated stochastic game. Figure 7 and Figure 8 shows the learning curves of COMA, MAAC, QMIX and DOP groups in terms of mean episode rewards and win rates in seven SMAC maps. Figure 9 shows the learning curves of FOP+DMAC and FOP in terms of mean episode rewards in 3s_vs_3z,2s3z,3s5z, 2c_vs_64zg and MMM2. Divergence-Regularized Multi-Agent Actor-Critic 0.0 0.2 0.4 0.6 0.8 1.0 episode_rewards MAAC+DMAC MAAC 0.0 0.2 0.4 0.6 0.8 1.0 COMA+DMAC COMA 0.0 0.2 0.4 0.6 0.8 1.0 QMIX+DMAC QMIX 0.0 0.2 0.4 0.6 0.8 1.0 DOP+DMAC DOP 0.0 0.2 0.4 0.6 0.8 1.0 1e6 episode_rewards 0.0 0.2 0.4 0.6 0.8 1.0 1e6 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 episode_rewards 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 1e6 episode_rewards 0.0 0.2 0.4 0.6 0.8 1.0 1e6 0.0 0.2 0.4 0.6 0.8 1.0 1e6 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 1e6 episode_rewards 0.0 0.2 0.4 0.6 0.8 1.0 1e6 0.0 0.2 0.4 0.6 0.8 1.0 1e6 0.0 0.2 0.4 0.6 0.8 1.0 1e6 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 episode_rewards 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 step 1e6 episode_rewards 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 step 1e6 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 step 1e6 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 step 1e6 Figure 7. Learning curves in terms of mean episode reward of COMA, MAAC, QMIX, and DOP groups in seven SMAC maps (each row corresponds to a map and each column corresponds to a group). Figure 10 shows the learning curves in terms of mean episode rewards of COMA and DOP groups in the CDM environment used by the DOP paper. Divergence-Regularized Multi-Agent Actor-Critic 0.0 0.2 0.4 0.6 0.8 1.0 MAAC+DMAC MAAC 0.0 0.2 0.4 0.6 0.8 1.0 COMA+DMAC COMA 0.0 0.2 0.4 0.6 0.8 1.0 QMIX+DMAC QMIX 0.0 0.2 0.4 0.6 0.8 1.0 DOP+DMAC DOP 0.0 0.2 0.4 0.6 0.8 1.0 1e6 0.0 0.2 0.4 0.6 0.8 1.0 1e6 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 1e6 0.0 0.2 0.4 0.6 0.8 1.0 1e6 0.0 0.2 0.4 0.6 0.8 1.0 1e6 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 1e6 0.0 0.2 0.4 0.6 0.8 1.0 1e6 0.0 0.2 0.4 0.6 0.8 1.0 1e6 0.0 0.2 0.4 0.6 0.8 1.0 1e6 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 step 1e6 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 step 1e6 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 step 1e6 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 step 1e6 Figure 8. Learning curves in terms of win rate of COMA, MAAC, QMIX, and DOP groups in seven SMAC maps (each row corresponds to a map and each column corresponds to a group). 0.0 0.5 1.0 1.5 2.0 step 1e6 episode rewards FOP+DMAC FOP 0.0 0.5 1.0 1.5 2.0 step 1e6 FOP+DMAC FOP 0.0 0.5 1.0 1.5 2.0 step 1e6 FOP+DMAC FOP 0.0 0.5 1.0 1.5 2.0 step 1e6 FOP+DMAC FOP 0.0 0.5 1.0 1.5 2.0 step 1e6 FOP+DMAC FOP Figure 9. Learning curves in terms of mean episode rewards of FOP+DMAC and FOP in five SMAC maps (each column corresponds to a map). 0 2000 4000 6000 8000 10000 step episode rewards COMA+DMAC COMA 0 2000 4000 6000 8000 10000 step DOP+DMAC DOP Figure 10. Learning curves in terms of mean episode rewards of COMA and DOP groups in the CDM environment used by the DOP paper.