# multiagent_determinantal_qlearning__c7a6b5ad.pdf Multi-Agent Determinantal Q-Learning Yaodong Yang * 1 2 Ying Wen * 1 2 Liheng Chen 3 Jun Wang 2 Kun Shao 1 David Mguni 1 Weinan Zhang 3 Centralized training with decentralized execution has become an important paradigm in multi-agent learning. Though practical, current methods rely on restrictive assumptions to decompose the centralized value function across agents for execution. In this paper, we eliminate this restriction by proposing multi-agent determinantal Q-learning. Our method is established on Q-DPP, an extension of determinantal point process (DPP) with partition-matroid constraint to multi-agent setting. Q-DPP promotes agents to acquire diverse behavioral models; this allows a natural factorization of the joint Q-functions with no need for a priori structural constraints on the value function or special network architectures. We demonstrate that Q-DPP generalizes major solutions including VDN, QMIX, and QTRAN on decentralizable cooperative tasks. To efficiently draw samples from Q-DPP, we adopt an existing linear-time sampler with theoretical approximation guarantee. The sampler also benefits exploration by coordinating agents to cover orthogonal directions in the state space during multi-agent training. We evaluate our algorithm on various cooperative benchmarks; its effectiveness has been demonstrated when compared with the state-of-the-art. 1. Introduction Multi-agent reinforcement learning (MARL) methods hold great potential to solve a variety of real-world problems, such as mastering multi-player video games (Peng et al., 2017), dispatching taxi orders (Li et al., 2019), and studying population dynamics (Yang et al., 2018). In this work, we consider the multi-agent cooperative setting (Panait & Luke, 2005) where a team of agents collaborate to achieve one common goal in a partially observed environment. *Equal contribution 1Huawei Technology R&D UK. 2University College London. 3Shanghai Jiaotong University. Correspondence to: Yaodong Yang . Proceedings of the 37 th International Conference on Machine Learning, Vienna, Austria, PMLR 119, 2020. Copyright 2020 by the author(s). A full spectrum of MARL algorithms has been developed to solve cooperative tasks (Panait & Luke, 2005); the two endpoints of the spectrum are independent and centralized learning (see Fig. 1). Independent learning (IL) (Tan, 1993) merely treats other agents influence to the system as part of the environment. The learning agent not only faces a non-stationary environment, but also suffers from spurious rewards (Sunehag et al., 2017; Du et al., 2019). Centralized learning (CL), at the other end, treats a multi-agent problem as a single-agent problem despite the fact that many realworld applications require local autonomy. Importantly, the CL approaches exhibit combinatorial complexity and can hardly scale to more than tens of agents (Yang et al., 2019). Another paradigm typically considered is a hybrid of centralized training and decentralized execution (CTDE) (Oliehoek et al., 2008). For value-based approaches in the framework of CTDE, a fundamental challenge is how to correctly decompose the centralized value function among agents for decentralized execution. For a cooperative task to be deemed decentralizable, it is required that local maxima on the value function per every agent should amount to the global maximum on the joint value function. In enforcing such a condition, current state-of-the-art methods rely on restrictive structural constraints and/or network architectures. For instance, Value Decomposition Network (VDN) (Sunehag et al., 2017) and Factorized-Q (Zhou et al., 2019) propose to directly factorize the joint value function into a summation of individual value functions. QMIX (Rashid et al., 2018) augments the summation to be non-linear aggregations, while maintaining a monotonic relationship between centralized and individual value functions. QTRAN (Son et al., 2019) introduces a refined learning objective on top of QMIX along with specific network designs. Unsurprisingly, the structural constraints put forward by VDN / QMIX / QTRAN inhibit the representational power of the centralized value function (Son et al., 2019); as a result, the class of decentralizable cooperative tasks these methods can tackle is limited. For example, poor empirical results of QTRAN have been reported on multiple multiagent cooperative benchmarks (Mahajan et al., 2019). Apart from the aforementioned problems, structural constraints also hinder efficient explorations when applied to value function decomposition. In fact, since agents behave Multi-Agent Determinantal Q-Learning Centralized Learning Independent Learning Centralized Training Decentralized Execution Independent Exploration Coordinated Exploration COMA (Foerster et al., 2018) QMIX (Rashid et al, 2018) VDN (Sunehag et al, 2017) QTRAN (Son et al, 2019) Factor. Q (Zhou et al, 2019) MADDPG (Lowe et al., 2017) Ind. Q (Tan 1993) Det. SARSA (Osogami et al, 2019) Bi CNet (Peng et al., 2017) MAVEN (Mahajan et al, 2019) Value-based methods Actor-Critic methods Figure 1: Spectrum of MARL methods on cooperative tasks. independently during the execution stage, CTDE methods usually lack a coordinated exploration strategy (Matignon et al., 2007). Clearly, an increasing per-agent exploration rate of ϵ-greedy in the single-agent setting can help exploration; however, it has been proved (Mahajan et al., 2019) that due to the structural constraints (e.g. the monotonicity assumption in QMIX), in the multi-agent setting, increasing ϵ will only lower the probability of obtaining the optimal value function. As a treatment, MAVEN (Mahajan et al., 2019) introduces a hierarchical model to coordinate diverse explorations among agents. Yet, a principled exploration strategy with minor structural constraints on the value function is still missing for value-based CTDE methods. In many tasks that require the division of labor, one reasonable solution is to make agents acquire a diverse set of skills, targets or behavioral models (Albrecht & Ramamoorthy, 2012) during training so that joint goal can be achieved by individual targets. In such scenarios, the equivalence between the local maxima on the per-agent value function and the global maximum on the joint value function can be automatically achieved. As a result, the centralized value function can enjoy a natural factorization with no need for any structural constraints beforehand. In this paper, we present a new value-based solution in the CTDE paradigm to multi-agent cooperative tasks. We establish Q-DPP, an extension of determinantal point process (DPP) (Macchi, 1977) with partition constraint, and apply it to multi-agent learning. DPPs are elegant probabilistic models on sets that can capture both quality and diversity when a subset is sampled from a ground set; this makes them ideal for modeling the set that contains different agents observation-action pairs in the multi-agent learning context. We adopt Q-DPP as a function approximator for the centralized value function. An attractive property of using Q-DPP is that, when reaching the optimum, it can offer a natural factorization on the centralized value function, assuming agents have acquired a diverse set of behaviors. Our method eliminates the need for a priori structural constraints or bespoke neural architectures. In fact, we demonstrate that Q-DPP generalizes current solvers including VDN, QMIX, and QTRAN, where all these methods can be derived as special cases from Q-DPP. As an additional contribution, we adopt a tractable sampler, based on the idea of sampleby-projection in P-DPP (Celis et al., 2018), for Q-DPP with theoretical approximation guarantee. Our sampler makes agents explore in a sequential manner; agents who act later are coordinated to visit only the orthogonal areas in the state space that previous agents haven t explored. Such coordinated way of explorations effectively boosts the sampling efficiency in the CTDE setting. Building upon these advantages, finally, we demonstrate that our proposed Q-DPP algorithm is superior to the existing state-of-the-art solutions on a variety of multi-agent cooperation benchmarks. 2. Preliminaries: Determinantal Point Process DPP is a probabilistic framework that characterizes how likely a subset is going to be sampled from a ground set. Originated from quantum physics for modeling repulsive Fermion particles (Macchi, 1977), DPP has recently been introduced to the machine learning community due to its probabilistic nature (Kulesza et al., 2012). Definition 1 (DPP). For a ground set of items Y = {1, 2, . . . , M}, a DPP, denoted by P, is a probability measure on the set of all subsets of Y, i.e., 2Y. Given an M M positive semi-definite (PSD) kernel L that measures similarity for any pairs of items in Y, let Y be a random subset drawn according to P, then we have, Y Y, PL Y = Y det LY = Vol2 {wi}i Y , (1) where LY := [Li,j]i,j Y denotes the submatrix of L whose entries are indexed by the items included in Y . If we write L := WW with W RM P , P M, and rows of W being {wi}, then the determinant value is essentially the squared |Y |-dimensional volume of parallelepiped spanned by the rows of W corresponding to elements in Y . A PSD matrix ensures all principal minors of L are nonnegative det(LY ) 0; it thus suffices to be a proper probability distribution. The normalizer can be computed as: P Y Y det(LY ) = det(L + I), where I is an M M identity matrix. Intuitively, one can think of a diagonal entry Li,i as capturing the quality of item i, while an offdiagonal entry Li,j measures the similarity between items i and j. DPP models the repulsive connections among multiple items in a sampled subset. In the example of two items, PL({i, j}) Li,i Li,j Lj,i Lj,j = Li,i Lj,j Li,j Lj,i, which suggests, if item i and item j are perfectly similar, such that Li,j = p Li,i Lj,j, then we know these two items will almost surely not co-occur, hence such two-item subset of {i, j} from the ground set will never be sampled. Multi-Agent Determinantal Q-Learning DPPs are attractive in that they only require training the kernel matrix L, which can be learned via maximum likelihood (Affandi et al., 2014). A trainable DPP favors many supervised learning tasks where diversified outcomes are desired, such as image generation (Elfeki et al., 2019), video summarization (Sharghi et al., 2018), model ensemble (Pang et al., 2019), and recommender system (Chen et al., 2018). It is, however, non-trivial to adapt DPPs to a multi-agent setting since additional restrictions are required to put on the ground set so that valid samples can be drawn for the purpose of multi-agent training. This leads to our Q-DPPs. 3. Multi-Agent Determinantal Q-Learning We offer a new value-based solution to multi-agent cooperative tasks. In particular, we introduce Q-DPPs as general function approximators for the centralized value functions, similar to neural networks in deep Q-learning (Mnih et al., 2015). We start from the problem formulation. 3.1. Problem Formulation of Multi-Agent Cooperation Multi-agent cooperation in a partially-observed environment is usually modeled as a Dec-POMDP (Oliehoek et al., 2016) denoted by a tuple G =< S, N, A, O, Z, P, R, γ >. Within G, s S denotes the global environmental state. At every time-step t Z+, each agent i N = {1, . . . , N} selects an action ai A where a joint action stands for a := (ai)i N AN. Since the environment is partially observed, each agent only has access to its local observation o O that is acquired through an observation function Z(s, a) : S A O. The state transition dynamics are determined by P(s |s, a) := S AN S [0, 1]. Agents optimize towards one shared goal whose performance is measured by R(s, a) : S AN R, and γ [0, 1) discounts the future rewards. Each agent recalls an observation-action history τi T := (O A)t, and executes a stochastic policy πi(ai|τi) : T A [0, 1] which is conditioned on τi. All of the agents histories is defined as τ := (τi)i N T N. Given a joint policy π := (πi)i N , the joint action-value function at time t stands as Qπ(τ t, at) = Est+1: ,at+1: [Gt|τ t, at], where Gt = P i=0 γi Rt+i is the total accumulative rewards. The goal is to find an optimal value function Q = maxπ Qπ(τ t, at) and the corresponding policy π . A direct centralized approach is to learn the joint value function, parameterized by θ, by minimizing the squared temporaldifference error L(θ) (Watkins & Dayan, 1992) from a sampled mini-batch of transition data { τ, a, R, τ }E j=1, i.e., R + γ max a Q(τ , a ; θ ) Qπ(τ, a; θ) 2 , (2) where θ denotes the target parameters that can be periodically copied from θ during training. In our work, apart from the joint value function, we also focus on obtaining a decentralized policy for each agent. CTDE is a paradigm for solving Dec-POMDP (Oliehoek et al., 2008) where it allows the algorithm access to all of the agents local histories τ during training. During testing, however, the algorithm uses each of the agent s own history τi for execution. CTDE methods provide valid solutions to multi-agent cooperative tasks that are decentralizable, which is formally defined as below. Definition 2 (Decentralizable Cooperative Tasks, a.k.a. IGM Condition (Son et al., 2019)). A cooperative task is decentralizable if {Qi}N i=1 such that τ τ N, a AN, arg max a Qπ(τ, a) = arg maxa1 Q1(τ1, a1) ... arg maxa N QN(τN, a N) Eq. 3 suggests that local maxima on the extracted value function per every agent needs to amount to the global maximum on the joint value function. A key challenge for CTDE methods is, then, how to correctly extract each of the agent s individual Q-function {Qi}N i=1, and as such an executable policy, from a centralized Q-function Qπ. To satisfy Eq. 3, current solutions rely on restrictive assumptions that enforce structural constraints on the factorization of the joint Q-function. For example, VDN (Sunehag et al., 2017) adopts the additivity assumption by assuming Qπ(τ, a) := PN i=1 Qi(τi, ai). QMIX (Rashid et al., 2018) applies the monotonicity assumption to ensure Qπ(τ,a) Qi(τi,ai) 0, i N. QTRAN (Son et al., 2019) introduces a refined factorizable learning objective in addition to QMIX. Nonetheless, structural constraints harm the representational power of the centralized value function, and also hinder efficient explorations (Son et al., 2019). To mitigate these problems, we propose Q-DPP as an alternative that naturally factorizes the joint Q-function by learning a diverse set of behavioral models among agents. 3.2. Q-DPP: A Constrained DPP for MARL Our method is established on Q-DPP which is an extension of DPP that suits MARL. We assume that local observation oi encodes all history information τi at each time-step. We model the ground set of all agents observation-action pairs by a DPP, i.e., Y = (o1 1, a1 1), . . . , (o|O| N , a|A| N ) with the size of the ground set being |Y| = N|O||A|. In the context of multi-agent learning, each agent takes one valid action depending on its local observation. A valid sample from DPP, therefore, is expected to include one valid observation-action pair for each agent, and the observations from the sampled pairs must match the true observations that agents receive at every time step. To meet such requirements, we propose a new type of DPP, named Q-DPP. Multi-Agent Determinantal Q-Learning Diversity Feature Quality Term L = B B> D D 2QI(oi,ai)(oi, ai) b I(oi,ai)(oi, ai) LY( ) Figure 2: Example of Q-DPP with quality-diversity kernel decomposition in a single-state three-player learning task, each agent has three actions to choose. The size of the ground set is |Y| = 9, and the size of valid subsets |C(o)| is 33 = 27. Different colors represent different partitions of each agent s observation-action pairs. Suppose all three agents select the 2nd action, then the Q-value of the joint action according to Eq. 5 is Qπ o, a = det [L[i,j],i,j {2,5,8}] . Definition 3 (Q-DPP). Given a ground set Y of size M that includes N agents all possible observation-action pairs Y = (o1 1, a1 1), . . . , (o|O| N , a|A| N ) , we partition Y into N disjoint parts, i.e., Y = SN i=1 Yi and PN i=1 |Yi| = M = N|O||A|, where each partition represents each individual agent s all possible observation-action pairs. At every time-step, given agents observations, o = (oi)i N , we define C(o) Y to be a set of valid subsets including only observation-action pairs that agents are allowed to take, C(o) := Y Y : |Y Yi(oi)| = 1, i {1, . . . , N} , with |C(o)| = |A|N, and Yi(oi) of size |A| denotes the set of pairs in partition Yi with only oi as the observation, Yi(oi) = (oi, a1 i ), . . . , (oi, a|A| i ) . Q-DPP, denoted by P, defines a probability measure over the valid subsets Y C(o) Y. Let Y be a random subset drawn according to P, its probability distribution is defined: PL Y = Y |Y C(o) := det LY Y C(o) det LY . (4) In addition, given a valid sample Y C(o), we define an identifying function I : Y N that specifies the agent number for each valid pair in Y , and an index function J : Y {1, . . . , M} that specifies the cardinality of each item in Y in the ground set Y. The construction of Q-DPP is inspired by P-DPP (Celis et al., 2018). However, the partitioned sets in P-DPP stay fixed, while in Q-DPP, C(ot) changes at every timestep with the new observation, and the kernel L is learned through the process of reinforcement learning rather than being given. More differences are listed in Appendix A.3. Given Q-DPPs, we can represent the centralized value function by adopting Q-DPPs as general function approximators: Qπ o, a := log det LY = (o1,a1),...,(o N,a N) C(ot) (5) where LY denotes the sub-matrix of L whose entries are indexed by the pairs included in Y . Q-DPP embeds the con- nection between the joint action and each agent s individual actions into a subset-sampling process, and the Q-value is quantified by the determinant of a kernel matrix whose elements are indexed by the associated observation-action pairs. The goal of multi-agent learning is to learn an optimal joint Q-function. Eq. 5 states det(LY ) = exp(Qπ(o, a)), meaning Q-DPP actually assigns large probability to the subsets that have large Q-values. Given det(LY ) is always positive, the log operator ensures Q-DPPs, as general function approximators, can recover any real Q-functions. DPPs can capture both the quality and diversity of a sampled subset; the joint Q-function represented by Q-DPP in theory should not only acknowledge the quality of each agent s individual action towards a large reward, but the diversification of agents actions as well. The remaining question is, then, how to obtain such quality-diversity representation. 3.3. Representation of Q-DPP Kernels For any PSD matrix L, such a W can always be found so that L = WW where W RM P , P M. Since the diagonal and off-diagonal entries of L represent quality and diversity respectively, we adopt an interpretable decomposition by expressing each row of W as a product of a quality term di R+ and a diversity feature term bi RP 1 with bi 1, i.e., wi = dib i . An example of such decomposition is visualized in Fig. 2 where we define B = [b1, . . . , b M] and D = diag(d1, . . . , d M). Note that both D and B are free parameters that can be learned from the environment during the Q-learning process in Eq. 2. If we denote the quality term as each agent s individual Q-value for a given observation-action pair, i.e., (oi, ai) Y, i = {1, . . . , M}, di := exp 1 2QI(oi,ai)(oi, ai) , then Eq. 5 can be further written into Qπ o, a = log det WY W Y = log tr D Y DY det B Y BY i=1 QI(oi,ai) oi, ai + log det B Y BY . (6) Multi-Agent Determinantal Q-Learning Since a determinant value only reaches the maximum when the associated vectors in BY are mutually orthogonal (Noble et al., 1988), Eq. 6 essentially stipulates that Q-DPP represents the joint value function by taking into account not only the quality of each agent s contribution towards reward maximization, more importantly, from a holistic perspective, the orthogonalization of agents actions. In fact, the inclusion of diversifying agents behaviors is an important factor in satisfying the condition in Eq. 3. Intuitively, in a decentralizable task with a shared goal, promoting orthogonality between agent s actions can help clarify the functionality and responsibility of each agent, which in return leads to a better instantiation of Eq. 3. On the other hand, diversity does not means that agents have to take different actions all the time. Since the goal is still to achieve large reward via optimizing Eq. 2, certain scenarios, such as agents need to take identical actions to accomplish a task, will not be excluded as a result of promoting diversity. 3.4. Connections to Current Methods Based on the quality-diversity representation, one can draw a key connection between Q-DPP and the existing methods. It turns out that, under the sufficient condition that if the learned diversity features that correspond to the optimal actions are mutually orthogonal, then Q-DPP degenerates to VDN (Sunehag et al., 2017), QMIX (Rashid et al., 2018), and QTRAN (Son et al., 2019) respectively. To elaborate such condition, let us denote a i = arg max Qi(oi, ai), a = (a i )i N , Y = {(oi, a i )}N i=1, with bi = 1 and b i bj = 0, i = j, then we have det B Y BY = 1. (7) Connection to VDN. When {bj}M j=1 are pairwise orthognal, by plugging Eq. 7 into Eq. 6, we can obtain i=1 QI(oi,a i ) oi, a i . (8) Eq. 8 recovers the exact additivity constraint that VDN applies to factorize the joint value function in meeting Eq. 3. Connection to QMIX. Q-DPP also generalizes QMIX, which adopts a monotonic constraint on the centralized value function to meet Eq. 3. Under the special condition when {bj}M j=1 are mutually orthogonal, we can easily show that Q-DPP meets the monotonicity condition because QI(oi,a i ) oi, a i = 1 0, I(oi, a i ) N. (9) Connection to QTRAN. Q-DPP also meets the sufficient conditions that QTRAN proposes for meeting Eq. 3, that is, i=1 Qi oi, ai Qπ(o, a)+V (o) = n 0 a = a 0 a = a , (10) where V (o) = maxa Qπ(o, a) PN i=1 Qi oi, a i . Through Eq. 6, we know Q-DPP can have Eq. 10 written as log det B Y BY +max a Qπ(o, a) i=1 Qi oi, a i .(11) When a = a , for pairwise orthogonal {bj}M j=1, Q-DPP satisfies the first condition since Eq. 11 equals to zero due to log det(B Y BY ) = 0. When a = a , Eq. 11 equals to log det B Y BY + log det B Y B Y , which is always positive since det(B Y BY ) < 1, Y = Y ; Q-DPP thereby meets the second condition of Eq. 10 and recovers QTRAN. Other Related Work. Determinantal SARSA (Osogami & Raymond, 2019) applies a normal DPP to model the ground set of the joint state-action pairs (s0, a0 1, . . . , a0 N), . . . , (s|S|, a|A| 1 , . . . , a|A| N ) . It fails to consider at all a proper ground set that suits multi-agent problems, which leads to the size of subsets being 2|S||A|N that is double-exponential to the number of agents. Furthermore, unlike Q-DPP that learns decentralized policies, Det. SARSA learns the centralized joint-action policy, which strongly limits its applicability for scalable real-world tasks. 3.5. Sampling from Q-DPP Agents need to explore the environment effectively during training; however, how to sample from Q-DPPs defined in Eq. 4 is still unknown. In fact, sampling from the DPPs with partition-matroid constraint is a non-trivial task. So far, the best known exact sampler for partitioned DPPs has O(mp) time complexity with m being the ground-set size and p being the number of partitions (Li et al., 2016; Celis et al., 2017). Nonetheless, these samplers still pose great computational challenges for multi-agent learning tasks and cannot scale to large number of agents because we have m = |C(o)| = |A|N for multi-agent learning tasks. In this work, we instead adopt a biased yet tractable sampler for Q-DPP. Our sampler is an application of the sampling-byprojection idea in Celis et al. (2018) and Chen et al. (2018) which leverages the property that Gram-Schmidt process preserves the determinant. One benefit of our sampler is that it promotes efficient explorations among agents during training. Importantly, it enjoys only linear-time complexity w.r.t. the number of agents. The intuition is as follows. Additional Notations. In a Euclidean space Rn equipped with an inner product , , let U Rn be any linear subspace, and U be its orthogonal complement U := {x Rn| x, y = 0, y U}. We define an orthogonal projection operator, U : Rn Rn, such that u Rn, if u = u1 + u2 with u1 U and u2 U , then U (u) = u2. Gram-Schmidt (Noble et al., 1988) is a process for orthogonalizing a set of vectors; given a set of linearly in- Multi-Agent Determinantal Q-Learning Algorithm 1 Multi-Agent Determinantal Q-Learning 1: DEF Orthogonalizing-Sampler (Y, D, B, o): 2: Init: bj B[:,j], Y , B , J . 3: for each partition Yi do 4: Define (o, a) Yi(oi) q(o, a) := b J (o,a) 2 exp DJ (o,a),J (o,a) . 5: Sample ( oi, ai) Yi(oi) from the distribution: q(o, a) P (ˆo,ˆa) Yi(oi) q(ˆo, ˆa) (o,a) Yi(oi) . 6: Let Y Y ( oi, ai), B B b J ( oi, ai), J J J ( oi, ai). 7: // Gram-Schmidt orthogonalization 8: Set bj = span{B} (bj) , j {1, ..., M} J 9: end for 10: Return: Y . 11: 12: DEF Determinantal-Q-Learning (θ = [θD, θB], Y): 13: Init: θ θ, D . 14: for each time-step do 15: Collect observations o = [o1, . . . , o N] for all agents. 16: a = Orthogonalizing-Sampler(Y, θD, θB, o). 17: Execute a, store the transition o, a, R, o in D. 18: Sample a mini-batch of { o, a, R, o }E j=1 from D. 19: Compute for each transition in the mini-batch maxa Q o , a ; θ = log det LY ={(o 1,a 1),...,(o N,a N)} where // off-policy decentralized execution a i = arg maxai Ai h θ b J (o i,ai) 2 exp θ DJ (o i,ai),J (o i,ai) i . 20: // centralized training 21: Update θ by minimizing L(θ) defined in Eq. 2. 22: Update target θ = θ periodically. 23: end for 24: Return: θD, θB. dependent vectors {wi}, it outputs a mutually orthogonal set of vectors { ˆwi} by computing ˆwi := Ui(wi) where Ui = span{w1, . . . , wi 1}. Note that we neglect the normalizing step of Gram-Schmidt in this work. Finally, if the rows {wi} of a matrix W are mutually orthogonal, we can compute the determinant by det(WW ) = Q wi 2. The Q-DPP sampler is built upon the following property. Proposition 1 (Volume preservation of Gram-Schmidt, see Chapter 7 in Shafarevich & Remizov (2012), also Lemma 3.1 in Celis et al. (2018).). Let Ui = span{w1, . . . , wi 1} and wi RP be the i-th row of W RM P , then QM i=1 Ui (wi) 2 = det(WW ). We also provide an intuition by Gaussian elimination in Appendix A.1. Proposition 1 suggests that the determinant of a Gram matrix is invariant to applying the Gram-Schmidt orthogonalization on the rows of that Gram matrix. In Q- DPP s case, a kernel matrix with mutually orthogonal rows can largely simplify the sampling process. In such scenario, an effective sampler can be that, from each partition Yi, sample an item i Yi with P(i) dib i 2, then add i to the output sample Y and move to the next partition; the above steps iterate until all partitions are covered. It is effortless to see that the probability of obtaining sample Y in such a way is i Y dib i 2 = Y i Y wi 2 = det(WY W Y ) det(LY ). (12) We formally describe the orthogonalizing sampling procedures in Algorithm 1. As it is suggested in Celis et al. (2018), the time complexity of the sampling function is O(NMP) (see also the breakdown analysis for each step in Appendix A.4), given the input size being O(MP), our sampler thus enjoys linear-time complexity w.r.t the agent number. Though the Gram-Schmidt process can preserve the determinant and simply the sampling process, it comes at a prize of introducing bias on the normalization term. Specifically, the normalization in our proposed sampler is conducted at each agent/partition level Yi(oi) (see the red in line 5) which does not match Eq. 4 that suggests normalizing by listing all valid samples considering all partitions C(o); this directly leads to a sampled subset from our sampler having larger probability than what Q-DPP defines. Interestingly, it turns out that such violation can be controlled through bounding the singular values of each partition in the kernel matrix (see Assumption 1), a technique also known as the β-balance condition introduced in P-DPP (Celis et al., 2018). Assumption 1 (Singular-Value Constraint on Partitions). For a Q-DPP defined in Definition 1, which is parameterized by D RM M, B RP M and W := DB , let σ1 . . . σP represent the singular values of W, and ˆσi,1 . . . ˆσi,P denote the singular values of WYi that is the submatrix of W with the rows and columns corresponding to the i-th partition Yi , we assume j {1, . . . , P}, δ (0, 1] , s.t., mini {1,...,N} ˆσ2 i,j/δ σ2 j holds. Theorem 1 (Approximation Guarantee of Orthogonalizing Sampler). For a Q-DPP defined in Definition 1, under Assumption 1, the Orthogonalizing Sampler described in Algorithm 1 returns a sampled subset Y C(o) with probability P(Y ) 1/δN P(Y = Y ) where N is the number of agents, P is defined in Eq. 4, δ is defined in Assumption 1. Proof. The proof is in Appendix A.2. It can also be taken as a special case of Theorem 3.2 in Celis et al. (2018) when the number of sample from each partition is one. Theorem 1 effectively suggests a way to bound the error between our sampler and the true distribution of Q-DPP through minimizing the difference between σ2 j and ˆσ2 i,j. Multi-Agent Determinantal Q-Learning 3.6. Determinantal Q-Learning We present the full learning procedures in Algorithm 1. Determinantal Q-Learning is a CTDE method. During training, agents explorations are conducted through the orthogonalizing-sampler. The parameters of B and D are updated through Q-learning in a centralized way by following Eq. 2. To meet Assumption 1, one can implement an auxiliary loss function of max(0, σ2 j ˆσ2 i,j/δ) in addition to where δ is a hyper-parameter. Given Theorem 1, for large N, we know δ should be set close to 1 to make the bound tight. In fact, it is worth mentioning that the Gram-Schmidt process adopted in the sampler can boost the sampling efficiency for multi-agent training. Since agents diversity features of observation-action pairs are orthogonalized every time after a partition is visited, agents who act later are essentially coordinated to explore the observation-action space that is distinctive to all previous agents. This speeds up training in early stages. During execution, agents only need to access the parameters in their own partitions to compute the greedy action (see line 19). Note that neural networks can be seamlessly applied to represent both B and D to tackle continuous states. Though a full treatment of deep Q-DPP needs substantial future work, we show a proof of concept in Appendix C. Hereafter, we use Q-DPP to represent our proposed algorithm. 4. Experiments We compare Q-DPP with state-of-the-art CTDE solvers for multi-agent cooperative tasks, including COMA (Foerster et al., 2018), VDN (Sunehag et al., 2017), QMIX (Rashid et al., 2018), QTRAN (Son et al., 2019), and MAVEN (Mahajan et al., 2019). All baselines are imported from Py MARL (Samvelyan et al., 2019). Detailed settings are in Appendix B. Code is released in https://github. com/QDPP-Git Hub/QDPP. We consider four cooperative tasks in Fig. 3, all of which require non-trivial value function decomposition to achieve the largest reward. Pathological Stochastic Game. The optimal policy of this game is to let both agents keep acting top left until the 10-th step to change to bottom right, which results in the optimal reward of 13. The design of such stochastic game intends to be pathological. First, it is non-monotonic (thus QMIX surely fails), second, it demonstrates relative overgeneralization (Wei et al., 2018) because both agents playing the 1st action on average offer a higher reward 10 when matched with arbitrary actions from the other agent. We allow agent to observe the current step number and the joint action in the last time-step. Zero reward leads to immediate termination. Fig. 4a shows Q-DPP can converge to the global optimal in only 20K steps while other baselines struggle. Blocker Game & Coordinated Navigation. Blocker game 1 4 Initial State Terminal Terminal (a) Pathological Stochastic Game Blocker 1 Blocker 2 Agents can move in four directions or stay fixed, the target is trying to reach the bottom row. Blockers can move left/right to block the agents. Available Target Agent 1 Agent 2 Agent 3 (b) Blocker Game (c) Coordinated Navigation Observation Area Reward -0.5 1 Predator meets 1 Prey Reward +5 2 Predators catch 1 Prey (d) Predator-Prey World Figure 3: Multi-agent cooperative tasks. The size of the ground set for each task is a) 176, b) 420, c) 720, d) 3920. (Heess et al., 2012) requires agents to reach the bottom row by coordinating with its teammates to deceive the blockers that can move left/right to block them. The navigation game requires four agents to reach four different landmarks. For both tasks, it costs all agents 1 reward per time-step before they all reach the destination. Depending on the starting points, the largest reward of the game are 3 and 6 respectively. Both tasks are challenging in the sense that coordination is rather challenging for agents that only have decentralized policies and local observations. Fig. 4b & 4c suggest Q-DPP still achieves the best performance. Predator-Prey World. In this task, four predators attempt to capture two randomly-moving preys. Each predator can move in four directions but they only have local views. The predators get a team reward of 1 if two or more predators are capturing the same prey at the same time, and they are penalized for 0.5 if only one of them captures a prey. The game terminates when all preys are caught. Fig. 4d shows Q-DPP s superior performance than all other baselines. Apart from the best performance in terms of rewards, here we offer more insights of why and how Q-DPP works well. The Importance of Assumption 1. Assumption 1 is the premise for the correctness of Q-DPP sampler to hold. To investigate its impact in practice, we conduct the ablation study on Blocker and Navigation games. We implement such assumption via an auxiliary loss function of max(0, σ2 j ˆσ2 i,j/δ) that penalizes the violation of the assumption, we set δ = 0.5. Fig. 4e presents the performance comparisons of the Q-DPPs with and without such additional loss function. We can tell that maintaining such a condition, though not helping improve the performance, Multi-Agent Determinantal Q-Learning (a) Multi-Step Matrix Game (b) Blocker Game (c) Coordinated Navigation (d) Predator-Prey World (e) Ablation study on Assumption 1 0 50K 100K 150K 200K Step Q-DPP (Ours) (f) Diversity / Quality Ratio Figure 4: (a)-(d):Performance over time on different tasks. (e): Ablation study on Assumption 1 on Blocker game. (f): The ratio of diversity to quality, i.e., log det(B Y BY )/ PN i=1 QI(oi,ai)(oi, ai), during training on Blocker game. (a) Agent 1. (b) Agent 2. (c) Agent 3. Figure 5: (a)-(c): Each of the agent s decentralized policy, i.e., arg maxa Qi(oi, a), during execution on Blocker game. stablizes the training process by significantly reducing the variance of the rewards. We believe this is because violating Assumption 1 leads to over-estimating the probability of certain observation-action pairs in the partition where the violation happens, such over-estimation can make the agent stick to a poor local observation-action pair for some time. The Satisfaction of Eq. 3. We show empirical evidence on Blocker game that the natural factorization that Q-DPP offers indeed satisfy Eq. 3. Intuitively, Q-DPPs encourage agents to acquire diverse behavorial models during training so that the optimal action of one agent does not depend on the actions of the other agents during the decentralized execution stage, as a result, Eq. 3 can be satisfied. Fig. 5 (a-c) justify such intuition by showing Q-DPP learns mutually orthogonal behavioral models. Given the distinction among agents individual policies, one can tell that the joint optimum is reached through individual optima. Quality versus Diversity. We investigate the change of the relative importance of quality versus diversity during training. On Blocker game, we show the ratio of log det B Y BY / PN i=1 QI(oi,ai) oi, ai , which reflects how the learning algorithm balances maximizing reward against encouraging diverse behaviors. In Fig. 4f, we can see that the ratio gradually converges to 0. The diversity term plays a less important role with the development of training; this is also expected since explorations tend to be rewarded more at the early stage of a task. 5. Conclusion We proposed Q-DPP, a new type of value-function approximator for cooperative multi-agent reinforcement learning. Q-DPP, as a probabilistic way of modeling sets, considers not only the quality of agents actions towards reward maximization, but the diversity of agents behaviors as well. We have demonstrated that Q-DPP addresses the limitation of current major solutions including VDN, QMIX, and QTRAN by learning the value function decomposition without structural constraints. In the future, we plan to investigate other kernel representations for Q-DPPs to tackle the tasks with continuous states and continuous actions. Multi-Agent Determinantal Q-Learning Acknowledgement We sincerely thank Dr. Haitham Bou Ammar for his constructive comments. Weinan Zhang thanks the support of New Generation of AI 2030 Major Project 2018AAA0100900 and NSFC (61702327, 61632017). Affandi, R. H., Fox, E., Adams, R., and Taskar, B. Learning the parameters of determinantal point process kernels. In International Conference on Machine Learning, pp. 1224 1232, 2014. Albrecht, S. V. and Ramamoorthy, S. Comparative evaluation of mal algorithms in a diverse set of ad hoc team problems. In Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems - Volume 1, AAMAS 12, pp. 349 356, Richland, SC, 2012. International Foundation for Autonomous Agents and Multiagent Systems. ISBN 0981738117. Celis, L. E., Deshpande, A., Kathuria, T., Straszak, D., and Vishnoi, N. K. On the complexity of constrained determinantal point processes. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2017). Schloss Dagstuhl Leibniz-Zentrum fuer Informatik, 2017. Celis, L. E., Keswani, V., Straszak, D., Deshpande, A., Kathuria, T., and Vishnoi, N. K. Fair and diverse dpp-based data summarization. ar Xiv preprint ar Xiv:1802.04023, 2018. Chen, L., Zhang, G., and Zhou, E. Fast greedy map inference for determinantal point process to improve recommendation diversity. In Advances in Neural Information Processing Systems, pp. 5622 5633, 2018. Du, Y., Han, L., Fang, M., Liu, J., Dai, T., and Tao, D. Liir: Learning individual intrinsic reward in multi-agent reinforcement learning. In Advances in Neural Information Processing Systems, pp. 4403 4414, 2019. Elfeki, M., Couprie, C., Riviere, M., and Elhoseiny, M. Gdpp: Learning diverse generations using determinantal point processes. In International Conference on Machine Learning, pp. 1774 1783, 2019. Foerster, J. N., Farquhar, G., Afouras, T., Nardelli, N., and Whiteson, S. Counterfactual multi-agent policy gradients. In Thirty-Second AAAI Conference on Artificial Intelligence (AAAI), 2018. Heess, N., Silver, D., and Teh, Y. W. Actor-critic reinforcement learning with energy-based policies. In EWRL, pp. 43 58, 2012. Kulesza, A., Taskar, B., et al. Determinantal point processes for machine learning. Foundations and Trends R in Machine Learning, 5(2 3):123 286, 2012. Li, C., Sra, S., and Jegelka, S. Fast mixing markov chains for strongly rayleigh measures, dpps, and constrained sampling. In Advances in Neural Information Processing Systems, pp. 4188 4196, 2016. Li, M., Qin, Z., Jiao, Y., Yang, Y., Wang, J., Wang, C., Wu, G., and Ye, J. Efficient ridesharing order dispatching with mean field multi-agent reinforcement learning. In The World Wide Web Conference, pp. 983 994. ACM, 2019. Macchi, O. The fermion process a model of stochastic point process with repulsive points. In Transactions of the Seventh Prague Conference on Information Theory, Statistical Decision Functions, Random Processes and of the 1974 European Meeting of Statisticians, pp. 391 398. Springer, 1977. Mahajan, A., Rashid, T., Samvelyan, M., and Whiteson, S. Maven: Multi-agent variational exploration. In Advances in Neural Information Processing Systems, pp. 7611 7622, 2019. Matignon, L., Laurent, G. J., and Le Fort-Piat, N. Hysteretic q-learning: an algorithm for decentralized reinforcement learning in cooperative multi-agent teams. In 2007 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 64 69. IEEE, 2007. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. Human-level control through deep reinforcement learning. Nature, 518(7540): 529 533, 2015. Noble, B., Daniel, J. W., et al. Applied linear algebra, volume 3. Prentice-Hall New Jersey, 1988. Oliehoek, F. A., Spaan, M. T., and Vlassis, N. Optimal and approximate q-value functions for decentralized pomdps. Journal of Artificial Intelligence Research, 32:289 353, 2008. Oliehoek, F. A., Amato, C., et al. A concise introduction to decentralized POMDPs, volume 1. Springer, 2016. Osogami, T. and Raymond, R. Determinantal reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp. 4659 4666, 2019. Panait, L. and Luke, S. Cooperative multi-agent learning: The state of the art. Autonomous agents and multi-agent systems, 11(3):387 434, 2005. Multi-Agent Determinantal Q-Learning Pang, T., Xu, K., Du, C., Chen, N., and Zhu, J. Improving adversarial robustness via promoting ensemble diversity. In International Conference on Machine Learning, pp. 4970 4979, 2019. Peng, P., Wen, Y., Yang, Y., Yuan, Q., Tang, Z., Long, H., and Wang, J. Multiagent bidirectionally-coordinated nets: Emergence of human-level coordination in learning to play starcraft combat games. ar Xiv preprint ar Xiv:1703.10069, 2017. Rashid, T., Samvelyan, M., De Witt, C. S., Farquhar, G., Foerster, J., and Whiteson, S. Qmix: monotonic value function factorisation for deep multi-agent reinforcement learning. ar Xiv preprint ar Xiv:1803.11485, 2018. Samvelyan, M., Rashid, T., de Witt, C. S., Farquhar, G., Nardelli, N., Rudner, T. G. J., Hung, C.-M., Torr, P. H. S., Foerster, J., and Whiteson, S. The Star Craft Multi-Agent Challenge. Co RR, abs/1902.04043, 2019. Shafarevich, I. R. and Remizov, A. O. Linear algebra and geometry. Springer Science & Business Media, 2012. Sharghi, A., Borji, A., Li, C., Yang, T., and Gong, B. Improving sequential determinantal point processes for supervised video summarization. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 517 533, 2018. Son, K., Kim, D., Kang, W. J., Hostallero, D. E., and Yi, Y. Qtran: Learning to factorize with transformation for cooperative multi-agent reinforcement learning. ar Xiv preprint ar Xiv:1905.05408, 2019. Sunehag, P., Lever, G., Gruslys, A., Czarnecki, W. M., Zambaldi, V., Jaderberg, M., Lanctot, M., Sonnerat, N., Leibo, J. Z., Tuyls, K., et al. Value-decomposition networks for cooperative multi-agent learning. ar Xiv preprint ar Xiv:1706.05296, 2017. Tan, M. Multi-agent reinforcement learning: independent versus cooperative agents. In International Conference on Machine Learning (ICML), pp. 330 337, 1993. Watkins, C. J. and Dayan, P. Q-learning. Machine learning, 8(3-4):279 292, 1992. Wei, E., Wicke, D., Freelan, D., and Luke, S. Multiagent soft q-learning. In 2018 AAAI Spring Symposium Series, 2018. Yang, Y., Yu, L., Bai, Y., Wen, Y., Zhang, W., and Wang, J. A study of ai population dynamics with million-agent reinforcement learning. In Proceedings of the 17th International Conference on Autonomous Agents and Multi Agent Systems, pp. 2133 2135. International Foundation for Autonomous Agents and Multiagent Systems, 2018. Yang, Y., Tutunov, R., Sakulwongtana, P., Ammar, H. B., and Wang, J. Alpha-alpha-rank: Scalable multi-agent evaluation through evolution. AAMAS 2020, 2019. Zhou, M., Chen, Y., Wen, Y., Yang, Y., Su, Y., Zhang, W., Zhang, D., and Wang, J. Factorized q-learning for large-scale multi-agent systems. In Proceedings of the First International Conference on Distributed Artificial Intelligence, pp. 1 7, 2019.