# truthful_selfplay__f096de78.pdf Published as a conference paper at ICLR 2023 TRUTHFUL SELF-PLAY Shohei Ohsawa Founder & CEO Daisy AI 6-13-9 Ginza, Chuo-ku, Tokyo, Japan o@daisy.inc We present a general framework for evolutionary learning to emergent unbiased state representation without any supervision. Evolutionary frameworks such as self-play converge to bad local optima in case of multi-agent reinforcement learning in non-cooperative partially observable environments with communication due to information asymmetry. Our proposed framework is a simple modification of selfplay inspired by mechanism design, also known as reverse game theory, to elicit truthful signals and make the agents cooperative. The key idea is to add imaginary rewards using the peer prediction method, i.e., a mechanism for evaluating the validity of information exchanged between agents in a decentralized environment. Numerical experiments with predator prey, traffic junction and Star Craft tasks demonstrate that the state-of-the-art performance of our framework. 1 INTRODUCTION Evolving culture prevents deep neural networks from falling into bad local optima (Bengio, 2012). Self-play (Samuel, 1967; Tesauro, 1995) has not only demonstrated the ability to abstract highdimensional state spaces as typified by Alpha Go (Silver et al., 2017), but also improved exploration coverage in partially observable environments. Communication (Sukhbaatar et al., 2016; Singh et al., 2019) exchanges their internal representations such as explored observation and hidden state in RNNs. Evolutionary learning is expected to be a general framework for creating superhuman AIs as such learning can generate a high-level abstract representation without any bias in supervision. However, when applying evolutionary learning to a partially observable environment with noncooperative agents, improper bias is injected into the state representation. This bias originates from the environment. A partially observable environment with non-cooperative agents induces actions that disable an agent from honestly sharing the correct internal state resulting in the agent taking actions such as concealing information and deceiving other agents at equilibrium (Singh et al., 2019). The problem arises because the agent cannot fully observe the state of the environment, and thus, it does not have sufficient knowledge to verify the information provided by other agents. Furthermore, neural networks are vulnerable to adversarial examples (Szegedy et al., 2014) and are likely to induce erroneous behavior with small perturbations. Many discriminative models for information accuracy are available; these include GANs (Goodfellow et al., 2014; Radford et al., 2016) and curriculum learning (Lowe et al., 2020). However, these models assume that accurate samples can be obtained by supervision. Because of this assumption, is it impossible to apply these models to a partially observable environment, where the distribution is not stable. We generalize self-play to non-cooperative partially observable environments via mechanism design (Myerson, 1983; Miller et al., 2005), which is also known as reverse game theory. The key idea is to add imaginary rewards by using the peer prediction method (Miller et al., 2005), that is, a mechanism for evaluating the validity of information exchanged between agents in a decentralized environment, which is calculated based on social influence on the signals. We formulate the non-cooperative partially observable environment as an extention of the partially observable stochastic games (POSG) (Hansen et al., 2004); introduce truthfulness (Vickrey, 1961), which is an indicator of the validity of state representation. We show that the imaginary reward enables us to reflect the bias of state representation on the gradient without oracles. Published as a conference paper at ICLR 2023 As the first contribution, we propose truthful self-play (TSP) and analytically demonstrate convergence to the global optimum (Section 4). We propose the imaginary reward on the basis of the peer prediction method (Miller et al., 2005) and apply it to self-play. The mechanism affects the gradient of the local optima, but not the global optima. The trick is to use the actions taken by the agents as feedback to verify the received signal from the every other agent, instead of the true state, input, and intent, which the agents cannot fully observe. TSP only requires a modification of the baseline function for self-play; it drastically improves the convergence to the global optimum in Comm-POSG. As the second contribution, based on the results of numerical experiments, we report that the TSP achieved state-of-the-art performance for various multi-agent tasks made of up to 20 agents (Section 5). Using predator prey (Barrett et al., 2011), traffic junction (Sukhbaatar et al., 2016; Singh et al., 2019), and Star Craft (Synnaeve et al., 2016) environments, which are typically used in Comm-POSG research, we compared the performances of TSP with the current neural nets, including the state-ofthe-art method, with LSTM, Comm Net (Sukhbaatar et al., 2016), and IC3Net (Singh et al., 2019). We report that the model with IC3Net optimized by TSP has the best performance. This work is the first attempt to apply mechanism design to evolutionary learning. TSP is a general optimization algorithm whose convergence is theoretically guaranteed for arbitrary policies and environments. Since no supervision is required, TSP has a wide range of applications to not only game AIs (Silver et al., 2017), but also the robots (Jaderberg et al., 2018), chatbots (Gupta et al., 2019; Chevalier et al., 2019), and autonomous cars (Tang, 2019) employed in multiagent tasks. Notation: Vectors are columns. Let Jn K := {1, . . . , n}. R is a set of real numbers. i is the imaginary unit. Re u and Im u are a real and an imaginary part of complex number u, respectively. n-tuple are written as boldface of the original variables a := a1, . . . , an , and a i is a (n 1)-tuple obtained by removing the i-th entry from a. Let 1 := (1, . . . , 1)T. Matrices are shown in uppercase letters L := (ℓij). E is the unit matrix. The set of probability distributions based on the support X is described as P(X). 2 RELATED WORK Neural communication has gained attention in the field of multiagent reinforcement learning (MARL) for both discrete (Foerster et al., 2016) and continuous (Sukhbaatar et al., 2016; Singh et al., 2019) signals. Those networks are trained via self-play to exchange the internal state of the environment stored in the working memory of recurrent neural networks (RNNs) to learn the right policy in partially observable environments. The term self-play was coined by the game AI community in the latter half of the century. Samuel (Samuel, 1967) introduced self-play as a framework for sharing a state-action value among two opposing agents to efficiently search the state space at Checkers. TD-Gammon (Tesauro, 1995) introduced self-play as a framework to learn TD(λ) (Sutton & Barto, 1998) and achieve professionalgrade levels in backgammon. Alpha Go (Silver et al., 2017) defeated the Go champion by combining supervised learning with professional game records and self-play. Alpha Zero (Silver et al., 2018) successfully learnt beyond its own performance entirely based on self-play. All these studies explain that eliminating the bias of human knowledge in supervision is the advantage of self-play. Self-play is also known as evolutionary learning (Bengio, 2012) in the deep learning community mainly as an approach to emerging representations without supervision (Bansal et al., 2018; Balduzzi et al., 2019). Bansal et al. (2018) show that competitive environments contribute to emerging diversity and complexity. Rich generative models such as GANs (Goodfellow et al., 2014; Radford et al., 2016) are frameworks for acquiring an environmental model by employing competitive settings. RNNs such as world models (Ha & Schmidhuber, 2018; Eslami et al., 2018) are capable of more comprehensive ranges of exploration in partially observable environments and generation of symbols and languages (Bengio, 2017; Gupta et al., 2019; Chevalier et al., 2019). The difference between evolutionary learning and supervised learning is the absence of human knowledge and oracles. Several works have formalized those in which the agents exchange environmental information as a formal class of the games such as Dec-POMDP-Com (Goldman & Zilberstein, 2003) and COMMTDP (Pynadath & Tambe, 2002), and several frameworks are proposed to aim to solve the problems. However, the limitation of the frameworks is that they assume a common reward. As there are yet no formal definition of non-cooperative communication game, we formalize such a game to Published as a conference paper at ICLR 2023 Comm-POSG as a superset of POSGs (Hansen et al., 2004), a more general class of multi-agent games including the cases of non-cooperativity (Hansen et al., 2004). To the best of our knowledge, there are no studies that have introduced truthful mechanisms into the field of MARL, but it may be possible to introduce it by using agents that can learn flexibly, such as neural networks. A typical truthful mechanism is the VCG mechanism (Vickrey, 1961), which is a generalization of the pivot method used in auction theory, but whereas the subject of the report that must satisfy truthfulness must be a valuation (or a value function if interpreted from a RL perspective). In this study, the scope of application is different because the belief states of the environment are subject to reporting. Therefore, we introduce instead a peer prediction method (Miller et al., 2005) that guarantees truthfulness with respect to reporting beliefs about arbitrary probability distributions using proper scoring rules (Gneiting & Raftery, 2007). 3 PROBLEM DEFINITION 3.1 COMM-POSG A communicative partially-observable stochastic game (Comm-POSG) is a class of non-cooperative Bayesian games in which every agent does not fully observe the environment but interacts each other. We define Comm-POSG as an extension of POSG (Hansen et al., 2004) with a message protocol. Definition 3.1 (Hansen et al., 2004) POSG n, T, S, A, X, T , P, R is a class for multi-agent decision making under uncertainty in which the state evolves over time 1 t T, where n is the number of agents, T is a horizon i.e., the episode length, S is a set of discrete/continuous state st S with an initial probabilistic distribution p(s0), A is a set of discrete/continuous action ati A, X is a set of discrete/continuous observation xti X, T P (S A S) is state transition probability, P P (S X n) is an observation probability, and R : S An Rn is a reward function that outputs an n-dimensional vector. In Comm-POSGs, every agent further follows a message protocol Zn n, where Z is the discrete/continuous signal space. The complete information exchanged among the agent in time is Zt, where Zt := (ztij)i,j Jn K Zn n is a signal matrix in which (i, j)-th entry ztij represents a signal from Agent i to Agent j at t. The i-th diagonal entry of Zt, hti := ztii represents the pre-state, an internal state of i-th agent before receiving the singals from the others. A game in Comm-POSG is denoted as G := n, T, S, A, X, T , P, R, Z . The objective of Comm-POSG is social welfare (Arrow, 1963) defined by the following, i=1 V πi; V πi := Eπi t=1 γt 1rti where γ [0, 1] is discount rate, rti is reward πi is a stochastic policy, and V πi is the value function. In extensive-form games including Comm-POSG, in addition to the information in the environment, the information of other agents cannot be observed. In the optimization problem under these assumptions, a policy converges to a solution called the Bayesian Nash equilibrium (BNE) (Fudenberg, 1993). We denote the social welfare at the BNE is J , and the global maximum ˆJ. In general, J = ˆJ holds, which is closely related to the information asymmetry. 3.2 COMMUNICATIVE RECURRENT AGENTS In order to propose an optimization algorithm in this paper, we do not propose a concrete structure of the network, but we propose an abstract structure that can cover existing neural communication models (Sukhbaatar et al., 2016; Singh et al., 2019), namely communicative recurrent agents (CRAs) fφ, σφ, qφ, πθ , where Published as a conference paper at ICLR 2023 fφ(ˆht 1,i, xti) 7 Z is a deep RNN for the high-dimensional input xti X and the previous post-state ˆht 1,i Z, with a parameter φ and an initial state ˆh0 Z, σφ(zti|hti) is a stochastic messaging policy for a pre-state hti := fφ(ˆht 1,i, xti), qφ(ˆhti|ˆzti) is a stochastic model for a post-state ˆhti Z and the received messages ˆzti := ZT t:i = (zt1i, . . . , zt,i 1,i, hti, zt,i+1,i, . . . , ztni)T, and πθ(ati|ˆhti) is the stochastic action policy with a parameter θ. These agents are trained through self-play using on-policy learning such as REINFORCE (Williams, 1992). All n-agents share the same weight per episode, and the weights are updated based on the cumulative reward after the episode. In addition to the recurrent agent s output of actions with the observation series as input, the CRA has signals for communication as input and output. CRAs estimate current state of the environment and current value of the agent herself based on the post-state model with the pre-state hti in the hidden layer of the RNN and the received signals ˆzti, i from other agents. Hence, the veracity of the signals zti is the point of contention. 3.3 TRUTHFULNESS In mechanism design, a truthful game (Vickrey, 1961) is a game in which all agents make an honest reporting in the Bayesian Nash equilibrium. In Comm-POSGs, the truthfulness of the game is achieved if all the sent signal equals the pre-state ztij = hti i.e., all the agent share a complete information. In such case, every agent has the same information ˆzti = hti := (ht1, . . . , htn)T for all i s and the same post-state model probability distribution, and hence the mean of the cross entropy between the distributions below will be minimized. Dφ(Zt) := 1 i=1 H h qφ(ˆhti|ˆzti) i + 1 j=1 DKL qφ(ˆhti|ˆzti) || qφ(ˆhtj|ˆztj) . (2) The first term represents the entropy of knowledge each agent has about the environment, and the second the information asymmetry between the agents. Dφ is a lower bound on the amount of true information the environment has H [p(st)]. Since achieving truthfulness is essentially the same problem as minimizing Dφ, it also maximizes J simultaneously. Proposition 3.1. (global optimality) For any games G in Comm-POSG, if Dφ(Zt) = H [p(st)] for 0 t T and Ri is symmetric for any permutation of i Jn K, J (G) = ˆJ(G) holds. Proof. Let w := θ, φ on a parameter space W(G), and W (G) the BNE. Since J is obviously maximized if σφ is truthful, we prove σφ must be truthful under a given condition. To this end, we show the following Pareto-optimality. Lemma 3.1. For any G in Comm-POSG and given w, if J(w) J(w ) holds for any w W(G), then either of U(w) U(w ) or DKL (p || qφ) DKL (p || qφ ) holds, where U(w) := Eqφ [V πθ] = Z st S,ˆzt Zn V πθ(st) dqφ(st|hti, zt, i) dσφ(ˆzt). (3) Proof. The first inequality U(w) U(w ) indicates that w is on the BNE W (G) given φ, and of the second that the belief state qφ is as close to the true state p as possible. On a fully-observable environment, from the theorem of value iteration, there exists the solution π (st) w.r.t. V π (st) for any st S. We name π and V := V π the unbiased policy and value, respectively. Since the unbiased policy solves the objective as J(w) = n Ep(st) [V πθ(st)] from Eq (1), the goal intrinsicly is to find the policy π as close as π for π , φ W(G), that maximizes U(w). The π further can be represented as a mixed policy made of a couple of the unbiased policy π and a biased policy π = π as follows, π (ati|xti, φ) = Eqφ(sti|xti) [π (ati|sti)] = qφ(st0|xti)π (ati|st0) + (1 qφ(st0|xti))π (ati|φ, xti), (4) Published as a conference paper at ICLR 2023 for the observations xti X n, where st0 S is the true state. Hence V π (st0|φ) = EP(xti|st0) [qφ(st0|xti)V (st0) + (1 qφ(st0|xti))V (xti|φ))] = qφ(st0)V (st0) + EP(xti|st0) [(1 qφ(st0|xti))V (xti|φ))] = qφ(st0)V (st0) + (1 qφ(st0)) V (st0|φ), (5) V (xti|φ) := Z st Sn+1,at An Ri(st0, at) i=1 dπ (ati|sti)qφ(sti|xti), (6) V (st0|φ) := EP(xti|st0) V (xti|φ)1 qφ(st0|xti) Thus, the error from the unbiased value function can be written as V (st0) V π (st0|φ) = (1 qφ(st0))(V (st0) V (st0|φ)), which is minimized if qφ(st0) = 1 as V (st0) > V (st0|φ) by the definition. From the Jensen s equation, log Ep(st0) [qφ(st0)] Ep(st0) [log qφ(st0)] = DKL (p || qφ) H [p] . (8) The right-hand side of the inequation corresponds to the negative cross-entropy to be maximized. Therefore, as the second term H [p] depends not on φ, the optimization is achieved by minimizing DKL (p || qφ). Suppose that ˆJ(G) = J(w) > J (G) for a non-truthful reporting policy s.t. σφ(h|h) < 1. From lemma 3.1, qφ(st|hti) for an internal state hti = f(xti) of Agent i with an encoder f minimizes DKL (p || qφ(st|hti)). As that qφ(st|zti) = qφ(st|hti) and DKL (p || qφ(st|zti)) > DKL (p || qφ(st|hti)) contradicts the Pareto-optimality, σφ must be truthful. 4 PROPOSED FRAMEWORK An obvious way to achieve truthful learning is to add Dφ as a penalty term of the objective, but there are two obstacles to this approach. One is that the new regularization term also adds a bias to the social welfare J, and the other is that Dφ contains the agent s internal state, post-state ˆhti, so the exact amount cannot be measured by the designer of the learner. If post-states are reported correctly, then pre-states should also be reported honestly, and thus truthfulness is achieved. Therefore, it must be assumed that the post-states cannot be observed during optimization. Our framework, truthful self-play (TSP), consists of two elements: one is the introduction of imaginary rewards, a general framework for unbiased regularization in Comm-POSG, and the other is the introduction of peer prediction method (Miller et al., 2005), a truthful mechanism to encourage honest reporting based solely on observable variables. In the following, each of them is described separately and we clarify that the proposed framework converges to the global optimum in Comm-POSG. We show the whole procedure in Algorithm 1. 4.1 IMAGINARY REWARD Imaginary rewards are virtual rewards passed from an agent and have a different basis i than rewards passed from the environment, with the characteristic that they sum to zero. Since the most of RL environments, including Comm-POSG, are of no other entities than agent and environment, twodimensional structure are sufficient to describe them comprehensively if we wish to distinguish the sender of the reward. To remain the social welfare of the system is real, the system must be designed so that the sum of the imaginary rewards, i.e., imaginary part of the social welfare, is zero. In other words, it is not observed macroscopically and affects only the relative expected rewards of agents. The real and imaginary parts of the complex rewards are ordered by the mass parameter β during training, which allows the weights of the network to maintain a real structure. Published as a conference paper at ICLR 2023 The whole imaginary reward is denoted as i Y = (iyij)i,j Jn K1 where iyij is the imaginary reward passed from Agent i to Agent j, and the complex reward for the whole game is R+ := R + i Y where R is a diagonal matrix with the environmental reward ri as (i, i)-th entry. We write G[i Y] as the structure in which this structure is introduced. In this case, the following proposition holds. Proposition 4.1. For any G in Comm-POSG, if G[i Y] is truthful and R+ is an Hermitian matrix, J (G[i Y]) = ˆJ(G) holds. Proof. Since G[i Y] is truthful, J (G[i Y]) = ˆJ(G[i Y]) holds from Proposition 3.1. Further, since R+ is Hermitian, iyij = iyji, and hence Im ˆJ(G[i Y]) = 0 holds; ˆJ(G[i Y]) = ˆJ(G) holds. This indicates that the BNE could be improved by introducing imaginary rewards: J (G[i Y]) J (G). Also, since Pn i=1 Pn j=1 iyij = 0 from the condition that R+ is Hermitian, the imaginary rewards affect not the social welfare of the system, which is a macroscopic objective, but only the expected rewards of each agent. The baseline in policy gradient (Williams, 1992) is an example of a function that affects not the objective when the mean gets zero. However, the baseline function is a quantity that is determined based on the value function of a single agent, whereas the imaginary reward is different in that (1) it affects the value function of each agent and (2) it is a meaningful quantity only when n 2 and is not observed when n = 1. 4.2 PEER PREDICTION MECHANISM The peer prediction mechanism (Miller et al., 2005) is derived from a mechanism design using proper scoring rules (Gneiting & Raftery, 2007), which aims to encourage verifiers to honestly report their beliefs by assigning a reward measure score to their responses when predicting probabilistic events. These mechanisms assume at least two agents, a reporter and a verifier. The general scoring rule can be expressed as F(ps s) where ps is the probability of occurrence reported by the verifier for the event s, and F(ps s) is the score to be obtained if the reported event s actually occurred. The scoring rule is proper if an honest declaration consistent with the beliefs of the verifier and the reported ps maximizes the expected value of the earned score, and it is strictly proper if it is the only option for maximizing the expected value. A representative example that is strictly proper is the logarithmic scoring rule F(ps s) = log ps, where the expected value for a 1-bit signal is the cross-entropy p s log ps + (1 p s) log(1 ps) for belief p s. One can find that ps = p s is the only report that maximizes the score. Since the proper scoring rule assumes that events s are observable, it is not applicable to problems such as partial observation environments where the true value is hidden. Miller et al. (2005), who first presented a peer prediction mechanism, posed scoring to a posterior of the verifiers that are updated by the signal, rather than the event. This concept is formulated by a model that assumes that an event s emits a signal z stochastically and infers the type of s by the signal of the reporters who receive it. The peer prediction mechanism (Miller et al., 2005) is denoted as F(p(s|z) s) under the assumption that (1) the type of event s and the signal z emitted by each type follow a given prior, (2) the priors are shared knowledge among verifiers, and (3) the posterior is updated according to the reports. We apply the mechanism to RL, i.e., the problem of predicting the agent s optimal behavior ati πθ|st for the true state st S. In self-play, the conditions of 1 and 2 can be satisfied because the prior πθ is shared among the agents, and furthermore, the post-state in Comm-POSG corresponds to 3, so that the peer prediction mechanism can be applied to the problem of predicting agent behavior. To summarize the above discussion, we can see that we can allocate a score matrix Lt as follows, Lt := (ℓ(ati|ztji))i,j Jn K; ℓ(ati|ztji) := F(πθ(ati|ztji) ati) = log πθ(ati|ztji), (9) which is an n-order square matrix representing the score from Agent i to Agent j. 4.3 THE TRUTHFUL SELF-PLAY In TSP, a truthful system is constructed by introducing a proper scoring rule into imaginary rewards. However, since the score matrix obtained from the proper scoring rule does not satisfy Hermitianity, we perform zero-averaging by subtracting the mean of the scores from each element of the matrix, 1Note the use of i for imaginary units and i for indices. Published as a conference paper at ICLR 2023 Algorithm 1 The truthful self-play (TSP). Require: Comm-POSG G = n, T, S, A+, X, T , P, R+ , recurrent neural network σφ, qφ, πθ with initial weight w0 = θ0, φ0 and initial state h0, learning rate α > 0, and mass parameter β 0. Initialize w w0. for each episode do Genesis: s1 p(s), ˆh0i h0 i Jn K. for t = 1 to T do 1. Self-Play Observe xt Pn( |st), Update pre-state hti fφ(ˆht 1,i, xti), i Jn K. Generate message zti σφ( |hti), i Jn K. Send message Zt (zt1, . . . , ztn) Receive the message ˆzti ZT t:i, i Jn K. Update post-state ˆhti qφ( |ˆzti), i Jn K. Act ati πθ( |ˆhti), i Jn K. Get the real reward rt R(st, at), 2. Compute the score matrix with peer prediction mechanism (Miller et al., 2005), ℓij log πθ(aj|zi) (i = j) 0 (i = j) i, j Jn K; (12) 3. Combine real and imaginary rewards and construct a complex reward. R+ t Rt + ı L . (13) 4. Update the weights by policy gradient (Williams, 1992), i=1 r+ ti w h log πθ(ati|ˆhti) + log qφ(ˆhti|ˆzti) + log σφ(zti|ˆht 1,i, xti) i (14) w w + αRe gt + αβIm gt 5. Proceed to the next state st+1 T ( |st, at). end for end for return w thereby making the sum to zero. This can be expressed as follows using the graph Laplacian := E 11T/n. n 1 1 . . . 1 1 n 1 . . . 1 ... ... ... ... 1 1 . . . n 1 0 ℓψ(at2|zt1) . . . ℓψ(atn|zt1) ℓψ(at1|zt2) 0 . . . ℓψ(atn|zt2) ... ... ... ... ℓψ(at1|ztn) ℓψ(at2|ztn) . . . 0 (10) to get R+ = R + i Lψ, (11) which is the formula that connects reinforcement learning and mechanism design. We show the truthful self-play (TSP) in Algorithm 1. The only modification required from self-play is the imaginary reward. Theorem 4.1. (global optimality) For any G in Comm-POSG, TSP converges to the global optimum ˆJ(G) if the following convergence condition are met, Published as a conference paper at ICLR 2023 Table 1: Social welfare J in five Comm-POSG tasks. PP-n: Predator prey and TJ-n: Traffic junction. All the scores are negatives. The experiment was repeated thrice. The average and standard deviation are listed. Bold is the highest score. The models listed in the top three rows show the models optimized by SP, and the bottom row shows IC3Net optimized by TSP. PP-3 PP-5 TJ-5 TJ-10 TJ-20 SP LSTM 1.92 0.35 4.97 1.33 22.32 1.04 47.91 41.2 819.97 438.7 Comm Net 1.54 0.33 4.30 1.14 6.86 6.43 26.63 4.56 463.91 460.8 IC3Net 1.03 0.06 2.44 0.18 4.35 0.72 17.54 6.44 216.31 131.7 TSP IC3Net 0.69 0.14 2.34 0.21 3.93 1.46 12.83 2.50 132.60 17.91 where β < is bounded mass parameter. Proof (in summary2) From Proposition A.2, [β Lψ] is unbiased truthful. Therefore, from Proposition A.1, convergence to the global optima is achieved. 5 NUMERICAL EXPERIMENT In this section, we establish the convergence of TSP through the results of numerical experiments with deep neural nets. We consider three environments for our analysis and experiments ( Fig. 1). (a) a predator prey environment (PP) in which predators with limited vision look for a prey on a square grid. (b) a traffic junction environment (TJ) similar to Sukhbaatar et al. (2016) in which agents with limited vision learn to signal in order to avoid collisions. (c) Star Craft: Brood Wars (SC) explore and combat tasks which test control on multiple agents in various scenarios where agent needs to understand and decouple observations for multiple opposing units. Figure 1: Experimental environments. a: Predator Prey (PP-n) (Barrett et al., 2011). Each agent receives rti = 0.05 at each time step. After reaching the prey, they receive rti = 1/m, where m is the number of predators that have reached the prey. b: Traffic Junction (TJ-n) (Sukhbaatar et al., 2016; Singh et al., 2019). n cars with limited sight inform the other cars of each position to avoid a collision. The cars continue to receive a reward of rti = 0.05 at each time as an incentive to run faster. If cars collide, each car involved in the collision will receive rti = 1. c: Combat task in Star Craft: Blood Wars (SC) (Synnaeve et al., 2016). We compare the performances of TSP with self-play (SP) and SP with curiosity (Houthooft et al., 2016) using three tasks belonging to Comm-POSG, comprising up to 20 agents. The hyperparameters are listed in the appendix. With regard CRAs, three models namely LSTM, Comm Net (Sukhbaatar et al., 2016), and IC3Net (Singh et al., 2019), were compared. The empirical mean of the social welfare function J was used as a measure for the comparison. IC3Net is an improvement over Comm Net, which is a continuous communication method based on LSTM. Actor-critic and value functions were added to the baselines in all the frameworks. We performed 2,000 epochs of experiment with 500 steps, each using 120 CPUs; the experiment was conducted over a period of three days. PP and TJ: Table 1 lists the experimental results for each task. We can see that IC3Net with TSP outperforms the one with SP for all tasks. Fig. 2 (a) shows that TSP elicits truthful information, and 2Refer Section A for the proofs. Published as a conference paper at ICLR 2023 Figure 2: Learning curves in three-agent predator prey (PP-3). a Truthfulness, the fraction of steps that the agent shares a true pre-state (zij = hi), b the real part of the reward, and c the imaginary part. Table 2: Star Craft Results: TSP beats the baselines. The experiment was repeated thrice. All the scores are negatives. The average and standard deviation are listed. Bold is the highest score. Star Craft task Comm Net IC3Net IC3Net w/ TSP Explore-10Medic 50 50 1.592 0.169 2.276 1.427 1.364 0.107 Combat-10Mv3Ze 50 50 3.202 1.382 3.815 0.392 1.717 1.548 (b) confirms that the social welfare of TSP exceeds that of the SPs. (c) confirms that the average of the imaginary part is zero. From these experimental results, we conclude that the TSP successfully realized truthful learning and state-of-the-art in tasks comprising 3 to 20 agents. Star Craft: Table 2 shows a comparison of social welfare in the exploration and combat tasks in Star Craft. (i) In the search task, 10 Medics find one enemy medic on a 50 50-cell grid; similar to PP, the reward is a competitive task where the reward is divided by the number of medics found. (ii) In the combat task, 10 Marines versus 3 Zealots fight on a 50 50 cell grid. The maximum step of the episode is set at 60. We find that IC3Net, with its information-hiding gate, performs less well than Comm Net but performs better when trained in TSP due to the truthful mechanism. 6 CONCLUDING REMARK Our objective was to construct a general framework for emergent unbiased state representation without any supervision. Firstly, we proposed the TSP and theoretically clarified its convergence to the global optimum in the general case. Secondly, we performed experiments involving up to 20 agents and achieved the state-of-the-art performance for all the tasks. Herein, we summarize the advantages of our framework. 1. Strong convergence: TSP guarantees convergence to the global optimum theoretically and experimentally; self-play cannot provide such a guarantee. Besides, the imaginary reward i L satisfies the baseline condition. 2. Simple solution: The only modification required for TSP is that i L should be added to the baseline in order to easily implement it for deep learning software libraries such as Tensor Flow and Py Torch. 3. Broad coverage: TSP is a general framework, the same as self-play. Since TSP is independent of both agents and environments and supports both discrete and continuous control, it can be applied to a wide range of domains. No supervision is required. To the best of our knowledge, introducing mechanism design to MARL is a new direction for the deep-learning community. In future work, we will consider fairness (Sen, 1984) as the social choice function. We expect that many other frameworks will be developed by using the methodology employed in this study. Published as a conference paper at ICLR 2023 Kenneth J Arrow. Social choice and individual values. Yale university press, 1963. David Balduzzi, Marta Garnelo, Yoram Bachrach, Wojciech M Czarnecki, Julien Perolat, Max Jaderberg, and Thore Graepel. Open-ended learning in symmetric zero-sum games. ICML, 2019. Trapit Bansal, Jakub Pachocki, Szymon Sidor, Ilya Sutskever, and Igor Mordatch. Emergent complexity via multi-agent competition. ICLR, 2018. Samuel Barrett, Peter Stone, and Sarit Kraus. Empirical evaluation of ad hoc teamwork in the pursuit domain. In AAMAS, 2011. Yoshua Bengio. Evolving culture vs local minima. arxiv:1203.2990, 2012. Yoshua Bengio. The consciousness prior. ar Xiv:1709.08568, 2017. Lev M Bregman. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR computational mathematics and mathematical physics, 7(3):200 217, 1967. Glenn W Brier. Verification of forecasts expressed in terms of probability. Monthly weather review, 78(1):1 3, 1950. Maxime Chevalier, Dzmitry Bahdanau, Salem Lahlou, Lucas Willems, Chitwan Saharia, Thien Huu Nguyen, and Yoshua Bengio. Baby AI: First steps towards grounded language learning with a human in the loop. In ICLR, 2019. SM Ali Eslami, Danilo Jimenez Rezende, Frederic Besse, Fabio Viola, Ari S Morcos, Marta Garnelo, Avraham Ruderman, Andrei A Rusu, Ivo Danihelka, Karol Gregor, et al. Neural scene representation and rendering. Science, 360(6394):1204 1210, 2018. Jakob Foerster, Yannis Assael, Nando de Freitas, and Shimon Whiteson. Learning to communicate with deep multi-agent reinforcement learning. In NIPS, 2016. Drew Fudenberg. Game theory. MIT Press, 1993. Tilmann Gneiting and Adrian E Raftery. Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102(477):359 378, 2007. Claudia V Goldman and Shlomo Zilberstein. Optimizing information exchange in cooperative multi-agent systems. In Proceedings of the second international joint conference on Autonomous agents and multiagent systems, pp. 137 144, 2003. Irving John Good. Rational decisions. In Breakthroughs in statistics, pp. 365 377. Springer, 1952. Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NIPS, 2014. Abhinav Gupta, Ryan Lowe, Jakob Foerster, Douwe Kiela, and Joelle Pineau. Seeded self-play for language learning. In Proceedings of the beyond vision and language: integrating real-world knowledge, 2019. David Ha and J urgen Schmidhuber. World models. ar Xiv:1803.10122, 2018. Eric A Hansen, Daniel S Bernstein, and Shlomo Zilberstein. Dynamic programming for partially observable stochastic games. In AAAI, volume 4, pp. 709 715, 2004. Rein Houthooft, Xi Chen, Yan Duan, John Schulman, Filip De Turck, and Pieter Abbeel. VIME: Variational information maximizing exploration. In NIPS, 2016. Max Jaderberg, Wojciech M Czarnecki, Iain Dunning, Luke Marris, Guy Lever, Antonio Garcia Castaneda, Charles Beattie, Neil C Rabinowitz, Ari S Morcos, Avraham Ruderman, et al. Humanlevel performance in first-person multiplayer games with population-based deep reinforcement learning. ar Xiv:1807.01281, 2018. Published as a conference paper at ICLR 2023 Ryan Lowe, Abhinav Gupta, Jakob Foerster, Douwe Kiela, and Joelle Pineau. On the interaction between supervision and self-play in emergent communication. In ICLR, 2020. Nolan Miller, Paul Resnick, and Richard Zeckhauser. Eliciting informative feedback: The peerprediction method. Management Science, 51(9):1359 1373, 2005. Roger B Myerson. Mechanism design by an informed principal. Econometrica: Journal of the Econometric Society, pp. 1767 1797, 1983. David V Pynadath and Milind Tambe. The communicative multiagent team decision problem: Analyzing teamwork theories and models. Journal of artificial intelligence research, 16:389 423, 2002. Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. ICLR, 2016. Arthur L Samuel. Some studies in machine learning using the game of checkers. ii recent progress. IBM Journal of Research and Development, 11(6):601 617, 1967. Amartya Sen. Collective choice and social welfare. Harvard University Press, 1984. David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. Nature, 550(7676):354 359, 2017. David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science, 362(6419): 1140 1144, 2018. Amanpreet Singh, Tushar Jain, and Sainbayar Sukhbaatar. Learning when to communicate at scale in multiagent cooperative and competitive tasks. In ICLR, 2019. Sainbayar Sukhbaatar, Rob Fergus, et al. Learning multiagent communication with backpropagation. In NIPS, 2016. R. S. Sutton and A. G Barto. Reinforcement learning: An introduction. A Bradford Book, 1998. Gabriel Synnaeve, Nantas Nardelli, Alex Auvolat, Soumith Chintala, Timoth ee Lacroix, Zeming Lin, Florian Richoux, and Nicolas Usunier. Torchcraft: a library for machine learning research on real-time strategy games. ar Xiv preprint ar Xiv:1611.00625, 2016. Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. In ICLR, 2014. Yichuan Tang. Towards learning multi-agent negotiations via self-play. In ICCV, 2019. Gerald Tesauro. Temporal difference learning and td-gammon. Communications of the ACM, 38(3): 58 68, 1995. William Vickrey. Counterspeculation, auctions, and competitive sealed tenders. The Journal of Finance, 16(1):8 37, 1961. Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229 256, 1992. Published as a conference paper at ICLR 2023 . . . a human brain can learn such high-level abstractions if guided by the messages produced by other humans, which act as hints or indirect supervision for these high-level abstractions; and, language and the recombination and optimization of mental concepts provide an efficient evolutionary recombination operator, and this gives rise to rapid search in the space of communicable ideas that help humans build up better high-level internal representations of their world. (Bengio, 2012) Proposition A.1. If [C] is an unbiased truthful mechanism of G, self-play with G[C] converges to ˆJ(G). Since [C] is unbiased, Eπθ [Ci] = 0 holds. Hence, for an arbitrary baseline b, b + Ci also satisfies the baseline condition. Therefore, from the policy gradient theorem (Sutton & Barto, 1998), self-play converges to J (G[C]). Further, since [C] is an unbiased truthful mechanism, J (G[C]) = ˆJ(G[C]) = ˆJ(G) holds from Proposition 3.1. A general loss function ℓψ : A Z R for any strictly concave nonnegative function ψ : P(A) R is defined as follows: ℓψ := Dψ (πθ(aj|zi) δ(aj| )) , (16) where δ(aj| ) is a point-wise probability that satisfies limϵ 0 R B(ϵ; aj) dδ(aj| aj) = 1 for an open ball B(ϵ; aj), and Dψ is Bregman divergence (Bregman, 1967) defined by the following equation. Dψ(p q) := ψ(p) ψ(q) + Z ψ(q) d(p q). (17) Sending a truthful signal is the best response to minimize the expectation of the general loss function. For example, KL-diveregence is a special case of Bregman divergence when ψ = H [ ], and the following equation holds. Eπθ [ℓψ] = Z Dψ (πθ(aj|zi) δ(aj| )) dπθ(aj|hi) = DKL (πθ(ai|zi) || πθ(ai|hi)) 0. (18) The equality holds if and only if zi = hi. Notice that πθ(ai|hi) = πθ(aj|hi). Now, we generalize the zero-one mechanism to arbitrary signaling games. Proposition A.2. (Bregman mechanism) For any signaling games, if supφ d V πθ/ dβIψ < 1, [ıIψ] is an unbiased truthful mechanism of G G for a general cost function: Iψ(a|z) := Lψ(a|z)1 = n 1 1 . . . 1 1 n 1 . . . 1 ... ... ... ... 1 1 . . . n 1 0 ℓψ(a2|z1) . . . ℓψ(an|z1) ℓψ(a1|z2) 0 . . . ℓψ(an|z2) ... ... ... ... ℓψ(a1|zn) ℓψ(a2|zn) . . . 0 where := n E 11T is a graph Laplacian. The problem we dealt with is designing a scoring rule for information that satisfies two properties: 1) regularity, the score should be finite if the information is sufficiently correct, and 2) properness, the score should be maximized if and only if the information is true. The well-known example of the scoring rule is mutual information (MI), which compares a pair of probabilistic distributions p and q in the logarithmic domain. However, MI cannot apply to continuous actions. Instead, we introduce a more general tool, the regular proper scoring rule as defined below. Published as a conference paper at ICLR 2023 Definition A.1. (Gneiting & Raftery, 2007) For a set Ω, F( ) : P (Ω) Ω R is a regular proper scoring rule iff there exists a strictly concave, real-valued function f on P (Ω) such that F (p x) = f(p) Z Ω f (p(ω), x) dp(ω) + f (p, x) (20) for p P (Ω) and x Ω, where f is a subgradient of f that satisfies the following property, f(q) f(p) + Z Ω f (p, ω) d(q p)(ω) (21) for q P (Ω). We also define F for q P (Ω) as F (p q) := R ΩF (p x) dq(x), and describe a set of regular proper scoring rules B. For F, F1, F2 B, the following property holds (Gneiting & Raftery, 2007). 1. Strict concavity: if q = p, then F(q q) > F(p q). 2. F1(p q) + a F2(p q) + f(q) B where a > 0 and f are not dependent on p. 3. Dψ B, where Dψ is the Bregman divergence. Lemma A.1. For any F B and a function ℓF defined as shown below, if supφ d V π θ / dβℓψ < 1, then [ıℓF] is a truthful mechanism of G. ℓF(aj|zi) := F (πθ(aj|zi) aj) (i = j) 0 (i = j) . (22) Proof. We prove that the surrogate objective of G[ıℓF], V πθ β := V πθ βℓF is strictly concave, and if φV πθ β = 0, then zi = hi with probability 1. We denote ˆφ as the truthful parameter where σ ˆφ(hi|hi) = 1. The policy gradient for φ is φV πθ β dqφ = φV πθ dqφ + β φ Z F (πθ(aj|zi) aj) dπθ dqφ = φV πθ dqφ + β φF (πθ(aj|zi) πθ(aj|hi)) dqφ. (23) First, we consider the local optima, i.e., φV πθ dqφ = 0 and φ = ˆφ. For Gˆateaux differential with respect to φ := (ˆφ φ)T/ ˆφ φ , φ V πθ β = β φ ℓψ > 0 holds from the strict concaveity. At the global optima i.e., φV πθ dqφ = 0 and φ = ˆφ, φV πθ β = β ℓψ = 0 holds. Next, if φ V πθ < 0, as supφ d V π θ / dℓF < β, the following equation holds for φ = ˆφ. φ V πθ β dqφ = φ ( V πθ + β ℓF) dqφ > φ V πθ dqφ + sup φ φ V πθ dqφ inf φ ( φ V πθ) dqφ 0. (24) Hence, φ V πθ β 0 holds, and the equality holds if and only if φ = ˆφ. Therefore, V πθ β is strictly concave, and the following equation holds for αk o(1/k). φV πθ β (φk) φV πθ β (φk) αk = ˆφ. (a.s.) (25) Iψ is defined for both discrete and continuous actions. Table 3 lists examples of scoring rules ℓψ for arbitrary actions. In particular, minimizing ℓψ for continuous action is known as probability density estimation (Gneiting & Raftery, 2007). Iψ is a proper scoring rule (Gneiting & Raftery, 2007) since it is a linear combination of Bregman divergence. Hence, from Lemma A.1, [i Iψ] is truthful. Besides, since 1TIψ = 1T Lφ1 = 0, [i Iψ] is unbiased. Published as a conference paper at ICLR 2023 Table 3: Examples of the scoring rules, where κ > 1, p κ = ( R p(a)κ da)1/κ. Readers can refer to (Gneiting & Raftery, 2007) for examples of many other functions. A Z ψ(p) ℓψ(a|z) Zero-one Discrete Discrete p πθ(a|z) Logarithmic Discrete Continuous H [p] log πθ(a|z) Quadratic Discrete D/C P a A p(a)2 1 2πθ(a|z) ψ(πθ( |z)) 2 (Brier, 1950) Continuous D/C p 2 2 2πθ(a|z) πθ( |z) 2 2 Pseudospherical Discrete Continuous (P a A p(a)κ)1/κ πθ(a|z)κ 1/ψ(πθ( |z))κ 1 (Good, 1952) Continuous Continuous p κ 1 κ πθ(a|z)κ 1/ πθ( |z)κ κ 1 κ Figure 3: Left: One-bit two-way communication game G2 com. S = A = {0, 1}, X = S { }, and p(s) = 1/2. Agents are given correct information from the environment with a probability of λ > 0: P(s|s) = λ, P( |s) = 1 λ. qi(s|xi, zj) follows a Bernoulli distribution and estimates the state p(s) of the environment from observations and signals zj Z = X. Let hi = xi. Right: An example of the non-cooperative solutions: s = 1, x = , 1 , z = , 0 , a = 0, 1 , r = 1, 1 and r1 + r2 = 0 < 2c. Theorem A.1. (global optimally) For any G in Comm-POSG, TSP converges to the global optimum ˆJ(G) if the following convergence condition are met, where β < is bounded mass parameter. From Proposition A.2, [ıIψ] is unbiased truthful. Therefore, from Proposition A.1, convergence to the global optima is achieved. A.1 SELF-PLAY CONVERGES TO LOCAL OPTIMA Theorem A.2. If G G is non-truthful, self-play does not converge to the global optimum ˆJ(G). Example A.1. (One-bit two-way communication game) Fig. 4 shows an example of a non-cooperative partially observable environment with 1-bit state. The reward structure is presented in Table 4. The sum of rewards is maximized when both agents report the correct state to the environment, i=1 Rn i (s, a) = 2c (a1 = a2 = s) 0 (otherwise) . Hence, the objective varies in the range 0 J(G2 com) 2c. Proposition A.3. If c < 1, J (G2 com) < ˆJ(G2 com) holds. Published as a conference paper at ICLR 2023 Table 4: G2 com s reward function R2(s, a). c > 0 is a constant that determines the nature of the game. a2 True False (r1, r2) s 1 s a1 T s (c, c) (1, 1) F 1 s ( 1, 1) (0, 0) Since p(s) = 1/2, we can assume s = 1 without loss of generality. Besides, we discuss only Agent 1 because of symmetry. From Z = {0, 1}, Agent 1 s messageling policy σ1 sends the correct information x or false information 1 x when it knows x. Hence, we can represent the policy by using parameter φ [0, 1] as follows. σφ(z|x) = φz(1 φ)1 z (x = 1) 1/2 (x = ) , (27) Differentiating σφ with φ gives the following. dφ = 2z 1 (x = 1) 0 (x = ) G2 com. (28) Therefore, from Eq (??), if π , W (G2 com), then the policy gradient for φ is as follows. d dφU(π , φ) = d Z V 1 dq1 dσ1 d P dp = Z V 1 dq1 dσ1 =λ Z V 1 (2z1 1) dz1 =λ Z (2z1 1)R1 dπ 1 dπ 2 dq2 dz1 d P s=x1=1 =λ(1 λ) Z (2z1 1)R1 dπ 1 dπ 2 dq2 dz1 z1=0 (2z1 1)R1(s, x1, z1 ) s=x1=1 =λ(1 λ) [R1(1, 1, 1 ) R1(1, 1, 0 )] =λ(1 λ)(c 1) < 0. (29) As the policy gradient is negative from the assumption of c (0, 1), φ = 0 gets the Nash equilibrium from φ 0, thereby resulting in always sending false information to the opposite: σφ (z|x) = 1 z (x = 1) 1/2 (x = ) . (30) Let J x1, x2 := J|x= x1,x2 . We can get J and ˆJ as follows. J = Z J x1, x2 d P2 dp = J 1, 1 λ2 + J 1, 2λ(1 λ) + J , (1 λ)2 = 2cλ2 + 0 + 2c/22 (1 λ)2 = 2c λ2 + 1 4(1 λ)2 , (31) Published as a conference paper at ICLR 2023 ˆJ = Z ˆJ x1, x2 d P2 dp = ˆJ 1, 1 λ2 + ˆJ 1, 2λ(1 λ) + ˆJ , (1 λ)2 = 2cλ2 + 2c 2λ(1 λ) + 2c/22 (1 λ)2 = 2c λ2 + 2λ(1 λ) + 1 4(1 λ)2 , (32) respectively. Therefore, ˆJ J = 4cλ(1 λ) > 0, and J < ˆJ holds. From Proposition A.1, G = G2 com is the counterexample that global optimally does not occur. A.2 ZERO-ONE MECHANISM SOLVES G2 com. Proposition A.4. (zero-one mechanism) Let ℓ: A Z {0, 1} be a zero-one loss between an action and a message ℓ(ai|zj) := aj(1 zi) + (1 ai)zi, and I(a|z) := 1 1 1 1 0 ℓ(a2|z1) ℓ(a1|z2) 0 If β > (1 c)(1 λ)/λ, then [i I] is an unbiased truthful mechanism of G2 com, and self-play with G2 com[ıI] converges to the global optima ˆJ(G2 com) = 2c[1 3/4 (1 λ)2]. The following equation holds. d dφV β,1 dq1 = d Z (V 1 βI1 dπ 2 dq2) dq1 = λ(1 λ)(1 c) β Z I1 dπ 2 dq1 dq2 dσ1 dφ dσ2 d P2 dp = λ(1 λ)(1 c) βλ Z I1 dπ 2 dq2(2z1 1) dz1 dσ2 d P s=x1=1 = λ(1 λ)(1 c) βλ2 Z (2z1 1)ℓ(a2|z1) dπ 2 dq2 dz1 = λ(1 λ)(1 c) βλ2 1 X z1=0 (2z1 1)ℓ(1|z1) = λ(1 λ)(1 c) βλ2(ℓ(1|1) ℓ(1|0)) = λ(1 λ)(1 c) + βλ2 =λ2 ı (1 c)1 λ Therefore, if β > (1 c)(1 λ)/λ, then φ = 1 holds, and J = ˆJ holds. The value of ˆJ is clear from the proof of Lemma A.1. [ıI] is also known as the peer prediction method (Miller et al., 2005), which is inspired by peer review. This process is illustrated in Fig. 4 Left, and the state-action value functions are listed in Fig. 4 Right. B COMPLEXITY ANALYSIS Although the computational complexity of βIψ per iteration is O(n3) as it involves the multiplication of n-order square matrices, we can reduce it to O(n2) to obtain Iψ = n Lψ 1TLψ1. The spatial complexity is O(n2), and the sample size is O(n). Published as a conference paper at ICLR 2023 z2 True False Unobserved T (c, c) (c + β, c β) (c, c) z1 F (c β, c + β) (c, c) (1, 1) U (c, c) ( 1, 1) (c/4, c/4) Figure 4: Left: The information released by one agent i is verified by another agent, and if the information is incorrect, a penalty is given. Right: State-action value function of G2 com[i I]: Qi(si, zi, ). C EXPERIMENTAL ENVIRONMENTS In the experiment, we used two partial observation environments. This setting is the same as that adopted in existing studies (Sukhbaatar et al., 2016; Singh et al., 2019). Fig. 1 shows the environments. C.1 PREDATOR PREY (PP) Predator-Prey (PP) is a widely used benchmark environment in MARL (Barrett et al., 2011; Sukhbaatar et al., 2016; Singh et al., 2019). Multiple predators search for prey in a randomly initialized location in the grid world with a limited field of view. The field of view of a predator is limited so that only a few blocks can be seen. Therefore, for the predator to reach the prey faster, it is necessary to inform other predators about the prey s location and the locations already searched. Thus, the prey s location is conveyed through communication among the predators, but predators can also send false messages to keep other predators away from the prey. In this experiment, experiments are performed using PP-3 and PP-5, which have two difficulty levels. In PP-3, the range of the visual field is set to 0 in a 5 5 environment. In PP-5, the field of view is set to 1 in a 10 10 environment. The numbers represent the number of agents. C.2 TRAFFIC JUNCTION (TJ) Traffic Junction (TJ) is a simplified road intersection task. An n body agent with a limited field of view informs the other bodies of its location to avoid collisions. In this experiment, the difficulty level of TJ is changed to three. TJ-5 solves the task of crossing two direct one-way streets. For TJ-10, there are two lanes, and each body can not only go straight, but also turn left or right. For TJ-20, the two-lane road will comprise two parallel roads, for a total of four intersections. Each number corresponds to n. In the initial state, each vehicle is given a starting point and a destination and is trained to follow a determined path as fast as possible while avoiding collisions. The agent is in each body and takes two actions, i.e., accelerator and brake, in a single time step. It is crucial to ensure that other vehicles do not approach each other to prevent collision while making good use of multiagent communication. That is similar to blinkers and brake lights. C.3 STARCRAFT: BLOOD WARS (SC) Explore: In order to complete the exploration task, the agent must be within a specific range (field of view) of the enemy unit. Once the agent is within the enemy unit s field of view, it does not take any further action. The reward structure of the enemy unit is the same as the PP task, with the only difference being that the agent The point is that instead of being in the same place, you explore the enemy unit s range of vision and get a non-negative reward. Medic units that do not attack enemy units are used to prevent combat from interfering with the mission objective. For observation, for each agent, it is the agent s (absolute x, absolute y) and the enemy s (relative x, relative y, visible), where visiblea is a visual range. If the enemy is not in exploration range, the relative x and relative y are zero. The agent has nine actions to choose from: eight basic directions and one stay action. Published as a conference paper at ICLR 2023 Combat: Agents make their own observations (absolute x, absolute y, health point + shield, weapon cooldown, previous action) and (relative x, relative y, visible, health point + shield, weapon cooldown). Relative x and y are only observed when the enemy is visible, corresponds to a visible flag. All observations are normalized to lie between (0, 1). The agent must choose from 9+M actions, including 9 basic actions and 1 action to attack M agents. The attack action is only effective if the enemy is within the agent s view, otherwise is a no-problem. In combat, the environment doesn t compare to Starcraft s predecessors. The setup is much more difficult, restrictive, new and different, and therefore not directly comparable. In Combat task, we give a negative reward rtime = 0.01 at each time step to avoid delaying the enemy team s detection. When an agent is not participating in a battle, at each time step, the agent is rewarded with (i) normalized health status at the current and previous time step, and (ii) normalized health status at the previous time step displays the time steps of the enemies you have attacked so far and the current time step. The final reward for each agent consists of (i) all remaining health * 3 as a negative reward and (ii) 5 * m + all remaining health * 3 as a positive reward if the agent wins. Give health*3 to all living enemies as a negative reward when you lose. In this task, a group of enemies is randomly initialized in half of the map. Thus the other half making communication-demanding tasks even more difficult. D HYPERPARAMETERS Table 5: Hyperparameters used in the experiment. β are grid searched in space {0.1, 1, 10, 100}, and the best parameter is shown. The other parameters are not adjusted. Notation Value Agents n {3, 5, 10, 20} Observation xti X R9 Internal state hti H R64 Message zti Z R64 Actions ati A { , , , , astop} True state st S {0, 1}25 400 Episode length T 20 Learning rate α 0.001 Truthful rate β 10 Discount rate γ 1.0 Metrics ψ H [ ]