# revisiting_populations_in_multiagent_communication__df8c6205.pdf Published as a conference paper at ICLR 2023 REVISITING POPULATIONS IN MULTI-AGENT COMMUNICATION Paul Michel ENS-PSL Mathieu Rita INRIA, Paris Kory Mathewson Deep Mind Olivier Tieleman Deep Mind Angeliki Lazaridou Deep Mind Despite evidence from sociolinguistics that larger groups of speakers tend to develop more structured languages, the use of populations has failed to yield significant benefits in emergent multi-agent communication. In this paper we reassess the validity of the standard training protocol and illustrate its limitations. Specifically, we analyze population-level communication at the equilibrium in sender-receiver Lewis games. We find that receivers co-adapt to senders they are interacting with, which limits the effect of the population. Informed by this analysis, we propose an alternative training protocol based on partitioning agents. Partitioning isolates sender-receiver pairs, limits co-adaptation, and results in a new global optimization objective where agents maximize (1) their respective internal communication accuracy and (2) their alignment with other agents. In experiments, we find that agents trained in partitioned populations are able to communicate successfully with new agents which they have never interacted with and tend to develop a shared language. Moreover, we observe that larger populations develop languages that are more compositional. Our findings suggest that scaling up to populations in multi-agent communication can be beneficial, but that it matters how we scale up.1 1 INTRODUCTION Uncovering the mechanisms that underlie our ability to communicate using language is an important stepping stone towards developing machine learning models that are capable of coordinating and interacting via natural language. Over the last few years, there has been increasing interest in simulating the emergence of language using artificial agents trained with reinforcement learning to communicate to achieve a cooperative task (Lazaridou & Baroni, 2020). Typically, agents are trained to perform a variant of the Lewis signaling game (Lewis, 1969; Skyrms, 2010) wherein a sender emits a message describing an object and a receiver attempts to reconstruct the object based on the description. This line of work has applications to semi-supervised learning. For example, agents that develop languages exhibiting universal properties of natural languages may be used as useful initialization for downstream tasks such as image captioning (Lazaridou et al., 2020) or representation learning (Dess ı et al., 2021). Most previous research has focused on communication between a single pair of agents. However, there is mounting evidence that the communication protocols developed in this restricted setting become highly specialized and exhibit properties that are at odds with those found in human languages (Bouchacourt & Baroni, 2018; Chaabouni et al., 2019): for example agents are able to solve the task successfully while using languages that are not compositional (Kottur et al., 2017; Chaabouni et al., 2020). These idiosyncrasies of the emergent languages can preclude their use in practical applications (Lazaridou et al., 2020). As a possible solution, a growing body of work is advocating for scaling up the emergent communication literature to populations of more than two agents communicating simultaneously (Harding Graesser et al., 2019; Kim & Oh, 2021; Rita et al., 2022a; Chaabouni et al., 2022). Indeed, there is substantial evidence within the language sciences that population dynamics shape the language structure Raviv et al. (2019); N olle et al. (2020). In spite of this fact, several negative results have been obtained, showing that training agents in population yield marginal Now at Deep Mind. Contact at pmichel31415@gmail.com. 1Code to reproduce our experiments can be found at https://github.com/pmichel31415/EGG Published as a conference paper at ICLR 2023 benefits without explicit pressure towards e.g. population diversity (Rita et al., 2022a) or emulation mechanisms (Chaabouni et al., 2022). In this paper, we call into question the way such populations are trained. By studying a simple referential game, we evaluate populations on two desirable features observed in natural language: Agents are able to communicate with new partners within the same population (Gupta et al., 2021) Larger populations tend to develop more structured languages (N olle et al., 2020). We provide evidence that populations of artificial agents do not always possess these features (as also attested by previous work, e.g. Kim & Oh (2021); Chaabouni et al. (2022)). To shed light on this phenomenon, we analyze the behaviour of agents in a population at the equilibrium ( 2). We find that with the standard training procedure, the functional form of the objective is the same as that of a single pair of agents, due to receivers co-adapting to their training partners. As our main contribution, we propose an alternative training procedure which partitions sender-receiver pairs and limits co-adaptation of receiver agents ( 3). We show that this new training paradigm maximizes a different objective at the population level. In particular, it explicitly promotes mutual-intelligibility across different agents. In experiments, we find that agents trained in partitioned populations are able to communicate successfully with new communication partners with which they have never interacted during training, and that languages spoken by various agents tend to be similar to one another ( 5). In addition, we observe that (1) languages developed in partitioned populations tend to be more compositional and (2) there is a population size effect whereby larger populations develop more structured languages ( 6). Our results show that there are multiple ways to generalize from single agent pairs to larger populations, and that these design choices matter when it comes to studying the emergent language. 2 COMMUNICATION GAME We study communication in referential games, a variant of the Lewis signaling game (Lewis, 1969) proposed by Lazaridou et al. (2017). The game proceeds as follows: during each round, a sender agent π observes an object x X (e.g., an arbitrary categorical entity, or a natural images) sampled from input space X according to distribution p and generates a message m π( | x). Messages consist of variable length sequences of tokens picked from a discrete vocabulary V . Note that the tokens themselves are arbitrary and meaningless (typically they are represented as numbers from 1 to |V |). A receiver agent ρ then observes message m and must predict the original object from among a set of candidates C = {x, y1, . . . y|C 1|} containing x and |C| 1 distractors, where each distractor y is sampled uniformly without replacement from the input space excluding the original object, X \ {x}. Concretely, this is implemented by calculating a score f(y, m) for each candidate y and defining the probability of a candidate conditioned on the message ρ( | m, C) as ef(x,m) P y C f(y,m). Based on the receiver s success, the sender agent receives a reward R(x, ρ( | m, C)). In practice, both senders and receivers are implemented as neural networks πθ and ρψ with parameters θ and ψ estimated by gradient descent. The sender is trained to maximize its expected reward using the REINFORCE algorithm (Williams, 1992), while the receiver maximizes the expected log-likelihood of identifying the original object, log ρψ(x | m, C) (also known as the Info NCE objective; Oord et al. (2018)). Denoting as Ex p the expectation over x sampled from p, the corresponding training objectives are: Js(θ) = Ex p Em πθ( |x)EC p R(x, ρψ( | m, C)) (1) Jr(ψ) = Ex p Em πθ( |x)EC p log ρψ(x | m, C) (2) 2.1 POPULATION LEVEL TRAINING The two-player referential game can be generalized to larger populations of agents (Mordatch & Abbeel, 2018; Chaabouni et al., 2022). In the most general case, we consider a population of Ns senders and Nr receivers that are linked by a bipartite communication graph G defining connections between senders and receiver (πθi, ρψj) (Harding Graesser et al., 2019; Kim & Oh, 2021). At training time, sender-receiver pairs are repeatedly sampled and trained to perform a round of the game. Importantly, only agent pairs that are connected in the communication graph are sampled. Throughout this paper, we will refer to this type of training as Standard training. Published as a conference paper at ICLR 2023 With this training procedure, agents are trained to maximize their communicative success with all their neighbors in the communication graph. Let N G(i) refer to the neighbors of the i-th node in the graph, and Js,i j (respectively Jr,i j) denote the objective of πθi (respectively ρψj)) in the pairwise communication from sender i to receiver j. We can write the overall objective for sender i (and receiver j, respectively) as: Js,i(θi) = 1 | N G(i)| j N G(i) Js,i j(θi) and Jr,j(ψj) = 1 | N G(j)| i N G(j) Jr,i j(ψj). (3) At test time, the population is evaluated by averaging the referential accuracy across all possible sender-receiver pairings. Following previous work, in this paper we focus on populations with an equal number N := Ns = Nr of senders and receivers, meaning that there are up to N 2 possible pairings. 2.2 WHAT DOES POPULATION-LEVEL TRAINING OPTIMIZE? To shed light on the differences between training a single agent pair and training a population of agents, we analyze the objective optimized by the population. Inspired by (Rita et al., 2022b) s analysis in the two-player case, we study the behaviour of the population at the optimum, that is when senders and receivers have reached a Nash equilibrium (Osborne & Rubinstein, 1994). In this section, we make the simplifying assumption that C = X. In other words, receiver agents must pick the correct candidate out of all possible objects in X. This allows us to remove the conditioning on C and write ρψ(x | m, C) = ρψ(x | m). We make this simplification to reduce clutter in notations. Nevertheless, our key observations still hold for C = X (see Appendix C for a detailed discussion). At a Nash equilibrium, the optimal receiver parameters ψ j satisfy ρψ j = arg max ψj Jr,j(ψj) = arg max ψj 1 | N G(j)| i N G(j) Jr,i j(ψj). (4) Assuming that receiver ρψj has high enough capacity, and training is able to reach the global optimum, the solution of the optimization problem in Equation 4 has an analytical solution ρψ j which can be written as a function of π N G(j)(m | x) := 1 | N G(j)| P i N G(j) πθ i (m | x), the mixture of all senders communicating with receiver j: ρψ j (x | m) = π N G(j)(x | m) = π N G(j)(m | x)p(x) Ey pπ N G(j)(m | y). In other words, ρψ j is the posterior associated with π N G(j) (full derivation in appendix B). An important implication of this result is that when the population graph is fully connected (all senders are connected to all receivers), each receiver converges to the same optimum π (x | m) = Pn i=1 πθi(m|x)p(x) Ey p Pn i=1 πθi(m|x), the posterior of the mixture of all senders in the population. Plugging this back into each sender s objective, we have Js,i(θ i ) = Ex p Em πθ i ( |x)R(x, π ( | m)) Summing across all senders, we can rewrite the global objective optimized by the senders as max θ Ex p Em π R(x, π ( | m)). (5) In other words, at the equilibrium, the population maximizes the expected reward of the sender ensemble π , rather than that of individual agents πθ i : the objective of a population N agents is functionally the same irrespective of N. We postulate that this indifference to the population size may account for the surprising lack of effect of larger populations observed in some previous work (Rita et al., 2022a; Chaabouni et al., 2022). Differences in behaviour must be attributed to effects stemming from training dynamics (e.g. it becomes more difficult for receivers to learn the posterior π (x | m)), or be imposed through extraneous modifications of the population objective (for example explicit imitation components; Chaabouni et al. (2022)). Published as a conference paper at ICLR 2023 A second observation is that there is no direct pressure for agents that communicate at training time to develop the same language. Indeed, it is entirely possible that all senders develop different but non-overlapping languages: it suffices that no two senders communicating with a shared receiver use the same message m to describe a different object. In this case receivers can simply learn their neighboring sender s languages and there is no need for the senders to converge to a unified language. 3 PARTITIONING AGENTS Receiver Sender Receiver gradient ( ) Sender gradient ( ) Figure 1: In the standard setting (left), both receivers (in blue) are trained by maximizing their discrimination objective with respect to both senders. With partitioning, receiver ρψ1 (resp. ρψ2) is only trained to maximize its communication objective with sender πθ1 (resp. πθ2) A key difference between the usual population setting and populations of humans in laboratory experiments is that agents are not usually split into senders and receivers . Rather, each participant in the experiment assumes both a sender and receiver role (Galke et al., 2022). Our hypothesis is that, counter to what is customary in the emergent communication literature, tying senders and receivers is key in surfacing useful population-level dynamics in multi-agent communication. To operationalize this sender-receiver coupling, we identify an agent as a sender-receiver pair. During training, we only train receiver ρψi with its associated sender πθi. In other words, Jr,i(ψi) := Jr,i i(ψi). In doing so, we partition the agents by preventing receiver i from co-adapting to other senders j = i. This procedure is illustrated in Figure 1. Note that senders can nevertheless still train with rewards from neighboring receivers, and so communication across agents can still emerge. Importantly, partitioning prevents receivers from learning to recognize multiple languages, as they are now only trained on messages emitted by a single sender. Following a similar analysis as Section 2.2, we derive that at the optimum, receiver ρψ i (x | m) now takes the form of the posterior associated with its respective sender, πθ i (x | m) = πθ i (m|x)p(x) Ey pπm|y (derivation in Appendix B). We can thus write the population-level objective at the equilibrium as Ex p Em πθ i ( |x)R(x, πθ i ( | m)) | {z } Internal communication j N G(i) Ex p Em πθ i ( |x)R(x, πθ j ( | m)) | {z } Mutual intelligibility Note that the functional form of the objective can now be decomposed into two parts: an internal communication objective which takes the same form as that of a single pair of agents, and a mutual intelligibility objective which enforces that neighboring agents are able to communicate successfully. In experiments, we show that this explicit pressure towards mutual intelligibility promotes the emergence of a single language within the population, which in turn enables agents to communicate with new partners outside of their training neighborhood. 4 EXPERIMENTAL SETTING 4.1 DATASETS We perform experiments on two datasets: a simple, synthetic attribute/values dataset and a more realistic image dataset. Attribute/Values In this dataset, each object is represent by a collection of abstract attributes . Specifically, each input x is a vector of 4 attributes, each of which can take 10 total values. This results in 104 total attribute/value combinations (Kottur et al., 2017; Chaabouni et al., 2020). In each setting we hold out 1, 000 combinations to be used as a validation, and 1, 000 more for use as a test set. We can thus ensure that we are evaluating the agents ability to generalize to unseen combinations of attributes. Published as a conference paper at ICLR 2023 Image Net In addition to toy objects, we perform experiments with referential games based on more realistic objects. Following Chaabouni et al. (2022), we use the Image Net (Deng et al., 2009) dataset of natural images. The dataset consists of about 1.4M training images collected on the internet and annotated for 1,000 labels from the Word Net database (Miller, 1995). Images are first encoded as 2048-sized real-valued vectors with a (frozen) Res Net pre-trained with BYOL (Grill et al., 2020) before being passed to sender and receivers. 4.2 GAME ARCHITECTURE Both sender and receiver agents are based on 1 layer LSTMs (Hochreiter & Schmidhuber, 1997) with embedding and hidden dimensions of size 256. Specifically, the sender first encodes the object x into a vector of size 256, which is concatenated to the input of the LSTM. At each step, the output of the LSTM cell is passed through a fully connected layer to produce logits of size |V |. A softmax function is then applied to obtain normalized probabilities over the vocabulary. During training, messages are generated by sampling from the distribution whereas at test time we generate messages deterministically via greedy decoding. In both cases, generation stops whenever a special is generated, or when the number of tokens reaches a fixed limit L. The receiver encodes the message with an LSTM encoder, the output of which is the fed into a fully connected layer to yield a vector of size 512. The candidate objects C are then scored by computing the dot product of this vector with a 512-dimensional encoding of each candidate. The conditional distribution over candidates is then obtained by taking a softmax. We set the reward function for the sender to the log-likelihood assigned by the receiver to the correct candidate, R(x, ρψ( | m)) = log ρψ(x | m). Throughout all experiments, we set the vocabulary size |V | to 20 and the maximum length of the messages, L, to 10. This means that the communication channel used by the agents has a capacity of about 2010 which ensures that there is no communication bottleneck (the size of the channel is several orders of magnitude larger than the size of our datasets). Our implementation, based on the EGG toolkit (Kharitonov et al., 2021), will be open-sourced upon de-anonymization. 4.3 POPULATION TRAINING (a) Fully-connected (b) Circular Figure 2: Example of communication graphs used in this paper We train populations following the procedure outlined by Chaabouni et al. (2022): for each minibatch of data, we sample K pairs from the population (uniformly among the pairs linked in the communication graph). Each pair plays an episode of the game, and the agents are updated simultaneously following the gradients of their respective objectives. We take K = max(10, N) to ensure that each agent plays the game at least once at every step on average. This procedure needs to be modified for partitioned populations: since receiver j is only with its respective sender instead of with all of its neighbors, there is now only a 1 |NG(j)| chance that receiver j will be updated every step (the probability that the pair (j, j) is sampled). For larger populations, especially those that are fully-connected, this dramatically slows down training as receivers are updated very infrequently. To address this issue, we modify the procedure as follows: for every sampled agent pair (πθi, ρψj), we calculate both Js,i j and Jr,i i and update both πθi and ρψi. Note that this necessitates calculating both ρψj(x | m, C) and ρψi(x | m, C) and therefore we incur a small computational overhead. However we only observe a 5% increase in training time due to the fact that we are back-propagating through only one of the two receivers, ρψi(x | m, C). With this modification, we recover the property that each agent (sender or receiver) is updated once every step on average. In all experiments we train with a batch size of 1024 with the Adam optimizer (Kingma & Ba, 2014) using a learning rate of 0.001 for the attribute/value dataset and 0.0001 for Imagenet. The other parameters are set to β1 = 0.9, β2 = 0.999 and ε = 10 8. We apply ℓ2 regularization with a coefficient of 10 5. Published as a conference paper at ICLR 2023 Table 1: Accuracies with training partners and new partners on both datasets. Numbers are reported with standard deviation across all pairs for 3 independent experiments Image Net Attribute/Values Standard Partitioned Standard Partitioned Training partners 97.09 1.10 99.75 0.08 99.88 0.15 99.81 0.22 New partners 5.41 13.57 96.24 3.25 7.81 18.28 40.37 29.44 Table 2: Language similarity between training partners and new partners on both datasets. Numbers are reported with standard deviation across all pairs for 3 independent experiments Image Net Attribute/Values Standard Partitioned Standard Partitioned Training partners 0.28 0.07 0.40 0.02 0.28 0.05 0.36 0.01 New partners 0.22 0.19 0.37 0.15 0.23 0.19 0.31 0.17 We systematically augment the sender objectives with an entropy maximizing term, which has been found to encourage exploration (Williams & Peng, 1991). The coefficient for this entropy term is set to 0.1 in all experiments. To reduce the variance of the policy gradient in REINFORCE, we substract a baseline computed by taking the average reward within a mini-batch for each pair (Sutton et al., 1999). We evaluate the population every epoch (every 5 epochs for the Attribute/Value dataset) on the validation set. We only evaluate on up to 100 unique pairs sampled uniformly within the population, this time without consideration for the communication graph. We train for a fixed number of epochs, selecting the best model based on the average validation accuracy across all evaluation pairs. Although convergence is not guaranteed in this general-sum communication game, in all reported experiments we find that agents converge reliably with our choice of hyper-parameters. 5 COMMUNICATION WITH NEW PARTNERS In our first set of experiments, we evaluate the ability of agents trained in populations to communicate with partners they haven t interacted with during training. 5.1 CIRCULAR POPULATIONS Specifically, we study circular populations of agents arranged on a ring lattice. Each agent (senderreceiver pair) i is only trained with neighboring agents i 1, . . . , i + 1 and the graph is cyclical (see Figure 6a). We choose this type of population because it is an extreme case of a population where each agent has the same, minimal amount of neighbors (two), yet there is still a path between any two agents. In this context, training partners are sender-receiver pairs that are connected in the graph and have interacted during the training phase whereas new partners refers to pairs that have not interacted during training. Experiments with other population types are reported in Appendix D. We report results along two metrics: Communication Accuracy of sender/receiver pairs on an evaluation set. This measures how successful the pair is in communicating. Language Similarity between senders. This metric (also called synchronization in Rita et al. (2022a)) is calculated as 1 δi,j, where δi,j is the normalized edit distance between messages output by two senders, averaged across all objects in our evaluation set. We report these metrics for both training partners and new partners. Note that high communication accuracy does not always entail similar languages: it is possible for the receivers to achieve high accuracy despite all senders sending different messages for any given object (it is only necessary for a given message to unambiguously refer to one object across senders). 5.2 PARTITIONING ENABLES SUCCESSFUL ZERO-SHOT COMMUNICATION In Table 1 and 2, we report accuracies and similarities for circular populations of 20 sender-receiver pairs trained on Image Net and the Attribute/Values dataset. All metrics are calculated on the test set and averaged across 3 independent experiments. Published as a conference paper at ICLR 2023 0 1 2 3 4 5 6 7 8 9 10 Distance standard partitioned (a) Accuracy 0 1 2 3 4 5 6 7 8 9 10 Distance standard partitioned (b) Similarity2 Figure 3: Accuracy and language similarity as a function of the distance between two agents in the communication graph. We observe that in populations following the standard training paradigm (Standard), there is a stark discrepancy between training and new partners. Indeed, on both datasets the accuracy with training partners reaches a very high value, above 95%. Yet, the accuracy when agents communicate with new partners drops down to less than 10%. On the other hand, in Partitioned populations, agents reach a much higher accuracy with non-neighbors, up to 96% on Image Net and 40%. A similar trend holds for language similarity. Note that all metrics on new partners exhibit high standard deviation. An explanation is that among non-neighboring pairs there is a different behaviour depending on how far the two agents are in the population. This is verified in Figure 3, which displays a breakdown as a function of the distance between two agents in the communication graph (on Image Net). We find that without partitioning, accuracy drops off sharply to close to 0 for agents at a distance 2, whereas it decreases almost linearly with the distance in the partitioned case, down to about 95% for the most distant agents. 5.3 TRAINING DYNAMICS We further investigate the evolution of accuracies during training. In Figure 4, we plot the evaluation accuracies of both standard and partitioned populations broken down by distance between pairs, focusing on the Image Net dataset. Note that there are two training phases in the standard case. Up to epoch 10, the accuracy for all training pairs increases, after which agents over-fit to their training partners (distances 0 and 1) and the accuracy on other pairs decreases to a plateau. On the other hand, Figure 4b illustrates the pressure for mutual-intelligibility in partitioned populations: as accuracy between training pairs reaches close to 99% accuracy (around epoch 20), accuracies across distant pairs increases rapidly before plateauing above 90%. In fact, our results show that the most distant accuracies are still increasing after 150 epochs, albeit very slowly. 6 PARTITIONED POPULATION DEVELOP MORE COMPOSITIONAL LANGUAGES In this section, we investigate the effect of partitioning on the structure of the language, with a focus on compositionality. 6.1 MEASURING COMPOSITIONALITY A language is said to be compositional (Szab o, 2020) when the meaning of a whole utterance can be systematically deduced from the meaning of its components (i.e. words). It has been argued that compositionality makes languages easier to learn (Davidson, 1965). Consequently, emergent communication protocols that are compositional may ultimately be easier to understand by humans. A common metric for measuring compositionality in emergent languages is the topographic similarity (Brighton & Kirby, 2006; Lazaridou et al., 2018). Topographic similarity captures the intuition that a compositional language will map similar meanings to similar messages: the phrase a red bird is more similar to the phrase a blue bird than to a powerful computer . In practice, the topographic similarity is computed by measuring the Spearman rank correlation coefficient (Spearman, 1904) between (1) the pairwise distances across all objects and (2) the pairwise distance across all messages. 2By construction, the similarity of a sender with itself (corresponding to a distance of 0) is always one. We omit this value from the figure to better illustrate the trends for distance 1. Published as a conference paper at ICLR 2023 0 25 50 75 100 125 150 epoch (a) Standard 0 20 40 60 80 100 120 140 epoch 0 2 4 6 8 10 All (b) Partitioned Figure 4: Evolution of validation accuracy during training across agent pairs at various distances in the communication graph. Results are aggregated over all agent pairs and 3 populations. 6.2 EFFECT OF POPULATION SIZE ON COMPOSITIONALITY We run experiments on our Attribute/Values dataset, with both standard and partitioned populations that are fully-connected (see Figure 6d). Population sizes range from 2 to 25 sender-receiver pairs. We compute topographic similarity using the Hamming distance in the object space (the distance between two objects is the number of attributes in which they differ) and the normalized edit distance between messages. In Figure 5a, we observe that while standard population-level training does increase the topographic similarity of the language overall, population size has very little effect: populations of sizes 3 and 20 both reach about the same value of 30 on average. On the other hand, partitioning greatly increases the effect of population size on compositionality: populations of size 20 have a significantly higher topographic similarity than populations of size 5, with a 10 points difference. 6.3 CO-ADAPTATION IS RESPONSIBLE FOR THE DECREASE IN COMPOSITIONALITY Up until this point, we have described partitioning (or lack thereof) as a binary choice. However, it is possible to partition a population only partially, by allowing receiver j to train with senders i = j occasionally with probability α > 0. In doing so, the optimal receiver now becomes the posterior associated with a mixture between πθ i (m | x) and π (m | x) (see Appendix B for the derivation). If 0 < α < 1, receivers are now optimizing for a different objective (as in partitioned populations), but some amount of co-adaptation is still allowed. We perform this experiment on the Attribute/Values dataset with a fully connected population of size 10, varying the degree of co-adaptation α ranging in {0, 0.1, 0.5, 0.9, 1}. α = 0 corresponds to partitioned training whereas α = 1 is equivalent to standard training. All populations converge to > 99% accuracy. However, in Figure 5b we find that topographic similarity drops as soon as we introduce minimal amounts of co-adaptation (α = 0.1) and decreases steadily to the level of standard populations as α grows to 1. This further corroborates our hypothesis that reducing co-adaptation promotes the emergence of a more structured language, and that eliminating it altogether (in a partitioned population) yields the best results. 6.4 IMPORTANCE OF MUTUAL INTELLIGIBILITY Recall that the objective of a partitioned population at the equilibrium (Equation 6) can be decomposed in two terms: an internal communication corresponding to the single agent pair objective and a mutual intelligibility term which encourages senders to align their languages. Importantly, the latter is the only element that separates a partitioned population from a collection of isolated agents. To measure its effect on the compositionality of the emergent language, we train fully connected populations of size 10 and decrease the relative weight of the mutual intelligibility term. This is implemented by making the pair (πθi, ρθi) more likely to be sampled than other pairs (πθi, ρθj), j = i by a factor 1 β β . We let β range from 0.5 (partitioned population) to 0.0 (collection of isolated sender-receiver pairs). In Figure 5c, we find that emergent languages retain high topographic similarity even at small β, and the sharp drop-off occurs only when β is very close to 0. This confirms that the mutual intelligibility term exerts a strong pressure towards compositionality. We investigate the evolution of the two terms during training in Appendix E. Published as a conference paper at ICLR 2023 2 3 4 5 10 15 20 25 Population size Topographic similarity standard partitioned Single Pair (a) Topographic similarity as a function of population size on an attribute/value communication game. 0.0 0.2 0.4 0.6 0.8 1.0 Co-adaptation probability Topographic similarity (b) Topographic similarity with varying degrees of partitioning (populations of size 10). 0.0 0.1 0.2 0.3 0.4 0.5 Mutual Intelligibility relative weight Topographic similarity (c) Topographic similarity when ablating the mutual-intelligibility term (populations of size 10). Figure 5: Influence of partitioning on the topographic similarity of the emergent languages. 7 RELATED WORK There is a rich history of modeling the emergence of language as the solution to a cooperative game that can be traced back to functional theories of language (Wittgenstein, 1953; Austin, 1962; Clark, 1996). With a regain of interest for the study of language evolution (Crawford & Sobel, 1982; Christiansen & Kirby, 2003), a rich literature has developed around computational simulations of the emergence of language based on simple language games (Lewis, 1969; Skyrms, 2010; Batali, 1998; Cangelosi & Parisi, 2002). Examples include studying evolutionary models of the emergence of grammar (Nowak & Komarova, 2001), the influence of cultural transmission (Brighton & Kirby, 2006), game theoretical considerations (Huttegger et al., 2014) or linguistic diversity (Livingstone & Fyfe, 1999) among others. The recent success of deep learning in natural language processing has spurred interest in studying signaling games between deep neural network trained with reinforcement learning to solve a signaling game (Lazaridou et al., 2017; Foerster et al., 2016). Several follow-ups have taken this idea further by extending it to more complex games or environment (Sukhbaatar et al., 2016; Havrylov & Titov, 2017; Jaques et al., 2019; Das et al., 2019) or by adding an element of competition (Singh et al., 2018; Noukhovitch et al., 2021) or negotiation (Cao et al., 2018) or even explicit pressure towards certain desirable properties (Kottur et al., 2017; Choi et al., 2018; Li & Bowling, 2019; Ren et al., 2019). In parallel, several efforts have been made to understand the properties of the emergent languages (Bouchacourt & Baroni, 2018; Chaabouni et al., 2019; 2020). Within this growing literature, many authors have explicitly studied the use of populations of more than two agents. Various works have argued for augmenting populations with an explicit pressure towards more structure languages, via e.g. generational transmission (Cogswell et al., 2019), adversarial regularization (Tieleman et al., 2019), varying learning speeds (Rita et al., 2022a) or imitation learning and voting (Chaabouni et al., 2022). Although the focus is often on fully-connected populations, some authors have explored more complex communication graphs, for the purpose of modeling contact linguistics (Harding Graesser et al., 2019) or the effect of social network structure on the language (Dubova et al., 2020). Recent work from Kim & Oh (2021) is perhaps closest to our own: the authors study the effect of population size and connectivity in the standard training paradigm. The purpose of this paper is to highlight the impact of the training procedure on these very effects. 8 CONCLUSION Empirical findings in socio-linguistics suggest that population dynamics should help in simple sender-receiver communication games. In this paper, we observed that populations trained by naively extending the simple 1-1 protocol to N N agent pairs fail to exhibit some of the properties that are observed in human populations. Motivated by an analysis of populations at the equilibrium, we described an alternative training paradigm, based on agents partitioning to reduce co-adaptation. Empirically, we find that partitioning enables us to recover some of the aforementioned properties. Our findings call attention to the fact that there is more than one way to generalize two single to many agents, and simple design choices can have a great impact on the training dynamics and ultimately the effect of population on the emergent language. Beyond emergent communication, we hope that this observation can inspire similar work in other cooperative multi-agent problems where co-adaptation between agents may counteract population effects. Published as a conference paper at ICLR 2023 John Langshaw Austin. How to do things with words. Harvard University Press, Cambridge, MA, 1962. John Batali. Computational simulations of the emergence of grammar. Approaches to the evolution of language: Social and cognitive bases, pp. 405, 1998. Diane Bouchacourt and Marco Baroni. How agents see things: On visual representations in an emergent language game. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 981 985, 2018. Henry Brighton and Simon Kirby. Understanding linguistic evolution by visualizing the emergence of topographic mappings. Artificial life, 12(2):229 242, 2006. Angelo Cangelosi and Domenico Parisi. Computer simulation: A new scientific approach to the study of language evolution. In Simulating the evolution of language, pp. 3 28. Springer, 2002. Kris Cao, Angeliki Lazaridou, Marc Lanctot, Joel Z Leibo, Karl Tuyls, and Stephen Clark. Emergent communication through negotiation. In Proceedings of the International Conference on Learning Representations (ICLR), 2018. Rahma Chaabouni, Eugene Kharitonov, Emmanuel Dupoux, and Marco Baroni. Anti-efficient encoding in emergent communication. In Proceedings of the 33rd Annual Conference on Neural Information Processing Systems (Neur IPS), volume 32, 2019. Rahma Chaabouni, Eugene Kharitonov, Diane Bouchacourt, Emmanuel Dupoux, and Marco Baroni. Compositionality and generalization in emergent languages. In Proceedings of the 8th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 4427 4442, 2020. Rahma Chaabouni, Florian Strub, Florent Altch e, Eugene Tarassov, Corentin Tallec, Elnaz Davoodi, Kory Wallace Mathewson, Olivier Tieleman, Angeliki Lazaridou, and Bilal Piot. Emergent communication at scale. In Proceedings of the International Conference on Learning Representations (ICLR), 2022. Edward Choi, Angeliki Lazaridou, and Nando de Freitas. Compositional obverter communication learning from raw visual input. In Proceedings of the International Conference on Learning Representations (ICLR), 2018. Morten H Christiansen and Simon Ed Kirby. Language evolution. Oxford University Press, 2003. Hebert Clark. Using Language. Cambridge University Press, Cambridge, UK, 1996. Michael Cogswell, Jiasen Lu, Stefan Lee, Devi Parikh, and Dhruv Batra. Emergence of compositional language with deep generational transmission. ar Xiv preprint ar Xiv:1904.09067, 2019. Vincent P Crawford and Joel Sobel. Strategic information transmission. Econometrica: Journal of the Econometric Society, pp. 1431 1451, 1982. Abhishek Das, Th eophile Gervet, Joshua Romoff, Dhruv Batra, Devi Parikh, Mike Rabbat, and Joelle Pineau. Tarmac: Targeted multi-agent communication. In Proceedings of the 36th International Conference on Machine Learning (ICML), pp. 1538 1546. PMLR, 2019. Donald Davidson. Theories of meaning and learnable languages. In Yehoshua Bar-Hillel (ed.), Proceedings of the 1964 International Congress for Logic, Methodology, and Philosophy of Science, pp. 383 394. Amsterdam: North-Holland, 1965. Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In Proceedings of the 22nd IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 248 255, 2009. Roberto Dess ı, Eugene Kharitonov, and Marco Baroni. Interpretable agent communication from scratch (with a generic visual processor emerging on the side). ar Xiv preprint ar Xiv:2106.04258, 2021. Published as a conference paper at ICLR 2023 Marina Dubova, Arsenii Kirillovich Moskvichev, and Rob Goldstone. Reinforcement communication learning in different social network structures. In Language in Reinforcement Learning Workshop 2020, 2020. Jakob Foerster, Ioannis Alexandros Assael, Nando De Freitas, and Shimon Whiteson. Learning to communicate with deep multi-agent reinforcement learning. In Annual Conference on Neural Information Processing Systems (NIPS), volume 29, 2016. Lukas Galke, Yoav Ram, and Limor Raviv. Emergent communication for understanding human language evolution: What s missing? In Emergent Communication Workshop at ICLR 2022, 2022. Jean-Bastien Grill, Florian Strub, Florent Altch e, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latent-a new approach to self-supervised learning. In Proceedings of the 34th Annual Conference on Neural Information Processing Systems (Neur IPS), volume 33, pp. 21271 21284, 2020. Abhinav Gupta, Marc Lanctot, and Angeliki Lazaridou. Dynamic population-based meta-learning for multi-agent communication with natural language. In Proceedings of the 35th Annual Conference on Neural Information Processing Systems (Neur IPS), volume 34, 2021. Laura Harding Graesser, Kyunghyun Cho, and Douwe Kiela. Emergent linguistic phenomena in multi-agent communication games. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 3700 3710, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1384. URL https: //aclanthology.org/D19-1384. Serhii Havrylov and Ivan Titov. Emergence of language with multi-agent games: Learning to communicate with sequences of symbols. In Proceedings of the 31st Annual Conference on Neural Information Processing Systems (NIPS), volume 30, 2017. Sepp Hochreiter and J urgen Schmidhuber. Long short-term memory. Neural computation, 9(8): 1735 1780, 1997. Simon Huttegger, Brian Skyrms, Pierre Tarres, and Elliott Wagner. Some dynamics of signaling games. Proceedings of the National Academy of Sciences, 111(Supplement 3):10873 10880, 2014. Natasha Jaques, Angeliki Lazaridou, Edward Hughes, Caglar Gulcehre, Pedro Ortega, DJ Strouse, Joel Z Leibo, and Nando De Freitas. Social influence as intrinsic motivation for multi-agent deep reinforcement learning. In Proceedings of the 36th International Conference on Machine Learning (ICML), pp. 3040 3049. PMLR, 2019. Eugene Kharitonov, Roberto Dess ı, Rahma Chaabouni, Diane Bouchacourt, and Marco Baroni. EGG: a toolkit for research on Emergence of lan Guage in Games. https://github.com/ facebookresearch/EGG, 2021. Jooyeon Kim and Alice Oh. Emergent communication under varying sizes and connectivities. In Proceedings of the 35th Annual Conference on Neural Information Processing Systems (Neur IPS), volume 34, 2021. Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations (ICLR), 2014. Satwik Kottur, Jos e Moura, Stefan Lee, and Dhruv Batra. Natural language does not emerge naturally in multi-agent dialog. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 2962 2967, 2017. Angeliki Lazaridou and Marco Baroni. Emergent multi-agent communication in the deep learning era. ar Xiv preprint ar Xiv:2006.02419, 2020. Angeliki Lazaridou, Alexander Peysakhovich, and Marco Baroni. Multi-agent cooperation and the emergence of (natural) language. In Proceedings of the International Conference on Learning Representations (ICLR), 2017. Published as a conference paper at ICLR 2023 Angeliki Lazaridou, Karl Moritz Hermann, Karl Tuyls, and Stephen Clark. Emergence of linguistic communication from referential games with symbolic and pixel input. In Proceedings of the International Conference on Learning Representations (ICLR), 2018. Angeliki Lazaridou, Anna Potapenko, and Olivier Tieleman. Multi-agent communication meets natural language: Synergies between functional and structural language learning. In Proceedings of the 8th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 7663 7674, 2020. David Lewis. Convention: A philosophical study. John Wiley & Sons, 1969. Fushan Li and Michael Bowling. Ease-of-teaching and language structure from emergent communication. In Proceedings of the 33rd Annual Conference on Neural Information Processing Systems (Neur IPS), volume 32, 2019. Daniel Livingstone and Colin Fyfe. Modelling the evolution of linguistic diversity. In European Conference on Artificial Life, pp. 704 708. Springer, 1999. George A Miller. Wordnet: a lexical database for english. Communications of the ACM, 38(11): 39 41, 1995. Igor Mordatch and Pieter Abbeel. Emergence of grounded compositional language in multi-agent populations. In Proceedings of the 32nd Meeting of the Association for Advancement of Artificial Intelligence (AAAI), volume 32, 2018. Jonas N olle, Riccardo Fusaroli, Gregory J Mills, and Kristian Tyl en. Language as shaped by the environment: linguistic construal in a collaborative spatial task. Palgrave Communications, 6(1): 1 10, 2020. Michael Noukhovitch, Travis La Croix, Angeliki Lazaridou, and Aaron Courville. Emergent communication under competition. In Proceedings of the 20th International Conference on Autonomous Agents and Multi Agent Systems, pp. 974 982, 2021. Martin A Nowak and Natalia L Komarova. Towards an evolutionary theory of language. Trends in cognitive sciences, 5(7):288 295, 2001. Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. ar Xiv preprint ar Xiv:1807.03748, 2018. Martin J. Osborne and Ariel Rubinstein. A Course in Game Theory. The MIT Press, 1994. ISBN 0262150417. Limor Raviv, Antje Meyer, and Shiri Lev-Ari. Larger communities create more systematic languages. Proceedings of the Royal Society B, 286(1907):20191262, 2019. Yi Ren, Shangmin Guo, Serhii Havrylov, Shay Cohen, and Simon Kirby. Enhance the compositionality of emergent language by iterated learning. In 3rd Neur IPS Workshop on Emergent Communication (Eme Com@ Neur IPS 2019). URL https://papers. nips. cc/book/advances-in-neural-informationprocessing-systems-32-2019, 2019. Mathieu Rita, Florian Strub, Jean-Bastien Grill, Olivier Pietquin, and Emmanuel Dupoux. On the role of population heterogeneity in emergent communication. In Proceedings of the International Conference on Learning Representations (ICLR), 2022a. Mathieu Rita, Corentin Tallec, Paul Michel, Jean-bastien Grill, and Florain Strub. Emergent communication: Generalization and overfitting in lewis games. In Proceedings of the 36th Annual Conference on Neural Information Processing Systems (Neur IPS). 2022b. Amanpreet Singh, Tushar Jain, and Sainbayar Sukhbaatar. Learning when to communicate at scale in multiagent cooperative and competitive tasks. In Proceedings of the International Conference on Learning Representations (ICLR), 2018. Brian Skyrms. Signals: Evolution, learning, and information. OUP Oxford, 2010. Published as a conference paper at ICLR 2023 C Spearman. The proof and measurement of association between two things. The American Journal of Psychology, 1904. Sainbayar Sukhbaatar, Rob Fergus, et al. Learning multiagent communication with backpropagation. In Proceedings of the 30th Annual Conference on Neural Information Processing Systems (NIPS), volume 29, 2016. Richard S Sutton, David Mc Allester, Satinder Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. Proceedings of the 13th Annual Conference on Neural Information Processing Systems (NIPS), 12, 1999. Zolt an Gendler Szab o. Compositionality. In Edward N. Zalta (ed.), The Stanford Encyclopedia of Philosophy. 2020. URL https://plato.stanford.edu/archives/fall2020/ entries/compositionality/. Accessed: 2022-05-13. Olivier Tieleman, Angeliki Lazaridou, Shibl Mourad, Charles Blundell, and Doina Precup. Community size effect in artificial learning systems. In Vi GIL@ Neur IPS, 2019. Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3):229 256, 1992. Ronald J Williams and Jing Peng. Function optimization using connectionist reinforcement learning algorithms. Connection Science, 3(3):241 268, 1991. Ludwig Wittgenstein. Philosophical Investigations. Blackwell, Oxford, UK, 1953. Translated by G.E.M. Anscombe. B DERIVATION OF THE OPTIMAL RECEIVER We first prove a more general result from which the optimal receiver both in the standard and partitioned can be derived. B.1 GENERAL CASE Consider a receiver j trained to maximize Jr,j(ψj) = X i senders αi Jr,i j(ψj) (7) where αi=1...n are arbitrary weights for the senders (we assume that the αi are positive and sum to one). We can rewrite the objective as: Jr,j(ψj) = X i senders αi Jr,i j(ψj) i senders αi Em πθi( |x) log ρψj(x | m) Note that by linearity of expectation we can pass the αi weighted average over the senders inside of the expectation and rewrite the second expectation in terms of the mixture π α(m | x) := P i senders αiπθ i (m | x): Jr,j(ψj) = Ex p Em P i senders αiπθ i (m|x) log ρψj(x | m) = Ex p Em π α( |x) log ρψj(x | m) Published as a conference paper at ICLR 2023 With slight abuse of notation, let us now denote by π α(m) := Ex pπ α(m | x) the marginal distribution over messages and π α(x | m) := π α(m|x)p(x) π α(m) the associated posterior. Notice that since by definition π α(m | x)p(x) = π α(x | m)π α(m), we can rewrite the double expectation Ex p Em π α( |x) as Em π α( )Ex π α( |m) by inverting the order of summation. We can therefore rewrite Jr,j(ψj) = Em π α( )H(π α( | m), ρψj( | m)) where H(p, q) denotes the cross-entropy Eq [ log p] of two distributions p and q. Importantly the cross-entropy is non-negative and H(p, q) = 0 if and only if p = q. Consequently, the receiver ρψ will be optimal (Jr,j(ψj) = 0) if and only if for all m:3 ρψ j (x | m) = π α(x | m) = π α(m | x)p(x) Ey pπ α(m | y). (8) B.2 OPTIMAL RECEIVER IN STANDARD POPULATIONS Recall that in standard populations, the training objective for receiver j is: Jr,j(ψj) = 1 | N G(j)| i N G(j) Jr,i j(ψj). Note that this is a special case of Equation 7 with ( 1 | N G(j)| if i N G(j) 0 otherwise Consequently, the derivation in Section B.1 tells us that the optimal receiver is ρψ j (x | m) = π N G(j)(x | m) = π N G(j)(m | x)p(x) Ey pπ N G(j)(m | y). (9) Where π N G(j)(m | x) := 1 | N G(j)| P i N G(j) πθ i (m | x) B.3 OPTIMAL RECEIVER IN PARTITIONED POPULATIONS In partitioned populations, the training objective for receiver j is: Jr,j(ψj) = Jr,j j(ψj). This is also a special case of Equation 7 with αi = 1 if i = j 0 otherwise The derivation in Section B.1 thus yields the optimal receiver ρψ j (x | m) = π j (x | m) = π j (m | x)p(x) Ey pπ j (m | y). (10) 3More accurately, if the message space is not finite then the condition holds not for all m, but almost surely. However throughout the paper we are experimenting with finite (albeit large) message spaces. Published as a conference paper at ICLR 2023 (a) k = 1 (b) k = 2 (c) k = 3 (d) k = n Figure 6: Example of ring populations with connectivities ranging from k = 1 (circular population) to k = n (fully connected population) B.4 OPTIMAL RECEIVER IN PARTIALLY PARTITIONED POPULATIONS In the partially partitioned populations used in Section 6.3, each receiver s objective is a mixture between the standard and partitioned objective. This can also be rewritten as a special case of Equation 7 with 1 α + α | N G(j)| if i = j α | N G(j)| if i N G(j) \ {i} 0 otherwise The optimal receiver can then be rewritten as the posterior distribution associated with the mixture sender α +(1 α) π j (x | m) C THE CASE OF REFERENTIAL GAMES In the analysis from Section 2.2 onward, we assumed C = X to simplify notation. We can relax this assumption without changing our key observation that all receivers are the same at the optimum. Indeed, in this case the receiver s objective in a standard population is: Jr,j(ψj) = 1 | N G(j)| i N G(j) Jr,i j(ψj) = 1 | N G(j)| i N G(j) Ex p Em πθi( |x)EC p log ρψj(x | m, C) = Ex p Em π N G(j)( |x)EC p log ρψj(x | m, C) This objective, called Info NCE (Oord et al., 2018) also has an analytical solution that can be expressed as a function of π N G(j), of the form: ρψ j (x | m, C) = π N G(j)(x|m) y C π N G(j)(y|m) Despite the more complicated form of the optimal receiver, the key ingredients to our analysis in Sections 2.2 and 3 are preserved: at the optimum, each receiver is a function of the posterior πN G(j)(x | m) associated with the communication partners to which it co-adapts. A similar analysis in partitioned populations shows that the optimum for receiver j then only depends on the posterior associated with its respective sender πθ j instead. D EXPERIMENTS WITH RING POPULATIONS WITH GREATER CONNECTIVITY The circular population described in Section 5 is an extreme case of a connected population where each agent has the minimal amount of neighbors (2). To verify that our findings apply to more general Published as a conference paper at ICLR 2023 Table 3: Accuracies with training partners and new partners on the Attribute/Values dataset for ring populations. Numbers are reported with standard deviation across all pairs for 3 independent experiments k = 2 k = 3 Standard Partitioned Standard Partitioned Training partners 99.28 0.36 99.32 0.37 99.23 0.39 99.26 0.40 New partners 79.67 20.78 99.71 0.33 98.08 4.17 99.89 0.12 0 50 100 epoch = 0 Topsim = 25.6 0 50 100 epoch = 0.001 Topsim = 22.7 0 50 100 epoch = 0.01 Topsim = 33.3 0 50 100 epoch = 0.1 Topsim = 37.2 0 50 100 epoch = 0.5 Topsim = 38.1 Mutual intelligibility Internal communication Figure 7: Evolution of internal communication and mutual intelligibility terms with different weightings β (populations of size 10). (non fully-connected) populations, we perform the same experiment on ring populations with higher connectivities. We define a ring graph with connectivity k as a circular graph where each vertex is connected to its k-nearest neighbors on each side (ie agent i is connected to agents i k, . . . , i + k). Examples are shown in Figure 6 This allows us to interpolate between circular populations (k = 1) and fully connected populations (k = n/2) while jointly (1) increasing the number of communication partner of a given agent and (2) decreasing the path length between any two agents, two properties of the graph that are arguably most impactful on the mutual intelligibility between agents. We perform the same experiment as Section 5 with graphs of connectivities k = 2 and k = 3 (resp. 4 and 6 training partners for each agent) on the Attribute/values dataset. We report accuracues in Table 3. We observe that across the board, populations with greater connectivity exhibit higher accuracy in both conditions. However, we still find that partitioned populations still perform best. E FURTHER ANALYSIS OF THE EFFECT OF MUTUAL INTELLIGIBILITY In Section 6.4, we find that languages stay highly compositional until the mutual intelligibility weight β is decreased to almost 0. Our hypothesis is that even with small amounts of mutual intelligibility, agents will eventually have to optimize this part of the objective after they have maximized their respective internal communication to the point where the main contributor to the training gradient is the mutual intelligibility term. To verify this hypothesis, in Figure 7 we report the evolution of both internal communication and mutual intelligibility losses during training for various values of the mutual intelligibility weight β. As expected, we observe that for all but very small values of β, the mutual intelligibility loss eventually decreases (although it decreases faster for high β).