# emergent_communication_at_scale__d7f9d37e.pdf

Published as a conference paper at ICLR 2022

EMERGENT COMMUNICATION AT SCALE

Rahma Chaabouni , Florian Strub , Florent Altch e , Corentin Tallec , Eugene Trassov , Elnaz Davoodi , Kory Mathewson , Olivier Tieleman , Angeliki Lazaridou , Bilal Piot

Emergent communication aims for a better understanding of human language evolution and building more efﬁcient representations. We posit that reaching these goals will require scaling up, in contrast to a signiﬁcant amount of literature that focuses on setting up small-scale problems to tease out desired properties of the emergent languages. We focus on three independent aspects to scale up, namely the dataset, task complexity, and population size. We provide a ﬁrst set of results for large populations solving complex tasks on realistic large-scale datasets, as well as an easy-to-use codebase to enable further experimentation1. In more complex tasks and datasets, we ﬁnd that RL training can become unstable, but responds well to established stabilization techniques. We also identify the need for a different metric than topographic similarity, which does not correlate with the generalization performances when working with natural images. In this context, we probe ease-of-learnability and transfer methods to assess emergent languages. Finally, we observe that larger populations do not induce robust emergent protocols with high generalization performance, leading us to explore different ways to leverage population, through voting and imitation learning.

1 INTRODUCTION

Language emergence is at the intersection of cognitive science and machine learning. From a cognitive science view, researchers have been looking at artiﬁcial agents as another expressive species to shed light on the source of linguistic regularities (Wagner et al., 2003; Guo et al., 2019; Chaabouni et al., 2021). From a machine learning view, language evolution is deemed as a promising direction to shape agents representation and design interactive AI (Steels, 1997; Lazaridou et al., 2020).

Most of the literature in this ﬁeld relies on different variants of the Lewis game (Lewis, 1969). There, a speaker network must describe an object to a listener network, which must then retrieve it among a set of other objects. To solve the game, the two agents need to settle on a communication protocol. While deep agents manage to solve the Lewis game, their communication protocols are usually degenerate, lacking core properties of human languages (Bouchacourt & Baroni, 2018; Chaabouni et al., 2019). In response to these ﬁndings, several works proposed ad-hoc solutions by constraining the game and agents capacity (Kottur et al., 2017; Resnick et al., 2020). While reducing problem complexity is tempting, it can lead to unexpected outcomes and may miss general language emergence behaviors (Hayes, 1985). We take a different route and advocate that scaling up communication games is a prerequisite to building interactive AI (Baroni et al., 2017; Sutton, 2019) or modeling language evolution (Barsalou, 2008). Indeed, contrary to other machine learning communities, the emergent communication ﬁeld mostly relies on small-scale games where only one speaker and one listener communicate about disentangled stimuli, which can hinder the generality of its conclusions. In this paper, we focus on three scaling dimensions: the dataset, the task complexity, and the population size. That is, we argue that making populations of deep agents communicate about larger and more realistic datasets and solve more complex tasks, is necessary if in the end we want these agents to interact with us or if we want to model human communication.

We study three properties of emergent languages: generalization, robustness to input noise, and ease of learning over transfer tasks, analyzing different facets of communication protocols. The proposed

Contributed equally. Corresponding authors: {rahmac,fstrub,piot}@deepmind.com 1 Source code: github.com/deepmind/emergent_communication_at_scale

Published as a conference paper at ICLR 2022

scaling up remains computationally tractable, as most of the experiments can be done within a day on a single GPU. The source code is based on the Jaxline pipeline (Babuschkin et al., 2020).

Overall, our experiments provide a large spectrum of observations, both positive and negative. First, we found that scaling up the Lewis game quickly entails unstable RL optimization. We propose KL regularization (Geist et al., 2019) as an effective way to overcome this issue. Second, we observe that complexifying the task has two positive aspects: it better discriminates between methods and improves the generalization of the learned communication protocol. Third, we note no correlation between generalization and the widely used topographic similarity metric, which suggests that the latter is not adequate to assess the compositionality of the languages in our more complex setup. Instead, we take inspiration from the self-supervised learning literature and explore transfer learning as a new evaluation metric. Fourth, unlike what was observed in human communication (Gary Lupyan, 2010; Raviv et al., 2019a;b), we ﬁnd little to no systematic beneﬁt on emergent languages properties when increasing the population size. Instead, we propose alternative methods to leverage populations, namely voting and imitation among speakers (Hester et al., 2018; Vecerik et al., 2017). In particular, we show that such population dynamics lead to more robust, productive, and in some cases easy-to-learn languages even when compared to our best seed without population, opening up new research opportunities. In the end, we expect that these observations, baselines, and good practices would allow the language emergence community to beneﬁt further from deep RL advances and move the ﬁeld closer to its goals of improving representations and interactive systems.

2.1 DISCRIMINATION GAME

The discrimination game involves two players, Speaker and Listener, and is decomposed into three sequential stages. First, Speaker receives a target image x and outputs a message m that is sent to Listener. Second, Listener receives m and selects an image ˆx among a set C of different images containing the target image x. The set C is called candidates. Finally, the target image x is revealed to Listener. If Listener selects the target image, then both players receive a positive reward of 1, and 0 otherwise. Speaker and Listener are parameterized by a set of parameters θ and φ respectively. The message m(x, θ) = (wt(x, θ))T 1 t=0 is a sequence of length T of words wt(x, θ). When no confusion is possible, we omit the dependence on x and θ to simplify notations. A word wt is an element of a ﬁnite vocabulary set W. For the image selected by Listener ˆx(φ, m, C), we also omit the dependence on φ, m and C. Finally, the reward for Listener and Speaker is denoted by R(x, ˆx) = 1x=ˆx.

Communication in a population of agents. A straightforward extension of the standard discrimination game is to train a population of agents. In this case, given a population of N Speakers and N Listeners, we sample S Speakers and S Listeners with replacement at each training step to construct S random pairs; where each Speaker is paired with only one Listener.2 All pairs observe the same examples and are trained independently to play the discrimination game as described in the 1-pair setting above. In our case, S = N so that each agent is trained on average once per step, which allows a fair comparison with the baseline of 1-pair. At inference time, we take the ﬁrst P Speakers and Listeners and construct all the possible P 2 pairs to compute the accuracy, by averaging the rewards of all pairs as in (Mordatch & Abbeel, 2018).

Exploiting the population. The training dynamic of the population of agents described above does not take advantage of the diversity of the population. Indeed, we treat all agents in the same way. Yet, one motivation for the effectiveness of the population is the variety between Speakers to invent new structures and between Listeners to avoid over-specialization (Bromham et al., 2015). Prior works have looked at the impact of training agents with different dynamics (e.g., Ren et al., 2019; Guo et al., 2019; Cogswell et al., 2019; Li & Bowling, 2019). In this work, we introduce two different ways to exploit the population; on Speakers side, we add imitation, and on Listeners side, we allow Listeners to vote to get the ﬁnal prediction. Mathematical details are found in Sec. 2.3. Imitation training consists of two different steps. First, as described above, Speakers and Listeners are paired randomly for M interaction steps to learn to communicate. Second, Speakers are trained for 1 imitation step as follows: we select the best Speaker (called teacher ) among K randomly sampled Speakers without replacement, and use the remaining K-1 samples as students . We then train all students in a supervised way to imitate the teacher. We alternate between the interaction

2In this work, we only consider populations with the same number of Speakers and Listeners.

Published as a conference paper at ICLR 2022

and imitation steps. Both M and K are hyper-parameters (see Appendix A.5). Finally, Voting is only used at inference time, irrespective of the training mode. Here, instead of averaging P 2 pairs rewards to compute accuracy, we now allow P Listeners to vote to get a unique prediction. This is inspired by ensemble methods to reduce prediction errors (Polikar, 2012). More details on imitation and voting are reported in Sec. 2.3. For completeness, we also reproduce ease-of-teaching protocol as another effective way to exploit population in Appendix B.4 (Li & Bowling, 2019).

2.2 DATASETS AND NEURAL ARCHITECTURES

Datasets. We use the Image Net (Deng et al., 2009; Russakovsky et al., 2015), and Celeb A datasets (Liu et al., 2015), which respectively contain 1400k and 200k labelled images. Celeb A contains 40 binary attributes per image, such as the presence of glasses, blond hair. As the ofﬁcial Celeb A splits have non-overlapping identities, it is impossible to perform regular identity classiﬁcation on the test/valid sets. We thus construct a new split, where we shufﬂe images to have overlapping identities across splits. In both datasets, each image is center-cropped before being processed by a Res Net-50 encoder f pretrained on Image Net with the self-supervised method BYOL (Grill et al., 2020) to extract a representation of size 2048. Full implementation details are in Appendix A.6.

Speaker architecture. Speaker is a neural network that receives an image x and outputs a message m = (wt)T 1 t=0 . It is composed of a ﬁxed (non-trainable) image encoder f that transforms x into an embedding f(x), before being projected by a state adapter cθ to an initial state of an LSTM (Hochreiter & Schmidhuber, 1997), z 1,θ = cθ(f(x)). The LSTM receives as input a word embedding et,θ = gθ(wt 1) and outputs the next state zt,θ = hθ(zt 1,θ, et,θ). The ﬁrst word embedding e0,θ = gθ(sos) is initialized with a start of sequence. The state zt,θ is then fed to two different heads; the value head vθ estimates the value of the expected reward knowing zt,θ and the policy head πθ computes the probability of the chosen word given zt,θ. Then, a sampling function s picks the word wt = s(πθ(.|zt,θ)). Finally wt is fed back to gθ to produce the next word and so on until the maximum length T is reached. At training, the word wt is sampled according to the policy whereas it is greedily selected at evaluation. Contrary to prior works (e.g., Kottur et al., 2017; Resnick et al., 2020), we do not restrict the channel capacity of agents to spur the emergence of systematic languages. Instead, we endow Speaker with a message space of size |W|T = 2010 (|W| = 20, T = 10), signiﬁcantly larger than the number of available training examples.

Listener architecture. Listener is a neural network that receives Speaker s message m and a set of image candidates C, containing the target image x. It outputs the probability over each image x C of being the target image. Listener is composed of an LSTM cell hφ, which is initialized to the null vector z 1,φ. The message m is decoded by processing the sequence of word embeddings et,φ = gφ(wt) through the LSTM such as zt,φ = hφ(zt 1,φ, et,φ). The state z T 1,φ is then fed to the network pφ to output pm,φ = pφ(z T 1,φ). In parallel, each image x is projected through the encoder f and the network tφ to obtain the image embedding t x = tφ(f( x)). Both message and image embeddings are then compared through a score function score(m, x, φ) = cos( pm,φ pm,φ 2 , t x t x 2 ). The scores over all images are normalized via a softmax to get a probability πφ(.|m, C). Finally, Listener selects an image by taking the best guess according to πφ, i.e. ˆx arg max x C πφ( x|m, C).

Architecture details, hyper-parameters and graphical descriptions are provided in Appendices A.2 and A.4. All experiments are run over 10 seeds.

2.3 OPTIMIZATION

Speaker training and loss. The goal of Speaker is to optimize the message m sent to Listener such that the expected reward of the game is the highest possible. This can be framed as a sequential decision making problem where the decision is the choice of each word wt. Therefore, following the policy gradient approach with a baseline (Sutton et al., 2000), we train Speaker s network by (i) minimizing a value loss LV (θ) to make the value head vθ(zt,θ) ﬁt the expected reward over a batch X: LV (θ) = 1 |X| P

x X PT 1 t=0 (R(x, ˆx) vθ(zt,θ))2 , (ii) maximizing the expected reward through

minimizing the policy loss Lπ(θ) = 1 |X| P

x X PT 1 t=0 sg (R(x, ˆx) vθ(zt,θ)) log(πθ(wt|zt,θ)), where sg(.) is the stop gradient function.

In addition, it is common practice in RL and emergent language literature to minimize an entropy loss LH(θ) encouraging to explore other choices of words by Speaker (Mnih et al., 2016; Williams & Peng, 1991; Espeholt et al., 2018). Finally, several deep RL (Schulman et al., 2015; 2017) and

Published as a conference paper at ICLR 2022

theoretical RL papers (Geist et al., 2019; Vieillard et al., 2020a;b) argued that minimizing a KL loss LKL(θ) between the online policy πθ and a target policy πθ instead of or in addition to entropy regularization could be beneﬁcial for better ﬁnal performance as well as stabilizing the learning. The policy πθ is obtained by doing an exponential moving average of the weights θ over training: θ (1 η)θ + ηθ where η is the exponential moving average parameter. The KL minimization encourages the online policy to change slowly and smoothly. To sum up, the speaker training loss L(θ) on a batch of images X is: L(θ) = LV (θ) + Lπ(θ) + αLH(θ) + βLKL(θ), where α and β are hyper-parameters.

Listener training and loss. Listener is also trained to maximize the reward but acts by predicting the best guess for the game ˆx arg max x C πφ( x|m, C). For each image x in a batch X, a set of image candidates C is sampled randomly (uniform without replacement over X \ {x}) chosen in X. When |C| = |X|, we take C = X. Finally, Listener s goal is to retrieve x among C, i.e. outputting a probability πφ( x|m, C) = 1 if x = x and πφ( x|m, C) = 0 otherwise. Therefore, we use a multiclass classiﬁcation loss where the correct class is the index of x in the set of candidates C also called Info NCE loss (van den Oord et al., 2018): L(φ) = 1

x X log (πφ(x|m, C)).

Imitation training among a group of Speakers. In a training imitation step, a group of K speakers among the total population of N speakers is sampled without replacement. Among those K speakers, one speaker plays the role of the teacher and K 1 play the role of the students. To choose the teacher, we compute, for each sampled speaker i, the exponential moving average of the accuracies over each batch on which the speaker i has been trained on, σi. Then the teacher is simply the speaker with the highest σi. For convenience,we respectively note θT and θS the parameters of the teacher and student. A student θS is trained on a batch of data X by imitating the messages of the teacher θT with the following loss: LI(θS) = 1

x X PT 1 t=0 log πθS(wt(x, θT )|x, zθS,t), where LI(θS) is a cross-entropy loss to encourage a student to output the same words as the teacher.

Listener s voting at inference time. At inference time, we can use all listeners (φi)N 1 i=0 of the population as an ensemble of networks. Together, they vote for a joint guess ˆx over a set of candidates C of images for a message m coming from a speaker. One simple way consists in averaging the score probabilities of the listeners and taking the best guess of this average. Formally, for each listener φi, each message m and batch X, we have the score probability πφi(.|m, C). Then the choice ˆx(m, C, (φi)N 1 i=0 ) of the population of listeners for the message m and the set C is the best guess of the average distribution which is ˆx(m, C, (φi)N 1 i=0 ) = arg max x C 1 N PN 1 i=0 πφi( x|m, C).

More details about how we derive the different losses are found in Appendix A.3. 2.4 LANGUAGE PROPERTIES

Generalization. It measures the ability of agents to communicate about never-seen inputs. To compute generalization, we simply report test accuracy. In the simple case where inputs are constructed by disentangled attributes, test inputs are new combinations of previously seen attributes at training.

Robustness. We report agents drop of accuracy when faced with noisy inputs relatively to clean inputs at test time. To construct noisy stimuli, we add a Gaussian noise for each batch, with mean 0 and a standard deviation that is half of the standard deviation of the batch. Note that we sample different noises for Speakers and Listeners inputs. That is, while Listeners are trained to ﬁnd the exact input of Speakers at train time, they are now required to uncover a different input at eval time.

Topographic Similarity (Top Sim). Top Sim (Brighton & Kirby, 2006) is used as a proxy for compositionality (e.g., Li & Bowling, 2019; Lazaridou et al., 2018), and it is widely considered crucial for generalization (Fodor & Lepore, 2002; Marcus, 2003). Top Sim tests whether close objects in the input space are described by close messages by computing the Spearman correlation between the pairwise distances in the input and message spaces. Following prior works, we use the edit-distance and the cosine-distance in the message and input spaces respectively.

Ease and transfer learning (ETL). It captures how fast and well the emergent language is transmitted to new Listeners performing a distinct task. It is similar to the ease-of-teaching (Li & Bowling, 2019) to the exception that Listeners are trained to solve a different task from the initial one that the emergent language was optimized for. This task transfer is inspired by the self-supervised linear evaluation protocol (Chen et al., 2020). To measure ETL, we take min(N, 5) ﬁxed Speakers from the population after convergence, and feed their deterministic language to 3 newly initialized Lis-

Published as a conference paper at ICLR 2022

teners. We argmax from Speakers distribution to construct the deterministic languages. Finally, we show how fast and well new Listeners perform in a given task when trained on ﬁxed languages by reporting the training curve for 10k steps. We look at the ease of learning when Listeners are trained to solve harder discrimination, classiﬁcation, and image reconstruction tasks. Hence, ETL not only captures the generality of the language to new Listeners but also its ability to transfer to new tasks.

3 EXPERIMENTS

3.1 TASK SCALING UP: INCREASING THE NUMBER OF CANDIDATES

In the emergent communication literature, agents are typically required to discriminate between less than 20 objects (Mu & Goodman, 2021) or reconstruct hand-crafted attributes (Rita et al., 2020). Yet, this simple training setting was shown to sometimes lead to degenerate communication protocols (Bouchacourt & Baroni, 2018). As highlighted by (Dess ı et al., 2021; Guo et al., 2021), the Lewis game is a discrete variant of Contrastive Predictive Coding in the self-supervised learning ﬁeld (van den Oord et al., 2018). There, it was observed that the intermediate representation quality improved when increasing the number of candidates (He et al., 2019). We hence look at the impact of the number of candidates |C| on languages generalization and the technical challenges it introduces. In this subsection, we consider the 1-pair setting trained on Image Net while varying |C|.

Scaling up task difﬁculty requires to carefully tune optimization. We ﬁrst train agents without KL regularization (β = 0) as it is commonly done in the ﬁeld (see e.g., Strub et al., 2017; Cao et al., 2018; Lu et al., 2020, among many others). As displayed in Fig. 1(a), the training becomes unstable when increasing the number of candidates from 20 to 1024 candidates. The mean train accuracy drops from 98.9% to 76.0%, and we observe a large variance across seeds. This high variance demonstrates how crucial it is to run multiple seeds to state robust conclusions. A remedy against such instability in RL consists in adding a KL regularisation LKL loss between the online policy and a target policy as described in Sec. 2.3. Such a solution has yet not been explored in the emergent communication ﬁeld. We add a KL regularization with a coefﬁcient β = 0.5 and display the results in Fig. 1(b). We note that, with this better regularization, 20 and 1024 candidates converge to 99.8% and 98.3% mean accuracy respectively with small variance during training. Adding LKL thus stabilizes the training and leads to better performances. We illustrate further the effect of this regularization across multiple sweeps in Appendix B.3.

50K 100K 150K 200K 250K 300K Steps

(a) KL coefﬁcient β = 0.0

50K 100K 150K 200K 250K 300K Steps

(b) KL coefﬁcient β = 0.5 Figure 1: Training curves per task complexity (20 vs. 1024 candidates) w/ and w/o KL regularization on Image Net. Thin lines are the accuracy of 10 seeds while thick lines represent the mean.

Scaling up task difﬁculty is necessary to differentiate representations and enhance protocol s generalization. In Fig. 2(a), we train 1-pair of agents on different number of candidates ranging from 64 to 1024 and evaluate them on 16 candidates.3 Note that, in the literature, the number of distractors rarely exceeds a dozen both at train and eval times (Mu & Goodman, 2021; Li & Bowling, 2019). In such settings, our experiments do not provide any interesting conclusions: all methods are above 99.7% test accuracy and within standard deviation. However, when evaluating the same communication protocols now on harder tasks with 1024 candidates, as shown in Fig. 2(b), we note that these protocols are actually different. In particular, agents achieve an accuracy of 87.36%, 93.25%, 96.09% when trained on 64, 256, and 1024 candidates respectively, differentiating the various representation s generalization capacity. While expected in light of the recent self-supervised results (Chen et al., 2020) - harder training tasks lead to better representation and harder evaluation tasks better discriminate algorithms - these results clearly illustrate how current emergent language

3We kept the training batch size to 1024, and only varied the number of candidates in the Info NCE loss.

Published as a conference paper at ICLR 2022

experiments may be ill-scaled to have robust conclusions. We report further experiments in Appendix B.5. In the following, we use 1024 candidates at train and eval to fully leverage our ﬁndings.

100K 200K 300K 400K 500K 600K 700K 800K 900K

Num candidates (train)

16 64 256 1024

(a) Num candidates (eval): 16

100K 200K 300K 400K 500K 600K 700K 800K 900K

Num candidates (train)

16 64 256 1024

(b) Num candidates at (eval): 1024 Figure 2: Test accuracy for different |C| at train (lines) and eval (subplot) times on Image Net. (a) Easy eval task: no differentiation between methods. (b) Hard eval task: more candidates at training induces higher test accuracy. Shaded region represents the standard deviation across 10 seeds.

3.2 DATASET SCALING UP: RETHINKING EVALUATION

In most emergent communication works, deep agents are situated in a simple disentangled environment where objects are one-hot vectors (Kottur et al., 2017; Li & Bowling, 2019). As a result, the state space rarely goes beyond a few thousands of discrete samples. Besides, such representations are unambiguous as they are inherently compositional. We argue that such environments may be too simplistic to reach complex or diverse communication behaviors. In contrast, the major success stories in machine learning collectively prove the importance of training neural networks on a rich and large amount of data to emulate complex distributions (Brown et al., 2020; Krizhevsky et al., 2012; Sutton, 2019). Furthermore, if the goal is to reproduce human interaction, it is fundamental to incorporate complex stimuli to develop advanced concepts in language evolution (Miller & Johnson-Laird, 1976; Barsalou, 2008). In this subsection, we scale up the input space by using large natural image datasets, namely Image Net and Celeb A. In particular, we investigate further emergent language properties in the same setting of 1-pair of agents while varying |C| at train time.

Top Sim Image Net Celeb A Training Image Image Attributes

|C| = 16 15.50 0.61 31.92 1.32 14.22 1.36 |C| = 64 19.01 0.35 28.7 0.56 12.23 0.76 |C| = 256 17.06 1.09 30.69 1.04 12.29 1.27 |C| = 1024 16.52 1.16 30.21 1.22 13.49 0.89

Figure 3: (left) Top Sim (x100) with different representation of the input space (Image and Attributes) for different training difﬁculties, |C|. We do not observe a correlation between Top Sim and |C|, and thus generalization. denotes 1 standard error of the mean. (right) Example of predicted images by new Listener trained on reconstruction task given a message of 1-pair for |C|=1024. Reconstructions are expected to be blurry, as we are regressing the mean of all faces associated to one message.

Top Sim may fall short with natural images. We compute Top Sim using the pretrained image logits on Image Net and Celeb A as input representation while varying the task complexity at train, |C|. Results are provided in Fig. 3(left). We do not observe a consistent pattern between Top Sim and |C|. Furthermore, we note a non-signiﬁcant (p-value>0.19) Spearman correlation of 0.09 and 0.08 between Top Sim and generalization for Image Net and Celeb A respectively. Although pretrained logits are excellent linear features (Grill et al., 2020), they may not be compositional. We thus reiterate the evaluation by using the attributes from Celeb A as input representation to compute Top Sim. In that case too we do not note any signiﬁcant correlation (p-value=0.13). Similar ﬁndings have been observed in (Andreas, 2019; Chaabouni et al., 2020). This could be due to different reasons. First, agents could be communicating in a compositional way, to support good generalization, by encoding unexpected compositional features that are not labeled, like forehead shape or smile angle. That is, human-labeled features may differ from the agents ones. Second, Top Sim relies on strong assumptions, such as the chosen distance, and the use of a linear correlation. Such hypotheses may not hold when moving from artiﬁcial to realistic data. Finally, due to the large channel capacity, agents could be using synonyms to generalize with low Top Sim values. In sum, our results demonstrate that

Published as a conference paper at ICLR 2022

Table 1: Evaluation of Speakers on multiple settings on Celeb A in %. denotes one standard error of the mean. We report ﬁnal accuracy for all ETL tasks but Recons. where we report the ﬁnal loss.

Metrics Generalization Ease and Transfer Learning

Training Accuracy Discr. Identity Attributes Recons.

|C| = 16 74.50 0.96 56.31 1.07 15.52 0.90 86.99 0.10 2448 16 |C| = 64 80.19 0.63 66.53 1.09 23.12 1.06 87.91 0.11 2419 12 |C| = 256 85.03 0.72 76.61 1.14 33.72 1.52 88.81 0.16 2355 13 |C| = 1024 89.00 0.48 83.57 0.78 44.00 1.26 90.08 0.13 2351 14

Top Sim is not a good predictor of communication protocol generalization for large-scale settings. Instead, we investigate the information content via ETL.

ETL is a robust metric to evaluate languages with natural images. ETL, contrary to Top Sim, evaluates how useful the emergent language is for a target task beyond overﬁtting. As such it is an important metric, since the goal of emergent communication is to endow agents with communication skills, rather than solving a speciﬁc game. Results are reported in Table 1 over distinct tasks on Celeb A. We observe that, even though ETL only looks at the information content and not at compositionality, it is a better predictor for language generalization performances than Top Sim, with a strong and signiﬁcant correlation >0.90 (p-value 0) for all considered tasks. As a visual sanity check, we generate predicted images of the reconstruction ETL task in Fig. 3(right). There, a new Listener is trained to minimize an l2 loss with the target images given the precomputed messages. We observe that gender, makeup, and hair color are fairly encoded, while others such as the smile are not. More samples and training details are available in Appendix B.1. Interestingly, some ETL tasks, like discrimination or reconstruction, do not depend on (human-)predeﬁned input representations, and are all more robust to channel capacity and linearity assumptions, as opposed to Top Sim, which makes ETL a more general evaluation metric.

Zero-shot dataset transfer highlights emergent protocols differences and limits. Scaling up to natural image opens a large diversity of visual distributions that we can leverage by using zeroshot dataset transfer (Yogatama et al., 2019; Lambert et al., 2020). We hence measure how generic and high-level the emergent protocols are. We take agents trained on Image Net with |C|=1024, and evaluate them on Celeb A in a zero-shot fashion. There, the mean accuracy dropped from 95.96 to 36.73, and the standard deviation raised from 0.49 to 12.86. Noticeably, the most different agents obtain a zero-shot accuracy of 27.88 and 68.49, while their initial Image Net accuracy only differed by 0.23. Firstly, such a standard deviation gap suggests that different emergent protocols have emerged between agent pairs. Secondly, the accuracy drop indicates that the emergent protocols remain speciﬁc to the initial data-distribution, and no systematic language has yet emerged.

3.3 POPULATION SCALING UP: EXPLOITING THE POPULATION DYNAMIC

The emergent communication literature often considers a single pair of agents (Lazaridou et al., 2017; Foerster et al., 2016). However, there is evidence in the multi-agent literature that such a setting may lead to extreme co-adaptation and overﬁtting (Lanctot et al., 2017). One counter-measure consists of sampling agents within a population (Jaderberg et al., 2018). From a linguistic perspective, different large-scale corpora analyses and human simulations support the importance of the population in shaping our languages (Gary Lupyan, 2010; Bromham et al., 2015; Raviv et al., 2019a;b). In this context, Raviv et al. (2019a) show that larger populations develop more systematic languages through human experimentation. Nonetheless, a few artiﬁcial simulations tentatively consider population sizes up to 32 agents, but with mixed results about their advantage (Mordatch & Abbeel, 2018; Tieleman et al., 2018; Graesser et al., 2019; Cogswell et al., 2019). In this part, we consider up to 100 agents. Speciﬁcally, we explore the impact of the population size on languages properties when we deal with more complex tasks (|C|=1024 at train and eval) and realistic inputs (Celeb A & Image Net) while varying the population size N {1, 10, 50} pairs.

The best single-pair seed should be the baseline against population. A few works in the emergent communication framework looked at the beneﬁt of the population by comparing 2N-agent s performances to the 2-agent baseline (Cogswell et al., 2019). However, the former introduces N more parameters, which could be responsible for its slight observed beneﬁt. Here, we consider instead the best 1 pair setting to investigate whether it is computationally advantageous to train a population of size N rather than training N independent pairs. That is, for a given computational budget, we ask whether it is beneﬁcial to train agents within a population as opposed to N pairs in parallel, as

Published as a conference paper at ICLR 2022

Table 2: Different language properties on Celeb A and Image Net datasets, in %. For each setting we report the mean over 10 seeds. denotes 1 standard error of the mean.

Celeb A Image Net

Setting Generalization Robustness Generalization Robustness

best 1 pair 90.73 35.82 97.55 27.83 1 pair 89.00 0.48 37.90 96.09 0.21 18.90 10 pairs 91.06 0.23 37.56 95.78 0.29 14.58 50 pairs 90.69 0.61 38.87 95.29 0.34 15.21

10 pairs + imitation 91.84 0.31 35.47 96.79 0.12 13.46 10 pairs + imitation + vote 92.19 0.30 35.21 96.95 0.11 13.27 50 pairs + imitation 92.82 0.51 32.69 96.70 0.16 12.92 50 pairs + imitation + vote 93.13 0.50 32.40 96.85 0.15 12.75

also done in (Tieleman et al., 2018). Best 1 pair is constructed by selecting the best seed among the 10 seeds of 1 pair , based on validation accuracy, offering a fairer baseline speciﬁcally for the 10 pairs setting. Results are shown in Table 2 and Fig. 4. We observe different behavior of best 1 pair across datasets. Particularly, if it always achieves better generalization and ETL discrimination compared to 1 pair , it shows, for Image Net only, the worst Robustness and ETL classiﬁcation. We hypothesize that in that speciﬁc case, the emergent protocol overspecializes to the task. Still, best 1 pair outperforms the standard baseline 1 pair 6/8 times, which makes it a stronger baseline. In the following, we thus always compare the population setting with best 1 pair for a fair evaluation.

Population size does not lead to a systematic advantage. We look at the language generalization and robustness performances when increasing the population size. Results in Table 2 do not show a clear trend between population size and these metrics. For example, for both datasets, 50 pairs achieves lower Generalization and Robustness than 10 pairs . Furthermore, best 1 seed always achieves better Generalization compared to the largest population setting 50 pairs . In Fig. 4, we look at ETL for a further investigation of the languages properties. Celeb A results exhibit a beneﬁt of the population. However, this beneﬁt does not always correlate with population size where 10 pairs outperforms 50 pairs on the discrimination task. This non-systematic beneﬁt is also found when experimenting with Image Net. Again, our results suggest no beneﬁt of the population size in the current setting. We thus explore new approaches to leverage the population dynamics and improve the representation of the emergent protocol in the following.

A better use of the population is needed to see larger beneﬁt. As population size does not improve the emergent language in itself, we took advantage of the richness of the population by considering imitation at training and vote at inference. In Table 2, we observe that both imitation and vote introduce a systematic improvement compared to the standard training in a population and lead to better performances compared to 1 pair in all metrics. Moreover, when considering the stronger baseline best 1 pair , the latter only outperforms the imitation and vote settings 1/4 times (Generalization for Image Net). However, as noted previously, this case is an over-specialized protocol. Furthermore, if both imitation and vote introduce a systematic improvement of the protocol, their beneﬁt is even more noticeable for distribution-shifted settings. For example, the imitation for 50 pairs adds a relative improvement of 12.8% in Robusteness compared to 2.3% for Generalization on Celeb A. The same observation holds for the vote but with a lower degree. Note that if vote induces only a slight beneﬁt, it does not incur any training cost. When considering ETL on discrimination in Figs. 4(a)&4(c), we observe that imitation leads to a systematic improvement. In Celeb A, if the standard population was already beneﬁcial, adding the imitation strengthens the results. In Image Net, 10/50 pairs + imitation perform similarly at the end of training than the overspecialized best 1 pair while having more stable optimization and it is considerably better than the standard training w/ and w/o population. Yet, the results are more nuanced when considering ETL over classiﬁcation in Fig. 4(b)&4(d). Speciﬁcally, while Celeb A experiments suggest an advantage when training a population of agents w/o imitation as opposed w/ imitation or 1 pair , this pattern is absent with Image Net. There all settings with population are on par. In sum, both considered dynamics outperform in most cases the strong best 1 pair baseline (7/8 times), highlighting their viability. 4 DISCUSSION AND CONCLUSION

The emergent communication framework has been extensively studied for decades before the successes of deep RL. In this context, many works have revisited this framework through deep RL settings (e.g., Kirby 2002 vs. Ren et al. 2019, Myers-Scotton et al. 2002 vs. Graesser et al. 2019, or

Published as a conference paper at ICLR 2022

1000 2000 3000 4000 5000 6000 7000 8000 9000

(a) Celeb A Discrimination

1000 2000 3000 4000 5000 6000 7000 8000 9000

(b) Celeb A Classiﬁcation of Identities

1000 2000 3000 4000 5000 6000 7000 8000 9000

(c) Image Net Discrimination

1000 2000 3000 4000 5000 6000 7000 8000 9000

(d) Image Net Classiﬁcation

best 1 pair 1 pair 10 pairs 50 pairs 10 pairs + imitation 50 pairs + imitation

Figure 4: ETL per datasets and tasks. The results are averaged across min(N, 5) Speakers, 3 newborn Listeners over 10 different seeds. The shaded region represents the standard deviation.

Batali 1998 vs. Choi et al. 2018). However, the experimental settings have barely evolved for twenty years. For instance, Kirby (2002) and Ren et al. (2019) both used a binary input vector of size eight, and only a few papers try to go beyond such artiﬁcial input spaces or simple tasks (Lazaridou et al., 2017; Havrylov & Titov, 2017; Dess ı et al., 2021). While computational constraints may have been a limiting factor, we show that our setting is possible with widely available hardware in Appendix B.2. For example, Table 1 requires approximately 400 hours of compute, i.e. 16 GPUs for a single day. Compared to other communities, this is equivalent to medium-scale studies in vision or NLP. We argue that it is ﬁnally time for the emergent language community to scale up!

In this spirit, we start clearing the way by identifying some scaling up challenges: optimization instabilities, ill-adapted metrics, or lack of population synergy. We show how to face these new difﬁculties by proposing: alternative optimization (KL regularization), new evaluation protocols (ETL, zero-shot dataset transfer, best seed baseline), and new dynamics to leverage populations (imitation, voting). There are different theories about the necessity of complex tasks to model human communication (e.g., (Barrett & Skyrms, 2017) vs. (Bickerton, 2015; Dupoux, 2018)). Hence, we endorse here Bickerton s view and adopt performance-inspired solutions to scale up such as KLregularization and imitation. An interesting future debate is whether our ﬁndings could inﬂuence the status quo of similar research in human communication.

Although we only examine three scaling up dimensions, many other directions can be pursued in the future: using larger or different architectures to improve agent abilities (Desai & Johnson, 2021), experimenting with multimodal data like video or sounds for more realistic stimuli (Arandjelovic & Zisserman, 2017), or building symmetric communication channels to have emergent dialogues (Gao et al., 2019). Another frontier would be to incorporate interaction within environments to ground language into actions (Bisk et al., 2020). As such embodiment is crucial for human language understanding (Harnad, 1990; Barsalou, 2008), we may start exploring small grid world (Kaji c et al., 2020) before scaling up to 3D-environments (Abramson et al., 2020).

Published as a conference paper at ICLR 2022

REPRODUCIBILITY STATEMENT

In this paper, we ensure the reproduciblity of the different ﬁndings through several (and voluntary redundant) ways, namely:

The task is detailed in Section 2.1 and we provide a visual sketch in Appendix A.1.

We use open-source datasets (Image Net and Celeb A). Also, dataset processing is explained, with pseudo-code to reproduce our splits, in Appendix A.6.

The speaker and listener architectures are ﬁrst explained in Section 2.2, we then detail model size values, and provide visual sketches in Appendix A.2. The speciﬁcs for image reconstruction for ETL are in Appendix B.1 with a pseudo-code for the reconstruction head.

The optimization is ﬁrst described in Section 2.3. The equations are then fully detailed in Appendix A.3.

All training and evaluation hyperparameters are listed in Appendix A.4.

Through the paper, we voluntary provide training curves, test curves, and ﬁnal scores to have a global view of the training trends.

We listed computation time and memory footprint for multiple experiments and multiple hardware in Appendix B.2

The source code should be released upon paper acceptance.

ACKNOWLEDGEMENT

We would like to thank Will Dabney, Remi Munos, Karl Tuyls, Nathalie Beauguerlange as well as the rest of the Deep Mind Paris team for their continuous support. We would also like to thank Marco Baroni, Eugene Kharitonov, Olivier Pietquin and Mathieu Rita for their discussions and helpful feedback at the different stages of the project. Finally, we thank Alison Reid and Saran Tunyasuvunakool for their help in open-sourcing the codebase.

Josh Abramson, Arun Ahuja, Iain Barr, Arthur Brussee, Federico Carnevale, Mary Cassin, Rachita Chhaparia, Stephen Clark, Bogdan Damoc, Andrew Dudzik, et al. Imitating interactive intelligence. ar Xiv preprint ar Xiv:2012.05672, 2020.

Jacob Andreas. Measuring compositionality in representation learning. In Proc. of International Conference on Learning Representations (ICLR), 2019.

Relja Arandjelovic and Andrew Zisserman. Look, listen and learn. In Proc. of the IEEE International Conference on Computer Vision (ICCV), 2017.

Igor Babuschkin, Kate Baumli, Alison Bell, Surya Bhupatiraju, Jake Bruce, Peter Buchlovsky, David Budden, Trevor Cai, Aidan Clark, Ivo Danihelka, Claudio Fantacci, Jonathan Godwin, Chris Jones, Tom Hennigan, Matteo Hessel, Steven Kapturowski, Thomas Keck, Iurii Kemaev, Michael King, Lena Martens, Vladimir Mikulik, Tamara Norman, John Quan, George Papamakarios, Roman Ring, Francisco Ruiz, Alvaro Sanchez, Rosalia Schneider, Eren Sezener, Stephen Spencer, Srivatsan Srinivasan, Wojciech Stokowiec, and Fabio Viola. The Deep Mind JAX Ecosystem, 2020. URL http://github.com/deepmind.

Marco Baroni, Armand Joulin, Allan Jabri, Germ an Kruszewski, Angeliki Lazaridou, Klemen Simonic, and Tomas Mikolov. Comm AI: Evaluating the ﬁrst steps towards a useful general AI. In Proc. of International Conference on Learning Representations (ICLR) - Workshop Track, 2017.

Jeffrey A Barrett and Brian Skyrms. Self-assembling games. The British Journal for the Philosophy of Science, 68(2):329 353, 2017.

Lawrence Barsalou. Grounded cognition. Annual Review of Psychology, 59:617 645, 2008.

John Batali. Computational simulations of the emergence of grammar. In James Hurford, Michael Studdert Kennedy, and Chris Knight (eds.), Approaches to the Evolution of Language: Social and Cognitive Bases, pp. 405 426. Cambridge University Press, Cambridge, UK, 1998.

Published as a conference paper at ICLR 2022

Derek Bickerton. Roots of language. Language Science Press, 2015.

Yonatan Bisk, Ari Holtzman, Jesse Thomason, Jacob Andreas, Yoshua Bengio, Joyce Chai, Mirella Lapata, Angeliki Lazaridou, Jonathan May, Aleksandr Nisnevich, et al. Experience grounds language. In Proc. of Empirical Methods in Natural Language Processing (EMNLP), 2020.

Diane Bouchacourt and Marco Baroni. How agents see things: On visual representations in an emergent language game. In Proc. of Empirical Methods in Natural Language Processing (EMNLP), 2018.

Henry Brighton and Simon Kirby. Understanding linguistic evolution by visualizing the emergence of topographic mappings. Artiﬁcial life, 12(2):229 242, 2006.

Lindell Bromham, Xia Hua, Thomas G Fitzpatrick, and Simon J Greenhill. Rate of language evolution is affected by population size. Proc. of the National Academy of Sciences (PNAS), 112(7):2097 2102, 2015.

Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. ar Xiv preprint ar Xiv:2005.14165, 2020.

Kris Cao, Angeliki Lazaridou, Marc Lanctot, Joel Leibo, Karl Tuyls, and Stephen Clark. Emergent communication through negotiation. In Proc. of International Conference on Learning Representations (ICLR), 2018.

Rahma Chaabouni, Eugene Kharitonov, Emmanuel Dupoux, and Marco Baroni. Anti-efﬁcient encoding in emergent communication. In Proc. of Advances in Neural Information Processing Systems (Neur IPS), 2019.

Rahma Chaabouni, Eugene Kharitonov, Diane Bouchacourt, Emmanuel Dupoux, and Marco Baroni. Compositionality and generalization in emergent languages. In Proc. of the Association for Computational Linguistics (ACL), 2020.

Rahma Chaabouni, Eugene Kharitonov, Emmanuel Dupoux, and Marco Baroni. Communicating artiﬁcial neural networks develop efﬁcient color-naming systems. Proc. of the National Academy of Sciences (PNAS), 118(12), 2021.

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey E. Hinton. A simple framework for contrastive learning of visual representations. In Proc. of International Conference on Machine Learning (ICML), 2020.

Edward Choi, Angeliki Lazaridou, and Nando de Freitas. Compositional obverter communication learning from raw visual input. In Proc. of International Conference on Learning Representations (ICLR), 2018.

Michael Cogswell, Jiasen Lu, Stefan Lee, Devi Parikh, and Dhruv Batra. Emergence of compositional language with deep generational transmission. ar Xiv preprint ar Xiv:1904.09067, 2019.

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Image Net: A Large-Scale Hierarchical Image Database. In Proc. of Conference on Computer Vision and Pattern Recognition (CVPR), 2009.

Karan Desai and Justin Johnson. Virtex: Learning visual representations from textual annotations. In Proc. of Conference on Computer Vision and Pattern Recognition (CVPR), 2021.

Roberto Dess ı, Eugene Kharitonov, and Marco Baroni. Interpretable agent communication from scratch (with a generic visual processor emerging on the side). In Proc. of Advances in Neural Information Processing Systems (Neur IPS), 2021.

Emmanuel Dupoux. Cognitive science in the era of artiﬁcial intelligence: A roadmap for reverse-engineering the infant language-learner. Cognition, 173:43 59, 2018.

Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Vlad Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, et al. Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. In Proc. of International Conference on Machine Learning (ICML), 2018.

Jerry Fodor and Ernest Lepore. The Compositionality Papers. Oxford University Press, Oxford, UK, 2002.

Jakob Foerster, Ioannis Alexandros Assael, Nando de Freitas, and Shimon Whiteson. Learning to communicate with deep multi-agent reinforcement learning. In Proc. of Advances in Neural Information Processing Systems (Neur IPS), 2016.

Jianfeng Gao, Michel Galley, and Lihong Li. Neural approaches to conversational AI: Question answering, task-oriented dialogues and social chatbots. Now Foundations and Trends, 2019.

Published as a conference paper at ICLR 2022

Rick Dale Gary Lupyan. Language structure is partly determined by social structure. PLo S ONE 5, 1, 2010.

Matthieu Geist, Bruno Scherrer, and Olivier Pietquin. A theory of regularized markov decision processes. In Proc. of International Conference on Machine Learning (ICML), 2019.

Laura Graesser, Kyunghyun Cho, and Douwe Kiela. Emergent linguistic phenomena in multi-agent communication games. In Proc. of Empirical Methods in Natural Language Processing (EMNLP), 2019.

Jean-Bastien Grill, Florian Strub, Florent Altch e, Corentin Tallec, Pierre H Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latent: A new approach to self-supervised learning. In Proc. of Advances in Neural Information Processing Systems (Neur IPS), 2020.

Shangmin Guo, Yi Ren, Serhii Havrylov, Stella Frank, Ivan Titov, and Kenny Smith. The emergence of compositional languages for numeric concepts through iterated learning in neural agents. ar Xiv preprint ar Xiv:1910.05291, 2019.

Shangmin Guo, Yi Ren, Kory Mathewson, Simon Kirby, Stefano V Albrecht, and Kenny Smith. Expressivity of emergent language is a trade-off between contextual complexity and unpredictability. ar Xiv preprint ar Xiv:2106.03982, 2021.

Stevan Harnad. The symbol grounding problem. Physica D: Nonlinear Phenomena, 42(1-3):335 346, 1990.

Serhii Havrylov and Ivan Titov. Emergence of language with multi-agent games: Learning to communicate with sequences of symbols. In Proc. of Advances in Neural Information Processing Systems (Neur IPS), 2017.

Patrick J. Hayes. The second naive physics manifesto. In R. Shaw & J. Bransford (ed.), Formal Theories of the Commonsense World, 1985.

Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross B. Girshick. Momentum contrast for unsupervised visual representation learning. In Proc. of Conference on Computer Vision and Pattern Recognition (CVPR), 2019.

Todd Hester, Matej Vecerik, Olivier Pietquin, Marc Lanctot, Tom Schaul, Bilal Piot, Dan Horgan, John Quan, Andrew Sendonaris, Ian Osband, et al. Deep q-learning from demonstrations. In Proc. of Conference on Artiﬁcial Intelligence (AAAI), 2018.

Sepp Hochreiter and J urgen Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735 1780, 1997.

Max Jaderberg, WM Czarnecki, I Dunning, L Marris, G Lever, AG Castaneda, et al. Human-level performance in ﬁrst-person multiplayer games with population-based deep reinforcement learning. ar Xiv preprint ar Xiv:1807.01281, 2018.

Ivana Kaji c, Eser Ayg un, and Doina Precup. Learning to cooperate: Emergent communication in multi-agent navigation. ar Xiv preprint ar Xiv:2004.01097, 2020.

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Proc. of International Conference on Learning Representations (ICLR), 2015.

Simon Kirby. Natural language from artiﬁcial life. Artiﬁcial life, 8(2):185 215, 2002.

Simon Kirby, Tom Grifﬁths, and Kenny Smith. Iterated learning and the evolution of language. Current Opinion in Neurobiology, 28:108 114, 2014.

Satwik Kottur, Jos e Moura, Stefan Lee, and Dhruv Batra. Natural language does not emerge naturally in multi-agent dialog. In Proc. of Empirical Methods in Natural Language Processing (EMNLP), 2017.

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classiﬁcation with deep convolutional neural networks. In Proc. of Advances in Neural Information Processing Systems (Neur IPS), 2012.

John Lambert, Zhuang Liu, Ozan Sener, James Hays, and Vladlen Koltun. Mseg: A composite dataset for multi-domain semantic segmentation. In Proc. of Conference on Computer Vision and Pattern Recognition (CVPR), 2020.

Marc Lanctot, Vinicius Zambaldi, Audrunas Gruslys, Angeliki Lazaridou, Karl Tuyls, Julien P erolat, David Silver, and Thore Graepel. A uniﬁed game-theoretic approach to multiagent reinforcement learning. In Proc. of Advances in Neural Information Processing Systems (Neur IPS), 2017.

Published as a conference paper at ICLR 2022

Angeliki Lazaridou, Alexander Peysakhovich, and Marco Baroni. Multi-agent cooperation and the emergence of (natural) language. In Proc. of International Conference on Learning Representations (ICLR), 2017.

Angeliki Lazaridou, Karl Moritz Hermann, Karl Tuyls, and Stephen Clark. Emergence of linguistic communication from referential games with symbolic and pixel input. In Proc. of International Conference on Learning Representations (ICLR), 2018.

Angeliki Lazaridou, Anna Potapenko, and Olivier Tieleman. Multi-agent communication meets natural language: Synergies between functional and structural language learning. In Proc. of the Association for Computational Linguistics (ACL), 2020.

David Lewis. Convention. Harvard University Press, Cambridge, MA, 1969.

Fushan Li and Michael Bowling. Ease-of-teaching and language structure from emergent communication. In Proc. of Advances in Neural Information Processing Systems (Neur IPS), 2019.

Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proc. of the IEEE International Conference on Computer Vision (ICCV), 2015.

Yuchen Lu, Soumye Singhal, Florian Strub, Aaron Courville, and Olivier Pietquin. Countering language drift with seeded iterated learning. In Proc. of International Conference on Machine Learning (ICML), 2020.

Gary Marcus. The Algebraic Mind. MIT Press, Cambridge, MA, 2003.

George A Miller and Philip N Johnson-Laird. Language and perception. Havard University Press, 1976.

Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In Proc. of International Conference on Machine Learning (ICML), 2016.

Igor Mordatch and Pieter Abbeel. Emergence of grounded compositional language in multi-agent populations. In Proc. of Conference on Artiﬁcial Intelligence (AAAI), 2018.

Jesse Mu and Noah Goodman. Emergent communication of generalizations. In Proc. of Advances in Neural Information Processing Systems (Neur IPS), 2021.

Carol Myers-Scotton et al. Contact linguistics: Bilingual encounters and grammatical outcomes. Oxford University Press on Demand, 2002.

Robi Polikar. Ensemble learning. In Ensemble machine learning, pp. 1 34. Springer, 2012.

Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. In Proc. of International Conference on Learning Representations (ICLR), 2016.

Limor Raviv, Antje Meyer, and Shiri Lev-Ari. Larger communities create more systematic languages. Proceedings of the Royal Society B, 286(1907):20191262, 2019a.

Limor Raviv, Antje Meyer, and Shiri Lev-Ari. Compositional structure can emerge without generational transmission. Cognition, 182:151 164, 2019b.

Yi Ren, Shangmin Guo, Serhii Havrylov, Shay Cohen, and Simon Kirby. Enhance the compositionality of emergent language by iterated learning. In Proc. of the Neur IPS Emergent Communication Workshop (Eme Com), 2019.

Cinjon Resnick, Abhinav Gupta, Jakob Foerster, Andrew Dai, and Kyunghyun Cho. Capacity, bandwidth, and compositionality in emergent language learning. In Proc. of Autonomous Agents and Multiagent Systems (AAMAS), 2020.

Mathieu Rita, Rahma Chaabouni, and Emmanuel Dupoux. Laz Impa : Lazy and Impatient neural agents learn to communicate efﬁciently. In Proc. of the Association for Computational Linguistics (ACL), 2020.

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. Image Net Large Scale Visual Recognition Challenge. International Journal of Computer Vision, 115(3):211 252, 2015. doi: 10.1007/s11263-015-0816-y.

John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In Proc. of International Conference on Machine Learning (ICML), 2015.

Published as a conference paper at ICLR 2022

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. ar Xiv preprint ar Xiv:1707.06347, 2017.

Luc Steels. The synthetic modeling of language origins. Evolution of communication, 1(1):1 34, 1997.

Florian Strub, Harm de Vries, J er emie Mary, Bilal Piot, Aaron C Courville, and Olivier Pietquin. End-toend optimization of goal-driven and visually grounded dialogue systems. In Proc. of International Joint Conference on Artiﬁcial Intelligence (IJCAI), 2017.

Rich Sutton. The Bitter Lesson, 2019. http://www.incompleteideas.net/Inc Ideas/ Bitter Lesson.html.

Richard S Sutton, David A Mc Allester, Satinder P Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Proc. of Advances in Neural Information Processing Systems (NIPS), 2000.

Olivier Tieleman, Angeliki Lazaridou, Shibl Mourad, Charles Blundell, and Doina Precup. Shaping representations through communication. ar Xiv preprint ar Xiv:1912.06208, 2018.

A aron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. ar Xiv preprint ar Xiv:1807.03748, 2018.

Mel Vecerik, Todd Hester, Jonathan Scholz, Fumin Wang, Olivier Pietquin, Bilal Piot, Nicolas Heess, Thomas Roth orl, Thomas Lampe, and Martin Riedmiller. Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. ar Xiv preprint ar Xiv:1707.08817, 2017.

Nino Vieillard, Tadashi Kozuno, Bruno Scherrer, Olivier Pietquin, R emi Munos, and Matthieu Geist. Leverage the average: an analysis of kl regularization in reinforcement learning. In Proc. of Advances in Neural Information Processing Systems (Neur IPS), 2020a.

Nino Vieillard, Olivier Pietquin, and Matthieu Geist. Munchausen reinforcement learning. In Proc. of Advances in Neural Information Processing Systems (Neur IPS), 2020b.

Kyle Wagner, James A Reggia, Juan Uriagereka, and Gerald S Wilkinson. Progress in the simulation of emergent communication and language. Adaptive Behavior, 11(1):37 69, 2003.

Ronald J Williams and Jing Peng. Function optimization using connectionist reinforcement learning algorithms. Connection Science, 3(3):241 268, 1991.

Dani Yogatama, Cyprien de Masson d Autume, Jerome Connor, Tomas Kocisky, Mike Chrzanowski, Lingpeng Kong, Angeliki Lazaridou, Wang Ling, Lei Yu, Chris Dyer, et al. Learning and evaluating general linguistic intelligence. ar Xiv preprint ar Xiv:1901.11373, 2019.

Published as a conference paper at ICLR 2022

A ADDITIONAL SETTING DETAILS

A.1 DISCRIMINATION GAME

Speaker Listener

Phase 1 Phase 2 Phase 3

Figure 5: Example of a discrimination game on Imagenet with a set C of 4 candidates. In this speciﬁc instance, Listener does not select an image ˆx that is identical to the target image x received by Speaker. Therefore the reward received by both player R(x, ˆx) = 1x=ˆx will be 0.

A.2 ARCHITECTURE DETAILS

Speaker The speaker s network architecture is composed of several components to transform the target image x into a message m = (wt)T 1 t=0 :

The encoder f is a ﬁxed Resnet-50 architecture that has been previously trained on Imagenet with the BYOL algorithm. The resulting embedding f(x) is of size 2048.

The RNN hθ used is an LSTM of hidden size 256. Therefore the core state zt,θ is of size 512.

The core-state adapter cθ is a linear layer with input size 2048 and an output size of 512 that allows to transform the embedding f(x) into an appropriate core state z 1,θ = cθ(f(x)). We split z 1,θ into two equal parts to obtain the initial hidden state zh, 1,θ and the initial cell state zc, 1,θ.

The word embedder gθ associates to each discrete symbols in W {sos} an embedding of size 10.

The value head vθ ﬁrst selects the hidden part zh,t,θ of the core state zt,θ and then applies a linear layer of output size 1.

The policy head πθ ﬁrst selects the hidden part zh,t,θ of the core state zt,θ and then applies a linear layer of output size |W| to obtain logits lt,θ. Then a softmax layer transforms the logits into the output distribution πθ(.|zt,θ).

Listener The listener s network architecture is composed of several components to transform the message m and the input image x into a score score(m, x, φ):

The encoder f is a ﬁxed Resnet-50 architecture that has been previously trained on Imagenet with the BYOL algorithm. The resulting embedding f(x) is of size 2048.

The RNN hφ used is an LSTM of hidden size 512. Therefore the core state zt,φ is of size 1024 and is composed of an hidden state zh,t,φ of size 512 and a cell state zc,t,φ of size 512.

The word embedder gφ associates to each discrete symbols in W an embedding of size 10.

The target projection tφ is a linear layer with output size 256.

The core-state projection pφ is a linear layer with output size 256.

Published as a conference paper at ICLR 2022

As explained in Sec.2.2, the score function is deﬁned as score(m, x, φ) = cos( pm,φ pm,φ 2 , t x t x 2 ). The scores over all images are normalized via a softmax to get a probability πφ(.|m, C) such that:

x C, πφ( x|m, C) = exp(score(m, x, φ)) P

x C exp(score(m, x, φ)).

Finally, the listener selects an image by taking the best guess according to πφ, i.e. ˆx arg max x C πφ( x|m, C).

Figure 6: Graphical representation of a speaker s architecture that shows how the words (wt)T 1 t=0 are computed from the inputs x.

Figure 7: Graphical representation of a listener s architecture that shows how the score score(m, x, φ) is computed given a message (wt)T 1 t=0 and an input image x.

A.3 OPTIMISATION DETAILS

Speaker training and loss The goal of a speaker, parameterized by θ, is to optimize the message m sent to a listener, parameterized by φ, such that the mean reward of the game is the highest possible. This can be framed as a sequential decision making problem where the decision is the choice of each word wt. Thus, each word is sampled from a parameterized stochastic policy πθ(.|x, (wk)t 1 k=0) that depends on the image x and past words (wk)t 1 k=0, where for t = 0 we have (wk) 1 k=0 = . Then, the goal is to maximize the expected reward J(θ, φ) by ﬁnding the best policy πθ:

J(θ, φ) = Ex ρ [Eπθ,x[R(x, ˆx)]] ,

Published as a conference paper at ICLR 2022

where the expectation Ex ρ is over the dataset of training images and the expectation Eπθ,x is over all possible sequences (x, (wt)T 1 t=0 ) that can be generated by πθ starting from x. For a given image x, we deﬁne the value V πθ(x) = Eπθ,x[R(x, ˆx)] as the expected reward for a given image, here we left the dependence over the choice of the listener ˆx for ease of notations as the speaker cannot act on it. For a parameterized policy πθ, we can use the policy gradient theorem to obtain the gradient:

θV πθ(x) = Eπθ,x

R(x, ˆx) V πθ t 1(x) θ log(πθ(wt|x, (wk)t 1 k=0))

where V πθ t 1(x) = Eπθ,x R(x, ˆx)|(wk)t 1 k=0 is the value conditioned on all the information revealed at time t 1. For our particular choice of speaker network, we encode the policy πθ(.|x, (wk)t 1 k=0) and the value V πθ t 1(x) with the policy head πθ(.|zt,θ) and the value head vθ(zt,θ) respectively. Doing so is legitimate because by construction of our recurrent speaker network the embedding zt,θ is a function of the image x and the past words (wk)t 1 k=0. We train then our speaker network by minimizing a value loss LV (θ) to make vθ(zt,θ) ﬁt V πθ t 1(x) over a batch X of images:

LV (θ) = 1 |X|

x X LV (θ, x) = 1 |X|

t=0 (R(x, ˆx) vθ(zt,θ))2 ,

and a policy loss function Lπ(θ) to maximise the expected reward:

Lπ(θ) = 1 |X|

x X Lπ(θ, x) = 1 |X|

t=0 sg (R(x, ˆx) vθ(zt,θ)) log(πθ(wt|zt,θ)),

where sg(.) is the stop gradient function. If we assume that vθ(zt,θ) ﬁts perfectly V πθ t 1(x), one can easily verify that θLπ(θ, x) is an unbiased estimate of θV πθ(x) therefore θLπ(θ) is also an unbiased estimate of θJ(θ, φ). In addition to following the gradient θV πθ(x), it is common practice in RL and emergent language literature to maximize an entropy term encouraging to explore other choices of words by Speaker:

H(πθ) =Ex ρ

t=0 H(πθ(.|x, (wk)t 1 k=0))

t=0 Ew πθ(.|x,(wk)t 1 k=0) πθ(w|x, (wk)t 1 k=0) log(πθ(w|x, (wk)t 1 k=0)) ##

In practice, a sampled version of the negative entropy LH(θ) is minimized with the speaker network:

t=0 Ew πθ(.|zt,θ) [πθ(w|zt,θ) log(πθ(w|zt,θ))] .

Recently, several deep RL (Schulman et al., 2015; 2017) and theoretical RL papers (Geist et al., 2019; Vieillard et al., 2020a;b) argued that minimizing the KL between the online policy πθ and a target policy πθ instead of or in addition to entropy regularization could be beneﬁcial for better ﬁnal performance as well as stabilizing the learning. Generally, πθ is an older policy or an exponential moving average of past policies. Therefore, we also consider the following KL regularization KL(πθ, πθ):

KL(πθ, πθ) =Ex ρ

t=0 KL(πθ(.|x, (wk)t 1 k=0), πθ(.|x, (wk)t 1 k=0))

t=0 Ew πθ(.|x,(wk)t 1 k=0)

πθ(w|x, (wk)t 1 k=0) log(πθ(w|x, (wk)t 1 k=0) πθ(w|x, (wk)t 1 k=0)) ##

In practice, the policy πθ is obtained by doing an exponential moving average of the weights θ over training. Then, a sampled version of the KL is minimized LKL(θ) with our speciﬁc speaker network:

LKL(θ) = 1 |X|

t=0 Ew πθ(.|zt,θ)

πθ(w|zt,θ) log(πθ(w|zt,θ)

πθ(w|zt,θ))

Published as a conference paper at ICLR 2022

To sum up, the speaker training loss L(θ) on a batch of images X is:

L(θ) = LV (θ) + Lπ(θ) + αLH(θ) + βLKL(θ),

where α and β are hyper-parameters.

Listener loss One important detail in Sec. 2.3 is that for each image x in a batch X, a set of image candidates C(x, X) is sampled randomly (uniform without replacement over X \ {x}) chosen in X. We omitted the dependencies of C(x, X) on x and X in the main text for ease of reading. In addition, as explained in Sec. 2.3, the listener loss is a multi-class classiﬁcation loss where the correct class is the index of x in the set of candidates C also called Info NCE loss (van den Oord et al., 2018): L(φ) = 1

|X| P x X log (πφ(x|m, C)). This can be rewritten more explicitly:

x X log (πφ(x|m, C)) ,

x X log exp(score(m, x, φ)) P

x C exp(score(m, x, φ))

exp(cos( pm,φ pm,φ 2 , tx tx 2 )) P

x C exp(cos( pm,φ pm,φ 2 , tx tx 2 ))

Imitation training among a group of speakers. In a training imitation step, a group of K speakers among the total population of N speakers is selected. Among those K speakers, one speaker is going to play the role of the teacher and K 1 are going to play the role of the students. To choose the teacher among the K speakers, we use as metric the exponential moving average of the accuracies over each batch on which a given speaker has been trained on. To be more precise, let θi be the i-th speaker and X the present batch on which θi has just been trained on, then σi is updated according to the following rule:

σi µσi + (1 µ) 1

and is the exponential moving average of coefﬁcient µ of the accuracies over each batch on which speaker i has been trained. Then the teacher is simply the speaker with the highest σi among the K speakers. Now for convenience, let us note θT the parameters of the teacher and θS the parameters of a student. A student θS is going to be trained on a batch of data X by imitating the messages of the teacher θT with the following loss:

t=0 log πθS(wt(x, θT )|x, zθS,t).

The loss LI(θS) is a cross-entropy loss that encourages the student to output the same words as the teacher.

Optimizer hyper-parameters Each member of the population (listener or speaker) uses its own optimiser. All optimisers are Adam optimisers (Kingma & Ba, 2015) with the same set of hyperparameters:

learning rate: 1e 4,

For the speakers loss, we use α = 0.0001 for the entropy regularization and β = 0.5 for the KL regularization. See Sec. B.3 for more details about the impact of these parameters.

Published as a conference paper at ICLR 2022

A.4 HYPER-PARAMETERS

We use the same hyper-parameters across the different settings. When experimenting with Image Net vs. Celeb A, we only vary the number of maximum steps. Speciﬁcally, we use 300k maximum steps for Celeb A (we already start observing some overﬁtting with this value) and 900k maximum steps for Image Net. In all cases, we select the best checkpoint for evaluation with respect to the listener loss at validation time. For robust evaluation, we average the scores over 5 epochs to have different candidates for a given target and get a score less dependent on the sampling of the candidates. Unless mentioned otherwise, the remaining training hyper-parameters are reported in Table 3.

Table 3: Hyper-parameters values across datasets and settings.

Learning rate lr 0.0001 Batch training size |X| 1024 Number of Candidates |C| 1024 Number of agent sampled P min(N, 10) KL coefﬁcient β 0.5 KL EMA η 0.99 Entropy Coefﬁcient α 0.0001 Vocabulary size |W| 20 Message Length T 10 Imitation EMA µ 0.99

ETL learning rate ETL, lr 0.001 ETL training batch size ETL, |X| 4096 ETL training candidates ETL discr., |C| 4096

A.5 IMITATION HYPER-PARAMETERS

In all our experiments, we perform a grid search over the parameters of M (number of interaction steps) and K (number of sampled speakers at the imitation step). For the case of a population of N = 10, K is chosen from {1, 4, 9} and M from {10, 50, 100} according to the validation accuracy. For N = 50, K is selected from {4, 9, 24} and M from {1, 10, 50}. Table 4 shows the selected hyper-parameters (K, M) for the different settings.

Table 4: Imitation hyper-parameters values chosen according to the best accuracy at validation.

Dataset 10 pairs + imitation (K, M)

50 pairs + imitation (K, M) Celeb A (1, 10) (9, 10) Image Net (1, 100) (9, 10)

Published as a conference paper at ICLR 2022

A.6 DATASETS DETAILS

Image Net ILSVRC2012, also known as Image Net (Deng et al., 2009; Russakovsky et al., 2015) is among the historical largest natural RGB image dataset. It mostly contains images of animals and objects over 1000 labels, e.g., camel, ostrich, hourglass, or bow tie. In our experiments, we use 99% of the ofﬁcial train set for training, i.e., 1300k images, the last 1% of the train set for validation, i.e., 13k images, and the ofﬁcial validation set as our test set (i.e., 50k images).

Celeb A Celeb A is a large natural RGB dataset (Liu et al., 2015), which contains image of celebrity faces over 10,177 identities. Noticeably, each face also contains 40 binary attributes, e.g., glasses and hair. Such attributes are interesting to compute, for instance, language topography. Although there is an ofﬁcial Celeb A split, there are no overlapping identities between the train, validation, and test set. We thus reshufﬂe the split as described in Listing 1. In the end, we respectively obtain 169,444 training, 16,669 valid, and 16,486 test images.

Image Processing In both datasets, we resize images to 256 pixels along the shorter side using bicubic resampling before applying a 224x224 center crop. We normalize the color channels subtracting the mean color and dividing by Image Net std. We then use the Res Net-50 encoder pretrained on Image Net with the self-supervised method BYOL (Grill et al., 2020) to extract the ﬁnal representation of dimension 2048.

Listing 1: Celeb A Splits

SPLIT RATIO = 5 # the r a t i o t r a i n : v a l i d i s 5:1

def c e l e b a s p l i t s ( data ) : S p l i t the data in t h r e e s e t s with overlapping i d e n t i t i e s

# I n i t i a l i z e v a r i a b l e s i m a g e t r a i n , image valid , i m a g e t e s t = [ ] , [ ] , [ ] c o u n t e r l a b e l = c o l l e c t i o n s . Counter ( )

for data in d a t a s e t . values ( ) :

l a b e l = data [ l a b e l ] # l a b e l encode the face i d e n t i t y count = c o u n t e r l a b e l [ l a b e l ]

i f count > 0 and count % SPLIT RATIO == 0: # We e q u a l l y s p l i t v a l i d and t e s t by Even and Odd l a b e l s i f l a b e l % 2 == 0: image valid . append ( data ) e l s e : i m a g e t e s t . append ( data ) e l s e : i m a g e t r a i n . append ( data )

c o u n t e r l a b e l [ l a b e l ] += 1

return i m a g e t r a i n , image valid , i m a g e t e s t

Published as a conference paper at ICLR 2022

Figure 8: Face reconstructions. Left: Randomly sampled input images from the Celeb A dataset; Middle: Reconstructions, using messages from a model trained with 16 candidates; Right: Reconstructions, using messages from a model trained with 1024 candidates.

B ADDITIONAL RESULTS AND ANALYSIS

B.1 FACE RECONSTRUCTION

To visualize some of the features encapsulated in the messages used by the agent, we propose a face reconstruction procedure. For a single speaker, parameterized by θ, and given an initial image x X, we produce a message m = (wt)T 1 t=0 . We feed the messages to a listener-like architecture, consisting of an embedding layer of size 10 followed by an LSTM with 512 units. We keep the last output of the LSTM, h T 1 as a message representation. We then pass the message through a deconvolutional architecture, similar to the one in Radford et al. (2016), but without batch normalizations. Pseudocode for the architecture is provided in Listing 2. The full network is trained by minimizing the ℓ2 loss between the input image and the reconstructed one. To optimize the loss, we use Adam W (Kingma & Ba, 2015) with a batch size of 128, a learning rate of 3 10 4, β1 = .9, β2 = .9, ε = 10 8 and a weight decay of 0.01. We clip gradients by norm with a clipping threshold of 500, and skip gradient updates with norm superior to 2000. In Table 1, we report reconstruction losses for different number of candidates |C|. The reconstruction loss is computed as a squared error, summed over both spatial and channel axes. Figure 8 displays examples of input images, and corresponding reconstructions for two discrimination games with different number of candidates.

Optimally, each reconstruction should converge to the average face, given the corresponding message. The more features a message contains, the better reconstructions should be. Note ﬁrst that reconstructions are far from perfect, and much worse than reconstructions that an auto-encoder trained end to end on images would provide: ﬁrst, solving the discrimination game does not require fully reconstructing the input image, but only capturing enough features to identify uniquely the image in a batch of candidates; second, the discrete nature of the message, its limited size, and the fact that it is learnt using reinforcement learning act as strong bottlenecks, that prevents messages from containing all the information about input images.

Nonetheless, as qualitatively shown in Figure 8, messages contain semantic information about inputs images. For instance, hair color, gender, or open mouth are mostly preserved throughout the reconstruction process. Other features, such as face orientation, skin tone, or age are mostly ignored in messages. Furthermore, both quantitative and qualitative difference are visible when going from low number of discriminators to high number of candidates. Quantitatively, going from 16 to 1024 candidates improves the reconstruction error from 2448 16 to 2351 14 (Table 1). Qualitatively, messages with 1024 candidates seem to contain information about presence of eye-glasses (top left image), as well as presence of head cover (bottom right image), that are absent in messages from the Speaker trained with 16 candidates.

Published as a conference paper at ICLR 2022

Listing 2: Face reconstruction Head

def upsample 2d ( x , f a c t o r : i n t = 2 ) : bs , height , width , channels = x . shape x = image . r e s i z e ( x , ( bs , h e i g h t f a c t o r , width f a c t o r , channels ) , method= n e a r e s t ) return x

def i m g r e c o n s t r u c t i o n ( x ) : Take the LSTM output , and turn i t i n t o an image .

# P r o j e c t the f l a t embedding i n t o a 2d t e n s o r of dim 4 x4x128 x = nn . Linear (4 4 128)( x ) x = x . reshape ( ( x . shape [ 0 ] , 4 , 4 , 128)) x = nn . r e l u ( x )

x = upsample 2d ( x ) # 8x8 x = nn . Conv3x3 (64 , w i t h b i a s =False , padding= VALID ) ( x ) x = nn . r e l u ( x )

x = upsample 2d ( x ) # 16 x16 x = nn . Conv3x3 (32 , w i t h b i a s =False , padding= VALID ) ( x ) x = nn . r e l u ( x )

x = upsample 2d ( x ) # 32 x32 x = nn . Conv3x3 (16 , w i t h b i a s =False , padding= VALID ) ( x ) x = nn . r e l u ( x )

x = upsample 2d ( x ) # 64 x64 x = nn . Conv3x3 (16 , w i t h b i a s =False , padding= VALID ) ( x ) x = nn . r e l u ( x ) x = nn . Conv3x3 (3 , w i t h b i a s =False , padding= VALID ) ( x )

return nn . tanh ( x )

B.2 COMPUTATIONAL REQUIREMENTS

As mentioned in the article, our approach remains computationally tractable with widely available hardware. Table 5 summarizes these requirements to reproduce our experimental setup.

B.3 IMPACT OF THE KL REGULARIZATION ON TRAINING STABILITY

We showed in the main paper how training a pair of agents to solve a complex task (a discrimination game with 1024 candidates) becomes unstable when using common optimization algorithms. In this section, we look at the training curves when agents are trained with and without KL regularization, while varying the entropy coefﬁcient α {10 4, 10 3, 10 2, 10 1}. Results are shown with a population of size 1 (Figure 9) and 10 (Figure 10). In both cases, agents are trained to solve the complex discrimination task with 1024 candidates.

First, we observe that the lower α is, the more crucial applying a KL regularization is. For example, in Figure 9(a), we go from a chaotic setting that converges to an accuracy of 40% when the KL coefﬁcient β = 0 to a signiﬁcantly more stable optimization with almost perfect accuracy when β 0.5. Second, we observe that, if the KL regularization is useful for stable optimization, it comes at the cost of the rapidity of convergence. This is clearly shown in Figures 9(c) and 10(c), where we observe that the larger β is, the slower the convergence is. In other words, one should select the best β optimizing the stability/rapidity trade-off. Third, in the presence of a KL regularization, training stability depends less on the entropy coefﬁcient α. Indeed, we observe a stable and high training

Published as a conference paper at ICLR 2022

Table 5: Computational requirements for our base setup. GPU memory refers to the peak GPU memory usage.

Dataset Device Pop. size, N GPU memory (Gi B) Step time (ms) Train. time (hours)

p100 1 0.29 42 10.5 10 0.71 381 95.2 50 2.63 1887 471.7

v100 1 0.29 25 6.3 10 0.71 223 55.7 50 2.63 1089 272.1

p100 1 0.33 89 7.4 10 0.75 457 38.1 50 2.60 2248 187.4

v100 1 0.33 107 8.9 10 0.75 293 24.4 50 2.60 1408 117.3

100K 200K 300K 400K Steps

0.0 0.1 0.5 1.0

(a) α = 0.0001

100K 200K 300K 400K Steps

0.0 0.1 0.5 1.0

(b) α = 0.001

100K 200K 300K 400K Steps

0.0 0.1 0.5 1.0

(c) α = 0.01

100K 200K 300K 400K Steps

0.0 0.1 0.5 1.0

(d) α = 0.1

Figure 9: Training accuracy for a 1 communicating pair when varying the KL coefﬁcient β. Each sub-ﬁgure represents the results for a ﬁxed entropy coefﬁcient α. Thin lines represent the training accuracy curves of different seeds. Thick lines represent the average across 10 seeds.

curves for α {10 4, 10 3, 10 2}.4 However, training without KL regularization (blue curves with β = 0) is very sensitive to the value of α and the most stable case is observed for α = 0.01. Still even in this best case without KL regularization, having a β > 0 is useful as shown in Figure 11.

4It is also the case for α = 0. Not shown here.

Published as a conference paper at ICLR 2022

100K 200K 300K 400K Steps

0.0 0.5 1.0

(a) α = 0.0001

100K 200K 300K 400K Steps

0.0 0.5 1.0

(b) α = 0.001

100K 200K 300K 400K Steps

0.0 0.5 1.0

(c) α = 0.01

100K 200K 300K 400K Steps

0.0 0.5 1.0

(d) α = 0.1

Figure 10: Training accuracy for a 10 communicating pairs when varying the KL coefﬁcient β. Each sub-ﬁgure represents the results for a ﬁxed entropy coefﬁcient α. Thin lines represent the training accuracy curves of different seeds. Thick lines represent the average across 10 seeds.

100K 200K 300K 400K Steps

(KL, H) (0, 0.01) (0.5, 0.0001)

100K 200K 300K 400K Steps

(KL, H) (0, 0.01) (0.5, 0.0001)

(b) 10 pairs

Figure 11: Training accuracy for 1 and 10 pairs. Each sub-ﬁgure compares the best setting with no KL regularization (with entropy coefﬁcient of 0.01) and the selected setting for our experiments, with a KL coefﬁcient of 0.5 and entropy coefﬁcient of 0.0001. Thin lines represent the training accuracy curves of different seeds. Thick lines represent the average across 10 seeds. In both cases, the setting with the KL regularization, (0.5, 0.0001), exhibits a more stable convergence with a larger difference for population of 10 pairs.

Published as a conference paper at ICLR 2022

Table 6: Different language properties on Celeb A dataset, in %. For each setting we report the mean over 10 seeds. denotes 1 standard error of the mean.

Setting Generalization Robustness

best 1 pair 90.73 35.82 1 pair 89.00 0.48 37.90 10 pairs 91.06 0.23 37.56 50 pairs 90.69 0.61 38.87

10 pairs + imitation 91.84 0.31 35.47 10 pairs + imitation + vote 92.19 0.30 35.21 50 pairs + imitation 92.82 0.51 32.69 50 pairs + imitation + vote 93.13 0.50 32.40

1 pair + reset 92.21 0.44 18.34 10 pairs + reset 90.55 0.56 35.32 50 pairs + reset 91.52 0.42 34.46

B.4 STUDY OF THE RESETTING ON THE CELEBA DATASET

In this work, we argue that training deep agents to communicate in a population beneﬁts from exploiting its richness, as opposed to a standard training (Mordatch & Abbeel, 2018). In this context, we introduced imitation and voting mechanisms. However, other prior works have stipulated a similar argument by focusing instead on the idea of the expressivity/compressibility trade-off (Kirby et al., 2014). The latter attests that our language is shaped by the two competing pressures of expressivity and compressibility. In practice, there were different methods to implement this tradeoff in deep agents communication such as iterated learning (Ren et al., 2019), cultural transmission (Cogswell et al., 2019), or ease-of-teaching (Li & Bowling, 2019). In this section, we compare imitation and voting to the ease-of-teaching.5 To encourage languages ease-of-teaching, Li & Bowling (2019) trained deep agents to solve a discrimination task, while periodically resetting listeners. In this section, we study the impact of this baseline on the different emergent languages properties introduced in the main paper. Speciﬁcally, we consider 3 additional settings: (1) 1 pair + reset that consists of resetting the only listener, (2) 10 pairs + reset where we reset a randomly selected listener among the 10 available ones, and (3) 50 pairs + reset where we reset a randomly selected listener among the 50 available ones. In all cases, resetting takes place every 51k steps. Note that if (1) is identical to Li & Bowling (2019) setting, (2) and (3) present 10 and 50 speakers respectively unlike 1 speaker as it is the case in (Li & Bowling, 2019).

We observe in Table 6 that resetting has a noticeable beneﬁt only in the case of 1 pair on the generalization and robustness of the languages. In particular, we note an average Generalization of 89.00% for 1 pair vs. 92.21% for 1 pair + reset and an average Robustness of 18.34% vs. 35.82%. However, there is no systematic improvement when agents are trained in a population, ( 10 pairs vs. 10 pairs + resetting ) and ( 50 pairs vs. 50 pairs + resetting ). This is in line with Li & Bowling (2019) s results which show that an abrupt change during training leads to better results. Still, 1 pair + reset does not systematically improve the emergent languages properties compared to the population when imitation and voting are at play.

However, as mentioned in the original paper, the purpose of resetting is not to boost agents generalization or robustness, but instead to incentivize agents to develop easy-to-transmit languages. In the following, we look at ETL shown in Figure 12. Here, adding resetting does not lead to a signiﬁcant change showing curves almost identical to the ones with standard training in all cases.

Overall, our results suggest that resetting listeners is only beneﬁcial in the one-pair agents in some cases. Furthermore, resetting does not induce faster or better easy-to-learn communication protocols over transfer tasks compared to the standard training of the Lewis game (with or without population).

5We also considered the setup of Cogswell et al. (2019). However, preliminary results showed systematically worse results compared to the ease-of-teaching settings. Our codebase provides both options.

Published as a conference paper at ICLR 2022

1000 2000 3000 4000 5000 6000 7000 8000 9000

(a) Discrimination

1000 2000 3000 4000 5000 6000 7000 8000 9000

(b) Classiﬁcation of Identities

best 1 pair 1 pair 10 pairs

50 pairs 10 pairs + imitation 50 pairs + imitation

1 pair + reset 10 pairs + reset 50 pairs + reset

Figure 12: ETL for the Celeb A dataset of the emergent languages for different tasks. The results are averaged across the emergent languages of min(5, N) different Speakers, newborn listeners seeds, and across the 10 seeds of each setting. The shaded region represents the standard deviation.

B.5 IMPACT OF THE NUMBER OF CANDIDATES AT TRAINING AND EVALUATION TIME

In this part we look the impact of task complexity at train (Figure 14) and evaluation (Figure 13) times, by varying the number of candidates |C|. The results conﬁrm our ﬁndings in the main paper. That is, for a ﬁxed train |C|, the harder the evaluation task is, the lower the test accuracy is. Still, if the task is complex enough at train time (e.g., |C| = 1024), agents reach overall higher accuracies for all evaluation settings, conﬁrming the importance of hard training tasks to see the emergence of communication protocols with good generalization performances. Also, similarly to our results in the main paper, we observe in Figure 14 that harder evaluation task is necessary to discriminate between the communication protocols.

Published as a conference paper at ICLR 2022

100K 200K 300K 400K 500K 600K 700K 800K 900K

Num candidates (eval)

16 64 256 1024 4096

(a) Num candidates at (train): 16

100K 200K 300K 400K 500K 600K 700K 800K 900K

Num candidates (eval)

16 64 256 1024 4096

(b) Num candidates at (train): 64

100K 200K 300K 400K 500K 600K 700K 800K 900K

Num candidates (eval)

16 64 256 1024 4096

(c) Num candidates at (train): 256

100K 200K 300K 400K 500K 600K 700K 800K 900K

Num candidates (eval)

16 64 256 1024 4096

(d) Num candidates at (train): 1024

Figure 13: Test accuracy for different number of candidates at training time (subplot) and at evaluation time (lines) on Image Net. As one would expect, given one model trained with |C| candidates, the more complex the evaluation task, the lower is the accuracy.

Published as a conference paper at ICLR 2022

100K 200K 300K 400K 500K 600K 700K 800K 900K

Num candidates (train)

16 64 256 1024

(a) Num candidates at (eval): 16

100K 200K 300K 400K 500K 600K 700K 800K 900K

Num candidates (train)

16 64 256 1024

(b) Num candidates at (eval): 64

100K 200K 300K 400K 500K 600K 700K 800K 900K

Num candidates (train)

16 64 256 1024

(c) Num candidates at (eval): 256

100K 200K 300K 400K 500K 600K 700K 800K 900K

Num candidates (train)

16 64 256 1024

(d) Num candidates at (eval): 1024

100K 200K 300K 400K 500K 600K 700K 800K 900K

Num candidates (train)

16 64 256 1024

(e) Num candidates at (eval): 4096

Figure 14: Test accuracy for different number of candidates at training time (lines) and at evaluation time on Image Net. The more we complexify evaluation, the better we discriminate representations.

Published as a conference paper at ICLR 2022

B.6 CELEBA ATTRIBUTES:

Table 7: Celeb A ETL attribute accuracies when varying the number of candidates at pretraining at 10k training steps.

Atribute |C| = 16 |C| = 64 |C| = 256 |C| = 1024

5 o Clock Shadow 89.93 0.74 90.57 0.75 91.46 0.5 92.76 0.54 Arched Eyebrows 77.73 0.75 79.49 0.88 81.35 1.15 83.22 0.8 Attractive 78.74 1.01 80.08 0.96 81.25 0.75 83.27 1.18 Bags Under Eyes 81.86 0.85 83.53 0.72 84.66 0.68 86.51 0.77 Bald 98.03 0.15 98.03 0.23 98.20 0.25 98.43 0.21 Bangs 85.60 0.53 86.29 0.71 87.05 0.71 88.42 0.62 Big Lips 77.57 0.92 78.73 0.81 80.31 0.97 82.09 0.87 Big Nose 80.58 0.74 82.56 0.77 83.77 0.77 85.56 0.75 Black Hair 78.22 0.83 79.78 1.32 81.32 1.39 83.52 0.93 Blond Hair 88.23 2.31 89.42 1.77 90.62 2.07 91.94 1.01 Blurry 95.11 0.19 95.40 0.26 95.65 0.28 96.31 0.41 Brown Hair 80.83 0.81 81.82 0.81 83.26 1.0 84.80 0.75 Bushy Eyebrows 86.53 0.5 87.28 0.51 87.87 0.54 89.46 0.5 Chubby 94.53 0.34 94.89 0.35 95.44 0.34 96.06 0.36 Double Chin 95.76 0.34 95.97 0.35 96.37 0.25 96.86 0.26 Eyeglasses 94.50 0.53 95.98 1.5 98.30 0.95 98.48 0.76 Goatee 94.28 0.38 94.66 0.36 95.00 0.37 95.88 0.44 Gray Hair 96.03 0.2 96.45 0.3 96.81 0.25 97.41 0.25 Heavy Makeup 88.08 0.5 88.75 0.64 89.36 0.65 90.49 0.95 High Cheekbones 77.94 1.87 79.70 1.07 79.98 1.57 82.39 1.33 Male 93.15 0.44 93.82 0.43 94.06 0.51 94.73 0.57 Mouth Slightly Open 71.25 2.1 73.62 1.2 74.77 1.98 77.86 1.53 Mustache 95.91 0.32 96.15 0.38 96.54 0.33 97.06 0.28 Narrow Eyes 89.12 0.49 89.53 0.55 89.98 0.53 90.35 0.45 No Beard 88.48 0.68 89.05 0.78 90.21 0.84 91.82 0.89 Oval Face 74.54 0.76 76.21 0.82 78.15 0.9 80.82 0.9 Pale Skin 95.69 0.23 95.94 0.24 96.13 0.31 96.50 0.31 Pointy Nose 75.61 0.84 77.31 1.03 78.78 1.23 80.96 0.79 Receding Hairline 92.49 0.39 92.88 0.3 93.23 0.39 93.95 0.42 Rosy Cheeks 93.79 0.24 94.17 0.3 94.70 0.43 95.42 0.45 Sideburns 94.80 0.3 95.06 0.32 95.62 0.32 96.33 0.47 Smiling 79.45 2.54 81.39 1.22 81.34 2.28 83.56 1.77 Straight Hair 80.55 0.57 81.64 0.6 82.95 0.75 84.46 0.77 Wavy Hair 76.42 0.76 78.09 0.98 80.24 1.35 82.22 0.92 Wearing Earrings 83.72 0.59 84.65 0.77 85.74 0.8 87.08 0.75 Wearing Hat 97.07 1.46 97.03 1.28 98.83 0.4 98.95 0.25 Wearing Lipstick 90.96 0.38 91.52 0.58 91.87 0.58 92.57 0.71 Wearing Necklace 87.93 0.41 88.51 0.47 89.23 0.7 90.37 0.59 Wearing Necktie 95.14 0.24 95.56 0.33 96.10 0.3 96.44 0.47 Young 83.45 1.34 84.93 1.12 85.79 0.78 87.85 0.65

B.7 OUT-OF-DISTRIBUTION GENERALIZATION: THE OFFICIAL CELEBA SPLIT

In the main paper, we consider a new Celeb A split where train and test sets have overlapping identities. This allows us to test in-distribution performances. In this Subsection, we look at out-ofdistribution generalization considering the ofﬁcial Celeb A split, where train, validation, and test sets include distinct identities. In other words, we now consider a harder form of generalization by testing agents on never-seen identities at training. Results are shown in Table 8. We observe a similar pattern to the in-distribution results shown in Table 2. That is, we do not observe a beneﬁt for the population, where best 1 pair achieves better unseen-generalization than 10 pairs and 50 pairs . However, when considering imitation and voting, agents develop languages that generalize better to

Published as a conference paper at ICLR 2022

never-seen identities. The setting 50 pairs + imitation + vote attains the best performance, with a 93.75% success when communicating about previously unseen identities at training time.

Table 8: Generalization performance on the ofﬁcial Celeb A split, in %. In this case, we look at out-of-distribution generalization as train and test sets contain different identities. For each setting we report the mean over 10 seeds. denotes 1 standard error of the mean.

Setting best 1 pair 1 pair 10 pairs 50 pairs

- - - +imitation + imitation+vote - +imitation + imitation+vote

Generalization 92.60 90.08 0.54 90.42 0.55 92.01 0.40 92.37 0.40 90.91 0.37 93.41 0.19 93.75 0.18