# emergent_quantized_communication__18e32111.pdf

Emergent Quantized Communication

Boaz Carmeli, Ron Meir, Yonatan Belinkov*

Technion Israel Institute of Technology boaz.carmeli@campus.technion.ac.il, rmeir@ee.technion.ac.il, belinkov@technion.ac.il

The field of emergent communication aims to understand the characteristics of communication as it emerges from artificial agents solving tasks that require information exchange. Communication with discrete messages is considered a desired characteristic, for both scientific and applied reasons. However, training a multi-agent system with discrete communication is not straightforward, requiring either reinforcement learning algorithms or relaxing the discreteness requirement via a continuous approximation such as the Gumbel-softmax. Both these solutions result in poor performance compared to fully continuous communication. In this work, we propose an alternative approach to achieve discrete communication quantization of communicated messages. Using message quantization allows us to train the model end-to-end, achieving superior performance in multiple setups. Moreover, quantization is a natural framework that runs the gamut from continuous to discrete communication. Thus, it sets the ground for a broader view of multi-agent communication in the deep learning era.

1 Introduction A key aspect in emergent communication systems is the channel by which agents communicate when trying to accomplish a common task. Prior work has recognized the importance of communicating over a discrete channel (Havrylov and Titov 2017; Lazaridou and Baroni 2020; Vanneste et al. 2022). From a scientific point of view, investigating the characteristics of communication that emerges among artificial agents may contribute to our understanding of human language evolution. And from a practical point of view, discrete communication is required for natural human machine interfaces. Thus, a large body of work has been concerned with enabling discrete communication in artificial multi-agent systems (Foerster et al. 2016; Havrylov and Titov 2017, inter alia). However, the discretization requirement poses a significant challenge to neural multiagent systems, which are typically trained with gradientbased optimization. Two main approaches have been proposed in the literature for overcoming this challenge, namely

*Supported by the Viterbi Fellowship in the Center for Computer Engineering at the Technion. Copyright 2023, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

0.10 -0.23 0.69

0.72 -0.54 -0.21

4.0 2.0 7.0

7.0 1.0 2.0

0.10 -0.23 0.69 4.0 1.0 7.0

Gumbel-softmax Continuous Quantized

N/A Symbol 0 0 1 1.0

Single symbol

Figure 1: Top: Symbol, word, and message elements for continuous, Gumbel-softmax, and quantized communication modes. Bottom: Accuracy (Y-axis) achieved by the three communication modes vs. number of candidates (X-axis), in the Object game. Continuous communication leads to good performance on the end task but does not use symbols. Gumbel-softmax sends one word per symbol, but requires a recurrent channel and does not work well in practice. Quantized communication enables discrete and successful communication. Detailed channel parameters are provided in section 5.

using reinforcement learning (RL) algorithms (Williams 1992; Lazaridou, Peysakhovich, and Baroni 2016) or relaxing the discrete communication with continuous approximations such as the Gumbel-softmax (Jang, Gu, and Poole 2016; Havrylov and Titov 2017). The RL approach maintains discreteness, but systems optimized with the Gumbelsoftmax typically perform better in this setting. However, Gumbel-softmax training is effectively done with continuous communication. Both discretization approaches perform far worse than a system with fully continuous communication. In short, the more discrete the channel, the worse the system s performance. In this work, we propose a new framework for discrete communication in multi-agent systems, based on quantization (Figure 1, top). Drawing inspiration from work on ef-

The Thirty-Seventh AAAI Conference on Artificial Intelligence (AAAI-23)

ficient neural network quantization during training and inference (Banner et al. 2018; Wang et al. 2018; Choi et al. 2018)), we quantize the message delivered between the agents. We investigate two learning setups: First, training is done with continuous communication, while inference is discretized by quantization, similar to the common scenario when using continuous approximations like Gumbelsoftmax. Second, we investigate the effects of quantizing the messages during both training and inference. We experimentally validate our approach in multiple scenarios. We consider three different games that fall into the well-known design of referential games, where a sender transmits information about a target object, which a receiver needs to identify (Lewis 2008; Lazaridou, Peysakhovich, and Baroni 2016; Choi, Lazaridou, and De Freitas 2018; Guo et al. 2019). Our objects include synthetic discrete objects, images, and texts. We also experiment with a variant, which we call the classification game, where the receiver needs to identify the class to which the object belongs. In all cases, we find our quantized communication to outperform the standard approach using Gumbel-softmax by a large margin, often even approaching the performance with fully continuous communication (Figure 1, bottom). Finally, we investigate the quantized communication by varying the granularity of quantization. This allows us to cover a much wider range of discreteness levels than has previously been possible. We analyze which aspects of the communication channel are most important for accomplishing the agents task and how they affect the resulting language. We find that quantization, even an extreme one, works surprisingly well given a long enough message. Evidently, quantized communication with a binary alphabet performs almost as well as continuous communication. In summary, this work develops a new framework for discrete communication in multi-agent systems, setting the ground for a broader investigation of emergent artificial communication and facilitating future work on interfacing with these systems.

2 Background We begin with a formal definition of the emergent multiagent communication setup (Lazaridou and Baroni 2020). In this setup, a sender and a receiver communicate in order to accomplish a given task. In the referential game, the sender needs to transmit information about a target object, which the receiver uses to identify the object from a set of candidates. In the classification game, the sender again transmits information about an object, but the receiver needs to identify the class the object belongs to, rather than its identity. Notably, the two games require significantly different communication. While in the referential game the sender needs to accurately describe the target, in the classification game the sender needs to describe the target s class (see Appendix A.2 for details.1) Formally, we assume a world O with |O| objects2. At each turn, n candidate objects C = {c}n O, are drawn uni-

1Appendices are available at https://arxiv.org/abs/2211.02412. 2We defer details on the type of objects to Section 4

Figure 2: The emergent communication setup. Sender network is at the left, Receiver network is at the bottom right, and m is the communication channel.

formly at random from O. One of them is randomly chosen to be the target t, while the rest, D = C \ t, serve as distractors. Figure 2 illustrates the basic setup. At each turn, the sender S encodes the target object t via its encoder network uθ, such that us = uθ(t) Rd is the encoded representation of t. It then uses its channel network zθ to generate a message m, m = zθ(us) = zθ(uθ(t)). The channel and message have certain characteristics that influence both the emergent communication and the agents performance in the game, and are described in Section 2.1. At each turn, the receiver R encodes each candidate object ci (i = 1, 2, . . . , n) via its encoder network uϕ, to obtain uϕ(ci). We write U r RC d to refer to the set of encoded candidate representations, each of dimension d. The receiver then decodes the message via its decoder channel network zϕ, obtaining zr = zϕ(m) Rd. Next, the receiver computes a score matching each of the encoded candidates to the decoded message. The receiver then calculates prediction scores t = softmax(zr U r). At test time, the receiver s predicted target object is the one with the highest score, namely ˆt = argmaxi t. During training, the entire system is optimized end-to-end with the cross-entropy loss between the correct target t and the predicted target ˆt. The trainable parameters are all the parameters of both sender and receiver networks, θ and ϕ.

2.1 Communication Elements A key aspect of the emergent communication setup is the message (m in Figure 2). In this work we compare three communication modes that generate this message: continuous (CN) uses a continuous message, while Gumbelsoftmax (GS) and quantized (QT) use a discrete message. We start by describing the communication elements common to all modes, and then provide more details on the unique aspects of each communication mode. Formally, we define three communication elements, namely, symbol, word and message. Figure 1 provides an example for each element. Symbol is the atomic element of the communication. An

alphabet is a collection of symbols. The alphabet size is a parameter of the quantized and Gumbel-softmax communication modes, while continuous communication uses real-numbers in m, corresponding to an uncountable alphabet. Word is the basic message element. A word is represented with a one-dimensional vector. In continuous communication, this vector is composed of floating point numbers, for quantized communication it is composed of integers, and for Gumbel-softmax communication it is a one-hot vector. Message is a sequence of one or more words, which the sender sends to the receiver. An instantaneous (Instant) channel is capable of sending (and receiving) only single-word messages, while a Recurrent channel sends (receives) multi-word messages with a recurrent neural network (RNN).

2.2 Communication Modes

In this section we describe two known communication modes: continuous and Gumbel-softmax. These communication modes serve as baselines. In the following section we describe our quantized communication.

Continuous Communication In continuous communication, words are represented with floating point vectors (see Figure 1). Though continuous, one may think of each vector element as if it represents a symbol, and the vector itself represents a word. Continuous communication is expected to lead to good performance, provided that the channel has sufficient capacity. With continuous communication, the system can easily be trained end-to-end with back-propagation.

Gumbel-softmax The Gumbel-softmax is a continuous approximation for a categorical distribution. In the communication context, a discrete message is approximated via a sampling procedure. Details are given elsewhere (Havrylov and Titov 2017; Jang, Gu, and Poole 2016) and implementation specifics are provided in Appendix A.4. The end result is a continuous message, where each word has the size of the alphabet and holds one (approximate) symbol. This allows for end-to-end optimization with gradient methods, and for discrete communication at inference time. However, the channel capacity is limited, and a large alphabet size is both inefficient (due to the need to sample from a large number of categories) and does not perform well in practice.

3 Quantized Communication

Quantization techniques aim to reduce model size and computation cost while maintaining a similar level of performance to the original model (Banner et al. 2018). The key quantization idea is to replace floating-point representations of model weights and/or activations with integers. We emphasize that, while quantization has a specific purpose in mind (efficiency), it renders the neural network discrete by definition. Allowing gradients to flow though the network during back-propagation enables end-to-end gradient-based optimization of the network with off-the-shelf optimizers.

Algorithm 1: Quantizing continuous communication.

1: msg: continuous message Floating point vector 2: S: scaling factor Set the alphabet range 3: procedure NORMALIZE(msg) 4: min elem min(msg.elements) 5: max elem max(msg.elements) 6: msg (msg.elems min elem)/max elem 7: return msg 8: end procedure 9: procedure QUANTIZE(msg) 10: msg Normalize(msg) 11: s msg msg/S Scale to range 12: qt msg quantize(s msg) Integer vector 13: deqt msg dequant(qt msg) Rounded float 14: discrete msg deqt msg For logging 15: deqt msg deqt msg S Scale back 16: return (deqt msg, discrete msg) 17: end procedure 18: msg Quantize(msg)

3.1 Quantized Communication Method We follow the quantization definition and notation provided by Gholami et al. (2021). The quantization operator is defined by Q(r) = Int(r/S) Z where r is a real-valued floating-point tensor, S is a real-valued scaling scalar, and Z is an integer zero point, which we set to zero. The Int() function maps a real value to an integer value through a rounding operation (e.g., round to nearest integer and truncation). This operator, also known as uniform quantization (Gholami et al. 2021), results in quantized values that are uniformly spaced.3 One can recover floating-point values r from the quantized values Q(r) through dequantization, ˆr = S(Q(r)+Z). Obviously, the recovered real values ˆr will not exactly match r due to the rounding operation. This rounding mismatch is a core difference between continuous and quantized communication. The quantized operator s scaling factor S essentially divides a given range of real values r into a number of partitions. Specifically, we define the scaling factor S to be S = [β α]

|v| , and we set v to be the alphabet size. In this work we normalize message values to the range [0, 1], thus β = 1, α = 0. Epmirically, message normalization improves results for both quantized and continuous communication. Notably, the rounding error of ˆr is linearly correlated with the alphabet size. This procedure results in a quantization algorithm, presented in Algorithm 1, which maps each message to a set of symbols from the alphabet. The quantization algorithm allows fine-grained control over channel capacity. Capacity can be controlled by both the alphabet size and the word length. The total number of unique words allowed by the channel is given by channel-capacity = alphabet-sizeword-length.

3Future work may explore communication with non-uniform quantization schemes (Gholami et al. 2021).

3.2 Training With Quantization

Notably, one may choose to apply the quantization algorithm during both training and inference, or only during inference. Quantization only during inference is similar to the basic Gumbel-softmax setup described above, where training is done with a continuous approximation and inference is discrete. Quantization during training makes the system non-differentiable due to the rounding operation. In this case, we use the straight-through estimator (STE; Bengio, L eonard, and Courville 2013), which approximates the nondifferentiable rounding operation with an identity function during the backward pass. This is similar to what is known in the Gumbel-softmax literature (Jang, Gu, and Poole 2016) as the straight-through option, where the softmax is replaced with argmax during the forward path.

4 Experimental Setup

4.1 Games and Datasets

We run our experiments on four games from three datasets.

Synthetic objects. This dataset is based on Egg s object game (Kharitonov et al. 2019). Each object has 4 attributes. Each attribute has 10 possible values, and different attributes share the same set of values. Thus, the dataset contains 104 objects which are uniquely identified by four discrete values.

Images. We use the Egg implementation of the image game from Lazaridou, Peysakhovich, and Baroni (2016).4 The Dataset contains images from Image Net (Deng et al. 2009). The training set contains 46, 184 images, distributed evenly over 463 classes, out of which we randomly choose 8032. The validation and test sets have 67, 521 images each, split over the same classes. We randomly choose distractors from classes other than the targets .

Texts. We use a short text dataset, named Banking77 (Casanueva et al. 2020), which we refer to as Sentences. It contains 10, 000 sentences, classified into 77 classes, each represented with a meaningful class name. The sentences are user queries to an online customer support banking system, while the classes are user intents. We use the Sentences dataset for two different games: Sent-Ref is a referential game, that is, the receiver needs to identify the sentence. Sent-Cls is a classification game, where the receiver receives a set of candidate classes and needs to identify the class of the target sent by the sender.

Data Splits. In all experiments we split the data 80/10/10 into training, validation, and test sets, respectively. For Image and Sentence-Ref games, both targets (sender-side objects) and candidates (receiver s side) are mutually exclusive across splits. For Object and Sentence-Cls game, targets are mutually exclusive while candidates are shared across splits. Table 1 provides summary statistics of the datatsets.

4https://dl.fbaipublicfiles.com/signaling game data

Dataset #Objects #Train #Valid #Test Max

Object 10K 8000 1000 1000 10K Image 181K 8032 1024 1024 100 Sent-Ref 10K 7997 1001 1001 77 Sent-Cls 10K/77 7953 1004 1042 77

Table 1: Sizes of datasets and splits for each game. 77 is the number of classes in the Banking dataset. Max is the maximum number of candidates used for evaluating the game.

4.2 Agents Architecture

Encoding Agents. We refer by sender and receiver encoding agents to the uθ and uϕ networks, respectively, as described in Section 2. For Object and Image games, we follow the architecture provided by the Egg implementation (Kharitonov et al. 2019). The agents in the Object game uses a single fully-connected (FC) layer to encode the objects. The agents in the Image game use a FC network followed by two convolutional layers and a second FC layer. In the Image game, the sender uses all candidates for encoding the target (referred to as informed-sender by Lazaridou, Peysakhovich, and Baroni (2016)). In all other games the sender encodes only the target object. For the Sentence games (both referential and classification), we use a distilbert-base-uncased model (Sanh et al. 2019) from Huggingface (Wolf et al. 2020) as the sentence encoder without any modification.5 Appendix A.4 provides more details.

Communication Channels. We refer by sender and receiver channels to the zθ and zϕ networks, respectively, as described in Section 2. We experiment with two architectures for the communication channel: Instant and Recurrent. Instant simply passes the sender s encoded representation of the target through a FC network to scale it to the word length and send it to the receiver. The receiver s Instant channel decodes the message with a FC feed-forward network and compares it with the candidates encoded representations as described in Section 2. The Recurrent channel enables sending and receiving multi-word messages. We adapt the Recurrent channel implemented in Egg to work with continuous and quantized communication. More details on channel configuration are provided in Appendix A.4.

4.3 Number of Candidates and Distractors

Most earlier emergent communication setups use a limited number of distractors (Mu and Goodman 2021; Li and Bowling 2019). Recent work (Chaabouni et al. 2021; Guo et al. 2021) reports the effect that an increased number of distractors has on accuracy results. The number of distractors affects results in two complementary ways. On the one hand, adding more distractors renders the receiver s task

5Importantly, in this work we aim to evaluate communication performance across various settings, and not necessarily find the best-performing encoding network. Nevertheless, we find our setup to achieve close to state-of-the-art results in the Sentence-Cls game, as shown in Section 5.

harder during inference. On the other hand, during training, distractors serve as negative samples which are known to improve learning Mitrovic, Mc Williams, and Rey (2020). Based on these observations, our experimental environment lets us decouple the number of negative examples during training, from the number of distractors used for evaluation. In all our experiments we train the system with a large number of negative samples (serving as distractors) and report results on an increasing number of candidates, always including the target as one of them.

4.4 Evaluation Metrics In this work we report prediction accuracy as the main metric. Similar to Guo et al. (2021), we observe a correlation between the number of unique messages (No UM) and accuracy, and report this measurement as well. Recent work (Chaabouni et al. 2021; Yao et al. 2022) reports that the popular topographic similarity metric (Brighton and Kirby 2006; Lazaridou et al. 2018) does not correlate well with accuracy, especially when measured on a large number of distractors. In our work we observed the same effect, so we refrain from reporting this metric.

4.5 Training Details We performed hyper-parameter tuning on the validation set and report results on the test set. As systems have many hyper-parameters to tune we tried to reduce changing most parameters between different setups to a minimum. However, we ran an extensive hyper-parameter search for alphabet size and message length for the Gumbel-softmax communication to insure that we report the best possible results for this communication mode. Quantized communication required only minimal tuning and still outperformed Gumbelsoftmax across all setups. We report more details on configurations and hyper-parameters in Appendix A.4. Each experiment took under 24 hours on a single v100 GPU and a CPU with 128GB RAM. We run each experiment three times with different random seeds and report average performance. Variance is generally small (Appendix A.1).

5 Results We first experiment with quantization only during inference and compare its performance to continuous and Gumbelsoftmax communication. Our main results are presented in Figure 3.6 The graphs show test accuracy results (Y-axis) against the number of candidates (X-axis) for the three communication modes over two channel architectures for the four games. As expected, performance generally degrades when increasing the number of candidates in most setups. This is especially evident in Gumbel-softamx (GS)

6These results are with the best-tuned configurations: Word length of 100 for continuous (CN) and quantized (QT) modes, except for QT-RNN in Sent-Cls, where word length is 10. Alphabet size of QT is 10 in all configurations. For Gumbel-softmax (GS), Alphabet size is 10, 50, 100, and 10 for the RNN channel, and 100, 50, 100, and 100 for the Instant channel, for the Object, Image, Sent-Ref, and Sent-Cls games, respectively. Section 5.2 and Appendix A.3 provide results with a range of possible configurations.

2 10 100 1k 2k 5k 10k 0.0

2 10 20 50 100 0.0

2 8 16 32 64 77 0.0

2 8 16 32 64 77 0.0

Number of candidates

Figure 3: Communication results for four games for Instant (Ins) and Recurrent (RNN) channels, using quantization only during inference. The CN and QT results are essentially the same, thus overlapping in the plots. Channel parameters for the experiments are provided in footnote 5.

communication (green lines), while continuous (CN, red) and quantized (QT, blue) communication modes scale much more gracefully with the number of candidates. Notably, the quantized communication is on-par with the fully continuous communication in all cases. Considering the different games, in the Object game, continuous and quantized communication perform perfectly, while Gumbel-softamx suffers from a large number of candidates. In the other games, there is a slight performance degradation with continuous and quantized communication. Next we compare the performance with communication using Instant vs. Recurrent channels. Recall that the Instant channel has a much more limited capacity (each message is one-word long) compared to the Recurrent channel (each message is made of multiple words). The Gumbelsoftmax communication suffers most clearly from the Instant channel: In all but the Image game, it performs worse than Gumbel-softmax with the Recurrent channel. The gap is especially striking in the Object game (compare green dashed and solid lines). The poor performance of Gumbelsoftmax with the Instant channel can be explained by the limited channel capacity: The sender in this case can generate up to 100 unique messages (each containing just a single symbol), which are just 1.0% of the unique objects in the Object game. Thus it has little hope of correctly identifying the target object. One might hope that a Recurrent channel would help Gumbel-Softamx perform better, as it has the required capacity to uniquely represent all objects (106 unique messages). However, even with this capacity, performance on a high number of candidates is low. We attribute the poor performance to the difficulty to optimize RNNs with discrete communication. This might also explain why Gumbelsoftmax with an instant channel works better than the one with Recurrent channel in the Image game. In contrast to Gumbel-softmax, quantized communication

Instant Recurrent

Train+Inf Only Inf Train+Inf Only Inf

Game 2 Max 2 Max 2 Max 2 Max

Obj 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 Img 1.0 .96 .99 .96 .99 .68* .99 .67 S-Ref 1.0 .95 .99 .95 1.0 .99 1.0 .99 S-Cls .99 .92 .99 .92 .99 .92 .99 .93

Table 2: Reporting quantization during both training and inference (Train+Inf) and quantization only during inference (Only Inf), for Instant and Recurrent channels, with 2 candidates and the maximum number of candidates in each game (Max). Reported results are for an average of three runs. Standard deviations for all reported results are under 0.01, apart from the one marked by , which is under 0.05.

does not suffer from the limited capacity problem, nor from the optimization difficulty. In both Instant and Recurrent channels, quantized communication leads to excellent performance, even in the face of a large number of candidates. We note in passing, that for the sentence classification game (Sent-Cls), we get results that are on par with state-ofthe-art classification results for this dataset (Qi et al. 2020; Zhang et al. 2021), even though we use a very different setup, that of communication between agents.

5.1 Quantization During Training

So far we report results when quantization is applied only at inference time. Here we compare it with applying quantization also during training. Table 2 reports accuracy results for the two settings, using either two or a maximum number of candidates (varying by game). As seen, performance results are on-par for all games and all communication settings, whether using quantization during training and inference or only at inference. The quantized communication achieves perfect, or near-perfect accuracy (> 0.99%) for setups with 2 candidates. Accuracy results surpass 92% for all games, apart from the Image game, even when the receiver has to discriminate between the maximum number of candidates. The low Image game results are attributed to the use of suboptimal convolutional networks at both the sender and receiver.

5.2 Communication Analysis

Figure 4 analyzes the quantized communication results for the Object game over an Instant channel. Results are obtained from a test set with 1000 unique targets. Appendix A.3 provides a similar analysis for the other games, showing largely consistent results. The top heatmaps show performance in various settings, organized according to word length (X-axis) and alphabet size (Y-axis). The bottom heatmaps show the number of unique messages (No UM) sent by the sender in each configuration (1000 max). We compare quantization during both training and inference (left heatmaps) with quantization only during inference (right heatmaps). As seen, quantization

Figure 4: Showing accuracy (top) and number of unique messages (No UM) (bottom) as a function of alphabet size and word length for the Object game with an Instant channel. Comparing quantization during both training and inference (left) or only during inference (right).

during both training and inference performs slightly better than quantization only during inference. Clearly, increasing the channel capacity (moving to the bottom-right corner in the heatmaps) improves accuracy results and increases the No UM sent by the sender up to a maximum of 1000. Increasing word length (moving to the right) improves results substantially for all alphabet sizes, and reaches a maximal performance for a length of 50, for all alphabet sizes, when quantization is done during both training and inference, and for alphabet size larger than 4 for quantization during inference only (top right). Interestingly, increasing the alphabet size (moving to the bottom in the heatmaps) has a smaller effect. With long enough words, the system performs almost optimally, even with a very small alphabet (e.g., 2 and 4 symbols), resembling findings by Freed et al. (2020). As seen by comparing the top and bottom heatmaps, the No UM correlates with performance. Interestingly, having a No UM equal to the number of unique targets is a necessary but not sufficient condition for perfect performance. Finally, we compare the number of unique messages in Gumbel-softmax and quantized communication modes across the four games (Table 3). The number of unique messages generated by quantized communication equals (in the Object and Sent-Ref games) or almost equals (in the Image and Sent-CLS games) the number of unique targets. In contrast, Gumbel-softmax does not generate enough unique messages. The Gumbel-softmax with a Recurrent channel produces many more messages than the Instant channel. However, only for the Object game does it generate nearly enough unique messages. It is noteworthy that sender at the

Instant Recurrent

Game #Targets GS QT GS QT

Object 1000 58 1000 948 1000 Image 1024 16 1016 391 1016 Sent-Ref 1001 8 1001 116 1001 Sent-Cls 1042 20 1042 125 1035

Table 3: Number of test-set targets and unique messages for GS and QT communication modes, with Instant and Recurrent channels. Channel parameters are provided in fn. 5.

Sent-Cls game does not require to generate unique message for every target for optimally solving it.

6 Related Work The field of emergent communication gained renewed interest in recent years with the advances of deep neural networks and natural language processing (Lazaridou and Baroni 2020). Despite significant advances, approaches for generating discrete messages from neural networks remain scarce. Multi-agent reinforcement learning (RL) and Gumbel-softmax (GS) are the two alternative approaches used by the community.

6.1 Multi-Agent Reinforcement Learning The work by Foerster et al. (2016) is probably the first to suggest methods for learning multi-agent communication. Many studies in the emergent communication field (Lazaridou, Potapenko, and Tieleman 2020) use RL and variants of the REINFORCE algorithm (Williams 1992) for solving the referential game (Foerster et al. 2016; Lazaridou, Peysakhovich, and Baroni 2016; Chaabouni et al. 2021). Vanneste et al. (2022) provide a comprehensive review of the various ways to overcome the discretization issue within multi-agent environments. Notably, they find that none of the surveyed methods is best in all environments, and that the optimal discretization method greatly depends on the environment. Somewhat close to our approach, Freed et al. (2020) propose an elaborate stochastic quantization procedure, which relies on adding stochastic noise as part of an encoding/decoding procedure, and evaluate it in pathfinding and search problems. In contrast, our approach is simple and deterministic, and works exceptionally well in the referential and classification games.

6.2 Gumbel-Softmax Communication Gumbel-softmax (Jang, Gu, and Poole 2016) enables discrete communication by sampling from a categorical Gumbel distribution. It allows gradients to flow through this non-differentiable distribution by replacing it with a differentiable sample from a Gumbel-softmax distribution. Havrylov and Titov (2017) compare communication with RL and Gumbel-softmax and observe that the latter converges much faster and results in more effective protocols. Since then, many studies have used Gumbel-softmax in the emergent communication setup (Resnick et al. 2019; Guo

et al. 2021; Mu and Goodman 2021; Dess ı, Kharitonov, and Marco 2021). as it easier to work with than RL-based methods and can be trained end-to-end with gradient decent and back-propagation. Though widely used as the default method for overcoming the discretization difficulty, Gumbel-Softmax still suffers from at least two severe limitations. First, it uses a one-hot vector to encode symbols, limiting capacity. Second, it requires sampling from a distribution, making optimization more expensive and less accurate.

6.3 Quantization In the context of neural networks, quantization is a method for reducing model size and computation cost while maintaining performance. More generally, quantization, as a method to map input values in a large (often continuous) set to output values in a small (often finite) set, has a long history (Gray and Neuhoff 1998). The fundamental role of quantization in modulation and analog-to-digital conversion was first recognized during the early development of pulsecode modulation systems, especially in the work of Oliver, Pierce, and Shannon (1948), and with the seminal coding theory work by Shannon (1948) that present the quantization effect and its use in coding theory. Recently, intensive research on quantization shows great and consistent success in both training and inference of neural networks using 8-bit number representations, and even less (Banner et al. 2018; Wang et al. 2018; Choi et al. 2018). In particular, breakthroughs of half-precision and mixed-precision training (Courbariaux, Bengio, and David 2014; Gupta et al. 2015) significantly contributed to vast performance improvements. Notably, moving from floating-point to integer computation renders many operations non-differentiable. To overcome this subtlety, a straight-through estimator (STE) (Bengio, L eonard, and Courville 2013) is often used. The STE approximates the non-differentiable rounding operation with an identity function during back-propagation, thus enables end-to-end model training.

7 Conclusions Research on emergent communication between artificial agents strives for discrete communication. However, common methods such as continuous relaxations via Gumbelsoftmax lag far behind continuous communication in terms of performance on the agents task. In this work we propose an alternative approach that achieves discrete communication via message quantization, while enabling simple endto-end training. We show that our quantized communication allows us to run the gamut from continuous to discrete communication by controlling the quantization level, namely, the size of the used alphabet and the word length. When applying quantization we observe extremely good results, even for the smallest possible alphabet size, given long enough word length. Future work may explore more elaborate quantization schemes for message discretization, during either training or inference. We believe that the quantization approach offers a good test bed for investigating emergent communication in multi-agent systems.

Acknowledgements The work of RM was partially supported by the Skillman chair in biomedical sciences, and by the Ollendor Center of the Viterbi Faculty of Electrical and Computer Engineering at the Technion. The work of YB was partly supported by the ISRAEL SCIENCE FOUNDATION (grant No. 448/20) and by an Azrieli Foundation Early Career Faculty Fellowship.

References Banner, R.; Hubara, I.; Hoffer, E.; and Soudry, D. 2018. Scalable methods for 8-bit training of neural networks. Advances in neural information processing systems, 31. Bengio, Y.; L eonard, N.; and Courville, A. 2013. Estimating or propagating gradients through stochastic neurons for conditional computation. ar Xiv preprint ar Xiv:1308.3432. Brighton, H.; and Kirby, S. 2006. Understanding linguistic evolution by visualizing the emergence of topographic mappings. Artificial life, 12(2): 229 242. Casanueva, I.; Temˇcinas, T.; Gerz, D.; Henderson, M.; and Vuli c, I. 2020. Efficient intent detection with dual sentence encoders. ar Xiv preprint ar Xiv:2003.04807. Chaabouni, R.; Strub, F.; Altch e, F.; Tarassov, E.; Tallec, C.; Davoodi, E.; Mathewson, K. W.; Tieleman, O.; Lazaridou, A.; and Piot, B. 2021. Emergent communication at scale. In International Conference on Learning Representations. Choi, E.; Lazaridou, A.; and De Freitas, N. 2018. Compositional obverter communication learning from raw visual input. ar Xiv preprint ar Xiv:1804.02341. Choi, J.; Chuang, P. I.-J.; Wang, Z.; Venkataramani, S.; Srinivasan, V.; and Gopalakrishnan, K. 2018. Bridging the accuracy gap for 2-bit quantized neural networks (qnn). ar Xiv preprint ar Xiv:1807.06964. Courbariaux, M.; Bengio, Y.; and David, J.-P. 2014. Training deep neural networks with low precision multiplications. ar Xiv preprint ar Xiv:1412.7024. Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; and Fei Fei, L. 2009. Image Net: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, 248 255. Ieee. Dess ı, R.; Kharitonov, E.; and Marco, B. 2021. Interpretable agent communication from scratch (with a generic visual processor emerging on the side). Advances in Neural Information Processing Systems, 34: 26937 26949. Foerster, J.; Assael, I. A.; De Freitas, N.; and Whiteson, S. 2016. Learning to communicate with deep multi-agent reinforcement learning. Advances in neural information processing systems, 29. Freed, B.; Sartoretti, G.; Hu, J.; and Choset, H. 2020. Communication learning via backpropagation in discrete channels with unknown noise. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, 7160 7168. Gholami, A.; Kim, S.; Dong, Z.; Yao, Z.; Mahoney, M. W.; and Keutzer, K. 2021. A survey of quantization methods for efficient neural network inference. ar Xiv preprint ar Xiv:2103.13630.

Gray, R. M.; and Neuhoff, D. L. 1998. Quantization. IEEE transactions on information theory, 44(6): 2325 2383. Guo, S.; Ren, Y.; Havrylov, S.; Frank, S.; Titov, I.; and Smith, K. 2019. The emergence of compositional languages for numeric concepts through iterated learning in neural agents. ar Xiv preprint ar Xiv:1910.05291. Guo, S.; Ren, Y.; Mathewson, K.; Kirby, S.; Albrecht, S. V.; and Smith, K. 2021. Expressivity of Emergent Language is a Trade-off between Contextual Complexity and Unpredictability. ar Xiv preprint ar Xiv:2106.03982. Gupta, S.; Agrawal, A.; Gopalakrishnan, K.; and Narayanan, P. 2015. Deep learning with limited numerical precision. In International conference on machine learning, 1737 1746. PMLR. Havrylov, S.; and Titov, I. 2017. Emergence of language with multi-agent games: Learning to communicate with sequences of symbols. Advances in neural information processing systems, 30. Jang, E.; Gu, S.; and Poole, B. 2016. Categorical reparameterization with gumbel-softmax. ar Xiv preprint ar Xiv:1611.01144. Kharitonov, E.; Chaabouni, R.; Bouchacourt, D.; and Baroni, M. 2019. EGG: a toolkit for research on Emergence of lan Guage in Games. ar Xiv preprint ar Xiv:1907.00852. Lazaridou, A.; and Baroni, M. 2020. Emergent multi-agent communication in the deep learning era. ar Xiv preprint ar Xiv:2006.02419. Lazaridou, A.; Hermann, K. M.; Tuyls, K.; and Clark, S. 2018. Emergence of linguistic communication from referential games with symbolic and pixel input. ar Xiv preprint ar Xiv:1804.03984. Lazaridou, A.; Peysakhovich, A.; and Baroni, M. 2016. Multi-agent cooperation and the emergence of (natural) language. ar Xiv preprint ar Xiv:1612.07182. Lazaridou, A.; Potapenko, A.; and Tieleman, O. 2020. Multi-agent communication meets natural language: Synergies between functional and structural language learning. ar Xiv preprint ar Xiv:2005.07064. Lewis, D. 2008. Convention: A philosophical study. John Wiley & Sons. Li, F.; and Bowling, M. 2019. Ease-of-teaching and language structure from emergent communication. Advances in neural information processing systems, 32. Mitrovic, J.; Mc Williams, B.; and Rey, M. 2020. Less can be more in contrastive learning. In I Can t Believe It s Not Better! Neur IPS 2020 workshop. Mu, J.; and Goodman, N. 2021. Emergent Communication of Generalizations. Advances in Neural Information Processing Systems, 34: 17994 18007. Oliver, B.; Pierce, J.; and Shannon, C. E. 1948. The philosophy of PCM. Proceedings of the IRE, 36(11): 1324 1331. Qi, H.; Pan, L.; Sood, A.; Shah, A.; Kunc, L.; Yu, M.; and Potdar, S. 2020. Benchmarking commercial intent detection services with practice-driven evaluations. ar Xiv preprint ar Xiv:2012.03929.

Resnick, C.; Gupta, A.; Foerster, J.; Dai, A. M.; and Cho, K. 2019. Capacity, bandwidth, and compositionality in emergent language learning. ar Xiv preprint ar Xiv:1910.11424. Sanh, V.; Debut, L.; Chaumond, J.; and Wolf, T. 2019. Distil BERT, a distilled version of BERT: smaller, faster, cheaper and lighter. ar Xiv preprint ar Xiv:1910.01108. Shannon, C. E. 1948. A mathematical theory of communication. The Bell system technical journal, 27(3): 379 423. Vanneste, A.; Vanneste, S.; Mets, K.; De Schepper, T.; Mercelis, S.; Latr e, S.; and Hellinckx, P. 2022. An Analysis of Discretization Methods for Communication Learning with Multi-Agent Reinforcement Learning. ar Xiv preprint ar Xiv:2204.05669. Wang, N.; Choi, J.; Brand, D.; Chen, C.-Y.; and Gopalakrishnan, K. 2018. Training deep neural networks with 8-bit floating point numbers. Advances in neural information processing systems, 31. Williams, R. J. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3): 229 256. Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; Davison, J.; Shleifer, S.; von Platen, P.; Ma, C.; Jernite, Y.; Plu, J.; Xu, C.; Le Scao, T.; Gugger, S.; Drame, M.; Lhoest, Q.; and Rush, A. 2020. Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 38 45. Online: Association for Computational Linguistics. Yao, S.; Yu, M.; Zhang, Y.; Narasimhan, K. R.; Tenenbaum, J. B.; and Gan, C. 2022. Linking Emergent and Natural Languages via Corpus Transfer. ar Xiv preprint ar Xiv:2203.13344. Zhang, J.; Bui, T.; Yoon, S.; Chen, X.; Liu, Z.; Xia, C.; Tran, Q. H.; Chang, W.; and Yu, P. 2021. Few-shot intent detection via contrastive pre-training and fine-tuning. ar Xiv preprint ar Xiv:2109.06349.