# emergent_communication_for_numerical_concepts_generalization__8be7050f.pdf Emergent Communication for Numerical Concepts Generalization Enshuai Zhou1, 2, 3, Yifan Hao2, Rui Zhang2, Yuxuan Guo1, 2, 3, Zidong Du2, 5, Xishan Zhang2, 3, Xinkai Song2, Chao Wang1, Xuehai Zhou1, Jiaming Guo2, Qi Yi1, 2, 3, Shaohui Peng6, Di Huang2, Ruizhi Chen6, Qi Guo2, Yunji Chen2, 4* 1University of Science and Technology of China 2State Key Lab of Processors, Institute of Computing Technology, CAS 3Cambricon Technologies 4University of Chinese Academy of Sciences 5Shanghai Innovation Center for Processor Technologies 6Intelligent Software Research Center, Institute of Software, CAS enszhou@mail.ustc.edu.cn, {haoyifan, cyj}@ict.ac.cn Research on emergent communication has recently gained significant traction as a promising avenue for the linguistic community to unravel human language s origins and explore artificial intelligence s generalization capabilities. Current research has predominantly concentrated on recognizing qualitative patterns of object attributes(e.g., shape and color) and paid little attention to the quantitative relationship among object quantities which is known as the part of numerical concepts. The ability to generalize numerical concepts, i.e., counting and calculations with unseen quantities, is essential, as it mirrors humans foundational abstract reasoning abilities. In this work, we introduce the Num Game, leveraging the referential game framework, forcing agents to communicate and generalize the numerical concepts effectively. Inspired by the human learning process of numbers, we present a two-stage training approach that sequentially fosters a rudimentary numerical sense followed by the ability of arithmetic calculation, ultimately aiding agents in generating semantically stable and unambiguous language for numerical concepts. The experimental results indicate the impressive generalization capabilities to unseen quantities and regularity of the language emergence from communication. 1 Introduction Research on emergent communication has gained widespread attention in recent years (Lazaridou, Peysakhovich, and Baroni 2016; Choi, Lazaridou, and de Freitas 2018; Conklin and Smith 2023). It primarily involves using deep neural networks to simulate communication among multiple agents to complete collaborative tasks. From linguistics and cognitive psychology perspectives, studying emergent communication can provide a new experimental method and may validate specific linguistic and cognitive hypotheses quickly (Chaabouni et al. 2019; Rita, Chaabouni, and Dupoux 2020). From the standpoint of artificial intelligence, the language emergence from communication can help agents generalize on cooperative *Corresponding author. Copyright 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. tasks better (Mu and Goodman 2021; Xu, Niethammer, and Raffel 2022). The ability to generalize numerical concepts, i.e., counting and calculating on unseen quantities, is essential. According to linguistics and cognitive psychology, this ability is considered foundational for human abstract reasoning (Gelman and Gallistel 1986; Wiese 2003). Natural language possesses a comprehensive numerical system that allows humans to describe the number of objects accurately, concisely, and efficiently (Hiraiwa 2017). Furthermore, humans can perform more complex mathematical operations based on numerical concepts and digits, constructing a complete arithmetic system (Dehaene 2011). However, previous research has focused chiefly on recognizing qualitative patterns of object attributes(e.g., shape and color) (Kottur et al. 2017; Kuci nski et al. 2021) and paid little attention to the agents numerical concepts. It remains challenging to help agents understand the quantitative relations between numbers (i.e., quantities) through emergent communication. In this work, we introduce the Num Game, leveraging the referential game framework (Lazaridou, Peysakhovich, and Baroni 2016), where agents are mandated to communicate and generalize their comprehension of numerical concepts proficiently. Specifically, agents are tasked with generalizing (in a few-shot learning manner) over unseen quantities via emergent communication in Num Game, encompassing two core tasks: Counting and Calculating. In the Counting task, agents must precisely evaluate unseen quantities of objects. In the Calculating task, agents face the challenge of deducing arithmetic relations (including addition, subtraction, and maximization) among unseen quantities. Both tasks in Num Game require the agents to understand rather than mechanically memorize quantities and their relations, making the agents training difficult to converge effectively. Drawing inspiration from the human learning process of numbers(Wiese 2003; Hiraiwa 2017), we present a twostage training approach comprising Num Sen and Num Rel. In this approach, we first employ the Num Sen method to foster a rudimentary numerical sense of the agents. Then, we guide the agents to gain a foundational understanding of basic arithmetic relations between numbers within a specified The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) range by the Num Rel method. This progression ultimately enables the agents to generate language that expresses numerical concepts semantically stable and unambiguously and facilitates the generalization over unseen quantities and the arithmetic relations among them. To quantitatively evaluate the effectiveness of our methods, we focus on natural language as the target and use generalization ability and regularity of the emergent language as two metrics to assess the agents understanding of numerical concepts. We also visualize the language distribution after convergence to help readers better understand its structure. Ultimately, the experimental results demonstrate that by using the two-stage (i.e., Num Sen and Num Rel) training approach: (Section 7.2) The agents can accurately generalize over unseen quantities in the Counting task. (Section 7.3) The agents can perform basic calculations on unseen quantities in the Calculating task. (Section 7.4) Furthermore, the emerged messages between agents exhibit a solid order relation. 2 Related Work Human numerical Concepts. Compared to other animals, humans have a remarkable grasp of numerical concepts(Hauser, Carey, and Hauser 2000; Drucker and Brannon 2014). Only humans possess the ability to use a finite set of numerical symbols to precisely describe quantities of objects and perform calculations using numbers(Pica et al. 2004; Butterworth 2005; Dehaene 2011). The concept of number is highly significant for humans (Conant 1896; Dehaene 2011) and is considered the foundation of human abstract reasoning and symbolic thinking ability (Gelman and Gallistel 1986; Wiese 2003; Feigenson, Dehaene, and Spelke 2004; Coolidge and Overmann 2012). Many research works suggest that humans precise grasp of the number concept arises from two main factors: number sense (Wiese 2003; Pica et al. 2004; Dehaene 2011) and human language (Hauser, Chomsky, and Fitch 2002; Wiese 2007; Hiraiwa 2017). Number sense can be divided into two parts: (1) the ability to recognize small quantities exactly, and (2) the ability to approximately recognize the magnitudes of larger quantities (Dehaene 2011). Even in prelinguistic eras, humans possessed number sense, and many animals also exhibited similar numerical abilities (Wiese 2003; Dehaene 2011). However, no animal possesses numerical abilities as powerful as humans do. Therefore, having number sense alone is insufficient; human language also plays a crucial role in the development of numerical concepts(Hiraiwa 2017). Human language is a unique communication system based on the recursive combination of a finite set of symbols (Berwick and Chomsky 2016). This unique property of the language may be another fundamental basis for the infinite expressive capacity of the human numerical system (Dehaene 2011). Inspired by these insights, we incorporate number sense and language into the process of intelligent agents learning numerical concepts. Emergent Communication. Using the Lewis signaling Game to research communication emergence in multi-agent systems has recently drawn more interest (Lewis 1969). Classified by motivation, some previous studies focus on how cognitive or social science views shape emergent communication, such as population heterogeneity(Chaabouni et al. 2019), linguistic complexity (Tucker et al. 2022), and efficiency of language(Chaabouni et al. 2019). Other previous studies focus on how to improve the quality of the emerged languages, such as compositionality (Conklin and Smith 2023), generalization (Xu, Niethammer, and Raffel 2022; Mu and Goodman 2021), and transferability on downstream tasks(Chaabouni et al. 2022). In these works, the agents are required to extract and convey qualitative concepts, such as the object s shape, color, or location in the image. However, these works missed the language emergence of quantitative numerical concepts. For example, (Feng, An, and Lu 2023) constructs a multi-object environment that primarily centers on the positional relations among objects yet maintains a qualitative perspective. (Guo et al. 2019) differentiates the target and distractors based on the number of objects. Yet, in that approach, numbers are merely treated as classification labels and do not capture the intrinsic relations among them. In this work, we delve into the quantitative concepts the numerical concepts and explore arithmetic relations among numbers. We propose a scenario in which agents are required to count and calculate quantities, which will compel them to comprehend the internal relations between quantities. We also propose the two-stage training method to facilitate their understanding of the numerical concepts. 3 Environment Based on the referential game, we propose a new game called Num Game, where the agents are required to communicate the number concept to complement the game. Additionally, we have developed a new dataset called Num World Dataset to evaluate the agents performance. In this section, we will introduce the Num Game and the Num World Dataset. 3.1 Num Game Figure 1 illustrates the basic setup of Num Game. In the Num Game, there are two agents involved: a speaker S and a listener L. The objective of the game is Counting or Calculating the number of objects in the images through communication and cooperation between agents. Figure 1a shows the Counting task, where the speaker is presented with an image denoted as I, containing n objects of the same category, where n N. Subsequently, the speaker generates a message M = {m1, ..., ml} based on the quantity of the objects in the image. Specifically, M is a sequence of l discrete symbols, where each symbol mi is a one-hot vector of size v, and v is the size of vocabulary V . We regard the message M as an emergent number like natural numbers in human language. The listener, on the other hand, receives the message M and uses it to make an The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) Listener Speaker (a) The Counting task. Listener Speaker (b) The Calculating task. Figure 1: The Num Game tasks, requiring the agents Counting or Calculating the quantities of objects. informed guess denoted as n about the number of objects in the image I. If the listener s guess aligns with the actual quantity, i.e., n = n , the game is considered successful; otherwise, it is deemed as failed. The Calculating task is similar to the Counting task, with the main difference being that it requires the agents to perform arithmetic calculations on the quantities represented by two images. Figure 1b illustrates the process of the agents collaborating to complete an add task. The speaker is presented with two images and an arithmetic calculations symbol, then generates a message M describing this arithmetic expression and passes it to the listener. The structure of the listener is the same as that of the Counting task it needs to deduce the final result of the calculation from the message. We focus on the generalization ability of the agents in the Num Game. Specifically, we are interested in the agents ability to generalize to unseen quantities. To this end, we will train the agents on a subset of N and evaluate their performance on the remaining unseen quantities. 3.2 Num World Dataset The Num World Dataset is developed based on the Shape World dataset (Kuhnle and Copestake 2017), which serves as a synthetic dataset for visual reasoning. Within the Num World dataset, each sample is represented as a tuple (I, n), where I is an image containing n objects of the same category, all set against a black background. The image s resolution is 128 128, and the quantity n varies from 1 to 32. Each object in the dataset possesses 2 controllable attributes: Shape and Color. Both attributes have 5 possible values, and these attribute values jointly determine the category of the object. Additionally, the objects locations and orientations are randomly generated within each image and are nonoverlapping to ensure unambiguous counting. Moreover, as the number of objects increases the size of the objects diminishes, ensuring that the total pixel area remains approximately constant across all images. The collection of all possible values for n is denoted as N = {1, 2, ..., 32}. Importantly, the quantity n encompasses distinct ranges across diverse training and testing stages. We have defined three distinct sub-datasets: Sen, Lang, and OOD, each encompassing distinct quantity ranges for specific purposes (refer to Section 6.1 for comprehensive information). Drawing inspiration from the human learning process of numbers, we propose a two-stage training approach comprising Num Sen and Num Rel in this section. 4.1 Num Sen: Pretrain the Number Sense Number sense is a crucial ability for humans to approximate the number of objects even before language acquisition. As a result, we believe it is essential to pretrain the speaker s number sense before initiating language training. We formulate the number sense pretraining as a visiononly process for the speaker. The vision encoder of the speaker takes an image I as input and generates a feature vector f. Subsequently, a projection head is employed to predict the quantity n of objects present in the image. Following pretraining, the vision encoder of the speaker will be utilized in the language training phase, while the projection head will be discarded. Based on previous linguistics research (Wiese 2003), human number sense exhibits distinct responses to smaller quantities (subitizing, typically less than or equal to 4) and larger quantities (magnitude estimation, usually greater than 4). As a result, we divide the quantity n used for pretraining into two segments: n 4 and n > 4. For the n 4 segment, we employ all possible quantities N 0 = 1, 2, 3, 4 to train the speaker s subitizing ability. For the n > 4 segment, we use a subset N 0 = 8, 16, 32 to train the ability to recognize larger quantities. The choice of using only powers of 2 for training the magnitude estimation ability is motivated by our desire for the agent s number sense to closely resemble that of humans, which typically cannot precisely recognize all larger numbers. Consequently, the quantity n used for pretraining is N0 = N 0 N 0 = 1, 2, 3, 4, 8, 16, 32. 4.2 Num Rel: Learn Relations between Numbers Language is a powerful tool for humans to communicate about numbers. Based on this, we also train the agents to use language (emerging from communication) to communicate numerical concepts. If we only train the agents to communicate a single quantity, then each quantity would be essentially treated as a classification label, and the speaker does not need to understand the actual meaning of the numbers or the relations between them. Consequently, training the agents in this manner would not lead to a genuine understanding of the numerical concepts. Considering how humans learn numbers, simple calculations (e.g., addition and subtraction) play a crucial role in fostering a better understanding of numerical concepts. To address this, we propose a novel approach called Num Rel The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) Figure 2: The Num Rel training process. The speaker receives two images I1 and I2 and generates corresponding M1 and M2. These messages and an operator o are then concatenated to form the final message M. The listener receives this message and outputs a guess, n , representing the result of the operation n1 o n2. to train the agents. The Num Rel method involves performing simple arithmetic calculations within a specified quantity range during training, which aids the agents in understanding the relations between numbers and then grasping the numerical concepts. Figure 2 illustrates the Num Rel setup. In Num Rel, two samples, (I1, n1) and (I2, n2), are randomly selected from the original dataset and combined to form a single sample denoted as (I1, I2, n1, n2). A random operator o is then chosen from a predefined set of operators, and the target number n is calculated as the result of the operation n1 o n2. Consequently, the sample is further represented as (I1,2, n1 o n2). The speaker generates two distinct messages, M1 and M2, corresponding to I1 and I2, respectively. These messages and the operator o are concatenated to form the final message M. Subsequently, the listener receives the message M and outputs a guess denoted as n concerning the result of the operation n1 o n2. It is essential to note that the result n of the operation n1 o n2 shares the same range as the original quantities n1 and n2. This design ensures that the Out-of-Distribution (OOD) test remains equitable and fair. 5 Model As shown in Figure 3, the entire model consists of two components: speaker and listener. In Num Game G, the speaker takes an image I as input and generates a conditional distribution over messages p S(M|I), and the listener takes the message M and outputs a distribution over quantities p L(n |M). In the following, we will introduce the architecture of each component and the optimization method. Speaker. The speaker takes an image I as input and encodes it into an embedding ES using a Res Net-50 (He et al. 2016) vision encoder f S vis , i.e., ES = f S vis(I). Then, a GRU (Chung et al. 2014) message decoder f S lang takes embedding ES as initial hidden state h S 0 to generate a sequence of distribution over tokens p S(M|h S 0 ) = Q i p S(mi|m