# emergent_translation_in_multiagent_communication__0c8e139d.pdf Published as a conference paper at ICLR 2018 EMERGENT TRANSLATION IN MULTI-AGENT COMMUNICATION Jason Lee New York University jason@cs.nyu.edu Kyunghyun Cho New York University Facebook AI Research kyunghyun.cho@nyu.edu Jason Weston Facebook AI Research jase@fb.com Douwe Kiela Facebook AI Research dkiela@fb.com While most machine translation systems to date are trained on large parallel corpora, humans learn language in a different way: by being grounded in an environment and interacting with other humans. In this work, we propose a communication game where two agents, native speakers of their own respective languages, jointly learn to solve a visual referential task. We find that the ability to understand and translate a foreign language emerges as a means to achieve shared goals. The emergent translation is interactive and multimodal, and crucially does not require parallel corpora, but only monolingual, independent text and corresponding images. Our proposed translation model achieves this by grounding the source and target languages into a shared visual modality, and outperforms several baselines on both word-level and sentence-level translation tasks. Furthermore, we show that agents in a multilingual community learn to translate better and faster than in a bilingual communication setting. 1 INTRODUCTION Building intelligent machines that can converse with humans is a longstanding challenge in artificial intelligence. Remarkable successes have been achieved in natural language processing (NLP) via the use of supervised learning approaches on large-scale datasets (Bahdanau et al., 2015; Wu et al., 2016; Gehring et al., 2017; Sennrich et al., 2017). Machine translation is no exception: most translation systems are trained to derive statistical patterns from huge parallel corpora. Parallel corpora, however, are expensive and difficult to obtain for many language pairs. This is especially the case for low resource languages, where parallel texts are often small or nonexistent. We address these issues by designing a multi-agent communication task, where agents interact with each other in their own native languages and try to work out what the other agent meant to communicate. We find that the ability to translate foreign languages emerges as a means to achieve a common goal. Aside from the benefit of not requiring parallel data, we argue that our approach to learning to translate is also more natural than learning from large corpora. Humans learn languages by interacting with other humans and referring to their shared environment, i.e., by being grounded in physical reality. More abstract knowledge is built on top of this concrete foundation. It is natural to use vision as an intermediary: when communicating with someone who does not speak our language, we often directly refer to our surroundings. Even linguistically distant languages will, by physical and cognitive necessity, still refer to scenes and objects in the same visual space. We compare our model against a number of baselines, including a nearest neighbor method and a recently proposed model (Nakayama & Nishida, 2017) that maps languages and images to a shared space, but lacks communication. We evaluate performance on both wordand sentence-level translation, and show that our model outperforms the baselines in both settings. Additionally, we show Work done while the author was interning at Facebook AI Research. Published as a conference paper at ICLR 2018 that multilingual communities of agents, comprised of native speakers of different languages, learn faster and ultimately become better translators. 2 PRIOR WORK Recent work has used neural networks and reinforcement learning in multi-agent settings to solve a variety of tasks with communication, including simple coordination (Sukhbaatar et al., 2016), logic riddles (Foerster et al., 2016), complex coordination with verbal and physical interaction (Lowe et al., 2017), cooperative dialogue (Das et al., 2017) and negotiation (Lewis et al., 2017). At the same time, there has been a surge of interest in communication protocols or languages that emerge from multi-agent communication in solving these various tasks. Lazaridou et al. (2017) first showed that simple neural network agents can learn to coordinate in an image referential game with single-symbol bandwidth. This work has been extended to induce communication protocols that are more similar to human language, allowing multi-turn communication (Jorge et al., 2016), adaptive communication bandwidth (Havrylov & Titov, 2017) and multi-turn communication with a variable-length conversation (Evtimova et al., 2017), and simple compositionality (Kottur et al., 2017; Mordatch & Abbeel, 2017). Meanwhile, Andreas et al. (2017) proposed a model to interpret continuous message vectors by translating them. Our work is related to a long line of work on learning multimodal representations. Several approaches proposed to learn a joint space for images and text using Canonical Correlation Analysis (CCA) or its variants (Hodosh et al., 2013; Andrew et al., 2013; Chandar et al., 2016). Other works minimize pairwise ranking loss to learn multimodal embeddings (Socher et al., 2014; Kiros et al., 2014; Ma et al., 2015; Vendrov et al., 2015; Kiela et al., 2017). Most recently, others extended this work to learn joint representations between images and multiple languages (Gella et al., 2017; Calixto et al., 2017b; Rajendran et al., 2016). In machine translation, our work is related to image-guided (Calixto et al., 2017a; Elliott & K ad ar, 2017; Caglayan et al., 2016) and pivot-based (Firat et al., 2016; Hitschler et al., 2016) approaches. It is also related to previous work on multiagent translation for low-resource language pairs (without grounding) (He et al., 2016a). At word-level, there has been work on translation via a visual intermediate (Bergsma & Van Durme, 2011), including with convolutional neural network features (Kiela et al., 2015; Joulin et al., 2016). It was recently shown that zero-resource translation is possible by separately learning an image encoder and a language decoder (Nakayama & Nishida, 2017). The main difference to our work is that their models do not perform communication. 3 TASK AND MODELS 3.1 COMMUNICATION TASK We let two agents communicate with each other in their own respective languages to solve a visual referential task. One agent sees an image and describes it in its native language to the other agent. The other agent is given several images, one of which is the same image shown to the first agent, and has to choose the correct image using the description. The game is played in both directions simultaneously, and the agents are jointly trained to solve this task. We only allow agents to send a sequence of discrete symbols to each other, and never a continuous vector. Our task is similar to Lazaridou et al. (2017), but with the following differences: communication (1) is bidirectional and (2) of variable length; (3) the speaker is trained on both the listener s feedback and ground-truth annotations; and (4) the speaker only observes the target image and no distractors. Let PA and PB be our agents, who speak the languages LA and LB respectively. We have two disjoint sets of image-annotation pairs: (IA, MA) in language LA and (IB, MB) in language LB. Task in language LA : PA is the speaker and PB is the listener. 1. A target image and annotation (i, m) {IA, MA} is drawn from the training set in LA. Published as a conference paper at ICLR 2018 2. Given i, the speaker (PA) produces a sequence of symbols ˆm in language LA to describe the image and sends it to the listener. The speaker s goal is to produce a message that is both an accurate prediction of the ground-truth annotation m, and helps the listener (PB) identify the target image. 3. K 1 distracting images are drawn from IA at random. The target image i is added to this set and all K images are shuffled. 4. Given the message ˆm and the K images, the listener s goal is to identify the target image. Task with language LB : The agents exchange the roles and play similarly. We explore two different settings: (1) a word-level task where the agents communicate with a single word, and (2) a sentence-level task where agents can transmit a sequence of symbols. 3.2 MODEL ARCHITECTURE AND TRAINING Each agent has an image encoder, a native speaker module and a foreign language encoder. In English-Japanese communication, for instance, the English-speaking agent PA consists of an image encoder EA IMG, a native English speaker module SA EN, and a Japanese encoder EA JA. Similarly, the Japanese-speaking agent PB = (EB IMG, SB JA, EB EN). (a) Communication task. (b) Translation. Figure 1: Sentence-level communication task and translation between English and Japanese. (a) The red dotted line delimits the agents and the gray dotted line delimits the communication tasks for different languages. Representations residing in the multimodal space of Agent A and B are shown in green and yellow, respectively. (b) An illustration of how the Japanese agent might translate an unseen English sentence to Japanese. We now illustrate the architecture of our model using the English part of the communication task as an example (upper half of Figure 1a). We first describe the sentence-level model. Speaker (PA) Given an image-annotation pair (i, m) {IEN, MEN} sampled from the English training set, let i be represented as a Dimg-dimensional vector. PA s speaker encodes i into a Dhiddimensional vector with a feedforward image encoder: h0 = EA IMG(i). Our speaker module SA EN is a recurrent neural network (RNN) with gated recurrent units (GRU, (Cho et al., 2014)). Our RNN takes the image representation h0 as initial hidden state and updates its state as ht+1 = GRU(ht, mt) where mt is the t-th token in m. The output layer projects each hidden state ht over the English vocabulary VEN, followed by a softmax to predict the next token: pt = softmax(Woht + bo). The speaker s predictions are trained on the ground truth English Published as a conference paper at ICLR 2018 annotation m using the cross entropy loss: J EN spk = 1 {IEN,MEN} X t=1 log p(mt|m(