# attentionviaattention_neural_machine_translation__39938184.pdf Attention-via-Attention Neural Machine Translation Shenjian Zhao Department of Computer Science and Engineering Shanghai Jiao Tong University sword.york@gmail.com Zhihua Zhang Peking University Beijing Institute of Big Data Research zhzhang@math.pku.edu.cn Since many languages originated from a common ancestral language and influence each other, there would inevitably exist similarities between these languages such as lexical similarity and named entity similarity. In this paper, we leverage these similarities to improve the translation performance in neural machine translation. Specifically, we introduce an attention-via-attention mechanism that allows the information of source-side characters flowing to the target side directly. With this mechanism, the target-side characters will be generated based on the representation of source-side characters when the words are similar. For instance, our proposed neural machine translation system learns to transfer the characterlevel information of the English word system through the attention-via-attention mechanism to generate the Czech word systém . Consequently, our approach is able to not only achieve a competitive translation performance, but also reduce the model size significantly. 1 Introduction A language family is a group of related languages that developed from a common ancestral language, such as the Indo-European family, the Niger-Congo family and the Austronesian family. The languages in the same family are more or less similar to each other. One of the measurements is lexical similarity (Simons and Fennig 2017), which approximately measures the similarity between the lexicons of two languages. Simons and Fennig (2017) calculated it by comparing a standardized set of wordlists and counting those forms that show similarity in both form and meaning. Based on such a method, English is evaluated to have a lexical similarity of 60% with German and 27% with French. Moreover, language itself is an evolving system and the evolution of lexicons in different languages never stops. Guestwords, foreignisms and loanwords from a language may be added to the lexicon of another language. Although the languages are different, many of the words (e.g., named entities) are usually represented by similar characters. Currently, many state-of-the-art neural machine translation (NMT) systems (Bahdanau, Cho, and Bengio 2015; Sutskever, Vinyals, and Le 2014; Jean et al. 2015; Luong and Manning 2016) are built on words. There are various Copyright c 2018, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. considerations behind the wide adoption of word-level modeling (Chung, Cho, and Bengio 2016). The vanishing gradient problem of character-level models and lower calculation cost of word-level models may be the major causes. However, the word-level NMT systems are unable to utilize the lexical similarity and named entity similarity between language pairs. Some explorations have been performed to incorporate the similarity of vocabularies. For instance, Gulcehre et al. (2016) introduced a pointer network to copy words from the source. However, they assumed that the target out-ofvocabulary (OOV) words are the same as the corresponding source words. Obviously, this assumption is not always satisfied. The character-level information is critical in neural machine translation. Suppose there are two languages that differ only in the alphabets, e.g., Russian written in Cyrillic and Russian written in Latin script. It would not be easy for a purely word-level NMT to translate between such a language pair, because the word-level model needs to establish the mapping between words. In contrast, the character-level model only needs to establish the mapping between characters. Although there are no such language pairs in reality, we can still make use of the similarity of languages from the character level. In particular, we provide the following two sentences to clarify what we are focusing on: 1) Aby legenda byla vˇerohodná , psalo se o filmovém projektu ve specializovaných magazínech , poˇrádaly se tiskové konference , fiktivní produkˇcní spoleˇcnost mˇela reálnou kanceláˇr . (Czech) 2) For the story to be believed , the film project was reported on in specialist magazines , press conferences were organised , and the fictitious production company had a real office . (English) There are many similar words between two sentences. One may speculate the meaning of some Czech words based on the English words such as projektu - project , magazínech - magazines and konference - conferences . In this paper, we leverage these similarities from character level in NMT to improve the translation performance, and reduce the model size simultaneously. In accordance with expectation, our model is able to detect and handle named entities, as shown in Section 6. To sum up the above statements, the character-level lexicon The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18) and the word-level grammar are both important for neural machine translation. Luong and Manning (2016) proposed a hybrid model on the English-to-Czech translation task which encodes the OOV words using a character-level RNN. However, this hybrid model is restricted to achieving open vocabularies, and the character-level information has not been exploited. Instead, we propose a model that takes advantage of word-level modeling and bridges lexicons with an attention-via-attention mechanism, without even involving any vocabularies. Specifically, we encode the source sentence from character level using an unidirectional recurrent neural network (RNN), then extract the word information to learn word-level grammar by a bidirectional RNN (Bi RNN) (Schuster and Paliwal 1997). To predict at character level, the attention is paid to the word level first. Subsequently, the attention is turned to the character level with the help of the word-level attention. Finally, the word-level representation and character-level representation are combined together to predict the target character. We illustrate the architecture of our model in Figure 1, from which we could find that the information of source-side characters flows to the target-side characters directly. There are many models employing multiple attention components, such as attention-over-attention neural networks (Cui et al. 2016), hierarchical attention networks (Yang et al. 2016) and multi-step attention (Gehring et al. 2017). The key difference is that our attention-via-attention mechanism is a top-down approach (from words to characters), while the others use a down-top approach, that is, to build the attention from a lowerlevel representation to a higher-level one. The hierarchical attention could not connect the source side and the target side directly, thus it is not applicable to this scenario. With a hierarchical encoder and an attention-via-attention mechanism, our method is capable of addressing several essential issues in neural machine translation community. That is, We avoid the use of large vocabularies. Instead, we employ a character-level RNN to encode the entire source sentence which also handles the rare words. The characterlevel RNN makes use of distributed representation, which generally yields better generalization. It is one of the key ingredients for the attention-via-attention mechanism. We alleviate the vanishing gradient problem of purely character-level models by introducing a hierarchical encoder. We detect named entities and similar lexemes automatically, then transfer them to the target language through the attention-via-attention mechanism. These issues impact not only on translation tasks but also on many other natural language processing tasks, such as text summarization (Gulcehre et al. 2016) and conversational models (Vinyals and Le 2015). Thus these tasks may benefit from our approach in principle. 2 Neural Machine Translation Neural machine translation systems are typically implemented as an encoder-decoder architecture (Bahdanau, Cho, and Bengio 2015; Sutskever, Vinyals, and Le 2014). The encoder could be a recurrent neural network or a bidirectional recurrent neural network that encodes a source language sentence x = {x1, . . . , x Tc} into a sequence of hidden states h = {h1, . . . , h Tc}: ht = fenc(e(xt), ht 1), in which ht is the hidden state at time step t, e(xt) is the continuous embedding of xt, Tc is the number of symbols in the source sequence, and the function fenc is the recurrent unit such as the gated recurrent unit (GRU) (Chung et al. 2014) or the long short-term memory (LSTM) unit (Hochreiter and Schmidhuber 1997). The decoder, another RNN, is trained to predict the conditional probability of each target symbol yt given its preceding symbols y