# neural_machine_translation_with_joint_representation__f4d144fc.pdf The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20) Neural Machine Translation with Joint Representation Yanyang Li,1 Qiang Wang,1 Tong Xiao,1,2 Tongran Liu,3 Jingbo Zhu1,2 1Natural Language Processing Lab., Northeastern University, Shenyang, China 2Niu Trans Co., Ltd., Shenyang, China 3CAS Key Laboratory of Behavioral Science, Institute of Psychology, CAS, Beijing, China blamedrlee@outlook.com, wangqiangneu@gmail.com, {xiaotong, zhujingbo}@mail.neu.edu.cn, liutr@psych.ac.cn Though early successes of Statistical Machine Translation (SMT) systems are attributed in part to the explicit modelling of the interaction between any two source and target units, e.g., alignment, the recent Neural Machine Translation (NMT) systems resort to the attention which partially encodes the interaction for efficiency. In this paper, we employ Joint Representation that fully accounts for each possible interaction. We sidestep the inefficiency issue by refining representations with the proposed efficient attention operation. The resulting Reformer models offer a new Sequence-to Sequence modelling paradigm besides the Encoder-Decoder framework and outperform the Transformer baseline in either the small scale IWSLT14 German-English, English-German and IWSLT15 Vietnamese-English or the large scale NIST12 Chinese-English translation tasks by about 1 BLEU point. We also propose a systematic model scaling approach, allowing the Reformer model to beat the state-of-the-art Transformer in IWSLT14 German-English and NIST12 Chinese-English with about 50% fewer parameters. The code is publicly available at https://github.com/lyy1994/reformer. Introduction To translate one sentence in the source language to its equivalent in the target one, the translation model relies on the bilingual interaction between any two source and target units to select the appropriate hypothesis. Early SMT systems are good examples of this as they use the alignment matrix between the source sentence and the translated part to direct the decoding (Koehn 2009). When it comes to NMT, the natural idea to explicitly model the interaction is by extending the intrinsic representation to have the size S T H, dubbed Joint Representation, where S is the source sentence length, T is the target sentence length and H is the hidden size. Despite that this representation can flexibly learn to encode various types of interaction, it is inefficient as it incurs high computation and storage cost. A practical surrogate is the well-known attention mechanism (Vaswani et al. 2017). It mimics the desired Corresponding Author. Copyright c 2020, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. interaction by dynamically aggregating a sequence of representations. Though successful, the receptive field of each position in the attention is restricted to one source or target sequence only instead of the cartesian product of them as required by the joint representation. In this work, we take one step toward the model family Reformer, which built entirely on top of the joint representation. We efficiently adapt the most advanced self-attention module to the joint representation space RS T H, namely Separable Attention. With this building block at hand, we present two instantiations of Reformer. The former one, Reformer-base, enjoys the best theoretical effectiveness with the capability that it can access any source or target token with minimum O(1) operations but has higher complexity induced by stacking separable attentions. The latter one, Reformer-fast, better trades off the effectiveness and efficiency of the separable attention, achieving comparable results as Reformer-base but 50% faster. As both Reformer variants do not resort to either the encoder or the decoder, they shed light on exploring the new promising Sequenceto-Sequence paradigm in the future. We additionally show that with proper model scaling, our Reformer models are superior to the state-of-the-art (SOTA) Transformer (Vaswani et al. 2017) with fewer parameters on larger datasets. The proposed model scaling method requires only O(2) runs to generate the enhanced model compared to the common Grid Search, whose cost grows polynomially as the candidates of hyper-parameters increase. In our experiments, the Reformer models achieve 1.3, 0.8 and 0.7 BLEU point improvement over the Transformer baseline in the small scale IWSLT15 Vietnamese-English and IWSLT14 German-English, English-German datasets as well as 1.9 in the large scale NIST12 Chinese-English dataset. After scaling, it outperforms the SOTA large Transformer counterpart by 0.7 and 2 BLEU point with about 50% parameters in IWSLT14 German-English and NIST12 Chinese-English translations respectively. Background Sequence-to-Sequence Learning Given a sentence pair (x, y), the NMT model learns to maximize its probability Pr(y|x), which is decomposed into the Layer i Layer i + 1 attention source wˇo hˇen hˇao . (a) Training Layer i Layer i + 1 attention source wˇo hˇen hˇao . (b) Decoding (T-th step) Figure 1: Separable Attention over representations (Chinese pinyin-English: wˇo hˇen hˇao . I am fine . ). product of the conditional probability of each target token Pr(y|x) = T t=1 Pr(yt|y