# nonautoregressive_machine_translation_with_auxiliary_regularization__d7e46586.pdf The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19) Non-Autoregressive Machine Translation with Auxiliary Regularization Yiren Wang,1 Fei Tian,2 Di He,3 Tao Qin,2 Cheng Xiang Zhai,1 Tie-Yan Liu2 1University of Illinois at Urbana-Champaign, Urbana, IL, USA 2Microsoft Research, Beijing, China 3Key Laboratory of Machine Perception, MOE, School of EECS, Peking University, Beijing, China 1{yiren,czhai}@illinois.edu 2{fetia, taoqin, tie-yan.liu}@microsoft.com 3di he@pku.edu.cn As a new neural machine translation approach, Non Autoregressive machine Translation (NAT) has attracted attention recently due to its high efficiency in inference. However, the high efficiency has come at the cost of not capturing the sequential dependency on the target side of translation, which causes NAT to suffer from two kinds of translation errors: 1) repeated translations (due to indistinguishable adjacent decoder hidden states), and 2) incomplete translations (due to incomplete transfer of source side information via the decoder hidden states). In this paper, we propose to address these two problems by improving the quality of decoder hidden representations via two auxiliary regularization terms in the training process of an NAT model. First, to make the hidden states more distinguishable, we regularize the similarity between consecutive hidden states based on the corresponding target tokens. Second, to force the hidden states to contain all the information in the source sentence, we leverage the dual nature of translation tasks (e.g., English to German and German to English) and minimize a backward reconstruction error to ensure that the hidden states of the NAT decoder are able to recover the source side sentence. Extensive experiments conducted on several benchmark datasets show that both regularization strategies are effective and can alleviate the issues of repeated translations and incomplete translations in NAT models. The accuracy of NAT models is therefore improved significantly over the state-of-the-art NAT models with even better efficiency for inference. Introduction Neural Machine Translation (NMT) based on deep neural networks has gained rapid progress over recent years (Cho et al. 2014; Bahdanau, Cho, and Bengio 2014; Wu et al. 2016; Vaswani et al. 2017; Hassan et al. 2018). NMT systems are typically implemented in an encoder-decoder framework, in which the encoder network feeds the representations of source side sentence x into the decoder network to generate the tokens in target sentence y. The decoder typically works in an auto-regressive manner: the generation of the tth token yt follows a conditional distribution P(yt|x, y