# crosslingual_natural_language_generation_via_pretraining__602f4a6f.pdf The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20) Cross-Lingual Natural Language Generation via Pre-Training Zewen Chi, Li Dong, Furu Wei, Wenhui Wang, Xian-Ling Mao, Heyan Huang Beijing Institute of Technology Microsoft Research {czw, maoxl, hhy63}@bit.edu.cn {lidong1, fuwei, Wenhui.Wang}@microsoft.com In this work we focus on transferring supervision signals of natural language generation (NLG) tasks between multiple languages. We propose to pretrain the encoder and the decoder of a sequence-to-sequence model under both monolingual and cross-lingual settings. The pre-training objective encourages the model to represent different languages in the shared space, so that we can conduct zero-shot cross-lingual transfer. After the pre-training procedure, we use monolingual data to fine-tune the pre-trained model on downstream NLG tasks. Then the sequence-to-sequence model trained in a single language can be directly evaluated beyond that language (i.e., accepting multi-lingual input and producing multi-lingual output). Experimental results on question generation and abstractive summarization show that our model outperforms the machine-translation-based pipeline methods for zero-shot cross-lingual generation. Moreover, crosslingual transfer improves NLG performance of low-resource languages by leveraging rich-resource language data. Our implementation and data are available at https://github.com/ CZWin32768/xnlg. 1 Introduction Learning natural language generation (NLG) models heavily relies on annotated training data. However, most available datasets are collected in a single language (typically English), which restricts deploying the applications to other languages. In this work, we aim at transferring the supervision of a monolingual NLG dataset to unseen languages, so that we can boost performance for the low-resource settings. Various methods have been proposed over the years to learn cross-lingual word embeddings (Mikolov, Le, and Sutskever 2013; Xing et al. 2015; Conneau et al. 2017) or sentence encoders (Johnson et al. 2017; Conneau et al. 2018; Lample and Conneau 2019), which try to encode multilingual texts into a shared vector space. Despite achieving promising results on cross-lingual classification problems, cross-lingual pre-trained models purposed for NLG tasks remains relatively understudied. Contribution during internship at Microsoft Research. Copyright c 2020, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Pre-Trained XNLG Many-to-Many Cross-Lingual NLG Monolingual NLG Data (E.g. Summarization) Figure 1: We use a monolingual (such as English) NLG dataset to fine-tune the pre-trained model XNLG, and then evaluate it beyond the language for both source and target sides (e.g., Chinese, and French). The cross-lingual generation problem is challenging due to the following reasons. First, it requires the models to understand multilingual input texts, and generate multilingual target sequences. So both the encoder and the decoder should be pre-trained together. Second, the many-to-many nature of cross-lingual NLG increases language pairs with the square of the number of languages. Third, the prediction space of cross-lingual NLG is much larger than classification tasks, which makes knowledge transfer of decoders quite critical. Previous work mainly relies on machine translation (MT) to map texts to different languages. The first strand of research directly uses MT in a pipeline manner (Wan, Li, and Xiao 2010). For example, the inputs written in other languages are first translated to English, and fed into the NLG model that is trained by English data. Then the generated English texts are translated back to the target language. Another strand of work uses MT to generate pseudo training data for other language pairs that are lack of annotations (Shen et al. 2018; Duan et al. 2019). However, such methods have to use multiple MT systems, which renders them suffering from error propagation. Moreover, because the pipelinebased methods do not explicitly share the same parameter space across languages, we can not directly transfer the taskspecific supervision to other low-resource languages. In this paper, we propose a cross-lingual pre-trained model (named as XNLG) in order to transfer monolingual NLG supervision to other pre-trained languages by finetuning. Specifically, XNLG shares the same sequence-to- sequence model across languages, and is pre-trained with both monolingual and cross-lingual objectives. The model not only learns to understand multilingual input, but also is able to generate specific languages by conditioning on the encoded semantics. Figure 1 demonstrates how to use XNLG to perform cross-lingual transfer for downstream tasks. The proposed model enables us to fine-tune the pre-trained model on monolingual NLG training data, and then evaluate it beyond a single language, including zero-shot crosslingual generation. Besides, we explore several fine-tuning strategies to make a compromise between cross-lingual ability and task ability. In addition, we introduce two crosslingual NLG datasets (i.e., question generation, and abstractive summarization) for evaluation, which includes three languages, namely English, Chinese, and French. Experimental results on the NLG tasks show that XNLG achieves competitive performance compared with the machine-translationbased pipeline model in zero-shot cross-lingual settings. 2 Related Work Cross-Lingual NLG Several previous methods have been proposed for cross-lingual abstractive summarization. Shen et al. (2018) and Duan et al. (2019) use translated documents or summaries as pseudo training data. Junnan et al. (2019) incorporate monolingual summarization and machine translation to improve cross-lingual summarization. However, the systems only conduct experiments that generate summaries with different languages from the input language, rather than transferring supervision signals across all language pairs. Kumar et al. (2019) use training data annotated in multiple languages to jointly train a sequence-to-sequence model for question generation. In contrast, our method can also be applied to zero-shot settings across languages. Monolingual Pre-Training Various training objectives are designed to pretrain text encoders used for generalpurpose representations, such as language modeling (Peters et al. 2018; Radford et al. 2018; Devlin et al. 2019; Joshi et al. 2019; Yang et al. 2019), auto-encoding (Liang et al. 2019), and machine translation (Mc Cann et al. 2017). Apart from pre-training encoders, several pre-trained models (Dong et al. 2019; Song et al. 2019) are proposed for generation tasks. In comparison, our goal is to investigate a pre-training method for cross-lingual NLG tasks. Cross-Lingual Pre-Training By pre-training BERT (Devlin et al. 2019) on corpus of multiple languages, it shows a surprising ability to produce cross-lingual representations (Wu and Dredze 2019). More recently, Lample and Conneau (2019) extend mask language modeling pre-training to cross-lingual settings, which shows significant improvements on cross-lingual classification and unsupervised machine translation. By comparison, we pretrain both encoder and decoder for cross-lingual generation tasks, rather than only focusing on encoder. Artetxe and Schwenk (2018) use the sequence encoder of the multilingual translation model (Johnson et al. 2017) to produce cross-lingual sentence embeddings. However, as shown in the experiments (Section 4), it is difficult to control the target language by directly fine-tuning the pre-trained translation model on downstream NLG tasks. 3 Methods As shown in Figure 2, XNLG is a pre-trained sequence-tosequence model, which is based on Transformer (Vaswani et al. 2017). Both the encoder and the decoder are supposed to support multiple languages. Following (Lample and Conneau 2019), we use language tag embeddings to distinguish the source and target languages. Given a sentence and its corresponding language tag, XNLG encodes the input into vector representations. By conditioning on the encoding vectors and a specific language tag, the decoder generates the output sequence in the target language. 3.1 Pre-Training Tasks Monolingual MLM The masked language modeling (MLM) (Devlin et al. 2019) task aims at predicting the randomly masked words according to their context. The objective pretrains the bidirectional encoder to obtain contextual representations. Following (Devlin et al. 2019), we randomly mask 15% of the tokens in a monolingual sentence. For each masked token, we substitute it with a special token [M], a random token, or the unchanged token with probabilities of 0.8, 0.1, and 0.1, respectively. Let x denote a sentence from the monolingual training corpus, and Mx the set of randomly masked positions. The monolingual MLM loss is defined as: i Mx log p(xi|x\Mx) (1) where x\Mx is the masked version of input x. The language tags are fed into the model for all pre-training tasks. Denoising Auto-Encoding (DAE) We use the denoising auto-encoding (DAE) objective (Vincent et al. 2008) to pretrain the encoder-decoder attention mechanism. Given sentence x from the monolingual corpus, we use three types of noise to obtain the randomly perturbed text ˆx. First, the word order is locally shuffled. Second, we randomly drop tokens of the sentence with a probability of 0.1. Third, we substitute tokens with the special padding token [P] with a probability of 0.1. The pre-training objective is to recover the original sentence x by conditioning on ˆx. The DAE loss is computed via: L(x) DAE = log p(x|ˆx) = i=1 log p(xi|ˆx, x