# crosslingual_natural_language_generation_via_pretraining__602f4a6f.pdf

The Thirty-Fourth AAAI Conference on Artiﬁcial Intelligence (AAAI-20)

Cross-Lingual Natural Language Generation via Pre-Training

Zewen Chi, Li Dong, Furu Wei, Wenhui Wang, Xian-Ling Mao, Heyan Huang

Beijing Institute of Technology Microsoft Research {czw, maoxl, hhy63}@bit.edu.cn {lidong1, fuwei, Wenhui.Wang}@microsoft.com

In this work we focus on transferring supervision signals of natural language generation (NLG) tasks between multiple languages. We propose to pretrain the encoder and the decoder of a sequence-to-sequence model under both monolingual and cross-lingual settings. The pre-training objective encourages the model to represent different languages in the shared space, so that we can conduct zero-shot cross-lingual transfer. After the pre-training procedure, we use monolingual data to ﬁne-tune the pre-trained model on downstream NLG tasks. Then the sequence-to-sequence model trained in a single language can be directly evaluated beyond that language (i.e., accepting multi-lingual input and producing multi-lingual output). Experimental results on question generation and abstractive summarization show that our model outperforms the machine-translation-based pipeline methods for zero-shot cross-lingual generation. Moreover, crosslingual transfer improves NLG performance of low-resource languages by leveraging rich-resource language data. Our implementation and data are available at https://github.com/ CZWin32768/xnlg.

1 Introduction

Learning natural language generation (NLG) models heavily relies on annotated training data. However, most available datasets are collected in a single language (typically English), which restricts deploying the applications to other languages. In this work, we aim at transferring the supervision of a monolingual NLG dataset to unseen languages, so that we can boost performance for the low-resource settings. Various methods have been proposed over the years to learn cross-lingual word embeddings (Mikolov, Le, and Sutskever 2013; Xing et al. 2015; Conneau et al. 2017) or sentence encoders (Johnson et al. 2017; Conneau et al. 2018; Lample and Conneau 2019), which try to encode multilingual texts into a shared vector space. Despite achieving promising results on cross-lingual classiﬁcation problems, cross-lingual pre-trained models purposed for NLG tasks remains relatively understudied.

Contribution during internship at Microsoft Research. Copyright c 2020, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

Pre-Trained XNLG

Many-to-Many Cross-Lingual NLG

Monolingual NLG Data

(E.g. Summarization)

Figure 1: We use a monolingual (such as English) NLG dataset to ﬁne-tune the pre-trained model XNLG, and then evaluate it beyond the language for both source and target sides (e.g., Chinese, and French).

The cross-lingual generation problem is challenging due to the following reasons. First, it requires the models to understand multilingual input texts, and generate multilingual target sequences. So both the encoder and the decoder should be pre-trained together. Second, the many-to-many nature of cross-lingual NLG increases language pairs with the square of the number of languages. Third, the prediction space of cross-lingual NLG is much larger than classiﬁcation tasks, which makes knowledge transfer of decoders quite critical. Previous work mainly relies on machine translation (MT) to map texts to different languages. The ﬁrst strand of research directly uses MT in a pipeline manner (Wan, Li, and Xiao 2010). For example, the inputs written in other languages are ﬁrst translated to English, and fed into the NLG model that is trained by English data. Then the generated English texts are translated back to the target language. Another strand of work uses MT to generate pseudo training data for other language pairs that are lack of annotations (Shen et al. 2018; Duan et al. 2019). However, such methods have to use multiple MT systems, which renders them suffering from error propagation. Moreover, because the pipelinebased methods do not explicitly share the same parameter space across languages, we can not directly transfer the taskspeciﬁc supervision to other low-resource languages. In this paper, we propose a cross-lingual pre-trained model (named as XNLG) in order to transfer monolingual NLG supervision to other pre-trained languages by ﬁnetuning. Speciﬁcally, XNLG shares the same sequence-to-

sequence model across languages, and is pre-trained with both monolingual and cross-lingual objectives. The model not only learns to understand multilingual input, but also is able to generate speciﬁc languages by conditioning on the encoded semantics. Figure 1 demonstrates how to use XNLG to perform cross-lingual transfer for downstream tasks. The proposed model enables us to ﬁne-tune the pre-trained model on monolingual NLG training data, and then evaluate it beyond a single language, including zero-shot crosslingual generation. Besides, we explore several ﬁne-tuning strategies to make a compromise between cross-lingual ability and task ability. In addition, we introduce two crosslingual NLG datasets (i.e., question generation, and abstractive summarization) for evaluation, which includes three languages, namely English, Chinese, and French. Experimental results on the NLG tasks show that XNLG achieves competitive performance compared with the machine-translationbased pipeline model in zero-shot cross-lingual settings.

2 Related Work Cross-Lingual NLG Several previous methods have been proposed for cross-lingual abstractive summarization. Shen et al. (2018) and Duan et al. (2019) use translated documents or summaries as pseudo training data. Junnan et al. (2019) incorporate monolingual summarization and machine translation to improve cross-lingual summarization. However, the systems only conduct experiments that generate summaries with different languages from the input language, rather than transferring supervision signals across all language pairs. Kumar et al. (2019) use training data annotated in multiple languages to jointly train a sequence-to-sequence model for question generation. In contrast, our method can also be applied to zero-shot settings across languages.

Monolingual Pre-Training Various training objectives are designed to pretrain text encoders used for generalpurpose representations, such as language modeling (Peters et al. 2018; Radford et al. 2018; Devlin et al. 2019; Joshi et al. 2019; Yang et al. 2019), auto-encoding (Liang et al. 2019), and machine translation (Mc Cann et al. 2017). Apart from pre-training encoders, several pre-trained models (Dong et al. 2019; Song et al. 2019) are proposed for generation tasks. In comparison, our goal is to investigate a pre-training method for cross-lingual NLG tasks.

Cross-Lingual Pre-Training By pre-training BERT (Devlin et al. 2019) on corpus of multiple languages, it shows a surprising ability to produce cross-lingual representations (Wu and Dredze 2019). More recently, Lample and Conneau (2019) extend mask language modeling pre-training to cross-lingual settings, which shows signiﬁcant improvements on cross-lingual classiﬁcation and unsupervised machine translation. By comparison, we pretrain both encoder and decoder for cross-lingual generation tasks, rather than only focusing on encoder. Artetxe and Schwenk (2018) use the sequence encoder of the multilingual translation model (Johnson et al. 2017) to produce cross-lingual sentence embeddings. However, as shown in

the experiments (Section 4), it is difﬁcult to control the target language by directly ﬁne-tuning the pre-trained translation model on downstream NLG tasks.

3 Methods As shown in Figure 2, XNLG is a pre-trained sequence-tosequence model, which is based on Transformer (Vaswani et al. 2017). Both the encoder and the decoder are supposed to support multiple languages. Following (Lample and Conneau 2019), we use language tag embeddings to distinguish the source and target languages. Given a sentence and its corresponding language tag, XNLG encodes the input into vector representations. By conditioning on the encoding vectors and a speciﬁc language tag, the decoder generates the output sequence in the target language.

3.1 Pre-Training Tasks Monolingual MLM The masked language modeling (MLM) (Devlin et al. 2019) task aims at predicting the randomly masked words according to their context. The objective pretrains the bidirectional encoder to obtain contextual representations. Following (Devlin et al. 2019), we randomly mask 15% of the tokens in a monolingual sentence. For each masked token, we substitute it with a special token [M], a random token, or the unchanged token with probabilities of 0.8, 0.1, and 0.1, respectively. Let x denote a sentence from the monolingual training corpus, and Mx the set of randomly masked positions. The monolingual MLM loss is deﬁned as:

i Mx log p(xi|x\Mx) (1)

where x\Mx is the masked version of input x. The language tags are fed into the model for all pre-training tasks.

Denoising Auto-Encoding (DAE) We use the denoising auto-encoding (DAE) objective (Vincent et al. 2008) to pretrain the encoder-decoder attention mechanism. Given sentence x from the monolingual corpus, we use three types of noise to obtain the randomly perturbed text ˆx. First, the word order is locally shufﬂed. Second, we randomly drop tokens of the sentence with a probability of 0.1. Third, we substitute tokens with the special padding token [P] with a probability of 0.1. The pre-training objective is to recover the original sentence x by conditioning on ˆx. The DAE loss is computed via:

L(x) DAE = log p(x|ˆx) =

i=1 log p(xi|ˆx, x<i) (2)

where x<i = x1, , xi 1.

Cross-Lingual MLM (XMLM) Similar to monolingual MLM, the masked token prediction task can be extended to cross-lingual settings (Lample and Conneau 2019). To be speciﬁc, given a parallel corpus, we concatenate the pair of bilingual sentences (x, y) to a whole sequence, and use it as the input of MLM. The language tags are also fed into

Figure 2: Overview of the pre-training tasks and the pre-training protocol designed for XNLG.

the model to indicate the languages of tokens. During training, we adopt the same masking strategy as monolingual MLM. Apart from using monolingual context to predict the masked tokens, XMLM encourages the model to utilize the alignment of bilingual sentences, so that the model learns to map cross-lingual texts into a shared vector space. Similar to Equation (1), the cross-lingual MLM loss is:

L(x,y) XMLM =

i Mx log p(xi|x\Mx, y\My) (3)

i My log p(yi|x\Mx, y\My) (4)

where Mx, My represent the masked positions of x and y, respectively.

Cross-Lingual Auto-Encoding (XAE) If only DAE is used as the pre-training task for the decoder, we found that the model ignores the target language tag while generating just the same language as the input, caused by the spurious correlation issue (Gu et al. 2019). In other words, the DAE loss captures the spurious correlation between the source language tag and the target sentences, but we expect the language of generated sentences can be controlled by the target language tag. To solve the above problem, we use machine translation as the cross-lingual auto-encoding (XAE) task, which decreases mutual information between the target sentences and the source language tag. XAE can be viewed as the multilingual-version DAE task in the sense that both of them recover the sentence by conditioning on the encoded representations. The cross-lingual auto-encoding loss is:

L(x,y) XAE = log p(y|x) log p(x|y) (5)

where (x, y) is a pair of sentences in the parallel corpus.

3.2 Pre-Training Protocol

As shown in Figure 2(b), we propose a two-stage pretraining protocol for XNLG. The ﬁrst stage pretrains the encoding components, where the model learns to encode multilingual sentences to a shared embedding space. We consider using MLM and XMLM as the pre-training tasks. The objective of the ﬁrst stage is to minimize:

(x,y) Dp L(x,y) XMLM +

x Dm L(x) MLM (6)

where Dp indicates the parallel corpus, and Dm is the monolingual corpus. Although the pre-trained encoder in the ﬁrst stage enables the model to encode multilingual sentences. However, it cannot directly be used in cross-lingual NLG because: 1) encoder-decoder attention is not pre-trained; 2) the decoding algorithm is different between masked language modeling and autoregressive decoding, resulting in the mismatch between pre-training and ﬁne-tuning. Therefore, we conduct decoding pre-training in the second stage by using DAE and XAE as the tasks. Besides, we only update decoder parameters and keep the encoder ﬁxed. The objective of the second stage is to minimize:

(x,y) Dp L(x,y) XAE +

x Dm L(x) DAE (7)

3.3 Fine-Tuning on Downstream NLG Tasks

In the ﬁne-tuning procedure, let us assume that we only have English training data for downstream NLG tasks. According to whether the target language is English, the directions of NLG can be categorized into two classes: any languages to non-English languages (Any-to-Others), and any languages to English (Any-to-English).

Data Models BL-4 MTR RG-L

COREFNQG 15.16 19.12 - MP-GSN 16.38 20.25 44.48 XLM 16.94 21.87 46.45 XNLG 19.99 24.05 48.74

Zh-QG XLM 23.41 23.32 47.40 XNLG 24.89 24.53 49.72

Table 1: Evaluation results of monolingual supervised question generation for English and Chinese. BL is short for BLEU, MTR for METEOR, and RG for ROUGE. The results with are reported on different data splits.

Models BL-4 MTR RG-L

XLM 0.25 0.62 2.56 PIPELINE (XLM) 4.42 9.59 21.22 w/ Google Translator 9.95 14.92 29.37 XNLG 16.37 18.74 34.93

Table 2: Evaluation results of zero-shot Chinese-Chinese question generation. Same shorthands apply as in Table 1.

Fine-Tuning for Any-to-Others NLG Ideally, the model can be ﬁne-tuned towards a new task without losing its cross-lingual ability. However, we observe the catastrophic forgetting of target language controllability, if we ﬁne-tune all the model parameters for Any-to-Others NLG. So we keep the decoder and the word embeddings frozen and only update the encoder parameters during ﬁne-tuning. In practice, we found that the proposed ﬁne-tuning method prevents the model from only decoding English words for the Anyto-Others setting.

Fine-Tuning for Any-to-English NLG For the Any-to English NLG transfer, the decoder always generates English. So we can freeze the encoder parameters, and update the decoder parameters to retain the cross-lingual ability. As an alternative way, we can also ﬁne-tune all the parameters to obtain the best results on the English dataset while having a slight drop in performance.

4 Experiments

We conduct experiments over two cross-lingual NLG downstream tasks, i.e., cross-lingual question generation, and cross-lingual abstractive summarization. We compare XNLG with state-of-the-art cross-lingual pre-trained models, and machine-translation-based pipelines.

4.1 Training Details

Pre-Training We use a pre-trained XNLG with a 10-layer encoder and a 6-layer decoder. For every Transformer layer, we use 1024 hidden units, 8 attention heads, and GELU activations (Hendrycks and Gimpel 2016). In the ﬁrst pretraining stage, we directly use the 15-language pre-trained

Models Rel Flu Corr

XLM 0 0 0 PIPELINE (XLM) 0.50 0.80 0.03 w/ Google Translator 1.31 1.43* 0.69 XNLG 1.68* 1.29 0.89*

Table 3: Human evaluation results of zero-shot Chinese Chinese question generation. Rel is short for relatedness, Flu for ﬂuency, and Corr for correctness. * indicates the improvements are signiﬁcant at p < 0.05.

Models Rel Flu Corr

XLM 0 0 0 PIPELINE (XLM) 0.87 0.86 0.28 XNLG 1.09* 0.95 0.53*

Table 4: Human evaluation results of zero-shot English Chinese question generation. * indicates the improvements are signiﬁcant at p < 0.05. Same shorthands apply as in Table 3.

XLM (Lample and Conneau 2019) to initialize the parameters of our encoder and decoder. In the second stage, we use Wikipedia as the monolingual data for the DAE objective, and Multi UN (Ziemski, Junczys-Dowmunt, and Pouliquen 2016) as the parallel data for the XAE objective. The DAE loss is trained with a weight of 0.5. We train a two-language (English/Chinese) and a three-language (English/French/Chinese) XNLG for two downstream NLG tasks, respectively. Following (Lample and Conneau 2019), we use the tokenizer provided by (Chang, Galley, and Manning 2008) for Chinese, and Moses1 for other languages, respectively. Then the words in all languages are split with a shared subword vocabulary learned by BPE (Sennrich, Haddow, and Birch 2015). We use Adam optimizer with a linear warm-up over the ﬁrst 4,000 steps and linear decay for later steps, and the learning rate is set to 10 4. The pre-training batch size is 64, and the sequence length is set to 256. It takes about 30 hours to run 23,000 steps for the pre-training procedure by using 4 Nvidia Telsa V100-16GB GPUs.

Fine-Tuning For ﬁne-tuning on downstream NLG tasks, we use Adam optimizer with a learning rate of 5 10 6. We set the batch size as 16 and 32 for question generation and abstractive summarization, respectively. When the target language is the same as the language of training data, we ﬁne-tune all parameters. When the target language is different from the language of training data, we ﬁne-tune the Transformer layers of the encoder. We truncate the input sentences to the ﬁrst 256 tokens. During decoding, we use beam search with a beam size of 3, and limit the length of the target sequence to 80 tokens.

1https://github.com/moses-smt/mosesdecoder

Models Rel Flu Corr

XLM 1.00 1.20 0.40 PIPELINE (XLM) 0.85 0.98 0.28 XNLG 1.24* 1.47* 0.76*

Table 5: Human evaluation results of zero-shot Chinese English question generation. * : the improvements are signiﬁcant at p < 0.05. Same shorthands apply as in Table 3.

Data Models RG-1 RG-2 RG-L

En-AS XLM 48.15 26.35 45.04 XNLG 48.76 26.82 45.57

Fr-AS XLM 56.27 39.20 52.84 XNLG 57.84 40.81 54.24

Zh-AS XLM 55.30 42.57 52.95 XNLG 57.65 44.93 54.95

Table 6: Evaluation results of supervised monolingual summarization. Same shorthands apply as in Table 1.

4.2 Question Generation We evaluate our model on zero-shot cross-lingual answeraware question generation (QG). The goal is to generate a question that asks towards the answer with the given passage and the expected answer. In the following experiments, we extend the QG task to the cross-lingual setting. By only using English QG training data, our goal is to generate questions in English or Chinese with the given passage-answer pair in English or Chinese. We use SQu AD 1.1 (Rajpurkar et al. 2016) as the English QG dataset. It is a popular English question answering dataset containing over 100,000 questions and their corresponding annotated passages. Following (Zhao et al. 2018), we regard the original development set as the test set, and sample 5000 examples from the training data of two datasets as the development sets. For Chinese QG, we follow the default data splits of Web QA (Li et al. 2016). We regard the provided annotated evidence sentences as the input passages instead of entire documents. To construct the input sequence, we view the whole input passage as a single sentence, and concatenate the passage and the answer into one sequence with a special token [S] between them. During decoding Chinese, we utilize a subset of vocabulary, which is obtained from the passage sentences of the Web QA dataset.

English-English Question Generation We ﬁrst conduct experiments on the supervised English-English QG setting. We compare our model to the following baselines:

COREFNQG (Du and Cardie 2018) An attentional sequence-to-sequence model with a feature-rich encoder. MP-GSN (Zhao et al. 2018) A sequence-to-sequence model with self-attention and maxout pointer mechanism. XLM (Lample and Conneau 2019) State-of-the-art cross-lingual pre-trained Transformer. We initialize the

Models RG-1 RG-2 RG-L

XLM 14.53 1.80 13.43 PIPELINE (XLM) 30.58 12.01 27.44 w/ Google Translator 38.48 18.86 34.98 XNLG 39.98 20.31 36.31

Table 7: Evaluation results of zero-shot French abstractive summarization. Same shorthands apply as in Table 1.

Models RG-1 RG-2 RG-L

XLM 0.71 0.28 0.70 PIPELINE (XLM) 26.39 13.11 23.98 w/ Google Translator 36.96 22.03 33.99 XNLG 41.66 28.70 38.91

Table 8: Evaluation results of zero-shot Chinese abstractive summarization. Same shorthands apply as in Table 1.

sequence-to-sequence model with pre-trained XLM.

We evaluate models with BLEU-4 (BL-4), ROUGE (RG) and METEOR (MTR) metrics. As shown in Table 1, XNLG outperforms the baselines, which demonstrates that our pretrained model provides a good initialization for NLG.

Chinese-Chinese Question Generation We conduct experiments on zero-shot Chinese-Chinese QG to evaluate the cross-lingual transfer ability. In this task, models are trained with English QG data but evaluated with Chinese QG examples. We include the following models as our baselines:

XLM Fine-tuning XLM with the English QG data.

PIPELINE (XLM) The pipeline of translating input Chinese sentences into English ﬁrst, then performing En-En QG with the XLM model, and ﬁnally translating back to the Chinese. We use the Transformer as the translator, which is also trained on the Multi UN dataset.

PIPELINE (XLM) with Google Translator Utilizing Google Translator in PIPELINE (XLM) for translation.

We evaluate models by both automatic evaluation metrics and human experts. The automatic metrics scores are computed by regarding each Chinese character as a token. For human evaluation, we consider three metrics: relatedness, ﬂuency, and correctness, which are represented as integers ranged from 1 to 3. We randomly select 100 passage-answer pairs from the English QG test set, and use the models to generate questions. Then we present these examples to three experts to ask for the above scores. In Table 2 and Table 3, we present the results for the zero-shot Zh-Zh-QG. The results of monolingual supervised models are also reported in Table 1 as reference. In the automatic evaluation, our model consistently performs better than baselines in both zero-shot and monolingual supervised setting. In the human evaluation, our model also obtains signiﬁcant improvements in terms of relatedness and correctness.

Models BL-4 MTR RG-L

XLM 0.25 0.62 2.56

XNLG 16.37 18.74 34.93 XAE 13.71 15.88 31.43 DAE 0.38 1.79 3.79

Table 9: Ablations for pre-training objectives, where models are evaluated on zero-shot Chinese-Chinese question generation. Same shorthands apply as in Table 1.

English-Chinese Question Generation In the zeroshot English-Chinese question generation experiments, we use XLM and PIPELINE (XLM) as our baselines. PIPELINE (XLM) is a pipeline method that uses En-En-QG with XLM to generate questions, and then translates the results to Chinese. Because there are no annotations for En Zh-QG, we perform human evaluation studies for this setting. Table 4 shows the human evaluation results, where our model surpasses all the baselines especially in terms of relatedness and correctness.

Chinese-English Question Generation We also conduct experiments for zero-shot Chinese-English question generation, and adopt the same evaluation procedure to En-Zh-QG. PIPELINE (XLM) ﬁrst translates Chinese input to English, and then conduct En-En-QG with XLM. As shown in Table 5, human evaluation results indicate that XNLG achieves signiﬁcant improvements on the three metrics.

4.3 Abstractive Summarization

We conduct experiments on cross-lingual abstractive summarization (AS). AS is the task of converting the input sentences into summaries while preserving the key meanings. For evaluation, we use English/French/Chinese Gigaword2 to extract the ﬁrst sentence and the headline of each article, and regard them as input document and predicted summaries, respectively. For each language, we sample 500k/5k/5k examples for training/validation/test.

Zero-Shot Summarization In the zero-shot setting, we only use English data for training, and directly evaluate the model on other languages. In Table 7 and Table 8, we present the results for French/Chinese AS, which are evaluated by the ROUGE-1, ROUGE-2 and ROUGE-L metrics. We also report the results of supervised AS in Table 6 for reference. We ﬁnd that XNLG outperforms all the baseline models on both French and Chinese AS. Comparing with French, there is a larger gap between baselines and our model on zero-shot Chinese AS, which indicates that the error propagation issue is more serious on distant language pairs.

4.4 Ablation Studies

2LDC2011T07, LDC2011T10, LDC2011T13

Supervised En-En-QG Zero-Shot Zh-Zh-QG BL-4 MTR RG-L BL-4 MTR RG-L

All 19.99 24.05 48.74 6.82 14.84 21.77 Dec 16.37 20.91 44.51 0.21 1.25 2.05 Enc 19.62 23.66 48.78 15.72 18.89 34.82 ET 19.69 23.73 48.53 16.37 18.74 34.93

Table 10: Effects of different ﬁne-tuning strategies. Dec, Enc and ET represent ﬁne-tuning the parameters of the decoder, the encoder, and the Transformer layers of the encoder, respectively. Same shorthands apply as in Table 1.

Figure 3: ROUGE-2 scores for few-shot French/Chinese abstractive summarization with different training data sizes.

Effects of Pre-Training We conduct ablation studies for pre-training objectives, and the results can be seen in Table 9. We observe that our model greatly beneﬁts from the DAE objective for the zero-shot Chinese question generation task. The results also demonstrate that combining DAE and XAE can alleviate the spurious correlation issue and improves cross-lingual NLG.

Effects of Fine-Tuning Strategies As shown in Table 10, we use the En-En-QG and Zh-Zh-QG tasks to analyze the effects of using different ﬁne-tuning strategies. It can be observed that ﬁne-tuning encoder parameters, our model obtain an impressive performance for both English and Chinese QG, which shows the strong cross-lingual transfer ability of our model. When ﬁne-tuning all the parameters, the model achieves the best score for English QG, but it suffers a performance drop when evaluating on Chinese QG. We ﬁnd that ﬁne-tuning decoder hurts cross-lingual decoding, and the model learns to only decode English words. For only ﬁne-tuning decoder, the performance degrades by a

Figure 4: Examples of generated questions by XNLG and the baselines in four directions (En-En,En-Zh,Zh-En and Zh-Zh). * : Because XLM is not designed for cross-lingual NLG, it is hard to produce meaningful sentences for En-Zh-QG and Zh-Zh-QG.

large margin for both languages because of the underﬁtting issue, which indicates the necessity of ﬁne-tuning encoder.

Effects of Cross-Lingual Transfer We examine whether low-resource NLG can beneﬁt from cross-lingual transfer. We consider English as the rich-resource language, and conduct experiments for few-shot French/Chinese AS. Speciﬁcally, we ﬁrst ﬁne-tune XNLG on the English AS data, and then ﬁne-tune it on the French or Chinese AS data. We compare with the monolingual supervised model that XNLG is only ﬁne-tuned on the dataset of the target language. As shown in Figure 3, we can observe that the cross-lingual supervision improves performance for few-shot abstractive summarization. As the training data size becomes larger, the performances of the two models are getting closer.

4.5 Case Studies As shown in Figure 4, we present some examples generated by XNLG and the baselines in four directions (En-En, En-Zh, Zh-En, and Zh-Zh). When decoding on an unseen language, XLM tends to generate random output, because it is not designed for cross-lingual NLG. In terms of the pipeline model, we can observe that it suffers from the error propagation issue, especially when the source and target languages are all different from the training data. For example, when the pipeline model performs Zh-Zh-QG, key-

words are translated twice, increasing the risk of mistranslation. In the second example, atomic bomb is mistranslated to nuclear bomb , resulting in its low correctness. On the contrary, by directly transferring English supervision signals to the other generation directions, the generated questions of XNLG match the references better than baselines.

5 Conclusion

In this paper, we propose a pre-training method for crosslingual natural language generation (NLG) that can transfer monolingual NLG supervision signals to all pre-trained languages. With the pre-trained model, we achieve zeroshot cross-lingual NLG on several languages by only ﬁnetuning once. Experimental results show that our model outperforms the machine-translation-based pipeline model on several cross-lingual NLG tasks. For future work, we would like to improve our pre-training method towards the fully unsupervised setting.

Acknowledgements

Prof. Heyan Huang is the corresponding author. The work is supported by NKRD (No. 2018YFB1005100), NSFC (No. 61772076 and 61751201), NSFB (No. Z181100008918002), Major Project of Zhijiang Lab (No. 2019DH0ZX01), and Open fund of BDAl GGCNEL and

CETC Big Data Research Institute Co., Ltd (No. w2018018).

Artetxe, M., and Schwenk, H. 2018. Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. ar Xiv preprint ar Xiv:1812.10464. Chang, P.-C.; Galley, M.; and Manning, C. D. 2008. Optimizing chinese word segmentation for machine translation performance. In WMT, 224 232. Conneau, A.; Lample, G.; Ranzato, M.; Denoyer, L.; and J egou, H. 2017. Word translation without parallel data. ar Xiv preprint ar Xiv:1710.04087. Conneau, A.; Rinott, R.; Lample, G.; Williams, A.; Bowman, S.; Schwenk, H.; and Stoyanov, V. 2018. Xnli: Evaluating cross-lingual sentence representations. In EMNLP, 2475 2485. Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL, 4171 4186. Dong, L.; Yang, N.; Wang, W.; Wei, F.; Liu, X.; Wang, Y.; Gao, J.; Zhou, M.; and Hon, H.-W. 2019. Uniﬁed language model pre-training for natural language understanding and generation. ar Xiv preprint ar Xiv:1905.03197. Du, X., and Cardie, C. 2018. Harvesting paragraph-level question-answer pairs from wikipedia. In ACL, 1907 1917. Duan, X.; Yin, M.; Zhang, M.; Chen, B.; and Luo, W. 2019. Zero-shot cross-lingual abstractive sentence summarization through teaching generation and attention. In ACL, 3162 3172. Gu, J.; Wang, Y.; Cho, K.; and Li, V. O. 2019. Improved zero-shot neural machine translation via ignoring spurious correlations. ar Xiv preprint ar Xiv:1906.01181. Hendrycks, D., and Gimpel, K. 2016. Gaussian error linear units (GELUs). ar Xiv preprint ar Xiv:1606.08415. Johnson, M.; Schuster, M.; Le, Q. V.; Krikun, M.; Wu, Y.; Chen, Z.; Thorat, N.; Vi egas, F.; Wattenberg, M.; Corrado, G.; et al. 2017. Google s multilingual neural machine translation system: Enabling zero-shot translation. TACL 5:339 351. Joshi, M.; Chen, D.; Liu, Y.; Weld, D. S.; Zettlemoyer, L.; and Levy, O. 2019. Spanbert: Improving pre-training by representing and predicting spans. ar Xiv preprint ar Xiv:1907.10529. Junnan, Z.; Qian, W.; Yining, W.; Yu, Z.; Jiajun, Z.; Shaonan, W.; and Chengqing, Z. 2019. Ncls: Neural cross-lingual summarization. ar Xiv preprint ar Xiv:1909.00156. Kumar, V.; Joshi, N.; Mukherjee, A.; Ramakrishnan, G.; and Jyothi, P. 2019. Cross-lingual training for automatic question generation. ar Xiv preprint ar Xiv:1906.02525. Lample, G., and Conneau, A. 2019. Cross-lingual language model pretraining. ar Xiv preprint ar Xiv:1901.07291. Li, P.; Li, W.; He, Z.; Wang, X.; Cao, Y.; Zhou, J.; and Xu, W. 2016. Dataset and neural recurrent sequence labeling

model for open-domain factoid question answering. ar Xiv preprint ar Xiv:1607.06275. Liang, W.; Wei, Z.; Ruoyu, J.; Sujian, L.; and Jingming, L. 2019. Denoising based sequence-to-sequence pre-training for text generation. ar Xiv preprint ar Xiv:1908.08206. Mc Cann, B.; Bradbury, J.; Xiong, C.; and Socher, R. 2017. Learned in translation: Contextualized word vectors. In NIPS, 6294 6305. Mikolov, T.; Le, Q. V.; and Sutskever, I. 2013. Exploiting similarities among languages for machine translation. ar Xiv preprint ar Xiv:1309.4168. Peters, M.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark, C.; Lee, K.; and Zettlemoyer, L. 2018. Deep contextualized word representations. In NAACL, 2227 2237. Radford, A.; Narasimhan, K.; Salimans, T.; and Sutskever, I. 2018. Improving language understanding by generative pre-training. Rajpurkar, P.; Zhang, J.; Lopyrev, K.; and Liang, P. 2016. SQu AD: 100,000+ questions for machine comprehension of text. In EMNLP, 2383 2392. Sennrich, R.; Haddow, B.; and Birch, A. 2015. Neural machine translation of rare words with subword units. ar Xiv preprint ar Xiv:1508.07909. Shen, S.; Chen, Y.; Yang, C.; Liu, Z.; and Sun, M. 2018. Zero-shot cross-lingual neural headline generation. TASLP 26(12):2319 2327. Song, K.; Tan, X.; Qin, T.; Lu, J.; and Liu, T.-Y. 2019. Mass: Masked sequence to sequence pre-training for language generation. ar Xiv preprint ar Xiv:1905.02450. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. Attention is all you need. In NIPS, 5998 6008. Curran Associates, Inc. Vincent, P.; Larochelle, H.; Bengio, Y.; and Manzagol, P.- A. 2008. Extracting and composing robust features with denoising autoencoders. In ICML, 1096 1103. ACM. Wan, X.; Li, H.; and Xiao, J. 2010. Cross-language document summarization based on machine translation quality prediction. In ACL. Wu, S., and Dredze, M. 2019. Beto, bentz, becas: The surprising cross-lingual effectiveness of bert. ar Xiv preprint ar Xiv:1904.09077. Xing, C.; Wang, D.; Liu, C.; and Lin, Y. 2015. Normalized word embedding and orthogonal transform for bilingual word translation. In NAACL, 1006 1011. Yang, Z.; Dai, Z.; Yang, Y.; Carbonell, J.; Salakhutdinov, R.; and Le, Q. V. 2019. XLNet: Generalized autoregressive pretraining for language understanding. ar Xiv preprint ar Xiv:1906.08237. Zhao, Y.; Ni, X.; Ding, Y.; and Ke, Q. 2018. Paragraph-level neural question generation with maxout pointer and gated self-attention networks. In EMNLP, 3901 3910. Ziemski, M.; Junczys-Dowmunt, M.; and Pouliquen, B. 2016. The united nations parallel corpus v1. 0. In LREC, 3530 3534.