# metatransfer_learning_for_lowresource_abstractive_summarization__b0c50d9f.pdf Meta-Transfer Learning for Low-Resource Abstractive Summarization Yi-Syuan Chen, Hong-Han Shuai National Chiao Tung University, Taiwan yschen.eed09g@nctu.edu.tw, hhshuai@nctu.edu.tw Neural abstractive summarization has been studied in many pieces of literature and achieves great success with the aid of large corpora. However, when encountering novel tasks, one may not always benefit from transfer learning due to the domain shifting problem, and overfitting could happen without adequate labeled examples. Furthermore, the annotations of abstractive summarization are costly, which often demand domain knowledge to ensure the ground-truth quality. Thus, there are growing appeals for Low-Resource Abstractive Summarization, which aims to leverage past experience to improve the performance with limited labeled examples of target corpus. In this paper, we propose to utilize two knowledge-rich sources to tackle this problem, which are large pre-trained models and diverse existing corpora. The former can provide the primary ability to tackle summarization tasks; the latter can help discover common syntactic or semantic information to improve the generalization ability. We conduct extensive experiments on various summarization corpora with different writing styles and forms. The results demonstrate that our approach achieves the state-of-the-art on 6 corpora in low-resource scenarios, with only 0.7% of trainable parameters compared to previous work. Introduction The goal of neural abstractive summarization is to comprehend articles and generate summaries that faithfully convey the core idea. Different from extractive methods, which summarize articles by selecting salient sentences from the original text, abstractive methods (Song et al. 2020; Chen et al. 2020; Gehrmann, Deng, and Rush 2018; See, Liu, and Manning 2017; Rush, Chopra, and Weston 2015) are more challenging and flexible due to the ability to generate novel words. However, the success of these methods often relies on a large number of training data with groundtruth, while generating the ground-truth of summarization is highly complicated and often requires professionalists with domain knowledge. Moreover, different kinds of articles are with various writing styles or forms, e.g., news, social media posts, and scientific papers. Therefore, the Low Resource Abstractive Summarization has emerged as an important problem in recent years, which aims to leverage Copyright c 2021, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. related sources to improve the performance of abstractive summarization with limited target labeled examples. Specifically, to tackle data scarcity problems, a recent line of research (Xiao et al. 2020; Zhang et al. 2020; Rothe, Narayan, and Severyn 2020; Lewis et al. 2020; Liu and Lapata 2019) leverages the large pre-trained language models (Brown et al. 2020; Devlin et al. 2019), which are trained in a self-supervised way based on unlabeled corpora. Since many natural language processing tasks share common knowledge in syntax, semantics, or structures, the pre-trained model has attained great success in the downstream summarization tasks. Another promising research handling low-resource applications is meta learning. For example, the recently proposed Model-Agnositc Meta Learning (MAML) (Finn, Abbeel, and Levine 2017) performs well in a variety of NLP tasks, including machine translation (Li, Wang, and Yu 2020; Gu et al. 2018), dialog system (Qian and Yu 2019; Madotto et al. 2019), relational classification (Obamuyide and Vlachos 2019), semantic parsing (Guo et al. 2019), emotion learning (Zhao and Ma 2019), and natural language understanding (Dou, Yu, and Anastasopoulos 2019). Under the assumption that similar tasks possess common knowledge, MAML condenses shared information of source tasks into the form of weight initialization. The learned initialization can then be used to learn novel tasks faster and better. Based on these observations, we propose to integrate the self-supervised language model and meta learning for lowresource abstractive summarization. However, three challenges arise when leveraging the large pre-trained language models and meta learning jointly. First, most state-of-theart summarization frameworks (Zhang et al. 2020; Rothe, Narayan, and Severyn 2020; Liu and Lapata 2019) exploit Transformer based architecture, which possesses a large number of trainable parameters. However, the training data size of the tasks in MAML is often set to be small, which may easily cause overfitting for a large model (Zintgraf et al. 2019). Second, MAML could suffer from gradient explosion or diminishing problems when the number of inner loop iterations and model depth increase (Antoniou, Edwards, and Storkey 2019), and both are inevitable for training summarization models. Third, MAML requires diverse source tasks to increase the generalizability on novel target tasks. However, how to build such meta-dataset on existing corpora for The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21) summarization tasks still remains unknown. To solve these challenges, we propose a simple yet effective method, named Meta-Transfer Learning for low-resource ABStractive summarization (MTL-ABS)1. Specifically, to address the first and second challenges, we propose using a limited number of parameters and layers between layers of a pre-trained network to perform meta learning. An alternative approach is to stack or replace several new layers on the top of the pre-trained model and only meta-learn these layers to control the model complexity. However, without re-training the full model, the performance may significantly drop due to the introduction of consecutive randomly initialized layers. Moreover, it is difficult to recover the performance by using limited target labeled examples for re-training. In addition to the limited number of parameters and layers between layers, to better leverage the pre-trained model, we add skip-connections to the new layers. With small initialization values, it alleviates the interference at the beginning of training. Also, with a limited number of new layers, the complexity of the framework can be reduced as a compact model with skip-connections, which can prevent gradient problems during meta learning. For the third challenge, most existing methods use inherent labels in a single dataset to handle this problem. For instance, in the few-shot image classification (Sun et al. 2019), the tasks are defined with different combinations of class labels. However, this strategy is not applicable for abstractive summarization since there is no specific label to characterize the property for article-summary pairs. One possible solution is to randomly sample data from a single corpus such that the inherent data variance provides task diversity. Take this idea further, we consider exploring multiple corpora to increase the diversity of different tasks. Since the data from different corpus could have distributional differences, an inappropriate choice of source corpora may instead lead to a negative transfer problem and deteriorate the performance. Thus, we investigate this problem by analyzing the performance of different corpora choices. Specifically, we study several similarity criteria and show that some general rules can help avoid inappropriate choices, which is crucial in developing meta learning for NLP tasks. The contributions are summarized as follows. Beyond conventional methods that only utilize a single large corpus, we further leverage multiple corpora to improve the performance. To the best of our knowledge, this is the first work to explore this opportunity with metalearning methods for abstractive summarization. We propose a simple yet effective method, named MTLABS, to tackle the low-resource abstractive summarization problem. With the proposed framework, we make successful cooperation of transfer learning and meta learning. Besides, we investigate the problem of choosing source corpora for meta-datasets, which is a significant but not well-studied problem in meta learning for NLP. We provide some general criteria to mitigate the negative effect from inappropriate choices of source corpora. 1Code is available at https://github.com/Yi Syuan Chen/MTLABS Experimental results show that MTL-ABS achieves the state-of-the-art on 6 corpora in low-resource abstractive summarization, with only 0.7% of trainable parameters compared to previous work. Related Works Transfer learning has been widely adopted to tackle applications with limited data. For NLP tasks, word representations are usually pre-trained by self-supervised learning with unlabeled data, and then used as strong priors for downstream tasks. Among the pre-training methods, language modeling (LM) (Devlin et al. 2019; Radford et al. 2019, 2018) has achieved great success. To transfer with pre-trained models for downstream tasks, it is common to add some taskspecific layers on top and fine-tune the full model. However, this strategy is often inefficient with regard to the parameter usage, and full re-training may be required when encountering new tasks. Thus, Houlsby et al. (2019) propose a compact adapter module to transfer from the BERT model for natural language understanding tasks. Each BERT layer is inserted with few adapter modules, and only the adapter modules are set to be learnable. Stickland and Murray (2019) similarly transfer from BERT with Projected Attention Layers (PALs) for multi-task natural language understanding, which is a multi-head attention layer residuallyconnected to the base model. For abstractive summarization, Liu and Lapata (2019) propose a Transformer based encoder-decoder framework. The training process includes two-level pre-training. The encoder is first pre-trained with unlabelled data as a language model, then fine-tuned to perform extractive summarization tasks. Finally, the decoder is added to learn for abstractive summarization tasks. Zhang et al. (2020) further propose to pre-train the decoder in a self-supervised way with the Gap Sentence Generation (GSG) task. It selects and masks important sentences according to the ROUGE scores of sentences and the rest of the article, and the objective for the decoder is to reconstruct the masked sentences. While abstractive summarization performance is improved with various transfer learning techniques, there is much less literature regarding the low-resource setting. Radford et al. (2019) propose a Transformer based language model trained on a massive-scale dataset consisting of millions of webpages, and the abstractive summaries are produced based on the pre-trained generative ability under the zero-shot setting. Khandelwal et al. (2019) propose a Transformer based decoder-only language model with newly collected data from Wikipedia. Zhang et al. (2020) report relatively outstanding performance on various datasets in lowresource settings, with the specially designed pre-training objective for abstractive summarization. However, these works only use a single large corpus for training, which can suffer severe domain-shifting problems for some target corpus. Besides, the large model size could cause an overfitting problem. On the contrary, our framework uses limited trainable parameters with multiple corpora chosen according to the proposed criteria to mitigate the above problems. Figure 1: Proposed summarization framework with metatransfer learning. The adapter modules are inserted into both encoder and decoder after every feed-forward layer. During meta-transfer learning, only the adapters and layer normalization layers are learnable. For simplicity, the learning illustration of layer normalization layers is omitted. Methodologies In this work, we define the low-resource abstractive summarization problem as follows: Definition 1 Low-Resource Abstractive Summarization is a task that requires a model to learn from experience E, which consists of direct experience Ed containing limited monolingual article-summary pairs and indirect experience Ei, to improve the performance in abstractive summarization measured by the evaluation metric P. The direct experience Ed refers to the training examples of target corpus, and the indirect experience Ei could be other available resources such as pre-trained models or related corpora. For measurement P, we use ROUGE (Lin 2004) for evaluation. In this work, we consider a challenging scenario that the quantity of labeled training examples for the target corpus is limited under 100, which matches the magnitude of the evaluation-only Document Understanding Conference (DUC) corpus. In the following, we first introduce the proposed summarization framework, and then elaborate on the meta-transfer learning process and the construction of metadataset. Summarization Framework Base Model We choose the state-of-the-art Transformer based encoder-decoder model (Liu and Lapata 2019) as the base summarization model. Special token [CLS] is added at the beginning of each sentence to aggregate information, while token [SEP] is appended after each sentence as boundary. The self-attention (SA) layer mainly consists of three sub-layers, which are the Multi-Headed Attention (MHA) layer, Feed-Forward (FF) layer, and Layer Normalization (LN) layer. The self-attention layer can be expressed as: SA(h) = LN(FF(MHA(h)) + h), (1) where h represents the intermediate hidden representation. The transformer (TF) layer is stacked with self-attention layers and can thus be expressed as: TF(h) = LN(FF(SA1:l(h))), (2) where l indicates the number of self-attention layers. We set l = 1 for the encoder and l = 2 for the decoder. The encoder of the base model is initialized with BERT (Devlin et al. 2019), which is trained on the general domain. Before the meta-transfer learning, we fine-tune the encoder with an extractive objective on the chosen pre-training corpus, as previous works suggest (Liu and Lapata 2019; Li et al. 2018; Gehrmann, Deng, and Rush 2018), to improve the abstractive performance. Adapters To prevent overfitting and gradient instability from applying MAML on the large pre-trained model, we propose restricting the number of meta-trainable parameters and layers. This is practically achieved with adapter modules. The adapter module is a bottlenecked feed-forward network consisting of a down-project layer fθd and an upproject layer fθu. A skip-connection from input to output is established, which prevents the noised initialization from interference with the training initially. The adapter (ADA) can be expressed as: ADA(h) = fθu(Re LU(fθd(h))) + h. (3) We insert adapters into each layer of the encoder and decoder to leverage pre-trained knowledge while performing meta learning. Specifically, the adapter is added after every feed-forward layer in the transformer layer. Thus, the adapted transformer (ADA-TF) layers with adapted selfattention (ADA-SA) layers can be expressed as: ADA-SA(h) = LN(ADA(FF(MHA(h)) + h)), (4) ADA-TF(h) = LN(ADA(FF(ADA-SA1:l(h)))). (5) The illustration of the proposed summarization framework is shown in Figure 1. Meta-Transfer Learning for Summarization Equipped with the adapter-enhanced summarization model, our goal is to perform the meta-transfer learning for fast adaption on new corpora. Consider a pre-training corpus Cpre, a set of source corpora {Csrc j } and a target corpus Ctgt, we aim to leverage both Cpre and {Csrc j } to improve the performance on Ctgt, which only contains limited number of labeled examples. For abstractive summarization, a training example consists of input, prediction and groundtruth sequences. We denote them by X = [X1, ..., XNx], Y = [Y1, ..., YNy], and b Y = [b Y1, ..., b YNy], respectively. Our framework comprises the base summarization model θ and the adapter modules ψ. The two parts are learned in decoupled scheme. For the base model, given an input article x = [x1, ..., x Nx] X, the model produces a particular prediction sequence y