# cont_contrastive_neural_text_generation__1a362af3.pdf CONT: Contrastive Neural Text Generation Chenxin An1,2 , Jiangtao Feng2, Kai Lv1, Lingpeng Kong2,3, Xipeng Qiu1, Xuanjing Huang1,4 1Fudan University, 2Shark-NLP Shanghai AI Laboratory 3The University of Hong Kong 4Shanghai Collaborative Innovation Center of Intelligent Visual Computing {cxan20, klv21, xpqiu, xjhuang}@fudan.edu.cn fengjiangtao@pjlab.org.cn, lpk@cs.hku.hk Recently, contrastive learning attracts increasing interests in neural text generation as a new solution to alleviate the exposure bias problem. It introduces a sequencelevel training signal which is crucial to generation tasks that always rely on autoregressive decoding. However, previous methods using contrastive learning in neural text generation usually lead to inferior performance. In this paper, we analyse the underlying reasons and propose a new Contrastive Neural Text generation framework, CONT. CONT addresses bottlenecks that prevent contrastive learning from being widely adopted in generation tasks from three aspects the construction of contrastive examples, the choice of the contrastive loss, and the strategy in decoding. We validate CONT on five generation tasks with ten benchmarks, including machine translation, summarization, code comment generation, datato-text generation and commonsense generation. Experimental results show that CONT clearly outperforms the conventional training framework on all the ten benchmarks with a convincing margin. Especially, CONT surpasses previous the most competitive contrastive learning method for text generation, by 1.50 BLEU on machine translation and 1.77 ROUGE-1 on summarization, respectively. It achieves new state-of-the-art on summarization, code comment generation (without external data) and data-to-text generation. 2 1 Introduction Contrastive learning has achieved great success in representation learning [6, 44, 45]. It also attracts enormous interests in neural text generation recently. By creating positive and negative examples in response to unseen (or erroneous) inputs [23], contrastive learning offers a new solution to alleviate the exposure bias problem [3, 35] an autoregressive model trained only using the ground truths exhibits inferior generalization performance. Apart from that, contrastive learning also introduces a sequence-level loss in addition to the conventional token-level language model loss with maximum likelihood estimation (MLE). This is crucial to most conditional text generation tasks (e.g., machine translation and summarization) which are evaluated on sequence-level metrics (e.g., BLEU [32]). However, it is non-trivial to get contrastive learning working on neural text generation. If we simply use from-batch positive-negative samples following sim CLR [6], and adopt the Info NCE loss [13, 45] which ignores the difference between negative samples ( 2.2; Naive CL), the improvement over non-contrastive baselines on generation tasks is rather marginal. Previous work attempts to build better contrastive samples by disturbing the ground truth [10, 23, 30] in the discrete space or the continuous embedding space, but when it comes to text generation tasks, their performance gains are still far from satisfactory. This work was done during Chenxin An s internship at Shanghai AI Laboratory 2The code is available at https://github.com/Shark-NLP/Co NT 36th Conference on Neural Information Processing Systems (Neur IPS 2022). Oea Ye Wha W XS WR XV XP , P\ b Ua Qd ? Whi V i V f Uigh We Qi Qg Whe\ Ve OO ce UWifica We V from-ba Wch Vample V be UOa VVe Q Vie da V UXhig XQV Encoder inp XW Encode U-Decode U Gro Xnd Wr XWh Decoder inp XW of nai Ye me Whod Decoder inp XW of o Xr me Whod Oea Ye Wha W XS WR XV Oea Ye Wha W TXie W WR XV Oea Ye Wha W TXie W d RZQ Oe W'V Wa Ne Wha W TXie WO\ Gro Xnd Wr XWh Velf-genera Wed Vample V f URP-ba Wch Va PSOe V Figure 1: A case study from IWSLT 14 De-En translation task. The naive setting uses from-batch samples following Sim CLR [6]. Compared with the naive method, CONT both incorporates selfgenerated samples and from-batch samples. The border color indicates the acutual distance between the ground truth and the contrastive example. In this work, we propose a new contrastive neural text generation framework, CONT. CONT does three different things from previous frameworks that make suboptimal use of contrastive learning. First, CONT samples contrastive examples from its own predictions (e.g., through beam search algorithm). This training procedure exposes the model to its mistakes in the inference stage and effectively alleviate the exposure bias problem. We show a comparison between negative samples in CONT and in Naive CL in Figure 1. Second, we use a N-pairs contrastive loss which gives a fine-grained treatment to the contrastive examples based on their sequence-level scores (e.g., BLEU). It allows the model to fully leverage the supervision from the ground truth example (and its own generated examples) to learn a better sequence-level distance function between the source and the target representation. Third, we directly incorporate the learned sequence similarity score from the distance function into the inference stage. This helps the model to find a better global configuration, than merely follows the language model likelihood objective in decoding. We validate CONT on various important conditional language generation tasks ( 4.2), including machine translation, summarization, code comment generation, data-to-text generation, and commonsense generation. Extensive experiments demonstrate that CONT greatly improve the conventional MLE baselines and significantly outperforms all previous contrastive generation models. CONT establishes new state-of-the-art results on summarization, code comment generation (without external data), and data-to-text generation. Particularly, on data-to-text generation and commonsense generation, CONT achieves on-par performance with the powerful large pre-trained models: T5-large, T5-3B [36] with only the base version of T5 while maintaining the efficiency of lightweight models. 2 Background 2.1 Neural Conditional Text Generation A neural sequence-to-sequence model [43] M = (f, g) generates the target sequence conditioning on a source sequence, where f and g denote the encoder and decoder, respectively. It is typically trained using the language model objective with the maximum likelihood estimation (MLE). Given a source sequence x = {xi}M i=0 and its target sequence y = {yi}N i=0, we minimize the following negative log likelihood (NLL) loss: log p (yt|x, y lea Ye Wha W Xp Wo XV beam search pair-Zise contrasti Ye loss TUa LQLQJ Rb M lea Ye Wha W q Xie W Wo XV, p =0.48 le W'V Wake Wha W q Xie Wl\, p =0.53 Dec Rd LQJ Rb M f Uom-ba Wch Vample V Vco Ue= 0.48 + 0.37 Vco Ue= 0.53 + 0.23 Figure 2: An overview of CONT. zx, zy is the representation of source sequence x and its target sequence y. y0 and y00 with their representations zy0, zy00 are returned by beam search algorithm. The feature representations come from pooling the output of the encoder (source sequence) or decoder (target sequence). Our training objective is obtained by comparing by all contrastive samples in pair. The decoding objective not only considers the likelihood of each sequence, but also the sequence similarity score modeled in training. LNCE = log exp(cos(zx, zy)/ ) P y02B exp(cos(zx, zy0)/ ), (2) where zx, zy, zy0 2 Rd denote the vector representation of input x, ground truth y and negative sample y0 2 B, respectively. is the temperature and cos( , ) defines the cosine similarity. Intuitively, the contrastive loss LNCE seeks to learn a similarity function that drives the distance between the source sequence representation zx and its ground-truth target sequence representation zy closer. In this section, we present our new contrastive neural text generation framework, CONT . CONT advances the Naive CL ( 2.2) in three aspects. First, CONT uses negative examples from its own predictions ( 3.1) to construct the set B. Second, CONT replaces the Info NCE loss (Eq.2) with a N-pairs contrastive loss (Eq.3) which leverages a finer-grained supervision given by the the sequencelevel scores of all pairs ( 3.2). Third, CONT incorporates the learned similarity function into its inference score directly ( 3.3). An overview of our approach can be found in Figure 2. 3.1 Contrastive Examples from Predictions Instead of only using contrastive examples from the same batch [6], we propose to add new contrastive examples from the model s own predictions. Kalkstein et al. [18] point that using diverse contrastive samples helps the generalization ability of the model. Therefore, we use the diverse beam search algorithm [49] to create contrastive examples from the top-K list of the model s lastest predictions and then append them to the from-batch samples to form the contrastive examples. A warm-up stage where the model is only supervised by LNLL is recommended as it guarantees the quality of the examples from the model s prediction. These self-generated contrastive examples alleviate the model s exposure bias. Besides, with the model s performance improving gradually, this approach creates high-quality hard negative examples that is known to be important in contrastive learning [16, 37]. 3.2 N-Pairs Contrastive Loss One major drawback of the Info NCE loss is that it has the same treatment for all negative samples. In text generation, this means that the relative difference between the ground truth and the contrastive examples is ignored, while this can be easily quantified using a sequence level score (e.g. BLEU) and the quality of these contrastive examples varies. To mitigate this problem, we propose to employ a pair-wise margin loss. We first rank all the contrastive examples based on an oracle function o( , y), which computes a sequence-level score with the ground truth y. Given a input sequence x, the ground truth y, and a set of K contrastive samples B = {y1, y2, , y K}, we can create a series Algorithm 1 Inference algorithm: Given an input sequence x, a contrastive generation model ˆ M = ( ˆf, ˆg); return the output sequence. 1: procedure BEAMSEARCH(g, Hx, b) . beam search algorithm 2: return Text, likelihood, logits of the b hypotheses 1: procedure INFERENCE( ˆG, x) 2: Hx ˆf (x), b beam size, balance factor 2 (0, 1) 3: y1:b, P1:b y = BEAMSEARCH(ˆg, Hx, b) . Get b candidates with beam search 4: zx, z1:b y Avg(Hx), Avg(H1:b y ) . Avg( ) is an average pooling function 5: D1:b y Cosine distance between zx and representation of hypotheses z1:b y Likelihood of hypotheses returned by beam search 7: k = arg maxi=1..b{ Di y + (1 ) Pi y} 8: return yk of example pairs (y+, y ) 2 P, where + and are determined by their ranks.3 The contrastive learning objective is formulated as a margin loss according to their cosine similarity to the source representation zx: L(y+, y ) = max{0, cos(zx, zy ) cos(zx, zy+) + }. (3) We further set = γ (rank(y ) rank(y+)) following Zhong et al. [57] to reflect the quality difference in these pairs, where γ is a hyperparameter controlling the the strength. Full details of the training algorithm can be found in Algorithm 2, Appendix B. 3.3 Inference with Learned Similarity Function Previous inference algorithm for contrastive text generation method [23] usually remains the same with non-contrastive approaches. In CONT, we incorporate the similarity function learned in the N-pairs contrastive loss into the decoding stage. Despite such a inference objective can be generalized to other contrastive learning methods as long as the vector representations for source and target sequence pair exist, the design of CONT can better make use of the learned similarity function ( 4.3). The decoding objective in Co NT is to find the sequence y that maximizes both the learned similarity score and the conventional language model likelihood: y = arg max ˆy { cos(zx, zˆy) + (1 ) p(ˆyt|x, ˆy