# neural_text_generation_with_unlikelihood_training__c24bf007.pdf NEURAL TEXT DEGENERATION WITH UNLIKELIHOOD TRAINING Sean Welleck1,2 Ilia Kulikov1,2 Stephen Roller2 Emily Dinan2 Kyunghyun Cho1,2,3 & Jason Weston1,2 1New York University, 2Facebook AI Research, 3CIFAR Azrieli Global Scholar Neural text generation is a key tool in natural language applications, but it is well known there are major problems at its core. In particular, standard likelihood training and decoding leads to dull and repetitive outputs (Holtzman et al., 2019). While some post-hoc fixes have been proposed, in particular top-k and nucleus sampling, they do not address the fact that the token-level probabilities predicted by the model are poor. In this paper we show that the likelihood objective itself is at fault, resulting in a model that assigns too much probability to sequences containing repeats and frequent words, unlike those from the human training distribution. We propose a new objective, unlikelihood training, which forces unlikely generations to be assigned lower probability by the model. We show that both token and sequence level unlikelihood training give less repetitive, less dull text while maintaining perplexity, giving superior generations using standard greedy or beam search. According to human evaluations, our approach with standard beam search also outperforms the currently popular decoding methods of nucleus sampling or beam blocking, thus providing a strong alternative to existing techniques. 1 INTRODUCTION Neural text generation is a vital tool in a wide range of natural language applications. However, the standard approach training a sequence to sequence model, e.g. Transformer (Vaswani et al., 2017), to maximize log-likelihood and approximately decoding the most likely sequence is known to be flawed. Generated text in open-ended applications such as language modeling or dialogue has been observed to be dull, with high frequency tokens used too often and interesting content words used too rarely (Holtzman et al., 2019; Dinan et al., 2019). Moreover, the models repeat themselves at the token, phrase, and sentence levels, and statistics comparing a set of human-generated utterances and model-generated responses indicate a discrepancy between the human and model word distributions. This does not appear to be rectified by training on more data (Radford et al., 2019). Recent fixes involve modifying the decoding strategy using sampling or more sophisticated beam search variants. However, these decoding strategies do not address the core issue: the model s underlying sequence probabilities are clearly not correct. Several reasons for exactly why neural text is degenerate have been posited, with the cause currently unknown. Possible candidates include the problem being (i) a by-product of the model architecture, e.g. the Transformer architecture preferring repeats (Holtzman et al., 2019; Vig, 2018), (ii) an intrinsic property of human language (Holtzman et al., 2019) rather than a modeling deficiency, or that (iii) a training objective relying on fixed corpora cannot take into account the real goal of using the language (Choi, 2018). Our work shows that, while the above may be factors, a primary factor is the use of the likelihood objective itself, as we demonstrate that degeneration is alleviated if we replace the likelihood objective with our proposal. While low perplexity in the limit should lead to predicting the correct next target word, there are two major flaws of the likelihood objective: (i) it pays relatively little attention to the argmax or the top of the ranked list of next token probabilities, instead optimizing the likelihood of the entire distribution; Equal contribution; the ordering was decided by a coin flip. (ii) it is not focused on optimizing sequence generation, only on producing the next token. The first issue means that greedy or beam search decoding, which rely on the top of the list to generate, are not optimized there is a discrepancy between maximizing the log-probability of a ground-truth token and ensuring the rank of the ground-truth token to be one. The second issue means that during sequence generation, any imperfection in next token prediction leads to error accumulation that is not addressed by likelihood training. In this work, we introduce unlikelihood training, an approach that addresses the two aforementioned issues. It combines two types of updates: a likelihood update on the true target tokens so that they are assigned high probability, and an unlikelihood update on tokens that are otherwise assigned too high a probability. We can collect these unlikely token candidates either during next-token prediction or from generated sequences, allowing us to train at both the token and sequence levels. Both token and sequence level unlikelihood training are shown to improve metrics that measure dullness and repetition of the model, while maintaining performance in other metrics such as perplexity or token accuracy compared to the maximum likelihood baseline. Finally, we assess our models using human evaluations. We find that our generations have vastly improved quality compared to likelihood trained models when both models use beam search decoding. Moreover, our approach when using beam search also significantly improves over likelihood trained models using either beam blocking or nucleus sampling, thus outperforming the current state-of-the-art. 2 RELATED WORK Neural Text Degeneration Recently, several papers have observed various forms of neural text degeneration, especially in open-ended generation tasks. In dialogue, it has been shown that there is a mismatch between model and human word distributions, where generative models are more likely to output frequent words, but less likely to produce rare words compared to humans. For example, this was observed across all generative models submitted to the Conv AI2 Neur IPS 2018 competition (Dinan et al., 2019). In language modeling, the work of Holtzman et al. (2019) highlighted problems with the word frequency distribution and level of repetition in model generations compared to human text. These issues are not remedied by simply increasing the amount of the training data; e.g. largescale GPT-2 language models (Radford et al., 2019) display the same issues. Improved Decoding Algorithms Several methods have been proposed to rectify these issues. The primary ones involve changing the decoding method to a sophisticated beam search variant or to stochastic decoding, e.g. sampling. Different variants of beam search have been explored (Li et al., 2016; Vijayakumar et al., 2018; Kulikov et al., 2018; Holtzman et al., 2018) which can decrease a model s level of repetition by selecting candidates that are unlike previously chosen ones. Separately, hard or soft beam blocking has been investigated (Paulus et al., 2017; Klein et al., 2017), whereby previously generated n-grams are blocked from subsequent generation. This approach is often used in dialogue generation, fixing some token or phrase level repetitions but removing repetitions that would naturally occur in human text. The second major approach is that of sampling from the model at generation time. Top k-sampling (Fan et al., 2018) and nucleus sampling (Holtzman et al., 2019) are two methods that sample sequences based on a function of the predicted next token probability distribution given by the model. Both approaches vastly improve the repetition issue, as the randomization often reduces the number of duplicate tokens in a decoded sequence, even if highly scored paths under the model (represented by beam search candidates) contain repetitions. However, as the underlying model is unchanged, it often prefers semantically similar phrasing, depending on the temperature parameter of the sampling (Holtzman et al., 2019). Furthermore, this solution is less relevant in less open-ended tasks such as machine translation, where beam search variants are the preferred method. Ideally we would like a model that can work with both beam and sampling decoding methods. Improved Learning Algorithms The proposed learning criteria are closely related to structured output prediction methods in which the goal is to increase the scores assigned by a model to true examples while decreasing those assigned to negative examples often generated by the model itself. Some representative algorithms include structured perceptron (Collins, 2002), energy-based models (Le Cun et al., 2006) and more recently reflective likelihood (Dieng et al., 2018). A particular variant in this family of algorithms, called negative training, was recently used by He and Glass (2019) to prevent generic and malicious responses in dialogue models. Similarly, these structured prediction algorithms with neural language models have been applied to machine translation in recent years by Shen et al. (2015) and Edunov et al. (2017). 3 NEURAL TEXT GENERATION Language Modeling In language modeling, our goal is to model a probability distribution p (x) over variable-length text sequences x = (x1, . . . , x|x|) composed of tokens from a vocabulary, xt V. We wish to find a model pθ(x) which resembles p (x), meaning that samples ˆx pθ are similar to samples from p , and pθ(x) p (x) for all x. When pθ(x) is parameterized by a neural network, we call pθ a neural language model. We assume that pθ takes the form pθ(x) = Q|x| t=1 pθ(xt|x= p (Holtzman et al., 2019). 4 NEURAL TEXT DEGENERATION In this section we discuss two degenerate properties that frequently occur in conventional neural language models trained with the maximum likelihood objective (Equation 1). Repetition First, model-generated continuations exhibit sequence-level repetition, especially with deterministic decoding. The problem is seen by observing samples in Appendix Table 4, which shows completions from the state-of-the-art GPT-2 language model (Radford et al., 2019). Greedy decoding as well as top-k and nucleus sampling exhibit degenerate repetition (with a certain hyperparameter setting), although greedy decoding shows the worst degradation. Using a Transformer language model trained with maximum likelihood ( 6), we find that the average percentage of repeated n-grams in model continuations with greedy decoding (43%) far exceeds that of humans (0.5%), computed over prefixes drawn from a validation corpus. Unlike previous work which only focused on degenerate sequence-level repeats (Holtzman et al., 2019), we additionally observe that neural language models exhibit substantially more repetition in next-token prediction compared to human text: Pr (ˆxk+1 = arg max pθ(x|x1:k) x1:k) > Pr (xk+1 x1:k) . (2) For instance, the Transformer language model ( 6) predicted next-tokens that appeared in the preceding 128 words 62% of the time, versus 49% in ground-truth text. This is especially concerning since the maximum-likelihood objective focuses on optimizing next-token conditional distributions. Token Distribution Mismatch Second, both greedy continuations and next-token predictions from conventional neural text generators have different token distributions from human text. As demonstrated by Holtzman et al. (2019), such models with greedy or beam search tend to produce high frequency tokens too often and low frequency tokens too rarely, where frequency is defined by the human token distribution. With the Transformer language model ( 6), the set of nexttoken greedy predictions on a held-out validation set had roughly 40% fewer unique tokens than the ground-truth tokens (11.6k vs. 18.9k), and overproduced frequent tokens (Appendix Figure 1). Such behavior has been linked to generations being judged as dull by humans because rare words can add engaging specificity (Weston et al., 2018; See et al., 2019). 5 THE UNLIKELIHOOD TRAINING OBJECTIVE We now describe unlikelihood training for neural language models, then in Section 6 demonstrate empirically that our proposal substantially improves neural text degeneration ( 4). 5.1 UNLIKELIHOOD TRAINING The key idea behind unlikelihood training is decreasing the model s probability of certain tokens, called negative candidates. Given a sequence (x1, . . . , x T ) and a set of negative candidate tokens Ct = {c1, . . . , cm}, where each cj V, we define the unlikelihood loss for step t as: Lt UL(pθ( |x 0.5). 5.2 SEQUENCE-LEVEL UNLIKELIHOOD TRAINING While the token-level unlikelihood objective efficiently augments maximum likelihood training with token-level penalties, it is limited to prefixes drawn from the training distribution. The resulting distribution mismatch between training sequences and generated sequences is a known issue with maximum-likelihood training, motivating objectives that operate on model-generated sequences (Daum e et al., 2009; Ross et al., 2011; Ranzato et al., 2015; Yu et al., 2016). We thus propose a sequence-level unlikelihood objective which uses unlikelihood on decoded continuations. That is, given a prefix (x1, . . . , xk) p , we decode a continuation (xk+1, . . . , xk+N) pθ( |x1, . . . , xk), construct per-step negative candidate sets (Ck+1, . . . , Ck+N), and define each perstep sequence-level loss for t {k + 1, . . . , k + N} as: Lt ULS(pθ( |x