# neural_text_generation_with_unlikelihood_training__c24bf007.pdf

NEURAL TEXT DEGENERATION WITH UNLIKELIHOOD TRAINING

Sean Welleck1,2 Ilia Kulikov1,2 Stephen Roller2 Emily Dinan2

Kyunghyun Cho1,2,3 & Jason Weston1,2

1New York University, 2Facebook AI Research, 3CIFAR Azrieli Global Scholar

Neural text generation is a key tool in natural language applications, but it is well known there are major problems at its core. In particular, standard likelihood training and decoding leads to dull and repetitive outputs (Holtzman et al., 2019). While some post-hoc ﬁxes have been proposed, in particular top-k and nucleus sampling, they do not address the fact that the token-level probabilities predicted by the model are poor. In this paper we show that the likelihood objective itself is at fault, resulting in a model that assigns too much probability to sequences containing repeats and frequent words, unlike those from the human training distribution. We propose a new objective, unlikelihood training, which forces unlikely generations to be assigned lower probability by the model. We show that both token and sequence level unlikelihood training give less repetitive, less dull text while maintaining perplexity, giving superior generations using standard greedy or beam search. According to human evaluations, our approach with standard beam search also outperforms the currently popular decoding methods of nucleus sampling or beam blocking, thus providing a strong alternative to existing techniques.

1 INTRODUCTION

Neural text generation is a vital tool in a wide range of natural language applications. However, the standard approach training a sequence to sequence model, e.g. Transformer (Vaswani et al., 2017), to maximize log-likelihood and approximately decoding the most likely sequence is known to be ﬂawed. Generated text in open-ended applications such as language modeling or dialogue has been observed to be dull, with high frequency tokens used too often and interesting content words used too rarely (Holtzman et al., 2019; Dinan et al., 2019). Moreover, the models repeat themselves at the token, phrase, and sentence levels, and statistics comparing a set of human-generated utterances and model-generated responses indicate a discrepancy between the human and model word distributions. This does not appear to be rectiﬁed by training on more data (Radford et al., 2019). Recent ﬁxes involve modifying the decoding strategy using sampling or more sophisticated beam search variants. However, these decoding strategies do not address the core issue: the model s underlying sequence probabilities are clearly not correct.

Several reasons for exactly why neural text is degenerate have been posited, with the cause currently unknown. Possible candidates include the problem being (i) a by-product of the model architecture, e.g. the Transformer architecture preferring repeats (Holtzman et al., 2019; Vig, 2018), (ii) an intrinsic property of human language (Holtzman et al., 2019) rather than a modeling deﬁciency, or that (iii) a training objective relying on ﬁxed corpora cannot take into account the real goal of using the language (Choi, 2018). Our work shows that, while the above may be factors, a primary factor is the use of the likelihood objective itself, as we demonstrate that degeneration is alleviated if we replace the likelihood objective with our proposal.

While low perplexity in the limit should lead to predicting the correct next target word, there are two major ﬂaws of the likelihood objective: (i) it pays relatively little attention to the argmax or the top of the ranked list of next token probabilities, instead optimizing the likelihood of the entire distribution;

Equal contribution; the ordering was decided by a coin ﬂip.

(ii) it is not focused on optimizing sequence generation, only on producing the next token. The ﬁrst issue means that greedy or beam search decoding, which rely on the top of the list to generate, are not optimized there is a discrepancy between maximizing the log-probability of a ground-truth token and ensuring the rank of the ground-truth token to be one. The second issue means that during sequence generation, any imperfection in next token prediction leads to error accumulation that is not addressed by likelihood training.

In this work, we introduce unlikelihood training, an approach that addresses the two aforementioned issues. It combines two types of updates: a likelihood update on the true target tokens so that they are assigned high probability, and an unlikelihood update on tokens that are otherwise assigned too high a probability. We can collect these unlikely token candidates either during next-token prediction or from generated sequences, allowing us to train at both the token and sequence levels. Both token and sequence level unlikelihood training are shown to improve metrics that measure dullness and repetition of the model, while maintaining performance in other metrics such as perplexity or token accuracy compared to the maximum likelihood baseline. Finally, we assess our models using human evaluations. We ﬁnd that our generations have vastly improved quality compared to likelihood trained models when both models use beam search decoding. Moreover, our approach when using beam search also signiﬁcantly improves over likelihood trained models using either beam blocking or nucleus sampling, thus outperforming the current state-of-the-art.

2 RELATED WORK

Neural Text Degeneration Recently, several papers have observed various forms of neural text degeneration, especially in open-ended generation tasks. In dialogue, it has been shown that there is a mismatch between model and human word distributions, where generative models are more likely to output frequent words, but less likely to produce rare words compared to humans. For example, this was observed across all generative models submitted to the Conv AI2 Neur IPS 2018 competition (Dinan et al., 2019). In language modeling, the work of Holtzman et al. (2019) highlighted problems with the word frequency distribution and level of repetition in model generations compared to human text. These issues are not remedied by simply increasing the amount of the training data; e.g. largescale GPT-2 language models (Radford et al., 2019) display the same issues.

Improved Decoding Algorithms Several methods have been proposed to rectify these issues. The primary ones involve changing the decoding method to a sophisticated beam search variant or to stochastic decoding, e.g. sampling. Different variants of beam search have been explored (Li et al., 2016; Vijayakumar et al., 2018; Kulikov et al., 2018; Holtzman et al., 2018) which can decrease a model s level of repetition by selecting candidates that are unlike previously chosen ones. Separately, hard or soft beam blocking has been investigated (Paulus et al., 2017; Klein et al., 2017), whereby previously generated n-grams are blocked from subsequent generation. This approach is often used in dialogue generation, ﬁxing some token or phrase level repetitions but removing repetitions that would naturally occur in human text.

The second major approach is that of sampling from the model at generation time. Top k-sampling (Fan et al., 2018) and nucleus sampling (Holtzman et al., 2019) are two methods that sample sequences based on a function of the predicted next token probability distribution given by the model. Both approaches vastly improve the repetition issue, as the randomization often reduces the number of duplicate tokens in a decoded sequence, even if highly scored paths under the model (represented by beam search candidates) contain repetitions. However, as the underlying model is unchanged, it often prefers semantically similar phrasing, depending on the temperature parameter of the sampling (Holtzman et al., 2019). Furthermore, this solution is less relevant in less open-ended tasks such as machine translation, where beam search variants are the preferred method. Ideally we would like a model that can work with both beam and sampling decoding methods.

Improved Learning Algorithms The proposed learning criteria are closely related to structured output prediction methods in which the goal is to increase the scores assigned by a model to true examples while decreasing those assigned to negative examples often generated by the model itself. Some representative algorithms include structured perceptron (Collins, 2002), energy-based models (Le Cun et al., 2006) and more recently reﬂective likelihood (Dieng et al., 2018). A particular variant in this family of algorithms, called negative training, was recently used by He and

Glass (2019) to prevent generic and malicious responses in dialogue models. Similarly, these structured prediction algorithms with neural language models have been applied to machine translation in recent years by Shen et al. (2015) and Edunov et al. (2017).

3 NEURAL TEXT GENERATION

Language Modeling In language modeling, our goal is to model a probability distribution p (x) over variable-length text sequences x = (x1, . . . , x|x|) composed of tokens from a vocabulary, xt V. We wish to ﬁnd a model pθ(x) which resembles p (x), meaning that samples ˆx pθ are similar to samples from p , and pθ(x) p (x) for all x. When pθ(x) is parameterized by a neural network, we call pθ a neural language model. We assume that pθ takes the form pθ(x) = Q|x| t=1 pθ(xt|x<t).

The de facto approach to training such a model is to ﬁnd parameters θ that maximize the loglikelihood of a ﬁnite set of samples D from p by minimizing:

LMLE(pθ, D) =

t=1 log pθ(x(i) t |x(i) <t). (1)

Sequence Completion A closely related problem consists of sampling a sub-sequence, or preﬁx, x1:k p , then using pθ to conditionally decode a continuation, ˆxk+1:N pθ( |x1:k). We now want the resulting completion (x1, . . . , xk, ˆxk+1, . . . , ˆx N) to resemble a sample from p .

We use sequence completion as a setting to study the behavior of neural language models due to its generality. For instance, sequence completion encompasses story generation (Fan et al., 2018), contextual text completion (Radford et al., 2019), language modeling (for k = 0), and dialogue modeling (Zhang et al., 2018) where x1:k is a dialogue history and a continuation is a next utterance.

Given pθ and a preﬁx x1:k, ﬁnding the optimal continuation is not tractable, so in practice approximate deterministic or stochastic decoding strategies are used to generate continuations.

Deterministic Decoding Two widely used deterministic decoding approaches are greedy search and beam search. The former can be seen as a special case of the latter. Greedy search selects the highest probability token at each time step: xt = arg max pθ(xt|x<t). Beam search maintains a ﬁxed-size set of partially-decoded sequences, called hypotheses. At each time step, beam search forms new hypotheses by appending each token in the vocabulary to each existing hypothesis, scoring the resulting sequences then selecting the highest scoring sequences. As we describe in Section 4, these deterministic decoding strategies, which depend highly on underlying model probabilities, expose issues with conventionally trained neural language models.

Stochastic Decoding An alternative is to sample from a model-dependent distribution at each step, xt q(xt|x<t, pθ). In order to prevent sampling low probability tokens, a typical approach is to restrict sampling to a subset of the vocabulary U V at each step:

q(xt|x<t, pθ) = pθ(xt|x<t)/Z xt U 0 otherwise,

where Z = P

x U pθ(x|x<t). The top-k sampler restricts sampling to the k most-probable tokens; i.e. U is the size k subset of V which maximizes P

x U pθ(x|x<t) (Fan et al., 2018). The nucleus sampler instead restricts sampling to the smallest set of tokens with total mass above a threshold p [0, 1]; i.e. U is the smallest subset with P x U pθ(x|x<t) >= p (Holtzman et al., 2019).

4 NEURAL TEXT DEGENERATION

In this section we discuss two degenerate properties that frequently occur in conventional neural language models trained with the maximum likelihood objective (Equation 1).

Repetition First, model-generated continuations exhibit sequence-level repetition, especially with deterministic decoding. The problem is seen by observing samples in Appendix Table 4, which

shows completions from the state-of-the-art GPT-2 language model (Radford et al., 2019). Greedy decoding as well as top-k and nucleus sampling exhibit degenerate repetition (with a certain hyperparameter setting), although greedy decoding shows the worst degradation. Using a Transformer language model trained with maximum likelihood ( 6), we ﬁnd that the average percentage of repeated n-grams in model continuations with greedy decoding (43%) far exceeds that of humans (0.5%), computed over preﬁxes drawn from a validation corpus.

Unlike previous work which only focused on degenerate sequence-level repeats (Holtzman et al., 2019), we additionally observe that neural language models exhibit substantially more repetition in next-token prediction compared to human text:

Pr (ˆxk+1 = arg max pθ(x|x1:k) x1:k) > Pr (xk+1 x1:k) . (2)

For instance, the Transformer language model ( 6) predicted next-tokens that appeared in the preceding 128 words 62% of the time, versus 49% in ground-truth text. This is especially concerning since the maximum-likelihood objective focuses on optimizing next-token conditional distributions.

Token Distribution Mismatch Second, both greedy continuations and next-token predictions from conventional neural text generators have different token distributions from human text. As demonstrated by Holtzman et al. (2019), such models with greedy or beam search tend to produce high frequency tokens too often and low frequency tokens too rarely, where frequency is deﬁned by the human token distribution. With the Transformer language model ( 6), the set of nexttoken greedy predictions on a held-out validation set had roughly 40% fewer unique tokens than the ground-truth tokens (11.6k vs. 18.9k), and overproduced frequent tokens (Appendix Figure 1). Such behavior has been linked to generations being judged as dull by humans because rare words can add engaging speciﬁcity (Weston et al., 2018; See et al., 2019).

5 THE UNLIKELIHOOD TRAINING OBJECTIVE

We now describe unlikelihood training for neural language models, then in Section 6 demonstrate empirically that our proposal substantially improves neural text degeneration ( 4).

5.1 UNLIKELIHOOD TRAINING

The key idea behind unlikelihood training is decreasing the model s probability of certain tokens, called negative candidates. Given a sequence (x1, . . . , x T ) and a set of negative candidate tokens Ct = {c1, . . . , cm}, where each cj V, we deﬁne the unlikelihood loss for step t as:

Lt UL(pθ( |x<t), Ct) = X

c Ct log(1 pθ(c|x<t)). (3)

The loss decreases as pθ(c|x<t) decreases. We incorporate the unlikelihood loss into a token-level unlikelihood objective which augments each time-step of maximum likelihood training:

Lt UL-token(pθ( |x<t), Ct) = α X

c Ct log(1 pθ(c|x<t))

| {z } unlikelihood

log pθ(xt|x<t) | {z } likelihood

As candidates, we use previous context tokens:

Ct prev-context = {x1, . . . , xt 1} \ {xt}. (5)

Intuitively, minimizing the unlikelihood loss with this candidate set makes (i) incorrect repeating tokens less likely, as the previous context contains potential repeats, and (ii) frequent tokens less likely, as these tokens appear often in the previous context. These candidates are efﬁcient to compute, without requiring additional supervision.

Gradient analysis We assume pθ(xt|x<t) = softmax(a) and consider the gradient of (4) with respect to the softmax input a RV. With a single negative candidate, the (negative) gradient is:

La = x m p, mi =

( (1 α pneg 1 pneg ) if i = ineg (1 + α) if i = ineg, (6)

where x {0, 1}V is a one-hot ground-truth vector, m RV, p = pθ( |x<t), and pneg is the probability of the negative candidate at index ineg (derivation in Appendix A).

This unlikelihood gradient (6) differs from the likelihood gradient, (x p), due to the term m which varies based on the hyper-parameter α and the model s negative candidate probability, pneg. At the ground-truth token index i , the unlikelihood gradient is positive, increasing the ground-truth token s probability with a magnitude that grows with pneg. Conversely, at the negative candidate index ineg the gradient is negative. At all other token indices i {i , ineg}, the gradient moves from negative to positive as pneg increases. For instance, with α = 1.0 the gradient increases the probability of each token xi when the model assigns high probability to the negative candidate (pneg > 0.5).

5.2 SEQUENCE-LEVEL UNLIKELIHOOD TRAINING

While the token-level unlikelihood objective efﬁciently augments maximum likelihood training with token-level penalties, it is limited to preﬁxes drawn from the training distribution. The resulting distribution mismatch between training sequences and generated sequences is a known issue with maximum-likelihood training, motivating objectives that operate on model-generated sequences (Daum e et al., 2009; Ross et al., 2011; Ranzato et al., 2015; Yu et al., 2016).

We thus propose a sequence-level unlikelihood objective which uses unlikelihood on decoded continuations. That is, given a preﬁx (x1, . . . , xk) p , we decode a continuation (xk+1, . . . , xk+N) pθ( |x1, . . . , xk), construct per-step negative candidate sets (Ck+1, . . . , Ck+N), and deﬁne each perstep sequence-level loss for t {k + 1, . . . , k + N} as:

Lt ULS(pθ( |x<t), Ct) = X

c Ct log(1 pθ(c|x<t)). (7)

Intuitively, the negative candidates can identify problematic tokens for the loss to penalize. We choose to penalize repeating n-grams in the continuation: Ct repeat-n = {xt} if (xt i, . . . , xt, . . . , xt+j) x<t i for any (j i) = n, i n j, (8)

which says that xt is the (single) negative candidate for step t if it is part of a repeating n-gram1.

In our experiments we apply this sequence loss in two ways: (i) using it to ﬁne-tune a standard MLE baseline; and (ii) using it to ﬁne-tune an unlikelihood model trained at the token level, LUL-token. We refer to the former as LUL-seq and the latter as LUL-token+seq. In both cases, ﬁne-tuning is done by equally mixing sequence-level unlikelihood updates (7) and the token-level loss from which it was initially trained (either likelihood updates (1) or token-level unlikelihood updates (4)).

Efﬁciency Any objective that requires explicitly decoding a sequence is constrained by sample efﬁciency when decoding is slow; if sample efﬁciency is low, the total decoding time is too large for practical use. In our experiments we show that when used for ﬁne-tuning, the sequence-level unlikelihood objective substantially reduced degeneration in under 1,500 updates, rendering it practical for modern large-scale neural models, even with high decoding costs.

6 EXPERIMENTS

We follow a standard language modeling setup from Baevski and Auli (2019) and evaluate our method on the task of sequence completion, detailed below.2

Model Architecture Recent large-scale language models are based on the Transformer architecture, a multi-layer feed-forward network with self-attention (Vaswani et al., 2017). We use a 16-layer Transformer with 8 attention heads, embedding dimension 1024, and fully-connected dimension 4096; the architecture is based on Baevski and Auli (2019) but with standard embedding and softmax layers. Our proposed method is architecture agnostic; we choose this one as a representative of recent large-scale language models, e.g. Radford et al. (2019).

1An alternative we tried is to choose a penalization probability ppenalize, and use xt as the single negative candidate for time t when zt Bernoulli(ppenalize) is 1, and no negative candidate for time t otherwise; this approach was effective but under-performed the Crepeat-n candidates; see Appendix D. 2Code and trained models are available at https://github.com/facebookresearch/ unlikelihood_training; implemented with Fairseq (Ott et al., 2019).

Preﬁx ... Lyrically the song has excerpts of different languages including French , Spanish LMLE , Italian , Spanish , Italian , Spanish , Italian , Spanish , Spanish , Portuguese , Portuguese , Portuguese , Portuguese , Portuguese , Portuguese , Portuguese , Portuguese , Portuguese , Portuguese , Portuguese , Portuguese , Spanish , Portuguese , Portuguese , Portuguese , Portuguese , Portuguese

LUL-token+seq , and German . In the ﬁrst verse , the protagonist sings about being a girl who s been in love with someone else , while the second verse describes the relationship between the protagonist and her lover . In the third verse , the protagonist sings

Preﬁx ... starboard engines and was going to crash . We re going in , LMLE he said . We re going to crash . We re going to crash . We re going to crash . We re going to crash . We re going to crash . We re going to crash . We re going to crash . We re going to 0.787

LUL-token+seq Hood said . I m going to make sure we re going to get back to the water . The order to abandon ship was given by Admiral Beatty , who ordered the remaining two battlecruisers to turn away . At 18 : 25 , Hood turned his 0.000

Preﬁx ... career - high 27 points on 8 - for - 11 shooting with three rebounds LMLE and two assists . On January 3 , 2012 , he was named to the 2012 13 All - Atlantic 10 ﬁrst team . On February 3 , 2012 , he was named to the Atlantic 10 ﬁrst team . On February 5 , 2012 , he was named 0.277

LUL-token+seq and a career - high 7 assists against the Minnesota Timberwolves . On February 3 , 2012 , he was named to the 2012 All - NBA First Team . On March 7 , 2012 , he was named one of ﬁve ﬁnalists for the Naismith Award , which is 0.064

Table 1: Example greedy completions showing representative examples of the MLE model s degenerate single-token repetition (top), phrase-level repetition (middle), and structural repetition (bottom), as well as the proposed method s ability to ﬁx these degenerate behaviors.

Dataset We use the Wikitext-103 dataset (Merity et al., 2016), a large-scale collection of Wikipedia articles containing over 100 million words and 260 thousand unique tokens. As a document-level dataset, Wikitext-103 is an open-source representative of recent datasets used for large-scale language modeling (Baevski and Auli, 2019; Radford et al., 2019). We perform experiments at the word level.

Training We train on ﬁxed-length contiguous sequences, in our case of length 1,536, which was selected based on GPU memory constraints. For the token-level losses (LMLE, LUL-token), we train each model on 8 GPUs for a maximum of 150k updates, evaluating on the validation set and saving the model state every 10k updates. For the experiments below, we select the saved model state with the best validation perplexity.

Sequence-level ﬁne-tuning begins with the model state selected based on the validation perplexity. Models are ﬁne-tuned for 1,500 total updates. With probability 0.5 an update uses LULS and otherwise uses the token-level loss with which the model was trained. For a LULS update, we split each training sequence and greedily decode continuations (details below). The experiments use a preﬁx length k = 50 and continuation length N = 100 for ﬁne-tuning.

Completions We evaluate a model on sequence completion by using the model to decode continuations of preﬁxes derived from the validation (or test) set. Speciﬁcally, the validation (or test) set is ﬁrst partitioned into sequences of 1,536 tokens, as in training. Then we split each sequence into a batch of preﬁxes of length k (discarding extra tokens), and decode a continuation of length N for each preﬁx. The experiments below use k = 50 and N = 100 for evaluation. For deterministic decoding we use greedy search and beam search with beam size 10, and for stochastic decoding we use top-k sampling with k {3, 50} and nucleus sampling with p {0.3, 0.9}.

6.1 EVALUATION METRICS

Repetition As a token-level metric for repetition, we use the fraction of next-token (top-1) predictions that occur in the previous ℓtokens (rep/ℓ); given a set D of length-T sequences,

rep/ℓ= 1 |D|T

t=1 I [arg max pθ(x|x<t) xt ℓ 1:t 1] . (9)

A predicted token is called a single-token repeat when I [ ] is 1. Some of these single-token repeats also occur in the human-generated sequences, and we thus report a variant which only counts singletoken repeats that are additionally not equal to the ground-truth next-token (wrep/ℓ).

Model search seq-rep-4 uniq-seq ppl acc rep wrep uniq

LMLE greedy .442 10.8k 25.64 .395 .627 .352 11.8k beam .523 9.5k

LUL-token greedy .283 13.2k 26.91 .390 .577 .311 12.7k beam .336 11.7k

LUL-seq greedy .137 13.1k 25.42 .399 .609 .335 12.8k beam .019 18.3k

LUL-token+seq greedy .058 15.4k 26.72 .395 .559 .293 13.8k beam .013 19.1k

Human - .006 19.8k - - .487 - 19.8k

Table 2: Results for token-level objectives (upper) and sequence-level ﬁne-tuning (lower) according to sequence-level (left) and token-level (right) metrics using the test subset of Wikitext-103.

We use the portion of duplicate n-grams (seq-rep-n) in a generated sequence to measure sequencelevel repetition. That is, for a continuation xk+1:k+N we compute,

seq-rep-n = 1.0 |unique n-grams(xk+1:k+N)|

|n-grams| , (10)

and average over continuations. seq-rep-n is zero when the continuation has no repeating n-grams, and increases towards 1.0 as the model repeats. We compute seq-rep-n on the continuation.

Token Distribution We quantify a model s predicted token distribution using the number of unique tokens. As a token-level metric (uniq), we use the number of unique next-token predictions on a validation or test set D, i.e. |{arg max p(xt|x<t) | x<t D}|. As a sequence-level metric (uniq-seq) we use the number of unique tokens in continuations of validation or test preﬁxes ( 6).

Language Modeling Quality We use perplexity (ppl), and next-token prediction accuracy (acc), deﬁned as 1

N |{arg max p(xt|x<t) = x t | x<t D}|, with N preﬁxes x<t and true next tokens x t .

6.2 RESULTS

Token-level and sequence-level results on the test set are in Table 2 (valid set in Appendix Table 5).

Baseline The baseline model trained with maximum likelihood (LMLE) achieved 25.64 test perplexity, comparable to a current state-of-the-art system (Baevski and Auli, 2019) (24.92). However, the greedy baseline s seq-level repeats (seq-rep-4 .442) and single-token repeats (rep .627) far exceed those in human text (.006, .487 respectively). The baseline continuations have far fewer unique tokens than human text (uniq-seq 11.8k vs 19.8k), with a high rate of frequent tokens (Figure 1).

Token-Level Objective The proposed token-level unlikelihood objective (LUL-token) reduced nexttoken wrong repetition (wrep .311 vs. .352) while increasing the number of unique next-tokens (uniq 12.7k vs. 11.8k) compared to the baseline (LMLE). Perplexity and accuracy were similar.

Importantly, the token-level unlikelihood objective yielded substantial improvements in sequencelevel generations. With greedy search, token-level unlikelihood training improved the 4-gram repetition in continuations by 36% (seq-rep-4 .283 vs. .442) while generating roughly 22% more unique tokens than the baseline (uniq-seq 13.2k vs. 10.8k), and a more favorable rate of infrequent tokens (Figure 1). With beam search, unlikelihood training showed similar improvements over the baseline.

Sequence-Level Objective The sequence level ﬁne-tuning (LUL-token+seq) yielded further improvements, with a 97% reduction in 4-gram repetitions (seq-rep-4 .013 vs. .442) from the baseline level (greedy LMLE), and 77% more unique tokens (uniq-seq 19.1k vs. 10.8k) with beam search.

Compared to the token-level unlikelihood model (LUL-token) which was the starting point of ﬁnetuning, the ﬁne-tuned model s repetition substantially improved (seq-rep-4 .058 vs. .283), unique tokens increased (uniq-seq 15.4k vs. 13.2k), and token-level metrics such as perplexity improved

Crowdworkers Experts

Winner Loser Win rate Win rate

LMLE baseline 57% LUL-seq LMLE baseline *71% LUL-token+seq LMLE baseline *82% LUL-token+seq LUL-token *75% LUL-token+seq LUL-seq 59%

LUL-token+seq beats LMLE Nucleus sampling (p = 0.9) 59% *83% LUL-token+seq LMLE Beam blocking (4-gram) 60% *74%

Table 3: Human eval results. * denotes statistical signiﬁcance (2-sided binomial test, p < .05).

(ppl 26.72 vs. 26.91), despite using only 1,500 updates. The token distribution improved, with infrequent tokens produced more often than the baseline, and frequent tokens approaching the human level (Figure 1). Finally, after sequence-level ﬁne-tuning, beam search out-performed greedy search.

To visualize how these improvements in metrics translate to generation quality, Table 1 shows greedy completions that characterize the baseline s degeneration and LUL-token+seq s improved behavior.

GPT-2 Fine-Tuning In the preceding experiment, sequence-level ﬁne-tuning alone (LUL-seq) showed substantial improvements over the baseline using a small number of updates. This indicates that the proposed sequence-level ﬁne-tuning can be a cheap, effective way to improve existing pre-trained language models. We demonstrate this by ﬁne-tuning a pre-trained GPT-2 (Radford et al., 2019) language model with sequence-level unlikelihood, using a comparable experimental setup to 6 (details in Appendix C). Fine-tuning with unlikelihood yielded similar improvements in sequence-level repetition (seq-rep-4 .042 vs. .506) to those observed in Table 5, while maintaining language modeling quality according to perplexity and accuracy (see Appendix Table 7).

Stochastic Decoding Although we have focused on deterministic decoding, we also conﬁrm that a model trained with the proposed unlikelihood objectives may still be used with stochastic decoders. Appendix Table 6 shows metrics for completions generated with top-k sampling (Fan et al., 2018) and nucleus sampling (Holtzman et al., 2019). Models trained with unlikelihood objectives maintain language modeling quality compared to the baseline, but with improvements in repetition.

Human Evaluation We perform a crowdworker evaluation to judge the quality of the generations of our proposed models compared to each other, the baseline, two other generation methods, and the reference. We employ a pairwise setup: an evaluator is presented with a preﬁx and shown continuations from two different models and asked to select which continuation they found more natural. Following Li et al. (2019), we ﬁlter workers using quality controls (detailed in Appendix E) and limit the number of annotations that they may complete. Prompts are from the Wikitext-103 test set. All models used beam search (beam size 10) for generation, except for those that use stochastic decoding. We report the win rates for each pairwise comparison.

The main results are presented in Table 3, with additional experiments in Appendix Table 9. We ﬁnd that all proposed models are preferred over the baseline, and that congruent with automatic metrics, win rates improve after adding the sequence level objective. Our best model also outperforms the baseline used with either nucleus sampling or beam blocking.

We also collected limited annotations from other NLP researchers. These Expert annotators were given the same UI as the crowdworkers, and not told about models they were evaluating, but all annotators were familiar with language models. As shown in Table 3, the LUL-token+seq model significantly outperforms both nucleus sampling and beam blocking according to the experts.

7 CONCLUSION

We described unlikelihood training, an approach to training neural language models. We observed that state-of-the-art models trained to maximize likelihood exhibit neural text degeneration, which

we characterized and quantiﬁed in terms of repetition and token distribution mismatch. Our results show that the likelihood objective is not constrained enough, in the sense that two models with the same perplexity can exhibit wildly different generation performance. We empirically showed that unlikelihood training - both at the token and sequence levels - substantially reduced degeneration according to automatic metrics, and outperformed likelihood-trained models with various decoding methods according to human evaluation, being superior to the current state-of-the-art approaches.

Alexei Baevski and Michael Auli. 2019. Adaptive input representations for neural language modeling. In International Conference on Learning Representations.

Yejin Choi. 2018. The missing representation in neural (language) models. 3rd Workshop on Representation Learning for NLP (Rep L4NLP).

Michael Collins. 2002. Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithms. In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP 2002). Association for Computational Linguistics.

Hal Daum e, John Langford, and Daniel Marcu. 2009. Search-based structured prediction. Machine learning, 75(3):297 325.

Adji B Dieng, Kyunghyun Cho, David M Blei, and Yann Le Cun. 2018. Learning with reﬂective likelihoods.

Emily Dinan, Varvara Logacheva, Valentin Malykh, Alexander Miller, Kurt Shuster, Jack Urbanek, Douwe Kiela, Arthur Szlam, Iulian Serban, Ryan Lowe, et al. 2019. The second conversational intelligence challenge (convai2). ar Xiv preprint ar Xiv:1902.00098.

Sergey Edunov, Myle Ott, Michael Auli, David Grangier, and Marc Aurelio Ranzato. 2017. Classical structured prediction losses for sequence to sequence learning. ar Xiv preprint ar Xiv:1711.04956.

Angela Fan, Mike Lewis, and Yann Dauphin. 2018. Hierarchical neural story generation. ar Xiv preprint ar Xiv:1805.04833.

Tianxing He and James Glass. 2019. Negative training for neural dialogue response generation. ar Xiv preprint ar Xiv:1903.02134.

Ari Holtzman, Jan Buys, Maxwell Forbes, Antoine Bosselut, David Golub, and Yejin Choi. 2018. Learning to write with cooperative discriminators. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1638 1649. Association for Computational Linguistics.

Ari Holtzman, Jan Buys, Maxwell Forbes, and Yejin Choi. 2019. The curious case of neural text degeneration. ar Xiv preprint ar Xiv:1904.09751.

Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senellart, and Alexander M Rush. 2017. Opennmt: Open-source toolkit for neural machine translation. ar Xiv preprint ar Xiv:1701.02810.

Ilya Kulikov, Alexander H Miller, Kyunghyun Cho, and Jason Weston. 2018. Importance of a search strategy in neural dialogue modelling. ar Xiv preprint ar Xiv:1811.00907.

Yann Le Cun, Sumit Chopra, Raia Hadsell, M Ranzato, and F Huang. 2006. A tutorial on energybased learning. Predicting structured data.

Jiwei Li, Will Monroe, and Dan Jurafsky. 2016. A simple, fast diverse decoding algorithm for neural generation. ar Xiv preprint ar Xiv:1611.08562.

Margaret Li, Jason Weston, and Stephen Roller. 2019. Acute-eval: Improved dialogue evaluation with optimized questions and multi-turn comparisons. ar Xiv preprint ar Xiv:1909.03087.

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2016. Pointer sentinel mixture models. ar Xiv preprint ar Xiv:1609.07843.

Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. 2019. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of NAACL-HLT 2019: Demonstrations.

Romain Paulus, Caiming Xiong, and Richard Socher. 2017. A deep reinforced model for abstractive summarization. ar Xiv preprint ar Xiv:1705.04304.

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. Open AI Blog, 1(8).

Marc Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. 2015. Sequence level training with recurrent neural networks. Co RR, abs/1511.06732.

St ephane Ross, Geoffrey Gordon, and Drew Bagnell. 2011. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artiﬁcial intelligence and statistics, pages 627 635.

Abigail See, Stephen Roller, Douwe Kiela, and Jason Weston. 2019. What makes a good conversation? how controllable attributes affect human judgments. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1702 1723, Minneapolis, Minnesota. Association for Computational Linguistics.

Shiqi Shen, Yong Cheng, Zhongjun He, Wei He, Hua Wu, Maosong Sun, and Yang Liu. 2015. Minimum risk training for neural machine translation. ar Xiv preprint ar Xiv:1512.02433.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998 6008.

Jesse Vig. 2018. Deconstructing bert: Distilling 6 patterns from 100 million parameters. Medium.

Ashwin K Vijayakumar, Michael Cogswell, Ramprasaath R Selvaraju, Qing Sun, Stefan Lee, David Crandall, and Dhruv Batra. 2018. Diverse beam search for improved description of complex scenes. In Thirty-Second AAAI Conference on Artiﬁcial Intelligence.

Jason Weston, Emily Dinan, and Alexander H Miller. 2018. Retrieve and reﬁne: Improved sequence generation models for dialogue. ar Xiv preprint ar Xiv:1808.04776.

Lantao Yu, Weinan Zhang, Jun Wang, and Yingrui Yu. 2016. Seqgan: Sequence generative adversarial nets with policy gradient. Ar Xiv, abs/1609.05473.

Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur Szlam, Douwe Kiela, and Jason Weston. 2018. Personalizing dialogue agents: I have a dog, do you have pets too? In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2204 2213, Melbourne, Australia. Association for Computational Linguistics.

Preﬁx Lyrically the song has excerpts of different languages including French , Spanish , GPT2-greedy German , Italian , Portuguese , Spanish , Portuguese , Portuguese , Portuguese , Portuguese , Portuguese , Portuguese , Portuguese , Portuguese , Portuguese , Portuguese , Portuguese , Portuguese , Portuguese , ... GPT2-top3 German , Italian , Portuguese , Spanish , Portuguese , Portuguese, Italian and Spanish. It is also available in English, French and Spanish. In addition to its lyrics, the album features a number of original songs, ... GPT2-nucleus-0.3 German , Italian , Portuguese , Spanish , Portuguese , Italian , Portuguese , Spanish , Portuguese , Spanish , Portuguese , Spanish , Portuguese , Spanish , Portuguese , Spanish , Portuguese , Spanish , Portuguese , Spanish , Portuguese , ...

Preﬁx The ﬁrst pair of pereiopods is armed with a large , asymm GPT2-greedy etrical, and long-range laser cannon. The second pair is armed with a large , asymmetrical, and long-range laser cannon. The third pair is armed with a large , asymmetrical, and long-range laser cannon. The fourth pair is armed with a large ... GPT2-top3 etrical, and highly mobile head, and the second pair has a small , asymmetrical, and highly mobile head. The second pair has a large and highly mobile head, and the third pair is armed with a large and highly mobile head. The ﬁrst pair ... GPT2-nucleus-0.3 etrical head and a large body. The ﬁrst pair of pereiopods is armed with a large , asymmetrical head and a large body. The ﬁrst pair of pereiopods is armed with a large , asymmetrical head and a large body. The ﬁrst pair of pereiopods is armed ...

Table 4: Top: Degenerate repetition in completions from a state-of-the-art large-scale language model (Radford et al., 2019). The examples contain single-word repetitions, phrase-level repetitions, and structural repetitions where some tokens within a repeating phrase vary. Recently proposed stochastic samplers (top-k, nucleus) exhibit degeneration based on hyper-parameter settings.

very rare (10%) rare (20%) medium (30%) frequent (40%) Token frequency

% of generated tokens

Token generation frequency distribution

Human L_tok+seq L_seq L_tok L_mle

(a) Different combinations of unlikelihood

very rare (10%) rare (20%) medium (30%) frequent (40%) Token frequency

% of generated tokens

Token generation frequency distribution

Human L_tok+seq Nucleus Beam Block

(b) Unlikelihood vs. stochastic decoding

Figure 1: Sequence-level token distribution using the test subset of Wikitext-103. Nucleus sampling (p = 0.9) and beam blocking (n = 4) are used with the maximum likelihood baseline (LMLE).

Notation Let x t be the true next-token (index i V) at step t, and let xneg be a negative candidate (index ineg). Let p = p(xt|x<t) RV be the output of softmax(a) where a RV.

Denote the probability of an element i {1, . . . , V } as pi = p(xi t|x<t), and let p , pneg, and pi be probabilities of the true next-token, negative-candidate token, and any other token with i {i , i}.

A.1 DERIVATION

The (negative) token-level loss with a single candidate is,

Lt = log p(x t |x<t) + α log(1 p(xneg|x<t)), (11)

and its gradient with respect to a logit ai is:

pi ai = (I[i = i ] pi) α pneg 1 pneg (I[i = ineg] pi) . (12)

We consider the gradient when i is the true next-token, a negative-candidate, and any other token.

True Next-Token (i = i )

p ai = (1 p ) α pneg 1 pneg (0 p ) (13)

= 1 p (1 α pneg 1 pneg ). (14)

Negative Candidate (i = ineg)

pneg aneg = (0 pneg) α pneg 1 pneg (1 pneg) (15)

= pneg(1 + α). (16)

Other Token (i {i , ineg})

pi ai = (0 pi) α pneg 1 pneg (0 pi) (17)

= pi(1 α pneg 1 pneg ). (18)

Combining the three cases above, we get:

La = x m p, (19)

where x {0, 1}V is 1 at index i and 0 otherwise, and m RV is:

( (1 α pneg 1 pneg ) i = ineg (1 + α) i = ineg. (20)

Multiple Candidates In general the objective considers multiple candidates (see section 5):

Lt UL-token(pθ( |x<t), Ct) = α X

c Ct log(1 pθ(c|x<t))

| {z } unlikelihood

log pθ(xt|x<t) | {z } likelihood

We regroup the token-level objective to be a weighted sum of per-candidate objectives:

Lt UL-token(pθ( |x<t), Ct) = 1 |Ct|

log pθ(xt|x<t) + αc log(1 pθ(c|x<t)) (22)

where αc = α |Ct|.

Now the gradient can be generalized to multiple candidates, in which case the gradient takes the same form as Eqn. 20, but with αc in place of α.

Model search seq-rep-4 uniq-seq ppl acc rep wrep uniq

LMLE greedy .429 10.6k 24.59 .401 .619 .346 11.6k beam .495 9.4k

LUL-token greedy .274 12.6k 25.62 .396 .569 .305 12.5k beam .327 11.2k

LUL-seq greedy .130 12.7k 24.28 .406 .603 .329 12.4k beam .018 16.8k

LUL-token+seq greedy .051 14.8k 25.37 .401 .551 .287 13.4k beam .013 17.6k

Human - .005 18.9k - - .479 - 18.9k

Table 5: Results for token-level objectives (upper) and sequence-level ﬁne-tuning (lower) according to sequence-level (left) and token-level (right) metrics using the validation subset of wikitext-103.

Search Model seq-rep-4 uniq-seq ppl acc rep wrep uniq

LMLE .0991 14.7k 25.70 .350 .597 .355 12.6k LUL-token .0491 16.4k 27.02 .344 .539 .306 13.6k LUL-seq .0068 17.9k 25.11 .353 .581 .341 13.6k LUL-token+seq .0087 15.2k 26.84 .347 .524 .292 14.6k

LMLE .0165 21.9k 25.70 .302 .511 .303 16.1k LUL-token .006 23.5k 27.02 .286 .440 .247 17.8k LUL-seq .0005 25.7k 25.11 .291 .497 .291 17.3k LUL-token+seq .0009 23.7k 26.84 .289 .430 .238 18.8k

LMLE .273 13.6k 25.70 .264 .339 .154 12.6k LUL-token .101 16.5k 27.02 .247 .290 .121 13.9k LUL-seq .0033 20.8k 25.11 .266 .327 .145 13.6k LUL-token+seq .0041 19.1k 26.84 .250 .284 .116 14.9k

LMLE .0154 26.9k 25.70 .288 .462 .263 18.6k LUL-token .004 30.2k 27.02 .266 .381 .202 22.3k LUL-seq .0003 34.7k 25.11 .290 .450 .254 19.6k LUL-token+seq .0007 32.4k 26.84 .269 .376 .198 22.7k

Human - .006 19.8k - - .487 - 19.8k

Table 6: Stochastic decoding results according to sequence-level (left) and token-level (right) metrics using the test subset of Wikitext-103.

B STOCHASTIC DECODING RESULTS

Table 6 provides automatic metrics for top-k and nucleus sampling (called top-p) on the Wikitext103 test set. These can be compared with the main results of the paper in Table 2. In general, sampling methods yield worse next-token predictions than deterministic approaches (0.302 vs. 0.394 acc for top-k-50 vs. greedy MLE, where acc for stochastic decoding measures the probability that the decoding strategy chooses the ground truth word given a ground truth context). As the choice of sampling hyperparameter gets closer to greedy (i.e. lower values of k and p) next token accuracy improves, eventually approaching the greedy MLE results. The unlikelihood-trained sampling models have similar next token accuracy (acc) to their likelihood-trained counterparts, but exhibit fewer repetitions. For lower values of p and k the improvements of unlikelihood training are larger, e.g. 0.277 reduced to 0.0041 for 4-gram sequence repetitions (seq-rep-4) using top-p-0.3. At higher levels of p and k, for all methods the continuations contain more unique tokens than that of humans, meaning those values may be too high.

Model search seq-rep-4 ppl acc rep wrep uniq

GPT-2 greedy .506 20.75 .430 .589 .306 13.3k GPT-2MLE greedy .460 15.82 .464 .612 .305 11.8k GPT-2UL-seq greedy .042 18.49 .444 .613 .317 11.3k

Human - .005 - - .407 - 17.7k

Table 7: GPT-2 results according to sequence-level and token-level metrics using the validation subset of wikitext-103. seq-rep-4 is computed on the word level; ppl, acc, rep, wrep are computed on the BPE level.

C GPT-2 FINE-TUNING

We evaluated the GPT-2 medium pre-trained model ( GPT-2 ) and two separate ﬁne-tuning variants on Wikitext-103. The ﬁrst variant ( GPT-2MLE ) was ﬁne-tuned using maximum likelihood; we select the model state with the lowest validation perplexity. The second model ( GPT-2UL-seq ) was ﬁne-tuned using the sequence-level unlikelihood objective ( 5.2). For both evaluation and sequencelevel tuning, we used a preﬁx length of 50 BPE tokens and a continuation length of 100 BPE tokens. In order to train on a single GPU, we used a batch-size of 1024 tokens for MLE updates, and 300 preﬁx tokens for unlikelihood updates. Due to the smaller batch size and single-GPU setting, we used 10,000 updates during sequence-level ﬁne-tuning, comparable to the 1,500 updates in the main experiment ( 6) in terms of the total number of tokens. Results are shown in Table 7.

D SEQUENCE-LEVEL RANDOM CANDIDATES

In Sec. 5.2 we described a way to penalize tokens that occurred in a n-gram repetition. One alternative is to penalize a random subset of the generated sequence. That is, given a continuation xt+1, . . . , xt+K, we now deﬁne per-step candidates (Ck+1, . . . , Ck+N) as:

Ct random-seq = {xt} if zt = 1 if zt = 0, (23)

for each t {k + 1, . . . , k + N}, where zt Bernoulli(ppenalize), and ppenalize [0, 1] is a ﬁxed hyper-parameter. Intuitively, these candidates identify random tokens in the generated sequence (hence random-seq ), which are then penalized by the sequence-level loss (Eqn. 7).

Results with different values of ppenalize are shown in Table 8. Penalizing 10% of the generated tokens led to substantial improvements in seq-rep-4 for both greedy and beam search compared to the baseline (e.g. 41% for LUL-seq greedy, 73% for LUL-tok+seq greedy), though using n-gram repetition candidates yielded further improvements ( 5.2, Table 5). Improvements in single-token metrics were similar to those from the n-gram repetition candidates (e.g. wrep .287). These results with random-seq candidates demonstrate that sequence ﬁne-tuning can yield improvements without explicitly using the notion of repetition for candidate selection. We also ﬁnd that penalizing 90% of the generated tokens yields substantial improvements in beam search, but not greedy search; investigating this is left as future work.

Model ppenalize search seq-rep-4 uniq-seq ppl acc rep wrep uniq

LMLE - greedy .429 10.6k 24.590 .401 .619 .346 11.6k - beam .495 9.4k

LUL-seq 0.1 greedy .253 9.9k 24.329 .404 .602 .330 12.3k beam .274 13.1k

LUL-seq 0.9 greedy .434 5.3k 26.519 .399 .600 .330 12.2k beam .231 13.5k

LUL-tok+seq 0.1 greedy .116 12.5k 25.518 .399 .551 .287 13.2k beam .146 14.2k

LUL-tok+seq 0.9 greedy .423 6.7k 26.629 .396 .551 .288 13.2k beam .080 16k

Human - - .005 18.9k - - .479 - 18.9k

Table 8: Results for sequence-level ﬁne-tuning using random-seq candidates according to sequence-level (left) and token-level (right) metrics using the validation subset of wikitext-103.

E HUMAN EVALUATION DETAILS

E.1 UI SCREENSHOT

Figure 2: Screen shot of the user interface used in the human evaluation.

E.2 CROWDWORKER QUALITY CONTROLS

We require workers to correctly answer both of the following quality control questions for their evaluations to be included. Both quality controls compare the true completion against a greedy baseline model.

Following Li et al. (2019), we informed workers that they must provide reasoning for their choices. We ﬁltered workers who did not provide reasoning for at least 80% of their choices.

63% of workers fail at least one of our three quality control mechanisms (2 quality control metrics, and failing to give reasons). 61% fail at least one quality control question; 16% of workers fail both; 4% of workers fail to give reasoning for their choices.

E.2.1 QUALITY CONTROL 1

Prompt = = In the decades since its release , The Hustler has cemented its reputation as a classic . Roger Ebert , echoing earlier praise for the performances , direction , and cinematography and adding laurels for editor Dede Allen , cites the ﬁlm as one of

Correct answer those ﬁlms where scenes have such psychic weight that they grow in our memories . He further cites Fast Eddie Felson as one of only a handful of movie characters so real that the audience refers to them as touchstones . TV Guide calls the ﬁlm a dark stunner offering a grim world whose only bright spot is the top of the pool table , yet [ with ] characters [ who ] maintain a shabby nobility and grace . The four leads are again lavishly praised for their performances and the

Incorrect answer the most inﬂuential ﬁlms of the year . In his review for the Chicago Sun @-@ Times , Richard Corliss calls it a ﬁlm of the highest order and a ﬁlm of the highest order . In his review for the Chicago Sun @-@ Times , Richard Corliss calls it a ﬁlm of the highest order and a ﬁlm of the highest order . In his review for the Chicago Sun @-@ Times , Richard Corliss calls it a ﬁlm of the highest order and a ﬁlm of

E.2.2 QUALITY CONTROL 2

Prompt , which had a major negative effect on the state s large merchant ﬂeet . Gore was in 1808 elected to the Massachusetts House of Representatives , where he successfully led Federalist efforts to ensure the selection of a Federalist slate of presidential electors . He also spearheaded actions to

Correct answer drive Senator John Quincy Adams from the Federalist Party over his support of Thomas Jefferson s foreign policy . The legislature elected Adams successor nine months early , and gave Adams sufﬁciently distasteful instructions that he resigned the post and joined with the Republicans . = = Governor = = Gore led the Federalists to victory in 1809 against Sullivan s successor , Levi Lincoln , Sr. , who had taken over as acting governor upon Sullivan s death late in 1808 . During Gore s term the principal domestic issue occupying state politics

Incorrect Answer prevent the American Revolutionary War from being fought by the British , and to prevent the British from using the country to launch a war against the British . Gore s actions in the House of Representatives were a major turning point in his political career . He was elected to the House of Representatives in 1811 , and served until his death in 1815 . = = Early life and education = = /s /s Gore was born in Boston , Massachusetts , on February 22 , 1798 , the son of Benjamin Gore and his

E.3 FULL HUMAN EVALUATION RESULTS

Crowdworkers Experts

Winner Loser Win rate W L Win rate W L

LMLE baseline 57% 17 13 LUL-seq LMLE baseline *71% 41 17 LUL-token+seq LMLE baseline *82% 41 9 LUL-token+seq LUL-token *75% 56 19 LUL-token+seq LUL-seq 59% 38 27

LUL-token+seq beats Nucleus 59% 47 33 *83% 30 6 LUL-token+seq Beam blocking 60% 50 34 *74% 25 9

LMLE baseline *85% 17 3 Reference Nucleus *69% 38 17 Reference Beam blocking *68% 48 23 Reference LUL-token *73% 44 16 Reference LUL-seq 50% 30 30 Reference LUL-token+seq *64% 46 26

Table 9: Full human evaluation results. Includes additional comparisons omitted for brevity, and the raw number of wins and loses by each comparison.