# a_nonmonotonic_selfterminating_language_model__09f4569f.pdf

Published as a conference paper at ICLR 2023

A NON-MONOTONIC SELF-TERMINATING LANGUAGE MODEL

Eugene Choi

eugene.choi@nyu.edu Kyunghyun Cho

kyunghyun.cho@nyu.edu Cheolhyoung Lee

cheolhyoung.lee@nyu.edu

Recent large-scale neural autoregressive sequence models have shown impressive performances on a variety of natural language generation tasks. However, their generated sequences often exhibit degenerate properties such as non-termination, undesirable repetition, and premature termination, when generated with decoding algorithms such as greedy search, beam search, top-k sampling, and nucleus sampling. In this paper, we focus on the problem of non-terminating sequences resulting from an incomplete decoding algorithm. We first define an incomplete probable decoding algorithm which includes greedy search, top-k sampling, and nucleus sampling, beyond the incomplete decoding algorithm originally put forward by Welleck et al. (2020). We then propose a non-monotonic self-terminating language model, which significantly relaxes the constraint of monotonically increasing termination probability in the originally proposed self-terminating language model by Welleck et al. (2020), to address the issue of non-terminating sequences when using incomplete probable decoding algorithms. We prove that our proposed model prevents non-terminating sequences when using not only incomplete probable decoding algorithms but also beam search. We empirically validate our model on sequence completion tasks with various architectures.

1 INTRODUCTION

Autoregressive neural sequence models (Bengio et al., 2000) have been widely used for various natural language generation tasks such as language modeling (Brown et al., 2020; Chowdhery et al., 2022), machine translation (Bahdanau et al., 2014), and conversational dialogue modeling (Vinyals & Le, 2015). Furthermore, large-scale autoregressive neural sequence models have shown unprecedented ability to generate fluent, human-like texts (Vaswani et al., 2017; Brown et al., 2020). Despite their success, the autoregressive neural sequence models have shown undesirable behaviors: non-termination (Welleck et al., 2020), degenerate repetition (Welleck et al., 2019; Holtzman et al., 2020), and premature termination (Koehn & Knowles, 2017; Stahlberg & Byrne, 2019). In this paper, we focus on how to prevent non-termination when using a given decoding algorithm.

Non-termination is the problem that a language model generates infinitely long sequences with a positive probability from our language model when using a given decoding algorithm. Welleck et al. (2020) pointed out that this issue comes from a discrepancy between the distribution of our language model and its induced distribution by an incomplete decoding algorithm. They formalized this disparity by the notion of inconsistency where our language model generates non-terminating sequences with a positive probability from the decoding algorithm. To avoid this inconsistency, they proposed a self-terminating (ST) language model that uses new parametrization for its classifier rather than usual softmax parametrization. They proved that the ST language model is consistent with respect to greedy search, beam search, top-k sampling (Fan et al., 2018) as well as nucleus sampling (Holtzman et al., 2020).

The ST language model increases the termination probability of each sequence monotonically to 1, but this parametrization is not appropriate for learning our natural language. As an illustrative

New York University Prescient Design, Genentech CIFAR Fellow Corresponding author.

Published as a conference paper at ICLR 2023

example, suppose there are two sequences in our dataset: I am a boy vs. I am a boy, and you are a girl. . Our language model trained on this dataset may or may not terminate after the former. Once our model decides not to end, it should dramatically reduce the termination probability to continue. The ST language model, which monotonically increase the termination probability, cannot capture such a case, where one sequence is a prefix of another. We thus propose a non-monotonic self-terminating (NMST) language model which guarantees the consistency with respect to greedy search, beam search, top-k sampling, and nucleus sampling without monotonically increasing termination probability.

The NMST language model encourages the termination probability of each sequence to converge to 1 through NMST parametrization however without monotonicity. Even under this relaxation, the proposed NMST language model provably prevents any non-terminating sequence resulting from greedy search, beam search, top-k sampling, and nucleus sampling, which we refer to as incomplete probable decoding algorithms.

We conduct experiments validating the effectiveness of our NMST language models on sequence completion tasks, as was done in earlier studies. We test NMST parametrization with various architectures. Specifically, we train RNN (Elman, 1990) and LSTM (Hochreiter & Schmidhuber, 1997) on Wiki Text-2 (Merity et al., 2016). We additionally finetune GPT-2 (Radford et al., 2019) on Wiki Text-103 (Merity et al., 2016). Across all these setups, NMST parametrization effectively prevents non-terminating sequences, especially when compared to softmax parametrization. Furthermore, we see that our NMST parametrization has better (lower) perplexities than those of ST parametrization, confirming the importance of relaxing the monotonic termination probability.

2 NOTATIONS AND BACKGROUND

2.1 NOTATIONS FOR AUTOREGRESSIVE NEURAL SEQUENCE MODELS

Sequences, vocabulary, and eos We view an instance (e.g., a sentence and a paragraph) as a sequence y = (y1, y2, . . . , y T ), where each yt is an element from a pre-defined finite set of discrete tokens, referred to as a vocabulary V. V includes a special symbol eos that only appears at the end of the sequence. Every sequence y must end with eos . We write the length of y as |y|, and y|y| = eos . We call y a non-terminating sequence, |y| = , if yt = eos for all t.

Embedding vectors Each token v V is not a numerical vector so that we use an embedding vector uv Rm to represent v. To capture the notion of similarity between discrete tokens efficiently, we use an embedding vector uv Rm to project v into continuous embedding space (Bengio et al., 2000; Mikolov et al., 2013b;a; Levy & Goldberg, 2014).

Autoregressive neural sequence models Bengio et al. (2000) proposed an autoregressive neural sequence model parametrized by θ Rk. They factorized pθ(y|x) into a product of the conditional probability of each token given all the previous tokens and an input in a predefined order as follows:

pθ(y|x) = QT t=1 pθ(yt|y<t, x), (1)

where y<t is a t-prefix of y and x is an input referred to as a context. For example, x represents either a prompt in sequence completion or a source-side sequence in machine translation.

There are several popular architectures for pθ such as RNN (Elman, 1990), LSTM (Hochreiter & Schmidhuber, 1997), GRU (Cho et al., 2014), and Transformer (Vaswani et al., 2017). As shown in equation 2, all these models utilize softmax classifiers. In this paper, we modify the parametrization of their softmax classifiers to prevent non-terminating sequences. We thus write a vanilla language model, regardless of its choice of architecture, that uses the original softmax parametrization as pva θ defined in Definition 1. Definition 1. A vanilla language model pva θ computes the conditional probability of each token given a t-prefix y<t and a context x at each time step t as follows:

pva θ (yt = v|y<t, x) = exp(u v ht)/ P

v V exp(u v ht), (2)

where ht = fθ(yt, ht 1) with h0 = 0.1

1This definition stands for RNN, LSTM, and GRU. For Transformer, ht = fθ(yt, h1:(t 1)).

Published as a conference paper at ICLR 2023

Training For a given dataset, D = x(n), y(n) N n=1, we maximize the joint probability assigned to the sequences in our training dataset to find an optimal parameter configuration θ as follows:

θ = arg maxθ PN n=1 PT (n)

t=1 log pθ y(n) t y(n) <t , x(n) . (3)

2.2 INCOMPLETE PROBABLE DECODING ALGORITHMS

An autoregressive language model pθ predicts the likelihood of a sequence y given a context x. Its autoregressive factorization in equation 1 requires a recursive process for every t to infer. Hence, at inference time, we use a decoding algorithm, defined below, to generate sequences from pθ. Definition 2. Let Y be a collection of y such that y = (y1, y2, , y T ) where T {1, 2, } and yt V. A decoding algorithm S is a function that maps pθ to q S(pθ) which is a probability distribution over Y. A decoded sentence ˆy given x by S from pθ is a random sample from q S(pθ)(y|x).

To generate a high quality sequence from pθ, a decoding algorithm assumes that a higher quality sequence has a higher probability of pθ than others. For instance, maximum a posteriori (MAP) decoding algorithm Smap gives the most probable sequence y given a context x from pθ:

y = arg maxy Y pθ(y|x), (4)

by setting q Smap(pθ)(y = y |x) = 1 and q Smap(pθ)(y = y |x) = 0 where y Y \ {y }. Unfortunately, Smap is intractable since equation 4 requires an exhaustive search over the sequence space Y. Hence, in practice, we utilize incomplete probable decoding algorithms defined as follows: Definition 3. A decoding algorithm S is incomplete and probable if there exists Vt V such that P v Vt q S(pθ)(yt = v|y<t, x) = 1 (5)

minv Vt pθ(yt = v|y<t, x) maxv V\Vt pθ(yt = v|y<t, x) (6)

for each t. Furthermore, for every v Vt, S satisfies

q S(pθ)(yt = v|y<t, x) pθ(yt = v|y<t, x). (7)

At each t, an incomplete probable decoding algorithm S considers only a set of highly probable tokens, Vt. S generates ˆy given x by recursively sampling ˆyt from q S(pθ)(yt|ˆy<t, x) supported on Vt. This reduces an exponential complexity of Smap, O |V||ˆy| , down to a linear level, O (|ˆy| |V|).

Greedy search, top-k sampling (Fan et al., 2018), and nucleus sampling (Holtzman et al., 2020) are incomplete and probable. For example, greedy search Sgr generates the t-th item of a sequence by

ˆyt = arg maxv V pθ(yt = v|ˆy<t, x). (8)

In other words, Sgr sets Vt to v(1) t where v(1) t = arg maxv V pθ(yt = v|ˆy<t, x). Moreover, we have pθ yt = v(1) t ˆy<t, x q Sgr(pθ) yt = v(1) t ˆy<t, x = 1, and q Sgr(pθ)(yt = v |ˆy<t, x) = 0 holds for v V \ Vt. Thus, Sgr is incomplete and probable. Unlike Sgr, top-k sampling considers k most probable tokens in V as Vt while nucleus sampling sets the smallest subset of V, containing most probable tokens of which total probability is higher than a given threshold µ, to Vt. In A.1 and A.2, we present that top-k sampling and nucleus sampling are also incomplete and probable.

Beam search is a heuristic algorithm that operates on the level of prefixes. We describe it further in A.3. Although beam search is not an incomplete probable decoding algorithm, it also selects Vt which is a proper subset of V to expand each prefix at each step t. Due to this, our main theoretical finding for the incomplete probable decoding algorithms in 3 is applicable to beam search as well.

2.3 CONSISTENCY WITH RESPECT TO INCOMPLETE PROBABLE DECODING ALGORITHMS AND SELF-TERMINATING (ST) LANGUAGE MODELS

Incomplete probable decoding algorithms greatly reduce computational overhead for generating sequences from our model. However, Welleck et al. (2020) observed that they can generate nonterminating sequences even if every training sequence has a finite length. To study this, Welleck et al. (2020) defined consistency with respect to decoding algorithms as shown in Definition 4.

Published as a conference paper at ICLR 2023

Definition 4. A language model pθ is consistent with respect to a decoding algorithm S if q S(pθ)(|y| = ) = 0 for any parameter configuration θ Rk.

Welleck et al. (2020) also proved that a vanilla language model pva θ defined in Definition 1 is inconsistent with respect to incomplete probable decoding algorithms and beam search as follows: Theorem 1. A vanilla language model pva θ defined in Definition 1 is inconsistent with respect to any incomplete probable decoding algorithm and beam search (Theorem 3.4 in Welleck et al. (2020)).

For each t, an incomplete probable decoding algorithm S selects Vt V as a set of candidates for decoding, but pva θ does not guarantee that eos Vt. Specifically, if eos / Vt for all t, then S cannot decode each token to eos for all t (i.e., non-terminating). Based on this result, Welleck et al. (2020) proposed a self-terminating (ST) language model, defined below: Definition 5. For ht defined in Definition 1, the conditional probability of each token v V given a t-prefix y<t and a context x at each time step t in an ST language model is given by

αt = pst θ (yt = eos |y<t, x) = 1 Qt t =1(1 ϵ) σ(u eos ht ), (9)

pst θ (yt = v|y<t, x) = (1 αt) exp(u v ht)/P

v V\{ eos } exp(u v ht),

where v V \ { eos }, ϵ (0, 1), and σ(x) = (1 + exp( x)) 1 is a sigmoid function.

They proved that the ST language model is consistent with respect to any incomplete probable decoding algorithm and beam search, as follows: Theorem 2. An ST language model pst θ defined in Definition 5 is consistent with respect to any incomplete probable decoding algorithms and beam search (Theorem 4.1-4.3 in Welleck et al. (2020)).

In equation 9, pst θ (yt = eos |y<t, x) monotonically increases to 1 as t increases. S ends up including eos in Vt always for t t with some t , and limt q S(pθ)(yt = eos |y<t, x) = 1 by equation 7. This guarantees S to terminate in a finite number of steps. Despite pst θ s consistency, its validation perplexity degrades compared to pva θ in sequence completion (Welleck et al., 2020). We suspect that such degradation comes from the core property of pst θ that pθ(yt = eos |y<t, x) monotonically increases to 1 as t increases. In Remark 1 below, we provide an example where the optimal pθ (yt = eos |y<t, x) is not monotonic.

Remark 1. Let D = (x(1), y(1)), (x(2), y(2)) be a two-instance training dataset. Assume that there exists t0 such that y<t0 = y(1) <t0 = y(2) <t0. Suppose further that t0 = |y(1)| < |y(2)| 1 and x = x(1) = x(2). If θ is an optimal parameter configuration in equation 3 over D. Then, pθ y(2) t = eos |y(2) <t , x is not monotonic with respect to t (proved in B).

We can easily find such case in natural language satisfying the assumptions in Remark 1 by concatenating two sequences. We empirically demonstrate the existence of such cases in 4.2.

3 NON-MONOTONIC SELF-TERMINATING (NMST) LANGUAGE MODELS

The consistency of pst θ comes from limt pst θ (yt = eos |y<t, x) = 1, not the monotonically increasing pst θ (yt = eos |y<t, x) as a function of t. This motivates us to propose a non-monotonic self-terminating (NMST) language model pnmst θ that permits pnmst θ (yt = eos |y<t, x) to be a non-monotonic function of t while satisfying limt pnmst θ (yt = eos |y<t, x) = 1 as follows: Definition 6. For ht defined in Definition 1, the conditional probability of each token given a t-prefix y<t and a context x at the t-th step in an NMST language model is defined by

αt = pnmst θ (yt = eos |y<t, x) = 1 σ u eos ht (1 (1 ϵ)t) + σ u eos ht , (10)

pnmst θ (yt = v|y<t, x) = (1 αt) exp(u v ht)/P

v V\{ eos } exp(u v ht),

where v V \ { eos }, ϵ (0, 1), and σ(x) = (1 + exp( x)) 1 is a sigmoid function.

Published as a conference paper at ICLR 2023

fub(t) g(t)

Figure 1: An illustration of NMST parametrization in equation 10 where flb(t) = 1 (1 ϵ)t, fub(t) = 1, λ(t ) = σ u eos ht , and g(t) = pnmst θ (yt = eos |y<t, x). If g(t) lies between flb(t) and fub(t), we can find λ(t ) such that g(t ) = (1 λ(t ))flb(t ) + λ(t )fub(t ) for any t regardless of whether g(t) is monotonic with respect to t. This allows pnmst θ to learn a non-monotonic behavior of pnmst θ (yt = eos |y<t, x). pnmst θ is consistent with respect to any incomplete probable decoding algorithms and beam search due to limt flb(t) = 1 limt pnmst θ (yt = eos |y<t, x) = 1.

Figure 1 shows that pnmst θ uses convex combination of two curves to model pnmst θ (yt = eos |y<t, x). We can write a curve g(t) between a lower-bound curve flb(t) and an upper-bound curve fub(t) as g(t) = (1 λ(t))flb(t) + λ(t)fub(t), with appropriate λ(t) (0, 1) for all t. pnmst θ sets g(t) to pnmst θ (yt = eos |y<t, x), and then regards it as a convex combination of flb(t) = 1 (1 ϵ)t and fub(t) = 1 with a coefficient λ(t) = σ u eos ht . This enables nonmonotonic pnmst θ (yt = eos |y<t, x). Moreover, in Theorem 3 below, we show that the proposed NMST parametrization in equation 10 still guarantees the consistency with respect to any incomplete probable decoding algorithms and beam search. Theorem 3. An NMST language model defined in Definition 6 is consistent with respect to any incomplete probable decoding algorithms and beam search.2

Theorem 3 guarantees that every decoded sequence from pnmst θ terminates when using incomplete decoding algorithms and beam search. Neither pnmst θ nor pst θ results in non-terminating sequences resulting from incomplete probable decoding algorithms and beam search. Unlike ST parametrization, our NMST parametrization in equation 10 can capture a wider range of pθ(yt = eos |y<t, x), since pnmst θ does not assume that pθ(yt = eos |y<t, x) is a monotonic function of t. We empirically demonstrate this by comparing pva θ (yt = eos |y<t, x), pst θ (yt = eos |y<t, x), and pnmst θ (yt = eos |y<t, x) in Figure 3.

4 EXPERIMENTS

We empirically validate the effectiveness of the proposed non-monotonic self-terminating (NMST) language model by evaluating it on sequence completion tasks. We test three variants of a given architecture: (i) a vanilla (VA+) language model using common softmax parametrization in equation 2, (ii) a self-terminating (ST+) language model using ST parametrization proposed by Welleck et al. (2020) and (iii) our non-monotonic self-terminating (NMST+) language model using NMST parametrization in equation 10. We use following evaluation metrics for comparison:

Perplexity: Given an autoregressive language model pθ, the perplexity of pθ over D is

N PN n=1 PT (n)

t=1 log pθ y(n) t y(n) <t , x(n) , where D = x(n), y(n) N n=1.

Non-termination ratio (rnt): To present the consistency of pθ with respect to a given decoding algorithm S, we need to compute rnt = q S(pθ) (|y| = ). Instead, based on

rnt = q S(pθ) (|y| = ) = lim L q S(pθ) (|y| > L) , (11)

we use rnt(L) = q S(pθ) (|y| > L) with a sufficiently large threshold L to estimate rnt.

Sequence completion is a task of predicting a continuation ˆy given a c-length context x = (x1, x2, , xc) by using a decoding algorithm S from a language model pθ (i.e. ˆy q S(pθ)(y|x)).

2We provide the proof in C.

Published as a conference paper at ICLR 2023

0.1 0.3 1.0

NMST+ ST+ ϵ = 5.0 10 4

ϵ = 1.0 10 4

ϵ = 5.0 10 5

ϵ = 1.0 10 5

0.1 0.3 1.0

NMST+ ST+ ϵ = 5.0 10 4

ϵ = 1.0 10 4

ϵ = 5.0 10 5

ϵ = 1.0 10 5

Figure 2: Non-termination ratios, rnt(L) s, as a function of L in log-log scale for (a) RNN and (b) LSTM trained on Wiki Text-2 when using greedy search. We report mean (curve) st.dev. (shaded area) across 10 random experiments. For all configurations, both ST+ (non-red dashed) proposed by Welleck et al. (2020) and our NMST+ (non-red solid) are consistent with respect to greedy search since rnt(L) goes to 0 as L increases. However, softmax parametrization (VA+, red dotted) is inconsistent with respect to greedy search since its rnt(L) does not converge to 0 as L .

In this section, we use greedy search defined in equation 8 to generate ˆy given x. Our main theoretical finding in Theorem 3 is that the proposed NMST language model is consistent with respect to not only greedy search but also top-k sampling, nucleus sampling, and beam search. We thus present results when using decoding algorithms other than greedy search at the end in 5 and F.

4.1 WIKITEXT-2

Wiki Text-2 (Merity et al., 2016) consists of 2 million words from 600 Wikipedia articles. With word tokenization, we regard the first 10 tokens of each sequence and its remaining part, as a context x and a ground truth y, respectively. We train RNN with tanh (Elman, 1990) and LSTM (Hochreiter & Schmidhuber, 1997) on Wiki Text-2. Both RNN and LSTM have 2 layers, with 256 and 512 hidden units at each layer, respectively. We perform 10 random runs with a batch size of 32 for 70 epochs. We use Adam W (Loshchilov & Hutter, 2017) with an initial learning rate of 0.001, β1 = 0.9, β2 = 0.99, weight decay of 0.01, learning rate decay, and early stopping. We further describe our models and training strategies for Wiki Text-2 experiments in D. Unlike VA+{RNN, LSTM}, ST+{RNN, LSTM} and NMST+{RNN, LSTM} need an additional hyperparameter ϵ. We explore ϵ in {1.0 10 5, 5.0 10 5, 1.0 10 4, 5.0 10 4}.

We present the average ( st.dev.) non-termination ratios, rnt(L) s, across 10 random runs as a function of L for all considered setups of Wiki Text-2 in Figure 2, using greedy search. From equation 11, a language model is consistent with respect to greedy search if lim L rnt(L) = 0. As L increases, we observe that rnt(L) s of VA+{RNN, LSTM} fail to converge toward 0 while rnt(L) s of ST+{RNN, LSTM} and NMST+{RNN, LSTM} all reach 0. In other words, RNN and LSTM are now consistent with respect to greedy search after replacing the original softmax parametrization with either the proposed NMST parametrization or ST parametrization.

Table 1 shows that the average ( st.dev.) validation perplexities across 10 random experiments for all variants of RNN and LSTM, trained on Wiki Text-2. We observe that NMST+{RNN, LSTM} have better validation perplexities than ST+{RNN, LSTM} for every ϵ. We demonstrate this more clearly in E.1 by plotting the evolution of the mean validation perplexities as we vary ϵ. Although our NMST+ guarantees the consistency of RNN and LSTM with respect to greedy search with a better validation perplexity than ST+, we need to carefully select ϵ of NMST+. As ϵ increases, the lower bound of pnmst θ (yt = eos |y<t, x) grows faster, yielding premature sequences when ϵ is too large. Indeed, the average validation perplexities of NMST+RNN and NMST+LSTM with ϵ = 5.0 10 4 are 184.2 and 105.6 which degrade by 5.6 and 4.0 from those of VA+RNN and VA+LSTM, 178.6 and 101.6, respectively. We however emphasize that there is the optimal ϵ = 1.0 10 5 that makes NMST+{RNN, LSTM} have the validation perplexities similar to those of VA+{RNN, LSTM}. In short, both NMST+ and ST+ prevent non-termination when using greedy search but only NMST+ has a competitive validation perplexity against VA+. In G, we further observe that the length distribution of predicted sequences from NMST+LSTM is closer to the length distribution of ground truth sequences than those of predicted sequences from {VA, ST}+LSTM.

Published as a conference paper at ICLR 2023

Table 1: Mean ( st.dev.) validation perplexities across 10 random runs on Wiki Text-2 for various model configurations. Lower is better. Bold marks the best of each architecture. For all ϵ, the validation perplexities of our NMST+{RNN, LSTM} are better than those of ST+{RNN, LSTM} proposed by Welleck et al. (2020). Moreover, with a proper choice of ϵ = 1.0 10 5, NMST+{RNN, LSTM} have competitive validation perplexities against those of VA+{RNN, LSTM}.

ϵ ST+ NMST+ ST+ NMST+

5.0 10 4 186.1 (6.2) 184.2 (6.5) 106.1 (1.0) 105.6 (1.2) 1.0 10 4 181.0 (3.8) 177.4 (7.0) 104.6 (1.4) 102.5 (1.0) 5.0 10 5 182.6 (8.0) 179.6 (5.7) 104.7 (1.6) 102.1 (1.0) 1.0 10 5 180.4 (3.3) 177.4 (4.5) 104.5 (1.4) 101.5 (0.8)

VA+ 178.6 (6.3) 101.6 (1.0)

Table 2: We present the average ( st.dev.) validation perplexities across 10 random runs for all variants of GPT-2 finetuned on Wiki Text-103. We also demonstrate their non-termination ratios (mean st.dev.), rnt(L) s, when using greedy search. We set L to 1,000 since the maximum length of generated sequences from GPT-2 is 1,024. For perplexity, lower is better. Bold marks the best validation perplexity in all setups. For every ϵ, NMST+GPT-2 outperforms ST+GPT-2 in terms of the average validation perplexity. From rnt(L), NMST+GPT-2 effectively prevents non-termination sequences compared to VA+GPT-2 for every ϵ while ST+GPT-2 with small ϵ fails to avoid them. With a proper choice of ϵ (e.g., ϵ = 1.0 10 5), NMST+GPT-2 improves the validation perplexity.

Perplexity rnt(L)

ϵ ST+ NMST+ ST+ NMST+

5.0 10 4 21.80 (0.02) 21.63 (0.02) 0.05 (0.03) 0.07 (0.03) 1.0 10 4 21.21 (0.02) 20.86 (0.02) 0.72 (0.11) 0.22 (0.10) 5.0 10 5 21.19 (0.03) 20.76 (0.02) 0.72 (0.11) 0.24 (0.10) 1.0 10 5 21.16 (0.03) 20.69 (0.03) 0.75 (0.10) 0.23 (0.10)

VA+ 20.72 (0.03) 0.27 (0.08)

4.2 WIKITEXT-103

Wiki Text-103 (Merity et al., 2016) consists of 103 million words constructed from 28,000 articles. We use BPE tokenization (Sennrich et al., 2015) and consider the first 10 tokens as a context for each sequence. Since Wiki Text-103 is substantially larger than Wiki Text-2, we finetune a pretrained GPT-2 which is a transformer language model with 124 million parameters (Radford et al., 2019) for 500, 000 steps. For computational efficiency, we bucket the dataset into sequences of similar lengths, and each batch contains a maximum of 1,024 total tokens. We use Adam W (Loshchilov & Hutter, 2017) with an initial learning rate of 5.0 10 5, β1 = 0.9, β2 = 0.99, weight decay of 0.01, linear learning rate decay, and early stopping. We present a more detailed description in D. We select ϵ from {1.0 10 5, 5.0 10 5, 1.0 10 4, 5.0 10 4} for ST+GPT-2 and NMST+GPT-2.

We report the mean ( st.dev.) validation perplexities and non-termination ratios, rnt(L) s, resulting from greedy search across 10 random runs for all GPT-2 setups finetuned on Wiki Text-103 in Table 2. Since GPT-2 can handle up to 1,024 tokens, we use L = 1,000. As shown in Figure 2, we need a sufficiently large L such as L = 105 to determine whether a language model is consistent with respect to greedy search. Although L = 1,000 is not sufficiently large, we observe that rnt(L) of NMST+GPT-2 decreases compared to rnt(L) of VA+GPT-2 as ϵ increases. That is, NMST+ reduces the number of non-terminating continuations within 1,000 steps. Non-terminating sequences do not necessarily imply better quality. We thus demonstrate sample continuations from NMST+GPT-2, given a context that leads non-termination with VA+GPT-2 in Table 3, using greedy search. We observe that the quality of the generated sequence tends to improve with NMST+ by avoiding repetitions of similar phrases and ending with eos . We present more example continuations in E.3.

Published as a conference paper at ICLR 2023

Table 3: Given a context in a validation instance of Wiki Text-103, we present example continuations of {VA, ST, NMST}+GPT-2 when using greedy search. We select ϵ = 1.0 10 5 for {ST, NMST}+GPT-2 because it is optimal in terms of validation perplexities in Table 2. Unlike {VA, ST}+GPT-2, NMST+GPT-2 improves the quality of the sequence by avoiding repetitive tokens and ending with eos when the given context leads VA+GPT-2 to non-terminate within 1, 000 steps.

Context Made of concrete, steel, and wood, the

VA+ building was built in the mid @-@ 19th century. It was the first building in the United States to be built in concrete, and the first to be built in wood. It was also the first building in the United States to be built in steel. It was the first building in ...

ST+ building is constructed of steel and concrete. The building s exterior is made of steel and concrete. The building s interior is made of wood, and the building s exterior is made of concrete. The building s exterior is made of concrete, and the building s ...

NMST+ building was designed by the architectural firm of Bowers & Wainwright, and was completed in 1892. The building is the largest of its kind in the United States. eos

girlfriendof

pθ (yt = eos |y<t, x)

x = On December 4 , 2010 , Kershaw married his

NMST+ (ϵ = 1.0 10 5 )

ST+ (ϵ = 1.0 10 5 ) VA+

responsible

geographical

organisation

pθ (yt = eos |y<t, x)

x = The CWGC is headquartered in Maidenhead , England

NMST+ (ϵ = 1.0 10 5 )

ST+ (ϵ = 1.0 10 5 ) VA+

Figure 3: We present pθ(yt = eos |y<t, x) as a function of t for validation instances of Wiki Text103 where pθ s are {VA, ST, NMST}+GPT-2. For {ST, NMST}+GPT-2, we choose ϵ = 1.0 10 5 because it is optimal in terms of validation perplexities in Table 2. Instead of t, we tag the t-th ground truth token. We report their mean (curve) st.dev. (shaded area) across 10 random runs. Unlike ST+GPT-2, NMST+GPT-2 can model non-monotonic behaviors of pθ(yt = eos |y<t, x) with respect to t. Both plots show that the non-monotonic behaviors occur where the sequences could end (e.g., after red marked tokens such as periods).

Similar to the results in 4.1, Table 2 shows that the validation perplexities of both ST+GPT-2 proposed by Welleck et al. (2020) and our NMST+GPT-2 degrade compared to VA+GPT-2 as ϵ increases. NMST+GPT-2 with the optimal ϵ = 1.0 10 5 has a competitive validation perplexity of 20.69 against that of VA+GPT-2, 20.72. On the other side, we cannot find ϵ that makes the validation perplexity of ST+GPT-2 competitive against that of VA+GPT-2. Moreover, if ϵ = 5.0 10 4, then rnt(L) s of ST+GPT-2 blow up unlike rnt(L) s of VA+GPT-2. E.2 demonstrates the inevitable perplexity degradation and exploding rnt(L) of ST+GPT-2. We suspect that it is due to monotonically increasing pθ(yt = eos |y<t, x) with respect to t.

We investigate behaviors of pθ(yt = eos |y<t, x) where pθ s are {VA, ST, NMST}+GPT-2 in Figure 3. Based on Table 2, we select the optimal ϵ = 1.0 10 5 in terms of validation perplexities for {ST, NMST}+GPT-2. In Figure 3, {VA, NMST}+GPT-2 well-capture whether a sequence might end (e.g., after periods) by showing non-monotonic behaviors at those seeminglyterminating steps, but ST+GPT-2 cannot model such non-monotonic behaviors because it assumes that pθ(yt = eos |y<t, x) is a monotonic function of t. This constraint makes ST+GPT-2 generate often finite but unnecessarily long sequences with greedy search (i.e., higher rnt(L) than VA+GPT-2 for small L, but rnt(L) = 0 for sufficiently large L). We demonstrate more behaviors in E.4.

5 CONSISTENCY WITH RESPECT TO OTHER DECODING ALGORITHMS

We explore the effectiveness of our proposed non-monotonic self-terminating (NMST) language model when using decoding algorithms other than greedy search, such as top-k sampling (Fan et al.,

Published as a conference paper at ICLR 2023

Table 4: Mean ( st.dev.) non-termination ratios, rnt(L) s, across 10 random runs for the variants of GPT-2 finetuned on Wiki Text-103 with various decoding algorithms. We set L to 1,000 due to GPT2 s context window size of 1,024. We use the optimal ϵ = 1.0 10 5 in terms of average validation perplexities in Table 2 for both NMST+GPT-2 and ST+GPT-2. Bold marks the lowest rnt(L) within each decoding algorithm (column). Similar to greedy search in Table 2, for all decoding algorithms, rnt(L) s of NMST+GPT-2 are lower than those of ST+GPT-2 and VA+GPT-2. It means that NMST+ reduce the number of non-terminating sequences within 1,000 decoding steps.

top-2 top-4 nucleus-0.2 nucleus-0.4 beam-2 beam-4

VA+ 0.0 (0.0) 0.0 (0.0) 0.25 (0.08) 0.14 (0.05) 0.05 (0.02) 0.03 (0.01) ST+ 0.0 (0.0) 0.0 (0.0) 0.73 (0.11) 0.55 (0.15) 0.29 (0.10) 0.15 (0.07) NMST+ 0.0 (0.0) 0.0 (0.0) 0.21 (0.10) 0.10 (0.06) 0.03 (0.02) 0.01 (0.01)

2018), nucleus sampling (Holtzman et al., 2020), and beam search. All experimental setups and notations are the same as Section 4. According to Theorem 3, the NMST language model is consistent with respect to any incomplete decoding algorithms (e.g., greedy search, top-k sampling, and nucleus sampling) and beam search for all ϵ > 0. To validate this, we use top-{2, 4} sampling, nucleus-{0.2, 0.4} sampling, and beam search with a width of {2, 4} (beam-{2, 4}) to generate sequences from NMST+GPT-2 finetuned on Wiki Text-103 with ϵ = 1.0 10 5. The choice of ϵ = 1.0 10 5 is made based on the validation perplexities in Table 2. Since the validation perplexity does not depend on decoding algorithms, we focus on the average ( st.dev.) non-termination ratios, rnt(L) s, across 10 random runs with L = 1, 000 for each decoding algorithm in Table 4. We also present rnt(L) s of VA+GPT-2 and ST+GPT-2 with ϵ = 1.0 10 5 as baselines.

Table 4 shows that our NMST+GPT-2 has the lowest rnt(L) with L = 1, 000 for all decoding algorithms compared to VA+GPT-2 and ST+GPT-2 proposed by (Welleck et al., 2020). In other words, NMST+ effectively prevent non-terminating sequences within 1,000 time steps regardless of decoding algorithms. Comparing with greedy search in Table 2 (rnt(L) when ϵ = 1.0 10 5), we observe that rnt(L) s decrease for all setups. As we discussed in 2.3, non-terminating sequences originate from the choice of eos / Vt V for all t where V is a vocabulary and Vt is the t-th proper subset of V, considered by a decoding algorithm at the t-th step. The decoding algorithms other than greedy search are likely to have eos in Vt and have the lower rnt(L) since their |Vt| are greater than or equal to |Vt| = 1 of greedy search for all t. In the case of top-{2, 4} sampling, we obtain rnt(L) = 0.0 for VA+GPT-2. Even without NMST+, VA+ can avoid non-terminating sequences if we choose a proper decoding algorithm. We however emphasize that NMST+GPT-2 with ϵ = 1.0 10 5 has a competitive validation perplexity against VA+GPT-2 in Table 2 and that it is guaranteed to terminate regardless of the choice of a decoding algorithm. We also empirically demonstrate the consistency of NMST+{RNN, LSTM} trained on Wiki Text-2 with respect to other decoding algorithms in F.

6 CONCLUSION

Non-termination is a degenerate behavior we often observe when generating text from a well-trained language model. To prevent this, Welleck et al. (2020) proposed a self-terminating language model that encourages the termination probability of each sequence, which is the conditional probability of eos given a t-prefix and a context, to monotonically increase toward 1 as t increases. In this paper, we theoretically demonstrate that monotonically increasing termination probability of each sequence is not a necessary condition for avoiding non-terminating sequences. We then propose a non-monotonic self-terminating language model where the termination probability for each sequence converges to 1 but not monotonically. Our non-monotonic self-terminating language models successfully address the issue of non-termination and achieve perplexities that are comparable to vanilla language models and are better than the original self-terminating language models.

Published as a conference paper at ICLR 2023

REPRODUCIBILITY STATEMENT

To ensure the reproducibility of our paper, we provide our code available at https://github. com/nyu-dl/non-monotonic-self-terminating-lm.

ACKNOWLEDGMENTS

This work was supported by 42dot, Hyundai Motor Company (under the project Uncertainty in Neural Sequence Modeling), Samsung Advanced Institute of Technology (under the project Next Generation Deep Learning: From Pattern Recognition to AI), and NSF Award 1922658 NRT-HDR: FUTURE Foundations, Translation, and Responsibility for Data Science. This work was supported in part through the NYU IT High Performance Computing resources, services, and staff expertise.

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. ar Xiv preprint ar Xiv:1409.0473, 2014.

Yoshua Bengio, R ejean Ducharme, Pascal Vincent, and Christian Janvin. A neural probabilistic language model. In J. Mach. Learn. Res., 2000.

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877 1901, 2020.

Kyunghyun Cho, Bart Van Merri enboer, Dzmitry Bahdanau, and Yoshua Bengio. On the properties of neural machine translation: Encoder-decoder approaches. ar Xiv preprint ar Xiv:1409.1259, 2014.

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. ar Xiv preprint ar Xiv:2204.02311, 2022.

Jeffrey L Elman. Finding structure in time. Cognitive science, 14(2):179 211, 1990.

Angela Fan, Mike Lewis, and Yann Dauphin. Hierarchical neural story generation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 889 898, Melbourne, Australia, July 2018. Association for Computational Linguistics. doi: 10.18653/v1/P18-1082. URL https://aclanthology.org/P18-1082.

Sepp Hochreiter and J urgen Schmidhuber. Long short-term memory. Neural computation, 9(8): 1735 1780, 1997.

Ari Holtzman, Jan Buys, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. Ar Xiv, abs/1904.09751, 2020.

Philipp Koehn and Rebecca Knowles. Six challenges for neural machine translation. ar Xiv preprint ar Xiv:1706.03872, 2017.

Omer Levy and Yoav Goldberg. Neural word embedding as implicit matrix factorization. Advances in neural information processing systems, 27, 2014.

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. ar Xiv preprint ar Xiv:1711.05101, 2017.

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. ar Xiv preprint ar Xiv:1609.07843, 2016.

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. ar Xiv preprint ar Xiv:1301.3781, 2013a.

Published as a conference paper at ICLR 2023

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems, 26, 2013b.

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. Open AI blog, 1(8):9, 2019.

Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. ar Xiv preprint ar Xiv:1508.07909, 2015.

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929 1958, 2014.

Felix Stahlberg and Bill Byrne. On nmt search errors and model errors: Cat got your tongue? ar Xiv preprint ar Xiv:1908.10090, 2019.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.

Oriol Vinyals and Quoc Le. A neural conversational model. ar Xiv preprint ar Xiv:1506.05869, 2015.

Sean Welleck, Ilia Kulikov, Stephen Roller, Emily Dinan, Kyunghyun Cho, and Jason Weston. Neural text generation with unlikelihood training. ar Xiv preprint ar Xiv:1908.04319, 2019.

Sean Welleck, Ilia Kulikov, Jaedeok Kim, Richard Yuanzhe Pang, and Kyunghyun Cho. Consistency of a recurrent language model with respect to incomplete decoding. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 5553 5568, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020. emnlp-main.448. URL https://aclanthology.org/2020.emnlp-main.448.

Published as a conference paper at ICLR 2023

A DEFINITIONS OF COMMON DECODING ALGORITHMS AND THEIR CHARACTERISTICS

In this section, we present mathematical definitions of top-k sampling (Fan et al., 2018), nucleus sampling (Holtzman et al., 2020), greedy search, and beam search. We then demonstrate whether they are incomplete probable decoding algorithms.

A.1 TOP-K SAMPLING

At each step t, top-k sampling selects a subset of k most probable tokens in a vocabulary V. Top-k sampling generates decoded sequences from a language model pθ as follows:

Definition A.1 (Top-k sampling (Fan et al., 2018)). Top-k sampling Stop-k generates a sequence from a language model pθ given a context x by recursively sampling ˆyt from

q Stop-k(pθ)(yt = v|ˆy<t, x) =

pθ(yt = v|ˆy<t, x) P

v Vt pθ(yt = v |ˆy<t, x), if v Vt,

0, otherwise, (12)

where Vt = arg top-k v V pθ(yt = v|ˆy<t, x). (13)

Except the trivial case k = |V|, we have Vt V for all t. By equation 13, equation 6 holds. From equation 12, we see that top-k sampling satisfies equation 5 and equation 7. Therefore, top-k sampling is an incomplete probable decoding algorithm.

A.2 NUCLEUS SAMPLING

At each step t, nucleus sampling selects the smallest subset of most probable tokens in a vocabulary V, of which total probability is higher than a given threshold µ. Nucleus sampling generates decoded sequences from a language model pθ as follows:

Definition A.2 (Nucleus sampling (Holtzman et al., 2020)). Nucleus sampling Snuc-µ generates a sequence from a language model pθ given a context x by recursively sampling ˆyt from

q Snuc-µ(pθ)(yt = v|ˆy<t, x) =

pθ(yt = v|ˆy<t, x) P

v Vt pθ(yt = v |ˆy<t, x), if v Vt,

0, otherwise, (14)

where Vt is the smallest subset of V such that X

v Vt pθ(yt = v|ˆy<t, x) µ. (15)

If minv V pθ(yt = v|y<t, x) 1 µ for any context x and any t-prefix y<t, then we have Vt V for all t. Suppose that equation 6 does not hold for nucleus sampling. Then, this contradicts to Vt is the smallest subset of V, satisfying equation 15. From equation 14, we see that nucleus sampling satisfies equation 5 and equation 7. Therefore, nucleus sampling is incomplete and probable.

A.3 BEAM SEARCH

Beam search is a heuristic algorithm that operates on the level of prefixes. We use the definition of beam search in Welleck et al. (2020).

Definition A.3 (Beam search, Definition A.2 in Welleck et al. (2020)). Beam search with a width (beam size) k, Sbeam-k, generates a sequence from a language model pθ by maintaining a set of k

Published as a conference paper at ICLR 2023

prefixes, Pt = {ρ(1)(t), ρ(2)(t), , ρ(k)(t)}, at each time step t where ρ(i)(0) is an empty prefix for all i. At each step t {1, 2, }, beam search forms a set of k k prefixes,

ρ Pt 1 {ρ v|v Vt(ρ)}, (16)

where ρ v is concatenation and

Vt(ρ) = arg top-k v V pθ(yt = v|ρ, x). (17)

After forming Pt, beam search selects a set of the k highest scoring prefixes in Pt,

Pt = arg top-k ρ Pt s(ρ), (18)

where s(ρ) = Pt τ=1 log pθ(yτ = ρτ|ρ<τ, x). If ρ Pt ends with eos , then it does not expand further and is added to the final set P. Beam search continues until P contains k sequences ending with eos . After that it returns the highest scoring sequence

ˆy = arg max ρ P s(ρ). (19)

Unlike greedy search, top-k sampling, and nucleus sampling, beam search recursively expands k sequences with at most k different prefixes. Therefore, we cannot formalize beam search in tokenlevel by using q Sbeam-k(yt = v|y<t, x). However, in equation 17, the number of possible tokens at t is at most k k. It means that Sbeam-k may exclude eos at time t if k p

|V| 1. By using this, Welleck et al. (2020) proved that a vanilla language model pva θ is inconsistent with respect to beam search as shown in Theorem 1.

B PROOFS FOR 2.3

Remark 1. Let D = (x(1), y(1)), (x(2), y(2)) be a two-instance training dataset. Assume that there exists t0 such that y<t0 = y(1) <t0 = y(2) <t0. Suppose further that t0 = |y(1)| < |y(2)| 1 and x = x(1) = x(2). If θ is an optimal parameter configuration in equation 3 over D. Then, pθ y(2) t = eos |y(2) <t , x is non-monotonic with respect to t.

Proof. Since θ is an optimal parameter configuration that perfectly minimizes equation 3 and t0 < |y(2)| 1, we have pθ y(2) t = eos |y(2) <t , x(2) = 0, (20)

for t < t0. Note that t0 = |y(1)| y(1) t0 = eos and t0 < |y(2)| 1 y(2) t0 = eos . From x = x(1) = x(2) and y = y(1) <t0 = y(2) <t0, we obtain

pθ y(2) t0 = eos |y(2) <t0, x(2) = 1

Moreover, t0 < |y(2)| 1 implies that y(2) t0+1 = eos which is equivalent to

pθ y(2) t0+1 = eos |y(2) <t0+1, x(2) = 0. (22)

From equation 20, equation 21, and equation 22, we see that pθ y(2) t = eos |y(2) <t , x is nonmonotonic with respect to t.

C PROOFS FOR 3

Theorem 3. A non-monotonic self-terminating (NMST) language model defined in Definition 6 is consistent with respect to any incomplete probable decoding algorithms and beam search.

Published as a conference paper at ICLR 2023

Proof. From equation 10, for any θ Rk, we have

lim t pnmst θ (yt = eos |y<t, x) = 1,

since (1 ϵ)t 0 as t for ϵ (0, 1) and σ u eos ht (0, 1) for any t. Hence, there exists t1/2 such that

t t1/2 pnmst θ (yt = eos |y<t, x) > 1

Let S be any incomplete probable decoding algorithm. From equation 6 and equation 7, eos Vt and q S(pnmst θ )(yt = eos |y<t, x) < 1

2 holds for any t t1/2. Therefore, we obtain

q S(pnmst θ )(|y| = |x) =

t=1 q S(pnmst θ )(yt = eos |y<t, x)

t=t1/2 q S(pnmst θ )(yt = eos |y<t, x)

1 2 0. (24)

Taking expectation of equation 24 over x, we finally have q S(pnmst θ )(|y| = ) = 0 for any S. In other words, pnmst θ is consistent with respect to any incomplete probable decoding algorithms.

In the case of beam search Sbeam-k defined in A.3, without loss of generality, there exists ρ Pt1/2 such that ρ does not end with eos . 3 Let P>t1/2(ρ) be a set of k highest scoring sequences continued from ρ by Sbeam-k. From equation 23, we have

pnmst θ ( eos |ρ, x) > pnmst θ (v|ρ, x)

for all v V \ { eos }. Hence, Vt1/2(ρ) in equation 17 includes eos . Let z = (z1, z2, , zl) be any subsequence with z1 = eos . Then, we have

pnmst θ (ρ z|ρ, x) =

i=1 pnmst θ (zi|ρ z<i, x)

pnmst θ (z1|ρ, x)

< pnmst θ ( eos |ρ, x) = pnmst θ (ρ eos |ρ, x), (25)

where is concatenation. Therefore, ρ eos = arg maxρ Pt1/2 s(ρ ) holds where s(ρ ) = Pt τ=1 log pnmst θ (ρ τ|ρ <τ, x). That is, ρ eos is the highest scoring sequence starting with ρ, and we have ρ eos P(ρ).

For each ρ P>t1/2(ρ) \ {ρ eos }, ρ starts with ρ v for v V \ { eos }. By the same argument, we add at least one sequence ending with eos to P>t1/2(ρ). It means that P>t1/2(ρ) has k sequences ending with eos within t1/2 + k steps. Note that the final set P satisfies

ρ Pt1/2 P>t1/2(ρ). (26)

Equation 26 implies that every sequence in P has the length of at most t1/2 + k. We thus obtain

q Sbeam-k(pnmst θ )(|y| = |x) q Sbeam-k(pnmst θ )(|y| > t1/2 + k|x) = 0. (27)

Taking expectation of equation 27 over x, we see that q Sbeam-k(pnmst θ )(|y| = ). That is, pnmst θ is consistent with respect to beam search.

3If there is no such ρ, all k sequences in Pt1/2 end with eos . It means that Sbeam-k returns a finite sequence, so that pnmst θ is consistent with respect to beam search.

Published as a conference paper at ICLR 2023

D EXPERIMENTAL DETAILS

In this section, we describe our models and optimization processes used in 4.

RNN and LSTM on Wiki Text-2 We use word tokenization for Wiki Text-2. We train RNN with tanh activations (Elman, 1990) and LSTM (Hochreiter & Schmidhuber, 1997) on Wiki Text-2. Both RNN and LSTM have 2 layers. Each layer has 256 hidden units for RNN and 512 hidden units for LSTM. The sizes of input and output embedding layers are 256 and 512 for RNN and LSTM, respectively. We use weight tying to share the weights between the input and output embedding layers for both models. We apply dropout (Srivastava et al., 2014) with drop probabilities of 0.3 and 0.5 to RNN and LSTM accordingly. For each model, we perform 10 random runs with a batch size of 32 for 70 epochs. To maximize the log-likelihood presented in equation 3, we use Adam W (Loshchilov & Hutter, 2017) with an initial learning rate of 0.001, β1 = 0.9, β2 = 0.99, weight decay of 0.01, and learning rate decay which halves the learning rate if the validation perplexity does not improve for a training epoch. To avoid overfitting, we additionally use early stopping, which terminates training if the validation perplexity does not improve upon the best score attained so far for 10 consecutive epochs. In most cases, the training ends within 50 epochs.

GPT-2 on Wiki Text-103 We use BPE tokenization4 (Sennrich et al., 2015) and the pretrained GPT-25 (Radford et al., 2019) with 124 million parameters, provided by Hugging Face. GPT-2 can handle up to 1,024 tokens. We apply dropout (Srivastava et al., 2014) with a drop probability of 0.1 to GPT-2. We finetune GPT-2 for 300,000 steps while ensuring that all runs continue for at least 250,000 steps. To minimize the number of padding tokens in every batch for computational efficiency, we bucket the dataset into sequences of similar lengths, and each batch contains a maximum of 1,024 total tokens. To maximize the log-likelihood function in equation 3, we use Adam W (Loshchilov & Hutter, 2017) with an initial learning rate of 5.0 10 5, β1 = 0.9, β2 = 0.99, weight decay of 0.01, and linear learning rate decay over 500, 000 steps.

4https://github.com/huggingface/tokenizers 5https://github.com/huggingface/transformers

Published as a conference paper at ICLR 2023

E ADDITIONAL PLOTS AND TABLES FOR 4

In this section, we demonstrate additional plots and tables for 4.

E.1 ADDITIONAL PLOTS FOR 4.1

NMST+ ST+ VA+

NMST+ ST+ VA+

Figure 4: Validation perplexities as a function of ϵ in log-linear scale for all configurations of RNN (left) and LSTM (right), which are trained on Wiki Text-2. We present their average (curve) st.dev. (shaded area) across 10 random experiments. For all ϵ and architectures, NMST+ has better validation perplexities than ST+. As ϵ increases, the validation perplexities of both NMST+RNN and NMST+LSTM degrade compared to those of VA+RNN and VA+LSTM. We thus need to search for an optimal ϵ to avoid degradation of validation perplexity when applying NMST+ to our language model.

E.2 ADDITIONAL PLOTS FOR 4.2

NMST+ ST+ VA+

NMST+ ST+ VA+

Figure 5: We present the average (curve) st.dev. (shaded area) of validation perplexities (left) and non-termnation ratios rnt(L) (right) with greedy search across 10 random runs for all considered setups of GPT-2 finetuned on Wiki Text-130 in log-linear scale. For rnt(L), we use L = 1, 000 because GPT-2 has a context window size of 1, 024. For all ϵ, NMST+GPT-2 outperforms ST+GPT2 in terms of the average validation perplexity. When ϵ is small, rnt(L) of ST+GPT-2 explodes. It means that ST+GPT-2 with small ϵ cannot prevent non-terminating sequences. However, our NMST+GPT-2 effectively reduces rnt(L) compared to VA+GPT-2 for every ϵ, and the validation perplexity degradation is smaller than that of ST+GPT-2 proposed by Welleck et al. (2020).

Published as a conference paper at ICLR 2023

E.3 ADDITIONAL TABLES FOR TABLE 3

Table 5: Given a context in a validation instance of Wiki Text-103, we present example continuations of {VA, ST, NMST}+GPT-2 when using greedy search. We select ϵ = 1.0 10 5 for {ST, NMST}+GPT-2 because it is optimal in terms of validation perplexities in Table 2. Unlike {VA, ST}+GPT-2, NMST+GPT-2 improves the quality of the sequence by avoiding repetitive tokens and ending with eos when the given context leads VA+GPT-2 to non-terminate within 1, 000 steps.

Context The single made its Irish Singles Chart debut at

VA+ number three on the issue date of November 18, 2005. It debuted at number one on the chart for the issue date of December 7, 2005. The single debuted at number two on the chart for the issue date of January 4, 2006. The single debuted at number one on the Irish Singles Chart for the issue date of March 4, 2006. The single debuted at number two on the Irish Singles Chart for the issue date of June 4, 2006. The single debuted at number one on the Irish Singles Chart for the issue date ...

ST+ number three on the chart issue dated March 16, 2010. It debuted at number two on the UK Singles Chart on the chart issue dated March 16, 2010, and peaked at number one on the chart issue dated March 16, 2010. The single was certified gold by the British Phonographic Industry ( BPI ) for shipments of over 15 @,@ 000 copies. The single debuted at number two on the Irish Singles Chart on the chart issue dated March 16, 2010, and peaked at number one on the chart issue dated March 16, 2010. The single was certified gold by the Irish Recorded Music Association ( IRMA ) for shipments of over 15 @,@ 000 copies. The single was ...

NMST+ number twenty @-@ seven on the week ending March 26, 2010, and peaked at number three on the week ending March 27, 2010. It was certified gold by the Recording Industry Association of Ireland ( RIAA ) for shipment of 500 @,@ 000 copies of the single. The single was certified gold by the Recording Industry Association of Ireland ( RIANZ ) for shipment of 500 @,@ 000 copies of the single. eos

Context Despite the expensive reconstructions, both vessels were considered

VA+ to be of sufficient quality to be considered for use in the Grand Fleet. The first, the British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built ...

ST+ to be of sufficient quality to be considered a part of the Royal Navy, and were assigned to the Channel Fleet. The two ships were transferred to the Mediterranean Fleet in early 1915, and were transferred to the North America and West Indies Station in early 1916. They were transferred to the Mediterranean Fleet in early 1917, and were transferred to the North America and West Indies Station in early 1918. They were transferred to the North America and West Indies Station in early 1918, and were transferred to the Mediterranean Fleet in early 1918. They were transferred to the North America and West Indies Station in late 1918, and ...

NMST+ obsolescent by the time of the Second World War, and were eventually scrapped in favour of the much more powerful British battleship HMS Dreadnought. eos

Published as a conference paper at ICLR 2023

Context On the July 21, 2014 episode of Raw,

VA+ Triple H and Randy Orton were involved in a tag team match, in which Triple H and Orton were accompanied by Triple H and Randy Orton s manager, Mr. Kennedy. Orton was accompanied by Orton s manager, Mr. Kennedy, and Orton s tag team partner, Mr. Kennedy. Triple H and Orton were accompanied by Orton s manager, Mr. Kennedy, and Orton s tag team partner, Mr. Kennedy. Triple H and Orton were accompanied by Orton s manager, Mr. Kennedy, and Orton s manager, Mr. Kennedy. Triple H and Orton were accompanied by Orton s manager, Mr. Kennedy, and Orton s manager, Mr. Kennedy. Triple H and Orton were accompanied by Orton s manager, Mr. Kennedy, and Orton s manager, Mr. Kennedy. Triple H ...

ST+ Triple H defeated Randy Orton to win the World Heavyweight Championship. On the July 27 episode of Raw, Triple H defeated Randy Orton to win the World Heavyweight Championship. On the July 30 episode of Raw, Triple H defeated Randy Orton to win the World Heavyweight Championship. Triple H then defeated Randy Orton to win the World Heavyweight Championship. On the August 2 episode of Raw, Triple H defeated Randy Orton to win the World Heavyweight Championship. Triple H then defeated Randy Orton to win the World Heavyweight Championship. Triple H then defeated Orton to win the World Heavyweight Championship. Triple H then defeated Orton to win the World Heavyweight Championship. Triple H then defeated Orton to win the World Heavyweight Championship. Triple H then defeated Orton to win the World Heavyweight Championship. Triple H then defeated Orton to win the World Heavyweight Championship. Triple H then defeated ...

NMST+ Triple H defeated Randy Orton to win the WWE Championship for the third time. eos

Context On the west side of the Naktong,

VA+ the NK 6th Division was reinforced by the NK 7th Division, which was reinforced by the NK 6th Division, which was reinforced by the NK 7th Division, which was reinforced by the NK 6th Division, which was reinforced by the NK 6th Division, which was reinforced by the NK 6th Division, which was reinforced by the NK 6th Division, which was reinforced by the NK 6th Division, which was reinforced by the NK 6th Division, which was reinforced by the NK 6th Division, which was reinforced by the NK 6th Division, which was reinforced by the NK 6th Division, which was reinforced by the NK 6th Division, which was reinforced by the NK ...

ST+ the 2nd Battalion, 27th Infantry Regiment, which had been holding up the North Koreans in the hills east of the Naktong, was ordered to withdraw to the Naktong itself. The 2nd Battalion, 27th Infantry Regiment, which had been holding up the North Koreans in the hills east of the Naktong, was ordered to withdraw to the Naktong itself. The 2nd Battalion, 27th Infantry Regiment, which had been holding up the North Koreans in the hills east of the Naktong, was ordered to withdraw to the Naktong itself. The 2nd Battalion, 27th Infantry Regiment, which had been holding up the North Koreans in the hills east of the Naktong, was ordered to withdraw to the Naktong itself. The 2nd Battalion, 27th Infantry Regiment, which had been ...

NMST+ the Naktong River rises to a height of 1 @,@ 000 metres ( 3 @,@ 300 ft ) above the surrounding terrain. The Naktong River is the longest river in North and South Korea, and the longest in the whole of North Korea. The Naktong is the longest river in North Korea, and the longest in the whole of North Korea. The river is the longest in the entire country, and the longest in the whole of North Korea. eos

Published as a conference paper at ICLR 2023

Context During spring training in a game against the Boston Red

VA+ Sox on April 4, 2009, Kinsler was placed on the 15 @-@ day disabled list with a strained right hamstring. He returned to the active roster on April 7, and was placed on the 15 @-@ day DL with a strained right hamstring. He was placed on the 15 @-@ day DL on May 4, and was placed on the 15 @-@ day DL on May 5. On May 6, he was placed on the 15 @-@ day DL with a strained right hamstring. He was placed on the 15 @-@ day DL on May 9. On May 10, he was placed on the 15 @-@ day DL with a strained right hamstring. He was placed on the 15 @-@ day DL on May 11. He was placed on the 15 @-@ day DL on May 13. He was placed on the 15 @-@ day DL on May 20. He was placed on the 15 @-@ day DL on May 24. He was placed on the 15 @-@ day DL on May 28. He was placed on the 15 @-@ day DL on June 2. He was placed on ...

ST+ Sox on April 20, 2011, he was placed on the disabled list with a back injury. He returned to the disabled list on May 10, 2011, and was activated on May 17. He was activated on May 20 and made his first appearance on May 21. He was activated on June 2 and made his first appearance on June 4. He was activated on June 8 and made his first appearance on June 10. He was activated on June 15 and made his first appearance on June 17. He was activated on June 20 and made his first appearance on June 23. He was activated on June 29 and made his first appearance on July 1. He was activated on July 1 and made his first appearance on July 4. He was activated on July 6 and made his first appearance on July 10. He was activated on July 14 and made his first appearance on July 16. He was activated on July 20 and made his first appearance on July 23. He was ...

NMST+ Sox on April 16, 2010, the Yankees signed Rivera to a one @-@ year, $ 2 @.@ 5 million contract. He made his major league debut on April 21, 2010, against the Boston Red Sox. He pitched a scoreless inning in the first inning of the first game of the 2010 World Series against the New York Mets. On May 1, 2010, Rivera was traded to the Pittsburgh Pirates in exchange for J. J. Hardy. eos

Published as a conference paper at ICLR 2023

E.4 ADDITIONAL PLOTS FOR FIGURE 3

pθ (yt = eos |y<t, x)

x= The history of Nero by Pl iny the Elder (

NMST+ (ϵ = 1.0 10 5 )

ST+ (ϵ = 1.0 10 5 ) VA+

continually

pθ (yt = eos |y<t, x)

x= Hitler justiﬁes the Final Solution by maintaining that the Jews

NMST+ (ϵ = 1.0 10 5 )

ST+ (ϵ = 1.0 10 5 ) VA+

responsible

pθ (yt = eos |y<t, x)

x= Emmanuel Lie ber ˆa G ˆ Jewish Holocaust survivor and director of

NMST+ (ϵ = 1.0 10 5 )

ST+ (ϵ = 1.0 10 5 ) VA+

transformed

application

transformations.

pθ (yt = eos |y<t, x)

x= 1 far ad . This prototype can be impedance scaled

NMST+ (ϵ = 1.0 10 5 )

ST+ (ϵ = 1.0 10 5 ) VA+

pθ (yt = eos |y<t, x)

x= Roman and Greek sources nowhere report Nero s alleged

NMST+ (ϵ = 1.0 10 5 )

ST+ (ϵ = 1.0 10 5 ) VA+

Figure 6: Additional plots of pθ(yt = eos |y<t, x) as a function of t for validation instances of Wiki Text-103 where pθ s are {VA, ST, NMST}+GPT-2. For {ST, NMST}+GPT-2, we choose ϵ = 1.0 10 5 because it is optimal in terms of validation perplexities in Table 2. Instead of t, we tag the t-th ground truth token. We report their mean (curve) st.dev. (shaded area) across 10 random runs. Unlike ST+GPT-2, NMST+GPT-2 exhibits non-monotonic behaviors at plausibly terminating steps (e.g., after red marked tokens such as periods).

Published as a conference paper at ICLR 2023

F CONSISTENCY WITH RESPECT TO OTHER DECODING ALGORITHMS FOR RNN AND LSTM

We validate the consistency of our proposed non-monotonic self-terminating (NMST) language model when using decoding algorithms other than greedy search, such as top-k sampling (Fan et al., 2018), nucleus sampling (Holtzman et al., 2020), and beam search. All experimental setups and notations are the same as Section 4. We use top-{2, 4} sampling, nucleus-{0.2, 0.4} sampling, and beam search with a width of {2, 4} (beam-{2, 4}) to generate sequences from NMST+{RNN, LSTM} trained on Wikitext-2 with ϵ = 1.0 10 5. The choice of ϵ = 1.0 10 5 is made based on the validation perplexities in Table 1. Since the validation perplexity does not change with decoding algorithms, we focus on the average ( st.dev.) non termination ratios, rnt(L) s, across 10 random runs as a function of L, for each decoding algorithm in Figure 7. We also plot the evolution of rnt(L) s for VA+{RNN, LSTM} and ST+{RNN, LSTM} of ϵ = 1.0 10 5 as we vary L.

0.1 0.3 1.0

rnt (L) RNN RNN

top {2, 4} nucleus {0.2, 0.4} beam {2, 4}

0.1 0.3 1.0

rnt (L) LSTM LSTM

NMST+ ϵ = 1.0 10 5

ST+ ϵ = 1.0 10 5

VA+ {0.2, 2} {0.4, 4}

Figure 7: Non-termination ratios, rnt(L) s, of sequences generated from all variants of RNN (top) and LSTM (bottom), trained on Wiki Text-2, when using top-k sampling (left), nucleus sampling (middle), and beam search (right), as a function of L in log-log scale. We use the first 10 tokens of every Wiki Text-2 validation instance as a context. We present their average (curve) with their min-max range (shaded area) across 10 random experiments. VA+ (orange) displays inconsistency (lim L rnt(L) 0) for all combinations of model architectures and decoding algorithms, except in VA+RNN using top-4 (orange dashed in top left) and VA+LSTM using top-{2,4} (orange solid and dashed in top left, respectively). On the other hand, NMST+ (blue) and ST+ (green) show consistency (lim L rnt(L) 0) across all configurations. By using decoding algorithms other than greedy search, VA+LSTM can avoid non-terminating sequences (e.g., top-{2, 4}). However, as shown in Table 1, NMST+{RNN, LSTM} not only have better validation perplexities than VA+{RNN, LSTM} and ST+{RNN, LSTM} but also are consistent with respect to all decoding algorithms.

Published as a conference paper at ICLR 2023

G ANALYSIS OF PREDICTED SEQUENCE LENGTH DISTRIBUTIONS IN 4.1

We investigate whether our proposed non-monotonic self-terminating (NMST+) language model matches the data length distribution better than the baselines: i) a vanilla (VA+) language model and ii) a self-terminating (ST+) language model. For this, we compare the length distributions of predicted sequences from {VA, ST, NMST}+LSTM trained on Wiki Text-2 with the data length distribution of ground truth sequences in the Wiki Text-2 validation dataset, Dval, when using greedy search. All experimental setups and notations are the same as 4.1.

Figure 8 shows the length distributions of {VA, ST, NMST}+LSTM, and Dval. For {ST, NMST}+LSTM, we use ϵ = 1 10 5 because this choice is optimal in terms of validation perplexities based on Table 1. We observe that the length distributions of predicted sequences from NMST+LSTM is closer to the data length distribution of Dval, than those of predicted sequences from VA+LSTM and ST+LSTM.

102 104 0.0

NMST+ ST+ VA+ Dval

102 104 0.0

Dval vs. NMST+

Dval vs. ST+

Dval vs. VA+

Figure 8: Length distributions of generated sequences from {VA, ST, NMST}+LSTM trained on Wiki Text-2 and the data length distribution of ground truth sequences in Wiki Text-2 validation dataset, Dval. For {ST, NMST}+LSTM, we select ϵ = 1.0 10 5 since it is optimal in terms of validation perplexities in Table 1. NMST+LSTM better models the length distribution of Dval than both VA+LSTM and ST+LSTM.

Furthermore, we can tune ϵ to make the predicted length distribution of NMST+LSTM agree with the ground truth length distribution of Dval. In Figure 9, we compare NMST+LSTM s predicted length distribution of ϵ = 5 10 4 with that of ϵ = 1 10 5. We see that ϵ = 5 10 4 better models the data length distribution than ϵ = 5 10 4. However, in this case, the average validation perplexity of NMST+LSTM degrades from 101.5 (ϵ = 1 10 5) to 105.6 (ϵ = 5 10 4) as shown in Table 1.

Published as a conference paper at ICLR 2023

102 104 0.00

ϵ = 5.0 10 4

102 104 0.00

ϵ = 1.0 10 4

ϵ = 5.0 10 5

ϵ = 1.0 10 5

Figure 9: Length distributions of predicted sequences from NMST+LSTM trained on Wiki Text-2 for various ϵ s and the data length distribution of ground truth sequences in Wiki Text-2 validation dataset, Dval. The length distribution of NMST+LSTM using ϵ = 5.0 10 5 matches the data length distribution of Dval better than that of NMST+LSTM using ϵ = 1.0 10 4. We can choose ϵ to make the predicted length distribution of NMST+LSTM agree with the ground truth length distribution.