# residual_energybased_models_for_text_generation__39ed85d0.pdf Published as a conference paper at ICLR 2020 RESIDUAL ENERGY-BASED MODELS FOR TEXT GENERATION Yuntian Deng1, Anton Bakhtin2, Myle Ott2, Arthur Szlam2, Marc Aurelio Ranzato2 Harvard University1 Facebook AI Research2 dengyuntian@seas.harvard.edu {yolo,myleott,aszlam,ranzato}@fb.com Text generation is ubiquitous in many NLP tasks, from summarization, to dialogue and machine translation. The dominant parametric approach is based on locally normalized models which predict one word at a time. While these work remarkably well, they are plagued by exposure bias due to the greedy nature of the generation process. In this work, we investigate un-normalized energy-based models (EBMs) which operate not at the token but at the sequence level. In order to make training tractable, we first work in the residual of a pretrained locally normalized language model and second we train using noise contrastive estimation. Furthermore, since the EBM works at the sequence level, we can leverage pretrained bi-directional contextual representations, such as BERT and Ro BERTa. Our experiments on two large language modeling datasets show that residual EBMs yield lower perplexity compared to locally normalized baselines. Moreover, generation via importance sampling is very efficient and of higher quality than the baseline models according to human evaluation. 1 INTRODUCTION The dominant approach to parametric text generation is based on large neural auto-regressive models (Radford et al., 2019). These models can be trained efficiently via maximum likelihood and they can efficiently generate samples of remarkable quality. Key to their success is local normalization, i.e. they are defined in terms of a product of conditional distributions, one for each token in the sequence. Such distributions are relatively cheap to compute with modern hardware given the limited vocabulary size of common sub-word units like BPE (Sennrich et al., 2015). Unfortunately, local normalization also brings some drawbacks. First, the designer of the model needs to specify the order in which tokens are generated. Second, at training time the model is conditioned on ground truth context while at test time it is conditioned on its own generations, a discrepancy referred to as exposure bias (Ranzato et al., 2016). Finally, while heuristics like beam search somewhat help rescore at the sequence level, generation generally lacks long-range coherency because it is produced by the greedy selection of one token at a time without lookahead. Energy-based models (EBMs) (Hinton, 2002; Le Cun et al., 2006; Ranzato et al., 2007) are a more general framework which potentially address all these issues, as they do not require any local normalization. They only require the definition of an energy function defined over the whole input sequence. Training aims at shaping the energy function such that regions of high density of training data points have lower energy than elsewhere. In principle, EBMs are ideal for modeling text as they can score the whole input at once, they are not prone to label bias (Bottou, 1991) and they may enable generation of large chunks of text, which should help improve coherency. However, so far EBMs had limited application in text generation, because sampling from the model is intractable, and so is maximum likelihood training. The problem is that shaping the energy function is accomplished by updating the model parameters such that the energy is decreased at the training data points (a.k.a. positive examples) and increased at other data points (a.k.a. negative examples). In maximum likelihood training negatives are generated from the model, but in text application we cannot use gradient-based MCMC methods (Teh et al., 2003; Du & Mordatch, 2019) and Gibbs sampling (Welling et al., 2005) is too slow to be practical. Generating negatives by local Published as a conference paper at ICLR 2020 perturbations of the ground truth would be efficient but hardly useful for generation purposes, when at test time the model needs to generate from scratch. Recently, Bakhtin et al. (2019) carefully studied the problem of training a discriminator to distinguish human written text from language model generations. They experimented with different language model and discriminator architectures, training/test time corpora and concluded that the discriminator can generalize rather well to weaker language models when the training/test corpora match. Bakhtin et al. (2019) found that the learned discriminator is not robust to random perturbations, and argued that the discriminator operates in the residual space of the language model. Concurrently, Grover et al. (2019) proposed a general approach to de-bias a generator, by simply training a discriminator and using its output for importance sampling. In this work, we build upon these two works. First, we formalize the residual interpretation by Bakhtin et al. (2019) and use a generative model of the form: Pθ(x) PLM(x) exp( Eθ(x)) (1) where PLM(x) is a locally normalized language model which is fixed during training, and Eθ is the energy function parameterized by θ. The resulting model Pθ(x) is globally normalized due to the energy term. Note that the same residual formulation was also used in Rosenfeld et al. (2001); Wang & Ou (2018b); Parshakova et al. (2019). This formulation has multi-fold benefits. First, by incorporating a locally normalized language model, we can leverage recent advancements in locally normalized language modeling. Second, the language model provides a natural proposal distribution for training (Bakhtin et al., 2019), and training can be made efficient by using the conditional noise contrastive estimation objective (Gutmann & Hyv arinen, 2010) as we shall see in 3. Lastly, this formulation enables efficient evaluation and generation via importance sampling (Horvitz & Thompson, 1952; Grover et al., 2019). In some sense, this last point is perhaps the central contribution of the paper, as it allows estimating perplexity of the residual EBM, and thus allows these EBMs to be compared in a standard way to other models. Indeed, in 4 we show that our joint model decreases perplexity on two large datasets, when compared to various auto-regressive language model baselines. Finally, the EBM generations are significantly preferred by humans according to our qualitative evaluation. To the best of our knowledge, this is the first time that an EBM has demonstrated improved generation ability against very strong auto-regressive baselines, both in terms of estimated perplexity and through human evaluation. 2 RELATED WORK Energy-based models have a long history in machine learning (Hopfield, 1982; Hinton, 2002; Le Cun et al., 2006; Ranzato et al., 2007). The key challenge of training is mining for good negatives. This can be accomplished explicitly by fantasizing inputs where the energy should be increased or implicitly via global constraints such as sparsity (Ranzato et al., 2007). Methods attempting at maximizing the likelihood of the data require to sample from the distribution induced by the model. Unfortunately, gradient-based MCMC approaches like Hybrid Monte Carlo (Teh et al., 2003) and Langevyn dynamics (Ranzato et al., 2007; Du & Mordatch, 2019; Xie et al., 2016; 2017; 2019; 2018; Gao et al., 2018; Nijkamp et al., 2019) are not applicable when the input is discrete like in text applications. Other approaches like Gibbs sampling (Hinton, 2002) were applied to binary inputs but do not scale well to large dictionaries once the energy function is a large bidirectional transformer model like the one used in this work. Several variants of auto-encoders have also been investigated for representing and generating text (Bowman et al., 2016; Zhao et al., 2018), but they have not shown significant improvements in terms of perplexity and they have so far been applied to relatively small datasets only. Our approach appears similar to discriminative reranking approaches used in the parsing and machine translation community (Shen et al., 2004). However, our approach provides a generative model, and parameters/hyper-parameters are directly tuned to close the gap between the model distribution and the data distribution, rather than relying on surrogate ranking losses. This approach is also related to other sequence level training objectives (Edunov et al., 2018), with the major differ- Published as a conference paper at ICLR 2020 ence that in those works training aims at improving the baseline model, but generation at test time is still greedy. Energy Networks have been used for sequence modeling (Rosenfeld et al., 2001; Wang et al., 2015; 2017; Wang & Ou, 2017; 2018a; Parshakova et al., 2019). In particular, our residual modeling form and the training algorithm is the same as in Wang & Ou (2018b), where they used an LSTM as the generator and a CNN-LSTM as the energy function, and showed significant gains compared to LSTM baselines in speech recognition. Our work builds on these prior works and develops new lower and upper bounds for the log-probability under the joint model, which makes it possible to show that the residual EBM approach gets better perplexity. We also develop an importance weighting sampling scheme used at generation time, which is focused on conditional generation as opposed to rescoring in speech recognition (Wang & Ou, 2018b). The residual EBM formalism makes it very natural to use BERT for language modeling, and we show that empirically this type of approach can outperform modern state-of-the-art language modeling baselines, both in terms of perplexity, and through human evaluation. Generative Adversarial Networks (Goodfellow et al., 2014) also relate to EBMs, except that in EBMs the generator is implicit and negatives samples are produced by the discriminator itself. In our work, the pretrained locally normalized language model can be seen as a fixed generator, like in Bakhtin et al. (2019). Azadi et al. (2018) also share our same goal but their generator is not locally normalized and they propose to improve the sampling from the generator by using the discriminator for rejection sampling. Similar to our work, Grover et al. (2019) propose to use the discriminator to de-bias the pretrained generator using importance sampling. We adapt this work to the application of text generation. In particular, we adopt the conditional noise contrastive estimation (NCE) objective (Ma & Collins, 2018; Gutmann & Hyv arinen, 2010) to our residual model energy function and then sample from the joint model using importance sampling. We want to note that the same formulation has been proposed in (Wang & Ou, 2018b; Parshakova et al., 2019). While Ma & Collins (2018) used conditional NCE to predict the next word in a sequence, we apply it to produce a whole sequence at once with the pretrained auto-regressive language model as the noise distribution. 3 RESIDUAL ENERGY-BASED MODELS We study the problem of conditional generation of discrete sequences. Given a prefix x1, , xp with xj V where V is the vocabulary, we want to model the probabilities of generating a sequence of total length T > p1. The generative model is: Pθ(xp+1, , x T |x1, , xp) = PLM(xp+1, , x T |x1, , xp) exp( Eθ(x1, , x T )) Zθ(x1, , xp) (2) where Zθ(x1, , xp) is a normalizing factor known as partition function. Computing the partition function is intractable in our case since it involves a sum over |V |T p terms which grow exponentially with the sequence length: in our experiments the size of the vocabulary is 50,096 and the length of the generation is 40 tokens. We call Pθ the joint model, and Eθ the residual energy function since PLM is fixed throughout training. The goal of training is to learn the parameters of the energy function such that the joint model distribution gets close to the data distribution. For the sake of reducing clutter in the notation, we will drop the conditioning variables in the following discussion. 3.1 TRAINING When the partition function is intractable, Maximum Likelihood Estimation (MLE) requires samples from the model distribution, which is usually approximated with Monte Carlo sampling or mean field inference (Hinton, 2012; Le Cun et al., 2006) for globally normalized models. Unfortunately, both approaches are too computationally expensive for text applications when using large bidirectional transformer models. For instance, if we were to employ Gibbs sampling exactly, we would need to perform at every position as many forward passes as words in the dictionary to compute each conditional distribution. On large datasets where training locally normalized models on multiple machines already takes days, having such additional overhead means that the model would learn 1We assume a fixed T for simplicity of analysis and implementation, but our method generalizes to varying length generation with an end-of-sequence symbol. Published as a conference paper at ICLR 2020 from much less data for the same amount of time, and this is seldom a beneficial strategy for learning models that generalize well. Therefore, we do not use either MCMC nor mean field methods, as the latter would introduce additional variational parameters or an inference network which anyway yields an approximation to MLE learning. Instead, we train our residual energy function using Noise Contrastive Estimation (NCE) (Gutmann & Hyv arinen, 2010), and more specifically its conditional version (Ma & Collins, 2018). NCE requires two distributions: The model distribution and a noise distribution. In our case, the model distribution is the joint model of Eq. 2, Pθ, while the noise distribution is the pretrained language model, PLM. NCE then trains a binary classifier on the difference of log-probability scores of these two models. Since our joint model is the product of the energy function (whose parameters we want to learn) with PLM, the difference reduces to: log Pθ log PLM = Eθ. Therefore, under these modeling assumptions of residual learning and noise model, the objective function becomes: max Ex+ Pdata log 1 1 + exp(Eθ(x+)) + Ex PLM log 1 1 + exp( Eθ(x )) (3) where x+ is a positive sequence taken from the human generated training set, and x is a negative sequence drawn from PLM (for a given ground truth prefix). In other words, training the energy function reduces to training a binary classifier to discriminate between real text and text generated by an auto-regressive language model. The aim of training is to assign as negative energy as possible to real data, and as positive energy as possible to machine generated data. Interestingly, the role of positive and negative samples is totally symmetric in this loss function, 5 will discuss the consequences of this. With the theoretical guarantee of NCE, we can show that the optimum of the above objective is reached at data distribution with infinite amount of data and model with enough capacity, which is also proved in Ma & Collins (2018)2. Theorem 1. If PLM has the same support as Pdata, then the objective function in Eq. 3 reaches its maximum at log PLM(x) Eθ(x) = log Pdata, if there exists such θ. Proof. This theorem directly follows from the proof in Gutmann & Hyv arinen (2010). Note that at optimum, PLM(x) exp( Eθ(x)) is self-normalizing: instead of Pθ(x) PLM(x) exp( Eθ(x)), we have Pθ(x) = PLM(x) exp( Eθ(x)). However, we still need to estimate the partition function throughout the rest of this paper, since we cannot guarantee that this optimum can be reached. 3.2 EVALUATION A commonly used protocol for evaluating generative sequence models, especially language models, is perplexity (PPL), which is equal to 2 1 T p PT i=p+1 log2 P (xi|xi 1, ,x1). PPL can be interpreted as the average number of tokens the model is uncertain of at every time step. Since the log-likelihood required by PPL relies on estimating the partition function Zθ = P x PLM(x) exp( Eθ(x)) = Ex PLM exp( Eθ(x)), we derive two estimators for the log-partition function log Zθ based on the work of Nowozin (2018). Theorem 2. Denote Tn as the empirical estimate of log Ex PLM exp( E(x)) with n samples xi PLM(i = 1, , n): Tn = log 1 n Pn i=1 exp( E(xi)), then ϵ > 0, N > 0 such that n > N we have Zθ ϵ < E[Tn] < Zθ < E[(2n 1)Tn 2(n 1)Tn 1] < Zθ + ϵ (4) The proof is given in Appendix A.2. We can use the above two estimators to estimate the lower and upper bounds of the partition function, but we want to emphasize that they are true only asymptotically (when n is sufficiently large). We also want to note that to get lower variance estimates we use leave-one-out strategy to estimate Tn 1. See Nowozin (2018) for implementation details and methods to improve numeric stability. Similarly to locally normalized models, we can also factorize the probabilities of an entire sequence step by step, as P(x) = QT t=1 P(xt|x