# latent_diffusion_for_language_generation__94a263e3.pdf Latent Diffusion for Language Generation Justin Lovelace Varsha Kishore Chao Wan Eliot Shekhtman Kilian Q. Weinberger Cornell University, Ithaca, NY Diffusion models have achieved great success in modeling continuous data modalities such as images, audio, and video, but have seen limited use in discrete domains such as language. Recent attempts to adapt diffusion to language have presented diffusion as an alternative to existing pretrained language models. We view diffusion and existing language models as complementary. We demonstrate that encoder-decoder language models can be utilized to efficiently learn high-quality language autoencoders. We then demonstrate that continuous diffusion models can be learned in the latent space of the language autoencoder, enabling us to sample continuous latent representations that can be decoded into natural language with the pretrained decoder. We validate the effectiveness of our approach for unconditional, class-conditional, and sequence-to-sequence language generation. We demonstrate across multiple diverse data sets that our latent language diffusion models are significantly more effective than previous diffusion language models. Our code is available at https://github.com/justinlovelace/ latent-diffusion-for-language. 1 Introduction Although originally introduced by Sohl-Dickstein et al. [61] in 2015, diffusion models did not see widespread use until Ho et al. [22] demonstrated their viability for high-quality image generation in 2020. Since then, research has driven rapid improvements and they have recently surpassed generative adversarial networks on image generation benchmarks [12] and autoregressive models on density estimation benchmarks [30], outclassing generative modeling paradigms that have dominated those areas for the better part of a decade. Diffusion models are now, arguably, the most widely used class of generative models for continuous data modalities such as images, audio, and video [54, 32, 23]. The widespread success of diffusion models across a variety of domains and applications makes them appealing for language generation. However, they have seen less use in discrete domains, where the gradual transition of discrete states to Gaussian noise (and vice versa) is not as natural as in continuous domains. Prior work proposes to learn continuous diffusion models in the space of learnable word embeddings and decodes the continuous generations with a rounding step [36]. However, combining representation learning with the diffusion objective requires careful regularization to avoid collapse. One breakthrough in image generation was the introduction of latent diffusion [51], where diffusion models are trained to produce samples from the latent distribution of a pretrained autoencoder. This offloads the task of generating high-frequency details to the autoencoder and enables the diffusion process to focus on the high-level semantics of images. In this paper, we explore the viability of latent diffusion for text generation. We claim that this approach is particularly well-suited for discrete modalities because it offloads the challenge of modeling a discrete distribution to the autoencoder and simplifies the diffusion process by restricting it to the continuous, latent feature space. We introduce Latent Diffusion for Language Generation (LD4LG), a method that leverages the latent space of a pretrained encoder-decoder network (e.g. BART [35], T5 [50]) to learn a high-quality Correspondence to . 37th Conference on Neural Information Processing Systems (Neur IPS 2023). diffusion model for text. The latent representations from such models are high-dimensional and input-length dependent complicating the use of diffusion models [51, 66]. To address both issues, we learn an additional compression module that maps the high-dimensional encoder representations to a lower-dimensional fixed-length representation. We also learn a corresponding reconstruction network to map these fixed-length features back to high dimensional features that guide the language decoder (via cross-attention) to reconstruct the original language. The low-dimensional representation is ideally suited for diffusion. For language generation, we use a diffusion model to generate a low-dimensional (fixed-length) latent, which is mapped into a higher dimensional space with the reconstruction network. This high dimensional representation then guides the pre-trained decoder to generate natural language. Our approach naturally combines the continuous, fixed-length diffusion process with discrete, variable length text generation. We demonstrate that LD4LG is effective for unconditional, class-conditional, and sequence-tosequence language generation across a variety of datasets. Our approach significantly outperforms recent diffusion language models while using fewer sampling steps. For instance, we achieve a MAUVE score [47] of .716 for the ROCStories dataset with 250 sampling steps while Diffusion LM [36] achieves a MAUVE score of .043 using 2000 sampling timesteps. For the challenging XSum summarization benchmark, we achieve a ROUGE-L of 31.9 with 250 timesteps while the recently proposed Diffu Seq [16] achieves a ROUGE-L of 14.1 with 2000 timesteps. We also find that the diffusion models offer some benefits over a strong autoregressive baseline. In particular, we observe that our latent language diffusion is less susceptible to memorization and more effective for class-conditional generation. 2 Background Diffusion models [61, 22, 63] are a class of latent variable models that learn to iteratively transform random Gaussian noise, which can be sampled analytically, to a sample from an unknown data distribution specified by a collection of samples. This mapping is defined through a forward diffusion process that iteratively adds Gaussian noise to samples, and a generative process that iteratively denoises samples from the Gaussian distribution to obtain samples from the data distribution. We provide a formal description of diffusion models in the appendix. The diffusion model consists of a denoising network xˆθ trained with a regression objective L(θ) = Et,x,ϵ[λt xˆθ( αtx + 1 αtϵ, t) x 2 2] where x is the training data, t U(0, 1) is the timestep, ϵ N(0, 1) is Gaussian noise, αt defines the noise schedule, and λt is a time-dependent weighting term. The denoising network is therefore trained to denoise a noisy latent, zt = αtx + 1 αtϵ, to the clean data, x, with a regression objective that emphasizes certain times t. Sampling algorithms start from pure Gaussian noise, z1 N(0, 1), and utilize the denoising network to iteratively generate latents zt1, zt2, ..., zt T where 1 = t1 > t2 > ... > t T = 0, with decreasing levels of noise until z0 is drawn approximately from the data distribution. 3 Latent Diffusion For Language Figure 1 presents an overview of Latent Diffusion for Language Generation. Our method consists of two main parts. We augment a pretrained encoder-decoder language model with two learnable networks to develop a high-quality language autoencoder with a compact latent space. We then introduce continuous diffusion models that learn to generate samples from the latent distribution of our language autoencoders. These continuous samples can, by design, be decoded into natural language. 3.1 Language Autoencoder We base our architecture on pretrained encoder-decoder language models (depicted in blue), such as BART [35] and T5 [50] (we present results with both). By default, we freeze the pre-trained models and learn only the autoencoding modules to accelerate training. The Language Encoder, E( ), maps variable-length language, represented as a sequence of tokens, w NL, to a latent representation of the same length, E(w) RL d LM. Figure 1: Overview of our proposed latent language diffusion framework. Compression Network. The learnable Compression Network maps the encoder features to a compact latent space that is well-suited for diffusion. We adopt the Perceiver Resampler [2] architecture, originally developed to compress image features for a vision-language model, which is depicted in Figure 2. The Perceiver Resampler, like the transformer, consists of a stack of alternating multi-head attention (MHA) blocks and feedforward (FF) layers. We refer the reader to Vaswani et al. [69] for a detailed description of these components. We learn ℓlatent queries Z Rℓ d LM that iteratively cross-attend to the language encoder features E(w) RL d LM to extract information. We follow Alayrac et al. [2] and allow the latent queries to simultaneously attend to themselves and the frozen encoder representations. We can write the attention layer as Z = Z + MHA(q = Z, kv = [Z; E(w)]) where MHA( ) is the multi-head attention operation with queries, q, and keys/values, kv. This design compresses the encoder representations to the fixed sequence length, ℓ, of the latents. After each multi-head attention layer, a feedforward layer is applied to the latent query representations. Figure 2: Architecture of our Compression Network. After the Compression Network maps the input to a fixed sequence length, we reduce the dimensionality of the output to dimension dae with a learnable linear projection. The compression network therefore maps the variable length output of the frozen encoder to a compact latent space x = fϕ(E(w)) Rℓ dae of fixed length ℓ< L and dimensionality dae < d LM where we will learn our diffusion model. To ensure that the latent space is appropriately scaled for diffusion, we can optionally constrain the norm of the latent space. Since Eϵ N(0,I)[ ϵ 2 2] = dae [1], we can normalize the latent vectors along the feature dimension so that xi 2 2 = dae similar to prior work on text diffusion [13]. Reconstruction Network. The Reconstruction Network maps the compressed latent space to the feature space expected by the Language Decoder. To achieve this, we project x = fϕ(E(w)) Rℓ dae back up to dimension d LM, add learnable absolute position embeddings, and pass it through a standard transformer model to obtain features gϕ(x) Rℓ d LM. The Language Decoder, D( ), cross-attends to to these features and generates text autoregressively. We train the compression and reconstruction networks to produce features that guide the decoder to reconstruct the input text w D(gϕ(x)) = D(gϕ(fϕ(E(w))) with the cross-entropy loss. This gives us a continuous, semantic latent space that can be decoded to natural language. Implementation Details. We utilize BART-base and FLAN-T5-base [10] as the encoder-decoder language models throughout this work and learn language autoencoders for each dataset. During autoencoder training, we freeze the pre-trained language models and only learn the autoencoding modules. The autoencoder training could likely be amortized across datasets by training a generalpurpose language autoencoder on a large corpus of text, but we leave such explorations to future work. We train the autoencoder to reconstruct the input language with the cross-entropy loss. For the diffusion latent space, we set ℓ= 32, dae = 64 and utilize 3 layers in both autoencoding modules across all monolingual datasets. For our machine translation experiments, we utilize MT5-base [72] to develop our autoencoder. We found it beneficial to jointly fine-tune the language model and the autoencoding modules, likely because the dataset is an order of magnitude larger than our other datasets and therefore benefits from the additional capacity. We use the same latent dimensionality, but only use a single layer for the autoencoding modules. We report full hyperparameter settings in the appendix. We constrain the norm of the latent space across models and datasets except when using FLAN-T5 because it led to a minor degradation in autoencoding performance and downstream generation quality. 3.2 Latent Language Diffusion Figure 1 outlines our latent language diffusion framework. Given some dataset of natural language, D, we can now sample continuous data as x = fϕ(E(w)) Rℓ dae where w D. We then train a continuous denoising network, xˆθ(), to recover x with the standard regression objective L(θ) = Et,x,ϵ[λt xˆθ( αtx + 1 αtϵ, t) x 2 2] with some time-dependent weighting λt. In practice, the denoising network is often parameterized as an ϵ-prediction network [22] or a v-prediction network [57] where the velocity, v, is defined as v = αtϵ 1 αtx. These parameterizations can be interpreted as different weighting functions, λt, for the regression objective above (see Salimans and Ho [57]. We adopt the v-parameterization in this work because it has been shown to be effective for latent image diffusion [51]. For generation, we sample a latent variable, z1 Rℓ dae N(0, I), that is iteratively denoised to produce a sample, x = z0, from the distribution of the language autoencoder s latent space. We then generate natural language with the pretrained reconstruction network and language decoder w = D(gϕ(x)). We train our diffusion models with the cosine noise schedule αt = cos(0.5πt)2 [45, 57, 55] by default. For our machine translation experiments, we employ a scaled cosine noise schedule (see subsection E.2 in the appendix for full details) [7, 27]. For generation, we use the DDPM sampler with 250 sampling timesteps. For text generation with the pretrained decoder, we utilize beam search with 4 beams. We train all of our diffusion models with a single Nvidia A6000 GPU except for the machine translation models which are trained with 4 Nvidia A6000 GPUs. Denoising Network Architecture. Our denoising network, xˆθ(zt, t), is a pre-Layer Norm transformer [69, 70] with 12 layers and a dimension of 768. We utilize learnable absolute positional encodings and Ge GLU activations [59]. Bao et al. [4] adapted transformers to image diffusion and found that dense connections [28] between early and late layers are beneficial due to the dense nature of the denoising objective. We adopt this modification to improve the suitability of the transformer for diffusion. The autoencoder latent is projected to the transformer dimension, processed by the transformer, and then projected back to dimensionality of the autoencoder latent to obtain the final prediction. Following prior work [6, 56, 7], we utilize α-conditioning to condition the model on the level of noise. We map αt to a sinusoidal positional embedding [69] and pass it through an MLP with a single hidden layer to obtain a time embedding. We add this time embedding to the input sequence and apply adaptive layer normalization [46] conditioned on the time embedding to the output of every feedfoward layer. Self-Conditioning We utilize the self-conditioning technique introduced by Chen et al. [8] which has been shown to improve the quality of diffusion models [8, 67]. The denoising network is typically conditioned on the latent variable and the current timestep as x t = xˆθ(zt, t). Self-conditioning proposes to condition the network on its estimate of the data from the previous timestep, s > t, to improve the prediction at the current timestep x t = xˆθ(zt, t, x s). During inference, the sampling procedure is inherently iterative and at time t, we have already computed the output of the denoising network for the previous step. Therefore, it does not require any additional applications of the network. We must, however, modify the training procedure so that the denoising network learns to utilize the estimate of the data, and we must define the inference behavior for the first timestep. For each training step, we sample some time t U([0, 1]) as before. With probability p, we do not provide any estimate of the data for self-conditioning, denoted x t, = xˆθ(zt, t, ). With probability 1 p, however, we mimic the inference behavior by first computing x t, = xˆθ(zt, t, ) and then computing an additional estimate x t = xˆθ(zt, t, sg(x t, )) where sg() is the stop-gradient operation. This second estimate is then used to compute the loss. We follow Chen et al. [8] and set p = 0.5. This training procedure also maintains the capacity for inference without self-conditioning which is utilized to generate the first estimate during sampling. We condition on the previous estimate by concatenating it with the noisy latent along the feature dimension. When the previous estimate is not provided, we concatenate a learnable embedding with the noisy latent. Class-Conditional Diffusion. For class-conditional diffusion, we have some dataset where each natural language utterance is associated with one of C class labels representing, for example, the topic of the text. We condition the denoising network on the class label, y, during training, x t = xˆθ(zt, t, y). We replace the ground truth class label, yi, with a null label, y , with probability p = 0.1 to maintain the capacity for unconditional generation. At inference time, we can choose some class y to guide the sampling process to generate text from the specified class. We condition on class labels by introducing learnable embeddings for all labels, including the null label, and add it to the time embedding. Sequence-to-Sequence Diffusion. Given some seq2seq dataset consisting of source-target language pairs (wsrc, wtrg) D, we condition our denoising network on the source sequence and generate the target latent xtrg = fϕ(E(wtrg)). For news summarization, for instance, we generate a latent representation of the summary by conditioning the network on the article text. To achieve this, we introduce a cross-attention layer after every self-attention layer in the denoising network that attends to features from a frozen language encoder. In general, we can incorporate any language encoder, Esrc( ), to extract features from the source text. By default, we use the same pretrained encoder used for our language autoencoder. For our machine translation experiments, we condition our latent diffusion models on representations from a frozen MT5-XL encoder, which we found to be more effective than MT5-base representations. Therefore, given a sample from our seq2seq dataset, (wsrc, wtrg) D, we can compute xtrg = fϕ(E(wtrg)) and use a modified seq2seq diffusion objective L(θ) = Et,(wsrc,wtrg),ϵ[λt xˆθ( αtxtrg + 1 αtϵ, t, Esrc(wsrc)) xtrg 2 2]. We also utilize classifier-free guidance [21] to improve sample quality. We jointly learn an unconditional network, xˆθ(zt, t), and a conditional network, xˆθ(zt, t, E(wsrc)), by dropping the conditioning information with probability p = 0.1 during training. When we drop the conditioning information, we cross-attend to a learnable embedding instead of the embedded source text. During sampling, we use guidance weight w and compute the prediction as x t = wxˆθ(zt, t, E(wsrc)) + (1 w)xˆθ(zt, t). Setting w = 1.0 corresponds to the conditional diffusion model while setting w > 1.0 strengthens the influence of the conditioning information. We use w = 2.0 for the seq2seq tasks and ablate this choice in section 5. We can also generate multiple outputs S for each input by sampling different latents z1 N(0, I). We then select the most promising candidate with Minimum Bayes Risk (MBR) Decoding [15, 34]. In MBR decoding, we define a loss function L, such as the negative Rouge, and use it to select a candidate w MBR = argminw S 1 |S| w S L(w, w ). In our experiments, we use |S| = 5 and denote the results from using MBR decoding as MBR-5. We also report results using the ground truth to select the best candidate woracle = argminw SL(w, wtrg) to provide an upper bound on the performance of our method given optimal sample selection. Because this requires knowledge of the ground-truth target text, we refer to this as Oracle sampling. We evaluate LD4LG on a variety of natural language datasets. ROCStories [42] is a corpus of 98k five-sentence commonsense stories, that capture casual and temporal relations. The AG News Topic Classification [60] dataset consists of news articles across four topics: World, Sports, Business, Sci/Tech with article titles and descriptions from 120k training instances. We focus on generating the article descriptions in this work. The XSum [44] dataset consists of BBC articles from 2010 to 2017 covering a wide range of topics (e.g., News, Politics, Sports, etc.). The training split has 204k instances and each example contains a document and a summary. The QQP [9] dataset consists of 400k question pairs, where each example is two similar questions and a binary value indicating whether the two questions have the same meaning. The WMT 2014 English-German [5] dataset is a widely used machine translation dataset consisting of roughly 4.5 million sentence pairs. We present detailed dataset statistics in the appendix. 4.1 Evaluation Metrics. We use MAUVE Score [47] and Perplexity (Ppl) to evaluate the quality of our generated text. MAUVE Score is a metric for open-ended text generation that compares the distribution of generated text with that of reference text using divergence frontiers. We follow Pillutla et al. [47] and use the GPT-2-Large model [49] to embed the text. Perplexity measures how likely the generated samples are according to an autoregressive language model; we use GPT-2-Large to compute perplexity. We also want to quantify the Diversity (Div) of generations. We define diversity as Div = 4 n=2 |unique n-grams({wi})| |total n-grams({wi})| where {wi} is a set of generated samples [68]. The metrics discussed so far can be optimized by generating samples from the training set. We measure the proportion of generated 4-grams that are found in the training set to quantify the degree of Memorization (Mem). To evaluate the performance for monolingual seq2seq language generation tasks, we utilize Rouge [37] and BERTScore [75]. Rouge-1/2 measures the number of unigrams/bigrams in the reference that appear in the generated text and Rouge-L measures the longest common sequence between the texts. BERTScore uses contextual embeddings from a pretrained language model to measure the similarity between texts. We follow prior work and use the microsoft/deberta-xlarge-mnli model [18] to extract contextual embeddings. For our machine translation experiments, we report Sacre BLEU scores [5] to ensure fair comparison with prior work. For our unconditional and class-conditional language generation experiments, we sample 1000 instances from the diffusion model. For the MAUVE reference text, we sample 1000 instances from the test set. We repeat this 5 times and report the mean and standard deviation as meanstdev. We also compute reference values for our metrics with natural samples from the test set. The reference MAUVE, for instance, is computed between 1000 train and 1000 test samples. Qualitative samples from our models are in the supplemental materials. 5 Experiments 5.1 Language Autoencoder We evaluate the effectiveness of our proposed language autoencoder using heldout examples from our datasets. As a point of comparison, we also evaluate the default behavior of the language models that we use to develop the language autoencoders. A consequence of BART s particular denoising objective is that the pretrained model already generates a copy of the input language, although this is not true of other models such as T5 or FLAN-T5. We present results for our two most complex datasets, ROCStories and AG News, in Table 1 and present the results for XSum, QQP, and WMT14-En-De, which show similar trends, in the appendix. We observe that our BART-base autoencoder is able to compress the feature space by a factor of 24 while improving the fidelity of the reconstructions. Our autoencoding modules are also effective at converting the pretrained FLAN-T5 into a language autoencoder, even though that is different from the model s default behavior. Across both models and all datasets, our language autoencoders are able to achieve near-perfect reconstruction with a low-dimensional latent space. 5.2 Unconditonal Language Generation Baselines. We evaluate our approach s capacity for unconditional language generation with the ROCStories and AG News datasets. We compare against the recently proposed Diffusion-LM model Table 1: Effectiveness of Language Autoencoder Method Latent Dimensions Hidden Units Roc Stories AG News Rouge-1/2/L BLEU Rouge-1/2/L BLEU BART-Base L 768 49,152 98.9/98.2/98.8 97.5 99.6/99.4/99.6 98.6 BART-Base Autoencoder 32 64 2048 99.2/98.5/99.2 97.6 99.7/99.4/99.7 98.8 FLAN-T5-Base L 768 49,152 21.5/11.8/19.4 0.7 63.6/53.0/59.6 42.3 FLAN-T5-Base Autoencoder 32 64 2048 98.4/96.9/98.4 95.8 99.1/98.3/99.1 96.8 [36]. We also fine-tune the pretrained GPT-2-Medium model, which is roughly 1.6 larger than our denoising network, as a strong autoregressive baseline [49]. For sampling from GPT-2, we prompt it with a BOS token and utilize nucleus sampling (p = 0.95) [24]. We explore different sampling configurations in the appendix and find that they lead to similar conclusions. Results. We present this comparison in Table 2. We observe that our approach is significantly more effective than Diffusion-LM at modeling language distributions, as demonstrated by the higher MAUVE scores, while requiring fewer sampling steps. Diffusion-LM is unable to model diverse language distributions and exhibits poor diversity. Utilizing high quality latent spaces from pretrained language models improves the effectiveness of our diffusion model. We observe that both language models are highly effective for the AG News dataset, but using BART-base leads to a stronger MAUVE score for the ROCStories dataset. Across both datasets, FLAN-T5-base produces more diverse generations and exhibits less memorization. While GPT-2 generally achieves strong language generation metrics, it is more susceptible to memorization than LD4LG. For the AG News dataset, GPT-2 exhibits significant memorization and a lower MAUVE score. We do find that GPT-2 samples have lower perplexity. However, measuring perplexity with a pretrained GPT-2 model likely biases the metric towards the fine-tuned GPT-2 model. Moreover, MAUVE scores have a stronger correlation with human judgments of quality [47]. Table 2: Unconditional Language Generation Evaluation. The fine-tuned language model is presented in gray. ROCStories AG News Timesteps MAUVE Ppl Div Mem MAUVE Ppl Div Mem Reference - .951.007 21.1.3 .414.003 .362.003 .951.014 43.61.2 .658.002 .385.005 Diffusion-LM [36] 2000 .043.006 47.3.6 .128.002 .434.002 .012.001 67.11.2 .043.002 .086.006 LD4LG (BART-Base) 250 .716.019 30.6.5 .331.005 .441.004 .866.016 100.62.9 .540.006 .293.001 LD4LG (FLAN-T5-base) 250 .481.007 37.5.4 .389.002 .387.002 .859.020 122.03.9 .624.008 .221.003 GPT-2-Medium - .788.025 20.0.2 .372.002 .688.006 .820.012 37.31.1 .532.017 .829.005 Benefits of Compression. Because the pretrained BART model already copies the input text, we can ablate the impact of learning a compact latent space by learning a diffusion model directly in the encoder feature space. One complication of this setting is that the sequence length of the BART features vary. During training, the sequence length is simply determined by the sample. During generation, however, we must specify the length. To determine the sequence length for generation, we opt to sample a length from the empirical distribution of lengths in the training set. We refer to this baseline as BART-Diffusion and outline full implementation details in the appendix. We compare BART-Diffusion with our proposed approach in Table 3. We quantify the speedup by measuring how long it takes each approach to match the peak validation MAUVE of BARTDiffusion. We observe that learning a compact latent space is beneficial both in terms of absolute performance and wall-clock time, reaching the peak MAUVE of BART-diffusion in a quarter of the time. Compressing the latent space along the sequence dimension significantly reduces the overhead per iteration due to the quadratic cost of self-attention, and we also observe faster convergence. Self-conditioning. We ablate the impact of self-conditioning in Table 4. We find that it significantly improves the MAUVE score and the perplexity of the generated text, but sacrifices some diversity. Table 3: Benefits of Compression (ROCStories) Hidden Units Relative Speedup MAUVE Ppl Div Mem BART-Diffusion 49,152 1.0 .605.024 46.8.7 .424.004 .304.003 LD4LG (BART-base) 2048 3.86 .716.019 30.6.5 .331.005 .441.004 Table 4: Impact of Self-Conditioning (ROCStories) MAUVE Ppl Div Mem LD4LG (BART-base) .716.019 30.6.5 .331.005 .441.004 - Self-cond. .480.018 79.31.0 .427.004 .299.003 Table 5: Metrics for class-conditional generation. LD4LG (BART-base) LD4LG (FLAN-T5-base) Conditioning MAUVE Mem MAUVE Mem World Sports Business Sci/Tech World Sports Business Sci/Tech World .842.017 .015.002 .026.002 .020.002 .296.002 .809.024 .013.001 .025.002 .022.002 .233.005 Sports .013.001 .845.024 .011.001 .010.000 .305.003 .011.001 .836.020 .009.000 .009.000 .249.004 Business .024.002 .011.001 .752.030 .068.005 .363.009 .025.003 .011.001 .765.016 .076.008 .244.004 Sci/Tech .023.002 .012.001 .082.008 .813.028 .225.004 .024.001 .011.001 .082.010 .843.033 .169.004 Conditional GPT-2 Reference Conditioning MAUVE Mem MAUVE Mem World Sports Business Sci/Tech World Sports Business Sci/Tech Comparisons World .805.022 .012.000 .025.002 .021.002 .402.002 .963.009 .018.001 .034.002 .032.003 .388.007 Sports .017.001 .840.019 .012.001 .013.001 .369.004 .018.001 .958.007 .014.001 .014.002 .346.002 Business .037.003 .012.001 .629.029 .069.007 .479.007 .040.005 .014.001 .968.009 .125.009 .441.003 Sci/Tech .033.002 .013.001 .102.015 .697.027 .434.004 .036.003 .016.001 .133.013 .955.011 .366.003 5.3 Class-Conditonal Language Generation Baselines. Conditional training with control tokens is one of the most widely used methods for controlling autoregressive models [14, 29, 40, 33]. We prepend the class label to each sample as a control token and fine-tune GPT-2-medium for class-conditional generation. Because memorizing the training instances associated with each class is a trivial solution, we terminate training when the model s memorization exceeds the reference values. Results. We evaluate the effectiveness of class-conditioning with the AG News topic classification dataset. We sample instances for each class and compute the MAUVE scores between natural instances from each class. We report these metrics in Table 5. We observe that the MAUVE scores are highest when the conditioning and ground-truth labels are aligned across all methods, demonstrating that the label guides the generation effectively. We observe that our approach is more consistently effective at class-conditional generation, particularly for the two most similar classes, business and sci/tech. The GPT-2 baseline is again more susceptible to memorization than our approach. 5.4 Sequence-to-Sequence Language Generation Baselines. We compare against directly fine-tuning BART-base and FLAN-T5-base on the XSum summarization and QQP paraphrasing datasets. For diffusion baselines, we compare against the following continuous diffusion models learned in the space of word embeddings: Diffu Seq [16], CDCD [13], DINOISER [73], and GENIE [38]. We also compare against the following discrete diffusion models which learn to invert discrete corruption processes (e.g. masking): Reparameterized Discrete Diffusion (RDM) [76] and Diffusion BERT [19]. We compare directly against the metrics reported in prior work on our datasets. For XSum, we additionally train a Diffu Seq model using the official implementation. We note that Gong et al. [16] typically train their models much longer than ours. The XSum Diffu Seq model, for instance, is trained for over 3 more epochs than our approach. For machine translation, we compare directly against the prior work that reported Sacre BLEU scores to ensure a fair comparison [48]. Table 6: Seq2Seq Evaluation on QQP. Results from fine-tuned language models are in gray. Method Sampling Rouge-1/2/L BERTScore Diffu Seq [16] Random 55.2/29.2/52.7 82.4 RDM-absorbing [76] Random / /57.9 83.7 RDM-multinomial [76] Random / /57.3 83.7 LD4LG (BART-base) Random 62.6/39.0/60.3 85.8 LD4LG (FLAN-T5-Base) Random 62.1/38.4/59.7 85.8 Diffu Seq [16] MBR-10 / /58.8 83.7 RDM-absorbing [76] MBR-10 / /59.5 84.7 RDM-multinomial [76] MBR-10 / /58.5 84.7 Diffusion BERT [19] MBR-10 / /58.9 LD4LG (BART-base) MBR-5 63.3/40.3/61.1 86.2 LD4LG (FLAN-T5-Base) MBR-5 63.0/39.7/60.7 86.1 Diffu Seq [16] Oracle-5 67.4/43.9/65.8 83.7 LD4LG (BART-base) Oracle-5 68.0/46.6/66.0 87.2 LD4LG (FLAN-T5-Base) Oracle-5 67.8/46.0/65.7 87.2 BART-Base Nucleus 51.5/28.1/48.3 79.9 FLAN-T5-Base Nucleus 55.0/30.1/52.3 83.2 BART-Base Beam 61.9/39.0/59.5 85.5 FLAN-T5-Base Beam 63.0/40.1/60.5 86.2 Table 7: Seq2Seq Evaluation on XSum. Results from fine-tuned language models are in gray. Method Sampling Rouge-1/2/L BERTScore Diffu Seq [16] Random 18.9/1.3/13.6 46.8 LD4LG (BART-base) Random 37.6/15.5/30.8 74.1 LD4LG (FLAN-T5-Base) Random 38.1/15.9/31.2 74.8 Diffu Seq [16] MBR-5 19.3/1.7/14.1 46.9 LD4LG (BART-base) MBR-5 38.2/16.2/31.5 74.5 LD4LG (FLAN-T5-Base) MBR-5 38.7/16.6/31.9 75.2 Diffu Seq [16] Oracle-5 23.5/2.3/18.6 47.9 GENIE [38] Oracle-5 37.3/15.3/29.4 GENIE w/ pre-training [38] Oracle-5 41.2/19.1/33.4 LD4LG (BART-base) Oracle-5 42.4/19.4/36.4 75.3 LD4LG (FLAN-T5-Base) Oracle-5 43.0/20.0/37.2 76.1 BART-Base Nucleus 35.1/13.3/27.7 73.1 FLAN-T5-Base Nucleus 34.6/12.9/27.2 72.7 BART-Base Beam 39.9/18.0/32.6 75.6 FLAN-T5-Base Beam 39.7/17.7/32.3 75.3 Table 8: Machine translation results on WMT14-En-De. Baseline results are from [13, 73]. Method Sampling Sacre BLEU En De De En CDCD [13] Random 19.3 24.9 LD4LG (MT5-base) Random 21.4 26.2 Diffusion-LM [36] MBR-5 15.3 17.3 CDCD [13] MBR-10 19.7 25.4 DINOISER [73] MBR-5 24.3 28.8 LD4LG (MT5-base) MBR-5 22.4 27.0 Results. We present our comparison on QQP and XSum in Table 6 and Table 7. Our approach significantly outperforms recent diffusion language models across both datasets, especially for the more challenging XSum dataset. For instance, Diffu Seq is reasonably effective for QQP, but it struggles with XSum and fails to generate coherent text (see samples in appendix). Our method, on the other hand, is competitive with fine-tuning. LD4LG narrowly outperforms fine-tuning on QQP with MBR decoding, but the finetuned models are slightly more effective on the XSum dataset. Across both datasets, LD4LG with oracle sampling outperforms all approaches (including direct fine-tuning methods) with just 5 random samples. This demonstrates that LD4LG has good coverage, but MBR decoding does not consistently identify the best candidate. In our experiments, we use classifier-free guidance with guidance strength w = 2.0. We ablate this choice with validation samples in Figure 3 and observe that such guidance meaningfully improves performance. Figure 3: Ablation of classifier-free guidance on the XSum summarization benchmark. We report our machine translation results in Table 8. We observe that LD4LG outperforms the Diffusion-LM and CDCD baselines although it lags behind the DINOISER baseline. This demonstrates that our method can effectively take advantage of strong pre-trained multilingual language models for effective multilingual generation. 6 Future Work Our experiments demonstrate that latent language diffusion models can generate high-quality natural language in a variety of settings. In continuous domains, diffusion models are remarkably effective for applications ranging from image editing [41] to solving inverse problems [64]. We are excited to explore the potential applications enabled by effective language diffusion models. We expect that LD4LG is a natural fit for applications such as language editing (e.g. style transfer) and controllable generation (e.g. mitigating toxicity). Despite achieving good performance, LD4LG has important limitations. Sampling from diffusion models is slow due to the iterative generative process. LD4LG improves upon some prior continuous text diffusion models (that use 2000 steps) and only uses 250 sampling steps. However, speeding up the inference process of diffusion models is an active area of research and techniques developed for image diffusion can likely be adapted for LD4LG [65, 57]. Song et al. [65], for instance, distilled a trained image diffusion model to produce high-quality samples in a single step. We leave the extension of such techniques to language generation as future work. In subsection 5.4, we observe that diffusion models have excellent coverage but MBR decoding fails to identify the best candidate; developing improved sampling procedures or candidate re-ranking methods would likely improve performance for tasks such as summarization and machine translation. 7 Related Work Diffusion models. Diffusion models [61, 22] are a class of generative models that have led to impressive results in image synthesis, recently surpassing Generative Adversarial Networks [17, 12]. These models typically operate directly in pixel-space, learning a distribution over images. Rombach et al. [52] introduced latent diffusion for image synthesis and demonstrated that they can be learned in the latent space of a pretrained autoencoder. Latent diffusion has since been successful in other domains such as audio synthesis [32], symbolic music generation [58], and molecule generation [71]. Diffusion for Language. Prior work has focused on directly modeling discrete data by designing diffusion processes for discrete state spaces [25, 3, 26]. Li et al. [36] train a continuous diffusion model in the space of token embeddings that are learned jointly with the denoising objective and decode generations with a rounding step. Strudel et al. [67] scaled up this approach and instead learn the diffusion model in the space of pretrained word embeddings and find that low-dimensional embeddings are better suited for diffusion. Gong et al. [16] extend Diffusion-LM [36] to sequence-tosequence tasks by concatenating the source and target sequence and only performing diffusion for the target sequence. Chen et al. [8] map words to arbitrary binary strings, represented as a sequence of real numbers. They then train a continuous diffusion model and round the generated sequences to produce binary strings. The authors also introduce self-conditioning, which we adopt for our method. 8 Conclusion In this work, we demonstrate that latent diffusion is an effective paradigm for language generation. To achieve this, we introduce a method for compressing the high-dimensional, variable-length language representations from pre-trained language models into a compact, fixed-size latent representation that can be decoded into natural language. This compact latent representation is, by design, well-suited for learning continuous latent diffusion models. Our latent language diffusion models are effective for unconditional, class-conditional, and sequence-to-sequence language generation. They offer some benefits over fine-tuned auto-regressive language models and significantly outperform recent diffusion language models across a variety of datasets. Acknowledgements This research is supported by grants from the National Science Foundation NSF (IIS-2107161, and IIS-1724282, HDR-2118310). The Cornell Center for Materials Research with funding from the NSF MRSEC program (DMR-1719875), DARPA, ar Xiv, Linked In, and the New York Presbyterian Hospital. [1] Chi-square Distribution, pages 70 72. Springer New York, New York, NY, 2008. ISBN 9780-387-32833-1. doi: 10.1007/978-0-387-32833-1_54. URL https://doi.org/10.1007/ 978-0-387-32833-1_54. [2] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35: 23716 23736, 2022. [3] Jacob Austin, Daniel D. Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg. Structured denoising diffusion models in discrete state-spaces. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=h7-Xix PCAL. [4] Fan Bao, Shen Nie, Kaiwen Xue, Yue Cao, Chongxuan Li, Hang Su, and Jun Zhu. All are worth words: A vit backbone for diffusion models. In CVPR, 2023. [5] Ondrej Bojar, Christian Buck, Christian Federmann, Barry Haddow, Philipp Koehn, Johannes Leveling, Christof Monz, Pavel Pecina, Matt Post, Herve Saint-Amand, Radu Soricut, Lucia Specia, and Ale s Tamchyna. Findings of the 2014 workshop on statistical machine translation. In Proceedings of the Ninth Workshop on Statistical Machine Translation, pages 12 58, Baltimore, Maryland, USA, June 2014. Association for Computational Linguistics. URL http://www. aclweb.org/anthology/W/W14/W14-3302. [6] Nanxin Chen, Yu Zhang, Heiga Zen, Ron J Weiss, Mohammad Norouzi, and William Chan. Wavegrad: Estimating gradients for waveform generation. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=Ns MLjc Fa O8O. [7] Ting Chen. On the importance of noise scheduling for diffusion models. ar Xiv preprint ar Xiv:2301.10972, 2023. [8] Ting Chen, Ruixiang Zhang, and Geoffrey Hinton. Analog bits: Generating discrete data using diffusion models with self-conditioning. ar Xiv preprint ar Xiv:2208.04202, 2022. [9] Zihang Chen, Hongbo Zhang, Xiaoji Zhang, and Leqi Zhao. Quora question pairs. 2017. [10] Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models. ar Xiv preprint ar Xiv:2210.11416, 2022. [11] Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, et al. Scaling vision transformers to 22 billion parameters. ar Xiv preprint ar Xiv:2302.05442, 2023. [12] Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 8780 8794. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper/2021/file/ 49ad23d1ec9fa4bd8d77d02681df5cfa-Paper.pdf. [13] Sander Dieleman, Laurent Sartran, Arman Roshannai, Nikolay Savinov, Yaroslav Ganin, Pierre H Richemond, Arnaud Doucet, Robin Strudel, Chris Dyer, Conor Durkan, et al. Continuous diffusion for categorical data. ar Xiv preprint ar Xiv:2211.15089, 2022. [14] Jessica Ficler and Yoav Goldberg. Controlling linguistic style aspects in neural language generation. EMNLP 2017, page 94, 2017. [15] Vaibhava Goel and William J Byrne. Minimum bayes-risk automatic speech recognition. Computer Speech & Language, 14(2):115 135, 2000. ISSN 0885-2308. doi: https://doi.org/10. 1006/csla.2000.0138. URL https://www.sciencedirect.com/science/article/pii/ S0885230800901384. [16] Shansan Gong, Mukai Li, Jiangtao Feng, Zhiyong Wu, and Lingpeng Kong. Diffuseq: Sequence to sequence text generation with diffusion models. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=j Qj-_ r LVXsj. [17] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc., 2014. URL https://proceedings. neurips.cc/paper/2014/file/5ca3e9b122f61f8f06494c97b1afccf3-Paper.pdf. [18] Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. Deberta: Decoding-enhanced bert with disentangled attention. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=XPZIaotuts D. [19] Zhengfu He, Tianxiang Sun, Qiong Tang, Kuanning Wang, Xuanjing Huang, and Xipeng Qiu. Diffusion BERT: Improving generative masked language models with diffusion models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4521 4534, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.248. URL https: //aclanthology.org/2023.acl-long.248. [20] Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus). ar Xiv preprint ar Xiv:1606.08415, 2016. [21] Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. ar Xiv preprint ar Xiv:2207.12598, 2022. [22] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Co RR, abs/2006.11239, 2020. URL https://arxiv.org/abs/2006.11239. [23] Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models. ar Xiv preprint ar Xiv:2210.02303, 2022. [24] Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=ryg GQyr Fv H. [25] Emiel Hoogeboom, Didrik Nielsen, Priyank Jaini, Patrick Forré, and Max Welling. Argmax flows and multinomial diffusion: Towards non-autoregressive language models. Co RR, abs/2102.05379, 2021. URL https://arxiv.org/abs/2102.05379. [26] Emiel Hoogeboom, Alexey A. Gritsenko, Jasmijn Bastings, Ben Poole, Rianne van den Berg, and Tim Salimans. Autoregressive diffusion models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=Lm8T39v LDTE. [27] Emiel Hoogeboom, Jonathan Heek, and Tim Salimans. simple diffusion: End-to-end diffusion for high resolution images. ar Xiv preprint ar Xiv:2301.11093, 2023. [28] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700 4708, 2017. [29] Nitish Shirish Keskar, Bryan Mc Cann, Lav R Varshney, Caiming Xiong, and Richard Socher. Ctrl: A conditional transformer language model for controllable generation. ar Xiv preprint ar Xiv:1909.05858, 2019. [30] Diederik Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models. Advances in neural information processing systems, 34:21696 21707, 2021. [31] Diederik P Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. On density estimation with diffusion models. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/ forum?id=2Ld Bqxc1Yv. [32] Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, and Bryan Catanzaro. Diffwave: A versatile diffusion model for audio synthesis. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=a-x FK8Ymz5J. [33] Tomasz Korbak, Kejian Shi, Angelica Chen, Rasika Bhalerao, Christopher L Buckley, Jason Phang, Samuel R Bowman, and Ethan Perez. Pretraining language models with human preferences. ar Xiv preprint ar Xiv:2302.08582, 2023. [34] Shankar Kumar and William Byrne. Minimum Bayes-risk decoding for statistical machine translation. In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics: HLT-NAACL 2004, pages 169 176, Boston, Massachusetts, USA, May 2 - May 7 2004. Association for Computational Linguistics. URL https://aclanthology.org/N04-1022. [35] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension, 2019. URL https://arxiv.org/abs/1910.13461. [36] Xiang Lisa Li, John Thickstun, Ishaan Gulrajani, Percy Liang, and Tatsunori B. Hashimoto. Diffusion-lm improves controllable text generation, 2022. URL https://arxiv.org/abs/ 2205.14217. [37] Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74 81, 2004. [38] Zhenghao Lin, Yeyun Gong, Yelong Shen, Tong Wu, Zhihao Fan, Chen Lin, Weizhu Chen, and Nan Duan. Text generation with diffusion language models: A pre-training approach with continuous paragraph denoise. 2023. [39] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum? id=Bkg6Ri Cq Y7. [40] Ximing Lu, Sean Welleck, Jack Hessel, Liwei Jiang, Lianhui Qin, Peter West, Prithviraj Ammanabrolu, and Yejin Choi. QUARK: Controllable text generation with reinforced unlearning. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id= 5Ha Ids3ux5O. [41] Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. SDEdit: Guided image synthesis and editing with stochastic differential equations. In International Conference on Learning Representations, 2022. URL https://openreview. net/forum?id=a Bs Cjc Pu_t E. [42] Nasrin Mostafazadeh, Nathanael Chambers, Xiaodong He, Devi Parikh, Dhruv Batra, Lucy Vanderwende, Pushmeet Kohli, and James Allen. A corpus and cloze evaluation for deeper understanding of commonsense stories. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 839 849, 2016. [43] Shashi Narayan, Shay B. Cohen, and Mirella Lapata. Don t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1797 1807, Brussels, Belgium, October-November 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-1206. URL https://aclanthology.org/D18-1206. [44] Shashi Narayan, Shay B Cohen, and Mirella Lapata. Don t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. ar Xiv preprint ar Xiv:1808.08745, 2018. [45] Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 8162 8171. PMLR, 18 24 Jul 2021. URL https://proceedings.mlr.press/v139/ nichol21a.html. [46] William Peebles and Saining Xie. Scalable diffusion models with transformers. ar Xiv preprint ar Xiv:2212.09748, 2022. [47] Krishna Pillutla, Swabha Swayamdipta, Rowan Zellers, John Thickstun, Sean Welleck, Yejin Choi, and Zaid Harchaoui. Mauve: Measuring the gap between neural text and human text using divergence frontiers. Advances in Neural Information Processing Systems, 34:4816 4828, 2021. [48] Matt Post. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186 191, Brussels, Belgium, October 2018. Association for Computational Linguistics. doi: 10.18653/v1/W18-6319. URL https:// aclanthology.org/W18-6319. [49] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. Open AI blog, 1(8):9, 2019. [50] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485 5551, 2020. [51] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. Highresolution image synthesis with latent diffusion models, 2021. URL https://arxiv.org/ abs/2112.10752. [52] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. Highresolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684 10695, June 2022. [53] Chitwan Saharia, William Chan, Huiwen Chang, Chris Lee, Jonathan Ho, Tim Salimans, David Fleet, and Mohammad Norouzi. Palette: Image-to-image diffusion models. In ACM SIGGRAPH 2022 Conference Proceedings, pages 1 10, 2022. [54] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Raphael Gontijo-Lopes, Burcu Karagol Ayan, Tim Salimans, Jonathan Ho, David J. Fleet, and Mohammad Norouzi. Photorealistic text-to-image diffusion models with deep language understanding. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=08Yk-n5l2Al. [55] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi. Photorealistic text-toimage diffusion models with deep language understanding. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 36479 36494. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/ ec795aeadae0b7d230fa35cbaf04c041-Paper-Conference.pdf. [56] Chitwan Saharia, Jonathan Ho, William Chan, Tim Salimans, David J Fleet, and Mohammad Norouzi. Image super-resolution via iterative refinement. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022. [57] Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. In International Conference on Learning Representations, 2022. URL https://openreview. net/forum?id=TId IXIpzho I. [58] Flavio Schneider, Zhijing Jin, and Bernhard Schölkopf. Mo\ˆ usai: Text-to-music generation with long-context latent diffusion. ar Xiv preprint ar Xiv:2301.11757, 2023. [59] Noam Shazeer. Glu variants improve transformer. ar Xiv preprint ar Xiv:2002.05202, 2020. [60] Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1631 1642, 2013. [61] Jascha Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics, 2015. URL https://arxiv. org/abs/1503.03585. [62] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. Co RR, abs/2010.02502, 2020. URL https://arxiv.org/abs/2010.02502. [63] Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems, 32, 2019. [64] Yang Song, Liyue Shen, Lei Xing, and Stefano Ermon. Solving inverse problems in medical imaging with score-based generative models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=va RCHVj0u GI. [65] Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models, 2023. [66] Robin Strudel, Corentin Tallec, Florent Altché, Yilun Du, Yaroslav Ganin, Arthur Mensch, Will Grathwohl, Nikolay Savinov, Sander Dieleman, Laurent Sifre, et al. Self-conditioned embedding diffusion for text generation. ar Xiv preprint ar Xiv:2211.04236, 2022. [67] Robin Strudel, Corentin Tallec, Florent Altché, Yilun Du, Yaroslav Ganin, Arthur Mensch, Will Grathwohl, Nikolay Savinov, Sander Dieleman, Laurent Sifre, and Rémi Leblond. Selfconditioned embedding diffusion for text generation, 2022. URL https://arxiv.org/abs/ 2211.04236. [68] Yixuan Su, Tian Lan, Yan Wang, Dani Yogatama, Lingpeng Kong, and Nigel Collier. A contrastive framework for neural text generation. ar Xiv preprint ar Xiv:2202.06417, 2022. [69] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS 17, page 6000 6010, Red Hook, NY, USA, 2017. Curran Associates Inc. ISBN 9781510860964. [70] Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, and Tieyan Liu. On layer normalization in the transformer architecture. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 10524 10533. PMLR, 13 18 Jul 2020. URL https://proceedings.mlr.press/ v119/xiong20b.html. [71] Minkai Xu, Alexander Powers, Ron Dror, Stefano Ermon, and Jure Leskovec. Geometric latent diffusion models for 3d molecule generation. ar Xiv preprint ar Xiv:2305.01140, 2023. [72] Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. m T5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 483 498, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.41. URL https://aclanthology.org/2021.naacl-main.41. [73] Jiasheng Ye, Zaixiang Zheng, Yu Bao, Lihua Qian, and Mingxuan Wang. Dinoiser: Diffused conditional sequence learning by manipulating noises. ar Xiv preprint ar Xiv:2302.10025, 2023. [74] Biao Zhang and Rico Sennrich. Root mean square layer normalization. Advances in Neural Information Processing Systems, 32, 2019. [75] Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. Bertscore: Evaluating text generation with bert. ar Xiv preprint ar Xiv:1904.09675, 2019. [76] Lin Zheng, Jianbo Yuan, Lei Yu, and Lingpeng Kong. A reparameterized discrete diffusion model for text generation. ar Xiv preprint ar Xiv:2302.05737, 2023. A Diffusion Models We present a formal description of diffusion [22, 62, 30]. Diffusion models are latent variable models with latents z = {zt|t [0, 1]} that are given by the forward diffusion process q(z|x), with the data, x p(x), being drawn from an unknown distribution. The forward process is a Markovian process that iteratively adds Gaussian noise to the data over time q(zt|x) = N(zt; αtx, (1 αt)I), q(zt|zs) = N(zt; αt|szs, (1 αt|s)I) where αt|s = αt/αs and 0 s < t 1. The noise schedule, specified by αt [0, 1], decreases with t until the final latent becomes approximately Gaussian, q(z1) N(z1; 0, I) independent of the original data. The forward process therefore defines a transition from the data distribution to a Gaussian distribution. Given access to the original data x, the forward process can be inverted analytically. For t > s, we have q(zs|zt, x) = N(µQ(zt, x, s, t), σ2 Q(s, t)I) where µQ(zt, x, s, t) = αs(1 αt|s) 1 αt x + αt|s(1 αs) 1 αt zt, σ2 Q(s, t) = (1 αs)(1 αt|s) We utilize this to define our generative process. Because x is unavailable during generation, we train a neural network to approximate the original data given some noisy latent and the timestep, xˆθ(zt, t) x. The denoising network is trained utilizing a regression loss L(θ) = Et,x,ϵ[λt xˆθ( αtx + 1 αtϵ, t) x 2 2] with some time-dependent weighting λt. This loss function can be motivated as the weighted variational lower bound of the log likelihood of the data under the forward diffusion process [22, 31]. In practice, the denoising network is often parameterized as an ϵ-prediction network [22] or a v-prediction network [57] where the velocity, v, is defined as v = αtϵ 1 αtx. These parameterizations can be interpreted as different weighting functions, λt, for the regression objective [57]. We adopt the v-parameterization throughout this work. With a trained denoising network, we define our generative process as pθ(zs|zt) = N(zs; µθ(zt, s, t), σ2(s, t)I) where µθ(zt, s, t) = µQ(zt, xˆθ(zt, t), s, t), σ2(s, t) = 1 αt|s. We therefore substitute our estimate of the clean data into the posterior distribution of q(zs|zt, x) to parameterize the mean of our generative process pθ(zs|zt). We follow Ho et al. [22] and set the variance of pθ(zs|zt) to σ2(s, t) = 1 αt|s, a choice given by the variance of the forward process. For generation, we utilize the standard DDPM sampler, also known as the ancestral sampler [22]. We sample some initial noise zt1 = z1 N(0, I) and iteratively apply the update rule zti+1 = µθ(zti, ti+1, ti) + σ(ti+1, ti)ϵ where ϵ N(0, I) and the intermediate timesteps 1 = t1 > t2 > ... > t T = 0 linearly interpolate between 1 and 0. We use T = 250 sampling timesteps by default. B Additional Language Autoencoder Results We present results for our language autoencoders on XSum, QQP and WMT14 in Table 9. We observe that our proposed language autoencoders are similarly effective for these datasets. We also ablate the performance as we vary the dimensionality of the latent space in Table 10. We observe, as expected, that the reconstruction performance improves as the dimensionality of the latent space increases and degrades as we decrease the size of the latent representation. We found our default dimensionality of 32 64 to be generally effective for high quality reconstructions across datasets. Table 9: Effectiveness of Language Autoencoder Method Latent Dimensions Hidden Units XSum QQP Rouge-1/2/L BLEU Rouge-1/2/L BLEU BART-Base L 768 49,152 99.9/99.9/99.9 99.9 99.9/99.9/99.9 99.8 BART-Base Autoencoder 32 64 2048 99.8/99.6/99.8 99.3 99.9/99.8/99.9 99.1 FLAN-T5-Base L 768 49,152 65.7/51.2/59.9 45.7 26.9/14.1/24.3 10.8 FLAN-T5-Base Autoencoder 32 64 2048 99.6/99.3/99.6 98.8 99.8/99.5/99.7 98.5 WMT14 English WMT14 German Rouge-1/2/L BLEU Rouge-1/2/L BLEU MT5-Base Autoencoder 32 64 2048 99.7/99.2/99.7 99.2 99.8/99.4/99.8 99.1 Table 10: Ablation of Autoencoder Latent Dimensionality Method Latent Dimensions Hidden Units Roc Stories Rouge-L BLEU BART-Base L 768 49,152 98.8 97.5 BART-Base Autoencoder 32 32 1024 97.0 92.4 32 64 2048 99.2 97.6 64 64 4096 99.2 97.7 C Impact of Sampling Steps We present the results from different sampling configurations for the ROCStories dataset in Table 11. We also report the wall clock time needed to generate the 1000 samples across the different numbers of sampling timesteps while batching the generations with a batch size of 128. We find that the number of sampling steps introduces a tradeoff between the diversity and the quality of the text, with more sampling steps leading to more fluent but less diverse text and fewer sampling steps leading to less fluent but more diverse text. When using BART-base, the MAUVE score is maximized when utilizing only 100-250 steps, demonstrating that it achieves a reasonable balance between diversity and quality. When utilizing FLAN-T5-base, on the other hand, we find that the MAUVE score improves monotonically with increased sampling steps. This suggests that the latent distribution of the FLAN-T5-base autoencoder may be more challenging to learn. Increasing the capacity of the denoising network or the language autoencoder may therefore be beneficial when using FLAN-T5-base. We observe that the sampling time scales with the number of sampling steps as expected, although there is also a fixed cost from the reconstruction network and the autoregressive decoder that is independent of the number of sampling steps. Table 11: Evaluation of different sampling configurations. We use 250 steps by default. Sampling Steps MAUVE Ppl Div Mem Wall Clock Time (1000 samples) Reference - .951.007 21.1.3 .414.003 .362.003 - LD4LG (BART-base) 50 .684.031 52.6.3 .407.004 .337.003 1m27s 100 .719.022 38.5.8 .368.002 .392.001 1m55s 250 .716.019 30.6.5 .331.005 .441.004 3m20s 500 .704.033 28.1.3 .313.003 .462.003 5m44s 1000 .667.026 25.9.1 .295.002 .481.004 10m30s LD4LG (FLAN-T5-base) 50 .331.028 67.9.7 .456.001 .283.001 1m34s 100 .421.012 48.7.7 .423.002 .334.002 2m02s 250 .481.007 37.5.4 .389.002 .387.002 3m29s 500 .495.024 32.8.6 .370.006 .413.006 5m51s 1000 .522.023 30.6.3 .360.004 .432.005 10m38s Table 12: Evaluation of different nucleus sampling configurations. Sampling Parameter (p) MAUVE Ppl Div Mem GPT-2-Medium .90 .762.027 19.6.3 .362.008 .718.006 .95 .788.025 20.0.2 .372.002 .688.006 .98 .782.020 20.2.3 .378.002 .666.008 1.00 .793.024 20.5.4 .385.004 .637.006 D GPT-2 Sampling Ablation We report an ablation of the nucleus sampling parameter, p, in Table 12. The memorization does exhibit some sensitivity to the nucleus sampling parameter, but the memorization is consistently higher than the LD4LG models across all sampling configurations. E Implementation Details All of the models presented in this work are trained on a single Nvidia A6000 except for the Diffu Seq XSum baseline which was trained with two Nvidia A6000s. E.1 Language Autoencoders We adopt the pre-LN design [70] for both the compression and reconstruction networks and therefore apply layer normalization before all attention and feedfoward blocks. We also adopt query-key normalization [11] and apply RMSNorm [74] to the queries and keys before computing the dot product similarities in the attention mechanism. We found that this enabled training with a larger learning rate which accelerated training. We present the hyperparameters for our language autoencoders across all datasets in this work in Table 13. We also report additional details such as the number of trainable parameters and training time. The training time is similar across datasets because we use the same hyperparameters, so we simply report the training times for the ROCStories dataset for the monolingual models. For the MT5-base base autoencoder, we report the training time for the German autoencoder which is similar to the English autoencoder. We note that our implementation is not optimized for runtime and that pre-computing and caching the language encoder representations would significantly accelerate training. E.2 Latent Diffusion For Language Generation We present the training details across the different datasets in Table 14. We tuned hyperparameters using the validation MAUVE scores for the ROCStories dataset and found that they generally transferred well across datasets. We therefore used the same hyperparameters across datasets, except that we utilized the L1 loss instead of the L2 loss for the Seq2Seq tasks. Consistent with prior work on image-to-image diffusion models [53], we observed that the L1 loss improved the fidelity of the generations at the cost of sacrificing some diversity. This improved fidelity translated to improvements in our metrics of interest, although the L2 loss may still be desireable for settings where diversity is of greater importance. For the unconditional and class-conditional language models, we did not observe overfitting to be a problem and simply use the final checkpoint for evaluation. For the monolingual Seq2Seq tasks, we utilize the checkpoint with the best validation ROUGE-L. For machine translation, we utilize the checkpoint with the best validation Sacre BLEU. For the machine translation experiments, we observed benefits from rescaling the noise schedule to emphasize training at higher levels of noise. This idea was introduced by Hoogeboom et al. [27] and Chen [7] to improve high-resolution image diffusion models. Both Hoogeboom et al. [27] and Chen [7] shift an existing noise schedule by some scale factor, s, to increase the time spent at higher noise levels. Given a noise schedule αt with SNR λt = α2 t 1 α2 t , the shifted noise schedule, αt,s [0, 1], is Table 13: Training details for our language autoencoders. Language Model BART-base FLAN-T5-base MT5-Base Trainable Params 26M 26M 591M Compression Architecture Perceiver Resampler [2] Perceiver Layers 3 3 1 Perceiver Dimension 768 768 768 Self-Attention Heads 12 12 12 Autoencoder Latent Length (ℓ) 32 32 32 Autoencoder Dimension (dae) 64 64 64 Reconstruction Architecture Transformer [69] Transformer Layers 3 3 1 Transformer Dimension 768 768 768 Self-Attention Heads 12 12 12 Activation Function GELU [20] Max Seq Length 64 64 128 Optimizer Adam W [39] Learning Rate 1e-4 1e-4 1e-4 (β1, β2) (0.9, 0.999) (0.9, 0.999) (0.9, 0.999) Batch Size 256 256 128 Warmup Steps 1000 1000 1000 Learning Rate Schedule Linear Decay Weight Decay 1e-2 1e-2 1e-2 Gradient Clipping 1.0 1.0 1.0 Training Steps 50k 50k 50k Training Time 12h38m 20h17m 20h29m defined so that α2 t,s 1 α2 t,s = λt,s = λt s2 = α2 t 1 α2 t s2. Given αt and the scale factor s, the scaled noise schedule αt,s can be computed in closed-form. Using the relationship that α2 t = sigmoid(log(λt)) (see Kingma et al. [30]), the new noise schedule can be computed as α2 t,s = sigmoid(log(λt,s)) = sigmoid(log(λt s2)) = sigmoid(log(λt) + 2 log(s)). We employ a shifted cosine noise schedule with s = 0.1 for machine translation. Past work on text diffusion for machine translation observed that training at higher levels of noise improves the models utilization of the conditioning information (i.e. the source sentence) [73]. During the inference process, image diffusion models typically rescale the estimate of the data to the range of pixel values (i.e. [-1,1]) at each sampling step. When we restrict the latent space so that xi 2 2 = dae, we similarly rescale the intermediate estimates of the data to enforce this constraint. This design decision is not critical and similar performance is achieved without this rescaling. We did, however, observe that this made the generative process more robust to large guidance weights which may be important in some settings. This observation is consistent with prior findings from text-to-image diffusion [55]. We also report the wall clock times for training the models, although our implementation could be further optimized to improve training times. The primary cause of the slowdown for AG News compared to ROCStories, for instance, stems from additional validation sampling and logging for class-conditional generation during training. When decoding the sampled latent vectors, we utilize beam search with a beam size of 4, a repetition penalty of 1.2 [29], and prevent generations of duplicate trigrams. E.3 BART-Diffusion For our BART-Diffusion baseline, we utilize the same denoising architecture as our LD4LG method. As discussed in the main paper, the sequence length of the BART features vary with the length of the input text. During training, the sequence length is simply determined by the training instance. To select the length of the Gaussian noise during generation, we sample a length from the empirical distribution of lengths in the training set. We observed that the v-prediction parameterization was less effective in this setting and the ϵprediction parameterization was unstable. We therefore adopted the x-prediction parameterization. This is consistent with past work that has found the x-prediction parameterization to be more effective for high-dimensional data [36, 8]. Another challenge is that we can no longer control the scale of the latent space. We therefore follow common practices from latent image diffusion and normalize the latent space to have unit variance [51]. When normalizing the latent space, we utilize the first batch of training data to compute the mean for each feature dimension, averaging across the samples in the batch and the sequence lengths of the samples. Therefore, we compute the mean vector µˆ = 1 b,ℓxb,ℓ, µˆ Rd where x Rb ℓ d is some batched data. We then compute the global variance across all dimensions in the centered latent space σˆ2 = 1 bℓd b,ℓ,d(xb,ℓ,d µˆd)2, σˆ2 R to rescale the latent space to have unit variance. We otherwise train this baseline with the same hyperparameters as LD4LG. E.4 Diffusion LM We train our Diffusion-LM models utilizing the public implementation by Li et al. [36]2. We utilize the provided command and hyperparameter settings for the ROCStories dataset. To adapt it to the AG News dataset, we increase the batch size from 64 to 128 and set the number of training steps to 250k match our training configuration. We otherwise utilize the same hyperparameter settings as the ROCStories model. We attempted to double the learning rate from 1e-4 to 2e-4 to account for the doubled batch size, but observed training instabilities and therefore used the original learning rate of 1e-4. We present the default hyperparameters for the GPT-2-Medium baseline in Table 15. For sampling from GPT-2, we prompt it with a BOS token and utilize nucleus sampling (p = 0.95). We use the same repetition penalty of 1.2 [29] that we use for the LD4LG language decoders and similarly prevent generations of duplicate trigrams. E.6 Diffu Seq For the QQP dataset, we compute the metrics with the model generations released by Gong et al. [16]. We utilize the official implementation from Gong et al. [16]3 to train a Diffu Seq model on the XSum dataset. In their work, the Diffu Seq models were trained with the same hyperparameters across all datasets considered, except for the number of training steps which varied across datasets. We therefore adopt their default hyperparameters for the XSum dataset. We observed that the Diffu Seq models were trained for much longer than our models. The official implementation also utilized gradient accumulation with microbatches of 128 to achieve a large effective batch size of 40964. We trained the XSum Diffu Seq model for 960k iterations which is significantly longer than the 250k iterations used by our LD4LG XSum model. Due to the use of gradient accumulation, this corresponds to 30k gradient updates. The XSum Diffu Seq baseline was therefore trained for over 3.8 more epochs than our method. A limitation of the Diffu Seq model compared to LD4LG is that it concatenates the source and target sequences as the input to their transformer model. Diffu Seq therefore scales quadratically with respect to the combined length of the source and target sequence. Our denoising network, on the other hand, operates upon a fixed sequence length of ℓ= 32 latents and only cross-attends to the 2https://github.com/Xiang Li1999/Diffusion-LM 3https://github.com/Shark-NLP/Diffu Seq 4We note that the original Diffu Seq implementation had a bug in its implementation of distributed training (see https://github.com/Shark-NLP/Diffu Seq/issues/37. We describe the behavior of the original implementation. source representations. As a result, our method scales linearly with respect to the length of the source sequence5. This enables LD4LG to more efficiently incorporate long contexts than Diffu Seq. By default, the official Diffu Seq implementation limits the combined length of the source and target sequences to a maximum length of 128. This could put it at a disadvantage compared to our model which incorporates up to 256 tokens of the source sequence. To ensure a fair comparison, we also experimented with increasing the maximum sequence length for the Diffu Seq model to 256 tokens, which significantly increases the training overhead. After training the model for 640k iterations, which took 5 days with two Nvidia A6000 GPUs, we observed worse performance than the model using the default length of 128. E.7 Encoder-Decoder Language models We report training hyperparameters for fine-tuning the pre-trained encoder-decoder language models models on the Seq2Seq datasets in Table 16. We perform early stopping with the validation ROUGEL. E.8 Evaluation Metrics For the MAUVE, ROUGE, BLEU, BERTScore, Perplexity, and Sacre BLEU metrics, we utilize the implementations provided by the Huggingface evaluate library (https://huggingface.co/ docs/evaluate/. For Sacre BLEU, we follow prior work and use the intl tokenizer if the target language is German and use the 13a tokenizer if the target language is English. For the n-gram metrics, we utilize the en_core_web_sm tokenizer from Spacy (https://spacy. io/) to split the generations into tokens. F Dataset Statistics ROCStories [42]. The dataset consists of 98,161 instances. We hold out 1,000 instances for validation, 4,000 instances for testing, and utilize the remaining 93,161 instances for training. AG News Topic Classification [60]. The dataset consists of titles and short descriptions from news articles. We discard the titles and focus on generating the descriptions in this work. The official train/test splits have 120k training instances and 7,600 testing instances. We hold out 1,000 instances from the training set for validation. We therefore utilize 119k training instances, 1,000 validation instances, and 7,600 test instances. XSUM [43]. The dataset consists of BBC articles from 2010 to 2017 covering a wide range of topics (e.g., News, Politics, Sports, etc.). Each example in the dataset consists of a news article and a summary. It has 204,045 training instances, 11,332 validation instances, and 11,334 test instances. QQP [9]. The dataset consists of 400k question pairs, where example consists of two similar questions and a binary value indicating whether the two questions have the same meaning. The semantically similar questions can be utilized as a paraphrasing dataset. We use the version released by Gong et al. [16] to enable direct comparison. It has 144,715 training instances, 2,048 validation instances, and 2,500 test instances. WMT 2014 English-German [5]. The dataset consists of roughly 4.5 million paired English and German sentences for training. The validation and testing splits each have roughly 3k paired sentences. G Qualitative Examples We present random unconditional samples from the diffusion models for the ROCStories (Table 17) and AG News (Table 18) datasets. We note that because the Diffusion-LM learns token representations from scratch and cannot model rare words, Li et al. [36] replace rare words with an UNK token. We observe that these tokens are often generated, leading to incoherent text. This problem is particularly 5For LD4LG, the frozen language encoder still scales quadratically with the source sequence length, but the source representations can be pre-computed and cached prior to training. Table 14: Training details for LD4LG across different datasets. ROCStories AG News XSum QQP WMT14-En-De Trainable Params 188M 190M 217M 217M 218M Sampling Timesteps 250 Noise Schedule Cosine Cosine Cosine Cosine Shifted Cosine (s = 0.1) [27, 7] Regression Loss L2 L2 L1 L1 L1 Transformer Layers 12 Transformer Dimension 768 Self-Attention Heads 12 Dense Connections [4] 3 Activation Function Ge GLU [59] Optimizer Adam W [39] Learning Rate 2e-4 2e-4 2e-4 2e-4 4e-4 (β1, β2) (0.9, 0.999) Batch Size 128 128 128 128 512 Warmup Steps 1000 Learning Rate Schedule Cosine Decay Weight Decay 1e-6 Dropout 0.1 0.1 0.1 0.1 0.0 Gradient Clipping 1.0 1.0 1.0 1.0 0.2 EMA Decay 0.9999 Training Steps 250k 250k 250k 250k 500k Max Seq Length (Source) n/a n/a 256 64 128 Training Time (BART-base) 1d 11h 1d 20h 2d 22h 1d 20h Training Time (FLAN-T5-base) 1d 17h 1d 21h 4d 2h 2d 7h Training Time (MT5-base) 9d 16h Table 15: Training details for our autoregressive baseline across different datasets. ROCStories AG News Model GPT-2-Medium Trainable Params 355M Max Seq Length 64 Optimizer Adam W [39] Learning Rate 8e-5 (β1, β2) (0.9, 0.999) Batch Size 32 Warmup Steps 500 Learning Rate Schedule Linear Decay Weight Decay 1e-2 Dropout 0.1 Gradient Clipping 1.0 Training Steps 100k Table 16: Training details for our Seq2Seq baselines model. Model BART-base FLAN-T5-base BART-base FLAN-T5-base Trainable Params 139M 220M 139M 220M Max Seq Length (Source) 256 64 Max Seq Length (Target) 64 Optimizer Adam W [39] Learning Rate 5e-5 1e-4 5e-5 5e-5 (β1, β2) (0.9, 0.999) Batch Size 32 Warmup Steps 500 Learning Rate Schedule Linear Decay Weight Decay 1e-2 Gradient Clipping 1.0 Training Steps 100k Table 17: Random samples from ROCStories dataset. LD4LG (BART-base) LD4LG (FLAN-T5-base) Diffusion-LM After a long line in line, Amy was ready to carry her cart. She asked if she should put the money in a bag. The cashier gave her a quarter and she opened the bag. She was happy to see that she paid for the amount on the line. The checkier checked when she Emma was playing with her doll doll. She was having a good time when suddenly she slipped! The doll doll shattered in many places! Emma was so upset she cried and cried! Her mother took her home and got her a new band-aid. Tom was going to eat with friends. But it was stressed out. So He decided to go to the local bar. But when he realized his friend was too much. The police allowed home to pull him home. Barry was a popular high school student. He always got good grades in school. Barry s friends all met up. He arrived at his new job with a big grin. Barry decided he would start the new job as a teacher. Max wanted to build a tree in his backyard. He researched guides on what kind of plant to plant. He went online and cut trees so he could see one that would cover large. He bought all his supplies and drove to the farm dealership. They had planted a beautiful backyard in his neighborhood. Rita was about to go out in the UNK. UNK was the UNK and Rita was very nervous. She took out the ball was beginning to UNK. She kicked the ball still and knew she was a good kid. She looked in her shoulder and immediately ran to the sound. Michael had a crush on a girl. He finally had the courage to talk to her. Michael went over to her and she walked down a hallway. They chatted for hours. Michael wished he had never asked another girl. John and Molly thought it would be fun to go to Europe. They decided to take their little child to go swimming. The child had a wonderful time playing in the waves. They also had ta lot of local food. They were exhausted when they still had to return home. I bought a new UNK. It was a UNK. My friends asked it for some money. We didn t listen. I was declined. Ed got a chihuahua. It escaped its cage. Ed was able to free the chihuohua. He wanted to keep him so he let it alone. Ed is able to keep the rest out. Yesterday lulu went to the theme park. To her surprise her phone fell out at the park. She was so disappointed. But thankfully no one was looking for it. She had to walk home as fast as she could to get it. Todd was walking his dog with his dog. The UNK hit by minutes close to check something out. There was a small UNK and UNK off of the ground. He got to the UNK s house to find UNK UNK. Todd s dog started to listen to the UNK of it. Maria was getting ready for her trip. She wanted a specific bathing suit, and went to the mall. She tried on many different outfits, but none fit. Maria realized she had found a great deal while shopping. She bought herself a nice suit. Anna had been friends with her family for years, but curious. Later, Anna s mom told her she might be sick after a bad age. Anna broke up with this, and swore that she would not get sick. That night, Anna threw up all over the house Stacy wanted to learn how to ride a horse. She found a long one near her UNK. She decided to UNK on. Finally she was able to ride a UNK. Stacy was happy to be her own horse. pronounced for the AG News dataset which has a more diverse vocabulary and uses many proper nouns such as names that are out-of-vocabulary. We also present random class-conditional samples for the AG News datasets (Table 19 and Table 20) for all of the classes. We present examples of sequence-to-sequence generations for QQP in Table 21 and XSum in Table 22. While the Diffu Seq generations are somewhat reasonable for the simpler QQP paraphrasing dataset, the model completely fails to produce coherent summaries for the challenging XSum dataset. This is the case even though Diffu Seq is trained for significantly longer than LD4LG and uses 8 as many sampling timesteps. Table 18: Random samples from AG News dataset. HTML entities are decoded for readability. LD4LG (BART-base) LD4LG (FLAN-T5-base) Diffusion-LM What could have been a decisive role in Disney s merger of a leading media group, but not only it appears to have been. Last night, the founders of the media conglomerate s leading stock management unit, introduced legislation capping the Sachin Tendulkar has found himself fit to India s batting squad ahead of this weekend s final and final session of the first Test against Bangladesh in East Oval. UNK UNK UNK UNK UNK - UNK, UNK - UNK UNK UNK de UNK, the UNK UNK UNK, the UNK UNK of a UNK UNK UNK UNK. The startup provider will provide CRM-based services for small and midsize businesses on its offices. America Online and Ask Jeeves settle over file-swapping technology that could lead to lawsuits against hundreds of online businesses and result in fraud. A federal grand judge has reached a new $ UNK stake in UNK for the $ 35 billion, UNK leading investors to the UNK. &