# diffusionlm_improves_controllable_text_generation__c04a4551.pdf Diffusion-LM Improves Controllable Text Generation Xiang Lisa Li Stanford University xlisali@stanford.edu John Thickstun Stanford University jthickst@stanford.edu Ishaan Gulrajani Stanford Univeristy igul@stanford.edu Percy Liang Stanford Univeristy pliang@cs.stanford.edu Tatsunori B. Hashimoto Stanford Univeristy thashim@stanford.edu Controlling the behavior of language models (LMs) without re-training is a major open problem in natural language generation. While recent works have demonstrated successes on controlling simple sentence attributes (e.g., sentiment), there has been little progress on complex, fine-grained controls (e.g., syntactic structure). To address this challenge, we develop a new non-autoregressive language model based on continuous diffusions that we call Diffusion-LM. Building upon the recent successes of diffusion models in continuous domains, Diffusion-LM iteratively denoises a sequence of Gaussian vectors into word vectors, yielding a sequence of intermediate latent variables. The continuous, hierarchical nature of these intermediate variables enables a simple gradient-based algorithm to perform complex, controllable generation tasks. We demonstrate successful control of Diffusion-LM for six challenging fine-grained control tasks, significantly outperforming prior work.1 1 Introduction Large autoregressive language models (LMs) are capable of generating high quality text [39, 3, 5, 56], but in order to reliably deploy these LMs in real world applications, the text generation process needs to be controllable: we need to generate text that satisfies desired requirements (e.g. topic, syntactic structure). A natural approach for controlling a LM would be to fine-tune the LM using supervised data of the form (control, text) [18]. However, updating the LM parameters for each control task can be expensive and does not allow for compositions of multiple controls (e.g. generate text that is both positive sentiment and non-toxic). This motivates light-weight and modular plug-and-play approaches [6] that keep the LM frozen and steer the generation process using an external classifier that measures how well the generated text satisfies the control. But steering a frozen autoregressive LM has been shown to be difficult, and existing successes have been limited to simple, attribute-level controls (e.g., sentiment or topic) [6, 25, 55]. In order to tackle more complex controls, we propose Diffusion-LM, a new language model based on continuous diffusions. Diffusion-LM starts with a sequence of Gaussian noise vectors and incrementally denoises them into vectors corresponding to words, as shown in Figure 1. These gradual denoising steps produce a hierarchy of continuous latent representations. We find that this hierarchical and continuous latent variable enables simple, gradient-based methods to perform complex control tasks such as constraining the parse tree of a generated sequence. Continuous diffusion models have been extremely successful in vision and audio domains [13, 24, 41, 8, 4], but they have not been applied to text because of the inherently discrete nature of text 1Code is available at https://github.com/Xiang Li1999/Diffusion-LM.git 36th Conference on Neural Information Processing Systems (Neur IPS 2022). Figure 1: Diffusion-LM iteratively denoises a sequence of Gaussian vectors into word vectors, yielding a intermediate latent variables of decreasing noise level x T x0. For controllable generation, we iteratively perform gradient updates on these continuous latents to optimize for fluency (parametrized by Diffusion-LM) and satisfy control requirements (parametrized by a classifier). ( 3). Adapting this class of models to text requires several modifications to standard diffusions: we add an embedding step and a rounding step to the standard diffusion process, design a training objective to learn the embedding, and propose techniques to improve rounding ( 4). We control Diffusion-LM using a gradient-based method, as shown in Figure 1. This method enables us to steer the text generation process towards outputs that satisfy target structural and semantic controls. It iteratively performs gradient updates on the continuous latent variables of Diffusion-LM to balance fluency and control satisfaction ( 5.1). To demonstrate control of Diffusion-LM, we consider six control targets ranging from fine-grained attributes (e.g., semantic content) to complex structures (e.g., parse trees). Our method almost doubles the success rate of previous plug-and-play methods and matches or outperforms the fine-tuning oracle on all these classifier-guided control tasks ( 7.1). In addition to these individual control tasks, we show that we can successfully compose multiple classifier-guided controls to generate sentences with both desired semantic content and syntactic structure ( 7.2). Finally, we consider span-anchored controls, such as length control and infilling. Diffusion-LM allows us to perform these control tasks without a classifier, and our Diffusion-LM significantly outperforms prior plug-and-play methods and is on-par with an autoregressive LM trained from scratch for the infilling task ( 7.3). 2 Related Work Diffusion Models for Text. Diffusion models [47] have demonstrated great success in continuous data domains [13, 33, 24, 31], producing images and audio that have state-of-the-art sample quality. To handle discrete data, past works have studied text diffusion models on discrete state spaces, which defines a corruption process on discrete data (e.g., each token has some probability to be corrupted to an absorbing or random token) [1, 15, 16]. In this paper, we focus on continuous diffusion models for text and to the best of our knowledge, our work is the first to explore this setting. In contrast to discrete diffusion LMs, our continuous diffusion LMs induce continuous latent representations, which enables efficient gradient-based methods for controllable generation. Autoregressive and Non-autoregressive LMs. Most large pre-trained LMs are left-to-right autoregressive (e.g., GPT-3 [3], Pa LM [5]). The fixed generation order limits the models flexibility in many controllable generation settings, especially those that impose controls globally on both left and right contexts. One example is infilling, which imposes lexical control on the right contexts; another example is syntactic structure control, which controls global properties involving both left and right contexts. Since autoregressive LMs cannot directly condition on right contexts, prior works have developed specialized training and decoding techniques for these tasks [46, 9, 36]. For example, Qin et al. [37] proposed a decoding method that relaxes the discrete LM outputs to continuous variables and backpropagates gradient information from the right context. Diffusion-LM can condition on arbitrary classifiers that look at complex, global properties of the sentence. There are other non-autoregressive LMs that have been developed for machine translation and speech-to-text tasks [12, 43]. However these methods are specialized for speech and translation settings, where the entropy over valid outputs is low, and whether they work for language modeling remains an open problem. We leave detailed discussions to Appendix H. Plug-and-Play Controllable Generation. Plug-and-play controllable generation aims to keep the LM frozen and steer its output using potential functions (e.g., classifiers). Given a probabilistic potential function that measures how well the generated text satisfies the desired control, the generated text should be optimized for both control satisfaction (measured by the potential function) and fluency (measured by LM probabilities) . There are several plug-and-play approaches based on autoregressive LMs: FUDGE [55] reweights the LM prediction at each token with an estimate of control satisfaction for the partial sequence; Ge Di [25] and DExperts [28] reweight the LM prediction at each token with a smaller LM finetuned/trained for the control task. The closest work to ours is PPLM [6], which runs gradient ascent on an autoregressive LM s hidden activations to steer the next token to satisfy the control and maintain fluency. Because PPLM is based on autoregressive LMs, it can only generate left-to-right. This prevents PPLM from repairing and recovering errors made in previous generation steps. Despite their success on attribute (e.g., topic) controls, we will show these plug-and-play methods for autoregressive LMs fail on more complex control tasks such as controlling syntactic structure and semantic content in 7.1. We demonstrate that Diffusion-LM is capable of plug-and-play controllable generation by applying classifier-guided gradient updates to the continuous sequence of latent variables induced by the Diffusion-LM. 3 Problem Statement and Background We first define controllable generation ( 3.1) and then review continuous diffusion models ( 3.3). 3.1 Generative Models and Controllable Generation for Text Text generation is the task of sampling w from a trained language model plm(w), where w = [w1 wn] is a sequence of discrete words and plm(w) is a probability distribution over sequences of words. Controllable text generation is the task of sampling w from a conditional distribution p(w | c), where c denotes a control variable. For syntactic control, c can be a target syntax tree (Figure 1), while for sentiment control, c could be a desired sentiment label. The goal of controllable generation is to generate w that satisfies the control target c. Consider the plug-and-play controllable generation setting: we are given a language model plm(w) trained from a large amount of unlabeled text data, and for each control task, we are given a classifier p(c | w) trained from smaller amount of labeled text data (e.g., for syntactic control, the classifier is a probabilistic parser). The goal is to utilize these two models to approximately sample from the posterior p(w | c) via Bayes rule p(w | c) / plm(w) p(c | w). Here, plm(w) encourages w to be fluent, and the p(c | w) encourages w to fulfill the control. 3.2 Autoregressive Language Models The canonical approach to language modeling factors plm in an autoregressive left-to-right mannar, plm(w) = plm(w1) Qn i=2 plm(xi | x