# diffuser_diffusion_via_editbased_reconstruction__47497c67.pdf Published as a conference paper at ICLR 2023 DIFFUSER: DIFFUSION VIA EDIT-BASED RECONSTRUCTION Machel Reid Google Research machelreid@google.com Vincent J. Hellendoorn Software and Societal Systems Department Carnegie Mellon University vhellendoorn@cmu.edu Graham Neubig Language Technologies Institute, Carnegie Mellon University Inspired Cognition gneubig@cs.cmu.edu In text generation, models that generate text from scratch one token at a time are currently the dominant paradigm. Despite being performant, these models lack the ability to revise existing text, which limits their usability in many practical scenarios. We look to address this, with DIFFUSER (Diffusion via Edit-based Reconstruction), a new edit-based generative model for text based on denoising diffusion models a class of models that use a Markov chain of denoising steps to incrementally generate data. DIFFUSER is not only a strong generative model in general, rivalling autoregressive models on several tasks spanning machine translation, summarization, and style transfer; it can also perform other varieties of generation that standard autoregressive models are not well-suited for. For instance, we demonstrate that DIFFUSER makes it possible for a user to condition generation on a prototype, or an incomplete sequence, and continue revising based on previous edit steps. 1 INTRODUCTION Revision and editing are central to how humans produce content; we write and revise emails and papers, gradually produce works of art, and iterate on plans for a project. Despite this, the most dominant paradigm in text generation is purely autoregressive, producing text left-to-right in a single pass (Bengio et al., 2003). Although models employing this single-pass form of generation are highly performant, they are limited by the inability to refine existing text. To address this, we propose DIFFUSER: Diffusion via Edit-based Reconstruction, a flexible method to apply edit-based generative processes to arbitrary text generation tasks. Specifically, we take inspiration from diffusion models (Sohl-Dickstein et al., 2015; Ho et al., 2020), generative models that generate by way of incremental denoising steps, and adapt this approach to the text generation paradigm with a formulation similar to natural editing processes. Prior work on text generation either focuses on improving the performance of standard autoregressive (AR) models through larger models and datasets (Vaswani et al., 2017; Sutskever et al., 2014; Radford et al.; Brown et al., 2020) or on proposing new, non-autoregressive approaches (Gu et al., 2017; Ghazvininejad et al., 2019; Gu et al., 2019) to improve general modes of text generation. A thus far separate line of models has taken the perspective of modeling text edits for specific tasks: e.g. style transfer (Reid & Zhong, 2021; Malmi et al., 2020), sentence fusion (Malmi et al., 2019), and grammatical error correction (Dale & Kilgarriff, 2011). DIFFUSER unifies these two perspectives by enabling edit processes to be applied to general purpose text generation without compromising performance or requiring external supervised data (Guu et al., 2018). This design enables it Work done partially while at the University of Tokyo Published as a conference paper at ICLR 2023 These model guilty named Diffus ER These model named Diffus ER uses editing procedures This model , named Diffus ER, uses editing processes This model , named Diffus ER, uses editing processes for flexible generation filter Toronto guilty trough feel Corruption Process Reconstruction Process Figure 1: DIFFUSER s text generation process. Orange represents replacements, blue represents insertions, red represents deletions, and white represents keep operations. This process largely imitates a natural editing process (Reid & Neubig, 2022). to both generate and edit text, including externally produced content, a natural extension of the text generation paradigm. DIFFUSER models text generation as a series of diffusion steps at the token level. This form of generation allows us to develop a synthetic formulation of natural editing processes (Reid & Neubig, 2022) using edit-based corruption and reconstruction. Our method starts from an arbitrary sequence (either a prototype generation, randomly sampled tokens, or a null sequence) and progressively edits it into the final sequence guided by the Levenshtein edit operations of IN S E R T, DE L E T E, KE E P, and REPLACE as shown in Figure 1. This enables flexible editing in a range of contexts, including machine translation, summarization, style transfer, while also allowing for the possibility of taking outside input to guide and constrain generation. Learning these edit-based diffusion processes required several innovations over standard autoregressive and MLM-style iterative generation approaches (Ghazvininejad et al., 2019; Austin et al., 2021; Savinov et al., 2022), including forming edit-based corruption and reconstruction processes for training (Sec 3), as well as techniques to improve the quality of decoding sequences across both timesteps and token-level generations (including 2D beam search; Sec 3.6, Sec 3.5). To demonstrate the effectiveness of DIFFUSER, we test our method on three text generation tasks: machine translation, abstractive summarization, and text style transfer, and show on-par or improved performance compared to purely autoregressive, single-pass and non-autoregressive methods. We also provide qualitative samples of the edit processes learned by the models in different settings and analyses on training and inference speeds, as well as the relationship between edit steps and performance. Overall, we demonstrate the potential of edit-based generative models to offer 1) more performant generation, 2) greater interactivity between different models (as we can now perform edits in the discrete space on model generated output), and 3) more flexible/controllable generation. 2 BACKGROUND DIFFUSER operates at the intersection of text generation, editing processes, and diffusion models. We first provide the background and intuition of these three techniques. 2.1 TEXT GENERATION Most text generation models used in NLP today are autoregressive in nature. In this paradigm, given a sequence s = [s0, s1, . . . , s N], one can model the likelihood of the entire sequence P(s) by modeling the probability of predicting each token in an autoregressive, often left-to-right, manner. This formulation, where the likelihood of a token p(st) is conditioned on its predecessors s