# autoregressive_diffusion_models__de289eb1.pdf Published as a conference paper at ICLR 2022 AUTOREGRESSIVE DIFFUSION MODELS Emiel Hoogeboom , Alexey A. Gritsenko, Jasmijn Bastings, Ben Poole, Rianne van den Berg, Tim Salimans Google Research e.hoogeboom@uva.nl,{agritsenko,pooleb,bastings,salimans}@google.com, riannevdberg@gmail.com We introduce Autoregressive Diffusion Models (ARDMs), a model class encompassing and generalizing order-agnostic autoregressive models (Uria et al., 2014) and absorbing discrete diffusion (Austin et al., 2021), which we show are special cases of ARDMs under mild assumptions. ARDMs are simple to implement and easy to train. Unlike standard ARMs, they do not require causal masking of model representations, and can be trained using an efficient objective similar to modern probabilistic diffusion models that scales favourably to highly-dimensional data. At test time, ARDMs support parallel generation which can be adapted to fit any given generation budget. We find that ARDMs require significantly fewer steps than discrete diffusion models to attain the same performance. Finally, we apply ARDMs to lossless compression, and show that they are uniquely suited to this task. Contrary to existing approaches based on bits-back coding, ARDMs obtain compelling results not only on complete datasets, but also on compressing single data points. Moreover, this can be done using a modest number of network calls for (de)compression due to the model s adaptable parallel generation. 1 INTRODUCTION Deep generative models have made great progress in modelling different sources of data, such as images, text and audio. These models have a wide variety of applications, such as denoising, inpainting, translating and representation learning. A popular type of likelihood-based models are Autoregressive Models (ARMs). ARMs model a high-dimensional joint distribution as a factorization of conditionals using the probability chain rule. Although very effective, ARMs require a pre-specified order in which to generate data, which may not be an obvious choice for some data modalities, for example images. Further, although the likelihood of ARMs can be retrieved with a single neural network call, sampling from a model requires the same number of network calls as the dimensionality of the data. Recently, modern probabilistic diffusion models have introduced a new training paradigm: Instead of optimizing the entire likelihood of a datapoint, a component of the likelihood bound can be sampled and optimized instead. Works on diffusion on discrete spaces (Sohl-Dickstein et al., 2015; Work done during as research intern at Google Brain. Figure 1: Generation of Autoregressive Diffusion Models for the generation order σ = (3, 1, 2, 4). Filled circles in the first and third layers represent respectively the input and output variables, and the middle layer represents internal activations of the network. Published as a conference paper at ICLR 2022 Hoogeboom et al., 2021; Austin et al., 2021) describe a discrete destruction process for which the inverse generative process is learned with categorical distributions. However, the length of these processes may need to be large to attain good performance, which leads to a large number of network calls to sample from or evaluate the likelihood with discrete diffusion. In this work we introduce Autoregressive Diffusion Models (ARDMs), a variant of autoregressive models that learns to generate in any order. ARDMs generalize order agnostic autoregressive models and discrete diffusion models. We show that ARDMs have several benefits: In contrast to standard ARMs, they impose no architectural constraints on the neural networks used to predict the distribution parameters. Further, ARDMs require significantly fewer steps than absorbing models to attain the same performance. In addition, using dynamic programming approaches developed for diffusion models, ARDMs can be parallelized to generate multiple tokens simultaneously without a substantial reduction in performance. Empirically we demonstrate that ARDMs perform similarly to or better than discrete diffusion models while being more efficient in modelling steps. The main contributions of this paper can be summarized as follows: 1) We introduce ARDMs, a variant of order-agnostic ARMs which include the ability to upscale variables. 2) We derive an equivalence between ARDMs and absorbing diffusion under a continuous time limit. 3) We show that ARDMs can have parallelized inference and generation processes, a property that among other things admits competitive lossless compression with a modest number of network calls. 2 BACKGROUND ARMs factorize a multivariate distribution into a product of D univariate distributions using the probability chain rule. In this case the log-likelihood of such as model is given by: t=1 log p(xt|x