# improved_autoregressive_modeling_with_distribution_smoothing__89f5c5e7.pdf Published as a conference paper at ICLR 2021 IMPROVED AUTOREGRESSIVE MODELING WITH DISTRIBUTION SMOOTHING Chenlin Meng, Jiaming Song, Yang Song, Shengjia Zhao & Stefano Ermon Stanford University {chenlin,tsong,yangsong,sjzhao,ermon}@cs.stanford.edu While autoregressive models excel at image compression, their sample quality is often lacking. Although not realistic, generated images often have high likelihood according to the model, resembling the case of adversarial examples. Inspired by a successful adversarial defense method, we incorporate randomized smoothing into autoregressive generative modeling. We first model a smoothed version of the data distribution, and then reverse the smoothing process to recover the original data distribution. This procedure drastically improves the sample quality of existing autoregressive models on several synthetic and real-world image datasets while obtaining competitive likelihoods on synthetic datasets. 1 INTRODUCTION Autoregressive models have exhibited promising results in a variety of downstream tasks. For instance, they have shown success in compressing images (Minnen et al., 2018), synthesizing speech (Oord et al., 2016a) and modeling complex decision rules in games (Vinyals et al., 2019). However, the sample quality of autoregressive models on real-world image datasets is still lacking. Poor sample quality might be explained by the manifold hypothesis: many real world data distributions (e.g. natural images) lie in the vicinity of a low-dimensional manifold (Belkin & Niyogi, 2003), leading to complicated densities with sharp transitions (i.e. high Lipschitz constants), which are known to be difficult to model for density models such as normalizing flows (Cornish et al., 2019). Since each conditional of an autoregressive model is a 1-dimensional normalizing flow (given a fixed context of previous pixels), a high Lipschitz constant will likely hinder learning of autoregressive models. Another reason for poor sample quality is the compounding error issue in autoregressive modeling. To see this, we note that an autoregressive model relies on the previously generated context to make a prediction; once a mistake is made, the model is likely to make another mistake which compounds (K a ari ainen, 2006), eventually resulting in questionable and unrealistic samples. Intuitively, one would expect the model to assign low-likelihoods to such unrealistic images, however, this is not always the case. In fact, the generated samples, although appearing unrealistic, often are assigned high-likelihoods by the autoregressive model, resembling an adversarial example (Szegedy et al., 2013; Biggio et al., 2013), an input that causes the model to output an incorrect answer with high confidence. Inspired by the recent success of randomized smoothing techniques in adversarial defense (Cohen et al., 2019), we propose to apply randomized smoothing to autoregressive generative modeling. More specifically, we propose to address a density estimation problem via a two-stage process. Unlike Cohen et al. (2019) which applies smoothing to the model to make it more robust, we apply smoothing to the data distribution. Specifically, we convolve a symmetric and stationary noise distribution with the data distribution to obtain a new smoother distribution. In the first stage, we model the smoothed version of the data distribution using an autoregressive model. In the second stage, we reverse the smoothing process a procedure which can also be understood as denoising by either applying a gradient-based denoising approach (Alain & Bengio, 2014) or introducing another conditional autoregressive model to recover the original data distribution from the smoothed one. By choosing an appropriate smoothing distribution, we aim to make each step easier than the original learning problem: smoothing facilitates learning in the first stage by making the input distribution Published as a conference paper at ICLR 2021 Figure 1: Overview of our method. From a data distribution (x) we inject noise (q( x|x)) which makes the distribution smoother ( x); then we model the smoothed distribution (pθ( x)) as well as the denoising step (pθ(x| x)), forming a two-step model. fully supported without sharp transitions in the density function; generating a sample given a noisy one is easier than generating a sample from scratch. We show with extensive experimental results that our approach is able to drastically improve the sample quality of current autoregressive models on several synthetic datasets and real-world image datasets, while obtaining competitive likelihoods on synthetic datasets. We empirically demonstrate that our method can also be applied to density estimation, image inpainting, and image denoising. 2 BACKGROUND We consider a density estimation problem. Given D-dimensional i.i.d samples {x1, x2, ..., x N} from a continuous data distribution pdata(x), the goal is to approximate pdata(x) with a model pθ(x) parameterized by θ. A commonly used approach for density estimation is maximum likelihood estimation (MLE), where the objective is to maximize L(θ) 1 N PN i=1 log pθ(xi). 2.1 AUTOREGRESSIVE MODELS An autoregressive model (Larochelle & Murray, 2011; Salimans et al., 2017) decomposes a joint distribution pθ(x) into the product of univariate conditionals: i=1 pθ(xi|x