# improved_autoregressive_modeling_with_distribution_smoothing__89f5c5e7.pdf

Published as a conference paper at ICLR 2021

IMPROVED AUTOREGRESSIVE MODELING WITH DISTRIBUTION SMOOTHING

Chenlin Meng, Jiaming Song, Yang Song, Shengjia Zhao & Stefano Ermon Stanford University {chenlin,tsong,yangsong,sjzhao,ermon}@cs.stanford.edu

While autoregressive models excel at image compression, their sample quality is often lacking. Although not realistic, generated images often have high likelihood according to the model, resembling the case of adversarial examples. Inspired by a successful adversarial defense method, we incorporate randomized smoothing into autoregressive generative modeling. We ﬁrst model a smoothed version of the data distribution, and then reverse the smoothing process to recover the original data distribution. This procedure drastically improves the sample quality of existing autoregressive models on several synthetic and real-world image datasets while obtaining competitive likelihoods on synthetic datasets.

1 INTRODUCTION

Autoregressive models have exhibited promising results in a variety of downstream tasks. For instance, they have shown success in compressing images (Minnen et al., 2018), synthesizing speech (Oord et al., 2016a) and modeling complex decision rules in games (Vinyals et al., 2019). However, the sample quality of autoregressive models on real-world image datasets is still lacking.

Poor sample quality might be explained by the manifold hypothesis: many real world data distributions (e.g. natural images) lie in the vicinity of a low-dimensional manifold (Belkin & Niyogi, 2003), leading to complicated densities with sharp transitions (i.e. high Lipschitz constants), which are known to be difﬁcult to model for density models such as normalizing ﬂows (Cornish et al., 2019). Since each conditional of an autoregressive model is a 1-dimensional normalizing ﬂow (given a ﬁxed context of previous pixels), a high Lipschitz constant will likely hinder learning of autoregressive models.

Another reason for poor sample quality is the compounding error issue in autoregressive modeling. To see this, we note that an autoregressive model relies on the previously generated context to make a prediction; once a mistake is made, the model is likely to make another mistake which compounds (K a ari ainen, 2006), eventually resulting in questionable and unrealistic samples. Intuitively, one would expect the model to assign low-likelihoods to such unrealistic images, however, this is not always the case. In fact, the generated samples, although appearing unrealistic, often are assigned high-likelihoods by the autoregressive model, resembling an adversarial example (Szegedy et al., 2013; Biggio et al., 2013), an input that causes the model to output an incorrect answer with high conﬁdence.

Inspired by the recent success of randomized smoothing techniques in adversarial defense (Cohen et al., 2019), we propose to apply randomized smoothing to autoregressive generative modeling. More speciﬁcally, we propose to address a density estimation problem via a two-stage process. Unlike Cohen et al. (2019) which applies smoothing to the model to make it more robust, we apply smoothing to the data distribution. Speciﬁcally, we convolve a symmetric and stationary noise distribution with the data distribution to obtain a new smoother distribution. In the ﬁrst stage, we model the smoothed version of the data distribution using an autoregressive model. In the second stage, we reverse the smoothing process a procedure which can also be understood as denoising by either applying a gradient-based denoising approach (Alain & Bengio, 2014) or introducing another conditional autoregressive model to recover the original data distribution from the smoothed one. By choosing an appropriate smoothing distribution, we aim to make each step easier than the original learning problem: smoothing facilitates learning in the ﬁrst stage by making the input distribution

Published as a conference paper at ICLR 2021

Figure 1: Overview of our method. From a data distribution (x) we inject noise (q( x|x)) which makes the distribution smoother ( x); then we model the smoothed distribution (pθ( x)) as well as the denoising step (pθ(x| x)), forming a two-step model.

fully supported without sharp transitions in the density function; generating a sample given a noisy one is easier than generating a sample from scratch.

We show with extensive experimental results that our approach is able to drastically improve the sample quality of current autoregressive models on several synthetic datasets and real-world image datasets, while obtaining competitive likelihoods on synthetic datasets. We empirically demonstrate that our method can also be applied to density estimation, image inpainting, and image denoising.

2 BACKGROUND

We consider a density estimation problem. Given D-dimensional i.i.d samples {x1, x2, ..., x N} from a continuous data distribution pdata(x), the goal is to approximate pdata(x) with a model pθ(x) parameterized by θ. A commonly used approach for density estimation is maximum likelihood estimation (MLE), where the objective is to maximize L(θ) 1

N PN i=1 log pθ(xi).

2.1 AUTOREGRESSIVE MODELS

An autoregressive model (Larochelle & Murray, 2011; Salimans et al., 2017) decomposes a joint distribution pθ(x) into the product of univariate conditionals:

i=1 pθ(xi|x<i), (1)

where xi stands for the i-th component of x, and x<i refers to the components with indices smaller than i. In general, an autoregressive model parameterizes each conditional pθ(xi|x<i) using a prespeciﬁed density function (e.g. mixture of logistics). This bounds the capacity of the model by limiting the number of modes for each conditional.

Although autoregressive models have achieved top likelihoods amongst all types of density based models, their sample quality is still lacking compared to energy-based models (Du & Mordatch, 2019) and score-based models (Song & Ermon, 2019). We believe this can be caused by the following two reasons.

2.2 MANIFOLD HYPOTHESIS

Several existing methods (Roweis & Saul, 2000; Tenenbaum et al., 2000) rely on the manifold hypothesis, i.e. that real-world high-dimensional data tends to lie on a low-dimensional manifold (Narayanan & Mitter, 2010). If the manifold hypothesis is true, then the density of the data distribution is not well deﬁned in the ambient space; if the manifold hypothesis holds only approximately and the data lies in the vicinity of a manifold, then only points that are very close to the manifold would have high density, while all other points would have close to zero density. Thus we may expect the data density around the manifold to have large ﬁrst-order derivatives, i.e. the density function has a high Lipschitz constant (if not inﬁnity).

To see this, let us consider a 2-d example where the data distribution is a thin ring distribution (almost a unit circle) formed by rotating the 1-d Gaussian distribution N(1, 0.012) around the origin. The density function of the ring has a high Lipschitz constant near the boundary . Let us focus on a data point travelling along the diagonal as shown in the leftmost panel in ﬁgure 2. We plot the ﬁrst-order

Published as a conference paper at ICLR 2021

Figure 2: Manifold hypothesis illustration. The data point is travelling along the diagonal as shown in the leftmost panel. The white arrow stands for the direction and magnitude of the derivative of density at the data point. The data location for each ﬁgure is (

0.5 + c), where c is the number below each ﬁgure and (

0.5) is the upper right intersection of the trajectory with the unit circle.

directional derivatives of the density for the point as it approaches the boundary from the inside, then lands on the ring, and ﬁnally moves outside the ring (see ﬁgure 2). As we can see, when the point is far from the boundary, the derivative has a small magnitude. When the point moves closer to the boundary, the magnitude increases and changes signiﬁcantly near the boundary even with small displacements in the trajectory. However, once the point has landed on the ring, the magnitude starts to decrease. As it gradually moves off the ring, the magnitude ﬁrst increases and then decreases just like when the point approached the boundary from the inside. It has been observed that certain likelihood models, such as normalizing ﬂows, exhibit pathological behaviors on data distributions whose densities have high Lipschitz constants (Cornish et al., 2019). Since each conditional of an autoregressive model is a 1-d normalizing ﬂow given a ﬁxed context, a high Lipschitz constant on data density could also hinder learning of autoregressive models.

2.3 COMPOUNDING ERRORS IN AUTOREGRESSIVE MODELING

Autoregressive models can also be susceptible to compounding errors from the conditional distributions (Lamb et al., 2016) during sampling time. We notice that an autoregressive model pθ(x) learns the joint density pdata(x) by matching each of the conditional pθ(xi|x<i) with pdata(xi|x<i). In practice, we typically have access to a limited amount of training data, which makes it hard for an autoregressive model to capture all the conditional distributions correctly due to the curse of dimensionality. During sampling, since a prediction is made based on the previously generated context, once a mistake is made at a previous step, the model is likely to make more mistakes in the later steps, eventually generating a sample ˆx that is far from being an actual image, but is mistakenly assigned a high-likelihood by the model.

The generated image ˆx, being unrealistic but assigned a high-likelihood, resembles an adversarial example, i.e., an input that causes the model to make mistakes. Recent works (Cohen et al., 2019) in adversarial defense have shown that random noise can be used to improve the model s robustness to adversarial perturbations a process during which adversarial examples that are close to actual data are generated to fool the model. We hypothesize that such approach can also be applied to improve an autoregressive modeling process by making the model less vulnerable to compounding errors occurred during density estimation. Inspired by the success of randomized smoothing in adversarial defense (Cohen et al., 2019), we propose to apply smoothing to autoregressive modeling to address the problems mentioned above.

3 GENERATIVE MODELS WITH DISTRIBUTION SMOOTHING

In the following, we propose to decompose a density estimation task into a smoothed data modeling problem followed by an inverse smoothing problem where we recover the true data density from the smoothed one.

3.1 RANDOMIZED SMOOTHING PROCESS

Unlike Cohen et al. (2019) where randomized smoothing is applied to a model, we apply smoothing directly to the data distribution pdata(x). To do this, we introduce a smoothing distribution q( x|x) a distribution that is symmetric and stationary (e.g. a Gaussian or Laplacian kernel) and convolve it with pdata(x) to obtain a new distribution q( x) R q( x|x)pdata(x)dx. When q( x|x) is a normal distribution, this convolution process is equivalent to perturbing the data distribution with Gaussian

Published as a conference paper at ICLR 2021

(b) Smoothed data

Figure 3: Visualization of a 1-d data distribution without smoothing (a) or with smoothing (b), modeled by the same mixture of logistics model.

noise, which, intuitively, will make the data distribution smoother. In the following, we formally prove that convolving a 1-d distribution pdata(x) with a suitable noise can indeed smooth pdata(x). Theorem 1. Given a continuous and bounded 1-d distribution pdata(x) that is supported on R, for any 1-d distribution q( x|x) that is symmetric (i.e. q( x|x) = q(x| x)), stationary (i.e. translation invariant) and satisﬁes limx pdata(x)q(x| x) = 0 for any given x, we have Lip(q( x)) Lip(pdata(x)), where q( x) R q( x|x)pdata(x)dx and Lip( ) denotes the Lipschitz constant of the given 1-d function.

Theorem 1 shows that convolving a 1-d data distribution pdata(x) with a suitable noise distribution q( x|x) (e.g. N( x|x, σ2)) can reduce the Lipschitzness (i.e. increase the smoothness) of pdata(x). We provide the proof of Theorem 1 in Appendix A.

Given pdata(x) with a high Lipschitz constant, we empirically verify that density estimation becomes an easier task on the smoothed distribution q( x) than directly on pdata(x). To see this, we visualize a 1-d example in ﬁgure 3a, where we want to model a ten-mode data distribution with a mixture of logistics model. If our model has three logistic components, there is almost no way for the model, which only has three modes, to perfectly ﬁt this data distribution, which has ten separate modes with sharp transitions. The model, after training (see ﬁgure 3a), mistakenly assigns a much higher density to the low density regions between nearby modes. If we convolve the data distribution with q( x|x) = N( x|x, 0.52), the new distribution becomes smoother (see ﬁgure 3b) and can be captured reasonably well by the same mixture of logistics model with only three modes (see ﬁgure 3b). Comparing the same model s performance on the two density estimation tasks, we can see that the model is doing a better job at modeling the smoothed version of the data distribution than the original data distribution, which has a high Lipschitz constant.

This smoothing process can also be understood as a regularization term for the original maximum likelihood objective (on the un-smoothed data distribution), encouraging the learned model to be smooth, as formalized by the following statement: Proposition 1 (Informal). Assume that the symmetric and stationary smoothing distribution q( x|x) has small variance and negligible higher order moments, then

Epdata(x)Eq( x|x)[log pθ( x)] Epdata(x)

log pθ(x) + η

for some constant η.

Proposition 1 shows that our smoothing process provides a regularization effect on the original objective Epdata(x)[log pθ(x)] when no noise is added, where the regularization aims to maximize

x2 i . Since samples from pdata should be close to a local maximum of the model, this encourages the second order gradients computed at a data point x to become closer to zero (if it were positive then x will not be a local maximum), creating a smoothing effect. This extra term is also the trace of the score function (up to a multiplicative constant) that can be found in the score matching objective (Hyv arinen, 2005), which is closely related to many denoising methods (Vincent, 2011; Hyv arinen, 2008). This regularization effect can, intuitively, increase the generalization capability of the model. In fact, it has been demonstrated empirically that training with noise can lead to improvements in network generalization (Sietsma & Dow, 1991; Bishop, 1995). Our argument is also

Published as a conference paper at ICLR 2021

similar to that used in (Bishop, 1995) except that we consider a more general generative modeling case as opposed to supervised learning with squared error. We provide the formal statement and proof of Proposition 1 in Appendix A.

3.2 AUTOREGRESSIVE DISTRIBUTION SMOOTHING MODELS

Motivated by the previous 1-d example, instead of directly modeling pdata(x), which can have a high Lipschitz constant, we propose to ﬁrst train an autoregressive model on the smoothed version of the data distribution q( x). Although the smoothing process makes the distribution easier to learn, it also introduces bias. Thus, we need an extra step to debias the learned distribution by reverting the smoothing process.

If our goal is to generate approximate samples for pdata(x), when q( x|x) = N( x|x, σ2I) and σ is small, we can use the gradient of pθ( x) for denoising (Alain & Bengio, 2014). More speciﬁcally, given smoothed samples x from pθ( x), we can denoise samples via: x = x + σ2 x log pθ( x), (2) which only requires the knowledge of pθ( x) and the ability to sample from it. However, this approach does not provide a likelihood estimate and Eq. (2) only works when q( x|x) is Gaussian (though alternative denoising updates for other smoothing processes could be derived under the Empirical Bayes framework (Raphan & Simoncelli, 2011)). Although Eq. (2) could provide reasonable denoising results when the smoothing distribution has a small variance, x obtained in this way is only a point estimation of x = E[x| x] and does not capture the uncertainty of p(x| x).

To invert more general smoothing distributions (beyond Gaussians) and to obtain likelihood estimations, we introduce a second autoregressive model pθ(x| x). The parameterized joint density pθ(x, x) can then be computed as pθ(x, x) = pθ(x| x)pθ( x). To obtain our approximation of pdata(x), we need to integrate over x on the joint distribution pθ(x, x) to obtain pθ(x) = R pθ(x, x)d x, which is in general intractable. However, we can easily obtain an evidence lower bound (ELBO): log pθ(x) Eq( x|x)[log pθ( x)] Eq( x|x)[log q( x|x)] + Eq( x|x)[log pθ(x| x)]. (3)

Note that when q( x|x) is ﬁxed, the entropy term Eq( x|x)[log q( x|x)] is a constant with respect to the optimization parameters. Maximizing ELBO on pdata(x) is then equivalent to maximizing:

J(θ) = Epdata(x)

Eq( x|x)[log pθ( x)] + Eq( x|x)[log pθ(x| x)] . (4)

From equation 4, we can see that optimizing the two models pθ( x) and pθ(x| x) separately via maximum likelihood estimation is equivalent to optimizing J(θ).

3.3 TRADEOFF IN MODELING

In general, there is a trade-off between the difﬁculty of modeling pθ( x) and pθ(x| x). To see this, let us consider two extreme cases for the variance of q( x|x) when q( x|x) has a zero variance and an inﬁnite variance. When q( x|x) has a zero variance, q( x|x) is a distribution with all its probability mass at x, meaning that no noise is added to the data distribution. In this case, modeling the smoothed distribution would be equivalent to modeling pdata(x), which can be hard as discussed above. The reverse smoothing process, however, would be easy since pθ(x| x) can simply be an identity map to perfectly invert the smoothing process. In the second case when q( x|x) has an inﬁnite variance, modeling p( x) would be easy because all the information about the original data is lost, and p( x) would be close to the smoothing distribution. Modeling p(x| x), on the other hand, is equivalent to directly modeling pdata(x), which can be challenging.

Thus, the key here is to appropriately choose a smoothing level so that both q( x) and p(x| x) can be approximated relatively well by existing autoregressive models. In general, the optimal variance might be hard to ﬁnd. Although one can train q( x|x) by jointly optimizing ELBO, in practice, we ﬁnd this approach often assigns a very large variance to q( x|x), which can trade-off sample quality for better likelihoods on high dimensional image datasets. We ﬁnd empirically that a pre-speciﬁed q( x|x) chosen by heuristics (Saremi & Hyvarinen, 2019; Garreau et al., 2017) is able to generate much better samples than training q( x|x) via ELBO. In this paper, we will focus on the sample quality and leave the training of q( x|x) for future work.

Published as a conference paper at ICLR 2021

Figure 4: Density estimation on 1-d synthetic dataset. In the second ﬁgure, the digit in the parenthesis denotes the number of mixture components used in the baseline mixture of logistics model. In comparison, our model in the third ﬁgure uses only 2 mixture of logistics components for each univariate conditional distribution.

Table 1: Negative log-likelihoods on 2-d synthetic datasets (lower is better). We compare with MADE (Germain et al., 2015), Real NVP (Dinh et al., 2016), CIF-Real NVP (Cornish et al., 2019).

Dataset Real NVP CIF-Real NVP MADE (3 mixtures) MADE (6 mixtures) Ours (3 mixtures)

Rings 2.81 2.81 3.26 2.81 2.71 Olympics 1.80 1.74 1.27 0.80 0.80

4 EXPERIMENTS

In this section, we demonstrate empirically that by appropriately choosing the smoothness level of randomized smoothing, our approach is able to drastically improve the sample quality of existing autoregressive models on several synthetic and real-world datasets while retaining competitive likelihoods on synthetic datasets. We also present results on image inpainting in Appendix C.2.

4.1 CHOOSING THE SMOOTHING DISTRIBUTION

To help us build insights into the selection of the smoothing distribution q( x|x), we ﬁrst focus on a 1-d multi-modal distribution (see ﬁgure 4 leftmost panel). We use model-based methods to invert the smoothed distribution and provide analysis on single-step denoising in Appendix B.1. We start with the exploration of three different types of smoothing distributions Gaussian distribution, Laplace distribution, and uniform distribution. For each type of distribution, we perform a grid search to ﬁnd the optimal variance. Since our approach requires the modeling of both pθ( x) and pθ(x| x), we stack x and x together, and use a MADE model (Germain et al., 2015) with a mixture of two logistic components to parameterize pθ( x) and pθ(x| x) at the same time. For the baseline model, we train a mixture of logistics model directly on pdata(x). We compare the results in the middle two panels in ﬁgure 4.

We ﬁnd that although the baseline with eight logistic components has the capacity to perfectly model the multi-modal data distribution, which has six modes, the baseline model still fails to do so. We believe this can be caused by optimization or initialization issues for modeling a distribution with a high Lipschitz constant. Our method, on the other hand, demonstrates more robustness by successfully modeling the different modes in the data distribution even when using only two mixture components for both pθ( x) and pθ(x| x).

For all the three types of smoothing distributions, we observe a reverse U-shape correlation between the variance of q( x|x) and ELBO values with ELBO ﬁrst increasing as the variance increases and then decreasing as the variance grows beyond a certain point. The results match our discussion on the trade-off between modeling pθ( x) and pθ(x| x) in Section 3.3. We notice from the empirical results that Gaussian smoothing is able to obtain better ELBO than the other two distributions. Thus, we will use q( x|x) = N( x|x, σ2I) for the later experiments.

4.2 2-D SYNTHETIC DATASETS In this section, we consider two challenging 2-d multi-modal synthetic datasets (see ﬁgure 5). We focus on model-based denoising methods and present discussion on single-step denoising in Appendix B.2. We use a MADE model with comparable number of total parameters for both the baseline and our approach. For the baseline, we train the MADE model directly on the data. For

Published as a conference paper at ICLR 2021

(b) MADE (6)

(c) Ours (3)

(d) Olympics

(e) MADE (6)

(f) Ours (3)

Figure 5: Samples on 2-d synthetic datasets. We use a MADE model with comparable number of parameters for both our method and the baseline. Our model uses 3 mixture of logistics, while the baseline uses 6 (more) mixture of logistics.

Figure 6: From left to right: Column 1: samples from pθ( x). Column 2: single-step denoising samples from pθ( x). Column 3: samples from pθ(x| x). Column 4: samples from the baseline Pixel CNN++ model. Samples in Column 2 ( single-step denoising ) contain wild pixels and are less realistic compared to samples in Column 3 (modeled by another Pixel CNN++). None of the samples are conditioned on class labels.

our randomized smoothing model, we choose q( x|x) = N( x|x, 0.32I) to be the smoothing distribution. We observe that with this randomized smoothing approach, our model is able to generate better samples than the baseline (according to a human observer) even when using less logistic components (see ﬁgure 5). We provide more analysis on the model s performance in Appendix B.2. We also provide the negative log-likelihoods in Tab. 1.

4.3 IMAGE EXPERIMENTS

In this section, we focus on three common image datasets, namely MNIST, CIFAR-10 (Krizhevsky et al., 2009) and Celeb A (Liu et al., 2015). We select q( x|x) = N( x|x, σ2I) to be the smoothing distribution. We use Pixel CNN++ (Salimans et al., 2017) as the model architecture for both pθ( x) and pθ(x| x). We provide more details about settings in Appendix C.

Published as a conference paper at ICLR 2021

Image generation. For image datasets, we select the σ of q( x|x) = N( x|x, σ2I) according to analysis in (Saremi & Hyvarinen, 2019) (see Appendix C for more details). Since q( x|x) is a Gaussian distribution, we can apply single-step denoising to reverse the smoothing process for samples drawn from pθ( x). In this case, the model pθ(x| x) is not required for sampling since the gradient of pθ( x) can be used to denoise samples (also from pθ( x)) (see equation 2). We present smoothed samples from pθ( x), reversed smoothing samples processed by single-step denoising and processed by pθ(x| x) in ﬁgure 6. For comparison, we also present samples from a Pixel CNN++ with parameters comparable to the sum of total parameters of pθ( x) and pθ(x| x). We ﬁnd that by using this randomized smoothing approach, we are able to drastically improve the sample quality of Pixel CNN++ (see the rightmost panel in ﬁgure 6). We note that with only pθ( x), a Pixel CNN++ optimized on the smoothed data, we already obtain more realistic samples compared to the original Pixel CNN++ method. However, pθ(x| x) is needed to compute the likelihood lower bounds. We report the sample quality evaluated by Fenchel Inception Distance (FID (Heusel et al., 2017)), Kernel Inception Distance (KID (Bi nkowski et al., 2018)), and Inception scores (Salimans et al., 2016) in Tab. 2. Although our method obtains better samples compared to the original Pixel CNN++, our model has worse likelihoods as evaluated in BPDs. We believe this is because likelihood and sample quality are not always directly correlated as discussed in Theis et al. (2015). We also tried training the variance for q( x|x) by jointly optimizing ELBO. Although training the variance can produce better likelihoods, it does not generate samples with comparable quality as our method (i.e. choosing variance by heuristics). Thus, it is hard to conclusively determine what is the best way of choosing q( x|x). We provide more image samples in Appendix C.4 and nearest neighbors analysis in Appendix C.5.

Model Inception FID KID BPD

Pixel CNN (Oord et al., 2016b) 4.60 65.93 - 3.14 Pixel IQN (Ostrovski et al., 2018) 5.29 49.46 - - Pixel CNN++ (Salimans et al., 2017) 5.30 54.25 0.046 2.92 EBM (Du & Mordatch, 2019) 6.02 40.58 - - i-Res Net (Behrmann et al., 2019) - 65.01 - 3.45 MADE (Germain et al., 2015) - - - 5.67 Glow (Kingma & Dhariwal, 2018) - 46.90 - 3.35 Single-step (Ours) 7.50 .08 57.53 0.052 - Two-step (Ours) 7.84 .07 29.83 0.022 3.53

Table 2: Inception, FID and KID scores for unconditional CIFAR-10. Single-step samples are generated solely by pθ( x). Two-step samples are generated by sampling from pθ( x) and then denoised by pθ(x| x). Although samples from single-step might appear visually similar to samples from the two-step method, there is still a gap between their Inception, FID and KID scores.

5 ADDITIONAL EXPERIMENTS ON NORMALIZING FLOWS

In this section, we demonstrate empirically on 2-d synthetic datasets that randomized smoothing techniques can also be applied to improve the sample quality of normalizing ﬂow models (Rezende & Mohamed, 2015). We focus on Real NVP (Dinh et al., 2016). We compare the Real NVP model trained with randomized smoothing, where we use pθ(x| x) (also a Real NVP) to revert the smoothing process, with a Real NVP trained with the original method but with comparable number of parameters. We observe that smoothing is able to improve sample quality on the datasets we consider (see ﬁgure 7) while also obtaining competitive likelihoods. On the checkerboard dataset, our method has negative log-likelihoods 3.64 while the original Real NVP has 3.72; on the Olympics dataset, our method has negative log-likelihoods 1.32 while the original Real NVP has 1.80. This example demonstrates that randomized smoothing techniques can also be applied to normalizing ﬂow models.

6 RELATED WORK

Our approach shares some similarities with denoising autoencoders (DAE, Vincent et al. (2008)) which recovers a clean observation from a corrupted one. However, unlike DAE which has a train-

Published as a conference paper at ICLR 2021

(b) Real NVP

(e) Real NVP

(f) Ours Figure 7: Real NVP samples on 2-d synthetic datasets. The Real NVP model trained with randomized smoothing is able to generate better samples according to human observers.

able encoder and a ﬁxed prior distribution, our approach ﬁxes the encoder and models the prior using an autoregressive model. Generative stochastic networks (GSN, Bengio et al. (2014)) use DAEs to train a Markov chain whose equilibrium distribution matches the data distribution. However, GSN needs to start the chain from a sample that is very close to the training distribution. Denoising diffusion model (Sohl-Dickstein et al., 2015; Ho et al., 2020) and NCSN (Song & Ermon (2019; 2020)) address the issue of GSNs by considering a sequence of distributions corresponding to data corrupted with various noise levels. By setting multiple noise levels that are close to each other, the sample from the previous level can serve as a proper initialization for the sampling process at the next level. This way, the model can start from a distribution that is easy to model and gradually move to the desired distribution. However, due to the large number of noise levels, such approaches require many steps for the chain to converge to the right data distribution.

In this paper, we instead propose to use only one level of smoothing by modeling each step with a powerful autoregressive model instead of deterministic autoencoders. Motivated by the success of randomized smoothing techniques in adversarial defense (Cohen et al., 2019), we perform randomized smoothing directly on the data distribution. Unlike denoising score matching (Vincent, 2011), a technique closely related to denoising diffusion models and NCSN, which requires the perturbed noise to be a Gaussian distribution, we are able to work with different noise distributions.

Our smoothing method is also relevant to dequantization approaches that are common in normalizing ﬂow models, where the discrete data distribution is converted to a continuous one by adding continuous noise (Uria et al., 2013; Ho et al., 2019). However the added noise for dequantization in ﬂows is often indistinguishable to human eyes, and the reverse dequantization process is often ignored. In contrast, we consider noise scales that are signiﬁcantly larger and thus a denoising process is required.

Our method is also related to quantization approaches which reduce the number of signiﬁcant bits that are modeled by a generative model (Kingma & Dhariwal, 2018; Menick & Kalchbrenner, 2018). For instance, Glow (Kingma & Dhariwal, 2018) only models the 5 most signiﬁcant bits of an image, which improves the visual quality of samples but decreases color ﬁdelity. SPN (Menick & Kalchbrenner, 2018) introduces another network to predict the remaining bits conditioned on the 3 most signiﬁcant bits already modeled. Modeling the most signiﬁcant bits can be understood as capturing a data distribution perturbed by bit-wise correlated noise, similar to modeling smoothed data in our method. Modeling the remaining bits conditioned on the most signiﬁcant ones in SPN is then similar to denoising. However, unlike these quantization approaches which process an image at the signiﬁcant bits level, we apply continuous data independent Gaussian noise to the entire image with a different motivation to smooth the data density function.

7 DISCUSSION

In this paper, we propose to incorporate randomized smoothing techniques into autoregressive modeling. By choosing the smoothness level appropriately, this seemingly simple approach is able to drastically improve the sample quality of existing autoregressive models on several synthetic and real-world datasets while retaining reasonable likelihoods. Our work provides insights into how recent adversarial defense techniques can be leveraged to building more robust generative models. Since we apply randomized smoothing technique directly to the target data distribution other than the model, we believe our approach is also applicable to other generative models such as variational autoencoders (VAEs) and generative adversarial networks (GANs).

Published as a conference paper at ICLR 2021

ACKNOWLEDGEMENTS

The authors would like to thank Kristy Choi for reviewing the draft of the paper. This research was supported by NSF (#1651565, #1522054, #1733686), ONR (N00014-19-1-2145), AFOSR (FA9550-19-1-0024), ARO, and Amazon AWS.

Guillaume Alain and Yoshua Bengio. What regularized auto-encoders learn from the data-generating distribution. The Journal of Machine Learning Research, 15(1):3563 3593, 2014.

Jens Behrmann, Will Grathwohl, Ricky TQ Chen, David Duvenaud, and J orn-Henrik Jacobsen. Invertible residual networks. In International Conference on Machine Learning, pp. 573 582, 2019.

Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural computation, 15(6):1373 1396, 2003.

Yoshua Bengio, Eric Laufer, Guillaume Alain, and Jason Yosinski. Deep generative stochastic networks trainable by backprop. In International Conference on Machine Learning, pp. 226 234, 2014.

Battista Biggio, Igino Corona, Davide Maiorca, Blaine Nelson, Nedim ˇSrndi c, Pavel Laskov, Giorgio Giacinto, and Fabio Roli. Evasion attacks against machine learning at test time. In Joint European conference on machine learning and knowledge discovery in databases, pp. 387 402. Springer, 2013.

Mikołaj Bi nkowski, Dougal J Sutherland, Michael Arbel, and Arthur Gretton. Demystifying mmd gans. ar Xiv preprint ar Xiv:1801.01401, 2018.

Chris M Bishop. Training with noise is equivalent to tikhonov regularization. Neural computation, 7(1):108 116, 1995.

Jeremy M Cohen, Elan Rosenfeld, and J Zico Kolter. Certiﬁed adversarial robustness via randomized smoothing. ar Xiv preprint ar Xiv:1902.02918, 2019.

Rob Cornish, Anthony L Caterini, George Deligiannidis, and Arnaud Doucet. Relaxing bijectivity constraints with continuously indexed normalising ﬂows. ar Xiv preprint ar Xiv:1909.13833, 2019.

Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real nvp. ar Xiv preprint ar Xiv:1605.08803, 2016.

Yilun Du and Igor Mordatch. Implicit generation and generalization in energy-based models. ar Xiv preprint ar Xiv:1903.08689, 2019.

Damien Garreau, Wittawat Jitkrittum, and Motonobu Kanagawa. Large sample analysis of the median heuristic. ar Xiv preprint ar Xiv:1707.07269, 2017.

Mathieu Germain, Karol Gregor, Iain Murray, and Hugo Larochelle. Made: Masked autoencoder for distribution estimation. In International Conference on Machine Learning, pp. 881 889, 2015.

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. ar Xiv preprint ar Xiv:1706.08500, 2017.

Jonathan Ho, Xi Chen, Aravind Srinivas, Yan Duan, and Pieter Abbeel. Flow++: Improving ﬂowbased generative models with variational dequantization and architecture design. ar Xiv preprint ar Xiv:1902.00275, 2019.

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. ar Xiv preprint ar Xiv:2006.11239, 2020.

Aapo Hyv arinen. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(Apr):695 709, 2005.

Published as a conference paper at ICLR 2021

Aapo Hyv arinen. Optimal approximation of signal priors. Neural Computation, 20(12):3087 3110, 2008.

Matti K a ari ainen. Lower bounds for reductions. In Atomic Learning Workshop, 2006.

Durk P Kingma and Prafulla Dhariwal. Glow: Generative ﬂow with invertible 1x1 convolutions. In Advances in neural information processing systems, pp. 10215 10224, 2018.

Alex Krizhevsky et al. Learning multiple layers of features from tiny images. 2009.

Alex M Lamb, Anirudh Goyal Alias Parth Goyal, Ying Zhang, Saizheng Zhang, Aaron C Courville, and Yoshua Bengio. Professor forcing: A new algorithm for training recurrent networks. In Advances in neural information processing systems, pp. 4601 4609, 2016.

Hugo Larochelle and Iain Murray. The neural autoregressive distribution estimator. In Proceedings of the Fourteenth International Conference on Artiﬁcial Intelligence and Statistics, pp. 29 37. JMLR Workshop and Conference Proceedings, 2011.

Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of the IEEE international conference on computer vision, pp. 3730 3738, 2015.

Jacob Menick and Nal Kalchbrenner. Generating high ﬁdelity images with subscale pixel networks and multidimensional upscaling. ar Xiv preprint ar Xiv:1812.01608, 2018.

David Minnen, Johannes Ball e, and George D Toderici. Joint autoregressive and hierarchical priors for learned image compression. In Advances in Neural Information Processing Systems, pp. 10771 10780, 2018.

Hariharan Narayanan and Sanjoy Mitter. Sample complexity of testing the manifold hypothesis. In Advances in neural information processing systems, pp. 1786 1794, 2010.

Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. ar Xiv preprint ar Xiv:1609.03499, 2016a.

Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. ar Xiv preprint ar Xiv:1601.06759, 2016b.

Georg Ostrovski, Will Dabney, and R emi Munos. Autoregressive quantile networks for generative modeling. ar Xiv preprint ar Xiv:1806.05575, 2018.

Martin Raphan and Eero P Simoncelli. Least squares estimation without priors or supervision. Neural computation, 23(2):374 420, 2011.

Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing ﬂows. ar Xiv preprint ar Xiv:1505.05770, 2015.

Sam T Roweis and Lawrence K Saul. Nonlinear dimensionality reduction by locally linear embedding. science, 290(5500):2323 2326, 2000.

Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. ar Xiv preprint ar Xiv:1606.03498, 2016.

Tim Salimans, Andrej Karpathy, Xi Chen, and Diederik P Kingma. Pixelcnn++: Improving the pixelcnn with discretized logistic mixture likelihood and other modiﬁcations. ar Xiv preprint ar Xiv:1701.05517, 2017.

Saeed Saremi and Aapo Hyvarinen. Neural empirical bayes. Journal of Machine Learning Research, 20:1 23, 2019.

Jocelyn Sietsma and Robert JF Dow. Creating artiﬁcial neural networks that generalize. Neural networks, 4(1):67 79, 1991.

Jascha Sohl-Dickstein, Eric A Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. ar Xiv preprint ar Xiv:1503.03585, 2015.

Published as a conference paper at ICLR 2021

Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. In Advances in Neural Information Processing Systems, pp. 11918 11930, 2019.

Yang Song and Stefano Ermon. Improved techniques for training score-based generative models. ar Xiv preprint ar Xiv:2006.09011, 2020.

Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. ar Xiv preprint ar Xiv:1312.6199, 2013.

Joshua B Tenenbaum, Vin De Silva, and John C Langford. A global geometric framework for nonlinear dimensionality reduction. science, 290(5500):2319 2323, 2000.

Lucas Theis, A aron van den Oord, and Matthias Bethge. A note on the evaluation of generative models. ar Xiv preprint ar Xiv:1511.01844, 2015.

Benigno Uria, Iain Murray, and Hugo Larochelle. Rnade: The real-valued neural autoregressive density-estimator. In Advances in Neural Information Processing Systems, pp. 2175 2183, 2013.

Pascal Vincent. A connection between score matching and denoising autoencoders. Neural computation, 23(7):1661 1674, 2011.

Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th international conference on Machine learning, pp. 1096 1103, 2008.

Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Micha el Mathieu, Andrew Dudzik, Junyoung Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 575(7782):350 354, 2019.

Published as a conference paper at ICLR 2021

Theorem 1. Given a continuous and bounded 1-d distribution pdata(x) that is supported on R, for any 1-d distribution q( x|x) that is symmetric (i.e. q( x|x) = q(x| x)), stationary (i.e. translation invariant) and satisﬁes limx pdata(x)q(x| x) = 0 for any given x, we have Lip(q( x)) Lip(pdata(x)), where q( x) R q( x|x)pdata(x)dx and Lip( ) denotes the Lipschitz constant of the given 1-d function.

Proof. First, we have that:

| xp(x)| = |p(x) x log p(x)| Lip(p) (5)

and if we assume symmetry, i.e. q(x| x) = q( x|x) then by integration by parts we have:

xq( x) = Ep(x)[ xq( x|x)] = Ep(x)[ xq(x| x)] = Ep(x)[q(x| x) x log p(x)] (6)

Lip(q) = max x | Ep(x)[q(x| x) x log p(x)]| = max x | X

x q(x| x)p(x) x log p(x)| (7)

x q(x| x)Lip(p)| = Lip(p) max x | X

x q(x| x)| = Lip(p) (8)

which proves the result.

Proposition 1 (Formal). Given a D-dimensional data distribution pdata(x) and model distribution pθ(x), assume that the smoothing distribution q( x|x) satisﬁes:

log pθ is inﬁnitely differentiable on the support of pθ(x)

q( x|x) is symmetric (i.e. q( x|x) = q(x| x))

q( x|x) is stationary (i.e. translation invariant)

q( x|x) is bounded and fully supported on RD

q( x|x) is element-wise independent

Eq( x|x)[( x x)2] is bounded, and Eq( x|x)[( xi xi)2] = η at each dimension i.

Denote ϵ = x x, then

Epdata(x)Eq( x|x)[log pθ( x)] = Epdata(x)

log pθ(x) + η

+ Z Z o(ϵ2)pdata(x)p(ϵ)dxdϵ,

where o(ϵ2) : RD R is a function of ϵ such that limϵ 0 o(ϵ2)

ϵ2 = 0. Thus when R R o(ϵ2)pdata(x)p(ϵ)dxdϵ 0, we have

Epdata(x)Eq( x|x)[log pθ( x)] Epdata(x)

log pθ(x) + η

Proof. To see this, we ﬁrst note that the new training objective for the smoothed data distribution is Epdata(x)Eq( x|x)[log pθ( x)]. Let ϵ = x x, because of the assumptions we have, the PDF function q( x|x) can be reparameterized as p(ϵ) which satisﬁes: p is bounded and fully supported on RD; p is element-wise independent and Ep(ϵ)[ϵ2 i ] = Ep(ϵi)[ϵ2 i ] = η at each dimension i (i = 1, ..., D). Then we have Epdata(x)Eq( x|x)[log pθ( x)] = Z Z log pθ(x + ϵ)pdata(x)p(ϵ)dxdϵ, (9)

Using Taylor expansion, we have:

log pθ(x + ϵ) = log pθ(x) + X

i ϵi log pθ

i,j ϵiϵj 2 log pθ

xi xj + o(ϵ2).

Published as a conference paper at ICLR 2021

Since ϵ is independent of x and Z

ϵidϵi = 0, Z

ϵiϵjdϵidϵj = δi,jη,

where δi,j is the Kronecker delta function, the right hand side of Equation 9 becomes

Epdata(x)[log pθ(x)] + Z Z 1 2

i ϵ2 i 2 log pθ

x2 i + o(ϵ2)

pdata(x)p(ϵ)dxdϵ (10)

= Epdata(x)

log pθ(x) + η

+ Z Z o(ϵ2)pdata(x)p(ϵ)dxdϵ. (11)

When Z Z o(ϵ2)pdata(x)p(ϵ)dxdϵ 0, (12)

Epdata(x)Eq( x|x)[log pθ( x)] Epdata(x)

log pθ(x) + η

where the second term on the right hand side serves as a regularization for the original objective Epdata(x)[log pθ(x)].

B DENOISING EXPERIMENTS

B.1 ANALYSIS ON 1-D DENOISING

To provide more insights into denoising, we ﬁrst study single-step denoising (see equation 2) on a 1-d dataset. We choose the data distribution to be a two mixture of Gaussian distribution 0.5N( 0.3, 0.12) + 0.5N(0.3, 0.12) and the smoothing distribution to be q( x|x) = N( x|x, 0.32) (see ﬁgure 8a). Since the convolution of two Gaussian distributions is also a Gaussian distribution, the smoothed data is a mixture of Gaussian distribution given by 0.5N( 0.3, 0.12 + 0.32) + 0.5N(0.3, 0.12 + 0.32). The ground truth of x log p( x) can then be calculated in closed form. Thus, given the smoothed data x, we can calculate the ground truth x log p( x) in equation 2 and obtain x using single-step denoising . We visualize the denoising results in ﬁgure 8b. We ﬁnd that the low density region between the two modes in pdata(x) are not modeled properly in ﬁgure 8b. However, this is very expected since single-step denosing uses x = E[x| x] as the substitute for the denoised result. When the smoothing distribution has a large variance (like in ﬁgure 8a where the smoothed data has merged into a one mode distribution), datapoints like x0 in the middle low density region of pdata(x) can have high density in the smoothed distribution. Since x0, as well as other points in the middle low density region of pdata(x), can come from both modes of pdata(x) with high probability before the smoothing process (see ﬁgure 11a), the denoised x = E[x| x = x0] can still be located in the middle low density region (see ﬁgure 8b). Since a large proportion of the smoothed data is located in the middle low density region of pdata(x), we would expect certain proportion of the density to remain in the low density region after single-step denoising just as shown in ﬁgure 8b. However, when the smoothing distribution has a smaller variance, single-step denoising can achieve much better denoising results (see ﬁgure 9, where we use q( x|x) = N( x|x, 0.12)). Although denoising can be easier when the smoothing distribution has a smaller variance, modeling the smoothed distribution could be harder as we discussed before.

In general, the right denoising results should be samples coming from p(x| x), which is the reason why samples from pθ(x| x) (i.e. introducing the model pθ(x| x)) is more ideal than using Eθ[x| x] as a denoising substitute (i.e. single-step denoising ). In general, the capacity of the denoising model pθ(x| x) also matters in terms of denoising results. Let us again consider the datapoint x0 shown in ﬁgure 8a. If the invert smoothing model pθ(x| x) is a one mode logistic distribution, due to the mode covering property of maximum likelihood estimation, given the smoothed observation x0, the best the model can do is to center its only mode at x0 for approximating p(x| x = x0) (see ﬁgure 10c). Thus, like x0, the smoothed datapoints at the low density region between the two modes of pdata(x)

Published as a conference paper at ICLR 2021

are still likely to remain between the two modes after denoising (see ﬁgure 10b). To solve this issue, we can increase the capacity of pθ(x| x) by making it a two mixture of logistics. In this case, the distribution pθ(x| x = x0) can be captured in a better way (see ﬁgure 11c and ﬁgure 11a). After the invert smoothing process, like x0, most smoothed datapoints in the low density can be mapped to one of the two high density modes (see ﬁgure 11b), resulting in much better denoising effects.

(a) Data distribution (q( x|x) = N( x|x, 0.32)).

(b) Single-step denoising results.

Figure 8: 1-d single-step (gradient based) denoising.

(a) Data distribution (q( x|x) = N( x|x, 0.12)).

(b) Single-step denoising results.

Figure 9: 1-d single-step (gradient based) denoising.

(a) Data distribution.

(b) Denoised with pθ(x| x).

(c) Distribution of pθ(x| x = x0).

Figure 10: Denoising with pθ(x| x), which is modeled by one mixture of logistics.

Published as a conference paper at ICLR 2021

(b) MADE (n = 5)

(c) MADE (n = 6)

(d) MADE (n = 7)

(e) Ours (n = 2)

(f) Ours (n = 3)

(g) Ours (n = 4)

Figure 12: Samples on 2-d synthetic datasets. We use a MADE model with comparable number of parameters for both our method and the baseline. The models have n mixture of logistics for each dimension. Our method is able to obtain reasonable samples when using fewer mixture components, while the baseline still has trouble modeling the two sides of the rings when n = 7.

(a) Ground truth p(x| x = x0).

(b) Denoising with pθ(x| x).

(c) Distribution of pθ(x| x = x0).

Figure 11: Denoising with pθ(x| x), which is modeled by two mixtures of logistics.

B.2 ANALYSIS ON 2-D DENOISING

On the 2-d Olympics dataset in section 4.2, we ﬁnd that the intersections between rings can be poorly modeled with the proposed smoothing approach when only two mixture of logistics are used (see ﬁgure 12e). We believe this can be caused if the denoising model is not ﬂexible enough to capture the distribution p(x| x). More speciﬁcally, we note that the ground truth distribution for p(x| x) at the intersections of the rings is a highly complicated distribution and can be hard to capture using our model which only has two mixtures of logistics for each dimension. If we increase the ﬂexibility of pθ(x| x) by using three or four mixtures of logistics components (note that we still use fewer mixture components than the MADE baseline and we use comparable number of parameters), the intersection of the rings can be modeled in an improved way (see ﬁgure 12).

We also provide single-step denoising results for the experiments in Section 4.2 (see ﬁgure 13), where we use the same smoothing distribution, and the MADE model with three mixture components as used in section 4.2. We note that single-step denoising results are not very good, which is also expected. As discussed in section B.1, when the smoothing distribution has a relatively large variance, Eθ[x| x] is not a good approximation for the denoised result, and we want the denoised sample to come from the distribution pθ(x| x), in which case introducing a denoising model pθ(x| x) could be a better option. Although we could select q( x|x) to have a smaller variance so that single-step denoing could work reasonably well, but modeling p( x) in this case could be more challenging.

Published as a conference paper at ICLR 2021

(b) Smoothed data distribution

(c) Single-step denoising results

(e) Smoothed data distribution

(f) Single-step denoising results

Figure 13: Single-step denoising results on 2-d synthetic datasets. We use the same MADE model with three mixture components and the same smoothing distribution as mentioned in Section 4.2.

C IMAGE EXPERIMENTS

C.1 SETTINGS

For the image experiments, we ﬁrst rescale images to [ 1, 1] and then perturb the images with q( x|x) = N( x|x, σ2I). We use σ = 0.5 for MNIST and σ = 0.3 for both CIFAR-10 and Celeb A. The selection of σ is mainly based on analysis in (Saremi & Hyvarinen, 2019). More speciﬁcally, given an image, we consider the median value of the Euclidean distance between two data points in a dataset, and then divide it by 2

D, where D is the dimension of the data. This provides us with a way of selecting the variance of q( x|x), when q( x|x) is a Gaussian distribution. We ﬁnd this selection of variance able to generate reasonably well samples in practice. We train all the models with Adam optimizer with learning rate 0.0002. To model pθ(x| x), we stack x and x together at the second dimension to obtain ˆx = [ x, x], which ensures that x comes before x in the pixel ordering. For instance, this stacking would provide an image ˆx with size 1 (2 28) 28 on a MNIST image, and an image with size 3 (2 32) 32 on a CIFAR-10 image. Since Pixel CNN++ consists of convolutional layers, we can directly feed ˆx into the default architecture without modifying the model architecture. As the latter pixels of the input only depend on the previous pixels in an autoregressive model and x comes before x, we can parameterize pθ(x| x) by computing the likelihoods only on x using the outputs from the autoregressive model.

C.2 IMAGE INPAINTING

Since both pθ( x) and pθ(x| x) are parameterized by an autoregressive model, we can also perform image inpainting using our method. We present the inpainting results on CIFAR-10 in ﬁgure 14a and Celeb A in ﬁgure 14b, where the bottom half of the input image is being inpainted.

Published as a conference paper at ICLR 2021

(a) CIFAR10 inpainting

(b) Celeb A inpainting

Figure 14: Inpainting results from our two-step method. The bottom half of the images are masked for inpainting.

C.3 IMAGE DENOISING

We notice that the reverse smoothing process can also be understood as a denoising process. Besides the single-step denoising approach shown above, we can also apply pθ(x| x) to denoise images. To visualize the denoising performance, we sample xtest from the test set and perturb xtest with q( x|x) to obtain a noisy sample xtest. We feed xtest into pθ(x| x = xtest) and draw samples from the model. We visualize the results in ﬁgure 15. As we can see, the model exhibits reasonable denoising results, which shows that the autoregressive model is capable of learning the data distribution when conditioned on the smoothed data.

(a) Noisy MNIST

(b) MNIST denoising

(c) Noisy CIFAR10

(d) CIFAR10 denoising

Figure 15: Denoising with pθ(x| x)

Published as a conference paper at ICLR 2021

C.4 MORE SAMPLES

Figure 16: CIFAR-10 samples from pθ( x) (unconditioned on class labels).

Published as a conference paper at ICLR 2021

Figure 17: CIFAR-10 samples from pθ(x| x) (unconditioned on class labels).

Published as a conference paper at ICLR 2021

Figure 18: CIFAR-10 samples from the original Pixel CNN++ method (unconditioned on class labels).

Published as a conference paper at ICLR 2021

Figure 19: Celeb A samples from pθ( x) (unconditioned on class labels).

Published as a conference paper at ICLR 2021

Figure 20: Celeb A samples from pθ(x| x) (unconditioned on class labels).

Published as a conference paper at ICLR 2021

Figure 21: Celeb A samples from the original Pixel CNN++ method (unconditioned on class labels).

C.5 NEAREST NEIGHBORS

Published as a conference paper at ICLR 2021

Figure 22: Nearest neighbors measured by the ℓ2 distance between images. Images on the left of the red vertical line are samples from our model. Images on the right are nearest neighbors in the training dataset.

Published as a conference paper at ICLR 2021

Figure 23: Nearest neighbors measured by the ℓ2 distance in the feature space of an Inception V3 network pretrained on Image Net. Images on the left of the red vertical line are samples from our model. Images on the right are nearest neighbors in the training dataset.

Published as a conference paper at ICLR 2021

Figure 24: Nearest neighbors measured by the ℓ2 distance between images. Images on the left of the red vertical line are samples from our model. Images on the right are nearest neighbors in the training dataset.

Published as a conference paper at ICLR 2021

Figure 25: Nearest neighbors measured by the ℓ2 distance in the feature space of an Inception V3 network pretrained on Image Net. Images on the left of the red vertical line are samples from our model. Images on the right are nearest neighbors in the training dataset.

Published as a conference paper at ICLR 2021

C.6 ABLATION STUDIES

In this section, we show that gradient-based single-step denoising will not improve sample qualities without performing randomized smoothing. To see this, we draw samples from a Pixel CNN++ pθ(x) trained directly on pdata(x) (i.e. without smoothing). We perform single-step denoising update deﬁned as x = x + σ2 x log pθ(x). (14)

We explore various values for σ, and report the results in ﬁgure 26. This shows that single-step denoising alone (without randomized smoothing) will not improve sample quality of Pixel CNN++.

(a) σ = 0 (Original)

(b) σ = 0.01

(c) σ = 0.05

(d) σ = 0.1

Figure 26: Single-step denoising on Pixel CNN++ trained on un-smoothed data. σ = 0 corresponds to the original samples.