# autoregressive_score_matching__f7f5f130.pdf

Autoregressive Score Matching

Chenlin Meng Stanford University chenlin@stanford.edu

Lantao Yu Stanford University lantaoyu@cs.stanford.edu

Yang Song Stanford University yangsong@cs.stanford.edu

Jiaming Song Stanford University tsong@cs.stanford.edu

Stefano Ermon Stanford University ermon@cs.stanford.edu

Autoregressive models use chain rule to deﬁne a joint probability distribution as a product of conditionals. These conditionals need to be normalized, imposing constraints on the functional families that can be used. To increase ﬂexibility, we propose autoregressive conditional score models (AR-CSM) where we parameterize the joint distribution in terms of the derivatives of univariate log-conditionals (scores), which need not be normalized. To train AR-CSM, we introduce a new divergence between distributions named Composite Score Matching (CSM). For AR-CSM models, this divergence between data and model distributions can be computed and optimized efﬁciently, requiring no expensive sampling or adversarial training. Compared to previous score matching algorithms, our method is more scalable to high dimensional data and more stable to optimize. We show with extensive experimental results that it can be applied to density estimation on synthetic data, image generation, image denoising, and training latent variable models with implicit encoders.

1 Introduction

Autoregressive models play a crucial role in modeling high-dimensional probability distributions. They have been successfully used to generate realistic images [18, 21], high-quality speech [17], and complex decisions in games [29]. An autoregressive model deﬁnes a probability density as a product of conditionals using the chain rule. Although this factorization is fully general, autoregressive models typically rely on simple probability density functions for the conditionals (e.g. a Gaussian or a mixture of logistics) [21] in the continuous case, which limits the expressiveness of the model.

To improve ﬂexibility, energy-based models (EBM) represent a density in terms of an energy function, which does not need to be normalized. This enables more ﬂexible neural network architectures, but requires new training strategies, since maximum likelihood estimation (MLE) is intractable due to the normalization constant (partition function). Score matching (SM) [9] trains EBMs by minimizing the Fisher divergence (instead of KL divergence as in MLE) between model and data distributions. It compares distributions in terms of their log-likelihood gradients (scores) and completely circumvents the intractable partition function. However, score matching requires computing the trace of the Hessian matrix of the model s log-density, which is expensive for high-dimensional data [14].

To avoid calculating the partition function without losing scalability in high dimensional settings, we leverage the chain rule to decompose a high dimensional distribution matching problem into simpler univariate sub-problems. Speciﬁcally, we propose a new divergence between distributions, named Composite Score Matching (CSM), which depends only on the derivatives of univariate logconditionals (scores) of the model, instead of the full gradient as in score matching. CSM training is

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

particularly efﬁcient when the model is represented directly in terms of these univariate conditional scores. This is similar to a traditional autoregressive model, but with the advantage that conditional scores, unlike conditional distributions, do not need to be normalized. Similar to EBMs, removing the normalization constraint increases the ﬂexibility of model families that can be used.

Leveraging existing and well-established autoregressive models, we design architectures where we can evaluate all dimensions in parallel for efﬁcient training. During training, our CSM divergence can be optimized directly without the need of approximations [15, 25], surrogate losses [11], adversarial training [5] or extra sampling [3]. We show with extensive experimental results that our method can be used for density estimation, data generation, image denoising and anomaly detection. We also illustrate that CSM can provide accurate score estimation required for variational inference with implicit distributions [8, 25] by providing better likelihoods and FID [7] scores compared to other training methods on image datasets.

2 Background

Given i.i.d. samples {x(1), ..., x(N)} RD from some unknown data distribution p(x), we want to learn an unnormalized density q (x) as a parametric approximation to p(x). The unnormalized q (x) uniquely deﬁnes the following normalized probability density:

q (x) = q (x)

Z( ) , Z( ) =

q (x)dx, (1)

where Z( ), the partition function, is generally intractable.

2.1 Autoregressive Energy Machine

To learn an unnormalized probabilistic model, [15] proposes to approximate the normalizing constant using one dimensional importance sampling. Speciﬁcally, let x = (x1, ..., x D) 2 RD. They ﬁrst learn a set of one dimensional conditional energies E (xd|x<d) , log q (xd|x<d), and then approximate the normalizing constants using importance sampling, which introduces an additional network to parameterize the proposal distribution. Once the partition function is approximated, they normalize the density to enable maximum likelihood training. However, approximating the partition function not only introduces bias into optimization but also requires extra computation and memory usage, lowering the training efﬁciency.

2.2 Score Matching

To avoid computing Z( ), we can take the logarithm on both sides of Eq. (1) and obtain log q (x) = log q (x) log Z( ). Since Z( ) does not depend on x, we can ignore the intractable partition function Z( ) when optimizing rx log q (x). In general, rx log q (x) and rx log p(x) are called the score of q (x) and p(x) respectively. Score matching (SM) [9] learns q (x) by matching the scores between q (x) and p(x) using the Fisher divergence:

L(q ; p) , 1

2Ep[krx log p(x) rx log q (x)k2

Ref. [9] shows that under certain regularity conditions L( ; p) = J( ; p) + C, where C is a constant that does not depend on and J( ; p) is deﬁned as below:

J( ; p) , Ep

2krx log q (x)k2

x log q (x))

where tr( ) denotes the trace of a matrix. The above objective does not involve the intractable term rx log p(x). However, computing tr(r2

x log q (x)) is in general expensive for high dimensional data. Given x 2 RD, a naive approach requires D times more backward passes than computing the gradient rx log q (x) [25] in order to compute tr(r2

x log q (x)), which is inefﬁcient when D is large. In fact, ref. [14] shows that within a constant number of forward and backward passes, it is unlikely for an algorithm to be able to compute the diagonal of a Hessian matrix deﬁned by any arbitrary computation graph.

3 Composite Score Matching

To make SM more scalable, we introduce Composite Score Matching (CSM), a new divergence suitable for learning unnormalized statistical models. We can factorize any given data distribution p(x) and model distribution q (x) using the chain rule according to a common variable ordering:

p(xd|x<d), q (x) =

where xd 2 R stands for the d-th component of x, and x<d refers to all the entries with indices smaller than d in x. Our key insight is that instead of directly matching the joint distributions, we can match the conditionals of the model q (xd|x<d) to the conditionals of the data p(xd|x<d) using the Fisher divergence. This decomposition results in simpler problems, which can be optimized efﬁciently using one-dimensional score matching. For convenience, we denote the conditional scores of q and p as s ,d(x<d, xd) , @ @xd log q (xd|x<d) : Rd 1 R ! R and sd(x<d, xd) , @ @xd log p(xd|x<d) : Rd 1 R ! R respectively. This gives us a new divergence termed Composite Score Matching (CSM):

LCSM(q ; p) = 1

Ep(x<d)Ep(xd|x<d)

(sd(x<d, xd) s ,d(x<d, xd))2

This divergence is inspired by composite scoring rules [1], a general technique to decompose distribution-matching problems into lower-dimensional ones. As such, it bears some similarity with pseudo-likelihood, a composite scoring rule based on KL-divergence. As shown in the following theorem, it can be used as a learning objective to compare probability distributions: Theorem 1 (CSM Divergence). LCSM(q , p) vanishes if and only if q (x) = p(x) a.e.

Proof Sketch. If the distributions match, their derivatives (conditional scores) must be the same, hence LCSM is zero. If LCSM is zero, the conditional scores must be the same, and that uniquely determines the joints. See Appendix for a formal proof.

Eq. (3) involves sd(x), the unknown score function of the data distribution. Similar to score matching, we can apply integration by parts to obtain an equivalent but tractable expression:

JCSM( ; p) =

Ep(x<d)Ep(xd|x<d)

2s ,d(x<d, xd)2 + @ @xd

s ,d(x<d, xd)

The equivalence can be summarized using the following results: Theorem 2 (Informal). Under some regularity conditions, LCSM( ; p) = JCSM( ; p) + C where C is a constant that does not depend on .

Proof Sketch. Integrate by parts the one-dimensional SM objectives. See Appendix for a proof.

Corollary 1. Under some regularity conditions, JCSM( , p) is minimized when q (x) = p(x) a.e.

In practice, the expectation in JCSM( ; p) can be approximated by a sample average using the following unbiased estimator

ˆJCSM( ; p) , 1

where {x(1), ..., x(N)} are i.i.d samples from p(x). It is clear from Eq. (5) that evaluating ˆJCSM( ; p) is efﬁcient as long as it is efﬁcient to evaluate s ,d(x<d, xd) , @ @xd log q (xd|x<d) and its derivative @ @xd s ,d(x<d, xd). This in turn depends on how the model q is represented. For example, if q is an energy-based model deﬁned in terms of an energy q as in Eq. (1), computing q (xd|x<d) (and hence its derivative, s ,d(x<d, xd)) is generally intractable. On the other hand, if q is a traditional autoregressive model represented as a product of normalized conditionals, then ˆJCSM( ; p) will be efﬁcient to optimize, but the normalization constraint may limit expressivity. In the following, we propose a parameterization tailored for CSM training, where we represent a joint distribution directly in terms of s ,d(x<d, xd), d = 1, D without normalization constraints.

4 Autoregressive conditional score models

We introduce a new class of probabilistic models, named autoregressive conditional score models (AR-CSM), deﬁned as follows: Deﬁnition 1. An autoregressive conditional score model over RD is a collection of D functions ˆsd(x<d, xd) : Rd 1 R ! R, such that for all d = 1, , D:

1. For all x<d 2 Rd 1, there exists a function Ed(x<d, xd) : Rd 1 R ! R such that

@ @xd Ed(x<d, xd) exists, and @ @xd Ed(x<d, xd) = ˆsd(x<d, xd).

2. For all x<d 2 Rd 1, Zd(x<d) ,

e Ed(x<d,xd)dxd exists and is ﬁnite (i.e., the improper integral w.r.t. xd is convergent).

Autoregressive conditional score models are an expressive family of probabilistic models for continuous data. In fact, there is a one-to-one mapping between the set of autoregressive conditional score models and a large set of probability densities over RD: Theorem 3. There is a one-to-one mapping between the set of autoregressive conditional score models over RD and the set of probability density functions q(x) fully supported over RD such that

@ @xd log q(xd|x<d) exists for all d and x<d 2 Rd 1. The mapping pairs conditional scores and densities such that

ˆsd(x<d, xd) = @ @xd

log q(xd|x<d)

The key advantage of this representation is that the functions in Deﬁnition 1 are easy to parameterize (e.g., using neural networks) as the requirements 1 and 2 are typically easy to enforce. In contrast with typical autoregressive models, we do not require the functions in Deﬁnition 1 to be normalized. Importantly, Theorem 3 does not hold for previous approaches that learn a single score function for the joint distribution [25, 24], since the score model s : RD ! RD in their case is not necessarily the gradient of any underlying joint density. In contrast, AR-CSM always deﬁne a valid density through the mapping given by Theorem 3.

In the following, we discuss how to use deep neural networks to parameterize autoregressive conditional score models (AR-CSM) deﬁned in Deﬁnition 1. To simplify notations, we hereafter use x to denote the arguments for s ,d and sd even when these functions depend on a subset of its dimensions.

4.1 Neural AR-CSM models

We propose to parameterize an AR-CSM based on existing autoregressive architectures for traditional (normalized) density models (e.g., Pixel CNN++ [21], MADE [4]). One important difference is that the output of standard autoregressive models at dimension d depend only on x<d, yet we want the conditional score s ,d to also depend on xd.

To ﬁll this gap, we use standard autoregressive models to parameterize a "context vector" cd 2 Rc (c is ﬁxed among all dimensions) that depends only on x<d, and then incorporate the dependency on xd by concatenating cd and xd to get a c + 1 dimensional vector hd = [cd, xd]. Next, we feed hd into another neural network which outputs the scalar s ,d 2 R to model the conditional score. The network s parameters are shared across all dimensions similar to [15]. Finally, we can compute @ @xd s ,d(x) using automatic differentiation, and optimize the model directly with the CSM divergence.

Standard autoregressive models, such as Pixel CNN++ and MADE, model the density with a prescribed probability density function (e.g., a Gaussian density) parameterized by functions of hd. In contrast, we remove the normalizing constraints of these density functions and therefore able to capture stronger correlations among dimensions with more expressive architectures.

4.2 Inference and learning

To sample from an AR-CSM model, we use one dimensional Langevin dynamics to sample from each dimension in turn. Crucially, Langevin dynamics only need the score function to sample from a density [19, 6]. In our case, scores are simply the univariate derivatives given by the AR-CSM.

Speciﬁcally, we use s ,1(x1) to obtain a sample x1 q (x1), then use s ,2(x1, x2) to sample from x2 q (x2 | x1) and so forth. Compared to Langevin dynamics performed directly on a high dimensional space, one dimensional Langevin dynamics can converge faster under certain regularity conditions [20]. See Appendix C.3 for more details.

During training, we use the CSM divergence (see Eq. (5)) to train the model. To deal with data distributions supported on low-dimensional manifolds and the difﬁculty of score estimation in low data density regions, we use noise annealing similar to [24] with slight modiﬁcations: Instead of performing noise annealing as a whole, we perform noise annealing on each dimension individually. More details can be found in Appendix C.

5 Density estimation with AR-CSM

In this section, we ﬁrst compare the optimization performance of CSM with two other variants of score matching: Denoising Score Matching (DSM) [28] and Sliced Score Matching (SSM) [25], and compare the training efﬁciency of CSM with Score Matching (SM) [9]. Our results show that CSM is more stable to optimize and more scalable to high dimensional data compared to the previous score matching methods. We then perform density estimation on 2-d synthetic datasets (see Appendix B) and three commonly used image datasets: MNIST, CIFAR-10 [12] and Celeb A [13]. We further show that our method can also be applied to image denoising and anomaly detection, illustrating broad applicability of our method.

5.1 Comparison with other score matching methods

Setup To illustrate the scalability of CSM, we consider a simple setup of learning Gaussian distributions. We train an AR-CSM model with CSM and the other score matching methods on a fully connected network with 3 hidden layers. We use comparable number of parameters for all the methods to ensure fair comparison.

(a) Training time per iteration.

(b) Variance comparison.

Figure 1: Comparison with SSM and SM in terms of loss variance and computational efﬁciency.

CSM vs. SM In Figure 1a, we show the time per iteration of CSM versus the original score matching (SM) method [9] on multivariate Gaussians with different data dimensionality. We ﬁnd that the training speed of SM degrades linearly as a function of the data dimensionality. Moreover, the memory required grows rapidly w.r.t the data dimension, which triggers memory error on 12 GB TITAN Xp GPU when the data dimension is approximately 200. On the other hand, for CSM, the time required stays stable as the data dimension increases due to parallelism, and no memory errors occurred throughout the experiments. As expected, traditional score matching (SM) does not scale as well as CSM for high dimensional data. Similar results on SM were also reported in [25].

CSM vs. SSM We compare CSM with Sliced Score Matching (SSM) [25], a recently proposed score matching variant, on learning a representative Gaussian N(0, 0.12I) of dimension 100 in Figure 2 (2 rightmost panels). While CSM converges rapidly, SSM does not converge even after 20k iterations due to the large variance of random projections. We compare the variance of the two objectives in Figure 1b. In such a high-dimensional setting, SSM would require a large number of projection vectors for variance reduction, which requires extra computation and could be prohibitively expensive in practice. By contrast, CSM is a deterministic objective function that is more stable to optimize. This again suggests that CSM might be more suitable to be used in high-dimensional data settings compared to SSM.

DSM (σ = 0.01)

DSM (σ = 0.05)

DSM (σ = 0.1)

SSM (σ = 0)

CSM (σ = 0)

Figure 2: Training losses for DSM, SSM and CSM on 100-d Gaussian distribution N(0, 0.12I). Note the vertical axes are different across methods as they optimize different losses.

(a) MNIST negative loglikelihoods

(b) 2-d synthetic dataset samples from MADE MLE baselines with n mixture of logistics and an AR-CSM model trained by CSM.

Figure 3: Negative log-likelihoods on MNIST and samples on a 2-d synthetic dataset.

CSM vs. DSM Denoising score matching (DSM) [28] is perhaps the most scalable score matching alternative available, and has been applied to high dimensional score matching problems [24]. However, DSM estimates the score of the data distribution after it has been convolved with Gaussian noise with variance σ2I. In Figure 2, we use various noise levels σ for DSM, and compare the performance of CSM with that of DSM. We observe that although DSM shows reasonable performance when σ is sufﬁciently large, the training can fail to converge for small σ. In other words, for DSM, there exists a tradeoff between optimization performance and the bias introduced due to noise perturbation for the data. CSM on the other hand does not suffer from this problem, and converges faster than DSM.

Likelihood comparison To better compare density estimation performance of DSM, SSM and CSM, we train a MADE [4] model with tractable likelihoods on MNIST, a more challenging data distribution, using the three variants of score matching objectives. We report the negative loglikelihoods in Figure 3a. The loss curves in Figure 3a align well with our previous discussion. For DSM, a smaller σ introduces less bias, but also makes training slower to converge. For SSM, training convergence can be handicapped by the large variance due to random projections. In contrast, CSM can converge quickly without these difﬁculties. This clearly demonstrates the efﬁcacy of CSM over the other score matching methods for density estimation.

5.2 Learning 2-d synthetic data distributions with AR-CSM

In this section, we focus on a 2-d synthetic data distribution (see Figure 3b). We compare the sample quality of an autoregressive model trained by maximum likelihood estimation (MLE) and an AR-CSM model trained by CSM. We use a MADE model with n mixture of logistic components for the MLE baseline experiments. We also use a MADE model as the autoregressive architecture for the AR-CSM model. To show the effectiveness of our approach, we use strictly fewer parameters for the AR-CSM model than the baseline MLE model. Even with fewer parameters, the AR-CSM model trained with CSM is still able to generate better samples than the MLE baseline (see Figure 3b).

5.3 Learning high dimensional distributions over images with AR-CSM

In this section, we show that our method is also capable of modeling natural images. We focus on three image datasets, namely MNIST, CIFAR-10, and Celeb A.

Setup We select two existing autoregressive models MADE [4] and Pixel CNN++ [21], as the autoregressive architectures for AR-CSM. For all the experiments, we use a shallow fully connected network to transform the context vectors to the conditional scores for AR-CSM. Additional details can be found in Appendix C.

Pixel CNN++ MLE

Pixel CNN++ CSM

Figure 4: Samples from MADE and Pixel CNN++ using MLE and CSM.

Results We compare the samples from AR-CSM with the ones from MADE and Pixel CNN++ with similar autoregressive architectures but trained via maximum likelihood estimation. Our AR-CSM models have comparable number of parameters as the maximum-likelihood counterparts. We observe that the MADE model trained by CSM is able to generate sharper and higher quality samples than its maximum-likelihood counterpart using Gaussian densities (see Figure 4). For Pixel CNN++, we observe more digit-like samples on MNIST, and less shifted colors on CIFAR-10 and Celeb A than its maximum-likelihood counterpart using mixtures of logistics (see Figure 4). We provide more samples in Appendix C.

5.4 Image denoising with AR-CSM

Gaussian Noise

Figure 5: Salt and pepper denoising on CIFAR-10. Autoregressive single-step denoising on MNIST.

Besides image generation, AR-CSM can also be used for image denoising. In Figure 5, we apply 10% "Salt and Pepper" noise to the images in CIFAR-10 test set and apply Langevin dynamics sampling to restore the images. We also show that AR-CSM can be used for single-step denoising [22, 28] and report the denoising results for MNIST, with noise level σ = 0.6 in the rescaled space in Figure 5. These results qualitatively demonstrate the effectiveness of AR-CSM for image denoising, showing that our models are sufﬁciently expressive to capture complex distributions and solve difﬁcult tasks.

5.5 Out-of-distribution detection with AR-CSM

Model Pixel CNN++ GLOW EBM AR-CSM(Ours)

SVHN 0.32 0.24 0.63 0.68 Const Uniform 0.0 0.0 0.30 0.57 Uniform 1.0 1.0 1.0 0.95 Average 0.44 0.41 0.64 0.73

Table 1: AUROC scores for models trained on CIFAR-10.

We show that the AR-CSM model can also be used for out-of-distribution (OOD) detection. In this task, the generative model is required to produce a statistic (e.g., likelihood, energy) such that the outputs of in-distribution examples can be distinguished from those of the out-of-distribution examples. We ﬁnd that h (x) , PD

d=1 s ,d(x) is an effective statistic for OOD. In Tab. 1, we compare the Area Under the Receiver-Operating Curve (AUROC) scores obtained by AR-CSM using h (x) with the ones obtained by Pixel CNN++ [21], Glow [10] and EBM [3] using relative log likelihoods. We use SVHN,

MNIST (AIS) Celeb A (FID)

Latent Dim 8 16 32

ELBO 96.74 91.82 66.31 Stein 96.90 88.86 108.84 Spectral 96.85 88.76 121.51 SSM 95.61 88.44 62.50 SSM-AR 95.85 88.98 66.88 CSM (Ours) 95.02 88.42 62.20

Table 2: VAE results on MNIST and Celeb A.

Figure 6: CSM VAE MNIST and Celeb A samples.

constant uniform and uniform as OOD distributions following [3]. We observe that our method can perform comparably or better than existing generative models.

6 VAE training with implicit encoders and CSM

In this section, we show that CSM can also be used to improve variational inference with implicit distributions [8]. Given a latent variable model p (x, z), where x is the observed variable and z 2 RD is the latent variable, a Variational Auto-Encoder (VAE) [11] contains an encoder qφ(z|x) and a decoder p (x|z) that are jointly trained by maximizing the evidence lower bound (ELBO)

Epdata(x)[Eqφ(z|x) log p (x|z)p(z) Eqφ(z|x) log qφ(z|x)], (6)

Typically, qφ(z|x) is chosen to be a simple explicit distribution such that the entropy term in Equation (6), H(qφ( |x)) , Eqφ(z|x)[log qφ(z|x)], is tractable. To increase model ﬂexibility, we can parameterize the encoder using implicit distributions distributions that can be sampled tractably but do not have tractable densities (e.g., the generator of a GAN [5]). The challenge is that evaluating H(qφ( |x)) and its gradient rφH(qφ( |x))) becomes intractable.

Suppose zd qφ(zd|z<d, x) can be reparameterized as gφ,d( d, x), where gφ,d is a deterministic mapping and is a D dimensional random variable. We can write the gradient of the entropy with respect to φ as

rφH(qφ( |x))) =

Ep( <d)Ep( d)

log qφ(zd|z<d, x)|zd=gφ,d( d,x)rφgφ,d( d, x)

where rφgφ,d( d, x) is usually easy to compute and @ @zd log qφ(zd|z<d, x) can be approximated by score estimation using CSM. We provide more details in Appendix D.

Setup We train VAEs using the proposed method on two image datasets MNIST and Celeb A. We follow the setup in [25] (see Appendix D.4) and compare our method with ELBO, and three other methods, namely SSM [25], Stein [26], and Spectral [23], that can be used to train implicit encoders [25]. Since SSM can also be used to train an AR-CSM model, we denote the AR-CSM model trained with SSM as SSM-AR. Following the settings in [25], we report the likelihoods estimated by AIS [16] for MNIST, and FID scores [7] for Celeb A. We use the same decoder for all the methods, and encoders sharing similar architectures with slight yet necessary modiﬁcations. We provide more details in Appendix D.

Results We provide negative log-likelihoods (estimated by AIS) on MNIST and the FID scores on Celeb A in Tab. 2. We observe that CSM is able to marginally outperform other methods in terms of the metrics we considered. We provide VAE samples for our method in Figure 6. Samples for the other methods can be found in Appendix E.

7 Related work

Likelihood-based deep generative models (e.g., ﬂow models, autoregressive models) have been widely used for modeling high dimensional data distributions. Although such models have achieved

promising results, they tend to have extra constraints which could limit the model performance. For instance, ﬂow [2, 10] and autoregressive [27, 17] models require normalized densities, while variational auto-encoders (VAE) [11] need to use surrogate losses.

Unnormalized statistical models allow one to use more ﬂexible networks, but require new training strategies. Several approaches have been proposed to train unnormalized statistical models, all with certain types of limitations. Ref. [3] proposes to use Langevin dynamics together with a sample replay buffer to train an energy based model, which requires more iterations over a deep neural network for sampling during training. Ref. [31] proposes a variational framework to train energy-based models by minimizing general f-divergences, which also requires expensive Langevin dynamics to obtain samples during training. Ref. [15] approximates the unnormalized density using importance sampling, which introduces bias during optimization and requires extra computation during training. There are other approaches that focus on modeling the log-likelihood gradients (scores) of the distributions. For instance, score matching (SM) [9] trains an unnormalized model by minimizing Fisher divergence, which introduces a new term that is expensive to compute for high dimensional data. Denoising score matching [28] is a variant of score matching that is fast to train. However, the performance of denoising score matching can be very sensitive to the perturbed noise distribution and heuristics have to be used to select the noise level in practice. Sliced score matching [25] approximates SM by projecting the scores onto random vectors. Although it can be used to train high dimensional data much more efﬁciently than SM, it provides a trade-off between computational complexity and variance introduced while approximating the SM objective. By contrast, CSM is a deterministic objective function that is efﬁcient and stable to optimize.

8 Conclusion

We propose a divergence between distributions, named Composite Score Matching (CSM), which depends only on the derivatives of univariate log-conditionals (scores) of the model. Based on CSM divergence, we introduce a family of models dubbed AR-CSM, which allows us to expand the capacity of existing autoregressive likelihood-based models by removing the normalizing constraints of conditional distributions. Our experimental results demonstrate good performance on density estimation, data generation, image denoising, anomaly detection and training VAEs with implicit encoders. Despite the empirical success of AR-CSM, sampling from the model is relatively slow since each variable has to be sampled sequentially according to some order. It would be interesting to investigate methods that accelerate the sampling procedure in AR-CSMs, or consider more efﬁcient variable orders that could be learned from data.

Broader Impact

The main contribution of this paper is theoretical a new divergence between distributions and a related class of generative models. We do not expect any direct impact on society. The models we trained using our approach and used in the experiments have been learned using classic dataset and have capabilities substantially similar to existing models (GANs, autoregressive models, ﬂow models): generating images, anomaly detection, denoising. As with other technologies, these capabilities can have both positive and negative impact, depending on their use. For example, anomaly detection can be used to increase safety, but also possibly for surveillance. Similarly, generating images can be used to enable new art but also in malicious ways.

Acknowledgments and Disclosure of Funding

This research was supported by TRI, Amazon AWS, NSF (#1651565, #1522054, #1733686), ONR (N00014-19-1-2145), AFOSR (FA9550-19-1-0024), and FLI.

[1] A. P. Dawid and M. Musio. Theory and applications of proper scoring rules. Metron, 72(2):169

[2] L. Dinh, J. Sohl-Dickstein, and S. Bengio. Density estimation using real nvp. ar Xiv preprint

ar Xiv:1605.08803, 2016.

[3] Y. Du and I. Mordatch. Implicit generation and generalization in energy-based models. ar Xiv

preprint ar Xiv:1903.08689, 2019.

[4] M. Germain, K. Gregor, I. Murray, and H. Larochelle. Made: Masked autoencoder for distribu-

tion estimation. In International Conference on Machine Learning, pages 881 889, 2015.

[5] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and

Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems,

pages 2672 2680, 2014.

[6] U. Grenander and M. I. Miller. Representations of knowledge in complex systems. Journal of

the Royal Statistical Society: Series B (Methodological), 56(4):549 581, 1994.

[7] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. Gans trained by a two

time-scale update rule converge to a local nash equilibrium. In Advances in neural information processing systems, pages 6626 6637, 2017.

[8] F. Huszár. Variational inference using implicit distributions. ar Xiv preprint ar Xiv:1702.08235,

[9] A. Hyvärinen. Estimation of non-normalized statistical models by score matching. Journal of

Machine Learning Research, 6(Apr):695 709, 2005.

[10] D. P. Kingma and P. Dhariwal. Glow: Generative ﬂow with invertible 1x1 convolutions. In

Advances in Neural Information Processing Systems, pages 10215 10224, 2018.

[11] D. P. Kingma and M. Welling. Auto-encoding variational bayes. ar Xiv preprint ar Xiv:1312.6114,

[12] A. Krizhevsky, G. Hinton, et al. Learning multiple layers of features from tiny images. 2009.

[13] Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. In Proceedings

of the IEEE international conference on computer vision, pages 3730 3738, 2015.

[14] J. Martens, I. Sutskever, and K. Swersky. Estimating the hessian by back-propagating curvature.

ar Xiv preprint ar Xiv:1206.6464, 2012.

[15] C. Nash and C. Durkan. Autoregressive energy machines. ar Xiv preprint ar Xiv:1904.05626,

[16] R. M. Neal. Annealed importance sampling. Statistics and computing, 11(2):125 139, 2001.

[17] A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner,

A. Senior, and K. Kavukcuoglu. Wavenet: A generative model for raw audio. ar Xiv preprint ar Xiv:1609.03499, 2016.

[18] A. v. d. Oord, N. Kalchbrenner, and K. Kavukcuoglu. Pixel recurrent neural networks. ar Xiv

preprint ar Xiv:1601.06759, 2016.

[19] G. Parisi. Correlation functions and computer simulations. Nuclear Physics B, 180(3):378 384,

[20] G. O. Roberts, R. L. Tweedie, et al. Exponential convergence of langevin distributions and their

discrete approximations. Bernoulli, 2(4):341 363, 1996.

[21] T. Salimans, A. Karpathy, X. Chen, and D. P. Kingma. Pixelcnn++: Improving the pixelcnn with

discretized logistic mixture likelihood and other modiﬁcations. ar Xiv preprint ar Xiv:1701.05517, 2017.

[22] S. Saremi, A. Mehrjou, B. Schölkopf, and A. Hyvärinen. Deep energy estimator networks.

ar Xiv preprint ar Xiv:1805.08306, 2018.

[23] J. Shi, S. Sun, and J. Zhu. A spectral approach to gradient estimation for implicit distributions.

ar Xiv preprint ar Xiv:1806.02925, 2018.

[24] Y. Song and S. Ermon. Generative modeling by estimating gradients of the data distribution. In

Advances in Neural Information Processing Systems, pages 11895 11907, 2019.

[25] Y. Song, S. Garg, J. Shi, and S. Ermon. Sliced score matching: A scalable approach to density

and score estimation. In Proceedings of the Thirty-Fifth Conference on Uncertainty in Artiﬁcial Intelligence, UAI 2019, Tel Aviv, Israel, July 22-25, 2019, page 204, 2019.

[26] C. M. Stein. Estimation of the mean of a multivariate normal distribution. The annals of

Statistics, pages 1135 1151, 1981.

[27] A. Van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals, A. Graves, et al. Conditional image

generation with pixelcnn decoders. In Advances in neural information processing systems, pages 4790 4798, 2016.

[28] P. Vincent. A connection between score matching and denoising autoencoders. Neural compu-

tation, 23(7):1661 1674, 2011.

[29] O. Vinyals, I. Babuschkin, W. M. Czarnecki, M. Mathieu, A. Dudzik, J. Chung, D. H. Choi,

R. Powell, T. Ewalds, P. Georgiev, et al. Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 575(7782):350 354, 2019.

[30] M. Welling and Y. W. Teh. Bayesian learning via stochastic gradient langevin dynamics.

In Proceedings of the 28th international conference on machine learning (ICML-11), pages 681 688, 2011.

[31] L. Yu, Y. Song, J. Song, and S. Ermon. Training deep energy-based models with f-divergence

minimization. ar Xiv preprint ar Xiv:2003.03463, 2020.