# autoregressive_score_matching__f7f5f130.pdf Autoregressive Score Matching Chenlin Meng Stanford University chenlin@stanford.edu Lantao Yu Stanford University lantaoyu@cs.stanford.edu Yang Song Stanford University yangsong@cs.stanford.edu Jiaming Song Stanford University tsong@cs.stanford.edu Stefano Ermon Stanford University ermon@cs.stanford.edu Autoregressive models use chain rule to define a joint probability distribution as a product of conditionals. These conditionals need to be normalized, imposing constraints on the functional families that can be used. To increase flexibility, we propose autoregressive conditional score models (AR-CSM) where we parameterize the joint distribution in terms of the derivatives of univariate log-conditionals (scores), which need not be normalized. To train AR-CSM, we introduce a new divergence between distributions named Composite Score Matching (CSM). For AR-CSM models, this divergence between data and model distributions can be computed and optimized efficiently, requiring no expensive sampling or adversarial training. Compared to previous score matching algorithms, our method is more scalable to high dimensional data and more stable to optimize. We show with extensive experimental results that it can be applied to density estimation on synthetic data, image generation, image denoising, and training latent variable models with implicit encoders. 1 Introduction Autoregressive models play a crucial role in modeling high-dimensional probability distributions. They have been successfully used to generate realistic images [18, 21], high-quality speech [17], and complex decisions in games [29]. An autoregressive model defines a probability density as a product of conditionals using the chain rule. Although this factorization is fully general, autoregressive models typically rely on simple probability density functions for the conditionals (e.g. a Gaussian or a mixture of logistics) [21] in the continuous case, which limits the expressiveness of the model. To improve flexibility, energy-based models (EBM) represent a density in terms of an energy function, which does not need to be normalized. This enables more flexible neural network architectures, but requires new training strategies, since maximum likelihood estimation (MLE) is intractable due to the normalization constant (partition function). Score matching (SM) [9] trains EBMs by minimizing the Fisher divergence (instead of KL divergence as in MLE) between model and data distributions. It compares distributions in terms of their log-likelihood gradients (scores) and completely circumvents the intractable partition function. However, score matching requires computing the trace of the Hessian matrix of the model s log-density, which is expensive for high-dimensional data [14]. To avoid calculating the partition function without losing scalability in high dimensional settings, we leverage the chain rule to decompose a high dimensional distribution matching problem into simpler univariate sub-problems. Specifically, we propose a new divergence between distributions, named Composite Score Matching (CSM), which depends only on the derivatives of univariate logconditionals (scores) of the model, instead of the full gradient as in score matching. CSM training is 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. particularly efficient when the model is represented directly in terms of these univariate conditional scores. This is similar to a traditional autoregressive model, but with the advantage that conditional scores, unlike conditional distributions, do not need to be normalized. Similar to EBMs, removing the normalization constraint increases the flexibility of model families that can be used. Leveraging existing and well-established autoregressive models, we design architectures where we can evaluate all dimensions in parallel for efficient training. During training, our CSM divergence can be optimized directly without the need of approximations [15, 25], surrogate losses [11], adversarial training [5] or extra sampling [3]. We show with extensive experimental results that our method can be used for density estimation, data generation, image denoising and anomaly detection. We also illustrate that CSM can provide accurate score estimation required for variational inference with implicit distributions [8, 25] by providing better likelihoods and FID [7] scores compared to other training methods on image datasets. 2 Background Given i.i.d. samples {x(1), ..., x(N)} RD from some unknown data distribution p(x), we want to learn an unnormalized density q (x) as a parametric approximation to p(x). The unnormalized q (x) uniquely defines the following normalized probability density: q (x) = q (x) Z( ) , Z( ) = q (x)dx, (1) where Z( ), the partition function, is generally intractable. 2.1 Autoregressive Energy Machine To learn an unnormalized probabilistic model, [15] proposes to approximate the normalizing constant using one dimensional importance sampling. Specifically, let x = (x1, ..., x D) 2 RD. They first learn a set of one dimensional conditional energies E (xd|x