# neural_inverse_transform_sampler__76e60870.pdf Neural Inverse Transform Sampler Henry Li 1 Yuval Kluger 1 Any explicit functional representation f of a density is hampered by two main obstacles when we wish to use it as a generative model: designing f so that sampling is fast, and estimating Z = R f so that Z 1f integrates to 1. This becomes increasingly complicated as f itself becomes complicated. In this paper, we show that when modeling one-dimensional conditional densities with a neural network, Z can be exactly and efficiently computed by letting the network represent the cumulative distribution function of a target density, and applying a generalized fundamental theorem of calculus. We also derive a fast algorithm for sampling from the resulting representation by the inverse transform method. By extending these principles to higher dimensions, we introduce the Neural Inverse Transform Sampler (NITS), a novel deep learning framework for modeling and sampling from general, multidimensional, compactly-supported probability densities. NITS is a highly expressive density estimator that boasts end-to-end differentiability, fast sampling, and exact and cheap likelihood evaluation. We demonstrate the applicability of NITS by applying it to realistic, high-dimensional density estimation tasks: likelihood-based generative modeling on the CIFAR-10 dataset, and density estimation on the UCI suite of benchmark datasets, where NITS produces compelling results rivaling or surpassing the state of the art. 1. Introduction Building efficient, highly expressive generative density estimators has been a long-standing challenge in machine learning. A sufficiently powerful estimator will have far reaching effects on a bevy of inference tasks, ranging from 1Department of Applied Mathematics, Yale University, New Haven, CT. Correspondence to: Henry Li . Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s). classification to generative modeling to missing value imputation. Density estimation of continuous random variables revolves around a fundamental trade off between expressivity and tractability, which is most concretely related to the computation of the partition function Zθ = R Ωfθ for a positive function fθ : Ω R+. The general rule of thumb: the more flexible the underlying fθ, the more difficult the estimation of Zθ. For instance, the most general model is perhaps the energybased model (EBM), where the desired density ν is approximated via an energy landscape for example, the Boltzmann (or Gibbs) distribution νθ = e fθ(x) R Ωe fθ(x)dx, (1) where fθ may be an arbitrary function. The problem with this formulation is that analytic integration of functions of general form is unknown, and the cost of numerical integration techniques scales exponentially with the dimension of Ω. Thus there are few, if any, approaches that represent an arbitrary fθ in high dimension, though some methods may estimate fθ via a variational lower bound (Sohl-Dickstein et al., 2015; Du & Mordatch, 2019). On the other hand, one may obtain Zθ through direct analytic derivation but this is only applicable for a small handful of distributions. (Theis et al., 2012) and (Uria et al., 2016), for example, perform density estimation via Gaussian mixture and Gaussian scale mixture models, which both have well-known partition functions. Ultimately, most approaches avoid explicit estimation of the density. Normalizing flows (Dinh et al., 2014; Rezende & Mohamed, 2015) learn the transform between the desired density and a reference density (usually Gaussian), and compute ν via the probabilistic change-of-variables formula, which does not require Z, while score-based (Song et al., 2020) models work with the gradient of the loglikelihood, which again does not depend on the partition function. These techniques for sidestepping the computation of Z each have drawbacks, which we will explore in Section 4. Directly approximating a pdf ν with a parametric function Neural Inverse Transform Sampler Figure 1. The integration trick, used here to enable direct representation of a conditional density νθ(x|y = c). The original neural network (left) is differentiated w.r.t. x (middle), then rescaled by the partition function (right), which we compute directly by evaluating Fθ at the boundaries of [A, B]. All operations are restricted to the line y = c (for more details, see Sections 2.1 and 2.3). fθ has remained an elusive task. This is also the approach we take on with the proposed Neural Inverse Transform Sampler (NITS). NITS is a deep neural network augmented by two basic ideas: fast integration via the gradient theorem, and fast sampling via the inverse transform method. In the one-dimensional case, we leverage the fundamental theorem of calculus to normalize a neural network s output Fθ(x) so that it is the cumulative distribution function (cdf) of a 1D probability distribution over a bounded interval [A, B]. This also enforces the neural network derivative x Fθ(x) to be the corresponding probability density function (pdf). Fast evaluation of the cdf means fast sampling from the induced distribution using the inverse transform method. Furthermore, the parameters of the distribution can be trained via gradient descent, as the framework is end-to-end differentiable. This idea is straightforwardly extended to higher dimensions, where a multi-dimensional probabilistic model is composed of multiple one-dimensional models whose statistics are computed either in parallel or in sequence, according to assumptions on the correlation structure of the random variables. See Figure 1 for a graphical breakdown. In Section 2 we explore these ideas in greater detail. Our Contributions In this work, we demonstrate two novel computational ideas: a method for obtaining cheap and exact integrals of 1D neural network functions which we call the integration trick, and an architecture for density estimation that allows for fast and accurate sampling via the inverse transform method. These techniques enable the design of a deep learning framework for generalized density estimation of multi dimensional, compactly supported, continuous valued random variables, which we call the Neural Inverse Transform Sampler1. NITS is (to our knowledge) the first framework to provide explicit2 representations for densities with non-analytical partition functions. We apply the resulting framework in two scenarios. First, in a generative autoregressive density model on natural images. And second on the UCI suite of density estimation benchmark datasets. We report competitive results in both settings. Finally, we corroborate our empirical results by demonstrating that our model can universally approximate compactly supported densities of continuous random variables. 2. Probabilistically Normalized Networks We consider the task of representing a parametric family of compactly supported probability densities on [A, B]n Rn, A, B R. An ideal model should possess the following properties: (1) high expressiveness, (2) fast sampling, (3) fast computation of νθ, and (4) end-to-end differentiability. In this section we propose the probabilistically normalized network (PNN), which induces such a family. We will start from the one dimensional case, introducing the modeling framework in Section 2.1, and the sampling framework in section 2.2. We will then generalize to the higher dimensional case in Section 2.3 in two ways: first, assuming that X = (X1, . . . , Xn)T are independent, and second, that they are autoregressively correlated. 1Code available at github.com/lihenryhfl/NITS. 2Explicit, in the sense that the density is directly represented by a neural network. Neural Inverse Transform Sampler 2.1. Modeling a Compactly Supported 1D Distribution We begin with a continuous compactly supported one dimensional random variable X νθ. We would like to express νθ : [A, B] R as some neural network function parameterized by θ. To be a valid density, νθ must satisfy two conditions. Condition 1: Integration to Unity i.e., Z B A νθ(x)dx = 1. (2) For this to hold, we choose to form the decomposition νθ = fθ/Zθ, where fθ can now be an arbitrary neural network, and Zθ = R R fθ(x)dx is the partition function that normalizes fθ so that the resulting νθ will integrate to 1. Unfortunately, when fθ is a simple neural network, any integral including Zθ can only be computed via numerical quadrature (e.g., neural ODE frameworks (Chen et al., 2018a; Wehenkel & Louppe, 2019)), which is expensive in low dimensions and virtually intractable in higher dimensions. However, by letting fθ be the derivative of a neural network, we can compute ν in O(1) time (with respect to the size of the integration region). To do this, we leverage the crucial observation that, in one dimension, whenever there exists another function Fθ : X R such that dxfθ(x) for all x in R, (3) the integral of fθ over any bounded interval [a, b] can be evaluated via the formula Z b a fθ(t)dt = Fθ(b) Fθ(a), (4) due to the fundamental theorem of calculus. Consequently, when Fθ(x) is a neural network and fθ(x) := x Fθ(x) its derivative, the required assumption Eq. (3) is automatically satisfied, and the pdf νθ becomes computable via νθ(x) = fθ(x) Fθ(B) Fθ(A). (5) As the denominator is Zθ = R R fθ(x)dx = R B A fθ(x)dx, Condition 1 (i.e. Eq. 2) must hold. This technique, which we term the integration trick, is cheap, exact, and preserves gradients of Fθ and fθ with respect to both θ and x. Moreover, it is now clear that Fθ divided by the same denominator as in Eq. 5 must be the cdf corresponding to the density νθ: Nθ(x) = Fθ(x) Fθ(B) Fθ(A) Fθ(A), (6) where we subtract by Fθ(A) so that Nθ(A) = 0. Additionally, now that we have defined Nθ(x), we note that we can obtain νθ directly via νθ(x) = x Nθ(x) (we previously obtained νθ by scaling the gradient of Fθ). This is because the differentiation and rescaling operators commute. Condition 2: Positivity i.e., νθ(x) > 0 for all x in [A, B]. (7) This is equivalent to requiring that x Fθ 0 for all x in [A, B], (8) as νθ is proportional to x Fθ up to a factor Zθ, which is always positive. There are many ways of enforcing Eq. 8 we take the simple approach of using positive monotonic activations (e.g. sigmoid, tanh, Re LU) and enforcing the positivity of the weights of the neural network Fθ. This is sufficient for Condition 2, given the following lemma: Lemma 1. A fully connected neural network Fθ : Rn R with positive weights and positive monotonic activations has directional derivative fθ := xi Fθ > 0 for all i = 1, . . . n. Probabilistically Normalized Network As both νθ and Nθ(x) describe a rescaled neural network (scaled by the same constant Z 1 θ = (Fθ(B) Fθ(A)) 1), we call the resulting function a probabilistically normalized network, or PNN. 2.2. Sampling from a Compactly Supported 1D Distribution Sampling from the PNN-induced distribution is now straightforward given our cheap access to Nθ. We can use the well-known inverse transform theorem, which we state below for completeness. Lemma 2. (Inverse Transform Theorem) Let Nθ be as defined above, and N 1 θ (y), y [0, 1] denote its inverse, i.e. N 1 θ (y) = min{x : Nθ(x) y} y [0, 1]. (9) If we let µ be the Lebesgue density on [0, 1], then νθ is its pushforward density via the transform N 1 θ , i.e. νθ = N 1 θ (µ). (10) Thus, sampling from x νθ is a simple two step process: 1. draw z Unif[0, 1], 2. compute x = N 1 θ (z), where N 1 θ can be accurately and efficiently calculated via bisection search, due to the monotonic nature of the cdf Nθ. Neural Inverse Transform Sampler 2.3. Modeling Multi-Dimensional Distributions Let us now consider the case of modeling X = (X1, ..., Xn)T , an n > 1 dimensional, continuous, compactly supported random variable. By making various further assumptions on the correlation structure of the coordinates in X, we can straightforwardly extend the 1D formulation in the previous subsection to higher dimensions. Here we consider two cases: a coordinate-wise independent correlation structure, and an autoregressive correlation structure. Coordinate-wise Independent Model Suppose that all elements of X are independently but perhaps not identically distributed. Note that, while simple, this correlation structure is not supported by normalizing flow-based approaches (Dinh et al., 2014; Rezende & Mohamed, 2015) due to their invertibility constraint, which by design requires dependence between coordinates. Let Nθ : Rn [0, 1]n be the concatenation of n PNNs (and θ be the concatenation of their parameters), i.e., Nθ = (Nθ1, . . . , Nθn)T and θ = (θ1, . . . , θn)T . (11) Then the density of the resulting distribution is simply the trace of its Jacobian νθ(x1, . . . , xn) = tr(JNθ) = i=1 νθi(xi), where Jf denotes the Jacobian of f. And since the perdimension cdf is a 1D PNN, sampling only requires a bisection search as before, and can be performed over all dimensions in parallel. Autoregressive Model Retaining the previous definition of X, we consider the case where Xi depends on Xj for all j < i. This requires the PNN Fθi to take multi-dimensional inputs, i.e. Fθi : Ri [0, 1], as we are now modeling the conditional likelihood P(Xi|X