# arbitrary_conditional_distributions_with_energy__e79e3632.pdf Arbitrary Conditional Distributions with Energy Ryan R. Strauss Department of Computer Science UNC at Chapel Hill Chapel Hill, NC 27514 rrs@cs.unc.edu Junier B. Oliva Department of Computer Science UNC at Chapel Hill Chapel Hill, NC 27514 joliva@cs.unc.edu Modeling distributions of covariates, or density estimation, is a core challenge in unsupervised learning. However, the majority of work only considers the joint distribution, which has limited utility in practical situations. A more general and useful problem is arbitrary conditional density estimation, which aims to model any possible conditional distribution over a set of covariates, reflecting the more realistic setting of inference based on prior knowledge. We propose a novel method, Arbitrary Conditioning with Energy (ACE), that can simultaneously estimate the distribution p(xu | xo) for all possible subsets of unobserved features xu and observed features xo. ACE is designed to avoid unnecessary bias and complexity we specify densities with a highly expressive energy function and reduce the problem to only learning one-dimensional conditionals (from which more complex distributions can be recovered during inference). This results in an approach that is both simpler and higher-performing than prior methods. We show that ACE achieves state-of-the-art for arbitrary conditional likelihood estimation and data imputation on standard benchmarks. 1 Introduction Density estimation, a core challenge in machine learning, attempts to learn the probability density of some random variables given samples from their true distribution. The vast majority of work on density estimation focuses on the joint distribution p(x) [10, 4, 25, 11, 23, 22, 6], i.e., the distribution of all variables taken together. While the joint distribution can be useful (e.g., to learn the distribution of pixel configurations that represent human faces), it is limited in the types of predictions it can make. We are often more interested in conditional probabilities, which communicate the likelihood of an event given some prior information. For example, given a patient s medical history and symptoms, a doctor determines the likelihoods of different illnesses and other patient attributes. Conditional distributions are often more practical since real-world decisions are nearly always informed by prior information. However, we often do not know ahead of time which conditional distribution (or distributions) will be of interest. That is, we may not know which features will be known (observed) or which will be inferred (unobserved). For example, not every patient that a doctor sees will have had the same tests performed (i.e., have the same observed features). A naïve approach requires building an exponential number of models to cover all possible cases (one for every conditional distribution), which quickly becomes intractable. Thus, an intelligent system needs to understand the intricate conditional dependencies between all arbitrary subsets of covariates, and it must do so with a single model to be practical. In this work, we consider the problem of learning the conditional distribution p(xu | xo) for any arbitrary subsets of unobserved variables xu R|u| and observed variables xo R|o|, where u, o {1, . . . , d} and o u = . We propose a method, Arbitrary Conditioning with Energy (ACE), 35th Conference on Neural Information Processing Systems (Neur IPS 2021). that can assess any conditional distribution over any subset of random variables, using a single model. ACE is developed with an eye for simplicity we reduce the arbitrary conditioning problem to the estimation of one-dimensional conditional densities (with arbitrary observations), and we represent densities with an energy function, which fully specifies unnormalized distributions and has the freedom to be instantiated as any arbitrary neural network. We evaluate ACE on benchmark datasets and show that it outperforms current methods for arbitrary conditional/marginal density estimation. ACE remains effective when trained on data with missing values, making it applicable to real-world datasets that are often incomplete, and we find that ACE scales well to high-dimensional data. Also, unlike some prior methods (e.g., [20]), ACE can naturally model data with both continuous and discrete values. Our contributions are as follows: 1) We develop the first energy-based approach to arbitrary conditional density estimation, which eliminates restrictive biases (e.g. normalizing flows, Gaussian mixtures) imposed by common alternatives. 2) We empirically demonstrate that ACE is state-of-theart for arbitrary conditional density estimation and data imputation. 3) We find that complicated prior approaches can be easily outperformed with a simple scheme that uses mixtures of Gaussians and fully-connected networks. 2 Previous Work Several methods have been previously proposed for arbitrary conditioning. Sum-Product Networks are specially designed to only contain sum and product operations and can produce arbitrary conditional or marginal likelihoods [27, 3]. The Universal Marginalizer trains a neural network with a crossentropy loss to approximate the marginal posterior distributions of all unobserved features conditioned on the observed ones [5]. VAEAC is an approach that extends a conditional variational autoencoder by only considering the latent codes of unobserved dimensions [14], and Neural Conditioner uses adversarial training to learn each conditional distribution [1]. DMFA uses factor analysis to have a neural network output the parameters of a conditional Gaussian density for the missing features given the observed ones [28]. The current state-of-the-art is ACFlow, which extends normalizing flow models to handle any subset of observed features [20]. Unlike VAEAC, ACE does not suffer from mode collapse or blurry samples. ACE is also able to provide likelihood estimates, unlike Neural Conditioner which only produces samples. DMFA is limited to learning multivariate Gaussians, which impose bias and are harder to model than onedimensional conditionals. While ACFlow can analytically produce normalized likelihoods and samples, it is restricted by a requirement that its network consist of bijective transformations with tractable Jacobian determinants. Similarly, Sum-Product Networks have limited expressivity due to their constraints. ACE, on the other hand, exemplifies the appeal of energy-based methods as it has no constraints on the parameterization of the energy function. Energy-based methods have a wide range of applications within machine learning [17], and recent work has studied their utility for density estimation. Deep energy estimator networks [30] and Autoregressive Energy Machines [22] are both energy-based models that perform density estimation. However, both of these methods are only able to estimate the joint distribution. Much of the previous work on density estimation relies on an autoregressive decomposition of the joint density according to the chain rule of probability. Often, the model only considers a single (arbitrary) ordering of the features [22, 24]. Uria et al. [33] proposed a method for assessing joint likelihoods based on any arbitrary ordering, where they use masked network inputs to effectively share weights between a combinatorial number of models. Germain et al. [8] also consider a shared network for joint density estimation with a constrained architecture that enforces the autoregressive constraint in joint likelihoods. In this work, we construct an order-agnostic weight-sharing technique not for joint likelihoods, but for arbitrary conditioning. Moreover, we make use of our weight-sharing scheme to estimate likelihoods with an energy based approach, which avoids the limitations of the parametric families used previously (e.g., mixtures of Gaussians [33], or Bernoullis [8]). Query Training [16] is a method for answering probabilistic queries. It also takes the approach of computing one-dimensional conditional likelihoods but does not directly pursue an autoregressive extension of that ability. The problem of imputing missing data has been well studied, and there are several approaches based on classic machine learning techniques such as k-nearest neighbors [32], random forests [31], and autoencoders [9]. More recent work has turned to deep generative models. GAIN is a generative adversarial network (GAN) that produces imputations with the generator and uses the discriminator to discern the imputed features [34]. Another GAN-based approach is Mis GAN, which learns two generators to model the data and masks separately [19]. MIWAE adapts variational autoencoders by modifying the lower bound for missing data and produces imputations with importance sampling [21]. ACFlow can also perform imputation and is state-of-the-art for imputing data that are missing completely at random (MCAR) [20]. While it is not always the case that data are missing at random, the opposite case (i.e., missingness that depends on unobserved features values) can be much more challenging to deal with [7]. Like many data imputation methods, we focus on the scenario where data are MCAR, that is, where the likelihood of being missing is independent of the covariates values. 3 Background 3.1 Arbitrary Conditional Density Estimation A probability density function (PDF), p(x), outputs a nonnegative scalar for a given vector input x Rd and satisfies R p(x) dx = 1. Given a dataset D = {x(i)}N i=1 of i.i.d. samples drawn from an unknown distribution p (x), the object of density estimation is to find a model that best approximates the function p . Modern approaches generally rely on neural networks to directly parameterize the approximated PDF. Arbitrary conditional density estimation is a more general task where we estimate the conditional density p (xu | xo) for all possible subsets of observed features o {1, . . . , d} (i.e., features whose values are known) and corresponding subset of unobserved features u {1, . . . , d} such that o and u do not intersect. Here xo R|o| and xu R|u|. The estimation of joint or marginal likelihoods is recovered when o is the empty set. Note that marginalization to obtain p(xo) = R p(xu, xo) dxu (and hence p(xu | xo) = p(xu,xo) p(xo) ) is intractable in performant generative models like normalizing flow based models [4, 20]; thus, we propose a weight-sharing scheme below. 3.2 Energy-Based Models Energy-based models capture dependencies between variables by assigning a nonnegative scalar energy to a given arrangement of those variables, where energies closer to zero indicate more desirable configurations [17]. Learning consists of finding an energy function that outputs low energies for correct values. We can frame density estimation as an energy-based problem by writing likelihoods as a Boltzmann distribution p(x) = e E(x) Z , Z = Z e E(x) dx (1) where E is the energy function, e E(x) is the unnormalized likelihood, and Z is the normalizer. Energy-based models are appealing due to their relative simplicity and high flexibility in the choice of representation for the energy function. This is in contrast to other common approaches to density estimation such as normalizing flows [4, 20], which require invertible transformations with Jacobian determinants that can be computed efficiently. Energy functions are also naturally capable of representing non-smooth distributions with low-density regions or discontinuities. 4 Arbitrary Conditioning with Energy We are interested in approximating the probability density p(xu | xo) for any arbitrary sets of unobserved features xu and observed features xo. We approach this by decomposing likelihoods into products of one-dimensional conditionals, which makes the learned distributions much simpler. This concept is a basic application of the chain rule of probability, but has yet to be thoroughly exploited for arbitrary conditioning. During training, ACE estimates distributions of the form p(xu i | xo ), Proposal Network 23 2 0 8 0 0 1 1 0 1 0 0 !(u1; xo), γ(u1; xo) !(u|u|; xo), γ(u|u|; xo) v MKb8+y8Ox/O57y14OQzx7A5+s X1Du WVQ=XXX + (a) Proposal Network Energy Network 1 1 0 1 0 0 23 2 0 8 0 0 0 0 0 0 1 0 (b) Energy Network Figure 1: Overview of the networks used in ACE. The plus symbol refers to concatenation. where xu i is a scalar. During inference, more complex distributions can then be recovered with an autoregressive decomposition: p(xu | xo) = Q|u| i=1 p(xu i | xo u