# posterior_matching_for_arbitrary_conditioning__b1fc26e7.pdf Posterior Matching for Arbitrary Conditioning Ryan R. Strauss Department of Computer Science UNC at Chapel Hill rrs@cs.unc.edu Junier B. Oliva Department of Computer Science UNC at Chapel Hill joliva@cs.unc.edu Arbitrary conditioning is an important problem in unsupervised learning, where we seek to model the conditional densities p(xu | xo) that underly some data, for all possible non-intersecting subsets o, u {1, . . . , d}. However, the vast majority of density estimation only focuses on modeling the joint distribution p(x), in which important conditional dependencies between features are opaque. We propose a simple and general framework, coined Posterior Matching, that enables Variational Autoencoders (VAEs) to perform arbitrary conditioning, without modification to the VAE itself. Posterior Matching applies to the numerous existing VAE-based approaches to joint density estimation, thereby circumventing the specialized models required by previous approaches to arbitrary conditioning. We find that Posterior Matching is comparable or superior to current state-of-the-art methods for a variety of tasks with an assortment of VAEs (e.g. discrete, hierarchical, Va DE). 1 Introduction Variational Autoencoders (VAEs) [21] are a widely adopted class of generative model that have been successfully employed in numerous areas [4, 15, 26, 33, 16]. Much of their appeal stems from their ability to probabilistically represent complex data in terms of lower-dimensional latent codes. Like most other generative models, VAEs are typically designed to model the joint data distribution, which communicates likelihoods for particular configurations of all features at once. This can be useful for some tasks, such as generating images, but the joint distribution is limited by its inability to explicitly convey the conditional dependencies between features. In many cases, conditional distributions, which provide the likelihood of an event given some known information, are more relevant and useful. Conditionals can be obtained in theory by marginalizing the joint distribution, but in practice, this is generally not analytically available and is expensive to approximate. Easily assessing the conditional distribution over any subset of features is important for tasks where decisions and predictions must be made over a varied set of possible information. For example, some medical applications may require reasoning over: the distribution of blood pressure given age and weight; or the distribution of heart-rate and blood-oxygen level given age, blood pressure, and BMI; etc. For flexibility and scalability, it is desirable for a single model to provide all such conditionals at inference time. More formally, this task is known as arbitrary conditioning, where the goal is to model the conditional density p(xu | xo) for any arbitrary subsets of unobserved features xu and observed features xo. In this work, we show, by way of a simple and general framework, that traditional VAEs can perform arbitrary conditioning, without modification to the VAE model itself. Our approach, which we call Posterior Matching, is to model the distribution p(z | xo) that is induced by some VAE, where z is the latent code. In other words, we consider the distribution of latent codes given partially observed features. We do this by having a neural network output an approximate partially observed posterior q(z | xo). In order to train this network, we develop a straightforward maximum likelihood estimation objective and show that it is equivalent to maximizing p(xu | xo), 36th Conference on Neural Information Processing Systems (Neur IPS 2022). the quantity of interest. Unlike prior works that use VAEs for arbitrary conditioning, we do not make special assumptions or optimize custom variational lower bounds. Rather, training via Posterior Matching is simple, highly flexible, and without limiting assumptions on approximate posteriors (e.g., q(z | xo) need not be reparameterized and can thus be highly expressive). We conduct several experiments in which we apply Posterior Matching to various types of VAEs for a myriad of different tasks, including image inpainting, tabular arbitrary conditional density estimation, partially observed clustering, and active feature acquisition. We find that Posterior Matching leads to improvements over prior VAE-based methods across the range of tasks we consider. 2 Background Arbitrary Conditioning A core problem in unsupervised learning is density estimation, where we are given a dataset D = {x(i)}N i=1 of i.i.d. samples drawn from an unknown distribution p(x) and wish to learn a model that best approximates the probability density function p. A limitation of only learning the joint distribution p(x) is that it does not provide direct access to the conditional dependencies between features. Arbitrary conditional density estimation [18, 24, 38] is a more general task where we want to estimate the conditional density p(xu | xo) for all possible subsets of observed features o {1, . . . , d} and unobserved features u {1, . . . , d} such that o and u do not intersect. Here, xo R|o| and xu R|u|. Estimation of joint or marginal likelihoods is a special case where o = . Note that, while not strictly necessary for arbitrary conditioning methods [24, 38], we assume D is fully observed, a requirement for training traditional VAEs. Variational Autoencoders Variational Autoencoders (VAEs) [21] are a class of generative models that assume a generative process in which data likelihoods are represented as p(x) = R p(x | z)p(z) dz, where z is a latent variable that typically has lower dimensionality than the data x. A tractable distribution that affords easy sampling and likelihood evaluation, such as a standard Gaussian, is usually imposed on the prior p(z). These models are learned by maximizing the evidence lower bound (ELBO) of the data likelihood: log p(x) Ez qψ( |x)[log pϕ(x | z)] KL(qψ(z | x) || p(z)), where qψ(z | x) and pϕ(x | z) are the encoder (or approximate posterior) and decoder of the VAE, respectively. The encoder and decoder are generally neural networks that output tractable distributions (e.g., a multivariate Gaussian). In order to properly optimize the ELBO, samples drawn from qψ(z | x) must be differentiable with respect to the parameters of the encoder (often called the reparameterization trick). After training, a new data point ˆx can be easily generated by first sampling z from the prior, then sampling ˆx pϕ( | z). 3 Posterior Matching Latent Space Observe Right Half of Image Conditional Latent Codes Figure 1: A two-dimensional VAE latent space (left) that represents the distribution of codes of handwritten 3s, 5s, and 8s. Conditioning on the subset of pixels shown in the center should then result in a distribution over latent codes that only correspond to 3s and 8s, as shown on the right. In this section we describe our framework, coined Posterior Matching, to model the underlying arbitrary conditionals in a VAE. In many respects, Posterior Matching cuts the Gordian knot to uncover the conditional dependencies. Following our insights, we show that our approach is direct and intuitive. Notwithstanding, we are the first to apply this direct methodology for arbitrary conditionals in VAEs and are the first to connect our proposed loss with arbitrary conditional likelihoods p(xu | xo). Note that we are not proposing a new type of VAE. Rather, we are formalizing a simple and intuitive methodology that can be applied to numerous existing (or future) VAEs. 3.1 Motivation Let us begin with a motivating example, depicted in Figure 1. Suppose we have trained a VAE on images of handwritten 3s, 5s, and 8s. This VAE has thus learned to represent these images in a low-dimensional latent space. Any given code (vector) in this latent space represents a distribution over images in the original data space, which can be retrieved by passing that code through the VAE s decoder. Some regions in the latent space will contain codes that represent 3s, some will represent 5s, and some will represent 8s. There is typically only an interest in mapping from a given image x to a distribution over the latent codes that could represent that image, i.e., the posterior q(z | x). However, we can just as easily ask which latent codes are feasible having only observed part of an image. For example, if we only see the right half the image shown in Figure 1, we know the digit could be a 3 or an 8, but certainly not a 5. Thus, the distribution over latent codes that could correspond to the full image, that is pψ(z | xo) (where ψ is the encoder s parameters), should only include regions that represent 3s or 8s. Decoding any sample from pψ(z | xo) will produce an image of a 3 or an 8 that aligns with what has been observed. The important insight is that we can think about how conditioning on xo changes the distribution over latent codes without explicitly worrying about what the (potentially higher-dimensional and more complicated) conditional distribution over xu looks like. Once we know pψ(z | xo), we can easily move back to the original data space using the decoder. 3.2 Approximating the Partially Observed Posterior The partially observed approximate posterior of interest is not readily available, as it is implicitly defined by the VAE: pψ(z | xo) = Exu p( |xo) h qψ(z | xo, xu) i , (1) where qψ(z | xo, xu) = qψ(z | x) is the VAE s encoder. Thus, we introduce a neural network in order to approximate it. Given a network that outputs the distribution qθ(z | xo) (i.e. the partially observed encoder in Figure 2), we now discuss our approach to training it. Our approach is guided by the priorities of simplicity and generality. We minimize (with respect to θ) the following likelihoods, where the samples are coming from our target distribution as defined in Equation 1: Exu p( |xo) h Ez qψ( |xo,xu)[ log qθ(z | xo)] i . (2) We discuss how this is optimized in practice in Section 3.4. Due to the relationship between negative log-likelihood minimization and KL-divergence minimization [3], we can interpret Equation 2 as minimizing: Exu p( |xo) h KL qψ(z | xo, xu) || qθ(z | xo) i . (3) We can directly minimize the KL-divergence in Equation 3 if it is analytically available between the two posteriors, for instance if both posteriors are Gaussians. However, Equation 2 is more general in that it allows us to use more expressive (e.g., autoregressive) distributions for qθ(z | xo) with which the KL-divergence cannot be directly computed. This is important given that pψ(z | xo) is likely to be complex (e.g., multimodal) and not easily captured by a Gaussian (as in Figure 1). Importantly, there is no requirement for qθ(z | xo) to be reparameterized, which would further limit the class of distributions that can be used. There is a high degree of flexibility in the choice of distribution for the partially observed posterior. Note that this objective does not utilize the decoder. 3.3 Connection with Arbitrary Conditioning While the Posterior Matching objective from Equation 2 and Equation 3 is intuitive, it is not immediately clear how this approach relates back to the arbitrary conditioning objective of maximizing p(xu | xo). We formalize this connection in Theorem 3.1 (see Appendix for proof). Encoder Decoder Decoder Partially Observed Inference Time Shared Weights Posterior Matching Loss Figure 2: Overview of the Posterior Matching framework. The partially observed encoder can be appended to any existing VAE to enable arbitrary conditioning. At inference time, we can decode samples from qθ(z | xo) to perform imputation or compute p(xu | xo). Theorem 3.1. Let qψ(z | x) and pϕ(x | z) be the encoder and decoder, respectively, for some VAE. Additionally, let qθ(z | xo) be an approximate partially observed posterior. Then minimizing Exu p( |xo) h KL qψ(z | xo, xu) || qθ(z | xo) i is equivalent to minimizing Exu p( |xo) h log pθ,ϕ(xu | xo) + KL qψ(z | xo, xu) || qθ(z | xo, xu) i , (4) with respect to the parameters θ. The first term inside the expectation in Equation 4 gives us the explicit connection back to the arbitrary conditioning likelihood p(xu | xo), which is being maximized when minimizing Equation 4. The second term acts as a sort of regularizer by trying to make the partially observed posterior match the VAE posterior when conditioned on all of x intuitively, this makes sense as a desirable outcome. 3.4 Implementation A practical training loss follows quickly from Equation 2. For the outer expectation, we do not have access to the true distribution p(xu | xo), but for a given instance x that has been partitioned into xo and xu, we do have one sample from this distribution, namely xu. So we approximate this expectation using xu as a single sample. This type of single-sample approximation is common with VAEs, e.g., when estimating the ELBO. For the inner expectation, we have access to qψ(z | x), which can easily be sampled in order to estimate the expectation. In practice, we generally use a single sample for this as well. This gives us the following Posterior Matching loss: LPM(x, o, θ, ψ) = Ez qψ( |x) log qθ(z | xo) , (5) where o is the set of observed feature indices. During training, o can be randomly sampled from a problem-specific distribution for each minibatch. Figure 2 provides a visual overview of our approach. In practice, we represent xo as a concatenation of x that has had unobserved features set to zero and a bitmask b that indicates which features are observed. This representation has been successful in other arbitrary conditioning models [24, 38]. However, this choice is not particularly important to Posterior Matching itself, and alternative representations, such as set embeddings, are valid as well. As required by VAEs, samples from qψ(z | x) will be reparameterized, which means that minimizing LPM will influence the parameters of the VAE s encoder in addition to the partially observed posterior network. In some cases, this may be advantageous, as the encoder can be guided towards learning a latent representation that is more conducive to arbitrary conditioning. However, it might also be desirable to train the VAE independently of the partially observed posterior, in which case we can choose to stop gradients on the samples z qψ( | x) when computing LPM. Similarly, the partially observed posterior can be trained against an existing pretrained VAE. In this case, the parameters of the VAE s encoder and decoder are frozen, and we only optimize LPM with respect to θ. Otherwise, we jointly optimize the VAE s ELBO and LPM. We emphasize that there is a high degree of flexibility with the choice of VAE, i.e. we have not imposed any unusual constraints. However, there are some potentially limiting practical considerations that have not been explicitly mentioned yet. First, the training data must be fully observed, as with traditional VAEs, since LPM requires sampling qψ(z | x). However, given that the base VAE requires fully observed training data anyway, this is generally not a relevant limitation for our purposes. Second, it is convenient in practice for the VAE s decoder to be factorized, i.e. p(x | z) = Q i p(xi | z), as this allows us to easily sample from p(xu | z) (sampling xu is less straightforward with other types of decoders). However, it is standard practice to use factorized decoders with VAEs, so this is ordinarily not a concern. We also note that, while useful for easy sampling, a factorized decoder is not necessary for optimizing the Posterior Matching loss, which does not incorporate the decoder. 3.5 Posterior Matching Beyond Arbitrary Conditioning Figure 3: Example acquisitions from our CNN model using the lookahead posteriors (see Section 5.5). Top row shows xo, and blue pixels are unobserved. Bottom row shows imputation of xu. The concept of matching VAE posteriors is quite general and has other uses beyond the application of arbitrary conditioning. We consider one such example, which still has ties to arbitrary conditioning, in order to give a flavor for other potential uses. A common application of arbitrary conditioning is active feature acquisition [13, 23, 25], where informative features are sequentially acquired on an instance-by-instance basis. In the unsupervised case, the aim is to acquire as few features as possible while maximizing the ability to reconstruct the remaining unobserved features (see Figure 3 for example). One approach to active feature acquisition is to greedily select the feature that will maximize the expected amount of information to be gained about the currently unobserved features [23, 25]. For VAEs, Ma et al. [25] show that this is equivalent to selecting each feature according to argmax i u H(z | xo) Exi p( |xo) h H(z | xo, xi) i = argmin i u Exi p( |xo) h H(z | xo, xi) i . (6) For certain families of posteriors, such as multivariate Gaussians, the entropies in Equation 6 can be analytically computed. In practice, approximating the expectation in Equation 6 is done via entropies of the posteriors p(i)(z | xo) Exi pθ,ϕ( |xo) h qθ(z | xo, xi) i , where samples from pθ,ϕ(xi | xo) are produced by first sampling z qθ( | xo) and then passing z through the VAE s decoder pϕ(xi | z) (we call p(i)(z | xo) the lookahead posterior for feature i, since it is obtained by imagining what the posterior will look like after one acquisition into the future). Hence, computing the resulting entropies requires one network evaluation per sample of xi to encode z, for i u. Thus, if using k samples for each xi, each greedy step will be Ω(k |u|), which may be prohibitive in high dimensions. In analogous fashion to the Posterior Matching approach that has already been discussed, we can train a neural network to directly output the lookahead posteriors for all features at once. The Posterior Matching loss in this case is LPM-Lookahead(x, o, u, ω, θ, ϕ) = X i u Exi pθ,ϕ( |xo) Ez qθ( |xo,xi) h log q(i) ω (z | xo) i# where ω is the parameters of the lookahead posterior network. In practice, we train a single shared network with a final output layer that outputs the parameters of all q(i) ω (z | xo). Note that given the distributions q(i) ω (z | xo) for all i, computing the greedy acquisition choice consists of doing a forward evaluation of our network, then choosing the feature i u such that the entropy of q(i) ω (z | xo) is minimized. In other words, we may bypass the individual samples of xi, and use a single shared network for a faster acquisition step. In this setting, we let q(i) ω (z | xo) be a multivariate Gaussian so that the entropy computation is trivial. See Appendix for a diagram of the entire process. This use of Posterior Matching leads to large improvements in the computational efficiency of greedy active feature acquisition (demonstrated empirically in Section 5.5). 4 Prior Work A variety of approaches to arbitrary conditioning have been previously proposed. ACE is an autoregressive, energy-based method that is the current state-of-the-art for arbitrary conditional likelihood estimation and imputation, although it can be computationally intensive for very high dimensional data [38]. ACFlow is a variant of normalizing flows that can give analytical arbitrary conditional likelihoods [24]. Several other methods, including Sum-Product Networks [32, 6], Neural Conditioner [2], and Universal Marginalizer [10], also have the ability to estimate conditional likelihoods. Rezende et al. [34] were among the first to suggest that VAEs can be used for imputation. More recently, VAEAC was proposed as a VAE variant designed for arbitrary conditioning [18]. Unlike Posterior Matching, VAEAC is not a general framework and cannot be used with typical pretrained VAEs. EDDI is a VAE-based approach to active feature acquisition and relies on arbitrary conditioning [25]. The authors introduce a Partial VAE in order to perform the arbitrary conditioning, which, similarly to Posterior Matching, tries to model p(z | xo). Unlike Posterior Matching, they do this by maximizing a variational lower bound on p(xo) using a partial inference network q(z | xo) (there is no standard VAE posterior q(z | x) in EDDI). Gong et al. [13] use a similar approach that is based on the Partial VAE of EDDI. The major drawback of these methods is that, unlike with Posterior Matching, q(z | xo) must be reparameterizable in order to optimize the lower bound (the authors use a diagonal Gaussian). Thus, certain more expressive distributions (e.g., autoregressive) cannot be used. Additionally, these methods cannot be applied to existing VAEs. The methods of Ipsen et al. [17] and Collier et al. [9] are also similar to EDDI, where the former optimizes an approximation of p(xo, b) and the latter optimizes a lower bound on p(xo | b). Ipsen et al. [17] also focuses on imputation for data that is missing not at random , a setting that is outside the focus of our work. There are also several works that have considered learning to identify desirable regions in latent spaces. Engel et al. [11] start from a pretrained VAE, but then train a separate GAN [14] with special regularizers to do their conditioning. They only condition on binary vectors, y, that correspond to a small number of predefined attributes, whereas we allow for conditioning on arbitrary subsets of continuous features xo (a more complicated conditioning space). Also, their resulting GAN does not make the likelihood q(z | y) available, whereas Posterior Matching directly (and flexibly) models q(z | xo), which may be useful for downstream tasks (e.g. Section 5.5) and likelihood evaluation (see Appendix). Furthermore, Posterior Matching trains directly through KL, without requiring an additional critic. Whang et al. [40] learn conditional distributions, but not arbitrary conditional distributions (a much harder problem). They also consider normalizing flow models, which are limited to invertible architectures with tractable Jacobian determinants and latent spaces that have the same dimensionality as the data (unlike VAEs). Cannella et al. [7] similarly do conditional sampling from a model of the joint distribution, but are also restricted to normalizing flow architectures and require a more expensive MCMC procedure for sampling. 5 Experiments In order to empirically test Posterior Matching, we apply it to a variety of VAEs aimed at different tasks. We find that our models are able to match or surpass the performance of previous specialized VAE methods. All experiments were conducted using JAX [5] and the Deep Mind JAX Ecosystem [1]. Code is available at https://github.com/lupalab/posterior-matching. Our results are dependent on the choice of VAE, and the particular VAEs used in our experiments were not the product of extensive comparisons and did not undergo thorough hyperparameter tuning that is not the focus of this work. With more carefully selected or tuned VAEs, and as new VAEs continue to be developed, we can expect Posterior Matching s downstream performance to improve accordingly on any given task. We emphasize that our experiments span a diverse set of task, domains, and types of VAE, wherein Posterior Matching was effective. In this first experiment, our goal is to demonstrate that Posterior Matching replicates the intuition depicted in Figure 1. We do this by training a convolutional VAE with Posterior Matching on the MNIST dataset. The latent space of this VAE is then mapped to two dimensions with UMAP [27] and visualized in Figure 4. In the figure, black points represent samples from qθ(z | xo), and for select 0 1 2 3 4 5 6 7 8 9 Figure 4: UMAP [27] visualization of the latent space of a VAE trained on MNIST. Black dots represent samples from the distribution qθ(z | xo) learned via Posterior Matching. Images of xo are to the right of their respective plot. Samples from each mode are decoded and shown. Table 1: Peak signal-to-noise ratio (PSNR) and precision/recall scores [35] for image inpaintings. We report mean and standard deviation across five evaluations with different random masks. VAEAC and ACFlow results are taken from Li et al. [24]. Higher is better for all metrics. MNIST OMNIGLOT CELEBA PSNR Precision Recall PSNR Precision Recall PSNR Precision Recall VDVAE + PM (ours) 21.603 0.022 0.996 0.996 18.256 0.038 0.995 0.994 27.190 0.049 0.995 0.995 VQ-VAE + PM (ours) 19.981 0.021 0.989 0.993 17.954 0.046 0.979 0.973 25.531 0.036 0.982 0.984 VAEAC 19.613 0.042 0.877 0.975 17.693 0.023 0.525 0.926 23.656 0.027 0.966 0.967 ACFlow 17.349 0.018 0.945 0.984 15.572 0.031 0.962 0.971 22.393 0.040 0.970 0.988 ACFlow+BG 20.828 0.031 0.947 0.983 18.838 0.009 0.967 0.970 25.723 0.020 0.964 0.987 samples, the corresponding reconstruction is shown. The encoded test data is shown, colored by true class label, to highlight which regions correspond to which digits. We see that the experimental results nicely replicate our earlier intuitions the learned distribution qθ(z | xo) puts probability mass only in parts of the latent space that correspond to plausible digits based on what is observed and successfully captures multimodal distributions (see the second column in Figure 4). 5.2 Image Inpainting One practical application of arbitrary conditioning is image inpainting, where only part of an image is observed and we want to fill in the missing pixels with visually coherent imputations. As with prior works [18, 24], we assume pixels are missing completely at random. We test Posterior Matching as an approach to this task by pairing it with both discrete and hierarchical VAEs. Vector Quantized-VAEs We first consider VQ-VAE [30], a type of VAE that is known to work well with images. VQ-VAE differs from the typical VAE with its use of a discrete latent space. That is, each latent code is a grid of discrete indices rather than a vector of continuous values. Because the latent space is discrete, Oord et al. [30] model the prior distribution with a Pixel CNN [29, 36] after training the VQ-VAE. We similarly use a conditional Pixel CNN to model qθ(z | xo). First, a convolutional network maps xo to a vector, and that vector is then used as a conditioning input to the Pixel CNN. More architecture and training details can be found in the Appendix. We train VQ-VAEs with Posterior Matching for the MNIST, OMNIGLOT, and CELEBA datasets. Table 1 reports peak signal-to-noise ratio (PSNR) and precision/recall [35] for inpaintings produced by our model. We find that Posterior Matching with VQ-VAE consistently achieves better precision/recall scores than previous models while having comparable PSNR. Hierarchical VAEs Hierarchical VAEs [22, 37, 39] are a powerful extension of traditional VAEs that allow for more expressive priors and posteriors by partitioning the latent variables into subsets z = {z1, . . . , z L}. A hierarchy is then created by fractorizing the prior p(z) = Q i p(zi | z