# undirected_graphical_models_as_approximate_posteriors__4e385bee.pdf Undirected Graphical Models as Approximate Posteriors Arash Vahdat 1 Evgeny Andriyash 2 William G. Macready 2 The representation of the approximate posterior is a critical aspect of effective variational autoencoders (VAEs). Poor choices for the approximate posterior have a detrimental impact on the generative performance of VAEs due to the mismatch with the true posterior. We extend the class of posterior models that may be learned by using undirected graphical models. We develop an efficient method to train undirected approximate posteriors by showing that the gradient of the training objective with respect to the parameters of the undirected posterior can be computed by backpropagation through Markov chain Monte Carlo updates. We apply these gradient estimators for training discrete VAEs with Boltzmann machines as approximate posteriors and demonstrate that undirected models outperform previous results obtained using directed graphical models. Our implementation is available here. 1. Introduction Training of likelihood-based deep generative models has advanced rapidly in recent years. These advances have been enabled by amortized inference (Hinton et al., 1995; Mnih & Gregor, 2014; Gregor et al., 2013), which scales up training of variational models, the reparameterization trick (Kingma & Welling, 2014; Rezende et al., 2014), which provides lowvariance gradient estimates, and increasingly expressive neural networks. Combinations of these techniques have resulted in many different forms of variational autoencoders (VAEs) (Kingma et al., 2016; Chen et al., 2016). It is also widely recognized that flexible approximate posterior distributions improve the generative quality of variational autoencoders (VAEs) by more faithfully modeling true posterior distributions. More accurate approxi- 1NVIDIA, USA 2Sanctuary AI, Canada, Work done at D-Wave Systems. Correspondence to: Arash Vahdat . Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s). mate posterior models have been realized by reducing the gap between true and approximate posteriors with tighter bounds (Burda et al., 2015), and using auto-regressive architectures (Gregor et al., 2013; 2015) or flow models (Rezende & Mohamed, 2015). A complementary (but unexplored) direction for further improving the richness of approximate posteriors is available through the use of undirected graphical models (UGMs). In this approach, the approximate posterior over latent variables is formed using UGMs whose parameters are inferred for each observed sample in an amortized fashion similar to VAEs. This is compelling as UGMs can succinctly capture complex relationships between variables. However, there are three challenges when training UGMs as approximate posteriors: i) There is no known low-variance path-wise gradient estimator (like the reparameterization trick) for learning parameters of UGMs. ii) Sampling from general UGMs is intractable and approximate sampling is computationally expensive. iii) Evaluating the probability of a sample under a UGM can require an intractable partition function. These two latter costs are particularly acute when UGMs are used as posterior approximators because the number of UGMs grows with the size of the dataset. However, we note that posterior1 UGMs conditioned on observed data points are often simple as there are usually only a small number of modes in latent space explaining an observation. We expect then, that sampling and partition function estimation (challenges ii and iii) can be solved efficiently using parallelizable Markov chain Monte Carlo (MCMC) methods. In fact, we observe that UGM posteriors trained in a VAE model tend to be unimodal but with a mode structure that is not necessarily well-captured by mean-field posteriors. Nevertheless, in this case, MCMC-based sampling methods quickly equilibrate. To address challenge i) we estimate the gradient by reparameterized sampling in the MCMC updates. Although an infinite number of MCMC updates are theoretically required to obtain an unbiased gradient estimator, we observe that a single MCMC update provides a low-variance but biased gradient estimation that is sufficient for training UGMs. 1We may use posterior when referring to approximate posterior for brevity. To disambiguate, we always refer to true posterior explicitly. Undirected Graphical Models as Approximate Posteriors Binary UGMs in the form of Boltzmann machines (Ackley et al., 1985) have recently been shown to be effective as priors for VAEs allowing for discrete versions of VAEs (Rolfe, 2016; Vahdat et al., 2018b;a). However, previous work in this area has relied on directed graphical models (DGMs) for posterior approximation. In this paper, we replace the DGM posteriors of discrete VAEs (DVAEs) with Boltzmann machine UGMs and show that these posteriors provide a generative performance comparable to or better than previous DVAEs. We denote this model as DVAE##, where ## indicates the use of UGMs in the prior and the posterior. We begin by summarizing related work on developing expressive posterior models including models trained nonvariationally. Nonvariational models employ MCMC to directly sample posteriors and differ from our variational approach which uses MCMC to sample from an amortized undirected posterior. Sec. 2 provides the necessary background on variational learning and MCMC to allow for the development in Sec. 3 of a gradient estimator for undirected posteriors. We provide examples for both Gaussian (Sec. 3.1) and Boltzmann machine (Sec. 3.2) UGMs. Experimental results are provided in Sec. 4 on VAEs (Sec. 4.1), importance-weighted VAEs (Sec. 4.2), and structured prediction (Sec. 4.3), where we observe consistent improvement using UGMs. We conclude in Sec. 5 with a list of future work. 1.1. Related Work Inference gap reduction in VAEs: Previous work on reducing the gap between true and approximate posteriors can be grouped into three categories: i) Training objectives that replace the variational bound with tighter bounds (Burda et al., 2015; Li & Turner, 2016; Bornschein et al., 2016). ii) Autoregressive models (Hochreiter & Schmidhuber, 1997; Graves, 2013) that use DGMs to form more flexible distributions (Gregor et al., 2013; Gulrajani et al., 2016; Van Den Oord et al., 2016; Salimans et al., 2017; Chen et al., 2018). iii) Flow-based models (Rezende & Mohamed, 2015) that map data to a latent space using a class of invertible functions (Kingma et al., 2016; Kingma & Dhariwal, 2018; Dinh et al., 2014; 2016). To the best of our knowledge, UGMs have not been used as approximate posteriors. Gradient estimation in latent variable models: REINFORCE (Williams, 1992) is the most generic approach for computing the gradient of the approximate posteriors but suffers from high-variance and must be augmented by variance reduction techniques. For many continuous latent-variable models the reparameterization trick (Kingma & Welling, 2014; Rezende et al., 2014) provides lowervariance gradient estimates. Reparameterization does not apply to discrete latent variables and recent methods for discrete variables have focused on REINFORCE with control variates (Mnih & Gregor, 2014; Gu et al., 2015; Mnih & Rezende, 2016; Tucker et al., 2017; Grathwohl et al., 2018) or continuous relaxations (Maddison et al., 2016; Jang et al., 2016; Rolfe, 2016; Vahdat et al., 2018b;a). See (Andriyash et al., 2018) for a review. Variational Inference and MCMC: A common alternative to variational inference is to use MCMC to sample directly from true posteriors in generative models (Welling & Teh, 2011; Salimans et al., 2015; Wolf et al., 2016; Hoffman, 2017; Li et al., 2017; Caterini et al., 2018). However, this approach is often computationally intensive, as it requires computing the prior and decoder distributions (often implemented by neural networks) many times at each each parameter update for each training data point in a batch. Moreover, evaluating these models is more challenging, as there is no amortized approximating posterior for importance sampling (Burda et al., 2015). Our method differs from these techniques, as we use MCMC to sample from an amortized approximate posterior represented by UGMs. 2. Background In this section we provide background for the topics discussed in this paper. Undirected graphical models: A UGM represents the joint probability distribution for a set of random variables z as q(z) = exp( Eφ(z))/Zφ, where Eφ is a φparameterized energy function defined by an undirected graphical model, and Zφ = is the partition function. UGMs over binary variables z with bipartite quadratic energy functions Eφ(z1,z2) = b T 1 Wz2 are called restricted Boltzmann machines (RBMs). Parameters φ = {b1,b2,W } encode linear biases and pairwise interactions. RBMs permit parallel Gibbs sampling updates as qφ(z1|z2) and qφ(z2|z1) are factorial in the components of z1 and z2. Variational autoencoders: A VAE is a generative model factored as p(x,z) = p(z)p(x|z), where p(z) is a prior distribution over latent variables z and p(x|z) is a probabilistic decoder representing the conditional distribution over data variables x given z. VAEs are trained by maximizing a variational lower bound (ELBO) on log p(x): L = Eq(z|x) log p(x), (1) where q(z|x) is a probabilistic encoder that approximates the posterior over latent variables given a data point. A tighter importance-weighted bound is obtained with LIW = Ez1:K log p(x), (2) where z1:K Q i q(zi|x). For continuous latent variables, these bounds are optimized using the reparameterization Undirected Graphical Models as Approximate Posteriors trick (Kingma & Welling, 2014; Rezende et al., 2014). However, as noted above, continuous relaxations are commonly employed with discrete latent variables. MCMC: MCMC methods are used to draw approximate samples from a target distribution q(z). Each MCMC method is characterized by a transition kernel K(z|z0) designed so that dz0q(z0)K(z|z0) 8z. (3) Samples from the fixed point q(z) are found by sampling from an initial distribution q0(z) and iteratively updating the samples by drawing conditional samples using zt|zt 1 K(zt|zt 1). Denoting the distribution of the samples after t iterations by qt = Ktq0, the theory of MCMC shows that under some regularity conditions qt ! q as t ! 1. In practice, K(z|z0) is usually chosen to satisfy the detailed balance condition: q(z)K(z0|z) = q(z0)K(z|z0) 8z,z0 (4) which implies Eq. (3). Kt itself is a valid transition kernel and satisfies Eq. (3) and Eq. (4). Gibbs sampling is a particularly efficient MCMC method when variables z = [z1,z2] are partitioned as in an RBM. The transition kernel in this case may be written as KGibbs(z1,z2|z0 2) = q(z2|z1)q(z1|z0 where all partitioned variables in z1 or z2 may be updated in parallel. Gibbs sampling is a special MCMC method that does not satisfy the detailed balance condition. However, it does admit a reverse kernel in the form: KReverse(z0 2|z1,z2) = q(z0 which samples from the variables in a reversed order. Thus, for Gibbs sampling on bipartite UGMs, we have: q(z0)KGibbs(z|z0) = q(z)KReverse(z0|z) 8z,z0, (7) where z = [z1,z2] and z0 = [z0 3. Undirected Approximate Posteriors In this section, we propose a gradient estimator for generic UGM approximate posteriors and discuss how the estimator is applied to RBM-based approximate posteriors. The training objective for a probabilistic encoder network with qφ(z) being a UGM can be written as maxφ Eqφ(z)[f(z)], where f is a differentiable function of z. For simplicity of exposition, we assume that f does not depend on φ.2 Using the fixed point equation in Eq. (3), we 2If f does depend on φ, then Eqφ(z)[@φf(z)] is approximated with Monte Carlo sampling. qφ z0 z L z0 qφ(z0) z Kφ(z|z0) @φz( ,φ,z0) Figure 1: To estimate the gradient of L = Eqφ(z)[f(z)] w.r.t the parameters of the UGM posterior (φ), we first sample from the approximating posterior (z0 qφ(z0)), then, we apply an MCMC update via reparameterized sampling (z Kφ(z|z0)). Finally, we evaluate the function on the samples (f(z)). The gradient is computed automatically by backpropagating through the reparameterized samples while ignoring the dependency of z0 on φ. Eqφ(z)[f(z)] = Eqφ(z0) φ(z|z0)[f(z)] where the right hand side implies sampling z0 qφ(z0) and applying MCMC updates t times. To maximize Eq. (8), we require its gradient: @φEqφ(z)[f(z)] = dz0 @φqφ(z0)EKt φ(z|z0)[f(z)] | {z } I φ(z|z0)[f(z)] The term marked as I is written as: I = Eqφ(z0)Kt f(z)@φ log qφ(z0) f(z)@φ log qφ(z0) @φ log qφ(z0) where we have used detailed balance to get Eq. (11) from Eq. (10). The form of Eq. (12) makes it clear that I goes to zero as t ! 1, because Kt φ(z0|z) ! qφ(z0) and Eqφ(z0) @φ log qφ(z0) = 0. However, similar to REINFORCE, Monte-Carlo estimate of I will have high variance due to the presence of qφ(z0) in the denominator3. The expectation in Eq. (9) marked as II involves the gradient of an expectation with respect to MCMC transition kernel. The reparameterization trick provides a low-variance gradient estimator for this term, and as t ! 1, the II contribution approaches the full gradient because I ! 0. Our key insight in this paper is that, given the high variance of I, it is beneficial to drop this term from the gradient (Eq. (9)) and use the biased but lower-variance estimate II. 3Technically, Gibbs sampling does not satisfy detailed balance. Instead, we can use the identity in Eq. (7) to derive the same argument for Gibbs sampling. In this case, the transition kernel in Eq. (11) and Eq. (12) will be the reverse kernel. Undirected Graphical Models as Approximate Posteriors The choice of t trades increased computational complexity for decreased bias. We observe that increasing t has little effect on the optimization performance. Therefore, t = 1 is a good choice giving the smallest computational complexity. Consequently, we use the approximation: @φEqφ(z)[f(z)] Eqφ(z0) @φEKφ(z|z0) = Eqφ(z0)E p( ) (@φz)@zf(z) where z = z( ,φ,z0) is a reparameterized sample from Kφ(z|z0) where is a sample from a base distribution. Our gradient estimator is illustrated in Fig. 1. The extension to t > 1 Gibbs updates is straightforward. In principle, MCMC methods such as Metropolis-Hastings or Hamiltonian Monte Carlo (HMC) can be used as the transition kernel in Eq. (13). However, unbiased reparameterized sampling for these methods is challenging.4 These complications are avoided when Gibbs sampling of qφ(z) is possible. The Gibbs transition kernel qφ(z2|z1)qφ(z1|z0 2) of Eq. (5) does not contain an accept/reject step or any hyper-parameters. Assuming that the conditionals can be reparameterized, the gradient of the kernel is approximated efficiently using low-variance reparameterized sampling from each conditional. In this case, the gradient estimator for a single Gibbs update is: @φEq[f(z)] Eqφ(z0) @φEqφ(z2|z1)qφ(z1|z0 = Eqφ(z0)E 1, 2 where z1( 1,φ,z0 2) qφ(z1|z0 2) and z2( 2,φ,z1) qφ(z2|z1) are reparameterized samples. Lastly, we note that the reparameterized z sample has distribution qφ(z) if z0 is equilibrated. So, if f has its own parameters (e.g., the parameters of the generative model in VAEs), the same sample can be used to compute an unbiased estimation of @ Eqφ(z)[f (z)] where denotes the parameters of f. 3.1. Toy Example We assess the efficacy of our approximations for a multivariate Gaussian. We consider learning the two-variable Gaussian distribution depicted in Fig. 2(b). The target distribution p(z) has energy function E(z) = (z µ)T (z µ)/2, where µ = [1, 1]T and = [1.1, 0.9; 0.9, 1.1]. The approximating distribution qφ(z) has the same form and we learn the parameters φ by minimizing the Kullback-Leibler (KL) divergence KL[qφ(z)||p(z)]. We compare Eq. (14) for t = 1, 2, 4, 8 with reparameterized sampling and with 4For example, Metropolis-Hastings contains a nondifferentiable operation in the accept/reject step, and HMC additionally requires backpropagating through gradients of the energy function and tuning hyper-parameters. REINFORCE. Fig. 2(a) shows the KL divergence during training. For our method we consider two variants including with and without I in Eq. (9). Our method significantly outperforms REINFORCE due to lower variance of the gradients as depicted in Fig. 2(c). There is little difference between different t until KL divergence becomes 10 4. Our method performs worse than reparameterized sampling, which is expected due to the bias introduced by neglecting I in Eq. (9). Including I negatively impacts optimization. The dashed lines of Fig. 2(a) show the KL objective when the I term is included. All curves lie atop one another and are noticeably slower to converge. This deterioration is driven by the noisier gradient estimates shown as dashed lines in Fig. 2(c). Note that this experiment confirms that the reparameterization trick provides low-variance unbiased gradient estimation. However, this trick is not applicable to UGMs in general. In the case of UGMs, our gradient estimator provides gradient estimation with variance properties similar to the reparameterization trick. 3.2. Learning Boltzmann Machine Posteriors Next, we consider training DVAE##, a DVAE (Rolfe, 2016; Vahdat et al., 2018b;a) where both approximating posterior and prior are RBM. The objective function of DVAE## is given by Eq. (1), where q(z|x) qφ(x)(z) is an amortized RBM with neural-network-generated parameters φ(x) = {b1(x),b2(x),W (x)}. To maximize L, we use the gradient @φ(x)L = @φ(x)Eqφ(x)(z) at each training point x. Although the RBM has a bipartite structure, applying the gradient estimator of Eq. (14) is challenging due to binary latent variables. To apply our estimator in Eq. (14), we relax the binary variables to continuous variables using Gumbel-Softmax or piece-wise linear relaxations (Andriyash et al., 2018).5 Representing the relaxations of z1 qφ(z1|z0 2) by 1 = 1( 1,z0 2,φ) and z2 qφ(z2|z1) by 2 = 2( 2, 1,φ), where 1, 2 U[0, 1], we use the following estimator: @φEqφ( 1|z0 2)Eqφ( 2| 1)[f( 1, 2)] @φf( 1, 2) is computed using the reparameterization trick. Thus far, we have only accounted for the dependency of the objective f on φ through samples . However, for VAEs, f also depends on φ through log qφ(x)(z). This dependency 5Another option is to use unbiased gradient estimators such as REBAR (Tucker et al., 2017) or RELAX (Grathwohl et al., 2018). However, with these approaches, the number of f evaluations increases with t. Undirected Graphical Models as Approximate Posteriors 0 1000 2000 3000 4000 Iterations Reinforce Reparam Ours t=1 Ours t=2 Ours t=4 Ours t=8 (a) KL divergence (b) Target distribution 0 1000 2000 3000 4000 Iterations Gradient L2 error Reinforce Reparam Ours t=1 Ours t=2 Ours t=4 Ours t=8 (c) Gradient L2 error Figure 2: (a) Our gradient estimator (for various t) compared with REINFORCE and reparameterized Gaussian samples for minimizing the KL divergence of a Gaussian to a target distribution (b). Dashed lines correspond to adding the term I of Eq. (9) to our gradient estimator. can be ignored in VAEs as Eq[@φ log qφ(z|x)] = 0 (Roeder et al., 2017), and it can be removed in importance weighted autoencoders using the doubly reparameterized gradient estimation (Tucker et al., 2018) (see Appendix C for more details). Note that Roeder et al. (2017) and Tucker et al. (2018) do not introduce any bias to the gradient and are known to reduce variance. As t increases and each Gibbs update is relaxed, sampling from the relaxed chain diverges from the exact discrete chain resulting in increasingly biased gradient estimates. Thus, we use t = 1 in our experiments (see Appendices A and D for t > 1). Moreover, in Eq. (16), our estimator requires samples from the RBM posterior in the outer expectation. These samples are obtained by running persistent chains for each training datapoint. Algorithm 1 summarizes training of DVAE##. We represent the Boltzmann prior using p (z) = exp /Z . The number of Gibbs sweeps for generating z0 in Eq. (16) is denoted by s and the number of Gibbs sweeps for sampling is denoted by t. The objective L is defined so that automatic differentiation yields the gradient estimator in Eq. (16) for φ and @ f is evaluated using relaxed samples for . We use the method introduced by Vahdat et al. (2018b) Sec. E to obtain gradients of log Z . 4. Experiments We examine our proposed gradient estimator for training undirected posteriors on three tasks: variational autoencoders, importance weighted autoencoders, and structured prediction models using the binarized MNIST (Salakhutdinov & Murray, 2008) and OMNIGLOT (Lake et al., 2015) datasets. We follow the experimental setup in DVAE# (Vahdat et al., 2018a) with minor modifications outlined below. 4.1. Variational Autoencoders We train a VAE of the form p(z)p(x|z), where p(z) is an RBM and p(x|z) is a neural network. p(x|z) is represented Algorithm 1 DVAE## with RBM prior and posterior Input: training sample x, number of Gibbs sweeps s, number of relaxed Gibbs sweeps t. Output: training objective function Lx φ = encoder(x) z0 old = retrieve_persistent_states(x) z0 new = update_gibbs_samples(z0 old,φ, s) z0 sg = stop_gradient(z0 2 in Eq. 16 = relax_gibbs_sample(z0 sg,φ, t) B 1, 2 in Eq. 16 φ0 = stop_gradient(φ) B Roeder et al. Lx = E ( ) log Z | {z } prior + log p (x| ) | {z } likelihood + Eφ0( ) | {z } approx. post using a fully-connected neural network having two 200-unit hidden layers, tanh activations, and batch normalization similar to DVAE++ (Vahdat et al., 2018b), DVAE# (Vahdat et al., 2018a), and Gum Bolt (Khoshaman & Amin, 2018). We compare generative models trained with an undirected posterior (DVAE##) to a directed posterior. For DVAE##, q(z|x) is modeled with a neural network having two tanh hidden layers that predicts the parameters of the RBM (Fig. 3(a)). Training DVAE## is done using Algorithm 1 with s = 10 and t = 1 using a piece-wise linear relaxation (Andriyash et al., 2018) for relaxed Gibbs samples. We follow (Vahdat et al., 2018a) for batch size, learning rate schedule, and KL warm up parameters. During training, samples from the prior are required for computing the gradient of log Z . This sampling is done using the Qu PA library that offers population-annealing-based sampling and partition function estimation (using AIS) from within Tensorflow. We also use Qu PA for sampling the undirected posteriors and estimating their partition function during evaluation. We explore VAEs with equally-sized RBM prior and posteriors consisting of either 200 (100+100) or 400 (200+200) latent variables. Test set negative log-likelihoods are computed using 4000 importance-weighted samples. Baselines: We compare the RBM posteriors with directed posteriors where the posteriors are factored across groups of latent variables zi as q(z|x) = QL i qi(zi|x,z