# semiimplicit_variational_inference__14cc89c9.pdf Semi-Implicit Variational Inference Mingzhang Yin 1 Mingyuan Zhou 2 Semi-implicit variational inference (SIVI) is introduced to expand the commonly used analytic variational distribution family, by mixing the variational parameter with a flexible distribution. This mixing distribution can assume any density function, explicit or not, as long as independent random samples can be generated via reparameterization. Not only does SIVI expand the variational family to incorporate highly flexible variational distributions, including implicit ones that have no analytic density functions, but also sandwiches the evidence lower bound (ELBO) between a lower bound and an upper bound, and further derives an asymptotically exact surrogate ELBO that is amenable to optimization via stochastic gradient ascent. With a substantially expanded variational family and a novel optimization algorithm, SIVI is shown to closely match the accuracy of MCMC in inferring the posterior in a variety of Bayesian inference tasks. 1. Introduction Variational inference (VI) is an optimization based method that is widely used for approximate Bayesian inference. It introduces variational distribution Q over the latent variables to approximate the posterior (Jordan et al., 1999), and its stochastic version is scalable to big data (Hoffman et al., 2013). VI updates the parameters of Q to move it closer to the posterior in each iteration, where the closeness is in general measured by the Kullback Leibler (KL) divergence from the posterior to Q, minimizing which is shown to be the same as maximizing the evidence lower bound (ELBO) (Jordan et al., 1999). To make it simple to climb the ELBO to a local optimum, one often takes the 1Department of Statistics and Data Sciences, 2Department of IROM, Mc Combs School of Business, The University of Texas at Austin, Austin TX 78712, USA. Correspondence to: Mingzhang Yin , Mingyuan Zhou . Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s). mean-field assumption that Q is factorized over the latent variables. The optimization problem is further simplified if each latent variable s distribution is in the same exponential family as its prior, which allows exploiting conditional conjugacy to derive closed-form coordinate-ascent update equations (Bishop & Tipping, 2000; Blei et al., 2017). Despite its popularity, VI has a well-known issue in underestimating the variance of the posterior, which is often attributed to the mismatch between the representation power of the variational family that Q is restricted to and the complexity of the posterior, and the use of KL divergence, which is asymmetric, to measure how different Q is from the posterior. This issue is often further amplified in mean-field VI (MFVI), due to the factorized assumption on Q that ignores the dependencies between different factorization components (Wainwright et al., 2008; Blei et al., 2017). While there exists a variety of methods that add some structure back to Q to partially restore the dependencies (Saul & Jordan, 1996; Jaakkola & Jordan, 1998; Hoffman & Blei, 2015; Giordano et al., 2015; Tran et al., 2015; 2016; Han et al., 2016; Ranganath et al., 2016; Maaløe et al., 2016; Gregor et al., 2015), it is still necessary for Q to have an analytic probability density function (PDF). To further expand the variational family that Q belongs to, there has been significant recent interest in defining Q with an implicit model, which makes the PDF of Q become intractable (Husz ar, 2017; Mohamed & Lakshminarayanan, 2016; Tran et al., 2017; Li & Turner, 2017; Mescheder et al., 2017; Shi et al., 2017). While using an implicit model could make Q more flexible, it makes it no longer possible to directly computing the log density ratio, as required for evaluating the ELBO. Thus, one often resorts to density ratio estimation, which, however, not only adds an additional level of complexity into each iteration of the optimization, but also is known to be a very difficult problem, especially in high-dimensional settings (Sugiyama et al., 2012). To well characterize the posterior while maintaining simple optimization, we introduce semi-implicit VI (SIVI) that imposes a mixing distribution on the parameters of the original Q to expand the variational family with a semi-implicit hierarchical construction. The meaning of semi-implicit is twofold: 1) the original Q distribution is required to have an analytic PDF, but its mixing distribution is not subject to Semi-Implicit Variational Inference such a constraint; and 2) even if both the original Q and its mixing distribution have analytic PDFs, it is common that the marginal of the hierarchy is implicit, that is, having a non-analytic PDF. Our intuition behind SIVI is that even if this marginal is not tractable, its density can be evaluated with Monte Carlo estimation under the semi-implicit hierarchical construction, an expansion that helps model skewness, kurtosis, multimodality, and other characteristics that are exhibited by the posterior but failed to be captured by the original variational family. For MFVI, a benefit of this expansion is restoring the dependencies between its factorization components, as the resulted Q distribution becomes conditionally independent but marginally dependent. SIVI makes three major contributions: 1) a reparameterizable implicit distribution can be used as a mixing distribution to effectively expand the richness of the variational family; 2) an analytic conditional Q distribution is used to sidestep the hard problem of density ratio estimation, and is not required to be reparameterizable in conditionally conjugate models; and 3) SIVI sandwiches the ELBO between a lower bound and an upper bound, and derives an asymptotically exact surrogate ELBO that is amenable to direct optimization via stochastic gradient ascent. With a flexible variational family and novel optimization, SIVI bridges the accuracy gap of posterior estimation between VI and Markov chain Monte Carlo (MCMC), which can accurately characterize the posterior using MCMC samples, as will be demonstrated in a variety of Bayesian inference tasks. Code is provided at https://github.com/mingzhang-yin/SIVI 2. Semi-Implicit Variational Inference In VI, given observations x, latent variables z, model likelihood p(x | z), and prior p(z), we approximate the posterior p(z | x) with variational distribution q(z | ) that is often required to be explicit. We optimize the variational parameter to minimize KL(q(z | )||p(z | x)), the KL divergence from p(z | x) to q(z | ). Since one may show log p(x) = ELBO + KL(q(z | )||p(z | x)), where ELBO = Ez q(z| )[log q(z| ) log p(x, z)], (1) minimizing KL(q(z | )||p(z | x)) is equivalent to maximizing the ELBO (Bishop & Tipping, 2000; Blei et al., 2017). Rather than treating as the variational parameter to be inferred, SIVI regards q( ) as a random variable, as described below. Note that when q( ) degenerates to a point mass density, SIVI reduces to vanilla VI. 2.1. Semi-Implicit Variational Family Assuming qφ( ), where φ denotes the distribution parameter to be inferred, the semi-implicit variational distri- bution for z can be defined in a hierarchical manner as z q(z | ), qφ( ). Marginalizing the intermediate variable out, we can view z as a random variable drawn from distribution family H indexed by variational parameter φ, expressed as hφ(z) : hφ(z) = q(z | )qφ( )d Note q(z | ) is required to be explicit, but the mixing distribution qφ( ) is allowed to be implicit. Moreover, unless qφ( ) is conjugate to q(z | ), the marginal Q distribution hφ(z) 2 H is often implicit. These are the two reasons for referring to the proposed VI as semi-implicit VI (SIVI). SIVI requires q(z | ) to be explicit, and also requires it to either be reparameterizable, which means z q(z | ) can be generated by transforming random noise " via function f(", ), or allow the ELBO in (1) to be analytic. Whereas the mixing distribution q( ) is required to be reparameterizable but not necessarily explicit. In particular, SIVI draws from q( ) by transforming random noise via a deep neural network, which generally leads to an implicit distribution for q( ) due to a non-invertible transform. SIVI is related to the hierarchical variational model (Ranganath et al., 2016; Maaløe et al., 2016; Agakov & Barber, 2004) in using a hierarchical variational distribution, but, as discussed below, differs from it in allowing an implicit mixing distribution qφ( ) and optimizing the variational parameter via an asymptotically exact surrogate ELBO. Note as long as qφ( ) can degenerate to delta function δ 0( ) for arbitrary 0, the semi-implicit variational family H is a strict expansion of the original Q = {q(z | 0)} family, that is, Q H. For MFVI that assumes q(z | ) = Q m q(zm | m), this expansion significantly helps restore the dependencies between zm if m are not imposed to be independent between each other. 2.2. Implicit Mixing Distribution While restricting q(z | ) to be explicit, SIVI introduces a mixing distribution qφ( ) to enhance its representation power. In this paper, we construct qφ( ) with an implicit distribution that generates its random samples via a stochastic procedure but may not allow a pointwise evaluable PDF. More specifically, an implicit distribution (Mohamed & Lakshminarayanan, 2016; Tran et al., 2017), consisting of a source of randomness q( ) for 2 Rg and a deterministic transform Tφ : Rg ! Rd, can be constructed as = Tφ( ), q( ), with PDF qφ( ) = @ @ 1 . . . @ @ d Tφ( ) q( )d . (2) When Tφ is invertible and the integration is tractable, the PDF of can be calculated with (2), but this is not the case Semi-Implicit Variational Inference in general and hence qφ( ) is often implicit. When Tφ( ) is chosen as a deep neural network, thanks to its high modeling capacity, qφ( ) can be highly flexible and the dependencies between the elements of can be well captured. Prevalently used in the study of thermodynamics, ecology, epidemiology, and differential equation systems, implicit distributions have only been recently introduced in VI to parameterize q(z | ) (Li & Turner, 2017; Mescheder et al., 2017; Husz ar, 2017; Tran et al., 2017). Using implicit distributions with intractable PDF increases flexibility but substantially complicates the optimization problem for VI, due to the need to estimate log density ratios involving intractable PDFs, which is particularly challenging in high dimensions (Sugiyama et al., 2012). By contrast, taking a semi-implicit construction, SIVI offers the best of both worlds: constructing a highly flexible variational distribution, without sacrificing the key benefit of VI in converting posterior inference into an optimization problem that is simple to solve. Below we develop a novel optimization algorithm that exploits SIVI s semi-implicit construction. 3. Optimization for SIVI To optimize the variational parameters of SIVI, below we first derive for the ELBO a lower bound, climbing which, however, could drive the mixing distribution qφ( ) towards a point mass density. To prevent degeneracy, we add a nonnegative regularization term, leading to a surrogate ELBO that is asymptotically exact, as can be further tightened by importance reweighting. To sandwich the ELBO, we also derive for the ELBO an upper bound, optimizing which, however, may lead to divergence. We further show that this upper bound can be corrected to a tighter upper bound that monotonically converges from above towards the ELBO. 3.1. Lower Bound of ELBO Theorem 1 (Cover & Thomas (2012)). The KL divergence from distribution p(z) to distribution q(z), expressed as KL(q(z)||p(z)), is convex in the pair (q(z), p(z)). Fixing the distribution p(z) in Theorem 1, KL divergence can be viewed as a convex functional in q(z). As in Appendix A, with Theorem 1 and Jensen s inequality, we have KL(E q(z | )||p(z)) E KL(q(z | )||p(z)). (3) Thus, using hφ(z) = E qφ( )q(z | ) as the variational distribution, SIVI has a lower bound for its ELBO as L(q(z | ), qφ( )) = E qφ( )Ez q(z | ) log p(x,z) q(z | ) = E qφ( )KL(q(z | )||p(z|x)) + log p(x) KL(E qφ( )q(z | )||p(z|x)) + log p(x) = L = Ez hφ(z) log p(x,z) hφ(z) . (4) The PDF of hφ(z) is often intractable, especially if qφ( ) is implicit, prohibiting a Monte Carlo estimation of the ELBO L. By contrast, a Monte Carlo estimation of L only requires q(z | ) to have an analytic PDF and qφ( ) to be convenient to sample from. It is this separation of evaluation and sampling that allows SIVI to combine an explicit q(z | ) with an implicit qφ( ) that is as powerful as needed, while maintaining tractable computation. 3.2. Degeneracy and Regularization A direct optimization of the lower bound L in (4), however, can suffer from degeneracy, as shown in the proposition below. All proofs are deferred to Appendix A. Proposition 1 (Degeneracy). Let us denote = arg max Ez q(z| ) log q(z | ) p(x,z) , then L(q(z | ), qφ( )) Ez q(z | ) log q(z| ) where the equality is true if and only if qφ( ) = δ ( ). Therefore, if optimizing the variational parameter by climbing L(q(z | ), qφ( )), without stopping the optimization algorithm early, qφ( ) could converge to a point mass density, making SIVI degenerate to vanilla VI. To prevent degeneracy, we regularize L by adding BK = E , (1),..., (K) qφ( )KL(q(z | )|| h K(z)), (5) where h K(z) = q(z | )+PK k=1 q(z | (k)) K+1 . Note that BK 0, with BK = 0 if and only if K = 0 or qφ( ) degenerates to a point mass density. Therefore, L0 = L and maximizing LK = L+BK with K 1 would encourage positive BK and drive q( ) away from degeneracy. Moreover, as lim K!1 h K(z) = E qφ( )q(z | ) = hφ(z) by the strong law of large numbers and lim K!1 BK = E qφ( )KL(q(z | )||hφ(z)) (6) by interchanging two limiting operations, as discussed in detail in Appendix A, we have the following proposition. Proposition 2. Suppose L and L are defined as in (4) and BK as in (5), the regularized lower bound LK = L + BK is an asymptotically exact ELBO that satisfies L0 = L and lim K!1 LK = L. 3.3. Upper Bound of ELBO and Correction Using the concavity of the logarithmic function, we have log hφ(z) E qφ( ) log q(z | ), and hence we can obtain an upper bound of SIVI s ELBO as L(q(z | ), qφ( )) = E qφ( )Ez hφ(z) log p(x,z) L = Ez hφ(z) log p(x,z) hφ(z) . (7) Semi-Implicit Variational Inference Comparing (4) and (7) shows that the lower bound L and upper bound L only differ from each other in whether the expectation involving z is taken with respect to q(z | ) or hφ(z). Moreover, one may show that L L is equal to E q( )[KL(q(z | )||hφ(z)) + KL(hφ(z)||q(z | ))]. Since L may not be bounded above by the evidence log p(x) and L L is not bounded from above, there is no convergence guarantee if maximizing L. For this reason, we subtract L by a correction term as AK = E qφ( )Ez hφ(z)E (1),..., (K) qφ( ) k=1 q(z | (k)) log q(z | ) As E (1) qφ( ) log q(z | (1)) = E qφ( ) log q(z | ), we have A1 = 0. The following proposition shows that the corrected upper bound LK = L AK monotonically converges from above towards the ELBO as K ! 1. Proposition 3. Suppose L and L are defined as in (7) and AK as in (8), then the corrected upper bound LK = L AK monotonically converges from the above towards the ELBO, satisfying L1 = L, LK+1 LK, and lim K!1 LK = L. The relationship between LK = L+BK and LK = L AK, two different asymptotically exact ELBOs, can be revealed by rewriting them as LK = E qφ( )Ez q(z | )E (1),..., (K) qφ( ) k=1 q(z | (k)) LK = E qφ( )Ez q(z | )E (1),..., (K) qφ( ) log p(x,z) 1 K k=1 q(z | (k)). (10) Thus LK differs from LK in whether q(z | ) participates in computing the log density ratio, which is analytic thanks to the semi-implicit construction, inside the expectations. When K is small, using LK as the surrogate ELBO for optimization is expected to have better numerical stability than using LK, as L0 = L relates to the ELBO as a lower bound while L1 = L does as an upper bound, but increasing K quickly diminishes the difference between LK and LK, which are both asymptotically exact. It is also instructive to note that as z q(z | ) is used for Monte Carlo estimation, if assuming {q(z | ), q(z | (1)), . . . , q(z | (K))} has a dominant element, then it is most likely that q(z | ) dominates all q(z | (k)). Therefore, maximizing LK in (9) would become almost the same as maximizing L0, which would lead to degeneracy as in Proposition 1, which means = (k) and q(z | ) = q(z | (k)) for all k, contradicting the reasoning that q(z | ) dominates all q(z| (k)). Using the importance reweighting idea, Burda et al. (2015) provides a lower bound L K ELBO that monotonically converges from below to the evidence log p(x) as K increases. Using the same idea, we may also tighten the asymptotically exact surrogate ELBO in (9) using K K = E(z1, 1),...,(z K, K) q(z | )qφ( )E (1),..., (K) qφ( ) q(zi | i)+PK k=1 q(zi | (k)) for which lim K!1 L K K = L K ELBO and lim K, K!1 L K K = lim K!1 L K = log p(x). Using LKt as the surrogate ELBO, where t indexes the number of iterations, Kt 2 {0, 1, . . .}, and Kt+1 Kt, we describe the stochastic gradient ascent algorithm to optimize the variational parameter in Algorithm 1, in which we further introduce as the variational parameter of the conditional distribution q (z | ) that is not mixed with another distribution. For Monte Carlo estimation in Algorithm 1, we use a single random sample for each (k), J random samples for , and a single sample of z for each sample of . One may further consult Rainforth et al. (2018) to help select K, J, and K for SIVI. We denote z = f(", , ), " p(") as the reparameterization for z q (z | ). As for , if 6= ;, one may learn it as in Algorithm 1, set it empirically, or fix it at the value learned by another algorithm such as MFVI. In summary, SIVI constructs a flexible variational distribution by mixing a (potentially) implicit distribution with an explicit one, while maintaining tractable optimization via the use of an asymptotically exact surrogate ELBO. 3.4. Score Function Gradient in Conjugate Model Let s consider the case that q(z | ) does not have a simple reparameterization but can benefit from conditional conjugacy. In particular, for a conditionally conjugate exponential family model, MFVI has an analytic ELBO, and its variational distribution can be directly used as the q (z | ) of SIVI. As in Appendix A, introducing a density ratio as r ,φ(z, , (1:K)) = q (z | Tφ( ))) q (z | Tφ( ))+PK k=1 q (z | Tφ( (k))) K+1 we approximate the gradient of LK with respect to φ as rφEz q (z | Tφ( j))[log q (z | Tφ( j)) + rφ log r ,φ(zj, j, (1:K)) + [rφ log q (zj | Tφ( j))] log r ,φ(zj, j, (1:K)) where the first summation term is equivalent to the gradient of MFVI s ELBO, both the second and third terms correct the restrcitions of q (z | Tφ( j)), and log r ,φ(z, , (1:K)) in the third term is expected to be small regardless of convergence, effectively mitigating the variance of score function gradient estimation that is usually high in basic black-box VI; r LK can be approximated in the same manner. For Semi-Implicit Variational Inference non-conjugate models, we leave for future study the use of non-reparameterizable q (z | ), for which one may apply customized variance reduction techniques (Paisley et al., 2012; Ranganath et al., 2014; Mnih & Gregor, 2014; Ruiz et al., 2016; Kucukelbir et al., 2017; Naesseth et al., 2017). 4. Related Work There exists a wide variety of VI methods that improve on MFVI. Examples include adding dependencies between latent variables (Saul & Jordan, 1996; Hoffman & Blei, 2015), using a mixture of variational distributions (Bishop et al., 1998; Gershman et al., 2012; Salimans & Knowles, 2013; Guo et al., 2016; Miller et al., 2017), introducing a copula to capture the dependencies between univariate marginals (Tran et al., 2015; Han et al., 2016), handling nonconjugacy (Paisley et al., 2012; Titsias & L azaro-Gredilla, 2014), and constructing a hierarchical variational distribution (Ranganath et al., 2016; Tran et al., 2017). To increase the expressiveness of the PDF of a random variable, a simple but powerful idea is to transform it with complex deterministic and/or stochastic mappings. One successful application of this idea in VI is constructing the variational distribution with a normalizing flow, which transforms a simple random variable through a sequence of invertible differentiable functions with tractable Jacobians, to deterministically map a simple PDF to a complex one (Rezende & Mohamed, 2015; Kingma et al., 2016; Papamakarios et al., 2017). Normalizing flows help increase the flexibility of VI, but still require the mapping to be deterministic and invertible. Removing both restrictions, there have been several recent attempts to define highly flexible variational distributions with implicit models (Husz ar, 2017; Mohamed & Lakshminarayanan, 2016; Tran et al., 2017; Li & Turner, 2017; Mescheder et al., 2017; Shi et al., 2017). A typical example is transforming random noise via a deep neural network, leading to a non-invertible highly nonlinear mapping and hence an implicit distribution. While an implicit variational distribution can be made highly flexible, it becomes necessary in each iteration to address the problem of density ratio estimation, which is often transformed into a problem related to learning generative adversarial networks (Goodfellow et al., 2014). In particular, a binary classifier, whose class probability is used for density ratio estimation, is trained in each iteration to discriminate the samples generated by the model from those by the variational distribution (Mohamed & Lakshminarayanan, 2016; Uehara et al., 2016; Mescheder et al., 2017). Controlling the bias and variance in density ratio estimation, however, is in general a very difficult problem, especially in highdimensional settings (Sugiyama et al., 2012). SIVI is related to the hierarchical variational model (HVM) (Ranganath et al., 2016; Maaløe et al., 2016) in having a hierarchical variational distribution, but there are two major distinctions: 1) the HVM restricts the mixing distribution in the hierarchy to have an explicit PDF, which can be constructed with a Markov chain (Salimans et al., 2015), a mixture model (Ranganath et al., 2016), or a normalizing flow (Ranganath et al., 2016; Louizos & Welling, 2017) but cannot come from an implicit model. By contrast, SIVI requires the conditional distribution q(z | ) to have an explicit PDF, but does not impose such a constraint on the mixing distribution q( ). In fact, any off-the-shelf reparameterizable implicit/explicit distribution can be used in SIVI, leading to considerably flexible hφ(z) = E qφ( )q(z | ). Moreover, SIVI does not require q(z | ) to be reparameterizable for conditionally conjugate models. 2) the HVM optimizes on a lower bound of the ELBO, constructed by adding a recursively estimated variational distribution that approximates q( | z) = q(z | )q( )/h(z). By contrast, SIVI sandwiches the ELBO between two bounds, and directly optimizes on an asymptotically exact surrogate ELBO, which involves only simple-to-compute analytic density ratios. 5. Experiments We implement SIVI in Tensorflow (Abadi et al., 2015) for a range of inference tasks. Note SIVI is a general purpose algorithm not relying on conjugacy, and has an inherent advantage over MCMC in being able to generate independent, and identically distributed (iid) posterior samples on the fly, this is, by feed-forward propagating iid random noises through the inferred semi-implicit hierarchy. The toy examples show SIVI captures skewness, kurtosis, and multimodality. A negative binomial model shows SIVI can accurately capture the dependencies between latent variables. A bivariate count distribution example shows for a conditionally conjugate model, SIVI can utilize a non-reparameterizable variational distribution, without being plagued by the high variance of score function gradient estimation. With Bayesian logistic regression, we demonstrate that SIVI can either work alone as a black-box inference procedure for correlated latent variables, or directly expand MFVI by adding a mixing distribution, leading to accurate uncertainty estimation on par with that of MCMC. Last but not least, moving beyond the canonical Gaussian based variational autoencoder (VAE), SIVI helps construct semi-implicit VAE (SIVAE) to improve unsupervised feature learning and amortized inference. 5.1. Expressiveness of SIVI We first show the expressiveness of SIVI by approximating various target distributions. As listed in Table 1, the conditional layer of SIVI is chosen to be as simple as an isotropic Gaussian (or log-normal) distribution N(0, σ2 0I). The implicit mixing layer is a multilayer perceptron (MLP), with Semi-Implicit Variational Inference Table 1. Inference and target distributions h(z) = E q( )q(z | ) p(z) z N ( , 0.1), Laplace(z; µ = 0, b = 2) 0.3N (z; 2, 1) + 0.7N (z; 2, 1) z Log-Normal( , 0.1), q( ) Gamma(z; 2, 1) 0.1 0 0 0.1 0.5N (z; 2, I) + 0.5N (z; 2, I) 2/4, 1)N (z2; 0, 4) 2 1.8 1.8 2 2 1.8 1.8 2 Figure 1. Approximating synthetic target distributions with SIVI layer widths [30, 60, 30] and a ten dimensional isotropic Gaussian noise as its input. We fix σ2 0 = 0.1 and optimize the implicit layer to minimize KL(Eqφ( )q(z | )||p(z)). As shown in Figure 1, despite having a fixed purposely misspecified explicit layer, by training a flexible implicit layer, the random samples from which are illustrated in Figure 8 of Appendix D, SIVI infers a sophisticated marginal variational distribution that accurately captures the skewness, kurtosis, and/or multimodality exhibited by the target one. 5.2. Negative Binomial Model We consider a negative binomial (NB) distribution with the gamma and beta priors (a = b = = β = 0.01) as iid NB(r, p), r Gamma(a, 1/b), p Beta( , β), for which the posterior p(r, p | {xi}1,N) is not tractable. In comparison to Gibbs sampling, it is shown in Zhou et al. (2012) that MFVI, which uses q(r, p) = Gamma(r; a, 1/ b)Beta(p; , β) to approximate the posterior, notably underestimates the variance. This caveat of MFVI motivates a semi-implicit variational distribution as q(r, p | ) = Log-Normal(r; µr, σ2 0)Logit-Normal(p; µp, σ2 = (µr, µp) q( ), σ0 = 0.1 where and an MLP based implicit q( ), as in Section 5.1, is used by SIVI to capture the dependency between r and p. We apply Gibbs sampling, MFVI, and SIVI to a real overdispersed count dataset of Bliss & Fisher (1953) that records the number of adult red mites on each of the 150 randomly selected apple leaves. With K = 1000, as shown in Figure 2, SIVI improves MFVI in closely matching the posterior inferred by Gibbs sampling. Moreover, the mixing distribution q( ) helps well capture the negative correla- Figure 2. Top row: the marginal posteriors of r and p inferred by MFVI, SIVI, and MCMC. Bottom row: the inferred implicit mixing distribution q( ) and joint posterior of r and p. Figure 3. Kolmogorov-Smirnov (KS) distance and its correspond- ing p-value between the marginal posteriors of r and p inferred by SIVI and MCMC. SIVI rapidly improves as K increases. See Appendix D for the corresponding plots of marginal posteriors. tion between r and p, as totally ignored by MFVI. The twosample Kolmogorov-Smirnov (KS) distances, between 2000 posterior samples generated by SIVI and 2000 MCMC ones, are 0.0185 (p-value = 0.88) and 0.0200 (p-value = 0.81) for r and p, respectively. By contrast, for MFVI and MCMC, they are 0.2695 (p-value = 5.26 10 64) and 0.2965 (pvalue = 2.21 10 77), which significantly reject the null hypothesis that the posterior inferred by MFVI matches that by MCMC. How the performance is related to K is shown in Figure 3 and Figures 9-10 of Appendix D, which suggests K = 20 achieves a nice compromise between complexity and accuracy, and as K increases, the posterior inferred by SIVI quickly approaches that inferred by MCMC. 5.3. Non-reparameterizable Variational Distribution To show that SIVI can use a non-reparameterizable q(z | ) in a conditionally conjugate model, as discussed in Section 3.4, we apply it to infer the two parameters of the Poissonlogarithmic bivariate count distribution as p(ni, li | r, p) = rlipni(1 p)r/Zi, where li 2 {0, . . . , ni} and Zi are normalization constants not related to r > 0 and p 2 (0, 1) (Zhou & Carin, 2015; Zhou et al., 2016). With r Gamma(a, 1/b) and p Beta( , β), while the joint posterior p(r, p | {ni, li}1,N) is intractable, the conditional posteriors of r and p follow the gamma and beta distributions, Semi-Implicit Variational Inference Figure 4. Top row: the marginal posteriors of r and p inferred by MFVI, SIVI, and MCMC. Bottom row: joint posteriors and the trance plots of LKt (subject to the difference of a constant). respectively. Although neither the gamma nor beta distribution is reparameterizable, SIVI multiplies them to construct a semi-implicit variational distribution as q(r, p | ) = Gamma(r; 1, 2)Beta(p; 3, 4), where = ( 1, 2, 3, 4) q( ) is similarly constructed as in Section 5.1. We set K = 200 for SIVI. As shown in Figure 4, despite the need of score function gradient estimation that is notorious for variance control, by utilizing conjugacy as in (11), SIVI well controls the variance of its gradient estimation, achieving accurate posterior estimation without requiring q(z | ) to be reparameterizable. By contrast, MFVI, which uses only the first summation term in (11) for gradient estimation, ignores covariance structure and notably underestimates posterior uncertainty. 5.4. Bayesian Logistic Regression We compare SIVI with MFVI, Stein variational gradient descent (SVGD) of Liu & Wang (2016), and MCMC on Bayesian logistic regression, expressed as yi Bernoulli[(1 + e x0 iβ) 1], β N(0, 1IV +1), where xi = (1, xi1, . . . , xi V )0 are covariates, yi 2 {0, 1} are binary response variables, and is set as 0.01. With the P olya-Gamma data augmentation of Polson et al. (2013), we collect posterior MCMC samples of β using Gibbs sampling. For MFVI, the variational distribution is chosen as a multivariate normal (MVN) N(β; µ, ), with a diagonal or full covariance matrix. For SIVI, we treat , diagonal or full, as a variational parameter, mix µ with an MLP based implicit distribution, and set K = 500. We consider three datasets: waveform, spam, and nodal. The details on datasets and inference are deferred to Appendix B. On waveform, the algorithm converges in about 500 iterations, which takes about 40 seconds on a 2.4 GHz CPU. Note the results of SIVI with K = 100 (or 50), which takes about 12 (or 8) seconds for 500 iterations, are almost identical to those Figure 5. Comparison of MFVI (red) with a full covariance matrix, MCMC (green on left), and SIVI (green on right) with a full covariance matrix on quantifying predictive uncertainty for Bayesian logistic regression on waveform. Figure 6. Marginal and pairwise joint posteriors for (β0, . . . , β4) inferred by MFVI (red, full covariance matrix), MCMC (blue), and SIVI (green, full covariance matrix) on waveform. shown in Figures 5-8 with K = 500. Given the posterior captured by the semi-implicit hierarchy, SIVI takes 0.92 seconds to generate 50, 000 iid 22-dimensional β s. We collect βj for j = 1, . . . , 1000 to represent the inferred posterior p(β | {xi, yi}1,N). For each test data x N+i, we calculate the predictive probabilities 1/(1 + e x T N+iβj) for all j and compute its sample mean, and sample standard deviation that measures the uncertainty of the predictive distribution p(y N+i = 1 | x N+i, {xi, yi}1,N). As shown in Figure 5, even with a full covariance matrix, the MVN variational distribution inferred by MFVI underestimates the uncertainty in out-of-sample prediction, let alone with a diagonal one, whereas SIVI, mixing the MVN with an MLP based implicit distribution, closely matches MCMC in uncertainty estimation. As shown in Figure 6, the underestimation of predictive uncertainty by MFVI can be attributed to variance underestimation for both univariate marginal and pairwise joint posteriors, which are, by contrast, well agreed on between SIVI and MCMC. Further examining the correlation coefficients of β, shown in Figure 7, all the univariate marginals, shown in Figure 11 of Appendix D, and additional results, show in Figures 12-17 of Appendix D, it is revealed that SIVI well characterizes the posterior distribution of β and is only slightly negatively affected if its Semi-Implicit Variational Inference Figure 7. Comparing the correlation coefficients of β estimated from the posterior samples {βi}i=1:1000 of SIVI with that of MCMC on waveform for SIVI with a full/diagonal covariance matrix, MFVI with a full/diagonal covariance matrix, and SVGD. explicit layer is restricted with a diagonal covariance matrix, whereas MFVI with a diagonal/full covariance matrix and SVGD all misrepresent uncertainty. Note we have also tried modifying the code of variational boosting (Guo et al., 2016; Miller et al., 2017) for Bayesian logistic regression, but failed to obtain satisfactory results. We attribute the success of SIVI to its ability in better capturing the dependencies between βv and supporting a highly expressive non-Gaussian variational distribution by mixing an MVN with a flexible implicit distribution, whose parameters can be efficiently optimized via an asymptotically exact surrogate ELBO. 5.5. Semi-Implicit Variational Autoencoder Variational Auto-encoder (VAE) (Kingma & Welling, 2013; Rezende et al., 2014), widely used for unsupervised feature learning and amortized inference, infers encoder parameter φ and decoder parameter to maximize the ELBO as L(φ, ) = Ez qφ(z | x)[log(p (x | z))] KL(qφ(z | x)||p(z)). The encoder distribution qφ(z | x) is required to be reparameterizable and analytically evaluable, which usually restricts it to a small exponential family. In particular, a canonical encoder is qφ(z | x) = N(z | µ(x, φ), (x, φ)), where the Gaussian parameters are deterministically transformed from the observations x, via non-probabilistic deep neural networks parameterized by φ. Thus, given observation xi, its corresponding code zi is forced to follow a Gaussian distribution, no matter how powerful the deep neural networks are. The Gaussian assumption, however, is often too restrictive to model skewness, kurtosis, and multimodality. To this end, rather than using a single-stochastic-layer encoder, we use SIVI that can add as many stochastic layers as needed, as long as the first stochastic layer qφ(z | x) is reparameterizable and has an analytic PDF, and the layers added after are reparameterizable and simple to sample from. More specifically, we construct semi-implicit VAE (SIVAE) by using a hierarchical encoder that injects random noise at M different stochastic layers as t = Tt( t 1, t, x; φ), t qt( ), t = 1, . . . , M, µ(x, φ) = f( M, x; φ), (x, φ) = g( M, x; φ), qφ(z | x, µ, ) = N(µ(x, φ), (x, φ)), (12) where 0 = ; and Tt, f, and g are all deterministic neural networks. Note given data xi, µ(xi, φ), (xi, φ) are now random variables rather than following vanilla VAE to assume deterministic values. This moves the encoder variational distribution beyond a simple Gaussian form. To benchmark the performance of SIVAE, we consider the MNIST dataset that is stochastically binarized as in Salakhutdinov & Murray (2008). We use 55,000 for training and use the 10,000 observations in the testing set for performance evaluation. Similar to existing VAEs, we choose Bernoulli units, linked to a fully-connected neural network with two 500-unit hidden layers, as the decoder. Distinct from existing VAEs, whose encoders are often restricted to have a single stochastic layer, SIVI allows SIVAE to use an MVN as its first stochastic layer, and draw the parameters of the MVN from M = 3 stochastic layers, whose structure is described in detail in Appendix C. As shown in Table 2 of Appendix C, SIVAE achieves a negative log evidence of 84.07, which is further reduced to 83.25 if choosing importance reweighing with K = 10. In comparison to other VAEs with a comparable single-stochastic-layer decoder, SIVAE achieves state-of-the-art performance by mixing an MVN with an implicit distribution defined as in (12) to construct a flexible encoder, whose marginal variational distribution is no longer restricted to the MVN distribution. We leave it for future study on further improving SIVAE by replacing the encoder MVN explicit layer with a normalizing flow, and adding convolution/autoregression to enrich the encoder s implicit distribution and/or the decoder. 6. Conclusions Combining the advantages of having analytic point-wise evaluable density ratios and tractable computation via Monte Carlo estimation, semi-implicit variational inference (SIVI) is proposed either as a black-box inference procedure, or to enrich mean-field variational inference with a flexible (implicit) mixing distribution. By designing a surrogate evidence lower bound that is asymptotically exact, SIVI establishes an optimization problem amenable to gradient ascent, without compromising the expressiveness of its semi-implicit variational distribution. Flexible but simple to optimize, SIVI approaches the accuracy of MCMC in quantifying posterior uncertainty in a wide variety of inference tasks, and is not constrained by conjugacy, often runs faster, and can generate iid posterior samples on the fly via the inferred stochastic variational inference network. Semi-Implicit Variational Inference Acknowledgements The authors thank Texas Advanced Computing Center (TACC) for computational support. Abadi, Mart ın, Agarwal, Ashish, Barham, Paul, Brevdo, Eugene, Chen, Zhifeng, Citro, Craig, Corrado, Greg S., Davis, Andy, Dean, Jeffrey, Devin, Matthieu, Ghemawat, Sanjay, Goodfellow, Ian, Harp, Andrew, Irving, Geoffrey, Isard, Michael, Jia, Yangqing, Jozefowicz, Rafal, Kaiser, Lukasz, Kudlur, Man- junath, Levenberg, Josh, Man e, Dan, Monga, Rajat, Moore, Sherry, Murray, Derek, Olah, Chris, Schuster, Mike, Shlens, Jonathon, Steiner, Benoit, Sutskever, Ilya, Talwar, Kunal, Tucker, Paul, Vanhoucke, Vincent, Vasudevan, Vijay, Vi egas, Fernanda, Vinyals, Oriol, Warden, Pete, Wattenberg, Martin, Wicke, Martin, Yu, Yuan, and Zheng, Xiaoqiang. Tensor Flow: Large-scale machine learning on heterogeneous systems, 2015. URL http://tensorflow.org/. Software available from tensorflow.org. Agakov, F. V. and Barber, D. An auxiliary variational method. In International Conference on Neural Information Processing, 2004. Bishop, Christopher M and Tipping, Michael E. Variational rele- vance vector machines. In UAI, pp. 46 53. Morgan Kaufmann Publishers Inc., 2000. Bishop, Christopher M, Lawrence, Neil D, Jaakkola, Tommi, and Jordan, Michael I. Approximating posterior distributions in belief networks using mixtures. In NIPS, pp. 416 422, 1998. Blei, David M., Kucukelbir, Alp, and Mc Auliffe, Jon D. Vari- ational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859 877, 2017. Bliss, Chester Ittner and Fisher, Ronald A. Fitting the negative binomial distribution to biological data. Biometrics, 9(2):176 200, 1953. Burda, Yuri, Grosse, Roger, and Salakhutdinov, Ruslan. Impor- tance weighted autoencoders. ar Xiv preprint ar Xiv:1509.00519, 2015. Cover, Thomas M and Thomas, Joy A. Elements of Information Theory. John Wiley & Sons, 2012. Dinh, Laurent, Krueger, David, and Bengio, Yoshua. NICE: Non-linear independent components estimation. ar Xiv preprint ar Xiv:1410.8516, 2014. Gershman, Samuel J, Hoffman, Matthew D, and Blei, David M. Nonparametric variational inference. In ICML, pp. 235 242, 2012. Giordano, Ryan J, Broderick, Tamara, and Jordan, Michael I. Linear response methods for accurate covariance estimates from mean field variational bayes. In NIPS, pp. 1441 1449, 2015. Goodfellow, Ian, Pouget-Abadie, Jean, Mirza, Mehdi, Xu, Bing, Warde-Farley, David, Ozair, Sherjil, Courville, Aaron, and Bengio, Yoshua. Generative adversarial nets. In NIPS, pp. 2672 2680, 2014. Gregor, Karol, Danihelka, Ivo, Graves, Alex, Rezende, Danilo, and Wierstra, Daan. Draw: A recurrent neural network for image generation. In International Conference on Machine Learning, pp. 1462 1471, 2015. Guo, Fangjian, Wang, Xiangyu, Fan, Kai, Broderick, Tamara, and Dunson, David B. Boosting variational inference. ar Xiv preprint ar Xiv:1611.05559, 2016. Habil, Eissa D. Double sequences and double series. IUG Journal of Natural Studies, 14(1), 2016. Han, Shaobo, Liao, Xuejun, Dunson, David, and Carin, Lawrence. Variational Gaussian copula inference. In AISTATS, pp. 829 838, 2016. Hoffman, Matthew and Blei, David. Stochastic structured varia- tional inference. In AISTATS, pp. 361 369, 2015. Hoffman, Matthew D, Blei, David M, Wang, Chong, and Paisley, John. Stochastic variational inference. The Journal of Machine Learning Research, 14(1):1303 1347, 2013. Husz ar, Ferenc. Variational inference using implicit distributions. ar Xiv preprint ar Xiv:1702.08235, 2017. Im, Daniel Jiwoong, Ahn, Sungjin, Memisevic, Roland, Bengio, Yoshua, et al. Denoising criterion for variational auto-encoding framework. In AAAI, pp. 2059 2065, 2017. Jaakkola, Tommi S and Jordan, Michael I. Improving the mean field approximation via the use of mixture distributions. In Learning in Graphical Models, pp. 163 173. Springer, 1998. Jaakkola, Tommi S and Jordan, Michael I. Bayesian parameter estimation via variational methods. Statistics and Computing, 10(1):25 37, 2000. Jordan, Michael I, Ghahramani, Zoubin, Jaakkola, Tommi S, and Saul, Lawrence K. An introduction to variational methods for graphical models. Machine learning, 37(2):183 233, 1999. Kingma, Diederik P and Welling, Max. Auto-encoding variational Bayes. ar Xiv preprint ar Xiv:1312.6114, 2013. Kingma, Diederik P, Salimans, Tim, Jozefowicz, Rafal, Chen, Xi, Sutskever, Ilya, and Welling, Max. Improved variational inference with inverse autoregressive flow. In NIPS, pp. 4743 4751. 2016. Kucukelbir, Alp, Tran, Dustin, Ranganath, Rajesh, Gelman, An- drew, and Blei, David M. Automatic differentiation variational inference. Journal of Machine Learning Research, 18(14):1 45, 2017. Kurdila, Andrew J and Zabarankin, Michael. Convex functional analysis (systems and control: Foundations and applications). Switzerland: Birkh auser, 2005. Li, Yingzhen and Turner, Richard E. Gradient estimators for implicit models. ar Xiv preprint ar Xiv:1705.07107, 2017. Liu, Qiang and Wang, Dilin. Stein variational gradient descent: A general purpose Bayesian inference algorithm. In NIPS, pp. 2378 2386, 2016. Louizos, Christos and Welling, Max. Multiplicative normalizing flows for variational Bayesian neural networks. In ICML, pp. 2218 2227, 2017. Semi-Implicit Variational Inference Maaløe, Lars, Sønderby, Casper Kaae, Sønderby, Søren Kaae, and Winther, Ole. Auxiliary deep generative models. In ICML, pp. 1445 1453, 2016. Mescheder, Lars, Nowozin, Sebastian, and Geiger, Andreas. Ad- versarial variational Bayes: Unifying variational autoencoders and generative adversarial networks. In ICML, 2017. Miller, Andrew C., Foti, Nicholas J., and Adams, Ryan P. Varia- tional boosting: Iteratively refining posterior approximations. In ICML, pp. 2420 2429, 2017. Mnih, Andriy and Gregor, Karol. Neural variational inference and learning in belief networks. In ICML, pp. 1791 1799, 2014. Mohamed, Shakir and Lakshminarayanan, Balaji. Learning in implicit generative models. ar Xiv preprint ar Xiv:1610.03483, 2016. Naesseth, Christian, Ruiz, Francisco, Linderman, Scott, and Blei, David. Reparameterization gradients through acceptancerejection sampling algorithms. In AISTATS, pp. 489 498, 2017. Paisley, John, Blei, David M, and Jordan, Michael I. Variational Bayesian inference with stochastic search. In ICML, pp. 1363 1370, 2012. Papamakarios, George, Murray, Iain, and Pavlakou, Theo. Masked autoregressive flow for density estimation. In NIPS, pp. 2335 2344, 2017. Polson, Nicholas G, Scott, James G, and Windle, Jesse. Bayesian inference for logistic models using P olya Gamma latent variables. Journal of the American statistical Association, 108(504): 1339 1349, 2013. Rainforth, Tom, Kosiorek, Adam R, Le, Tuan Anh, Maddison, Chris J, Igl, Maximilian, Wood, Frank, and Teh, Yee Whye. Tighter variational bounds are not necessarily better. ar Xiv preprint ar Xiv:1802.04537, 2018. Ranganath, Rajesh, Gerrish, Sean, and Blei, David. Black box variational inference. In AISTATS, pp. 814 822, 2014. Ranganath, Rajesh, Tran, Dustin, and Blei, David. Hierarchical variational models. In ICML, pp. 324 333, 2016. Rezende, Danilo and Mohamed, Shakir. Variational inference with normalizing flows. In ICML, pp. 1530 1538, 2015. Rezende, Danilo Jimenez, Mohamed, Shakir, and Wierstra, Daan. Stochastic backpropagation and approximate inference in deep generative models. In ICML, pp. 1278 1286, 2014. Rudin, Walter. Principles of Mathematical Analysis, volume 3. Mc Graw-hill New York, 1964. Ruiz, Francisco J. R., Titsias, Michalis K., and Blei, David M. The generalized reparameterization gradient. In NIPS, pp. 460 468, 2016. Salakhutdinov, Ruslan and Murray, Iain. On the quantitative anal- ysis of deep belief networks. In ICML, pp. 872 879, 2008. Salimans, Tim and Knowles, David A. Fixed-form variational posterior approximation through stochastic linear regression. Bayesian Analysis, 8(4):837 882, 2013. Salimans, Tim, Kingma, Diederik, and Welling, Max. Markov chain Monte Carlo and variational inference: Bridging the gap. In ICML, pp. 1218 1226, 2015. Saul, Lawrence K and Jordan, Michael I. Exploiting tractable substructures in intractable networks. In NIPS, pp. 486 492, 1996. Shi, Jiaxin, Sun, Shengyang, and Zhu, Jun. Implicit variational inference with kernel density ratio fitting. ar Xiv preprint ar Xiv:1705.10119, 2017. Sønderby, Casper Kaae, Raiko, Tapani, Maaløe, Lars, Sønderby, Søren Kaae, and Winther, Ole. Ladder variational autoencoders. In NIPS, pp. 3738 3746, 2016. Sugiyama, Masashi, Suzuki, Taiji, and Kanamori, Takafumi. Den- sity ratio estimation in machine learning. Cambridge University Press, 2012. Titsias, Michalis and L azaro-Gredilla, Miguel. Doubly stochastic variational bayes for non-conjugate inference. In ICML, pp. 1971 1979, 2014. Tran, Dustin, Blei, David, and Airoldi, Edo M. Copula variational inference. In NIPS, pp. 3564 3572, 2015. Tran, Dustin, Ranganath, Rajesh, and Blei, David M. The varia- tional Gaussian process. In ICLR, 2016. Tran, Dustin, Ranganath, Rajesh, and Blei, David. Hierarchical implicit models and likelihood-free variational inference. In NIPS, pp. 5529 5539, 2017. Uehara, Masatoshi, Sato, Issei, Suzuki, Masahiro, Nakayama, Kotaro, and Matsuo, Yutaka. Generative adversarial nets from a density ratio estimation perspective. ar Xiv preprint ar Xiv:1610.02920, 2016. Wainwright, Martin J, Jordan, Michael I, et al. Graphical models, exponential families, and variational inference. Foundations and Trends R in Machine Learning, 1(1 2):1 305, 2008. Zhou, Mingyuan. Softplus regressions and convex polytopes. ar Xiv:1608.06383, 2016. Zhou, Mingyuan and Carin, Lawrence. Negative binomial process count and mixture modeling. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(2):307 320, 2015. Zhou, Mingyuan, Li, Lingbo, Dunson, David, and Carin, Lawrence. Lognormal and gamma mixed negative binomial regression. In ICML, pp. 859 866, 2012. Zhou, Mingyuan, Padilla, Oscar Hernan Madrid, and Scott, James G. Priors for random count matrices derived from a family of negative binomial processes. J. Amer. Statist. Assoc., 111(515):1144 1156, 2016.