# revisiting_unbiased_implicit_variational_inference__42d98094.pdf

Revisiting Unbiased Implicit Variational Inference

Tobias Pielok 1 2 Bernd Bischl 1 2 David R ugamer 1 2

Recent years have witnessed growing interest in semi-implicit variational inference (SIVI) methods due to their ability to rapidly generate samples from complex distributions. However, since the likelihood of these samples is non-trivial to estimate in high dimensions, current research focuses on finding effective SIVI training routines. Although unbiased implicit variational inference (UIVI) has largely been dismissed as imprecise and computationally prohibitive because of its inner MCMC loop, we revisit this method and show that UIVI s MCMC loop can be effectively replaced via importance sampling and the optimal proposal distribution can be learned stably by minimizing an expected forward Kullback Leibler divergence without bias. Our refined approach demonstrates superior performance or parity with state-of-the-art methods on established SIVI benchmarks.

1. Introduction

Bayesian inference, such as sampling-based or variational inference, is an important foundation for constructing uncertainty quantification measures for machine learning models. In variational inference (VI), samples are generated from a target distribution function pz with the associated random variable z, which can only be evaluated but not directly sampled from and is possibly unnormalized. This could be, e.g., a Bayesian posterior distribution or the canonical distribution w.r.t. a physical system. For this, a family Qz over distributions with a tractable sampling procedure is chosen, and a divergence measure D where D quantifies the dissimilarity between two distributions. The target distribution pz can then be approximated by finding q z Qz which is closest to pz w.r.t. D, i.e., q z arg minqz Qz D(qz, pz).

1Department of Statistics, LMU Munich, Munich, Germany 2Munich Center for Machine Learning (MCML), Munich, Germany. Correspondence to: Tobias Pielok <tobias.pielok@stat.unimuenchen.de>.

Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s).

Figure 1. We sample from a semi-implicit distribution q(z) by sampling from the latent distribution p(ϵ) and subsequently from the conditional distribution q(z|ϵ). The simple distributions p(ϵ) and q(z|ϵ) can induce a complicated distribution q(z) and consequently a potentially even more complicated reverse conditional distribution q(ϵ|z). AISIVI learns a mass-covering representation τ(ϵ|z) of q(ϵ|z) to estimate z log q(z) in high dimensions.

1.1. Implicit Variational Inference

In contrast to VI, where we assume that qz Qz is an explicit distribution, i.e., we can evaluate qz, for implicit VI (IVI) we have no direct access to qz and can only produce samples from qz, i.e., qz is an implicit distribution. Representative examples of explicit and implicit distributions are normalizing flows (NFs) and neural samplers, which transform a random variable via an arbitrary neural network (NN), respectively. While NFs can be trained stably, they are known to smooth out sharp target distributions. In contrast, neural samplers can model highly complex and sharp distributions but are notoriously hard to train. This naturally suggests combining them.

Revisiting Unbiased Implicit Variational Inference

Semi-implicit variational inference (SIVI; Yin & Zhou, 2018) offers a compromise between VI and IVI. Since we sample from semi-implicit distribution qz by sampling the parameters y of an explicit distribution1 qz|y from an implicit distribution qy, its representative capabilities come close to those of an implicit distribution, but qz of a semiimplicit distribution can be estimated in a principle manner.

More formally, assuming that the target z pz is a continuous random variable taking values in Z Rd Z where d Z N, we approximate its probability density function via an uncountable mixture of densities s.t.

qz(z) = Ey qy qz|y(z|y) . (1)

For SIVI, the random variable y taking values in Y Rd Y where d Y N is drawn via a neural sampler, i.e.,

ϵ pϵ y = fϕ(ϵ) (2)

where ϵ is a latent random variable taking values in E Rd E where d E N and fϕ : E Y is a NN with parameters ϕ Rdϕ where dϕ N. Consequently, since every ϕ defines qz, the distribution family Qz is also parametrized by the NN parameters ϕ. With Eq. 1 and Eq. 2, we also directly get that

qz(z) = Eϵ pϵ qz|ϵ(z|ϵ) (3)

where qz|ϵ(z|ϵ) = qz|y(z|fϕ(ϵ)).

1.2. Our Contributions

In this work, we focus on the efficient estimation of the score gradient z log qz, which enables us to train SIVI models even in high dimensions. For this, we propose using importance sampling (IS) with an adaptively informed proposal distribution τϵ|z modeled by a conditional normalizing flow (CNF). We show that τϵ|z = qϵ|z debiases our score gradient estimate and propose a stable training routine of the CNF via an expected forward Kullback-Leibler divergence. Our contribution advances both mathematical insights of SIVI and contributes two new algorithms.

2. Background and Missed Opportunities

2.1. Reparametrizable Semi-implicit Distributions

In this work, we assume2 that the reparametrization trick (Kingma & Welling, 2014) is applicable to qz|y, i.e., there

1usually a common, unimodal distribution such as, e.g., a normal distribution 2We could even lessen our assumption by only assuming that implicit reparametrization gradients can be computed (Figurnov et al., 2018), but this is not the focus of this paper.

exist a random variable η taking values in H Rd H where d H N and a differentiable function g : Y H Z s.t.

ϵ, η pϵ,η g(fϕ(ϵ), η) | {z } =:hϕ(ϵ,η)

where pϵ,η is the joint distribution of the independent random variables ϵ and η which does not depend on ϕ. From this, it directly follows that

Ez qz [aϕ(z)] = Eϵ,η pϵ,η [aϕ(hϕ(ϵ, η))] (5)

where aϕ(z) : Z R is a differentiable function with parameters ϕ. Hence, under our assumptions, the reparametrization trick can be applied to qz.

2.2. Path gradient estimator and DKL minimization

We choose to minimize the reverse Kullback-Leibler divergence DKL, i.e.,

DKL(qz pz) = Ez qz

On the one hand, one of the main advantages of DKL is that if we can evaluate qz we can compute unbiased estimates of the gradients w.r.t. the parameters ϕ of qz, which is especially useful when stochastic gradient descent methods are employed to minimize the objective. On the other hand, the reverse DKL is known to underestimate the variance if the variational distribution qz is not sufficiently expressive (Andrade, 2024). However, for SIVI, this is rarely relevant, as the variational distribution qz is highly expressive due to its implicit nature.

Since qz is amenable to the reparametrization trick, we can follow Roeder et al. (2017) to formulate a low-variance gradient estimator of DKL, the so-called path gradient estimator

ϕDKL(qz pz) =

Eϵ,η pϵ,η h z (log qz(z) log pz(z)) z=hϕ(ϵ,η)

ϕhϕ(ϵ, η) i .

While this result also appeared in the context of SIVI as an intermediate result in Titsias & Ruiz (2019), its far-reaching implications were not discussed since this expression was not of interest for the authors final derivation (see Section 2.3). Not only does the path gradient estimator in Eq. 7 reduce the variance of the gradient estimation, but it also vastly reduces the computational demand in contrast to the reparametrization trick, for which we would need to estimate the gradient

ϕ log qz(hϕ(ϵ, η)) =

ϕ log Eeϵ pϵ qz|y(hϕ(ϵ, η)|fϕ(eϵ) . (8)

Revisiting Unbiased Implicit Variational Inference

This simple observation leads to a surprisingly wellperforming approach, which we will introduce in Section 3.

2.3. Unbiased Implicit Variational Inference

The problematic term of the path gradient estimator in Eq. 7 is the score gradient z log qz(z), for which no analytical expression exists. Titsias & Ruiz (2019) proved for UIVI that

Eϵ qϵ|z z log qz|ϵ(z|ϵ) = z log qz(z), (9)

i.e., if we can produce samples from the intractable conditional distribution qϵ|z, we can compute an unbiased estimate of the score gradient z log qz(z). Titsias & Ruiz (2019) propose to sample z, ϵ qz,ϵ and use MCMC with target distribution3 qϵ|z. The MCMC chains are initialized at ϵ because it already stems from the stationary distribution qϵ|z. However, we can not use ϵ directly since this would violate the independence assumption, which is needed for an unbiased estimate in Eq. 9. Therefore, MCMC has to run as long as the sample produced by the i-th chain ϵ i is independent of ϵ. Titsias & Ruiz (2019) argue that only a few steps of MCMC are needed since the chains are already initialized at the stationary distribution. However, as it can be seen in Figure 1, qϵ|z is likely multimodal with regions of vanishing probability potentially occurring between the modes due to the implicit and possibly very complicated nature of qz. In such cases, very long chains would be needed to effectively break the dependence between ϵ and ϵ i, rendering the already computationally intensive method as prohibitive. Furthermore, note that the number of chains cannot reduce the bias introduced by the prevailing dependence between ϵ and ϵ i.

In light of these observations, we propose a novel method in Section 3 to fix the encountered shortcomings.

2.4. Conditional Normalizing Flows

Normalizing flows (NF; see, e.g., Papamakarios et al., 2021) leverage the change of variable method to model complex distributions by repeatedly transforming a random variable stemming from a simple error distribution. More specifically, for a random variable u taking values in U Rd U where d U N and a differentiable and invertible transformation Tθ : U U with parameters θ Θ R it holds that

u pu, ϵ = Tθ(u) ϵ qϵ,

qϵ(ϵ) = pu(T 1 θ (ϵ)) det JT 1 θ (ϵ) (10)

3We know qϵ|z up to a normalizing constant, i.e., qz,ϵ, which suffices for MCMC

where JT 1 θ is the Jacobian of the inverse function of Tθ. A conditional NF (CNF) is a differentiable map Tθ : U Z U, (ϵ, z) 7 Tθ(ϵ, z) such that for every z Z it holds that Tθ( , z) is a NF.

A plethora of different NFs have been proposed over the last years. In this work, we use affine coupling layers as introduced in Real NVP (Dinh et al., 2017) because sampling and evaluating their likelihood is equally computationally efficient, and they can be scaled up to high dimensions (Andrade, 2024). Specifically, we use a conditional variant of affine coupling layers similar to one introduced in Lu & Huang (2020). While this is a natural choice, other combinations could provide additional performance gains as also discussed in Section 6.

Starting from Eq. 7 we can rewrite the problematic score term, i.e.,

z log qz(z) = z log Eϵ pϵ qz|ϵ(ϵ) . (11)

Thus, a straightforward Monte Carlo (MC) estimator of the score gradient is

s MC,k(z) = z log

i=1 qz|ϵ(z|ϵi)

where (z, ϵ1) qz,ϵ and ϵi i.i.d. pϵ, i = 2, . . . , k. This is a consistent estimator of the score gradient z log qz(z) and for large k its bias

Eϵi pϵ [ zs MC,k(z)] z log (qz(z))

Vϵ pϵ qz|ϵ(z|ϵ)

2(k 1) qz(z)2

(see Appendix A.1 for the proof). Note that including ϵ1 introduces additional bias but strongly reduces the variance since ϵ1 qϵ|z. A closely related estimator was derived in Molchanov et al. (2019), but their estimator is purely based on the reparametrization trick and does not benefit from the advantages discussed in Section 2.2 and Section 3.2.

Although we would expect that for high dimensions, the contribution of qz|ϵ(z|ϵi) resulting from uninformed ϵi to be nearly negligible to our estimator, s MC,k performs surprisingly well.

Based on the previous observation, we devise a new importance sampling (IS) version of Eq. 11, given as follows:

z logqz(z) =

pϵ(ϵ)qz|ϵ(z|ϵ)

τϵ|ez(ϵ|ez)

ez=z . (14)

Revisiting Unbiased Implicit Variational Inference

The idea of enhancing SIVI with importance sampling was also proposed by Sobolev & Vetrov (2019), but their approach is more expensive than ours due to the joint optimization of the proposal distribution and the SIVI model, rendering more expressive conditional models, such as CNFs, infeasible in practice.

Importance Sampling Estimator Based on Eq. 14, we can estimate z log qz(z) using the following score gradient estimator

s IS,k(z) = z log

pϵ(ϵi)qz|ϵ(z|ϵi)

τϵ|ez(ϵi|ez)

! ez=z , (15)

where ϵi τϵ|z, i = 1, . . . , k. We show in Appendix A.2 that this estimator is consistent when supp(qϵ|z) supp(τϵ|z). To also make this estimator efficient, we need to generate samples τϵ|z and evaluate their likelihood efficiently. A suitable option in this case is to model τϵ|z with a sequence of conditional affine coupling layers (see Section 2.4).

Since we optimize τϵ|z and qz alternately, we are interested in the optimal τϵ|z for a fixed qz. This leads us to the following proposition:

Proposition 3.1. Choosing τϵ|z = qϵ|z debiases our proposed score gradient estimate s IS,k, i.e.,

Eϵi qϵ|z z log

pϵ(ϵi)qz|ϵ(z|ϵi)

qϵ|ez(ϵi|ez)

! ez=z = z log qz(z).

We prove Proposition 3.1 in Section 3.1. Hence, we propose to learn τϵ|z by minimizing the expected forward Kullback Leibler divergence Ez qz DKL(qϵ|z τϵ|z) , for which we can estimate its gradient w.r.t. to the parameters θ of the NF without bias since

θEz qz DKL(qϵ|z τϵ|z) (17)

= Ez qz Eϵ qϵ|z θ log qϵ|z(ϵ|z)

= Ez,ϵ qz,ϵ θ log τϵ|z(ϵ|z) (19)

which holds because qz,ϵ does not depend on θ. The following proposition assures the validity of our procedure:

Proposition 3.2. Minimizing Ez qz DKL(qϵ|z τϵ|z) is equivalent to minimizing DKL(qz,ϵ τϵ|z qz).

This follows from the fact that

Ez qz DKL(qϵ|z τϵ|z) (20)

= Ez qz Eϵ qϵ|z log qϵ|z(ϵ|z)

= Ez,ϵ qz,ϵ log qz,ϵ(z, ϵ) τϵ|z(ϵ|z)qz(z)

= DKL(qz,ϵ τϵ|z qz) (23)

From this, assuming that τϵ|z is sufficiently flexible, it directly follows that at the global optimum τ ϵ|z of the expected forward DKL it holds that

qz,ϵ = τ ϵ|z qz τ ϵ|z = qz,ϵ

qz = qϵ|z. (24)

Being of particular importance for understanding our finding, we also include the proof of Proposition 3.1 in the following.

3.1. Proof of Proposition 3.1

First note that

pϵ(ϵi) qϵ|z(ϵi|z) = pϵ(ϵi)qz(z) pϵ(ϵi)qz|ϵ(z|ϵi) = qz(z) qz|ϵ(z|ϵi). (25)

With this, we get that

Eϵi qϵ|z z log

pϵ(ϵi)qz|ϵ(z|ϵi)

qϵ|ez(ϵi|ez)

! ez=z (26)

1 k Pk i=1 pϵ(ϵi) qϵ|z(ϵi|z) zqz|ϵ(z|ϵi)

1 k Pk i=1 pϵ(ϵi) qϵ|z(ϵi|z)qz|ϵ(z|ϵi)

1 k Pk i=1 Eϵi qϵ|z h pϵ(ϵi) qϵ|z(ϵi|z) zqz|ϵ(z|ϵi) i

1/k Pk i=1 qz(z) (28)

Pk i=1 Eϵi qϵ|z h qz(z)qz|ϵ(z|ϵi)

qz|ϵ(z|ϵi) z log qz|ϵ(z|ϵi) i

k qz(z) (29)

i=1 Eϵi qϵ|z z log qz|ϵ(z|ϵi) (30)

Eq. 9 = z log qz(z). (31)

3.2. Training Under Memory Constraints

One of the main advantages of our proposed score gradient estimators s MC,k and s IS,k is that increasing k, i.e., the number of samples ϵi, does not increase the computational cost of backpropagation w.r.t. the parameters of our SIVI model ϕ because we follow the path gradient. This insight motivates the following procedure, which allows us to train

Revisiting Unbiased Implicit Variational Inference

our SIVI models with constant memory requirements independent of k.

First, note that both our score gradient estimators can be written s.t.

s(z) = zℓ(z, ez)

ez=z with (32)

ℓ(z, ez) = log

i=1 w(ϵi|ez)qz|ϵ(z|ϵi)

where choosing w(ϵi|ez) = 1 or w(ϵi|ez) = pϵ(ϵi) qϵ| ez(ϵi|ez) results in s MC,k and s IS,k, respectively. Since evaluating qϵ|ez(ϵi|ez) is computationally non-intensive because of the inner neural sampler, we could, in principle, process very large ϵ batches. However, since our memory is constrained, we need a way to aggregate score gradient estimators computed on different ϵ batches.

Efficient Aggregation on Batch Level Assume we have computed the score gradient estimates s1, s2 with associated log probability density estimates ℓ1, ℓ2 of the ϵi batches of sizes j b and b, respectively, with j, b N. Then, we show in Appendix A.3 that if we aggregate these estimates s.t.

ℓ3(z, ez) = logaddexp (ℓ1(z, ez) + log j, ℓ2(z, ez))

log(j + 1), (34)

s3(z) = α1s1(z) + α2s2(z) with

α1 = exp ℓ1(z, ez) ℓ3(z, ez) + log j j + 1

α2 = exp (ℓ2(z, ez) ℓ3(z, ez) log(j + 1))

then s3 and ℓ3 are the corresponding estimates of the combined ϵi batches.

Also, note that we keep most of our operations in the log space to make the procedure numerically stable. For example, we use the logaddexp(ℓ1, ℓ2) operation, which allows to numerically stable compute log(exp(ℓ1)+exp(ℓ2)), and the logsumexp trick to compute ℓ1 and ℓ2 themselves. Applying this algorithm iteratively allows us to process an arbitrarily large number of samples ϵi while keeping the memory requirement constant. As a direct consequence, we note that our score gradient estimation is completely parallelizable.

3.3. Algorithms

Following the previous findings, we propose two new algorithms for SIVI.

Algorithm 1 BSIVI

Input: target density pz, batch size m, number of latent samples k with k > m, SIVI model hϕ i = 1, . . . , m, j = 1, . . . , k repeat

ϵj pϵ, ηj pη zi = hϕ(ϵi, ηi)

si = zilogsumexp log qz|ϵ(zi|ϵj)

qi = stop gradient(si) zi loss = 1/m Pm i=1(qi log pz(zi)) ϕ = opt(loss, ϕ) until ϕ has converged

3.3.1. BSIVI

As a new baseline method, we propose base SIVI (BSIVI), which minimizes the reverse Kullback-Leibler divergence DKL(qz pz) by following the path gradient of Eq. 7. For the score gradient z log qz we plug-in s MC,k(z). This method exploits the fact that we can rapidly sample from a SIVI model, and s MC,k can be computed with constant memory independent of k as discussed in Section 3.2. The algorithm is summarized in Algorithm 1. We use BSIVI to ablate the use of importance sampling, which our main method is built upon.

3.3.2. AISIVI

Furthermore, we propose adaptively informed SIVI (AISIVI), which alternates between minimizing the expected forward KL divergence Ez qz DKL(qϵ|z τϵ|z)

and the reverse KL divergence DKL(qz pz) by following the path gradient of Eq. 7. For the score gradient z log qz, we plug-in s IS,k(z), which uses τϵ|z as the proposal distribution. This alternating training is possible since s IS,k(z) is a consistent estimator of the score gradient for any τϵ|z with supp(qϵ|z) supp(τϵ|z). Since the forward DKL is mass covering, we can expect that the support assumption is always fulfilled. This means, in contrast to UIVI, we do not need exact4 samples from qϵ|z and the bias and variance of our estimate decreases5 with increasing k. Also, sampling from the CNF τϵ|z is comparatively cheap, and the samples are guaranteed to be independent.

4. Related Literature

Yin & Zhou (2018) propose to use semi-implicit distributions for VI and train their models by sandwiching the ELBO. Titsias & Ruiz (2019) introduce another objective based on ELBO and derive an associated unbiased gradient

4However, we can greatly reduce the bias the better we match qϵ|z with τϵ|z 5This is not the case for UIVI regarding the number of chains

Revisiting Unbiased Implicit Variational Inference

Algorithm 2 AISIVI

Input: target density pz, batch size m, number of latent samples k, SIVI model hϕ, CNF τϵ|z i = 1, . . . , m, j = 1, . . . , k repeat

ϵi pϵ, ηi pη zi = hϕ(ϵi, ηi) lossflow = 1/m Pm i=1 log τϵ|z(ϵi|zi) θ = opt(lossflow, θ)

ϵi,j τϵ|z( |zi) log wi,j = log pϵ(ϵi,j) log τϵ|z(ϵi,j|zi) log ewi,j = stop gradient(log wi,j) log eqz|ϵ(zi|ϵi,j) = log ewi,j + log qz|ϵ(zi|ϵi,j)

si = zilogsumexp log eqz|ϵ(zi|ϵi,j)

qi = stop gradient(si) zi loss = 1/m Pm i=1(qi log pz(zi)) ϕ = opt(loss, ϕ) until ϕ has converged

estimator, which, however, depends on expensive MCMC simulations. Sobolev & Vetrov (2019) also improved upon Yin & Zhou (2018) by introducing an importance sampling distribution; however, using expressive models such as CNFs remains infeasible for their approach. In recent years, new approaches based on different objectives have been proposed that seem to outperform methods based on the ELBO. Yu & Zhang (2023) propose minimizing the Fisher divergence, but their minimax formulation proves difficult to train compared to the standard minimization problems mentioned above.

Building upon Yu & Zhang (2023), Cheng et al. (2024) use the kernel Stein discrepancy as the training objective, which turns the minimax problem into a standard minimization problem. We will refer to their method as KSIVI. Lim & Johansen (2024) proposed Particle Semi-Implicit Variational Inference (PVI), which is a particle approximation of a Euclidean-Wasserstein gradient flow. Both Cheng et al. (2024) and Lim & Johansen (2024) showed strong empirical evidence supporting their methods.

While beyond the scope of this work, we note that SIVI has been successfully extended to multilayer architectures, yielding improved performance as demonstrated by Yu et al. (2023).

Beyond SIVI, another line of research explores variational inference with fully implicit distributions (Mescheder et al., 2017; Shi et al., 2018; Feng et al., 2017). These methods often encounter training challenges, such as instability introduced by adversarial learning or density-ratio estimation.

Another related direction performs inference directly in

Figure 2. Histograms based on 100000 samples produced by the true distribution, AISIVI, and BSIVI

function space (Sun et al., 2019; Ma et al., 2019; Pielok et al., 2023). These approaches frequently incorporate implicit inference mechanisms within their frameworks.

Several approaches have improved variational inference by incorporating importance sampling. IWAE (Burda et al., 2016) introduces a tighter bound through multiple importance-weighted samples, while NVI (Zimmermann et al., 2021) extends this idea using nested objectives to learn better proposal distributions. Our work builds on this line by integrating importance sampling into the SIVI framework to improve expressivity and stability.

5. Experiments

In the following, we analyze the performance of our proposed methods AISIVI and BSIVI under different data scenarios. We start by comparing our two methods on wellknown toy examples that serve as a first sanity check (Section 5.1). We then compare our methods with the state-ofthe-art methods KSIVI and PVI on a 22-dimensional problem in the context of a Bayesian logistic regression model (Section 5.2, which serves as another common benchmark example for SIVI. Finally, we move to a 100-dimensional problem related to a conditioned diffusion process (Section 5.3). We implemented AISIVI and BSIVI in Py Torch (Paszke et al., 2019). All experiments are performed on a Linux-based server A5000 server with 2 GPUs, 24GB VRAM, and Intel Xeon Gold 5315Y processor with 3.20 GHz.

Revisiting Unbiased Implicit Variational Inference

Table 1. DKL(p, q) of different toy examples (rows) using the two proposed methods (columns).

NAME AISIVI (DKL) BSIVI (DKL)

BANANA 0.0853 0.3022 MULTIMODAL 0.0044 0.0017 X-SHAPE 0.0072 0.0034

5.1. Toy examples

First, we train BSIVI and AISIVI on the three common two-dimensional test densities Banana, X-Shape, and Multimodal as proposed by Cheng et al. (2024). Their respective definitions can be found in Table 3 in the Appendix B. For both methods, we use the same NN architecture and train them for 4000 iterations. For the NF of AISIVI, we use 6 conditional affine coupling layers.

Results It can be seen in Figure 2 that AISIVI and BSIVI can capture the three densities nearly equally well. Only for the Banana benchmark, AISIVI outperforms BSIVI notably (Table 1).

5.2. Bayesian Logistic Regression

Next, we perform a Bayesian logistic regression on the WAVEFORM6 dataset as proposed by Yin & Zhou (2018). For the target variables yi {0, 1}, i = 1, . . . , N with N = 400 and the feature vectors xi R21, the log-likelihood is given by

log p(y1,...,N| x1,...,N, β) =

i=1 yi(1, x i )β log 1 + exp (1, x i )β ,

where β R22 is the variable we want to infer. We set the prior distribution of β to a normal distribution, i.e., p(β) = N(0, α 1I) with α = 0.01. In line with Cheng et al. (2024), we estimate the ground truth by simulating parallel stochastic gradient Langevin dynamics (SGLD Welling & Teh, 2011) for 400,000 iterations, 1000 samples, and a step size of 0.0001. We use the same NN architecture for all methods and use the best hyperparameters for PVI and KSIVI proposed by the respective authors for this benchmark. We train AISIVI and BSIVI for 10,000 iterations and use ϵi batch sizes of 9182 and 91,820 respectively. The large batch size of BSIVI is possible and computationally feasible because of the considerations discussed in Section 3.2. All methods use a batch size m = 128 the latent dimension is set to 10, i.e., ϵ R10. For the NF of AISIVI, we use 16 conditional affine coupling layers. We use the

6https://archive.ics.uci.edu/ml/ machine-learningdatabases/waveform

Table 2. KSIVI serves as the gold standard, with AISIVI reaching it in 10K iterations. The other SIVI variants are compared based on their estimated log marginal likelihood, given a comparable computational budget to AISIVI. The log marginal likelihood is estimated using 1000 high-quality SGLD samples, while each variant s estimate is computed using 60,000 samples,

METHOD LOG ML TRAINING TIME [S] ITERATIONS

KSIVI 74521 0.6K 100K

AISIVI 74062 1.4K 10K IWHVI 67667 1.5K 10K BSIVI 60556 1.5K 10K PVI 53121 1.4K 10K UIVI 40207 1.5K 10K

full batch for the score gradient computation of the target density.

Results The marginal and pairwise density estimates in Figure 3 highlight that all methods perform nearly equally well since no systematic overor underestimation of the variance can be observed. We also compare with the ground truth all pairwise correlation coefficients of β given by

ρi,j = cov β(i), β(j)

cov β(i), β(i) cov β(j), β(j) , i = j, (36)

where β(i) is a vector containing the i-th coordinate of all β samples.

The scatter plot in Figure 4 provides a visual summary of the correlation coefficients and the relation between those of different IVI methods and the ones of SGLD as considered ground truth. The results illustrate that PVI and KSIVI exhibit a slightly reduced spread compared to our proposed methods, indicating a marginally better fit. However, overall, the performance of all methods remains comparable.

5.3. Conditioned Diffusion Process

We adopt the Bayesian inference setting proposed in Cheng et al. (2024), which is based on the Langevin stochastic differential equation (SDE):

dxt = 10xt(1 x2 t)dt + dwt, 0 t 1, (37)

where x0 = 0 and wt is a one-dimensional standard Brownian motion. This SDE models the motion of a particle in an energy potential with Brownian fluctuations (Detommaso et al., 2018).

Following (Cheng et al., 2024), we discretize the SDE using the Euler-Maruyama scheme with a step size t = 0.01, yielding a 100-dimensional latent variable

x = (x t, x2 t, . . . , x100 t),

Revisiting Unbiased Implicit Variational Inference

Figure 3. Comparision of marginal and pairwise density estimates of β(1), β(2), β(3) where the SGLD estimates are marked in black

Figure 4. Scatter plot of every pairwise correlation coefficient ρi,j between the estimates and SGLD.

which gives rise to the prior distribution pprior(x). The observations are perturbed at 20 time points, given by

y = (y5 t, y10 t, . . . , y100 t),

y5k t N(x5k t, σ2), 1 k 20 (38)

with σ = 0.1, defining the likelihood function p(y|x). Given y, our goal is to infer the posterior

p(x|y) pprior(x)p(y|x). (39)

To approximate the posterior, we reapply the approach in (Cheng et al., 2024) by running a long-run parallel stochastic

gradient Langevin dynamics (SGLD) simulation with 1000 independent particles, a step size of 0.0001, and 100,000 iterations to generate 1000 ground truth samples.

For this benchmark, we also include IWHI to ablate the effect of their joint training approach compared to our sequential training. For their method, we use a conditional Gaussian model, where the conditional parameters are predicted by a neural network, as their joint training setup makes more complex conditional models such as continuous normalizing flows (CNFs) infeasible. Additionally, we evaluate against UIVI to compare our importance sampling-based enhancement with their original MCMC-based approach. For all methods, we use the same NN architecture. For KSIVI, we use the hyperparameters proposed by the authors for this benchmark. For PVI, we use 100 particles. To ensure a fair comparison, we fixed the outer batch size (number of sampled z) for all SIVI methods and adjusted the inner batch size (number of sampled ϵ) until we achieved approximately the same iterations per second as AISIVI. The ϵi batch sizes for AISIVI, BSIVI, and IWHI are 256, 40960, and 7000, respectively. The latent dimension is 100 for all SIVI variants. For the NF of AISIVI, we use 32 conditional affine coupling layers.

Results The results of the experiment is depicted in Figure 5. We observe that KSIVI and AISIVI are closest to SGLD while UIVI, PVI, and BSIVI tend to underestimate the variability of the process. In Table 2, we report the estimated log marginal likelihoods of the SIVI variants along with their associated training times. Notably, only our method, AISIVI, approaches the performance of the state-of-the-art KSIVI. While IWHI also performs well, it does not match AISIVI, highlighting the benefits of a more expressive proposal model. For UIVI, we were limited to an inner batch size of 2 due to computational constraints, which led to noticeably weaker performance. Nevertheless, this

Revisiting Unbiased Implicit Variational Inference

Figure 5. Approximations of KSIVI, PVI, IWHVI, UIVI, AISIVI, and BSIVI for the discretized conditioned diffusion process are shown. The red dots represent the observations, the magenta line the ground truth estimated via parallel SGLD, and the blue line the estimated posterior mean. The shaded region shows the 95 marginal posterior confidence interval at each discretization step

comparison shows that AISIVI successfully adapts UIVI s core ideas in a way that makes them more computationally efficient and competitive.

6. Conclusion

In this paper, we proposed a novel SIVI framework, AISIVI, which revitalizes the ELBO as the training objective. This is possible because the bias and variance of the ELBO gradients can be severely reduced by using importance sampling and the optimal proposal distribution can be stably learned with a CNF. We provided the respective efficient Monte Carlo gradient estimators. The numerical experiments support the efficiency and effectiveness claim of AISIVI.

In particular, our experiments on the high-dimensional diffusion example suggest that it can be beneficial not to rely on a kernel method, which is known to be scalable to very large dimensions. Our method thus represents an easy-to-use and scalable alternative to current state-of-the-art SIVI methods with on-par performance.

Limitations and Outlook

This work marks an initial attempt to integrate the strengths of semi-implicit distributions and normalizing flows. However, given the numerous normalizing flow frameworks, certain alternative combinations may lead to improved performance. Future research could explore these possibilities to identify more effective configurations. While our method shows on par performance with current state-of-the-art SIVI

methods, a suitable combination could further notably enhance performance.

Additionally, the proposed method does not inherently offer exploration capabilities, which may limit its ability to model multi-modal distributions. However, note that we can always combine a temperature annealing strategy (Rezende & Mohamed, 2015) with our approach, but a more principled procedure would be desirable. While this limitation is common in related work, addressing it in future research could enhance the applicability of AISIVI.

Impact Statement

This paper presents work whose goal is to advance the field of machine learning. There are many potential societal consequences of our work, none of which we feel must be specifically highlighted here.

Andrade, D. Stabilizing training of affine coupling layers for high-dimensional variational inference. Machine Learning: Science and Technology, 5, 12 2024. doi: 10.1088/2632-2153/ad9a39.

Burda, Y., Grosse, R. B., and Salakhutdinov, R. Importance weighted autoencoders. In Bengio, Y. and Le Cun, Y. (eds.), 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016. URL

Revisiting Unbiased Implicit Variational Inference

http://arxiv.org/abs/1509.00519.

Cheng, Z., Yu, L., Xie, T., Zhang, S., and Zhang, C. Kernel semi-implicit variational inference. In Fortyfirst International Conference on Machine Learning, 2024. URL https://openreview.net/forum? id=w5o Uo0Lh O1.

Detommaso, G., Cui, T., Spantini, A., Marzouk, Y., and Scheichl, R. A stein variational newton method. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS 18, pp. 9187 9197, Red Hook, NY, USA, 2018. Curran Associates Inc.

Dinh, L., Sohl-Dickstein, J., and Bengio, S. Density estimation using real NVP. In International Conference on Learning Representations, 2017. URL https: //openreview.net/forum?id=Hkpbn H9lx.

Feng, Y., Wang, D., and Liu, Q. Learning to draw samples with amortized stein variational gradient descent. In Elidan, G., Kersting, K., and Ihler, A. (eds.), Proceedings of the Thirty-Third Conference on Uncertainty in Artificial Intelligence, UAI 2017, Sydney, Australia, August 11-15, 2017. AUAI Press, 2017. URL http://auai.org/ uai2017/proceedings/papers/206.pdf.

Figurnov, M., Mohamed, S., and Mnih, A. Implicit reparameterization gradients. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips. cc/paper_files/paper/2018/file/ 92c8c96e4c37100777c7190b76d28233-Paper. pdf.

Kingma, D. P. and Welling, M. Auto-encoding variational bayes. In Bengio, Y. and Le Cun, Y. (eds.), 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, 2014. URL http://arxiv.org/ abs/1312.6114.

Lim, J. N. and Johansen, A. M. Particle semi-implicit variational inference. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum? id=p3g MGk HMk M.

Lu, Y. and Huang, B. Structured output learning with conditional generative flows. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelli-

gence, EAAI 2020, New York, NY, USA, February 712, 2020, pp. 5005 5012. AAAI Press, 2020. doi: 10.1609/AAAI.V34I04.5940. URL https://doi. org/10.1609/aaai.v34i04.5940.

Ma, C., Li, Y., and Hernandez-Lobato, J. M. Variational implicit processes. In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 4222 4233. PMLR, 09 15 Jun 2019. URL https://proceedings.mlr. press/v97/ma19b.html.

Mescheder, L., Nowozin, S., and Geiger, A. Adversarial variational bayes: unifying variational autoencoders and generative adversarial networks. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML 17, pp. 2391 2400. JMLR.org, 2017.

Molchanov, D., Kharitonov, V., Sobolev, A., and Vetrov, D. Doubly semi-implicit variational inference. In Chaudhuri, K. and Sugiyama, M. (eds.), Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, volume 89 of Proceedings of Machine Learning Research, pp. 2593 2602. PMLR, 16 18 Apr 2019. URL https://proceedings.mlr. press/v89/molchanov19a.html.

Owen, A. B. Monte Carlo theory, methods and examples.

https://artowen.su.domains/mc/, 2013.

Papamakarios, G., Nalisnick, E., Rezende, D. J., Mohamed, S., and Lakshminarayanan, B. Normalizing flows for probabilistic modeling and inference. J. Mach. Learn. Res., 22(1), January 2021. ISSN 1532-4435.

Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., De Vito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., and Chintala, S. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pp. 8024 8035. Curran Associates, Inc., 2019.

Pielok, T., Bischl, B., and R ugamer, D. Approximate bayesian inference with stein functional variational gradient descent. In The Eleventh International Conference on Learning Representations, 2023. URL https: //openreview.net/forum?id=a2-aoqme YM4.

Rezende, D. and Mohamed, S. Variational inference with normalizing flows. In Bach, F. and Blei, D. (eds.), Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pp. 1530 1538, Lille, France, 07 09 Jul 2015. PMLR. URL https://proceedings. mlr.press/v37/rezende15.html.

Revisiting Unbiased Implicit Variational Inference

Roeder, G., Wu, Y., and Duvenaud, D. K. Sticking the landing: Simple, lower-variance gradient estimators for variational inference. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips. cc/paper_files/paper/2017/file/ e91068fff3d7fa1594dfdf3b4308433a-Paper. pdf.

Shi, J., Sun, S., and Zhu, J. Kernel implicit variational inference. In International Conference on Learning Representations, 2018. URL https://openreview.net/ forum?id=r1l4e QW0Z.

Sobolev, A. and Vetrov, D. P. Importance weighted hierarchical variational inference. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alch e-Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips. cc/paper_files/paper/2019/file/ 5737c6ec2e0716f3d8a7a5c4e0de0d9a-Paper. pdf.

Sun, S., Zhang, G., Shi, J., and Grosse, R. FUNCTIONAL VARIATIONAL BAYESIAN NEURAL NETWORKS. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum? id=rkxacs0q Y7.

Titsias, M. K. and Ruiz, F. Unbiased implicit variational inference. In Chaudhuri, K. and Sugiyama, M. (eds.), Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, volume 89 of Proceedings of Machine Learning Research, pp. 167 176. PMLR, 16 18 Apr 2019. URL https://proceedings.mlr.press/v89/ titsias19a.html.

Welling, M. and Teh, Y. W. Bayesian learning via stochastic gradient langevin dynamics. In International Conference on Machine Learning, 2011. URL https://api. semanticscholar.org/Corpus ID:2178983.

Yin, M. and Zhou, M. Semi-implicit variational inference. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 5660 5669. PMLR, 10 15 Jul 2018. URL https://proceedings.mlr.press/v80/ yin18b.html.

Yu, L. and Zhang, C. Semi-implicit variational inference via score matching. In The Eleventh International Confer-

ence on Learning Representations, 2023. URL https: //openreview.net/forum?id=sd90a2ytrt.

Yu, L., Xie, T., Zhu, Y., Yang, T., Zhang, X., and Zhang, C. Hierarchical semi-implicit variational inference with application to diffusion model acceleration. In Proceedings of the 37th International Conference on Neural Information Processing Systems, NIPS 23, Red Hook, NY, USA, 2023. Curran Associates Inc.

Zimmermann, H., Wu, H., Esmaeili, B., and van de Meent, J. Nested variational inference. In Ranzato, M., Beygelzimer, A., Dauphin, Y. N., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, Neur IPS 2021, December 6-14, 2021, virtual, pp. 20423 20435, 2021. URL https://proceedings. neurips.cc/paper/2021/hash/ ab49b208848abe14418090d95df0d590-Abstract. html.

Revisiting Unbiased Implicit Variational Inference

For the proofs, we assume that the objects of interest are sufficiently regular s.t. we can change the order of integration, summation, and differentiation.

A.1. s MC,k is a consistent estimator and its bias approximation

We approximate the bias of s MC,k(z) = z log 1 k Pk i=1 qz|ϵ(z|ϵi) by using the delta method. First we note for

large k that s MC,k(z) z log

i=2 qz|ϵ(z|ϵi)

| {z } =: s MC,k(z)

. With the second-order Taylor approximation around qz(z) =

Eϵi pϵ h 1 k Pk 1 i=2 qz|ϵ(z|ϵi) i we get that

i=2 qz|ϵ(z|ϵi)

log (qz(z)) +

1 k 1 Pk i=2 qz|ϵ(z|ϵi) qz(z)

1 k 1 Pk i=2 qz|ϵ(z|ϵi) qz(z) 2

2 qz(z)2 . (40)

From this, it follows that

i=2 qz|ϵ(z|ϵi)

log (qz(z)) Vϵ pϵ qz|ϵ(z|ϵ)

2(k 1) qz(z)2 . (41)

Consequently, we get for large k that

Eϵi pϵ [ zs MC,k(z)] z log (qz(z)) Eϵi pϵ z s MC,k(z) z log (qz(z)) (42)

Vϵ pϵ qz|ϵ(z|ϵ)

2(k 1) qz(z)2

which, in general, is non-zero.

To prove the consistency of s MC,k, we observe since log is a continuous function that

lim k s MC,k = z log

i=1 qz|ϵ(z|ϵi)

a.s. = z log Eϵ pϵ qz|ϵ(z|ϵ) = z log (qz(z)) , (44)

P lim k s MC,k = z log (qz(z)) = 1. (45)

A.2. s IS,k is a consistent estimator

To prove the consistency of s IS,k, we observe since log is a continuous function that

lim k s IS,k = z log

pϵ(ϵ)qz|ϵ(z|ϵi)

τϵ|ez(ϵi|ez)

a.s. = z log Eϵi τϵ|z

pϵ(ϵ)qz|ϵ(z|ϵi)

τϵ|ez(ϵi|ez)

ez=z . (46)

For a valid proposal distribution, it must hold that τϵ|z must be non-zero where pϵ qz|ϵ = qz,ϵ is greater than zero (Owen, 2013). Consequently, the support of τϵ|z must also contain the support of qϵ|z = qz,ϵ

qz . In this case

lim k s IS,k a.s. = z log (qz(z)) , i.e., P lim k s IS,k = z log (qz(z)) = 1. (47)

Revisiting Unbiased Implicit Variational Inference

A.3. s3 gives the correct score gradient estimator regarding all ϵi samples

Assume we got a ϵ batch of size (j + 1) b and have computed the following estimators

s1(z) = zℓ1(z, ez) ez=z with (48)

ℓ1(z, ez) = log

i=1 w(ϵi|ez)qz|ϵ(z|ϵi)

s2(z) = zℓ2(z, ez) ez=z with (50)

ℓ2(z, ez) = log

i=j b+1 w(ϵi|ez)qz|ϵ(z|ϵi)

These estimates can be aggregated such that

ℓ3(z, ez) = logaddexp (ℓ1(z, ez) + log j, ℓ2(z, ez)) log(j + 1), (52)

1 (j + 1) b

i=1 w(ϵi|ez)qz|ϵ(z|ϵi)

s3(z) = α1s1(z) + α2s2(z) with (54)

α1 = exp ℓ1(z, ez) ℓ3(z, ez) + log j j + 1

α2 = exp (ℓ2(z, ez) ℓ3(z, ez) log(j + 1)) . (56)

For the score gradient estimate, it follows that

s3 = 1 exp (ℓ3(z, ez)) z j j + 1 exp (ℓ1(z, ez)) ez=z + 1 exp (ℓ3(z, ez)) z 1 j + 1 exp (ℓ2(z, ez)) ez=z (58)

= 1 exp (ℓ3(z, ez)) z exp (ℓ3(z, ez)) ez=z (59)

= zℓ3(z, ez) ez=z. (60)

B. Implementation Details

Table 3 summarizes the details for the toy example discussed in Section 5.1.

Table 3. Densities of the toy examples

NAME DENSITY PARAMETERS

BANANA z = (ν1, ν2 1 + ν2 + 1) , ν N(0, Σ) Σ = 1 0.9 0.9 1

MULTIMODAL z 0.5N(z| µ1, I) + 0.5N(z| µ2, I) µ1 = ( 2, 0) , µ2 = (2, 0)

X-SHAPE z 0.5N(z| 0, Σ1) + 0.5N(z| 0, Σ2) Σ1 = 2 1.8 1.8 2

, Σ2 = 2 1.8 1.8 2