# kernel_semiimplicit_variational_inference__2ea4a69b.pdf

Kernel Semi-Implicit Variational Inference

Ziheng Cheng 1 Longlin Yu 1 Tianyu Xie 1 Shiyue Zhang 1 Cheng Zhang 1 2

Semi-implicit variational inference (SIVI) extends traditional variational families with semiimplicit distributions defined in a hierarchical manner. Due to the intractable densities of semiimplicit distributions, classical SIVI often resorts to surrogates of evidence lower bound (ELBO) that would introduce biases for training. A recent advancement in SIVI, named SIVI-SM, utilizes an alternative score matching objective made tractable via a minimax formulation, albeit requiring an additional lower-level optimization. In this paper, we propose kernel SIVI (KSIVI), a variant of SIVI-SM that eliminates the need for lowerlevel optimization through kernel tricks. Specifically, we show that when optimizing over a reproducing kernel Hilbert space (RKHS), the lowerlevel problem has an explicit solution. This way, the upper-level objective becomes the kernel Stein discrepancy (KSD), which is readily computable for stochastic gradient descent due to the hierarchical structure of semi-implicit variational distributions. An upper bound for the variance of the Monte Carlo gradient estimators of the KSD objective is derived, which allows us to establish novel convergence guarantees of KSIVI. We demonstrate the effectiveness and efficiency of KSIVI on both synthetic distributions and a variety of real data Bayesian inference tasks.

1. Introduction

Variational inference (VI) is an optimization based approach widely used to approximate posterior densities for Bayesian models (Jordan et al., 1999; Wainwright & Jordan, 2008; Blei et al., 2016). The idea behind VI is to first posit a family of variational distributions over the model parameters (or

*Equal contribution (Alphabetical) 1School of Mathematical Sciences, Peking University, China 2Center for Statistical Science, Peking University, China. Correspondence to: Cheng Zhang <chengzhang@math.pku.edu.cn>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

latent variables), and then to find a member from this family that is close to the target posterior, where the closeness is often measured by the Kullback-Leibler (KL) divergence. Since the posterior generally does not possess analytical densities, in practice people often maximize the evidence lower bound (ELBO) instead, which is an equivalent but tractable reformulation (Jordan et al., 1999).

In classical VI, a common assumption is that variational distributions are factorized over the parameters (or latent variables) (Bishop & Tipping, 2000; Hinton & van Camp, 1993; Peterson & Hartman, 1989). When combined with conditional conjugate models, this mean-field assumption allows for a straightforward coordinate-ascent optimization scheme with closed-form update rules (Blei et al., 2016). However, conditional conjugacy may not hold in practice, and a factorized variational distribution may fail to capture the complex posterior due to the ignorance of the dependencies between different factorization components. Recent years have witnessed significant advancements in the field of VI, with numerous efforts aimed at alleviating these constraints. To address the conjugacy condition, black-box VI methods have been proposed for a broad class of models that allow generic training algorithms via Monte Carlo gradient estimators (Paisley et al., 2012; Nott et al., 2012; Ranganath et al., 2014; Rezende et al., 2014; Kingma & Welling, 2014; Titsias & Lázaro-Gredilla, 2014). In parallel, more flexible variational families have emerged, either explicitly incorporating certain structures among the parameters or drawing inspiration from invertible transformations of probability distributions (Jaakkola & Jordan, 1998; Saul & Jordan, 1996; Giordano et al., 2015; Tran et al., 2015; Rezende & Mohamed, 2015; Dinh et al., 2017; Kingma et al., 2016; Papamakarios et al., 2019). Despite their successes, it is noteworthy that all these approaches assume tractable densities of variational distributions.

To further enhance the capacity of variational families, one strategy is to incorporate the implicit models defined through neural network mappings (Huszár, 2017; Tran et al., 2017; Mescheder et al., 2017; Shi et al., 2018a;b; Song et al., 2019). These implicit models can provide more flexible variational approximations that allow easy sampling procedures by design. However, their probability densities are often intractable, rendering direct computation of the log density ratio, crucial for evaluating the ELBO, impossible. Implicit

Kernel Semi-Implicit Variational Inference

VI methods, therefore, generally rely on density ratio estimation for training, which not only introduces additional optimization complexity but also is known to be difficult in high dimensional settings (Sugiyama et al., 2012).

To circumvent the need for density ratio estimation, semiimplicit variational inference (SIVI) has been introduced (Yin & Zhou, 2018; Moens et al., 2021). In SIVI, variational distributions are crafted using a semi-implicit hierarchical framework which allows efficient computation of surrogate ELBOs for training. As surrogate ELBOs are inherently biased, Titsias & Ruiz (2019) proposed an unbiased estimator of the gradient of the exact ELBO, albeit requiring an innerloop MCMC sampling from a reversed conditional distribution that can easily become expensive in high-dimensional regimes. Alternatively, Yu & Zhang (2023) introduced SIVISM, a variant of SIVI that uses an objective based on Fisher divergence rather than KL divergence. They showed that this new objective facilitates a simple minimax formulation, where the intractability of densities can be naturally handled similarly to denoising score matching (DSM) (Vincent, 2011), due to the hierarchical construction of semi-implicit distributions. However, this comes at the price of an additional lower-level problem that requires optimization over a family of functions parameterized via neural networks.

As a popular approach in machine learning, kernel methods have shown great advantage when it comes to optimization over function spaces (Gretton et al., 2012; Liu & Wang, 2016). In this paper, we propose kernel semi-implicit variational inference (KSIVI), a variant of SIVI-SM that eliminates the need for lower-level optimization through kernel tricks. More specifically, we show that there exists an explicit solution to the lower-level problem when optimizing over a reproducing kernel Hilbert space (RKHS). Upon substitution, the objective for the upper-level problem becomes the kernel Stein discrepancy (KSD), a widely used kernel-induced dissimilarity measure between probability distributions (Liu et al., 2016; Chwialkowski et al., 2016; Gorham & Mackey, 2017b). Moreover, similarly as done in DSM, this solution has a tractable form that solely involves conditional densities due to the hierarchical structure of semi-implicit variational families, making KSD readily computable for stochastic gradient descent (SGD). We derive an upper bound for the variance of the Monte Carlo gradient estimators under mild assumptions. This bound allows us to establish convergence guarantees of KSIVI to a stationary point using standard analysis in stochastic optimization. We show empirically that KSIVI performs on par or better than SIVI-SM on both synthetic distributions and a variety of real data Bayesian inference tasks while enjoying more efficient computation, more stable training, and fewer hyperparameter tuning headaches.

2. Background

Notations Let be the standard Euclidean norm of a vector and the operator norm of a matrix or highdimensional tensor. Let denote the ℓ norm of a vector. For any x, y Rd, the expressions x y, x

stand for element-wise product, division, and square, respectively. The RKHS induced by kernel k( , ) is denoted as H0 and the corresponding Cartesian product is defined as H := H d 0 . Let 1k(resp. 2k) be the gradient of the kernel w.r.t. the first (resp. second) variable. We denote the inner product in Rd and Hilbert space F by , and , F, respectively. For a probabilistic density function q, we use sq to denote its score function log q. Finally, we use standard and to omit constant factors.

2.1. Semi-Implicit Variational Inference

Given an observed data set D, the classical VI methods find the best approximation in a variational family {qϕ(x)}ϕ Φ to the posterior distribution p(x|D) by maximizing the following evidence lower bound (ELBO)

L(qϕ(x)) = Eqϕ(x) [log p(D, x) log qϕ(x)] , (1)

where ϕ are the variational parameters. To expand the capacity of variational families, SIVI (Yin & Zhou, 2018) constructs a semi-implicit variational family with hierarchical structure as

qϕ(x) = Z qϕ(x|z)q(z)dz, (2)

where q(z) is the mixing distribution which is easy to sample from and can be implicit, and qϕ(x|z) is the conditional layer that is required to be explicit. Compared to classical VI which requires tractable variational distributions with explicit densities, the semi-implicit construction greatly enhances the approximation ability, making it able to capture the complex dependence among different variables.

Since qϕ(x) in the ELBO (1) is no longer tractable, Yin & Zhou (2018) considers a sequence of surrogates L(K)(qϕ(x)) of the ELBO (1) defined as

L(K)(qϕ(x)) = Eq(z(0:K))qϕ(x|z(0)) log p(D, x)

1 K+1 PK k=0 qϕ(x|z(k)) ,

(3) where q(z(0:K)) = QK i=0 q(z(k)). The introduced surrogate L(K)(qϕ(x)) is a lower bound of the L(qϕ(x)) and is asymptotically exact in the sense that lim K L(K)(qϕ(x)) = L(qϕ(x)).

Besides the ELBO-based objectives, the score-based measure has also been introduced for VI (Liu et al., 2016; Zhang et al., 2018; Hu et al., 2018; Korba et al., 2021) where the score function sp(x) of the target posterior distribution

Kernel Semi-Implicit Variational Inference

p(x) = p(x|D) is generally assumed to be tractable. In particular, SIVI-SM (Yu & Zhang, 2023) learns the semiimplicit approximation by minimizing the Fisher divergence via a minimax formulation

min ϕ max f Eqϕ(x) 2f(x)T [sp(x) sqϕ(x)] f(x) 2 .

(4) Using a similar trick in DSM (Vincent, 2011), (4) has a tractable form as follows

min ϕ max f Eqϕ(x,z) 2f(x)T [sp(x) sqϕ( |z)(x)] f(x) 2 ,

(5) where qϕ(x, z) = qϕ(x|z)q(z). The introduced auxiliary function f(x) is implemented as a neural network fψ(x) in practice, necessitating additional attention to training deep models and tuning hyperparameters.

2.2. Kernel Stein Discrepancy

Consider a continuous and positive semi-definite kernel k( , ) : Rd Rd R and its corresponding RKHS H0 of real-valued functions in Rd. The reproducing property of H0 indicates that for any function f H0, f( ), k(x, ) H0 = f(x). Given a measure q such that R k(x, x)dq(x) < , it follows that H0 L2(q), and the integral operator Sq,k : L2(q) H0 is defined as

(Sq,kf)( ) := Z k( , y)f(y)dq(y). (6)

The Cartesian product H := H d 0 , which consists f = (f1, , fd) with fi H0, is an extension to vector-valued functions. The inner product in H is defined by f, g H = Pd i=1 fi, gi H0. For f H and g0 H0, we extend g0 and define f, g0 H = ( f1, g0 H0 , , fd, g0 H0). For vector-valued inputs f, we reload the definition of the operator Sq,k and apply it element-wisely. The Stein discrepancy (Gorham & Mackey, 2015; Liu & Wang, 2016; Liu et al., 2016) which measures the difference between q and p is defined as

S(q p) := max f F Eq[ log p(x)T f(x) + f(x)], (7)

where F is a pre-defined function space. To tackle the minimization problem of S(q p), the kernel Stein discrepancy (KSD) (Liu & Wang, 2016; Chwialkowski et al., 2016; Gorham & Mackey, 2017b) maximizes f within the unit ball of the RKHS, i.e., F = {f : f H 1}. Since the optima of f(x) has a closed form, KSD has the following equivalent form

KSD(q p) = Sq,k log p

q , Sq,k log p

L2(q) , (8)

which can be also seen as a kernelized Fisher divergence (Korba et al., 2021). Gorham & Mackey (2017b) proves that under mild conditions of target distribution p(x) and kernel k, weak convergence of the KSD implies the weak convergence of distribution.

3. Proposed Method

In this section, we present kernel semi-implicit variational inference (KSIVI), which explicitly solves the lower-level optimization in SIVI-SM through kernel tricks. We will first give the expression of this explicit solution and then discuss the practical implementation of the upper-level optimization.

3.1. Training Objective

Instead of considering f L2(qϕ) which in general does not allow an explicit solution of f in (4), we seek the optimal f in an RKHS H and reformulate this minimax problem as

min ϕ max f H Eqϕ(x) 2f(x)T [sp(x) sqϕ(x)] f 2 H .

(9) The following theorem shows that the solution f to the lower-level optimization in (9) has an explicit form, which allows us to reduce (9) to a standard optimization problem.

Theorem 3.1. For any variational distribution qϕ, the solution f to the lower-level optimization in (9) takes the form

f (x) = Eqϕ(y)k(x, y) sp(y) sqϕ(y) . (10)

Thus the upper-level optimization problem for ϕ is

min ϕ KSD(qϕ p)2 = Sqϕ,k log p

The detailed proof is deferred to Appendix A.1. Moreover, by taking advantage of the semi-implicit structure, we have

Eqϕ(y)k(x, y)sqϕ(y)

=Eqϕ(y) k(x, y)

Z yqϕ(y|z)q(z)dz

= Z k(x, y) yqϕ(y|z)q(z)dzdy

= Z k(x, y)sqϕ( |z)(y)qϕ(y|z)q(z)dzdy

=Eqϕ(y,z)k(x, y)sqϕ( |z)(y).

This leads to the following proposition.

Proposition 3.2. The solution f in Theorem 3.1 can be rewritten as

f (x) = Eqϕ(y,z)k(x, y) sp(y) sqϕ( |z)(y) . (13)

Kernel Semi-Implicit Variational Inference

And the KSD(qϕ p)2 in (11) has an equivalent expression

KSD(qϕ p)2 = Eqϕ(x,z),qϕ(x ,z ) k(x, x ) sp(x) sqϕ( |z)(x), sp(x ) sqϕ( |z )(x ) . (14)

Since the semi-implicit variational distribution qϕ(x, z) enables efficient sampling, both f and KSD(qϕ p)2 can be estimated using the Monte Carlo method with samples from qϕ(x, z). This way, we have transformed the minimax problem (9) into a standard optimization problem with a tractable objective function.

3.2. Practical Implementation

Suppose the conditional qϕ(x|z) admits the reparameterization trick (Kingma & Welling, 2014; Titsias & Lázaro Gredilla, 2014; Rezende et al., 2014), i.e., x qϕ( |z) x = Tϕ(z, ξ), ξ qξ, where Tϕ is a parameterized transformation and qξ is a base distribution that does not depend on ϕ. We now show how to find an optimal variational approximation qϕ(x) that minimizes the KSD between qϕ(x) and p(x) defined in (14) using stochastic optimization.

To estimate the gradient g(ϕ) = ϕKSD(qϕ p)2, we consider two unbiased stochastic gradient estimators in our implementations. The first one is a vanilla gradient estimator using two batches of Monte Carlo samples (xri, zri) i.i.d. qϕ(x, z) where r = 1, 2 and 1 i N, and it is defined as

ˆgvanilla(ϕ) = 1 N 2 X

1 i,j N ϕ [k(x1i, x2j) f1i, f2j ] ,

(15) where fri = sp(xri) sqϕ( |zri)(xri). We can also leverage U-statistics to design an alternative estimator (Gretton et al., 2012; Liu et al., 2016). Given one batch of samples (xi, zi) i.i.d. qϕ(x, z) for 1 i N, the U-statistic gradient estimator takes the form

ˆgu-stat(ϕ) = 2 N(N 1)

1 i<j N ϕ [k(xi, xj) fi, fj ] ,

(16) where fi = sp(xi) sqϕ( |zi)(xi). Note that all the gradient terms in ˆgvanilla(ϕ) and ˆgu-stat(ϕ) can be efficiently evaluated using the reparameterization trick.

In our implementation, we assume a diagonal Gaussian conditional layer qϕ(x|z) = N(µ(z; ϕ), diag{σ2(z; ϕ)}) where the mean and standard deviation µ(z; ϕ), σ(z; ϕ) Rd are parametrized by neural networks. Hence the parameterization formula of qϕ(x|z) is x = µ(z; ϕ) + σ(z; ϕ) ξ with ξ qξ = N(0, I), and the corresponding conditional score takes the form sqϕ( |z)(x) = ξ/σ(z; ϕ). Note that in principle, one can sample multiple ξ for a single zi to get a more robust estimation as done in Yin & Zhou (2018). In our experiments, we find that sampling a single ξ works

Algorithm 1 KSIVI with diagonal Gaussian conditional layer and vanilla gradient estimator

Input: target score sp(x), number of iterations T, number of samples N for stochastic gradient. Output: the optimal variational parameters ϕ . for t = 0, , T 1 do

Sample {zr1, , zr N} from mixing distribution q(z) for r = 1, 2. Sample {ξr1, , ξr N} from N(0, I) for r = 1, 2. Compute xri = µ(zri; ϕ) + σ(zri; ϕ) ξri and fri = sp(xri) + ξri σ(zri;ϕ). Compute stochastic gradient ˆgvanilla(ϕ) through (15). Set ϕ optimizer(ϕ, ˆgvanilla(ϕ)). end for ϕ ϕ

well. The full training procedure of KSIVI with the vanilla gradient estimator (15) is described in Algorithm 1. The algorithm with U-statistic gradient estimator (16) is deferred to Appendix B. Compared to other variants of SIVI, our proposed KSIVI optimizes the KSD objective efficiently and stably with samples from the variational distribution, without using surrogates of exact ELBO, expensive inner-loop MCMC, or additional lower-level optimization.

4. Theoretical Results

In this section, we provide a theoretical guarantee for the convergence of KSIVI as a black-box variational inference (BBVI) problem (Ranganath et al., 2014; Titsias & Lázaro Gredilla, 2014). This is non-trivial as the conventional assumptions in stochastic optimization are challenging to meet within the context of variational inference settings. Recent works (Kim et al., 2023b; Domke et al., 2023; Kim et al., 2023a) attempt to analyze the location-scale family, which has poor approximation capacity. For SIVI, Moens et al. (2021) assumes that q(z) is a discrete measure with n components and gives a sub-optimal rate depending on poly(n). We are not aware of any analysis of the convergence of the full version SIVI with nonlinear reparameterization.

Similar to prior works, we first investigate the smoothness of the loss function and prove an upper bound of the variance of the stochastic gradient. Then we apply the standard analysis in stochastic optimization, which ensures {ϕt}t 1 converges to a stationary point.

Define the loss function as

L(ϕ) := KSD(qϕ p)2 = Sqϕ,k log p

We consider a diagonal Gaussian conditional layer. The conditional score function is sqϕ( |z)(x) = ξ σ(z;ϕ) for x =

Kernel Semi-Implicit Variational Inference

µ(z; ϕ) + σ(z; ϕ) ξ with ξ N(0, I). We assume an unbiased gradient estimator in (15) or (16). An SGD update ϕt+1 = ϕt ηˆgt is then applied with learning rate η for T iterations, where ˆgt is the stochastic gradient in the t-th iteration. Assumption 4.1. There exists a constant B > 0 such that x, y Rd, max{k, 1k , 11k , 12k } B2.

Assumption 4.2. log p is three times continuously differentiable and 2 log p(x) L, 3 log p(x) M. Assumption 4.3. There exists a constant G 1 such that max{ ϕµ(z; ϕ) , 2 ϕµ(z; ϕ) } G(1 + z ). The same inequalities hold for σ(z; ϕ). Besides, for any z, ϕ, σ(z; ϕ) has a uniform lower bound 1/

As an application to location-scale family, where µ(z; ϕ) µ Rd, σ(z; ϕ) σ Rd, we have G = 1. For general nonlinear neural networks, similar first order smoothness is also assumed in the analysis of GANs (Arora et al., 2017). Note that we do not directly assume a uniformly bounded constant of smoothness of µ(z; ϕ) and σ(z; ϕ), but it can grow with z . We find in practice this assumption is valid and the constant G is in a reasonable range. See Appendix A.3 for numerical evidence. The lower bound of σ(z; ϕ) is to avoid degeneracy and is required in Domke et al. (2023); Kim et al. (2023a), in which projected SGD is applied to ensure this. Similarly, our results can be easily extended to the setting of stochastic composite optimization and projected SGD optimizer, which we omit due to limited space.

Assumption 4.4. The mixing distribution q(z) and the variational distribution qϕ(x) have bounded 4th-moment, i.e., Eq(z) z 4 d2 z, Eqϕ(x) x 4 s4.

Overall, Assumption 4.1 guarantees that the kernel operator is well-defined and is easy to verify for commonly used Gaussian radial basis function (RBF) kernel and the inverse multi-quadratic (IMQ) kernel (Gorham & Mackey, 2017b). Assumption 4.2 and 4.3 ensures the smoothness of target distribution and variational distribution. Assumption 4.4 is a technical and reasonable assumption to bound the variance of stochastic gradient.

Based on Assumption 4.2 and 4.4, we can give a uniform upper bound on Eqϕ(x) sp(x) 4, which is crucial to the subsequent analysis. Proposition 4.5. Under Assumption 4.2 and 4.4, we have Eqϕ(x) sp(x) 4 L4(s4 + x 4) := C2, where x is any zero point of sp. Theorem 4.6. The objective L(ϕ) is Lϕ-smooth, where

Lϕ B2G2dz log d (1 L)3 + Ld + M 2 + C . (18)

Theorem 4.7. Both gradient estimators ˆgvanilla and ˆgu-stat have bounded variance Σ = Σ0

Σ0 B4G2dz log d[L3d + L2d2 + C2]. (19)

Theorem 4.8. Under Assumption 4.1-4.4, iterates from SGD update ϕt+1 = ϕt ηˆgt with proper learning rate η include an ε-stationary point ˆϕ such that E[ ϕL(ˆϕ) ] ε, if

where L0 := L(ϕ0) infϕ L.

The complete proofs are deferred to Appendix A.2. Theorem 4.6 and 4.7 imply that the optimization problem minϕ L(ϕ) satisfies the standard assumptions in non-convex optimization literature. Therefore, classic results of SGD can be applied to our scenario (Ghadimi & Lan, 2013).

Theorem 4.8 implies that the sequence {ϕt} converge to a stationary point of the objective L(ϕ). Since the training objective is squared KSD instead of traditional ELBO in BBVI, L(ϕ) is nonconvex even for location-scale family. The convergence in loss function L is generally inaccessible. We hope future works can shed more insights into this issue.

5. Related Works

To address the limitation of standard VI, implicit VI constructs a flexible variational family through non-invertible mappings parameterized by neural networks. However, the main issue therein is density ratio estimation, which is difficult in high dimensions (Sugiyama et al., 2012). Besides SIVI, there are a number of recent advancements in this field. Molchanov et al. (2019) have further extended SIVI in the context of generative models. Sobolev & Vetrov (2019) introduce a new surrogate of ELBO through importance sampling. UIVI (Titsias & Ruiz, 2019) runs inner-loop MCMC to get an unbiased estimation of the gradient of ELBO. KIVI (Shi et al., 2018a) constrains the density ratio estimation within an RKHS. LIVI (Uppal et al., 2023) approximates the intractable entropy with a Gaussian distribution by linearizing the generator.

Besides ELBO-based training of VI, many works consider to minimize Fisher divergence or its variants (Yu & Zhang, 2023; Ranganath et al., 2016; Grathwohl et al., 2020; Dong et al., 2022; Cheng et al., 2023). Leveraging the minimax formulation of Fisher divergence, these methods try to alternatively optimize the variational distribution and an adversarial function in certain function classes, typically neural networks. Although it does not rely on surrogates of ELBO, the lower-level optimization can not be done accurately in general, thus inducing unavoidable bias for the training of variational distribution. More importantly, it involves expensive extra computation during training.

To fix this computation issue, another closely related work (Korba et al., 2021) proposes a particle-based VI method, KSDD, which follows the KSD flow to minimize KSD, a kernelized version of Fisher divergence. It utilizes the fact

Kernel Semi-Implicit Variational Inference

5.0 2.5 0.0 2.5 5.0 x0

5.0 2.5 0.0 2.5 5.0 x0

Figure 1. Performances of KSIVI on toy examples. The histplots in blue represent the estimated densities using 100,000 samples generated from KSIVI s variational approximation. The black lines depict the contour of the target distributions.

0 20000 40000 Iteration

KSIVI SIVI-SM

0 20000 40000 Iteration

KSIVI SIVI-SM

0 20000 40000 Iteration

KSIVI SIVI-SM

Figure 2. Convergence of KL divergence during training obtained by different methods on toy examples. The KL divergences are estimated using the Python ITE module (Szabó, 2014) with 100,000 samples. The results are averaged over 5 independent computations with the standard deviation as the shaded region.

that the Wasserstein gradient of KSD can be directly estimated by the empirical particle distribution. Korba et al. (2021) also discusses the theoretical properties of KSDD. However, our formulation and the "denoising" derivation are not naive extensions, since KSIVI only requires the function value of kernel, while KSDD relies on twice derivatives of the kernel. We believe this is a crucial point that distinguishes KSIVI, since less smooth kernels can be leveraged, which may have stronger power as a distance, e.g. Riesz kernel (Altekrüger et al., 2023).

As for the convergence guarantee of BBVI, previous works mainly focus on ELBO objectives and location-scale families. In particular, Domke (2019) proves the bound of the gradient variance and later Domke (2020) shows the smoothness guarantee of the loss function, both of which are under various structural assumptions. The first self-contained analysis of the convergence of location-scale family has been recently proposed by Kim et al. (2023b); Domke et al. (2023), where some important variants like STL gradient estimator and proximal gradient descent are also discussed. Still, there is no theoretical convergence guarantee of (semi) implicit VI to the best of our knowledge.

6. Experiments

In this section, we compare KSIVI to the ELBO-based method SIVI and the score-based method SIVI-SM on toy

examples and real-data problems. For the construction of the semi-implicit variational family in all these methods, we choose a standard Gaussian mixing distribution and diagonal Gaussian conditional layer (see Section 3.2) whose standard deviation σ(z, ϕ) = ϕσ Rd does not depend on z. Following the approach by Liu & Wang (2016), we dynamically set the kernel width, to the median value of variational samples spacing. In all experiments, we use the Gaussian RBF kernel in KSIVI, following Liu & Wang (2016); Liu et al. (2016). Throughout this section, we use the vanilla gradient estimator for KSIVI, and results of the U-statistic gradient estimator can be found in Appendix C. All the experiments are implemented in Py Torch (Paszke et al., 2019). More implementation details can be found in Appendix C and https://github.com/longin Yu/KSIVI.

6.1. Toy Examples

We first conduct toy experiments on approximating three two-dimensional distributions: BANANA, MULTIMODAL, and X-SHAPED, whose probability density functions are in Table 4 in Appendix C.1. We consider a temperature annealing strategy (Rezende & Mohamed, 2015) on MUL-

TIMODAL to facilitate exploration. The results are collected after 50,000 iterations with a learning rate of 0.001 for all the methods.

Figure 1 shows approximation performances of KSIVI on the toy examples. We see that KSIVI provides favorable variational approximations for the target distributions. Figure 2 displays the KL divergence as a function of the number of iterations obtained by KSIVI and SIVI-SM. Compared to SIVI-SM, KSIVI is more stable in training, partly because it does not require additional lower-level optimization. KSIVI also tends to converge faster than SIVI-SM, although this advantage becomes less evident as the target distribution gets more complicated. Figure 8 in Appendix C.1 illustrates the convergence of maximum mean discrepancy (MMD) during training obtained by KSIVI and SIVI-SM, which mostly aligns with the behavior of KL divergence. We also conduct experiments with the IMQ kernel (Gorham & Mackey, 2017b), and do not notice a significant difference w.r.t. the Gaussian RBF kernel, which is consistent with the findings in Korba et al. (2021).

6.2. Bayesian Logistic Regression

Our second experiment is on the Bayesian logistic regression problem with the same experimental setting in Yin & Zhou (2018). Given the explanatory variable xi Rd and the observed binary response variable yi {0, 1}, the loglikelihood function takes the form

log p(yi|x i, β) = yiβT xi log(1 + exp(βT xi)),

Kernel Semi-Implicit Variational Inference

1.25 1.00 0.75 0.50 0.25 0.00 0.25 0.50

1.25 1.00 0.75 0.50 0.25 0.00 0.25 0.50

0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.25

0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.25

1.25 1.00 0.75 0.50 0.25 0.00 0.25 0.50

1.25 1.00 0.75 0.50 0.25 0.00 0.25 0.50

1.25 1.00 0.75 0.50 0.25 0.00 0.25 0.50

1.25 1.00 0.75 0.50 0.25 0.00 0.25 0.50

0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.25

0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.25

1.25 1.00 0.75 0.50 0.25 0.00 0.25 0.50

1.25 1.00 0.75 0.50 0.25 0.00 0.25 0.50

1.25 1.00 0.75 0.50 0.25 0.00 0.25 0.50

1.25 1.00 0.75 0.50 0.25 0.00 0.25 0.50

0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.25

0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.25

1.25 1.00 0.75 0.50 0.25 0.00 0.25 0.50

1.25 1.00 0.75 0.50 0.25 0.00 0.25 0.50

Figure 3. Marginal and pairwise variational approximations of β2, β3, β4 on the Bayesian logistic regression task. The contours of the pairwise posterior approximation produced by SIVI-SM (in orange), SIVI (in green), and KSIVI (in blue) are graphed in comparison to the ground truth (in black). The sample size is 1000.

0.5 0.0 0.5 Truth (SGLD)

0.5 0.0 0.5 Truth (SGLD)

0.5 0.0 0.5 Truth (SGLD)

Figure 4. Comparison between the estimated pairwise correlation coefficients and the ground truth on the Bayesian logistic regression task. Each scatter represents the estimated correlation coefficient (y-axis) and the ground truth correlation coefficient (x-axis) of some pair (βi, βj). The lines in the same color as the scatters represent the regression lines. The sample size is 1000.

where xi = 1 xi Rd+1 is the covariate and β Rd+1

is the variable we want to infer. The prior distribution of β is set to p(β) = N(0, α 1I) where the inverse variance α = 0.01. We consider the WAVEFORM1 dataset of {xi, yi}N i=1 where the dimension of the explanatory variable xi is d = 21. Then different SIVI variants with the same architecture of semi-implicit variational family are applied to infer the posterior distribution p(β|{xi, yi}N i=1). The learning rate for variational parameters ϕ is chosen as 0.001 and the batch size of particles is chosen as 100 during the training. For all the SIVI variants, the results are collected after 40,000 parameter updates. The ground truth consisting of 1000 samples is established by simulating parallel stochastic gradient Langevin dynamics (SGLD) (Welling & Teh, 2011) with 400,000 iterations, 1000 independent particles, and a small step size of 0.0001. Additionally, we assessed the performance of parallel Metropolis-adjusted Langevin

1https://archive.ics.uci.edu/ml/machine-learningdatabases/waveform

algorithm (MALA) (Rossky et al., 1978). The calculated KL divergence between samples from MALA and SGLD is 0.0289, suggesting their proximity.

Figure 3 demonstrates the marginal and pairwise posterior approximations for β2, β3, β4 obtained by the aforementioned SIVI variants in comparison to the ground truth. We see that KSIVI agrees well with the ground truth and performs comparably to SIVI-SM. Notice that SIVI-SM also requires tuning additional hyper-parameters (e.g., the learning rate of fψ(x) and the number of lower-level gradient steps), which may be challenging as commonly observed in minimax optimization (Goodfellow et al., 2014; Mescheder et al., 2017). In contrast, SIVI slightly underestimates the variance of β4 in both marginal and pairwise joint distributions, as shown by the left plot in Figure 3. The results for more components of β can be found in Figure 9 in Appendix C.2. Additionally, we also investigate the pairwise correlation coefficients of β defined as

ρi,j := cov(βi, βj) p

cov(βi, βi)cov(βj, βj)

and compare the estimated pairwise correlation coefficients produced by different methods to the ground truth in Figure 4. We see that KSIVI provides better correlation coefficient approximations than SIVI and SIVI-SM, as evidenced by the scatters more concentrated around the diagonal.

6.3. Conditioned Diffusion Process

Our next example is a higher-dimensional Bayesian inference problem arising from the following Langevin stochastic differential equation (SDE) with state xt R

dxt = 10xt(1 x2 t)dt + dwt, 0 t 1, (21)

Kernel Semi-Implicit Variational Inference

0.00 0.25 0.50 0.75 1.00

true path sample path confidence interval observation

0.00 0.25 0.50 0.75 1.00

true path sample path confidence interval observation

0.00 0.25 0.50 0.75 1.00

true path sample path confidence interval observation

0.00 0.25 0.50 0.75 1.00

true path sample path confidence interval observation

Figure 5. Variational approximations of different methods for the discretized conditioned diffusion process. The magenta trajectory represents the ground truth via parallel SGLD. The blue line corresponds to the estimated posterior mean of different methods, and the shaded region denotes the 95% marginal posterior confidence interval at each time step. The sample size is 1000.

where x0 = 0 and wt is a one-dimensional standard Brownian motion. Equation (21) describes the motion of a particle with negligible mass trapped in an energy potential, with thermal fluctuations represented by the Brownian forcing (Detommaso et al., 2018; Cui et al., 2016; Yu et al., 2023). Using the Euler-Maruyama scheme with step size t = 0.01, we discretize the SDE into x = (x t, x2 t, , x100 t), which defines the prior distribution pprior(x) of the 100-dimensional variable x. The perturbed 20-dimensional observation is y = (y5 t, y10 t, . . . , y100 t) where y5k t N(x5k t, σ2) with 1 k 20 and σ = 0.1, which gives the likelihood function p(y|x). Given the perturbed observations y, our goal is to infer the posterior of the discretized path of conditioned diffusion process p(x|y) pprior(x)p(y|x). We simulate a long-run parallel SGLD of 100,000 iterations with 1000 independent particles and a small step size of 0.0001 to form the ground truth of 1000 samples. For all SIVI variants, we update the variational parameters ϕ for 100,000 iterations to ensure convergence (Appendix C.3).

Table 2 shows the training time per 10,000 iterations of SIVI variants on a 3.2 GHz CPU. For a fair time comparison, we use the score-based training of SIVI discussed in Yu et al. (2023), which computes the log p(x) instead of log p(x) to derive the gradient estimator. We see that KSIVI achieves better computational efficiency than SIVI-SM and comparable training time to SIVI. Figure 5 shows the approximation for the discretized conditional diffusion process of all methods. We see that the posterior mean estimates given by SIVI-SM are considerably bumpier compared to the ground truth, while SIVI fails to capture the uncertainty with a severely underestimated variance. In contrast, the results from KSIVI align well with the SGLD ground truth.

6.4. Bayesian Neural Network

In the last experiment, we compare KSIVI against SGLD, SIVI, and SIVI-SM on sampling from the posterior of a

Bayesian neural network (BNN) across various benchmark UCI datasets2. Following Liu & Wang (2016); Wang & Li (2022), we use a two-layer neural network consisting of 50 hidden units and Re LU activation functions for the BNN model. The datasets are randomly partitioned into 90% for training and 10% for testing. Additionally, gradient clipping is applied during the training process, and the exponential moving average (EMA) trick (Huang et al., 2017; Izmailov et al., 2019) is employed in the inference stage for all SIVI variants.

The averaged test rooted mean squared error (RMSE) and negative log-likelihood (NLL) over 10 random runs are reported in Table 1. We see that KSIVI can provide results on par with SIVI and SIVI-SM on all datasets. However, it does not exhibit a significant advantage within the context of Bayesian neural networks. This phenomenon is partially attributable to the evaluation metrics. Since test RMSE and test NLL are assessed from a predictive standpoint rather than directly in terms of the posterior distribution, they do not fully capture the approximation accuracy of the posterior distribution. Additionally, the performance of KSIVI may be contingent on the choice of kernels, particularly in higherdimensional cases, as suggested by research on unbounded kernels (Hertrich et al., 2024). We leave a more thorough investigation for future work.

7. Ablation Study

Kernel Choice in Heavy Tails Distribution Considering the pivotal role of kernel function selection in the efficacy of kernel-based methods(Gorham & Mackey, 2017a), we conducted an ablation study on a two-dimmensional distribution constructed as the product of two Student s tdistributions, following the setting described by Li et al. (2023). For KSIVI, we apply a practical regularization fix to thin the kernel Stein discrepancy across all the kernel

2https://archive.ics.uci.edu/ml/datasets.php

Kernel Semi-Implicit Variational Inference

Table 1. Test RMSE and test NLL of Bayesian neural networks on several UCI datasets. The results are averaged from 10 independent runs with the standard deviation in the subscripts. For each data set, the best result is marked in black bold font and the second best result is marked in brown bold font.

Dataset Test RMSE ( ) Test NLL ( )

SIVI SIVI-SM SGLD KSIVI SIVI SIVI-SM SGLD KSIVI

BOSTON 2.621 0.02 2.785 0.03 2.857 0.11 2.555 0.02 2.481 0.00 2.542 0.01 3.094 0.01 2.506 0.01 CONCRETE 6.932 0.02 5.973 0.04 6.861 0.19 5.750 0.03 3.337 0.00 3.229 0.01 4.036 0.01 3.309 0.01 POWER 3.861 0.01 4.009 0.00 3.916 0.01 3.868 0.01 2.791 0.00 2.822 0.00 2.944 0.00 2.797 0.00 WINE 0.597 0.00 0.605 0.00 0.597 0.00 0.595 0.00 0.904 0.00 0.916 0.00 0.904 0.00 0.901 0.00 YACHT 1.505 0.07 0.884 0.01 2.152 0.09 1.237 0.05 1.721 0.03 1.432 0.01 2.873 0.03 1.752 0.03 PROTEIN 4.669 0.00 5.087 0.00 4.777 0.00 5.027 0.01 2.967 0.00 3.047 0.00 2.984 0.00 3.034 0.00

Table 2. Training time (per 10,000 iterations, in seconds) for the conditioned diffusion process inference task. For all the methods, the batch size for Monte Carlo estimation is set to N = 128.

Method \Dimensionality 50 100 200

SIVI 58.00 88.12 113.46 SIVI-SM 70.42 128.13 149.61 KSIVI 56.67 90.48 107.84

Table 3. Estimated Wasserstein distances for the product of two Student s t-distributions. The results were averaged from 10 independent runs.

Methods \Edge width 5 8 10

MIED 0.0366 0.01 0.0778 0.03 0.1396 0.04 KSDD 0.1974 0.04 0.3187 0.10 0.3910 0.11 KSIVI (Gaussian kernel) 0.1048 0.03 0.1976 0.08 0.2546 0.09 KSIVI (Reisz kernel) 0.0451 0.01 0.0816 0.04 0.1136 0.05

functions(Bénard et al., 2023). We use the implementation from Li et al. (2023) to reproduce the results for two particlebased variational inference methods MIED (Li et al., 2023) and KSDD (Korba et al., 2021). In table 3, we report the estimated Wasserstein distances from the target distributions to the variational posteriors (using the metric implementation provided in (Li et al., 2023) with 1000 samples). We found that compared to the Gaussian kernel, the Reisz kernel leads to a more pronounced improvement for this task.

8. Conclusion

This paper proposed a novel framework of semi-implicit variational inference, KSIVI, which takes the KSD as the training objective instead of the ELBO or Fisher divergence. As a variant of SIVI-SM, KSIVI eliminates the need for lower-level optimization using the kernel trick. By taking advantage of the hierarchical structure of the semi-implicit variational family, the KSD objective is readily computable. We provided efficient Monte Carlo gradient estimators for stochastic optimization. We also derived an upper bound for the variance of the Monte Carlo gradient estimators,

which allows us to establish the convergence guarantee to a stationary point using techniques in stochastic optimization. Extensive numerical experiments validate the efficiency and effectiveness of KSIVI.

9. Limitation

Our current work has several limitations. Firstly, since KSIVI uses KSD to measure dissimilarity between distributions, it may encounter drawbacks like stationary points that differ from the target distributions, especially in highdimensional, non-convex tasks. Secondly, our use of the Gaussian kernel in KSIVI may be less effective in high dimensions due to its fast-decaying tails. Exploring unbounded kernel functions could be a promising direction for future research. Additionally, our theoretical analysis depends on strong assumptions, such as the Lipschitz smoothness of the neural network, which may be too strict for deep neural networks in high dimensions. Investigating simpler kernel machines or random feature regression models as alternatives could also be valuable.

Impact Statement

This paper presents work whose goal is to advance the field of machine learning. There are many potential societal consequences of our work, none of which we feel must be specifically highlighted here.

Acknowledgements

This work was supported by National Natural Science Foundation of China (grant no. 12201014 and grant no. 12292983). The research of Cheng Zhang was support in part by National Engineering Laboratory for Big Data Analysis and Applications, the Key Laboratory of Mathematics and Its Applications (LMAM) and the Key Laboratory of Mathematical Economics and Quantitative Finance (LMEQF) of Peking University. The authors appreciate the anonymous ICML reviewers for their constructive feedback.

Kernel Semi-Implicit Variational Inference

Altekrüger, F., Hertrich, J., and Steidl, G. Neural wasserstein gradient flows for maximum mean discrepancies with riesz kernels. In International Conference on Machine Learning, 2023. URL https: //api.semanticscholar.org/Corpus ID: 256358344.

Arora, S., Ge, R., Liang, Y., Ma, T., and Zhang, Y. Generalization and equilibrium in generative adversarial nets (gans). In International conference on machine learning, pp. 224 232. PMLR, 2017.

Bishop, C. M. and Tipping, M. E. Variational relevance vector machines. In Proceedings of the Sixteenth conference on Uncertainty in artificial intelligence, pp. 46 53, 2000.

Blei, D. M., Kucukelbir, A., and Mc Auliffe, J. D. Variational inference: A review for statisticians. Journal of the American Statistical Association, 112:859 877, 2016.

Bénard, C., Staber, B., and Veiga, S. D. Kernel stein discrepancy thinning: a theoretical perspective of pathologies and a practical fix with regularization, 2023.

Cheng, Z., Zhang, S., Yu, L., and Zhang, C. Particle-based variational inference with generalized wasserstein gradient flow. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.

Chwialkowski, K., Strathmann, H., and Gretton, A. A kernel test of goodness of fit. In Balcan, M. F. and Weinberger, K. Q. (eds.), Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pp. 2606 2615, New York, New York, USA, 20 22 Jun 2016. PMLR.

Cui, T., Law, K. J., and Marzouk, Y. M. Dimensionindependent likelihood-informed mcmc. Journal of Computational Physics, 304:109 137, 2016.

Detommaso, G., Cui, T., Marzouk, Y., Spantini, A., and Scheichl, R. A stein variational newton method. Advances in Neural Information Processing Systems, 31, 2018.

Dinh, L., Sohl-Dickstein, J., and Bengio, S. Density estimation using real nvp. In International Conference on Learning Representations, 2017.

Domke, J. Provable gradient variance guarantees for blackbox variational inference. Advances in Neural Information Processing Systems, 32, 2019.

Domke, J. Provable smoothness guarantees for black-box variational inference. In International Conference on Machine Learning, pp. 2587 2596. PMLR, 2020.

Domke, J. and Sheldon, D. R. Importance weighting and variational inference. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.

Domke, J., Garrigos, G., and Gower, R. Provable convergence guarantees for black-box variational inference. ar Xiv preprint ar Xiv:2306.03638, 2023.

Dong, H., Wang, X., Yong, L., and Zhang, T. Particle-based variational inference with preconditioned functional gradient flow. In The Eleventh International Conference on Learning Representations, 2022.

Ghadimi, S. and Lan, G. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341 2368, 2013.

Giordano, R. J., Broderick, T., and Jordan, M. I. Linear response methods for accurate covariance estimates from mean field variational bayes. In Advances in Neural Information Processing Systems, 2015.

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in Neural Information Processing System 27 (NIPS), 2014.

Gorham, J. and Mackey, L. Measuring sample quality with stein's method. In Advances in Neural Information Processing Systems, pp. 226 234, 2015.

Gorham, J. and Mackey, L. Measuring sample quality with kernels. In Precup, D. and Teh, Y. W. (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 1292 1301. PMLR, 06 11 Aug 2017a.

Gorham, J. and Mackey, L. Measuring sample quality with kernels. In International Conference on Machine Learning, pp. 1292 1301. PMLR, 2017b.

Grathwohl, W., Wang, K.-C., Jacobsen, J.-H., Duvenaud, D., and Zemel, R. Learning the stein discrepancy for training and evaluating energy-based models without sampling. In International Conference on Machine Learning, pp. 3732 3747. PMLR, 2020.

Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf, B., and Smola, A. A kernel two-sample test. Journal of Machine Learning Research, 13:723 773, 2012.

Hertrich, J., Wald, C., Altekrüger, F., and Hagemann, P. Generative sliced MMD flows with Riesz kernels. In The Twelfth International Conference on Learning Representations, 2024.

Kernel Semi-Implicit Variational Inference

Hinton, G. E. and van Camp, D. Keeping the neural networks simple by minimizing the description length of the weights. In Annual Conference Computational Learning Theory, 1993. URL https://api. semanticscholar.org/Corpus ID:9346534.

Hu, T., Chen, Z., Sun, H., Bai, J., Ye, M., and Cheng, G. Stein neural sampler. ar Xiv preprint ar Xiv:1810.03545, 2018.

Huang, G., Li, Y., Pleiss, G., Liu, Z., Hopcroft, J. E., and Weinberger, K. Q. Snapshot ensembles: Train 1, get m for free, 2017.

Huszár, F. Variational inference using implicit distributions. ar Xiv preprint ar Xiv: 1702.08235, 2017.

Izmailov, P., Podoprikhin, D., Garipov, T., Vetrov, D., and Wilson, A. G. Averaging weights leads to wider optima and better generalization, 2019.

Jaakkola, T. S. and Jordan, M. I. Improving the mean field approximation via the use of mixture distributions. In Learning in Graphical Models, pp. 173 173, 1998.

Jordan, M. I., Ghahramani, Z., Jaakkola, T., and Saul, L. K. An introduction to variational methods for graphical models. Machine Learning, 37:183 233, 1999.

Kim, K., Ma, Y., and Gardner, J. R. Linear convergence of black-box variational inference: Should we stick the landing? ar Xiv preprint ar Xiv:2307.14642, 2023a.

Kim, K., Oh, J., Wu, K., Ma, Y., and Gardner, J. R. On the convergence of black-box variational inference. In Thirtyseventh Conference on Neural Information Processing Systems, 2023b.

Kingma, D. P. and Welling, M. Auto-encoding variational bayes. In ICLR, 2014.

Kingma, D. P., Salimans, T., Jozefowicz, R., Chen, X., Sutskever, I., and Welling, M. Improved variational inference with inverse autoregressive flow. In Advances in Neural Information Processing Systems, pp. 4743 4751, 2016.

Korba, A., Aubin-Frankowski, P.-C., Majewski, S., and Ablin, P. Kernel stein discrepancy descent. In International Conference on Machine Learning, pp. 5719 5730. PMLR, 2021.

Li, L., qiang liu, Korba, A., Yurochkin, M., and Solomon, J. Sampling with mollified interaction energy descent. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview. net/forum?id=z Wy7dq Ocel.

Liu, Q. and Wang, D. Stein variational gradient descent: A general purpose bayesian inference algorithm. Advances in neural information processing systems, 29, 2016.

Liu, Q., Lee, J., and Jordan, M. A kernelized stein discrepancy for goodness-of-fit tests. In International conference on machine learning, pp. 276 284. PMLR, 2016.

Mescheder, L. M., Nowozin, S., and Geiger, A. Adversarial variational bayes: Unifying variational autoencoders and generative adversarial networks. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, 2017.

Moens, V., Ren, H., Maraval, A., Tutunov, R., Wang, J., and Ammar, H. Efficient semi-implicit variational inference. ar Xiv preprint ar Xiv:2101.06070, 2021.

Molchanov, D., Kharitonov, V., Sobolev, A., and Vetrov, D. Doubly semi-implicit variational inference. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 2593 2602. PMLR, 2019.

Nott, D. J., Tan, S. L., Villani, M., and Kohn, R. Regression density estimation with variational methods and stochastic approximation. Journal of Computational and Graphical Statistics, 21(3):797 820, 2012.

Paisley, J. W., Blei, D. M., and Jordan, M. I. Variational bayesian inference with stochastic search. In Proceedings of the 29th International Conference on Machine Learning ICML, 2012.

Papamakarios, G., Nalisnick, E., Rezende, D., Mohamed, S., and Lakshminarayanan, B. Normalizing flows for probabilistic modeling and inference. Ar Xiv Preprint ar Xiv:1912.02762, 2019.

Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., De Vito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., and Chintala, S. Py Torch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems (Neur IPS), 2019.

Peterson, C. and Hartman, E. Explorations of the mean field theory learning algorithm. Neural Networks, 2(6):475 494, 1989. ISSN 0893-6080. doi: https://doi.org/10.1016/0893-6080(89)90045-2. URL https://www.sciencedirect.com/ science/article/pii/0893608089900452.

Ranganath, R., Gerrish, S., and Blei, D. Black box variational inference. In Artificial intelligence and statistics, pp. 814 822. PMLR, 2014.

Kernel Semi-Implicit Variational Inference

Ranganath, R., Tran, D., Altosaar, J., and Blei, D. Operator variational inference. Advances in Neural Information Processing Systems, 29, 2016.

Rezende, D. and Mohamed, S. Variational inference with normalizing flows. In Proceedings of The 32nd International Conference on Machine Learning, pp. 1530 1538, 2015.

Rezende, D. J., Mohamed, S., and Wierstra, D. Stochastic backpropagation and approximate inference in deep generative models. In International Conference on Machine Learning, 2014.

Rossky, P. J., Doll, J. D., and Friedman, H. L. Brownian dynamics as smart monte carlo simulation. Journal of Chemical Physics, 69:4628 4633, 1978. URL https://api.semanticscholar. org/Corpus ID:14967940.

Saul, L. K. and Jordan, M. I. Exploiting tractable substructures in intractable networks. In Advances in Neural Information Processing Systems, 1996.

Shi, J., Sun, S., and Zhu, J. Kernel implicit variational inference. In International Conference on Learning Representations, 2018a.

Shi, J., Sun, S., and Zhu, J. A spectral approach to gradient estimation for implicit distributions. In International Conference on Machine Learning, 2018b.

Sobolev, A. and Vetrov, D. P. Importance weighted hierarchical variational inference. Advances in Neural Information Processing Systems, 32, 2019.

Song, Y., Garg, S., Shi, J., and Ermon, S. Sliced score matching: A scalable approach to density and score estimation. In Proceedings of the Thirty-Fifth Conference on Uncertainty in Artificial Intelligence, 2019.

Sugiyama, M., Suzuki, T., and Kanamori, T. Density ratio estimation in machine learning. Cambridge University Press, 2012.

Szabó, Z. Information theoretical estimators toolbox. Journal of Machine Learning Research, 15:283 287, 2014.

Titsias, M. and Lázaro-Gredilla, M. Doubly stochastic variational bayes for non-conjugate inference. In Xing, E. P. and Jebara, T. (eds.), Proceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, pp. 1971 1979, Bejing, China, 22 24 Jun 2014. PMLR.

Titsias, M. K. and Ruiz, F. Unbiased implicit variational inference. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 167 176. PMLR, 2019.

Tran, D., Blei, D. M., and Airoldi, E. M. Copula variational inference. In Advances in Neural Information Processing Systems, 2015.

Tran, D., Ranganath, R., and Blei, D. M. Hierarchical implicit models and likelihood-free variational inference. In Advances in Neural Information Processing Systems, 2017.

Uppal, A., Stensbo-Smidt, K., Boomsma, W. K., and Frellsen, J. Implicit variational inference for highdimensional posteriors. ar Xiv preprint ar Xiv:2310.06643, 2023.

Vincent, P. A connection between score matching and denoising autoencoders. Neural Computation, 23(7):1661 1674, 2011. doi: 10.1162/NECO_a_00142.

Wainwright, M. J. and Jordan, M. I. Graphical models, exponential families, and variational inference. Foundations and Trends in Maching Learning, 1(1-2):1 305, 2008.

Wang, Y. and Li, W. Accelerated information gradient flow, 2022.

Welling, M. and Teh, Y. W. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th International Conference on International Conference on Machine Learning, ICML 11, pp. 681 688, Madison, WI, USA, 2011. Omnipress. ISBN 9781450306195.

Yin, M. and Zhou, M. Semi-implicit variational inference. In International Conference on Machine Learning, pp. 5660 5669. PMLR, 2018.

Yu, L. and Zhang, C. Semi-implicit variational inference via score matching. ar Xiv preprint ar Xiv:2308.10014, 2023.

Yu, L., Xie, T., Zhu, Y., Yang, T., Zhang, X., and Zhang, C. Hierarchical semi-implicit variational inference with application to diffusion model acceleration. In Thirtyseventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/ forum?id=gh IBaprxs V.

Zhang, C., Shahbaba, B., and Zhao, H. Variational hamiltonian monte carlo via score matching. Bayesian Analysis, 13(2):485 506, 2018.

Kernel Semi-Implicit Variational Inference

A.1. Proof of Theorem 3.1

Theorem A.1. Consider the min-max problem

min ϕ max f Eqϕ 2f(x)T [sp(x) sqϕ(x)] f 2 H . (22)

Given variational distribution qϕ, the optimal f is given by

f (x) = Ey qϕ(y)k(x, y) sp(y) sqϕ(y) . (23)

Thus the upper-level problem for ϕ is

min ϕ KSD(qϕ p)2 = Sqϕ,k log p

Proof. For any f H, we have f(x) = f, k( , x) H by reproducing property. Since

Eqϕf(x)T [sp(x) sqϕ(x)] = Eqϕ f, k( , x) T H [sp(x) sqϕ(x)]

= f, Eqϕk( , x)[sp(x) sqϕ(x)]

= f, Sqϕ,k log p

the lower-level problem is

max f 2 f, Sqϕ,k log p

H f 2 H. (26)

Therefore, the optimal f = Sqϕ,k log p

qϕ and the upper-level problem is

Sqϕ,k log p

A.2. Proofs in Section 4

Proposition A.2. Under Assumption 4.2 and 4.4, we have Eqϕ(x) sp(x) 4 L4(s4 + x 4) := C2, where x is any zero point of sp.

Proof. By AM-GM inequality, and since x is a zero point of sp,

Eqϕ(x) sp(x) 4 = Eqϕ(x) sp(x) sp(x ) 4 L4(Eqϕ(x) x 4 + x 4) L4(s4 + x 4). (28)

Theorem A.3. The objective L(ϕ) is Lϕ-smooth, where Lϕ B2G2dz log d (1 L)3 + Ld + M 2 + E sp(x) 2 .

Proof. Let u = (z, ξ) and x = T(u; ϕ) = µ(z; ϕ) + σ(z; ϕ) ξ. Then the objective has the following representation:

L = Ez,z Eqϕ(x|z),qϕ(x |z ) k(x, x )(sp(x) sqϕ( |z)(x))T (sp(x ) sqϕ( |z )(x ))

= Eu,u k(x, x )(sp(x) + ξ/σ)T (sp(x ) + ξ /σ ) , (29)

Kernel Semi-Implicit Variational Inference

which implies

2 ϕ[k(x, x )](sp(x) + ξ/σ)T (sp(x ) + ξ /σ ) | {z } ① + ϕk(x, x ) ϕ (sp(x) + ξ/σ)T (sp(x ) + ξ /σ ) + ϕ (sp(x) + ξ/σ)T (sp(x ) + ξ /σ ) ϕk(x, x ) | {z } ②

+ k(x, x ) 2 ϕ (sp(x) + ξ/σ)T (sp(x ) + ξ /σ )

For ①, we have 2 ϕ[k(x, x )] = 2 ϕx 1k + T ϕx ( 11k ϕx + 21k ϕx )

+ 2 ϕx 2k + T ϕx ( 22k ϕx + 12k ϕx) . (31)

Here 2 ϕx 1k is a matrix with entry [ 2 ϕx 1k]ij = Pd l=1 2 ϕxl xlk.

Note that by Assumption 4.3, ϕx ϕµ + ϕσ ξ G(1+ z )(1+ ξ ) and 2 ϕx 2 ϕµ + 2 ϕσ ξ G(1 + z )(1 + ξ ). Then by Assumption 4.1, 1k B2, 11k B2,

E[ ① ] E[ 2 ϕk sp(x) + ξ/σ sp(x ) + ξ /σ ]

E[ 2 ϕk 2] q

E[ sp(x) + ξ/σ 2 sp(x ) + ξ /σ 2]

E(1 + ξ )2(1 + z )2 + G2B2p

E(1 + ξ )4(1 + z )4 i E[ sp(x) + ξ/σ 2]

G2B2dz log d E[ sp(x) + ξ/σ 2]

G2B2dz log d[E sp(x) 2 + Ld].

In the inequality second to last, we utilize the well-known fact E ξ 4 log2 d and Assumption 4.4.

For ②, we have ϕk(x, x ) T ϕx 1k + T ϕx 2k

B2G[(1 + z )(1 + ξ ) + (1 + z )(1 + ξ )], (33)

and ϕ (sp(x) + ξ/σ)T (sp(x ) + ξ /σ ) = 2 log p(x) ϕx diag{ξ/σ2} ϕσ T (sp(x ) + ξ /σ )

+ 2 log p(x ) ϕx diag{ξ /σ 2} ϕσ T (sp(x) + ξ/σ). (34)

E[ ② ] B2G p

E[(1 + z )2(1 + ξ )2]

E h [ 2 log p(x) ϕx diag{ξ/σ2} ϕσ]T (sp(x ) + ξ /σ ) 2 i

dz log d LG q

E[(1 + z )2(1 + ξ )2 sp(x ) + ξ /σ 2]

LG2B2dz log d q

E sp(x) 2 + Ld.

For ③, we have

2 ϕ (sp(x) + ξ/σ)T (sp(x ) + ξ /σ ) = 2 ϕ

sp(x)T sp(x ) + ξ

σ sp(x ) + ξ

σ sp(x) + ξ

2 ϕ[sp(x)T sp(x )] = 2 ϕx[ 2 log p(x)sp(x )] + T ϕx[ 3 log p(x)sp(x )] ϕx

+ 2 ϕx [ 2 log p(x )sp(x)] + T ϕx [ 3 log p(x )sp(x)] ϕx

+ T ϕx 2 log p(x) 2 log p(x ) ϕx + T ϕx 2 log p(x ) 2 log p(x) ϕx.

Kernel Semi-Implicit Variational Inference

E k(x, x ) 2 ϕ[sp(x)T sp(x )] B2E LG(1 + z )(1 + ξ ) + MG2(1 + z )2(1 + ξ )2 sp(x )

+ B2L2G2E[(1 + z )(1 + ξ )(1 + z )(1 + ξ )]

dz log d + GMdz log d) q

E sp(x) 2 + B2L2G2dz log d.

Then, since

σ sp(x )] = 2 ϕ[sp(x )] ξ

σ ]sp(x ) + ϕ[sp(x )] ϕ[ ξ

σ ] ϕ[sp(x )], (39)

it holds that

σ sp(x )] 2 ϕ[sp(x )] ξ

σ ] sp(x ) + ϕ[ ξ

σ ] ϕ[sp(x )]

M ϕx 2 + L 2 ϕx L1/2 ξ + (1 L)3/2G2(1 + z )2 ξ sp(x )

+ L ϕx LG(1 + z ) ξ ,

E k(x, x ) 2 ϕ[ ξ

σ sp(x )] B2G(L p

dz log d + GMdz log d)L1/2d1/2

+ B2G2(1 L)3/2dz log1/2 d q

E sp(x) 2 + B2G2L2dz log d. (41)

Lastly, we have

σ ] = 2 ϕ[ ξ

E k(x, x ) 2 ϕ[ ξ

σ ] B2E h (1 L)3/2G2(1 + z )2 ξ L1/2 ξ + ξ LG(1 + z ) ξ LG(1 + z ) i

B2 h (1 L)3/2L1/2G2dzd1/2 log1/2 d + L2G2dz log d i

B2(1 L)3/2L1/2G2dzd1/2 log1/2 d. (43) Therefore, we get

E[ ③ ] B2G2dz log d (1 L)3/2 + M q

E sp(x) 2 + Ld (44)

Combining ①, ②, ③, we can conclude that

2 ϕL B2G2dz log d (1 L)3 + Ld + M 2 + E sp(x) 2 . (45)

Theorem A.4. Both gradient estimators ˆgvanilla and ˆgu-stat have bounded variance:

Var(ˆg) B4G2dz log d[L3d + L2d2 + E sp(x) 4]

Proof. It is sufficient to look at the upper bound of variance with two samples. Note that

ϕ k(x, x )(sp(x) + ξ/σ)T (sp(x ) + ξ /σ ) = ϕ[k(x, x )](sp(x) + ξ/σ)T (sp(x ) + ξ /σ ) | {z } ① + k(x, x ) ϕ (sp(x) + ξ/σ)T (sp(x ) + ξ /σ )

For ①, we again utilize the fact that

ϕk(x, x ) B2G[(1 + z )(1 + ξ ) + (1 + z )(1 + ξ )], (48)

Kernel Semi-Implicit Variational Inference

and thus E[ ① 2] E[ ϕk 2 sp(x) + ξ/σ 2 sp(x ) + ξ /σ 2]

B4G2dz log d[E sp(x) + ξ/σ 4]1/2E sp(x) + ξ/σ 2

B4G2dz log d q

E sp(x) 4 + Ld E sp(x) + ξ/σ 2.

For ②, we have

ϕ (sp(x) + ξ/σ)T (sp(x ) + ξ /σ ) = 2 log p(x) ϕx diag{ξ/σ2} ϕσ T (sp(x ) + ξ /σ )

+ 2 log p(x ) ϕx diag{ξ /σ 2} ϕσ T (sp(x) + ξ/σ). (50)

E[ ② 2] B4E 2 log p(x) ϕx diag{ξ/σ2} ϕσ T (sp(x ) + ξ /σ ) 2

B4L2G2E[(1 + z )2(1 + ξ )2 sp(x ) + ξ /σ 2]

B4L2G2dz log d E sp(x) + ξ/σ 2 .

Therefore, combining ①, ②, we conclude that

E h ϕ k(x, x )(sp(x) + ξ/σ)T (sp(x ) + ξ /σ ) 2i

B4G2dz log d Ld + L2 + q

E sp(x) 4 E sp(x) + ξ/σ 2 , (52)

which implies

N Var k(x, x )(sp(x) + ξ/σ)T (sp(x ) + ξ /σ )

B4G2dz log d[L3d + L2d2 + E sp(x) 4]

Theorem A.5. Under Assumption 4.1-4.4, the iteration sequence generated by SGD ϕt+1 = ϕt ηˆgt with proper learning rate η includes an ε-stationary point ˆϕ such that E[ ϕL(ˆϕ) ] ε, if

Here L0 := L(ϕ0) infϕ L.

Proof. Since L( ) is Lϕ-smooth and ˆgt is unbiased with Var(ˆgt) Σ0

N , we have

Et L(ϕt+1) Et

L(ϕt) + ϕL(ϕt), ϕt+1 ϕt + Lϕ

2 ϕt+1 ϕt 2

L(ϕt) (η η2Lϕ

2 ) ϕL(ϕt) 2 + η2LϕΣ0

Let η 1/Lϕ. Take full expectation and sum from 0 to T 1.

t=0 E[ ϕL(ϕt) 2] L(ϕ0) EL(ϕT )

In the last inequality, we plug optimal η min n 1/Lϕ, q

NL0 Lϕσ0T o . We conclude the proof by substituting the RHS with ε2

and solving T.

Kernel Semi-Implicit Variational Inference

A.3. Justification for Assumption 4.3

In all our experiments, the variance σ( ; ϕ) is a constant vector σ Rd, rather than a neural network. Therefore the smoothness assumption for σ( ; ϕ) is valid for G = 1. As for µ( ; ϕ), we compute ϕµ(z; ϕ) and 2 ϕµ(z; ϕ) for randomly sampled z and ϕ in the training process. Note that for simplicity, we compute the Frobeneious norm and estimate the Lipschitz constant of Jacobian ϕµ(z; ϕ), which are larger than the operator norms. We plot the result in BLR experiment in Figure 6 and 7, showing that G remains within a certain range during the early, middle, and late stages of training. Therefore, Assumption 4.3 is reasonable.

0 2000 4000 6000 8000 10000 z

log( (z, ) F) log( z + 1)

Iteration=0 Iteration=10000 Iteration=20000

0 2000 4000 6000 8000 10000 z

log( 2 (z, ) ) log( z + 1)

Iteration=0 Iteration=10000 Iteration=20000

Figure 6. Smoothness of neural network µ( ; ϕ) in KSIVI (ˆgvanilla) trajectory.

0 2000 4000 6000 8000 10000 z

log( (z, ) ) log( z + 1)

Iteration=0 Iteration=20000 Iteration=40000

0 2000 4000 6000 8000 10000 z

log( 2 (z, ) ) log( z + 1)

Iteration=0 Iteration=20000 Iteration=40000

Figure 7. Smoothness of neural network µ( ; ϕ) in KSIVI (ˆgu-stat) trajectory.

Kernel Semi-Implicit Variational Inference

0 20000 40000 Iteration

kernel-SIVI SIVI-SM

0 20000 40000 Iteration

kernel-SIVI SIVI-SM

0 20000 40000 Iteration

kernel-SIVI SIVI-SM

Figure 8. Convergence of MMD divergence of different methods on 2-D toy examples. The MMD objectives are estimated using 1000 samples. The results are averaged over 5 independent computations with the standard deviation as the shaded region.

B. KSIVI Algorithm

Algorithm 2 KSIVI with diagonal Gaussian conditional layer and U-statistic gradient estimator

Input: target score sp(x), number of iterations T, number of samples N for stochastic gradient. Output: the optimal variational parameters ϕ . for t = 0, , T 1 do

Sample {z1, , z N} from mixing distribution q(z); Sample {ξ1, , ξN} from N(0, I); Compute xi = µ(zi; ϕ) + σ(zi; ϕ) ξi and fi = sp(xi) + ξi/σ(zi; ϕ); Compute the gradient estimate ˆgu-stat(ϕ) through (16); Set ϕ optimizer(ϕ, ˆgu-stat(ϕ)); end for ϕ ϕ.

C. Experiment Setting and Additional Results

C.1. Toy Experiments

Setting details In the three 2-D toy examples, we set the conditional layer to be N(µ(z; ϕ), diag{ϕ2 σ}. The µ(z; ϕ) in SIVI-SM and KSIVI all have the same structures of multi-layer perceptrons (MLPs) with layer widths [3, 50, 50, 2] and Re LU activation functions. The initial value of ϕσ is set to 1 except for banana distribution on which we use 1

2. We set the learning rate of variational parameters ϕ to 0.001 and the learning rate of ψ in SIVI-SM to 0.002. In the lower-level optimization of SIVI-SM, ψ is updated after each update of ϕ.

Additional results Figure 8 depicts the descent of maximum mean discrepancy (MMD) during the training process of SIVI-SM and KSIVI.

Table 4. Three 2-D target distributions implemented in the 2-D toy experiments. Name Density

BANANA x = v1, v2 1 + v2 + 1 T , v N(0, Σ) Σ = 1 0.9 0.9 1

MULTIMODAL x 1

2N(x|µ1, I) + 1

2N(x|µ2, I) µ1 = [ 2, 0]T , µ2 = [2, 0]T

X-SHAPED x 1

2N(x|0, Σ1) + 1

2N(x|0, Σ2) Σ1 = 2 1.8 1.8 2 , Σ2 = 2 1.8 1.8 2

Kernel Semi-Implicit Variational Inference

Table 5. Estimated sliced Wasserstein distances (Sliced-WD) of KSIVI and SIVI-SM on Bayesian logistic regression. The Sliced-WD objectives are estimated using 1000 samples.

Methods \Step sizes 0.00001 0.0001 0.001 0.005 0.008

SIVI-SM (iteration = 10k) 2.5415 24.2952 0.1203 0.6107 0.9906 SIVI-SM (iteration = 20k) 5.6184 28.6335 0.0938 0.3283 0.9145 KSIVI (iteration = 10k) 1.5196 0.6863 0.1509 0.1064 0.6059 KSIVI (iteration = 20k) 1.2600 0.1120 0.0965 0.2628 0.3186

C.2. Bayesian Logistic Regression

Setting details For the experiment on Bayesian logistic regression, µ(z; ϕ) for all the SIVI variants are parameterized as MLPs with layer widths [10, 100, 100, 22] and Re LU activation functions. For SIVI-SM, the fψ is three-layer MLPs with 256 hidden units and Re LU activation functions. The initial value of ϕ2 σ is e 5 for KSIVI and 1 for SIVI and SIVI-SM. These initial values are determined through grid search on the set {e0, e 2, e 5}. For SIVI, the learning rate of ϕµ is set to 0.01, and the learning rate of ϕσ is set to 0.001, following the choices made in Yin & Zhou (2018). The learning rate for the variational parameters ϕ, comprising ϕµ and ϕσ, is set to 0.001 for both SIVI and SIVI-SM. The learning rate of ψ for SIVI-SM is set to 0.002, and we update ψ after each update of ϕ. For SIVI, the auxiliary sample size is set to K = 100. For all methods, we train the variational distributions for 20,000 iterations with a batch size of 100.

Additional results In line with Figure 5 in Domke & Sheldon (2018), we provide the results of KSIVI using varying step sizes of 0.00001, 0.0001, 0.001, 0.005, and 0.008. The results in Table 5 indicate that KSIVI demonstrates more consistent convergence behavior across different step sizes. Figure 9 shows the marginal and pairwise joint density of the well-trained posterior qϕ(β1, β2, , β6) for SIVI, SIVI-SM, KSIVI with vanilla gradient estimator and KSIVI with U-statistic gradient estimator. Figure 10 presents the results of the estimated pairwise correlation coefficients by KSIVI using the U-statistic gradient estimator.

C.3. Conditioned Diffusion Process

Setting details For the experiment on conditioned diffusion process, µ(z; ϕ) for all the SIVI variants are parameterized as MLPs with layer widths [100, 128, 128, 100] and Re LU activation functions. For SIVI-SM, the fψ is a MLPs with layer widths [100, 512, 512, 100] and Re LU activation functions. The initial value of ϕ2 σ is e 2 for all the SIVI variants. We set the learning rate of variational parameters ϕ to 0.0002 for KSIVI and 0.0001 for SIVI and SIVI-SM. The learning rate of ψ of SIVI-SM is set to 0.0001 and the number of lower-level gradient steps for ψ is 1. For all methods, we conduct training on variational distributions using 100,000 iterations with a batch size of 128, and the Adam optimizer is employed. For SIVI, we use the score-based training as done in Yu et al. (2023) for a fair time comparison.

Additional results Table 6 shows the estimated sliced Wasserstein distances from the target distributions to different variational approximations with increasing dimensions d = 50, 100, and 200 on conditional diffusion. Table 7 shows the training time per 10,000 iterations of KSIVI with the vanilla estimator and KSIVI with the u-stat estimator. We see that there exists a tradeoff between computation cost and estimation variance between the vanilla and u-stat overall. As shown in the equations (15) and (16), the vanilla estimator uses two batches of N samples and requires N 2 computation complexity for backpropagation, while the u-stat estimator uses one batch of N samples and requires N 2

2 computation. For convergence, the variance of the vanilla estimator (15) is slightly smaller than the variance of the u-stat estimator (16).

Figure 11 depicts the well-trained variational posteriors of SIVI, SIVI-SM, and KSIVI with the U-statistic gradient estimator on the conditioned diffusion process problem. Figure 12 shows the training loss of SIVI, SIVI-SM, and KSIVI. We observe that the training losses of all the methods converge after 100,000 iterations.

C.4. Bayesian Neural Network

Setting details For the experiments on the Bayesian neural network, we follow the setting in Liu & Wang (2016). For the hyper-parameters of the inverse covariance of the BNN, we select them as the estimated mean of 20 particles generated by SVGD (Liu & Wang, 2016). µ(z; ϕ) for all the SIVI variants are parameterized as MLPs with layer widths [3, 10, 10, d] and Re LU activation functions, where d is the number of parameters in Bayesian neural network. For SIVI-SM, the fψ takes

Kernel Semi-Implicit Variational Inference

Table 6. Estimated sliced Wasserstein distances (Sliced-WD) on conditional diffusion. The Sliced-WD objectives are estimated using 1000 samples.

Methods \Dimensionality 50 100 200

SIVI 0.1266 0.0981 0.0756 SIVI-SM 0.0917 0.0640 0.0628 UIVI 0.0314 0.0426 0.0582 KSIVI (u-stat) 0.0134 0.0182 0.0551 KSIVI (vanilla) 0.0140 0.0115 0.0493

Table 7. Training time (per 10,000 iterations) for the conditioned diffusion process inference task. For all the methods, the batch size for Monte Carlo estimation is set to N = 128.

Methods \Dimensionality 50 100 200

KSIVI (u-stat) 39.09 58.13 64.17 KSIVI (vanilla) 56.67 90.48 107.84

three-layer MLPs with 16 hidden units and Re LU activation functions. For KSIVI, we apply a practical regularization fix to thin the kernel Stein discrepancy, as described in (Bénard et al., 2023). For all the SIVI variants, we choose the learning rate of ϕ from {0.00001, 0.0001, 0.0002, 0.0005, 0.001} and run 20,000 iterations for training. For SGLD, we choose the step size from the set {0.00002, 0.00004, 0.00005, 0.0001}, and the number of iterations is set to 10,000 with 100 particles.

Additional results Table 8 shows the additional results of KSIVI with the U-statistic gradient estimator.

Table 8. Test RMSE and test NLL of Bayesian neural networks on several UCI datasets. The results are averaged from 10 independent runs with the standard deviation in the subscripts. For each data set, the best result is marked in black bold font and the second best result is marked in brown bold font.

Dataset Test RMSE ( ) Test NLL ( )

SIVI SIVI-SM KSIVI (ˆgu-stat) KSIVI (ˆgvanilla) SIVI SIVI-SM KSIVI (ˆgu-stat) KSIVI (ˆgvanilla)

BOSTON 2.621 0.02 2.785 0.03 2.857 0.11 2.555 0.02 2.481 0.00 2.542 0.01 3.094 0.01 2.506 0.01 CONCRETE 6.932 0.02 5.973 0.04 6.861 0.19 5.750 0.03 3.337 0.00 3.229 0.01 4.036 0.01 3.309 0.01 POWER 3.861 0.01 4.009 0.00 3.916 0.01 3.868 0.01 2.791 0.00 2.822 0.00 2.944 0.00 2.797 0.00 WINE 0.597 0.00 0.605 0.00 0.597 0.00 0.595 0.00 0.904 0.00 0.916 0.00 0.904 0.00 0.901 0.00 YACHT 1.505 0.07 0.884 0.01 2.152 0.09 1.237 0.05 1.721 0.03 1.432 0.01 2.873 0.03 1.752 0.03 PROTEIN 4.669 0.00 5.087 0.00 4.777 0.00 5.027 0.01 2.967 0.00 3.047 0.00 2.984 0.00 3.034 0.00

Kernel Semi-Implicit Variational Inference

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75

1.25 1.00 0.75 0.50 0.25 0.00 0.25 0.50

1.25 1.00 0.75 0.50 0.25 0.00 0.25 0.50

1.25 1.00 0.75 0.50 0.25 0.00 0.25 0.50

1.25 1.00 0.75 0.50 0.25 0.00 0.25 0.50

1.25 1.00 0.75 0.50 0.25 0.00 0.25 0.50

0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.25

0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.25

0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.25

0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.25

0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.25

1.25 1.00 0.75 0.50 0.25 0.00 0.25 0.50

1.25 1.00 0.75 0.50 0.25 0.00 0.25 0.50

1.25 1.00 0.75 0.50 0.25 0.00 0.25 0.50

1.25 1.00 0.75 0.50 0.25 0.00 0.25 0.50

1.25 1.00 0.75 0.50 0.25 0.00 0.25 0.50

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75

0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.25

0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.25

0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.25

0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.25

0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.25

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75

1.25 1.00 0.75 0.50 0.25 0.00 0.25 0.50

1.25 1.00 0.75 0.50 0.25 0.00 0.25 0.50

1.25 1.00 0.75 0.50 0.25 0.00 0.25 0.50

1.25 1.00 0.75 0.50 0.25 0.00 0.25 0.50

1.25 1.00 0.75 0.50 0.25 0.00 0.25 0.50

0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.25

0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.25

0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.25

0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.25

0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.25

1.25 1.00 0.75 0.50 0.25 0.00 0.25 0.50

1.25 1.00 0.75 0.50 0.25 0.00 0.25 0.50

1.25 1.00 0.75 0.50 0.25 0.00 0.25 0.50

1.25 1.00 0.75 0.50 0.25 0.00 0.25 0.50

1.25 1.00 0.75 0.50 0.25 0.00 0.25 0.50

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75

0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.25

0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.25

0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.25

0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.25

0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.25

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75

1.25 1.00 0.75 0.50 0.25 0.00 0.25 0.50

1.25 1.00 0.75 0.50 0.25 0.00 0.25 0.50

1.25 1.00 0.75 0.50 0.25 0.00 0.25 0.50

1.25 1.00 0.75 0.50 0.25 0.00 0.25 0.50

1.25 1.00 0.75 0.50 0.25 0.00 0.25 0.50

0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.25

0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.25

0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.25

0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.25

0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.25

1.25 1.00 0.75 0.50 0.25 0.00 0.25 0.50

1.25 1.00 0.75 0.50 0.25 0.00 0.25 0.50

1.25 1.00 0.75 0.50 0.25 0.00 0.25 0.50

1.25 1.00 0.75 0.50 0.25 0.00 0.25 0.50

1.25 1.00 0.75 0.50 0.25 0.00 0.25 0.50

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75

0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.25

0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.25

0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.25

0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.25

0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.25

KSIVI (gvanilla)

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75

1.25 1.00 0.75 0.50 0.25 0.00 0.25 0.50

1.25 1.00 0.75 0.50 0.25 0.00 0.25 0.50

1.25 1.00 0.75 0.50 0.25 0.00 0.25 0.50

1.25 1.00 0.75 0.50 0.25 0.00 0.25 0.50

1.25 1.00 0.75 0.50 0.25 0.00 0.25 0.50

0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.25

0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.25

0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.25

0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.25

0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.25

1.25 1.00 0.75 0.50 0.25 0.00 0.25 0.50

1.25 1.00 0.75 0.50 0.25 0.00 0.25 0.50

1.25 1.00 0.75 0.50 0.25 0.00 0.25 0.50

1.25 1.00 0.75 0.50 0.25 0.00 0.25 0.50

1.25 1.00 0.75 0.50 0.25 0.00 0.25 0.50

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75

0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.25

0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.25

0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.25

0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.25

0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.25

KSIVI (gu stat)

Figure 9. Marginal and pairwise joint posteriors on β1, , β6. The contours of the pairwise empirical densities produced by three baseline algorithms, i.e. SIVI-SM (in orange), SIVI (in green), and KSIVI (in blue) including both KSIVI (ˆgvanilla) and KSIVI (ˆgu-stat) are graphed in comparison to the ground truth(in black).

Kernel Semi-Implicit Variational Inference

0.5 0.0 0.5 Truth (SGLD)

0.5 0.0 0.5 Truth (SGLD)

0.5 0.0 0.5 Truth (SGLD)

KSIVI (gu stat)

Figure 10. Comparison result between the estimated pairwise correlation coefficients on SIVI, SIVI-SM, and KSIVI with U-statistic gradient estimator. The correlation coefficients are estimated with 1000 samples.

0.00 0.25 0.50 0.75 1.00 1.5

true path sample path confidence interval observation

0.00 0.25 0.50 0.75 1.00 1.5

true path sample path confidence interval observation

0.00 0.25 0.50 0.75 1.00 1.5

true path sample path confidence interval observation

0.00 0.25 0.50 0.75 1.00 1.5

KSIVI (gu stat)

true path sample path confidence interval observation

Figure 11. Variational approximations of different methods for the discretized conditioned diffusion process. The magenta trajectory represents the ground truth via parallel SGLD. The blue line corresponds to the estimated posterior mean of different methods, and the shaded region denotes the 95% marginal posterior confidence interval at each time step. The sample size is 1000.

0 50000 100000 Iteration

0 50000 100000 Iteration

1e5 SIVI-SM

0 50000 100000 Iteration

Figure 12. Training loss of SIVI in conditioned diffusion process experiment. The sample size is 1000 for all the methods.